100% found this document useful (1 vote)

59 views1,051 pages

Encyclopedia of Physical Science and Technology - Mathematics

Abstract algebra studies algebraic structures like groups, rings, and fields by focusing on their operational properties rather than specific sets of numbers. It examines classes of systems that satisfy given properties like associativity. Two systems are considered isomorphic if there is a one-to-one correspondence between their elements that preserves relations and operations. Binary relations represent relationships between elements of sets through ordered pairs. Composition of relations combines relations step-by-step. Identity relations link each element to itself.

Uploaded by

richard14r03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

59 views1,051 pages

Encyclopedia of Physical Science and Technology - Mathematics

Uploaded by

richard14r03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1051

P1: FVZ Revised Pages Qu: 00, 00, 00, 00

Encyclopedia of Physical Science and Technology EN001C.19 May 26, 2001 14:19

Algebra, Abstract
Ki Hang Kim
Fred W. Roush
Alabama State University

I. Sets and Relations

II. Semigroups
III. Groups
IV. Vector Spaces
V. Rings
VI. Fields
VII. Other Algebraic Structures

GLOSSARY other structure T such that, for every operation ∗ and

all x, y in S, f (x ∗ y) = f (x) ∗ f (y).
Binary relation Set of ordered pairs (a, b), where a, b Ideal Subset of a ring containing the sum and differ-
belong to given sets A, B; as the set of ordered pairs of ence of any two elements of , and the product of any
real numbers (x, y) such that x > y. element of with any element of the ring.
Equivalence relation Binary relation on a set S often Identity Element e such that, for all x for a given opera-
written x ∼ y if (x, y) is in the relation such that (1) tion ∗, e ∗ x = x ∗ e = x.
x ∼ x; (2) if x ∼ y then y ∼ x; and (3) if x ∼ y and Incline Structure having two associative and commuta-
y ∼ z then x ∼ z for all x, y, z in S. tive operations + and ×, satisfying the distributive law
Field Structure having operations of commutative, as- and (1) x + x = x and (2) x + (x × y) = x.
sociative distributive addition and multiplication with Isomorphism Homomorphism f from a set S to a set
additive and multiplicative identities and inverses of T such that the inverse mapping g from T to S
nonzero elements, as the real or rational numbers. where g(y) = x if and only if f (x) = y is also a
Function Binary relation from a set A to a set B that to homomorphism.
every a in A associates a unique b in B such that (a, b) Partial order Binary relation R on a set S such that for
is in the (binary) relation, as a polynomial associates all x, y, z in S (1) (x, x) is in R; (2) if (x, y) and (y, x)
its value to a given x. are both in R, then x = y; and (3) if (x, y) and (y, z)
Group Structure having one associative operation with are in R, so is (x, z).
identity and inverses. Quadratic form A function defined on a ring by
Homomorphism Function f from one structure S to an- ai j xi x j .

435
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

436 Algebra, Abstract

Quotient group For a normal subgroup N of a group G, ments in B but not in A. For fixed set B ⊃ A, this is called
the group formed by equivalence classes x̄ under the the complement of A in B.
relation that x ∼ y if x y −1 in N , with multiplication The Cartesian product A1 × A2 × · · · × An of n sets
given by x̄ ȳ = x y. A1 , A2 , . . . , An is the set of all n-tuples (a1 , a2 , . . . , an )
Ring Structure having one associative and commutative such that ai ∈ Ai for all i = 1, 2, . . . , n. Infinite Cartesian
operation with identity and inverses, and a second op- products can be similarly defined with i ranging over any
eration associative and distributive with the first. index set I.
Semigroup Structure having an associative operation.
Simple Structure S such that, for every homomorphism
B. Binary Relations
f to another structure, either f (x) = f (y) for all x, y
in S or f (x) = f (y) for all x = y in S. The truth or falsity of any mathematical statement about
Subgroup Subset S of a group G containing the product a relationship x R y, for instance, x 2 + y 2 = 1, can be de-
and inverse of any of its elements. It is called normal termined from the set of ordered pairs (x, y) such that
if, for every g in G, s in S, the element gsg −1 is in S. x R y. In fact, R is equivalent to the relationship that
(x, y) ∈ {(x, y) ∈ A × B : x R y}, where A and B are sets
containing x and y.
ABSTRACT ALGEBRA is the study of laws of opera- The set of ordered pairs then represents the relationship
tions and relations such as those valid for the real numbers and is called a binary relation. In general, a binary relation
and similar laws for new systems. It studies the class of from A to B is a subset of A × B, that is, a set of ordered
all possible systems having given laws, for example, asso- pairs.
ciativity. There is no standard object that it studies such as The union, intersection, and complement of binary re-
the real numbers in analysis. Two systems are isomorphic lations from A to B are defined on them as subsets of
in abstract algebra if there is a one-to-one (1–1 for short) A × B. Another operation is composition. If R is a binary
correspondence between their elements such that relations relation from A to B and S is a binary relation from B to
and operations agree in the two systems. C, then R ◦ S is {(x, z) ∈ A × C : (x, y) ∈ R and (y, z) ∈ S
for some y ∈ B}. Composition is distributive over union.
For R, R1 , S, S1 , T , from A to B, A to B, B to C, B to
C, C to D, we have (R ∪ R1 ) ◦ S = (R ◦ S) ∪ (R1 ◦ S) and
I. SETS AND RELATIONS
R ◦ (S ∪ S1 ) = (R ◦ S) ∪ (R ◦ S1 ). It is associative; we have
(x, w) ∈ (R ◦ S) ◦ T if and only if for some y ∈ B and z ∈ C,
A. Sets
(x, y) ∈ R, (y, z) ∈ S, and (z, w) ∈ T . The same condition
A set is any precisely defined collection of objects. The is obtained for R ◦ (S ◦ T ).
objects in a set are called elements (or members) of the The identity relation 1 A is {(a, a) : a ∈ A}. This rela-
set. The set A = {1, 2, x, y} has elements 1, 2, x, y and we tion acts as an identity on either side for R ⊂ A × B,
write 1 ∈ A, 2 ∈ A, x ∈ A, and y ∈ A. There is no set of all S ⊂ B × C; that is, R ◦ 1 B = R, and 1 B ◦ S = S.
sets, but given any set we can obtain other sets by oper- The transpose R T of a binary relation R from A to B is
ations listed below, and for any set A and property P(x) {(b, a) ∈ B × A : (a, b) ∈ R}. It is also called converse and
we have a set {x ∈ A : P(x)} of all elements of A having inverse.
the property. Things such as “all sets” are called classes
rather than sets. There is a set of all real numbers and one
C. Functions
of all isosceles triangles in three-dimensional space.
We say that A is a subset of B if all elements of A are A partial function from a set A to a set B is a relation f
in B. This is written A ⊂ B, or B ⊃ A. The set is called a from A to B such that if (x, y) ∈ f , (x, z) ∈ f then y = z. If
proper subset if A = B; otherwise it is called an improper (x, y) ∈ f , we write y = f (x) since this condition means
subset. For finite sets, a subset is proper if and only if it y is unique. A partial function is a function defined on a
has strictly fewer elements. The empty set ∅ is the set with subset of a set considered. Its domain is {a ∈ A : (a, b) ∈ f
no elements. for some b ∈ B}.
If A ⊂ B and B ⊂A, then A = B. The union ∪ A of a A function is a partial function such that for all x ∈ A
family of sets is the set whose elements are all things there exists y ∈ B with (x, y) ∈ f .
that belong to at least one set A ∈ . The intersection ∩ A A function is 1–1 if and only if whenever (x, z) ∈ f ,
is the set of all elements lying in every set A in . (y, z) ∈ f we have x = y. This is the transpose of the def-
The power set P(A) is the set of all subsets of A. The inition of a partial function. Functions and partial func-
relative complement B\A (or B − A) is the set of all ele- tions are thought of in several ways. A function may be
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 437

considered the output of a process in which x is an input, The set x̄ = {y ∈ A : x R y} is called the equivalence
as x 2 + x + 1 adds a number to its square and 1. We may class of x (the set of all elements equivalent to the same
consider it as assigning an element of B to any element of element x). The set of equivalence classes x̄ is called
A. Or we may consider it to be a map of A into the set B the quotient set A/R. The function f (x) = x̄ is called
in which the point x ∈ A is represented by f (x) ∈ B. Then the quotient map A → A/R. The relation {(x, y) ∈ A ×
f (x) is called the image of x. Also for a subset C ⊂ A the A : f (x) = f (y)} is precisely R.
image f (C) is { f (x) : x ∈ C} the set of images of points of Any two equivalence classes are disjoint or equal: If
C. We may also consider it to be a transformation. Strictly, x R z and y R z, by symmetry z R y and by transitivity
however, the term transformation is sometimes reserved x R y. Any element belongs in the equivalence class of
for functions from a set to itself. itself. A family of nonempty subsets of a set S is called a
A 1–1 function sends distinct values of A to distinct partition if (1) whenever C = D in , we have C ∩ D = ∅;
points. For a 1–1 function on a finite set C, f (C) will (2) ∪ C = S. Therefore, the set of equivalence classes
have exactly as many elements as C. always forms a partition. Conversely, every partition
The composition of functions f : A → B and g : B → C arises from the equivalence relation {(x, y) ∈ A × A: for
is given by g( f (x)). It is written g ◦ f . Composition is as- some C ∈ , x ∈ C and y ∈ C}.
sociative since it is a special case of composition of binary An important equivalence is congruence modulo m of
relations except that the order is reversed. The identity re- integers. We say x ≡ y(mod m) for integers x, y, m if
lation on a set is also called the identity function. A com- there exists an integer h such that x − y = hm, that is,
position of partial (1–1) functions is respectively a partial if m divides x − y. If x − y = hm and y − z = km, then
(1–1) function. x − z = x − y + y − z = hm + km. So if x ≡ y(mod m)
and y ≡ z(mod m), then x ≡ z(mod m). This proves
transitivity.
A relation that is reflexive and transitive is called a
quasiorder. If it satisfies also (OR-4) antisymmetry if
A function f : A → B is onto if f (A) = B, that is, if (x, y) ∈ R and (y, x) ∈ R then x = y, it is called a par-
every y ∈ B is the image of some x. A composition of tial order. If a partial order satisfies (OR-5) completeness
onto functions is onto. A 1–1 onto function is called a 1–1 for all x, y ∈ A either (x, y) ∈ R or (y, x) ∈ R, it is called
correspondence. For a 1–1 onto function f , the inverse a total order, or a linear order.
(converse) of f is defined by f −1 = {(y, x) : (x, y) ∈ f } For every total order on a finite set A, the elements
or x = f −1 (y) if and only if y = f (x). It is character- of A can be labeled a1 , a2 , . . . , an in such a way that
ized by f ◦ f −1 = 1 B , f −1 ◦ f = 1 A . Moreover, for any ai R a j if i < j. An isomorphism between a binary rela-
binary relations R, S if R ◦ S = 1 B , S ◦ R = 1 A both R and tion R1 on a set A1 and R2 on A2 is a 1–1 correspon-
S must be 1–1 correspondences and S must be R T . A dence f : A1 → A2 such that (x, y) ∈ R1 if and only if
1–1 correspondence from a finite set to itself is called a ( f (x), f (y)) ∈ R2 . Therefore, every total order on a finite
permutation. set is isomorphic to the standard order on {1, 2, . . . , n}.
There are many nonisomorphic infinite total orders on
the same set and many nonisomorphic partial orders on fi-
D. Order Relations
nite sets of n elements. The structure of quasiorders can be
That a binary relation R from a set A to itself is (OR-1) re- reduced to that of partial orders. For every quasiorder R on
flexive means (x, x) ∈ R for all x ∈ A; (OR-2) symmetric any set A, the relation {(x, y) : (x, y) ∈ R and (y, x) ∈ R}
means if (x, y) ∈ R, then (y, x) ∈ R; and (OR-3) transitive is an equivalence relation. The quasiorder gives a partial
means if (x, y) ∈ R and (y, z) ∈ R, then (x, z) ∈ R. Fre- order on the set of equivalence classes of R and (x, y) ∈ R
quently x R y is written for (x, y) ∈ R, so that transitivity if and only if (x̄, ȳ) ∈ R1 .
means if x R y and y R z, then x R z. For instance, if x ≥ y The structure of partial orders on small sets can be de-
and y ≥ z, then x ≥ z. So x ≥ y is transitive, but x = y is scribed by diagrams known as Hasse diagrams. An ele-
not transitive. ment x of a partial order is called minimal (maximal) if
The following relations are reflexive, symmetric, and for no y = x does (y, x) ∈ R ((x, y) ∈ R) where (x, y) ∈ R
transitive on the set of geometric figures: x = y (same is taken as x ≤ y in the order. Every partial order on a finite
point set), x y (congruence), x ∼ y (similarity), x has set has at least one minimal and at least one maximal ele-
the same area as y. A reflexive, symmetric, transitive re- ment. Represent all minimal elements of the partial order
lation is called an equivalence relation. For any function as points at the same horizontal level of the bottom of the
f from A to B the relation {(x, y) ∈ A × A : f (x) = f (y)} diagram. From then on, the ith level consists of elements
is an equivalence relation. z not in previous levels but such that for at least one y on
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

438 Algebra, Abstract

E. Boolean Matrices and Graphs

The Boolean algebra has elements {0, 1} and operations
+, ·,c given by 0 + 0 = 0, 0 + 1 = 1 + 0 = 1 + 1 = 1, 0 · 0 =
1 · 0 = 0 · 1= 0, 1 · 1 = 1, 0c = 1, 1c = 0. There are many
interpretations and uses of the Boolean algebra : (1)
FIGURE 1 Hasse diagrams of partially ordered sets.
propositional logic where 0 is false, 1 is true, + is “or,” ·
is “and,” c is “not,” (2) switching circuits where 0 means
no current flows, 1 means current flows; and (3) × ×
a previous level y < z and for no x is y < x < z. That is, z · · · × can be taken as the set of subsets of an n-element
is the very next element greater than y. For all such z , y set {yi }, where (x1 , x2 , . . . , xn ) corresponds to the subset
draw a line segment from z to y. Figure 1 gives the Hasse {yi : xi = 1}; (4) 0 means zero in R, 1 means some positive
diagrams of three partially ordered sets (posets for short). number in R, where R denotes the set of all real numbers.
A poset is called a lattice if every pair of elements in The algebra satisfies the same laws as the algebra of
it have a least upper bound and a greatest lower bound. sets under ∪, ∩, ∼ where ∼ denotes the complementation
An upper (lower) bound on a set S is an element x such since it is a Boolean algebra.
that y ≥ x (y ≤ x) for all y ∈ S. Every linearly ordered set An n × m Boolean matrix A = (ai j ) is an n × m rect-
is a lattice, and any family of subsets of a set contain- angle of elements of . The entry in row (horizontal)
ing the union and intersection of any two of its mem- i and column (vertical) j is denoted ai j . To every bi-
bers. These lattices are distributive: If ∧, ∨ denote great- nary relation R from a set X = {x1 , x2 , . . . , xn } to set
est lower bound, least upper bound, respectively, then Y = {y1 , y2 , . . . , ym } we can assign a Boolean matrix M =
x ∧ (y ∨ z) = (x ∧ y) ∨ (x ∧ z) and x ∨ (y ∧ z) = (x ∨ y) ∧ (m i j ), where m i j is 0 or 1 according to whether (xi , y j )
(x ∨ z). A lattice is called modular if these laws hold does or does not lie in R. This gives a 1–1 onto correspon-
whenever two of x, y, z are comparable [two elements dence from binary relations from X to Y to n × m Boolean
a, b are comparable for R if (a, b) ∈ R or (b, a) ∈ R]. matrices. Many questions can be dealt with simply in the
There exist nonmodular lattices, as the last diagram of Boolean matrix form. Boolean matrices are multiplied and
Fig. 1. added by the same formulas as for ordinary matrices, ex-
A binary relation R is called a strict partial order if it is cept that operations are Boolean. Composition (union) of
transitive and (OR-6) irreflexive, for no x does (x, x) ∈ R. binary relations corresponds to product (sum) of Boolean
There is a 1–1 correspondence between partial orders matrices.
R (as x ≤ y) on A and strict partial orders (x < y) ob- A directed graph (digraph for short) consists of a set V
tained as R\{(a, a) : a ∈ A}. An irreflexive binary relation of elements called vertices (represented by points) and a
R on A is called a semiorder if for all x, y, z, w ∈ A, set E or ordered pairs of V known as edges. Each ordered
(1) if (x, y) ∈ R and (z, w) ∈ R, then either (x, w) ∈ R pair (x, y) is represented by a segment with an arrow going
or (z, y) ∈ R; (2) if (x, y) ∈ R and (y, z) ∈ R, then either from point x to point y. Thus digraphs are essentially the
(w, z) ∈ R or (x, w) ∈ R. Semiorders can be represented same as binary relations from a set to itself. Any n × n
always as {(x, y) : f (x) > f (y) + δ} for some real valued (n-square) Boolean matrix or binary relation on a la-
function f and number δ, and are strict partial orders. beled set can be represented as a digraph, and conversely.
Figure 2 shows the relationship of many of the relations Figure 3 shows the Boolean matrix and graph of a binary
taken up here. relation on {1, 2, 3, 4}.

F. Blockmodels
Often binary relations are empirically obtained. Such bi-
nary relations can frequently be simplified by blocking
the Boolean matrices: dividing the set of indices into dis-
joint subsets, relabeling to get members of the same sub-
set adjacent, and dividing the matrix into blocks. Each
nonzero block is replaced by a single 1 entry and each
zero block by a single 0 entry. Many techniques called
clustering exist for dividing the total set into subsets. One
(CONCOR) is to take iterated correlation matrices of the
FIGURE 2 Classification of binary operations. rows.
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 439

FIGURE 3 Matrix and graph of a relation.

Provided that every nonzero block has at least one 1 in II. SEMIGROUPS
each row the replacement of blocks by single entries will
preserve all Boolean sums and products. A. Generators and Relations
A binary operation is a function that given two entries
from a set S produces some element of a set T . Therefore,
G. General Relational Structures
it is a function from the set S × S of ordered pairs (a, b)
A general finite relational structure on a set S is an indexed to T . The value is frequently denoted multiplicatively as
family Rα of subsets of S ∪ (S × S) ∪ (S × S × S) ∪ · · ·. a ∗ b, a ◦ b, or ab. Addition, subtraction, multiplication,
Such structures include order structures and operational and division are binary operations.
structures such as multiplicative ones as subsets {(x, y, z) The set S is said to be closed under the operation if the
∈ S × S × S : x ∗ y = z}. A homomorphism of relational product always lies in S itself. The positive integers are
structures (index the same way) Rα on S1 to Tα on S2 con- not closed under subtraction or division.
sists of a function f : S1 → S2 such that if g is the mapping The operation is called associative if we always have
S1 ∪ (S1 × S1 ) ∪ (S1 × S1 × S1 ) ∪ · · · to S2 ∪ (S2 × S2 ) ∪ (a ◦ b) ◦ c = a ◦ (b ◦ c). We have noted that this always
(S2 × S2 × S2 ) ∪ · · · which is f on each coordinate then holds for composition of functions or binary relations.
g(Rα ) ⊂ Tα for each α. An isomorphism of relational Conversely, if closure and associatively hold, the set can
structures is a 1–1 onto homomorphism such that g(Rα ) = always be represented by a set of functions under compo-
Tα for each α. The quotient structure of Rα on S associ- sition.
ated with an equivalence relation E on S is the structure A set with a binary operation satisfying associativity
Tα = g(Rα ), for f the mapping S → S/E assigning to each and closure is called a semigroup. The positive integers
element its equivalence class. form a semigroup under multiplication or addition. An
element is called a left (right) identity in a semigroup
S if for all x ∈ S, x = x(x = x). A semigroup with two-
H. Arithmetic of Residue Classes sided identity is called a monoid. To represent a semigroup
as a set of functions under composition, first add a two-
Let Z denote the set of all integers. Let E m be the sided identity element to obtain a monoid M. Then for
equivalence relation {(x, y) : Z × Z : x − y = km for some each x in M define a function f x : M → M by f x (y) =
k ∈ Z}. This is denoted x ≡ y(mod m). We have previously x ◦ y.
noted that it is an equivalence relation. The set generated by a subset G ⊂ S is the set of all
It divides the integers into exactly m equivalence finite products {x1 x2 · · · xn : n ∈ Z+ , xi ∈ G}, where Z+
classes, for m = 0. For m = 3, the classes are 0̄ = {. . . , denotes the set of all positive integers. The set satisfies
−9, −6, −3, 0, 3, 6, 9, . . .}, 1̄ = {. . . , −8, −5, −2, 1, 4,
7, 10, . . .}, 2̄ = {. . . , −7, −4, −1, 2, 5, 8, 11, . . .}. Any
two members of the same class are equivalent (3 divides
their difference).
This relation has the property that if x ≡ y(mod m)
then, for any z ∈ Z, x + z ≡ y + z since m divides x + z −
(y + z) = x − y and x z ≡ yz. Such a relation in general is
called a congruence. For any congruence, we can define
operations on the classes by x + y = x̄ + ȳ, and x y = x̄ ȳ.
Let Zm be the set of equivalence classes of Z under
congruence module m and under +, × quotient operators. FIGURE 4 Addition and multiplication of module 4 residue
Operations in Z5 are given in Fig. 4. classes.
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

440 Algebra, Abstract

closure since (x1 x2 · · · xn )(y1 y2 · · · yn ) has agian the form B. Green’s Relations
required. A subset of a semigroup satisfying closure is it-
For analyzing the structure of semigroups in many cases
self a semigroup and is called a subsemigroup. The set G
it is best to decompose it into a family of equivalence
is called a set of generators for S if G generates the entire
classes.
semigroup.
Let S be a semigroup and let M be S with an identity el-
For a given set G of generators of S, a relation is an
ement if S lacks one. Let be the relation for x, y ∈ S that
equation x1 x2 · · · xn = y1 y2 · · · yn , where xi , yi ∈ G.
for some a, b ∈ M, xa = y, yb = x. Let be the relation
A homomorphism of semigroups is a mapping f :
for some a, b ∈ M, ax = y, by = x. Let be the relation
S1 → S2 such that f (x ◦ y) = f (x) f (y) for all x, y ∈ S1 .
that for some a, b, c, d ∈ M, axc = y, byd = x. What these
For example, e z is a homomorphism from the additive
equations express is a kind of divisibility, that either x, y
semigroup of real numbers (R, +) to the multiplicative
can be a factor of the other. For example, in Z+ under
semigroup of complex numbers (C, ×), since e x+y =
multiplication (2, 5) ∈ because 2 is not a factor of 5.
(e x )(e y ). Here C denotes the set of all complex numbers.
There are two other important relations: = ∩ =
The trace and determinant of n-square matrices are ho-
{(x, y) : x y and x y} and = ◦ = ◦ . These
momorphisms, respectively, from the additive and multi-
are both equivalence relations also and are known as
plicative semigroups of n-square matrices to the additive
Green’s relations.
and multiplicative semigroups of real numbers.
For the semigroup of transformations on a finite set
An isomorphism of semigroups is a 1–1 onto homomor-
T, f g if and only if the partitions given by the equiv-
phism. The inverse function will then also be an isomor-
alence relations {(x, y) ∈ T × T : f (x) = f (y)}, {(x, y) ∈
phism. Isomorphic semigroups are structurally identical.
T × T : g(x) = g(y)} coincide, f g if and only if f g
Two semigroups having the same generating set G and
if and only if f, g have the same number of image ele-
the same relations are isomorphic, since the mapping
ments, and f g if and only if their images are the same.
f (x1 ◦ x2 ◦ · · · ◦ xn ) = x1 ∗ x2 ∗ · · · ∗ xn will be a homo-
In a finite semigroup, = . The entire semigroup is
morphism where ◦ , ∗ denote the two operations. The iden-
always broken into -classes, which are partially ordered
tity of the sets of relations guarantees that this is well de-
by divisibility. The -classes, which are in these, are bro-
fined, that is, if x1 ◦ x2 ◦ · · · ◦ xm = y1 ◦ y2 ◦ · · · ◦ yn , then
ken into -classes Hi j = Ri ∩ L j , where Ri , L i are the
f (x1 ◦ x2 ◦ · · · ◦ xm ) = f (y1 ◦ y2 ◦ · · · ◦ yn ).
, -classes in a given -class.
For any set of generators G and set R of relations, a
The -classes can be laid out in a matrix (Hi j ) called
semigroup S can be produced having these generators
the eggbox picture of a semigroup. There exists a 1–1
and satisfying the relations such that if f (G) → T is any
correspondence between any two -classes.
function such that for every relation x1 ◦ x2 ◦ · · · ◦ xm =
A left ideal in a semigroup S is a set ⊂ S such that
y1 ◦ y2 ◦ · · · ◦ yn the relation f (x1 ) ∗ f (x2 ) ∗ · · · ∗ f (xm )
for all x ∈ S, y ∈ the element x y ∈ . Right and two-
= f (y1 ) ∗ f (y2 ) ∗ · · · ∗ f (yn ) holds then f extends to a
sided ideals are similarly defined by yx ∈ , yx z ∈
homomorphism S → T .
for all y, z ∈ S, respectively. An element x generates the
To produce S we take the set of all “words” x1 x2 · · · xm
principal left, right two-sided ideals Sx = {yx : y ∈ S},
in G, that is, sequences of elements in G. Two “words” are
x S = {x y : y ∈ S}, Sx S = {yx z : y, z ∈ S}. Two elements
multiplied by writing one after the other x1 x2 · · · xm y1 y2
are -, -, -equivalent (assuming S has an identity, oth-
· · · yn . We define an equivalence relation on words by
erwise add one) if and only if they generate the same left,
w1 ∼ w2 if w2 can be obtained from w1 by a series of re-
right, or two-sided ideals, respectively.
placments of a x1 x2 · · · xn b by ay1 y2 · · · ym b, where a, b
are words (or are empty) and x1 x2 · · · xn = y1 y2 · · · ym is
a relation in R. Then S is the set of equivalence classes of
C. Binary Relations and Boolean Matrices
words.
The semigroup S is called the semigroup with genera- The set of binary relations on an n-element set under com-
tors S and defining relations R. The fact that multiplication position forms a semigroup that is isomorphic to Bn , the
is well defined in S follows from the fact that a ∼ b is a semigroup of n-square Boolean matrices under Boolean
congruence. This means that if a ∼ b then for all words matrix multiplication. A 1 × n(n × 1) Boolean matrix is
x, ax ∼ bx and xa ∼ xb, and that ∼ is an equivalence called a row (column) vector. Two vectors are added or
relation. multiplied by a constant (0 or 1) as matrices: (1, 0, 1) +
For all semigroup homomorphisms f : S → T the re- (0, 1, 0) = (1, 1, 1), 0v = 0, and 1v = v. A set of Boolean
lation f (x) = f (y) is a congruence. Conversely any con- vectors is called a subspace if it is closed under sums
gruence gives rise to a homomorphism from S to the set and contains 0. The span of a set W of Boolean vectors
of equivalent classes. is the set of all finite sums of elements of W , including 0.
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 441

The row (column) space of a Boolean matrix is the space If an -class contains an idempotent, then it will be
spanned by its row (column) vectors. Two Boolean ma- closed under multiplication, and multiplication by any of
trices are - (also -) equivalent if and only if their row its elements gives a 1–1 onto mapping from it to itself.
spaces are isomorphic. The basis for a space of Boolean Therefore, it forms a group. Conversely any group in a
vectors spanned by S is {x ∈ S : x = 0 and x is not in the semigroup lies in a single -class since under multiplica-
span of S\{x}}. Two subspaces are identical if and only if tion any two elements are -equivalent and -equivalent.
their bases are the same. Bases for the row, column spaces
of a Boolean matrix are called row, column bases. Two
Boolean matrices are - (-) equivalent if and only if E. Finite State Machines
their row (column) bases (and so spaces) coincide. The theory of automata deals with different classes of
A semigroup of the form {x n } is called cyclic. Every theoretical machines representing robots, calculators, and
cyclic semigroup is determined by the index k = inf{k ∈ similar devices. The simplest are the finite state machines.
Z+ : x k+m = x k for some m ∈ Z+ }, and the period d = There are two essentially equivalent varieties: Mealy ma-
inf{d ∈ Z+ : x k+d = x k }. The set of powers k, k + 1, . . . re- chines and Moore machines.
peat with period d. If an n-square Boolean matrix A is re- A Mealy machine is a 5-tuple (S, X, Z , ν, µ), where
flexive (A ≥ I ), then A = AI ≤ A2 ≤ A3 ≤ · · · ≤ An−1 = S, X, Z are sets, ν a function S × X to S, and µ a function
An , an idempotent. Here I denotes the identity matrix. S × X to Z . The same definition holds for Moore machines
Also if A is fully indecomposable, meaning that there ex- except that µ is a function S to Z .
ists no nonempty K , L ⊂ {1, 2, . . . , n} with |K | + |L| = n, Here S is the set of internal states of the machine. For
where |S| denotes the cardinality of a set S, and ai j = 0 for a computer this could include all posibilities as to which
i ∈ K , j ∈ L, then An−1 = J and so A has period 1 and in- circuit elements are on or off. The set X is the set of inputs,
dex at most n − 1. Here J is the Boolean matrix all entries which could include a program and data. The set Z is the
of which are 1. set of outputs, that is, the desired response. The function
In general, an n-square Boolean matrix has period equal µ gives the particular output from a given internal state
to the period of some permutation on {1, 2, . . . , n} and and input. The function ν gives the next internal state
index at most (n − 1)2 + 1. resulting from a given state and input. For a computer this
is determined by the circuitry. For example, a flip-flop
D. Regularity and Inverses will change its internal state if it receives an input of 1;
otherwise the internal state will remain unchanged.
An element x of a semigroup S is said to be a group inverse A Mealy machine to add two n-digit binary numbers,
of y if S has a two-sided identity and x y = yx = . A a, b can be constructed as follows. Let the ith digits of
semigroup in which every element has a group inverse is ai , bi be inputs, so X = {0, 1} × {0, 1}. Let the carry from
called a group. In most semigroups, few elements are in- the i − 1 digit be ci . It is an internal state, so S = {0, 1}.
vertible in this sense. A function is invertible if and only if The output is the ith digit of the answer, so Z = {0, 1}.
it is 1–1 onto. A Boolean matrix in invertible if and only The function ν gives the next carry. It is 1 if and only if
if it is the matrix of a permutation (permutation matrix). ai + bi + ci > 1. The function µ gives the output. It is 1 if
There are many weaker ideas of inverse. The two most and only if ai + bi + ci is odd. Figure 5 gives the values
important are regularity yx y = y and Thierrin–Vagner in- of ν and µ.
verse yx y = y and x yx = x. An element y has a Thierrin– With a finite state machine is associated a semigroup
Vagner inverse if and only if it is regular: If yx y = y of transformations, the transformations f x (s) = ν(s, x) of
then x yx is a Thierrin–Vagner inverse. A semigroup in
which all elements are regular is called a regular semi-
group. The semigroups of partial transformation, transfor-
mations, and n-square matrices over a field are regular. In
the semigroup of n-square Boolean matrices, an element
is regular if and only if its row space forms a distributive
lattice as a poset.
An idempotent is an element x such that x x = x. An ele-
ment x is regular if and only if its -equivalence class con-
tains an idempotent: if x yx = x then x y is idempotent. The
same holds for -equivalence classes. Therefore, if two
elements are or -equivalent (therfore -equivalent),
one is regular if and only if the other is. FIGURE 5 Machine to add two binary numbers.
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

442 Algebra, Abstract

the state space, and all compositions of them f x1 x2 ···xn (s) = symbol. If (x, y) ∈ ρ, we are allowed to change any oc-
f x1 ( f x2 · · · f xn (s)). This is called the semigroup of the currence of x with y. Members of W are called terminals.
machine. Consider logical formulas involving operations ∨, ∼
Two machines are said to have the same behavior if and variables p, q, r . Let W = { p, q, r, ∨, (,), ∼}, N =
there exists a binary relation R from the initial states of one {ψ}. We derive formulas by successive substitution, as
machine to the intial state of the other such that if (s, t) ∈ S ψ, (ψ ∨ψ), ((ψ ∨ψ)∨ψ), ((ψ ∨ψ)∨ ∼ ψ), (( p ∨ψ)∨ ∼
then for any sequence x1 , x2 , . . . , xk of inputs machine 1 ψ), (( p ∨ q) ∨ ∼ ψ), (( p ∨ q) ∨ ∼ r ).
in state s gives the same sequence of outputs as machine An element y ∈ (N ∪ W )∗ is said to be directly de-
2. A machine M1 equivalent to a given machine M having rived from x ∈ (N ∪ W )∗ if x = azb, y = awb for some
a minimal number of states can be constructed as follows. (z, w) ∈ ρ, a, b ∈ (N ∪ W )∗ . An indirect derivation is a
Call two states of M equivalent if, for any sequence of sequence of direct derivations. Here ρ = {(ψ, p), (ψ, q),
inputs, the same sequence of outputs is obtained. This (ψ, r ), (ψ, ∼ψ), (ψ, (ψ ∨ ψ))}.
gives an equivalence relation on states. Then let the states The language determined by a phrase structure grammar
of M1 be the equivalence classes of states of M. is the set of all a ∈ W ∗ that can be derived from ψ.
No finite state machine M can multiply, as the above A grammar is called context free if and only if for all
machine adds binary number of arbitrary length. Suppose (a, b) ∈ ρ, a ∈ N , b = e0 . This means that what items can
such a machine has n states. Then suppose we multiply be substituted for a given grammatical element do not de-
the number 2k by itself in binary notation (adding zeros pend on other grammatical elements. The grammar above
in front until the correct number of digits in the answer is is context free.
achieved). Let f x be the transformation of the state space A grammar is called regular if for all (a, b) ∈ ρ we have
given by inputs of 0, 0 and let a be the state after the inputs a ∈ N , b = tn, where t ∈ W, n ∈ N , or n = e0 . This means
1, 1. The inputs 0, 0 applied to state f xi (a) give output 0 at each derivation we go from t1 t2 · · · tr n to t1 t2 · · · tr tr +1 m,
for i = 0, 1, 2, . . . , k − 2 but output 1 for i = k − 1. Yet the where ti are terminals, n, m nonterminals, (n, tr +1 m) ∈ ρ.
transformation f on a set of n elements will have index at So we fill in one terminal at each step, going from left to
most n. Therefore, if k > n then f xk−1 (a) will coincide with right. The grammar mentioned above is not regular.
j
f x (a) for some j < k − 1. This is a contradiction since one To recognize a grammar is to be able to tell whether or
yields 0 output, the other 1 output. It follows that no such not a sequence from W ∗ is in the language. A grammar is
machine can exist. regular if and only if some finite state machine recognizes
it. The elements of W are input 1 at a time and outputs
are “yes, no,” meaning all symbols up to the present are or
F. Mathematical Linguistics
are not in the language. Let the internal states of the ma-
We start with a set W of basic units considered words. chine be in 1–1 correspondence with all subsets of N , and
Mathematical linguistics is concerned with the formal let the initial state be ψ. For a set S1 of nonterminals and
theory of sentences, that is, sequences of words that input x let the next state be the set S2 of all nonterminals z
are grammatically allowed and the grammatical structure such that for some u ∈ S1 , (u, x z) is a production. Then at
of sentences (or longer units). This is syntax. Meaning any time the state consists of all nonterminals that could
(semantics) is usually not dealt with. occur after the given seqence of inputs. Let the output be
For a set X , let X ∗ be the set of finite sequences from “yes” if and only if for some u ∈ N , (u, x) ∈ ρ. This is if
X including the empty sequence e. For instance, if X is and only if the previous inputs together with the current
{0, 1}, then X ∗ is {e, 0, 1, 00, 01, 10, 11, 000 · · ·}. For a input form a word in the language.
more important example, we can consider the family of For the converse, if a finite state machine can recognize
all sequences of logical variables p, q, r , and ∨ (or), ∧ a language, let W be as in the language, N be the set of
(and), (,) (parentheses), → (if then), ∼ (not). The set of internal states, ψ the initial state, the productions the set
logical formulas will be a subset of this. of pairs (n 1 , xn 2 ) such that if the machine is in state n 1
A phrase structure grammar is a quadruple (N , W, and x is input, state n 2 is the next state, and the set of
ρ, ψ), where W (set of words) is nonempty and finite, pairs (n 1 , x) such that in state n 1 after input x the machine
N (nonterminals) is a finite set disjoint from W, ψ ∈ N , answers “yes.”
and ρ is a finite subset of ((N ∪ W )∗ \W ∗ ) × (N ∪ W )∗ . A further characterization of regular language is the
The set N is a set of valid grammatical forms involving Myhill–Nerode theorem. Let W ∗ be considered a semi-
abstract concepts such as ψ (sentence) or subject, predi- group of words. Then a language L ⊂ W ∗ is regular if
cate, object. The set ρ (productions) is a set of ways we can and only if the congruence {(x, y) ∈ W ∗ × W ∗ : axb ∈ L
substitute into a valid grammatical form to obtain another, if and only if ayb ∈ L for all a, b ∈ L} has finitely many
more specific one. The element ψ is called the starting equivalence classes. This is if and only if there exists a
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 443

finite semigroup H , a homomorphism h : W ∗ → H is onto,

and a subset S ⊂ H such that L = h −1 (S).

III. GROUPS

A. Examples of Groups
A group is a set G together with a binary operation
here denoted ◦ such that (1) ◦ is a function G × G →
G (closure); (2) ◦ is associative, for example, (x ◦ y) ◦ z =
x ◦ (y ◦ z); (3) there exists a two-sided identity such that FIGURE 6 Multiplication table of the dihedral group.
◦ x = x ◦ = x for all x ∈ G; (4) for all x ∈ G there ex-
ists y ∈ G with x ◦ y = y ◦ x = . Here y is known as an
inverse of x. From now on we suppress ◦ . n. It is isomorphic to the group of functions Zn → Zn of
For any semigroup S with a two-sided identity , let the form f (x) = ±x + k. Its multiplication table is given
S ∗ = {x ∈ S : x y = yx = for some y ∈ S}, the set of in- in Fig. 6, where xi is x + i − 1 if i > n + 1, −x + i − n − 1
vertible elements. The element y = x −1 is unique since if if i > n.
y1 x = , x y2 = then y1 = y1 x y2 = y2 . Then S ∗ satisfies
closure since (ab)−1 = b−1 a −1 and is a group where a −1
B. Fundamental Homomorphism Theorems
is an inverse of a.
A group with a finite number of elements is called a A group homomorphism is a function f : G → H satis-
finite group; otherwise, it is called an infinite group. If a fying f (x y) = f (x) f (y) for all x, y in G. If f is 1–1
group G is finite, and G contains n elements, we say that and onto it is called an isomorphism. For any group G
the order of G is n and we write |G| = n. If G is infinite, of order n there is a 1–1 homomorphism G → n ; la-
we write |G| = ∞. In general, |G| denotes the cardinality bel the elements of G as x1 , x2 , . . . , xn . For any x in G,
of a set G. the mapping a → xa gives a 1–1 onto function G → G,
The group of all permutations of {1, 2, . . . , n} is called so there exists a permutation π with x xi = xπ (i) . This is
n , the symmetric group of degree n. The degree is the a homomorphism since if x, y are sent to π, φ, we have
number of elements in the domain (and range) of a permu- yx xi = yxπ (i) = xφ(π (i)) . There is an isomorphism from n
tation. Therefore, n has degree n, order n!. A subgroup to the set of n-square permutation matrices (either over R
of n is formed by the set of all transformation of the form or a Boolean algebra), and it follows that any finite group
1 → k + 1, 2 → k + 2, . . . , n − k → n, n − k + 1 → 1, . . . , can be represented as a group of permutations or of in-
n → k. vetible matrices.
For any set P of subsets of n-dimensional Euclidean Groups can be simplified by homomorphisms in many
space En , let T be the union of the subsets of P. A sym- cases. The determinant represents the group of nonsin-
metry of P is a mapping f : T → T such that (S–1) for gular n-square matrices in the multiplicative group of
x, y ∈ T , d(x, y) = d( f (x), f (y)), where d(x, y) denotes nonzeroreal numbers (it is not 1–1). For a bounded func-
the distance from x to y; and (S–2) for A ⊂ T , A ∈ P if tion g, g f d x is a homomorphism from the additive
and only if f (A) ∈ P. That is, f preserves distances and group of integrable functions f to the additive R.
sends the subsets C of P, for example, points and lines, so The kernel of a homomorphism f : G → H is
other subsets of P. The inverse off is its inverse function, {x : f (x) = }. Every homomorphism sends the identity
which also satisfies (S–1) and (S–2). to the identity and inverses to inverses. Existence of in-
The sets R, Z, Zm under addition are all groups. verses means we can cancel on either side in groups:
The group Zm , the group of permutations {1 → k + 1, If ax = ay then a −1 ax = a −1 ay, x = y, x = y. There-
2 → k + 2, . . . , n → k}, and the group of rotational sym- fore, the identity is the unique elements satisfying = .
metries of an n-sided regular polygon are isomorphic. That Under a homomorphism f ( ) = f ( ) = f ( ) f ( ) so
is, there exists a 1–1 correspondence f between any two f ( ) is an identity. So is in the kernel. A ho-
such that f (x y) = f (x) f (y) for all x, y in the domain. A momorphism is 1–1 if and only if is its only el-
group isomorphic to any of these is called a cyclic group ement: If f (x) = f (y), then f (x y −1 ) = f (x) f (y −1 ) =
of order m. f (x) f (y)−1 = f (x) f (x)−1 = .
The group of all symmetries of a regular n-sided poly- If x, y are in the kernel, so is x y since f (x y) =
gon has order 2n and is called the dihedral group of order f (x) f (y) = = . If x is in the kernel, so is x −1 . Any
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

444 Algebra, Abstract

subset of group that is closed under products and inverses For any homomorphism f : G → H with kernel K , im-
is a subgroup. Both the kernel and image of a homomor- age M every equivalence class of K is sent to a single
phism are subgroups. element. That is, f gives a mapping G/K → M. This
The mapping cg : G → G defined by cg (x) = gxg −1 is mapping is an isomorphism.
called a conjugation. An isomorphism from a group (or If A, B are normal in a group G, and B is a subgroup
other structure) to itself is called an automorphism. Since of A, then
cg (x y)=gx yg −1 = gxg −1 gyg −1 , cg is a homomorphism. G/B
If h = g −1 , then cg is the inverse of ch . Therefore, the map- = G/A
A/B
pings cg are automorphisms of a group. They are called
the inner automorphisms. This gives a homomorphism If A is a normal subgroup and B is any subgroup of G,
of any group to its automorphisms, which also form a and AB = {ab : a ∈ A, b ∈ B} then,
group, the automorphism group. A subgroup N is normal AB B
if cg (x) ∈ N for all x ∈ N and g ∈ G. The kernel is always =
A A∩B
a normal subgroup.
If f : G → H has kernel K and is onto, the group G is
C. Cyclic Groups
said to be an extension of K by H . Whenever there exists
a homomorphism f from K to Aut(H ) for any groups Let G be a group. If there exists a ∈ G such that G = {a k :
K , H , a particular extension exists called the semidirect k ∈ Z}, then G is called a cyclic group generated by a, and
product. Here Aut(H ) denotes the automorphism group a is called a generator of G. For addition, the definition
of H . This has as its set K × H and products are defined becomes:
by (k1 , h 1 )(k2 , h 2 ) = [k1 k2 , f (k2 )(h 1 )(h 2 )], k1 , k2 ∈ K and There exists a ∈ G such that for each g ∈ G there exists
h 1 , h 2 ∈ H . The dihedral group is a semidirect product of k ∈ Z such that g = ka.
Z2 and Zm . A group G is said to be commutative when x y = yx
The groups K , H will generally be simpler than G. A for all x, y in G. Commutative groups are also called
theory exists classifying extensions. If a group has no nor- Abelian. In a commutative group, gxg −1 = x for all
mal subgroups except itself and { }, it is called simple. g ∈ G. Therefore, every subgroup of a commutative
The alternating group, the group of permutations in n group is normal. Let Rn denote the set of all n-tuples of
whose matrices have positive determinant, is simple for real numbers. Then (Rn , +) is a commutative group. For
n > 4. Its order is n!/2. For n odd, the group of n-square any groups G 1 , G 2 , . . . , G n , the set G 1 × G 2 × · · · × G n
real-valued invertible matrices of determinant 1 is sim- under componentwise multiplication (x1 , . . . , xn ) ×
ple. For n even a homomorphic image of n with kernel (y1 , . . . , yn ) = (x1 y1 , . . . , xn yn ) is a group called the
{I, −I } is simple, where I is an identity matrix. All finite Cartesian or direct product of G 1 , G 2 , . . . , G n . If all G i
and finite-dimensional differentiable, connected simple are commutative, so is the direct product. For any indexed
groups are known and most are constructed from groups of family G α of groups, coordinatewise multiplication
matrices. makes the Cartesian product a group ×αGα . The subset
For any subgroup H of a group G there exist equiva- of elements (xα ) such that {α : xα = } is finite is also a
lence relations defined by x ∼ y if x y −1 ∈ H (y −1 x ∈ H ). group sometimes called the direct sum.
These equivalence classes are called left (right) cosets. A set G 0 ⊂ G is said to generate the set
The left (right) coset of a is H a = {ha : h ∈ H } {x1 x2 · · · xn : xi ∈ G 0 or xi−1 ∈ G 0 for all i and n ∈ Z}.
(a H = {ah : h ∈ H }). There exists a 1–1 correspondence Every finitely generated Abelian group is isomorphic to
H → H a given by x → xa (x → ax for right cosets). a direct product of cyclic (1 generator) groups.
Therefore, all cosets have the same cardinality as H . If A group of words on any set X is defined as the
[G : H ] denotes the number of right (or left) cosets, then set of finite sequences (with ) x1a(1) x2a(2) · · · xna(n) , xi ∈ X ,
|G| = |H |[G : H ], which is known as Lagrange’s theorem. a(n) ∈ Z. We can reduce such words until a(i) = 0,
So the order of a subgroup of a finite group must divide xi = xi+1 by adding exponents if and only if they have the
the order of the group. If G is a group of prime order p, same reduced form. Equivalence classes of words form a
let x ∈ G, x = . Then {x n } forms a subgroup whose order group F.
divides and must therefore equal p. So G = {x n }. Relations among generators in a group can be writ-
If N is a normal subgroup and x ∼ y, for all g ∈ G then ten in the form x1a(1) x2a(2) · · · xna(n) = . For any set G 0 and
gx ∼ gy and xg ∼ yg since gx y −1 g −1 and x y −1 are in set of relations R in the elements of G 0 , there exists a
N if x y −1 in N . It follows that if a ∼ b and c ∼ d, then group G defined by these relations such that any map-
ac ∼ bd. Therefore, the equivalence classes form a group ping f : G 0 → H extends to a unique homomorphism
G/N called the quotient group. Its order is |G|/|N |. g : G → H if the relations hold in H . The group G is
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 445

defined to be F/N , where N is the subgroup generated 1 2 3 4 5 −1 1 2 3 4 5
by {yx1a(1) x2a(2) · · · xna(n) y −1 : x1a(1) x2a(2) · · · xna(n) = is a re- =
4 5 1 2 3 3 4 5 1 2
lation of R, y ∈ F}.
The dihedral group has generators, x, y and defining In computer work, permutations can be stored as arrays
relations x m = , y 2 = , yx = x −1 y. F(N ), G(N ) with composition F(G(N )) [or u = G(x) :
The word problem for a group G with given sets of y = F(u)].
generators and defining relations is to find an algorithm The second standard notation is cyclic notation. A
which for any product x1a(1) x2a(2) . . . xna(n) will determine k-cycle (x1 x2 · · · xk ) is the permutation f such that
(in a finite number of steps) whether this product is . f (x1 ) = x2 , f (x2 ) = x3 , . . . , f (xk ) = x1 , and f (y) = y for
For general groups with finite G 0 , R, the word problem is y = xi . If for i = 1, 2, . . . , m and j = 1, 2, . . . , k, xi = y j ,
unsolvable; that is, no algorithm exists. then (x1 x2 · · · xm ) and (y1 y2 . . . yk ) commute.
If a cyclic group {x m } has two powers equal, then some Any permutation ρ can be written uniquely as a prod-
power is . Let d be the least positive power equal to uct of disjoint cycles of length greater than 1. In such a
. Then x 0 , x 1 , x 2 , . . . , x d−1 are distinct and for any m representation x, y will lie in the same cycle if and only
if m = qd + r then x m = (x d )q x r = q x r = x r . It follows if ρ(x) = x, ρ(y) = y and ρ k (x) = y for some k ∈ Z. This
that the group is isomorphic to Zm under addition. If no two defines an equivalence relation, so take the sets of cycles
powers are equal, then f (x m ) = m defines an isomorphism to be the equivalence classes. In each equivalence class,
to Z under addition. choose an element x, and the cycle if it has size d must be
The multiplicative groups Z∗m = Zm \{0̄} must be distin- (xρ(x)ρ 2 (x) · · · ρ d−1 (x)).
guished from Zm . For m prime we will later prove Z∗m is Two permutations are conjugate if and only if they have
cyclic (of order m − 1). For m = 8, S = {1̄, 3̄, 5̄, 7̄} ⊂ Z∗8 the same number of cycles of each length. The order of
with multiplication 1̄ = , 3̄2 = 5̄2 = 7̄2 = 1̄, 3̄5̄ = 7̄. It is a group element x is the order of the cyclic subgroup it
isomorphic to Z2 × Z2 , is not cyclic, and is the noncyclic generates, that is, the least m such that x m = . If x =
group of smallest order. z 1 z 2 · · · z k in cycle form, then since z i z j = z j z i , x m =
z 1m z 2m · · · z km and x m = if and only if all z im = . There-
fore, the order of x is the least common multiple of the
D. Permutation Groups lengths of its cycles.
In the symmetric group n of all permutations of A permutation ρ acts on the expression i< j (x1 − x j )
{1, 2, . . . , n} elements are written in two standard ways. by taking it to i< j (xρ(i) − xρ( j) ). Since each pair {i, j} is
The notation sent to {ρ(i), ρ( j)} in a 1–1 fashion, each factor goes to
plus or minus another factor in the same expression. The
1 2 3 ... n sign of a permutation is defined to be +1 or −1 according
a1 a2 a3 . . . an as the whole expression goes to + or − itself. This is a
group homomorphism. Permutations of sign +1 (−1) are
denotes the function f (1) = a1 , f (2) = a2 , f (3) = a3 , . . . , called even (odd). Those of sign +1 form the kernel of the
f (n) = an . It is convenient to compose permutations writ- homomorphism, the alternating group n . Any 2-cycle
ten in this notation. To find (transposition) is odd. A k-cycle (x1 x2 · · · xk ) is a prod-
uct of (k − 1) transpositions (xk xk−1 )(xk−1 xk−2 ) · · · (x2 x1 )
1 2 3 4 5 1 2 3 4 5 so it has sign (−1)k−1 . The symmetric group is generated
◦
5 4 3 2 1 3 5 4 2 1 by any transposition t = (ab) and any n-cycle of the form
z = (ab · · ·). By conjugation z n t z −n we obtain a family of
as functions, on the right go through 1 to 5 on top. For transpositions ti . Products of these give a new family of
each number find its image below; then locate that image elements wi . Conjugation of ti by w j give all transposi-
on the upper left. Below it is the final image. tions. Products of these give all cycles. Products of cycles
give all permutations.
1−5−1
2−4−2
1 2 3 4 5 E. Orbits
3−3−4 −→
1 2 4 5 3
4−2−5 A group G of permutations of a set S is also said to be a
5−1−3 group acting on a set. The function evaluation g(s) defines
a function G × S → S such that g(h(s)) = (g ◦ h)(s). The
The inverse of an element is obtained by looking up i on relation on S; {(s, t) : g(s) = t for some g ∈ G} is an equiv-
the bottom and taking the element above it, i = 1 to 5. alence relation. If g(s) = t and h(t) = u, then h(g(s)) = u,
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

446 Algebra, Abstract

so it is transitive. Its equivalence classes are called orbits. G. Enumeration

For cyclic groups, the orbits are the underlying sets of the
The equation |O X | = |G|/|H |, where O X is the orbit of X ,
cycles.
G a group, and H the isotropy subgroup of X , simplifies
A permutation group is said to be transitive if S has a
a number of enumeration problems.
single orbit. This will be true if and only if G does not lie in
Another problem is to count the number of orbits. A
a conjugate of some S(n 1 ) × S(n 2 ) × · · · × S(n k ) ⊂ S(n),
lemma of Burnside states that for a group G acting on a
where n i > 0, k > 1, n i = n, and S(n i ) acts on n 1 +
set X the number of orbits is |G|−1 g∈G f g , where f g is
n 2 + · · · + n i−1 + 1, n 1 + n 2 + · · · + n i−1 + 2, . . . , n 1 +
the number of elements fixed by g, that is, |{x : gx = x}|.
n 2 + · · · + n i−1 + n i . Otherwise, it is called intransi-
This is proved by counting the set S = {(x, g) : g(x) = x}
tive. For any element x, the isotropy subgroup of x is
first over each g (yielding f g ) and then over each x
H ={g ∈ G : gx = x}. Then for any a, the coset a H maps
(yielding|G|/|Ox |).
x to a single element ax. Different cosets give different el-
Polya’s theory of counting was motivated partly by a de-
ements. Therefore, there is a 1–1 correspondence between
sire to count the number of distinct structures of a chemical
cosets and elements of the orbit of x. So [G : H ] is the
compound with known numbers of elements and known
cardinality of the orbits of x.
graph but unknown locations of elements within the di-
Any group acts on itself by conjugation. The orbits are
agram. Suppose we have a hexagonal compound con-
the equivalence classes under the relation for some g in G,
taining four carbon and two silicon atoms (Fig. 7; the
x = gyg −1 , and are called conjugacy classes. Their size,
reality of such a compound is not considered here). In
since they are orbits, divides |G|.
how many distinct ways can the atoms be arranged? The
Let G be a group of order p n , p prime, n ∈ Z+ . Suppose
symmetry group of a hexagon is the dihedral group of
the identity was the only conjugacy class containing only
order 12.
one element. Then all other conjugacy classes have order
In the Redfield–Polya theory, we first obtain a
pi , i > 0. So G has kp + 1 elements for some k. This is
polynomial
false. If every element of G has an order power of p, then
G is called a p-group. Therefore, for any p-group G the 1 c1 (g) c2 (g) c (g)
set C = {c : cg = gc for all g ∈ G} has size larger than 1. PG (x1 , x2 , . . . , xk ) = x x2 · · · xk k
|G| g∈G 1
For any group C so defined is a normal subgroup and is
commutative, called the center. where ci (g) is the number of cycles of g of length i and k
is the maximum length of any cycle. The dihedral group
F. Symmetry breaks down as shown in Table I. Then

An algebraic structure on a set S is a family of subsets α PG (x1 , x2 , . . . , xn )

indexed on α ∈ I of S1 = S ∪ (S × S) ∪ (S × S × S) ∪ · · ·. 6
An automorphism of this structure is a 1–1 onto map- = 121
x1 + 2x61 + 2x32 + x23 + 3x22 x12 + 3x23
ping f : S → S such that for all α ∈ I, T ∈ α if and only For instance, having six 1-cycles yields term x16 . Redfield
if f (T ) ∈ α . For operational structures, this is any 1–1 and Polya computed the number of labelings of X by labels
mapping that is a homomorphism: f (x y) = f (x) f (y). The θ1 , θ2 , . . . , θm such that θi occurs exactly n i times. They
complex number system has the automorphism a + bi → proved that this is the coefficient of θ1n 1 θ2n 2 · · · θmn m in
a − bi. The additive real numbers has the automorphism
x → −x. PG θ1 + θ2 + · · · + θm , θ12 + θ22 + · · · + θm2 , . . . ,
Especially in a geometric setting, automorphisms are
called symmetries. A metric symmetry of a geometric fig- θ1k + θ2k + · · · + θmk
ure is an isometry (distance-preserving map) on its points,
which also gives a 1–1 correspondence on its lines and
other distinguished subsets. Finite groups of isometries in
three-dimensional space include the isometry groups of a
regular n-gon (dihedral), cube and octahedron (extension
of Z2 by 4 ), icosahedron and dodecahedron (extension
of Z2 by 5 ).
The possible isomorphism types of n-dimensional sym-
metric groups, finite or continuous, are limited. One of the
principal applications of group theory is to make use of ex-
act or approximate symmetries of a system in studying it. FIGURE 7 Hexagonal molecule.
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 447

TABLE I Cycle Structure of Hexagonal Symmetries

Rotations Reflections

6 1-cycles c1 = 6 — — —
(123456) 1 6-cycle c6 = 1 About 2 vertices 2 2-cycles c1 = c2 = 2
(654321) (26) (35)
(13) (46)
(15) (24)
(135) (246) 2 3-cycles c3 = 2 — — —
(153) (246)
(14) (25) (36) 3 2-cycles c2 = 3 About an edge center 3 2-cycles c2 = 3
(12) (36) (45)
(23) (14) (65)
(16) (25) (34)

In the problem above this yields the coefficient of θ12 θ24 in c ∈ F. For F = R, v3 is the set of vectors consideered in
physics. These can be regarded as positions (displace-
2 3
1
12
(θ1 + θ2 )6 + 2 θ16 + θ26 + θ13 + θ23 + 4 θ12 + θ22 ment), velocities, or forces: Each adds in the prescribed
way.
2
+ 3 θ12 + θ22 (θ1 + θ2 )2 A general vector space is a set v having an addition
v × v → v and a scalar multiplication F × v → v such that
That coefficient is (v, +) is a commutative group and for all a, b ∈ F, v,

1 6 3 1 w ∈ v; a(v + w) = av + aw, (a + b)v = av + bv, (ab)v =
+4 + 3(3) = (15 + 12 + 9) = 3
12 2 1 12 a(bv), and cv = v, where c = 1 ∈ F. Incidentally, F is
called the field of scalars, and multiplication by its ele-
ments scalar multiplication.
IV. VECTOR SPACES
B. Basis and Dimension
A. Vector Space Axioms
The span S of a subset S ⊂ v is {a1 s1 + a2 s2 + · · · +
A vector space is a set having a commutative group ad- an sn : n ∈ Z+ , si ∈ S, ai ∈ F} ∪ {0}. The elements a1 s1 +
dition, and a multiplication by another set of quantities a2 s2 + · · · + an sn are called linear combinations of
(magnitudes) called a field. A field is a set F such as R or s1 , s2 , . . . , sn . A sum of two linear combinations is a linear
C having addition and multiplication F × F → F such that combination and a multiple by c ∈ F of a linear combina-
the axioms in Table II hold for all x , y , z and some 0, 1 in F. tion is a linear combination. A subset of a vector space
The standard example of a vector space is the set vn of closed under addition and scalar multiplication is called a
n-tuples of members of F. Let (x1 , x2 , . . . , xn ), (y1 , y2 , subspace. Therefore, the span of any set is a subspace.
. . . , yn ) ∈ vn . The (x1 , x2 , . . . , xn ) + (y1 , y2 , . . . , yn ) = A subspace is a vector space in itself. Its span is itself.
(x1 + y1 , x2 + y2 , . . . , xn + yn ) and multiplication by c It follows that S lies in any subspace containing S.
is given by c(x1 , x2 , . . . , xn ) = (cx1 , cx2 , . . . , cxn ) where An indexed family tα , α ∈ I of vectors is said to be lin-
early independent if tγ is not a linear combination of tα ,
TABLE II Field Axioms α = γ and no tγ = 0 for all γ . Otherwise, it is called lin-
early dependent. A set of vectors W is likewise said to be
Operation linearly independent if for all x ∈ W, x ∈ / W \{x} .
Property Addition Multiplication A basis for a vector space is a linearly independent
spanning set for the vector space. A chain in a poset is a
Commutative x +y=y+x x y = yx subset that is linearly ordered by the partial order. Zorn’s
Associative (x + y) + z = x + (y + z) (x y)z = x(yz) lemma states that in a poset X in which every chain has a
Identity For all x ∈ F, x + 0 = 0 For all x ∈ F, x1 = x maximal element there is a maximal element of X . Take
Inverse For all x ∈ F, there exists For all x ∈ F (x = 0), X to be the family of linearly independent sets in v. Every
−x such that x + (−x) = 0 there exists x −1 chain has its union as an upper bound. A maximal linearly
such that x x −1 = 1
independent set therefore exists. It is a basis. Therefore,
Distributive x(y + z) = x y + x z
every vector space has a basis.
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

448 Algebra, Abstract

A linearly ordered set is well ordered if every nonempty alent to bi ti . No two distinct sums bi ti can have differ-
subset has a least element. Using Zorn’s lemma it can be ence a nonzero sum ai si so the classes of ti are linearly
shown that every set can be well ordered. From every independent in v/w. Therefore t̄τ gives a basis for v/w.
set {xα } of vectors where the index set is well ordered The external direct sum of two vector spaces v, w is
we can pick a subset {xγ } that is linearly independent v × w with operations (v1 , w1 ) + (v2 , w2 ) = (v + v2 , w1 +
and spans the same set. Let B = {xγ : xγ ∈ {xα : α < γ } }. w2 ), and c(v, w) = (cv, cw). It has as basis the disjoint
Then no element of B is linearly dependent on previous union of any bases for v, w. It is denoted v ⊕ w. There-
vectors. But if some element of B were linearly depen- fore, if w ⊂ v, v is isomorphic to w ⊗ v/w. Moreover,
dent, let y = ai xi , ai = 0. Let xn be the last of the xi in dim(w) + dim(v/w) = dim(v), where dim(v) denotes the
the ordering. Then xn = (1/an )(y − i =n ai xi ) is linearly dimension of v.
dependent on previous vectors. This is a contradiction. A complement to a subspace w ⊂ v is a subspace u ⊂ v
Two bases for a vector space have the same cardinal such that u ∩ w = {0}, u + w = v, where u + w = {u + w :
number. For finite bases u 1 , u 2 , . . . , u n , v1 , v2 , . . . , vm u ∈ u, w ∈ w}. Every subspace has a complement by the
let m < n. Expressions u i = ai j v j exist since {vi } = construction above with u = {tα } . The mapping ai ti →
{u i } . Since m < n, we can find a nonzero solution of the ai t̄i gives an isomorphism u → v/w.
m linear equations i wi ai j = 0, j = 1 to m in n variables For a linear transformation f : v → w, the image set
wi . Then wi u i = 0, not all wi = 0. This contradicts linear Im( f ) and the null space (kernel) Nu( f ) = {v ∈ v : f (v) =
independence of the u i . 0} are both subspaces. Results on group homomorphism
The cardinality of a basis for a vector space is called imply that f gives an isomorphism f¯ : v/Nu( f ) → Im( f ).
the dimension of the vector space. A linear transformation is 1–1 (onto) if and only if
For vector spaces v, w a homomorphism (isomor- Nu( f ) = 0 (Im( f ) = w). The dimension of Im( f ) is called
phism) is a (1–1 and onto) function f : v → w such the rank of f .
that f (x + y) = f (x) + f (y) and f (cx) = c f (x) for all Choose bases xi , yi , z i for vectors spaces u, v, w. Let f
x, y ∈ v, c ∈ F. If two vector spaces over F have the same be a homomorphism from u to v and g a homomorphism
dimension, choose bases u α , vα for them. Every element v to w. There exist unique coefficients ai j , bi j such that
of v has a unique expression ai u i , u i ∈ {u α }, and every f (xi ) = ai j yi and g(yi ) = bi j z j . Then g( f (xi )) =
element of w has a unique expression ai vi , vi ∈ {vα }. g(ai j y j ) = ai j g(y j ) = ai j b jk z k . Therefore, the
Therefore, f (ai u i ) = ai vi defines an isomorphism v composition g ◦ f sends xi to j,k ai j b jk z k . This rule
to w. Conversely if two vector spaces are isomorphic, defines a product on nm-tuples ai j , which must be
their dimensions must be equal since if {u α } is a linearly associative since composition of functions is. It is
independent (spanning) set for v, { f (u α )} will be a distributive since f (g(x) + h(x)) = f (g(x)) + f (h(x))
linearly independent (spanning) set for w. and (g + h)( f (x)) = g( f (x)) + h( f (x)).

C. Linear Transformations D. Matrices and Determinants

A linear transformation is another phrase for homomor- An n × m matrix A is an nm-tuple (ai j ) indexed on i = 1,
phism of vector spaces. To every subspace w of a vec- 2, . . . , n, and j = 1, 2, . . . , m. Matrices are represented
tor space v is associated the quotient space v/w. This is as rectangles of elements where ai j is placed in horizontal
the set of equivalence classes of the equivalence relation line i (row i) and vertical line j (column j).
x − y ∈ w. Transitivity follows from (x − y) + (y − z) = x Addition is entrywise. The entries of both matrices in
− z ∈ w if x − y, y − z do. If x − y ∈ w, u − v ∈ w, then location (i, j) are added to form the (i, j)-entry of the sum.
(x + u) − (y + v) ∈ w. So addition of equivalence classes The (i, j)-entry of the product is formed by multiplying
is well defined; so is scalar multiplication since a(x − y) row i of the first factor by column j of the second factor
∈ w if x − y does. entry by entry and adding all these products. For example,
The quotient space, additively, is the same as the quo-
tient group. Let {sσ } be a basis for w. We can find a basis a11 a12 b11 b12
for v of the form {sσ } ∪ {tτ }, where {sσ } ∩ {tτ } = ∅. Let qγ a21 a22 b21 b22
be a well-ordered basis for v. Delete each qα that is linearly
dependent on prior qγ together with the sσ . The remaining a11 b11 + a12 b21 a11 b12 + a12 b22
set with sσ is a spanning set and is linearly independent. =
a21 b11 + a22 b21 a21 b12 + a22 b22
This is a basis for v, which contains a basis for w.
Let {sσ } ∪ {tτ } denote the resulting basis. Since every This is the same as the operation defined at the end of the
element of w has the form ai si + bi ti it will be equiv- last section. To multiply two n-square matrices by the
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 449

formula requires n 3 multiplications and n 2 (n − 1) ad- where the summation is over all permutations π, and
ditions. There does exist a slightly improved computer sgn(π ) denotes the sign of π and so sgn(π ) is ±1 accord-
method known as fast matrix multiplication. ing to whether π is even or odd. As before, a permutation
Matrices obey the following laws: (1) A + B = B + A, is even or odd according to whether it is the product of an
(2) (A + B) + C = A + (B + C), (3) (AB)C = A(BC), odd or even number of transpositions. It has the following
(4) A(B + C) = AB + AC, and (5) 0 + A = A. Here 0 de- properties, which can be proved in turn. Let (ai j )T denote
notes the zero matrix all the entries of which are 0. It is an (a ji ), called the transpose of A. This operation changes
additive identity. rows into columns and columns into rows.
There is a 1–1 correspondence from the set of linear
transformations from an n-dimensional vector space to an (D-1) Since sgn(σ ) = sgn(σ −1 ), det(A) = det(AT ),
m-dimensional vector space to the set of n × m matrices, σ ∈ n.
given bases in these. As in the last subsection (ai j ) corre- (D-2) If Ai ∗ = Bi ∗ = Ci ∗ for i = k and Ck ∗ = Ak ∗ + Bk ∗ ,
sponds to f such that f (xi ) = ai j y j . then det(C) = det(A) + det(B).
Consider linear transformations from a vector space (D-3) If the rows of A are permuted by a permutation
with basis {xi } to itself. Let f be represented as (ai j ). Let π , the determinant of A is multiplied by sgn(π ).
{w j } be another basis, where xi = bi j w j , w j = c ji xi . (D-4) If two rows (columns) of A are equal, then det
Then f (w j ) = c ji f (xi ) = c ji aik xk = c ji aik bkm wm . (A) = 0.
From w j = c ji xi = c ji bik wk it follows that C B = I , (D-5) If any row is multiplied by k, then the
where I is the matrix determinant is multiplied by k.
  (D-6) If Ai ∗ is replaced by Ai ∗ − k A j ∗ , i = j, then the
1 0 0 ··· 0 determinant is unchanged.
0 1 0 · · · 0
  (D-7) If ai j = 0 for i > j, then det(A)= a11 a22 · · · ann .
 
 .........  (D-8) det(AB) = det(A) det(B).
0 0 0 ··· 1 (D-9) det(A) = 0 if and only if A has an inverse.
(D-10) Let A[i| j] the submatrix of A obtained by
is known as a (multiplicative) identity. It acts as a two- deleting row i and column j. The (i, j)th-
sided identity I A = AI = A. We also have xi = bi j w j = cofactor of ai j is (−1)i+ j det A[i| j] and it is
bi j c jk xk so BC = I . Therefore, B = C −1 . Here C −1 de- denoted as C[i| j]:
notes the (multiplicative) inverse of C. Then, expressed
n
n
in terms of w j , f is C AB = C AC −1 . Two matrices X , det(A) = ar j C[r | j] = ais C[i|s]
Y X Y −1 are said to be similar. j=1 i=1
A linear transformation represented as (ai j ) sends
ci xi to ci ai j y j . The matrix product (ci )(ai j ) is Property (D-1) ensures that for properties of determi-
(ci ai j ). A row (column) vector is an 1 × n(n × 1) matrix. nants stated in terms of their rows, equivalent properties
So the linear transformation is equivalent to that given can be stated in terms of columns. These sometimes will
by matrix multiplication on row vectors. Matrices also act not be explicitly mentioned.
as linear transformation on column vectors. If follows from (D-4), (D-9), and (D-10) that if A has
The rank of a matrix is its rank as a linear transformation an inverse
on row vectors. Taking column vectors gives the same 1
number. A−1 = (C[i| j]T )
det(A)
The image space is spanned by the rows of the matrix
since (i ci ai j ) is the sum of ci times the ith row of A. From (D-7) and (D-8), det(I ) = 1, det(A−1 ) = 1/det(A).
Therefore, the row rank is the maximum size of a set of
linearly independent rows. The ith row (column) of A is E. Boolean Vectors and Matrices
denoted Ai ∗ (A∗ i ).
In general, an n-square matrix X is invertible if there Most of the theory of the last section holds in the case
exists Y with X Y = I or Y X = I . Either equation implies of Boolean matrices. Boolean row (column) vectors are
the other and is equivalent to the rank of the matrix being 1 × n(n × 1) Boolean matrices. The set Vn is the set of
n. all Boolean n-tuples, and additive homomorphisms pre-
The determinant of a matrix A is defined to be serving 0 from Vn to Vm are in 1–1 correspondence with
Boolean matrices.
det(A) = sgn(π)a1π (1) a2π (2) · · · anπ(n) Matrices over a Boolean algebra or field can be mul-
π∈ n tiplied or added in blocks using the same formulas as
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

450 Algebra, Abstract

for individual entries (Ai j ) + (Bi j ) = (Ai j + Bi j ), and space if and only if their matrices are similar. Similarity
(Ai j )(Bi j ) = ( Aik Bk j ). Here the set of all indices has is an equivalence relation. To describe it completely we
been partitioned into r subsets S1 , S2 , . . . , Sr and Ai j is the need a set of similarity invariants that distinguish any two
submatrix consisting of all akm such that (k, m) ∈ Si × S j . nonsimilar matrices.
Usually the numbers in each St are assumed adjacent. For The most important similarity invariant is the char-
example, acteristic polynomial defined by p(λ) = det(λI − A). If
   B = X AX −1 , then
1 2 5 6 1 2 5 6
   det(λI − X AX −1 ) = det(X (λI − A)X −1 )
3 4 7 8 3 4 7 8
  
0 1 1 1 0 1 1 1 = det(X ) p(λ)(det(X ))−1
1 0 1 1 1 0 1 1
 = p(λ)

 1 2 1 2 + 5 6 0 1 Therefore, it is a similarity invariant. Its roots are called

 3 4 3 4
 7 8 1 0 eigenvalues (characteristic values, latent values).
=  The matrix A itself satisfies p(A) = 0. The coefficient
 0 1 1 2 1 1 0 1 of λn−1 is minus aii . This quantity is called the trace of A
 +
1 0 3 4 1 1 1 0 and it is denoted by Tr(A). We have Tr(AB) = Tr(B A) and
 the trace is an additive homomorphism. The last coefficient
1 2 5 6 5 6 1 1  is (−1)n det(A).
+ An eigenvector (characteristic vector, latent vector) v
3 4 7 8 7 8 1 1 
 of an n-square matrix is a vector such that v A = λv for

 some λ. There exists an eigenvector for λ if and only
0 1 5 6 1 1 1 1 
+  if v(A − λI ) = 0 if and only if A − λI has rank less
1 0 7 8 1 1 1 1 than n if and only if p(λ) = 0. That is, λ must be an
eigenvalue.
A block lower triangular form is one such that Ai j = 0
for i < j. Every Boolean matrix can be put uniquely into a
block triangular form such that main diagonal blocks Aii
are indecomposable or zero. This means n=0 ∞
Aiin = J , the V. RINGS
matrix consisting entirely of 1 for each i.
If the Boolean matrix is idempotent, every block in this A. Ring of Integers
block triangular form consists identically of 0 or of 1. The ring of integers satisfies the following algebraic
A Boolean matrix is invertible if and only if it is the ma- axioms: Let x, y, z ∈ Z. (R-1) x + y = y + x; (R-2)
trix of a permutation. The row (column) rank of a Boolean (x + y) + z = x + (y + z); (R-3) x(y + z) = x y + x z; (R-
matrix is the number of vectors in row (column) basis. An 4) (y + z)x = yx + zx; (R-5) (x y)z = x(yz) (R-6) there
n-square Boolean matrix is nonsingular if it has row and exists 0 such that 0 + x = x for all x; (R-7) for all x there
column rank n and it is regular. Incidentally, the row and exists −x such that x + (−x) = 0; (R-8) x y = yx; (R-9) if
column rank of a Boolean matrix are not necessarily equal x z = yz, then x = y or z = 0; (R-10) there exists 1 such
when n ≥ 4. For a nonsingular Boolean matrix, there ex- that 1 · x = x for all x.
ists an analog of the formula for inverses. The permanent Any structure of a set with two binary operations sat-
of a Boolean matrix A is isfying (R-1) to (R-7) is called a ring. If (R-8) holds, it is

per(A) = a1π (1) a2π (2) · · · anπ(n) called a commutative ring. The sets of n-square matrices
π∈ n over Z, Q, R, C satisfy (R-1) to (R-7) and (R-10) but not
A nonsingular Boolean matrix A has a unique Thierrin– (R-8) and (R-9). Here Q denotes the set of all rational
Vagner inverse given by per(A[i| j])T . The theories of numbers. A ring satisfying (R-10) is called a ring with
Boolean matrices and semirings found more and more unit. A ring satisfying (R-1) to (R-10) is called an integral
applications related to computers in the 1990s such as domain.
communication complexity and Schein rank, image pro- In addition the integers satisfy the properties that there
cessing, and the study of parallel computation. exists a set P ⊂ Z called the positive integers such that
(R-11) P is closed under addition, (R-12) P is closed under
multiplication, and (R-13) for all integers exactly one of
F. Characteristic Polynomials
these holds: x = 0, x ∈ P, −x ∈ P. (Obviously, P = Z+ .)
The linear transformations on a vector space are isomor- A ring satisfying these is called an ordered ring. R but not
phic (as binary relations) by an isomorphism of the vector C is an ordered ring.
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 451

The final axiom is the induction property (R-14): If a When an argument is made by induction, it is under-
nonempty set S is contained in P and 1 ∈ S and 1 + x ∈ S stood that a statement P(n) holds for n = 1 and that there
whenever x ∈ S, then S = P. is an argument showing that if P(1), P(2), . . . , P(n − 1)
The ordinary rules of algebra follow from these axioms, hold so does P(n). Let S = {n ∈ P : P(n) is true}. Then by
as (1 + (−1) + 1) · (−1) = (0 + 1) · (−1) = 1 · (−1) = −1. the induction axiom (R-14), S is all of P. Arguments by
Therefore, −1 + ( −1 )(−1 ) + ( −1 ) = −1, 1 + ( −1 ) + induction are especially common in the theory of Z.
(−1)(−1) + (−1) + 1 = 1 + (−1) + 1, and (−1)(−1) = 1. The following is an example of the Euclidean algorithm
for polynomials:
B. Euclidean Domains x 3 + 1 = x(x 2 + 1) + (−x + 1)
Many properties of Z are shared by the ring of polyno- x 2 + 1 = (−x + 1)(−x − 1) + 2

mials over any√ field, and the rings {a + bi : a, b ∈ Z} and −x − 1
{a + b[(1 + 3)/2]i : a, b ∈ Z}. To deal with these we de- −x − 1 = 2
2
fine Euclidean domains. A Euclidean domain is a ring
satisfying (R-1) to (R-10) provided with a function Therefore, 2 (equivalently 1) is a g.c.d. of x 2 + 1, x 3 + 1.
ω(x) defined from nonzero elements of into Z such that We can now express it in terms of them:
(R-15) ω(x y) ≥ ω(x); (R-16) for all x, y ∈ there exist
2 = (x 2 + 1) − (−x + 1)(−x − 1)
q, r ∈ such that x = qy + r , and either ω(r ) < ω(y) or
r = 0; and (R-17) ω(x) ≥ 0. For polynomials, let ω be the = (x 2 + 1) − (−x + 1)(x + 1)
degree. For other cases, let it be |x|.
2 = (x 2 + 1) + ((x 3 + 1) − x(x 2 + 1))(x + 1)
In any ring with unit , the relation x|y can be de-
fined by there exists z ∈ with zx = y. This relation is 2 = (x 2 + 1) + (x + 1)(x 3 + 1) − x(x + 1)(x 2 + 1)
transitive. If x|y, x|z, then x|y + z since y = ax, z = bx,
2 = (1 − x − x 2 )(x 2 + 1) + (x + 1)(x 3 + 1)
y + z = (a + b)x for some a, b. If x|y then xa|ya for any
a. Since 1x = x, it is a quasiorder. It is called (right) In the second step, we substituted x 3 + 1 − x(x 2 + 1) for
divisibility. −x + 1.
In an integral domain, suppose y|x and x|y. Then x = By induction there is a g.c.d. of x1 , x2 , . . . , xk that is
r y, y = sx for some r, s. If either is zero then both are. linear combination s1 x1 + s2 x2 + · · · + sk xk of them. If
Otherwise r sx = r y = x = 1x and r s = 1 by (R-9). So r , we multiply by invertible elements, we obtain such an
s are invertible. expression for any g.c.d.
A greatest common divisor (g.c.d.) of a set S ⊂ is an If a is invertible, then ω(ab) ≤ ω(b) and ω(a −1 ab) ≤
element g ∈ such that g|x for all x ∈ S and if y|x for all ω(ab), so the two are equal. Conversely, if ω(ab) = ω(b), b
x ∈ S then y|g. Any two g.c.d.’s of the same set must be = 0, b ∈ , then a is a unit [divide b by ab; if ω(r ) <
multiples of one another, and so multiples by an invertible ω(ab) = ω(b) we have a contradiction].
element. Let g be a g.c.d. of S and h be a g.c.d. of a set From this we can establish the basic facts of prime fac-
T and m be a g.c.d. of {g, h}. Then m is a g.c.d. of S ∪ T . torizations. An element p ∈ is called prime if it is not
The g.c.d. is a greatest lower bound in the partial order of invertible but whenever p = x y one of x, y is invertible.
the set of equivalence classes of elements of under the It follows that for any a if p|a then the g.c.d. of p, a is
relation x ∼ ay if a is invertible. invertible.
In a Euclidean domain, the Euclidean algorithm is an Suppose 1 is a g.c.d. of c, a. Let 1 = ar + cs. If c|ab,
efficient way to calculate the g.c.d. of two elements x, y. then c|abr + cbs = b. Therefore, if p, a prime, divides
Assume ω(x) ≥ ω(y). Let d0 = x, d1 = y. Choose a se- x y, it divides x or y. To factor an element y of into
quence di , qi for i > 0 by di = qi+1 di+1 + di+2 , where primes choose a divisor x that has minimal ω(x) among
ω(di+2 ) < ω(di+1 ). The process terminates since ω(x) is nonivertible elements. If x = ut and u is not invertible then
always a nonnegative integer, and in the last stage we ω(u) = ω(x) and so t is invertible and so x is prime. Since
have a remainder dk+2 = 0. Then dk+1 |dk . By induction x is not invertible, ω(y/x) < ω(y). So after a finite number
using the equations di = qi+1 di+1 + di+2 it will divide of prime factors the process ends.
the right side and therefore the left side. So it divides By induction we can show that if p is a prime and divides
all di . Also di+2 = di − qi+1 di+1 is a linear combina- y1 y2 · · · yk , then p divides some yi . From this it can be
tion of di , di+1 so by induction dk+1 is a linear com- proved that if a = p1 p2 · · · pm = q1 q2 · · · qn , where pi , qi
bination of a, b. Therefore, if x divides a, b, it divides are primes, then n = m and the qi can be renumbered so
dk+2 . Therefore, the element dk+1 = ra + sb is a g.c.d. of that pi = u i qi , where u i is invertible. Since pi |q1 q2 · · · qn ,
a, b. p1 divides some qi . Label it q1 . Since both are primes,
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

452 Algebra, Abstract

p1 = q1 . This reduces n, m by 1 and the process can be quences (a0 , a1 , . . .) such that only finitely many ai are
repeated. nonzero. The ai are the coefficients of the polynomial and
multiplication can be defined by (an )(bn ) = (r ar bn−r ).
In it or any Euclidean domain, all ideals have the form
C. Ideals and Congruences
{ f (x)y : y ∈ K[x]} for some polynomial f (x). Given an
A ring homomorphism is a function f : → satisfying ideal , take a nonzero element f (x) in it such that
f (x + y) = f (x) + f (y) and f (x y) = f (x) f (y). That is, ω( f (x)), the degree, is minimal.
it is a semigroup homomorphism for multiplication and a If f (x) is another element of the ideal, choose q(x), r (x)
group homomorphism for addition. such that r (x) = 0 or ω(r (x)) < ω( f (x)) for g(x) =
The following are examples of ring homomorphisms. f (x)q(x) + r (x). Then r (x) = g(x) − f (x)q(x) ∈ since
(1) The mapping from n-square matrices to m-square ma- g(x), f (x) do. So ω(r (x)) < ω( f (x)) contradicts minimal-
trices for m > n, which adds to a matrix m − n rows and ity of ω( f (x)). So r (x) = 0. So every element in is a
columns of zero. (2) The mapping f : Z → Zm , f (x) = x̄. multiple of f (x). Conversely multiples of f (x) belong in
For any two rings , a product ring × is defined by since it is an ideal.
operations (a, b) + (c, d) = (a + c, b + d), (a, b)(c, d) = Ideals of the form a = {ax : x ∈ } are called princi-
(ac, bd). (3) The mapping × → sending (a, b) to pal, and a is said to be a generator. Therefore, Euclidean
a (or b) is a ring homomorphism. (4) For any complex domains are principal ideal domains, that is, integral do-
number c, the evaluation f (c) is a homomorphism from mains in which all ideals are principal.
the ring of all functions f : C → C into C itself. In all principal ideal domains, all elements can be
A subring of a ring is a subset closed under addi- uniquely factored in terms of primes and invertible el-
tion, subtraction, and multiplication. An ideal is a sub- ements. Let m where m ∈ Z is not √ divisible by the
set S of a ring closed under addition and subtraction square√ of a prime denote {a + [b(1 + m)/2] : a, b ∈ Z} or
(if x, y ∈ S, then x − y ∈ S) and such that if z ∈ , x ∈ S, {a + b m : a, b ∈ Z} according to whether m ≡ 1(mod 4)
then x z, zx ∈ S. The set of even integers is an ideal in the or not. If 0 < m < 25, then m is a principal ideal domain if
set of intergers. and only if m = 10, 15. If −130 < m < 0, then m is a prin-
A congruence on a ring is an equivalence relation cipal ideal domain if m = −1, −2, −3, −7, −11, −19,
x ∼ y such that for all a ∈ if x ∼ y then x + a ∼ y + a, −43, −67. (This is a result proved by techniques of alge-
xa ∼ ya, and ax ∼ ay. For any congruence if x ∼ y, braic number theory.) Unique factorization does not hold
z ∼ w then x + z ∼ y + z ∼ y + w and x z ∼ yw. Then ad- in m if it is not a principal ideal domain.
dition and multiplication of equivalence classes x̄ + ȳ = For noncommutative rings, there are separate concepts
x + y, x̄ ȳ = x y are well defined and give a ring called the of right and left ideals. A subset S of a ring closed under
quotient ring. The function x → x̄ is a homomorphism. addition and subtraction is a left (right) ideal if and only if
There is a 1–1 correspondence between ideals and con- ax ∈ S (xa ∈ S) for all x ∈ S, a ∈ . The ring of n-square
gruences. To a congruence associate the set x ∼ 0, which matrices has no (two-sided) ideals except itself and zero
is an ideal . Then the congruence is defined by x ∼ y if but for any subspace W, = {M ∈ Mn (F) : v M ∈ W} is
and only if x + (−y) ∼ y + (−y) = 0. Therefore, x ∼ y if a right ideal. A proper ideal is an ideal that is a proper
and only if x − y ∈ . Every ideal in this way determines subset. The trivial ideal is {0}.
a congruence. Suppose a ring has not all products zero and has no
The equivalence classes have the form x + = {x + m : proper nontrivial right ideals. If ab = 0 for some b, then
m ∈ }. The quotient ring is denoted /. {x : ax = 0} is a proper right ideal and is therefore zero.
All ideals in the integers—in fact, all subgroups—have The set {ax} is a nonzero right ideal and is therefore .
the form mZ = {mx : x ∈ Z}. The congruence associated So multiplication on the left by a is 1–1 and onto.
with these is a ≡ b(mod m) if and only if a − b = km Let = {a : ab = 0 for all b ∈ }. Then is a proper
for some k. By the general theory, such congruences can right ideal and is therefore 0. So for all a = 0 left multi-
be added, subtracted, multiplied, or raised to a power. plication by a is 1–1 and onto. So if ab = 0, then a = 0 or
Congruences can sometimes decide whether or not equa- b = 0.
tions are solvable in whole numbers. For example, x 2 + y 2 For any a = 0 since {ax} = , there exist , a −1 such
+ z 2 = w is not possible if w ≡ 7(mod 8). To prove this it that a = a, aa −1 = . Then for any x ∈ , a x = ax.
can first be noted that x 2 ≡ K (mod 8) where k = 0, 1, 4, So x = x since left multiplication is 1–1. From ab = 0
and no three elements of {0, 1, 4} add to 7(mod 8). The for a = 0, b = 0, it follows that right multiplication is also
ring of all polynomials in a variable x having coefficients 1–1. Since x x = x x and x = x, is an identity. From
in a field K is denoted K[x]. As a set it is a subset of aa −1 a = a = a = a , it follows that a −1 is an inverse of
a Cartesian product of countably many copies of K, se- a. Therefore, in , all nonzero elements have inverses.
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 453

A ring with unit in which all nonzero elements have tion x = k we find r = 0. So if p(k) = 0, then (x − k)| p(x).
inverses is called a division algebra. The best known di- Thus, a polynomial of order n cannot have more than n dis-
vision algebra that is not a field is the ring of quater- tinct roots since each root gives a degree 1 factor (x − k).
nions. As a set, it is R × R × R × R. Elements are The polynomial x p−1 − 1 has exactly p − 1 roots in
written as a + b i +c j + d k. Expressions are multiplied Zp , that is, 1, 2, . . . , p − 1. Thus, it factors as c(x − 1) · · ·
by expanding and using the rules i2 = j2 = k2 = −1, (x − ( p − 1)). Since the, coefficient of x p−1 is 1, we have
i j = k, j k = i, k i = j, j i = − k, k j = − i, i k = − j. Any c = 1. The polynomial x r − 1 divides x p−1 − 1 for r | p−1,
multiplication defined from a basis by expanding terms so by unique factorization it factors into linear factors and
is distributive. The multiplication on basis elements and has exactly r roots. For any prime power r t dividing p − 1
t t−1
their negative ±1, ± i, ± j, ± k is a finite group of order 8. take a root yr of x r = 1, which is not a root of x r = 1.
t
This implies associativity. Then yr has order precisely r . Take t maximal. The prod-
uct u of all yr will have order the least common multiple
of r t , which is p − 1. Therefore, u, u 2 , . . . , u p−1 are dis-
D. Structure of Zn
tinct and are all of Z∗p since it has exactly p − 1 elements.
The ring Zn is the ring of congruence classes modulo n. It is This proves Z∗p is cyclic. Therefore, it is isomorphic to
the quotient ring Z/n , where n = {nx : x ∈ Z}. For any (Zp−1 , +).
quotient ring /, ideals of the quotient ring are in 1–1 In Zn for m|n, an element y is a multiple of m if and
correspondance with ideals of containing . An ideal only if (n/m)y ≡ 0. Therefore, in Z∗p , an element y is an
m ⊃ n if and only if m|n. So the ideals of Zn are in 1–1 mth power for m| p − 1 if and only if y ( p−1)/m ≡ 1(mod p).
correspondence with positive integers m dividing n. The The mth powers form a multiplicative subgroup of order
ring Zn has exactly n elements, since each integer x can ( p − 1)/m.
be expressed as nk + r for a unique r ∈ {0, 1, . . . , n − 1},
and then x ≡ r (mod n). It has a unit and is commutative.
E. Simple and Semisimple Rings
Two elements of an integral domain are called relatively
prime if 1 is a g.c.d. of them. If x, n are relatively prime, A ring is called simple if it has no proper nontrivial (two-
then r x + sn = 1 for some r and s. Thus, r̄ x̄ = 1 so x̄ sided) ideals. A commutative simple ring is a field. It is
is invertible. Conversely, if r̄ x̄ = 1, then r x − 1 = sn for simplest to treat the case of finite dimensional algebras.
some s, and 1 is a g.c.d. of x and n. Thus, the number of An algebra over a field F is a ring provided with a multi-
invertible elements equals (n), the number of positive plication F × → such that (1) (ax)y = a(x y) = x(ay)
integers m < n that are relatively prime to n. for all a ∈ F, x, y ∈ ; and (2) is a vector space over F.
The Chinese remainder theorem asserts that if all pairs The quaternions, the complex numbers, and the ring of all
of n 1 , n 2 , . . . , n m are relatively prime and ai ∈ Z, i = 1 to functions R to R are all algebras over R.
m, then there exists x ∈ Z and x ≡ ai (mod n i ) for i = 1 to A division algebra is an algebra that is a division ring.
m. Any two such x are congruent modulo n 1 n 2 · · · n m . Division algebras can be classified in terms of fields. A
The set of invertible elements is closed under products field F is called algebraically closed if every nonzero
since (x y)−1 = y −1 x −1 and contains inverses and identity, polynomial p(x) = a0 x n + a1 x n−1 + · · · + an x 0 , ai ∈ F,
so it forms a group for any ring. Here it is a finite group. a0 = 0, n = 0 has a root r ∈ F. Suppose we have a division
For any invertible element x, the elements 1, x, . . . , x k−1 algebra over an algebraically closed field F of finite
form a subgroup if k is the order of x. By the Lagrange’s dimension n. Let a ∈ . Then 1, a, . . . , a n are n + 1 ele-
theorem, the order of any subgroup of a group divides the ments in an n-dimensional vector space so they cannot be
order of the group. Therefore, k|(n). From x k = 1, fol- linearly independent (they could be extended to a basis).
lows x (n) = 1. This proves a theorem of Euler and Fermat. So some combination p(a) = c0 + c1 a + · · · + cn a n = 0.
For any x ∈ Zn , if x is invertible, then x (n) = 1. Choose the degree of p minimal. Let deg( p) denote
If p is prime, then 1, 2, . . . , p − 1 are all relatively the degree of p. If Deg( p) > 1, then p can be factored
prime to p, so ( p) = p − 1, and x p−1 ≡ 1(mod p) if over F. Each factor is nonzero evaluated at a, but their
x ≡ 0). Then x p ≡ x(mod p) for all x ∈ Zp . Assume p is product is zero. This is impossible in a division alge-
prime for the rest of this section. bra where nonzero elements are invertible. Therefore,
The multiplicative group Z∗p = {x̄ ≡ 0̄ : x̄ ∈ Zp } is a com- Deg( p) = 1, so c0 + c1 a = 0, as a ∈ F. So = F. Any
mutative group of order p − 1. The ring Zp is a field since such division algebra coincides with F. This applies to
Z∗p is a group. Polynomials over Zp can be uniquely fac- F = C.
tored into primes. A finite dimensional F-algebra is simple if and only if it
Over any field K, if a polynomial p(x) satisfies p(k) = 0, is isomorphic to the ring Mn () of n-square matrices over
where k ∈ K, then let p(x) = (x − k)q(x) + r . By substitu- . More generally, every ring having no infinite strictly
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

454 Algebra, Abstract

decreasing sequence of distinct left ideals (Artinian ring) have b(x, x) ≥ 0 and b(x, x) = 0 only if x = 0 since each
that is simple is an Mn (). term will be a vector vv ∗T = |vi2 |. We can inductively
A finite dimensional F-algebra A is called semisimple choose a basis u i for row vectors such that b(u i , u i ) = 1,
if as a vector space it is an internal direct sum of minimal b(u i , u j ) = 0 if i = j. In this basis, let u i f (g) = u k aki .
right ideals i . This means every x ∈ A can be uniquely Then b(u i , u j ) = b(u k aki , u k ak j ) = ak∗j aki b(u k , u k ).
expressed as xi , xi ∈ i . Left ideals can equivalently be So ak∗j aki = 1, 0 according to whether i = j or i = j.
used. This proves the matrix is now unitary.
This is equivalent to any of the following conditions. (1) For F = R, essentially the same can be done. A real
The multiplicative semigroup of is regular; (2) every left unitary matrix is called orthogonal. Orthogonal matrices
ideal of A is generated by an idempotent; and (3) A has no are geometrically products of rotations and reflections.
two-sided ideals such that m = {x1 x2 · · · xm : xi ∈ }
is zero for some m. Ideals such that m = 0 are called
G. Modules
nilpotent.
A finite dimensional semisipmle F-algebra is isomor- A module differs from a vector space only in that the ring
phic to a direct sum of rings Mn (). involved may not be a field. A left (right) -module is a set
If and are nilpotent two-sided ideals, so is + provided with a binary operation denoted as addition
since ( + )2 n ⊂ n + n . This implies that every finite and a multiplication × → ( × → ), where
dimensional algebra has a maximal two-sided nilpotent is a ring such that the axioms in Table III hold for all
ideal, the Jacobson radical, and its quotient ring by this x, y ∈ , r, s ∈ , and some 0 ∈ .
ideal is semisimple. For group representations, we consider only unitary
All finite division rings are fields. modules, those in which 1x = x for all x ∈ . Henceforth,
we consider left modules.
A homomorphism of modules is a mapping f : →
F. Group Rings and Group Representations
such that f (x + y) = f (x) + f (y) and f (r x) = r f (x) for
A group ring of a group G over a ring is formally the set all x, y ∈ , and r ∈ . A submodule of is a subset
of all functions f : G → such that |{g : f (g) = 0}| is fi- closed under addition, subtraction, and multiplication by
nite. Multiplication is given by ( f h)(x) = yz=x f (y)h(z) elements of . For any submodule ⊂ , the equiva-
and addition is the usual sum of functions. Distributivity lence relation x ∼ y if and only if x − y ∈ is a module
follows from the linear form of this expression. Asso- congruence, that is, x + z ∼ y + z, r x ∼ r y for all z ∈ ,
ciativity follows from {(u, v, w) : uv = y and yw = x for r ∈ . Then the equivalence classes form a new module
some y} = {(u, v, w) : vw = z, uz = x for some z}. In fact, called the quotient module.
semigroups also have semigroup rings. Modules are a type of algebraic action, that is, a map-
The group ring can be thought of as the set of all sums ping G × S → S for a structure G and set S. Figure 8
r1 g1 + r2 g2 + · · · + rn gn of group elements with co- classifies some of these.
efficients in . For coefficients in Z, we have, for instance, Direct sums of modules are defined as for vector spaces.
(2g + )(3 − 2g) = 3 − 2g + 6g − 4g 2 = 3 + 4g − 4g 2 . Unlike the vector space case not every finite dimensional
A representation of a group G is a homomorphism module over a general algebra is isomorphic to a direct
h from G into the ring Mn (F) of n-square matrices sum ⊕ ⊕ · · · ⊕ of copies of (with operations
over F. Two representations f, h are called equivalent if from ). If a ring is an F-algebra, all modules over it are
there exists an invertible matrix M ∈ Mn (F) with f (g) = vector spaces and this determines their dimension.
Mh(g) M −1 for all g ∈ G. This is an equivalence relation.
Every representation f of a group G defines a ring ho-
TABLE III Module Axioms
momorphism h : F(G) → Mn (F) by h(ri gi ) = ri h(gi )
such that h(1) = I where I is an n-square identity matrix. Left module Right module
This is a 1–1 correspondence since if h is a ring homo-
morphism h(g) gives a group representation. (x + y) + z = x + (y + z) (x + y) + z = x + (y + z)
For F = C, every group representation is equivalent to a x +0=x 0+x =x
unitary representation, that is, one in which every matrix 0x = 0 x0 = 0
f (g) satisfies f (g) f (g)∗T = I . Here * is complex conju- There exists −x such that There exists −x such that
x + (−x) = 0 for all x x + (−x) = 0 for all x
gation. Define a modified inner product on row vectors
r (x + y) = r x + r y (x + y)r = xr + yr
by b(x, y) = g x f (g)(y f (g))∗T . Then b(x, y) = b(x f (h),
(r + s)x = r x + sx x(r + s) = xr + xs
y f (h)) for h ∈ G since multiplication by h permutes the
r (sx) = (r s)x (xs)r = x(sr )
group elements g and so permutes the terms in the sum. We
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 455

sum decomposition (G) = ⊕ i . The mapping →

has kernel a proper submodule and image a nontrivial sub-
module. So by irreducibility it has kernel zero and image
. Therefore, every irreducible module is isomorphic to
a member of the finite set {i }.
Further use of these facts gives the basic facts about
FIGURE 8 Algebraic actions. group representations such as orthogonality for group
characters. The character of a representation is the func-
tion assigning to each group element g the trace of the
There is a 1–1 correspondence between equivalence matrix h(g).
classes of representations in Mn (F) and isomorphism
classes of modules over F(G) of dimension n, where F(g)
is the group ring of G over F. For any group representa- VI. FIELDS
tion, the set of column vectors is a left module over F(G)
with action given by (ci gi )v = ci h(gi )v. For any mod- A. Integral Domains
ule of dimension n, the group action gives a set of linear An integral domain is a commutative ring with unit having
transformations of a vector space. If we choose a basis, cancellation. Any subring of a field must be an integral
the matrices of these linear transformations give a group domain.
representation. Conversely, let be an integral domain. We construct a
field, the field of fractions containing . Let S = × (\
H. Irreducible Representations {0}). We consider (a, b) ∈ S as a/b. Therefore, since a/b =
c/d if ad = bc, we define an equivalence relation by (a, b)
A module is called irreducible if it has no proper nonzero ∼ (c, d) if and only if ad = bc. Transitivity follows for
submodules. Suppose ⊂ , where , are finite (a, b) ∼ (c, d) ∼ (e, f ) since we have ad = bc, c f = de;
dimensional F(G)-modules. Then there exists a vector therefore, ad f = bc f = bde. By cancellation of d = 0,
space complement V of in . Then there is a natural a f = be. The relation is a multiplicative congruence if
isomorphism f : V → / of vector spaces. Consider we define products by (a, b)(c, d) = (ac, bd). This shows
f −1 : / → V ⊂ . Assume |G| = 0 in F (if F is R or that equivalence classes can be multiplied under this
C, for instance). Let definition.
1 −1 −1 Addition is defined by (a, b) + (c, d) = (ad + bc, bd)
h(x) = g f (gx) by analog with ordinary fractions. A computation shows
|G|
that the equivalence relation is also a congruence with re-
Then h(gx) = gh(x), so h is a module homomorphism. If spect to addition and that associativity and distributivity
we let h 1 be h : / → composed with the quotient hold. The elements (0, 1) and (1, 1) are additive and mul-
mapping → /, h 1 is the identity since the summand tiplicative identities. A nonzero element (a, b) has the ad-
g −1 f −1 (gx) = f −1 (x) because we have a module isomor- ditive inverse (−a, b) and the multiplicative invese (b, a).
phism. So h is 1–1. Its image has dimension Dim(V) and Therefore, the set of equivalence classes is a field.
intersection {0} with since h 1 is 1–1. Thus the image of If is the ring of polynomials over a field K in one
h gives a module isomorphic to / whose intersection variable, then the field is the field of rational functions.
with is {0}. Then ⊕ Im(h), where Im(h) is the
image of h.
B. Fields and Extensions
Therefore, all representations are direct sums of irre-
ducible modules since we can repeatedly express a module Let (F1 , ∗, ) be a field. A subset F2 of F1 is a called
as a direct sum. a subfield F1 if the elements of F2 form a field under ∗
If we regard right ideals as modules, the group ring is and .
then a direct sum of irreducible left ideals. These ideals The rings (Zp , +, ·), where p is a prime number and
are minimal. Their number is at most the dimension of (Q, + , ·) are fields. Let F be any field. If no sum of
(G). Therefore, (G) is semisimple and is isomorphic n1 = (1 + 1 + · · · + 1) is zero, then Q is isomorphic to
to a direct sum of matrix rings Mn (). If F = C, then the subfield {n1/m1 : m = 0} of F. If n1 = 0 where n is
= C so it is isomorphic to a direct sum of rings Mn (C). minimal, n must be a prime p; else we would have zero
For any irreducible module , let 0 = x ∈ . The map- divisors ( p1)(q1) = 0. Then Zp = {n1} is a subfield of F.
ping ci gi → ci gi gives a mapping (G) to . This Let E, F be fields. If F ⊂ E, then E is called an extension
must be nonzero on some left ideal in a given direct of F. All fields are extensions of Q or Zp . The field E is a
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

456 Algebra, Abstract

vector space over F. The dimension of E over F is denoted For any f (x) such that f (x̄) = 0, there will exist r (x), s(x)
[E : F] and is called the degree of the extension. such that r (x) f (x) + s(x) p(x) = 1 since f, p have g.c.d.
Suppose F ⊂ E ⊂ D. Let xi be a basis for E over F and 1. So f (t) has an inverse r (t) and the quotient ring is a
yi a basis for D over E. Then every element of D can be field.
written uniquely as ci yi , ci ∈ E, and in turn this can be
written uniquely as ( f i j x j )yi , f i j ∈ F. So x j yi are a basis
C. Applications to Solvability
and [E : F][D : E] = [D : F].
and Constructibility
Let E be an extension of F and let t ∈ E. Then F(t) de-
notes the subfield of E generated by F, that is, all elements The problem of solving quadratic equations goes at least
of E that can be obtained from elements of F, t by adding, back to the Babylonians. In the ninth century, the Muslim
multiplying, subtracting, and dividing. mathematician Al-Khwarismi gave a version of the mod-
Suppose E is a finite dimensional extension of degree ern quadratic formula. In the mid-sixteenth century, Italian
n. Let x denote an indeterminate. Then the elements 1, t, mathematicians reduced the solution of a cubic equation
t 2 , . . . , t n are n + 1 elements and so are linearly depen- to the form
dent. Then some sum i=0 n
ai t i = 0. We can choose n mini-
x 3 + mx = n
mal and assume by dividing by an that an = 1. If an = 1, we
say that a polynomial is monic. We then get a polynomial by dividing it by the coefficient of x 3 and making a sub-
p(t) = 0 of minimal degree called the minimal polynomial stitution replacing x by some x − c. They then solved it
of t. If f (t) = 0 for any other polynomial f (x), divide p(x) as
into f (x). Then r (t) = f (t) − q(t) p(t) = 0. If r (x) = 0 it
would have lower degree than p(x) by minimality. There- x =a−b

fore, p(x) divides every other polynomial f (x) such that a = 3 (n/2) + (n/2)2 + (m/3)3
f (t) = 0. Also by minimality 1, t, t 2 , . . . , t n−1 are linearly

independent. The polynomial p(x) must also be prime;
b = 3 −(n/2) + (n/2)2 + (m/3)3 .
else if p(x) = r (x)s(x) then one of r (t), s(t) = 0. Prime
polynomials are called irreducible. They proceeded to solve quartic equations by reducing
We next show that 1, t, t 2 , . . . , t n−1 are a basis for them to cubics. But no one was able to solve the general
F(t). They are linearly independent. Since t n = i=0 n−1
ai t n , quintic using nth roots and in 1824 N.H. Abel proved this
t n+ j
= i=0 ai t , every power of t can be expressed as a
n−1 i+ j
is impossible. E. Galois in his brief life proved this also,
linear combination of 1, t, t 2 , . . . , t n−1 . Therefore, their as part of a general theory that applies to all polynomials.
span is a subring. Suppose f (t) = 0. Then p(x) does This is based on field theory, and we describe it next.
not divide f (x). Since p(x) is prime, 1 is a g.c.d. In the extensions F(t), one root of a polynomial p(t)
of p(x), f (x). For some r (x), s(x), r (x) p(x) + s(x) f (x) = has been added, or adjoined, to F. Extensions obtained by
1. So s(t) f (t) = 1 and s(t) is an inverse of f (t). There- adding all roots of a polynomial are called normal exten-
fore, the span of 1, t, t 2 , . . . , t n−1 is closed under divisions. The roots can be added one at a time in any order.
sion and is a field. So it must be the field F(t) generated Finite dimensional normal extensions can be studied by
by t. finite groups called Galois groups. The Galois group of a
There exists a homomorphism h from the ring of poly- normal extension F ⊂ E is the group of all field automor-
nomials F[x] to F(t) defined by h( f (x)) = f (t). From the phisms of E that are the identity on F. It will, in effect,
usual laws of algebra it is a ring homomorphism that is permute the roots of a polynomial whose roots generate √ the
also onto. extension. For example,√ let F = Q(ξ ) and let E = F( 3
2),
The kernel is the set of polynomials f (x) such that where ξ = (−1 √ + i 3)/2.
√ √ Then an automorphism
√ √ √E
of
f (t) = 0. We have already shown that this is the set of exists taking 3 2 → ξ 3 2, ξ 3 2 → ξ 2 ( 3 2), ξ 2 ( 3 2) → 3 2.
polynomials divisible by p(x), or the ideal p(x)F(x) gen- The Galois group is cyclic of order 3, generated by this
erated by p(x). automorphism. Since the ratio ξ of two roots goes to itself,
Therefore, in summary of these results, let p(x) be the it is the identity on Q(ξ ).
minimum polynomial, of degree n, of t. Then [F(t) : F] = n The order of the Galois group equals the degree of a nor-
with basis 1, t, t 2 , . . . , t n−1 . The field F(t) is isomorphic mal extension. Moreover, there is a 1–1 correspondence
to the quotient ring F[x]/ p(x)F[x], where F[x] is the ring between subfields F ⊂ K ⊂ E and subgroups of H ⊂ G,
of all polynomials in an indeterminate x. the Galois group of E over F. To a subgroup H is associ-
Conversely, let p(x) be a monic irreducible polynomial. ated the field K = {x ∈ E : f (x) = x for all f ∈ K}.
Then we will show that F[x]/ p(x)F[x] is an extension A splitting field of a polynomial p over a field F is a
field of F in which t = x̄ has minimum polynomial p(x). minimal extension of F over which p factors into factors
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 457

of degree 1. It is a normal extension and any two splitting Therefore, if x is a coordinate of a constructible point, x
fields are isomorphic. lies in an extension of degree 2n , in fact a normal extension
Suppose a polynomial p is solvable by radicals over of degree 2n . But if [Q(x) : Q] has degree not a power of
Q. Let E be the splitting field of p over Q. Each time we 2, this is impossible since [E : Q] = [E : Q(x)][Q(x) : Q].
extract a radical the roots of the radical generate a normal In particular, duplicating a cube (providing a cube of
extension F1 of a previous field F2 . Let Ei = Fi ∩ E. Then volume precisely 2) and trisecting an angle of 60◦ lead
F2 over F1 has cyclic Galois group, so E2 over E1 does to roots of irreducible cubics x 3 − 2 = 0 and 4 cos3 θ −
also. 3 cos θ − cos 60◦ = 0 and cannot be performed. Since π
It follows that there exist a series of extensions Q = D0 ⊂ does not satisfy any monic polynomial with coefficients
D1 ⊂ · · · ⊂ Dn = E each normal over the preceding such in Q, the circle cannot be squared.
that the Galois group of each over the other is cyclic. It
follows that the Galois group G has a series of subgroups D. Finite Fields
G n = {e} ⊂ G n−1 ⊂ · · · ⊂ G 0 = G such that gi is a normal
subroup of G i−1 with cyclic quotient group. Such a group The fields Zp where p is a prime number are fields with a
is called solvable. finite number of elements. All finite fiels have n1 = 0 for
The symmetric group of degree 5 has as its only nontriv- some n and are therefore extension fields of some Zp . If
ial proper normal subgroup the alternating group which is Zp ⊂ E, then E is a vector space over Zp . Let x1 , x2 , . . . , xn
simple. Therefore, it is not solvable. If F(x) is a degree be a basis. Then all elements of E can be uniquely ex-
5 irreducible polynomial over Q with exactly two non- pressed as linear combinations ci xi , where ci ∈ Zp . This
real roots, there exists an order 5 element in its Galois set has cardinality |{(c1 , c2 , . . . , cn )}| = |Znp | = p n . So
group just because 5 divides the degree of the splitting |E| = p n if [E : Zp ] = n.
field. Complex conjugation gives a transposition. There- Let E have order p n . Then the multiplicative group E∗
fore, the Galois group is 5 . So polynomials of degree 5 has order p n − 1. So every element has order dividing
p n − 1. So if r = 0, r p −1 = 1. So for all r ∈ E, r p = r .
n n
cannot in general be solved by radicals. n
Conversely, it is true that every normal extension E ⊂ F Then for all r ∈ E, x − r divides the polynomial x p − x.
n
with cyclic Galois group can be generated by radicals. It Therefore x p − x factors as precisely the product of x − r
can be shown that there is a single element θ such that for all r ∈ E. Therefore, if m is a divisor of p n − 1, the
equation x m − 1 divides x p −1 − 1 and splits into linear
n
E = F(θ ) (consider all linear combinations θ of a basis for
E over F, and there being a finite number of intermediate factors. As with Zp this means we can get an element u i
fields). of order a maximum power of each prime dividing p n − 1
Let the extension be cyclic of order n and let τ and their product will have order p n − 1 and generate the
be such that τ n = 1 but no lower power. Let the au- group. So E∗ is cyclic.
tomorphism g generate the Galois group. Let t = θ + Since p| p! but not r ! for r < p, p prime we have p|( rp ).
τ g(θ ) + · · · + τ n−1 g n−1 (θ ). Then t has n distinct In E since p1 = 0, every element satisfies x + x + · · · + x
conjugates (assuming τ ∈ F) g i (θ ) + τ g i+1 (θ ) + · · · + = px = 0. So (x + y)n = ( nr )x r y n−r by the binomial the-
τ n−1 g n−1+i (θ) and so its minimum polynomial has degree orem, which holds in all commutative rings. In E∗ , (x +y) p
n. Since g(t) = τ −1 (t), the element t n = a is invariant un- = x p + y p since other terms are divisible by p. This im-
r r r r
der the Galois group and lies in F. So θ, g(θ ), . . . , g n−1 (θ) plies by induction, (x + y) p = x p + y p . Therefore, x p
lie in the splitting field of x n = a, which must be E. is an automorphism of E.
Geometric constructions provide an application of field This gives a cyclic group of order n of automorphisms
i
theory. Suppose we are given a unit line segment. What fig- of E since if y generates the cyclic group then y p = y
ures can be constructed from it by ruler and compass? Let for i < n. This is the complete automorphism group of
the segment be taken as a unit length or the x axis. Wher- E.
ever we construct a new point from existing ones by ruler If an element z lies in a proper subfield F of E, then F
k
and compass it is an intersection of a line or circle with has order p k and k|n and z p = z. Conversely, the set of
k
a line or circle. Such intersections lead to quadratic equa- {z : z p = z} is closed under sums and products and mul-
k
tions. Therefore, if a point P is constructible, each coor- tiplicative inverses so it is a proper subfield. So if z p = z
dinate must be obtained from rational numbers by adding, then z lies in a proper subfield.
subtracting, multiplying, dividing, or taking square For any irreducible polynomial p(x) of degree k over
roots. Such quantities lie in an extension field of E ⊂ Q Zp , there exists a field H = Zp [x]/ p(x)Zp [x]. It has degree
k and so order p k . If t = x̄, then it is Zp (t) where p(t) = 0
such that there exist√ fields E0 = Q ⊂ E1 ⊂ · · · ⊂ Ek = E
and En = En−1 ( a) for a ∈ En−1 . The degree of
k
is the minimum polynomial of t. Since t p − t = 0, we
[E : Q] = [En : En−1 ] · · · [E1 : E0 ] is a power of 2.
k k n
have p(x)|x p − x, and if k|n, then p(x)|x p − x|x p − x.
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

458 Algebra, Abstract

n
Suppose p(x)|x p − x and is irreducible of degree k, A perfect code is one such that equality holds. A linear
n
k | n. Then in Zp (t) of degree k we would have t p = t code is one that is a subspace of Vn . A cyclic code is a linear
n
and so for all r in the field r p = r . This is false. So an code in which each cyclic rearrangement b1 b2 · · · bn →
n
irreducible polynomial divides x p − x if and only if its bn b1 · · · bn−1 of a code vector is a code vector.
degree divides n. Let Vn be considered as the set of polynomials of degree
The derivative can be defined for polynomials in less than n in x. Then cyclic rearrangement is multiplica-
Zp [x] and satisfies the usual sum, product, power tion by x if we set x n = 1. Therefore, cyclic codes are
n n
rules. If x p − x = q(x)( p(x))2 , then (d/d x)(x p − x) = subspaces closed by multiplication by x in
−1 = q $ (x)( p(x))2 + 2q(x) p(x) p $ (x). So p(x) divides F[x]
−1. For p of degree greater than zero, this is false. There-
n (x n − 1)F[x]
fore, x p − x has each irreducible polynomial of degree k
dividing n exactly once as a factor. and are therefore ideals.
n
Suppose x p − x has no irreducible factors of degree Such an ideal is generated by a unique monic irreducible
k n
n. Then all its factors divide x p − x, k < n. So x p − x| polynomial g(x). Let γ be an element of an extension field
n−1 n
(x p − x) · · · (x p − x). Since x p − x has higher degree of F such that γ n = 1 but no lower power is 1. Let g(x) be
n
this is false. So x p − x has an irreducible factor g(x) of the least common multiple of the minimum polynomials
degree n. Therefore, a field H exists of order exactly p n . of γ 1 , γ 2 , . . . , γ d−1 , where d ≤ n, d is relatively prime
Any field F of order p n has an element r such that to n. Then g(x)|x n − 1 since all powers of γ are roots
n n
x − r |g(x)|x p − x since x p − x factors into linear fac- of x n − 1. By a computation involving determinants no
tors. Then r has minimum polynomial g(x). It follows element of the ideal of g(x) can have fewer than d nonzero
that Zp (r ) has degree n and equals F. And F is isomorphic coefficients. So t = (d − 1)/2 errors can be corrected. The
to H. So any two fields of order p n are isomorphic. number m is n minus the degree of g(x). These are called
the BCH (Bose–Chaudhuri–Hoquenghem) codes and if
n = q − 1 Reed–Solomon codes.
E. Codes
Coding theory is concerned with the following problem.
Consider information in the form of sequences a1 a2 · · · am F. Applications to Latin Squares
over a q-element set assumed to be a finite field F of An n × n Latin square is a matrix whose entries are el-
order q. We wish to find a function f encoding a1 a2 · · · am ements of an n-element set such that each number oc-
as another sequence b1 b2 · · · bn such that, if an error of curs exactly once in every row and column. The following
specified type occurs in the sequence (bi ), the sequence cyclic Latin square occurs for all n:
(ai ) can still be recovered. There should also be a readily  
computable function g giving a1 a2 · · · am from b1 b2 · · · bn 1 2 ··· n
2 ··· 1 
with possible errors. We assume any t or fewer errors could  3 
 
occur in (bi ).  ······ 
We consider a1 a2 · · · am as an m-dimensional vector
n 1 ··· n − 1
(a1 , a2 , . . . , am ) over F, that is, a ∈ Vm and (bi ) as a vector
b ∈ Vn . Therefore, a coding function is a function f : Vm → Two Latin squares (ai j ), (bi j ) are orthogonal if for all
Vn . The resulting code is its image of C where C ⊂ Vn . i, j, the ordered pairs (ai j , bi j ) are distinct.
Suppose two vectors v, w in a code differ in at most Orthogonal Latin squares are used in experiments in
2t places. Then let z agree with v in half these, w the which several factors are tested simultaneously. In partic-
other half, and agree with v and w where they agree. ular, suppose we want to test 5 fertilizers, 5 soil types,
Then z could have been either v or w. Error correction is 5 amounts of heat, and 5 plant varieties. Choose 5 × 5
impossible. orthogonal Latin squares. Take an experiment with vari-
Conversely, if two vectors v, w could give rise to the ety i, heat amount j, soil type ai j , fertilizer bi j for all
same vector z by at most t errors each, then they differ in i, j, and n 2 experiments. Then for any two factors, each
at most 2t places. variation on one factor occurs in combination with each
Therefore, a code can correct t errors if and only if every for the other factor exactly once. If we have k mutually
pair of vectors differ in at least 2t + 1 places. orthogonal Latin squares we can test k + 2 factors.
For such a code, q n ≥ q m (( n0 ) + (q − 1)( n1 ) + · · · + Suppose n = p1n 1 p2n 2 · · · prnr , where pi is a prime num-
(q − 1)t ( nt ) since the kth term on the right is the number. Let R be the direct product of fields of order pini .
ber of vectors that differ from a code vector on exactly r Let k = inf{ pini − 1}. Choose k invertible elements from
places, and for r ≤ t all these are distinct. each field, xi j : i = 1, 2, . . . , k; j = 1, 2, . . . , r . Then the
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 459

elements z i = (xi1 , xi2 , . . . , xir ) are invertible and the dif- set of points to a set L called the set of lines, satisfying three
ference of any two is invertible. axioms: (1) Exactly one line is determined by two distinct
Define r , n × n Latin squares Ms by ms i j = z s yi points; (2) two lines intersect in exactly one point; and (3)
+ y j , where yi , y j run through all elements of R. Then there exist four points no three of which are collinear. The
this function is 1–1 in y j . Since zr is invertible, it is 1–1 last axiom is only to guarantee that the system does not
in yi . From the fact that z s − z t is invertible, it follows reduce to various special cases.
that (ms i j , mt i j ) = (ms hk , mt hk ) unless i = h and Let be any division ring. Left modules over
j = k. So all the square are orthogonal. If n is a prime have many of the properties of vector spaces. Let
power, then n − 1 is the maximum number of orthogonal = × × as a left module. Let be the set of
n × n Latin squares. However, there exists a pair of or- submodules of the form x, x ∈ . Let be the set of sub-
thogonal n × n Latin squares for all n = 2, 6 constructed modules of the form x + y, x, y ∈ , where x = dy
by R. C. Bose, S. S. Shrikhande, and E. T. Parker in 1958. for any d ∈ . Then and are essentially one- and
The existence of 10 × 10 Latin squares disproved a long- two-dimensional subspaces of . Let incidence be the re-
standing conjecture of Euler. lation U ⊂ V . Then we have a projective plane. A choice of
A k × m orthogonal array is a k × m matrix (ai j ) such = R gives the standard geometric projective plane. This
that for any i = j the ordered pairs ais , a js are distinct for type of projective plane is characterized by the validity of
s = 1 to m. the theorem of Desargues.
An (m + 2) × n 2 orthogonal array with entries from an Many more general systems also give rise to projective
n-element set yields m mutually orthogonal Latin squares. planes. A ternary ring (or groupoid) is a set R with an oper-
The first two rows run through all pairs (i, j) so let mr i j ation R × R × R to R denoted x ◦ m ∗ b. Every projective
be the entry in row r + 2 such that rows 1, 2 have entries plane can be realized with the set R × R ∪ R ∪ {∞},
(i, j) in that place. where R × R is considered as a plane for the ternary ring,
For n a prime power congruent to 3 modulo 4, an orthog- R ∪ {∞} the points on a line at infinity. Incidence is in-
onal 4 × n 2 array can be constructed. Let F be a field of terpreted set theoretically by membership, and lines con-
order n and let g generate F∗ its multiplicative group. Be- sist of sets y = x ◦ m ∗ b together with a line at infinity.
cause of the congruence condition −1 will not be a square Necessary and sufficient conditions that the ternary ring
in F. Let y1 , y2 , . . . , yk , k = 12 (n − 1), be distinct symbols give a projective plane are that for some 0, 1 ∈ R, for all
not in F. Take as columns of the array all columns of the x, m, b, y, k ∈ R: (TR-1) 0 ◦ m ∗ b = x ◦ 0 ∗ b = b; (TR-2)
types shown in Table IV, where x ranges through F, and 1 ◦ m ∗ 0 = m ◦ 1 ∗ 0 = m; (TR-3) there exists a unique z
i independently varies from 1, 2, . . . , 12 (n − 1), together such that a ◦ m ∗ z = b; (TR-4) there exists a unique z
with n columns such that z ◦ m ∗ b = z ◦ k ∗ a if k = m; and (TR-5) there
 
x exist unique z, w such that a ◦ z ∗ w = x and b ◦ z ∗ w = y
x  if a = b.
 
 , x ∈F Finite fields give ternary rings xm + b of all prime
x 
power orders. It is unknown whether ternary rings (TR-1)
x to (TR-5) exist not of prime power order. If |R| = m, then
and (n − 1) /4 columns corresponding to a pair of orthog-
2 || = || = m 2 + m + 1.
onal Latin squares of size (n − 1)/2 × (n − 1)/2 with en- Projective planes are essentially a special case of block
tries y1 , y2 , . . . , yk . This gives an orthogonal array yield- designs. A balanced incomplete block design of type (b, ν,
ing, for example, two 10 × 10 orthogonal Latin squares. r, k, λ) consists of a family Bi , i = 1 to b of subsets of a
set V having ν elements such that (1) |Bi | = k > λ for all
i; (2) |{i : x ∈ Bi }| = r for all x ∈ V ; and (3) |{i : x ∈ Bi and
G. Applications to Projective Planes y ∈ Bi }| = λ for all x ∈ V, x = y. These are also used in de-
A projective plane is a mathematical system with a binary sign of experiments where Bi is the ith set of experiments
relation called incidence (lying on) from a set P called the and V the set of varieties tested.
Let A = (ai j ) be the matrix such that ai j = 1 if and
TABLE IV Columns Used to Construct Two Orthogonal Latin
only if the ith element of V occurs in Bi . Then A AT =
Squares λJ +(r − λ)I and all column sums of A are k, where J is a
matrix all of whose entries are 1. Moreover, these proper-
yi g 2i (g + 1) + x g 2i + x x
ties characterize balanced incomplete block designs. For
x yi g 2i (g + 1) + x g 2i + x
k = 3, λ = 1, designs are called Steiner triple systems.
g 2i + x x yi g 2i (g + 1) + x
A permutation group G acting on a set S is called doubly
g 2i (g + 1) + x g 2i + x x yi
transitive if for all x = y, u = v in S there exists g in G with
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

460 Algebra, Abstract

gx = u and gy = v. If T ⊂ S, then the family of subsets f (x) f (y) for all x, y ∈ . The equivalence relation
{Bi } = {g(T )} forms a balanced incomplete block design. {(x, y) : f (x) = f (y)} is a congruence. This means that,
if x ∼ y, then x + z ∼ y + z, x z ∼ yz, and zx ∼ yz for all
z ∈ . Conversely, for every congruence there is a quotient
VII. OTHER ALGEBRAIC STRUCTURES semiring and a homomorphism to the quotient semiring
x → x̄.
A. Groupoids If the semiring is partially ordered, so are the semirings
of n-square matrices over it.
As mentioned before, a groupoid is a set having one bi-
Just as Boolean matrices represent binary relations, so
nary operation satisfying only closure. For instance, in a
matrices over semirings on [0, 1] can represent relations
finite group, the operation aba −1 b−1 called the commuta-
in which there is a concept of degree of relationship. One
tor gives a groupoid. Congruences and homomorphism on
widely used semiring is the fuzzy algebra in which the op-
groupoids are defined in the same way as for semigroups,
erations are sup{x, y}, inf{x, y}, giving a distributive lat-
and for every congruence there is a quotient groupoid.
tice. Fuzzy matrices have applications in clustering theory,
Sometimes as in topology, groupoids are used to construct
where objects are partitioned into a hierarchy of subsets
quotient groupoids which are groups.
on the basis of a matrix C = (ci j ) giving the similarity be-
Groupoids are also used in combinatorics. For example,
tween objects i and j. Applications of fuzzy matrices are
any n × n Latin square, its entries labeled 1, 2, . . . , n is
also found in cybernetics.
equivalent to a groupoid satisfying (q) any two of a, b, c
Inclines are a more general class of semirings. An
determines the third uniquely in ab = c. A groupoid sat-
incline is a semiring satisfying x + x = x, x y ≤ x, and
isfying (q) is called a quasigroup. A loop is a quasi-
x y ≤ y for all x, y. The set of two-sided ideals in any
group with two-sided identity. Figure 9 classifies some
ring or semigroup forms an incline. The additive opera-
one-operation structures.
tion in an incline makes it a semilattice and, under weak
restrictions such as compactness of intervals, a lattice. In a
B. Semirings finitely generated incline for every sequence (ai ), i ∈ Z+ ,
there exist i < j with ai ≥ a j . The set of n-square matrices
A semiring is a system in which addition and multiplica- over an incline is not an incline under matrix multiplica-
tion are semigroups with distributivity on both sides. Any tion but is one under the elementwise product given by
Boolean algebra and any subset of a ring closed under ad- (ai j ) % (bi j ) = (ai j bi j ). Inclines can be used to study opti-
dition and multiplication as the nonnegative elements in mization problems related to dynamic programming.
Z or R are semirings. The set of matrices over a semiring
with 0, 1 comprises a semiring with additive and multi-
plicative identity. C. Nonassociative Algebras and Higher
A homomorphism of semirings is a function f from Order Algebras
to such that f (x + y) = f (x) + f (y), and f (x y) = A nonassociative ring is a system having two binary oper-
ations satisfying all the axioms for a ring except associa-
tivity of multiplication. A nonassociative algebra A over a
field F in addition is a vector space over the field satisfying
the algebra property a(bc) = b(ac) = (ab)c for all a ∈ F,
b, c ∈ A. There exists a nonassociative eight-dimensional
division algebra over the real numbers called the Cayley
numbers. It is an alternative ring, that is, (yy)x =
y(yx), and (yx)x = y(x x) for all x, y. It has applications
in topology and to projective planes.
A Lie algebra is an algebra in which for all a, b, c
the product denoted [a, b] satisfies (L-1) [a, a] = 0;
(L-2) [a, b] + [b, a] = 0; and (L-3) [[a, b], c] + [[b, c], a] +
[[c, a], b] = 0. In any associative algebra, the commuta-
tors [a, b] = ab − ba define a Lie algebra. Conversely, for
any Lie algebra an associative algebra called the univer-
sal enveloping algebra can be defined such that the Lie
algebra is a subalgebra of its algebra of commutators. In
FIGURE 9 Classification of one-operation structures. many topological groups, for every element a there exists
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 461

a parametrized family b(t), t ∈ R of elements such x1 , x2 , . . . , xn equals a certain other composition for all
that b(1) = a, b(t + s) = b(t)b(s) = b(s)b(t). The opera- x1 , x2 , . . . , xn in the set. The commutative, associative,
tion b(t)c(t) is commutative up to first order in t as t → 0 and distributive laws for commutative rings form an ex-
and defines a real vector space. The operation b(t)c(t)b−1 ample. A variety is the class of all structures satisfying
(t)c−1 (t) taken to second order in t then defines a Lie alge- given laws. Any law valid for a class of structures will
bra. All finite dimensional connected, locally connected also hold for their substructures, homomorphic images,
continuous topological groups can be classified using Lie and direct products of them. Conversely, suppose a class
algebras. of structures is closed under taking of substructures,
For any group G, let i (G) denote the subgroup gen- product structures, and homomorphic images. For any set
erated by all products of i elements under the operation of generators g we can construct an element of whose
of group commutator x yx −1 y −1 . Then i (G) is a normal only relations are those implied by the general laws of .
subgroup and i (G)/ i+1 (G) is an Abelian group. Com- To do this we take one copy of each isomorphism class
mutators give a product from i (G)/ i+1 (G) × j (G)/ Sα of structues in that are either finite or have the same
j+1 (G) to i+ j (G)/ i+ j+1 (G) satisfying (L-1) to (L-3). cardinality at most | g |. Replicate these copies until we
Such a structure is called a Lie ring. The sequence of nor- have one copy Sαβ for every map f αβ : g → Sα . Take the
mal subgroups i (G) is called the lower central series. direct product ×Sαβ and the substructure F(g) generated
A (noncommutative) Jordan algebra is a nonassociative by the image of g under each. Then by construction any
algebra such that (x 2 y)x = x 2 (yx), and (x y)x = x(yx). relation among these generators must hold identically in
Commutative Jordan algebras are used in constructing ex- all the Sα and thus be a law. It follows that the relations of
ceptional Lie algebras. For any associative algebra, the g conicide with laws of .
product x y + yx defines a commutative Jordan algebra. If S is any structure in which the laws hold with |S| = g,
They also are used in physics. then the the relations of F(g), being laws of , hold is S. In
A median algebra is a set S with ternary product x yz particular, there will then be an onto homomorphism from
such that (1) x yz is unchanged under any permutation of F(g) to S. This proves that a class is a variety if and only if
x, y, z; (2) (x(x yz)w) = (x y(x zw)); and (3) x yy = y for it is closed under taking substructures, product structures,
all x, y, z, w. The product in any two factors is then a and homomorphic images. Much work has been done on
semilattice. The set of n-diemensional Boolean vectors is classifying varieties, especially groups.
a median algebra under the componentwise product that Quasivarieties of relational structures on a set S also
x yz equals whichever of x, y, z is in the majority. The set exist but must be defined a little differently, as classes
of linear orders on n elements is a median algebra under closed under isomorphism, structures induced on subsets,
the same operation. and product structures. Laws are of three types. For a col-
lection of variables x1 , x2 , . . . , xn take a collection Tα of
m-tuples from x1 , x2 , . . . , xn for each m-ary relation R.
D. Varieties
For all replacements of xi by elements of the set T ei-
As mentioned earlier an algebraic operation on a set S is a ther (1) if for all α, all members of Tα belong in R, then
function S n = S × S × · · · × S to S for some positive in- (xi(1) , xi(2) , . . . , xi(k)) ∈ R, where i(1), i(2), . . . , i(n) ∈ Z+ ;
teger n. An operational structure is any labeled collection (2) if for all α, all members of Tα belong in R, then
of operations on a set S. A substructure is a subset closed (xi(1) , xi(2) , . . . , xi(k) ) ∈ R; or (3) if for all α, all members of
under all operations. A congruence is an equivalence re- Tα belong in R, then xi(1) = xi(2) for some i(1), i(2) ∈ Z+ .
lation such that for each operation if xi = yi for i = j The classes of reflexive, irreflexive, symmetric, and tran-
and x j ∼ y j , then f (x1 , x2 , . . . , xn ) ∼ f (y1 , y2 , . . . , yn ). sitive binary relations are all quasivarieties.
Multiplication is then uniquely defined on equivalence
classes, which form a quotient structure. A homomor-
E. Categories and Topoi
phism is a function h such that for each operation f and all
x1 , x2 , . . . , xn in the domain h[ f (x1 , x2 , . . . , xn )] = If we take algebraic structures of a given type and ig-
f [h(x1 ), h(x2 ), . . . , h(xn )]. Its image will be a substruc- nore the internal structure but consider only the homo-
ture of the structure into which f is defined. An isomor- morphisms that exist between them, then we have assen-
phism is a 1–1 onto homomorphism. The direct product tially a category. To be precise, a category consists of (1)
of algebraic structures with operations of the same type is a class called the class of objects together with (2) a
the product set with componentwise operations. set Hom(x, y) for all x, y ∈ called the set of morphisms
A law for an algebraic structures S is a relation which from x to y; (3) a special morphism 1x for each x in
holds identically for all elements of the structure. That is, called the identity morphism; and (4) an operation from
it asserts that a certain composition of operations applied Hom(x, y) × Hom(y, z) to Hom(x, z) called composition
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

462 Algebra, Abstract

of morphisms. It is required that composition be associa- phism of C1 such that the objects involved correspond,
tive when defined and that 1x act as the identity element the mappings 1x go to identities, and composition is pre-
whenever composition with it is defined. The most fun- served. Group rings over a fixed ring give a functor from
damental category consists of all sets and all functions, the category of groups to the category of rings.
between sets. Sets and binary relations form a category.
There are categories in which Hom(x, y) does not con-
F. Algebraic Topology
sist of functions, for example, let C be a poset and let
Hom(x, y) = (x, y) if x ≤ y and Hom(x, y) = ∅ otherwise. Algebraic topology is the study of functors from subcat-
The category of all modules over a ring is an additive egories (subsets of the sets and morphisms of a category
category: Morphisms can be added in a natural way. forming a category under the same operations) of the cat-
The direct product of objects K 1 × K 2 can be de- egory of topological spaces and continuous mappings to
scribed in terms of categories by its universal property. categories of algebraic structures. These functors often
This is that there exist mappings πi : K 1 × K 2 to K i such allow one to deduce the nonexistence or existence of pos-
that for any object A and any mappings f i : A to K i sible topological spaces or continuous mappings. If for
there exists a unique mapping h : A → K 1 × K 2 such that some functor F and two topological spaces X, Y, F(X ) is
πi (h(x)) = f i (x). There exists a dual concept of direct not isomorphic to F(Y ), then X, Y cannot be topologically
sum, an object K 1 × K 2 with mapping g j : K j → K 1 × K 2 equivalent. In topology, any mapping between spaces is
such that for any object A and any mappings f j : K j to A meant to be a continuous mapping.
there exists a unique mapping h : K 1 × K 2 → A such that Let A(B) be a subset of the topological space X (Y ).
h(g j (x)) = f j (x). A mapping from X, A to Y, B is a mapping f : X → Y
More generally, in category theory, one works with fi- such that f (A) ⊂ B. Two mapings f and g : X, A → Y ,
nite diagrams, or graphs, in which each vertex is an object B are called homotopic if there exists a mapping
and each arrow a morphism. The limit of a diagram cosists h : × X, × A → Y, B where = [0, 1] such that
of an object A and a mapping g j from each object a j of h(0, x) = f (x), and h(1, x) = g(x). This is an equivalence
the diagram to A such that if there is an arrow f : ak → a j , relation. Equivalence classes of mappings under homo-
then g j ( f (x)) = gk (x) and for any other object H and map- topy form a new category called the homotopy category.
pings m j with the same properties there exists a unqiue The first important functor from a category associated
mapping h : A → H such that h(g j (x)) = m j (x) for all j. to the category of topological spaces to the category of
A colimit is the dual concept. Any variety of relational groups is the fundamental group. Let x0 denote any point
structures, as the variety of commutative rings, has prod- of X . Homotopy classes of mappings f, g from ,
ucts, coproducts, limits, and colimits. Moreover, there is (where = {0, 1}) to X, x0 can be multiplied by defin-
the trivial structure {0} where any set of operations on zero ing f ∗ g = h such that h(t) = g(2t) for 0 ≤ t ≤ 0.5 and
gives zero. Any object has a unique mapping to zero. h(t) = g(2t − 1) for 0.5 ≤ t ≤ 1. The product depends only
A topos is a category with limits, colimits, products, on homotopy classes, and on homotopy classes is a group,
coproducts, and a few additional properties: There ex- where f (1 − t) is the inverse of f (t).
ists an exponential object X Y for any objects X , Y such Every group is the fundamental group of some topologi-
that for any object X there is a natural isomorphism from cal space, and every group homomorphism can be realized
Hom(Z , X Y ) to Hom(Z × Y, X ). In addition, there exists by a continuous mapping of some topological spaces.
a subobject classifier, that is, an object and a mapping Higher homotopy groups are defined in a similar way
called true from a point p to such that there is a 1–1 using an n-cube and its boundary in place of , .
correspondence between substructures of any object X Suppose a space Y is a topological groupoid; that is,
and mappings X → . The category of sets acted on the there exists a continuous mapping Y × Y → Y . Then the
left by a fixed monoid M is a topos. In the category of set [X : Y ] of homotopy classes of mappings from X to
sets itself = {0, 1}, where subsets T of a set S are in Y is a groupoid for any topological space Y . If we let
1–1 correspondence with mappings f : S → {0, 1} such Y be a topological space whose only nonzero homo-
that f (x) = 1 if x is in T and x = 0 if x ∈ T . Topoi have im- topy group is an Abelian group G in dimension n, called
portant applications to models in mathematical logic such an Eilenberg–MacLane space, then [X ; Y ] is a group
as in Boolean-valued models used to show the indepen- called the nth cohomology group of X . All the cohomol-
dence of the continuum hypothesis in Zermelo–Frankel ogy groups together form a ring. From the cohomology
set theory. groups of Eilenberg–MacLane spaces themselves can be
A functor from one category C1 to another category obtained cohomology operations, that is, mappings from
C2 consists of an assignment of a unique object of C2 to one cohomology group to another preserved by continous
each object of C1 , a unique morphism of C2 to each mor- mappings.
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

Algebra, Abstract 463

A vector bundle over a topological space X is a vector Two quadratic forms are isomorphic if they become
space associated to each point of X such that the union identical after a linear invertible change of variables. A
of the vector spaces forms a topological space. An exam- form is isotropic if nonzero x gives f (x) = 0. It is defined
ple is the tangent vectors to a surface. The K -theory of to be totally isotropic if f (x) = 0 identically, hyperbolic
a topological space is essentially a quotient of the semi- if it is isomorphic to a direct sum of forms f (x) = x1 x2
group of vector bundles over X under the operation of over 2 , and anisotropic if it is not isotropic.
direct sum of vector spaces. The equivalence class of a Over a field every quadratic form is uniquely express-
vector bundle V is the set of vector bundles W such that ible as the direct sum of an anisotropic form, a hyper-
U ⊕ V is equivalent to U ⊕ W for some vector bundle U . bolic form, and a totally isotropic form up to equivalence.
K -Theory of X also equals [X ; Y ] for a space Y which can Cancellation holds for direct sum: if h(x) ⊕ f (x) is iso-
be constructed from quotient groups of groups of n-square morphic to h(x) ⊕ g(x) then f (x) is isomorphic to g(x).
matrices. It is then convenient to work with a ring of isomorphism
By analogy-with topological constructions, researchers classes of forms and their additive inverses called the Witt-
have defined homology and K -theory for rings and Grothendieck ring . ˆ The Witt ring is generated by
modules. a , the class of axi2j has operations induced by direct sum
and tensor product, and is defined by relations 1 = 1
a b = ab , a + b = a + b (1 + ab ). Here ˆ is
G. Inclines and Antiinclines ˆ
defined as /H where H is the ideal of hyperbolic forms.
We have defined incline above as a semiring satisfy- It is studied in terms of the filtration by powers of the ideal
ing the incline inequality. The set of two-sided ideals in I generated by a − 1.
a ring (semigroup) for example forms an incline under Over the real numbers, rank [of the matrix (ai j )] and
I + J (I ∩ J ) and IJ . Many results on Boolean and fuzzy signature (the number of its positive eigenvalues, given
matrices generalize to inclines. Green’s relation - (-) ai j = a ji ) are complete isomorphism invariants. Over the
classes in the semiring of matrices over an incline are rational numbers the strong Hasse principle asserts that
characterized by equality of row, column spaces. In many two forms are isomorphic if and only if they are isomor-
cases the row, column spaces will have a unique basis. phic over each completion (real and p-adic) of the ra-
For the fuzzy algebra [0, 1] under sup{x, y}, inf{x, y}, it tionals. The weak Hasse principle asserts that a form is
is necessary to add the condition that if ci = ai j c j , {ci } isotropic if and only if it is isotropic over each comple-
the basis, then aii ci = ci . Matrices will have computable tion. A complete set of invariants of nonsingular forms
maximal subinverses A × A ≤ A, giving a way to test is given by determinant det[(ai j )] and Hasse invariants
regularity. at each prime. The Hasse invariant of a quadratic form
In any finitely generated incline, no chain has an infi- aii xi2 can be defined as
nite nondecreasing subchain. Eigenvectors and eigenval-
ues can be described in terms of a power of a matrix. A (aii , ai j )
i≤ j
particularly interesting incline is [0, 1] under sup{x, y}
and x y (ordinary multiplication). where (aii , a j j ) = ±1 according as aii x 2 + a j j y 2 = 1 has a
The asymptotic forms of finite matrices can be de- solution or not, over the p-adic numbers. The determinant,
scribed by the existence of positive integers k, d, and a taken modulo squares is called the discriminant.
matrix C such that Milnor’s theorem gives a description of the Witt ring of
(x), x transcendental in terms of the Witt ring of . The
Ak+nd = A % C % · · · % C
Tsen–Lang theorem asserts that if has transcendance
where % denotes entrywise product. degree n over C then every quadratic form of dimension
Antiinclines are defined by the reverse inequalities greater than 2n is isotropic. The theory of quadratic forms
x y ≥ x and yx ≥ x. They have the dual nonascending is also related to that of division algebra.
chain property.
I. Current and Recent Research
H. Quadratic Forms
The greatest achievement in algebra itself since 1950 has
A quadratic form over a ring is a function f (x) = been the classification of finite simple groups. Proofs are
ai j xi x j , ai j ∈ from n × n to . Quadratic forms very lengthy and due to many researchers. Much progress
occur often in statistics and optimization and as the first has been made on decidability of algebraic problems, for
nonlinear term in a power series, as well as in pure example, the result that the word problem is unsolvable in
mathematics. general groups having a finite number of generators and
P1: FVZ Revised Pages
Encyclopedia of Physical Science and Technology EN001C.19 May 7, 2001 13:42

464 Algebra, Abstract

the solution of Hilbert’s problem showing that polynomial been numerous important developments such as the proof
equations in n variables over Z (Diophantine equation) are of the van der Waerden conjecture.
in general undecidable by Turning machines. The proof Mathematical linguistics and automata theory have be-
of Mordell’s conjecture gives a positive step for equations come well-developed subjects.
of degree n in two variables over Q. Another result in Research is actively continuing in most of these areas
group theory was the construction of an infinite but finitely as well as in category theory and the theory of varieties
generated group G such that x m = for all x in G and for and combinatorial aspects of algebra.
a fixed m in Z+ . In the 1990s quantum algebras has become a very ac-
In algebraic geometry, a remarkable theory has been tive field. This deals with structures related to traditional
created leading to the proof of the Weil conjectures. This algebraic structures in the way quantum physics is related
theory made it possible to prove that, for any Diophantine to classical physics. In particular, a quantum group is a
equation, it is decidable whether for every prime number kind of Hopf algebra.
p it has a solution modulo p. Much has been done with A somewhat related and active area is noncommutative
algebraic groups. algebraic geometry, in which a prominent place is occu-
In coding theory, all perfect codes have been found. pied by the K -theoretic ideas of A. Connes.
A pair of n × n orthogonal Latin squares have been con-
structed for all n except 2 and 6.
The entire subjects of homological algebra and alge-
SEE ALSO THE FOLLOWING ARTICLES
braic K -theory have been developed. For a ring , the
following sequence
• ALGEBRAIC GEOMETRY • BOOLEAN ALGEBRA • GROUP
f1 f2
M1 −→ M2 −→ · · · −→ Mn
fn THEORY • MATHEMATICAL LOGIC • SET THEORY •
TOPOLOGY, GENERAL
is called exact if Im( f i ) = Ker( f i+1 ), where Ker( f ) de-
notes the kernel of f . A free resolution of a module is
an exact sequence · · · → n → n −1 → · · · 0 → ,
where each i is a free module. Another module , gives BIBLIOGRAPHY
a sequence
Cao, Z. Q., Kim, K. H., and Roush, F. W. (1984). “Incline Algebra and
gn
· · · ← Hom(n , ) ←− Hom(n −1 , ) Application,” Ellis Horwood, Chichester, England/Wiley, New York.
Childs, L. (1979). “A Concrete Introduction to Higher Algebra,”
gn−1 g0
←− · · · ←− Hom(0 , ) Springer-Verlag, Berlin and New York.
Connes, Ah. (1994). “Noncommutative Geometry,” Academic Press, New
The quotients Ker(gn+1 )/Im(gn ) are independent of the York.
particular free resolution and are called Extn (, ). Fraleigh, J. B. (1982). “A First Course in Abstract Algebra,” Addison-
Wesley, Reading, Massachusetts.
Whitehead’s problem is whether if Ext1 (A, Z) = 0, then
Gilbert, J., and Gilbert, L. (1984). “Applied Modern Algebra,” Prindle,
A is a direct sum of copies of Z. S. Shelah proved this is Weber & Schmidt, Boston, Massachusetts.
independent of the axioms of Zermelo–Frankel set theory. Kassel, C. (1995). “Quantum Groups,” Springer-Verlag, Berlin and
Considerable research has been done on order struc- New.
tures and on ordered algebraic structures in lattice theory Kim, K. H. (1982). “Boolean Matrix Theory and Applications,” Marcel
Dekker, New York.
and general algebra. Finite posets can be studied in ways
Kim, K. H., and Roush, F. W. (1984). “Incline Algebra and Applications,”
similar to topological spaces, because they are equiva- Ellis Horwood, Chichester, England/Wiley, New York.
lent to finite topological spaces. The theory of Boolean Lam, T. Y. (1973). “Algebraic Theory of Quadratic Forms,” Benjamin,
and fuzzy matrices has been developed with the advent Reading, Massachusetts.
of Green’s relations classes. Inclines, semirings (, ◦ , ∗) Laufer, H. B. (1984). “Applied Modern Algebra,” Prindle, Weber &
Schmidt, Boston, Massachusetts.
satisfying x ◦ x = x, x ◦ (x ∗ y) ◦ (y ∗ x) = x are a further
Lidl, R., and Pilz, G. (1984). “Applied Abstract Algebra,” Springer-
generalization. The algebraic structure of semigroups, es- Verlag, Berlin and New York.
pecially regular semigroups, has become well understood. Pinter, C. C. (1982). “A Book of Abstract Algebra,” McGraw-Hill, New
In matrix theory and algebraic number theory, there have York.
P1: ZCK Revised Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN001H-20 May 26, 2001 14:20

Algebraic Geometry
Rick Miranda
Colorado State University

I. Basic Definitions of Affine Geometry

II. Projective Geometry
III. Curves and Surfaces
IV. Applications

GLOSSARY ALGEBRAIC GEOMETRY is, broadly speaking, the

study of the solutions of polynomial equations. Since poly-
Affine space The space of n-tuples with coordinates in a nomials are ubiquitous in mathematics, algebraic geome-
field K . try has played a central role in the development and study
Algebraic set The set of solutions to a collection of poly- of many mathematical ideas. Its origins can be traced to
nomial equations in affine or projective space. ancient Greek mathematics, when the study of lines and
Coordinate ring Ring of polynomial functions on an al- conics in the plane began. From these relatively simple
gebraic set. beginnings, modern algebraic geometry has grown into a
Curve An algebraic variety of dimension one. subject which draws on and informs nearly every other dis-
Genus The fundamental invariant of curves, equal to the cipline in mathematics, and modern practitioners develop
dimension of the space of regular 1-forms. and use some of the most sophisticated “mathematical ma-
Moduli space A topological space (usually an algebraic chinery” (including sheaves, cohomology, etc.) available.
variety itself) whose points represent algebraic vari- However, the subject is still rooted in fundamental ques-
eties of a specific type. tions about systems of polynomial equations; the complex-
Projective space The space of (n + 1)-tuples with coor- ity and apparent abstraction of some of the modern tools
dinates in a field K , not all zero, up to scalar factor. of practicing algebraic geometers should not be confused
The natural compactification of affine space, which in- with a lack of concern for basic and accessible problems.
cludes points at infinity.
Quadric An algebraic variety defined by a single
I. BASIC DEFINITIONS OF
quadratic equation.
AFFINE GEOMETRY
Rational function A ratio of polynomial functions de-
fined on an algebraic set in affine or projective
A. Affine Space
space.
Regular function A rational function whose denomina- To be more precise about solutions to polynomial
tor does not vanish at a subset of an algebraic set. equations, one first specifies a field k of numbers where the

465
P1: ZCK Revised Pages
Encyclopedia of Physical Science and Technology EN001H-20 April 20, 2001 17:14

466 Algebraic Geometry

coefficients of the polynomials lie, and where one looks the most common algebraic structure is the ring, which
for solutions. A field is a set with two binary operations is a set with addition and multiplication, but not neces-
(addition and multiplication) in which all of the usual rules sarily division. The ring of importance in affine algebraic
of arithmetic hold, including subtraction and division (by geometry is the ring K [x] = K [x1 , x2 , . . . , xn ] of polyno-
nonzero numbers). The main examples of interest are the mials in the n variables x = (x1 , . . . , xn ). Sometimes this
field Q of rational numbers, the field R of real numbers, is called the affine coordinate ring, since it is generated
and the field C of complex numbers. Another is the finite by the coordinate functions xi . It is from this ring that we
field Z/ p of the integers {0, 1, . . . , p − 1} under addition draw the subset S ⊂ K [x] of polynomials whose zeros we
and multiplication modulo a prime p. want to study.
Solutions to polynomials in n variables would naturally An ideal J in a ring R (like K [x]) is a subset of the
be n-tuples of elements of the field. If we denote the field ring R with the special properties that it is closed under
in question by K , the natural place to find solutions is addition and “outside multiplication”: if f and g are in J
affine n-space over K , denoted by K n , AnK , or simply An : then f + g is in J , and if g is in J then f g is in J for any
f in the ring. An example of an ideal is the collection of
An = {z = (z 1 , z 2 , . . . , z n ) | z i ∈ K }.
all polynomials that vanish at a particular point p ∈ An .
When n = 1 we have the affine line, if n = 2 we have the If S is a subset of the ring, the ideal generated
by S is
affine plane, and so forth. the set of all finite “linear combinations” i h i f i where
the h i ’s are in the ring and the f i ’s are in the given subset
S. The reader may check that this set, denoted by S , is
B. Affine Algebraic Sets
closed under addition and outside multiplication, and so
Let f (x) = f (x1 , . . . , xn ) be a polynomial in n variables is always an ideal.
with coefficients in K . The zeros of f are denoted by Returning to the setting of algebraic sets, one can easily
Z ( f ): see that if S is any collection of polynomials in K [x],
and J = S is the ideal in K [x] generated by S, then
Z ( f ) = {z ∈ An | f (z) = 0}.
S ⊂ J and Z (S) = Z (J ): the set S and the ideal J have
For example, Z (y − x 2 ) is a parabola in the plane, and exactly the same set of common zeros. This allows alge-
Z (x 2 − y 2 − z 2 − 4) is a hyperboloid in 3-space. braic geometers to focus only on the zeros of ideals of
More complicated geometric objects in affine space polynomials: every algebraic set is of the form Z (J ) for
must be defined by more than one polynomial. Let S be an ideal J ⊂ K [x]. It is not the case that the ideal defining
a set of polynomials (not necessarily finite) all having co- the algebraic set is unique, however: it is possible that two
efficients in the field K . The common zeros of all the different ideals J1 and J2 have the same set of common
polynomials in the set S is denoted by Z (S): zeros, so that Z (J1 ) = Z (J2 ).
Z (S) = {z ∈ An | f (z) = 0 for all f ∈ S}.
An algebraic set in A , or simply an affine algebraic set, is a
n
D. Algebraic Sets Defining Ideals
subset of An of the form Z (S) for some set of polynomials
S in n variables. That is, an affine algebraic set is a set We saw in the last paragraph that ideals may be used to
which is exactly the set of common zeros of a collection define all algebraic sets, but that the defining ideal is not
of polynomials. Affine algebraic sets are the fundamental unique. In the construction of algebraic sets, we look at
objects of study in affine algebraic geometry. a collection of polynomials and take their common zeros,
The empty set is an affine algebraic set [it is Z (1)] and which is a set of points in An . Now we turn this around,
so is all of affine space [it is Z (0)]. The intersection of an and look instead at a subset X ⊂ AnK of points in affine
arbitrary collection of algebraic sets is an algebraic set. space, and consider all the polynomials that vanish on X .
The union of a finite number of algebraic sets is also an We denote this by I (X ):
algebraic set. Therefore the algebraic sets in affine space
form the closed sets in a topology; this topology is called I (X ) = { f ∈ K [x] | f (z) = 0 for all z ∈ X }.
the Zariski topology.
No matter what kind of subset X we start with, this is
always an ideal of polynomials. Again there is the same
C. Ideals Defining Algebraic Sets
possibility of non-uniqueness: it may be that two differ-
The “algebraic” part of algebraic geometry involves the ent subsets X and Y of An have the same ideal, so that
use of the tools of modern algebra to study algebraic sets. I (X ) = I (Y ). This would happen if every polynomial that
Algebra is the study of sets with operations on them, and vanished on X also vanished on Y and vice versa.
P1: ZCK Revised Pages
Encyclopedia of Physical Science and Technology EN001H-20 April 20, 2001 17:14

Algebraic Geometry 467

The ideal I (X ) for a subset X ⊂ An has a special rational numbers, the field R of real numbers, and all finite
property not shared by all ideals: it is closed under roots. fields are not algebraically closed.
That is, if f is a polynomial, and f m vanishes on all the
points of X , then it must be the case that f itself vanishes. Theorem 1 (Hilbert’s Nullstellensatz). If K is an
This means that if f m ∈ I (X ), then f ∈ I (X ). An ideal algebraically closed field, then for any ideal J ⊂ K [x],
with this special property is called a radical ideal. I (Z (J )) = rad(J ).
The power of the Nullstellensatz occurs because in
E. The Z–I Correspondence many applications one has access to a polynomial f which
is known to be zero when other polynomials g1 , . . . , gk
The two operations of taking an ideal of polynomials and
are. The conclusion is then that f is in the radical of the
using Z to get a subset of affine space, and taking a subset
ideal generated by the gi ’s. Therefore there is a power of
of affine space and using I to get an ideal of polynomials
f which is a linear combination of the gi ’s, and so there
form the theoretical foundation of affine algebraic geom-
is an explicit equation of the form
etry. We have the following basic facts:
fm = h i gi .
r If J1 ⊂ J2 ⊂ K [x] are ideals, then Z (J1 ) ⊃ Z (J2 ). i
r If X 1 ⊂ X 2 ⊂ An , then I (X 1 ) ⊃ I (X 2 ). The Nullstellensatz permits a more detailed correspon-
r If J ⊂ K [x] is an ideal, then I (Z (J )) ⊃ J . dence of properties between the algebraic set X and its
r If X ⊂ An , then Z (I (X )) ⊃ X . ideal I (X ). Typical statements when K is algebraically
closed are as follows:
This last statement can be sharpened to give a crite-
rion for when a subset of An is algebraic, as follows. r There is a one-to-one correspondence between
If X is algebraic, equal to Z (J ) for some ideal J , then algebraic subsets of An and radical ideals in K [x].
I (X ) = I (Z (J )) ⊃ J , so that Z (I (X )) = Z (I (Z (J ))) ⊂ r X is empty if and only if I (X ) = K [x].
Z (J ) = X , which forces Z (I (X )) = X . Conversely if r X = An if and only if I (X ) = {0}.
Z (I (X )) = X , then X is obviously algebraic. Hence, r X consists of a single point if and only if I (X ) is a

X is an algebraic subset of An if and only if Z (I (X)) = X. maximal ideal.

r X is an irreducible algebraic set (that is, it is not a
This statement addresses the question of which points are union of two proper algebraic subsets) if and only if
additional zeros of all the polynomials which vanish on X , I (X ) is a prime ideal [that is, I (X ) has the property
stating that if X is algebraic there are no additional such that if f g ∈ I (X ) then either f ∈ I (X ) or g ∈ I (X )].
points.
All these statements relate geometric properties of the
F. The Nullstellensatz and the Basis Theorem algebraic set X to algebraic properties of the ideal I (X ).
This is the basic theme of algebraic geometry in general.
The Nullstellensatz turns the previous question around The other main theorem concerning these general ideas
and asks which additional polynomials always vanish at is the basis theorem, which addresses the question of the
the points which are common zeros of a given ideal of finiteness of the set of equations which may define an
polynomials. This is a much more subtle question, and the algebraic set:
answer is the fundamental theorem in elementary affine
geometry. It uses the concept of the radical of an ideal J , Theorem 2 (Hilbert’s basis theorem). Every ideal
rad(J ), which is the set of all polynomials for which some J in the ring of polynomials K [x] is finitely generated:
power is in J : there is a finite set S of polynomials such that J = S .
rad(J ) = { f ∈ K [x] | f m ∈ J for some m ≥ 1}. As a consequence, every algebraic set X ⊂ An may be
defined as the set of common zeros of a finite set of poly-
It is always the case that rad(J ) ⊃ J , and that rad(J ) is a nomials.
radical ideal. Clearly I (Z (J )) ⊃ rad(J ): if f m ∈ J , then f
will certainly vanish wherever everything in J (in partic-
G. Regular and Rational Functions
ular wherever f m ) vanishes.
An algebraically closed field is a field that contains all Any polynomial f ∈ K [x] can be considered as a
roots of all polynomials in one variable. The field C of K -valued function on affine space An . If X ⊂ An is an
complex numbers is algebraically closed; the field Q of algebraic set, such a polynomial may be restricted to X
P1: ZCK Revised Pages
Encyclopedia of Physical Science and Technology EN001H-20 April 20, 2001 17:14

468 Algebraic Geometry

to obtain a function on X . Functions on X which are re- single quadratic equation in two variables, and they define
strictions of polynomial functions are called polynomial a curve in the affine plane. The classification and study of
functions. The polynomial functions on X form a ring, conics date to antiquity. Most familiar is the classification
denoted by K [X ] and called the affine coordinate ring of of nonempty irreducible conics over the real numbers R:
X . For example, the affine coordinate ring of affine space we have either an ellipse, a parabola, or a hyperbola. Again
is the entire polynomial ring: K [An ] = K [x]. an appreciation of the points “at infinity” sheds light on
There is an onto ring homomorphism (restriction) from this classification: an ellipse has no real points at infinity,
K [x] to K [X ], whose kernel is the ideal I (X ). There- a parabola has one real point at infinity, and a hyperbola
fore K [X ] is isomorphic as a ring to the quotient ring has two real points at infinity.
K [x]/I (X ). Over any field, if an irreducible conic C [given by
Irreducible algebraic sets are often referred to as f (x, y) = 0] is nonempty, then it may be parametrized
algebraic varieties. If X is an algebraic variety, then I (X ) using a rational function of one variable. This is done by
is a prime ideal, so that the coordinate ring K [X ] is an in- choosing a point p = (x0 , y0 ) on C, writing the line L t
tegral domain (if f g = 0 then either f = 0 or g = 0). The through p with slope t [given by y − y0 = t(x − x0 )], and
field of fractions of K [X ], denoted by K (X ), represent intersecting L t with C. This intersection will consist of
rational functions on the variety X . Rational functions are the point p and one other point pt which depends on t.
more useful than polynomial functions, but they have the One may solve easily for the coordinates of pt as ratio-
drawback that any given rational function may not be de- nal functions of t, giving a rational parametrization for the
fined on all of X (where the denominator vanishes). If a conic C. From the point of view of maps, this gives a ratio-
rational function is defined at a point p, one says that the nal map φ : A1 → C which has an inverse: for any point
function is regular at p. (x, y) on C, the t-value is t = (y − y0 )/(x − x0 ). Invertible
The rational function field K (X ) of an algebraic variety rational maps are called birational, and most of the clas-
X is an extension field of the base field K , and as such has sification efforts of algebraic geometry are classifications
a transcendence degree over K : this is the largest num- up to birational maps.
ber of algebraically independent rational functions. This Applying this procedure to the unit circle C (defined
transcendence degree is the dimension of the variety X . by x 2 + y 2 = 1) and choosing the point p to be the point
Points have dimension zero, curves have dimension one, (−1, 0), one finds the parametrization
surfaces have dimension two, and so forth.
Polynomial and rational functions are used to define 1 − t2 2t
pt = , .
maps between algebraic sets. In particular, maps between 1 + t2 1 + t2
two affine spaces are simply given by a vector of functions.
We see here the origins of algebraic number theory, in
Maps between affine algebraic sets are given by restriction
particular the formulas for the Pythagorean triples. If
of such a vector of functions. Depending on the type of
we look at the case when t = p/q is a rational number,
functions used, such maps are called polynomial or ratio-
we find (clearing the denominator q) that ((q 2 − p 2 )/
nal maps. Again, rational maps have the property that they
(q 2 + p 2 ), 2 pq/(q 2 + p 2 )) is a point on the unit circle.
may not be defined everywhere. A map is called regular if
Again clearing the denominator q 2 + p 2 , we see that this
it is given by rational functions but is defined everywhere.
means

H. Examples (q 2 − p 2 )2 + (2 pq)2 = (q 2 + p 2 )2 ,

The most common example of an affine algebraic variety which gives the standard parametrization of Pythagorean
is an affine subspace: this is an algebraic set given by linear triples. This style of argument in elementary number the-
equations. Such a set can always be defined by an m × n ory dates the ancient Greeks, in particular to Appollonius
matrix A, and an m-vector b, as the vanishing of the set of and Diophantus.
m equations given in matrix form by Ax = b. This gives the In higher dimensions, an algebraic variety given by a
geometric viewpoint on linear algebra. One consequence single quadratic equation is called a quadric. Although the
of the geometric point of view is a deeper understanding classification of affine quadrics is not difficult, it becomes
of parallelism for linear spaces: geometrically, one begins clearer that the use of projective techniques simplifies the
to believe that parallel lines might be made to intersect matter considerably.
“at infinity.” This is the germ of the development of pro- Algebraic varieties defined by either more equations
jective geometry. or equations of degree larger than two present a much
No doubt the first nonlinear algebraic sets to be studied greater challenge. Even the study of cubic curves in the
were the conics. These are the algebraic sets given by a plane [given by a single equation f (x, y) = 0 where f
P1: ZCK Revised Pages
Encyclopedia of Physical Science and Technology EN001H-20 April 20, 2001 17:14

Algebraic Geometry 469

has degree three] is still a subject of modern research, [x0 : x1 : . . . : xn ], with each xi ∈ K , not all equal to zero,
especially over number fields. subject to the equivalence relation that
[x0 : x1 : . . . : xn ] = [λx0 : λx1 : . . . : λxn ]
I. Affine Schemes
for any nonzero λ ∈ K . The xi ’s are called the homoge-
The theory exposed above was developed in the hundred neous coordinates of the point [x]; note that they are not
years ending in the middle of the twentieth century. In the individually well-defined, because of the scaling condi-
second half of the twentieth century a different foundation tion. However, it does make sense to say whether xi = 0
to algebraic geometry was developed, which more closely or not.
follows the algebra of the rings and ideals in question. If x0 = 0, we can scale by λ = 1/x0 and assume x0 = 1;
Recall that there is a correspondence between algebraic then all the other n coordinates are well-defined and cor-
subsets of affine space and radical ideals in the polyno- respond to a unique point in affine n-space An :
mial ring K [x]. If the ground field K is algebraically
closed, points correspond to maximal ideals, and irre- p = (x1 , . . . , xn ) ∈ An corresponds to
ducible algebraic sets to prime ideals. The ring of polyno- [1 : x1 : . . . : xn ] ∈ Pn .
mial functions on X is naturally the ring K [x]/I (X ). This
ring is a finitely generated K -algebra, with no nilpotent However, we have a host of new points where x0 = 0 in
elements (elements f = 0 such that f k = 0 for some k). Pn ; these are to be thought of as points “at infinity” of
From this ring one can recover X , as the set of maximal An . In this way the points at infinity are brought into view
ideals. and in fact become no different than any other point, in
The idea of affine schemes is to start with an arbitrary projective space.
ring R and to construct a geometric object X having R
as its natural ring of functions. Grothendieck’s theory, de-
veloped in the 1950s and 1960s, uses Spec(R), the set of C. Projective Algebraic Sets
prime ideals of R, as the set X . This gives the notion of Let K [x] = K [x0 , x1 , . . . , xn ] be the polynomial ring gen-
an affine scheme. erated by the homogeneous coordinates. We can no longer
view these polynomials as functions since (because of the
scaling issue) even the coordinates themselves do not have
II. PROJECTIVE GEOMETRY well-defined values. However, suppose that a polynomial
F(x) is homogeneous; that is, every term of F has the
A. Infinity same degree d. Then F(λx) = λd F(x) for any nonzero
λ, and hence whether F = 0 or not at a point [x] ∈ Pn is
Consider the affine line A1 , which is simply the ground
well-defined.
field K as a set. What should it mean to approach infinity
We therefore define a projective algebraic set to be the
in A1 ? Consider a ratio x/y ∈ K ; if y = 0 this is an element
set of common zeros of a collection S of homogeneous
of K , but as y approaches zero for fixed x, this element
polynomials:
will “approach infinity.” However, we cannot let y equal
zero in this ratio, since one cannot divide by zero in a Z (S) = {[x] ∈ Pn | F(x) = 0 for all F ∈ S}.
field. It makes sense then to separate the numerator and
denominator in this ratio and consider the ordered pair: Correspondingly, given a subset X ⊂ Pn , we define the
let us write [y : x] for this. Since we are thinking of this homogeneous ideal of X , I (X ), to be the ideal in K [x] gen-
ordered pair as representing a ratio, we should maintain erated by all homogeneous polynomials F which vanish
that [y : x] = [ay : ax] for any nonzero a ∈ K . The ordered at all points of X :
pair [1 : x] will represent the number x ∈ K ; the ordered
I (X ) = {homogeneous F ∈ K [x] | F| X = 0} .
pair [0 : 1] will represent a new point, at infinity. We re-
move the ordered pair [0 : 0] from the discussion, since The reader will see the possibility of developing exactly
this would represent an undefined ratio. the same type of theory, complete with a Z −I correspon-
dence, a basis theorem, and a projective version of the
Nullstellensatz, all properly interpreted with the extra at-
B. Projective Space
tention paid to the homogeneity conditions at every turn.
The construction above generalizes to n-dimensional This is in fact what happens, and projective geometry
affine space An , as follows. Let us define projective attains an algebraic foundation as solid as that of affine
space Pn to be the set of all ordered (n + 1)-tuples geometry.
P1: ZCK Revised Pages
Encyclopedia of Physical Science and Technology EN001H-20 April 20, 2001 17:14

470 Algebraic Geometry

D. Regular and Rational Functions In general, one expects that each time one adds a new
equation that must vanish, the dimension of the set of ze-
In the context of projective geometry, the polynomial ring
ros goes down by one. Therefore, in projective space Pn ,
K [x] is called the homogeneous coordinate ring. It is a
which has dimension n, one expects that the projective
graded ring, graded by degree: K [x] = ⊕d≥0 Vd where Vd
algebraic set defined by exactly n homogeneous polyno-
is the vector space of homogeneous polynomials in x of
mials F1 = F2 = · · · = Fn = 0 will have dimension zero;
degree exactly d. If X ⊂ Pn is a projective algebraic set,
this means that it should be a finite set of points.
the homogeneous ideal I (X ) is also graded: I (X ) = ⊕d Id ,
Bezout’s theorem deals with the number of such in-
and the graded quotient ring K [X ] = K [x]/I (X ) is called
tersections. It may happen that a point of intersection is
the homogeneous coordinate ring of X .
counted with multiplicity; this is the same phenomenon
As we have noted above, a polynomial (even a homo-
as when a polynomial in one variable has a double root: it
geneous one) F does not have well-defined values at
counts for two when the number of roots is being enumer-
points of projective space. However, a ratio of polyno-
ated. In more than one variable, there is a corresponding
mials r = F/G will have a well-defined value at a point
notion of multiplicity; this is always an integer at least one
p if both F and G are homogeneous of the same degree,
(for a common isolated root).
and G( p) = 0. Such a ratio is called a rational function of
degree zero, and these functions form the foundation of
Theorem 3 (Bezout’s theorem). Suppose the
the function theory on projective algebraic sets. A rational
ground field K is algebraically closed. Let Fi , i = 1, . . . ,
function whose denominator does not vanish at p is said
n, be homogeneous polynomials in the n + 1 homogeneous
to be regular at p.
variables of projective n-space Pn . Suppose that Fi has
degree di . Then,
E. Homogenization
Let us suppose now that we have an affine algebraic set, (a) The common zero locus X = Z ({F1 , . . . , Fn }) is
given by the vanishing of a polynomial f (x1 , . . . , xn ). We nonempty.
want to investigate how this algebraic set looks at infinity, (b) If X is finite, the cardinality of X is at most the
in projective space. Let d be the largest degree of terms product of the degrees d1 d2 · · · dn .
of f . To each term of f , we multiply in the appropriate (c) If each point of X is counted according to its
power of the new variable x0 to make the term have de- multiplicity, the sum of the multiplicities is exactly
gree exactly d. This produces a homogeneous polynomial the product of the degrees d1 d2 · · · dn .
F(x0 , x1 , . . . , xn ), which for x0 = 1 has the same zeros as
the original polynomial f . However, now for x0 = 0 we For example, three quadrics (each of degree two) in P3
have new zeros, at infinity, in projective space. always intersect. If the intersection is finite, it is at most
For a general affine algebraic set X defined by more than eight points; if the points are counted according to their
one equation, we do this process for every polynomial in multiplicity, the sum is exactly eight.
I (X ), producing a collection of homogeneous polynomi- As another example, a line and a cubic in the plane will
als S. Then Z (S) ⊂ Pn is the projective closure X̄ of X intersect three times, counted with multiplicity. This is the
and exactly adds the minimal set of points at infinity to X basis for the famous “group law” on a plane projective cu-
to produce a projective algebraic set. bic curve X : given two points p and q on X , the line joining
For example, let us consider the hyperbola in the affine p and q will intersect the curve X in a third point, and this
plane defined by f (x, y) = x 2 − y 2 − 1 = 0. We homog- operation forms the basis for a group structure on X .
enize this to F(z, x, y) = x 2 − y 2 − z 2 , a homogeneous
polynomial of degree two. For z = 0, we recover the
two points [0 : 1 : 1] and [0 : 1 : −1] at infinity (where III. CURVES AND SURFACES
x 2 − y 2 = 0), up to scalar factors.
A. Singularities
F. Bezout’s Theorem
Let X be an irreducible algebraic variety, of codimension
One of the consequences of adding in the points at infinity r , in An . If xi are local affine variables at a point p ∈ X ,
to form projective space is that algebraic sets which did not and the ideal of X is generated near p by polynomials
intersect in affine space now tend to intersect, at infinity, f 1 , . . . , f k , then the Jacobian matrix J = (∂ f i /∂ x j ) will
in projective space. The classic example is that of two have rank at most r at every point of X and will have
parallel lines in the affine plane: these intersect at one maximal rank r at all points of X away from a subalge-
point at infinity in the projective plane. braic set Sing(X ). At points of X − Sing(X ), the variety
P1: ZCK Revised Pages
Encyclopedia of Physical Science and Technology EN001H-20 April 20, 2001 17:14

Algebraic Geometry 471

is said to be nonsingular, or smooth; at points of Sing(X ), projective plane. Plücker’s formula addresses the question
the variety is said to be singular, and Sing(X ) is called the of the genus of this curve in relation to the degree of the
singular locus of X . Over the complex numbers, the non- polynomial F:
singular points are those points of X where X is a complex
manifold, locally analytically isomorphic to an open set in Theorem 4 (Plücker’s formula). Suppose X is a
Cd , where d is the dimension of X . A common situation smooth projective plane curve defined by an irreducible
is when Sing(X ) is empty; then the variety is said to be polynomial F(x, y, z) of degree d. Then the genus of X is
smooth. At smooth points of algebraic varieties, there are equal to (d − 1)(d − 2)/2.
local “analytic” coordinates y1 , . . . , yd equal in number
to the dimension of the variety. Plücker’s formula has been generalized to curves with
singularities, in particular with simple singularities like
nodes (locally like x y = 0 or y 2 = x 2 ) and cusps (locally
B. 1-Forms
like y 2 = x 3 ). Then the formula gives the genus of the
A 1-form on a smooth algebraic variety is a collection of desingularization of the curve: if a plane projective curve
local expressions of the form i f i (y) dyi , where {yi } are of degree d has a finite number of singularities which are
local coordinates on the variety and the f i ’s are rational all either nodes or cusps, and there are ν nodes and κ cusps,
functions; this collection of expressions must be all equal then the genus of the desingularization is
under changes of coordinates, and at every point of the (d − 1)(d − 2)
variety at least one of the expressions must be valid. For a g= − ν − κ.
2
curve, where there is only one local coordinate y, a local
1-form expression has the simple form f (y) dy. The set
of 1-forms on a variety form a vector space. E. Elliptic and Hyperelliptic Curves
Using Plücker’s formula, we see that smooth plane pro-
C. The Genus of a Curve jective curves of degree one or two have genus zero. We
have in fact seen above that conics may be parametrized by
Let X be a smooth projective curve. Let ω be a 1-form on lines, and this is possible precisely because they have the
X . If at every point of X there is a local expression for ω same genus. Smooth projective plane cubic curves have
of the form f (y) dy where y is a local coordinate and f is genus one, and they cannot be parametrized by lines. Ev-
a regular function, then we say that ω is a regular 1-form. ery smooth projective plane curve over C has exactly nine
The set of regular 1-forms on a smooth projective curve inflection points, and if an inflection point is put at the
form a finite-dimensional vector space over K , and the projective point [0 : 1 : 0], and if the line of inflection is
number of linearly independent regular 1-forms, which is the line at infinity, the affine equation of the cubic curve
the dimension of this vector space, is the most important can be brought into Weierstrass form
invariant of the curve, called the genus.
If K is the complex numbers, the genus has a topological y 2 = x 3 + Ax + B
interpretation also. A smooth projective curve is a compact for complex numbers A and B, such that 4A3 + 27B 2 = 0
complex manifold of dimension one, which is therefore a (this is the smoothness condition). The form of this equa-
compact orientable real manifold of dimension two. As tion shows that a cubic curve is also representable as a
such, it is topologically homeomorphic to a sphere with double cover of the x-line: for every x-value there are two
g handles attached; this g is the topological genus, which y-values each giving points of the curve.
is equal to the genus. If g = 0, the curve is topologically a In general, there are curves of every genus representable
sphere; if g = 1, the curve is topologically a torus; if g ≥ 2, as double covers, using similar affine equations of the form
the curve is topologically a g-holed torus.
y 2 = f d (x),
The simplest smooth projective curve is the projective
line itself, P1 . It has genus zero. where f d is a polynomial with distinct roots of degree
d. Such a curve is called hyperelliptic and has genus
g = [(d − 1)/2]. In particular these constructions show the
D. Plane Curves and Plücker’s Formula
existence of curves of any genus g ≥ 0.
The most common projective curves studied over the Higher-degree coverings of the projective line, such as
centuries are the plane curves, which are defined by a sin- a curve given by an equation of the form y 3 = f (x), are
gle irreducible polynomial f (x, y) = 0 in affine 2-space, also important in the classification of curves. Coverings of
and then closed up with points at infinity to a projective degree three are called trigonal curves, tetragonal curves
curve defined by the homogenization F(x, y, z) = 0 in the are covers of degree four, and so forth.
P1: ZCK Revised Pages
Encyclopedia of Physical Science and Technology EN001H-20 April 20, 2001 17:14

472 Algebraic Geometry

F. Rational Functions, Forms, jective line P1 ; (b) all curves of genus one are isomor-
and the Riemann-Roch Theorem phic to smooth plane cubic curves; (c) all curves of genus
The most celebrated theorem in the theory of curves is two are hyperelliptic, given by an equation of the form
the theorem of Riemann-Roch, which gives precise infor- y 2 = f 6 (x), where f 6 is a sextic polynomial; (d) all curves
mation about the rational functions on a smooth projective of genus three are either hyperelliptic or smooth plane
curve X . The space of rational functions K (X ) on X forms quartic curves; (e) all curves of genus four are either hy-
a field of transcendence degree one over K , and as such perelliptic or the intersection of a quadric surface and a
is an infinite-dimensional vector space over K . In order cubic surface in P3 ; and (f) all curves of genus five are
to organize these functions, we concentrate our attention hyperelliptic, trigonal, or the intersection of three quadric
on the zeros (where the numerator vanishes) and the poles threefolds in P4 . Of special interest is the so-called canon-
(where the denominator vanishes). Specifically, given a ical embedding of a smooth curve, which shows that a
point p ∈ X and a positive integer m, we may look at all curve of genus g either is hyperelliptic or can be embed-
the rational functions on X with a zero at p, no zero or ded as a curve of degree 2g − 2 in Pg−1 .
pole at p, or a pole at p of order at most m. We denote this
space by L(mp). If m = − n is a negative integer, we de- G. The Moduli Space Mg
fine L(mp) = L(−np) to be the space of rational functions
Much of the research in algebraic geometry since 1960 has
with a zero at p of order at least n.
focused on the study of the moduli spaces for algebraic
A divisor on X is a function from the points of X to
varieties. In general, a moduli space is a topological space
the group of integers Z, such that all but finitely many
M whose points represent particular geometric objects;
values are zero. For any divisor D, we define the vector
the topology on M is such that points which are close in
space L(D) to be the space L(D) = ∩x∈X L(D(x) · x). In
M represent geometric objects which are also “close” to
plain English this is the space of rational functions with
each other. Of particular interest has been the moduli space
restricted poles (to the set of points with positive D-values)
Mg for smooth projective curves of genus g. For exam-
and prescribed zeros (as the set of points with negative D-
ple, since every curve of genus zero is isomorphic to the
values). Part of the Riemann-Roch theorem is that these
projective line, M0 consists of a single point. The moduli
spaces are finite-dimensional.
space M1 classifying curves of genus one is isomorphic
We can make the same construction with rational
to the affine line A1 : the famous j-invariant of elliptic
1-forms also. Let us recall that a rational 1-form has local
curves classifies curves of genus one by a single num-
expression f (y) dy and use the function part f to define
ber. [For a plane cubic curve of genus one in Weierstrass
the zeros and poles. Given a divisor E, we may then con-
form y 2 = x 3 + Ax + B, j = 4A3 /(4A3 + 27B 2 ).] For
sider the space 1 (E) of rational 1-forms with restricted
g ≥ 2, Mg is itself an algebraic variety of dimension
poles and prescribed zeros as E indicates. Again, this is a
3g − 3.
finite-dimensional space of forms.
Of particular interest has been the construction and
study of meaningful compactifications of various moduli
Theorem 5 (Riemann-Roch). Let X be a smooth
spaces. For Mg , the most natural compactification Mg
projective curve of genus g. Let D be a divisor on
X , and was constructed by Deligne and Mumford and the ad-
denote by d the sum of the values of D: d = x D(x).
ditional points represent stable curves of genus g, which
Then,
are curves without continuous families of automorphisms,
dim(L(D)) = d + 1 − g + dim( 1
(−D)). and having only nodes as singularities. Even today the con-
struction and elementary properties of moduli spaces for
The inequality dim(L(D)) ≥ d + 1 − g was proved by higher-dimensional algebraic varieties (e.g., surfaces) is
Riemann; Roch supplied the “correction term” related to a challenge. More recently attention has turned to mod-
1-forms. One of the main uses of the Riemann-Roch the- uli spaces for maps between algebraic varieties, and it is
orem is to guarantee the existence of rational functions an area of very active research today to compactify and
with prescribed zeros and poles, in order to intelligently understand such spaces of maps.
embed the curve in projective space: given rational func-
tions f 0 , . . . , f n on a curve X , the mapping sending x ∈ X
H. Surfaces and Higher Dimensions
to the point [ f 0 (x), f 1 (x), . . . , f n (x)] will always be well-
defined on all of X and under certain circumstances will The construction and understanding of the moduli spaces
embed X as a subvariety of Pn . Mg for smooth curves is tantamount to the successful clas-
Using this technique, we can show for example that sification of curves and their properties. The classification
(a) all curves of genus zero are isomorphic to the proof higher-dimensional varieties is not anywhere near as
P1: ZCK Revised Pages
Encyclopedia of Physical Science and Technology EN001H-20 April 20, 2001 17:14

Algebraic Geometry 473

complete. Even surfaces, for which a fairly satisfactory widespread use of general schemes, sheaves, and homo-
classification exists due to Enriques, Kodaira, and others, logical algebra, as well as slightly more general notions
presents many open problems. of algebraic spaces and stacks.
The Enriques classification of smooth surfaces essen- After this period of partial introspection, a vigorous
tially breaks up all surfaces into four categories. The first return to application areas in algebraic geometry began in
category consists of those surfaces with a family of genus the 1980s. We will close this article with brief discussions
zero curves on them. Since genus zero curves are all iso- of a sampling of these.
morphic to lines, such surfaces are known as ruled sur-
faces, and a detailed understanding of them is possible. A. Enumeration
The prototype is a product surface X × P1 for a curve X .
One of the fundamental questions of geometry, and of
The second category consists of surfaces with a nowhere-
many other subjects in mathematics, is the “how many”
vanishing regular 2-form, or finite quotients of such sur-
question: in algebraic geometry, this expresses itself in
faces. These are the so-called abelian surfaces, K 3 sur-
counting the number of geometric objects with a given
faces, Enriques surfaces, and hyperelliptic surfaces. The
property. In contrast to pure combinatorics, often the geo-
third category consists of surfaces with a family of genus
metric approach involves counting objects with multiplic-
one curves on them. Some techniques similar to those
ity (e.g., roots of a polynomial).
used in the study of ruled surfaces are possible, and since
Typical enumerative questions (and answers) ask for the
genus one curves are very well understood, again a rather
number of flexes on a smooth cubic curve (9); the number
detailed description of these surfaces, called elliptic sur-
of lines on a smooth cubic surface (27); the number of
faces, is available. The last category is the surfaces of
conics tangent to five given conics (3264); and the number
general type, and most surfaces are in this category. Mod-
of lines on a smooth quintic threefold (2875).
uli spaces have been constructed, and many elementary
Recent breakthrough developments in intersection the-
invariants are known, but there is still a lot of work to do
ory have enabled algebraic geometers to compute such
to understand general-type surfaces. An exciting current
numbers of enumerative interest which have stumped prior
application area has to do with the connection of algebraic
generations. New excitement has been brought by totally
surfaces (which over the complex numbers are real four-
unexpected relationships with string theory in theoretical
dimensional objects) and the study and classification of
physics, where the work of Witten and others has found
4-manifolds.
surprising connections between computations relating el-
For varieties of dimension three or more, there are
ementary particles and generating functions for enumera-
essentially no areas of complete classification. Basic
tive problems of the above sort in algebraic geometry.
techniques available for curves and surfaces begin to
break down for threefolds, in particular because several
B. Computation
fundamental constructions lead immediately to singular
varieties, which are much more difficult to handle. How- Computation with polynomials is a fundamentally algo-
ever, since the 1980s, starting with the groundbreaking rithmic process which permits in many cases the design of
work of Mori, steady progress has been made on funda- explicit algorithms for calculating a multitude of quantities
mental classification constructions. of interest to algebraic geometers. Usually these quantities
either are related to enumerative questions such as those
mentioned above or are the dimensions of vector spaces
IV. APPLICATIONS (typically of functions or forms) arising naturally either in
elementary settings or as cohomology spaces.
The origins of algebraic geometry, and the development With the advent of sufficient computing power as
of projective geometry in the Renaissance, were driven in represented by modern computers and computer algebra
large part by applications (of the theory of multivariable software packages, it has become more possible to actually
polynomials) to a variety of problems in geography, art, perform these types of computations by means of com-
number theory, and so forth. Later, in the 1800s, newer puter. This has led to an explosion of activity in designing
problems in differential equations, fluid flows, and the efficient and effective algorithms for the computations of
study of integrals were driving development in algebraic interest. These algorithms usually rely in some way on the
geometry. In the 1900s the invention of algebra as we know theory of Gröbner bases, which build on a multivariable
it today caused a rethinking in the foundations of alge- version of the division algorithm for polynomials.
braic geometry, and the energies of working researchers Software is now widely available to execute many cal-
were somewhat siphoned into the development of new culations, and it is typically user-customizable, which
structures and techniques: these have culminated in the has enabled researchers around the world to make
P1: ZCK Revised Pages
Encyclopedia of Physical Science and Technology EN001H-20 April 20, 2001 17:14

474 Algebraic Geometry

fundamental contributions. Software packages which are These three properties tend to act against one another, and
widely used include Macaulay, CoCoA, and Schubert. the theory of error-correcting codes is directed toward the
classification and analysis of possible codes and coding
schemes.
C. Mechanics
Algebraic geometry has found an application in this
Many problems in mechanical engineering and construc- area, by taking the field K to be a finite field, and taking
tion involve the precise understanding of the position of the code to be certain natural spaces of functions or forms
various machine parts when the machine is in motion. on an algebraic variety. Most successful have been at-
Robotic analysis is especially concerned with the position tempts to use algebraic curves; this was initiated by Goppa
and velocity of robot extremities given various parame- and has been successful in producing codes with desirable
ters for joint motion and extension. There are a few basic properties and in aiding in the classification and uniform
motions for machine joints, and these are all describable treatment of several families of previously known codes.
by simple polynomials [e.g., circular motion of radius r
occurs on the circle (x 2 + y 2 = r 2 )]. It is not difficult to see
E. Automatic Theorem Proving
therefore that in suitable coordinate systems, virtually all
such problems can be formulated by polynomial systems. It is often the case, especially in elementary geometry, that
However, mechanical devices with many joints can cor- geometric statements can be expressed by having some
respond to algebraic sets in affine spaces with many vari- polynomial vanish. For example, three points (x1 , y1 ),
ables and many equations, which makes the geometric (x2 , y2 ), and (x3 , y3 ) in the plane are collinear if and only
analysis rather complicated. It is therefore useful to be if the determinant of the matrix
able to apply more sophisticated techniques of the theory  
1 1 1
to reduce the complexity of the problem, and this is where  
the power of algebraic geometry can come into play. x1 x2 x3  ,
A specific example of the type of problem is the follow- y1 y2 y3
ing so-called n-bar configuration. Let us consider a cycli- which is the polynomial x2 y3 − x3 y2 − x1 y3 + x3 y1 +
cal arrangement of n stiff rods, joined at the ends by joints x1 y2 − x2 y1 , is zero.
that can actuate only in a planar way locally. For what ini- A typical theorem therefore might be viewed as a
tial positions of this arrangement does the configuration collection of hypothesis polynomials h 1 , . . . , h k , and a
actually move and flex? In how many different ways can conclusion polynomial g. The truth of the theorem would
it be made to flex? When n = 3, the configuration must lie be equivalent to saying that wherever the hypothesis poly-
in a plane, and no flexing or motion is possible; for n = 4, nomials all vanish, the conclusion polynomial is also zero.
the flexing configurations, although known, have not been This exactly says that g ∈ I (Z ({h 1 , . . . , h k })), using the
completely analyzed. language of the Z −I correspondence of affine algebraic
geometry.
D. Coding Theory If the field is algebraically closed, the Nullstellensatz
says that the conclusion polynomial g is in this ideal if
A code is a collection of vectors in K n for some n, where K and only if some power of g is a linear combination of the
is a field; each vector in the collection is a code word. The h i ’s; that is, there is some equation of the form
Hamming distance between two code words is the number
of positions where the code words differ. If each code gr = fi h i ,
word is intended to represent some piece of irreducible i

information, then desirable properties of a code are as where f i are other polynomials. A “proof” of the theorem
follows: would be given by an explicit collection of polynomials
f i , which exhibited the equation above for some r , and this
(a) Size: There should be many code words so that a can be easily checked by a computer. Indeed, algorithms
large amount of information can be represented. exist which check for the existence of such an expression
(b) Distinctness: The Hamming distance between any for g and are therefore effective in determining the truth
two code words should be large, so that if a code of a given proposed theorem.
word is corrupted in some way, the original code
word can be recovered by changing the corrupted
F. Interpolation
code word in a small number of positions.
(c) Efficiency: The ambient dimension n, which is the A general problem in approximation theory is to con-
number of positions, should be as small as possible. struct easy-to-evaluate functions with specified behavior.
P1: ZCK Revised Pages
Encyclopedia of Physical Science and Technology EN001H-20 April 20, 2001 17:14

Algebraic Geometry 475

Typically this behavior involves the values of the func- has lately become standard fare for computer graphics
tion at prescribed loci, the values of the derivatives of the specialists and novices alike.
functions, or both. Polynomial interpolation is the method Although initially the application of the ideas of
whereby polynomial functions are used in this manner, and projective geometry were rather elementary, more sophis-
algebraic geometry has found recent applications and new ticated techniques are now being brought to bear, espe-
problems of interest in this field in recent decades. cially those involving problems in computer vision and
Lagrange interpolation involves finding polynomial pattern recognition. In particular it now seems feasible
functions with specified values ck at a specified set of to use subtle projective invariants to discriminate between
points pk ∈ An . In one variable, there is a relatively easy scene objects of either different types at the same perspec-
formula for writing down the desired polynomial, and this tive or the same type at different perspectives.
is taught in most first-year calculus courses. In higher di-
mensions, no such formula exists. However, the problem is
a linear problem, and so it is a straightforward application SEE ALSO THE FOLLOWING ARTICLES
of linear algebra techniques.
Hermite interpolation involves finding polynomial ABSTRACT ALGEBRA • BOOLEAN ALGEBRA • CALCULUS
functions with specified derivative values. This is a signif-
icantly more complicated generalization, and open ques- BIBLIOGRAPHY
tions exist even for polynomials in two variables.
Spline interpolation involves finding piecewise poly- Beauville, A. (1996). “Complex Algebraic Surfaces,” 2nd ed., Cambridge
nomial functions, which stitch together to have a certain Univ. Press, Cambridge, UK.
degree of global smoothness, but which also have speci- Cox, D., Little, J., and O’Shea, D. (1997a). “Ideals, Varieties, and
fied behavior at given points. Cubic splines are the most Algorithms: An Introduction to Computational Algebraic Geometry
and Commutative Algebra,” 2nd ed., Springer-Verlag, New York.
popular in one variable, and with the advent of computer Cox, D., Little, J., and O’Shea, D. (1997b). “Using Algebraic Geometry,”
graphics, two- and three-variable splines are becoming Graduate Texts in Mathematics, Vol. 185, Springer-Verlag, New York.
more widely known and used in elementary applications. Eisenbud, D. (1996). “Commutative Algebra with a View toward Alge-
In all these settings, techniques of algebraic geometry braic Geometry,” Graduate Texts in Mathematics, Vol. 150, Springer-
are brought to bear in order to determine the dimension Verlag, New York.
Fulton, W. (1997). “Intersection Theory,” 2nd ed., Springer-Verlag, New
of the space of suitable interpolating functions. Usually it York.
is a simple matter to find a lower bound for such dimen- Griffiths, P., and Harris, J. (1978). “Principles of Algebraic Geometry,”
sions, and the more difficult problem is to find sharp upper Wiley (Interscience), New York.
bounds. Harris, J. (1992). “Algebraic Geometry: A First Course,” Springer-Verlag,
New York.
Harris, J., and Morrison, I. (1998). “Moduli of Curves,” Springer-Verlag,
G. Graphics New York.
Hartshorne, R. (1979). “Algebraic Geometry,” Springer-Verlag, New
The origins of projective geometry lie in the early York.
Renaissance, when artists and architects began to be inter- Kirwan, F. (1992). “Complex Algebraic Curves,” Cambridge Univ. Press,
ested in accurately portraying objects in perspective. The Combridge, UK.
Miranda, R. (1995). “Algebraic Curves and Riemann Surfaces,” Am.
correct treatment of horizon lines and vanishing points in Math. Soc., Providence.
artwork of the period led directly to a new appreciation of Reid, M. (1988). “Undergraduate Algebraic Geometry,” Cambridge Univ.
points at infinity and eventually to a mathematically sound Press, Combridge, UK.
theory of projective geometry. Shafarevich, I. (1994). “Basic Algebraic Geometry,” 2 vols., Springer-
With the recent explosion of capabilities in computer Verlag, New York.
Sturmfels, B. (1996). “Grobner Bases and Convex Polytopes,” Am. Math.
speed, resolution of displays, and amount and interest of Soc., Providence.
visualizing data, an appreciation of projective geometry, Ueno, K. (1999). “Algebraic Geometry I: From Algebraic Varieties to
homogeneous coordinates, and projective transformations Schemes,” Am. Math. Soc., Providence.
P1: GAE Revised Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN001C-26 May 26, 2001 14:31

Approximations and Expansions

Charles K. Chui
University of Missouri–St. Louis

I. Background
II. Best Approximation
III. Bivariate Spline Functions
IV. Compact Operators and M Ideals
V. Constrained Approximations
VI. Factorization of Biinfinite Matrices
VII. Interpolation
VIII. Multivariate Polyhedral Splines
IX. Quadratures
X. Smoothing by Spline Functions
XI. Wavelets

GLOSSARY over all of [0, 1] for the periodic case and over
[0, 1 – rt] for the nonperiodic setting. Also, denote
Center Let A be a bounded set in a normed linear space w( f, h) p = w1 ( f, h) p .
X . An element x0 in X is called a center (or Chebyshev Natural spline Let a = t0 < · · · < tn+1 = b. A function s
center) of A if sup{x0 − y : y ∈ A} = infx∈X sup{x in C 2k−2 [a, b] is called a natural (polynomial) spline of
− y : y ∈ A}. order 2k and with knots at t1 , . . . , tn , if the restriction
L Summand A closed subspace S of a Banach space X of s to each [ti , ti+1 ] is a polynomial of degree 2k − 1
is said to be an L summand (or L ideal) if there exists for i = 1, . . . , n − 1, and a polynomial of degree k − 1
a closed subspace T of X such that X = S ⊗ T and on each of the intervals [a, t1 ] and [tn , b].
x + y = x + y for all x ∈ S and y ∈ T . Normalized B spline Let ti ≤ · · · ≤ ti+k with ti < ti+k .
M Ideal A closed subspace S of a Banach space X is The ith normalized B spline of order k and with knots
called an M ideal in X if its annihilator S ⊥ is an L at ti , . . . , ti+k , is defined by Ni,k (x) = (ti+k − ti )[ti , . . . ,
summand of the dual space X ∗ . ti+k ](· − x)k−1 + , where the kth order divided difference
Modulus of smoothness Let f be in L p [0, 1], 1 ≤ of the truncated (k − 1)st power is taken at ti , . . . , ti+1 .
p ≤ ∞, and rt f denote the r th forward difference of n Widths Let A be a subset of a normed linear space X .
f with step size t, where r t ≤ 1. The L p r th mod- The Kolmogorov n width of A in X is the quantity
ulus of smoothness of f is defined by wr ( f, h) p = dn (A; X ) = inf X n supx∈A inf y∈X n x − y, where X n is
sup{rt f p : 0 < t < h}, where the L p norm is taken any subspace of X with dimension at most n. The linear

581
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

582 Approximations and Expansions

n width of A in X is the quantity δn (A; X ) = inf Pn I. BACKGROUND

supx∈A x − Pn x, where Pn is any set of continu-
ous linear operators of X into itself of rank at most Approximation theory, which is often classified as approx-
n. The Gel’fand n width of A in X is the quan- imations and expansions, is a subject that serves as an
tity d n (A; X ) = inf L n supx∈An ∩L n x, where L n is any important bridge between pure and applied mathematics.
closed subspace of X with codimension at most Although its origin is traced back to the fundamental work
n. The Bernstein n width of A in X is the quan- of Bernstein, Chebyshev, Haar, Hermite, Kolmogorov, La-
tity bn (A; X ) = sup X n+1 sup{t : t B(X n+1 ) ⊆ A}, where grange, Markov, and others, this branch of mathematics
X n+1 is any subspace of X with dimension at least was not fully established until the founding of the Journal
n + 1 and B(X n+1 ) is the unit ball {x ∈ X n+1 : x ≤ 1} of Approximation Theory in 1968. Now with the rapid ad-
in X n+1 . vance of digital computers, approximation theory has be-
Padé approximants The [m, n] Padé approximant come a very important branch of mathematics. Two other
of a formal power series c0 + c1 z + c2 z 2 + · · · journals on this subject, Approximation Theory and Its Ap-
is the (unique) rational function pm /qn , where plications and Constructive Approximation, were founded
pm (z) = a0 + · · · + am z m and qn (z) = b0 + · · · + bn z n in late 1984.
are determined by the algebraic quantity (c0 + c1 z + There is no well-defined boundary of the field of ap-
c2 z 2 + · · ·)(b0 + · · · + bn z n ) − (a0 + · · · + am z m ) = proximation theory. In fact, it overlaps with both classical
d1 z m + n + 1 + d2 z m + n + 2 + · · · for some d1 , d2 , . . . . and modern analysis, as well as numerical analysis, linear
Total positivity An n × n square matrix A = [ai j ] is algebra, and even various branches of applied mathemat-
said to be totally positive (TP) if the determinant ics. We may nevertheless divide it into five areas: (1) pos-
of [aim jn ] is nonnegative for all choices of integers sibility of approximation, (2) quality of approximation,
1 ≤ i 1 < · · · < i k ≤ n and 1 ≤ j1 < · · · < jk ≤ n and (3) optimal approximation, (4) families of approximants,
any integer k = 1, . . . , n. It is said to be strictly to- and (5) approximation schemes and computational algo-
tally positive (STP) if each of the above determinants rithms. The first area is mainly concerned with density
is positive. A kernel K (x, y), where a ≤ x ≤ b and and completeness problems. The typical classical results
c ≤ y ≤ d, is TP or STP if the matrices K (xi , y j ) are TP in this area are the theorems of Stone-Weierstrass, Müntz-
or STP, respectively, for all choices of xi and y j with Szasz, Mergelian, Korovkin, and others. The subject of
a ≤ x1 < · · · < xm ≤ b and c ≤ y1 < · · · < ym ≤ d, and qualitative approximation includes mainly the study of de-
all m = 1, 2, . . . . grees of approximations and inverse theorems. Jackson’s
Wavelets A function, with sufficiently fast decay at in- theorems and the so-called saturation theorems are typical
finity, can be used as a wavelet to define the wavelet results in this area. Optimal approximation is probably the
transform, if its Fourier transform vanishes at the backbone of the field of approximation theory. The typi-
origin. cal problems posed in this area are existence, uniqueness,
characterization, and computation of best approximants,
and the most basic result is probably the so-called alter-
APPROXIMATION of discrete data, complicated func- nation theorem. This area of mathematics is extremely
tions, solutions of certain equations, and soon by functions important and includes the subjects of least-squares meth-
from a simple and usually finite dimensional space is often ods, minimal projections, orthogonal polynomials, opti-
done in every field of science and technology. Approxima- mal estimation, and Kalman filtering. Depending on the
tion theory, also known as approximations and expansions, mathematical models or the nature of the approximation
serves as an important bridge between pure and applied problems, certain families of approximants are to be used.
mathematics and is intimately related to numerical analy- The most familiar families of approximants are algebraic
sis. This field has no well-defined boundary and overlaps polynomials, trigonometric polynomials, rational func-
with various branches of mathematics. Among the most tions, and spline functions. In this direction, there is a vast
important areas of study are possibility of approximation, amount of literature, including, in particular, the study of
quality of approximation, optimal approximation, families total positivity. Unfortunately, most of the beautiful prop-
of approximants, approximation schemes, and computa- erties of these classical families of approximants do not
tional algorithms. Approximation of bounded operators carry over to the two- and higher-dimensional settings.
by compact operators on a Banach space and factorization The important subject of multivariate spline functions, for
of biinfinite matrices are also important areas in the field instance, has been a very active field of research in recent
of approximation theory. Recent development of approxi- years. Again depending on the approximation problems,
mation theory includes computer-aided geometric design appropriate schemes of approximation are used. Compu-
(CAGD) and wavelets. tational algorithms are very important for the purpose of
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 583

yielding good approximations efficiently. This area of ap- als of degree at most n. Although πn is not compact, a
proximation theory includes the subjects of interpolation, compactness argument still applies to yield the existence
least-squares methods, and approximate quadratures. of a best approximant to each f in C[0, 1] from πn . It is
It must be noted that the separation of the field of ap- also well known that best approximation from πn has a
proximation theory into five areas should not be taken unique solution. This can be proved by using the follow-
very strictly since these areas obviously have to overlap. ing characterization theorem, which is better known as the
For instance, the elementary and yet important method of alternation theorem.
least-squares is an approximation scheme with efficient
computational algorithms and yields an optimal approxi- Theorem 1. pn is a best approximant of f in C[0, 1]
mation. In addition, many mathematicians and practition- from πn if and only if there exist points
ers may give a broader or even different definition of ap-
0 ≤ x1 < · · · < xn + 2 ≤ 1
proximation theory. Hence, only selected topics of this
field are treated in this article and the selection of topics such that ( f − pn )(xi ) = a(−1)i f − pn for i = 1, . . . ,
and statements of results has to be subjective. In addi- n + 2, where a = 1 or −1.
tion, many interesting areas, such as approximation theory
More can be said about the uniqueness of best approx-
in the complex domain, trigonometric approximation and
imants as stated in the following so-called strong unicity
interpolation, numerical approximation and algorithms,
theorem.
inequalities, orthogonal polynomials, and asymptotic ap-
proximation, are not included.
Theorem 2. Let f be in C[0, 1] and pn its best ap-
proximant from πn . Then there exists a constant C f such
that
II. BEST APPROXIMATION
f − g ≥ E n ( f ) + C f pn − g
Approximation theory is built on the theory of best ap- for all g in πn .
proximation. Suppose that we are given a sequence of
subsets P1 ⊂ P2 ⊂ · · · in a normed linear space X of func- There are results concerning the behavior of the con-
tions with norm . For each function f in X , consider stant C f in the literature. All the above results on approx-
the distance imation by algebraic polynomials are also valid for any
Chebyshev system.
E n ( f ) = inf{ f − g : g ∈ Pn } Results on the orders (or degrees) of approximation by
algebraic or trigonometric polynomials are called Jack-
of f from Pn . If there exists a function pn in Pn such
son’s theorems. Favard also gave sharp constants for best
that f − pn = E n ( f ), we say that pn is a best approxi-
trigonometric polynomial approximation.
mant of f in X from Pn . Existence of best approximants
Although rational functions do not form a linear space,
is the first basic question in the theory of best approxi-
analogous results on existence, uniqueness, and character-
mation. Compactness of Pn , for instance, guarantees their
ization still hold. Let Rm,n be the collection of all functions
existence. The second basic question is uniqueness, and
p/q, where p ∈ πm and q ∈ πn , such that p and q are rel-
another basic problem is to characterize them. Character-
atively prime and q(x) > 0 for all x on [0, 1]. The last
ization is fundamental to developing algorithms for the
condition is not restrictive since, for approximation pur-
construction of best approximants.
poses, p/q must be finite on [0, 1]. A more careful com-
Suppose that the function f that we wish to approximate
pactness argument shows that every function in C[0, 1]
lies in some subset G of X . Let B be the unit ball of all f
has at least one best rational approximant from Rm,n . The
in G (with f ≤ 1). Then the quantity
following alternation theorem characterizes best rational
E n (B) = sup{E n ( f ) : f ∈ B} approximants. We will use the notation d( p) to denote the
degree of a polynomial p.
is called the order of approximation of functions in G from
Pn . Knowing if, and if so how fast, E n (B) tends to zero is Theorem 3. pm /qn is a best approximant of f from
also one of the most basic problems in best approximation. Rm,n if and only if there exist points
0 ≤ x 1 < · · · < xr ≤ 1
A. Polynomial and Rational Approximation
where r = 2 + max{m + d(qn ), n + d( pm )} such that
Consider the Banach space C[0, 1] with the uniform (or ( f − pm /qn )(xi ) = a(−1)i f − pm /qn for i = 1, . . . , r ,
supremum) norm. Let πn be the space of all polynomi- where a = 1 or −1.
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

584 Approximations and Expansions

Again uniqueness of best rational approximants can as δ → 0. If this n tuple is balanced, then their sum is called
be shown using the alternation theorem, and in fact an a balanced integer. There is an algorithm to generate all
analogous strong unicity result can be obtained. Unfor- balanced integers.
tunately, although rational functions form a larger class
than polynomials, they do not improve the orders of ap- Theorem 4. Let (i 1 , . . . , i n ) be balanced, N its sum,
proximation in general. For instance, the approximation and PN a fully interpolating set of functions in L p [V (δ)],
order of the class of all lip α functions by Rnn is O(1/n α ), in the sense that, for any discrete data {a jk }, there exists a
which is the same as the order of approximation from unique g in Pn such that
πn . However, while best approximation from πn is satu-
g ( j) (xk ) = a jk
rated [e.g., E n ( f ) = O(1/n α ) for 0 < α < 1 if and only if
f ∈ lip α, and E n ( f ) = O(1/n) if and only if f is in the for j = 0, . . . , i k − 1 and k = 1, . . . , n. Then for any f
Zygmund class], this is certainly not the case in rational that is i k -times differentiable around xk , k = 1, . . . , n, the
approximation. A typical example is that the order of best best local L p approximant of f from PN exists and is the
approximation
√ of the important function |x − 12 | from Rnn unique function g0 that satisfies the Hermite interpolation
− n condition
is O(e ), which is much better than O(1/n).
( f − g0 )( j) (xk ) = 0
B. Best Local Approximation for j = 0, . . . , i k − 1 and k = 1, . . . , n. Conversely, to any
n tuple of positive integers (c1 , . . . , cn ) with sum M, there
Because of the alternation properties of best polynomial is some scaling εk (δ) > 0 such that this n tuple is balanced
and rational approximants, these approximants are be- with respect to {εk (δ)}, k = 1, . . . , n; consequently, any
lieved to behave like Taylor polynomials and Padé ap- Hermite interpolation is a best L p local approximation
proximants, respectively, when the interval of best ap- scheme.
proximation is small. This leads to the study of best local
approximation. The general case where (i 1 , . . . , i n ) is not balanced is
Suppose we are given a set of discrete data {a jk }, which much more delicate, and the best result in this direction in
are the values of some (unknown) function f and its the literature is stated as follows:
derivatives at the sample points x1 , . . . , xn , say
Theorem 5. Suppose (i 1 , . . . , i n ) is not balanced and
a jk = f ( j) (xk ), k = 1, . . . , n N is its sum. Assume that each Ak is either an interval for
each δ or is independent of δ. Also assume that
The problem is to give a “best” approximation of f using −1
the available discrete data from some class P of functions

n

(πn , Rmn , etc.). εk (δ)ik +1/ p ε j (δ)i j + 1/ p = ek + o(1)

j=1
Let us assume, for the time being, that f is known in
a neighborhood of the sample points. It will be clear that i 1
as δ → 0, where + · · · + i n is the largest balanced inte-
the best local approximant of f (or more precisely of the ger that does not exceed N . Let J A (i, p) denote the mini-
data {a jk }) does not depend on f , except on its interpola- mum L p norm on a measurable set A of the polynomial in
tory conditions mentioned above. For each k = 1, . . . , n, πi with unit leading coefficient. Suppose that f is i k -times
let Vk = xk + εk Ak , where εk = εk (δ) > 0 and tends to zero differentiable around xk for k = 1, . . . , n and 1 < p ≤ ∞.
as δ → 0 and Ak = Ak (δ), be uniformly bounded measur- Then the best L p local approximant of f exists and is the
able sets with unit measure, and let V (δ) be the union of unique solution of the constrained l p minimization prob-
V1 , . . . , Vn . Now, for each δ, let gδ = gδ, p be a best approx- lem:

imant of f from P in L p [V (δ)], where 1 ≤ p ≤ ∞ (and
min ek J Ak (i k , p)( f − g)(ik ) (xk )i k !
L ∞ will be replaced by C, as usual). Then if gδ → g0 as g∈PN lp
δ → 0, we say that g0 is a best L p local approximant of f
( f − g) (xk ) = 0
( j)
or of the data {a jk }. subject to
We will discuss only the one-variable setting, although j = 0, . . . , i k − 1; k = 1, . . . , n
some of these results have multivariate extensions. We will We remark that if Ak is an interval, then the subscript
also say that an n tuple (i 1 , . . . , i n ) of integers is p-balanced Ak of J can be replaced by [0, 1] and that for p = 1, the
(or simply balanced), if for each j with i j > 0, l1 minimization may not have a unique solution, but if it
does, it is the best L 1 local approximant of f .

n
i +1/ p i −1+1/ p
εkk = O ε jj In the special case when there is only one sample point,
k=1 say the origin, then the best L ∞ local approximant to a
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 585

sufficiently smooth function f from Rmn is the [m, n] de Ballore. The following inverse result has been ob-
Padé approximant of f . We remark, however, that the tained. Let qmn denote the denominator of the [m, n] Padé
convergence of the net of best uniform approximants on approximant of f . Since n is fixed, we use any norm that is
small intervals to the Padé approximant is in general not equivalent to a norm of the (n + 1)-dimensional space πn .
uniform, but only in measure.
Theorem 6. Suppose that there exist nonzero com-
plex numbers z i and positive integers pi with p1 + · · ·
C. Padé Approximants
+ pk = n, such that
As just mentioned, Padé approximants are best local ap- 1/m
k
proximants and include Taylor polynomials, which are
lim supqmn (z) − (z − z i ) Pi =r <1
just [m, 0] approximants. What is interesting is that Padé m→∞ j=1

approximants to a formal power series are defined alge-
braically, so that they are certainly well defined even if the Then f is analytic in
series has zero radius of convergence. In this case, taking |z| < r −1 max{|z j | : j + 1, . . . , k}
suitable limits of the approximants on the Padé table can
be considered a summability scheme of the divergent se- with an exception of the poles at z 1 , . . . , z k with multiplic-
ries. The most interesting example is, perhaps, the Stieltjes ities p1 , . . . , pk , respectively.
series, Without the geometric convergence assumed in Theo-
a0 + a1 z + a2 z + · · · 2 rem 6, the following inverse result still holds.

where Theorem 7. Suppose that

∞
an = t n dµ n
0 lim qmn (z) = (z − z i )
m→∞
j=1
with µ being a positive measure on [0, ∞). It has been
shown that the diagonal Padé approximants [n + j, n], where 0 < |z 1 | ≤ · · · ≤ |z n − 1 | < |z n |. Then f is analytic
where j is fixed, converge uniformly, in each compact set in |z| < |z n | with an exception of the poles at z 1 , . . . , z n − 1 ,
in the complex plane that does not intersect the interval counting multiplicities, and has a singular point at z n .
[0, ∞), to the Stieltjes integral
Let us return to the diagonal approximants. In particular,
∞
dµ consider the main diagonal of the Padé table: rn = pn /qn .
0 1 − zt Since there exist entire functions whose diagonal Padé ap-
proximants rn diverge everywhere in the complex plane
provided that the coefficients an do not tend to infinity too
except at the origin, it is interesting to investigate the pos-
fast, say
sibility of convergence when the poles of rn can be con-
an = O[(2n + 1)!R 2n ] trolled. In this direction, the following result has been
obtained.
for some R > 0. A simple example is an = n!, where
dµ = e−t dt. Theorem 8. Let f (z) − rn (z) = bz 2n+1 + · · · for each
n = 1, 2, . . . . Suppose that rn is analytic in |z| < R for
It has been proved, however, that the Padé approximants
all sufficiently large n. Then {rn } converges uniformly on
to a Stieltjes series diverge along any ray of the Padé table
every compact subset of |z| < R to f , so that f is analytic
that is not parallel to the diagonal.
in |z| < R.
It is also interesting that, while Padé approximants along
the diagonals behave so nicely for Stieltjes series, they do It should be noted that no a priori analytic assumption
not necessarily converge even for some entire functions. was made on f . This shows that, once the location of
One reason is that the distribution of the poles of the ap- the poles of the Padé approximants can be estimated, we
proximants cannot be controlled. Suppose that can make a conclusion about the analyticity of the formal
power series. This result actually holds for a much more
f (z) = a0 + a1 z + · · ·
general domain in the complex plane.
is actually a meromorphic function. Then along each row
of the Padé table, where the denominators of the approxi-
D. Incomplete Polynomials
mants have a fixed degree n, the poles of f usually attract
as many poles of the approximants as possible. Results of Loosely speaking, if certain terms in an algebraic poly-
this type are related to a classical theorem of de Montessus nomial do not appear, it is an incomplete polynomial. We
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

586 Approximations and Expansions

shall discuss best approximations by incomplete polyno- Hence, the order of approximation is of interest. Results
mials. in C[0, 1]. Let H be a finite set of nonnegative in this area are called Müntz–Jackson theorems.
integers, and consider the best approximation problem We end our discussion of incomplete polynomials by

considering the density of the polynomials.
E( f, H ) = inf
f − c x
i
i

n
ci
i∈H pn,a (x) = ck x k , m(a) ≥ na
Note that E( f, H ) = E n ( f ) if H = {0, . . . , n}. We also k=m(a)
denote where 0 < a < 1.
E nk ( f ) = E( f, {0, . . . , k − 1, k + 1, . . . , n})
Theorem 11. If pn,a is uniformly bounded, then
for any 0 ≤ k ≤ n. The following result has been obtained:
lim pn,a (x) = 0, 0 ≤ x < a2
n→∞
Theorem 9. Let 0 < c < 1. There exists a positive
and the convergence is uniform on every interval [0, r ],
constant Mk independent of n such that, if f is in C k [0, 1]
where r < a 2 .
and f (k) (c) = 0, then
Thus, the best we can do is to consider approximation
E nk ( f ) ≥ Mk n −k [| f (k) (c)| + o(1)]
on [a 2 , 1]:
as n → ∞.
Hence, since E n ( f ) is of order Theorem 12. Let f ∈ C(a 2 , 1). Then there exists a
sequence of polynomials pn,a such that
O n −k E n ( f (k) ) = o(n −k )
lim pn,a (x) = f (x)
we have E nk ( f )/E n ( f ) → ∞ for each f in C k [0, 1] whose n→∞

kth derivative does not vanish identically in [0, 1]. So even uniformly on [r , 1] for any r > a 2 .
one term makes some difference in studying the order of
approximation.
E. Chebyshev Center
If only finitely many terms are allowed, the natural
question is which are the optimal ones. Let 0 < n < N Suppose that, instead of approximating a single function
and Hn + {t1 , . . . , tn } be a set of integers with 0 ≤ t1 < f in a normed linear space X from a subset P of X ,
· · · < tn < N . The problem is to find a set Hn , which we we are interested in approximating a bounded set A of
call an optimal set, denoted by Hn∗ , such that functions in X simultaneously from P. Then the order of
approximation is the quantity
E f, Hn∗ = inf{E( f, Hn ) : Hn }
An analogous problem for trigonometric approximation r P (A) = inf sup f − g : g ∈ P
has interesting applications to antenna design where cer- f ∈A

tain frequencies are to be received by using a specified Of course, if P = Pn and A is a singleton { f }, then r P (A)
number of antennas. Unfortunately, the general problem reduces to E n ( f ), introduced at the beginning of this sec-
is unsolved. We have the following result: tion. In general, r P (A) is called the Chebyshev radius of
A with respect to P. A best simultaneous approximant x0
Theorem 10. For f (x) = x M , where M ≥ N , Hn∗ = of A from P, defined by
{N − n, . . . , N − 1} and is unique.
sup f − x0 = r P (A)
This result can be generalized to a Descartes system and f ∈A

even holds for any Banach function space with monotone is called a Chebyshev center (or simply center) of A with
norm. In a different direction, the analogous problem of respect to P. We denote the (possibly empty) set of such
approximating a given incomplete polynomial by a mono- x0 by c P (A). In particular, we usually drop the indices if
mial has also been studied. P = X ; that is,
Next, consider Hn = {0, t1 , . . . , tn }, where 0 < t1 <
t2 <, . . . with r (A) = r X (A) and c(A) = c X (A)
∞
1 which are simply called the Chebyshev radius and the set
=∞ of centers of A, respectively.
t
i=1 i Most of the literature in this area is concerned with
and set E n ( f ) = E( f, Hn ). The so-called Müntz–Szasz’s the existence problem; namely, c P (A) is nonempty. If
theorem guarantees that E n ( f ) → 0 for every f in C[0, 1]. c P (A) = Ø for all bounded nonempty sets A in X , we
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 587

say that P admits centers. A classical result is that if X is
the range of a norm-one projection defined on its second r P (A) = inf sup f − x : x ∈ P
f ∈A
dual, then X admits centers. In addition, if X admits cen-
ters and P is a norm-one complemented subspace, then P of A with respect to the “model” set P. A (Chebyshev)
also admits centers. An interesting example is the Banach center, if it exists, is an optimal estimation of xu .
space of real-valued continuous functions on a paracom- A related but somewhat different problem in best ap-
pact space. Not only does it admit centers; the set of centers proximation is optimal recovery. Let x be in a normed
of any bounded nonempty set A of functions in this space linear space X , and U an operator from X into another
is also relatively easy to describe in terms of A. If X is a normed linear space Z . The problem is to recover Ux from
uniformly rotund Banach space, then we even have unique a limited amount of information about x. Let I be the in-
best approximants in the sense that c(A) is a singleton for formation operator mapping X into another normed linear
any bounded nonempty set A in X . The following result space Y . Suppose Ix is known. To estimate Ux, we need
has been obtained: an operator A from Y to Z , so that AIx approximates Ux.
A is called an algorithm.
Usually, we assume that x is restricted to a balanced
Theorem 13. Let X be a Banach lattice such that the
convex subset K of X . Here, K is said to be balanced
norm is additive on the positive cone, and Q be a positive
if v ∈ K implies that −v ∈ K . If the information Ix of x
linear nonexpansive mapping of X into itself. Then the set
may not be precise, say with some tolerance ε ≥ 0, then an
of fixed points of Q admits centers.
algorithm A that is used to recover U x has the maximum
The proof of this result depends on the fact that the error
Banach lattice X described here admits centers itself.
E(A) = sup U x − Ay : x ∈ K , I x − y ≤ ε
Now suppose that X is a Hilbert space and B its unit ball.
Then a hypercircle in X is the (nonempty) intersection of Hence, to recover Ux optimally we have to determine an
some translate of a closed linear subspace of X with r B algorithm A∗ , called an optimal algorithm, such that
for some r > 0.
E(A∗ ) = inf E(A)
A
Theorem 14. The Chebyshev center of any hypercir- It should be remarked that an optimal algorithm may be
cle in a Hilbert space is the unique element of minimum nonlinear, although linear optimal algorithms exist in a
norm. fairly general situation. A lower bound of E(A∗ ) usually
This result is basic in the study of optimal estimation. helps, and it can be found in
E(A∗ ) ≥ sup{U x : x ∈ K , I x ≤ ε}

F. Optimal Estimation and Recovery There has been some progress in optimal recovery of non-
linear operators, and results on the Hardy spaces have also
Suppose that a physical object u is assumed to be ad- been obtained. It is not surprising that (finite) Blaschke
equately modeled by an unknown element xu of some products play an essential role here in the H p problems.
normed linear space X , and observations or measurements
are taken of u that give a limited amount of information
about xu . The data accumulated, however, are inadequate G. n Widths
to determine xu completely. The optimal estimation prob- Let A be a set in a normed linear space X that we wish
lem is to approximate xu from the given data in some to approximate from an n-dimensional subspace X n of X .
optimal sense. Let us assume that the data describe some Then the maximum error (or order of approximation if A
subset A of X . Then the maximum error in estimating xu is the unit ball) is the quantity
from A is
E(A, X n ) = sup inf{x − y : y ∈ X n }
sup f − xu x∈A
f ∈A It was Kolmogorov who proposed finding an optimal
In order for this quantity to be finite it is necessary and suf- n-dimensional subspace X̄ n of X . By this, we mean
ficient that A is a bounded subset of X , and we shall make E(A, X̄ n ) = inf E(A, X n )
Xn
that assumption. To optimize our estimate with only the
additional knowledge that xu must lie in some subset P of where the infimum is taken over all subspaces X n of
X we minimize the error, and this leads to the Chebyshev X with dimension at most n. This optimal quantity is
radius called the Kolmogorov n width of A and is denoted by
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

588 Approximations and Expansions

dn (A) = dn (A; X ). Knowing the value of dn (A) is impor- Theorem 16. Let k ≥ 2. Then
tant in identifying an optimal (or extremal) approximating  −k,

 n 1≤r ≤ p ≤∞
subspace X̄ n of X . 
k  or 2 ≤ p ≤ r ≤ ∞
Replacing the distance of x ∈ A from X n in dn B p ; L r ≈
 −k+1/
Kolmogorov’s definition by the distance of x from  n p−1/2
, 1≤ p ≤2≤r ≤∞

 −k+1/ p−1/r
its image under a continuous linear operator Pn of rank n , 1≤ p ≤r ≤2
at most n of X into itself yields the notion of the linear n  −k,
width of A in X , defined by 
 n 1≤r ≤ p ≤∞


or 1 ≤ p ≤ r ≤ 2
δn (A) = δn (A; X ) = inf sup x − Pn x d n B kp ; L r ≈ −k+1/2−1/r

 n , 1 ≤ p ≤2≤r ≤∞
Pn x∈A 
 −k+1/ p−1/r
n , 2≤ p ≤r ≤∞
It is clear from the two definitions that δn (A) ≥ dn (A).
A dual to the Kolmogorov n width is the Gel’fand n and
 −k,
width defined by  n 1≤r ≤ p ≤∞

 −k+1/ p−1/r

 n 1≤ p ≤r ≤2
d n (A) = d n (A; X ) = inf sup x 


 or 2 ≤ p ≤ r ≤ ∞

L n x∈A∩L n

δn B kp ; L r ≈ n −k+1/ p−1/2 , 1≤ p ≤2≤r ≤∞
where L n is a closed subspace of X with codimension at 

most n. It has also been shown that d n (A) is also bounded 
 and q ≥ r


above by the linear n width δn (A). To provide a lower 
 −k+1/2−1/r
, 1≤ p ≤2≤r ≤∞

n
bound for both dn (A) and d n (A), the Bernstein n width and q ≤ r
defined by
For the discrete case R m , let B p denote the unit ball us-
bn (A) = bn (A; X ) = sup sup{t : t B(X n+1 ) ⊆ A} ing the l p (R m ) norm, and set X = l∞ (R m ). The following
X n+1 estimates have been obtained:
could be used. Here, X n+1 is a subspace of X with di-
Theorem 17. There exist constants C and Ca inde-
mension at least n + 1, and B(X n+1 ) is the unit ball in
pendent of m and n such that
X n+1 .

An important setting is that A is the image of the unit dn (B1 ; X ) ≤ Cm1/2 n (1)
ball in a normed linear space Y under some T ∈ L(Y, X ),
the class of all continuous linear operators from Y to X . dn (B1 ; X ) ≤ 2[(ln m)/n]1/2 (2)
Although the set A is not TY, the commonly used nota- −1/2
dn (B1 ; X ) ≤ Ca n if n<m<n a
(3)
tions in this case are dn (T Y ; X ), δn (T Y ; X ), d n (T Y ; X ), 1/2
and bn (T Y ; X ). Let C(Y, X ) be the class of all compact n
dn (B1 ; X ) ≤ C 1 + ln n (4)
operators in L(Y, X ). Then the duality between dn and d n m
can be seen in the following theorem:
ln m
dn (B1 ; X ) ≤ 8 n 1/2 (5)
∗ ∗ ∗
ln n
Theorem 15. Let T ∈ L(Y, X ) and T ∈ L(X , Y )
be its adjoint. Suppose that T ∈ C(Y, X ) or that X is a (m − 1)n −1/2
dn (B1 ; X ) ≥ 1 + (6)
reflexive Banach space. Then m−n

dn (T Y ; X ) = d n (T ∗ X ∗ ; Y ∗ ) m 3/2
dn (B2 ; X ) ≤ C 1 + ln n 1/2 (7)
n
and

δn (T Y ; x) = δn (T ∗ X ∗ ; Y ∗ )
III. BIVARIATE SPLINE FUNCTIONS
Let us now consider some important examples. First,
let A be the unit ball B kp of functions in the Sobolev space Spline functions in one variable have proved to be very
H pk = H pk [0, 1] with f (k) p ≤ 1, where 1 ≤ p ≤ ∞. Also, rich in theory and extremely useful in applications. How-
let q be the conjugate of p. The following result describes ever, not much was done in the multivariable setting until
the asymptotic estimates of dn , d n and δn . We shall use the 1980s, and at this writing, many basic questions con-
the notation dn ≈ n s to mean C1 n s ≤ dn ≤ C2 n s for some cerning dimensions, bases, minimum supported elements,
positive constants C1 and C2 independent of n. interpolation, approximation order, shape-prescribed or
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 589

shape-preserving approximation schemes, computational rays through Ai are represented by irreducible polynomi-
algorithms, and so on are still unanswered. In fact, many als li,1 , . . . , li,ni with degrees f i,1 , . . . , f i,ni , respectively.
of the very fundamental problems do not seem to have Then we have the following result on the dimension of the
satisfactory solutions. This is an area in approximation space Sdk of all C k bivariate splines on D with degree d
theory that requires much research effort. In this section and grid partition .
we discuss only some selected basic results in the two-
variable setting, although most of the results here could Theorem 18.
be obtained in higher dimensions. A different aspect of m
d +2 d − (k + 1)d j + 2
the multivariable theory is discussed in Section VIII. dim Sdk = +
2 2 +
Since spline functions in one variable (or univariate j=1
splines) are piecewise polynomials separated by points,
n

spline functions in two variables (of bivariate splines) are + bj

j=1
piecewise polynomials separated by curves. If the bivariate
splines are to be continuous, these curves, which we call where b j = dim V j and
grids, are necessarily algebraic. In fact, if the restrictions
of s to D1 and D2 are polynomials p1 and p2 , respectively, Vj = q1 , . . . , qn j : qi ∈ πd−(k+1) f j,i ,
and D1 , D2 are separated by an (algebraic) curve C repre-

sented by an irreducible polynomial equation l(x, y) = 0, nj

then a necessary and sufficient condition that s is in C k qi (l j,i )k+1 = 0

i=1
on D1 ∪ C ∪ D2 is the existence of a polynomial q12 with
p1 − p2 = q12l k+1 . Since the factor l k+1 determines the However, the dimension b j is difficult to determine in
smoothing condition of the bivariate spline s, it is called general. In the particular case where all f j,i ’s are equal to
the smoothing factor of s across C, and the polynomial q12 1, then we have, using the notation N = n j ,
is called a smoothing cofactor of s. Note that q21 = −q12
1 k+1
and q12 uniquely determines s on D2 if s is already known b j = dim V j = d −k−
2 N −1 +
on D1 . If C1 , . . . , Cn are algebraic curves with a common
point of intersection A, called a vertex, and separate the
× (N − 1)d − (N + 1)k + (N − 3)
domains D1 , . . . , Dn , and if the restriction of a C k bivari-
ate spline s on D j is a polynomial p j , then the smoothing
k+1
cofactors of s across C1 , . . . , Cn must satisfy the following + (N − 1)
conformality condition. N −1
with [x] denoting the integral part of x.
q12 (l1 )k+1 + · · · + qn−1,n (ln−1 )k+1 + qn,1 (ln )k+1 = 0
A basis of the above bivariate spline space can be con-
where l j (x, y) = 0 is an irreducible polynomial equation structed, but there may not be any locally supported func-
of C j . Hence, a bivariate spline s on a region D in R 2 tions in general. Much work has been done in the three and
that is partitioned by algebraic curves is uniquely deter- four-directional meshes. Let us assume that two of the di-
mined by all smoothing cofactors that satisfy the confor- rections in each case are parallel to the x and y axes, and
mality conditions around each interior vertex as long as to be more specific let the horizontal and vertical grids be
one polynomial piece is prescribed. This simple observa- x = i and y = i with i ∈ Z . Then a three-directional mesh
tion allows us to study the dimensions of bivariate spline is a unidiagonal triangulation and a four-directional mesh
spaces and to construct basis elements, particularly locally is a crisscross triangulation of a uniform rectangular parti-
supported bivariate splines. However, even the dimension tion. They are also called type I and type II triangulations,
on a very simple grid partition is “unstable” in the sense respectively. Even in these special cases many interesting
that a perturbation would change the dimension. Among and unexpected results have been obtained. For instance,
the “stable” bivariate spline spaces are those with ray (or minimum-supported splines may be linearly dependent,
quasi-cross-cut) partitions. Let D be a simply connected some functions in Sdk cannot be locally reproduced, and
domain in R 2 . A ray partition of D is one that each even the optimal approximation orders are somewhat un-
(polynomial) curve C must have at least one end point expected.
that lies on the boundary of D. The other end point must We discuss only the three-directional mesh here. The
either be an interior vertex or also lie on the boundary of approximation order of Sdk is an integer m such that dist
D. In the latter case, C is called a cross-cuts of D. Sup- ( f, Sh ) = O(h m ) for all sufficiently smooth functions f
pose that there are m cross-cuts with (irreducible) degrees and dist(g, Sh ) = o(h m ) for some C ∞ function g. Here, Sh
d1 , . . . , dm and n interior vertices A1 , . . . , An , such that the is simply the space of all sh (x, y) = s(x/ h, y/ h), where
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

590 Approximations and Expansions

s ∈ Sdk . Let m(d) = min{2(d − K ), d + 1}. It is known essarily produce constants. There is still no general result
that, if k ≤ (2d − 2)/3, then m(d) − 2 ≤ m ≤ m(d). For in- in the nonuniform setting.
stance, for S31 , m(d) = 4 and m is known to be 3. It is also When a triangular grid partition is considered, it is usu-
known that if k > (2d − 2)/3, then m = 0. These results ally more convenient to use Bézier (or Bernstein) represen-
were proved using box splines, which are discussed in tations of the polynomial pieces. Smoothing conditions on
Section VIII. the adjacent polynomial pieces are expressed in terms of
While the exact value of m is still unknown even for the certain relations on the Bézier coefficients. Many interest-
three-directional mesh that we discuss here, the controlled ing formulas have been recently obtained. These formulas
approximation order with respect to box splines has been have applications to constructing Hermite interpolants and
determined. Let {B 1 , . . . , B N } be the collection of all box quasi interpolants to scattered data and also have nice ap-
splines in Sdk . Then the controlled approximation order of plications to computer-aided geometric designs.
Sh with respect to box splines is the largest integer n such
that
IV. COMPACT OPERATORS
N

· AND M IDEALS
f − wih ( j)B i − j ≤ Ch n D n f ∞
i=1 z∈Z 2
h
∞ Since the mid-1970s, a branch of approximation theory
called operator approximation has come into vogue. This
for some sequence wih ( j)
satisfying
area is concerned with the approximation of an opera-
h
w (·) ≤ C f ∞ , i = 1, . . . , N tor from a family of operators with some nice structures,
i l∞
namely, positive operators, self-adjoint operators, normal
where C is an absolute constant, and f and C ∞ function. operators, compact operators, or operators with more than
In Section VIII, it will be seen that ∞ can be replaced by one of these properties. Since the majority of the research
any p, 1 ≤ p ≤ ∞, even when the controlled condition on papers in this area have been concerned with compact op-
the weights is replaced by a “local” one. erator approximation, we limit ourselves to the discussion
of this topic. Let B(X ) denote the class of bounded linear
Theorem 19. Let Sh be the scaled three-directional operators on a Banach space X , and C(X ) the subcollec-
mesh of Sdk . Then the controlled approximation order of tion of compact operators on X . The general question of
Sh with respect to all box splines of Sdk is interest is the existence of a best approximant to a given
operator T in B(X ) from C(X ). If every T in B(X ) has at
2d − 2k for 2d − 3k = 2 (1) least one best approximant from C(X ), we say that C(X ) is
proximal in B(X ). It is well known that, if X is a Hilbert
2d − 2k − 1 for 2d − 3k = 3 or 4 (2) space, then C(X ) is proximinal in B(X ). The following
d +1 for k = 0 (3) result on Banach spaces is worth stating:
min{2d − 2k − 2, d} for k ≥ 1 and 2d − 3k ≥ 5 Theorem 20. Let 1 < p < ∞. Then C(l p ) is prox-
(4) iminal in B(l p ). However, C(X ) is not proximinal in B(X )
for X = C[0, 1] or X = L p [0, 1], 1 < p < ∞, p = 2.
However, the controlled approximation order of Sh with
Nevertheless, certain bounded linear operators in B(L p )
respect to all minimum-supported splines is still unknown.
may still have best compact approximants. Let B1 (L p ) be
When the rectangular partitions are nonuniform, the
the collection of T in B(L p ) such that T xn p → 0 for
unidiagonal and crisscross triangulations are no longer
every uniformly bounded weakly null sequence {xn } in
three- and four-directional meshes. The dimensions of
L p . Also, let B2 (L p ) be the integral operators in B1 (L p ).
these spaces are still unknown except for the cases when
The following result holds:
d − k is sufficiently large or when k = 1, 2. It is also known
that, while the support of the S21 minimum-supported bi-
Theorem 21. C(L p ) is proximinal in B1 (L p ) for
variate splines on the crisscross triangulation is indepen-
2 < p < ∞ and is proximinal in B2 (L p ) for 1 < p < ∞.
dent of the uniformity of the rectangular partition, the
corresponding statement for S31 on the unidiagonal tri- If H is a Hilbert space, every Hankel operator in B(H )
angulation is false. In fact, even the support of the box has a best compact Hankel approximant. Using some func-
splines, which are larger in this case, is not preserved un- tion theoretic arguments, the following result can then
der nonuniform perturbation of the rectangular partition, be obtained. Here, H ∞ will denote the Hardy space of
and if one minimum-supported spline in the perturbed case bounded analytic functions and C the space of continuous
here is used, the totality of all its “translates” does not nec- functions.
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 591

Theorem 22. H ∞ + C is a proximinal subspace in M Ideals have other important structures. For instance,
L ∞. an M ideal of a Banach algebra must be an algebra, and if
is a compact Hausdorff space and A a function algebra
It should be remarked that, although this theorem does
contained in C(), then the M ideals of A are precisely the
not seem to be a result in operator approximation, it is
closed ideals with a bounded approximate identity. If, in
indeed related to best approximation by compact Hankel
particular, A is the disc algebra, then the M ideals of A are
operators. For the l p spaces, if we let Pn be the norm-one
exactly the two-sided ideals with norm-one approximate
projection of l p onto the first n coordinate vectors, the
identity. In addition, the following interpolation result has
following distance formula is obtained:
been obtained:
Theorem 23. Let T be in B(l p ), where 1 < p < ∞.
Then Pn⊥ T Pn⊥ → dist[T, C(l p ] as n tends to infinity. Theorem 27. Let f be in the disc algebra A
and for each n = 1, 2, . . . , E n be a closed subset of
For p = 2, Pn could be chosen as a norm-one projec- the unit circle with linear measure zero such that γn =
tion onto the first n vectors of any orthonormal basis. It log( f T / f En ) → 0. Then there exist minimal norm
is important to remark, however, that nothing has been interpolants sn of f satisfying ( f − sn ) ∈ Yn = {g ∈ A :
mentioned about uniqueness. In fact, if T is a noncom- g(E n ) = 0} such that f − sn = O(γn ). Furthermore, the
pact bounded linear operator in l p , where 1 < p < ∞, T rate of convergence O(γn ) cannot be replaced by o(γn ) in
has “many” compact best approximants. This observation general.
follows from the fact that C(l p ) is an M ideal in B(l p ).
The notion of an M ideal in a Banach space was in-
troduced in the early 1970s. A closed subspace Y of a
V. CONSTRAINED APPROXIMATIONS
Banach space X is an M ideal in X if its annihilator Y ⊥ is
an L summand of the dual space X ∗ . This in turn means
In many approximation and interpolation problems,
that Y ⊥ is the range of an L projection defined on X ∗ ,
the approximants, which may also be interpolants, are
that is, a projection Q : X ∗ → Y ⊥ with the property that
required to satisfy certain constraints. The constraints
f = Q f + f + Q f for every f in X ∗ . The impor-
may be explicit conditions imposed on the approximation
tance of M ideals in approximation theory is that M ide-
problems, or they may be certain specific properties of
als are proximinal subspaces with certain special approx-
the mathematical model or data the approximants are sup-
imation properties. For instance, (1) the metric projection
posed to preserve. In general, constrained approximation
PY onto Y satisfies the Lipschitz condition d H [PY (x),
problems are nonlinear, and some important problems
PY (y)] ≤ 2x − y for all x, y in X , where d H denotes
do not even have analytic solutions. We mention a few
the Hausdorff distance, and (2) there exists a continuous
such problems. The examples we have chosen should
homogeneous selection for the metric projection PY . Per-
not be construed as the most important work in this area
haps the most remarkable approximation characteristic of
but should be thought of as illustrating the results of the
M ideals is the following result:
subject.
We first limit ourselves to approximation by polynomi-
Theorem 24. Let Y be an M ideal in a Banach space als and splines in one variable. For polynomial approx-
X and x be in X \Y . Then PY (x) algebraically spans Y . imation in the uniform (or supremum) norm ∞ , the
We have already seen in Theorem 20 that C(l p ) is prox- following result is known:
iminal in B(l p ), where 1 < p < ∞. It is also known that
C(l p ) is an M ideal in B(l p ), 1 < p < ∞. In fact, C(l p ) is Theorem 28. Let k be any nonnegative integer. Then
the only M ideal in B(l p ). there exists a constant C such that for any function f in
C k [0, 1] satisfying f ≥ 0 and any integer n ≥ k + 1
Theorem 25. Let Z be a subspace of l p , where
1 < p < ∞. Then the compact operators on Z form an inf{ f − p∞ : p ≥ 0, p a polynomial of degree ≤ n}

M ideal in the space of bounded operators on Z if and ≤ Cn −k w f (k) , n −1 ∞
only if Z has the compact approximation property.
where w(·, n −1 )∞ denotes, as usual, the modulus of con-
In the same direction as Theorem 22, the following re- tinuity in the uniform norm.
sult can be proved:
This result essentially says that monotone approxima-
Theorem 26. The compact Hankel operators form tion by polynomials retains the same order of approxima-
an M ideal in the space of Hankel operators on a Hilbert tion as unconstrained approximation. For spline approx-
space. imation, an analogous result has been obtained, and this
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

592 Approximations and Expansions

∗
result holds even in the L p setting. Let Sk,N denote the where q is the conjugate of p and
space of all kth-order splines s with knots at i/N , i = 0,
!
n−k
. . . , N , such that s ≥ 0. We have the following result: A = [0, 1] {(t j , t j+k ) : d j = 0}
j=1
Theorem 29. Let 1 ≤ p ≤ ∞ and k be a nonnega-
We observe that if p = 2, so that q − 1 = 1, then s is
tive integer. Then there exists a positive constant C such
in C k−1 and is a piecewise polynomial of order 2k. Re-
that for any monotonically nondecreasing function f with
cently, an analog to the perfect spline solution of the un-
f ( j) ∈ L p [0, 1], if 1 ≤ p > ∞, or f ( j) ∈ C[0, 1], if p = ∞,
constrained problem has been obtained for case p = ∞.
where 0 ≤ j ≤ k − 1, and for any N = 1, 2, . . .
However, with the exception of some simple cases, no
∗
inf f − s p : s ∈ Sk,N ≤ C N − j w f ( j) , N −1 p numerical algorithm to determine s is known.
To describe computational methods for shape-
We now turn to the interpolation problem but con-
prescribed splines, it is easier to discuss quadratic spline
centrate only on spline interpolation. Let 0 = t0 ≤ t1
interpolation. It should be clear from the above result that
≤ · · · tn ≤ tn+1 = 1. Defining t j for j < 0 and j > n + 1 ar-
extra knots are necessary. Let these knots be x1 , . . . , xn−1
bitrarily as long as t j < t j+k for all j, we can use {t j } as the
with ti < xi < ti+1 and let {yi , m i }, i = 1, . . . , n, be a
knot sequence of the normalized B splines N j,k of order
given Hermite data set. Then a quadratic spline s with
k. Let H pk = H pk [0, 1] denote, as usual, the Sobolev space
knots at {t1 , . . . , tn , x1 , . . . , xn−1 } is uniquely determined
of functions f in C k−1 [0, 1] with f (k−1) absolutely con-
by the interpolation conditions s(ti ) = yi and s (ti ) = m i
tinuous and f (k) in L p [0, 1]. The “optimal” interpolation
for i = 1, . . . , m. The slopes of s at the knots xi can be
problem can be posed as follows. Let {gi }, i = 1, . . . , n,
shown to be
and
(k) 2(yi+1 − yi ) − (xi − ti )m i − (ti+1 − xi )m i+1
s = inf f (k) : f ∈ H k , f (ti ) = gi , i = 1, . . . , n s (xi ) =
p p p ti+1 − ti
if 1 < p < ∞, s always exists, and if 1 < p < ∞ and n ≥ k, for i = 1, . . . , n − 1. Hence, the knot xi is not active if and
s is also unique. It is also known that, for 1 < p < ∞ and only if
n ≥ k, m i+1 + m i yi+1 − yi
=
s (k) = |h|q−1 sgn h 2 ti+1 − ti
where h is some linear combination of the normalized In general, the slopes m i are not given and can be
B splines Ni,k and q the conjugate of the index p. Sup- considered parameters to ensure certain shape. For in-
pose that the data {gi } are taken from some function g in stance, if m i m i+1 ≥ 0, a necessary and sufficient con-

C k [0, 1]. Then the constrained version of the above prob- dition that m i s(t) ≥ 0 for ti ≤ t ≤ ti+1 is |(xi − ti )m i +
lem is determining an s in H pk such that s(ti ) = g(ti ) for (ti+1 − xi )m i+1 | ≤ 2 |yi+1 − yi |.
i = 1, . . . , n, s (k) ≥ 0, and Similar conditions can be obtained for comonotonic-
(k) ity and convexity preservation. For practical purposes, the
s = inf f (k) : f ∈ H k , f (ti ) = g(ti ), f (k) ≥ 0
p p p slopes m i could be selected as a weighted central differ-
Again for 1 < p < ∞ and n ≥ k, s exists and is unique. ence formula for 2 ≤ i ≤ n − 1 and a weighted noncentral
In fact, the following result is known. To formulate the formula for i = 1 and n, so that the artificial knots xi can
result, we set be determined using the corresponding necessary and suf-
ficient condition such as the inequality mentioned above.
di = [ti , . . . , ti+k ]g Such an algorithm has the advantage of being one-pass.
where the divided difference notation has been used. Also, Cubic splines could be easily used for monotone ap-
let χ A denote the characteristic function of a set A. proximation. In fact, if n ≥ 2, there is a uique function s
such that s satisfies s ≥ 0, s(ti ) = gi , and
Theorem 30. The unique function s described
s 2 = inf f 2 : f ∈ H22 , f (ti ) = gi , f ≥ 0
above is characterized by
1 The unique solution s is a natural cubic spline with at most
s (k) Ni,k = di 2[n/2] + 2 extra knots in addition to the knots at ti , . . . , tn .
0 Much more can be said when the Hilbert space H2k is
for i = 1, . . . , n − k, and considered. For instance, let C be a closed convex subset
q−1 of H2k . This may be a set of functions that are positive,

n−k
s =
(k)
α j N j,k χA monotone, convex, and so on. The following result has
j=1 + been obtained:
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 593

Theorem 31. Let g ∈ C ⊂ H2k and n ≥ k + 1. Then ai j = N j (xi ). Hence, the biinfinite (coefficient) matrix A
there is a unique function sn in C such that sn (ti ) = g(ti ) can be considered an operator from l∞ into itself. If A is
for i = 1, . . . , n, and surjective, then a bounded spline interpolant s exists, and
(k) if A is injective, then s is uniquely determined by the inter-
s = inf f (k) : f ∈ H k , f (ti ) = g(ti ), f ∈ C
2 2 2 polation condition s(xi ) = y + i, where i = . . . 0, 1, . . . . It
Furthermore, sn is a piecewise polynomial of order 2k, and is clear that A cannot be invertible unless it is banded, and
if sn∗ denotes the (unconstrained) natural spline of order using the properties of B splines, it can be shown that A
2k interpolating the same data, then is totally positive.
(k) Generally, certain biinfinite matrices A = [ai j ] can be
(sn − g)(k) ≤
sn∗ − g
considered as operators on l p as follows. Let A{xi } =
2 "
2
{Axi }, with Axi = j ai j x j , where {x j } is a finitely sup-
Consequently, the rate of convergence of the con- ported sequence. Then the definiton can be extended to all
strained interpolants sn to g is established. Another con- of l p using the density of finitely supported sequences. If
sequence is that the convergence rate for splines interpo- the extension is unique, we denote the operator also by A
lating derivatives at the end points remains the same as the and use the usual operator norm A p . Let {e j } be the nat-
unconstrained ones. ural basis sequence with e j (i) = δi j , and denote by Pn the
We now give a brief account of the multivariable set- projection defined by Pn e j = e j for | j| ≤ n and zero oth-
ting. Minimizing the L 2 norm over the whole space R s erwise. Therefore, the biinfinite matrix representation of
of certain partial derivatives gives the so-called thin-plate Pn is [ pi j ], where pi j = δi j for | j| ≤ n and zero otherwise.
splines. So far, the only attempt to preserve shape has For any biinfinite matrix A, we can consider its truncation
been preserving the positivity of the data. Since the nota- An = Pn A Pn . Also, let S be the shift operator defined by
tion has to be quite involved, we do not go into details but Se j = e j+1 . Then (S r A)n (i, j) = 0 if |i| > n or | j| > n and
remark that the convergence rate does not differ from the = ai−r j otherwise. That is, S r shifts the diagonals of A
unconstrained thin-plate spline interpolation. It must be down by r units.
mentioned, however, that thin-plate splines are not piece- We use the notation A ∈ B(l p ) if A, considered an op-
wise polynomials. Piecewise polynomials satisfying cer- erator on l p , is a bounded operator. We also say that A
tain smoothness joining conditions are discussed in the is boundedly invertible if, as an operator on l p , A is both
sections of bivariate splines for two variables and multi- injective and surjective. In order to extend the important re-
variate polyhedral splines for the general n-dimensional sults in finite matrices, it is important to be able to identifya
setting. Results on monotone and convex approximation “main diagonal” of A. Perhaps the following is a good def-
by piecewise polynomials, however, are not quite com- inition of such diagonal. The r th diagonal of A is main if
plete and certainly not yet published.
lim sup(S r A)−1 n
<∞
p

Clearly, the zeroth (or central) diagonal of the identity

VI. FACTORIZATION OF operator I on l p is main, and in fact if both I + K and
BIINFINITE MATRICES its inverse are in B(l p ), the zeroth diagonal is also main.
Another example is that, if K p < 1, then the zeroth di-
There has been much interest in biinfinite matrices. We agonal of I + K is a main diagonal. For the Hilbert space
will discuss this topic only from the view of an approxi- l2 , in particular, it can be verified that, if A is a positive
mation theorist, although biinfinite matrices have also im- definite symmetric matrix such that both A and A−1 are
portant applications to signal processing, time series, and in B(l2 ), then the zeroth diagonal of A is main. For l1 , if
so on. The importance of biinfinite matrices in approx- A is column diagonally dominant such that A and A−1
imation theory can best be explained via the following are in B(l1 ), then again the zeroth diagonal of A is a main
(polynomial) spline interpolation problem. diagonal.
Let t: · · · < ti < ti+1 < · · · be a knot sequence and The following result is important in spline interpolation:
x: · · · < xi < xi+1 < · · · a sequence of sample points. If a
certain sequence of data {(xi , yi )} is given where {yi } ∈ l∞ ,
Theorem 32. Any totally positive biinfinite matrix
we would like to know the existence and uniqueness of a
A such that both A and A−1 are in B(l∞ ) has a unique
kth-order spline s with knots at t such that s(xi ) = yi for
main diagonal.
all i and s ∈ L ∞ . If we write s as a linear combination
of the normalized B splines N j = N j,k,t , the interpolation This theorem allows us to conclude that A−1 is checker-
problem can be expressed as a biinfinite linear system board with the main diagonal containing only positive en-
Aα = y where y = [· · · yi , yi+1 , . . .]T and A = [ai j ], with tries, so that a bound on the local mesh ratio in spline
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

594 Approximations and Expansions

interpolation can be obtained in terms of the norm of where B is a diagonal matrix with positive diagonal
A−1 . We remark that there are many other possible and entries.
plausible definitions for a main diagonal of a biinfinite We conclude this section by stating a factorization result
matrix. of block Toeplitz matrices.
In solving a biinfinite linear system as discussed ear-
lier, a Gauss elimination procedure quite often factors the Theorem 37. Let A be totally positive biinfinite
coefficient matrix A as A = LU , where L is a unit lower block Toeplitz, with block size at least 2, and unit lower
triangular matrix with the zeroth diagonal as its rightmost triangular. Write A = [Ai j ], where Ai j = Ak for all i and
(nontrivial) band, and U is an upper triangular matrix. j with i = j + k, and denote
This is called an LU factorization of A. We say that an

∞
LU factorization is invertible if each factor is bounded A(z) = Ak z k
and boundedly invertible, with L −1 and U −1 being again k=0
lower and upper triangular, respectively.
Then
−1
Theorem 33. Let A be totally positive with both ∞ ∞
A and A−1 in B(l∞ ). Then there exists a unique r such A(z) = [I + ak (z)]c(z) [I − bk (z)]
that S r A = LU , where L, L −1 , U , U −1 are in B(l∞ ), L, k=1 k=1

L −1 being unit lower triangular and U , U −1 being upper where I + ak (z) and I − bk (z) are the symbols of one-
triangular biinfinite matrices. Furthermore, if we write banded
" block Toeplitz matrices with ak (1), bk (1) ≥ 0 and
(S r A)n = L n Un [ak (1) + bk (1)] < ∞. Furthermore, c(z) is the symbol
of a totally positive block Toeplitz matrix with c(z) and
for each n, then L n → L and Un → U entrywise. c−1 (z) entire, and det c(z) = 1.
It is clear that the LU factorization above is unique.
However, it would be even better if U were the transpose
of L. Such a factorization is called a Cholesky decompo- VII. INTERPOLATION
sition. We have the following result:
The theory and methods of interpolation play a central
Theorem 34. Let A be a positive definite symmetric role in approximation theory and numerical analysis. This
matrix with A and A−1 in B(l2 ). Then A has a unique branch of approximation theory was initiated by Newton
Cholesky decomposition; that is, A = L L T , where L and and Lagrange. Lagrange interpolation polynomials, for-
L −1 are lower triangular biinfinite matrices in B(l2 ) with mulated as early as 1775, are still used in many applica-
lii > 0. Furthermore, L is unique, and writing An = L n L nT , tions. To be more explicit, let
we have L n → L and n → ∞.
n : −1 ≤ tn−1 < · · · tnn ≤ 1
The following result for l1 is also of some interest.
be a set of interpolation nodes (or sample points) and
Theorem 35. Let A be a column diagonally domi- lnk (t) = (t − tni )/(tnk − tni )
nant biinfinite matrix with A and A−1 in B(l1 ). Then there i=k
is a unique factorization A = LU , where L, L −1 are B(l 1 )
unit lower triangular, and U , U −1 are B(l 1 ) upper triangu- Then for an given set of data Y = {yi , . . . , yn }, the poly-
lar. Furthermore, writing An = L n Un , we have L n x → L x nomial
and Un x → U x for all x ∈ l1 .
n
pn (t) = pn (t; Y, n ) = yk lnk (t)
Certain biinfinite matrices have more refined factoriza- k=1
tions. We will call a one-banded biinfinite matrix R = [ri j ]
elementary if rii = 1 and ri j = 0 for all i and j with interpolates the data Y on n . It is interesting that, even for
j ≤ i − 2 or j > i. uniformly spaced n , there exists a continuous function
g on [−1, 1] such that the (Lagrange) interpolating poly-
Theorem 36. Any strictly m-banded totally positive nomials pn (t) to yi = g(tni ), i = 1, . . . , n, do not converge
unit lower triangular biinfinite matrix A has a factorization uniformly to g(t) on [−1, 1]. This leads to the study of the
A = R1 · · · Rm where each Ri , i = 1, . . . , m, is elementary. Lebesgue function and constant defined, respectively, by

n
Hence, if A is not necessarily unit but is m-banded to- ln (t) = ln (t; n ) = |lnk (t)|
tally positive and lower triangular, then A = R1 · · · Rm B, k=1
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 595

and This is the reason that modified roots of the Chebyshev

polynomials are believed to be an “almost best” choice
ln = ln (n ) = max{ln (t) : −1 ≤ t ≤ 1} of n .
The following result indicates the divergence property of For each choice of n , there is some continuous func-
ln (t): tion f such that the corresponding Lagrange polynomial
interpolation to f does not even converge in L p . Some
extra interpolating conditions are required to increase the
Theorem 38. There exists a positive constant C
chance of convergence. We first discuss interpolation at
such that, for any sequence of positive numbers en and
the roots xnk = xnk (w) of the (generalized) Jacobi orthog-
any n , there is a set Hn (en , n ) of measure no greater
onal polynomials corresponding to a “smooth” weight w.
than en with
Let xno = −1, xn,n+1 = 1, and
ln (t) > Cen ln n
L n = L (r,s)
n (·, f ) = L n (w, f, {x nk })
(r,s)

for t ∈ [−1, 1]\Hn (en , n ) and all n = 1, 2, . . . .

denote the interpolating polynomial of f satisfying the
Hence, ln = O(ln n) is the best we can expect for interpolating conditions
various choices of n . A natural (and almost best) choice
of n is the set L n (xnk ) = f (xnk ), k = 0, . . . , n + 1
Dl L n (1) = 0, l = 1, . . . , r − 1
(2k − 1)π
Tn = cos : k = 1, . . . , n
2n and
of roots of the nth-degree Chebyshev polynomial of the Dl L n (−1) = 0, l = 1, . . . , s − 1
first kind. In this case, the asymptotic expression of the
Lebesgue constant is given in the following expression: Theorem 40. Let u be a nonnegative function not
identically equal to zero on [−1, 1] and be in (L log+L) p ,
Theorem 39. where 0 < p < ∞. Then

2 8 8 ∞
lim L (r,s)
n (·, f ) − f (·) u(·) p = 0
ln (Tn ) − ln n + γ + ln ∼ (−1)s−1 n→∞
π π π s=1
for all f in C[−1, 1] if and only if both

× (22s−1 − 1)2 π 2s B2s
2
(2s)!(2s)(2n)2s (1 − x)−r +1/4 (1 + x)−s+1/4 [w(x)]1/2
where γ is the Euler constant and B2s are the Bernoulli is in L 1 and
numbers.
u(x)(1 − x)r −1/4 (1 + x)s−1/4 [w(x)]−1/2
The best choice of n could be defined by ∗n , where
is in L p . Furthermore, the condition that u is in (L log+L) p
ln ∗n = inf{ln (n ) : n } cannot be replaced by u ∈ L p .
It should be mentioned, however, that to every contin-
and we set ln∗ = ln (∗n ). It is known that ∗n exists and
uous function f there exists a sequence of interpolation
is unique. In fact, it is characterized by the localization
nodes n such that the Lagrange interpolation polynomi-
condition
als of f at n converge uniformly to f on [−1, 1]. Fejér
max{ln (t; n ) : tn,i ≤ t ≤ tn,i+1 } = ln (n ) arranged the interpolation nodes for Hermite interpola-
tion and gave an easy proof of the Weierstrass theorem.
for i = 1, . . . , n − 1. Let us denote the above localized His polynomials, which are usually called Hermite–Fejér
maximum by lni (n ). Then the (modified) roots of the interpolation polynomials, are defined by the interpola-
nth-degree Chebyshev polynomials are known to satisfy tion condition Hn (xnk , f ) = f (xnk ) and D Hn (xnk , f ) = 0
for k = 1, . . . , n, where {xnk } = Tn are the roots of the nth-
max lni (T̂n ) − min lni (T̂n ) < 0.0196
i i degree Chebyshev polynomial. Later, other choices of n
were also studied, and results on approximation orders,
for all n ≥ 70, where the modification is just a linear scal-
asymptotic expansions, and so on were obtained. For in-
ing, mapping the largest and smallest roots to the end
stance, the following result is of some interest:
points of [−1, 1]; namely,

π Theorem 41. Let Cn denote the nth-degree
Tn = cos T̂n
2n Chebyshev polynomial of the first kind and Hn (·, f )
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

596 Approximations and Expansions

denote the Hermite-Fejér interpolation polynomial of a Given a set of data {yi j : ei j = 1}, find an f in
continuous function f at the roots Tn of Cn . Then Sn (F, m ) satisfying D j f (xi ) = yi j , where ei j = 1 and
D j f (xi ) = 12 [D j f (xi− ) + D j f (xi+ )].
|Hn (x, f ) − f (x)| The problem I.P.(E, F, m ) posed here is said to be
#
n
1 i poised if it has a unique solution for any given data set.
= O(1) w f, (1 − x ) × |Cn (x)|
2 1/2
It is known that I.P.(E, F, m ) is poised if and only if
i2 n
i=1 I.P.(E, F, m ) is poised. By using the Pólya conditions
$
n
1 i 2 on (E, F) and the Budan–Fourier theorem, the following
+ w f, |xCn (x)| result can be shown. We say that E is quasi Hermite if for
i2 n
i=1 each i = 1, . . . , m − 1, there exists an Mi such that ei j = 1
and if and only if j < Mi .
#
1 n
1 Theorem 42. Suppose that (E, F) satisfies the Pólya
|Hn (x, f ) − f (x)| = O(1) w f, (1 − x 2 )1/2
n i=1 i conditions and the ei j = 1 implies f i,n− j = 0 for all (i, j).
$ Suppose further that one of the matrices E and F is quasi
Hermite and the other has no supported odd blocks. Then
× |Cn (x)| + i −2 |xCn (x)|
the problem I.P.(E, F, m ) is poised for any m .

There has been much interest in the so-called Birkhoff For cardinal interpolation, we impose the extra condi-
polynomial interpolation problem. The interpolation con- tions e0 j = em j and f 0 j = f m j for j = 0, . . . , m on the inci-
ditions here are governed by an incidence matrix E = [ei j ], dence matrices. Of course, Sn (F, m ) has to be extended
where ei j = 1 or 0, namely, to be piecewise polynomial functions f of degree n with
break points at q + x1 , where i = 1, . . . , m and q = · · · 0,
D j p(xi ) = yi j if ei j = 1 1, . . . , such that

We now study interpolation by spline functions. Spline D n− j f q + xi− = D n− j f q + xi+
interpolation is much more useful than polynomial inter-
for all q and (i, j) with f i j = 0. The following problem
polation. In the first place there is no need to worry about
will be called a cardinal interpolation problem, C.I.P.(E,
convergence. Another great advantage is that a spline in-
F, m ):
terpolation curve does not unnecessarily oscillate as a
Given a set of data {yi jq }, for any 0 ≤ i < m and ei j = 1,
polynomial interpolation curve does when a large num-
find an f in Sn (F, m ) satisfying D j f (q + xi ) = yi jq ,
ber of nodes are considered. The most commonly used
where D j f (q + xi ) is the average of D j f (q xi− ) and
is a cubic spline. If interpolation data are given at the
D j f (q + xi+ ).
nodes m : 0 = x0 < · · · < xm = 1, the interior nodes can
The problem C.I.P.(E, F, m ) is said to be poised
be used as knots of the cubic spline. When the first deriva-
if whenever yi jq is arbitrarily given and satisfies yi jq =
tives are prescribed at the end points, the correspond-
O(|q| p ) for some p, as q → ± ∞, for all (i, j), the prob-
ing interpolation cubic spline, called a complete cubic
lem C.I.P.(E, F, m ) has a unique solution f satisfying
spline, is uniquely determined. If no derivative values
f (x) = O(|x| p ) as x → ±∞. Let Co denote the space of
at the end points are given, a natural cubic spline with
all solutions of C.I.P.(E, F, m ) with yi jq = 0 and denote
zero second derivatives at the end points could be used.
its dimension by d. A function f in Co is called an eigen-
Spline functions can also adjust to certain shapes. This
spline with eigenvalue λ if f (x + 1) = λ f (x) for all x. The
topic is discussed in Section V on constrained approxima-
following results have been obtained:
tion. There is a vast amount of literature on various results
on spline interpolations: existence, uniqueness, error es-
Theorem 43. The C.I.P.(E, F, m ) has eigenvalue
timates, asymptotics, variable knots, and so on. We shall
if and only if the C.I.P.(E, F, m ) has eigenvalue λ−1 .
discuss only the following general interpolation problem:
Let E = [ei j ] and F = [ f i j ] be m × n incidence matri-
Theorem 44. Let the C.I.P.(E, F, m ) have d dis-
ces and n be a partition of [0, 1] as defined above. We
tinct eigenvalues. Then the problem is poised if and only
denote by Sn (F, m ) the space of all piecewise polyno-
if none of the eigenvalues lies on the unit circle.
mial functions f of degree n with the partition points xi
of m as break points such that D n− j f (xi− ) = D n− j f (xi+ ) Error estimates on interpolation by these splines have
for all pairs (i, j) with f i j = 0, where 0 < i < m. The fol- also been obtained.
lowing problem is called an interpolation problem, I.P.(E, We next discuss multivariate interpolation. First it
F, m ): should be mentioned that not every choice of nodes in R s ,
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 597

s ≥ 2, admits a unique Lagrange interpolation. Hence, it is vertices of a simplex T s in R s , and for each x in T s let
important to classify such nodes. We will use the notation (u 1 , . . . , u s+1 ) be its barycentric coordinate; that is,

n+s
s+1
s+1
Nn (s) = x= u i Vi ; u i = 1, ui ≥ 0
s
i=1 i=1
The following result gives such a criterion, which can be
Let a = (a1 , . . . , as+1 ), where a1 , . . . , as+1 are non-
used inductively to give an admissible choice of nodes in
negative integers such that |a| = a1 + · · · + as+1 ≤ n.
any dimension:
It is known that the set of equally spaced nodes
{xa }, |a| ≤ n, where xa = (a1 n −1 , . . . , as+1 n −1 ), admits
Theorem 45. Let n = {x i }1,...,Nn (s) be a set of nodes
a unique Lagrange interpolation. For this set of equally
in R s . If there exists n + 1 distinct hyperplanes S0 , . . . , Sn
spaced knots {xa }, we define
in R s and n + 1 pairwise disjoint subsets A0 , . . . , An of
the set of nodes n such that for each j = 0, . . . , n, A j is a s+1 ai −1 % s+1
j
subset of S j \{S j+1 ∪ · · · ∪ Sn }, has cardinality N j (s − 1), la (x) = n n
ui − ak !
i=1 j=0
n k=1
and admits a unique Lagrange polynomial interpolation

of degree j in (s − 1) variables, then n admits a unique L n (x) = |la (x)|
Lagrange polynomial interpolation of degree n in R s . |a|≤n

On the other hand, we remark that an arbitrary choice and

of n nodes in R s admits an interpolation from some n-
L n = max{L(x) : x ∈ T s }
dimensional incomplete polynomial space if we neglect
the degree of the polynomials. L n (x) and L n are called the Lebesgue function and
We next discuss a Birkhoff-type interpolation problem Lebesgue constant, respectively, corresponding to the
by polynomials in two variables. It will be clear that an knots {xa }. The following estimate of the Lebesgue con-
analogous problem can be posed in any dimension. Let S stant is known:
be a lower set of a finite number of pairs of integers; that
is, whenever i ≤ i, j ≤ j and (i, j) ∈ S, then "
(i , j ) ∈ S. Theorem 47. For the above equally spaced nodes
Denote by PS the space of all polynomials ai j x i y j , {xa }, the Lebesgue constant satisfies
where ai j = 0 if (i, j) ∈ S. Note that the dimension of PS
2n − 1
is simply |S|, the cardinality of S. Let E = [eqik ] be a set of Ln ≤
n
0 and 1 defined for q = 1, . . . , m and (i, k) ∈ S. We shall
call E an interpolation matrix. The interpolation problem for each n = 1, 2, . . . . Furthermore,
we discuss here is to find a polynomial p in PS such that
2n − 1
Ln →
∂ i+k n
p(z q ) = cqik
∂ x i ∂ yk as the dimension s tends to infinity.
for all q, i, k with eqik = 1, where Z = {z 1 , . . . , z m } is a Another problem in Lebesgue constants is concerned
given set of nodes in R 2 and C = {cqik } is a given set of with optimality of the nodes. However, not much is known
data values. about this subject. We simply quote a result on its growth
An interpolation matrix E is said to be regular if for any order, which again turns out to be logarithmic as expected.
sets Z and C the above interpolation problem is solvable. Let
It is said to be an Abel matrix if, for each pair (i, k) in s
S, there is one and only one q for which eqik = 1. The n= (n i + 1)
following result has been obtained: i=1

and E n = {x 1 , . . . , x n } be an admissible set of nodes in the

Theorem 46. Let E = [eqik ] be a normal interpola-
unit cube I s in the sense that for each function f in C(I s )
tion matrix with respect to a lower index set S; that is,
there exists a unique polynomial

m
eqik = |S| pn (x, f ) = bk x k
q=1 (i,k)∈S 0≤ki ≤n i
i=1,...,s

Then E is regular if and only if it is an Abel matrix.

where k = (k1 , . . . , ks ), such that pn (x i , f ) = f (x i ), i =
We now consider the problem of Lebesgue constants in 1, . . . , n. We define the “optimal” Lebesgue constant n
the multivariate setting. Let Vi , i = 1, . . . , s + 1, be the by
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

598 Approximations and Expansions

p(·, f ) has positive volume and P y i = t i , i = 0, . . . , n, where P

n = inf s sup
E n ⊂I f ∈C(I s ) f is the projection of any y in R s onto an x in R s consisting
of its first s coordinates. Then for each x in R s , we
where the supremum norm is being used. The following have
result has been obtained:
S(x, X ) = voln−s {y ∈ σ : P y = x}/voln σ
Theorem 48. There exist constants As and Bs de-
pending only on the dimension s, such that Next, let X n = {x 1 , . . . , x n }, where each x i is a nonzero
s s
vector in R s such that X n = R s . Suppose that I = [0, 1]
As ln(n i + 2) ≤ n ≤ Bs ln(n i + 2) and B(·, X n ) is defined, again in the distribution sense, by
i=1 i=1
B(x, X n )g(x) d x
Rs

VIII. MULTIVARIATE POLYHEDRAL

= g u 1 x 1 + · · · + u n x n du 1 · · · du n
SPLINES In

for all test functions g in C0 . We call B(·, X n ) a box spline

Spline functions in one variable can be considered linear with directions X n = {x 1 , . . . , x n }. Box splines also have
combinations of B splines. Let X = {t0 , . . . , tn }, where a nice geometric interpretation. Let
t0 ≤ · · · ≤ tn with t0 < tn . Then the univariate B spline, # $
with X as its set of knots and order n, is defined by n
η= u i y : 0 ≤ u i ≤ 1, P y = x
i i i
M(x, X ) = n[t0 , . . . , tn ](· − x)n−1
+ i=1

An application of the Hermite–Gennochi formula for di- where, and hereafter, P denotes an arbitrary orthogonal
vided differences gives the interesting relation projection from R n onto R s . Then we can write

M(x, X )g(x) d x B(x, X n ) = voln−s {y ∈ η : P y = x}/voln η

R
If we consider the formula
= n! g(u 0 t0 + · · · + u n tn ) du 1 · · · du n
Sn C(x, X n )g(x) d x
for any continuous function g, where S n is the n simplex Rs

n
= g u 1 x 1 + · · · + u n x n du 1 · · · du n
u i = 1, ui ≥ 0 n
R+
i=0
for all g in C0 , we arrive at a truncated power (or cone)
The development of multivariate polyhedral splines is spline.
based on a certain geometric interpretation of this rela- Box splines can be derived from truncated power
tion. First, by simply replacing the knots {t0 , . . . , tn } by splines, and conversely truncated power splines can be
a set X = {t 0 , . . . , t n } of n + 1 points (which are not nec- expressed in terms of box splines. Let
essarily distinct) in R s such that the algebraic span of X ,
denoted by X , is all of R s , we arrive at the formula y f (·) = f (·) − f (· − y) for any y in R s

S(x, X )g(x) d x and

Rs
Xn = y
= n! g u 0 t 0 + · · · + u n t n du 1 · · · du n y∈X n
Sn
In addition, let [X n ] be the n × n matrix whose ith column
for any test function g in C0 , the class of all continu-
is x i , and let
ous functions in R s with compact supports. The function
S(·, X ), called the simplicial spline with X as its set of
L(X n ) = [X n ]b : b ∈ Z n
knots, has a very nice geometric interpretation. Let y 0 , . . . ,
y n be in R n such that the set and t(a, X n ) denote the number of different solutions
# $ b ∈ Z+n
to a = [X n ]b. Here, Z and Z + denote, as usual,
n
σ = u i y i : u 0 + · · · + u n = 1; u 0 , . . . , u n ≥ 0 the set of all integers and nonnegative integers, respec-
i=0 tively. Then we have the following relationships:
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 599

Theorem 49. where bi and ci lie on Bi and Ci , respectively, z in R n ,

with z in R m consisting of the first m components of z.
B(·, X n ) = X n C(·, X n )
It is obvious that M B (·, P) is nonnegative and vanishes
and outside PB. From the recurrence relations, it is also clear
that it is a piecewise polynomial of degree n − s. Let
C(·, X n ) = t(a, X n )B(· − a, X n ) us now specialize on box splines B(·, X n ). In addition
a∈L(X n )
to being a piecewise polynomial of degree n − s, it is in
A more general notion is the so-called multivariate poly- C d−1 (R s ), where

hedral spline (or P shadow) M B (·, P), where B is an d = d(X n ) = max m : |Y | = m, Y ⊂ X n , X n \Y = R s
arbitrary convex polyhedral body in R n , defined in the
distribution sense by The Fourier transform of B(·, X n ) is particularly simple,
being
M B (x, P)g(x) d x = g(Pu) du 1 − e−i x·y
Rs B B̂(x, X n ) =
y∈X n
ix · y
for all g in C0 . Some important properties are included in
the following two theorems. Let This immediately yields the following result, which is a
special case of the above theorem:

n
∂
Dy = yi B(·, X n ∪ Ym ) = B(·, X n ) ∗ B(·, Ym )
i=1
∂ xi
Let S(B) denote the linear span
where y = (y1 , . . . , yn ), and if B is a polyhedral body in
ca B(· − a, X n )
R n , then Bi will denote an (n − 1)-dimensional flat of B a∈L(X n )
with unit outer normal n i .
of the translates of B(·, X n ). We first state the following
result:
Theorem 50. For each z in R n ,

D Pz M B (·, P) = − z · n i M Bi (·, P) (1) Theorem 52. Let Y be a subset of X n with Y = R s .
i Then

1 B(· − Y b, X n ) =
1
M B (Pz, P) = (bi − z) · n i M Bi (P z, P) (2) |det Y |
n−s i b∈Z s
Hence, appropriate translates of box splines form a par-
Dx M B (x, P) = (n − m)M B (x, P) − bi · n i M Bi (x, P)
tition of unity, so that S(B) can at least approximate con-
i
(3) tinuous functions. More important is the following result
on linear independence of translates of a box spline.
where bi lies on Bi .
Theorem 53. Let X n ⊂ Z s \{0}. Then
Theorem 51. Let n ≥ m ≥ s, and P and Q be any {B(· − a, X n ) : a ∈ L(X n )}
orthogonal projections of R n and R m , respectively, onto
R s . If B and C are any two proper convex polyhedral is linearly independent if and only if
bodies in R n and R m respectively, then |detY | = 1
for all Y ⊂ X n with Y = R s .
M B (x − y, P)MC (y, Q) dy = M B×C (x, P ⊗ Q)
Rs
As usual, let πk denote the space of all polynomials in
(1)
R s with total degree at most k. Then the following result
1
M B (x, P)MC (x, Q) d x = is obtained:
Rs n+m−s
#
Theorem 54. Let X n ⊂ Z s \{0} and d = d(X n ). Then
× (bi − z) · n i M Bi (x)MC (x) ds there exists a linear functional λ on πd such that
i Rs
p(x) = λ[ p(· + a)]B(x − a, X n )

+ (ci − z ) · m i M B (x)MCi (x) d x (2) a∈Z s

i Rs for all p ∈ πd .
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

600 Approximations and Expansions

In fact, if we write Theorem 56. The following statements are equiva-

lent:
[ B̂(x, X n )]−1 = ajx j
j
1. There exists a sequence {ψa }, |a| < k, in S() that
j j
where j = ( j1 , . . . , js ), and x j
= x 11 · · · xs s
then, using the satisfies
notation ψ̂ 0 (a) = δ0,a , a ∈ 2π Z s

∂ J1 ∂ js and
Dj = ···
∂ x1 ∂ xs (−i)|c|
we have D c ψ̂ b−c (a) = 0
c!
c≤b
λ( p) = a j (−1)| j| D j p(0) for a ∈ 2π Z \{0} and 1 ≤ |b| < k.
s
j
2. There exists a sequence {ψa }, |a| < k, in S() such
Let F be an extension of the linear functional λ( p) on that
C(R s ), such that, xb ac
− ψb−c (x − a)
|F( f )| ≤ F f , f ∈ C(R s ) b! c≤b a∈Z s
c!

and F( p) = λ( p) for any polynomial p. For any h > 0, is in π|b|−1 for |b| < k.
define 3. There exist some finitely supported wi,a so that

(Q h f )(x) = F{ f [h(· + a)]}B h −1 x − a, X n
N
ψ= wi,a φi (· − a)
a
i=1 a∈Z s
Then the following estimate is established: satisfies
xb ab
Theorem 55. There exist absolute constants C and − ψ(x − a)
r such that b! a∈Z s b!

| f (x) − (Q h f )(x)| ≤ Ch d+1−s/ p D j f L P(Ax,h ) is in π|b|−1 for |b| < k.
| j|=d+1 4. S() ∈ A p,k for all p, 1 ≤ p ≤ ∞.
5. S() ∈ A∞,k .
for all x ∈ R s and f ∈ H pd (R s ), where d = d(X n ), and
A x,h = x + r h[−1, 1]s In the two-dimensional setting, and particularly the
three-directional mesh, the approximation order of S(),
At least for p = ∞, the order O(h d+1 ) cannot be im- where is a collection of box splines, has been extensively
proved. Let us now replace B by a finite collection studied. It is interesting that, although C 1 cubics locally
of locally supported splines φi on R s . Hence, S() will reproduce all cubic polynomials, the optimal approxima-
denote the linear span of the translates φi (·, −a), a ∈ Z s , tion rate is only O(h 3 ). However, not much is known on
where i = 1, . . . , N , say. Let H p,c
m
be the subspace of the approximation order of S() for R s with s > 2.
m
the Sobolev space H p of compactly supported functions Although multivariate polyhedral splines are defined in
f with norm the distributional setting, they can be computed by us-
# $1/ p ing the recurrence relations. The computational schemes,
however, are usually very complicated. For R 2 , there is
f p,m = D f p
j p

| j|≤m
an efficient algorithm that gives the Bézier coefficients for
each polynomial piece of a box spline. In general, one
We say that S() provides local L p approximation of order could apply the subdivision algorithms, which are based
k, or S() ∈ A p,k , if for any f ∈ H p,c
m
, there exist weights on discrete box splines. Under certain mild assumptions,
wi,a such that
h
the convergence rate could be shown to be quadratic.

N
·

f − wi,a φi
h
− a ≤ Ch k f p,m
i=1 a∈Z s h IX. QUADRATURES
p

and wi,a
h
= 0 whenever dist(ah, supp f ) > r holds, where This section is devoted to approximation of integrals. Let
C and r are positive constants independent of h and f . D be a region in R s and C(D) the space of all continuous
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 601

functions on D. The approximation scheme is to find linear f ∈ C[−1, 1] if and only if

functionals L nk , k = 1, . . . , n, on C(D) such that the error

n & &
functional rn , defined by & An & ≤ M < ∞
k

n k=1
f (x)w(x) d x = L nk f + rn ( f ) = Q n ( f ) + rn ( f ) for all n, the sample points xk should be chosen very care-
D k=1
fully. For instance, equally spaced sample points yield
where w(·) is some positive weight function, satisfying

n & &
rn ( f ) → 0 as n → ∞ for all f in C(D). In one dimension, & An & → ∞
k
the above formula is called an integration quadrature for- k=1
mula. Multivariable quadrature formulas are sometimes
called cubature formulas. There is a vast amount of lit- The following scheme is quite efficient. Let t1 = 14 , t2i =
erature on this subject. The most well-known quadrature ti /2, t2i+1 = t2i + 12 , i = 1, 2, . . . , and xk = cos 2π tk , k = 1,
formulas are, perhaps the trapezoidal and Simpson rules. 2, . . . . Fix an N = 2m for some m > 0, and let Tk (x) and
These rules are obtained using certain interpolatory linear Uk (x) denote the Chebyshev polynomials of the first and
functionals L nk . In addition to various other interpolatory- second kinds, respectively. Consider the polynomial
type formulas, there are automatic quadrature, Gauss-type
N −1
n
N −1

quadrature, and integral formulas obtained by the adaptive pn (x) = b0k Uk−1 (x) + wi [TN (x)] bik Tk (x)
Simpson (or Romberg) method. Of course, error analysis k=1 i=1 k=0
is very important. There is literature on algebraic preci- "
where indicates that the first term under this summa-
sion, optimal quadrature, asymptotic expansion, and so on.
tion is halved, and
In one variable, the most commonly used weights are per-
haps 1, (1 − x)α x β , log(1/x) for the interval [0, 1], e−x for m

[0, ∞) and e−x for R. It is impossible to discuss many im-

2 wm (x) = 2m (x − xk )
k=1
portant results on this subject in a short section. We give
only a brief description of three approaches in the one- For each f in C[−1, 1], the polynomial pn (x) is
variable setting and briefly discuss optimal and multivari- then uniquely determined by the interpolation condi-
able quadratures at the end of the section. For simplicity we tions pn (xi ) = f (xi ) for i = 1, . . . , (n + 1)N − 1. Write
will assume that D = [−1, 1] in the following discussion. bik = bik ( f ) for this interpolation polynomial. We then
arrive at the quadrature

A. Automatic Quadratures
N −1
n
N −1
Q (n+1)N −1 ( f ) = A0k b0k ( f ) + Aik bik ( f )
An integration method that can be applied in a digital com- k=1 i=1 k=0
puter to evaluate definite integrals automatically within where
a tolerable error is called an automatic quadrature. The
1
trapezoidal, Simpson, and adaptive Simpson rules are typ- A0k = Uk−1 (x) d x
ical examples. We may say that an automatic quadra- −1
ture is based on interpolatory-type formulas. An ideal
and
interpolatory quadrature is one whose linear functionals
1
L nk are defined by L nk f = Ank (xk ), k = 1, . . . , n, where
A1k = wi [TN (x)]Tk (x) d x
−1 ≤ x1 < · · · < xn ≤ 1, such that Ank > 0, the function −1
values f (xk ), can be used again in future computation (i.e.,
for i = 1, . . . , n. It is important to point out that bik ( f )
for n + 1, n + 2, . . .), and that the quadrature has algebraic
and Aik are independent of n and that to proceed from
precision of degree n − 1 [i.e., rn ( f ) = 0 for f (x) = 1, . . . ,
Q (n+1)N −1 ( f ) to Q (n+2)N −1 ( f ) we need only the extra
x n−1 ]. Using Lagrange interpolation by polynomials of
terms bn+1,k (F) and An+1,k , for k = 1, . . . , N , and their
degree n − 1 guarantees the algebraic precision. Here, we
recurrence relations are available.
have
1
Ank = (x − xi )/(xk − xi ) d x B. Gauss Quadrature
−1 i=k
Let us return to the integration quadrature formula
and the quadrature is sometimes attributed to Newton.
However, the sequence {Ank }, k = 1, . . . , n, is usually quite 1
f (x)w(x) d x = Q n ( f ) + rn ( f )
oscillatory. Since it is known that rn (F) → 0 for all −1
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

602 Approximations and Expansions

where where x = cos θ and z = eiθ , θ ∈ [0, π ], and qm has m − 2l

n zeros inside and 2l zeros outside the unit circle.
Qn ( f ) = Ank f (xnk )
Another generalization is the quadrature
k=1
1
n
nj
with −1 ≤ xn1 < · · · < xnn ≤ 1. Gauss approached this f (x)w(x) d x = c ji f (i−1) (x j )
problem by requiring rn ( f ) = 0 for as many polynomi- −1 j=1 i=1
als f with degrees as high as possible and proved that the
highest degree is 2n − 1. In fact, this polynomial precision
m
mj
+ d ji f (i−1) (y j ) + r N ( f )
condition uniquely determines xnk and the coefficients Ank . j=1 i=1
We call this Q n ( f ) a Gauss quadrature formula. The points
xnk turn out to be the roots of the orthonormal polynomials where N = m 1 + · · · + m m + n 1 + · · · + n n and y1 , . . . ,
pn with respect to the weight function w, and the coeffi- ym are prescribed. It is well known that the polynomial
cients Ank , known as Christoffel numbers, are precision requirement r N ( f ) = 0 for all f in π N +n com-
pletely characterizes the nodes x1 , . . . , xn . In fact, they
−1

n−1 are determined by the n equations
Ank = p 2j (xnk ) 1 m m
j=0 w(x)x k (x − y j )m j (x − x j )n j d x = 0
−1 j=1 j=1
If w(x) = 1, then the polynomials pn are the Legendre
polynomials, and if w(x) = (1 − x 2 )−1/2 , the pn ’s are (con- k = 0, . . . , n − 1.
stant multiples of) Chebyshev polynomials of the first
There are methods for computing the nodes x j and co-
kind. The corresponding quadrature formulas are called
efficients c ji and d ji . In fact, by establishing a relation be-
Gauss–Legendre and Gauss–Chebyshev quadratures. The
tween these coefficients and the eigenvectors of the Jacobi
following estimate has been obtained:
matrix corresponding to the weight function w(·), one can
obtain a stable and efficient algorithm to evaluate these
Theorem 57. Let 1 ≤ s < 2n and (1 − x 2 )s f (s) (x) be nodes and coefficients.
integrable on [−1, 1]. Then the error rn ( f ) of the Gauss–
Legendre quadrature of f satisfies
1&
C. Chebyshev Quadratures
&
|rn ( f )| ≤ Cs & f (s) (x)&min{(1 − x 2 )s/2 /n s , (1 − x 2 )s } d x It is easy to see that all the Christoffel numbers of the
−1
Gauss–Chebyshev quadrature are the same. This leads
where the constant Cs is independent of n and f . to the problem of determining the weight functions for
which the constants in the corresponding quadratures are
Several generalizations of the Gauss quadrature have
independent of the summation index. More precisely, a
been considered. For instance, requiring the weaker poly-
quadrature formula
nomial precision condition rn ( f ) = 0 for all f in π2n−m−1
gives a so-called [2n − m − 1, n, w(x)] quadrature. The 1
n

following result gives a characterization of this quadra- f (x)w(x) d x = An f (xnk ) + rn ( f )

−1 k=1
ture. We denote by Pn the polynomials in πn the poly-
nomials in πn with leading coefficient equal to 1 that are −1 ≤ xn1 < · · · < xnn ≤ 1, that satisfies rn ( f ) = 0 for all
orthogonal on the unit circle with respect to the weight f in πn−1 is called a Chebyshev quadrature formula.
function Bernstein proved that w(x) = 1 does not give a Chebyshev
quadrature for n = 8 or n ≥ 10. It has been shown that the
w(cos θ )| sin θ | weights
w(x, p) = (2 p + 1 + x)−1 (1 − x 2 )−1/2
Theorem 58. Let 0 < m ≤ n. Then a[2n − m − 1,
n, w(x)] quadrature with nodes xi in [−1, 1] has n − l where p ≥ 1, give Chebyshev quadratures for all n ≥ 1.
positive coefficients and l negative coefficients if and only This means, of course, that An = An ( p) and xnk = xnk ( p)
if there exists a polynomial qm in πm with leading coeffi- exist such that the quadrature formulas are (n − 1)st degree
cient equal to 1 and real coefficients such that polynomial precise.
Finally, we discuss multivariable quadrature (or cuba-
n
2−n+1 Re z −n+1 qm (z)P2n−m−1 (z) = (x − x j ) ture) formulas very briefly. With the exception of very spe-
j=1 cial regions (balls, spheres, cones, etc.), nontensor-product
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 603

cubatures are very difficult to obtain. Some of the reasons does not exceed rm,n [Vr,s; p (M)], defined analogously,
∗
are that multivariate interpolation may not be unisolvent where rmn (g) has the same form as rmn (g) with arbitrary
and that orthogonal polynomials are difficult to find on nodes xk and y j in I and arbitrary coefficients A j , Bk , and
an arbitrary region D and R s . There are, however, other Ck j . Furthermore
methods, including, Monte Carlo, quasi Monte Carlo, and ∗
number-theoretic methods. On the other hand, many of the rmn [Vr,s; p (M)] = M r̄r, p (Wr, p )r̄n,s (Ws, p )
one-variable results can be generalized to Cartesian prod-
uct regions. For instance, let us study optimal quadratures.
Let H pr (I ) denote the Sobolev spaces on I = [−1, 1], X. SMOOTHING BY SPLINE FUNCTIONS
and Wr, p the unit ball of f in H pr (I ); that is
(r ) Not only are spline functions useful in interpolation and
f ≤1 approximation; they are natural data-smoothing functions,
p
especially when the data are contaminated with noise. We
To extend to two variables, for instance, let Vr,s; p (M) be the
first discuss the one-variable setting.
set of all functions g(· , ·) in the Sobolev space H p(r,s) (I 2 )
Let 0 = t0 < · · · < tn+1 = 1 and H2k = H2k [0, 1] be the
such that
(r,s) Sobolev space of functions f such that f , . . . , f (k) are
g (· , ·) ≤ M in L 2 = L 2 [0, 1], where k ≤ n − 1. For any given data
p
Z = [z 1 , . . . , z n ]T , it can be shown that there is a unique
where M is a fixed positive constant. Now, in the one- function sk, p = sk, p (·; Z ), which minimizes the functional
variable setting, if

m
1
1 n
1
Jk, p (u; Z ) = p [D k u]2 + [u(ti ) − z i ]2
rm ( f ) = f (x) d x − Am
k f (xmk ) 0 n i=1
−1 k=1

and over all functions u in H2k . In fact, the extremal func-

tion sk, p is a natural spline of order 2k having its knots
rm (Wr, p ) = sup{|rm ( f )| : f ∈ Wr, p } at t1 , . . . , tn . The parameter p > 0, called the smoothing
parameter, controls the “trade-off” between the smooth-
it is known that rm (Wr, p ) attains its minimum at some set
ness of sk, p and its approximation to the data Z . Let
of nodes x̄ m1 , . . . , x̄ mm in I and some coefficients Ām
1 ,..., sk, p (ti ) = yi and Y = [y1 , . . . , yn ]T . Then Y = Ak ( p)Z ,
Ām . The resulting quadrature formula,
m where the n × n matrix Ak ( p) is called the influence ma-
1
m trix. Determining sk, p is equivalent to determining Y .
f (x) d x = k f ( x̄ mk ) + r̄ m,r ( f )
Ām Hence, the influence matrix is importtant. It can be shown
−1 k=1
that Ak ( p) = (In + np Bk )−1 , where Bk is an n × n non-
is called an optimal quadrature formula for Wr, p . Let negative definite symmetric matrix with exactly k zero
1
n eigen-values. The other n − k eigenvalues of Bn are posi-
f (x) d x = B̄ nk f ( ȳ nk ) + r̄ n,s ( f ) tive and simple, regardless the location of the ti . Let these
−1 k=1 eigenvalues be di = dni arranged in nondecreasing order as
be an optimal quadrature formula for Ws, p . The following follows: 0 = d1 = · · · = dk < dk+1 < · · · < dn . It is clear
result has been obtained: that if Q is the unitary matrix that diagonalizes Bk in the
sense that Bk = Q ∗ Dk Q, where Dk = diag[d1 · · · dn ], then
setting Ŷ = Q ∗ Y and Ẑ = Q ∗ Z , we have
Theorem 59. Let 1 < p ≤ ∞. Then the formula
1
m
ŷ i = ẑ i
g(x, y) d x d y = Ām
k g(x̄ mk , y) dy 1 + npdi
I2 k=1 I
where Ŷ = [ ŷ 1 · · · ŷ n ]T and Ẑ = [ẑ 1 · · · ẑ n ]T could be

n
+ B̄ nj g(x, ȳ n j ) d x viewed as the Fourier transforms of Y and Z , respectively.
j=1 I
Theorem 60. Let {ti } be quasi-uniformly dis-
m
n
∗ tributed; that is
− k B̄ j g( x̄ mk , ȳ n j ) + rmn (g)
Ām n

k=1 j=1
max(ti+1 − ti ) ≤ c min(tt+1 − ti )
∗
is optimal in the sense that rmn [Vr,s; p (M)], defined to be
& ∗ & where c is independent of n as n → ∞. Then there exist
sup &rmn (g)& : g ∈ Vr,s; p (M) positive constants c1 and c2 such that
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

604 Approximations and Expansions

c1 i 2k ≤ ndni ≤ c2 i 2k with D (R s ) denoting the space of Schwartz distributions

on R s and
for i = k + 1, . . . , n and n = 1, 2, . . . .
∂ |a|
Hence, for quasi-uniformly distributed {ti }, we have the Da =
∂ x1a1 · · · ∂ xsas
following asymptotic behavior of ŷ i , i = k + 1, . . . , n:
Of course, all derivatives are taken in the distributional
1 sense. Since K > s/2, the elements in D −k L 2 (R s ) are con-
ŷ 1 ∼ ẑ i ∼ bk ( p 1/2k
ri)ẑ i
1 + ( p 1/2k ri)2k tinuous functions.
as n → ∞, where r is a positive constant and bk a
Butterworth-type low-pass filter with the inflection point Theorem 61. The smoothing spline sk, p = sk, p (·; Z )
as its cutoff frequency i c = (r p 1/2k )−1 . The size of the win- that minimizes the above optimization problem exists and
dow of this filter decreases monotonically as the smooth- is unique and satisfies the distributional equation
ing parameter p increases. In the limiting cases as p → ∞, 1 n
we have (−1)k k Sk, p = ci δti
np i=1
ẑ i if i = 1, . . . , k
lim ŷ i = where δt is the Dirac distribution and ci = z i − sk, p (ti ; Z ),
p→∞ 0 if i = k + 1, . . . , n
i = 1, . . . , n. Furthermore, k sk, p is orthogonal to the col-
In this case, the smoothing spline sk,∞ (·; Z ) is simply the lection πk−1 of polynomials of total degree k − 1 in the
least-squares polynomial of degree k − 1. On the other sense that
hand, as p → 0, we have
' k ( n
sk, p , p = ci P(ti ) = 0
lim ŷ i = ẑ i , i = 1, . . . , n
p→0 i=1

so that sk,0 (·; Z ) is the natural spline of order 2k interpo- for all P in πk−1 .
lating the data Z . Let K k be the elementary solution of the k-times-
Suppose that the data are contaminated with noise, say iterated Laplacean:
z i = g(ti ) + ei , where g ∈ H2k and e1 , . . . , en are indepen-
dent random variables with zero mean and positive stan- k K k = δ0
dard deviation. A popular procedure for controlling the Then it is well known that
smoothing parameter p is the method of generalized cross- #
validation (GCV). We will not go into details but will re- ck |t|2k−s if s is odd
K k (t) =
mark that it is a predictor–corrector procedure. ck |t|2k−s log |t| if s is even
The idea of a smoothing spline in one dimension has a
natural generalization to the multivatiate setting. Unfortu- where ck = (−1)k 2π [2k−1 (k−1)!]2 . Also, if m is a measure
nately not much is known for an arbitrary domain D in R s . with compact support orthogonal to πk−1 , then m ∗ K k is
In the following, the L 2 norm 2 will be taken over all an element of D −k L 2 (R s ). Applying this result to k sk, p ,
of R s , and this accounts for the terminology of “thin-plate we have
splines.” Again let p > 0 be the smoothing parameter and 1 n

k > s/2. Then for every set of data {Z 1 , . . . , Z n } in R and a (−1)k k sk, p ∗ K k = ci K k (t − ti )
np i=1
scattered set of sample points {t1 , . . . , tn } in R s , the thin-
plate smoothing spline of order 2k and with smoothing and
parameter p is the (unique) solution of the minimization
k (−1)k k sk, p ∗ K k = (−1)k k sk, p ∗ k K k
problem,
# $ = (−1)k k sk, p ∗ δ0
1 n
inf p|u|k +
2
[u(ti ) − z i ]2
= (−1)k k sk, p
u∈D −k L 2 (R s ) n i=1
where This yields
2 sk, p − (−1)k k sk, p ∗ K k ∈ πk−1
s ∂k

|u|k =
2
u
i ,...,i =1
∂ x i1 · · · ∂ x ik Theorem 62. The smoothing spline sk, p satisfies
1 k 2

and
n
sk, p (t) = (−1)k di K k (t − ti ) + P(t)
D −k L 2 (R s ) = {u ∈ D (R s ) : D a u ∈ L 2 (R s ), |a| = k} i=1
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 605

where P ∈ πk−1 . Furthermore, the polynomial P and the Of course, sk, p (ti ; Z ) = yi and determining {yi } is equiva-
coefficients d1 , . . . , dn are uniquely determined by the lent to determining sk, p itself.
equations: In a more general setting, let X , H , K be three Hilbert
1 spaces and T , A be linear operators mapping X onto H
(−1)k di + sk, p (ti ) − z i = 0, i = 1, . . . , n and K and having null spaces N (T ), N (A), respectively.
np
Let p > 0 be the smoothing parameter in the functional
and
Jk, p (u, Z ) = pT u2H + Au − Z 2K
d1 Q(t1 ) + · · · + dn Q(tn ) = 0, Q ∈ πk−1
where Z is the given “data” vector in K and u ∈ X .
Hence, if the values yi = sk, p (ti ; Z ), i = 1, . . . , n, are
known, the smoothing spline sk, p is also determined. Writ-
Theorem 64. If N (T ) + N (A) is closed in X with
ing Y = Ak ( p)Z , where Y = [y1 , . . . , yn ]T , it is important
N (T ) ∩ N (A) = {0}, there exists a unique function s p in
to study the influence matrix Ak ( p) as in the one-variable
X satisfying
setting. Again it is possible to write
Ak ( p) = (I + np Bk )−1 Jk, p (s p , Z ) = inf{Jk, p (u, Z ) : u ∈ X }

Let {t1 , . . . , tn } lie in a bounded domain D and R s . We Moreover, if S denotes the space of all “spline” functions
say that {ti } is asymptotically quasi-uniformly distributed defined by
in D if for all large n,
S = {s ∈ X : T ∗ T s is in the range of A∗ }
sup min |t − ti | ≤ c min |t1 − t j |
t∈D 1≤i≤n 1≤i< j≤n where T ∗ , A∗ denote the adjoints of T and A, respectively,
where c is some positive constant, depending only on D. then s p satisfies s p ∈ S and
T ∗ T s p = p −1 A∗ (Z − As p )
Theorem 63. Let D satisfy a uniform cone con-
dition and have Lipschitz boundary and {t1 , . . . , tn } As a corollary, it is clear that s p ∈ X is the smoothing
be asymptotically quasi-uniformly distributed in spline function if and only if
D. Then there exists a positive constant C such
T s p , T x H = p −1 Z − As p , Ax K
that the eigenvalues dn1 , . . . , dnn of Bk satisfy
0 = dn1 = · · · = dnk < dn,k+1 < · · · < dnn and for all x ∈ X .
An interesting special case is X = H2k (D), where D is
i 2k/s ≤ ndni ≤ Ci 2k/s , i = N + 1, . . . , n
a bounded open set in R s with Lipschitz boundary and
for all n, where satisfying a uniform cone condition. If k > s/2, then the
evaluation functional δt is continuous. Let ti , . . . , tn ∈ D
k+s−1
N = dim πk−1 = and set K = R n and H = [L 2 (D)] N , where N = s k . Also,
s
define
Thus, if we study the effect of thin-plate spline smooth- T
ing in the “frequency domain” as before by writing Au = δt1 u · · · δtn u
Ŷ = Q ∗ Y ; Ẑ = Q ∗ Z and
∗ ∂k
where Q is unitary with Q Bk Q = diag[dn1 · · · dnn ], we
T =
obtain ∂ xi1 · · · ∂ xik
1
ŷ i = ẑ i ∼ bk ( p 1/2k ri)ẑ i where i 1 = 1, . . . , s and i k = 1, . . . , s, and equip H with
1 + pdni the norm H , where
where, again, bk is a Butterworth-type low-pass filter with

d 2
cutoff frequency at (r p s/2k )−1 . Similar conclusions can v2H = vi (D)
i 1 ...i k
be drawn on the limiting cases as p → ∞ and p → 0, and t1 ,...,i k =1
L2

again the method of GCV is commonly used to choose the

smoothing parameter p. It is a predictor–corrector proce- Then using Theorem 64 and the fact that {t1 , . . . , tn } is a
dure and amounts to choosing p as the minimizer of the πk−1 unisolvent set, the following result can be proved:
GCV function:
% 2 Theorem 65. There exists a unique s p in H2k (D) that
1 n
1 minimizes Jk, p (u, Z ) for all u in H2k (D). Furthermore, s p
[sk, p (ti ; Z ) − z i ]2
1 − tr Ak ( p)
n i=1 n satisfies the distributional partial differential equation.
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

606 Approximations and Expansions

1 n
by a constant multiple of a −1 . Hence, the IWT has the
(−1)k k s p = [z i − s p (ti )]δti zoom-in and zoom-out capability.
p i=1
This localization of time and frequency of a signal func-
In fact, the following estimate has also been obtained for tion f (t), say, not only makes it possible for filtering,
noisy data: detection, enhancement, etc., but also facilitates tremen-
dously the procedure of data reduction for the purpose of
Theorem 66. Let z i = g(ti ) + ei , i = 1, . . . , n, storage or transmittance. The modified signal, however,
where g is in H2k (D), and let e1 , . . . , en be independent must be reconstructed, and the best method for this re-
identically distributed random variables with variance construction is by means of a wavelet series. To explain
V . If the points ti satisfy a quasiuniform distribution what a wavelet series is, we rely on the notion of mul-
condition in D, then there exist positive constants c1 and tiresolution analysis as follows: Let {Vn } be a nested se-
c2 , independent of n, such that for 0 ≤ p ≤ 1, quence of closed subspaces of L 2 = L 2 (R), such that the
1 intersection of all Vn is the zero function and the union
E |s p (· ; Z ) − g|2j ≤ c1 p (k− j)/k + c2 V p −(2 j+s)/2k of all Vn is dense in L 2 . Then we say that {Vn } consti-
n
tutes a multiresolution analysis of L 2 if for each n ∈ Z,
for n −2nk/s ≤ p ≤ 1, where 1 ≤ j ≤ k. we have f ∈ Vn ⇔ f (2·) ∈ Vn+1 and if there exists some
However, no general result on asymptotic optimality of φ ∈ V0 such that the integer translates of φ yields an un-
cross-validation is available at the time of this writing. conditional basis of V0 . We will also say that φ gener-
ates a multiresolution analysis of L 2 . Examples of φ are
B-splines of arbitrary order and with Z as the knot se-
XI. WAVELETS quence. Next, for each k ∈ Z, let Wk be the orthogonal
complementary subspace of Vk+1 relative to Vk . Then it
The Fourier transform fˆ(w) of a function f (t) represents is clear that the sequence of subspaces {Wk } is mutually
the spectral behavior of f (t) in the frequency domain. This orthogonal, and the orthogonal sum is all of L 2 . Conse-
representation, however, does not reflect time-evolution quently, every f ∈ L 2 can be decomposed as an orthogonal
of frequencies. A classical method, known as short-time sum of functions gk ∈ Wk , k ∈ Z. It can be proved that there
Fourier transform (STFT), is to window the Fourier inte- exists some function ψ whose integer translates form an
gral, so that, by the Plancherel identity, the inverse Fourier unconditional basis of L 2 . If this ψ, called a wavelet, is
integral is also windowed. This procedure is called time- used as the window function in the IWT, then the “modi-
frequency localization. This method is not very efficient, fied” signal f (t) can be reconstructed from IWT at dyadic
however, because a window of the same size must be used values by means of a wavelet series.
for both low and high frequencies. The integral wavelet Let us first introduce the dual ψ̃ ∈ W0 of ψ defined
transform (IWT), defined by (uniquely) by

1 ∞
t −b ψ̃, ψ(· − n) = δn,0 , n ∈ Z.
(Wψ f )(b, a) = √ f (t)ψ dt,
a −∞ a Then, by using the notation
on the other hand, has the property that the time-window
narrows at high frequency and widens at low frequency. ψk, j (t) = ψ(2k t − j)
This is seen by observing that, for real f and ψ,
where j, k ∈ Z, and the same notation for ψ̃ k, j , we have
(Wψ f )(b, a) the following.
√ ∞
a ω0 Theorem 67. For every f ∈ L 2 , then
= Re e−ibω fˆ(ω)g a ω − dω;
π 0 a
f (t) = 2k/2 (Wψ f )( j2−k , 2−k )ψ̃k, j (t)
where g(ω) := ψ̂(ω + ω0 ), and ω0 is the center of the fre- k, j∈Z
quency window function ψ̂(ω). More precisely, the win-
dow in the time-frequency domain may be defined by = 2k/2 (Wψ̂ f )( j2−k , 2−k )ψk, j (t).
k, j∈Z
ω0 1 ω0 1
[b − aψ , b + aψ ] × − ψ̂ , + ψ̂ , Algorithms are available to find the coefficients of the par-
a a a a
tial sums of this series and to sum the series using values
where ψ and ψ̂ denote the standard deviations of ψ of the coefficients. Since these coefficients are 2k/2 mul-
and ψ̂, respectively, and we have identified the frequency tiples of the IWT at ( j2−k , 2−k ) in the time-scale plane,
P1: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001C-26 May 7, 2001 13:59

Approximations and Expansions 607

these algorithms determine the IWT efficiently without

2m−2
integration and reconstruct f (t) efficiently without sum- m
(z) := N2m (n + 1)z n ,
ming at every t. Based on two pairs of sequences ( pn , qn ) n=0

and (an , bn ), which are known as reconstruction and de- Where we have omitted the multiple of [(2m − 1)!]−1 for
composition sequences, respectively, a pyramid algorithm convenience. Then {an } and {bn } are determined by the
is realizable. Here, { pn } defines the φ that generates the Laurent series:
)
multiresolution analysis, namely: 1 (z)
G(z) = m (1 + z)m )m 2 = an z −n
φ(t) = pn φ(2t − n) 2 z m (z ) n∈Z
n∈Z
1 1
H (z) = − (1 − z)m ) 2 = bn z −n .
and {qn } relates ψ to φ by: 2 m z (z ) n∈Z

ψ(t) qn φ(2t − n). The duals Ñ m of Nm and ψ̃m of ψ can be computed by
n∈Z using the following formulas:
In addition, {an } and {bn } determine the decomposition
Ñ m (t) = 2am Ñ m (2t − n)
V1 = V0 ⊕ W0 by n∈Z

φ(2t − ) = a−2n φ(t − n) + b−2n ψ(t − n), ψ̃m (t) = 2bn Ñ m (2t − n).
n∈Z n∈Z n∈Z

for all ∈ Z.
If φ(t) = Nm (t) is the mth order B-spline with integer SEE ALSO THE FOLLOWING ARTICLES
knots and supp Nm = [0, m], then we have the following
result. FRACTALS • KALMAN FILTERS AND NONLINEAR FILTERS
• NUMERICAL ANALYSIS • PERCOLATION • WAVELETS,
Theorem 68. For each positive integer m, the min- ADVANCED
imally supported wavelet ψm corresponding to Nm (t) is
given by
BIBLIOGRAPHY
1
2m−2
ψm (t) = (−1) j N2m ( j + 1) Baker, G. A., Jr., and Graves-Morris, P. R. (1981). “Encyclopedia of
2m−1 j=0 Mathematics and Its Applications,” Padé Approximants, Parts I and
II, Addison-Wesley, Reading, Massachusetts.

m m
Braess, D. (1986). “Nonlinear Approximation Theory,” Springer-Verlag,
× (−1) Nm (2t − j − ).
=0
New York.
Chui, C. K. (1988). “Multivariate Splines,” CBMS-NSF Series in Ap-
plied Math. No. 54, SIAM, Philadelphia.
Furthermore, the reconstruction sequence pair is given by: Chui, C. K. (1992). “An Introduction to Wavelets,” Academic Press,
Boston.
−m+1 m
pn = s Chui, C. K. (1997). “Wavelets : A Mathematica Tool for Signal Analysis,”
n SIAM, Philadelphia.
Daubechies (1992). “Ten Lectures on Wavelets,” CBMS-NSF Series in
and Applied Math. No. 61, SIAM, Philadelphia.
m
Petrushev, P. P., and Popov, V. A. (1987). “Rational Approximation of
(−1)n m Real Functions,” Cambridge University Press, Cambridge, England.
qn = m−1
N2m (n − j + 1), Pinkus, A. (1985). “n-Widths in Approximation Theory,” Springer-
2 j=0
j
Verlag, New York.
Pinkus, A. (1989). “On L 1 -Approximation,” Cambridge University
with supp{ pn } = [0, m] and supp{qn } = [0, 3m − 2]. Press, Cambridge, England.
To describe the pair of decomposition sequences, we Wahba, G. (1990). “Spline Models for Observational Data,” CBMS-NSF
need the “Euler–Forbenius” polynomial Series in Applied Math. No. 59, SIAM, Philadelphia.
P1: GKB/LPB P2: GAE Revised Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN001D-72 May 19, 2001 13:37

Boolean Algebra
Raymond Balbes
University of Missouri-St. Louis

I. Basic Definitions and Properties

II. Representation Theory
III. Free Boolean Algebras and Related Properties
IV. Completeness and Higher Distributivity
V. Applications

GLOSSARY of the lattice that are less than elements in the ideal.
Lattice [distributive] Partially ordered set for which
Atom Minimal nonzero element of a Boolean algebra. the least upper bound x + y and the greatest lower
Boolean homomorphism Function from one Boolean bound x y always exist [a lattice in which the identity
algebra to another that preserves the operations. x(y + z) = x y + x z holds].
Boolean subalgebra Subset of a Boolean algebra that is Least upper bound In a partially ordered set, the smallest
itself a Boolean algebra under the original operations. of all of the elements that are greater than or equal to
Complement In a lattice with least and greatest elements, all of the elements of a given set.
two elements are complemented if their least upper Partially ordered set Set with a relation “≤” that satisfies
bound is the greatest element and their greatest lower certain conditions.
bound is the least element of the lattice.
Complete Pertaining to a Boolean algebra in which every
subset has a greatest lower bound and a least upper A BOOLEAN ALGEBRA is an algebraic system con-
bound. sisting of a set of elements and rules by which the elements
Field of sets Family of sets that is closed under fi- of this set are related. Systematic study in this area was
nite unions, intersections, and complements and also initiated by G. Boole in the mid-1800s as a method for
contains the empty set. describing “the laws of thought” or, in present-day ter-
Greatest lower bound In a partially ordered set, the minology, mathematical logic. The subject’s applications
largest of all of the elements that are less than or equal are much wider, however, and today Boolean algebras not
to all of the elements in a given set. only are used in mathematical logic, but since they form
Ideal In a lattice, a set that is closed under the formation the basis for the logic circuits of computers, are of funda-
of finite least upper bounds and contains all elements mental importance in computer science.

289
P1: GKB/LPB P2: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001D-72 May 19, 2001 13:37

290 Boolean Algebra

I. BASIC DEFINITIONS AND PROPERTIES by 1) if x ≤ 1 for all x ∈ P. It is evident that, when they
exist, 0 and 1 are unique. A lattice L with 0 and 1 is
A. Definition of a Boolean Algebra complemented if for each x ∈ L there exists an element
y ∈ L such that x + y = 1 and x y = 0. The elements x and
Boolean algebras can be considered from two seemingly y are called complements of one another.
unrelated points of view. We shall first give a definition in
terms of partially ordered sets. In Section I.C, we shall give Definition. A Boolean algebra is a poset B that is
an algebraic definition and show that the two definitions a complemented distributive lattice. That is, B is a poset
are equivalent. that satisfies
Definition. A partially ordered set (briefly, a poset) 1. For each x, y ∈ B, x + y and x y both exist.
is a nonempty set P together with a relation ≤ that satisfies 2. 0 and 1 exist.
3. x(y + z) = x y + x z is an identity in B.
1. x ≤ x for all x. 4. For each x ∈ B there exists y ∈ B such that x + y = 1
2. If x ≤ y and y ≤ z, then x ≤ z. and x y = 0.
3. If x ≤ y and y ≤ x, then x = y.
We shall soon see that, in a Boolean algebra, comple-
For example, the set of integers with the usual ≤ relation ments are unique. Thus, for each x ∈ B, x̄ will denote the
is a poset. The set of subsets of any set together with the complement of x.
inclusion relation ⊆ is a poset. The positive integers with
the | relation (recall that x | y means that y is divisible by B. Examples
x) are a poset. A chain C is a subset of a poset P if x ≤ y
or y ≤ x for any x, y ∈ C. The integers themselves form a Example. The two-element poset 2, which consists
chain under ≤ but not under | since 3 ≤ 5 by neither 3 nor of {0, 1} together with the relation 0 < 1, is a Boolean
5 is divisible by the other. algebra.
In any poset, we write x < y when x ≤ y and x = y. We
Example. Let X be a nonempty set and (X ) the
also use y ≥ x synonymously with x ≤ y.
set of all subsets of X . Then (X ), with ⊆ as the re-
A subset S of a poset P is said to have an upper bound if
lation, is a Boolean algebra in which S + T = S ∪ T ,
there exists an element p ∈ S such that s ≤ p for all s ∈ S.
ST = S ∩ T , 0 = ∅, 1 = X , and S̄ = X − S, where X − S =
An upper bound p for S is called a least upper bound
{x ∈ X | x ∈ S}. Note that if X has only one member then
(lub) provided that if q is any other upper bound for S
(X ) has two elements, ∅, X , and is essentially the same
then p ≤ q. Similarly l is a greatest lower bound for S if
as the first example.
l ≤ s for all s ∈ S, and if m ≤ s for all s ∈ S then S then
m ≤ l. Diagrams can sometimes be drawn to show the order-
Consider the poset (X ) of all subsets of a set X under ing of the elements in a Boolean algebra. Each node rep-
inclusion ⊆. For S, T ∈ (X ), the set S ∪ T is an upper resents an element of the Boolean algebra. A node that is
bound for S and T since S ⊆ S ∪ T and T ⊆ S ∪ T ; clearly, connected to one higher in the diagram is “less than” the
any other set that contains both S and T must also contain higher one (Fig. 1).
S ∪ T , so S ∪ T is the least upper bound for S and T .
Least upper bounds (when they exist) are unique. In- Example. Let X be a set that is possibly infinite. A
deed, if z and z are both least upper bounds for x and y, subset S of X is cofinite if its complement X − S is finite.
then since z is an upper bound and z is least, we have The set of all subsets of X that are either finite or cofinite
z ≤ z. Reversing the roles of z and z , we have z ≤ z , so form a Boolean algebra under inclusion.
z=z . The preceding example is a special case of the
Denote the (unique) least upper bound, if it exists, by following.
x + y. Thus, x ≤ x + y, y ≤ x + y, and if x ≤ z, y ≤ z then
x + y ≤ z. Similarly x · y (or simply xy) denotes the great- Example. A field of subsets is a nonempty collection
est lower bound of x and y if it exists. of subsets of a set X = ∅ that satisfies
A nonempty poset P in which x + y and xy exist for
every pair of elements x, y ∈ P, is called a lattice (the 1. If S, T ∈ , then S ∩ T and S ∪ T are also in .
symbols ∨ and ∧ are often used as notation instead of + 2. If S ∈ , then so is X − S.
and ·). A lattice is called distributive if x(y + z) = x y + x z
holds for all x, y, z. A poset P has a least element (denoted A field of subsets is a Boolean algebra under the inclu-
by 0) if 0 ≤ x for all x ∈ P and a greatest element (denoted sion relation, and it is easy to see that
P1: GKB/LPB P2: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001D-72 May 19, 2001 13:37

Boolean Algebra 291

1 (x, y) 1 C. Elementary Properties

The following theorem contains several elementary but
(x) (y) useful formulas.

Theorem. In a Boolean algebra B,

0 f0

1. The following are equivalent:

(i) x ≤ y; (ii) x + y = y; (iii) x y = x.
1 2. (x + y) + z = x + (y + z); (x y)z = x(yz).
3. x + y = y + x; yx = x y.
4. x + x = x; x x = x.
ab a
b 5. x + x y = x; x(x + y) = x.
6. If x ≤ y, then x + z ≤ y + z and x z ≤ yz.
b
7. x y ≤ u + v if and only if x ū ≤ ȳ + v; in particular,
a ab x y = 0 if and only if x ≤ ȳ, and x ≤ y if and only if
ȳ ≤ x̄.
8. x + y = x̄ ȳ; x y = x̄ + ȳ (de Morgan’s laws).
9. x + yz = (x + y)(x + z).
10. Complements are unique.
FIGURE 1 Ordering of elements in a Boolean algebra. (A), 2; 11. x̄¯ = x.
(B), ({x, y, }); (C), ({x, y, z}), a = {x}, b = {y}.
Proof. We select some of these properties to prove.
First, in 1, we prove that (i) implies (ii). Given x ≤ y, we
S+T = S∪T want to show that y is the least upper bound of x and y.
ST = S ∩ T But this is immediate from the definition of least upper
bound. Indeed, x ≤ y and y ≤ y show that y is an upper
0=∅ bound for x, y. Also if x ≤ z and y ≤ z for some z, then
1=X y ≤ z, so y is the least upper bound.
For the first part of 6, simply note that y + z is an upper
S̄ = X − S bound for x and z (since x ≤ y) so the least upper bound
Note that ∅ and X are in because = 0, so there x + z of x and z must satisfy x + z ≤ y + z.
exists S0 ∈ . Hence, S̄ 0 ∈ and so ∅ = S0 ∩ S0 ∈ . Also For 7, first observe from the definition of the greatest
X = X − ∅ ∈ . lower bound and the least upper bound that x + 1 = 1 and
This is an important example because the elements of x1 = x for all x. Now if x y ≤ u + v, then
the Boolean algebra are themselves sets rather than ab- x ū = (x ū)1 = (x ū)(y + ȳ)
stract symbols.
= (x ū)y + (x ū) ȳ ≤ x(ū y) + ȳ = x(y ū) + ȳ
Example. Let X = ∅ be a topological space. A set S = (x y)ū + ȳ ≤ (u + v)ū + ȳ
is regular open if S = INT(CL(S)). The set of all regular
open subsets of X forms a Boolean algebra under inclusion = (u ū + v ū) + ȳ ≤ (0 + v) + ȳ = v + ȳ
and To prove the first part of 8, we show that x + y is the
S + T = INT(CL(S ∪ T )) greatest lower bound for x̄ and ȳ. Now since x, y ≤ x + y,
we have x + y ≤ x̄ and x + y ≤ ȳ, by 7, so x + y is a lower
ST = S ∩ T bound for x̄ and ȳ. But if z ≤ x̄ and z ≤ ȳ, then x ≤ z̄ and
S = X − CL(S) y ≤ z̄, so x + y ≤ z̄ and thus z ≤ x + y.
Formula 9 is just the “distributive law” with the + ·
0=∅ interchanged, and it indeed holds since (x + y)(x + z) =
1=X (x + y)x + (x + y)z = x x + yx + x z +yz = x + x y + x z
+ yz = x + yz.
Note that is not a field of sets since S ∪ T may not be Finally, statement 10, that complements are unique,
a regular open set. means that if there exist z 1 and z 2 satisfying x + z 1 = 1,
P1: GKB/LPB P2: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001D-72 May 19, 2001 13:37

292 Boolean Algebra

x z 1 = 0 and x + z 2 = 1, x z 2 = 0, then z 1 = z 2 . Now to it is easy to see that by considering the + and · as opera-
prove this, z 1 = z 1 1 = z 1 (x + z 2 ) = z 1 x + z 1 z 2 = 0 + z 1 z 2 tions, rather than as least upper and greatest lower bounds,
≤ z 2 . Similarly, z 2 ≤ z 1 so z 1 = z 2 . all of the above identities are true. On the other hand,
suppose we start with a Boolean algebra B defined alge-
In a Boolean algebra B, any finite set S = {x1 , . . . , xn }
braically. We can then define a relation ≤ on B by x ≤ y if
has a least upper bound (. . . ((x1 + x2 ) + x3 ) + · · ·) + xn .
and only if x y = x. This relation is in fact a partial order on
This can be proved by mathematical induction, and in
B since (1) x x = x implies x ≤ x; (2) if x ≤ y and y ≤ z,
fact the order and parentheses do not matter. That is,
then x y = x, yz = y, so x z = (x y)z = x(yz) = x y = x, so
x1 + (x2 + x3 ), (x2 + x1 ) + x3 , etc., all represent the same
x ≤ z; and (3) if x ≤ y and y ≤ x, then x y = x, yx = y, so
element, namely, the least member of B that is greater
x = x y = yx = y. If we continue in this way, x + y turns
than or equal to x1 , x 2 , and x 3 . Thus, we use the notation
out to be the least upper bound of x and y under the ≤ that
x1 + · · · + xn or S or
we have defined, x y is the greatest lower bound, 0 is the
xi least element, 1 is the greatest, and x̄ is the complement
i∈(1,...,n) of x. Thus B, under ≤, is a Boolean algebra.
for the least upper bound of a nonempty set S = Since both of these points of view have advantages, we
{x1 , . . . , xn }. Similarly, x1 . . . xn , S, and shall use whichever is most appropriate. As we shall see,
no confusion will arise on this account.
xi
From the algebraic points of view, the definition of 2
i∈(1,...,n)
and ({x , y }), shown in Fig. 1, could be given by Table I.
represent the greatest lower bound of S. All of the prop-
erties listed above generalize in the obvious way. For
D. Subalgebras
example,
A (Boolean) subalgebra is a subset of a Boolean alge-
1. S ≤ s ≤ S for all s ∈ S. bra that is itself a Boolean algebra under the original
x + S = {x + s | s ∈ S}.
2. operations.
3. S = {s̄ | s ∈ S}, S = {s̄ | s ∈ S}. A subset B0 of a Boolean algebra that satisfies the con-
ditions 1–3 below also satisfies the four definitions of a
Note that, if S = ∅, then 0 satisfies the criteria for being a Boolean algebra. This justifies the following:
least
upper bound for the set S. So we extend the difinitions
of and to ∅ = 0, ∅ = 1. Definition. A subalgebra B0 of a Boolean algebra B
As mentioned above, Boolean algebras can be defined is a subset of B that satisfies
as algebraic systems
1. If x, y ∈ B0 , then x + y and x y ∈ B0 .
Definition. A Boolean algebra is a set B with two 2. If x ∈ B0 , then x̄ ∈ B0 .
binary operations +, ·, a unary operation −, and two dis- 3. 0, 1 ∈ B.
tinguished, but not necessarily distinct elements 0, 1 that
satisfies the following identities: Every subalgebra of a Boolean algebra contains 0 and
1; {0, 1} is itself a subalgebra. The collection of finite and
1. (x + y) + z = x + (y + z); (x y)z = x(yz). cofinite subsets of a set X forms a subalgebra of (X ).
2. x + y = y + x; x y = yx.
3. x + x = x; x x = x. TABLE I Operator Tables for 2 and ({x, y})
4. x + x y = x; x(x + y) = x.
5. x(y + z) = x y + x z; x + yz = (x + y)(x + z). Boolean Algebra 2
6. 0 + x = x; 1x = x.
+ 0 1 · 0 1 − 0 1
7. x + x = 1; x x = 0. 0 0 1 0 0 0 1 0
1 1 1 1 0 1
A more concise definition of a Boolean algebra is as
an idempotent ring (R, +, ·, 0)—in the classical algebraic Boolean algebra ({x, y}), where a = {x}, b = {y}
sense—with unity. In this case, the least upper bound is
+ 0 a b 1 · 0 a b 1 − 0 a b 1
x + y + x y; the greatest lower bound is x y; 0 and 1 play
0 0 a b 1 0 0 0 0 0
their usual role; and x̄ = 1 − x. 1 b a 0
a a a 1 1 a 0 a 0 1
Now that we have two definitions of a Boolean algebra,
b b 1 b 1 b 0 0 b b
we must show that they are equivalent. Let us start with the
1 1 1 1 1 1 0 a b 1
poset definition of B. From the theorem in the last section,
P1: GKB/LPB P2: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001D-72 May 19, 2001 13:37

Boolean Algebra 293

Now suppose that S is a subset of a Boolean algebra and

B. Let S − = {s̄
|n s ∈ S}. Then the set of all elements of
the form x = i=1 Si , where the Si are finite nonempty x/I = y/I
subsets of S ∪ S − , form a subalgebra of B. In fact, this if there exists i 0 ∈ I such that
set is the smallest subalgebra of B that contains S. It is
denoted by [S] and is called the subalgebra of B generated x + i0 = y + i0
by S. For x ∈ B, [{x}] = {0, x, x̄, 1}; [{0, 1}] = {0, 1} and
[B] = B.
F. Homomorphisms
E. Ideals and Congruence Relations
Definition. Let B and B1 be Boolean algebras. A
Definition. Let B be a Boolean algebra. An ideal I (Boolean) homomorphism is a function f : B → B1 that
is a nonempty subset of B such that preserves the operations; that is,

1. If x ≤ y, y ∈ I, then x ∈ I. 1. f (x + y) = f (x) + f (y).

2. If x, y ∈ I, then x + y ∈ I. 2. f (x y) = f (x) f (y).
3. f (x̄) = f (x).
If I = B, then I is called a proper ideal.
4. f (0) = 0, f (1) = 1.
Example. The set I of finite subsets of (X ) forms
an ideal. Homomorphisms preserve order; that is, if x ≤ y, then
f (x) ≤ f (y). But a function that preserves order need not
Example. {0} and B are both ideals in any Boolean be a homomorphism. Condition 3 is actually a conse-
algebra B. quence of 1, 2, and 4 since f (x̄) is the complement of
Let B be a Boolean algebra. For b ∈ B, the set f (x) and complements are unique.
{x ∈ B | x ≤ b} is called a principal ideal and is de-
noted by (b). Now let S be any subset of B. The set Example. Let be a field of subsets of a set X (see
{x ∈ B | x ≤ s1 + · · · + sn , si ∈ S} [denoted by (S)] is the Section I.B) and select an element p ∈ X . Define a function
smallest ideal in B that contains S. It is called the ideal f : → 2 by
generated by S. Of course, if S = {b}, then (b) = ({b}).
0 if p ∈ S
Ideals are closely related to the standard algebaric no- f (S) =
tion of a congruence relation. A congruence relation on a 1 if p ∈ S
Boolean algebra B is a relation that satisfies
To see that f is a homomorphism, we first show that f
1. (x, x) ∈ for all x ∈ B. preserves order. If S1 ⊆ S2 and f (S2 ) = 1, then f (S1 ) ≤
2. (x, y) ∈ implies (y, x) ∈ . 1 = f (S2 ). But if f (S2 ) = 0, then p ∈ S2 , so p ∈ S1 ,
3. (x, y) (y, z)∈ implies(x, z) ∈ . since S1 ⊆ S2 . Therefore, f (S1 ) = 0 = f (S2 ). Since f
4. (x, y), (x1 , y1 ) ∈ implies (x + x1 , y + y1 ), preserves order, we have f (S1 ) ≤ f (S1 ∪ S2 ) and f (S2 ) ≤
(x x1 , yy1 ) ∈ . f (S1 ∪ S2 ), so f (S1 ) + f (S2 ) ≤ f (S1 ∪ S2 ). To prove
5. (x, y) ∈ implies (x̄, ȳ) ∈ . f (S1 ∪ S2 ) ≤ f (S1 ) + f (S2 ), suppose that f (S1 ) +
f (S2 ) = 0. Then since 2 consists only of 0 and 1,
A congruence ralation on B determines a Boolean al- we have f (S1 ) = f (S2 ) = 0, so p ∈ S1 and p ∈ S2 .
gebra B/ = {[x] | x ∈ B}, where [x] is the congruence Hence, p ∈ S1 ∪ S2 , so f (S1 ∪ S2 ) = 0. Next, we show
class determined by . There is a one-to-one correspon- f (S1 ) f (S2 ) ≤ f (S1 ∩ S2 ); the reserve inclusion follows
dence between the set of all congruences on B and set of all since f preserves order. If f (S1 ∩ S2 ) = 0, then p ∈ S1 ∩
ideals in B. The correspondence is given by → I, where S2 , so either p ∈ S1 or P ∈ S2 . Hence, f (S1 ) = 0 or
x ∈ I if (x, 0) ∈ . The inverse function I → is defined f (S2 ) = 0. In any case, f (S1 ) f (S2 ) = 0. Since 3 is a
by (x, y) ∈ if x ⊕ y ∈ I , where x ⊕ y = x ȳ + x̄ y Thus, consequence of 1, 2, and 4 we turn to 4. Now p ∈ ∅ and
B/ is sometimes written as B/I , where I is the ideal p ∈ X , so f (∅) = 0 and f (X ) = 1.
corresponding to . In this notation, B/I is a Boolean
algebra in which Example. Let B be a Boolean algebra and
x/I + y/I = (x + y)/I a ∈ B, a = 0. The ideal (a) is a Boolean algebra (not a
subalgebra!) in which, if we denote the least upper bound
x/I · y/I = (x y)/I by ∗, the greatest lower bound by ×, and the complement
(x/I ) = x̄/I by −a , we have for x, y ∈ (a],
P1: GKB/LPB P2: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001D-72 May 19, 2001 13:37

294 Boolean Algebra

x∗y=x+y algebra. Hence, its dual “x y = z̄ implies x + x = 1” is also

x × y = xy true in all Boolean algebras.
The strong-duality principle is that, if a statement is true
x̄ a = a x̄ in a particular Boolean algebra B, its dual is also true in
Also, 0 is the least element of (a] and a the greatest B. The reason for this is that B is “anti-isomorphic” with
element. itself; that is, x ≤ y if and only if ȳ ≤ x̄. For example, if
The function f : B → (a)], f (x) = ax is a homomor- there exists an element b = 0 in a Boolean algebra B such
phism. that 0 ≤ x < b implies x = 0, then by strong duality there
must also exist an element c = 1 in B such that c < x ≤ 1
Example. Let I be an ideal in a Boolean algebra B. implies x = 1.
Then v : B → B/I, v(x) = x/I is a homomorphism.
II. REPRESENTATION THEORY
Definition. A homomorphism that is one to one and
onto is called an isomorphism. A. The Finite Case
If there exists an isomorphism f from B to B 1 , then In this section we classify all of the finite Boolean
it is easy to see that f −1 is also an isomorphism. In this algebras. A useful concept for this purpose is that of an
case, B and B 1 are said to be isomorphic. This is an im- atom.
portant concept because isomorphic Boolean algebras are
identical from the point of view of Boolean algebras. That Definition. Let B be a Boolean algebra. An element
is, there is a one-to-one correspondence between their el- a ∈ B is called an atom if a = 0 and if x ≤ a, then x = 0
ements and any property that can be expressed in terms of or x = a.
≤, +, ·, 0, 1, − that is true about one is also true about the
Theorem. In a Boolean algebra B, the following are
other. For example, the Boolean algebra 2 (see Table I)
equivalent for an element a = 0.
and the field of all subsets of the set { p} are isomorphic
under the correspondence 0 ↔ ∅, 1 ↔ { p}. In fact, we will
1. a is an atom.
see later that, for any positive integer n, there is only one
2. If a = x + y, then a = x or a = y.
(up to isomorphism) Boolean algebra with 2n elements.
3. If a ≤ x + y, then a ≤ x or a ≤ y.
Put another way, every Boolean algebra with n elements
is isomorphic with the set of all subsets of {1, . . . , n}. Proof. Suppose first that a is an atom and a = x + y.
The well-known “homomorphism theorem” of classical Then since x ≤ a and y ≤ a, we have {x, y} ⊆ {0, a}. But
ring theory applies to Boolean algebras and can be formu- x and y cannot both be 0 since a = 0, so x = a or y = a.
lated as follows. If f : B → B1 is an onto homomorphism, Next suppose that 2 holds and a ≤ x + y. Then a = a(x +
then B1 is isomorphic with B/K , where K is the ideal y) = ax + ay, so a = ax or a = ay. Hence, a ≤ x or a ≤ y.
K = {x ∈ B | f (x) = 0}. Also let L 1 be a subalgebra of L Finally, if 3 holds and x ≤ a, then a = a(x + x̄) = ax + a x̄;
and I an ideal in L. Then L 11 = {a ⊕ x | a ∈ L 1 , x ∈ I } is thus, a = ax or a = a x̄. So a ≤ x or a ≤ x̄. But x ≤ a, so if
a subalgebra of L, and L 1 /(L 1 ∩ I ) is isomorphic with a ≤ x then x = a, and if a ≤ x̄ then x = xa = 0.
L 11 /I . (Here, a ⊕ x = āx + a ȳ.)
Theorem. Let B be a finite Boolean algebra. Then
G. Duality B is isomorphic with (A), where A is the set of atoms
of B.
Let P be a mathematical statement concerning Boolean
algebras. The dual of P is the statement obtained from P Proof. For each b ∈ B, let ϕ(b) = {a ∈ A | a ≤ b}.
by interchanging the + and · operations and also inter- Now b → ϕ(b) preserves order for if b1 ≤ b2 and a ∈ ϕ(b1 ),
changing the 0’s and 1’s, where they occur. There are two then a ≤ b1 ≤ b2 so a ∈ ϕ(b2 ). Thus, ϕ(b1 ∪ ϕ(b2 ) ⊆
duality principles that play a role in the study of Boolean ϕ(b1 + b2 ). For the reverse inclusion, let a ∈ ϕ(b1 + b2 ).
algebras. The first—weak duality—is a consequence of Then a ≤ b1 + b2 , but by the previous theorem, a ≤ b1
the fact that, in the algebraic definition of a Boolean al- or a ≤ b2 , so a ∈ ϕ(b1 ) or a ∈ ϕ(b2 ). Hence, a ∈ ϕ(b1 ) ∪
gebra (Section I.C), all of the identities appear along with ϕ(b2 ). It is also easily seen that ϕ(b1 ) ∩ ϕ(b2 ) = ϕ(b1 b2 ),
their dual statements. This implies that, if P is a statement ϕ(0) = ∅, ϕ(1) = A, and ϕ(b) = ϕ(b̄). So ϕ is a homomor-
that is true in all Boolean algebras, the dual of P is also phism. Now for each b, let len(b) be the number of ele-
true in all Boolean algebras. For example, the statement ments in the longest chain in B that has b as its largest ele-
“If x + y = z̄, then x z = 0” is always true in any Boolean ment. This definition is possible since B is finite. We shall
P1: GKB/LPB P2: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001D-72 May 19, 2001 13:37

Boolean Algebra 295

prove, by induction, on the length of b that b = ϕ(b) It can also be shown that an ideal I is maximal if and
for all
b ∈ B. If
len(b) = 1, then b = 0
so ϕ(0) = ∅. Hence only if B/I is isomorphic with 2 and that the set of homo-
0 = {0} = ϕ(0). Suppose that ϕ(x) = x for all x morphisms of B onto 2 are in one-to-one correspondence
such that len(x) < n. Now suppose
len(b) = n. If b is an with the set of maximal ideals. However, the crucial result
atom, then b ∈ ϕ(b), so b = {a ∈ A | a ≤ b} = ϕ(b), so about maximal ideals concerns their existence.
we can assume that b is not an atom. Then there exists
c = b and d = b such that b = c + d. But since c < b and Theorem. In a Boolean algebra B every element
d < b, we b = 1 is contained in a maximal ideal.
can conclude that len(c) < n and len(d) < n.
Thus, c = ϕ(c) and d = ϕ(d). It follows that
Proof. (Note: This proof requires a knowledge of
b = c+d = ϕ(c) + ϕ(d) Zorn’s lemma.) Let be the poset of proper ideals that
contain b; (b] is one such ideal. Now if is a chain in ,
= [ϕ(c) + ϕ(d)] = ϕ(c + d) = ϕ(b)
then ∪ is an upper bound for since it is an ideal and,
To complete the proof first note that, by distributiv-
being a union of ideals that do not contain 1, cannot itself
ity,
if S ⊆ A, a ∈ A, and a ≤ S, then a = a S = contain 1. So by Zorn’s lemma, has a maximal element,
{as | s ∈ S} and each
as is either 0or an atom. Now say I . Clearly, I is an ideal that contains b. Now if J is
let
S ∈ (A). Then S ∈ B and ϕ( S) = {a ∈ A | a ≤ an ideal that contains I , then b ∈ J , so J ∈ . But I is a
S} = S, so ϕ is onto. Finally, ϕ is one to one,
to show that maximal element of , so J = I ; hence, I is a maximal
let ϕ(b1 ) = ϕ(b2 ); then b1 = ϕ(b1 ) = ϕ(b2 ) = b2 . ideal.
Let B be a Boolean algebra and the set of maximal
B. General Representation Theory ideals in B. For each b ∈ B, let b̂ = {I ∈ | b ∈ I }. It is easy
to verify that B̂ = {b̂ | b ∈ B} is a field of sets. Specifically,
The representation of finite Boolean algebras as fields of
sets can be generalized to arbitrary Boolean algebras. In
b̂1 ∪ b̂2 = b1 + b2
this section we show how this is done and go farther by
showing that, as a category, Boolean algebras and homo- b̂1 ∩ b̂2 = b1 b2
morphisms are equivalent to a certain category of topo- 0̂ = ∅
logical spaces and continuous functions.
1̂ =
Definition. An ideal I in a Boolean algebra B is max- b̄ˆ = − b̂
imal provided that
Moreover, the assignment ϕ : b → b̂ is an isomorphism.
1. I = B. To see that ϕ is one to one, suppose b1 = b2 and without
2. If J is an ideal and I ⊆ J ⊆ B, then J = I or J = B. loss of generality say b1 | b2 . Then b1 + b2 = 1. So there
is a maximal ideal I containing b1 + b2 ; thus, b2 ∈ I and
It is easy to see that a principal ideal (b] is maximal if b1 ∈ I, so b1 ∈ I . Thus, I ∈ b̂1 , but I ∈ b̂2 . Hence, b̂1 = b̂2 .
and only if b̄ is an atom. A more general theorem is the For the topological representation theory, we start with
following: a Boolean algebra B and form a topological space (B)
as follows. The points of (B) are the maximal ideals
Theorem. An ideal I is maximal if and only if I = B of B, and {x̂ | x ∈ B} is taken as a basis for the topology.
and x y ∈ I implies x ∈ I or y ∈ I . This space is a compact Hausdorff space in which the
sets that are both open and closed (these turn out to be
Proof. First suppose that I is maximal, x y ∈ I , but exactly {x̂ | x ∈ B} form a basis. Such a space X is called
x∈ / I and y ∈
/ I . Then the ideal generated by I ∪ {x} prop- a Boolean (also Stone) space. Let (X ) denote the field
erly contains I and, by hypothesis, must be B. So 1 ∈ B = of open–closed sets of a Boolean space.
(I ∪ {x}] and 1 = i + x for some i ∈ I . Thus, x̄ ≤ i so
x̄ ∈ I . Similarly, ȳ ∈ I and hence x y = x̄ + ȳ ∈ I . But then Theorem. For a Boolean algebra B, ((B)) is iso-
morphic with B and for each Boolean space X, ((X ))
1 = (x y) + (x y) ∈ I, which implies that I = B, a contra-
is homeomorphic with X .
diction. So x ∈ I or y ∈ I .
Now assume the hypothesis of the converse and that This identification can be extended to a coequivalence
there is an ideal J such that I ⊂ J ⊆ B. Let x ∈ J−I . To (in the sense of category theory) that assigns to each
prove B̄ = J, let z ∈ B. Then x(x̄ z) = 0 ∈ I and x ∈ I so Boolean algebra B its corresponding Boolean space (B)
x̄ z ∈ I . Hence, x̄ z ∈ J . So x + z = x + x̄ z ∈ J and there- and for each homomorphism f : B → B1 the continuous
fore z ∈ J . Thus, B = J . function f : (B1 ) → (B) defined by fˆ(I ) = f −1 (I ),
P1: GKB/LPB P2: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001D-72 May 19, 2001 13:37

296 Boolean Algebra

B
f
B1
1. Every chain in a free Boolean algebra is finite or
countably infinite.
2. Every free Boolean algebra has the property that, if S
is a set in which x y = 0 for all x = y in S, then S is
x x̂ x x̂ finite or countably infinite. (This is called the
countable chain condition).
3. Countable atomless Boolean algebras are free.
4. The finite free Boolean algebras are all of the form
fˆ ((S)) for finite S.
S(B) S(B1)
FIGURE 2 Coequivalence between homomorphisms and contin-
uous functions. The Boolean space corresponding to the free Boolean
algebra on countably many free generators is the Cantor
discontinuum. So for example, the topological version of
I ∈ (B1 ). Furthermore, f̂ −1 [x̂] = f (x) for each x ∈ B “Every countable Boolean algebra is projective” is “Every
(Fig. 2). closed subset of the Cantor discontinuum is a retract of
This coequivalence can be very useful since problems itself.”
in Boolean algebra can be transformed into questions
or problems about Boolean spaces that may be more
amenable to solution. IV. COMPLETENESS AND HIGHER
DISTRIBUTIVITY

III. FREE BOOLEAN ALGEBRAS The canonical work on this topic is “Boolean algebras”
AND RELATED PROPERTIES by R. Sikorski, and there are many more recent results as
well. We present here only a brief introduction.
The basic fact about extending functions to homomor-
phisms is given by the following: Definition. A Boolean algebra is complete if S
and S exist for all S ⊆ B.
Theorem. Let B, B1 be Boolean algebras, ∅ = S ⊆
For example, (S) is complete, as is the Boolean al-
B, and f : S → B1 a function; f can be extended to a
gebra of regular open sets of a topological space. Here
homomorphism
→ B1 if and only if: (l) If T1 ≤
g : [S]
one must be careful, for if S is a family of regular open
T2 then f [T1 ] ≤ f [T2 ], where T1 ∪ T2 is any finite
sets, ∩S may not beregular open. It can be shown that
nonempty subset of S. (Note that this includes
the cases S = INT(∩S) and S = INT(C L(∪S)).
T = 0 ⇒ f [T ] = 0 and T =1⇒ f [T ] = 1,
There are many useful ways to generalize the distribu-
where T is a finite nonempty set).
tive law x(y + z) = x y = x z; we will restrict our attention
Sketch of Proof.
To define g : [S] → B, we set to the following:
n
g( i=1 Ti = f [Ti ]. The condition (1) is sufficient
to prove that g is a well-defined homomorphism. Clearly, Definition. A Boolean algebra is completely dis-
g|S= f. tributive provided that, if {ai j }i∈I, j∈J is a set of
elements

A free Boolean algebra with free generators {xi }i∈I is
such that j∈J ai j exists for each i∈I , i∈I j∈J ai j

a Boolean algebra generated by {xi }i∈I and such that any exists,
and i∈I aiϕ(i) exists for each ϕ ∈ J I
, then
function f : {xi | i ∈ I } → B1 can be extended to homo- ϕ∈J I i∈I ai ϕ
(i) exists and
morphism. Free Boolean algebras with free generating
ai j = aiϕ(i).
sets of the same cardinality are isomorphic. For the ex- i∈I J ∈J ϕ∈J I i∈I
istence of a free Boolean algebra with a free generating
set of cardinality α, we start with a set X of cardinality α. A typical result involving these notions is the following:
For each x ∈ X, let Sx = {S ⊆ X | x ∈ S}. Now {Sx | x ∈ X }
freely generates a subalgebra C of the power set of X . The Theorem. Let B be a Boolean algebra. Then the
set {Sx | x ∈ X } is
said to be independent because if sat- following are equivalent:
isfies x∈A Sx ⊆ y B Sy and implies A ∩ B = φ for any
finite nonempty subset A ∪ B of X . But this condition 1. B is complete and completely distributive.
clearly implies that (1) holds, so C is free. 2. B is complete and every element is a sum of atoms.
Free Boolean algebras have many interesting proper- 3. B is isomorphic with the field of all subsets of some
ties: set.
P1: GKB/LPB P2: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001D-72 May 19, 2001 13:37

Boolean Algebra 297

A striking theorem of Sikorski, from which it follows Thus, Boolean algebras can be used as models for log-
that the injectives in the category of Boolean algebras are ical systems. A striking example of this is the simplifica-
those that are complete, can be formulated as follows: tion of the independence proofs of P. J. Cohen by means
of Boolean models. For example, by studying the appro-
Theorem. If B1 is a subalgebra of a Boolean alge- priate Boolean algebra, we can prove that the axiom of
bra B, A is a complete Boolean algebra, and f : B1 → A choice is not a consequence of set theory—even with the
is a homomorphism, then there exists a homomorphism continuum hypothesis.
g : B → A such that g | B1 = f . To illustrate, we start with the set S of formulas
α, β, γ , . . . of the classical propositional calculus. An
Proof. By Zorn’s lemma, there exists a homomor- equivalence relation ≡ is defined by identifying α and
phism f 1 : B0 → A that is maximal with respect to being β, provided the α → β and β → α are both derivable. A
an extension of f . We will, of course, be done if B0 = B, Boolean algebra is defined on the classes [α], [β], [γ ], . . .
so suppose B0 ⊂ B. Select a ∈ B − B0 and set by the partial ordering [α] ≤ [β] if and only if α → β is
derivable.
b= f 1 [{x ∈ B1 | x ≤ a} ∩ B0 ]. The resulting Boolean algebra is called the
Now, it can be shown that a ≤ x, x ∈ B0 ⇒ b ≤ f 1 (x), and Lindenbaum–Tarski algebra and is in fact the free
x ≤ a, x ∈ B0 ⇒ f 1 (x) ≤ b, so by our theorem on extend- Boolean algebras with free generators [α], [β], [γ ], . . . ,
ing functions to homomorphism, f 1 can be extended to where α, β, γ are the propositional variables. The
[{a} ∪ B0 ], contradicting the maximality of f 1 . construction of the Lindenbaum–Tarski algebra for
Another interesting series of results concerns the em- the restricted prelicate calculus is similar, and it is
bedding of Boolean algebras into ones that are complete. in this context that the independence proofs can be
Specifically, a regular embedding f : B → B1 is given.
a one-to-
one homomorphism
with the property that, if Sexists
in B, then f [S] exists in B1 and is equal to f ( B) B. Switching Functions and Electronics
and similarly for products.
To outline one of the basis results, we shall call an ideal Switching functions are mathematical formulas that en-
closed if it is an intersection of principal ideals. The set able us to describe and design certain parts of electronic
B̄ of all closed ideals forms a complete Boolean algebra circuits. They are widely used in both communications
under inclusion, and the map v : B → B̄, v(x) = (x] is a and computer applications.
regular embedding. The pair (B, v) is called the MacNeille First, we shall define some notation. We use 2n
completion of B. B is isomorphic with the Boolean algebra to represent the set of all n-tuples of elements in 2.
of regular open subsets of the Boolean space of B. That is, 22 = {(0, 0), (0, 1), (1, 0), (1, 1)}, and so on. For
σ ∈ 2n , we sometimes write σ = (σ (1), . . . , σ (n)), so for
σ = (0, 1) ∈ 22 , we have σ (1) = 0 and σ (2) = 1.
V. APPLICATIONS
A switching function is any function f : 2n → 2. For
example, if n = 2, there are 16 switching functions, the
A. Logic
values of which are given in Table II.
One of the most importance areas of application for Explicity, f 7 is defined by f 7 (0, 0) = 0; f 7 (0, 1) = 1;
Boolean algebras is logic. Indeed, many problems in logic f 7 (1, 0) = 1; f 7 (1, 1) = 0.
can be reformulated in terms of Boolean algebras where a For x ∈ 2 (that is, x = 0 or x = 1), we write x 0 = x̄
solution can be more easily obtained. For example, the fun- and x 1 = x. Now for a fixed value of n, a switch-
damental compactness theorem for the (classical) propo- ing function is called a complete product if it has the
sitional calculus is equivalent to the theorem that every form f (x1 , . . . , xn ) = x1σ (1) . . . xnσ (n) , where σ ∈ 2n . For
element a = 1 in a Boolean algebra is contained in a max- example, if n = 3, then f (x1 , x2 , x3 ) = x1 x̄2 x3 is a com-
imal ideal. plete product, but f (x1 x2 x3 ) = x̄2 x3 is not.

TABLE II The 16 Switching Functions for n = 2

f1 f2 f3 f4 f5 f6 f7 f8 f9 f 10 f 11 f 12 f 13 f 14 f 15 f 16

(0, 0) 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
(0, 1) 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
(1, 0) 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
(1, 1) 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
P1: GKB/LPB P2: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001D-72 May 19, 2001 13:37

298 Boolean Algebra

TABLE III Table for Constructing Disjunctive Nor- x1

mal Forms for f (0, 0) = 1, f (0, 1) = 1, f (1, 0) = 0,
f (1, 1) = 1
A B
Complete
σ(1) σ(2) x2
σ f (σ) product f (σ)x 1 x2

(0, 0) 1 x̄ 1 x̄ 2 1 · x̄1 x̄2

(0, 1) 1 x̄ 1 x2 1 · x̄ 1 x2
(1, 0) 0 x1 x̄ 2 0 · x1 x̄ 2 A B A B
x1 x2
(1, 1) 1 x1 x2 1 · x1 x2

It turns out that every switching function can be written

FIGURE 3 Application of switching functions to circuit design.
as a sum of complete products. Indeed, if f : 2n → 2 is any (A) x1 + x2 ; (B) x1 x2 ; (C)x̄1 .
switching function, it can be written in the form.

f (x1 , . . . , xn ) = f (σ )x1σ (1) . . . xnσ (n) . the “black box” inverts the current. That is, if current is
σ ∈2n flowing at A, no current reaches B, but if current is not
To illustrate how to do this, suppose f (0, 0) = 1, flowing from A, current will be detected at B.
f (0, 1) = 1, f (1, 0) = 0, and f (1, 1) = 1. Now, form the To see how switching circuits are used in this context
table represented by Table III. The products in the third we consider a “half-adder.” This is an electronic circuit
column are formed by putting the bar over the ith factor that performs the following binary arithmetic:
when the ith component of σ is 0. 0 0
Next add the terms f (σ )x σ (1) x σ (2) to obtain f (x1 , x2 ) = +0 +1
1x̄1 x̄2 + 1x̄1 x2 + 0x1 x̄2 + 1x1 x2 = x̄1 x̄2 + x̄1 x2 + x1 x2 . —— ——
This last form is called the disjunctive normal form, but 0 carry = 0 1 carry = 0
f can be simplified further since x̄1 x̄2 + x̄1 x2 + x1 x2 =
x̄1 (x2 + x2 ) + x1 x2 = x̄1 + x1 x2 = (x̄1 + x1 )(x̄1 + x2 ) = 1 1
x̄1 + x2 . That is, f (x1 x2 ) = x̄1 + x2 . +0 +1
Thus, starting with the values of f , we have shown how —— ——
1 carry = 0 0 carry = 1
to represent f as a switching function in disjunctive nor-
mal form. On the other hand, starting with an f defined as It is called a half-adder because it does not take into ac-
a function of x1 , . . . , xn , it is easy to represent it in disjunc- count any carries from previous additions.
tive normal form. Simply multiply each term by xi + x̄i for Now, if the addends are named A and B, the sum S,
each missing variable xi . An example will clarify the pro- and the carry C, Table IV describes the arithmetic. From
cedure. To put x1 x̄2 + x3 in disjunctive normal form, write our previous discussion we can write S = A B̄ + ĀB and
C = AB. The design of the half-adder is shown in Fig. 4.
x1 x̄2 + x3 = x1 x̄2 (x3 + x̄3 ) + x3 (x1 + x̄1 )(x2 + x̄2 )
The expression A B̄ + ĀB is quite simple, but in design-
= x1 x̄2 x3 + x1 x̄2 x̄3 + x1 x2 x3 + x̄1 x̄2 x3 ing more complicated circuits, economies can be made by
algebraically simplifying the switching functions. There
+ x̄1 x2 x3 + x̄1 x̄2 x3
are several such methods. The Veitch–Karnaugh and
= x1 x̄2 x3 + x1 x̄2 x̄3 + x1 x2 x3 Quine methods are examples.
+ x̄1 x2 x3 + x̄1 x̄2 x3
C. Other Applications
For a given n, each disjunctive normal form is a distinct Boolean algebras, especially with additional operations,
n
function. So there are 2(2 ) distinct switching functions of have applications to nonclassical logic, measure theory,
n variables, all of which are uniquely given in disjunctive
normal form as described above.
TABLE IV Half-Adder Table
The application of switching functions to circuit design
can be described as follows. Suppose that an electric cur- A B Sum Carry
rent is flowing from A to B and x1 , x2 represent on–off
0 0 0 0
switches (Fig. 3A). If x1 and x2 are both off, no current
0 1 1 0
will be detected at B, whereas if either x1 of x2 is on,
1 0 1 0
current will be detected at B. The circuit is therefore rep-
1 1 0 1
resented by x1 + x2 . Figure 3B represents x1 x2 . In Fig. 3C,
P1: GKB/LPB P2: GAE Revised Pages
Encyclopedia of Physical Science and Technology EN001D-72 May 19, 2001 13:37

Boolean Algebra 299

A AB Set Theory,” Clarendon, Oxford.

B Birkhoff, G. (1967). “Lattice Theory,” Am. Math. Soc., Providence.
AB AB Brown, F. (1990). “Boolean Reasoning,” Kluwer Academic, Boston,
B AB
A Dordrecht, London.
Delvos, F., and Schempp, W. (1989). “Boolean Methods and Interpola-
tion and Approximation,” Longman, Harlow, England.
AB
Dwinger, P. (1971). “Inroduction to Boolean Algebras,” Physica-Verlag,
FIGURE 4 Design of the half-adder. Wirzburg.
Hailperin, T. (1986). Boole’s Logic and Probability. In “Studies in
Logic and the Foundations of Mathematics” (J. Barwise, D. Kaplan,
functional analysis, and probability. A detailed summary H. J. Keisler, P., Suppes, and A. S. Troelstra, eds.), North-Holland,
of these can be found in R. Sikorski’s “Boolean Alge- Amsterdam.
bras.” A concise exposition of nonstandard analysis—via Halmos, P., and Givant, S. (1998). Logic as Algebra. In “The Dolciani
Mathematical Expositions,” Math. Assoc. of America, Washington,
Boolean algebra—can be found in A. Abian’s “Boolean
DC.
Rings.” Johnstone, P. (1983). “Stone Spaces,” Cambridge Univ. Press,
Cambridge, UK.
Monk, J. D. (1989). “Handbook of Boolean Algebras,” Vol. 1–3,
SEE ALSO THE FOLLOWING ARTICLES Elsevier, New York.
Rosser, J. (1969). “Simplified Independence Proofs,” Academic Press,
ALGEBRA, ABSTRACT • ALGEBRAIC GEOMETRY • CIR- New York.
Schneeweiss, W. (1989). “Boolean Functions with Engineering Appli-
CUIT THEORY • MATHEMATICAL LOGIC • SET THEORY
cations and Computer Programs,” Springer-Verlag, Tokyo.
Sikorski, R. (1969). “Boolean Algebras, 3rd Ed., ”Springer-Verlag,
Berlin/New York.
BIBLIOGRAPHY Stormer, H. (1990). Binary Functions and Their Applications. In “Lec-
ture notes in Economics and Mathematical Systems” (M. Beck-
Abian, A. (1976). “Boolean Rings,” Branden, Boston. mann and W. Krelle, eds.), Springer-Verlag, Berlin, Heidelberg, New
Bell, J. L. (1977). “Boolean Valued Models and Independence Proofs in York.
P1: GLQ Revised Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

Calculus
A. Wayne Roberts
Macalester College

I. Two Basic Problems in Calculus

II. Limits
III. Functions and Their Graphs
IV. Differentiation
V. Integration
VI. Fundamental Theorem of Calculus
VII. Infinite Series
VIII. Some Historical Notes

GLOSSARY Function Rule that assigns to one real number x a unique

real number y; one commonly lets a letter f (or g, h,
Antiderivative A function F is called the antiderivative etc.) represent the rule and writes y = f (x). The con-
of the function f if F (x) = f (x). cept of function is used in more general settings, but
Chain rule Provides a rule for finding the derivative of always with the idea of assigning to members of one
f (x) if the value of f at x is found by first finding set (the domain of the function) a unique member of
u = g(x), then h(u); that is, f (x) = h(g(x)). In this case, another set (the range of the function).
if g and h are differentiable, then f (x) = h (g(x))g (x). Fundamental theorem Speaking loosely, the two princi-
Derivative The derivative of a function f is the function pal ideas of calculus both relate to a graph of a function
f defined at x by y = f (x). One is concerned with how to find the slope
f (x) of a line tangent to the graph at any point x; the
f (x + h) − f (x)
f (x) = lim other asks how to find the area under the graph between
h→0 h x = a and x = b. The fundamental theorem relates these
whenever this limits exists. seemingly unrelated ideas. A more technical statement
Differential When the function f has a derivative f (x0 ) of the theorem is to be found below.
at x0 , the differential is the quantity f (x0 )h where h is a Graph If a function f is defined for each real number x
small number; if y = f (x), it is common to represent h in a set, then we may set y = f (x) and plot the points
by d x and to define the differential to be dy = f (x0) d x. (x, y) in the x y plane to obtain a “picture” correspond-
Exponential function While f (x) = b x is, for any posi- ing to the function. More formally, the graph of f is
tive number b, an exponential function, the term is com- the set G = {(x, f (x)): x is in the set on which f is
monly applied to the situation in which b = e where e defined}.
is the number 2.718 (rounded to three decimals). Indefinite integral Another term for the antiderivative.

317
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

318 Calculus

Infinite series A sequence {Sn } in which each term is ob-

tained from the preceding one by adding one term; thus
Sn = Sn−1 + an . Such a series is commonly represented
by a1 + a2 + · · ·. Terms may be functions an (x), so a
series may represent a function as well as a constant.
Inflection point If a function f is continuous at x0 , then
x0 is called an inflection point for f if the graph changes
from a curve that holds water (convex) to one that sheds
water (concave) or conversely at x0 .
Integral Also called the definite integral of f between
x = a and x = b. It may be described informally as the
FIGURE 1 The curve is y = x 2 .
area under the graph of y = f (x), but such a definition
is subject to logical problems. A formal definition of
the Riemann integral is given below. used as coordinates to locate points relative to a set of per-
Limit When two variables are related to each other by a pendicular axes (Fig. 1). The resulting curve is a parabola,
clear rule; that is, when y = f (x), we may ask how y a curve studied by the Greeks and frequently encountered
is affected by taking values of x closer and closer to in applications of mathematics.
some fixed value x0 , or by taking values of x that get The problem we wish to consider is that of finding the
larger and larger without bound. If y also gets closer to shaded area OPQ under the parabola, above the x axis from
some fixed y0 , then we say in the two cases described, 0–1. Archimedes (287–212 B.C.) could have solved this
problem using what he called the method of exhaustion.
y0 = lim f (x) or y0 = lim f (x). The line of attack we shall use is suggested by Fig. 2, from
x→x0 x→∞
which we see that the desired area is
Natural logarithm The logarithm of a positive number N
to a base b > 0 is the exponent r such that br = N . When (a) less than the area of rectangle A1 given by
the base b is chosen to be the number e ≈ 2.7, then the
logarithm is said to be the natural logarithm of N . R1 = 1(1)2 = 1
Tangent While there is an intuitive idea of what is meant (b) less than the sum of the areas of rectangles B1 and
by a tangent to the graph of y = f (x) at (x0 , f (x0 )) (it B2 given by
is the line that just touches the curve at the point), the
2 2
logically satisfactory way to define it is to say that it R2 = 12 12 + 12 22
is the line through the specified point that has a slope
equal to f (x0 ). (c) less than the sum of the areas of rectangles C1 , C2 ,
and C3 given by

CALCULUS is the branch of mathematics that provides

computational (or calculating) rules for dealing with prob-
lems involving infinite processes. Approximating the cir-
cumference of a circle by finding, for larger and larger
values of n, the perimeter of a regular inscribed n-sided
polygon illustrates an infinite process. Since we cannot
hope to complete an infinite process, our only hope is to
establish that the longer a certain process is continued, the
closer it comes to some limiting position. Thus, calculus
is sometimes described as the study of limits, or defined
as the discipline that provides algorithms and standard
procedures for calculating limits.

I. TWO BASIC PROBLEMS IN CALCULUS

Problem 1. If to each real number x we assign a

second number y = x 2 , the resulting pairs (x, y) may be FIGURE 2 The curve is y = x 2 .
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

Calculus 319

1 1 2

1 2 2

1 3 2
R3 = 3 3
+ 3 3
+ 3 3

Proceeding in this way, we see that after n steps, the desired

area will be less than

1 1 2 1 2 2 1 n 2
Rn = + + ··· + (1)
n n n n n n
but intuitively, at least, this process should bring us closer
and closer to the desired answer.
This is not exactly Archimedes’ method of exhaustion,
but it is clearly an exhausting process to carry out as n
increases. It is, in fact, an infinite process.

Problem 2. Our second problem again makes refer- FIGURE 4 The curve is y = x 2 .
ence to the graph of y = x 2 . This time we seek the slope of
the line tangent to the graph at P(1, 1). That is, with refer-
ence to Fig. 3, we wish to know the ratio of the “rise” BT The slope of the desired line is evidently the limiting value
to the “run” PB (or, stated another way, we wish to know of this infinite process.
the tangent of the angle that PT makes with the positive Our interest in both Eqs. (1) and (2) focuses on what
x axis). Our method, certainly known to Fermat (1601– happens as n increases idenfinitely. Using the symbol
1665), is again suggested by a sequence of pictures. →∞ to represent the idea of getting large without bound,
From the graphs in Fig. 4, we see in turn mathematicians summarize their interest in Eq. (1) by ask-
ing for
B1 S1 (1 + 1)2 − 12
(a) slope PS1 = = =3
PB1 1 1 1 2 1 2 2 1 n 2
lim Rn = lim + + ··· +
B2 S2 (1 + 1/2)2 − 12 5 n→∞ n→∞ n n n n n n
(b) slope PS2 = = =
PB2 1/2 2
and in Eq. (2) by asking for
B3 S3 (1 + 1/3)2 − 12 7
(c) slope PS3 = = =
PB3 1/3 3 Bn Sn (1 + 1/n)2 − 12
lim = lim
n→∞ PBn n→∞ 1/n
And again there is a pattern that enables us to see where
we will be after n steps:
Bn Sn (1 + 1/n)2 − 12 II. LIMITS
slope PSn = = (2)
PBn 1/n
Calculus is sometimes said to be the study of limits. That is
because the nature of an infinite process is that we cannot
carry it to completion. We must instead make a careful
analysis to see if there is some limiting position toward
which things are moving.

A. Algebraic Expressions
Limits of certain algebraic expressions can be found by
the use of simplifying formulas. Archimedes was greatly
aided in his study of the area under a parabola because he
had discovered
12 + 22 + · · · + n 2 = 16 n(n + 1)(2n + 1) (3)
Sometimes the limit of a process is evident from a very
minor change of form. The limit of the sum (2) above is
FIGURE 3 The curve is y = x 2 . easily seen if we write it in the form
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

320 Calculus

Bn Sn (1 + 1/n)2 − 12 A(1 + i/n)n

lim = lim
n→∞ PBn n→∞ 1/n Will this sum grow infinitely large if the number n of pe-
1 + 2/n + 1/n − 1
2 riods is increased to the point where interest is apparently
= lim being added instantaneously?
n→∞ 1/n
This apparently frivolous problem is one that involves
= lim (2 + 1/n) = 2 (4) an infinite process. It is also a problem that, as we shall
n→∞
see below, has profound implications. In order to free our-
Not every question about limits is concerned with the selves from considerations of a particular choice of i and
consequence of a variable growing large without bound. to focus on the principal question of what happens as n
The student of calculus will very soon encounter problems increases, let us agree to set n = mi:
in which a variable approaches zero. A typical question
asks what happens to the quotient A(1 + i/n)n = A[(1 + 1/m)m ]i
√ √ Now study what happens as m increases to the expression
x +h− x
h 1 m
em = 1 + (6)
m
as h gets closer and closer to zero. Since both the nu-
merator and the denominator approach zero, the effect on Beginners will be excused if they make the common
the quotient is not obvious. In this case, a little algebraic guess that as m gets increasingly large, em gets increas-
trickery helps. One notes that ingly close to 1. Even in the face of early evidence provided
by the computations
√ √ √ √
x +h− x x +h+ x (x + h) − x
√ √ = √ √ e1 = (1 + 1) = 2.00
h x +h+ x h( x + h + x) 2
e2 = 1 + 12 = 94 = 2.25
1
= √ √ 3
x +h+ x e3 = 1 = 13 = 64 27
= 2.37
beginners find it hard to believe that as m gets larger and
√ that as h gets very small, the quotient
In this form, it is clear
approaches 1/(2 x); we write larger, em will eventually get larger than 2.71, but that it
√ √ will never get as large as 2.72. They are more incredulous
x +h− x 1 when told that the number that em approaches rivals π as
lim = √ (5)
h→0 h 2 x one of the most important constants in mathematics, that
it has been given the name e as its very own, and that
accurate to 10 decimal places, e = 2.7182818285.
B. The Natural Base If in Eqs. (6) we replace 1/m by h, we get an expression
often taken as the definition of e:
An investment yielding 8% per year will, for an investment
of A dollars, return A + 0.08A = A(1 + 0.08). If the invest- lim (1 + h)1/ h = e
h→0
ment is one that adds interest semiannually, then the invest-
ment will be worth A + 12 (0.08)A = A(1 + 0.08/2) = S From here it is a short step to a second limit to which we
after 6 months, and by the end of the year it will will refer below:
have grown to S + 12 (0.08)S = S(1 + 0.08/2) = A(1 + eh − 1
0.08/2)2 = A(1.0816). lim =1 (7)
h→0 h
If interest is added every month (taken for convenience
1
to be 12 of a year), the investment at the end of a year will
C. Trigonometric Expressions
have grown to
12 A final tricky but important limit must be mentioned. Stu-
0.08 dents in elementary school learn to measure angles in de-
A 1+ = A(1.0830)
12 grees. When the angle x is so measured, then one finds
that
There is a definite advantage to the investor in having the
sin 5◦ sin 3◦
interest added as frequently as possible. = 0.017431 = 0.017445
Jakob Bernoulli (1654–1705) once posed the follow- 5 3
ing problem. An investment of A dollars at i% that adds sin 1◦
= 0.017452
interest at the end of each of n periods will yield 1
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

Calculus 321

TABLE I Table of Values

√
x y = 2 x −x2

0 0

1
4 2 12 − 1
16 = 15
16
1
2 2 √1 − 14 ≈ 1.2
2
√
1 2 1−1=1

with changes in x. There are short cuts to drawing such

graphs, many of them learned in the study of analytic
geometry, which is taught prior to or along with calculus.
FIGURE 5 Radius of the circle is arbitrary. The angle is 1 radian But even without shortcuts, the method of calculating a
≈57◦ . table of values and then plotting points can be used to
sketch a useful graph. This latter process is illustrated
√ in
Table I and Fig. 6 for the function y = f (x) = 2 x − x 2 .
and that in general
sin x π
lim = = 0.017453 IV. DIFFERENTIATION
x→0 x 180
This awkward number would permeate computations
made in calculus, were it not for a uniformly adopted con- A. The Derivative Defined
vention. Radians, rather than degrees, are used to measure Suppose that the distance y that an automobile has trav-
angles. The central angle of a circle that intercepts an arc of elled in x hours is given by y = f (x). We pose the fol-
length equal to the radius (Fig. 5) is said to have a measure lowing question. What does the speedometer read when
of one radian. The familiar formula for the circumference x = 2?
of a circle tells us that there will be 2π radians in a full As a first step, we might find the average speed during
circle of 360◦ , hence that π radians = 180◦ . Using radians the third hour by computing
to measure the angle x, it turns out that (distance at x = 3) − (distance at x = 2) f (3) − f (2)
sin x =
lim =1 (8) time elapsed 3−2
x→0 x
It is clear, of course, that the average speed over the third
For this reason, radians are always used to measure angles hour hardly tells us the speedometer reading at x = 2. To
in calculus and in engineering applications that depend on get even a reasonable approximation to the speedometer
calculus. reading at x = 2, we would want an average over a much
smaller time interval near x = 2. Let h represent some
fraction of an hour after x = 2. The average speed over
III. FUNCTIONS AND THEIR GRAPHS this time interval would be

When the value of one variable y is known to depend on

a second variable x, it is common in ordinary discourse
as well as in mathematics to say that y is a function of x.
An important difference must be noted, however. General
usage is commonly imprecise about the exact nature of
the relationship (corporate income is a function of interest
rates), but mathematicians use the term to imply a precise
relationship. For a given value of x, a unique value of y is
determined. Mathematicians abbreviate this relationship
by writing y = f (x).
Just as a corporation may draw a graph showing how
its income has varied with changing interest rates, so a √
FIGURE 6 This is the graph of y = 2 x − x 2 , drawn by plotting
mathematician may draw a graph showing how y varies from the table at the right.
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

322 Calculus

(distance at x = 2 + h) − (distance at x = 2) TABLE II Meaning of the Derivative

(2 + h) − 2 Given meaning of f (x) Interpretation of f (x)
f (2 + h) − f (2) Velocity: f (x) gives the
= Distance: f (x) is the distance a moving
h object travels from some fixed starting velocity at time x.
point in time x.
No information would be gained by setting h = 0, but the
Amount: f (x) gives the measure of some Rate of change: f (x)
smaller the positive number h is taken, the better will be changeable quantity (volume of a gives the measure in
our estimate of the speed at x = 2. And more generally, the melting ice cube, length of a shadow) at (units)/(time interval)
speedometer reading at time x will be approximated by time x. of the rate at which
the quantity is
f (x + h) − f (x) changing.
speed at time x = lim (9)
h→0 h Graph: The coordinates (x,y) of points Slope of a tangent line:
on a graph in the plane are related by the line tangent to the
Consider now a situation in which water runs at 8 in.3 / y = f (x). graph at (x, y) has
sec into a cone-shaped container that is 12 in. across at slope f (x).
the top and 12 in. deep (Fig. 7). The height y of the water
rises quickly at first, less quickly as the level rises. In fact,
using the formula for the volume of a cone, we can√actually
determine that the height y at time x will be y = 2 3 12x/π , And again for an arbitrary time x, we see that the rate at
but for our purposes here it suffices to notice that which height increases is
f (x + h) − f (x)
y = f (x) lim
h→0 h
(10)

We now ask how fast, in inches per second, y is in- Finally, look again at our introductory Problem 2. Using
creasing when x = 2. One way to get started is to find the the notation f (x) = x 2 now available to us, we see that
average rate at which y increased during the third minute. Eq. (2) may be written in the form
f (1 + 1/n) − f (1)
(height at x = 3) − (height at x = 2) f (3) − f (2) slope PSn =
= 1/n
elapsed time 3−2
We find the slope of the line tangent to the graph at x = 1
Since the rate is slowing down as time passes, however, by taking the limit of this expression as n → 1. Similar
it is clear that we would get a better approximation to the reasoning at an arbitrary point x, making the substitution
rate at x = 2 if we used a smaller time interval near x = 2. h = 1/n, would lead us to the conclusion that the slope of
Again we let h represent some short interval of time after a line tangent to the graph at x is given by
x = 2. Again we see that the desired rate at x = 2 will be f (x + h) − f (x)
given by the limit of slope = lim (11)
h→0 h
(height at x = 2 + h) − (height at x = 2) whenever this limit exists.
h The expression that we have encountered in Eqs. (9),
f (2 + h) − f (2) (10), and (11) turns up time and again in the applications of
= mathematics. For a given function f , the derived function
h
or derivative f is the function defined at x by
f (x + h) − f (x)
f (x) = lim
h→0 h
It is important to recognize that the meaning of the deriva-
tive is not restricted to just one interpretation. The three
important ones that we have discussed are summarized in
Table II. Thus, the computational techniques developed
by Fermat and other mathematical pioneers have turned
out to be of central importance in modern mathematical
analysis.

B. Some Derivatives Calculated

We begin with a calculation that can be checked against
FIGURE 7 Water at height y in cone. the answer of 2 obtained in Eq. (4) as the slope of
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

Calculus 323

the line tangent to the graph of y = x 2 at x = 1. For Finally, we note that differentiation is a linear opera-
f (x) = x 2 , f (x + h) = (x + h)2 = x 2 + 2xh + h 2 , so tion. This means that if f and g are differentiable func-
tions and r and s are real numbers, then the function
f (x + h) − f (x) x 2 + 2xh + h 2 − x 2 h(2x + h) h defined by h(x) = r f (x) + sg(x) is differentiable, and
= =
h h h h (x) = r f (x) + sg (x).
f (x + h) − f (x)
f (x) = lim (12)
h→0 h C. The Derivative and Graphing

= lim (2x + h) = 2x The rule just

√ stated, together with our ability to differen-
h→0 tiate both√ x and x 2 , enables us to find the derivative of
The value of f (1) = 2, as predicted. f (x) = 2 x − x 2 , which is graphed in Fig. 6:
For the function f (x) = x 3 , we need to use the fact that √
f (x) = 2/2 x − 2x
f (x + h) = (x + h)3 = x 3 + 3x 2 h + 3xh 2 + h 3 . Then
This means that the tangent line to the graph at x = 14
f (x + h) − f (x)
f (x) = lim is f ( 14 ) = 2 − 12 = 32 ; at x = 1, it is f (1) = 1 − 2 = −1.
h→0 h Of more interest is the fact that the continuous function
= lim (3x 2 + 3xh + h 2 ) = 3x 2 (13) f (x), in passing from 32 at x = 14 to −1 at x = 1, must
h→0 be 0 somewhere between x = 14 and x = 32 . Where would
√ f (x) = 0? Evidently this occurs at the high point of the
For the function f (x) = x, we make use of the previ-
graph, and herein lies another great source of applications.
ously noted limit in Eq. (5) to write
The maximum or minimum of a function can often be
f (x + h) − f (x) found by finding the value of x for which f (x) = 0. For
f (x) = lim
h→0 h the example at hand, setting
√ √ √
x +h− x 1 1/ x − 2x = 0
= lim = √ (14)
h→0 h 2 x gives
√
Equations (12), (13), and (14) illustrate a general result 1/ x = 2x
that Newton discovered by using the binomial theorem
(which he also discovered): Squaring both sides and multiplying through by x gives
1 = 4x 3 , or x = 3 14 ≈ 0.63. That is not a value that would
If f (x) = x r where r is any real number, be easily guessed by plotting points. The information
then f (x) = r x r −1 .

(15) now in√ hand enables us to draw a much better graph of
y = 2 x − x 2 (see Fig. 8).
Two special cases, both easy to prove directly, should The connection between the values of f (x) and the
be noted here because we need them later. When r = 1, the slope of lines tangent to the graph of y = f (x) enables
formula tells us that the derivative of f (x) = x is f (x) = 1; us to prove the so-called mean value theorem. This theo-
and any constant function f (x) = c may be thought of as rem says that if two points on the graph of y = f (x), say
f (x) = cx ◦ , so that the formula gives the correct result, (a, f (a)) and (b, f (b)) are connected by a line segment,
f (x) = 0. then at some point c located between x = a and x = b,
The exponential function E(x) = e x is important in there will be a tangent to the graph that is parallel to the
mathematics and important to an application we shall men- line (Fig. 9). The formal statement of this theorem is as
tion below. Let us use the property of exponents that says follows.
e x+h = e x eh and Eq. (7) to find
e x+h − e x Mean value theorem. Let f be differentiable on an
E (x) = lim interval that includes a and b in its interior. Then there is
h→0 h
a point c between a and b such that
eh − 1
= lim e x = ex (16) f (b) − f (a)
h→0 h = f (c)
b−a
Functions of the form f (x) = ke x are the only functions
Suppose f is positive on a certain interval.
that satisfy f (x) = f (x), and this fact accounts for the
Then if x1 < x2 , the mean value theorem tells us that
great importance of the exponential function in mathe-
matical applications. f (x2 ) − f (x1 ) = f (c)(x2 − x1 ) > 0
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

324 Calculus

FIGURE 10 The function increases when f (x) > 0, decreases

when f (x) < 0.

FIGURE 8 The maximum occurs at (0.63, 1.19). This is an im-

provement on Fig. 6 illustrating that calculus provides information L, and the points P, Q, and R where the graph of y = sin x
enabling us to improve our graph. crosses the x axis. The slopes at H and L are clearly 0,
meaning that S (π/2) = S (3π/2) = 0. We have plotted
these points on the second set of axes in anticipation of
We conclude (Fig. 10) that f must be increasing on the
graphing y = S (x).
interval. Similar reasoning leads to the conclusion that if
We also note from the symmetry of the graph of y =
f (x) < 0 on an interval, f is decreasing on the interval.
sin x that if the slope at P is k, then the slope at Q will
If f is differentiable at a point x = a where its graph
be −k and the slope at R will be k again. The appropriate
achieves a relative high point, then since f swithces from
points are again plotted on the second set of axes.
an increasing to a decreasing function at that point, f (x)
A person familiar with trigonometric functions who
switches from positive to negative at x = a. We conclude
thinks about passing a curve through the plotted points
that f (a) = 0 (Fig. 10). It can similarly be shown that
might well guess the S (x) = cos x. Verification of this
a differentiable function will have a derivative of 0 at a
result depends on use of the addition formulas for the
relative low point.
sine function and, as has been mentioned previously,
An analysis of the graph of y = S(x) = sin x can help
on the use of radian measure, which allows the use of
us anticipate what S (x) might be. In Fig. 11, note the
Eq. (8).
tangent segments sketched at the high point H, low point

FIGURE 9 The tangent line is parallel to the secant line. FIGURE 11 The top curve is a sine wave.
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

Calculus 325

D. Four Important Applications This is a differential equation. We are asked to find a func-
tion, the derivative of which is a constant multiple of the
1. Falling Bodies
function with which we started. If we recall from Eq. (16)
We have seen that if y = f (x) describes the distance y that that E (x) = E(x) for E(x) = e x , we can quite easily con-
a body moves in time x, then f (x) gives the velocity at vince ourselves that a solution to (17) is
time x. Let us take this another step. Suppose the velocity
p = f (x) = ekx (18)
is changing with time. That rate of change, according to
the second principle in Table II, will be described by the This explains why it is said that population grows expo-
derivative of f (x), designated by f (x). The change of nentially.
velocity with respect to time is called acceleration; accel- We need not be talking about people on an island. We
eration at time x is f (x). might be talking about bacteria growing in a culture, or
When an object is dropped, it picks up speed as it falls. ice melting in a lake. Anytime the general principle is that
This is called the acceleration due to gravity. There is spec- growth (or decay) is proportional to the amount present,
ulation as to how Galileo (1564–1642), with the measuring Eq. (17), perhaps with negative k, describes the action and
instruments available to him, came to believe that the ac- Eq. (18) provides a solution.
celeration due to gravity is constant. But he was right; the
acceleration is designated by g and is equal to 32.2 ft/sec2 .
Now put the two ideas together. The acceleration is 3. Maxima and Minima
f (x); the acceleration is constant; f (x) = −g. We use The usefulness in practical engineering and scientific
−g because the direction is down. What function, if dif- problems of our ability to find a high or low point on a
ferentiated, gives the constant −g? We conclude that graph can be hinted at with the following simple problem.
f (x) = −gx + c. The c is included because the deriva- Suppose that a box is to be made from a 3-ft-square piece
tive of a constant is 0, so we can only be sure of f (x) of sheet metal by cutting squares from each corner, then
up to a constant. Since f (x) is the velocity at time x, folding the edges up (see Fig. 12) and welding the seams
and since f (0) = c, the constant c is usually designated along the corners. If the edges of the removed squares are
by υ0 , which stands for the initial velocity—allowing x in length, the volume of the box is given by the function
the possibility that the object was thrown instead of
dropped. y = f (x) = x(3 − 2x)2 = 4x 3 − 12x 2 + 9x
The function f (x) = − gx + υ0 is called the antideriva- which is graphed in Fig. 13. The derivative is
tive of f (x) = −g. Now to find the distance y that the ob-
ject has fallen, we seek the antiderivative of f (x); what f (x) = 12x 2 − 24x + 9 = 3(2x − 1)(2x − 3)
function was differentiated to give f (x) = −gx + υ0 ?
from which we see that with x restricted by the nature of
Contemplation of the derivatives computed above leads
the problem to be between 0 and 32 , the maximum volume
to the conclusion that
is obtained by choosing x = 12 .
y = f (x) = − 12 gx 2 + υ0 x + y0
this time the constant y0 represents the initial height from
which the object was dropped or thrown.

2. Growth and Decay

Imagine an island nation in which the population p is
unaffected by immigration into or out of the country. Let
the size of the population at time x be p = f (x). Then
according to the second principle of Table II, f (x) is the
rate of growth of the population.
Let us make the reasonable assumption that the rate at
which the population grows at time x is proportional to
the size of the population at that time. Translated into the
language of calculus, this says
f (x) = k f (x) (17) FIGURE 12 Construction of a box.
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

326 Calculus

There is a school of thought that says that the derivative is

best understood as a tool that enables us to approximate
f (x) in a neighborhood of x0 with a linear function.

V. INTEGRATION

A. The Integral Defined

We introduced our discussion of differentiation with a
problem about an automobile. We shall do the same in
introducing integration.
The acceleration of an automobile is often judged by
how fast it can go from a standing position to a speed of
60 mph (88 ft/sec). Consider a car that can do this in 10 sec.
Its velocity υ is a function of x, the number of seconds
since it started: υ = g(x), where g(0) = 0 and g(10) = 88.
The question we pose is this. How far will this car have
travelled in the 10-sec test?
FIGURE 13 Maximum occurs at ( 12 , 2). Distance equals rate times time, but our problem is that
the rate keeps changing. We must therefore approximate
our answer. One way to approach the problem is to sub-
4. Approximation divide the time span into n intervals, to pick a time ti
√ between each xi−1 and xi , and use g(ti ) as the approx-
Suppose we seek a quick estimate for 9.54. √ We might
proceed as follows. The function f (x) = x √ has, ac- imate speed during that interval. Then the approximate
cording to Eq. (14), a derivative of √ f (x) = 1/2 x. The distance travelled during the time from xi−1 to xi would
line tangent to the graph of y = x at x = 9 has a be g(ti )(xi − xi−1 ), and the total distance traveled would
slope of f (9) = 16 (Fig. 14). The equation of this line is be approximated by
y − 3 = 16 (x − 9). distance = g(t1 )(x1 − x0 ) + g(t2 )(x2 − x1 ) + · · ·
When x is a little larger than 9, it is clear that the corre-
+ g(tn )(xn − xn−1 ) (19)
sponding value of y on the tangent √ line is going to be larger
than the value of y given by y = x. But if x is not too Intuition tells us that smaller subintervals will result in
much larger than 9, the two values will be approximately better approximations to the total distance traveled.
equal. In fact, when x = 9.54, the two values are We turn our attention now to the subject of work. Physi-
cists define work done in moving an object to be the
y = 3 + 16 (9.54 − 9) = 3.09
product of the force exerted in the direction of motion
√
y = 9.54 ≈ 3.08869 and the distance moved. If a constant force of 15 lb is
exerted in pushing an object 20 ft, then it is said that
This gives us a method of approximating f (x) near a 20 × 15 = 300 ft · lb of work has been done.
value x0 where f (x0 ) is easily calculated. Anyone who has pushed an object along a floor knows,
We use however, that it takes more force to get the object moving
y = f (x) ≈ f (x0 ) + f (x0 )(x − x0 ) than to keep it moving. In this and other realistic situations
(stretching a spring), the force is likely to change as the
distance from the starting point changes. We say that the
force F is a function of the distance x: F = f (x).
How, when f = f (x), shall we find the work done in
moving from x = a to x = b? A sensible approximation
might be found in this way: Subdivide the segment from
a to b into n subintervals. In each subinterval xi−1 to xi ,
choose a point ti and use f (ti ) as an approximation of the
force exerted over the subinterval (Fig. 15).
√ The total work done would then be approximated by
FIGURE 14 The curve is y = x. The line is tangent at (9, 3). Its
slope is 16 . the sum
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

Calculus 327

When we write Eq. (1) in the more general form of

Eq. (21) and then compare Eq. (21) with Eqs. (19) and (20),
we see that the problem Archimedes considered of finding
the area under a parabola leads to a computation that comes
up in other problems as well. There are, in fact, problems in
virtually every area of engineering and science that lead to
sums like Eq. (21) above. They are called Riemann sums.
Let us state things formally. A partition of the interval
from a to b into n segments is a set P of points:
FIGURE 15 Total work done approximation.
P = {a = x0 < x1 < x2 < · · · < xn−1 < xn = b}
In each subinterval we select a tag ti : xi−1 ≤ ti ≤ xi . Noth-
f (t1 )(x1 − x0 ) + f (t2 )(x2 − x1 ) + · · · ing in the theory says that the intervals must be of equal
+ f (tn )(xn − xn−1 ) (20) length, or that the ti must all be chosen in a certain way—a
fact we emphasized in Fig. 16. The longest of the subinter-
Intuition suggests that better approximations can be vals is called the gauge of P, and this number is designated
obtained by taking more intervals of shorter length. by |P|. It is to be noted that when we require |P| to be
As a final motivating example, we direct attension again less than some certain g, then every subinterval of P has
to problem 1, the problem of finding the area under the length less than g. If g is defined for all x between a and
parabolic graph of y = x 2 from x = 0 to x = 1. If we set b, then corresponding to a choice of a partition P and a
f (x) = x 2 , then, corresponding to a partition of the interval selection of tags {ti }, we can form the Riemann sum
into n segments, the area can be approximated (Fig. 16)
by the sum R( f, P, {ti }) = f (t1 )(x1 − x0 ) + f (t2 )(x2 − x1 ) + · · ·

f (t1 )(x1 − x0 ) + f (t2 )(x2 − x1 ) + · · · + f (tn )(xn − xn−1 )

+ f (tn )(xn − xn−1 ) (21) For a large class of functions, it happens that by speci-
fying a small enough number g, we can guarantee that if
where ti is chosen to satisfy xi −1 ≤ ti < xi . Figure 2 shows |P| < g, then the sum R( f, P, {ti }) will be very close to
the areas obtained if we choose n = 1, 2, and 3, respec- some fixed number—and this will be true no matter how
tively, if we choose intervals of equal length and if in ev- the tags {ti } are chosen in the subintervals. Such a function
ery case we choose ti = xi , the right-hand endpoint. If we f is said to be Riemann integrable on the interval from
choose n intervals of equal length so that xi − xi−1 = 1/n a to b. The value around which the sums congregate is
for every i, and if we choose ti = xi = i/n, then Eq. (21) designated by an elongated S, stretched between an a and
becomes the sum Rn given in Eq. (1). ab, and set before the functional symbol f :
b
f
a
This number is called the integral of f .
Which functions are integrable? Numerous answers
can be given. Monotone increasing functions, continuous
functions, and bounded functions continuous at all but a
finite number of points are all integrable. Suffice to say
that most functions that turn up in applied mathematics
are integrable. This is also a good place to say that follow-
ing the work of Lebesque (1875–1941), the integral we
have defined has been generalized in numerous ways so
as to enlarge the class of integrable functions.

B. Evaluation of Integrals
No matter what the source (distance traveled, work done,
FIGURE 16 The curve is y = x 2.The intervals may be unequal, etc.) of a Riemann sum, the corresponding integral may
the ti located differently (not a center, or always the same propor- be interpreted as an area. Thus, for f (x) = x, we easily
tion of the way between xi −1 and xi ) within the intervals. see (Fig. 17) that
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

328 Calculus

b
b3
x2 =
0 3
Proceeding in just this way, the mathematician
Cavalieri (1598–1647) was able to find formulas simi-
lar to Eq. (3) for sums of the form 1k + 2k + · · · + n k for
k = 1, 2, . . . , 9. He was thus able to prove that
b
bk+1
xk = (22)
0 k+1
for all positive integers up to 9, and he certainly correctly
anticipated that Eq. (22) holds for all positive integers.
But each value of k presented a new problem, and as we
said, Cavalieri himself stalled on k = 10. Reasonable peo-
b
ple will be discouraged by the prospect of finding a f
for more complicated functions. Indeed, the difficulty of
FIGURE 17 Interpreting an integral as an area.
such calculations made most such problems intractable
until the time of Newton (1642–1727) and Leibniz (1647–
1 1 1716), when discovery of the fundamental theorem of cal-
1
f = x= culus, discussed below, brought such problems into the
0 0 2 realm of feasibility.
When a function is known to be Riemann integrable, The computer age has changed all that. For a given
then the fact that all Riemann sums get close to the correct function, it is now possible to evaluate Riemann sums very
value for partitions with sufficiently small gauge means rapidly. Along with this possibility have come increased
that we may use any convenient sum. We have seen that for emphases on variations of Riemann sums that approximate
b
f (x) = x 2 , one such sum was given by Rn as expressed a f . These are properly studied in courses on numerical
in Eq. (1): analysis.
Before leaving the topic of evaluation, we note that,
1 1 2 1 2 2 1 n 2
Rn = + + ··· + like differentiation, integration is a linear operator. For
n n n n n n two integrable functions f and g and two real numbers r
and s,
1 2
= (1 + 22 + · · · + n 2 ) b b b
n3
(r f + sg) = r f +s g
We have already said that Archimedes was helped in his a a a
work because he knew Eq. (3), enabling him to write Thus, using what we know from our calculations above,
1 1 1 1 1
Rn = n(n + 1)(2n + 1) 1
n3 6 (4x − 2x 2 ) = 4 x −2 x2 = 4
0 0 0 2

1 n n+1 2n + 1 1 4
= −2 =
6 n n n 3 3

1 1 1
Rn = 1+ 2+
6 n n C. Applications
Since the partition used here had n equal intervals, the It seems more difficult to group the diverse applications of
requirement that the gauge get smaller is equivalent to integration into major classifications than is the case for
requiring that n get larger. Thus, differentiation. We shall indicate the breadth of possibili-
1 ties with several examples typically studied in calculus.
1 1
x 2 = lim Rn = (1)(2) =
0 n→∞ 6 3
1. Volumes
With almost no extra effort, the length of the interval can
be extended to an arbitrary b, and it can be determined Suppose the graph of y = x 2 is rotated about the x axis to
that form a so-called solid of revolution that is cut off by the
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

Calculus 329

l= (xi − xi−1 )2 + (yi − yi−1 )2

= (xi − xi−1 )2 + [g(xi ) − g(xi−1 )]2
If the function g is differentiable, then from the Mean
Value Theorem, we see that
g(xi ) − g(xi−1 ) = g (ti )(xi − xi−1 )
where ti is between xi−1 and xi . Thus

l = 1 + [g (ti )]2 (xi − xi−1 )
and the sum of these lengths is

1 + [g (t1 )]2 (x1 − x0 ) + · · ·

+ 1 + [g (tn )]2 (xn − xn−1 )
This has the form of a Riemann sum for the function
f (x) = 1 + [g (x)]2 . It converges to an integral we take
to be the desired
b
Length = 1 + [g (x)]2
a

3. Normal Distribution
If in a certain town we measure the heights of all the
women, or the IQ scores of all the third graders, or the
gallons of water consumed in each single family dwelling,
FIGURE 18 The curve y = x 2 has been rotated about the x axis we will find that the readings cluster around a number
to generate a solid figure.
called the mean, x̄. A common display of the readings uses
a bar graph, a graph in which the percentage of the readings
plane x = 1 (Fig. 18). A partition of the x axis from 0 to 1 that fall between xi −1 and xi is indicated by the height of a
now determines a series of planes that cut off disks, each bar drawn over the appropriate interval (Fig. 20a). The sum
disk having a volume approximated by π yi2 (xi − xi−1 ) of all the heights (all the percentages) should, of course,
where yi = ti2 for some ti between xi−1 and xi . Summing be 1.
these volumes gives As the size of the intervals is decreased and the number
of data points is increased, it happens in a remarkable
πt14 (x1 − x0 ) + π t24 (x2 − x1 ) + · · · + πtn4 (xn − xn−1 ) number of cases that the bars arrange themselves under
the so-called normal distribution curve (Fig. 20b) that has
which has the form of a Riemann sum for a function an equation of the form
f (x) = π x 4 . It converges to an integral that we can eval-
uate with the help of Eq. (22).
1 1
1
Volume = π x4 = π x4 = π
0 0 5

2. Length of Arc
Suppose the points A and B are connected by a curve that
is the graph of y = g(x) for x between a and b (Fig. 19).
The length of this curve is clearly approximated by the
sum of the lengths of line segments joining points that FIGURE 19 An arbitrarily drawn curve, together with a sequence
have as their first coordinates the points of a partition of of secant lines being used to (roughly) approximate the length of
the x axis from a to b. A typical segment has length the curve.
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

330 Calculus

Thus, in the present situation with g = f as the velocity

function, we see that from the time x = a to x = b,
b
Distance traveled = f (24)
a

Comparison of Eqs. (23) and (24) suggests that

b
f = f (b) − f (a)
a

FIGURE 20 A pair of histograms, the second one indicating how The function being integrated is f , the derivative of the
an increasing number of columns leads to the concept of a distri- function f on the right side of the equal sign. The function
bution curve.
f is in turn called the antiderivative of f . The relationship
lies at the heart of the calculus.
y = de−m(x−x̄)
2

The constants are related to the relative spread of the bell- Fundamental theorem of calculus. Let F be any
shaped curve, and they are chosen so that the area under antiderivative of f ; that is, let F be a function such that
the curve is 1. The percentage of readings that fall between F (x) = f (x). Then
a and b is then given by b

b
f = F(b) − F(a)
de− m(x − x̄)
2 a

a
B. Some Consequences
VI. FUNDAMENTAL THEOREM In Eq. (15), we saw that if f (x) = x r , then f (x) = r x r −1 .
OF CALCULUS It follows that the antiderivative of f (x) = x r would, for
r = −1, be
Up to this point, calculus seems to be neatly separated into
two main topics, differential calculus and integral calcu-
lus. Historically, the two topics did develop separately, and
considerable progress had been made in both areas. The
genius of Newton and Leibniz was to see and exploit the
connection between the integral and the derivative.

A. Integration by Antidifferentiation
Let y = f (x) be the distance that a moving object has
traveled from some fixed starting point in time x. We have
seen (Table II) that the velocity of the object at time x
is then given by υ = f (x). Since the value of f (x) is
also the slope of the line tangent at (x, y) to the graph of
y = f (x), a general sketch of υ = f (x) may be drawn by
looking at the graph of y = f (x); see Fig. 21.
From the graph of y = f (x), we see that from time x = a
to time x = b,
Distance traveled = f (b) − f (a) (23)
At the same time, we argued in setting up Eq. (19) that
if the velocity υ of a moving object is given by υ = g(t),
then the distance traveled would be approximated by what
FIGURE 21 A graph of y = f (x) on the top axes with two tangent
we came later to see was a Riemann sum that converged to segments having a slope of 1, an intermediate segment with slope
b >1, and a low point with slope 0. The lower curve, obtained by
g plotting the slopes of representative tangent lines, is the graph of
a y = f (x).
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

Calculus 331

1 T2 (x) = f (9) + f (9)(x − 9)

F(x) = x r +1
r +1
T2 (x) = f (9)
Consequently, for any r = −1 the fundamental theorem
gives us This function and its first two derivatives agree with f at
x = 9; that is, we see by substituting in the expressions
b
1 r +1 1 r +1 just given that
x r = F(b) − F(a) = b − a
r +1 r +1
a T2 (9) = f (9), T2 (9) = f (9), T2 (9) = f (9),
For a = 0 and r chosen to be a positive integer, we have
It seems reasonable, therefore, to guess
√ that T2 (9.54) will
Cavalieri’s result [Eq. (22)].
give us a better approximation to 9.54 than we got with
The case r = −1 is of special interest. Since f (x) = 1/x
T1 (9.54) = 3.09. Indeed, since
is continuous for x > 0,
1 1
f (x) = − x− 2 ,
3
b
1
2 2
1 x
1 1 1 1
exists for any b > 0. Yet the function f (x) = 1/x has no
f (9) = − =−
2 2 3 3 108
simple antiderivative. Suppose we set
t and
L(t) =
1
1 1 −1
1 x T2 (9.54) = 3 + (0.54) + (0.54)2 = 3.08865
6 2 108
This function turns out to have the following interesting
properties: The actual value correct to five places to the right of the
decimal is 3.08869, so we have improved our estimate.
1
L (t) = L(e x ) = x The full generalization is this. Suppose a function and
t its derivatives at a point x0 are known or easily calculated.
The second property means that L is the inverse of the ex- Then the nth degree Taylor polynomial is defined to be
ponential function; it exhibits all the properties associated f (x0 )
with a logarithm function. Since L(e) = 1, we say L is the Tn (x) = f (x0 ) + f (x0 )(x − x0 ) + (x − x0 )2
2
logarithmic function having e as its base; it is called the
natural logarithm. f n (x0 )
+···+ (x − x0 )n
n!
This polynomial can be used to estimate f for values of
VII. INFINITE SERIES x near x0 , and the larger n is taken, the better the estimate
will be. Taylor polynomials of degree n = 6 for some well
In discussing Approximation above, we noted that the known functions are seen in Table III.
value of We are now in a position to get some idea of how cal-
√ culators determine the values of familiar functions. The
y = f (x) = x
following table gives the actual values for the sine and
could be approximated for values of x near 9 by using cosine of angles near x0 = 0, and, for comparison, the val-
ues of the polynomials S(x) and C(x) defined above. To
y = T1 (x) = f (9) + f (9)(x − 9), get the results listed, one must remember that in calculus,
the function having√ as its graph the line tangent at x = 9 angles must be measured in radians.
to the graph of y = x. We might explain the usefulness
of T1 as a way to approximate f near x = 9 by observing TABLE III
that T1 and its first derivative both agree with f at x = 9,
That is, Function Taylor polynomial of degree 6

T1 (9) = f (9) and T1 (9) = f (9) sin x S(x) = x −

x3
+
x5
3! 5!
This viewpoint has a wonderfully useful generalization. x2 x4 x6
cos x C(x) = 1 − + −
Consider the function 2! 4! 6!
x2 x6
f (9) ex E(x) = 1 + x + + ··· +
T2 (x) = f (9) + f (9)(x − 9) + (x − 9)2 2! 6!
2 x3 x5
Arc tan x A(x) = x − +
Its first and second derivatives are 3! 5!
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

332 Calculus

TABLE IV

X (degrees) x(radians) Sin x S(x) Cos x C(x)

1 0.0174533 0.0174524 0.0174524 0.9998477 0.9998477

2 0.0349066 0.0348995 0.0348995 0.9993908 0.9993908
3 0.0523599 0.0523360 0.0523360 0.9986295 0.9986295
4 0.0698132 0.0697565 0.0697565 0.9975641 0.9975641
5 0.0872664 0.0871557 0.0871558 0.9961947 0.9961947
10 0.1745329 0.1736482 0.1736483 0.9848078 0.9848077
15 0.2617994 0.2588192 0.2588192 0.9659258 0.9659258
20 0.3490659 0.3420201 0.3420204 0.9396926 0.9396926
25 0.4363323 0.4226183 0.4226190 0.9063078 0.9063077
30 0.5235987 0.5000000 0.5000023 0.8660254 0.8660252
35 0.6108652 0.5735764 0.5735829 0.8191520 0.8191514

In a table of values accurate to seven places to the right Fourier series provide another useful representation, and
of the decimal, one must get to 5 degrees before any differ- there are others.
ence shows up between sin x and its polynomial approxi- Also, while we have approached infinite series by way
mation S(x); and the polynomial approximations S(x) and of representing functions, we might well have started with
C(x) give six place accuracy for sin x and cos x for values the representation of individual numbers. The familiar use
of x all the way through 20 degrees. of 0.333 · · · = 13 is really a statement saying that the infi-
Though manufacturers of calculators use variations, nite sum
the idea illustrated in the Table IV gives correct in- 3 3 3
sight into how calculators find values of trigonomet- + 2 + 3 + ···
10 10 10
ric functions; they make use of polynomial approxima-
tions. Greater accuracy can be obtained by using Tay- gets closer and closer to 13 ; that is, the finite sums
lor polynomials of higher degree. It is sufficient, of
course, to obtain accuracy up to the number of digits 3 3 3
Sn = + 2 + ··· n
to the right of the decimal that are displayed by the 10 10 10
calculator.
get closer to 13 as n, the number of terms increases.
The last function listed in Table III, Arc tan x, is the
The addition of an infinite number of terms is not a triv-
inverse tangent function. If y = Arc tan x, then x = tan y.
ial subject. The history of mathematics includes learned
It is included so we can address another question about
discussions of what value to assign to
which people sometimes wonder. Since tan 30◦ = √13 and
30◦ = π6 radians, substitution of x = √13 into the polyno- 1 − 1 + 1 − 1 + 1 − ···
mial approximation for Arc tan x gives
Some argued that an obvious grouping shows that
π 1 1 1 1
≈√ − √ + √ − √ = 0.5230 (1 − 1) + (1 − 1) + (1 − 1) + · · · = 0
6 3 3( 3) 3 5( 3) 5 7( 3)7
Multiplication by 6 gives us the approximation of Others countered that
π ≈ 3.138 Rounding to two decimals to the right of the 1 − (1 − 1) − (1 − 1) − · · · = 1
decimal, we get the familiar approximation of 3.14. More
accuracy can be obtained by using a Taylor polynomial and Leibniz, one of the developers of calculus, suggested
of higher degree, and there are tricks that yield much bet- that the proper value would therefore be 12 . There are other
ter approximations using just the 7th degree polynomial instances in which famous mathematicians have chal-
employed here, but again we stop, having illustrated our lenged one another to find the value of an infinite series
point. Familiar constants such as π (and e) can be approx- of constants. James Bernoulli (1654–1705) made it clear
imated using Taylor polynomials. that he would like to know the sum of
While our discussion of infinite series has given some
1 1 1 1
indication of the usefulness of the idea, it is important to 1+ + + + ··· + 2 + ···.
indicate that a great deal more could be said. Thus, while 4 9 16 n
Taylor series provide an important way to represent func- but it wasn’t until 1736 that Euler discovered that this sum
got closer and closer to π6
2
tions, they should be seen as just one such representation.
P1: GLQ Revised Pages
Encyclopedia of Physical Science and Technology En002c-76 May 17, 2001 20:27

Calculus 333

VIII. SOME HISTORICAL NOTES written by the Marquis de l’Hospital in 1696. It wasn’t
until 1816, however, that a text by Lacroix, Traité du Cal-
Archimedes, with his principle of exhaustion, certainly cul Différentiél et du Calcul Intégral, having been well
had in hand the notion of approaching a limiting value received in France for many years, was translated into
with a sequence of steps that could be carried on ad in- English and so brought the methods used on the conti-
finitum, but he was limited by, among other things, the nent to the English speaking world. Through all of these
lack of a convenient notation. The algebraic notation came books there ran an undercurrent of confusion about the
much later, and the idea of relating algebra to geome- nature of the infinitesimal, and it was not until the work
try was later still, often attributed to Descartes (1596– of Cauchy (1789–1857) and Weierstrass (1815–1897) that
1650), but arguably owing more to Fermat (1601–1665). logical gaps were closed and calculus took the form that
Fermat, Barrow (1630–1677), Newton’s teacher, Huygens we recognize today.
(1629–1695), Leibniz’s teacher, and others set the stage The most recent changes in the teaching of calculus
in the first half of the 17th century, but the actual develop- grew out of an effort during the decade from 1985–1995
ment of the calculus is attributed to Isaac Newton (1642– to reform the way calculus was to be taught in the United
1727) and Gottfried Wilhelm Leibniz (1646–1716), two States. A summary of that effort is presented in Calculus,
geniuses who worked independently and were ultimately The Dynamics of Change.
drawn into arguments over who developed the subject first.
Evidence seems to substantiate Newton’s claim to pri-
macy, but his towering and deserved reputation as one of SEE ALSO THE FOLLOWING ARTICLES
the greatest thinkers in the history of mankind is surely
owing not so much to what he invented, but what he did ALGEBRAIC GEOMETRY • DIFFERENTIAL EQUATIONS,
with it. In Newton’s hands, calculus was a tool for chang- ORDINARY • DIFFERENTIAL EQUATIONS, PARTIAL •
ing the way humans understood the universe. Using calcu- INTEGRAL EQUATIONS • NUMERICAL ANALYSIS • STO-
lus to extrapolate from his law of universal gravitation and CHASTIC PROCESSES
other laws of motion, Newton was able to analyze not only
the motion of free falling bodies on earth, but to explain,
even predict the motions of the planets. He was widely
regarded as having supernatural insight, a reputation the
BIBLIOGRAPHY
poet Alexander Pope caught with the lines,
Apostol, T. M. (1961). “Calculus, Volumes 1 and 2,” Blaisdell, Boston.
Boyer, C. B. (1959). “The History of the Calculus and Its Conceptual
Nature and Nature’s laws lay hid in night: Development,” Dover, New York.
God said, “Let Newton be!” and all was light. Morgan, F. (1995). “Calculus Lite,” A. K. Peters, Wellesley,
Massachusetts.
Leibniz, on the other hand, developed the notation that Ostebee, A., and Zorn, P. (1997). “Calculus,” Harcourt Brace & Co., San
Diego.
made the calculus comprehensible to others, and he gath-
Roberts, A. W., ed. “Calculus: The Dynamics of Change,” Mathematics
ered around himself the disciples that took the lead away Association of America, Washington, DC.
from Newton’s followers in England, and made the con- Sawyer, W. W. (1961). “What is Calculus About?,” The Mathemati-
tinent the center of mathematical life for the next cen- cal Association of America New Mathematical Library, Washington,
tury. James (1654–1705) and John (1667–1748) Bernoulli, DC.
Swokowski, E. W. (1991). “Calculus,” 5th ed., PWS-Kent, Boston.
Leonhard Euler (1707–1783), and others applied calculus
Thomas, G. B., Jr., and Finney, R. L. (1984). “Calculus and Analytic
to a host of problems and puzzles as well applied problems. Geometry,” 6th ed., Addison-Wesley, Reading, Massachusetts.
The first real textbook on calculus was Analyse des In- Toeplitz, O. (1963). “The Calculus, A Genetic Approach,” University of
finiments Petits Pour L’intelligence des Lignes Courbes, Chicago Press, Chicago.
P1: ZCK Final Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

Complex Analysis
Joseph P. S. Kung Chung-Chun Yang
Department of Mathematics, University of North Texas Department of Mathematics, The Hong Kong
University of Science and Technology

I. The Complex Plane

II. Analytic and Holomorphic Functions
III. Cauchy Integral Formula
IV. Meromorphic Functions
V. Some Advanced Topics

GLOSSARY Half-plane The upper half-plane is the set of complex

numbers with positive imaginary part.
Analytic function A complex function with a power Harmonic function A real function u(x, y) of two real
series expansion. An analytic function is holomorphic variables satisfying Laplace’s equation u x x + u yy = 0.
and conversely. Holomorphic functions A differentiable complex func-
Argument In the polar form z = r eiθ of a complex num- tion. See also Analytic functions.
ber, the argument is the angle θ , which is determined Laurent series An expansion of a complex function as a
up to an integer multiple of 2π . doubly infinite sum: f (z) = ∞ m=−∞ an (z − c) .
m

Cauchy integral formula A basic formula expressing Meromorphic function A complex function that is dif-
the value of an analytic function at a point as a ferentiable (or holomorphic) except at a discrete set of
line integral along a closed curve going around that points, where it may have poles.
point. Neighborhood of a point An open set containing that
Cauchy–Riemann equations The system of first-order point.
partial differential equations u x = v y , u y = −vx , which Pole A point a is a pole of a function f (z) if f (z) is ana-
is equivalent to the complex function f (z) = u(x, y) + lytic on a neighborhood of a but not at a, limz→a
iv(x, y) being holomorphic (under the assump- f (z) = ∞, and limz→a (z − a)k f (z) is finite for some
tion that all four first-order partial derivatives are positive integer k.
continuous). Power series or Taylor series An expansion
of a complex
Conformal maps A map that preserves angles infinites- function as an infinite sum: f (z) = ∞ m=0 an (z − c) .
m

imally. Region A nonempty open set in the complex plane.

Disk A set of the form {z : |z − a| < r } is an open disk. Riemann surface A surface (in two real dimensions)
Domain A nonempty connected open set in the complex obtained by cutting and gluing together a finite or infi-
plane. nite number of complex planes.

443
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

444 Complex Analysis

TO OVERSIMPLIFY, COMPLEX ANALYSIS is cal- the real part of z. The y-axis is called the imaginary axis
culus using complex numbers rather than real numbers. and the real number b is called the imaginary part of z.
Because a complex number a + ib is essentially a pair The complex conjugate z̄ of the complex number z =
of real numbers, there is more “freedom of movement” a + ib is the complex number
√ a − ib. The absolute value
over the complex numbers and many conditions become |z| is the real number a 2 + b2 . The (multiplicative) in-
stronger. As a result, complex analysis has many features verse or reciprocal of z is given by the formula
that distingush it from real analysis. The most remarkable 1 z̄ a − ib
feature, perhaps, is that a differentiable complex function = =√ .
z |z| a 2 + b2
always has a power series expansion. This fact follows
from the Cauchy integral formula, which expresses the The complex number z = a + ib can also be written in the
value of a function in terms of values of the function on a polar form
closed curve going around that point.
z = r eiθ = r (cos θ + i sin θ ),
Many powerful mathematical techniques can be found
in complex analysis. Many of these techniques were devel- where
oped in the 19th century in conjunction with solving prac-
r= a 2 + b2 , tan θ = y/x.
tical problems in astronomy, engineering, and physics.
Complex analysis is now recognized as an indispens- The angle θ is called the argument of z. The argument is
able component of any applied mathematician’s toolkit. determined up to an integer multiple of 2π . Usually, one
Complex analysis is also extensively used in number the- takes the value θ so that −π < θ ≤ π ; this is called the
ory, particularly in the study of the distribution of prime principal value of the argument.
numbers.
In addition, complex analysis was the source of many
new subjects of mathematics. An example of this is B. Topology of the Complex plane
Riemann’s attempt to make multivalued complex func- The absolute value defines a metric or distance function d
tions such as the square-root function single-valued by on the complex plane C by
enlarging the range. This led him to the idea of a Riemann
surface. This, in turn, led him to the theory of differentiable d(z 1 , z 2 ) = |z 1 − z 2 |.
manifolds, a mathematical subject that is the foundation This metric satisfies the triangle inequality
of the theory of general relativity.
d(z 1 , z 2 ) + d(z 2 , z 3 ) ≥ d(z 1 , z 3 )
and determines a topology on the complex plane C. We
I. THE COMPLEX PLANE shall need the notions of open sets, closed sets, clo-
sure, compactness, continuity, and homeomorphism from
A. Complex Numbers topology.
A complex number is a number of the form a + ib, where
a and b are real numbers and i is a square root of −1, C. Curves
that is, i satisfies the quadratic equation i 2 + 1 = 0. His-
A curve γ is a continuous map from a real interval [a, b]
torically, complex numbers arose out of attempts to solve
to C. The curve γ is said to be closed if γ (a) = γ (b); it
polynomial equations. In particular, in the 16th century,
is said to be open otherwise. A simple closed curve is a
Cardano, Tartaglia, and others were forced to use complex
closed curve with γ (t1 ) = γ (t2 ) if and only if t1 = a and
numbers in the process of solving cubic equations, even
t2 = b.
when all three solutions are real. Because of this, complex
An intuitively obvious theorem about curves that turned
numbers acquired a mystical aura that was not dispelled
out to be very difficult to prove is the Jordan curve theorem.
until the early 19th century, when Gauss and Argand pro-
This theorem is usually not necessary in complex analysis,
posed a geometric representation for them as pairs of real
but is useful as background.
numbers.
Gauss thought of a complex number z = a + ib geomet- The Jordan Curve Theorem. The image of a sim-
rically as a point (a, b) in the real two-dimensional space. ple closed curve (not assumed to be differentiable) sepa-
This represents the set C of complex numbers as a real rates the extended complex plane into two regions. One
two-dimensional plane, called the complex plane. The x- region is bounded (and “inside” the curve) and the other is
axis is called the real axis and the real number a is called unbounded.
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

Complex Analysis 445

D. The Stereographic Projection hence, closed) subsets A and B. A set is simply connected
if every closed curve in it can be contracted to a point,
It is often useful to extend the complex plane by adding
with the contraction occurring in the set. It can be proved
a point ∞ at infinity. The extended plane is called the
that a set is simply connected in C if its complement in the
extended complex plane and is denoted C̄. Using a stereo-
extended complex plane is connected. If A is not simply
graphic projection, we can represent the extended plane
connected, then the connected components (that is, open
C̄ using a sphere.
and closed subsets) of its complement C̄ not containing
Let S denote the sphere with radius 1 in real three-
the point ∞ are the “holes” of A.
dimensional space defined by the equation
A region is a nonempty open set of C. A domain is
x12 + x22 + x32 = 1 a non-empty connected open set of C. When r > 0, the
set {z : |z − c| < r } of all complex numbers at distance
Let the x1 -axis coincide with the real axis of C and let
strictly less than r from the center c is called the open
the x2 -axis coincide with imaginary axes of C. The point
disk with center c and radius r . Its closure is the closed
(0, 0, 1) on S is called the north pole. Let (x1 , x2 , x3 )
disk {z : |z − c| ≤ r }. Open disks are the most commonly
be any point on S not equal to the north pole. The point
used domains in complex analysis.
z = x + i y of intersection of the straight line segment
emanating from the north pole N going through the point
(x1 , x2 , x3 ) with the complex plane C is the stereographi-
cal projection of (x1 , x2 , x3 ). Going backwards, the point II. ANALYTIC AND HOLOMORPHIC
(x1 , x2 , x3 ) is the spherical image of z. The north pole FUNCTIONS
is the spherical image of the point ∞ at infinity. The
stereographic projection is a one-to-one mapping of the A. Holomorphic Functions
extended plane C̄ onto the sphere S. The sphere S is called Let f (z) denote a complex function defined on a set
the Riemann sphere. The stereographical projection has in the complex plane C. The function f (z) is said to be
the property that the angles between two (differentiable) differentiable at the point a in if the limit
curves in C and the angle between their images on S are
equal. f (a + h) − f (a)
lim
The mathematical formulas relating points and their h→0 h
spherical images are as follows:
is a finite complex number. This limit is the derivative
z + z̄ z − z̄ |z|2 − 1 f (a) of f (z) at a. Note that the limit is taken over all
x1 = , x2 = , x3 = complex numbers h such that the absolute value |h| goes
1 + |z|2 i(1 + |z|2 ) 1 + |z|2
to zero, so that h ranges over a set that has two real di-
and
mensions. Thus, differentiability of complex functions is a
x1 + i x2
z= much stronger condition than differentiability of real func-
1 − x3 tions. For example, the length of every (infinitesimally)
Let z 1 and z 2 be two points in the plane C. The spherical or small line segment starting from a is changed under the
chordal distance σ (z 1 , z 2 ) between their spherical images function f (z) by the same real scaling factor | f (a)|, inde-
on S is pendently of the angle. The formal rules of differentiation
2|z 1 − z 2 | in calculus hold also for complex differentiation.
σ (z 1 , z 2 ) = . It is possible to construct complex functions that are
1 + |z 1 |2 1 + |z 2 |2
differentiable at only one point. To exclude these degen-
Let dσ and ds be the length of the infinitesimal arc on S erate cases, we only consider complex functions that are
and C, respectively. Then differentiable at every point in a region . Such functions
are said to be holomorphic on .
dσ = 2(1 + |z|2 )−1 ds.

B. The Cauchy–Riemann Equations

E. Connectivity and Harmonic Functions
We need two notions of connectedness of sets in the com- A complex function f (z) can be expressed in the following
plex plane. Roughly speaking, a set is connected if it conway:
sists of a single piece. Formally, a set S is connected if S is
not the union A ∪ B of two disjoint nonempty open (and f (z) = f (x + i y) = u(x, y) + iv(x, y)
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

446 Complex Analysis

where u(x, y) is the real part of f (z) and v(x, y) is the power series work in the same way as power series over
imaginary part of f (z). The functions u(x, y) and v(x, y) the reals.
are real functions of two real variables. In particular, a power series has a radius of convergence,
Differentiability of the complex function f (z) can be that is, an extended real number ρ, 0 ≤ ρ ≤ ∞, such that
rewritten as a condition on the real functions u(x, y) and the series converges absolutely whenever |z − c| < ρ. The
v(x, y). Let f (z) = u(x, y) + iv(x, y) be a complex func- radius of convergence is given explicitly by Hadamard’s
tion such that all four first-order partial derivatives of u formula:
and v are continuous in the open set . Then a necessary 1
and sufficient condition for f (z) to be holomorphic on is ρ= √
lim sup n |an |
n→∞
∂u ∂v ∂u ∂v
= , =− A function f (z) is analytic in the region if for every
∂x ∂y ∂y ∂x
point c in , there exists an open disk {z : |z − c| < r }
These equations are the Cauchy–Riemann equations. In contained in such that f (z) has a (convergent) power
polar coordinates z = r eiθ , the Cauchy–Riemann equa- series or Taylor expansion
tions are ∞
∂u ∂v ∂u ∂v f (z) = am (z − c)m
r = , = −r
∂r ∂θ ∂θ ∂r m=0

The square of the absolute value of the derivative is the When this holds, f (c)/m! = am .
(n)

Jacobian of u(x, y) and v(x, y), that is Polynomials are analytic functions on the complex
plane. Other examples of analytic functions on the com-
∂u ∂v ∂u ∂v
| f (z)|2 = − plex plane are the exponential function,
∂x ∂y ∂y ∂x ∞

A harmonic or potential function h(x, y) is a real ez = z n /n!

m=0
function having continuous second-order partial deriva-
tives in a nonempty open set in R2 satisfying the two- and the two trigenometric functions,
dimensional Laplace equation ei z + e−i z ei z − e−i z
cos z = , sin z =
∂ 2h ∂ 2h 2 2i
+ =0 The inverse under functional composition of e z is the (nat-
∂x2 ∂ y2
ural) logarithm log z. The easiest way to define it is to put
for all z = x + i y in . We shall see that f (z) being holo- z into polar form. Then
morphic in implies that every derivative f (n) (z) is holo-
morphic on , and, in particular, the real functions u(x, y) log z = log r eiθ = log r + iθ
and v(x, y) have partial derivatives of any order. Hence, Since θ is determined up to an integer multiple of 2π,
by the Cauchy–Riemann equations, u(x, y) and v(x, y) the logarithmic function is multivalued and one needs to
satisfy the Laplace equation and are harmonic functions. extend the range to a Riemann surface (see Section V.D)
The harmonic functions u(x, y) and v(x, y) are called the to make it a function. For most purposes, one takes the
conjugate pair of the holomorphic function f (z). value of θ so that −π < θ ≤ π . This yields the principal
The two-dimensional Laplace equation governs (in- value of the logarithm.
compressible and irrotational) fluid flow and electrostatics
in the plane. These give intuitive physical models for holo-
morphic functions.
III. CAUCHY INTEGRAL FORMULA

C. Power Series and Analytic Functions A. Line Integrals and Winding Numbers
Just as in calculus, we can define complex functions using Line integrals are integrals taken over a curve rather than
power series. A power series is an infinite sum of the form an interval on a real line. Let γ : [a, b] → C be a piecewise
continuously differentiable curve and let f (z) be a contin-

∞
am (z − c)m uous complex function defined on the image of γ . Then
m=0 the line integral of f (z) along the path γ is the Riemann
integral
where the coefficients am and the center c are complex
b
numbers. This series determines a complex number (de-
f (γ (t))γ (t) dt
pending on z) whenever the the series converges. Complex a
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

Complex Analysis 447

This integral is denoted by m! f (ζ ) dζ
f (m) (z) =
2πi γ (ζ − z)m+1
f (z) dz
γ In particular,
Line integrals are also called path integrals or contour
1 f (ζ ) dζ
integrals. Line integrals behave in similar ways to integrals f (z) =
2πi γ ζ −z
over the real line, except that instead of moving along the
real line, we move on a curve. Cauchy’s formula for f (z) follows from Cauchy’s theorem
The winding number or index n(γ , a) of a curve γ rel- applied to the function ( f (ζ ) − f (z))/(ζ − z), and the
ative to a point a is the number of time the curve winds or general case follows similarly.
goes around the point a. Formally, we define n(γ ; z) using A somewhat more general formulation of Cauchy’s for-
a line integral: mula is in terms of the winding number. If f (z) is analytic
on a simply connected nonempty open set and γ is a
1 dζ
n(γ , a) = closed piecewise continuously differentiable curve, then,
2πi γ ζ − a for every point z in ,
This definition is consistent with the intuitive definition.
m! f (ζ ) dζ
One can show, for example, that (a) n(γ , a) is always an f (m) (z) = n(γ , z)
integer, (b) the winding number of a circle around its cen- 2πi γ (ζ − z)m+1
ter is 1 if the circle goes around in a counterclockwise Cauchy’s integral formula expresses the function value
direction, and −1 if the circle goes around in a clockwise f (z) in terms of the function values around z. Take the
direction, (c) the winding number of a circle relative to a curve γ to be a circle |z − a| = r of radius r with center
point in its exterior is 0, and (d) if a is a point in the interior a, where r is sufficiently small so that the circle is in .
of two curves and those two curves can be continuously Using the geometric series expansion
deformed into one another without going through a, then ∞
1 1 z−a m
they have the same winding number relative to a. =
ζ −z ζ − a m=0 ζ − a

B. Cauchy’s Integral Formula and interchanging summation and integration (which is

valid as all the series converge uniformly), we obtain
In general, line integrals depend on the curve. But if the
integrand f (z) is holomorphic, Cauchy’s integral theorem ∞
(z − a)m+1 f (ζ ) dζ
f (z) =
implies that the line integral on a simply connected region
m=0
2πi |z−a|=r (ζ − a)m+1
only depends on the endpoints.
This gives an explicit formula for the power series ex-
Cauchy’s integral theorem. Let f (z) be holomorphic pansion of f (z) and shows that a holomorphic function
on a simply connected region in C. Then for any closed is analytic. Since an analytic function is clearly differen-
piecewise continuously differential curve γ in , tiable, being analytic and being holomorphic are the same

property.
f (z) dz = 0
γ
One way to prove Cauchy’s theorem (due to Goursat) C. Geometric Properties of Analytic Functions
is to observe that if the curve is “very small,” then the line
Analytic functions satisfy many nice geometric and topo-
integral should also be “very small” because holomor-
logical properties. A basic property is the following.
phic functions cannot change drastically in a small neigh-
borhood. Hence, we can prove the theorem by carefully The open mapping theorem. The image of an open set
decomposing the curve into a union of smaller curves. under a non-constant analytic function is open.
Another way to think of Cauchy’s theorem is that a line
Analytic functions also have the nice geometric prop-
integral over a curve γ of a holomorphic function on a
erty that they preserve angles. Let γ1 and γ2 be two differ-
region is zero whenever γ can be continuously shrunk
entiable curves intersecting at a point z in . The angle
to a point in .
from γ1 to γ2 is the signed angle of their tangent lines at
Cauchy’s integral formula. Let f (z) be holomorphic z. A function f (z) is said to be conformal at the point z
on a simply connected region in C and let γ be a sim- if it preserves the angles of pairs of curves intersecting at
ple piecewise continuously differentiable closed path go- z. An analytic function f (z) preserves angles at points z
ing counterclockwise in . Then for every point z in where the derivative f (z) = 0. If an analytic function f (z)
inside γ , is conformal at every point in (equivalently, if f (z) = 0
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

448 Complex Analysis

for every point z in ), then it is said to be a conformal Maximum principle. Let be a bounded domain in
mapping on . C. Suppose that f (z) is analytic in and continuous in
Examples of conformal mappings (on suitable regions) the closure of . Then, | f (z)| attains its maximum value
are Möbius transformations. Let a, b, c, and d be complex | f (a)| at a boundary point a of . If f (z) is not constant,
numbers such that ad − bc = 0. Then the bilinear trans- then
formation | f (z)| < | f (a)|
az + b for every point z in the interior of .
T (z) =
cz + d
Using the maximum principle, one obtains Schwarz’s
is a Möbius transformation. Any Möbius transformation lemma.
can be decomposed into a product of four elementary con-
Schwarz’s lemma. Let f (z) be analytic on the disk
formal mappings: translations, rotations, homotheties or
D = {z : |z| < 1} with center 0 and radius 1. Suppose that
dilations, and inversions. In addition to preserving angles,
f (0) = 0 and | f (z)| ≤ 1 for all points z in D. Then,
Möbius transformations map circles into circles, provided
that a straight line is viewed as a “circle” passing through | f (0)| ≤ 1
the point ∞ at infinity.
and for all points z in D,

D. Some Theorems about Analytic Functions | f (z)| ≤ |z|

If f (z) is a function on the disk with center 0 and radius ρ If | f (0)| = 1 or | f (a)| = |a| for some nonzero point a,
and r < ρ, let M f (r ) be the maximum value of | f (z)| on then f (z) = αz for some complex number α with |α| = 1.
the circle {z : |z| = r }. Another theorem, which has inspired many generaliza-
Cauchy’s inequality. Let f (z) be analytic tions in the theory of partial differential equations, is the
in the disk following.
with center 0 and radius ρ and let f (z) = ∞ n
m=0 an z . If
r < ρ, then Hadamard’s three-circles theorem. Let f (z) be an
|an |r ≤ M f (r )
n analytic function on the annulus {z : ρ1 ≤ |z| ≤ ρ3 } and let
ρ1 < r1 ≤ r2 ≤ r3 < ρ3 . Then
A function is entire if it is analytic on the entire com- log r3 − log r2
plex plane. The following are two theorems about entire log M f (r2 ) ≤ log M f (r1 )
log r3 − log r1
function. The first follows easily from the case n = 1 of
log r2 − log r1
Cauchy’s inequality. + log M f (r3 )
log r3 − log r1
Liouville’s theorem. If f (z) is an entire function and
f (z) is bounded on C, then f must be a constant function. It follows from the three-circles theorem that log M f (r ) is
The second is much harder. It implies the fact that if a convex function of log r .
two or more complex numbers are absent from the image Cauchy’s integral theorem has the following converse.
of an entire function, then that entire function must be a Morera’s theorem. Let f (z) be a continuous function
constant. on a simply connected region in C. Suppose that the
Picard’s little theorem. An entire function that is not a line integral

polynomial takes every value, with one possible exception,
g(z) dz
infinitely many times. ∂

Applying Liouville’s theorem to the reciprocal of a over the boundary ∂ of every triangle in is zero. Then
nonconstant polynomial p(z) and using the fact that f (z) is analytic in .
p(z) → ∞ as z → ∞, one obtains the following impor-
tant theorem.
E. Analytic Continuation
The fundamental theorem of algebra. Every polyno-
An analytic function f (z) is usually defined initially with
mial with complex coefficients of degree at least one has
a certain formula in some region D1 of the complex plane.
a root in C.
Sometimes, one can extend the function f (z) to a function
It follows that a polynomial of degree n must have n fˆ(z) that is analytic on a bigger region D2 containing
roots in C, counting multiplicities. D1 such that fˆ(z) = f (z) for all points z on D1 . Such an
The next theorem is a fundamental property of analytic extension is called analytic continuation. Expanding the
functions. function as a Taylor (or Laurent series) is one possible
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

Complex Analysis 449

way to extend a function locally from a neighborhood of lim bm /m = ∞

m→∞
a point. Contour integration is another way.
Analytic continuation results in a unique extended func- Suppose that the power series has radius of convergence
tion when it is possible. It also preserves identities between ρ. Then f (z) has no analytic continuation outside the disk
functions. {z : |z| < ρ}.
The uniqueness theorem for analytic continuation.
Let f (z) and g(z) be analytic in a region . If the set of
points z in where f (z) = g(z) has a limit point in ,
IV. MEROMORPHIC FUNCTIONS
then f (z) = g(z) for all z in . In particular, if the set of
A. Poles and Meromorphic Functions
zeros of f (z) in has a limit point in , then f (z) is
identically zero in . A point a is an isolated singularity of the analytic func-
tion f (z) if f (z) is analytic in a neighborhood of a, ex-
Permanence of functional relationships. If a finite
cept possibly at the point itself. For example, the function
number of analytic functions in a region satisfy a certain
f (z) = 1/z is analytic on the entire complex plane, except
functional equation in a part of that has a limit point,
at the isolated singularity z = 0. If the limit limz→a f (z)
then that functional equation holds everywhere in .
is a finite complex number c, then we can simply define
For example, the Pythagorean identity f (a) = c and f (z) will be analytic on the entire neighbor-
sin2 x + cos2 x = 1 hood. Such an isolated singularity is said to be removable.
holds for all real numbers x. Thus, it holds for all complex If the limit limz→a f (z) is infinite, but for some positive
numbers. real number α, limz→a |z − a|α | f (z)| is finite, then a is
Deriving a functional equation is often the key step in a pole of f (z). It can be proved that if the last condition
analytic continuation. A famous example is the Riemann holds, then the smallest such real number α must be a
functional equation relating the gamma function and the positive integer k. In this case, the pole a is said to have
Riemann zeta function. order k. Equivalently, k is the smallest positive integer such
A quick and direct way (when it works) is the following that (z − a)k f (z) is analytic on an entire neighborhood of
method. a (including a itself).
If an isolated singularity is neither removable nor a pole,
Schwarz’s reflection principle. Let be a domain in it is said to be essential. Weierstrass and Casorati proved
the upper half-plane that contains a line segment L on that an analytic function comes arbitrarily close to every
the real axis. Let f (z) be a function analytic on and complex number in every neighborhood of an isolated
continuous on L. Then the function extended by defining essential singularity. A refinement of this is a deep theorem
f (z) = f (z̄) for z in the “reflection” ˜ = {z|z̄ ∈ } is an of Picard.
analytic continuation of f (z) from to to the bigger do-
main ∪ ˜ . Picard’s great theorem. An analytic function takes on
every complex value with one possible exception in every
We note that not all analytic functions have proper an- neighborhood of an essential singularity.
alytic continuations. For example, when 0 < |a| < 1, the
Fredholm series These results say that near an isolated singularity, an
∞ analytic function behaves very wildly. Thus, the study of
2
an zn isolated singularities has concentrated on analytic func-
m=1 tions with poles. A complex function f (z) is meromor-
converges absolutely on the closed unit disk {z : |z| ≤ 1} phic in a region if it is analytic except at a discrete set
and defines an analytic function f (z) on the open disk of points, where it may have poles.
{z : |z| < 1}. However, it can be shown that f (z) has no The residue Res( f, a) of a meromorphic function f (z)
analytic continuation outside the unit disk. Roughly speak- at the isolated singularity a is defined by
ing, the Fredholm functions are not extendable because of
1
“gaps” in the powers of z. The sharpest “gap” theorem is Res( f, a) = f (ζ ) dζ
the following. 2πi γ

The Fabry gap theorem. Let If one allows negative powers, then analytic functions

∞ can be expanded as power series at isolated singularities.
f (z) = am z bm The idea is to write a meromorphic function f (z) in a
m=0 neighborhood of a pole a as a sum of an analytic part
where bm is a sequence of increasing nonnegative integers and a singular part. Suppose the function f (z) is analytic
such that in a region containing the annulus {z : ρ1 < |z − a| < ρ2 }.
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

450 Complex Analysis

Then we can define two functions f 1 (z) and f 2 (z) by: {cω1 + dω2 }, where c and d range over all integers. An
elliptic function is a meromorphic function on the complex
1 f (ζ ) dζ
f 1 (z) = plane with two (independent) periods, ω1 and ω2 , that is,
2πi {ζ :|ζ −a|=r } ζ − z
f (z + ω1 ) = f (z), f (z + ω2 ) = f (z), and every complex
where r satisfies |z − a| < r < ρ2 , and number ω such that f (z + ω) = f (z) for all points z in the
complex plane is a number in the lattice L.
1 f (ζ ) dζ
f 2 (z) = − A specific example of an elliptic function is the
2πi {ζ :|ζ −a|=r } ζ − z
Weierstrass ℘-function defined by the formula
where r satisfies ρ1 < r < |z −a|. The function f 1 (z) is an-
1 1 1

alytic in the disk {z : |z − a| < ρ2 } and the function f 2 (z) ℘(z) = 2 + − 2
is analytic in the complement {z : |z − a| > ρ1 }. By the z ω∈L\{0}
(z − ω)2 ω
Cauchy integral formula, f (z) = f 1 (z) + f 2 (z) and this This defines a meromorphic function on the complex
representation is valid in the annulus {z : ρ1 < |z−a| < ρ2 }. plane that is doubly periodic with periods ω1 and ω2 . The
The functions f 1 (z) and f 2 (z) can each be expanded ℘-function has poles exactly at points in L. Weierstrass
as Taylor series. Using the transformation z − a → 1/z proved that every elliptic function with periods ω1 and
and some simple calculation, we obtain the Laurent series ω2 can be written as a rational function of ℘ and its
expansion derivative ℘ .

∞
f (z) = am (z − a)m C. The Cauchy Residue Theorem
m=−∞
A simple but useful generalization of Cauchy’s integral
where formula is the Cauchy residue theorem.

1 f (ζ ) dζ
am = The Cauchy residue theorem. Let be a simply con-
2πi {ζ :|ζ −a|=r } (ζ − z)m+1
nected region in the complex plane, let f (z) be a function
valid in the annulus {z : ρ1 < |z − a| < ρ2 }. Note that analytic on except at the isolated singularities am , and
let let γ be a closed piecewise continuously differentiable
Res( f, a) = a−1 ,
curve in that does not pass through any of the points
and the point a is a pole of order k if and only if a−k = 0 and am . Then
every coefficient a−m with m > k is zero. The polynomial
1
f (z)dz = n(γ , am )Res( f, am )

k
am 2πi γ

m=1
(z − a)m where the sum ranges over all the isolated singularities
inside the curve γ .
in the variable 1/(z − a) is called the singular or principal
part of f (z) at the pole a. Cauchy’s residue theorem has the following useful
corollary.

B. Elliptic Functions The argument principle. Let f (z) be a meromorphic

function in a simply connected region , let a1 , a2 , . . . ,
Important examples of meromorphic functions are elliptic be the zeros of f (z) in , and let b1 , b2 , . . . , be the poles
functions. Elliptic functions arose from attempts to evalu- of f (z) in . Suppose the zero am has multiplicity sm
ate certain integrals. For example, to evaluate the integral and the pole bn has order tn . Let γ be a closed piecewise
1 continuously differentiable curve in that does not pass
1 − x2 dx through any poles or zeros of f (z). Then
0
1 f (ζ ) dζ
which gives the area π/2 of a semicircle with radius 1, one = sm n(γ , am ) − t j n(γ , b j )
can use the substitution x = sin θ . However, to evaluate 2πi γ f (ζ )
integrals of the form where the sum ranges over all the zeros and poles of f (z)
1 contained in the curve γ .
(1 − x 2 )(1 − k 2 x 2 ) d x The name “argument principle” came from the follow-
0
ing special case. When γ is a circle, the argument principle
we need elliptic functions. Elliptic functions are doubly
says that the change in the argument of f (z) as z traces
periodic generalizations of trigonometric functions.
the circle in a counterclockwise direction, equals
Let ω1 and ω2 be two complex numbers whose ratio
ω1 /ω2 is not real and and let L be the ‘integer lattice’ Z ( f ) − P( f )
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

Complex Analysis 451

the difference between the number Z ( f ) of zeros and the for every point z on the curve bounding , then
number P( f ) of poles of f (z) inside γ , counting multi-
Z ( f ) − P( f ) = Z (g) − P(g)
plicities and orders.
Hurwitz’s theorem. Let ( f n (z) : n = 1, 2, . . .) be a se-
D. Evaluation of Real Integrals quence of functions analytic in a region bounded by a
simple closed piecewise continuously differentiable curve
Cauchy’s residue theorem can be used to evaluate real such that f n (z) converges uniformly to a nonzero (analytic)
definite integrals that are otherwise difficult to evaluate. function f (z) on every closed subset of . Let a be an in-
For example, to evaluate an integral of the form terior point of . If a is a limit point of the set of zeros of
2π
the functions f n (z), then a is a zero of f (z). If a is a zero
R(cos θ, sin θ ) dθ
0 of f (z) with multiplicity m, then every sufficiently small
where R is a rational function of cos θ and sin θ, let z = eiθ . neighborhood K of a contains exactly m zeros of the func-
If we make the substitutions tions f n (z), for all n greater than a number N depending
on K .
z + z −1 z − z −1
cos θ = , sin θ =
2 2i
F. Infinite Products, Partial Fractions,
the integral becomes a line integral over the unit circle of and Approximations
the form A natural way to write a meromorphic function is in terms
S(z) dz of its zeros and poles. For example, because sin π z has
|z|=1
zeros at the integers, we expect to be able to “factor” it into
where S(z) is a rational function of z. By Cauchy’s residue product. Indeed, Euler wrote down the following product
theorem, this integral equals 2πi times the sum of the expansion:
residues of the poles of S(z) inside the unit circle. Using
∞
z z
this method, one can prove, for example, that if a > b > 0, sin π z = π z 1− 1+
2π n n
dθ π(2a + b) j=1
= 3/2 With complex analysis, one can justify such expansions
0 (a + b cos 2 θ )2 a (a + b)3/2
rigorously.
One can also evaluate improper integrals, obtaining for- The question of convergence of an infinite product is
mulas such as the following formula due to Euler: For easily resolved. By taking logarithms, one can reduce it
−1 < p < 1 and −π < α < π, to a question of convergence of a sum. For example, the
∞
x−p dx π sin pα product
= ∞
1 + 2x cos α + x 2 sin pπ sin α
0
(1 + am )
m=1
∞
E. Location of Zeros converges absolutely if and only if the sum m=1
It is often useful to locate zeros of polynomials in the | log(1 + am )| converges absolutely. Since | log(1 + am )|
complex plane. An elegant theorem, which can be proved is approximately |am |, the product converges absolutely if
by elementary arguments, is the following result. and only if the series ∞ m=1 |am | converges absolutely.
The following theorem allows us to construct an entire
Lucas’ theorem. Let p(z) be a polynomial of degree function with a prescribed set of zeros.
at least 1. All the zeros of the derivative p (z) lie in the The Weierstrass product theorem. Let (a j : j =
convex closure of the set of zeros of p(z). 1, 2, . . .) be a sequence of nonzero complex numbers in
Deeper results usually involve using some form of which no complex number occurs infinitely many times.
Rouché’s theorem, which is proved using the argument Suppose that the set {a j } has no (finite) limit point in the
principle. complex plane. Then there exists an entire function f (z)
with a zero of multiplicity m at 0, zeros in the set {a j } with
Rouché’s theorem. Let be a region bounded by a sim- the correct multiplicity, and no other zeros. This function
ple closed piecewise continuously differentiable curve. can be written in the form
Let f (z) and g(z) be two functions meromorphic in an ∞
z
open set containing the closure of . If f (z) and g(z) f (z) = z m e g(z) 1−
satisfy j=1
aj
+···+(1/m j )(a j /z)m j
× ea j /z+(1/2)(a j /z)
2
| f (z) − g(z)| < | f (z)| + |g(z)|
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

452 Complex Analysis

where m j are positive integers depending on the set {a j }, and

and g(z) is an entire function. 1 ∞
2z
π cot π z = +
From this theorem, we can derive the following repre- z m=1 z − n 2
2

sentation of a meromorphic function.

Runge’s approximation theorem says that a function an-
Theorem. A meromorphic function on the complex alytic on a bounded region with holes can be uniformly
plane is the quotient of two entire functions. The two en- approximated by a rational function all of whose poles
tire functions can be chosen so that they have no common lie in the holes. Runge’s theorem can be proved using a
zeros. Cauchy’s integral formula for compact sets.
In particular, one can think of meromorphic functions Runge’s approximation theorem. Let f (z) be an an-
as generalizations of rational functions. alytic function on a region in the complex plane; let K
The gamma function (z) is a useful function which be a compact subset of . Let > 0 be a given (small)
can be defined by a product formula. Indeed, positive real number. Then there exists a rational function
e−γ z
∞
z −1 z/m r (z) with all its poles outside K such that
(z) = 1+ e
z m=1 m | f (z) − r (z)| <
where γ is the Euler–Mascheroni constant defined by for all z in K .

n
1
γ = lim − log n
n→∞
m=1
m V. SOME ADVANCED TOPICS
It equals 0.57722 . . . . The gamma function interpolates
the integer factorials. Specifically, (n) = (n − 1)! for a A. Riemann Mapping Theorem
positive integer n and (z) satisfies the functional equation
On the complex plane, most simply connected regions can
(z + 1) = z(z). be mapped conformally in a one-to-one way onto the open
unit disk. This is a useful result because it reduces many
Another useful functional equation is the Legendre problems to problems about the unit disk.
formula
√ Riemann mapping theorem. Let be a simply con-
π(2z) = 22z−1 (z)(z + 1/2)
√ nected region that is not the entire complex plane, and let
This gives the following useful value: (1/2) = π . a be a point in . Then there exists a unique analytic func-
Rational functions can be represented as partial fraction f (z) on such that f (a) = 0, f (a) > 0, and f (z) is
tions; so can meromorphic functions. a one-to-one mapping of onto the open disk {z : |z| < 1}
Mittag-Leffler’s theorem. Let {b j : j = 1, 2, . . .} be a with radius 1.
set of complex numbers with no finite limit point in the For example, the upper half-plane {ζ : Im(ζ ) > 0} and
complex plane, and let p j (z) be given polynomials with the unit disk {z : |z| < 1} are mapped conformally onto
zero constant terms, one for each point b j . Then there exist each other by the Möbius transformations
meromorphic functions in the complex plane with poles at
i(1 + z) ζ −i
bm with singular parts p j (1/z − b j ). These functions have ζ = , z=
the form 1−z ζ +i
∞
The Schwarz–Christoffel formula gives an explicit for-
1
g(z) + pj − q j (z) mula for one-to-one onto conformal maps from the open
j=1
z − bj
unit disk or the upper half-plane to polygonal domains,
where q j (z) are suitably chosen polynomials depending which are sets in the complex plane bounded by a a closed
on p j (z), and g(z) is an entire function. simple curve made up of a finite number of straight line
Taking logarithmic derivatives and integrating, one segments. It is rather complicated to state, and we refer
can derive Weierstrass’s product theorem from Mittag- the reader to the books by Ahlfors and Nehari in the bib-
Leffler’s theorem. liography. As an example, one can map the upper-half
Two examples of partial fraction exapnsions of mero- complex plane into a triangle with interior angles α1 π ,
morphic functions are α2 π , and α3 π (where α1 + α2 + α3 = 1) by the inverse of
the function
π2
∞
1 z
= F(z) = t α1 −1 (t − 1)α2 −1 dt
sin π z m=−∞ (z − n)
2 2
0
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

Complex Analysis 453

The function F(z) is called the triangle function of For general simply connected domains, Dirichlet’s
Schwarz. problem is difficult. It is equivalent to finding a Green’s
function. For a disk, the following formula is known.
B. Univalent Functions and
Bieberbach’s Conjecture The Poisson formula. Let g(z) = g(eiφ ), − π < φ ≤ π,
be a piecewise continuous function on the boundary
Univalent or Schlicht functions are one-to-one analytic {z : |z| = 1} of the unit disk. Then the function
functions. They have been extensively studied in complex
analysis. A famous result in this area was the Bieberbach u(z) = u(r eiθ )
conjecture, which was proved by de Branges in 1984. π
1 1 − r2
= g(eiφ ) dφ
Theorem. Let f (z) be a univalent analytic function on 2π −π 1 + r 2 − 2r cos(φ − θ )
the unit disk {z : |z| < 1} with power series expansion is a solution to Dirichlet’s problem for the unit disk.
f (z) = z + a2 z + a3 z + · · ·
2 3

D. Riemann Surfaces
(that is, f (0) = 0 and f (0) = 1). Then
A Riemann surface S is a one-dimensional complex con-
|an | ≤ n. nected paracompact Hausdorff space equipped with a con-
When equality holds, f (z) = e−iθ K (eiθ z), where formal atlas, that is, a set of maps or charts {h : D → N },
where D is the open disk {z : |z| < 1} and N is an open set
z
K (z) = = z + 2z 2 + 3z 3 + 4z 4 + · · · of S, such that
(1 − z)2
The function K (z) is called Koebe’s function. 1. The union of all the open sets N is S.
2. The chart h : D → N is a homeomorphism of the
Another famous result in this area is due to Koebe.
disk D to N .
Koebe’s 1/4-theorem. Let f (z) be a univalent function. 3. Let N1 and N2 be neighborhoods with charts h 1
Then the image of the unit disk {z : |z| < 1} contains the and h 2 . If the intersection N1 ∩ N2 is nonempty and con-
disk {z : |z| < 1/4} with radius 1/4. nected, then the composite mapping h −1 2 ◦ h 1 , defined on
the inverse image h −1 1 (N1 ∩ N2 ), is conformal.
The upper bound 1/4 (the “Koebe constant”) is the best
possible.
Riemann surfaces originated in an attempt to make a
C. Harmonic Functions “multivalued” analytic function single-valued by making
its range a Riemann surface. Examples of multivalued
Harmonic functions, defined in Section II, are real func-
functions are algebraic functions. These are functions f (z)
tions u(x, y) satisfying Laplace’s equation. We shall use
satisfying a polynomial equation P( f (z)) = 0. A specific
√
the notation: u(x, y) = u(x + i y) = u(z). Harmonic func-
example of this is the square-root
√ function f (z) = z,
tions have several important properties.
which takes on two values ± z except when z = 0. This
The mean-value property. Let u(z) be a harmonic can be made into a single-valued function using the Rie-
function on a region . Then for any disk D with cen- mann surface obtained by gluing together two sheets or
ter a and radius r whose closure is contained in , copies of the complex plane cut from 0 to ∞ along the pos-
2π itive real axis. Another example is the logarithmic func-
1
u(a) = u(a + r eiθ ) dθ tion log z, which requires a Riemann surface made from
2π 0 countably infinitely many sheets.
The maximum principle for harmonic functions. Let Intuitively, the genus of a Riemann surface S is the
u(z) be a harmonic function on the domain . If there number of “holes” it has. The genus can be defined as the
is a point a in such that u(a) equals the maximum maximum number of disjoint simple closed curves that
max{u(z) : z ∈ }, then u(z) is a constant function. do not disconnect S. For example, the extended complex
plane has genus 0 and an annulus has genus 1. There are
An important problem in the theory of harmonic func-
many results about Riemann surfaces. The following are
tion is Dirichlet’s problem. Given a simply connected do-
two results that can be simply stated.
main and a piecewise continuous function g(z) on the
boundary ∂ , find a function u(z) in the closure ¯ such Picard’s theorem. Let P(u, v) be an irreducible poly-
that u(z) is harmonic in , and the restriction of u(z) to nomial with complex coefficients in two variables u and v.
the boundary ¯ \ equals g(z). (The boundary ∂ is the If there exist nonconstant entire functions f (z) and g(z)
set ¯ \ .) satisfying P( f (z), g(z)) = 0 for all complex numbers z,
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003D-127 June 13, 2001 22:39

454 Complex Analysis

then the Riemann surface associated with the algebraic Gunning, R. (1990). “Introduction to Holomorphic Functions of Several
equation P(u, v) = 0 has genus 0. Variables,” Vols. I, II, and III. Wadsworth and Brooks/Cole, Pacific
Grove, CA.
Koebe’s uniformization theorem. If S is a simply con- Hayman, W. K. (1964). “Meromorphic Functions,” Oxford University
nected Riemann surface, then S is conformally equiv- Press, Oxford.
alent to Hille, E. (1962, 1966). “Analytic Function Theory,” Vols. I and II. Ginn-
Blaisdell, Boston.
Hille, E. (1969). “Lectures on Ordinary Differential Equations,”
1. (Elliptic type) The Riemann sphere. In this case, S Addison-Wesley, Reading, MA.
is the sphere. Hu, P.-C., and Yang, C.-C. (1999). “Differential and Complex Dynamics
2. (Parabolic type) The complex plane C. In this case, of One and Several Variables,” Kluwer, Boston.
Hua, X.-H., and Yang, C.-C. (1998). “Dynamics of Transcendental Func-
S is biholomorphic to C, C\{0}, or a torus. tions,” Gordon and Breach, New York.
3. (Hyperbolic type) The unit disk {z : |z| < 1}. Kodiara, K. (1984). “Introduction to Complex Analysis,” Cambridge
University Press, Cambridge, U.K.
Complex manifolds are higher-dimensional generaliza- Krantz, S. G. (1999). “Handbook of Complex Variables,” Birkhäuser,
Boston.
tions of Riemann surfaces. They have been extensively
Laine, I. (1992). “Nevanlinna Theory and Complex Differential Equa-
studied. tions,” De Gruyter, Berlin.
Lang, S. (1987). “Elliptic Functions,” 2nd ed., Springer-Verlag, Berlin.
E. Other Topics Lehto, O. (1987). “Univalent Functions and Teichmüller Spaces,”
Springer-Verlag, Berlin.
Complex analysis is a vast and ever-expanding area. “Nine Marden, M. (1949). “Geometry of Polynomials,” American Mathemat-
lifetimes” do not suffice to cover every topic. Some inter- ical Society, Providence, RI.
McKean, H., and Moll, V. (1997). “Elliptic Curves, Function Theory, Ge-
esting areas that we have not covered are complex differen-
ometry, Arithmetic,” Cambridge University Press, Cambridge, U.K.
tial equations, complex dynamics, Montel’s theorem and Morrow, J., and Kodiara, K. (1971). “Complex Manifolds,” Holt, Rine-
normal families, value distribution theory, and the theory hart and Winston, New York.
of complex functions in several variables. Several books Nehari, Z. (1952). “Conformal Mapping,” McGraw-Hill, New York;
on these topics are listed in the Bibliography. reprinted, Dover, New York.
Needham, T. (1997). “Visual Complex Analysis,” Oxford University
Press, Oxford.
SEE ALSO THE FOLLOWING ARTICLES Palka, B. P. (1991). “An Introduction to Complex Function Theory,”
Springer-Verlag, Berlin.
Pólya, G., and Szegö, G. (1976). “Problems and Theorems in Analysis,”
CALCULUS • DIFFERENTIAL EQUATIONS • NUMBER THE- Vol. II. Springer-Verlag, Berlin.
ORY • RELATIVITY, GENERAL • SET THEORY • TOPOLOGY, Protter, M. H., and Weinberger, H. F. (1984). “Maximum Principles in
GENERAL Differential Equations,” Springer-Verlag, Berlin.
Remmert, R. (1993). “Classical Topics in Complex Function Theory,”
Springer-Verlag, Berlin.
Rudin, W. (1980). “Function Theory in the Unit Ball of Cn ,” Springer-
BIBLIOGRAPHY Verlag, Berlin.
Schiff, J. L. (1993). “Normal Families,” Springer-Verlag, Berlin.
Ahlfors, L. V. (1979). “Complex Analysis,” McGraw-Hill, New York. Schwerdtfeger, H. (1962). “Geometry of Complex Numbers,” University
Beardon, A. F. (1984). “A Primer on Riemann Surfaces,” Cambridge of Toronto Press, Toronto. Reprinted, Dover, New York.
University Press, Cambridge, U.K. Siegel, C. L. (1969, 1971, 1973). “Topics in Complex Function Theory,”
Blair, D. E. (2000). “Inversion Theory and Conformal Mappings,” Amer- Vols. I, II, and III, Wiley, New York.
ican Mathematical Society, Providence, RI. Smithies, F. (1997). “Cauchy and the Creation of Complex Function
Carleson, L., and Gamelin, T. W. (1993). “Complex Dynamics,” Theory,” Cambridge University Press, Cambridge, U.K.
Springer-Verlag, Berlin. Steinmetz, N. (1993). “Rational Iteration, Complex Analytic Dynamical
Cartan, H. (1960). “Elementary Theory of Analytic Functions of One Systems,” De Gruyter, Berlin.
or Several Complex Variables,” Hermann and Addison-Wesley, Paris Titchmarsh, E. C. (1939). “The Theory of Functions,” 2nd ed., Oxford
and Reading, MA. University Press, Oxford.
Cherry, W., and Ye, Z. (2001). “Nevanlinna’s Theory of Value Distribu- Vitushkin, A. G. (ed.). (1990). “Several Complex Variables I,” Springer-
tion,” Springer-Verlag, Berlin. Verlag, Berlin.
Chuang, C.-T., and Yang, C.-C. (1990). “Fix-Points and Factorization Weyl, H. (1955). “The Concept of a Riemann Surface,” 3rd ed., Addison-
of Meromorphic Functions,” World Scientific, Singapore. Wesley, Reading, MA.
Duren, P. L. (1983). “Univalent Functions,” Springer-Verlag, Berlin. Whitney, H. (1972). “Complex Analytic Varieties,” Addison-Wesley,
Farkas, H. M., and Kra, I. (1992). “Riemann Surfaces,” Springer-Verlag, Reading, MA.
Berlin. Whittaker, E. T., and Watson, G. N. (1969). “A Course of Modern Anal-
Gong, S. (1999). “The Bieberbach Conjecture,” American Mathematical ysis,” Cambridge University Press, Cambridge, U.K.
Society, Providence, RI. Yang, L. (1993). “Value Distribution Theory,” Springer-Verlag, Berlin.
P1: GLQ/GJP P2: FQP Final Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN003B-132 June 13, 2001 22:45

Computer-Based Proofs of
Mathematical Theorems
C. W. H. Lam
Concordia University

I. Mathematical Theories
II. Computer Programming
III. Computer As an Aid to Mathematical Research
IV. Examples of Computer-Based Proofs
V. Proof by Exhaustive Computer Enumeration
VI. Recent Development: RSA Factoring Challenge
VII. Future Directions

GLOSSARY Proposition A statement which is either true or false, but

not both.
Axiom A statement accepted as true without proof. Search tree A pictorial representation of the partial
Backtrack search A method of organizing a search for solutions encountered in a backtrack search.
solutions by a systematic extension of partial solutions.
Computer-based proof A proof with a heavy computer
component; one which is impossible to do by hand. EVER SINCE the arrival of computers, mathematicians
Computer programming A process of creating a have used them as a computational aid. Initially, they
sequence of instructions to be used by a computer. were used to perform tedious and repetitious calculations.
Enumerative proof A proof method by exhibiting and The tremendous speed and accuracy of computers enable
analyzing all possible cases. mathematicians to perform lengthy calculations without
Monte Carlo method A method of estimation by fear of making careless mistakes. Their main application,
performing random choices. however, has been to obtain insight into various mathe-
Optimizing a computer program A process of fine- matical problems, which then led to conventional proofs
tuning a computer program so that it runs faster. independent of the computer. Recently, there was a depar-
Predicate A statement whose truth value depends of the ture from this traditional approach. By exploiting the speed
values of its arguments. of a computer, several famous and long-standing problems
Proof A demonstration of the truth of a statement. were settled by lengthy enumerative proofs, a technique

543
P1: GLQ/GJP P2: FQP Final
Encyclopedia of Physical Science and Technology EN003B-132 June 13, 2001 22:45

544 Computer-Based Proofs of Mathematical Theorems

only suitable for computers. Two notable examples are The statement “x = 3” is not a proposition because its
the four-color theorem and the nonexistence of a finite truth value depends on the value of x. It is a predicate,
projective plane of order 10. Both proofs required thou- and its truth value depends of the value of the variable or
sands of hours of computing and gave birth to the term “a argument x. The expression P(x) is often used to denote
computer-based proof.” The organization of such a proof a predicate P with an argument x. A predicate may have
requires careful estimation of the necessary computing more than one argument, for example, x + y = 1 has two
time, meticulous optimization of the computer program, arguments. To convert a predicate to a proposition, values
and prudent control of all possible computer errors. In have to be assigned to all the arguments. The process of
spite of the difficulty in checking these proofs, mathemati- associating a value to a variable is called a binding.
cians are starting to accept their validity. As computers are Another method of converting a predicate to a proposi-
getting faster, many other famous open problems will be tion is by the quantification of the variables. There are two
solved by this new approach. common quantifiers: universal and existential. For exam-
ple, the statement
I. MATHEMATICAL THEORIES For all x, x < x + 1

What is mathematics? It is not possible to answer this ques- uses the universal quantifier “for all” to provide bindings
tion precisely in this short article, but a generally accept- for the variable x in the predicate x < x + 1. If the predi-
able definition is that mathematics is a study of quantities cate is true for every possible value of x, then the propo-
and relations using symbols and numbers. The starting sition is true. The symbol ∀ is used to denote the phrase
point is often a few undefined objects, such as a set and its “for all.” Thus, the above proposition can also be written
elements. A mathematical theory is then built by assum- as
ing some axioms which are statements accepted as true. ∀x, x < x + 1.
From these basic components, further properties can then
be derived. The set of possible values for x has to come from a cer-
For example, the study of geometry can start from the tain universal set. The universal set in the above example
undefined objects called points. A line is then defined as may be the set of all integers, or it may be the set of all
a set of points. An axiom may state: “Two distinct lines reals. Sometimes, the actual universal set can be deduced
contain at most one common point.” This is one of the ax- from context and, consequently, not stated explicitly in the
ioms in Euclidean geometry, where it is possible to have proposition. A careful mathematician may include all the
parallel lines. The complete set of axioms of Euclidean details, such as
geometry was classified by the great German mathemati- √
∀ real x > 1, x > x.
cian David Hilbert in 1902. Using these axioms, further
results can be derived. Here is an example: The choice is often not a matter of sloppiness, but a con-
scious decision depending on whether all the exact details
Two triangles are congruent if the three sides of one are will obscure the main thrust of the result.
equal to the three sides of the other. An existential quantifier asserts that a predicate P(x) is
true for one or more x in the universal set. It is written as
To show that these derived results follow logically from
For some x, P(x),
the axioms, a system of well-defined principles of mathe-
matical reasoning is used. or, using the symbol ∃, as
∃x, P(x).
A. Mathematical Statements
This assertion is false only if for all x, P(x) is false. In
The ability to demonstrate the truth of a statement is central
other words,
to any mathematical theory. This technique of reasoning
is formalized in the study of propositional calculus. ¬[∃x, P(x)] ≡ [∀x, ¬P(x)].
A proposition is a statement which is either true or false,
but not both. For example, the following are propositions:
B. Enumerative Proof
r 1 + 1 = 2. A proof of a mathematical result is a demonstration of
r 4 is a prime number. its truth. To qualify as a proof, the demonstration must
be absolutely correct and there can be no uncertainty nor
The first proposition is true, and the second one is false. ambiguity in the proof.
P1: GLQ/GJP P2: FQP Final
Encyclopedia of Physical Science and Technology EN003B-132 June 13, 2001 22:45

Computer-Based Proofs of Mathematical Theorems 545

Many interesting mathematical results involve quanti- D. What Makes an Interesting Theorem
fiers such as ∃ or ∀. There are many techniques to prove
Of course, a theorem must be interesting to be worth prov-
quantified statements, one of which is the enumerative
ing. However, whether something is interesting is a sub-
proof. In this method, the validity of ∀x, P(x) is estab-
jective judgement. The following list contains some of the
lished by investigating P(x) for every value of x one after
properties of an interesting result:
another. The proposition ∀x, P(x) is only true if P(x) has
been verified to be true for all x. However, if P(x) is false r Useful
for one of the values, then it is not necessary to consider r Important
the remaining values of x, because ∀x, P(x) is already r Elegant
false. Similarly, the validity of ∃x, P(x) can be found by r Provides insight
evaluating P(x) for every value of x. The process of evalu-
ation can stop once a value of x is found for which P(x) is
While the usefulness of some theorems is immediately
true, because ∃x, P(x) is true irrespective of the remaining
obvious, the appreciation of others may come only years
values of x. However, to establish that ∃x, P(x) is false,
later. The importance of a theorem is often measured by
it has to be shown that P(x) is false for all values of x.
the number of mathematicians who know about it or who
To ensure a proof that is finite in length, enumerative
have tried proving it. An interesting theorem should also
proofs are used where only a finite number of values of x
give insights about a problem and point to new research
have to be considered. It may be a case where x can take
directions. A theorem is often appreciated as if it is a piece
on only a finite number of values, or a situation where
of art. Descriptions such as “It is a beautiful theorem” or
an infinite number of values can be handled by another
“It is an elegant proof” are often used to characterize an
proof technique, leaving only a finite number of values
interesting mathematical result.
for which P(x) are unknown.
An enumerative proof is a seldom-used method because
it tends to be too long and tedious for a human being. A
proof involving a hundred different cases is probably the II. COMPUTER PROGRAMMING
limit of the capacity of the human mind. Yet, it is precisely
in this area that a computer can help, where it can evaluate Figure 1 shows a highly simplified view of a computer.
millions or even billions of cases with ease. Its use can There are two major components: the central processing
open up new frontiers in mathematics. unit (CPU) and memory. Data are stored in the memory.
The CPU can perform operations which changes the data.
A program is a sequence of instructions which tell the
C. Kinds of Mathematical Results CPU what operations to perform and it is also stored in
Mathematicians love to call their results theorems. Along the computer memory.
the way, they also prove lemmas, deduce corollaries, and Computer programming is the process of creating the
propose conjectures. What are these different classifica- sequence of instructions to be used by the computer. Most
tions? The following paragraphs will give a brief answer programming today is done in high-level languages such
to this question. as Fortran, Pascal, or C. Such languages are designed for
Mathematical results are classified as lemmas, theo- the ease of creating and understanding a complex program
rems, and corollaries, dependent on their importance. The by human beings. A program written in one of these lan-
most important results are called theorems. A lemma is an guages has to be translated by a compiler before it can be
auxiliary result which is useful in proving a theorem. A used by the computer.
corollary is a subsidiary result that can be derived from
a theorem. These classifications are quite loose, and the
actual choice of terminology is based on the subjective
evaluation of the discoverer of the results. There are cases
where a lemma or a corollary have, over a period of time,
attained an importance surpassing the main theorem.
A conjecture, on the other hand, is merely an educated
guess. It is a statement which may or may not be true.
Usually, the proposer of the conjecture suspects that it
is highly probable to be true but cannot prove it. Many
famous mathematical problems are stated as conjectures.
If a proof is later found, then it becomes a theorem. FIGURE 1 A simplified view of a computer.
P1: GLQ/GJP P2: FQP Final
Encyclopedia of Physical Science and Technology EN003B-132 June 13, 2001 22:45

546 Computer-Based Proofs of Mathematical Theorems

Due to the speed with which a computer can perform

operations it is programmed to do and its capacity in deal-
ing with huge amounts of information, it is well suited to
do lengthy or repetitive tasks that human beings are not
good at doing.

FIGURE 2 The complete graph K 4 .

III. COMPUTER AS AN AID TO
MATHEMATICAL RESEARCH
so that no two regions with a common boundary line are
Ever since the arrival of computers, mathematicians have colored with the same color.
used them as a computational aid. Initially, computers Francis Guthrie was given credit as the originator of this
were used to perform tedious and repetitious calculations. problem while coloring a map of England. His brother
The tremendous speed and accuracy of computers enable communicated the conjecture to DeMorgan in October
mathematicians to perform lengthy calculations without 1852. Attempts to prove the four-color conjecture led to
fear of making careless mistakes. For example, π can be the development of a major branch of mathematics: graph
computed to several billion digits of accuracy, and prime theory.
numbers of ever-increasing size are being found continu- A graph G(V , E) is a structure which consists of a
ally by computers. set of vertices V = {v1 , v2 , . . .} and a set of edges E =
Computer can also help mathematicians manipulate {e1 , e2 , . . .}; each edge e is incident on two distinct ver-
complicated formulae and equations. MACSYMA is the tices u and v. For example, Fig. 2 is a graph with four
first such symbol manipulation program developed for vertices and six edges. The technical name for this graph
mathematicians. A typical symbol manipulation program is the complete graph K 4 . It is complete because there is an
can perform integration and differentiation, manipulate edge incident on every possible pair of vertices. Figure 3
matrices, factorize polynomials, and solve systems of al- is another graph with six vertices and nine edges. It is the
gebraic equations. Many programs have additional fea- complete bipartite graph K 3,3 . In a bipartite graph, the set
tures such as two- and three-dimensional plotting or even of vertices is divided into two classes, and the only edges
the generation of high-level language output. are those that connect a vertex from one class to one of the
Even though these programs have been used to prove a other class. The graph K 3,3 is complete because it contains
number of results directly, their main application has been all the possible nine edges of the bipartite graph. A graph
to obtain insight into the behavior of various mathemat- is said to be planar if it can be drawn on a plane in such a
ical objects, which then has led to conventional proofs. way that no edges cross one another, except, of course, at
Mathematicians still prefer computer-free proofs, if pos- common vertices. The graph K 4 in Fig. 2 is drawn with no
sible. A proof that involves using the computer is difficult crossing edges and it is obviously planar. The graph K 3,3
to check and so its correctness cannot be determined ab- can be shown to be not planar, no matter how one tries to
solutely. The term “a computer-based proof” is used to draw it.
refer to a proof where part of it involves extensive compu- One can imagine that a planar graph is an abstract rep-
tation, for which an equivalent human argument may take resentation of a map. The vertices represent the capitals
millions of years to make. of the countries in the map. Two vertices are joined by an
edge if the two countries have a common boundary. In-
stead of coloring regions of a map, colors can be assigned
IV. EXAMPLES OF COMPUTER-BASED to the vertices of a graph representing the map. A graph is
PROOFS k-colorable if there is a way to assign colors to the graph
such that no edge is incident on two vertices with the same
In the last 20 years, two of the best known mathematical
problems were solved by lengthy calculations: the four-
color conjecture and the existence question of a finite pro-
jective plane of order 10.

A. Four-Color Conjecture
The four-color conjecture says that four colors are suffi-
cient to color any map drawn in the plane or on a sphere FIGURE 3 The complete bipartite graph K 3,3 .
P1: GLQ/GJP P2: FQP Final
Encyclopedia of Physical Science and Technology EN003B-132 June 13, 2001 22:45

Computer-Based Proofs of Mathematical Theorems 547

color. So, the four-color conjecture can also be stated as

every planar graph is 4-colorable.
The graph K 4 is not 3-colorable. This can be proved
by contradiction. Suppose it is 3-colorable. Since there
are four vertices, two of the vertices must have the same
color. Since the graph is complete, there is an edge con-
necting these two vertices of the same color, which is a
contradiction.
Kempe in 1879 published an incorrect proof of the four- FIGURE 5 An incidence matrix for the plane of order 2.
color conjecture. Heawood in 1890 pointed out Kempe’s
error, but demonstrated that Kempe’s method did prove
that every planar graph is 5-colorable. Since then, many fa- The earliest reference to a finite projective plane was
mous mathematicians have worked on this problem, lead- in an 1856 book by von Staudt. In 1904, Veblen used the
ing to many significant theoretical advances. In particular, projective plane of order 2 as an exotic example of a finite
it was shown that the validity of the four-color conjecture object satisfying all the Hilbert’s axioms for geometry. He
depended only on a finite number of graphs. These results also proved that this plane of order 2 cannot be drawn using
laid the foundation for a computer-based proof by Appel only straight lines. In a series of papers from 1904 to 1907,
and Haken in 1976. Their proof depended on a computer Veblen, Bussey, and Wedderburn established the existence
analysis of 1936 graphs which took 1200 h of computer of most of the planes of small orders. Two of the smallest
time and involved about 1010 separate operations. Finally, orders missing are 6 and 10. In 1949, the celebrated Bruck-
the four-color conjecture became the four-color theorem. Ryser theorem gave an ingenious theoretical proof of the
nonexistence of the plane of order 6. The nonexistence
of the plane of order 10 was established in 1988 by a
B. Projective Plane of Order 10 computer-based proof.
In the computer, lines and points are represented by their
The question of the possible existence of a projective plane
incidence relationship. The incidence matrix A = [ai j ] of
of order 10 was also settled recently by using a com-
a projective plane of order n is an n 2 + n + 1 by n 2 + n + 1
puter. A finite projective plane of order n, with n > 0, is a
matrix where the columns represent the points and the
collection of n 2 + n + 1 lines and n 2 + n + 1 points such
rows represent the lines. The entry ai j is 1 if point j is
that
on line i; otherwise, it is 0. For example, Fig. 5 gives the
incidence matrix for the projective plane of order 2. In
1. Every line contains n + 1 points.
terms of an incidence matrix, the properties of being a
2. Every point is on n + 1 lines.
projective plane are translated into
3. Any two distinct lines intersect at exactly one point.
4. Any two distinct points lie on exactly one line.
1. A has constant row sum n + 1.
2. A has constant column sum n + 1.
The smallest example of a finite projective plane is one
3. The inner product of any two distinct rows of A is 1.
of order 1, which is a triangle. The smallest nontrivial
4. The inner product of any two distinct columns of A
example is one of order 2, as shown in Fig. 4. There are
is 1.
seven points labeled from 1 to 7. There are also seven lines
labeled L1 to L7. Six of them are straight lines, but L6 is
These conditions can be encapsuled in the following
represented by the circle through points 2, 6, and 7.
matrix equation:
A A T = n I + J,
where A T denotes the transpose of the matrix A, I denotes
the identity matrix, and J denotes the matrix of all 1’s.
As a result of intensive investigation by a number of
mathematicians, it was shown that the existence question
of the projective plane of order 10 can be broken into four
starting cases. Each case gives rise to a number of geomet-
ric configurations, each corresponding to a partially com-
pleted incidence matrix. Starting from these partial matri-
FIGURE 4 The finite projective plane of order 2. ces, a computer program tried to complete them to a full
P1: GLQ/GJP P2: FQP Final
Encyclopedia of Physical Science and Technology EN003B-132 June 13, 2001 22:45

548 Computer-Based Proofs of Mathematical Theorems

plane. After about 2000 computing hours on a CRAY-1A equal to {[110], [101], [011]}. A partial solution at level 1
supercomputer in addition to several years of computing can be formed by choosing any of these three rows as x1 .
on a number of VAX-11 and micro-VAX computers, it If x1 is chosen to be [110], then there are only two choices
was shown that none of the matrices could be completed, for x2 which would satisfy the predicate P(xl , x2 ), namely,
which implied the nonexistence of the projective plane of [101] and [011]. Each of these choices for x2 has a unique
order 10. About 1012 different subcases were investigated. extension to a solution.
A nice way to organize the information inherent in the
V. PROOF BY EXHAUSTIVE partial solutions is the backtrack search tree. Here, the
COMPUTER ENUMERATION empty partial solution ( ) is taken as the root, and the par-
tial solution (x1 , . . . , xk ) is represented as the child of the
The proofs of both the four-color theorem and the nonex- partial solution (x1 , . . . , xk−1 ). Following the computer
istence of a projective plane order 10 share one common science terminology, we call a partial solution a node. It is
feature: they are both enumerative proofs. This approach also customary to label a node (x1 , . . . , xk ) by only xk be-
to a proof is often avoided by humans, because it is tedious cause this is the choice made at this level. The full partial
and error prone. Yet, it is tailor-made for a computer. solution can be read off the tree by following the branch
A. Methodology from the root to the node in question. Figure 6 shows the
search tree for the projective plane of order 1. The possi-
Exhaustive computer enumeration is often done by a pro- ble candidates are labeled as r1 , r2 , and r3 . The right-most
gramming technique called backtrack search. A version branch, for example, represents choosing r3 or [011] as
of the search problem can be defined as follows: the first row, r2 as the second row, and r1 as the third row.
It is often true that the computing cost of processing a
Search problem: node is independent of its level k. Under this assumption,
Given a collection of sets of candidates the total computing cost of a search is equal to the number
C1 , C2 , C3 , . . . , Cm and a boolean compatibility of nodes in the search tree times the cost of processing
predicate P(x , y) defined for all x ∈ Ci and y ∈ C j , a node. Hence, the number of nodes in the search tree
find an m-tuple (x1 , . . . , xm ) with xi ∈ Ci such that is an important parameter in a search. This number can
P(xi , x j ) is true for all i = j. be obtained by counting the nodes in every level, with αi
defined as the node count at level i of the tree. For example,
A m-tuple satisfying the above condition is called a in the search tree for the projective plane of order 1 shown
solution. in Fig. 6, α1 = 3, α2 = 6, and α3 = 6.
For example, if we take m = n 2 + n + 1 and let Ci be
the set of all candidates for row i of the incidence matrix B. Estimation
of a projective plane, then P(x , y) can be defined as
It is very difficult to predict a priori the running time of a
true if x , y = 1
P(x , y) = backtrack program. Sometimes, one program runs to com-
false otherwise, pletion in less than 1 sec, while other programs seem to
where x , y denotes the inner product of rows x and y. A take forever. A minor change in the strategy used in the
solution is then a complete incidence matrix.
In a backtrack approach, we generate k-tuples with
k ≤ m. A k-tuples (x1 , . . . , xk ) is a partial solution at
level k if P(xi , x j ) is true for all i = j ≤ k. The basic idea
of the backtrack approach is to extend a partial solution
at level k to one at level k + 1, and if this extension is im-
possible, then to go back to the partial solution at level
k − 1 and attempt to generate a different partial solution at
level k.
For example, consider the search for a projective plane
of order 1. In terms of the incidence matrix, the problem is
to find a 3 ×3 (0,1)-matrix A satisfying the matrix equation
A AT = I + J.
Suppose the matrix A is generated row by row. Since each
row of A must have two l’s, the candidate sets Ci are all FIGURE 6 Search tree for the projective plane of order 1.
P1: GLQ/GJP P2: FQP Final
Encyclopedia of Physical Science and Technology EN003B-132 June 13, 2001 22:45

Computer-Based Proofs of Mathematical Theorems 549

backtrack routine may change the total running time by

several orders of magnitude. Some “minor improvements”
may speed up the program by a factor of 100, while some
other “major improvements” actually slow down the pro-
gram. It is useful to have a simple and reliable method
to estimate the running time of such a program. It can be FIGURE 7 Typical shape of a search tree.
used to

1. Decide whether the problem can be solved with the C. Optimization

available resources.
2. Compare the various methods of “improvement.” A computer-based proof using backtracking may take a
3. Design an optimized program to reduce the running large amount of computing time. Optimization is a pro-
time. cess of fine-tuning the computer program so that it runs
faster. The optimization methods are divided into two
A good approximation to the running time of a back- broad classes:
track program is to multiply the number of nodes in the
search tree by a constant representing the time required to 1. Those whose aim is to reduce the size of the search
process each node. Hence, a good estimate of the running tree
time can be obtained by finding a good estimate to the 2. Those whose aim is to reduce the cost of the search
number of nodes in the search tree. The size of the tree tree by processing its nodes more efficiently
can be approximated by summing up the estimated αi’s,
the node counts at each level. As a general rule, methods that reduce the size of the
An interesting solution to the estimation problem is a search tree, a process also called pruning the search tree,
Monte Carlo approach developed by Knuth. The idea is to can potentially reduce the search by many orders of magni-
run a number of experiments, with each experiment con- tude, whereas improvements obtained by trying to process
sisting of performing the backtrack search with a randomly nodes more efficiently are often limited to 1 or 2 orders
chosen candidate at each level of the search. Suppose we of magnitude. Thus, given a choice, one should first try to
have a partial solution (x1 , . . . , xk ) for 0 ≤ k < n, where reduce the size of the search tree.
n is the depth of the search tree. We let There are many methods to prune the search tree. One
possibility is to use a more effective compatibility predi-
Ck+1 = {xk+1 ∈ Ck+1 |P(x1 , . . . , xk + 1 ) is true} cate, while preserving the set of solutions. In a search tree,
be the set of acceptable candidates which extend there are many branches which do not contain any solu-

(x1 , . . . , xk ). We choose xk + 1 at random from Ck+1 such tion. If these branches can be identified early, then they

that each of the |Ck+1 | possibilities are equally likely to can be eliminated, hence reducing the size of the tree.
be chosen. We let dk = |Ck | be the number of elements in Another method to prune the search tree is by symmetry
Ck for k = 1, . . . , n. Then the node count at level i, αi , can pruning. Technically, a symmetry is a property preserving
be estimated by operation. For example, two columns of a matrix A can be
interchanged without affecting the product A A T . Consider
αi ≈ d1 . . . di . again the search tree for a projective plane of order 1. The
Now, the total estimated size of the search tree is interchange of columns 1 and 2 will induce a relabeling
of r2 as r3 and vice versa. Thus, after considering the

n
d 1 . . . dk . partial solution x1 = [101], there is no need to consider the
k=1 remaining partial solution x1 = [011], because its behavior
The cost of processing a node can be estimated by run- will be a duplicate of the earlier case. In combinatorial
ning a few test cases, counting the nodes, and dividing the problems, the size of symmetries tends to be large and
running time by the number of nodes. symmetry pruning can be very effective.
These estimated values of the node counts can best be Methods to reduce the cost of processing a node of the
presented by plotting the logarithm of αi , or equivalently search tree are often adaptions of well-known methods
the number of digits in αi , as a function of i. A typical in computer programming. For example, one can use a
profile is shown in Fig. 7. The value of i for which log better algorithm such as replacing a linear search by a
αi is maximum is called the bulge of the search tree. A binary search. We can also replace the innermost loop by
backtrack search spends most of its time processing nodes an assembly language subroutine, or a faster computer can
near the bulge. be used.
P1: GLQ/GJP P2: FQP Final
Encyclopedia of Physical Science and Technology EN003B-132 June 13, 2001 22:45

550 Computer-Based Proofs of Mathematical Theorems

One common optimization technique is to move invari- So, it is impossible to have an absolute error-free,
ant operations from the inside of a loop to the outside. This computer-based proof! In this sense, a computer-based
idea can be applied to a backtrack search in the following proof is an experimental result. As scientists in other dis-
manner. We try to do as little as possible for nodes near the ciplines have long discovered, the remedy is an indepen-
bulge, at the expense of more processing away from the dent verification. In a sense, the verification completes the
bulge. For example, suppose we have a tree of depth 3 and proof.
that α1 = 1, α2 = 1000, and α3 = 1. If the time required to
process each node is 1 sec, then the processing time for
the search tree is 1002 sec. Suppose we can reduce the
VI. RECENT DEVELOPMENT: RSA
cost of processing the nodes at level 2 by a factor of 10
FACTORING CHALLENGE
at the expense of increasing the processing cost of nodes
at levels 1 an 3 by a factor of 100. Then, the total time is
Recently, there has been a lot of interest in factorizing big
reduced to 300 sec.
numbers. It all started in 1977 when Rivest, Shamir, and
Adleman proposed a public-key cryptosystem based on
D. Practical Aspects the difficulty of factorizing large numbers. Their method
is now known as the RSA scheme. In order to encourage
There are many considerations that go into developing
research and to gauge the strength of the RSA scheme,
a computer program which runs for months and years.
RSA Data Security, Inc. in 1991 started the RSA Fac-
Interruptions, ranging from power failures to hardware
toring Challenge. It consists of a list of large composite
maintenance, are to be expected. A program should not
numbers. A cash prize is given to the first person to
have to restart from the very beginning for every inter-
factorize a number in the list. These challenge numbers
ruption; otherwise, it may never finish. Fortunately, an
are identified by the number of decimal digits contained
enumerative computer proof can easily be divided into in-
in the numbers. In February 1999, the 140-digit RSA-140
dependent runs. If there is an interruption, just look up the
was factorized. In August 1999, RSA-155 was factorized.
last completed run and restart from that point. Thus, the
The best known factorization method divides the task into
disruptive effect of an untimely interrupt is now limited
two parts: a sieving part to discover relations and a matrix
to the time wasted in the incomplete run. Typically, the
reduction part to discover dependencies. The sieving part
problem is divided into hundreds or even millions of in-
has many similarities with an enumerative proof. One has
dependent runs in order to minimize the time wasted by
to try many possibilities, and the trials can be divided into
interruptions.
many independent runs. In fact, for the factorization of
Another advantage of dividing a problem into many in-
RSA-155, the sieving part took about 8000 MIPS years
dependent runs is that several computers can be used to run
and was accomplished by using 292 individual computers
the program simultaneously. If a problem takes 100 years
located at 11 different sites in 3 continents. The resulting
to run on one computer, then by running on 100 computers
matrix had 6,699,191 rows and 6,711,336 columns. It
simultaneously the problem can be finished in 1 year.
took 224 CPU hours and 3.2 Gbytes of central memory
on a Cray C916 to solve. Fortunately, there is never any
E. Correctness Considerations question about the correctness of the final answer, because
one can easily verify the result by multiplying the factors
An often-asked question is, “How can one check a
together.
computer-based proof?” After all, a proof has to be abso-
lutely correct. The computer program itself is part of the
proof, and checking a computer program is no different
from checking a traditional mathematical proof. Computer VII. FUTURE DIRECTIONS
programs tend to be complicated, especially the ones that
are highly optimized, but their complexity is comparable When the four-color conjecture was first settled by a
to some of the more difficult traditional proofs. computer, there was some hesitation in mathematics cir-
The actual execution of the program is also part of the cles to accept it as a proof. There is first the question
proof, and the checking of this part is difficult or impos- of how to check the result. There is also the aesthetic
sible. Even if the computer program is correct, there is aspect: “Is a computer-based proof elegant?” The re-
still a very small chance that a computer makes an error in sult itself is definitely interesting. The computer-based
executing the program. In the search for a projective plane proof of the nonexistence of a projective plane of order
of order 10, this error probability is estimated to be 1 in 10 again demonstrated the importance of this approach.
100,000. A lengthy enumerative proof is a remarkable departure
P1: GLQ/GJP P2: FQP Final
Encyclopedia of Physical Science and Technology EN003B-132 June 13, 2001 22:45

Computer-Based Proofs of Mathematical Theorems 551

from traditional thinking in terms of simple and ele- BIBLIOGRAPHY

gant proofs. As computers are getting faster, many other
famous open problems may be solved by this approach. Cipra, B. A. (1989). “Do mathematicians still do math?” Res. News, Sci.
Mathematicians are slowly coming to accept a computer- 244, 769–770.
based proof as another proof technique. Before long, it will Kreher, D. L., and Stinson, D. R. (1999). “Combinatorial Algo-
rithms, Generation, Enumeration, and Search,” CRC Press, Boca
be treated as just another tool in a mathematician’s tool
Raton, FL.
box. Lam, C. W. H. (1991). “The search for a finite projective plane of
order 10,” Am. Math. Mon. 98, 305–318.
Lam, C. W. H. (1989). “How reliable is a computer-based proof?” Math.
SEE ALSO THE FOLLOWING ARTICLES Intelligencer 12, 8–12.
Odlyzko, A. M. (1985). Applications of Symbolic Algebra to Math-
ematics, In “Applications of Computer Algebra” (R. Pavelle, ed.),
COMPUTER ARCHITECTURE • DISCRETE MATHEMATICS pp. 95–111, Kluwer-Nijhoff, Boston.
AND COMBINATORICS • GRAPH THEORY • MATHEMATI- Saaty, T. L., and Kainen, P. C. (1977). “The Four-Color Problem,”
CAL LOGIC • PROBABILITY McGraw-Hill, New York.
P1: ZCK Final Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN003H-881 June 13, 2001 22:46

Computer-Generated Proofs
of Mathematical Theorems
David M. Bressoud
Macalester College

I. The Ideal versus Reality

II. Hypergeometric Series Identities
III. The WZ Method
IV. Extensions, Future Work, and Conclusions

GLOSSARY Proper hypergeometric term A function of two vari-

ables such as n and k that is of the following form:
Algorithm A recipe or set of instructions that, when fol- a polynomial in n and k times x k y n for some fixed x
lowed precisely, will produce a desired result. times a product of quotients of factorials of the form
Binomial coefficient The binomial coefficient nk is the (an + bk + c)! where a, b, and c are fixed integers.
coefficient of x k in the expansion of (1 + x)n . It counts Rising factorial A finite product of numbers in an arith-
the number of ways of choosing k objects from a set of metic sequence with difference 1. It is written as
n objects. (a)n = a(a + 1)(a + 2) · · · (a + n − 1).
Computer algebra system A computer package that en-
ables the computer to do symbolic computations such
as algebraic simplification, formal differentiation, and IN A COMPUTER-BASED PROOF, the computer is
indefinite integration. used as a tool to help guess what is happening, to check
Diophantine equation An equation in several variables cases, to do the laborious computations that arise. The
for which only integer solutions are accepted. person who is creating the proof is still doing most of the
Hypergeometric series A finite or infinite summation, work. In contrast, a computer-generated proof is totally
1 + a1 + · · · + ak + · · ·, in which the first term is 1 automated. A person enters a carefully worded mathe-
and the ratio of successive summands, ak+1 /ak , is a matical statement for which the truth is in doubt, hits the
quotient of polynomials in k. RETURN key, and within a reasonable amount of time the
Hypergeometric term A function of, say, k, that is the computer responds either that the statement is true or that
kth summand in a hypergeometric series. it is false. A step beyond this is to have the computer do its
Proof certificate A piece of information about a mathe- own searching for reasonable statements that it can test.
matical statement that makes it possible to prove the Such fully automated algorithms for determining the
statement easily and quickly. truth or falsehood of a mathematical statement do exist.

553
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003H-881 June 13, 2001 22:46

554 Computer-Generated Proofs of Mathematical Theorems

With Doron Zeilberger’s program EKHAD, one can enter the equation x n + y n = z n when n is an integer greater than
the statement believed or suspected to be correct. If it is or equal to 3. We know that the last assertion is correct,
true, the computer will not only tell you so, it is capable thanks to Andrew Wiles.
of writing the paper ready for submission to a research For a Diophantine equation, if a solution exists then it
journal. Even the search for likely theorems has been au- can be found in finite (though potentially very long) time
tomated. A good deal of human input is needed to set just by trying all possible combinations of integers, but if
parameters within which one is likely to find interesting no solution exists then we cannot discover this fact just
results, but computer searches for mathematical theorems by trying possibilities. A proof that there is no solution is
are now a reality. usually very hard. In 1970, Yuri Matijasevic̆ proved that
The possible theorems to which this algorithm can be Hilbert’s algorithm could not exist. It is impossible to con-
applied are strictly circumscribed, so narrowly defined struct an algorithm that, for every Diophantine equation,
that there is still a legitimate question about whether this is able to determine whether it does or does not have a
constitutes true computer-generated proof or is merely a solution.
powerful mathematical tool. What is not in question is that There have been other negative results. Let E be an
such algorithms are changing the kinds of problems that expression that involves the rational numbers, π , ln 2, the
mathematicians need to think about. variable x, the functions sine, exponential, and absolute
value, and the operations of addition, multiplication, and
composition. Does there exist a value of x where this ex-
I. THE IDEAL VERSUS REALITY
pression is zero? As an example, is there a real x for which
A. What Cannot Be Done e x − sin(π ln 2) = 0?
Mathematics is frequently viewed as a formal language For this particular expression the answer is “yes” because
with clearly established underlying assumptions or ax- sin(π ln 2) > 0, but in 1968, Daniel Richardson proved that
ioms and unambiguous rules for determining the truth of it is impossible to construct an algorithm that would de-
every statement couched in this language. In the early termine in finite time whether or not, for every such E,
decades of the twentieth century, works such as Russell there exists a solution to the equality E = 0.
and Whitehead’s Principia Mathematica attempted to
describe all mathematics in terms of the formal language
B. What Can Be Done
of logic. Part of the reason for this undertaking was the
hope that it would lead to an algorithmic procedure for In general, the problem of determining whether or not a
determining the truth of each mathematical statement. As solution of a particular form exists is extremely difficult
the twentieth century progressed, this hope receded and and cannot be automated. However, there are cases where
finally vanished. In 1931, Kurt Gödel proved that no ax- it can be done. There is a simple algorithm that can be
iomatic system comparable to that of Russell and White- applied to each quadratic equation to determine whether
head could be used to determine the truth or falsehood of or not it has real solutions, and if it does, to find them.
every mathematical statement. Every consistent system of That x 2 − 4312x + 315 = 0 has real solutions may not
axioms is necessarily incomplete. have been explicitly observed before now, but it hardly
One broad class of theorems deals with the existence of qualifies as a theorem. The theorem is the statement of
solutions of a particular form. Given the mathematical the quadratic formula that sits behind our algorithm. The
problem, the theorem either exhibits a solution of the conclusion for this particular equation is simply an appli-
desired type or states that no such solution exists. In 1900, cation of that theorem, a calculation whose relevance is
as the tenth of his set of twenty-three problems, David based on the theory.
Hilbert challenged the mathematical community: “Given a But as the theory advances and the algorithms become
Diophantine equation with any number of unknown quan- more complex, the line between a calculation and a the-
tities and with rational integral numerical coefficients: To orem becomes less clear. The Risch algorithm is used by
devise a process according to which it can be determined computer algebra systems to find indefinite integrals in
by a finite number of operations whether the equation is Liouvillian extensions of difference fields. It can answer
solvable in rational integers.” A well-known example of whether or not an indefinite integral can be written in a
such a Diophantine equation is the Pythagorean equation, suitably defined closed form. If such a closed form exists,
x 2 + y 2 = z 2 , with the restriction that we only accept in- the algorithm will find it. Most people would still classify
teger solutions such as x = 3, y = 4, and z = 5. Another a specific application of this algorithm as a calculation, but
problem of this type is Fermat’s Last Theorem. This theo- it is no longer always so clear-cut. Even definite integral
rem asserts that no such positive integer solutions exist for evaluations can be worthy of being called theorems.
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003H-881 June 13, 2001 22:46

Computer-Generated Proofs of Mathematical Theorems 555

Freeman Dyson conjectured the following integral eval- n n−k
uation for positive integer z in 1962: 2k
0≤k≤n/3
n−k 2k
2π 2π
iθ
(2π )−n
··· e j − eiθk 2z dθ1 · · · dθn 1
0 0 1≤ j<k≤n
= 2n−1 + (i n + (−i)n ), n ≥ 2.
2
(nz)! In general, the coefficients in the recursion will be poly-
= .
(z!)n nomials in n. In 1991 Marko Petkovšek created an al-
Four proofs have since been published. Dyson’s conjec- gorithm that will find a closed form solution for such a
ture cannot be proven by the Risch or any other general recursion, or prove that no such formula exists. The com-
integral evaluation algorithm because the dimension of the bination of the WZ method with Petkovšek’s algorithm
space over which the integral is taken is a variable, but its gives an automated proof that a particular type of solution
proof is now close to the boundary of what can be totally cannot exist, or else it finds such a solution. As an example,
automated. there is a computer-generated proof of the fact that
Most of this article will focus on the WZ method n
2 2
developed by Wilf and Zeilberger in the early 1990s. n n+k
Given a suitable hypergeometric series, the WZ method k=0
k k
will determine whether or not it has a closed form. If it
does, the algorithm will find it. It can even be used to cannot be written as a linear combination of hypergeo-
find new hypergeometric series that can be expressed in metric terms in n.
closed form. Again, the important mathematics is the the- The WZ method combined with Petkovšek’s algorithm
ory that is used to create and justify the algorithm, but is producing fully automated proofs of results that, un-
specific applications now look very much like theorems. til recently, have required considerable human ingenuity.
One example of a result that can be proved by the WZ Significantly, it replies not just with a statement that a
method is the following identity, discovered and proved particular identity is true, but also with a proof certificate,
by J. C. Adams in the nineteenth century. Let Pn (x) be the a critical insight that enables anyone with pencil and paper
Legendre polynomial defined by and a little time to verify that this identity is correct. At
the very least, these algorithms have moved the line of
n
1 n 2 demarcation between what constitutes a proof and what is
Pn (x) := n (x − 1)k (x + 1)n−k ,
2 k=0 k only a computation.

and let Ak = ( 2kk ), then

1 II. HYPERGEOMETRIC SERIES IDENTITIES
Pm (x)Pn (x)Pm+n−2k (x) d x
−1 A. What Is a Hypergeometric Series?
1 Ak Am−k An−k A series, 1 + a1 + a2 + a3 + · · ·, is called hypergeometric
= · . (1)
(m + n + 1/2 − k) Am+n−k if the ratio of consecutive terms, an+1 /an , is a rational func-
Note that the term-by-term integration is not difficult for a tion of n, say an+1 /an = P(n)/Q(n), where P and Q are
computer algebra system. What distinguishes this particu- polynomials. Most of the commonly encountered power
lar identity is that the number of terms in each summation series are hypergeometric or can be expressed in terms of
is left as a variable. hypergeometric series (see Fig. 1). A hypergeometric term
Given a hypergeometric series, the WZ method can be is a function of n that is a summand of a hypergeometric
used to find the closed expression that it equals, provided series indexed by n. In particular, a hypergeometric term
such an expression exists. We take as an example is of the form

n n−k
k−1
an+1
k−1
P(n)
f (n) = 2 k
. ak = = ,
0≤k≤n/3
n − k 2k n=0
an n=0
Q(n)
The algorithm produces a recursion satisfied by f (n): for some pair of polynomials P and Q.
f (n + 3) − 2 f (n + 2) + f (n + 1) − 2 f (n) = 0. If we factor P and Q,

This is a particularly nice example because the coefficients P(n) = c1 (n + α1 )(n + α2 ) · · · (n + αm ),

are constants and standard techniques can be applied to
discover that Q(n) = c2 (n + β1 )(n + β2 ) · · · (n + βn+1 ),
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003H-881 June 13, 2001 22:46

556 Computer-Generated Proofs of Mathematical Theorems

FIGURE 1 Examples of common functions expressed in terms of hypergeometric series.

then the hypergeometric term can be written as a

b

a b
(α1 )k (α2 )k · · · (αm )k (1 + x) (1 + x) =
a b
x i
x j,
ak = c , i=0
i j=0
j
(β1 )k (β2 )k · · · (βn+1 )k
and
where c = c1 /c2 and (α)k is the rising factorial:
a+b
a+b
(α)k = α(α + 1)(α + 2) · · · (α + k − 1). (1 + x) a+b
= xk.
k=0
k
B. The Chu–Vandermonde Identity
Equation (2) was rediscovered by Alexandre Vander-
A large part of the impetus behind the development of the monde in 1772 and is today known as the Chu–Vander-
WZ method and the reason why it has become such an monde Identity.
influential tool is that there is a rich and ever-expanding The ratio of successive terms in the summation is
store of useful identities for hypergeometric series. These
a b a b
recur throughout mathematics, playing important roles in
the solutions of both theoretical and applied problems. n+1 k−n−1 n k−n
The binomial theorem was the first and is the most (n − a)(n − k)
= .
fundamental of these identities. It is the foundation upon (n + 1)(n + b − k + 1)
which all others are proved. Mathematicians have been
If we divide both sides of Eq. (2) by the first summand,
building upon the binomial theorem for many years. In
( bk ), it can be rewritten in terms of rising factorials as
1303, Chu Shih-Chieh wrote Precious Mirror of the Four
Elements (Ssu Yü Chien), in which he may have been the ∞
(−a)n (−k)n (a + b)! (b − k)!
first person to state the fundamental result: 1+ = . (3)
n!(b − k + 1)n (a + b − k)! b!
∞ n=1
a b a+b
= . (2) In 1797, Johann Friedrich Pfaff showed that, subject
i k − i k
i=0 only to convergence conditions, Eq. (3) holds for complex
In Chu’s identity, a, b, and k are positive integers. Note values, in which case it can be expressed as
that all summands will be zero once i is greater than a or k.
∞
(α)n (β)n (γ − α − β) (γ )
Equation (2) is easily derived from the binomial theorem 1+ = . (4)
by comparing the coefficients of x k in n=1
n!(γ )n (γ − α) (γ − β)
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003H-881 June 13, 2001 22:46

Computer-Generated Proofs of Mathematical Theorems 557

1 (m − 2n − k − 1)(m − 2n − k)
4 (n + 1)(n + k + 1)

(n + (k + 1 − m)/2)(n + (k − m)/2)
= .
(n + 1)(n + k + 1)
This is simply the Chu-Vandermonde identity with α =
(k + 1 − m)/2, β = (k − m)/2, and γ = k + 1.
There is clearly an advantage to using the rising factorial
notation, in which case we write
∞
α1 , . . . , αm (α1 )k · · · (αm )k k
F
m n ; x := 1 + x .
β1 , . . . , βn k=1
k!(β 1 )k · · · (βn )k

Even with this standardized notation, there are equiv-

alent identities that look different because there are
nontrivial transformation formulas for hypergeometric se-
ries. As an example, provided the series in question con-
verge, we have that

a, b x −b
F
2 1 ; x = 1 −
2a 2

b/2, (b + 1)/2 x 2
× 2 F1 ; .
a + 1/2 2−x

This is why, even if all identities for hypergeometric

series were already known, it would not be enough to have
FIGURE 2 The representation of “Pascal’s” triangle in Chu’s Pre-
a list of them against which one could compare the candi-
cious Mirror of the Four Elements of 1303. (Reprinted with the
permission of Cambridge University Press.) date in question. Just establishing the equivalence of two
identities can be a very difficult task. This makes the WZ
method all the more remarkable because it is independent
Pfaff’s student, Carl Friedrich Gauss, used hypergeo- of the form in which the identity is given and can even be
metric series in his astronomical work and advanced their used to verify (or disprove) a conjectured transformation
study. Among his contributions, he found sharp criteria formula.
for whether or not a hypergeometric series converges.
Throughout the nineteenth and twentieth century, a great
number of identities for hypergeometric series were
III. THE WZ METHOD
discovered, many of which were collected in the Bateman
Manuscript Project published as Higher Transcendental
A. Sister Celine’s Technique
Functions in 1953–1955.
The WZ method for finding and proving identities for
C. Standardized Notation hypergeometric series builds on a succession of devel-
opments that began with the Ph.D. thesis of Sister Mary
Most hypergeometric series can be written as sums of Celine Fasenmyer at the University of Michigan in 1945.
rational products of binomial coefficients, but this repre- We consider a sum of the form
sentation is problematic because it is not unique. As an
example, f (n) = F(n, k),
k
m
m m−n 2m
2m−k−2n = where F(n, k) is a proper hypergeometric term. This
n n+k m+k
n=0
means that it is a polynomial in n and k times x k y n , for
appears to be different from the Chu-Vandermonde iden- fixed x and y, times a product of quotients of factorials of
tity [Eq. (2)]. But if we look at the ratio of consecutive the form (an + bk + c)!, where a and b, and c are fixed
summands, it is integers. As an example,
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003H-881 June 13, 2001 22:46

558 Computer-Generated Proofs of Mathematical Theorems

n n−k (n − k − 1)! This means that if F(n, j) is the summand in the conjec-
2k = n · 2k · . tured identity, then
0≤k≤n/3
n − k 2k 0≤k≤n/3
(2k)! (n − 3k)!
−sF(n, j) = G(n, j + 1) − G(n, j),
is such a series.
Every such sum of proper hypergeometric terms will where G(n, j) = j(m + n − s − j)F(n, j)/(m + n − 2 j).
satisfy a finite recursion of the form The sum over j of G(n, j + 1) − G(n, j) telescopes,
and therefore the original summation equals [G(n, s + 1)

J
a j (n) f (n + j) = 0. − G(n, 0)]/(−s) = 0.
j=0 Gosper’s algorithm is a fertile approach that is often
applicable, but it is limited by the fact that such a G does
Sister Celine showed how to reduce the problem of find-
not always exist.
ing these coefficients to one of solving a system of linear
equations. It was Doron Zeilberger who realized that this
gives us an algorithm for proving hypergeometric series C. Wilf and Zeilberger
identities because we need only verify that each side satis- Major progress was made by Doron Zeilberger who, start-
fies the same recursion and the same initial conditions. The ing in 1982, began to combine the ideas of Sister Celine
problem with using Sister Celine’s approach is that her par- and William Gosper. In the early 1990s, Herbert Wilf
ticular algorithm for finding the coefficients is slow. Later joined Zeilberger in extending and refining these methods
developments would speed it up considerably, though in into a fully automated proof machine that is now known
the process would lose the easy generalization of Sister as the WZ method. If F(n, k) is a proper hypergeometric
Celine’s technique to summations over several indices. term, then there always is a proper hypergeometric term
G(n, k) such that G(n, k + 1) − G(n, k) is equal to a linear
B. Gosper’s algorithm combination of {F(n + j, k) | 0 ≤ j ≤ J } for some explic-
itly computable J ,
In 1977 and 1979, R. W. Gosper, Jr., took a different
approach and became one of the first people to use com-
J
a j (n)F(n + j, k) = G(n, k + 1) − G(n, k), (5)
puters to discover and check identities for hypergeomet- j=0
ric series. Given a proper hypergeometric term F(n, k), K
Gosper showed how to automate a search for a proper where the a j (n) are polynomials in n. If f (n) = k=0
hypergeometric term G(n, k) with the property that F(n, k), then we can sum both sides of Eq. (5) over
0 ≤ k ≤ K . The right side telescopes, and we are left with
G(n, k + 1) − G(n, k) = F(n, k).
the recursive formula
If such a G could be found, then

J

n a j (n) f (n + j) = G(n, K + 1) − G(n, 0).
f (n) = (G(n, k + 1) − G(n, k)) j=0
k=0 Gosper’s technique—which is very fast—can be used to
= G(n, n + 1) − G(n, 0). find the function G. The coefficients a j (n) are then found
by solving a system of linear equations.
An example of the application of this algorithm is
Gosper’s algorithm is the special case of the WZ method
the computer-generated proof of an identity discovered
in which J = 0. The other case of particular interest is
and first proved by J. S. Lomont and John Brillhart: Let
when J = 1 and a1 = −a0 = 1. Consider the conjectured
1 ≤ m ≤ n, where n ≥ 2 and 1 ≤ s ≤ min(m, n − 1), then
identity:

s m m − j n n − j n 2k + 1 n + 2
(−1) j (m + n − 2 j) 2n + 1
m −s n−s 2n−2k−1 = . (6)
j=0
j j
k
2k k 2k + 1 n
m + n m + n − s − j − 1 s 2
× = 0. If we divide each side by ( 2n+1
n
), this can be rewritten as
j s− j j
n 2k + 1 n + 2
f (n) = 2n−2k−1
2k k 2 j + 1
Given this conjecture, the program EKHAD replies with k

the proof certificate:

2n + 1
= 1.
−s, j ∗ (m + n − s − j)/(m + n − 2 ∗ j). n
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003H-881 June 13, 2001 22:46

Computer-Generated Proofs of Mathematical Theorems 559

If this is true, then f (n) satisfies the recursion f (n + 1) − IV. EXTENSIONS, FUTURE WORK,
f (n) = 0. Let F(n, k) be the summand, AND CONCLUSIONS

n 2k + 1 n + 2 n−2k−1 2n + 1
F(n, k) = 2 . A. Extensions and Future Work
2k k 2j + 1 n
If we could find a proper hypergeometric term G(n, k) for All of the techniques described in this article have been
which extended to q-hypergeometric series such as

F(n + 1, k) − F(n, k) = G(n, k + 1) − G(n, k), (7)

n
2 (1 − q n )(1 − q n−1 ) · · · (1 − q n−k+1 )
1+ qk ,
then it would follow that f (n + 1) − f (n) = 0, and so k=0
(1 − q)(1 − q 2 ) · · · (1 − q k )
f (n) would be constant. It would be enough to check that
f (0) = 1. in which the ratio of consecutive summands is a rational
In fact, such a G does exist. The WZ method finds it. function of q k .
The proof certificate is the rational function, Many general determinant evaluations can be reduced
G(n, k) 4k(k + 1) to problems that can be solved using the WZ method. Work
= . is progressing on fully automating proofs of such results.
F(n, k) (2n + 3)(2k − n − 1)
There are other theorems that appear to be amenable to
To check Eq. (6), we only need to verify that F and G, an automated computer attack. These include results on
which we now know, do indeed satisfy Eq. (7). real closed fields using techniques of George Collins and
In general, the WZ method returns either the ratio geometrical theorems proved using algebraic techniques
G(n, k)/F(n, k) [if the recursion is of the form given in such as Gröbner bases.
Eq. (7)], or it returns the actual recursive formula satisfied
by f (n). The only drawback to the WZ method is that the
number of terms in the recursive formula may be too large
B. Conclusions
for practical use.
The net effect of the algorithms that prove identities for
hypergeometric series is that a piece of mathematics that
D. Petkovšek and Others
once could only be done by those with cleverness and
In his Ph.D. thesis of 1991, Marko Petkovšek showed how insight has been turned into a purely mechanical calcula-
to find a closed form solution—or to show that such a tion. Rather than limiting the scope of mathematics, the
solution does not exist—for any recursive formula of the WZ method has widened it. Problems that had once been
form intractable are now within reach. The situation is differ-
J ent in degree but not in kind from the invention of cal-
a j (n) f (n + j) = g(n), culus. This was the discovery of mechanical procedures
j=0 that enabled scientists to shift their attention away from
in which g(n) and the a j (n) are polynomials in n. By closed laborious and ingenious techniques for finding areas and
form, we mean a linear combination of a fixed number of tangent lines and to begin addressing the really interesting
hypergeometric terms. questions.
Combined with the WZ method, Petkovšek’s algorithm Perhaps this will always be the fate of computer-
implies that in theory if not in practice, given any sum- generated proofs. That one class of problems has been
mation of proper hypergeometric terms, there is a com- moved into the category of those that can be solved by
pletely automated computer procedure that will either find computers means that we are freed to direct our attention
a closed form for the summation or prove that no such to those questions that are most important.
closed from exists. A full account of the WZ method
and Petkovšek’s algorithm is given in the book A = B
Web Sites for the WZ Method
by Petkovšek, Wilf, and Zeilberger.
and Related Algorithms
Others have worked on implementing and extending the
ideas of the WZ method. One of the centers for this work Home page for the book A = B:
has been a group headed by Peter Paule at the University http://www.cis.upenn.edu/˜wilf/AeqB.html
of Linz in Austria. Ira Gessel has been at the forefront of Wilf and Zeilberger’s programs:
those who have used this algorithm to implement computer http://www.cis.upenn.edu/˜wilf/progs.html
searches that both discovered and proved a large number Programs of the RISC group at the University of Linz:
of new identities for hypergeometric series. http://www.risc.uni-linz.ac.at/research/combinat/risc/
P1: ZCK Final
Encyclopedia of Physical Science and Technology EN003H-881 June 13, 2001 22:46

560 Computer-Generated Proofs of Mathematical Theorems

718 Convex Sets

I. INTRODUCTION A set of vectors B with the property that every vector has
a unique representation as a finite linear combination of
Relatively few shapes in the natural world are convex. the vectors in B is called a basis for X . Each vector space
When they do occur— for example, soap bubbles, drops has a basis, and all bases for the same space have the same
of dew, smoothly worn stones on the beach, single crys- number of elements. That number is called the dimension
tals of amethyst and salt— we find them to be estheti- of the space. The dimension of Rn is n. The space is said to
cally pleasing. Among manufactured objects, rectangles, be finite dimensional if it has a basis with a finite number of
circles, hexagons, cubes, cylinders, and cones are quite elements and is infinite dimensional otherwise. We shall be
ubiquitous. We first encounter them as children and enjoy concerned almost entirely with finite dimensional spaces.
the shapes of wooden building blocks and colored tiles. A linear map is called an isomorphism if it is one to one
Convexity is the study of these shapes. Two- and onto. If X is finite dimensional, then, corresponding to
dimensional convex shapes (circles, ellipses, triangles, a basis (x1 , x2 , . . . , xn ), there is an isomorphism, T , of X
polygons) and the regular Platonic solids have been ob- onto Rn defined asfollows: each x ∈ X has a unique repre-
jects of mathematical study for a very long time. The study sentation as x = αi xi ; set T (x) := (α1 , α2 , . . . , αn ). In
of convexity as a specific mathematical topic dates back this sense, there is no real loss of generality if, when deal-
only to the end of the 19th century. The primary influence ing with finite dimensional spaces, we restrict attention to
was the pioneering work of Minkowski for which one Rn .
should consult his collected works.3 For a good historical The linear mappings from X to R are given the spe-
summary, see the article by Peter Gruber in the Handbook cial name of linear functionals. The set of all of them is
of Convex Geometry.2 This chapter covers only some of called the dual space of X and denoted by X ∗ . If X is
the topics that fall under the heading of convexity. Other finite dimensional and if it is given a basis, then the linear
aspects can be found in articles in Reference 2. There is functionals can be represented by 1 × n matrices (i.e., row
not space to mention the many interactions of convexity vectors of the same size as the column vectors from X ).
with other branches of mathematics. The book by Roger Thus, in the finite dimensional case, the dimension of X ∗
Webster5 is a readable, elementary introduction to the sub- is the same as that of X .
ject and has some interesting applications. In Rn , the length of a vector x = (ξ1 , ξ2 , . . . , ξn ) is de-
To define convex sets precisely we need an ambient noted by x and is defined to be the number:
space in which they may exist. This space requires one type 1/2
of mathematical structure and usually comes equipped n
x := |ξi | 2
.
with another.
i=1
The structure that is required is that of a vector or linear
space; the most familiar vector spaces are those that we The structure that is usually present in discussions of
denote by R2 , R3 , and, in general, Rn . This space consists convex sets in addition to the vector space structure just
of n-tuples of numbers (ξ1 , ξ2 , . . . , ξn ) whose entries are outlined is that of a topological space. Most frequently
called the coordinates of the vector. It is customary to write the topology derives from a metric or distance. This is the
single vectors as rows, but matrices (and other functions case in Rn where the distance between two vectors x and
that operate on vectors) are usually written to the left of y, d(x, y), is defined to be the length of x − y:
the vector. This means that the vectors should ‘really’ be d(x, y) := x − y .
viewed as columns.
If a vector y is expressed in the form: The topological structure allows one to talk about conver-
gence, continuity, and such concepts as open sets, closed
y = α1 x1 + α2 x2 + · · · + αk xk sets, connected sets, and compact sets. The Heine-Borel
then it is said to be a linear combination of the vectors Theorem asserts that C is a compact subset of Rn if and
{xi : i = 1 . . . , k}. The usual basis for Rn consists of the only if it is both closed and bounded. This is no longer
vectors: true in infinite dimensional spaces.
e1 = (1, 0, 0, . . . , 0), e2 = (0, 1, 0, . . . , 0), ...,
en = (0, 0, 0, . . . , 1) II. DEFINITIONS
where the ith vector has a 1 as the ith coordinate and a
Definition. A set K in a vector space X is said to be
0 elsewhere. Any vector x = (ξ1 , ξ2 , . . . , ξn ) can be ex-
convex if whenever x , y ∈ K then the line segment [x , y]
pressed uniquely as:
is contained in K . Thus, in Fig. 1, the set (a) is convex
x = ξ1 e1 + ξ2 e2 + · · · + ξn en . while (b) is not.
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

Convex Sets 719

FIGURE 1

A point of the line segment [a , b]— a point of the form: Note that the convexity of K is needed to ensure that λx +
(1 − λ)y is in the domain of f when x and y are.
a + λ(b − a) = (1 − λ)a + λb, 0 ≤ λ ≤ 1,
The relation between the definitions of convexity for
is said to be a convex combination of a and b. a function and for a set is twofold. The graph of f is a
A related notion is that of being star shaped. A set S subset of K × R and is defined as:
is star shaped about a point x0 if for all x in S the line
segment [x, x0 ] is contained in S. In Fig. 2, the set (a) is graph( f ) := {(x, η) : f (x) = η}.
star shaped about x but not about y, and (b) is not star Extending this idea, the epigraph of f is the set that lies
shaped about any point. A set is convex if and only if it is
above the graph of f :
star shaped about every point. Each convex set (and each
star-shaped set) is connected. epigraph( f ) := {(x, η) : f (x) ≤ η}.
In the one-dimensional space R the collections of con-
vex sets, star-shaped sets, and connected sets coincide. Then, f is convex (as a function) if and only if epigraph( f )
is convex as a set. Secondly, if f is convex then, for each
Each of these is the class of intervals (closed, open, half-
(extended) real number α, the sets {x : f (x) ≤ α} and {x :
open, bounded, and unbounded). Therefore, in order to
have an interesting theory of convexity, the dimension of f (x) < α} are convex. This illustrates the connection be-
tween convexity and certain types of inequality.
the underlying space should be at least 2.
A function is sublinear if it is both subadditive— f (x +
There is also a definition of convexity as an adjective
y) ≤ f (x) + f (y)—and non-negatively homogeneous—
that applies to functions rather than sets:
f (αx) = α f (x) for all α ≥ 0. From now on, “homo-
geneity” will mean “non-negative homogeneity.” All lin-
Definition. A real valued function f defined on a
ear functionals are sublinear and all sublinear functions
convex set K is said to be convex if, for all x and y in K ,
are convex. Hence, if f is sublinear then sets of the form
f (λx + (1 − λ)y) ≤ λ f (x) + (1 − λ) f (y). {x : f (x) ≤ α} are convex.
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

720 Convex Sets

FIGURE 2

III. EXAMPLES Any closed ball B[x0 , r ] := {x : x − x0 ≤ r } of

center x0 and radius r is convex. An open ball, B(x0 , r ) :=
If K is a convex set and if A is an affine mapping, then {x : x − x0 < r } is also convex (in general, the interior
A(K ) := {Ax : x ∈ K } is also convex. Affine mappings of a convex set is also convex). Note that we may consider,
include translations, rotations, dilations, and reflections. for example, a two-dimensional ball (a disc) as a subset
Therefore, it is often convenient to think of examples as of R3 or a higher dimensional space. It is still convex,
being located at some particular point in space (centered at but whereas as a subset of R2 it has interior points, in R3
the origin, for example) or with some particular orientation it does not. We shall say more about the idea of relative
or with some particular scaling. These are not relevant to interior in Section IV. Closed and bounded convex sets are
the property of being convex and may not be very relevant called convex bodies. Some authors use this term to also
to other properties of the convex sets. imply that the set has a non-empty interior.
4. The image of the unit ball under any invertible affine
1. A single point is a convex set. map is convex. This gives the important class of convex
2. Lines, line segments (with or without the end points), bodies known as ellipsoids.
and rays (sets of the form {x : x = a + λb, with λ ≥ 0}) 5. The unit cube is defined to be the set {x = (ξ1 , ξ2 ,
are convex sets. As indicated in Section II, in R the only . . . , ξn ) : 0 ≤ ξi ≤ 1}. If a cube centered at the origin is
convex sets are intervals. required, we often consider one that is dilated by a factor
3. The ball of radius 1 centered at the origin (briefly, of 2 and call it the standard cube, Cn :
the unit ball),
Cn := {x = (ξ1 , ξ2 , . . . , ξn ) : −1 ≤ ξi ≤ 1}.
B := B[0, 1] := {x : x ≤ 1}
This set is a convex body in Rn . The image of a standard
is convex. While this is geometrically clear in two and cube under an invertible affine map is called a parallelo-
three dimensions, the proof (the same in all dimensions) tope.
is not so immediate. We need to show that the norm is a 6. The standard simplex, Sn , is defined by the following
sublinear functional. That it is homogeneous is easy; the equation:
fact that it is subadditive,
x+y ≤ x + y , Sn := x = (ξ1 , ξ2 , . . . , ξn ) : ξi ≥ 0 and ξi ≤ 1 .

is usually referred to as the triangle inequality and is a A general n-simplex is the image of Sn under an invertible
consequence of the Cauchy-Schwarz inequality. affine map.
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

Convex Sets 721

7. The standard cross-polytope in Rn , On , is defined

by:

On := x = (ξ1 , ξ2 , . . . , ξn ) : |ξi | ≤ 1 .

The letter O is used because in R3 this set is a regular

octahedron. Note that O2 and C2 are both squares but they
are oriented differently.
8. A hyperplane H αf := {x : f (x) = α} (where f is a
linear functional) is convex. A (closed) half-space is a set
of the form H α− f := {x : f (x) ≤ α} and is described as
one side of the hyperplane H αf . This set is convex as is
its interior (the open half-space for whose definition ≤ is
replaced by <). One also has the half-space H α+ f := {x :
−α−
f (x) ≥ α} = H− f . FIGURE 3
9. In addition to the standard Euclidean ball B in Rn considered but here the word will always denote a closed,
we may consider the p ball B( p) which is defined by: convex set and extra adjectives will be dropped.

B( p) := x : |ξi | p ≤ 1 . Alternatively, a polytope may be described as the in-
tersection of finitely many closed half-spaces. There is a
If 1 ≤ p, then this set is convex. difference between the two notions. The convex hull of
finitely many points is always bounded; the intersection
There will be more examples of convex sets in of half-spaces may not be. A bounded polytope that has
Sections IV and V, where we discuss general methods of an interior may be described either by the points of which
constructing them and of getting new sets from old ones. it is the convex hull or by the bounding hyperplanes. This
is the first example of the duality relationship discussed in
Section V.
IV. DESCRIPTIONS OF CONVEX SETS
Examples. The standard simplex is the convex hull
How should a convex set be specified? How does one de- of the finite set {0, e1 , e2 , . . . , en }. The standard octahe-
cide whether a given point is inside a particular convex set dron is the convex hull of the finite set {±e1 , ±e2 , . . . ,
or not? There are more sophisticated computational ver- ±en }. The standard cube is the intersection of the follow-
sions of these questions: What are the most efficient algo- ing half-spaces: {x : f i (x) ≤ 1} and {x : − f i (x) ≤ 1},
rithms for describing a convex set or for deciding whether where f i (x) = ξi and i = 1, 2, . . . , n.
a point is inside or not? These are difficult questions which
Here we digress to discuss the dimension of a convex
will not be tackled here. Instead, we give a variety of an-
set. An affine combination of vectors {x1 , x2 , . . . , xk } is
swers to the more general questions, any one of which
a linear combination α1 x1 + α2 x2 + · · · + αk xk in which

may be the best for a particular situation.
αi = 1. If K is a convex set, then the set of all affine
The intersection of an arbitrary collection of convex sets
linear combinations of elements of K is called the affine
is convex. Since any set is contained in at least one convex
hull of K . If 0 ∈ K , then the affine hull of K is the same
set (the whole vector space in which it sits), it follows that
as the set of all linear combinations (because one can add
any set, A, is contained in a smallest convex set, namely
a suitable multiple of 0 to make the coefficients sum to 1).
the intersection of all the convex sets that contain A. It is
Hence, in this case, the affine hull of K is a subspace
called the convex hull of A and is written coA. Thus,
(containing K ). In the general case, the affine hull of K is
coA := K a flat (the older term is affine variety)—i.e., the translate of
where the intersection is taken over all convex sets K with a subspace. Subspaces and hence flats have a well-defined
A ⊆ K . One visualizes the convex hull as the set obtained dimension. The dimension of a convex set is the dimension
by stretching a rubber sheet around the set A or, in two of its affine hull.
dimensions, an elastic band (see Fig. 3). Related to this notion are the following concepts. First,
Therefore, one may describe a convex set either as the a finite set of points with n elements is said to be in general
intersection of certain simpler convex sets or as the convex position if its affine hull (or, equivalently, its convex hull)
hull of some simpler set. has dimension (n − 1) which is the maximum possible.
The convex hull of a finite set is called a convex poly- Two distinct points are always in general position, three
tope. Sometimes more general types of polytopes may be points if they are not collinear, four points if they are not
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

722 Convex Sets

coplanar, and so on. The convex hull of (n + 1) points (1, 0, 1), (1, 0, −1), and all the points of the circle except
in general position (necessarily in Rm with m ≥ n) is an (1, 0, 0).
n-simplex. To see that there is an affine map of this set
The next way to describe a convex set is to first sup-
onto Sn observe that there is a translation that takes one
pose that the origin is an interior point of the set. This
point to 0 and then a linear map that takes the remaining
effectively means that the set has interior points because
n to the usual basis vectors. Second, the relative interior
one can always either translate the set or choose the origin
of K is the interior when K is regarded as a subset of its
appropriately. Then one describes the set by saying how
affine hull. The relative interior of a non-empty convex set
far in any direction the boundary is from the origin.
is always non-empty.
Definition. If K is a convex set with 0 as an interior
Definition. A face F of a convex set K is a convex
point, then the radial function of K , r K (x), is defined by:
subset of K with the property that if y is in F and if y
can be represented in the form y = αx1 + (1 − α)x2 with r K (x) := sup{λ : λx ∈ K }.
x1 and x2 in K , then, in fact, x1 and x2 are in F. A more A slight variant of this is to say how much K has to be
geometrical description is that if an open line segment in dilated to contain a given vector.
K contains points of F then the whole line segment lies
in F. If P is an n-dimensional polytope in Rn , then its Definition. If K is a convex set with 0 as an interior
0-dimensional faces are called vertices, its one- point, then the gauge function (or Minkowski functional)
dimensional faces are called edges, and the (n − 1) of K , g K (x), is defined by:
dimensional faces will be called facets (this latter term is
not universally used). g K (x) := inf{λ ≥ 0 : x ∈ λK }.
It is evident that 1/r K (x) = g K (x) (with 1/∞ = 0 if K is
Examples. The faces of the standard cube C3 in R3 unbounded). The function g K has the advantage of being
are the cube itself, the six facets of the cube (the sets both homogeneous and subadditive (i.e., sublinear). If K is
where one fixed coordinate has a prescribed value from bounded, then the radial function is finite for all x = 0 and
{−1, 1}), the 12 edges of the cube (the sets where two the gauge function is non-zero for x = 0. If, in addition,
fixed coordinates have prescribed values from {−1, 1}), K is symmetric about the origin so that g K (−x) = g K (x),
and the eight vertices (the sets where all three coordinates then the gauge function has the properties of a norm. This
have prescribed values from {−1, 1}). leads to the use of convexity in the study of normed spaces.
The faces of the standard simplex S3 in R3 are S3 ; the If K is closed, it is described as K = {x : g K (x) ≤ 1}.
four facets of the simplex (the intersections of S3 with The boundary of K (denoted by ∂ K ) is the set of points
the planes {x : ξi = 0} and {x : ξi = 1}); the six edges for which g K (x) = r K (x) = 1.
(the line segments [0, ei ], [e1 , e2 ], [e2 , e3 ], [e3 , e1 ]); and Instead of thinking of the boundary of K as a set of
the four vertices {0}, {ei }. points, we may also regard it as an envelope of half-spaces
Faces of a convex set K that are single points {z} are (in the same way that a polytope was described as a finite
called extreme points of K . intersection of half-spaces).

Theorem (Minkowski). A convex body in Rn is the Definition. A hyperplane H αf is said to be a support

convex hull of its extreme points. hyperplane for a convex body K if (1) K ∩ H αf = ∅, and
(2) K is contained in one of the half-spaces H α+ α−
f , Hf .
There is an infinite dimensional extension of this the-
orem due to Krein and Milman which states that each If K is a convex body, if f is a linear functional, and if
compact convex set in a normed linear space is the closed α := max{ f (x) : x ∈ K }, then H αf is a support hyperplane
convex hull of its extreme points. of K . This leads to the idea of the support function of K .
A face of a convex set that is a ray is called an ex-
treme ray. Another generalization of Minkowski’s theo- Definition. If K is a closed convex set then the sup-
rem is that any closed convex set that contains no lines port function of K , h K ( f ), is defined by:
is the convex hull of its extreme points and extreme h K ( f ) := sup{ f (x) : x ∈ K }.
rays.
The support function is a sublinear function. If K is
The set of extreme points of a compact set need not be
bounded, then h K is finite for all x and we may replace sup
closed. This is shown by the following example.
by max. If 0 ∈ K , then h K is non-negative, and if 0 is an in-
Example. In R3 let K be the convex hull of the circle terior point, then h K is strictly positive. If K is symmetric,
{x : ξ12 + ξ22 = 1} and the points (1, 0, 1) and (1, 0, −1). K then h K is a norm on the dual space. There are generaliza-
looks like a double slanted cone. The extreme points are tions of this idea to infinite dimensional spaces, but there
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

Convex Sets 723

taining A. Rather than cut down to the convex hull from

outside the set, we may also build up the convex hull from
inside.
If x1 , x2 , . . . , xk is a finite set of vectors, then y is said
to be a convex combination of these vectors if:

y = λ1 x1 +λ2 x2 +· · ·+λk xk with λi ≥ 1 and λi = 1.
The definition of convexity uses only two points but, by
induction, one shows that if K is convex then every con-
vex combination of points in K is also in K . If A is not
convex, every convex combination of points from A is in
coA. Finally, one shows that the set of all convex combi-
nations of points from A is a convex set. Therefore, co A
is precisely the set of all convex combinations of points
FIGURE 4 from A.
Note that k, the length of the convex combination, is ar-
the dual space has a slightly different connotation—one bitrary but Carathéodory’s Theorem in Section VII shows
must restrict attention to continuous linear functionals. that in finite dimensional spaces this is really not so.
Most books on convexity that restrict attention to finite
dimensional spaces identify the dual space (Rn )∗ with Rn Theorem. The convex hull of a compact set is com-
by means of the inner product. By that we mean that for pact and of an open set is open.
each linear functional f there is a vector y f such that
f (x) = y f , x where the symbol . , . denotes the inner Theorem. If A is a non-empty bounded set in Rn ,
product. In this case, the definition of support function is then the diameter of coA is the same as the diameter of A.
given in terms of the inner product. (Here, “diameter” means the supremum of the distances
To interpret the support function geometrically, we re- between points of A.)
strict our attention to linear functionals f with length 1. A nice application of the ideas here is the Gauss-Lucas
In this special case, the α that appears in the equation of Theorem. A short and elegant proof can be found in
the hyperplane: Webster.5
f (x) = α
Theorem. If p(z) is a non-constant polynomial, then
represents the perpendicular distance from the origin to the
the roots of p (z) (the derivative of p) are contained in the
hyperplane. Therefore, the support function (restricted to
convex hull of the roots of p(z).
linear functions of length 1) represents the perpendicular
distance from the origin to the supporting hyperplane of
K that is in the direction f (see Fig. 4). B. The Polar of a Convex Set
The following theorems are important.
By definition, the convex hull operator assigns to each set
Theorem. If f is a sublinear functional on R , then n a convex set. The next very important operation does the
there is a convex body K such that f = h K . same thing.

Theorem. If K is a non-empty convex body, then Definition. If A is a non-empty subset of X , then the
K = {x : f (x) ≤ h K ( f ) for all linear functionals f }. polar of A, denoted by A◦ (or A∗ ), is defined by:

The proof of the second theorem requires the separation A◦ := { f ∈ X ∗ : f (x) ≤ 1 for all x ∈ A}.
theorem from Section VII. There are a variety of proofs If A ⊆ X , then A◦ ⊆ X ∗ . Often, the distinction between
of the first theorem (see Schneider4 ). X and X ∗ is obscured and the inner product is used in the
definition of A◦ . Similarly, for a non-empty set B in X ∗
we have:
V. NEW CONVEX SETS FROM OLD
B ◦ := {x ∈ X : f (x) ≤ 1 for all f ∈ B}.
A. The Convex Hull
Repeated applications of this operation are indicated
We begin this section with a second look at the operation without parentheses, thus A◦◦ , B ◦◦ , and so on. For all sets
of the convex hull, the intersection of all convex sets con- A (in either X or X ∗ ) we have A ⊆ A◦◦ . The operation
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

724 Convex Sets

reverses inclusions: If A1 ⊆ A2 , then A◦2 ⊆ A◦1 . It follows be convex, but the convex hull of the union is the smallest
that A◦ = A◦◦◦ always. convex set that contains both of them. Therefore, we define
The definition of B ◦ reveals it to be an intersection of the following binary operations:
closed half-spaces so it is always a closed, convex set in X
K 1 ∧ K 2 := K 1 ∩ K 2
that contains 0. The same holds for A◦ in X ∗ . If A = {0},
then A◦ = X ∗ and X ◦ = {0}. If A is a single point other K 1 ∨ K 2 := co(K 1 ∪ K 2 ).
than 0, then A◦ is a half-space. Thus, the duality between
points and half-spaces (encountered in Section IV) is im- With these operations, the collection of convex sets is a
plemented by this operation. However, if A is a half-space, lattice. The underlying order relation is that of inclusion.
then A◦ is not a singleton but a line segment joining the There are two important sublattices: the collection of
expected point to 0. Because A◦ is always a closed, convex closed convex sets that contain 0 and the collection of
set containing the origin, in order to get an exact duality compact convex sets with 0 as an interior point. On each
we must restrict attention to this class of sets. of these sublattices, the polar map is a bijection that re-
verses inclusion and hence reverses the lattice operations:
Theorem. If K is a closed, convex set with 0 ∈ K , (K 1 ∧ K 2 )◦ = (K 1 ∩ K 2 )◦ = K 1◦ ∨ K 2◦ = co K 1◦ ∪ K 2◦
then K = K ◦◦ .
(K 1 ∨ K 2 )◦ = (co(K 1 ∪ K 2 ))◦ = K 1◦ ∧ K 2◦ = K 1◦ ∩ K 2◦ .
This is a most important theorem whose proof relies on
the separation theorem in Section VII. For an arbitrary set
A, A◦◦ is the closed convex hull of A and 0. D. Algebraic Operations on Convex Sets
On the collection of closed convex sets that contain 0,
the polar mapping is one to one and maps this collection In addition to the operations just discussed, vector oper-
onto the corresponding collection in X ∗ . If 0 is an interior ations can be performed on the collection of convex sets.
point of K , then K ◦ is compact. If K is compact, then K ◦ These are done “elementwise.”
has 0 as an interior point. Thus, the polar operation also
maps the class of compact convex sets with 0 as an interior Definition. If K 1 and K 2 are convex sets in a vector
point onto the same class in X ∗ . If K is also symmetric space X and λ is a non-negative scalar, then:
about 0, then so is K ◦ . In this last case, K plays the role K 1 + K 2 := {x : x = x1 + x2 with xi ∈ K i }
of the unit ball in a normed space and K ◦ is the dual ball
in the dual space X ∗ . and

Examples. If B is the unit ball in X , then B ◦ is the λK 1 := {λx : x ∈ K 1 }.

unit ball in X ∗ . The reason for the restriction to λ ≥ 0 is twofold. First,
If Cn is the standard cube in Rn , then Cn◦ is the standard although we may define (−1)K , it may be (if K is symmet-
cross-polytope and conversely. ric) that (−1)K = K . Furthermore, the distributive law:
If Sn is the standard simplex, then Sn◦ is the following
(λ + µ)K = λK + µK
unbounded set. If f = (φ1 , φ2 , . . . , φn ), then f ∈ Sn◦ if
and only if φi ≤ 1 for all i. only holds generally if λ, µ ≥ 0. Hence, there is no such
Finally, this duality also connects the gauge (and radial thing as K + (−1)K = {0}. With the restriction to non-
function) of K with the support function of K ◦ . negative scalars, all the usual algebraic identities are valid.
Despite this, we shall occasionally have need to talk about
Theorem. If K is a convex body with 0 as an interior
−K := (−1)K below. Any non-negative linear combina-
point, then the gauge function of K ◦ is the support function
tion of convex sets is again convex. This is the basis of what
of K , and the gauge function of K is the support function
is called the Brunn-Minkowski Theory of convex sets (see
of K ◦ .
Schneider4 ).
Since the radial function is the reciprocal of the gauge
function, this is readily rewritten using the radial func- Examples. The sum of a convex set K with a single
tions. point {x0 } is the same as the translation of K by x0 . In
general, K + L is the union of translates of K by points
in L (or vice versa). Hence, one way to think of this op-
C. The Collection of Convex Sets as a Lattice
eration is to visualize one of the convex sets K with its
The intersection of two convex sets is again a (possibly boundary ∂ K and the other one L with the origin as a ref-
empty) convex set. It is the largest convex set contained erence point somewhere in L. First translate L so that the
in both of them. The union of two convex sets need not reference point lies on ∂ K and then slide L round K (just
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

Convex Sets 725

X into X × Y that sends a vector x to (x , 0) and similarly

for Y . Then X × Y is the direct sum of these embedded
spaces.
If K is a convex set in X and L is one in Y , then K × L
(the Cartesian product) is a convex set in X × Y . However,
if we apply the embedding maps to K and L so that both
can be thought of as lying in X × Y , then K × L = K + L.
In this sense, the cube Cn can be thought of either as a
sum of line segments (as above) or as the Cartesian product
of the interval [−1, 1] with itself n times.

Examples. The product of the unit ball in R2 and the

interval [−1, 1] in R is the standard circular cylinder in
R3 . Likewise, the product of any convex set K with a line
segment is a cylinder. In the case when K is a polytope,
the word “prism” is often used instead of cylinder.

FIGURE 5 Another operation that can be performed on K and L as

above is the suspension operation. As just indicated, we
by translation) so that the reference point stays on ∂ K and may think of both K and L as embedded in X × Y and
eventually has traversed all of it. The convex body swept then K ∗ L, the suspension of K and L, is defined to be
out in this process is K + L. co(K ∪ L) = K ∨ L .
In R2 , the sum of the line segments [−e1 , e1 ] and
[−e2 , e2 ] is the square C2 . If, in R3 , we now add3 the line Example. The standard cross-polytope is built up
segment [−e3 , e3 ], we get the standard cube in R . In gen-
from the interval [−1, 1] by repeated suspension opera-
eral, the cube Cn is the sum of the line segments [−ei , ei ]
tions.
for i = 1, 2, . . . , n .
If we are careful to interpret the polar operations in the
In R2 , if we add the line segment [(−1, −1), (1, 1)]
appropriate spaces, we then get:
to the square C2 we get a hexagon with vertices at
(2, 0), (2, 2), (0, 2) and their negatives. (K + L)◦ = K ◦ ∗ L ◦ and (K ∗ L)◦ = K ◦ + L ◦ .
The sum of a finite number of line segments is a special
type of polytope called a zonotope. These have a number of F. Symmetrizing Operations
significant properties (see Section X). The sum of the cube
C3 and a multiple λB of the unit ball is shown in Fig. 5. The question of symmetry is an important one. A convex
One of the reasons that the support function is so im- set is said to be symmetric about the origin if K = −K . It
portant is that it behaves well for these operations: is often useful to operate on a convex set in such a way
as to make it more symmetrical. For example, to prove
h K 1 +K 2 = h K 1 + h K 2 and h λK = λh K . certain inequalities where some quantity is maximized by
For the gauge and radial functions, we have: a ball, one may be able to show that the quantity increases
gλK = λ−1 g K and rλK = λr K . under a symmetrizing operation.
The first of these is quite simple. We define the dif-
For the polar operation, likewise: ference set of K to be the convex set D(K ) := K −
(λK )◦ = λ−1 (K ◦ ). K := K + (−1)K . The set D(K ) is always symmetric
Finally, these operations can be defined for arbitrary about 0. One readily checks that the support function
sets, and then the convex hull operation is also well be- of −K is given by h −K ( f ) = h K (− f ), hence h D(K ) ( f ) =
haved: h K ( f ) + h K (− f ). For linear functionals f of norm 1, this
last quantity is defined to be the width of K in the direc-
co(A + B) = co(A) + co(B) and coλA = λco(A).
tion f and is denoted by w K ( f ). It represents the distance
between parallel supporting hyperplanes with normal di-
E. Operations that Increase Dimension
rection f (see Fig. 6).
So far we have been concerned with operations that take A set is said to be of constant width if w K ( f ) is con-
place within one fixed vector space X or from that space to stant for all f with f = 1; that is, if h D(K ) (restricted
its dual X ∗ . We now consider two vector spaces X and Y to those f with f = 1) is constant, which is so if and
and their Cartesian product X × Y , which has dimension only if D(K ) = B. We shall say more about these sets
equal to the sum of the dimensions of X and Y (assuming in Section X. For now, note that the only convex set of
all are finite dimensional). There is a natural embedding of constant width that is symmetric about 0 is the ball.
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

726 Convex Sets

orthogonality. It is due to Jakob Steiner, is called Steiner

symmetrization, and deals with symmetry with respect to
a hyperplane.
Let K be a compact convex set in X . We wish to sym-
metrize K with respect to a hyperplane H in X . For
each point x ∈ H let (x) denote the line through x or-
thogonal to H . The intersection (x) ∩ K is a line seg-
ment, a point, or is empty. If it is a line segment, let
S(x) be the translation of (x) ∩ K whose midpoint is
at x; if the intersection is a point, then S(x) = x, other-
wise, S(x) is empty. The Steiner symmetral of K with
respect to H , S H (K ), is defined by the equation S H (K ) :=
∪ S(x).
It turns out that S H (K ) is convex. If K has interior,
then so does S H (K ). If K and L are compact convex sets,
then S H (K ) + S H (L) ⊆ S H (K + L). The most important
property of this symmetrization is that, given any convex
body K , one may, by symmetrizing successively with a
suitable sequence of hyperplanes, obtain a convex body
FIGURE 6
that approximates a ball to within any degree of accuracy.
Although we have not defined these measures, we also
Examples. From some points of view, the most
point out here that Steiner symmetrization leaves volume
asymmetric convex set is the standard simplex. The dif-
unchanged and does not increase either the surface area
ference body is rather regular. The difference body of the
or the diameter of the set.
standard two-simplex S2 is an affine regular hexagon and
The third such operation is similar but reverses the roles
that of S3 is an affine regular cuboctahedron (see Fig. 7).
of line and hyperplane. If is a line, for each point x ∈
The most well-known and simplest example of a non- let H (x) be the hyperplane through x orthogonal to . If
symmetric set of constant width is the Reuleaux triangle H (x) ∩ K = ∅, let B(x) be the (n − 1)-dimensional ball
(see Section X for a definition of this set; see Fig. 8 for a centered at x and lying in H (x) whose (n − 1)-dimensional
representation of it.) volume (see Section VIII) is the same as that of H (x) ∩ K .
The second symmetrizing operation is more compli- Let S (K ) := ∪ B(x). This object is also convex. It is called
cated and requires an inner product and hence a notion of the Schwarz symmetral of K .

FIGURE 7
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

Convex Sets 727

G. Other Operations regardless of their dimension. The second distinguishes

between points, line-segments, two-dimensional convex
Two more ways to get new convex sets from old ones ought
sets, and so on.
to be mentioned. The first is the projection body operator
. There are several ways to define this operation; the
simplest way is to use orthogonal projections. If K is a A. The Hausdorff Metric
convex body in X , for each unit vector u in X consider the
If K and L are two compact convex sets, then they are
projection of K onto the hyperplane Hu orthogonal to u.
bounded. Hence, there exist non-negative scalars λ, µ
This projection is the set pu (K ) consisting of those points
such that:
x ∈ Hu such that the line x + λu has a non-empty inter-
section with K . Now let h(u) be the (n − 1)-dimensional K ⊆ L + λB and L ⊆ K + µB.
volume (see Section VIII) of pu (K ). It turns out that this Now let λ0 be the infimum (in fact, the minimum) of all
function is subadditive. If it is extended to all of X by such λ’s and similarly for µ0 . The numbers λ0 and µ0 may
(positive) homogeneity, then it is sublinear and hence is also be defined directly in terms of the norm (or distance)
the support function of a convex body in X ∗ called the on X by:
projection body of K , (K ).
λ0 = min max x − y
This is a most interesting object. It is always symmetric x∈L y∈K
with respect to the origin. If K is a polytope, then (K ) and similarly for µ0 .
is not just a polytope but is a zonotope (a sum of line
segments) which has all of its faces centrally symmetric. Definition. The Hausdorff metric δ on the set of com-
pact convex sets is defined by the equation:
Examples. The construction (explained in Sec-
δ(K , L) := max{λ0 , µ0 }.
tion X) of projection bodies of polytopes is relatively easy.
The projection body of C3 is (up to a scaling) C3 itself. Theorem. The function δ is a metric on the set of all
The projection body of O3 is a rhombic dodecahedron. compact convex sets in X .
This object has eight vertices that coincide with those of
C3 and six that are at ±2e1 , ±2e2 , ±2e3 . It has 12 facets If the support functions of convex sets are restricted to
that are all alike and are rhombi, hence the name. the unit ball in X ∗ , then the support function of a mul-
tiple λB of the unit ball is just the constant λ; hence,
The second operation is (in some not entirely clear way) K ⊆ (L + λB) if and only if h K ≤ h L + λ, which implies
a dual construction to . We begin with a convex body K that the Hausdorff metric between the sets is the same as
such that 0 ∈ K . Let f be a norm 1 linear functional in the uniform metric between the support functions. All the
X ∗ . Let H f be the hyperplane that is the kernel of f . Let operations discussed in the previous section are contin-
i f (K ) := K ∩ H f and let r ( f ) be the (n − 1)-dimensional uous with respect to this metric as are several important
volume (see Section VIII) of i f (K ). Now let I (K ) be the functions that are defined in later sections.
set in X ∗ consisting of all the line segments [0, r ( f ) f ]. In The most important fact about this metric is a compact-
other words, by again extending r , we make it the radial ness result, known as the Blaschke selection theorem (the
function of I (K ). It is an important theorem (whose proof name refers to the selection of a convergent subsequence
is difficult) of Busemann that if K is symmetric about 0 from a given bounded sequence).
then I (K ) is convex. It is called the intersection body of K .
For more general convex sets, I (K ) is star shaped. Lutwak Theorem (Blaschke). The set of compact convex
has shown that the “proper” setting for this operation is sets contained in some ball B[x, r ] and equipped with the
the class of star-shaped sets. Hausdorff metric is compact.
Various collections of convex sets, for example the col-
VI. SPACES OF CONVEX SETS lection of all ellipsoids, form closed subsets in this metric.
Therefore, a bounded sequence of ellipsoids has a subse-
In the last section, various operations on the collection of quence that converges to another ellipsoid (provided we
all convex sets were considered. With the algebraic op- allow degenerate cases such as points and line segments).
erations of addition and multiplication by (non-negative) Other important classes are dense in this metric. For ex-
scalars, the collection of convex sets has many of the at- ample, the collection of all polytopes is dense. Therefore,
tributes of a vector space. In this section, we show that the any convex body can be approximated as closely as we
collection of compact convex sets is also a metric space. In please (with respect to the Hausdorff metric) by a poly-
fact, there are two metrics that can be considered. One is tope. This, together with the continuity of various func-
defined on all compact convex sets in a given vector space tions, means that the proof of a theorem can often be
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

728 Convex Sets

accomplished by first proving the result for polytopes and An infimum is used here because it is also a useful defi-
then extending it to all convex bodies “by continuity.” nition in the case of infinite dimensional spaces. Restricted
A convex set is said to be strictly convex if its boundary to the finite dimensional case, the infimum is attained.
does not contain any line segment (of positive length). The functional is a metric on the equivalence classes
A convex set is said to be smooth if there is a unique of convex sets under the above equivalence relation. If we
supporting hyperplane at each point of its boundary. The consider the norms generated by K and L, then the equiv-
sets of smooth, of strictly convex, and of both smooth and alence relation is one of isometry between normed spaces
strictly convex bodies are all dense in the set of all convex and measures the distance between equivalence classes
bodies. of normed spaces.
John’s Theorem now says that the distance between
the equivalence class containing K and the set of ellip-
B. The Banach-Mazur Metric soids (which is the equivalence class containing B) is no
It is sometimes appropriate to consider that convex sets of more than (log n)/2. Hence, the distance between any two
different dimension are infinitely far apart. This section is equivalence classes is no more than log n. If we allow
concerned with metrics of this sort. We limit our attention non-symmetric sets, then these numbers are doubled. It is
to convex bodies with 0 as an interior point. surprising that the exact diameter of these metric spaces
If K and L are two convex bodies with 0 in their inte- is only known in the case of two dimensions.
riors, then there are scalars λ and µ such that:
K ⊆ λL and L ⊆ µK . VII. BASIC THEOREMS
As before, we can now take λ0 and µ0 to be the minimal
A. Separation and Support Theorems
such λ and µ. Then set (K , L) := λ0 µ0 . The fundamental
result is now John’s Theorem. The notion of separation involves placing a hyperplane
between two convex sets. There are varying degrees of
Theorem (John). If K is a centrally symmetric con- separation that can be considered.
vex body in an n-dimensional space X and if 0 is an interior
point of K , then there is an ellipsoid E such that: Definition. A hyperplane H = H αf is said to separate
√ the convex sets K and L if K lies in one of the closed
E ⊆ K ⊆ nE half-spaces determined by H , and L lies in the other. The
separation is proper if it is not the case that both sets lie
√ The standard cube Cn and cross polytope On show that in H . The separation is strict if one can replace “closed”
n cannot be improved. If we remove the condition √ that
K be centrally symmetric, then we must replace n by n. by “open” in the first sentence. Finally, the separation is
This bound is attained by the simplex Sn . strong if there exist α and β with α < β and K ⊆ H α− f
β+
The ellipsoid E that appears in this result is of con- and L ⊆ H f (or vice versa).
siderable interest. It is the ellipsoid of maximal volume
Separation Theorem. Let K be a convex set in Rn
contained in K and is called the Löwner-John ellip-
and suppose x is not in K , then K and x can be separated.
soid. It occurs in linear and nonlinear programming in
If K is closed, then K and x can be strongly separated.
Khachiyan’s polynomial time algorithm and in Schor’s
algorithm. It follows from the second statement that every closed
The functional is not a metric for two reasons. First, convex set can be represented as an intersection of closed
the construction is multiplicative rather than additive; e.g., half-spaces. It follows from the first statement that if K has
(K , K ) = 1 rather than 0. If we want a genuine metric a non-empty interior then at each point x of the boundary
we must take the logarithm of . Since not all authors do of K there is a supporting hyperplane obtained by sepa-
so, one should be careful when reading the literature to rating x from the interior of K .
see the precise definition. There is a converse to this theorem. If A is a closed
Second, log((K , α K )) = 0; Therefore, one should set with a non-empty interior and if, at each bound-
consider equivalence classes of multiples of K . How- ary point, there is a supporting hyperplane, then A is
ever, it is more appropriate to enlarge the equivalence convex.
classes and say that the sets K 1 and K 2 are equivalent if The separation theorem can be generalized to the fol-
there is an invertible linear map T such that T (K 1 ) = K 2 . lowing: If K and L are two convex sets whose relative
Finally, the definition of the Banach-Mazur metric is interiors are disjoint, then K and L can be separated. If
(K , L) := inf{log[(K , T (L))] : T is an invertible lin- one set is closed and the other is compact, then the sepa-
ear map}. ration is strong.
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

Convex Sets 729

The proof of the seemingly more general statement Corollary. If F is a finite family of parallel line seg-
follows from the earlier one because K and L can be ments in R2 such that every three of them has a transversal,
(strongly) separated if and only if K − L and {0} can be then the whole family has a transversal.
(strongly) separated. Also, if one set is compact and the
other closed, then K − L is closed. This is not true in Corollary. Let F be a finite family of convex sets in
general for two closed sets. Rn and let K be a convex set. If for every finite subfamily
with (n + 1) elements there is a translate of K that inter-
Example. In R2 the sets K := {(x, y) : y = 0} and sects each member of the subfamily, then there is a single
L := {(x, y) : x ≥ 0 and y ≥ 1/x} are both closed and translate of K which intersects every member of the whole
convex. They cannot be strongly (or even strictly) sep- family.
arated, and the set K − L is open.
Theorem (Kirchberger). Let F1 and F2 be finite
These theorems have very important analogs in infinite sets in an n-dimensional space such that F1 ∪ F2 has at
dimensional spaces. In that setting, they are all conse- least (n + 2) elements. Suppose that for every subset F
quences of the Hahn-Banach Theorem, which is one of of F1 ∪ F2 with exactly (n + 2) points the sets F ∩ F1 and
the most important theorems of functional analysis. F ∩ F2 can be strictly separated, then F1 and F2 can be
strictly separated.
B. Carathéodory’s Theorem and Its Relatives
Webster5 gives an elegant proof of Jung’s theorem also
The following theorems are all closely related, but the based on Helly’s theorem.
Carathéodory result appears the most fundamental.
Theorem (Jung). Every set A in Rn with diameter
Theorem (Carathéodory). If A is a subset of an n- 1√ is contained in a closed ball of radius no more than
dimensional space and if x ∈ coA, then x can be expressed n/(2n + 2).
as a convex combination of (n + 1) or fewer points.
Finally, we give Krasnosel’skiǐ’s Theorem (sometimes
Other ways of phrasing the conclusion is to say that x is called the “Art Gallery” Theorem).
a convex combination of a set of points in general position.
Another is to say that x lies in a simplex whose vertices are Theorem (Krasnosel’skiǐ). Let A be a compact sub-
in A. Thus, when constructing the convex hull, the length set of Rn . If, for every (n + 1) points, a1 , a2 , . . . , an+1 of
of the convex combinations needed is bounded (in finite A there is a point x of A such that the line segments [x, ai ]
dimensional spaces). all lie in A, then A is star shaped.

Theorem (Radon). If a finite set F of points in an In other words, in an art gallery, if for every finite set of
n-dimensional space is not in general position, then it may (n + 1) pictures there is a point in the gallery from which
be decomposed into two disjoint subsets F1 and F2 such one can see all (n + 1), then there is a point from which
that co(F) ∩ co(G) = ∅. one can see the whole art gallery.
All of these theorems have been much generalized. For
In particular, this is true of any set of at least (n + 2) one collection of such results, see the article by J. Eckhoff
points. in the Handbook of Convex Geometry.2
Theorem (Helly). Let K 1 , K 2 , . . . , K m be a finite
family of convex sets in an n-dimensional space (m ≥
VIII. VOLUMES AND MIXED VOLUMES
n + 1). If every subfamily with exactly (n + 1) members
has a non-empty intersection, then the whole family has a
For many of the more interesting properties of convex sets
non-empty intersection.
it is necessary to measure them in some way. Such mea-
Eggleston1 shows how Helly’s Theorem can be derived surement requires more advanced ideas than were pre-
from Carathéodory’s and conversely. sented in Section I. The basic concept is that of volume
Since any family of compact sets has a non-empty inter- in an n-dimensional space. There are several approaches
section if every finite subfamily does, there is an easy ex- which coincide for compact convex sets but may not for
tension to infinite families of compact convex sets. If an ar- more general types of sets.
bitrary family of compact convex sets in an n-dimensional The most straightforward approach is Eggleston’s,1
space is such that every subfamily with (n + 1) members which says that the volume of a convex set in Rn is its
has a non-empty intersection, then so does the whole fam- n-dimensional Lebesgue measure. The volume of a set in
ily. A transversal for a family of sets is a line that meets Rn will be denoted by Vn (K ). One-dimensional volume
every member of the family. is usually called length, and V2 is usually called area. In
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

730 Convex Sets

n-dimensional space, Vn − 1 is also frequently referred to 2n (1 + 1/ p)n

Vn (B( p)) = .
as area. No confusion should arise. (1 + n/ p)
The volume functional takes values in the extended real
numbers and has a number of very important properties: The volume of the standard cube Cn is Vn (Cn ) = 2n .
The volume of the standard simplex Sn is
1. It is non-negative—Vn (K ) ≥ 0 for all convex sets K Vn (Sn ) = 1/n!.
in Rn —and is strictly positive if K has a non-empty The volume of the standard cross-polytope On is
interior (has dimension n). Vn (On ) = 2n /n!.
2. It is countably additive, in the sense that, if {K m : If K is a convex set such that K ⊆ H αf with α = 0 and
m = 1, 2, . . . , } is a
sequence of disjoint convex sets, f = 1, then the pyramid with vertex at 0 and base K is
then Vn (∪m K m ) = m Vn (K m ). the convex hull of K and 0. The volume of this pyramid
3. It is finite for compact sets; Vn (K ) < ∞ if K ⊆ Rn is αVn − 1 (K )/n.
is compact. The previous formula can be generalized. If P is a poly-
4. It is continuous with respect to the Hausdorff metric. tope with 0 as an interior point and if Fi , i = 1, 2, . . . , m,
5. It is monotonic; if K ⊆ L, then Vn (K ) ≤ Vn (L) (this denote the facets of P and if αi is the perpendicular
follows from (property 2)). distance from 0 to the hyperplane containing Fi , then
6. It is translation invariant; Vn (K + x) = Vn (K ). Vn (P) = (1/n) m 1 αi Vn − 1 (Fi ).
7. For linear transformations T , it has the property that With a notion of surface area, this can be further gener-
Vn (T (K )) = det T Vn (K ). alized to arbitrary convex sets:
In particular: 1
8. Vn (λK ) = λn Vn (K ) and also Vn is invariant under Vn (K ) = h K ( f )dσ K ( f )
n Sn − 1
rigid motions.
where σ K is the “surface area measure” induced on the
The important fact from the theory of Haar measure is surface of the dual unit ball S n − 1 by K (see Schneider4
that, up to a scalar factor, there is only one functional that for details).
has properties 1, 2, 3, and 6. Properties (7) and (8) show how Vn behaves for scalar
However, if we relax property 2 a little, then a number multiples and under linear maps. The basic question, first
of important functions are relevant to the theory of convex considered by Brunn and Minkowski toward the end of
sets. the 19th century, is how does Vn behave with respect to
A real-valued function v defined on a collection of sets the addition of sets; i.e., how is property 6 extended from
S is said to be a valuation if: singletons to general convex sets?
v(K ∪ L) + v(K ∩ L) = v(K ) + v(L) To see how this might work, consider the special case
of K + λK ; then:
whenever K , L , K ∪ L, and K ∩ L are all in S.
It follows that for finitely many disjoint sets, a valua- Vn (K + λK ) = Vn ((1 + λ)K ) = (1 + λ)n Vn (K )
tion is additive, hence, if its values are non-negative, it is n
= λi Vn (K );
monotonic. Volume is a valuation. i
The n-dimensional volume of a convex set may be cal-
that is, we get a polynomial in λ whose coefficients are
culated by integrating the (n − 1)-dimensional volumes of
multiples of Vn (K ). In fact, we always get a polynomial.
its cross sections in some particular direction. Thus, if f
is a linear functional and if K is a convex set and if we let
Theorem. If K and L are convex bodies, then
K α := K ∩ H αf , then:
Vn (K + λL) is a polynomial in λ of degree n whose coef-
∞
ficients are written in the following way:
Vn (K ) = Vn − 1 (K α )dα.
−∞
n
n
Vn (K + λL) = V (K , n − i; L , i)λi .
Examples. The above formula can be used induc- 0
i
tively to prove that:
The numbers V (K , n − i; L , i) are called the mixed
π n/2 volumes of K and L. The numbers n − i and i are inserted
Vn (B) = .
(1 + n/2) in this notation because the mixed volumes are functions
of n variables and here K and L occur n − i times and
This number is abbreviated n (many authors use κn ).
i times, respectively. Since this is true for a summand
There is a more general formula for the n-dimensional with two terms, it can be extended inductively to sums
p-ball B( p): with an arbitrary number of terms. The essential feature
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

Convex Sets 731

is that the volume of a linear combination ( λi K i ) is a Theorem (Cauchy). If K is a convex body in Rn ,
homogeneous polynomial in the λi ’s of degree n whose then:
coefficients are functions of precisely n of the K i ’s. 1
A(K ) = Vn−1 ( pu (K ))dω(u)
nn S n−1
Example. Referring again to Figure 5, one sees that
where S n − 1 is the surface of the unit ball, dω is surface
V3 (C3 + λB) = V3 (C3 ) + A(C3 )λ area measure (Lebesgue measure) on S n − 1 , u is a unit
+ (3π)L(C3 )λ2 + V3 (B)λ3 vector, and pu (K ) is the projection of K onto the subspace
orthogonal to u.
where A(C3 ) is the sum of the areas of the facets of the
cube and L(C3 ) is the length of an edge of the cube and Theorem. If K is a convex body in Rn , then:
measures the width of the cube. Thus, in this instance, we
2
have V (C3 , C3 , C3 ) = V3 (C3 ), V (C3 , C3 , B) = A(C3 )/3, b(K ) = h K (u)dω(u).
V (C3 , B, B) = π L(C3 ), and V (B, B, B) = V3 (B). nn S n−1
(This is often taken as the definition of b(K )).
Mixed volumes have the following properties:
There are various relationships between the functionals
1. They are non-negative. Wi expressed as integral formulas.
2. They are monotonic in each variable. All the functionals Wi are valuations and, in a certain
3. They are homogeneous (for non-negative scalars) in precise sense, these are all the valuations.
each variable. Recall that with the notation
V (K , n − i; L , i) the variable K occurs n − i times. Theorem (Hadwiger). If v is a valuation on the col-
4. They are additive in each variable. lection of convex bodies that is invariant under rigid mo-
5. They are translation invariant (in each variable). tions and is either continuous or monotonic, then:
6. They are continuous with respect to the Hausdorff n
metric. v(K ) = αi Wi (K )
7. V (K , K , . . . , K ) = Vn (K ). 0

where the coefficients αi are real if v is continuous and

The example shows that mixed volumes of the form non-negative if v is monotonic. The first straightforward
V (K , n − i; B, i) are closely related to the geometry of proof of this theorem was given in 1995 by Dan Klain.
K . For this, the notation is modified.

Definition. The quermassintegrals or cross-sectional

IX. INEQUALITIES
measures or Minkowski functionals of K are written
Wi (K ) and are defined by Wi (K ) := V (K , i; B, n − i)
There is a huge variety of inequalities that relate to convex
(note the change of place of i).
sets in one way or another. We present a brief sample. The
Then, W0 (K ) = Vn (K ), W1 (K ) = A(K )/n, and reader is referred to the work of Erwin Lutwak and, in par-
Wn − 1 (K ) = n b(K )/2, where A(K ) is the surface area ticular, his article in the Handbook of Convex Geometry.2
of K and b(K ) is the mean width of K . These last We do not always give the most general result (which may
two equations may either be taken as the definition of need more notation or definitions); many extensions can
these quantities or, if they are defined in other ways, be found in Schneider.4 The word “homothetic” often oc-
then as theorems. Perhaps surprisingly, even Wn (K ) has curs in the conditions for equality. The sets A and B are
relevance to the set K : Wn (K ) = n χ (K ), where χ (K ) is homothetic if A = λB + a (λ > 0); i.e., A is the image of
the Euler characteristic of K which (for convex sets) is B under a translation and a (positive) dilation.
always 1. We begin with the fact that the nth root of the volume
Since W1 is the coefficient of λ in a certain polynomial, is a concave function.
it (and hence A(K )) can be obtained via differentiation in
the usual way. Theorem (Brunn-Minkowski). If K and L are con-
vex bodies with interior in Rn and if 0 ≤ α ≤ 1, then:
Theorem. The surface area A(K ) of K is obtained
by the formula: Vn1/n (α K + (1 − α)L) ≥ αVn1/n (K ) + (1 − α)Vn1/n (L)
Vn (K + λB) − Vn (K ) with equality if and only if K and L are homothetic.
A(K ) = lim .
λ→0 λ
There are also integral formulas for both A(K ) and Theorem (Minkowski inequality for mixed vol-
b(K ): umes). If K and L are convex bodies in Rn , then:
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

732 Convex Sets

V (K , n − 1; L , 1)n ≥ Vn (K )n − 1 Vn (L) 2n
2n Vn (K ) ≤ Vn (D(K )) ≤ Vn (K )
with equality if and only if K and L are homothetic. n
with equality on the left if and only if K is symmetric and
The power of this inequality is shown by the fact that,
on the right if and only if K is a simplex. (The left-hand
substituting the unit ball for L, one immediately gets the
inequality is trivial; it is the other one that is due to Rogers
isoperimetric theorem for convex sets.
and Shephard.)
Theorem (Isoperimetric). If K is a convex body in
Theorem (Busemann intersection inequality). If
Rn with prescribed volume v, then A(K ) ≥ A(Bv ), where
K is a convex body in Rn with 0 as an interior point, then:
Bv is the dilation of B with volume v. Moreover, equality
holds if and only if K is a translate of Bv . Vn (I (K )) Vn (I (B)) n−1
n
≤ =
Vn (K )n−1 Vn (B)n−1 nn−2
Corollary (Isoperimetric inequality). If K is a con-
vex body in Rn , then: with equality if and only if K is an ellipsoid.
A(K )n A(B)n Theorem (Petty projection inequality). If K is a
−
≥ = n n n .
Vn (K ) n 1 Vn (B)n − 1 convex body in Rn , then:

Theorem (Urysohn). If K is a convex body in Rn , Vn (K )n−1 Vn ([(K )◦ ]) ≤ Vn (B)n−1 Vn ([(B)◦ ])

then: = (n /n−1 )n
b(K )n b(B)n
≥ = 2n /n . with equality if and only if K is an ellipsoid.
Vn (K ) Vn (B)
Since b(K ) ≤ D(K ) (the diameter of K ), we get the Here, we insert results dealing with two problems that
following corollary. generated a great deal of research in the latter part of the
20th century: the Busemann-Petty problem and its “dual,”
Corollary (Isodiametric inequality). If K is a con- the Shephard problem. Both deal with convex bodies sym-
vex body in Rn , then: metric about 0 because otherwise it is easy to give a neg-
D(K )n D(B)n ative answer to both questions.
≥ = 2n /n .
Vn (K ) Vn (B) Question (Busemann-Petty). If K and L are two
In the last three inequalities, equality holds if and only centrally symmetric convex bodies in Rn and if, for every
if K is a ball. linear functional f , we have:
The next set of inequalities relate the volume functional Vn − 1 K ∩ H 0f ≤ Vn − 1 L ∩ H 0f
to the operations given in Section V. However, they are
also affine inequalities because the quantities involved are does it follow that Vn (K ) ≤ Vn (L)?
unchanged by affine transformations. The breakthrough on this problem came when Lutwak
showed that the answer is “yes” if the body K is an
Theorem (Blaschke-Santaló). If K is a convex intersection body (in a broader sense than our defini-
body in Rn , then: tion), otherwise, there will be a convex body L yielding a
Vn (K )Vn (K ◦ ) ≤ Vn (B)Vn (B ◦ ) = n2 counter example. The work of many authors and finally of
Zhang, Gardner, and Koldobsky showed that the answer
with equality if and only if K is an ellipsoid.
to the problem is “no” in all dimensions ≥ 5 but that for
The quantity Vn (K )Vn (K ◦ ) is called the volume product n = 2, 3, 4 the answer is “yes.”
of K and is an affine invariant. There is a conjecture of
Mahler on the lower bound for the volume product. It has Question (Shephard). If K and L are two centrally
been proved for zonoids (the closure, in the Hausdorff symmetric convex bodies in Rn and if, for every direction
metric, of the set of zonotopes). u, we have:

Theorem (Reisner). If K is a zonoid in Rn , then: Vn − 1 ( pu (K )) ≤ Vn − 1 ( pu (L))

Vn (K )Vn (K ◦ ) ≥ Vn (Cn )Vn (Cn◦ ) = Vn (On )Vn (On◦ ) = 4n /n! does it follow that Vn (K ) ≤ Vn (L)?

with equality if and only if K = Cn (On is not a zonoid). Petty and Schneider showed that the answer is “no” in
general but “yes” if the body L is a projection body. Since
Theorem (Rogers-Shephard). If K is a convex all symmetric convex bodies in R2 are projection bodies,
body in Rn then the answer is “yes” in R2 but “no” in all higher dimensions.
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

Convex Sets 733

Finally we mention two well-known inequalities that under the inclusion relation (this is why P and ∅ are in-
relate to the finite (and infinite) dimensional p -spaces. cluded as faces). The lattice of faces of P ◦ forms the dual
lattice.
Theorem (Hölder’s inequality). If the numbers A key combinatorial problem is to characterize those
p ≥ 1 and q are related by the equation p −1 + q −1 = 1 vectors that are f -vectors of some polytope. Only in three
and if x = (ξi ) and y = (ηi ) are two vectors in Rn , then: dimensions has this problem been solved. The primary
1/ p 1/q necessary condition for a vector to be an f -vector is that

ξi ηi ≤ |ξi | p |ηi |q . it satisfies the so-called Euler relation.
A consequence of this inequality is that of Minkowski. Theorem. If f i denotes the number of faces of a
polytope of dimension i, then
Theorem (Minkowski’s inequality).
If x and y are
n
vectors in Rn and if x p := ( |ξi | p )1/ p , then (−1)i f i = 1.
x+y p ≤ x p + y p.
0
The number 1 is the Euler characteristic of P.
Therefore, the functional x p is a norm and the set
B( p) defined in Example 9 in Section III is convex. Corollary. In three-dimensional space, the number
of vertices f 0 of edges f 1 and of facets f 2 satisfy the
equation f 2 − f 1 + f 0 = 2.
X. SPECIAL CLASSES OF CONVEX SETS
Theorem (Steinitz). In three-dimensional space,
( f 0 , f 1 , f 2 ) is the f -vector of a polyhedron if and only
A. Polytopes
if it satisfies, in addition to the Euler relation, the follow-
In this section we deal with bounded polytopes. Recall ing inequalities:
that such a polytope is a convex body that may be regarded
either as the convex hull of finitely many points or, dually, 1. 4 ≤ f 0 ≤ 2 f 2 − 4
as the intersection of finitely many half-spaces. Polygons 2. 4 ≤ f 2 ≤ 2 f 0 − 4
and polyhedra have been studied since the beginnings of
Such conditions are not completely known in higher di-
mathematics. The existence theorem of Minkowski is very
mensions but there are many partial results, especially for
important.
n = 4 (see the survey article by Bayer and Lee2 ). However,
Theorem (Minkowski). If {u 1 , u 2 , . . . u k } is a set of it is known that the Euler relation is the only affine relation
dual unit vectors which do not all lie in a hyperplane satisfied by all f vectors.
and if {α1 , α2 , . . . , αk } are positive real numbers such Of particular appeal are the regular figures. If n = 2,
that: then there is an infinite family of regular polygons. Up to
rigid motions and dilations, there is precisely one for each

k
αi u i = 0 number k of vertices (edges). If n = 3, then there are pre-
i cisely five regular polyhedra (the Platonic solids). There
are many proofs that there can be no more. We outline
then there is a polytope (unique up to translation) that
one. Suppose that the facets are p-gons and that q meet
has k facets whose areas are given by the αi ’s and whose
at each vertex. Then the Euler relation implies that p −1 +
“normals” are given by the u i ’s.
q −1 > 1/2. Since p, q ≥ 3, only the values ( p, q) =(3, 3),
There is a generalization of this theorem to general con- (3, 4), (4, 3), (3, 5) and (5, 3) are possible.
vex bodies determined by general “surface area measures” In all dimensions it is possible to construct a regu-
(see Schneider4 ). lar cube Cn , a regular cross-polytope On , and a regular
By the facial structure of a polytope, we mean the simplex (a linear image of Sn ). For n ≥ 5, these are the
collection of all of its faces classified by their dimen- only regular polytopes. When n = 4, there are three more
sion. If P is an n-dimensional polytope, then there is with, respectively, 24 octahedral facets, 120 dodecahedral
precisely one n-dimensional face, P itself. The empty facets, and 600 tetrahedral facets.
set is usually included as the unique face of dimen- A great variety of polytopes have some degree of
sion −1. The 0-dimensional faces are the vertices, the regularity; for example, the cuboctahedron has six
one-dimensional faces are the edges, and the (n − 1)- square facets and eight that are equilateral triangles and
dimensional faces are the facets. The f -vector of P is whose vertices are all alike. Its dual is the rhombic do-
the vector ( f 0 , f 1 , f 2 , . . . , f n − 1 ) where f i is the number decahedron, which has all its facets alike (they are all
of faces of dimension i. The faces of P form a lattice rhombi).
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

734 Convex Sets

√
B. Zonotopes and the areas are all 3/2. Another way to find the sum
This special class of polytopes was first brought to at- of the corresponding line segments is as the convex hull
tention by Federov in connection with crystallography. of the vectors:
They are centrally symmetric but, more than that, they are 1/4(±(1, 1, 1) ± (1, −1, 1) ± (−1, −1, 1) ± (−1, 1, 1))
characterized by having all their two-dimensional faces
centrally symmetric. A theorem of Alexandrov implies where all (16) possible choices of signs are taken. Of these,
that all of the faces are centrally symmetric. Thus, when 14 are vertices and the other two are interior points (they
n = 3, the zonohedra are those polyhedra with centrally are 0). These vertices are those of the rhombic dodecahe-
symmetric facets. For n > 3, there are polytopes whose dron, as stated in Section V (with an extra factor of 1/2).
facets are centrally symmetric but are not zonotopes. For the standard simplex, (S3 ) is also a sum of four line
From the point of view of the Brunn-Minkowski theory segments and is combinatorially the same as (O3 ) but
based on sums and scalar multiples of convex sets, zono- it has different proportions. However, the two are affinely
topes are the simplest objects because they are the ones equivalent showing that is not one to one on the set of
that can be expressed as sums of line segments. To simplify all polytopes.
matters, the line segments can be centered at 0 so that the There are two other useful characterizations of zono-
zonotope is also symmetric about 0. The support function topes. If P is a centrally symmetric polytope with k pairs
of a line segment [−x, x] is h [−x,x] ( f ) = | f (x)|
and, since of opposite vertices, then P is a projection of Ok . Simi-
support functions respect addition, if Z := i [−x i , x i ], larly, if a zonotope is a sum of k line segments, then it is a
then h Z ( f ) = i | f (xi )|. projection of Ck . The dual characterization is that the dual
In addition to crystallography, these objects are impor- of a zonotope is a central cross section of Ok for some
tant in several areas of mathematics. First, if we restrict suitable k.
the domain of the projection body operator to the poly- A problem of great antiquity is that of characterizing
topes, its range is the set of zonotopes. Moreover, if the and classifying the tilings of space. A more tractable prob-
domain is restricted to the centrally symmetric polytopes, lem is to ask what objects can be used as a single tile
then is one–one. that will tile space by translation. In R3 these are known.
If P is a polytope, then the action of on P is described There are five such objects and all are zonohedra. In higher
in the following way. Let the facets of P be denoted by dimensions, it is known (Venkov) that such objects are
F1 , F2 , . . . , Fm . Each Fi is contained in a hyperplane H αfii centrally symmetric polytopes with centrally symmetric
with f i = 1. The functional f i is called the unit normal facets. However, for n > 3 it was noted above that these
to the facet Fi ; the sign is usually chosen so that it is need not be zonotopes. For results on tilings, see the article
directed outwards but that is immaterial here. The f i ’s are of that title by Schulte.2
in the directions of the vertices of P ◦ . If µi denotes the
area of the facet Fi , then: C. Zonoids

m
(P) = 1/2 µi /2[− f i , f i ]. Whereas the polytopes form a dense set among the convex
i bodies in Rn , this is far from true of the zonotopes. The
class of zonoids is defined as the closure of the set of
One factor of 1/2 is to make the line segments be of length zonotopes in the Hausdorff metric (in some fixed vector
µi , the other is because the projection of P in some direc- space). Thus, every zonoid is the limit of a sequence of
tion is covered twice by the projections of the facets; if P zonotopes. Since the map from convex body to support
is symmetric, the sum can be taken over half the facets, function is continuous, it follows that the support function
one from each pair. This construction reveals that maps of a zonoid is the limit of a sequence of functions of the
objects in X to objects in the dual space X ∗ .
form i | f (xi )|. Such limits can be written as an integral.
Thus, a convex body Z is a zonoid if and only if its support
Examples. For the standard cube C3 , the normals
function has the form:
are the usual basis vectors
(and their negatives). The areas
are all 4, hence, (C3 ) = 2[−ei , ei ] = 2C3 . hZ( f ) = | f (x)|dρ(x)
For the standard octahedron O3 , the normals are
where the integral is over the surface of the unit ball and
ρ is an even measure on that surface.
3−1/2 (1, 1, 1), 3−1/2 (1, −1, 1) 3−1/2 (−1, −1, 1)
The set of zonoids is also the range of the projection
3−1/2 (−1, 1, 1) body operator when its domain is the set of all convex
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

Convex Sets 735

bodies. This operator is one to one on the set of centrally x+y 2

+ x−y 2
= x 2
+ y 2

symmetric bodies. An important context in which zonoids

arise is that of vector measures. The range of any non- for all vectors x and y. Thus, a convex body is an ellipsoid
atomic vector measure is a zonoid. if and only if its gauge function satisfies the above equa-
In finite-dimensional normed spaces, there are several tion. Another characterization of inner product spaces is
ways in which area can be defined. For that due to Holmes that of Kakutani and is closely related to the Blaschke
and Thompson the solution to the isoperimetric problem is result below.
always a zonoid. This fact makes that definition especially
suitable for integral geometry. An interesting open ques- Theorem (Kakutani). If (X, . ) is a normed space
tion is whether there are non-Euclidean normed spaces for of dimension n ≥ 3 and if for every two-dimensional sub-
which the solution to the isoperimetric problem is the unit space H of X there is a projection of X onto H which
ball. For further results about zonoids, consult the article does not increase the norm of any vector, then the norm
by Goodey and Weil.2 comes from an inner product.
Ellipsoids are characterized by the equality case in a
D. Ellipsoids number of the inequalities in Section IX. A few more
geometric results are the following.
An ellipsoid in Rn was defined in Section III to be an
affine image of the unit ball. Here, we restrict attention to Theorem (Bertrand, Brunn). Let K be a convex
ellipsoids with the origin as center and as an interior point; body. For every vector x, consider the chords of K that
that is, the image of the unit ball under an invertible linear are parallel to x. The midpoints of these chords all lie on
map T . Alternatively, if S is a positive definite symmetric a hyperplane if and only if K is an ellipsoid with center
matrix, the ellipsoid E S is defined (using the inner product) on that hyperplane.
by:
E S := {x : Sx, x ≤ 1}. Definition. If K is a convex body, if x = 0 is a
vector, and if L x is the line spanned by x, then the shadow
Thus, the unit ball B is E I where I is the identity ma- boundary of K in the direction x is the set Sx := ∂(K +
trix. The two approaches are connected by the fact that Lx) ∩ ∂ K .
if T is an invertible matrix then S := T −1∗ T −1 is positive
definite symmetric and T (B) = T (E I ) = E S . Conversely, The image here is of a beam of light shining on K in
a positive definite matrix S has a positive definite square the direction x. Part of the body is illuminated and part
root S 1/2 and is in shadow. The set Sx is the “edge” of the shadow. If
K has some flat parts to its boundary, the edge may be
E S = S −1/2 (E I ) = S −1/2 (B). broad.
Many properties of ellipsoids may be regarded as facts
Theorem (Blaschke). If n ≥ 3 and if K is a convex
about positive definite symmetric matrices and conversely.
body, then the shadow boundary Sx lies in a hyperplane
If an ellipsoid E is given in the form E√= T (B), then
for all x if and only if K is an ellipsoid.
Vn (E) = det T n . Similarly, Vn (E S ) = n / det S. Useful
here is the fact that the determinant of a matrix is the Theorem (Shaı̆denko, Goodey). If λ ≥ 3 and if K
product of its eigenvalues. and L are two convex bodies such that for all translates
The gauge function of a centrally symmetric convex L + x of L (except if L + x = K ) the set ∂ K ∩ ∂(L + x) is
body with 0 as an interior point is a norm. Ellipsoids are contained in a hyperplane, then K and L are homothetic
precisely those bodies whose corresponding norms come ellipsoids.
from an inner product—a Euclidean norm and (in infinite
dimensions) a Hilbert space norm. Therefore, theorems Lastly, we recall the Löwner-John ellipsoids (see Sec-
that characterize ellipsoids among convex bodies also tion VI). For every convex body K , there is a unique el-
characterize inner product spaces among normed spaces lipsoid of maximal volume inscribed to K and a unique
(and conversely). Theorems of this type are extremely nu- ellipsoid of minimal volume circumscribed to K .
merous in the literature. Here we give a small sample.
The most well-known characterization of inner product
E. Simplices
spaces is via the parallelogram law and is due to Jordan and
von Neumann: a norm . comes from an inner product In many respects, among convex bodies simplices stand at
if and only if: the opposite extreme from ellipsoids. Whereas ellipsoids
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

736 Convex Sets

are among the most symmetric of bodies, simplices are

among the most asymmetric. In some of the inequalities
in Section IX, ellipsoids are precisely at one extreme and
simplices at the other.
The place where many students and users of mathe-
matics meet the word “simplex” is in the terms simplex
method or simplex algorithm; see Webster.5
Choquet observed that if K is a convex body such that
its intersection with any homothetic image is again a ho-
mothetic image then K is a simplex. He used this idea
to define a simplex in infinite-dimensional spaces. More-
over, if K is of dimension (n − 1) and is contained in a
hyperplane H 1f in Rn (not a hyperplane through 0), then
one can construct a cone C K := {λx : x ∈ K , λ ≥ 0}
which induces an order relation on Rn by:

x ≤ y if and only if y − x ∈ C K .

This order relation makes Rn into a lattice if and only if K

is a simplex. This is the starting point of a far-reaching the-
ory of infinite-dimensional spaces initiated by Choquet.
FIGURE 8
Finally, we give a characterization due to Martini that
nicely connects three of the operations of Section V.
imal radius of a ball that contains K . If K has inradius
Theorem (Martini). A polytope is a simplex if and r and circumradius R and is of constant width w, then
only if D(K ) = λ(K )◦ for some λ > 0. there is a unique inscribed ball of radius r and a unique
circumscribed ball of radius R, these balls are concentric,
Other characterizations of (and information about) sim-
and r + R = w.
plices can be found in the article by Heil and Martini in
There is an inequality relating the volume and surface
the Handbook of Convex Geometry.2
area of bodies of constant width w in R n :
Vn (K )/Vn−1 (∂ K ) ≤ w/2n.
F. Sets of Constant Width
The width of a convex set in a direction f , w K ( f ), was Two-dimensional sets of constant width have been
defined in Section V and sets of constant width were also much studied. A Reuleaux triangle of width w is the inter-
introduced there. In Section V, it was shown that K is of section of three discs of radius w centered at the vertices
constant width if and only if D(K ) is a ball. If K 1 and K 2 of an equilateral triangle (see Fig. 8). Reuleaux polygons
are each of constant width, then so is K 1 + K 2 . with any odd number of sides may be constructed sim-
For each point x in ∂ K , if we define the diameter of K ilarly. Discs B[x, w/2] are the sets of constant width w
at x to be d K (x) := sup{ x − y : y ∈ K }, then K is of whose area is maximal.
constant width if and only if d K (x) is constant.
If the convex body K has diameter d, we may define Theorem (Blaschke, Lebesgue). If K is a two-
the spherical hull of K , sh(K ), to be the intersection of all dimensional set of constant width w, then:
√
balls of radius d with center in K : V2 (K ) ≥ π − 3 w 2 2

sh(K ) := {B[x , d] : x ∈ K }. with equality if and only if K is a Reuleaux triangle.
Then K ⊆ sh(K ) and equality occurs if and only if K The corresponding problem in higher dimensions is un-
is of constant width. Sets of constant width are precisely solved. Constructions analogous to the Reuleaux polygons
those sets of diameter d such that if y ∈ K then diam(K ∪ can be made in higher dimensions but not so easily.
{y }) > d. Moreover, every set K of diameter d is contained Cauchy’s formula for surface area (Section IX) yields
in a set of constant width d. an elegant result for bodies of constant width in R2 . The
The inradius of a convex body K is the maximal radius length of the perimeter of such a body is π w where w is
of a ball contained in K , and the circumradius is the min- the width.
P1: FWQ Final Pages
Encyclopedia of Physical Science and Technology EN003A-146 June 14, 2001 3:14

Convex Sets 737

Data Mining, Statistics

David L. Banks
U.S. Department of Transportation

Stephen E. Fienberg
Carnegie Mellon University

I. Two Problems
II. Classification
III. Cluster Analysis
IV. Nonparametric Regression

GLOSSARY strong assumptions (such as linearity) about the form

of the relationship.
Bagging A method that averages predictions from several Principle of parsimony This asserts that a simpler model
models to achieve better predictive accuracy. usually has better predictive accuracy than a complex
Boosting A method that uses outcome-based weighting model, even if the complex model fits the original sam-
to achieve improved predictive accuracy in a classifi- ple of data somewhat better.
cation rule. Projection pursuit An exploratory technique in which
Classification The statistical problem of assigning new one looks for low-dimensional projections of high-
cases into categories based on observed values of dimensional data that reveal interesting structure or de-
explanatory variables, using previous information on fine useful new explanatory variables.
a sample of cases for which the categories are Recursive partitioning An approach to regression or
known. classification in which one fits different models in dif-
Clustering The operation of finding groups within mul- ferent regions of the space of explanatory variables and
tivariate data. One wants the cases within a group to the data are used to identify the regions that require dis-
be similar and cases in different groups to be clearly tinct models.
distinct. Smoothing Any operation that does local averaging or
Curse of dimensionality The unfortunate fact that sta- local fitting of the response values in order to estimate
tistical inference becomes increasingly difficult as the a functional relationship.
number of explanatory variables becomes large.
Nonparametric regression Methods that attempt to find
a functional relationship between a response variable DATA MINING is an emerging analytical area of re-
and one or more explanatory variables without making search activity that stands at the intellectual intersection of

247
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

248 Data Mining, Statistics

statistics, computer science, and database management. It techniques to study climate change over time. Or they
deals with very large datasets, tries to make fewer theoret- can be other forms of images such as those arising from
ical assumptions than has traditionally been done in statis- functional magnetic resonance imaging in medicine, or
tics, and typically focuses on problems of classification, robot sensors. Or the data might be continuous functions,
clustering, and regression. In such domains, data mining such as spectra from astronomical studies of stars. This
often uses decision trees or neural networks as models review focuses upon the most typical applications, but
and frequently fits them using some combination of tech- the reader can obtain a fuller sense of the scope by ex-
niques such as bagging, boosting/arcing, and racing. These amining the repository of benchmark data maintained at
domains and techniques are the primary focus of the the University of California at Irvine (Bay, 1999). This
present article. Other activities in data mining focus on is- archive is resource for the entire machine learning com-
sues such as causation in large-scale systems (e.g., Spirtes munity, who use it to test and tune new algorithms. It
et al. 2001; Pearl, 2000), and this effort often involves elab- contains, among other datasets, all those used in the an-
orate statistical models and, quite frequently, Bayesian nual KDD Cup contest, a competitive comparison of data
methodology and related computational techniques (e.g., mining algorithms run by the organizers of the KDD
Cowell et al., 1999, Jordan, 1998). For an introductory conference.
discussion of this dimension of data mining see Glymour It is unclear whether data mining will continue to gain
et al. (1997). intellectual credibility among academic researchers as a
The subject area of data mining began to coalesce in separate discipline or whether it will be subsumed as a
1990, as researchers from the three parent fields discov- branch of statistics or computer science. Its commercial
ered common problems and complementary strengths. success has created a certain sense of distance from tra-
The first KDD workshop (on Knowledge Discovery and ditional research universities, in part because so much of
Data Mining) was held in 1989. Subsequent workshops the work is being done under corporate auspices. But there
were held in 1991, 1993, and 1994, and these were then re- is broad agreement that the content of the field has true
organized as an annual international conference in 1995. In scientific importance, and it seems certain that under one
1997 the Journal of Data Mining and Knowledge Discov- label or another, the body of theory and heuristics that
ery was established under the editorship of Usama Fayyad. constitutes the core of data mining will persist and grow.
It has published special issues on electronic commerce, The remainder of this article describes the two prac-
scalable and parallel computing, inductive logic program- tical problems whose tension defines the ambit of data
ming, and applications to atmospheric sciences. Interest mining, then reviews the three basic kinds of data mining
in data mining continues to grow, especially among busi- applications: classification, clustering, and regression. In
nesses and federal statistical agencies; both of these com- parallel, we describe three relatively recent ideas that are
munities gather large amounts of complex data and want characteristic of the intellectual cross-fertilization that oc-
to learn as much from their collections as is possible. curs in data mining, although they represent just a small
The needs of the business and federal communities have part of the work going on in this domain. Boosting is dis-
helped to direct the growth of data mining research. For cussed in the context of classification, racing is discussed
example, typical applications in data mining include the in the context of clustering, and bagging is discussed in the
following: context of regression. Racing and bagging have broad ap-
plicability, but boosting is (so far) limited in application to
r Use of historical financial records on bank customers classification problems. Other techniques for these prob-
to predict good and bad credit risks. lems of widespread interest not discussed here include
r Use of sales records and customer demographics to support vector machines and kernel smoothing methods
perform market segmentation. (e.g., Vapnik, 2000).
r Use of previous income tax returns and current returns
to predict the amount of tax a person owes.
I. TWO PROBLEMS
These three examples imply large datasets without explicit
model structure, and the analyses are driven more by man- Data mining exists because modern methods of data col-
agement needs than by scientific theory. The first example lection and database management have created sample
is a classification problem, the second requires clustering, sizes large enough to overcome (or partially overcome) the
and the third would employ a kind of regression analysis. limitations imposed by the curse of dimensionality. How-
In these examples the raw data are numerical or cat- ever, this freedom comes at a price—the size of the datasets
egorical values, but sometimes data mining treats more severely restricts the complexity of the calculations
exotic situations. For example, the data might be satellite that can be made. These two issues are described in the
photographs, and the investigator would use data mining following subsections.
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

Data Mining, Statistics 249

A. The Curse of Dimensionality length is (0.5)1/ p , which increases to 1 as p gets large.

Therefore the expected number of observations in a fixed
The curse of dimensionality (COD) was first described
volume in IR p goes to zero, which implies that the data
by Richard Bellman, a mathematician, in the context of
can provide little information on the local relationship be-
approximation theory. In data analysis, the term refers to
tween X and Y . Thus it requires very large sample sizes
the difficulty of finding hidden structure when the number
to find local structure in high dimensions.
of variables is large.
The second version of the COD is related to complexity
For all problems in data mining, one can show that as the
theory. When p is large there are many possible models
number of explanatory variables increases, the problem
that one might fit, making it difficult for a finite sam-
of structure discovery becomes harder. This is closely re-
ple properly to choose among the alternative models. For
lated to the problem of variable selection in model fitting.
example, in predicting the amount of tax owed, compet-
Classical statistical methods, such as discriminant analy-
ing models would include a simple one based just on the
sis and multiple linear regression, avoided this problem
previous year’s payment, a more complicated model that
by making the strong model assumption that the mathe-
modified the previous year’s payment for wage inflation,
matical relationship between the response and explana-
and a truly complex model that took additional account of
tory variables was linear. But data miners prefer alter-
profession, location, age, and so forth.
native analytic approaches, since the linearity assump-
To illustrate the explosion in the number of possible
tion is usually wrong for the kinds of applications they
models, suppose one decides to use only a polynomial
encounter.
model of degree 2 or less. When p = 1 (i.e., a single ex-
For specificity, consider the case of regression analysis;
planatory variable), there are seven possible regression
here one looks for structure that predicts the value of the
models:
response variable Y from explanatory variables X ∈ IR p .
Thus one might want to predict the true amount of tax Y = β0 + ,
that an individual owes from information reported in the
previous year, information declared on the current form, Y = β0 + β1 X 1 + ,
and ancillary demographic or official information. The
COD arises when p is large. Y = β0 + β1 X 1 + β2 X 12 + ,
There are three essentially equivalent descriptions of
Y = β1 X 1 + ,
the COD, but each gives a usefully different perspective
on the problem: Y = β0 + β2 X 12 + ,

1. For large p nearly all datasets are too sparse. Y = β1 X 1 + β2 X 12 + ,

2. The number of regression functions to consider grows
Y = β0 + β2 X 12 + ,
quickly (faster than exponentially) with p, the
dimension of the space of explanatory variables. where denotes noise in the observation of Y . For p = 2
3. For large p, nearly all datasets are multicollinear (or there are 63 models to consider (including interaction
concurve, the nonparametric generalization of terms of the form X 1 X 2 ), and simple combinatorics shows
multicollinearity). that, for general p, there are 21+ p+ p( p+1)/2 − 1 models
of degree 2 or less. Since real-world applications usu-
These problems are minimized if data can be collected ally explore much more complicated functional relation-
using an appropriate statistical design, such as Latin hy- ships than low-degree polynomials, data miners need
percube sampling (Stein, 1987). However, aside from sim- vast quantities of data to discriminate among the many
ulation experiments, this is difficult to achieve. possibilities.
The first version of the curse of dimensionality is most Of course, it is tempting to say that one should include
easily understood. If one has five points at random in the all the explanatory variables, and in the early days of data
unit interval, they tend to be close together, but five ran- mining many naive computer scientists did so. But statis-
dom points in the unit square and the unit cube tend to ticians knew that this violated the principle of parsimony
be increasingly dispersed. As the dimension increases, a and led to inaccurate prediction. If one does not severely
sample of fixed size provides less information on local limit the number of variables (and the number of trans-
structure in the data. formations of the variables, and the number of interac-
To quantify this heuristic about increasing data spar- tion terms between variables) in the final model, then one
sity, one can calculate the side length of a p-dimensional ends up overfitting the data. Overfit happens when the
subcube that is expected to contain half of the (uniformly chosen model describes the accidental noise as well as
random) data in the p-dimensional unit cube. This side- the true signal. For example, in the income tax problem it
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

250 Data Mining, Statistics

might by chance happen that people whose social security B. Massive Datasets
numbers end in 2338 tend to have higher incomes. If the
Massive datasets pose special problems in analysis. To
model fitting process allowed such frivolous terms, then
provide some benchmarks of difficulty, consider the fol-
this chance relationship would be used in predicting tax
lowing taxonomy by dataset size proposed by Huber
obligation, and such predictions would be less accurate
(1995):
than those obtained from a more parsimonious model that
excluded misinformation. It does no good to hope that
Type Size (bytes)
one’s dataset lacks spurious structure; when p is large,
it becomes mathematically certain that chance patterns Tiny 102
exist. Small 104
The third formulation of the COD is more subtle. Medium 106
In standard multiple regression, multicollinearity arises Large 108
when two or more of the explanatory variables are highly Huge 1010
correlated, so that the the data lie mostly inside an affine Monster 1012
subspace of IR p (e.g., close to a line or a plane within the
p-dimensional volume). If this happens, there are an un- Huber argues that the category steps, which are factors
countable number of models that fit the data about equally of 100, correspond to reasonable divisions at which the
well, but these models are dramatically different with re- quantitative increase in sample size compels a qualitative
spect to predictions for future responses whose explana- change in the kind of analysis.
tory values lie outside the subspace. As p gets large, the For tiny datasets, one can view all the data on a single
number of possible subspaces increases rapidly, and just page. Computational time and storage are not an issue.
by chance a finite dataset will tend to concentrate in one This is the realm in which classical statistics began, and
of them. most problems could be resolved by intelligent inspection
The problem of multicollinearity is aggravated in non- of the data tables.
parametric regression, which allows nonlinear relation- Small datasets could well defeat tabular examination by
ships in the model. This is the kind of regression most humans, but they invite graphical techniques. Scatterplots,
frequently used in data mining. Here the analogue of histograms, and other visualization techniques are quite
multicollinearity arises when predictors concentrate on a capable of structure discovery in this realm, and modern
smooth manifold within IR p , such as a curved line or sheet computers permit almost any level of analytic complexity
inside the p-volume. Since there are many more manifolds that is wanted. It is usually possible for the analyst to
than affine subspaces, the problem of concurvity in non- proceed adaptively, looking at output from one trial in
parametric regression distorts prediction even more than order to plan the next modeling effort.
does multicollinearity in linear regression. Medium datasets begin to require serious thought. They
When one is interested only in prediction, the COD is will contain many outliers, and the analyst must develop
less of a problem for future data whose explanatory vari- automatic rules for detecting and handling them. Visual-
ables have values close to those observed in the past. But ization is still a viable way to do structure discovery, but
an unobvious consequence of large p is that nearly all when the data are multivariate there will typically be more
new observation vectors tend to be far from those pre- scatterplots and histograms than one has time to examine
viously seen. Furthermore, if one needs to go beyond individually. In some sense, this is the first level at which
simple prediction and develop interpretable models, then the entire analytical strategy must be automated, which
the COD can be an insurmountable obstacle. Usually the limits the flexibility of the study. This is the lowest order
most one can achieve is local interpretability, and that in the taxonomy in which data mining applications are
happens only where data are locally dense. For more de- common.
tailed discussions of the COD, the reader should con- Large datasets are difficult to visualize; for example, it
sult Hastie and Tibshirani (1990) and Scott and Wand is possible that the point density is so great that a scatter-
(1991). plot is completely black. Many common statistical proce-
The classical statistical or psychometrical approach to dures, such as cluster analysis or regression and classifi-
dimensionality reduction typically involves some form of cation based on smoothing, become resource-intensive or
principal component analysis or multidimensional scal- impossible.
ing. Roweis and Saul (2000) and Tenenbaum et al. (2000) Huge and monster datasets put data processing issues at
suggest some novel ways to approach the problem of non- the fore, and analytical methods become secondary. Sim-
linear dimensionality reduction in very high dimensional ple operations such as averaging require only that data
problems such as those involving images. be read once, and these are feasible at even the largest
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

Data Mining, Statistics 251

size, but analyses that require more than O(n) or per- gory membership for future observations. For example, a
haps O(n log(n)) operations, for n the sample size, are common data mining application is to use the historical
impossible. records on loan applicants to classify new applicants as
The last three taxa are a mixed blessing for data mining. good or bad credit risks.
The good news is that the large sample sizes mitigate the
curse of dimensionality, and thus it is possible, in princi-
A. Methods
ple, to discover complex and interesting structure. The bad
news is that many of the methods and much of the theory Classical classification began in 1935, when Sir Ronald
developed during a century of explosive growth in statisti- Fisher set his children to gather iris specimens. These spec-
cal research are impractical, given current data processing imens were then classified into three species by a botanist,
limitations. and numerical measurements were made on each flower.
From a computational standpoint, the three major is- Fisher (1936) then derived mathematical formulas which
sues in data mining are processor speed, memory size, used the numerical measurements to find hyperplanes that
and algorithmic complexity. Speed is proportional to the best separated the three different species.
number of floating point operations (flops) that must be In modern classification, the basic problem is the same
performed. Using PCs available in 2000, one can under- as Fisher faced. One has a training sample, in which the
take analyses requiring about about 1013 flops, and perhaps correct classification is known (or assumed to be accurate
up to 1016 flops on a supercomputer. Regarding memory, with high probability). For each case in the training sam-
a single-processor machine needs sufficient memory for ple, one has additional information, either numerical or
about four copies of the largest array required in the analy- categorical, that may be used to predict the unknown clas-
sis, and the backup storage (disk) should be able to hold an sifications of future cases. The goal is use the information
amount equal to about 10 copies of the raw data (otherwise in the learning sample to build a decision function that can
one has trouble storing derived calculations and exploring reliably classify future cases.
alternative analyses). The algorithmic complexity of the Most applications in data mining use either logistic re-
analysis is harder to quantify; at a high level it depends gression, neural nets, or recursive partitioning to build the
on the number of logical branches that the analyst wants decision functions. The next three subsections describe
to explore when planning the study, and at a lower level it these approaches. For information on the more traditional
depends on the specific numerical calculations employed discriminant analysis techniques, see Press (1982).
by particular model fitting procedures.
Datasets based on extracting information from the
World Wide Web or involving all transactions from banks 1. Logistic Regression
or grocery stores, or collections of fMRI images can Logistic regression is useful when the response variable is
rapidly fill up terabytes of disk storage, and all of these binary but the explanatory variables are continuous. This
issues become relevant. The last issue, regarding algorith- would be the case if one were predicting whether or not an
mic complexity, is especially important added on top of customer is a good credit risk, using information on their
the sheer size of some datasets. It implies that one cannot income, years of employment, age, education, and other
search too widely in the space of possible models, and that continuous variables.
one cannot redirect the analysis in midstream to respond In such applications one uses the model
to an insight found at an intermediate step.
For more information on the issues involved in prepar-
exp(X T θ)
ing superlarge datasets for analysis, see Banks and P[Y = 1] = , (1)
Parmigiani (1992). For additional discussion of the special 1 + exp(X T θ)
problems proposed in the analysis of superlarge datasets, where Y = 1 if the customer is a good risk, X is the vector
see the workshop proceedings on massive datasets proof explanatory variables for that customer, and θ are the
duced by the National Research Council (1997). unknown parameters to be estimated from the data. This
model is advantageous because, under the transformation
P[Y = 1]
II. CLASSIFICATION p = ln
1 − P[Y = 1]
Classification problems arise when one has a training sam- one obtains the linear model p = X T θ. Thus all the usual
ple of cases in known categories and their correspond- machinery of multiple linear regression will apply.
ing explanatory variables. One wants to build a decision Logistic regression can be modified to handle categor-
rule that uses the explanatory variables to predict cate- ical explanatory variables through definition of dummy
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

252 Data Mining, Statistics

variables, but this becomes impractical if there are many and may be thought of as estimated values of model pa-
categories. Similarly, one can extend the approach to cases rameters. As diagrammed in Fig. 1, the simple perceptron
in which the response variable is polytomous (i.e., takes fits the model
more than two categorical values). Also, logistic regres- p

sion can incorporate product interactions by defining new y = signum wi xi + τ ,
explanatory variables from the original set, but this, too, i =1
becomes impractical if there are many potential interac-
tions. Logistic regression is relatively fast to implement, where the weights wi and the threshhold parameter τ are
which is attractive in data mining applications that have estimated from the training sample.
large datasets. Perhaps the chief value of logistic regres- Unlike logistic regression, it is easy for the simple per-
sion is that it provides an important theoretical window on ceptron to include categorical explanatory variables such
the behavior of more complex classification methodolo- as profession or whether the applicant has previously de-
gies (Friedman et al., 2000). clared bankruptcy, and there is much less technical dif-
ficulty in extending the perceptron to the prediction of
2. Neural Nets polytomous outcomes. There is no natural way, however,
Neural nets are a classification strategy that employs an to include product interaction terms automatically; these
algorithm whose architecture is intended to mimic that still require hand-tuning.
of a brain, a strategy that was casually proposed by von The simple perceptron has serious flaws, and these have
Neumann. Usually, the calculations are distributed across been addressed in a number of ways. The result is that
multiple nodes whose behavior is analogous to neurons, the field of neural networks has become complex and di-
their outputs are fed forward to other nodes, and these verse. The method shown in Fig. 1 is too primitive to be
results are eventually accumulated into a final prediction commonly used in modern data mining, but it serves to
of category membership. illustrate the basic ideas.
To make this concrete, consider a simple perceptron As a strategy, neural networks go back to the pio-
model in which one is attempting to use a training set of neering work of McCulloch and Pitts (1943). The com-
historical data to teach the network to identify customers putational obstacles in training a net went beyond the
who are poor credit risks. Figure 1 shows a hypothetical technology then available. There was a resurgence of at-
situation; the inputs are the values of the explanatory vari- tention when Rosenblatt (1958) introduced the perceptron,
ables and the output is a prediction of either 1 (for a good an early neural net whose properties were widely studied
risk) or −1 (for a bad risk). The weights in the nodes are in the 1960s; fundamental flaws with the perceptron de-
developed as the net is trained on the historical sample, sign were pointed out by Minsky and Papert (1969). The
area languished until the early 1980s, when the Hopfield
net (Hopfield, 1982) and the discovery of the backprop-
X1 X2 X3 ... Xp agation algorithm [Rumelhart et al. (1985) were among
several independent inventors] led to networks that could
W2 W3 be used in practical applications. Additional information
Wp on the history and development of these ideas can be found
W1 in Ripley (1996).
The three major drawbacks in using neural nets for data
mining are as follows:

1. Neural nets require such computer-intensive that

t fitting even Huber’s large datasets can be infeasible.
In practice, analysts who are intent on applying
neural nets to their problems typically train the net on
only a small fraction of the learning sample and then
use the remainder to estimate the accuracy of their
classifications.
2. They are difficult to interpret. It is hard to examine a
trained network and discover which variables are
most influential and what are the functional
relationships between the variables. This impedes the
Y kind of scientific insight that is especially important
FIGURE 1 Simple perceptron. to data miners.
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

Data Mining, Statistics 253

3. Neural nets do not automatically provide statements ods grow elaborate trees and then prune back to improve
of uncertainty. At the price of greatly increased predictive accuracy outside the training sample (this is a
computation one can use statistical techniques such as partial response to the kinds of overfit concerns that arise
cross-validation, bootstrapping, or jackknifing to get from the curse of dimensionality).
approximate misclassification rates or standard errors However it is done, the result is a decision tree. Figure 2
or confidence intervals. shows a hypothetical decision tree that might be built from
credit applicant data. The first split is on income, a con-
Nonetheless, neural nets are one of the most popular tools tinuous variable. To the left-hand side, corresponding to
in the data mining community, in part because of their applicants with incomes less than $25,000, the next split
deep roots in computer science. is categorical, and divides according to whether the ap-
Subsection IV.A.4 revisits neural nets in the context of plicant has previously declared bankruptcy. Going back
regression rather than classification. Instead of describing the the top of the tree, the right-hand side splits on a lin-
the neural net methodology in terms of the nodes and con- ear combination; the applicant is considered a good risk
nections which mimic the brain, it develops an equivalent if a linear combination of their age and income exceeds
but alternative representation of neural nets as a proce- a threshold. This is a simplistic example, but it illustrates
dure for fitting a mathematical model of a particular form. the interpretability of such trees, the fact that the same
This latter perspective is the viewpoint embraced by most variable may be used more than once, and the different
current researchers in the field. kinds of splits that can be made.
Recursive partitioning methods were first proposed by
3. Recursive Partitioning Morgan and Sonquist (1963). The method became widely
popular with the advent of CART, a statistically sophis-
Data miners use recursive partitioning to produce de- ticated implementation and theoretical evaluation devel-
cision trees. This is one of the most popular and ver- oped by Breiman et al. (1984). Computer scientists have
satile of the modern classification methodologies. In also contributed to this area; prominent implementations
such applications, the method employs the training of decision trees include ID3 and C4.5 (Quinlan, 1992).
sample to recursively partition the set of possible ex- A treatment of the topic from a statistical perspective
planatory measurements. The resulting classification rule is given by Zhang and Singer (1999). The methodology
can displayed as a decision tree, and this is generally extends to regression problems, and this is described in
viewed as an attractive and interpretable rule for infer- Section IV.A.5 from a model-fitting perspective.
ence.
Formally, recursive partitioning splits the training sam-
ple into increasingly homogeneous groups, thus inducing B. Boosting
a partition on the space of explanatory variables. At each
step, the algorithm considers three possible kinds of splits Boosting is a method invented by computer scientists to
using the vector of explanatory values X: improve weak classification rules. The idea is that if one
has a classification procedure that does slightly better than
chance at predicting the true categories, then one can apply
X i ≤ t (univariate split)?
1. Is
p this procedure to the portions of the training sample that
2. Is i =1 wi xi ≤ t (linear combination split)?
are misclassified to produce new rules and then weight
3. Does xi ∈ S (categorical split, used if xi is a
all the rules together to achieve better predictive accuracy.
categorical variable)?
Essentially, each rule has a weighted vote on the final
classification of a case.
The algorithm searches over all possible values of t, all
The procedure was proposed by Schapire (1990) and
coefficients {wi }, and all possible subsets S of the category
improved by Freund and Schapire (1996) under the name
values to find the split that best separates the cases in the
AdaBoost. There have been many refinements since, but
training sample into two groups with maximum increase
the core algorithm for binary classification assumes one
in overall homogeneity.
has a weak rule g1 (X) that takes values in the set {1, −1}
Different partitioning algorithms use different methods
according to the category. Then AdaBoost starts by putting
for assessing improvement in homogeneity. Some seek to
equal weight wi = n −1 on each of the n cases in the training
minimize Gini’s index of diversity, others use a “twoing
sample. Next, the algorithm repeats the following steps K
rule,” and hybrid methods can switch criteria as they move
times:
down the decision tree. Similarly, some methods seek to
find the greatest improvement on both sides of the split,
whereas other methods choose the split that achieves max- 1. Apply the procedure gk to the training sample with
imum homogeneity on one side or the other. Some meth- weights w1 , . . . , wn .
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

254 Data Mining, Statistics

FIGURE 2 Decision tree.

2. Find the empirical probability pw of misclassification groups that are usefully similar. One common application
under these weightings. is in market segmentation, where a merchant has data on
3. Calculate ck = ln[(1 − pw )/ pw ]. the purchases made by a large number of customers, to-
4. If case i is misclassified, replace
wi by wi exp ck . gether with demographic information on the customers.
Then renormalize so that i wi = 1 and go to step 1. The merchant would like to identify clusters of customers
who make similar purchases, so as to better target adver-

The final inference is the sign of kK=1 ck gk (X), which is a tising or forecast the effect of changes in product lines.
weighted sum of the determinations made by each of the
K rules formed from the original rule g1 .
This rule has several remarkable properties. Besides A. Clustering Strategies
provably improving classification, it is also resistant to The classical method for grouping observations is hierar-
overfit, which arises when K is large. The procedure al- chical agglomerative clustering. This produces a cluster
lows quick computation, and thus can be made practical tree; the top is a list of all the observations, and these are
even for huge datasets, and it can be generalized to han- then joined to form subclusters as one moves down the
dle more than two categories. Boosting therefore provides tree until all cases are merged in a single large cluster.
an automatic and effective way to increase the capability For most applications a single large cluster is not infor-
of almost any classification technique. As a new method, mative, however, and so data miners require a rule to stop
it is the object of active research; Friedman et al. (2000) the agglomeration algorithm before complete merging oc-
describe the current thinking in this area, linking it to the curs. The algorithm also requires a rule to identify which
formal role of statistical models such as logistic regression subcluster should be merged next for each stage of the
and generalized additive models. tree-building process.
Statisticians have not found a universally reliable rule to
determine when to stop a clustering algorithm, but many
III. CLUSTER ANALYSIS have been suggested. Milligan and Cooper (1985) de-
scribed a large simulation study that included a range of
Cluster analysis is the term that describes a collection of realistic situations. They found that no rule dominated all
data mining methods which take observations and form the others, but that the cubic clustering criterion was rarely
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

Data Mining, Statistics 255

bad and often quite good. However, in practice, most ana- The primary drawback to using k-means clustering in
lysts create the entire tree and then inspect it to find a point data mining applications is that exact solution requires
at which further linkages do not seem to further their pur- extensive computation. Approximate solutions can be ob-
pose. For the market segmentation example, a merchant tained much more rapidly, and this is the direction in which
might be pleased to find clusters that can be interpreted many researchers have gone. However, the quality of the
as families with children, senior citizens, yuppies, and so approximation can be problematic.
forth, and would want to stop the clustering algorithm Both k-means and agglomerative cluster analysis are
when further linkage would conflate these descriptive strongly susceptible to the curse of dimensionality. For
categories. the market segmentation example, it is easy to see that
Similar diversity exists when choosing a subcluster customers could form tight clusters based upon their first
merging rule. For example, if the point clouds associated names, or where they went to elementary school, or other
with the final interpretable clusters appear ellipsoidal with misleading features that would not provide commercial
similar shape and orientation, then one should probably insight. Therefore it is useful to do variable selection, so
have used a joining rule that connects the two subclusters that clustering is done only upon features that lead to use-
whose centers have minimum Mahalanobis (1936) dis- ful divisions. However, this requires input from the user
tance. Alternatively, if the final interpretable clusters are on which clusters are interpretable and which are not, and
nonconvex point clouds, then one will have had to discover encoding such information is usually impractical. An al-
those by using some kind of nearest neighbor joining rule. ternative is to use robust clustering methods that attempt
Statisticians have devised many such rules and can show to find a small number of variables which produce well-
that no single approach can solve all clustering problems separated clusters. Kaufman and Rousseeuw (1990) re-
(Van Ness, 1999). view many ideas in this area and pay useful attention to
In data mining applications, most of the joining rules de- computational issues.
veloped by statisticians require infeasible amounts of com- For medium datasets in Huber’s taxonomy, it is possi-
putation and make unreasonable assumptions about homo- ble to use visualization to obtain insight into the kinds of
geneous patterns in the data. Specifically, large, complex cluster structure that may exist. This can provide guidance
data sets do not generally have cluster structure that is on the clustering algorithms one should employ. Swayne
well described by sets of similar ellipsoids. Instead, data et al. (1997) describe software that enables data miners
miners expect to find structures that look like sheets, el- to see and navigate across three-dimensional projections
lipsoids, strings, and so forth (rather as astronomers see of high-dimensional datasets. Often one can see groups of
when looking at the large-scale structure of the universe). points or outliers that are important in the application.
Therefore, among agglomerative clustering schemes,
data miners almost always use nearest neighbor clustering.
B. Racing
This is one of the fastest clustering algorithms available,
and is basically equivalent to finding a minimum span- One of the advances that came from the data mining syn-
ning tree on the data. Using the Prim (1957) algorithm for ergy between statisticians and computer scientists is a
spanning trees, the computation takes O(n 2 ) comparisons technique called ‘racing’ (Maron and Moore, 1993). This
(where n is the number of observations). This is feasible enables analysts to do much larger searches of the space
for medium datasets in Huber’s taxonomy. Furthermore, of models than was previously possible.
nearest neighbor methods are fairly robust at finding the In the context of cluster analysis, suppose one wanted to
diverse kinds of structure that one anticipates. do variable selection to discover which set of demographic
As an alternative to hierarchical agglomerative clus- features led to, say, 10 consumer clusters that had small
tering, some data miners use k-means cluster analysis, intracluster variation and large intercluster variation. One
which depends upon a strategy pioneered by MacQueen approach would be to consider each possible subset of the
(1967). Starting with the assumption that the data contain demographic features, run the clustering algorithm, and
a prespecified number k of clusters, this method iteratively then decide which set of results had the best cluster sepa-
finds k cluster centers that maximize between-cluster dis- ration. Obviously, this would entail much computation.
tances and minimize within-cluster distances, where the An alternative approach is to consider many feature sub-
distance metric is chosen by the user (e.g., Euclidean, sets simultaneously and run the clustering algorithm for
Mahalanobis, sup norm, etc.). The method is useful when perhaps 1% of the data on each subset. Then one compares
one has prior beliefs about the likely number of clusters the results to see if some subsets lead to less well-defined
in the data. It also can be a useful exploratory tool. In clusters than others. If so, those subsets are eliminated and
the computer science community, k-means clustering is the remaining subsets are then tested against each other
known as the quantization problem and is closely related on a larger fraction of the data. In this way one can weed
to Voronoi tesselation. out poor feature subset choices with minimal computation
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

256 Data Mining, Statistics

and reserve resources for the evaluation of the very best simplest AM says that the expected value of an observation
candidates. Yi can be written
Racing can be done with respect to a fixed fraction of p
the data or a fixed amount of runtime. Obviously, this E[Yi ] = θ0 + f j (X i j ), (2)
admits the possibility of errors. By chance a good method j =1

might appear poor when tested with only a fraction of the where the functions f j are unspecified but have mean zero.
data or for only a fixed amount of computer time, but the Since the functions f j are estimated from the data, the
probability of such errors can be controlled statistically, AM avoids the conventional statistical assumption of lin-
and the benefit far outweighs the risk. If one is using racing earity in the explanatory variables; however, the effects of
on data fractions, the problem of chance error makes it the explanatory variables are still additive. Thus response
important that the data be presented to the algorithm in is modeled as a sum of arbitrary smooth univariate func-
random order, otherwise the outcome of the race might tions of explanatory variables. One needs about 100 data
depend upon subsamples that are not representative of the points to estimate each f j , but under the model given in
entire dataset. (1), the requirement for data grows only linearly in p.
Racing has much broader application than cluster analy- The backfitting algorithm is the essential tool used in es-
sis. For classification problems, competing classifiers can timating an additive model. This algorithm requires some
be tested against each other, and those with high misclassi- smoothing operation (e.g., kernel smoothing or nearest
fication rates are quickly eliminated. Similarly, for regres- neighbor averages; Hastie and Tibshirani, 1990) which
sion problems, competing regression models are raced to we denote by Sm(·|·). For a large classes of smoothing
quickly eliminate those that show poor fit. In both appli- operations, the backfitting algorithm converges uniquely.
cations one can easily obtain a 100-fold increase in the The backfitting algorithm works as follows:
size of the model space that is searched, and this leads
to the discovery of better classifiers and better regression 1. At initialization, define functions f j(0) ≡ 0 and set
functions. θ0 = Ȳ .
2. At the ith iteration, estimate f j(i+1) by
IV. NONPARAMETRIC REGRESSION

(i+1) i
Regression is a key problem area in data mining and has at- f j = Sm Y − θ0 − f X1 j , . . . , Xnj
k
k= j
tracted a substantial amount of research attention. Among
dozens of new techniques for nonparametric regression for j = 1, . . . , p.
that have been invented over the last 15 years, we detail 3. Check whether | f j(i+1) − f j(i) | < δ for all j = 1, . . . ,
seven that are widely used. Section IV.A describes the p, for δ the prespecified convergence tolerance. If
additive model (AM), alternating conditional expecta- not, go back to step 2; otherwise, take the current f j(i)
tion (ACE), projection pursuit regression (PPR), neural as the additive function estimate of f j in the model.
nets (NN), recursive partitioning regression (RPR), mul-
tivariate adaptive regression splines (MARS), and locally This algorithm is easy to code. Its speed is chiefly deter-
weighted regression (LOESS). Section IV.B compares mined by the complexity of the smoothing function.
these techniques in terms of performance and computa- One can generalize the AM by permitting it to add a
tion, and Section IV.C describes how bagging can be used few multivariate functions that depend on prespecified ex-
to improve predictive accuracy. planatory variables. Fitting these would require bivariate
smoothing operations. For example, if one felt that predic-
A. Seven Methods tion of tax owed in the AM would improved by including
The following seven methods may seem very different, but a function that depended upon both a person’s previous
they employ at most two distinct strategies for addressing year’s declaration and their current marital status, then
the curse of dimensionality. One strategy fits purely local this bivariate smoother could be used in the second step
models; this is done by RPR and LOESS. The other strat- of the backfitting algorithm.
egy uses low-dimensional smoothing to achieve flexibility
in fitting specific model forms; this is done by AM, ACE, 2. Alternating Conditional Expectations (ACE)
NN, and PPR. MARS combines both strategies.
A generalization of the AM allows a smoothing transfor-
mation of the response variable as well the smoothing of
1. The Additive Model (AM)
the p explanatory variables. This uses the ACE algorithm,
Many researchers have developed the AM; Buja et al. as developed by Breiman and Friedman (1985), and it fits
(1989) describe the early development in this area. The the model
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

Data Mining, Statistics 257

p explanatory variable (and thus perpendicular to the sec-
E[g(Yi )] = θ0 + f j (X i j ). (3) ond), then AM works well. When the aluminum sheet is
j=1
rotated slightly so that the corrugations do not parallel
Here all conditions are as stated before for (1), except a natural axis, however, AM fails because the true func-
that g is an arbitrary smooth function scaled to ensure tion is a nonadditive function of the explanatory variables.
the technically necessary requirement that var[g(Y )] = 1 PPR would succeed, however, because the true function
(if not for this constraint, one could get a perfect fit by can be written as an additive model whose functions have
setting all functions to be identically zero). arguments that are linear combinations of the explanatory
variables.
Given data Yi and X i , one wants to find pg, θ0 , and PPR combines the backfitting algorithm with a numer-
f 1 , . . . , f p such that E[g(Yi ) | X i ] − θ0 − j=1 f j (X i j )
is well described as independent error. Thus one solves ical search routine, such as Gauss–Newton, to fit models
of the form
(ĝ, fˆ1 , . . . , fˆ p ) = argmin
(g, f 1 ,..., f p )
r
  E[Yi ] = f k (αk X i ). (4)
n
p 2  k=1
× g(Yi ) − f j (X i j ) ,
 
i=1 j=1
Here the α1 , . . . , αr are unit vectors that define a set of
r linear combinations of explanatory variables. The lin-
where ĝ is constrained to satisfy the unit-variance require-
ear combinations are similar to those used for principal
ment. The algorithm for achieving this is described by
components analysis (Flury, 1988). These vectors need
Breiman and Friedman (1985); they modify the backfit-
not be orthogonal, and are chosen to maximize predictive
ting algorithm to provide a step that smoothes the left-hand
accuracy in the model as estimated by cross-validation.
side while maintaining the variance constraint.
Operationally, PPR alternates calls to two routines. The
ACE analysis returns sets of functions that maximize
first routine conditions on a set of pseudovariables given
the linear correlation between the sum of the smoothed
by linear combinations of original variables; these are fed
explanatory variables and the smoothed response variable.
into the backfitting algorithm to obtain an AM in the pseu-
Therefore ACE is more similar in spirit to the multiple cor-
dovariables. The other routine conditions on the estimated
relation coefficient than to multiple regression. Because
AM functions from the previous step and then searches to
ACE does not directly attempt a regression analysis, it has
find linear combinations of the original variables which
certain undesirable features; for example, small changes in
maximize the fit of those functions. By alternating itera-
the data can lead to very different solutions (Buja and Kass,
tions of these routines, the result converges to a unique
1985), it need not reproduce model transformations, and,
solution.
unlike regression, it treats the explanatory and response
PPR is often hard to interpret for r > 1 in (3). When
variables symmetrically.
r is allowed to increase without bound, PPR is consis-
To redress some of the drawbacks in ACE, Tibshirani
tent, meaning that as the sample size grows, the estimated
(1988) devised a modification called AVAS, which uses a
regression function converges to the true function.
variance-stabilizing transformation in the backfitting loop
Another improvement that PPR offers over AM is this
when fitting the explanatory variables. This modification
it is invariant to affine transformations of the data; this is
is somewhat technical, but in theory it leads to improved
often desirable when the explanatory variables are mea-
properties when treating regression applications.
sured in the same units and have similar scientific jus-
tifications. For example, PPR might be sensibly used to
3. Projection Pursuit Regression (PPR) predict tax that is owed when the explanatory variables are
shares of stock owned in various companies. Here it makes
The AM uses sums of functions whose arguments are the sense that linear combinations of shares across commer-
natural coordinates for the space IR p of explanatory vari- cial sectors would provide better prediction of portfolio
ables. But when the true regression function is additive appreciation than could be easily obtained from the raw
with respect to pseudovariables that are linear combina- explanatory variables.
tions of the explanatory variables, then the AM is inap-
propriate. PPR was developed by Friedman and Stuetzle
(1981) to address such situations.
4. Neural Nets (NN)
Heuristically, imagine there are two explanatory vari-
ables and suppose the regression surface is shaped like Many neural net techniques exist, but from a statistical
a sheet of corrugated aluminum. If that sheet is oriented regression standpoint (Barron and Barron, 1988), nearly
to make the corrugations parallel to the axis of the first all variants fit models that are weighted sums of sigmoidal
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

258 Data Mining, Statistics

functions whose arguments involve linear combinations of lustrated for classification in Fig. 2. Many common func-
the data. A typical feedforward network uses a model of tions are difficult for RPR, however; for example, it ap-
the form proximates a straight line by a stairstep function. In high
m
dimensions it can be difficult to discover when the RPR
E[Y ] = β0 + βi f αiT x + γi0 , piecewise constant model closely approximates a simple
i=1 smooth function.
where f (·) is a logistic function and the β0 , γi0 , and αi are To be concrete, suppose one used RPR to predict tax
estimated from the data. Formally, this approach is similar obligation. The algorithm would first search all possible
to that in PPR. The choice of m determines the number of splits in the training sample observations and perhaps di-
hidden nodes in the network and affects the smoothness of vide on whether or not the declared income is greater than
the fit; in most cases the user determines this parameter, $25,000. For people with lower incomes, the next search
but it is also possible to use statistical techniques, such might split the data on marital status. For those with higher
as cross-validation to assess model fit, that allow m to be incomes, subsequent split might depend on whether the
estimated from the data. declared income exceeds $75,000. Further splits might
Neural nets are widely used, although their performance depend on the number of children, the age and profession
properties, compared to alternative regression methods, of the declarer, and so forth. The search process repeats
have not been thoroughly studied. Ripley (1996) describes in every subset of the training data defined by previous
one assessment which finds that neural net methods are not divisions, and eventually there is no potential split that
generally competitive. Schwarzer et al. (1986) review the sufficiently reduces variability to justify further partitions.
use of neural nets for prognostic and diagnostic classifi- At this point RPR fits the averages of all the training cases
cation in clinical medicine and reach similar conclusions. within the most refined subsets as the estimates θ j and
Another difficulty with neural nets is that the resulting shows the sequence of chosen divisions in a decision tree.
model is hard to interpret. The Bayesian formulation of (Note: RPR algorithms can be more complex; e.g., CART
neural net methods by Neal (1996) provides a some rem- “prunes back” the final tree by removing splits to achieve
edy for this difficulty. better balance of observed fit in the training sample with
PPR is very similar to neural net methods. The primary future predictive error.)
difference is that neural net techniques usually assume
that the functions f k are sigmoidal, whereas PPR allows 6. Multivariate Adaptive Regression
more flexibility. Zhao and Atkeson (1992) show that PPR Splines (MARS)
has similar asymptotic properties to standard neural net
techniques. Friedman (1991) proposed a data mining method that com-
bines PPR with RPR through use of multivariate adaptive
regression splines. It fits a model formed as a weighted
5. Recursive Partitioning Regression (RPR) sum of multivariate spline basis functions (tensor-spline
RPR has become popular since the release of CART (Clas- basis functions) and can be written as
sification and Regression Tree) software developed by
q
Breiman et al. (1984). This technique has already been E[Yi ] = ak Bk (X i ),
described in the context of classification, so this subsec- k=0
tion focuses upon its application to regression. The RPR where the coefficients ak are estimated by (generalized)
algorithm fits the model cross-validation fitting. The constant term is obtained by

M setting B0 (X 1 , . . . , X n ) ≡ 1, and the multivariate spline
E[Yi ] = θ j I R j (X i ), terms are products of univariate spline basis functions:
j=1

rk

where the R1 , . . . , R M are rectangular regions which par- Bk (x1 , . . . , xn ) = b xi(s,k) ts,k , 1 ≤ k ≤ r.
s=1
tition IR p , and I R j (X i ) denotes an indicator function that
takes the value 1 if and only if X i ∈ R j and is otherwise The subscript i(s, k) identifies a particular explanatory
zero. Here θ j is the estimated numerical value of all re- variable, and the basis spline for that variable puts a knot
sponses with explanatory variables in R j . at ts,k . The values of q, the r1 , . . . , rq , the knot locations,
RPR was intended to be good at discovering local low- and the explanatory variables selected for inclusion are
dimensional structure in functions with high-dimensional determined from the data adaptively.
global dependence. RPR is consistent; also, it has an at- MARS output can be represented and interpreted in a
tractive graphic representation as a decision tree, as il- decomposition similar to that given by analysis of vari-
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

Data Mining, Statistics 259

ance. It is constructed to work well if the true function has 4

Y = 0.1 exp(4X 1 ) +
local dependence on only a few variables. MARS can au- 1 + exp(−20X 2 + 10)
tomatically accommodate interactions between variables,
+ 3X 3 + 2X 4 + X 5 ,
and it handles variable selection in a natural way.
with five additional noise variables and sample sizes of
7. Locally Weighted Regression (LOESS) 50, 100, and 200, MARS had a slight tendency to
overfit, particularly for the smallest sample sizes.
Cleveland (1979) developed a technique based on locally
weighted regression. Instead of simply taking a local av-
All of these simulations, except for the pure noise con-
erage, LOESS fits the model E[Y ] = θ(x) x, where
dition, have large signal-to-noise ratios.

n
θ̂(x) = argmin wi (x)(Yi − θ X i )2 (5) r Barron (1991, 1993) proved that in a narrow sense, the
θ∈IR p i=1
MISE for neural net estimates in a certain class of
and wi is a weight function that governs the influence of functions has order O(1/m) + O(mp/n) ln n, where m
the ith datum according to the direction and distance of is the number of nodes, p is the dimension, and n is the
X i from x. sample size. Since this is linear in dimension, it evades
LOESS is a consistent estimator but may be inefficient the COD; similar results have been obtained by Zhao
at finding relatively simple structures in the data. Although and Atkeson (1992) for PPR, and it is probable that a
not originally intended for high-dimensional regression, similar result holds for MARS. These findings have
LOESS uses local information with advantageous flex- limited practical value; Barron’s class of functions
ibility. Cleveland and Devlin (1988) generalized LOESS consists
of those whose Fourier transform g̃ satisfies
to perform polynomial regression rather than the linear re- |ω||g̃(ω)| dω < c for some fixed c. This excludes
gression θ X i in (4), but from a data mining perspective, such simple cases as hyperflats, and the class becomes
this increases the cost of computation with little improve- smoother as dimension increases.
ment in overall predictive accuracy. r De Veaux et al. (1993) tested MARS and a neural net
on two functions; they found that MARS was faster
and had better MISE.
B. Comparisons r Zhang and Singer (1999) applied CART, MARS,
Nonparametric regression is an important area for testing multiple logistic regression, and other methods in
the performance of data mining methods because statisti- several case studies and found that no method
cians have developed a rich theoretical understanding of dominates the others. CART was often most
the issues and obstacles. Some key sources of comparative interpretable, but logistic regression led more directly
information are as follows: to estimation of uncertainty.
r Donoho and Johnstone (1989) make asymptotic The general conclusions are (1) parsimonious modeling is
comparisons among the more mathematically tractable increasingly important in high dimensions, (2) hierarchi-
techniques. They indicate that projection-based cal models using sums of piecewise-linear functions are
methods (PPR, MARS) perform better for radial relatively good, and (3) for any method, there are datasets
functions but kernel-based methods (LOESS) are on which it succeeds and datasets on which it fails.
superior for harmonic functions. (Radial functions are In practice, since it is hard to know what methods works
constant on hyperspheres centered at 0, while harmonic best in a given situation, data miners usually hold out a
functions vary sinusoidally on such hyperspheres.) part of the data, apply various regression techniques to the
r Friedman (1991) describes simulation studies of remaining data, and then use the models that are built to
MARS, and related work is cited by his discussants. estimate the hold-out sample. The method that achieves
Friedman focuses on several criteria; these include minimum prediction error is likely to be the best for that
scaled versions of mean integrated squared error application.
(MISE) and predictive-squared error (PSE). From the Most of the software detailed in this section is avai-
standpoint of data mining practice, the most useful lable from the StatLib archive at http://lib.stat.
conclusions are (1) for data that are pure noise in 5 and cmu.edu; this includes AM, ACE, LOESS, and PPR.
10 dimensions, and sample sizes of 50, 100, and 200, Both MARS and CART are commercially available
AM and MARS are comparable and both are unlikely from Salford Systems, Inc., at http://www.salford-
to find false structure, and (2) if test data are generated systems.com. Splus includes versions of RPR,
from the following additive function of five variables the (generalized) AM, ACE, PPR, and LOESS; this
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

260 Data Mining, Statistics

is commercially available from Mathsoft, Inc., at tions of a sigmoidal function,” IEEE Trans. Information Theory 39,
http://www.splus.mathsoft.com. 930–945.
Barron, A. R., and Barron, R. L. (1988). “Statistical learning networks:
A unifying view,” Comput. Sci. Stat., 20, 192–203.
C. Bagging Bay, S. D. (1999). “The UCI KDD Archive,” http://kdd.ics.uci.edu.
Department of Information and Computer Science, University of
Bagging is a strategy for improving predictive accuracy by California, Irvine, CA.
model averaging. It was proposed by Breiman (1996), but Breiman, L. (1996). “Bagging predictors,” Machine Learning 26, 123–
has a natural pedigree in Bayesian work on variable selec- 140.
tion, in which one often puts weights on different possible Breiman, L. (1998). “Arcing classifiers,” Ann. Stat. 26, 801–824.
models and then lets the data update those weights. Breiman, L., and Friedman, J. (1985). “Estimating optimal transforma-
tions for multiple regression and correlation,” J. Am. Stat. Assoc. 80,
Concretely, suppose one has a training sample and 580–619.
a nonparametric regression technique that takes the ex- Breiman, L., Friedman, J., Olshen, R.A., and Stone, C. (1984). “Classi-
planatory variables and produces an estimated response fication and Regression Trees,” Wadsworth, Belmont, CA.
value. Then the simplest form of bagging proceeds by Buja, A., and Kass, R. (1985). “Discussion of ‘Estimating optimal trans-
drawing K random samples (with replacement) from formations for multiple regression and correlation,’ by Breiman and
Friedman,” J. Am. Stat. Assoc. 80, 602–607.
the training sample and applying the regression tech- Buja, A., Hastie, T. J., and Tibshirani, R. (1989). “Linear smoothers and
nique to each random sample to produce regression rules additive models,” Ann. Stat. 17, 453–555.
T1 (X), . . . , TK (X). For a new observation,say X ∗ , the Cleveland, W. (1979). “Robust locally weighted regression and smooth-
bagging predictor of the response Y ∗ is K −1 kK=1 Tk (X ∗ ). ing scatterplots,” J. Am. Stat. Assoc. 74, 829–836.
The idea behind bagging is that model fitting strategies Cleveland, W., and Devlin, S. (1988). “Locally weighted regression: An
approach to regression analysis by local fitting,” J. Am. Stat. Assoc.
usually have high variance but low bias. This means that 83, 596–610.
small changes in the data can produce very different mod- Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter,
els but that there is no systematic tendency to produce D. G. (1999). “Probabilistic Networks and Expert Systems,” Springer-
models which err in particular directions. Under these cir- Verlag, New York.
cumstances, averaging the results of many models can Deitterich, T. (1998). “An experimental comparison of three methods
for constructing ensembles of decision trees: Bagging, boosting, and
reduce the error in the prediction that is associated with randomization,” Machine Learning 28, 1–22.
model instability while preserving low bias. De Veaux, R. D., Psichogios, D. C., and Ungar, L. H. (1993). “A com-
Model averaging strategies are moving beyond simple parison of two nonparametric estimation schemes: MARS and neural
bagging. Some employ many different kinds of regression networks,” Computers Chem. Eng. 17, 819–837.
techniques rather than just a single method. Others modify Donoho, D. L., and Johnstone, I. (1989). “Projection based approxima-
tion and a duality with kernel methods,” Ann. Stat. 17, 58–106.
the bagging algorithm in fairly complex ways, such as Fisher, R. A. (1936). “The use of multiple measurements in taxonomic
arcing (Brieman, 1998). A nice comparison of some of problems,” Ann. Eugen. 7, 179–188.
the recent ideas in this area is given by Dietterich (1998), Flury, B. (1988). “Common Principal Components and Related Multi-
and Hoeting et al. (1999) give an excellent tutorial on variate Models,” Wiley, New York.
more systematic Bayesian methods for model averaging. Freund, Y., and Schapire, R. E. (1996). “Experiments with a new boost-
ing algorithm.” In “Machine Learning: Proceedings of the Thirteenth
Model averaging removes the analyst’s ability to interpret International Conference,” pp. 148–156, Morgan Kaufmann, San
parameters in the models used and can only be justified in Mateo, CA.
terms of predictive properties. Friedman, J. H. (1991). “Multivariate additive regression splines,” Ann.
Stat. 19, 1–66.
Friedman, J. H., and Stuetzle, W. (1981). “Projection pursuit regression,”
SEE ALSO THE FOLLOWING ARTICLES J. Am. Stat. Assoc. 76, 817–23.
Friedman, J. H., Hastie, T., and Tibshirani, R. (2000). “Additive logistic
ARTIFICIAL NEURAL NETWORKS • DATABASES • DATA regression: A statistical view,” Ann. Stat. 28, 337–373.
Glymour, C., Madigan, D., Pregibon, D., and Smyth, P. (1997). “Statis-
STRUCTURES • INFORMATION THEORY • STATISTICS,
tical themes and lessons for data mining,” Data Mining Knowledge
BAYESIAN • STATISTICS, MULTIVARIATE Discovery 1, 11–28.
Hastie, T. J., and Tibshirani, R. J. (1990). “Generalized Additive Models,”
Chapman and Hall, New York.
BIBLIOGRAPHY Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999).
“Bayesian model averaging: A tutorial,” Stat. Sci. 14, 382–417.
Banks, D. L., and Parmigiani, G. (1992). “Preanalysis of superlarge data Hopfield, J. J. (1982). “Neural networks and physical systems with emer-
sets,” J. Quality Technol. 24, 930–945. gent collective computational abilities,” Proc. Natl. Acad. Sci. USA 79,
Barron, A. R. (1991). “Complexity regularization with aapplications to 2554–2558.
artificial neural networks.” In “Nonparametric Functional Estimation” Huber, P. J. (1994). “Huge data sets.” In “Proceedings of the 1994 COMP-
(G. Roussas, ed.), pp. 561–576, Kluwer, Dordrecht. STAT Meeting,” (R. Dutter and W. Grossmann, eds.), pp. 221–239,
Barron, A. R. (1993). “Universal approximation bounds for superposi- Physica-Verlag, Heidelberg.
P1: ZBU Final Pages
Encyclopedia of Physical Science and Technology En004F-164 June 8, 2001 16:12

Data Mining, Statistics 261

Jordan, M. I., (ed.). (1998). “Learning in Graphical Models,” MIT Press, Rosenblatt, F. (1958). “The perceptron: A probabilistic model for in-
Cambridge, MA. formation storage and organization in the brain,” Psychol. Rev. 65,
Kaufman, L., and Rousseeuw, P. J. (1990). “Finding Groups in Data: An 386–408.
Introduction to Cluster Analysis,” Wiley, New York. Roweis, S. T., and Saul, L. K. (2000). “Nonlinear dimensionality reduc-
MacQueen, J. (1967). “Some methods for classification and analysis tion by local linear embedding,” Science 290, 2323–2326.
of multivariate observations.” In “Proceedings of the Fifth Berkeley Rumelhart, D., Hinton, G. E., and Williams, R. J. (1986). “Learning
Symposium on Mathematical Statistics and Probability,” pp. 281–297, representations by back-propagating errors,” Nature 323, 533–536.
University of California Press, Berkeley, CA. Schapire, R. E. (1990). “The strength of weak learnability,” Machine
Mahalanobis, P. C. (1936). “On the generalized distance in statistics,” Learning 5, 197–227.
Proc. Natl. Inst. Sci. India 12, 49–55. Schwarzer, G., Vach, W., and Schumacher, M. (1986). “On misuses of
Maron, O., and Moore, A. W. (1993). “Hoeffding races: Accelerating artificial neural networks for prognostic and diagnostic classification
model selection search for classification and function approximation.” in oncology,” Stat. Med. 19, 541–561.
In “Advances in Neural Information Processing Systems 6,” pp. 38– Scott, D. W., and Wand, M. P. (1991). “Feasibility of multivariate density
53, Morgan Kaufmann, San Mateo, CA. estimates,” Biometrika 78, 197–206.
McCulloch, W. S., and Pitts, W. (1943). “A Logical calculus of the ideas Spirtes, P., Glymour, C., and Scheines, R. (2001). “Causation, Prediction,
immanent in nervous activity,” Bull. Math. Biophys. 5, 115–133. and Search,” 2nd ed., MIT Press, Cambridge, MA.
Milligan, G. W., and Cooper, M. C. (1985). “An examination of proce- Stein, M. L. (1987). “Large sample properties of simulations using latin
dures for determining the number of clusters in a dataset,” Psychome- hypercube sampling,” Technometrics 29, 143–151.
trika 50, 159–179. Swayne, D. F., Cook, D., and Buja, A. (1997). “XGobi: Interactive dy-
Minsky, M., and Papert, S. A. (1969). “Perceptrons: An Introduction to namic graphics in the X window system,” J. Comput. Graphical Stat.
Computational Geometry,” MIT Press, Cambridge, MA. 7, 113–130.
Morgan, J. N., and Sonquist, J. A. (1963). “Problems in the analysis of Tennenbaum, J. B., de Silva, V., and Langford, J. C. (2000). “A global ge-
survey data and a proposal,” J. Am. Stat. Assoc. 58, 415–434. ometric framework for nonlinear dimensionality reduction,” Science
National Research Council. (1997). “Massive Data Sets: Proceedings of 290, 2319–2323.
a Workshop,” National Academy Press, Washington, DC. Tibshirani, R. (1988). “Estimating optimal transformations for regres-
Neal, R. (1996). “Bayesian Learning for neural networks,” Springer- sion via additivity and variance stabilization,” J. Am. Stat. Assoc. 83,
Verlag, New York. 394–405.
Pearl J. (1982). “Causality,” Cambridge University Press, Cambridge. Van Ness, J. W. (1999). “Recent results in clustering admissibility.” In
Pearl, J. (2000). “Causality: Models, Reasoning and Inference,” Cam- “Applied Stochastic Models and Data Analysis,” (H. Bacelar-Nicolau,
bridge University Press, Cambridge. F. Costa Nicolau, and J. Janssen, eds.), pp. 19–29, Instituto Nacional
Press, S. J. (1982). “Applied Multivariate Analysis: Using Bayesian de Estatistica, Lisbon, Portugal.
and Frequentist Methods of Inference,” 2nd ed., Krieger, Hunting- Vapnik, V. N. (2000). “The Nature of Statistical Learning,” 2nd ed.,
ton, NY. Springer-Verlag, New York.
Prim, R. C. (1957). “Shortest connection networks and some generaliza- Zhang, H., and Singer, B. (1999). “Recursive Partitioning in the Health
tions.” Bell Syst. Tech. J. 36, 1389–1401. Sciences,” Springer-Verlag, New York.
Quinlan, J. R. (1992). “C4.5 : Programs for Machine Learning,” Morgan Zhao, Y., and Atkeson, C. G. (1992). “Some approximation properties
Kaufmann, San Mateo, CA. of projection pursuit networks.” In “Advances in Neural Information
Ripley, B. D. (1996). “Pattern Recognition and Neural Networks,” Processing Systems 4” (J. Moody, S. J. Hanson, and R. P. Lippmann,
Cambridge University Press, Cambridge. eds.), pp. 936–943, Morgan Kaufmann, San Mateo, CA.
P1: GRA/GLT P2: FQP Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN004J-167 June 8, 2001 17:9

Designs and Error-Correcting

Codes
K. T. Phelps
C. A. Rodger
Auburn University

I. Introduction
II. Perfect Codes
III. Constant Weight Codes
IV. Maximum Distance Separable Codes
V. Convolutional Codes

GLOSSARY I. INTRODUCTION

Binary sum, u + v Componentwise addition of u and v Both error-correcting codes and combinatorial designs are
with 1 + 1 = 0 (exclusive or). areas of discrete (not continuous) mathematics that began
Binary word A string (or vector or sequence) of 0’s in response to applied problems, the first in making the
and 1’s electronic transmission of information reliable and the
Code A set of (usually binary) words, often all of the second in the design of experiments with results being
same length. statistically analyzed. It turns out that there is substantial
Combinatorial design A collection of subsets of a set overlap between these two areas, mainly because both are
satisfying additional regularity properties. looking for uniformly distributed subsets within certain
Decoding Finding the most likely codeword (or message) finite sets. In this article we provide a brief introduction to
transmitted. both areas and give some indication of their interaction.
Distance, d(u, v) The number of coordinates in which u
and v differ.
A. Error-Correcting Codes
Encoding The assignment of codewords to messages.
Information rate The fraction of information per trans- Error-correcting codes is a branch of discrete mathemat-
mitted bit. ics, electrical engineering, and computer science that has
Weight, wt(u) The number of nonzero bits in the word u. developed over the past 50 years, largely in response to

335
P1: GRA/GLT P2: FQP Final Pages
Encyclopedia of Physical Science and Technology EN004J-167 June 8, 2001 17:9

336 Designs and Error-Correcting Codes

the dramatic growth of electronic transfer and storage of The key to being able to detect or correct errors that oc-
information. Coding Theory began in the late 1940s and cur during transmission is to have a code, C, such that no
early 1950s with the seminal work of Shannon (1948), two codewords are close. The distance between any two
Hamming (1950), and Golay (1949). Error-correcting words u and v, denoted by d(u, v), is simply the num-
codes’ first significant application was in NASA’s deep ber of coordinates in which the two words differ. The
space satellite communications. Other important appli- weight of u, wt(u), is just the number of nonzero coor-
cations since then have been in storage devices (e.g., dinates in u. Using the binary sum (so 1 + 1 = 0), we have
compact discs), wireless telephone channels, and geo- d(u, v) = wt(u + v).
positioning systems. They are now routinely used in all Under the assumptions that bits are more likely to be
satellite communications and mobile wireless communi- transmitted correctly than incorrectly (a natural assump-
cations systems. tion) and that messages are equally likely to be trans-
Since no communication system is ideal, information mitted (this condition can be substantially relaxed), it
can be altered, corrupted, or even destroyed by noise. Any is easy to show that for any received word w, the most
communication system needs to be able to recognize or likely codeword originally transmitted is the codeword
detect such errors and have some scheme for recovering c ∈ C for which d(c, w) is least; that is, the most likely
the information or correcting the error. In order to pro- codeword sent is the one closest to the received word.
tect against the more likely errors and thus improve the This leads to the definition of the minimum distance
reliability, redundancy must be incorporated into the mes- d(C) = minc1 ,c2 ∈C {d(c1 , c2 )} of a code C, the distance be-
sage. As a crude example, one could simply transmit the tween the two closest codewords in C. Clearly, if a word
message several times in the expectation that the majority w is received such that d(c, w) ≤ t = (d(C) − 1)/2 for
will appear correctly at their destination, but this would some c ∈ C, then the unique closest codeword to w is c;
greatly increase the cost in terms of time or the rate of therefore, c is the most likely codeword sent and we de-
transmission (or space in storage devices). code w to c. (Notice that if c was the codeword that was
The most basic error control scheme involves simply originally transmitted, then at most t bits were altered
detecting errors and requesting retransmission. For many in c during transmission to result in w being received.)
communications systems, requests for retransmission are So decoding each received word to the closest codeword
impractical or impose unacceptable costs on the commu- (known as maximum likelihood decoding, or MLD) will
nication system’s performance. In deep space communi- always result in correct decoding provided at most t errors
cations, a request for retransmission would take too long. occur during transmission. Furthermore, if c1 and c2 are
In speech communications, noticeable delays are unac- the two closest codewords [so d(c1 , c2 ) = d(C)], then it
ceptable. In broadcast systems, it is impractical given the is clearly possible to change t + 1 bits in c1 so that the
multitude of receivers. The problem of correcting errors resulting word w satisfies d(c1 , w) > d(c2 , w). Therefore,
and recovering information becomes of paramount impor- if these t + 1 bits are altered during the transmission of
tance in such constrained situations. c1 , then, using MLD, we would incorrectly decode the
Messages can be thought of as words over some alpha- received word w to c2 . Since MLD results in correct de-
bet, but for all practical purposes, all messages are simply coding for C no matter which codeword is transmitted
strings of 0’s and 1’s, or binary words. Information can be and no matter which set of up to t bits are altered during
partitioned or blocked up into a sequence of binary words transmission, and since this is not true if we replace t with
or messages of fixed length k. A (block) code, C, is a set t + 1, C is known as a t-error-correcting code, or as an
of binary words of fixed length n, each element of which (n, |C|, d) code where d = d(C) ≥ 2t + 1.
is called a codeword. Mathematically, codewords can be The construction problem is to find an (n, |C|, d) code,
considered to be vectors of length n with elements being C, such that the minimum distance d (and thus t) is large—
chosen from a finite field, normally of order 2, but in some this improves the error-correction ability and thus the relia-
cases from the field G F(2r ). [So, for example, the binary bility of transmission; and where |C| is large, so the rate of
codeword 01101 could also be represented as the vector transmission (log2 |C|/n = k/n) is closer to 1. Since mes-
(0, 1, 1, 0, 1).] There are also convolutional codes where sages are usually blocked up into k-bit words, one usually
the codewords do not have fixed length (or have infinite has log2 |C| = k. Clearly, these aims compete against each
length), but these will be discussed later. Encoding refers other. The more codewords one packs together in C, the
to the method of assigning messages to codewords which harder it is to keep them far apart. In practice one needs
are then transmitted. Clearly, the number of codewords to make decisions about the reliability of the channel and
has to be at least as large as the number of k-bit messages. the need to get the message transmitted correctly and then
The rate of the code is k/n, since k bits of information weigh that against the cost of decreasing the rate of trans-
result in n bits being transmitted. mission (or increasing the amount of data to be stored). It
P1: GRA/GLT P2: FQP Final Pages
Encyclopedia of Physical Science and Technology EN004J-167 June 8, 2001 17:9

Designs and Error-Correcting Codes 337

is possible to obtain bounds on one of the three parameters an extremely efficient decoding algorithm which finds the
n, d, and |C| in terms of the other two, and in some cases closest codeword without having to compare the received
families of codes have been constructed which meet these word to all 2192 codewords.
bounds. Two of these families are discussed in this arti- Again, the class of linear codes also has a relatively
cle: the perfect codes are codes that meet the Hamming efficient decoding algorithm. Associated with each linear
Bound and are described in Section II; the maximum dis- code C is the dual code C ⊥ consisting of all vectors (code-
tance separable (or MDS) codes are codes that meet the words) such that the dot product with any codeword in C
Singleton Bound and are described in Section IV. is 0 (again using xor or binary arithmetic). This is useful
The second problem associated with error-correcting because of the fact that if H is a generating matrix for
codes is the encoding problem. Each message m is to be C ⊥ , then Hw T = 0 if and only if w is a codeword. H is
assigned a unique codeword c, but this must be done effi- also known as the parity check matrix for C. The word
ciently. One class of codes that have an efficient encoding s = Hw T is known as the syndrome of w. Syndromes are
algorithm are linear codes; that is, the binary sum of any even more useful because it turns out that for each possible
pair of codewords in the code C is also a codeword in C syndrome s, there exists a word es with the property that a
(here, binary sum means componentwise binary addition closest codeword to any received word w with syndrome s
of the binary digits or the exclusive or of the two code- is w + es . This observation is taken even further to obtain
words). This means that C is a vector space and thus has a very efficient decoding algorithm for the Reed-Solomon
a basis which we can use to form the rows of a gener- codes that can deal with the 2196 codewords in real time;
ating matrix G. Then each message m is encoded to the it incorporates the fact that these codes are not only linear
codeword c = mG. One might also require that C have but also cyclic.
the additional property of being cyclic; that is, the cyclic Another family of codes that NASA uses is the convo-
shift c = xn x1 x2 . . . xn−1 of any codeword c = x1 x2 . . . xn lutional codes. Theoretically, these codes are infinite in
(where xi ∈ {0, 1}) is also a codeword for every codeword length, so a completely different decoding algorithm is
c in C. If C is cyclic and linear, then encoding can easily be required in this case (see Section V).
completed using a shift register design. The representation In the following sections, we focus primarily on the
of a code is critical in decoding. For example, Hamming construction of some of the best codes, putting aside dis-
codes (see Section II) possess a cyclic representation but cussion of the more technical problem of describing de-
also have equivalent representations that are not cyclic. coding algorithms for all except the convolutional codes
The final main problem associated with error-correcting in Section V. This allows the interaction between codes
codes is the decoding problem. It is all well and good to and designs to be highlighted.
know from the design of the code that all sets of up to t
errors occurring during transmission result in a received
B. Combinatorial Designs
word, w, that is closer to the codeword c that was sent
than it is to any other codeword; but given w, how do you Although (combinatorial) designs were studied earlier by
efficiently find c and recover m? Obviously, one could test such people as Euler, Steiner, Kirkman, it was Yates (1936)
w against each possible codeword and perhaps eventually who gave the subject a shot in the arm in 1935 by pointing
decode which is closest, but some codes are very big. Not out their use in statistics in the design of experiments.
only that, but it can also be imperative that decoding be In particular, he defined what has become known as an
done extremely quickly, as the following example demon- (n, k, λ) balanced incomplete block design (BIBD) to be
strates. a set V of n elements and a set B of subsets (called blocks)
The introduction of the compact disc (CD) by Phillips of V such that
in 1979 revolutionized the recording industry. This may
not have been possible without the heavy use of error- 1. Each block has size k < n.
correcting codes in each CD. (Errors can occur, for ex- 2. Each pair of elements in V occur as a subset of
ample, from incorrect cutting of the CD.) Each codeword exactly λ blocks in B.
on each CD represents less than 0.0001 sec of music, is
represented by a binary word of length 588 bits, and is ini- Fisher and Yates (1938) went on to find a table of small de-
tially selected from a Reed-Solomon code (see Section III) signs, and Bose (1939) soon after began a systematic study
that contains 2192 codewords. Clearly, this is an applica- of the existence of such designs. Bose made use of finite
tion where all decoding must take place with no delay, geometries and finite fields in many of his constructions.
as nobody will buy a CD that stops the music while the A natural generalization of BIBD is to replace (2) with
closest codeword is being found! It turns out that not only
are the Reed-Solomon codes excellent in that they meet 2 . Each (t + 1)-element subset of V occurs as a subset
the Singleton Bound (see Section III), but they also have of exactly λ blocks of B.
P1: GRA/GLT P2: FQP Final Pages
Encyclopedia of Physical Science and Technology EN004J-167 June 8, 2001 17:9

338 Designs and Error-Correcting Codes

Such designs are known as (t + 1) designs, which can ductory text], and this topic is also considered elsewhere
be briefly denoted by Sλ (t + 1, k, n); in the particular in this encyclopedia, so here we restrict our attention to
case when λ = 1, they are known as Steiner (t + 1) de- designs that arise in connection with codes.
signs which are denoted by S(t + 1, k, n). By elementary
counting techniques, one can show that if s < t, then an
II. PERFECT CODES
Sλ (t + 1, k, n) design is also an Sµ (s, k, u) design where
µ = λ( t+1−s
n−s k−s
)/( t+1−s ). Since µ must be an integer, this
Let C be a code of length n with minimum distance d.
provides several necessary conditions for the existence of
Let t = (d − 1)/2 . Then, as described in Section I, for
a (t + 1) design.
each codeword c in C, and for each binary word w of
For many values of n, k, and t, an S(t + 1, k, n) design
length n with d(w, c) ≤ t, the unique closest codeword
cannot exist. A partial S(t + 1, k, n) design then is a set of
to w is c. Since we can choose any i of the n positions
k subsets of an n set where any (t + 1) subset is contained
to change in c in order to form a word of length n dis-
in at most one block. This is equivalent to saying that
tance exactly i from c, the number of words distance
any two k subsets intersect in at most t elements. Partial
i from c is ( ni ) = n!/(n − i)!i!. So the total number of
designs are also referred to as packings, and much research
words of length n that are distance at most t from c is
has focused on finding maximum packings for various
( n0 ) + ( n1 ) + · · · + ( nt ), one of which is c, thus the number
parameters.
of words of length tn distance at most t from some code-
There are very few results proving the existence of
word in C is |C| i=0 ( ni ) (by the definition of t, no code-
s designs once s ≥ 3. Hanani found exactly when there
word is within distance t of two codewords). Of course, the
exists an Sλ (3, 4, v) [also called Steiner Quadruple Sys-
total number of binary words of length n is 2n . Therefore,
tems; see Lindner and Rodger (1975) for a simple proof
it must be the case that
and Hartman and Phelps (1992) for a survey], and Teir-
t
linck (1980) proved that there exists an Sλ (s, s + 1, v) n
|C| ≤ 2 n
.
whenever λ = ((s + 1)!)2s+1 , v ≥ s + 1, and v ≡ s (modulo i
i=0
((s + 1)!)2s+1 . Otherwise, just a few s designs are known
[see Colbourn and Dinitz (1996)]. Much is known about This bound is known as the Hamming Bound or the sphere
their existence when s = 2. In particular, over 1000 pa- packing bound. Any code that satisfies equality in the
pers have been written [see Colbourn and Rosa (1999)] Hamming Bound is known as a perfect code, in which
about Sλ (2, 3, v) designs (known as triple systems, and as case d = 2t + 1.
Steiner triple systems if λ = 1). We only need to consider From the argument above, it is clear that for any per-
designs with λ = 1 in this article. fect code, each word of length n must be within distance t
Certainly, (t + 1) designs and maximum packings are of a unique codeword (if a code is not perfect, then there
of interest in their own right, but they also play a role in exist words for which the distance to any closest code-
the construction of good codes. To see how, suppose we word is more than t). In particular, if C is a perfect code
have a (t + 1) design (or packing) (V, B). For each block with d = 2t + 1, then the codewords of minimum weight
b in B, we form its characteristic vector cb of length n, d in C are the characteristic vectors of the blocks of an
indexed by the elements in V , by placing a 1 in position i S(t + 1, 2t + 1, n) design. To see this, note that each word
if i ∈ B and placing a 0 in position i if i ∈ B. Let C be the w of weight t + 1 is within distance t of a unique codeword
code {cb | b ∈ B}. Then C is a code of length n in which c, where clearly c must have weight d = 2t + 1. Equiva-
all codewords have exactly k 1’s (we say each codeword lently, each (t + 1) subset is contained in a unique d subset,
has weight k). The fact that (V, B) is a t + 1 design (or the characteristic vector of which is a codeword. In fact,
packing) also says something about the minimum distance for any codeword c ∈ C, one can define the neighborhood
of C: since each pair of blocks intersect in at most t el- packing,
ements, each pair of codewords have at most t positions
NS(c) = {x + c | x ∈ C and d(x, c) = d}.
where both are 1, so each pair of codewords disagree in
at least 2k − 2t positions, so d(C) ≥ 2k − 2t. This con- Then, the code C will be perfect if and only if every neigh-
nection is considered in some detail in Section III. We borhood packing is, in fact, the characteristic vectors of an
also show in Section II that the codewords of weight d in S(t + 1, 2t + 1, n) design. To see the converse, suppose C
perfect codes together form the characteristic vectors of a is a code with every NS(c) an S(t + 1, 2t + 1, n) design.
(t + 1) design. If w is any word, let c ∈ C be the closest codeword and as-
There is much literature on the topic of combinatorial sume d(c, w) ≥ t + 1. Choose any t + 1 coordinates where
designs [see Colbourn and Dinitz (1996) for an encyclope- c and w disagree. Since NS(c) is an S(t + 1, 2t + 1, n) de-
dia of designs and Lindner and Rodger (1997) for an intro- sign, these coordinates uniquely determine a block of size
P1: GRA/GLT P2: FQP Final Pages
Encyclopedia of Physical Science and Technology EN004J-167 June 8, 2001 17:9

Designs and Error-Correcting Codes 339

2t + 1 in the design and, hence, a codeword c such that site: http://www.research.att.com/njas/codes/Andw/. We

d(c, c ) = 2t + 1. So c disagrees with c in the same t + 1 will present an overview of this topic with an emphasis
coordinates as does w, and thus, c agrees with w in these on the connections with designs.
t + 1 coordinates. Thus, Since the sum of any two binary words of the same
weight always has even weight, we have A(n, 2δ − 1,
d(c , w) ≤ d(c, c ) − (t + 1) + d(c, w) − (t + 1) w) = A(n, 2δ, w). We will assume from now on that the
≤ d(c, w) − 1, distance d is even. We also have A(n, d, w) = A(n, d, n −
w), since whenever two words are distance d apart, so are
which contradicts the assumption that c was the closest their complements. This means one only needs to consider
codeword and thus d(c, w) ≤ t. the case w ≤ n/2.
It turns out that perfect binary codes are quite rare. The The connection between CW codes and designs is im-
Hamming codes are an infinite family of perfect codes of mediate. In terms of sets, a CW code is just a collection
length 2r − 1, for any r ≥ 2, in which the distance is 3 and of w subsets of an n set where the intersection of any two
the number of codewords is 22 −1−r . A linear Hamming
r
w subsets contains at most t = w − d2 elements. Equiva-
code can always be formed by defining its parity check lently, a CW code is a partial S(w − d2 + 1, w, n) Steiner
matrix, H , to be the r × n matrix in which the columns system. We then have
form all nonzero binary words of length r . Notice also that,
n(n − 1) · · · (n − w + d/2)
in view of the comments earlier in this section, the code- A(n, d, w) ≤
words of weight 3 in any Hamming code form a Steiner w(w − 1) · · · (d/2)
triple system, S(2, 3, n). with equality if and only if a Steiner system S(w −
The other perfect binary code is the Golay code, which d
2
+ 1, w, n) exists.
has length 23, distance 7, and 212 codewords. The interest in CW codes also comes from the problem
Tietäväinen (1973), based on work by van Lint, showed of finding linear (or nonlinear) codes (n, M, d) of max-
that these are the only perfect binary codes and, in fact, imum size M. Obviously, A(n, d, w) is an upper bound
generalized this result to codes over finite fields [see also on the number of words of a given weight in such a
Zinov’ev and Leont’ev (1973)]. For a survey of results on maximum code. Conversely, such codes (or their cosets)
perfect codes, see van Lint (1975). can give lower bounds for A(n, d, w). In particular, the
stronger version of the Hamming Bound (given in the
section on perfect codes) was originally proved using
III. CONSTANT WEIGHT CODES A(n, 2t + 2, 2t + 1).
A(n, 2t + 2, 2t + 1) is just the number of blocks in
A constant weight code (CW) with parameters n, d, w is a maximum partial S(t + 1, 2t + 1, n) design or pack-
a set C of binary words of length n all having weight w ing. If C is a t-error-correcting code, then for any
such that the distance between any two codewords is at c ∈ C, the number of blocks in a neighborhood packing
least d. All nontrivial (n, d, w) CW codes have d ≤ 2w. |NS(c)| ≤ A(n, 2t + 2, 2t + 1). The number of words that
Let A(n, d, w) be the largest number of codewords in any are distance t + 1 from c but not distance t from any other
CW code with these parameters. The classic problem then codeword is

is to detemine this number or find the best upper and lower n 2t + 1 n
bounds on A(n, d, w). − |NS(c)| ≥
t +1 t +1 t +1
Binary CW codes have found application in synchro-
2t + 1
nization problems, in areas such as optical codedivi- − A(n, 2t + 2, 2t + 1).
sion multiple-acces (CDMA) communications systems, t +1
frequency-hopping spread-spectrum communications, Each such word is distance t + 1 from at most n/t + 1
modile radio, radar and sonar signal design, and the con- other codewords. Thus, summing over all c ∈ C, each such
struction of protocol sequences for multiuser collision word is counted at most this many times. This gives a
channel without feedback. Constant weight codes over stronger version of the Hamming bound:

t
other alphabets have received some attention, but so far
n
n
− 2t+1
A(n, 2t + 2, 2t + 1)
there have been few applications. We will only discuss the |C| + t+1 t+1
binary CW codes. i=0
i n/(t + 1)
Constant weight codes have been extensively stud-
≤ 2n .
ied, and a good reference is MacWilliams and Sloane
(1977). Eric Raines and Neil Sloane maintain a table of Constant weight codes cannot be linear, since this would
the best known lower bounds on A(n, d, w) on the web mean the zero vector was in the code, but one can have
P1: GRA/GLT P2: FQP Final Pages
Encyclopedia of Physical Science and Technology EN004J-167 June 8, 2001 17:9

340 Designs and Error-Correcting Codes

a code with all nonzero words having the same weight. IV. MAXIMUM DISTANCE
These codes are sometime referred to as linear equidistant SEPARABLE CODES
codes. The dual of the Hamming code (also called the
simplex code) is an example of such a code. In fact, it has For any linear code C, recall that the minimum distance
been proved that the only such codes are formed by taking equals the minimum weight of any nonzero codeword.
several copies of a simplex code. The proofs that all such Also, if C has dimension k, then C ⊥ has dimension n − k
codes are generalized simplex codes come explicitly from and any parity check matrix H of C has rank n − k. If
coding theory (Bonisoli, 1983) and also implicitly from c ∈ C is a codeword of minimum weight, wt(c) = d, then
results on designs and set systems (Teirlinck, 1980). There H cT = 0 implies that d columns of H are dependent, but
is a close connection between linear equidistant codes and no d − 1 columns are dependent. Since H has rank n − k,
finite geometries. The words of a simplex code correspond every n − k + 1 columns of H are dependent. Thus,
to the hyperplanes of projective space [over GF(2)] just as
the words of weight 3 in the Hamming code correspond d ≤ n − k + 1.
to lines in this projective space. [For connections between This is known as the Singleton Bound, and any code meet-
codes and finite geometries, see Black and Mullin (1976).] ing equality in this bound is known as a maximum distance
Another variation on CW codes are optical orthogo- separable code (or MDS code).
nal codes (OOC) which were motivated by an applica- There are no interesting binary MDS codes, but there are
tion to optical CDMA communication systems. Briefly, such codes over other alphabets, for example, the Reed-
an (n, w, ta , tb ) OOC is a CW code, C, of length n and Solomon codes used in CD encoding of order 256 (Reed-
weight w such that for any c = (c0 , c1 , . . . , cn−1 ) ∈ C, and Solomon codes are described below). Even though such
each y ∈ C, c = y and each i≡ 0 (mod n), codes are treated mathematically as codes with 256 differ-
ent “digits,” each still has an implementation as a binary

n−1
code, since each of the digits in the finite field GF(28 ) can
c j c j+i ≤ ta , (1)
j=0
be represented by a binary word of length 8; that is, by
one byte. So the first step in encoding the binary string
and representing all the music onto a CD is to divide it into

n−1 bytes and to regard each such byte, as a field element in
c j y j+i ≤ tc . (2) GF(28 ).
j=0
We now consider a code C ⊆ F n as a set of codewords
Equation (1) is the autocorrelation property, and Eq. (2) is over the alphabet F, where F is typically the elements of a
the cross-correlation property. Most research has focused finite field. The are several equivalent definitions for a lin-
on the case where ta = tc = t, in which case we refer to an ear code C of length n and dimension k to be an MDS code:
(n, w, t) OOC. Again, one can refomulate these properties
in terms of (partial) designs or packings. In this case, an 1. C has minimum distance d = n − k + 1,
OOC is a collection of w subsets of the integers (mod n), 2. Every k column of G, the generating matrix for C, is
such that for subsets c, b ∈ C, linearly independent.
3. Every n − k column of H , the parity check matrix for
|(c + i) ∩ (c + j)| ≤ ta (i = j), (3) C, is linearly independent.
and Note, that from item (3) above C is MDS if and only if
|(c + i) ∩ (b + j)| ≤ tc . (4) C ⊥ is MDS.
If one arranges the codewords of C in an |C| × n array,
Here, c + i = {x + i(mod n) | x ∈ c}. then from item (2) this array will have the property that
An OOC code is equivalent to a cyclic design or pack- for any choice of k columns (or coordinates) and any word
ing. A code or packing is said to be cyclic if every cyclic of length k, w ∈ F k , there will be exactly one row of this
shift of a codeword (or block) is another codeword. The set array that has w in these coordinates. An orthogonal array
of all cyclic shifts of a codeword is said to be an orbit. A is defined to be a q k × n array with entries from a set
representative from that orbit is often called a base block. F, |F| = q, with precisely this property: restricting the
An (n, w, t) OOC is a set of base blocks for a cyclic (par- array to any k columns, every word w ∈ F k occurs exactly
tial) S(t + 1, w, n) design or packing (assuming t < w). once in this q k × k subarray. Two rows of an orthogonal
Conversely, given such a cyclic partial S(t + 1, w, n) de- array can agree in at most k − 1 coordinates, which means
sign or packing, one can form an (n, w, t) OOC by taking that they must disagree in at least n − (k − 1) coordinates.
one representative block or codeword from each orbit. Thus, the distance between any two rows of an orthogonal
P1: GRA/GLT P2: FQP Final Pages
Encyclopedia of Physical Science and Technology EN004J-167 June 8, 2001 17:9

Designs and Error-Correcting Codes 341

array is d = n − k + 1. Obviously, the row vectors of an

orthogonal array are also codewords of an MDS code,
except that orthogonal arrays (and MDS codes) do not
need to be linear and exist over arbitrary alphabets.
Orthogonal arrays were also introduced in the design of
experiments in statistical studies and are closely related to
designs and finite geometries. In fact, the construction for
Reed-Solomon codes was first published by Bush as a con- FIGURE 1 Convolutional code encoding.
struction of orthogonal arrays (see Bush, 1952; Colbourn
and Dinitz, 1996).
There are various representations of the Reed-Solomon Algebraically, one can represent the registers whose
codes. We present the most perspicuous one. Again, let F contents are added to form ci by the polynomial
denote a finite field and let Fk [x] denote the space of all gi = gi,0 x 0 + gi,1 x + . . . gi, x , where gi, j = 1 if the con-
polynomials of degree less than k with coefficients from tents in register R j are used when forming ci , and is 0 oth-
F. Choose n > k different (nonzero) field elements α1 , erwise. Then by writing the message m = m 0 m 1 m 2 , . . .
α2 , . . . , αn ∈ F. For each polynomial f (x) ∈ Fk [x], in polynomial form m(x) = m 0 + m 1 x + m 2 x 2 + · · · , it
form the valuation vector c f = ( f (α1 ), f (α2 ), . . . , turns out that ci (x) = m(x)gi (x). This representation of
f (αn )). Define convolutional code encoding makes it clear why the code
is cyclic and linear.
C = {c f | f (x) ∈ Fk [x]}.
Convolutional codes can also be represented graphi-
First, we note that C is a linear code, since c f + cally. Assuming one message bit is moved into the shift
cg = c f +g . Second, for any two different polynomials register at each tick, the vertices of the directed graph
f (x), g(x) ∈ Fk [x], we have c f = cg ; if c f and cg were are the 2 binary words (r0 , r1 , . . . , r −1 ) that can make
equal, then the polynomial f (x) − g(x) would have n up the contents of the first registers R0 , R1 , . . . , R −1 ,
roots but degree <k(<n). This means that |Fk [x]| = and there is a directed edge from r = (r0 , r1 , . . . , r −1 )
|F k | = q k = |C|, and thus, C has dimension k and length n. to s = (s0 , s1 , . . . , s −1 ) if and only if si = ri−1 for
Finally, since f (αi ) = 0 if and only if αi is a root of f (x), 1 ≤ i ≤ − 1. So there is a directed edge from r to s
and moreover any polynomial of degree ≤k − 1 has at if in one tick the contents of the first registers can
most k − 1 roots, then c f has at most k − 1 zeros and at change from r into s. Furthermore, the directed edge from
least n − k + 1 nonzero entries. Therefore, the minimum (m t + , m t + −1 , . . . , m t ) to (m t+ +1 , m t+ , . . . , m t+1 ) is
distance for C is n − k + 1 and C is an MDS code. labeled with the t th bits of c1 , c2 , . . . , cµ ; so this label
Reed-Solomon codes also have a representation as a is the contribution to c made at the t th tick. This directed
cyclic code and a relatively efficient decoding algorithm graph is essentially the transition diagram of the finite state
(see Hoffman et al., 1991, for example). machine formed by the shift register.
Representing a convolutional code with this directed
graph D is helpful in understanding both encoding and
V. CONVOLUTIONAL CODES decoding. Codewords simply correspond to walks in D
with the codeword being the concatenation of the labels
Convolutional codes are practical codes, having been on the edges in the walk. If message bits are moved into
adopted for use by both NASA and the European Space the shift register one at a time, then there are exactly
Agency. In fact, they encode messages twice: first using two arcs directed out of each vertex (corresponding to
a Reed-Solomon code, then the resulting codeword is en- a message bit of a 0 or a 1 being moved into R0 ), so the
coded using a convolutional code. walk can head off in one of two possible directions. It
Convolutional codes are infinite length codes that are is this observation that makes it clear what is involved
both linear and cyclic. The messages to be considered are in decoding. At each tick, decoding one message bit re-
strung together into a stream of bits which form a sin- quires deciding which of the two directions to take from
gle message m that is encoded by feeding m into a shift the current vertex (state). This decision cannot be based
register (see Fig. 1). Initially, µ codewords are formed: on knowing the entire received word, since it has arbri-
for 1 ≤ i ≤ µ, and for each tick t ≥ 0, the contents of cer- trarily long length. Instead, one gathers the next τ ticks
tain registers are added together to form the t th bit in the worth of the received word, which together form, say,
output of the codeword ci = ci,0 , ci,1 , ci,2 , . . . . The sin- w , then finds which walk W emanating from the current
gle codeword c to which m is encoded is then formed by state most closely matches w , then finally makes a decod-
c = c1,0 c2,0 , . . . , cµ,0 c1,1 c2,1 , . . . , cµ,1 , . . . . ing decision to take one step along W . This process can
P1: GRA/GLT P2: FQP Final Pages
Encyclopedia of Physical Science and Technology EN004J-167 June 8, 2001 17:9

342 Designs and Error-Correcting Codes

be efficiently implemented by using the Viterbi Decoder. Colbourn C. J., and Dinitz, J. H., eds. (1996). “The CRC Handbook of
Deciding on how large to make τ can affect the code- Combinatorial Designs,” CRC Press, Boca Raton, FL.
word to which w is decoded (see Hoffman et al., 1991, for Colbourn, C. J., and Rosa, A. (1999) “Triple Systems,” Oxford Univ.
Pess, Oxford.
example). Fisher, R. A., and Yates, F. (1938). “Statistical Tables for Biological,
In deciding which convolutional code to use, choices Agricultural and Medical Research,” Oliver & Boyd, Edinburgh.
have to made about g1 (x), . . . , gµ (x) and the number k of Golay, M. J. E. (1949). “Notes on digital coding,” Proc. IEEE 37, 657.
message symbols to move into the shift register at each Hamming, R. S. (1950) “Error-detecting and error-correcting codes,”
tick, usually chosen to be 1. The rate of the code is then Bell Syst. Tech. J. 29, 147–160.
Hanani, H. (1960) “On quadruple systems,” Canad. J. Math., 12, 145–
k/µ. 157.
Hartman, A., and Phelps, K. T. (1992) “Steiner Quadruple Systems,
Contemporary Design Theory” (J. H. Dinitz and D. R. Stinson, eds.),
SEE ALSO THE FOLLOWING ARTICLES Wiley, New York.
Hoffman, D. G., Leonard, D. A., Lindner, C. C., Phelps, K. T., Rodger, C.
A., and Wall, J. R. (1991). “Coding Theory: The Essentials,” Dekker,
COMMUNICATION SATELLITE SYSTEMS • DATABASES • New York.
DISCRETE MATHEMATICS AND COMBINATORICS • WIRE- Lindner, C. C., and Rodger, C. A. (1997) “Design Theory,” CRC Press,
LESS COMMUNICATIONS Boca Raton, FL.
van Lint, J. H. (1975) “A survey of perfect codes,” Rocky Mount. J. Math.
5, 199–224.
MacWilliams, F. J., and Sloane, N. J. A. (1977). “The Theory of Error-
BIBLIOGRAPHY Correcting Codes,” North-Holland, Amsterdam.
Shannon, C. E. (1948) “A mathematical theory of communication,” Bell
Beth, Th., Jungnickel, D., and Lenz, H. (1999, 2000). “Design Theory,” Syst. Tech. J. 27, 379–423 and 623–656.
Vols. 1 and 2, Cambridge Univ. Press, Cambridge, UK. Teirlinck, L. (1980). “On projective and affine hyperplanes,” J. Combi-
Blake, I. F., and Mullin, R. C. (1976). “The Mathematical Theory of natorial Theory, Ser. A 28, 290–306.
Coding Theory,” Academic Press, New York. Tietäväinen, A. (1973) “On the nonexistence of perfect codes over finite
Bonisoli, A. (1983) “Every equidistant linear code is a sequence of dual fields,” SIAM J. Appl. Math. 24, 88–96.
Hamming codes,” Ars Combinatoria 18, 181–186. Yates, F. (1939) “Complex experiments,” J. R. Stat. Soc. 2, 181–247.
Bose, R. C. (1939) “On the construction of balanced incomplete block Yates, F. (1936) “Incomplete randomized blocks,” Ann. Eugen. 7, 121–
designs,” Ann. Eugen. 9, 353–399. 140.
Bush, K. A. (1952) “Orthogonal arrays of index unity,” Ann. Math. Stat. Zinov’ev, V. A., and Leont’ev, V. K. (1973). “The nonexistence of perfect
23, 426–434. codes over Galois fields,” Probl. Control Inf. Theory 2(2), 123–132.
P1: GNH Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary

Anthony N. Michel
University of Notre Dame

I. Introduction
II. Initial-Value Problems
III. Fundamental Theory
IV. Linear Systems
V. Stability

GLOSSARY State transition matrix For the initial value problem

x = A(t)x, x(τ ) = ξ , the matrix given by (t, τ ) =
−1
Equilibrium or rest position For the system of equations (t) (τ ) is the state transition matrix, where de-
x = f (t, x), any point xe such that f (t, xe ) = 0 for all notes a fundamental matrix for x = A(t)x and −1
t is an equilibrium point or a rest position. denotes the inverse of ; the unique solution of the
Fundamental matrix If {φ1 , . . . , φn } denotes a set of initial value problem is then φ(t, τ, ξ ) = (t, τ )ξ .
linearly independent solutions for the equation x = System of linear homogeneous differential equations
A(t)x, then the matrix = [φ1 , . . . , φn ] is a funda- System of equations given by x = A(t)x where x is an n
mental matrix of x = A(t)x. vector and A(t) denotes an n × n (time-varying) matrix
Initial-value problem The system of ordinary differen- is a linear homogeneous system of ordinary differential
tial equations x = f (t, x) along with the initial data equations.
x(τ ) = ξ is an initial value problem, where τ denotes System of ordinary differential equations The system
initial time and ξ denotes the initial condition or the of equations given by x = f (t, x) where x is an n vec-
initial state. tor, t is real (time), f is a function, and x denotes
nth-Order ordinary differential equations Equation of differentiation of x with respect to t is a system of n
the form y (n) = h(t, y (1) , . . . , y (n−1) ) is an nth-order or- ordinary differential equations of the first order.
dinary differential equation, where y (i) denotes the ith
derivative of y with respect to t.
Qualitative theory of ordinary differential equations EQUATIONS containing the derivatives or differentials
Study of families of solutions of ordinary differential of one or more dependent variables, with respect to one or
equations, such as, for example, the (stability) proper- more independent variables, are called differential equa-
ties of solutions near an equilibrium point. tions. If such equations contain only ordinary deriva-
Solution of an initial-value problem n-Vector-valued tives of one or more dependent variables, with respect
function φ is a solution of an initial-value problem if φ to a single independent variable, then one speaks of ordi-
satisfies the equation x = f (t, x) andif φ(τ ) = ξ . nary differential equations. Equations involving the partial

373
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

374 Differential Equations, Ordinary

derivatives of one or more dependent variables of two or xi = f i (t, x1 , . . . , xn ), i = 1, . . . , n (Ei )

more independent variables are called partial differential
equations. a system of n ordinary differential equations of the first
order. By a solution of the system of ordinary differential
equations (Ei ) we shall mean n continuously differentiable
I. INTRODUCTION functions φ1 , . . . , φn defined on an interval J = (a, b) (re-
call that (a, b) is the set of all t in R with the property
In what follows, we will concern ourselves only with or- a < t < b) such that (t, φ1 (t), . . . , φn (t)) ∈ D for all t ∈ J
dinary differential equations. The study of such equations and such that
may be divided into qualitative theory and quantitative
theory (e.g., the numerical solution of differential equa- φi (t) = f i (t, φ1 (t), . . . , φn (t))
tions). We will concern ourselves almost exclusively only i = 1, . . . , n
with a qualitative theory for such equations.
Ordinary differential equations, which were first consi- for all t ∈ J .
ddered in the seventeenth century by Leibnitz and Newton, Next, we let (τ, ξ1 , . . . , ξn ) ∈ D. Then the initial-value
arise in nearly all disciplines of science (physics, chem- problem associated with (Ei ) is given by:
istry, biology, and the like) and engineering (aerospace,
xi = f i (t, x1 , . . . , xn ), i = 1, . . . , n
chemical, civil, electrical, mechanical, nuclear, and so (Ii )
forth) as well as in economics and societal systems. It xi (τ ) = ξi , i = 1, . . . , n
is not an overstatement to say that a very great deal of
applied mathematics involves in some way the study of A set of functions (φ1 , . . . , φn ) is a solution of (Ii ) if
differential equations. (φ1 , . . . , φn ) is a solution of (Ei ) on some interval J con-
The study of ordinary differential equations can be from taining τ and if (φ1 (τ ), . . . , φn (τ )) = (ξ1 . . . , ξn ).
fairly elementary and down-to-earth to rather advanced In Fig. 1, the solution of a hypothetical initial-value
and abstract levels. In the current treatment we will fol- problem is given when n = 1. Note that at (τ, ξ ),
low a path between these two extremes. The study of or- φ (τ ) = f (τ, φ(τ )) = m is the slope of line L in the figure.
dinary differential equations at a very basic level requires In dealing with systems of equations, we find it con-
as a prerequisite some knowledge of elementary calcu- venient to use vector notation, such as:
lus. At an intermediate level, the study of such equations      
x1 ξ1 φ1
demands some background in real variables and linear al-      
x =  ...  , ξ =  ...  , φ =  ... 
gebra, while at the advanced level, the study of differential
equations may involve facts from measure and integration xn ξn φn
theory as well as functional analysis.  
f 1 (t, x1 , . . . , xn )
 .. 
f (t, x) =  . 
II. INITIAL-VALUE PROBLEMS f n (t, x1 , . . . , xn )
 
A. First-Order Ordinary Differential Equations f 1 (t, x)
 
We let D denote a domain (i.e., an open, non-empty, and =  ... 
connected set) in the Rn+1 space. We call R n+1 the (t, x) f n (t, x)
space, and we denote elements of R n+1 by (t, x1 , . . . ,
xn ) = (t, x) and elements of R n by (x1 , . . . , xn ) = x.
Next we consider n functions f i , i = 1, . . . , n, which
map D into the real numbers R. To express this, we
write f i : D → R. We assume that each f i is contin-
uous at all points in D and we express this by writ-
ing f i ∈ C(D). Finally, we let xi(n) denote the nth
derivative of xi with respect to t (provided that it exists)
(i.e., d n xi /dt n = xi(n) ). In particular, when n = 1,
we frequently write

xi(1) = xi = d xi /dt
FIGURE 1 Solution of an initial-value problem; t interval J =
We call the system of equations given by: (a, b), m (slope of line L) = f (τ, φ(τ )).
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 375

 
t 5. If in (I), f (t, x)= A(t)x + g(t), where g(t)T = [g1 (t),
   f 1 (s, φ(s)) ds  . . . , gn (t)] and where gi : J → R, then we have
x1  τ 
 ..   .. 

x =  . ,   x = A(t)x + g(t) (LN)
 . 
 t 
xn   In this case we speak of a linear nonhomogeneous system
f n (s, φ(s)) ds
τ
of ordinary differential equations.
6. If in (I), f (t, x) = Ax, where A = [ai j ] is a real n × n
t
= f (s, φ(s)) ds matrix with constant coefficients, then we have
τ
x = Ax (L)
With this notation we can express the initial-value problem
(Ii ) by: This type of system is called a linear, autonomous, ho-
mogeneous system of ordinary differential equations.
x = f (t, x), x(τ ) = ξ (I)
It is an easy matter to verify that the initial-value problem
(I) can equivalently be expressed by an integral equation C. n th-Order Ordinary Differential Equations
of the form: Thus far we have concerned ourselves with systems of
t
first-order ordinary differential equations. It is also pos-
φ(t) = ξ + f (s, φ(s)) ds (V)
τ sible to characterize initial value problems by means of
where φ denotes the solution of (I). nth-order ordinary differential equations. To this end, we
let h be a real function which is defined and continuous
B. Classification of Systems of First-Order on a domain D of the real (t, y1 , . . . , yn ) space. Then
Differential Equations y (n) = h t, y (1) , . . . , y (n−1) (En )
We are now ready to classify systems of first-order differ- is an nth-order ordinary differential equation. A solu-
ential equations in a variety of ways: tion of (En ) is a real function φ which is defined on a
t interval J = (a, b) which has n continuous derivatives
1. If in (I), f (t, x) ≡ f (x) for all (t, x) in D, then on J and satisfies (t, φ(t), . . . , φ (n−1) (t)) ∈ D for all t ∈ J
x = f (x) (A) and

and we call (A) an autonomous system of first-order or- φ (n) (t) = h t, φ(t), . . . , φ (n−1) (t)
dinary differential equations. for all t ∈ J .
2. If (t + T, x) ∈ D when (t, x) ∈ D and if f (t, x) = Now for a given (τ, ξ1 , . . . , ξn ) ∈ D, the initial value
f (t + T, x) for all (t, x) ∈ D then (I) assumes the form problem for (En ) is
x = f (t, x) = f (t + T, x) (P) y (n) = h t, y, y (1) , . . . , y (n−1)
(In )
Such a system is called a periodic system of first-order y (τ ) = ξ1 , . . . , y (n−1) (τ ) = ξn
differential equations of period T . The smallest T > 0 for
which (P) is true is the least period of this system of A function φ is a solution of (In ) if φ is a solution of
equations. Eq. (En ) on some interval containing τ and if φ(τ ) =
3. When in (I), f (t, x) = A(t)x, where A(t) = [ai j (t)] is ξ1 , . . . , φ (n−1) (τ ) = ξn .
a real n × n matrix with elements ai j (t) which are defined As in the case of systems of first-order equations, we
and at least piecewise continuous on a t interval J , then single out several special cases.
we have
1. Consider equations of the form
x = A(t)x (LH)
y (n)
+ an−1 (t)y (n−1) + · · · + a1 (t)y (1) + a0 (t)y = g(t)
and we speak of a linear homogeneous system of ordi-
nary differential equations. (1)
4. If for (LH) A(t) is defined for all real t and if there is
where an−1 (t), . . . , a0 (t) are real continuous functions de-
a T > 0 such that A(t) = A(t + T ) for all t, then we have
fined on the interval J . We refer to Eq. (1) as a linear ho-
x = A(t)x = A(t + T )x (LP) mogeneous ordinary differential equation of order n.
2. If in (1) we let g(t) ≡ 0, then
This system is called a linear periodic system of ordinary
differential equations. y (n) + an−1 (t)y (n−1) + · · · + a1 (t)y (1) + a0 (t)y = 0 (2)
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

376 Differential Equations, Ordinary

We call Eq. (2) a linear homogeneous ordinary differ-

ential equation of order n.
3. If in (2) we have ai (t) ≡ ai , i = 0, 1, . . . , n − 1, then

y (n) + an−1 y (n−1) + · · · + a1 y (1) + a0 y = 0 (3)

and we call Eq. (3) a linear, autonomous, homogeneous

ordinary differential equation of order n.

We can, of course, also define periodic and linear pe-

riodic ordinary differential equations of order n in the
obvious way. FIGURE 2 Mass–spring system.
It turns out that the theory of nth-order ordinary dif-
ferential equations reduces to the theory of a system of
n first-order ordinary differential equations. To this end,
we let y = x1 , y (1) = x2 , . . . , y (n−1) = xn in Eq. (In ). Then fied, constitutes an initial-value problem. If we let x1 = x
we have the system of first-order ordinary differential and x2 = x , then Eq. (5) can equivalently be represented
equations: by a system of two first-order ordinary differential equa-
tions given by:
x1 = x2
x1 = x2 , x2 = −(1/m)g(x1 ) (6)
x2 = x3
.. (4)
. with x1 (0) = ξ1 and x2 (0) = ξ2 .
xn = h(t, x1 , . . . , xn )
Equation (5) can be used to describe a variety of phys-
which is defined for all (t, x1 , . . . , xn ) ∈ D. Assume that ical phenomena. Consider, for example, the mass–spring
the vector φ = (φ1 , . . . , φn )T is a solution of (4) on an system depicted in Fig. 2. When the system is in the rest
interval J . Since φ2 = φ1 , φ3 = φ2 , . . . , φn = φ1(n−1) , and position, then x = 0; otherwise, x is positive in the direc-
since tion of the arrow or negative otherwise. The function g(x)
denotes the restoring force of the spring while the mass is
h(t, φ1 (t), . . . , φn (t)) expressed by m (in a consistent set of units).
There are several well-known special cases for Eq. (5):
= h t, φ1 (t), . . . , φ1(n−1) (t) = φ1(n) (t)
(a) If g(x) = kx, where k > 0 is known as Hooke’s
constant, then Eq. (5) is a linear ordinary differential equa-
it follows that the first component φ1 of the vector φ is
tion called the harmonic oscillator.
a solution of Eq. (En ) on the interval J . Conversely, if
(b) If g(x) = k(1 + a 2 x 2 )x, where k > 0 and a 2 > 0
φ1 is a solution of (En ) on J , then the vector (φ, φ (1) ,
are parameters, then Eq. (5) is a nonlinear ordinary differ-
. . . , φ (n−1) )T is clearly a solution of Eq. (4). Moreover, if
ential equation and one refers to the resulting system as a
φ1 (τ ) = ξ1 , . . . , φ1(n−1) (τ ) = ξn , then the vector φ satisfies
“mass and a hard spring.”
φ(τ ) = ξ = (ξ1 , . . . , ξn )T . The converse is also true.
(c) If g(x) = k(1−a 2 x 2 )x, where k > 0 and a 2 > 0 are
parameters, then Eq. (5) is a nonlinear ordinary differential
D. Examples of Initial-Value Problems equation and one refers to the resulting system as a “mass
We conclude the present section with several representa- and a soft spring.”
tive examples. Alternatively, Eq. (5) can be used to describe the be-
havior of the pendulum shown in Fig. 3 with θ = x. In this
1. Consider the second-order ordinary differential eq- case, the restoring force g(x) is specified by:
uation given by:
g(x) = m(g/l) sin x
m d 2 x/dt 2 + g(x) = 0 (5)
2. Using Kirchhoff’s voltage and current laws, the cir-
where m > 0 is a constant and g: R → R is continuous. cuit of Fig. 4 can be modeled by the linear system of
This equation, along with x(0) = ξ1 and x (0) = ξ2 speci- equations:
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 377

3. What is the extent of the time interval over which

one or more solutions exist for (I)?
4. How do solutions for (I) behave when the initial data
(τ, ξ ) (or some other parameters for the differential equa-
tion) are varied?

The significance of the preceding questions is brought

further to light when the following examples are con-
sidered.

1. For the initial-value problem,

x = −sgn x, x(0) = 0, t ≥0 (8)
where

 1, x >0
FIGURE 3 Simple pendulum. sgn x = 0, x =0

−1, x <0
1 1 1 1 v no continuously differentiable function φ exists which sat-
v1 = − + v1 + v2 +
C1 R1 R2 R2 C 2 R1 C 1 isfies Eq. (8). Hence, no solution (as defined in Section II)
1 1 1 R2 1 exists for the present initial-value problem.
v2 = − + v1 − − v2 2. The initial-value problem,
C1 R1 R2 L R2 C 1
(7) x = 1/(2x), x(0) = 0, t ≥0 (9)
R2 v
+ v3 +
L R1 C 1 has two solutions given by φ(t) = ± t 1/2 which exist for
1 1 all t ≥ 0.
v3 = v1 − v2 3. The initial-value problem,
R2 C 2 R2 C 2
In order to complete the description of this circuit, we need x = 1 + x 2, x(0) = 0, t ≥0 (10)
to specify the initial conditions v1 (0), v2 (0), and v3 (0). has the unique solution given by φ(t) = tan t. This solution
exists only when 0 ≤ t < π/2, since φ is not continuously
III. FUNDAMENTAL THEORY differentiable at t = π/2. In this case, we say that this
solution has finite escape time.
In this section, we address the following questions: 4. The initial-value problem given by:
x = ax, x(τ ) = ξ (11)
1. Under what conditions has the initial-value problem
(I) at least one solution for a given set of initial data (τ, ξ )? where a is a fixed parameter, has a unique solution given
2. Under what conditions has (I) exactly one solution by φ(t) = φ(t, τ, ξ ) = ξ ea(t−τ ) which exists for all real t.
for given (τ, ξ )? Note that the solution φ is continuous with respect to the
parameters, a, τ , and ξ .

A. Existence of Solutions
In order to simplify our presentation, we will consider in
the next few results one-dimensional initial value prob-
lems (i.e., we will assume that for (I), n = 1). Later in
this section we will show how these results are modified
for higher dimensional systems. Thus, we have a domain
D ∈ R 2 , we are given (τ, ξ ) ∈ D and f ∈ C(D), and we
seek a solution for the initial-value problem,
FIGURE 4 An example of an electric circuit. x = f (t, k), x(τ ) = ξ (I )
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

378 Differential Equations, Ordinary

In doing so, it suffices to find a solution of the integral

equation:
t
φ(t) = ξ + f (s, φ(s)) ds (V )
τ

One way of solving the above problem is by considering

approximations to a solution first; an ε-approximate so-
lution for (I ) on an interval J containing τ is a real-valued
function φ which is piecewise continuously differentiable
on J and satisfies φ(τ ) = ξ, (t, φ(t)) ∈ D for all t ∈ J , and
which satisfies:
FIGURE 6 Typical ε-approximation solution.
|φ (t) − f (t, φ(t))| < ε
at all points t of J where φ (t) exists.

Let us now consider a subset S of D defined by: case, we have m−1 j =0 |t j +1 − t j | = c. In Fig. 6, a typical
ε-approximate solution is shown.
S = {(t, x) ∈ D: |t − τ | ≤ a, |x − ξ | ≤ b} ⊂ D
Next, let us consider a monotone decreasing sequence
Since f ∈ C(D), there is an M ≥ 0 such that | f (t, x)| ≤ of real numbers with limit zero, and let us denote this se-

M for all (t, x) ∈ S. Now define c = min{a, b/M} and, quence by {εm }. An example of such a sequence would
depending on the size of M relative to a, define one of the be the case when εm = 1/m, where m denotes all pos-
triangular regions shown either in Fig. 5a or 5b. itive integers greater or equal to one. Corresponding to
It is now not too difficult to prove the following results: each εm , let us consider now an εm -approximate solution
which we denote by φm . Next, let us consider the fam-
If f ∈ C(D) and if c is as defined above, then for any ε > 0 there ily of εm -approximate solutions {φm }, m = 1, 2, . . . . This
is an ε-approximate solution of (I ) on the interval |t − τ | ≤ c. family {φm } is an example of an equicontinuous family of
functions. Now, according to Ascoli’s lemma, an equicon-
Indeed, for a given ε > 0, such a solution will be of the tinuous family {φm }, as constructed above, will contain
form: a subsequence of functions, which we denote by {φm k },
which converges uniformly on the interval [τ − c, τ + c]
φ(t) = φ(t j ) + f (t j , φ(t j ))(t − t j ) to a continuous function φ; that is,
(12)
t j ≤ t ≤ t j+1 , j = 0, 1, 2, . . . , m − 1 lim φm k (t) = φ(t), uniformly in t (13)
k→∞
when t ≥ τ = t0 (the case in which t ≤ τ is modified in
Now it turns out that φ is continuously differentiable on
the obvious way). In Eq. (12), the choice of m and of
the interval (τ − c, τ + c) and that it satisfies the integral
max|t j+1 − t j | will depend on ε but not on t, but in any
equation (V ) and, hence, the initial-value problem (I ). In
other words, φ is a solution of (I ).
The preceding discussion gives rise to the Cauchy–
Peano existence theorem:

If f ∈ C(D) and (τ, ξ ) ∈ D, then the initial-value problem (I )

has a solution defined on |t − τ | ≤ c where c is as defined in
Fig. 5.

We mention in passing that a special case of Eq. (12),

φ(t j+1 ) = φ(t j ) + f (t j , φ(t j ))(t j+1 − t j ) (14)
is known as Euler’s method of solving ordinary differen-
tial equations.
It should be noted that the above result yields only a
sufficient condition. In other words, when f is not contin-
uous on the domain D, then the initial-value problem (I )
may or may not possess a solution in the sense defined in
FIGURE 5 (a) Case c = b/M. (b) Case c = a. Section II.
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 379

The above result asserts the existence of a solution (I )

“locally,” that is, only on a sufficiently short time interval
(determined by |t − τ | ≤ c). In general, this assertion can-
not be changed to existence of a solution for all t ≥ τ (or
for all t ≤ τ ) as the following example shows:
The initial-value problem,
x = x 2, x(τ ) = ξ
has a solution φ(t) = ξ [1 − ξ (t − τ )]−1 which exists for-
ward in time for ξ > 0 only until t = τ + ξ −1 .

B. Continuation of Solutions
Our next task is to determine if it is possible to extend a
solution φ to a larger interval than was indicated above
(|t − τ | ≤ c). The answer to this is affirmative. To see this,
suppose that f ∈ C(D) and suppose also that f is bounded
on D. Suppose also that by some procedure, as above, it
was possible to show that φ is a solution of the scalar FIGURE 7 Continuation of a solution φ to ∂D.
differential equation,
x = f (t, x) (E ) f ∈ C(D) is said to satisfy a Lipschitz condition on D
(with respect to x) with Lipschitz constant L if
on an interval J = (a, b). Using expression (V ) for φ it
is an easy matter to show that the limit of φ(t) as t ap- | f (t, x̄¯ ) − f (t, x̄)| ≤ L|x̄¯ − x̄| (15)
proaches a from the right exists and that the limit of φ(t)
for all (t, x̄)(t, x̄¯ ) in D. The function f is said to be
as t approaches b from the left exists; that is,
Lipschitz continuous in x on D in this case.
lim φ(t) = φ(a + ) For example, it can be shown that if ∂ f (t, x)/∂ x exists
t→a +
and is continuous on D, then f will be Lipschitz contin-
and uous on any compact and convex subsect D0 of D.
lim φ(t) = φ(b− ) In order to establish a uniqueness result for solutions
t→b− of the initial value problem (I ), we will also require a
Now clearly, if the point (a, φ(a + )) ∈ D (resp., if (b, result known as the Gronwall inequality: Let r and k
φ (b− )) ∈ D), then by repeating the procedure given in the be continuous nonnegative real functions defined on an
above results (ε-approximate solution result and Peano– interval [a, b] and let δ ≥ 0 be a constant. If
Cauchy theorem), the solution φ can be continued to the t
left past the point t = a (resp., to the right past the point r (t) ≤ δ + k(s)r (s) ds (16)
a
t = b). Indeed, it should be clear that repeated applications
of these procedures will make it possible to continue the then
t
solution φ to the boundary of D. This is depicted in Fig. 7.
r (t) ≤ δ exp k(s) ds (17)
It is worthwhile to note that the solution φ in this figure a
exists over the interval J and not over the interval J˜.
Now suppose that for (I ) the Cauchy–Peano theorem
We summarize the preceding discussion in the follow-
holds and suppose that for one given (τ, ξ ) ∈ D, two solu-
ing continuation result:
tions φ1 and φ2 exist over some interval |t − τ | ≤ d, d > 0.
On the interval τ ≤ t ≤ τ + d we now have, using (V ) to
If f ∈ C(D) and if f is bounded on D, then all solutions of (E )
express φ1 and φ2 ,
can be extended to the boundary of D. These solutions are then
t
noncontinuable.
φ1 (t) − φ2 (t) = [ f (s, φ1 (s)) − f (s, φ2 (s))] ds (18)
τ
Now if, in addition, f is Lipschitz continuous in x, then
C. Uniqueness of Solutions Eq. (18) yields:
t
Next, we address the question of uniqueness of solutions.
|φ1 (t) − φ2 (t)| ≤ L|φ1 (s) − φ2 (s)| ds
To accomplish this, we require the following concept: τ
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

380 Differential Equations, Ordinary

Hence, it must be true that φ1 (t) = φ2 (t) on τ ≤ t ≤ τ + d. and by using the Gronwall inequality with δ = |ξ − ξ0 |,
A similar argument will also work for the interval τ − d ≤ L ≡ k(s), and r (t) = |φ(t) − ψ(t)|, we obtain the estimate:
t ≤ τ. |φ(t, τ, ξ ) − ψ(t, τ, ξ0 )| ≤ |ξ − ξ0 | exp(L|t − τ |)
Summarizing, we have the following uniqueness
result: t ∈ |t − τ | ≤ d (20)
If, in particular, we consider a sequence of initial con-
If f ∈ C(D) and if f satisfies a Lipschitz condition on D with ditions {ξm } having the property that ξm → ξ0 as m → ∞,
Lipschitz constant L, then the initial-value problem (I ) has at then it follows from Eq. (20) that φ(t, τ, ξm ) → φ(t, τ, ξ0 ),
most one solution on any interval |t − τ | ≤ d, d > 0. uniformly in t on |t − τ | ≤ d.
Summarizing, we have the following continuous
If the solution φ of (I ) is unique, then the ε-approximate dependence result:
solutions constructed before will tend to φ as ε → 0+ and
this is the basis for justifying Euler’s method—a numer- Let f ∈ C(D) and assume that f satisfies a Lipschitz condition
ical method of constructing approximations to φ. Now, if on D. Then, the unique solution φ(t, τ, ξ ) of (I ), existing on
we assume that f satisfies a Lipschitz condition, an alter- some bounded interval containing τ , depends continuously on
native classical method of approximation is the method ξ , uniformly in t.
of successive approximations. Specifically, let f ∈ C(D)
and let S be the rectangle in D centered at (τ, ξ ) shown in This means that if ξm → ξ0 then φ(t, τ, ξm ) → φ(t, τ, ξ0 ),
Fig. 5 and let c be defined as in Fig. 5. Successive approx- uniformly in t on |t − τ | ≤ d for some d > 0.
imations for (I ), or equivalently for (V ), are defined as: In a similar manner we can show that φ(t, τ, ξ ) will
depend continuously on the initial time τ . Furthermore,
φ0 (t) = ξ
if the differential equation (E ) depends on a parameter,
t
say µ, then the solutions of the corresponding initial value
φm+1 (t) = ξ + f (s, φm (s)) ds, (19)
τ problem may also depend in a continuous manner on µ,
m = 0, 1, 2, . . . provided that certain safeguards are present. We consider
a specific case in the following.
for |t − τ | ≤ c. Consider the initial-value problem,
The following result is the basis for justifying the
x = f (t, x, µ), x(τ ) = ξµ = µ + 1 (Iµ )
method of successive approximations:
where µ is a scalar parameter. Let f satisfy Lipschitz
If f ∈ C(D) and if f is Lipschitz continuous on S with constant L, conditions with respect to x and µ for (t, x) ∈ D and for
then the successive approximations φm , m = 0, 1, 2, . . . , given in |µ − µ0 | < c. Using an argument similar to the one em-
Eq. (19) exist on |t − τ | ≤ c, are continuous there, and converge ployed in connection with Eq. (20), we can show that
uniformly, as m → ∞, to the unique solution of (I ). the solution φ(t, τ, ξµ , µ) of (Iµ ), where ξµ depends con-
tinuously on µ, is a continuous function of µ (i.e., as
µ → µ0 , ξµ → ξµ0 , and φ(t, τ, ξµ , µ) → φ(t, τ, ξµ0 , µ0 )).
D. Continuity of Solutions with Respect
As an example, consider the initial-value problem
to Parameters
Our next objective is to study the dependence of solutions x = x + µt, x(τ ) = ξµ = µ + 1 (21)
φ of (I ) on initial data (τ, ξ ). In this connection, we find The right-hand side of Eq. (21) has a Lipschitz constant
it advantageous to highlight this dependence by writing with respect to x equal to one and with respect to µ equal
φ(t) = φ(t, τ, ξ ). to |a − b|, where J = (a, b) is assumed to be a bounded
Now suppose that f ∈ C(D) and suppose that f satisfies t-interval. The solution of (Iµ ) is
a Lipschitz condition on D with Lipschitz constant L.
Furthermore, suppose that φ and ψ solve: φ(t, τ, ξµ , µ) = [µ(τ + 2) + 1]e(t−τ ) − µ(t + 1) (22)
At t = τ , we have φ(τ, τ, ξµ , µ) = µ + 1 = x(τ ). Now
x = f (t, x) (E )
what happens when µ → 0? In this case, Eq. (21) becomes:
on an interval |t − τ | ≤ d with ψ(τ ) = ξ0 and φ(τ ) = ξ .
Then, by using (V ), we obtain: x = x, x(τ ) = 1 (23)
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 381

while the solution of Eq. (23) is F. Differentiability of Solutions

with Respect to Parameters
φ(t) = et−τ (24)
In some of the preceding results we investigated the con-
If we let µ → 0 in Eq. (22), then we also obtain:
tinuity of solutions with respect to parameters. Next, we
φ(t, τ, ξ0 , 0) = et−τ (25) address the question of differentiability of solutions with
respect to parameters for the initial-value problem,
as expected.
x = f (t, x), x(τ ) = ξ (I)
E. Systems of Equations
Again we assume that f ∈ C(D), (τ, ξ ) ∈ D ⊂ R n+1 ,
In the interests of simplicity, we have considered thus far where D is a domain. In addition, we assume that f (t, x)
in this section the one-dimensional initial-value problem is differentiable with respect to x1 , . . . , xn and we form
(I ). It turns out that the preceding results can be restated the Jacobian matrix f x (t, x) given by:
and proved for initial-value problems (I) involving systems
of equations (E). In doing so, one must replace absolute
values of scalars by norms of vectors or matrices, conver- ∂f
f x (t, x) = (t, x)
gence of scalars by convergence of vectors and matrices, ∂x
 
and so forth. ∂ f1 ∂ f1 ∂ f1
Rather than go through the task of restating the pre-  (t, x) (t, x) · · · (t, x)
 ∂ x1 ∂ x2 ∂ xn 
ceding results for systems of equations, we present as an  
 .. .. .. 
illustration an additional result. Specifically, consider the = . . .  (27)
 
linear system of equations ∂f 
 n (t, x) ∂ f n (t, x) · · · ∂ fn
(t, x)
x = A(t)x + g(t) = f (t, x) (LN) ∂ x1 ∂ x2 ∂ xn
where A(t) = [ai j (t)] is an n × n matrix and g(t) is an
Under the above conditions, it can be shown that when
n-vector. We assume that ai j (t), i, j = 1, . . . , n, and gi (t),
f x exists and is continuous, then the solution φ of (I) de-
i = 1, . . . , n, are real and continuous functions defined on
pends smoothly on the parameters of the problem. More
an interval J .
specifically,
By making use of the taxicab norm given by

n
|x| = |xi | Let f ∈ C(D), let f x exist, and let f x ∈ C(D). If φ(t, τ, ξ ) is the
i=1
solution of (E) such that φ(τ, τ, ξ ) = ξ , then φ is continuously
differentiable in (t, τ, ξ ). Each vector-valued function ∂φ/∂ξi or
it is an easy matter to show that for t on any compact ∂φ/∂τ will solve:
subinterval J0 of J there exists L 0 ≥ 0 such that
y = f x (t, φ(t, τ, ξ ))y

n
| f (t, x̄) − f (t, x̄¯ )| ≤ max |ai j (t)| |x̄ − x̄¯ |
i=1
l≤ j≤n as a function of t while

≤ L 0 |x̄ − x̄¯ | (26) ∂φ

(τ, τ, ξ ) = − f (τ, ξ )
∂τ
If we now invoke the results of the present section
(rephrased for systems of equations), we obtain the fol- and
lowing: ∂φ
(τ, τ, ξ ) = E n
∂ξ
Suppose that A(t) and g(t) in (LN) are defined and continuous on
an interval J. Then for any τ in J and any ξ in R n , Eq. (LN) has where E n denotes the n × n identity matrix.
a unique solution satisfying x(τ ) = ξ . This solution exists on the
entire interval J and is continuous in (t, τ, ξ ). If A and g depend G. Comparison Theory
continuously on parameters λ ∈ R l , then the solution will also
vary continuously with λ. We conclude this section by touching on the comparison
theory for differential equations. This theory is useful
It is emphasized that the above result is a global result, in continuation of solutions, in establishing estimates on
since the solution φ exists over the entire interval J . On bounds of solutions, and, as we will see later, in stability
the other hand, as noted before, our earlier results in this theory. Before we can present some of the main results of
section will in general be of a local nature. this theory we need to introduce a few concepts.
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

382 Differential Equations, Ordinary

Once again we consider the scalar initial-value v = F(t, v), u(τ ) = η (29)
problem,
then
x = f (t, x), x(τ ) = ξ (I )
|φ(t)| ≤ vM (t)
where f ∈ C(D) and (τ, ξ ) ∈ D. For (I ) we define the
maximal solution φM as that noncontinuable solution of for as long as both functions exist. (Here, |φ(t)| denotes the norm
of φ(t).)
(I ) having the property that if φ is any other solution
of (I ), then φM (t) ≥ φ(t) as long as both solutions are
defined. The minimal solution φm of (I ) is defined in a As an application of this result, suppose that f (t, x) in (E)
similar manner. It is not too difficult to prove that φM and is such that
φm actually exist. | f (t, x)| ≤ A|x| + B
In what follows, we also need the concept of upper
right Dini derivative D + x. Given x: (α, β) → R and for all t ∈ J and for all x ∈ R n , where J = [t0 , T ], and
x ∈ C(α, β) (i.e., x is a continuous real-valued function where A > 0, B > 0 are parameters. Then Eq. (29) as-
defined on the interval (α, β)), we define: sumes the form:
D + x(t) = lim+ sup[x(t + h) − x(t)]/ h v = Au + B, v(τ ) = η
h→0

According to the above result, we now have

= lim[x(t + h) − x(t)]/ h
where lim sup denotes the limit supremum. The lower vM (t) = e A(t−τ ) (η − B/A) + B/A
right Dini derivative D − x is defined similarly, replacing Since vM (t) exists for all t ∈ J , then so do the solutions of
the lim sup by the lim inf. (E). Also, if φ is any solution of (E) with |φ(τ )| ≤ η, then
We will also require the concept of differential inequal- the estimate,
ities, such as, for example, the inequality,
|φ(t)| ≤ e A(t−τ ) (η − (β/A)) + (β/A)
D + x(t) ≤ f (t, x(t)) on D (28)
Any function φ satisfying Eq. (28) is called a solution for is true.
(28). Differential inequalities involving D− x are defined
similarly.
Our first comparison result is now as follows: IV. LINEAR SYSTEMS

Suppose that the maximal solution φM of (I ) stays in D for Both in the theory of differential equations and in their
all t ∈ [τ, T ]. If a continuous function ψ(t) with ψ(τ ) = ξ applications, linear systems of ordinary differential equa-
satisfies tions are extremely important. In this section we first
present the general properties of linear systems. We then
ψ (t) = D + ψ(t) ≤ f (t, ψ(t)) on D
turn our attention to the special cases of linear systems of
then it is true that ordinary differential equations with constant coefficients
ψ(t) ≤ φM (t) for all t ∈ [τ, T ] and linear systems of ordinary differential equations with
periodic coefficients. We also address some of the prop-
A similar result involving minimal solutions can also erties of nth-order linear ordinary differential equations.
be established.
The above result can now be applied to systems of equa-
A. Linear Homogeneous and
tions to obtain estimates for the norms of solutions. We
Nonhomogeneous Systems
have the following:
We first consider linear homogeneous systems,
Let f ∈ C(D), D ⊂ R n+1 , and let φ be a solution of
x = A(t)x (LH)
x = f (t, x), x(τ ) = ξ (I)
As noted in Section III, this system possesses unique so-
Let F(t, v) be a scalar-valued continuous function such that
lutions for every (τ, ξ ) ∈ D where x(τ ) = ξ ,
| f (t, x)| ≤ F(t, |x|) for all (t, x) ∈ D
D = {(t, x): t ∈ J = (a, b), x ∈ R n (or x ∈ C n )}
where | f (t, x)| denotes any one of the equivalent norms of f (t, x)
on R n . If η ≤ |φ(τ )| and if vM denotes the maximal solution of when each element ai j (t) of matrix A(t) is continuous over
the scalar comparison equation given by: J . These solutions exist over the entire interval J = (a, b)
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 383

and they depend continuously on the initial conditions. In (Here, det denotes the determinant of and tr A de-
applications it is typical that j = (−∞, ∞). We note that notes the trace of the matrix A.) This result is known as
φ(t) ≡ 0, for all t ∈ J , is a solution of (LH), with φ(τ ) = 0. Abel’s formula.
This is called the trivial solution of (LH). 3. A solution of matrix equation (31) is a fundamen-
In this section we consider matrices and vectors which tal matrix of (LH) if and only if its determinant is nonzero
will be either real or complex valued. In the former case, for all t ∈ J . (This result is a direct consequence of Abel’s
the field of scalars for the x space is the field of real formula.)
numbers (F = R) and in the latter case, the field for the x 4. If is a fundamental matrix of (LH) and if C is
space is the field of complex numbers (F = C). any nonsingular constant n × n matrix, then C is also a
Now let V denote the set of all solutions of (LH) on fundamental matrix of (LH). Moreover, if is any other
J ; let α1 , α2 be scalars (i.e., α1 , α2 ∈ F); and let φ1 , φ2 fundamental matrix of (LH), then there exists a constant
be solutions of (LH) (i.e., φ1 , φ2 ∈ V ). Then it is eas- n × n nonsingular matrix P such that = P.
ily verified that α1 φ1 + α2 φ2 will also be a solution of
(LH) (i.e., α1 φ1 + α2 φ2 ∈ V ). We have thus shown that In the following, we let {e1 , e2 , . . . , en } denote the set of
V is a vector space. Now if we choose n linearly inde- vectors e1T = (1, 0, . . . , 0), e2T = (0, 1, 0, . . . , 0), . . . , enT =
pendent vectors ξ1 , . . . , ξn in the n-dimensional x-space, (0, . . . , 0, 1). We call a fundamental matrix of (LH)
then there exist n solutions φ1 , . . . , φn of (LH) such that whose columns are determined by the linearly indepen-
φ1 (τ ) = ξ1 , . . . , φn (τ ) = ξn . It is an easy matter to verify dent solutions φ1 , . . . , φn with
that this set of solutions {φ1 , . . . , φn } is linearly indepen-
dent and that it spans V . Thus, {φ1 , . . . , φn } is a basis of φ1 (τ ) = e1 , . . . , φn (τ )en , τ∈J
V and any solution φ can be expressed as a linear combi- the state transition matrix for (LH). Equivalently, if
nation of the vectors φ1 , . . . , φn . is any fundamental matrix of (LH), then the matrix
Summarizing, we have determined by:

The set of solutions of (LH) on the interval J forms an n- (t, τ ) = (t) −1 (τ ) for all t, τ ∈ J
dimensional vector space.
is said to be the state transition matrix of (LH).
In view of the above result it now makes sense to define a We now enumerate several properties of state transition
fundamental set of solutions for (LH) as a set of n linearly matrices. All of these are direct consequences of the def-
independent solutions of (LH) on J . If {φ1 , . . . , φn } is such inition of state transition matrix and of the properties of
a set, then we can form the matrix: fundamental matrices. In the following we let τ ∈ J , we
let φ(τ ) = ξ , and we let (t, τ ) denote the state transition
= [φ1 , . . . , φn ] (30) matrix for (LH) for all t ∈ J . Then,

which is called a fundamental matrix of (LH).

1. (t, τ ) is the unique solution of the matrix equation:
In the following, we enumerate some of the important
properties of fundamental matrices. All of these proper- ∂
(t, τ ) = (t, τ ) = A(t)(t, τ )
ties are direct consequences of definition (30) and of the ∂t
properties of solutions of (LH). We have with (τ, τ ) = E, the n × n identity matrix.
2. φ(t, τ ) is nonsingular for all t ∈ J .
1. A fundamental matrix of (LH) satisfies the matrix 3. For any t, σ, τ ∈ J , we have
equation:
(t, τ ) = (t, σ )(σ, τ )
X = A(t)X (31)

4. [(t, τ )]−1 = −1 (t, τ ) = (τ, t) for all t, τ ∈ J .
where X = [xi j ] denotes an n × n matrix. (Observe that 5. The unique solution φ(t, τ, ξ ) of (LH) with φ(τ, τ,
Eq. (31) consists of a system of n 2 first-order ordinary ξ ) = ξ specified, is given by:
differential equations.)
2. If is a solution of the matrix equation (31) on an φ(t, τ, ξ ) = (t, τ )ξ for all t∈J (32)
interval J and if τ is any point of J, then
t In engineering and physics applications, φ(t) is inter-
det (t) = det (τ ) exp tr A(s) ds preted as representing the “state” of a (dynamical) system
τ represented by (LH) at time t and φ(τ ) = ξ is interpreted
for every t ∈ J as representing the “state” at time τ . In Eq. (32), (t, τ )
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

384 Differential Equations, Ordinary

relates the “states” of (LH) at t and τ . This motivated the The matrix
name “state transition matrix.”  t 
eη /2
2
Let us now consider a couple of specific examples. 1 
(t) =  τ 
2
/2
1. For the system of equations 0 et
x1 = 5x1 − 2x2 satisfies the matrix equation = A(t) and
(33) 2
/2
x2 = 4x1 − x2 det (t) = et for all t ∈ (−∞, ∞)
we have Therefore, is a fundamental matrix for Eq. (34). Also,
in view of Abel’s formula, we have
5 −2
A(t) ≡ A = for all t ∈ (−∞, ∞) t
4 −1
det (t) = det (τ ) exp tr A(s) ds
Two linearly independent solutions for Eq. (33) are τ
t
= eτ /2
η dη = e−t /2
2 2
e3t et exp
φ1 (t) = , φ2 (t) = τ
e3t 2et
for all t ∈ (−∞, ∞)
The matrix
as expected. Also, since
e3t et  t 
(t) =
e3t 2et 1 −e −t 2 /2
e η2 /2
dη
 
−1 (t) =  τ 
satisfies the equation = A and
−e−t /2
2
0
det (t) = e4t = 0 for all t ∈ (−∞, ∞)
we obtain for the state transition of Eq. (34),
Thus, is a fundamental matrix for Eq. (33). Also, in  t 
view of Abel’s formula, we obtain: −τ 2 /2 η2 /2
t  1 e e dη 
(t)−1 (τ ) =  τ 
det (t) = det (τ ) exp tr A(s) ds (t −τ )/2
2 2
τ 0 e
t
= e exp 4τ
4 ds = e4t Finally, suppose that φ(τ ) = ξ = [1, 1]T . Then
τ  t 
−τ 2 /2 η2 /2
for all t ∈ (−∞, ∞)  1 + e e dη 
as expected. Finally, since φ(t, τ, ξ ) = (t, τ )ξ =  τ 
e(t −τ )/2
2 2

−1 2e−3t −e−3t
(t) = Next, we consider linear nonhomogeneous systems,
−e−t −e−t
x = A(t)x + g(t) (LN)
we obtain for the transition matrix of Eq. (33),
We assume that A(t) and g(t) are defined and continuous
−1 2e3(t−τ ) − et−τ −e3(t−τ ) + et−τ
(t) (τ ) = over R = (−∞, ∞) (i.e., each component ai j (t) of A(t)
2e3(t−τ ) − 2et−τ −e3(t−τ ) + 2et−τ and each component gk (t) of g(t) is defined and contin-
2. For the system, uous on R). As noted in Section III, system (LN) has for
any τ ∈ R and any ξ ∈ R n a unique solution satisfying
x1 = x2 , x2 = t x2 (34) x(τ ) = ξ . This solution exists on the entire real line R and
we have is continuous in (t, τ, ξ ). Furthermore, if A and g depend
continuously on parameters λ ∈ R l , then the solution will
0 1
A(t) for all t ∈ (−∞, ∞) also vary continuously with λ. Indeed, if we differentiate
0 t
the function
Two linearly independent solutions of Eq. (34) are t
 t  φ(t, τ, ξ ) = (t, τ )ξ + (t, η)g(η) dη (35)
η2 /2 τ
0  e dη
φ1 (t) = , φ2 (t) =  τ 
with respect to t to obtain φ (t, τ, ξ ), and if we substitute
1
e t 2 /2 φ and φ into (LN) (for x), then it is an easy matter to
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 385

verify that Eq. (35) is in fact the unique solution of (LN) It turns out that a similar result holds for the system of
with φ(t, τ, ξ ) = ξ . linear equations with constant coefficients,
We note that when ξ = 0, then Eq. (35) reduces to
t x = Ax (L)
φp (t) = (t, η)g(η) dη (36) By making use of the Weierstrass M test, it is not diffi-
τ cult to verify the following result:
and when ξ = 0 but g(t) ≡ 0, then Eq. (35) reduces to
Let A be a constant n × n matrix which may be real or complex
φh (t) = (t, τ )ξ (37) and let S N (t) denote the partial sum of matrices defined by the
Thus, the solution of (LN) may be viewed as consisting of formula,
a component due to the “forcing term” g(t) and another N
tk k
component due to the initial data ξ . This type of separation S N (t) = E + A (41)
k=1
k!
is in general possible only in linear systems of differential
equations. We call φ p the particular solution and φh the where E denotes the n × n identity matrix and k! stands for
homogeneous solution of (LN). k factorial. Then each element of the matrix S N (t) converges
Before proceeding to linear systems with constant co- absolutely and uniformly on any finite t interval (−a, a), a > 0,
as N → ∞.
efficients, we introduce adjoint equations. Let be a fun-
damental matrix for the linear homogeneous system (LH).
Then, This result enables us to define the matrix,
∞ k
t k
(−1 ) = −−1 −1 = −−1 A(t) e At = E + A (42)
k=1
k!
Taking the conjugate transpose of both sides, we obtain:
for any −∞ < t < ∞.
(∗−1 ) = −A∗ (t)∗−1 It should be clear that when A(t) ≡ A, system (LH) re-
This implies that ∗−1 is a fundamental matrix for the duces to system (L). Consequently, the results we estab-
system: lished above for (LH) are also applicable to (L). Now
by making use of these results, the definition of e At
y = −A∗ (t)y, t∈J (38) in Eq. (42), and the convergence properties of S N (t) in
Eq. (42), it is not difficult to establish several impor-
We call Eq. (38) the adjoint to (LH), and we call the matrix tant properties of e At and of (L). To this end we let
equation, J = (−∞, ∞) and τ ∈ j, and we let A be a given con-
Y = −A∗ (t)Y, t∈J stant n × n matrix for (L). Then the following is true:

the adjoint to matrix equation (31).

1. (t) = e At is a fundamental matrix for (L) for t ∈ J .
One of the principal properties of adjoint systems is
2. The state transition matrix for (L) is given by
summarized in the following result:
(t, τ ) = e A(t−τ ) = (t − τ ), t ∈ J .
A(t1 +t2 )
If is a fundamental matrix for (LH), then is a fundamental
3. e e = e
At1 At2
for all t1 , t2 ∈ J .
matrix for its adjoint (38) if and only if 4. Ae At = e At A for all t ∈ J .
5. (e At )−1 = e−At for all t ∈ J .
∗ = C 6. The unique solution φ of (L) with φ(τ ) = ξ is given
where C is some constant nonsingular matrix. by:
φ(t, τ, ξ ) = e A(t−τ ) ξ (43)
B. Linear Systems with Constant Coefficients
Notice that solution (43) of (L) such that φ(τ ) = ξ de-
We now turn our attention to linear systems with constant pends on t and τ only via the difference t − τ . This is the
coefficients. For purposes of motivation, we first consider typical situation for general autonomous systems that sat-
the scalar initial-value problem: isfy uniqueness conditions. Indeed, if φ(t) is a solution of
x = ax, x(τ ) = ξ (39) x = F(x), x(0) = ξ
It is easily verified that Eq. (39) has the solution: then clearly φ(t − τ ) will be a solution of
φ(t) = ea(t−τ ) ξ (40) x = F(x), x(τ ) = ξ
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

386 Differential Equations, Ordinary

Next, we consider the “forced” system of equations, in Eq. (46) f i (t) from f i (s) we obtain for the solution of
Eq. (47),
x = Ax + g(t) (44)
φ(t) = −1 [(sE − A)−1 ]ξ = (t, 0)ξ = e At ξ (49)
where g: J → R n is continuous. Clearly, Eq. (44) is a spe-
−1
cial case of (LN). In view of Eq. (35) we thus have where [ fˆ (s)] = f (t) denotes the inverse Laplace
t transform of fˆ (s). It follows from Eqs. (49) and (48) that
φ(t) = e A(t−τ ) ξ + e At e−Aη g(η) dη (45) (s)
ˆ = (sE − A)−1
τ

for the solution of Eq. (44). and

Next, we address the problem of evaluating the state
(t, 0) = (t) = −1 [(sE − A)−1 ] = e At (50)
transition matrix. While there is no general procedure
for evaluating such a matrix for a time-varying matrix Finally, note that when the initial time τ = 0, we can im-
A(t), there are several such procedures for determining mediately compute (t, τ ) = (t − τ ) = e A(t−τ ) .
e At when A(t) ≡ A. In the following, we consider two such Next, let us consider a “forced” system of the form:
methods.
x = Ax + g(t), x(0) = ξ (51)
We begin by recalling the Laplace transform. To this
end, we consider a vector f (t)=[ f 1 (t), . . . , f n (t)]T , where and let us assume that the Laplace transform of g exists.
f i : [0, ∞) → R, i = 1, . . . , n. Letting s denote a complex Taking the Laplace transform of both sides of Eq. (51)
variable, we define the Laplace transform of f i as: yields:
∞
s x̂(s) − ξ = A x̂(s) + ĝ(s)
ˆf i (s) = [ f i (t)] = f i (t)e−st dt (46)
0
or
provided, of course, that the integral in Eq. (46) exists. (In
this case f i is said to be Laplace transformable.) Also, we (sE − A)x̂(s) = ξ + ĝ(s)
define the Laplace transform of the vector f (t) by: or
fˆ (s) = [ fˆ 1 (s), . . . , fˆn (s)] ,
T
x̂(s) = (sE − A)−1 ξ + (sE − A)−1 ĝ(s)
and we define the Laplace transform of a matrix C(t) = = (s)ξ
ˆ + (s)

ˆ ĝ(s) = φ̂ h (s) + φ̂ p (s) (52)
[ci j (t)] similarly. Thus, if ci j : [0, ∞) → R and if each ci j
is Laplace transformable, then the Laplace transform of Taking the inverse Laplace transform of both sides of
C(t) is defined by: Eq. (52) and using Eq. (45), we obtain:

Ĉ(s) = [cij (t)] = [cij (t)] = [ĉij (s)] φ(t) = φh (t) + φp (t)

Now consider the initial value problem, = −1 [(sE − A)−1 ]ξ + −1 [(sE − A)−1 ĝ(s)]
t
x = Ax, x(0) = ξ (47) = (t, 0)ξ + (t − η)g(η) dη (53)
0
Taking the Laplace transform of both sides of Eq. (47),
we obtain: Therefore,
t
sx(s) − ξ = Ax(s) φp (t) = (t − η)g(η) dη (54)
0
or as expected. We call the expression in Eq. (54) the convo-
(sE − A)x(s) = ξ lution of and g. Clearly, convolution of and g in the
time domain corresponds to multiplication of and g in
or the s domain.
x(s) = (sE − A)−1 ξ (48) Let us now consider the specific initial-value problem,
x1 = −x1 + x2 , x1 (0) = −1
where E denotes the n × n identity matrix. It can be shown (55)
by analytic continuation that (sE − A)−1 exists for all s, x2 = −2x2 + u(t), x2 (0) = 0
except at the eigenvalues of A (i.e., except at those val-
ues of s where the equation det(sE − A) = 0 is satisfied). where

Taking the inverse Laplace transform of Eq. (48) (i.e., 1 for t > 0
u(t) =
by reversing the procedure and obtaining, for example, 0 elsewhere
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 387

We have, in this case, is in the canonical form,

 
s+1 −1 J0
(sE − A) =
0 s+2  
 J1 0 
J = ..  (57)
(s)
ˆ = (sE − A)−1  0 . 
  Js
1 1 1
 − 
 s + 1 s + 1 s + 2  where J0 is a diagonal matrix with diagonal elements
=  λ1 , . . . , λk (not necessarily distinct); that is,
 1 
0
s+2  
λ1
 0 
and  
J0 = 

..
. 
 (58)
(t) = e At = −1 [(s)]
ˆ  0 

e−t (e−t − e−2t ) λk
=
0 e−2t and each J p is an n p × n p matrix of the form (p = 1, . . . , s):
It now follows that  
λk+ p 1 0 ··· 0
 
e−t (e−t − e−2t ) −1  0 λk+ p 1 
φh (t) =  

Jq =  .. . . . .. 
. (59)
0 e−2t 0 .. .. 
 
−t
 ... 1 
−e 0 · · · λk+ p
= 0 0
0
where λk+ p need not be different from λk+q if p = q and
Also, since û(s) = 1/s, we have
k + n 1 + · · · + n s = n. The numbers λi , i = 1, . . . , k + s,
 
1 1 1   are the eigenvalues of A (i.e., the roots of the equation
s + 1 −  0 det(λE − A) = 0). If λi is a simple eigenvalue of A (i.e.,
 s + 1 s + 2  
φ̂ p (s) =   1 it is not a repeated root of det(λE − A) = 0), then it ap-
 1 
0 s pears in the block J0 . The blocks J0 , J1 , . . . , Js are called
s+2 Jordan blocks and J is called the Jordan canonical form
  of A.
1 1 1 1 1 Returning to the subject at hand, we consider once more
2 + −
 s 2 s+2 s + 1
 the initial value problem (47) and let P be a real n × n non-
= 
1 1 1 1  singular matrix which transforms A into a Jordan canoni-
− cal form J . Consider the transformation x = P y or, equiv-
2 s 2 s+2
alently, y = P −1 x. Differentiating both sides with respect
and to t, we obtain:
1
+ 12 e−2t − e−t
φp (t) =
2 y = P −1 x = P −1 APy = Jy
1 −2t
1
− e (60)
2 2
y(τ ) = P −1 ξ
Therefore, the solution of the initial value problem (55) is
1 The solution of Eq. (60) is given as
− 2e−t + 12 e−2t
φ(t) = φp (t) + φh (t) = ψ(t) = e J (t−τ ) P −1 ξ
2
(61)
1
2
− 12 e−2t
Using Eq. (61) and x = P y, we obtain for the solution of
A second method of evaluating e At and of solving initial Eq. (47),
value problems for (L) and Eq. (44) involves the transfor-
mation of A into a Jordan canonical form. Specifically, it φ(t) = Pe J (t−τ ) P −1 ξ (62)
is shown in linear algebra that for every complex n × n
In the case in which A has n distinct eigenvalues
matrix A there exists a nonsingular n × n matrix P (i.e.,
λ1 , . . . , λn , we can choose P = [ p1 , p2 , . . . , pn ] in such a
det P = 0) such that the matrix,
way that pi is an eigenvector corresponding to the eigen-
J = P −1 AP (56) value λi , i = 1, . . . , n (i.e., pi = 0 satisfies the equation
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

388 Differential Equations, Ordinary

λi pi = Api ). Then the Jordan matrix J = P −1 AP assumes From Eq. (62), it now follows that the solution of Eq. (47)
the form: is given by:
   
λ1 e J0 (t−τ )
 0   e J1 (t−τ ) 
   0 
J =  . ..   ..  −1
 φ(t) = P  . P ξ
   
0  0 
λn e Js (t−τ )

Using the power series representation, Eq. (42), we im- As a specific example of the above procedure of de-
mediately obtain the expression: termining the state transition matrix, consider the initial-
 λ1 t  value problem:
e
 0 
 ..  x1 = −x1 + x2 , x1 (0) = 1
e =
Jt  . 

 0  x2 = −2x2 , x2 (0) = 2
λn t
e In this case, we have

In this case, we have the expression for the solution of −1 1
A=
Eq. (47): 0 −2
 
eλ1 (t−τ ) with eigenvalues λ1 = −1 and λ2 = −2 and with corre-
  sponding eigenvectors,
 0 
 ..  −1
φ(t) = P  . P ξ 1 −1
  P1 = , P2 =
 0  0 1
eλn (t−τ )
We thus have
In the general case when A has repeated eigenvalues,
1 −1 1 1
we can no longer diagonalize A and we have to be con- P = [ p1 , p2 ] = , P −1 =
0 1 0 1
tent with the Jordan form given by Eq. (57). In this case,
P = [v1 , . . . , vn ], where the vi denote generalized eigen- and
vectors. Using the power series representation, Eq. (42)
λ1 0 −1 0
and the very special nature of the Jordan blocks (58) and J= =
0 λ2 0 −2
(59), it is not difficult to show that in the case of repeated
eigenvalues we have Furthermore,we obtain:
 
e J0 t φ1 (t)
  = Pe J t P −1 ξ
 e J1 t 0  φ2 (t)
 .. 
eJt =  . −∞<t <∞ 1 −1 e−t 0 1 1 1
  =
 0  0 1 0 e −2t
0 1 2
e Js t −t
3e − 2e−2t
where =
  2e−2t
e λ1 t
 0 
 .. 
e J0 t
=
 . 
 C. Linear Systems With Periodic Coefficients
 0 
e λk t Next, we consider linear homogeneous periodic systems
of the form:
and
  x = A(t)x, −∞ < t < ∞ (P)
1 t · · · t ni −1 /(n i − 1)!
0 1 · · · t ni −2 /(n i − 2)! where the elements of A are continuous functions on R
 
et Ji = eλk+i t  . .. ..  and where
 .. . . 
0 0 1 A(t) = A(t + T ) (63)
i = 1, . . . , s for some T > 0 which is a period of A.
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 389

The principal result for (P) which we shall present here the norm of any solution of (P) tends to zero as t → ∞ at
(called Floquet theory) involves the logarithm of a matrix, an exponential rate.
the existence of which is not too difficult to establish. We Finally, by using Eq. (64) we can write:
have the following:
P(t) = (t)e−tR (67)
Let B be a nonsingular n × n matrix. Then there exists an n × n which in turn can be used to see that AP − P = PR. Thus,
matrix F, called the logarithm of B, such that for the transformation,
eF = B x = P(t)y (68)

Using properties of the fundamental matrix as well as we compute

the concept of the logarithm of a matrix, we are in a posi- x = A(t)x = A(t)P(t)y
tion to prove the following fundamental result for (P):
= P (t)y + P(t)y = (P(t)y)
Let Eq. (63) be true and let A(t) be continuous on
or
(−∞, ∞). If (t) is a fundamental matrix for (P), then
so is (t + T ), −∞ < t < ∞. Moreover, corresponding to y = P −1 (t)[A(t)P(t) − P (t)]y
every , there exists a nonsingular matrix P which is also
periodic with period T and a constant matrix R, such that = P −1 (t)(P(t)R)y = Ry
This shows that transformation (68) reduces the linear,
(t) = P(t)etR (64) homogeneous, periodic system (P) to
It is not difficult to show, using Eq. (64), that if the y = Ry (69)
fundamental matrix for (P) is known over any inter-
val of length T , then it is automatically known for all a linear homogeneous system with constant coefficients.
−∞ < t < ∞. For example, if (t) is known for all t over Since P(t) is nonsingular, we are thus able to deduce the
the interval [t0 , t0 + T ], then we can show that properties of the solutions of (P) from those of Eq. (69),
−1 provided of course that we can determine the matrix
(t) = P(t)et T log C (65) P(t).
where C = (t0 )−1 (t0 + T ). Thus, since P(t) is periodic,
(t) as given in Eq. (65) will be known for all t over
D. Linear n th-Order Ordinary
(−∞, ∞).
Differential Equations
It can also be shown that even though the fundamen-
tal matrix does not determine R uniquely in Eq. (64), We conclude this section by considering some of the more
the set of all fundamental matrices of (P), and hence important aspects of linear nth-order ordinary differential
of A(t), determines uniquely all quantities associated equations. We shall consider equations of the form,
with e TR which are invariant under a similarity trans-
y (n) + an−1 (t)y (n−1) + · · · + a1 (t)y (1) + a0 (t)y = b(t)
formation. Specifically, the set of all fundamental ma-
trices of A(t) determine a unique set of eigenvalues of (70)
the matrix eTR , λ1 , . . . , λn , which are called the Floquet
y (n)
+ an−1 (t)y (n−1)
+ · · · + a1 (t)y (1)
+ a0 (t)y = 0
multipliers associated with A(t). None of these van-
ishes since λi = det e TR = 0. Also, the eigenvalues of (71)
R, ρ1 , . . . , ρn , are called the characteristic exponents of
and
A(t).
Now let us suppose that all ρi are such that Re ρi < y (n) + an−1 y (n−1) + · · · + a1 y (1) + a0 y = 0 (72)
0 (i.e., the real part of each ρi is negative) and let
In Eqs. (70) and (71) the functions ak (t) and b(t), k =
α = mini |Re ρi |. Now, if we arrange things so that R in
1, . . . , n − 1, are continuous on some appropriate time
Eq. (64) is in Jordan canonical form, then it is a simple
interval J . If we define the differential operator L n by:
matter to show that there exists a k > 0 such that for each
component φi (t) of the solution φ(t), dn d n−1 d
Ln = + a n−1 (t) + · · · + a1 (t) + a0 (t) (73)
|φi (t)| ≤ keαt for all t ≥ 0 (66) dt n dt n−1 dt
then we can rewrite Eqs. (70) and (71) more compactly as
and |φi (t)| → 0 as t → ∞. In other words, if the eigenval-
ues ρi , i = 1, . . . , n, of R have negative real parts, then L n y = b(t) (74)
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

390 Differential Equations, Ordinary

and As an example, consider the second-order differential

equation
Ln y = 0 (75)
respectively. We can rewrite Eq. (72) similarly by defining t 2 y + t y − y = 0, 0<t <∞
a differential operator L in the obvious way. which can be written equivalently as:
Following the procedure in Section II, we can reduce the
study of Eq. (71) to the study of the system of n first-order y + (1/t)y − (1/t 2 )y = 0, 0<t <∞ (79)
ordinary differential equations,
The functions φ1 (t = t and φ2 (t) = 1/t are clearly solu-
x = A(t)x (LH) tions of Eq. (79). We now form the matrix,

where A(t) is the companion matrix given by: φ1 φ2 t 1/t
  (t) = =
0 1 0 ··· 0 φ1 φ2 1 1/t 2
 
 0 0 1 ··· 0 
 . .. ..  which yields the Wronskian
A(t) = 

.. . . 

  W (φ1 , φ2 )(t) = det (t) = −2/t, t >0
 0 0 0 ··· 1 
−a0 (t) −a1 (t) −a2 (t) · · · −an−1 (t) In the notation of Eq. (76) we have a1 (t) = 1/t, a0 (t) =
−1/t 2 , and thus a1 (s) = 1/s. In view of Eq. (78), we have
(76) for any τ > 0,
Since A(t) is continuous on J , we know from Section III
W (φ1 , φ2 )(t) = det (t)
that there exists a unique solution φ(t), for all t ∈ J , to the
t
initial-value problem, = W (φ1 , φ2 )(τ ) exp −a1 (s) ds
x = A(t)x, x(τ ) = ξ, τ∈J τ

= −(2/τ )eln(τ/t) = −2/t, t >0

where ξ = (ξ1 , . . . , ξn )T ∈ R n . The first component of
this solution is a solution of L n y = 0 satisfying y(τ ) = as expected.
ξ1 , y (τ ) = ξ2 , . . . , y (n−1) (τ ) = ξn . Similarly as in the case of systems of equations, we
Now let φ1 , . . . , φn be n solutions of Eq. (75). Then we can prove the following result for nth-order differential
can easily show that the matrix, equations:
 
φ1 φ2 ··· φn
 φ A set of n solutions of Eq. (75), φ1 , . . . , φn , is linearly inde-
 1 φ2 ··· φn 
 pendent on J if and only if W (φ1 , . . . , φn )(t) = 0 for all t ∈ J .
(t) =   .. . . .. 

.. . 
 Moreover, every solution of Eq. (75) is a linear combination of
any set of n linearly independent solutions.
φ1(n−1) φ2(n−1) · · · φn(n−1)
is a solution of the matrix equation, The above result enables us to make the following
definition:
X = A(t)X (77)
where A(t) is defined by Eq. (76). We call the determi- A set of n linearly independent solutions of Eq. (75) on j, φ1 ,
nant of the Wronskian for Eq. (75) with respect to the . . . , φn , is called a fundamental set of solutions for Eq. (75).
solutions φ1 , . . . , φn and we denote it by:
W (φ1 , . . . , φn ) = det (t) Next, we turn our attention to nonhomogeneous linear
nth-order ordinary differential equations of the form (70).
Note that W (φ1 , . . . , φn )(t) depends on t ∈ J . Since is As shown in Section II, the study of Eq. (70) reduces to the
a solution of matrix equation (77), then by Abel’s formula study of the system of n first-order ordinary differential
it follows that for any τ ∈ J and for any t ∈ J , equations,
t
W (φ1 , . . . , φn )(t) = det (τ ) exp trA(s) ds x = A(t)x + g(t) (80)
τ
where A(t) is given by Eq. (76) and g(t) = [0, . . . , 0,
= W (φ1 , . . . , φn )(τ )
t b(t)]T . Recall that for given τ ∈ J and given x(τ ) =
ξ ∈ R n , Eq. (80) has a unique solution given by φ =
× exp −an−1 (s) ds (78)
τ φh + φp , where φh (t) = (t, τ )ξ is a solution of (LH),
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 391

φ(t, τ ) denotes the state transition matrix of A(t), and φp the characteristic polynomial of the differential equa-
is a particular solution of Eq. (80), given by: tion (72), and we call
t
p(λ) = 0 (83)
φp (t) = (t, s)g(s) ds
τ
the characteristic equation of Eq. (72). The roots of p(λ)
t
= (t) −1
(s)g(s) ds are called the characteristic roots of Eq. (72).
τ We see that the study of Eq. (72) reduces to the study
of the system of first-order ordinary differential equations
We now specialize this result from the n-dimensional
with constant coefficients given by x = Ax, where
system (80) to the corresponding nth-order equation (70)
 
to obtain the following result: 0 1 0 0 ··· 0
 0 0 1 0 ··· 0 
 . .. .. .. .. 
If {φ1 , . . . , φn } is a fundamental set for the equation L n y = 0, A= 
.. . . . .  (84)
then the unique solution ψ of the equation L n y = b(t) satisfying  
0 0 0 0 ··· 1
ψ(τ ) = ξ1 , . . . , ψ (n−1) (τ ) = ξn is given by:
−a0 −a1 −a2 −a3 · · · −an−1

n
ψ(t) = ψh (t) + ψP (t) = ψh (t) + φk (t) The following result, which is proved in a straightfor-
k=1 ward manner, connects Eq. (72) and x = Ax with A given
t
Wk (φ1 , . . . , φn )(s) by Eq. (84):
× b(s) ds (81)
τ W (φ1 , . . . , φn )(s)
The characteristic polynomial of A in Eq. (83) is precisely the
Here, ψh is the solution of L n y = 0 such that ψ(τ ) = ξ1 , ψ (τ ) =
characteristic polynomial p(λ) given by Eq. (82), that is,
ξ2 , . . . , ψ (n−1) (τ ) = ξn , and Wk (φ1 , . . . , φn )(t) is obtained from
W (φ1 , . . . , φn )(t) by replacing the kth column in W (φ1 , . . . , p(λ) = det(λE n − A)
φn )(t) by (0, . . . , 0, 1)T .
The next result enumerates a fundamental set for
We apply the above example to the second-order differ- Eq. (72):
ential equation,
y + (1/t)y − (y/t 2 ) = b(t), 0<t <∞ Let λ1 , . . . , λs be the distinct roots of the characteristic equa-
s and suppose that λi has multiplicity m i , i = 1, . . . , s,
tion (83)
where b(t) is a real continuous function for all t > 0. with i=1 m i = n. Then the following set of functions is a fun-
From the example involving Eq. (79) we have φ1 (t) = t, damental set for Eq. (72):
φ2 (t) = 1/t, and W (φ1 , φ2 )(t) = −2/t, t > 0. Also,
t k eλi t , k = 0, 1, . . . , m i − 1,

0 1/t 1 (85)
W1 (φ1 , φ2 )(t) = i = 1, . . . , s
2 = − ,
1 −1/t t

t 0 As a specific example, consider:
W1 (φ1 , φ2 )(t) = =t
1 1 p(λ) = (λ − 2)(λ + 3)2 (λ + i)(λ − i)(λ − 4)4 (86)
From Eq. (81) we now have Then, n = 9, and {e2t , e−3t , te−3t , e−it , e+it , e4t , te4t ,
ψ(t) = ψh (t) + ψp (t) t 2 e4t , t 3 e4t } is a fundamental set for the differential equa-
tion corresponding to the characteristic equation (86).
t t 1 t 2 We conclude this section by considering adjoint equa-
= ψh (t) + b(s) ds − s b(s) ds
2 τ 2t τ tions. Corresponding to the operator L n given in Eq. (73),
we define a second linear operator L + n of order n, which
Next, we consider nth-order ordinary differential equa-
we call the adjoint of L n , as follows. The domain of L + n
tions with constant coefficients given by Eq. (72) which
is the set of all continuous functions defined on J such
can equivalently be written as L n y = 0, where
that [ā j (t)y(t)] has j continuous derivatives on J . (Here,
dn d n−1 d ā j (t) denotes the complex conjugate of a j (t).) For each
Ln = n
+ an−1 n−1 + · · · + a1 + a0 function y, define:
dt dt dt
We assume that J = (−∞, ∞), we call L+
n y = (−1) y
n (n)
+ (−1)n−1 (ān−1 y)n−1
p(λ) = λn + an−1 λn−1 + · · · + a1 λ + a0 (82) + · · · + (−1)(ā1 y) + ā0 y
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

392 Differential Equations, Ordinary

The equation, A. The Concept of an Equilibrium Point

L+
n y = 0, t∈J We shall concern ourselves with systems of equations,

is called the adjoint equation to L n y = 0. x = f (t, x) (E)

When Eq. (75) is written in companion form (LH)
with A(t) given by Eq. (76), then the adjoint system is where x ∈ R n . When discussing global results, such as
z = −A∗ (t)z, where global asymptotic stability, we shall always assume that
  f : R + × R n → R n . On the other hand, when consider-
0 0 · · · 0 −ā0 (t)
1 0 · · · 0 −ā1 (t)  ing local results, we shall usually assume that f : R + ×
  B(h) → R n for some h > 0, where R + = [0, ∞), B(h) =
∗  
A (t) = 0 1 · · · 0 −ā2 (t)  {x ∈ R n : |x| < h} and |·| is any one of the equivalent
. . .. .. 
 .. .. . .  norms on R n . On some occasions we assume that t ∈ R =
0 0 ··· 1 −ān−1 (t) (−∞, ∞) rather than t ∈ R + . Unless otherwise stated,
we assume that for every (t0 , ξ ), t0 ∈ R + , the initial-value
This adjoint system can be written in component form as: problem,
z 1 = ā0 (t)z n
x = f (t, x), x(t0 ) = ξ (I)
z j = −z j−1 + ā j−1 (t)z n , 2≤ j ≤n (87)
possesses a unique solution φ(t, t0 , ξ ) which is defined for
If ψ = [ψ1 , ψ2 , . . . , ψn ]T is a solution of Eq. (87) and if all t ≥ t0 and which depends continuously on the initial
a j ψn has j derivatives, then data (t0 , ξ ). Since it is natural in this section to think of
ψn − (ān−1 ψn ) = −ψn−1 t as representing time, we shall use the symbol t0 in (I)
to represent the initial time (rather than τ as was done
and earlier). Furthermore, we shall frequently use the symbol
x0 in place of ξ to represent the initial state.
ψn − (ān−1 ψn ) = −ψn−1

= ψn−1 − (ān−2 ψn ) A point xe ∈ R n is called an equilibrium point of (E)
(at time t ∗ ∈ R + ) if f (t, xe ) = 0 for all t ≥ t ∗ . Other terms
or
for equilibrium point include stationary point, singular
ψn − (ān−1 ψn ) + (ān−2 ψn ) = ψn−2 point, critical point, and rest position.
We note that if xe is an equilibrium point of (E) at t ∗ ,
Continuing in this manner, we see that ψn solves L +
n ψ = 0. then it is an equilibrium point at all τ ≥ t ∗ . Note also that
in the case of autonomous systems,

V. STABILITY x = f (x) (A)

Since there are no general rules for determining explicit and in case of T -periodic systems,
formulas for the solutions of systems ofordinary differen-
tial equations (E), the analysis of initial-value problems x = f (t, x), f (t, x) = f (t + T, x) (P)
(I) is accomplished along two lines: (1) a quantitative ap-
proach is used which usually involves the numerical solu- a point xe ∈ R n is an equilibrium at some t ∗ if and only
tion of such problems by means of simulations on a digital if it is an equilibrium point at all times. Also note that
computer, and (2) a qualitative approach is used which is if xe is an equilibrium (at t ∗ ) of (E), then the transfor-
usually concerned with the behavior of families of solu- mation s = t − t ∗ reduces (E) to d x/ds = f (s + t ∗ , x)
tions of a given differential equation and which usually and xe is an equilibrium (at s = 0) of this system. For
does not seek specific explicit solutions. As mentioned in this reason, we shall henceforth assume that t ∗ = 0 in the
Section I, we will concern ourselves primarily with qual- above definition and we shall not mention t ∗ further. Note
itative aspects of ordinary differential equations. also that if xe is an equilibrium point of (E), then for any
The principal results of the qualitative approach include t0 ≥ 0, φ(t, t0 , xe ) = xe for all t ≥ t0 (i.e., xe is a unique so-
stability properties of an equilibrium point (rest position) lution of (E) with initial data given by φ(t0 , t0 , xe ) = xe ).
and the boundedness of solutions of ordinary differential As a specific example, consider the simple pendulum
equations. We shall consider these topics in the present introduced in Section II, which is described by equations
section. of the form,
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 393

where f is assumed to be continuously differentiable with

respect to all of its arguments and where
∂f

J (xe ) = (x)
∂x x=xe

denotes the n × n Jacobian matrix defined by ∂ f /∂ x =

[∂ f i /∂ x j ] has an isolated equilibrium at xe if f (xe ) = 0
and J (xe ) is nonsingular.
Unless otherwise stated, we shall assume throughout
this section that a given equilibrium point is an iso-
lated equilibrium. Also, we shall assume, unless otherwise
stated, that in a given discussion, the equilibrium of inter-
FIGURE 8 (a) Stable and (b) unstable equilibria of the simple est is located at the origin of R n . This assumption can be
pendulum. made without any loss of generality. To see this, assume
that xe = 0 is an equilibrium point of (E) (i.e., f (t, xe ) = 0
for all t ≥ 0). Let w = x − xe . Then w = 0 is an equilibrium
x1 = x2 of the transformed system,
(88)
x2 = k sin x1 , k>0 w = F(t, w) (91)
Physically, the pendulum has two equilibrium points. One where
of these is located as shown in Fig. 8a and the second
point is located as shown in Fig. 8b. However, the model F(t, w) = f (t, w + xe ) (92)
of this pendulum, described by Eq. (88), has countably Since Eq. (92) establishes a one-to-one correspondence
infinitely many equilibrium points located in R 2 at the between the solutions of (E) and Eq. (91), we may assume
points (πn, 0), n = 0, ±1, ±2, . . . . henceforth that (E) possesses the equilibrium of interest
An equilibrium point xe of (E) is called an isolated equi- located at the origin. The equilibrium x = 0 will some-
librium point if there is an r > 0 such that B(xe , r ) ⊂ R n times be referred to as the trivial solution of (E).
contains no equilibrium points of (E) other than xe itself.
(Here, B(xe , r ) = {x ∈ R n : |x − xe | < r }.) B. Definitions of Stability and Boundedness
All equilibrium points of Eq. (88) are isolated equilibria We now state several definitions of stability of an equilib-
in R 2 . On the other hand, for the system, rium point, in the sense of Lyapunov.
The equilibrium x = 0 of (E) is stable if for every ε > 0
x1 = −ax1 + bx1 x2
(89) and any t0 ∈ R + there exists a δ(ε, t0 ) > 0 such that:
x2 = −bx1 x2
|φ(t, t0 , ξ )| < ε for all t ≥ t0 (93)
where a > 0, b > 0 are constants, every point on the posi- whenever
tive x2 axis is an equilibrium point for Eq. (89).
It should be noted that there are systems with no equi- |ξ | < δ(ε, t0 ) (94)
librium points at all, as is the case, for example, in the It is an easy matter to show that if the equilibrium x = 0
system, satisfies Eq. (93) for a single t0 when Eq. (94) is true, then
it also satisfies this condition at every initial time t0 > t0 .
x1 = 2 + sin(x1 + x2 ) + x1
(90) Hence, in the preceding definition it suffices to take the
x2 = 2 + sin(x1 + x2 ) − x1 single value t = t0 in Eqs. (93) and (94).
Suppose that the initial-value problem (I) has a unique
Many important classes of systems possess only one solution φ defined for t on an interval J containing t0 .
equilibrium point. For example, the linear homogeneous By the motion through (t0 , ξ = x(t0 )) we mean the set
system, {t, φ(t): t ∈ J }. This is, of course, the graph of the func-
tion φ. By the trajectory or orbit through (t0 , ξ =
x = A(t)x (LH)
x(t0 )) we mean the set C(x(t0 )) = {φ(t): t ∈ J }. The
has a unique equilibrium at the origin if A(t0 ) is nonsin- positive semitrajectory (or positive semiorbit) is
gular for all t0 ≥ 0. Also, the system, defined as C + (x(t0 )) = {φ(t): t ∈ J and t ≥ t0 }. Also, the
negative trajectory (or negative semiorbit) is defined
x = f (x) (A) as C − (x(t0 )) = {φ(t): t ∈ J and t ≤ τ }.
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

394 Differential Equations, Ordinary

FIGURE 9 Stability of an equilibrium point.

Now, in Fig. 9 we depict the behavior of the trajectories limt → ∞ φ(t + t0 , t0 , ξ ) = 0 uniformly in (t0 , ξ ) for t0 ≥ 0
in the vicinity of a stable equilibrium for the case x ∈ R 2 . and for |ξ | ≤ δ0 .
In applications, we are frequently interested in the fol-
When xe = 0 is stable, by choosing the initial points in a suffi- lowing special case of uniform asymptotic stability: the
ciently small spherical neighborhood, we can force the graph of equilibrium x = 0 of (E) is exponentially stable if there
the solution for t ≥ t0 to lie entirely inside a given cylinder. exists an α > 0, and for every ε > 0 there exists a δ(ε) > 0,
such that |φ(t, t0 , ξ )| ≤ εeα(t−t0 ) for all t ≥ t0 whenever
In the above definition of stability, δ depends on ε and t0 |ξ | < δ(ε) and t ≥ 0.
(i.e., δ = δ(ε, t0 )). If δ is independent of t0 (i.e., δ = δ(ε)). In Fig. 11, the behavior of a solution in the vicinity of
then the equilibrium x = 0 of (E) is said to be uniformly an exponentially stable equilibrium x = 0 is shown.
stable. The equilibrium x = 0 of (E) is said to be unstable if
The equilibrium x = 0 of (E) is said to be asymptoti- it is not stable. In this case, there exists a t0 ≥ 0, ε > 0,
cally stable if (1) it is stable, and (2) for every t0 ≥ 0 there a sequence ξm → 0 of initial points, and a sequence {tm }
exists an η(t0 ) > 0 such that limt→∞ φ(t, t0 , ξ ) = 0 when- such that |φ(t0 + tm , t0 , ξm )| ≥ ε for all m, tm ≥ 0.
ever |ξ | < η. Furthermore, the set of all ξ ∈ R n such that If x = 0 is an unstable equilibrium of (E), it still can
φ(t, t0 , ξ ) → 0 as t → ∞ for some t0 ≥ 0 is called the do- happen that all the solutions tend to zero with increas-
main of attraction of the equilibrium x = 0 of (E). Also, ing t. Thus, instability and attractivity are compatible con-
if for (E) condition (2) is true, then the equilibrium x = 0 cepts. Note that the equilibrium x = 0 is necessarily un-
is said to be attractive. stable if every neighborhood of the origin contains initial
The equilibrium x = 0 of (E) is said to be uniformly points corresponding to unbounded solutions (i.e., solu-
asymptotically stable if (1) it is uniformly stable, and tions whose norm |φ(t, t0 , ξ )| grows to infinity on a se-
(2) there is a δ0 > 0 such that for every ε > 0 and for any quence tm → ∞). However, it can happen that a system (E)
t0 ∈ R + there exists a T (ε) > 0, independent of t0 , such that with unstable equilibrium x = 0 may have only bounded
|φ(t , t0 , ξ )| < ε for all t ≥ t0 + T (ε) whenever |ξ | < δ0 . solutions.
In Fig. 10 we depict property (2), for uniform asymp-
totic stability, pictorially. By choosing the initial points
in a sufficiently small spherical neighborhood at t = t0 ,
we can force the graph of the solution to lie inside a
given cylinder for all t > t0 + T (ε). Condition (2) can be
rephrased by saying that there exists a δ0 > 0 such that

FIGURE 10 Attractivity of an equilibrium point. FIGURE 11 Exponential stability of an equilibrium point.

P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 395

The above concepts pertain to local properties of an 3. The scalar equation,

equilibrium. In the following definitions, we consider
some global characterizations of an equilibrium. x = −ax, a>0 (97)
A solution φ(t, t0 , ξ ) of (E) is bounded if there exists has for every x(0) = c the solution φ(t, 0, c) = ce−at and
a β > 0 such that |φ(t, t0 , ξ )| < β for all t ≥ t0 , where β x = 0 is the only equilibrium of Eq. (97). This equilibrium
may depend on each solution. System (E) is said to possess is exponentially stable in the large.
Lagrange stability if for each t0 ≥ 0 and ξ the solution 4. The scalar equation,
φ(t, t0 , ξ ) is bounded.
The solutions of (E) are uniformly bounded if for any x = [−1/(t + 1)]x (98)
α > 0 and t0 ∈ R + there exists a β = β(α) > 0 (indepen-
has for every x(t0 ) = c, t0 ≥ 0, a unique solution of the
dent of t0 ) such that if |ξ | < α, then |φ(t, t0 , ξ )| < β for all
form φ(t, t0 , c) = (1 + t0 )c/(t + 1) and x = 0 is the only
t ≥ t0 .
equilibrium of Eq. (98). This equilibrium is uniformly sta-
The solutions of (E) are uniformly ultimately
ble and asymptotically stable in the large, but it is not
bounded (with bound B) if there exists a B > 0 and if
uniformly asymptotically stable.
corresponding to any α > 0 and t0 ∈ R + there exists a
T = T (α) (independent of t0 ) such that |ξ | < α implies
that |φ(t, t0 , ξ )| < β for all t ≥ t0 + T . C. Some Basic Properties of Autonomous
In contrast to the boundedness properties given in the and Periodic Systems
preceding three paragraphs, the concepts introduced ear-
Making use of the properties of solutions and using the
lier as well as those stated in the following are usually re-
definitions of stability in the sense of Lyapunov it is not
ferred to as stability (respectively, instability) in the sense
difficult to establish the following general stability results
of Lyapunov.
for systems described by:
The equilibrium x = 0 of (E) is asymptotically stable
in the large if it is stable and if every solution of (E) tends x = f (x) (A)
to zero as t → ∞. In this case, the domain of attraction of
and
the equilibrium x = 0 of (E) is all of R n . Note that in this
case, x = 0 is the only equilibrium of (E). x = f (t, x), f (t, x) = f (t + T, x) (P)
The equilibrium x = 0 of (E) is uniformly asymptoti-
cally stable in the large if (1) it is uniformly stable, and 1. If the equilibrium x = 0 of (P) (or of (A)) is stable,
(2) for any α > 0 and any ε > 0, and t0 ∈ R + , there exists then it is in fact uniformly stable.
T (ε, α) > 0, independent of t0 such that if |ξ | < α, then 2. If the equilibrium x = 0 of (P) (or of (A)) is asymp-
|φ(t, t0 , ξ )| < ε for all t ≥ t0 + T (ε, α). totically stable, then it is uniformly asymptotically stable.
Finally, the equilibrium x = 0 of (E) is exponentially
stable in the large if there exists α >0 and for any β >0,
there exists k(β)>0 such that |φ(t, t0 , ξ )|≤k(β)|ξ |e−α(t−t0 ) D. Linear Systems
for all t ≥ t0 whenever |ξ | < β. Next, by making use of the general properties of the solu-
At this point it may be worthwhile to consider some tions of linear autonomous homogeneous systems,
specific examples:
x = Ax, t ≥0 (L)
1. The scalar equation, and of linear homogeneous systems (with A(t) conti-
nuous),
x = 0 (95)
x = A(t)x, t ≥ t0 , t0 ≥ 0 (LH)
has for any initial condition x(0) = c the solution φ(t, 0,
c) = c. All solutions are equilibria of Eq. (95). The trivial the following results are easily verified:
solution is stable; in fact, it is uniformly stable. However,
it is not asymptotically stable. 1. The equilibrium x = 0 of (LH) is stable if and only
2. The scalar equation, if the solutions of (LH) are bounded. Equivalently, the
equilibrium x = 0 of (LH) is stable if and only if:
x = ax, a>0 (96)
sup |(t, t0 )| = c(t0 ) < ∞
has for every x(0) = c the solution φ(t, 0, c) = ceat and t≥t0

x = 0 is the only equilibrium of Eq. (96). This equilibrium where |(t, t0 )| denotes the matrix norm induced by the
is unstable. vector norm used on R n and sup denotes supremum.
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

396 Differential Equations, Ordinary

2. The equilibrium x = 0 of (LH) is uniformly stable if E. Two-Dimensional Linear Systems

and only if:
Before we present the principal results of the Lyapunov

theory for general systems (E), we consider in detail the
sup c(t0 ) = sup sup |(t, t0 )| = c0 < ∞ behavior of trajectories near an equilibrium point x = 0 of
t0 ≥0 t0 ≥0 t≥t0
two-dimensional linear systems of the form
3. The following statements are equivalent: (a) The x1 = a11 x1 + a12 x2
equilibrium x = 0 of (LH) is asymptotically stable. (b) (99)
x2 = a21 x1 + a22 x2
The equilibrium x = 0 of (LH) is asymptotically stable in
the large. (c) limt→∞ = 0. We can rewrite Eq. (99) equivalently as
4. The equilibrium x = 0 of (LH) is uniformly asymp- x = Ax (100)
totically stable if and only if it is exponentially stable.
5. (a) The equilibrium x = 0 of (L) is stable if all where

eigenvalues of A have nonpositive real parts and ev- a11 a12
ery eigenvalue of A that has a zero real part is a sim- A= (101)
a21 a22
ple zero of the characteristic polynomial of A. (b) The
equilibrium x = 0 of (L) is asymptotically stable if and When det A = 0 system (99) will have one and only one
only if all eigenvalues of A have negative real parts. equilibrium point, namely x = 0. We shall classify this
In this case, there exist constants k > 0, σ > 0 such that equilibrium point (and, hence, system (99)) according to
|(t, t0 )| ≤ k exp[−σ (t − t0 )], t0 ≤ t ≤ ∞, where (t, t0 ) the following cases which the eigenvalues λ1 , λ2 of A can
denotes the state transition matrix of (L). assume:

We shall find it convenient to use the following conven- 1. λ1 , λ2 are real and λ1 < 0, λ2 < 0: x = 0 is an asymp-
tion, which has become standard in the literature: A real totically stable equilibrium point called a stable node.
n × n matrix A is called stable or a Hurwitz matrix if all 2. λ1 , λ2 are real and λ > 0, λ2 > 0: x = 0 is an unstable
of its eigenvalues have negative real parts. If at least one equilibrium point called an unstable node.
of the eigenvalues has a positive real part, then A is called 3. λ1 , λ2 are real and λ1 λ2 < 0: x = 0 is an unstable
unstable. A matrix A which is neither stable nor unstable equilibrium point called a saddle.
is called critical and the eigenvalues of A with zero real 4. λ1 , λ2 are complex conjugates and Re λ1 = Re λ2 <
parts are called critical eigenvalues. 0: x = 0 is an asymptotically stable equilibrium point
Thus, the equilibrium x = 0 of (L) is asymptotically called a stable focus.
stable if and only if A is stable. If A is unstable, then 5. λ1 , λ2 are complex conjugates and Re λ1 = Re λ2 >
x = 0 is unstable. If A is critical, then the equilibrium is 0: x = 0 is an unstable equilibrium point called an unsta-
stable if the eigenvalues with zero real parts correspond ble focus.
to a simple zero of the characteristic polynomial of A; 6. λ1 , λ2 are complex conjugates and Re λ1 =
otherwise, the equilibrium may be unstable. Re λ2 = 0: x = 0 is a stable equilibrium called a center.
Next, we consider the stability properties of linear pe-
riodic systems, Using the results of Section IV, it is possible to solve
Eq. (99) explicitly and verify that the qualitative behavior
x = A(t)x, A(t) = A(t + T ) (PL) of the trajectories near the equilibrium x = 0 is as shown in
Figs. 12–14 for the cases of a stable node, unstable node,
where A(t) is a continuous matrix for all t ∈ R. For such
systems, the following results follow directly from Floquet
theory:

1. The equilibrium x = 0 of (PL) is uniformly stable if

all eigenvalues of R in Eq. (64) have nonpositive real parts
and any eigenvalue of R having zero real part is a simple
zero of the characteristic polynomial of R.
2. The equilibrium x = 0 of (PL) is uniformly asymp-
totically stable if and only if all eigenvalues of R have
negative real parts. FIGURE 12 Trajectories near a stable node.
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 397

FIGURE 13 Trajectories near an unstable node. FIGURE 15 Trajectories near a stable node (repeated eigenvalue
case).

and saddle, respectively. (The arrows on the trajectories

point into the direction of increasing time.) In these cases, D = R + × R n . Unless stated otherwise, we shall always
the figures labeled (a) correspond to systems in which A assume that v(t, 0) = 0 for all t ∈ R + (resp., v(0) = 0).
is in Jordan canonical form while the figures labeled (b) Now let φ be an arbitrary solution of (E) and consider
correspond to systems in which A is in some arbitrary the function t → v(t, φ(t)). If v is continuously differen-
form. In a similar manner, it is possible to verify that the tiable with respect to all of its arguments, then we obtain
qualitative behavior of the trajectories near the equilib- (by the chain rule) the derivative of v with respect to t
rium x = 0 is as shown in Figs. 15–18 for the cases of a
along the solutions of (E), v(E) , as:
stable node (repeated eigenvalue case), an unstable focus,
a stable focus, and a center, respectively. (For purposes of ∂v
convention, we assumed in these figures that when λ1 , λ2 v(E) (t, φ(t)) = (t, φ(t)) + ∇v(t, φ(t))T f (t, φ(t))
∂t
are real and not equal, then λ1 > λ2 .)
Here, ∇v denotes the gradient vector of v with respect to
x. For a solution φ(t, t0 , ξ ) of (E), we have
F. Lyapunov Functions t

Next, we present general stability results for the equilib- v(t, φ(t)) = v(t0 , ξ ) + v(E) (τ, φ(τ, t0 , ξ )) dτ
rium x = 0 of a system described by (E). Such results t0

involve the existence of realvalued functions v: D → R. The above observations motivate the following: let v:
In the case of local results (e.g., stability, instability, R + × R n → R (resp., v:R + × B(h) → R) be continuously
asymptotic stability, and exponential stability results), we differentiable with respect to all of its arguments and
shall usually only require that D = B(h) ⊂ R n for some let ∇v denote the gradient of v with respect to x. Then
H > 0, or D = R + × B(h). (Recall that R + = (0, ∞) and
v(E)
: R + × R n → R (resp., v(E) : R + × B(h) → R) is de-
B(h) = {x ∈ R n : |x| < h} where |x| denotes any one of the fined by:
equivalent norms of x on R n .) On the other hand, in the
case of global results (e.g., asymptotic stability in the large,
exponential stability in the large, and uniform bounded-
ness of solutions), we have to assume that D = R n or

FIGURE 14 Trajectories near a saddle. FIGURE 16 Trajectory near an unstable focus.

P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

398 Differential Equations, Ordinary

Lipschitz condition with respect to x. In such cases we

define the upper right-hand derivative of v with respect
to t along the solutions of (E) by:

v(E) (t, x) = lim+ sup(1/θ ){v(t + θ,
θ →0

φ(t + θ, t, x)) − v(t, x)}

= lim+ sup(1/θ ){v(t + θ,
θ →0

x + θ · f (t, x)) − v(t, x)} (104)

When v is continuously differentiable, then Eq. (104) re-
duces to Eq. (102).
We now give several important properties that v func-
tions may possess. In doing so, we employ Kamke com-
FIGURE 17 Trajectory near a stable focus. parison functions defined as follows: a continuous func-
tion ψ: [0, r1 ] → R + (resp., ψ: (0, ∞) → R + ) is said to
belong to the class K (i.e., ψ ∈ K ), if ψ(0) = 0 and if
∂v n
∂v

v(E) (t, x) = (t, x) + (t, x) f i (t, x) ψ is strictly increasing on [0, r1 ] (resp., on [0, ∞)). If
∂t i=1
∂ xi ψ: R + → R + , if ψ ∈ K and if limr →∞ ψ(r ) = ∞, then ψ
is said to belong to class KR.
∂v
= (t, x) + ∇v(t, x)T f (t, x) (102) We are now in a position to characterize v-functions
∂t in several ways. In the following, we assume that v: R + ×

We call v(E) the derivative of v (with respect to t) along R n → R (resp., v: R + × B(h) → R), that v(0, t) = 0 for all
the solutions of (E). t ∈ R + , and that v is continuous.
It is important to note that in Eq. (102) the deriva-
tive of v with respect to t, along the solutions of (E), 1. v is positive definite if, for some r > 0, there exists
is evaluated without having to solve (E). The significance a ψ ∈ K such that v(t, x) ≥ ψ(|x|) for all t ≥ 0 and for all
of this will become clear later. We also note that when x ∈ B(r ).
v: R n → R (resp., v: B(h) → R), then Eq. (102) reduces 2. v is decrescent if there exists a ψ ∈ K such that

to v(E) (t, x) = ∇v(x)T f (t, x). Also, in the case of au- |v(t, x)| ≤ ψ(|x|) for all t ≥ 0 and for all x ∈ B(r ) for some
tonomous systems (A), if v: R n → R (resp., v: B(h) → R), r > 0.
we have 3. v: R + × R n → R is radially unbounded if there ex-

v(A) (x) = ∇v(x)T f (x) (103) ists a ψ ∈ KR such that v(t, x) ≥ ψ(|x|) for all t ≥ 0 and
for all x ∈ R n .
Occasionally, we shall require only that v be continuous 4. v is negative definite if −v is positive definite.
on its domain of definition and that it satisfy locally a 5. v is positive semidefinite if v(t, x) ≥ 0 for all
x ∈ B(r ) for some r > 0 and for all t ≥ 0.
6. v is negative semidefinite if −v is positive semidef-
inite.

The definitions involving the above concepts, when

v: R n → R or v: B(h) → R (where B(h) ⊂ R n for some
h > 0) involve obvious modifications. We now consider
several specific cases:

1. The function v: R 3 → R given by v(x) = x T x = x12 +

x22 + x32 is positive definite and radially unbounded. (Here,
T
x denotes the transpose of x.)
2. The function v: R 3 → R given by v(x) = x12 + (x2 +
x3 )2 is positive semidefinite (but not positive definite).
3. The function v: R 2 → R given by v(x) = x12 + x22 −
(x1 + x22 )3 is positive definite but not radially unbounded.
2

4. The function v: R 3 → R given by v(x) = x12 + x22 is

FIGURE 18 Trajectories near a center. positive semidefinite (but not positive definite).
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 399

5. The function v: R 2 → R given by v(x) = x14 /(1 +

x14 ) + x24is positive definite but not radially unbounded.
6. The function v: R + × R 2 → R given by v(t, x) =
(1 + cos2 t)x12 + 2x22 is positive definite, decrescent, and
radially unbounded.
7. The function v: R + × R 2 → R given by v(t, x) =
(x1 + x22 ) cos2 t is positive semidefinite and decrescent.
2

8. The function v: R + × R 2 → R given by v(t, x) =

(1 + t)(x12 + x22 ) is positive definite and radially un-
bounded but not decrescent.
9. The function v: R + × R 2 → R given by v(t, x) =
x1 /(1 + t) + x22 is decrescent and positive semidefinite but
2

not positive definite.

10. The function v: R + × R 2 → R given by v(t, x) =
FIGURE 19 Surface described by a quadratic form.
(x2 − x1 )2 (1 + t) is positive semidefinite but not positive
definite or decrescent.

Of special interest are functions v: R n → R that are Quadratic forms (105) have some interesting geometric
quadratic forms given by: properties. To see this, let n = 2, and assume that both
n eigenvalues of B are positive so that v is positive definite
v(x) = x T Bx = bik xi xk (105) and radially unbounded. In R 3 , let us now consider the
i,k=1
surface determined by:
where B = [bi j ] is a real symmetric n × n matrix (i.e.,
B T = B). Since B is symmetric, it is diagonizable and all z = v(x) = x TBx (106)
of its eigenvalues are real. For Eq. (105) one can prove the
following: This equation describes a cup-shaped surface as depicted
in Fig. 19. Note that corresponding to every point on this
1. v is positive definite (and radially unbounded) if and cup-shaped surface there exists one and only one point in
only if all principal minors of B are positive; that is, if and the x1 x2 plane. Note also that the loci defined by Ci = {x ∈
only if R 2 : v(x) = ci 0}, ci = const, determine closed curves in
  the x1 x2 plane as shown in Fig. 20. We call these curves
b11 · · · b1k level curves. Note that C0 = {0} corresponds to the case
 . .. 
in which z = c0 = 0. Note also that this function v can be
det .. .  > 0, k = 1, . . . , n
bk1 ··· bkk used to cover the entire R 2 plane with closed curves by
selecting for z all values in R + .
2. v is negative definite if and only if In the case when v = x T Bx is a positive definite
  quadratic form with x ∈ R n , the preceding comments are
b11 · · · b1k still true; however, in this case, the closed curves Ci must
 . ..  be replaced by closed hypersurfaces in R n and a simple
(−1)k det .. .  > 0, k = 1, . . . , n
geometric visualization as in Figs. 19 and 20 is no longer
bk1 · · · bkk
possible.
3. v is definite (i.e., either positive definite or negative
definite) if and only if all eigenvalues are nonzero and have
the same sign.
4. v is semidefinite (i.e., either positive semidefinite or
negative semidefinite) if and only if the nonzero eigenval-
ues of B have the same sign.
5. If λm and λM denote the smallest and largest
eigenvalues of B and if |x| denotes the Euclidean norm
of x, then λm |x|2 v(x) λM |x|2 for all x∈ R n . (The
n
Euclidean norm of x is defined as (x Tx)1/2 = ( i=1 xi2 )1/2 .)
6. v is indefinite (i.e., in every neighborhood of the ori-
gin x = 0, v assumes positive and negative values) if and
only if B possesses both positive and negative eigenvalues. FIGURE 20 Level curves determined by a quadratic form.
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

400 Differential Equations, Ordinary

G. Lyapunov Stability and Instability

Results: Motivation
Before we summarize the principal Lyapunov-type of sta-
bility and instability results, we give a geometric inter-
pretation of some of these results in R 2 . To this end, we
consider the system of equations,
x1 = f 1 (x1 , x2 ), x2 = f 2 (x1 , x2 ) (107)
and we assume that f 1 and f 2 are such that for every
(t0 , x0 ), t0 0, Eq. (107) has a unique solution φ(t, t0 ,
x0 ) with φ(t0 , t0 , x0 ) = x0 . We also assume that (x1 , x2 )T =
(0, 0)T is the only equilibrium in B(h) for some h > 0. FIGURE 21 Trajectory near an asymptotically stable equilibrium
Next, let v be a positive definite, continuously dif- point.
ferentiable function with nonvanishing gradient ∇v on
0 < |x| h. Then, v(x) = c, c 0, defines for sufficiently
small constants c > 0 a family of closed curves Ci which 1. If there exists a positive definite function v such that

cover the neighborhood B(h) as shown in Fig. 21. Note v(107) is negative definite, then the equilibrium x = 0 of
that the origin x = 0 is located in the interior of each such Eq. (107) is asymptotically stable.
curve and in fact, C0 = {0}. 2. If there exists a positive definite and radially un-

Now suppose that all trajectories of Eq. (107) originat- bounded function v such that v(107) is negative definite
ing from points on the circular disk |x| ≤ r1 < h cross the for all x ∈ R , then the equilibrium x = 0 of Eq. (107) is
2

curves v(x) = c from the exterior toward the interior when asymptotically stable in the large.
we proceed along these trajectories in the direction of in- Continuing our discussion by making reference to
creasing values of t. Then, we can conclude that these Fig. 22, let us assume that we can find for Eq. (107) a
trajectories approach the origin as t increases (i.e., the continuously differentiable function v: R 2 → R which is
equilibrium x = 0 is in this case asymptotically stable). indefinite and has the properties discussed below. Since v
In terms of the given v function, we have the following is indefinite, there exist in each neighborhood of the origin
interpretation. For a given solution φ(t, t0 , x0 ) to cross the points for which v > 0, v < 0, and v(0) = 0. Confining our
curve v(x) = r, r = v(x0 ), the angle between the outward attention to B(k), where k > 0 is sufficiently small, we
normal vector ∇v(x0 ) and the derivative of φ(t, t0 , x0 ) at let D = {x ∈ B(k): v(x) < 0}. (D may consist of several
t = t0 must be greater than π/2; that is, subdomains.) The boundary of D, ∂ D, as shown in

v(107) (x0 ) = ∇v(x0 ) f (x0 ) < 0 Fig. 22, consists of points in ∂ B(k) and of points deter-
mined by v(x) = 0. Assume that in the interior of D, v

For this to happen at all points, we must have v(107) (x) < 0 is bounded. Suppose v(107) (x) is negative definite in D
for 0 < |x| ≤ r1 . The same results can be arrived at from an
analytic point of view. The function V (t) = v(φ(t, t0 , x0 ))
decreases monotonically as t increases. This implies that
the derivative v (φ(t, t0 , x0 )) along the solution (φ(t, t0 ,
x0 )) must be negative definite in B(r ) for r > 0 sufficiently
small.
Next, let us assume that, Eq. (107) has only one equi-
librium (at x = 0) and that v is positive definite and ra-
dially unbounded. It turns out that in this case, the re-
lation v(x) = c, c ∈ R + , can be used to cover all of R 2
by closed curves of the type shown in Fig. 21. If for ar-
bitrary (t0 , x0 ), the corresponding solution of Eq. (107),
φ(t, t0 , x0 ), behaves as already discussed, then it follows
that the derivative of v along this solution, v (φ(t, t0 , x0 )),
will be negative definite in R 2 .
Since the foregoing discussion was given in terms of an
arbitrary solution of Eq. (107), we may suspect that the
following results are true: FIGURE 22 Instability of an equilibrium point.
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 401

and that x(t) is a trajectory of Eq. (107) which originates where k > 0 is a constant. As noted earlier, Eq. (109)
somewhere on the boundary of D (x(t0 ) ∈ ∂ D) with has an isolated
x1 equilibrium at x = 0. Choose v(x1 , x2 ) =
v(x(t0 )) = 0. Then, this trajectory will penetrate the 1 2
x
2 2
+ k 0 sin η dη, which is continuously differentiable
boundary of D at points where v = 0 as t increases and and positive definite. Also, since v does not depend
it can never again reach a point where v = 0. In fact, as t on t, it will automatically be decrescent. Further-

increases, this trajectory will penetrate the set of points more, v(109) (x1 , x2 ) = (k sin x1 )x1 + x2 x2 = (k sin x1 )x2 +

determined by |x| = k (since; by assumption, v(107) <0 x2 (−k sin x1 ) = 0. Therefore, the equilibrium x = 0 of
along this trajectory and v < 0 in D). But, this indicates Eq. (109) is uniformly stable.
that the equilibrium x = 0 of Eq. (107) is unstable.
3. If there exists a continuously differentiable, positive
We are once more led to a conjecture: definite, decrescent function v with a negative definite

3. Let a function v: R 2 → R be given which is continu- derivative v(E) , then the equilibrium x = 0 of (E) is uni-
ously differentiable and which has the following proper- formly asymptotically stable.
ties: (a) There exist points x arbitrarily close to the origin
For an example, the system,
such that v(x) < 0; they form the domain D bounded by the
set of points determined by v = 0 and the disk |x| = k. (b) x1 = (x1 − c2 x2 ) x12 + x22 − 1
In the interior of D, v is bounded. (c) In the interior of D, (110)

v(107) is negative. Then, the equilibrium x = 0 of Eq. (107) x2 = (c1 x1 + x2 ) x12 + x22 − 1
is unstable.
has an isolated equilibrium at the origin x = 0. Choos-

ing v(x) = c1 x12 + c2 x22 , we obtain v(110) (x) = 2(c1 x12 +
H. Principal Lyapunov Stability
c2 x2 )(x1 + x2 − 1). If c1 > 0, c2 > 0, then v is positive
2 2 2
and Instability Theorems
definite (and decrescent) and v(110) is negative definite in
It turns out that results of the type given above for Eq. (107) the domain x12 + x22 < 1. Therefore, the equilibrium x = 0
are true for general systems (E). These results, which are of Eq. (110) is uniformly asymptotically stable.
proved by standard δ–ε arguments, comprise the direct
method of Lyapunov, which is also sometimes called the 4. If there exists a continuously differentiable, positive
second method of Lyapunov. The reason for this nomen- definite, decrescent, and radially unbounded function v

clature is clear: Results of the kind presented here allow such that v(E) is negative definite for all (t, x) ∈ R + × R n ,
us to make qualitative statements about whole families of then the equilibrium x = 0 of (E) is uniformly asymptoti-
solutions of (E), without actually solving this equation. cally stable in the large.
In the following, we enumerate some of the more im- As an example, consider the system,
portant results of the direct method. We shall assume that
v: R + × B(h) → R (resp., v: R + × R n → R). x1 = x2 + cx1 x12 + x22
(111)
1. If there exists a continuously differentiable positive x2 = −x1 + cx2 x12 + x22
definite function v with a negative semidefinite (or iden- where c is a real constant. Note that x = 0 is the only

tically zero) derivative v(E) , then the equilibrium x = 0 of equilibrium. Choosing the positive definite, decrescent,
(E) is stable. and radially unbounded function v(x) = x12 +x22 , we obtain

As an example, consider the system given by: v(111) (x) = 2c(x12 + x22 )2 . We conclude that if c = 0, then
x = 0 of Eq. (111) is uniformly stable and if c < 0, then
x1 = x2 , x2 = −x2 − e−t x1 (108)
x = 0 of Eq. (111) is uniformly asymptotically stable in
which has an equilibrium at (x1 , x2 ) = (0, 0) . For
T T
the large.
Eq. (108), choose the positive definite function v(t, x1 ,
5. If there exists a continuously differentiable function
x2 ) = x12 + et x22 . We obtain v(108) (t, x1 , x2 ) = −et x22 which
v and three positive constants c1 , c2 , and c3 such that
is negative semidefinite. We conclude that the equilibrium
x = 0 of Eq. (108) is stable. c1 |x|2 ≤ v(t, x) ≤ c2 |x|2
2. If there exists a continuously differentiable, positive
v(E) (t, x) ≤ −c3 |x|2
definite, decrescent function v with negative semidefinite

derivative v(E) , then the equilibrium x = 0 of (E) is uni- for all t ∈ R + and for all x ∈ B(r ) for some r > 0, then the
formly stable. equilibrium x = 0 of (E) is exponentially stable.
As an example, consider the simple pendulum,
6. If there exist a continuously differentiable function v
x1 = x2 , x2 = −k sin x1 (109) and three positive constants c1 , c2 , and c3 such that
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

402 Differential Equations, Ordinary

c1 |x|2 ≤ v(t, x) ≤ c2 |x|2 10. Let there exist a bounded and continuously differ-

entiable function v: D → R, D = {(t, x) ≥ t0 , x ∈ B(h)},
v(E) (t, x) ≤ −c3 |x|2 with the following properties: (a) v(E)
(t, x) = λv(t, x) +
for all t ∈ R + and for all x ∈ R n , then the equilibrium w(t, x), where λ > 0 is a constant and w(t, x) is ei-
x = 0 of (E) is exponentially stable in the large. ther identically zero or positive semidefinite; (b) in the
set D1 = {(t, x): t = t1 , x ∈ B(h 1 )} for fixed t1 ≥ t0 and
As an example, consider the system, with arbitrarily small h 1 , there exist values x such that
x1 = −a(t)x1 − bx2 v(t1 , x) > 0. Then the equilibrium x = 0 of (E) is unstable.
(112)
x2 = bx1 − c(t)x2 As a specific example, consider:

where b is a real constant and where a and c are real and x1 = x1 + x2 + x1 x24
continuous functions defined for t ≥ 0 satisfying a(t) ≥ (114)
x2 = x1 + x2 − x12 x2
δ > 0 and c(t) ≥ δ > 0 for all t ≥ 0. We assume that
x = 0 is the only equilibrium for Eq. (112). If we which has an isolated equilibrium x = 0. Choosing v(x) =

choose v(x) = 12 (x12 + x22 ), then v(112) (t, x) = −a(t)x12 − (x12 − x22 )/2, we obtain v(114) (x) = λv(x) + w(x), where
c(t)x2 ≤ −δ(x1 + x2 ). Hence, the equilibrium x = 0 of
2 2 2
w(x) = x1 x2 + x1 x2 and λ = 2. It follows from the above
2 4 2 2

Eq. (112) is exponentially stable in the large. result that the equilibrium x = 0 of Eq. (114) is unstable.
7. If there exists a continuously differentiable function 11. Let there exist a continuously differentiable func-
v defined on |x| ≥ R (where R may be large) and tion v having the following properties: (a) For every ε > 0
0 ≤ t ≤ ∞, and if there exist ψ1 , ψ2 ∈ KR such that and for every t ≥ 0, there exist points x̄ ∈ B(ε) such that

ψ1 (|x|) ≤ v(t, x) ≤ ψ2 (|x|), v(E) (t, x) ≤ 0 for all |x| ≥ R v(t, x̄) < 0. We call the set of all points (t, x) such that
and for all 0 ≤ t < ∞, then the solutions of (E) are x ∈ B(h) and such that v(t, x) < 0 the “domain v < 0.”
uniformly bounded. It is bounded by the hypersurfaces which are determined
by |x| = h and v(t, x) = 0 and it may consist of several
8. If there exists a continuously differentiable func-
component domains. (b) In at least one of the compo-
tion v defined on |x| ≥ R (where R may be large) and
nent domains D of the domain v < 0, v is bounded from
0 ≤ t < ∞, and if there exist ψ1 , ψ2 ∈ KR and ψ3 ∈ K such
below and 0 ∈ ∂ D for all t ≥ 0. (c) In the domain D,
that ψ1 (|x|) ≤ v(t, x) ≤ ψ2 (|x|), v(E) (t, x) ≤ −ψ3 (|x|) for
v(E) ≤ − (|v|), where ψ ∈ K . Then, the equilibrium x = 0
all |x| ≥ R and 0 ≤ t < ∞, then the solutions of (E) are
of (E) is unstable.
uniformly ultimately bounded.
As an example, consider the system,
As an example, consider the system,
x = −x − σ, σ = −σ − f (σ ) + x (113) x1 = x1 + x2
(115)
where f (σ ) = σ (σ − 6). There are isolated equilib-
2 x2 = x1 − x2 + x1 x2
rium points at x = σ = 0, x = −σ = 2, and x = −σ = which has an isolated equilibrium at the origin x = 0.
−2. Choosing the radially unbounded and decrescent
Choosing v(x) = −x1 x2 , we obtain v(115) (x) = −x12 − x22 −

function v(x, σ ) = 12 (x 2 + σ 2 ), we obtain v(113) = (x, σ ) = x12 x2 . Let D = {x ∈ R 2 : x1 > 0, x2 > 0, and x12 + x22 < 1}.

−x − σ (σ − 5) ≤ − x − (σ − 2 ) + 25
2 2 2 2 2 5 2
4
. Also v(113)
Then, for all x ∈ D, v < 0 and c(115) < 2v. We see that the
is negative for all (x, σ ) such that x + σ > R , where,
2 2 2
above result is applicable and conclude that the equilib-
for example, R = 10 will do. Therefore, all solutions of rium x = 0 of Eq. (115) is unstable.
Eq. (113) are uniformly bounded and, in fact, uniformly
ultimately bounded. The results given in items 1–11 are also true when v
is continuous (rather than continuously differentiable). In
9. The equilibrium x = 0 of (E) is unstable (at t =
this case, v(E) must be interpreted in the sense of Eq. (104).
t0 ≥ 0) if there exists a continuously differentiable, decres- For the case of systems (A),

cent function v such that v(E) is positive definite (negative
definite) and if in every neighborhood of the origin there x = f (x) (A)
are points x such that v(t0 , x) > 0(v(t0 , x) < 0).
it is sometimes possible to relax the conditions on v(A)
Reconsider system (111), this time assuming that c > 0. when investigating the asymptotic stability of the equilib-

If we choose v(x) = x12 + x22 , then v(111) (x) = 2c(x12 + x22 )2 rium x = 0, by insisting that v(A) be only negative semidef-
and we can conclude from the above result that the equi- inite. In doing so, we require the following concept: A set
librium x = 0 of (E) is unstable. of points in R n is invariant (with respect to (A)) if
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 403

every solution of (A) starting in remains in for all

v(L) (x) = −x T C x
time.
where
We are now in a position to state our next result which
is part of invariance theory for ordinary differential −C = AT B + B A, C = CT (118)
equations. is definite (i.e., negative definite or positive definite).
12. Assume that there exists a continuously differen-
tiable, positive definite, and radially unbounded func- The above result shows that if A is a stable matrix, then
for (L), our earlier Lyapunov result for asymptotic stabil-
tion v: R n → R such that (a) v(A) (x) ≤ 0 for all x ∈ R n ,
and (b) the origin x = 0 is the only invariant subset of the ity in the large constitutes also necessary conditions for
asymptotic stability. Also, if A is an unstable matrix with
set E = {x ∈ R n : v(A) (x) = 0}. Then, the equilibrium x = 0
of (A) is asymptotically stable in the large. no eigenvalues on the imaginary axis, then according to
the above result, our earlier instability result given in item
As a specific example, consider the Liénard equation 9 above yields also necessary conditions for instability.
given by: In view of the above result, the v function in Eq. (117)
x1 = x2 , x2 = − f (x1 )x2 − g(x1 ) (116) is easily constructed by assuming a definite matrix C (ei-
ther positive definite or negative definite) and by solving
where it is assumed that f and g are continuously the Lyapunov matrix equation (118) for the n(n + 1)/2
differentiable for all x1 ∈ R, g(x1 ) = 0 if and only if unknown elements of the symmetric matrix B.
x1 = 0, x1 g(x1 ) > 0 for all x1 = 0 and x1 ∈ R,
x1
J. Domain of Attraction
lim g(η) dη = ∞
|x1 |→∞ 0
Next, we should address briefly the problem of estimat-
and f (x1 ) > 0 for all x1 ∈ R. Then (x1 , x2 ) = (0, 0) is the ing the domain of attraction of an equilibrium x = 0. Such
only equilibrium of Eq. (116). Let us choose the v function, questions are important when x = 0 is not the only equi-
x1 librium of a system or when x = 0 is not asymptotically
1
v(x1 , x2 ) = x22 + g(η) dη stable in the large.
2 0
For purposes of discussion, we consider a system,
(for Eq. (116)) which is positive definite and radially
unbounded. Along the solutions of Eq. (116) we have x = f (x) (A)

v(116) (x1 , x2 ) = −x22 f (x1 ) ≤ 0 for all (x1 , x2 ) ∈ R 2 . It is and we assume that for (A) there exists a Lyapunov
easily verified that the set E in the above theorem is in function v which is positive definite and radially un-
our case the x1 axis. Furthermore, a moment’s reflection bounded. Also, we assume that over some domain

shows that the largest invariant subset (with respect to D ⊂ R n containing the origin, v(A) (x) is negative, ex-
Eq. (116)) of the x1 axis is the set {(0, 0)T }. Thus, by the
cept at the origin, where v(A) = 0. Now let Ci denote
above result, the origin x = 0 of Eq. (116) is asymptotically Ci = {x ∈ R n : v(x) ≤ ci }, ci > 0. Using similar reasoning
stable in the large. as was done in connection with Eq. (107), we can show
that as long as Ci ⊂ D, Ci will be a subset of the domain
of attraction of x = 0. Thus, if ci > 0 is the largest number
I. Linear Systems Revisited
for which this is true, then it follows that Ci will be
One of the great drawbacks of the Lyapunov theory, as contained in the domain of attraction of x = 0. The set
developed above, is that there exist no general rules of Ci will be the best estimate that we can obtain for our
choosing v functions, which are called Lyapunov func- particular choice of v function.
tions. However, for the case of linear systems,
x = Ax (L) K. Converse Theorems
it is possible to construct Lyapunov functions, as shown Above we showed that for system (L) there exist actually
by the following result: converse Lyapunov (stability and instability) theorems. It
turns out that for virtually every result which we gave
Assume that the matrix A has no eigenvalues on the imaginary above (in items 1–11) there exists a converse. Unfortu-
axis. Then, there exists a Lyapunov function v of the form: nately, these Lyapunov converse theorems are not much
help in constructing v functions in specific cases. For pur-
v(x) = x T Bx, B = BT (117)
poses of illustration, we cite here an example of such a

whose derivative v(L) , given by: converse theorem:
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

404 Differential Equations, Ordinary

If f and f x = ∂ f /∂ x are continuous on the set (R + × B(r )) for M. Lyapunov’s First Method
some r > 0, and if the equilibrium x = 0 of (E) is uniformly
asymptotically stable, then there exists a Lyapunov function v We close this section by answering the following question:
which is continuously differentiable on (R + × B(r1 )) for some Under what conditions does it make sense to linearize
r1 > 0 such that v is positive definite and decrescent and such a nonlinear system about an equilibrium x = 0 and then

that v(E) is negative definite. deduce the properties of x = 0 from the corresponding
linear system? This is known as Lyapunov’s first method
or Lyapunov’s indirect method.
L. Comparison Theorems We consider systems of n real nonlinear first-order or-
Next, we consider once more some comparison results dinary differential equations of the form:
for (E), as was done in Section III. We shall assume that
x = Ax + F(t, x) (PE)
f : R + × B(r ) → R n for some r1 > 0 and that f is con-
tinuous there. We begin by considering a scalar ordinary where F: R + × B(h) → R n for some h > 0 and A is a real
differential equation of the form, n × n matrix. Here, we assume that Ax constitutes the
y = G(t, y) (C̃) linear part of the right-hand side of (PE) and F(t, x)
represents the remaining terms which are of order higher
+ +
where y ∈ R, t ∈ R , and F: R × [0, r ) → R for some than one in the various components of x. Such systems
r > 0. Assume that G is continuous on R + × [0, r ) and may arise in the process of linearizing nonlinear equations
that G(t, 0) = 0 for all t. Also, assume that y = 0 is an of the form:
isolated equilibrium for (C̃).
The following results are the basis of the comparison x = g(t, x) (G)
principle in the stability analysis of the isolated equilib-
or they may arise in some other fashion during the mod-
rium x = 0 of (E):
eling process of a physical system.
To be more specific, let g: R × D → R n where D is
Let f and G be continuous on their respective domains of defi-
nition. Let v: R + × B(r ) → R be a continuously differentiable, some domain in R n . If g is continuously differentiable
positive-definite function such that on R × D and if φ is a given solution of (E) defined for
all t ≥ t0 ≥ 0, then we can linearize (G) about φ in the

v(E) (t, x) ≤ G(t, v(t, x)) (119) following manner. Define y = x − φ(t) so that

Then, the following statements are true: (1) If the trivial so- y = g(t, x) − g(t, φ(t))
˜ is stable, then the trivial solution of system (E)
lution of (C) = g(t, y + φ(t)) − g(t, φ(t))
is stable. (2) If v is decrescent and if the trivial solution of
˜ is uniformly stable, then the trivial solution of (E) is uni-
(C) = (∂g/∂t)(t, φ(t))y + G(t, y)
formly stable. (3) If v is decrescent and if the trivial solution
˜ is uniformly asymptotically stable, then the trivial so-
of (C) Here,
lution of (E) is uniformly asymptotically stable. (4) If there
are constants a > 0 and b > 0 such that a|x|b ≤ v(t, x), if v is
G(t, y) = [g(t, y +φ(t))− g(t, φ(t))]−(∂g/∂ x)(t, φ(t))y
decrescent, and if the trivial solution of (C) ˜ is exponentially
is o(|y|) as |y| → 0 uniformly in t on compact subsets of
stable, then the trivial solution of (E) is exponentially stable.
[t0 , ∞).
(5) If f : R + × R n → R n , G: R + × R → R, v: R + × R n → R is
decrescent and radially unbounded, if Eq. (119) holds for all
Of special interest is the case when g is independent
t ∈ R + , x ∈ R n , and if the solutions of (C)
˜ are uniformly bounded of t (i.e., when g(t, x) ≡ g(x)) and φ(t) = ξ0 is a constant
(uniformly ultimately bounded ), then the solutions of (E) are also (equilibrium point). Under these conditions we have
uniformly bounded (uniformly ultimately bounded ).
y = Ay + G(y)
The above results enable us to analyze the stability and where A = (∂g/∂ x)(x)|x=ξ0 , where (∂g/∂ x)(x) denotes
boundedness properties of an n-dimensional system (E), the Jacobian of g(x).
which may be complex, in terms of the corresponding By making use of the result for the Lyapunov function
properties of a one-dimensional comparison system (C), ˜ (117), we can readily prove the following results:
which may be quite a bit simpler. The generality and effec-
tiveness of the above results can be improved and extended 1. Let A be a real, constant, and stable n × n matrix
by considering vector-valued comparison equations and and let F: R + × B(h) → R n be continuous in (t, x) and
vector Lyapunov functions. o (|x|) as |x| → 0, uniformly in t ∈ R + .
satisfy F(t, x) =
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN004E-172 June 8, 2001 17:16

Differential Equations, Ordinary 405

Then, the trivial solution of (PE) is uniformly asymptoti- SEE ALSO THE FOLLOWING ARTICLES
cally stable.
As a specific example, consider the Liénard equation ARTIFICIAL NEURAL NETWORKS • CALCULUS • COMPLEX
given by: ANALYSIS • DIFFERENTIAL EQUATIONS, PARTIAL • FUNC-
TIONAL ANALYSIS • MEASURE AND INTEGRATION
x + f (x)x + x = 0 (120)
where f : R → R is a continuous function with f (0) > 0.
We can rewrite Eq. (120) as: BIBLIOGRAPHY
x1 = x2 , x2 = −x1 − f (0)x2 + [ f (0) − f (x1 )]x2 Antsaklis, P. J., and Michel, A. N. (1997). “Linear Systems,” McGraw-
Hill, New York.
and we can apply the above result with x = (x1 , x2 ) , T
Boyce, W. E., and DiPrima, R. C. (1997). “Elementary Differential
Equations and Boundary Value Problems,” John Wiley & Sons, New
0 1
A= York.
−1 − f (0) Brauer, F., and Nohel, J. A. (1969). “Qualitative Theory of Ordinary
Differential Equations,” Benjamin, New York.
0
F(t, x) = Carpenter, G. A., Cohen, M., and Grossberg, S. (1987). “Computing with
[ f (0) − f (x1 )]x2 neural networks,” Science 235, 1226–1227.
Coddington, E. A., and Levinson, N. (1955). “Theory of Ordinary Dif-
Noting that A is a stable matrix and that F(t, x) = o(|x|) as ferential Equations,” McGraw-Hill, New York.
|x| → 0, uniformly in t ∈ R + , we conclude that the trivial Hale, J. K. (1969). “Ordinary Differential Equations,” Wiley, New York.
solution (x, x ) = (0, 0) of Eq. (120) is uniformly asymp- Halmos, P. R. (1958). “Finite Dimensional Vector Spaces,” Van Nostrand,
totically stable. Princeton, NJ.
Hille, E. (1969). “Lectures on Ordinary Differential Equations,”
2. Assume that A is a real n × n matrix with no eigen- Addison-Wesley, Reading, MA.
values on the imaginary axis and that at least one eigen- Hoffman, K., and Kunze, R. (1971). “Linear Algebra,” Prentice-Hall,
Englewood Cliffs, NJ.
value of A has positive real part. If F: R + × B(h) → R n Hopfield, J. J. (1984). “Neurons with graded response have collective
is continuous and satisfies F(t, x) = o(|x|) as |x| → 0, computational properties like those of two-state neurons,” Proc. Nat.
uniformly in t ∈ R + , then the trivial solution of (PE) is Acad. Sci. U.S.A. 81, 3088–3092.
unstable. Kantorovich, L. V., and Akilov, G. P. (1964). “Functional Analysis in
Normed Spaces,” Macmillan, New York.
As a specific example, consider the simple pendulum, Michel, A. N. (1983). “On the status of stability of interconnected sys-
tems,” IEEE Trans. Automat. Control 28(6), 639–653.
x + k sin x = 0, k>0 (121) Michel, A. N., and Herget, C. J. (1993). “Applied Algebra and Functional
Analysis,” Dover, New York.
Note that xe = π, xe = 0 is an equilibrium of Eq. (121). Michel, A. N., and Miller, R. K. (1977). “Qualitative Analysis of Large-
Let y = x − xe so that Scale Dynamical Systems,” Academic Press, New York.
Michel, A. N., and Wang, K. (1995). “Qualitative Theory of Dynamical
y + a sin(y + π) = y − ay + a(sin(y + π) + y) = 0 Systems,” Dekker, New York.
Michel, A. N., Farrell, J. A., and Porod, W. (1989). “Qualitative analysis
This equation can be put into the form (PE) with of neural networks,” IEEE Trans. Circuits Syst. 36(2), 229–243.
Miller, R. K., and Michel, A. N. (1982). “Ordinary Differential Equa-
0 1
A= tions,” Academic Press, New York.
a 0 Naylor, A. W., and Sell, G. R. (1971). “Linear Operator Theory in En-
gineering and Science,” Holt, Rinehart & Winston, New York.
0
F(t, x) = Royden, H. L. (1963). “Real Analysis,” Macmillan, New York.
a(sin(y + π ) + y) Rudin, W. (1953). “Principles of Mathematical Analysis,” McGraw-Hill,
New York.
Applying the above result we conclude that the equilib- Simmons, G. F. (1972). “Differential Equations,” McGraw-Hill,
rium point (π, 0) is unstable. New York.
P1: GNB Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

Differential Equations, Partial

Martin Schechter
University of California, Irvine

I.Importance
II.How They Arise
III.Some Well-Known Equations
IV. Types of Equations
V. Problems Associated with
Partial Differential Equations
VI. Methods of Solution

GLOSSARY ∂u ∂u ∂ 2 u ∂ku
F x1 , . . . , xn , u, ,..., , ,..., k,... =0
∂ x1 ∂ xn ∂ x12 ∂ x1
Boundary Set of points in the closure of a region not
contained in its interiors. One can have more than one unknown function and more
Bounded region Region that is contained in a sphere of than one equation involving some or all of the unknown
finite radius. functions. One then has a system of j partial differential
Eigenvalue Scalar λ for which the equation Au = λu has equations in k unknown functions. The number of equa-
a nonzero solution u. tions may be more or less than the number of unknown
Euclidean n dimensional space Rn Set of vectors x = functions. Usually it is the same.
(x1 , . . . , xn ) where each component x j is a real number.
Partial derivative Derivative of a function of more than I. IMPORTANCE
one variable with respect to one of the variables keeping
the other variables fixed. One finds partial differential equations in practically every
branch of physics, chemistry, and engineering. They are
also found in other branches of the physical sciences and in
A PARTIAL DIFFERENTIAL EQUATION is an equa- the social sciences, economics, business, etc. Many parts
tion in which a partial derivative of an unknown function of theoretical physics are formulated in terms of partial
appears. The order of the equation is the highest order of differential equations. In some cases, the axioms require
the partial derivatives (of an unknown function) appear- that the states of physical systems be given by solutions
ing in the equation. If there is only one unknown function of partial differential equations. In other cases, partial dif-
u(x1 , . . . , xn ), then a partial differential equation for u is ferential equations arise when one applies the axioms to
of the form: specific situations.

407
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

408 Differential Equations, Partial

II. HOW THEY ARISE time they are merely plausibility arguments. For this rea-
son, some branches of science have accepted partial dif-
Partial differential equations arise in several branches ferential equations as axioms. The success of these axioms
of mathematics. For instance, the Cauchy–Riemann equa- is judged by how well their conclusions describe past ob-
tions, servations and predict new ones.
∂u(x, y) ∂v(x, y) ∂u(x, y) ∂v(x, y)
= , =−
∂x ∂y ∂y ∂x III. SOME WELL-KNOWN EQUATIONS
must be satisfied if
Now we list several equations that arise in various bran-
f (z) = u(x, y) + iv(x, y) ches of science. Interestingly, the same equation can arise
is to be an analytic function of the complex variable z = in diverse and unrelated areas.
x + i y. Thus, the rich and beautiful branch of mathemat-
A. Laplace’s Equation
ics known as analytic function theory is merely the study
of solutions of a particular system of partial differential In n dimensions this equation is given by:
equations.
u=0
As a simple example of a partial differential equation
arising in the physical sciences, we consider the case of a where
vibrating string. We assume that the string is a long, very ∂2 ∂2
= + · · · +
slender body of elastic material that is flexible because of ∂ x12 ∂ xn2
its extreme thinness and is tightly stretched between the
It arises in the study of electromagnetic phenomena
points x = 0 and x = L on the x axis of the x,y plane. Let
(e.g., electrostatics, dielectrics, steady currents, magne-
x be any point on the string, and let y(x, t) be the displace-
tostatics), hydrodynamics (e.g., irrotational flow of a per-
ment of that point from the x axis at time t. We assume
fect fluid, surface waves), heat flow, gravitation, and many
that the displacements of the string occur in the x,y plane.
other branches of science. Solutions of Laplace’s equation
Consider the part of the string between two close points
are called harmonic functions.
x1 and x2 . The tension T in the string acts in the direction
of the tangent to the curve formed by the string. The net B. Poisson’s Equation
force on the segment [x1 , x2 ] in the y direction is
u ≡ f (x), x = (x1 , . . . , xn )
T sin ϕ2 − T sin ϕ1
Here, the function f (x) is given. This equation is found
where ϕi is the angle between the tangent to the curve and in many of the situations in which Laplace’s equation ap-
the x axis at xi . According to Newton’s second law, this pears, since the latter is a special case.
force must equal mass times acceleration. This is
x2 C. Helmholtz’s Equation
ρ ∂ 2 y/∂t 2 d x
x1 u ± α2u = 0
where ρ is the density (mass per unit length) of the string. This equation appears in the study of elastic waves,
Thus, in the limit vibrating strings, bars and membranes, sound and acous-
∂ ∂2 y tics, electromagnetic waves, and the operation of nuclear
T sin ϕ = ρ 2 reactors.
∂x ∂t
We note that tan ϕ = ∂ y/∂ x. If we make the simplifying
D. The Heat (Diffusion) Equation
assumption (justified or otherwise) that
∂ This equation is of the form:
cos ϕ ≈ 1, cos ϕ ≈ 0
∂x ut = a2 u
we finally obtain: where u(x1 , . . . , xn , t) depends on the variable t(time) as
T ∂ 2 y/∂ x 2 = ρ ∂ 2 y/∂t 2 well. It describes heat conduction or diffusion processes.
which is the well-known equation of the vibrating string.
The derivation of partial differential equations from E. The Wave Equation
physical laws usually brings about simplifying assump-
tions that are difficult to justify completely. Most of the u ≡ (1/c2 ) − u=0
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

Differential Equations, Partial 409

This describes the propagation of a wave with velocity c. M. The Navier–Stokes Equations
This equation governs most cases of wave propagation.
∂u j ∂u j 1 ∂p
+ uk + = γ uj
F. The Telegraph Equation ∂t k
∂ x k ρ ∂x j
∂u k
u + σ u t = 0 =0
k
∂ xk
This applies to some types of wave propagation.
This system describes viscous flow of an incompress-
ible liquid with velocity components u k and pressure p.
G. The Scalar Potential Equation

u = f (x, t) N. The Korteweg–Devries Equation

u t + cuu x + u x x x = 0
H. The Klein–Gordon Equation
This equation is used in the study of water waves.
u + µ2 u = 0

I. Maxwell’s Equations IV. TYPES OF EQUATIONS

∇ × H = σ E + ε ∂E/∂t
In describing partial differential equations, the following
∇ × E = −µ ∂H/∂t notations are helpful. Let Rn denote Euclidean n dimen-
Here E and H are three-dimensional vector functions of sional space, and let x = (x1 , . . . , xn ) denote a point in Rn .
position and time representing the electric and magnetic One can consider partial differential equations on various
fields, respectively. This system of equations is used in types of manifolds, but we shall restrict ourselves to Rn .
electrodynamics. For a real- or complex-valued function u(x) we shall use
the following notation:

J. The Cauchy–Riemann Equations Dk u = ∂u/i∂ xk , 1k n

∂u ∂v ∂u ∂v If µ = (µ1 , . . . , µn ) is a multi-index of nonnegative inte-

= , =−
∂x ∂y ∂y ∂x gers, we write:
µ
These equations describe the real and imaginary parts x µ = x1 1 . . . xnµn , |µ| = µ1 + · · · + µn
of an analytic function of a complex variable. µ µ
D = D1 1 ··· Dnµn

Thus, D µ is a partial derivative of order |µ|.

K. The Schrödinger Equation

h2 ∂ψ
− ψ + V (x)ψ = i h A. Linear Equations
2m ∂t
This equation describes the motion of a quantum mech- The most general linear partial differential equation of
anical particle as it moves through a potential field. The order m is
functionV (x)represents the potential energy, while the un-
Au ≡ aµ (x)D µ u = f (x) (1)
known function ψ(x) is allowed to have complex values.
|µ|m

It is called linear because the operator A is linear, that is,

L. Minimal Surfaces satisfies:
In three dimensions a surface z = u(x, y) having the least
area for a given contour satisfies the equation: A(αu + βv) = α Au + β Av
for all function u, v and all constant scalars α, β. If the
1 + u 2y u x x − 2u x u y u x y + 1 + u 2x u yy = 0
equation cannot be put in this form, it is nonlinear. In
where u x = ∂u/∂ x, etc. Section III, the examples in Sections A–K are linear.
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

410 Differential Equations, Partial

B. Nonlinear Equations D. Parabolic Equations

In general, nonlinear partial differential equations are When m = 2, the quasilinear equation (2) becomes:
more difficult to solve than linear equations. There may ∂ 2u
be no solutions possible, as is the case for the equation: a jk (x, u, Du) = f (x, u, Du) (4)
j,k
∂ x j ∂ xk
|∂u/∂ x| + 1 = 0
We shall say that Eq. (4) is parabolic in a region ⊂ Rn
There is no general method of attack, and only special if for every choice of the function v(x), the matrix =
types of equations have been solved. (a jk (x, v, Dv)) has a vanishing determinant for each x ∈
. The equation in Section III.D is a parabolic equation.
1. Quasilinear Equation
A partial differential equation is called quasilinear if it is E. HYPERBOLIC EQUATIONS
linear with respect to its derivatives of highest order. This Equation (4) will be called ultrahyperbolic in if for
means that if one replaces the unknown function u(x) and each v(x) the matrix = (a jk (x, v, Dv)) has some posi-
all its derivatives of order lower than the highest by known tive, some negative, and no vanishing eigenvalues for each
functions, the equation becomes linear. Thus, a quasilinear x ∈ . It will be called hyperbolic if all but one of the
equation of order m is of the form: eigenvalues of have the same sign and none of them

aµ (x, u, Du, . . . , D m−1 u)D µ u vanish in . The equations in Sections III.E–H are hyper-
|µ|=m bolic equations.
The only time Eq. (4) is neither parabolic nor ultrahy-
= f (x, u, Du, . . . , D m−1 u) (2) perbolic is when all the eigenvalues of have the same
where the coefficients aµ depend only on x, u, and deriva- sign with none vanishing. In this case, Eq. (4) is elliptic
tives of u up to order m − 1. Quasilinear equations are as described earlier.
important in applications. In Section III, the examples in
Sections L–N are quasilinear. F. Equations of Mixed Type
If the coefficients of Eq. (1) are variable, it is possible
2. Semilinear Equation that the equation will be of one type in one region and of
another type in a different region. A simple example is
A quasilinear equation is called semilinear if the coeffi-
cients of the highest-order derivatives depend only on x. ∂ 2 u/∂ x 2 − x∂ 2 u/∂ y 2 = 0
Thus, a semilinear equation of order m is of the form:
in two dimensions. In the region x > 0 it is hyperbolic,
Au ≡ aµ (x)D µ u = f (x, u, Du, . . . , D m−1 u) (3) while in the region x < 0 it is elliptic. (It becomes parabolic
|µ|=m on the line x = 0.)
where A is linear. Semilinear equations arise frequently
in practice. G. Other Types
The type of an equation is very important in determining
C. Elliptic Equations what problems can be solved for it and what kind of so-
lutions it will have. As we saw, some equations can be
The quasilinear equation (2) is called elliptic in a region
of different types in different regions. One can define all
⊂ Rn if for every function v(x) the only real vector ξ =
three types for higher order equations, but most higher or-
(ξ1 , . . . , ξn ) εRn that satisfies:
der equations will not fall into any of the three categories.
aµ (x, v, Dv, . . . , D m−1 v)ξ µ = 0
|µ|=m

is ξ = 0. It is called uniformly elliptic in if there is a V. PROBLEMS ASSOCIATED WITH

constant c0 > 0 independent of x and v such that: PARTIAL DIFFERENTIAL EQUATIONS

In practice one is rarely able to determine the most gen-
c0 |ξ |m aµ (x, v, Dv, . . . , D m−1 v)ξ µ
|µ|=m eral solution of a partial differential equation. Usually one
looks for a solution satisfying additional conditions. One
where |ξ |2 = ξ12 + · · · ξn2 . The equations in Sections A–C may wish to prescribe the unknown function and/or some
and J–L are elliptic equations or systems. of its derivatives on part or all of the boundary of the region
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

Differential Equations, Partial 411

in question. We call this a boundary-value problem. If the B. The Neumann Problem

equation involves the variable t (time) and the additional
As in the case of the Dirichlet problem, the Neumann prob-
conditions are prescribed at some time t = t0 , we usually
lem is well posed only for elliptic operators. The Neumann
refer to it as an initial-value problem.
problem for Eq. (2) in a region is to prescribe on ∂ the
A problem associated with a partial differential equation
normal derivatives of u from order 12 m to m − 1. (In both
is called well posed if:
the Dirichlet and Neumann problems, exactly 12 m normal
derivatives are prescribed on ∂. In the Dirichlet problem,
1. A solution exists for all possible values of the given
the first 12 m are prescribed [starting from the zeroth-order
data.
derivative—u itself]. In the Neumann problem the next
2. The solution is unique. 1
2
m normal derivatives are prescribed.)
3. The solution depends in a continuous way on the
As in the case of the Dirichlet problem, the Neumann
given data.
problem for a linear equation of the form (1) can fail to be
well posed because 0 is an eigenvalue of A. Again, this
The reason for the last requirement is that in most cases
can usually be corrected by considering Eq. (5) in place of
the given data come from various measurements. It is im-
Eq. (1) for ε sufficiently small.
portant that a small error in measurement of the given
When m = 2, the Neumann problem consists of pre-
data should not produce a large error in the solution. The
scribing the normal derivative ∂u/∂n of u on ∂. In the
method of measuring the size of an error in the given data
case of Laplace’s or Poisson’s equation, it is easily seen
and in the solution is a basic question for each problem.
that 0 is indeed an eigenvalue if is bounded, for then
There is no standard method; it varies from problem to
any constant function is a solution of:
problem.
The kinds of problems that are well posed for an equa- w=0 in , ∂w/∂n = 0 on ∂ (6)
tion depend on the type of the equation. Problems that are
Thus, adding any constant to a solution gives another so-
suitable for elliptic equations are not suitable for hyper-
lution. Moreover, we can solve the Neumann problem:
bolic or parabolic equations. The same holds true for each
of the types. We illustrate this with several examples. ∂u
u= f in , =g on ∂ (7)
∂n
A. Dirichlet’s Problem only if

For a region ⊂ Rn , Dirichlet’s problem for Eq. (2) is f (x) d x = g ds
to prescribe u and all its derivatives up to order 12 m − 1 ∂
on the boundary ∂ of . This problem is well posed Thus, we cannot solve Eq. (7) for all f and g. If is
only for elliptic equations. If m = 2, only the function u is unbounded, one usually requires that the solution of the
prescribed on the boundary. If is unbounded, one may Neumann problem vanish at infinity. This removes 0 as an
have to add a condition at infinity. eigenvalue, and the problem is well posed.
It is possible that the Dirichlet problem is not well posed
for a linear elliptic equation of the form (1) because 0 is
an eigenvalue of the operator A. This means that there is C. The Robin Problem
a function w(x) ≡ 0 that vanishes together with all deriva- When m = 2, the Robin problem for Eq. (2) consists of
tives up to order 12 m − 1 on ∂ and satisfies Aw = 0 in . prescribing:
Thus, any solution of the Dirichlet problem for Eq. (1) is
not unique, for we can always add a multiple of w to it to Bu = α ∂u/∂n + βu (8)
obtain another solution. Moreover, it is easily checked that on ∂, where α(x), β(x) are functions and α(x) = 0 on
one can solve the Dirichlet problem for Eq. (1) only if:
∂. If β(x) ≡ 0, this reduces to the Neumann problem.
Again this problem is well posed only for elliptic
f (x)w(x) d x
equations.
is a constant depending on w and the given data. Thus, we
cannot solve the Dirichlet problem for all values of the D. Mixed Problems
given data. When is bounded, one can usually remedy
the situation by considering the equation, Let B be defined by Eq. (8), and assume that α(x)2 +
β(x)2 = 0 on ∂. Consider the boundary value problem
Au + εu = f (5) consisting of finding a solution of Eq. (4) in and pre-
in place of Eq. (1) for ε sufficiently small. scribing Bu on ∂. On those parts of ∂ where α(x) = 0,
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

412 Differential Equations, Partial

we are prescribing Dirichlet data. On those parts of ∂ and, indeed, it has a unique analytic solution:
where β(x) = 0, we are prescribing Neumann data. On the
remaining sections of ∂ we are prescribing Robin data. u(x, y) = n −2 sinh ny sin nx
This is an example of a mixed boundary-value problem The function n −1 sin nx tends uniformly to 0 as n → ∞,
in which one prescribes different types of data on differ- but the solution does not become small as n → ∞ for
ent parts of the boundary. Other examples are provided by y = 0. It can be shown that the Cauchy problem is well
parabolic and hyperbolic equations, to be discussed later. posed only for hyperbolic equations.

E. General Boundary-Value Problems

VI. METHODS OF SOLUTION
For an elliptic equation of the form (2) one can consider
general boundary conditions of the form: There is no general approach for finding solutions of par-

Bju ≡ b jµ D µ u tial differential equations. Indeed, there exist linear partial
|µ|m j differential equations with smooth coefficients having no
solutions in the neighborhood of a point. For instance, the
= gj on ∂, 1 j m/2 (9)
equation:
Such boundary-value problems can be well posed pro- ∂u ∂u ∂u
vided the operators B j are independent in a suitable sense +i + 2i(x1 + i x2 ) = f (x3 ) (11)
∂ x1 ∂ x2 ∂ x3
and do not “contradict” each other or Eq. (2).
has no solution in the neighborhood of the origin unless
F. The Cauchy Problem f (x3 ) is a real analytic function of x3 . Thus, if f is in-
finitely differentiable but not analytic, Eq. (11) has no so-
For Eq. (2), the Cauchy problem consists of prescribing lution in the neighborhood of the origin. Even when partial
all derivatives of u up to order m − 1 on a smooth sur- differential equations have solutions, we cannot find the
face S and solving Eq. (2) for u in a neighborhood of “general” solution. We must content ourselves with solv-
S. An important requirement for the Cauchy problem to ing a particular problem for the equation in question. Even
have a solution is that the boundary conditions not “con- then we are rarely able to write down the solution in closed
tradict” the equation on S. This means that the coefficient form. We are lucky if we can derive a formula that will
of ∂ m u/∂n m in Eq. (2) should not vanish on S. Otherwise, enable us to calculate the solution in some way, such as a
Eq. (2) and the Cauchy boundary conditions, convergent series or iteration scheme. In many cases, even
∂ k u/∂n k = gk on S, k = 0, . . . , n − 1 (10) this is unattainable. Then, one must be satisfied with an
abstract theorem stating that the problem is well posed.
involve only the function gk on S. This is sure to cause Sometimes the existence theorem does provide a method
a contradiction unless f is severely restricted. When this of calculating the solution; more often it does not.
happens we say that the surface S is characteristic for Now we describe some of the methods that can be used
Eq. (2). Thus, for the Cauchy problem to have a solution to obtain solutions in specific situations.
without restricting f , it is necessary that the surface S
be noncharacteristic. In the quasilinear case, the coeffi- A. Separation of Variables
cient of ∂ m u/∂n m in Eq. (2) depends on the gk . Thus, the
Cauchy data (10) will play a role in determining whether or Consider a vibrating string stretched along the x axis from
not the surface S is characteristic for Eq. (2). The Cauchy– x = 0 and x = π and fixed at its end points. We can as-
Kowalewski theorem states that for a noncharacteristic an- sign the initial displacement and velocity. Thus, we are
alytic surface S, real analytic Cauchy data gk , and real ana- interested in solving the mixed initial and boundary value
lytic coefficients aµ , f in Eq. (2), the Cauchy problem (2) problem for the displacement u(x, t):
and (10) has a unique real analytic solution in the neigh- u = 0, 0 < x < π, t >0 (12)
borhood of S. This is true irrespective of the type of the
equation. However, this does not mean that the Cauchy u(x, 0) = f (x),
problem is well posed for all types of equations. In fact, u t (x, 0) = g(x), 0x π (13)
the hypotheses of the Cauchy–Kowalewski theorem are
satisfied for the Cauchy problem, u(0, t) = u(π, t) = 0, t 0 (14)

u x x + u yy = 0, y>0 We begin by looking for a solution of Eq. (12) of the form:

u(x, 0) = 0, u y (x, 0) = n −1 sin nx u(x, t) = X (x)T (t)
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

Differential Equations, Partial 413

π
Such a function will be a solution of Eq. (12) only if: 2
An = f (x) sin nx d x,
π
T X 0
= c2 (15) π
T X 2
Bn = g(x) sin nx d x
Since the left-hand side of Eq. (15 ) is a function of t ncπ 0

only and the right-hand side is a function of x only, both With these values, the series (16) converges and gives a
sides are constant. Thus, there is a constant K such that solution of Eqs. (12)–(14).
X (x) = KX(x). If K = λ2 > 0, X must be of the form:

X = Ae−λx + Beλx B. Fourier Transforms

For Eq. (14) to be satisfied, we must have If we desire to determine the temperature u(x, t) of a sys-
tem in Rn with no heat added or removed and initial tem-
X (0) = X (π) = 0 (15 ) perature given, we must solve:

This can happen only if A = B = 0. If K = 0, then X must u t = a 2 u, x ε Rn , t >0 (17)

be of the form: u(x, 0) = ϕ(x), x ε Rn (18)
x = A + Bx If we apply the Fourier transform,
Again, this can happen only if A = B = 0. If K = −λ < 0, 2
−n/2
then X is of the form: f (ξ ) = (2π )
ˆ e−iξ x f (x) d x (19)

X = A cos λx + B sin λx where ξ x = ξ1 x1 + · · · + ξn xn , we obtain:

This can satisfy Eq. (15) only if A = 0 and B sin λπ = 0.
û t (ξ, t) + a 2 |ξ |2 û(ξ, t) = 0
Thus, the only way that X should not vanish identically is
if λ is an integer n. Moreover, T satisfies: The solution satisfying Eq. (18) is

T +n c T =02 2
û(ξ, t) = e−a |ξ |2 t
2
ϕ̂(ξ )
The general solution for this is
If we now make use of the inverse Fourier transform,
T = A cos nct + B sin nct
f (x) = (2π )−n/2 eiξ x fˆ(ξ ) dξ
Thus,
u(x, t) = sin nx(An cos nct + Bn sin nct) we have

is a solution of Eqs. (12) and (14) for each integer n. u(x, t) = K (x − y, t)ϕ(y) dy
However, it will not satisfy Eq. (13) unless f (x), g(x) are
of a special form. The linearity of the operator allows
where
one to add solutions of Eqs. (12) and (14). Thus,
−n
ei xξ −a |ξ |2 t
2

∞ K (x, t) = (2π ) dξ
u(x, t) = sin nx(An cos nct + Bn sin nct) (16)
n=1 If we introduce the new variable,
will be a solution provided the series converges. Moreover, η = at 1/2 ξ − 12 ia −1 t −1/2 x
it will satisfy Eq. (13) if:
this becomes

∞
f (x) = An sin nx, (2π )−n a −n t −n/2 e−|x|
2
/4a 2 t
e−|η| dη
2

n=1

∞
= (4πa 2 t)−n/2 e−|x|
2
/4a 2 t
g(x) = ncBn sin nx
n=1 This suggests that a solution of Eqs. (17) and (18) is given
This will be true if f (x), g(x) are expandable in a Fourier by:

sine series. If they are, then the coefficients An , Bn are 2 −n/2
e−|x−y| /4a t ϕ(y) dy
2 2

given by: u(x, t) = (4πa t) (20)

P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

414 Differential Equations, Partial

It is easily checked that this is indeed the case if ϕ is con- c0 |ξ |m P(ξ ) C0 |ξ |m , ξ ∈ Rn (23)
tinuous and bounded. However, the solution is not unique
holds for positive constants c0 , C0 . We introduce the norm,
unless one places more restriction on the solution.
1/2
|v|r = |ξ |m |v̂(ξ )|2 dξ
C. Fundamental Solutions, Green’s Function
Let for function v in C0∞ (), the set of infinitely differentiable
functions that vanish outside . Here, is a bounded
|x − y|2−n domain in Rn with smooth boundary, and v̂(ξ ) denotes
K (x, y) = + h(x), n > 2,
(2 − n)ωn the Fourier transform given by Eq. (19). By Eq. (23) we
log 4 see that (P(D)v, v) is equivalent to |v|r2 on C0∞ (), where
K (x, y) = + h(x), n=2
2π
(u, v) = u(x)v(x) d x
where ωn = 2π n/2 / ( 12 n) is the surface area of the unit
sphere in Rn and h(x) is a harmonic function in a bounded Let
domaiń ⊂ Rn (i.e., h(x) is a solution of h = 0 in
). If the boundary ∂ of is sufficiently regular and a(u, v) = (u, P(D)v), u, v ∈ C0∞ () (24)
h ∈ C 2 (),
¯ then Green’s theorem implies for y ∈ : If u ∈ C m ()
¯ is a solution of the Dirichlet problem,

u(y) = K (x, y) u(x) d x P(D)u = f in (25)

Dµu = 0 on ∂, |µ| < r (26)
∂ K (x, y) ∂u
+ u(x) − K (x, y) d Sx (21) then it satisfies
∂ ∂n ∂n
for all u ∈ C 2 ().
¯ The function K (x, y) is called a funda- a(u, v) = ( f, v), v ∈ C0∞ () (27)
mental solution of the operator . If, in addition, K (x, y) Conversely, if ∂ is sufficiently smooth and u ∈ C () ¯ m
vanishes for x ∈ ∂, it is called a Green’s function, and satisfies Eq. (27), then it is a solution of the Dirichlet
we denote it by G(x, y). In this case, problem, Eqs. (25) and (26). This is readily shown by in-

∂G(x, y) tegration by parts. Thus, one can solve Eqs. (25) and (26)
u(y) = u(x) d Sx , y∈ (22) by finding a function u ∈ C m ()
¯ satisfying Eq. (27). Since
∂ ∂n
a(u, v) is a scalar product, it would be helpful if we had
for all u ∈ C 2 ()
¯ that are harmonic in . Conversely, this
a theorem stating that the expression ( f, v) can be repre-
formula can be used to solve the Dirichlet problem for sented by the expression a(u, v) for some u. Such a the-
Laplace’s equation if we know the Green’s function for orem exists (the Riesz representation theorem), provided
, since the righthand side of Eq. (22) is harmonic in a(u, v) is the scalar product of a Hilbert space and
and involves only the values of u(x) on ∂. It can be shown
that if the prescribed boundary values are continuous, then |( f, v)| Ca(v, v)1/2 (28)
indeed Eq. (22) does give a solution to the Dirichlet prob-
lem for Laplace’s equation. We can fit our situation to the theorem by completing
It is usually very difficult to find the Green’s function for C0∞ () with respect to the |v|r norm and making use of
an arbitrary domain. It can be computed for geometrically the fact that a(v, v)1/2 and |v|r are equivalent on C0∞ ()
symmetric regions. In the case of a ball of radius R and and consequently on the completion H0r (). Moreover,
center 0, it is given by: inequality (28) follows from the Poincare inequality,

G(x, y) = K (x, y) − (|y|/R)2−n K (x, R 2 |y|−2 y) v M r |v|r , v ∈ C0∞ () (29)

which holds if is contained in a cube of side length M.

D. Hilbert Space Methods Thus, by Schwarz’s inequality,
Let |( f, v)| f v f M r |v|r ca(v, v)1/2

P(D) = aµ D µ The Riesz representation theorem now tells us that there is
|µ|=m
a u ∈ H0r () such that Eq. (27) holds. If we can show that
be a positive, real, constant coefficient, homogeneous, el- u is in C m (),
¯ it will follow that u is indeed a solution of
liptic partial differential operator of order m = 2r . This the Dirichlet problem, Eqs. (25) and (26). As it stands now,
means that P(D) has only terms of order m, and u is only a weak solution of Eqs. (25) and (26). However,
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

Differential Equations, Partial 415

it can be shown that if ∂ and f are sufficiently smooth, It is clear that w(x) 0 in . For otherwise it would have
then u will be in C m () ¯ and will be a solution of Eqs. (25) a negative interior mininum in . At such a point, one
and (26). has ∂ 2 w/∂ xk2 = 0 for each k, and consequently. w 0
The proof of the Poincare inequality (29) can be given contradicting Eq. (35). Since w ∈ C(),¯ there is a constant
as follows. It sufficies to prove it for r = 1 and contained Ct such that:
in the slab 0 < x1 < M. Since v ∈ C0∞ (),
0 w(x) C1 , x ∈
x1 2
v(x1 , . . . , xn )2 = vx1 (t, x2 , . . . , xn ) dt Let
0
x1 K = max ψ(t)
|t|C1
x1 vx1 (t, x2 , . . . , xn )2 dt
0
Then, by Eq. (33):
M
M vx1 (t, x2 , . . . , xn )2 dt |∂ f (x, t)/∂t| K , |t| C1 (36)
0

Thus, Consequently,
M M f (x, t) − f (x, s) K (t − s) (37)
v(x1 , . . . , xn ) d x1 M
2 2
vx1 (t, x2 , . . . , xn )2 dt
0 0 when −C1 s t C1 . We define a sequence {u k } of func-
If we now integrate over x2 , . . . , xn , we obtain: tions as follows. We take u 0 = w and once u k−1 has been
defined, we let u k be the solution of the Dirichlet problem:
v Mvx1
Lu k ≡ u k − K u k = f (x, u k−1 ) − K u k−1 (38)
But, by Parseval’s identity,
in with u k = 0 on ∂. The solution exists by the theory
|vx1 | d x = |v̂x1 |2 dξ
2
of linear elliptic equations. We show by induction that:
−w u k u k−1 w (39)
= ξ12 |v̂|2 dξ |ξ |2 |v̂|2 dξ = |v|21
To see this for k = 1, note that:
L(u 1 − w) = f (x, w) − K w − w + Kw 0
E. Iterations
An important method of solving both linear and nonlinear From this we see that u 1 w in . If u 1 −w had an interior
problems is that of successive approximations. We illus- positive maximum in , we would have
trate this method for the following Dirichlet problem: L(u 1 − w) = (u 1 − w) − K (u 1 − w) < 0
u = f (x, u), x ∈ (30) at such a point. Thus, u 1 w in . Also we note:
u=0 on ∂ (31) (u 1 + w) = f (x, w) + K (u 1 − w) + w K (u 1 − w)
We assume that the boundary of is smooth and that
This shows that u 1 + w cannot have a negative minimum
f (x, t) = f (x1 . . . , xn , t) is differentiable with respect to
inside . Hence, u 1 + w 0 in , and Eq. (39) is verified
all arguments. Also we assume that:
for k = 1. Once we know it is verified for k, we note that:
| f (x, t)| N , x ∈ , −∞ < t < ∞ (32)
L[u k+1 − u k ] = f (x, u k ) − f (x, u k−1 )
|∂ f (x, t)/∂t| ψ(t), x ∈ (33)
−K (u k − u k−1 ) 0
where ψ(t) is a continuous function.
by Eq. (37). Thus, u k−1 u k in . Hence,
First, we note that for every compact subset G of
there is a constant C such that: (u k+1 + w) = f (x, u k ) − K u k + K u k+1
max |∇v| C(sup | v| + sup |v|) (34) + w K (u k+1 − u k )
G
Again, we deduce from this that u k+1 +w 0 in . Hence,
for all v ∈ C 2 (). Assume this for the moment, and let
Eq. (39) holds for k + 1 and consequently for all k. In par-
w(x) be the solution of the Dirichlet problem,
ticular, we see that the u k are uniformly bounded in ,
w = −N in , w=0 on ∂ (35) and by Eq. (38) the same is true of the functions u k .
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

416 Differential Equations, Partial

Hence, by Eq. (34), the first derivatives of the u k are uni- F. Variational Methods
formly bounded on compact subsets of . If we differenti-
In many situations, methods of the calculus of variations
ate Eq. (38), we see that the sequence (∂u k /∂ x j ) is uni-
are useful in solving problems for partial differential equa-
formly bounded on compact subsets of (here we make
tions, both linear and nonlinear. We illustrate this with a
use of the continuous differentiability of f ). If we now
simple example. Suppose we wish to solve the problem,
make use of Eq. (34) again, we see that the second deriva-
tives of the u k are uniformly bounded on compact subsets ∂ ∂u(x)
− pk (x) + q(x)u(x) = 0 in (40)
of . Hence, by the Ascoli–Arzela theorem, there is a sub- ∂ xk ∂ xk
sequence that converges together with its first derivatives
u(x) = g(x) on ∂ (41)
uniformly on compact subsets of . Since the sequence u k
is monotone, the whole sequence must converge to a con- Assume that pk (x) c0 , q(x) c0 , c0 > 0 for x ∈ ,
tinuous function u that satisfies |u(x)| w(x). Hence, u bounded, ∂ smooth, and that g is in C 1 (∂). We consider
vanishes on ∂. By Eq. (38), the functions u k must con- the expression,
verge uniformly on compact subsets, and by Eq. (34), the
1 ∂u(x) ∂v(x)
same must be true of the first derivatives of the u k . From a(u, v) = pk (x)
the differentiated Eq. (38) we see that the (∂u k /∂ x j ) con- 2 ∂ xk ∂ xk
verge uniformly on bounded subsets and consequently the
same is true of the second derivatives of the u k by Eq. (34). + q(x)u(x)v(x) d x
Since the u k converge uniformly to u in and their sec-
ond derivatives converge uniformly on bounded subsets, and put a(u) = a(u, u). If u ∈ C 2 () ∩ C 0 (),
¯ satisfies,
we see that u ∈ C 2 () and u k → u. Hence, Eq. (41), and

u = lim u k a(u) a(w) (42)

= lim[ f (x, u k−1 ) + K (u k − u k−1 )] = f (x, u) for all w satisfying Eq. (41), then it is readily seen that u
is a solution of Eq. (40). Let v be a smooth function which
and u is the desired solution.
vanishes on ∂. Then, for any scalar β,
It remains to prove Eq. (34). For this purpose we let ϕ(x)
be a function in C0∞ () which equals one on G. Then we a(u) a(u + βv) = a(u) + 2βa(u, v) + β 2 a(v)
have, by Eq. (21),
and, consequently,
ϕ(y)v(y) = K (x, y) (ϕ(x)v(x)) d x 2βa(u, v) + β 2 a(v)2 0

(the boundary integrals vanish because ϕ is 0 near ∂). for all β. This implies a(u, v) = 0. Integration by parts now
Thus, if y ∈ G, yields Eq. (40). The problem now is to find a u satisfying
Eq. (42). We do this as follows. Let H denote the Hilbert
v(y) = K {ϕ v + 2∇ϕ · ∇v + v ϕ} d x space obtained by completing C 2 () ¯ with respect to the

norm a(w), and let H0 be the subspace of those functions
in H that vanish on ∂. Under the hypotheses given it can
= {ϕ K v − 2v∇ K · ∇ϕ − v K ϕ} d x
be shown that H0 is a closed subspace of H . Let w be an
by integration by parts. We note that ∇ϕ vanishes near the element of H satisfying Eq. (41), and take a sequence {vk }
singularity of K . Thus, we may differentiate under integral of functions in H0 such that:
sign to obtain: a(w − vk ) → d = inf a(w − v)
v∈H0
∂v(y) ∂K ∂∇k ∂k
= ϕ v − 2v · ∇ϕ − v ϕ dx The parallelogram law for Hilbert space tells us that:
∂yj ∂ y j ∂ y j ∂ yj
Consequently, a(2w − v j − vk ) + a(v j − vk )

∂v = 2a(w − v j ) + 2a(w − vk )
sup | v| ∂ K d x
∂y ∂y
j j But,

∂K ∂K
+ sup |v| |∇ϕ|
+ | ϕ| K d x a(2w − v j − vk ) = 4a w − 12 (v j + u k ) 4d
∂yj ∂yj
Hence,
and all of the integrals are finite. This gives Eq. (34), and
the proof is complete. 4d + a(v j − vk ) 2a(w − v j ) + 2a(w − vk ) → 4d
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

Differential Equations, Partial 417

Thus, bounded. However, this in itself does not guarantee that

an extremum exists. All that can be derived from the semi-
a(v j − vk ) → 0
boundedness is the existence of a sequence {u k } ⊂ E such
and {vk } is a Cauchy sequence in H0 . Since H0 is complete, that:
there is a v ∈ H0 such that a(vk − v) → 0. Put u = w − v.
Then clearly u satisfies Eqs. (41) and (42), hence a(u, v) = G(u k ) → c, G (u k ) → 0 (45)
0 for all v ∈ H0 . If u is in C 2 (), then it will satisfy Eq. (40) where c is either the supremum or infimum of G. The
as well. The rub is that functions in H need not be in existence of such a sequence does not guarantee that an
C 2 (). However, the day is saved if the pk (x) and q(x) are extremum exists. However, it does so if one can show that
sufficiently smooth. For then one can show that functions the sequence has a convergent subsequence. This condi-
u ∈ H which satisfy a(u, v) = 0 for all v ∈ H0 are indeed tion is known as the Palais–Smale condition. A sequence
in C 2 (). satisfying Eq. (45) is called a Palais–Smale sequence.
Recently, researchers have discovered several sets of
G. Critical Point Theory sufficient conditions on G which will guarantee the ex-
istence of a Palais–Smale sequence. If, in addition, the
Many partial differential equations and systems are the Palais–Smale condition holds, then one obtains a critical
Euler–Lagrange equations corresponding to a real valued point. These conditions can apply to functionals which are
functional G defined on some Hilbert space E. Such a not semibounded. Points found by this method are known
functional usually represents the energy or related quan- as mountain pass points due to the geometrical interpre-
tity of the physical system governed by the equations. tation that can be given. The situation can be described as
In such a case, solutions of the partial differential equa- follows. Suppose Q is an open set in E and there are two
tions correspond to critical points of the functional. For points e0 , e1 such that e0 ∈ Q, e1 ∈
/ Q̄, and
instance, if the equation to be solved is
G(e0 ), G(e1 ) ≤ b0 = inf G (46)
− u(x) = f (x, u), x ∈ ⊂ Rn (43) ∂Q

then solutions of Eq. (43) are the critical points of the Then there exists a Palais–Smale sequence, Eq. (45), with
functional, c satisfying:

G(u) = ∇u2 − 2 F(x, u) d x b0 ≤ c < ∞ (47)

When G satisfies Eq. (46) we say that it exhibits mountain
where the norm is that of L 2 () and pass geometry.
t
It was then discovered that other geometries (i.e., con-
F(x, t) = f (x, s) ds figurations) produce Palais–Smale sequences, as well.
0
Consider the following situation. Assume that
The history of this approach can be traced back to the
calculus of variations in which equations of the form: E=M⊕N

G (u) = 0 (44) is a decomposition of E into the direct sum of closed
are the Euler–Lagrange equations of the functional G. The subspaces with
original method was to find maxima or minima of G by
dim N < ∞
solving Eq. (44) and then show that some of the solu-
tions are extrema. This approach worked well for one- Suppose there is an R > 0 such that
dimensional problems. In this case, it is easier to solve
Eq. (44) than it is to find a maximum or minimum of G. sup G ≤ b0 = inf G
N ∩∂ B R M
However, in higher dimensions it was realized quite early
that it is easier to find maxima and minima of G than it is to Then, again, there is a sequence satisfying Eqs. (45) and
solve Eq. (44). Consequently, the tables were turned, and (47). Here, B R denotes the ball of radius R in E and ∂ B R
critical point theory was devoted to finding extrema of G. denotes its boundary.
This approach is called the direct method in the calculus In applying this method to solve Eq. (43), one discovers
of variations. If an extremum point of G can be identified, that the asymptotic behavior of f (x, t) at ∞ plays an im-
it will automatically be a solution of Eq. (44). portant role. One can consider several possibilities. When
The simplest extrema to find are global maxima and
minima. For such points to exist one needs G to be semi- lim sup | f (x, t)/t| = ∞ (48)
|t|→∞
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

418 Differential Equations, Partial

we say that problem (43) is superlinear. Otherwise, we where

t
call it sublinear. If
P(x, t) := p(x, s) ds
f (x, t)/t → b± (x) as t → ±∞ (49) 0
This type of problem is more difficult to solve; it is called
and the b± are different, we say that it has a nonlinearity strong resonance. Possible situations include:
at ∞. An interesting special case of Eq. (49) is when the
b± (x) are constants. If b− (x) ≡ a and P(x, t) → P± (x) as t → ±∞ (55)
b+ (x) ≡ b, we say
that (a, b) is in the Fučı́k spectrum if: and
− u = bu + − au − P(x, t) → P0 (x) as |t| → ∞ (56)
has a nontrivial solution, where u ± = max{±u, 0}. Be- What is rather surprising is that the stronger the resonance,
cause of its importance in solving Eq. (43), we describe the more difficult it is to solve Eq. (43). There is an interest-
this spectrum. It has been shown that emanating from ing connection between Eq. (43) and nonlinear eigenvalue
each eigenvalue λ of − , there are curves µ (a), ν−1 (a) problems of the form:
(which may coincide) which are strictly decreasing at least G (u) = βu (57)
in the square S = [λ−1 , λ+1 ] and such that (a, µ (a))
2

and (a, ν−1 (a)) are in in the square S. Moreover,

the re- for functionals. This translates into the problem
gions b > µ (a) and b < ν−1 (a) are free of in S. These − u = λ f (x, u) (58)
curves are known exactly only in the one-dimensional case
(ordinary differential equations). In higher dimensions it for partial differential equations. It can be shown that there

is not known in general how many curves of emanate is an intimate relationship between Eqs. (44) and (57)
from each eigenvalue. It is known that there is at least (other than the fact that the former is a special case of
one (when µ (a) and ν−1 (a) coincide). If there are two the latter). In fact, the absence of a certain number of
or more curves emanating from an eigenvalue, the status solutions of Eq. (44) implies the existence of a rich family
of the region between them is unknown in general. If the of solutions of Eq. (57) on all spheres of sufficiently large

eigenvalue is simple, there are at most two curves of radius, and vice versa. The same holds true for Eqs. (43)
emanating from it, and any region between them is not and (58).

in . On the other hand, examples are known in which
many curves of emanate from a multiple eigenvalue.
H. Periodic Solutions
In the higher dimensional case, the curves have not been
traced asymptotically. Many problems in partial differential equations are more
If easily understood if one considers periodic functions. Let

f (x, t)/t → λ as |t| → ∞ (50) Q = x ∈ Rn : 0 ≤ x j ≤ 2π
where λ is one of the eigenvalues of − , we say that be a cube in Rn . By this we mean that Q consists of
Eq. (43) has asymptotic resonance. One can distinguish those points x = (x1 , . . . , xn ) ∈ Rn such that each compo-
several types. One can have the situation: nent x j satisfies 0 ≤ x j ≤ 2π . Consider n-tuples of inte-
gers µ = (µ1 , . . . , µn ) ∈ Zn , where each µ j ≥ 0, and write
f (x, t) = λ t + p(x, t) (51)
µx = µ1 x1 + · · · + µn xn . For t ∈ R we let Ht be the set
where p(x, t) = o(|t|β ) as |t| → ∞ for some β < 1. An- of all series of the form:

other possibility is when p(x, t) satisfies: u= aµ eiµx (59)
| p(x, t)| ≤ V (x) ∈ L 2 () (52) where the aµ are complex numbers satisfying:
and α−µ = ᾱµ

p(x, t) → p± (x) a.e. as t → ±∞ (this is done to keep the functions u real valued), and

u2t = (2π )n (1 + µ2 )t |αµ |2 < ∞ (60)
A stronger form occurs when
where µ2 = µ21 + · · · + µ2n . It is not required that the se-
p(x, t) → 0 as |t| → ∞ (53)
ries (59) converge in anyway, but only that Eq. (60) hold.
and If

|P(x, t)| ≤ W (x) ∈ L 1 () (54) u= aµ eiµx , v= βµ eiµx
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

Differential Equations, Partial 419

are members of Ht , we can introduce the scalar product: to be given arbitrarily when Eq. (69) does hold. On the
other hand, if λ is not equal to any λk , then, Eq. (69) never
(u, v)t = (2π )n (1 + µ2 )t αµ β−µ (61) holds, and we can solve Eq. (66) by taking the αµ to satisfy
With this scalar product, Ht becomes a Hilbert space. If Eq. (67). Thus, we have Theorem 0.2:

f (x) = γµ eiµx (62) Theorem 0.2. There is a sequence {λk } of nonnegative
integers tending to +∞ with the following properties. If
we wish to solve f ∈ Ht and λ = λk for every k, then there is a unique so-
lution u ∈ Ht+2 of Eq. (66). If λ = λk for some k, then one
− u= f (63)
can solve Eq. (66) only if f satisfies Eq. (68). The so-
In other words, we wish to solve: lution is not unique; there is a finite number of linearly

µ2 αµ eiµx = γµ eiµx independent periodic solutions of:
( + λk )u = 0 (70)
This requires:
µ 2 α µ = γµ ∀µ which can be added to the solution.

In order to solve for αµ , we must have The values λk for which Eq. (70) has a nontrivial solu-
tion (i.e., a solution which is not ≡ 0) are called eigenval-
γ0 = 0 (64) ues, and the corresponding nontrivial solutions are called
Hence, we cannot solve for all f . However, if Eq. (64) eigen functions.
holds, we can solve Eq. (63) by taking: To analyze the situation a bit further, suppose λ = λk
for some k, and f ∈ Ht is given by Eq. (62) and satisfies
αµ = γµ µ2 when µ = 0 (65) Eq. (68). If v ∈ Ht is given by:
On the other hand, we can take α0 to be any number we
v= βµ eiµx (71)
like, and u will be a solution of Eq. (63) as long as it
satisfies Eq. (65). Thus, we have Theorem 0.1: then

Theorem 0.1. If f, given by Eq. (62), is in Ht and satis- ( f, v)t = (2π )n (1 + µ2 )t γµ β−µ
fies Eq. (64), then Eq. (63) has a solution u ∈ Ht+2 . An by Eq. (61). Hence, we have
arbitrary constant can be added to the solution.
( f, v)t = 0
If we wish to solve:
for all v ∈ Ht satisfying Eq. (71) and
−( + λ)u = f (66)
βµ = 0 when µ2 = λk (72)
where λ ∈ R is any constant, we want (µ − λ)αµ = γµ , or
2

On the other hand,

αµ = γµ (µ2 − λ) when µ2 = λ (67)
( + λk )v = (λk − µ2 )βµ eiµx (73)
Therefore, we must have
Thus, v is a solution of Eq. (70) if and only if it satis-
γµ = 0 when µ2 = λ (68) fies Eq. (72). Conversely, if ( f, v)t = 0 for all v satisfying
Eqs. (71) and (72), then f satisfies Eq. (68). Combining
Not every λ can equal some µ2 . This is certainly true if these, we have Theorem 0.3:
/ Z. But even if λ is a positive integer, there
λ < 0 or if λ ∈
may be no µ such that: Theorem 0.3. If λ = λk for some k, then there is a solution
u ∈ Ht+2 of Eq. (66) if and only if:
λ = µ2 (69)
( f, v)t = 0 (74)
For instance, if n = 3 and λ = 7 or 15, there is no µ satis-
for all v ∈ Ht satisfying:
fying Eq. (69). Thus, there is a subset:
( + λk )v = 0 (75)
0 = λ0 < λ1 < · · · < λk < · · ·
Moreover, any solution of Eq. (75) can be added to the
of the positive integers for which there are n-tuples µ
solution of Eq. (66).
satisfying Eq. (69). For λ = λk , we can solve Eq. (66) only
if Eq. (68) holds. In that case, we can solve by taking αµ What are the solutions of Eq. (75)? By Eq. (73), they
to be given by Eq. (67) when Eq. (69) does not hold, and are of the form:
P1: GNB Final Pages
Encyclopedia of Physical Science and Technology EN004E-173 June 8, 2001 17:20

420 Differential Equations, Partial

v= βµ eiµx = [aµ cos µx + bµ sin µx] As a corollary, we have Corollary 0.1:
µ2 =λ µ2 =λ
k k
Corollary 0.1. |u|q ≤ Cq ||u||1 , u ∈ H1 , where 1 ≤ q ≤
where we took βµ = αµ − ibµ .Thus, Eq. (74) becomes: 2∗ := 2n/(n − 2).
( f, cos µx)t = 0, ( f, sin µx)t = 0 when µ2 = λk
SEE ALSO THE FOLLOWING ARTICLES
The results obtained for periodic solutions are indicative
of almost all regular boundary-value problems. CALCULUS • DIFFERENTIAL EQUATIONS, ORDINARY •
GREEN’S FUNCTIONS
I. Sobolev’s Inequalities
Certain inequalities are very useful in solving problems in BIBLIOGRAPHY
partial differential equations. The following are known as
Bers, L., John, F., and Schechter, M. (1979). “Partial Differential Equa-
the Sobolev inequalities.
tions,” American Mathematical Society, Providence, RI.
Theorem 0.4. For each p ≥ 1, q ≥ 1 satisfying: Courant, R., and Hilbert, D. (1953, 1962). “Methods of Mathematical
Physics: I, II,” Wiley–Interscience, New York.
1 1 1 Gilbarg, D., and Trudinger, N. S. (1983). “Elliptic Partial Differential
≤ + (76) Equations of Second Order,” Springer-Verlag, New York.
p q n Gustofsen, K. E. (1987). “Partial Differential Equations and Hilbert
Space Methods,” Wiley, New York.
there is a constant C pq such that: John, F. (1978). “Partial Differential Equations,” Springer-Verlag,
New York.
|u|q ≤ C pq (|∇u| p + |u| p ), u ∈ C 1 (Q) (77) Lions, J. L., and Magenes, E. (1972). “Non-Homogeneous Boundary
Value Problems and Applications,” Springer-Verlag, New York.
where Powers, D. L. (1987). “Boundary Value Problems,” Harcourt Brace
1/q Jovanovich, San Diego, CA.
|u|q = |u|q d x (78) Schechter, M. (1977). “Modern Methods in Partial Differential Equa-
Q tions: An Introduction,” McGraw–Hill, New York.
Schechter, M. (1986). “Spectra of Partial Differential Operators,” North-
and Holland, Amsterdam.
2 1/2 Treves, F. (1975). “Basic Linear Partial Differential Equations,” Aca-
n

∂u
demic Press, New York.
|∇u| = (79)
∂x Zauderer, E. (1983). “Partial Differential Equations of Applied Mathe-
k=1 k matics,” Wiley, New York.
P1: LLL Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN004E-180 June 8, 2001 18:11

Discrete Mathematics and

Combinatorics
Douglas R. Shier
Clemson University

I. Nature of Combinatorics
II. Basic Counting Techniques
III. Recurrence Relations and Generating Functions
IV. Inclusion–Exclusion Principle
V. Existence Problems

GLOSSARY Typically, the focus of combinatorics is on determining

whether arrangements can be found that satisfy certain
Algorithm Systematic procedure or prescribed series of properties or on counting all possible arrangements of such
steps followed in order to solve a problem. objects. While the roots of combinatorics extend back sev-
Binary Pertaining to the digits 0 and 1. eral thousands of years, its relevance to modern science
Event Set of occurrences defined with respect to some and engineering is increasingly evident.
probabilistic process.
Identity Mathematical equation that always holds. I. NATURE OF COMBINATORICS
Integers The numbers 0, 1, 2, . . . and their negatives.
List Ordered sequence of elements. Combinatorics constitutes a rapidly growing area of con-
Mutually exclusive Events that cannot occur simultane- temporary mathematics and is one with an enviable
ously. repertoire of applications to areas as diverse as biology,
Prime Integer greater than 1 that cannot be evenly divided chemistry, physics, engineering, communications, cryp-
by any integer other than itself and 1. tography, and computing. Of particular significance is its
Set Unordered collection of elements. symbiotic relationship to the concerns and constructs of
String Ordered sequence of letters taken from some computer science. On the one hand, the advent of high-
alphabet. speed computers has facilitated the detailed study of exist-
Universal set Set that contains all elements relevant to ing combinatorial patterns as well as the discovery of new
the current discussion. arrangements. On the other hand, the design and analy-
sis of computer algorithms frequently require the insights
COMBINATORICS is a branch of discrete mathematics and tools of combinatorics. It is not at all surprising, then,
that involves the study of arrangements of various objects. that computer science, which is ultimately concerned with

523
P1: LLL Final Pages
Encyclopedia of Physical Science and Technology EN004E-180 June 8, 2001 18:11

524 Discrete Mathematics and Combinatorics

the manipulation of finite sets of symbols (e.g., strings of 10 × 10 = 1,000,000 possibilities, not enough to accom-
binary digits), and combinatorial mathematics, which pro- modate the expected number of vehicles. By contrast, the
vides tools for analyzing such patterns of symbols, have proposed new system allows (again by the product rule)
rapidly achieved prominence together. Moreover, since the 26 × 26 × 10 × 10 × 10 × 10 = 6,760,000 possibilities,
symbols themselves can be abstract objects (rather than more than enough to satisfy the anticipated demand.
simply numerical quantities), combinatorics supports the
EXAMPLE 2. DNA (deoxyribonucleic acid) consists of
more abstract manipulations of symbolic mathematics and
a chain of nucleotide bases (adenine, cytosine, guanine,
symbolic computer languages.
thymine). How many different three-base sequences are
Combinatorics is at heart a problem-solving discipline
possible?
that blends mathematical techniques and concepts with a
necessary touch of ingenuity. In order to emphasize this Solution. For each of the three positions in the sequence,
dual nature of combinatorics, the sections that follow will there are four possibilities for the base, so (by the product
first present certain fundamental combinatorial principles rule) there are 4 × 4 × 4 = 64 such sequences.
and then illustrate their application through a number of
EXAMPLE 3. In a certain computer programming lan-
diverse examples. Specifically, Sections II–IV provide an
guage, each identifier (variable name) consists of either
introduction to some powerful techniques for counting
one or two alphanumeric characters (A–Z, 0–9), but the
various combinatorial arrangements, and Section V ex-
first character must be alphabetic (A–Z). How many dif-
amines when certain patterns can be guaranteed to exist.
ferent identifier names are possible in this language?
Solution. In this case, analysis of the compound event
II. BASIC COUNTING TECHNIQUES can be broken into counting the possibilities for event
E, a single-character identifier, and for event F, a two-
A. Fundamental Rules of Sum and Product character identifier. The number of possibilities for E is
26, whereas (by the product rule) the number of possibili-
Two deceptively simple, but fundamentally important,
ties for F is 26 × (26 + 10) = 936. Since the two events E
rules allow the counting of complex patterns by decompo-
and F are mutually exclusive, the total number of distinct
sition into simpler patterns. The first such principle states,
identifiers is 26 + 936 = 962.
in essence, that if we slice a pie into two nonoverlapping
portions, then indeed the whole (pie) is equal to the sum
B. Permutations and Combinations
of its two parts.
In the analysis of combinatorial problems, it is essen-
Rule of Sum. Suppose that event E can occur in m differ-
tial to recognize when order is important in the arrange-
ent ways, that event F can occur in n different ways, and
ment and when it is not. To emphasize this distinc-
that the two events are mutually exclusive. Then, the com-
tion, the set X = [x1 , x2 , . . . , xn ] consists of n elements
pound event where at least one of the two events happens
xi , assembled without regard to order, whereas the list
can occur in m + n ways.
X = [x1 , x2 , . . . , xn ] contains elements arranged in a pre-
The second principle indicates the number of ways that scribed order.
a menu of choices (one item chosen from E, another item In the previous examples, the order of arrangement was
chosen from F) can be selected. clearly important so lists were implicitly being counted.
More generally, arrangements of objects into a list are re-
Rule of Product. Suppose that event E can occur in m
ferred to as permutations. For example, the objects a, b, c
different ways and that subsequently event F can occur
can be arranged into the following permutations: [a, b, c],
in n different ways. Then, a choice from E followed by a
[a, c, b], [b, a, c], [b, c, a], [c, a, b], [c, b, a]. By the prod-
choice from F can be made in m × n ways.
uct rule, n distinct objects can be arranged into:
EXAMPLE 1. A certain state anticipates a total of
2,500,000 registered vehicles within the next ten years. n! = n × (n − 1) × (n − 2) × · · · × 2 × 1
Can the current system of license plates (consisting of six different permutations. (The symbol n!, or n factorial, de-
digits) accommodate the expected number of vehicles? notes the product of the first n positive integers.) A per-
Should there instead be a change to a proposed new sys- mutation of size k is a list with k elements chosen from
tem consisting of two letters followed by four digits? the n given objects, and there are exactly
Solution. To analyze the current situation, there are ten P(n, k) = n × (n − 1) × (n − 2) × · · · × (n − k + 1)
possibilities (0–9) for each of the six digits, so appli-
cation of the product rule yield 10 × 10 × 10 × 10 × such permutations.
P1: LLL Final Pages
Encyclopedia of Physical Science and Technology EN004E-180 June 8, 2001 18:11

Discrete Mathematics and Combinatorics 525

EXAMPLE 4. In a manufacturing plant, a particular prod- of ways of constructing a string of 11 symbols (eight stars
uct is fabricated by processing in turn on four different and three bars); namely, we can select the three bars in
machines. If any processing sequence using all four ma- C(11, 3) = 165 ways.
chines is permitted, how many different processing orders
are possible? How many processing orders are there if
C. Binomial Coefficients
only two machines from the four need to be used?
Ways of arranging objects can also be viewed from an
Solution. Each processing order corresponds to a permu-
algebraic perspective. To understand this correspondence,
tation of the four machines, so there are P(4, 4) = 4! = 24
consider the product of n identical factors (1 + x), namely:
different orders. If processing on any two machines is
allowable then there are P(4, 2) = 4 ×3 = 12 different (1 + x)n = (1 + x)(1 + x) · · · (1 + x)
orders.
The coefficient of x k in the expansion of this product is juh
When the order of elements occurring in the arrange- the number of ways to select the symbol x from exactly
ment is not pertinent, then a way of arranging k objects, k of the factors. However, the number of ways to select
chosen from n distinct objects, is called a combination these k factors from the n available is C(n , k), meaning
of size k. For example, the objects a , b, c can be ar- that:
ranged into the following combinations, or sets, of size
2: {a , b}, {a , c}, {b, c}. The number of combinations of (1 + x)n = C(n , 0) + C(n , 1)x
size k from n objects is given by the formula: + C(n , 2)x 2 + · · · + C(n , n)x n (1)
C(n , k) = P(n , k)/k! Because the coefficients C(n , k) arise in this way from the
expansion of a two-term expression, they are also referred
EXAMPLE 5. A group of ten different blood samples is to to as binomial coefficients. These coefficients can be con-
be split into two batches, each consisting of five “pooled” venienthvaced in a triangular array, called Pascal’s trian-
samples. Further chemical analysis will then be performed gle, as shown in Fig. 1. Row n of Pascal’s triangle contains
on the two batches. In how many ways can the samples be the values C(n, 0), C(n, 1), . . . , C(n, n). Several patterns
split in this fashion? are apparent from this figure. First, the binomial coeffi-
Solution. Any division of the samples S1 , S2 , . . . , S10 into cients are symmetrically placed within each row: namely,
the two batches can be uniquely identified by those sam- C(n , k) = C(n , n − k). Second, the coefficient appearing
ples belonging to the first batch. For example, {S1 , S2 , in any row equals the sum of the two coefficients appear-
S5 , S6 , S8 } defines one such division. Since the order of ing in the previous row just to the left and to the right. For
samples within each batch is not important, there are example, in row 5 the third entry, 10, is the sum of the
C(10, 5) = 252 ways to divide the original samples. second and third entries, 4 and 6, from the previous row.
In general, the binomial coefficients satisfy the identity:
EXAMPLE 6. Suppose that 12 straight lines are drawn
on a piece of paper, with no two lines being parallel and C(n, k) = C(n − 1, k − 1) + C(n − 1, k)
no three meeting at a single point. How many different The binomial coefficients satisfy a number of other in-
triangles are formed by these lines? teresting and useful identities. To illustrate one way of
Solution. Any three lines form a triangle since no lines are
parallel. As a result, there are as many triangles as choices
of three lines selected from the 12, giving C(12, 3) = 220
such triangles.
EXAMPLE 7. How many different solutions are there in
nonnegative integers xi to the equation x1 + x2 +x3 +
x4 = 8?
Solution. We can view this problem as an equivalent one in
which eight balls are placed into four numbered boxes. For
example, the solution x1 = 2, x2 = 3, x3 = 2, x4 = 1 corre-
sponds to placing 2, 3, 2, 1 balls into boxes 1, 2, 3, 4. This
solution can also be represented by the string * * | * * *
| * * | * which shows the number of balls residing in the FIGURE 1 Arrangement of binomial coefficients C(n, k) in
four boxes. The number of solutions is then the number Pascal’s triangle.
P1: LLL Final Pages
Encyclopedia of Physical Science and Technology EN004E-180 June 8, 2001 18:11

526 Discrete Mathematics and Combinatorics

discovering such identities, formally substitute the value assessing the probability that certain desirable (or unde-
x = 1 into both sides of Eq. (1), yielding the identity: sirable) outcomes will occur.
2n = C(n, 0) + C(n, 1) + C(n, 2) + · · · + C(n, n) EXAMPLE 8. What is the probability that a hand of five
cards, dealt from a shuffled deck of cards, contains at least
In other words, the binomial coefficients for n must sum three aces?
to the value 2n . If instead, the value x = −1 is formally
substituted into Eq. (1), the following identity results: Solution. The population of 52 cards can be conveniently
partitioned into set A of the 4 aces and set N of the 48
0 = C(n, 0) − C(n, 1) + C(n, 2) − · · · + (−1)n C(n, n) non-aces. In order to obtain exactly three aces, the hand
This simply states that the alternating sum of the binomial must contain three cards from set A and two cards from
coefficients in any row of Fig. 1 must be zero. set N , which can be achieved in C(4, 3) C(48, 2) = 4512
A final identity involving the numbers in Fig. 1 con- ways. To obtain exactly 4 aces, the hand must contain
cerns a string of coefficients progressing from the left- four cards from A and one card from N , which can be
hand border along a downward sloping diagonal to any achieved in C(4, 4) C(48, 1) = 48 ways. Since the total
other entry in the figure. Then, the sum of these coefficients number of possible hands of five cards chosen from the
will be found as the value just below and to the left of the 52 is C(52, 5) = 2,598,960, the probability of the required
last such entry. For instance, the sum C(2, 0) + C(3, 1) + hand is (4512 + 48)/2,598,960 = 0.00175, indicating a
C(4, 2) + C(5, 3) = 1 + 3 + 6 + 10 = 20 is indeed the rate of occurrence of less than twice in a thousand.
same as the binomial coefficient C(6, 3). In general, this EXAMPLE 9. In a certain state lottery, six winning num-
observation can be expressed as the identity: bers are selected from the numbers 1, 2, . . . , 40. What are
C(n + k + 1, k) = C(n, 0) + C(n + 1, 1) the odds of matching all six winning numbers? What are
the odds of matching exactly five? Exactly four?
+ C(n + 2, 2) + · · · + C(n + k, k)
Solution. The number of possible choices is the number of
This identity can be given a pleasant combinatorial in- ways of selecting six numbers from the 40, or C(40, 6) =
terpretation, namely, consider selecting k items (with- 3,838,380. Since only one of these is the winning selec-
out regard for order) from a total of n + k + 1 items, tion, the odds of matching all six numbers is 1/3,838,380.
which can be done in C(n + k + 1, k) ways. In any To match exactly five of the winning numbers, there are
such selection, there will be some item number r so C(6, 5) = 6 ways of selecting the five matching numbers
that items 1, 2, . . . , r are selected but r + 1 is not. This and C(34, 1) = 34 ways of selecting a nonmatching num-
then leaves k − r items to be selected from the re- ber, giving (by the product rule) 6 × 34 = 204 ways, so
maining n + k + 1 − (r + 1) = n + k − r items, which can the odds are 204/3,838,380 = 17/319,865 for matching
be done in C(n + k − r, k − r ) ways. Since the cases five numbers. To match exactly four winning numbers,
r = 0, 1, . . . , k are mutually exclusive, the sum rule there are C(6, 4) = 15 ways of selecting the four match-
shows the total number of selections is also equal to ing numbers and C(34, 2) = 561 ways of selecting the
C(n + k, k) + C(n + k − 1, k − 1) + · · · + C(n + 1, 1) + nonmatching numbers, giving 15 × 561 = 8415 ways, so
C(n, 0). Thus, by counting the same group of objects in the odds are 8415/3,838,380 = 561/255,892 (or approx-
two different ways, one can verify the above identity. imately 0.0022) of matching four numbers.
This technique of “double counting” provides a power-
ful tool applicable to a number of other combinatorial EXAMPLE 10. An alarm system is constructed from
problems. five identical components, each of which can fail (in-
dependently of the others) with probability q. The sys-
tem is designed with a certain amount of redundancy so
D. Discrete Probability that it functions whenever at least three of the compo-
nents are working. How likely is it that the entire system
Probability theory is an important area of mathematics
functions?
in which combinatorics plays an essential role. For ex-
ample, if there are only a finite number of outcomes Solution. There are two states for each individual com-
S1 , S2 , . . . , Sm to some process, the ability to count the ponent, either good or failed. The state of the system can
number of occurrences of Si provides valuable informa- be represented by a binary string x1 x2 x3 x4 x5 , where xi
tion on the likelihood that outcome Si will in fact be is 1 if component i is good and is 0 if it fails. A func-
observed. Indeed, many phenomena in the physical sci- tioning state for the system thus corresponds to a binary
ences are governed by probabilistic rather than determin- string having at most two zeros. The number of states
istic laws; therefore, one must generally be content with with exactly two zeros is C(5, 2) = 10, so the probability
P1: LLL Final Pages
Encyclopedia of Physical Science and Technology EN004E-180 June 8, 2001 18:11

Discrete Mathematics and Combinatorics 527

of two failed and three good components is 10q 2 (1 − q)3 . Using the initial conditions f 1 = 2 and f 2 = 3, the values
Similarly, the probability of exactly one failed component f 3 , f 4 , . . . , f 8 can be calculated in turn by substitution
is C(5, 1)q 1 (1 − q)4 = 5q(1 − q)4 , and the probability of into Eq. (2):
no failed components is C(5, 0)(1 − q)5 = (1 − q)5 . Alto-
f 3 = 5, f 4 = 8, f 5 = 13,
gether, the probability that the system functions is given
by 10q 2 (1 − q)3 + 5q(1 − q)4 + (1 − q)5 . For example, f 6 = 21, f 7 = 34, f 8 = 55
when q = 0.01, the system will operate with probability
0.99999015 and thus fail with probability 0.00000985; this Therefore, 55 binary strings of length 8 have the desired
shows how adding redundancy to a system composed of property.
unreliable components (1% failure rate) produces a highly In this problem it was clearly expedient to solve the
reliable system (0.001% failure rate). general problem by use of a recurrence relation that
stressed the interdependence of solutions to related prob-
lems. The particular sequence obtained for this problem,
[1, 2, 3, 5, 8, 13, 21, 34, . . .], with f 0 = 1 added for con-
III. RECURRENCE RELATIONS venience, is called the Fibonacci sequence, and it arises
AND GENERATING FUNCTIONS in numerous problems of mathematics as well as biology,
physics, and computer science.
A. Recurrence Relations
and Counting Problems EXAMPLE 12. Suppose that ten straight lines are drawn
in a plane so that no two lines are parallel and no three
Not all counting problems can be solved as readily and as intersect at a single point. Into how many different regions
directly as in Section II. In fact, the best way to solve will the plane be divided by these lines?
specific counting problems is often to solve instead a
more general, and presumably more difficult, problem. Solution. As seen in Fig. 2, the number of regions created
One technique for doing this involves the use of recurrence by one straight line is f 1 = 2, the number created by two
relations. lines is f 2 = 4, and the number created by three lines is
Recall that the binomial coefficients satisfy the relation: f 3 = 7. The picture becomes excessively complicated with
more added lines, so it is prudent to seek a general solution
C(n, k) = C(n − 1, k − 1) + C(n − 1, k) for f n , the number of regions created by n lines in the
plane. Suppose that n − 1 lines have already been drawn
Such an expression shows how the value C(n, k) can be
and that line n is now added. Because the lines are all
calculated from certain “prior” values C(n − 1, k − 1) and
mutually nonparallel, line n must intersect each existing
C(n − 1, k). This type of relation is termed a recurrence
line exactly once. These n − 1 intersection points divide
relation, since it enables any specific value in the sequence
the new line into n segments and each segment serves to
to be obtained from certain previously calculated values.
subdivide an existing region into two regions. Thus, the
EXAMPLE 11. How many strings of eight binary digits n segments increase the number of regions by exactly n,
contain no consecutive pair of zeros? producing the recurrence relation:
Solution. It is easy to find the number f 1 of such strings of f n = f n−1 + n
length 1, since the strings “0” and “1” are both acceptable,
yielding f 1 = 2. Also, the only forbidden string of length Given the initial condition f 1 = 2, application of this re-
2 is “00” so f 2 = 3. There are three forbidden strings of currence relation yields the values f 2 = f 1 + 2 = 4 and
length 3 (“001,” “100,” and “000”), whereupon f 3 = 5. f 3 = f 2 + 3 = 7, as previously verified. In fact, such a re-
At this point, it becomes tedious to calculate subsequent currence relation can be explicitly solved, giving:
values directly, but they can be easily found by noticing
that a certain recurrence relation governs the sequence f n .
In an acceptable string of length n, either the first digit is
a 1 or it is a 0. In the former case, the remaining digits
can be any acceptable string of length n − 1 (and there
are f n−1 of these). In the latter case, the second digit must
be a 1 and then the remaining digits must form an accept-
able string of length n − 2 (there are f n−2 of these). These
observations provide the recurrence relation:

f n = f n−1 + f n−2 (2) FIGURE 2 Number of regions created by the placement of n lines.
P1: LLL Final Pages
Encyclopedia of Physical Science and Technology EN004E-180 June 8, 2001 18:11

528 Discrete Mathematics and Combinatorics

f n = (n 2 + n + 2)/2 The variable x simply serves as a formal symbol and its

exponents represent placeholders for carrying the coeffi-
indicating that there are f 10 = 112/2 = 56 regions bound- cient information. More generally, a generating function
ed by ten lines. is a polynomial in the variable x:
In representing mathematical expressions, the place-
ment of parentheses can be crucial. For example, ((x − f (x) = a0 + a1 x + a2 x 2 + · · · + an x n + · · ·
y) − z) does not in general give the same result as (x − and it serves as a template for studying the sequence of
(y − z)). Moreover, the placement of parentheses must coefficients [a0 , a1 , a2 , . . . , an , . . .].
be syntactically valid as in ((x − y) − z) or (x − (y − z)), Recall that the binomial coefficients C(n, k) count the
whereas (x) − y) − (z is not syntactically valid. number of combinations of size k derived from a set
EXAMPLE 13. How many valid ways are there of paren- {1, 2, . . . , n} of n elements. In this context, the generating
thesizing an expression involving n variables? function f (x) = (1 + x)n for the binomial coefficients can
be developed by the following reasoning. At each step
Solution. Notice that there is only one valid form for a sin- k = 1, 2, . . . , n a decision is made as to whether or not
gle variable x, namely (x), giving f 1 = 1; similarly, it can to include element k in the current combination. If x 0 is
be verified that f 2 = 1 and f 3 = 2. More generally, sup- used to express exclusion and x 1 inclusion, then the factor
pose that there are n variables. Then any valid form must (x 0 + x 1 ) = (1 + x) at step k compactly encodes these two
be expressible as (AB), where A and B are themselves choices. Since each element k presents the same choices
valid forms. Notice that if A contains k variables, then B (exclude/include), the product of the n factors (1 + x) pro-
must contain n − k variables, so in this case the f k valid duces the desired enumerator (1 + x)n . This reasoning ap-
forms for A can be combined with the f n−k valid forms for plies more generally to cases where the individual choices
B to yield f k f n−k valid forms for the whole expression. at each step are not identical, so the factors need not all be
Since the different possibilities for k (k = 1, 2, . . . , n − 1) the same (as in the case of the binomial coefficients). The
are mutually exclusive, we obtain the recurrence relation: following examples give some idea of the types of prob-
f n = f 1 f n−1 + f 2 f n−2 + · · · + f n−2 f 2 + f n−1 f 1 lems that can be addressed through the use of generating
functions.
This equation can be explicitly solved for f n in terms of
binomial coefficients: EXAMPLE 14. In how many different ways can change
for a dollar be given, using only nickels, dimes, and
f n = C(2n − 2, n − 1)/n quarters?
The numbers in this sequence, [1, 1, 2, 5, 14, 42, 132,
Solution. The choices for the number of nickels to use
. . .], are called Catalan numbers (Eugène Catalan, 1814–
can be represented by the polynomial:
1894), and they occur with some regularity as solutions
to a variety of combinatorial problems (such as triangu- (1 + x 5 + x 10 + · · ·) = (1 − x 5 )−1
lating convex polygons, counting voting sequences, and
where x i signifies that exactly i cents worth of nickels are
constructing rooted binary trees).
used. Similarly, the choices for dimes are embodied in:
(1 + x 10 + x 20 + · · ·) = (1 − x 10 )−1
B. Generating Functions
and Counting Problems and the choices for quarters in:
The examples in Part A of this section serve to illustrate (1 + x 25 + x 50 + · · ·) = (1 − x 25 )−1
the theme that solving a specific problem can frequently be
Multiplying together these three polynomials produces the
aided by relating the given problem to other, often simpler,
required generating function:
problems of the same type. For example, a problem of
size n might be related to a problem of size n − 1 and to f (x) = (1 − x 5 )−1 (1 − x 10 )−1 (1 − x 25 )−1
another problem of size n − 3. Another way of pursuing
The coefficient of x n in the expanded form of this gen-
such interrelationships among problems of different sizes
erating function indicates the number of ways of making
is through use of a generating function. As a matter of
change for n cents. In particular, there are 29 different
fact, this concept has already been previewed in studying
ways of making change for a dollar (n = 100). This can
the binomial coefficients. Specifically, Eq. (1) shows that
be verified by using a symbolic algebra package such as
f (x) = (1 + x)n can be viewed as a generating function
Mathematica or Maple.
for the binomial coefficients C(n, k):
A partition of the positive integer n is a set of positive
f (x) = C(n, 0) + C(n, 1)x + C(n, 2)x 2 + · · · + C(n, n)x n integers (or “parts”) that together sum to n. For example,
P1: LLL Final Pages
Encyclopedia of Physical Science and Technology EN004E-180 June 8, 2001 18:11

Discrete Mathematics and Combinatorics 529

the number 4 has the five partitions: {1, 1, 1, 1}, {1, 1, 2}, IV. INCLUSION–EXCLUSION PRINCIPLE
{1, 3}, {2, 2}, and {4}.
Another important counting technique is based on the idea
EXAMPLE 15. How many partitions are there for the
of successively adjusting an initial count through system-
integer n?
atic additions and subtractions that are guaranteed to pro-
Solution. The choices for the number of ones to include duce a correct final answer. This technique, called the
as parts is represented by the polynomial: inclusion–exclusion principle, is applicable to many in-
stances where direct counting would be impractical.
(1 + x + x 2 + · · ·) = (1 − x)−1 As a simple example, suppose we wish to count the
number of elements that are not in some subset A of a
where the x i term means that 1 is to appear i times in the given universal set U . Then, the required number of ele-
partition. Similarly, the choices for the number of twos to ments equals the total number of elements in U , denoted
include is given by: by N = N (U ), minus the number of elements in A, de-
noted by N (A). Expressed in this notation,
(1 + x 2 + x 4 + · · ·) = (1 − x 2 )−1
N (A ) = N − N (A)
the choices for the number of threes is given by:
where A = U − A designates the set of elements in U that
(1 + x + x + · · ·) = (1 − x )
3 6 3 −1 do not appear in A. Figure 3a depicts this relation using
a Venn diagram (John Venn, 1834–1923), in which the
and so forth. Therefore, the number of partitions of n can enclosing rectangle represents the set U , the inner circle
be found as the coefficient of x n in the generating function: represents A, and the shaded portion represents A . The
quantity N (A ) is thus obtained by excluding N (A) ele-
f (x) = (1 − x)−1 (1 − x 2 )−1 (1 − x 3 )−1 · · · ments from N .

EXAMPLE 16. Find the number of partitions of the in- EXAMPLE 17. The letters a, b, c, d, e are used to form
teger n into distinct parts. five-letter words, using each letter exactly once. How
many words do not contain the sequence bad?
Solution. Since the parts must be distinct, the choices for
any integer i are whether to include it (x i ) or not (x 0 ) in Solution. The universe here consists of all words, or per-
the given partition. As a result, the generating function for mutations, formed from the five letters, so there are N =
this problem is 5! = 120 words in total. Set A consists of all such
words containing bad. By treating these three letters as
f (x) = (1 + x)(1 + x 2 )(1 + x 3 ) · · · a new “megaletter” x, the set A equivalently contains all
words formed from x, c, e so N (A) = 3! = 6. The num-
For example, the coefficient of x 8 in the expansion of ber of words not containing bad is then N − N (A) =
f (x) is found to be 6, meaning that there are six partitions 120 − 6 = 114.
of 8 into distinct parts: namely, {8}, {1, 7}, {2, 6}, {3, 5}, Figure 3b shows the situation for two sets, A and B,
{1, 2, 5}, {1, 3, 4}. contained in the universal set U . As this figure suggests,

FIGURE 3 Venn diagrams relative to a universal set U . (a) Sets A and A ; (b) Sets A, B, and A B .
P1: LLL Final Pages
Encyclopedia of Physical Science and Technology EN004E-180 June 8, 2001 18:11

530 Discrete Mathematics and Combinatorics

the number of elements in either A or B (or both) is then Pr (A ∪ B ∪ C) = Pr (A) + Pr (B) + Pr (C) − [Pr (AB)
the number of elements in A plus the number of elements
in B, minus the number of elements in both: + Pr (AC) + Pr (BC)] + Pr (ABC)

N (A ∪ B) = N (A) + N (B) − N (AB) (3) Here, Pr (A) is the probability of event A occurring,
Pr (AB) is the probability of event AB occurring, and so
Here A ∪ B denotes the elements either in A or in B, or forth. Notice that Pr (A) = Pr (B) = Pr (C) = (1 − q)2
in both, whereas AB denotes the elements in both A and and Pr (AB) = Pr (AC) = Pr (BC) = Pr (ABC) = (1 −
B. Since the sum N (A) + N (B) counts the elements of q)3 , so the assembly functions with probability:
AB twice rather than just once, N (AB) is subtracted to
remedy the situation. The number of elements N (A B ) in Pr (A ∪ B ∪ C) = 3(1 − q)2 − 3(1 − q)3 + (1 − q)3
neither A nor B is N − N (A ∪ B), thus an alternative form
= 3(1 − q)2 − 2(1 − q)3
of Eq. (3) is
Two positive integers are called relatively prime if the
N (A B ) = N − [N (A) + N (B)] + N (AB) (4)
only positive integer evenly dividing both is the number 1.
This form shows how terms are alternately included and For example, 7 and 15 are relatively prime, whereas 6 and
excluded to produce the desired result. 15 are not (they share the common divisor 3).
EXAMPLE 18. In blood samples obtained from 50 pa- EXAMPLE 20. How many positive integers not exceed-
tients, laboratory tests show that 20 patients have antibod- ing 60 are relatively prime to 60?
ies to type A bacteria, 29 patients have antibodies to type
Solution. The appropriate universe here is U = {1, 2,
B bacteria, and 8 patients have antibodies to both types.
. . . , 60} and the (prime) divisors of N = 60 are 2, 3, 5.
How many patients have antibodies to neither of the two
Relative to U , let A be the set of integers divisible by
types of bacteria?
2, let B be the set of integers divisible by 3, and let C
Solution. The given data state that N = 50, N (A) = 20, be the set of integers divisible by 5. The problem here
N (B) = 29, and N (AB) = 8. Therefore, by Eq. (4): is to calculate N (A B C ), the number of integers that
share no divisors with 60. Because every other integer
N (A B ) = 50 − [20 + 29] + 8 = 9
is divisible by 2, we have N (A) = 60/2 = 30. Similarly,
meaning that nine patients are immune to neither type of N (B) = 60/3 = 20 and N (C) = 60/5 = 12. Because any
bacteria. integer divisible by both 2 and 3 must also be divisible by
The foregoing equations generalize in a natural way to 6, we have N (AB) = 60/6 = 10. Likewise, N (AC) = 60/
three sets A, B, and C: 10 = 6, N (BC) = 60/15 = 4, and N (ABC) = 60/30 = 2.
Substituting these values into Eq. (6) gives:
N (A ∪ B ∪ C) = N (A) + N (B) + N (C) − [N (AB)
N (A B C ) = 60 − [30 + 20 + 12] + [10 + 6 + 4] − 2
+ N (AC) + N (BC)] + N (ABC) (5)
= 16
N (A B C ) = N − [N (A) + N (B) + N (C)] + [N (AB)
so there are 16 positive numbers not exceeding 60 that are
+ N (AC) + N (BC)] − N (ABC) (6)
relatively prime to 60.
In each of these forms, the final result is obtained by suc-
EXAMPLE 21. Each package of a certain product con-
cessive inclusions and exclusions, thus justifying these as
tains one of three possible prizes. How likely is it that a
manifestations of the inclusion–exclusion principle.
purchaser of five packages of the product will get at least
EXAMPLE 19. An electronic assembly is comprised of one of each prize?
components 1, 2, and 3 and functions only when at least
Solution. The number of possible ways in which this event
two components are working. If all components fail inde-
can occur will first be calculated, after which a probabilis-
pendently of one another with probability q, what is the
tic statement can be deduced. Let the prizes be denoted
probability that the entire assembly functions?
a, b, c, and let the contents of the five packages be rep-
Solution. Let A denote the event in which components 1 resented by a string of five letters (where repetition is
and 2 work, let B denote the event in which components allowed). For example, the string bacca would represent
1 and 3 work, and let C denote the event in which com- one possible occurrence. Define A to be the set of all such
ponents 2 and 3 work. Of interest is the probability that at strings that do not include a. In similar fashion, define B
least one of the events A, B, or C occurs. The analogous (respectively, C) to be the set of strings that do not in-
form of Eq. (5) in the probabilistic case is clude b (respectively, c). It is required then to calculate
P1: LLL Final Pages
Encyclopedia of Physical Science and Technology EN004E-180 June 8, 2001 18:11

Discrete Mathematics and Combinatorics 531

N (A B C ), the number of instances in which a, b, and Solution. Here the objects are the numbers 1, 2, . . . , 8
c all occur. The approach will be to use the inclusion– and the locations are the four sets A1 = {1, 8}, A2 =
exclusion relation, Eq. (6), to calculate this quantity. {2, 7}, A3 = {3, 6}, and A4 = {4, 5}. Notice that for each
Here, N is the number of strings of five letters over the of these sets its two elements sum to 9. According to the
alphabet {a, b, c}, so (by the product rule) N = 35 = 243. pigeonhole principle, placing five numbers into these four
Also, N (A) is the number of strings from the alphabet sets results in some set Ai having both its elements se-
{b, c}, whereupon N (A) = 25 = 32. By similar reasoning, lected, so the sum of these two elements (by the construc-
N (B) = N (C) = 32, N (AB) = N (AC) = N (BC) = 15 = tion of set Ai ) must equal 9.
1, and N (ABC) = 0. As a result,
EXAMPLE 23. In a room with n ≥ 2 persons, there must
N (A B C ) = 243 − 3(32) + 3(1) − 0 = 150 be two persons having exactly the same number of friends
in the room.
In summary, the total number of possible strings is 243
and all three prizes are obtained in 150 of these cases. If Solution. The number of possible friendships for any given
all strings are equally likely (i.e., any package is just as person ranges from 0 to n − 1. However, if n − 1 oc-
likely to contain each prize), then the required probability curs then that person is a friend of everyone else, and
is 150/243 = 0.617, indicating a better than 60% chance (assuming that friendship is a mutual relation) no other
of obtaining three different prizes in just five packages. A person can be without friends. Thus, both 0 and n − 1
similar analysis shows that for six packages, the proba- cannot simultaneously occur in a group of n persons. If
bility of obtaining all three prizes increases to 0.741; for 1, 2, . . . , n − 1 are the possible numbers of friendships,
seven packages, there is a 0.826 probability. then using these n − 1 numbers as locations for the n per-
sons (objects), the pigeonhole principle assures that some
number in {1, 2, . . . , n − 1} must appear twice. A simi-
lar result can be established for the case {0, 1, . . . , n − 2},
V. EXISTENCE PROBLEMS demonstrating there must always be at least two persons
having the same number of friends in the room.
A. Pigeonhole Principle
Two strings x = x1 x2 · · · xn and y = y1 y2 · · · ym over the
In certain combinatorial problems, it may be exceedingly alphabet {a, b, . . . , z} are said to be disjoint if they share
difficult to count the number of arrangements of a pre- no common letter and are said to overlap otherwise.
scribed type. In fact, it might not even be clear that any
such arrangement actually exists. What is needed then is EXAMPLE 24. In any collection of at least six strings,
some guiding mathematical assurance that configurations there must either be three strings that are mutually disjoint
of the desired type do indeed exist. One such principle or three strings that mutually overlap.
is the so-called pigeonhole principle. While its statement
Solution. For a given string x, let the k ≥ 5 other strings
is overwhelmingly self-evident, its applications range
in the collection be divided into two groups, D and O; D
from the simplest to the most challenging problems in
consists of those strings that are disjoint from x, and O
combinatorics.
consists of those that overlap with x. By a generalization
The pigeonhole principle states that if there are more
of the pigeonhole principle, one of these two sets must
than k objects (or pigeons) to be placed in k locations
contain at least three elements. Suppose that it is set D.
(or pigeonholes), then some location must house two (or
Then, either D contains three mutually overlapping strings
possibly more) such objects. A simple illustration of this
or it contains two disjoint strings y and z. In the first case,
principle assures that at least two residents of a town with
these three strings satisfy the stated requirements. In the
400 inhabitants have the same birthday. Here, the objects
second case, the elements x, y, z are all mutually disjoint,
are the residents and the locations are the 356 possible
so again the requirements are met. A similar argument can
birthdays. Since there are more objects than locations,
be made if O is the set containing at least three elements.
some location must contain at least two objects, meaning
In any event, there will either be three mutually disjoint
that some two residents (or more) must share the same
strings or three mutually overlapping strings.
birthday. Notice that this principle only guarantees the
This last example is a special case of Ramsey’s theo-
existence of two such residents; it does not give any infor-
rem (Frank Ramsey, 1903–1930), which guarantees that
mation about finding them.
if there are enough objects then configurations of certain
EXAMPLE 22. In any selection of five different elements types will always be guaranteed to exist. Not only does this
from {1, 2, . . . , 8}, there must be some pair of selected theorem (which generalizes the pigeonhole principle) pro-
elements that sum to 9. duce some very deep combinatorial results, but it has also
P1: LLL Final Pages
Encyclopedia of Physical Science and Technology EN004E-180 June 8, 2001 18:11

532 Discrete Mathematics and Combinatorics

been applied to problems arising in geometry, the design of n types of fertilizer and n types of insecticide on the
of communication networks, and information retrieval. yield of a particular crop. Suppose that a field on which
the crop is grown is divided into an n × n grid of plots.
In order to minimize vertical and horizontal variations
B. Combinatorial Designs in the composition and drainage properties of the soil,
Combinatorial designs involve ways of arranging objects each fertilizer should appear on exactly one plot in each
into various groups in order to meet specified require- “row” and exactly one plot in each “column” of the grid.
ments. Such designs find application in the planning of Likewise, each insecticide should appear once in each row
statistical experiments as well as in other areas of math- and once in each column of the grid. In other words, a
ematics (number theory, coding theory, geometry, and Latin square design should be used for each of the two
algebra). treatments. Figure 5a shows a Latin square design for four
As one illustration, suppose that an experiment is to fertilizer types (A, B, C, D), and Fig. 5b shows another
be designed to test the effects of five different drugs Latin square design for four insecticide types (a, b, c, d ).
using five different subjects. One clear requirement is In addition, the fertilizer and insecticide treatments can
that each subject should receive all five drugs, since oth- themselves interact, thus an ideal design would ensure
erwise the results could be biased by variation among that each of the n 2 possible combinations of the n fertil-
the subjects. Each drug is to be administered for one izers and n insecticides appear together once. Figure 5c
week, so that at the end of five weeks the experiment shows that the two Latin squares in Figs. 5a and b (when
will be completed. However, the order in which drugs superimposed) have this property; namely, each fertilizer–
are administered could also have an effect on their ob- insecticide pair occurs exactly once on a plot. Such a pair
served potency, so it is also desirable for all drugs to of Latin squares is called orthogonal.
be represented on any given week of the experiment. A pair of orthogonal n × n Latin squares need not exist
One way of designing such an experiment is depicted in for all values of n ≥ 2. However, it has been proved that
Fig. 4, which shows one source of variation—the sub- the only exceptions occur when n = 2 and n = 6. In all
jects (S1 , S2 , . . . , S5 )—appearing along the rows and the other cases, an orthogonal pair can be constructed.
other source of variation—the weeks (W1 , W2 , . . . , W5 )— Latin squares are special instances of complete designs,
appearing along the columns. The entries within each row since every treatment appears in each row and in each
show the order in which the drugs ( A, B, . . . , E) are ad-
ministered to each subject on a weekly basis. Such an
arrangement is termed a Latin square, since the five treat-
ments (drugs) appear exactly once in each row and exactly
once in each column.
Figure 4 clearly demonstrates the existence of a 5 × 5
Latin square; more generally, Latin squares of size n × n
exist for each value of n ≥1. There are also occasions when
it is desirable to superimpose certain pairs of n × n Latin
squares. An example of this arises in testing the effects

FIGURE 5 Orthogonal Latin squares. (a) Latin square design

FIGURE 4 Latin square design for drug treatments (A, . . . , E ) for fertilizers (A, . . . , D); (b) Latin square design for insecticides
applied to subjects (S1 , . . . , S 5 ) by week (W1 , . . . , W5 ). (a, . . . , d ); (c) superimposed Latin squares.
P1: LLL Final Pages
Encyclopedia of Physical Science and Technology EN004E-180 June 8, 2001 18:11

Discrete Mathematics and Combinatorics 533

column. Another useful class of combinatorial designs is In the previous example, these relations hold since
one in which not all treatments appear within each test 7 × 3 = 7 × 3 and 1 × 6 = 3 × 2. While the above condi-
group. Such incomplete designs are especially relevant tions must hold for any balanced incomplete block design,
when the number of treatments is large relative to the num- there need not exist a design corresponding to every set of
ber of tests that can be performed on an experimental unit. parameters satisfying these conditions.
As an example of an incomplete design, consider an
experiment in which subjects are to compare v = 7 brands
of soft drink (A, B, . . . , G). For practical reasons, every SEE ALSO THE FOLLOWING ARTICLES
subject is limited to receiving k = 3 types of soft drink.
Moreover, to ensure fairness in the representation of the COMPUTER ALGORITHMS • PROBABILITY
various beverages, each soft drink should be tasted by the
same number r = 3 of subjects, and each pair of soft drinks BIBLIOGRAPHY
should appear together the same number λ = 1 of times. It
turns out that such a design can be constructed using b = 7 Bogart, K. P. (2000). “Introductory Combinatorics,” 3rd ed. Academic
subjects, with the soft drinks compared by each subject i Press, San Diego, CA.
given by the set Bi below: Cohen, D. I. A. (1978). “Basic Techniques of Combinatorial Theory,”
Wiley, New York.
B1 = {A, B, D}; B2 = {A, C, F}; Grimaldi, R. P. (1999). “Discrete and Combinatorial Mathematics,” 4th
ed. Addison-Wesley, Reading, MA.
B3 = {A, E, G}; B4 = {B, C, G}; Liu, C. L. (1985). “Elements of Discrete Mathematics,” 2nd ed.
McGraw–Hill, New York.
B5 = {B, E, F}; B6 = {C, D, E}; McEliece, R. J., Ash, R. B., and Ash, C. (1989). “Introduction to Discrete
Mathematics,” Random House, New York.
B7 = {D, F, G} Roberts, F. S. (1984). “Applied Combinatorics,” Prentice Hall,
Englewood Cliffs, NJ.
The sets Bi are referred to as blocks, and such a design is Rosen, K. H. (1999). “Discrete Mathematics and Its Applications,” 4th
termed a (b, v, r, k, λ) balanced incomplete block design. ed. McGraw–Hill, New York.
In any such design, the parameters b, v, r, k, λ must satisfy Rosen, K. H., ed. (2000). “Handbook of Discrete and Combinatorial
the following conditions: Mathematics,” CRC Press, Boca Raton, FL.
Tucker, A. (1995). “Applied Combinatorics,” 3rd ed. John Wiley & Sons,
bk = vr, λ(v − 1) = r (k − 1) New York.
P1: ZCK Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems

N. U. Ahmed
University of Ottawa

I. System Models
II. Linear Evolution Equations
III. Nonlinear Evolution Equations and Differential
Inclusions
IV. Recent Advances in Infinite-Dimensional
Systems and Control

GLOSSARY Weak convergence and weak topology Sequence {xn }

∈ X is said to be weakly convergent to x if, for ev-
Banach space Normed vector space in which every ery x ∗ ∈ X ∗ , x ∗ (xn ) → x ∗ (x) as n → ∞; the notion of
Cauchy sequence has a limit; normed space complete weak convergence induces a topology different from
with respect to the norm topology. the norm topology and is called the weak topology;
Cauchy sequence Sequence {xn } ∈ X is called a Cauchy the spaces L p , 1 ≤ p < ∞, are weakly (sequentially)
sequence if limn,m→∞ xn − xm = 0; an element x ∈ X complete.
is its limit if limn→∞ xn − x = 0. Weak∗ convergence and w∗ -topology Sequence {xn∗ } ∈
Normed vector space X ≡ (X, ·
) Linear vector space X ∗ is said to be w ∗ -convergent to x ∗ if, for every x ∈ X ,
X furnished with a measure of distance between its ele- xn∗ (x) → x ∗ (x) as n → ∞; the corresponding topology
ments d(x, y) = x − y, x = d(0, x) satisfying: (a) is called w∗ -topology.
x ≥ 0, x = 0 iff x = 0, (b) x + y ≤ x + y, Weak and weak∗ compactness Set K in a Banach space
(c) αx = |α| x for all x, y ∈ X, and α scalar. X is said to be weakly (sequentially) compact if ev-
Reflexive Banach space Banach space X is reflexive if it ery sequence {xn } ∈ K has a subsequence {xn k } and an
is equivalent to (isomorphic to, indistinguishable from) x0 ∈ K such that, for each x ∗ ∈ X ∗ , x ∗ (xn k ) → x ∗ (x0 );
its bidual (X ∗ )∗ ≡ X ∗∗ . weak∗ compactness is defined similarly by interchang-
Strictly convex normed space Normed space X is said ing the rolls of X ∗ and X .
to be strictly convex if, for every pair x, y ∈ X with
x = y, 12 (x + y) < 1 whenever x ≤ 1, y ≤ 1.
Uniformly convex normed space Normed space X is DISTRIBUTED PARAMETER SYSTEMS are those
said to be uniformly convex if, for every ε > 0, there whose state evolves with time distributed in space. More
exists a δ > 0, such that x − y > ε implies that precisely, they are systems whose temporal evolution of
12 (x + y) < 1 − δ; for x ≤ 1, y ≤ 1; a uniformly state can be completely described only by elements of
convex Banach space is reflexive. ∞-dimensional spaces. Such systems can be described by

561
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

562 Distributed Parameter Systems

partial differential equations, integro-partial differential ∂ 2ψ

equations, functional equations, or abstract differential or = c2 ψ (3)
∂t 2
integro-differential equations on topological spaces.
where c denotes the speed of propagation. It may repre-
sent the displacement of a vibrating string, or propagation
I. SYSTEM MODELS of acoustic or light waves and surface waves on shallow
water.
Mysteries of the universe have baffled scientists, poets, Maxwell’s equation, for a nonconducting medium with
philosphers, and even the prophets. Scientists have tried permeability µ and permittivity ε describing the temporal
to learn the secrets of nature by building and studying evolution of electric and magnetic field vectors H and E,
mathematical models for various systems. It is a difficult respectively, in 3 , is given by:
task to give a precise definition of what is meant by a sys-
∂H
tem. But, roughly speaking, you may think of any physical µ + ∇ × E = 0,
or abstract entity that obeys certain systematic laws, de- ∂t
terministic or probabilistic, as being a system. ∂E
ε −∇ × H = 0
Systems can be classified into two major categories: ∂t
(a) lumped parameter systems, and (b) distributed param- (4)
eter systems. Lumped parameter systems are those whose div E ≡ ∇ · E = 0,
temporal evolution can be described by a finite number of div H ≡ ∇ · H = 0
variables. That is, they are systems having finite degrees
of freedom and they are governed by ordinary differential Under the assumptions of small displacement, the clas-
or difference equations in a finite-dimensional space. On sical elastic waves in 3 are governed by the system of
the other hand, distributed parameter systems are those equations,
whose temporal evolution can be described only by ele-
∂ 2 yi ∂
ments of an ∞-dimensional space called the state space. ρ = µ yi + (λ + µ) (div y)
These systems are governed by partial differential equa- ∂t 2 ∂ xi
tions or functional differential equations or a combination i = 1, 2, 3,
of partial and ordinary differential equations.
y = (y1 , y2 , y3 )

A. Examples of Physical Systems yi = yi (t, x1 , x2 , x3 ) (5)

In this section we present a few typical examples of dis- where y represents the displacement of the elastic body
tributed parameter systems. The Laplace equation in n from its unstrained configuration, ρ is the mass density,
is given by: and λ, µ, are Lamé constants. In 3 , the state of a single
particle of mass m subject to a field of potential v = v(t,

n
∂ 2φ
φ≡ =0 (1) x1 , x2 , x3 ) is given by the Schrödinger equation:
i=1 ∂ xi2
∂ψ h2
✥

=− ψ + vψ
✥

For n = 3, this equation is satisfied by the velocity potential ih (6)

∂t 2m
of an irrotational incompressible flow, by the gravitational
field outside the attracting masses, by electrostatic and where h = 2πh is Planck’s constant.
✥

magnetostatic fields, and also by the temperature of a body In recent years the nonlinear Schrödinger’s equation
in thermal equilibrium. In addition to its importance on its has been used to take into account nonlinear interaction of
own merit, the Laplacian is also used as a basic operator in particles in a beam by replacing vψ by a suitable nonlinear
diffusion and wave propagation problems. For example, function g(t, x, ψ).
the temperature of a body is governed by the so-called The equation for an elastic beam allowing moderately
heat equation: large vibration is governed by a nonlinear equation of the
form:
∂T
= k T + f (t, x), ∂2 y ∂y ∂2 ∂2 y ∂2 y
∂t ρA 2 + β + 2 E I 2 + N 2 = f (t, x)
∂t ∂t ∂x ∂x ∂x
(t, x) ∈ I × , ⊂ 3 (2)
l 2
EA ∂y
where k is the thermal conductivity of the material and f N= dx (7)
is an internal source of heat. The classical wave equation 2l 0 ∂ x
in R n is given by: x ∈ (0, l), t ≥0
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 563

where E denotes Young’s modulus, I the moment of area A(x, D)φ ≡ aα (x) D α φ
of the cross section A of the beam, l the length, ρ the |α|≤2m
mass density, N the membrane force, and f the applied
x∈ ≡ open subset of n (11)
force. For small displacements, neglecting N and β, one
obtains the Euler equation for thin beams. The dynamics Under suitable smoothness conditions on the coefficient
of Newtonian fluid are governed by the Navier–Stokes aα and the boundary ∂ of one can express Eq. (11) in
equations: the so-called divergence form:

∂v A(x, D)φ ≡ (−1)|α| D α aαβ D β φ (12)
ρ + v · ∇v − ν v − (3λ + ν) grad div v |α|,|β|≤m
∂t
+ grad p = f (8) The operator A is said to be strongly uniformly elliptic on
if there exists a γ > 0 such that
∂ρ
+ div(ρv) = 0, t ≥ 0, x∈ ⊂ n
∂t m
(−1) Re aα (x)ξ α ≥ γ |ξ |2m
|α|=2m
obtained from momentum and mass conservation laws,
where ρ is the mass density, v the velocity vector, p for all x∈ (13)
the pressure, f the force density, and ν, λ are constant
or
parameters.
Magnetohydrodynamic equations for a homogeneous
Re aαβ (x)ξ α ξ β ≥ γ |ξ |2m (14)
adiabatic fluid in 3 is given by: |α|=|β|=m

∂v Many physical processes in steady state (for example,

+ v · ∇v + ρ −1 ∇ p + (µρ)−1 (B × rot B) = f thermal equilibrium or elastic equilibrium) can be de-
∂t
∂ρ ∂s scribed by elliptic boundary value problems given by:
+ div(ρv) = 0, + v · ∇s = 0 (9) A(x, D)φ = f on
∂t ∂t
(15)
∂B B(x, D)φ = g on ∂
− rot(v × B) = 0, divB = 0
∂t
where B = {B j , 0 ≤ j ≤ m − 1} is a system of suitable
p = g(ρ, s), rot B ≡ ∇ × B boundary operators. For example, the boundary operator
may be given by the Dirichlet operator,
where v, ρ, p, and f are as in Eq. (8); s is the entropy; B,
the magnetic induction vector; and µ, the permeability. ∂j
B≡ ,0 ≤ j ≤ m − 1
In recent years semilinear parabolic equations of the ∂ν j
form, where ∂/∂ν denotes spatial derivatives along the outward
normal to the boundary ∂ . The boundary operators {B j }
∂φ
=D φ + f (t, x; φ), t ≥ 0, x∈ ⊂ n cannot be chosen arbitrarily; they must satisfy certain
∂t compatibility conditions with respect to the operator A.
(10)
Only then is the boundary value problem (15) well posed;
have been extensively used for modeling biological, chem- that is, one can prove the existence of a solution and its
ical, and ecological systems, where D represents the mi- continuous dependence on the data f and g = {g j , 0 ≤
gration or diffusion coefficient. j ≤ m − 1}. The order of each of the operators B j is de-
The dynamics of a spacecraft with flexible appendages noted by m j .
is governed by a coupled system of ordinary and partial Evolution equations of parabolic type arise in the
differential equations. problems of heat transfer, chemical diffusions, and also in
In the following two sections we present abstract models the study of Markov processes. The most general model
that cover a wide variety of physical systems including describing such phenomenon is given by:
those already mentioned.
∂φ
+ Aφ = f, t, x ∈ I × = Q
∂t
Bφ = g, t, x ∈ I × ∂ (16)
B. Linear Systems
φ(0) = φ0 , x∈
A general spatial differential operator used to construct
system models is given by: where f , g, and φ0 are the given data.
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

564 Distributed Parameter Systems

Second-order evolution equations of hyperbolic type C. Nonlinear Evolution Equations

describing many vibration and wave propagation prob- and Differential Inclusions
lems have the general form:
Nonlinear systems are more frequently encountered in
∂ φ
2 practical problems than the linear ones. There is no clear-
+ Aφ = f, t, x ∈ I × cut classification for these systems. We present here a few
∂t 2
basic structures that seem to cover a broad area in the field.
Bφ = g, t, x ∈ I ×
φ(0) = φ0 , (17)
1. Elliptic Systems
x∈
φ̇ (0) = φ1 , The class of elliptic problems which have received con-
siderable attention in the literature is given by:
Schrödinger-type evolution equations are obtained if A is Aψ = 0 on
replaced by iA in Eq. (16). (20)
In the study of existence of solutions of these problems, Dα ψ = 0 on ∂ , |α| ≤ m − 1
Garding’s inequality is used for a priori estimates. If A
where
is strongly uniformly elliptic, the principal coefficients
satisfy Hölder conditions on uniformly with respect to Aψ ≡ (−1)|α| D α (|D α ψ| p−2 D α ψ) + (−1)|β|
t ∈ I , and the other coefficients are bounded measurable, |α|≤m |β|≤m−1
then one can prove the existence of a λ ∈ and α > 0 such β

× D bβ (x, ψ, D ψ, . . . , D ψ)
1 m
(21)
that

a(t, φ, φ) + λ|φ|2H > αφ2V , t∈I (18) 2. Semilinear Systems

where The class of nonlinear evolution equations that can be

described in terms of a linear operator and a nonlinear
a(t, φ, ψ) ≡ aα,β (t, ·)D β φ, D α ψ (19) operator containing lower order derivatives has been clas-
|α|,|β|≤m sified as semilinear. These systems have the form:
dψ
and V is any reflexive Banach space continuously and + A(t)ψ + f (t, ψ) = 0
densely embedded in H = L 2 ( ) where L p ( ) is the dt
(22)
equivalence classes of pth-power Lebesgue integrable ψ(0) = ψ0
functions on ⊂ R n , with the norm given by:
and
1/ p
| f (x)| p d x for 1 ≤ p < ∞ d 2ψ
f L p( ) ≡ + A(t)ψ + f (t, ψ) = 0
ess sup{ f (x), x ∈ } for p = ∞ dt 2
m, p m, p ψ(0) = ψ0 , (23)
For example, V could be W0 , W m, p , or W0 ⊂ V ⊂
W m, p for p ≥ 2 where W0 ( ) is the closure of C ∞ func-
m, p ψ̇(0) = ψ1
tions with compact support on in the topology of W m, p , where A(t) may be a differential operator of the form (11)
and and the nonlinear operator f may be given by:

W m, p ( ) ≡ f ∈ L p ( ) : D α f ∈ L p ( ), |α| ≤ m f (t, ψ) ≡ (−1)|α| D α
|α|≤2m−1
furnished with the norm topology:
× bα (t, ·; ψ, D 1 ψ, . . . , D 2m−1 ψ) (24)

f W m, p ≡ D α f L p ( ) , p≥1 For example, a second-order semilinear parabolic equa-
|α|≤m
tion is given by:
∂ αi
D α ≡ D1α1 D1α2 · · · Dnαn , Diαi ≡ ∂φ
∂ xiαi + ai j (t, x)φxi x j + a(t, x, φ, φx ) = 0 (25)
∂t i,J

n
α = αi , αi ≡ nonnegative integers which certainly covers the ecological model, Eq. (10). In
i=1 short, Eq. (22) is an abstract model for a wide variety of
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 565

nonlinear diffusions. It covers the Navier–Stokes equa- D. Stochastic Evolution Equations

tion (8) and many others including the first-order semilin-
In certain situations because of lack of precise knowledge
ear hyperbolic system:
of the system parameters arising from inexact observation
∂y n
∂y or due to gaps in our fundamental understanding of the
+ Ai (t, x) + B(t, x)y + f (t, x, y) = 0 physical world, one may consider stochastic models to
∂t i=1
∂ xi
(26) obtain most probable answers to many scientific questions.
Such models may be described by stochastic evolution
The second-order abstract semilinear equation (23) covers equations of the form:
a wide variety of nonlinear vibration and wave propagation
problems. dψ = (A(t)ψ + f (t, ψ)) dt + σ (t, ψ) d W
(31)
ψ(0) = ψ0
3. Quasilinear Systems
where A is a differential operator possibly of the form (11),
The most general form of systems governed by quasilinear f is a nonlinear operator of the form (24), and σ is a suit-
evolution equations that have been treated in the literature able operator-valued function. The variable W = {W (t),
is given by: t ≥ 0} represents a Wiener process with values in a suit-
dψ able topological space and defined on a probability space.
+ A(t, ψ)ψ + f (t, ψ) = 0 The uncertainty may arise from randomness of ψ0 , the
dt
(27) process W , and even from the operators A, f , and σ . This
ψ(0) = ψ0 is further discussed in Sections II and III.
It covers the quasilinear second-order parabolic equation
or systems of the form:
II. LINEAR EVOLUTION EQUATIONS
∂ψ n
+ ai j (t, x; ψ, ψx )ψxi x j + a(t, x, ψ, ψx ) = 0
∂t i,J In this section, we present some basic results from control
(28) theory for systems governed by linear evolution equations.
where, for parabolicity, one requires that
ai j (t, x, p, q)ξi ξ j ≥ γ |ξ |2 A. Existence of Solutions for Linear Evolution
Equations and Semigroups
γ > 0, t, x ∈ Q, p ∈ , q ∈ n
A substantial part of control theory for distributed parame-
It covers the quasilinear hyperbolic systems of the form: ter systems has been developed for first- and second-order
∂y n
∂y evolution equations of parabolic and hyperbolic type of
A0 (t, x, y) + Ai (t, x, y) + B(t, x, y) = 0 the form:
∂t i
∂ xi
(29) dφ
+ A(t)φ = f, φ(0) = φ0 (32)
including the magnetohydrodynamic equation (9). dt
d 2φ
+ A(t)φ = f, φ(0) = φ0
4. Differential Inclusions dt 2
(33)
In recent years abstract differential inclusions have been φ̇(0) = φ1
used as models for controlled systems with disconti- A fundamental question that must be settled before a con-
nuities. For example, consider the abstract semilinear trol problem can be considered is the question of existence
equation (22). In case the operator f , as a function of of solutions of such equations.
the state ψ, is discontinuous but otherwise well defined as We present existence theorems for problems (32) and
a set-valued function in a suitable topological space, one (33) and conclude the section with a general result when
may consider Eq. (22) as a differential inclusion: the operator A is only a generator of a c0 -semigroup (in
− ψ̇ (t) ∈ A(t)ψ(t) + f (t, ψ(t)) a.e. case of constant A) or merely the generator of an evolu-
(30) tion operator in a Banach space (in case A is variable).
ψ(0) = ψ0 The concepts of semigroups and evolution operators are
introduced at a convenient place while dealing with the
Such equations also arise from variational inequalities. questions of existence.
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

566 Distributed Parameter Systems

For simplicity we consider time-invariant systems, al- For system (32) one can prove the following result using
though the result given below holds for the general case. the same procedure.
Theorem 1. Consider system (33) with the operator A Theorem 2. Consider system (32) and suppose that
time invariant and self-adjoint and suppose it satisfies the the operator A satisfies assumptions (a1) and (a2) (see
conditions: Theorem 1). Then, for each φ0 ∈ H and f ∈ L 2 (I , V ∗ ),
system (32) has a unique solution:
(a1) |Aφ, ψ| ≤ cφV ψV
φ ∈ L 2 (I, V ) ∩ L ∞ (I, H ) ∩ C( I¯ , H )
c ≥ 0, φ, ψ ∈ V
(a2) Aφ, φ + λ|φ|2H ≥ αφ2V and further
(φ0 , f ) → φ
λ ∈ R, α > 0, φ∈V
is a continuous map from H × L 2 (I , V ∗ ) to C(I¯ , H ).
Then, for every φ0 ∈ V , φ1 ∈ H , and f ∈ L 2 (I , H ), system
(33) has a unique solution φ satisfying: As a consequence of this result there exists an evolu-
tion operator U (t, s), 0 ≤ s ≤ t ≤∞, with values U (t, s)
(c1) φ ∈ L ∞ (I, V ) ∩ C( I¯ , V ) ∈ (H ) such that
(c2) φ˙ ∈ L ∞ (I, V ) ∩ C( I¯ , H ) t
φ(t) = U (t, 0)φ0 + U (t, θ ) f (θ) dθ (36)
and 0

(c3) (φ0 , φ1 , f ) → (φ, φ˙ ) is a (mild) weak solution of problem (32).

We conclude this section with some results on the ques-
is a continuous mapping from V × H × L 2 (I , H ) to C( I¯ , tion of existence of solutions for a class of general time-
V ) × C( I¯ , H ). invariant linear systems on Banach space.
Let X be a Banach space, and S(t), t ≥ 0, a family
Proof (Outline). The proof is based on Galerkin’s ap-
of bounded linear operators from X to X satisfying the
proach which converts the infinite-dimensional system
properties:
(33) into its finite-dimensional approximation, then by
use of a priori estimates and compactness arguments one (a) S(0) = I (identity operator)
shows that the approximating sequence has a subsequence
that converges to the solution of problem (33). (b) S(t + τ ) = S(t)S(τ ), t, τ ≥ 0 (37)

Note that the second-order evolution equation (33) can (c) Strong limit
+
S(t)ξ = ξ ∈ X
t↓0
be written as a first-order equation dψ/dt + Ã ψ = f˜
where The operator S(t), t ≥ 0, satisfying the above properties is
called a strongly continuous semigroup or, in short, a c0 -
φ 0 −I semigroup. Let A be a closed, densely defined linear oper-
ψ= , Ã =
φ˙ A 0 ator with domain D(A) ⊂ X and range R(A) ⊂ X . Suppose
(34) there exist numbers M > 0 and ω ∈ R such that

0
f˜ = (λI − A)−1 X ≤ M/(λ − ω)
f
for all real λ > ω. Then, by a fundamental theorem from
Defining X = V × H , with the product topology, as the
the semigroup theory known as the Hille–Yosida theorem,
state space, it follows from Theorem 1(c3) that there exists
there exists a unique c0 -semigroup S(t), t ≥ 0, with A as
an operator-valued function S(t), t ≥ 0, with values S(t)
its infinitesimal generator. The semigroup S(t), t ≥ 0, sat-
∈ (X ) so that
isfies the properties:
t
ψ(t) = S(t)ψ0 + S(t − θ ) f˜ (θ ) dθ (35) (a) S(t)(X ) ≤ Meωt , t ≥0
0
(b) for ξ ∈ D(A)
The family of operators {S(t), t ≥ 0} forms a c0 -semigroup
in X and ψ ∈ C( Ī , X ) where C(I , X ) is the space of con- S(t)ξ ∈ D(A) for all t ≥0 (38)
tinuous functions on I with values in the Banach space X
(c) for ξ ∈ D(A)
with the norm (topology):
d
f = sup{| f (t)| X , t ∈ I } S(t)ξ = AS(t)ξ = S(t)Aξ, t ≥0
dt
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 567

If ω = 0, we have a bounded semigroup; for ω = 0 and One reason for calling the analytic semigroups parabolic
M = 1 we have a contraction semigroup, and for ω < 0, semigroups is that S(t), t ≥ 0, turns out to be the funda-
we have a dissipative semigroup. In general, the abstract mental solution of certain parabolic evolution equations.
Cauchy problem: Consider the differential operator,

dy
= Ay, y(0) = ξ (39) L(x, D) = aα (x) D α
dt |α|≤2m

has a unique solution y(t) = S(t)ξ , t > 0, with y(t) ∈ D and suppose it is strongly elliptic of order 2m and
(A), provided ξ ∈ D(A). If ξ ∈ D(A) and f is any strongly
continuously differentiable function with values f (t) ∈ X , B j (x, D) = b j,α D α , 0≤ j ≤m−1
|α|≤m j
then the inhomogeneous problem,
dy is a set of normal boundary operators as defined earlier.
= Ay + f, y(0) = ξ (40) Define
dt
m, p
has a unique continuously differentiable solution y given D(A) = W 2m, p ( ) ∩ W0 ( )
by: (43)
t
(Aφ)(x) = −L(x, D)φ(x), φ ∈ D(A)
y(t) = S(t)ξ + S(t − θ ) f (θ ) dθ
0 Then, under certain technical assumptions, A generates
with y(t) ∈ D(A) for all t ≥0 (41) an analytic semigroup S(t), t ≥ 0, in the Banach space
X = L p ( ). The initial boundary value problem,
A solution satisfying these conditions is called a clas-
sical solution. For control problems these conditions are ∂φ
(t, x) + L(x, D)φ(t, x) = f (t, x)
rather too strong since, in general, we do not expect the ∂t
controls, for example f (t), to be even continuous. Thus, t ∈ (0, T ), x∈
there is a need for a broader definition and this is provided
by the so-called mild solution. (D α φ)(t, x) = 0 (44)
Any function y: I → X having the integral representa- |α| ≤ m − 1, t ∈ (0, T ), x ∈∂
tion (41) is called a mild solution of problem (40). In this
regard we have the following general result. φ(0, x) = φ0 (x), x∈

Theorem 3. Suppose A is the generator of a c0 - can be considered to be an abstract evolution equation in

semigroup S(t), t ≥ 0, in X and let y0 ∈ X and f ∈ L p (I , X = L p ( ) and be written as:
X ), 1 ≤ p ≤ ∞, where L p (I , X ) is the space of strongly dφ
measurable functions on I taking values in a Banach space = Aφ + f
dt
X with norm:
 1/ p φ(0) = φ0

 p
| f (t)| X dt , 1≤ p<∞ with mild solution given by:
f L p (I,X ) =
 I
ess sup {| f (t)| , t ∈ I }, t
X p=∞
φ(t) = S(t)φ0 + S(t − θ ) f (θ ) dθ (45)
Then, evolution Eq. (40) has a unique mild solution y 0

given by: where f ∈ L p (I , X ).

t
y(t) = S(t)y0 + S(t − θ ) f (θ ) dθ (42)
0 B. Stability
In this case y(t) does not necessarily belong to D(A). The question of stability is of signnificant interest to sys-
Another special but important class of strongly continuous tem scientists, since every physical system must be stable
semigroups S(t), t ≥ 0, that satisfies the property: to function properly.
S(t)X ⊂ D(A) for all t >0 In this section, we present a Lyapunov-like result for
the abstract evolution equation,
is called the analytic (holomorphic, parabolic) semigroup.
dy
An analytic semigroup has the following properties: = Ay (46)
dt
(a) S(t)X ⊂ D(An ) for all integers n ≥ 0 and t > 0,
where A is the generator of a strongly continuous semi-
(b) S(t), d n S(t)/dt n = An S(t)are all bounded group S(t), t ≥ 0 on H . Let + (H ) denote the class of
operators for t > 0. symmetric, positive, self-adjoint operators in H ; that is,
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

568 Distributed Parameter Systems

T ∈ (H ), T = T ∗ , and (T ξ , ξ ) > 0 for ξ (=0) ∈ H . We observer chooses a feasible set Q, where q ∗ may possibly
can prove the following result. lie, and constructs the model system,
Theorem 4. A necessary and sufficient condition for ẏ (q) = A(q)y(q), y(q)(0) = y0
the system dy/dt = Ay to be exponentially stable in the (51)
Lyapunov sense (i.e., there exist M ≥ 1, β > 0 such that z(q) = C y(q)
y(t) ≤ Me−βt ) is that the operator equation,
where C is the observation or output operator, an element
A∗ Y + YA = − (47)
of (H , K ). The analyst may choose to identify the pa-
has a solution Y ∈ + (H ) for each ∈ + (H ) with (ξ , rameter approximately by minimizing the functionl,
ξ ) ≥ γ |ξ |2H for some γ > 0. T
1
REMARK. Equation (47) is understood in the sense J (q) = |z(q) − z ∗ |2K dt over Q (52)
2 0
that the equality,
0 = (ξ, η) + (Aξ, Y η) + (Y ξ, Aη) Similarly, one may consider identification of an opera-
tor appearing in system equations. For example, one may
holds for all ξ, η ∈ D(A). consider the system,
Corollary 5. If the autonomous system (46) is asym- d2 y
ptotically stable, then the system, + Ay + B ∗ y = f
dt 2
dy y(0) = y0 , ẏ (0) = y1 (53)
= Ay + f, t ≥0 (48)
dt
z = Cy
is stable in the L p sense; that is, for every y0 ∈ H and
input f ∈ L p (0, ∞; H ), the output y ∈ L p (0, ∞; H ) for where the operator A is known but the operator B ∗ is
all 1 ≤ p ≤ ∞ and in particular, for 1 ≤ p < ∞, y(t) → 0 unknown. One seeks an element B from a feasible set
as t → ∞. P 0 ⊂ (V , V ∗ ) so that
We conclude this section with a remark on the solvabil- T
ity of Eq. (47). Let {λi } be the eigenvalues of the operator J (B) = g(t, y(B), ẏ (B)) dt (54)
A, each repeated as many times as its multiplicity requires, 0

and let {ξi } denote the corresponding eigenvectors com- is minimum, where g is a suitable measure of discrep-
plete in H . Consider Eq. (47) and form ancy between the model output, z(B) = C y(B), and the
∗ observed data z ∗ corresponds to the natural history y ∗ .
A Y ξi , ξ j + (YA ξi , ξ j ) = −(ξi , ξ j ) (49)
In general, one may consider the problem of identifi-
for all integers i, j ≥ 1. Clearly, from this equation, it cation of all the operators A, B including the data y0 , y1 ,
follows that and f . For simplicity, we shall consider only the first two
problems and present a couple sample results. First, con-
(Y ξi , ξ j ) = −(ξi , ξ j )/(λi + λ̄j ) (50)
sider problem (51) and (52).
Hence, if λi + λ̄j = 0 for all i, j, determines Y uniquely,
Theorem 6. Let the feasible set of parameters Q be
and if λi + λ̄j = Re λi < 0 for all i, then Y ∈ + (H ). In
a compact subset of a metric space and suppose for each
other words, if system (46) is asymptotically stable then
q ∈ Q that A(q) is the generator of a strongly continuous
the operator equation (47) always has a positive solution.
contraction semigroup in H . Let

C. System Identification (γI − A(qn ))−1 → (γI − A(q0 ))−1 (55)

A system analyst may know the structure of the system— in the strong operator topology for each γ > 0 whenever
for example, the order of the differential operator and its qn → q0 in Q. Then there exists q 0 ∈ Q at which J (q)
type (parabolic, hyperbolic, etc.)—but the parameters are attains its minimum.
not all known. In that case, the analyst must identify the un- Proof. The proof follows from the fact that under as-
known parameters from available information. Consider sumption (55) the semigroup Sn (t), t ≥ 0, corresponding
the natural system to be given by ẏ ∗ = A(q ∗ )y ∗ . Assume to qn strongly converges on compact intervals to the semi-
that q ∗ is unknown to the observer but the observer can group S0 (t), t ≥ 0, corresponding to q0 . Therefore, J is
observe certain data z ∗ from a Hilbert space K , the output continuous on Q and, Q being compact, it attains its min-
space, which corresponds to the natural history y ∗ . The imum on Q.
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 569

The significance of the above result is that the identifi- p (E) ≡ locally pth power summable
proper subset of L loc
cation problem is well posed. E-valued functions on 0 = [0, ∞). For a given initial
For the second-order evolution equation (53) we have state φ0 ∈ ,
the following result. t
φ(t) = S(t)φ0 + S(t − θ )Bu(θ ) dθ, t ≥0
Theorem 7. Consider system (53). Let P 0 be a compact 0
(in the sense of strong operator topology) subset of the ball, denotes the mild (weak) solution of problem (58).

Pb ≡ B ∈ (V, V ∗ ): B(V,V ∗ ) ≤ b Given φ0 ∈ X and a desired target φ1 ∈ X , is it possible
to find a control from that transfers the system from
Then, for each g defined on I × V × H which is measur- state φ0 to the desired state φ1 in finite time? This is the
able in the first variable and lower semi continuous in the basic question of controllability. In other words, for a given
rest, the functional J (B) of Eq. (54) attains its minimum φ0 ∈ X , one defines the attainable set:
on P 0 .
The best operator B 0 minimizing the functional J (B) (t) ≡ x ∈ X : x = S(t)φ0
can be determined by use of the following necessary con- t
ditions of optimality.
+ S(t − θ)Bu(θ) dθ, u ∈
0
Theorem 8. Consider system (53) along with the func-
tional, and inquires if there exists a finite time τ ≥ 0, such that
φ1 ∈ (τ ) or equivalently,
1 T
J (B) ≡ |C y(B) − z ∗ (t)|2K dt
2 0 φ1 − S(τ )φ0 ∈ R(τ ) ≡ (τ ) − S(τ )φ0
with the observed data z ∗ ∈ L 2 (I , K ), the observer The set (τ ), given by:
C ∈ (H , K ), f ∈ L 2 (I , H ), y0 ∈ V , y1 ∈ H and P 0 as
in Theorem 7. Then, for B 0 to be optimal, it is necessary
(τ ) ≡ ξ ∈ X : ξ = L τ u
that there exists a pair {y, x} satisfying the equations,
τ
ÿ + Ay + B 0 y = f
≡ S(τ − θ)Bu(θ ) dθ, u ∈
ẍ + A∗ x + (B 0 )∗ x = C ∗ K (C y(B 0 ) − z ∗ ) 0

(56) is called the reachable set. If S(t)B is a compact op-

y(0) = y0 , x(T ) = 0 erator for each t ≥ 0, then (τ ) is compact, hence the
ẏ(0) = y1 , ẋ(T ) = 0 given target may not be attainable. A similar situation
arises if BE ⊂ D(A) and φ0 ∈ D(A) and φ1 ∈ / D(A). As
and the inequality a result, an appropriate definition of controllability for
T T ∞-dimensional systems may be formulated as follows.
B y(B ), xV ∗ ,V dt ≥
0 0
By(B 0 ), xV ∗ ,V dt
0 0 Definition. System (58) is said to be controllable
(57)
(exactly controllable) in the time interval [0, τ ] if R(τ )
for all B ∈ P 0 , where K is the canonical isomorphism of
is dense in X [(τ ) = X ] and it is said to be controllable
K onto K ∗ such that K e K ∗ = e K .
(exactly controllable) in finite time if ∪τ >0 R(τ ) is dense
By solving Eqs. (56) and (57) simultaneously one can in X [∪τ >0 R(τ ) = X ]. Note that for finite-dimensional
determine B 0 . In fact, a gradient-type algorithm can be systems, X = R n , E = R m , and controllability and
developed on the basis of Eqs. (56) and (57). exact controllability are all and the same. It is only
for ∞-dimensional systems that these concepts are
different.
D. Controllability
We present here a classical result assuming that both X
Consider the controlled system, and E are self-adjoint Hilbert spaces with = L loc 2 (E).

φ˙ = Aφ + Bu, t ≥0 (58) Theorem 9. For system (58) the following statements

with φ denoting the state and u the control. The operator A are equivalent:
is the generator of a strongly continuous semigroup S(t), (a) System (58) is controllable in time τ
t ≥ 0, in a Banach space X and B is the control operator
(b) L τ L ∗τ ∈ + (X )
with values in (E, X ) where E is another Banach space.
Let denote the class of admissible controls, possibly a (c) Ker L ∗τ ≡ ξ ∈ X : L ∗τ ξ = 0 = {0}
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

570 Distributed Parameter Systems

where
τ L ∗τ is the adjoint of the operator L τ and L τ u ≡ for all nontrivial weak solutions of the adjoint system
0 S(τ − θ) Bu(θ ) dθ .
ψ̇ + A∗ (t)ψ = 0, t ≥0
Note that by our definition, here controllability means
approximate controllability; that is, one can reach an ar- where H (ξ ) = sup{(ξ, e) E ∗ ,E , e ∈ }.
bitrary neighborhood of the target but never exactly at the
target itself. Another interesting difference between finite- E. Existence of Optimal Controls
and infinite-dimensional systems is that in case X = R n ,
E = R m , condition (b) implies that (L τ L ∗τ )−1 exists and The question of existence of optimal controls is consid-
the control achieving the desired transfer from φ0 to φ1 is ered to be a fundamental problem in control theory. In
given by: this section, we present a simple existence result for the
−1 hyperbolic system,
u = L τ L ∗τ (φ1 − S(τ )φ0 )
φ¨ + Aφ = f + Bu, t ∈ I ≡ (0, T )
For ∞-dimensional systems, the operator (L τ L ∗τ ) does not
in general have a bounded inverse even though the operator (62)
is positive. φ(0) = φ0 , φ˙ (0) = φ1
Another distinguishing feature is that in the finite- Similar results hold for parabolic systems. Suppose the
dimensional case the system is controllable if and only operator A and the data φ0 , φ1 , f satisfy the assump-
if the rank condition, tions of Theorem 1. Let E be a real Hilbert space and
rank(B, AB, . . . , An−1 B) = n 0 ⊂ L 2 (I, E) the class of admissible controls and B ∈
(E, H ). Let S0 ⊂ V and S1 ⊂ H denote the set of ad-
holds. In the ∞-dimensional
case there is no such con- missible initial states. By Theorem 1, for each choice of
dition; however, if BE ⊂ ∞ n=1 D(A n
) then the system is φ0 ∈ S0 , φ1 ∈ S1 , and u ∈ 0 there corresponds a unique
controllable if solution φ called the response. The quality of the response
is measured through a functional called the cost functional
∞
closure range(An B) = X (59) and may be given by an expression of the form,
n=0 T
This condition is also necessary and sufficient if S(t), J (φ0 , φ1 , u) ≡ α g1 (t, φ(t), φ̇ (t)) dt
0
t ≥ 0, is an analytic semigroup and BE ⊂ t>0 S(t)X .
In recent years, much more general results that admit + βg2 (φ(T ), φ̇ (T )) + λg3 (u) (63)
very general time-varying operators {A(t), B(t), t ≥ 0}, α + β > 0; α, β, γ ≥ 0, where g1 , g2 , and g3 are suitable
including hard constraints on controls, have been proved. functions to be defined shortly. One may interpret g1 to
We conclude this section with one such result. The system, be a measure of discrepancy between a desired response
ẏ = A(t)y + B(t)u, t ≥0 and the one arising from the given policy {φ0 , φ1 , u}. The
(60) function g2 is a measure of distance between a desired
y(0) = y0 target and the one actually realized. The function g3 is
a measure of the cost of control applied to system (62).
is said to be globally null controllable if it can be steered
A more concrete expression for J will be given in the
to the origin from any initial state y0 ∈ X .
following section. Let P ≡ S0 × S1 × 0 denote the set
Theorem 10. Let X be a reflexive Banach space and of admissible policies or controls. The question is, does
Y a Banach space densely embedded in X with the in- there exist a policy p 0 ∈ P such that J ( p 0 ) ≤ J ( p) for
jection Y ⊂ X continuous. For each t ≥ 0, A(t) is the all p ∈ P? An element p 0 ∈ P satisfying this property is
generator of a c0 -semigroup satisfying the stability con- called an optimal policy. A set of sufficient conditions for
l (0, ∞; (Y, X )) where (X , Z ) is
dition and A ∈ L loc the existence of an optimal policy is given in the following
the space of bounded linear operators from a Banach result.
space X to a Banach space Z ; (X ) ≡ (X , X ), B ∈
Theorem 11. Consider system (62) with the cost func-
L qloc (0, ∞; (E, X )), and = {u ∈ L loc
p (E); u(t) ∈ a.e.} tional (63) and let S0 , S1 , and 0 be closed bounded
where is a closed bounded convex subset of E with o ∈
convex subsets of V , H , and L 2 (I , E), respectively. Sup-
and p −1 + q −1 = 1. Then a necessary and sufficient con-
pose for each (ξ , η) ∈ V × H , t → g1 (t, ξ, η) is measur-
dition for global null controllability of system (60) is that
able on I and, for each t ∈ I , the functions (ξ , η) → g1
∞
(t, ξ, η) and (ξ , η) → g2 (ξ , η) are weakly lower semicon-
H (B ∗ (t)ψ(t)) dt = +∞ (61) tinuous on V × H and the function g3 is weakly lower
0
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 571

semicontinuous on L 2 (I , E). Then, there exists an opti- Let x ∗ ∈ X ∗ ≡ V ∗ × H , where X ∗ is the dual of the Banach
mal policy, space X which is the space of continuous linear functionals
on X ; for example,
p 0 = φ00 , φ10 , u 0 ∈ P
X = L p, 1≤ p<∞
Another problem of considerable interest is the question
∗
of existence of time-optimal controls. Consider system X = Lq with p −1 + q −1 = 1
(33) in the form (34) with
Then,
0 τn
f = Bu, f˜ = ∗ ∗ ∗
Bu x (ψ1 ) = x (S(τn )φ0 ) + x S(τn − θ ) f n (θ ) dθ
0
and solution given by:
t (64)
ψ(t) = S(t)ψ0 + S(t − θ ) f˜ (θ ) dθ (35 ) By virtue of the c0 -property of the semigroup S(t), t ≥ 0,
0
where S(t), t ≥ 0, is the c0 -semigroup in X ≡ V × H with
lim x ∗ (S(τn )ψ0 ) = x ∗ S(τ ∗ )ψ0 (65)
the generator −Ã as given Eq. (34). Here, one is given the n→∞

initial and the desired final states ψ0 , ψ1 ∈ X and the set Splitting the integral in Eq. (64) into two parts, we have
of admissible controls 0 . Given that the system is con- τn
trollable from state ψ0 to ψ1 in finite time, the question ∗
x S(τn − θ ) f n (θ) dθ
is, does there exist a control that does the transfer in min- 0
τ∗
imum time? A control satisfying this property is called
a time-optimal control. We now present a result of this = x∗ S τn − τ ∗ S(τ ∗ − θ ) f n (θ) dθ
0
kind. τn
∗
Theorem 12. If 0 is a closed bounded convex subset +x S(τn − θ ) f n (θ ) dθ
τ∗
of L 2 (I , E) and if systems (34) and (35 ) are exactly con-
τ∗
trollable from the state ψ0 to ψ1 ∈ X , then there exists a ∗ ∗ ∗ ∗
= S(τ − θ ) f n (θ) dθ, S τn − τ x
time-optimal control. 0 X,X ∗

Proof. Let ψ(u) denote the response of the system cor- τn

responding to control u ∈ 0 ; that is, + x∗ S(τn − θ ) f n (θ ) dθ

τ∗
t
ψ(u)(t) = S(t)ψ0 + S(t − θ )
0
dθ where S ∗ is the dual of the operator S. u (X , Y ) is the
0 Bu(θ ) space of unbounded linear operators from X into Y . Let
Let 00 ⊂ 0 denote the set of all controls that transfer {x ∗ , y ∗ } ∈ X ∗ × Y ∗ be the set of all pairs for which
the system from state ψ0 to state ψ1 in finite time. Define y ∗ , SxY ∗ ,Y = x ∗ , x X ∗ ,X
= {t ≥ 0 : ψ(u)(0) = ψ0 , ψ(u)(t) = ψ1 , u ∈ 00 } for all x ∈ D(S) ⊂ X where
and τ ∗ = inf{t ≥ 0 : t ∈ }. We show that there exists a x ∗ (x) = x ∗ , x X ∗ ,X = x, x ∗ X,X ∗
control u ∗ ∈ 00 having the transition time τ ∗ . Let {τn } ∈
such that τn is nonincreasing and τn → τ ∗ . Since τn ∈ is the duality pairing between the elements x ∈ X and
there exists a sequence u n ∈ 00 ⊂ 0 such that ψ(u n ) (0) x ∗ ∈ X ∗ or the value of x ∗ at x. If D(S) is dense in X
= ψ0 and ψ(u n )(τn ) = ψ1 . Denote by f n the element (0B u n ). (i.e., closure of D(S) = X ), then the above relation de-
By virtue of our assumption, 0 is weakly compact, B is termines uniquely the dual S ∗ of S and its domain D(S ∗ )
bounded, and there exists a subsequence of the sequence ⊂ Y ∗ . If D(S) = X , then S ∈ (X , Y ) and S ∗ ∈ (Y ∗ , X ∗ ).
{ f n } relabeled as { f n } and Clearly,
τ∗
∗ 0
f =
Bu ∗
∈ L 2 (I, X ) S(τ ∗ − θ ) f n (θ ) dθ
0
∗ ∗
with u ∈ 0 such that f n → f weakly in L 2 (I , X ). We τ∗
w
must show that u ∗ ∈ 00 . Clearly, by definition of {τn } we → S(τ ∗ − θ) f ∗ (θ) dθ in X (66)
0
have τn and since V and hence V ∗ are all reflexive Banach spaces
ψ1 = ψ(u n )(τn ) ≡ S(τn )ψ0 + S(τn − θ) f n (θ) dθ S ∗ is a c0 -semigroup in X ∗ and consequently
0
s
for all n S ∗ τn − τ ∗ x ∗ → x ∗ in X ∗ (67)
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

572 Distributed Parameter Systems

Further, by the c0 -property of S(t), t ≥ 0 there exists a constraint (62), an optimal control exists. Since in this
finite M > 0 such that case J is strictly convex, there is, in fact, a unique opti-
τn mal control. For characterization of optimal controls, the
∗
x S(τn − θ) f n (θ ) dθ concept of Gateaux differentials plays a central role. A
τ∗
1/2 real-valued functional f defined on a Banach space X is
τn 1/2
∗ said to be Gateaux differentiable at the point x ∈ X in the
≤ Mx X ∗ f n (θ )2X dθ τn − r ∗ (68)
τ∗
direction h ∈ X if
Since 0 is bounded, it follows from this that f (x + εh) − f (x)
τn lim = f (x, h) (70)
∗
ε→0 ε
lim x S(τn − θ) f n (θ ) dθ = 0
n→∞ τ∗
exists. In general, h → f (x, h) is a homogeneous func-
Using Eqs. (65) to (67) in (64) we obtain: tional, and, in case it is linear in h, we write:
τ∗
∗
∗ ∗
x (ψ1 ) = x S(τ )ψ0 + x ∗ ∗ ∗
S(τ − θ) f (θ ) dθ f (x, h) = ( f (x), h) X ∗ ,X (71)
0

for all x ∗ ∈ X ∗ . Hence, with the Gateaux derivative f (x) ∈ X ∗ . Since the func-
τ∗ tional J , defined on the Hilbert space L 2 (I , E), is Gateaux
ψ1 = S(τ ∗ )ψ0 + S(τ ∗ − θ ) f ∗ (θ ) dθ differentiable and strictly convex, and the set of admissi-
0 ble controls 0 is a closed convex subset of L 2 (I , E), a
and u ∗ ∈ 00 . This completes the proof. control u 0 ∈ 0 is optimal if and only if
The method of proof of the existence of time-optimal (J (u 0 ), u − u 0 ) ≥ 0 for all u ∈ 0 (72)
controls presented above applies to much more general
systems. Using this inequality, we can develop the necessary con-
ditions of optimality.
F. Necessary Conditions of Optimality Theorem 13. Consider system (62) with the cost func-
After the questions of controllability and existence of op- tional (69) and 0 , a closed bounded convex subset of
timal controls are settled affirmatively, one is faced with L 2 (I , E). For u 0 ∈ 0 to be optimal, it is necessary that
the problem of determining the optimal controls. For this there exists a pair
purpose, one develops certain necessary conditions of op-
{φ 0 , ψ 0 } ∈ C(I¯ , V ) × C(I¯ , V )
timality and constructs a suitable algorithm for computing
the optimal (extremal) controls. We present here necessary with
conditions of optimality for system (62) with a quadratic
0 0
cost functional of the form, {φ̇ , ψ̇ } ∈ C(I¯ , H ) × C(I¯ , H )
T
J (u) ≡ α (Cφ(t) − z 1 (t), Cφ(t) − z 1 (t)) H1 dt satisfying the equations:
0
0
T φ̈ + Aφ 0 = f + Bu 0
+β (D φ˙ (t) − z 2 (t), D φ˙ (t) − z 2 (t)) H2 dt
0 φ 0 (0) = φ0 , (73a)
T 0
+γ (N (t)u, u) E dt (69) φ˙ (0) = φ1
0
where α, β, γ > 0. The output spaces, where observations T
ψ̈ + A∗ ψ 0 +
0
are made, are given by two suitable Hilbert spaces H1 and g1 dθ + g2 = 0
H2 with output operators C ∈ (H , H1 ) and D ∈ (H , t
H2 ). The desired trajectories are given by z 1 ∈ L 2 (I , H1 ) ψ 0 (T ) = 0,
0
ψ̇ (T ) = 0
and z 2 ∈ L 2 (I , H2 ). The last integral in Eq. (69) gives a (73b)
measure of the cost of control with N (t) ≥ δ I for all t ∈ I ,
g1 = 2αC ∗ 1 Cφ 0 − z 1
with δ > 0. We assume that N (t) = N ∗ (t), t ≥ 0. Our prob-

g2 = 2β D ∗ 2 D φ˙ − z 2
0
lem is to find the necessary and sufficient conditions an
optimal control must satisfy. By Theorem 11, we know
that, for the cost functional (69) subject to the dynamic and the inequality
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 573

T 0 Using Eq. (80) in Eq. (77) and the duality map E :
u − u 0 , 2γNu 0 + −1 ∗
E B ψ̇ E
dt ≥ 0 (74) E → E ∗ , we obtain, for w = u − u 0 ,
0
T
for all u ∈ 0 .
(J (u 0 ), u − u 0 ) = 2γ N u 0 − −1 ∗
E B p2 , u − u E dt
0
0
Proof. By taking the Gateaux differential of J at u 0
in the direction w we have u ∈ 0 (81)
T T
0
Defining ψ 0 (t) = t p2 (θ) dθ , one obtains the adjoint
(J (u ), w) ≡ dt 2α Cφ̂ (u 0 , w), Cφ(u 0 ) − z 1 H1 equation (73b), and the necessary inequality (74) follows
0
from Eqs. (72) and (81).
+ 2β D φ˙ˆ (u 0 , w), D φ̇ (u 0 ) − z 2 H2
REMARK 1. In case β = 0, the adjoint equation is

+ 2γ (w, N u 0 ) E (75) given by a differential equation rather than the integro-
differential equations (73b). That is, p2 satisfies the
where φ(u
ˆ 0 , w) denotes the Gateaux differential of φ at equation:
u in the direction w, which is given by the solution of
0
p̈2 + A∗ p2 + g1 = 0
φ̂¨ (u 0 , w) + A φ̂ (u 0 , w) = Bw
p2 (T ) = 0, ṗ2 (T ) = 0
(76)
φ̂(u , w)(0) = 0,
0
φ̂˙ (u 0 , w)(0) = 0 0
and in Eq. (74) one may replace ψ̇ by − p2 .
Introducing the duality maps, REMARK 2. In case of terminal observation, the cost
1 : H1 → H1∗ , 2 : H2 → H2∗ functional (69) may be given by:

in expression (75) and defining J (u) ≡ αCφ(T ) − z 1 2H1 + βD φ̇(T ) − z 2 2H2
T
g1 (t, φ(u 0 )) ≡ 2αC ∗ 1 Cφ(u 0 ) − z 1 (t)
+γ (N (t)u, u) E dt (82)

g2 (t, φ(u 0 )) ≡ 2β D ∗ 2 D φ˙ (u 0 ) − z 2 (t) 0

where z 1 ∈ H1 and z 2 ∈ H2 .
we obtain: In this case, the necessary conditions of optimality are
T given by:

(J (u ), w) =
0
dt φ̂ (u 0 , w), g1 H + φ̂˙ (u 0 , w), g2 H T
0
u − u 0 , 2γ Nu 0 − −1 ∗
E B p2 dt ≥ 0
+ (w, 2γ Nu 0 ) E (77) 0

for all w ∈ 0 . u ∈ 0 (83)

Defining φ̂ 1 = φ̂ , φ̂ 2 = φ̂˙ one can rewrite Eq. (76) as where p2 satisfies the differential equation,
a first-order evolution equation:
d 2 p2
d φ̂ 1 0 I φ̂ 1 0 + A ∗ p2 = 0
= + dt 2
dt φ̂ 2 −A 0 φ̂ 2 Bw
p2 (T ) = g2 ≡ 2βD ∗ 2 D φ̇ (T ) − z 2
0
(78) (84)

φ̂ 1 (0) 0
= ṗ2 (T ) = −g1 ≡ −2αC ∗ 1 Cφ 0 (T ) − z 1
φ̂ 2 (0) 0
In recent years several interesting necessary conditions of
Then, by introducing the adjoint evolution equation,
optimality for time- optimal control problems have been

d p1 0 −A∗ p1 g reported. We present here one such result. Suppose the
=− + 1 system is governed by the evolution equation,
dt p2 I 0 p2 g2
(79) dy
p1 (T ) 0 = Ay + u
= dt
p2 (T ) 0
in a Banach space X , and let
one can easily verify from Egs. (78) and (79) that
T T ≡ u ∈ L loc p (R0 , X ) : u(t) ∈ B1
{( φ̂ 1 , g1 ) H + ( φ̂ 2 , g2 ) H } dt = − (Bw, p2 ) H dt with p > 1, B1 = unit ball in X , denote the class of admis-
0 0
(80) sible controls.
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

574 Distributed Parameter Systems

−1 ∗ 1
Theorem 14 (Maximum Principle.) Suppose A is J (u 1 ) = 2γ Nu 1 + E B ψ̇
the generator of a c0 -semigroup S(t), t ≥ 0, in X and there
exists a t > 0 such that S(t)X = X . Let y0 , y1 ∈ X , and as the gradient at u 1 and constructs a new control u 2 =
suppose u 0 is the time-optimal control, with transition time u 1 − ε1 J (u 1 ), with ε1 > 0 sufficiently small so that
τ , that steers the system from the initial state y0 to the final J (u 2 ) ≤ J (u 1 ). This way one obtains a sequence of
state y1 . Then there exists an x ∗ ∈ X ∗ (=dual of X ) such approximating controls,
that
u n+1 = u n − εn J (u n )
∗ ∗
S (τ − t)x , u (t) X ∗ ,X
0

with J (u n+1 ) ≤ J (u n ), n = 1, 2, . . . . In practical applica-

= sup S ∗ (τ − t)x ∗ , e, e ∈ B1 tions, a finite number of iterations produces a fairly good
approximation to the optimal control.
= |S ∗ (τ − t)x ∗ | X ∗ (85)
Suppose X is a reflexive Banach space and there exists a
H. Stochastic Evolution Equations
continuous map v: X ∗ \{0} → X such that, for ξ ∗ ∈ X ∗ ,
In this section we present a very brief account of stochastic
|v(ξ ∗ )| X = 1 linear systems. Let ( , , P) denote a complete probabil-
and ity space and t , t ≥ 0, a nondecreasing family of right-
continuous, completed subsigma algebras of the σ -algebra
ξ ∗ , v(ξ ∗ ) X ∗ ,X = |ξ ∗ | X ∗ ; that is, s ⊂ t for 0 ≤ s ≤ t. Let H be a real separa-
then the optimal control is bang-bang and is given by ble Hilbert space and {W (t), t ≥ 0} an H -valued Wiener
u 0 (t) = v(S ∗ (τ − t)x ∗ ), and it is unique if X is strictly process characterized by the properties:
convex. (a) P{W (0) = 0} = 1,
Maximum principle for more general control problems (b) W (t), t ≥ 0, has independent increments over
are also available. disjoint intervals, and

(c) E ei(W (t),h) = exp[−t/2(Qh, h)],
G. Computational Methods
where E{·} ≡ {·} dP and Q ∈ + (H ) is the space
In order to compute the optimal controls one is required to of positive self-adjoint bounded operators in H .
solve simultaneously the state and adjoint equations (73a) Symbolically, a stochastic linear differential equation is
and (73b), along with inequality (74). In case the admissi- written in the form
ble control set 0 ≡ L 2 (I , E), the optimal control has the
form: dy = Ay dt + σ (t) d W (t), t ≥ 0, y(0) = y0
u 0 = −(1/2γ )N −1 −1 ∗ 0
E B ψ̇ where A is the generator of a c0 -semigroup S(t), t ≥ 0, in
Substituting this expression in Eqs. (73a) and (73b), one H and σ ∈ (H ) and y0 is an H -valued random variable
obtains a system of coupled evolution equations, one with independent of the Wiener process. The solution y is given
initial conditions and the other with final conditions spec- by:
t
ified. This is a two-point boundary-value problem in an
∞-dimensional space, which is a difficult numerical prob- y(t) = S(t)y0 + S(t − θ )σ (θ ) d W (θ), t ≥θ
0
lem. However, in general one can develop a gradient-type
algorithm to compute an approximating sequence of con- Under the given assumptions one can easily show that
trols converging to the optimal. The required gradient is E|y(t)|2H < ∞ for finite t, whenever E|y0 |2H < ∞, and fur-
obtained from the necessary condition (74). The solutions ther y ∈ C(I , H ) a.s. (a.s. ≡ with probability one). In fact,
for the state and adjoint equations, (73a) and (73b), can {y(t), t ≥ 0} is an t –Markov random process and one can
be obtained by use of any of the standard techniques for easily verify that
solving partial differential equations, for example, the fi- E{y(t) | τ} = S(t − τ )y(τ )
nite difference, finite element, or Galerkin method.
In order to use the result of Theorem 13 to compute the for 0 ≤ τ ≤ t. The covariance operator C(t), t ≥ 0, for the
optimal controls, one chooses an arbitrary control u 1 ∈ 0 process y(t), t ≥ 0, defined by:
and solves Eq. (73a) to obtain φ 1 , which is then used in (C(t)h, g) = E{(y(t) − E y(t), h)(y(t) − E y(t), g)}
Eq. (73b) to obtain ψ 1 . Then, on the basis of Eq. (74) one
takes is then given by:
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 575

t have been studied where f and g have been considered
(C(t)h, g) = (S(t − θ )σ (θ )Qσ ∗ (θ ) as generalized random processes and y0 as a generalized
0
random variable. Stability of similar systems with g re-
× S ∗ (t − θ)h, g) dθ + (S(t)C0 S ∗ (t)h, g) placed by a generalized white noise process has been con-
sidered recently. Among other things, it has been shown
√ Denoting the positive square root of the operator Q by
Q we have that y(t), t ≥ 0, is a Feller process on H , a Hilbert space,
t ! and there exists a Feller semigroup Tt , t ≥ 0, on Cb (H ), a
(C(t)h, h) = | Qσ ∗ (θ )S ∗ (t − θ)h|2H dθ space of bounded continuous functions on H , whose dual
0 Ut determines the flow µt = Ut µ0 , t≥ 0, of the measure

+ C0 S ∗ (t)h, S ∗ (t)h induced by y(t) on H . µt , t ≥ 0, satisfies the differential
equation:
This shows that C(t) ∈ + (H ) if and only if the condition, d
∗ µt ( f ) = µt (G f ), t ≥0
S (t)h = 0, or dt
∗ ∗
σ (θ)S (t − θ)h ≡ 0, 0≤θ ≤t for all f ∈ D(G) ⊂ Cb (H ) where G is the infinitesimal
generator of the semigroup Tt , t ≥ 0. Optimal control prob-
implies h = 0. This is precisely the condition for (approxi- lems for a class of very general linear stochastic systems
mate) controllability as seen in Theorem 9. Hence, the of the form,
process y(t), t ≥ 0, is a nonsingular H -valued Gaussian
dy = (A(t)y + B(t)u) dt + σ (t) d W (t)
process if and only if the system is controllable. Similar
T
results hold for more general linear evolution equations on " #
J (u) = E |C y − z d |2 + (N u, u) dt
Banach spaces with operators A and σ both time varying.
Linear stochastic evolution equations of the given form
≡ min
arise naturally in the study of nonlinear filtering of or-
dinary Ito stochastic differential equations in R n . Such have been considered in the literature, giving results on
equations are usually written in the weak form, the existence of optimal controls and necessary conditions
of optimality, including feedback controls. In this work,
dπt ( f ) = πt (A f ) dt + πt (σ f ) d Wt A, B, σ , C, and N were considered as operator-valued
t ≥ 0, f ∈ D(A) random processes.
where A is a second-order partial differential operator and
{W (t), t ≥ 0} is an d -valued Wiener process (d ≤ n). One III. NONLINEAR EVOLUTION EQUATIONS
looks for solutions πt , t ≥ 0, that belong to the Banach AND DIFFERENTIAL INCLUSIONS
space of bounded Borel measures satisfying πt ( f ) ≥ 0 for
f ≥ 0. Questions of existence of solutions for stochastic The two major classes of nonlinear systems that occupy
systems of the form, most of the literature are the semilinear and quasilinear
dy = (A(t)y + f (t)) dt + σ (t) d W systems. However, control theory for such systems has
not been fully developed; in fact, the field is wide open.
y(0) = y0 , t ∈ (0, T ) In this section, we shall sample a few results.
∗
dz = −((A (t) − B(t))z + g) dt + σ (t) d W
A. Existence of Solutions,
z(T ) = 0, t ∈ (0, T )
Nonlinear Semigroup
have been studied in the context of control theory. A funda-
We consider the questions of existence of solutions for the
mental problem arises in the study of the second equation
two major classes of systems, semilinear and quasilinear.
and it has been resolved by an approach similar to the Lax–
In their abstract form we can write them as
Milgram theorem in Hilbert space. Questions of existence
dφ
and stability of solutions of nonhomogeneous boundary + A(t)φ = f (t, φ) (semilinear) (86)
value problems of the form, dt
dφ
∂y + A(t, φ)φ = f (t, φ) (quasilinear) (87)
+ A(t)y = f (t), on I× dt
∂t
and consider them as evolution equations on suitable state
By = g(t), on I ×∂
space which is generally a topological space having the
u(0) = y0 , on structure of a Banach space, or a suitable manifold therein.
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

576 Distributed Parameter Systems

For example, let us consider the semilinear parabolic for u in a suitable subspace Y with D(A) ⊂ Y ⊂ X . Prob-
equation with mixed initial and boundary conditions: lem (88) can then be considered as an abstract Cauchy
∂φ problem,
+ Lφ = g(t, x; φ, Dφ, . . . , D 2m−1 φ)
∂t dφ
+ A(t)φ = f (t, φ)
(t, x) ∈ I × ≡ Q dt
(95)
φ(0, x) = φ0 (x), x∈ (88) φ(0) = φ0

Dνk φ = 0, 0≤k ≤m−1 In view of Eq. (93), a mild solution of Eq. (95) is given by
a solution of the integral equation,
(t, x) ∈ I × ∂ t
where φ(t) = U (t, 0)φ0 + U (t, θ ) f (θ, φ(θ)) dθ (96)
0
α
(Lφ)(t, x) ≡ aα (t, x)D φ (89) if one exists. Defining the operator G by:
|α|≤2m t
and Dνk φ = ∂ k φ/∂ν k denotes the kth derivative in the (Gφ)(t) ≡ U (t, 0)φ0 + U (t, θ ) f (θ, φ(θ )) dθ
0
direction of the normal ν to ∂ . We assume that L is
one then looks for a fixed point for G, that is, an element φ
strongly elliptic with principal coefficients aα , |α| = 2m,
such that φ = Gφ. Using a priori estimates, the most dif-
in C( Q̄) and the lower order coefficients aα , |α| ≤ 2m − 1,
ficult part of the program, one can establish the existence
L ∞ (Q), and further they are all Hölder continuous in t
of a solution by use of a suitable fixed-point theorem—
uniformly on ¯ . Let 1 < p < ∞ and define the operator-
for example, Banach, Schauder, or Leray–Schauder fixed-
valued function A(t), t ∈ I , by:
point theorems. We state the following result without
D(A(t)) ≡ {ψ ∈ X = L p ( ) : (Lψ)(t, ·) ∈ X proof.
and Dνk ψ ≡ 0 on ∂ , 0 ≤ k ≤ m − 1} (90) Theorem 15. Consider the semilinear parabolic prob-
The domain of A (t) is constant and is given by: lem (88) in the abstract form (95) and suppose A generates
the evolution operator U (t, τ ), 0 ≤ τ ≤ t ≤ T , and f sat-
m, p
D ≡ W 2m, p ∩ W0 (91) isfies the properties:

Then, one can show that for each t ∈ I, −A(t) is the gener- (F1) f (t, u) X ≤ c 1 + Aβ (t)u X
ator of an analytic semigroup and there exists an evolution
operator U (t, τ ) ∈ (X ), 0 ≤ τ ≤ t ≤ T , that solves the t∈I
abstract Cauchy problem: for constants c > 0, 0≤β<1
dy
+ A(t)y = 0 and u ∈ D(Aβ ), (97)
dt
(92)
y(0) = y0 (F2) f (t, u) − f (t, w) X
ρ
for each y0 ∈ X ; that is, y(t) = U (t, 0)y0 , t ∈ I , with y ∈ ≤ C Aβ (t)v − Aβ (t)w X
C( I¯ , X ) and ẏ ∈ C((0, T ], X ). In general, if f ∈ L p (I, X ), t∈I
then y, given by:
t for some 0 < ρ ≤ 1,
y(t) = U (t, 0)y0 + U (t, τ ) f (τ ) dτ (93) u, w ∈ D(Aβ ) (98)
0
is a mild solution of the Cauchy problem, Then, Eq. (95) has a mild solution φ ∈ C(I, X ), hence
the semilinear parabolic equation (88) has a generalized
dy
+ A(t)y = f solution. The solution is unique if ρ = 1.
dt
(94) REMARK 1. Condition (F1) is satisfied if the function
y(0) = y0
g satisfies the growth condition:
This would be the generalized (weak) solution of the

2m−1
parabolic initial boundary value problem (88) if g were |g(t, x, u, Du, . . . , D 2m−1
u)| ≤ k 1 + |D j u|rj
replaced by f ∈ L p (I, X ). In order to solve problem (88), j=0
one must introduce an operator f such that
for 0 ≤ r j ≤ (2m + n/q)/( j + n/q), 1 < q < ∞, and a
f (t, u) ≡ g(t, ·; u, D 1 u, . . . , D 2m−1 u) ∈ X a.e. number β satisfying (2m − 1)/2m < β < 1. In the case
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 577

ρ = 1, condition (F2) is satisfied if the function g is Proof. We discuss the outline of a proof. The differen-
Lipschitz in the last 2m variables uniformly with respect tial equation (100) is converted into an integral equation
to (t, x) ∈ I × . and then one shows that the integral equation has a solu-
If the coefficients {aα , |α| ≤ 2m} in Eq. (89) are tion. Let v ∈ C([0, t ∗ ], X ) and define:
also dependent on φ so that {aα = aα (t, x; φ, D 1 φ, . . . , −β
Av (t) ≡ A t, A0 v(t) ,
D 2m−1 φ) then system (88) becomes a quasilinear system
−β
and parabolic if f v (t) ≡ f t, A0 v(t)

(−1)m Re aα (t, x; η)ξ α ≥ c|ξ |2m and consider the linear system,
|α|=2m dy
+ Av (t)y = f v (t), t ∈ [0, t ∗ ]
c>0 (99) dt
(101)
$
for (t, x) ∈ Q̄ and η ∈ R N , where N = 2m−1
j=0 N j with N j
y(0) = φ0
denoting the number of terms representing derivatives of By virtue of assumptions (A1) to (A3), −Av (t), t ∈ [0,
order exactly j appearing in the arguments of {aα }. Sys- t ∗ ], is the generator of an evolution operator U v (t, τ ),
tem (88) then takes the form: 0 ≤ t ≤ t ∗ . Hence, the system has a mild solution given
dφ by:
+ A(t, φ)φ = f (t, φ) t
dt
(100) y (t) = U (t, 0)φ0 + U v (t, θ ) f v (θ) dθ
v v
φ(0) = φ0 0

0 ≤ t ≤ t∗ (102)
This problem is again solved by use of a priori es-
timates and a fixed-point theorem under the following Defining an operator G by setting
assumptions: t
β β
(Gv)(t) ≡ A0 U v (t, 0)φ0 + A0 U v (t, θ ) f v (θ ) dθ
(A1) The operator A0 = A(0, φ0 ) is a closed operator 0
(103)
with domain D dense in X and
% % one then looks for a fixed point of the operator G, that
%(λI − A0 )−1 % ≤ k/(1 + |λ|)
is, an element v ∗ ∈ C([0, t ∗ ], X ) such that v ∗ = Gv ∗ . In
for all λ, Re λ ≤ 0. fact, one shows, under the given assumptions, that for
sufficiently small t ∗ ∈ I , there exists a closed convex set
(A2) A−1
0 is a completely continuous operator that is; it K ⊂ C([0, t ∗ ], X ) such that GK ⊂ K and GK is relatively
is continuous in X and maps bounded sets into compact compact in C([0, t ∗ ], X ) and hence, by the Schauder
subsets of X . fixed-point theorem (which is precisely as stated) has a
(A3) There exist numbers ε, ρ satisfying 0 < ε ≤ 1, solution v ∗ ∈ K . The solution (mild) of the original prob-
−β
0 < ρ ≤ 1, such that for all t, τ ∈ I , lem (100) is then given by φ ∗ = A0 v ∗ . This is a genuine
(strong) solution if f is also Holder continuous in t. If
(A(t, u) − A(τ, w))A−1 (τ, w)
% β ρ = 1, G has the contraction property and the solution is
β %ρ
≤ k R |t − τ |ε + % A0 u − A0 w % unique.
β β
for all u, w such that A0 u, A0 w < R with k R possibly According to our assumptions, for each t ∈ [0, t ∗ ] and
β
depending on R. y ∈ D(A0 ), the operator A(t, y) is the generator of an an-
alytic semigroup. This means that Theorem 16 can han-
(F1) For all t, τ ∈ I, dle only parabolic problems and excludes many physical
% β β %ρ problems arising from hydrodynamics and wave propaga-
f (t, v) − f (t, w)| ≤ k R % A0 v − A0 w %
tion phenomenon including the semilinear and quasilinear
β β symmetric hyperbolic systems discussed in Section I. This
for all v, w ∈ X such that A0 v, A0 w ≤ R.
limitation is overcome by allowing A(t, y), for each t, y in
Theorem 16. Under assumptions (A1) to (A3) and (F1) a suitable domain, to be the generator of a c0 -semigroup
there exists a t ∗ ∈ (0, T ) such that Eq. (100) has at least rather than an analytic semigroup. The fundamental as-
β
one mild solution φ ∈ C ([0, t ∗ ], X ) for each φ0 ∈ D(A0 ) sumptions required are:
β
with A0 ψ0 ≤ R. Further, if f also satisfies the Hölder
condition in t, the solution is C 1 in t ∈ (0, t ∗ ]. If ρ = 1, the (H1) X is a reflexive Banach space with Y being an-
solution is unique. other Banach space which is continuously and densely
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

578 Distributed Parameter Systems

embedded in X and there is an isometric isomorphism S Let H be a real Hilbert space and V a subset of H
of Y onto X . having the structure of a reflexive Banach space with V
(H2) For t ∈ [0, T ], y ∈ W ≡ an open ball in Y , A(t, y) dense in H . Let V ∗ denote the (topological) dual of V and
is the generator of a c0 -semigroup in X . suppose H is identified with its dual H ∗ . Then we have
(H3) For t, y ∈ [0, T ] × W , V ⊂ H ⊂ V ∗ . Using the theory of monotone operators we
can prove the existence of solutions for nonlinear evolution
(S A(t, y) − A(t, y)S)S −1 = B(t, y) ∈ (Y, X )
equations of the form,
and B(t, y)(Y,X ) is uniformly bounded on I × Y . dφ
+ B(t)φ = f, t ∈ I = (0, T )
dt
Theorem 17. Under assumptions (H1) to (H3) and (107)
certain Lipschitz and boundedness conditions for A and φ(0) = φ0
f on [0, T ] × W , the quasilinear system (100) has, for
where B(t), t ∈I , is a family of nonlinear monotone oper-
each φ0 ∈ W , a unique solution,
ators from V to V ∗ .
φ ∈ C([0, t ∗ ), W ) ∩ C 1 ([0, t ∗ ), X )
Theorem 18. Consider system (107) and suppose the
for some 0 < t∗ ≤ T operator B satisfy the conditions:
The proof of this result is also given by use of a fixed- (B1) B: L p (I , V ) → L q (I , V ∗ ) is hemicontinuous:
point theorem but without invoking the operator A0 . Here,
for any v ∈ C([0, t ∗ ), W ), one defines Av (t) = A(t, v(t)), p−1
Bφ L q (I,V ∗ ) ≤ K 1 1 + φ L p (I,V ) (108)
f v (t) = f (t, v(t)) and U v (t, τ ), 0 ≤ τ ≤ t ≤ T , the evo-
where K 1 > 0, and 1 < p, q < ∞, 1/ p + 1/q = 1.
lution operator corresponding to −Av and constructs the
(B2) For all φ, ψ ∈ L p (I , V ),
operator G by setting
t (Bφ − Bψ, φ − ψ) L q (I,V ∗ ),L p (I,V ) ≥ 0 (109)
v
(Gv)(t) = U (t, 0)φ0 + U v (t, τ ) f v (τ ) dτ
0 That is, B is a monotone operator form L p (I , V ) to
where the expression on the right-hand side is the mild L q (I, V ∗ ).
solution of the linear equation (101). Any v ∗ satisfying (B3) There exists a nonnegative function C: → R̄
v ∗ = Gv ∗ is a mild solution of the original problem (100). with C(ξ ) → +∞ as ξ → ∞ such that for ψ ∈ L p (I , V ),
From the preceding results it is clear that the solutions (Bψ, ψ) L q (I,V ∗ ),L p (I,V ) ≥ C(ψ)ψ (110)
are defined only over a subinterval (0, t ∗ ) ⊂ (0, T ) and
it may actually blow up at time t ∗ . Mathematically this Then for each φ0 ∈ H and f ∈ L q (I, V ∗ ) system (107)
is explained through the existence of singularities, which has a unique solution φ ∈ L p (I, V ) ∩ C(I¯ , H ) and φ˙ ∈
physically correspond to the occurrence of, for example, L q (I ,V ∗ ). Further, φ is an absolutely continuous V ∗ -
turbulence or shocks in hydrodynamic problems. How- valued function on I¯ .
ever, global solutions are defined for systems governed It follows from the above result that, for φ0 ∈ H ,
by differential equations with monotone operators. We φ ∈ C(Ī , H ) and hence φ(t) ∈ H for t ≥ 0. For f ≡ 0,
present a few general results of this nature. the mapping φ0 → φ(t) defines a nonlinear evolution
Let X be a Banach space with dual X ∗ and suppose A operator U (t, τ ), 0 ≤ τ ≤ t ≤ T , in H . In case B is time
is an operator from D(A) ⊂ X to X ∗ . The operator A is invariant, we have a nonlinear semigroup S(t), t ≥ 0,
said to be monotone if satisfying the properties:
(Ax − Ay, x − y) X ∗ ,X ≥ 0 for all x, y ∈ D(A) (a) S(0) = I and, as t → 0,
(104)
(b) S(t)ξ → s ξ in H and, due to uniqueness,
It is said to be demicontinuous on X if (c) S(t + τ )ξ = S(t)S(τ )ξ, ξ ∈ H.
w
Axn → Ax0 in X ∗ Further, it follows from the equation φ̇ + Bφ = 0 that
whenever
s
xn → x0 in X (105) (φ̇ (t), φ(t)) + (Bφ(t), φ(t)) = 0; hence,
t
And, it is said to be hemicontinuous on X if |φ(t)|2H = |φ0 |2H − 2 (Bφ(θ ), φ(θ ) dθ, t ≥0
w 0
A(x + θ y) → Ax in X ∗
Thus, by virtue of (B3), |φ(t)| H ≤ |φ0 | H ; that is,
whenever θ →0 (106) |S(t)φ0 | H ≤ |φ0 | H . Hence the semigroup {S(t), t ≥ 0} is
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 579

a family of nonlinear contractions in H , and its generator (F2) f (t, ξ ) − f (t, η), ξ − ηV ∗ ,V ≤ 0 for all ξ , η ∈ V .
is −B. (F3) There exists an h ∈ L q (I , R+ ), R+ = [0, ∞), and
A classical example is given by the nonlinear initial α ≥ 0 such that
boundary-value problem,
| f (t, ξ )|V ∗ ≤ h(t) + α(ξ V ) p/q a.e.
∂φ ∂ ∂φ p−2 ∂φ for each ξ ∈ V .
−
∂t i
∂ xi ∂ xi ∂ xi (F4) There exists an h 1 ∈ L 1 (I , R) and β > 0 such
= f in I× =Q that

φ(0, x) = φ0 (x) (111) f (t, ξ ), ξ V ∗ ,V ≤ h 1 (t) − β(ξ V ) p a.e.

φ(t, x) = 0 on I ×∂ for each ξ ∈ V .

For this example, V = W0 ( ) with V ∗ = W −1,q and

1, p
Theorem 19. Consider system (112) and suppose as-
H = L 2 ( ) and sumptions (A1) to (A2) and (F1) to (F4) hold. Then,
for each φ0 ∈ H , Eq. (112) has a unique (weak) solution
∂ ∂v p−2 ∂v
Bu = − , p≥2 φ ∈ L p (I, V ) ∩ C( I¯ , H ) and further the solution φ is an
∂ xi ∂ xi ∂ xi absolutely continuous V ∗ -valued function.
f ∈ L q (I, W −1,q ), φ0 ∈ L 2 ( ) The general result given above also applies to partial
differential equations of the form:
We can rewrite problem (111) in its abstract form,
∂φ
dφ + (−1)|α| D α aαβ (t, x)D β φ
+ Bφ = f, t∈I ∂t |α|≤m+1
dt
φ(0) = φ0 + (−1)|α| D α Fα (t, x; φ, D 1 φ, . . . , D m φ)
|α|≤m
noting that V absorbs the boundary condition.
We conclude this section with one of the most general =0 on I×
results involving monotone operators. Basically we in- φ(0, x) = φ0 (x) on
clude a linear term in model (107) which is more singular α
(113)
(D φ)(t, x) = 0, 0 ≤ |α| ≤ m
than the operator B. For convenience of presentation in
on I × ∂
the sequel we shall write this model as:
where the operator A(t) in Eq. (112) comes from the linear
dφ
= A(t)φ + f (t, φ), φ(0) = φ0 (112) part and the boundary conditions, and the nonlinear op-
dt erator f comes from the nonlinear part in Eq. (113). The
where f represents −B + f of the previous model (107) space V can be chosen as W0 , p ≥ 2, with V ∗ ≡ W −m,q
m, p

and A(t), t ∈ I , is a family of linear operators more singu- where 1/ p + 1/q = 1.

lar than f in the sense that it may contain partial differen-
REMARK 2. Again, if both A and f are time invariant,
tials of higher order than that in f and hence may be an
it follows from Theorem 19 that there exists a nonlinear
unbounded operator.
semigroup S(t), t ≥ 0, such that φ(t) = S(t)φ0 , φ0 ∈ H .
For existence of solutions we use the following assump-
tions for A and f : In certain situations the function {Fα , |α| ≤ m} may be
discontinuous in the variables {φ, D 1 φ, . . . , D m φ} and
(A1) {A(t), t ∈ I } is a family of densely defined lin- as a result the operator f , arising from the correspond-
ear operators in H with domains D(A(t)) ⊂ V and range ing Dirichlet form, may be considered to be a multivalued
R(A(t)) ⊂ V ∗ for t ∈ I . function. In other words f (t, ξ ), for t, ξ ∈ I × V , is a
(A2) A(t)e, eV ∗ ,V ≤ 0 for all e ∈ D(A(t)). nonempty subset of V ∗ . In that case, the differential equa-
tion becomes a differential inclusion φ˙ ∈ A(t)φ + f (t, φ).
(F1) The function t → f (t, e), g is measurable on I Differential inclusions may also arise from variational
for e, g ∈ V , and f : I × V → V ∗ is demicontinuous in the evolution inequalities. Define the operator S by:
sense that for each e ∈ V t
f (tn , ξn ), eV ∗ ,V → f (t, ξ ), e St g = U (t, 0)φ0 + U (t, θ )g(θ ) dθ
0

whenever tn → t and ξn → ξ in V . t ∈ I, g ∈ L 1 (I, V ∗ ) (114)

P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

580 Distributed Parameter Systems

where U is the transition operator corresponding to the a ball Br (D) ⊂ B R (D) such that Tt (φ0 ) ∈ B R (D) for all
generator A and φ0 is the initial state. Then, one questions t ≥ 0 whenever φ0 ∈ Br (D). The zero state is said to be
the existence of a g ∈ L 1 (I , V ∗ ) such that g(t) ∈ f (t, St g) asymptotically stable if limt→∞ Tt (φ0 ) = 0 whenever
a.e. If such a g exists, then one has proved the existence φ0 ∈ D .
of a solution of the initial value problem: A function V : D → [0, ∞] is said to be positive defi-
nite if it satisfies the properties:
φ(t)
˙ ∈ A(t)φ(t) + f (t, φ(t))
(115) (a) V (x) > 0 for x ∈ D \{0}, V (0) = 0.
φ(0) = φ0 (b) V is continuous on D and bounded on bounded
sets.
These questions have been considered in control problems. (c) V is Gateaux differentiable on D in the direction of
H , in the sense that, for each x ∈ D and h ∈ H ,
B. Stability, Identification, and Controllability V (x + εh) − V (x)
lim ≡ V (x, h)
ε→0 ε
We present here some simple results on stability and some
comments on the remaining topics. exists, and for each h ∈ H , x → V (x, h) is continuous.
We consider the semilinear system, The following result is the ∞-dimensional analog of
dφ the classical Lyapunov stability theory.
= Aφ + f (φ) (116) Theorem 21. Suppose the system φ̇ = Aφ + f (φ) has
dt
strong solutions for each φ0 ∈ D(A) and there exists a
in a Hilbert space (H ,·) and assume that f is weakly positive definite function V on D such that along any
nonlinear in the sense that (a) f (0) = 0, and (b) f (ξ ) = trajectory φ(t), t ≥ 0, starting from φ0 ∈ D ,
o(ξ ), where
V (φ(t), Aφ(t) + f (φ(t))) ≤ 0 (<0) (118)
o(ξ )
lim =0 (117) for all t ≥ 0. Then the system is stable (asymptotically
ξ →0 ξ
stable) in the region D .
Theorem 20. If the linear system φ̇ = Aφ is asymp-
If the system admits only mild solutions, Theorem 21
totically stable in the Lyapunov sense and f is a weakly
must be modified by using positive definite functions
nonlinear continuous map from H to H , then the nonlin-
which have Gateaux derivatives in the directions {h} in
ear system (116) is locally asymptotically stable near the
spaces larger than H .
zero state.
We conclude this section with a result for systems gov-
The proof is based on Theorem 4. erned by monotone nonlinear operators as in Eqs. (107)
For finite-dimensional systems, Lyapunov stability the- and (112).
ory is most popular in that stability or instability of a sys-
Theorem 22. Consider system (112) with the opera-
tem is characterized by a scalar-valued function known
tors A and f satisfying the assumptions of Theorem 19
as the Lyapunov function. For ∞-dimensional systems a
for all t ≥ 0, and suppose h 1 ∈ L 1 (0, ∞; R) and the in-
straight forward extension is possible only if strong solu-
jection V ⊂ H is continuous. Then, the system is globally
tions exist.
asymptotically stable with respect to the origin in H .
Consider the evolution equation (116) in a Hilbert
space H , and suppose that A is the generator of a strongly The questions of identification of parameters appearing
continuous semigroup in H and f is a continuous map in in any of the system equations treated above can be dealt
H bounded on bounded sets. We assume that Eq. (116) with in a similar way as in the linear case. In fact, an iden-
has strong solutions in the sense that φ̇ (t) = Aφ(t) + tification problem may be considered as a special case of
f (φ(t)) holds a.e., and φ(t) ∈ D(A) whenever φ0 ∈ a control problem with controls appearing in the system
D(A). Let Tt , t ≥ 0, denote the corresponding nonlinear coefficients. Such classes of problems have been covered
semigroup in H so that φ(t) = Tt (φ0 ), t ≥ 0. Without loss well in the literature. However, controllability questions
of generality we may consider f (0) = 0 (if necessary for the general systems are more difficult and almost noth-
after proper translation in H ) and study the question of ing is known.
stability of the zero state. Let be a nonempty open
connected set in H containing the origin and define
C. Existence of Optimal Controls
D≡ ∩ D(A), and Ba (D) ≡ {ξ ∈ H : |ξ | H < a} ∩ D(A)
for each a > 0. The system is said to be stable in the Existence of optimal controls for strongly nonlinear
region D if, for each ball B R (D) ⊂ D , there exists parabolic systems and more general nonlinear evolution
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 581

equations of the form (86) have been treated in the (P2) Time-optimal control. Let M, a subset of H , be the
literature. The technical details are rather involved and target set. The requirement is to find a control u ∈ that
long. We shall limit ourselves to a brief summary of some transfers the system from the state φ0 ∈ H to the target set
results. M in minimum time.
Consider system (107) with controls denoted by u:
The existence of optimal controls depends on the prop-
dφ erties of admissible trajectories, attainable sets, and the
+ B(t)φ = f (t, u) (119)
dt cost functionals. Let denote the set of admissible tra-
where the operator B is nonlinear and may be given by jectories, that is, the set of all {φ} ∈ L p (I , V ) ∩ C( I¯ , H )
the expression: such that φ is a solution of Eq. (122) corresponding to
some control u ∈ . Similarly, the attainable set may be
B(t)φ = (−1)|α| D α Fα (t, x; φ, D 1 φ, . . . , D m φ), defined as:
|α|≤m

t, x ∈ I × (120) (t) ≡ {ξ ∈ H : ξ = φ(t) for some φ ∈ }

Under quite general assumptions on the function Fα , the Under a number of technical assumptions on the operators
operator B has properties (B1) to (B3) of Theorem 18. B and f and the control set one can prove the following
m, p
For the space V one may choose W0 , p ≥ 2, or any result.
m, p
closed subspace of W m, p
so that W0 ⊂ V ⊂ W m, p . Here Theorem 23. (a) The set of admissible trajectories is
V is a reflexive Banach space. For admissible controls we a weakly closed and weakly sequentially compact subset
choose any reflexive Banach space E of functions defined of L p (I , V ). (b) For each t ∈ [0, T ], the attainable set (t),
on and , a closed bounded convex subset of E, and is a weakly compact subset of H .
consider to be the class of admissible controls which
are strongly measurable functions defined on I = (0, T ), Using the preceding result, one can prove the existence
with values in . Let f : I × → V ∗ , so that for each of optimal controls for problems (P1) and (P2).
t, f (t, ·) is weakly continuous (or more generally demi- Theorem 24. Let Z be a weakly lower semicontinuous
continuous) on ; for each v ∈ , f (·, v) is measurable on functional defined on H and bounded from below. Then
I (or continuous), and for each u ∈ , f (u) ∈ L q (I, V ∗ ) there exists an optimal control solving problem (P1).
where f (u)(t) ≡ f (t, u(t)).
The system, Theorem 25 (P2). Suppose the given target set M is a
weakly closed subset of H and the system is controllable in
dφ the sense that there exists an admissible u, and τ ∈ I¯ , such
+ B(t)φ = f (t, u(t)), t∈I
dt that φ (u)(τ ) ∈ M. Then there exists an optimal control
(121)
φ(0) = φ0 , u∈ that steers the systems from state φ0 to the target set M in
minimum time.
is written in its weak form:
Optimal control problems for the more general system
(Lφ, ψ) + b(φ, ψ) = ( f (u), ψ) (112) recently have been studied in several papers giving
existence results for measurable controls and measure-
for all ψ ∈ L p (I, V ) ∩ C( I¯ , H ) (122)
valued controls. Systems governed by differential inclu-
φ(0) = φ0 , u∈ sions of the form (115) and their associated control prob-
lems also have been studied recently. The technical details
where L denotes the extension of d/dt as an operator
are too long for presentation here. Interested readers may
from the space L p (I , V ) to the space L q (I , V ∗ ) and b is
consult the Bibliography.
the Dirichlet form given by:

b(φ, ψ) ≡ Fα (t, x, φ, D 1 φ, . . . , D m φ) D. Necessary Conditions of Optimality
I |α|≤m
For completeness we shall present a result on the neces-
·D α ψ d x dt (123) sary conditions of optimality. Consider system (122) along
We consider the following control problems: with the cost functional given by:
T
(P1) Terminal control. Let J (u) = Z (φ(T )), where Z is J (u) = Z (φ(T ) + f 0 (t, φ(t), u(t) dt (124)
a real-valued function on H and the pair {u, φ} is subject 0
to the dynamic constraint (122). The problem is to find a The problem is to find a control u 0 ∈ that minimizes the
control u ∈ that minimizes the functional J . functional J subject to constraint (122). Let {Fα (t, x, ξ )},
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

582 Distributed Parameter Systems

t, x ∈ I × , ξ = {ξα , |α| ≤ m} ∈ R N , denote the functions For time-optimal controls, similar necessary conditions
defining the operator B (see Eqs. (119) and (120)) and exist. In this case, the optimal control is also characterized
{Fαβ , |β| ≤ m} their directional derivatives with respect to by inequality (c), with the exceptions that f 20 ≡ 0 and the
ξ ∈ R N . We assume that for fixed t, x ∈ I × , the func- upper limit of the integral is the optimal time t 0 instead
tions ξ → Fαβ (t, x, ξ ) are continuous on R N for all α, β of T .
satisfying |α|, |β| ≤ m, and, for fixed ξ ∈ R N , (t, x) → Fαβ A number of interesting observations can be made from
(t, x, ξ ) are measurable on I × . For any fixed φ ∈ L p (I , the above result. For example, if f (t, u) = T (t)u and f 0 (t,
0
V ), the bilinear form, φ, u) = f˜ (t, φ) + N u, u E ∗ ,E and the control set = E,
then it follows from inequality (c) that
bφ (ψ, ν) ≡ D α ψ, Fβα (t, x; φ,
|α|,|β|≤m I (N + N ∗ )u 0 (t) = T ∗ (t)ψ 0 (t) (129)

D 1 φ, . . . , D m φ)D β ν dt (125) Hence, if E is reflexive and N = N ∗ and N is invertible,

then u 0 (t) = 12 N −1 T ∗ (t)ψ 0 (t). This is precisely the form
is well defined on L p (I, V ) × L p (I, V ). Let of optimal controls for linear systems with quadratic cost
functionals.
f 10 : I × V × E → V ∗
(126)
f 20 : I × V × E → E ∗ E. Nonlinear Stochastic Evolution Equations
0
denote the linear Gateaux differentials of f with respect We present here a brief account of nonlinear stochastic sys-
to the state and control variables, respectively, and let tems. The simplest nonlinear stochastic evolution equation
F: I × E → (E, V ∗ ) (127) may be given by:

denote the linear Gateaux differential of f with respect to dy = Ay dt + f (y) dW (t), t ≥ 0, y(0) = y0
the control variable. (130)
Under a number of technical assumptions on the func-
tions {Fα , |α| ≤ m}, f , f 0 , and Z and the control constraint where W is the Wiener process with covariance opera-
set ⊂ E, one can prove the folowing necessary condi- tor Q ∈ + (H ) as in the linear case. A is the generator
tions of optimality. of a c0 -semigroup S(t), t ≥ 0, on H , and f : H → (H )
satisfying:
Theorem 26. Consider system (122) with the cost func-
tional (124). For the pair {u 0 , φ 0 } ∈ × to be optimal it (a) f (x)2(H ) ≤ K 2 1 + |x|2H
is necessary that there exists a ψ 0 ∈ L p (I , V ) ∩ C(I¯ , H ) (131)
(b) f (x) − f (y)2(H ) ≤ K 2 |x − y|2H
so that the triple {u 0 , ψ 0 , φ 0 } satisfy the conditions:
Define the nonlinear operator G by:
(a) (Lφ 0 , ν) + b(φ 0 , ν) = ( f (u 0 ), ν), φ 0 (0) = φ0
t
for all (Gx)(t) ≡ S(t)y0 + S(t − θ ) f (x(θ )) d W (θ ) (132)
0
ν∈ 1 ≡ {ν ∈ L p (I, V ) ∩ C(I¯ , H ): ν(T ) = 0}
for x ∈ X ≡ C(I , L 2 ( , H )), where X is a Banach space
with respect to the topology given by:
(b) − (Lψ 0 , ν) + bφ 0 (ψ 0 , ν) + f 10 , ν = 0, &
ψ 0 (T ) = −Z (φ 0 (T )) x X ≡ sup E|x(t)|2H , t ∈ I
for all
Under the above assumptions one can prove the existence
ν∈ 0 ≡ {ν ∈ L p (I, V ) ∩ C(I¯ , H ): ν(0) = 0} of an integer n, such that the nth iteration of G, denoted
by G n , is a contraction in X . Hence, by the Banach fixed-

point theorem, there exists y ∈ X such that y = Gy. In
(c) f 20 (t, φ 0 (t), u 0 (t)) other words, the integral equation in X ,
I
t
−F ∗ (t, u 0 (t))ψ 0 (t), w(t) − u 0 (t) E ∗ ,E
dt ≥ 0 y(t) = S(t)y0 + S(t − θ) f (y(θ )) d W (θ )
(128) 0
t ∈ [0, T ]
for all w ∈ , where F ∗ denotes the dual of the operator
F(t, u 0 (t)) ∈ (E, V ∗ ) and Z the Gateaux derivative of has a unique solution y ∈ X whenever y0 ∈ L 2 ( , H ),
Z on H . hence y is a mild solution of system (130).
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 583

Existence theory for semilinear stochastic evolution 1. The first one is in the area of control theory for sys-
equations of the form, tems governed by m-times integrated semigroups or dis-
tribution semi groups.
dy = (A(t)y + f (t, y)) dt + σ (t) d W (133)
2. The second one is in the area of fundamental con-
has been developed under much weaker hypotheses cepts, in particular, the notion of solution, a new notion of
on the operators A and f using the Leray–Schauder solutions, called measure solutions, has been introduced
degree theory. There are also results in which A has been very recently and has been used in the theory of control
considered to be a strongly measurable function from ( , of distributed parameter systems.
, P) to c (H ), the space of closed densely defined linear 3. The third front extends the concept of impulsive sys-
(not necessarily bounded) operators in H . In this case, A tems to infinite dimensional Banach spaces; we shall dis-
is assumed to generate a strongly measurable (random) cuss briefly these new developments and their implication
c0 -semigroup in H . Existence theory for nonlinear in mathematical sciences.
stochastic boundary value problems of the form, 4. The fourth front represents recent applications of the
theory of distributed parameter systems to the physical
∂φ
+ A(t)φ = f + F(φ) on I× sciences.
∂t
Bφ = g + G(φ) on I ×∂ (134)
A. m-Times Integrated Semigroups
φ(0) = φ0 on
The classical semigroup theory, as seen in Section II, is
has been considered with f , g being generalized random based on the assumptions that A is closed and D(A) is
processes, φ0 a generalized random variable, and F, G dense in X and that the Hille–Yosida inequality,
nonlinear accretive operators. Stability problems for sys-
tems of the form (134) with f = 0; g = N , a generalized R(λ, A) ≡ (λI − A)−1 ≤ M/(λ − ω),
white noise; G = 0; and F being a monotone operator have λ ∈ ρ(A) ⊃ (ω, ∞) (135)
been studied. It has been shown that the system is asymp-
totically stable with respect to a ball around the origin with holds for some M ≥ 0 and ω ∈ R. These are the neces-
radius determined by the trace of the covariance operator sary and sufficient conditions for the existence of a C0 -
of the associated Wiener process. semigroup S(t), t ≥ 0, and hence the existence of a solu-
It appears from the literature that for nonlinear systems tion of the Cauchy problem,
the theory of optimal control, filtering, identification, and ẋ = Ax, x(0) = ξ (136)
controllability is far from satisfactory. This is a difficult
but fascinating field and certainly a challenging subject of in X . The solution is given by x(t) = S(t)ξ , t ≥ 0. No
the future. doubt this covers a very large class of partial differen-
tial operators with given boundary conditions and hence
a large class of distributed parameter systems. However,
IV. RECENT ADVANCES IN there are classes of operators A which do not satisfy
INFINITE-DIMENSIONAL the Hille–Yosida theorem, yet such systems have solu-
SYSTEMS AND CONTROL tions in some generalized sense. According to the Hille–
Yosida theorem, R(λ, A) is the Laplace transform of some
In this section we discuss some recent advances in the operator-valued function S(t), t ≥ 0. It is now known that
theory and applications of distributed parameter systems the Cauchy problem stated above has a solution in some
since the time of first publication of this encyclopedia. generalized sense even if only Rm (λ, A) ≡ R(λ, A)/λm ,
Details of these new developments can be found in the λ ∈ ρ(A), is the Laplace transform of an operator-valued
references. There has been substantial theoretical devel- function T (t), t ≥ 0. In this case, T (t), t ≥ 0, is said to
opment of distributed parameter systems, as indicated in be the m-times integrated semigroup and A is said to be
References 11 to 36. These include general boundary con- its infinitesimal generator. The classical solution for the
trol problems, control of deterministic and stochastic evo- Cauchy problem as stated above is now given by:
lution inequalities and differential inclusions, uncertain
x(t) = (d m /dt m )T (t)ξ, t ≥ 0, for ξ ∈ D(Am+1 )
systems, and the so-called B-evolutions. Due to space lim-
itations, we cannot include the details. Here we shall give In general, if ξ ∈ X there is no classical solution but we
only a brief outline of the major new concepts introduced may admit generalized derivatives and hence generalized
in recent years. On the theoretical front, there are three solutions. For example, if D(Am+1 ) is dense in X , one
new major developments: can choose a sequence {ξk } converging strongly to ξ and
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

584 Distributed Parameter Systems

$
consider the entity x as the generalized solution if it sat- bounded, finitely additive measures denoted by r ba (E)
isfies the identity: and defined on the algebra of sets generated by closed
τ subsets of E. This is a Banach space with respect to the
x(t), φ(t) X,X ∗ dt topology induced by the total variation norm. Let !r ba (E)
$
0
⊂ r ba (E) denote the space of$regular, finitely additive
τ
probability measures. For a ν ∈ r ba (E) and φ ∈ BC(E),
= (−1)m
lim T (t)ξk , D m φ(t) X,X ∗
dt (137)
k→∞ 0 the pairing

for all φ in a class of test functions. A suitable class of test
(ν, φ) ≡ ν(φ) ≡ φ(ξ )ν(dξ )
functions for this problem is the Sobolev class W0m,1 (I , E
X ∗ ) which consists of X ∗ -valued functions whose deriva-
is well defined. Letting Dφ denote the Frechet derivative
tives up to order m − 1 vanish on the boundary of the set
of φ ∈ BC(E), we introduce the class,
I ≡ (0, τ ) and belong to L 1 (I, X ∗ ). By duality, the solu-
tion x ∈ W −m,∞ (I, X ). For further details on m-times inte- F ≡ {φ ∈ BC(E) : Dφ continuous with φ, Dφ
grated semigroups, generalized solutions, stochastic sys-
having bounded supports}
tems, and optimal controls of systems involving operators
that generate such semigroups, the reader may consult for test functions. Define the operator A with domain:
References 13, 21, and 24.
D(A) ≡ {φ ∈ F : Aφ ∈ BC(E + )}
where for φ ∈ D(A),
B. Measure Solution
Aφ ≡ A∗ Dφ(ξ ), ξ E ∗ ,E + Dφ(ξ ), f (ξ ) E ∗ ,E (141)
Consider the system,
and E + is a suitable compactification of E that makes
ẋ = f (x), x(0) = ξ ∈ E (138)
E + a compact Hausdorff space containing E as a dense
It is well known that if E = R n is a finite-dimensional subspace. The new notion of a solution for the Cauchy
space and f is merely continuous, the system has at least problem (140) can be stated as follows.
one local solution in the sense that there exists a maximal
Definition: A measure function µt , t ≥ 0, with values
interval of time (0, τm ) and an absolutely continuous func-
in !r ba (E), is said to be a generalized solution of the
tion x ∗ (t), t ∈ (0, τm ), that satisfies the differential equa-
semilinear evolution equation if, for each φ ∈ D(A), the
tion along with the initial condition x ∗ (0) = ξ , with the
following identity holds:
possibility of blow-up at time τm . That is,
t
lim x ∗ (t) = ∞ (139) µt (φ) = φ(x0 ) + µs (Aφ) ds, t ≥0 (142)
t→τm 0

An elementary example is ẏ = y 2 , y(0) = γ . If γ > 0, the

blow-up time is τm = (1/γ ). The concepts of measure solutions and stochastic evolu-
In contrast, if E is an infinite-dimensional Banach tion equations have been extended. For example, consider
space, mere continuity of f : E → E is no more sufficient the infinite-dimensional stochastic systems on a Hilbert
to guarantee even a local solution. Even continuity and space H , governed by Eq. (130) or, more generally, the
boundedness of this map do not guarantee existence. Com- equation:
pactness is necessary. However, under mere continuity and d x = Ax dt + f (x) dt + σ (x) d W
local boundedness assmptions, one can prove the existence
(143)
of generalized (to be defined shortly) solutions. This is
x(0) = x0
what prompted the concept and development of measure-
valued solutions. Consider the semilinear system, Again, if f and ω are merely continuous and bounded
on bounded sets, all familiar notions of solutions (strong,
ẋ = Ax + f (x), t ≥ 0,
(140) mild, weak, martingale) fail. However, measure solutions
x(0) = x0 are well defined. In this case, expression (142) must be
modified as follows:
and suppose A is the infinitesimal generator of a C0 -semi- t
group in E. Let BC(E) denote the Banach space of µt (φ) = φ(x0 ) + µs (Aφ) ds
bounded continuous functions on E with the standard 0

topology induced by the sup norm φ ≡ sup{|φ (ξ )|, t

ξ ∈ E}. The dual of this space is the space of regular, + µs (Bφ), d W (s), t ≥0 (144)
0
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 585

where now the operator A is given by the second-order where, generally, A is the infinitesimal generator of a
partial differential operator, C0 -semigroup in a Banach space E, the function f is a
continuous nonlinear map from E to E, and Fi : E → E,
Aφ ≡ (1/2)Tr(D 2 φσ σ ∗ ) + A∗ Dφ(ξ ), ξ H i = 1, 2, ·, ·, n, are continuous maps. The difference
+ Dφ(ξ ), f (ξ ) H (145) operator x (ti ) ≡ x(ti + 0) − x(ti − 0) ≡ x(ti + 0) − x(ti )
denotes the jump operator. This represents the jump in
and the operator B is given by: the state x at time ti with Fi determining the jump size
Bφ ≡ σ ∗ Dφ at time ti . Similarly, a controlled impulsive system is
governed by the following system of equation:
The last term is a stochastic integral with respect to a cylin-
drical Brownian motion (for details, see Reference 31). d x (t) = [Ax(t) + f (x(t))] dt + g(x(t)) dv(t),
The operators A and B are well defined on a class of test t ∈ I \D, x(0) = x0 ,
functions given by: (148)
x(ti ) = Fi (x(ti )),
F ≡ {φ ∈ BC(E) : φ, Dφ, D φ continuous having
2

0 = t0 < t1 < t2 , · · · < tn < tn+1 = T

bounded supports and
Tr(D 2 φσ σ ∗ (ξ )) < ∞, ξ ∈ H } (146) where g : E → L(F, E) and the control v ∈ BV (I, F). For
maximum generality, one can choose the Banach space
A detailed account of measure solutions and their op- BV (I, F) of functions of bounded variation on I with
timal control for semilinear and quasilinear evolution values in another Banach space F as the space of admissi-
equations can be found in References 20, 22, 28, and ble controls. This class allows continuous as well as jump
30–32. controls. For each r > 0, let Br ≡ {ξ ∈ E :ξ ≤ r } denote
the ball of radius r around the origin. Let PWC" (I, E) de-
note the Banach space of piecewise continuous functions
C. Impulsive Systems on I taking values from E, with each member being left
In many physical problems, a system may be subjected to continuous, having right-hand limits. For solutions and
a combination of continuous as well as impulsive forces at their regularity properties we have the following.
discrete points of time. For example, in building construc- Theorem 27. Suppose the following assumptions hold:
tion, piling is done by dropping massive weights on the top
of vertically placed steel bars. This is an impulsive input. (A1) A is the infinitesimal generator of a C0 -semigroup
In the management of stores, inventory is controlled by an in E.
agreement from the supplier to supply depleted goods. An (A2) The maps g, Fi , i = 1, 2, ·, ·, n are continuous and
example from physics is a system of particles in motion bounded on bounded subsets of E with values in L(F, E)
which experience collisions from time to time, thereby and E, respectively,
causing instantaneous changes in directions of motion. In (A3) The map f is locally Lipschitz having at most
the treatment of patients, medications are administered linear growth; that is, there exist constants K > 0, K r > 0,
at discrete points of time, which could be considered as such that:
impulsive inputs to a distributed physiological system.
f (x) − f (y) E ≤ K r x − y E , x, y ∈ Br
The theory of finite dimensional impulsive systems is
well known.12 Only in recent years has the study of infi- and f (x) ≤ K (1 + x)
nite dimensional impulsive systems been initiated.27,33−36
Here, we present a brief outline of these recent advances. Then, for every x0 ∈ E and v ∈ BV (I, F), system (148)
Let I ≡ [0, T ] be a closed bounded interval of the real line has a unique mild solution, x ∈ PWCl (I, E).
and define the set D ≡ {t1 , t2 , · , · , tn } ∈ (0, T ). A semilin-
ear impulsive system can be described by the following Using this result one can construct necessary condi-
system of equations: tions of optimality. We present here one such result of the
author.35 Consider U ∈ BV(I, F) to be the class of con-
ẋ (t) = Ax(t) + f (x(t)), trols comprised of pure jumps at a set of arbitrary but
prespecified points of time J ≡ {0 = t0 = s0 < s1 < s2 , ·, ·,
t ∈ I \D, x(0) = x0 ,
< sm−1 < sm , m ≥ n} ⊂ [0, T ). Clearly'U is isometrically
(147)
isomorphic to the product space F ≡ m+1 k=1 F, furnished
x(ti ) = Fi (x(ti )),
with the product topology. We choose a closed convex
0 = t0 < t1 < t2 , · · · < tn < tn+1 = T subset Uad ⊂ U to be the class of admissible controls. The
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

586 Distributed Parameter Systems

basic control problem is to find a control policy that im- one wishes to control the jump sizes in order to achieve
parts a minimum to the following cost functional, certain objectives, one has the model,
T
ẋ(t) ∈ Ax(t) + F(x(t)),
J (v) = J (q) ≡ (x(t), v(t)) dt + ϕ(v) + $(x(T ))
0 t ∈ I \D, x(0) = x0 ,
T
≡ (x(t), q) dt + ϕ(q) + $(x(T )) x(ti ) = gi (x(ti ), u i )),
0
0 = t0 < t1 < t2 , · · · < tn < tn+1 ≡ T
Theorem 28. Suppose assumptions (A1) to (A3) hold, where the controls u i may take values from a compact met-
with { f, Fi , g} all having Frechet derivatives continuous ric space U . In this case, the multis are given by G i (ζ ) =
and bounded on bounded sets, and the functions , ϕ, $ are gi (ζ, U ). For more details on impulsive evolution equa-
once continuously Gateaux differentiable on E × F, F, E, tions and, in general, inclusions, the reader may consult
respectively. Then, if the pair {v o (orq o ), x o } is optimal, References 35 and 36.
there exists a ψ ∈ PWCr (I , E ∗ ) so that the triple {v 0 , x 0 , ψ}
satisfies the following inequality and evolution equation:
D. Applications
m

(a) g ∗ x o (si ) ψ(si ) + ϕqi (q o ) The slow growth of application of distributed systems the-
i=0 ory is partly due to its mathematical and computational
T complexities. In spite of this, in recent years there have
+ u (x o (t), v o (t)) dt , qi − qio ≥0 been substantial applications of distributed control theory
si F ∗ ,F in aerospace engineering, including vibration suppression
for all q = {q0 , q1 , ·, ·, qm } ∈ Uad . of aircraft wings, space shuttle orbiters, space stations,
flexible artificial satellites, and suspension bridges. Sev-
(b) dψ = − A∗ ψ + f x∗ (x o (t) ψ − x (x o (t), v o (t)) dt eral papers on control of fluid dynamical systems gov-
− gx∗ (x o (t), ψ(t)) dv o , t ∈ I \D erned by Navier–Stokes equations have appeared over the
past decade with particular reference to artificial heart de-
ψ(T ) = $x (x (T )) o
sign. During the same period, several papers on the control
∗
o
r ψ(ti ) = −Fi,x x (ti ) ψ(ti ), i = 1, 2, ·, ·, n of quantum mechanical and molecular systems have ap-
peared. Distributed systems theory has also found applica-
where the operator r is defined by r f (ti ) ≡ f (ti − 0) tions in stochastic control and filtering. With the advance-
− f (ti ). ment of computing power, we expect far more applications
(c) The process x o satisfies the system equation (148) (in in the very near future.
the mild sense) corresponding to the control v o .

In case a hard constraint is imposed on the final state SEE ALSO THE FOLLOWING ARTICLES
requiring x o (T ) ∈ K where K is a closed convex subset of
E with nonempty interior, the adjoint equation given by CONTROLS, LARGE-SCALE SYSTEMS • DIFFERENTIAL
(b) requires modification. The terminal equality condition EQUATIONS, ORDINARY • DIFFERENTIAL EQUATIONS,
is replaced by the inclusion ψ(T )∈ ∂ IK (x o (T )), where IK PARTIAL • TOPOLOGY, GENERAL • WAVE PHENOMENA
is the indicator function of the set K.
Recently a very general model for evolution inclusions
has been introduced:36
BIBLIOGRAPHY
ẋ(t) ∈ Ax(t) + F(x(t)),
Ahmed, N. U., and Teo, K. L. (1981). “Optimal Control of Distributed
t ∈ I \D, x(0) = x0 , Parameter Systems,” North-Holland, Amsterdam.
(2.3) Ahmed, N. U. (1983). “Properties of relaxed trajectories for a class of
x(ti ) ∈ G i (x(ti )), nonlinear evolution equations on a Banach space,” SIAM J. Control
Optimization 2(6), 953–967.
0 = t0 < t1 < t2 , · · · < tn < tn+1 ≡ T Ahmed, N. U. (1981). “Stochastic control on Hilbert space for linear
evolution equations with random operator-valued coefficients,” SIAM
Here, both F and {G i , i = 1, 2, ·, ·, n} are multivalued J. Control Optimization 19(3), 401–403.
maps. This model may arise under many different situ- Ahmed, N. U. (1985). “Abstract stochastic evolution equations on Ba-
ations. For example, in case of a control problem where nach spaces,” J. Stochastic Anal. Appl. 3(4), 397–432.
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN004E-183 June 8, 2001 18:23

Distributed Parameter Systems 587

Ahmed, N. U. (1986). “Existence of optimal controls for a class of sys- solutions for Zakai equations,” Publicationes Mathematicae 49(3–4),
tems governed by differential inclusions on a Banach space,” J. Opti- 251–264.
mization Theory Appl. 50(2), 213–237. Ahmed, N. U. (1996). “Generalized solutions for linear systems governed
Ahmed, N. U. (1988). “Optimization and Identification of Systems Gov- by operators beyond Hille–Yosida type,” Publicationes Mathematicae
erned by Evolution Equations on Banach Space,” Pitman Research 48(1–2), 45–64.
Notes in Mathematics Series, Vol. 184, Longman Scientific/Wiley, Ahmed, N. U. (1997). “Measure solutions for semilinear evolution equa-
New York. tions with polynomial growth and their optimal control,” Discussiones
Balakrishnan, A. V. (1976). “Applied Functional Analysis,” Springer– Mathematicae (Differential Inclusions) 17, 5–27.
Verlag, Berlin. Ahmed, N. U. (1997). “Stochastic B-evolutions on Hilbert spaces,” Non-
Butkovskiy, A. G. (1969). “Distributed Control Systems,” Elsevier, New linear Anal. 30(1), 199–209.
York. Ahmed, N. U. (1997). “Optimal control for linear systems described
Curtain, A. F., and Pritchard, A. J. (1978). “Infinite Dimensional Linear by m- times integrated semigroups,” Publicationes Mathematicae
Systems Theory,” Lecture Notes, Vol. 8, Springer–Verlag, Berlin. 50(1–2), 1–13.
Lions, J. L. (1971). “Optimal Control of Systems Governed by Partial Xiang, X., and Ahmed, N. U. (1997). “Necessary conditions of optimality
Differential Equations,” Springer–Verlag, Berlin. for differential inclusions on Banach space,” Nonlinear Anal. 30(8),
Barbu, V. (1984). “Optimal Control of Variational Inequalities,” Pit- 5437–5445.
man Research Notes in Mathematics Series, Vol. 246, Longman Ahmed, N. U., and Kerbal, S. (1997). “Stochastic systems governed by
Scientific/Wiley, New York. B-evolutions on Hilbert spaces,” Proc. Roy. Soc. Edinburgh 127A,
Lakshmikantham, V., Bainov, D. D., and Simeonov, P. S. (1989). “Theory 903–920.
of Impulsive Differential Equations,” World Scientific, Singapore. Rogovchenko, Y. V. (1997). “Impulsive evolution systems: main results
Ahmed, N. U. (1991). “Semigroup Theory with Applications to Systems and new trends,” Dynamics of Continuous, Discrete, and Impulsive
and Control,” Pitman Research Notes in Mathematics Series, Vol. 246, Systems 3(1), 77–78.
Longman Scientific/Wiley, New York. Ahmed, N. U. (1998). “Optimal control of turbulent flow as measure
Ahmed, N. U. (1992). “Optimal Relaxed Controls for Nonlinear Stochas- solutions,” IJCFD 11, 169–180.
tic Differential Inclusions on Banach Space,” Proc. First World Fattorini, H. O. (1998). “Infinite dimensional optimization and con-
Congress of Nonlinear Analysis, Tampa, FL, de Gruyter, Berlin, trol theory,” In “Encyclopedia of Mathematics and Its Applications,”
pp. 1699–1712. Cambridge Univ. Press, Cambridge, U.K.
Ahmed, N. U. (1994). “Optimal relaxed controls for nonlinear infinite Ahmed, N. U. (1999). “Measure solutions for semilinear systems with
dimensional stochastic differential inclusions,” Lect. Notes Pure Appl. unbounded nonlinearities,” Nonlinear Anal. 35, 478–503.
Math., Optimal Control of Differential Equations 160, 1–19. Ahmed, N. U. (1999). “Relaxed solutions for stochastic evolution equa-
Ahmed, N. U. (1995). “Optimal control of infinite dimensional systems tions on Hilbert space with polynomial nonlinearities,” Publicationes
governed by functional differential inclusions,” Discussiones Mathe- Mathematicae 54(1–2), 75–101.
maticae (Differential Inclusions) 15, 75–94. Ahmed, N. U. (1999). “A general result on measure solutions for semi-
Ahmed, N. U., and Xiang, X. (1996). “Nonlinear boundary control of linear evolution equations,” Nonlinear Anal. 35.
semilinear parabolic systems,” SIAM J. Control Optimization 34(2), Liu, J. H. (1999). “Nonlinear impulsive evolution equations: dynamics
473–490. of continuous, discrete, and impulsive systems,” 6, 77–85.
Ahmed, N. U. (1996). “Optimal relaxed controls for infinite-dimensional Ahmed, N. U. (1999). “Measure solutions for impulsive systems in
stochastic systems of Zakai type,” SIAM J. Control Optimization 34(5), Banach space and their control,” J. Dynamics of Continuous, Discrete,
1592– 1615. and Impulsive Systems 6, 519–535.
Ahmed, N. U., and Xiang, X. (1996). “Nonlinear uncertain systems and Ahmed, N. U. (2000). “Optimal impulse control for impulsive systems in
necessary conditions of optimality,” SIAM J. Control Optimization Banach spaces,” Int. J. of Differential Equations and Applications 1(1).
35(5), 1755–1772. Ahmed, N. U. (2000). “Systems governed by impulsive differential in-
Ahmed, N. U. (1996). “Existence and uniqueness of measure-valued clusions on Hilbert space,” J. Nonlinear Analysis.
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming
Martin L. Puterman
The University of British Columbia

I. Sequential Decision Problems

II. Finite-Horizon Dynamic Programming
III. Infinite-Horizon Dynamic Programming
IV. Further Topics

GLOSSARY Myopic policy Policy in which each decision rule ignores

the future consequence of the decision and uses the ac-
Action One of several alternatives available to the deci- tion at each stage that maximizes the immediate reward.
sion maker when the system is observed in a particular Policy Sequence of decision rules.
state. Stage Point in time at which a decision is made.
Decision rule Function that determines for the decision State Description of the system that provides the decision
maker which action to select in each possible state of maker with all the information necessary to make future
the system. decisions.
Discount factor Present value of a unit of currency re- Stationary Referring to a problem in which the set of
ceived one period in the future. actions, the set of states, the reward function, and the
Functional equation Basic entity in the dynamic pro- transition function are the same at each decision point
gramming approach to solving sequential decision or to a policy in which the same decision rule is used
problems. It relates the optimal value for a (t + 1)-stage at every decision point.
decision problem to the optimal value for a t-stage
problem. Its solution determines an optimal decision
rule at a particular stage. IN ALL AREAS of endeavor, decisions are made either
Horizon Number of stages. explicitly or implicitly. Rarely are decisions made in iso-
Markov chain Sequence of random variables in which lation. Today’s decision has consequences for the future
the conditional probability of the future is independent because it could affect the availability of resources or limit
of the past when the present state is known. the options for subsequent decisions. A sequential deci-
Markov decision problem Stochastic sequential decision problem is a mathematical model for the problem
sion problem in which the set of actions, the rewards, faced by a decision maker who is confronted with a se-
and the transition probabilities depend only on the cur- quence of interrelated decisions and wishes to make them
rent state of the system and the current action selected; in an optimal fashion. Dynamic programming is a col-
the history of the problem has no effect on current lection of mathematical and computational tools for an-
decisions. alyzing sequential decision problems. Its main areas of

673
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

674 Dynamic Programming

application are operations research, engineering, statistics, are classified as either finite horizon or infinite horizon
and resource management. Improved computing capabil- according to whether the set T is finite or infinite. The
ities will lead to the wide application of this technique in problem formulation in these two cases is almost identical;
the future. however, the dynamic programming methods of solution
differ considerably. For discrete-time problems, T is the
set {1, 2, . . . , N } in the finite case and {1, 2, . . .} in the
I. SEQUENTIAL DECISION PROBLEMS infinite case. The present decision point is denoted by
t and the subsequent point by t + 1. The points of time
A. Introduction at which decisions can be made are often called stages.
Almost all the results in this article concern discrete-time
A system under the control of a decision maker is evolv- models; the continuous-time model is briefly mentioned
ing through time. At each point of time at which a de- in Section IV.
cision can be made, the decision maker, who will be re- The set of possible states of the system at time t is de-
ferred to as “he” with no sexist connotations intended, noted by St . In finite-horizon problems, this is defined for
observes the state of the system. On the basis of this in- t = 1, 2, . . . , N + 1, although decisions are made only at
formation, he chooses an action from a set of alternatives. times t = 1, 2, . . . , N . This is because the decision at time
The consequences of this action are two-fold; the deci- N often has future conseqences that can be summarized
sion maker receives an immediate reward or incurs an by evaluating the state of the system at time N + 1. This
immediate cost, and the state that the system will occupy is analagous to providing boundary values for differential
at subsequent decision epochs is influenced either deter- equations. If at time t the decision maker observes the
ministically or probabilistically. The problem faced by the system in state s ∈ St , he chooses an action a from the set
decision maker is to choose a sequence of actions that will of allowable actions at time t, As,t . As above, St and As,t
optimize the performance of the system over the decision- can be either finite or infinite and discrete or continuous.
making horizon. Since the action selected at present affects This distinction has little consequence for the problem
the future evolution of the system, the decision maker can- formulation.
not choose his action without taking into account future As a result of choosing action a when the system is in
consequences. state s at time t, the decision maker receives an immediate
Dynamic programming is a procedure for finding op- reward rt (s, a). This reward can be positive or negative. In
timal policies for sequential decision problems. It differs the latter case it can be thought of as a cost. Furthermore,
from linear, nonlinear, and integer programming in that the choice of action affects the system evolution either
there is no standard dynamic programming problem for- deterministically or probabilistically. In the deterministic
mulation. Instead, it is a collection of techniques based case, the choice of action determines the state of the system
on developing mathematical recursions to decompose a at time t + 1 with certainty. Denote by wt (s, a) ∈ St+1 the
multistage problem into a series of single-stage problems state the system will occupy if action a is chosen in state s
that are analytically or computationally more tractable. at time t; wt (s, a) is called the transfer function. When the
Its implementation often requires ingenuity on the part of system evolves probabilistically, the subsequent state is
the analyst, and the formulation of dynamic programming random and choice of action specifies its probability dis-
problems is considered by some practitioners to be an art. tribution. Let pt ( j|s, a) denote the probability that the sys-
This subject is best understood through examples. This tem is in state j ∈ St+1 if action a is chosen in state s at time
section proceeds with a formal introduction of the basic t; pt ( j|s, a) is called the transition probability function.
sequential decision problem and follows with several ex- When St is a continuum, pt ( j|s, a) is a probability density.
amples. The reader is encouraged to skip back and forth Such models are discussed briefly in Section IV. A sequen-
between these sections to understand the basic ingredients tial decision problem in which the transitions from state to
of such a problem. Dynamic programming methodology state are governed by a transition probability function and
is discussed in Sections II and III. the set of actions and rewards depends only on the current
state and stage is called a Markov decision problem.
The deterministic model is a special case of the proba-
B. Problem Formulation
bilistic model obtained by choosing pt ( j|s, a) = 1 if j =
Some formal notation follows. Let T denote the set of wt (s, a) and pt ( j|s, a) = 0 if j = wt (s, a). Even though
time points at which decisions can be made. The set T can there is this equivalence, the transfer function represen-
be classified in two ways; it is either finite or infinite and tation is more convenient for deterministic problems.
either a discrete set or a continuum. The primary focus of A decision rule is a function dt : St → As,t that specifies
this article is when T is discrete. Discrete-time problems the action the decision maker chooses when the system is
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming 675

FIGURE 1 Evolution of the sequential decision model under the policy π = (d 1 , d 2 , . . . , dN ). The state at stage 1 is s.

in state s at time t; that is, dt (s) specifies an action in known. Let X tπ denote the state the system occupies at
As,t for each s in St . A decision rule of this type is called time t if the decision maker uses policy π over the plan-
Markovian because it depends only on the current state of ning horizon. In the first period the decision maker receives
the system. The set of allowable decision rules at time t a reward of r1 (s, d1 (s)), in the second period a reward
is denoted by Dt and is called the decision set. Usually it of r2 (X 2π , d2 (X 2π )), and in the tth period rt (X tπ , dt (X tπ )).
is the set of all functions mapping St to As,t , but in some Figure 1 depicts the evolution of the process under a
applications, it might be a proper subset. policy π = {d1 , d2 , . . . , d N } in both the deterministic and
Many generalizations of deterministic Markovian deci- stochastic cases. The quantity in each box indicates the
sion rules are possible. Decision rules can depend on the interaction of the incoming state with the prespecified de-
entire history of the system, which is summarized in the cision rule to produce the indicated action dt (X tπ ). The
sequence of observed states and actions observed up to arrow to the right of a box indicates the resulting state,
the present, or they can depend only on the initial and and the arrow downward the resulting reward to the deci-
current state of the system. Furthermore, the decision rule sion maker. The system is assumed to be in state s before
might be randomized; that is, in each state it specifies a the first decision.
probability distribution on the set of allowable actions so The decision maker evaluates policies by comparing the
that by using such a rule the decision maker chooses his value of a function of the policy’s income stream. Many
action at each decision epoch by a probabilistic mecha- such evaluation functions are available, but it is most con-
nism. For the problems considered in this article, using venient to assume a linear, additive, and risk-neutral utility
deterministic Markovian decision rules at each stage is function over time, which leads to using the total reward
optimal so that the generalizations referred to above will over the planning horizon for evaluation. Let v πN (s) be the
not be discussed further. total reward over the planning horizon. It is given by the
A policy specifies the sequence of decision rules to expression
be used by the decision maker over the course of the
planning horizon. A policy π is a finite or an infinite
N +1

v πN (s) = rt X tπ , dt X tπ , (1)
sequence of decision rules; that is, π = {d1 , d2 , . . . , d N },
t=1
where dt ∈ Dt for t = 1, 2, . . . , N if the horizon is finite,
or π = {d1 , d2 , . . .}, where dt ∈ Dt for t = 1, 2, . . . if the in which it is implicit that X 1π = s. For deterministic prob-
horizon is infinite. Let denote the set of all possible lems, evaluation formulas such as Eq. (1) always depend
policies; = D1 × D2 × · · · × D N in the finite case and on the initial state of the process, although this is not ex-
= D1 × D2 × · · · in the infinite case. plicitly stated below.
In deterministic problems, by specifying a policy at the In probabilistic problems, by specifying a policy at the
start of the problem, the decision maker completely deter- start of the problem, the decision maker determines the
mines the future evolution of the system. For each policy transition probability functions of a nonstationary Markov
the sequence of states the system will occupy is known chain. The sequence of states the system will occupy is not
with certainty, and hence the sequence of rewards the de- known with certainty, and consequently the sequence of
cision maker will receive over the planning horizon is rewards the decision maker will receive over the planning
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

676 Dynamic Programming

horizon is not known. Instead, what is known is the joint vλ∗ (s) = sup vλπ (s) for all s ∈ S1 . (6)
probability distribution of system states and rewards. In π ∈

this case, expectations with respect to the joint probability

distributions of the Markov chain conditional on the state Alternatively in the infinite-horizon setting, the deci-
at the first decision epoch are often used to evaluate policy sion maker might not be willing to assume that a reward
performance. As in the deterministic case, let X tπ denote received in the future is any less valuable than a reward re-
the state the system occupies at time t if the decision maker ceived at present. For example, if decision epochs are very
uses policy π over the planning horizon. For finite-horizon close together in real time, then all rewards the decision
problems, let v πN (s) equal the total expected reward over maker receives would have equal value. In this case the
the planning horizon. It is given by the expression decision maker’s objective might be to choose a policy
that maximizes the average or expected average reward

N +1
π π per period. This quantity is frequently called the gain of a
π
v N (s) = E π,s r t X t , dt X t , (2) policy. For a specified policy it is denoted by g π (s) and in
t=1 both problems is given by
where E π,s denotes the expectation with respect to prob-
ability distribution determined by π conditional on the 1 π
g π = lim v (s), (7)
initial state being s. N →∞ N N
In both the deterministic and stochastic problems the
decision maker’s problem is to choose at time 1 a policy where v πN (s) is defined in Eqs. (1) and (2).
π in to make v πN (s) as large as possible and to find the In this setting the decision maker’s problem is to choose
maximal reward, at time 1 a policy π in to make g π (s) as large as possible
and to find the supremal average reward,
v ∗N (s) = sup v πN (s) for all s ∈ S1 . (3)
π ∈
g ∗ (s) = sup g π (s) for all s ∈ S1 . (8)
π ∈
Frequently the problem is such that the supremum in
Eq. (3) is attained—for example, when both As,t and St
Dynamic programming methods with the average reward
are finite for all t ∈ T . In such cases the decision maker’s
criteria are quite complex and are discussed only briefly
objective is to maximize v πN (s) and find its maximal
in Section III. The reader is referred to the works cited in
value.
the Bibliography for more details.
For infinite-horizon problems, the total reward or the
Frequently in infinite-horizon problems, the data are
expected total reward the decision maker receives will not
stationary. This means that the set of states, the set of al-
necessarily be finite; that is, the summations in Eqs. (1)
lowable actions in each state, the one-period rewards, the
and (2) usually will not converge. To evaluate policies
transition or transfer functions, and the decision sets are
in infinite-horizon problems, decision makers often use
the same at every stage. When this is the case, the time sub-
discounting or averaging. Let λ represent the discount
script t is deleted and the notation S, As , r (s, a), p( j|s, a)
factor, usually 0 ≤ λ < 1. It measures the value at present
or w(s, a), and D is used. Often stationary policies are
of one unit of currency received one period from now. Let
optimal in this setting. By a stationary policy is meant a
vλπ (s) equal the total discounted reward in deterministic
policy that uses the identical decision rule in each period;
problems or the expected total discounted reward for
that is, π = (d, d, . . .). Often it is denoted by d when there
probabilistic problems if the system is in state s, before
is no possible source of confusion.
choosing the first action. For deterministic problems it is
given by

∞
C. Examples
vλπ (s) = λt−1rt X tπ , dt X tπ , (4)
t=1 The following examples clarify the notation and formula-
and for stochastic problems it is given by tion described in the preceding sections. The first example
illustrates a deterministic, finite-state, finite-action, finite-

∞
horizon problem; the second a deterministic, infinite-state,
vλπ (s) = E π,s λt−1rt X tπ , dt X tπ . (5) infinite-action, finite-horizon problem; and the third a
t=1
stochastic, finite-state, finite-action problem with both
In this setting the decision maker’s problem is to choose finite- and infinite-horizon versions. In Sections II and
at time 1 a policy π in to make vπλ (s) as large as possible III, these examples will be solved by using dynamic pro-
and to find the supremal reward, gramming methodology.
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming 677

Decision points:
T = {1, 2, 3}
States (numbers correspond to nodes):
S1 = {1}; S2 = {2, 3}; S3 = {4, 5, 6};
S4 = {7}
Actions (action j selected in node i corresponds to choos-
ing to traverse the arc between nodes i and j; the first
subscript on A is the state and the second the stage):
A1,1 = {2, 3}
FIGURE 2 Network for the longest-route problem. A2,2 = {4, 5}; A3,2 = {4, 5, 6}
A4,3 = {7}; A5,3 = {7}; A6,3 = {7}
1. A Longest-Route Problem
Rewards:
A finite directed graph is depicted in Fig. 2. The circles
are called nodes, and the lines connecting them are called r1 (1, 2) = 2; r1 (1, 3) = 4
arcs. On each arc, an arrow indicates the direction in which
r2 (2, 4) = 5; r2 (2, 5) = 6;
movement is possible. The numerical value on the arc is
the reward the decision maker receives if he chooses to r2 (3, 4) = 5, r2 (3, 5) = 3; r2 (3, 6) = 1
traverse that arc on his journey from node 1 to node 7. His
r3 (4, 7) = 3; r3 (5, 7) = 6; r3 (6, 7) = 2
objective is to find the path from node 1 to node 7 that max-
imizes the total reward he receives on his journey. Such Transfer function:
a problem is called a longest-route problem. A practical
application is determining the length of time needed to wt (s, a) = a
complete a project. In such problems, the arc length rep- The remaining ingredients in the sequential decision
resents the time to complete a task. The entire project is not problem formulation are the decision set, the set of poli-
finished until all tasks are performed. Finding the longest cies, and an evaluation formula. The decision set at stage
path through the network gives the minimum amount of t is the set of all arcs emanating from nodes at stage t.
time for the entire project to be completed since it cor- A policy is a list of arcs in which there is one arc start-
responds to that sequence of tasks that requires the most ing at each node (except 7) in the network. The policy set
time. The longest path is called a critical path because, if contains all such lists. Each policy contains a route from
any task in this sequence is delayed, the entire project will node 1 to node 7 and some superfluous action selections.
be delayed. The value of the policy is the total of the rewards along
In other applications the values on the arcs represent this route, and the decision maker’s problem is to choose
lengths of road segments, and the decision maker’s ob- a policy that maximizes this total reward.
jective is to find the shortest route from the first node The structure of a policy is described in more de-
to the terminal node. Such a problem is called a shortest- tail through an example. Consider the policy π = {(1, 2),
route problem. All deterministic, finite-state, finite-action, (2, 4), (3, 4), (4, 7), (5, 7), (6, 7)}. Implicit in this defini-
finite-horizon dynamic programming problems are equiv- tion is a sequence of decision rules dt (s) for each state and
alent to shortest- or longest-route problems. Another ex- stage. They are d1 (1) = 2, d2 (2) = 4, d2 (3) = 4, d3 (4) = 7,
ample of this will appear in Section II. d3 (5) = 7, and d3 (6) = 7. This policy can be formally de-
An important assumption is that the network contains noted by π = {d1 , d2 , d3 }. The policy contains one unique
no directed cycle; that is, there is no route starting at a routing through the graph, namely, 1 → 2 → 4 → 7 and
node that returns to that node. If this were the case, the several unnecessary decisions. We use the formal nota-
longest route would be infinite and the problem would not tion X 1π = 1, X 2π = 2, X 3π = 4, and X 4π = 7 so that
be of interest.
The longest-route problem is now formulated as a se- v3π (1) = r1 (1, d1 (1)) + r2 (2, d2 (2)) + r3 (4, d3 (4))
quential decision problem. This requires defining the set
= r1 (1, 2) + r2 (2, 4) + r3 (4, 7)
of states, actions, decision sets, transfer functions, and re-
wards. They are as follows: = 2 + 5 + 3 = 10.
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

678 Dynamic Programming

In such a small problem, one can easily evaluate all Maximize g1 (s1 ) + g2 (s2 ) + · · · + g N (s N )
policies by enumeration and determine that the longest subject to
route through the network is 1 → 2 → 5 → 7 with a return
s1 + s2 + · · · + s N ≤ K (11)
of 14. For larger problems this is not efficient; dynamic
programming methods will be seen to offer an efficient and
means of determining the longest route.
0 ≤ st ≤ m t , t = 1, 2, . . . , N . (12)
The reader might note that the formal sequential de-
cision process notation is quite redundant here. The sub- It is not immediately obvious that this problem is a
script for stage does not convey any useful information and sequential decision problem. The formulation is based on
the specification of a policy requires making decisions in treating the problem as if the decision to allocate resources
nodes that will never be reached. Solution by dynamic to the activities were done sequentially through time with
programming methods will require this superfluous infor- allocation to activity 1 first and activity N last. Decisions
mation. In other settings this information will be useful. are coupled, since successive allocations must take the
quantity of the resource allocated previously into account.
That is, if K − s units of resource have been allocated to
2. A Resource Allocation Problem
the first t activities, then s units are available for activities
A decision maker has a finite amount K of a resource to t + 1, t + 2, . . . , N .
allocate between N possible activities. Using activity i at The following sequential decision problem formulation
level xi consumes ci (xi ) units of the resource and yields a is based on the second formulation above:
reward or utility of f i (xi ) to the decision maker. The max- Decision points (correspond to activity number):
imum level of intensity for activity i is Mi . His objective
T = {1, 2, . . . , N }
is to determine the intensity for each of the activities that
maximizes his total reward. When any level of the activ- States (amount of resource available for allocation in re-
ity is possible, this is a nonlinear programming problem. maining stages): For 0 ≤ t ≤ N ,
When the activity can operate only at a finite set of levels, 

 {s : 0 ≤ s ≤ m t }
this is an integer programming problem. In the special case 
if resource levels are continuous
that the activity can be either utilized or not (Mi = 1 and xi St =

 {0, 1, 2, . . . , m t }
is an integer) this is often called a knapsack problem. This 
if resource levels are discrete
is because it can be used to model the problem of a camper
who has to decide which of N potential items to carry in For t = N + 1,
his knapsack. The value of item i is f i (1) and it weighs 

 {s : 0 ≤ s ≤ K }
ci (1). The camper wishes to select the most valuable set 
if resource levels are continuous
of items that do not weigh more than the capacity of the St =

 {0, 1, 2, . . . , K }
knapsack. 
if resource levels are discrete
The mathematical formulation of the resource alloca-
tion problem is as follows: Actions (s is amount of resource available for stages t,
t + 1, . . . , N ):
Maximize f 1 (x1 ) + f 2 (x2 ) + · · · + f N (x N ) 

 {u : 0 ≤ u ≤ min(s, m t )}
subject to 
if resource levels are continuous
c1 (x1 ) + c2 (x2 ) + · · · + c N (x N ) ≤ K As,t =
(9) 
 {0, 1, 2, . . . , min(s, m t )}

if resource levels are discrete
and
Rewards:
0 ≤ x t ≤ Mt , t = 1, 2, . . . , N . (10)
rt (s, a) = gt (a)
The following change of variables facilitates the se-
quential decision problem formulation. Define the new Transfer function:
variable si = ci (xi ) and assume that ci is a monotone in-
wt (s, a) = s − a
creasing function on [0, Mi ]. this assumption says that
the more intense the activity level, the more resource uti- The decision set at stage t is the set of all functions from
lized. Define gi (si ) = f i (ci−1 (si )) and m i = ci−1 (Mi ). This St to As,t , and a policy is a sequence of such functions,
change of variables corresponds to formulating the prob- one for each t ∈ T . A decision rule specifies the amount
lem in terms of the quantity of resource being used. In this of resource to allocate to activity t if s units are available
notation the formulation above becomes for allocation to activities t, t + 1, . . . , N , and a policy
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming 679

specifies which decision rule to use at each stage of the and the decision maker decides to allocate 2 units, he will
sequential allocation. As in the longest-route problem, a receive a reward of 23 = 8 units and move to node 1.
policy specifies decisions in many eventualities that will When the resource levels form a continuum, the net-
not occur using that policy. This may seem wasteful at first work representation is no longer valid. The problem is
but is fundamental to the dynamic programming method- reduced to a sequence of constrained one-dimensional
ology. The quantity X tπ is the amount of remaining re- nonlinear optimization problems through dynamic pro-
source available for allocation to activities t, t + 1, . . . , N gramming. Such an example will be solved in Section II
using policy π . Clearly, X 1π = K , X 2π = K − d1 (K ), and by using dynamic programming methods.
so forth. The decision maker compares policies through
the quantity v πN (K ), which is given by 3. A Stochastic Inventory Control Problem

v πN (K ) = g1 (d1 (K )) + g2 d2 X 2π Each month, the manager of a warehouse must determine

+ · · · + g N d N X πN . how much of a product to keep in stock to satisfy cus-
tomer demand for the product. The objective is to max-
When the set of activities is discrete, the resource allo- imize expected total profit (sales revenue less inventory
cation problem can be formulated as a longest-route prob- holding and ordering costs), which may or may not be
lem, as can any discrete state and action sequential deci- discounted. The demand throughout the month is random,
sion problem. This is depicted in Fig. 3 for the following with a known probability distribution. Several simplifying
specific example: assumptions make a concise formulation possible:
Maximize 3s12 + s23 + 4s3
subject to a. The decision to order additional stock is made at
s1 + s2 + s3 ≤ 4, the beginning of each month and delivery occurs instan-
taneously.
s1 , s2 , and s3 are integers, and b. Demand for the product arrives throughout the month
0 ≤ s1 ≤ 2; 0 ≤ s2 ≤ 2; 0 ≤ s3 ≤ 2. but is filled on the last day of the month.
c. If demand exceeds the stock on hand, the customer
In the longest-route formulation, the node labels are the goes elsewhere to purchase the product.
amount of resource available for allocation at subsequent d. The revenues and costs and the demand distribution
stages. A fifth stage is added so that there is a unique des- are identical each month.
tination and all decisions at stage 4 correspond to moving e. The product can be sold only in whole units.
from a node at stage 4 to node 0 with no reward. This is f. The warehouse capacity is M units.
because an unallocated resource has no value to the deci-
sion maker in this formulation. The number on each arc Let st denote the inventory on hand at the beginning
is the reward, and the amount of resource allocated is the of month t, at the additional product ordered in month t,
difference between the node label at stage t and that at and Dt the random demand in month t. The demand has a
stage t + 1. For instance, if at stage 2, there are 3 units known probability distribution given by pd = P{Dt = d},
of resource available for allocation over successive stages d = 0, 1, 2, . . . . The cost of ordering u units in any month
is O(u) and the cost of storing u units for 1 month is h(u).
The ordering cost is given by
K + c(u), if u>0
O(u) = (13)
0, if u = 0,
where c(u) and h(u) are increasing functions of u. For
finite-horizon problems, if u units of inventory are on hand
at the end of the planning horizon, its value is g(u). Fi-
nally, if u units of product are demanded in a month and
the inventory is sufficient to satisfy demand, the manager
receives f (u). Define F(u) to be the expected revenue in
a month if the inventory before receipt of customer orders
is u units. It is given in period t by

u−1
F(u) = f ( j) p j + f (u)P{Dt ≥ u}. (14)
FIGURE 3 Network for the resource allocation problem. j=0
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

680 Dynamic Programming

Equation (14) can be interpreted as follows. If the inven- units. This occurs with probability qs+a . Finally, the prob-
tory on hand exceeds the quantity demanded, j, the rev- ability that the inventory level ever exceeds s + a units is
enue is f ( j); p j is the probability that the demand in a zero, since this demand is nonnegative.
period is j units. If the inventory on hand is u units and the The decision sets consist of all rules that assign the
quantity demanded is at least u units, then the revenue is quantity of inventory to be ordered each month to each
f (u) and P{Dt ≥ u} is the probability of such a demand. possible starting inventory position in a month. A policy
The combined quantity is the probability-weighted, or exis a sequence of such ordering rules. Unlike deterministic
pected, revenue. problems, in which a decision rule is specified for many
This is a stochastic sequential decision problem states that will never be reached, in stochastic problems
(Markov decision problem), and its formulation will in- such as this, it is necessary for the decision maker to de-
clude a transition probability function instead of a transfer termine the decision rule for all states. This is because
function. The formulation follows: the evolution of the inventory level over time is random,
Decision points: which makes any inventory level possible at any decision
T = {1, 2, . . . , N }; point. Consequently the decision maker must plan for each
N may be finite or infinite of these eventualities.
An example of a decision rule is as follows: Order only
States (units of inventory on hand at the start of a month): if the inventory level is below 3 units at the start of the
month and order the quantity that raises the stock level to
St = {0, 1, 2, . . . M}, t = 1, 2, . . . , N + 1
10 units on receipt of the order. In month t this is given by
Actions (the amount of additional stock to order in 10 − s, s<3
month t): dt (s) =
0, s ≥ 3.
As,t = {0, 1, 2, . . . , M − s} The evaluation method for a policy depends on the time
Expected rewards (expected revenue less ordering and horizon under consideration. For finite-horizon problems,
holding costs): the total expected cost conditional on the initial stock level
is a convenient summary. Assuming the stock level at time
rt (s, a) = F(s + a) − O(a) − h(s + a), 1 is s, the expected total reward for policy π is

t = 1, 2, . . . , N
N

v πN (s) = E π,s F X tπ + dtπ X tπ − O dtπ X t
Value of terminal inventory (no actions are possible): t=1

r N +1 (s, a) = g(s)
−h X tπ + dt X tπ −g X πN +1 .
Transition probabilities (see explanation below):
 If, instead, the decision maker wishes to discount future
 0, if j > s + a

 profit at a monthly discount rate of λ, 0 ≤ λ < 1, the term

 pj, if j = s + a − Dt ,

 λt−1 is inserted before each term in the above summa-
 s + a ≤ M, and
pt ( j|s, a) = tion and λ N before the terminal reward g. For an infinite-

 s + a > Dt horizon problem, the expected total discounted profit is



q , if j = 0, s + a ≤ M, given by


s+a

and s + a ≤ Dt ∞

π
vλ = E π,s λt−1 F X tπ + dt X tπ
where t=1

∞

qs+a = P{Dt ≥ s + a} = pd · −O dtπ X tπ +h X tπ + dt (X ) π
.
d=s+a

A brief explanation of the transition probabilities might The decision maker’s problem is to choose a sequence
be helpful. If the inventory on hand at the beginning of of decision rules to maximize expected total or total dis-
period t is s units and an order is placed for a units, the counted profits.
inventory before external demand is s + a units. If the de- Many modifications of this inventory problem are pos-
mand of j units is less than s + a units, then the inventory sible; for example, excess demand in any period could be
at the beginning of period t + 1 is s + a − j units. This backlogged and a penalty for carrying unsatisfied demand
occurs with probability p j . If the demand exceeds s + a could be charged, or there could be a time lag between
units, then the inventory at the start of period t + 1 is 0 placing the order and its receipt. The formulation herein
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming 681

can easily be modified to include such changes; the inter- II. FINITE-HORIZON DYNAMIC
ested reader is encouraged to consult the Bibliography for PROGRAMMING
more details.
A numerical example is now provided in complete de- A. Introduction
tail. It will be solved in subsequent sections by using
dynamic programming methods. The data for the prob- Dynamic programming is a collection of methods for solv-
lem are as follows: K = 4, c(u) = 2u, g(u) = 0, h(u) = u, ing sequential decision problems. The methods are based
M = 3, N = 3, f (u) = 8(u), and on decomposing a multistage problem into a sequence
1 of interrelated one-stage problems. Fundamental to this
4,
 if d = 0 decomposition is the principle of optimality, which was
pd = 2 ,
1
if d = 1 developed by Richard Bellman in the 1950s. Its impor-

1 tance is that an optimal solution for a multistage problem
4
, if d = 2. can be found by solving a functional equation relating the
The inventory is constrained to be 3 or fewer units, and the optimal value for a (t + 1)-stage problem to the optimal
decision maker wishes to consider the effects over three value for a t-stage problem.
periods. All the costs and revenues are linear. This means Solution methods for problems depend on the time hori-
that for each unit ordered the per unit cost is 2, for each zon and whether the problem is deterministic or stochastic.
unit held in inventory for 1 month the per unit cost is 1, and Deterministic finite-horizon problems are usually solved
for each unit sold the per unit revenue is 8. The expected by backward induction, although several other methods,
revenue when u units of stock are on hand before receipt including forward induction and reaching, are available.
of an order is given by For finite-horizon stochastic problems, backward induc-
tion is the only method of solution. In the infinite-horizon
u F(u)
case, different approaches are used. These will be dis-
0 0 cussed in Section III. The backward induction procedure
1 0× 1
4 +8× 3
4 =6 is described in the next two sections. This material might
2 0× 1
4 +8× 1
2 + 16 × 1
4 =8 seem difficult at first; the reader is encouraged to refer to
3 0× 1
+8× 1
+ 16 × 1
=8 the examples at the end of this section for clarification.
4 2 4

Combining the expected revenue with the expected short-

age and holding costs gives the expected profit in period B. Functional Equation of
t if the inventory level is s at the start of the period and an Dynamic Programming
order for a units is placed. If a = 0, the ordering and hold-
ing cost equals s, and if a is positive, it equals 4 + s + 3a. Let v t (s) be the maximal total reward received by the de-
It is summarized in the tabulations below, where an X cor- cision maker during stages t, t + 1, . . . , N + 1, if the sys-
responds to an action that is infeasible. Transition proba- tem is in state s immediately before the decision at stage t.
bilities depend only on the total inventory on hand before When system transitions are stochastic, v t (s) is the max-
the receipt of orders. They are the same for any s and a imal expected return. Recall that decisions are made at
that have the same total s + a. So that redundant informa- stages 1, 2, . . . , N and not at time N + 1; however, a re-
tion is reduced, transition probabilities are presented as ward might be received at stage N + 1 as a consequence of
functions of s + a only. The information in the following the decision in stage N . In most deterministic problems,
tabulations defines this problem completely: S1 consists of one element, whereas in stochastic prob-
lems, solutions are usually required for all possible initial
r t (s, a) states.
s a=0 a=1 a=2 a=3 The basic entity of dynamic programming is the func-
0 0 −1 −2 −5
tional equation, or Bellman equation, which relates v t (s)
1 5 0 −3 X to v t+1 (s). For deterministic problems it is given by
2 6 −1 X X
v t (s) = max rt (s, a) + v t+1 (wt (s, a)) ,
3 5 X X X a∈As,t
(15)
p t ( j|s, a)
t = 1, . . . , N
s +a j =0 j =1 j =2 j =3
and
0 0 0 0 0
1 3
4
1
4 0 0 v N +1 (s) = 0, (16)
1 1 1
2 4 2 4 0 where Eq. (15) is valid for all s ∈ St and Eq. (16) is valid for
1 1 1
3 0 4 2 4 all s ∈ S N +1 . Equation (15) is the basis for the backward
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

682 Dynamic Programming

induction algorithm for solving sequential decision prob- pected reward received over the remaining periods as a
lems. This equation corresponds to a one-stage problem consequence of choosing action a in period t.
in which the decision maker observes the system in state
s and must select an action from the set As,t . The conse-
quence of this action is that the decision maker receives an C. Backward Induction and the Principle
immediate reward of rt (s, a) and the system moves to state of Optimality
wt (s, a), at which he receives a reward of v t+1 (wt (s, a)).
Backward induction is a procedure that uses the functional
Equation (15) says that he chooses the action that maxi-
equation in an iterative fashion to find the optimal total
mizes the total of these two rewards. This is exactly the
value function and an optimal policy for a finite-horizon
problem faced by the decision maker in a one-stage se-
sequential decision problem. That this method achieves
quential decision problem when the terminal reward func-
these objectives is demonstrated by the principle of opti-
tion is v t+1 . Equation (16) provides a boundary condition.
mality. The principle of optimality is not a universal truth
When the application dictates, this value 0 can be replaced
that applies to all sequential decision problems but a math-
by an arbitrary function that assigns a value to the terminal
ematical result that requires formal proof in each applica-
state of the system. Such might be the case in the inventory
tion. For problems in which the (expected) total reward
control example.
criterion is used, as considered in this article, it is valid. A
Equation (15) emphasizes the dynamic aspects of the
brief argument of why it holds in such problems is given
sequential decision problem. The decision maker chooses
below.
that action which maximizes his immediate reward plus
To motivate backward induction, the following iterative
his reward over the remaining decision epochs. This is in
procedure for finding the total reward of some specified
contrast to the situation in which the decision maker be-
policy π = (d1 , d2 , . . . , d N ) is given. It is called the policy
haves myopically and chooses the decision rule that max-
evaluation algorithm. To simplify notation, assume that
imizes the reward only in the current period and ignores
pt ( j|s, a) = p( j|s, a) and rt (s, a) = r (s, a) for all s, a,
future consequences. Some researchers have given condi-
and j.
tions in which such a myopic policy is optimal; however,
in almost all problems dynamic aspects must be taken into
account. a. Set t = N + 1 and v N +1 (s) = 0 for all s ∈ S N +1 .
The expression “max” requires explanation because it is b. Substitute t − 1 for t(t − 1 → t) and compute v t (s)
fundamental to the dynamic programming methodology. for each s ∈ St in the deterministic case by
If f (x, y) is any function of two variables with x ∈ X and
y ∈ Y , then v t (s) = r (s, dt (s)) + v t+1 (wt (s, dt (s))), (18)
g(x) = max{ f (x, y)} or in the stochastic case by
y∈Y

if for each x ∈ X , g(x) ≥ f (x, y) for all y ∈ Y and there

v t (s) = r (s, dt (s)) + p( j|s, dt (s))v t+1 ( j). (19)
exists a y ∗ ∈ Y with the properties that f (x, y ∗ ) ≥ f (x, y) j∈St+1
for all y ∈ Y and g(x) = f (x, y ∗ ). Thus, Eq. (15) states
that the decision maker chooses a ∈ As,t to make the c. If t = 1, stop; otherwise, return to step b.
expression in braces as large as possible. The quantity
v t (s) is set equal to this maximal value.
In stochastic problems, the functional equation (15) is This procedure inductively evaluates the policy by first
modified to account for the probabilistic transition struc- fixing its value at the last stage and then computing its
ture. It is given by value at the previous stage by adding its immediate reward
to the previously computed total value. This process is re-
peated until the first stage is reached. This computation
v (s) = max rt (s, a) +
t
pt ( j|s, a)v ( j) ,
t+1
process yields the quantities v 1 (s), v 2 (s), . . . , v N +1 (s).
a∈As,t
j∈St+1
The quantity v 1 (s) equals the expected total value of policy
(17)
π , which in earlier notation is given by vπN (s). The quanti-
t = 1, . . . , N .
ties v 1 (s) correspond to the value of this policy from stage
The stochastic nature of the problem is accounted for in t onward. This procedure is extended to optimization by
Eq. (17) by replacing the fixed transition function wt (s, a) iteratively choosing and evaluating a policy consisting of
by the random state j, which is determined by the proba- the actions that give the maximal return from each stage to
bility transition function corresponding to selecting action the end of the planning horizon instead of just evaluating
a. The second expression in this equation equals the ex- a fixed prespecified policy.
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming 683

The backward induction algorithm proceeds as follows: necessary only at nodes that can be reached from the ini-
tial state. the algorithm traces forward through the net-
a. Set t = N + 1 and v N +1 (s) = 0 for all s ∈ S N +1 . work along the path determined by decisions that obtain
b. Substitute t − 1 for t(t − 1 → t) and compute v t (s) the maximum in Eq. (15). It produces one route through
for each s ∈ St using Eq. (15) or (17) depending on which the network with longest length. If at any stage the set
is appropriate. Denote by A∗s,t the set of actions a ∗ for A∗s,t contains more than one action, then several optimal
which in the deterministic case, routings exist, and if all are desired, the procedure must

v t (s) = r (s, a ∗ ) + v t+1 wt (s, a ∗ ) , (20) be carried out to trace each path.
A problem closely related to the longest-route problem
or in the stochastic case,
is that of finding the longest route from each node to the
v t (s) = r (s, a ∗ ) + p( j|s, a ∗ )v t+1 ( j). (21) final node. When there is only one action in A∗s,t for each
j∈St+1 s and t, then specifying the decision rule that in each state
c. If t = 1, stop; otherwise, return to step b. is equal to the unique maximizing action produces these
routings. This is closer in spirit to the concept of a policy
By comparing this procedure with the policy evalua- than a longest route.
tion procedure above, we can easily see that the backward That the above procedure results in an optimal policy
induction algorithm accomplishes three objectives: and optimal value function is due to the additivity of re-
wards in successive periods. A formal proof of these re-
a. It finds sets of actions A∗s,t that contain all actions in sults is based on induction, but the following argument
As,t that obtain the maximum in Eq. (15) or (17). gives the main idea. The backward induction algorithm
b. It evaluates any policy made up of actions selected chooses maximizing actions in reverse order. It does not
from the sets A∗s,t . matter what happened before the current stage. The only
c. It gives the total return or expected total return vt (s) important information for future decisions is the current
that would be obtained if a policy corresponding to select- state of the system. First, for stage N the best action in
ing actions in A∗s,t were used from stage t onward. each state is selected. Clearly, v N (s) is the optimal value
function for a one-stage problem beginning in stage s at
Thus, if the decision maker had specified a policy that stage N . Next, in each state at stage N − 1, an action is
selected actions in the sets A∗s,t before applying the policy found to maximize the immediate reward plus the reward
evaluation algorithm, these two procedures would be iden- that will be obtained if, after reaching a state at stage N , the
tical. It will be argued below that any policy obtained by decision maker chooses the optimal action at that stage.
selecting an action from A∗s,t in each state at every stage is Clearly, v N −1 (s) is the optimal value for a one-stage prob-
optimal and consequently v 1 (s) is the optimal value func- lem with terminal reward v N (s). Since v N (s) is the opti-
tion for the problem; that is, v 1 (s) = v ∗N (s). mal value for the one-stage problem starting at stage N , no
In deterministic problems, specifying a policy often greater total reward can be obtained over these two stages.
provides much superfluous information, since if the state Hence, v N −1 (s) is the optimal reward from stage N − 1
of the system is known before the first decision, a policy onward starting in state s. Now, since the sets A∗s,N and
determines the system evolution with certainty and only A∗s,N −1 have been determined by the backward induction
one state is reached at each stage. Since all deterministic algorithm, choosing any policy that selects actions from
problems are equivalent to longest-route problems, the ob- these sets at each stage and evaluating it with the policy
jective in such problems is only to find a longest route. The evaluation algorithm above will also yield v N −1 (s). Thus,
following route selection algorithm does this. The system this policy is optimal over these two stages since its value
state is known before decision 1, so S1 contains a single equals the optimal value. This argument is repeated at
state. stages N − 2, N − 3, . . . , 1 to conclude that a policy that
selects an action from A∗s,t at each stage is optimal.
a. Set t = 1 and for s ∈ St define dt (s) = a ∗ for some The above argument contains the essence of the princi-
a ∈ A∗s,t . Set u = wt (s, a ∗ ).
∗
ple of optimality, which appeared in its original form on
b. For u ∈ St+1 , define dt+1 (u) = a ∗ for some a ∗ ∈ p. 83 of Bellman’s classic book, “Dynamic Programming,”
∗
Au,t+1 . Replace u by u = wt+1 (u, a ∗ ). as follows:
c. If t + 1 = N , stop; otherwise, t + 1 → t and return
to step b. An optimal policy has the property that whatever the initial state
and initial decision are, the remaining decisions must constitute
In this algorithm, the choice of an action at each node an optimal policy with regard to the state resulting from the first
determines which arc will be traversed, and decisions are decision.
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

684 Dynamic Programming

The functional equations (15) and (17) are mathematical d. Since t = 1, continue. Set t = 1 and
statements of this principle.
It might not be obvious to the reader why the backward v 1 (1) = max r1 (1, 2) + v 2 (2), r1 (1, 3) + v 2 (3)
induction algorithm is more attractive than an enumera- = max{2 + 12, 4 + 9} = 14,
tion procedure for solving sequential decision problems.
A∗1,1 = {2}.
To see this, suppose that there are N stages, M states at
each stage, and K actions that can be chosen in each state. e. Since t = 1, stop.
Then there are (K M ) N policies. Solving a deterministic
problem by enumeration would require N (K M ) N addi-
This algorithm yields the information that the longest
tions and (K M ) N comparisons. By backward induction,
route from node 1 to node 7 has length 14, the longest
solution would require NMK additions and NK compar-
route from node 2 to node 7 has length 12, and so on.
isons, a potentially astronomical savings in work. Solv-
To find the choice of arcs that corresponds to the longest
ing stochastic problems requires additional M multipli-
route, we must apply the route selection algorithm:
cations at each state at each stage to evaluate expecta-
tions. Enumeration requires MN(K M ) N multiplications,
whereas backward induction would require NKM multipli- a. Set t = 1, d1 (1) = 2, and u = 2.
cations. Clearly, backward induction is a superior method b. Set d2 (2) = 5 and u = 5.
for solving any problem of practical significance. c. Since t + 1 = 3, continue. Set t = 2, d3 (5) = 7, and
u = 7.
d. Since t + 1 = 3, stop.
D. Examples
In this section, the use of the backward induction algorithm This procedure gives the longest path through the net-
is illustrated in terms of the examples that were presented work, namely, 1 → 2 → 5 → 7, which can easily be seen
in Section I. First, the longest-route problem in Fig. 2 is to have length 14. Note that choosing the myopic policy
considered. at each node would not have been optimal. At node 1, the
myopic policy would have selected action 3; at node 3, ac-
1. A Longest-Route Problem tion 4; and at node 4, action 7. The path 1 → 3 → 4 → 7
has length 12. By not taking future consequences into ac-
Note that N = 3 in this example. count at the first stage, the decision maker would have
found himself in a poor position for subsequent decisions.
a. Set t = 4 and v 4 (7) = 0.
The backward induction algorithm has also obtained an
b. Since t = 1, continue. Set t = 3 and
optimal policy. It is given by
v 3 (4) = r3 (4, 7) + v 4 (7)
π ∗ = d1∗ , d2∗ , d3∗ ,
= 3 + 0 = 3,
A∗6,3 = {7}, where

v 3 (5) = r3 (5, 7) + v 4 (7) d1∗ (1) = 2, d2∗ (2) = 5, d2∗ (3) = 5,

= 6 + 0 = 6, d3∗ (4) = 7, d3∗ (5) = 7, and d3∗ (6) = 7.
A∗5,3 = {7}, This policy provides the longest route from each node to
v (6) = r3 (6, 7) + v (7)
3 4 node 7, as promised by the principle of optimality. In the
language of graph theory, this corresponds to a maximal
= 2 + 0 = 2, spanning tree. The longest route from each node to node
A∗4,3 = {7}. 7 is depicted in Fig. 4.

c. Since t = 1, continue. Set t = 2 and

2. A Resource Allocation Problem
v 2 (2) = max r2 (2, 4) + v 3 (4), r2 (2, 5) + v 3 (5)
The backward induction algorithm is now applied to the
= max{5 + 3, 6 + 6} = 12,
continuous version of the resource allocation problem of
v (3) = max r2 (3, 4) + v 3 (4), r2 (3, 5) + v 3 (5),
2
Section 1. Computation in the discrete problem is almost

r2 (3, 6) + v 3 (6) identical to that in the longest-route problem and will be
left to the reader.
= max{5 + 3, 3 + 6, 1 + 2} = 9, The bounds on the si ’s are changed to simplify exposi-
A∗2,2 = {5}; A∗3,2 = 5. tion. The problem that will be solved is given by
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming 685

v 1 (4) = max r1 (s, a) + v 2 (4 − a)
0≤a≤3

= max max {3a 2 + 4(4 − a) ,
0≤a≤3
0≤4−a≤2

max{3a 2 + 4(4 − a)3 }}

0≤a≤3
2≤4−a

= max max {3a 2 + 4(4 − a) ,
2≤a≤3

max {3a 2 + (4 − a)3 }

FIGURE 4 Solution to the longest-route problem. 0≤a≤2

= max{31, 30} = 31
and A∗4,1 = {3}.
Maximize 3s12 + s23 + 4s3 e. Since t = 1, stop.
subject to
The objective function value for this constrained re-
s 1 + s2 + s3 ≤ 4 source allocation problem is 31. To find the optimal
resource allocation, we must use the second algorithm
and
above:
0 ≤ s1 ≤ 3; 0 ≤ s2 ; 0 ≤ s3 .
a. Set t = 1, d1 (4) = 3, and u = 1.
The backward induction is applied as follows: b. Set d2 (1) = 0 and u = 1.
c. Since t + 1 = 3, continue. Set t = 2, d3 (1) = 1, and
a. Set t = 4 and v 4 (s) = 0, 0 ≤ s ≤ 4. u = 0.
b. Since t = 1, continue. Set t = 3 and d. Since t + 1 = 3, stop.

v 3 (s) = max r3 (s, a) + v 4 (s − a) This procedure gives the optimal allocation, namely,
0≤a≤s
s1 = 3, s2 = 0, and s3 = 1, which corresponds to the
= max {4a} = 4s optimal value of 31.
0≤a≤s

and A∗s,3 = {s}. 3. The Inventory Example

c. Since t = 1, continue. Set t = 2 and
The backward induction algorithm is now applied to the

v 2 (s) = max r2 (s, a) + v 3 (s − a) numerical stochastic inventory example of Section I. Since
0≤a≤s the data are stationary, the time index will be deleted. Also,
the following additional notation will be useful. Define
= max {a 3 + 4(s − a)},
0≤a≤s v t (s, a) by

4s, 0≤s≤2 v t (s, a) = r (s, a) + p( j|s, a)v t+1 ( j). (22)
v 2 (s) =
s3, 2 ≤ s ≤ 4, j∈S

a. Set t = 4 and v (s) = 0, s = 0, 1, 2, 3.

and b. Since t = 1, continue. Set t = 3, and for s = 0, 1, 2, 3,

{0}, 0≤s≤2
A∗s,2 = v (s) = max r (s, a) +
3 4
p( j|s, a)v ( j)
{s}, 2 ≤ s ≤ 4. a∈As
j∈S

d. Since t = 1, continue. Set t = 1. A solution is ob- = max{r (s, a)}. (23)

a∈As
tained for v 1 (4) only. Obviously it is optimal to allocate
all resources in this problem. In most problems one would It is obvious from inspecting the values of r (s, a) that in
obtain a v 1 (s) for all s. That is quite tedious in this example each state the maximizing action is 0; that is, do not order.
and unnecessary for solution of the original problem. Thus,
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

686 Dynamic Programming

s v 3 (s) A∗s,3 s d ∗1 (s) d ∗2 (s) d ∗3 (s) v ∗3 (s)

67
0 0 0 0 3 2 0 16
1 5 0 1 0 0 0 129
16
2 6 0 194
3 5 0 2 0 0 0 16
227
3 0 0 0 16
c. Since t = 1, continue. Set t = 2 and
This policy has a particularly simple form: If at deci-
v 2 (s) = max{v 2 (s, a)}, sion point 1 the inventory in stock is 0 units, order 3 units;
a∈As
otherwise, do not order. If at decision point 2 the inventory
where, for instance, in stock is 0 units, order 2 units; otherwise, do not order.
And at decision point 3 do not order. The quantity v3∗ (s)
v 2 (0, 2) = r (0, 2) + p(0|0, 2)v 3 (0) + p(1|0, 2)v 3 (1) gives the expected total reward obtained by using this pol-
icy when the inventory before the first decision epoch is
+ p(2|0, 2)v 3 (2) + p(3|0, 2)v 3 (3) s units.

= −2 + 14 × 0 + 12 × 5 + 14 × 6 + A policy of this type is called an (s, S) policy. An (s,
S) policy is implemented as follows. If in period t the
0 × 5 = 2. inventory level is s t units or below, order the number of
units required to bring the inventory level up to S t units.
The quantities v 2 (s, a), v 2 (s), and A∗s,2 are summarized Under certain convexity and linearity assumptions on the
in the following tabulation, where Xs denote infeasible cost functions, Scarf showed in an elegant and important
actions: 1960 paper that (s, S) policies are optimal for the stochastic
inventory problem. His proof of this result is based on
v 2 (s, a)
using backward induction to show analytically that, for
s a=0 a=1 a=2 a=3 v 2 (s) A∗s,2 each t, v t (s) is K -convex, which ensures that there exists
1 1
a maximizing policy of (s, S) type. This important result
0 0 2 2 2
4 2 plays a fundamental role in stochastic operations research
1 6 14 4 2 12 X 6 14 0
and has been extended in several ways.
2 10 4 12 X X 10 0
3 10 12 X X X 10 12 0

d. Since t = 1, continue. Set t = 1 and III. INFINITE-HORIZON DYNAMIC

PROGRAMMING
v 1 (s) = max{v 1 (s, a)}.
a∈As
A. Introduction
The quantities v (s, a), v (s), and
1 1
are summarized A∗s,1 Solution methods for infinite-horizon sequential decision
in the following tabulation, where Xs denote infeasible problems are based on solving a stationary version of
actions: the functional equation. In this section, the state and ac-
tion sets, the rewards, and the transition probabilities are
v 1 (s, a) assumed to be stationary and only the stochastic version
of the problem is considered. Reward streams are sum-
s a=0 a=1 a=2 a=3 v 1 (s) A∗s,1
marized using the expected total discounted reward cri-
0 2 33 66 67 67
3 terion, and the objective is to find a policy with maximal
16 16 16 16
1 129 98 99
X 129
0 expected total discounted reward as well as this value.
16 16 16 16
194 131 194 The two main solution techniques are value iteration and
2 16 16 X X 16 0
227 227
policy iteration. The former is the extension of backward
3 16 X X X 16 0
induction to infinite-horizon problems and is best ana-
lyzed from the perspective of obtaining a fixed point for
e. Since t = 1, stop.
the Bellman equation. Policy iteration corresponds to us-
ing a generalization of Newton’s method for finding a zero
This procedure has produced the optimal reward func- of the functional equation. It is not appropriate for finite-
tion v3∗ (s) and optimal policy π ∗ = (d1∗ (s), d2∗ (s), d3∗ (s)), horizon problems. Other solution methods include mod-
which are as follows: ified policy iteration, linear programming, Gauss-Seidel
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming 687

iteration, successive overrelaxation, and extrapolation. Of the remaining terms are exactly vλd (s) Equation (26) can
these, only modified policy iteration and linear program- be rewritten as
ming will be discussed.
[δ( j, s) − λp( j|s, d(s)]vλd ( j) = r (s, d(s)),
j∈S

B. Basic Results where δ( j, s) = 1 if j = s and 0 if j = s. This equation can

The basic data of the stationary, infinite-horizon, stochas- be re-expressed in matrix terms as
tic sequential decision model are the state space S; the set (I − λPd )vdλ = rd , (27)
of allowable actions in state s, As ; the one-period expected
where Pd is the matrix with entries p( j|s, d(s)), I is
reward if action a is selected in state s, r (s, a); the prob-
the identity matrix, and rd is the vector with elements
ability the system is in state j at the next stage if it is in
r (s, d(s)). Equation (27) has a unique solution, which can
state s and action a is selected, p( j|s, a); and the discount
be obtained by Gaussian elimination, successive approx-
factor λ, 0 ≤ λ < 1. Both S and As are assumed to be
imations, or any other numerical method. A consequence
finite. Let M denote the number of elements in S.
of this equation is the following convenient representation
For a policy π = (d1 , d2 , . . .), the infinite-horizon ex-
for vλd (s):
pected total discounted reward is denoted by vλπ (s). Let
pπm ( j|s) = P{X m+1 π
= j|X 1π = s}. For m = 2 it can be vdλ = (I − λPd )−1 rd . (28)
computed by The inverse in Eq. (28) exists because Pd is a probability
matrix, so that its spectral radius is less than or equal to 1
pπ2 ( j|s) = p( j|k, d2 (k)) p(k|s, d1 (s)).
and 0 ≤ λ < 1.
k∈S
The functional equation of infinite-horizon discounted
This is the matrix product of the transition probability ma- dynamic programming is the stationary equivalent of
trices corresponding to the decision rules d1 (s) and d2 (s). Eq. (17). It is given by
In general pπm ( j|s) is given by the matrix product of the

matrices corresponding to d1 , d2 , . . . , dm . Using this no- v(s) = max r (s, a) + λp( j|s, a)v( j)
tation, we can compute vλπ (s) by a∈As
j∈S
.
vλπ (s) = r (s, d1 (s)) + λp( j|s, d1 (s))r ( j, d2 ( j)) = T v(s). (29)
j∈S
Equation (29) defines the nonlinear operator T on the
+ λ2 pπ2 ( j|s)r ( j, d3 ( j)) + ···. (24) space of bounded M vectors or real-valued functions on
j∈S S. Since T is a contraction mapping (see Section III.C.1),
this equation has a unique solution. This solution is the
Equation (24) cannot be implemented since an arbi-
expected discounted reward of an optimal policy. To see
trary nonstationary infinite-horizon policy cannot be com-
this, let v ∗ (s) denote the solution of this equation. Then
pletely specified. However, if the rewards are bounded, the
for any decision rule dt (s),
above infinite series is convergent, because λ < 1. When
the policy is stationary, Eq. (24) simplifies so that a policy v ∗ (s) ≥ r (s, dt (s)) + λp( j|s, dt (s))v ∗ ( j). (30)
can be evaluated either inductively or by solution of a sys- j∈S

tem of linear equations. Let d denote the stationary policy By repeatedly substituting the above inequality into the
that uses decision rule d(s) at each stage and pdm ( j|s) its right-hand side of Eq. (30) and noting that λn → 0 as
corresponding m-step transition probabilities. Then, n → ∞, we can see that what follows from Eq. (24) is
that v ∗ (s) ≥ vλπ (s) for any policy π . Also, if d ∗ (s) is the
vλd (s) = r (s, d(s)) + λp( j|s, d(s))r ( j, d( j))
decision rule that satisfies

j∈S

+ λ2 pd2 ( j|s)r ( j, d( j)) r (s, d ∗ (s)) + λp( j|s, d ∗ (s))v ∗ ( j)

j∈S

+ λ3 pd3 ( j|s)r ( j, d( j)) + ··· (25)
∗
j∈S
= max r (s, a) + λp( j|s, a)v ( j) , (31)
= r (s, d(s)) + λp( j|s, d(s))vλd ( j). (26) j∈S
j∈S then the stationary policy that uses d ∗ (s) each period is
Equation (26) is derived from Eq. (25) by explicitly optimal. This follows because if d ∗ satisfies Eq. (31), then

writing out the matrix powers, factoring out the term v ∗ (s) = r (s, d ∗ (s)) + λp( j|s, d ∗ (s))v ∗ ( j),
p( j|s, d(s)) from all summations, and recognizing that j∈S
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

688 Dynamic Programming

but since this equation has a unique solution, c. If vn+1 − vn < ε(1 − λ)/2λ, go to step d. Other-
∗ wise, increment n by 1 and return to step b.
v ∗ (s) = vλd (s) = vλ∗ (s).
d. For each s ∈ S, set d ε (s) equal to an a ∈ As that
These results are summarized as follows. There ex- obtains the maximum on the right-hand side of Eq. (32)
ists an optimal policy to the infinite-horizon discounted at the last iteration and stop.
stochastic sequential decision problem that is stationary
and can be found by using Eq. (31). Its value function This algorithm can best be understood in vector space
is the unique solution of the functional equation of dis- notation. In Eq. (29), the operator T is defined on the set
counted dynamic programming. of bounded real-valued M vectors. Solving the functional
This result plays the same role as the principle of opti- equation corresponds to finding a fixed point of T , that
mality in finite-horizon dynamic programming. It says that is, a v such that Tv = v. The value iteration algorithm
for us to solve the infinite-horizon dynamic programming starts with an arbitrary v0 (0 is usually a good choice) and
problem, it is sufficient to obtain a solution to the func- iterates according to vn+1 = T vn . Since T is a contraction
tional equation. In the next section, methods for solving mapping, that is,
the functional equation will be demonstrated.
T v − T u ≤ λv − u

C. Computational Methods for any v and u, the iterative method is convergent for any
v 0 . This is because
The four methods to be discussed here are value iteration,
policy iteration, modified policy iteration, and linear pro- vn+1 − vn ≤ λn v1 − v0 ,
gramming. Only the first three are applied to the example
in Section III.D. Value iteration and modified policy it- and the space of bounded real-valued M vectors is a
eration are iterative approximation methods for solving Banach space (a complete normed linear space) with re-
the dynamic programming functional equation, whereas spect to the norm used here. Since a contraction mapping
policy iteration is exact. To study the convergence of an has a unique fixed point, v∗λ , vn converges to it. The rate
approximation method, we must have a notion of distance. of convergence is geometric with parameter λ, that is,
If v is a real-valued function on S (an M vector), the norm
vn − v∗ ≤ λn v0 − v∗ .
of v, denoted by v is defined as
v = max |v(s)|. The algorithm terminates with a value function vn+1 and
s∈S a decision rule dε with the following property:
The distance between two vectors v and u is given by
v n+1 (s) = r (s, d ε (s)) + λp( j|s, d ε (s))v n ( j). (33)
v − u. This means that two vectors are ε units apart if j∈S
the maximum difference between any two components is
ε units. This is often called the L ∞ norm. The stopping rule in step c ensures that the stationary
A policy π is said to be ε-optimal if vπλ − v∗λ < ε. policy that uses dε every period is ε-optimal.
If ε is specified sufficiently small, the two iterative algo- The sequence of iterates vn have interesting interpre-
rithms can be used to find policies whose expected to- tations. Each iterate corresponds to the optimal expected
tal discounted reward is arbitrarily close to optimum. Of total discounted return in an n-period problem in which the
course, the more accurate the approximation, the more terminal reward equals v0 . Alternatively, they correspond
iterations of the algorithm that are required. to the expected total discounted returns for the policy in
an n-period problem that is obtained by choosing a maxi-
mizing action in each state at each iteration.
1. Value Iteration
Value iteration, or successive approximation, is the direct
2. Policy Iteration
extension of backward induction to infinite-horizon prob-
lems. It obtains an ε-optimal policy d ε as follows: Policy iteration, or approximation in the policy space, is
an algorithm that uses the special structure of infinite-
a. Select v0 , specify ε > 0, and set n = 0. horizon stationary dynamic programming problems to find
b. Compute vn+1 by all optimal policies. The algorithm is as follows:

v (s) = max r (s, a) +
n+1
λp( j|s, a)v ( j) . (32)
n a. Select a decision rule d 0 (s) for all s ∈ S and set
a∈As
j∈S n = 0.
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming 689

b. Solve the system of equations a. Select v0 , specify ε > 0, and set n = 0.

b. For each s and each a ∈ As , compute
[δ( j, s) − λp( j|s, d n (s))]v n ( j) = r (s, d n (s)) (34)

j∈S
r (s, a) + λp( j|s, a)v n ( j). (37)
for v (s).
n j∈S

c. For each s, and each a ∈ As , compute For each s, put a in A∗n,s if a obtains the maximum value

r (s, a) + λp( j|s, a)v n ( j). (35) in Eq. (37).
j∈S c. For each s in S, set d n (s) equal to any a in A∗n,s .
(i) Set k = 0 and define u 0 (s) by
For each s, put a in A∗n,s if a obtains the maximum value
in Eq. (35).
d. If, for all s, d n (s) is contained in A∗n,s , stop. Other- u 0 (s) = max r (s, a) + λp( j|s, a)v n ( j) . (38)
a∈As
wise, proceed. j∈S

e. Set d n+1 (s) equal to any a in A∗n,s for each s in S,

(ii) If u0 − vn < ε(1 − λ)/2λ, go to step d. Other-
increment n by 1, and return to step b.
wise, go to step (iii).
(iii) Compute u k+1 by
The algorithm consists of two main parts: step b, which
is called policy evaluation, and step c, which is called u k+1 (s) = r (s, d n (s)) + λp( j|s, d n (s))u k ( j). (39)
policy improvement. The algorithm terminates when the j∈S
set of maximizing actions found in the improvement stage (iv) If k = m, go to step (v). Otherwise, increment k
repeats, that is, if the same decision obtains the maximum by 1 and return to step (i).
in step b on two successive passes through the iteration (v) Set vn+1 = um+1 , increment k by 1, and go to
loop. step b.
This algorithm terminates in a finite number of itera- d. For each s ∈ S, set d ε (s) = d n (s) and stop.
tions with an optimal stationary policy and its expected
total discounted reward. This is because the improvement
This algorithm combines features of both policy itera-
procedure guarantees that v n+1 (s) is strictly greater than
tion and value iteration. Like value iteration, it is an iter-
v n (s), for some s ∈ S, until the termination criterion is
ative algorithm that terminates with an ε-optimal policy;
satisfied, at which point v n (s) is the solution of the dy-
however, value iteration avoids step c above. The stop-
namic programming functional equation. Since each vn
ping criterion used in step (ii) is identical to that of value
is the expected total discounted reward of the stationary
iteration, and the computation of u0 in step (i) requires
policy dn , and there are only finitely many stationary poli-
no extra work because it has already been determined in
cies, the procedure must terminate in a finite number of
step b. Like policy iteration, the algorithm contains an
iterations.
improvement step, step b, and an evaluation step, step c.
If only an ε-optimal policy is desired, a stopping rule
However, the evaluation is not done exactly. Instead, it is
similar to that in step c of the value iteration procedure
carried out iteratively in step c, which is repeated m times.
can be used.
Note that m can be selected in advance or adaptively dur-
ing the algorithm. For instance, m can be chosen so that
3. Modified Policy Iteration um+1 − um is less than some prespecified tolerance that
can vary with n. Recent studies have shown that low orders
The evaluation step of the policy iteration algorithm is of m work well, while adaptive choice is better.
usually implemented by solving the linear system The modified policy iteration algorithm serves as a
n
(I − λPd n )vd = rdn (36) bridge between value iteration and policy iteration. When
m = 0, it is equivalent to value iteration, and when m is
by using Gaussian elimination, which requires 13 M 3 mul- infinite, it is equivalent to policy iteration. It will converge
tiplications and divisions. When the number of states is in fewer iterations than value iteration and more iterations
large, exact solution of Eq. (34) can be computationally than policy iteration; however, the computational effort
prohibitive. An alternative is to use successive approxima- per iteration exceeds that for value iteration and is less
tions to obtain an approximate solution. This is the basis than that for policy iteration. When the number of states,
of the modified policy iteration, or value-oriented succes- M, is large, it has been shown to be the most compu-
sive approximation, method. The modified policy iteration tationally efficient method for solving Markov decision
algorithm of order m is as follows: problems.
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

690 Dynamic Programming

4. Linear Programming however, the discount rate λ is chosen to be .9. The objec-
tive is to determine the stationary policy that maximizes
The stationary infinite-horizon discounted stochastic se-
the expected total infinite-horizon discounted reward.
quential decision problem can be formulated and solved
by linear programming. The primal problem is given by
1. Value Iteration
Minimize
To initiate the algorithm, we will take v0 to be the zero vec-
α j v( j)
tor; ε is chosen to be .1. The algorithm will terminate with
j∈S
a stationary policy that is guaranteed to have an expected
subject to, for a ∈ As and s ∈ S, total discounted reward within .1 of optimal. Calculations
proceed as in the finite-horizon backward induction algo-
v(s) ≥ r (s, a) + λp( j|s, a)v( j),
j∈S
rithm until the stopping criterion of
ε(1 − λ) .1 × .1
and v(s) is unconstrained. vn+1 − vn ≤ = = .0056
2λ 2 × .9
The constants α j are positive and arbitrary. The dual prob- is satisfied. The value functions vn and the maximiz-
lem is given by ing actions obtained in step b at each iteration are pro-
Maximize vided in the tabulation on the following page. The above
algorithm terminates after 58 iterations, at which point
x(s, a)r (s, a) v58 − v57 = .0054. The .1-optimal stationary policy is
s∈S a∈As
dε = (3, 0, 0, 0), which means that if the stock level is
subject to, for J ∈ S, 0, order 3 units; otherwise, do not order. Observe that the
optimal policy was first identified at iteration 3, but the al-
x( j, a) − λp( j|s, a)x(s, a) = α j ,
a∈As s∈S a∈As
gorithm did not terminate until iteration 58. In larger prob-
lems such additional computational effort is extremely
and x( j, a) ≥ 0 for a ∈ A j and j ∈ S. wasteful. Improved stopping rules and more efficient al-
Using a general-purpose linear programming code for gorithms are described in Section III.E.
solving dynamic programming problems is not compu-
tationally attractive. The dynamic programming methods 2. Policy Iteration
are more efficient. The interest in the linear programming
To initiate policy iteration, choose the myopic policy,
formulation is primarily theoretical but allows inclusion
namely, that which maximizes the immediate one-period
of side constraints. Some interesting observations are as
reward r (s, a). The algorithm proceeds as follows:
follows:
a. Set d 0 = (0, 0, 0, 0) and n = 0.
a. The dual problem is always feasible and bounded. b. Solve the evaluation equations:
Any optimal basis has the property that for each s ∈ S,
(1 − .9 × 1)v 0 (0) = 0,
x(s, a) > 0 for only one a ∈ As . An optimal stationary
policy is given by d ∗ (s) = a if x(s, a) > 0. (−.9 × .75)v (0) + (1 − .9 × 25)v (1) = 5,
0 0

b. The same basis is optimal for all α j ’s. (−.9 × .25)v 0 (0) + (−.9 × .5)v 0 (1)
c. When the dual problem is solved by the simplex al- + (1 − .9 × 25)v 0 (2) = 6,
gorithm with block pivoting, it is equivalent to policy it- and
eration. (−.9 × .25)v 0 (1) + (−.9 × .50)v 0 (2)
d. When policy iteration is implemented by only chang- + (1 − .9 × .25)v 0 (3) = 5.
ing the action that gives the maximum improvement over These equations are obtained by substituting the tran-
all states, it is equivalent to solving the dual problem by sition probabilities and rewards corresponding to pol-
the simplex method. icy d0 into Eq. (34). The solution of these equations is
v0 = (0, 6.4516, 11.4880, 14.9951).
c. For each s, the quantities
D. Numerical Examples
3
In this section, an infinite-horizon version of the stochas- r (s, a) + λp( j|s, a)v 0 ( j)
tic inventory example presented earlier is solved by using j=0

value iteration, policy iteration, and modified policy iter- are computed for a = 0, . . . , 3 − s, and the actions that
ation. The data are as analyzed in the finite-horizon case; achieve the maximum are placed into A∗0,s . In this example
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming 691

v n (s) d n (s)
n s =0 s =1 s =2 s =3 s =0 s =1 s =2 s =3

0 0 0 0 0 0 0 0 0
1 0 5.0 6.0 5.0 2 0 0 0
2 1.6 6.125 9.6 9.95 2 0 0 0
3 3.2762 7.4581 11.2762 12.9368 3 0 0 0
4 4.6632 8.8895 12.6305 14.6632 3 0 0 0
5 5.9831 10.1478 13.8914 15.9831 3 0 0 0
10 10.7071 14.8966 18.6194 20.7071 3 0 0 0
15 13.5019 17.6913 21.4142 23.0542 3 0 0 0
30 16.6099 20.7994 24.5222 26.6099 3 0 0 0
50 17.4197 21.6092 25.3321 27.4197 3 0 0 0
56 17.4722 21.6617 25.3845 27.4722 3 0 0 0
57 17.4782 21.6676 25.3905 27.4782 3 0 0 0
58 17.4736 21.6730 25.3959 27.4836 3 0 0 0

there is a unique maximizing action in each state, and it is is described in detail; calculations for the remainder are
given by presented in tabular form below.
A∗0,0 = {3}; A∗0,1 = {2};
a. Set v0 = (0, 0, 0, 0), n = 0, and ε = .1.
A∗0,2 = {0}; A∗0,3 = {0}. b. Observe that
d. Since d 0 (0) = 0, it is not contained in A∗0,0 , so con-
3

tinue. r (s, a) + λp( j|s, a)v 0 ( j) = r (s, a),

e. Set d1 = (3, 2, 0, 0) and n = 1, and return to the j=0

evaluation step. so that for each s the maximum value occurs for a = 0.
Thus, A∗n,s = {0} for s = 0, 1, 2, 3 and dn = (0, 0, 0, 0).
The detailed step-by-step calculations for the remain- (i) Set k = 0 and u0 = (0, 5, 6, 5).
der of the algorithm are omitted. The value functions and (ii) Since u0 = v0 = 6 > .0056, continue.
corresponding maximizing actions are presented below. (iii) Compute u1 by
Since there is a unique maximizing action in the improve-
ment step at each iteration, A∗n,s is equivalent to d n (s) and
3
u 1 (s) = r (s, d 0 (s)) + λp( j|s, d 0 (s))u 0 ( j)
only the latter is displayed. j=0
The algorithm terminates in three iterations with the op-

3
timal policy d∗ = (3, 0, 0, 0). Observe that an evaluation = r (s, 0) + λp( j|s, 0)u 0 ( j) (40)
was unnecessary at iteration 3 since d2 = d3 terminated j=0
the algorithm before the evaluation step. Unlike value it-
eration, the algorithm has produced an optimal policy as = 0 + .9 × 1 × 0 = 0, for s = 0,
well as its expected total discounted reward v3 , which is = 5 + .9 × 3
4
× 0 + .9 × 1
4
× 5 = 6.125,
the optimal expected total discounted reward. This com- for s = 1,
putation shows that the .1-optimal policy found by using
value iteration is in fact optimal. This information could = 6 + .9 × 1
4
× 0 + .9 × 1
2
×5
not be obtained by using value iteration unless the action + .9 × 1
4
× 6 = 9.60, for s = 2,
elimination method described in Section III.E were used.
= 6+9× 1
4
× 5 + .9 × 1
2
×6
3. Modified Policy Iteration + .9 × 1
4
× 5 = 10.95, for s = 3,
The following illustrates the application of modified pol- so that u = (0, 6.125, 9.60, 10.95).
1

icy iteration of order 5. The first pass through the algorithm (iv) Since k = 1 < 5, continue.

v n (s) d n (s)
n s =0 s =1 s =2 s =3 s =0 s =1 s =2 s =3

0 0 6.4516 11.4880 14.9951 0 0 0 0

1 10.7955 12.7955 18.3056 20.7955 3 2 0 0
2 17.5312 21.7215 25.4442 27.5318 3 0 0 0
3 X X X X 3 0 0 0
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

692 Dynamic Programming

v n (s) d n (s)
n s =0 s =1 s =2 s =3 s =0 s =1 s =2 s =3

0 0 0 0 0 0 0 0 0
1 0 6.4507 11.4765 14.9200 3 2 0 0
2 7.1215 9.1215 14.6323 17.1215 3 0 0 0
3 11.5709 15.7593 19.4844 21.5709 3 0 0 0
4 14.3639 18.5534 22.2763 24.3639 3 0 0 0
5 15.8483 20.0377 23.7606 25.8483 3 0 0 0
10 17.4604 21.6499 25.3727 27.4604 3 0 0 0
11 17.4938 21.6833 25.4062 27.4938 3 0 0 0

The loop is repeated four more times to evaluate u2 , u3 , Bounds are given for value iteration; however, they have
u4 , and u5 . Then v1 is set equal to u5 and the maximization also been obtained for policy iteration and modified policy
in step b is carried out. The resulting iterates are shown at iteration. First, define the following two quantities:
the bottom of the page. In step (ii), following iteration 11,
L n = min{v n+1 (s) − v n (s)}
the computed value of u0 is (17.4976, 21.6871, 25.4100, s∈S
27.4976), so that u0 − v11 = .0038, which guarantees and
that the policy (3, 0, 0, 0) is ε-optimal with ε = .1.
U n = max{v n+1 (s) − v n (s)}.
s∈S

4. Comparison of the Algorithms Then, for each n and s ∈ S,

Value iteration required 58 iterations to obtain a .1-optimal 1 1
L n ≤ v d (s) ≤ v ∗ (s) ≤ vλn (s) +
n
v n (s) + U n.
solution. At each iteration a maximization was required so 1−λ 1−λ
that each action had to be evaluated to determine Eq. (32) (41)
at each iteration. Modified policy iteration of order 5 re- We can easily see, using the two extreme bounds, that if
quired 11 iterations to determine a .1-optimal policy so that U n − L n < ε(1 − λ), then
the maximization step was carried out only 11 times. How-
ever, at each iteration the inner loop of the algorithm in step ∗ 1
0 ≤ vλ (s) − v (s) +
n
L ≤ ε.
n
(42)
c was invoked five times. Thus, modified policy iteration 1−λ
required far fewer maximizations than did value iteration. This provides an alternative stopping criterion for value
This would lead to considerable computational savings iteration and can be modified for any of the above al-
when the action sets or the problem had many states. gorithms. In particular in value iteration, if the algo-
Policy iteration found an optimal policy in three itera- rithm is terminated when U n − L n < ε(1 − λ), then
tions; however, each iteration required both the evaluation the quantity vn (s) + (1 − λ)−1 L n is within ε of the op-
of all actions in each state and the solution of a linear timal value function. If this had been implemented in
system of equations. In this small example, it is not time the value iteration algorithm, in Section III.C., it would
consuming to solve the linear system by using Gaussian have terminated after 9, as opposed to 58, iterations, when
elimination, but when the number of states is large, this ε(1 − λ) = .1 × (1 − .9) = .01.
can be prohibitive. In such cases, modified iteration is pre- These bounds can be used at each iteration to eliminate
ferred over both policy iteration and value iteration; how- actions that cannot be part of an optimal policy. This is
ever, which order is the best to use is an open question. important computationally because the maximization in
the improvement step can be made more efficient if all of
E. Bounds on the Optimal Total Expected As need not be evaluated at each iteration. Elimination is
Discounted Reward based on the result that action a is suboptimal in state s, if

At each stage of the value iteration, policy iteration, and r (s, a) + λp( j|s, a)vλ∗ ( j) < vλ∗ (s). (43)
j∈S
modified policy iteration, the computed value of vn can
be used to obtain upper and lower bounds on the optimal Of course, vλ∗ (s)
is not known in Eq. (43), but by substi-
expected discounted reward. These bounds can be used to tuting an upper bound for it on the left-hand side and a
terminate any of the iterative algorithms, eliminate subop- lower bound on the right-hand side, one can use the result
timal actions at each iteration of an algorithm, and develop to eliminate suboptimal actions in state s. The bounds in
improved algorithms. Eq. (41) can be used.
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming 693

Action elimination procedures are especially important not relevant. For instance, in a large telecommunications
in approximation algorithms such as value iteration and network, millions of packet and call routing decisions are
modified policy iteration that produce only ε-optimal poli- made every second, so that discounting the consequences
cies. If a unique optimal policy exists, by eliminating sub- of latter decisions makes little sense. An alternative is the
optimal actions at each iteration, one obtains an optimal expected average reward criterion defined in Eq. (7). Us-
policy when only one action remains in each state. This ing this criterion means that rewards received at each time
will occur in finitely many iterations. period receive equal weight.
Computational methods for the average reward crite-
rion are more complex than those for the discounted re-
F. Turnpike and Planning-Horizon Results ward problem. This is because the form of the average
Infinite-horizon sequential decision models are usually ap- reward function, g π (s), depends on the structure of the
proximations to long finite-horizon problems with many Markov chain corresponding to the stationary policy π. If
decision points. A question of practical concern is, When the policy is unichain, that is, the Markov chain induced
is the optimal policy for the infinite-horizon model optimal by the policy has exactly one recurrent class and possibly
for the finite-horizon problem? An answer to this question several transient classes, then gπ is a constant vector. In
is provided through planning-horizon, or turnpike, the- this case, the functional equation is

ory. The basic result is the following. There exists an N ∗ ,
called the planning horizon, such that, for all n ≥ N ∗ , the v(s) = max r (s, a) − g + p( j|s, a)v( j) . (44)
a∈As
optimal decision when there are n periods remaining is s∈S

one of the decisions that is optimal when the horizon is Solving this equation uniquely determines g and deter-
infinite. This result means that if there is a unique optimal mines v(s) up to an additive constant. The quantity g is
stationary policy d ∗ for the infinite-horizon problem, then the optimal expected average reward. For us to specify v
in an n-period problem with n ≥ N ∗ , it is optimal to use uniquely, it is sufficient to set v(s0 ) = 0 for some s0 in
the stationary strategy for the initial n − N ∗ decisions and the recurrent class of a policy corresponding to choosing
to find the optimal policy for the remaining N ∗ periods a maximizing action in each state. If this is done, v(s)
by using the backward induction methods of Section II. is called a relative value function and v( j) − v(k) is the
The optimal infinite-horizon strategy is called the turn- difference in expected total reward obtained by using an
pike, and it is reached after traveling N ∗ periods on the optimal policy and starting the system in state j as opposed
nonstationary “side roads.” The term turnpike originates to state k.
in mathematical economics, where it refers to the policy As in the discounted case, an optimal policy is found by
path that produces optimal economic growth. solving the functional equation. This is best done by policy
Another interpretation of the above result is that it is iteration. The theory of value iteration is quite complex in
optimal to use d ∗ for the first decision in any finite-horizon this setting and is not discussed here. The policy iteration
problem in which it is known that the horizon exceeds algorithm is given below. It is assumed that all policies are
N ∗ . Thus, it is not necessary to specify the horizon, only unichain.
to know that there are at least N ∗ decisions to be made.
Bounds on N ∗ are available, and this concept has been a. Select a decision rule d0 and set n = 0.
extended to nonstationary infinite-horizon problems and b. Solve the system of equations
the expected average reward criteria. This is referred to as
a rolling horizon approach. [δ( j, s) − p( j|s, d n (s))]v n ( j) − g n = r (s, d n (s)),
j∈S
The computational results for the value iteration al-
(45)
gorithm in Section III.D give further insight into the
planning-horizon result. There, N ∗ = 3, so that in any where v n (s0 ) = 0 for some s0 in the recurrent class of d n .
problem with horizon greater than 3, it is optimal to use c. For each s, and each a ∈ As , compute
the decision rule (3, 0, 0, 0) until there are three decisions
r (s, a) + p( j|s, a)v n ( j). (46)
left to be made, at which point it is optimal to use the de- j∈S
cision rule (2, 0, 0, 0) for two periods and (0, 0, 0, 0) in
the last period. For each s, put a in A∗n,s if a obtains the maximum value
in Eq. (46).
d. If for each s, d n (s) is contained in A∗n,s , stop. Other-
G. The Average Expected Reward Criteria
wise, proceed.
In many applications in which infinite-horizon formula- e. Set d n+1 (s) equal to any a in A∗n,s for each s in S,
tions are natural, the total discounted reward criterion is increment n by 1, and return to step b.
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

694 Dynamic Programming

Note that this algorithm is almost identical to that for to compute optimal policies, as well as analytically to
the discounted case. The only difference is the linear sys- determine the form of an optimal policy under various as-
tem of equations solved in step b. If the assumption that all sumptions on the rewards and transition probabilities. A
policies are unichain is dropped, solution of the functional brief and by no means complete summary of applications
equation [Eq. (44)] is no longer sufficient to determine an appears in Table I. Only stochastic dynamic programming
optimal policy. Instead, a nested pair of optimality equa- is considered; however, in many cases, the problems have
tions is required. also been analyzed in the deterministic setting. In these
applications, probability distributions of the random quan-
tities are assumed to be known before solution of the prob-
IV. FURTHER TOPICS lem; adaptive estimation of parameters is not necessary.
A major limitation in the practical application of dy-
A. Historical Perspective namic programming has been computational. When the
The development of dynamic programming is usually set of states at each stage is large—for example, if the state
credited to Richard Bellman. His numerous papers in the description is vector-valued—then solving a sequential
1950s presented a formal development of this subject and decision problem by dynamic programming requires con-
numerous interesting examples. Most of this pioneering siderable storage as well as computationl time. Bellman
work is summarized in his book, “Dynamic Program- recognized the difficulty early on and referred to it as
ming.” However, many of the themes of dynamic pro- the “curse of dimensionality.” Research in the 1990s ad-
gramming and sequential decision processes are scattered dressed this issue by developing approximation meth-
throughout earlier works. These include studies that ap- ods for large-scale applications. These methods combined
peared between 1946 and 1953 on water resource manage- concepts from stochastic approximation, simulation, and
ment by Masse; sequential analysis in statistics by Wald; artificial intelligence and are sometimes referred to as re-
games of pursuit by Wald; inventory theory by Arrow, inforcement learning. A comprehensive treatment of this
Blackwell, and Girshick; Arrow, Harris, and Marshak; and line of research appears in “Neuro-Dynamic Program-
Dvoretsky, Kiefer, and Wolfowitz; and stochastic games ming” by Bertsekas and Tsitsiklis.
by Shapley.
Although Bellman coined the phrase “Markov decision C. Extensions
processes,” this aspect of dynamic programming got off In the models considered in this article, it has been as-
the ground with Howard’s monograph, “Dynamic Pro- sumed that the decision maker knows the state of the sys-
gramming and Markov Processes” in 1960. The first for- tem before making a decision, that decisions are made at
mal theoretical treatment of this subject was by Blackwell discrete time points, that the set of states is finite and dis-
in 1962. In 1960, deGhellinck demonstrated the equiv- crete (with the exception of the example in Section II.D.2),
alence between Markov decision processes and linear and the model rewards and transition probabilities are
programming. Other major contributions are those of known. These models can be modified in several ways:
Denardo in 1968, in which he showed that value iteration the state of the system may be only partially observed
can be analyzed by the theory of contraction mappings; by the decision maker, the sets of decision points and
Veinott in 1969, in which he introduced a new family of states may be continuous, or the transition probabilities
optimality criteria for dynamic programming problems; or rewards may not be known. These modifications are
and Federgruen and Schweitzer between 1978 and 1980, discussed briefly below.
in which they investigated the properties of the sequences
of policies obtained from the value iteration algorithm. 1. Partially Observable Models
Modified policy iteration is usually attributed to Puterman
and Shin in 1978; however, similar ideas appeared earlier This model differs from the fully observable model in that
in works of Kushner and Kleinman and of van Nunen. In the state of the system is not known to the decision maker
1978, Puterman and Brumelle demonstrated the equiva- at the time of decision. Instead, the decision maker re-
lence of policy iteration to Newton’s method. Puterman’s ceives a signal from the system and on the basis of this
book “Markov Decision Processes” provides a compre- signal updates his estimate of the probability distribution
hensive overview of theory, application, and calculations. of the system state. Updating is done using Bayes’ theo-
rem. Decisions are based on this probability distribution,
which is a sufficient statistic for the history of the process.
B. Applications
When the set of states is discrete, these models are re-
Dynamic programming methods have been applied in ferred to as partially observable Markov decision pro-
many areas. These methods have been used numerically cesses. Computational methods in this case are quite
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

Dynamic Programming 695

TABLE I Stochastic Dynamic Programming Applications

Area States Actions Reward Stochastic aspect

Capacity expansion Size of plant Maintain or add capacity Costs of expansion and Demand for product
production at current
capacity
Cash management Cash available Borrow or invest Transaction costs less External demand for cash
interest
Catalog mailing Customer purchase record Type of catalog to send Purchases in current Customer purchase amount
to customer, if any period less mailing
costs
Clinical trials Number of successes Stop or continue the trial, Costs of treatment and Response of a subject to
with each treatment and if stopped, choose incorrect decisions a treatment
best treatment if any
Economic growth State of the economy Investment or consumption Utility of consumption Effect of investment
Fisheries management Fish stock in each age class Number of fish to harvest Value of the catch Population size
Football Position of ball Play to choose Expected points scored Outcome of play
Forest management Size and condition of stand Harvesting and reforestation Revenues less harvesting Stand growth and price
activities costs fluctuations
Gambling Current wealth Stop or continue playing Cost of playing Outcome of the game
the game
Hotel and airline Number of confirmed Accept, wait-list, or reject Profit from satifisfied Demand for reservations
reservations reservations new reservations reservations less and number of arrivals
overbooking penalties
Inventory control Stock on hand Order additional stock Revenue per item sold Demand for items
less ordering, holding,
and penalty costs
Project selection Status of each project Project to invest in at present Return from investing in Change of project status
project
Queuing control Number in the queue Accept or reject arriving Revenue from serving Interarrival times and
customers or control customer less delay service times
service rate costs
Reliability Age or status of equipment Inspect and repair or Inspection and repair Failure and deterioration
replace if necessary costs plus failure cost
Scheduling Activites completed Next activity to schedule Cost of activity Length of time to complete
activity
Selling an asset Current offer Accept or reject the offer The offer less the cost of Size of the offer
holding the asset for
one period
Water resource Level of water in each Quantity of water to release Value of power generated Rainfall and runoff
management reservoir in river system

complex, and only very small problems have been solved Its solution using dynamic programming methodology is
numerically. When the states form a continuum, this prob- given in Section II. When transitions are stochastic, only
lem falls into the venue of control theory. An extremely minor modifications to the general sequential decision
important result in this area is the Kalman filter, which problem are necessary. Instead of a transition probabil-
provides an updating formula for the expected value and ity function, a transition probability density is used, and
covariance matrix of the system state. Another important summations are replaced by integrations throughout. This
result is the separation theorem, which gives conditions modification causes considerable theoretical complexity;
that allow decomposition of this problem into separate the main issues concern measurability and integrability of
problems of estimation and control. value functions.
Problems of this type fall into the realm of stochastic
control theory. Although dynamic programming is used to
2. Continuous-State, Discrete-Time Models
solve such problems, the formulation is quite different. In-
The resource allocation problem in Section I is an example stead of explicitly giving a transition probability function
of a continuous-state, discrete-time, deterministic model. for the state, the theory requires use of a dynamic equation
P1: FYD/FQW P2: FYD/LRV QC: FYD Final Pages
Encyclopedia of Physical Science and Technology EN004A-187 June 8, 2001 18:59

696 Dynamic Programming

to relate the state at time t +1 to the state at time t. A major estimate this parameter. In a Bayesian analysis of such
result in this area is that when the state dynamics are linear models, uncertainty about the parameter value is described
in the state, action, and random component and the cost through a probability distribution which is periodically up-
is quadratic in the state and the action, then a closed-form dated as information becomes available. The classical ap-
solution is available for the optimal decision rule and it proach to such models uses maximum likelihood theory to
is linear in the system state. These problems have been estimate the parameter and derive its statistical properties.
studied extensively in the engineering literature.

3. Continous-Time Models
ACKNOWLEDGMENT

Stochastic continuous-time models are categorized ac- Preparation of this article was supported by Natural Sciences and Engi-
cording to whether the state space is continuous or dis- neering Research Council Grant A-5527.
crete. The discrete-time model has been widely studied in
the operations research literature. The stochastic nature of
the problem is modeled as either a Markov process, a semi- SEE ALSO THE FOLLOWING ARTICLES
Markov process, or a general jump process. The decision
maker can control the transition rates, transition probabil- LINEAR OPTIMIZATION • NONLINEAR PROGRAMMING •
ities, or both. The infinite-horizon versions of the Markov OPERATIONS RESEARCH
and semi-Markov decision models are analyzed in a sim-
ilar fashion to the discrete-time Markov decision process;
however, general jump processes are considerably more BIBLIOGRAPHY
complex. These models have been widely applied to prob-
lems in queuing and inventory control. Bellman, R. E. (1957). “Dynamic Programming,” Princeton University
When the state space is continuous and Markovian Press, Princeton, N.J.
Bertsekas, D. P. (1995). “Dynamic Programming and Optimal Control,”
assumptions are made, diffusion processes are used to
Vols. 1 and 2, Athena Scientific, Belmont, Mass.
model the transitions. The decision maker can control Bertsekas, D. P., and Tsitsiklis, J. M. (1995). “Neuro-Dynamic Program-
the drift of the system or can cause instantaneous ming,” Athena Scientific, Belmont, Mass.
state transitions or jumps. The discrete-time optimality Blackwell, D. (1962). Ann. Math. Stat. 35, 719–726.
equation is replaced by a nonlinear second-order partial Denardo, E. V. (1967). SIAM Rev. 9, 169–177.
Denardo, E. V. (1982). “Dynamic Programming, Models and Applica-
differential equation and is usually solved numerically.
tions,” Prentice-Hall, Englewood Cliffs, N.J.
These models are studied in the stochastic control theory Fleming, W. H., and Rishel, R. W. (1975). “Deterministic and Stochastic
literature and have been applied to inventory control, Optimal Control,” Springer-Verlag, New York.
finance, and statistical modeling. Howard, R. A. (1960). “Dynamic Programming and Markov Processes,”
MIT Press, Cambridge, Mass.
Puterman, M. L. (1994). “Markov Decision Processes,” Wiley, New
4. Adaptive Control York.
Ross, S. M. (1983). “Introduction to Stochastic Dynamic Programming,”
When transition probabilities and/or rewards are un- Academic Press, New York.
known, the decision maker must adaptively estimate them Scarf, H. E. (1960). In “Studies in the Mathematical Theory of Inventory
to control the system optimally. The usual approach to and Production,” (K. J. Arrow, S. Karlin, and P. Suppes, eds.), Stanford
analysis of such systems is to assume that the rewards and University Press, Stanford, Calif.
Veinott, A. F., Jr. (1969). Ann. Math. Stat. 40, 1635–1660.
transition probabilities depend on an unknown parame- Wald, A. (1947). “Sequential Analysis,” Wiley, New York.
ter, such as the arrival rate to a queuing system, and then White, D. J. (1985). Interfaces 15, 73–83.
use the observed sequence of system states to adaptively White, D. J. (1988). Interfaces 18, 55–61.
P1: GHA/LOW P2: FQP Final Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

Fourier Series
James S. Walker
University of Wisconsin–Eau Claire

I. Historical Background
II. Definition of Fourier Series
III. Convergence of Fourier Series
IV. Convergence in Norm
V. Summability of Fourier Series
VI. Generalized Fourier Series
VII. Discrete Fourier Series
VIII. Conclusion

∞
GLOSSARY ists a collection {(ai ,
bi )}i=1 of open intervals such that
∞ ∞
S ⊂ ∪i=1 (ai , bi ) and i=1 (bi − ai ) ≤ . Examples: All
Bounded variation A function f has bounded vari- finite sets, and all countably infinite sets, have Lebesgue
ation on a closed interval [a, b] if there exists a measure zero.
positive constant B such that, for all finite sets Odd and even functions A function f is odd if
Npoints a = x0 < x1 < · · · < x N = b, the inequality
of f(−x) = −f(x) for all x in its domain. A function f is
i=1 |f(x i ) − f(x i−1 )| ≤ B is satisfied. Jordan proved even if f(−x) = f(x) for all x in its domain.
that a function has bounded variation if and only if it One-sided limits f(x−) and f(x+) denote limits of f(t)
can be expressed as the difference of two nondecreas- as t tends to x from the left and right, respectively.
ing functions. Periodic function A function f is periodic, with period
Countably infinite set A set is countably infinite if it P > 0, if the identity f(x + P) = f(x) holds for all x.
can be put into one-to-one correspondence with the set Example: f(x) = | sin x| is periodic with period π.
of natural numbers (1, 2, . . . , n, . . .). Examples: The
integers and the rational numbers are countably infinite
sets. FOURIER SERIES has long provided one of the prin-
Continuous function If limx→c f(x) = f(c), then the func- cipal methods of analysis for mathematical physics, en-
tion f is continuous at the point c. Such a point is called gineering, and signal processing. It has spurred general-
a continuity point for f. A function which is continuous izations and applications that continue to develop right
at all points is simply referred to as continuous. up to the present. While the original theory of Fourier
Lebesgue measure zero A set S of real numbers is said to series applies to periodic functions occurring in wave mo-
have Lebesgue measure zero if, for each > 0, there ex- tion, such as with light and sound, its generalizations often

167
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

168 Fourier Series

relate to wider settings, such as the time-frequency analy- This was the first example of the use of Fourier series to
sis underlying the recent theories of wavelet analysis and solve boundary value problems in partial differential equa-
local trigonometric analysis. tions. To obtain Eq. (3), Fourier made use of D. Bernoulli’s
method of separation of variables, which is now a stan-
dard technique for solving boundary value problems.
I. HISTORICAL BACKGROUND
A good, short introduction to the history of Fourier se-
ries can be found in The Mathematical Experience. Be-
There are antecedents to the notion of Fourier series in
sides his many mathematical contributions, Fourier has
the work of Euler and D. Bernoulli on vibrating strings,
left us with one of the truly great philosophical principles:
but the theory of Fourier series truly began with the pro-
“The deep study of nature is the most fruitful source of
found work of Fourier on heat conduction at the begin-
knowledge.”
ning of the 19th century. Fourier deals with the problem
of describing the evolution of the temperature T (x, t) of a
thin wire of length π , stretched between x = 0 and x = π ,
with a constant zero temperature at the ends: T (0, t) = 0 II. DEFINITION OF FOURIER SERIES
and T (π, t) = 0. He proposed that the initial tempera-
ture T (x, 0) = f(x) could be expanded in a series of sine The Fourier sine series, defined in Eqs. (1) and (2), is a spe-
functions: cial case of a more general concept: the Fourier series for a
∞ periodic function. Periodic functions arise in the study of
f(x) = bn sin nx (1) wave motion, when a basic waveform repeats itself peri-
n=1 odically. Such periodic waveforms occur in musical tones,
with in the plane waves of electromagnetic vibrations, and in
π the vibration of strings. These are just a few examples.
2
bn = f(x) sin nx d x. (2) Periodic effects also arise in the motion of the planets, in
π 0
AC electricity, and (to a degree) in animal heartbeats.
A function f is said to have period P if f(x + P) =
A. Fourier Series f(x) for all x. For notational simplicity, we shall restrict
Although Fourier did not give a convincing proof of con- our discussion to functions of period 2π . There is no loss
vergence of the infinite series in Eq. (1), he did offer the of generality in doing so, since we can always use a sim-
conjecture that convergence holds for an “arbitrary” func- ple change of scale x = (P/2π )t to convert a function of
tion f. Subsequent work by Dirichlet, Riemann, Lebesgue, period P into one of period 2π .
and others, throughout the next two hundred years, was If the function f has period 2π , then its Fourier series is
needed to delineate precisely which functions were ∞
expandable in such trigonometric series. Part of this work c0 + {an cos nx + bn sin nx} (4)
entailed giving a precise definition of function (Dirichlet), n=1

and showing that the integrals in Eq. (2) are properly with Fourier coefficients c0 , an , and bn defined by the
defined (Riemann and Lebesgue). Throughout this article integrals
we shall state results that are always true when Riemann π
1
integrals are used (except for Section IV where we need c0 = f(x) d x (5)
to use results from the theory of Lebesgue integrals). 2π −π

In addition to positing Eqs. (1) and (2), Fourier argued 1 π
that the temperature T (x, t) is a solution to the following an = f(x) cos nx d x, (6)
π −π
heat equation with boundary conditions:
1 π
∂T ∂2T bn = f(x) sin nx d x. (7)
= , 0 < x < π, t > 0 π −π
∂t ∂x2
T (0, t) = T (π, t) = 0, t ≥0 [Note: The sine series defined by Eqs. (1) and (2) is a
special instance of Fourier series. If f is initially defined
T (x, 0) = f (x), 0 ≤ x ≤ π. over the interval [0, π ], then it can be extended to [−π, π ]
Making use of Eq. (1), Fourier showed that the solution (as an odd function) by letting f(−x) = −f(x), and then
T (x, t) satisfies extended periodically with period P = 2π . The Fourier
series for this odd, periodic function reduces to the sine

∞
T (x, t) = bn e−n t sin nx.
2
(3) series in Eqs. (1) and (2), because c0 = 0, each an = 0, and
n=1 each bn in Eq. (7) is equal to the bn in Eq. (2).]
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

Fourier Series 169

It is more common nowadays to express Fourier se-

ries in an algebraically simpler form involving complex
exponentials. Following Euler, we use the fact that the
complex exponential ei θ satisfies ei θ = cos θ + isin θ.
Hence
1 iθ
cos θ = (e + ei θ ),
2
1
sin θ = (ei θ − e−i θ ).
2i
From these equations, it follows by elementary algebra
that Formulas (5)–(7) can be rewritten (by rewriting each
term separately) as

∞

c0 + cn einx + c−n e−inx (8)
n =1

with cn defined for all integers n by

π
1
cn = f(x)e−inx d x. (9)
2π −π
FIGURE 1 Square wave.
The series in Eq. (8) is usually written in the form

∞
cn einx . (10) Thus, the Fourier series for this square wave is
n =−∞
1 ∞
sin(nπ/2) inx
We now consider a couple of examples. First, let f1 be + (e + e−inx )
defined over [−π, π] by 2 n=1 nπ
1 ∞
2 sin(nπ/2)
1 if |x | < π/2 = + cos nx. (11)
f1 (x) = 2 n=1 nπ
0 if π/2 ≤ |x | ≤ π
Second, let f2 (x) = x 2 over [−π, π] and have period 2π ,
and have period 2π . The graph of f1 is shown in Fig. 1; see Fig. 2. We shall refer to this wave as a parabolic wave.
it is called a square wave in electric circuit theory. The This parabolic wave has c0 = π 2 /3 and cn , for n = 0, is
constant c0 is π
π 1
1 cn = x 2 e−inx d x
c0 = f1 (x) d x 2π −π
2π −π π π
π/2 1 i
1 1 = x 2 cos nx d x − x 2 sin nx d x
= 1 dx = . 2π −π 2π −π
2π −π/2 2 2(−1)n
=
While, for n = 0, n2
π after an integration by parts. The Fourier series for this
1
cn = f1 (x)e−inx d x function is then
2π −π
π/2 π2 ∞
2(−1)n inx
1 + (e + e−inx )
= e−inx d x 3 n 2
2π −π/2
n=1

1 e−inπ/2 − einπ/2 π2 ∞
4(−1)n
= = + cos nx. (12)
2π −in 3 n=1
n2
sin(n π/2) We will discuss the convergence of these Fourier series,
= .
nπ to f1 and f2 , respectively, in Section III.
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

170 Fourier Series

Eq. (13) by e−imx and integrating term-by-term from −π

to π , we obtain
π
∞ π
f(x)e−imx d x = cn einx e−imx d x.
−π n=−∞ −π

By the orthogonality property, this leads to

π
f(x)e−imx d x = 2π cm ,
−π

which justifies (in a formal, nonrigorous way) the defini-

tion of cn in Eq. (9).
We close this section by discussing two important prop-
erties of Fourier coefficients, Bessel’s inequality and the
Riemann-Lebesgue lemma.
π
Theorem 1 (Bessel’s Inequality): If −π |f(x)|2 d x is fi-
nite, then

∞ π
1
|cn | ≤ 2
|f(x)|2 d x. (15)
n=−∞ 2π −π
FIGURE 2 Parabolic wave.
Bessel’s inequality can be proved easily. In fact, we have
Returning to the general Fourier series in Eq. (10), we
π
N 2
shall now discuss some ways of interpreting this series. A 1
0≤ f(x) − cn e inx
dx
complex exponential einx = cos nx + isin nx has a small- 2π −π n=−N
est period of 2π/n. Consequently it is said to have a fre-
quency of n/2π , because the form of its graph over the π
N
N
1
interval [0, 2π/n] is repeated n/2π times within each unit- = f(x) − cm eimx f(x) − cn e−inx d x.
2π −π m=−N n=−N
length. Therefore, the integral in Eq. (9) that defines the
Fourier coefficient cn can be interpreted as a correlation Multiplying out the last integrand above, and making use
between f and a complex exponential with a precisely lo- of Eqs. (9) and (14), we obtain
cated frequency of n/2π. Thus the whole collection of
these integrals, for all integers n, specifies the frequency π
N 2
1
content of f over the set of frequencies {n/2π }∞ f(x) − cn einx dx
n=−∞ . If 2π −π n=−N
the series in Eq. (10) converges to f, i.e., if we can write
π
N

∞ 1
f(x) = cn e inx
, (13) = |f(x)|2 d x − |cn |2 . (16)
2π −π n=−N
n=−∞

then f is being expressed as a superposition of elemen- Thus, for all N ,

tary functions cn einx having frequency n/2π and ampli-

N
1 π
tude cn . (The validity of Eq. (13) will be discussed in the |cn | ≤ 2
|f(x)|2 d x (17)
next section.) Furthermore, the correlations in Eq. (9) are n=−N
2π −π
independent of each other in the sense that correlations
between distinct exponentials are zero: and Bessel’s inequality (15) follows by letting N → ∞.
π
1 inx −imx 0 if m = n Bessel’s inequality has a physical interpretation. If f has
e e dx = (14) finite energy, in the sense that the right side of Eq. (15) is
2π −π 1 if m = n.
finite, then the sum of the moduli-squared of the Fourier
This equation is called the orthogonality property of com- coefficients is also finite. In Section IV, we shall see that
plex exponentials. the inequality in Eq. (15) is actually an equality, which
The orthogonality property of complex exponentials says that the sum of the moduli-squared of the Fourier
can be used to give a derivation of Eq. (9). Multiplying coefficients is precisely the same as the energy of f.
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

Fourier Series 171

Because of Bessel’s inequality, it follows that holds for all x near x0 (i.e., |x − x0 | < δ for some δ > 0). It
is easy to see, for instance, that the square wave function
lim cn = 0 (18) f1 is Lipschitz at all of its continuity points.
|n |→∞
The inequality in Eq. (19) has a simple geometric
π interpretation. Since both sides are 0 when x = x0 , this
holds whenever −π |f(x)|2 d x is finite. The Riemann-
inequality is equivalent to
Lebesgue lemma says that Eq. (18) holds in the following
more general case: f(x) − f(x0 )
≤A (20)
Theorem 2 (Riemann-Lebesgue Lemma): If x − x0
π
−π | f(x)| d x is finite, then Eq. (18) holds. for all x near x0 (and x = x0 ). Inequality (20) simply says
that the difference quotients of f (i.e., the slopes of its
One of the most important uses of the Riemann-Lebesgue
secants) near x0 are bounded. With this interpretation, it
lemma is in proofs of some basic pointwise convergence
is easy to see that the parabolic wave f2 is Lipschitz at all
theorems, such as the ones described in the next section.
points. More generally, if f has a derivative at x0 (or even
See Krantz and Walker (1998) for further discussions
just left- and right-hand derivatives), then f is Lipschitz
of the definition of Fourier series, Bessel’s inequality, and
at x0 .
the Riemann-Lebesgue lemma.
We can now state and prove a simple convergence the-
orem.
π
Theorem 3: Suppose f has period 2π , that −π |f(x)| d x is
III. CONVERGENCE OF FOURIER SERIES
finite, and that f is Lipschitz at x0 . Then the Fourier series
for f converges to f(x0 ) at x0 .
There are many ways to interpret the meaning of Eq. (13).
Investigations into the types of functions allowed on the To prove this theorem, we assume that f(x0 ) = 0. There
left side of Eq. (13), and the kinds of convergence con- is no loss of generality in doing so, since we can always
sidered for its right side, have fueled mathematical in- subtract the constant f(x0 ) from f(x). Define the function g
vestigations by such luminaries as Dirichlet, Riemann, by g(x) = f(x)/(e
π − e ). This function g has period 2π.
ix i x0

Weierstrass, Lipschitz, Lebesgue, Fejér, Gelfand, and Furthermore, −π |g(x)| d x is finite, because the quotient
Schwartz. In short, convergence questions for Fourier se- f(x)/(ei x − ei x0 ) is bounded in magnitude for x near x0 . In
ries have helped lay the foundations and much of the su- fact, for such x,
perstructure of mathematical analysis.
The three types of convergence that we shall describe f(x) f(x) − f (x0 )
=
here are pointwise, uniform, and norm convergence. We ei x − e x 0 ei x − e x 0
shall discuss the first two types in this section and take up
x − x0
the third type in the next section. ≤ A
All convergence theorems are concerned with how the ei x − e x 0
partial sums and (x − x0 )/(ei x − e x0 ) is bounded in magnitude, because

N it tends to the reciprocal of the derivative of ei x at x0 .
S N (x) := cn einx If we let dn denote the nth Fourier coefficient for
n=−N g(x), then we have cn = dn−1 − dn ei x0 because f(x) =
g(x)(ei x − ei x0 ). The partial sum S N (x0 ) then telescopes:
converge to f(x). That is, does lim N →∞ S N = f hold in
some sense?
N

The question of pointwise convergence, for example, S N (x0 ) = cn einx0

concerns whether lim N →∞ S N (x0 ) = f(x0 ) for each fixed n=−N

x-value x0 . If lim N →∞ S N (x0 ) does equal f(x0 ), then we = d−N −1 e−i N x0 − d N ei(N +1)x0 .
say that the Fourier series for f converges to f(x0 ) at x0 .
We shall now state the simplest pointwise convergence Since dn → 0 as |n| → ∞, by the Riemann-Lebesgue
theorem for which an elementary proof can be given. lemma, we conclude that S N (x0 ) → 0. This completes the
This theorem assumes that a function is Lipschitz at each proof.
point where convergence occurs. A function is said to be It should be noted that for the square wave f1 and the
Lipschitz at a point x0 if, for some positive constant A, parabolic wave f2 , it is not necessary to use the general
Riemann-Lebesgue lemma stated above. That is because
π
|f(x) − f(x0 )| ≤ A |x − x0 | (19) for those functions it is easy to see that −π |g(x)|2 d x is
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

172 Fourier Series

finite for the function g defined in the proof of Theorem 3.

Consequently, dn → 0 as |n | → ∞ follows from Bessel’s
inequality for g.
In any case, Theorem 3 implies that the Fourier series
for the square wave f1 converges to f1 at all of its points
of continuity. It also implies that the Fourier series for the
parabolic wave f2 converges to f2 at all points. While this
may settle matters (more or less) in a pure mathematical
sense for these two waves, it is still important to examine
specific partial sums in order to learn more about the nature
of their convergence to these waves.
For example, in Fig. 3 we show a graph of the par-
tial sum S100 superimposed on the square wave. Although
Theorem 3 guarantees that S N → f1 as N → ∞ at each
continuity point, Fig. 3 indicates that this convergence is at
a rather slow rate. The partial sum S100 differs significantly
from f1 . Near the square wave’s jump discontinuities, for
example, there is a severe spiking behavior called Gibbs’
phenomenon (see Fig. 4). This spiking behavior does not
go away as N → ∞, although the width of the spike does
tend to zero. In fact, the peaks of the spikes overshoot the
square wave’s value of 1, tending to a limit of about 1.09. FIGURE 4 Gibbs’ phenomenon and ringing for square wave.
The partial sum also oscillates quite noticeably about the
constant value of the square wave at points away from the
discontinuities. This is known as ringing.
These defects do have practical implications. For in- combinations of sinusoidal waves over a limited range
stance, oscilloscopes—which generate wave forms as of frequencies—cannot use S100 , or any partial sum S N ,
to produce a square wave. We shall see, however, in
Section V that a clever modification of a partial sum does
produce an acceptable version of a square wave.
The cause of ringing and Gibbs’ phenomenon for the
square wave is a rather slow convergence to zero of its
Fourier coefficients (at a rate comparable to |n |−1 ). In the
next section, we shall interpret this in terms of energy
and show that a partial sum like S100 does not capture a
high enough percentage of the energy of the square wave
f1 .
In contrast, the Fourier coefficients of the parabolic
wave f2 tend to zero more rapidly (at a rate comparable to
n −2 ). Because of this, the partial sum S100 for f2 is a much
better approximation to the parabolic wave (see Fig. 5).
In fact, its partial sums S N exhibit the phenomenon of
uniform convergence.
We say that the Fourier series for a function f converges
uniformly to f if

lim max |f(x) − S N (x)| = 0. (21)

N →∞ x ∈[−π,π ]

This equation says that, for large enough N , we can have

the maximum distance between the graphs of f and S N as
FIGURE 3 Fourier series partial sum S 100 superimposed on small as we wish. Figure 5 is a good illustration of this for
square wave. the parabolic wave.
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

Fourier Series 173

Theorem 4 applies to the parabolic wave f2 , but it does

not apply to the square wave f1 . In fact, the Fourier se-
ries for f1 cannot converge uniformly to f1 . That is be-
cause a famous theorem of Weierstrass says that a uni-
form limit of continuous functions (like the partial sums
S N ) must be a continuous function (which f1 is certainly
not). The Gibbs’ phenomenon for the square wave is a
conspicuous failure of uniform convergence for its Fourier
series.
Gibbs’ phenomenon and ringing, as well as many other
aspects of Fourier series, can be understood via an integral
form for partial sums discovered by Dirichlet. This integral
form is
π
1
S N (x) = f(x − t)D N (t) dt (22)
2π −π
with kernel D N defined by
sin(N + 1/2)t
D N (t) = . (23)
sin(t /2)
This formula is proved in almost all books on Fourier
series (see, for instance, Krantz (1999), Walker (1988),
FIGURE 5 Fourier series partial sum S 100 for parabolic wave. or Zygmund (1968)). The kernel D N is called Dirichlet’s
kernel. In Fig. 6 we have graphed D20 .
We can verify Eq. (21) for the parabolic wave as follows. The most important property of Dirichlet’s kernel is
By Eq. (21) we have that, for all N ,
π
∞
4(−1)n
1
D N (t) dt = 1.
|f2 (x) − S N (x)| = cos nx 2π −π
n =N +1
n2
From Eq. (23) we can see that the value of 1 follows from

∞
4(−1)n
≤ cos nx cancellation of signed areas, and also that the contribution
n =N +1
n2

∞
4
≤ .
n =N +1
n2
Consequently

∞
4
max |f2 (x) − S N (x)| ≤ 2
x ∈[−π,π ] n
n =N +1

→0 as N → ∞
and thus Eq. (21) holds for the parabolic wave f2 .
Uniform convergence for the parabolic wave is a special
case of a more general theorem. We shall say that f is
uniformly Lipschitz if Eq. (19) holds for all points using the
same constant A. For instance, it is not hard to show that a
continuously differentiable, periodic function is uniformly
Lipschitz.
Theorem 4: Suppose that f has period 2π and is uniformly
Lipschitz at all points, then the Fourier series for f con-
verges uniformly to f.
A remarkably simple proof of this theorem is described in
Jackson (1941). More general uniform convergence theo-
rems are discussed in Walter (1994). FIGURE 6 Dirichlet’s kernel D20 .
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

174 Fourier Series

of the main lobe centered at 0 (see Fig. 6) is significantly for L 2 -convergence. There is also an interpretation of L 2 -
greater than 1 (about 1.09 in value). norm in terms of a generalized Euclidean distance and
From the facts just cited, we can explain the origin of this gives a satisfying geometric flavor to L 2 -convergence
ringing and Gibbs’ phenomenon for the square wave. For of Fourier series. By interpreting the square of L 2 -norm
the square wave function f1 , Eq. (22) becomes as a type of energy, there is an equally satisfying physi-
x +π/2 cal interpretation of L 2 -convergence. The theory of L 2 -
1
S N (x) = D N (t) dt . (24) convergence has led to fruitful generalizations such as
2π x −π/2 Hilbert space theory and norm convergence in a wide va-
As x ranges from −π to π, this formula shows that S N (x) riety of function spaces.
is proportional to the signed area of D N over an interval To introduce the idea of L 2 -convergence, we first ex-
of length π centered at x. By examining Fig. 6, which is amine a special case. By Theorem 4, the partial sums of
a typical graph for D N , it is then easy to see why there is a uniformly Lipschitz function f converge uniformly to
ringing in the partial sums S N for the square wave. Gibbs’ f. Since that means that the maximum distance between
phenomenon is a bit more subtle, but also results from the graphs of S N and f tends to 0 as N → ∞, it follows
Eq. (24). When x nears a jump discontinuity, the central that
lobe of D N is the dominant contributor to the integral in π
1
Eq. (24), resulting in a spike which overshoots the value lim |f(x) − S N (x)|2 d x = 0. (25)
N →∞ 2π −π
of 1 for f1 by about 9%.
Our final pointwise convergence theorem was, in
This result motivates the definition of L 2 -convergence.
essence, the first to be proved. It was established by Dirich-
If g is a function for which |g |2 has a finite Lebesgue
let using the integral form for partial sums in Eq. (22). We
integral over [−π, π], then we say that g is an L 2 -function,
shall state this theorem in a stronger form first proved by
and we define its L 2 -norm g 2 by
Jordan.
π
Theorem 5: If f has period 2π and has bounded variation 1
on [0, 2π ], then the Fourier series for f converges at all g 2 = |g(x)|2 d x .
2π −π
points. In fact, for all x-values,
We can then rephrase Eq. (25) as saying that f − S N 2 →
lim S N (x) = 12 [f(x +) + f(x −)].
N →∞ 0 as N → ∞. In other words, the Fourier series for f con-
This theorem is too difficult to prove in the limited space verges to f in L 2 -norm. The following theorem general-
we have here (see Zygmund, 1968). A simple consequence izes this result to all L 2 -functions (see Rudin (1986) for a
of Theorem 5 is that the Fourier series for the square proof).
wave f1 converges at its discontinuity points to 1/2 (al- Theorem 6: If f is an L 2 -function, then its Fourier series
though this can also be shown directly by substitution of converges to f in L 2 -norm.
x = ±π/2 into the series in (Eq. (11)).
Theorem 6 says that Eq. (25) holds for every L 2 -func-
We close by mentioning that the conditions for con- tion f. Combining this with Eq. (16), we obtain Parseval’s
vergence, such as Lipschitz or bounded variation, cited equality:
in the theorems above cannot be dispensed with entirely.

∞ π
For instance, Kolmogorov π gave an example of a period 1
2π function (for which −π |f(x)| d x is finite) that has a |cn |2 = |f(x)|2 d x. (26)
n=−∞ 2π −π
Fourier series which fails to converge at every point.
More discussion of pointwise convergence can be found Parseval’s equation has a useful interpretation in terms
in Walker (1998), Walter (1994), or Zygmund (1968). of energy. It says that the energy of the set of Fourier
coefficients, defined to be equal to the left side of Eq. (26),
is equal to the energy of the function f, defined by the right
IV. CONVERGENCE IN NORM side of Eq. (26).
The L 2 -norm can be interpreted as a generalized
Perhaps the most satisfactory notion of convergence for Euclidean distance.To see 2this take square roots of both
Fourier series is convergence in L 2 -norm (also called sides of Eq. (26): |cn | = f2 . The left side of this
L 2 -convergence), which we shall define in this section. equation is interpreted as a Euclidean distance in an
One of the great triumphs of the Lebesgue theory of inte- (infinite-dimensional) coordinate space, hence the L 2 -
gration is that it yields necessary and sufficient conditions norm f2 is equivalent to such a distance.
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

Fourier Series 175

As examples of these ideas, let’s return to the square We close this section by returning full circle to the no-
wave and parabolic wave. For the square wave f1 , we find tion of pointwise convergence. The following theorem was
that proved by Carleson for L 2 -functions and by Hunt for L p -
sin2 (n π/2) functions ( p = 2).
f1 − S100 22 =
|n |>100
(n π)2 Theorem 9: If f is an L p -function for p > 1, then its
Fourier series converges to it at almost all points.
= 1.0 × 10−3 .
By almost all points, we mean that the set of points where
Likewise, for the parabolic wave f2 , we have f2 − divergence occurs has Lebesgue measure zero. References
S100 22 = 2.6 × 10−6 . These facts show that the energy for the proof of Theorem 9 can be found in Krantz (1999)
of the parabolic wave is almost entirely contained in and Zygmund (1968). Its proof is undoubtedly the most
the partial sum S100 ; their energy difference is almost difficult one in the theory of Fourier series.
three orders of magnitude smaller than in the square wave
case. In terms of generalized Euclidean distance, we have
f2 − S100 2 = 1.6 × 10−3 and f1 − S100 2 = 3.2 × 10−2 ,
V. SUMMABILITY OF FOURIER SERIES
showing that the partial sum is an order of magnitude
closer for the parabolic wave.
In the previous sections, we noted some problems with
Theorem 6 has a converse, known as the Riesz-Fischer
convergence of Fourier series partial sums. Some of these
theorem.
problems include Kolmogorov’s example of a Fourier

Theorem 7 (Riesz-Fischer): If |cn |2 converges, then series for an L 1 -function that diverges everywhere, and
there exists an L -function f having {cn } as its Fourier
2 Gibbs’ phenomenon and ringing in the Fourier series par-
coefficients. tial sums for discontinuous functions. Another problem
is Du Bois Reymond’s example of a continuous function
This theorem is proved in Rudin (1986). Theorem and
whose Fourier series diverges on a countably infinite set of
the Riesz-Fischer theorem combine to give necessary and
points, see Walker (1968). It turns out that all of these dif-
sufficient conditions for L 2 -convergence of Fourier series,
ficulties simply disappear when new summation methods,
conditions which are remarkably easy to apply. This has
based on appropriate modifications of the partial sums, are
made L 2 -convergence into the most commonly used no-
used.
tion of convergence for Fourier series.
The simplest modification of partial sums, and one of
These ideas for L 2 -norms partially generalize to the
the first historically to be used, is to take arithmetic
case of L p -norms. Let p be real number satisfying p ≥ 1.
means. Define the N th arithmetic mean σ N by σ N =
If g is a function for which |g | p has a finite Lebesgue
(S0 + S1 + · · · + S N −1 )/N . From which it follows that
integral over [−π, π ], then we say that g is an L p -function,
and we define its L p -norm g p by N
|n|
π 1/ p σ N (x) = 1− cn einx . (27)
1 n=−N
N
g p = |g(x)| p d x .
2π −π The factors (1 − |n|/N ) are called convergence factors.
If f − S N p → 0, then we say that the Fourier series for They modify the Fourier coefficients cn so that the am-
f converges to f in L p -norm. The following theorem gen- plitude of the higher frequency terms (for |n| near N ) are
eralizes Theorem 6 (see Krantz (1999) for a proof). damped down toward zero. This produces a great improve-
ment in convergence properties as shown by the following
Theorem 8: If f is an L p -function for p > 1, then its theorem.
Fourier series converges to f in L p -norm.
Theorem 10: Let f be a periodic function. If f is an L p -
Notice that the case of p = 1 is not included in Theorem 8.
function for p ≥ 1, then σ N → f in L p -norm as N → ∞.
The example of Kolmogorov cited at the end of Section III
If f is a continuous function, then σ N → f uniformly as
shows that there exist L 1 -functions whose Fourier series
N → ∞.
do not converge in L 1 -norm. For p = 2, there are no simple
analogs of either Parseval’s equality or the Riesz-Fischer Notice that L 1 -convergence is included in Theorem 10.
theorem (which say that we can characterize L 2 -functions Even for Kolmogorov’s function, it is the case that
by the magnitude of their Fourier coefficients). Some par- f − σ N 1 → 0 as N → ∞. It also should be noted that
tial analogs of these latter results for L p -functions, when no assumption, other than continuity of the periodic func-
p = 2, are discussed in Zygmund (1968) (in the context of tion, is needed in order to ensure uniform convergence of
Littlewood-Paley theory). its arithmetic means.
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

176 Fourier Series

For a proof of Theorem 10, see Krantz (1999). The key

to the proof is Fejér’s integral form for σ N :
π
1
σ N (x) = f(x − t)FN (t) dt (28)
2π −π
where Fejér’s kernel FN is defined by

1 sin N t/2 2
FN (t) = . (29)
N sin t /2
In Fig. 7 we show the graph of F20 . Compare this graph
with the one of Dirichlet’s kernel D20 in Fig. 6. Unlike
Dirichlet’s kernel, Fejér’s kernel is positive [FN (t) ≥ 0],
and is close to 0 away from the origin. These two facts
are the main reasons that Theorem 10 holds. The fact that
Fejér’s kernel satisfies
π
1
FN (t) dt = 1
2π −π
is also used in the proof.
An attractive feature of arithmetic means is that Gibbs’
phenomenon and ringing do not occur. For example, in
Fig. 8 we show σ100 for the square wave and it is plain that FIGURE 8 Arithmetic mean σ100 for square wave.
these two defects are absent. For the square wave function
f1 , Eq. (28) reduces to
x +π/2 an interval of length π centered at x. By examining Fig. 7,
1
σ N (x) = FN (t) dt . which is a typical graph for FN , it is easy to see why ringing
2π x −π/2 and Gibbs’ phenomenon do not occur for the arithmetic
As x ranges from −π to π , this formula shows that σ N (x) means of the square wave.
is proportional to the area of the positive function FN over The method of arithmetic means is just one example
from a wide range of summation methods for Fourier se-
ries. These summation methods are one of the major ele-
ments in the area of finite impulse response filtering in the
fields of electrical engineering and signal processing.
A summation kernel K N is defined by

N
K N (x) = m n einx . (30)
n=−N

The real numbers {m n } are the convergence factors for the

kernel. We have already seen two examples: Dirichlet’s
kernel (where m n = 1) and Fejér’s kernel (where m n =
1 − |n|/N ).
When K N is a summation kernel,
N then we inx define the
modified partial sum of f to be n=−N m n cn e . It then
follows from Eqs. (14) and (30) that
N π
1
m n cn einx = f(x − t)K N (t) dt. (31)
n=−N
2π −π

The function defined by both sides of Eq. (31) is denoted

by K N ∗ f. It is usually more convenient to use the left
side of Eq. (31) to compute K N ∗ f, while for theoretical
purposes (such as proving Theorem 11 below), it is more
FIGURE 7 Fejér’s kernel F20 . convenient to use the right side of Eq. (31).
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

Fourier Series 177

We say that a summation kernel K N is regular if it

satisfies the following three conditions.

1. For each N ,
π
1
K N (x) d x = 1.
2π −π

2. There is a positive constant C such that

π
1
|K N (x)| d x ≤ C .
2π −π
3. For each 0 < δ < π ,

lim max |K N (x)| = 0.

N →∞ δ≤|x |≤π

There are many examples of regular summation kernels.

Fejér’s kernel, which has m n = 1 − |n |/N , is regular. An-
other regular summation kernel is Hann’s kernel, which
has m n = 0.5 + 0.5 cos(n π/N ). A third regular summa-
tion kernel is de le Vallée Poussin’s kernel, for which
m n = 1 when |n | ≤ N /2, and m n = 2(1 − |m /N |) when
N /2 < |m | ≤ N . The proofs that these summation kernels
FIGURE 9 Approximate square wave using Hann’s kernel.
are regular are given in Walker (1996). It should be noted
that Dirichlet’s kernel is not regular, because properties 2
and 3 do not hold.
As with Fejér’s kernel, all regular summation kernels For example, Fourier series can be viewed as one aspect
significantly improve the convergence of Fourier series. of a general theory of orthogonal series expansions. In
In fact, the following theorem generalizes Theorem 10. this section, we shall discuss a few of the more celebrated
orthogonal series, such as Legendre series, Haar series,
Theorem 11: Let f be a periodic function, and let K N be a
and wavelet series.
regular summation kernel. If f is an L p -function for p ≥ 1,
We begin with Legendre series. The first two Legendre
then K N ∗f → f in L p -norm as N → ∞. If f is a continuous
polynomials are P0 (x) = 1, and P1 (x) = x. For n = 2,
function, then K N ∗ f → f uniformly as N → ∞.
3, 4, . . . , the nth Legendre polynomial Pn is defined by
For an elegant proof of this theorem, see Krantz (1999). the recursion relation
From Theorem 11 we might be tempted to conclude
that the convergence properties of regular summation ker- n Pn (x) = (2n − 1)x Pn−1 (x) + (n − 1)Pn−2 (x).
nels are all the same. They do differ, however, in the rates These polynomials satisfy the following orthogonality
at which they converge. For example, in Fig. 9 we show relation
K 100 ∗ f1 where the kernel is Hann’s kernel and f1 is the 1
square wave. Notice that this graph is a much better ap- 0 if m = n
Pn (x) Pm (x) d x = (32)
proximation of a square wave than the arithmetic mean −1 (2n + 1)/2 if m = n.
graph in Fig. 8. An oscilloscope, for example, can easily
generate the graph in Fig. 9, thereby producing an accept- This equation is quite similar to Eq. (14). Because of
able version of a square wave. Eq. (32)—recall how we used Eq. (14) to derive Eq. (9)
Summation of Fourier series is discussed further —the Legendre series for a function f over the interval
in Krantz (1999), Walker (1996), Walter (1994), and [−1, 1] is defined to be
Zygmund (1968).
∞
cn Pn (x) (33)
n=0
VI. GENERALIZED FOURIER SERIES
with
1
The classical theory of Fourier series has undergone ex- 2
cn = f(x) Pn (x) d x. (34)
tensive generalizations during the last two hundred years. 2n + 1 −1
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

178 Fourier Series

Tchebysheff—which we cannot examine here because of

space limitations.
We now turn to another type of orthogonal series, the
Haar series. The defects, such as Gibbs’ phenomenon and
ringing, that occur with Fourier series expansions can be
traced to the unlocalized nature of the functions used for
expansions. The complex exponentials used in classical
Fourier series, and the polynomials used in Legendre se-
ries, are all non-zero (except possibly for a finite number of
points) over their domains. In contrast, Haar series make
use of localized functions, which are non-zero only over
tiny regions within their domains.
In order to define Haar series, we first define the funda-
mental Haar wavelet H (x) by

1 if 0 ≤ x < 1/2
H (x) =
−1 if 1/2 ≤ x ≤ 1.
The Haar wavelets {H j ,k (x)} are then defined by
H j ,k (x) = 2 j /2 H (2 j x − k)
for j = 0, 1, 2, . . . ; k = 0, 1, . . . , 2 j − 1. Notice that
FIGURE 10 Step function and its Legendre series partial sum
H j ,k (x) is non-zero only on the interval [k2− j ,
S 11 .
(k + 1)2− j ], which for large j is a tiny subinterval of [0, 1].
As k ranges between 0 and 2 j − 1, these subintervals par-
tition the interval [0, 1], and the partition becomes finer
The partial sum S N of the series in Eq. (33) is defined to
(shorter subintervals) with increasing j.
be
The Haar series for a function f is defined by

N
S N (x) = cn Pn (x).
∞ 2
j
−1

n =0 b+ c j ,k H j ,k (x) (35)
j =0 k =0
As an example, let f(x) = 1 for 0 ≤ x ≤ 1 and f(x) = 0 1
for −1 ≤ x < 0. The Legendre series for this step function with b = 0 f(x) d x and
is [see Walker (1988)]: 1
1 ∞
(−1)k (4k + 3)(2k)! c j ,k = f (x) H j ,k (x) d x.
+ P2k +1 (x). 0
2 k =0 4k +1 (k + 1)!k!
The definitions of b and c j ,k are justified by orthogonality
In Fig. 10 we show the partial sum S11 for this series. The relations between the Haar functions (similar to the or-
graph of S11 is reminiscent of a Fourier series partial sum thogonality relations that we used above to justify Fourier
for a step function. In fact, the following theorem is true. series and Legendre series).
1 A partial sum S N for the Haar series in Eq. () is defined
Theorem 12: If −1 |f(x)|2 d x is finite, then the partial by
sums S N for the Legendre series for f satisfy
1 S N (x) = b + c j,k H j,k (x).
{ j,k | 2 j +k≤N }
lim |f(x) − S N (x)|2 d x = 0.
N →∞ −1
For example, let f be the function on [0, 1] defined as
Moreover, if f is Lipschitz at a point x0 , then S N (x0 ) → follows
f(x0 ) as N → ∞.
x − 1/2 if 1/4 < x < 3/4
This theorem is proved in Walter (1994) and Jackson f(x) =
0 if x ≤ 1/4 or 3/4 ≤ x.
(1941). Further details and other examples of orthogonal
polynomial series can be found in either Davis (1975), In Fig. 11 we show the Haar series partial sum S256 for this
Jackson (1941), or Walter (1994). There are many impor- function. Notice that there is no Gibbs’ phenomenon with
tant orthogonal series—such as Hermite, Laguerre, and this partial sum. This contrasts sharply with the Fourier
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

Fourier Series 179

1
Theorem 13: Suppose that 0 |f(x)| p d x is finite, for
p ≥ 1. Then the Haar series partial sums for f satisfy
1/ p
1
lim |f(x) − S N (x)| p d x = 0.
N →∞ 0

If f is continuous on [0, 1], then S N converges uniformly

to f on [0, 1].
This theorem is reminiscent of Theorems 10 and 11
for the modified Fourier series partial sums obtained by
arithmetic means or by a regular summation kernel. The
difference here, however, is that for the Haar series no
modifications of the partial sums are needed.
One glaring defect of Haar series is that the partial
sums are discontinuous functions. This defect is remedied
by the wavelet series discovered by Meyer, Daubechies,
and others. The fundamental Haar wavelet is replaced by
some new fundamental wavelet and the set of wavelets
{ j ,k } is then defined by j ,k (x) = 2− j /2 [2 j x − k]. (The
bracket symbolism [2 j x − k] means that the value,
2 j x − k mod 1, is evaluated by . This technicality is
FIGURE 11 Haar series partial sum S 256 , which has 257 terms. needed in order to ensure periodicity of j ,k .) For exam-
ple, in Fig. 13, we show graphs of 4,1 and 6,46 for one
of the Daubechies wavelets (a Coif18 wavelet), which is
series partial sum, also using 257 terms, which we show continuously differentiable. For a complete discussion of
in Fig. 12. the definition of these wavelet functions, see Daubechies
The Haar series partial sums satisfy the following theo- (1992) or Mallat (1998).
rem [proved in Daubechies (1992) and in Meyer (1992)]. The wavelet series, generated by the fundamental
wavelet , is defined by

∞ 2
j
−1
b+ c j ,k j ,k (x) (36)
j =0 k =0
1
with b = 0 f(x) d x and
1
c j ,k = f(x) j ,k (x) d x. (37)
0
This wavelet series has partial sums S N defined by

S N (x) = b + c j ,k j ,k (x).
{ j ,k | 2 j +k ≤N }

Notice that when is continuously differentiable, then

so is each partial sum S N . These wavelet series partial
sums satisfy the following theorem, which generalizes
Theorem 13 for Haar series, for a proof, see Daubechies
(1992) or Meyer (1992).
1
Theorem 14: Suppose that 0 |f(x)| p d x is finite, for
p ≥ 1. Then the Daubechies wavelet series partial sums
for f satisfy
1/ p
1
FIGURE 12 Fourier series partial sum S 128 , which has 257 lim |f(x) − S N (x)| p d x = 0.
N →∞ 0
terms.
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

180 Fourier Series

VII. DISCRETE FOURIER SERIES

The digital computer has revolutionized the practice of

science in the latter half of the twentieth century. The
methods of computerized Fourier series, based upon the
fast Fourier transform algorithms for digital approxima-
tion of Fourier series, have completely transformed the
application of Fourier series to scientific problems. In this
section, we shall briefly outline the main facts in the theory
of discrete Fourier series.
The Fourier series coefficients {cn } can be discretely ap-
proximated via Riemann sums for the integrals in Eq. (9).
For a (large) positive integer M, let xk = −π + 2πk /M
for k = 0, 1, 2, . . . , M − 1 and let x = 2π/M. Then the
nth Fourier coefficient cn for a function f is approximated
as follows:

1 M −1
cn ≈ f(xk ) e−i2π nxkx
2π k =0

e−inπ M −1
= f(xk )e−i2πkn/M .
FIGURE 13 Two Daubechies wavelets. M k =0
The last sum above is called the Discrete Fourier Trans-
If f is continuous on [0, 1], then S N converges uniformly form (DFT) of the finite sequence of numbers {f(xk )}.
to f on [0, 1]. That is, we define the DFT of a sequence {gk }k=0M−1
of
numbers by
Theorem 14 does not reveal the full power of wavelet
series. In almost all cases, it is possible to rearrange the
M−1
terms in the wavelet series in any manner whatsoever and Gn = gk e−i2πkn/M . (38)
convergence will still hold. One reason for doing a rear- k=0

rangement is in order to add the terms in the series with The DFT is the set of numbers {G n }, and we see from the
coefficients of largest magnitude (thus largest energy) first discussion above that the Fourier coefficients of a function
so as to speed up convergence to the function. Here is a f can be approximated by a DFT (multiplied by the fac-
convergence theorem for such permuted series. tors e−inπ /M). For example, in Fig. 14 we show a graph
1 of approximations of the Fourier coefficients {cn }50
n=−50 of
Theorem 15: Suppose that 0 |f(x)| p d x is finite, for
the square wave f1 obtained via a DFT (using M = 1024).
p > 1. If the terms of a Daubechies wavelet series are per-
For all values, these approximate Fourier coefficients dif-
muted (in any manner whatsoever), then the partial sums
fer from the exact coefficients by no more than 10−3 . By
S N of the permuted series satisfy
taking M even larger, the error can be reduced still further.
1/ p The two principal properties of DFTs are that they can
1
lim |f(x) − S N (x)| p d x = 0. be inverted and they preserve energy (up to a scale factor).
N →∞ 0 The inversion formula for the DFT is
If f is uniformly Lipschitz, then the partial sums S N of the
M−1

permuted series converge uniformly to f. gk = G n ei2π kn/M . (39)

n=0
This theorem is proved in Daubechies (1992) and Meyer
And the conservation of energy property is
(1992). This type of convergence of wavelet series is
called unconditional convergence. It is known [see Mallat
M−1
1 M−1
(1998)] that unconditional convergence of wavelet series |gk |2 = |G n |2 . (40)
k=0
N n=0
ensures an optimality of compression of signals. For de-
tails about compression of signals and other applications Interpreting a sum of squares as energy, Eq. (40) says that,
of wavelet series, see Walker (1999) for a simple intro- up to multiplication by the factor 1/N , the energy of the
duction and Mallat (1998) for a thorough treatment. discrete signal {gk } and its DFT {G n } are the same. These
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

Fourier Series 181

These calculations with DFTs are facilitated on a com-

puter using various algorithms which are all referred to
as fast Fourier transforms (FFTs). Using FFTs, the pro-
cess of computing DFTs, and hence Fourier coefficients
and Fourier series, is now practically instantaneous. This
allows for rapid, so-called real-time, calculation of the fre-
quency content of signals. One of the most widely used ap-
plications is in calculating spectrograms. A spectrogram
is calculated by dividing a signal (typically a recorded,
digitally sampled, audio signal) into a successive series of
short duration subsignals, and performing an FFT on each
subsignal. This gives a portrait of the main frequencies
present in the signal as time proceeds. For example, in
Fig. 15a we analyze discrete samples of the function
sin(2ν1 π x)e−100π (x −0.2) + [sin(2ν1 π x) + cos(2ν2 π x)]
2

× e−50π (x −0.5) + sin(2ν2 π x)e−100π (x −0.8)

2 2
(41)
where the frequencies ν1 and ν2 of the sinusoidal factors
are 128 and 256, respectively. The signal is graphed at the
bottom of Fig. 15a and the magnitudes of the values of its
FIGURE 14 Fourier coefficients for square wave, n = −50 to 50. spectrogram are graphed at the top. The more intense spec-
Successive values are connected with line segments. trogram magnitudes are shaded more darkly, while white
regions indicate magnitudes that are essentially zero. The
dark blobs in the graph of the spectrogram magnitudes
facts are proved in Briggs and Henson (1995) and Walker clearly correspond to the regions of highest energy in the
(1996). signal and are centered on the frequencies 128 and 256,
An application of inversion of DFTs is to the calculation the two frequencies used in Eq. (41).
of Fourier series partial sums. If we substitute xk = −π + As a second example, we show in Fig. 15b the spectro-
2πk /M into the Fourier series partial sum S N (x) we obtain gram magnitudes for the signal
(assuming that N < M /2 and after making a change of
e−5π [(x −0.5)/0.4] [sin(400π x 2 ) + sin(200π x 2 ) ].
10

indices m = n + N ): (42)

N This signal is a combination of two tones with sharply
S N (xk ) = cn ein(−π+2π k /M) increasing frequency of oscillations. When run through a
n =−N sound generator, it produces a sharply rising pitch. Sig-

N nals like this bear some similarity to certain bird calls, and
= cn (−1)n ei2πnk /M are also used in radar. The spectrogram magnitudes for
n =−N this signal are shown in Fig. 15b. We can see two, some-
what blurred, line segments corresponding to the factors

2N
= cm −N (−1)m −N e−i2π k N /M ei2πkm/M . 400π x and 200π x multiplying x in the two sine factors
m =0 in Eq. (42).
One important area of application of spectrograms is
Thus, if we let gm = cm −N for m = 0, 1, . . . , 2N and gm =
in speech coding. As an example, in Fig. 16 we show
0 for m = 2N + 1, . . . , M − 1, we have
spectrogram magnitudes for two audio recordings. The

M −1 spectrogram magnitudes in Fig. 16a come from a record-
S M (xk ) = e−i2π k N /M gm (−1)m −N ei2πkm/M . ing of a four-year-old girl singing the phrase “twinkle,
m =0
twinkle, little star,” and the spectrogram magnitudes in
This equation shows that S M (xk ) can be computed using a Fig. 16b come from a recording of the author of this arti-
DFT inversion (along with multiplications by exponential cle singing the same phrase. The main frequencies are seen
factors). By combining DFT approximations of Fourier to be in harmonic progression (integer multiples of a low-
coefficients with this last equation, it is also possible to ap- est, fundamental frequency) in both cases, but the young
proximate Fourier series partial sums, or arithmetic means, girl’s main frequencies are higher (higher in pitch) than
or other modified partial sums. See Briggs and Henson the adult male’s. The slightly curved ribbons of frequency
(1995) or Walker (1996) for further details. content are known as formants in linguistics. For more
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

182 Fourier Series

FIGURE 15 Spectrograms of test signals. (a) Bottom graph is the signal in Eq. (41). Top graph is the spectrogram
magnitudes for this signal. (b) Signal and spectrogram magnitudes for the signal in (42). Horizontal axes are time
values (in sec); vertical axes are frequency values (in Hz). Darker pixels denote larger magnitudes, white pixels are
near zero in magnitude.

FIGURE 16 Spectrograms of audio signals. (a) Bottom graph displays data from a recording of a young girl singing
“twinkle, twinkle, little star.” Top graph displays the spectrogram magnitudes for this recording. (b) Similar graphs for
the author’s rendition of “twinkle, twinkle, little star.”
P1: GHA/LOW P2: FQP Final
Encyclopedia of Physical Science and Technology EN006C-258 June 28, 2001 19:56

Fourier Series 183

details on the use of spectrograms in signal analysis, see SEE ALSO THE FOLLOWING ARTICLES
Mallat (1998).
It is possible to invert spectrograms. In other words, we FUNCTIONAL ANALYSIS • GENERALIZED FUNCTIONS •
can recover the original signal by inverting the succession MEASURE AND INTEGRATION • NUMERICAL ANALYSIS •
of DFTs that make up its spectrogram. One application SIGNAL PROCESSING • WAVELETS
of this inverse procedure is to the compression of audio
signals. After discarding (setting to zero) all the values in
the spectrogram with magnitudes below a threshold value, BIBLIOGRAPHY
the inverse procedure creates an approximation to the sig-
nal which uses significantly less data than the original Briggs, W. L., and Henson, V. E. (1995). “The DFT. An Owner’s Manual,”
signal. For example, by discarding all of the spectrogram SIAM, Philadelphia.
values having magnitudes less than 1/320 times the largest Daubechies, I. (1992). “Ten Lectures on Wavelets,” SIAM, Philadelphia.
Davis, P. J. (1975). “Interpolation and Approximation,” Dover, New
magnitude spectrogram value, the young girl’s version of
York.
“twinkle, twinkle, little star” can be approximated, without Davis, P. J., and Hersh, R. (1982). “The Mathematical Experience,”
noticeable degradation of quality, using about one-eighth Houghton Mifflin, Boston.
the amount of data as the original recording. Some of the Fourier, J. (1955). “The Analytical Theory of Heat,” Dover, New York.
best results in audio compression are based on sophis- Jackson, D. (1941). “Fourier Series and Orthogonal Polynomials,” Math.
Assoc. of America, Washington, DC.
ticated generalizations of this spectrogram technique—
Krantz, S. G. (1999). “A Panorama of Harmonic Analysis,” Math. Assoc.
referred to either as lapped transforms or as local cosine of America, Washington, DC.
expansions, see Malvar (1992) and Mallat (1998). Mallat, S. (1998). “A Wavelet Tour of Signal Processing,” Academic
Press, New York.
Malvar, H. S. (1992). “Signal Processing with Lapped Transforms,”
VIII. CONCLUSION Artech House, Norwood.
Meyer, Y. (1992). “Wavelets and Operators,” Cambridge Univ. Press,
In this article, we have outlined the main features of Cambridge.
the theory and application of one-variable Fourier series. Rudin, W. (1986). “Real and Complex Analysis,” 3rd edition, McGraw-
Hill, New York.
Much additional information, however, can be found in Walker, J. S. (1988). “Fourier Analysis,” Oxford Univ. Press, Oxford.
the references. In particular, we did not have sufficient Walker, J. S. (1996). “Fast Fourier Transforms,” 2nd edition, CRC Press,
space to discuss the intricacies of multivariable Fourier Boca Raton.
series which, for example, have important applications Walker, J. S. (1999). “A Primer on Wavelets and their Scientific Appli-
in crystallography and molecular structure determination. cations,” CRC Press, Boca Raton.
Walter, G. G. (1994). “Wavelets and Other Orthogonal Systems with
For a mathematical introduction to multivariable Fourier Applications,” CRC Press, Boca Raton.
series, see Krantz (1999), and for an introduction to their Zygmund, A. (1968). “Trigonometric Series,” Cambridge Univ. Press,
applications, see Walker (1988). Cambridge.
P1: GLQ Final Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals
Benoit B. Mandelbrot
Michael Frame
Yale University

I. Scale Invariance VI. Lacunarity

II. The Generic Notion of Fractal Dimension and a VII. Fractal Graphs and Self-Affinity
Few Specific Implementations VIII. Fractal Attractors and Repellers of Dynamical
III. Algebra of Dimensions and Latent Dimensions Systems
IV. Methods of Computing Dimension in IX. Fractals and Differential or Partial Differential
Mathematical Fractals Equations
V. Methods of Measuring Dimension in Physical X. Fractals in the Arts and in Teaching
Systems

GLOSSARY ematics had two effects. It led to the development of tools

like fractal dimensions, but marked a turn toward abstrac-
Dimension An exponent characterizing how some tion that contributed to a deep and long divide between
aspect—mass, number of boxes in a covering, etc.—of mathematics and physics. Quite independently from fun-
an object scales with the size of the object. damental mathematical physics as presently defined, frac-
Lacunarity A measure of the distribution of hole sizes tal geometry arose in equal parts from an awareness of
of a fractal. The prefactor in the mass–radius scaling is past mathematics and a concern for practical, mundane
one such measure. questions long left aside for lack of proper tools.
Self-affine fractal A shape consisting of smaller copies The mathematical input ran along the lines described
of itself, all scaled by affinities, linear transformations by John von Neumann: “A large part of mathematics
with different contraction ratios in different directions. which became useful developed with absolutely no de-
Self-similar fractal A shape consisting of smaller copies sire to be useful . . . This is true for all science. Successes
of itself, all scaled by similitudes, linear transforma- were largely due to . . . relying on . . . intellectual elegance.
tions with the same contraction ratios in every direction. It was by following this rule that one actually got ahead
in the long run, much better than any strictly utilitarian
course would have permitted . . . The principle of laissez-
FRACTALS have a long history: after they became the faire has led to strange and wonderful results.”
object of intensive study in 1975, it became clear that they The influence of mundane questions grew to take on
had been used worldwide for millenia as decorative pat- far more importance than was originally expected, and re-
terns. About a century ago, their appearance in pure math- cently revealed itself as illustrating a theme that is common

185
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

186 Fractals

in science. Every science started as a way to organize a Euclidean or locally Euclidean, and observed nature is
large collection of messages our brain receives from our written in the language of noisy Euclidean geometry.
senses. The difficulty is that most of these messages are Fractal geometry was invented to approach roughness in
very complex, and a science can take off only after it suc- a very different way. Under magnification, smooth shapes
ceeds in identifying special cases that allow a workable are more and more closely approximated by their tangent
first step. For example, acoustics did not take its first step spaces. The more they are magnified, the simpler (“bet-
with chirps or drums but with idealized vibrating strings. ter”) they look. Over some range of magnifications, look-
These led to sinusoids and constants or other functions ining more closely at a rock or a coastline does not reveal
variant under translation in time. For the notion of rough- a simpler picture, but rather more of the same kind of
ness, no proper measure was available only 20 years ago. detail. Fractal geometry is based on this ubiquitous scale
The claim put forward forcibly in Mandelbrot (1982) is invariance. “A fractal is an object that doesn’t look any
that a workable entry is provided by rough shapes that are better when you blow it up.” Scale invariance is also called
dilation invariant. These are fractals. “symmetry under magnification.”
Fractal roughness proves to be ubiquitous in the works A manifestation is that fractals are sets (or measures)
of nature and man. Those works of man range from mathe- that can be broken up into pieces, each of which closely
matics and the arts to the Internet and the financial markets. resembles the whole, except it is smaller. If the pieces scale
Those works of nature range from the cosmos to carbon isotropically, the shape is called self-similar; if different
deposits in diesel engines. A sketchy list would be use- scalings are used in different directions, the shape is called
less and a complete list, overwhelming. The reader is re- self-affine.
ferred to Frame and Mandelbrot (2001) and to a Panorama There are deep relations between the geometry of fractal
mentioned therein, available on the web. This essay is or- sets and the renormalization approach to critical phenom-
ganized around the mathematics of fractals, and concrete ena in statistical physics.
examples as illustrations of it.
To avoid the need to discuss the same topic twice, math- B. Examples of Self-Similar Fractals
ematical complexity is allowed to fluctuate up and down.
The reader who encounters paragraphs of oppressive dif- 1. Exact Linear Self-Similarity
ficulty is urged to skip ahead until the difficulty becomes A shape S is called exactly (linearly) self-similar if
manageable. the whole S splits into the union of parts Si : S =
S1 ∪ S2 ∪ . . . ∪ Sn . The parts satisfy two restrictions: (a)
each part Si is a copy of the whole S scaled by a linear
I. SCALE INVARIANCE contraction factor ri , and (b) the intersections between
parts are empty or “small” in the sense of dimension. An-
A. On Choosing a “Symmetry” Appropriate ticipating Section II, if i = j, the fractal dimension of the
to the Study of Roughness intersection Si ∩ S j must be lower than that of S. The
roughness of these sets is characterized by the sim-
The organization of experimental data into simple theoret- ilarity dimension d. In the special equiscaling case
ical models is one of the central works of every science; r1 = · · · = rn = r , d = log(n)/ log(1/r ). In general, d is
invariances and the associated symmetries are powerful the solution of the Moran equation
tools for uncovering these models. The most common in- n
variances are those under Euclidean motions: translations, rid = 1.
rotations, reflections. The corresponding ideal physics is i=1
that of uniform or uniformly accelerated motion, uniform More details are given in Section II.
or smoothly varying pressure and density, smooth subman- Exactly self-similar fractals can be constructed by sev-
ifolds of Euclidean physical or phase space. The geometric eral elegant mathematical approaches.
alphabet is Euclidean, the analytical tool is calculus, the
statistics is stationary and Gaussian. a. Initiator and generator. An initiator is a starting
Few aspects of nature or man match these idealizations: shape; a generator is a juxtaposition of scaled copies of the
turbulent flows are grossly nonuniform; solid rocks are initiator. Replacing the smaller copies of the initiator in the
conspicuously cracked and porous; in nature and the stock generator with scaled copies of the generator sets in mo-
market, curves are nowhere smooth. One approach to this tion a process whose limit is an exactly self-similar frac-
discrepancy, successful for many problems, is to treat ob- tal. Stages before reaching the limit are called protofrac-
served objects and processes as “roughened” versions of tals. Each copy is anchored by a fixed point, and one may
an underlying smooth ideal. The underlying geometry is have to specify the orientation of each replacement. The
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals 187

as the limit set of the orbit O+ (x0 ) of any point x0 under

the action of the semigroup generated by {T1 , . . . , Tn }.
The formal definition of IFS, which is delicate and
technical, proceeds as follows. Denoting by K the set
of nonempty compact subsets of Rn and by h the Haus-
FIGURE 1 Construction of the Sierpinksi gasket. The initiator
dorff metric on K [h(A, B) = inf{δ: A ⊂ Bδ and B ⊂ Aδ },
is a filled-in equilateral triangle, and the generator (on the left)
is made of N = 3 triangles, each obtained from the initiator by a where Aδ = {x ∈ Rn : d(x, y) ≤ δ for some y ∈ A} is the δ-
contraction map Ti of reduction ratio r = 1/2. The contractions’ thickening of A, and d is the Euclidean metric], the Ti
fixed points are the vertices of the initiator. The middle shows the together
n define a transformation T : K → K by T (A) =
second stage, replacing each copy of the initiator with a scaled {T
i=1 i (x): x ∈ A}, a contraction in the Hausdorff met-
copy of the generator. On the right is the seventh stage of the
ric with contraction ratio r = max{r1 , . . . , rn }. Because
construction.
(K, h) is complete, the contraction mapping principle
guarantees there is a unique fixed point C of T . This fixed
Sierpinski gasket (Fig. 1) is an example. The eye sponta- point is the attractor of the IFS {T1 , . . . , Tn }. Moreover, for
neously splits the whole S into parts. The simplest split any K ∈ K, the sequence K , T (K ), T 2 (K ), . . . converges
yields N = 3 parts Si , each a copy of the whole reduced by to C in the sense that limn → ∞ h(T n (K ), C) = 0.
a similitude of ratio 1/2 and with fixed point at a vertex of The IFS inverse problem is, for a given compact set A
the initiator. In finer subdivisions, Si ∩ S j is either empty and given tolerance δ > 0, to find a set of transformations
or a single point, for which d = 0. In this example, but {T1 , . . . , Tn } with attractor C satisfying h(A, C) < δ. The
not always, it can be made empty by simply erasing the search for efficient algorithms to solve the inverse prob-
topmost point of every triangle in the construction. lem is the heart of fractal image compression. Detailed
Some of the most familiar fractals were orignally con- discussions can be found in Barnsley and Hurd (1993)
structed to provide instances of curves that exemplify and Fischer (1995).
properties deemed counterintuitive: classical curves may
have one multiple point (like Fig. 8) or a few. To the con- 2. Exact Nonlinear Self-Similarity
trary, the Sierpinski gasket (Fig. 2, far left) is a curve with
dense multiple points. The Sierpinski carpet (Fig. 2, mid A broader class of fractals is produced if the decomposi-
left) is a universal curve in the sense that one can embed tion of S into the union S = S1 ∪ S2 ∪ . . . ∪ Sn allows the
in the carpet every plane curve, irrespective of the collec- Si to be the images of S under nonlinear transformations.
tion of its multiple points. The Peano curve [initiator the
diagonal segment from (0, 0) to (1, 1), generator in Fig. 2 a. Quadratic Julia sets. For fixed complex number
mid right] is actually not a curve but a motion. It is plane- c, the “quadratic orbit” of the starting complex number z
filling: a continuous onto map [0, 1] → [0, 1] × [0, 1]. is a sequence of numbers that begins with f c (z) = z 2 + c,
then f c2 (z) = ( f c (z))2 + c and continues by following the
b. Iterated function systems. Iterated function sys- rule f cn (z) = f c ( f cn−1 (z)). The filled-in (quadratic) Julia
tems (IFS) are a formalism for generating exactly self- set consists of the starting points that do not iter-
similar fractals based on work of Hutchinson (1981) and ate to infinity, formally, the points {z: f cn (z) remains
Mandelbrot (1982), and popularized by Barnsley (1988). bounded as n → ∞}. The (quadratic) Julia set Jc is the
IFS are the foundation of a substantial industry of im- boundary of the filled-in Julia set. Figure 3 shows the
age compression. The basis is a (usually) finite collection Julia set Jc for c = 0.4 + 0 · i and the filled-in Julia set
{T1 , . . . , Tn } of contraction maps Ti : Rn → Rn with confor c = −0.544 + 0.576 · i. The latter has an attracting
traction ratios ri < 1. Each Ti is assigned a probability pi 5-cycle, the black region is the basin of attraction of the
that serves, at each (discrete) instant of time, to select the
next map to be used. An IFS attractor also can be viewed

FIGURE 2 The Sierpinski gasket, Sierpinski carpet, the Peano FIGURE 3 The Julia set of z 2 + 0.4 (left) and the filled-in Julia
curve generator, and the fourth stage of the Peano curve. set for z 2 − 0.544 + 0.576 · i (right).
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

188 Fractals

5-cycle, and the Julia set is the boundary of the black vation by Mandelbrot, Tan Lei (1984) proved the conver-
region. Certainly, Jc is invariant
√ under f c and under
√ the gence of appropriate magnifications of Julia sets and the M
−1 −1
inverses of f c , f c+ (z) = z − c and f c− (z) = − z − c. set at certain points named after Misiurewicz. Shishikura
Polynomial functions allow several equivalent character- (1994) proved Mandelbrot’s (1985) and Milnor’s (1989)
izations: Jc is the closure of the set of repelling periodic conjecture that the boundary of the M set has Hausdorff
points of f c (z) and Jc is the attractor of the nonlinear IFS dimension 2. Lyubich proved that the boundary of the M
−1 −1
{ f c+ , f c− }. set is asymptotically self-similar about the Feigenbaum
Much is known about Julia sets of quadratic functions. point.
For example, McMullen proved that at a point whose ro- Mandelbrot’s first conjecture, that the interior of the M
tation number has periodic continued-fraction expansion, set consists entirely of components (called hyperbolic) for
the J set is asymptotically self-similar about the critical which there is a stable cycle, remains unproved in general,
point. though McMullen (1994) proved it for all such compo-
The J sets are defined for functions more general nents that intersect the real axis. Mandelbrot’s notion that
than polynomials. Visually striking and technically in- M may be the closure of M 0 is equivalent to the assertion
teresting examples correspond to the Newton function that the M set is locally connected. Despite intense ef-
N f (z) = z − f (z)/ f (z) for polynomial families f (z), or forts, that assertion remains a conjecture, though Yoccoz
entire functions like λ sin z, λ cos z, or λ exp z (see Sec- and others have made progress.
tion VIII.B). Discussions can be found in Blanchard Other developments include the theory of quadratic-like
(1994), Curry et al. (1983), Devaney (1994), Keen (1994), maps (Douady and Hubbard, 1985), implying the univer-
and Peitgen (1989). sality and ubiquity of the M set. This result was presaged
by the discovery (Curry et al., 1983) of a Mandelbrot set
b. The Mandelbrot set. The quadratic orbit f cn (z) in the parameter space of Newton’s method for a family
always converges to infinity for large enough values of z. of cubic polynomials.
Mandelbrot attempted a computer study of the set M 0 of The recent book by Tan Lei (2000) surveys current re-
those values of c for which the orbit does not converge to sults and attests to the vitality of this field.
infinity, but to a stable cycle. This approach having proved
unrewarding, he moved on to a set that promised an easier c. Circle inversion limit sets. Inversion IC in a cir-
calculation and proved spectacular. Julia and Fatou, build- cle C with center O and radius r transforms a point P
ing on fundamental work of Montel, had shown that the into the point P lying on the ray OP and with d(O, P) ·
Julia set Jc of f c (z) = z 2 + c must be either connected or d(O, P ) = r 2 . This is the orientation-reversing involution
totally disconnected. Moreover, Jc is connected if, and defined on R2 ∪ {∞} by P → IC (P) = P . Inversion in C
only if, the orbit O+ (0) of the critical point z = 0 re- leaves C fixed, and interchanges the interior and exterior
mains bounded. The set M defined by {c: f cn (0) remains of C. It contracts the “outer” component not containing
bounded} is now called the Mandelbrot set (see the left O, but the contraction ratio is not bounded by any r < 1.
side of Fig. 4). Mandelbrot (1980) performed a computer Poincaré generalized from inversion in one circle to a
investigation of its structure and reported several observa- collection of more than one inversion. As an example,
tions. As is now well known, small copies of the M set are consider a collection of circles C1 , . . . , C N each of which
infinitely numerous and dense in its boundary. The right is external to all the others. That is, for all j = i, the disks
side of Fig. 4 shows one such small copy, a nonlinearly bounded by Ci and C j have disjoint interiors. The limit set
distorted copy of the whole set. Although the small copy (C1 , . . . , C N ) of inversion in these circles is the set of
on the right side of Fig. 4 appears to be an isolated “island,” limit points of the orbit O+ (P) of any point P, external to
Mandelbrot conjectured and Douady and Hubbard (1984) C1 , . . . , C N , under the group generated by IC1 , . . . , IC N .
proved that the M set is connected. Sharpening an obser- Equivalently, it is the set left invariant by every one of the
inversions IC1 , . . . , IC N .
The limit set is nearly always fractal but the nonlin-
earity of inversion guarantees that is nonlinearly self-
similar. An example is shown in Fig. 5: the part of the limit
set inside C1 is easily seen to be the transform by I1 of the
part of the limit set inside C2 , C3 , C4 , and C5 .
How can one draw the limit set when the arrangement
FIGURE 4 Left: The Mandelbrot set. Right: A detail of the Man- of the circles C1 , . . . , C N is more involved? Poincaré’s
delbrot set showing a small copy of the whole. Note the nonlinear original algorithm converges extraordinarily slowly. The
relation between the whole and the copy. first alternative algorithm was advanced in Mandelbrot
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals 189

(Keen et al., 1993). In fact, the limit set of the Kleinian

groups that are in the Maskit embedding (Keen and Series,
1993) of the Teichmüller space of any finite-type Riemann
surface are Apollonian packings. These correspond to hy-
perbolic 3-manifolds having totally geodesic boundaries.
McShane et al. (1994) used automatic group theory to
produce efficient pictures of these limit sets, and Parker
(1995) showed that in many cases the Hausdorff dimen-
FIGURE 5 Left: The limit set generated by inversion in the five sion of the limit set equals the circle packing exponent,
circles C1 , . . . , C5 . Right: A magnification of the limit set. easily estimated as the slope of the log–log plot of the
number of circles of radius ≥r (y axis) versus r (x axis).
Limit sets of Kleinian group actions are an excellent
(1982, Chapter 18); it is intuitive and the large-scale fea- example of a deep, subtle, and very active area of pure
tures of appear very rapidly, followed by increasingly mathematics in which fractals play a central role.
fine features down to any level a computer’s memory can
support. 3. Statistical Self-Similarity

d. Kleinian group limit sets. A Kleinian group A tree’s branches are not exact shrunken copies of that
(Beardon, 1983; Maskit, 1988) is a discrete group of tree, inlets in a bay are not exact shrunken copies of that
Möbius transformations bay, nor is each cloud made up of exact smaller copies
of that cloud. To justify the role of fractal geometry as
az + b
z→ a geometry of nature, one must take a step beyond ex-
cz + d act self-similarity (linear or otherwise). Some element of
randomness appears to be present in many natural objects
acting on the Riemann sphere Ĉ, the sphere at infinity
and processes. To accommodate this, the notions of self-
of hyperbolic 3-space H3 . The isometries of H3 can be
similarity and self-affinity are made statistical.
represented by complex matrices

a b a. Wiener brownian motion: its graphs and trails.
. The first example is classical. It is one-dimensional
c d
Brownian motion, the random process X (t) defined by
[More precisely, by their equivalence classes in these properties: (1) with probability 1, X (0) = 0 and X (t)
P S L 2 (C).] Sullivan’s side-by-side dictionary (Sullivan, is continuous, and (2) the increments X (t + t) − X (t) of
1985) between Kleinian groups and iterates of rational X (t) are Gaussian with mean 0 and variance t. That is,
maps is another deep mathematical realm informed, 1
at least in part, by fractal geometry. Thurston’s “ge- Pr {X (t + t) − X (t) ≤ x} = √
2π t
ometrization program” for 3-manifolds (Thurston, 1997) x 2
involves giving many 3-manifolds hyperbolic structures −u
× exp du.
by viewing them as quotients of H3 by the action of a −∞ 2 t
Kleinian group G (Epstein, 1986). The corresponding
An immediate consequence is independence of incre-
action of G on Ĉ determines the limit set (G), defined
ments over disjoint intervals. A fundamental property of
as the intersection of all nonempty G-invariant subsets
Brownian motion is statistical self-affinity: for all s > 0,
of Ĉ. For many G, the limit set is a fractal. An example √
gives the flavor of typical results: the limit set of a finitely Pr {X (s(t + t)) − X (st) ≤ sx} = Pr {X (t + t)
generated Kleinian group is either totally disconnected, a
− X (t) ≤ x}.
circle, or has Hausdorff dimension greater than 1 (Bishop
and Jones, 1997). The Hausdorff dimension of the limit That is,√ rescaling t by a factor of s, and of x by a fac-
set has been studied by Beardon, Bishop, Bowen, Canary, tor of s, leaves the distribution unchanged. This correct
Jones, Keen, Mantica, Maskit, McMullen, Mumford, rescaling is shown on the left panel of Fig. 6: t (on the
Parker, Patterson, Sullivan, Tricot, Tukia, and many horizontal axis) is scaled by 4, x (on the vertical axis) is
others. Poincaré exponents, eigenvalues of the Laplacian, scaled by 2 = 41/2 . Note that this magnification has about
and entropy of geodesic flows are among the tools used. the same degree of roughness as the full picture. In the
Figure 5 brings forth a relation between some limit sets center panel, t is scaled by 4, x by 4/3; the magnification
of inversions or Kleinian groups and Apollonian packings is flatter than the original. In the right panel, both t and
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

190 Fractals

FIGURE 6 Left panel: Correct rescaling illustrating the self-

affinity of Brownian motion. Center and right panels: Two incorrect
rescalings. FIGURE 8 A Brownian trail. Right: The first quarter of the left
trail, magnified and with additional turns interpolated so the left
and right pictures have about the same number of turns.
x are scaled by 4; the magnification is steeper than the
original.
A sequence of increments of Brownian motion is called r For E ≥ 2 a Brownian trail B: [0, 1] → R E has
Gaussian white noise. Even casual inspection of the graph Hausdorff and box dimensions dH = dbox = 2,
reveals some fundamental features. The width of an old respectively.
pen-plotter line being equal to the spacing between suc- r The graph of one-dimensional Brownian motion
cessive difference values, the bulk of the difference plot B: [0, 1] → R has dH = dbox = 3/2.
merges into a “band” with the following properties (see
Fig. 7):
Some related constructions have been more resistant
r The band’s width is approximately constant. to theoretical analysis. Mandelbrot’s planar Brownian
r The values beyond that band stay close to it (this is due cluster is the graph of the complex B(t) constrained
to satisfy B(0) = B(1). It can be constructed by lin-
to the fact that the Gaussian has “short tails”).
r The values beyond that band do not cluster. early detrending the x- and y-coordinate functions:
(X (t) − t X (1), Y (t) − tY (1)). See Fig. 9. The cluster is
Positioning E independent one-dimensional Brownian known to have dimension 2. Visual inspection supported
motions along E coordinate axes gives a higher dimen- by computer experiments led to the 4/3 conjecture, which
sional Brownian motion: B(t) = {X 1 (t), . . . , X E (t)}. Plot- asserts that the boundary of the cluster has dimension
ted as a curve in E-dimensional space, the collection of 4/3 (Mandelbrot, 1982, p. 243). This has been proved
points that B(t) visits between t = 0 and t = 1 defines a by Lawler et al. (2000).
Brownian trail. Brownian motion is the unique stationary random pro-
When E > 2, this is an example of a statistically self- cess with increments independent over disjoint intervals
similar fractal. To split the Brownian trail into N reduced- and with finite variance. For many applications, these con-
scale parts, pick N − 1 arbitrary instants tn with 0 = t0 < ditions are too restrictive, drawing attention to other ran-
t1 < · · · < t N −1 < t N = 1. The Brownian trail for 0 ≤ t ≤ 1 dom processes that retain scaling but abandon either in-
splits into N subtrails Bn for the interval tn−1 < t < tn . dependent increments or finite variance.
The parts Bn follow the same statistical distribution as
the whole, after the size is expanded by (tn − tn−1 )−1/2 in b. Fractional Brownian motion. For fixed 0 < H <
every direction. 1, fractional Brownian motion (FBM) of exponent H is
Due to the definition of self-similarity, this example a random process X (t) with increments X (t + t) − X (t)
reveals a pesky complication: for i = j, Bi ∩ B j must be following the Gaussian distribution with mean 0 and stan-
of dimension less than B. This is indeed the case if E > 2, dard deviation ( t) H . Statistical self-affinity is straight-
but not in the plane E = 2. However, the overall idea can be forward: for all s > 0
illustrated for E = 2. The right side of Fig. 8 shows B1 (t) Pr {X (s(t + t)) − X (st) ≤ s H x}
for 0 ≤ t ≤ 1/4, expanded by a factor of 2 = (1/4 − 0)−1/2
and with additional points interpolated so the part and = Pr {X (t + t) − X (t) ≤ x}.
the whole exhibit about the same number of turns. The
The correlation is the expected value of the product of
details for E = 2 are unexpectedly complex, as shown in
successive increments. It equals
Mandelbrot (2001b, Chapter 3).
The dimensions (see Sections II.B and II.E) of some
Brownian constructions are well-known, at least in most
cases. For example, with probability 1:

FIGURE 7 Plot of 4000 successive Brownian increments. FIGURE 9 A Brownian cluster.

P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals 191

FIGURE 12 Lévy flight on the line. Left: the graph as a function

of time. Right, the increments.

FIGURE 10 Top: Fractional Brownian motion simulations with and a (one-dimensional) Lévy stable process is defined as
H = 0.25, H = 0.5, and H = 0.75. Bottom: Difference plots X(t +
a sum
1) − X(t) of the graphs above.

∞
f (t) = λk ξ (t − tk ),
k=1
E((X (t) − X (0)) · (X (t + h) − X (t)))
where the pulse times tn and amplitudes λn are cho-
= 12 ((t + h)2H − t 2H − h 2H ). sen according to the following Lévy measure: given t
and λ, the probability of choosing (ti , λi ) in the rectan-
If H = 1/2, this correlation vanishes and the increments
gle t < ti < t + dt , λ < λi < λ + d λ is C λ−α − 1 d λ dt. Fig-
are independent. In fact, FBM reduces to Brownian mo-
ure 12 shows the graph of a Lévy process or flight, and a
tion. If H > 1/2, the correlation is positive, so the incre-
graph of its increments.
ments tend to have the same sign. This is persistent FBM.
Comparing Figs. 7, 10, and 12 illustrates the power of
If H < 1/2, the correlation is negative, so the increments
the increment plot for revealing both global correlations
tend to have opposite signs. This is antipersistent FBM.
(FBM) and long tails (Lévy processes).
See Fig. 10. The exponent determines the dimension of
The effect of large excursions in Lévy processess is
the graph of FBM: with probability 1, dH = dbox = 2 − H .
more visible in the plane. See Fig. 13. These Lévy flights
Notice that for H > 1/2, the central band of the difference
were used in Mandelbrot (1982, Chapter 32) to mimic the
plot moves up and down, a sign of long-range correlation,
statistical properties of galaxy distributions.
but the outliers still are small. Figure 11 shows the trails
Using fractional Brownian motion and Lévy processes,
of these three flavors of FBM. FBM is the main topic of
Mandelbrot (in 1965 and 1963) improved upon Bache-
Mandelbrot (2001c).
lier’s Brownian model of the stock market. The former
corrects the independence of Brownian motion, the lat-
c. Lévy stable processes. While FBM introduces
ter corrects its short tails. The original and corrected pro-
correlations, its increments remain Gaussian and so have
cesses in the preceding sentence are statistically self-affine
small outliers. The Gaussian distribution is characterized
random fractal processes. This demonstrates the power
by its first two moments (mean and variance), but some
of invariances in financial modeling; see Mandelbrot
natural phenomena appear to have distributions for which
(1997a,b).
these are not useful indicators. For example, at the critical
point of percolation there are clusters of all sizes and the
d. Self-affine cartoons with mild to wild random-
expected cluster size diverges.
ness. Many natural processes exhibit long tails or global
Paul Lévy studied random walks for which the jump dis-
dependence or both, so it was a pleasant surprise that both
tributions follow the power law Pr {X > x} ≈ x −α . There
can be incorporated in an elegant family of simple car-
is a geometrical approach for generating examples of Lévy
toons. (Mandelbrot, 1997a, Chapter 6; 1999, Chapter N1;
processes.
2001a). Like for self-similar curves (Section I.B.1.a), the
The unit step function ξ (t) is defined by
basic construction of the cartoon involves an initiator and a
0 for x < 0
ξ (t) =
1 for x ≥ 0

FIGURE 11 Top: Fractional Brownian motion simulations with

H = 0.25, H = 0.5, and H = 0.75. Bottom: Difference plots X(t + FIGURE 13 Left: Trail of the Lévy flight in the plane. Right: The
1) − X(t) of the graphs above. Lévy dust formed by the turning points.
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

192 Fractals

spanning cluster connects opposite sides of the lattice.

For large L there is a critical probability or percolation
threshold pc ; spanning clusters do not arise for p < pc .
Numerical experiments suggest pc ≈ 0.59275. In Fig. 17,
FIGURE 14 The initiator (left), generator (middle), and first gen- p = 0.4, 0.6, and 0.8. Every lattice has its own pc .
eration (right) of the Brownian cartoon.
At p = pc the masses of the spanning clusters scale
with the lattice size L as L d , independently of the lattices.
generator. The process used to generate the graph consists Experiment yields d = 1.89 ± 0.03, and theory yields
in replacing each copy of the initiator with an appropriately d = 93/49. This d is the mass dimension of Section II.C.
rescaled copy of the generator. For a Brownian cartoon, In addition, spanning clusters have holes of all sizes; they
the initiator can be the diagonal of the unit square, and the are statistically self-similar fractals.
generator, the broken line with vertices (0, 0), (4/9, 2/3), Many fractals are defined as part of a percolation clus-
(5/9, 1/3), and (1, 1). Pictured in Fig. 14 are the initiator ter. The backbone is the subset of the spanning cluster
(left), generator (middle), and first iteration of the process that remains after removing all parts that can be separated
(right). from both spanned sides by removing a single filled cell
To get an appreciation for how quickly the local rough- from the spanning cluster. Numerical estimates suggest
ness of these pictures increases, the left side of Fig. 15 the backbone has dimension 1.61. The backbone is the
shows the sixth iterate of the process. path followed by a fluid diffusing through the lattice.
Self-affinity is built in because each piece is an appro- The hull of a spanning cluster is its boundary. It was
priately scaled version of the whole. In Fig. 14, the scal- observed by R. F. Voss in 1984 and proven by B. Duplantier
ing ratios have been selected to match the “square root” that the hull’s dimension is 7/4.
property of Brownian motion: for each segment of the A more demanding definition of the boundary yields the
generator we have | xi | = ( ti )1/2 . perimeter. It was observed by T. Grossman and proven by
More generally, a cartoon is called unifractal if there B. Duplantier that the perimeter’s dimension is 4/3.
is a constant H with | xi | = ( ti ) H for each generator Sapoval et al. (1985) examined discrete diffusion and
segment, where 0 < H < 1. If different H are needed for showed that it involves a fractal diffusion front that can
different segments, the cartoon is multifractal. be modeled by the hull and the perimeter of a percolation
The left side of Figure 15 is too symmetric to mimic cluster.
any real data, but this problem is palliated by shuffling the
order in which the three pieces of the generator are put f. Diffusion-limited aggregation (DLA; Vicsek,
into each scaled copy. The right side of Fig. 15 shows a 1992). DLA was proposed by Witten and Sander (1981,
Brownian cartoon randomized in this way. 1983) to simulate the aggregates that carbon particles
Figure 16 illustrates how the statistical properties of form in a diesel engine. On a grid of square cells, a cartoon
the increments can be modified by adjusting the genera- of DLA begins by occupying the center of the grid with a
tor in a symmetrical fashion. Keeping fixed the endpoints “seed particle.” Next, place a particle in a square selected
(0, 0) and (1, 1), the middle turning points are changed at random on the edge of a large circle centered on the seed
into (a, 2/3) and (1 − a, 1/3) for 0 < a ≤ 1/2. square and let it perform a simple random walk. With each
tick of the clock, with equal probabilities it will move to an
e. Percolation clusters (Stauffer and Aharony, adjacent square, left, right, above, or below. If the moving
1992). Given a square lattice of side length L and a particle wanders too far from the seed, it falls off the edge
number p ∈ [0, 1], assign a random number x ∈ [0, 1] to of the grid and another wandering particle is started at a
each lattice cell and fill the cell if x ≤ p. A cluster is a randomly chosen edge point. When a wandering particle
maximal collection of filled cells, connected by sharing reaches one of the four squares adjacent to the seed, it
common edges. Three examples are shown in Fig. 17. A sticks to form a cluster of two particles, and another mov-
ing particle is released. When a moving particle reaches a
square adjacent to the cluster, it sticks there. Continuing
in this way builds an arbitrarily large object called a
diffusion-limited aggregate (DLA) because the growth of
the cluster is governed by the particles’ diffusing across
the grid. Figure 18 shows a moderate-size DLA cluster.
Early computer experiments on clusters of up to the 104
FIGURE 15 Left: The sixth iterate of the process of Fig. 14. Right: particles showed the mass M(r ) of the part of the cluster
A sixth iterate of a randomized Brownian cartoon. a distance r from the seed point scales as M(r ) ≈ k · r d ,
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals 193

FIGURE 16 Generators, cartoons and difference graphs for symmetric cartoons with turning points (a, 2/3) and
(1 − a, 1/3), for a = 0.333, 0.389, 0.444, 0.456, and 0.467. The same random number seed is used in all graphs.

with d ≈ 1.71 for clusters in the plane and d ≈ 2.5 for most general cases but coincide for exactly self-similar
clusters in space. This exponent d is the mass dimen- fractals. Many other dimensions cannot be mentioned
sion of the cluster. (See Sections II.C and V.) These val- here.
ues match measured scalings of physical objects moder- A more general approach to quantifying degrees of
ately, but not terribly well. A careful examination of much roughness is found in the article on Multifractals.
larger clusters revealed discrepancies that added in due
time to a very complex picture of DLA. Mandelbrot et al. A. Similarity Dimension
(1995) investigated clusters in the 107 range; careful mea-
surement reveals an additional dimension of 1.65 ± 0.01. The definition of similarity dimension is rooted in the fact
This suggests the clusters become more compact as they that the unit cube in D-dimensional Euclidean space is
grow. Also, as the cluster grows, more arms develop and self-similar: for any positive integer b the cube can be de-
the largest gaps decrease in size; i.e., the lacunarity decomposed into N = b D cubes, each scaled by the similar-
screases. (See Section VI.) ity ratio r = 1/b, and overlapping at most along (D − 1)-
dimensional cubes.
The equiscaling or isoscaling case. Provided the
II. THE GENERIC NOTION OF FRACTAL pieces do not overlap significantly, the power-law relation
DIMENSION AND A FEW SPECIFIC N = (1/r ) D between the number N and scaling factor r of
IMPLEMENTATIONS the pieces generalizes to all exactly self-similar sets with
all pieces scaled by the factor r . The similarity dimension
The first, but certainly not the last, step in quantifying dsim is
fractals is the computation of a dimension. The notion of log(N )
dsim = .
Euclidean dimension has many aspects and therefore ex- log(1/r )
tends in several fashions. The extensions are distinct in the

FIGURE 17 Percolation lattices well below, near, and well above

the percolation threshold. FIGURE 18 A moderate-size DLA cluster.
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

194 Fractals

The pluriscaling case. More generally, for self-similar M(r ) = ρ · V (d) · r d

sets where each piece is scaled by a possibly different
1 d
factor ri , the similarity dimension is the unique positive with V (d) = 2
d+ 1
2
,
root d of the Moran equation where V (d) is the volume of the d-dimensional unit

N sphere. That is, for constant-density Euclidean objects,
rid = 1. the ordinary dimension—among many other roles—is the
i=1 exponent relating mass to size. This role motivated the
The relation 0 ≤ dsim ≤ E. If the fractal is a subset of definition of mass dimension for a fractal. The definition
E-dimensional Euclidean space, E is called the embed- of mass is delicate. For example, the mass of a Sierpin-
ding dimension. ski gasket cannot be defined by starting with a triangle of
So long as the overlap of the parts is not too great uniform density and removing middle triangles; this pro-
(technically, under the open set condition), we have cess would converge to a mass reduced to 0. One must,
dsim ≤ E. If at least two of the ri are positive, we have instead, proceed as on the left side of Fig. 1: take as ini-
dsim > 0. However, Section III.F shows that some circum- tiator a triangle of mass 1, and as generator three triangles
stances introduce a latent dimension d, related indirectly each scaled by 1/2 and of mass 1/3. Moreover, two very
to the similarity dimension, and that can satisfy d < 0 new facts come up.
or d > E. Firstly, the = sign in the formula for M(r ) must be re-
placed by ≈. That is, M(r ) fluctuates around a multiple
of r d . For example, as mentioned in Sections I.B.3.e and
B. Box Dimension I.B.3.f, the masses of spanning percolation clusters and
The similarity dimension is meaningful only for exactly diffusion-limited aggregates scale as a power-law func-
self-similar sets. For more general sets, including exper- tion of size. Consequently, the exponent in the relation
imental data, it is often replaced by the box dimension. M(r ) ≈ k · r dmass is called the mass dimension.
For any bounded (nonempty) set A in E-dimensional Second, in the Euclidean case the center is arbitrary
Euclidean space, and for any δ > 0, a δ-cover of A is a but in the fractal case it must belong to the set un-
collection of sets of diameter δ whose union contains A. der consideration. As an example, Fig. 19 illustrates
Denote by Nδ (A) the smallest number of sets in a δ-cover attempts to measure the mass dimension of the Sier-
of A. Then the box dimension dbox of A is pinski gasket. Suppose we take circles centered at the
lower left vertex of the initiator, and having radii 1/2,
log(Nδ (A)) 1/4, 1/8, . . . . We obtain M(1/2i ) = 1/3i = (1/2i )d , where
dbox = lim
δ→0 log(1/δ) dmass = log 3/ log 2. See the left side of Fig. 19. That is,
when the limit exists. When the limit does not exist, the the mass dimension agrees with the similarity dimension.
replacement of lim with lim sup and lim inf defines the On the other hand, if the circle’s center is randomly
upper and lower box dimensions: selected in the interior of the initiator, the passage to the
limit r → 0 almost surely eventually stops with circles
log(Nδ (A))
dbox = lim sup , bounding no part of the gasket. See the middle of Fig. 19.
δ→0 log(1/δ) Taking a family of circles with center c a point of the
log(Nδ (A)) gasket, the mass–radius relation becomes M(r ) = k(r, c) ·
dbox = lim inf . r d , where the prefactor k(r, c) fluctuates and depends on
δ→0 log(1/δ)
both r and c. See the right side of Fig. 19. Even in this
The box dimension can be thought of as measuring how case, the exponent is the mass dimension. The prefactor
well a set can be covered with small boxes of equal size, is no longer a constant density, but a random variable,
because the limit (or lim sup and lim inf) remain un- depending on the choice of the origin.
changed if Nδ (A) is replaced by the smallest number of
E-dimensional cubes of side δ needed to cover A, or even
the number of cubes of a δ lattice that intersect A.
Section V describes methods of measuring the box di-
mension for physical datasets.

C. Mass Dimension
The mass M(r ) of a d-dimensional Euclidean ball of con- FIGURE 19 Attempts at measuring the mass dimension of a
stant density ρ and radius r is given by Sierpinski gasket using three families of circles.
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals 195

Section V describes methods of measuring the mass more difficult because the inf is taken over the collection of
dimension for physical datasets. all δ-covers. Because of the inf that enters in its definition,
the Hausdorff–Besicovitch dimension cannot be measured
for any physical object.
D. Minkowski–Bouligand Dimension
Note: If A can be covered by Nδ (A) sets of diame-
Given a set A ⊂ R E and δ > 0, the Minkowski sausage of ter at most δ, then Hδs (A) ≤ Nδ (A) · δ s . From this it fol-
A, also called the δ-thickening or δ-neighborhood of A, lows dH (A) ≤ dbox (A), so dH (A) ≤ dbox (A) if dbox (A) ex-
is defined as Aδ = {x ∈ R E: d(x, y) ≤ δ for some y ∈ A}. ists. This inequality can be strict. For example, if A is
(See Section I.B.b.) In the Euclidean case when A is a any countable set, dH (A) = 0 and yet dbox (rationals in
smooth m-dimensional manifold imbedded in R E , one has [0, 1]) = 1.
vol(Aδ ) ∼ · δ E−m . That is, the E-dimensional volume
of Aδ scales as δ to the codimension of A. This concept
F. Packing Dimension
extends to fractal sets A: if the limit exists,
log(vol(Aδ )) Hausdorff dimension measures the efficiency of covering
E − lim a set by disks of varying radius. Tricot (1982) introduced
δ→0 log(δ)
packing dimension to measure the efficiency of packing
defines the Minkowski–Bouligand dimension, dMB (A) (see a set with disjoint disks of varying radius. Specifically,
Mandelbrot, 1982, p. 358). In fact, it is not difficult to see for δ > 0 a δ-packing of A is a countable collection of
that dMB (A) = dbox (A). If the limit does not exist, lim sup disjoint disks {Bi } with radii ri < δ and with centers in A.
gives dbox (A) and lim inf gives dbox (A). In analogy with Hausdorff measure, define
In the privileged case when the limit

vol(Aδ ) Pδ (A) = sup
s
|Bi | : {Bi } is a δ-packing of A .
s
lim
δ E−m
δ→0 i

exists, it generalizes the notion of Minkowski content for As δ decreases, so does the collection of δ-packings of A.
smooth manifolds A. Section VI will use this prefactor to Thus Pδs (A) decreases as δ decreases and the limit
measure lacunarity.
P0s (A) = lim Pδs (A)
δ→0

E. Hausdorff–Besicovitch Dimension exists. A technical complication requires an additional

step. The s-dimensional packing measure of A is defined
For a set A in Euclidean space, given s ≥ 0 and δ > 0, as
consider the quantity

∞

P (A) = inf
s
P0 (Ai ): A ⊂
s
Ai .
Hδ (A) = inf
s
|Ui | : {Ui } is a δ-cover of A .
s i i=1
i
Then the packing dimension dpack (A) is
A decrease of δ reduces the collection of δ-covers of
dpack (A) = inf{s: P s (A) = 0} = sup{s: P s (A) = ∞}.
A, therefore Hδs (A) increases as δ → 0 and Hs (A) =
limδ→0 Hδs (A) exists. This limit defines the s-dimensional Packing, Hausdorff, and box dimensions are related:
Hausdorff measure of A. For t > s, Hδt (A) ≤ δ t−s Hδs (A).
dH (A) ≤ dpack (A) ≤ dbox (A).
It follows that a unique number dH has the property that
For appropriate A, each inequality is strict.
s < dH implies Hs (A) = ∞
and
III. ALGEBRA OF DIMENSIONS
s > dH implies Hs (A) = 0.
AND LATENT DIMENSIONS
That is,
The dimensions of ordinary Euclidean sets obey sev-
dH (A) = inf{s: Hs (A) = 0} = sup{s: Hs (A) = ∞}.
eral rules of thumb that are widely used, though rarely
This quantity dH is the Hausdorff–Besicovitch dimension stated explicitly. For example, the union of two sets of
of A. It is of substantial theoretical significance, but in dimension d and d usually has dimension max{d, d }.
most cases is quite challenging to compute, even though it The projection of a set of dimension d to a set of di-
suffices to use coverings by disks. An upper bound often is mension d usually gives a set of dimension min{d, d }.
relatively easy to obtain, but the lower bound can be much Also, for Cartesian products, the dimensions usually add:
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

196 Fractals

dim(A × B) = dim(A) + dim(B). For the intersection of The obvious generalization holds for fractals A ⊂ R E
subsets A and B of R E , it is the codimensions that usually and projections to k-dimensional hyperplanes through the
add: E − dim(A ∩ B) = (E − dim(A)) + (E − dim(B)), origin.
but only so long as the sum of the codimensions is non- Projections of fractals can be very complicated. There
negative. If this sum is negative, the intersection is empty. are fractal sets A ⊂ R3 with the surprising property that
Mandelbrot (1984, Part II) generalized those rules to frac- for almost every plane P through the origin, the projection
tals and (see Section III.G) interpreted negative dimen- proj P (A) is any prescribed shape, to within a set of area 0.
sions as measures of “degree of emptiness.” Consequently, as Falconer (1987) points out, in principle
For simplicity, we restrict our attention to generaling we could build a fractal digital sundial.
these properties to the Hausdorff and box dimensions of
fractals. D. Subordination and Products of Dimension
We have already seen operations realizing the sum, max,
A. Dimension of Unions and Subsets and min of dimensions, and in the next subsection we shall
Simple applications of the definition of Hausdorff dimen- examine the sum of codimensions. For certain types of
sion give fractals, multiplication of dimensions is achieved through
“subordination,” a process introduced in Bochner (1955)
A⊆B implies dH (A) ≤ dH (B) and elaborated in Mandelbrot (1982). Examples are con-
and structed easily from the Koch curve generator (Fig. 20a).
The initiator (the unit interval) is unchanged, but the new
dH (A ∪ B) = max{dH (A), dH (B)}. generator is a subset of the original generator. Figure 20
Replacing max with sup, this property holds for countable shows three examples.
collections of sets. The subset and finite union properties In Fig. 20, generator (b) gives a fractal dust (B) of
hold for box dimension, but the countable union property dimension log 3/log 3 = 1. Generator (c) gives the stan-
fails. dard Cantor dust (C) of dimension log 2/log 3. Generator
(d) gives a fractal dust (D) also of dimension log 2/log 3.
Thinking of the Koch curve K as the graph of a func-
B. Product and Sums of Dimensions tion f : [0, 1] → K ⊂ R2 , the fractal (B) can be obtained
For all subsets A and B of Euclidean space, by restricting f to the Cantor set with initiator [0, 1] and
dH (A × B) ≥ dH (A) + dH (B). Equality holds if one generator the intervals [0, 1/4], [1/4, 1/2], and [1/2, 3/4].
of the sets is sufficiently regular. For example, if dH (A) = In this case, the subordinand is a Koch curve, the subor-
dH (A), then dH (A × B) = dH (A) + dH (B). Equality does dinator is a Cantor set, and the subordinate is the fractal
not always hold: Besicovitch and Moran (1945) give an (B). The identity
example of subsets A and B of R with dH (A) = dH (B) = 0, log 3 log 4 log 3
yet dH (A × B) = 1. = ·
log 3 log 3 log 4
For upper box dimensions, the inequality is reversed:
dbox (A × B) ≤ dbox (A) + dbox (B). expresses that the dimensions multiply,
dim(subordinate) = dim(subordinand)
C. Projection · dim(subordinator).
Denote by proj P (A) the projection of a set A ⊂ R to a3
Figure 20, (C) and (D) give other illustrations of this
plane P ⊂ R3 through the origin. If A is a one-dimensional multiplicative relation. The seeded universe model
Euclidean object, then for almost all choices of the plane of the distribution of galaxies (Section IX.D.1) uses
P, projP (A) is one-dimensional. If A is a two- or three- subordination to obtain fractal dusts; see Mandelbrot
dimensional Euclidean object, then for almost all choices (1982, plate 298).
of the plane P, proj P (A) is two-dimensional of positive
area. That is, dim(proj P (A)) = min{dim(A), dim(P)}.
The analogous properties hold for fractal sets A. If
dH (A) < 2, then for almost all choices of the plane
P, dH (proj P (A)) = dH (A). If dH (A) ≥ 2, then for al-
most all choices of the plane P, dH (proj P (A)) = 2 and FIGURE 20 The Koch curve (A) and its generator (a); (b), (c),
proj P (A) has positive area. So again, dH (proj P (A)) = and (d) are subordinators, and the corresponding subordinates of
min{dH (A), dH (P)}. the subordinand (A) are (B), (C), and (D).
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals 197

E. Intersection and Sums of Codimension Embedding. A problem that concerns R2 can often be
reinterpreted as a problem that really concerns R E , with
The dimension of the intersection of two sets obviously
E > 2, but must be approached within planar intuitions by
depends on their relative placement. When A ∩ B = ∅, the
R2 . Conversely, if a given problem can be embedded into a
dimension vanishes. The following is a typical result. For
problem concerning R E , the question arises, “which is the
Borel subsets A and B of R E , and for almost all x ∈ R E ,
‘critical’ value of E − 2, defined as the smallest value for
dH (A ∩ (B + x)) ≤ max{0, dH (A × B) − E}. which the intersection ceases to be empty, and precisely
If dH (A × B) = dH (A) + dH (B), this reduces to reduces to a point?” In the example of a line and a point,
the critical E −2 is precisely 1: once embedded in R3 , the
dH (A ∩ (B + x)) ≤ max{0, dH (A) + dH (B) − E}. problem transforms into the intersection of a plane and a
This is reminiscent of the transversality relation for inter- line, which is a point.
sections of smooth manifolds. Approximation and pre-asymptotics in mathematics
Corresponding lower bounds are known in more re- and the sciences. Consider a set defined as the limit of
stricted circumstances. For example, there is a positive a sequence of decreasing approximations. When the limit
measure set M of similarity transformations of R E with is not empty, all the usual dimensions are defined as be-
ing properties of the limit, but when the limit is empty
dH (A ∩ T (B)) ≥ dH (A) + d H (B) − E and all the dimensions vanish, it is possible to consider
for all T ∈ M. Note dH (A ∩ T (B)) = dH (A) + dH (B) − E instead the limits of the properties of the approximations.
is equivalent to the addition of codimensions: E − The Minkowski–Bouligand formal definition of dimen-
dH (A ∩ T (B)) = (E − dH (A)) + (E − dH (B)). sion generalizes to fit the naive intuitive values that may
be either positive or negative.
F. Latent Dimensions below 0 or above E
2. Latent Dimensions That Exceed
A blind application of the rule that codimensions are addi- That of the Embedding Space
tive easily yields results that seem nonsensical, yet become
useful if they are properly interpreted and the Hausdorff For a strictly self-similar set in R E , the Moran equation
dimension is replaced by a suitable new alternative. defines a similarity dimension that obeys dsim ≤ E. On
the other hand, a generator that is a self-avoiding broken
line can easily yield log(N )/ log(1/r ) = dsim > E. Recur-
1. Negative Latent Dimensions as Measures
sive application of this generator defines a parametrized
of the “Degree of Emptiness”
motion, but the union of the positions of the motion is
Section E noted that if the codimension addition rule gives neither a self-similar curve nor any other self-similar set.
a negative dimension, the actual dimension is 0. This ex- It is, instead, a set whose points are covered infinitely of-
ception is an irritating complication and hides a feature ten. Its box dimension is ≤E, which a fortiori is <dsim .
worth underlining. However, one can load a mass on this set by following the
As background relative to the plane, consider the fol- route that applies in the absence of multiple points. Mass
lowing intersections of two Euclidean objects: two points, is distributed on the generator’s intervals in proportion to
a point and a line, and two lines. Naive intuition tells the values of ridsim . By infinite recursion, the difference
us that the intersection of two points is emptier than between the times t and t when points P and P are
the intersection of a point and a line, and that the lat- visited is defined as the mass supported by the portion of
ter in turn is emptier than the intersection of two lines the curve that links these points.
(which is almost surely a point). This informal intu- If so and dsim > E, the similarity dimension acquires a
ition fails to be expressed by either a Euclidean or a useful role as a latent dimension. For example, consider
Hausdorff dimension. On the other hand, the formal ad- the multiplication of dimensions in Section III.D. Suppose
dition of codimensions suggests that the three intersec- that our recursively constructed set is not lighted for all
tions in question have the respective dimensions −2, −1, instants of time, but only intermittently when time falls
and 0. The inequalities between those values conform within a fractal dust of dimension d . Then, the rule of
with the above-mentioned naive intuition. Therefore, they thumb is that the latent dimension of the lighted points
ushered in the search for a new mathematical defini- is dsim d . When dsim d < E, the rule of thumb is that the
tion of dimension that can be measured and for which true dimension is also dsim d .
negative values are legitimate and intuitive. This search Figure 21 shows an example. The generator has N = 6
produced several publications leading to Mandelbrot segments, each with scaling ratio r = 1/2, hence latent
(1995). Two notions should be mentioned. dimension dsim = log 6/log 2 > 2. Taking as subordinator
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

198 Fractals

FIGURE 21 Left: Generator and limiting shape with latent dimension exceeding 2. Right: generator and limiting
shape of a subordinate with dimension <2. For comparison, this limiting shape is enclosed in the outline of the left
limiting shape.

a Cantor set with generator having N = 3 segments, each dµ(x)
φs (x) = .
with scaling ratio r = 1/2, yields a self-similar fractal with |x − y|s
dimension log 3/log 2.
there is a mass distribution µ on a set A with
If
G. Mapping φs (x) dµ(x) < ∞, then dH (A) ≥ s. Potential theory has
been useful for computing dimension of many sets, for
Recall f satisfies the Hölder condition with exponent example, Brownian paths.
H if there is a positive constant c for which | f (x) −
f (y)| ≤ c|x − y| H . For such functions, dH ( f (A)) ≤
(1/H )dH (A). If H = 1, f is called a Lipschitz function; C. Implicit Methods
f is bi-Lipschitz if there are constants c1 and c2 with
c1 |x − y| ≤ | f (x) − f (y)| ≤ c2 |x − y|. Hausdorff dimen- McLaughlin (1987) introduced a geometrical method,
sion is invariant under bi-Lipschitz maps. The analogous based on local approximate self-similarities, which suc-
properties hold for box-counting dimension. ceeds in proving that dH (A) = dbox (A), without first deter-
mining dH (A). If small parts of A can be mapped to large
parts of A without too much distortion, or if A can be
IV. METHODS OF COMPUTING DIMENSION mapped to small parts of A without too much distortion,
IN MATHEMATICAL FRACTALS then dH (A) = dbox (A) = s and Hs (A) > 0 (in the former
case) or Hs (A) < ∞ (in the latter case). Details and ex-
Upper bounds for the Hausdorff dimension can be rel- amples can be found in Falconer (1997, Section 3.1).
atively straightforward: it suffices to consider a specific
family of coverings of the set. Lower bounds are more
delicate. We list and describe briefly some methods for D. Thermodynamic Formalism
computing dimension.
Sinai (1972), Bowen (1975), and Ruelle (1978) adapted
methods of statistical mechanics to determine the dimen-
A. Mass Distribution Methods
sions of fractals arising from some nonlinear processes.
A mass distribution on a set A is a measure µ with Roughly, for a fractal defined as the attractor A of a fam-
supp(µ) ⊂ A and 0 < µ(A) < ∞. The mass distribution ily of nonlinear contractions Fi with an inverse function f
principle (Falconer, 1990, p. 55) establishes a lower bound defined on A, the topological pressure P(φ) of a Lipschitz
for the Hausdorff dimension: Let µ be a mass distribution function φ: A → R is
on A and suppose for some s there are constants c > 0

and δ > 0 with µ(U ) ≤ c · |U |s for all sets U with |U | ≤ δ. P(φ) = lim
1
log exp[φ(x) + φ( f (x))
Then δ ≤ dH (A). k→∞ k
x ∈ Fix( f k )
Suitable choice of mass distribution can show that no

individual set of a cover can cover too much of A. This
can eliminate the problems caused by covers by sets of a + · · · φ( f k−1
(x))] ,
wide range of diameters.
where Fix( f k ) denotes the set of fixed points of f k . The
B. Potential Theory Methods
sum plays the role of the partition function in statistical
Given a mass distribution µ, the s-potential is defined by mechanics, part of the motivation for the name “ther-
Frostman (1935) as modynamic formalism.” There is a unique s for which
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals 199

P(−s log | f |) = 0, and s = dH (A). Under these condi-

tions, 0 < Hs (A) < ∞, Hs (A) is a Gibbs measure on A,
and many other results can be deduced. Among other
places, this method has been applied effectively to the
study of Julia sets.

FIGURE 22 Two Sierpinski carpet fractals with the same

V. METHODS OF MEASURING DIMENSION dimension.
IN PHYSICAL SYSTEMS

For shapes represented in the plane—for example, coast- another step in characterizing fractals through associated
lines, rivers, mountain profiles, earthquake faultlines, frac- numbers. How can the distribution of a fractal’s holes or
ture and cracking patterns, viscous fingering, dielectric gaps (“lacunae”) be quantified?
breakdown, growth of bacteria in stressed environments—
box dimension is often relatively easy to compute. Select a
A. The Prefactor
sequence 1 > 2 > · · · > n of sizes of boxes to be used to
cover the shape, and denote by N (i ) the number of boxes Suppose A is either carpet in Fig. 22, and let Aδ de-
of size i needed to cover the shape. A plot of log(N (i )) note the δ-thickening of A. As mentioned in Section II.D,
against log(1/i ) often reveals a scaling range over which area(Aδ ) ∼ · δ 2−log 40/ log 7 . One measure of lacunarity is
the points fall close to a straight line. In the presence of 1/ , if the appropriate limit exists.
other evidence (hierarchical visual complexity, for exam- It is well known that for the box dimension, the limit as
ple), this indicates a fractal structure with box dimension → 0 can be replaced by the sequential limit n → 0, for
given by the slope of the line. Interpreting the box di- n satisfying mild conditions. For these carpets, natural
mension in terms of underlying physical, chemical, and choices are those n just filling successive generations of
biological processes has yielded productive insights. holes. Applied to Fig. 22, these n give 1/ ≈ 0.707589
For physical objects in three-dimensional space—for and 0.793487, agreeing with the notion that higher lacu-
example, aggregates, dustballs, physiological branch- narity corresponds to a more uneven distribution of holes.
ings (respiratory, circulatory, and neural), soot parti- Unfortunately, the prefactor is much more sensitive than
cles, protein clusters, terrain maps—it is often easier the exponent: different sequences of n give different lim-
to compute mass dimension. Select a sequence of radii its. Logarithmic averages can be used, but this is work in
r1 > r2 > · · · > rn and cover the object with concentric progress.
spheres of those radii. Denoting by M(ri ) the mass of the
part of the object contained inside the sphere of radius ri ,
a plot of log(M(ri )) against log(ri ) often reveals a scaling B. The Crosscuts Structure
range over which the points fall close to a straight line. In
An object is often best studied through its crosscuts by
the presence of other evidence (hierarchical arrangements
straight lines, concentric circles, or spheres. For a fractal
of hole sizes, for example), this indicates a fractal struc-
of dimension d in the plane, the rule of thumb is that the
ture with mass dimension given by the slope of the line.
crosscuts are Cantor-like objects of dimension d − 1. The
Mass dimension is relevant for calculating how density
case when the gaps between points in the crosscut are
scales with size, and this in turn has implications for how
statistically independent was singled out by Mandelbrot
the object is coupled to its environment.
as defining “neutral lacunarity.” If the crosscut is also self-
similar, it is a Lévy dust.
Hovi et al. (1996) studied the intersection of lines (linear
VI. LACUNARITY crosscuts) with two- and three-dimensional critical per-
colation clusters, and found the gaps are close to being
Examples abound of fractals sharing the same dimension statistically independent, thus a Lévy dust.
but looking quite different. For instance, both Sierpin- In studying very large DLA clusters, Mandelbrot et al.
ski carpets in Fig. 22 have dimension log 40/log 7. The (1995) obtained a crosscut dimension of dc = 0.65 ± 0.01,
holes’ distribution is more uniform on the left than on different from the value 0.71 anticipated if DLA clus-
the right. The quantification of this difference was un- ters were statistically self-similar objects with mass di-
dertaken in Mandelbrot (1982, Chapter 34). It introduced mension dmass = 1.71. The difference can be explained by
lacunarity as one expression of this difference, and took asserting the number of particles Nc (r/l) on a crosscut
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

200 Fractals

of radius r scales as Nc (r/l) = (r )(r/l)dc . Here l is

the scaling length, and the lacunarity prefactor varies
with r . Assuming slow variation of (r ) with r , the
observed linear log–log fit requires (r ) ∼ r δd , where
δd = dmass − 1 − dc = 0.06 ± 0.01. Transverse crosscut
analysis reveals lacunarity decreases with r for large DLA
clusters.
FIGURE 23 The effect of H on Weierstrass graph roughness. In
all pictures, b = 1.5 and H has the indicated value.

C. Antipodal Correlations
Select an occupied point p well inside a random fractal known was constructed by Weierstrass in 1872. The Weier-
cluster, so that the R × R square centered at p lies within strass sine function is
the cluster. Now select two vectors V and W based at p
∞

and separated by the angle θ . Finally, denote by x and y the W (t) = b−H n sin(2π bn t),
n=0
number of occupied sites within the wedges with apexes
at p, apex angles φ much less than θ , and centered about and the complex Weierstrass function is
the vectors V and W . The angular correlation function
∞
is W0 (t) = b−H n exp(2πibn t).
x y − xy n=0
C(θ) = , Hardy (1916) showed W (t) is continuous and nowhere-
x 2 − xx
differentiable if and only if b > 1 and 0 < H < 1.
where · · · denotes an average over many realizations of As shown in Fig. 23, the parameter H determines the
the random fractal. Antipodal correlations concern θ = π . roughness of the graph. In this case, H is not a perspicuous
Negative and positive antipodal correlations are inter- “roughness exponent.” Indeed, as b increases, the ampli-
preted as indicating high and low lacunarity; vanishing tudes of the higher frequency terms decrease and the graph
correlation is a weakened form of neutral lacunarity. is more clearly dominated by the lowest frequency terms.
Mandelbrot and Stauffer (1994) used antipodal correla- This effect of b is a little-explored aspect of lacunarity.
tions to study the lacunarity of critical percolation clusters.
On smaller central subclusters, they found the antipodes
are uncorrelated. B. Weierstrass–Mandelbrot Functions
Trema random fractals. These are formed by remov- The Weierstrass function revolutionized mathematics but
ing randomly centered discs, tremas, with radii obeying a did not enter physics until it was modified in a series
power-law scaling. For them, C(π) → 0 with φ because of steps described in Mandelbrot (1982, pp. 388–390;
a circular hole that overlaps a sector cannot overlap the (2001d, Chapter H4). The step from W0 (t) to W1 (t) added
opposite sector. But nonconvex tremas introduce positive low frequencies in order to insure self-affinity. The step
antipodal correlations. For θ close to π , needle-shaped from W1 (t) to W2 (t) added to each addend a random phase
tremas, though still convex, yield C(θ) much higher than ϕn uniformly distributed on [0, 1]. The √step from W1 (t) to
for circular trema sets. From this more refined viewpoint, W3 (t) added a random amplitude An = −2 log V , where
needle tremas’ lacunarity is much lower. V is uniform on [0, 1]. A function W4 (t) that need not be
written down combines a phase and an amplitude. The lat-
est step leads to another function that need not be written
VII. FRACTAL GRAPHS down: it is W5 (t) = W4 (t) + W4 (−t), where the two ad-
AND SELF-AFFINITY dends are statistically independent. Contrary to all earlier
extensions, W5 (t) is not chiral. We have
A. Weierstrass Functions
∞
W1 (t) = b−H n (exp(2πibn t) − 1),
Smooth functions’ graphs, as seen under sufficient mag- n=−∞
nification, are approximated by their tangents. Unless the
function itself is linear, the existence of a tangent con-
∞
W2 (t) = b−H n (exp(2πibn t) − 1) exp(iϕn ),
tradicts the scale invariance that characterizes fractals. n=−∞
The early example of a continuous, nowhere-differentiable

∞
function devised in 1834 by Bolzano remained unpub- W3 (t) = An b−H n (exp(2πibn t) − 1).
lished until the 1920s. The first example to become widely n=−∞
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals 201

C. The Hölder Exponent the dynamics to a shift map on a Cantor set under general
conditions. In this sense, chaos often equivalent to simple
A function f: [a, b] → R has Hölder exponent H if there
dynamics on an underlying fractal.
is a constant c > 0 for which
| f (x) − f (y)| ≤ c · |x − y| H
B. Fractal Basin Boundaries
for all x and y in [a, b] (recall Section III.G). If f is con- For any point c belonging to a hyperbolic component of the
tinuous and has Hölder exponent H satisfying 0 < H ≤ 1, Mandelbrot set, the Julia set is the boundary of the basins
then the graph of f has box dimension dbox ≤ 2 − H . of attraction of the attracting cycle and the attracting fixed
The Weierstrass function W (t) has Hölder exponent point at infinity. See the right side of Fig. 3.
H , hence its graph has dbox ≤ 2 − H . For large enough Another example favored by Julia is found in New-
b, dbox = 2 − H , so one can think of the Hölder exponent ton’s method for finding the roots of a polynomial f (z)
as a measure of roughness of the graph. of degree at least 3. It leads to the dynamical system
z n+1 = N f (z n ) = z n − f (z n )/ f (z n ). The roots of f (z) are
attracting fixed points of N f (z), and the boundary of the
VIII. FRACTAL ATTRACTORS AND basins of attraction of these fixed points is a fractal; an ex-
REPELLERS OF DYNAMICAL ample is shown on the left side of Fig. 24. If contaminated
SYSTEMS by even small uncertainties, the fate of initial points near
the basin boundary cannot be predicted. Sensitive depen-
The modern renaissance in dynamical systems is asso- dence on initial conditions is a signature of chaos, but here
ciated most often with chaos theory. Consequently, the we deal with something different. The eventual behavior
relations between fractal geometry and chaotic dynam- is completely predictable, except for initial points taken
ics, mediated by symbolic dynamics, are relevant to our exactly on the basin boundary, usually of two-dimensional
discussion. In addition, we consider fractal basin bound- Lebesgue measure 0.
aries, which generalize Julia sets to much wider contexts The same complication enters mechanical engineer-
including mechanical systems. ing problems for systems with multiple attractors. Moon
(1984) exhibited an early example. Extensive theoretical
and computer studies by Yorke and coworkers are de-
A. The Smale Horseshoe
scribed in Alligood and Yorke (1992). The driven har-
If they exist, intersections of the stable and unstable monic oscillator with two-well potential
manifolds of a fixed point are called homoclinic points.
d2x dx 1
Poincaré (1890) recognized that homoclinic points cause + f − x(1 − x 2 ) = A cos(ωt)
great complications in dynamics. Yet much can be un- dt 2 dt 2
derstood by labeling an appropriate coarse-graining of a is a simple example. The undriven system has two
neighborhood of a homoclinic point and translating the equilibria, x = −1 and x = +1. Initial values (x , x ) are
corresponding dynamics into a string of symbols (the painted white if the trajectory from that point eventually
coarse-grain bin labels). The notion of symbolic dynam- stays in the left basin, black if it eventually stays in the right
ics first appears in Hadamard (1898), and Birkhoff (1927) basin. The right side of Fig. 24 shows the initial condition
proved every neighborhood of a homoclinic point contains portrait for the system with f = 0.15, ω = 0.8, and A =
infinitely many periodic points. 0.094.
Motivated by work of Cartwright and Littlewood (1945)
and Levinson (1949) on the forced van der Pol oscillator,
Smale (1963) constructed the horseshoe map. This is a IX. FRACTALS AND DIFFERENTIAL OR
map from the unit square into the plane with completely PARTIAL DIFFERENTIAL EQUATIONS
invariant set a Cantor set , roughly the Cartesian product
of two Cantor middle-thirds sets. Restricted to , with the The daunting task to which a large portion of Mandelbrot
obvious symbolic dynamics encoding, the horseshoe map (1982) is devoted was to establish that many works of
is conjugate to the shift map on two symbols, the archetype nature and man [as shown in Mandelbrot (1997), the lat-
of a chaotic map. ter includes the stock market!] are fractal. New and often
This construction is universal in the sense that it oc- important examples keep being discovered, but the hard-
curs in every transverse homoclinic point to a hyperbolic est present challenge is to discover the causes of fractal-
saddle point. The Conley–Moser theorem (see Wiggins, ity. Some cases remain obscure, but others are reasonably
1990) establishes the existence of chaos by conjugating clear.
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

202 Fractals

FIGURE 24 Left: The basins of attraction of Newton’s method for finding the roots of z 3 − 1. Right: The basins of
attraction for a damped, driven two-well harmonic oscillator.

Thus, the fractality of the physical percolation clusters nection with lasers. The sensitivity to initial conditions
(Section I.B.3.e) is the geometric counterpart of scaling common to chaotic dynamics is mediated by the intricate
and renormalization: the analytic properties of those ob- fractal interleaving of the multiple layers of the attractor.
jects follow a wealth of power-law relations. Many math- In addition, Birman and Williams (1983) showed an abun-
ematical issues, some of them already mentioned, remain dance of knotted periodic orbits embedded in the Lorenz
open, but the overall renormalization framework is firmly attractor, though Williams (1983) showed all such knots
rooted. Renormalization and the resulting fractality also are prime. Grist (1997) constructed a universal template,
occur in the structure of attractors and repellers of dy- a branched 2-manifold in which all knots are embedded.
namical systems. Best understood is renormalization for Note the interesting parallel with the universal aspects of
quadratic maps. Feigenbaum and others considered the the Sierpinski carpet (Section I.B.1.a). It is not yet known
real case. For the complex case, renormalization estab- if the attractor of any differential equation contains a uni-
lishes that the Mandelbrot set contains infinitely many versal template. The Poincaré–Bendixson theorem pro-
small copies of itself. hibits fractal attractors for differential equations in the
Unfortunately, additional examples of fractality proved plane, but many other classical ordinary differential equa-
to be beyond the scope of the usual renormalization. A tions in at least three dimensions exhibit similar fractal
notorious case concerns DLA (Section I.B.3.f). attractors in certain parameter ranges.

A. Fractal Attractors of Ordinary

Differential Equations B. Partial Differential Equations on Domains
with Fractal Boundaries (“Can One Hear
The Lorenz equations for fluid convection in a two- the Shape of a Fractal Drum?”)
dimensional layer heated from below are
Suppose D ⊂ Rn is an open region with boundary ∂ D.
dx dy dz Further, suppose the eigenvalue problem 2 u = −λu
= σ (y −x), = −x z +r x − y, = x y −bz.
dt dt dt with boundary conditions u(x) = 0 for all x ∈ ∂ D has
Here x denotes the rate of convective overturning, y the real eigenvalues 0 < λ1 < λ2 < · · · . For D with suffi-
horizontal temperature difference, and z the departure ciently smooth boundary, a theorem of Weyl (1912)
from a linear vertical temperature gradient. For the pa- shows N (λ) ∼ λn/2 , where the eigenvalue counting func-
rameters σ = 10, b = 8/3, and r = 28, Lorenz (1963) sug- tion N (λ) = {the number of λi for which λi ≤ λ}. If the
gested that trajectories in a bounded region converge to boundary ∂ D is a fractal, Berry (1979, pp. 51–53) postu-
an attractor that is a fractal, with dimension about 2.06, lated that some form of the dimension of ∂ D appears in
as estimated by Liapunov exponents. The Lorenz equa- the second term in the expansion of N (λ), therefore can
tions are very suggestive but do not represent weather be recovered from the eigenvalues. This could not be the
systems very well. However, Haken established a con- Hausdorff dimension, but Lapidus (1995) showed that it
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals 203

With this Laplacian, the heat and wave equations can be

defined on fractals. Among other things, the wave equation
on domains with fractal boundaries admits localized solu-
tions, as we saw for the wave equation on fractal drums. A
major challenge is to extend these ideas to fractals more
FIGURE 25 Perimeter of an extensively studied fractal drum. complicated than the Sierpinski gasket and its relatvives.

is the Minkowski–Bouligard dimension. The analysis is D. How Partial Differential Equations

subtle, involving some deep number theory. Generate Fractals
For regions with fractal boundaries, the heat equation A quandary: It is universally granted that physics is ruled
2 u = (∂/∂t)u shows heat flow across a fractal bound- by diverse partial differential equations, such as those of
ary is related to the dimension of the boundary. Sapoval Laplace, Poisson, and Navier–Stokes. A differential equa-
(1989) and Sapoval et al. (1991) conducted elegant exper- tion necessarily implies a great degree of local smooth-
imenths to study the modes of fractal drums. The perimeter ness, even though close examination shows isolated
of Fig. 25 has dimension log 8/log 4 = 3/2. A membrane singularities or “catastrophes.” To the contrary, fractal-
stretched across this fractal curve was excited acoustically ity implies everywhere-dense roughness or fragmentation.
and the resulting modes observed by sprinkling powder This is one of the several reasons that fractal models in di-
on the membrane and shining laser light transverse to the verse fields were initially perceived as being “anomalies”
surface. Sapoval observed modes localized to bounded contradicting one of the firmest foundations of science.
regions A , B , C , and D shown in Fig. 25. By carefully A conjecture–challenge responding to the preceding
displacing the acoustic source, he was able to excite each quandry. There is no contradiction at all: fractals arise
separately. unavoidably in the long-time behavior of the solution of
Theoretical and computer-graphic analyses of the wave very familiar and innocuous-looking equations. In partic-
equation on domains with fractal boundaries have been ular, many concrete situations where fractals are observed
carried out by Lapidus et al. (1996), among others. involve equations that allow free and moving boundaries,
interfaces, or singularities. As a suggestive “principle,”
C. Partial Differential Equations on Fractals Mandelbrot (1982, Chapter 11) described the following
possibility: under broad conditions that largely remain to
The problem is complicated by the fact that a fractal is not be specified, these free boundaries, interfaces, and singu-
a smooth manifold. How is the Laplacian to be defined on larities converge to suitable fractals. Many equations have
such a space? One promising approach was put forward been examined from this viewpoint, but we limit ourselves
by physicists in the 1980s and made rigorous in Kigami to two examples of central importance.
(1989): approximate the fractal domain by a sequence of
graphs representing successive protofractals, and define
the fractal Laplacian as the limit of a suitably renormalized 1. The Large-Scale Distribution of Galaxies
sequence of Laplacians on the graphs. Figure 26 shows the Chapters 9 and 33–35 of Mandelbrot (1982) conjecture
first four graphs for the equilateral Sierpinski gasket. The that the distribution of galaxies is fractal. This conjec-
values at the boundary vertices are specified by the bound- ture results from a search for invariants that was central
ary conditions at any nonboundary vertex x0 . The mth to every aspect of the construction of fractal geometry.
approximate Laplacian of a function f (x) is the product Granted that the distribution of galaxies certainly deviates
of a renormalization factor by ( f (y) − f (x0 )), where from homogeneity, one broad approach consists in correct-
the sum is taken over all vertices y in the mth protofrac- ing for local inhomogeneity by using local “patches.” The
tal graph corresponding to the mth-stage reduction of the next simplest global assumption is that the distribution is
whole graph. nonhomogeneous but scale invariant, therefore fractal.
Excluding the strict hierarchies, two concrete construc-
tions of random fractal sets were subjected to detailed
mathematical and visual investigation. These construc-
tions being random, self-similarity can only be statistical.
But a strong counteracting asset is that the self-similarity
ratio can be chosen freely, is not restricted to powers of a
FIGURE 26 Graphs corresponding to protofractal approxima- prescribed r0 . A surprising and noteworthy finding came
tions of the equilateral Sierpinski gasket. forth. These constructions exhibited a strong hierarchical
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

204 Fractals

structure that is not a deliberate and largely arbitrary input. extended to turbulence, and circa 1964 led to the following
Details are given in Mandelbrot (1982). conjecture.
The first construction is the seeded universe based on Conjecture. The property of being “turbulently dissipa-
a Lévy flight. Its Hausdorff-dimensional properties were tive” should not be viewed as attached to domains in a fluid
well known. Its correlation properties (Mandelbrot 1975) with significant interior points, but as attached to fractal
proved to be nearly identical to those of actual galaxy sets. In a first approximation, those sets’ intersection with
maps. The second construction is the parted universe a straight line is a Cantor-like fractal dust having a dimen-
obtained by subtracting from space a random collection sion in the range from 0.5 to 0.6. The corresponding full
of overlapping tremas. Either construction yields sets that sets in space should therefore be expected to be fractals
are highly irregular and involve no special center, yet, with with Hausdorff dimension in the range from 2.5 to 2.6.
no deliberate design, exhibit a clear-cut clustering, “fila- Actually, Cantor dust and Hausdorff dimension are not
ments” and “walls.” These structures were little known the proper notions in the context of viscous fluids because
when these constructions were designed. viscosity necessarily erases the fine detail essential to frac-
Conjecture: Could it be that the observed “clusters,” tals. Hence the following conjecture (Mandelbrot, 1982,
“filaments,” and “walls” need not be explained separately? Chapter 11; 1976). The dissipation in a viscous fluid oc-
They may not result from unidentified features of spe- curs in the neighborhood of a singularity of a nonviscous
cific models, but represent unavoidable consequences of approximation following Euler’s equations, and the mo-
a variety of unconstrained forms of random fractality, as tion of a nonviscous fluid acquires singularities that are
interpreted by a human brain. sets of dimension about 2.5–2.6. Several numerical tests
A problem arose when careful examination of the sim- agree with this conjecture (e.g., Chorin, 1981).
ulations revealed a clearly incorrect prediction. The sim- A related conjecture, that the Navier–Stokes equations
ulations in the seeded universe proved to be visually far have fractal singularities of much smaller dimension, has
more “lacunar” than the real world. That is, the simula- led to extensive work by V. Scheffer, R. Teman, and
tions’ holes are larger than in reality. The parted universe C. Foias, and many others. But this topic is not exhausted.
model fared better, since its lacunarity can be adjusted at Finally, we mention that fractals in phase space entered
will and fit to the actual distribution. A lowered lacunarity the transition from laminar to turbulent flow through the
is expressed by a positive correlation between masses in work of Ruelle and Takens (1971) and their followers. The
antipodal directions. Testing this specific conjecture is a task of unifying the real- and phase-space roles of fractals
challenge for those who analyze the data. is challenging and far from being completed.
Does dynamics make us expect the distribution of galax-
ies to be fractal? Position a large array of point masses
in a cubic box in which opposite sides are identified to X. FRACTALS IN THE ARTS
form a three-dimensional torus. The evolution of this ar- AND IN TEACHING
ray obeys the Laplace equation, with the novelty that the
singularities of the solution are the positions of the points, The Greeks asserted art reflects nature, so it is little sur-
therefore movable. All simulations we know (starting with prise that the many fractal aspects of nature should find
those performed at IBM around 1960) suggest that, even their way into the arts—beyond the fact that a representa-
when the pattern of the singularities begins by being uni- tional painting of a tree exhibits the same fractal branching
form or Poisson, it gradually creates clusters and a sem- as a physical tree. Voss and Clarke (1975) found fractal
blance of hierarchy, and appears to tend toward fractality. It power-law scaling in music, and self-similarity is designed
is against the preceding background that the limit distribu- in the music of the composers György Ligeti and Charles
tion of galaxies is conjectured to be fractal, and fractality Wuorinen. Pollard-Gott (1986) established the presence of
is viewed as compatible with Newton’s equations. fractal repetition patterns in the poetry of Wallace Stevens.
Computer artists use fractals to create both abstract aes-
thetic images and realistic landscapes. Larry Poons’ paint-
2. The Navier–Stokes Equation
ings since the 1980s have had rich fractal textures. The
The first concrete use of a Cantor dust in real spaces is “decalcomania” of the 1830s and the 1930s and 1940s
found in Berger and Mandelbrot (1963), a paper on noise used viscous fingering to provide a level of visual com-
records. This was nearly simultaneous with Kolmogorov’s plexity. Before that, Giacometti’s Alpine wildflower paint-
work on the intermittence of turbulence. After numerous ings are unquestionably fractal. Earlier still, relatives of the
experimental tests designed to create an intuitive feeling Sierpinski gasket occur as decorative motifs in Islamic and
for this phenomenon (e.g., listening to turbulent velocity Renaissance art. Fractals abound in architecture, for exam-
records that were made audible), the fractal viewpoint was ple, in the cascades of spires in Indian temples, Bramante’s
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals 205

plan for St. Peter’s, Malevich’s Architektonics, and some SEE ALSO THE FOLLOWING ARTICLES
of Frank Lloyd Wright’s designs. Fractals occur in the
writing of Clarke, Crichton, Hoag, Powers, Updike, and CHAOS • PERCOLATION • TECTONOPHYSICS
Wilhelm, among others, and in at least one play, Stoppard’s
Arcadia. Postmodern literary theory has used some con- BIBLIOGRAPHY
cepts informed by fractal geometry, though this applica-
tion has been criticized for its overly free interpretations Alligood, K., and Yorke, J. (1992). Ergodic Theory Dynam. Syst. 12,
of precise scientific language. Some have seen evidence 377–400.
of power-law scaling in historical records, the distribu- Alligood, K., Sauer, T., and Yorke, J. (1997). “Chaos. An Introduction
tion of the magnitudes of wars and of natural disasters, to Dynamical Systems,” Springer-Verlag, New York.
Barnsley, M. (1988). “Fractals Everywhere,” 2nd ed., Academic Press,
for example. In popular culture, fractals have appeared Orlando, FL.
on t-shirts, totebags, book covers, MTV logos, been men- Barnsley, M., and Demko, S. (1986). “Chaotic Dynamics and Fractals,”
tioned on public radio’s A Prairie Home Companion, and Academic Press, Orlando, FL.
been seen on television programs from Nova and Murphy Barnsley, M., and Hurd, L. (1993). “Fractal Image Compression,” Peters,
Brown, through several incarnations of Star Trek, to The X- Wellesley, MA.
Batty, M., and Longley, P. (1994). “Fractal Cities,” Academic Press,
Files and The Simpsons. While Barnsley’s (1988) slogan, London.
“fractals everywhere,” is too strong, the degree to which Beardon, A. (1983). “The Geometry of Discrete Groups,” Springer-
fractals surround us outside of science and engineering is Verlag, New York.
striking. Beck, C., and Schlögl, F. (1993). “Thermodynamics of Chaotic Systems:
A corollary of this last point is a good conclusion to An Introduction,” Cambridge University Press, Cambridge.
Berger, J., and Mandelbrot, B. (1963). IBM J. Res. Dev. 7, 224–236.
this high-speed survey. In our increasingly technologi- Berry, M. (1979). “Structural Stability in Physics,” Springer-Verlag,
cal world, science education is very important. Yet all New York.
too often humanities students are presented with limited Bertoin, J. (1996). “Lévy Processes,” Cambridge University Press,
choices: the first course in a standard introductory se- Cambridge.
quence, or a survey course diluted to the level of journal- Besicovitch, A., and Moran, P. (1945). J. Lond. Math. Soc. 20, 110–120.
Birkhoff, G. (1927). “Dynamical Systems, American Mathematical
ism. The former builds toward major points not revealed Society, Providence, RI.
until later courses, the latter discusses results from science Birman, J., and Williams, R. (1983). Topology 22, 47–82.
without showing how science is done. In addition, many Bishop, C., and Jones, P. (1997). Acta Math. 179, 1–39.
efforts to incorporate computer-aided instruction attempt Blanchard, P. (1994). In “Complex Dynamical Systems. The Mathemat-
to replace parts of standard lectures rather than engage ics behind the Mandelbrot and Julia Sets” (Devaney, R., ed.), (pp.
139–154, American Mathematical Society, Providence, RI.
students in exploration and discovery. Bochner, S. (1955). “Harmonic Analysis and the Theory of Probability,”
Basic fractal geometry courses for non-science students University of California Press, Berkeley, CA.
provide a radical departure from this mode. The subject Bowen, R. (1975). “Equilibrium States and the Ergodic Theory of
of fractal geometry operates at human scale. Though new Anosov Diffeomorphisms,” Springer-Verlag, New York.
to most, the notion of self-similarity is easy to grasp, Bunde, A., and Havlin, S. (1991). “Fractals and Disordered Systems,”
Springer-Verlag, New York.
and (once understood) handles familiar objects from a Cartwright, M., and Littlewood, L. (1945). J. Lond. Math. Soc. 20, 180–
genuinely novel perspective. Students can explore frac- 189.
tals with the aid of readily available software. These in- Cherbit, G. (1987). “Fractals. Non-integral Dimensions and Applica-
stances of computer-aided instruction are perfectly nat- tions,” Wiley, Chichester, UK.
ural because computers are so central to the entire field Chorin, J. (1981). Commun. Pure Appl. Math. 34, 853–866.
Crilly, A., Earnshaw, R., and Jones, H. (1991). “Fractals and Chaos,”
of fractal geometry. The contemporary nature of the field Springer-Verlag, New York.
is revealed by a supply of mathematical problems that Crilly, A., Earnshaw, R., and Jones, H. (1993). “Applications of Fractals
are simple to state but remain unsolved. Altogether, many and Chaos,” Springer-Verlag, New York.
fields of interest to non-science students have surprising Curry, J., Garnett, L., and Sullivan, D. (1983). Commun. Math. Phys. 91,
examples of fractal structures. Fractal geometry is a pow- 267–277.
Dekking, F. M. (1982). Adv. Math. 44, 78–104.
erful tool for imparting to non-science students some of Devaney, R. (1989). “An Introduction to Chaotic Dynamical Systems,”
the excitement for science often invisible to them. Sev- 2nd ed., Addison-Wesley, Reading, MA.
eral views of this are presented in Frame and Mandelbrot Devaney, R. (1990). “Chaos, Fractals, and Dynamics. Computer Exper-
(2001). iments in Mathematics,” Addison-Wesley, Reading, MA.
The importance of fractals in the practice of science and Devaney, R. (1992). “A First Course in Chaotic Dynamical Systems.
Theory and Experiment,” Addison-Wesley, Reading, MA.
engineering is undeniable. But fractals are also a proven Devaney, R. (ed.). (1994). “Complex Dynamical Systems. The Mathe-
force in science education. Certainly, the boundaries of matics Behind the Mandelbrot and Julia Sets,” American Mathemati-
fractal geometry have not yet been reached. cal Society, Providence, RI.
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

206 Fractals

Devaney, R., and Keen, L. (1989). “Chaos and Fractals. The Mathemat- Lapidus, M., Neuberger, J., Renka, R., and Griffith, C. (1996). Int. J.
ics Behind the Computer Graphics,” American Mathematical Society, Bifurcation Chaos 6, 1185–1210.
Providence, RI. Lasota, A., and Mackey, M. (1994). “Chaos, Fractals, and Noise. Stochas-
Douady, A., and Hubbard, J. (1984). “Étude dynamique des polynômes tic Aspects of Dynamics,” 2nd ed., Springer-Verlag, New York.
complexes. I, II,” Publications Mathematiques d’Orsay, Orsay, Lawler, G., Schramm, O., and Warner, W. (2000). Acta Math., to appear
France. [xxx.lanl.gov./abs/math. PR/0010165].
Douady, A., and Hubbard, J. (1985). Ann. Sci. Ecole Norm. Sup. 18, Lei, T. (2000). “The Mandelbrot Set, Theme and Variations,” Cambridge
287–343. University Press, Cambridge.
Edgar, G. (1990). “Measure, Topology, and Fractal Geometry,” Springer- Le Méhauté, A. (1990). “Fractal Geometries. Theory and Applications,”
Verlag, New York. CRC Press, Boca Raton, FL.
Edgar, G. (1993). “Classics on Fractals,” Addison-Wesley, Reading, MA. Levinson, N. (1949). Ann. Math. 50, 127–153.
Edgar, G. (1998). “Integral, Probability, and Fractal Measures,” Springer- Lorenz, E. (1963). J. Atmos. Sci. 20, 130–141.
Verlag, New York. Lu, N. (1997). “Fractal Imaging,” Academic Press, San Diego.
Eglash, R. (1999). “African Fractals. Modern Computing and Indigenous Lyubich, M. (2001). Ann. Math., to appear.
Design,” Rutgers University Press, New Brunswick, NJ. Mandelbrot, B. (1975). C. R. Acad. Sci. Paris 280A, 1075–1078.
Encarncação, J., Peitgen, H.-O., Sakas, G., and Englert, G. (1992). “Frac- Mandelbrot, B. (1975, 1984, 1989, 1995). “Les objects fractals,” Flam-
tal Geometry and Computer Graphics,” Springer-Verlag, New York. marion, Paris.
Epstein, D. (1986). “Low-Dimensional Topology and Kleinian Groups,” Mandelbrot, B. (1976). C. R. Acad. Sci. Paris 282A, 119–120.
Cambridge University Press, Cambridge. Mandelbrot, B. (1980). Ann. N. Y. Acad. Sci. 357, 249–259.
Evertsz, C., Peitgen, H.-O., and Voss, R. (eds.). (1996). “Fractal Geom- Mandelbrot, B. (1982). “The Fractal Geometry of Nature,” Freeman,
etry and Analysis. The Mandelbrot Festschrift, Curaçao 1995,” World New York.
Scientific, Singapore. Mandelbrot, B. (1984). J. Stat. Phys. 34, 895–930.
Falconer, K. (1985). “The Geometry of Fractal Sets,” Cambridge Uni- Mandelbrot, B. (1985). In “Chaos, Fractals, and Dynamics” (Fischer, P.,
versity Press, Cambridge. and Smith, W., eds.), pp. 235–238, Marcel Dekker, New York.
Falconer, K. (1987). Math. Intelligencer 9, 24–27. Mandelbrot, B. (1995). J. Fourier Anal. Appl. 1995, 409–432.
Falconer, K. (1990). “Fractal Geometry. Mathematical Foundations and Mandelbrot, B. (1997a). “Fractals and Scaling in Finance. Discontinuity,
Applications,” Wiley, Chichester, UK. Concentration, Risk,” Springer-Verlag, New York.
Falconer, K. (1997). “Techniques in Fractal Geometry,” Wiley, Chich- Mandelbrot, B. (1997b). “Fractales, Hasard et Finance,” Flammarion,
ester, UK. Paris.
Family, F., and Vicsek, T. (1991). “Dynamics of Fractal Surfaces,” World Mandelbrot, B. (1999). “Multifractals and 1/ f Noise. Wild Self-Affinity
Scientific, Singapore. in Physics,” Springer-Verlag, New York.
Feder, J. (1988). “Fractals,” Plenum Press, New York. Mandelbrot, B. (2001a). Quant. Finance 1, 113–123, 124–130.
Feder, J., and Aharony, A. (1990). “Fractals in Physics. Essays in Honor Mandelbrot, B. (2001b). “Gaussian Self-Affinity and Fractals: Globality,
of B. B. Mandelbrot,” North-Holland, Amsterdam. the Earth, 1/ f Noise, & R/S,” Springer-Verlag, New York.
Fisher, Y. (1995). “Fractal Image Compression. Theory and Application,” Mandelbrot, B. (2001c). “Fractals and Chaos and Statistical Physics,”
Springer-Verlag, New York. Springer-Verlag, New York.
Flake, G. (1998). “The Computational Beauty of Nature. Computer Ex- Mandelbrot, B. (2001d). “Fractals Tools,” Springer-Verlag, New York.
plorations of Fractals, Chaos, Complex Systems, and Adaptation,” Mandelbrot, B. B., and Stauffer, D. (1994). J. Phys. A 27, L237–L242.
MIT Press, Cambridge, MA. Mandelbrot, B. B., Vespignani, A., and Kaufman, H. (1995). Europhy.
Fleischmann, M., Tildesley, D., and Ball, R. (1990). “Fractals in the Lett. 32, 199–204.
Natural Sciences,” Princeton University Press, Princeton, NJ. Maksit, B. (1988). “Kleinian Groups,” Springer-Verlag, New York.
Frame, M., and Mandelbrot, B. (2001). “Fractals, Graphics, and Massopust, P. (1994). “Fractal Functions, Fractal Surfaces, and
Mathematics Education,” Mathematical Association of America, Wavelets,” Academic Press, San Diego, CA.
Washington, DC. Mattila, P. (1995). “Geometry of Sets and Measures in Euclidean Space.
Frostman, O. (1935). Meddel. Lunds. Univ. Math. Sem. 3, 1–118. Fractals and Rectifiability,” Cambridge University Press, Cambridge.
Gazalé, M., (1990). “Gnomon. From Pharaohs to Fractals,” Princeton McCauley, J. (1993). “Chaos, Dynamics and Fractals. An Algorith-
University Press, Princeton, NJ. mic Approach to Deterministic Chaos,” Cambridge University Press,
Grist, R. (1997). Topology 36, 423–448. Cambridge.
Gulick, D. (1992). “Encounters with Chaos,” McGraw-Hill, New York. McLaughlin, J. (1987). Proc. Am. Math. Soc. 100, 183–186.
Hadamard, J. (1898). J. Mathematiques 5, 27–73. McMullen, C. (1994). “Complex Dynamics and Renormalization,”
Hardy, G. (1916). Trans. Am. Math. Soc. 17, 322–323. Princeton University Press, Princeton, NJ.
Hastings, H., and Sugihara, G. (1993). “Fractals. A User’s Guide for the McShane, G., Parker, J., and Redfern, I. (1994). Exp. Math. 3, 153–170.
Natural Sciences,” Oxford University Press, Oxford. Meakin, P. (1996). “Fractals, Scaling and Growth Far from Equilibrium,”
Hovi, J.-P., Aharony, A., Stauffer, D., and Mandelbrot, B. B. (1996). Cambridge University Press, Cambridge.
Phys. Rev. Lett. 77, 877–880. Milnor, J. (1989). In “Computers in Geometry and Topology” (Tangora,
Hutchinson, J. E. (1981). Ind. Univ. J. Math. 30, 713–747. M., ed.), pp. 211–257, Marcel Dekker, New York.
Keen, L. (1994). In “Complex Dynamical Systems. The Mathematics Moon, F. (1984). Phys. Rev. Lett. 53, 962–964.
behind the Mandelbrot and Julia Sets” (Devaney, R., ed.), pp. 139– Moon, F. (1992). “Chaotic and Fractal Dynamics. An Introduction for
154, American Mathematical Society, Providence, RI. Applied Scientists and Engineers,” Wiley-Interscience, New York.
Keen, L., and Series, C. (1993). Topology 32, 719–749. Parker, J. (1995). Topology 34, 489–496.
Keen, L., Maskit, B., and Series, C. (1993). J. Reine Angew. Math. 436, Peak, D., and Frame, M. (1994). “Chaos Under Control. The Art and
209–219. Science of Complexity,” Freeman, New York.
Kigami, J. (1989). Jpn. J. Appl. Math. 8, 259–290. Peitgen, H.-O. (1989). “Newton’s Method and Dynamical Systems,”
Lapidus, M. (1995). Fractals 3, 725–736. Kluwer, Dordrecht.
P1: GLQ Final
Encyclopedia of Physical Science and Technology EN006H-259 June 28, 2001 20:1

Fractals 207

Peitgen, H.-O., and Richter, P. H. (1986). “The Beauty of Fractals,” Shlesinger, M., Zaslavsky, G., and Frisch, U. (1995). “Lévy Flights and
Springer-Verlag, New York. Related Topics in Physics,” Springer-Verlag, New York.
Peitgen, H.-O., and Saupe, D. (1988). “The Science of Fractal Images,” Sinai, Y. (1972). Russ. Math. Surv. 27, 21–70.
Plenum Press, New York. Smale, S. (1963). In “Differential and Combinatorial Topology”
Peitgen, H.-O., Jürgens, H., and Saupe, D. (1992). “Chaos and Fractals: (Cairns, S., ed.), pp. 63–80, Princeton University Press, Princeton,
New Frontiers of Science,” Springer-Verlag, New York. NJ.
Peitgen, H.-O., Rodenhausen, A., and Skordev, G. (1998). Fractals 6, Stauffer, D., and Aharony, A. (1992). “Introduction to Percolation The-
371–394. ory,” 2nd ed., Taylor and Francis, London.
Pietronero, L. (1989). “Fractals Physical Origins and Properties,” North- Strogatz, S. (1994). “Nonlinear Dynamics and Chaos, with Applica-
Holland, Amsterdam. tions to Chemistry, Physics, Biology, Chemistry, and Engineering,”
Pietronero, L., and Tosatti, E. (1986). “Fractals in Physics,” North- Addison-Wesley, Reading, MA.
Holland, Amsterdam. Sullivan, D. (1985). Ann. Math. 122, 410–418.
Poincaré, H. (1890). Acta Math. 13, 1–271. Tan, Lei (1984). In “Étude dynamique des polynômes complexes”
Pollard-Gott, L. (1986). Language Style 18, 233–249. (Douardy, A., and Hubbard, J., eds.), Vol. II, pp. 139–152, Publi-
Rogers, C. (1970). “Hausdorff Measures,” Cambridge University Press, cations Mathematiques d’Orsay, Orsay, France.
Cambridge. Tan, Lei. (2000). “The Mandelbrot Set, Theme and Variations,”
Ruelle, D. (1978). “Thermodynamic Formalism: The Mathematical Cambridge University Press, Cambridge.
Structures of Classical Equilibrium Statistical Mechanics,” Addison- Thurston, W. (1997). “Three-Dimensional Geometry and Topology,”
Wesley, Reading, MA. Princeton University Press, Princeton, NJ.
Ruelle, D., and Takens, F. (1971). Commun. Math. Phys. 20, 167–192. Tricot, C. (1982). Math. Proc. Camb. Philos. Soc. 91, 54–74.
Samorodnitsky, G., and Taqqu, M. (1994). “Stable Non-Gaussian Ran- Vicsek, T. (1992). “Fractal Growth Phenomena,” 2nd ed., World Scien-
dom Processes. Stochastic Models with Infinite Variance,” Chapman tific, Singapore.
and Hall, New York. Voss, R., and Clarke, J. (1975). Nature 258, 317–318.
Sapoval, B. (1989). Physica D 38, 296–298. West, B. (1990). “Fractal Physiology and Chaos in Medicine,” World
Sapoval, B., Rosso, M., and Gouyet, J. (1985). J. Phys. Lett. 46, L149– Scientific, Singapore.
L156. Weyl, H. (1912). Math. Ann. 71, 441–479.
Sapoval, B., Gobron, T., and Margolina, A. (1991). Phys. Rev. Lett. 67, Williams, R. (1983). Ergodic Theory Dynam. Syst. 4, 147–163.
2974–2977. Wiggins, S. (1990). “Introduction to Applied Nonlinear Dynamical Sys-
Scholz, C., and Mandelbrot, B. (1989). “Fractals in Geophysics,” tems and Chaos,” Springer-Verlag, New York.
Birkhäuser, Basel. Witten, T., and Sander, L. (1981). Phys. Rev. Lett. 47, 1400–1403.
Shishikura, M. M. (1994). Astérisque 222, 389–406. Witten, T., and Sander, L. (1983). Phys. Rev. B 27, 5686–5697.
P1: GRB/GWT P2: GQT Final Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

Functional Analysis
C. W. Groetsch
University of Cincinnati

I. Linear Spaces
II. Linear Operators
III. Contractions
IV. Some Principles and Techniques
V. A Few Applications

GLOSSARY Orthogonal vectors Vectors whose inner product is zero;

a generalization of perpendicularity.
Banach space A complete normed linear space. Riesz representation theorem A theorem that identifies
Bounded linear operator A continuous linear operator bounded linear functionals on a Hilbert space with vec-
acting between normed linear spaces. tors in the space.
Compact linear operator A linear operator that maps Spectral theorem Characterizes the range of a compact
weakly convergent sequences into convergent sequ- self-adjoint linear operator as the orthogonal sum of
ences. eigenspaces.
Decomposition theorem A Hilbert space is the sum of
any closed subspace and its orthogonal complement.
Fredholm alternative A theorem that characterizes the FUNCTIONAL ANALYSIS strives to bring the tech-
solvability of a compact linear operator equation of the niques and insights of geometry to bear on the study of
second kind. functions by treating classes of functions as “spaces.” The
Hilbert space A complete inner product space. interplay between such spaces and transformations be-
Inner product A positive definite (conjugate) symmetric tween them is the central theme of this geometrization
bilinear form defined on a linear space. of mathematical analysis. The spaces in question are lin-
Linear operator A mapping acting between linear spaces ear spaces, that is, collections of functions (or more gen-
that preserves linear combinations. eral “vectors”) endowed with operations of addition and
Linear functional A linear operator whose range is the scalar multiplication that satisfy the well-known axioms
field of scalars. of a vector space. The subject is rooted in the vector anal-
Norm A real-valued function on a linear space having the ysis of J. Willard Gibbs and the investigations of Vito
properties of length. Volterra and David Hilbert in the theory linear integral
Orthogonal complement All vectors which are orthog- equations; it enjoyed youthful vigor, spurred by appli-
onal to a given set. cations in potential theory and quantum mechanics, in

337
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

338 Functional Analysis

the works of Stefan Banach, John von Neumann, and x = [x1 , x2 , . . . , xn ]

Marshal Stone on the theory of linear operators, and settled
y = [y1 , y2 , . . . , yn ]
into a dignified maturity with the contributions of M. A.
Naimark, S. L. Sobolev, and L. V. Kantorovich to normed in Rn , and α ∈ R, the vector operations are defined by
rings, partial differential equations, and numerical anal-
ysis, respectively. Functional analysis provides a general x + y = [x1 + y1 , x2 + y2 , . . . , xn + yn ]
context, that is rich in geometric and algebraic nuances, for
αx = [αx1 , αx2 , . . . , αxn ]
organizing and illuminating the structure of many areas of
mathematical analysis and mathematical physics. Further, The most useful linear spaces are function spaces. For
the subject enables rigorous justification of mathematical example, the space, F(), of all real-valued functions de-
principles by use of general techniques of great power fined on a set is a linear space when fitted out with the
and wide applicability. In recent decades functional anal- operations
ysis has become the lingua franca of applied mathematics,
mathematical physics and the computational sciences, and ( f + g)(s) = f (s) + g(s)
it has made inroads into statistics and the engineering sci-
ences. In this survey we will concentrate on functional (α f )(s) = α f (s).
analysis in normed linear spaces, with special emphasis
Taking = {1, 2, . . . , n} we see that, with obvious no-
on Hilbert space.
tational conventions, Euclidean n-space is a special case
of the function space F(). A more useful space is the
space C[a, b] consisting of all real-valued functions de-
I. LINEAR SPACES fined on the interval [a, b] that are continuous at each point
of [a, b].
Functional analysis is a geometrization of analysis ac-
Definition. A subset S of a linear space V is called a sub-
complished via the notion of a linear space. Linear
space of V if αx + βy ∈ S for all x, y ∈ S and all α, β ∈ F.
spaces provide a natural algebraic and geometric frame-
work for the description and study of many problems in A subspace is therefore a subset of a linear space which
analysis. is a linear space in its own right. Applying the definition
iteratively, one sees that a subspace S contains all linear
Definition. A linear space (vector space) is a collection
combinations of the form
V of elements, called vectors, along with an associated
field of scalars, F, and two operations, vector addition α1 x1 + α2 x2 + · · · + αn xn
(denoted “+”) and multiplication of a vector by a scalar
(the scalar product of α ∈ F with x ∈ V is denoted αx). where n is a positive integer, α1 , α2 , . . . , αn are arbitrary
The set V is assumed to form a commutative group under scalars and x1 , x2 , . . . , xn are arbitrary vectors in S. A set
vector addition (the additive identity is called the zero of vectors W is called linearly independent if no vector in
vector, denoted θ , and the additive inverse of a vector x W can be written as a linear combination of other vectors
is denoted as −x). In addition, the scalar product respects in the set W. One might then say that a linearly indepen-
the usual algebraic customs, namely: dent set contains no “linear redundancies.” A basis for a
subspace S is a set of linearly independent vectors B ⊂ S
1x = x, with the property that every vector in S can be written as
α(βx) = (αβ)x, a linear combination of vectors in B. If a subspace has
(α + β)x = αx + βx, a basis consisting of finitely many vectors, then the sub-
space is called finite-dimensional. For example, the set
α(x + y) = αx + αy, of all real polynomials of degree not more than 7 is an
eight-dimensional subspace of C[a, b]; the set of polyno-
for any α, β ∈ F and any x, y ∈ V, where 1 denotes the
mials {1, t, t 2 , . . . , t 7 } is a basis for this subspace. On the
multiplicative identity in F.
other hand, the set of all real polynomials is an infinite
The vector x + (−y) is normally written x − y. We al- dimensional subspace of C[a, b].
ways take the field F to be the field of real numbers, R, Subspaces allow a layering of sets of increasingly reg-
or (less frequently) the field of complex numbers, C. The ular vectors while preserving the linear structure of the
prototype of all linear spaces is Euclidean n-space, that space. For example, the set C 1 [a, b] of all real-valued
is, the vector space Rn consisting of all ordered n-tuples functions on [a, b] which also have a continuous deriva-
of real numbers with addition and scalar multiplication tive is a subspace of C[a, b], while C02 [a, b], the space of
defined componentwise. That is, for all twice continuously differentiable real-valued functions
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

Functional Analysis 339

b
on [a, b] that vanish at the endpoints of the interval, is a
subspace of C 1 [a, b]. f 1 = | f (t)| dt
a
The space AC[a, b] of absolutely continuous functions
and hence (C[a, b], · 1 ) is another distinct normed space
is another important subspace of C[a, b]. A function f is consisting of the same set of vectors as the normed lin-
called absolutely continuous if the sum | f (bi ) − f (ai )|
ear space (C[a, b], · ∞ ). As another example, C01 [a, b]
can be made arbitrarily small for all finite collections of is a normed linear space when equipped with the norm
subintervals {[ai , bi ]} whose total length (bi − ai ) is suf-
f = f ∞ , where f is the derivative of f .
ficiently small. The importance of absolutely continuous
A norm allows the possibility of measuring the distance
functions resides in the fact that they are precisely those
between two vectors and it also gives meaning to “equality
functions which have integrable (but not necessarily con-
in the limit,” that is, to convergence.
tinuous) derivatives on [a, b]. Functions in C 1 [a, b] are
absolutely continuous because such functions have uni- Definition. A sequence {xn } of vectors is said to converge
formly bounded derivatives. We therefore have the fol- to a vector x, with respect to the norm · , if xn − x → 0
lowing relationships for these subspaces of C[a, b]: as n → ∞.
C 1 [a, b] ⊂ AC[a, b] ⊂ C[a, b]. The notion of convergence is norm-dependent. For
example, let ln denote the “spike” function in C[0, 1]
A somewhat less specialized notion than a subspace,
whose graph is obtained by connecting the points
that of a convex set, is crucial in various applications of 1
(0, 0), ( 2n , 1), ( n1 , 0), (1, 0) with straight line segments.
functional analysis.
The sequence {ln } converges with respect to the · 1 norm
Definition. A subset K of a linear space is called convex to the zero function since ln 1 = 2n 1
. However, this se-
if (1 − t)x + t y ∈ K for all x, y ∈ K and all real numbers quence does not converge to the zero function in the uni-
t with 0 ≤ t ≤ 1. form norm since ln ∞ = 1. But note that every sequence
that converges with respect to the uniform norm also con-
Convex sets have a natural geometric interpretation. Given
verges with respect to the norm · 1 since f 1 ≤ f ∞
two vectors x and y, the set of vectors of the form
for all f ∈ C[0, 1]. As another example, consider the space
{(1 − t)x + t y : t ∈ [0, 1]} is called, in analogy with the
C01 [a, b] with the norm f = f 2 . Wirtinger’s inequal-
corresponding situation in Euclidean space, the segment
ity asserts that if f ∈ C01 [a, b], then
between x and y. A convex set is then a subset of a linear
b b
space which contains the segment between any two of its
vectors. Of course, every subspace is a convex set. π2 | f (t)|2 dt ≤ (b − a)2 | f (t)|2 dt
a a

A. Normed Linear Spaces Hence, if · 2 is the norm on C01 [a, b]

defined by

b
Vector spaces become more serviceable when a metric
fabric is stretched over the linear framework. Such a f 2= | f (t)|2 dt
a
metric structure is provided by a function defined on
then m f 2 ≤ f , where m = π/(b − a). Therefore,
the space which generalizes the concept of length in
convergence in the norm · implies convergence in the
Euclidean spaces.
norm · 2 . Wirtinger’s inequality can be generalized to
Definition. A nonnegative real-valued function · de- functions defined on domains in Rn ; the resulting inequal-
fined on a linear space V is called a norm if: ity is known as Poincaré’s inequality.
Two norms · a and · b on a linear space V are called
(i) x = 0 if and only if x = θ (the zero vector in V),
equivalent if there are positive constants c1 and c2 such
(ii) x + y ≤ x + y , for all x, y ∈ V, that
(iii) t x = |t| x , for all scalars t and all x ∈ V. c1 x a ≤ x b ≤ c2 x a

A linear space that comes bundled with a norm is called for all x ∈ V. Equivalence therefore means equivalent in
a normed linear space. For specificity, we will sometimes the sense of convergence: a sequence converges with re-
indicate a linear space V equipped with a norm · by the spect to the norm · a if and only if it converges with
pair (V, · ). For example, (C[a, b], · ∞ ) is a normed respect to the norm · b . It can be shown that any two
linear space under the so-called uniform norm norms on a finite-dimensional linear space are equiv-
alent and hence when speaking of convergence in a
f ∞ = max{| f (t)| : t ∈ [a, b]}.
finite-dimensional space one can dispense with mention-
The space C[a, b] is also a normed linear space when ing the norm. However, as we have seen above, conver-
endowed with the norm gence is norm-dependent in infinite dimensional spaces.
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

340 Functional Analysis

Definition. The closure, W , of a subset W of a normed Every normed linear space can be imbedded as a
linear space (V, · ) is the set consisting of elements of dense subspace of a Banach space by an abstract pro-
W along with all limits of convergent sequences in W . A cess of “completion.” Essentially, the completion process
set which is its own closure is called closed. involves adjoining to the original space, for each Cauchy
sequence in the original space, a vector in an extended
For example, the closure of the set of all polynomials in space (the completion) which is the limit in the extended
the space (C[a, b], · ∞ ) is, by virtue of the Weierstrass space of the Cauchy sequence. For example, it can be
approximation theorem, the space C[a, b] itself. Also, for shown that the completion of the space (C[a, b], · 1 ) is
example, the set of nonnegative functions in C[a, b] is a the space (L 1 [a, b], · 1 ).
closed subset of C[a, b].

B. Banach Spaces C. Hilbert Spaces

Certain sequences of vectors in a normed linear space tend Perpendicularity, the key ingredient in the Pythagorean
to “bunch up” in a way that imitates convergent sequences. theorem, is one of the fundamental concepts in geome-
Such sequences are named after the nineteenth century try. A successful geometrization of analysis requires the
mathematician Augustin Cauchy. incorporation of this concept. The essential aspects of per-
pendicularity are captured in special normed linear spaces
Definition. A sequence of vectors {xn } in a normed called inner product spaces.
linear space (V, · ) is called a Cauchy sequence if
limn,m→∞ xn − xm = 0. Definition. An inner product on a real linear space V is a
function ·, · : V × V → R which satisfies:
It is easy to see that every convergent sequence is
Cauchy, however, it is not necessarily the case that a (i) x, x ≥ 0 for all x ∈ V,
Cauchy sequence is convergent. Consider, for exam-
(ii) x, x = 0 if and only if x = θ (the zero vector),
ple, the “ramp” function h n in C[−1, 1] whose graph
consists of the straight line segments connecting the (iii) x, y = y, x for all x, y ∈ V,
points (−1, 0), (− n1 , 0), (0, 1), (1, 1). The sequence of
(iv) t x, y = tx, y for all x, y ∈ V and all t ∈ R,
ramp functions is Cauchy with respect to the norm · 1
since (v) x + y, z = x, z + y, z, for all x, y, z ∈ V.

1 1 1 Properties (iii)–(v) are summarized by saying that ·, ·
h n − h m 1 = − → 0, as n, m → ∞.
2 n m is a symmetric bilinear form; properties (i) and (ii) say
that ·, · is nonnegative and definite. A linear space en-
However, {h n } does not converge, with respect to the norm dowed with an inner product is called an inner product
· 1 , to a function in C[−1, 1] (the · 1 limit of this se- space.
quence of ramp functions is the discontinuous Heaviside The most familiar inner product space is the Euclidean
function). Spaces, such as (C[−1, 1], · 1 ), which do not space Rn with the inner product
accommodate limits of Cauchy sequences are in some
sense “incomplete.” A normed linear space is called com- x, y = x1 y1 + x2 y2 + · · · + xn yn .
plete if every Cauchy sequence in the space converges to
some vector in the space. Of course other inner products may be used on the same
underlying linear space. For example, in statistics, the
Definition. A Banach space is a complete normed linear Euclidean space is often used with a weighted inner
space. product
We have just seen that C[a, b] with the norm · 1 is
not a Banach space. However, C[a, b] with the uniform x, y = w1 x1 y1 + w2 x2 y2 + · · · + wn xn yn
norm is a Banach space. The Lebesgue spaces are partic- where w1 , w2 , . . . , wn are fixed positive weights.
ularly important Banach spaces. For 1 ≤ p < ∞ the space Function spaces serve up a particularly rich stew of
L p [a, b] consists of measurable real-valued functions f inner product spaces. For example, the theory of Fourier
defined on [a, b] such that | f | p has a finite Lebesgue in- series can be developed in the Lebesgue space L 2 [a, b]
tegral. The norm · p on L p [a, b] is defined by with the inner product
b 1p b
f p = | f (t)| p dt . f, g = f (t)g(t) dt
a a
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

Functional Analysis 341

and many variations are possible. For instance, the space 1

C 1 [0, 1] with f, g = f (t)g (t) dt + f (0)g(0).

0
1
f, g = f (t)g (t) dt + f (0)g(0) Let S be the set of linear functions. Then g ∈ S ⊥ if and
0 only if
1
is an inner product space.
In the case of a linear space over the field C of complex 0 = 1, g = g(0) and 0 = t, g = g (t) dt
0
scalars the definition of an inner product requires some
= g(1) − g(0).
modification. Here the inner product is a complex valued
function that satisfies the properties above, save that (iii) Therefore, S ⊥ = C01 [0, 1], the space of continuously dif-
is replaced by ferentiable functions which vanish at the end points of
[0, 1]. If W is a subspace, then its second orthogonal com-
x, y = y, x,
plement is its closure: W ⊥⊥ = W .
where the bar indicates complex conjugation, and (iv) is Definition. A Hilbert space is a complete inner product
required to hold for all t ∈ C. Note that this implies that space.
x, t y = t̄x, y.
Every inner product on a linear space generates The best known Hilbert space is the space L 2 [a, b].
√ a cor-
responding norm by way of the definition x = x, x. The fact that this space is complete is known as the
The proof that this defines a norm relies on the Cauchy- Riesz-Fischer Theorem. The Sobolev spaces are impor-
Schwarz inequality: tant Hilbert spaces used in the theory of differential
equations. Let be a bounded domain in Rn . C m ()
|x, y|2 ≤ x, xy, y. denotes the space of all continuous real-valued func-
Any norm induced by an inner product must satisfy, by tions on having bounded continuous partial derivatives
virtue of the bilinearity of the inner product, the Parallel- through order m. An inner product ·, ·m is defined on
ogram Law: C m () by

x+y 2
+ x−y 2
=2 x 2
+ 2 y 2. f, gm = D α f, D α g
|α|≤m
Therefore, any norm which does not satisfy the parallelo- where ·, · denotes the usual L 2 inner product,
gram law is not induced by an inner product. For example, α = (α1 , α2 , . . . , αn ) is a multi-index of nonnegative in-
the space (C[−1, 1], · ∞ ) is not an inner product space tegers, |α| = α1 + · · · + αn , and D α is the differential
since the parallelogram law for · ∞ fails for the contin- operator
uous ramp functions h 1 and h 2 introduced above.
∂ |α|
Definition. Two vectors x and y in an inner product space Dα f = f.
∂ xnαn · · · ∂ x1α1
are called orthogonal, denoted x ⊥ y, if x, y = 0. A sub-
set S of an inner product space is called an orthogonal set The Sobolev space H m () is the completion of C m ()
if x ⊥ y for each distinct pair of vectors x, y ∈ S. with respect to the norm · m generated by the inner prod-
uct ·, ·m . A closely associated Sobolev space is the space
From the properties of the inner product it follows im- H0m () which is formed in the same way, but using as base
mediately that space the space C0m () of all functions in C m () which
x⊥y if and only if x+y 2
= x 2
+ y 2. vanish off of some closed bounded subset of (that is,
the space of all compactly supported functions in C m ()).
This extension of the classical Pythagorean theorem to The advantage of using such spaces in the study of differ-
inner product spaces is what justifies our association of ential equations lies in the fact that in these spaces ordi-
orthogonality, a purely algebraic concept, with the geo- nary functions are very smooth, convergence is very strict
metrical notion of perpendicularity. (convergence in H m () implies uniform convergence on
Given a subset S of an inner product space V, the or- compact subsets of all derivatives through order m − 1),
thogonal complement of S is the closed subspace and Sobolev spaces are complete, allowing the deployment
S ⊥ = {y ∈ V : x, y = 0 for all x ∈ S}. of all the weapons of Hilbert space theory. In particular,
Hilbert space theory can be used to develop a rich theory
For example, consider the space C 1 [0, 1] of continuously of weak solutions of partial differential equations and a
differentiable real functions on [0, 1] equipped with the corresponding theory of finite element approximations to
inner product weak solutions.
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

342 Functional Analysis

An orthogonal set of unit vectors (i.e., vectors of norm have been studied and applied extensively in recent years.
one) is called an orthonormal set. If S is an orthonormal set We now show how to construct the simplest wavelet basis,
in a Hilbert space H , then for any y ∈ H at most countably the Haar basis. As in the previous example, we seek a sin-
many of the numbers y, x, where x ∈ S, are nonzero and gle function to generate all the basis vectors, but instead
of a wave, we choose a function that decays quickly, in
|y, x|2 ≤ y 2
fact a function that vanishes off an interval. Our choice is
x∈S
the Haar “wavelet”
(this is called Bessel’s inequality). The numbers {y, x: 
x ∈ S} are called the Fourier coefficients of y with respect 
 1 for 0 ≤ t < 12
to the orthonormal set S. If the orthonormal set S has ψ(t) = −1 for 1
≤t <1

 0 otherwise
2
the property that the Fourier coefficients of a vector com-
pletely determine the vector in the sense that if, for any
vector y ∈ H Taking dilates of this wavelet will serve the same purpose
as the higher frequency waves in the previous example,
y, x = 0 for all x ∈ S implies y=θ that is, the dilates of the fundamental wave φ. However,
this by itself will not serve to represent functions defined
then S is called a complete orthonormal set.
over the entire line. To do that we must also shift the dilates
Definition. A Hilbert space is called separable if it con- about. If this is done properly, then all the needed time
tains a sequence which is a complete orthonormal set. Any and frequency attributes are captured and the resulting
such complete orthonormal sequence is called a basis for wavelets are orthonormal. Specifically, it can be shown
the Hilbert space. that the functions {ψ j,k : j, k = 0, ±1, ±2, . . .} defined by
It is not hard to see that an orthonormal sequence {φn } in j
ψ j,k (t) = 2 2 ψ(2 j t − k)
a Hilbert space H is complete if and only if
form an orthonormal basis for L 2 (−∞, ∞) called the
|y, φn |2 = y 2 Haar basis.
n

for each y ∈ H (Parseval’s identity). This is equivalent to

the assertion II. LINEAR OPERATORS

y, φn φn = y Mappings between linear spaces that preserve the linear
n
structure are called linear operators.
where the convergence of the series is understood in the
Definition. A mapping T defined on a linear space V and
sense of the norm generated by the inner product ·, ·.
taking values in a linear space W is called linear if
For example, the complex space L 2 [−π, π ] with inner
product T (x + y) = T x + T y, for all x, y ∈ V
π
f, g = f (t)g(t) dt and
−π
T (t x) = t T x, for all x ∈V and all scalars t.
is a separable Hilbert space with complete orthonormal
sequence The space V on which a linear operator T is defined
is called the domain of T and denoted D(T ). In finite-
1
φn (t) = √ eint , , n = 0, ±1, ±2, . . . . dimensional spaces linear operators have matrix repre-
2π sentations relative to specific ordered bases. For example,
Note that these orthonormal vectors are integer dilates of the differentiation operator acting on the space of polyno-
a single complex wave, that is, mials of degree not greater that n has, relative to the basis
{1, t, . . . , t n }, the (n + 1) × (n + 1) matrix representation
φn (t) = φ(nt),  
0 1 0 ... 0
where 0 0 2 ... 0
 
1 1 . .
φ(t) = √ eit = √ (cos t + i sin t).  
2π 2π  
. . .
 
None of the functions φn just defined lies in the space . .
 
L 2 (−∞, ∞) because the waves do not “die out” as  
0 0 0 ... n
|t| → ∞. Special bases for L 2 (−∞, ∞) that have par-
ticularly attractive decay properties, called wavelet bases, 0 0 0 ... 0
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

Functional Analysis 343

In functional analysis the primary interest is the study complish. From this it also follows that every bounded
of linear operators on infinite dimensional function spaces linear operator is continuous (relative to the norms in
that are defined intrinsically, that is, without regard to a question) and, in fact, every continuous linear operator is
specific basis. For example, given a real-valued function bounded. Indeed, if T is linear and continuous, then there
k(·, ·) which is continuous on the square [0, 1] × [0, 1] is a δ > 0 such that z ≤ δ implies Tz ≤ 1. For any x = θ
one can define the linear integral operator T on C[0, 1], one then has T (δx/ x ) ≤ 1, that is, Tx ≤ 1δ x for
taking values in C[0, 1], by T f = g where all x. Therefore, T is bounded and T ≤ 1/δ.
1 The set L(V, W) of all bounded linear operators from
g(s) = k(s, t) f (t) dt. a normed linear space V to a normed linear space W is
0 itself a normed linear space under the natural notions of
Such integral operators may be viewed as natural general- operator sum and scalar multiplication:
izations of matrices to the case of continuous, rather than (T + S)(x) = Tx + Sx
discrete, variables.
Two important subspaces, the nullspace and range, are (t T )(x) = t(Tx)
associated with each linear operator. The nullspace, N (T ),
and with the norm on bounded linear operators as de-
of a linear operator T consists of all vectors that T maps to
fined above. Further, if W is a Banach space, then so is
the zero vector: N (T ) = {x ∈ D(T ): Tx = θ }. For example,
L(V, W). If the image space W is the scalar field (R, or
the differentiation operator acting on the space C 1 [a, b]
C), then L(V, W) is called the dual space, V ∗ , of V. An
has nullspace consisting of all constant functions. The
operator in V ∗ is called a bounded linear functional on V.
range, R(T ), of a linear operator T is the set of all images
For example, given a fixed t0 ∈ [a, b], the point evaluation
of vectors under T , that is, R(T ) = {Tx: x ∈ D(T )}. For ex-
operator E t0 , defined by
ample, the range of the integral operator on C[0, 1] gener-
ated by the kernel k(s, t) = exp(s + t) consists of all scalar E t0 ( f ) = f (t0 )
multiples of the exponential function f (s) = exp(s).
is a bounded linear functional on the space (C[a, b],
· ∞ ). The dual space of C[a, b] may be identified with
A. Bounded Operators the space of all functions of bounded variation on [a, b] in
the sense that any bounded linear functional on C[a, b]
Linear operators acting between normed linear spaces
has a unique representation of the form
that are continuous with respect to the norms are called
bounded. b
( f ) = f (t) dg(t)
Definition. A linear operator T defined on a normed linear a
space (V, · v ) and taking values in a normed linear space for some function g of bounded variation (the integral is a
(W, · w ) is called bounded if there is a constant L such Riemann-Stieljes integral). In the case of point evaluation
that Tx w ≤ L x v for all x ∈ V. the representative function g is the Heaviside function at
The notational specification of norms on various lin- t0 :
ear spaces is tiresome and annoying; we will there-
0 , t < t0
fore often dispense with it and rely on the reader to g(t) =
1 , t0 ≤ t.
understand appropriate norms from the context. Hence
we will say that a linear operator T is bounded if It can be shown that for 1 < p < ∞,
Tx ≤ L x for all x ∈ D(T ) and some constant L. The
smallest constant L for which this inequality holds is (L p [a, b])∗ = L q [a, b]
called the norm of the operator T and is denoted T . where 1p + q1 = 1, in the sense that every bounded linear
Equivalently, functional on L p [a, b] has the form
Tx b
T = sup
x=θ x ( f ) = f (t)g(t) dt
a
where “sup” stands for supremum, or least upper bound. for some g ∈ L [a, b] that is uniquely determined by .
q
Using linearity of T one sees that With this understanding the space L 2 [a, b] is self-dual.
Tx − Ty ≤ T x−y In fact, by the Riesz Representation Theorem, all Hilbert
spaces are self-dual for the following reason: a bounded
and hence T gives the smallest universal bound for the linear functional on a Hilbert space H has the form
relative “spread” that a bounded linear operator T can ac- (x) = x, z, for some vector z ∈ H which is uniquely
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

344 Functional Analysis

determined by . Further, = z , where the first norm b

is an operator norm and the other is the underlying Hilbert (T f )(s) = k(s, t) f (t) dt, s ∈ [c, d],
a
space norm. The association of the bounded linear func-
tional with its Riesz representor z provides the identifi- then
d
cation of H ∗ with H . b
T f, g = k(s, t) f (t) dt g(s) ds
Bounded linear functionals may be thought of as linear c a
measurements on a Hilbert space in that a bounded linear b d
functional gives a numerical measure on vectors in the = k(s, t)g(s) ds f (t) dt = f, T ∗ g
space which is continuous with respect to the norm. Such a c
measures distinguish vectors in the space because x = y if and hence T ∗ is the integral operator generated by the
and only if there is a vector z with x, z = y, z. However, kernel k ∗ (·, ·) defined by k ∗ (t, s) = k(s, t). In particular,
it may happen that no such linear measure can ultimately if the kernel k is symmetric and [a, b] = [c, d], then the
distinguish the vectors in a sequence from a certain fixed operator T is self-adjoint.
vector. This gives rise to the notion of weak convergence.
Definition. A sequence {xn } in a Hilbert space H is said C. Compact Operators
to converge weakly to a vector x, denoted xn x, if
xn , z → x, z for all z ∈ H . A linear operator T : H1 → H2 is called an operator of
finite rank if its range is spanned by finitely many vectors
It is a consequence of the Cauchy-Schwarz inequality that in H2 . In other words, T has finite rank if there are linearly
every convergent sequence is weakly convergent to the independent vectors {v1 , . . . , vm } in H2 such that
same limit. Also, Bessel’s inequality shows that every or-

m
thonormal sequence of vectors converges weakly to the Tx = ak (x)vk
zero vector. k=1

where the coefficients ak are bounded linear functionals

B. Adjoint Operators on H1 . Therefore, by the Riesz Representation Theorem,

Adjoint operators mimic the behavior of the transpose ma-

m
Tx = x, u k vk
trix on real Euclidean space. Recall that the transpose A T of k=1
a real m × n matrix A satisfies
for certain vectors {u 1 , . . . , u m } ⊂ H1 . Note that every
Ax, y = x, A T y finite rank operator is continuous, but more is true. If {xn }
is a weakly convergent sequence in H1 , say xn w, then
for all x ∈ Rn and y ∈ Rm , where ·, · is the Euclidean
inner product. If T is a bounded linear operator from a
m
m
Txn = xn , u k vk → w, u k vk = Tw,
Hilbert space H1 into a Hilbert space H2 , i.e., T : H1 → k=1 k=1
H2 , then for fixed y ∈ H2 the linear functional defined
on H1 by and hence a finite rank operator T maps weakly conver-
gent sequences into strongly convergent sequences. Finite
(x) = Tx, y rank operators are a special case of an important class of
linear operators that enjoy this weak-to-strong continuity
is bounded and hence by the Riesz Representation
property.
Theorem
Definition. A linear operator T : H1 → H2 is called com-
Tx, y = x, z pact (also called completely continuous) if xn w implies
for some z ∈ H1 . This z is uniquely determined by y, via Txn → Tw.
T , and we denote it by T ∗ y, that is, Note that every finite rank operator is completely con-
∗
Tx, y = x, T y. tinuous and every completely continuous operator is a
fortiori continuous. Also, limits of finite rank operators
The operator T ∗ : H2 → H1 is a bounded linear operator are compact. More precisely, if {Tn } is a sequence of finite
called the adjoint of T . If T is a bounded linear operator, rank operators converging in operator norm to an operator
then T = T ∗ and T ∗∗ = T . T , then T is compact.
Suppose, for example, the linear operator T : L 2 [a, b]→ Completely continuous operators acting on a Hilbert
L [c, d] is generated by the kernel k(·, ·) ∈ C([c, d] ×
2
space have a particularly simple structure expressed in
[a, b]), that is, terms of certain characteristic subpaces known as
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

Functional Analysis 345

eigenspaces. The eigenspace of a linear operator Then D(M) is a dense subspace of L 2 (−∞, ∞) and one
T : H → H associated with a scalar λ is the subspace can define the linear operator M : D(M) → L 2 (−∞, ∞)
N (T − λI ) = {x ∈ H : Tx = λx}. by
Mf = g where g(t) = t f (t).
In general, an eigenspace may be trivial, that is, it may
consist only of the zero vector. If N (T − λI ) = {θ }, we Let φn (t) = 1 for t ∈ [n, n + 1] and φn (t) = 0 otherwise.
say that λ is an eigenvalue of T . If T is self-adjoint, Then φn 2 = 1 and Mφn 2 > n, and hence M is un-
then the eigenvalues of T are real numbers and vec- bounded. The multiplication operator has an important
tors in distinct eigenspaces are orthogonal to each other. interpretation in quantum mechanics.
If T is self-adjoint, compact, and of infinite rank, then The adjoint operator may also be defined for unbounded
the eigenvalues of T form a sequence of real numbers linear operators with dense domains. Given a linear op-
{λn }. This sequence converges to zero, for taking an or- erator T : D(T ) ⊂ H1 → H2 with dense domain D(T ), let
thonormal sequence {xn } with xn ∈ N (T − λn I ) we have D(T ∗ ) be the subspace of all vectors y ∈ H2 satisfying
λn xn = Txn → θ since xn θ (a consequence of Bessel’s
Tx, y = x, y ∗
inequality). Since xn = 1, it follows that λn → 0. The
fact that Txn → 0 for a sequence of unit vectors {xn } is an for some vector y ∗ ∈ H1 and all x ∈ D(T ). The vector y ∗ is
abstract version of the Riemann-Lebesgue Theorem. then uniquely defined and we set T ∗ y = y ∗ . Then the oper-
The prime exemplar of a compact self-adjoint operator ator T ∗ : D(T ∗ ) ⊂ H2 → H1 is densely defined and linear.
is the integral operator on the real space L 2 [a, b] generated As an example, let D(T ) be the space of absolutely
by a symmetric kernel k( ·, ·) ∈ L 2 ([a, b] × [a, b]): continuous complex-valued functions f defined on [0, 1]
b with f ∈ L 2 [0, 1] satisfying the periodic boundary con-
(T f )(s) = k(s, t) f (t) dt. dition f (0) = f (1). Then D(T ) is dense in L 2 [0, 1].
a Define T : D(T ) → L 2 [0, 1] by T f = i f . For g ∈ D(T )
This operator is compact because it is the limit in operator we have
norm of the finite rank operators 1
b T f, g = i f (t)g(t) dt
0
TN f = cn,m f (t)φm (t) dtφn
a 1
= i g(t) f (t) 1 − i
n,m≤N
0 f (t)g (t) dt
where {φn }∞ 1 is a complete orthonormal sequence in
0
L 2 [a, b] and 1
= f (t)ig (t) dt = f, ig

∞
0
k(s, t) = cn,m φn (s)φm (t)
n,m=1 for all f ∈ D(T ). Therefore, D(T ) ⊂ D(T ∗ ) and, in fact, it
can be shown that D(T ∗ ) = D(T ). This calculation shows
is the Fourier expansion of k( ·, ·) relative to the orthonor- that T ∗ g = Tg, that is, T is self-adjoint.
mal basis {φn (s)φm (t)} for L 2 ([a, b] × [a, b]). A linear operator T : D(T ) ⊆ H1 → H2 is called closed
if its graph G(T ) = {(x, Tx) : x ∈ D(T )} is a closed sub-
D. Unbounded Operators space of the product Hilbert space H1 × H2 . This means
that if {xn } ⊂ D(T ), xn → x ∈ H1 , and Txn → y ∈ H2 , then
It is not the case that every interesting linear operator is (x, y) ∈ G(T ), that is, x ∈ D(T ) and Tx = y. For example,
bounded. For example, the differentiation operator act- the differentiation operator defined in the previous para-
ing on the space C 1 [0, π] and taking values in the space graph is closed. In fact, the adjoint of any densely defined
C[0, π ], both with the uniform norm, is unbounded. In- linear operator is closed.
deed, for the functions φn (t) = sin(nt) we find that the A densely defined linear operator T : D(T ) ⊆ H → H
quotient is called symmetric if
φn ∞
=n Tx, y = x, Ty for all x, y ∈ D(T ).
φn ∞
Every self-adjoint transformation is, of course, symmetric;
is unbounded.
however, a symmetric transformation is not necessarily
Multiplication by the variable in the space L 2 (−∞, ∞)
self-adjoint. Consider, for instance, a slight modification
is another famous example of an unbounded operator. Let
∞ of the previous example. Let D(T ) be the space of ab-
solutely continuous complex-valued functions on [0, 1]
D(M) = f ∈ L 2 (−∞, ∞) : t 2 | f (t)|2 dt < ∞ .
−∞ which vanish at the end points, and let T f = i f . For
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

346 Functional Analysis

f, g ∈ D(T ), integration by parts gives T f, g = f, Tg, 1
and hence T is symmetric, and the adjoint of T satis- |(Tw)(s) − (Tu)(s)| = k(s, w(t)) − k(s, u(t)) dt
0
fies T ∗ g = i f . However, D(T ∗ ) is a proper extension of 1
D(T ), in that no boundary conditions are imposed on func- ≤α |w(t) − u(t)| dt ≤ α w − u .
tions in D(T ∗ ), and hence T is not self-adjoint. The ex- 0
amples just given show that a symmetric linear operator is Therefore, Tw − Tu ≤ α w − u and hence T is a con-
not necessarily bounded. The Hellinger-Toeplitz Theorem traction for the uniform norm if α < 1.
gives sufficient conditions for a symmetric operator to be The Contraction Mapping Theorem (elucidated by
bounded: a symmetric linear operator whose domain is Banach in 1922) is a constructive existence and unique-
the entire space is bounded. ness theorem for fixed points. If D is a closed subset of a
If a linear operator T : D(T ) ⊆ H1 → H2 is closed, then Banach space and T : D → D is a contraction, then the
D(T ) is a Hilbert space when endowed with the graph theorem guarantees the existence of a unique fixed point
inner product: x ∈ D. This fixed point is the limit of any sequence con-
(x, Tx), (y, Ty) = x, y + Tx, Ty. structed iteratively by xn+1 = Txn , where x0 is an arbitrary
vector in D. If α is a contraction constant for T , then there
If T is closed and everywhere defined, i.e., D(T ) = H1 , is an a priori error bound
then since the graph norm dominates the norm on H1 ,
αn
we find, by the corollary to Banach’s theorem (see the xn − x ≤ x1 − x0
inversion section), that the norm in H1 is equivalent to the 1−α
graph norm. In particular, the operator T is then bounded. and an a posteriori error bound
This is the closed graph theorem: a closed everywhere 1
defined linear operator is bounded. xn − x ≤ xn+1 − xn
1−α
for xn as an approximation to the unique fixed point x.
The contraction mapping theorem is often used to es-
III. CONTRACTIONS
tablish the existence and uniqueness of solutions of prob-
lems in function spaces. One such application is a sim-
Suppose X is a Banach space and D ⊆ X . A vector x ∈ D
ple implicit function theorem. Suppose f ∈ C([a, b] × R)
is called a fixed point of the mapping T : D → X if Tx = x.
satisfies
Every linear operator has a fixed point, namely the zero
∂ f
vector. But a nonlinear mapping may be free of fixed
0 < m ≤ (t, s) ≤ M
points. However, a mapping that draws points together in ∂s
a uniform relative sense (a condition called contractivity) for some constants m and M. Then one can show that
is guaranteed to have a fixed point. the equation f (t, x) = 0 implicitly defines a continu-
Definition. A mapping T : D ⊆ X → X is called a ous function x on [a, b]. Indeed, the nonlinear operator
contraction (relative to the norm · on X ) if, T : C[a, b] → C[a, b] defined by
Tx − Ty ≤ α x − y , for all x, y ∈ D and some posi- 2
(Tx)(t) = x(t) − f (t, x(t))
tive constant α < 1. m+M
For example, if L : X → X is a bounded linear oper- is, under the stated conditions, a contraction mapping on
ator with L < 1, and g ∈ X , then the (affine) mapping C[a, b] with contraction constant α = (M − m)/(M + m).
T : X → X defined by Tx = L x + g is a contraction with Therefore, T has a unique fixed point x ∈ C[a, b]. That is,
contraction constant α = L . As another example, con- there is a unique function x ∈ C[a, b] satisfying
sider the nonlinear integral operator T : C[0, 1] → C[0, 1] 2
defined by x(t) = x(t) − f (t, x(t))
m+M
1
or, equivalently f (t, x(t)) = 0.
(Tu)(s) = k(s, u(t)) dt
0

where k(·, ·) ∈ C ([0, 1] × R) is a given kernel. If the ker-

1
IV. SOME PRINCIPLES AND TECHNIQUES
nel satisfies

∂ A. Projection and Decomposition
k(s, t) ≤ α, for all s, t
∂t
The projection property is a key feature of the geometry
then of Hilbert space: a closed convex subset of a Hilbert space
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

Functional Analysis 347

contains a unique vector of smallest norm. By shifting If S is a separable closed subspace of a Hilbert space
the origin to a vector x, one sees that this is equivalent H , then a representation of the projection operator P of
to saying that given a closed convex subset S of a Hilbert H onto the subspace S can be given in terms of a com-
space H and a vector x ∈ H , there is a unique vector Px ∈ S plete orthonormal sequence {φn } for S. Indeed, if x ∈ H ,
satisfying then

x − Px = min x − y . x, φn φn − x ∈ S ⊥
y∈S n

This purely geometric projection property has impor- since this vector is orthogonal to each member of the basis
tant applications in optimization theory. For example, con- {φn } for S. Therefore,
sider the following simple example of optimal control of a

one-dimensional dynamical system. Suppose a unit point Px = Px + P x, φn φn − x = x, φn φn .
mass is steered from the origin with initial velocity 1 by n n
a control (external force) u. We are interested in a control
There is an important relationship involving the
that will return the particle to a “soft landing” at the origin
nullspace, range, and adjoint of a bounded linear oper-
in unit time while expending minimal effort, where the
ator acting between Hilbert spaces. If T : H1 → H2 is a
measure of effort is
bounded linear operator, then N (T ∗ ) = R(T )⊥ . To see this,
1
note that w ∈ R(T )⊥ if and only if
|u(t)|2 dt.
0 0 = Tx, w = x, T ∗ w
We may formulate this problem in the Hilbert space
for all x ∈ H1 , that is, if and only if w ∈ N (T ∗ ). By a pre-
L 2 [0, 1]. The dynamics of the system are governed by
viously discussed result on the second orthogonal com-
the equations
plement we get the related result that R(T ) = N (T ∗ )⊥ . In
ẍ = u, x(0) = 0, ẋ(0) = 1, x(1) = 0, ẋ(1) = 0. particular, if T is a bounded linear operator with closed
range, then the equation T f = g has a solution if and only
Suppose C is the set of all vectors u in L 2 [0, 1] for if g is orthogonal to all solutions x of the homogeneous
which the equations above are satisfied for some vector adjoint equation T ∗ x = θ.
x ∈ H 2 [0, 1]. It may be routinely verified that C is a closed Replacing T with T ∗ and noting that T ∗∗ = T , we obtain
convex subset of L 2 [0, 1] and hence C contains a unique two additional relationships between the nullspace, range,
vector of smallest L 2 -norm, i.e., there is a unique mini- and adjoint. Taken together these relationships, namely
mal effort control that steers the system in the specified
N (T ∗ ) = R(T )⊥ , N (T ∗ )⊥ = R(T ),
manner.
The (generally nonlinear) operator P defined above is N (T ) = R(T ∗ )⊥ , N (T )⊥ = R(T ∗ )
called the (metric) projection of H onto S. If S is a closed
subspace of H , then P is a bounded self-adjoint linear are sometimes collectively called the theorem on the four
operator and I − P, where I is the identity operator, is fundamental subspaces.
the projection of H onto S ⊥ . Since x = P x + (I − P)x, The Riesz Representation Theorem is a simple conse-
this provides a Cartesian decomposition of H , written quence of the decomposition theorem. If is a nonzero
H = S ⊕ S ⊥ , meaning that each vector in H can be written bounded linear functional on a Hilbert space H , then
uniquely as a sum of a vector in S and a vector in S ⊥ . N ()⊥ = R(∗ ) is one-dimensional and H = N () ⊕
For example, suppose H is the completion of the space N ()⊥ . Let y ∈ N ()⊥ be a unit vector. Then x = Px +
C 1 [0, 1] with respect to the inner product x, yy, where P is the projection operator of H onto
1 N (). Therefore,
f, g = f (t)g (t) dt + f (0)g(0), (x) = (x, yy) = x, y(y) = x, z
0

and let S be the subspace of linear functions. Then where z = (y)y.

H = S ⊕ S ⊥ and we have seen that S ⊥ consists of those
functions in H which vanish at 0 and 1. So in this instance
B. The Spectral Theorem
the decomposition theorem expresses the fact that each
function in H can be uniquely decomposed into the sum We limit our discussion of the spectral theorem to the case
of a linear function and a function in H that vanishes at of a compact self-adjoint operator. The spectral theorem
both end points of [0, 1]. gives a particularly simple characterization of the range
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

348 Functional Analysis

of such an operator. We have seen that the nonzero eigen- The system {u j , v j ; µ j } is called a singular system for the
values of a compact self-adjoint operator T of infinite operator T , and any f ∈ H1 has, by the decomposition
rank form a sequence of real numbers {λn } with λn → 0 theorem, a representation in the form
as n → ∞. The corresponding eigenspaces N (T − λn I )
∞
are all finite-dimensional and there is a sequence {vn } of f = Pf + f, u j u j
orthonormal eigenvectors, that is, vectors satisfying j=1

where P is the projection of H1 onto N (T ) (the sum is

v j = 1, vi , v j = 0, for i = j, finite if T has finite rank). It then follows that
and, T v j = λ j v j . ∞
Tf = µ j f, u j v j .
The closure of the range of T is the closure of the span j=1
of this sequence of eigenvectors. Since T is self-adjoint, This is called the SVD of the compact linear operator
N (T )⊥ = R(T ); the decomposition theorem then gives the T . The SVD may be viewed as a nonsymmetric spectral
representation representation.

∞
w = Pw + w, v j v j
j=1
D. Operator Equations
Suppose T : H1 → H2 is a compact linear operator and
for all w ∈ H , where P is the projection operator from H
λ ∈ C. If λ = 0, then it can be shown that R(T − λI ) is
onto N (T ). The range of T then has the form
closed. The Fredholm Alternative Theorem completely

∞ characterizes the solubility of equations of the type
Tw = λ j w, v j v j .
j=1
Tf −λ f = g
where λ = 0 and g ∈ H2 . Such equations are called linear
This result is known as the spectral theorem. It can be
operator equations of the second kind. The “alternative” is
extended (in terms of a Stieljes integral with respect to
this: either the operator (T − λI ) has a bounded inverse, or
a projection valued measure on the real line) to bounded
λ is an eigenvalue of T . If the first alternative holds, then
self-adjoint operators (and beyond).
the equation has the unique solution f = (T − λI )−1 g, and
this solution depends continuously on g. In this case we
say that the equation of the second kind is well-posed. On
C. The Singular Value Decomposition the other hand, if λ is an eigenvalue of T , then λ̄ is an
The decomposition of a Hilbert space into the nullspace eigenvalue of T ∗ and the equation has a solution only if
and eigenspaces of a compact self-adjoint operator can g ∈ R(T − λI ) = N (T ∗ − λI )⊥ .
be simply extended to obtain a similar decomposition,
called the singular value decomposition (SVD), for com- That is, the equation has a solution only if g is orthogonal
pact operators which are not necessarily self-adjoint. If to the finite-dimensional eigenspace N (T ∗ − λ̄I ).
T : H1 → H2 is a compact linear operator from a Hilbert The Fredholm Alternative is particularly simple if
space H1 into a Hilbert space H2 , then the operators T ∗ T T : H → H is self-adjoint. In this case the eigenvalues are
and T T ∗ are both self-adjoint compact linear operators real and N (T ∗ − λ̄I ) = N (T − λI ). Therefore, if λ is not
with nonnegative eigenvalues. By the spectral theorem, an eigenvalue of T , that is, if the equation T f − λ f = g
T ∗ T has an orthonormal sequence of eigenvectors, {u j }, has no more than one solution, then
associated with its positive eigenvalues {λ j }, that is com-
H = {θ }⊥ = N (T − λI )⊥ = R(T − λI )
plete in the subspace
and hence the equation has a solution for any g ∈ H .
R(T ∗ T ) = N (T ∗ T )⊥ = N (T )⊥ . Simply put, the Fredholm Alternative for a self-adjoint
compact operator says that uniqueness of solutions
The numbers µ j = λ j are called the singular values of (N (T − λI ) = {θ }) implies existence of solutions for any
T . If the vectors {v j } are defined by v j = µ−1
j T u j , then right hand side (R(T − λI ) = H ).
{v j } is a complete orthonormal set for N (T ∗ )⊥ and the The contraction mapping theorem can be applied to es-
following relations hold: tablish the existence and uniqueness of the solutions of
certain nonlinear operator equations of the second kind,
Tu j = µ j v j and T ∗v j = µ j u j . that is, equations of the form T f − λ f = g. For example,
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

Functional Analysis 349

s
suppose T : X → Y is a mapping from a Banach space X T : L 2 [0, 1] → C[0, 1] defined by (T f )(s) = 0 f (t) dt
to a Banach space Y , satisfying Tu − Tv ≤ µ u − v . has an inverse defined on the subspace of absolutely con-
For nonzero λ ∈ C the equation T f − λ f = g then has a tinuous functions which vanish at 0. But the inverse oper-
unique solution f ∈ X for each g ∈ Y , if µ < |λ|. Indeed, ator T −1 g = g is an unbounded operator.
a solution of this equation is the unique fixed point of If T ∈ L(X, Y ), where X and Y are Banach spaces, then
the contractive mapping A f = λ1 (T f − g). Further, the in- the existence of a bounded inverse for T is equivalent to
verse mapping g → f is continuous, that is, the original the condition Tx ≥ m x for some m > 0. Indeed, if this
nonlinear equation of the second kind is well-posed. condition is satisfied, then N (T ) = {θ }, R(T ) is closed in
Monotone operators form another general class of non- Y , and T −1 : R(T ) → X satisfies T −1 ≤ m −1 . On the
linear operators for which unique solutions of certain op- other hand, if T −1 is bounded, then the condition holds
erator equations of the second kind can be assured. Sup- with m = T −1 −1 . If T is bijective, i.e., if N (T ) = {θ }
pose H is a real Hilbert space. An operator T : H → H is and R(T ) = Y , then Banach’s Theorem insures that the
called monotone if Tx − Ty, x − y ≥ 0 for all x, y ∈ H . inverse of T is also bounded, i.e., T −1 ∈ L(Y, X ). One
If T is a continuous monotone operator and λ < 0, then consequence of this fact is that if a normed linear space X
a theorem of Minty insures the existence of a unique is a Banach space under two norms · 1 and · 2 , and if
solution f of the operator equation of the second kind these norms satisfy x 2 ≤ M x 1 for some M > 0, then
T f − λ f = g, for each g ∈ H . Further, the inverse opera- the norms are equivalent (apply Banach’s Theorem to the
tor J = (T − λI )−1 is not just continuous, but nonexpan- identity operator I : (X, · 1 ) → (X, · 2 )).
sive, i.e., Jh − Jg ≤ h − g . The familiar geometric series has an operator analog. If
When λ = 0, the linear operator equation equation X is a Banach space and T : X → X is a bounded linear
treated above becomes an operator equation of the first operator with T < 1, then (I − T )−1 ∈ L(X, X ) and, in
kind: fact,
T f = g. (I − T )−1 = I + T + T 2 + · · ·
In this case, the role of the Fredholm Alternative The- where the series, called the Neumann series, converges in
orem is to some extent played by Picard’s Theorem. Let the operator norm. This result can be used immediately to
{u j , v j ; µ j } be a singular system for the compact linear op- guarantee the existence of a unique solution of the linear
erator T . Then {v j } is a complete orthonormal system for operator equation of the second kind T f − λ f = g, if |λ|
R(T ). Picard’s Theorem gives a necessary and sufficient is sufficiently large (|λ| > T does the trick).
condition for a vector g ∈ R(T ) = N (T ∗ )⊥ to lie in R(T ), The existence of an inverse of T ∈ L(X, Y ) is equiva-
that is, for the equation of the first kind to have a solu- lent to the unique solvability of the equation T f = g for
tion. The condition, Picard’s criterion, is that the singular each g ∈ Y . This, in turn, is equivalent to R(T ) = Y and
coefficients of g decay sufficiently quickly, specifically, N (T ) = {θ}. If either of these conditions is violated, then
that there is no inverse, but all is not lost. In certain circum-
∞ stances a generalized inverse can be defined; we limit our
µ−2
j |g, v j | < ∞.
2
discussion of the generalized inverse to the Hilbert space
j=1 context. Suppose T : H1 → H2 is a bounded linear opera-
If Picard’s criterion is satisfied, then the series tor from a Hilbert space H1 to a Hilbert space H2 . Suppose
we wish to solve the equation T f = g. If g ∈ / R(T ), then

∞
g, v j
uj the equation has no solution and we might settle for find-
j=1
µj ing an f ∈ H1 whose image under T is a close to g as
possible, that is, we seek f ∈ H1 satisfying
converges to some vector f ∈ H1 and

∞ T f − g = inf{ Tx − g : x ∈ H1 }.
Tf = g, v j v j = g, Such an f is called a least-squares solution of T f = g. A
j=1
least-squares solution exists if and only if the projection
and hence the equation has a solution, namely f . The of g onto R(T ) actually lies in R(T ). This is equivalent to
solution is unique only if N (T ) = {θ }. requiring g to lie in the dense subspace R(T ) + R(T )⊥ of
H2 . The condition for a least-squares solution may also be
phrased as requiring that T f − g ∈ R(T )⊥ = N (T ∗ ) (see
E. Inversion
the theorem on the four fundamental subspaces). This
An invertible bounded linear operator need not gives another characterization of least-squares solutions:
have a bounded inverse. For example, the operator u is a least-squares solution of T f = g if and only if u
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

350 Functional Analysis

satisfies the normal equation T ∗ T u = T ∗ g. Since T is it follows that u ∈ K is the vector in K that is nearest to
bounded, the set of solutions of the normal equation is f , that is, u = P f , the metric projection of f onto K .
closed and convex, and hence, by the projection theorem, A nonsymmetric version of the Riesz Representation
if there is a least-squares solution, then there is a least- Theorem also follows from the theorem on variational
squares solution of smallest norm. These ideas enable us inequalities. Suppose is a bounded linear functional on
to define a generalized inverse (the Moore-Penrose gen- H . Then, by the Riesz Representation Theorem, there is
eralized inverse) for T . Let D(T † ) = R(T ) + R(T )⊥ and a f ∈ H such that (w) = f, w for all w ∈ H . Let the
for g ∈ D(T † ), define T † g to be the least-squares solution closed convex set K be the entire Hilbert space H , then
having smallest norm. The Moore-Penrose generalized in- there is a unique u ∈ H satisfying
verse, T † : D(T † ) ⊆ H2 → H1 , is a closed densely defined
a(u, v − u) ≥ f, v − u for all v∈H
linear operator. However, T † is bounded if and only if
R(T ) is closed. If T is compact, then R(T ) is closed if and hence a(u, w) ≥ f, w for all w ∈ H . Replacing w
and only if T has finite rank. In particular, a linear inte- by −w, we also get a(u, − w) ≥ f, −w for all w ∈ H .
gral operator generated by a square integrable kernel has Therefore, a(u, w) = f, w for all w ∈ H . That is, the
a bounded Moore-Penrose generalized inverse if and only functional has the representation (w) = a(u, w) for a
if the kernel is degenerate. If T is compact with singular unique u ∈ H . This representation of bounded linear func-
system {u j , v j ; µ j }, the the Moore-Penrose generalized tional in terms of a possibly nonsymmetric bilinear form
inverse has the explicit representation is known as the Lax-Milgram lemma. The Lax-Milgram
g, v j lemma can be used to establish the existence of a unique
T †g = u j. weak solution for certain nonsymmetric elliptic boundary
j
µj
value problems in the same way that the Riesz Represen-
tation Theorem is used to prove the existence of a unique
F. Variational Inequalities weak solution of the Poisson problem.
As a simple application of the Lax-Milgram lemma,
Suppose H is a real Hilbert space with inner prod- consider the two-point boundary value problem
uct ·, · and corresponding norm · . A bilinear form
a(·, ·): H × H → R is called bounded if there is a con- −u + u + u = f, u (0) = u (1) = 0.
stant C such that Integration by parts yields a(u, v) = f, v, where ·, · is
|a(u, v)| ≤ C u v for all u, v ∈ H the L 2 [0, 1] inner product and a(u, v) is the nonsymmetric,
bounded, coercive, bilinear form defined on H 1 [0, 1] by
and coercive if there is a constant m > 0 such that 1
m u 2
≤ a(u, u) a(u, v) = (u v + u v + uv)(s) ds.
0
for all u ∈ H . A fundamental result of Stampacchia asserts The Lax-Milgram lemma then ensures the existence of a
that if a( ·, ·) is a bounded, coercive bilinear form (which unique weak solution u ∈ H 1 [0, 1] of the boundary value
need not be symmetric), and if f ∈ H and K is a closed problem, that is, a unique vector u ∈ H 1 [0, 1] satisfying
convex subset of H , then there is a unique u ∈ K satisfying
a(u, v) = f, v for all v ∈ H 1 [0, 1].
a(u, v − u) ≥ f, v − u for all v ∈ K .
This is called a variational inequality for the form a( ·, ·), V. A FEW APPLICATIONS
the closed convex set K , and the vector f ∈ H .
The fundamental nature of this result becomes apparent A. Weak Solutions of Poisson’s Equation
when one notices that the projection property for Hilbert Suppose is a bounded domain in R2 with smooth bound-
space is a special case of this result on variational inequal- ary ∂. Given f ∈ C() a classical solution of Poisson’s
ities. Indeed, if a(x, y) = x, y, then the theorem insures equation is a function u ∈ C 2 () satisfying
the existence of a unique u ∈ K satisfying
−u = f in
u, v − u ≥ f, v − u
u=0 on ∂
or, equivalently
where is the Laplacian operator. If the Poisson equation
f − u, v − u ≤ 0 for all v ∈ K . is multiplied by v ∈ C02 () and integrated over , then
Green’s identity gives
Geometrically, this says that the angle between the vectors
f − u and v − u is obtuse for all vectors v ∈ K . From this ∇u, ∇v = f, v
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

Functional Analysis 351

where ∇ is the gradient operator and ·, · is the L 2 () A finite element solution of the Poisson problem is de-
inner product. There are two things to notice about this fined by restricting the conditions for a weak solution to
equation: f, v is defined for f ∈ L 2 (), allowing consid- the subspace of finite elements, that is, u N ∈ U N is a finite
eration of “rougher” data f , and on the left-hand side only element solution of the Poisson problem if
first derivatives are required rather than second derivatives,
a(u N , v) = f, v for all v ∈ UN .
allowing less smooth “solutions” u. These observations
permit us to propose a weaker formulation of the Poisson When this condition is expressed in terms of the finite
problem. element basis, the resulting coefficient matrix is positive
The bilinear form a( ·, ·) defined by a(u, v) = ∇u, ∇v definite and hence there is a unique finite element solution.
is an inner product (sometimes called the energy inner If u is the weak solution of the Poisson problem, then
product) on the space C01 () (this is a consequence of
Poincaré’s inequality: a(u, u) ≥ C u 22 , where · 2 is the a(u − u N , v) = f, v − f, v = 0
L 2 norm). The norm generated by this inner product is for all v ∈ U N . Geometrically, this says that the finite ele-
equivalent to the Sobolev H01 () norm. Therefore, the ment solution is the projection (relative to the energy inner
completion of C01 () relative to this inner product is the product) of the weak solution onto the finite element sub-
Sobolev space H01 (), and we define a weak solution of space and hence,
the Poisson problem to be a vector u ∈ H01 () satisfying
u − uN ≤ u − v
a(u, v) = f, v
for all v ∈ U N , where · is the energy norm. That is,
for all v ∈ H01 (). The linear functional : H01 () → R the finite element solution is the best approximation to the
defined by (v) = f, v is (again, as a consequence of weak solution, with respect to the energy norm, in the finite
Poincaré’s inequality) a bounded linear functional on element subspace.
H01 (), and hence, by the Riesz Representation Theorem,
there is a unique u ∈ H01 () satisfying
C. Two-Point Boundary Value Problems
a(u, v) = f, v for all v ∈ H01 ().
We briefly treat a simple class of Sturm-Liouville prob-
In other words, Poisson’s problem has a unique weak so- lems. The goal is to find x ∈ C 2 [a, b] satisfying the differ-
lution for each f ∈ L 2 (). ential equation

d dx
p(s) − [µ + q(s)] x(s) = f (s)
B. A Finite Element Method ds ds
A finite element method is a constructive precedure for where q, f ∈ C[a, b] and p ∈ C 1 [a, b] are given functions
approximating a weak solution by a linear combination and µ = 0 is a given scalar. Define T : C02 [a, b] → C[a, b]
of “basis” functions. As a simple illustration we treat a by
piecewise linear finite element method for the Poisson
Tx = [ px ] − q x.
problem in the plane. A finite dimensional subspace U N
of H01 () chosen and a basis (whose members are called We suppose the this differential operator is nonsingu-
finite elements) is selected for U N . The finite element ap- lar, that is, N (T ) = {θ } (such is the case, for example, if
proximation to the weak solution will be a certain linear p(s) < 0 and q(s) ≤ 0). Then there is a symmetric Green’s
combination of these basis functions, i.e., a member of function k(·, ·) ∈ C[a, b] × C[a, b] for T . That is, T −1 is
U N . First the region is triangulated; the vertices of the the integral operator generated by the kernel k(·, ·). In
resulting triangles are called nodes of the triangulation. other words, Tx = h if and only if
The functions in U N will be continuous on , linear on b
each triangle, and zero on the boundary of . With each x(s) = k(s, t)h(t) dt.
interior node of the triangulation we associate a basis func- a
tion which is 1 at the node and zero at all other nodes of The original problem may then be expressed as
the triangulation (these basis functions are called the linear
Lagrange elements).The dimension of the finite element Tx = µx + f
subspace is therefore equal to the number of interior nodes
or, in terms of the Green’s function k(·, ·):
in the triangulation and each basis function has a pyrami-
b
dal shape with a peak at the associated nodal point of the
x(s) = k(s, t)[µx(t) + f (t)] dt.
triangulation. a
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

352 Functional Analysis

Equivalently, that even very accurate measurements can lead to wildly

unstable computations.
Kx − λx = g Stability can be restored (at the expense of accuracy)
where λ = 1/µ and g = −λK f , where K is the compact by the method of regularization. In its simplest form this
self-adjoint operator on L 2 [a, b] generated by the kernel method replaces the normal equation with an augmented
k( ·, ·). If λ is not an eigenvalue of K , then this integral (or regularized ) normal equation
equation of the second kind has, by Fredholm’s Alter- K ∗ Kxα + αxα = K ∗ y,
native, a solution which is expressable, via the spectral
where α > 0 is a regularization parameter. That is, an
theorem, as a series of eigenfunctions of K . Specifically,
ill-posed equation of the first kind is replaced by an ap-
R(K ) has an orthonormal basis of eigenfunctions {v j } of
proximating well-posed equation of the second kind. The
K and therefore,
solution xα of the regularized equation is stable with re-
x = (K − λI )−1 g = −λ(K − λI )−1 K f spect to perturbations in the data y since −α < 0 is not an
λλ j eigenvalue of K ∗ K and hence, by the Fredholm Alterna-
= f, v j v j . tive, (K ∗ K + α I )−1 is bounded.
λ − λj
j The regularized normal equation must be solved using
the available data y δ and hence the choice of the regu-
larization parameter, in terms of the available data, is a
D. Inverse Problems
matter of considerable importance. One general method
Many inverse problems in mathematical physics can be for choosing the regularization parameter is known as
modeled as operator equations of the first kind, Kx = y, Morozov’s discrepancy principle. According to this prin-
where K is a compact linear operator acting between ciple, if the signal-to-noise ratio is greater than one, i.e.,
Hilbert spaces. Consider, for example, the simple model y δ > δ ≥ y − y δ and xαδ = (K ∗ K + α I )−1 K ∗ y δ , then
problem in gravimetry in which the vertical component the equation
of gravity, y(s), along a horizontal segment 0 ≤ s ≤ 1 is δ
K x − yδ = δ
α
engendered by a mass distribution x( p), 0 ≤ p ≤ 1 on a
δ
parallel segment one unit distant from the first. The rela- has a unique positive solution α = α(δ) and xα(δ) → K†y
tionship between y and x is given by as δ → 0.
1
((s − p)2 + 1)− 2 x( p) d p
3
y(s) = γ E. Heisenberg’s Principle
0
Heisenberg’s uncertainty principle in quantum mechanics
where γ is a constant. The inverse problem consists of
is a consequence of an inequality involving unbounded
determining the mass distribution from observations of
self-adjoint operators. We will consider only a very sim-
the vertical force y. The model may be phrased abstractly
ple case. Suppose a particle moves in one dimension, its
as y = Kx, where K : H → H is a compact linear operator
position at a given time t denoted by x ∈ (−∞, ∞). In the
acting on the Hilbert space H = L 2 [0, 1].
quantum mechanical formalism the position of the par-
Formally, the inverse problem may be solved by using
ticle is understood to be a random variable and it is the
the Moore-Penrose generalized inverse: x = K † y. How-
state, rather than the position, that is at issue. The state is
ever, the representation of K † in terms of the SVD of K
a function ψ which is a unit vector in the complex Hilbert

∞
y, v j space H = L 2 (−∞, ∞); the interpretation being that
x = K†y = uj b
µj
j=1 ψ(s)ψ(s) ds
a
points to a serious problem. The singular values {µ j }
converge to zero leading to instability in the solution represents the probability that at the time in question the
process. The vector y is a measured entity and hence particle is positioned between a and b. In other words, the
is subject to error. Suppose, for example, the measured integrable real-valued function ψ ψ̄ is a probability density
data consists of a vector y δ ∈ H satisfying y − y δ ≤ δ, for the position. In general, a unit vector in H is called a
where δ is a known bound for the measurement error. state. A self-adjoint linear operator T : D(T ) ⊆ H → H is
While y δ → y as δ → 0, generally K † y − K † y δ → ∞ called an observable. The expected value of an observable
√ T when the system is in state ψ is
as δ → 0. For example, if δ = µn , where n = 1, 2, . . . , ∞
√ √
and y δ = y + µn vn , then y − y δ = µn → 0, while
√ E(T ) = T ψ, ψ = T ψ(s)ψ(s) ds
K † y − K † y δ = 1/ µn → ∞ as n → ∞. The lesson is −∞
P1: GRB/GWT P2: GQT Final
Encyclopedia of Physical Science and Technology EN06M-269 June 27, 2001 12:29

Functional Analysis 353

For example, the expected value of the multiplication by i.e., for all ψ ∈ D([P, M]):
the independent variable operator, (Mψ)(x) = xψ(x),
∞ h
[P, M]ψ = ψ.
E(M) = xψ(x)ψ(x) d x 2πi
−∞

is the mean position of the particle (slight modifications The general Heisenberg principle then gives
of the arguments given in the section on unbounded oper- 2
ators show that M is unbounded and self-adjoint). For this h
Var (P) × Var (M) ≥ .
reason the observable M is called the position operator. 4π
The variance of an observable T when the system is in
state ψ is defined by This is an expression of the physical uncertainty principle:
no matter the state ψ, the position and momentum can not
Var (T ) = (T − E(T )I )ψ 2 . both be determined with arbitrary certainty.
That is, the variance gives a measure of the dispersion of
an observable from its expected value.
Definition. The commutator of two observables S and SEE ALSO THE FOLLOWING ARTICLES
T is the observable [S, T ] : D(ST ) ∩ D(TS) → H defined
by [S, T ] = ST − TS. CONVEX SETS • DATA MINING AND KNOWLEDGE DIS-
S and T are said to commute if ST = TS, i.e., COVERY • DIFFERENTIAL EQUATIONS, ORDINARY •
D(ST ) = D(TS) and [S, T ] = 0. In abstract form the FOURIER SERIES • GENERALIZED FUNCTIONS • TOPOL-
Heisenberg principle says that for any state ψ ∈ D([S, T ]) OGY, GENERAL

1
|E([S, T ])|2 ≤ Var (S) × Var (T ).
4
Suppose D(P) is the subspace of all absolutely continuous BIBLIOGRAPHY
functions in H whose first derivative is also in H . Define
the operator P : D(P) → H by Brenner, S. C., and Scott, L. R. (1994). “The Mathematical Theory of
h Finite Element Methods,” Springer-Verlag, New York.
Pψ(x) = ψ (x) Groetsch, C. W. (1980). “Elements of Applicable Functional Analysis,”
2πi Dekker, New York.
where h is Planck’s constant. Then P is self-adjoint, that Kantorovich, L. V., and Akilov, G. P. (1964). “Functional Analysis in
is, an observable. A physical argument shows that E(P) Normed Spaces,” Pergamon, New York.
Kirsch, A. (1996). “An Introduction to the Mathematical Theory of In-
is the expected value of the momentum of the system and verse Problems,” Springer-Verlag, New York.
hence P is called the momentum operator. The commuta- Kreyszig, E. (1978). “Introductory Functional Analysis with Applica-
tor of the momentum and position operators can be found tions,” Wiley, New York.
from the relation Lebedev, L. P., Vorovich, I. I., and Gladwell, G. M. L. (1996). “Functional
Analysis: Applications in Mechanics and Inverse Problems,” Kluwer,
h d h
(P Mψ)(x) = [xψ(x)] = [ψ(x) + xψ (x)] Dordrecht.
2πi d x 2πi Naylor, A. W., and Sell, G. R. (1971). “Linear Operator Theory in En-
gineering and Science,” Holt, Rinehart and Winston, New York.
h
= ψ(x) + (MPψ)(x) Riesz, F., and Sz.-Nagy, B. (1955). “Functional Analysis,” Ungar, New
2πi York.
P1: FLV/LPB P2: FJU Final Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

Generalized Functions
Ram P. Kanwal
Pennsylvania State University

I. Introduction
II. Distributions
III. Algebraic Operations on Distributions
IV. Analytic Operations on Distributions
V. Pseudo-Function, Hadamard Finite
Part and Regularization
VI. Distributional Derivatives
OF DiscontinuousFunctions
VII. Convergence of Distributions and Fourier
Series
VIII. Direct Product and Convolution of Distributions
IX. Fourier Transform
X. Poisson Summation Formula
XI. Asymptotic Evaluation of Integrals:
A Distributional Approach

GLOSSARY Hadamard finite part Finite difference of two infinite

terms for defining divergent integrals.
∞ f ∗g of two functions f (x)
Convolution The convolution Impulse response The particular solution of a differential
and g(x) is defined as −∞ f (y) g(x − y) dy. This con- equation with δ(x) (impulse) as its forcing term.
cept carries over to the generalized functions also. Pseudo-function Singular functions such as 1/x k which
Dirac delta function δ(x) This function is intuitively de- are regularized at their singularity with the help of con-
fined to be zero when x = 0 and infinite at x = 0, in cepts such as the Hadamard finite part.
such a way that the area under it is unity. Regular distributions Distributions based on locally in-
Distribution A linear continuous functional. tegrable functions.
Distributional (generalized) derivatives Nonclassical Singular distributions (generalized functions) Distri-
derivatives of the generalized functions which do not butions which are not regular.
have classical derivatives at their singularities. Test function space A space of suitable smooth functions
Functional A rule which assigns a number t, φ to a test φ(x) which is instrumental in defining a generalized
function φ(x) through a generalized function t(x). function such as δ(x) as a distribution.

517
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

518 Generalized Functions

∞
GENERALIZED FUNCTIONS are objects, such as the Because −∞ sm (x) d x = 1, the total charge on the line is
Dirac delta function, with such inherent singularities that equal to unity. Furthermore, it can be readily proved that
they cannot be integrated or differentiated in the clas- ∞
sical sense. By defining them as functionals (distribu-
lim f (x)sm (x) d x = f (0).
tions) which carry smooth functions to numbers, we can m→∞
−∞
overcome the difficulty. Then they possess remarkable
properties that extend the capabilities of the classical Accordingly, it satisfies the sifting property (1).
mathematics. Indeed, the techniques based on the gen- Defining δ(x) as a distribution is much more convenient
eralized functions not only solve the classical problems and useful. Moreover, the theory of distributions helps us
in a simple fashion but also produce many new concepts. define many more functions which are singular in nature.
Accordingly, they have influenced many topics in mathe-
matical and physical sciences.
II. DISTRIBUTIONS

To make δ(x) meaningful, we appeal to various spaces of

I. INTRODUCTION smooth functions. We start with the real-valued functions
φ(x), where x = (x1 , . . . , xn ) ∈ Rn , the n-dimensional Eu-
Functions such as Dirac’s delta function and Heaviside clidean space. To describe various properties of these func-
function have been used by scientists even though the for- tions we introduce the multi-index notation. Let k be an
mer is not a function and the latter is not a differentiable n-tuple of non-negative integers: k = (k1 . . . , kn ). Then we
function in the classical sense. The theory of distributions define
has provided us not only with mathematical foundations
for these functions but also for various other nonclassi- |k| = k1 + · · · + kn ; x k = x1k1 · · · xnkn ;
cal functions. The rudiments of the theory of distributions k! = k1 ! · · · kn !;
can be found in the concepts of the Hadamard finite part
k k! ∂
of the divergent integrals and sequences which in the limit = ; Dj = ; (3)
behave like delta function. However, it was Sobolev and m m!(k − m)! ∂x j
Schwartz who established the modern theory of distribu- ∂ |k| ∂ k1 +···+kn
tions. Dk = = = D1k1 . . . Dnkn .
The Dirac delta function δ(x − ξ ), also called the im- ∂ x1k1 · · · ∂ xnkn ∂ x1k1 · · · ∂ xnkn
pulse function, is usually defined as a function which is We are now ready to state the properties of the functions
zeroeverywhere except at x = ξ , where it has a spike such φ(x). They are (1) D k φ(x) exists for all multi-indices k.
∞
that −∞ δ(x − ξ ) d x = 1. More generally, it is defined by (2) There exists a number A such that φ(x) vanishes for
its sifting property, r > A, where r is the radial distance r = (x12 +· · ·+ xn2 )1/2 .
∞ This means that φ(x) has a compact support. These two
properties are written symbolically as φ(x) ∈ C0∞ , where
f (x)δ(x − ξ ) d x = f (ξ ), (1)
the superscript stands for infinite differentiability, the sub-
−∞ script for the compact support. This space of functions is
for all continuous functions f (x). However, for any ac- denoted by D, and φ(x) are called the test functions. The
ceptable definition of integration there can be no such prototype example in R is
function. This difficulty was subsequently overcome by 
 a2
two approaches. The first is to define the delta function as exp − , |x| < a,
φ(x) = a2 − x 2 (4)
the limit of delta sequences, while the second is to define it 
as a distribution. The reasoning behind a delta sequence is 
0, |x| > a.
that although the delta function cannot be justified math-
ematically, there are sequences {sm (x)} which in the limit Note that the definition of D does not demand that all φ(x)
m → ∞ satisfy the relation (1). An interesting sequence have the same support. Furthermore, we observe that (a)
is D is a linear (vector) space, because if φ1 and φ2 are in
D, then so is c1 φ1 + c2 φ2 for arbitrary real numbers. (b)
1 m
sm (x) = · (2) If φ ∈ D, then so is D k φ. (c) For a C ∞ function of f (x)
π 1 + m2 x 2 and a φ(x) ∈ D, f φ ∈ D. (d) If φ(x1 , . . . , xm ) is an m-
It is instructive to imagine (2) as a continuous charge dis- dimensional test function and ψ(xm+1,...,xn ) is an (n − m)-
tribution on a line, so that the total charge sm (x) to the dimensional test function, then φψ is an n-dimensional
left of x is rm (x) = −∞ sm (u) du = 12 + (1/π ) tan−1 mx.
x
test function in the variables x1 , . . . , xn .
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

Generalized Functions 519

With the following four definitions we are able to grasp from the sifting property. This functional is continuous
the concept of distributions: because limm→∞ δ(x − ξ ), φm (x) = limm→∞ φm (ξ ) =
δ(x − ξ ), limm → ∞ φm (x). Since δ(x − ξ ) is not a
1. A sequence {φm (x)}, m = 1, 2, . . . , for all φm ∈ D, locally integrable function, it produces a singular distri-
converges to φ0 if the following two conditions are bution. These results hold in Rn as well. The functions
satisfied. (a) All φm as well as φ0 vanish outside a which generate singular distributions are called general-
common region. (b) D k φm → Dk φ0 uniformly over ized functions. We shall use these words interchangeably.
Rn as m → ∞, for all multi-indices k. For the special The definition of a distribution can be extended to include
case φ0 = 0, the sequence {φm } is called a null complex-valued functions.
sequence.
2. A linear functional t(x) on the space D is an The dual space D . The space of all distributions on D
operation by which we assign to every test function is called the dual space of D and is denoted as D . It is
φ(x) a real number t(x), φ(x) such that t, c1 φ1 + also a linear space.
c2 φ2 = c1 t, φ1 + c2 t, φ2 for arbitrary test There are many other interesting spaces. For example,
functions φ1 (x) and φ2 (x) and real numbers c1 and c2 . the test function space E consists of all the functions φ(x),
Then it follows that t, mj=1 c j φ j = so there is no limit on their growth at infinity. Accordingly,
m E ⊃ D. The corresponding space E consists of the distri-
j=1 c j t, φ j , where c j are arbitrary real numbers.
3. A linear functional t(x) on D is called continuous if butions which have compact support, so that E ⊃ D . The
and only if the sequence of numbers t, φm → t, φ, distributions of slow growth are defined in Sect. IX.
as m→∞, when the sequence {φm } of test functions
converges to φ ∈ D, that is,
III. ALGEBRAIC OPERATIONS
lim t, φm = t, lim φm . ON DISTRIBUTIONS
m→∞ m→∞

4. A continuous linear functional on the space D is

called a distribution. Let t(x), φ(x) be a regular distribution generated by
a locally integrable function t(x) × ∈ R. Let x = ay − b,
where a and b are constants. Then we have
The set of distributions that is most useful is the ones
generated by locally integrable functions f (x) (that is, ∞

[ f (x)] d x exists for every bounded region of Rn ). t(ay − b), φ(y) = t(ay − b)φ(y) dy
Indeed, every locally integrable function f (x) generates a −∞
distribution through the relation f, φ = f (x)φ(x) d x. ∞
The linearity and continuity of this functional can be 1 x +b
= t(x)φ dx
readily proved. The distributions produced by locally |a| a
−∞
integrable functions are called regular distributions or

distributions of order zero. All the other ones are called 1 x +b
= t(x), φ . (5)
singular distributions. We present two examples in the |a| a
one-dimensional space R.
For the special case a = 1, relation (5) becomes t(y − b),
φ(y) = t(x), φ(x + b). As another special case, (5)
1. Heaviside distribution:
yields δ(−y), φ(y)δ(x), φ(−x) = φ(0). Thus δ(x) is
0, x < 0, an even function.
H (x) =
1, x > 0. A distribution is called homogeneous of degree λ if
t(ax) = a λ t(x), a > 0. In view of relation (5) with b = 0,
The Heaviside ∞distribution defines the regular distribution we find that for a homogeneous distribution we have
H, φ = 0 φ(x) d x. It is clearly a linear functional. t(ay), φ(y) = (1/a)t(x), φ(x/a) = a λ t(x), φ(y), so
It is also continuous because
∞ limm→∞ H (x), φm (x) = that t(x), φ(x/a) = a λ+1 t(x), φ(x).
∞
limm→∞ 0 φm (x) = 0 φ(x) d x, where φ(x) is the limit All these relations hold also for distributions t(x),
of the sequence {φm (x)} as m → ∞. Thus, x ∈ Rn . In that case we set x = Ay − B, where A is
lim H (x), φm (x) = H (x), lim φm (x). a nonsingular n × n matrix and B is a constant vec-
m→∞ m→∞ tor. Then we have t(Ay − B), φ(y) =(1/det A)t(x),
2. The Dirac delta function δ(x − ξ ). By the sifting φ[A−1 (x + B)], where A−1 is the inverse of the matrix A.
∞
property (1) we have δ(x − ξ ), φ(x) = −∞ δ(x − ξ ) × Product of a distribution and a fucntion: In general
φ(x) d x = φ(ξ ), a number. The linearity also follows it is difficult to define the product of two distributions,
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

520 Generalized Functions

even for two regular

√ distributions. Indeed, if we take 2. Impulse response. With the help of the formula (7)
f (x) = g(x) = 1/ x, which is a regular distribution, then we can find the derivative of the signum function sgn x,
their product 1/x is not a locally integrable function in which is 1 for x > 0 and −1 for x < 0. Thus, sgn x =
the neighborhood of x = 0 and as such does not define a 2H (x) − 1. Then it follows from (7) that (sgn x) =
regular distribution (we shall discuss the function 1/x in 2δ(x). This result, in turn, helps us in finding the
Sect. V). However, we can multiply a distribution t(x) with derivative
∞ of the function |x|. Indeed, |x|, φ(x) =
∞
a C ∞ function ψ(x) so that we have ψt, φ = t, ψφ.
0
−∞ |x|φ(x) d x = 0 xφ(x) d x − −∞ xφ(x) d x. Thus,
0
Because ψφ ∈ D, it follows that ψt is a distribution. ∞
|x| , φ(x) = −|x|, φ (x) = − 0 xφ (x) d x + ∞ xφ ×
(x) d x, which, when integrated by parts, yields the for-
mula |x| , φ(x) = sgn x, φ(x). Thus, |x| = sgn x. The
IV. ANALYTIC OPERATIONS second differentiation yields (d 2 /d x 2 )(|x|) = 2δ(x).
ON DISTRIBUTIONS 3. A distribution E(x) is said to be a fundamental so-
lution or a free-space Green’s function or an impulse re-
Let us start with a regular distribution generated sponse if it satisfies the relation L E(x) = δ(x), where L is
∞ C function t(x), x ∈ R, so that t(x), φ(x) =
1
by a differential operator. Then from the previous paragraph
−∞
t(x)φ(x) d x. When we integrate the quantity
∞ we find that 12 |x| is the impulse response for the differential
∞
−∞ t (x), φ(x) d x by parts, we get −∞ t (x) × operator L = (d 2 /d x 2 ).
∞
φ(x) d x = − −∞ t(x)φ (x) d x, where we have used Relation (6) for differentiation helps us in deriving
the fact that φ(x) has a compact support. Thus, many interesting formulas such as
t (x), φ(x) = −t(x), φ (x). Because φ (x) is also t(x)δ (n) (x) = (−1)n t (n) (0)δ(x)
in D, this relation helps us in defining the distribu-
tional derivative t (x) of a distribution t(x) (regular + (−1)(n−1) t (n−1) (0)δ (x)
or singular) as above. Continuing this process, we n(n − 1) (n−2)
+ (−1)n−2 t (0)δ (x) + · · ·
have t (n) (x), φ(x) = (−1)n t(x), φ (n) (x), where the 2!
superscript (n) stands for nth-order differentiation. The + f (0)δ (n) (x) (8)
corresponding formula in Rn is
for a distribution t(x). The proof follows by evaluat-
D k t(x), φ(x) = (−1)k t(x), D k φ(x). (6) ing the quantity t(x)δ (n) (x), φ(x) = δ (n) (x), t(x)φ(x) =
(−1)n δ(x), (t(x)φ(x))(n) . For n = 0 and 1, formula (8)
Thus a generalized function is infinitely differentiable. becomes
This result has tremendous ramifications, as we shall soon
t(x)δ(x) = t(0)δ(x);
discover. (9)
The primitive of a distribution t(x) is a solution of t(x)δ (x) = −t (0)δ(x) + t(0)δ (x).
ds(x)/d x = t(x). This means that we seek s(x) ∈ D such
that s, φ = −t, φ, φ ∈ D. One of the most important consequences of (8) is

Let us illustrate the concept of distributional differenti- (−1)m n! δ (n−m) (x),
 n ≥ m,
ation with the help of a few applications.
x δ (x) =
m (n) (n − m)!


1. Recall that the Heaviside function is de- 0, n < m.
fined as a distribution by the relation (6) H (x), (10)
∞
φ(x) = 0 φ(x) d x. According to definition
∞ (6), we have 4. Next, we attempt to evaluate δ[ f (x)]. Let us first
H (x), φ(x) = − H (x), φ (x) = − 0 φ (x) = φ(0) = assume that f (x) has a simple zero at x1 such that
δ(x), φ(x), where we have use the fact that φ(x) vanishes f (x1 ) = 0 but f (x1 ) > 0 [the case f (x)1 < 0 follows
at ∞. Thus, in a similar fashion]. Thus f (x) increases monoto-
nically in the neighborhood of x1 so that H [ f (x)] =
dH
= δ(x). (7) H (x − x1 ), where H (x) is the Heaviside function.
dx Then we use (7) and find that (d/d x)H [ f (x)] =
Because H (x) is a distribution of order zero, δ(x) is a δ(x − x1 ) or δ[ f (x)] = |( f (x1 ))|−1 δ(x − x1 ). If there are
distribution of order 1. We can continue this process n zeros of f (x), then the above result yields
and find that δ (n) (x), φ(x) = (−1)n δ(x), φ (n) (x). Just as n
δ(x − xm )
δ(x) stands for an impulse or a pole at x = 0, δ (x) stands δ[ f (x)] = . (11)
f (xm )
for a dipole and δ (n) (x) is a multiple of order (n + 1) as m=1
well as a distribution of order (n + 1). This result has many applications.
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

Generalized Functions 521

5. The formula (sgn x) = 2δ(x), derived in Example 2, 1 φ(x) − φ(0)
is also instrumental in deriving an integral representation Pv , φ(x) = lim d x.
x →0 |x|> x
for the delta function. Indeed, if we appeal to the relations
 Because > 0 is arbitrary, the previous relation is also
∞  ∞ written as
sin t x π, t >0 cos t x
dx = = 0, 1 φ(x) − φ(0)
x −π, t < 0; x Pv , φ(x) = lim d x. (16)
−∞ −∞ x →0 x
|x|>1
∞
from calculus, we find that (1/2π) −∞ (eitx /ix ) d x = In this form the function Pv(1/x) is easily proved to be
1
2
sgn t. When we differentiate this relation with respect a linear continuous function. As such, it is a distribution
to t we get the important formula and is written as P f (1/x), where P f stands for pseudo-
∞ ∞ function.
1 1 We come across many divergent integrals whose princi-
δ(t) = e itx
dx = e−itx d x, (12)
2π 2π pal values do not exist. Consider, for instance, the integral
−∞ −∞
b
a d x/x , a < 0 < b. Because
2

where we have used the fact that δ(t) is an even function.

dx 1 1 2
This formula can be generalized to n dimensions if we = lim − + + , (17)
observe that δ(x1 , . . . , xn ) = δ(x1 ) · · · δ(xn ). Accordingly, [a,b]/[−,] x
2 →0 b a
in the four-dimensional space (x1 , x2 , x3 , t), relation (12) we cannot get the principal value. In these situations the
yields the planewave expansion of the delta function, concept of the Hadamard finite part becomes helpful. In-
∞ deed, in relation (17), the value (−1/b + 1/a) is the finite
1
δ(x, t) = e−i(k·x − ωt) d 3 k dω, (13) part and (2/) is the infinite part. Thus, we write
(2π)4
−∞ b
dx 1 1
where k = (k1 , k2 , k3 ) and we have a fourfold integral. Fp 2
=− + . (18)
x b a
a
1
As another example, we consider the integral 0 d x/x α .
V. PSEUDO-FUNCTION, HADAMARD
It is divergent if α ≥ 1. Indeed, for α > 1, we obtain
FINITE PART AND REGULARIZATION
1
The functions 1/x m and H (x)/x m , where m is an integer, dx 1 1−α
= − (19)
are not locally integrable at x = 0 and, as such, do not de- xα 1−α 1−α

fine distributions. However, we can appeal to the concepts
of Cauchy principal value and the Hadamard finite part so that we have
and define these functions as distributions. The simplest 1
exampleis the function 1/x. For a test function φ(x) the dx 1
Fp α
= , α > 1. (20)
integral [φ(x)/x] d x is not absolutely convergent unless x 1−α
φ(0) = 0. However, it has an interpretation as a principal 0

value integral, namely, 1

When α = 1, d x/x = − ln , which is infinite, so that

1 φ(x)
Pv , φ(x) = lim d x. (14) 1
x →0 |x|> x dx
Fp = 0. (21)
Writing φ(x) = φ(0) + [φ(x) − φ(0)], relation (14) takes x
0
the form
The process of finding the principal value and the finite
1 φ(0) part of divergent integrals is called the regularization of
Pv , φ(x) = lim dx
x →0 |x|> x these integrals. When both of them exist, they are equal.
Let us use the definition of the Hadamard finite part to
[φ(x) − φ(0)]
+ d x. (15) regularize the function H (x)/x. The action of H (x)/x on
|x| > x ∞
a test function φ(x) is 0 (φ(x)/x) d x. Thus, we consider
Since the function 1/x is odd, the first term on the right
side of (15) vanishes. The integrand in the second inte- ∞ 1 ∞
φ(x) φ(x)
gral on the right side of (15) approaches φ (0) as x → 0. dx = dx + φ(x) d x. (22)
x x
Accordingly, we have 1
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

522 Generalized Functions

In the numerator of the first term on the right side of this has the drivative F (x) in the intervals x < ξ and x > ξ . The
equation we add and subtract φ(0) to get derivative is undefined at x = ξ . To find the distributional
∞ 1 1 derivative of F(x) in the entire interval we define the func-
φ(x) φ(0) φ(x) − φ(x) tion f (x) = F(x) − a H (x − ξ ), where H is the Heaviside
dx = dx +
x x x function. This function is continuous at x = ξ and has a
derivative which coincides with F (x) on both sides of ξ .
∞ Accordingly, we differentiate both sides of this equation

+ φ(x) d x and get F (x) = F̄ (x) − aδ(x − ξ ), where the bar over F
1
stands for the generalized (distributional) derivative of F.
Thus,
1
φ(x) − φ(0)
= −φ(0) ln + dx F̄ (x) = F (x) + [F]δ(x − ξ ), (26)
x
where [F] = a is the value of the jump of F at x = ξ . Be-
∞ fore we present the corresponding n-dimensional theory,
φ(x)
+ d x. we give an interesting application of (26) to the Sturm-
x Liouville differential equation,
1

The first term on the right side of this equation is infinite as d dE(x; ξ )
p(x) = q(x)E(x, ξ ) − δ(x − ξ ), (27)
→ 0, while the other two terms are finite. Accordingly, dx dx
we define
where E(x, ξ ) stands for the impulse response and p(x)
1 ∞ and q(x) are continuous at x = ξ . When we compare (22)
H (x)φ(x) φ(x) − φ(0) φ(x)
dx = dx + d x. and (23) and use the continuity of p(x) at x=ξ , we find
x x x
1 that the jump of [dE/d x]x=ξ is
(23)
The function P f (1/|x|) can be regularized in the same [dE/d x]x=ξ = −1/ p(ξ ).
way. Indeed,
These concepts easily generalize to higher-order deriva-
1 φ(x) − φ(0) tives and to the surfaces of discontinuity and, therefore,
Pf , φ(x) = dx
|x| |x|≤1 |x| have applications in the theory of wave fronts. Accord-
ingly, we include time t in our discussion and consider
φ(x)
+ d x. (24) a function F(x, t), x ∈ Rn , which has a jump discontinu-
|x|>1 |x| ity across a moving surface (x, t). Such a surface can
These concepts can be used to regularize the func- be represented locally either as an implicit equation of
tions 1/x m and H (x)/x m to yield the generalized func- the form u(x1 , . . . , xn , t) = 0, or in terms of the curvi-
tions P f (1/x m ) and P f (H (x)/x m ). The combination of linear Gaussian coordinates v1 , . . . , vn−1 on the surface:
P f (1/x m ) and δ (m) (x) yield the Heisenberg distributions xi = xi (v1 , . . . , vn−1 , t). The surface is regular, so the
above-mentioned functions have derivatives of all orders
1 1 1
δ ±(m) (x) = δ (m) (x) ∓ Pf . with respect to each of their arguments, and for all values
2 2πi xm
of t, the corresponding Jacobian matrices of transforma-
They arise in quantum mechanics. tion have appropriate ranks, that is, grad u = 0, and the
Let us end this section by solving the equation rank of the matrix (∂ xi /∂v j ) = n − 1. Furthermore, di-
xt(x) = g(x). The homogeneous part xt(x) = 0 has vides the space into two parts, which we shall call positive
the solution t(x) = δ(x) because xδ(x) = 0. Thus, the and negative.
complete solution is The basic distribution concentrated on a moving and a

1 deforming surface (x, t) is the delta function δ[(x, t)]
t(x) = δ(x) + g(x)P f . (25) whose action on a test function φ(x, t) ∈ D is
x
∞
δ(), φ = φ(x, t) d S(x) dt, (28)
VI. DISTRIBUTIONAL DERIVATIVES
−∞
OF DISCONTINUOUSFUNCTIONS
where dS is the surface element on . This is a sim-
Let us start with x ∈ R and consider a function F(x) that ple layer. The second surface distribution is the normal
has a jump discontinuity at x = ξ of magnitude a but that derivative operator, given as
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

Generalized Functions 523

∞ of this function at the spherical surface of discontinuity

dφ
dn δ(), φ = − dS(x) dt, (29) : r = , is 1/. Thus, formula (33) becomes
dn
−∞ ∂¯ H (r − ) xj 1
= − 3 H (r − ) + r̂ j δ(), (37)
where dφ/dn stands for the differentiation along the nor- ∂x j r r
mal to the surface. This is a dipole layer. Another surface where n j = r̂ j − x j /r . To evaluate the limit of the
distribution that we need in our discussion is δ (), defined second term on the right side of (37)
as we evaluate
lim→0 φ(x)(1/)r̂ d S = lim→0 (1/) (1) φ(x)r̂
∂¯ 2 dω = 0, where (1) is the unit sphere and ω is the
δ () = n i [δ()], (30)
∂ xi solid angle. Thus ∂/∂ ¯ x j (1/r ) = −x j /r 3 , which is the
where n i are the components of the unit normal vector n same as the classical derivative. In order to compute the
to the surface. These three distributions are connected by second-order distributional derivatives we apply formula
the relation (33) to the function (∂/∂ x j )[H (r − )/r ] and get

dn δ() = δ () − 2 δ(), (31) ∂¯ 2 H (r − ) ∂2 1
= H (r − )
∂ xi ∂ x j r ∂ xi ∂ x j r
where is the mean curvature of .
Let a function F(x, t) have a jump discontinuity across xi −x j
+ δ()
(x, t). Then the formulas for the time derivative, gradi- r r3

ent, and curl of a discontinuous function F(x, t) can be 3xi x j − r 2 δi j
derived in a manner similar to (26). Indeed, = H (r − )
r5
∂¯ F ∂F xi x j
= − G[F]δ(), (32) − 4 δ(), (38)
∂t ∂t r
grad F = grad F + n[F]δ(), (33) where δi j is the Kronecker delta, which is 1 when i = j
and 0 when i = j. Because lim→0 (xi x j /r 4 )φ(x) d S =
div F = div F + n[F]δ(), (34) (4π/3)δi j φ(0), relation (38) becomes
curl F = curl F + n × [F]δ(),
(35) ∂¯ 2 1 3xi x j − r 2 δi j 4π
= − δi j δ(x). (39)
where G is the normal speed of the front and ∂ xi ∂ x j r r5 3
[F] = F + − F − is the jump of F across . From this formula we can derive the impulse response for
Before we extend these concepts, we apply relation (33) the Laplace operator. Indeed, if we set i = j and sum on j,
to a scalar function F and derive the distributional deriva- we get ∇ 2 (1/r ) = −4π δ(x). Thus, the impulse response of
tive of 1/r , where r is the radial distance. In mathematical −∇ 2 is 1/4πr . We can continue this process and drive the
physics the function 1/r is important because it describes nth-order distributional derivative of the function 1/r k .
the gravitational and the Coulomb potentials. In the mod- With the help of the foregoing analysis we can obtain the
ern theories of small particles the singularity of 1/r at results that correspond to (32)–(35) for singular surfaces
r = 0 makes a significant contribution. Accordingly, its which carry infinite singularities, such as charge sheets and
distributional derivatives are needed to get the complete vortex sheets. Suppose that the surface of discontinuity
picture of the singularity at r = 0 and we present it as fol- (x, t) carries a single layer of strength f (x, t)δ(). Then
lows. For the sake of computational simplicity, we restrict
ourselves to R3 , so that r = (x12 + x22 + x33 )1/2 . The function ∂¯ δ̄ f
[ f δ()] = δ() − G f δ (), (40)
1/r corresponds to the P f [H (x)/x] that we considered in ∂t δt
the previous section. Accordingly, we introduce the func- where δ̄/δt = ∂¯ f /∂t + G d f /dn is the distributional time
tion F(x) = H (r − )/r , where H (r − ) is the Heaviside derivative as apparent to an observer moving with the front
function, which is unity for r > and is zero for r < . . Similarly, δ̄ f /δxi = ∂¯ f /∂ xi − n i d f /dn is the distri-
This, in turn, helps us in defining the distribution 1/r as butional surface derivative with respect to the Cartesian
∞ coordinates of the surrounding space. Thereby,
1 1
, φ(x) = lim H (r − )φ(x) d x. (36) ∂¯ δ̄ f
r →0 r [ f δ()] = δ() + n i f δ (). (41)
−∞ ∂ xi δxi
Our aim is to differentiate 1/r by taking into account If we use relation (31) between dn δ() and δ (), we can
the singularity at r = 0. For this we appeal to formula rewrite formulas (40) and (41) which contain the mean
(33) and observe that F(x) = H (r − )/r so that the jump curvature .
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

524 Generalized Functions

VII. CONVERGENCE OF DISTRIBUTIONS 1. Linearity: s ⊗ (αt + βu) = αs ⊗ t + βs ⊗ u.

AND FOURIER SERIES 2. Commutativity: s ⊗ t = t ⊗ s.
3. Continuity: if sm (x) → s(x), in D (x) as m → ∞,
A sequence {tm (x)} of distributions with tm (x) ∈ D , then sm (x) ⊗ t(y) → s(x) ⊗ t(y) in D (x, y) as
m = 1, 2, . . . , is said to converge to a distribution m → ∞.
t(x) ∈ D if limm→∞ tm , φ = t, φ for all φ ∈ D. This is 4. Associativity: when s(x) ∈ D (x), t(y) ∈ D (y) and
called distributional (or weak) convergence. An important u(z) ∈ D (z), then s(x) ⊗ [t(y) ⊗ u(z) = s(x) ⊗ t(y)] ⊗
consequence is that if the sequence {tm (x)} converges to u(z).
t(x), then the sequence {D k tm } converges to {D k t} be- 5. Support: supp (s × t) = supp s × supp t, where ×
cause limm→∞ D k tm , φ = limm→∞ (−1)|k| tm , D k φ = stands for the Cartesian product.
(−1)|k| t,D k φ = D k t, φ. As an example, consider the 6. Differentiation: Dxk [s(x) ⊗ t(y)] = D k [s(x)] ⊗ t(y).
sequence {cos mx/m}, x ∈ R, which is a sequence of reg- 7. Translation: (s ⊗ t)(x + h, y) = s(x + h) ⊗ t(y).
ular distributions and converges to zero pointwise. Then
the sequence {− sin mx} which arises by differentiating An interesting example of the direct product is δ(x) ⊗
{cos mx/m}, also converges to zero. It is remarkable δ(y) = δ(x, y), where δ(x, y) is the two-dimensional delta
because the sequence {sin mx} does not have a pointwise function. This follows by observing that δ(x) ⊗ δ(y),
limit, as m → ∞, in the classical sense. φ(x, y ) = δ(x ), δ( y ), φ(x, y ) = δ( y ), φ(0, y ) =
∞
A series of distributions p=1 s p (x) converges dis-
φ(0, 0) = δ(x, y), φ(x, y).
tributionally to t(x) if the sequence of the partial sums The convolution
∞ s ∗ t of two functions
∞ s(x) and t(x) is
{tm (x)} = { mp=1 s p (x)} converges to t distributionally. (s ∗ t)(x) = −∞ s(y)t(x − y) dy = −∞ t(y)s(x − y) dy =
Thus, we observe from the above analysis that term-by- t ∗ s(x). When the functions s(x) and t(x) are locally
term differentiation of a convergent series is always possi- integrable, then s(x) ∗ t(x) is also locally integrable, and
ble, provided the resultant series is interpreted in the sense as such, defines a regular distribution.
of distributions. Next, let us examine s ∗ t, φ:
It is known that the Fourier series ∞ m=−∞ cm e
imx
con- ∞
verges uniformly if for a large m, |cm | ≤ M/m where m is
k
s ∗ t, φ = (s ∗ t)φ(z) dz
a constant and k is an integer greater than 2. The series may −∞
diverge for other values of k. However, the above analysis  
assures us that the series converges distributionally for any ∞ ∞
integer because it can be obtained from the uniformly con- =  s(z − y)t(y) dy  φ(z) dz
vergent series ∞ m = −∞ (im)
k−2
cm eimx by (k + 2) succes- −∞ −∞
sive differentiations. For example, the series ∞ imx  
m=−∞ e ∞ ∞
has no meaning in the classical sense, but it can be written = t(y)  s(z − y)φ(z) dy
as 1 + (d 2 /d x 2 ){ ∞ m = −∞ (1/m )e
2 imx
}. Accordingly, the
∞ imx −∞ −∞
series m=−∞ e is distributionally convergent.
∞ ∞
= t(y) [s(x)φ(x + y) d x] dy
VIII. DIRECT PRODUCT AND −∞ −∞
CONVOLUTION OF DISTRIBUTIONS ∞ ∞
= s(x)t(y)φ(x + y) d x d y.
The direct product of distributions is defined by using the
−∞ −∞
space of test functions in two variables x and y. Let us
denote the direct product of the distributions s(x) ∈ D (x) Thus,
and t(y) ∈ D (y) as s(x) ⊗ t(y). Then
s(x) ∗ t(y), φ(x, y) = s(x) ⊗ t(y), φ(x + y). (43)
s(x) ⊗ t(y), φ(x, y) = s(x), (t(y), φ(x, y) (42)
There is, however, one difficulty with this definition. Even
where φ(x, y) is a test function in D(x, y). This makes though the function φ(x + y) is infinitely differentiable, its
sense because the function ψ(x) = t(y), φ(x, y) is a test support is not bounded in the (x, y) plane because x and y
function in D(x) and D k ψ(x) = t(y), Dxk φ(x, y), where may be unboundly large while x + y remains finite. This is
Dxk implies kth-order derivative with respect to x. Thus, remedied by ensuring that the intersection of the supports
s(x) ⊗ t(y) is a functional in D (x, y). Indeed, it is a linear of s(x) ⊗ t(y) and φ(x + y) is a bounded set. This happens
and continuous functional. Let us mention some of the if (1) either s or t has a bounded support or (2) both s and
properties of the direct product: t have support bounded on the same side.
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

Generalized Functions 525

As an example, we find that (δ ∗ t)(x), = δ(y) tions φ(x) with the properties: (1) φ(x) ∈ C ∞ , (2) φ(x)
t(x − y) d x = t(x), that is, δ(x) is the unit element in the and its derivatives of all orders vanish at infinity faster than
convolution algebra. The operation of the convolution the reciprocal of any polynomial, i.e., |x p D k φ(x)| < C pq ,
has the following properties: where p = ( p1 , . . . , pn ), k = (k1 , . . . , kn ) are n-tuples and
C pq is a constant depending on p, k and φ(x). Another way
1. Linearity: s ∗ (αt + βu) = αs ∗ t + βs ∗ u. of saying this is that after multiplication by any polynomial
2. Commutivity: s ∗ t = t ∗ s. P(x), the function P(x)φ(x) still tends to zero as x → ∞.
3. Associativity: (s ∗ t) ∗ u = s ∗ (t ∗ u). Clearly, S ⊃ D. Convergence in this space is defined as fol-
4. Differentiation: (D k s) ∗ t = D k (s ∗ t) = s ∗ D k t. lows. The sequence of functions {φm (x)} → φ(x) if and
5. For instance, [H (x) ∗ t(x)] = δ(x) ∗ t(x) = H (x) ∗ only if |x k (D k φm − D k φ)| < C pq for all multi-indices p
t (x). and k and all m.
Functions of slow growth. A function f (x) in Rn is
of slow growth if f (x), together with all its deriva-
IX. FOURIER TRANSFORM tives, grows at infinity more slowly than some polyno-
mial, i.e., there exist constants C, m, and A such that
Let us write formula (12) as |D k f (x)| C|x|m , |x| A.
Tempered distributions. A linear continuous functional
∞ t(x) over the space S of test functions is called a
1
eiu(x−ξ ) du = δ(x − ξ ), distribution of slow growth or tempered distribution.
2π
−∞ According to each φ ∈ S, there is assigned a complex
and multiply both sides of this formula by f (x), integrate number t, φ with the properties: (1) t, c1 φ1 + c2 φ2 =
with respect to x from −∞ to ∞, use the sifting property, c1 t, φ1 + c2 t, φ2 ; (2) limm→0 t, φm = 0, for every
and interchange the order of integration. The result is the null sequence {φm (x)} ∈ S. This yields us the dual space
celebrated Fourier integral theorem, S . If follows from the definitions of convergence in D
and S that a sequence {φm (ξ )} that converges in the sense
∞ ∞ of D also converges in the sense of S, so that S ⊂ D .
1
f (ξ ) = du f (x)eiu(x−ξ ) d x, (44) Fortunately, most of the distributions in D encountered
2π
−∞ −∞ previously are also in S . However, some locally integrable
which splits into the pair functions in D are not in S . The regular distributions in
S are those generated by functions ∞ f (x) of slow growth
∞ through the formula f, φ = ∞ f (x)φ(x) d x, φ ∈ S. It
fˆ(u) = f (x)eiux d x, (45) is a linear continuous functional.
−∞ Fourier transform of the test functions. We shall first
∞ give the analysis in R and then state the corresponding
results in Rn . For ∞φ(x) ∈ S, we can use the definition (45)
1
f (x) = fˆ(u)e−iux du, (46)
2π so that φ̂(u) = ∞ φ(x)eiux d x. The inverse follows from
−∞
(46). Since φ̂(u) is also in S, as can be easily proved, the
where we have relabeled ξ with x in (46). The quantity difficulty encountered earlier has disappeared. Moreover,
fˆ(u) is the Fourier transform of f (x), while formula it follows by the inversion formula (46) that
(46) gives the inverse Fourier transform F −1 [ f (u)].
We shall now examine formula (45) in the context of [φ̂]ˆ(x) = 2π φ(−x), (47)
test functions and distributions. If we attempt to define
the Fourier transform of a distribution t(x) as in (45), which shows that every function φ(x) ∈ S is a Fourier
then we get t̂(u) = t(x), eiux , but we are in trouble transform of some function in S. Indeed, the Fourier trans-
because eiux is not inD. We could try Parseval’s theorem, form and its inverse are linear, continuous, and one-to-one
∞ ˆ ∞
−∞ f (x)g(x) d x = −∞ f (x) ĝ(x) d x, which follows
mapping of S on to itself.
from (45) and (46) and connects the Fourier transforms of In order to obtain the transform of tempered distribu-
two functions. Then we have t̂, φ = t, φ̂, φ ∈ D. We tions, we need some specific formulas of the transform
again run into trouble because φ̂ may not be in D even of the test functions φ(x). Let us list them and prove
d φ/d xiux is equal to
k k
if φ is in it. These difficulties are overcome by enlarging some
∞ of them. The transform of
k ∞
−∞ (d φ/d x )e d x = (−iu) −∞ φ(x)e d x so that
k k iux
the class of test functions.
Test functions of rapid decay. The space S of test func- [d φ/d x ]ˆ(u) = (−iu)k φ̂(u). Thus, if P(λ) is an arbi-
k k

tions of rapid decay contains the complex-valued func- trary polynomial with constant coefficients, we find that
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

526 Generalized Functions

[P(d/d
∞ x)φ]ˆ = P(−iu)φ̂(u). Similarly, ∞ fromiuxthe relation φ(u)] = d k t/d x k , φ̂(x) = (−1)k t(x), [(d k /d x k )φ(x)]ˆ
−∞ (i x) k
φ(x)e iux
d x = (d k
/du k
) −∞ φ(x)e d x, we = t(x), (−iu)k φ(u)]ˆ(x) = t̂(u), (−iu)k φ(u) = (−iu)k
derive the relation [x k
φ]ˆ(u) = P(−d/du) φ̂(u). Because t̂(u), φ(u), which shows that [d k t/d x k ]ˆ(u) =(−iu)k t̂(u),
∞
iua ∞
−∞ φ(x − a)e iux
d x = e −∞ φ(y)e iuy
dy, we have which agrees with (51). Instead of writing the formulas
[φ(x − a)]ˆ(u) = e φ̂(u). Similarly, [φ(x)]ˆ(u + a) =
iau (50)–(57) all over again for a distribution t(x), we shall
[eax (x)]ˆ(u) and [φ(ax]ˆ(u) = (1/|a|φ̂(u/a)). The merely refer to them with φ replaced by t. Let us now
corresponding n-dimensional formulas are listed below. give some important applications of these formulas.
∞
∞ 1. Delta function. [δ(x)]ˆ, φ(u) = δ(x), −∞ φ(u)eiux
∞
φ̂(u) = e iu·x
φ(x) d x, (48) du = −∞ φ(u)du = 1, φ(u). Thus δ̂(x) = 1 so that
−∞ 1̂ = 2π δ(x). In the n-dimensional case, the corresponding
∞ formulas are δ̂(x) = 1 and 1̂ = (2π )n δ(x). Incidentally, we
1 recover formula (12) from this relation.
φ(x) = eiu·x φ̂(u) du, (49)
(2π)n 2. Heaviside function. We use formula (51), so that
−∞
[t (x)]ˆ(u) = −iu t̂(u). Because t(x) = H (x), we have
[φ̂]ˆ(u) = (2π )n φ(−u), (50) [δ(x)]ˆ(u) = −iu[H (x)]ˆ(u) or u(H )(x)]ˆ(u) = i, whose
[D k φ]ˆ(u) = (−iu)k φ̂(u), (51) solution follows from (25) as [H (x)]ˆ(u) = cδ(u) +
i P f (1/u). Similarly, from (57) and the above result
∂ ∂ we have [H (−x)]ˆ(u) = cδ(u) − i P f (1/u). To find the
P ,..., φ(x) ˆ(u) = P(−iu 1 , ., −u n )φ̂(x), constant c we observe that H (x) + H (−x) = 1, whose
∂ x1 ∂ xn
Fourier transform is 2cδ(u) = 2π δ(u), so that c = 1. Thus
(52) [H (±x)]ˆ(u) = π δ(u) ± i P f (1/u).
[x φ]ˆ(x) = (−i D) x φ̂(u),
k k
(53) 3. Signum function. Because sgn x = H (x) − H (−x),
we take Fourier transforms of both sides and use Example
∂ ∂ 2 above to get [sgn x]ˆ(u) = 2i P f (1/u).
[P(x1 , . . . , xn )φ]ˆ(u) = P − i , . . . , −i φ̂(u),
∂u 1 ∂u n 4. P f (1/x). We use formulas (47) for t(x) and
(54) the example above. The result is [(sgn x)ˆ(u)](x) =
[2i P f (−1/u)]ˆ(x), so that
[φ(x − a)]ˆ(u) = e ia·u
φ̂(u), (55)
1
[φ̂(x)](u + a) = [eia·x ]ˆφ(u), (56) Pf ˆ(u) = iπ sgn u. (58)
x
−1
[φ(Ax)]ˆ(u) = |det A| φ̂(A )u), T
(57)
5. The function |x|. We write it as |x| = x H (x) −
where u · x = u 1 x1 +· · ·+u n xn , A is a nonsingular matrix, x H (−x). Taking the Fourier transforms of both sides of
and AT is its transpose. We shall refer to the numbered this equation, we get
formulas above for n = 1 also. [|x|]ˆ(u) = [x H (x)]ˆ(u) − [x H (−x)]ˆ(u)
As a simple example we consider the function
exp(−x 2 /2), which is clearly a member of S. To find its d d
= −i [H (x)]ˆ(u) + i [H (−x)]ˆ(u)
transform we first observe that it satisfies the equation du du
φ (x) + xφ(x) = 0. Taking Fourier transforms of both
d 1
sides of this equation and using (52) and (53), we = −i π δ(u) + i P f
du u
find that (d/du)[φ̂(u)eu /2 ] = 0. Thus, φ̂(u) = Ce−u /2 ,
2 2

where C is a constant. To evaluate√ C, we observe√that d 1
∞
φ̂(u) = −∞ exp(−x 2 /2) d x = 2π, so that C = 2π ±i π δ(u) − i P f
√ du u
and we have φ̂(u) = 2πφ(u). Thus we have found a
2
function
√ which is its own inverse. [The multiplicative
√ = .
2π disappear if we use the factors 1/ 2π in the u2
definition of the transform pairs (48) and (49)]. Thus [|x|]ˆ(u) = 2/u 2 .
Fourier transform of tempered distributions. Having 6. The function 1/x 2 . To find its transform we ap-
discovered that φ̂ ∈ S when φ is, we can apply the relation peal to the previous example and relation (47) so that
t̂, φ = t, φ̂ to define the Fourier transform of the tem- [(|x|)ˆ(u)]ˆ(x) = (2/u 2 )ˆ(x) and we get (1/x 2 )ˆ(u) = 12 |x|.
pered distributions t(x). Then all the formulas given above 7. Polynomial P(x) = a0 + a1 + · · · + an x n . Formula
for φ̂ carry over for t̂. For instance, |(d k /d x k )t(x)]ˆu, (54) gives [P(x)t(x)]ˆ(u) = P[−i(d/du)t̂(u)]. When we
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

Generalized Functions 527

substitute t(x) = 1 in this formula and use Example 1, we By a routine computation it yields
get [P(x)]ˆ(u) = 2π P(−i d/du)δ(u). Thus, 2
sin(u/2) 2 u
a0 + a1 + · · · + an x n ˆ(u) = 2π (a0 δ − ia1 δ + · · · fˆ(u) = = sinc . (63)
(u/2) 2

+ (−i)n an δ (n) (u) . (59) 12. Fourier transform of an integral. We have found

In particular, x̂(u) = −2πiδ (u), [x ]ˆ(u) = −2π δ (u), 2 previously that taking the Fourier transform of the difer-
. . . , [x n ]ˆ(u)= (−i)n 2πδ (n) (u). ential of a distribution t(x) has the effect of multiplying
8. P f (1/|x|). With the help of the definition (24) of t̂(u) by (−iu). In the case of taking the Fourier transform
this function we have of the integral of t(x), it amounts to dividing t̂(u) by (−iu)
so that
1 1  x ˆ
Pf ˆ(u), φ(u) = P f , φ̂(x)  
|x| |x| i t̂(u)
[t(s) ds] (u) = . (64)
  u
1
φ̂(x) − φ̂(0) φ̂(x) a
= dx + d x,
|x| |x|>1 |x| Similarly,
−1  x ˆ
which, after some algebraic manipulation, yields   i t̂(u) 1
[t(s) ds] (u) = + t̂(0)δ(u). (65)
1   u 2
Pf ˆ(u) = −2(γ + ln|u|), (60) −∞
|x|
13. Fourier transform of the convolution. The Fourier
where γ is Euler’s constant. transform of the convolution f ∗ g of two locally
9. ln|x|. For evaluation of the Fourier transform of this ∞
integrable functions f and g is [ f ∗ g]ˆ(u) = −∞ f ∗
important function, we take the Fourier transform of both ∞ iux ∞ ∞
ge = −∞ e
iux
−∞ f (x − y)g(y) dy = −∞ g(y)e
iuy
dy×
sides of (60), use formula (47), and obtain ∞ iu(x − y)
−∞ f (x − y) e d(x − y) = ĝ(u) f (u) = f (u)ĝ(u).
ˆ ˆ
1 Thus, the Fourier transform of the convolution of two
Pf = 2[2πγ δ(x) + [ln(u)]ˆ(x)]
|x| regular distributions is the product of their transforms.
and relabel. The result is This relation also holds for singular distributions with
slight restrictions. For instance, if at least one of these
1
[ln|x|]ˆ(u) = − π P f + 2πγ δ(u) . (61) distributions has compact support, then the relation holds.
2|u|
In the theories of wavelets, sampling, and interpolation,
we need the Fourier transforms of the square and triangular X. POISSON SUMMATION FORMULA
functions. We derive them in the next two examples.
10. The square function. It is defined as f (x) = To derive the Poisson summation formula we first
H (x + 12 ) − H (x − 12 ). Thus find the Fourier series of the delta function in the
1/2 1/2 period [0, 2π ]: δ(x) = ∞ m=0 (am cos mx + bm sin mx),
1 iux where the coefficients am , bm are given by a0 =
fˆ(u) = e iux
dx = e 2π 2π
iu −1/2 (1/2π ) 0 δ(x) d x = (1/2π ), am = (1π ) 0 δ(x) cos mx×
−1/2 π
d x = (1/π ), and bm = (1/π ) 0 δ(x) sin mx d x = 0. Thus,

2 u u we have
= sin = sinc , (62)
u 2 2 1 ∞
1 ∞
δ(x) = 1+2 cos mx = eimx . (66)
where sinc t = sin t/t. 2π 2π
m=1 m=−∞
11. The triangular function. It is defined as
Now we periodize δ(x) by putting the row of deltas at the
1 − |x|, |x| < 1, points 2π m so that relation (66) can be written as
f (x) =
0, |x| > 1,
∞
1 ∞
so that δ(x − 2π m) = 1+2 cos mx
m=−∞ 2π m=1
0 1
fˆ(u) = (1 + x) e iux
dx + (1 − x)eiux d x. 1 ∞
= eimx .
−1 0 2π m=−∞
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

528 Generalized Functions

When we set x = 2π y in this relation and use the formula arises from the points where the function h(x) has a min-
δ[2π(y − m)] = (1/2π)δ(y − m) and relabel, we obtain imum. If x0 is the only global minumum of h(x), then we

∞
∞
∞ have the Laplace formula
δ(x − m) = 1 + 2 cos 2πmx = ei2π mx . 1/2
2π
m =−∞ m =1 m =−∞ I (λ) ∼ g(x0 )e−λh(x0 ) . (73)
(67) λh (x0 )
The action of this formula on a function φ(x) ∈ S yields If we could somehow prove that e−λh(x) has an asymptotic

∞ series of delta functions such that
φ(m) = φ̂(2π m), (68) 1/2
−λh(x) 2π
m =−∞ e ∼ e−λh(x0 ) δ(x − x0 ), (74)
λh (x0 )
which relates the sum of functions φ and their Fourier
transforms φ̂ and is a very useful formula. then all that we have to do is to substitute (74) in (72) and
In the classical theory, it is necessary that both sides use the sifting property of the delta function and formula
of relation (68) converge. Moreover, they must converge (73) follows immediately. To achieve the expansion (74)
in the same interval. With the help of the theory of we first define the moments of a function f (x). They are
distributions we can obtain many variants of relation ∞
(68) which are applicable even when one or both of
µn = f (x), x =n
f (x)x n d x. (75)
the series (68) are divergent. As an example, we set
x = x/λ, where λ is a real number, and use the relation −∞

δ[(x/λ) − m] = |λ|δ(x − mλ). Then relation (67) becomes The Taylor expansion of a test function φ(x) at x = 0 is

∞
∞

∞
xn
(λ) δ(x − mλ) = e2iπ xm/λ . (69) φ(x) = φ (n) (0) . (76)
m =−∞ m =−∞ n=0
n!
When we multiply both sides of this relation by a test Then it follows from (75) that
function φ(x) and integrate with respect to x, we obtain a
variant of (68) as
∞
xn
f (x), φ(x) = f (x), φ (0)(n)
∞
1 ∞
2πm n=0
n!
φ(mλ) = φ̂ . (70)
m=−∞ |λ| m=−∞ λ
∞
µn
= φ n (0) . (77)
This is called the distributional Poisson summation n=0
n!
formula. Among other things, the Poisson summation
But φ n (0) = (−1)n δ (n) (x), φ(x), so that (77) becomes
formula (70) transforms a slowly converging series to

a rapidly converging series.
√ For instance, if we take ∞
(−1)n µn δ (n) (x)
φ(x) = e−x , then φ̂(u) = πe−u /2 , so that (70) becomes
2 2
f (x), φ(x) = , φ(x) . (78)
n=0
n!

∞
∞
e−m λ
e−m π 2 /λ2
2 2 2
= (π/λ)1/2 . (71) Thus
m=−∞ m=−∞

∞
(−1)n µn δ (n) (x)
The series on the left side of (71) converges rapidly for f (x) = . (79)
n!
large λ, that on the right side for small λ. n=0

Finally, we use the formula δ (n) (λx) = (1/λn+1) )δ n (x) in

(79) and obtain the complete asymptotic expansion
XI. ASYMPTOTIC EVALUATION OF
∞
(−1)n µn δ (n) (x)
INTEGRALS: A DISTRIBUTIONAL f (λx) = , λ → ∞. (80)
APPROACH n=−∞ n! λn+1
In the case of the Laplace integral (72) we have
Let g(x) and h(x) be sufficiently smooth functions on the
f (x) = e−λh(x) . At the minimum point x0 , h(x)∼h(x0 ) +
interval [a, b], then the main contribution to the Laplace
[h (x0 )(x − x0 )2 ]/2. Accordingly, we can find an increas-
integral,
ing smooth function ψ(x) with ψ(x0 ) = 0, ψ (x0 ) > 0,
b so that h(x) = h(x0 ) + [ψ(x)]2 in the support of h(x).
I (λ) = e−λh(x) g(x) d x, λ→∞ (72) Then h (x0 ) = 2ψ(x0 )ψ (x0 ) = 0, which yields h (x0 ) = 0
a as required for h(x) to have minimum at x0 . Also,
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

Generalized Functions 529

h (x) = 2[ψ (x)]2 + 2ψ(x)ψ (x), which, for x = x0 , The oscillatory integral
becomes h (x0 ) = 2[ψ (x0 )]2 or ∞
1 I (λ) = eiλh(x) g(x) d x, λ→∞ (87)
ψ (x0 ) =
[h (x0 )]1/2 . (81)
2 −∞

Substituting this information about h(x) in the Laplace can also be processed in the same manner as the Laplace
integral (72), we obtain integral. Indeed, the steps leading (87) to relation
∞ ∞
−λh(x0 ) −λ[ψ(x)]2 g(u)
I (λ) = e −iλh(x0 ) 2
e g(x) d x. (82) I (λ) = e eiλu du (88)
ψ (x0 )
−∞ −∞

The next step is to set u = ψ(x), which gives d x = are almost the same as these from (72) to (83). The dif-
du/ψ (x) so that (82) becomes ference arises in the value of the moments µn which now
∞ are

−λh(x0 ) g(u)
e−λu
2
I (λ) = e du, (83) ∞ 
 nπ eπi[(2n+1)/4] , n even,
ψ (x0 ) iλu 2 n
−∞ µn = e u du = 2

0,
−∞ n odd.
so we need the asymptotic expansion of e−λu from for-
2

mula (80). This is obtained by finding the moments of (89)

e−u , and they are
2
2
This yields the moment expansion for eiλu as

∞  n+1
∞
, n even, 2 [(2n + 1)/2] e−πi[(2n+1)/4] δ (2n) (u)
e−u u n du = eiλu =
2
µn = 2 (84) du.
 (2n)!λ(2n+1)/2
−∞ 0, n odd. n=0
(90)
Then formula (80) yields
iλu 2
When we substitute this value for e in integral (88) we

∞
[(2n + 1)/2]δ (2n) (u) obtain
e−λu =
2
(85)
(2n)!λ[(2n + 1)/2] ∞ ∞
n=0
−i h(x0 ) [(2n + 1)/2] e−πi[(2n+1)/4]
I (λ) ∼ e
so that formula (83) becomes (2n)!
n=0
−∞
∞
∞
[(2n + 1)/2] δ (2n) (u) δ (n) g(u)
(2n)
I (λ) = e−λh(x0 ) × du. (91)
n=0
2n! λ(2n+1/2) λ(2n+1/2) ψ (x0 )
−∞

g(u) The first term of this formula yields

× du. (86)
ψ (x0 ) 2π 1/2 φ(x0 )
I (λ) ∼ e i(λh(x0 )+π/4)
√ , (92)
The interesting feature of the distributional approach is h (x0 ) λ
that we have obtained the complete asymptotic expansion. where we have used relation (81).
Thereafter, we evaluate as many terms as are needed to get Let us observe an essential difference between the
the best approximation. Laplace integral and the oscillatory integral. In the case
The first term of formula (86) is of the Laplace integral we collect the contributions only
∞ from the points where h(x) has the minimum such that
−λh(α0 ) 1 δ(u) g(u) h (x) = 0 and h (x) > 0. For the oscillatory integral we
I (λ) = e √ du.
2 λ ψ (x0 ) collect the contributions from all the points where h(x) is
−∞
stationary such that h (x) = 0, h (x) = 0. For this reason
When we substitute the value of ψ (x0 ) from relation (81) the classical method for the oscillatory integral is called
in the above relation and use the sifting property of δ(u), the method of stationary phase.
we obtain
1/2
2π
I (λ) ∼ g(x0 )e−λh(x0 ) , SEE ALSO THE FOLLOWING ARTICLES
λh (x0 )
which agrees with (73). FOURIER SERIES • FUNCTIONAL ANALYSIS
P1: FLV/LPB P2: FJU Final
Encyclopedia of Physical Science and Technology EN006A-278 June 29, 2001 21:22

530 Generalized Functions

BIBLIOGRAPHY Gel’fand, I. M., and Shilov, G. E. (1965). “Generalized Functions, Vol.

I,” Academic Press, New York.
Benedetto, J. J. (1996). “Harmonic Analysis and Applications,” CRC Jones, D. S. (1982). “Generalized Functions,” Cambridge University
Press, Boca Raton, FL. Press, Cambridge, UK.
Duran, A. L., Estrada, E., and Kanwal, R. P. (1998). “Extensions of the Kanwal, R. P. (1983). “Generalized Functions,” 2nd ed., Birkhauser,
Poisson Summation Formula,” J. Math. Anal. Appl. 218, 581–606. Boston.
Estrada, R., and Kanwal, R. P. (1985). “Regularization and distributional Lighthill, M. J. (1958). “Introduction to Fourier Analysis and Generalised
derivatives of 1/r n in R p ,” Proc. Roy. Soc. Lond. A 401, 281–297. Functions,” Cambridge University Press.
Estrada, R., and Kanwal, R. P. (1993). “Asymptotic Analysis: A Distri- Saichev, A. I., and Woyczynsky, W. A. (1997). “Distributions in the
butional Approach,” Birkhäuser, Boston. Physical and Engineering Sciences,” Vol. 1, Birkhäuser, Boston.
Estrada, R., and Kanwal, R. P. (1999). “Singular Integral Equations,” Schwartz, L. (1966). “Théorie des distributions,” Nouvell èdition,
Birkhäuser, Boston. Hermann, Paris.
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

Graph Theory
Ralph Faudree
University of Memphis

I. Introduction
II. Connectedness
III. Trees
IV. Eulerian and Hamiltonian Graphs
V. Colorings of Graphs
VI. Planar Graphs
VII. Factorization Theory
VIII. Graph Reconstruction
IX. Extremal Theory
X. Directed graphs
XI. Networks
XII. Random Graphs

GLOSSARY Factorization A collection of subgraphs of a graph which

partition the edges of the graph.
Bipartite graph A graph whose vertices can be parti- Good algorithm An algorithm with a polynomial upper
tioned into two sets of independent vertices. bound on the time required to terminate in the worst
Connectivity (edge) The minimum number of vertices case.
(edges) which when deleted leaves a disconnected Graph A pair consisting of a finite nonempty set called
graph. the vertex set and a collection of unordered pairs of the
Coloring (edge) An assignment of colors to the vertices vertices called the edge set.
(edges) of a graph so that adjacent vertices (edges) have Independent set A set of vertices (edges) such that no
different colors. pair is adjacent.
Component A maximal connected subgraph of a graph. Isomorphism A one-to-one map from the vertices of one
Digraph A pair consisting of a finite nonempty set called graph onto the vertices of another graph which pre-
the vertex set and a collection of ordered pairs of the serves edges.
vertices called the arc set. Matching A collection of independent edges in a graph.

15
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

16 Graph Theory

Network A directed graph in which each arc is given a

label.
Planar graph A graph which can be embedded in the
plane or on a sphere.
Reconstructible graph (edge) A graph which is deter- FIGURE 1 Graph G.
mined by its subgraphs obtained by deleting a single
vertex (edge) of the graph.
Tournament A directed graph obtained by orienting the each element in E(G) is called an edge. When it is clear
edges of a complete graph. just what graph is being considered, the vertex set V (G)
Tree A connected graph which has no cycles. and the edge set E(G) will be written as just V and E,
respectively. The order of G, usually denoted by |V |, is
the number of elements in V , while the size of G is the
A GRAPH is a finite collection of elements, which are number of elements in E, and is denoted by |E|.
called vertices, and a finite collection of lines or curves, For vertices u and v of V the edge e = {u, v} will be
which are called edges, that join certain pairs of these ver- expressed more compactly as just uv. We say that u is
tices. This is an abstract mathematical structure which is adjacent to v and e is incident to u and v. The edge e is
of interest on its own, but it is also of interest as a structure said to join the vertices u and v. Also, two edges incident
which can be used to model a wide variety of real-life sit- to a common vertex are said to be adjacent to each other.
uations and problems. Street maps, communications sys- The neighborhood NG (v) [or just N (v)] of a vertex v is
tems, electrical networks, organizational structures, and the set of vertices of G which are adjacent to v, and the
chemical molecules all can be viewed as graphs. Some number of elements in N (v) is called the degree of v and
historical roots of the subject, important results about the is denoted by dG (v) [or just d(v)]. Of course, the degree
structure and theory of graphs, outstanding open prob- dG (v) is also the number of edges of G incident to v.
lems in graph theory, and graphs as mathematical models A graph has a useful geometric representation in the
will be presented in this article. Procedures (called algo- plane, with points representing vertices and curves or lines
rithms) for solving graphical problems will be described, representing edges. The graph G pictured in Fig. 1 can be
and the application of the structure of graphs and graph- used to illustrate the notation that has been presented. The
ical algorithms to problems modeled by graphs will also graph G has order 5 and size 4 with V = {u, v, w, x, y}
be discussed. and E = {ux, uy, vy, x y}. The degrees of the vertices are
d(u) = d(x) = 2, d(v) = 1, d(y) = 3, and d(w) = 0. We say
that w is an isolated vertex.
I. INTRODUCTION With this minimal background we can now make an
elementary but useful observation. If we sum the degrees
A graph G is generally thought of as a collection of points of vertices of a graph, then each edge of the graph will
together with curves or lines which join certain pairs of be counted exactly twice, once at each end. This gives the
these points. Thus, this simple structure is a natural model following result.
for many real-life situations. For example, a road map uses
Theorem 1.1: If G is a graph p of order p and size q with
a graph as a model. The cities are represented by points
V = {v1 , v2 , . . . , v p }, then i=1 d(vi ) = 2q.
and the roads between cities are represented by curves or
lines. Organizational charts can be considered as graphs, The concept of substructure is of basic importance in
with the points representing the individuals in the organi- graph theory. A graph H is a subgraph of a graph G if
zation and lines indicating that one person is an immediate V (H ) ⊆ V (G) and E(H ) ⊆ E(G). Also, the graph G is
supervisor of another. In this case it may be more appro- said to be a supergraph of H . If the edges of H are pre-
priate to think of the lines as being directed from the su- cisely the edges of G which join pairs of vertices of H ,
pervisor to the person being supervised. Communication then H is an induced subgraph of G. If H is a subgraph
systems, electrical networks, family trees, and molecules with V (H ) = V (G), then H is called a spanning subgraph
in chemistry can all be represented as graphs. Our initial of G. There are some special subgraphs and supergraphs
objectives will be to formalize this idea of graph, intro- of a graph G that occur so frequently that special notation
duce some standard notation and terminology, develop the is given to them. If e ∈ E, then G − e is the subgraph of G
basic concepts, and illustrate these with examples. with the same vertices and edges as G except the edge e is
A graph G is a nonempty finite set V (G) together with deleted. Similarly, if v ∈ V , then G − v is the subgraph of
a finite set E(G) of distinct unordered pairs of elements G with the vertex v deleted from the vertex set and all the
of V (G). Each element in V (G) is called a vertex and edges incident to v deleted from the edge set. There are
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

Graph Theory 17

FIGURE 2

obvious generalizations to sets of vertices or edges. Also, If all of the vertices of a graph have the same degree r ,
if e is an edge not in E, then G + e is the supergraph of then the grahph is said to be r -regular or just regular. The
G with the same vertex set as G but with e added to the complete graph K r +1 and the complete bipartite graph K r,r
edge set. The concepts of subgraph and supergraph are are examples of r -regular graphs. There are many other
illustrated in Fig. 2, with G being the graph pictured in special graphs, but we conclude with the graph Cn , a cycle
Fig. 1. of length n. This is a connected graph of order n which
Two graphs G and H are identical (written as G = H ) is 2-regular. The graph C5 is pictured in Fig. 4 along with
if V (G) = V (H ) and E(G) = E(H ). However, it is pos- other examples of complete graphs, bipartite graphs, and
sible for graphs to have the same appearance if they are their complements.
appropriately labeled but yet not be identical. Two graphs Up to this point only one form of a “graph” has been
G and H are said to be isomorphic (written as G ∼ = H ) if discussed, and that is what is sometimes called a simple
there is a one-to-one mapping (called an isomorphism) θ graph. No edges of the form vv, which are called loops,
of V (G) onto V (H ) such that edge uv is in E(G) if and were allowed. Also, any pair of vertices in a graph were
only if the edge θ (u)θ(v) is in E(H ). The graph G pictured joined by at most one edge, not multiple edges. If loops
in Fig. 3 is isomorphic to the graph H of Fig. 3; in fact, the and multiple edges are allowed, the structure is called a
map θ defined by θ(u) = a, θ(v) = d, θ(w) = f, θ (x) = b multigraph. This definition of a multigraph is not com-
and θ (y) = c is an isomorphism from G onto H . pletely standard because loops are sometimes not allowed.
Some classes of graphs occur so frequently that special Figure 5a shows an example of a multigraph which has
names are given to them. A graph of order n in which multiple edges joining vertices u and x and loops at the
every pair of its vertices are adjacent is called a complete vertex v. All edges have to this point been assumed to
graph and is denoted by K n . Each vertex
in K n has degree have no direction, so that the edge uv is the same as the
n − 1 and the size of the graph is n2 = n(n − 1)/2. The edge vu. Sometimes it is appropriate for each edge to have
complement Ḡ of a graph G is the graph with vertex set a direction, and such directed edges will be called arcs.
V (G) such that two vertices are adjacent in Ḡ if and only Therefore each arc is an ordered pair of distinct vertices,
if they are not adjacent in G. Thus K̄ n is a graph with no and the arc uv is not the same as the arc vu. Such struc-
edges. A bipartite graph is one in which the vertices can tures are called directed graphs or digraphs. An example
be partitioned into two sets, say A and B, such that every of a digraph is given in Fig. 5b. All of these structures and
edge of the graph joins a vertex in A and a vertex in B. The some additional variations will be considered later.
sets A and B are called the parts of the bipartite graph. If There are many ways to describe or determine a par-
every vertex in A is joined to every vertex in B, then the ticular graph. So far, we have generally described graphs
bipartite graph is called a complete bipartite graph, and is by using the definition and listing the vertices and edges
denoted by K m,n if |A| = m and |B| = n. Thus, K m,n has of the graph. In a few cases, a picture or drawing of the
order m + n and size mn. The special complete bipartite
graph K 1,n is called a star of size n (and order n + 1). A
collection of vertices of a graph are said to be independent
if there are no edges joining them. Thus, the vertices of
the complement of a complete graph and the vertices in
the parts of a bipartite graph are independent.

FIGURE 3 Isomorphic graph H . FIGURE 4 Special graphs and complements.

P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

18 Graph Theory

Consider two graphs G and H of the same order (say,

n) and size, which are each described by a vertex and
edge set. There is a good algorithm to determine if the
graphs are identical. First determine if V (G) = V (H ). If
this is true, order the vertex sets in the same way and form
the adjacency matrices A(G) and A(H ). If A(G) = A(H ),
then G and H are identical. Each of these steps takes at
FIGURE 5 (a) Multigraph. (b) Digraph.
most n 2 comparisons, so we have a good algorithm; in
fact, this is an n 2 algorithm for determining if two graphs
are identical. Next consider the problem of determining if
graph was used to determine the graph. The adjacency G and H are isomorphic. A reasonable approach would
matrix A(G) of a graph G is a common and useful struc- be to order the sets V (G) and V (H ) (they need not be
ture to describe a graph. If G has order n, then A(G) is an identical) and construct the adjacency matrices A(G) and
n by n matrix with n rows and columns and 0 or 1 entries A(H ). However, G ∼ = H if and only if there is some order-
in which the ith vertex corresponds to the ith row and col- ing of the sets V (G) and V (H ) such that A(G) = A(H ).
umn. The (i, j) entry is 1 if and only if the ith vertex is Unfortunately, there are n! possible permutations of an
adjacent to the jth vertex. The matrix A(G) depends on ordered set with n elements. Since n! is not polynomial
an ordering of the vertices, and a different ordering of the in n, this approach will not give a good algorithm. It is
vertices will cause a permutation of the rows and columns not known if a good algorithm for isomorphism exists. In
of the adjacency matrix. Also, A(G) is a symmetric ma- fact, this problem, the isomorphism problem, is one of the
trix with only 0’s on the diagonal, and the number of 1’s outstanding problems in graph theory.
in the ith row (column) is the degree of ith vertex. Sim- In the remaining sections, selected but representative
ilar adjacency matrices can be defined for digraphs and topics from the theory of graphs will be discussed. The
multigraphs. topics are by no means exhaustive; for example, graphical
The study of procedures or algorithms for solving enumeration, algebraic graph theory, and matroids are not
graphical problems is central in the theory of graphs. Af- discussed, and all are important and interesting topics.
ter the existence of an algorithm has been verified, the
question of efficiency must be answered. An algorithm is
said to be good if it is a polynomial-time algorithm. This II. CONNECTEDNESS
means that if n is some parameter which describes the
magnitude of the graph (such as order or size), then there Connectedness is a fundamental concept in graph theory,
exists positive constants c and k such that in the worst and most readers will have an intuitive feel for it. However,
case the time (the number of operations such as additions we first need to introduce some elementary related ideas
or comparisons) required for the algorithm to terminate is to present the concept carefully. In a graph G, a walk
bounded above by cn k . An n k algorithm is one in which the from a vertex u to a distinct vertex v is a finite sequence
time required is at most cn k for some positive constant c. of alternating vertices and edges W = v0 e1 v1 · · · vk−1 ek vk
If k = 1, then the algorithm is said to be linear. with v0 = u, vk = v and with each edge of the sequence
The reason for searching for good algorithms as op- incident in G to each of the vertices on either side of it in
posed to those with no polynomial upper bound is that the sequence. The length of the walk W is k, the number of
those with exponential growth can be extremely ineffi- edges in W . If all of the edges of the sequence are distinct,
cient. The following is a dramatic example of this. If then W is called a trail. If, in addition, all of the vertices
n = 20, which is a relatively small number, then n 3 = 8000. of W are distinct, it is called a path. If the initial vertex
On the other hand n! is a number with 19 digits. If a and the final vertex are allowed to be the same (u = v), the
computer could do 1,000,000,000 calculations a second, walk, trail or path is called closed. A closed path of length
it would take less than 1/(100,000)th of a second to do n 3 at least three determines a graph which is called a cycle.
calculations, but it would take over 75 years to do n! calcu- A path (or cycle) with n vertices is usually denoted by Pn
lations. However, some care must be taken in interpreting (or Cn ).
the relative efficiencies of algorithms for small values of Two vertices u and v in a graph G are connected if there
n. Although an n 3 algorithm will be quicker than an n 5 or is a path from u to v. A graph G is connected if each pair of
an n! algorithm for large values of n, for small values of its vertices are connected. Even when G is not connected,
n the constant associated with each of the algorithms may all of the vertices of G connected to a fixed vertex form
be the dominant factor. Either an n 5 or n! algorithm could a connected subgraph of G. Thus, a disconnected graph
be faster than an n 3 algorithm for small values of n. G can be partitioned into connected subgraphs, which are
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

Graph Theory 19

called the components of G. The components are the max- called a cut-set, and the set S is said to separate vertices
imal connected subgraphs of a graph. For example, the x and y if they are in different components of G − S. The
graph G of Fig. 1 has two components, one with the four connectivity (sometimes called vertex connectivity) κ(G)
vertices {u, v, x, y} and the other with the single vertex w. of a graph G is the minimum number of vertices in a cut-
Associated with any pair of vertices u and v of a graph G set of G. The only graphs without cut-sets are complete
is the distance dG (u, v) [or just d(u, v) when the graph G graphs, and there the connectivity is one less than the
is obvious], which is the length of a shortest path from u to order of the complete graph. A graph G is k-connected if
v. A natural and interesting problem is the determination κ(G) ≥ k. Thus, for example, a graph of order at least 3 and
of a good procedure or algorithm for finding d(u, v). A with no cut-vertices is 2-connected. It is easily checked
brute-force approach of checking all possible paths from that κ(Pn ) = 1, κ(Cn ) = 2, κ(K n ) = n−1, and κ(K m,n ) = n
u to v would not be very satisfactory, since the number when m ≥ n. One would expect that as the connectivity of
of such paths could be quite large—in fact, of order of a graph becomes larger, there would be an increase in the
magnitude (n − 2)! for a graph G of order n. number of “alternative” paths between pairs of vertices.
A good “shortest-path algorithm” was discovered by The following results verify this expectation.
Dijkstra and determines the distance from a fixed vertex
Theorem 2.1 (Menger): For distinct nonadjacent ver-
of a graph G of order n to each of the remaining vertices,
tices u and v of a graph, the maximum number of internally
and it works in order of magnitude n 2 . Also, this algorithm
disjoint (vertex disjoint except for u and v) paths between
will handle the more general structure of a weighted graph
u and v is equal to the minimum number of vertices that
G in which each edge e of G is assigned a real number
separate u and v.
w(e), called its weight. For a weighted graph, the length
of a path is the sum of the weights of the edges of the path. Theorem 2.2 (Menger-Whitney): A graph is k-
It should be noted that any graph G can be considered connected if and only if for each pair of distinct vertices
as a weighted graph by assigning weight 1 to each edge of the graph there are at least k internally vertex disjoint
of G. The algorithm of Dijkstra is of greedy design and paths between them.
uses a breadth-first search of the graph. It starts by finding
Vertex connectivity has an analog, edge connectivity,
the vertex in G which is “closest” to a fixed vertex v.
denoted by κ1 (G). Each of the previous definitions and
At each step the “next closest” vertex is found using the
results concerning vertices has a natural edge analog. For
information about the distance from v to vertices whose
example, if e is an edge of a connected graph G and the
distances from v have already been determined. Each of
graph G − e (the graph obtained from G by deleting the
the n steps involves at most n comparisons and n additions
edge e) is disconnected, then e is called a cut-edge. An
of real numbers, and thus n 2 is an upper bound on the order
edge-cut-set is a collection of edges which, when deleted,
of magnitude of the algorithm.
disconnect the graph, and the edge connectivity is the min-
The “shortest-path algorithm” can be used to determine
imum number of edges in an edge-cut-set. There are results
the components of a graph. All vertices in the same com-
analogous to those of Menger and Whitney which also say
ponent of a fixed vertex v of a graph would have finite
that high edge connectivity is equivalent to the existence
distance from v, while the remaining vertices would have
of many edge disjoint paths between any pair of vertices.
infinite distance. Although there are more efficient meth-
ods for finding the components of a graph, the shortest
path algorithm is a good one for this purpose.
Connectedness in a graph which is being used as a III. TREES
model for a transportation or communication system is
clearly desirable. However, connectedness alone might A tree is usually defined as a connected graph which con-
not be sufficient. If the connected graph G which rep- tains no cycles. However, this very useful class of graphs
resented a communication network had a vertex v such has many other characterizations. The following theorem
that G − v was disconnected (v is called a cut-vertex of gives five equivalent statements, each of which could be
G), then it would be critical that no failure occur at v, for used as a definition of a tree.
this could result in a failure of the entire system. Thus, we
Theorem: 3.1 The following statements are equivalent
need some measure of the extent of the connectedness of
for a graph of order n:
a graph. With that objective in mind we introduce some
additional concepts. (i) G is connected and has no cycles.
If S is a collection of vertices of a graph G, then G − S (ii) G is connected and has size n − 1.
is the graph obtained from G by deleting the vertices of (iii) G has no cycles and has size n − 1.
S. If G is connected but G − S is disconnected, then S is (iv) G is a graph in which every edge is a cut-edge.
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

20 Graph Theory

FIGURE 6 Trees of order 5.

FIGURE 7 Complete binary tree.

(v) G is a graph with a unique path between each pair

of distinct vertices. weight), subject only to the condition that an edge is never
added if it creates a cycle. This will generate a minimal
All nonisomorphic trees of order 5 are pictured in Fig. 6. cost spanning tree.
Any connected graph G has a spanning tree. This is Trees are also useful as information structures. One ex-
easy to observe. If an edge e is on a cycle of a connected ample of this is a binary tree used as a search tree. The
graph, then its deletion will not disconnect the graph. Thus, complete binary tree T of depth k can be defined induc-
by successive deletion of edges on cycles of appropriate tively as a tree with a root vertex v which is adjacent to two
connected subgraphs of G, one eventually obtains a con- vertices v L and v R , called the left and right children of v,
nected subgraph of G which has no cycles. This graph is such that v L and v R are roots of disjoint complete binary
a spanning tree T of G. This tree is not in general unique trees of depth k − 1 in T − v. Figure 7 shows a complete
(unless G is itself a tree), but it will have size n − 1 if G binary tree of depth 3. The tree T has 2k+1 − 1 vertices,
has order n, and there will be no smaller size connected and the paths from the root to other vertices of the tree
spanning subgraph of G. The number of spanning trees have length at most k. Elements from an ordered set with
in a graph can be quite large, as the following result of n = 2k+1 − 1 elements can be stored at the vertices of the
Cayley indicates. tree in such a way that for any complete binary subtree the
Theorem 3.2 (Cayley): For n ≥ 2, the graph K n has n n−2 elements in any “left tree” are less than the root and ele-
nonidentical spanning trees. ments in the “right tree” are greater than the root. Thus,
with the binary search tree the number of comparisons
There is a more general result on the number of span- needed to search for an element from this ordered set with
ning trees of a graph, and this result can be used to prove n elements has an upper bound of log2 (n + 1), which is
the theorem of Cayley. Let G be a graph and let A(G) considerably less than a naive straightforward approach
be the adjacency matrix of G. The matrix D(G), called would require.
the degree matrix, is the diagonal matrix with the degrees
of the vertices on the diagonal. Thus, the sum of the en-
tries of each row (column) of the matrix D(G) − A(G)
IV. EULERIAN AND HAMILTONIAN
is 0. This implies that all of the cofactors of this matrix
GRAPHS
D(G) − A(G) are the same. This is the setting for the fol-
lowing result.
In the first paper on graph theory, Euler (1736) considered
Theorem 3.3 (Kirchhoff, matrix tree): The number of the problem of traversing the seven bridges of Königsberg
nonidentical spanning trees in any graph G is the value (see Fig. 8a) without crossing any bridge twice. He showed
of any cofactor of the matrix D(G) − A(G). that it was not possible. This is equivalent to showing that
the multigraph G of Fig. 8b does not contain a trail which
Consider the problem of building a transportation sys-
uses all of the edges of G.
tem that connects (but not necessarily each pair directly)
An (closed) eulerian trail of a graph G is a (closed)
a collection of cities. The cost of constructing the link
trail which uses all of the edges of the graph. A graph
between each pair of cities is known, and we want to min-
which contains a closed eulerian trail is called eulerian.
imize the total cost of the system by selecting the appro-
priate links. This translates directly into finding a mini-
mal cost spanning tree of a weighted graph. The result of
Cayley indicates that the brute-force method of consider-
ing all possible spanning trees is not very efficient. An in-
tuitive and good algorithm [of order n 2 /ln(n)] for finding
such a minimal spanning tree was developed by Kruskal.
First, the edges of the graph are sorted in increasing or-
der by weight. A subgraph is built by successively adding
edges from this sorted list (starting with those of smallest FIGURE 8 Königsberg bridges and multigraph.
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

Graph Theory 21

Euler showed that the graph G of Fig. 8b has no eulerian lem is to determine the route which will minimize the time
trail. For a graph to have such a trail, it is clear that the (distance) of the trip. Another version of the same prob-
graph must be connected and that each vertex, except for lem is presented by a robot that is tightening screws on
possibly the first and last vertex of the trail, must have a piece of equipment on an assembly line. An order for
even degree. These conditions are also sufficient, as the tightening the screws should be determined so that the dis-
following result states. tance traveled by the arm of the robot is minimized. The
corresponding graph problem in both cases is to deter-
Theorem 4.1 (Euler): Let G be a connected graph (multi-
mine a minimum-weight hamiltonian cycle in a complete
graph). Then, G has a closed eulerian trail if and only if
graph, with weights assigned to each edge. The weight as-
each vertex has even degree, and G has an “open” eule-
signed to an edge would represent the time or cost of that
rian trail if and only if there are precisely two vertices of
edge. A brute-force approach of examining all possible
odd degree.
hamiltonian cycles could be quite expensive, since there
A consequence of Theorem 1.1 is that a graph has an are (n − 2)! possibilities in a complete graph of order n.
even number of vertices of odd degree. There is a useful Although there are good solutions for special classes of
immediate corollary of Theorem 4.1. If a connected graph graphs, no good algorithm is known for determining such
G has 2k vertices of odd degree, then the edges of G can be a hamiltonian cycle in the general case; in fact, the travel-
“covered” with k trails, and this is the minimum number ing salesman problem is known to be NP-complete. This
of trails which will suffice. This observation is the basis means that it is not known if a good algorithm exists, but
of many puzzles and games. the existence of a good algorithm to solve this problem
Also, related to eulerian graphs is the Chinese postman would imply the existence of good algorithms to solve
problem, which is to determine the shortest closed walk many other outstanding problems, such as the graph iso-
that contains all of the edges in a connected graph G. Such morphism problem. Some of these problems will be men-
a walk is called for obvious reasons a postman’s walk. If tioned in later sections. Although there is no known good
G has size m, then the postman’s walk will have length algorithm which always gives a minimum solution, there
m if and only if G is eulerian. At the other extreme, this are procedures which give reasonable solutions most of
shortest walk will have length 2m if and only if G is a tree. the time.
There is the obvious extension of the Chinese postman
problem to weighted graphs and minimizing the sum of
the weights along the postman’s walk. There are several V. COLORINGS OF GRAPHS
good algorithms for solving this problem.
A graph G is hamiltonian if it contains a spanning cy- A coloring (sometimes called a vertex coloring) of the
cle, and the spanning cycle is called a hamiltonian cycle. vertices of a graph is an assignment of one color to each
The name is derived from the mathematician Sir William vertex. If k colors are used, it is called a k-coloring, and if
Rowan Hamilton, who in 1857 introduced a game, whose adjacent vertices are given different colors, it is a proper
object was to form such a cycle. In Euler’s problem the coloring. The minimum number of colors in a proper col-
object was to visit each of the edges exactly once. In the oring of a graph G is called the (vertex) chromatic number
hamiltonian case the object is to visit each of the ver- of G and is denoted by χ (G). The chromatic number of
tices exactly once, so the problems seem closely related. many special graphs is easy to determine. For example,
However, in sharp contrast to the eulerian case, there χ (K n ) = n, χ (Cn ) = 3 if n is odd, and χ (B) = 2 for any
are no known necessary and sufficient conditions for a bipartite graph B with at least one edge. Therefore, all
graph to be hamiltonian, and the problem of finding such paths, all cycles of even length, and all trees have chro-
conditions is considered to be very difficult. There are nu- matic number 2, since they are bipartite. In general, the
merous sufficient conditions for the existence of a hamil- chromatic number of a graph is difficult to determine; in
tonian cycle and a few necessary conditions. The follow- fact, from an algorithm point of view it is an NP-complete
ing is an example of one of the better-known sufficient problem just like the traveling salesman problem.
conditions. In a proper coloring of a graph the set of all vertices
assigned the same color must be independent. A proper
Theorem 4.2 (Ore): If for each pair of nonadjacent ver-
k-coloring is thus a way of partitioning the vertices V of a
tices u and v of a graph G of order n ≥ 3, d(u) + d(v) ≥ n,
graph G into k independent sets, and thus, the chromatic
then G is hamiltonian.
number is the minimum number of independent sets in
A traveling salesman wishes to visit all of the cities on such a partition.
his route precisely one time and return to his home city in Consider the problem of storing n chemicals (or other
the smallest possible time. The traveling salesman prob- objects) when certain pairs of these chemicals cannot be
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

22 Graph Theory

stored in the same building. There is a natural model as- different colors. An upper bound is given by the following
sociated with this problem, which is a graph G of order result.
n. The vertices are the chemicals, and the edges are pairs
Theorem 5.3 (Vizing): For any graph G, χ1 (G) = (G)
of objects that cannot be stored together. The chromatic
or (G) + 1.
number χ (G) is the minimum number of buildings needed
to safely store the objects, since the buildings partition The theorem of Vizing implies that a graph G falls
the objects into sets corresponding to independent sets of into one of two categories, depending on whether
vertices. χ1 (G) = (G) or not. “Most” graphs are in the first cate-
The chromatic number χ (G) is related to other graphi- gory (χ1 = ). In particular, all bipartite graphs are in this
cal parameters which are easy to determine, such as (G), category. If n is odd, then the cycle has Cn edge chromatic
the maximum degree of a vertex of G. The maximum de- number 3, but maximum degree 2, so there are graphs
gree gives an upper bound on the chromatic number. in the second category. No characterization for graphs in
either of these two categories is known.
Theorem 5.1 (Brooks): For any graph G, χ (G) ≤
There are two scheduling problems that can be consid-
(G) + 1, and if G is a connected graph which is not
ered as graph coloring problems. We briefly describe each
an odd cycle or a complete graph, then χ (G) ≤ (G).
of the problems along with a solution. The first problem in-
A maximal complete subgraph of a graph is called a volves a vertex coloring and the second problem involves
clique. If a graph G contains a clique of order m, then an edge coloring.
clearly χ (G) ≥ m. However, the determination of the order A schedule of classes is to determined that will accom-
of the largest clique in a graph is also an NP-complete modate the requests of a group of students. If each class is
problem. In addition, the chromatic number of a graph offered at a different time, then each student will be able
can be very large without the graph even containing a to take the classes he or she wanted. This could result in
complete graph with three vertices or any small cycle. a large number of class periods, with some being at very
There is no relation between χ (G) and the largest clique of undesirable times. The problem is to determine the mini-
a graph G that would simplify the problem of determining mum number of time periods in which the classes can be
χ(G). The girth of a graph G is the length of a shortest scheduled so that each student will get the schedule he or
cycle in the graph. The following result indicates that a she requested. Consider a graph G with the vertices being
graph can be very sparse and still have a large chromatic the classes to be scheduled. Two classes are joined by an
number. edge if some student would like to take both classes. If
two classes are scheduled at the same time, then no stu-
Theorem 5.2 (Erdős, Lovász): For any pair of positive
dent requested to take both of these classes, so such classes
integers k and m, there is a graph with chromatic number
must be independent in the graph G. This suggests that the
at least k and girth at least m.
chromatic number χ (G) is the minimum number of time
This result is interesting, but the proof technique that periods needed for the scheduling. This fact is not difficult
was used may be of even more interest. The probabilistic to verify.
method, which was introduced by Paul Erdős, is the basis A similar scheduling problem deals with the scheduling
for the proof. It its simplest form, this powerful technique of teachers for classes. A collection of teachers T are to
merely counts the number of graphs which do not satisfy be scheduled to teach a set of classes C. The number of
a certain property and verifies that this number is less than sections of each class for which each teacher is responsible
the total number of graphs. Therefore, there must be at is known. The problem is to determine a schedule which
least one graph with the desired property. This results in a uses a minimum number of time periods. The assumption
proof of the existence of a graph, but it does not necessarily is made that, in any period, a teacher can teach only one
exhibit such a graph. class and each class is taught by only one teacher. Consider
There is an edge analog to the vertex coloring of a graph. the bipartite graph B (actually it is a bipartite multigraph
An assignment of a color to each edge of a graph is called without loops) with T as the vertices in one part and C as
an edge coloring, and it is called a k-edge coloring if at the vertices in the other part. If a teacher t in T is scheduled
most k colors are used. A proper edge coloring is one in to teach m sections of a class c in C, then place m edges
which adjacent edges are assigned different colors. The between the vertex t and the vertex c. The construction of
edge chromatic number χ1 (G) of a graph G is the mini- a schedule is really a proper edge coloring of the bipartite
mum k for which there is a proper k-edge coloring of G. graph B using the time periods as colors. This observation
It should be noted that this concept applies to any multi- can be used to verify that the minimum number of periods
graph without loops. A lower bound for χ1 (G) is (G), that must be used in the scheduling is the edge chromatic
since each of the edges incident to a fixed vertex must have number χ1 (B), and this is the maximum degree (B).
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

Graph Theory 23

verify that the following linear equality is always satisfied

by these parameters.
Theorem 6.1 (Euler formula): If G is a plane graph with
p vertices, q edges, and r faces, then p − q + r = 2.
The above result is a useful and powerful tool in proving
that certain graphs are not planar. The boundary of each
region of a plane graph has at least three edges, and of
course each edge can be on the boundary of at most two
FIGURE 9 Houses-and-utilities problem. regions. Thus, the number of edges and regions in a plane
graph satisfies the inequality r ≤ 2q/3. This inequality and
Theorem 6.1 imply the relationship q ≤ 3 p − 6 between
VI. PLANAR GRAPHS the number of vertices and edges in a planar graph. This
type of reasoning and the observation that in a bipartite
The houses-and-utilities problem is a classical puzzle plane graph each region will have at least four edges on
whose solution can be described using graph theory. Sup- its boundary give the following result.
pose there are three houses (A, B, and C), and three util-
ities (E, G, and W for electricity, gas, and water) as pic- Theorem 6.2: If G is a connected planar graph with p
tured in Fig. 9a. The problem is to connect each of the (≥3) vertices and q edges, then q ≤ 3 p − 6. If, in addition,
utilities to each of the houses without any of the utility G is a bipartite graph, then q ≤ 2 p − 4.
lines crossing. Pictured in Fig. 9a is an attempt to do this The second inequality of Theorem 6.2 implies that the
that failed. The graph theoretical form of this problem is bipartite graph K 3,3 is not planar, since 9 > 2(6) − 4. The
to determine if the complete bipartite graph K 3,3 (Fig. 9b) first inequality of Theorem 6.2 implies that the complete
can be drawn or pictured in the plane in such a way that graph K 5 is also not planar, since p = 5 and q = 10 for K 5 .
the lines which represent the edges do not intersect except In some sense any graph that is not planar must contain
at the vertices of the graph. This leads to the concepts of either K 5 or K 3,3 . We now describe what we mean by
plane and planar graphs. “in some sense.” To subdivide an edge uv of a graph is
A planar graph is one that can be drawn in the plane to replace an edge with a vertex w and two edges uw
with points representing the vertices, and “nice smooth and wv. This is like placing a vertex in the middle of an
curves or lines” that do not intersect representing the edge. A subdivision of a graph is obtained by successively
edges. Once such a graph is drawn or embedded in the subdividing the edges of a graph. Clearly, a subdivision of
plane, it is called a plane graph. Thus, the three-utilities a graph is planar if and only if the graph is planar. We can
problem is equivalent to determining if K 3,3 is a planar now give a characterization of planar graphs.
graph. It is also a question that has a surprisingly uncom-
plicated answer, not only for the special graph K 3,3 , but Theorem 6.3 (Kuratowski): A graph is planar if and only
for any graph. if it contains no subdivision of K 5 or K 3,3 .
Figure 10 is a diagram of a plane graph G. Associated One of the most famous problems in mathematics is
with this diagram are points (vertices of G), curves (edges the four-color problem, which was first proposed in 1852.
of G), and regions (faces of G). The regions are the con- Consider a map, such as the map of Europe. We would like
nected portions of the plane that remain after the points to color the countries (which we will assume to be con-
and curves of the diagram are deleted. Also note that one nected) so that two countries that have a common bound-
of these regions, namely R4 , is unbounded. If p, q, and ary (not just a point) will have different colors. Is it pos-
r are the number of vertices, edges, and faces of a plane sible to color any such map with at most four colors? It is
graph G, respectively, then p = 6, q = 8, and r = 4 for the easy to show that six colors will suffice, since any planar
graph in Fig. 10. An induction proof can be employed to graph must have a vertex of degree at most 5. Also, there
is a clever but elementary proof which shows that five col-
ors is sufficient. However, the question of whether four
colors are enough remained an open question for many
years. Many mathematicians (both professional and ama-
teur) worked on this problem, and many incorrect proofs
were generated.
A map M with five countries is pictured in Fig. 11a.
FIGURE 10 Plane graph. Associated with this map is a planar graph G, whose
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

24 Graph Theory

nar subgraphs of G, whose union of edges is all of the

edges of G, is called the thickness of G. Therefore, a pla-
nar graph has thickness 1.
If a graph G is not planar, then any drawing of it in the
plane would require that at least two of the edges “cross”
at some point other than at a vertex. The minimum number
FIGURE 11 Map and plane graph. of such crossings possible for some representation in the
plane is called the crossing number of G. Clearly, planar
graphs have crossing number 0.
In general, the determination of the genus, thickness
vertices are the countries and whose edges join countries or crossing number of a graph is an extremely difficult
with a common boundary. An embedding of G is given in problem, even for special classes of graphs. For example,
Fig. 11b. It is important to note that the graph G is planar. both the genus and thickness of all complete graphs are
Coloring the map M so that adjacent countries have differ- known, but the proofs are not easy. Also, the crossing
ent colors is the same as giving a proper vertex coloring of number is not known for complete graphs with more than
the planar graph G of Fig. 11b. Associated with any map 10 vertices.
is a planar graph, and conversely, associated with a plane
graph is a map. Thus, solving the four-color problem is
equivalent to showing that no planar graph has chromatic
number greater than four. In 1976 a positive solution to this VII. FACTORIZATION THEORY
problem was obtained by Appel and Haken by reducing the
problem to a finite number of classes of planar graphs, and A spanning subgraph of a graph G is called a factor of G.
then by analyzing these cases with the aid of a computer. If E(G) is the edge disjoint union of the edges of factors
Since that time, the number of cases left to computer ver- of G, then these factors form a factorization of G. Thus a
ification has been decreased, and also independent com- factorization is a partitioning of the edges of the graph. A
puter programs have been created to verify the remaining factor of a graph which is r -regular is called an r -factor,
cases. However, all present proofs are still dependent on and any factorization of a graph with r -factors is called an
computer verification of a large number of cases. r -factorization. A hamiltonian cycle in a graph is an ex-
ample of a 2-factor. Of particular interest are 1-factors of
Theorem 6.4 (Appel-Haken): For any planar graph G,
graphs, also called perfect matchings. A perfect matching
χ (G) ≤ 4.
of a graph of order n (clearly n must be even) thus consists
For many applications, planarity is a very desirable of n/2 independent edges (edges which are not adjacent).
property to have. If the graph that represents an electri- A collection of independent edges, which may not nec-
cal circuit is planar, then the circuit can be printed on a essarily span the graph, is called a matching. A matching
board. Also, there are good algorithms (in fact, linear in with the most edges is called a maximum matching. In a
the number of edges of the graph) for testing graphs for cycle C2k of even length the alternate edges in the cycle
planarity and for embedding planar graphs in the plane. form a perfect matching in the cycle. There are thus two
For graphs which are not planar, there are several mea- such perfect matchings, and they form a 1-factorization of
sures of how “nonplanar” they are. We mention three of the cycle.
the most common measures. One of these is the genus of Factorizations of complete graphs have been studied
a graph. The surface obtained from a sphere by adding extensively. For example, K 2n has a 1-factorization. Thus
m handles (like a handle on a coffee cup) is a surface of K 2n will have an r -factorization if and only if 2n is a
genus m. Embeddings for these surfaces is defined in the multiple of r . Of course K 2n+1 cannot have a 1-factor
same way as it is for the plane: curves which represent because of its order, but it does have a 2-factorization.
edges must not intersect except at points which represent In fact, it has a factorization consisting of hamiltonian
vertices. A graph G has genus m if it can be embedded in a cycles. In Fig. 12a, a 1-factorization of K 4 is displayed,
surface of genus m but not in a surface of genus m −1. Any and a 2-factorization of K 5 is given in Fig. 12b. There are
finite graph can be embedded in a surface of sufficiently many other kinds of interesting factors and factorizations
large genus, so every graph has finite genus. Embedding of complete graphs, such as factoring with paths or trees.
a graph in the plane is equivalent to embedding the graph One of the earliest general results in this area involves
on a sphere. Since a sphere is a surface of genus 0, the a characterization of graphs which have 2-factorizations.
planar graphs are precisely the graphs of genus 0. Clearly, a graph G cannot have an r -factorization unless
If a graph G is not planar, then it can be “broken” into it is m-regular for some multiple m of r . In the case when
subgraphs which are planar. The minimum number of pla- r = 2, this condition is also sufficient.
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

Graph Theory 25

FIGURE 12 (a) 1-Factorization of K 4 . (b) 2-Factorization of K 5 .

Theorem 7.1 (Petersen): A graph has a 2-factorization if graphs. One such algorithm for bipartite graphs uses the
and only if it is m-regular with m even. idea of an alternating path. If M is a matching of a graph
G, then an M-alternating path is a path whose edges alter-
A similar characterization for 1-factorizations in bipar-
nate between edges in M and edges not in M. If the first
tite graphs exists.
and last vertices of the path are not in M, then the path is an
Theorem 7.2 (Kőnig): A bipartite graph has a 1- M-augmenting path. If P is an M-augmenting path, then
factorization if and only if it is regular. replacing the edges of P which are in M with the edges
of P not in M will give a larger matching in G. Thus a
Perfect matchings of graphs have been investigated maximum matching M never has an M-augmenting path.
more extensively than any other factors. Fortunately, there Berge showed that this was a necessary and sufficient con-
is a very useful characterization of bipartite graphs which dition for a matching to be maximum.
have perfect matchings. If G is a bipartite graph with parts The algorithm based on this result searches for aug-
A and B, then a matching M is said to saturate A if each menting paths by constructing a tree rooted at a vertex
vertex in A is incident to an edge of M. Thus, if |A| = |B|, which is not saturated by the the matching. Also, all of the
any matching which saturates A is a perfect matching. paths in this tree are alternating. Either this tree will give
Also, if M is a matching which saturates A, then for any an augmenting path, which will give a larger matching,
subset S of A, the neighborhood N (S) of S (vertices not or the matching will be a maximum matching. The condi-
in S which are adjacent to some vertex of S) must have as tion of Hall can be shown to be violated if the maximum
many vertices as S, because just the matching M implies matching is not a perfect matching.
this. This condition is also sufficient to imply the existence There are many applications of matchings in graphs. An
of a perfect matching. obvious one is the assignment problem. Assume there are
Theorem 7.3 (Philip Hall): If G is a bipartite graph with n workers and n jobs, and each worker can perform certain
parts A and B, then there is a matching which saturates of the jobs. The assignment problem is to determine if it is
A if and only if |N (S)| ≥ |S| for all S ⊆ A. possible to assign each person a job (two workers can not
be assigned the same job), and if so, make the assignment.
A characterization of all graphs which have perfect The graph model is a bipartite graph with the workers in
matchings also exists, but it is more complicated. If S is a one part and the jobs in the other part. An edge is placed
separating set for a graph G, then G − S may have several between a worker and any job he can perform. The assign-
components. Any component C with an odd number of ment problem reduces to finding a maximal matching in
vertices (called an odd component) cannot within itself this bipartite graph. If this matching is a perfect matching,
have a perfect matching. So, any perfect matching of G then an assignment can be made. Otherwise, it is not pos-
would have at least one edge joining a vertex of C with a sible. Since there is a good algorithm for finding maximal
vertex of S. Thus, if G has a perfect matching, the number matchings in bipartite graphs, there is a good algorithm
of odd components cannot exceed the number of vertices for solving the assignment problem.
in S. This condition on the number of odd components is The assignment problem can also be considered when
also sufficient for the existence of a perfect matching. the number of jobs and the number of workers are not the
same. However, for example, in this case when the number
Theorem 7.4 (Tutte): A graph G has a perfect match-
of workers exceeds the number of jobs, the objective is to
ing if and only if for each S ⊆ V (G) the number of odd
assign all of the jobs. This will unfortunately leave some
components in G − S does not exceed |S|.
of the workers unemployed. The graph model is still the
Although both Theorem 7.3 and Theorem 7.4 are use- same bipartite graph, and the objective is to find a matching
ful results, they cannot be applied directly to obtain good which saturates the vertices associated with the jobs. The
algorithms for finding perfect matchings (or maximum same algorithm used when the number of jobs is the same
matchings) in graphs. This is true because there are 2n as the number of workers applies in this case.
different subsets in a set with n elements. However, good A generalization of the assignment problem is the opti-
algorithms do exist for finding maximum matchings in mal assignment problem. Here there are also n workers and
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

26 Graph Theory

n jobs. However, in this case, the ith worker can perform
p
the jth job with some efficiency ci j . The problem is to as- q= qi ( p − 2).
sign each worker a job such that the sum of the efficiencies i=1

is a maximum. Again, a brute-force approach of examin- Also, clearly the vertex vi has degree q −qi . An immediate
ing all possible assignments is not efficient, since there consequence of these facts is that any regular graph is
are n! such possibilities. However, there is a good algo- reconstructible.
rithm for solving the optimal assignment problem which Several properties dealing with the connectedness of a
utilizes finding maximum matchings in a series of appro- graph are reconstructible, including the number of com-
priately defined bipartite graphs. Associated with a given ponents of the graph. A subgraph of a graph is a block
assignment of jobs there is a bipartite graph. A maximum if it is a maximal 2-connected subgraph. The blocks of a
matching is determined in this bipartite graph. If the max- graph partition the edges of a graph, and the only vertices
imum matching is not a perfect matching, then it can be that are in more than one block are the cut-vertices. If a
used to obtain a better assignment which determines a graph has at least two blocks, then the blocks of the graph
new bipartite graph. If the matching is perfect, then the can also be determined. However, this does not mean the
assignment is optimal and the algorithm terminates. graph can be reconstructed from the blocks.
There are many special classes of graphs which are
reconstructible, but we list only three well-known classes.
VIII. GRAPH RECONSTRUCTION Theorem 8.1: The following classes of graphs are recon-
structible:
A famous unsolved problem in graph theory is the Kelly-
Ulam conjecture. A graph G of order n is reconstructible (i) Disconnected graphs
if it is uniquely determined by its n subgraphs G − v for (ii) Trees
v ∈ V (G). In Fig. 13 there is an example of the four graphs (iii) Regular graphs
obtained from single vertex deletions of a graph of order
Corresponding to the “vertex” reconstruction conjec-
4, and the graph they uniquely determine.
ture is an edge reconstruction conjecture, which states
Reconstruction Conjecture (Kelly-Ulam): Any graph that a graph G of size m ≥ 4 is uniquely determined by the
of order at least 3 is reconstructible. m subgraphs G − e for e ∈ E(G). Such a graph is said to be
edge-reconstructible. Just as in the vertex case, the edge
The initial but equivalent formulation of the conjec-
conjecture is open. The two conjectures are related, as the
ture involved two graphs. If G and H are graphs with
following result indicates.
V (G) = {u 1 , u 2 , . . . , u n } and V (H ) = {v1 , v2 , . . . , vn },
and if G−u i ∼= H − vi for 1 ≤ i ≤ n, then G ∼ = H . Note that Theorem 8.2 (Greenwell): If a graph with at least four
to say that a graph G is reconstructible does not mean that edges and no isolated vertices is reconstructible, then is
there is a good algorithm which will construct the graph is edge-reconstructible.
G from the graphs G − v for v ∈ V . A positive solution to Theorem 8.2 implies that trees, regular graphs, and
the conjecture might still leave open the question of the disconnected graphs with two nontrivial components are
complexity of algorithms that would generate a solution edge reconstructible. There are also results which show
to the problem. that graphs with “many” edges are edge-reconstructible.
Although it is not known in general if a graph is re- For example, Lovász has shown that if a graph G has
constructible, certain properties and parameters of the order n and size m with m ≥ n(n − 1)/4, then G is edge-
graph are reconstructible. It is straightforward to recon- reconstructible.
struct from the vertex-deleted subgraphs both the size of Intuitively, the edge-reconstruction conjecture is wea-
a graph and the degree of each vertex. Let G be a graph ker than the reconstruction conjecture. This is confirmed
of size q with vertices {v1 , v2 , . . . , v p }, and for each i let by Theorem 8.2. However, there is another way of relating
qi be the size of the graph G − vi . Each edge in G would the two conjectures. Associated with each graph G is the
appear in precisely p − 2 of the vertex deleted subgraphs, line graph L(G) of G. The vertices of L(G) are the edges
hence of G and two vertices of L(G) (which are edges of G) are
adjacent in L(G) if and only if they were adjacent edges
in G. The following result relates reconstruction and edge
reconstruction.
Theorem 8.3 (Harary, Hemminger, Palmer): A graph
with size at least four is edge-reconstructible if and only
FIGURE 13 Graph reconstruction. if its line-graph is reconstructible.
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

Graph Theory 27

The line graphs of some special classes of graphs are vertices. Let t(n, k) denote the size of the graph T (n, k).
easy to determine. For example, the line graph of a star Turán proved the following result.
K 1,n is K n , a complete graph, and the line graph of a cy-
Theorem 9.1 (Turán): The maximum number of edges
cle Cn is the cycle Cn of the same length. Therefore, the
in a graph of order n which does not contain a K p is
graphs K 3 and K 1,3 have isomorphic line graphs, namely,
t(n, p − 1), and the graph T (n, p − 1) is the only graph
K 3 . With this one exception, the line graphs of nonisomor-
of that order and size with no K p .
phic connected graphs are also nonisomorphic. However,
this does not imply that every graph is the line graph of For a fixed graph F, which we will call the forbid-
some graph. In fact, there are numerous characterizations den graph, the extremal number ex(n, F) is the maxi-
of line graphs. In particular, no graph which has an in- mum size of a graph of order n which does not contain
duced subgraph isomorphic to K 1,3 can be the line graph an F as subgraph. The collection of graphs of order n
of a graph. and size ex(n, F) are called the extremal graphs for F
Since not every graph is the line graph of some graph, and is denoted by Ex(n, F). The extremal problem con-
Theorem 8.3 does not imply that the edge reconstruction sists of finding ex(n, F) and also Ex(n, F), if possible.
conjecture and the vertex reconstruction conjecture are In the case when the forbidden graph is K p , the result of
equivalent. Turán states that ex(n, K p ) = t(n, p −1) and Ex(n, K p) =
{T (n, p − 1)}. Note that
IX. EXTREMAL THEORY
p−2 n
t(n, p − 1) =
There are a large number of optimization problems and p−1 2
results in graph theory which could be considered in a
discussion of extremal graph theory. However, we will re- when n is a multiple of p − 1 and is close to that for
strict our consideration to the Turán-type extremal prob- all other values of n. If p is thought of as the chromatic
lem of determining the maximum number of edges a graph number of the complete graph K p , then the extremal result
can have without containing a certain subgraph. For exam- for an arbitrary graph F is closely related to the result for
ple, consider the problem of determining how many edges the complete graph. In fact, the chromatic number χ (F)
there can be in a graph of order n which does not contain determines the order of magnitude of the extremal number
a K 3 (triangle). For n even, the complete bipartite graph if the graph is not bipartite. In the following result, an
K n/2,n/2 has n 2 /4 edges and no triangle, and, in fact, no “error function” is needed. Denote by o(n 2 ) a function of
odd cycle of any length. Also, any graph with n vertices n with the property that limn→∞ o(n 2 )/n 2 = 0.
and (n 2 /4) + 1 edges (for n even) can be shown to contain Theorem 9.2 (Erdős-Simonovits): For any nontrivial
a triangle. Thus, when n is even, the maximum number of graph F with chromatic number k ≥ 3,
edges in a triangle-free graph of order n is n 2 /4.
The original problem of Turán, who initiated this kind k−2
of investigation, was to determine the maximum number ex(n, F) = (n 2 ) + o(n 2 ).
k−1
of edges a graph of order n can have without containing
a complete graph K p . The case p = 3 was just consid-
Also, any graph in Ex(n, F) can be obtained from T (n,
ered. The general case, while more complicated, is entirely
k − 1) by deleting or adding at most o(n 2 ) edges.
parallel.
Multipartite graphs are a class of graphs which occur in If F is not a bipartite graph, then the first term in the
this type of study. A graph G is called a k-partite graph expression for ex(n, F) in Theorem 9.2 is of order n 2 , so
if the vertices V can be partitioned into k parts such that the error function o(n 2 ) becomes insignificant for suffi-
the only edges in G join vertices in different parts. Thus, ciently large values of n. However, when F is a bipartite
bipartite graphs are simply 2-partite graphs. If all edges graph, this first term is 0, and thus Theorem 9.2 gives little
between each pair of k parts are in the graph, the graph is information. The extremal problem for a bipartite graph
called a complete k-partite graph. In any k-partite graph is called the degenerate extremal problem. In this case,
the vertices in each part are independent, and if the graph there are many interesting open questions, and no general
is a complete k-partite graph then the chromatic number asymptotic result like Theorem 9.2 is known in the de-
of the graph is clearly k. Let T (n, k) denote the complete generate case. Examples which give lower bounds for the
k-partite graph of order n in which the difference in the number ex(n, F) are difficult to find when F is a bipartite
number of vertices in any pair of parts is at most 1. That graph, and in many cases involve designs. We mention two
is, if n = km + r with 0 ≤ r < m, there will be r parts with results which give upper bounds. Sharp lower bounds are
m + 1 vertices and the remaining k − r parts will have m not known for either of the classes of graphs considered
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

28 Graph Theory

below, except for a few small-order cases such as C4 , C6 , TABLE I Ramsey Numbers r (Km , Kn )
C10 , and K 3,3 . m\n 3 4 5 6 7 8 9
Theorem 9.3 (Kővári-Sós-Turán): If r ≤ s, then there
3 6 9 14 18 23 29 36
exists a constant cr s (depending only on r and s) such that
4 18

ex(n, K r,s ) ≤ cr s n [2−(1/r )] .

Theorem 9.4 (Erdős): There is a constant ck such that

X. DIRECTED GRAPHS
ex(n, C2k ) ≤ ck n [1+(1/k)] .
A directed graph or digraph D is a finite collection of
In the special case when the forbidden bipartite graph F elements, which are called vertices, and a collection of
is a tree, more is known about ex(n, F). When p−1 divides ordered pairs of this vertices, which are called arcs. Thus,
n, a disjoint union of complete graphs K p−1 implies that a digraph is similar to a graph except that each arc in a
ex(n, T p) ≥ ( p−2)n/2 for any tree with p vertices. Erdős digraph has a direction, while an edge in a graph does not.
and Sós conjectured that ( p − 2)n/2 is an upper bound as Just as in the case for graphs, the vertices and arcs of D
well for all values of n. This is known to be true for stars will be denoted by V and E, respectively. However, in this
and paths. case the arc e = uv does not join vertices u and v, but it
The extremal problem can be generalized to consider joins u to v. Thus, uv = vu. The terms adjacent with and
a family of forbidden graphs, not just one graph. How- incident with will be replaced by the terms adjacent to,
ever, the extremal results for a family of forbidden graphs adjacent from, incident to, and incident from. The order
is essentially the same as for one forbidden graph. The and size of a digraph is the same as for a graph.
important parameter in the case of a family of graphs is Most of the concepts for graphs have obvious analogs
the smallest chromatic number of a graph which is in the for digraphs, for example, subdigraph and directed walks
family. Therefore, if there is at least one subgraph in the of various types. However, the degree of a vertex v is split
family which is bipartite, then the extremal number will be into two parts: d + (v) is the number of arcs incident from
o(n 2 ). Also, if a graph G of order n has more than ex(n, F) v and is called the outdegree, and d − (v) is the number of
edges, then it has at least one copy of F. An interesting arcs incident to v and is called the indegree. The degree
problem is to determine the number of copies of F that G d(v) = d + (v) + d − (v). Associated with each digraph D
must have as a function of the number of edges k in G. is an underlying graph G obtained from D by removing
An area of extremal theory which predates theory re- the directions from the edges of D and then removing any
lated to the Turán type of theory is Ramsey theory. In one of any pair of multiple edges that are produced from
Ramsey theory, pairs of “forbidden” graphs are consid- removing the directions.
ered, say, F1 and F2 , and one is interested in determining The concept of connectivity is more complicated for
whether for each graph G of order n, F1 is a subgraph directed graphs, and several types of connectivity arise.
of G, or F2 is a subgraph of the complement Ḡ of G. A We will mention two that are most commonly used. A
consequence of a result of Frank Ramsey proved in 1930 digraph D is connected if the underlying graph is con-
is that this is always true if n is sufficiently large. The nected. A vertex v is reachable from a vertex u if there is
smallest n for which it is true is called the Ramsey num- a directed path from u to v. The digraph is strongly con-
ber of the pair (F1 , F2 ) and is denoted by r (F1 , F2 ). In nected or diconnected if each pair of distinct vertices is
general, these numbers are difficult to determine. In the reachable from each other. Clearly, diconnectedness im-
classical case when F1 and F2 are complete graphs, only plies connectedness, but the digraph in Fig. 14a shows
a few of the Ramsey numbers are known; in fact, Table I that the two concepts are not the same. Just as connectivity
gives all of the known Ramsey numbers r (K m , K n ) for can be used to partition the vertices of a graph into compo-
3 ≤ m ≤ n. nents, diconnectedness can be used to partition the vertices

FIGURE 14 (a) Digraph. (b) Components.

P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

Graph Theory 29

of a digraph into dicomponents. Any pair of vertices in

the same dicomponent are mutually reachable. The di-
components of the digraph D in Fig. 14a are shown in
Fig. 14b. Thus, the digraph in Fig. 14a is an example of a
graph which is connected, but yet it has three diconnected
components.
The lengths of paths or cycles in the underlying graph FIGURE 15 Diconnected tournament.
of a digraph give no information about the length of di-
rected paths or cycles. However, it is surprising that the
tournament which is diconnected and in fact has a di-
chromatic number of a digraph (which is the same as the
rected hamiltonian cycle. The use of the term “tourna-
chromatic number of the underlying graph) does provide
ment” is a natural one. If a “tennis tournament” were
some information, as the following two results indicate.
played with each pair of participants playing each other
The first of these results can be obtained as a corollary of
(round robin tournament), then the results of the tourna-
the second.
ment could be modeled by this type of directed graph.
Theorem 10.1 (Gallai-Roy): Every digraph D contains A directed edge from u to v indicates that u defeated v.
a directed path of length at least χ (D). Tournaments are a class of digraphs that has been stud-
ied extensively. If T is a tournament of order n, then
Theorem 10.2 (Bondy): Every diconnected digraph D
the chromatic number χ (T ) is n. Therefore Theorem
contains a cycle of length at least χ (D).
10.1 and Theorem 10.2 have the following immediate
Just as in the case for graphs, there are natural ques- consequences.
tions concerning the existence of special subdigraphs in di-
Theorem 10.6 (Re’dei): Every tournament has a directed
graphs. Conditions which imply the existence of eulerian
hamiltonian path.
directed trails, cycles, or other subdigraphs in digraphs are
many times analogs of conditions in graphs. The following Theorem 10.7 (Camion): Every tournament of order at
are just two such examples from among many. least 3 has a directed hamiltonian cycle if and only if it is
diconnected.
Theorem 10.3 A digraph D has a closed eulerian di-
rected trail if and only if d + (v) = d − (v) for all v ∈ V . Theorem 10.6 implies that in any round-robin tourna-
ment the players can always be ordered such that the ith
Theorem 10.4 (Woodall): If D is a digraph of order n
player defeated the (i + 1)th player. In this case the first
such that uv ∈ E(D) implies that d + (u) + d − (v) ≥ n, then
player could be considered the “best” player. However,
D has a directed hamiltonian cycle.
the hamiltonian path is not necessarily unique, so there
Consider a city which has had both one-way and two- could be another ordering which would give a different
way streets, but because of the additional traffic flow must first person who could also claim to be the “best.” The-
make each street one-way. Is it possible that the direction orem 10.7 implies that the players can be cyclically or-
of the streets can be chosen so that a motorist can get from dered such that the ith player defeated the (i + 1)th if
any one point in the city to any other point in the city? and only if for each pair of players A and B there is
Given a graph G, an orientation of the graph is an as- a sequence from A to B such that each player defeated
signment of a direction to each of the edges of the graph. the next player in the sequence. In this case any person
Thus, the oriented graph obtained in this way is a digraph. could make an argument for being the “best” player in the
The graph theory form of the initial problem is to deter- tournament.
mine for which graphs there is an orientation which makes Most of the algorithms dealing with graphs have di-
the resulting digraph diconnected. If the graph contains a graph analogs. Also, in general, the complexity of the
bridge (an edge which disconnects the graph), then clearly corresponding algorithms are approximately the same. If
no such orientation exists. The lack of such a bridge, which there is a good algorithm in the graph case, then the cor-
means that at least two edges must be deleted to discon- responding digraph algorithm will usually be good, and if
nect the graph and so the graph is 2-edge-connected, is the graph algorithm is NP-complete, then the correspond-
also a sufficient condition. ing digraph algorithm will be NP-complete. For exam-
ple, there is a good algorithm to find the dicomponents
Theorem 10.5 (Robbins): Every 2-edge-connected graph
of a digraph just as there is a good algorithm to find the
has an orientation which gives a diconnected digraph.
components of a graph. The existence of hamiltonian cy-
A digraph obtained from orientating a complete graph cles is NP-complete in both the directed and undirected
is called a tournament. See Fig. 15 for an example of a case.
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

30 Graph Theory

XI. NETWORKS The networks considered here were single-source and

single-sink networks. This is not a restriction, since it is
The term “network” is used in several different contexts simple to go from a multiple-source and multiple-sink
and has many meanings. For example, there are communi- network to a network with a single source and a single
cation networks and there are electrical networks. We will sink. A new network can be constructed by adding a “super
generally restrict our consideration to one kind of network, source” which is adjacent to all of the sources and a “super
which we now define. The purpose is to build a model for sink” which is adjacent from all of the sinks. A solution
determining how to maximize the flow of goods from one to the maximum flow problem for this single-source and
point to another in some transportation system. A network single-sink network will give a solution to the maximum
is a digraph D with two specified vertices s (which is called flow problem for the multiple-source and multiple-sink
the source) and t (which is called the sink), and a function network.
c (called the capacity) which assigns to each arc of the There are many applications and consequences to the
network a non-negative integer. A flow on D is a function max-flow min-cut theorem. The matching result of Hall
f from E(D) into the non-negative integers such that the and the various forms of Menger’s theorems for both
following two conditions are satisfied: graphs and digraphs can be proved using this result.
Digraphs with labels on the directed edges, such as the
f (e) ≤ c(e) for all e ∈ E, and network just described, can be used as models for many

f (uv) = f (vw) for all v ∈ D − s, t. situations. In operations research, activity networks are
uv∈E vw∈E useful as models for the planning of activities of large
The first condition merely states that the flow cannot projects. A digraph with a single source and sink (rep-
exceed the capacity of the arc, and the second condition resenting the beginning and end of the project) will have
requires that the flow into an intermediate vertex is the directed edges which represent the activities of the project.
same as the flow out of the vertex. The object is to find The directed edges also indicate which activities must be
a maximum flow from the source s to the sink t. The completed before others can be started. The label on each
value to be maximized is the net flow out of s, which is edge will represent the time required to complete that ac-
tivity. Algorithms which calculate path lengths can be used
sv∈E f (sv) − vs∈E f (vs). An example of a network
with a flow is given in Fig. 16. The first number of the to determine optimal schedules for such projects and to
pair by each arc is the flow and the second number is the determine which activities are critical in keeping such a
capacity. The net flow in this example is 4. schedule.
A cut in a network D is a partition of the vertices V into
two subsets S and S̄ with s ∈ S and t ∈ S̄. The capacity of
the cut is the sum of the capacities of the edges from S to XII. RANDOM GRAPHS
S̄. The following is called the max-flow min-cut theorem.
Paul Erdős first introduced probabilistic methods to graph
Theorem 11.1 (Ford-Fulkerson): In a network the value theory in the 1940s, and this powerful technique of study-
of a maximum flow is equal to the capacity of a minimum ing “random graphs” has had an extraordinary impact on
cut. the development of graph theory. There are several mod-
There is a good algorithm for finding a maximum flow els for studying random graphs, but we will just discuss
in a network. It is based on starting with any flow—say, the one. Consider the collection of all labeled graphs (the
0 flow—and successively increasing this flow by finding vertices are distinguishable) of order n on the vertices
a path (with possibly some arcs reversed) from source to {1, 2,. . . , n}. The complete graph on the set of vertices has
the sink on which the flow can be increased. If at any step N = n2 edges, and the number of different labeled graphs
this cannot be done, the algorithm generates a cut whose is 2 N , since there is the choice of either choosing or reject-
capacity is equal to the flow. This, of course, implies that ing each of the N edges. Given a positive number p with
the flow is maximum. (0 ≤ p ≤ 1), this collection of graphs becomes a probabil-
ity space, which we will denote by G(n, p), by choosing
each edge with independent probability p. Therefore the
probability that a random graph in G(n, p) is a fixed graph
G that has m edges is p m (1 − p) N −m .
Using the probability space G(n, p), it is possible to
prove the existence of a graph with certain properties
without actually exhibiting the graph. This is done by
FIGURE 16 Network. counting the number of graphs that do not satisfy a
P1: GPA/GHE P2: FYK/LSK QC: FYK Final Pages
Encyclopedia of Physical Science and Technology en007b-296 June 30, 2001 16:40

Graph Theory 31

specified property and showing that this number is smaller Theorem 12.3: Assume that limn→∞ g(n) = ∞, and
than the total number of graphs. An example of this type let p1 (n) = [ln n − g(n)]/n and p2 (n) = [ln n + g(n)]/n.
of result is the following result of Erdős concerning an Then, almost no graphs in G[n, p1 (n)] are connected, and
exponential lower bound for Ramsey number r (K s , K s ). almost all graphs in G[n, p2 (n)] are connected.
Theorem 12.1 (Erdős): There is a positive constant c The previous theorem implies that t(n) = (ln n)/n is the
such that r (K n , K n ) > cs s/2 . threshold function for connectivity, and the next result
implies that t(n) = [(ln n)/n + ln ln n]/n is the threshold
The lower bound of the previous theorem has been im-
function for hamiltonicity.
proved using random graphs (probabilistic techniques),
but at this point no graphs have actually been described Theorem: 12.4: Assume that limn→∞ g(n) = ∞, and
that satisfy the bound. It was mentioned earlier that given let p1 (n) = [ln n + ln ln n − g(n)]/n and p2 (n) = [ln n +
positive integers k and m there is a graph G with the chro- ln ln n + g(n)]/n. Then, almost no graphs in G[n, p1 (n)]
matic number χ (G) ≥ k and girth g(G) ≥ m. Again, the are hamiltonian, and almost all graphs in G[n, p2 (n)] are
major tool used in this proof is the probabilistic method. hamiltonian.
If R is a graphical property, then the probability that
There are corresponding threshold functions results for
a graph G in G(n, p) has property R will be denoted by
a wide range of graphical properties.
P[R ∩ G(n, p)]. We say that almost all graphs have prop-
erty R if limn→∞ P[R ∩ G(n, p)] = 1. There is the corre-
sponding definition for almost no graphs. As the value of SEE ALSO THE FOLLOWING ARTICLES
p increases, the expected number of edges of a random
graph G in G(n, p) will also increase. Thus, if p is a fixed ALGEBRA, ABSTRACT • DISCRETE MATHEMATICS AND
positive real number, then the expected
number in edges in COMBINATORICS • GRAPHICS IN THE PHYSICAL SCIENCES
a random graph G in G(n, p) is p n2 , which is substantial. • PROBABILITY
Therefore, these graphs are dense, so it is not surprising
that the following is true.
Theorem 12.2: If p is a fixed positive number BIBLIOGRAPHY
(0 < p < 1), then for any fixed positive integer k, almost
all graphs [in the probability model G(n, p)] have the fol- Alon, N., and Spencer, J. (1992). “The Probabilistic Method,” Wiley,
New York.
lowing properties (among others): Beineke, L. W., and Wilson, R. J., eds. (1983). “Selected Topics in Graph
Theory 2,” Academic Press, London.
(i) Diameter 2
Beineke, L. W., and Wilson, R. J., eds. (1988). “Selected Topics in Graph
(ii) Contain any finite subgraph H as an induced Theory 3,” Academic Press, London.
subgraph Biggs, N. L., Lloyd, E. K., and Wilson, R. J. (1976). “Graph Theory
(iii) k-Connected 1736–1936,” Clarendon Press, Oxford, U.K.
(iv) Hamiltonian Bollobás, B. (1998). “Modern Graph Theory,” Springer-Verlag, New
York.
(v) Genus exceeding k
Bollobás, B. (1978). “Extremal Graph Theory,” Academic Press, New
(vi) Chromatic number exceeding k York.
Chartrand, G., and Lesniak, L. (1996). “Graphs and Digraphs,” 3rd ed.,
It is obvious that a p increases, the number of edges of Chapman & Hall, London.
graphs in a random graph in G(n, p) increases, but it is Chartrand, G., and Oellermann, O. (1993). “Applied and Algorithmic
not obvious that random graphs undergo significant struc- Graph Theory,” McGraw-Hill, New York.
tural changes as the number of edges increases. This be- Clark, J., and Derek Holton, D. (1991). “A First Look at Graph Theory,”
havior can be described in terms of threshold functions World Scientific, Singapore.
Gould, R. (1988). “Graph Theory,” Benjamin/Cummins, Menlo Park,
for a graphical property in the probability space G(n, p) CA.
for the probability p = p(n). If R is a graphical prop- Graham, R. L., Grötschel, M., and Lovász, L., eds. (1995). “Handbook of
erty then t(n) is a threshold function for property R when Combinatorics, I, II,” MIT Press, Cambridge, MA, and North Holland,
limn→∞ p(n)/t(n) = ∞, then almost all graphs have prop- Amsterdam, The Netherlands.
Palmer, E. (1985). “Graphical Evolution,” Wiley, New York.
erty R and when limn→∞ p(n)/t(n) = 0, then almost no
West, D. B. (1996). “Introduction to Graph Theory,” Prentice-Hall,
graphs have property R. The following two results are ex- Upper Saddle River, NJ.
amples of threshold functions for the property of being Wilson, R. J., and Watkins, J. (1991). “Graphs, An Introductory
“connected” and the property of being “hamiltonian.” Approach,” Wiley, New York.
P1: GTV Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

Graphics in the Physical Sciences

Daniel B. Carr
George Mason University

I. Univariate Guidelines
II. Bivariate Guidelines
III. Multivariate Visualization
IV. Closing Remarks

GLOSSARY Useful for assessing higher-dimensional relationships

using mode dimensional views.
Box plot Graph displaying the outliers, adjacent values, Mean difference plot A scatterplot that plots the differ-
quartiles, and median. The distribution caricature is es- ence of paired values on the y-axis and the mean of
pecially useful as a terse summary and for comparing pair values on the x-axis.
the distributions of different groups. Multidimensional scaling An algorithmic method that
Conditioned plots Multiple juxtaposed panels where represents points in a lower-dimensional space while
only the cases that satisfy constraints associated with attempting to preserve the distances between all point
a panel appear in or define the contents of the panel. pairs. An infrequently used method for embedding data
Controls for the variation in related variables. Useful in lower dimensions.
for reducing three-, four-, and five-dimensional rela- Paired comparison plot A scatterplot of values that have
tionships to a sequence of planar views. been paired because they are measurements on the same
Conditioned maps Multiple panels of juxtaposed maps entity or process in two different circumstances.
where only the regions with variables that satisfy con- Parallel coordinate plots Graph with parallel axes for
straints associated with a panel appear highlighted in each of the variables. Lines connect the coordinates of
panel. Useful or controlling the variation of a vari- the same case between adjacent axes. Is especially use-
able display on a map due to one or more independent ful in showing many variables, and graphically high-
variables. lighting subsets of data.
Glyph plots Graphs that encode variables as something Principal components A method of constructing linear
other than position along an axis. Sometimes useful for combinations of the data to produce new orthogonal
showing several variables. variables. The first new variable has the largest vari-
Linked plots A set of panels representing different variance. The second new variable has the second largest
ables of the same cases. The representation of the vari- variance and so on. Commonly used for approximating
ables of the same case are connected by lines, color, the data in a lower-dimensional space by dropping the
point position, or other features that allow matching. last of the new variables.

1
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

2 Graphics in the Physical Sciences

Quantile plot Graph that relates the ordered observations many times more images. The human visual system may
in a batch of data to cumulative probabilities. be able to handle on the order of 10 7 bits per second
Quantile–quantile plot Graph for comparing two distri- of visual input, so flashing through the images may not
butions. Good for several tasks like judging relative tail take all that long. However, encoding the information in a
thickness, ratios of standard deviations, and changes in way that helps scientists is a real challenge. Norretranders
mean. estimates that consciousness is limited to about 16 bits a
Scatterplot matrix Graph displaying scatterplots of all second. If care is not taken none of what is seen will get
pairs of variables, arranged in square array with indi- into those precious 16 bits or whatever gets in might be
vidual variables corresponding to rows and columns. distorted.
Especially useful in the early stages of studying joint The brief monochrome article cannot do justice to meth-
relationships across several variables. ods available for quantitative visualization in the phys-
Three-dimensional scatterplot Graph with points plot- ical sciences. One book of over 400 pages catalogues
ted against three orthogonal axes. Perceived via a stereo monochrome chart types. The possibilities grow combina-
pair view or via motion parallax. Especially useful in torially from this when considering the addition of color,
seeing three-dimensional structure. Useful as a frame- texture, interactivity, and further composition of forms.
work for showing additional coordinates using glyphs. This article presents some of the basic methods asso-
ciated with the field of statistical graphics, along with a
discussion of related cognitive and perceptual principles
THE PURPOSE of graphics in the physical sciences is that help to provide guidance when moving beyond the
to help scientific and public understanding of the phenom- basic forms. The principles are, of course, not correct in
ena being studied. The graphics and design issues for the every detail, but provide a reasonable basis for selecting
respective audiences differ. This article emphasizes graph- among the options. Before proceeding, a few comments
ics for scientists who are often interested in the discovery on data and estimates are appropriate.
and analysis of data. The varied sources of physical science data include lab-
Scientists involved in discovery can face many chal- oratory studies, field samples, and electronically collected
lenges. These may include data such as satellite imagery and computer simulation.
Most data need to be transformed into estimates that are
r The complexity of the phenomena being studied, suitable for scientific analysis. Each type of data and col-
r The difficulty of parsimoniously conceptualizing this lection circumstances poses it own issues in addressing the
complexity, processes of producing estimates that are worthy of eval-
r The logistic and political impediments to collecting uation. Recurrent types of issues include calibration of
adequate, representative data, instruments, scaling of variables, adequacy of variables as
r The limits of computational resources, surrogates for the desired variables of interest, adjustments
r The limits of human perception and cognition for for covariates, representative coverage of the population
understanding multivariate summaries. or phenomena of interest, and validation of simulated or
indirect estimates.
Graphics can help in conceptualizing and characteriz- Graphics can be no better than the estimates presented
ing the phenomena being student. The Crick and Wat- (unless they are being used to show how bad the estimates
son discovery of the DNA’s double-helix structure was are). Readers should look for indications of estimate qual-
a major step forward. Now the exponentially expand- ity either in the accompanying text or in the graphics.
ing studies of genes can boggle the mind. Graphical The display of confidence bounds for estimates provides
bookkeeping (for example, see biochemical pathways at a good indication. The lack of confidence bounds is some-
http://www.expasy.ch/cgi-bin/show thumbnails.pl.) helps times a warning that estimates have not been assessed with
scientists cope with the ever-increasing details. The pos- respect to accuracy (bias) and precision (variability).
sibilities of combinatorial chemistry and genetics are hard The heart of graphics is visual comparison. One com-
to imagine. The physical scientists monitoring the earth mon task is to assess estimated density patterns. This
also face huge datasets. Consider one multispectral dataset involves comparing local densities from different parts
with 30-m resolution over the 8 million km2 of the con- of a plot. Another task compares data to a conjectured
tinental United States. A quick calculation indicates that functional relationship. Other common tasks include com-
it takes over 7000 screen images to examine this data on paring estimates against reference values and compar-
a 1024 × 1280 monitor. That encodes all of the multi- ing two or more sets of estimates against each other.
spectral information for a 30-m region into the color of a Much of graphics design concerns facilitating accurate
pixel. To appreciate the multispectral resolution requires comparisons.
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

Graphics in the Physical Sciences 3

I. UNIVARIATE GUIDELINES

Assuming accurate comparison is important, it is advanta-

geous to encode estimates so that humans can decode them Various
accurately. Cleveland and McGill (1984) discuss percep- Positions
tual accuracy of extraction and indicate preferred methods
for univariate encoding. In this research subjects judged
relative magnitudes of graphically encoded variables. As
presented here the results put the graphical encoding meth-
ods into three classes, labeled best, good, and poor. 0.0 0.4 0.8 1.2
The two best encoding methods represent variables us- Standard
ing position along a common scale as shown in Fig. 1 Position
and position along identical nonaligned scales. Humans
do well in judging the position of a point relative to scale.
Locating the position of objects is a fundamental visual
task. Mapmakers have long used position along a scale as 0.0 0.4 0.8 1.2
the fundamental encoding for spatial coordinates. Length, FIGURE 2 Transforming lengths. This changes the decoding task
angle, and orientation are good encodings. Figure 2 shows from taking differences to judging position along a scale.
that transforming line segments into a standard position
converts the task of judging length into a task of judging [.6, .9] and volume [.5, .8]. With an exponent of about 1,
the position of one endpoint against a scale. While this people’s perception of length tends to be direction pro-
is not necessarily what people do, the example suggests portional to object length. However, we tend judge area
that judging line length is more complicated than judging and volume nonlinearly. Consider comparing areas, one
position. of 4 square units and the other of 1 square unit. With an
Figure 3 shows angle encoding. Rotation of the angles exponent of .75, the ratio of perceived magnitudes is 2.8
puts them in position for comparison against equivalent to 1, rather than convert 4 to 1. We underjudge the large
angular scales shown in gray. The transformation suggests areas relative to small areas. If everyone had the same
that while angle comparisons work pretty well, they are exponent, graphical encoding could adjust for systematic
more complicated than direct comparison against angle human bias. However, the range of values for the exponent
scales. Area, volume, point density, and color saturation “b” vary substantially from person to person. Providing a
are poor encodings. Those familiar with experimental re-
sults involving Steven’s Law will not be surprised about
poor results for the area and volume encodings. Steven’s Various Standard
Law states that the perceived magnitude of a stimulus fol- Orientations Orientation
lows a power law.
p(x) = a x b
• •
where x is the magnitude of the true stimulus (for ex-
ample, length, area, or volume), and where the constants
a and b depend on the type of stimulus. The range of • •
characteristic exponents b for length is [.9 , 1.1], area s

• •

•• • • • • •

0.0 0.2 0.4 0.6 0.8 1.0

• •

FIGURE 1 Position along a scale. This is the preferred encoding FIGURE 3 Transforming angles. This reduces the decoding task
for a continuous variable. from taking differences to judging position along an angular scale.
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

4 Graphics in the Physical Sciences

set of reference symbols in a legend helps people calibrate

to the intended interpretation, but the preferred strategy is
to use better encodings whenever possible.
Weber’s law, a fundamental law in human perception, Various
also has significant ramifications in terms of accurate hu- Positions
man decoding. A simple example gives the basic notion of
the law. The probability of detecting that a 1.01-in. line is
longer than a 1-in. line is nearly the same as the probability
of detecting that a 1.01-ft line is longer than a 1-ft line. In
absolute terms .01 in. is much smaller than .01 ft. Using a 0.0 0.4 0.8 1.2
finer resolution scale allows more accurate judgments on Standard
an absolute scale. For example the tic marks on a ruler help Position
us make more accurate assessments. The graphical equiv-
alent uses grid lines to provide a finer resolution scale for
more precise comparisons. In interactive graphics, zoom-
ing in effectively provides a finer scale. Computer human 0.0 0.4 0.8 1.2
interface implementations often provide sliders that allow FIGURE 4 Density estimate. This shows construction of a kernel
the reader to change the reference scale and make accurate density estimate from five points by averaging bumps.
judgments.
We render most graphics on a plane. We could show val- likelihoods (or kernels) as bell-shaped curves, one in each
ues of a continuous variable as points on a line. However, of the top panels. For each location where we want to es-
we do not usually do so. We may see gap between points timate the data density (the x locations of white lines in
quite well, but we do not judge point density very well. Fig. 4) we simply average values from the five constructed
Consequently the standard approach for this simple case likelihoods. The white lines indicate the locations of the
is to estimate the density and show it using a bivariate plot. density estimates. In the bottom panel each white line is
the average height of all the white lines directly above it.
(When panels have no white lines directly above, they con-
II. BIVARIATE GUIDELINES tribute zero to the average.) The construction is straight-
forward.
Edward Tufte notes that it took over 5000 years to mankind Scott (1992) provides the theory behind density esti-
to generalize from early clay tablets maps to representing mation. For a valid density estimate, the kernel needs to
general variables using a scatterplot. A scatterplot is an ex- integrate to 1. The hard part is in deciding how wide to
cellent representation since the two orthogonal axes allow make the kernel. Scott describes methods for making this
two coordinates to be independently encoded as position decision.
along a common scale. The popular press in the United
States still considers the scatterplot too complicated for
B. Quantile Plots
the general public, but in the sciences the scatterplot is the
standard for representing continuous bivariate data. Com- The integral of the estimated density up to a value ap-
mon bivariate activities include assessing univariate distri- proximates that the cumulative probability of observations
butions, comparing univariate distributions, and looking begin less than or equal to that value. A cumulative dis-
for functional relationships. tribution plot plots the sorted values on the x-axis and
cumulative probabilities on the y-axis. A quantile plot
is essentially the same plot transposed. For quantile plots
A. Univariate Density Estimates
the x-axis shows cumulative probabilities and the y-axis
When we are interested in data density it is advantageous shows the sorted data values. Figure 5, a quantile plot,
to compute and show the density as accurately as possible. shows the pairs ( pi , qi ) where pi is the cumulative
Figure 4 illustrates the construction of a kernel density es- probability and qi represents the ith largest observation.
timate based on a sample of five univariate values. The Cleveland’s convention connects the point pairs as shown.
locations of the white triangles relative to the x-axis in- There are different methods for approximating cumu-
dicate the magnitudes of observed values. The basic idea lative probabilities. The order statistics approach here fol-
is that each observed value is a surrogate for values in a lows Cleveland (1993) and uses the expression (i − .5)/n
neighborhood. We then construct a relative likelihood for for i = 1, . . . , n to calculate the cumulative probabilities,
a neighborhood about the value. Figure 4 shows the five where n is the sample size. The probability corresponding
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

Graphics in the Physical Sciences 5

2 •••••
••
••
••• Sept. 1992
••
••
••
••
•••••••••••
•
0 q=-.31
••
•••
••
••
••
•••••••••••••••
•
Quantiles

•
•••••
• Sept. 1982
••
••
••
••
•••••••••••••
•••••
-2 ••••
••• p=0.5
-.1 .2 .5 .8
NDVI from Africa
-4 FIGURE 6 Box plots. This compare two distributions of values
taken from Fig. 8. Vertical lines show the medians. First and third
quartiles bind the thicker gray rectangles. Adjacent values bind
• the thinner gray rectangles. No outlines appear in the example.
0.1 0.3 0.5 0.7 0.9 When interior white rectangles do not overlap, medians are not
significantly different.
Cumulative Probabilities
FIGURE 5
changed. For maps it is possible to save space by repre-
to the smallest observation in a sample of size 10 would senting selected cumulative probability and quantile pairs
be (1 − .5)/10 = .05. An interpretation is that if the ob- using a parallel coordinate approach.
servations were from a random sample of size 10, the
probability of a future observation being smaller than the
C. Box Plots
currently observed smallest observation is .05.
Another common textbook approach for calculating cu- The box plot is a distribution caricature that has achieved
mulative probabilities is the empirical method. For this wide acceptance. While used to represent individual
method the cumulative probability for a value is the frac- distributions, the common use is to compare distributions.
tion of observations less than or equal to the value. This Figure 6 shows two boxplots. The features shown include
approach treats the curve as a step function jumping at the median, quartiles, and adjacent values. The notion
each observation. The step function jumps from 0 to 1/10 of adjacent values sets a bound on what will be called
at the smallest observation in a sample of size 10. This outliers and warrant a note on how it is calculated. The
suggests that the probability of a future observation being upper adjacent value is the largest observation less than
smaller than the smallest one already observed is 0. The the 3rd quartile +1.5 times the interquartile range (the
sample does not allow determination of the true probabil- 3rd quartile −1st quartile). The lower adjacent value is
ity. The probability .05 seems reasonable. As the sample determined correspondingly. Outliers, if any, are more
gets larger the discrepancy between the two calculation extreme than the adjacent values. Not all box plot vari-
approaches get smaller. ations are the same. Some just show the maximum and
Interpretation is straightforward. For any probability minimum values rather than adjacent values and outliers.
covered on the x-axis it is possible to determine a quan- The design variation in Fig. 6 has an additional feature. It
tile. To obtain the .5 quantile (or estimated median) go uses a white line to provide comparison intervals for the
straight up from .5 on the x-axis to the curve and then medians. If two comparison intervals do not overlap the
straight across to the y-axis to obtain x units. Starting medians are significantly different.
with .25 and .75 yields corresponding quantiles are also
known as the 1st and 3rd quartiles, or 25th and 75th per-
D. Quantile–Quantile Plots
centiles, respectively. Similarly one can go from quantiles
to cumulative probabilities. Since scientists use such plots Some authors consider the quantile–quantile (QQ) plots
to describe convenience collections from a population as as the preferred graphic to make detailed continuous dis-
well as random samples from a population, the trickiest tribution comparisons. For theoretical distributions, the
interpretation task is often to decide if the probabilities cumulative distribution function, F(), provides the corre-
really extend to a larger population. spondence between the probability and quantile pairs via
Quantile plots are helpful on maps and provide a frame p = F(q). When F() is a strictly increasing function the
of reference for observing change over time. For exam- quantile function, Q(), is the inverse of F() and Q( p) = q.
ple, one can tell if the 50th percentile (the median) has Familiar pq pairs from the standard normal distribution
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

6 Graphics in the Physical Sciences

are (.5, 0) and (.025, −1.96). Comparison of two distri- Two Curves
butions, denoted 1 and 2, proceeds by plotting quantile And Their Difference
pairs (Q1 (p), Q2 (p)) over a range of probabilities, such
as from .01 to .99 in steps of .01. For two distributions 1.0
of observed data, the calculations described for the quan-
tile plots (previously described) are appropriate. Figure 6
shows a QQplot for two sets of data. The x-axis shows
quantiles from Set 1 and the y-axis shows quantiles from 0.5
Set 2. Sometimes statistician chose Q1 ( p) to be from a the-
oretical family of distributions, such as the normal family,
to see if parametric modeling is reasonable using the nor-
mal (Gaussian) family of distributions.
A merit of QQplots is that in simple cases they have a 0.0
nice interpretation. If points fall on a straight line then the
distributions have the similar shape differing only in the
first 2 moments. This is the case in Fig. 7 since the robust 0.11
fit (thin) line matches the quantiles quite well.
The slope and intercept of the approximating straight 0.08
line tell about the discrepancies in the second moment
(standard deviation) and first moment (mean). The slope
of the thin line tells about the ratio of standard devia- 0.05
tions. The thick line in the figure is the reference line for
identical distributions. The lines are not quite parallel in
0.02
Fig. 7 so the standard deviations are not quite the same.
Multiplying Set 2 quantiles by a factor and then adding 0.0 0.5 1.0
a constant changes the slope and intercept of the robust
FIGURE 8 Explicit difference of two curves. Humans tend to see
fit line. Graphical input and visual assessment provides closest differences between curves, not differences parallel to the
a quick way to find the two numbers that make the lines y-axis.

match. These two numbers then indicate the ratio of the

Two-Sample QQPlot standard deviations and the defference in the means after
Thin Line = Robust Fit the Set 2 scale adjustment. In Fig. 7 the lines are nearly
parallel so a reasonable guess is that the distribution differ
Thick Line = Same Distribution Line
in mean by about .5.
• QQplots avoid the visually deceptive procedure of su-
• perimposing two cumulative distribution functions or two
•
2

••• • survival curves. As Fig. 8 suggests, we are really poor at

• judging the distance between curves. Our visual process-
• • ••
• ing assesses the closest differences between curves rather
•••
1

•• than the correct vertical distances. This deficiency applies

Set 2

••
••
•••
••••• to comparing spectra and time series as well. Adding grid
•
•••• lines can help, but it is often better to plot the difference ex-
0

••
•• plicitly or make distribution comparisons using QQplots.
••
•
•
-1

••• E. Direct Comparison of Two Distributions

• and the Mean Difference Plot
• • •
Before and after comparisons are common in science.
-3 -2 -1 0 1 2
The basic idea is to control for variation in experimental
Set 1 units by studying the change of experimental unit values.
FIGURE 7 Quantile–Quantile (QQ) Plot. This compares distribu- This differs from QQ plots in that the study unit is the
tions. When the points fall near a straight line, they have the same basis for pairing values rather than cumulative percent-
shape and differ only in the standard deviation and mean. ages. Figure 9 shows a paired comparison plot for two
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

Graphics in the Physical Sciences 7

FIGURE 9 Pair Difference Plot (bottom panel). Top panels represent the Device Vegetation Index (NDVI) values on
a 1◦ grid using gray levels. The images were taken about a decade apart. The bottom panel shows the difference of
location paired values. The mean and difference plots show the mean on the x-axis and the difference on the y-axis.
The reference line is x = 0. This plot handles many thousands of points by hexagon binning. Large hexagon symbols
show more counts. While there are many exceptions, the smooth line indicates a tendency for 1992 values to be
larger.
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

8 Graphics in the Physical Sciences

low-resolution satellite images of the same region. The 1950-1959 State Mortality Rates
top two images show NDVI values on a 1◦ grid. NDVI

Melanoma Mortality Per Million

+ + = Ocean State
stands for normalized device vegetation index and pro-

22
+
+
vides a measure of greenness. The traditional reference + o = Land State
+ +
line for equality is a 45◦ line through origin. John Tukey +
+ o + o

18
suggested making the reference line horizontal by plot- + o o
o + o+
ting differences on the y-axis and mean on the x-axis. The o + +
bottom of Figure 9 shows a mean difference plot. Making o oo + +
+

14
o +
transformations to simplify the visual reference and re- o
o o o o +
oo o o +
duce the visual calculation burden is an important graphic o o+ o +
o
o o

10
design principle.
The data for the panels in Fig. 9 consist of over 27,000 o
point pairs. The overplotting in the bottom panel would
30 35 40 45
cause a problem. Since we do not judge density well the
bottom panel bins the data into hexagon bins and rep- Latitude in Degrees
resents the count with the size of the hexagon symbol FIGURE 10 A smooth. The smooth curve suggests a functional
plotted. Such binning is a fast but rudimentary way to ob- relationship. The two types of points suggest there are two differ-
ent functional relationships involved.
tain a bivariate density estimate. Carr lists the merits of
hexagon binning over square binning. An alternative view
estimates the density surface on a higher-resolution bivari- Hastie and Tibshirani provide a good introduction to a
ate grid and displays the surface with perspective views variety of smoothing methods.
or contours. Numerous smoothers are available. Historically, many
The line in the bottom panel of Fig. 9 shows a smooth of researchers used cubic spline as smoothers. Cubic splines
the data. The upward shift indicates increased greenness in have a continuous second derivative and that is sufficient to
1992. Symbols below the zero reference line for no change make curves appear smooth to humans. The elegant math-
indicate that the change is a decrease in some locations. ematical formulation behind splines increased their pop-
The smooth indicates that the increase tends to be highest ularity within the statistical community. However, there
for intermediate values of NDVI. is no a priori best smoother. New methods, such as the
wavelet smoothing, keep appearing in statistical software.
Recently developed wavelet smoothers are supposed to
F. Functional Relationships and Smoothing be better than many smoothers at tracking discontinuities
When y is considered a function of x, common practice is in the functional form. However, the old local median
to enhance scatterplots of (x, y) pairs by adding a smooth smoothers still do well at handling discontinuities.
curve. This is done in Fig. 9 to see if the difference is Smoothers typically have a smoothing parameter that
related to the average of the two values. To avoid the con- needs to be estimated or specified by the user. With com-
siderable human variability in sketching a fit, the standard putational power at hand, cross-validation methods have
procedure is to model the data using a procedure that oth- become increasingly popular as a community standard.
ers can replicate. Figure 10 shows a scatterplot with the This reduces the judgment burdens on the analyst. How-
smooth line generated using loess (see Cleveland et al. ever, this does not guarantee a match between an empiri-
for more details). Loess fits the data using weighted local cal curve and a hypothesized true but unknown underlying
regression. That is, the regression uses data local to x0 curve.
to predict a value at x0. Points closest to x0 receive the
greatest weight. The use of many local regressions pro-
duces a set of pairs (x, y) that are connected to produce III. MULTIVARIATE VISUALIZATION
the smooth. Each local regression in the smooth shown in
Fig. 10 used a linear model in x and included the closest Visualization in the physical sciences is inherently multi-
60% of the observations to the prediction point x0. Such variate. Scientists are interested in the relationships among
a smooth is reproduceable. The smooth in Fig. 10 draws many attributes and the attributes have space–time coor-
further attention to the distinction between ocean and land dinates. The purpose of multivariate graphics is to show
states and additional modeling is appropriate. multivariate patterns and to facilitate comparisons. As in
Smoothing is an extremely important visual enhance- low dimensions, the patterns often concern data distribu-
ment technique. It helps us see the structure through the tions or models with at least one dependent variable.
noise. The decomposition of data into smooth and resid- After converting attributes and their space–time coordi-
ual parts is fundamental technique in statistical modeling. nates to images for evaluation, human visual comparisons
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

Graphics in the Physical Sciences 9

typically fall into three categories, comparison of external The entry only briefly describes color since it is not an
images to each other, comparisons of external images to option here. However, color provides an important tool.
external visual references, and comparison of external Different hues provide a good way to distinguish cate-
images to the analyst’s internal references. Internal ref- gories of a categorical variable. The cartographic guidance
erences include scientific knowledge, statistical expecta- limits categories or hues to six or fewer. Humans are very
tions, and process models. The visual investigation process sensitive to a second dimension of color, a dark-to-light
often involves converting internal references into external scale that is referred to in literature by terms such as value,
visual references subject to further manipulation. With lightness, or brightness. Gray levels provide ordered scale
external images and references available, the next step so it is thinkable to represent a continuous variable using
often involves graphical transformation to simpler forms value or gray level. Humans are less sensitive to the dimen-
in terms of our perceptual–cognitive processing abilities. sion of color called saturation. A saturation scale going
Multivariate graphics must deal with the problem of from an achromatic color such as medium-gray to a vivid
showing higher-dimensional relationships and, as in lower red is also an ordered scale. However, humans can make
dimensions, often have to deal with the noise that obscures fewer saturation distinctions than lightness distinctions.
patterns and the overplotting due to many points. Generally color is a poor choice for representing a continu-
ous variable. Humans perceive that hue and brightness are
integral dimensions so we should not use them to encode
A. Graphical Design Principles and Resources two or three variables. Brewer et al. discuss additional
The description indicates the increased complexity as we considerations that apply to the color-vision impaired.
move to higher dimensions. At the same time, Kosslyn Since most people can work with four chunks of
warns that “The spirit is willing but the mind is weak.” We information, this article suggests attempting to fit the
should approach the multivariate challenge prepared to do guidelines into four broad categories of quantitative design
battle. Our arsenal of tools includes design principles that principles:
help us in uncharted territory. Some of our tools include: r Use encodings that have high perceptual accuracy of
extraction
r Distributional caricatures such as box plots to help us r Provide context for appropriate interpretation
deal with large data sets r Strive for simple appearance
r Map caricatures that let us show small multiples r Involve the reader
r Modeling to reduce noise and complexity
r Layering and separating to constrain and guide the
These organizing categories contain some conflicting
information flow guidelines. For example, a long list of caveats may pro-
r Partitioning and sorting to simplify appearance and
vide the context for appropriate interpretation but conflict
facilitate comparisons with simple appearance and reader involvement. Compos-
r Linking low-dimensional views to get insight into
ing graphics that balance among the guidelines remains
higher-dimensional relationships. something of an art form. The communication objectives
influence the balance.
The basic formats for comparison graphics include jux-
taposition, superposition, or the direct display differences.
B. Communication Objectives
The art of multivariate graphics is to select the methods
and enhancements that work best in view of the phenom- Multivariate graphics can have many different communi-
ena’s complexity view and in view of human perceptual cation objectives. Four common objectives include pro-
and cognitive limitations. viding an overview, telling a story, suggesting hypothe-
This entry suggests just a few representation tools and ses, and criticizing a model. In providing an overview,
principles. Good reference include books by Cleveland, broad coverage is important. Achieving clarity often in-
Kosslyn, Wilkinson, and Ware. The books by Tufte pro- volves hiding details and working with caricatures. Sim-
vide wonderful examples of putting design principles into ilarly, in telling a story the predetermined message must
practice. MacEachren provides a good introduction on shine through. Scientists often fail to tell simple stories
mapping and Cleveland and McGill provide an early sur- because they are eager to list caveats and a host of details
vey on dynamic multivariate graphics that go beyond the that qualify the basic results. Interactive web graphics can
scope of the entry. As more work is done in computing alleviate the archival side of this problem providing ready
environments, issues of the computer human interface be- access to metadata, supplemental documents, and giga-
come increasingly important. Card et al. edited a book of byte databases. However, it still takes careful design to
readings that gathers many important concepts. draw readers to the details.
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

10 Graphics in the Physical Sciences

This entry emphasizes graphics of discovery. Discovery

objectives include suggesting hypotheses and criticizing
models. The graphics in this article can often be used to
display residuals from models and to review patterns in the
residuals. For discovery, it is often crucial to see through
the known and miscellaneous sources of variation. In the
context of mortality mapping, John Tukey said, “the un-
adjusted plot should not be made.” Today mortality maps z
control for known variation by being sex and race specific.
The maps control for age by limiting the age range or by
showing adjusted statistical rates. In the physical sciences
there are often known factors that contribute to variation
that warrant making adjustments. Thus the graphics of
discovery often show model residuals.

C. Surface Views
In some ways multivariate visualization is similar to lower- y
x
dimension visualization. The general tasks are still to show
densities and functional relationships. The difference is FIGURE 12 A perspective view or wireframe plot. This surface
that more variables including spatial and temporal indices shows the density of bivariate points.
are called into play. The density of bivariate points consti-
tutes a third coordinate. The basic idea in bivariate kernel less variable. In the context of maps some approaches to
density estimation is similar to that for univariate esti- modeling address the issue of spatial correlation. Common
mation: average local likelihoods. The key difference is monochrome views again include contour and wireframe
that local neighborhood is bivariate. The result is surface plots. Fully rendered color surfaces with highlights pro-
z = f (x , y). Figure 11 shows the contours of a bivariate vide better a view of the wireframes. Pairing of contour
density surface. Figure 12 shows a wireframe perspec- and surface plots can aid understanding. The surface tends
tive view. to provide a good overall impression (except for what is
Estimating functional relationships of the form hidden) while the contours help to locate features, such as
z = f (x, y) also follows the pattern established with one local extrema, on the plane. Wireframes and translucent
surfaces can be superimposed on maps or contour plots.
Perspective views appeal to many people and have also
been used to show local values as the animated flyovers of
three-dimensional bar charts as well as surfaces. How per-
spective foreshortening complicates accurate decoding of
0.005 values and comparisons. For a detailed study of surfaces,
5 0.015 Cleveland recommends dropping back a dimension. By
0.025 conditioning on intervals of x or y it is possible to return
0.035 study strips of the surface using two-dimensional plots.
0.040
For comparing surfaces there are three standard ap-
y

0.045 proaches. One is to superimpose surfaces, for example,

0
rendering them in translucent color. A second is to juxta-
pose displays of two surfaces. The third option of calcu-
lating and showing the difference of two surfaces is often
best.
-5
D. Distance Judgments and 3-D Scatterplots

-2 0 2 4 In the multivariate context, accurate interpoint distance

judgments are crucial to geometrically based visual in-
x terpretation. For two points in a scatterplot the cognitive
FIGURE 11 A contour plot. This plot represents the computed task becomes one of judging the length of an implicit line
density of bivariate points. between the two points. As indicated previously, humans
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

Graphics in the Physical Sciences 11

perceive length in a plane with good perceptual accuracy

of extraction. For higher-dimensional representations, as-
sessing interpoint distances becomes increasingly diffi-
cult. The position here is that our ability to judge distance
in three-, four-, and five-dimensional representations pro-
vides a strong basis for ranking point encoding methods.
At first thought, a stereo three-dimensional scatter-
plot might seem ideal for representing continuous three-
dimensional data. Judging distance between two points
then equates to judging the length of a line segment. On
closer inspection, there is substantial change from two-
dimensions to three-dimensions. Depth perception for the
third coordinate derives from horizontal binocular dis-
crepancies (parallax). The horizontal discrepancies in-
volve only a small fraction of our full horizontal field of
view. Horizontal visual acuity within this small interval
determines how many distinct depth planes we can re-
solve. This cannot match the horizontal resolution across
the full field of view. Stereo three-dimensional plots are
less than ideal since humans do not judge depth as ac-
FIGURE 13 A scatterplot matrix. The four-dimensional points fall
curately as they judge vertical and horizontal position. on a hypersurface defined by w = f (x, y, z). The point cloud edges
While stereo three-dimensional distance judgments are in last row and column suggest the date are a lower-dimensional
not as accurate as two-dimensional distance judgments, object embedded in four dimensions. The cloud interiors do not
we live in a three-dimensional world and have strong in- show much.
tuitions about three-dimensional relationships. For those
with good depth perception the testable conjecture here dimensions. The three views considered here are scatter-
is that three-dimensional scatterplots allow the most ac- plot matrices, parallel coordinates, and stereo ray glyphs.
curate assessment of interpoint distances of the available Figure 13 shows a scatterplot matrix. The panels in
encoding for three continuous variables. the matrix show scatterplots for all pairs of variables.
Historically most analysts created depth views via ro- Some variations of the scatterplot matrix show only the
tation (motion parallax). This approach is very powerful. upper or lower triangle of panels since otherwise the same
The main drawbacks are that moving points are harder to point pairs appear (transposed) in a second panel. If fact,
study than stationary points and that interacting with mov- thoughtful construction allows two data sets to be juxta-
ing data is awkward. Using both stereo and rotation main- posed using upper and lower triangles. Returning to the
tains the depth when rotation stops. Overplotting obscures task at hand, the edges of the cloud in the panels of the last
points and some authors suggest we can only see two and row and column of the scatterplot matrix suggest there is
one-half dimensions. Mostly one sees the edges of the data some kind of relationship w = f (x , y , z).
cloud. Different viewing angles bring out different edges. Figure 14 shows a parallel coordinate plot. This has four
Slicing or sectioning helps to reveal the inside of the cloud. parallel axes. The plot scales each variable to the interval
In the virtual reality (VR) settings one can fly through data [0 1] and equates this to the length of the correspond-
clouds and touch points (or density features) to gain ad- ing axis. Each component of a four-dimensional point is
ditional detail. Since we do not judge density accurately plotted on the respective axis and the components are con-
there is good reason to estimate trivariate densities and nected. The clue about the relationships is limited to the
study the hypersurfaces w = f (x, y, z). Scott provides in- line patterns in the last pair of axes.
structive graphics that simultaneously show three contours Figure 15 shows a stereo ray glyph plot in a side-by-side
of hypersurfaces. stereo pair view. The three coordinates are represented by
the point’s three-dimensional location. The angle of the
ray represents the fourth coordinate. The small size of a
E. Scatterplot Matrices, Parallel Coordinates,
side by side stereo plot complicates views, but the smooth
and Stereo Ray Glyphs
variation of the ray angle locally in three-dimensional
If we have points on a surface w = f (x, y, z) there are space is evident. The locally smooth variation of the ray
many different ways to view the points. It is instructive to angle indicates there is a relationship.
consider how the different views reveal that the data are As a demonstration of the inferior nature of nonposi-
from a surface and not points distributed throughout four tional glyphs in low dimensions, one can generate, say
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

12 Graphics in the Physical Sciences

X ellipsoids to evaluate problems in improving an electro-

static potential model. Their article includes color side-
by-side stereo figures. In a long sequence of efforts they
failed to obtain insight using a wide variety of encodings.
Y Their eventual success suggests that using position along
a scale to represent three variables has merit.
As we move beyond four dimensions, our ability to
judge interpoint distances deteriorates quickly. We are
Z forced to use clustering algorithms and other computa-
tional methods to bring out patterns. Simple graphical
methods do not work well except when there is a very
simple geometric structure embedded in high-dimensional
W space. For complex multivariate structure, Furnas and
FIGURE 14 A parallel coordinate plot. The same four- Buja indicate how we can lower the viewing dimensional-
dimensional points as in Fig. 13. The weak hints of a dimension ity by slicing or sectioning (adding linear constraints). We
reducing constraint appear in only in lines between the the z and can also condition on categorical variables. This divide-
w axes. The plot is useful for graphical input and for viewing many
coordiates.
and-conquer approach can reveal a plethora of patterns,
but it is only the rare individual that can integrate a cat-
alog of low-dimensional views into a coherent higher-
a thousand points of three-dimensional data embedded dimensional framework. Few people claim to fully un-
in four dimensions. That is, select the triples (u, v, w) derstand the richness of even four-dimensional structures.
randomly, say from a normal distribution, and then
let x1 = f 1(u, v, w), x2 = f 2(u, v, x), x3 = f 3(u, v, w),
F. Glyphs and Layouts
and x4 = f 4(u, v, w) where the four functions are simple
distinct polynomials. The structure in the stereo ray glyph Glyphs provide one way to represent observations with
plot will immediately suggest the existence of a constraint several variables as points. The climate description maps
and the fact the data are not four-dimensional. The ability of the U.S. Environmental Data Service have some inter-
to observe smooth variation can easily be masked by noise esting glyphs. These include a wind rose glyph showing
in real data. Still it should be clear that the scatterplot ma- 16 variables representing the percentage of time the wind
trix and the parallel coordinate plot do not help us to see in is blowing in the major compass directions, and bar chart
three dimensions as well as three-dimensional scatterplots glyphs show 12 variables indexed by month of the year.
do. Our brains are not wired for the task of construct- In both these cases the glyph controlling variables are sta-
ing higher-dimensional views from projections in lower tistical summaries. It is not a requirement that variables
dimensions. controlling glyph characteristics be direct measures.
Few researchers have seriously tackled the visualization Chernoff faces seem to be the glyphs that draw the most
of six-dimensional data. Bayly and co-authors provide a attention. Different variables control different facets of
notable exception. They successfully used colored stereo caricature faces. Examples appear in Fig. 16. Since a sig-
nificant portion of our brain is devoted to face recognition,
the encoding idea intrigues many people. While humans
get a gestalt impression that they can often relate to an
emotion, it is difficult to respond to the individual com-
ponents of the face. Few glyphs see much use. The star
glyph, a variant of a parallel coordinate plot for a single
case, seems to appear with some frequency.
A simple example plotting a few hundred faces or stars
representing points on a hypersurface is instructive. The
hypersurface relationship remains obscure. That is dis-
couraging. In contrast to the map glyphs introduced pre-
viously, glyphs are often laid out arbitrarily on a regular
FIGURE 15 A stereo ray glyph plot. The ray pointing down to
grid. This wastes using the favored representation, posi-
show small values and up to show large values. Some people can
fuse the stereo pair images without a viewer. The smooth local tion along a scale, for two coordinates. Position using pos-
variation of the ray angle strongly suggests that lower dimensional sibilities that can lead to overplotting include using the
structure is embedded in four dimensions. first two principal components of multivariate data, and
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

Graphics in the Physical Sciences 13

dimensional wireframe plots or other higher-dimensional

plots. Many people readily understand one- and two-way
conditioning and the corresponding layout of panels. For
example, defining each of two conditioning variables as
low, medium, and high leads to a 3 × 3 layout of panel
showing other variables of interest. Thus conditioned plots
are one of the more effective ways to study relationships
involving three to five dimensions.

IV. CLOSING REMARKS

FIGURE 16 Three types of glyphs. Here a face, profile, and star
glyph represent nine coordinates. In this case the coordinates This entry provides a description of several templates for
represents a gene’s mRNA production levels at nine animal ages. low-dimensional graphics. When datasets have higher di-
mensions there are a variety methods that will reduce the
dimensions of many datasets to something that we can
multidimensional scaling provides another option. There view using the templates. The methods include princi-
are additional options that avoid overplotting. pal components, projection, and sectioning. The methods
have some limitations including the sizes of data sets to
G. Linked Plots which they apply.
Increasingly, the size, detail, and complexity of data sets
Linking points across plots provides a way to connect overwhelm our direct visualization capacity. Our dimen-
cases whose variables are represented in different plots. sion reduction and case reduction methods are stretched
Linking provides a weaker binding of the multivariate ob- thin and overwhelmed at times. The number of plausibly
servations than glyphs. Linking methods include linking important views of data exceeds the time we have to look
by lines, colors, names, pointers, and spatial linking by and study. In thinking about the future, John Tukey coined
juxtaposition. The following discussion emphasizes line the term “cognostics” (diagnostics interpreted by a com-
linking and color linking. puter rather than by a human). The idea was to compute
The parallel coordinate plot (see Fig. 14) provides an features of merit and have computers rank plots by their by
example of linking with lines. The lines connect compo- potential interest to humans. Researchers have developed
nents of the same case across the axes showing the dif- projection pursuit methods that help us to find interesting
ferent variables. In many applications of line linking the views in lower dimensions. There is much more to be done
lines overplot and this complicates decoding the plot. along this line. The work can be frightening because our
Painting points on a screen provides a common way algorithms can miss details like the hole in the ozone layer.
of linking points using color. A common application in- On the other hand we miss a lot because our viewing is
volves scatterplot matrices. Painting some points in one not optimized. We also miss a lot because we do not re-
panel selects the corresponding cases. The color then ally put computers to work to find the representations that
highlights views of the selected cases in the other panels. are optimal for our understanding! While our visualiza-
This notion extends to different types of panels. When tion methods fall further behind the increasing amounts
there are spatial coordinates one of the panels is often of data, it remains the case that we can do much more than
a map. we are doing now.

H. Conditioned Plots
SEE ALSO THE FOLLOWING ARTICLES
Conditioned plots partition the estimates (or data) into
sets based on classes defined for conditioning variables. COLOR SCIENCE • FLOW VISUALIZATION • GRAPH
The different plots for the sets then appear as juxtaposed THEORY • IMAGE PROCESSING
panels. The visualization task is then to study how the
distributions or function relationships shown in the pan-
els vary across the conditioned panels. Tukey and Tukey BIBLIOGRAPHY
provide an early example called a casement display. Con-
ditioned plots appearing as a one-way or two-way layout Bayly, C. I., Cieplak, P., Cornell, W. D., and Kollman, P. A. (1993).
are typically two-dimensional plots, but they can be three- “A Well-Behaved Electostatic Potential Based Method Using Charge
P1: GTV Final Pages
Encyclopedia of Physical Science and Technology EN007J-297 July 6, 2001 16:57

14 Graphics in the Physical Sciences

Restraints for Deriving Atomic Charges: The RESP Model,” J. Phys. inference through sections and projections,” J. Comp. Graph. Stat.
Chem. 97, 10269–10280. 3(4), 323–353.
Brewer, C. A., MacEachren, A. M., Pickle, L. W., and Herrmann, D. Hastie, T. J., and Tibshirani, R. J. (1990). “Generalized Additive
(1997). “Mapping Mortality, evaluation color schemes for choropleth Models,” Chapman and Hall, New York.
maps,” Annals of the Association of American Geographers 87(3), Kosslyn, S. M. (1994). “Elements of Graph Design,” W. H. Freeman and
411–438. Company, New York.
Card, K., Mackinlay, J. D., and Schneiderman, B. (1999). “Information MacEachren, A. M. (1994). “Some Truth with Maps: A Primer on
Visualization Using Vision To Think,” Morgan Kaufmann Publishers, Symbolization & Design,” Association of American Cartographers,
Inc., San Francisco, CA. Washington, D.C.
Carr, D. B. (1991). Looking at large data sets using binned data plots. Scott, D. W. (1992). “Multivariate Density Estimation; Theory, Practice
“Computing and Graphics in Statistics” (A. Buja and P. Tukey, eds.), and Visualization,” Wiley, New York.
pp. 7–39, Springer-Verlag, New York. Tufte, E. R. (1983). “The Visual Display of Quantitative Information,”
Cleveland, W. S. (1993). “Visualizing Data,” Hobart Press, Summit, NJ. Graphics Press, Cheshire, CN.
Cleveland, W. S. (1994). “The Elements of Graphing Data,” Hobart Tufte, E. R. (1990). “Envisioning Information,” Graphics Press,
Press, Summit, NJ. Cheshire, CN.
Cleveland, W. S., and McGill, R. (1984). “Graphical perception: The- Tufte, E. R. (1997). “Visual Explanations,” Graphics Press, Cheshire,
ory, experimentation, and application to the development of graphics CN.
methods,” J. Am. Stat. Assoc. 79, 531–554. Wilkinson, L. (1999). “The Grammar of Graphics,” Springer-Verlag,
Cleveland, W. S., and McGill, M. E. (eds.). (1988). “Dynamic Graphics New York.
for Statistics,” Chapman and Hall, New York. Wood, D. W. (1992). “The Power of Maps,” The Guilford Press, New
Furnas, G. W., and Buja, A. (1994). “Prosection views: Dimensional York.
P1: LLL/GKP P2: FYK Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

Group Theory
Ronald Solomon
Ohio State University

I. Fundamental Concepts
II. Important Examples
III. Basic Constructions
IV. Permutation Groups and Geometric Groups
V. Group Representations and Linear Groups
VI. Finite Simple Groups
VII. Other Topics

GLOSSARY Representation of a group Homomorphism of the group

into the full group of all symmetries or automorphisms
Abelian group Group in which the operation is commu- of some fixed mathematical object.
tative. Simple group Group whose only normal subgroups are
Center Subgroup of a given group consisting of all ele- the identity subgroup and the group itself.
ments which commute with every group element. Solvable group Group which has a normal series of sub-
Commutator subgroup Subgroup of a given group gen- groups such that each of the factors (quotient groups)
erated by the subset of all commutators. of the series is abelian.
Cyclic group Group generated by a single element. Subgroup Subset of a group which is a group under the
Homomorphism Function mapping a group into a group same operation.
and preserving the group operation. Sylow p-subgroup Subgroup of a group which is maxi-
Isomorphism One-to-one homomorphism from one mal with respect to the property that each element has
group onto another. order a power of the prime p.
Linear representation of a group Homomorphism of a
group into the group of all invertible linear operators
on a fixed vector space. A GROUP is a nonempty set G of invertible functions
Permutation representation of a group Homomor- mapping a set X to itself such that whenever f, g ∈ G,
phism of a group into the group of all permutations then also f −1 ◦ g ∈ G. (There is also a more abstract def-
of a fixed set. inition which we shall give in the next section.) Thus, in
Quotient group of a group G Group of all cosets of a particular, G is a subset (subgroup) of the group S(X ) of all
fixed normal subgroup of G. invertible functions whose domain and range is the set X .

137
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

138 Group Theory

Any such function may be thought of as a permutation or (3) For any x ∈ G, there is an element x −1 ∈ G such that
symmetry of the set X , and S(X ) is generally referred to x −1 ◦ x = e = x ◦ x −1 .
as the symmetry group of X or as the symmetric group on
the set X . This definition makes the group operation the promi-
Historically, groups emerged first in the work of nent feature and is indifferent to the kind of mathematical
Lagrange, Ruffini, and Cauchy around 1800 as sets of objects which the elements of G might be.
permutations of a finite set X . In this context the defini- The element e is unique and is called the identity ele-
tion may be simplified. A nonempty set G of permuta- ment of G, and the element x −1 is uniquely determined
tions of a finite set X is a group if and only if G is closed by x and is called the inverse of the element x. Henceforth
under composition of permutations. Closure under func- in this article we shall in general write x y for x ◦ y. Also,
tional inversion follows automatically. The condition of we shall typically speak of a group G, with the operation
closure under composition puts severe restrictions on the ◦ being tacitly understood. It is important to note that in
group G, e.g., the remarkable fact discovered by Lagrange general x y = yx. A group in which the commutative law
that if |X | = n [and so |S(X )| = n!] and G is any subgroup
of S(X ), then |G| is a divisor of n! x y = yx for all x, y ∈ G
The subject was greatly enriched by Evariste Galois,
who demonstrated around 1830 how to associate to each holds is called an abelian group in honor of Nils Abel, who
polynomial equation a finite group (called its Galois first recognized that a polynomial equation whose Galois
group) whose structure encodes considerable information group is abelian is solvable by radicals. Other groups are
concerning the algebraic relationships among the roots of called nonabelian groups.
the equation. An analogous theory for differential equa- Sometimes when G is an abelian group, the operation
tions was developed later in the century by Lie, Picard, is denoted by + and the identity element by 0. Many im-
and Vessiot. The relevant groups in this setting were no portant sets of numbers form abelian groups under +, for
longer finite groups but rather continuous groups (more example, (C, +) and such subsets as (Z, +), (Q, +), and
commonly known as Lie groups). (R, +).
Starting around 1850, Cayley championed the idea Groups of numbers like (R, +) seem quite different
of studying groups abstractly rather than via some spe- from groups of functions. However, we may identify (R,
cific representation. This notion simplifies numerous ar- +) with the group (T, ◦) of all translation functions Ta :
guments and constructions, notably the construction of R1 → R1 defined by
quotient groups. The first modern definition of a group Ta (x) = a + x for all x ∈ R1 .
was given by Dyck in 1882, who also pioneered the study
of infinite groups. More generally, as observed by Cayley and Jordan, if (G, ·)
is any abstract group, then G may be regarded as a group of
invertible functions on the set G by regarding the element
g ∈ G as the function λg : G → G defined by
I. FUNDAMENTAL CONCEPTS
λg (x) = gx for all x ∈ G.
We shall freely use the standard notation of naive set theory
with regard to sets, subsets, elements, relations, functions, It is for this reason that our two definitions of group are
etc. By a binary operation ◦ on a set S, we shall mean a essentially the same. The mapping G → S(G) taking g to
function ◦ : S × S → S. It is customary to say that S is λg is called the (left) regular representation of G.
closed under the binary operation ◦ in this situation, and Groups like (Z, +), (Q, +), and (R, +) are instances of
it is customary to write x ◦ y for the value of ◦ at the subgroups of (C, +) in the following sense.
ordered pair (x, y) ∈ S × S. The symbols Z, Q, R, and Definition: Let (G, ◦) be a group. A nonempty subset H
C will denote the sets of integers, rational numbers, real of G is a subgroup of G if (H, ◦ H ) is itself a group, where
numbers, and complex numbers, respectively. ◦ H is the restriction of the function ◦ to H × H .
We now give the promised abstract definition of a group.
It follows readily that the identity in H is the identity
Definition: A group is an ordered pair (G, ◦) such that in G and that inverses of elements in H are the same as
G is a set closed under the binary operation ◦ satisfying: their inverses in G.
Given any subgroup H of a group G, we may define an
(1) (x ◦ y) ◦ z = x ◦ (y ◦ z) for all x, y, z ∈ G. equivalence relation on the group G by the rule
(2) There is an element e ∈ G with e ◦ x = x = x ◦ e for
all x ∈ G. g ≡ H g1 if g −1 g1 ∈ H.
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

Group Theory 139

The equivalence classes determined by this equivalence The quotient construction affords important examples
relation are called the left cosets of H in G. Thus the left of cyclic groups. Namely, for each positive integer n, the
coset containing the element g is just subset nZ of all multiples of n in Z is a subgroup of the
abelian group Z and hence we have cyclic groups Z/nZ,
gH = {gh : h ∈ H }. generated by the coset 1 + nZ. Cyclic groups arise in many
natural contexts. For instance, the set Z (n) of all com-
The set G\H of all left cosets of H in G forms a space plex nth roots of 1 forms a cyclic group of cardinality
on which G acts by left multiplication, in the sense that to n under multiplication. Likewise, the group of all rota-
each element g ∈ G there is naturally associated a functional symmetries of a regular n-sided polygon is a cyclic
tion L g : G\H → G\H by L g (xH ) = gx H and the func- group Cn of cardinality n, under composition of functions.
tional equation L gg1 = L g ◦ L g1 holds for all g, g1 ∈ G. The groups Z/nZ, Z (n), and Cn have identical proper-
The left regular representation of G is the special case ties as abstract groups, although their manifestations in
where H = {e}. However, although the map g → λg is one- the “real world” are different. Under such circumstances
to-one, in general, many elements of G may give rise to the groups are called isomorphic. The formal definition
the same function on G\H . follows.
The fact that all left cosets of H have the same cardinal-
Definition: The function : (G, ◦) → (H, ·) is an iso-
ity yields the first important theorem about finite groups.
morphism (of groups) if : G → H is a bijective function
Lagrange’s Theorem: Let G be a finite group and H a satisfying:
subgroup of G. Then |H | is a divisor of |G|.
(g ◦ g ) = (g) · (g ) for all g, g ∈ G.
In a similar way, it is possible to define the right cosets
H g of a subgroup H . An important class of subgroups, Two groups which are isomorphic are indistinguishable
first studied by Galois, is the normal subgroups, which are as abstract groups, and the isomorphism may be regarded
those for which the right and left cosets coincide. as a dictionary for translating from one group to the other.
Much of the power of abstract group theory derives from
Definition: A subgroup N of a group G is a normal sub-
the fact that a given theorem about abstract groups may
group if and only if g N = N g for all g ∈ G. When N is a
have numerous important instantiations via different re-
normal subgroup of G, we write N G.
alizations of the abstract groups. For later reference, we
Given a group G and a normal subgroup N , we can note that an isomorphism : G → G between a group
construct a new group G/N , called the quotient group of and itself is called an automorphism of G. The set of all
G by N . automorphisms of G itself forms a group, Aut(G), un-
der composition of functions. (We shall refer later to an
Definition: Let G be a group and N a normal subgroup
automorphism of a field, which has a similar meaning.)
of G. The set G\N with the operation (gN)(g N ) = gg N
Returning to cyclic and abelian groups, we observe that
is a group, called the quotient group G/N. (Note that it is
we now know, up to isomorphism, every cyclic group and
conventional to replace \ by / when the set is regarded as
every abelian simple group.
a group.)
Theorem:
The normality condition is necessary for the multipli-
cation on G\H to be well defined. The identity subgroup (1) Every cyclic group is isomorphic either to Z or to
{e} and the entire group G are always normal subgroups of Z/nZ for some positive integer n; and
every group G. A group G whose only normal subgroups (2) Every abelian simple group is isomorphic to Z/pZ
are {e} and G is said to be a simple group. Simple groups for some prime p.
play the role in the theory of finite groups that prime num-
In contrast to this elementary result, the classification
bers play in the theory of numbers, as remarked by Michael
of the finite nonabelian simple groups was a very chal-
Solomon, among others. All finite groups are “composed”
lenging and complex undertaking, and the classification
from the simple ones.
of infinite simple groups is surely impossible. In any case,
At the other extreme, every subgroup of an abelian
a strategy for the analysis of a general (finite) group is to
group is normal. If G is any group and g ∈ G, then the
“decompose” it into simple factors via an iterated quotient
subset {g n : n ∈ Z} forms a subgroup of G called the cyclic
group construction. This is meaningful for finite groups
subgroup generated by g and often denoted g. If G = g,
thanks to the following theorem.
then G is called a cyclic group. It follows that an abelian
simple group must be cyclic, indeed cyclic of cardinality Jordan–Hölder Theorem: Let G be a finite group. Then
p for some prime p. there is an integer n ≥ 0 and a chain of subgroups
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

140 Group Theory

G = G 0 ≥ G 1 ≥ · · · ≥ G n−1 ≥ G n = {e} If X is any set with |X | = n, then we often denote by Sn

the symmetric group S(X ). Since the isomorphism type
such that G i+1 G i for all i < n and G i /G i+1 is a non-
of S(X ) is completely determined by the cardinality of X ,
identity simple group for all i. The chain is not necessar-
this notation is justified. It is clear that every permutation
ily unique, but the list G 0 /G 1 , G 1 /G 2 , . . . , G n−1 /G n is
σ of a finite set X can be written as a product of trans-
uniquely determined by the group G, up to changing the
positions (permutations which interchange two points and
order of the groups in the list and replacing groups by
fix the rest). We call σ an even (odd) permutation if the
isomorphic copies.
product involves an even (odd) number of transpositions.
The chain of subgroups is called a composition series (Although the product is not unique, the parity turns out
for G, and the set of quotient groups is called the set of to depend only on σ .) The set of all even permutations is a
composition factors of G. Continuing the analogy with subgroup of S(X ) called the alternating group Alt(X ) (or
numbers, the Jordan–Hölder theorem plays the role of the An ) of cardinality n!/2.
fundamental theorem of arithmetic. In contrast to a num- The second important class of groups is the family of
ber, however, a group G is hardly ever determined up to general linear groups GL(V ), where V is a vector space
isomorphism by its set of composition factors. Indeed, al- over a field F and GL(V ) is the group of all invertible
ready in the elementary case, where the set of composition linear operators on V , i.e., invertible linear transforma-
factors is {Z/ pZ, Z/ pZ}, there are two nonisomorphic tions T : V → V . Again the operation is composition of
possibilities for the group G, one cyclic and the other not. functions, and indeed GL(V ) is clearly a subgroup of the
symmetric group S(V ). GL(V ) is nonabelian whenever
dim(V ) > 1. As above, if : V → W is an isomorphism
II. IMPORTANT EXAMPLES between the F-vector spaces V and W , i.e., an invertible
linear transformation mapping V onto W , then, exactly
We begin with some important examples of abelian as before, induces an isomorphism of groups between
groups. We have already mentioned the abelian group the groups GL(V ) and GL(W ). Thus the isomorphism
(C, +) and some of its subgroups. Sometimes the oper- type of GL(V ) is completely determined by the field F
ation in an abelian group is naturally thought of as multi- and the dimension of V as an F-vector space. Indeed, if
plication. Indeed, the algebraic object called a field may dim(V ) = n is finite, then any choice of basis for V deter-
be defined as a set F closed under two binary operations mines a (noncanonical) isomorphism of groups between
+ and · such that (F, +) is an abelian group with identity the group GL(V ) and the multiplicative group GL(n, F)
element 0, (F − {0}, ·) is also an abelian group (with iden- of all n × n invertible matrices with entries taken from
tity element 1), and the two operations are intertwined by the field F. (This is simply an extension of the standard
the distributive law: identification of a linear operator T : V → V with an n × n
matrix A, the identification depending on a choice of ba-
x(y + z) = x y + x z for all x, y, z ∈ F.
sis for V .) If |F| is finite, then it may be shown that (up
Thus, for example, Q, R, and C are fields, and so in partic- to isomorphism of fields) there is only one field of cardi-
ular (Q − {0}, ·), (R − {0}, ·), and (C − {0}, ·) are abelian nality |F| and thus the notation GL(n, q), where q = |F|,
groups with identity element 1. An important example of is common and unambiguous. This is no longer the case
a subgroup of C − {0} is the circle group S 1 : when |F| is infinite.
Certain subgroups of GL(V ) are of particular impor-
S 1 = {a + bi ∈ C : a 2 + b2 = 1}.
tance and were first studied in depth by C. Jordan and
Further examples of abelian groups are afforded by any L. E. Dickson. They are often referred to as the classical
vector space V with the operation of vector addition. linear groups or classical matrix groups. First is SL(V ), the
There are two particularly important sets of examples subgroup of all elements of GL(V ) of determinant 1. The
of nonabelian groups. The first are the symmetric groups others are subgroups which preserve a geometric structure
S(X ), where X is any set and S(X ) is the group of all on the space V .
bijective functions f : X → X , with the operation being
Definition. A bilinear form on an F-vector space V is a
composition of functions. If : X → Y is a bijection of
function b : V × V → F which is linear in both positions,
sets, then there is an induced bijection
˜ : S(X ) → S(Y )
i.e.,
by
˜ f ) = ◦ f ◦ −1 , b(cu + dv, w) = cb(u, w) + db(v, w)
(
and
which is in fact an isomorphism of groups. S(X ) is a non-
abelian group whenever |X | ≥ 3. b(u, cv + dw) = cb(u, v) + db(u, w)
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

Group Theory 141

for all u, v, w ∈ V and all c, d ∈ F. We say that the form The symplectic group Sp(V ) is the group of all isometries
is symmetric if b(u, v) = b(v, u) for all u, v ∈ V . We say of the symplectic space (V, b). [As (V, b) is unique up to
that the form is alternating if b(v, v) = 0 for all v ∈ V . isometry, it is unambiguous (up to isomorphism) to write
Sp (V).] If (V, Q) is an orthogonal space and T : V → V
Definition: A quadratic form on an F-vector space V is
is a linear operator, then T is an isometry of (V, Q) if
a function Q : V → F such that
Q(v) = Q(T (v)) for all v ∈ V.
(1) Q(av) = a Q(v) for all a ∈ F, v ∈ V ; and
2
The orthogonal group O(V, Q) is the group of all isome-
(2) b(u, v) = Q(u + v) − Q(u) − Q(v) defines a bilinear tries of the orthogonal space (V, Q).
form (necessarily symmetric) on V.
An isometry is necessarily a bijection, whence Sp(V )
and O(V, Q) are subgroups of GL(V ). Classical mechan-
If (V, b) is a space with a symmetric bilinear form and
ics focusses on the case where F = R and Q is positive
if u and v are vectors in V of length 1, i.e., b(u, u) =
definite, i.e.,
1 = b(v, v), then b(u, v) may be thought of as the cosine
of the angle between the vectors u and v. In all cases, Q(x1 , x2 , . . . , xn ) = x12 + x22 + · · · + xn2 ,
if b(u, v) = 0 we say that u and v are orthogonal vec-
tors. Note that if b(v, v) = 0 for all v ∈ V , then b(u, v) = and so often in the physics literature the group O(Rn , Q)
−b(v, u) for all u, v ∈ V , and so in both the alternating for this choice of Q is denoted simply O(n). The Lorentz
and symmetric cases, orthogonality is a symmertic rela- group, however, which arises in relativity theory, is an or-
tion on V . thogonal group defined with respect to a four-dimensional
indefinite quadratic form.
Definition: If (V, b) is a space with an alternating or Symplectic isometries always have determinant 1.
symmetric form b, we let There is a somewhat larger group CSp(V ) of symplectic
similarities, i.e., linear operators T : (V, b) → (V, b) such
V ⊥ = {v ∈ V : b(u, v) = 0 for all u ∈ V }. that for all v, w ∈ V ,
If (V, Q) is a space with a quadratic form Q and associ- b(T (v), T (w)) = λT b(v, w),
ated symmetric bilinear form b, then we let
for some nonzero scalar λT ∈ F. This group will reappear
Rad(V ) = {v ∈ V ⊥ : Q(v) = 0}. toward the end of Section V.
Orthogonal isometries have determinant ±1. The sub-
We say that a space (V, b) with an alternating form b group of O(V, Q) consisting of isometries of determinant
is a symplectic space if V ⊥ = {0}. It may be shown that a 1 is called the special orthogonal group, SO(V, Q). In the
finite-dimensional symplectic space V must have even di- case of O(n), it is denoted SO(n) and interpreted as the
mension and for every even positive integer 2n and field F, group of all orientation-prserving isometries of Rn fixing
there is (up to isometry) exactly one symplectic F-space of the origin. For n = 3, SO(3) is the group of all rotations
dimension 2n. We say that a space (V, Q) with a quadratic of R3 about an axis through (0, 0, 0). The earliest classi-
form Q is an orthogonal space if Rad(V ) = {0}. There may fication theorem for a class of nonabelian groups was the
be many orthogonal spaces of a given dimension over a classification of finite subgroups of SO(3), achieved in the
given field; the number depends on both the field and the mid-nineteenth century.
dimension. The classical law of inertia of Sylvester asserts
in particular that there are exactly n + 1 nonisometric or- Theorem: A finite subgroup of SO(3) is a subgroup of
thogonal structures on a real vector space of dimension the full group of symmetries of one of the following ob-
n. An orthogonal space (V, Q) is completely determined jects centered at (0, 0, 0): a regular polygon lying in a
by its associated bilinear structure (V, b) when the field plane through (0, 0, 0) or one of the five regular poly-
F has characteristic different from 2 (e.g., when F is a hedra (Platonic solids): the tetrahedron, the cube, the
subfield of C). octahedron, the dodecahedron, or the icosahedron.
The symplectic and orthogonal groups are the isometry It follows that a finite subgroup G of SO(3) is either
groups of symplectic and orthogonal spaces, respectively, a cyclic group or is the dihedral group Dn of all symme-
in the following sense. tries of a regular n-gon or is the tetrahedral, octahedral, or
Definition: If (V, b) is a symplectic space and T : V → V icosahedral group (the group of rotational symmetries of
is a linear operator, then T is an isometry of (V, b) if the tetrahedron, octahedron, or icosahedron, respectively).
Finally, when the field F admits an involutory automor-
b(u, v) = b(T (u), T (v)) for all u, v ∈ V. phism σ (i.e., σ 2 = I = σ ), there is a further important
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

142 Group Theory

family of classical linear groups, namely, the unitary Definition: Let G and H be groups. The Cartesian prod-
groups. uct set G × H with the operation
Definition: Let F be a field and let σ be an involutory (g, h) · (g , h ) = (gg , hh )
automorphism of F. A (σ -)hermitian form on an F-vector
is a group, called the direct product G × H.
space V is a function h : V × V → F which is linear in the
first position and which satisfies This
construction can be iterated to define direct prod-
σ ucts G i over arbitary index sets I . Another important
h(v, w) = h(w, v) for all v, w ∈ V.
generalization is the semidirect product. Before defining
A pair (V, h) with V an F-vector space and h a σ - this, we must generalize the notion of an isomorphism of
Hermitian form on V is said to be a unitary space if groups.
V ⊥ = {0}. We denote by U (V, h) the group of all isome-
tries of the unitary space (V, h) and call it the unitary Definition: Let (G, ◦) and (H, ·) be groups. A function
group of (V, h). The subgroup of unitary isometries of f : G → H is a homomorphism of group provided that
determinant 1 is the special unitary group, SU(V, h). f (g ◦ g ) = f (g) · f (g ) for all g, g ∈ G.
Of course the most important of all involutory automor-
Thus an isomorphism of groups is a bijective homo-
phisms is the complex conjugation map on C. When h is
morphism of groups. In general, failure of injectivity for
the standard positive definite inner product on Cn , the uni-
homomorphisms is measured by the following subgroup.
tary group U (Cn , h) is often denoted simply U (n). There
are, however, other inequivalent hermitian forms on Cn . Definition: Let f : G → H be a homomorphism of
There is a famous twofold covering map from SU(2) onto groups. The kernel of f is defined by
SO(3) called the spin covering because of its connection
Ker( f ) = {g ∈ G : f (g) = e H },
with the concept of spin in quantum mechanics.
If F is a finite field with |F| = q 2 for some prime power where e H is the identity element of the group H.
q, then F admits an involutory automorphism and it is
The pre-image f −1 (h) of each element in f (G) is both
possible to define unitary groups of isometries of finite-
a left and right coset of Ker( f ) = f −1 (e H ) and so has the
dimensional F-vector spaces. Up to isomorphism, in the
same cardinality as Ker( f ). In particular, f is injective if
finite field case, the unitary group is uniquely determined
and only if Ker( f ) = {eG }, where eG is the identity element
by the field F and the dimension n of the vector space. Un-
of G. There is an intimate connection between normal
fortunately, if |F| = q 2 , some group theorists denote this
subgroups and homomorphisms.
group Un (q 2 ), while others write Un (q). This is only one
instance of the notational inconsistencies which litter the Theorem: Let f : G → H be a homomorphism of groups
terrain of classical linear groups. The original notations of with kernel K. Then K is a normal subgroup of G
Jordan and Dickson have been largely abandoned or mod- and f induces an isomorphism between G/K and f (G).
ified in favor of a “classical” notation dating from the time Conversely, if K is any normal subgroup of the group G,
of Weyl. However, when Cartan (for the field R) and later then the function π : G → G/K defined by
Chevalley and Steinberg (for arbitrary fields) constructed
π (g) = gK for all g ∈ G
these groups from a Lie-theoretic standpoint, they adopted
the notation of Killing, which is completely different from is a homomorphism of groups with kernel K.
Weyl’s notation. Moreover, even within the Weyl frame-
Now we can describe the semidirect product construc-
work, there are treacherous inconsistencies. For instance,
tion of groups.
some write Spn (F) and others Sp2n (F), both meaning the
group of symplectic isometries of a 2n-dimensional F- Definition: Let K and H be groups and let φ : H →
vector space. Let the reader beware. Aut(K) be a homomorphism of groups. The semidirect
product K :φ H (or simply K : H ) is the group whose un-
derlying set is the Cartesian product set K × H with mul-
III. BASIC CONSTRUCTIONS
tiplication defined by
In Section I we addressed the problem of decomposing (k, h)(k , h ) = (kφ(h)(k ), hh )
a group into its constituent simple composition factors
for all k, k ∈ K , h, h ∈ H.
(when possible). Now we consider the opposite problem
of “composing” two or more groups to create a new and The direct product K × H is the special case of the
larger group. One fundamental construction is the direct above construction in which φ is the homomorphism map-
product construction. ping every element of H to the identity automorphism
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

Group Theory 143

of K . In all cases, {(k, e H ) : k ∈ K } is a normal sub- and β are understood to be isomorphisms. Like the direct
group of K : H , which is isomorphic to K . The subset product, all of these product constructions can be iterated.
H0 = {(e K , h) : h ∈ H } is a subgroup of K : H isomorphic In some ways dual to the center of G is the commutator
to H , but is not in general a normal subgroup of K : H . quotient.
Indeed, H0 is also normal if and only if K : H ∼ = K × H.
Definition: Let G be a group. If x, y ∈ G, then the com-
If all composite (i.e., not simple) groups could be con-
mutator [x, y] = x −1 y −1 x y. The commutator subgroup
structed from proper subgroups by an iterated semidirect
[G, G] of G is defined by
product construction, then the classification of all finite
groups, or even all groups having a composition series, [G, G] = [x, y] : x, y ∈ G,
would at least be thinkable, if not doable. However, not,
i.e., [G, G] is the subgroup of G generated by all commu-
all composite groups can be so constructed, as is illus-
tators of elements of G. Then [G, G] is a normal subgroup
trated by the easy example of the cyclic group C p2 . The
of G and the commutator quotient G/[G, G] is the largest
obstruction to a composite group “splitting” as a semidi-
abelian quotient group of G.
rect product was first analyzed by Schur. The attempt to
parametrize the set of all groups which are an “extension” Note that G is an abelian group if and only if Z (G) = G
of a given normal subgroup by a given quotient group was if and only if [G, G] = {e}. Sometimes G/[G, G] is called
one of the motivating forces for the development of the the abelianization of G.
cohomology theory of groups. This leads to the definition of the following important
In the context of finitely generated abelian groups, a classes of groups.
complete classification is nevertheless possible and essen-
Definition: A nonidentity finite group G is quasi-simple if
tially goes back to Gauss. (A group G is finitely generated
G = [G, G] and G/Z (G) is a (nonabelian) simple group.
if there is a finite subset S of G such that every element
A finite group G is semisimple if G = G 1 ◦ G 2 ◦ · · · ◦ G r ,
of G is expressible as a finite word in the “alphabet” S.
where each Gi is a quasisimple group.
We write G = S if G is generated by the elements of
the subset S. In particular, every finite group G is finitely In the contexts of Lie groups and algebraic groups, a
generated: you may take S = G.) group is called simple if every proper closed normal sub-
group is finite. Thus the group S L(n, C), whose center
Theorem: If A is a finitely generated abeliean group, then
is isomorphic to the finite group of complex nth roots
A is a finite direct product of cyclic groups.
of 1, would be called a simple group by a Lie theorist
On the other hand, the enumeration of all finite p-groups and a quasi-simple (but not simple) group by a finite
is an unthinkable problem. For p a prime, we call a group group theorist. There is a notion of semisimple group in
P a p-group if all of the elements of P have order a power the categories of linear algebraic groups and Lie groups
of p. By theorems of Lagrange and Cauchy, P is a finite p- which coincides with the definition above for connected
group if and only if |P| = p n for some n ≥ 0. G. Higman groups.
and Sims have established that the number P(n) of p- Some slightly larger classes of groups play an impor-
groups of cardinality p n is asymptotic to a function of the tant role in many areas. The following definitions are not
form p 27 n +o(n ) as n → ∞.
2 3 3
standard. The concept of a connected reductive group is
In a different direction, another important generaliza- fundamental in the theory of algebraic groups, in which
tion of the direct product is the central product. First we context it is defined to be the product of a normal semisim-
introduce an important normal subgroup of a group G. ple group and a torus (a group isomorphic to a product of
GL1 ’s). The second definition below approximates this
Definition. Let G be a group. The center, Z (G), of G is
notion in the category of all groups.
defined by
Z (G) = {z ∈ G : zg = gz for all g ∈ G}. Definitions: A group G is almost simple if G has a nor-
mal subgroup E which is quasi-simple and G/Z (E) is
Definition: Let H, K, and C be groups and let α : C →
isomorphic to a subgroup of Aut(E). A group G is al-
Z (H ) and β : C → Z (K ) be injective homomorphisms.
most semisimple if G has a normal subgroup E = S ◦ Z (E)
Let
with S = S1 ◦ S2 ◦ · · · ◦ Sr semisimple (with each Si quasi-
Z = {(α(c), β(c)) ∈ H × K : c ∈ C}. simple) and G/Z (E) is isomorphic to a subgroup of
Then the central product H ◦C K is the quotient group Aut(S) normalizing each Si .
(H × K )/Z .
Most of the classical groups described above are almost
Sometimes the central product is (ambiguously) writ- semisimple. Indeed, S is a quasi-simple group in most
ten H ◦ K . Often this is done when Z (H ) ∼
= Z (K ) and α cases. Also, the Levi complements of parabolic subgroups
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

144 Group Theory

(stabilizers of flags of totally singular subspaces) of the |G| = |G x | |x G |.

classical linear groups are usually almost semisimple, ex-
ceptions arising over small fields or because of the peculiar If G acts transitively on the set X , then the permutation
structure of certain four-dimensional orthogonal groups. action of G on X is permutation isomorphic to the action
In a vast number of cases of importance in mathematics by left multiplication of G on the left coset space G\G x .
and the physical sciences, the symmetry group (or auto- Conversely, for any subgroup H of a group G, the left
morphism group) of an interesting structure is either an multiplication action of G on G\H defines a transitive
almost semisimple group or has the form G = V : H as permutation action of G. Thus the classification of tran-
the semidirect product of an abelian group V and an al- sitive permutation actions of a group G is equivalent to
most semisimple group H . For example, the group of the classification (up to G-conjugacy) of subgroups of G.
all rigid motions of Euclidean space Rn has the struc- Here G-conjugacy refers to yet another important class of
ture Rn : O(n). Similarly, if V is an affine space, then the permutation actions of a group G.
group AGL(V ) of all affine transformations of V has the Definition: Let G be a group and X a subset of G. Let
structure AGL(V ) = V : GL(V ).
The proof of the classification of finite simple groups X G = {g Xg −1 : g ∈ G}.
necessitated in fact the classification of all finite almost-
The members of X G are called the G-conjugates of X and
simple groups. Thus, as a consequence of the classifica-
X G is called the G-conjugacy class of X. G acts transi-
tion (along with the classification of finite abelian groups),
tively by conjugation on X G . The subgroup G X is denoted
there is essentially a complete description of all finite
NG (X ) and is called the normalizer in G of X.
almost-semisimple groups. The analogous result for in-
finite groups is out of reach, but if one restricts to the Thus Lagrange’s theorem in this context asserts that
categories of algebraic groups or Lie groups, then a de- |G| = |NG (X )||X G |. (Another notational caution: For
scription of all connected reductive groups is again a part many mathematicians, X G would be interpreted as the
of the fundamental classification theorems. subset of X fixed pointwise by G. This is not the usage in
this article.)
Definition: A transitive permutation action of a group G
IV. PERMUTATION GROUPS on a set X is said to be imprimitive if there is a parti-
AND GEOMETRIC GROUPS tion B of X into sets Bi , i ∈ I , with |I | > 1 and |Bi | > 1,
satisfying: for all i ∈ I and all g ∈ G, there exists j ∈ I
Definition: A group G is said to have a permutation ac- such that g(Bi ) = B j . The partition B is called a system
tion (or, simply, to act) on a set X if there is a homomor- of imprimitivity and the members Bi are called blocks. If
phism f : G → S(X ). If f is injective, then we may identify there is no such system of imprimitivity for the action of
G with f (G), a subgroup of S(X ). In this case we say that G, the action is called primitive.
G is a permutation group. In any case we may write g(x)
for f (g)(x). In the theory of permutation groups, primitive permu-
tation actions roughly play the role of simple groups in
As noted earlier, the action of G on the set G by left the abstract theory of groups: all permutation actions may
multiplication defines an injective homomorphism of G be considered to be built up out of primitive permuta-
into S(G), called the left regular representation of G. Thus tion actions. Thus the classification of primitive permuta-
every group is a permutation group. tion groups and primitive permutation actions has received
Definition: If G acts on the set X and x ∈ X , we say that considerable attention.
the G-orbit containing x is the subset An elementary but basic fact is that if G acts transitively
on the set X , then the stabilizer G x of a point x ∈ X is a
x G = {g(x) : g ∈ G}. maximal subgroup of G (i.e., there is no subgroup of G
The G-orbits form a partition of the set X. If X = x G , we properly between G x and G) if and only if the action of
say that G acts transitively on the set X. We say that the G on X is primitive. Thus the classification of primitive
stabilizer in G of the point x is the subgroup permutation actions is equivalent to the classification (up
to G-conjugacy) of the maximal subgroups of the permu-
G x = {g ∈ G : g(x) = x}. tation group G.
An important class of examples of primitive permu-
The following is a version of Lagrange’s theorem.
tation groups is the class of affine general linear groups
Theorem: If G is a finite group acting on the set X and if AGL(V ) = V : GL(V ) acting naturally on the affine space
x ∈ X , then V as the group of all affine transformations of V . The
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

Group Theory 145

subgroup V of translation maps is a regular normal sub- on X if for any two ordered k-tuples (x1 , . . . , xk ) and
group, i.e., V transitively permutes the vectors of V via (y1 , . . . , yk ) with xi = x j and yi = y j for i = j, there
translation and the stabilizer Vu of any vector u is {I }, exists an element g ∈ G with g(xi ) = yi for all i.
the identity function (regarded as translation by 0). Thus
Thus transitive is the same as 1-transitive. In the 1860s,
AGL(V )0 = GL(V ).
Mathieu discovered two remarkable groups, M12 and M24 ,
In the theory of permutation groups an important role
which act 5-transitively on sets of cardinality 12 and 24,
is played by the socle of G, which is the product of
respectively. Despite considerable effort, the only known
all minimal normal subgroups of G. There exist rather
proof of the following theorem is as a corollary of the
detailed theorems concerning the structure of primitive
classification of finite simple groups.
permutation groups, in particular describing the structure
of the socle and its intersection with the stabilizer of a Theorem: Let G be a finite k-transitive permutation
point. group on a set X for some k ≥ 4. Then either G =
S(X ) with |X | ≥ 4 or G = Alt(X ) with |X | ≥ 6 or G =
Theorem (O’Nan-Scott): Let G be a primitive permuta-
M11 , M12 , M23 , or M24 . [Here M11 (resp. M23 ) is the sta-
tion group on a finite set X with |X | = n, and let B be the
bilizer of a point in M12 (resp. M24 ).]
socle of G. Then one of the following holds:
Highly transitive permutation groups lead to tight com-
1. X may be regarded as an affine space with n = p m for binatorial designs which may often be interpreted as error-
some prime p, B is the abelian group of all correcting codes or dense sphere-packing lattices. Specif-
translations of X and G is a subgroup of the affine ically, Witt described designs (or Steiner systems) S(5, 6,
group AGL(X ); or 12) and S(5, 8, 24) associated with the Mathieu groups
2. B = S k is the direct product of k ≥ 1 isomorphic M12 and M24 , respectively. These yield the ternary and
nonabelian simple groups acting via the product binary Golay codes, respectively.
action on the Cartesian product set X = Y k , where S There is also a complete description of finite k-transitive
is a primitive simple subgroup of S(Y ); or permutation groups for k = 2 and 3. One important class of
3. B = S k is the direct product of k > 1 nonabelian examples of 2-transitive permutation groups is the class
simple groups acting either via diagonal action with of projective general linear groups PGL(V ). For V any
|X | = |S|k−1 and with Bx a diagonally embedded vector space (over a field F) of dimension at least 2, we
copy of S, or via twisted wreath action with may form the projective space P(V ) whose objects are
|X | = |S|k and Bx = 1. the k-dimensional subspaces of V, k ≥ 1, with incidence
being given by symmetrized containment of subspaces.
The O’Nan-Scott theorem focuses attention on the Then GL(V ) acts on the points of P(V ) and the ker-
primitive permutation actions of nonabelian simple groups nel of the action is Z (GL(V )), the group of scalar linear
and indeed, in the wake of the classification of finite sim- transformations on V . The projective general linear group
ple groups, considerable effort has been invested in the PGL(V ) = GL(V )/Z (GL(V )) then acts as a 2-transitive
program of determining all maximal subgroups of all fi- permutation group on the set P(V ). The image of SL(V ) in
nite simple groups and all closed maximal subgroups of the symmetric group on P(V ) is the subgroup of PGL(V ),
all simple algebraic groups. There exist detailed structure denoted PSL(V ), and it too acts 2-transitively on P(V ).
theorems for these maximal subgroups, as well as exten- The lines of projective spaces are coordinatized by the
sive specific information. Besides the case-by-case enu- elements of the field F plus one extra “point at infin-
merations of maximal subgroups for the sporadic simple ity.” As no element of PGL(V ) can map three collinear
groups, most of this theory has been developed in the con- points to three noncollinear points, the action of PGL(V )
text of simple algebraic and Lie groups by Dynkin, Seitz, cannot be 3-transitive unless the geometry contains only
Liebeck, and others. Lifting theorems, originating with one line. This is precisely the case when dim(V ) = 2, and
Steinberg, make it possible to a certain extent to trans- indeed in this case the action is 3-transitive. If V is two-
fer maximal subgroup questions concerning finite classi- dimensional over the finite field F with |F| = q, then we
cal linear and exceptional groups to analogous questions write PGL(V ) = PGL(2, q) and PSL(V ) = PSL(2, q). If
about algebraic groups over the algebraic closure of the q = 2m , then PGL(2, q) = PSL(2, q), while if q is odd,
finite field. then the PSL(2, q) is a normal subgroup of PGL(2, q) of
Since the mid-nineteenth century there has been con- index 2.
siderable interest in highly transitive permutation groups.
Theorem: Let G be a finite 3-transitive (but not 4-
Definition: Let G be a permutation group on a set X and transitive) permutation group on a set X. Then one of the
let k be a positive integer. We say that G acts k-transitively following holds:
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

146 Group Theory

1. X may be regarded as an affine space with |X | = 2n x G i is incident with yG j for i = j if x G i ∩ yG j = ∅.

and either G = AGL(X ) or |X | = 16 and G = X :
A7 ; or Thus, if G = GL(V ) and G i is taken to be the full stabilizer
2. The socle of G is PSL(2, q) and X = PG(1, q) is the of a fixed i-dimensional subspace Vi , 0 < i < dim(V ), then
projective line of order q, q ≥ 2; or we recover the projective geometry PG(V ). Analogous
3. G = M22 (the stabilizer of a point in M23 ). constructions using stabilizers of totally singular sub-
spaces yield the classical polar geometries for PSp(V ),
The description of finite 2-transitive (but not PO(V, Q), and PU(V, h). This type of construction was
3-transitive) permutation groups G is much more generalized by Tits to yield the theory of buildings and
elaborate. Again the examples fall into two major classes. BN pairs. Tits gave an elegant axiomatic treatment for a
The first are affine groups G = V : H , with H acting as a large class of geometries, which he dubbed “buildings.”
subgroup of GL(V ) transitive on the nonzero vectors of Most of these geometries have large (flag-transitive) auto-
V . These were enumerated by Hering, using the classifi- morphism groups, and the corresponding axiomatization
cation theorem for finite simple groups. There are three of these groups underlies the theory of BN pairs or Tits
infinite classes and several additional small examples. In systems.
the second class, the socle of G is a nonabelian simple
group. These too have been completely enumerated using
the classification theorem. The socles of the generic V. GROUP REPRESENTATIONS
examples are PSL(n, q) acting on the points of projective AND LINEAR GROUPS
space, Sp(2n, 2) acting on spaces of quadratic forms, and
the split BN pairs of rank 1 acting on the coset space A linear representation of a group G is a homomorphism
G\B. There are also several isolated examples. ρ : G → GL(V ) for some vector space V . Throughout this
At one time it was suspected that every nonabelian section all vector spaces will be assumed to be over a given
finite simple group had a 2-transitive permutation ac- field F. The systematic study of finite-dimensional group
tion on some set. This is far from the case. However, a representations began in the 1890s, notably in the work of
vast number of simple groups come close in the sense Frobenius. Many of the basic ideas are analogous to those
of being rank k permutation groups for k ≤ 3, as defined in the theory of permutation representations. Thus the de-
below. composition of a set into G-orbits has its correspondent in
Definition: A permutation group G acting on a set X is the decomposition of the finite-dimensional vector space
said to have a rank k action on X if G is transitive on X V into G-invariant subspaces Vi :
and a point stabilizer G x has exactly k orbits on X. V = V1 ⊕ · · · ⊕ Vn ,
A 2-transitive group G is a rank 2 group in this sense.
where each Vi is indecomposable in the sense that it cannot
A considerable number of finite simple groups are rank 3
be written as the direct sum of two proper G-invariant sub-
permutation groups. For example, the classical projective
spaces. The Krull-Schmidt theorem asserts in this context
linear groups PSp(V ), PU(V, h), and PO(V, Q) are rank 3
the uniqueness of the isomorphism classes of G-invariant
in their actions on the singular points of P(V ) [i.e., all
spaces in the unordered list V1 , . . . , Vn . (In the case of
points in the symplectic case and those points v such that
infinite-dimensional representations of a Lie group G, this
h(v, v) = 0, resp. Q(v) = 0 in the unitary and orthogonal
direct sum decomposition must often be replaced by a di-
cases respectively], provided (in the unitary and orthog-
rect integral of G-invariant subspaces of a Hilbert space.
onal cases) that the geometries contain totally singular
We see this already in the classical Fourier integral which
planes, i.e., planes on which the form is identically 0.
arises in the representation theory of the real line R1 .)
There are tight combinatorial conditions on rank 3 per-
Immediately at this stage the representation theory of
mutation actions, developed by D. G. Higman and others.
finite groups bifurcates into two cases: ordinary represen-
The study of rank 3 permutation groups led to the discov-
tation theory, where the field F has characteristic 0, and
ery of several of the so-called sporadic simple groups, as
modular representation theory, where the field F has char-
discussed in Section VI.
acteristic p and p divides |G|. (When F has characteristic
The action of a group G on the coset space G\H for
p and p does not divide |G|, the theory is essentially the
a single subgroup H leads to a transitive permutation ac-
same as the characteristic 0 theory.)
tion. When a family {G i }i∈I of subgroups and coset spaces
G\G i are considered simultaneously, this may be regarded Maschke’s theorem: If G is a finite group and F is a
as a geometry with objects of type i, i ∈ I , transitively per- field of characteristic 0 (e.g., a subfield of C), then ev-
muted by G, and with incidence defined by ery finite-dimensional G-space V is completely reducible
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

Group Theory 147

in the sense that each indecomposable summand Vi is Burnside’s pa q b Theorem: If G is a finite group with
irreducible, i.e., Vi has no proper G-invariant subspaces. |G| = pa q b , with p and q primes, then G is a solvable
On the other hand, every finite group of order divisi- group.
ble by a prime p has a p-modular representation space There is also a theory of Brauer characters in the context
V which is indecomposable but not irreducible. The of modular representation theory, but the Brauer characters
p-modular representation theory for cyclic p-groups is do not determine the modular representations uniquely,
precisely the theory of unipotent Jordan canonical ma- except in the case of irreducible modular representations.
trices. A unipotent Jordan block represents an indecom- A fundamental construction in representation theory, in-
posable module and is irreducible only if it is a 1 × 1 troduced by Frobenius, is the induced representation. First
block. Similar considerations play a significant role in the note that for a finite group G and a field F, the F-vector
representation theory of Lie groups and algebraic groups. space F[G] with basis G becomes a ring when endowed
Thus a fundamental result in Lie theory asserts the com- with the unique extension of the group multiplication to
plete reducibility of finite-dimensional complex repre- F[G] via the distributive law. F[G] may also be regarded
sentations of complex semisimple Lie groups. By con- as the algebra of F-valued functions on G with convolu-
trast, nilpotent Lie groups typically have indecomposable tion product and, as such, has natural generalizations in
finite-dimensional complex representations which are not the categories of Lie and algebraic groups. An F-vector
irreducible. space V with a G-action is an F[G]-module in a natural
For the remainder of this section we shall assume that way.
V is finite-dimensional, although many of the concepts
such as characters and induced representations have wider Definition: Let G be a finite group, H a subgroup of G, F
applicability. a field and V an F[H ]-module. The induced F[G]-module
In characteristic 0, a G-representation ρ is completely V G is the tensor product F[G] ⊗ F[H ] V with the action of
determined by its trace function or character χρ , defined by F[G] via left multiplication.

χρ (g) = Tr(ρ(g)) for all g ∈ G, There is a similar construction in the categories of Lie
groups and algebraic groups, which plays a fundamen-
where Tr is the trace of the matrix ρ(g). As similar tal role, for example, in Harish-Chandra’s construction of
matrices have the same trace, χρ is a class function, i.e., it unitary representations of semisimple Lie groups. In this
is constant on G-conjugacy classes. The (ordinary) char- context it is often best to study representations of the asso-
acter theory of finite groups was developed by Frobenius, ciated Lie algebra, in which case the universal enveloping
Burnside, and Schur around 1900. For abelian groups, algebra substitutes for the group algebra.
complex characters are simply homomorphisms into C× , Induced modules are analogous to imprimitive sets in
and they arose in the number-theoretic investigations of the theory of permutation groups. Indeed, a G-module V
Legendre, Gauss, and Dirichlet. It was Dedekind whose is called imprimitive if V = U G , where U is an H -module
letter to Frobenius stimulated the extension of this theory for some proper subgroup H of G. Otherwise V is called
to nonabelian groups. a primitive G-module.
The following concept goes back to Galois. A new complication in representation theory relates to
Definition: A group G is solvable if and only if there is a the size of the field F. This is the extension of the basic
series problem of finding eigenvalues in the field for a matrix
or linear operator. Thus the theory works most smoothly
G = G 0 > G 1 > · · · > G n−1 > G n = {e}, over algebraically closed fields such as C. Sometimes,
however, it is necessary to work over smaller fields and
with G i+1 G i for all i < n and with G i /G i+1 an abelian
due caution is necessary. For example, an F[G]-module
group.
V may be irreducible but not absolutely irreducible, in the
For finite groups this is equivalent to the assertion that sense that V becomes reducible upon suitable extension
the composition factors of G are all cyclic of prime order. of the field of scalars. The theory of Schur indices and
Galois established that a polynomial equation is “solv- Brauer groups is important in this context.
able by radicals” if and only if its Galois group is a finite There is a further important way of “analyzing” a
solvable group, extending the result of Abel mentioned G-module into simpler constituents. An F[G]-module V
in Section I. Sylow proved that a finite group of prime- is said to be tensor decomposable if V ∼ = U ⊗ W with U
power cardinality is necessarily a solvable group. One of and W F[G]-modules. [Here ⊗ denotes the ordinary ten-
the great early achievements of character theory was the sor product over F and g(u ⊗ w) = g(u) ⊗ g(w).] Other-
proof of the following result. wise V is tensor indecomposable. A fundamental theorem
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

148 Group Theory

of Steinberg, the tensor product theorem, describes all irre- of B(n, C), the solvable group of all upper triangular
ducible rational modules for simple algebraic groups and matrices in GL(n, C).
all irreducible modules for finite groups of Lie type in the
Thus every subgroup of U (n, C) is nilpotent and every
defining characteristic (i.e., p-modular representations,
closed connected nilpotent subgroup of GL(n, C) is con-
where p is the defining prime for the group) as tensor prod-
jugate to a subgroup of B(n, C). Lie’s theorem does not
ucts of certain basic modules and their Galois “twists.”
extend to arbitrary fields. For example, SO(2) is a closed
As tensor decomposability parallels reducibility, so too
connected abelian subgroup of SL(2, R) which is not con-
there is a notion of a tensor-induced F[G]-module which
jugate to a subgroup of B(2, R).
is analogous to the notion of an imprimitive F[G]-module.
Returning to abstract groups, it can easily be shown that
We can call a module tensor-primitive if it is neither ten-
the product of two normal nilpotent subgroups of a group
sor decomposable nor tensor-induced. Thus the “prime”
G is again nilpotent. Hence if G is a finite group, then G
object in this context is an absolutely irreducible primitive
has a unique maximal normal nilpotent subgroup, called
and tensor-primitive F[G]-module.
the fitting subgroup of G, F(G). Likewise the product of
In the context of linear representation theory, the socle
two normal semisimple subgroups of a group G is again
is no longer the correct subgroup to consider. For exam-
semisimple and so a finite group G has a unique maximal
ple, whereas the socle of the permutation group Sn (when
normal semisimple subgroup denoted E(G). Moreover,
n ≥ 5) is the nonabelian simple group An , which captures
the subgroups E(G) and F(G) commute with each other
much of the structure of Sn , the socle of the general lin-
elementwise.
ear group GL(n, q), when n divides q − 1, is a subgroup
of the group Z of scalar matrices and captures very little Definition: The generalized Fitting subgroup F ∗ (G) of
of the structure of GL(n, q). Surprisingly it was not until the finite group G is the central product F ∗ (G) = E(G) ◦
1970 that the correct replacement for the socle in the con- F(G).
text of linear groups or general finite groups was defined.
Notice that if F(G) is an abelian group, then F ∗ (G)
Before giving the definition, we must finally discuss the
is an almost semisimple group. In the general theory of
important class of nilpotent groups.
finite groups (or more generally of groups having a com-
Definition: For a group G, we define recursively G 0 = G position series), the group F ∗ (G) plays the role that the
and G i+1 = [G i , G] for i ≥ 0. [Thus G 1 = [G, G] is the socle plays in the theory of permutation groups. Thus, re-
commutator subgroup of G.] A group G is said to be nilpo- turning to the example of GL(n, q), although the socle is
tent (of nilpotence class n) if G n = {e} for some positive typically a subgroup of the group of scalar matrices Z ,
integer n (with n the smallest such positive integer). F ∗ (GL(n, q)) = S L(n, q) ◦ Z captures most of the struc-
ture of GL(n, q).
Thus an abelian group is a nilpotent group of nilpo-
To further appreciate the fundamental significance of
tence class 1. As G i is a normal subgroup of G for all i
F ∗ (G) we must introduce a basic concept.
and as G i/G i+1 is abelian, every nilpotent group is a solv-
able group. However, the converse is false, as shown by Definition: Let X be any subset of the group G. The set
the symmetric group S3 . Sylow proved that every finite
p-group is nilpotent, and indeed, the following theorem C G (X ) = {g ∈ G : gx = xg for all x ∈ X }
characterizes finite nilpotent groups. is a subgroup of G called the centralizer in G of X.
Theorem: A finite group G is nilpotent if and only if G is Definition: Let G be a group and N a normal subgroup
a direct product of finite p-groups. of G. The conjugation action of G on the elements of N de-
In the category of linear algebraic groups, the quin- fines a homomorphism : G → Aut(N ). The kernel of this
tessential (though certainly not the only) examples of con- homomorphism is C G (N ). Thus G/C G (N ) is isomorphic
nected nilpotent groups are the groups U (n, F) of all up- (via ) to a subgroup of Aut(N ).
per triangular matrices with all diagonal entries equal to Theorem (Fitting–Bender): Let G be a finite group.
1. We call a matrix unipotent if it is conjugate to a matrix Then
in U (n, F). We have the following theorems.
C G (F ∗ (G)) = Z (F(G)).
Theorem: Let U be a subgroup of GL(n, F) all of whose
elements are unipotent matrices. Then U is a nilpotent Thus G/Z (F(G)) is isomorphic to a subgroup of
group and is conjugate to a subgroup of U (n, F). Aut(F ∗ (G)).
Lie’s Theorem: Let G be a closed connected solvable In particular, this result says that |G| is bounded by a
subgroup of GL(n, C). Then G is conjugate to a subgroup function of |F ∗ (G)|. Specifically, |G| ≤ (|F ∗ (G)|)!. (This
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

Group Theory 149

upper bound is achieved only when G = Sn for n ≤ 4.) In is a quadratic form on V whose associated bilinear form is
fact, much more is true in most practical applications. b. Now there is a surjective homomorphism : Aut(E) →
Returning briefly to linear groups, we recall that the O(V, Q) whose kernel is again Inn(E) ∼ = V . Thus Aut(E)/
fundamental case to analyze was the following: G is a Inn(E) ∼= O(V, Q) . In this case, however, Griess proved
subgroup of GL(V ), where V is an absolutely irreducible that the group Aut(E) is not in general a semidirect prod-
primitive and tensor-primitive G-module. In this case the uct. In any case, we get the following theorem.
following structure theorem is valid. We need one further
Theorem: Let G be a finite group such that F ∗ (G) =
definition.
E is an extraspecial p-group. Set V = E/Z (E). Then G/E
Definition: Let p be a prime. An extraspecial p-group E is isomorphic to a subgroup of CSp(V ). If p = 2, then in
is a finite p-group such that |Z (E)| = p and E/Z (E) is a fact G/E is isomorphic to a subgroup of O(V, Q), where
nontrivial abelian p-group of exponent p. Q is the quadratic squaring map. Moreover, G/E has no
nontrivial normal p-subgroup.
Extraspecial p-groups are the finite cousins of the
Heisenberg groups, which play a prominent role in quan- Thus the structure of G/E for such groups G may be
tum mechanics. analyzed by the methods of linear representation theory. In
fact, a similar line of reasoning may be applied to any finite
Theorem: Let G be a finite subgroup of GL(V ), where V
group G such that F ∗ (G) is a p-group for some prime p.
is an n-dimensional vector space over the field F. Suppose
One of the earliest and deepest results in this vein is the
that for every normal subgroup N of G not contained in
Hall–Higman theorem, which illustrates the relevance of
Z = Z (GL(V )), V is an absolutely irreducible primitive
the final assertion of the preceding theorem.
and tensor-primitive F[N ]-module. Then either
Hall–Higman Theorem B: Let V be a finite-dimen-
1. F ∗ (G) = E ◦ (Z ∩ G) with E an extraspecial p-group sional vector space over a field of characteristic p, p an
for some prime p such that the field F contains odd prime. Let G be a solvable subgroup of GL(V ) having
primitive pth roots of 1; or no nontrivial normal p-subgroup. Let x be an element of
2. F ∗ (G) = E ◦ (Z ∩ G) with E a quasi-simple finite G of order pn . Then the minimum polynomial m x (t) of x
n
group and Z (E) ≤ Z . is (t − 1) p , except possibly if p is a Fermat prime and G
contains a nonabelian normal 2-subgroup.
A version of this theorem figures prominently in
Note that p n is the largest possible size of a Jordan block
Aschbacher’s analysis of the maximal subgroups of the
for a linear transformation of order p n acting on a vector
finite classical matrix groups.
space in characteristic p.
The importance of finite groups G for which F ∗ (G) = E
The general theory of p-modular representations of fi-
is an extraspecial p-group (or a close relative thereof ) af-
nite groups was developed primarily by Richard Brauer.
fords an excuse for an illustration of the use of the Fitting–
The sharpest results are achieved when every p-subgroup
Bender theorem and of linear group methods in the study
of G is cyclic, for example, when |G| is divisible by p
of abstract finite groups. First, let E be an extraspecial
but not by p 2 . A related theme of great importance is
p-group and let V = E/Z (E). Then V may be regarded
the p-modular representation theory of a finite group of
as a finite-dimensional vector space over the finite field
Lie type G(q) defined over a field Fq of characteristic
F p . Moreover, we may identify Z (E) with the field F p .
p. The methodology here is completely different from
Then the function b : V × V → F p , defined by
Brauer’s theory and instead is analogous to the theory
b(u + Z (E), v + Z (E)) = [u, v] for all u, v ∈ E, of (rational) representations of semisimple Lie groups or
algebraic groups. Indeed, Steinberg proved that all irre-
is a nondegenerate alternating form on the vector space V .
ducible representations of G(q) come by restriction from
Moreover, if p is odd, then there is a surjective homomor-
the irreducible rational representations over F̄q of the al-
phism : Aut(E) → CSp(V ), the conformal symplectic
gebraic group G(F̄q ), where F̄q is the algebraic closure of
group of all similarities of the symplectic space (V, b).
Fq and G(q) arises from G(F̄q ) by taking fixed points of
The kernel of is the group of all inner automorphisms
a Frobenius–Steinberg endomorphism. The group G(F̄q )
of E, i.e., the automorphisms induced by the conjugation
is the characteristic p analog of the Lie group G(C).
action of E onto itself. This group, Inn(E) may be iden-
Now, the irreducible continuous representations of G(C)
tified with V = E/Z (E) and then we see that Aut(E) ∼ =
are parametrized by “highest weights” and the characters
V : CSp(V ). When p = 2, the squaring map Q : V → F2 ,
of these representations are given by the Weyl charac-
defined by
ter formula. Curtis and Richen showed that an analogous
Q(u + Z (E)) = u 2 for all u ∈ E, parametrization holds for the finite groups of Lie type. An
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

150 Group Theory

analog of the Weyl character formula has been conjec- the unique nonabelian simple composition factors of the
tured by Lusztig and proved generically, but not yet for all groups O + (V ), resp. O − (V ), where V is an n-dimensional
primes p. Fq -space. The groups 3 D4 (q) and 2 E 6 (q) were new fam-
For applications to topology and number theory, there ilies of nonabelian finite simple groups. 2 E 6 had arisen
have been efforts to study integral representations of a already as a real form of E 6 in Cartan’s classification,
finite group G, i.e., homomorphisms of G into GL(n, R), but the 3 D4 groups were truly new, arising as the fixed
where R is the ring of integers of some algebraic number points on D4 (F) of ρ ◦ σ , where ρ is the special triality
field, e.g., R = Z . The situation here is quite subtle, and automorphism of the D4 oriflamme geometry and σ
only small cases can be handled effectively. is a field automorphism of F of order 3. [Note: only
involutory automorphisms arise in the classification of
real forms of Lie groups, since (C : R) = 2.]
As 1960 dawned, the known nonabelian finite simple
VI. FINITE SIMPLE GROUPS groups were the alternating groups of degree at least 5,
the Chevalley and Steinberg groups (including the classi-
Around 1890, Wilhelm Killing published a series of pa- cal linear groups), and the five Mathieu groups, which had
pers classifying the finite-dimensional semisimple Lie al- been dubbed sporadic groups by Burnside around 1900.
gebras over C, modulo a significant error corrected by Each of these groups has even order, and indeed it had
E. Cartan. In particular, via Lie’s correspondence, this been conjectured as early as 1900 by Miller and Burnside
gave a classification of the finite-dimensional simple Lie that groups of odd order were necessarily solvable groups.
groups over C, and this classification was shortly ex- Even more, each of these groups has order divisible by
tended to the real field as well by Cartan. In addition 6. In fact, they all have subgroups isomorphic to either
to the Lie algebras associated with the classical linear SL(2, 3) or PSL(2, 3) ∼ = A4 , with the exception of the
groups over C, Killing discovered five exceptional sim- 2-transitive simple permutation groups PSL(2, 2n ) for n
ple Lie algebras over C, which he denoted E 2 , F4 , E 6 , E 7 , odd, n ≥ 3.
and E 8 , the subscript denoting the maximum dimension In 1960 Michio Suzuki discovered the first known non-
of a diagonalizable subalgebra, later known as a Car- abelian simple groups of order not divisible by 3, the
tan subalgebra. Later E 2 came to be known as G 2 . In infinite family of 2-transitive simple permutation groups
Killing’s notation the classical families of Lie groups over Sz(2n ), n odd, n ≥ 3. Shortly thereafter, Ree showed that
C are: An (C) = PSL(n + 1, C), Bn (C) = PSO(2n + 1, C), a modification of Steinberg’s twist could be apply to the
Cn (C) = PSp(2n, C), and Dn (C) = PSO(2n, C). Chevalley groups B2 and F4 over fields of order 2n , n odd,
The classification theorem of Killing and Cartan, which and to G 2 over fields of order 3n , n odd, to produce the
is a fundamental precursor of the classification of the finite families of simple groups 2 B2 (2n ),2 F4 (2n ), and 2 G 2 (3n ), n
simple groups, is the following result. odd. In particular, the groups 2 B2 (2n ) were precisely
Theorem: Let G be a finite-dimensional simple Lie group the Suzuki groups Sz(2n ). The other two families were
with trivial center. Then G is either (an adjoint version of ) new and were dubbed Ree groups. The finite Chevalley,
a member of one of the four infinite families of classical Lie Steinberg, Suzuki, and Ree groups are now often called
groups An (C), Bn (C), Cn (C), and Dn (C); or G is isomor- the finite groups of Lie type, inasmuch as they can be
phic to the adjoint version of one of the five exceptional uniformly constructed as the fixed points of Frobenius–
Lie groups E 6 (C ), E 7 (C), E 8 (C), F4 (C), or G 2 (C). Steinberg endomorphisms of simple algebraic groups over
algebraically closed fields of prime characteristic, and
In 1955 Chevalley published a paper constructing these groups are analogous to the simple Lie groups over
analogs of Killing’s groups over all fields. In particular, C classified by Killing and Cartan.
when the fields are finite, these Chevalley groups are Meanwhile, in 1960–1962, Walter Feit and John G.
finite simple groups, except in a few cases over fields of Thompson wrote the most remarkable paper in the his-
cardinality 2 or 3. Shortly thereafter, Steinberg did the tory of finite group theory, proving the Miller-Burnside
analog of Cartan’s passage to real forms of Lie groups conjecture.
and constructed the so-called twisted Chevalley groups or
The Odd-Order Theorem (Feit–Thompson): Every
Steinberg groups 2 An (q) = PSU(n + 1, q),2 Dn (q),3 D4 (q),
finite group of odd order is solvable.
and 2 E 6 (q). If V is an even-dimensional vector space over
a finite field Fq , then there are exactly two nonisomorphic This paper and the successor “N -Group Paper” of
orthogonal groups on V : O + (V ) and O − (V ), where Thompson provided most of the fundamental ideas for
O + (V ) is the isometry group of an orthogonal space of the classification of the finite simple groups. The first tool
maximal Witt index (i.e., an orthogonal sum of hyper- in the analysis of finite simple groups was provided in
bolic planes). Then, for n ≥ 3, Dn (q), and 2 Dn (q) are 1872 by Ludvig Sylow.
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

Group Theory 151

Sylow’s Theorem: If G is a finite group with |G| = p n m For p = 2, this problem was solved by a combination
for some prime p, with m not divisible by p, then G has of independent results of Suzuki and Bender.
subgroups of order p k for all k, 0 ≤ k ≤ n, and all sub-
Theorem: Let G be a finite transitive permutation group
groups of G of order p n are G-conjugate.
in which every involution (element of order 2) fixes ex-
The maximal p-subgroups of G are called Sylow actly one point. Then either G has 2-rank at most 1 or
p-subgroups of G. It is interesting to note that there are G∼ = SL(2, 2n ), Sz(22n−1 ), or PSU(3, 2n ) for some n ≥ 2.
somewhat analogous results in the category of linear alge-
For groups of odd order, the problem was solved by Feit
braic groups, namely, the conjugacy of Borel subgroups
and Thompson in the “Odd Order Paper.” The theory of
(maximal closed connected solvable subgroups) and of
group characters plays a major and seemingly unavoid-
maximal tori (maximal connected semisimple abelian sub-
able role in this delicate analysis. The complete problem
groups) of the linear algebraic group G. Note that a
has been solved as a corollary of the classification of fi-
Borel subgroup B = NG (U ) = U T , where U is a maxi-
nite simple groups, but an independent solution for p odd
mal unipotent subgroup and T is a maximal torus. If G is
remains elusive.
defined over an algebraically closed field of characteristic
A somewhat different strategy for the classification of fi-
p, then U is a maximal p-subgroup of G, in the sense that
nite simple groups was pioneered by Richard Brauer start-
U is maximal with the property that every element of U
ing around 1950. For groups of even order, Brauer cham-
has order a power of p.
pioned an inductive approach focusing on the centralizer
Returning to finite group theory, the set of all normal-
C G (t) of an involution t. A philosophical underpinning for
izers and centralizers of nonidentity p-subgroups of the
this approach was provided by the following result.
finite group G is called the set of p-local subgroups of G,
and the study of G via the analysis of these subgroups is Brauer–Fowler Theorem: Let G be a finite simple group
called p-local analysis. The “N -Group Paper” combined of even order containing the involution t. If |C G (t)| = c,
with the “Odd Order Paper” completed the classification then |G| < (c2 )!.
of finite simple groups all of whose p-local subgroups are
The Brauer–Fowler bound is useless in practice, but it
solvable groups. The strategy in brief is to choose a prime
does establish that given a finite group H with center of
p and a Sylow p-subgroup P for which the p-rank (the di-
even order, the problem of finding all finite simple groups
mension of an elementary abelian p-subgroup, thought of
G containing an involution t with C G (t) ∼ = H is a finite
as a vector space over F p ) is as large as possible. A theorem
problem. In practice, as Brauer, Janko, and their students
of Burnside, using monomial representations, guarantees
demonstrated in the ensuing decades, the problem is not
that p may be chosen so that either the p-rank is at least 3
only finite, it is doable. Indeed, a posteriori, C G (t) usually
or p = 2 and |G| is a multiple of 12. One now studies the
determines G uniquely. Once Feit and Thompson proved
normalizers of all nonidentity subgroups of P. If G is go-
that every nonabelian finite simple group has even order,
ing to turn out to be a finite group of Lie type defined over
Brauer’s strategy became an inductive strategy for the re-
a field of characteristic r , then almost always the chosen p
mainder of the classification proof. Brauer’s strategy is
will turn out to be r and the geometry defined by the coset
significantly different from the Feit–Thompson strategy
spaces G\Mi , where Mi is a p-local subgroup of G con-
in the following sense: since most primes are odd, a finite
taining P, will be precisely the Tits building for the group
simple group of Lie type is almost always defined over a
G. If there are at least two maximal p-local subgroups Mi
field of characteristic τ with τ = 2. Thus, for most finite
containing P, then the Tits building will have rank at least
simple groups G, each involution t is a semisimple ele-
2 and the geometry will be rich enough to characterize the
ment in the sense that F ∗ (C G (t)) is an almost semisimple
group G. More accurately, finite buildings of rank at least
group. Fortunately, the two approaches mesh and comple-
3 were classified in a major paper of Tits. Finite rank 2
ment each other, and the current proof is a blending of both.
buildings include all finite projective planes. However, the
In the course of pursuing Brauer’s strategy, Janko inves-
buildings of rank at least 2 which occur in the context of
tigated many possible specific structures for the centralizer
the simple group classification satisfy a “Moufang condi-
of an involution in a simple group. One of these structures,
tion,” which permits their complete identification. On the
C2 × A5 , led him to a new simple group, J1 , the first spo-
other hand, if there is a unique maximal p-local subgroup
radic group discovered since Mathieu. Shortly thereafter,
M of G containing P, then the “geometry” is simply the
Janko discovered two more sporadic groups, J2 and J3 .
point set G\M. This leads to a problem in permutation
The former was soon constructed as a rank-3 permutation
group theory.
group by M. Hall. The involution centralizer approach and
Problem: Classify finite transitive permutation groups the rank-3 permutation group approach were the main tac-
G in which, for some fixed prime p, every nonidentity tics leading to the discovery of 21 sporadic groups in the
p-element fixes exactly one point. decade 1965–1975.
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

152 Group Theory

In a different vein, John Leech constructed in 1967 The principal obstruction to establish this structure the-
the Leech lattice, the densest sphere-packing lattice in orem for maximal p-locals is the hypothetical existence of
R24 , based on the Steiner system S(5.8.24) for M24 . John certain p -subgroups (subgroups of order relatively prime
Conway investigated the automorphism group .0 of the to p) whose normalizers contain large p-subgroups. These
Leech lattice and discovered that .0/{±1} = Co1 was a were dubbed p-signalizers by Thompson, who devel-
new finite simple group, as were two of its subgroups, Co2 oped the principal strategies for controlling them. These
and Co3 . In 1974, pursuing the involution centralizer phi- strategies were honed into a “signalizer functor theory”
losophy, starting with a centralizer C built out of the Leech by Gorenstein and Walter with major contributions by
lattice modulo 2, /2, and Co1 , Bernd Fischer and Goldschmidt, Glauberman, Harada, Lyons, McBride, and
Robert L. Griess, Jr., independently discovered the largest others.
of the sporadic simple groups, dubbed M, the Monster. The
Monster was constructed by Griess as the automorphism
group of a 196, 884-dimensional nonassociative commu- VII. OTHER TOPICS
tative C-algebra, the Griess algebra. Twenty-one of the
26 sporadic simple groups are contained (as quotients of Thus far the article has focused on finite groups, with
subgroups) in the Monster. The investigation of numeri- some reference to Lie groups and linear algebraic groups.
cal mysteries (dubbed Monstrous Moonshine) associated Certain other classes of groups have played a fundamen-
with M led Frenkel, Lepowsky, Meurman, and Borcherds tal role in mathematics. Klein and Poincaré pioneered the
to develop the mathematical theory of vertex operator al- study of discrete subgroups of PSL(2, R) and PSL(2, C).
gebras, objects which were first studied by physicists. For instance, PSL(2, R) acts naturally as fractional lin-
The classification theorem for finite simple groups is ear transformations on the upper half-plane of R2 , which
the following theorem. may be identified with the hyperbolic plane with PSL(2,
R) acting as orientation-preserving isometries. A discrete
Classification Theorem: Let G be a finite simple group.
subgroup is then a subgroup which does not contain
Then G is isomorphic to a member of one of the following
arbitrarily small translations or rotations. A similar inter-
families of simple groups:
pretation may be given to discrete subgroups of PSL(2,
1. The cyclic groups of prime order; C) viewed as isometries of hyperbolic 3-space. An im-
2. The alternating groups of degree at least 5; portant class of examples of these discrete subgroups is
3. The finite simple groups of Lie type; and given by PSL(2, Z) and its congruence subgroups, i.e.,
4. The 26 sporadic simple groups. the kernels of the natural homomorphisms of PSL(2, Z)
onto PSL(2, Z/nZ), where Z/nZ is the ring of integers
There are a small number of unusual isomorphisms modulo n. These discrete groups may be regarded as the
among the simple groups. For example, A5 ∼ = SL(2, 4) ∼
= symmetry groups of tessellations (tilings) of hyperbolic
∼ ∼
PSL(2, 5), A6 = PSL(2, 9) = [Sp(4, 2), Sp(4, 2)], and space by congruent (hyperbolic) polygons. This theory
A8 ∼= SL(4, 2). has been generalized to the study of arithmetic subgroups
The proof is a massive inductive argument, i.e., one as- of higher-dimensional Lie groups. Like the integers, these
sumes througout that G is a minimal counterexample to groups do not have a composition series and so they can-
the statement of the theorem, and so every composition not be “built up from the bottom” like finite groups: they
factor of every proper subgroup of G (in particular, of have no bottom from which to build. They are, however,
every p-local subgroup of G) is one of the listed simple usually generated by a finite set of elements or proper sub-
groups. Having chosen a prime p upon which to perform groups and may be viewed as a quotient of the universal
p-local analysis according to either the Feit–Thompson or or freest object so generated.
Brauer strategy, one is confronted by the major problem Thus, if G is generated by the set of elements S, then G is
of attempting to establish that for many of the maximal a quotient of the free group F(S). A relator is a word which
p-local subgroups H of G, either F ∗ (H ) is an almost becomes 1 when interpreted in G rather than in F(S). In
semisimple group or F ∗ (H ) is a p-group. In the latter general, if R is a set of relators, then there is a natural
case we say that H is of parabolic type, since in a fi- homomorphism from F(S)/N (R) onto G, where N (R) is
nite simple group G of Lie type in characteristic p, all the smallest normal subgroup of F(S) containing all of the
subgroups which contain a Sylow p-normalizer (the so- members of R. If that homomorphism is an isomorphism,
called parabolic subgroups of G) are of parabolic type. then the generators S and relations R “present” the group
Aschbacher and Timmesfeld were leaders in the analysis G. In general, it is quite difficult to “understand” a group
of such groups in the 1970s. Later, Stroth, Stellmacher, given by generators and relations. In fact, it is a theorem
and Meierfrankenfeld also pursued this vein. (the “Undecidability of the Word Problem”) that there is no
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

Group Theory 153

recursive algorithm for deciding whether a given finitely The free product H * K of two groups H and K is the
presented group is the identity group. freest group generated by H and K with H ∩ K = {e}.
A striking exception to the intractability of the “Word Underlying the Brauer–Fowler theorem on finite simple
Problem” is the family of Coxeter groups C(m i j ) de- groups is the elementary fact that the free product C2 * C2
fined by the generators {si }i∈I and relators (si s j )m i j , where is an infinite dihedral group generated by an element x
M = (m i j ) is a symmetric matrix with m ii = 1 for all i and of infinite order and an involution y such that yx y = x −1 .
with m i j ∈ {2, 3, . . . , ω} for i = j. Classical examples of All finite homomorphic images of C2 * C2 are finite di-
Coxeter groups are discrete groups of isometries of spheri- hedral groups (the symmetry groups of regular polygons)
cal, Euclidean, or hyperbolic n-spaces. The spherical Cox- or are cyclic of order 1 or 2. By contrast, the free product
eter groups are well understood and form the skeletons C2 * C3 is isomorphic to PSL(2, Z) and almost every finite
of Tits’ spherical BN pairs. Beyond the finite dihedral simple group occurs as a homomorphic image of C2 * C3 .
groups, all but two of the spherical Coxeter groups are PSL(2, Z) is in turn the quotient of the 3-string braid group
finite Weyl groups, whose classification is a key step in B3 by an infinite cyclic central subgroup.
the modern proof of the Killing-Cartan classification the- It follows from the theory of fundamental groups and
orem. The Euclidean Coxeter groups are also well under- covering spaces that the study of discrete subgroups (at
stood and play a major role in the theory of affine buildings least the torsion-free ones, i.e., those having no nontrivial
and p-adic groups. Closely associated to Coxeter groups elements of finite order) of PSL(2, R) and PSL(2, C) is
are the braid groups and Artin groups, whose generators essentially identical to the problem of finding Riemannian
have infinite order but which satisfy relations similar to metrics of constant (sectional) curvature −1 on manifolds
the Coxeter relations. They have featured in recent work of dimensions 2 and 3, respectively. This suggests that one
in numerous areas, including knot theory, singularity the- should be especially interested in studying fundamental
ory, and inverse Galois theory, i.e., the problem of finding groups of negatively curved Riemannian manifolds (or
a polynomial (usually in Q[x]) having a specified Galois negatively curved spaces). Quite recently, Gromov has
group. developed a theory of “word hyperbolic groups” which
The braid groups arise as the fundamental groups of encompasses the group-theoretic aspects of negative
topological spaces, an important concept introduced by curvature. For example, free groups as well as the funda-
Poincaré. The elements of the fundamental group are mental groups of negatively curved manifolds are all word
equivalence classes of loops (closed paths) in the space hyperbolic. In contrast to the general undecidability of
with a distinguished base point. Multiplication is concate- the Word Problem, it can be shown that the Word Problem
nation of loops and inversion is reversal of direction. Fun- is solvable for word hyperbolic groups. Indeed, this has
damental groups are naturally described via generators led to a study of the larger class of “automatic groups”
and relations, with the generators being certain loops and for which the Word Problem is decidable by a finite-state
a relation being a closed path which can be contracted to automation.
the base point in the space. For example, if the space X Recently there have been projects to develop efficient
is the union of n circles which intersect only at a com- computer algorithms for the identification of finite groups
mon base point, then the fundamental group of X is the from data consisting either of a set of generating permu-
free group on n generators. The fundamental group acts tations or a set of generating matrices or an abstract set of
naturally on the universal cover X̃ of the space X . In this generators and relators (a black-box group). Most of these
example, X̃ is a regular tree (graph without circuits) of algorithms use the classification of finite simple groups
valency 2n. as well as numerous corollaries concerning simple groups
The study of fundamental groups of topological spaces with small linear or permutation representations. Com-
attached along embedded subspaces motivates the group- puter algorithms played a relatively small role in the clas-
theoretic study of so-called free products with amalgama- sification of finite simple groups. The existence of most of
tion, HNN extensions, and group actions on trees, all of the sporadic simple groups can be established as a corol-
which is subsumed in the Bass–Serre theory of graphs of lary of the existence of the Monster, which Griess con-
groups. The objective is understand the universal comple- structed “by hand.” Several of the sporadic groups were
tion of a graph of groups, i.e., the largest possible group first constructed on a machine, however, and for a few this
containing certain specified subgroups (attached to the remains the only proof of their existence.
vertices and edges of the graph) subject to certain relating
homomorphisms mapping edge groups to vertex groups.
The amalgamated product of two groups over a specified SEE ALSO THE FOLLOWING ARTICLES
common subgroup is a “free” construction, which in all
nontrivial cases leads to an infinite completion. GROUP THEORY, APPLIED • SET THEORY
P1: LLL/GKP P2: FYK Final Pages
Encyclopedia of Physical Science and Technology EN007C-304 June 30, 2001 17:0

154 Group Theory

BIBLIOGRAPHY Gorenstein, D. (1982). “Finite Simple Groups: An Introduction to Their

Classification,” Plenum, New York.
Gorenstein, D., Lyons, R., and Solomon, R. (1994). “The classification
Borel, A. (1991). “Linear Algebraic Groups,” 2nd ed., Springer-Verlag, of the Finite simple groups, numbers 1–3,” AMS Math. Surv. Monogr.
Berlin. 40 (1–3).
Bridson, M., and Haefliger, A. (1999). “Metric Spaces of Non-positive Huppert, B., and Blackburn, N. (1982). “Finite Groups II, III,” Springer-
Curvature,” Grundlehren der mathematischen Wissenschaften 319, Verlag, Berlin.
Springer-Verlag, Berlin. Knapp, A. (1996). “Lie Groups: Beyond an Introduction,” Birkhauser,
Curtis, C., and Reiner, I. (1987). “Methods of Representation Theory I, Boston.
II,” Wiley-Interscience, New York. Suzuki, M. (1977), “Group Theory I, II,” Springer-Verlag, Berlin.
P1: GLQ/GLT P2: GPJ Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied

Nyayapathi V. V. J. Swamy Satyanarayan Nandi
Oklahoma State University Oklahoma State University

I. Brief History
II. Group Representation Theory
III. Continuous Lie Groups
IV. Applications in Atomic Physics
V. Applications in Nuclear Physics
VI. Applications in Molecular and
Solid-State Physics
VII. Application of Lie Groups in
Electrical Engineering
VIII. Applications in Particle Physics
IX. Applications in Geometrical Optics
X. The Renormalization Group

GLOSSARY obey if the electromagnetic interaction is neglected;

this accounts for the charge multiplets which are ob-
Brillouin zone Volume in k space (reciprocal lattice served in particle physics.
space) bounded by planes of energy discontinuities. Lie algebra Certain commutation relations obeyed by the
Character Trace of a matrix in a representation. generators of a continuous Lie group.
Color Physical property of the strong interactions Normal modes Fundamental vibrations into which any
[SU(3)] analogous to electric charge in electrody- vibrational mode of molecule can be decomposed.
namics. Particle physics Study of the fundamental constituents
Critical point The highest pressure and the highest tem- of matter and their interactions.
perature at which a liquid and gas can coexist in equilib- Quark model Model in which all hadrons (strongly inter-
rium. This is the point at which the difference between acting particles) are described in terms of fundamental
the density of the liquid phase and that of the gaseous particles called quarks.
phase becomes zero. Representation Set of matrices which represents the
Group Finite or infinite set of elements which have elements of a group.
certain properties defining a group. SU(2), SU(3) Lie groups whose elements are unitary ma-
Isospin Symmetry [SU(2)] which elementary particles trices in two and three dimensions with determinant +1.

155
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

156 Group Theory, Applied

APPLIED GROUP THEORY is the study of the theory TABLE I Cayley Multiplication Table of C3v or S3
of group representations and its applications in various ar- E A B C D F
eas of physics such as atomic physics, molecular physics,
solid-state physics, and particle physics. Group theory is E E A B C D F
essentially a mathematical description of symmetry in na- A A E D F B C
ture, and it is very interesting that this symmetry plays an B B F E D C A
important role in understanding many experimental phe- C C D F E A B
nomena in physics. D D C A B F E
F F B C A E D

I. BRIEF HISTORY
made after the operation B, akin to the left multiplication
Ancient as well as modern works of art and architecture of a matrix B by a matrix A). A function of an independent
tell us that humans have long familiarity with symme- variable x is written F(x). These notations are consistent
try. An Egyptian pyramid, a Greek temple, an Indian Taj with the customary usage in group theory books by physi-
Mahal, and a Chinese pagoda strike us as beautiful exam- cists (e.g., Wigner, 1959). The set of six elements form a
ples of symmetry. There is symmetry in works of nature finite group or order g = 6 (G or S3 or C3v ), and from the
as well, in living organisms, and in people. It is interest- table one can see that it is closed with respect to multi-
ing to note, however, that while knowledge of symmetry plication. There exists an identity element (E) and every
existed for centuries, it was only toward the end of the element has an inverse element such that the product of
eighteenth century that it was realized that symmetry has the element and its inverse (or vice versa) is the identity.
a scientific basis in the mathematical theory of groups. If the multiplication is commutative (i.e., AB = B A) the
Since the mathematical structure and analysis of groups is group is known as an abelian group.
the subject matter of another article in this Encyclopedia, This abstract group multiplication table can be “real-
we will discuss here only the applications of group theory ized” in more than one way by different types of elements
in physics and engineering. with a multiplication rule appropriately defined. For in-
In a series of papers published toward the end of the stance, the following numbers satisfy all the group prop-
nineteenth century and at the beginning of the twen- erties under ordinary multiplications:
tieth century Frobenius and Schur laid the foundation
of the theory of group representations for finite groups. E = 1 = D = F, A = B = C = −1. (1)
The structure and representation of continuous groups
is the work of Lie, Cartan, and Weyl. Wigner was probably
the first to recognize the importance of Frobenius’s work This, for instance, is an abelian group. Another realization
to quantum mechanics and modern physics. He applied of the group table is obtained by elements which are op-
representation theory to physical problems in many areas erations of permutations of three objects and this group is
of physics. Subsequently several others played important known as the symmetric group S3 . Explicitly, the elements
roles in developing and applying group theory as a tool— are
Bargmann, Bethe, Casimir, Gell-Mann, Pauli, to name a E A B
few.
1 2 3 1 2 3 1 2 3
, , ,
1 2 3 1 3 2 3 2 1
II. GROUP REPRESENTATION THEORY (2)
C D F

It is necessary to discuss briefly some important theo- 1 2 3 1 2 3 1 2 3
rems in representation theory, since all the applications in , , .
2 1 3 2 3 1 3 1 2
the physical sciences are based on these theorems. First,
a Cayley multiplication table (Table I) defines a group The permutation D is defined to read “1 is replaced by 2, 2
abstractly. In the table the operation of “multiplication” by 3, and 3 by 1.” AB is defined as the result of the permu-
is to be understood as, for instance, the correspondences tation corresponding to A performed after the operation
AB = D and B A = F. A word of caution is in order here. B. Yet another way of realizing this group, which is of
There is usually a notational difference between mathe- importance in molecular physics, is by choosing for the
maticians and physicists. Here the operation AB is to be elements certain geometric symmetry operations. This
performed from right to left (i.e., the operation A to be group is known as C3v . These symmetry operations are
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 157

FIGURE 1 Equilateral triangle (C3v ) with coordinates of vertices. [Reprinted with permission from Swamy, N. V. V. J.,
and Samuel, M. A. (1979). “Group Theory Made Easy for Scientists and Engineers,” Wiley-Interscience, New York.
Copyright 1979 John Wiley and Sons.]

easily understood by applying them to the equilateral tri- elements E, A, B, . . . of S3 are in one-to-one correspon-
angle in Fig. 1. In this A, B, C are the vertices, A , B , dence with elements E, σv , σv , . . . of C3v uniquely such
C are the midpoints of the opposite sides such that A A , that group multiplication is preserved. AB = D in the case
B B , CC are the medians of the triangle. Let us assume of the group S3 means that the product of the elements
the triangle to be in the X Y plane of a rectangular coordi- corresponding to A and B in C3v , in that order, results
nate system whose origin is at the centroid of the triangle in the element corresponding to D. Here multiplication
O. Here the identity element means “leave it alone.” Ele- means the group operation. Isomorphic groups have the
ments D,F correspond to rotations through 120◦ and 240◦ , same Cayley multiplication table. A many-to-one corre-
respectively, about the Z axis through O perpendicular to spondence is known as homomorphism. In this case more
the plane of the triangle. These are called C3 , C32 opera- than one element of one group corresponds to the same
tions, the subscript 3 referring to the angle of rotation 2π/3 element in the other group, but products do correspond to
and the letter C stands for the cyclic axis. A, B, C are products in either group. Thus the group of numbers in (1)
operations of reflections in three vertical mirror planes is homomorphic to S3 or C3v .
passing through A A , B B , CC and containing the cyclic We notice from the Cayley table for S3 that subsets of
axis. Although we labeled the vertices to be able to under- elements (E), (E, A), and (E, D, F) themselves satisfy
stand the geometric symmetry operations of rotations and all the group properties. These are known as subgroups
reflections, it is important to note that the operations bring of the main group, of orders h = 1, 2, and 3, respectively.
the triangle into coincidence with itself, which implies that There is a theorem of Lagrange which says that the or-
all the vertices are identical and hence indistinguishable. der h of a subgroup is a divisor of the order g of the
The ammonia molecule NH3 has the symmetry of the C3v group. Thus a group or order 11 can have only one sub-
group. The three hydrogen atoms are at the vertices of group, of order 1, the trivial group with only the identity
the triangle and the nitrogen atom sits somewhere on the element. Given an element A, one defines its conjugate
cyclic axis and thus all the six geometric operations bring of B AB −1 , where B is any element of the group. For
the molecule into coincidence with itself. The groups S3 instance, in Table I, D AD −1 = D AF = DC = B. Thus
and C3v are said to be isomorphic, which means that the B is a conjugate of A, the operation of conjugation
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

158 Group Theory, Applied

 √ √ √ 
resembling a similarity transformation in matrix theory. 1/ 3 1/ 3 1/ 3
A set of self-conjugate elements is called a class, and a  √ √ 
S= 0 −1/ 2 1/ 2  ,
group is divided into classes. Thus, A, B, C form a class √ √ √
because the conjugate of A, X AX −1 , is again one of the 2/ 3 −1/ 6 −1/ 6
elements of this set no matter which element of the group  
X is. For any finite group the identity element forms a class 1 0 0
 
all by itself. A subgroup which has whole classes of the S(A)S −1 ≡ A = 0 −1 0
original group as its elements is known as an invariant sub- 0 0 1
group. Thus for the above group the subgroup E, D, F is
an invariant subgroup consisting of the classes K 1 (E) and B
K 2 (D, F). If a group does not have an invariant subgroup  
1 0 0
it is called a simple group. An isomorphism of a group −1 0  √ 
with itself (i.e., a one-to-one correspondence between el- = (1) ⊕ 0 1/2 3/2
0 1 √
ements of the group preserving multiplication) is called 0 3/2 −1/2
an automorphism. If this is brought about by conjugation
it is called an inner automorphism. For instance, the set E, C D
  
A, B, C, D, F is an inner automorphism induced by the 1 0 0 1 0 0
 √  √ 
identity as the conjugating element. This is mapped into × 0 1/2 − 3/20 −1/2 − 3/2
E, B, A, C, F, D if the conjugation is done by C. This √ √
0 − 3/2 −1/2 0 3/2 −1/2
then is an inner automorphism induced by C.
F
 
A. Representations and Characters 1 0 0
 √ 
× 0 −1/2 3/2 .
If a Cayley multiplication table is realized by choosing √
the elements as (unitary) matrices this set is known as 0 − 3/2 −1/2 (4)
a representation of the group. Frobenius and Schur did (Boldface letters represent operators; letters with arrows
pioneering work in demonstrating quite a few important represent vectors.) This latter set is seen to be a direct sum
theorems in representation theory. For instance, the fol- of two matrices of dimensions 1 and 2. It can be shown
lowing set of 3 × 3 matrices is a representation of S3 or that the two-dimensional representation cannot further be
C3v , of dimension 3: reduced and is therefore called an irreducible representa-
tion. It is the irreducible representation of a group that is of
E A B paramount importance in physics. The sums of the diago-
     
1 0 0 1 0 0 0 0 1 nal elements of the matrices in an irreducible representa-
      tion, or the traces of the matrices, are called characters χ .
0 1 0 , 0 0 1 , 0 1 0 ,
0 0 1 0 1 0 1 0 0 The relevant irreducible representation and characters of
S3 are rewritten as
(3)
C D F E A B
      √
0 1 0 0 0 1 0 1 0 1 0 −1 0 1/2 3/2
      , , √ ,
1 0 0 , 1 0 0 , 0 0 1 . 0 1 0 1 3/2 −1/2
0 0 1 0 1 0 1 0 0 χ =2 χ =0 χ =0

Here obviously “multiplication” means matrix multipli- C D

√ √
cation. It is easy to see, however, that another set of ma- 1/2 − 3/2 −1/2 − 3/2
trices, obtained from the one given through a similarity √ , √ , (5)
− 3/2 −1/2 3/2 −1/2
transformation, is also a representation of the group. Two
χ =0 χ = −1
sets of matrices which differ only through a similarity
transformation are said to be equivalent representations F
of the group. Furthermore, this is known as a reducible √
−1/2 3/2
representation because the following one matrix S gen- √
erates an equivalent representation through a similarity − 3/2 −1/2
transformation: χ = −1
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 159

It is interesting to note that the representation can  

0 0 0 0 0 1
be a homomorphism and need not necessarily be an 0 0 1 0 0 0
 
isomorphism. An important lemma, called Schur’s lemma, 0 0
 0 0 1 0 
says that no matrix other than a multiple of the identity (D) =  . (6)
0 1 0 0 0 0
matrix commutes with all the matrices in an irreducible  
representation. This therefore gives a red-litmus-blue 1 0 0 0 0 0
type of test for checking whether a given representation is 0 0 0 0 1 0
reducible or irreducible. One of the interesting represen- The matrices representing other elements of S3 are
tations is known as the regular representation. From the obtained similarly. This is, however, a reducible repre-
multiplication table for the S3 group of order 6 (Table I) sentation. One matrix, by a similarity transformation, can
we know, for instance, that D A = C. If we consider reduce all the six matrices into block-diagonal form as a
this result as a linear combination of all the elements direct sum of lower-dimensional matrices. It can be shown
we get that in the reduction every irreducible representation oc-
curs as many times as its dimensionality. This regular rep-
D A = O E + O A + O B + 1C + O D + O F. resentation decomposes into two one-dimensional and two
two-dimensional representation matrices, because the S3
group happens to have two one-dimensional and one two-
If we express the other operations in the Cayley table dimensional representation.
(i.e., D B, DC, D D, D F, and D E) we can write the re- We take the symmetric group S4 (permutations on four
sult in matrix form. The transpose of this matrix is the objects) as an example to illustrate important theorems on
representation matrix of the element D in the regular rep- representations and characters. The multiplication table
resentation for this group of order 24 is given in Table II. An interesting

TABLE II Cayley Multiplication Table for S4

E A B C D F G H I J K L M N O P Q R S T U V W X

E E A B C D F G H I J K L M N O P Q R S T U V W X E
A A B E D F C R U O K N I V J L H T X Q S P W M G A
B B E A F C D X P L N J O W K I U S G T Q H M V R B
C C F D E B A K O U R G P T X H L V J W M I Q S N C
D D C F A E B N L P X R H S G U I W K M V O T Q J D
F F D C B A E J I H G X U Q R P O M N V W L S T K F
G G U S K W I E J F H C T P Q R M N O B L A X D V G
H H T X V L J I E G F S D R P Q N O M K A W C U B H
I I W K S U G H F J E V A N O M R P Q X D T B L C I
J J L V X T H F G E I B W O M N Q R P C U D K A S J
K K I W G S U C R A O E M L V J T X H D P F N B Q K
L L V J T H X P D N B M E K I W G U S R C Q A O F L
M M N O P R Q T S V W L K E A B C F D H G X I J U M
N N O M R Q P D X B L A V I W K S G U F H C J E T N
O O M N Q P R U C K A W B J L V X H T G F S E I D O
P P Q R M O N L B X D T C G U S K I W J E V F H A P
Q Q R P O N M W V S T U X F D C B E A I J K H G L Q
R R P Q N M O A K C U D S H T X V J L E I B G F W R
S S G U I K W V M T Q H R D C F A B E L N J P X O S
T T X H L J V M W Q S P G C F D E A B O K N U R I T
U U S G W I K O A R C Q F X H T J L V N B M D P E U
V V J L H X T S Q W M I N A B E D C F U R G O K P V
W W K I U G S Q T M V O J B E A F D O P X R L N H W
X X H T J V L B N D P F Q U S G W K I A O E R C M X

E A B C D F G H I J K L M N O P Q R S T U V W X
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

160 Group Theory, Applied

theorem in permutation groups is that the “number of q” Here we have denoted the classes by K i , the number of
classes of the symmetric group Sn (of order n!) is equal to elements in each class gi . The partitions of the number
the number of distinct partitions in n. We thus have 4 associated with the different classes are shown as (14 ),
K 1 = (E) 1 + 1 + 1 + 1 (14 ), g1 = 1, etc., which can be taken as an alternative description of
classes. According to representation theory:
K 2 = (C, D, F, G, H, Q) 1 + 1 + 2 + 0 (12 2)
g2 = 6,
1. The number of inequivalent irreducible
K 3 = (K , L , M) 2 + 2 + 0 + 0 (22 ) representations is equal to the number of classes. S4
g3 = 3, (7) thus has five representations.
K 4 = (A, B, I, J, N , O, V, W ) 1+3+0+0 (31) 2. The sum of squares of the dimensions of the
irreducible representations equals the order of the
g4 = 8,
group:
K 5 = (P, R, S, T, U, X ) 4 + 0 + 0 + 0 (41 )
g5 = 6. n 21 + n 22 + n 23 + n 24 + n 25 = g = 24 for the S4 group.

D (2) : →
 √   √   √ 
−i/2 i/ 2 −i/2 −1/2 −1/ 2 −1/2 −i/2 i/2
1/ 2
 √ √   √ √  √ √
1/ 2 0 −1/ 2 , −1/ 2 0 1/ 2 , −i/ 2 0 i/ 2 ,
D(A) = √ D(F) = √ D(J ) = √
i/2 i/ 2 i/2 −1/2 1/ 2 −1/2 −i/2 −1/ 2 i/2
χ =0 χ = −1 χ =0
 √     
i/2 1/ 2 −i/2 0 0 −i −1 0 0
 √ √     
−i/ 2 0 −i/ 2 , 0 −1 0  ,  0 1 0 ,
D(B) = √ D(G) = D(K ) =
i/2 −1/ 2 −i/2 i 0 0 0 0 −1
χ =0 χ = −1 χ = −1
   √   
0 0 i −1/2 i/ 2 1/2 0 0 −1
   √ √   
 0 −1 0 , −i/ 2 0 −i/ 2 ,  0 −1 0  ,
D(C) = D(H ) = √ D(L) =
−i 0 0 1/2 i/ 2 −1/2 −1 0 0
χ = −1 χ = −1 χ = −1
 √   √   
−1/2 −i/ 2 1/2 i/2 −i/ 2 i/2 0 0 1
 √ √   √ √   
i/ 2 0 i/ 2  , 1/ 2 0 −1/ 2 , 0 −1 0 ,
D(D) = √ D(I ) = √ D(M) =
1/2 −i/ 2 −1/2 −i/2 −i/ 2 −i/2 1 0 0
χ = −1 χ =0 χ = −1
 √   √   √ 
i/2 i/ 2 i/2 1/2 −i/ 2 −1/2 −i/2 −i/ 2 −i/2
 √ √   √ √   √ √ 
−1/ 2 0 1/ 2 , −i/ 2 0 −i/ 2 , −1/ 2 0 1/ 2 ,
D(N ) = √ D(R) = √ D(V ) = √
−i/2 i/ 2 −i/2 −1/2 −i/ 2 1/2 i/2 −i/ 2 i/2
χ =0 χ =1 χ =0
 √   √   √ 
i/2 −1/ 2 −i/2 1/2 i/ 2 −1/2 −i/2 −1/ 2 i/2
 √ √   √ √   √ √ 
i/ 2 0 i/ 2 , 1/ 2 0 i/ 2  , −i/ 2 0 −i/ 2 ,
D(O) = √ D(S) = √ D(W ) = √
i/2 1/ 2 −i/2 −1/2 i/ 2 1/2 −i/2 1/ 2 i/2
χ =0 χ =1 χ =0
     √ 
−i 0 0 i 0 0 1/2 −1/ 2 1/2
     √ √ 
 0 1 0 , 0 1 0  , 1/ 2 0 −1/ 2 ,
D(P) = D(T ) = D(X ) = √
0 0 i 0 0 −i 1/2 1/ 2 1/2
χ =1 χ =1 χ =1
 √   √   
−1/2 +1/ 2 −1/2 1/2 1/ 2 1/2 1 0 0
 √ √   √ √   
1/ 2 −1/ 2 , −1/ 2 1/ 2 , 0 1 0 .
(E) = 
0 0
D(Q) = √ D(U ) = √
−1/2 −1/ 2 −1/2 1/2 −1/ 2 1/2 0 0 1
χ = −1 χ =1 χ =3
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 161

This equation has the solution 12 + 12 + 22 + 32 + 32 . to verify the above numerically with the help of the repre-
sentation matrices for S4 and its character table (Table III).
Thus S4 has two one-dimensional, two three- The √ third relation above shows that normalized charac-
( j)
dimensional, and one two-dimensional representations. ters gi /g χi form an orthonormal vector system in the
All finite groups have a trivial one-dimensional represen- k-dimensional space of classes. Various prescriptions ex-
tation where each element is represented by 1. One set ist for calculating the characters of a finite group, and it
of the three-dimensional representation matrices of S4 is is much easier to determine characters in many instances
given on the previous page. Swamy and Samuel give all than the representation matrices themselves. Frobenius de-
the representation matrices in their work. It is important veloped a systematic algebraic procedure for determining
to note that, while the inequivalent representations will be the characters of any permutation group. If we know the
finite in number, there will be an infinite number of equiv- characters of a permutation group we can infer therefrom
alent representations, each differing from the other only the characters of all finite groups because of a theorem in
by a similarity transformation. We use the notation D ( j) group theory which says that any finite group is isomorphic
for the jth irreducible representation, D ( j) (R) for the ma- to a subgroup of some permutation or symmetric group.
trix representing the general group element R in the J th Isomorphic groups have the same character table. One of
representation, and Dmm (R) for the mm th matrix ele-
( j) the interesting prescriptions for obtaining characters is the
ment of that particular matrix, χ ( j) (R) is the character of class product relation
the element R in the J th irreducible representation.
ki k j = Ci jl kl , (9)
All the elements of one class have the same character t
in a given irreducible representation. Table III gives the
classwise character table for S4 . Two important theorems where the ki are classes and the summation is done over
in representation theory are summarized in the formulas all the classes. From this can be derived
∗ (µ) (µ) (µ)
(µ)
Dil (R)D (ν)
jm (R) = (g/n µ ) δµν δi j δlm ,
gi g j χi χ j = nµ Ci jl gl χl , (10)
R l
(8)
µ (ν)∗ where the C coefficients are identical to the ones in Eq. (9).
χ (R)χ (R) = g δµν ,
R
To illustrate the first identity let us take classes k2 and k3
of S4 :
or, equivalently,
k3 k2 = (K , L , M)(C, D, F, G, H, Q)
k
∗
(µ)
χi χi(ν) gi = g δµν . = (G, S, U, C, R, X ), (T, H, X, P, D, U ),

Here µ and ν label the different irreducible representa- (P, R, Q, T, S, F)

tions, R as a general group element, n µ the dimensional- = k2 + 2k5 .
ity of the µth representation, gi the number of elements
in the ith class, and g the total number of elements or the This gives for the values of the C coefficients
order of the group. In other words, the above theorems go
to show that, treated as vectors in the space of group ele- C321 = 0, C322 = 1, C323 = 0,
ments, these are orthogonal. Similarly, characters χ ( j) (R) (11)
of the irreducible representations form an orthogonal vec- C324 = 0, C325 = 2.
tor system in the same space of group elements. It is easy
With these numbers determined it is a trivial exercise to
check (10) from the character table for S4 . Frobenius also
gave a prescription for determining the characters of a
TABLE III Character Table of S4 group from the known characters of its subgroups. This is
K1 K2 K3 K4 K5 discussed lucidly in the work of Littlewood.
g1 = 1 g2 = 6 g3 = 3 g4 = 8 g5 = 6

D (1) 1 −1 1 1 −1 B. Construction of Representations

D (2) 3 −1 −1 0 1
D (3) 2 0 2 −1 0
The construction of the representation of a finite group
D (4) 3 1 −1 0 −1
depends on the choice of a suitable set of basis functions,
D (5) 1 1 1 1 1
and the dimensionality of the representation is the number
of basis functions chosen. If ψ1 , ψ2 , . . . , ψn are a set of n
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

162 Group Theory, Applied

TABLE IV OR x, OR y, OR z
F(132) D(123) A(23) C(12) B(13)
E Xi C 3 Xi C 3 Xi σv Xi σ v X i σ v X i
√ √ √ √
X1 = X X −(1/2)x − ( 3/2)y −(1/2)x + ( 3/2)y −X (1/2)x − ( 3/2)y (1/2)x + ( 3/2)y
√ √ √ √
X2 = Y Y ( 3/2)x − (1/2)y −( 3/2)x − (1/2)y Y −( 3/2)x − (1/2)y ( 3/2)x − (1/2)y
X3 = Z Z Z Z Z Z Z

basis functions, the fundamental formula of representation The characters are shown underneath each matrix, and
theory then is expressed as Table V gives all the characters classwise. x, y are
written against the E representation since these transform
n
O R ψν = Dµν (R)ψµ . (12) accordingly. x is said to belong to the first row of the
µ=1
representation D (E) and y to its second row. Since the
A2 representation is also one-dimensional, the matrix
This means that the result of a group operation O R elements (numbers which are also the characters) are
(element of the group) on one of the basis functions is easily obtained by applying the orthogonality relations
a linear combination of the basis functions and the matrix either to the matrix elements of different irreducible
of the coefficients in the linear combination essentially de- representations or to the characters. It is to be noted
termines the representation. We will illustrate this for the that the choice of x and y for the two-dimensional
C3v group. In Fig. 1 let us choose a rectangular coordinate representation is fortuitous inasmuch as it straightaway
system with origin at the centroid O and the Y axis through generated the representation. However, there is a general
the median A A with the X axis perpendicular to it in its prescription for obtaining the basis functions starting with
plane. The Z axis will be pointing up from the origin per- any arbitrary function by means of projection operators.
pendicular to the plane of the figure. The√ vertices will then The formula for this projection operator is
have the√ coordinates A = (0, 1, 0), B( 3/2, −1/2, 0), (µ) nµ (µ)∗
and C( 3/2, −1/2, 0). In Table IV we gather the results pi = Dii (R)O R , (14)
g R
of making all the six operations of the group C3v , on x, y,
and z. From formula (12) and the table we see that if we pi2 = pi , pi p j = 0, (14a)
choose z as one basis function, we have a one-dimensional and for the two-dimensional representation these are
representation with the number 1 representing all the ele- explicitly
ments. In the character table (Table V) this is shown as the
A1 representation, which means that the basis function is p1 = 13 E − 12 D − 12 F − A + 12 B + 12 C ,
symmetric with respect to rotation about the cyclic axis. (15)
If we choose a pair of functions x = ψ1 , y = ψ2 in that p2 = 13 E − 12 D − 12 F + A − 12 B − 12 C .
order we obtain the two-dimensional (E representation) If we choose the arbitrary function φ(x, y, z) = x + y + z,
irreducible representation then it is easily seen that
(R) (A) p1 = x
or ψi(ν) = pi(ν) . (16)
p2 = y
1 0 −1 0
, , We will conclude this discussion of representation the-
0 1 0 1
ory by introducing the concept of direct product of two
χ =2 χ =0 groups. Let a group G 1 be of order g1 with elements A1 ,
(B) (C) A2 , . . . , A gi and another group of G 2 of order g2 with el-
√ √ ements B1 , B2 , . . . , Bg2 . A direct product group G 1 × G 2
1/2 3/2 1/2 − 3/2
√ , √ , (13)
3/2 −1/2 − 3/2 −1/2 TABLE V Character Table of S3 or C3v
χ =0 K 1 (13 )E K 2 21(A, B, C) K 3 (3)(D, F)
g1 = 1 (g 2 = 3) g3 = 2
(D) (F)
√ √
−1/2 − 3/2 −1/2 3/2 D (A1 ) 1 1 1 z
√ , √ . D (A2 ) 1 −1 1
3/2 −1/2 − 3/2 −1/2
D (E) 2 0 −1 (x, y)
χ = −1 χ = −1
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 163

is defined if the elements of G 1 commute with the el- φ(xi ) = φ( f i ). (19)

ements of G 2 . This group consists of the ordered pair
of elements A1 B1 , A1 B2 , . . . , and will be of the order While the precise nature of the finite transformations de-
g1 g2 . The multiplication rule for the elements of the di- fines the Lie group, Lie proved that these transformations
rect product follows simply from the multiplication rules can be built up from the identity (since they are continu-
for each group. Thus if in group G 1 , A1 A2 = A5 and in ous) by means of infinitesimal generators X k , there being
group G 2 , B1 B2 = B7 , then in the direct product group in general as many generators as essential parameters. Ac-
(A1 B1 ) × (A2 B2 ) will be equal to (A1 A2 ) × (B1 B2 ) the cording to Lie,

element corresponding to A5 B7 . For example, C3h , is a n
∂ xι ∂
direct product group of the cyclic group C3 (with ele- Xα = , (20)
∂a α ∂ xi
ments E, C3 , and C32 ) and the group C h consisting of the i =1 aα 0
identity element and reflection in a horizontal plane. The
X i = {eε1 X1 + ε2 X2 + ··· + εn Xn }X i , (21)
rotation–reflection group in Lie groups is a direct product
of the rotation group and the inversion group (parity oper- and
ation). The direct products of some Lie groups decompose φ(xi ) = (ei εi Xi )φ(xi ). (22)
into a direct sum of irreducible components. A well-known
theorem, called the Clebsch–Gordan theorem, prescribes The interesting thing about the Lie group is that it is
how to construct basis functions for their irreducible rep- uniquely defined in terms of certain structure relations
resentations (IRs) from the basis functions of the IRs of that exist among these derived infinitesimal generators.
the factors making up the direct product. The characters of These are
the direct product groups can be calculated easily from the
[Xχ , Xµ ] = Cχλ µ Xλ , [A, B] ≡ AB − BA, (23)
characters of the two “factors” according to the formula
λ
χ (ν) (A1 )χ ν (B2 ) = χ (µ × ν) (A1 B2 ). (17)
where Cχλ µ are called structure constants. Knowing the
structure constants is synonymous with knowing the
III. CONTINUOUS LIE GROUPS
group of finite transformations f i . The above relations are
usually referred to as the Lie algebra of the generators, and
The theory of continuous or topological groups is
often a Lie group is defined in terms of its Lie algebra.
due to the Norwegian mathematician Sophus Lie and
Sometimes a subset of the operators X α itself satisfies a
hence these are familiarly known as Lie groups. In these
relation of the above type, and in this case we say these
groups the group element is a continuous invertible trans-
smaller number of generators constitutes a subgroup of
formation of a set of variables and the transformations
the Lie group. In quantum mechanics and in particle
satisfy the group postulates. For instance, the fundamen-
physics there is a lot of interest in a certain category of
tal property of a group being closed with respect to multi-
Lie groups known as semisimple groups. The French
plication here means that the transformation of a transfor-
mathematician E. Cartan, to whom we owe several
mation is another transformation, and the identity element
developments of the original Lie theory, established the
means a mapping of the variables into themselves. Trans-
criterion for a Lie group to be semisimple. If we define
lations and rotations of a coordinate system in three- or
β α
four-dimensional space are examples of these transforma- gµρ = Cµα Cνβ , (24)
tions which form a group. αβ
A continuous transformation of a set of variables is best
then gµρ is the determinant of the matrix whose ordered
described in terms of parameters which can take on contin-
coefficients are gµν , and nonvanishing of this determinant
uous values in a finite (compact) or infinite (noncompact)
is the condition for the group to be semisimple. Casimir
range, a certain equilibrium value of the parameters corre-
has shown that in the case of such groups there exists a
sponding to the identity transformation. If x1 , x2 , x3 , . . . ,
certain function of the infinitesimal generators, called the
xn are the variables and a1 , a2 , . . . , ak are the parameters,
Casimir operator, that commutes with every generator
we can write the transformation as
xi = f i (xi , ar ), i = 1, . . . , n, [C, Xα ] = 0 for all α,
r = 1, . . . , k (25)
(18) C= g µν Xµ Xµ ,
xi = xi , (identity) µν
for a1 = a10 , etc., µν
where g is the matrix inverses to gµν . Such an operator
and further, for any arbitrary functions of the variables we which commutes with all the generators is called an
have invariant of the Lie group, and this plays an important
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

164 Group Theory, Applied

role in understanding the structure of the group and also as

in constructing its irreducible representations.
xi = ei εi Xi X i = et3 X 3 et2 X 2 et1 X 1 X i . (28)
This is a semisimple group with gµν being a diagonal unit
A. Three-Dimensional Rotation Group
matrix and the Casimir operator, the invariant of the group,
As an application of the preceding ideas we will discuss the is simply the sum of the squares of the generators,
three-dimensional rotation group SO(3) which describes C = X21 + X22 + X33 . (29)
one of the fascinating symmetries in physics. If a sphere
is rotated in space about its radius it is brought into coin- In quantum mechanics this operator is related to the square
cidence with itself and this is known as spherical symme- of the angular momentum, which is one of the experimen-
try. Spherical symmetry (rotational invariance) has exper- tally observable quantities.
imentally interesting consequences in classical as well as The spherical harmonics Ylm defined on the surface of
quantum physics. Rotational invariance implies in general the unit sphere and originally discovered in the context
conservation of angular momentum. In classical physics, of solving the Laplace equation are eigenfunctions of this
for instance, the latter conservation principle explains why Casimir operator, that is, they satisfy the equation
a figure skater draws her outstretched arms closer to her CYlm (θ, φ) = −l(l + 1)Ylm (θ, φ). (30)
body to increase her speed of gyration, or why a cat al-
ways lands on its feet when it falls! Three-dimensional Here l is any positive integer and for a given l m can assume
rotations form a Lie group called the SO(3) group. any integral value between −l and +l, a total of 2l + 1

    
x cos t2 cos t3 −cos t2 sin t3 sin t2 x
    
 y  = cos t1 sin t3 + sin t1 sin t2 cos t3 cos t1 cos t3 − sin t1 sin t2 sin t3 sin t1 cos t2   y  (26)
z sin t1 sin t3 − cos t1 sin t2 cos t3 sin t1 cos t3 + cos t1 sin t2 sin t3 cos t1 cos t2 z
x + y + z 2 = x 2 + y 2 + z 2 .
2 2

The way to describe rotations of a coordinate system values. Thus for a given l there are 2l + 1 spherical har-
mathematically in three-dimensional space was shown by monics and these are the basic functions which generate
Euler, centuries ago, who introduced three real parameters the odd (2l + 1)-dimensional irreducible representations
called the Eulerian angles. An alternative and equivalent of the three-dimensional rotation group. If a dynamical
description is given in Eq. (26). Here ti are the parame- system in quantum mechanics has a certain symmetry, or
ters of the Lie group and these are naturally functions of more precisely if the Hamiltonian of the dynamical sys-
the Eulerian angles. Following the Lie prescription, the tem is invariant to a certain set of group operations, then
infinitesimal generators are seen to be its solutions generate the representations of that group.
∂ ∂ In particular, if the dynamical system has spherical sym-
Xt1 ≡ X1 = z −y , metry, which means that the Hamiltonian has rotational
∂y ∂z
invariance, the spherical harmonics must be a factor in
∂ ∂ its solution. The Laplace operator is indeed rotationally
X2 = x −z , (27)
∂z ∂x invariant, and its solutions are either
∂ ∂
X3 = y −x , r l Ylm or r −l−l Ylm . (31)
∂x ∂y
and the associated Lie algebra is
[X1 , X2 ] = X3 and cyclically. (27a) B. SU(2) Covering Group of the Rotation Group
This relation is written somewhat cryptically as The even-dimensional representations of SO(3) were first
discovered by Weyl. He showed that the covering group of
[Xi , X j ] = εi jk Xk , (27b) this rotation group is SU(2), which is the unitary unimod-
k
ular group in two dimensions. This group played a crucial
where the Levi–Civita symbol εi jk assumes the value 0 role in characterizing the spin properties of particles such
whenever any two indices are equal, equals +1 when ijk is as electrons and protons (fermions), and recently it has
an even permutation of 123, and equals −1 for an odd per- proved to be one of the cornerstones of a theory that led
mutation. The finite transformation (26) can be obtained to the discovery of the W ± and Z 0 bosons.
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 165

A general element of the SU(2) group can be written as xµ = bµλ xλ . (37)
a 2 × 2 unitary matrix of determinant +1. λ

α β The transformation matrix bµλ , an element of the SO(4)
U= , αα ∗ + ββ ∗ = 1. (32)
−β ∗ α ∗ group, can be expressed in terms of six real parameters
α, β, γ , α , β , γ . Forsyth has shown that the matrix ele-
Here the asterisk stands for the complex conjugate. This
ments bµλ are somewhat complicated trigonometric func-
is called a covering group of SO(3) because it not only
tions of these parameters. This matrix, as well as the ele-
induces three-dimensional rotations, it gives the odd- as
ment of SO(3), is an orthogonal matrix. The Lie algebra
well as even-dimensional representations of the latter. For
of the six infinitesimal generators is best given in terms
instance, a similarity transformation with U of the matrix
of two sets of operators X 1 = L 1 , X 2 = L 2 , X 3 = L 3 , and

z x − iy X 4 = A1 , X 5 = A2 , X 6 = A3 :
σ · r = (33)
x + i y −z [Li , L j ] = εi jk Lk ,
k
gives
[Ai , A j ] = εi jk Lk , (38)
· r = U+ σ
σ · r U. (34) k
r (x , y , z ) is related to r (x, y, z) through α, β and their [Li , A j ] = εi jk Ak .
complex conjugates. From the theory of determinants we k
know that σ · r and σ · r have the same determinant or These relations identify the SO(4) group and any six oper-
2 2 2 ators, which obey the same commutation relations as the
x +y +z =x +y +z .
2 2 2
X i generate a group isomorphic to SO(4). It is interesting
Wigner has given the explicit connection between the to note that the L j themselves form a closed Lie algebra
complex α, β and of SU(2) and the Eulerian angles α, isomorphic to SO(3). This group is then a subgroup of
β, γ which characterize the three-dimensional rotations. SO(4). This group has two invariant operators,
It is straightforward algebra to establish the relation be-
F= Li2 + Ai2 ,
tween the SU(2) elements and the ti of (26). The Lie (39)
algebra satisfied by the generators of SU(2) is isomor- G = L1 A1 + L2 A2 + L3 A3 .
phic to that of the SO(3) group. The Pauli spin operators
This group plays an important role in understanding the
that describe the magnetic electron in atomic physics, the
bound states of the nonrelativistic hydrogen atom as well
Cayley–Klein operators that describe rigid body spin in
as in particle physics.
classical mechanics, and quoternions in vector space the-
If instead of Eq. (36) we have transformations which
ory all have algebras similar to the SU(2) algebra. The
leave
basis functions which generate the irreducible represen-
tations of SU(2) are known as monomials or tensors. A X 12 + X 22 + X 32 − X 42 (40)
typical even-dimensional representation of SU(2) is given invariant, then these form the elements of the homo-
in Eq. (35). Some of the properties of SU(2) groups will geneous Lorentz group, of fundamental importance in
be discussed elsewhere in this article in the context of the special theory of relativity and elementary particle
applications in particle physics. physics. We have explicitly
1/2
Dmm (αβγ ) = X µ = aµλ xλ . (41)
λ
−i(α/2)
e (cos β/2)e−i(γ /2) −i −i(α/2) (sin β/2)ei(γ /2) The aµλ are once again expressed in terms of six real pa-
−i(γ /2) i(γ /2)
. rameters. How algebraically complicated these functions,
e i(α/2)
(sin β/2)e e i(α/2)
(cos β/2)e
elements of aµλ , are can be judged from the typical ele-
(35)
ment
a12 = sin γ cos γ {cosh α cos2 β
C. Four-Dimensional Rotation Group
− cos α cosh2 β + cos α sin2 β
and Homogeneous Lorentz Group
+ cosh α sinh2 β } − sin a sin β cosh β
The extension of the rotation group to four dimensions is
straightforward but not trivial. The SO(4) group elements + sinh α sinh β cos β. (42)
are the linear transformations that ensure For instance, when α = 1, β = 0.7, γ = 0.35, α = 0.5,
β = 1.4, and γ = 0.1, the aµλ becomes a numerical
x12 + x22 + x32 + x42 = x12 + x22 + x32 + x42 , (36)
matrix, the element of the homogenous Lorentz group,
where each element of the group transforms xi into xi : as given in Eq. (43). The
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

166 Group Theory, Applied

aµλ = In (45a),
 
0.5930545 0.3897004 −0.9044001 0.5670270 a·
 1.2037232 −1.2987247 1.5783052 −2.1509726 ≡ ab. (45b)
  ·b
 
 0.5232450 −2.2304747 0.7592345 −2.1966429 The above matrix elements are related to the velocity com-
−1.0365561 2.4111345 −1.6986536 3.2822924 ponents of the moving frame (with the speed of light c = 1)
as
(43)
infinitesimal generators of this group and the associated vx = tanh α cosh β cosh γ ,
Lie algebra are v y = i tanh α cosh β cosh γ ,
∂ ∂ (46)
X1 = y −z ≡ L1 , vz = i tanh α sinh β ,
∂z ∂y
∂ ∂ v = vx2 + v 2y + vz2 = tanh α .
X2 = z −x ≡ L2 ,
∂x ∂z
This group also has two invariant operators, somewhat
∂ ∂
X3 = x −y ≡ L3 , complicated functions of the generators. The representa-
∂y ∂z tions of the homogeneous Lorentz group can be obtained
∂ ∂ from its covering group which was shown by Bargmann
X4 = t +x ≡ A1 ,
∂x ∂t to be SL(2, C), an element of which is a 2 × 2 complex
∂ ∂ matrix with determinant +1. For example, the element of
X5 = t + y ≡ A2 , the covering group corresponding to the element of the
∂y ∂t
homogeneous Lorentz group in Eq. (43) is
∂ ∂
X6 = t + z ≡ A3 ,
∂z ∂t −1.222144 + i (0.2411005) − 0.6624081−i (1.0352785)
.
(t ≡ x4 ); 0.1311536 − i (0.9976158) 1.9325431 − i (0.4832207)

[Li , L j ] = − εi jk Lk , (47)
k
It is important to mention that the full group of rela-
[Ai , A j ] = εi jk Lk , (44) tivistic quantum mechanics is the inhomogeneous Lorentz
k group or the Poincaré group which has, in addition, ele-
ments corresponding to translations in space–time.
[Li , A j ] = − εi jk Ak ,
k

When three of the parameters are set equal to zero we

IV. APPLICATIONS IN ATOMIC PHYSICS
have the “Lorentz rotations” or the Lorentz transforma-
tion, where one frame of reference moves with respect to
A. Symmetry of the Hydrogen Atom
another frame with constant velocity v in an arbitrary
direction. In this case the general element reduces to The nonrelativistic quantum mechanical Hamiltonian de-
Eq. (45a) below. scribing the motion of an electron in the hydrogen atom is

aµλ =
 
cosh α sinh α sinh β sinh α cosh β sinh γ sinh α cosh β cosh γ
 
 cosh α (1 − cosh α ) sinh β
(1 − cosh α
) 
 −sinh α sinh β 
 

+(1 − cosh α ) cosh β2
· cosh β sinh γ
· sinh β cosh β cosh γ  

 
 − α (1 − cosh α
) cosh 2
β (1 − cosh α ) .
 (1 cosh ) 
−sinh α cosh β sinh γ 
 · sinh β cosh β sinh γ
· sinh γ + 1
2
· cosh β sinh γ cosh γ 
2
 
 

(cosh α − 1) (cosh α
− 1) (cosh α
− 1) 
 sinh α cosh β cosh γ 
2 2 2
· sinh β cosh β cosh γ · cosh β sinh γ cosh γ · cosh β cosh γ + 1
(45a)
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 167

−1 ∂ 2 ∂2 ∂2 e2 momentum operator L, which commutes with this Hamil-
H = + + − ,
2m ∂ x 2 ∂y 2 ∂z 2 r tonian. He generalized the classical Runge–Lenz vector
(48) to make it a quantum mechanical invariant. This vector
r = x 2 + y2 + z2, operator, which commutes with the Hamiltonian, is

× p) − m r .
where e and m are the charge and mass of the electron
=√ 1
A −L
(p × L (51)
and r the distance from the nucleus. The electron is in a −8mH −2H r
central field, the potential energy e2 /r being invariant to
rotations in three-dimensional space. This Hamiltonian The three components of A and the three components of L

thus has SO(3) symmetry and its bound-state solutions, make up the six generators of SO(4) and the quantum me-
which generate the irreducible representations of the chanical commutation relations among these operators are
three-dimensional rotation groups, are expressed in terms the same as the Lie algebra of the infinitesimal generators
of three quantum numbers n, l, m: of SO(4). This group has two invariants and one of them,
·A
L =A · L,
vanishes in this case. The other invariant is
Unlm ( r ) = Ylm (θ, φ)Rnl (r ). (49)
simply the sum of the squares of the six components of
As can be expected from our discussion of the rota- and L.
A It is interesting to note that linear combinations
tion group, the spherical harmonics Ylm are a factor in the
of A and L, j1 , and j2 commute with each other and each
solution. Dmm(l) satisfies an SU(2) Lie algebra:
are the matrices of the irreducible repre-

sentation of SO(3) of dimension 2l + 1. Unlm is the general j1 = 1 (L

+ A),
j2 = 1 (L
− A),

form of a central field wave function. L 2 and L z are the 2 2

angular momentum operators which commute with the [ j1i , j1 j ] = εi jk j1k ,

Hamiltonian and Unlm is an eigenfunction of both with k
eigenvalues l(l + 1) and m in units of h = l. The L 2 is (52)
just the Casimir operator of the SO(3) group apart from [ j2i , j2 j ] = εi jk j2k ,
a constant and L z is essentially one of the infinitesimal k

generators of the SO(3) group. [ j1i , j2 j ] = 0 for all i, j.

Schrödinger, who solved the quantum mechanical equa-
tion for the hydrogen atom, pointed out that the energy This means that SO(4) is the direct product SU(2)⊗ SU(2).
(eigenvalue of H ) depends on only one quantum number Bargmann showed that this is related to the solution of the
n, the so-called principal quantum number, and is inde- Hamiltonian in parabolic coordinates (of importance in the
pendent of both l and m. Since there are n 2 number of Stark effect resulting from putting the hydrogen atom in
different states (combinations of l and m for a given n), as an invariant is
an electric field), just as the existence of L
all these have the same energy and this is known as de- related to its solutions in spherical polar coordinates. Fur-
generacy. The independence of the energy on the orbital thermore, j1 + j2 = L,
which means that the two solutions
angular momentum quantum number l is known as ac- are connected through a Clebsch–Gordan-type relation.
cidental degeneracy and was difficult to understand for
quite some time. As a result of the work of Pauli, Fock,
and Bargmann we now know that this degeneracy exists B. Isotropic Harmonic Oscillator
because the hydrogen atom has a higher symmetry than
In atomic physics it is possible to have quantum mechani-
SO(3). The Hamiltonian is invariant to rotations in four-
cal Hamiltonians because the forces between particles are
dimensional space, or the group of this Hamiltonian is
well known. The Coulomb field is an outstanding example.
SO(4), the four-dimensional rotation group. Fock showed
Another such example is the three-dimensional isotropic
that the Schrödinger equation in momentum space can be
harmonic oscillator, and the Schrödinger equation with
written as a four-dimensional Laplacian equation,
this potential has exact solutions which are helpful in un-
2
∂ ∂2 ∂2 ∂2 derstanding its symmetry properties. In sharp contrast, it
+ + + = 0, (50) is difficult to have a Hamiltonian in elementary particle
∂ p12 ∂ p22 ∂ p32 ∂ p42
physics because the forces between strongly interacting
and that its solutions can be written in terms of spher- particles such as the neutron and proton are known only
ical harmonics defined on the three-dimensional surface vaguely. One thus derives maximum benefit from knowl-
of a sphere in four-dimensional space. We recall that there edge of symmetries such as SO(4), SU(3), and SU(2) in
are six generators of SO(4) which satisfy the Lie algebra understanding which atomic physics plays a big role.
appropriate to this group. Pauli demonstrated that there To understand the symmetry of the isotropic harmonic
exists another vector operator, besides the usual angular oscillator it is necessary to express its Hamiltonian in terms
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

168 Group Theory, Applied

√
of coordinates and momenta (with units h = m = c = 1) E −1 = (1/2 6)(A11 − A22 − 2iA12 ),
√
H = 12 px2 + p 2y + pz2 + 12 ω2 (x 2 + y 2 + z 2 ). (53) E 2 = (1/4 3)(Lx + iL y − 2A13 − 2iA23 ),
√
Here ω is the angular frequency which characterizes the E −2 = (1/4 3)(Lx − iL y − 2A13 + 2iA23 ),
oscillator. In the Schrödinger
scheme of quantization the √
operator 12 px2 + p 2y + pz2 is the Laplace operator − 12 ∇ 2 E 3 = (1/4 3)(Lx + iL y + 2A13 + 2iA23 ),
and the Schrödinger equation has exact solutions of the √
form E −3 = (1/4 3)(Lx − iL y + 2A13 − 2iA23 ). (56)

Unlm (r 2 ) = Ylm (θ, φ)Rnl (r 2 ). (54) The structure relations can be written explicitly as
Since the Laplacian
[Hi , H j ] = 0, i, j = 1, 2, . . . , l
∂2 ∂2 ∂2 (l = 2 here),
∇ ≡ 2+ 2+ 2
2
∂x ∂y ∂z
[Hi , E α ] = ri (α)E α , α = ±1, ±2, . . . , (57)
and r 2 = x 2 + y 2 + z 2 both do not change on a rotation of
[E α , E −α ] = ri (α)Hi ,
the coordinate system in space, the Hamiltonian has rota-
γ
tional invariance and naturally the spherical harmonics Ylm [E α , E β ] = Cαβ Er , α = β = ±1, ±2, . . . .
are a factor in the eigenfunctions of H . However, the har- γ
monic oscillator, like the hydrogen atom, has a higher sym-
metry than SO(3). Jauch, Hill, and Baker demonstrated The number of mutually commuting generators of this
that the group of the isotropic harmonic oscillator is SU(3), group, called the rank of the group l, is 2 in this case,
a three-dimensional generalization of SU(2), the unitary H1 and H2 being these generators ri (α) is considered
unimodular group in three dimensions. A typical element as the ith component of an l-dimensional “root vector”
of this group is a 3 × 3 unitary matrix with determinant r(α). The √
different root vectors √
and their components are
+1. To establish the group structure or symmetry of a r(1) = (1/ 3, 0) or r1 (1) = (1/ 3)r2 (1) = 0,
Hamiltonian one should find, on the one hand, operators √
that commute with it and show, on the other, that these op- r(−1) = (−1/ 3, 0),
erators are the infinitesimal generators of that particular √
Lie group. In the 3 × 3 matrix there are nine complex ele- r(2) = 12 3, 12 ,
ments or 18 real parameters. However, the conditions that √
r(−2) = − 12 3, − 12 , (58)
need to be satisfied by the rows and columns of a unitary
matrix, as well as the requirement that the determinant √
r(3) = 12 3, − 12 ,
of the matrix should be +1, reduce the number to eight
√
independent real parameters. These eight parameters lead r(−3) = − 12 3, 12 .
to eight generators of the SU(3) group. Three of these are
the familiar L x , L y , and L z operators, These are represented graphically in Fig. 2 in a root dia-
∂ ∂ ∂ ∂ ∂ ∂ gram with r1 and r2 as rectangular axes. The root vectors
y −z , z −x , and x −y , have the “orthonormal” property
∂z ∂y ∂x ∂z ∂y ∂x
respectively, which are the operators that generate the ri (α)r j (α) = δi j . (59)
SO(3) group. The other generators are the five independent α
components of the symmetric tensor γ
In the relation [E α , E β ] = Cαβ E γ the structure constants
Ai j = (1/2ω) pi p j + ω2 xi x j , γ
(55) Cαβ vanish whenever r(α) and r(β)is not a root vector. The
the sum of whose diagonal elements is essentially the root diagram is a concise way of showing the structure of
Hamiltonian. The Lie algebra of SU(3) can be conviently the Lie algebra.
expressed in terms of linear combinations of these eight The well-known solutions of the isotropic harmonic
operators: oscillator can be used as basis functions for calculating
√ the irreducible representations of the SU(3) group and
H1 = (1/2 3)Lz , enumeration of the degenerate states give us the dimen-
sionality of the representations. This degeneracy is easily
H2 = (1/6)(A11 + A22 − 2A33 ),
√ known from the expression for the energy eigenvalue of
E1 = (1/2 6)(A11 − A22 + 2iA12 ), the Hamiltonian:
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 169

   
0 0 0 0 1 0
1   1  
E −1 = √ 1 0 0 , E 1 = √ 0 0 0 ,
6 6
0 0 0 0 0 0
   
0 0 1 0 0 0
1   1  
E 2 = √ 0 0 0 , E −2 = √ 0 0 0 ,
6 6
0 0 0 1 0 0
   
0 0 0 0 0 0
1   1  
E 3 = √ 0 0 0 , E −3 = √ 0 0 1 .
6 6
0 1 0 0 0 0
(61)
The SU(3) group plays an important role in nuclear as well
as particle physics.

C. Splitting of Atomic Levels

The energy levels of an electron in a spherically symmetric
field (central field) in an atom are degenerate, more than
one state having the same energy. Quantum mechanical
perturbation theory shows that very often a perturbation
FIGURE 2 Root vector diagram. [Reprinted with permission from
splits the degenerate levels. For instance, in a hydrogen-
Swamy, N. V. V. J., and Samuel, M. A. (1979). “Group Theory Made like atom such as sodium the energy levels, taking the
Easy for Scientists and Engineers,” Wiley-Interscience, New York. spin of the electron into account, are 2(2l + 1) degenerate
Copyright 1979 John Wiley and Sons.] having 2S (l = 0) levels, 6P (l = 1) levels, and so on. If
the magnetic field due to the spin current of the electron
interacts with the magnetic field due to its orbital motion

HUnlm = 2(n − 1) + l + 32 ωUnlm

≡ N + 32 ωUnlm , N = 0, 1, 2, 3, . . . .
(60)

Since n assumes an integral value from 1 onward and l

likewise from 0 onward, the degeneracy arises from the
different partitions of N between n and 1. We thus have,
since for a given l there are 2l + 1 substates, 1, 3, 6, 10,
etc., states having energies 22 , 52 , 72 , etc., in units of ω.
Thus the dimensionality of the representations provided
by the solutions of this Hamiltonian are 1, 3, 6, 10, etc.
The three basis functions with n = 1, l = 1 (1p states) are
simultaneous eigenfunctions of the commuting generators
H1 and H2 with eigenvalues m 1 and m 2 and the latter are
treated as components of a vector in l-dimensional space
called a “weight vector.” The weight vector for l = 1 are
shown in Fig. 3, the weight diagram. The explicit matrices
of this representation are
   
1 0 0 0 0 0 FIGURE 3 Weight diagram. [Reprinted with permission from
1   1  
H1 = √ 0 −1 0 , H2 = √ 0 1 0 , Swamy, N. V. V. J., and Samuel, M. A. (1979). “Group Theory Made
2 3 6 Easy for Scientists and Engineers,” Wiley-Interscience, New York.
0 0 0 0 0 −2 Copyright 1979 John Wiley and Sons.]
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

170 Group Theory, Applied

around the nucleus, which is called spin–orbit coupling, Applying this formula and remembering that the spheri-
then these degenerate levels are split into two groups, giv- cal harmonics Y1m have parity (−)l , we obtain the com-
ing rise to what is known as fine structure of spectra lines. pound characters displayed in Table VI. Since we know
In the case of sodium a 5896-Å wavelength sodium D line the compound characters of the subgroup and its char-
(µ)
is really made up of two lines of 5890 and 5896 Å wave- acters χi , calculation of aµ following formula (62) is
lengths, the D1 and D2 lines. The unsplit line is due to an straightforward. For instance, the sevenfold-degenerate
electric dipole transition between one of the six degener- level l = 3 of the free atom (ignoring spin) is split into one
ate P states to an S state. The two lines appear because the level of symmetry A1u , two levels of symmetry A2u , and
six states split into two groups and there are transitions two twofold-degenerate levels each of symmetry E µ . The
between levels in either group and the S state. This split- splitting of the other levels is shown below. Thus group
ting can be predicted by group theory if the symmetries of theory predicts the number of split levels and their sym-
the degenerate states and the symmetry of the perturbation metries, which implies that the transitions and spectral
are known, as was first pointed out by Bethe. lines can be predicted without calculation. The magni-
We will calculate and see how the levels of the atom are tude of the splitting or the frequency of the spectral lines
split when the perturbation arises, say from the atom being cannot, of course, be known from symmetry considera-
in a crystalline field of D3d symmetry. The elements of D3d tions alone. These types of symmetry considerations are
form a subgroup of the full rotation–reflection group and of great importance in solid state. For instance, the pres-
hence the irreducible representations of the latter become ence in a crystal of an impurity atom occupying a site of
reducible representations of the subgroup. The reduction cubic symmetry leads to the formation of a localized triply
theorem degenerate vibrational mode. If the crystal has C3v point
symmetry at that site the triply degenerate level will split
1 (µ)∗
aµ = gi χi χi (62) into a doublet (two levels E type) and a singlet (a type).
g i This is of experimental use in infrared studies of crystal
defects:
gives aµ the number of times the µth irreducible repre-
sentation of the subgroup is contained in the reducible E 0 → A1g ,
representation in which the elements of the ith class have
E 1 → A2u + E u ,
the compound character χi (see Table VI). gi is the num-
(µ)
ber of elements in class K i which have the character χi E 2 → A1g + 2E g ,
in the µth representation. The character table (Table VII) (64)
specifies gi , K i , and χi . The irreducible representations E 3 → A1u + 2A2u + 2E u ,
are classified as even or odd (g or µ) because of inversion E 4 → 2A1g + A2g + 3E g ,
(or the parity operation) as an element of the subgroup. In
the rotation–reflection group O(3) the rotation angles that E 5 → A1u + 2A2u + 4E u .
correspond to elements in classes K 1 , K 2 , and K 3 , are,
respectively 0, 2π/3, and 2π/2, although the axes for the D. Selection Rules in Atomic Spectra
cyclic and dihedral rotations (i.e., rotations about axes per-
pendicular to the cyclic axis) are different. The characters The spectrum radiated by an atom, hydrogen for exam-
are given by the formula ple, consists in general of discrete lines, all of which do
not have the same intensity of illumination. Also, in the
(l) sin(1 + 12 )α spectra of different atoms it usually happens that some ex-
χ(α) = (α = angle of rotation). (63) pected lines are not seen; these are called forbidden lines.
sin(α/2)
It was Neils Bohr who pointed out that a spectral line is
the result of a quantum mechanical transition between two
TABLE VI Compound Character Table of Subgroup D3d states of energy E i and E f and that the frequency of the
radiated line is given by the following Bohr frequency con-
E 2C 3 3C 2 i 2S6 3σ d
dition, which follows from the principle of conservation
χ (l = 6) 1 1 1 1 1 1 of energy:
χ (1) 3 0 −1 −3 0 1 hc
χ (2) 5 −1 1 5 −1 1 = hν = E i − E f . (65)
λ
χ (3) 7 1 −1 −7 −1 1
Here ν and λ, respectively, are the frequency and wave-
χ (4) 9 0 1 9 0 1
length of the spectral line, h Planck’s constant, and c the
χ (5) 11 −1 −1 −11 1 1
speed of light. Since more than one atom in a gas makes
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 171

TABLE VII Character Table of D3d

K 1 (E) K 2 (C 3 ) K 3 (C 2 ) K 4 (i) K 5 (S6 = i × C 3 ) K 6 (σ d = i × C 2 )
g1 = 1 g2 = 2 g3 = 3 g4 = 1 g5 = 2 g6 = 3

A1g 1 1 1 1 1 1
µ=1
A2g 1 1 −1 1 1 −1
µ=2
Eg 2 −1 0 2 −1 0
(3)
µ=3 =X4
A1u 1 1 1 −1 −1 −1
µ=4
A2u 1 1 −1 −1 −1 1
µ=5
E 2u 2 −1 0 −2 1 0
µ=6

the same transition between quantized states, the intensity belongs to the D (1) representation and the product zUnli m i
of a spectral line depends on the population of these atoms, belongs to the direct product D (1) × D (li) . More precisely,
which can be determined from the statistical distribution zUnli m i is a certain linear combination of the basis func-
of the atoms at a given temperature and the quantum me- tions which generate the representations of the irreducible
chanical transition probability between these states. For components of D (1) × D (li) . Unl f m i belongs to the m f th
instance, the transition probability for transition from a row of the D (l f ) representation of SO(3). If D (l f ) is not
2P state to a 1S state in a hydrogen atom is 6.25 × 108 s−1 . one of the terms in the direct sum into which D (1) × D (li)
The radiation is usually expressed as a sum of multipoles decomposes, the matrix element vanishes. According to
and the transition probability decreases, by several orders the Clebsch–Gordan theorem we have
of magnitude, with increasing order of multipoles. That
certain transitions are seen to happen experimentally and D (1) × D (li ) = D (li + 1) ⊕ D (li ) ⊕ D (li − 1) . (68)
others not is related to selection rules which vary with
It is clear that unless l f = li + 1, li , li − 1, the matrix el-
the multipole order of the radiation. For a linearly po-
ement vanishes with the exception that when li = 0, l f
larized electric dipole radiation, the transition probability
can only be +1. Furthermore, the central field functions
between states |i and | f is given by
have definite parity. For a reflection operation x → −x,
Ai f = (64π 4 ν 3 /3hc3 )| f |ez|i|2 . (66) y → −y, z → −z these functions are even (do not change
sign) if l is even, and vice versa. In other words, they
Here ν is the frequency of the radiation, ez the dipole
are also basis functions of the inversion or parity group.
operator, and f |ez|i the quantum mechanical matrix el-
Since z has odd parity, the matrix element vanishes unless
ement which depends on the nature of the initial state
l f = li + 1 or li − 1. Since li cannot equal l f , it is custom-
|i and the final state | f . It is this matrix element that
ary to say that parity should change in an electric dipole
dictates whether the transition |i → | f is allowed or
transition. Thus the electron can make a transition between
not. If the matrix element vanishes, then that transition is
a P state and an S state.
forbidden—at least in the electric dipole approximation.
If an atom is not free but is in a crystalline field, say
The selection rules then depend on this matrix element
of C3v symmetry, the matrix element can be analyzed as
and it is the vanishing or nonvanishing of this matrix ele-
follows. The initial state function ψµ(i) belongs to the µth
ment that can be predicted by group theory without actual (f)
calculation. row of the ith irreducible representation of C3v and ψν
Let us assume that the initial quantum state of the elec- belongs to the νth row of the f th representation. From
tron |i in the hydrogen atom making the transition is the the character table of C3v we see that z belongs to the
central field state Unli m i , and that the final state it jumps one-dimensional representation D (A1 ) of C3v . Hence the
into is |1Unl f m f . The matrix element is explicitly matrix element will vanish unless D ( f ) occurs in the de-
composition of the direct product D (A1 ) × D (i) . Without
e Unl∗ f m f (ez)Unli m i dτ. (67) having to apply the systematic reduction formula one can
easily see from the character table, for instance,
Now Unli m i is the basis function which belongs to the m i th
D (A1 ) × D (A1 ) = D (A1 ) ⊕ OD(A2 ) ⊕ OD(E) ,
row of the D (li) irreducible representation of the rotation (69)
group. The operator z which is Y10 , apart from a constant, D (A1 ) × D (A2 ) = OD(A1 ) ⊕ D (A2 ) ⊕ OD(E) .
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

172 Group Theory, Applied

The coefficients of the sums of the right-hand side of Gordan theorem. Two such functions, one symmetric and
the equation are determined by looking at the appropri- the other antisymmetric, in the interchange of 1 and 2 are
ate characters on both sides. Thus if ψµ(i) refers to A1 , then given below. These are eigenfunctions of S 2 and Sz with
(f)
ψν also must have A1 symmetry and likewise for A2 . eigenvalues 12 and 12 :
Thus A1 -to-A2 transition, or vice versa, is strictly forbid-
den in electric dipole radiation. 1/2 1
φ3 ≡ χ1/2 = √ {2α(1)α(2)β(3) − α(2)α(3)β(1)
6

V. APPLICATIONS IN NUCLEAR PHYSICS −α(3)α(1)β(2)}, (71)

Nuclear physics is somewhat unique in that, because the 1/2 1

φ4 ≡ χ1/2 = √ {α(1)β(2)α(3) − α(2)β(1)α(3)}.
fundamental interaction between two nucleons is only ap- 2
proximately known, there does not exist one model which
Since the Hamiltonian is symmetric in the coordinates of
can account for all the different phenomena observed ex-
all particles, its eigenfunctions should be basis functions of
perimentally. Thus, for instance, the shell model has suc-
the irreducible representations of permutation group, here
cessfully predicted the ground-state spins and the observed
the group S3 . Two of the representation matrices generated
shell structure of many nuclei. The optical model has been
by the above set, together with the respective characters,
very useful in analyzing neutron scattering. The Bohr–
are
Mottelson model successfully explains the properties of
deformed nuclei. It is therefore not surprising that more (A) (C)
than one group has been used in understanding the differ- √ √
ent characteristics of nuclear structure and nuclear levels, −1/ 2 3/2 1 0
√ , . (72)
and once again the pioneering work was done by Wigner 3/2 1/2 0 −1
in his theory of supermultiplets proposed in 1937. We will χ =0 χ =0
discuss here
We have seen that the most straightforward way of
1. The construction of the many-particle wavefunction generating basis functions for irreducible representations,
following symmetry principles starting with any arbitrary function, is by means of pro-
2. A model based on the SU (3) group and then jection operators. For the permutation group there is an
comment on recent applications. elegant theory of Young tableaux for deriving these pro-
jection operators. Let us assume three boxes arranged in
In the L–S coupling scheme of central field wave func- rows and columns as
tions the many-particle wave function is a simultaneous
eigenfunction of L 2, L z , S
2 , and Sz where L
is a vector |1 | |2 | |1 | |3 |
sum of orbital angular momenta of all the particles and S (73)
the sum of spin operators: |3 | |2 |

=
L li , s = i .
S (70) A standard tableau is an arrangement of the numbers 1,
i i
2, 3 in the boxes such that they are in increasing order
The corresponding Hamiltonian is symmetrical in the in- in rows as well as columns, as shown above. The num-
terchange of particle coordinates and is free of any spin ber of such standard tableaux is the dimensionality of the
operators. The Pauli principle restricts the many-particle representation, which in this case is two. Since each dia-
function to be antisymmetric in the interchange of the co- gram is obtained from the other by interchanging rows and
ordinates (space as well as spin coordinates) of any two columns, these are said to be conjugate to each other. The
particles because all the nucleons are fermions with spin 12 . projection operators corresponding to these two diagrams
This implies that, when written as a product of space and are
spin functions, the function is symmetrical in space co-
ordinates and antisymmetric in spin coordinates and vice P1 = 13 [E − (13)][E + (12)]
versa. To illustrate this let us take a three-nucleon system = 13 (E + C − B − D),
and consider the spin functions first. Denoting an up-spin (74)
state by α and a down-spin state by β and labeling the P2 = 13 [E − (12)][E + (13)]
particles 1, 2, 3, we have both symmetric and antisym-
metric functions which are built following the Clebsch– = 13 (E − C + B − F),
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 173

where we used the notation of Section II. Operating with which means three quantum numbers define a single state.
these on α(1)α(2)β(3), for instance, generates the basis While the energy levels are equally spaced they are degen-
functions erate. The ground state is nondegenerate, but the higher
excited states have degeneracies 3, 6, 10, etc., which is the
2
3
{α(1)α(2)β(3) − α(2)α(3)β(1)}, underlying basis for the shell structure. These states are
(75)
1
{α(2)α(3)β(1) − α(1)α(3)β(2)}. labeled 1s, 1 p (2s, 1d), (2 p, 1 f ), and so on. Because of
3
the presence of Ylm as a factor, these are basis functions
The elements A, C of the S3 group are represented in this for the irreducible representation of the rotation group (the
basis by the matrix orbital angular momentum group), the Hamiltonian being
invariant to rotations in space. However, as we saw earlier,
(A) (C) the bigger group is SU(3) with eight infinitesimal gener-
1 ates H1 , H2 , E 1 , E −1 , E 2 , E −2 , E 3 , E −3 satisfying the Lie
0 −2 1 0
, . (76) algebra appropriate to this group, and commuting with
−2 0 2 −1 the Hamiltonian of the oscillator. The Casimiar operator
χ =0 χ =0 C is simply the sum of squares of these eight operators
and this commutes with every generator of the group.
Other matrices can be calculated in a similar manner. As Lipkin has shown, the Elliott Hamiltonian can be
The characters show that this representation is equiva- written
lent to the one in Eq. (72). Thus linear combinations of
functions of the type α(1)α(2)β(3) give rise to basis func- 2,
H = H0 + λ1 V C + λ2 V L (78)
tions identical to the ones in Eq. (71). The functions of
the space coordinates can be treated likewise. On the one
where λ1 and λ2 are constants and V is related to the
hand, since the single-particle states are described by cen-
strength of the quadrupole interaction between two parti-
tral field functions (product of spherical harmonics and ap-
cles. H0 describes the independent particle motion in an
propriate radial function) the three-particle function must
2 and L z . This means that the latter oscillator potential. The additional terms are intended to
be an eigenfunction of L
remove some of the degeneracies associated with H0 ; in
should be a Clebsch–Gordan-type linear combination of
other words, the SU(3) multiplets are split. Since the added
products of single-particle functions. On the other hand,
terms commute with H0 the eigenfunction of H is also si-
they should be basis functions for the irreducible represen-
multaneously an eigenfunction of C as well as the angular
tations of S3 , which means that these should be obtainable 2 . The required eigenfunctions are chosen to
momentum L
by means of projection operators associated with appro-
make sure that they are SU(3) multiplet states and within
priate Young tableaux. Since the wavefunction, a product
these multiplets they are also eigenfunctions of the angu-
of space and spin functions, has to be antisymmetric in 2 . Since the energy levels corresponding
lar momentum L
the interchange of all the coordinates (space and spin) of 2 are in the nature of rotational en-
to the eigenvalues of L
any two particles, we need spatially symmetric functions
ergy level, each SU(3) multiplet constitutes a rotational
multiplying spin antisymmetric functions and vice versa.
band. Thus the rotational spectrum, which is assumed to
The space and spin functions then correspond to conjugate
be due to collective motions of the nuclear particles, is
Young diagrams. More details about these can be known
derivable from an SU(3)-type independent-particle shell
from the work of Swamy and Samuel.
model under the assumption of a quadrupole two-body
interaction. Experiments have shown, for instance, that in
20
Ne (whose outer nucleons can be considered to be in
A. The Elliott Model
the 2s–1d harmonic oscillator shell) there are excited lev-
In 1958 Elliott introduced a model of the nucleus wherein els, corresponding to a rotational spectrum, of energies
the particles move in a common harmonic oscillator po- 9.48 MeV(2+ ), 10.24 MeV(2+ ), 10.64 MeV(6− ), 11.99
tential and in addition have a mutual interaction of the MeV(8+ ), and so on.
quadrupole type. Before writing this Hamiltonian which We conclude this section by remarking that currently
mathematically describes the model, let us recall some of much research activity is related to supersymmetries such
the features of the nonrelativistic isotropic harmonic os- as U (6/4) and U (6/12) starting with the interacting boson
cillator in three dimensions discussed earlier. The simul- model initiated by Iachello and Arima. In the original in-
taneous eigenfunctions of energy and angular momentum teracting boson model the valence neutrons and protons,
have the central field form which are fermions individually, are paired into “s” and
“d” bosons in even–even nuclei, similar to Cooper pairs in
Unlm (r ) = Ylm (θ, φ)Rnl (r ), (77) the theory of superconductivity. The U (6/4) is extended
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

174 Group Theory, Applied

VI. APPLICATIONS IN MOLECULAR

AND SOLID-STATE PHYSICS

A. Normal Modes of Vibration in Molecules

The application of group theory to the vibrations of sym-
metrical molecules was done by Wigner, who analyzed
the normal modes of tetrahedral methane. We will illus-
trate his method by studying the vibrations of an OL4 -type
molecule having a C4v symmetry. Stephenson and Jones
made an experimental study of the vibrational spectra of
BrF5 and concluded from its features that this molecule
should have C4v symmetry. The geometric arrangement
of the five atoms in the OL4 arrangement is shown in
Fig. 5. In the case of BrF5 , the fifth fluorine atom is ver-
tically above the bromine atom which is located at the
pyramidal point O in the figure. The elements of this
group of order g = 8 are the identity E (E), three cyclic
rotations through angles 2π/4, 4π/4, and 6π/4, respec-
tively (C4 = M, C42 = N , C43 = P), about the Z axis pass-
FIGURE 4 Experimental and theoretical levels of 195 Pt. ing through the point O, reflections in vertical planes X Z
and Y Z (Q and S, respectively), and two reflections in ver-
tical planes containing O and the diagonal L 1 L 3 and L 2 L 4
(T and U , respectively). Reflections, or improper rotations
to U (6/12) to include fermions and account for the spectra so-called, are treated as rotations through the angle 0◦ in
of odd–even nuclei as well. In Fig. 4 we show the success this analysis. The group has five classes K 1 (E), K 2 (N4 ),
of this model in predicting the level structure of 195 Pt. The K 3 (M4 , P4 ), K 4 (Q, S), and K 5 (T, U ), and therefore has
levels in this figure are not, however, drawn to the exact five irreducible representations. Two of these are one-
scale of numerical energies. dimensional A1 , A2 symmetric with respect to the cyclic

FIGURE 5 Arrangement of atoms in OL 4 molecule. [Reprinted with permission from Swamy, N. V. V. J., and Samuel,
M. A. (1979). “Group Theory Made Easy for Scientists and Engineer,” Wiley-Interscience, New York. Copyright 1979
John Wiley and Sons.]
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 175

axis, two antisymmetrical one-dimensional representa- stance, for the A1 type we have
tions B1 and B2 , and a degenerate two-dimensional repre-
sentation. The subscripts 1 and 2 and A and B indicate N A1 = 18 [(12 × 1) + (1 × 0) + (1 × 0) + (1 × 0)
whether it is symmetric or antisymmetric with respect
+ (2 × 1) + (2 × 1) + (4 × 1) + (4 × 1)]
to the vertical reflections. Wigner’s fundamental demon-
stration has been that the allowed normal modes of any =3 (81)
molecule with a certain symmetry (belonging to a group)
should correspond to the symmetries of these irreducible Since the E type are doubly degenerate, we notice that
representations and none else. we have accounted for all 12 frequencies. Pictures of the
According to the theory of small vibrations in classical modes of vibration of all important symmetry types are
mechanics, a system of N particles connected by springs found in the classic work of Herzberg. The transforma-
can have only 3N − 6 normal modes unless all the par- tion properties of the components of the dipole moment
ticles are in one line. In the case of BrF5 this means 12 vector (essentially x or y or z) and the components of
frequencies, and these modes should have the symmetries the polarizability tensor (x z, yz, x 2 − y 2 , etc.) determine
of the C4v representations. We will apply the Wigner pre- whether these modes are Raman active or infrared active.
scription to ascertain these symmetries. The first step is to From the character table we notice that both these sets of
calculate the compound character χ (R) of each symmetry components transform as the A1 and E representations,
operation following the rule whereas only the components of the polarizability tensor
have the B-type symmetries. Thus the A1 and E modes
χ (R) = (µ R − 2)(1 + 2 cos φ) are both Raman and infrared active, whereas B1 and B2
for proper rotations, are only Raman active. The experimentally measured fre-
quencies of Stephenson and Jones (expressed in units of
= µ R (−1 + 2 cos φ) cm−1 ) are 365, 572, 683 (A type), 315, 481, 536 (B type),
for improper rotations. (79) 244, 415, 626 (E type).
It is important to note that group theory predicts only
Here µ R indicates the number of atoms left unchanged by the symmetries of the normal modes but not their nu-
that particular group operation. These compound charac- merical frequencies. One needs to solve the secular de-
ters as well as the characters of the C4v group (not class- terminant to obtain these frequencies and even then the
wise) are shown in Table VIII. Following the usual reduc- numerical values are dependent on assumed force con-
tion formula relating characters to compound characters, stants between the atoms. In general the calculation is
the number of modes of each symmetry type is cumbersome, but considerable reduction in labor and time
1 will result if “symmetry-adapted eigenfunctions (symme-
Ni = χ (R)χi (R). (80) try coordinates)” are used in the evaluation of the indi-
g R
vidual matrix elements. These latter set of functions are
Calculation shows that there will be 3A1 , 0A2 , 1B1 , 2B2 , easily obtained by means of projection operators. Four
and three sets of degenerate E-type vibrations. For in- of the representations are one-dimensional and we give

TABLE VIII Characters and Compound Characters of C4v

E M N P Q S T U

A1 z, x 2 + y 2 , z 2 1 1 1 1 1 1 1 1
A2 1 1 1 1 −1 −1 −1 −1
B1 x 2 − y 2 1 −1 1 −1 1 1 −1 −1
B2 x y 1 −1 1 −1 −1 −1 1 1
E (x, y)(x z, yz) 2 0 −2 0 0 0 0 0

φ 0◦ 90◦ 180◦ 270◦ 0◦ 0◦ 0◦ 0◦

±1 + 2 cos φ 3 1 −1 1 1 1 1 1
µR 6 2 2 2 2 2 4 4
χ (R) 12 0 0 0 2 2 4 4
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

176 Group Theory, Applied

a set of irreducible representation matrices for the two- elements are those rotations and reflections which always
dimensional E representation: leave one point fixed. The point group is the local symme-
try at the site of an atom. In an earlier section we discussed
1 0 0 −1 −1 0 0 1
, , , , how the symmetry of the crystalline field splits the degen-
0 1 1 0 0 −1 −1 0 erate energy levels of a free atom when this is in a crystal.
(E) (M) (N ) (P) The complete set of symmetry operations carrying a crys-
(82) tal into itself, including translations and the point group
operations, is known as the space group.
1 0 −1 0 0 −1 0 1 are such that there are planes
, , , . The eigenvalues E(k)
0 −1 0 1 −1 0 1 0 of energy discontinuity in k space which are the bound-
(Q) (S) (T ) (V ) aries of a volume, the Brillouin zone. The translational
symmetry enables one to consider a single unit cell in k
Applying formula (14), we can readily obtain the projec- space known as the first Brillouin zone or the reduced zone
tion operators with the corresponding reduced wave vector. Bouckaert,
PA1 = 18 {E + M + N + P + Q + S + T + U }, Smoluchowski, and Wigner studied the symmetry prop-
erties of the Brillouin zone and derived “compatibility
PB1 = 18 {E − M + N − P + Q + S − T − U }, relations” between adjoining points, lines, and planes of
PB2 = 18 {E − M + N − P − Q − S + T + U }, (83) symmetry. These relations are of fundamental importance
in the analysis of solid-state experiments. For instance,
PE1 = 14 (E − N + Q − S), energy-band calculations become considerably simplified
for states along symmetry lines and at symmetry points
PE2 = 14 (E − N − Q + S). in the Brillouin zone. Selection rules for the absorption of
A convenient set of symmetry coordinates is easily ob- polarized electromagnetic radiation depend on the sym-
tained by simply applying these projection operators to metry associated with a given point in the reduced zone.
the changes in the various bond lengths and bond angles In the analysis of the vibronic spectra of doped crystals,
of the molecule (Table VIII). vibronic selection rules need to be determined for phonons
at various points in the Brillouin zone, and compatibility
relations are indispensable for doing this. We now discuss
B. Brillouin Zones and Compatibility Relations these compatibility relations.
Because of the lattice structure, an electron moves in The Bloch function is a product of two factors, the phase

a periodic electric potential field in the solid and the function ei k·r which determines what is known as the “star
Hamiltonian describing its motion has the form
of k” and the periodic function Uk which determines the
2 “small representations” of the “group of the wave vec-
h The “star of k”
tor k.” is the figure one obtains when a
H=− ∇ 2 + V (r ), (84)
2m
given wave vector k is subjected to all the symmetry oper-
ations of the point group. If k terminates on a zone bound-
where V ( r ) satisfies the periodicity condition
ary, two points separated by a reciprocal lattice vector are
V (r + a) = V (r ). (85) considered identical. Those elements of the point group
that leave a k invariant constitute a subgroup called “group
Here a describes the periodicity of the lattice in three of the wave vector.” Suppose an irreducible representation
dimensions. This Hamiltonian has translational invari- j of the point group is decomposed in terms of the ir-
ance and its eigenfunctions are basis functions for the reducible representations of its subgroup. If j , a given
irreducible representations of the group of translations irreducible representation of the latter, occurs in this de-
in space. Bloch showed that group theory requires the composition, it is said to be compatible with j . In band
stationary-state solution ψk of the Schrödinger equation, theory, compatibility relates states that can exist together
k (r ), in a single band.
Hψk (r ) = E kψ (86)
To illustrate the star of k let us assume a two-
be of the form dimensional square zone, the length of whose side equals a
reciprocal lattice vector. Let OE in Fig. 6 be the position of
ψk (r ) = ei k·r Uk (r ), (87)
our k vector and let this correspond to the identity element
where Uk is a periodic function with the periodicity of of the group C4v of the square. The other symmetry oper-
the lattice. This translational symmetry of the crystal is in ations, C4 , C42 , C43 , σx , σ y , σα , and σα , take the vector into
addition to its point group symmetry of the lattice, whose the positions shown in the figure, which does resemble a
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 177

TABLE IX Characters of the Symmetry Elements of Oh

E 8C 3 6C 4 3C 24 6C 2 i 8S6 6S4 3σ v 6σ d

Oh
A1g 1 1 1 1 1 1 1 1 1 1
E g 2 −1 0 2 0 2 −1 0 2 0
T1g 3 0 1 −1 −1 3 0 1 −1 −1
A1u 1 1 1 1 1 −1 −1 −1 −1 −1
Eu 2 −1 0 2 0 −2 1 0 −2 0
T1u 3 0 1 −1 −1 −3 0 −1 1 1
C3v
A1 1 1 1
A2 1 1 −1
E 2 −1 0

This vector is invariant to three rotations C3 , C32 , C33 (iden-

tity) about this line as the cyclic axis, and reflections in
[Reprinted with permission from Swamy, N.
the diagonal planes of the cube passing through this line.
FIGURE 6 Star of k.
V. V. J., and Samuel, M. A. (1979) “Group Theory Made Easy for
These are the six symmetry operations that are elements
Scientists and Engineers,” Wiley-Interscience, New York. Copy- of Oh and which also leave the wave vector invariant.
right 1979 John Wiley and Sons.] It is easy to see that these six operations are elements of
the group C3v . The characters of these elements in the ir-
reducible representations of C3v and the characters of the
star. Since only the identity element leaves OE alone, the elements of Oh in some of its irreducible representations
group of the wave vector in this case is just this trivial are given in Table IX. With the use of these characters of
can be
one (identity) element. Other cases of the star of k, the reducible representations in the well-known reduction
found in Wigner’s original paper or the book of Tinkham. formula, a trivial calculation shows that
To understand the compatibility relations let us consider
a Brillouin zone with the symmetry of the simple cubic A1g = A1 , E g = E, T1g = A2 + E,
lattice as in Fig. 7. The wave vector k = 0 () has a full (88)
symmetry Oh , as does the wave vector R in the figure. A1u = A2 , E u = E, T1u = A1 + E.
To establish the compatibility relations between points
We thus conclude that A1g is compatible with A1 , E g with
and consider a k vector which lies at an intermediate
E, A1u with A2 , E u with E, T1g with A2 , E and T1u with
position along the body diagonal (line joining and ).
A1 and E. There is then continuity between the symmetry
orbitals at and at .

VII. APPLICATION OF LIE GROUPS

IN ELECTRICAL ENGINEERING

The application of the underlying concepts of Lie groups

in frequency modulation (FM) in electrical engineering
is best described by means of an example. Figure 8 is a
block diagram of a Wien bridge RC(t) variable frequency
oscillator (VFO) network, where µ is the appropriate am-
plifier circuit. In the mathematical analysis by Gardner of
the modulation rate distortion originating in the VFO, the
modulated signal output VFM satisfies a somewhat com-
FIGURE 7 Brillouin zone with the symmetry of a cubic lattice.
[Reprinted with permission from Swamy, N. V. V. J., and Samuel, plicated linear differential equation
M. A. (1979). “Group Theory Made Easy for Scientists and Engi-
ċ ċ ċ2 1
neers,” Wiley-Interscience, New York. Copyright 1979 John Wiley VFM + 3 V̇FM + + 2 + VFM = 0. (89)
and Sons.] c c c (Rc)2
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

178 Group Theory, Applied

Norman show that this is a product of exponentials,

U(t) = exp[gi (t)Xi ] (95)
i

where the gi are related to ai (t) through a set of simple

first-order differential equations, and thus the question of
solving (92) reduces to solving the latter more simple set
of equations. As an illustration, let us assume that A(t) is
the matrix

i(1 − sin t) e2it 12 − i cos t
FIGURE 8 Block diagram of Wien bridge oscillator. A(t) = ,
−e−2it 12 + i cos t −i(1 − sin t)

= ak (t)Xk . (96)
Here the dot above a variable means derivative with respect k
to time. This second-order equation can be transformed It turns out that this particular operator A can be expanded
into an equivalent matrix differential equation involving in terms of the generators of a Lie group isomorphic to
coupled first-order quantities by the substitution the SO(3) group familiar in quantum mechanics, with the
V̇1 = −(ċ/c)V1 = (1/Rc)V2 , defining Lie algebra
(90) [Xi , X j ] = εi jk Xk . (97)
V̇2 = (1/Rc)V1 − (ċ/c)V2 , V2 = VFM .
εi jk is the well-known Levi–Civita tensor symbol, whose
Thus components can only assume values 1, −1, or 0, and
Einstein summation convention over repeated indices is
V̇1 −ċ/c −1/Rc V1 implied. In the 2 × 2 matrix realization the generators are
= , (91)
V̇2 1/Rc −(ċ/c V2 explicitly

which can be written in operator form dV/dt = A(t)V.

i/2 0 0 1/2
Here V is a column vector and A a 2 × 2 matrix represent- X1 = , X2 = ,
0 −i/2 −1/2 0
ing the modulator element. This is sought to be solved
(98)
with the initial condition V(0) = l. This can be treated as
0 i/2
a special case of an abstract operator equation, X3 = .
i/2 0
dU
= A(t)U, U(0) = I, (92) If we now write
dt
I being the identity operator. Magnus, and Wei and Nor- U (t) = eαX1 eβX2 eγ X3 (99)
man developed a technique for solving this equation in and use the Baker–Hausdorff expansion
the special case where A(t) is expandable in terms of
the infinitesimal generators Xi of a certain Lie group. In exp(X1 )X2 exp(−X1 )
particular, let A(t) be written as
1
= X2 + [X1 , X2 ] + [X1 [X1 , X2 ]] + · · ·
m 2!
A(t) = ai (t)Xi , m finite (93) (100)
i =1
we arrive at the defining differential equations for α, β, γ :
where the time-independent operators Xi are the gener-
ators of a Lie group. The Lie algebra of the Xi is given a1 = 2 − s sin(t) = α̇ + γ̇ sin β,
by
a2 = cos(2t) + 2 sin(2t) cos(t)
λ
[Xχ , Xµ ] = Cχµ Xλ (94)
λ
= β̇ cos α − γ̇ sin α cos β, (101)

λ a3 = sin(2t) − 2 cos(2t) cos(t)

Cχµ being the structure constants defining the group. In
the so-called global representation of the solution Wei and = β̇ sin α + γ̇ cos α cos β.
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 179

 
U11 U12
 (e−it [cos t cos(t/2) − i sin t sin(t/2)]) (eit [cos t sin(t/2) − i sin t sin(t/2)]) 
 
U (t) =   (102)
 U21 U22 
−(e−it [cos t sin(t/2) + i sin t cos(t/2)]) (e−it [cos t cos(t/2) + i sin t sin(t/2)])

Solving these leads to the final solution given in Eq. (102) U (a, b) f Jm = f Jm = f Jm U −1 uv
above. If, on the other hand, the given U is treated as a
(au + bv) J + m (−b∗ u + av) J − m
column vector, then either column of the above matrix = √ . (104)
satisfies the equation with the initial values ( 10 ) and ( 01 ), (J + m)!(J − m)!
respectively.
As shown in Hamermesh’s (1959) text, a straightforward
It is interesting to note that, unlike in quantum mechan-
binomial expansion leads to the relation
ics or elementary particle physics, the properties of the Lie
group itself do not seem to be amenable to a direct phys-
U (a, b) f Jm = f Jm DmJ n (a, b),
ical interpretation bearing on the behavior of the VFO,
m
although the structure constants that define the group are
involved in the differential relations satisfied by the gi ’s
where the representation matrix DmJ m is explicitly
in the exponential solutions.
√
(J + m)!(J − m)!(J + m )!(J − m )!
k
(J + m − k)!k!(J − m − k)!(m − m + k)!
VIII. APPLICATIONS IN PARTICLE
PHYSICS
×a J + m − k a ∗J − m −k k
b (−b∗ )m −m +k
. (105)
Group theory has been important in particle physics since We now show that the representation is unitary. The nor-
the early 1960s. The hadrons were first classified into malization of the f Jm was chosen so that
charge multiplets (isospin). Then approximate SU(3) sym-
metry was used to classify the hadrons into larger multi- J
∗
plets. (This is the so-called flavor SU(3).) More recently, f Jm f Jm
in quantum chromodynamics (QCD), exact SU(3) color m=−J

symmetry is necessary to describe the strong interactions. J

1
= |u|2(J + m) |v|2(J − m)
m=−J
(J + m)!(J − m)!
A. Unitary Irreducible Representations
of SU(2) |u|2k |v|2(2J − k)
2J
(|u|2 + |v|2 )2J
= = (106)
Consider the transformation k =0
k!(2J − k)! (2J )!

u u by binomial expansion. Similarly,
U (a, b): = U (a, b)
v v
(103) J
∗ J
∗ f Jm f Jm = U (a, b) f m 2
a −b u J
= , m=−J m=−J
b∗ a v
J
|au + bv|2(J + M) × | − b∗ u + a ∗ v|2(J − M)
with |a|2 + |b|2 = 1. U is thus a unitary matrix, U −1 = U + , =
and is an element of the SU(2) group. To obtain the unitary m=−J
(J + m)!(J − m)!
irreducible representations consider the basis polynomials
(functions of u and v) (|u|2 + |v|2 )(|a|2 + |b|2 )
=
(2J )!
u J +mvJ −m
f jm = √ , m = −J, . . . , +J,
(J + m)!(J − m)! (|u|2 + |v|2 ) J m 2
= = f . (107)
J
and (2J )! m=−J
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

180 Group Theory, Applied

Therefore, According to the C–G theorem we have

J
∗ J = j1 + 1 , M = m
f Jm DmJ n f Jm DmJ m 2
m = −J m m

j1 + m + 12
J
∗ = j1 , m 1 = m − 1
= f Jm f Jm . (108) 2 j1 + 1 2
m = −J

Since the (2J + 1)2 functions f m ( f m )∗ are linearly inde- 1 1 j1 − m + 12
× j2 = , m 2 = +
pendent, the representation matrices must satisfy 2 2 2 j1 + 1

1 1 1
∗ × j1 , m 1 = m + j2 = , m 2 = − ,
J
DmJ m DmJ m = δm m . (109) 2 2 2
m = −J
(112)
+
That is, D D = I and the representation are unitary.

One can show that a matrix A which commutes with 1 j − m + 12
j1 − , m = − 1
all matrices of the representation D J(a, b) must be a mul- 2 2 j1 + 1
tiple of the unit matrix. Hence, by Schur’s lemma, the
representation matrices D J(a, b) are irreducible. 1 1 1

× j1 , m 1 = m −
2 22

B. Glebsch–Gordan Coefficients for SU(2) and j1 + m + 12 1 1 1
+ j1 , m − − .
2 j1 + 1
Isospin
2 2 2
The SU(2) group has three infinitesimal generators, J1 , (113)
J2 , and J3 , and the Casimir invariant J 2 = J12 + J22 + J32 .
The (2 j + 1)-dimensional representations D ( j) are gen- It is true for m 1 + 12 ; | j1
+ 1
,j
+
2 1
1
2
= | j1 j1 | 12 21 . Assume
erated by the basis functions | jm, where the labels are, it is true for m. We now show it is true for m − 1. Once a
respectively, the eigenvalues of J 2 ( j( j + 1)) and J3 (m). basis function of a representation D ( j) is known, its part-
The electron spin in quantum mechanics is an outstand- ners can be generated by means of step-up and step-down
ing example of SU(2) symmetry. Consider two different operators J+ and J− that satisfy
irreducible representations D ( j1 ) and D ( j2 ) (say, e.g., two
electron spins). D ( j1 ) × D ( j2 ) , a direct product of the two J± | j1 j3 = ( j ± j3 )( j ± j3 + 1)| j1 j3 ± 1. (114)
representations, is, in general, reducible. The Clebsch–
Similar relations are well known in quantum mechanics
Gordan (C–G) theorem essentially relates to the reduction
where, for instance, |lm will be a spherical harmonic
of the representation
when J refers to the orbital angular momentum. Inci-
dentally, it is important to note that SU(2) is the cover-
D ( j1 ) × D ( j2 ) = J D (J ) where | j1 + j2 |J ( j1 + j2 ),
ing group for the three-dimensional rotation group O(3)
(110) to which the spherical harmonics belong. Operating with
(J 1 + J 2 )− = J− on the state | j1 + 12 , m we have
|J M = m 1 m 2 j1 m 1 j2 m 2 |J M| j1 m 1 | j2 m 2 , (111)

where | j1 m 1 are the basis functions of the irreducible 1 j1 + m + 12
J− j1 + , m =
representations D ( j1 ) and similarly for | j2 m 2 ; |J M are 2 2 j1 + 1
the basis functions of the irreducible representations
D (J ) ; j1 j2 m 1 m 2 |J M are real numbers called C–G (or
1 3
Wigner) coefficients. × j1 + m − j1 − m +
2 2
This corresponds to the quantum mechanical vector

addition of two commuting angular momentum vectors 3 1 1
J1 + J2 = J. The derivation of these coefficients is given × j1 , m − ,
2 2 2
in several textbooks and Wigner’s well-known classic
(1959). Here we give instead a brief demonstration by 1 1 1
+ j1 , m − , −
induction for the case j2 = 12 . 2 2 2
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 181

j1 − m + 1 1 1 1 1 1
+ 2 j1 − , j1 − = − | j , j − 1 ,
2 j1 + 1 2 2 2 j1 + 1
1 1 2 2

2 j1 1 1
× j1 + m + 1
2
j1 − m + 1
2 +
| j1 , j1 , − , (118)
2 j1 + 1 2 2

1 1 1

× j1 , m − , − which is indeed orthogonal to | j1 + 12 , j1 − 12 and is nor-
2 2 2 malized. Hence the equation is correct for m = j1 − 12 . By
going through a process of induction similar to the one for
j1 − m + 12 j = j1 + 12 , the proof is readily established.
= j1 + m − 12
2 j1 + 1

3 3 C. Isospin Clebsch–Gordan Coefficients
× j1 − m + j1 , m −
2 2 Consider the reactions

1 1 3 (1) π+ + p → π+ + p
× , + j1 − m +
2 2 2 (2) π− + p → π− + p

1 1 1 π− + p → π0 + n
× j1 , m − , − . (115) (3)
2 2 2
The C–G expansion is
! " !
The representations being unitary, the basis functions must J M = lm 1 21 m 2 |J M|lm 2 12 m 2 ,
be orthonormal. We normalize Eq. (115) by multiplying m1 + m2 = M

it with the factor ! 1 !

J M = JM
Clm, 1 |l, m 1 , m2 ,
2 m 2 2
m1 + m2 = M
1
3 3! !
= |1, 1 12 , 12 ,
j1 + m + 1
2
j1 − m + 3
2
22

3 1! 1 1 1! 2 !
= |1, 2 − + |1, 0 12 12 ,
22 3 2 2 3
and obtain
3 1! 1 1 1! 2 !
− = |1, −1 + 3 |1, 0 12 − 12 ,
2 2 3 22
j1 + m − 12
j1 + 1 , m − 1 = j1 , m − 3 1 1 3 3! !
2 2 j1 + 1 2 2 2 − = |1 − 1 1 − 1 ,
2 2 2 2

j1 − m + 3

! ! !
+ 2 j1 , m − 1 , 1 , − 1 , (116) − 12 , 12 = 13 12 , 12 |1, 0 − 23 12 − 12 |1, 1,
2 j1 + 1 2 2 2
1 ! ! !
, − 1 = 1 1 − 1 |10 − 2 1 1 |1, −1. (119)
2 2 3 2 2 3 22
and this means that the result is true for m − 1, and there-
fore true for all m. In particular, when m = j1 − 12 we Identifying,
have |11 = |π + ,
|10 = |π 0 ,
1 1 2 j1 1 1
j1 + , j1 − = | , − ,
2 2 2 j1 + 1
j1 1j 1 2 2 |1 − 1 = |π − , (120)
1 1!
= |p,
1 1 1 1 1!
22
+
| j1 , j1 , − . (117) − = |n.
2 j1 + 1 2 2 2 2

One obtains
3 3!
Now for the case j = j1 − we note that the basis func-
1
2
, = | p, π + ,
2 2
tion | j1 − 12 , j1 − 12 must be normalized and orthogonal 3 1! 1
to the above state. Equation (114) gives, with m = j1 − 12 , , = |n, π + 2
| p, π 0 ,
2 2 3 3
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

182 Group Theory, Applied

3 !
λ3 and λ8 are diagonal is
, − 1 = 1 | p, π + 2 |n, π 0
2 2 3 3
   
3 ! (121) 0 1 0 0 −i 0
, − 3 = |n, π − ,    
2 2 λ1 = 1 0 0 , λ2 =  i 0 0 ,
1 1! 2
, = |n, π +
− 1
| p, π 0 , 0 0 0 0 0 0
2 2 3 3
   
1 !
1 0 0 0 0 1
, − 1 = 1 |n, π 0 − 2 | p, π − ,
   
2 2 3
3 3!
3
λ3 = 0 −1 0 , λ4 = 0 0 0 ,
| p, π = 2 , 2 ,
+
0 0 0 1 0 0
! (124)
| p, π − = 13 32 , − 12 − 23 12 , − 12 , (122)    
0 0 −i 0 0 0
!    
|n, π 0 = 23 32 , − 12 + 13 12 , − 12 , λ5 = 0 0 0 , λ6 = 0 0 1 ,
i 0 0 0 1 0
By SU(2) invariance of (isospin conservation in) strong
   √ 
interactions, 0 0 0 1/ 3 0 0
   √ 
π + p|T |π + p T (3/2) λ7 = 0 0 −i  , λ8 =  0 1/ 3 0 .
= √
π − p|T |π − p 1 (3/2)
3
T + 23 T (1/2) 0 i 0 0 0 2/ 3

and The usual structure relations satisfied by these generators

are
π + p|T |π + p T (3/2)
= . [Fα , Fβ ] =
γ
Cαβ Fγ = i f αβγ Fγ .
π 0 n|T |π − p 2 (3/2)
T − 23 T (1/2)
3 γ γ

At low energy we may take It is more convenient to express the structure constants in
γ
terms of f αβγ rather than Cαβ . The nonvanishing f αβγ are
T (1/2) = 0,
f 123 = 1, f 246 = 12 , f 367 = − 12 ,
and √
f 147 = 12 , f 257 = 12 , f 458 = 3/2 (125)
σ (π + p → π + p) : σ (π − p → π − p) : σ (π − p → π 0 n) √
= |π + p|T |π + p|2 : |π − p|T |π − p|2 : |π 0 n|T |π p|2 f 156 = − 12 , f 345 = 12 , f 678 = 3/2,
√ The f αβγ are odd under permutation of any two indices.
= 1 : (1/3)2 : ( 2/3)2
We now define the combinations of generators, using
= 9 : 1 : 2. (123) the standard notation T± , T3 for isospin instead of J± , J3 :
In general, for T (1/2) = 0, T± = F1 ± i F2 , U± = F6 ± i F7 ,
√
2π 0 n|T |π − p + π − p|T |π − p = π + p|T |π + p. V± = F4 ± i F5 , (126)
√
T3 = F3 , and Y = (2/ 3)F8 .
D. SU(3) and Particle Physics
The commutation relations satisfied by these operators can
SU(3) is the group of transformations ψa = Uab ψb , where easily be derived. For example,
U is any unitary, unimodular 3 × 3 matrix with determi-
nant U =
0. This is the group of the three-dimensional [T3 , T± ] = [F3 , F1 ± i F2 ]
isotropic harmonic oscillator in quantum mechanics (see
= [F3 , F1 ] ± i[F3 , F2 ]
Schiff 1968). In terms of Lie’s infinitesimal generators U
is given by = i F2 ± F1 = ±T± .

8
Similarly,
U = exp i εk Fk
i =1 [Y, T± ] = 0 = [T3 , Y ], [T3 , U± ] = ∓ 12 U± ,
where Fk ≡ 12 λk are the infinitesimal generators. An ex- [Y, U± ] = ±U± ,
plicit matrix form for the λk ’s, due to Gell-Mann, in which (127)
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 183

[U+ , U− ] = 32 Y − T3 = 2U3 , The multiplicity (dimensionality) of the representa-

tion ( p, q) is N = 12 ( p + 1)(q + 1)( p + q + 2). This can
[V+ , V− ] = 32 Y + T3 = 2V3 , etc. be seen as follows. Consider the case p > q. The bound-
The generators have been chosen so that T3 and Y are ary is, in general, a sixsided figure with sides of length
diagonal with eigenvalues t3 and y. The states ψ are la- p, q, p, q, p, q as one goes around clockwise from ψmax
beled with t3 and y. The action of the shift operators T± , and each with interior angles of 120◦ . There is one state
U± , and V± on the state ψ(t3 , y ) is illustrated on the two- at each site on the boundary. As one moves in from the
dimensional grid in Fig. 9. The action of T± is determined boundary, each successive six-sided figure has the length
by the commutation relations of each side reduced by one, and the number of states at
each site increased by one. This continues until the short
[T3 , T± ] = ±T± and [Y, T± ] = 0, sides are reduced to zero and one has an equilateral trian-
gle with sides of length ( p − q) and multiplicity (q + 1).
that of U± by
As one continues inward now, the multiplicity remains
[T3 , U± ] = ∓ 12 U± and [Y, U± ] = ±U± , (q + 1).
The number of states on the boundary and inside the
and that of V± by triangle is
[T3 , V± ] = ± 12 V± and [Y, V± ] = ±V± . (128) p−q +1
(q + 1) k = 12 (q + 1)( p − q + 1)( p − q + 2).
One can generate all the states of an irreducible represen- k =1
tation by repeated application of the shift operators to any
one of them. The irreducible representations can be labeled (129)
by the pair of integers ( p, q). If one begins at the unique The number of states on the six-sided figures is given by
state of maximum t3 , ψmax , the boundary of distribution q −1
of occupied sites (states), is p steps long in the − 120◦ di- 3 (q − k)( p − q + 2k + 2).
p+1 p
rection (i.e., V− ψmax = 0) and V− ψmax is proportional k =0
to the state at the next corner, moving clockwise along the
Therefore,
boundary from ψmax . If one now heads in the −t3 direction,
there are q steps until the next corner in the boundary is 1
q +1 p q p N = (q + 1)( p − q + 1)( p − q + 2)
reached [i.e., T− (V− ψmax ) = 0] and T− (V− ψmax ) is pro- 2
portional to the state at the second corner, clockwise along q =1
the boundary from ψmax (the boundary is always convex). +3 (q − k)( p − q + 2k − 2),
k =0

1
= ( p + 1)(q + 1)( p − q + 2)
2
(130)
1
− q(q + 1)( p − q + 2)
2
q −1
1
+3( p − q + 2) (q + 1) + 6 (q − k)k.
2 k =0

and since
q −1
(q − k)k = 16 (q + 1)q(q − 1),
k =0
we readily get
N = 12 ( p + 1)(q + 1)( p + q + 2).
This result, symmetric under interchange of p and q, is
also valid for p < q.
FIGURE 9 Results of shift operations on the state ψ(t3 , y ).
If one now applies U+ to the state of maximum
[Reprinted with permission from Swamy, N. V. V. J., and Samuel,
M. A. (1979). “Group Theory Made Easy for Scientists and Engi- t3 , ψmax , one moves along the boundary in the counter-
neers,” Wiley Interscience, New York. Copyright 1979 John Wiley clockwise direction, reaching the next corner, ψ after
and Sons.] q steps. These q + 1 states form a U -spin multiplet of
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

184 Group Theory, Applied

which ψmax = |U = 12 q, u 3 = − 12 q is the lowest state and TABLE X Quantum Numbers for the Quarks and the Anti-
quarks
ψ = |U = 12 q, u 3 = + 12 q is the highest state. ψ carries
the maximum value of y that occurs in the representation B t t3 y S Q
ymax , since continuing counterclockwise from ψ the (con-
p 1/3 1/2 1/2 1/3 0 2/3
vex) boundary runs parallel to the t3 axis and then turns
n 1/3 1/2 −1/2 1/3 0 −1/3
downward to smaller values of y, but u 3 , y, and t3 are
λ 1/3 0 0 −2/3 −1 −1/3
not independent: 32 y − t3 = 2u 3 . Applying this equation
p̄ −1/3 1/2 −1/2 −1/3 0 −2/3
to ψ (t3 = 12 p, u 3 = 12 q) we find
n̄ −1/3 1/2 1/2 −1/3 0 1/3
λ̄ −1/3 0 0 2/3 1 1/3
ψmax = 43 13 q + 23 12 p = 13 ( p + 2q). (131)

[The eigenvalue of Y for ψmax is 13 ( p − q)]. One can also

find the value of t3 carried by ψmax , (t3 )max and hence the tions are shown in Fig. 10. Figure 11 shows the usual
+
largest value of t, tmax , which occurs in the representation. association of the octet with the 0− mesons and the 12
tmax = (t3 )max = 12 ( p + q), since in moving back from ψ − +
baryons (there are other octets with J = 1 , 2 ). The
P

to ψmax the value of t3 increases by 12 for each of the q octet is easily constructed using the aforementioned rules
steps. and the values given by our formulas ymax = 1 and tmax = 1.
One makes here the usual association of the three (3) The decuplets (Fig. 12) are easily constructed after de-
representation with the quarks and the three-star (3∗ ) termining that tmax = 32 , ymax = 1 for the 10 representa-
representation with the antiquarks, y with hypercharge tion and tmax = 32 , ymax = 2 for the 10∗ representation.
y = B + S (B is the baryon number and S the strangeness), The Gell-Mann–Nishijima relation is used to obtain the
and t3 with the third component of isospin. The for- charges.
mula ymax = 13 ( p + 2q) gives ymax = 13 for the three repre-
sentation, ymax = 23 for the three-star representation. Us-
E. Gell-Mann–Okubo Mass Formula
ing tmax = 12 ( p + q) one obtains tmax = 12 for both the
three and three star. Using the Gell-Mann–Nishijima re- If one assumes that the mass operator is the sum of two
lation for the charge (in units of the proton charge) terms, one which transforms like a U -spin scalar (U = 0)
Q = t3 + 12 y we obtain the usual quantum numbers for and the other which transforms like U = 1 (this is equiv-
the quarks and antiquarks (Table X). These representa- alent to the usual “octet enhancement” assumption), one

FIGURE 10 The fundamental triplet representation. [Reprinted with permission from Swamy, N. V. V. J., and Samuel,
M. A. (1979). “Group Theory Made Easy for Scientists and Engineers,” Wiley-Interscience, New York. Copyright 1979
John Wiley and Sons.]
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 185

F. SU(6) and the Quark Model

We have seen that the hadrons fit well into the represen-
tations of SU(3). This can be understood well in terms of
the quarks (antiquarks) which fit into the fundamental 3
(and 3∗ ) representations of SU(3). We now have to take
spin into account. We assume that the quarks are fermions
and have spin 12 . This gives us naturally the result that the
mesons (qq states) have spin 0 or 1 and the baryons (qqq
states) have spin 18 or 32
Our six-quark states (Kokkedee) consist of | p ↑ ,
| p ↓ |n ↑ , |n ↓ , |λ ↑ , and |λ ↓ , where the arrows re-
fer to spin-up and spin-down. The SU(6) generalization
for qq is
[6] × [6] = [1] + [35]. (135)
FIGURE 11 The octet. [Reprinted with permission from Swamy,
The SU(6) singlet has total spin zero. The 35-plet consists
N. V. V. J., and Samuel, M. A. (1979). “Group Theory Made Easy
for Scientists and Engineers,” Wiley-Interscience, New York. Copy- of an SU(3) octet with spin 0, another SU(3) octet with
right 1979 John Wiley and Sons.] spin 1, and an SU(3) singlet with spin 1. This is represented
as

obtains M = a + bU3 for a given U -spin multiplet. For the [6] = {3}, 12 , [6̄] = {3̄}, 12 , (136)
1+
2
octet, where the first entry is the SU(3) representation and the
second is the spin. With this notation we have
M(0 ) = a − b,
[1] = [{1}, 0]
M( ) = a = 34 M() + 14 M( 0 ),
and
M(n) = a + b,
[35] = [{1}, 1] + [{8}, 0] + [{8}, 1]. (137)
so that M(0 ) + M(n) = M( ), which leads to the fa-
mous mass formula This accommodates very well the pseudo-scalar meson
octet, the vector meson octet, and the vector meson singlet.
1
2
(M(0 ) + M(n)) = 14 (3M() + M( 0 )). (132) For the qqq states the SU(6) expansion is
For the 0− octet the corresponding formula (using M 2 for [6] × [6] × [6] = [20] + [56] + [70] + [70]. (138)
mesons instead of M) is
The expansions in terms of SU(3) representations and spin
1
2
(M 2 (k̄ 0 ) + M 2 (k 0 )) = M 2 (k 0 ) are

= 14 (3M 2 (η) + M 2 (π 0 )). (133) [20] = {1}, 32 + {8}, 12 ,

3+ [56] = {8}, 12 + {10}, 32 ,
For the 2
decuplet,
and
M(!− ) = a − 32 b,

∗
[70] = {1}, 12 + {8}, 12 + {8}, 32 + {10}, 12 . (139)
M( ) = a − 3
2
b,
(134) This describes very well the baryon singlets, octets, and
M( − ) = a + 12 b, decuplets.

M("− ) = a + 32 b,
G. Gauge Theories: Applications
which leads to the “equal spacing rule,” to Elementary Particle Physics

M(!− ) − M(8 ) = M(∗− ) − M( − ) Group theory plays a very important role in our current
understanding of the elementary particles, the basic build-
= M( − ) − M("− ). ing blocks of nature, and their fundamental interactions.
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

186 Group Theory, Applied

FIGURE 12 The decuplet. [Reprinted with permission from Swamy, N. V. V. J., and Samuel, M. A. (1979). “Group
Theory Made Easy for Scientists and Engineers,” Wiley-Interscience, New York. Copyright 1979 John Wiley and Sons.]

There are four fundamental interactions, electromagnetic, gauge symmetry. The photon cannot interact with itself at
weak, strong, and gravitational. It is well established now the tree level.] One problem in pure gauge theories is that
that these are all gauge interactions based on various gauge all the gauge particles must be massless. While the pho-
groups which are continuous Lie groups. All the elemen- ton, the gluons, and the graviton are massless, the gauge
tary particles falls neatly with the various representations particles such as weak intermediate vector bosons, W ± , Z
of the gauge groups. The allowed interactions between are not. Thus, part of the gauge symmetry must be broken
the different particles are completely determined by the so that some of these gauge particles can acquire masses.
gauge symmetry, the mathematical requirement that the In 1964, P. Higgs incorporated (Higgs, Englert, Brout,
action be invariant under the transformation of the var- Guralnik, Hagen, Kibble) the idea of spontaneous sym-
ious fields under these group transformations. Histori- metry breaking with the introduction of additional scalar
cally, the pioneering step in this development was taken by particles in the theory, called now the Higgs bosons. The
H. Weyl who proposed (Weyl, Fock) that the electromag- self-interaction of these particles (the so-called Higgs po-
netic interaction, currently known as quantum electrody- tential) is such that the vacuum expectation values of some
namics (QED), is invariant under a local U (1) gauge sym- of these Higgs fields are nonzero and the gauge symmetry
metry. The experimental consequences of this symmetry is broken spontaneously to a smaller gauge group. As a
had been tested to a very high degree of precision. Then, result, some of the gauge bosons, as well as matter fields,
in 1954, Yang and Mills (Yang, Mills) extended this idea acquire masses. This idea of nonabelian gauge symmetry
to include nonabelian gauge symmetry, such as SU(2), and the spontaneous symmetry braking was successfully
SU(3), etc. One new feature in this generalization is that integrated to build a unified theory of weak and electro-
the gauge bosons belonging to the adjoint representation of magnetic interactions (Weinberg, Salam, Glashow). The
the gauge groups now can have interaction among them- gauge group is SU(2) × U (1), which is spontaneously bro-
selves. [Such an interaction is not present in the U (1) ken to UEM (1). Thus the theory has one massless gauge
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 187

boson, the photon, and three massive gauge bosons, W + , TABLE XI Particle Contents of
W − , and Z . All three were discovered experimentally the Standard Model
(Arnison et al.) in the CENN proton–antiproton collider SU(3)×SU(2)×U (1)
at their theoretically predicted masses. All the detailed ex- Particles representations
perimental predictions of this theory have been verified to
u Lα (3, 2, 13 )
better than 1% accuracy, and not a single deviation has
d Lα (3, 2, 13 )
been found. S. L. Glashow, A. Salam, and S. Weinberg
νeL (1, 2, −1)
were awarded the Nobel Prize in 1979 for proposing this
eL (1, 2, −1)
theory, and C. Rubia and S. Van der Meer were awarded
u Rα (3, 1, 43 )
the Nobel Prize in 1984 for the experimental discovery
d Rα (3, 1, − 23 )
of these theoretically predicted gauge bosons, W ± and Z .
eR (1, 1, −2)
It is now well established that the protons and neutrons,
ga (8, 1, 0)
which are the building blocks of the nuclei, are not elemen-  
tary particles. They are made of elementary constituents +1
 
called quarks, bound by interaction through a set of gauge Aa 1, 3, 0 
particles called gluons. The gauge symmetry group for −1
this interactions is SU(3), sometimes called SU(3) color. B (1, 1, 0)
The quarks, antiquarks, and gluons have nonzero color H (1, 2, 1)
charges under this symmetry, and the gluons can interact
with themselves in addition to interacting with the quarks
weak interaction. This set of 15 chiral fermions
and antiquarks. This theory is called quantum chromody-
(u La , d La νeL , e L , u Rα d Rα , e R ) constitute what is called the
namics (QCD). This interaction, which is responsible for
first family of fermions. This is called the electron family.
the binding of three quarks, or a quarks and an antiquark,
There are two other families, the muon family consisting
is very strong at low energy, and is responsible for the con-
of (c La , s La , νµL , µ L , c Rα , s Rα , µ R ), and the tao family
finement of the quarks and gluons. At very high energy,
consisting of (t Lα , b Lα , ντ L , τ L , t Rα , b Rα , τ R ). The fami-
the interaction becomes weaker and the theory becomes
lies are exact replica of each other except for the particles
asymptotically free (Gross, Wilczek, Politzer). Many pre-
masses. The ga (a = 1–8) represent the eight gluons, and
dictions of the theory at high energies have been tested
belong to the adjoint representation of the color SU(3).
experimentally to a good degree of accuracy.
The weak gauge bosons Aa (a = 1–3) belong to the ad-
Our current understanding of all three particle interac-
joint representation of SU(2), and B is the gauge bosons
tions, the electromagnetic, weak, and strong, is thus based
for the U (1) group. The observed gauge bosons W ± are
on the gauge group SU(3) × SU(2) × U (1) called the stan-
linear combination of A1 and A2 , and are given by
dard model of particle physics. All the existing elementary
particles, the fermions, and gauge bosons fall neatly into 1
W ± = √ (A1 ∓ i A2 ) (141)
various representations of this gauge group as shown in 2
Table XI, where u L stands for the left-handed up quark while the observed photon γ and the Z boson are linear
and u R for the right-handed up quark. The index α is the combinations of A3 and B,
SU(3) color index and takes the values 1, 2, 3. The same
is true for the down quark, d. Here u L and d L are dou- γ = B cos θW + A3 sin θW ,
blets under the weak SU(2), with u L having I3 = + 12 and (142)
Z = −B sin θW + A3 cos θW .
d L having I3 = − 12 . The electron-type neutrino, νeL , and
the electron, e L , have no color, and are doublets under the The angle θW is known as the weak mixing angle. All
weak SU(2). All the right-handed fermions are singlets the experimental observations, at both low and high en-
under SU(2). The last entries in the parentheses represent ergy scales (up to few hundred GeV) agree very well
the values of all the U (1) hypercharges, Y , and are related with the predictions of the standard model. However, the
to the usual electric charge by the relation standard Model has several theoretically unsatisfactory
features. Since it is based on the semisimple group con-
Y
Q = I3 + . (140) taining the product of three group factors, it has three in-
2 dependent gauge couplings associated with gauge groups
Here I3 is the third component of the weak isospin U (1), SU(2), and SU(3), respectively. It will be theoret-
[the diagonal generator for SU(2)]. Note that all ically much more beautiful as well as predictive if these
the right-handed particles are singlets under SU(2) three couplings can be unified into a simple group. Such a
weak, and hence they do not participate in the SU(2) unification group (Georgi, Glashow, Pati, Salam), such as
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

188 Group Theory, Applied

SU(5), has been proposed. This not only relates the three Aµα is given by
coupling strengths g, g , and g3 , it also predicts many in-
1
teresting new phenomena such as proton decay, nonzero Aαµ −→ Aα α
µ = Aµ − ∂µ ε α + Cαβγ ε β Aγµ , (145)
neutrino masses, matter–antimatter asymmetry etc. g3
So far, we have left out the gravitational interaction. where εα are the infinitesimal parameters of SU(3).
This is also a gauge theory, the gauge group being that The Lagrangian describing the interaction of the quarks
of a general coordinate transformation. Is it possible to (q) with the gluons is given by
unify also the gravity with the three gauge interactions of
the standard model? Recent development of a new sym- L 2 = q̄(∂µ − ig3 Tα Aµα )q + M q̄q, (146)
metry, called the supersymmetry (Wess, Zumino, Volkov,
Akulov), based on graded Lie algebra, makes such unifica- where M is the mass of the quark. The SU(3) transforma-
tion not only possible, but a requirement. Supersymmetry tion law for the triplet quark fields qi (i = 1–3) is given
is a generalization of a 10-parameter Poincaré group to by
a 14-parameter super-Poincaré group (Wess, Zumino). It
has four additional fermionic generators. Making these qi −→ (eiεα Tα )i j q j , (147)
fermionic parameters local requires the introduction of where εα are the local (space–time-dependent) infinites-
gravity. An exciting new development in the 1990s, known imal parameters. The sum of the Lagrangians L 1 + L 2
as string theory (Polchinski) (which includes local super- describes the complete SU(3) color interactions involving
symmetry) allows the unification of the standard model the quarks and the gluons and is known as quantum chro-
with gravity. The most interesting prediction of the super- modynamics (QCD). This symmetry is exact, so that the
string theory is that space–time is 10-dimensional, with gluons are massless. One very interesting feature of this
the extra six dimensions possibly being compact. Below nonabelian local gauge theory is that the coupling constant
we discuss the formalism of the standard model, grand uni- decreases logerithmically with the energy scale (this will
fication, and supersymmetry in the context of the group be discussed in detail later, in the renormalization group
theory point of view. equation section). Thus, the coupling constant vanishes at
infinite energy, so the theory approaches the behavior of
H. Standard Model a free field theory as the energy scale approaches infinity.
This is known as asymptotic freedom. This logarithmic
1. Quantum Chromodynamics decrease of the coupling parameter with energy has been
The standard model (SM) is based on the local gauge sym- tested up to few hundred GeV. On the other hand, as the
metry group SU(3) × SU(2) × U (1). SU(3) corresponds energy scale decreases, the coupling increases logarithmi-
to strong color interactions, while SU(2) × U(1) corre- cally so that it becomes very large at low energy and the
sponds to electroweak interactions. Let us discuss SU(3) theory becomes nonperturbative. This is sometimes called
color interactions first. The representations of the quarks infrared slavery, and it has been speculated that this may
and gluons under the SU(3) color are given in Table XI. be responsible for the confinement of the quarks and the
The Yang–Mills Lagrangian, for the pure gauge sector gluons.
involving the gluons only, invariant under the local SU(3)
gauge transformations, is given by
2. Electroweak Theory
L 1 = − 14 Fµνα F µνα , (143)
The SU(2) × U (1) part of the SM is known as the elec-
where troweak theory (Weinberg, Salam, Glashow), since it de-
scribes the weak and EM interactions. The multiplet struc-
Fµνα = ∂µ Avα − ∂ν Aµα + g3 Cαβγ Aµβ Aνγ . ture of the quarks, leptons, and the electroweak gauge
Here Aµα are the SU(3) gauge fields belong to the ad- bosons as given in Table XI. The gauge boson part of the
joint representation; µ, ν are space–time indices; α, β are Lagrangian under this symmetry is given by
the adjoint SU(3) indices (α, β, γ = 1–8); g3 is the cou- 1 1
pling constant; and the Cαβγ are the structure constants L 1 = − Fµνα F µνα − Bµν B µν , (148)
4 4
for SU(3) given by
where
[Tα , Tβ ] = iCαβγ Tγ , (144)
Bµν = ∂µ Bν − ∂ν Bµ
where the Tα are the SU(3) generators given by Eq. (124)
with Tα = 12 λα . The transformation law for the gauge fields Fµνα = ∂µ Aνa − ∂ν Aµa + gεabc Aµb Aνc .
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 189

The transformation laws for the gauge fields Bµ and Aµa where
are given by g
DµL = ∂µ − igTi Aµi − i Y L Bµ ,
1 2
Bµ −→ Bµ = Bµ − ∂µ ε,
g g
DµR = ∂µ − i Y R Bµ .
1 2
Aµa −→ Aµa = Aµa − ∂µ εa − εabc εb Aµc . Here $ L represent left-handed quarks or leptons doublets,
g
while $ R represent right-handed quarks or leptons singlets
Although the photon is massless, the gauge bosons W ± under the SU(2) transformations given in Table XI.
and Z are massive (MW = 80 GeV, M Z = 91 GeV). Thus
SU(2) × U (1) symmetry must be broken to UEM (1) at the u νe
$L = qL = , %L = ,
100-GeV scale. In analogy with the spontaneous breaking d L e L
of the rotational symmetry in the case of ferromagnetism, $R = u R , dR , eR .
this symmetry is also spontaneously broken by using suit-
able scalar fields, the so-called Higgs fields. The Higgs Finally, the Lagrangian for the Yukawa part of the in-
fields required is an SU(2) weak doublet, as shown in teraction responsible for giving rise to the masses of the
Table XI. The Higgs potential is such that although the fermions is given by
Lagrangian is invariant under the symmetry group, the L 4 = f d q̄ L d R H + f u q̄ L U R H̃ + f e %̄ L e R H,
ground state (the vacuum state) is not. The required Higgs
where
Lagrangian is
o −i
L 2 = D µ H † Dµ H − V (H ), (149) H̃ ≡ iτ2 H ∗ and τ2 = . (154)
i o
where When the symmetry is broken spontaneously, the fermions
g acquires masses.
Dµ = ∂µ − igTi Aµi − iY Bµ V V V
2 m u = fu √ , m d = fd √ , m e = fe √ .
is known as the gauge-covariant derivative, and Y is the 2 2 2
U (1) hypercharge. (155)
V (H ) = −µ2 H † H + λ(H † H )2 (150) Thus, in electroweak gauge theory, the gauge boson
masses are predicted by the gauge symmetry, while the
is called the Higgs potential.
fermion masses are parametrized in terms of the unknown
Note that for positive values of µ2 and λ, V (H )
Yukawa couplings f u , f d and f e .
has minimum for nonzero values of H , where H ≡
0|H |0 ≡ V is the vacuum expectation value of H . Thus,
the SU(2) × U (1) symmetry is broken spontaneously to I. Grand Unification
UEM (1), which is a linear combination of the diagonal
We saw that the gauge theory based on the SM gauge
part of the SU(2) and the U (1). The charged (W ± ) and the
group, SU(3) × SU(2) × U (1), involves three independent
neutral (Z ) gauge bosons acquire masses
coupling constants,
√ g3 , g, and g (or g3 , g2 , g1 , where
1 1 2
g2 ≡ g, g1 = 5/3g in a different normalization). The
Mw = gV, Mz = g + g 2 V, (151)
2 2 question naturally arises whether the SM gauge group can
while the photon Aµ , corresponding to the unbroken be embedded into a simple group so that it will involve
UEM (1) gauge symmetry, remains massless. In terms of only one coupling constant. It turns out that there are such
the original gauge fields (A1 , A2 , A3 ) and B, the expres- unification groups, the simplest being SU(5). (Since the
sions for the W ± , Z , and γ fields are given by Eqs. (141) SM gauge group has rank 4, the minimum rank of such
and (142). The expression for the weak mixing angle θw a unifying symmetry group must be rank 4.) The unifica-
given by tion of the three SM gauge interactions in a simple gauge
group is known as the grand unification theory (GUT).
g
tan θw = . (152)
g
J. SU(5) Grand Unification
The Lagrangian for the fermionic part of the gauge inter-
actions is given by The SU(5) grand unification theory can accommodate
the chiral nature of the fermion representation under
L 3 = $̄ L DµL $ L + $̄ R DµR $ R , (153) SU(2) × U (1) as well as the absence of chiral anomalies
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

190 Group Theory, Applied

without introducing new fermions. One can also calculate are two stages of symmetry breaking:
the value of the weak mixing angle. The gauge bosons be-
µ∼1016 GeV
long to the adjoint representation of SU(5). The 24 gauge SU(5) −−−−−−→ SU(3) × SU(2) × U (1)
bosons have the following decomposition under the SM, first stage

SU(3) × SU(2) × U(1) representations: µ∼102 GeV

−−−−−−→ SU(3) × UEM (1).
second stage
24 = (8, 1, 0) + (1, 3, 0) + 3, 2, 53 + 3∗ , 2, − 53
+ (1, 1, 0), The first stage of symmetry breaking can be achieved using
a 24-dimensional Higgs representation of SU(5), while the
where (8, 1, 0) represents the eight gluons, (1, 3, 0) rep- second stage of symmetry breaking can be achieved by
resents the three SU(2) gauge bosons, and (1, 1, 0) rep- using a 5-dimensional Higgs representation of SU(5). The
resents the U (1) gauge bosons. The particles in (3, 2, 53 ) gauge bosons X, Y acquire masses after the first symmetry
and (3∗ , 2, − 53 ) represent 12 new gauge bosons that are breaking, while W ± , Z get their masses at the second
not present in the SM. These are denoted by X α , Yα , the symmetry breaking. Fermions also receive their masses
(X, Y ) being the doublet under the weak SU(2) and the from the second stage of symmetry breaking.
triplet under the color SU(3) (α = 1, 2, 3). The six gauge
bosons in (3∗ , 2, − 53 ) are the complex conjugate fields of
L. Renormalization Group Equations and
(X, Y ). There are 24 generators, λ A (A = 1–24) of SU(5),
Evolutions of Gauge Couplings
where λα (α = 1–8) are the 5 × 5 matrix generalization of
those given in Eq. (124). λ20 + α = 1, 2, 3 corresponds to In grand unified theories (GUTs) with a simple gauge
the generalization of three SU(2) generators, while λ24 group G such as SU(5), we have only one coupling con-
corresponds to that of the hypercharge U (1) generator. stant, gG . Thus, at the energy scale MGUT above which
The generators λ A (A = 9–20) are the new ones that are the symmetry is exact, the three coupling g3 , g2 , g1 are
not present in the SM gauge group. The corresponding equal. (Note √that we are using a normalization for which
gauge bosons X α , Yα can change quarks into leptons and g2 ≡ g, g1 = 5/3g ). The reason for the values of the
thus violate baryon and lepton number conservation. This three coupling being so different at low energy is due
is a new feature of all grand unified theories. to the fact that they evolve in energy differently, making
The fermions belong to the SU(5) representations 5∗ and the effective couplings g2 , and g1 much smaller than the
10, which under SU(3) × SU(2) × U (1) representations strong coupling g3 . The renormalization group evolution
decompose as follows: of a coupling gi is given by

5∗ = 3∗ , 1, 23 + (1, 2, −1), dgi
µ = βi ({g}), (156)
10 = 3∗ , 1, − 43 + 3, 2, 13 + (1, 1, 2). ∂µ

Expressing all fermions as left-handed Weyl fermions, Where βi ({g}) is the beta-function for the coupling gi .
(3∗ , 1, 23 ) contains the three color singlet SU(2) doublet- At the one-loop level, for the three SM gauge couplings,
down-type antiquarks (d1c , d2c , d3c ), (1, 2, −1) contains the βi ({gi (µ)}) = bi gi3 (µ) for gi 1. The boundary condition
color singlet SU(2) doublet electron-type neutrino and the is
electron (νe , e), (3, 2, 13 ) contains the color triplet SU(2) g1 = g2 = g3 = gG at µ = MGUT . (157)
doublet-up and- down quarks, (3∗ , 1, − 43 ) contains the
color triplet SU(2) singlet up-type antiquarks, and (1, 1, For the SM,
2) contains both the SU(3) and SU(2) singlet positrons.
1 2
Note that all the known fermions in one family fit neatly, b3 = − 11 − N f ,
16π 2 3
and no extra fermions are needed.

1 22 2 1
b2 = − − N f − N H , (158)
16π 2 3 3 6
K. Symmetry Breaking in SU(5)
1 2 1
SU(5) symmetry must be broken spontaneously at a very b1 = N f + N H ,
16π 2 3 10
high energy scale (∼1016 GeV) because X , Y gauge bosons
have baryon number-violating interactions, and will cause where N f is the number of quark flavors (N f = 6 for six
proton decay. The agreement with the current experimen- quark flavors, u, d, c, s, t, b) and N H is the number of
tal lifetime of the proton demand Mx , y ≥ 1016 GeV. There Higgs doublets.
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 191

The experimental values of the three coupling at the low super Poincaré algebra, Eqs. (160) and (161), involves
energy scale, such as at µ = M Z is very well determined both bosons and fermions in the same representation and
from the studies of Z -boson decay: is called Fermi–Bose symmetry or supersymmetry. In a
given representation, we have particles of both spin j and
α3−1 (M Z ) = 8.5 ± 0.5, α2−1 (M Z ) = 29.61 ± 0.13, j − 12 . The simplest irreducible representation is called
α1−1 (M Z ) = 98.29 ± 0.13, (159) the chiral scalar superfield, in which we have a complex
scalar field and a chiral fermionic field. Next is the vec-
where αi (µ) ≡ Using these values at µ = M Z
gi2 (µ)/4π . tor representation, which has a massless vector field and a
and evolving the couplings to higher energy scale us- Majorana spinor field. Thus, in the SM, each representa-
ing Eqs. (156) and (157), we find that the couplings do tion will be associated with its superpartner. The particle
not unify for this simple minimal SU(5) GUT. Also, content of the minimal supersymmetric extension of the
the very approximate unification scale is too small, SM is given in Table XII. The first column represent the
M X ∼1013 GeV, so the dominant proton decay rate in this usual SM particles, and the second column their super-
model, P → e+ π o is too high compared to the experi- symmetric partners. The intrinsic spins of the superpart-
mental limit. Thus, this simple SU(5) GUT theory is ruled ners differ by half a unit. For example, ẽ L has spin 0,
out experimentally. However, for supersymmetric SU(5) a scalar particle. Note that an additional new feature is
GUT, the unification works beautifully and the unifica- that that two Higgs doublets are needed to cancel chiral
tion scale is ∼1016 GeV, in complete agreement with the anomalies.
current experimental proton lifetime limits. So, next we Since none of the superpartners has so far been ob-
turn to supersymmetry, supersymmetric SM (also known served, supersymmetry must be broken at a few hundred
as MSSM), and supersymmetric GUTS. GeV scale or higher. Two popular supersymmetry break-
ing mechanisms are the gravity mediated and the gauge
M. Supersymmetry mediated (GMSB). The superpartners are expected to a
have masses less than TEV for the supersymmetry to solve
Supersymmetry is a beautiful mathematical generaliza-
the gauge hierarchy problem, and are expected to be dis-
tion of 10-parameter Poincaré group to 14-parameter
covered at the Large Hadron Collider (LHC), if not at the
super-Poincaré or supersymmetry groups. The 10-
upgraded Tevatron. Local supersymmetry, in which the
parameter Poincaré group, with the generators Jµν and
fermionic parameters εα are arbitrary functions of space–
Pµ (µ, ν = 0, 1, 2, 3), satisfy the algebra.
time (x, t), necessarily needs the introduction of the grav-
[Jµν , J pσ ] = gµρ Jνσ − gνρ Jµσ + gµσ Jνρ − gνσ Jµρ , ity supermultiplet containing the spin-2 graviton and its
supersymmetric partner, the gravitino.
(160)
[Pµ , Pν ] = 0, [Jµν , Pρ ] = gµρ Pν − gνρ Pµ .
Wess and Zumino (1974) discovered that this Poincaré al- TABLE XII Particle Contents of the Supersymmetric
gebra beautifully generalizes to a new symmetry algebra Standard Model
by introducing four new generators, Sα . Sα ’s are fermionic,
SU(3)× SU(2)×U (1)
the index α (α = 1, 2, 3, 4) is a spinor index, and the cor- Particles Superpartners representations
responding parameters εα are anticommuting Majorana
spinors. Sα ’s satisfy the following algebra: u Lα ũ Lα (3, 2, 13 )
d Lα d̃ Lα (3, 2, 13 )
1
[Sα , Pµ ] = 0, [Jµν , Sα ] = (σµν )αβ Sβ , νeL ν̃eL (1, 2, −1)
2
(161) eL ẽ L (1, 2, −1)
{Sα , Sβ } = (γ µ c)αβ Pµ , u Rα ũ Rα (3, 1, 43 )
d Rα d̃ Rα (3, 1, − 23 )
where c is the Dirac charge conjugation matrix. eR ẽ R (1, 1, −2)
In Eq. (161), the curly braces represent anticommu- ga g̃a (8, 1, 0)
tators. The introduction of anticommuting parameters  
+1
and the associated fermonic generators, together with the 1, 3, 0 
Aa Ãa  
bosonic ones, brought lot of excitement among the math-
−1
ematicians. The usual Lie algebra has now been gener- B B̃ (1, 1, 0)
alized to include both commutators and anticommutators H1 H̃1 (1, 2, 1)
and is known as graded Lie algebra. Graded Lie alge- H2 H̃2 (1, 2, −1)
bras have now been classified. The representation of the
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

192 Group Theory, Applied

In supersymmetric SM, there will be additional contri- general linear group GL(2n). The requirement of skew
butions to the evolution of the gauge coupling above the symmetry implies that the general element of this group
supersymmetric thresholds. Thus, above threshold, the co- of transformations S should satisfy
efficients to the beta-factors b1 , b2 , and b3 are modified to
ST K S = K. (167)
1
b̃3 = − (9 − N f ),
16π 2 In geometric optics a single ray incident on a refracting
surface is usually defined by one of its points in object
1 1
b̃2 = − 6 − N f − N H , (162) space, given by a vector a and a vector s which gives the
16π 2 2
direction of the ray. The length of s is n, the refractive
1 3 index of the medium in that space.
b̃1 = Nf + NH .
16π 2 10 In image space the corresponding vectors describing
Starting with the experimental values of α1 (M Z ), α2 (M Z ), the refracted ray are a and s , n being the appropriate re-
and α3 (M Z ) given in Eq. (162), and evolving these to fractive index there. A two-dimensional manifold of such
higher energy scales, these coupling unify at an energy rays is adequately described by the components of the
scales, µ = MGUT 2 × 1016 GeV. Thus, although in non- vector expressed as continuous functions of two variable
SUSY SU(5) the unification does not take place, unifi- parameters u and v. It has been proven by Herzberger
cation does occur very accurately in SUSY SU(5). Also, that a fundamental invariant governing image formation
because of the very high scale of the unifications, the pro- au · sv ) − (
is ( av · su ). In other words,
ton decay limits are easily satisfied. au · sv ) − (
( av · su ) = (
au · sv ) − (
av · su ), (168)

equivalent to the Lagrange bracket in analytical mechan-

IX. APPLICATIONS IN GEOMETRICAL ics. Here au is a notation for the partial derivative ∂ a /∂u.
OPTICS It turns out that the invariant is valid for any point on
the ray, and as such it is convenient to refer to the inter-
A nondegenerate, skew-symmetric bilinear form in section of the ray with a plane Z = 0, so that the com-
2n (n = 2, for example) dimensions is usually written as ponents of a are (x, y). Let the components of s be
4 4 (ξ, η, n 2 − ξ 2 − η2 ). Taking the pair of parameters u, v
aik , xi yk , aik = −aki . (163) to be (x, y), (x, ξ ), (x, η), (y, ξ ), (y, η), and (ξ, η) [every
i =1 k =1 pair of the four variables (x, y)], in turn, the above in-
variant relation yields the following six equations for the
In matrix form this can be expressed as XT AY, where xi
and Y,
components of the vectors:
and yk are treated as components of the Vectors X
respectively, and the matrix A satisfies the relation x x ξ y + yx ηy − x y ξx − y y ηx = 0
AT = −A, (164) x x ξξ + yx ηξ − xξ ξx − yξ ηx = 1
the superscript T meaning the transposition operation. We x x ξη + yx ηη − xη ξx − yη ηx = 0
assume that the variables as well as the coefficients are (169)
x y ξξ + y y ηξ − xξ ξ y − yξ ηy = 0
real. By suitable transformations the above bilinear form
can be reduced to “canonical form,” x y ξη + y y ηη − xη ξ y − yη ηy = 1
ν =2
xξ ξη + yξ ηη − xη ξξ − yη ηξ = 0
ξi ηi+ν − ξi+ν ηi = ξ T Kη (165)
i =1 This set of equations can be expressed in matrix form as
Here ξ is the vector (ξ1 , ξ2 , ξ3 , ξ4 ), η the vector MT K M = K (170)
(η1 , η2 , η3 , η4 ), and K the matrix where K is given in Eq. (166). The matrix M now becomes
 
0 0 1 0  
 0 x x x y xξ xη
 0 0 1  0 1  y y y y 
K= ≡ . (166)  ξ η
−1 0 0 0 −1 0 M =  x y

 ξx ξ y ξξ ξη 
0 −1 0 0
ηx ηy ηξ ηη
The set of all matrices S that leave this skew-symmetric
form invariant constitute a Lie group called the symplectic recalling that yξ ≡ ∂ y /∂ξ , etc. Thus we see that Eq. (170),
group Sp(2n) (here 2n is 4), and this is a subgroup of the sometimes called the “lens equation,” is identical to
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 193

Eq. (167), and the latter describes the symplectic group the forces between electrons are mediated by light quanta
Sp(4). Thus, the symplectic group is of fundamental im- or photons. A popular picture of this electron can thus be
portance in geometric optics. that of a bare particle modestly apparelled in photons. It
An interesting application of the techniques based on is important to note that underlying all these ideas is the
the symplectic Lie group has been made to charged- mass–energy equivalence of Einstein’s special relativity
particle beam optics, which is of importance in particle theory.
accelerator physics. The article in Annual Review of Nu- As a preliminary to the introduction of the renormaliza-
clear and Particle Science cited in the Bibliography (Dragt tion group, it is helpful to discuss certain ideas of “critical
et al., 1988) gives an illuminating discussion of this ap- phenomena”, to which it has its important application.
plication. We shall single out phase transitions in liquids for this
purpose. It is common knowledge that under appropriate
conditions of temperature and pressure, water exists in the
three phases of solid (ice), liquid (water), and gas (steam).
X. THE RENORMALIZATION GROUP In a three-dimensional plot of pressure, density, and tem-
perature, the domains of these phases and their boundaries
The renormalization group (RG) theory has applications are clearly demarcated. Across the boundary separating
in several areas of physics but we shall confine our treat- gas from a liquid, for instance, there is a discontinuity in
ment here to “critical phenomena”. The importance of RG density. In his experiments on the liquefaction of carbon
theory in elementary particle physics is discussed else- dioxide gas by isothermal compression, Andrews noticed
where in this article. in 1869 the existence of a critical temperature Tc , above
The concept of renormalization has its origins in quan- which there is a continuity of the gas and liquid phases.
tum electrodynamics, which is the theory that explains the In other words, it is only below the critical temperature
properties and behavior of electrons and photons or light that condensation of a gas can take place. The pressure at
quanta. According to quantum field theory, particles are this point is called the critical pressure Pc , and likewise
the outcome of quantizing appropriate wave fields. For in- the density ρc . As the temperature is increased to Tc , the
stance, photons are quanta resulting from quantizing the density of the gas phase, ρg , tends to equal the density
classical electromagnetic field of Maxwell. A disturbing of the liquid phase, ρl , and the two become one ρc at the
feature of quantum electrodynamics has been the exis- critical temperature Tc . For water, for instance, Tc is 648,
tence of infinities or divergent integrals. For instance, if Pc is 218, and ρc is 0.25 in appropriate units.
we assume the charge of an electron e to be uniformly dis- A classical theoretical foundation for the behavior of a
tributed in a sphere of radius a, the electrostatic self-energy gas at equilibrium temperature T is the equation of state re-
of this sphere is 35 (e2 /a) according to Maxwell’s theory, lating the thermodynamic variables pressure p, volume (or
and this becomes infinite in the limit of a going to zero. In density ρ) V , and temperature T . Taking into account pos-
other words, a point electron has infinite self-energy. Many sible intermolecular forces and the finite size of molecules,
such divergences exist, of which two important cases are van der Waals derived in 1873 the equation of state
the self-energy of the photon and the polarizability of the
vacuum induced by an external electric field. This vacuum a
p + 2 (V − b) = N kT, (171)
happens to be an infinite sea of occupied electron states V
of negative energy which are the inevitable consequences where a and b are constants pertaining to the gas, k is
of Dirac’s relativistic electron theory. According to Op- the universal Boltzmann constant, and N is the Avogadro
penheimer, the roots of charge renormalization lie in the number. This equation does predict the existence of critical
efforts to overcome these infinities by asserting that the ex- constants according to the following relations:
perimentally measured charge of the electron is the sum
of the “true” and “induced” charges, and it is this prescrip- ∂p ∂2 p
= 0, = 0,
tion that is the genesis of renormalization of charge. The ∂V ∂V 2
(172)
existence of a similar distinction between “bare electron a 8a
mass” and “dressed electron mass” has been pointed out pc = , Vc = 3b, Tc = .
27b2 27bN k
by Kramers in his study of the interaction between charged
particles and the radiation field. In terms of reduced variables ρ = p/ pc , ν = V /Vc ,
These concepts took concrete shape ten years later in t = T /Tc , the van der Waals equation can be written in
the epoch-making contributions of Schwinger, Tomonoga, the so-called universal form applicable to any gas:
and Feynman, and the term renormalization has been in
3 1 8
vogue ever since. According to Yukawa, the forces be- p+ ν− = t. (173)
tween particles are mediated by quanta and, in particular, ν2 3 3
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

194 Group Theory, Applied

This is known as the law of corresponding states. The ex- A second transformation involving a parameter s j takes
istence of such a universal relation, through not the exact xk to xl . A succession of these two transformations then
quadratic equation, has been demonstrated by Guggen- moves x j to xl .
heim, who plotted the reduced density (ρ/ρc ) as a func-
{ f (−s j , xk )xk }{(−si , x j )x j } = xe . (179)
tion of the reduced temperature (T /Tc ) for eight different
fluids on which experimental measurements have been However, there must also exist a transformation, say with
made. This has given rise to the suggestion that the order parameter sk , which takes x j directly to xl .
parameter ρl − ρg satisfies a power law in the reduced
temperature, f (−sk , x j )x j = xe . (180)
In other words, we have the group multiplication
T
ρl − ρg ∼ |τ |β , τ= − 1 = t − 1, (174)
Tc { f (−s j , xk )xk }{ f (−si , x j )x j } = f (−sk , x j )x j = xe .

and β has since come to be known as a critical exponent. (181)

The other critical exponents are γ , related to the isothermal We assume this can happen provided the parameters sat-
compressibility, isfy the relation

1 ∂V 1 ∂ρ s j si = sk . (182)
κT = − = = |τ |−γ , (175)
V ∂p T ρ ∂p T
For a certain value of the parameters, say s = 1, the trans-
and α, related to specific heat at constant volume, formation must be an identity and this means that

∂V f (−1, x) = 1 (183)
Cν = = |τ |−α . (176)
∂T V for any point x. It is generally agreed that the RG is really
From thermodynamic reasoning it has been shown that a semigroup, because the inverse transformation cannot
the exponents are not all independent but are coupled by always be uniquely defined. To be applicable to actual
a relation physical situations these transformation must be such that
the transformed system describes the same physics as the
α + 2β + γ = 2. (177) original one. In statistical mechanics, for instance, the par-
tition function will be the same in different systems. In
One of the achievements of the RG theory has been the
quantum mechanics the original and transformed Hamil-
derivation of these critical exponents.
tonians relate to the same dynamical system.
The concept of a renormalization group, like the concept
The importance of the RG transformation is that it is
of renormalization itself, originated in quantum electro-
iterative:
dynamics. Stueckelberg and Petermann introduced this is
quantum field theoretic language and worked out its de- f (H) = H1 , f (H1 ) = H11 , etc. (184)
tails including infinitesimal transformations and their Lie
In one dimension the functional equation is like a recursion
algebra. However, this found little application and was al-
relation,
most ignored. The foundation for the currently used RG
is the work of Gell-Mann and Low, also in the context of f (x) = x 1 , f (x 1 ) = x 11 , . . . , (185)
quantum electrodynamics. Recognizing this and making a
and this implies that after very many iterations one can
systematic development of RG theory and its adaptability
hope to approach a fixed point x ∗ ,
to critical phenomena analysis has been the indefatigable
effort of Wilson. f (x ∗ ) = x ∗ . (186)
The RG is a Lie group of transformations which are
functions of certain variables and parameters that vary As in the case of the roots of an algebraic equation, there
continuously over a given domain. The fundamental group can be several fixed points. Let us assume we have at a
postulate of multiplication here means that a transforma- fixed point x ∗ ,
tion followed by a transformation is still another transfor- x = x ∗ + δx,
mation with different values of the parameters. To illus- (187)
trate this in one dimension let us assume xi variable point x 1 = x ∗ + δx 1 .
on the X axis and s j a parameter. The transformation with In the vicinity of the fixed point a Taylor expansion gives
parameter si takes x j to xk on the line
∂f
f (−si , x j )x j = xk . (178) f (x ∗ + δx) = f (x ∗ ) + δx + · · · = x ∗ + δx 1 ,
∂x∗
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

Group Theory, Applied 195

or √ (0, 1), λ1 = 2, λ2 = 3 and both are relevant. At

point
∂f (1, 2), λ1 = − 12 , λ2 = 3,√and so one is irrelevant and
δx 1 = δx + higher powers of δx, (188) one relevant. At (1, − 2), λ1 = − 12√, λ2 = 3 and this
∂x∗
is the same case as the fixed point (1, 2). The marginal
We thus have, to a linear approximation in δx,
case can be decided only when one calculates beyond the
∂f linear approximation. As mentioned earlier, relevant pa-
δx 1 = δx (189)
∂x∗ rameters are those that tend to make the system unstable
In higher dimensions the variables can be regarded as com- to small disturbances.
ponents of a vector and Eq. (154) now generalizes to The RG transformations in practice are designed to re-
duce the degrees of freedom of the system from, say, N
f (r ) = r l , f rl = r ll , . . . , f (r ∗ ) = r ∗ . to N , and this involves a scale change of the vectors r in
(190) coordinate space by the factor b,
The linear approximation above now becomes
r
δr l = M( r∗ )δr, (191) bd = (N /N ), r = , (197)
b
where the matrix elements of M are evaluated at the fixed where d relates to its dimensionality. This makes the eigen-
point r ∗ . We presume M satisfies the eigenvalue equation values λi functions of the scale factor b and the iterative
M ξi = λi ξi . (192) nature of the transformation makes

The behavior of r after several iterations is governed by the λi = bηi , (198)

magnitude of the eigenvalue λ. If |λ|1, under iterations where ηi are independent of the scale factor. The use of
the higher powers of λ will then be progressively smaller this in thermodynamic functions like free energy eventu-
and Eq. (155) shows that x will converge to (attracted ally yields the critical exponents. A remarkable similarity
to) the fixed point x ∗ , and for this reason this is said to in the values of these exponents in seemingly different
be an “irrelevant” eigenvalue. On the other hand, when phase transitions led to the hypothesis of “universality,”
|λ|1 the point x will drift farther and farther away from which states that all phase-transition problems belong to
the fixed point. This is then a “relevant” eigenvalue. We a few classes determined by certain parameters such as
will now illustrate this by a hypothetical example in two dimensionality and the order. This is confirmed by RG
dimensions. theory, according to which these universality classes are
In component form let us assume the recursion equa- determined by the different ranges of attraction of the fixed
tions point.
2x y3
x1 = , y 1
= . (193)
1 + x3 1 + x3
The fixed points (x ∗ , y ∗ ), obtained ACKNOWLEDGMENT
√ by solving
√ for
x 1 = x, y 1 = y, are (0, 0), (0, 1), (1, 2), (1, − 2). We
We would like to dedicate this to the late Prof. Mark A. Samuel, who
will now set
was a co-author on the first two editions of this article.
2(x + δx) (y + δy)3
x + δx 1 = , y + δy 1 = .
1 + (x + δx)3 1 + (x + δx)3
SEE ALSO THE FOLLOWING ARTICLES
(194)
In the linear approximation the matrix M of Eq. (192) is ATOMIC PHYSICS • FIELD THEORY AND THE STANDARD
  MODEL • GROUP THEORY • NUCLEAR PHYSICS • QUAN-
2 − 4x 3 TUM CHROMODYNAMICS • QUANTUM MECHANICS • SET
1  0  δx
δx  (1 + x 3 )2  THEORY • PARTICLE PHYSICS, ELEMENTARY • SUPER-
=   . (195)
δy 1  −3x 2 y 3 3y 2  δy STRING THEORY
(1 + x 3 )2 (1 + x 3 )
It is easy to see that the eigenvalues of this matrix are
BIBLIOGRAPHY
2 − 4x 3 3y 2
λ1 = , λ2 = . (196)
(1 + x 3 )2 1 + x3 Arnison, G., et al. (1983–1984). “Experimental observation of iso-
lated large transverse energy electrons with associated missing energy
At the fixed point (0, 0), λ1 = 2, λ2 = 0, and we have s = 540 Gev,” Phys. Lett. 122B, 103 (1983). “Experimental observa-
one relevant eigenvalue and one marginal. At the fixed tion of Lepton pairs of invariant mass around 95 Gev/c2 at the CERN
P1: GLQ/GLT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN007C-305 June 29, 2001 17:59

196 Group Theory, Applied

collider,” Phys. Lett. 126B, 398 (1983). “Further evidence for charged Lichtenberg, D. B. (1970). “Unitary Symmetry and Elementary Parti-
intermediate vector bosons at the SPS collider,” Phys. Lett. 129B, 273 cles,” Academic Press, New York.
(1984). “Observation of the muonic decay of the charged intermediate Lie, S. (1893). “Theorie der Transformationgruppen,” F. Engel, Leipzig.
vector boson,” Phys. Lett. 134B, 469 (1984). “Observation of muonic Lipkin, H. J. (1965). “Lie Groups for Pedestrians,” North-Holland, Am-
Z o decay at the PP collider,” Phys. Lett. 147B, 241 (1984). sterdam.
Campbell, J. E. (1966). “Introductory Treatise on Lie’s theory of Finite Luneburg, R. K. (1964). “Mathematical Theory of Optics,” University
Continuous Transformation Groups,” Chelsea, New York. of California Press, Los Angeles.
Chang, T. P., and Li, L. F. (1984). “Gauge Theory of Elementary Par- Miller, A. (1994). “Early Quantum Electrodynamics,” Cambridge Uni-
ticle Physics,” Oxford University (Clarendon) Press, London/New versity Press, Cambridge, U.K.
York. Pati, J. C., and Salam, A. (1973). “Is baryon number conserved?” Phys.
Cotton, F. A. (1970). “Chemical Applications of Group Theory,” Wiley Rev. Lett. 31, 661.
(Interscience), New York. Polchinski, J. (1998). “String Theory,” Vols. I and II, Cambridge Uni-
Domb, C. (1996). “The Critical Point,” Taylor & Francis, London (all versity Press, Cambridge, U.K.
other references cited in the text can be found in this book). Politzer, H. D. (1973). “Reliable perturbative results for strong interac-
Dragt, A. J., et al. (1988). “Annual Review of Nuclear & Particle Sci- tions,” Phys. Rev. Lett. 30, 1346.
ence,” p. 455, Annual Reviews, Palo Alto, CA. Salam, A. (1968). “Elementary Particle Physics,” p. 367, Almquist and
Englert, F., and Brout, R. (1964). “Broken symmetry and the mass of Wiksells, Stockholm, Sweden.
gauge vector mesons,” Phys. Rev. Lett. 13, 321. Schiff, L. I. (1968). “Quantum Mechanics,” McGraw-Hill, New York.
Fock, V. (1927). Uber die invariante Form der Wellen- und der Beu- Schwinger, J. (ed). (1958). “Quantum Electrodynamics,” Dover, New
gungs gleichungen fur einen geladenen Massenpunkt. Z. Physik 39, York. The original papers can be found in the books of Miller and
226. Schwinger.
Gasiorowicz, S. (1966). “Elementary Particle Physics,” Wiley, New York. Segré, E. (1977). “Nuclei and Particles,” Benjamin, New York.
Gell-Mann, M., and Neeman, Y. (1964). “The Eight-Fold Way,” Slater, J. C. (1972). “Symmetry and Energy Bands in Crystals,” Dover,
Benjamin, New York. New York.
Georgi, H. (1982). “Lie Algebras in Particle Physics,” Frontiers in Stavroudis, O. N. (1972). “The Optics of Rays, Wavefronts and Caustics,”
Physics, Benjamin/Cummings, New York. Academic Press, New York.
Georgi, H., and Glashow, S. L. (1974). “Unity of all elementary particle Swamy, N. V. V. J., and Samuel, M. A. (1979). “Group Theory Made
forces,” Phys. Rev. Lett. 32, 438. Easy for Scientists & Engineers,” Wiley-Interscience, New York.
Glashow, S. L. (1961). “Particle symmetries of weak interactions,” Nu- ’t Hooft, G. (1992). “Conference on Lagrangian Field Theory,”
clear Phys. 22, 579. Marseille.
Gross, D. J., and Wilczek, F. (1973). “Ultra violet behavior of non- Tinkham, M. (1975). “Group Theory and Quantum Mechanics,”
Abelian gauge theories,” Phys. Rev. Lett. 30, 1343. McGraw-Hill, New York.
Guralnik, G. S., Hagen, C. R., and Kibble, T. W. (1964). “Global con- Volkov, D. V., and Akulov, V. P. (1972). “Universal neutrino interaction,”
servation laws and massless particles,” Phys. Rev. Lett. 13, 585. JETP Lett. 16, 438.
Halzen, F., and Martin, A. D. (1984). “Quarks and Leptons,” Wiley, New Weinberg, S. (1967). “A model of leptons,” Phys. Rev. Lett. 19,
York. 1264.
Hamermesh, M. (1959). “Group Theory,” Addison-Wesley, Reading, Wess, J., and Zumino, B. (1974). “Supergauge transformation in four
MA. dimensions,” Nuclear Phys. B70, 39.
Herzberg, G. (1959). “Infrared and Raman Spectra,” Van Nostrand, Weyl, H. (1929). “Elektron und Gravitation,” Z. Physik 56, 330.
Princeton, NJ. Wigner, E. P. (1959). “Group Theory and Its Application to the Quantum
Herzberger, M. (1958). “Modern Geometrical Optics,” Interscience, New Mechanics of Atomic Spectra,” Academic Press, New York.
York. Wilson, R. G. (1971). “Renormalization group and strong interactions,”
Higgs, P. W. (1964). “Broken Symmetries and the Masses of Gauge Phys. Rev. D3, 1818.
Bosons,” Phys. Rev. Lett. 13, 508. Yang, C. N., and Mills, R. L. (1954). “Conservation of isotopic spin and
Kibble, T. W. B. (1967). “Symmetry breaking in non-Abelian gauge isotopic gauge invariance,” Phys. Rev. 96, 191.
theories,” Phys. Rev. 155, 1554. Yeomans, J. M. (1992). “Statistical Mechanics of Phase Transitions,”
Kokkedee, J. J. J. (1969). “The Quark Model,” Benjamin, New York. Oxford University Press, New York.
P1: GLQ/GJP P2: FJU Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

Integral Equations
Ram P. Kanwal
Pennsylvania State University

I. Definitions, Classification, and Notation

II. The Method of Successive Approximations
III. The Fredholm Alternative
IV. The Fredholm Operator
V. Hermitian Kernels and the Hilbert–Schmidt
Theory
VI. Singular Integral Equations on the Real Line
VII. The Cauchy Kernel and the Riemann–Hilbert
Problem
VIII. Wiener–Hopf Integral Equation
IX. Nonlinear Integral Equations
X. A Taylor Expansion Technique

GLOSSARY ditions under which a linear integral equation can have

a unique solution.
Cauchy representation Representation of a function Green’s function Kernel obtained when a differential
f (z): operator is inverted into an integral operator.
Hermitian kernel Kernel K (x, y) if K (x, y) = K (y, x),
1 f (ζ )
F(z) = dζ where bar indicates complex conjugate.
2πi C ζ − z Hilbert–Schmidt theory Theory pertaining to the series
where z and ζ are points in the complex plane while expansion of a function f (x) which can be represented
b
C is a contour in . in the form a K (x, y)h(y) dy.
Compact operator Operator that transforms any Kernel Function K (x, y) occurring under the integral
bounded set in a Hilbert space onto a precompact set. sign of the integral equation φ( x )g( x ) = f ( x ) +
b
Eigenvalue Complex numbers λn of the integral equation λ a K (x, y)F{y, g(y)} dy.
b
g(x) = λ a K (x, y)g(y) dy. Neumann series Series obtained by solving the integral
Eigenfunctions Nonzero solutions gn (x) of the integral equation by successive approximations.
b
equation g(x) = λ a K (x, y)g(y) dy. Riemann–Hilbert problem Conversion of a singular in-
Fredholm alternative Alternative that specifies the con- tegral equation with Cauchy kernel into an algebraic

839
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

840 Integral Equations

equation in terms of the boundary values of the When one or both limits of integration become infinite
function. n or when the kernel becomes infinite at one or more points
Separable kernel Kernel K (x, y) = i=1 ai (x)bi (y); within the range of integration, the integral equation is
also called degenerate. called singular.
Singular integral equation Equation in which either the We shall mainly deal with functions which are either
kernel in the integral equation is singular or one or both continuous or integrable or square integrable. A function
b
of the limits of integration are infinite. g(x) is square integrable if a |g(x)|2 d x < ∞, and is called
an 2 function. The kernel K (x, y) is an 2 function if
b
INTEGRAL EQUATIONS are equations in which the un- |K (x, y)|2 d x < ∞
known function appears under the integral sign. They arise a
in the quest for the integral representation formulas for b
the solution of a differential operator so as to include the |K (x, y)|2 dy < ∞ (3)
a
boundary and initial conditions. They also arise naturally
b b
in describing phenomena by models which require sum-
|K (x, y)|2 d x d y < ∞.
mation over space and time. Among the integral equations a a
which have received the most attention are the Fredholm- We shall use the inner product (or scalar product)
and Volterra-type equations. In the study of singular inte- notation
gral equations, the prominent ones are the Abel, Cauchy, b
and Carleman type. φ, ψ = φ(x)ψ̄(x) d x, (4)
a
where the bar indicates complex conjugate. The func-
I. DEFINITIONS, CLASSIFICATION, tions φ and ψ are orthogonal if φ, ψ = 0. The norm
AND NOTATION of the function φ is ||φ|| = ( φ, φ )1/2 . If φ = 1, then
φ is called normalized. In terms of this norm the famous
An integral equation is a functional equation in which the Cauchy–Schwarz inequality can be written as
unknown variable g(x) appears under the integral sign. A | φ, ψ | ≤ φ ψ (5)
general example of an integral equation is
b while the Minkowski inequality is
φ(x)g(x) = f (x) + λ F{x, y; g(y)} dy φ+ψ ≤ φ + ψ . (6)
a
a ≤ x ≤ b, (1) NOTATION. We shall sometimes write the right-hand
side of Eq. (2) as f + λK g and call K the Fredholm oper-
where φ(x), f (x), and F{x, y, g(y)} are known functions ator. Furthermore, for Fredholm integral equations it will
and g(x) is to be evaluated. The quantity λ is a complex be assumed that the range of integration is a to b unless
parameter. When F{x, y, g(y)} = K (x, y)g(y), Eq. (1) the contrary is stated. The limits a and b will be omitted.
becomes a linear integral equation:
b
φ(x)g(x) = f (x) + λ K (x, y)g(y) dy II. THE METHOD OF SUCCESSIVE
a APPROXIMATIONS
a ≤ x ≤ b, (2)
Our aim is to solve the inhomogeneous Fredholm integral
where K (x, y) is called a kernel. Four special cases of equation
Eq. (2) are extensively studied. In the Fredholm integral
equation of the first kind φ(x) = 0, and in his equation g(x) = f (x) + λ K (x, y)g(y) dy, (7)
of the second kind φ(x) = 1; in both cases a and b are
constants. The Volterra integral equations of the first and where we assume that f (x) and K (x, y) are in the space
second kinds are like the corresponding Fredholm integral 2 [a, b], by Picard’s method of successive approxima-
equations except that now b = x. If f (x) = 0 in either case, tions. The method is based on choosing the first approx-
the equation is called homogeneous. imation as g0 (x) = f (x). This is substituted into Eq. (7)
A nonlinear integral equation may occur in the form under the integral sign to obtain the second approxima-
(1) or the function f {x, y, g(y)} may have the form tion and the process is then repeated. This results in the
K(x, y)F(y, g(y)) where F(y, g(y)) is nonlinear in g(y). sequence
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

Integral Equations 841

g0 (x) = f (x) process we get the relation Cm2 ≤ B 2m−2 C12 . Substituting
it in (13) we arrive at the inequality
g1 (x) = f (x) + λ K (x, y)g0 (y) dy
2
(8)
.. K m (x, y) f (y) dy < C 2 D 2 B 2m−2 . (14)
. 1

gm (x) = f (x) + λ K (x, y)gm−1 (y) dy. This means that the infinite series (11) converges faster
than the geometric series with common ratio |λ|B. Thus, if
The analysis is facilitated when we utilize the iterated |λ|B < 1, the Neumann series converges uniformly and ab-
kernels defined as solutely. Fortunately, the condition |λ|B < 1 also assures
us that solution (11) is unique as can be easily proved.
K 1 (x, y) = K (x, y) In view of the uniform convergence of series (11) we can

change the order of integration and summation in it and
K 2 (x, y) = K (x, s)K 1 (s, y) ds write it as
(9)
..
. g(x) = f (x) + λ (x, y; λ) f (y) dy, (15)

K m (x, y) = K (x, s)K m−1 (s, y) ds.

where (x, y; λ) = ∞ m=1 λ
m−1
K m (x, y) is called the

It can be proved that K m+n (x, y) = K m (x, s)K n (s, y) ds. resolvent kernel. This series is also convergent at least
Thereby, we can express the mth approximation in (8) as for |λ|B < 1. Indeed, the resolvent kernel is an analytic
function of λ, regular at least inside the circle |λ|B < 1.
m

From the uniqueness of the solution it can be proved that
gm (x) = f (x) + λ λn−1 K n (x, y) f (y) dy (10) the resolvent kernel is unique.
n=1 A few remarks are in order:
When we let m → ∞, we obtain formally the so-called
Neumann series 1. We can start with any other suitable function for the
first approximation g0 (s).
g(m) = lim gm (x) 2. The Neumann series, in general, cannot be summed in
m→∞
closed form.

∞
3. The solution of Eq. (7) may exist even if |λ|B > 1.
= f (x) + λm K m (x, y) f (y) dy. (11)
m=1
The same iterative scheme is applicable to the Volterra
In order to examine the convergence of this series, we integral equation of the second kind:
applythe Cauchy–Schwarz inequality (5) to the general x
term K m (x, y) f (y) dy and get g(x) = f (x) + λ K (x, y)g(y) dy. (16)
a
2

K m (x, y) f (y) dy In this case the formulas corresponding to (11) and (15)

are

≤ |K m (x, y)|2 dy | f (y)|2 dy. (12)
∞ x
g(x) = f (x) + λm K m (x, y) f (y) dy (17)
m=1 a
Let us denote the norm f as D and the upper bound of
the integral |K m (x, y)|2 dy as Cm2 , so that relation (12) and
becomes x
2 g(x) = f (x) + λ (x, y; λ) f (y) dy, (18)

K m (x, y) f (y) dy ≤ C 2 D 2 (13) a
m

where the iterated kernel K m (x, y) satisfies the recur-

x
We can connect the estimate Cm2 with Cm−12
by applying rence formula K m (x, y) = y K (x, s)K m−1 (s, y) ds, with
the Cauchy–Schwarz inequality to relation
(9) and then K 1 (x, y) = K (x, y) as before. The resolvent kernel is
integrating with respect
to y so that |K m (x, y)| 2
dy ≤ given by the same formula as given previously and is an
B Cm−1 where B = |K (x, y)| d x d y. Continuing this
2 2 2 2
entire function of λ for any given (x, y).
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

842 Integral Equations

III. THE FREDHOLM ALTERNATIVE D(λ) = 0, the algebraic system (25) and thereby integral
equation (19) has a unique solution. On the other hand, for
Let us consider the inhomogeneous Fredholm integral all values of λ for which D(λ) = 0, algebraic system (25),
equation of the second kind and with it integral equation (19), is either insoluble or
has an infinite number of solutions. We discuss both these
g(x) = f (x) + λ K (x, y)g(y) dy, (19) cases.
The Case D(λ) = 0. In this case the algebraic system
when the kernel is degenerate (separable) (i.e., K (x, y) =
(25) has only one solution given by Cramer’s rule
n
k=1 ak (x)bk (y), where ak (x) and bk (y), k = 1, . . . , n, are
D1k f 1 + · · · + Dhk f h + · · · + Dnk f n
linearly independent functions). Thus, Eq. (19) becomes ck =
n D(λ)
g(x) = f (x) + λ ak (x) bk (y)g(y) dy, (20) k = 1, 2, . . . , n, (27)
k=1
where Dhk denotes the cofactor of the (h, k)th element of
where we have exchanged summation with integration. It
the determinant (26). When we substitute (27) in (22) we
emerges that the technique of solving Eq. (20) depends on
obtain the unique solution
the choice of the complex parameter λ and on the constants
n n
ck defined as k=1 D jk f i ak (x)
g(x) = f (x) + λ
j=1
D(λ)
ck = bk (y)g(y) dy, (21)
λ
= f (x) +
which are unknown because g(y) is so. Thereby, Eq. (20) D(λ)
takes the algebraic form
n
n

n × D jk b j (y)ak (x) f (y) dy, (28)
g(x) = f (x) + λ ck ak (x) (22) j=1 k=1
k=1
where we have used relation (24). This expression can be
Next, we multiply both sides of (22) by bi (x) and inte-
put in an elegant form if we introduce the determinant
grate from a to b so that we have a set of linear algebraic
equations D(x, y; λ)

n 0 −a1 (x) −a2 (x) ··· −an (x)
ci = f i + λ aik ck , i = 1, 2, . . . , n, (23)

1
b (y) 1 − λa11 −λa12 · · · −λa1n
k=1

· · · −λa2n
where = b2 (y) −λa21 1 − λa22 (29)
.
.
fi = bi (x) f (x) d x .

bn (y) −λan1 −λan2 · · · 1 − λann
(24)
aik = bi (x)ak (x) d x. which is called the Fredholm minor. Then Eq. (28) takes
the form
Let us write the algebraic system (23) in the matrix form
(I − λA)c = f, (25) g(x) = f (x) + λ (x, y; λ) f (y) dy (30a)

where I is the identity matrix of order n, A is the matrix where the resolvent kernel is the ratio of two determi-
aik , while c and f are column matrices. nants, that is,
The determinant D(λ) of the algebraic system (23) is
(x, y; λ) = D(x, y; λ)/D(λ)
1 − λa11 −λa12 · · · −λa1n

−λa 1 − λa22 · · · −λa2n 1 n n
21 = D jk b j (y)ak (x). (30b)
D(λ) = ..
, (26)
D(λ) j=1 k=1
.

−λa −λan2 · · · 1 − λann It is clear from the above analysis that if we start with
n1
the homogeneous integral equation,
which is a polynomial of degree at most n in λ. Note
that D(λ) is not identically zero because when λ = 0, it g(x) = λ K (x, y)g(y) dy, (31)
reduces to unity. Accordingly, for all values of λ for which
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

Integral Equations 843

we shall obtain the homogeneous algebraic system respectively. Clearly, the determinant of this algebraic
system is D(λ̄). Accordingly, the transposed integral
(I − λA)c = 0. (32)
equation (33) also possesses a unique solution whenever
When D(λ) = 0, this algebraic system and the homoge- (19) does. Also, the eigenvalues of the homogeneous part
neous integral equation (31) have only the trivial solutions. of Eq. (33), that is,
The Case D(λ) = 0. In this case the algebraic system
(32) and hence the homogeneous integral equation (31) ψ(x) = λ K (y, x)ψ(y) dy (35)
may have either no solution or infinitely many solutions.
To examine these possibilities it is necessary to discuss are the complex conjugates of those for (31). The eigen-
the subject of eigenvalues and eigenfunctions of the ho- vectors of the homogeneous system (I − λĀT )c = 0, are,
mogeneous in general, different from the corresponding eigenvectors
problem (31). Strictly speaking we should
write it as K (x, y)g(y) dy = ωg(x) for ω to be an eigen- of system (32). The same applies to the eigenfunctions
value, but in the theory of integral equations it has be- of the transposed integral equation (35). Because the ge-
come customary to call the parameter λ = 0, for which ometric multiplicity r of λi for (31) is the same as that
the homogeneous equation (31) has a nontrivial solution, of λ̄i for (35), the number of linearly independent eigen-
its eigenvalue. The corresponding solution g(x) is called functions of the transposed equation (35) corresponding
the eigenfunction of the operator K . From the above anal- to λ̄i are also r in number, say, ψi1 , ψi2 , . . . , ψir which we
ysis it follows that the eigenvalues of (31) are the solu- assume to be normalized. Accordingly, any solution ψi (x)
tions of the polynomial |I − λA| = 0. There may exist of (35) corresponding
to the eigenvalue λi is of the form
more than one eigenfunction corresponding to a specific ψi (x) = nk=1 βk ψik (x), where βi are arbitrary constants.
eigenvalue. Let us denote the number r of such eigenfunc- Incidentally, it can be easily proved that the eigenfunc-
tions as gi1 , gi2 , . . . , gir corresponding to the eigenvalue tions g(x) and ψ(x) corresponding to eigenvalues λ1 and
λi . The number r is called the index of the eigenvalue λi λ̄2 (λ1 = λ2 ) of the homogeneous integral equation (31)
(it is also called the geometric multiplicity of λi while and its transpose (35) respectively, are orthogonal.
the algebraic multiplicity m means that D(λ) = 0 has m This analysis is sufficient for us to prove that the neces-
equal roots). We know from linear algebra that if p is the sary and sufficient condition for Eq. (19) to have a solution
rank of the determinant D(λi ) = |I − λi A|, then r = n − p. for λ = λi , a root of D(λ) = 0, is that f (x) be orthogonal to
If r = 1, λi is called a simple eigenvalue. Let us assume the r eigenfunctions ψi j , j = 1, . . . , r , of the transposed
that the eigenfunctions gi1 , . . . , gir have been normalized equation (35). The necessary part follows from the fact that
(i.e., gi j = 1 for j = 1, . . . , r ). Then to each eigenvalue if Eq. (19) for λ = λi admits a certain solution g(x), then
λi of index r = n − p, there corresponds a solution gi (x)
of the homogeneous integral equation (31) of the form f (x)ψi j (x) d x

gi (x) = rk=1 αk gik (x), where αk are arbitrary constants.
For studying the case when the inhomogeneous integral = g(x)ψi j (x) d x
equation (19) has a solution even when D(λ) = 0, we need

the integral equation
− λi ψi j (x) d x K (x, y)g(y) dy
ψ(x) = f (x) + λ K (y, x)ψ(y) dy, (33)
= g(x)ψi j (x) d x
which is called the transpose (or adjoint) of Eq. (19) which
is then the transpose of Eq. (33). For the separable kernel
K (x, y) as considered in this section, the transpose kernel − λi g(y) dy K (x, y)ψi j (x) d x

is K (y, x) = nk=1 ak (y)bk (x). When we follow the same
steps that we followed for integral equation (19) we find =0
that the transposed integral equation (33) leads to the al- because λ̄i and ψi j (x) are an eigenvalue and a corre-
gebraic system (I − λĀT )c = f, where AT stands for the sponding eigenfunction of (35). For the proof of the
transpose of A while ck and f k are now defined as sufficiency, we appeal to the corresponding condition

of orthogonality for the linear algebraic system which
ck = ak (y)ψ(y) dy,
assures us that the inhomogeneous system (25) reduces
to only n − r independent equations (i.e., the rank of the
and (34)
matrix (I − λA) is exactly p = n − r and therefore the
fk = ak (y) f (y) dy, system ((I − λA)c = f is soluble). Substituting this value
of c in (22) we have the required solution of (19).
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

844 Integral Equations

This analysis is true for a general integrable kernel. The norm f defined by (4) generates the natural met-
Fredholm gave three theorems in this connection and they ric d( f, g) = f − g . Furthermore, we have the Cauchy–
bear his name. These theorems are of great importance Schwarz and Minkowski inequalities as given by (5) and
in general discussion but are of little use in constructing (6), respectively.
closed form solutions or obtaining solutions numerically. An important concept in the study of metric spaces is
Fredholm’s first theorem gives the same formula as (30a) that of completeness. A metric space is called complete
where the resolvent kernel is if every Cauchy sequence of functions in this space is a
convergent sequence (i.e., the limit is in this space). A
(x, y; λ) = D(x, y; λ)/D(λ), D(λ) = 0, (36) Hilbert space H is an inner product linear space that is
which is a meromorphic function of the complex variable complete in its natural metric. An important example is the
λ, being the ratio of two entire functions D(x, y; λ) and space of square integrable functions on the interval [a, b].
D(λ) which are given by suitable Fredholm series of form It is denoted as 2 [a, b], called 2 space in the sequel.
similar to (30b). The other two theorems discuss the case An operator K is called bounded if there exists a con-
D(λ) = 0 for a general kernel. The discussion given above stant M > 0, such that K g ≤ M g for all g ∈ 2 . We
and these three theorems add up to the following important can prove that the Fredholm operator with an 2 kernel
result. is bounded by starting with the relation f = K g. Then by
using the Cauchy–Schwarz inequality we have
2

A. The Fredholm Alternative Theorem | f (x)|2 = K (x, y)g(y) dy
For a fixed λ, either the integral equation (19) possesses
one and only one solution g(x) for integrable functions ≤ |K (x, y)|2 dy |g(y)|2 dy.
f (x) and K (x, y) (in particular the solution g(x) = 0 for
the homogeneous equation (31)), or the homogeneous of this relation we find that
Integrating both sides
equation (31) possesses a finite number r of linearly in- f = K g ≤ g [ |K (x, y)|2 d x d y]1/2 and we have
dependent solutions gi j , j = 1, . . . , r , with respect to the established the boundedness of K . The norm K of an
eigenvalue λ = λi . In the first case, the transposed inho- operator K is defined as
mogeneous equation (33) also possesses a unique solu- K = sup( K g / g ) (37a)
tion. In the second case, the transposed homogeneous
equation (35) also has r linearly independent solutions or
ψi j (x), j = 1, . . . , r , corresponding to the eigenvalues λi ; K = (sup K g ; g = 1). (37b)
and the inhomogeneous integral equation (19) has a solu-
tion if and only if the given function f (x) is orthogonal The operator K is called continuous in a Hilbert space if
to all the eigenfunctions ψi j (x). In this case the general whenever {gn } is a sequence in the domain of K with limit
solution of the integral equation (19)is determined only g, then K gn → K g. A linear operator is continuous if it is
up to an additive linear combination rj=1 α j gi j (x). bounded and conversely.
A set S is called precompact if a convergent subse-
quence can be extracted from any sequence of elements
in S. A bounded linear operator K is called compact
IV. THE FREDHOLM OPERATOR if it transforms any bounded set in H onto a precom-
pact set. Any bounded operator K , whose range is finite-
We have observed in Section I that the Fredholm dimensional, is compact because it transforms a bounded
operator K g(x) = K (x, y)g(y) dy is linear (i.e., set in H into a bounded finite-dimensional set which is
K (αg1 + βg2 ) = α K g1 + β K g2 , where α and β are necessarily precompact. Many interesting integral opera-
arbitrary complex numbers). In this section we study
some general results for this operator. For this purpose we space is 2
tors are compact. For instance, if the Hilbert
space and K (x, y) is a degenerate kernel in= 1 ai (x)bi (y),
consider a linear space of an infinite dimension
with inner then K is a compact operator. This follows by observing
product defined by (4) (i.e., f, g = f (x)ḡ(x) d x). that
This inner product is a complex number and satisfies the n

following axioms: Kg = ai (x)bi (y)g(y) dy
i=1
(a) f, f = 0 iff f = 0.

n
(b) α f 1 + β f 2 , g = α f 1 , g + β f 2 , g . = ci ai (x)
(c) f, g = g, f . i=1
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

Integral Equations 845

(i.e., the range of K is a finite-dimensional subspace of
2 ). Furthermore, K φ, ψ = ψ̄(x) K (x, y)φ(y) dy d x

n
n
Kg = ci ai (x) ≤ |ci | ai = φ(y) K (x, y)ψ̄(x) d x dy
i=1 i=1

n = φ(x) K (y, x)ψ̄(y) dy d x
≤ ai |bi (y)||gi (y)| dy.
i=1 = φ, K̄ ψ , (38)
Finally, by applying the Cauchy–Schwarzn inequality we
have K g ≤ M g , where M = i=1 ai bi . Ac- the operator K̄ (y, x) is called the adjoint operator.
cordingly, K is a bounded linear operator with finite- When K is Hermitian, this result becomes K φ, ψ =
dimensional range and hence it is compact. φ, K ψ (i.e., K is selfadjoint). On the other hand,
The following property of compact operators will prove K φ, φ = φ, K φ . Combining these two results we ob-
useful in the next section. Let K be a compact operator serve that the inner product K φ, φ is always real and the
on the Hilbert space H , and L be a bounded operator. converse is also true.
Then both KL and LK are compact operators. To prove Systems of orthogonal functions play an important role
this property for the case of LK we let { f n } be a uni- in this section so we give a brief account here. A finite
formly bounded sequence in H . Since K is compact it con- or an infinite set {φk } is said to be an orthogonal set if
tains a subsequence { f n } such that LK f n − LK f m ≤ φi , φ j = 0, i = j. If none of the elements of this set is
L K f n − K f m . This means that {LK f n } is also a zero vector, it is said to be a proper orthogonal set. A set is
Cauchy sequence. In the case of KL, we observe that orthonormal if φi , φ j = δi j , where δi j is the Kronecker
if { f n } is uniformly bounded, and L is bounded, then delta which is zero for i = j and 1 if i = j. As defined in
{L f n } is also uniformly bounded. The compactness of K section 1. a function φ for which φ = 1 is said to be nor-
now assures us that there exists a subsequence {KL f n } malized. Given a finite or a countably infinite independent
which is a Cauchy sequence. Thus, KL is also compact. set of functions {ψn }, we can replace them by an equiv-
As a particular case we find that if K is compact then so alent set {φn } which is orthonormal. This is achieved by
is K 2 . the so-called Gram–Schmidt procedure:
There are many interesting results regarding the com- ψ1
φ1 =
pact operators. Some of them are as follows. ψ1
ψ2 − ψ2 , φ1 φ1
1. If {K n } is a sequence of compact operators on a φ1 =
ψ2 − ψ2 , φ1 φ1
Hilbert space H , such that for some K we have ..
limn → ∞ K − K n = 0, then K is compact. .
n−1
2. If K (x, y) is continuous for all a ≤ x, y ≤ b then K is ψn − i=1 ψn , φi φi
a compact operator on 2 [a, b]. φn = n−1
ψn − i=1 ψn , φi φi
3. An 2 kernel K (i.e., |K (x, y)|2 d x d y < ∞) is a
compact operator. as is easily verified. In case we are given a set of orthogonal
functions then we can convert it into an orthonormal set
simply by dividing each function by its norm.
V. HERMITIAN KERNELS AND THE Starting from an arbitrary orthonormal system we can
HILBERT–SCHMIDT THEORY construct the theory of Fourier series on the same lines as
the trigonometric series. This is achieved if we attempt
Let us now consider a technique quite different from the to find the best approximation of an arbitrary function
Neumann and Fredholm series. This technique is based ψ(x) in terms of a linear combination of an orthonormal
on considering the eigenvalues and eigenfunctions of set {φn }. By this we mean that we choose the coefficients
the homogeneous integral equation with Hermitian ker- α1 , α2 , . . . , αn in order to minimize
nel (i.e., K (x, y) = K (y, x) (then K is called the Hermi- 2

n
n
tian operator). For real K it becomes K (x, y) = K (y, x) ψ− αi φi = ψ 2
+ | ψ, φi − αi |2
(i.e., a symmetric kernel). We shall restrict ourselves to i=1 i=1
the Hilbert space 2 of square integrable functions and

n
will benefit from the concepts of the previous section. − | ψ, φi |2 . (39)
Because i=1
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

846 Integral Equations

Clearly, the minimum is achieved by setting αi = But K f n converges to K , so that limn → ∞ K 2 f n −

ψ, φi = ai (say), which are called the Fourier coeffi- K f n 2 f n = 0. In this relation we can use the fact that
cients of ψ relative to the orthonormal system {φn }. Then K 2 being the product of compact operators is com-
(39) becomes pact and, therefore, we can extract a subsequence { f n }
2 from { f n } so that {K 2 f n } converges to a function, say

n
n
ψ− αi φi = ψ 2
− |ai |2 , K 2 g. Thus limn → ∞ K 2 g − K 2 f n = 0, so that
i=1 i=1 { f n } converges to g and K 2 g = K 2 g. But this means
n that (( K )−1 K − 1)(( K )−1 K + 1)g = 0 and the result
i=1 |ai | ≤ ψ . For an
2 2
from which it follows that follows.
infinite set {φn } this inequality is As mentioned in section III, an eigenvalue is simple if

∞ there is only one corresponding eigenfunction, otherwise
|ai |2 ≤ ψ 2 , (40) the eigenvalue is called degenerate. The spectrum of K is
i=1 the set of all eigenvalues, in view of the above-mentioned
which is called the Bessel inequality. If an orthonormal property we find that the spectrum of a Hermitian kernel
system of functions can be found in 2 space such that is never empty.
its every element can be represented linearly in terms of
this system,
it is called an orthonormal basis. Thus, we B. Property 2
have ψ = i ψ, φi φi = i ai φi . From this relation we
readily derive the Parseval identity: The eigenvalues of a Hermitian operator K are all real.

∞
ψ 2
= | ψ, φi |2 1. Proof
i=1
n Suppose λK φ = φ. Then by taking the inner product
Incidentally, if ψ − i=1 αi φi → 0 as n → ∞, then ψ of this relation with φ, we have λ K φ, φ = φ 2 or
∞
is said to converge in the mean to the series i=1 αi φi . λ = φ 2 / K φ, φ . As already observed K φ, φ is real
The reason for this terminology is that the norm entails for a Hermitian kernel. The quantity φ 2 is also real.
the integration of the square of the function. Thus λ is real.
We are now ready to discuss the solutions of the integral
equation λK φ = φ. We take K (x, y) to be a Hermitian
kernel so that, in view of relation (38) K is self-adjoint. C. Property 3
As such, this operator has some very interesting properties All eigenfunctions of a Hermitian operator K correspond-
as we establish below. ing to distinct eigenvalues are orthogonal.

A. Property 1 1. Proof
1. Existence of an Eigenvalue Let φ1 and φ2 be the eigenfunctions corresponding re-
If the Hermitian kernel is also an 2 function then at least spectively, to the distinct eigenvalues λ1 and λ2 . Then we
one of the quantities ±( K )−1 , where this norm is defined have λ1 K φ1 = φ1 and λ2 K φ2 = φ2 so that K φ1 , φ2 =
by (37), must be an eigenvalue of λK f = f . λ−1
1 φ1 , φ2 and also K φ1 , φ2 = φ1 , K φ2 = λ2 φ1 ,
−1

φ2 . Thus λ2 φ1 , φ2 = λ1 φ1 , φ2 . Since λ = λ2 , we find

that φ1 , φ2 = 0, as desired.
2. Proof
Because K is a Hermitian kernel, it is self-adjoint because D. Property 4
of (38), and being an 2 function it is compact as men-
The multiplicity of any nonzero eigenvalue is finite for
tioned in the previous section. Accordingly, we consider
every Hermitian operator K with 2 kernel.
a sequence { f n } such that f n = 1 and K f n converges
to K . Note that
2 1. Proof
0 ≤ K 2 fn − K fn 2
fn
Let the functions φ1λ (x), φ2λ (x), . . . , φnλ (x), . . . be the
2
= K 2 fn − 2 K fn 2
K 2 fn , fn + K fn 4 linearly independent eigenfunctions which correspond
to a nonzero eigenvalue λ. We appeal to the Gram–
2
= K 2 fn − K fn 4
Schmidt process and then find linear combinations of
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

Integral Equations 847

these functions which form an orthonormal system 1. Proof

{u kλ (x)}. Their complex conjugate functions {u kλ (x)}
Let λ be an eigenvalue of K with corresponding eigenfunc-
also form an orthonormal system. Let the function
tion φ(x) so that (I − λK )φ = 0, where I is the identity
K (x, y) with fixed x be written as K (x, y) ∼ i ai u iλ (y),
operator. The result of operating on both sides of this equa-
where ai = K (x, y)u iλ (y) dy = u iλ (x)/λ. By apply-
tion with the operator (I + λK ) yields (I − λ2 K 2 )φ = 0,
ing
the Bessel inequality
(40) to this series, we have
or
|K (x, y)|2 dy ≥ i (λ−1 )2 |u iλ (x)|2 , which, when inte-
grated, yields the inequality φ(x) − λ2 K 2 (x, y)φ(y) dy = 0,
m
|K (x, y)|2 d x d y ≥ (λ−1 )2 = 2 which proves that λ2 is an eigenvalue of the kernel
λ
i K 2 (x, y). Conversely, if λ2 is an eigenvalue K 2 (x, y), with
where m is the multiplicity of λ. But K is an 2 kernel so φ(x) as its corresponding eigenfunction φ(x), we have
the left-hand side of this relation is finite. It follows that (I − λ2 K 2 )φ = 0
m is finite.
or
(I − λK )(I + λK )φ = 0
E. Property 5
If λ is an eigenvalue of K this equation is satisfied and
The sequence of eigenfunctions of a Hermitian kernel K we have established the property for n = 2. Otherwise,
can be made orthonormal. we set [(I + λK )]φ(x) = (x) so that the above equation
becomes (I − λK )(x) = 0. Since we have assumed that
1. Proof λ is not an eigenvalue of K , it follows that (x) ≡ 0. This
means that (I + λK )φ = 0, or −λ is an eigenvalue and we
Suppose that, corresponding to a certain eigenvalue, there have again proved our result for n = 2. The general result
are m linearly independent eigenfunctions. Because K is follows by continuing this process.
a linear operator, every linear combination of these func- In the next stage we wish to expand a nonzero Hermitian
tions is also an eigenfunction. Thus, by the Gram–Schmidt kernel in terms of its eigenfunctions. It may have a finite
procedure, we can get an equivalent number of eigenfunc- or infinite number of real eigenvalues. We order them in
tions which are orthonormal. On the other hand, for dis- the sequence λ1 , λ2 , . . . , λn , . . . in such a way that
tinct eigenvalues, the corresponding eigenfunctions are
orthogonal and can be readily normalized. Combining 1. Each eigenvalue is repeated as many times as its
these two parts we have established the required property. multiplicity, and
2. we enumerate these eigenvalues in an order which
corresponds to their absolute values (i.e., 0 < |λ1 | ≤
F. Property 6
|λ2 | ≤ · · · ≤ |λn | ≤ |λn+1 | ≤ · · ·).
The eigenvalues of a Hermitian operator K with 2 kernel
form a finite or an infinite sequence {λn } with no finite Let φ1 (x), φ2 (x), . . . , φn (x), . . . be the sequence of cor-
limit point. Furthermore, if we include each eigenvalue responding orthonormalized eigenfunctions which are ar-
in the sequence a number of times equal to its algebraic ranged in such a way that they are no longer repeated and
multiplicity, then are linearly independent in each group corresponding to
the same eigenvalue. Thus to each eigenvalue λk there cor-
∞
−1 2
λn ≤ |K (x, y)|2 d x d y responds just one eigenfunction φk (x). We shall have this
n=1 ordering in mind in the sequel.
Property 1 has assured us that a nonzero Hermitian
The proof of this property can be presented by a slight kernel always has a finite, nonzero lowest eigenvalue λ1 .
extension of the arguments given in the analysis of Let φ1 be the corresponding eigenfunction. Now we re-
Property 4. move this eigenvalue from the spectrum of K by defining
the truncated kernel K (2) (x, y):
G. Property 7 K (2) (x, y) = K (x, y) − (φ1 (x)φ̄ 1 (y)/λ1 )
The set of eigenvalues of the nth iterated kernel coincide which is also Hermitian. If K (2) (x, y) is nonzero then let λ2
with the set of nth powers of the eigenvalues of the kernel be its lowest eigenvalue with corresponding eigenfunction
K(x, y). φ2 (x). Because
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

848 Integral Equations

than that of the series representing K and h. This important
K (2) φ1 = K φ1 − (φ1 (x)/λ1 ) φ1 (y)φ1 (y) dy = 0
concept for a Hermitian kernel is embodied in the theorem.
we observe that φ1 = φ2 even if λ1 = λ2 . Similarly, the
third truncated Hermitian kernel is H. Hilbert–Schmidt Theorem
φ2 (y) If f (x) can be written in the form
K (3) (x, y) = K (2) (x, y) − φ2 (x)
λ2
f (x) = K (x, y)h(y) dy, (44)

2
φk (x)φk (y)
= K (x, y) − where K and h are in 2 space, then f (x) can be expanded
k=1
λk
in an absolutely and uniformly convergent Fourier series
Continuing this process we have with respect to the orthonormal system of eigenfunctions
of K , that is,

n
φk (x)φk (y)
K (n+1) (x, y) = K (x, y) − , (41)
∞

k=1
λk f (x) = f n φn (x), f n = f, φn . (45a)
n=1
which yields the (n + 1)th lowest eigenvalue and the
The Fourier coefficients f n are related to the correspond-
corresponding eigenfunction φn+1 (x). Thereby we find
ing coefficients h n of h(x) as
that either this process terminates after n steps (i.e.,
f n = h n /λn = h, φn /λn
= 0), and the kernel K (x, y) is a degener-
K (n+1) (x, y) (45b)
ate kernel nk=1 (φk (x)φk (y)/λk ), or the process can be and λn are the eigenvalues of K .
continued indefinitely and there are an infinite number of
eigenvalues and eigenfunctions so that 1. Proof

∞
φk (x)φk (y)
K (x, y) = . (42) The Fourier coefficients of the function f (x) with respect
k=1
λk to the orthonormal system {φn (x)} are
This is called the bilinear form of the kernel. Recall that f n = f, φn = Kh, φn = h, Kφn
we meet a similar situation for a Hermitian matrix A. In- = h, φn /λn = h n /λn
deed, by transforming to an orthonormal basis of the vector
space consisting of the eigenvectors of A we can transform because K is self-adjoint and λn K φn = φn . Accordingly,
it to a diagonal matrix. we can write the correspondence
From the bilinear form (42) we derive a useful inequal- ∞ ∞
h n φn (x)
ity. Let the sequence {φk (x)} be all the eigenfunctions of a f (x) ∼ f n φn (x) = . (46)
λn
Hermitian 2 kernel K (x, y) with {λk } as the correspond- n=1 n=1
ing eigenvalues as described and arranged in the above The estimate of the remainder term for this series is
analysis. Then the series 2
n+ p
φk (x)
n+ p
n+ p
|φk (x)|2

∞
|φn (x)|2 hk ≤ h 2k
< C12 , (43) k=n+1 λk k=n+1 k=n+1 λ2k
λ2n
∞ 2
n+p φk (x)
n=1
≤ hk 2
(47)
where C12 is an upper bound of the integral |K (x, y)|2 dy. λ2k
k=n + 1 k=1
The proof follows by observing that the Fourier co-
efficients an of the function K (x, y) with fixed x, Now the series ∞ k=1 |φk (x)|/λk is
2 2
bounded in view of re-
n+ p
with respect to the orthonormal system φn (y) are an = lation (43) while the partial sum k=n + 1 h 2k can be made
K (x, y), φ̄ n (y) = φn (x)λn . Substituting these values of arbitrarily because h(x) ∈ 2 and, as such, the se-
small
an in the Bessel inequality (40) we derive (43). ries ∞ h
k=1 k
2
is convergent. Thus, the estimate (47) can
The eigenfunctions do not have to form a complete set be made arbitrarily small so that series (46) converges
in order to represent the functions in 2 . Indeed, any func- absolutely and uniformly. Next, we show that this series
tion which can be written as “sourcewise” in terms of the converges to f (x) in the mean and for this purpose we de-
kernel K (i.e., any function f = Kh) can be expanded in note its partial sum as ψn (x) = nm=1 (h m /λm )φm (x) and
a series of the eigenfunctions of K . This is not surpris- estimate the value f (x) − ψn (x) . Because
ing because integration smooths out irregularities or if the n
hm
functions f , K , and h are represented by the series, then f (x) − ψn (x) = Kh − φm (x) = K (n+1) h,
the convergence of the series representing f will be better m=1
λm
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

Integral Equations 849

where K n+1 is the truncated kernel (41), we find that In view of the uniform convergence of the expansion, we
2
can interchange the order of integration and summation,
f (x) − ψn (x) 2
= K (n+1) h and get

= K (n+1) h, K (n+1) h
∞
∞ ∞
gn φn (x)
gn φn (x) = f n φn (x) + λ (52)
= h, K (n+1) K (n+1) h n=1 n=1 n=1
λn
Now we multiply both sides of (52) by φk (x) and inte-
= h, K 2(n+1) h (48)
grate from a to b and appeal to the orthogonality of the
in view of the self-adjointness of the kernel K (n+1) and the eigenfunctions and obtain
relation K (n+1) K (n+1) = K 2(n+1) . Now we use Property 7 gk = f k + (λ/λk )gk (53a)
for the Hermitian kernels and find that the least eigenvalue
of the kernel K 2(n+1) is λ2n+1 . On the other hand, Property 1 or
implies that gk = f k + (λ/(λk − λ)) f k . (53b)
2 Substitution of (53) into (50) leads us to the required
1 λ2n+1 = sup K (n+1) h / h 2 solution

= sup h, K 2(n+1) h / h 2 . ∞
λ
g(x) = fn + f n φn (x)
(λn − λ)
Combining it with (48) we get f (x) − ψn (x) 2 ≤ h 2 / n=1
λ2n+1 . Because λn+1 → ∞, we have proved that f (x) − ∞
φn (x)φn (y)
ψn (x) → 0 as n → ∞. = f (x) + λ f (y) dy
(λn − λ)
In order to prove that f = ψ, where ψ is the series n=1

with partial sum ψn , we use the Minkowski inequality (6)
and get f − ψ ≤ f − ψn + ψn − ψ . The first term = f (x) + λ (x, y; λ) f (y) dy, (54)
on the right-handside of this inequality tends to zero as
where the resolvent kernel (x, y; λ) is expressed by the
proved above. Because series (47) converges uniformly,
series
the second term can be made as small as we want (i.e.,
given an arbitrarily small and positive ε we can find n large ∞
φn (x)φn (y)
(x, y; λ) = , (55)
enough that |ψn − ψ| < ε). One integration then yields n=1
λn − λ
ψn − ψ < ε(b − a)1/2 and we have proved the result.
and we have again interchanged the integration and sum-
Let us now use the foregoing theorem for solving the
mation. It follows from expression (55) that the singu-
inhomogeneous Fredholm integral equation of the second
lar points of the resolvent kernel corresponding to a
kind:
Hermitian 2 kernel are simple poles and every pole is an
g(x) = f (x) + λ K (x, y)g(y) dy, (49) eigenvalue of the kernel.
In the event that λ in Eq. (49) is equal to one of the
with a Hermitian 2 kernel. First, we assume that λ eigenvalues, say, λ p of K , solution (55) becomes infinite.
is not an eigenvalue of K . Because g(x) − f (x) in this To remedy it we return to relation (53) which for k = p
equation has the integral representation of the form (44) becomes g p = f p+ g p . Thus, g p is arbitrary and f p = 0.
we expand both g(x) and f (x) in terms of the eigen- This implies that f (x)φ p (x) d x = 0 (i.e., f (x) is orthog-
functions φn (x) given by the homogeneous equation onal to the eigenfunction φ p (x)). If this is not the case, we
φn (x) = λn K (x, y)φn (y) dy. Accordingly, we set have no solution. If λ p has the algebraic multiplicity m,
then there are m coefficients g p which are arbitrary and

∞
∞
f (x) is orthogonal to all these m functions.
g(x) = gn φn (x), f (x) = f n φn (x), (50) Integral equations arise in the process of inverting or-
n=1 n=1
dinary and partial differential operators. In the quest for
where gn = g, φn is unknown and f n = f, φn is known. the representation formula for the solutions of these oper-
Substituting these expansions in (49) we obtain ators so as to include the initial or boundary values in it,

∞
∞ we arrive at integral equations. In the process there arises
gn φn (x) = f n φn (x) the theory of Green’s functions which are symmetric and
n=1 n=1 become the kernels of the integral equations. If they are

∞ not symmetric, then they can be symmetrized. We illus-
+λ K (x, y) gn φn (y) dy. (51) trate these concepts with the help of the Sturm–Liouville
n=1 differential operator
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

850 Integral Equations

d d dφ1 dφ2
L=− p(x) + q(x) p φ2 − φ1 = −1. (61)
dx dx dx dx
where p(x) and q(x) are continuous in the interval [a, b] From relations (60) and (61) we find that C1 (x) =
and in addition p(x) has a continuous derivative and does −φ2 (x) f (x) and C2 (x) = φ1 (x) f (x). Thus
not vanish in this interval. We discuss two kinds of equa- b
tions, that is, C1 (x) = φ2 (ξ ) f (ξ ) dξ
L y = f (x), a≤x ≤b (56) x
x
and C2 (x) = φ1 (ξ ) f (ξ ) dξ,
L y − λr (x)y = 0, (57) a

with convenient constants of integration. Substituting

where f (x) and r (x) are given functions. The function
these values in (59) we arrive at the solution
r (x) is continuous and nonnegative in [a, b]. Each of these
b
equations is subject to the boundary conditions
y(x) = φ1 (x) φ2 (ξ ) f (ξ ) dξ
α1 y(a) + α2 y (a) = 0 x
(58) x
β1 y(b) + β2 y (b) = 0. + φ2 (x) φ1 (ξ ) f (ξ ) dξ
a
Let us assume that it is not possible to obtain a b
nonzero solution of system (57) and (58) for the case = G(x, ξ ) f (ξ ) dξ, (62)
λ = 0 (this means that there is no eigenfunction corre- a
sponding to the eigenvalue λ = 0). Accordingly, we as- where the function
sume that a function φ1 satisfies the boundary condi-
tion (58)1 : α1 φ1 (a) + α2 φ1 (a) = 0 and another function φ1 (ξ )φ2 (x), ξ ≤ x
G(x, ξ ) = (63)
φ2 (independent of φ1 ) satisfies the boundary condi- φ1 (x)φ2 (ξ ), ξ ≥ x
tion (58)2 : β1 φ2 (b) + β2 φ2 (b) = 0. This amounts to solv-
ing two initial value problems, namely, Lφ1 = 0, φ1 (a) = is called the Green’s function of this boundary value prob-
−α2 , φ1 (a) = α1 and Lφ2 = 0, φ2 (b) = −β2 , φ2 (b) = β1 . lem. It can be written in an elegant form by defining the
With the help of these two linearly independent solutions regions
φ1 and φ2 we use the method of variation of parame-
x, a ≤ x ≤ ξ
ters and assume the solution of the inhomogeneous equa- x< = min(x, ξ ) =
ξ, ξ ≤ x ≤ b
tion (56) as

y(x) = C1 (x)φ1 (x) + C2 (x)φ2 (x), (59) ξ, a ≤ x ≤ ξ
x> = max(x, ξ ) =
x, ξ ≤ x ≤ b.
where C1 and C2 must be found from the relations
Then G(x, ξ ) = φ1 (x< )φ2 (x> ).
C1 (x)φ1 (x) + C2 (x)φ2 (x) = 0 (60a) Finally, we attend to Eq. (57) whose solution follows
C1 (x)φ1 (x) + C2 (x)φ2 (x) = − f (x)/ p(x). (60b) from (62) to be

We need one more relation for φ1 and φ2 to facilitate the y(x) = λ r (ξ )G(x, ξ )y(ξ ) dξ. (64)
solution of (56). This is found from the fact that φ1 and
φ2 are two linearly independent solutions of the homoge- This is an integral equation with kernel r (ξ )G(x, ξ ). Al-
neous equation L y = 0. As such though this kernel is not symmetric, it can be symmetrized
by setting [r (x)]1/2 y(x) = g(x) and defining the symmet-
0 = φ2 Lφ1 − φ1 Lφ2
ric kernel K (x, y) = G(x, ξ )[r (x)]1/2 [r (ξ )]1/2 . Then (64)
d dφ1 d dφ2 becomes the integral equation
= −φ2 p + φ1 p
dx dx dx dx
g(x) = λ K (x, y)g(y) dy. (65)
d dφ1 dφ2
=− p φ2 − φ1 ,
dx dx dx
If there was a term on the right-hand side of (57) we would
so that the quantity within the braces is a constant. Inas- have arrived at an inhomogeneous integral equation of the
much as φ1 and φ2 can be determined up to a constant second kind. Incidentally, it can be readily verified that
factor, we can choose this expression to be Eq. (64) satisfies the given boundary conditions. The case
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

Integral Equations 851

when λ = 0 has a nonzero eigenfunction can be handled Similarly, the solution of the integral equation
by a slight extension of the foregoing arguments. b
g(y)
The theory of Green’s functions as derived above can be dy = f (s), 0<α<1 (70)
displayed very elegantly with the help of the Dirac delta s (y − x)α
function and other generalized functions. is

sin απ d b
f (x)
g(y) = − d x . (71)
π dy y (x − y)1 − α
VI. SINGULAR INTEGRAL EQUATIONS
ON THE REAL LINE There are many related integral equations which can be
solved by similar steps. For instance, the solution of the
An integral equation is said to be singular either if the integral equation
x
kernel is singular within the range of integration or if one g(y) dy
f (x) = α
, 0 < α < 1, (72)
a [h(x) − h(y)]
or both limits of integration are infinite. In this section
we study some famous singular integral equations. They
where the function h(x) is strictly increasing differentiable
arise very frequently in various branches of physics and
function with nonzero h (x) over some interval a ≤ x ≤ b,
engineering. No general theory is available for these equa-
is
tions but methods are available for solving some special y
sin απ d h (u) f (u) du
cases. We start with the Abel integral equation. g(y) = . (73)
π dy a [h(y) − h(u)]1−α

A. The Abel Integral Equation Similarly, the solution of the integral equation
b
g(y) dy
This equation is f (x) = , 0<α<1 (74)
x x [h(y) − h(x)]α
g(y)
f (x) = dy, 0 < α < 1. (66) is
a (x − y)α
sin απ d b
h (u) f (u) du
To solve it we multiply both sides of this equation by g(y) = − . (75)
π dy [h(u) − h(y)]1−α
d x/(u − x)1−α and integrate with respect to x from a to u y

so that we have These relations remain valid when a → −∞ and b → ∞.

u u x
f (x) d x dx g(y) dy
= α
. B. The Cauchy Integral Equation
a (u − x) a (u − x) a (x − y)
1−α 1−α

When we change the order of integration on the right-hand The equation

side of this equation we have 1
g(y) dy
u u u g(x) = f (x) + λ (76)
f (x) d x
= g(y) dy
dx
. 0 y−x
a (u − x) 1−α
a y (u − x) 1−α (x − y)a
is the inhomogeneous Cauchy integral equation. Here the
(67) integral is the Cauchy principal value. To solve this equa-
tion we appeal to the identity
Next, we set t = (u − x)/(u − y) in the second integral so u
that dy
u 0 (u − y)
α−1 y α (y − x)
(u − x)α−1 (x − y)−α d x 
 π cot απ
y 
 (u − x)1−α x α , 0 < x < u
1
π = (77)
= t α−1 (1 − t)−α dt = , 
 −π csc απ
sin απ  , u<x
0 (x − u)1−α x α
where we have used the value of the Eulerian beta function and then define the function φ(x, u) as
B(α, 1 − α) = π csc απ. Thus relation (67) becomes
u φ(x, u) =
1
, 0 < x < u,
sin απ u f (x) d x (78)
= g(y) dy. (68) (u − x)1−α x α
π a (u − x)
1−α
a
where α is such that −π cot απ = (1/λ). Then φ(x, u) is
Differentiation of this relation finally yields the solution the solution of the integral equation
y u
sin απ d f (x) φ(y, u)
g(y) = d x . (69) −λ dy = φ(x, u), 0 < x < u (79)
π dy a (y − x)1−α 0 y−x
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

852 Integral Equations

b
while g(y) dy
u g(x) = f (x) + λ (85)
φ(y, u) π csc απ a y−x
dy = − , u < x. (80)
0 y−x (x − u)1−α x α is
When we multiply (76) by x we get f (x) λ
g(x) = − +
1 1+π λ 2 2 (1 + π λ )(x − a)1−α (b − x)α
2 2
yg(y) dy b
λ = xg(x) − x f (x) + c, (81) (b − y)α (y − a)1−α f (y) dy
0 y−x ×
1 a y−x
where c = λ 0 g(y) dy. Next, we multiply both sides of c
(81) by φ(x, u) as defined by (78), integrate from 0 to u + , (86)
(x − a)1−α (b − x)α
and change the order of integration. The result is
u u where c is an arbitary constant.
φ(x, u) d x The solution of the Cauchy-type integral equation of
−λ yg(y) dy
0 0 x−y the first kind:
1 u b
φ(x, u) d x g(y) dy
−λ yg(y) dy = f (x), a<x <b (87)
u 0 x−y a y−x
u
can be obtained with a very similar analysis and is
= xg(x)φ(x, u) d x
0 1
g(x) = √
u u
π 2 (x − a)(b − x)
− x f (x)φ(x, u) d x + c φ(x, u) d x. b √
0 0 (y − a)(b − y)
× f (y) dy + π c . (88)
With
u the help of relations (79) and (80) and the fact that a x−y
0 φ(x, u) d x = π csc απ, the above relation becomes In particular, when a = −1, b = 1, it follows that the solu-
1 1−α tion of the airfoil equation
y g(y)
λπ csc απ dy
u (y − u) 1 1 g(y) dy
1−α
u = f (x), −1 < x < 1 (89)
π −1 y − x
= − x f (x)φ(x, u) d x + cπ csc απ (82)
0 is

This is an Abel-type integral equation whose solution is 1 1
(1 − y 2 ) f (y) c
found from the previous analysis to be g(x) = √ dy + √ .
π 1− x2 −1 x−y 1 − x2
1 u
sin2 απ d (90)
λy g(y) =
1−α
(u − y)−α
π 2 dy y 0

α−1 1−α c sin απ C. Singular Integral Equations
× (u − x) x f (x) d x d y + with a Logarithmic Kernel
π (1 − y)α
(83) We start with the integral equation
1
Now we use the relation −π cot απ = 1/λ and do a little ln|x − y|g0 (y) dy = 1, −1 < x < 1. (91)
algebraic manipulation and obtain the required solution as −1

f (x) λ By setting x = cos α, y = cos β, Eq. (91) becomes

g(x) = − + π
1+π λ 2 2 (1 + π λ )x 1−α (1 − x)α
2 2
ln|cos α − cos β|G(β) dβ = 1, 0 < α < π, (92)
1
(1 − y)α y 1−α f (y) dy 0
×
0 y−x ∞ G(β) = g0 (cos β) sin β. Let us now expand G(β) =
where
n=0 bn cos nβ and use the summation formula
c
+ √ . (84) ∞
x 1−α (1 − x)α 1 + π 2 λ2 ln|cos α − cos β| = −ln 2 − 2
cos nα cos nβ
.
n
Finally, we set y = (y − a)/(b − a), and find from the n=1

above analysis that the solution of the integral equation (93)

P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

Integral Equations 853

Then relation (92) becomes which when substituted in (97) yields the solution
π ∞
1
cos nα cos nβ 1 1 − y 2 1/2 f (y)
−ln 2 − 2 g(x) = 2 dx
0 n=1
n π −1 1 − x 2 y−x
1

∞
− 2
1 f (y)
dy. (98)
× bm cos mβ dβ = 1, π ln 2(1 − x )
2 1/2
−1 (1 − y )
2 1/2
m=0
Various other forms of integral equations with logarith-
from which it follows, due to orthogonality of cosine func- mic kernels can be solved in a similar fashion.
tions, that
∞
cos nα
−π b0 ln 2 − πbn = 1.
n VII. THE CAUCHY KERNEL AND THE
n=1
RIEMANN–HILBERT PROBLEM
Thus, b0 = −(1/(π ln 2)), bn = 0, n ≥ 1, and we find that
the solution of Eq. (91) is For the study of the singular equations in the complex
1 1 plane , we require a few important results from the anal-
g0 (y) = − . (94)
π ln 2 1 − y 2 ysis of a complex variable. We present some of these con-
cepts needed for the Cauchy kernel. Let C be a simple,
In passing we observe that by substituting solution (94) smooth, and closed curve in the complex z plane endowed
in (91) we have the useful identity with the counterclockwise orientation. The complement
1 \C consists of two parts, one interior (bounded) part
ln |x − y|
dy = −π ln 2, −1 < x < 1. (95) S+ and the other exterior part S− . A function F(z) defined
−1 (1 − y )
2 1/2
and analytic in the complement \C is called a sectionally
Next, we consider the integral equation analytic function with discontinuity contour C. Let f (ζ )
1
be a continuous function defined for ζ ∈ C. The Cauchy
ln |x − y|g(y) dy = f (x), −1 < x < 1. (96) (or analytic) representation of f is the sectionally analytic
−1
function
Differentiation with respect to x gives
1 F(z) = F{ f (ζ ); z}
g(y)
dy = f (x), −1 < x < 1,
1 f (ζ ) dζ
−1 x −y = , z ∈ \C. (99)
2πi C ζ − z
whose solution follows from (90) to be
1 The boundary values F± (ω) of this function on both sides
1 1 − y 2 1/2 f (y) C of C satisfy the Plemelj relations
g(x) = 2 dy + √ ,
π −1 1 − x 2 y−x π 1 − x2 1 i
(97) F+ (ω) = f (ω) − H ( f )
2 2
1 (100)
where C = −1 g(y) dy. To find the constant C, we mul- 1 i
F− (ω) = − f (ω) − H ( f ),
tiply (96) by 1/ (1 − x 2 ) and integrate it with respect to 2 2
x from −1 to 1 and change the order of integration. The
where H ( f ) = (1/π ) C ( f (ζ )/(ζ − ω)) dζ, ω ∈ C, is
result is called the Hilbert transform of f . Solving (100) for f
1 1
ln |x − y| and H ( f ) we get
g(y) dy dx
−1 (1 − x )
2 1/2
−1
f = F+ − F− = [F]
1
f (x) (101)
= d x, 1 f (ζ )
−1 (1 − x )
2 1/2
H( f ) = dζ = i(F+ + F− ),
π C ζ −ω
which, in view of identity (95), becomes
1 where [F] is called the jump of F across C.
f (x) Now let 1 (ζ ) and 2 (ζ ) be two continuous functions
(−π ln 2)C = d x.
−1 (1 − x )
2 1/2
defined on C. The Riemann–Hilbert problem is to find the
Thus sectionally analytic function Y (z) defined on \C whose
1 boundary values satisfy
1 f (x)
C =− d x,
π ln 2 −1 (1 − x 2 )1/2 1 (ζ )Y+ (ζ ) − 2 (ζ )Y− (ζ ) = (ζ ), (102)
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

854 Integral Equations

where (ζ ) is a function given on C. We assume that 1 problem (fundamental in the sense that all other solutions
and 2 never vanish on C. It is called the normality con- can be obtained from it in a suitable way).
dition. When we divide both sides of the above equation Let us now consider the case when ln (ζ ) is multiple
by 1 (ζ ) we get valued on C and introduce the number k
Y+ (ζ ) = (ζ )Y− (ζ ) + ψ(ζ ), (103) 1 1
k= c (ln (ζ )) = c (arg ln (ζ )), (109)
2πi 2π
where = 2 /1 and ψ = /1 . When ψ = 0. Eq. (103)
becomes the homogeneous Riemann–Hilbert problem where c ( f (ζ )) denotes the increment of the function
f (ζ ) when the curve C is transversed in the positive di-
X + (ζ ) = (ζ )X − (ζ ). (104) rection. Thus, k is the index of the point z = 0 with respect
to the curve C , the image of the curve C under the func-
To solve (103) we first reduce it to the simple form
tion (ζ ). The number k, which is always an integer, is
W+ (ζ ) = W− (ζ ) + ψ(ζ ) (105) called the index of the Riemann–Hilbert problem.
Let Y (z) be a solution of the homogeneous Riemann–
which is obtained from (103) by taking (ζ ) = 1 be- Hilbert problem (104); we define the sectionally analytic
cause the solution of (105) is known to have the analytic function Ȳ (z) as
representation
Ȳ (z) = Y (z), z ∈ S+
1 ψ(ζ ) dζ (110)
W (z) = F{ψ(ζ ); z} = , (106) Ȳ (z) = (z − z 0 ) Y (z),
k
z ∈ S−
2πi C ζ − z
where we have appealed to definition (99). We first solve Thus, Ȳ (z) satisfies the following boundary value problem
the homogeneous Riemann–Hilbert problem (104). For where z 0 is an arbitrary point of S+ :
this purpose we take the logarithm of both sides of (104),
Ȳ+ (ζ ) = 0 (ζ )Ȳ− (ζ ),
and get (111)
0 (ζ ) = (z − z 0 )−k (ζ )
ln X + (ζ ) = ln X − (ζ ) + ln (ζ ), (107)
Thereby ln 0 (ζ ) has become single valued and we can
where we assume, for the time being, that ln (ζ ) is single apply the previous analysis to conclude that the solution
valued on C. A particular solution of (107) is given by the of (111) is Ȳ (z) = P(z) X̄ (z), where P(z) is a polynomial
analytic representation and where X̄ (z) = exp[F(ln φ0 (ζ ); z)] is the fundamental
solution. Then it follows from (110) that the solutions of
ln X (z) = F{ln (ζ ); z}
(104) are of the form Y (z) = P(z)X (z) where P(z) is an
1 ln (ζ ) dζ arbitrary polynomial and where the fundamental solution
= . (108)
2πi C ζ − z X (z) is given by

or X (z) = e F{ln (ζ );z} , which is a sectionally analytic func- 1 ln 0 (ζ ) dζ
X (z) = exp , z ∈ S+ (112a)
tion that never vanishes on \C and whose boundary val- 2πi C ζ −z
ues satisfy (104) because
1 ln 0 (ζ ) dζ
X + (ζ ) X (z) = (z − z 0 )−k exp
= exp[F+ {ln (ω); ζ } − F− {ln (ω); ζ }] 2πi C ζ −z
X − (ζ )
z ∈ S− (112b)
= exp[ln (ζ )] = (ζ ).
Once a fundamental solution of the Riemann–Hilbert
Note that this basic solution is normal because X (∞) = 1. problem has been obtained, we can solve the inhomoge-
Now, if Y (z) is any other solution of the homogeneous neous problem (103) as follows. Let X (z) be a fundamental
problem (104) then the function Y (z)/ X (z), which is solution and let Y (z) be a solution of (103) which we can
known to be analytic on \C, is also analytic on C because write as
its jump across C vanishes:
Y+ Y− ψ
Y Y Y− Y− = + (113)
− = − = 0. X+ X− X+
X + X − X − X−
because = X + / X − . The solution of this equation with
Thus, Y / X is an entire function and the most general so- polynomial behavior at z = ∞ is
lution of the homogeneous Riemann–Hilbert problem is
Y (z) ψ(ζ )
Y (z) = P(z)X (z) where P(z) is an entire function. It is = P(z) + F ;z .
called the fundamental solution of the Riemann–Hilbert X (z) X + (ζ )
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

Integral Equations 855

Thus If k ≥ 0, the solution of the Riemann–Hilbert problem

(117) is given as
ψ(ζ )
Y (z) = P(z)X (z) + X (z)F ;z . (114)
X + (ζ ) f (ζ )
G(z) = X (z)P(z) + X (z)F ;z ,
This formula gives the solution of the Riemann–Hilbert (a(ζ ) + b(ζ ))X + (ζ )
problem with polynomial behavior at z = ∞. For obtain- (122)
ing the solutions that vanish at z = ∞, it is necessary to
consider the sign of the index k. When k ≥ 0, (114) will where P(z) is a polynomial of degree (k − 1) at most. If
be a solution provided the degree of P does not exceed k < 0, the solution is
k − 1. When k < 0, however, for the solution to vanish at
f (ζ )
z = ∞, the polynomial P(z) should vanish and so should G(z) = X (z)F ;z , (123)
(a(ζ ) + b(ζ ))X + (ζ )
the coefficients α1 , α2 , . . . , α−k of the Taylor expansion
provided
f (ζ ) α1 α2
F ;z = + 2 + · · · at z = ∞.
X + (ζ ) z z ( f (ζ )ζ j /(a(ζ ) + b(ζ ))X + (ζ )) dζ = 0
C
We are now ready to solve one of the most important
singular integral equations, namely, the Carleman integral for
equation:
0 ≤ j ≤ −k − 1.
b(ζ ) g(ω)
a(ζ )g(ζ ) + dω = f (ζ ) (115) Because
πi C ω − ζ
over a closed contour C, where a(ζ ), b(ζ ), and f (ζ ) are (a(ζ ) + b(ζ ))X + (ζ )

given functions on C subject to the normality condition = (a(ζ ) + b(ζ )) (ζ )(ζ − z 0 )−k/2 eγ (ζ )
a 2 (ζ ) − b2 (ζ ) = 0. To solve this equation we appeal to the
analytic representation = a 2 (ζ ) + b2 (ζ )(ζ − z 0 )−k/2 eγ (ζ ) ,

1 g(ω) dω it follows by using the Plemelj formula g(ζ ) = G + (ζ ) −
G(z) = (116)
2πi C ω − z G − (ζ ), and relations (119), (121), and (123) that the
solution of the integral equation (115) is
of the unknown function g(ζ ). Next, we substitute the
Plemelj formulas (100) in (115) and get a(ζ ) f (ζ ) eγ (ζ ) (ζ − z 0 )−k/2 b(ζ )
g(ζ ) = −
G + (ζ ) = (ζ )G − (ζ ) + ψ(ζ ), (117) a 2 (ζ ) − b2 (ζ ) a 2 (ζ ) − b2 (ζ )πi

f (ω)e−γ (ω) (ω − z 0 )k/2 dω
where × ω (124)
C a 2 (ω) − b2 (ω)(ω − ζ )
a(ζ ) − b(ζ ) f (ζ )
(ζ ) = , ψ(ζ ) = . (118)
a(ζ ) + b(ζ ) a(ζ ) + b(ζ ) so long as k < 0 and

Thus, we have the Riemann–Hilbert problem (117) to f (ω)eγ (ω) (ω − z 0 )k/2 ω j
solve. For this purpose we consider the index (109) and dω = 0
C a 2 (ω) − b2 (ω)
then define the fundamental solution (112). Now
0 ≤ j ≤ −k − 1 (125)
X + (ζ ) = exp F+ ln[(ω)(ω − z 0 )−k ; ζ
Similarly, when k ≥ 0 the solution becomes
= exp 12 ln (ζ )(ζ − z 0 )−k + γ (ζ )
a(ζ ) f (ζ ) eγ (ζ ) (ζ − z 0 )−k/2 b(ζ )
= (ζ )eγ (ζ ) /(ζ − z 0 )k/2 , (119) g(ζ ) = −
− b (ζ )
a 2 (ζ ) 2
a 2 (ζ ) − b2 (ζ ) πi
where
f (ω)e−γ (ω) (ω − z 0 )k/2 dω
× + P(ζ ) ,
1 ln (ω)(ω − z 0 )−k a 2 (ω) − b2 (ω)(ω − ζ )
γ (ζ ) = dω. (120) C
2πi C ω−ζ
(126)
Similarly,
where P(ζ ) is a polynomial whose degree does not exceed
X − (ζ ) = eγ (ζ ) (ζ )(ζ − z 0 )k/2 . (121) k − 1.
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

856 Integral Equations

For the special case when a and b are constants. K (x) = 0(e−c|x| ), f (x) = 0(ed x ), as x → ∞, the function
Eq. (115) reduces to the Cauchy integral equation h − (x) = 0(e−e|x| ) as x → −∞, where c > 0 and d < c. (4).
We look for a solution g+ (x) = 0(ed x ) as x → ∞.
b g(ω)
ag(ζ ) + dω = f (ζ ) (127) Let us now apply the Fourier transform ψ̂(u)
πi C ω − ζ
∞
while its solution follows by observing that k = 0, and
ψ̂(u) = ψ(x)eiux d x (133)
and γ are constants. Thus, we appeal to relation (126) and −∞
find the solution to be
to both sides of equation (132) and get
a f (ζ ) b f (ω) dω
g(ζ ) = 2 − 2 . (128)
a −b 2 (a − b )πi C ω − ζ
2
(1 + λ K̂ (u))ĝ + (u) = fˆ+ (u) + ĥ − (u). (134)

These Fourier transforms have the following strips of def-

VIII. WIENER–HOPF INTEGRAL EQUATION inition K̂ (u) in |Im u| < c, ĝ+ (u) in Im u > d, f + (u) in
Im u > d and ĥ(u) in Im u < c. Accordingly, Eq. (134) is
The integral equation of the type a well-defined equation in the strip d < Im u < c.
∞ At this stage, two splitting techniques help us in solving
g(x) + λ K (x − y)g(y) dy = f (x), 0<x <∞ Eq. (134). The first one is called the quotient splitting. This
0 is achieved by setting
(129)
is called the Wiener–Hopf integral equation. Its distin- K + (u)
(1 + λ) K̂ (u) = , (135)
guishing features are the difference kernel and the semi- K − (u)
infinite interval. This is an eigenvalue problem in which
the eigenvalue happens to be known from certain physical where K + and K − have their respective regions of ana-
considerations. lyticity. When we substitute Eq. (135) in Eq. (134) we
To use a two-sided transform, we have to change obtain
Eq. (129) in such a way that it is valid for x < 0. To achieve K + (u)ĝ+ (u) = K − (u) fˆ+ (u) + K − (u)h!
− (u). (136)
this objective we set
∞ Next we split the mixed term K − fˆ+ in the previous
−g(x) + f (x), 0 < x < ∞
λ K (x − y)g(y) = equation as
−∞ h(x), −∞ < x < 0,
(130) K − (u) fˆ+ (u) = p+ (u) + p− (u) (137)
where h(x) is unknown. Next we extend the definition of so that Eq. (136) becomes
the function f (x), g(x), and h(x) as follows:
K + (u)ĝ + (u) − p+ (u) = K − (u)h!
− (u) + p− (u). (138)
f (x) = 0, x < 0, g(x) = 0, x < 0,
(131) Thereby we have reduced our problem to solving the
h(x) = 0, x > 0.
Riemann–Hilbert boundary value problem as studied in
This enables us to write Eq. (130) as the previous section.
∞ We illustrate all the above-mentioned concepts with the
g+ (x) + λ K (x − y)g+ (y) dy = f + (x) + h − (x), help of the following example.
−∞ Let us solve the integral equation:
−∞ < x < ∞, (132) ∞
1
where the subscripts ± indicate the ± half line on which g(x) + e−|x − y| g(y) dy = 1. (139)
4 0
the function is nonvanishing. Thereby we have succeeded
in transforming the Wiener–Hopf integral Eq. (129) into The Fourier transform of the kernel K (x) = e−|x| is
a convolution-type integral equation. K̂ (u) = 2/(1 + u 2 ). Because 1+ is the Heaviside func-
Let us observe some interesting features of Eq. (132): tion, which is 0 for x < 0 and 1 for x > 0, we have
(1) We assume that the kernel K (x − y) is known for all (1+ ) ∧ (u) = i/u. Note that 2/(1 + u 2 ) is analytic in the
values of its arguments, namely, −∞ < x − y < ∞. (2) strip −1 < Im < 1, while i/u is analytic in the half plane
The function h − (x) is unknown and can only be evaluated Im u > 0.
after we have found g+ (x). (3) Because the integral in this Now we process Eq (139) by the Wiener–Hopf tech-
equation extends from −∞ to +∞, we must have growth nique, as explained previously, and find that this integral
limits on the functions appearing in Eq. (132). They are: equation transforms into
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

Integral Equations 857

q + u2 i niques by which these equations are solved, their treatment

ĝ+ (u) = + ĥ − (u), 0 < Im u < 1. (140) is available on ad hoc basis. A nonlinear Fredholm integral
4(1 + u 2 ) u
equation is of the form
Next we split the coefficient of g+ (u) in (140) as the
quotient
g(x) = f (x) = λ F{x, y, g(y)} dy, (146)
9 + u2 (4 + 3i)(4 − 3i)
=
4(1 + u )
2 4(u + i)(u − i)
" where both x and y lie in the domain (a, b) and we
4 + 3i u−i K+ have omitted the limits a, b of integration as in the previ-
= = .
4(u + i) u − 3i K− ous discussion. Equation (146) is called an Uryson equa-
Then Eq. (140) takes the form tion. Perhaps the most important particular case is the
Hammerstein equation:
u + 3i u−i i u−i
g+ (u) = + h − (u). (141)
4(u + i) u − 3i 4 u − 3i
g(x) = f (x) + λ K (x, y)F(y, g(y)) dy. (147)
At this stage we split the first term on the right side of
(141) as the sum
These nonlinear integral equations are being presently
u−i i i 2i studied by modern numerical methods. We shall limit
= + ,
u − 3i 4 3u 3(u − 3i) ourselves to giving the iterative scheme for solving the
so that Eq. (141) becomes general case (146). This scheme is similar to the one
given for the linear case in Section II. For this purpose
4 + 3i i 2i u−i
ĝ+ (u) − = + ĥ − (u). we impose the following conditions on the functions
4(u + i) 3u 3(u − 3i) u − 3i occurring in this equation. We assume that f (x) is contin-
(142)
uous in (a, b). The function F(x, y, g(y)) is continuous
The left side of the equation (142) is analytic in Im u > 0,
with respect to x and y in the rectangle a ≤ x, y ≤ b and
while the right side is valid in the strip 0 < Im u < 1. Be-
satisfies the Lipschitz condition with respect to g(y) (i.e.,
cause these functions are equal on Im u = 0, they are
| F{x, y, g1 ( y )} − F{x, y, g2 ( y )}| ≤ L |g1 ( y ) − g2 ( y )|),
equal everywhere by analytic continuation. Accordingly,
where L is a positive constant. The continuity of F with
they are equal to the same entire function E(u). As the
respect to x and y assures that |F{x, y, g(y)}| < M for
Im u → ∞, the left side of Eq. (142) tends to zero. Sim-
bounded g(y), where M is a positive constant.
ilarly, as Im u → −∞, the right side of (142) tends to
Now we follow the iterative technique as given in Sec-
zero. Therefore E(u) ≡ 0. Since our interest is to evaluate
tion II and set the first approximation for g(x) to be
g+ (u), we evaluate the part
g (0) (x) = f (x) and subsequent ones as
u + 3i i
ĝ (u) − = 0. (143)
4(u + i) + 3u g (n) (x) = f (x) + λ F x, y, g (n−1) (y) dy
When inverted, Eq. (143) yields n≥1 (148)
ia+∞
4i 1 (u + i) du
g+ (x) = e−iux . (144) Because
3 2π ia−∞ u(u + 3i)
We evaluate this integral by choosing the contour in the
n

g (n) (x) = g (k) (x) − g (k−1) (x) + g (0) (x)
lower half plane. This contour consists of the line −ia − R k=1
to ia + R, and the semicircle of radius R with center at
(0, a) and then we let R → ∞. The value of the integral the convergence of the sequence g (n) (x) is equivalent
on the semicircle vanishes. Finally, with the help of the to the convergence of the series whose kth term is
theory of residues we find that the solution is (g (k) (x) − g (k−1) (x)). To examine the convergence of this
4 series we set
g(x) = (1 + 2e−3x )H (x). (145)
9
g (1) (x) − g (0) (x) = λ F(x, y, f (x))

IX. NONLINEAR INTEGRAL EQUATIONS = φ(x)

Let us finally present an elementary discussion of nonlin- as g (0) (x) = f (x), and assume that |φ(x)| < A, a constant.
ear integral equations. Because there are no analytic tech- Then we find that
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

858 Integral Equations

(k)
g (x) − g (k−1) (x) Then from (150) it follows that
(k)
g (x) − g (k−1) (x)

= |λ| F x, y, g (k−1)
(y) − F x, y, g (k−2)
(y) dy x

= |λ| F x, y, g (k−1) (y) − F x, y, g (k−2) (y) dy
≤ |λ| F x, y, g (k−1) (y) − F x, y, g (k−2) (y) dy 0
x

(k−1) ≤ |λ| L g (k−1) (y) − g (k−2) (y) dy
≤ |λ|L g (y) − g (k−2) (y) dy 0

(L|λ||x|)k
≤ · · · ≤ [|λ|L(b − a)] k−1
A. ≤ ··· ≤ M (152)
k!
Thus, if |λ|L(b − a) < 1, the series will be absolutely and But the last term in (152) is the kth term of the power
uniformly convergent and g (n) (x) will tend to a function series
∞ for M exp{L|λ||x|}}, so that the series g (0) (x) +
k=1 {g (x) − g
(k) (k−1)
g(x) which will be the solution of Eq. (146). (x)} is absolutely and uniformly
To prove the uniqueness of the solution, we let g(x) and convergent for all values of λ and its sum limn → ∞ g (n) (x)
h(x) be two solutions. Then the value of the difference is the solution of the integral equation (149). The unique-
ψ(x) = g(x) − h(x) is ness of this solution can be proved by a slight extension
of the arguments presented above for the Fredholm case.
ψ(x) = λ [F{x, y, g(y)} − F{x, y, h(y)}] dy.

When we denote by ψmax , the maximum value of ψ(x) in X. A TAYLOR EXPANSION TECHNIQUE
(a, b) and appeal to the Lipschitz continuity of F, we find
that In the previous analysis we have presented the Neumann
and Hilbert-Schmidt expansion techniques for solving the
|ψ(x)| < |λ| |ψ(y) dy| ≤ |λ|L(b − a)ψmax (x), integral equations. Recently, it has been discovered that
both the linear and nonlinear integral equations can also
which means that ψmax ≤ |λ|L(b − a)ψmax . But |λ|L(b − be solved with the help of the Taylor series. To present
a) < 1, so we have ψmax = 0 and the uniqueness of the the basic ideas of this method, we consider the Fredholm
solution is proved. integral equation of the second kind
Let us now consider the nonlinear Volterra integral b
equation g(x) = f (x) + K (x, y)g(y) dy (153)
a
x
g(x) = f (x) + λ F{x, y, g(y)} dy, (149) and differentiate it n times with respect to x so that we
0 have
b n
with the same conditions on the function f (x) and ∂ K (x, y)
g (n) (x) = f (n) (x) + g(y) dy.
F{x, y, g(y)} as given above. The iterative scheme again a ∂xn
yields the sequence g (0) (x) = f (x) and
For x = 0, the above relation becomes
x
b n
g (n) (x) = f (x) + λ F x, y, g (n−1) (y) dy ∂ K (x, y)
g (n) (0) = f (n) (0) + g(y) dy. (154)
0
a ∂ x n
n≥1 (150) x=0

When we substitute the Taylor series

and its convergence is equivalent to that of the series whose
kth term is g (k) (x)− g (k−1) (x). To study this convergence ∞
1 (m)
g(y) = g (0)y m (155)
we observe that m=0
m!
x
g (x) − g (x) = λ
(1) (0)
F{x, y, f (y)} dy for g(y) in (154) we get
0
x
g (n) (0) = f (n) (0)
b n # $
≤ |λ| |F{x, y, f (y)}| dy ∂ K (x, y) ∞
1 (m)

0 + g (0)y m
dy.
x a ∂xn m=0
m!
x=0
≤ |λ| M dy = M|λ|x (151)
0 (156)
P1: GLQ/GJP P2: FJU Final Pages
Encyclopedia of Physical Science and Technology EN007-344 June 30, 2001 18:2

Integral Equations 859

Next, we set f (0) = 1, f (1) (0) = 2, f (2) (0) = 2

1 b
∂ n K (x, y) 2
Tmn = y m dy (157) T00 = 0, T01 = 0, T10 = 0, T11 =
m! a ∂xn x=0 3
4 2
in (139) and obtain T20 = , T21 = 0, T22 = .
3 5

∞
g (n) (0) = f (n) (0) + Tmn g (m) (0), (158) When we substitute these values in Eq. (159) and solve
m=0 for the first three Taylor coefficients of g(x), we get
n = 0, 1, 2, 3, . . . . Accordingly, the evaluation of the so- 50
lution g(x) of integral equation (136) reduces to solving g(0) = 1, g (1) (0) = 6, g (2) (0) = .
9
the above infinite system of algebraic equations for the
Taylor coefficients g m (0). This is achieved by truncating Thus, we have
this system in a suitable manner so that we get a determi- 25 2
g(x) = 1 + 6x + x ,
nate system at every step. Thus, we get an approximate 9
solution to the desired order of accuracy. For instance, which happens to be the exact solution of Eq. (160).
let us take (i) n = 0, m = 1, (ii) n = 1, m = 1; (iii) n = 2,
m = 2, in (158) so that we get the determinate system of
three equations SEE ALSO THE FOLLOWING ARTICLES
(T00 − 1)g(0) + T01 g (1) (0) = − f (0)
CALCULUS • DIFFERENTIAL EQUATIONS, ORDINARY •
(159a) GREEN’S FUNCTIONS
T10 g(0) + (T11 − 1)g (1) (0) = − f (1) (0)
(159b) BIBLIOGRAPHY
T20 g(0) + T21 g (0) + (T22 − 1)g (0) = − f
(1) (2) (2)
(0)
Cochran, J. A., (1972). “The Analysis of Linear Integral Equations,”
(159c) McGraw-Hill, New York.
Estrada, R., and Kanwal, R. P. (1999). “Singular Integral Equations,”
Equations (159a) and (159b) yield the values of g(0) and Birkhäuser, Boston.
g (1) (0), which when substituted in (159c), give the value Hochstadt, H. (1973). “Integral Equations,” Wiley, New York.
Jerry, A. J. (1999). “Introduction to Integral Equations with Applica-
of g (2) (0). Thus, we obtain the solution g(x) to O(x 2 ).
tions,” Wiley, New York.
The elegance of the method lies in the fact that fre- Kanwal, R. P. (1997). “Linear Integral Equations,” Second Edition,
quently we get an exact solution of an integral equation. Birkhäuser, Boston.
We illustrate this point with the help of the integral Muskhelishvili, N. I. (1953). “Singular Integral Equations,” Noordhoff,
equation Groningen, Holland.
1 Peters, A. S. (1969). Some integral equations related to Abel’s equation
g(x) = (x + 1)2 + (x y + x 2 y 2 )g(y) dy. (160) and the Hilbert transform. Comm. Pure Appl. Math., 22, 539–560.
−1 Pipkin, A. C. (1991). “A Course on Integral Equations,” Springer Verlag,
New York.
Accordingly, f (x) = (x + 1)2 , K (x, y) = x y + x 2 y 2 , Stakgold, I. (1968). “Boundary Value Problems of Mathematical Physics,
and we have Vol. II,” Macmillan, New York.
P1: GRB/GWT P2: GPJ Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots
Louis H. Kauffman
University of Illinois

I. Knot Tying and the Reidemeister Moves

II. Invariants of Knots and Links, A First Pass
III. The Jones Polynomial
IV. The Bracket State Sum
V. Vassiliev Invariants
VI. Vassiliev Invariants and Lie Algebras
VII. A Quick Review of Quantum Mechanics
VIII. Knot Amplitudes
IX. Topological Quantum Field Theory, First Steps

GLOSSARY intent of this mathematical definition is to capture the

topological aspect of a knotted rope in the space of
Coloring A labeling of a combinatorial structure (such as physical experience.
a graph or a knot diagram) with elements of a chosen Knot polynomial A method of assigning to each knot or
set (called colors) according to rules that are specified link a polynomial that is a topological invariant of the
in a given context. knot or link.
Diagram A graphical structure intended to illustrate or Link An embedding of several disjoint circles in three-
embody a mathematical structure. dimensional space. Two links are said to be topologi-
Fundamental group The fundamental group of a topo- cally equivalent if there is an (orientation-preserving)
logical space is a group that is naturally defined by homeomorphism of three-dimensional space that car-
mappings of circles to the space. This group is used ries one link to the other one.
throughout topology. Quandle An algebraic invariant of knots and links closely
Group An algebraic structure with one binary operation, related to the fundamental group.
satisfying the axioms that there is an identity element, Topological invariant A method of assigning a mathe-
every element has an inverse, and associativity. The matical object (such as a number, a polynomial, or a
concept of a group formalizes a general notion of sym- group) to each of a class of topological spaces (or in
metry. the case of knots and links, to a space and collection
Knot An embedding of a circle in three-dimensional of subspaces) so that topologically equivalent spaces
space. A knot is said to be knotted if this embedding (or configurations of spaces) receive the same assigned
cannot be transformed topologically to a flat circle. The mathematical structure.

199
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

200 Knots

Topology The study of topological spaces—sets en- We begin by making some knots. In particular, we shall
dowed with a notion of neighborhoods (called open look at the bowline, a most useful knot. The bowline is
sets) closed under finite intersections and arbitrary widely used by persons who need to tie a horse to a post
unions, such that the whole space and the empty set are or their boat to a dock. It is easy and quick to make, holds
both open. Topological spaces encapsulate the concept exceedingly well, and can be undone in a jiffy. Figure 1
of continuity in the structure of the neighborhoods. gives instructions for making the bowline. In showing the
bowline we have drawn it loosely. To use it, one grabs
THIS ARTICLE constitutes an introduction to the theory the lower loop and pulls it tight by the upper line shown in
of knots as it has been influenced by developments concur- the drawing. It tightens while maintaining the given size
rent with the discovery of the Jones polynomial in 1984 of the loop. Nevertheless, the knot is easily undone, as
and the subsequent explosion of research that followed some experimentation will show.
this signal event in the mathematics of the 20th century. The utility of a schema for drawing a knot is that the
I hope to give the flavor of these extraordinary events schema does not have to indicate all the physical prop-
in this exposition. Even the act of tying a shoelace can erties of the knot. It is sufficient that the schema should
become an adventure. The familiar world of string, rope, contain the information needed to build the knot. Here
and the third dimension becomes an inexhaustible source is a remarkable use of language. The language of the
of ideas and phenomena. diagrams for knots implicitly contains all their topolog-
Sections 1 and 2 constitute a start on the subject of knots. ical and physical properties, but this information may
Later sections introduce more technical topics. The theme not be easily available unless the “word is made flesh”
of a relationship of knots with physics begins already with in the sense of actually building the knot from rope or
the Jones polynomial and the bracket model for the Jones cord.
polynomial as discussed in Section 4. Sections 5 and 6 Our aim is to get topological information about knots
provide an introduction to Vassiliev invariants and the re- from their diagrams. Topological information is informa-
markable relationship between Lie algebras and knot the- tion about a knot that does not depend upon the material
ory. The idea for the bracket model and its generalizations from which it is made and is not changed by stretching or
is to regard the knot itself as a discrete physical system and
to obtain information about its topology by averaging over
the states of the system. In the case of the bracket model
this summation is finite and purely combinatorial. Trans-
positions of this idea occur throughout, involving ideas
from quantum mechanics (Sections 7 and 8) and quantum
field theory (Section 9). In this way knots have become a
testing ground not only for topological ideas, but also for
the methods of modern theoretical physics.
This article concentrates on the construction of invari-
ants of knots and the relationships of these invariants to
other mathematics (such as Lie algebras) and to physi-
cal ideas (quantum mechanics and quantum field theory).
There is also a rich vein of knot theory that considers a knot
as a physical object in three-dimensional space. Then one
can put electrical charge on a knot and watch (in a com-
puter) the knot repel itself to form beautiful shapes in three
dimensions. Or one can think of a knot as made of thick
rope and ask for an “ideal” form of the knot with minimal
length-to-diameter ratio. This idea of physical knots is a
current topic of research.

I. KNOT TYING AND THE

REIDEMEISTER MOVES

A. Introduction
For this section it is recommended that the reader obtain a
length of soft rope for the sake of direct experimentation. FIGURE 1 The bowline.
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 201

bending that material so long as it is not torn in the pro-

cess. We do not want the knot to disappear in the course
of such a stretching process by slipping over one of the
ends of the rope. The knot theorist’s usual convention for
preventing this is to assume that the knot is formed in a
closed loop of string. The trefoil knot shown in Fig. 2 is
an example of such a closed, knotted loop.
A knot presented in closed-loop form is a robust object,
capable of being pushed and twisted into many topolog-
ically equivalent forms. For example, the knot shown in
Fig. 3 is topologically equivalent to the trefoil shown in
Fig. 2.
The existence of innumerable versions of a given knot
or link gives rise to a mathematical problem. To state that a
loop is knotted is to state that nowhere among the infinity
of forms that it can take do we find an unkotted loop. Two
loops are said to be (topologically) equivalent if it is pos-
sible to deform one smoothly into the other so that all the FIGURE 3 Deformed trefoil.
intermediate stages are loops without self-intersections.
In this sense a loop is knotted if it is not equivalent to a
simple flat loop in the plane.
a game played with symbols according to specific rules.
Knot theory, done with diagrams, illustrates the formalist
B. Reidemeister Moves idea very well. In the formalist point of view a specific
mathematical game (formal system) can itself be an object
The key result that makes it possible to begin a (combi-
of study for the mathematician. Each particular game may
natorial) theory of knots is the theorem of Reidemeister
act as a coordinate system, illuminating key aspects of the
(1948, 1932), which states that two diagrams repre-
subject. One can think about knots through the model of
sent equivalent loops if and only if one diagram can
the diagrams. Other models (such as regarding the knots as
be obtained from the other by a finite sequence of
specific kinds of embeddings in three-dimensional space)
special deformations called the Reidemeister moves. I
are equally useful in other contexts. As we shall see, the
shall illustrate these moves in a moment. The upshot
diagrams are amazingly useful, allowing us to pivot from
of Reidemeister’s theorem is that the topological prob-
knots to other ideas and fields and then back to topology
lems about knots can all be formulated in terms of knot
again.
diagrams.
The Reidemeister moves are illustrated in Fig. 4. The
There is a famous philosophy of mathematics called
moves shown in Fig. 4 are intended to indicate changes
“formalism” in which mathematics is considered to be
that are made in a larger diagram. These changes modify
the diagram only locally as shown in the figure. Figure 5
shows a sequence of Reidemeister moves from one dia-
gram for a trefoil knot to another. In this illustration we
performed two instances of the second Reidemeister move
in the first step and a combination of the second move and
the third move in the second step, and used “move zero”
(a topological rearrangement that does not change any of
the crossing patterns) in the last step. Move zero is as im-
portant as the other Reidemeister moves, but since it does
not change any essential diagrammatic relationships it is
left in the background of the discussion.

C. Knots as Analog Computers

We end this section with one more illustration. This time
we take the bowline and close it into a loop. A deforma-
FIGURE 2 The trefoil as closed loop. tion then reveals that the closed-loop form of the bowline
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

202 Knots

FIGURE 4 Reidemeister moves.

is topologically equivalent to two trefoils clasping one an- of arithmetic, we believe that the topological properties of
other, as shown in Fig. 6. knotted rope follow the laws of knot topology.
This deformation was discovered by making a bowline
in a length of rope, closing it into a loop, and fooling
about with the rope until the nice pair of clasped trefoils II. INVARIANTS OF KNOTS
appeared. Note that there is more than one way to close the AND LINKS, A FIRST PASS
bowline into a loop. Figure 6 illustrates one choice. After
discovering them, it took some time to find a clear picto- We want to be able to calculate numbers (or bits of algebra
rial pathway from the closed-loop bowline to the clasped such as polynomials) from given link diagrams in such a
trefoils. The pictorial pathway shown in Fig. 6 can be eas- way that these numbers do not change when the diagrams
ily expanded to a full sequence of Reidemeister moves. In are changed by Reidemeister moves. Numbers or polyno-
this way the model of the the knot in real rope is an analog mials of this kind are called invariants of the knot or link
computer that can help to find sequences of deformations represented by the diagram. If we produce such invari-
that would otherwise be overlooked. ants, then we are finding topological information about
It is a curious reversal of roles that the original physical the knot or link. The easiest example of such an invariant
object of study becomes a computational aid for getting is the linking number of two curves, which measures how
insight into the mathematics. Of course this is really a two- many times one curve winds around another. In order to
way street. The very close fit between the mathematical calculate the linking number we orient the curves. This
model for knots and the topological properties of actual means that each curve is equipped with a directional ar-
knotted rope is the key ingredient. Knots are analogous to row, and we keep track of the direction of the arrow when
integers. Just as we believe that objects follow the laws the curve is deformed by the Reidemeister moves. If the
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 203

FIGURE 5 Deforming a Braid to the Trefoil Knot.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

204 Knots

FIGURE 6 The Bowline.

curves A and B are represented by an oriented link dia- Of course, two singly linked rings receive linking num-
gram with two components, attach a sign (+1 or −1) to ber equal to +1 or −1 as shown in Fig. 8.
each crossing as in Fig. 7. Then the linking number Lk(A, It can be shown that the linking number is invariant
B) is the sum of these signs over all the crossings of A under the Reidemeister moves. That is, if we take a given
with B. diagram D (representing the curves A and B) and change it
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 205

FIGURE 7 Crossing Signs.

to a new diagram E by applying one of the Reidemeister

moves, then the linking number calculation for D will
be the same as the calculation for E. The calculation is
unaffected by the first Reidemeister move because self-
crossings of a single curve do not figure in the calculation
of the linking number. The second Reidemeister move FIGURE 9 The Whitehead link.
either creates or destroys two crossings of opposite sign,
and the third move rearranges a configuration of crossing
without changing their signs. One of the most fascinating aspects of the linking num-
With these observations we have in fact proved that ber is its limitations as an invariant. Figure 9 shows the
the singly linked rings are indeed linked! There is no Whitehead link, a link of two components with linking
possible sequence of Reidemeister moves from these number equal to zero. The Whitehead link is indeed linked,
rings to two separated rings because the linking number but it requires methods more powerful than the linking
of separated rings is equal to zero, not to plus or minus number to demonstrate this fact.
one. Another example of this sort is the Borromean (or
It may seem a minor accomplishment to prove some- Ballantine) rings as shown in Fig. 10. These three rings
thing as obvious as the inseparability of this simple config- are topologically inseparable, but if any one of them is
uration, but it is the first step in the successful application ignored, then the other two are not linked.
of algebraic topology to the study of knots and links. The Just in case these last few examples leave the reader
linking number has a long and interesting history, and there pessimistic about the prospects of the linking number, here
are a number of ways to define it, many considerably more is a positive application. We shall use the linking number to
complicated than the sum of diagrammatic signs. We shall show that the Möbius strip is not topologically equivalent
discuss some of these alternative definitions at the end of to its mirror image. The Möbius strip is a circular band
this section. with a half twist in it as illustrated in Fig. 11. The Möbius

FIGURE 8 Two Linking Numbers.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

206 Knots

We have shown that there are two topologically distinct

Möbius bands. The two bands are mirror images of one
another in the sense that each looks like the image of the
other in a reflecting mirror. When an object is topologically
inequivalent to its mirror image, it is said to be chiral. We
have demonstrated the chirality of the Möbius band.

A. Three Coloring a Knot

There is a remarkable proof that the trefoil knot is knotted.
This proof goes as follows: Color the three arcs of the
trefoil diagram with three distinct colors, say red, blue, and
purple (Fig. 12). Note that in the standard trefoil diagram
three distinct colors occur at each crossing. Now adopt the
following coloring rule:
• Either three colors or exactly one color occur at any
crossing in the colored diagram.
Call a diagram colored if its arcs are colored and they
satisfy this rule. Note that the standard unknot diagram
is colored by simply assigning one color to its circle. A
FIGURE 10 The Borromean rings. coloring does not necessarily have three colors on a given
diagram. Call a diagram three-colored if it is colored and
three colors actually appear on the diagram.
is a justly famous example of a surface with only one side
and one edge. An observer walking along the surface goes Theorem. Every diagram that is obtained from the
through the half-twist and arrives back where she started standard trefoil diagram by Reidemeister moves can be
only to discover that she is on the other local side of the three-colored. Hence the trefoil diagram represents a knot.
band! It requires another trip around the band to return Rather than write a formal proof of this theorem, we
to the original local side. As a result there is only one illustrate the coloring process in Figs. 13 and 14. Each time
side to the surface in the global sense. It is as though the a Reidemeister move is performed, it is possible to extend
opposite side of the world were infinitesimally close to us the coloring from the original diagram to the diagram that
by drilling into the ground, but a full circumnavigation of is obtained from the move. These extensions of colorings
the globe away by external travel. involve only local changes in the colorings of the original
To make matters even more surprising, there are actually diagrams. The best way to see that this proof works is to
two Möbius bands, depending on the sense of the half do a few experiments. Figures 13 and 14 are a start.
twist. Call them M and M ∗ as illustrated in Fig. 11. If Note that in the case of the second move performed in
one makes these two Möbius bands from strips of paper the simplifying direction, although a color is lost in the
and tries to deform one into the other without tearing the arc that disappears under the move, this color must appear
paper, one will fail (Try it!). elsewhere in the diagram or else it is not possible for the
How can we understand the topological nature of the two arcs in the move to have different colors (since there is
handedness of the Möbius band M? Draw a curve C down a path along the knot from one local arc to the other). Thus
the center of the band M as shown in Fig. 11. Compare three-coloration is preserved under Reidemeister moves,
this curve with the space curve formed by the boundary whether they make the diagram simpler or more compli-
of the band. Orient these curves in parallel and compute cated. As a result, every diagram for the trefoil knot can
the linking number. It is +1. The very same calculation be colored with three colors according to our rules. This
for the mirror image band M ∗ yields the linking number proves that the trefoil is knotted, since an unknotted trefoil
of −1. would have a simple circle among its diagrams, and the
If it were possible topologically to deform M to M ∗ , simple circle can be colored with only one color.
then the corresponding links (formed by the core curve and
the boundary curve of the band) would be topologically
B. The Quandle and the Determinant of a Knot
equivalent, and hence they would have the same linking
number. Since this is not the case, we conclude that M There is a wide generalization of this coloring argument.
cannot be deformed to M ∗ . We shall replace the colors by arbitrary labels for the arcs
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 207

FIGURE 11 Möbius and mirror Möbius.

in the diagram and replace the coloring rule by a method an observer facing the overcrossing line and standing on
for combining these labels. It turns out that a good way to the arc labeled A.
articulate such a rule of combination is to make the label This operation depends upon the orientation of the
on one of the undercrossing arcs at a crossing a product line labeled B, so that A∗ B corresponds to B pointing
(in the sense of this new mode of combination) of the to the right for an observer approaching the crossing
labels of the other two arcs. In fact, we shall assume that along A, and A# B corresponds to B pointing to the
this product operation depends upon the orientation of the left for the same observer. All of this is illustrated in
arcs as shown in Fig. 15. Fig. 15.
In Fig. 15 we show how a label A on an undercrossing The binary operations ∗ and # are not necessarily ass-
arc combines with a label B on an overcrossing arc to ociative. For example, our original color assignments of
form C = A∗ B or C = A# B depending upon whether the R (red), B (blue), and P (purple) for the trefoil knot corres-
overcrossing arc is oriented to the left or to the right for pond to products R∗ R = R, B∗ B = B, P∗ P = P, R∗ B = P,
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

208 Knots

B∗ P = R, P∗ R = B. Then R∗ (B∗ P) = R∗ R = R and

(R∗ B) ∗ P = P∗ P = P.
We shall insist that these operations satisfy a number
of identities so that the labeling is compatible with the
Reidemeister moves.
Figure 16 illustrates the diagrammatic justification for
the following algebraic rules about ∗ and # . An algebraic
system satisfying these rules is called a quandle.

1. A∗ A = A and A# A = A for any label A.

2. (A∗ B)# B = A and (A# B)∗ B = A for any labels A
and B.
3. (A∗ B)∗ C = (A∗ C)∗ (B ∗ C) and (A# B)# C =
(A# C)# (B # C) for any labels A, B, C.

These rules correspond, respectively, to the Reidemeis-

FIGURE 12 The three-colored trefoil. ter moves 1–3. Labelings that obey these rules can

FIGURE 13 Inheriting coloring under the type 2 move.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 209

FIGURE 14 Coloring under type 2 and 3 moves.

be handled just like the three-coloring that we have a coloring must obey the equations A∗ B = C, C ∗ A = B,
already studied. In particular, a given labeling of a B ∗ C = A. Hence 2B − A = C, 2A − C = B, 2C − B = A.
knot diagram means that it is possible to label (sat- For example, if A = 0 and B = 1, then C = 2B − A = 2
isfying the rules given above for the labels) any dia- and A = 2C − B = 4 − 1 = 3. We need 3 = 0. Hence this
gram that is related to it by a sequence of Reidemeister system of equations will be satisfied for appropriate label-
moves. However, not all the labels will necessarily ap- ings in Z/3Z, the integers modulo three, a modular number
pear on every related diagram, and for a given coloring system.
scheme and a given knot, certain special restrictions can For the reader unfamiliar with the concept of modular
arise. number system, consider a standard clock whose dial is
To illustrate this, consider the color rule for numbers: labeled with the hours 1, 2, 3, . . . , 11, 12. We ask what
A∗ B = A# B = 2B − A. This satisfies the axioms, as is time is it 4 hr past the hour of 10? The answer is 2, and one
easy to see. Figure 17 shows how, on the trefoil, such can say that in the arithmetic of this clock 10 + 4 = 2. In
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

210 Knots

FIGURE 15 The quandle operation.

FIGURE 16 Quandle identities.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 211

mathematics to think of the elements of Z/12Z as the set

{0, 1, 2, . . . , 11}. Since 0 = 12 this takes care of all the
hours.
In general we can consider Z/nZ, where n is any posi-
tive integer modulus. The resulting modular number sys-
tem has elements {0, 1, 2, . . . , n − 1} and is handled just
as though there were a clock with n hours rather than 12.
In such a system one says that x = y (mod n) if the
difference between x and y is divisible by n. For example,
49 = 1 (mod 12) since 49 − 1 = 48 is divisible by 12.
The modular number system Z/3Z reproduces exactly
the three coloring of the trefoil, and we see that the number
3 emerges as a characteristic of the equations associated
with the knot. In fact, 3 is the value of a determinant that is
associated with these equations, and its absolute value is
FIGURE 17 Equations for the trefoil knot. an invariant of the knot. For more about this construction,
see Kauffman (1993, Part 1, Chapter 13).
fact 12 = 0 in this arithmetic because adding 12 hr to the Here is another example: For the figure eight knot E,
time does not change the time indicated on the clock. We we have that the modulus is 5. This shows that E is indeed
work in clock arithmetic by remembering to set blocks of knotted and that it is distinct from the trefoil. We can
12 hr to zero. One can multiply in this arithmetic as well. color (label) the figure eight knot with five “colors” 0,
The square of the present time is 1 o’clock; what time is 1, 2, 3, and 4 with the rule A∗ B = 2B − A (mod 5). See
it? The answer is 7 since 7 squared is 49 and 49 is equal Fig. 18.
to 1 on the clock. Note that in coloring the figure eight knot we have
We say that the clock represents a modular number only used four of the five available “colors” from the set
system Z/12Z with modulus 12. It is convenient in {0, 1, 2, 3, 4}. Figure 18 uses the colors 0, 1, 2, and 4. The

FIGURE 18 Five colors for the figure eight knot.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

212 Knots

coloring number of a knot or link K is the least number t, we obtain a polynomial that generalizes the modulus.
of colors (greater than 1) needed to color it in the 2B − A This polynomial is the Alexander polynomial.
fashion for any diagram of K . It is a nice exercise to verify Alexander (1923) described it differently in his original
that the coloring number of the figure eight knot is indeed paper, and there is a remarkable history to the development
four. In general the coloring number of knot or link is not of this invariant. See, for example Crowell and Fox (1963)
easy to determine. This is an example of a topological and Kauffman (1987b) for more information. The flavor
invariant that has subtle combinatorial properties. of this relationship can be seen by doing a little experi-
Other knots and links mentioned in this section can be ment in labeling the trefoil diagram shown in Fig. 19. The
shown to be knotted and linked by the modular method. circularity inherent in the knot diagram results in relations
The reader should try it for the Borromean rings and the that must be satisfied by the module action. In Fig. 19 we
Whitehead link. see directly by labeling the diagram that if arc 1 is labeled
The coloring (labeling) rules as we have formalized 0 and arc 2 is labeled A, then (t + (1 − t)2 )A = 0. In fact,
them can be described as axioms for an algebra associated t + (1 − t)2 = t 2 − t + 1 is the Alexander polynomial of
with the knot. This is called the quandle. It has been gen- the trefoil knot. The Alexander polynomial is an algebraic
eralized to the crystal, the interlock algebra, and the rack. modulus for the knot.
The quandle is itself a generalization of the fundamental
group of the knot complement.
III. THE JONES POLYNOMIAL
C. The Alexander Polynomial
Our next topic describes an invariant of knots and links of
The modular labeling method has a marvellous genera- quite a different character than the modulus or the Alexan-
lization to the Alexander polynomial of the knot. der polynomial of the knot. It is a “polynomial” invariant of
This comes about through generalized coloring rules knots and links discovered by Jones (1985). Jones’ invari-
A∗ B = t A + (1 − t)B and A# B = t −1 A + (1 − t −1 )B, ant, usually denoted VK (t), is a polynomial in the variable
where t is an indeterminate. It is a nice exercise to verify t 1/2 and its inverse t −1/2 . One says that VK (t) is a Laurent
that these rules satisfy the axioms for the quandle. This polynomial in t 1/2 . Superficially, the Jones polynomial
algebraic structure is called the Alexander module. appears to be just another polynomial invariant of knots
The case t = −1 gives the rule 2B − A that we have and links, somewhat similar to the Alexander polynomial.
already considered. By coloring diagrams with arbitrary When I say that the Jones polynomial is of a different

FIGURE 19 Alexander polynomial of the trefoil knot.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 213

character, I mean something deeper, and it will take a little crossing site for K + and K − (see Fig. 20), then
while to explain this difference. A little history will help. t −1 VK + (t) − t VK − (t) = (t 1/2 − t −1/2 )VK 0 (t).
The Alexander polynomial was discovered in the 1920s
and until 1984 no one had found another polynomial in- The axioms for VK (t) are a consequence of Jones’ orig-
variant of knots and links that was not a simple gener- inal definition of his invariant. He was led to this invariant
alization of the Alexander polynomial. Jones discovered by a trail that began with the study of von Neumann al-
a new polynomial invariant of knots and links that had gebras (a branch of algebra directly related to quantum
some very remarkable properties. The Alexander polyno- theory and to statistical mechanics) and ended in braids,
mial cannot detect the difference between any knot and knots, and links. The Jones polynomial has a distinctly
its mirror image. What made the Jones polynomial such different flavor from the Conway–Alexander polynomial,
an exciting discovery for knot theorists was the fact that it even though it can be axiomatized in a very similar way.
could detect the difference between many knots and their In fact, this similarity of axiomatics points to a common
mirror images. Later, other properties began to emerge. generalization [the Homfly(Pt) polynomial] and to another
It became a key tool in proving properties of alternating generalization (the Kauffman polynomial) and then to fur-
links (and generalizations) that had been conjectured since ther generalizations in the connection with statistical me-
the last century (see, eg., Murasugi, 1987a, b, Kauffman chanics (see, e.g., Kauffman, 1989).
1987a and Thistlewaite 1987). To this date no one has found a knotted loop that the
It turns out the the Jones polynomial is intimately re- Jones polynomial does not declare to be knotted. Thus one
lated to a number of topics in mathematical physics. Curi- can make the following conjecture:
ously, it is actually easier to define and verify the properties
of the Jones polynomial than for any other invariant in the Conjecture. If a single-component loop K is knotted,
theory of knots (except of course the linking number). We then VK (t) is not equal to one.
shall devote this section to the defining properties of the While it is possible that the Jones polynomial is able to
Jones polynomial, and later sections to the relationships detect the property of being knotted, it is not a complete
with physics. classifier for knots. There are inequivalent pairs of knots
Here is a set of axioms for the Jones polynomial. that have the same Jones polynomial. Such a pair is shown
The polynomial was not discovered in the form of these in Fig. 21. These two knots, the Kinoshita–Terasaka knot
axioms. The axioms are in a format analogous to the and the Conway knot, have the same Jones polynomial but
framework that Conway (1970; Kauffman, 1980, 1983) are different topologically. Incidentally these two knots are
discovered for the Alexander polynomial. I am starting examples of knots whose knottedness cannot be detected
with these axioms because they give a quick access to the by the Alexander polynomial.
polynomial and to sample computations. Let us use the axioms to compute the Jones polynomial
for the trefoil knot. To this end, there is a useful device
called the skein tree. A skein tree is obtained from a given
1. Axioms for the Jones Polynomial
knot or link diagram by recording the knots and links
1. If two oriented links K and K are ambient isotopic, obtained from this diagram by smoothing or switching
then VK (t) = VK (t). crossings. Each node of the tree is a knot or link. The
2. If U is an unknotted loop, then VU (t) = 1. nodes farthest from the original knot or link are unknotted
3. If K + , K − , and K 0 are three links with diagrams that or unlinked. Such a tree can be produced from a given knot
differ only as shown in the neighborhood of a single or link by using the fact that any knot or link diagram can

FIGURE 20 Skein Triple.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

214 Knots

FIGURE 21 (Top) Conway Knot. (Bottom) Kinoshita-Terasaka Knot. Two Knots with trivial Alexander Polynomial.

be transformed into an unknotted (unlinked) diagram by This is the easiest possible knot diagram to draw since
a sequence of crossing switches. one never has to make any corrections: one just passes
Figure 22 illustrates a “standard unknot diagram.” This under when one wants to cross an an already created line
diagram is drawn by starting at the arrowhead in the fig- in the diagram. Standard unknot diagrams are always un-
ure and tracing the diagram in such a way that one always knotted. Trying the one in Fig. 22 will show why this
draws an overcrossing before drawing an undercrossing. is so.
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 215

t −1 VT − t VU = (t 1/2 − t −1/2 )VL ,

t −1 VL − td = (t 1/2 − t −1/2 )VU .
Thus, VL = t(td + (t 1/2 − t −1/2 )) = −t 5/2 − t 1/2 .
Hence
VT = t(t + (t 1/2 − t −1/2 )VL )
= t(t + (t 1/2 − t −1/2 )(−t 5/2 − t 1/2 ))
= t(t − t 3 − t + t 2 + 1) = t(−t 3 + t 2 + 1)
= −t 4 + t 3 + t.
The same calculation applied to the mirror image T ∗
(obtained by reversing all the crossings of T ) of the trefoil
yields the invariant VT ∗ = −t −4 + t −3 + t −1 . This shows
how the Jones polynomial discriminates between the tre-
foil and its mirror image, thereby proving that there is no
ambient isotopy from T to T ∗ .
This method of calculating the Jones polynomial from
its axioms does not tell us why the invariant works. It is
possible to analyze this method of calculation and show
that it does not depend upon the choices that one makes
FIGURE 22 A standard unknot. in the process and that it gives topological information
about the knot or link in question. There is a different
way to proceed that leads to a very nice formula for the
Jones polynomial as a sum over “states” of the diagram. In
Using the fact that standard unknot diagrams are avail- this formulation, the polynomial is well defined from the
able, we can use the difference between a given diagram beginning, and we can see the topological invariance arise
K and a standard unknot with the same plane projection in the course of adjusting certain parameters of a well-
to give a procedure for switching crossings to unknot the defined function. Our next topic is this state summation
diagram K . This switching procedure can be used to pro- model for the Jones polynomial.
duce a skein tree for calculating the Jones polynomial
of K .
Figure 23 shows a skein tree for the computation of the IV. THE BRACKET STATE SUM
Jones polynomial of the trefoil knot. The tree reduces the
calculation of the Jones polynomial of the trefoil diagram In the last section we gave axioms for the Jones polynomial
to the calculation of certain unknots and unlinks. In order and showed how to compute it by skein calculations from
to see how to calculate an unlink it is useful to observe the these axioms. In this section we shall show one way to
behavior of the axioms in this case: prove that the Jones polynomial is well defined by these
axioms and that it is an invariant of ambient isotopy of
t −1 VU+ − t VU− = (t 1/2 − t −1/2 )VU0 . links in three-dimensional space. In order to accomplish
this aim, we shall give a different definition of the poly-
Here U + and U − denote unknots with single positive nomial as a certain summation over combinatorial con-
and negative twists in them, respectively. U0 , obtained by figurations associated with the given link diagram. This
smoothing the crossing of U+ or U− , is an unlinked pair summation will be called a state summation model for the
of circles with no twists. See Fig. 24. Jones polynomial.
Therefore (t −1 − t)1 = (t 1/2 − t −1/2 )VU0 . In fact, we shall first construct a state summation called
Hence d = VU0 = (t −1 − t)/(t 1/2 − t −1/2 ) = −(t 1/2 + the bracket polynomial and then explain how to modify the
−1/2
t ). bracket polynomial to obtain the Jones polynomial. The
Thus, we see that an extra unknotted component of the bracket polynomial has a rather natural development, and
link multiplies the invariant by d = −(t 1/2 + t −1/2 ). For is defined for unoriented link diagrams.
T the trefoil knot, U the unknot, and L the link of two We work with diagrams for unoriented knots and links.
unknotted circles as shown in Fig. 23, we find that To each crossing in the diagram assign two local states
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

216 Knots

FIGURE 23 Trefoil skein tree.

with labels A and B as shown in Fig. 25. (In the A state diagram are indicated in Fig. 26. States are evaluated in
the regions swept out by a counterclockwise turn of the two ways. These ways are denoted by K | S and by S.
overcrossing line are joined. In the B state the regions The norm of the state S, S, is defined to be one less than
swept out by a clockwise turn of the overcrossing line are the number of closed curves in the plane described by S.
joined.) In the example in Fig. 26, we have S = 1 and S = 0.
A state S of a diagram K consists in a choice of local The evaluation K | S is defined to be the product of all
state for each crossing of K . Thus a diagram with N cross- the state labels (A and B) in the state. Thus, in Fig. 26, we
ings will have 2U states. Two states S and S of the trefoil have, K | S = A3 and K | S = A2 B.
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 217

FIGURE 24 Skein Triple for Au Unknot.

Taking variables A, B, and d, we define the state sum- Kauffman (1987a) for more information about the bracket
mation associated with a given diagram K by the formula and its relationship with the Jones polynomial.] There is
a great deal of topological information in the calculations
K = K | S d S .
that ensue from the bracket polynomial. In particular, one
In other words, for each state we take the product of the can distinguish many knots from their mirror images, and
labels for that state multiplied by d raised to the number it is possible that the bracket calculation can detect whether
of loops in the state. K is the summation of this state a given diagram is actually knotted.
evaluation over all the states in the diagram for K . (The
notation S means “sum over all instances of S.”) We will
A. First Steps in Bracketology
show that the state summation K is invariant under the
second and third Reidemeister moves if we take B = 1/A The first constructions related to the bracket polynomial
and d = −(A2 + B 2 ). A normalization then enables us to are quite elementary. There are two basic formulas that
obtain invariance under all three Reidemeister moves, and are reminiscent of the exchange relations we have al-
hence topological information about knots and links. [See ready seen for the Jones polynomial. These formulas are

FIGURE 25 States of a Crossing.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

218 Knots

FIGURE 26 Two States of the Trefoil Knot.

as shown in Fig. 27. Here the small diagrams indicate parts specialized the variables A, B, and d. We shall analyze
of larger diagrams that are otherwise identical. Formula just what specialization will produce an invariant of knots
1 just says that the state summation breaks up into two and links. The advantage to having set up the definition
sums with respect to a given crossing in the diagram. In of the bracket polynomial in this way is exactly that we
one sum, we have made a smooothing of type A at the have a method of labeling link diagrams with algebra, and
crossing, while in the other sum we have made a smooth- it is possible to then adjust the evaluation so that it is
ing of type B. The factors of A and B indicated in the for- invariant under Reidemeister moves. To this end, the next
mula are the contributions to the product of vertex weights lemma tells us how the general bracket behaves under a
from this crossing. All the rest of the two partial sums Reidemeister move of type two. Essential diagrams for
can be interpreted as bracket evaluations of the smoothed this lemma are in Fig. 28.
diagrams.
Formula 2 in Fig. 27 just states that an extra, simple Lemma. Let K be a given link diagram, and let
closed curve in a diagram multiplies its bracket evaluation K denote a diagram that is obtained from K by per-
by the loop value d. Note that a single loop receives the forming a type 2 Reidemeister move in the simplify-
value 1. ing direction (eliminating two crossings from K ). Let
With the help of these two formulas, we can compute K be the diagram obtained from K by replacing the
some basic bracket evaluations. Note that we have not yet site of the type 2 move by two arcs in the opposite

FIGURE 27 Bracket equations.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 219

FIGURE 28 Type 2 changes.

pattern to the form of the simplified site in K . (The and respective coefficients ABd (after converting the loop
diagrams in Fig. 28 illustrate this construction.) Then to a value d), A2 , and B 2 . This completes the proof of the
K = AB K + (ABd + A2 + B 2 ) K . Lemma.
With the help of this lemma it is now obvious that if
Proof. Consider the four local state configurations we choose B = 1/A and d + A2 + B 2 = 0, then K is
that are obtained from the diagram K on the left-hand invariant under the second Reidemeister move.
side of the equation, as illustrated in Fig. 28. The formula Once this choice is made, the resulting specialized
follows from the fact that one of these states has coefficient bracket is invariant under the third Reidemeister move,
AB and the other three have the same underlying diagram as illustrated in Fig. 29.
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

220 Knots

FIGURE 29 Type 3 invariance of the bracket.

Finally, we can investigate bracket behavior under the K (+) = (−A3 ) K ,

first Reidemeister move.
K (−) = (−A−3 ) K .
Lemma. Let K denote the bracket state sum with Here K (+) denotes a diagram with a simplifying move of
B = A−1 and d = −A2 − A−2 . Then K is invariant under type 1 available where the crossing that is to be removed
the Reidemeister moves 2 and 3 and on move 1 behaves has type +1. K is the diagram obtained from K (+) by
as shown: doing the type 1 move. Similarly, K (−) denotes a diagram
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 221

FIGURE 30 Bracket under type 1 move.

with a simplifying move of type 1 available where the (Here we use the notation of the previous lemma as shown
crossing that is to be removed has type −1. Figure 30 in Fig. 30.)
illustrates the diagrams for K (+) and K (−). Thus the writhe behaves in a parallel way to the bracket
on the type 1 moves, and we can combine writhe and
Proof. See Fig. 30 for the behavior under type 1 bracket to make a new calculation that is invariant under
moves. We have already verified the other statements in all three Reidemeister moves. We call the fully invariant
this lemma. calculation the “ f -polynomial” and define it by the
equation
B. Framing Philosophy, Twist and Writhe
f K (A) = (−A3 )−w(K ) K (A).
Is it unfortunate that the bracket is not invariant under the
first Reidemeister move? No, it is fortunate! First of all, Up to this normalization, the bracket gives a model for
the matter is easy to fix by a little adjustment: Let K be an the original Jones polynomial. The precise relationship
oriented knot or link, and define the writhe of K , denoted is that VK (t) = f K (t −1/4 ), where w(K ) is the sum of the
w(K ), to be the sum of the signs of all the crossings in K . crossing signs of the oriented link K , and K is the bracket
Thus, the writhe of the right-handed trefoil knot is three. polynomial obtained by ignoring the orientation of K .
The writhe has the following behavior under Reidemeister We shall return to this relationship with the Jones poly-
moves: nomial in a moment, but first a little extra mathematical
philosophy: Another way to view the fact of the bracket’s
(i) w(K ) is invariant under the second and third lack of invariance under the first Reidemeister move is to
Reidemeister moves. see that the bracket is an invariant of knotted and linked
(ii) w(K ) changes by plus or minus one under the first bands embedded in three-dimensional space. Regard a link
Reidemeister move: diagram as shorthand for an embedding of bands as shown
in Fig. 31.
w(K (+)) = w(K ) + 1
Figure 31 illustrates a link diagram for the trefoil knot in
w(K (−)) = w(K ) − 1. a thick, dark mode of drawing. This diagram is juxtaposed
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

222 Knots

FIGURE 31 Bands and twists.

with a drawing of a knotted band that parallels that knot of a ribbbon-like strip of paper attached to itself with an
diagram. The band has two boundary components that even number of half-twists. The first Reidemeister move
proceed (mostly in the plane) parallel to one another. The no longer applies to this shorthand since we can, at best,
curl in the knot diagram becomes a flat curl in the band replace a curl by a twist as shown in Fig. 31.
that is ambient isotopic to a full twist (two half-twists) in In fact, as Fig. 31 shows, there are two distinct curls cor-
the band. This isotopy is indicated in Fig. 31. The top of responding to a single full twist of a band. The bracket (and
Fig. 31 shows a full twist in a band and two flat curls that the writhe) behave the same way on both of these twists.
both give rise to this same full twist by ambient isotopy This means that we can reinterpret the bracket as an in-
that leaves their ends fixed. Each component of a link variant of the topological embeddings of knotted, linked,
diagram is replaced by a paralleled version: the analog and twisted bands in three-dimensional space. This means
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 223

that the bracket has a fully three-dimensional interpreta- Hence,

tion, although its definition depends upon the use of planar
projections. f T = (−A3 )−w(T ) T = A−4 + A−12 − A−16 .
Note that we managed only three branches in the tree for
C. Calculating the Bracket this calculation rather than the full expansion into the eight
Figure 32 shows a tree for calculating the bracket polyno- states. A savings like this is always possible because we
mial of the trefoil knot T . know how the bracket behaves on curls. The resulting
It follows at once from the behavior of the bracket on expansion gives a sum of monomials and is useful for
curls that the contributions of the three (farthest from the thinking about the properties of the invariant.
trefoil itself) branches of this tree add to give the bracket
polynomial of the trefoil:
D. Mirror Mirror
T = A2 (−A3 ) + A A−1 (−A−3 ) + A−1 (−A−3 )2
The knot K ∗ obtained by reversing all the crossings of K
−3 −7
= −A − A
5
+A . is called the mirror image of K . The knot K ∗ is the mirror

FIGURE 32 Tree for bracket of trefoil.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

224 Knots

image of the knot that would ensue if the plane on which contributions of (−A3 ) or (−A3 )−1 , one from each cross-
the knot is drawn were a mirror. It is easy to see that ing and depending upon the sign of the crossing. Thus
K ∗ (A) = K (A−1 ) and that f K ∗ (A) = f K (A−1 ). Thus, we can write an oriented state expansion formula for f K
if K is ambient isotopic to K ∗ (all three Reidemeister as shown below, where K + and K − denote links with
moves allowed), then corresponding sites with oriented crossings, K 0 is the re-
sult of smoothing the crossing in an oriented fashion, and
f K (A) = f K ∗ (A) = f K (A−1 ).
K & is the result of smoothing the crossing against the
Returning to the evaluation of the f -invariant for the tre- orientation:
foil, note that f T (A−1 ) is not equal to f T (A). Therefore,
the trefoil knot T and its mirror image T ∗ are topolog- f K + = (−A3 )−1 A f K 0 + (−A3 )−1 A − 1 f K & .
ically distinct. The proof that we have given for it is
Hence,
the simplest proof known to this author. Note that we
have given a complete proof of this fact, starting with the f K + = −A−2 f K 0 − A−4 f K &
Reidemeister moves, constructing and applying the
bracket invariant. and similarly, for a negative crossing,
A knot is said to be chiral if it is not ambient isotopic
f K − = −A2 f K 0 − A4 f K & .
to its mirror image. The words chiral and chirality come
from physical chemistry and natural science. A knot that Letting VK (t) = f K (t −1/4 ), we have
is equivalent to its mirror image is said to be achiral. (or
VK + = −t 1/2 VK 0 − t VK &
amphicheiral in the speech of knot theorists). Many knots
are achiral. The reader may enjoy verifying that the figure VK − = −t −1/2 VK 0 − t −1 VK & .
eight knot shown in Fig. 33 is ambient isotopic to its mirror
Therefore,
image.
A complete understanding of the problem of determin- t −1 VK + − t VK − = (t 1/2 − t −1/2 )VK 0 .
ing whether a knot is chiral remains in the far distance.
We leave the rest of the verification that VK (t) is the Jones
The new invariants of knots and links have enhanced our
polynomial (see Section 3) to the reader (check that it has
understanding of this difficult question.
the right behavior on unknotted loops).

E. Return to the Jones Polynomial

Now let us verify that the bracket does indeed give a V. VASSILIEV INVARIANTS
model for the Jones polynomial. To see this, consider
f K (A) = (−A3 )−w(K ) K (A). Since the writhe w(K ) is We have seen how it is fundamental to take the difference
obtained by summing signs over all the crossings of K , of an invariant at a positive crossing and at a negative
we can interpret the factor (−A3 ) − w(K ) as the product of crossing, leaving the rest of the diagram alone. The earliest

FIGURE 33 Figure eight knot and its mirror image.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 225

instance of this is the Conway polynomial C K (z) with its ted graphs is a Vassiliev invariant of finite type i if it satis-
exchange identity fies the identity
C K + − C K − = zC K 0 . Z K+ − Z K− = Z K#
Vassiliev (1990) gave new meaning to this sort of iden- and it is of finite type i. In rigid vertex isotopy the cyclic
tity by thinking of the structure of the entire space of all order at the vertex is preserved, so that the vertex behaves
mappings of a circle into three-dimensional space. This like a rigid disk with flexible strings attached to it at spe-
space of mappings includes mappings with singularities cific points.
where two points on a curve touch. He interpreted the Vassiliev invariants form an extraordinary class of knot
equation invariants. It is an open problem whether the Vassiliev
invariants are sufficient to distinguish knots that are topo-
Z K+ − Z K− = Z K#
logically distinct.
as describing the difference of values across a singular Vassiliev began an analysis of the combinatorial condi-
embedding K # , where K # has a transverse singularity in tions on graph evaluations that could support such invari-
the knot space as illustrated in Fig. 34. (In a transverse ants. The key observation is the following result:
singularity the curve touches itself along two different
directions.) Lemma. If Z G is a Vassiliev invariant of finite type
The Vassiliev formula serves to define the value of the i, then Z G is independent of the embedding of the graph
invariant on a singular embedding in terms of the the val- G when G has i vertices.
ues on two knots “on either side” of this embedding. This
Vassiliev formula serves to describe a method of extend- Proof. Suppose that G is an embedded graph G with
ing a given invariant of knots to a corresponding invari- i nodes. If we switch a crossing in G to form G , then
ant of embedded graphs with controlled singularities of the exchange relation for the Vassiliev invariant says that
this transverse type. This idea had been considered be- Z G − Z G = Z G , where G has one more node than G or
fore Vassiliev. Vassiliev carried out his program of analyz- G . But then G has i + 1 nodes and hence Z G = 0. There-
ing the singular knot space using techniques of algebraic fore Z G = Z G . This shows that we can switch crossings
topology, and in the course of this investigation he discov- in any embedding of G without changing the value of Z G .
ered a key concept that had been completely overlooked It follows from this that Z G is independent of the embed-
in the context of graph invariants. That concept is the idea ding and depends only on the graph G. This completes the
of an invariant of finite type. proof of the lemma.
For a Vassiliev invariant of type i, there is important
Definition. We shall say that Z G is an invariant of information in the values it takes on graphs with exactly i
finite type i if Z G vanishes for all graphs with greater than nodes. These evaluations do not depend upon the embed-
i nodes. ding type of the graph. However, not just any such graph-
This concept was extracted from Vassiliev’s work by ical evaluation will extend to give a topological invari-
Birman and Lin (1993). A (rigid vertex) invariant of knot- ant of knots and graphs. There are necessary conditions.

FIGURE 34 Difference equation.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

226 Knots

Vassiliev found a version of these conditions through his its generalizations give rise to Vassiliev invariants. In the
analysis of the knot space and Stanford (1996) discov- case of the Jones polynomial here is an easy proof of their
ered the beautiful topological meaning of these condi- result:
tions in relation to the switching identity. Stanford’s argu-
ment goes as follows: Consider a singular crossing that Theorem. Let VG (t) denote the Jones polynomial
has an arc from the diagram passing underneath it as extended to rigid vertex 4-valent graphs by the formula
shown in Fig. 35. Four crossing switches will take that VK + − VK − = VK # . Let vi (G) denote the coefficient of x i
arc above the singular crossing and return the diagram to in the expansion of VG (exp(x)). Then vi (G) is a Vassiliev
a position that is topologically equivalent to its original invariant of type i.
position.
Each crossing switch gives an equation. There are four Proof. Use the identities from the end of Section 4:
equations. Add them up and one gets an identity among
the values of the invariant on four diagrams. Call this the VK + = −t 1/2 VK 0 − t VK & ,
four-term relation. This identity is illustrated in the second VK − = −t −1/2 VK 0 − t −1 VK & .
box in Fig. 35.
Now recall from the lemma we proved above that for a Substitute t = exp(x). It follows at once that VK # = VK + −
Vassiliev invariant of type i, the graphs with i nodes have VK − is divisible by x. Hence VG is divisible by x i when G
values that are independent of their embeddings in three- has i nodes. This implies that the coefficients vi (G) = 0 if
dimensional space. This means that at the top level (the G has more than i nodes. Hence the coefficients vi (G) are
i-noded graphs for a Vassiliev invariant of type i will be of finite type, proving the theorem.
called the top level) the four-term relations will be rela- With the help of theorems of this type it is possible
tions among the evaluations of abstract graphs. At the top to study Vassiliev invariants by studying the structure of
level the four-term relations will be purely combinatorial known invariants of knots and links. In particular it is
conditions related to the topology. possible to justify the structure of many weight systems
How shall we think of abstract four-valent graphs cor- in terms of known invariants. We shall not go into these
responding to singular embeddings of a knot? An abstract sorts of investigations in this exposition. The next section
knot is just a circle. An abstract singular knot is a cir- shows how the algebraic study of Lie algebras is directly
cle with pairs of points marked that become the singular related to the construction of Vassiliev invariants. This is
points in the embedding. Indicate these paired points by one beginning of a whole world of relationships between
arcs between them. Call the resulting structure a chord knot theory and algebra.
diagram. See the example at the beginning of Fig. 36.
In the language of the chord diagrams the four-term
relation at the top level (see the discussion of the top level VI. VASSILIEV INVARIANTS
in the paragraph above) becomes the equation shown in AND LIE ALGEBRAS
Fig. 36. This can be seen by translating the relation in
Fig. 35 into the language of chord diagrams. In Fig. 36 The subject of Lie algebras is an algebraic study with
we indicated parts of the chord diagram that are neighbors a remarkable connection with the topology of knots and
by showing an outer bracket connecting them. Those sites links. The purpose of this section is to first give a brief
that are neighbors can have no other chords between them. introduction to the concept of a Lie algebra and then to
Otherwise there can be many chords in these diagrams show the deep connection between these algebras and the
that are not indicated, just so long as the diagrams in the structure of Vassiliev invariants for knots and links, as
equation for the four-term relation differ only as shown in described in the previous section.
the figure. In order to understand the idea behind a Lie algebra it
If one can write down a top-level evaluation of chord is helpful to first consider the concept of a group. A set G
diagrams that satisfies the four-term relation, then one has is said to be a group if it has a single binary operation ∗
the raw data for a Vassiliev invariant. Such an evalua- such that:
tion of chord diagrams is called a weight system for a
Vassiliev invariant. By the theorems of Kontsevich (1994) 1. Given a and b in G, then a ∗ b is also in G.
and Bar-Natan (1995), these raw data guarantee the exis- 2. If a, b, c are in G, then (a ∗ b)∗ c = a ∗ (b∗ c).
tence of at least one invariant that satisfies the top-level 3. There is an element E in G such that E ∗ a = a ∗ E = a
evaluation. for all a in G.
The world is rife with Vassiliev invariants. Birman and 4. Given a in G, there exists an element a −1 in G such
Lin (1993) showed directly that the Jones polynomial and that a ∗ a −1 = a −1∗ a = E.
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 227

FIGURE 35 Embedded four-term relation.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

228 Knots

FIGURE 36 Abstract four-term relation via chord diagrams.

One of the most fertile sources of groups is matrix algebra. the formula (AB)i j = k Aik Bk j , where k runs from 1 to
Recall that an n × n matrix A is an array of numbers Ai j n in this summation.
(real or complex) A = (Ai j ), where i and j range in value For our purposes it is essential to have a dia-
from 1 to n. One defines the product of two matrices by grammatic representation for matrix multiplication. This
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 229

FIGURE 37

representation is illustrated in Fig. 37. Each matrix is rep- is the sum of the diagonal entries Aii where i ranges from
resented by a labeled box with one arrow that enters the one to n. The diagrammatic proof of the basic formula
box and one arrow that leaves the box. The entering arrow tr(AB) = tr (B A) is illustrated in Fig. 38.
corresponds to the left index i in Ai j and the right arrow For a given value of n, we let Mn (R) denote the set of
corresponds to the right index j. In multiplying two ma- all n × n matrices with coefficients in the real numbers R.
trices A and B together to form AB we tie the outgoing We let A∗ B = AB denote the product of matrices and we
arrow of A to the ingoing arrow of B. By convention, an let E denote the matrix whose entries are given by E ii = 1
arrow that has no free ends connotes the summation over for all i and E i j = 0 if i is not equal to j. With this choice
all possible choices of index for that arrow. of multiplication and identity element E, Mn (R) satisfies
Many facts about matrices become quite transparent in the first three axioms for a group. However, there are ma-
this notation. For example, the trace of A, denoted tr(A), trices A that have no inverse ( A−1 so that A A−1 = E). For

FIGURE 38 Diagrams for Matrix Trace.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

230 Knots

example the matrix 0, all of whose entries are zero, is a ma- (since tr(BC) = tr(CB) for any matrices B and C). Thus, if
trix without an inverse. Thus Mn (R) is not itself a group. B and C belong to sl(n) , then [B, C] also belongs to sl(n) .
There is a criterion for a matrix to have an inverse. This This closure under the bracket operation leads directly to
is simply that the determinant Det(A) should be nonzero. the notion of a Lie algebra.
Thus the largest group of matrices of size n × n that we
can devise is the set of all matrices A such that Det(A) is Definition. A Lie algebra is a vector space L over
nonzero. This is called the general linear group and is de- a field F that is closed under a binary operation, called
noted by GLn (R). There are many interesting subgroups the Lie bracket and denoted by [B, C ] for B and C
of this large group of matrices. One example is the group in L. The bracket is assumed to satisfy the following
Sl(n) of all matrices with determinant equal to one. We may axioms.
also restrict to orthogonal matrices A over R. These are
invertible matrices A such that At = A−1 , where At de-
1. [X, Y ] = − [Y, X ] for all X and Y in L.
notes the transpose of the matrix A: Ait j = A ji . The group
2. [a X + bY, Z ] = a[X, Z ] + b[Y, Z ] for all a and b in
of orthogonal matrices is denoted by O(n).
F and X , Y , Z in L.
The intersection of O(n) and Sl(n) is denoted SO(n).
3. [X, [Y, Z ]] + [Z , [X, Y ]] + [Y, [Z , X ]] = 0.
SO(n) consists of the orthogonal matrices of determinant
equal to one. In the case n = 2, SO(2) consists of rota-
tions of the plane that fix the origin, and in the case of This last identity is called the Jacobi identity. It is easy
n = 3, SO(3) consists of rotations of three-dimensional to verify that the bracket operation [B, C] = BC − CB on
space about specified axes. the vector space of all n × n matrices over F (e.g., F = R)
SO(3) has a fascinating collection of finite subgroups satisfies the axioms given above. Thus, we have so far seen
including the symmetries of the classical regular solids: that sl(n) is a Lie algebra that is naturally associated with
the tetrahedron, the cube, the octahedron, the dodecahe- the group of matrices SL(n) . In fact, sl(n) generates SL(n)
dron, and the icosahedron. Ultimately, the matrix groups by exponentiation.
become a language for the precise expression of symmetry. There is a general pattern. Each matrix group has
We now ask when a matrix A can be written in the form its corresponding Lie algebra. The classification of ma-
trix groups is simplified by a corresponding classifi-
A = e B = E + (1/1!)B + (1/2!)B 2 + (1/3!)B 3 + · · · cation of Lie algebras. As a result, the Lie algebras
for some other matrix B. Since e B = limit(E + B/m)m , are a subject in their own right. It has often hap-
where the limit is taken as m approaches infinity, we can pened that Lie algebras are connected mathematically
regard (E + B/m), for m large, as an “infinitesimal” ver- with subjects different from their original roots in group
sion of the matrix A, and one refers to B as an “infinitesi- theory.
mal generator” for A. It is interesting and mathematically In our context the Lie algebras turn out to be related
significant to compare the algebraic properties of A and B. to the formation of weight systems for Vassiliev invari-
The key property for this comparison is the determinant ants. One way to see this is to just take the case of matrix
equation Lie algebras with commutator brackets and interpret dia-
grammatically the formula that states that the Lie algebra
Det(e B ) = e tr(B), is closed under the bracket operation. This formula states
where tr(B) denotes the trace of B. (One way to prove this that there is a basis {T 1 , T 2 , . . . , T m } for the Lie algebra
identity is to use the Jordan canonical form for the matrix as a vector space over F such that each T a is an n × n
and the fact that similar matrices have the same trace and matrix and such that
determinant.)
For example, if Det(e B ) = 1, then we need that T a T b − T b T a = f abc T c ,
tr(B) = 0. This means that elements of Sl(n) are the ex-
where f abc is a set of constants in F depending on the three
ponentials of matrices with trace equal to zero.
indices a, b, c (running from 1 to n). The right-hand side of
Let sl(n) denote the set of n × n matrices with trace equal
this equation connotes a summation over all values of the
to zero. The set sl(n) is not closed under matrix multiplica-
index c = 1, . . . , n. The left-hand side is the commutator
tion, but it is closed under the Lie bracket (or commutator)
of T a and T b for any given choice of a and b.
operation [B, C] = BC − C B.
In Fig. 39 we diagram this equation using the conven-
If tr(B) = tr(C) = 0, then tions for diagrammatic matrix multiplication explained in
this section. The structure constants f abc are represented
tr[B, C] = tr(BC − C B) = tr(BC) − tr(C B)
by a graphical vertex with three lines attached to it, one for
= tr(BC) − tr(BC) = 0 a, one for b, and one for c. For the purpose of discussion,
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 231

FIGURE 39 Diagrammatic Lic Algebra.

we shall assume that fcab is dependent only on the cyclic Concretely, the relationship we have just described
order of abc. It is convenient to regard the graphical ver- means that it is possible to construct weight systems for
tex as representing a “tensor” that has this cyclic invari- Vassiliev invariants by using matrix Lie algebras. To see
ance since this means that we can slide the diagram for how this works see Fig. 42. Here we indicated a chord di-
the structure constant tensor around in the plane so long agram D and a corresponding diagram involving matrices
as we keep the cyclic order of its legs unchanged. Such T a from a Lie algebra basis. Thesecond diagram repre-
bases can be obtained in many cases of matrix Lie algebras, sents the sum of traces wt(D) = tr(T a T b T c T a T b T c ),
and the results that we outline can be generalized in any where we are summing over all values for the indices
case. a, b, and c. This second diagram represents the weight
Figure 40 shows a formal version of the commutator wt(D) that is assigned to the first diagram. It follows
relation of Fig. 39, except that the labels and indices have from our considerations that this weight system satis-
been removed and the boxes for matrix elements have been fies the four-term relation and hence, by the theorem
replaced by graphical vertices. Imagine that the terms in of Kontsevich, is the top-row evaluation for a Vassiliev
this formal version of the commutator relation are parts invariant.
of chord diagrams as illustrated with examples in this This section has sketched the amazing and deep con-
figure. In other words, recall the method of chord dia- nection between Lie algebras and invariants of knots and
grams from the last section and imagine that along with links. The territory is even more surprising as one explores
the chords there are also trivalent graphical vertices among it further. First of all, it should be clear from what we have
the chords, and that these vertices are related to commu- said that what is really needed here is an appropriate gen-
tators as shown in the figure. Finally, Fig. 41 shows a eralization of Lie algebras. In fact, prior to the discovery of
formal derivation of the four-term relation for chord dia- the Vassiliev invariants, a very remarkable such general-
grams from the diagrammatic commutator identity. This ization called “quantum groups” was discovered through
means that the four-term relation that we derived from work in statistical mechanics and was applied to knot the-
topological considerations in the last section is intimately ory. It was already known that quantum groups provided a
related to the basic structure of a Lie algebra. This is the strong connection between Lie algebras and their general-
essence of the relationship of Vassiliev invariants with Lie izations and invariants of knots and links. Now the matter
algebras and their generalizations. of finding all weight systems challenges the resources of
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

232 Knots

FIGURE 40 Lic Algebra and Chord Diagrams.

quantum groups and it is not known if all Vassiliev invari- the fantastic notion that matter (such as an electron) is
ants can be built through the quantum groups. accompanied by a wave that guides its motion and pro-
In the next few sections we shall discuss the physical duces interference phenomena just like the waves on the
background behind many of the mathematical ideas dis- surface of the ocean or the diffraction effects of light going
cussed so far in this introduction to knot invariants. through a small aperture.
de Broglie’s idea was successful in explaining the prop-
erties of atomic spectra. In this domain, his wave hypothe-
VII. A QUICK REVIEW OF sis led to the correct orbits and spectra of atoms, formally
QUANTUM MECHANICS solving a puzzle that had been only described in ad hoc
terms by the preceding theory of Niels Bohr. In Bohr’s the-
To recall principles of quantum mechanics it is useful to ory of the atom, the electrons are restricted to move only in
have a quick historical recapitulation. Quantum mechan- certain elliptical orbits. These restrictions are placed in the
ics really got started when Louis de Broglie introduced theory to get agreement with the known atomic spectra,
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 233

FIGURE 41 Proof of Four Term—Relation.

FIGURE 42 Lic Algebra Insertion in a Chord Diagram.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

234 Knots

and to avoid a paradox! The paradox arises if one thinks of As we shall see, the statistical nature of quantum theory
the electron as a classical particle orbiting the nucleus of has a formal side that can be exploited to understand the
the atom. Such a particle is undergoing acceleration in or- topological properties of such mundane objects as knot-
der to move in its orbit. Accelerated charged particles emit ted ropes in space and spaces constructed by identifying
radiation. Therefore the electron should radiate away its the sides of polyhedra. These topological applications of
energy and spiral into the nucleus! Bohr commanded the quantum mechanical ideas are exciting in their own right.
electron to only occupy certain orbits and thereby avoided They may shed light on the nature of quantum theory
the spiral death of the atom—at the expense of logical itself.
consistency. In this section we review a bit of the mathematics of
de Broglie hypothesized a wave associated with the quantum theory. Recall the equation for a wave
electron and he said that an integral multiple of the length
f (x, t) = sin[(2π/l)(x − ct)].
of this wave must match the circumference of the electron
orbit. Thus, not all orbits are possible, only those where With x interpreted as the position and t and as the time,
the wave pattern can “bite its own tail.” The mathematics this function describes a sinusoidal wave traveling with
works out, providing an alternative to Bohr’s picture. velocity c. We define the wave number k = 2π/l and the
de Broglie had waves, but he did not have an equa- frequency w = (2π c/l), where l is the wavelength. Thus
tion describing the spatial distribution and temporal evo- we can write f (x, t) = sin(kx − wt). Note that the veloc-
lution of these waves. Such an equation was discovered by ity c of the wave is given by the ratio of frequency to wave
Erwin Schrödinger. Schrödinger relied on inspired guess- number, c = w/k.
work, based on de Broglie’s hypothesis, and produced a de Broglie hypothesized two fundamental relationships:
wave equation, known ever since as the Schrödinger equa- between energy and frequency, and between momentum
tion. Schrödinger’s equation was enormously successful, and wave number. These relationships are summarized in
predicting fine structure of the spectrum of hydrogen and the equations
many other aspects of physics. Suddenly a new physics,
quantum mechanics, was born from this musical hypoth- E = hw, p = hk,
esis of de Broglie. where E denotes the energy associated with a wave and
Along with the successes of quantum mechanics came p denotes the momentum associated with the wave. Here
a host of extraordinary problems of interpretation. What h = h/2π , where h is Planck’s constant.
is the status of this wavefunction of Schrödinger and de For de Broglie the discrete energy levels of the orbits of
Broglie. Does it connote a new element of physical re- electrons in an atom of hydrogen could be explained by
ality? Is matter “nothing but” the patterning of waves in restrictions on the vibrational modes of waves associated
a continuum? How can the electron be a wave and still with the motion of the electron. His choices for the energy
have the capacity to instantiate a very specific event at one and the momentum in relation to a wave are not arbitrary.
place and one time (such as causing a bit of phosphor to They are designed to be consistent with the notion that the
glow there on a television screen)? Max Born developed a wave or wave packet moves along with the electron. That
statistical interpretation of the wavefunction wherein the is, the velocity of the wave packet is designed to be the
wave determines a probability for the appearance of the velocity of the “corresponding” material particle.
localized particulate phenomenon that one wanted to call It is worth illustrating how de Broglie’s idea works.
an “electron.” In this story the wavefunction ψ takes val- Consider two waves whose frequencies are very nearly the
ues in the complex numbers and the associated probability same. If we superimpose them (as a piano tuner superim-
is ψ ∗ ψ, where ψ ∗ denotes the complex conjugate of ψ. poses a tuning fork with the vibration of the piano string),
Mathematically, this is a satisfactory recipe for dealing then there will be a new wave produced by the interference
with the theory, but it leads to further questions about the of the original waves. This new wave pattern will move at
exact character of the statistics. If quantum theory is inher- its own velocity, different (and generally smaller) than the
ently statistical, then it can give no complete information velocity of the original waves. To be specific, let
about the motion of the electron. In fact, there may be
no such complete information available even in principle. f (x, t) = sin(kx − wt), g(x, t) = sin(k x − w t).
Electrons manifest as particles when they are observed in Let
a certain manner and as waves when they are observed in
another, complementary manner. This is a capsule sum- h(x, t) = sin(kx − wt) + sin(k x − w t)
mary of the view taken by Bohr, Heisenberg, and Born.
= f (x, t) + g(x, t).
Others, including de Broglie, Einstein, and Schrödinger,
hoped for a more direct and deterministic theory of nature. A little trigonometry shows that
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 235

h(x, t) = cos{[(k − k )/2]x − [(w − w )/2]t} and momentum that initiated the beginnings of quantum
theory.
× sin{[(k + k )/2]x − [(w + w )/2]t}.

If we assume that k and k are very close and that w and A. Schrödinger’s Equation
w are very close, then (k + k )/2 is approximately k, and Schrödinger answered the question, Where is the wave
(w + w )/2 is approximately w. Thus h(x, t) can be rep- equation for de Broglie’s waves? Writing an elementary
resented by wave in complex form
H (x, t) = cos[(δk/2)x − (δw/2)t] f (x, t), ψ = ψ(x, t) = exp[i(kx − wt)],

where δk = (k − k )/2 and dw = (w − w )/2. This means we see that we can extract de Broglie’s energy and mo-
that the superposition H (x, t) behaves as the wave- mentum by differentiating:
form f (x, t) carrying a slower moving “wave packet” i h ∂ψ/∂t = E ψ and − i h ∂ψ/∂ x = pψ .
G(x , t) = cos[(δk /2)x − (δw/2)t]. See Fig. 43.
This led Schrödinger to postulate the identification of dy-
Since the wave packet (seen as the clumped oscil-
namical variables with operators so that the first equation,
lations in Fig. 43) has the equation G(x , t) =
cos[(δk/2)x − (δw/2)t], we see that that the velocity of i h ∂ψ/∂t = E ψ ,
this wave packet is vg = δw/δk. Recall that wave velocity is promoted to the status of an equation of motion, while
is the ratio of frequency to wave number. Now according the second equation becomes the definition of momentum
to de Broglie, E = hw and p = hk, where E and p are as an operator:
the energy and momentum, respectively, associated with
p = −i h ∂/∂ x.
this wave packet. Thus we get the formula vg = dE/d p. In
other words, the velocity of the wave packet is the rate of Once p is identified as an operator, the numerical value of
change of its energy with respect to its momentum. Now momentum is associated with an eigenvalue of this opera-
this is exactly in accord with the well-known classical laws tor, just as in the example above. In our example
for a material particle! For such a particle, E = mv 2 /2 and pψ = hkψ .
p = mv. Thus E = p 2 /2m and dE/d p = p/m = v. In this formulation, the position operator is just multi-
It is this astonishing concordance between the sim- plication by x itself. Once we have fixed specific opera-
ple wave model and the classical notions of energy tors for position and momentum, the operators for other

FIGURE 43 Waves and wave packets.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

236 Knots

physical quantities can be expressed in terms of them. We (ψ ∗ denotes the complex conjugate of y) represents the
obtain the energy operator by substitution of the momen- probability of finding the “particle” (a particle is an ob-
tum operator in the classical formula for the energy: servable with local spatial characteristics) at a given point
in spacetime.
E = (1/2)mv 2 + V
E = p 2 /2m + V
B. Dirac Brackets
E = −(h 2 /2m) ∂ 2 /∂ x 2 + V.
We now discuss Dirac’s notation a | b (Dirac, 1958). In
Here V is the potential energy, and its corresponding op- this notation a| and |b are vectors and covectors, respec-
erator depends upon the details of the application. tively. a | b is the evaluation of a| by |b , hence it is a
With this operator identification for E, Schrodinger’s scalar, and in ordinary quantum mechanics it is a complex
equation number. One can think of this as the amplitude for the state
i h ∂ψ/∂t = −(h 2 /2m) ∂ 2 ψ/∂ x 2 + Vψ to begin in “a” and end in “b.” That is, there is a process
that can mediate a transition from state a to state b. Except
is an equation in the first derivatives of time and in second for the fact that amplitudes are complex valued, they obey
derivatives of space. In this form of the theory one consid- the usual laws of probability. This means that if the pro-
ers general solutions to the differential equation and this in cess can be factored into a set of all possible intermediate
turn leads to excellent results in a myriad of applications. states c1 , c2 , . . . , cn , then the amplitude for a → b is the
In quantum theory, observation is modeled by the sum of the amplitudes for a → ci → b. Meanwhile, the
concept of eigenvalues for corresponding operators. The amplitude for a → ci → b is the product of the amplitudes
quantum model of an observation is a projection of the of the two subconfigurations a → ci and ci → b. Formally
wavefunction into an eigenstate. we have
An energy spectrum {E k } corresponds to wavefunctions
ψ satisfying the Schrödinger equation such that there are a|b = a | ci ci | b ,
constants E k with E ψ = E k ψ. An observable (such as en- where the summation is over all the intermediate states
ergy) E is a Hermitian operator on a Hilbert space of i = 1, . . . , n.
wavefunctions. Since Hermitian operators have real eigen- In general, the amplitude for mutually disjoint processes
values, this provides the link with measurement for the is the sum of the amplitudes of the individual processes.
quantum theory. The amplitude for a configuration of disjoint processes is
It is important to notice that there is no mechanism pos- the product of their individual amplitudes.
tulated in this theory for how a wavefunction is “sent” Dirac’s division of the amplitudes into bras a| and kets
into an eigenstate by an observable. Just as mathematical |b is done mathematically by taking a vector space V (a
logic need not demand causality behind an implication be- Hilbert space, but it can be finite dimensional) for the bras:
tween propositions, the logic of quantum mechanics does a| belongs to V . The dual space V ∗ is the home of the
not demand a specified cause behind an observation. This kets. Thus |b belongs to V ∗ so that |b is a linear map-
absence of an assumption of causality in logic does not ping |b : V → C, where C denotes the complex numbers.
obviate the possibility of causality in the world. Similarly, We restore symmetry to the definition by realizing that an
the absence of causality in quantum observation does not element of a vector space V can be regarded as a map-
obviate causality in the physical world. Nevertheless, the ping from the complex numbers to V . Given a|: C → V ,
debate over the interpretation of quantum theory has often the corresponding element of V is the image of 1 (in C)
led its participants into asserting that causality has been under this mapping. In other words, a|(1) is a member
demolished in physics. of V . Now we have a|: C → V and |b : V → C. The
Note that the operators for position and momentum sat- composition ab = a | b : C → C is regarded as an el-
isfy the equation x p − px = hi. This corresponds directly ement of C by taking the specific value a | b (1). The
to the equation obtained by Heisenberg on other grounds, complex numbers are regarded as the “vacuum,” and the
stating that dynamical variables can no longer necessar- entire amplitude a | b is a “vacuum-to-vacuum” ampli-
ily commute with one another. In this way, the points of tude for a process that includes the creation of the state a,
view of de Broglie, Schrödinger, and Heisenberg came to- its transition to b, and the annihilation of b to the vacuum
gether, and quantum mechanics was born. In the course once more.
of this development, interpretations varied widely. Even- Dirac notation has a life of its own. Let P = |y x| and
tually, physicists came to regard the wavefunction not as xy = x|y . Then
a generalized wave packet, but as a carrier of information
about possible observations. In this way of thinking ψ ∗ ψ PP = |y xy x| = |y x | y x| = x | y P.
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 237

Up to a scalar multiple, P is a projection operator. That T − V , where T is the classical kinetic energy and V is
is, if we let Q = P/ x | y , then the classical potential energy of the particle.
The beauty of Feynman’s approach to quantum me-
QQ = PP/ x | y x | y
chanics is that it shows the relationship between the clas-
= x | y P/ x | y x | y = P/ x | y = Q. sical and the quantum in a particularly transparent manner.
Classical motion corresponds to those regions where all
Thus, Q Q = Q. In this language, the completeness of
nearby paths contribute constructively to the summation.
intermediate states becomes the statement that a certain
This classical path occurs when the variation of the action
sum
of projections is equal to the identity: Suppose that is null. To ask for those paths where the variation of the
i |ci ci | = 1 (summing over i) with ci | ci = 1 for each action is zero is a problem in the calculus of variations, and
i. Then
it leads directly to Newton’s equations of motion. Thus,
a | b = ab with the appropriate choice of action, classical and quan-
tum points of view are unified.
= a| |ci ci b
The drawback of this approach lies in the unavailability

= aci ci b at the present time of an appropriate measure theory to
support all cases of the Feynman integral.
= a | ci ci | b . To summarize, Dirac’s notation shows at once how the
probabilistic interpretation for amplitudes is tied to the
Iterating this principle of expansion over a complete set vector space structure of the space of states of the quantum
of states leads to the most primitive form of the Feyn- mechanical system. Our strategy for bringing forth rela-
man integral (Feynman and Hibbs, 1965). Imagine that tions between quantum theory and topology is to pivot on
the initial and final states a and b are points on the vertical the Dirac bracket. The Dirac bracket acts as intermediate
lines x = 0 and x = n + 1, respectively, in the x–y plane, between notation and linear algebra. In a very real sense,
and that (c(k)i(k), k) is a given point on the line x = k the connection of quantum mechanics with topology is an
for 0 < i(k) < m. Suppose that the sum of projectors for amplification of Dirac notation.
each intermediate state is complete. That is, we assume The next two sections discuss how topological invari-
that following sum is equal to one, for each k from 1 to ants in low-dimensional topology are related to amplitudes
n − 1: in quantum mechanics. In these cases the relationship with
|c(k)1 c(k)1| + · · · + |c(k)m c(k)m| = 1. quantum mechanics is primarily mathematical. Ideas and
techniques are borrowed. It is not yet clear what the effect
Applying the completeness iteratively, we obtain the fol- of this interaction will be on the physics itself.
lowing expression for the amplitude a | b :

a|b = a | c(1)i(1) c(1)i(1) | c(2)i(2) · · ·
VIII. KNOT AMPLITUDES
× c(n)i(n) | b ,
where the sum is taken over all i(k) ranging between 1 At the end of the last section we said that the connection of
and m, and k ranging between 1 and n. Each term in this quantum mechanics with topology is an amplification of
sum can be construed as a combinatorial path from a to b Dirac notation. Consider first a circle in a spacetime plane
in the two-dimensional space of the x–y plane. Thus the with time represented vertically and space horizontally
amplitude for going from a to b is seen as a summation (Fig. 44). The circle represents a vacuum-to-vacuum pro-
of contributions from all the “paths” connecting a to b. cess that includes the creation of two “particles” (Fig. 45)
Feynman used this description to produce his famous path and their subsequent annihilation (Fig. 46). In accord with
integral expression for amplitudes in quantum mechanics. our previous description, we could divide the circle into
His path integral takes the form two parts, creation (a) and annihilation (b), and consider
the amplitude a | b . Since the diagram for the creation
dP exp(i S), of the two particles ends in two separate points, it is nat-
ural to take a vector space of the form V ⊗ V (the tensor
where i is the square root of minus one, the integral is product of V with V ) as the target for the bra and as the
taken over all paths from point a to point b, and S is domain of the ket.
the action for a particle to travel from a to b along a We imagine at least one particle property being cata-
given path. For the quantum mechanics associated with a logued by each dimension of V . For example, a basis of
classical (Newtonian) particle the action S is given by the V could enumerate the spins of the created particles. If
integral along the given path from a to b of the difference {ea } is a basis for V , then {ea ⊗ eb } forms a basis for
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

238 Knots

FIGURE 46 The Cap.

Each simple closed curve gives rise to an amplitude,

but any simple closed curve in the plane is isotopic to
a circle, by the Jordan curve theorem. If these are topo-
FIGURE 44 A Circle in Space Time. logical amplitudes, then they should all be equal to the
original amplitude for the circle. Thus the question: What
condition on creation and annihilation will insure topolog-
V ⊗ V . The elements of this new basis constitute all pos- ical amplitudes? The answer derives from the fact that all
sible combinations of the particle properties. Since such isotopies of the simple closed curves are generated by the
combinations are multiplicative, the tensor product is the cancellation of adjacent maxima and minima as illustrated
appropriate construction. in Fig. 47.
In this language the creation ket is a map CUP, In composing mappings it is necessary to use the iden-
CUP = a|: C → V ⊗ V tifications

and the annihilation bra is a mapping CAP, (V ⊗ V ) ⊗ V = V ⊗ (V ⊗ V ) and V ⊗ k = k ⊗ V = V.

CAP = |b : V ⊗ V → C. Thus in Fig. 47, the composition on the left is given by
The first hint of topology comes when we realize that it is V = V ⊗k−−−→
1⊗CUP V ⊗ (V ⊗ V )
possible to draw a much more complicated simple closed
curve in the plane that is nevertheless decomposed with re- = (V ⊗ V ) ⊗ V −−−→
CAP⊗1 k ⊗ V = V.
spect to the vertical direction into many CUPs and CAPs.
In fact, any non-self-intersecting differentiable curve can This composition must equal the identity map on V (de-
be rigidly rotated until it is in general position with re- noted 1 here) for the amplitudes to have a proper image
spect to the vertical. It will then be seen to be decom- of the topological cancellation.
posed into these minima and maxima. Our prescriptions This condition is said very simply by taking a
for amplitudes suggest that we regard any such curve matrix representation for the corresponding operators.
as an amplitude via its description as a mapping from Specifically, let {e1 , e2 ,. . . , en } be a basis for V . Let
C to C. eab = ea ⊗ eb denote the elements of the tensor basis for
V ⊗ V . Then
there are matrices Mab and M ab such that
cup(1) = ab
M eab with the summation taken over all
values of a and b from 1 to n. Similarly, cap is described
by cap(eab ) = Mab.Thus, the amplitude ab for the circle is
cap[cup(1)] = cap M ab eab = M Mab .
In general, the value of the amplitude on a simple closed
curve is obtained by translating it into an “abstract tensor
expression” in the Mab and M ab and then summing over
these products for all cases of repeated indices.
Returning to the topological conditions, we see that they
are just that the matrices (Mab ) and (M ab ) are inverses in
ai a b
the sense that M Mib = b and Mai M ib = a are
identity matrices.
Figure 48 shows the diagrammatic
a representative of the
FIGURE 45 The Cup. equation M ai Mib = b .
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 239

FIGURE 47 Cup–Cap Cancellation.

In the simplest case cup and cap are represented by 2 × 2 crossings we have indicated mappings of V ⊗ V to itself,
matrices. The topological condition implies that these ma- called R and R̄, respectively. These mappings represent
trices are inverses of each other. Thus the problem of the the transitions corresponding to these elementary config-
existence of topological amplitudes is very easily solved urations.
for simple closed curves in the plane. That R and R̄ really must be inverses follows from the
Now we go to knots and links. Any knot or link can be isotopy shown in Fig. 50 (this is the second Reidemeister
represented by a picture that is configured with respect to a move.)
vertical direction in the plane. The picture will decompose We now have the vocabulary of cup, cap, R, and R̄.
into minima (creations) maxima (annihilations) and cross- Any knot or link can be written as a composition of these
ings of the two types shown below. (Here I consider knots fragments, and consequently a choice of such mappings
and links that are unoriented. They do not have an intrinsic determines an amplitude for knots and links. In order for
preferred direction of travel.) In Fig. 49, next to each of the such an amplitude to be topological we want it to be

FIGURE 48 Cancellation.
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

240 Knots

FIGURE 49

FIGURE 50 Knot as Vacuum–Vacuum Expectation.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 241

invariant under the list of local moves on the diagrams In the first Reidemeister move, a curl in the diagram is
shown in Fig. 51. These moves are an augmented list of created or destroyed. Ambient isotopy (generated by all
the Reidemeister moves (see Fig. 4), adjusted to take care the Reidemeister moves) corresponds to the full topology
of the fact that the diagrams are arranged with respect to of knots and links embedded in three-dimensional space.
a given direction in the plane. Two link diagrams are ambient isotopic via the Reide-
The equivalence relation generated by these moves is meister moves if and only if there is a continuous family
called regular isotopy. It is one move short of the relation of embeddings in three dimensions leading from one link
known as ambient isotopy. The missing move is the first to the other. The moves give a combinatorial reformulation
Reidemeister move shown in Fig. 4. of the spatial topology of knots and links.

FIGURE 51 Augmented Reidemeister moves for regular isotopy.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

242 Knots

By ignoring the first Reidemeister move, we allow the Taken together with the loop value of A2 − A−2 ,
possibility that these diagrams can model framed links,
that is, links with a normal vector field or, equivalently,
embeddings of curves that are thickened into bands. It
turns out to be fruitful to study invariants of regular iso- = A2 − A−2
topy. In fact, one can usually normalize an invariant of reg-
ular isotopy to obtain an invariant of ambient isotopy. We
have already discussed this phenomenon with the bracket
polynomial in Section 4. these equations can be regarded as a recursive algorithm
As the reader can see, we have already discussed the al- for computing the amplitude. This algorithm is the bracket
gebraic meaning of moves 0 and 2. The other moves trans- state model for the (unnormalized) Jones polynomial. This
late into very interesting algebra. Move 3, when translated model can be studied on its own grounds as we have al-
into algebra, is the famous Yang–Baxter equation. The ready done in section 4.
Yang–Baxter equation occurred for the first time in prob-
lems related to exactly solved models in statistical me-
chanics. All the moves taken together are directly related IX. TOPOLOGICAL QUANTUM FIELD
to the axioms for a quasi-triangular Hopf algebra (“quan- THEORY, FIRST STEPS
tum group”). We shall not go into this connection here.
There is an intimate connection between knot invariants In order to further justify the idea of topology in rela-
and the structure of generalized amplitudes, as we have de- tion to the amplification of Dirac notation, consider the
scribed them in terms of vector space mappings associated following scenario. Let M be a three-dimensional man-
with link diagrams. This strategy for the construction of ifold; that is, a space that is locally homeomorphic to
invariants is directly motivated by the concept of an ampli- Euclidean three-dimensional space. Suppose that F is a
tude in quantum mechanics. It turns out that the invariants closed orientable surface inside M dividing M into two
that can actually be produced by this means (that is, by pieces M1 and M2 . These pieces are 3-manifolds with
assigning finite dimensional matrices to the caps, cups and boundary. They meet along the surface F. Now consider
crossings) are incredibly rich. They encompass, at present, an amplitude M1 | M2 = Z (M). The form of this am-
all of the known invariants of polynomial type (Alexander plitude generalizes our previous considerations, with the
polynomial, Jones polynomial, and their generalizations). surface F constituting the distinction between the “prepa-
It is now possible to indicate the construction of the ration” M1 and the “detection” M2 . This generalization
Jones polynomial via the bracket polynomial as an ampli- of the Dirac amplitude a | b amplifies the notational dis-
tude, by specifying its matrices. tinction consisting of the vertical line of the bracket to a
The cups and the caps are defined by (Mab ) = topological distinction in a space M. The amplitude Z (M)
(M ab ) = M, where M is the 2 × 2 matrix (with ii = −1). will be said to be a topological amplitude for M if it is a
Note that M M = I , where I is the identity matrix. Note topological invariant of the 3-manifold M. Note that a
also that the amplitude for the circle is topological amplitude does not depend upon the choice of
surface F that divides M.
Mab M ab = Mab Mab = (Mab )2 From a physical point of view the independence of the
= (i A)2 + (−i A−1 )2 = −A2 − A−2 . topological amplitude on the particular surface that divides
the 3-manifold is the most important property. An ampli-
The matrix R is then defined by the equation tude arises in the condition of one part of the distinction
a b carved in the 3-manifold acting as “the observed” and the
−1 other part of the distinction acting as “the observer.” If the
Rcd = AM Mcd + A
ab ab
.
c d amplitude is to reflect physical (read topological) infor-
Since, diagrammatically, we identify R with a (right- mation about the underlying manifold, then it should not
handed) crossing, this equation can be written diagram- depend upon this particular decomposition into observer
matically as and observed. The same remarks apply to 4-manifolds and
interface with ideas in relativity. We mention 3-manifolds
because it is possible to describe many examples of topo-
logical amplitudes in three dimensions. The matter of four-
=A + A−1 dimensional amplitudes is a topic of current research. The
notion that an amplitude be independent of the distinction
producing it is prior to topology. Topological invariance
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 243

of the amplitude is a convenient and fundamental way to agree (see, e.g., Freed and Gompf, 1991; Lawrence and
produce such independence. Rozansky, 1997). This is one of the pieces of evidence in
This sudden jump to topological amplitudes has its a puzzle that everyone expects will eventually justify the
counterpart in mathematical physics. Witten (1989) pro- formalism of the functional integral.
posed a formulation of a class of 3-manifold invariants In order to obtain invariants of knots and links from
as generalized Feynman integrals taking the form Z (M), Witten’s integral, one adds an extra bit of machinery to the
where brew. The new machinery is the Wilson loop. The Wilson
loop is an exponentiated version of integrating the gauge
ik
Z (M) = dAe 4π S(M,A) . field along a loop K . We take this-loop K in three-space
to be an embedding (a knot) or a curve with transversal
Here M denotes a 3-manifold without boundary and A is a self-intersections.
It is usually indicated by the symbolism
gauge field (also called a gauge potential or gauge connec- tr(e K A ), where P denotes path-ordered integration, that
tion) defined on M. The gauge field is a one-form on M is, we are integrating and exponentiating matrix-valued
with values in a representation of a Lie algebra. The group functions, and one must keep track of the order of the
corresponding to this Lie algebra is said to be the gauge operations. The symbol tr denotes the trace of the resulting
group for this particular field. In this integral the “action” matrix.
S(M, A) is taken to be the integral over M of the trace of With the help of the Wilson loop function on knots and
the Chern–Simons three-form C S = A dA+( 23 )A A A. links, Witten writes down a functional integral for link
(The product is the wedge product ˆ of differential
ˆ ˆ invariants in a 3-manifold M:
forms.)
Z (M, K )
Instead of integrating over paths, the integral Z(M) in-
tegrates over all gauge fields modulo gauge equivalence. ik
= dAe 4π (M,A) tr(e K A ).
This generalization from paths to fields is characteristic of
quantum field theory. Quantum field theory was designed Here S(M, A) is the Chern–Simons Lagrangian, as in the
in order to accomplish the quantization of electromag- previous discussion.
netism. In quantum electrodynamics the classical entity If one takes the standard representation of the Lie alge-
is the electromagnetic field. The question posed in this bra of SU(2) as 2 × 2 complex matrices then it is a fasci-
domain is to find the value of an amplitude for starting nating exercise to see that the formalism of Z (S 3 , K ) (S3
with one field configuration and ending with another. The denotes the three-dimensional sphere) yields the original
analogue of all paths from point a to point b is “all fields Jones polynomial with the basic properties as discussed
from field A to field B.” in Section 1. See Witten (1989) or Kauffman (1995) for
Witten’s integral Z(M) is, in its form, a typical integral discussions of this part of the heuristics.
in quantum field theory. In its content Z(M) is highly un- This approach to link invariants crosses boundaries be-
usual. The formalism of the integral and its internal logic tween different methods. There are close relations be-
supports the existence of a large class of topological in- tween Z (S 3 , K ) and the invariants defined by Vassiliev
variants of 3-manifolds and associated invariants of knots (Kauffman, 1995), to name one facet of this complex
and links in these manifolds. crystal.
Invariants of 3-manifolds were initiated by Witten as
functional integrals and at the same time defined in a com-
A. Links and the Wilson Loop
binatorial way by Reshetikhin and Turaev (1991). The
Reshetikhin–Turaev definition proceeds in a way that is We shall now indicate an analysis of the formalism of this
quite similar to the definition that we gave for the bracket functional integral that reveals quite a bit about its role in
model for the Jones polynomial in Section 1. It is an amaz- knot theory. This analysis depends upon some key facts
ing fact that Witten’s definition seems to give the very relating the curvature of the gauge field to both the Wil-
same invariants. We are not in a position to go into the son loop and the Chern–Simons Lagrangian. To this end,
details of this correspondence here. However, one theme let us recall the local coordinate structure of the gauge
is worth mentioning: For k large, the Witten integral is field A(x), where x is a point in three-space. We can
approximated by those gauge connections A for which write A(x) = Aah Ta d x h , where the index a ranges from
S(M, A) has zero variation with respect to change in A. 1 to m with the Lie algebra basis {T1 , T2 , T3 ,. . . , Tm }
These are the so-called flat connections. It is possible and the index k goes from 1 to 3. For each choice of
in many examples to calculate this contribution via both a and k, Aak (x) is a smooth function defined on three-
the functional integral and by the combinatorial definition space. In A(x) we sum over the values of repeated in-
of Reshetikhin and Turaev. In all cases, the two methods dices. The Lie algebra generators Ta are actually matrices
P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

244 Knots

corresponding to a given representation of an abstract Lie V (G) is said to be of finite type k if V (G) = 0 when-
algebra. ever #(G) > k, where #(G) denotes the number of 4-valent
nodes in the graph G. See Section 5.
With this definition in hand, let us return to the invariants
B. Difference Formula derived from the functional integral Z (K ). We have that
One can deduce a difference formula for the Witten invari-
ants from the formal properties of the functional integral. 4πi
Z (K +) − Z (K −) = Z K ## T a T a .
Let K + and K − denote knots that differ at a single cross- k
ing with + and − signs, respectively, and K ## the result
of replacing the crossing by a transverse singularity (i.e., This formula tells us that for the Vassiliev invariant asso-
with distinct tangent directions for the two local curve seg- ciated with Z we have
ments). We take K # to denote the insertion of a graphical 4πi
node at the transverse crossing, as we have done in our Z (K #) = Z K ## T a T a .
k
discussion of the Vassiliev invariant. The notation K ## in-
dicates that the curve intersects itself in space at one point. Furthermore, if V j (K ) denotes the coefficient of 4πi in the
k
Let K ## Ta Ta denote the result of placing the matrices of expansion of Z(K ) in powers of (1/k), then the ambient
the Lie algebra basis into the Wilson line at the singular difference formula implies that (1/k) j divides Z(G) when
crossing as shown in Fig. 52. G has j or more nodes. Hence V j (G) = 0 if G has more
These matrices become part of the big matrix product than j nodes. Therefore V j (K ) is a Vassiliev invariant of
that generates the Wilson line. Then, up to order (1/k) one finite type. [This result was proved by Birman and Lin
has the difference relation (1993) by different methods and by Bar-Natan (1995) by

Z (K + ) − Z (K − ) = (4 pi/k)Z K ## Ta Ta . methods equivalent to ours.]
The fascinating thing is that the ambient difference
This formula is the key to unwrapping many properties of formula, appropriately interpreted, actually tells us how
the knot invariants. to compute Vk (G) when G has k nodes. This result is
equivalent to the description of weight systems derived
C. Graph Invariants and Vassiliev Invariants from Lie algebras that we described in Section 6. Thus the
Recall, from Section 5, that V(G) is a Vassiliev invariant if approach to link invariants via the functional integral mo-
tivates and explains the fundamental structure of Vassiliev
VK + − VK − = VK # . invariants.

FIGURE 52 Lic Algebra Insertion.

P1: GRB/GWT P2: GPJ Final Pages
Encyclopedia of Physical Science and Technology EN008O-359 July 14, 2001 18:55

Knots 245

This deep relationship between topological invariants in Freed, D., and Gompf, R. (1991). “Computer calculations of Witten’s
low-dimensional topology and quantum field theory in the 3-manifold invariants,” Comm. Math. Phys. 41, 79–117.
sense of Witten’s functional integral is still in its infancy. Jones, V. F. R. (1985). “A polynomial invariant for links via von Neumann
algebrea,” Bull. Amr. Math. Soc. 129, 103–112.
There will be many surprises in the future as we discover Kauffman, L. H. (1980). “The Conway polynomial,” Topology 20, 101–
that what has so far been uncovered is only the tip of an 108.
iceberg. Kauffman, L. H. (1983). “Formal Knot Theory,” Princeton University
Press, Princeton, NJ.
Kauffman, L. H. (1987a). “State models and the Jones polynomial,”
Topology 26, 395–407.
ACKNOWLEDGMENT Kauffman, L. H. (1987b). “On Knots,” Princeton University Press,
Princeton, NJ.
It gives me great pleasure to thank Vaughan jones, Ed Witten, Nicolai Kauffman, L. H. (1989). “Statistical mechanics and the Jones polyno-
Reshetikhin, Mario Rasetti, Sostenes Lins, Massimo Ferri, Lee Smolin, mial.” In “Proceedings of the 1986 Santa Cruz Conference on Artin’s
Louis Crane, David Yetter, Ray Lickorish, DeWitt Sumners, Hugh Braid Group,” pp. 263–298, AMS, Providence, RI. [Reprinted in
Morton, Joan Birman, John Conway, John Simon and Dennis Roseman M. Rasetti, ed. (1990). “New Problems, Methods and Techniques
for many conversations related to the topics of this paper. This research in Quantum Field Theory and Statistical Mechanics,” pp. 175–222,
was partially supported by the National Science Foundation Grant DMS- World Scientific, Singapore.]
2528707. Kauffman, L. H. (1993). “Knots and Physics,” 2nd ed. World Scientific,
Singapore.
Kauffman, L. H. (1995). “Functional integration and the theory of knots,”
J. Math. Phys. 36, 2402–2429.
SEE ALSO THE FOLLOWING ARTICLES Kontsevich, M. (1994). “Feynman diagrams and low-dimensional topol-
ogy.” First European Congress of Mathematics, Vol. II (Paris, 1992),
QUANTUM MECHANICS • QUANTUM THEORY • STATIS- 97–121, Progro Math., 120, Birkhauser, Basel.
TICAL MECHANICS • TOPOLOGY, GENERAL Lawrence, R., and Rozansky, L. (1997). “Witten–Reshetikhin–Turaev
invariants of Seifert manifolds,” Preprint.
Murasugi, K. (1987a). “The Jones polynomial and classical conjectures
in knot theory,” Topology 26, 187–194.
BIBLIOGRAPHY Murasugi, K. (1987b). “Jones polynomials and classical conjectures in
knot theory II,” Math. Proc. Camb. Phil. Soc. 102, 317–318.
Alexander, J. W. (1923). “Topological invariants of knots and links,” Reidemeister, K. (1948, 1932). “Knotentheorie,” Chelsea, New York and
Trans. Amr. Math. Soc. 20, 275–306. Julius Springer, Berlin.
Atiyah, M. F. (1990). “The Geometry and Physics of Knots,” Cambridge Reshetikhin, N. Y., and Turaev, V. (1990). “Ribbon graphs and their
University Press, Cambridge. invariants derived from quantum groups,” Comm. Math. Phys. 127,
Bar-Natan, D. (1995). “On the Vassiliev knot invariants,” Topology 34, 1–26.
423–472. Reshetikhin, N. Y., and Turaev, V. (1991). “Invariants of three-manifolds
Birman, J., and Lin, X. S. (1993). “Knot polynomials and Vassiliev’s via link polynomials and quantum groups,” Invent. Math. 103, 547–
invariants,” Invent. Math. 111, 225–270. 597.
Conway, J. H. (1970). “An enumeration of knots and links and some of Stanford, T. (1996). “Finite-type invariants of knots, links and graphs,”
their algebraic properties,” In “Computational Problems in Abstract Topology, 35, 1027–1050.
Algebra,” Pergamon Press, New York, pp. 329–358. Tnistlethwaite, M. (1987). “ A spanning tree expansion of the Jones
Crowell, R. H., and Fox, R. H. (1963). “Introduction to Knot Theory,” polynomial,” Topology, 26, pp 297–309.
Ginn, Boston. Vassiliev, V. (1990). “Cohomology of knot spaces.” In “Theory of Sin-
Dirac, P. A. M. (1958). “Principles of Quantum Mechanics,” Oxford gularities and Its Applications” (V. I. Arnold, ed.), pp. 23–69, AMS,
University Press, Oxford. Providence, RI.
Feynman, R., and Hibbs, A. R. (1965). “Quantum Mechanics and Path Witten, E. (1989). “Quantum field theory and the Jones polynomial,”
Integrals,” McGraw-Hill, New York. Commun. Math. Phys. 121, 351–399.
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

Linear Optimization
C. Roos
Delft University of Technology/University of Leiden

I. Historical Background
II. The Simplex Method
III. Interior-Point Methods
IV. Related Topics
V. Further Extensions

GLOSSARY whose running time depends polynomially on the size

of the problem.
Bounded problem An LO-problem whose objective Size of a problem Length of a binary string needed to
function is bounded from above if the problem is a encode an LO-problem.
maximization problem, and bounded from below if it Standard LO-model An LO-problem in which all con-
is a minimization problem. straints are equality constraints and all variables are
Canonical LO-model All constraints are inequality con- nonnegative.
straints and all variables are nonnegative. Unbounded problem An LO-problem that is not
Feasible problem An LO-problem whose feasible region bounded.
is nonempty.
Feasible region The set of points satisfying all the con-
straints. LINEAR OPTIMIZATION (LO) is the branch of mathe-
Infeasible problem An LO-problem whose feasible re- matics that deals with minimizing (or maximizing) a lin-
gion is empty. ear function whose variables are subject to linear equality
Integer LO-problem LO-problem whose variables are or inequality constraints. The standard form of the LO-
required to be integral. problem has the form
Interior-point condition (IPC) There exists a feasible
n

point for which all inequality constraints in the LO- minimize p= cjxj
j=1
problem are satisfied strictly.
LO-relaxation The LO-problem that arises when the in-
n

tegrality condition on the variables in an integer LO- subject to ai j x j = bi , i = 1, . . . , m

j=1
problem is removed.
Polynomial method A solution method for LO-problems x j ≥ 0, j = 1, . . . , n,

597
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

598 Linear Optimization

or, in matrix form, data of the LO-model. The number of operations, how-
ever, is so large that only relatively small models can be
minimize p = cT x
solved manually. Fortunately, the invention of the Simplex
subject to Ax = b (1) method coincided in time more or less with the inven-
tion of the electronic computer. The calculations could
x ≥ 0,
be automatized, thus paving the way for large-scale ap-
where A is an m × n matrix and x, b, and c are vectors plications. Oil companies belonged to the early users of
of dimension n, m, and m respectively. Here, x ≥ 0 means the computer code MPS/360 for LO released by IBM in
that all entries of x are nonnegative and the superscript T 1966 and that could run on the IBM 360 computers that
refers to taking the transpose. were introduced around that time. At the end of the 1960s
The relevance of the subject is due to the fact that real- and in the early 1970s, other software packages for LO
life problems in many branches of applied sciences like became available: MPSX, and later MPSX/370 of IBM,
economy, logistics, and engineering, can be modeled as MPS III of Exxon, the UMPIRE system for the UNIVAX
LO-problems. As a result, LO is probably one of the most 1108, APX III for the CDC computers, LAMPS (of John
applied mathematical tools. The beauty of the subject is Forrest), and LINDO. For the history of the implemen-
due to its rich mathematical theory, part of which has de- tations of the Simplex method the reader is referred to
veloped over the last twenty years. W. Orchard-Hays (1990).
From a theoretical point of view the Simplex method
has some properties that deserve mentioning here, because
I. HISTORICAL BACKGROUND they were important in the history of LO. First, the method
may cycle, i.e., it may happen that after hours of calcula-
The field of LO1 arose in the 1940s, due to important tions a tableau is generated that already occurred before
work of Dantzig, Kantorovich, Koopmans, Von Neumann, during the calculations. This phenomenon is called cy-
and Morgenstern. Dantzig was involved in the research cling. For a mathematical algorithm this is disastrous. It
project Scientific Computation of the Optimum Programs inplies that a computer program may run forever without
(SCOOP) at the U.S. Air Force. He visited Koopmans terminating with a solution. Dantzig already recognized
in June 1947 and told him of his work on linear mod- this danger and found a so-called cycle-breaking rule; as
els for the optimization of military operations. Koopmans a consequence, the Simplex method always solves any
immediately recognized the relevance of this approach LO-problem correctly.
for his economic models and made clear to Dantzig A second property concerns the computational time. It
that the economists did not have an algorithm for solv- may be expected that this time will grow with the size of
ing such models. In the summer of 1947 Dantzig in- the problem, i.e., with the number n of variables as well as
vented the Simplex method. Koopmans was the leader of with the number m of constraints. In practice the behavior
a group of economists who developed the theory of opti- is such that, as a rule of thumb, the computational time
mal assignment of resources by using LO-models. For this depends linearly on n and m. Many researchers tried to
work he received the Nobel prize, in 1975, together with find a theoretical estimate for the computational time in
Kantorovich. terms of n and m. In all cases they ended up with a formula
In the autumn of 1947 another important meeting took that is exponential in either n or m. In 1972 it became
place, when Dantzig visited Von Neumann. This meeting clear that this behavior is inherent to the Simplex method.
clarified the relation between Dantzig’s work and the game In that year Klee and Minty gave an elegant example for
theory as developed by Von Neumann and Morgenstern. which the computational time is exponential in n [Klee
Dantzig also heard for the first time of Farkas’ lemma and and Minty (1972)].
the notion of duality in LO. The result of Klee and Minty initiated a period of search
From a computational point of view the Simplex method for new, more efficient methods in the LO world. In 1979
is very simple. Only the elementary arithmetic operations the front page of the New York Times announced an
addition, subtraction, division, and multiplication are per- important result: the Russian mathematician Khachiyan
formed on a rectangular table that initially contains the had found a new method with the desired property, the
Ellipsoid method [Khachiyan (1979)]. This method is
1 Historically, the field is named Linear Programming, or LP. This
polynomial, i.e., the computational time depends poly-
name was proposed shortly after World War II [Dantzig (1963)]. Since nomially on the size of the problem. Theoretically, this
then the modern computer came to life and the word “programming”
usually refers to the activity of writing computer programs. As a conse-
was an enormous breakthrough, from the practical point
quence, its use instead of the more natural word “optimization,” gives of view the result was disappointing. Computer programs
rise to confusion [Williams (1990)]. based on the Ellipsoid method proved to be much less
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

Linear Optimization 599

efficient and less robust than those using the Simplex was faster, suitable for all current computer platforms, and
method. The result was an unpleasant tension between freely available for research purposes.
theory and practice: the theoretically efficient Ellipsoid The computer code OB1 was based on a method pro-
method could not match the theoretically inefficient Sim- posed already in the 1950s by Frisch (1956), the logarith-
plex method. mic barrier method. By adjusting this method the so-called
A new breakthrough came in 1984. In that year the In- path-following methods or interior-point methods arose.
dian mathematician Karmarkar published a method that The first term refers to the central path of an LO-problem;
reconciled theory and practice [Karmarkar (1984)]. It was this is a curve in the interior of the feasible region that con-
again front page news. Karmarkar worked at the well- verges to an optimal solution. After its rediscovery, inde-
known American telephone company AT&T, and had pendently by several researchers, it became the basis of all
found a completely new approach to the LO-problem. modern interior-point methods. These methods are noth-
His so-called Projective method was polynomial and, as ing else than numerical recipes that generate a sequence
claimed by Karmarkar, was in practice at least 100 times of points, on or close to the central path, that converge
faster than the Simplex method, especially for large-scale to an optimal solution. In this way, several efficient (i.e.,
problems. The last claim gave rise to much commotion. polynomial) methods came to life whose practical perfor-
It is hard to find another mathematical result that caused mance justify the original claim of Karmarkar [Roos et al.
so much excitement and dispute. The disputes were due to (1997)].
the fact that the published version of Karmarkar’s method In the meantime it turned out that the implementations
differed from the version implemented at AT&T. This was of the Simplex method could be accelerated dramatically
not made public. Later on the reason became clear. AT&T by implementing techniques already available in the lit-
had made a big investment to design a computer program erature. Especially Bixby (Rice University, Houston, TX)
for LO that should be about 200 times faster than the com- did important work in this respect. As a result an exciting
mercial codes for solving LO-problems available at that competition arose between him and the makers of OB1. In
time. A project group was formed including many promi- some large-scale applications in the airline industry this
nent researchers, with the task to devise a new LO-program has led to a nice synthesis of both approaches, where a
on the basis of Karmarkar’s method. During its existence close-to-optimal solution is generated by an interior-point
the size of the group grew from 20 to 200 researchers method which is then used as input for the Simplex method
(R. E. Bixby and R. Vanderbei, Private communication). to generate an exact solution.
The system would be a hardware/software “complete” so- A result of the sketched developments is that nowa-
lution. For the hardware a $1 million vector/parallel ma- days LO-problems can be solved about 1,000,000 times
chine of Alliant Computers was chosen. faster than in 1984. A factor of 1000 is due to the bet-
About 5 years later, in 1989, the LO-package was ter algorithmic methods and the other factor of 1000 to
launched under the name KORBX. In 1991, after sell- the improvements in computer technology (Moore’s law).
ing one system to the military airlift command for about The consequences for the applications are far-reaching.
$4 million and one system to Delta airlines for $12 million, An LO-model that required 1 year of computational time
the venture was essentially abandoned. The LO-package 16 years ago can now be solved in about 30 seconds. It is
was not portable, it ran only on the Alliant computer. In clear that problems that require a computational time of 1
retrospect, this may have been the main reason for the year are unsolvable from a practical point of view. There-
failure of the project. fore, the improved performance, although of a quantitative
Outside AT&T much research activity took place. Initial nature, has qualitative consequences: problems that could
implementations based on Karmarkar’s paper proved to be not be solved 15 years ago can now be solved in a few
about 100 times slower than the Simplex-based codes. It seconds. This explains why the use of commercial LO-
looked as if the Simplex method would survive this attack packages has shown an explosive growth during the last
from the Projective method, just as it had survived the years.
Ellipsoid method. Karmarkar, however, persisted with his Modern LO-packages contain both a Simplex-based
claims. This inspired further worldwide research. Over a solver and an interior-point solver. Some of the most well-
period of 10 years more than 2000 scientific publications known packages are OSL of IBM, CPLEX (based on the
appeared on the subject. This led to many new theoretical Simplex code of Bixby and on OB1), XPRESS-MP of
concepts and algorithmic ideas. These insights were im- Dash Associates, and MOSEK of the brothers E. D. and
mediately incorporated into computer programs. The work K. D. Andersen. The environments in which these pack-
of Lustig, Marsten, and Shanno resulted in a computer ages are used nowadays is quite diverse: aviation industry,
code, called OB1, that could compete with the Simplex oil industry, engineering design, finance, water manage-
method. At the time that KORBX entered the market, OB1 ment, etc.
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

600 Linear Optimization

II. THE SIMPLEX METHOD B. Basic Solution

Let x denote any solution of Ax = b and let σ (x) denote
The Simplex method can be used to solve any LO-problem
its support:
in the standard form. The method constructs a sequence
of basic solutions of the problem until it either finds an σ (x) := {i: xi = 0}.
optimal basic solution or detects that such a solution does
not exist. We call x a basic solution if the columns in Aσ (x) are lin-
early independent. If x is not a basic solution we can easily
derive a basic solution from it, as follows. The columns
A. Reduction to Standard Form
in Aσ (x) being linearly dependent, there exists a nonzero
If an LO-problem does not have the standard form it can vector λ such that Aλ = 0, with λi = 0 for i ∈ / σ (x). Now
easily be put into this form as follows. let α ≥ 0 grow until one of coordinates in σ (x) of x ± αλ
If a constraint has the form aiT x ≥ bi we replace it by becomes zero and let x̄ denote x ± αλ for this value of α.
−aiT x ≤ −bi . Thus, we may assume that the inequali- Then σ (x̄) is a proper subset of σ (x). If x̄ is not a basic
ties have the form aiT x ≤ bi . For each such inequality solution the above procedure is repeated. Since σ (x) is
constraint we introduce a slack variable si according to finite this yields a basic solution after a finite number of
aiT x + si = bi . We are then left with a situation where all steps.
constraints are equality constraints. Let us write them as
Ax = b. If b does not belong to the column space of the
matrix A, then this system is inconsistent and the problem C. Basic Feasible Solution
infeasible. Otherwise, we remove redundant constraints
until the matrix A has full row rank. If for each variable Let x denote any nonnegative solution of Ax = b and let
a nonnegativity constraint exists, the problem already has σ (x) be its support. We call x a basic feasible solution
the standard form. Otherwise, let F denote the index set if x is a basic solution. If x is not a basic solution, the
of the “free” variables, i.e., the variables without non- procedure described in Section II.B can be used to produce
negativity constraints. The classical way to “remove” the a nonnegative basic solution x̄ = x + αλ (if necessary, use
free variables is to substitute xi = xi(1) − xi(2) , with xi(1) ≥ 0 −λ instead of λ). Even more, if p = c T x, we can find a
and xi(2) ≥ 0, for each free variable xi . This approach, al- basic feasible solution x such that c T x̄ ≤ p, or establish
though theoretically sound, has some serious drawbacks: that the problem is unbounded. To this end we modify the
it increases the number of variables, and when solving procedure of Section II.B by taking λ such that c T λ ≤ 0. If
the problem with an interior-point method, the new model this works it gives a basic feasible solution with the desired
will not satisfy the interior point condition (IPC). A better property. If it does not work and c T λ < 0, it follows that
approach is as follows. Let r denote the rank of A F (the x̄ = x + αλ remains nonnegative, however large the value
submatrix of A formed by the columns indexed by F). As of α. For large α, c T x̄ goes to minus infinity, proving that
long as free variables occur in the problem formulation we the problem is unbounded. If c T λ = 0, replacing λ by −λ
choose a free variable and a constraint in which it occurs. provides a basic feasible solution with the same objective
Then, using this (equality) constraint, we express the free value.
variable in terms of the other variables and by substitution Let us mention that, in general, interior-point methods
we eliminate it from the other constraints and from the ob- generate a nonbasic optimal solution. Note that the pro-
jective function. Since F has rank r , we can do this r times, cedure described above enables us to construct a basic
and then the remaining constraints no longer contain free optimal solution from any given optimal solution.
variables. We are left with m equality conditions, r of
which express free variables in the remaining variables,
while the remaining m − r equalities contain no free vari- D. A Simplex Iteration
ables. Observe that the first r equalities do not impose a Suppose we are given a basic feasible solution x̄. There
condition on the feasibility of the vector x; they simply tell are two main questions:
us how the values of r of the free variables in x can be cal-
culated from the remaining variables. Hence, these equal- 1. How can we determine whether the given solution is
ities can be neglected from now on, and we are left with optimal?
m − r equality constraints in nonnegative variables. If the 2. If the solution is not optimal, how can we find a
objective function still contains free variables, the problem better solution?
is unbounded from below, and we have solved the problem.
Otherwise, the remaining problem has the standard form. In this section we deal with these two questions.
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

Linear Optimization 601

We assume (without loss of generality) that the problem while leaving the other nonbasic variables zero (their cur-
matrix A has rank m. Since x̄ is a basic solution, its support rent value). We then have
contains at most m indices and the corresponding columns
of A are linearly independent. If necessary, we extend the c T x = c T x̄ + s̄k xk . (9)
support of x̄ to obtain a set B of m indices such that the
columns Ak of A with k ∈ B are linearly independent. Any This makes clear that the objective value will decrease if xk
such set B is called a basic index set (or basis) for x̄. The increases. Of course, the larger xk , the larger the decrease
remaining indices are called the nonbasic indices; their of the objective value. However, when increasing xκ we
set is denoted as N . have to take care of the feasibility. In this respect Eq. (7) is
By construction, the submatrix A B of A, consisting of useful. Taking for x N the vector with xk in the k-position,
the columns indexed by elements of B, is nonsingular. and zeros elsewhere, we obtain x B as a function of xk :
Hence, the current basic feasible solution x̄ is given by
x B = x̄ B − xk A−1
B Ak .
x̄ B = A−1
B b, x̄ N = 0, (2)
It is convenient to introduce the matrix
and, defining ȳ ∈ IRm by
T B := A−1
B A. (10)
A TB ȳ = c B , (3)
Note that this is an m × n matrix whose rows are naturally
the current objective value satisfies
indexed by the elements of B. We can now rewrite the
c T x̄ = c TB x̄ B + c TN x̄ N = c TB A−1
B b = b ȳ.
T
(4) above expression for x B as follows:

Letting x B = x̄ B − xk TkB . (11)

s̄ := c − A ȳ, T
(5) Hence, if TkB ≤ 0 then x B ≥ 0 for all nonnegative values
we claim that x̄ is an optimal solution if s̄ ≥ 0, thus of xk . Letting xk go to infinity, the objective value goes to
answering question (1). minus infinity. We conclude that the problem is unbounded
Here is the proof of this claim. We rewrite the con- if TkB ≤ 0.
straints as In the other case, when TkB has one or more positive en-
tries, there exists a maximal value of xk for which x B ≥ 0.
A B x B + A N x N = b, x B ≥ 0, xN ≥ 0 (6) In fact this value of xk is given by

and multiply from the left by the inverse of A B , leaving x̄i
us with the equivalent system x̃k = min :T > 0 .
B
(12)
TikB ik
x B = x̄ B − A−1
B AN xN , x B ≥ 0, x N ≥ 0. (7) For this value of xk at least one (but possibly more) entries
The objective function can now be written as of x B vanish. Let x be one of these entries. Then the new
solution is a basic solution with index set
c T x = c TB x B + c TN x N
B := (B ∪ {k})\{ }. (13)
= c TB x̄ B + c TN − c TB A−1
B AN xN
T The new objective value follows from Eq. (9). If x̃k > 0
= c TB x̄ B + c N − A TN ȳ x N
the new value is smaller than c TB A−1
B b, the current value.
= c T x̄ + s̄ NT x N . (8) Thus, we have answered question (2). In doing so,
we have just described a typical iteration of the Simplex
Now suppose that s̄ ≥ 0, and let x be any feasible solution. method. Starting with a basic feasible solution, with index
Then, by Eq. (8), we have set B for the basic variables, we moved to another basic
c T x = c T x̄ + s̄ NT x N ≥ c T x̄, feasible solution whose basic index set is given by Eq. (13).
We call k the entering index and the leaving index. The
since s̄ N ≥ 0 and x N ≥ 0. This proves the claim that x̄ is entering index is chosen first: for this one may take any
optimal if s̄ ≥ 0. index k such that s̄k < 0. Given k we choose the leaving
Next, we deal with question (2) and consider the case index according to the above quotient rule (Eq. 12). For
where s̄ is not nonnegative. So s̄ has at least one negative the leaving index we may take any index yielding the
entry. Let s̄k < 0. Then k ∈ N , because s̄ B = 0. Expression minimal quotient in Eq. (12). The pair (k, ) is called the
(8) makes clear how c T x will change if we increase xk pivoting pair of the Simplex iteration.
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

602 Linear Optimization

E. Simplex Tableaus Using this, the Simplex tableau for B becomes

When the given LO-problem is small enough, the Sim- x1 x2 x3 x4
plex iteration can be performed by hand. In such cases it
3 0 4 0 −3
is usual, and useful, to make use of a so-called Simplex
tableau. The Simplex tableau belonging to the basis B x1 1 1 2 0 −1
looks as follows. x3 2 0 1 1 −1

c T x̄ −s̄ T The variable names outside the borders serve to indicate to

which variables the rows and columns belong. The tableau
(14) shows that s̄2 = −4 < 0. In its column we find the pivot (set
x̄ B TB
in boldface) (k, ) = (2, 1) by using the quotient rule. After
the pivot we get the tableau
If the pivoting pair is (k, ), then the Simplex tableau be-
x1 x2 x3 x4
longing to the new basis B simply arises by pivoting on
the element T Bk . 1 −2 0 0 −1
To illustrate the use of the Simplex tableau we consider x2 1 1
1 0 − 12
2 2
the following LO-problem.
x3 3
2
− 12 0 1 − 12
min 3x1 + 2x2
Note that new basic variables are still nonnegative, as it
s.t. x1 + x2 ≥ −1 should. But we now have s̄ ≥ 0, which means that the
x1 + 2x2 ≥ 1 optimal solution has been found. We read it from the
last tableau: the nonbasic variables are equal to zero and
x1 ≥ 0, x2 ≥ 0. the basic variables can be read from its first column:
x̄ = (0, 12 , 32 , 0). Hence, the optimal solution of the original
By adding surplus variables x3 and x4 the problem assumes problem is x1 = 0, x2 = 12 , and the optimal value is 1.
the standard form:

min 3x1 + 2x2 F. Finiteness of the Simplex Method

s.t. x1 + x2 − x3 = −1 Unfortunately, it may happen that the new solution is not
x1 + 2x2 − x4 = 1 really “better” than the old one. The new objective value
may be equal to the old one, and this happens exactly if
x1 ≥ 0, x2 ≥ 0, x3 ≥ 0, x4 ≥ 0. x̃k = 0, which means that x̄ = 0. In that case we call the
iteration degenerate. Then no progress is made in terms of
The vector x̄ = (1, 0, 2, 0) is a basic solution. The set B of the objective value. Even worse, examples exist in which,
basic indices is {1, 3}. We thus have after a sequence of degenerate iterations, the new basic
index set is equal to a basic index set occurring earlier
in the sequence [Avis and Chvátal (1978); Beale (1955);
1 −1 0 1
AB = , A−1
B = . Hoffman (1953); Lee (1997)]. This is the so-called phe-
1 0 −1 1
nomenon of cycling.
Furthermore, It is most important to avoid cycling. Although it is im-
possible to avoid degenerate iterations, it is possible to

0 −1 3 0 avoid the occurrence of cycles. This can be achieved by
ȳ = A−T
B cB = = , using a cycle-breaking rule. Such a rule imposes some
1 1 0 3
conditions on the choice of the entering and the leaving
and, using Eq. (5), variable and its use guarantees that the Simplex method
will never produce the same basic index set twice, see Sec-
     
3 1 1 0 tion II.G. What does this mean? Since the number of basic
2  1 
   2 0

 −4 
index sets is finite it means that after a finite number of
s̄ =   −   = . iterations the Simplex method terminates. There are two
 0  −1 0 3  0
possible reasons for the method to terminate. The first rea-
0 0 −1 3 son is that there is no candidate for k. In that case we have
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

Linear Optimization 603

s̄ ≥ 0, and as we saw before this implies the optimality entry in b is negative, by −1. Then x = 0, t = b is a fea-
of the current solution x̄. The second possible reason for sible solution for Eq. (15), and, obviously, this solution
termination of the method is that after having found the is basic. Hence, we can solve Eq. (15) with the Simplex
entering variable k there is no candidate for ; as we estab- method. Since Eq. (15) is a bounded problem, this yields
lished earlier this implies that the problem is unbounded. a basic feasible solution (x̄, t̄) of Eq. (15). Now two cases
can occur: e T t̄ > 0 or e T t̄ = 0. In the first case the prob-
lem must be infeasible, because any feasible solution of
G. Cycle-Breaking Rules Eq. (1) yields a feasible solution to Eq. (15) with t = 0. In
One elegant way to avoid cycles is to use a so-called least the second case x̄ is a basic feasible solution of Eq. (1).
index rule. Then, whenever there is any ambiguity in the It is now clear that we can solve Eq. (1) with the Simplex
choice of k or we choose the smallest possible index method in two phases. In phase I we solve Eq. (15); if the
among the candidate indices. This rule was first proposed problem is infeasible it is detected in this phase, otherwise
by Bland (1977). we obtain a basic feasible solution of Eq. (1) that can be
A second cycle-breaking rule is the lexico-graphic rule used in phase II to solve the problem. Phase II either yields
[Danzig et al. (1955)]. This rule uses the lexico-graphic an optimal basic solution, or detects that the problem is
ordering “≺” of vectors: we say that v ≺ w (v, w ∈ IR p ) (feasible and) unbounded.
if the vector u = w − v is lexicographically positive, i.e.
u = 0 and the first nonzero coordinate of u is positive. The I. Duality
lexicographic rule gives no condition on the choice of
the entering index k, but it requires the leaving index to Suppose that Eq. (1) has an optimal solution. Then, by
be taken such that the pivot element T Bk is positive and the applying the Simplex method, we can obtain an optimal
vector basic index set B and the corresponding basic feasible
solution x̄. Below, we use again the vectors ȳ ∈ IRm and
x̄ B T B ,: s̄ ∈ IRn as introduced in Eqs. (3) and (5), respectively.
T Bk
is lexicographically minimal. Here, (x̄ B T B ) is the matrix 1. Dual of the Standard Problem
obtained by extending T B to the left with the column x̄ B . Recall from Section II.D that s̄ ≥ 0. Hence, ( ȳ, s̄) is feasi-
So this matrix is nothing else than the part of the Simplex ble for the following maximization problem:
tableau below its first row; (x̄ B T B ) : denotes the row of
the tableau indexed by the basic variable . maximize d = bT y
If initially all rows in the tableau below the first row are subject to A T y + s = c (16)
lexicographically positive, which can be realized easily,
then the lexicographic rule guarantees that in all subse- s ≥ 0.
quent tableaus this property is maintained, and that the
first row is lexicographically decreasing. The last prop- 2. Duality Results
erty prevents the occurrence of cycles.
We claim that ( ȳ, s̄) is an optimal solution of Eq. (16).
This is a consequence of the following result.
H. Initialization: Two-Phase Method
Theorem 1 (Weak duality) Suppose that x and (y, s) are
In Section II.D we assumed that we were given a basic feasible for Eqs. (1) and (16), respectively. Then
feasible solution x̄. We show in this section how to obtain
c T x − b T y ≥ 0.
such a solution, if it exists.
We introduce artificial variables ti (1 ≤ i ≤ n) and con- Proof: We have
sider the problem
x T s = x T (c − A T y)
minimize e T t = c T x − (Ax)T y = c T x − b T y.
subject to Ax + t = b (15) Since x ≥ 0 and s ≥ 0, x T s ≥ 0.

x ≥ 0, t ≥ 0,
Theorem 1 reveals a close relation between problems (1)
where e denotes the all-one vector. Without loss of gener- and (16). From now we call these problems the primal
ality we may assume that b ≥ 0 in Eq. (1), because other- problem and dual problem, respectively. The theorem im-
wise we multiply constraints for which the corresponding plies that if x is feasible for the primal problem (or shortly,
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

604 Linear Optimization

primal feasible) then c T x is an upper bound for the opti- TABLE I Scheme for Dualizing
mal value d ∗ of the dual problem, and, vice versa, if (y , s) min c T x max b T y
is dual feasible then b T y is a lower bound for the opti-
mal value p ∗ of the primal problem. As a consequence, ‘=’ Constraint Free variable
if c T x = b T y, then x is an optimal solution of the primal ‘≥’ Constraint Variable ≥ 0
problem and (y , s) is an optimal solution of the dual prob- ‘≤’ Constraint Variable ≤ 0
lem. Hence, by taking x = x̄ , y = ȳ and s = s̄, the above Free Variable ‘=’ Constraint
claim now follows from Eq. (4). Variable ≥ 0 ‘≤’ Constraint
A direct consequence of Theorem 1 is that if either of Variable ≤ 0 ‘≥’ Constraint
problems (1) or (16) is unbounded, then the other problem
is infeasible.
In this way we get a 1–1 correspondence between the vari-
In summary, if the primal problem has an optimal solu-
ables in the primal problem and the constraints in the dual
tion then so has the dual problem and the optimal values
problem, and vice versa. Note that the primal constraints
coincide. If the primal problem is unbounded then the
are equality constraints and the dual variables are free
dual problem is infeasible. In Section II.I.3 we will see
(i.e., without sign constraints). On the other hand, the pri-
that problem (16) also has a dual problem, which is ex-
mal variables are nonnegative and the dual constraints are
actly (1). Hence, we may interchange the words “primal”
inequality constraints of ‘≤’ type. Also note that the roles
and “dual” in the first sentence of this paragraph. Thus, we
of b and c are exchanged when taking the dual, and the
have the following result, due to Von Neumann (1947).
problem matrix A is transposed.
Theorem 2 (Strong duality) If the primal and dual prob- We can associate with each LO-problem (not necessar-
lem are both feasible then both problems have optimal ily in the standard from) a dual problem. To this end we
solutions and their optimal values are equal. Otherwise, first put the problem in the standard from (cf. Section II.A),
neither of the two problems has optimal solutions. and then take the dual of this problem. The dual problem
obtained in this way, however, in general does not have
If neither of the two problems has optimal solutions then
such a nice 1–1 correspondence between variables on one
both are infeasible, or one is infeasible while the other is
side and constraints on the other side. With little extra
unbounded.
effort it is possible to reformulate the dual such that we
A second duality result is due to Goldman and Tucker
retain a 1–1 correspondence. In this way we get a simple
(1956).
and natural relation between a general LO-problem and
Theorem 3 (Goldman-Tucker) If both the primal and its dual, at the same time making it quite easy to write
dual problem are feasible then there exists a strictly com- down the dual problem of an arbitrary LO-problem. The
plementary pair of optimal solutions, i.e., optimal solu- relation is shown in Table I.
tions x and (y, s) such that x + s > 0. The scheme can be used both from the left to the right
and from the right to the left. Thus, e.g., if the primal
This duality result is less well-known. Its interest is that
problem is a maiximization problem, the dual problem
interior-point methods produce strictly complementary
is a minimization problem and, e.g., a ‘≥’-constraint in
solutions (cf. Section III.J). It has recently become clear
the primal problem gives rise to a nonpositive variable in
that such solutions play an important role when dealing
the dual problem. An obvious consequence is that we may
with sensitivity analysis (cf. Section IV.B).
say, in short, that the dual of the dual problem is the primal
Let us also mention that the primal problem is infeasible
problem.
if and only if there exists a vector y such that A T y ≤ 0
and b T y > 0, and the dual problem is infeasible if and
only if there exists a vector x ≥ 0 such that Ax = 0 and J. The Dual Simplex Method
c T x < 0. These statements are examples of theorems of the Given any basic index set B, following the method of
alternatives and are equivalent to the well-known Farkas’ Section II.I we can associate with B a basic solution x̄
lemma. See, e.g., Schrijver (1986). of the primal problem and a basic solution ( ȳ, s̄) of the
dual problem as done in Eqs. (2), (3), and (5). Note that
3. Dual of a General LO-Problem
if x̄ ≥ 0 and s̄ ≥ 0 then x̄ and ( ȳ, s̄) are primal and dual
The dual of the standard problem, as given in Section II.I.3, feasible respectively, and due to Eq. (4), these solutions
can be reformulated in the following way: are optimal. In this section we do not assume that x̄ is
primal feasible but we assume that ( ȳ, s̄) is dual feasible.
maximize b T y
A typical iteration in the dual Simplex method then goes
subject to A T y ≤ c. as follows. Since x̄ is not primal feasible we may choose
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

Linear Optimization 605

an index ∈ B such that x̄ < 0 . Our aim is to replace Let us conclude this section by mentioning that in some
the index ∈ B by some index k ∈ B, thus getting the new natural way the dual Simplex method is completely equiv-
basic index set [see Eq. (13)] alent with the (primal) Simplex method as treated earlier.
The natural correspondence between the two methods fol-
B := (B ∪ {k})\{ }. lows by noting that when applying the Simplex method to
The new Simplex tableau will be obtained by using T Bk solve problem (1) then this also yields the solution of the
as pivot. Denoting the new basic solutions as x̃ and ( ỹ, s̃), dual problem (16). Therefore, since the dual of (16) is
we then have exactly (1), when solving (16) with the primal Simplex
method we also obtain the solution of (1). In fact, it can be
x̃ = x̃ − x̃k T Bk = 0 (17) seen that applying the dual Simplex method to (1) is es-
sentially the same as applying the primal Simplex method
and to (16).
T Bi With the above observation in mind it will be no surprise
s̃i = s̄i − s̄k , 1 ≤ i ≤ n. (18) that cycling of the dual Simplex method can be prevented
T Bk
by using the least index rule: whenever there is any ambi-
We want to have x̃k > 0. Therefore, since x̄ < 0, Eq. (17) guity in the choice of or k, choose the smallest possible
makes it clear that we need index among the candidate indices.

T Bk < 0. (19)
K. The Criss-Cross Method
We further want to maintain dual feasibility, i.e., s̃ ≥ 0. As-
suming Eq. (19), it is obvious that s̃i ≥ 0 whenever T Bi ≥ 0, So far, we have dealt with two variants of the Simplex
because s̄i ≥ 0 and s̄k ≥ 0. Thus, dual feasibility is main- method: the primal Simplex method and the dual Simplex
tained if and only if k is such that method. The primal Simplex method generates a sequence
of primal feasible basic solutions, whereas the dual Sim-
T Bi plex method uses only dual feasible basic solutions. Fea-
s̄i − s̄k ≥ 0
T Bk siblility is maintained by choosing the pivoting pair (k, )
in each iteration according to the primal and dual quotient
whenever T Bi < 0. As a consequence, we obtain rule, respectively. To start such a method, one first needs to
generate a feasible basic solution, thus separating the work
s̄i
k = argmini : T B
i < 0 . into two phases. In Phase I a feasible basic solution is gen-
−T Bi
erated; this phase requires the introduction of the artificial
This is the quotient rule for the dual Simplex method. With variables. Phase II is used to generate an optimal solution.
the given leaving index this rule finds an entering index We discuss in this section how we can avoid artificial
k such that dual feasibility of the new basic solution is variables, and solve the problem in one phase. The
guaranteed. underlying idea, due to Zionts (1969), is to neglect the
The above method works only if there are suitable can- feasibility issue and to aim for feasibility at both sides
didates for the entering index k. What if there are no such simultaneously.
candidates? Then we have Let us call index i primal infeasible if x̄i < 0 and other-
wise primal feasible. Similarly, index i is dual infeasible
x̄ < 0 and T Bk ≥ 0, for all k. (20) if s̄i < 0 and otherwise dual feasible. Recall that any basic
solution satisfies x̄ N = 0 and s̄ B = 0. Hence, for each index
Now recall from Eq. (7) that
either x̄i = 0 or s̄i = 0. Therefore, it is impossible for an
x B = x̄ B − TNB x N . index i to be primal infeasible and dual infeasible at the
same time. If all indices are feasible then the basic solu-
Hence, if Eq. (20) holds then we have for the -entry of tions are optimal, and we are done. Otherwise, we take
any primal feasible vector x: an infeasible index i. If i is dual infeasible then, putting
k := i we look for an index such that T Bk > 0; if such an
x = x̄ − T Bk xk ≤ x̄ < 0, does not exist the problem is unbounded, or infeasible.
k∈N
If i is primal infeasible then, putting := i we look for
showing that the problem cannot be feasible. We conclude an index k such that T Bk > 0; if such a k does not exist
that if no candidate for the entering variable exists then the the problem is infeasible. If a pair (k, ) has been found
problem is infeasible. Note that this argument does not use then we perform an iteration with this pair as pivoting pair.
the dual feasibility of the current basic solution. Hoping for the best, this process is repeated.
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

606 Linear Optimization

L. Example of Cycling max{ym : y ∈ Pm }.

Note that this so-called criss-cross method differs from Note that y = 0 is feasible for this problem, and one easily
the primal and the dual Simplex method in many ways. verifies that it is a basic solution.
Since no quotient rule is used there is much more free- If ε = 0 then Pm is the m-dimensional unit cube:
dom in the choice of the pivoting pair, and primal and
dual iterations are used in a more or less arbitrary order. {y ∈ IRm : 0 ≤ y j ≤ 1, 1 ≤ j ≤ m },
It may be clear that there is no guarantee that the pro-
which has 2m vertices. Hence, for (small) positive val-
cess will stop. In fact, it is quite easy to find examples of
ues of ε we can consider P m as a perturbation of the
cycling with this method. Below is an example of a cy-
unit cube. In fact, because of 0 ≤ ε < 0.5, the polytope
cle that starts with the first tableau from the example in
Pm has also 2m vertices. Moreover, both the primal Sim-
Section II.E.
plex method and the criss-cross method, when started at
x1 x2 x3 x4 y = 0, and with the least-index rule, pass through all these
3 0 4 0 −3 vertices. Hence, the number of iterations will be 2m − 1,
in both cases [Roos (1990) and Zionts (1969)]. See Fig. 1,
x1 1 1 2 0 −1 which depicts the Klee-Minty path for m = 3. Being op-
x3 2 0 1 1 −1 timistic, let us assume that one needs 10−9 seconds per
iteration. Then the computational time becomes 2m · 10−9
x1 x2 x3 x4 sec. For m = 50 this is about 13 days, and for m = 60 even
−5 0 0 −4 1 36.5 years!
x1 −3 1 0 −2 1
x2 2 0 1 1 −1
N. Computational Issues

x1 x2 x3 x4 Despite the exponential example of Klee and Minty, in

−2 −1 0 −2 0 practice the Simplex method is a very powerful method.
This is due to the fact that the average-case behavior is not
x4 −3 1 0 −2 1
exponential but polynomial. On the average, taken over
x2 −1 1 1 −1 0 a wide class of representative LO-problems, the num-
ber of iterations is bounded above by a polynomial in
x1 x2 x3 x4 m and n. This has been shown by Smale (1983) and
0 −3 −2 0 0 Borgwardt (1982). As a rule of thumb one may state
x4 −1 −1 −2 0 1 that on the average the number of iterations grows lin-
early with the number n of variables and the number
x3 1 −1 −1 1 0
m of constraints. Actually, this number highly depends
on the choice of the pivots. Quite efficient pivot rules
x1 x2 x3 x4 are used nowadays. The search for a theoretically effi-
3 0 4 0 −3 cient pivot rule goes on. So far, the best thing a pivot
x1 1 1 2 0 −1 rule can have is a finiteness proof. Unfortunately, for all
x3 2 0 1 1 −1 rules with such a proof an exponential example like that of
Klee and Minty exists. The history of implementations of
the Simplex method is well documented in Orchard-Hays
The fifth tableau is exactly the same as the first tableau. (1990).
Cycling of the criss-cross method can be prevented by
the least index rule: finiteness of the criss-cross method is
then guaranteed [Terlaky (1985)].
III. INTERIOR-POINT METHODS
M. The Klee-Minty Example
For a long time the Simplex algorithm was the only prac-
For m ≥ 1 we define the polytope Pm in IRm as follows: tical algorithm for linear optimization. Many people tried
to justify the remarkable efficiency of the method by pro-
{y: εy j −1 ≤ y j ≤ 1 − εy j −1 , 1 ≤ j ≤ m },
viding a theoretical bound on the number of Simplex iter-
where y0 = 0 and 0 ≤ ε < 0.5. We consider the LO- ations. To date, no one has yet given a polynomial bound
problem for general problems.
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

Linear Optimization 607

FIGURE 1 Klee-Minty path for m = 3.

The interest in interior-point methods for linear pro- minimize d = bT y

gramming emerged from Karmarkar’s contribution in
1984. He proposed a totally new method that enjoyed both subject to A T y ≤ b (22)
polynomial complexity and practical efficiency. The field y ≥ 0.
became one of the most active in the area of mathemati-
cal optimization. It introduced new ideas and techniques
that now have received their own place among the basic A. Reduction to Canonical Form
tools in optimization. In this section we survey the theory
If a given LO-problem does not have the canonical form,
of interior-point methods, including the analysis of their
using the method of the Section II.A the problem is first
complexity.
brought into the standard form (1). Choosing an arbitrary
The heart of this theory is the concept of central path,
basic index set B, using the notation introduced in Sec-
a continuous curve in the interior of the feasible set that
tion II.D, we rewrite the constraints as in Eq. (7), using
converges to an optimal point. Most interior-point methods
also Eq. (10),
follow the central path, and can be characterized as path-
following methods. The concept captures the basic ideas z B = z̄ B − TNB z N , z B ≥ 0, z N ≥ 0,
that underlie some of the most prominent interior-point
methods. It provides a unified framework for a variety of and the objective function as
existing efficient methods.
The problem we consider in this section is c T z = c T z̄ + s̄ NT z N .

minimize p = cT z Omitting the constant term c T z̄ the problem gets the

canonical form
subject to Az ≥ b (21)
z ≥ 0. minimize s̄ NT z N
subject to −TNB z N ≥ −z̄ B
This is the LO-problem in canonical from. The dual
problem is z N ≥ 0.
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

608 Linear Optimization

B. Reduction to Feasibility Problem does not satisfy this condition. Because if y, z, and κ solve
(23) then c T z − b T y = 0, and hence the last coordinate of
By Theorem 2, if one of the two problems (21) and (22)
M x̄ vanishes.
has an optimal solution then so has the other, and then
To circumvent this difficulty the problem is embedded
their optimal values coincide. Using also Theorem 1, this
into a slightly larger problem that satisfies the IPC. This
is the case if and only if the system
goes as follows. Letting n − 1 denote the size of M̄, and
Az ≥ b, z ≥ 0, with e denoting the all-one vector, as usual of appropriate
size, we introduce
−A T y ≥ −c, y ≥ 0,
0
b y−c z ≥ 0
T T r = e − M̄e, q = ·
n
has a solution, and any such solution provides optimal Now consider the system
solutions for (21) and (22). With κ = 1, the above system
M̄ r x̄ x̄
can be rewritten as ≥ −q, ≥ 0. (25)
     −r T 0 ϑ ϑ
0 A −b y y Note that if (x̄, ϑ) satisfies (25) and if ϑ = 0, then x̄ is a
    
−A T
0 c   z  ≥ 0,  z  ≥ 0. (23) solution of (24). On the other hand, a solution x̄ of (24)
bT −c T 0 κ κ gives rise to a solution of (25) with ϑ = 0 if and only if
Here the zeros represent matrices (or vectors) of appropri- r T x̄ ≤ n.
ate sizes. Note that by introducing the variable κ, the sys-
This certainly holds if r T x̄ ≤ 0; otherwise a positive mul-
tem became homogeneous: if (y, z, κ) is a solution then
tiple of x̄ yields a solution of problem (25) with ϑ = 0. We
λ(y, z, κ) is also a solution, for any positive λ. Hence,
conclude that the set of solutions of (25) with ϑ = 0 con-
problem (23) has a solution with κ = 1 if and only if it has
tains all solutions of (24), possibly up to a positive factor.
a solution with κ > 0. Given any solution (y, z, κ) of (23)
The new variable ϑ is called the lifting variable.
with κ > 0, then
We now show that the new system satisfies the IPC. In
z y fact the all-one vector does the work. Taking x̄ = e and
z∗ = , y∗ =
κ κ ϑ = 1 we get
are optimal solutions of (21) and (22).
M̄ x̄ + ϑr = M̄e + r = e
Thus we may conclude that (21) and (22) have optimal
solutions if and only if (23) has a solution with κ > 0. and
The extra variable κ is called the homogenizing variable.
−r T x̄ = −(e − M̄e)T e = −e T e = −n + 1,
Note that problem (23) always admits the zero solution
z = 0, y = 0, and κ = 0. If κ = 0 for every solution, then we where we used that e T M̄e = 0, since M̄ is skew-
may conclude that (21) and (22) have no optimal solutions, symmetric. Hence, we obtain
i.e., these problems are infeasible or unbounded.
M̄ r e e
+q = ,
−r T
0 1 1
C. Embedding into Self-Dual Model proving the claim.
To simplify notations, we use the matrix M̄ and the vector We already observed that a solution of (25) can be useful
x̄ defined by for us only if ϑ = 0. How do we get such a solution? Note
    that, since q ≥ 0, x̄ = 0 and ϑ = 0 are feasible for problem
0 A −b y (25). Therefore, when minimizing ϑ subject to (25) the
   
M̄ =  −A T 0 c , x̄ =  z . optimal value will be equal to zero. Defining
bT −c T 0 κ
M̄ r x̄
M= , x = ,
Then Eq. (23) can be written as −r T
0 ϑ
M̄ x̄ ≥ 0, x̄ ≥ 0. (24) q T x = nϑ vanishes if and only if ϑ = 0, and thus we are
interested in the optimal solutions of the problem
We need to find out whether or not this inequality system
has a solution with κ > 0. minimize q T x
When using interior-point methods it is necessary that
subject to M x ≥ −q (26)
the IPC is satisfied. In other words, there should exist a vec-
tor x 0 > 0 such that M̄ x 0 > 0. The system (23) certainly x ≥ 0.
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

Linear Optimization 609

When taking the dual of this problem we get the question is whether n − 1 ∈ B or not. If the answer is
yes, then any solution with positive homogenizing variable
maximize −q T u
will yield optimal solutions for our original problems (21)
subject to −Mu ≤ q and (22).
Since M is skew-symmetric we have for any vector u
u ≥ 0,
u T Mu = 0. (29)
where we used that M T = −M. Since maximizing −q T u
is equivalent to minimizing q T u this is exactly the same Hence, if x is feasible,
problem as (26). This is expressed by saying that problem
(26) is self-dual. q T x = (M x + s(x))T x = x T s(x), (30)
We finally point out that the optimal set of (26) is and x is optimal if and only if xi si (x) = 0 for each i.
bounded. For any feasible x let s(x) = Mx + q. Then We will use a shorthand notation for the vector with
e T M x = e T ( M̄ x̄ + ϑr ) − r T x̄ coordinates xi si (x), namely xs(x). Then the optimality
conditions for problem (26) are given by
= e T ( M̄ x̄ + ϑr ) − (e − M̄e)T x̄
x ≥ 0, s(x) = M x + q ≥ 0, xs(x) = 0. (31)
= e T ( M̄ x̄ + ϑr ) − e T x̄ + e T M̄ T x̄
The last condition is the complementarity condition.
= ϑe T r − e T x̄, For more appreciation of the next theorem, let us indi-
where we used once more that M̄ T = − M̄. For the same cate that problem (31) may have multiple solutions. For
reason we have e T M̄e = 0, whence e T r = e T e = n − 1. example, if
Hence,
0 −1 1
M= , q= ,
e T s(x) = ϑ(n − 1) − e T x̄ + n 1 0 0
and, finally, then

e T x + e T s(x) = e T x̄ + ϑ + e T s(x) x1 1 1 − x2
s(x) = M + = ,
x2 0 x1
= n(1 + ϑ).
and the solution to (31) is x = (0, x2 ) with 0 ≤ x2 ≤ 1.
Taking ϑ = 0 we get Now consider the following system.
e T x + e T s(x) = n. x ≥ 0, s(x) = M x + q ≥ 0, xs(x) = µe, (32)
Since x ≥ 0 and s(x) ≥ 0, this implies that the optimal set
where µ is any positive number.
is bounded.
Theorem 4 For any µ > 0 the system (32) has a unique
solution.
D. Central Path
The solution of (32) is denoted as x(µ); it is called the
In the previous section we reduced the general LO- µ-center of (26). When µ runs through all positive reals
problem to the problem of solving the self-dual problem then x(µ) follows a curve in the interior of the feasible
(26). This is not exactly true, however. Remember that the region of (26). Note that z(µ)/κ(µ) and x(µ)/κ(µ) are
real issue is to find an optimal solution with a positive feasible solutions for the problems (21) and (22); the cor-
homogenizing variable κ = xn−1 , or to establish that such responding curves are the respective central paths of these
a solution does not exist. two problems.
Recall that the all-one vector is feasible and The central path is quite relevant for two reasons. First,
s(e) = e, (27) observe that for any µ > 0 we have

so the IPC is satisfied. Although the M and q introduced q T x(µ) = nµ. (33)
in the previous section have a very special structure, in
This is an obvious consequence of Eq. (30) and xi si (x) = µ
the analysis below we allow M to be any skew-symmetric
for each i. Hence, if µ approaches 0 then q T x(µ) goes to
matrix and q any nonnegative vector.
zero, which means that x(µ) approaches the optimal set.
Denoting the optimal set of (26) as S, and defining the
The second reason is that not only the limit
index set B by
x(0) := lim x(µ)
B := {i:xi > 0 for some x ∈ S}, (28) µ↓0
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

610 Linear Optimization

exists but, even more importantly, that the support of x(0) iterations the objective value, given by q T x(µ) = nµ, is
is the set B. As a consequence, if we know x(0), we have smaller than, or equal to ε if
enough information to decide whether the homogeneous
variable can be positive in the optimal set, and if this is the (1 − θ )k n ≤ ε.
case we can derive from x(0) optimal solutions for (21) Taking logarithms, this becomes
and (22).
Path-following methods use the central path as a guide k log(1 − θ ) + log n ≤ log ε.
to the optimal set. Such a method starts at the point on the
Since −log(1 − θ) ≥ θ , this certainly holds if
central path corresponding to µ = 1, because the 1-center
is known: due to Eq. (27) we have x(1) = e. n
kθ ≥ log n − log ε = log .
ε
This implies the lemma.
E. Conceptual Method
The above algorithm uses exact µ-centers. These can
The parameter is µ is called barrier parameter. The ques-
be obtained only by solving the nonlinear system (32). To
tion we have to deal with is how to obtain the µ-centers
make the algorithm more practical, we have to avoid this.
for small values of the barrier parameter.
This is the subject of the following sections.
Suppose that we know x(µ) for some µ > 0, and let µ
be obtained from µ by
µ := (1 − θ )µ, F. Using Approximate Centers

where θ is a positive constant smaller than 1. We may Let us now assume that we have an appr oximate µ-
expect that if θ is not too large, the µ -center will be center, i.e., a positive feasible solution x that is close to
close to the given µ-center. For the moment, let us as- x(µ). The meaning of “close” will be made more precise
sume that we are able to calculate the µ -center, pro- later on.
vided θ is not too large. Then the following conceptual We want to find a displacements x such that
algorithm can be used to find an ε-optimal solution of x = x + x (34)
problem (26).
is the µ-center. Denoting s = s(x ), and

Conceptual Algorithm s = s + s, (35)

Input: neglecting the inequality constraints for the moment, this
An accuracy parameter ε > 0; means that x and s should satisfy
a barrier update parameter θ, 0 < θ < 1.
begin M(x + x) + q = s + s
x = x(1); µ := 1;
while nµ ≥ ε do (x + x)(s + s) = µe.
begin
µ := (1 − θ )µ; This system can be rewritten as
x := x(µ);
end Mx = s, (36)
end
sx + xs + xs = µe − xs. (37)

The output of this algorithm is a feasible solution for The above system is nonlinear and hard to solve. Follow-
problem (26) such that the objective value does not exceed ing Newton’s method, we linearize the centering condition
ε. How many iterations are needed by the algorithm? The (37) by neglecting the quadratic term xs, thus obtain-
answer is provided by the following lemma. ing the Newton equation

Lemma 1 After at most sx + xs = µe − xs. (38)

1 n Substitution of Eq. (36) leads to the linear equation
log
θ ε (S + X M) x = µe − xs.
iterations we have nµ ≤ ε.
Here X = diag(x) and S = diag(s). The coefficient matrix
Proof: Initially, the objective value is n and in each it- being nonsingular, this equation uniquely defines the New-
eration it is reduced by the factor 1 − θ . Hence, after k ton direction x at x to the target x(µ).
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

Linear Optimization 611

G. Analysis of the Newton Step H. Approximate Center Method

Using the notation of Eqs. (34) and (35) we may write We modify the conceptual algorithm in Section III.E
slightly, by replacing the statement x := x(µ) with
x s = (x + x)(s + s) x := x + x.
= xs + (sx + xs) + xs.

Due to the Newton equation (38) this implies Approximate Center Algorithm

Input:
x s = µe + xs. (39) An accuracy parameter ε > 0;
a barrier update parameter θ, 0 < θ < 1.
Since M is skew-symmetric, begin
x = x(1); µ := 1;
(x)T s = (x)T Mx = 0, while nµ ≥ ε do
begin
proving that x and s are orthogonal. Hence, using µ := (1 − θ )µ;
Eq. (39) x := x + s;
end
q T x = e T (x s ) = e T (µe + xs) = nµ. end

Thus we have shown that after the Newton step, the ob-
In the next section we discuss how an appropriate
jective value has the value at the target x(µ).
value of the parameter θ can be obtained, so that during
But what about x ? We may not expect that x coincides
the course of the algorithm the iterates are always close
with x(µ), due to the “error term” xs in Eq. (39). But
enough to the current µ-center to guarantee that Newton’s
hopefully, x is closer to x(µ) then x. For dealing with this
method is quadratically convergent.
we need a quantity to measure proximity to x(µ). For this
purpose we introduce the quantity δ(x, µ) defined by

1 xs µe
.
I. Complexity Analysis
δ(x, µ) = −
2 µ xs At the start of the algorithm we have µ = 1 and x = x(1),
whence q T x = n and δ(x, µ) = 0. In each iteration µ is
The notation, although not quite common, explains itself:
first reduced with the factor 1 − θ and then the Newton
all operations are meant to be coordinatewise. Note that
step is made to the new µ-center. It will be clear that the
xs µe reduction of µ has effect on the value of the proximity
x = x(µ) ⇔ =e⇔ = e.
µ xs measure. This effect is fully described by the following
lemma.
Hence, if x = x(µ) then δ(x, µ) = 0 and δ(x, µ) > 0 oth-
erwise. One may prove the following. Lemma 2 Let x > 0 and µ > 0 be such that s =
s(x) > 0 and q T x = nµ. Moreover, let δ := δ(x, µ) and
Theorem 5 If δ := δ(x, µ) ≤ 1, then the Newton step is
µ = (1 − θ )µ. Then
feasible, i.e., x and s are nonnegative. Moreover, if δ < 1,
then x and s are positive and θ 2n
δ(x, µ )2 = (1 − θ)δ 2 + .
δ2 4(1 − θ )
δ(x , µ) ≤ .
2(1 − δ 2 ) Now let us have

This remarkable result shows not only that the Newton 1

q T x = nµ and δ(x, µ) ≤ (40)
step is feasible if x is close enough (i.e., δ < 1) to the 2
µ-center, but also that the Newton process is quadratically at the start of some iteration. It certainly holds in the first
convergent. Especially, iteration. We claim that these properties are √ maintained
1 during the course of the algorithm if θ = 1/(2 n).
δ(x, µ) ≤ √ ⇒ δ(x , µ) ≤ δ(x, µ)2 . When the barrier parameter is updated to µ = (1−θ )µ,
2
the lemma gives
It is now clear that with Newton’s method we can obtain
good approximations of the µ-center in a very efficient 1−θ θ 2n
δ(x, µ )2 ≤ + .
way. 4 4(1 − θ )
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

612 Linear Optimization

Due to the choice of θ, we get In fact, this property can be used to prove Theorem 3 in
1 1 1 1 1 Section II.I.2. Another consequence is that
δ(x, µ )2 < + ≤ + < .
4 16(1 − θ) 4 8 2 N := {i: si > 0 for some x ∈ S}. (41)

Therefore, after the Newton step δ(x , µ ) ≤ 1/2. Also, The partition of the index set {1, . . . , n} into the classes B
q T x = nµ . This proves the claim. and N is called the optimal partition of (26).
Thus we have shown that the following theorem holds. We define the condition number of (26) by
√
Theorem 6 If θ = 1/(2 n) then the algorithm with full σ := min max(xi + si (x)).
Newton steps requires at most i x∈S
The calculation of this (positive!) number is even more
√ n
2 n log cumbersome than solving Eq. (26). For our purpose, how-
ε
ever, it is sufficient to know the following lower bound
iterations. The output is a feasible x > 0 such that
1
q T x = nµ ≤ ε and δ(x, µ) ≤ 12 . σ ≥ n , (42)
j=1 M j
This theorem shows that we can get an ε-solution x of
our self-dual model with ε as small as desirable. But note where M j denotes the jth column of M. For this bound
that x will always be an interior solution, so x > 0 and we need to make the assumption that the entries of M and
s = s(x) > 0. q are integral. In the sequel, in all the bounds where the
A crucial question is whether the variable κ = xn−1 is condition number occurs, it can be safely replaced by this
positive or zero in the limit, when µ goes to zero. In prac- easily computable lower bound.
tice, for small enough ε it is usually no serious problem Lemma 3 For any positive µ one has
to decide which of the two cases occurs. But the theory σ
can help us in an elegant way, as we explain in the next xi (µ) ≥ , i ∈ B,
n
section. This requires some further analysis of the central
nµ
path. xi (µ) ≤ , i ∈ N.
Before proceeding, we want to agree upon the follow- σ
ing. This section has made clear that the conceptual al- Proof: Let i ∈ N and x̃ ∈ S be such that s̃ i := si (x̃) is
gorithm of Section III.E can be turned into a practical maximal. Then the definition of the condition number σ
algorithm; the iterates are no longer on the central path implies that s̃i ≥ σ . Using the skew-symmetry of M one
but move in a narrow neighborhood around the central easily sees that
path. (x(µ) − x̃)T (s(µ) − s̃) = 0.
In the next sections we will assume that the iterates are
on the central path, because this simplifies the analysis Since q T x̃ = x̃ T s̃ = 0, it follows that
significantly. At the cost of some additional technicalities
x(µ)T s̃ + s(µ)T x̃ = nµ.
in the analysis we can obtain similar results for the iterates
of the practical algorithm. For the technical details the This implies
reader is referred to Roos et al. (1997).
xi (µ)s̃i ≤ x(µ)T s̃ ≤ nµ.
Dividing by s̃i and using that s̃i ≥ σ we obtain the second
J. Finding the Optimal Partition
inequality of the lemma:
If x ∈ S, S being the set of optimal solutions, then nµ nµ
xi (µ) ≤ ≤ .
xs(x) = 0, x + s(x) ≥ 0. s̃i σ
The equality represents the complementarity property of The first inequality can be derived in a similar way.

optimal solutions. If the inequality is strict, we say that Lemma 3 gives a complete separation between variables
x is a strictly complementary solution. Recall from Sec- in B and variables in N , provided that µ is so small that
tion III.D that the support of the limit point x(0) of the
σ nµ
central path is the set B, defined by Eq. (28). Let N de- > .
note the complementary set. Then another property of the n σ
central path is that the support of s(0) := s(x(0)) is N . As or, equivalently,
a consequence, x(0) is a strictly complementary solution: σ2
nµ < .
x(0)s(0) = 0, x(0) + s(0) > 0. n
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

Linear Optimization 613

Thus, applying the algorithm with Consider the system of equations in the unknown vector
ξ given by
σ2
ε=,
n MB B ξ = sB − MB N x N . (43)
we obtain a solution x which reveals the optimal partition Note that ξ = xB is a “large” solution of (43), because
according to the entries of x B are “large” variables. We can easily see
B = {i: xi > si (x)} that Eq. (43) has more solutions. This follows by using
any optimal solution x̃ with x̃B = 0. One has x̃N = 0
N = {i: xi < si (x)}. and sB (x̃) = 0, whence MBB x̄B = 0. Since xB (0) = 0, it
By substitution of the above value of ε in Theorem 6 the follows that the matrix MBB must be singular, and hence
required number of iterations follows: Eq. (43) has multiple solutions.
Now let ξ be any solution of Eq. (43) and consider the
√ n2
2 n log 2 . vector x̄ defined by
σ
After this number of iterations we know if κ belongs to B x̄B = x B − ξ, x̄N = 0.
or not. In other words, we then know whether the original Define s̄ = s(x̄). Since x̄N = 0, we have
problem has an optimal solution or not. In the last case we
are done, otherwise we are usually interested in finding a s̄B = MBB x̄B = MBB (xB − ξ ) = 0.
solution. If we are interested in an ε-solution we can pro-
Therefore, x̄ N = s̄ B = 0, showing that the vectors x̄ and
ceed with the algorithm until such a solution is obtained.
s̄ are complementary. It will be clear, however, that the
An alternative approach may be to use the rounding pro-
vectors x̄ and s̄ are not necessarily nonnegative, let alone
cedure described in the next section, which yields an exact
strictly complementary. This only holds if
solution.
x̄ B = x B − ξ > 0, (44)
K. Rounding to an Exact Solution
and
Knowing the optimal partition, we aim for finding a strictly
complementary solution of Eq. (26). s̄ N = M N B x̄ B + M N N x̄ N + q N
In one case this is very easy. If it happens that the set B = M N B (x B − ξ ) + q N
in the optimal partition is empty then x(0) = 0 is a strictly
complementary solution and we are done. Note that in that = s N − M N N x N − M N B ξ > 0. (45)
case s(x(0)) must be positive. Since s(x(0)) = q, this case Note that if we run the algorithm long enough, xB and
can easily be seen to occur if and only if q > 0. sN converge to positive vectors, whereas xN and sB (and
Therefore, we assume from now on that the class B is not hence also ξ ) converge to zero. This makes it plausible
empty. Assuming this we describe a rounding procedure that if we take ε small enough in the algorithm, then we
that can be applied to any x generated by the algorithm to will get a solution ξ of (43) that satisfies (44) and (45),
yield a vector x̄ such that x̄ and its surplus vector s̄ = s(x̄) hence giving rise to a strictly complementary solution of
are complementary (in the sense that x̄ N = s̄ B = 0). In gen- Eq. (43).
eral, however, these x̄ and s̄ are not necessarily nonnega- Omitting further details, we conclude this section by
tive. But, as we will see, after sufficient additional itera- stating the main complexity result for interior-point re-
tions of the algorithm the separation between the “small” sults: when solving Eq. (43) by Gaussian elimination,
and the “large” variables is strong enough to get a strictly √ an
exact solution of Eq. (26) can be found after at most 7 n L
complementary solution. All of this can be done in poly- iterations, where L denotes the binary size of the problem.
nomial time.
Partitioning the matrix M and the vectors x and s ac- L. Other Interior-Point Methods
cording to the optimal partition, the relation s = M x + q
can be rewritten as In the preceding sections we have shown that the LO-
problem can be solved in polynomial time. If the problem
sB MBB MBN xB qB is infeasible or unbounded then this is detected by the
= + .
sN MN B MN N xN qN method, otherwise it generates an exact solution. When
executed, the number of iterations will be about the same
Since qBT xB (0) = 0 and x B (0) > 0, it follows that q B = 0.
as predicted by the theory. In practice this means that the
Hence, we have
method is much slower than the Simplex method. There
sB = MBB xB + MBN xN . are several ways to make the method more efficient and
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

614 Linear Optimization

competitive to the Simplex method. We briefly discuss The embedding technique, as described in Section III.C,
some of them. elegantly resolves this initialization problem, at the cost
of two additional variables, the homogenizing variable κ
and the lifting variable ϑ. There exist other solutions to the
1. Adaptive-Update Methods
initialization problem, leading to the so-called infeasible-
The method we considered generates iterations in a narrow start methods. We briefly discuss such a method for the
neighborhood of the central path, since at the start of each standard form (1) and its dual (16). The centering condi-
iterations we have δ(x , µ) ≤ 12 . In the theoretical anal- tion for this pair of problems is simply xs = µe.
ysis this property is maintained by taking √ small updates Infeasible-start methods take any two positive n-vectors
of the barrier barrier parameter (θ = 1/(2 n). In practice, x and s and an arbitrary m-vector y. These vectors may
Newton’s method behaves much better than predicted by be not feasible. Defining the primal and dual residuals by
Theorem 5. As a consequence, δ(x , µ) will usually be
r p (x) = Ax − b, rd (y, s) = A T y + s − c,
much smaller than 12 . It means that we can take larger val-
ues of θ without loosing the above property. The adaptive- respectively, the Newton-like search directions are defined
update method seeks to take θ as large as possible such that by the system
at the start of the next iteration we still have δ(x , µ) ≤ 12 .
This strategy accelerates the method significantly and does Ax = −r p (x)
not deteriorate the theoretical iteration bound. A T y + s = −rd (y, s)
xs + sx = µe − xs.
2. Large-Update Methods
The steps are damped to keep the iterates sufficiently
A popular strategy is to use larger values of θ , say θ = 0.5, bounded away from zero. When starting with y = 0 and
or even θ = 0.99. In this case, after the barrier update, x = s = ζ e for some suitable ζ , under a mild condition
δ(x , µ) will be much larger than 1, and Theorem 5 can on ζ it can be shown that within O(n 2 | log ε|) iterations
be no longer used to measure progress. Even worse, the an ε-solution can be found [Wright (1996)]. The gener-
(full) Newton step will be no longer feasible. The rem- ated solutions are not feasible, in general, but the norms
edy is to take a damped Newton step x = x + αx, with of their residuals are bounded in terms of ε.
damping factor α; this factor can be chosen such that x is
feasible and at the same time the proximity decreases suf-
ficiently to get a provable polynomial method. In practice IV. RELATED TOPICS
this approach yields very efficient methods.
A. Integer Programming
3. Predictor-Corrector Method Suppose we have a linear maximization problem with fea-
This is the most popular method. We describe a simple sible region P:
variant. It is based on a very greedy strategy that uses max{c T x: x ∈ P}, (46)
the Newton step targeting at the zero vector. So, in the
definition of the Newton step one takes µ = 0. The re- and one or more of the variables are required to be inte-
sulting direction is called the affine scaling direction. To gral. In many situations it is natural to impose such a con-
stay feasible, this step is damped again, but with a greedy dition on the variable, for example if a variable represents
damping factor. As a result, after such a step we may end a number of vehicles in the model. If all variables need to
up far from the central path. To restore the proximity, a be integral the problem is called a pure integer problem,
so-called centering step is taken. This is a (damped) New- otherwise a mixed integer problem. In the next sections we
ton step with µ = q T x /n, where x is the current iterate. restrict the discussion to pure integer problems, the gen-
This method has not only turned out to be extremely ef- eralization to mixed integer problems is straightforward.
ficient, but also has the nice property that asymptotically
the objective value converges quadratically to zero.
1. Branch-and-Bound Methods
This method is an intelligent way of searching for integral
4. Infeasible-Start Methods
solutions in P with an objective value that is better than
To start an interior-point method one needs an interior so- the current lower bound, which is initially put to −∞.
lution of the problem at hand. Usually such a solution is Let x̄ be an optimal solution of Eq. (46). Suppose that x̄i
not available, and then the method cannot even be started. is fractional. Let P1 be the region obtained by adding the
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

Linear Optimization 615

constraint xi ≤ x̄i to P and let P2 be the region obtained 3. Branch-and-Cut Methods

by adding the constraint xi ≥ x̄i + 1 to P. Here, x̄i
Successful solution of integer LO-problems requires a
denotes the integral part of x̄i . Then it will be clear that
clever combination of many different ideas. Branch-
the integral solution to Eq. (46) is the best of the integral
and-cut methods combine branch-and-bound and cutting-
solutions of the two subproblems
plane methods. The cutting-planes are generated through-
out the branch-and-bound tree. The underlying idea is to
max c T x: x ∈ Pi , i = 1, 2.
work on getting as tight as possible bounds in each node
These subproblems can be solved in a similar way: solve of the tree and thus reducing the number of nodes in the
the LO-relaxation and if necessary branch to two new search tree. Of course, there is an obvious trade-off. If
subproblems. many cuts are added at a node, re-optimization may slow
In this way a binary search tree of subproblems is gen- down. In addition, keeping all the information in the tree
erated. In principle all subproblems in the generated tree is more difficult. On the other hand, the size of the search
need to be solved. When a subproblem happens to have tree may be reduced significantly. For more details the
an integral solution its objective value is a lower bound reader is referred to Wolsey (1998).
for the objective value of the original problem, and such
a subproblem needs no further branching; if its objective
B. Sensitivity Analysis
value is higher than the current lower bound we refresh
the lower bound. Also, if a subproblem is infeasible no In practice it is often quite important to know how sen-
further branching is needed. All subproblems for which sitive the optimal value is to perturbations of the data in
the optimal value of the LO-relaxation does not exceed the problem. We restrict ourselves to perturbations in the
the current lower bound can be considered as fathomed. objective vector c. For perturbations in the right-hand side
If all subproblems in the tree are solved or fathomed, then vector b one may consider the dual problem, which has b
there are two possibilities: the lower bound is −∞ or it as objective vector, and similar results will follow.
is a finite value. In the first case no integer solution has Assuming that the given LO-problem is a minimization
been found and hence the original problem has no inte- problem, let us consider the case where one of the coeffi-
ger solution. In the second case the subproblem that gave cients of c, c j say, is varied. It can easily be seen that the
rise to the current lower bound provides the best integer objective value is a convex and piecewise linear function
solution. of c j . The derivative of the optimal-value function at c j is
called its shadow price and the linearity interval to which
c j belongs, its range. If c j is a break point of the optimal
2. Cutting-Plane Methods value function, we need to distinguish between a left- and
If the solution x̄ of the problem (46) is not integral, we call a right shadow price, and then the range of c j is just a
an inequality a T x ≤ β a cutting plane of (46) if a T x̄ > β singleton.
whereas all integer vectors in P satisfy a T x ≤ β. A cutting- A crucial result in this respect is that the break points
plane method adds one or more cutting planes to the prob- are precisely the points where the optimal partition of the
lem and then solves the resulting LO-problem. The process problem changes. If the current optimal partition is known,
is repeated until an integer solution is found or a problem the range and the shadow price(s) of c j can be obtained
arises that is infeasible. as follows. Assuming that the LO-problem is in standard
When applying the Simplex method, there is a nice form, the range of c j is obtained by minimizing and max-
systematic way to generate cutting planes from optimal imizing c j over the set

tableaus, originally proposed by Gomory. These cutting c j : A T y + s = c, sB = 0, sN ≥ 0 ,
planes are added to the tableau, and then the tableau is
re-optimized. If necessary, the new optimal tableau is and the shadow price(s) follow(s) by minimizing and max-
used to generate new Gomory-cuts, etc. The process con- imizing x j over the set
verges to an integer solution, if it exists. Unfortunately, {x j : Ax = b, xB ≥ 0, xN = 0}.
for interior-point methods the situation is not as good. No
satisfactory systematic methods exist at present that pro- The classical approach, as treated in all textbooks, instead
duce an exact integral solution, although some promising of using the optimal partition in the above formulas, uses
work has been done by Mitchell and Borchers for some the partition formed by the classes B of basic and N of
specific problems. Part of the problem is due to the fact nonbasic variables for some optimal basic solution; this
that at present interior-point methods do not allow fast usually gives rise to ambiguous information [Jansen et al.
re-optimizing. (1997) and Roos et al. (1997)].
P1: GPT/MBG P2: GPJ/GKP P3: FJU/LOT QC: FJS/FYD Final Pages
Encyclopedia of Physical Science and Technology EN008A-379 June 29, 2001 15:17

616 Linear Optimization

V. FURTHER EXTENSIONS the simplex method is polynomial. Z. Oper. Res. 26, 157–177.
Boyd, S. E., El Ghaoui, L., Feron, E., and Balakrishnan, V. (1994). Linear
matrix inequalities in system and control theory. SIAM Studies in
It should be noted that many phenomena in this world Applied Mathematics, Vol. 15. SIAM, Philadelphia.
cannot be described adequately by a linear model, but Dantzig, G. B. (1963). “Linear Programming and Extensions,” Princeton
require a nonlinear model. The solution of such models Univ. Press, Princeton, NJ.
goes beyond the scope of this chapter. Some remarks are Dantzig, G. B., Orden A., and Wolfe, Ph. (1955). Notes on linear pro-
in order, however. gramming: Part I—the generalized simplex method for minimizing
a linear form under linear inequality restrictions. Pac. J. Math. 5(2),
Inspired by the success of the interior-point approach 183–195.
to LO, Nesterov and Nemirovski recognized that the un- Frisch, K. R. (1956). La resolution des problemes de programme lineaire
derlying ideas can be extended to a wide class of non- par la methode du potential logarithmique. Cah. Semin. D’Econ. 4,
linear optimization problems, the so-called convex cone 7–20.
optimization problems. Such problems have the form Goemans, M. X., and Williamson, D. P. (1995). Improved approxima-
tion algorithms for maximum cut and satisfiability problems using
min{c T x: Ax = b, x ∈ K}, (47) semidefinite programming. J. Assoc. Comput. Mach. 42(6), 1115–
1145.
where K denotes a convex cone. If K = IRnt , the nonnega- Goldman, A. J., and Tucker, A. W. (1956). Theory of linear programming.
tive orthant, this is exactly the standard LO-problem. The In “Linear Inequalities and Related Systems” (H.W. Kuhn and A.W.
above authors showed that the interior-point approach also Tucker, eds.) Annals of Mathematical Studies, No. 38, pp. 53–97,
applies to problems of this type for different cones. Note Princeton Univ. Press, Princeton, NJ.
that Eq. (47) looks like a linear problem, but one should Hoffman, A. J. (1953). Cycling in the simplex algorithm. Technical Re-
port 2974, National Bureau of Standards.
realize that the cone K can hide a lot of nonlinearity. One Jansen, B., de Jong, J. J., Roos, C., and Terlaky, T. (1997). Sensitivity
striking example occurs when K is the cone of positive analysis in linear programming: just be careful! Eur. J. Oper. Res. 101,
semidefinite matrices. For example, the nonlinear con- 15–28.
straint uv ≥ 1, with u ≥ 0 and v > 0, can be modeled as Karmarkar, N. K. (1984). A new polynomial-time algorithm for linear
programming. Combinatorica 4, 373–395.
u 1 Khachiyan, L. G. (1979). A polynomial algorithm in linear programming.
is positive semidefinite.
1 v Dok. Akad. Nauk SSSR 244, 1093–1096. Translated into English in
Sov. Math. Dok. 20, 191–194.
As a consequence, the field of semidefinite optimization Klee, V., and Minty, G. J. (1972). How good is the simplex algorithm?
at present receives much attention. It has important appli- In “Inequalities III” (O. Shisha, ed.), Academic Press, New York.
cations in system theory [Boyd et al. (1994)] and in com- Lee, J. (1997). Hoffman’s circle untangled. SIAM Rev. 39, 98–105.
binatorial optimization. Many combinatorial optimization Mitchell, J. E., and Borchers, B. (1996). Solving real-world linear order-
ing problems using a primal-dual interior point cutting plane method.
problems admit a natural relaxation to a semidefinite op- Ann. Oper. Res. 62, 253–276.
timization problem. Using this, Goemans and Williamson Nesterov, Y., and Nemirovskii, A. S. (1994). Interior point polynomial
could generate approximate solutions of a famous and hard algorithms in convex programming. SIAM Studies in Applied Math-
combinatorial problem (finding a maximal cut in a graph), ematics, Vol. 13. SIAM, Philadelphia.
not more than 13% from optimal. von Neumann, J. (1947). On a maximization problem. Manuscript, In-
stitute for Advanced Studies, Princeton University, Princeton, NJ.
It is generally believed that Karmarkar revealed only the Orchard-Hays, W. (1990). History of the development of LP solvers.
tip of an iceberg; there is still a big mass to be explored. Interfaces 20(4), 61–73.
Roos, C. (1990). An exponential example for Terlaky’s pivoting rule for
the criss-cross method. Math. Program. 46, 79–84.
SEE ALSO THE FOLLOWING ARTICLES Roos, C., Terlaky, T., and Vial, J.-Ph. (1997). “Theory and Algorithms
for Linear Optimization. An Interior Approach,” John Wiley & Sons,
COMPUTER ALGORITHMS • DYNAMIC PROGRAMMING • Chichester, UK.
GAME THEORY • LINEAR SYSTEMS OF EQUATIONS • NON- Schrijver, A. (1986). “Theory of Linear and Integer Programming,” John
Wiley & Sons, New York.
LINEAR PROGRAMMING • OPERATIONS RESEARCH
Smale, S. (1983). On the average number of steps of the simplex method
of linear programming. Math. Program. 27, 241–262.
Terlaky, T. (1985). A convergent criss–cross method. Math. Oper. Stat.
BIBLIOGRAPHY ser. Optimization 16, 683–690.
Williams, H. P. (1990). “Model Building in Mathematical Program-
Avis, D., and Chvátal, V. (1978). Notes on Bland’s pivoting rule. Math. ming,” 3rd edition, John Wiley & Sons, New York. USA.
Program. Study 8, 24–34. Wolsey, L. A. (1998). “Integer Programming,” John Wiley & Sons, New
Beale, E. M. L. (1955). Cycling in the dual simplex algorithm. Nav. Res. York.
Logist. Q. 2, 269–276. Wright, S. J. (1996). Primal-Dual Interior-Point Methods. SIAM,
Bland, R. (1977). New finite pivoting rules for the simplex method. Math. Philadelphia.
Oper. Res. 2, 103–107. Zionts, S. (1969). The criss-cross method for solving linear programming
Borgwardt, K.-H. (1982). The average number of pivot steps required by problems. Manag. Sci. 15, 426–445.
P1: FYK Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN008A-387 June 29, 2001 15:29

Loop Groups
Andrew Pressley
King’s College, London

I. Basic Properties of Loop Groups

II. The Fundamental Homogeneous Space
III. Representation Theory of Loop Groups
IV. Relations with Other Parts of Mathematics

GLOSSARY tral extensions of the Lie algebras of loop groups are the
simplest infinite-dimensional examples of Kac–Moody al-
Central extension A central extension of a group G by gebras. (Throughout this article, G will denote a compact
an Abelian group A is a group G̃ that has A as a central Lie group.)
subgroup such that the quotient group G̃/A is isomor-
phic to G.
Deformation retract A space X is a deformation retract
I. BASIC PROPERTIES OF LOOP GROUPS
of a space Y if the identity map Y → Y can be deformed
through continuous maps to a map Y → X .
A. Definitions
Lie group A group that is also a smooth manifold such
that the group operations are smooth. A loop group LG is the group of maps from the circle S 1
Maximal torus A maximal connected Abelian subgroup into a Lie group G. Except where stated otherwise, the
of a compact Lie group. maps will be assumed to be smooth, that is, infinitely dif-
Symplectic manifold A smooth manifold equipped with ferentiable, and the group G to be compact. The group op-
a closed 2-form that is nondegenerate at each point. eration in LG is given by pointwise multiplication. When
provided with the C ∞ -topology, LG can be given the
structure of an infinite-dimensional Lie group, modeled
LOOP GROUPS are groups of maps from the circle on the space L g of smooth maps from S 1 into the Lie al-
into a finite-dimensional Lie group, usually assumed to gebra g of G. The model space L g is a Lie algebra, again
be compact. They arise in many areas of mathematics and under pointwise operations, and is the Lie algebra of LG.
physics, such as the theory of integrable systems, singu- It is called a loop algebra. There is an exponential map
larity theory, and two-dimensional quantum field theory. L g → LG induced from that of G; unlike the exponential
The geometry and representation theory of loop groups is map of G itself, that of LG is not surjective, although its
closely analogous to that of compact Lie groups. The cen- image is dense in LG.

791
P1: FYK Final Pages
Encyclopedia of Physical Science and Technology EN008A-387 June 29, 2001 15:29

792 Loop Groups

One of the many properties of loop groups that does not D. Polynomial Loops
hold for infinite-dimensional Lie groups in general is the
Apart from the smooth loops, it is often convenient to con-
existence of a complexification. This is simply the group
sider other classes of loops. One of the most important is
LG of smooth loops in the complexification G of G
the Laurent polynomial loops; these have finite expansions
itself. Its Lie algebra is the complexification of L g.

N
f (z) = Ak z k
B. Twisted Loop Groups k=N

The polynomial loops form a subgroup L pol G of LG,

Geometrically, it LG can be thought of as the space
but L pol G is not a Lie group in any reasonable sense.
of smooth sections of the trivial principal G-bundle on
However, it is a good approximation to LG in at least two
S 1 . More generally, if α is any automorphism of G, one
ways:
can form a G-bundle on S 1 by taking the quotient of
G × by the equivalence relation that identifies (x, t) with Theorem 2. If G is any compact group, L pol G is a
(α(x), t + 2π ) for all x ∈ G. The cross sections of this bun- deformation retract of LG. If G is also semi-simple, L pol G
dle form a group L α G called a twisted loop group. It is is dense in LG in the C -topology.
clear that L α G depends only on the class of α modulo
inner automorphisms of G, and hence may be assumed to
be of finite order if G is semi-simple. Since all G-bundles E. Central Extensions
on S 1 are trivial if G is connected, any twisted loop group Throughout this section G will be assumed to be simply-
is isomorphic to an untwisted loop group as an abstract connected. A central extension of the loop algebra L g by
group. However, there are good reasons for treating them the one-dimensional Lie algebra is a Lie algebra L̃ g that
separately. fits into a short exact sequence of the form
Twisted loop groups arise naturally in the study of max-
0 → → L̃ g → L g → 0
imal Abelian subgroups of LG. If T is a maximal torus in
G, then L T is obviously a maximal Abelian subgroup of with lying in the center of L̃g. Every such extension is
LG. More generally, if τ is any smooth loop in the space isomorphic to one corresponding to the cocycle
of maximal tori of G there is a maximal Abelian subgroup
1
consisting of the loops f such that f (θ) ∈ τ (θ) for all θ . ω(ξ, η) = ξ, dη ,
2π S 1
Up to isomorphism, it depends only on the free homotopy
class of τ . The space of maximal tori is G/N (T ), where where , is an invariant inner product on g and ξ, η ∈ g.
N (T ) is the normalizer of T in G. Its fundamental group This extension is the vector space L g ⊕ with the bracket
is the Weyl group W = N (T )/T of G. Hence, there is a [(ξ, λ), (η, µ)] = ([ξ, µ], ω(ξ, µ))
maximal Abelian subgroup associated to every conjugacy Note that this cocycle is obviously invariant under the
class in W . The group associated to w ∈ W is the twisted action of the group Diff+ (S 1 ) of orientation-preserving
loop group of T associated to the automorphism of T in- diffeomorphisms of S 1 . The cocycle ω extends to a left
duced by conjugation with a representative of w in N (T ). invariant closed 2-form on LG and thus defines a coho-
mology class in H 2 (LG, ).
C. Automorphisms A central extension of LG by the circle group S 1 is de-
fined in an analogous way. Such an extension gives rise by
One of the most important properties of loop groups is the differentiation to a central extension of L g. The converse
existence of the one-parameter group of automorphisms question is answered by the following result.
Rφ of LG that rotates a loop f rigidly through an angle φ:
Theorem 3.
Rφ ( f )(θ ) = f (θ + φ)
(a) There is a central extension of LG by S 1 with Lie
1
More generally, the group Diff(S ) of all smooth algebra cocycle ω if and only if ω/2π is an integral
differomorphisms of the circle acts as a group of auto- cohomology class. In this case, the central extension
morphisms of LG. There are also the loops in the group L̃ g is unique up to isomorphism, and Diff+ (S 1 ) acts
of automorphisms of G itself. on it as a group of automorphisms.
(b) If no nonzero multiple of ω is integral, L̃ g is not the
Theorem 1. If G is simple, every automorphism of Lie algebra of any Lie group.
LG, even as an abstract group, is a product of these two (c) The class of ω/2π is integral if and only if α̌, α̌ ∈
types. for all coroots α̌ of g.
P1: FYK Final Pages
Encyclopedia of Physical Science and Technology EN008A-387 June 29, 2001 15:29

Loop Groups 793

(d) If G is simple, the extension of L g defined by any The length l(w) of an element w ∈ Waff is the number of
nonzero cocycle ω of the above form is universal, in reflections in a minimal expression for w.
the sense that every central extension of L g by any Choose elements eγ in the root space gα z k correspond-
Abelian Lie algebra A arises from it via a ing to each root γ = (α, k) with the property
homomorphism A → .
[eγ , e−γ ] = γ̌
For the definition of coroots, see the next subsection.
for all simple roots γ of LG. Let γ1 , . . ., γl be the sim-
F. Affine Lie Algebras ple roots of LG and write ei = eγi , f i = e−γi . Then, the
elements of ei , f i , γ̌i , i = 1, . . . , l satisfy the following
The complexified central extensions L̃ g become, after commutation relations:
taking the semi-direct product with the one-dimensional
algebra generated by the derivation d = zd/dz, examples [γi , γ j ] = 0
of Kac–Moody algebras, and in this context they are re- [ei , f j ] = δi j γ̌i
ferred to as affine Lie algebras. One can describe analogs
of all the usual notions associated to finite-dimensional [γ̌i , e j ] = ai j e j
semi-simple Lie algebras. [γ̌i , f j ] = −ai j f j
Recall the decomposition
−ai j +1
(adei ) ej = 0 if i = j
g = ⊕ gα
α (ad f i )−ai j +1 f j = 0 if i = j
where gα is the subspace of g on which the maximal Here, ai j is a square matrix of integers, called the Cartan
torus T acts (by conjugation) via the homomorphism matrix of LG, satisfying
α : T → S 1 . The homomorphisms that occur are called
the roots of g. Roots are often identified with their deriva- aii = 2
tives at the identity, which are elements of the dual space ai j ≤ 0 if i = j
∗ of . The element α̌ ∈ [gα , g−α ] such that α(α̌) = 2 is
called the coroot associated to the root α. ai j = 0 if and only if ai j = 0
To formulate the corresponding definitions for loop
groups, consider the semi-direct product LG × ˜ S 1 , where Theorem 4. These relations define the universal cen-
S acts on LG by rotating loops. Then T × S 1 is a max-
1 tral extension of L g defined in the previous section.
imal Abelian subgroup of LG × ˜ S 1 (T is identified with If (ai j ) is any square matrix of integers satisfying the
the constant loop whose image lies in T ). The Weyl group above conditions, then the preceding relations define a
of LG is thus Waff = N (T × S 1 )/(T × S 1 ). It is called the so-called Kac–Moody Lie algebra. If (ai j ) is positive defi-
affine Weyl group, and is isomorphic to the semi-direct nite, the Lie algebra is finite-dimensional and semi-simple.
product Ť ×˜ , where Ť is the lattice of homomorphisms In the case of affine Lie algebras, det(ai j ) = 0, but every
S1 → T . principal minor of (ai j ) is positive definite. In fact, this
One has the following decomposition of the complexi- condition characterizes the affine Lie algebras, and their
fied Lie algebra of LG × ˜ S1: twisted analogs, among Kac–Moody algebras.
The information contained in the Cartan matrix is often
L g ⊕ = ( ⊕ ) ⊕ z k ⊕ gα z k expressed graphically in the form of the Dynkin diagram
k=0 (α,k)
of the Lie algebra in question. This is a graph that has one
The pairs (α, k) that occur, including those corresponding node for each generator ei , the ith and jth nodes being
to the second summand where α = 0, are called the roots joined by ai j a ji bonds. If |ai j | > |a ji |, the bonds carry an
of LG. The pair (α, k) is regarded as the homomorphism arrow pointing toward the ith node.
T × S 1 → S 1 given by (t, z) → α(t)z k . As in the finite-
dimensional case, a subset of the roots is called a simple
II. THE FUNDAMENTAL
system if every root can be written as a linear combination
HOMOGENEOUS SPACE
of the simple roots, the coefficients being integers all of
which have the same sign. A root is called positive if all
A. Differential-Geometric Properties
the coefficients are positive.
The affine Weyl group acts on the Lie algebra of T ×˜ S1 The homogeneous space X = LG/G plays a fundamental
and is generated by the reflections in the hyperplanes role in the study of loop groups. It is obviously diffeomor-
γ (x) = 0, where γ runs through the simple roots of LG. phic to the subgroup G of LG consisting of the based
P1: FYK Final Pages
Encyclopedia of Physical Science and Technology EN008A-387 June 29, 2001 15:29

794 Loop Groups

loops, that is, those satisfying f (1) = 1; however, it is bet- The Hamiltonian vector field associated to E by the sym-
ter to regard it as a homogeneous space of LG, partly plectic structure is the derivative of the circle action on
because its properties are closely analogous to those of X that rigidly rotates loops. The critical points of E are
the homogeneous space G/T of G, where T is a maximal thus the homomorphisms S 1 → G; they fall into conju-
torus of G; of course, G/T is not a group. gacy classes under the action of G and each conjugacy
The first example of this phenomenon is that X , like class is a connected, compact complex manifold; the con-
G/T , is a complex manifold. This follows from the fol- jugacy classes are in one-to-one correspondence with the
lowing factorization theorem, in which L + G denotes orbits of the action of the Weyl group W of G on the lattice
the subgroup of LG consisting of the smooth maps of homomorphisms S 1 → T .
f : S 1 → G that extend smoothly to holomorphic maps The Kähler metric on X defined above allows one to
of the disc {z ∈ : |z| < 1} in to G . define the gradient vector field of E. Let { f t } be the down-
ward gradient flow of E passing through a loop f at time
Theorem 5. LG = LG, L + G . t = 0; in other words, the solution of the ordinary differ-
This result shows that X is a homogeneous space of the ential equation
complex Lie group LG : ∂f
= −grad E
X∼
= LG /L + G ∂t
Since X is infinite-dimensional, it is not clear a priori that
One of the most striking facts about X is that it behaves
this exists.
like a compact complex manifold.
Theorem 7.
Theorem 6.
(a) Every holomorphic map X → is constant on each (a) The integral curve f t of the downward gradient flow
connected component of X . exists for all t > 0 for any initial loop f . The integral
(b) If M is any compact complex manifold, the space of curve exists for all t < 0 if and only if the initial loop
based holomorphic maps M → X lying in a fixed f is polynomial.
homotopy class is finite-dimensional. (b) For all integral curves f t , limt→∞ f t exists and is a
critical point of E. If f 0 is a polynomial loop,
The second important aspect of the geometry of X is the limt→−∞ f t exists and is a critical point of E.
existence of an invariant symplectic structure. The tangent
space to G at the identity element is L g/g, where g is For any conjugacy class C of homomorphisms S 1 → G,
the Lie algebra of G. The formula let X C and X C denote the parts of X that tend to a point
2π of C as t → ∞ and as t → −∞, respectively. Note that X
1
ω(ξ, η) = ξ (θ), η (θ ) dθ is the disjoint union of the X C and X pol = L pol G/G is the
2π 0 disjoint union of the X C .
where , is an invariant inner product on g, defines a
skew form on L g/g; extending by left translation gives a Theorem 8.
nondegenerate closed 2-form on X (the nondegeneracy is
in the weak sense, that ω(ξ, η) = 0 for all η implies ξ = 0). (a) For any conjugacy class C, X C and X C are locally
Moreover, the complex structure and the symplectic closed complex submanifolds of X of finite
structure are compatible, in the sense that they fit to- codimension and finite dimension, respectively.
gether to give a Kähler structure on X ; this means that (b) The intersection of X C and X C is transverse and
ω(J ξ, J η) = ω(ξ, η) and that ω(ξ, J η) is a Riemannian consists of the conjugacy class C.
metric on X , where J : L g/g → L g/g is the infinitesimal (c) The stratum X 1 corresponding to the identity
complex structure. The Ricmannian metric in question is homomorphism is open and dense in the identity
the Sobolev 12 -metric, given by the L 2 -norm of the 12 -th component of X .
derivative. (d) If λ is any homomorphism in C,
X C = LG · λ
B. Stratifications
XC = L+
pol G · λ
There is a canonical real-valued function on X , the energy
function, given by
Here, LG is the subgroup of LG consisting of the
2π
1 loops that are boundary values of holomorphic maps from
E( f ) = f 1 f (θ ), f 1 f (θ ) dθ the disc |z| > 1 in the Riemann sphere into G .
4π 0
P1: FYK Final Pages
Encyclopedia of Physical Science and Technology EN008A-387 June 29, 2001 15:29

Loop Groups 795

The first half of part (c) is equivalent to the Birkhoff D. Determinant Bundle and Central Extensions
factorization theorem, which asserts that every loop f in
If H were a finite-dimensional space, there would be
G can be written as
a holomorphic line bundle Det on Gr(H ) whose fiber
f = f ·λ· f+ over W ∈ Gr(H ) is the top exterior power of W . In the
infinite-dimensional case, one can still define the deter-
where f + ∈ L + G . minant bundle, using the fact that operators on H of
the form 1 + (traceclass) have well-defined determinants.
But whereas, in the finite-dimensional case, Det is ho-
C. Grassmannian Embedding
mogeneous under the action of G L(H ), in the infinite-
Let H be a separable infinite-dimensional Hilbert space dimensional case only a central extension G̃ L res (H ) of
equipped with a polarization, that is, an orthogonal G L res (H ) by C + acts on Det.
splitting H = H+ ⊕ H into a pair of closed infinite- By combining this with the embedding in Proposition 9,
dimensional subspaces. The restricted general linear one obtains a central extension L̃G of LG ; up to finite
group G L res (H ) is defined as the group of invertible op- coverings, every extension of LG by C + arises in this
erators on H whose block decomposition way. In particular, this shows that all the central extensions
of LG described in Section I.E have complexifications.
a b
c d
III. REPRESENTATION THEORY
with respect to the polarization has the property that b and OF LOOP GROUPS
c are Hilbert–Schmidt operators. Then a and d are nec-
essarily Fredholm operators, and index (a) = −index(d). A. Positive-Energy Representations
The group G L res (H ) is a Hilbert Lie group and has con- A representation of LG is said to be of positive energy
nected components determined by the index of a. Also of if it extends to a representation of the semidirect product
importance is the subgroup Ures (H ) of unitary operators S1 × ˜ LG, with S 1 acting on LG by rotation, and if eiθ ∈ S 1
in G L res (H ). acts by ei Rθ , where the spectrum of the self-adjoint oper-
The restricted Grassmannian Gr(H ) of H is defined to ator R is bounded below. The most interesting represen-
be the set of closed subspaces W of H such that the orthog- tations of LG are those of positive energy; it turns out
onal projections W → H+ and W → H− are Fredholm that these representations are projective, that is, represen-
and Hilbert–Schmidt, respectively. The group G L res (H ) tations of a central extension of LG. Apart from that, their
acts transitively on Gr(H ). The Grassmannian Gr(H ) is a properties and classification are closely analogous to those
Hilbert manifold and is homotopy equivalent to × BU , of the finite-dimensional representations of G. The irre-
the universal classifying space for vector bundles. ducible positive-energy representations of LG correspond
To apply this theory to loop groups, one chooses a finite- to the so-called integrable highest weight representations
dimensional unitary representation V of G and takes H of affine Lie algebras.
to be the space of L 2 functions on the circle with values There are also representations that are not of posi-
in V . The subspaces H+ and H− consist of the functions tive energy. For any a ∈ S 1 , there is a homomorphism
n ≥ 0 vn z and
n
whose
Fourier expansions are of the form LG → G given by evaluating the loop at a; pulling back
v
n<0 n z n
, respectively. The loop group LG acts on H, an irreducible representation V of G by this homomor-
and LG acts by unitary operators. phism gives an irreducible representation V (a) of LG;
every finite-dimensional irreducible representation of LG
Theorem 9. The action of LG on H induces a is a tensor product of representations of this form. A
homomorphism LG → G L res (H ) and a smooth map closely related construction is to let f ∈ LG act on L V by
X → Gr(H ). Both maps are embeddings if V is a faithful ( f, v)(z) = f (az)v(z); tensor products of such representa-
representation of G. tions are generically irreducible. Neither of these types of
If G = Un and V is the natural representation on n , the representations are of positive energy.
image of X pol in Gr(H ) can be characterized as follows:
X pol = {W ∈ Gr(H ) : zW ⊂ W and B. The Fundamental Representation
and The Spin Representation
z N H+ ⊂ W ⊂ z N H+ }.
The space of holomorphic sections of Det is a com-
Similar characterizations are possible for classes of loops pletion of the exterior algebra (H+ ⊕ H̄ − ), where H̄
other than polynomial and groups other than Un . is the orthogonal complement of H+ , but with the
P1: FYK Final Pages
Encyclopedia of Physical Science and Technology EN008A-387 June 29, 2001 15:29

796 Loop Groups

complex conjugate complex structure. The induced ac- (b) L n,λ has nonzero holomorphic sections if and only if
tion of the central extension G̃ L res (H ) on is irreducible; (n, λ) is dominant, in the sense that
it is called the fundamental representation of G̃ L res (H ).
“Irreducible” here means that contains no proper invari- 0 ≤ λ(α̌) ≤ n α̌, α̌
ant subspace that is closed in the compact-open topology. for every simple coroot α̌ of G. Here, , is the
Moreover, is a unitary representation of Ũ res (H ), the unique inner product on g such that α̌, α̌ = 2 for
central extension of Ures (H ) by S 1 obtained by restricting every simple coroot α̌.
the extension G̃ L res (H ). This means that contains as a (c) If (n, λ) is dominant, the space (L n,λ ) of
dense subspace a Hilbert space on which Ũ res (H ) acts by holomorphic sections of L n,λ is an irreducible
unitary operators. positive-energy representation of a central extension
One can define in a similar way the restricted orthogonal of LG .
group Ores (H ) of the real Hilbert space H underlying (d) Every irreducible positive energy representation of
H . The fundamental representation can then be realized LG arises in this way.
as the spin representation of Ores (H ). The space of this
realization is a direct sum of symmetric algebras indexed The positive integer n is called the level of the repre-
by the integers. sentation (L n,λ ). The pair (n, λ) is usually regarded as
The isomorphism between these two realizations is an element of the dual of the Lie algebra of T̃ , the part of
called the boson-fermion correspondence, by analogy the central extension L̃G lying over T .
with quantum field theory. The space (H+ ⊕ H̄ − ) of One can show also that every positive-energy represen-
the first realization is the Hilbert space of a system of tation of LG is unitary, and hence breaks up into a direct
free fermions, with H+ and H− being the states of a sin- sum of representations of the above type. In particular,
gle particle of positive and negative energy, respectively. this implies that every positive-energy representation of
Similarly, the symmetric algebra is the Fock space of a LG extends to the complexification LG . No such prop-
system of free bosons. erty holds for infinite-dimensional groups in general.
By restricting the fundamental representation, one ob-
tains an irreducible unitary representation of L Ũ n , called
the basic representation of LUn . D. Kac Character Formula
V. Kac proved an analog for affine Lie algebras (in fact, for
C. Borel–Weil Theory all symmetrizable Kac–Moody algebras) of the Weyl for-
mula for the characters of the irreducible representations
To construct the positive-energy representations of LG of G. The character of a representation V of LG × ˜ S 1 is
for a general compact Lie group G, one must consider the the formal power series
homogeneous space Y = LG/T , where T is a maximal
torus of G. Dividing by the action of G exhibits Y as a χV = (dim Vγ )γ (t, z)
bundle over X with fiber G/T . For simplicity, we shall γ

only consider the case where G is simply-connected and where Vγ is the part of V on which the maximal Abelian
simple. Then Y is connected and simply connected so the subgroup T × S 1 acts by the homomorphism γ . In some
complex line bundles over Y are classified by their first cases this can be interpreted as some kind of generalized
Chern class, which is an element of function of (t, z) ∈ T × S 1 . The formula for the character
H 2 (Y, ) ∼
= ⊕ H 2 (G/T ) ∼
is often written
= ⊕ T̂
χV = (dim Vγ )eγ
where T̂ is the lattice of characters of T , that is, the ho-
γ
morphisms T → S 1 . Let L n,λ be the line bundle associated
to (n, λ) ∈ ⊕ T̂ . Then we have the following classifica- identifying the homomorphism γ with an element of the
tion of the positive-energy representations of LG, which dual of the Lie algebra of T × S 1 .
is a precise analog of the Borel–Weil theorem, which de-
scribes the finitedimensional representations of G as sec- Kac Character Formula
tions of line bundles over G/T . Let L V be the holomorphic line bundle on LG/T associ-
ated to a dominant element of the dual of the Lie algebra
Theorem 10. Let G be a simply-connected, simple of T̃ , as in theorem 10. Let V be the space of holomorphic
compact Lie group. sections of L V . Then the character of V , is
l(w) w(V,ρ)+ ρ
(a) Each complex line bundle L n,λ has a unique w ∈ waft (−1) e
χV = γ
holomorphic structure. γ ,0 (1 − e )
P1: FYK Final Pages
Encyclopedia of Physical Science and Technology EN008A-387 June 29, 2001 15:29

Loop Groups 797

Here, ρ is characterized by ρ(γ̌ ) = 1 for all simple roots cycle defining the central extension LG, and Diff + (S 1 )
γ of LG. acts on L̃G.
Regarded as functions of z ∈ S 1 , these characters are
boundary values of holomorphic functions in the disc Theorem 11. There is a (projective) action of
|z| < 1. Since the group S L 2 () acts on the disc by ap- Diff + (S 1 ) on all positive-energy representations of LG
propriate linear fractional transformations, it acts also on that intertwines with the action of LG.
the space of such functions. One of the remarkable facts More recently, the algebra of vertex operators has been
about the characters is that the finite-dimensional vector formalized and generalized and is capable of describing
space spanned by the characters of the irreducible positive- representations of LG other than the basic one. Vertex
energy representations of a given level is preserved by the algebras have also found other applications, notably to
action of S L 2 (). The proper explanation of this is to be the construction of the “moonshine” representation of the
found in conformal field theory. Monster simple group.

E. Vertex Operators
IV. RELATIONS WITH OTHER PARTS
Historically, the first realization of the basic representa- OF MATHEMATICS
tion was given in terms of so-called vertex operators, the
construction and properties of which were known to physi- Due to limitations of space, we shall restrict discussion of
cists in dual resonance theory. The restriction of the central applications of loop groups to the following two combi-
extension of LG to the Abelian group L T is an infinite- natorial topics.
dimensional Heisenberg group, and thus has a canonical
level 1 irreducible unitary representation, say H . One at- A. Macdonald Identities
tempts to extend this representation, initially to the Lie
algebra L g, and then to LG, by defining, for each coroot I. G. Macdonald discovered a remarkable series of multi-
α̌ of G, and each complex-valued function f on S 1 , op- variable power series identities, one for each simple com-
erators Vα ( f ) that obey the correct commutation relations pact Lie group G. When G = SU2 , this gives Jacobi’s
with each other and with the generators of the action of L T formula
on H . Since g is spanned by and the coroots α̌, this will 2

accomplish the task of extending the representation to L g. η(q)3 = (−1) j q (1/24)(6 j+1)
j
The construction works, at least in its simplest form,
only when G is simply-laced, which means that all the for the cube of the Dedekind eta-function
simple coroots have the same length; the coroots can then
be identified with the homomorphisms S 1 → T of minimal η(q) = q 1/24 (1 − q j ).
length. One defines j=1

2π For a general G, Macdonald’s identities give an analo-

Vα ( f ) = f (θ )Vα (θ ) dθ gous formula for η(q)dimG . Kac and R. V. Moody indepen-
0
dently observed that Macdonald’s identities are precisely
for some “operator-valued distribution” Vα (θ ) on the cir- the statement of the Kac character formula for the triyial
cle. To define Vα (θ ), think of α as a homomorphism representation (the socalled denominator formula).
S 1 → T . For each ε > 0, let Vα,θ,ε be the element of L T More recently, J. Lepowsky has shown that many other
such that well-known power series identities have a Lie algebraic
Vα,θ,ε = 1 if |θ − θ| > ε proof.
and such that Vα,θ,ε (θ ) traces out the loop α when θ goes
from θ − ε to θ + ε. Then the limit B. Quivers

limε → 0 ε −1 Vα,θ,ε A quiver is a finite set of points together with directed

arrows joining certain pairs of points. A representation
exists in the distributional sense and defines the required of a quiver is an assignment to each point i of a finite-
Vα (θ). dimensional vector space Vi and to each arrow going from
The vertex operator construction is equivariant with re- point i to point j of a linear map from Vi to V j . Quivers
spect to the action of diffeomorphisms of the circle. Recall having only finitely many indecomposable representa-
from Section I that Diff(S 1 ) acts as a group of automor- tions correspond exactly to the Dynkin diagrams of
phisms of LG. Moreover, the action of the orientation- simply-laced finite-dimensional complex simple Lie al-
preserving diffeomorphisms Diff+ (S 1 ) preserves the co- gebras. Quivers whose representations can be classified
P1: FYK Final Pages
Encyclopedia of Physical Science and Technology EN008A-387 June 29, 2001 15:29

798 Loop Groups

correspond exactly the Dynkin diagrams of affine Lie al- BIBLIOGRAPHY

gebras that contain no double bonds.

Kac, V. G. (1985). “Infinite Dimensional Lie Algebras,” 2nd ed.,

SEE ALSO THE FOLLOWING ARTICLE Cambridge Univ. Press, Cambridge.
Pressley, A. N., and Segal, G. B. (1986). “Loop Groups,” Oxford Univ.
MANIFOLD GEOMETRY Press, Oxford.
P1: GPA Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry
C. T. J. Dodson
University of Manchester Institute of Science and Technology

I. Preliminary Notions
II. Geometrical Spaces
III. Manifolds and Bundles
IV. Calculus of Sections
V. Metric Geometry
VI. Connection Geometry
VII. Singular Geometry
VIII. Topology, Geometry, and Physics

GLOSSARY Parallel transport Process available through a connec-

tion for the movement of geometrical entities along
Bundle Superstructure over a base manifold used in the curves in the manifold.
definition and analysis of geometrical entities. Section “Slice” of a bundle that chooses one bundle point
Cohomology Algebraic procedure for studying global over each point of the base manifold.
properties of spaces on manifolds, via differential Tensor field Section of a tensor bundle; similarly, vector
forms. fields.
Completeness Freedom from singularities, such as holes, Torsion Tensor field carrying information about the
in the underlying manifold. asymmetry of a connection.
Connection Fundamental geometrical object for pres-
cribing parallelism structures on a manifold.
Curvature Tensor field carrying information about the MANIFOLD GEOMETRY is the latest contribution to
noncommutativity of parallel transport. one of the oldest branches of mathematics. It exploits
Differential form Section of a bundle of antisymmetric modern abstract algebra and abstract analysis in the study
tensors. of geometrical spaces, called manifolds, that are natural
Geodesic General geometrical equivalent of a straight generalizations of spaces studied by Euclid. To handle the
line in Euclidean geometry. greater complexity, modern geometers created superstruc-
Metric tensor Fundamental tensor field for prescribing tures called bundles over manifolds. Through these bun-
metrical geometry on a manifold. dles, generalizations of differential and integral calculus

49
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

50 Manifold Geometry

allow a geometrization of diverse problems in abstract algebra to organize our calculus procedures and to rep-
analysis and in physical field theories; they also allow the resent symmetries in the geometry. Our first task is to
use of algebraic topological methods of cohomology to describe the appropriate parts of analysis and algebra to
study global properties. Very elegant formulations of the explain the above use of the term “geometric structure”
concepts of parallelism and curvature arise from the ge- on a set and to explain the kinds of requirements that are
ometrical object called a connection. This distinguishes often imposed on maps that arise in geometrical contexts.
the geodesic curves that generalize to a manifold the We shall attempt to present them together in a geometrical
Euclidean notion of a straight line and also gives a test setting with examples. In order to have a fund of examples
for completeness of the manifold. we need to assume a working knowledge of certain mathe-
matical raw materials, abbreviations, and basic tools.
We shall suppose some familiarity with elementary al-
I. PRELIMINARY NOTIONS gebra and calculus on the real and complex number fields.
The analysis on them depends on the modulus or absolute
A. Sets and Maps value function a → |a|, which yields a distance function
d between two numbers a and b by
We shall accept as sufficient for our purposes the intuitive
notion of a set as a collection of distinct and definite ele- d(a, b) = |a − b|
ments. Likewise, a map from a set X to a set Y is a rule
This satisfies the geometrical requirements:
that prescribes precisely one element of set Y to be related
to each element of set X . Then we indicate it by a diagram d(a, b) = d(b, a) (symmetry)
like d(a, b) = 0 if and only if a = b (positive definiteness)
f : X → Y : x → f (x) d(a, c) d(a, b) + d(b, c) (triangle inequality)
and speak of the map f with domain X that sends a typical The algebra of the real and complex numbers depends on
element x in X to the element f (x) in Y . A convenient each of them having double group structures, one under
way to view f is in terms of its graph, that is, the set of addition with identity zero and one under multiplication
elements (x, f (x)) lying in the X × Y space. Every map (excepting zero) with identity one. Recall that a group is a
f : X → Y defines another map bringing subsets of Y back set with a binary operation that is associative, has identity,
to subsets of X and admits inverses.

f ← : sub Y → sub X : B → {x ∈ X | f (x) ∈ B}

Usually, the sets X and Y that we shall be interested C. Notations
in will have some geometric structure and our maps will
in some sense respect this structure. Typical problems re- Symbol Meaning
duced to geometrical essentials are: the set of natural numbers, i.e., 1, 2, 3, . . .
the set of integer numbers, i.e., 0, ±1, ±2, . . .
1. Given sets X and Y with particular geometric the set of rational numbers, i.e., p/q for q = 0, p, q in
structures, show that a map f : X → Y satisfying the set of real numbers, i.e., the complete
geometrical line
certain requirements exists.
the set√of complex numbers, i.e., x + i y for
2. Extend the domain of a given map.
i = −1, x, y in
3. Given a map f : X → Y and some geometric structure
E2 the real plane, i.e., two-dimensional real space
on Y , what geometric structure is induced on (i.e.,
En n-dimensional real space for n in : E 1 =
pulled back to) X ?
n the standard n-dimensional real vector space
4. Given a map f : X → Y and some geometric structure
[0, 1] the interval {x ∈ | 0 x 1}
on X , what geometric structure is coinduced on (i.e.,
[0, 1) the interval {x ∈ | 0 ≤ x < 1}
pushed forward to) Y ?
x∈X x is an element of set X
A⊆ B A is a set, all of whose elements are also elements
B. Algebra and Analysis of B; a subset of B
sub B the set of all subsets of set B
It turns out that two mathematical areas are intertwined in A= B A ⊆ B and B ⊆ A; A and B are the same set
the precise formulation and solution of geometrical prob- ∅ the empty set; the set with no elements
lems. On the one hand we need analysis to deal with con- A∩ B the set of elements in both subsets A and B; their
cepts of nearness and change; on the other hand we need intersection
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 51

A∪B the set of elements in one or both of subsets A or B; Hence we seek f satisfying, for all θ and w, the condition
their union
{x ∈ X | P(x)} the set of elements in X for which statement P(x)
f (eiθ w) = f (w)
is true This requirement is simply rotational symmetry; that is,
⇒ implies; then f (w) is independent of the angular position of w. One such
⇔ implies and is implied by; if and only if function is the modulus, defined for all w = x + iy by
A×B the set of ordered pairs (a, b) such that a is in A
and b is in B f (x + iy) = (x 2 + y 2 )1/2
X →Y a map from set X to set Y
Try repeating this example for the same group when
X →→ Y a map from set X onto (the whole of) set Y
acting as a rotation about a point other than the origin.
X Y an equivalence of structures (iso/homeo/
diffeomorphism)
Whenever we have a group action on a set we wish
f ❜ g or fg the composite map; do g then f
to know the subsets left invariant by the action—if there
are any. In our example they do exist and in fact they are
EXAMPLE. Denote by G the multiplicative group of the circles with center the origin; we call such subsets the
unit modulus complex numbers, that is, orbits of the group action. They always form a partition of
the set carrying the action into disjoint nonempty subsets.
G = {z ∈ | |z| = 1} = {eiθ | θ ∈ } Orbits that consist of single points are called fixed points
This group acts on the set of complex numbers in a (the origin is one in the example). Orbits need not look like
simple way: planetary orbits; for example, the action on the Euclidean
plane E 2 given for the additive group of real numbers by
α: G × → : (eiθ , w) → eiθ w α
× E 2 → E 2 : (a, (x, y)) → (x + a, y)
that is simply to rotate each complex number w = re by iφ

the angle θ to give eiθ × reiφ = rei(θ +φ) , for each choice of has no fixed point and orbits the horizontal lines. This latter
φ (Fig. 1). This α is a continuous action since group action is simply translation by a real number along
the x axis, but in fact any other direction (or all directions)
1. Nearby θ values send a fixed w to nearby numbers could have been used. Indeed, if we combine all rotations,
in ; translations, and reflections into one collection we get
2. Nearby w numbers are sent to nearby numbers by a the Euclidean group, so called because it leaves invariant
fixed θ . the essentials of Euclidean geometry, namely lengths and
angles.
Consider seeking a real map (i.e., a function)
f: → D. Geometrical Spaces

that is invariant under the above group action. Evidently Euclidean geometry is the usual geometry of real n-space
we want commutativity of a diagram E n and it is algebraically easy to handle because E n is an
affine space; this simply means that relative to any choice
α
G×
 −−−−−→
 of origin it is equivalent to a vector space. Non-Euclidean
  n-dimensional geometries arise from the smooth patching
p2 f that is
f
together of pieces of E n without regard for the preserva-
−−− → tion of their algebraic structures. Such patchwork struc-
tures are called n-manifolds, and they are our principal
(θ, w) − −−−−→ eiθ w
  domain spaces in general geometry. Familiar examples of
 
on elements 2-manifolds are a sphere and a torus; evidently each has
w − −−−−→ ? some residual local similarity to pieces of the Euclidean
plane E 2 from which they are synthesized, but globally
they are very different. One geometrical difference is
already apparent: in the Euclidean plane the angle sum
of any triangle is 180◦ , but it is easy to see that on a sphere
we can find a triangle with sides the arcs of three great cir-
cles and having an angle sum of 270◦ . In fact, this angular
excess is a manifestation of the presence of curvature. The
formal way to handle it in general geometry is via an en-
FIGURE 1 Rotation through angle θ . tity called a connection, which governs the definition of
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

52 Manifold Geometry

parallelism on a manifold: going due north on a sphere is steps: from one to more than one, and from finitely many
a curve that propagates in a parallel fashion on the curved to infinitely many. Thus the step from E 1 to E 2 , the real
surface, that is, maintains its direction as closely as it can. plane, is intuitively difficult at first meeting; but the step
Intuitively, a connection structure provides a general geo- on, to E 4 or E n for any finite n, is simple because the
meter with a tool somewhat like the parallel rulers used algebra and analysis are essentially unchanged. The fur-
by navigators. ther step to infinite dimensions is also difficult because de-
Once we can say what we mean by the “same direction” finitions change; the kind of hurdle that arises is like the
at two different points it is meaningful to pose problems transition for polynomial functions to power series. In fact,
like: what is the rate of change in that direction of some we shall not study infinite-dimensional spaces explicitly
entity defined on the space? By means of a connection but they will arise incidentally.
we obtain a differential operator that precisely measures So for our purposes, a good intuitive grasp of the geom-
all such rates of change in an invariant (sometimes called etry of the real plane E 2 is a prime asset. By suppressing
covariant) way, that is, independently of the manner in coordinates and other manifestations of its “twoness” we
which we choose to label the directions and points. The shall be able to use the algebra and analysis for higher-
operator involved here is the covariant derivative and it re- dimensional situations. We summarize the geometry as
places ordinary derivatives in non-Euclidean geometries, follows:
so allowing the representation of very intricate systems
of differential equations such as those used in electrical 1. Points of E 2 are ordered 2-tuples of real numbers like
engineering or for physical field theories. p = ( p1 , p2 ).
Probably the most important differential operator on a 2. There is a vector difference between pairs of points,
manifold is the exterior derivative; this comes free with the a map, diff: E 2 × E 2 → 2 : ( p, q) → q − p =
manifold: no extra structure need be assumed. In quite a (q1 − p1 , q2 − p2 ), and this correspondence is the
precise way it generalizes the gradient operation on scalar best possible in the sense that if we hold one of the
functions and the curl operation on vector functions that points fixed then it is a one-to-one correspondence
are used in vector calculus on E 3 . In that sense it turns between all points of E 2 and all vectors of 2 ;
out that curvature is the “curl of the connection.” We shall reversing the points reverses the vector.
see how the exterior derivative characterizes some intrin- 3. The parallelogram law holds for vector differences of
sically topological properties of manifolds by means of de points.
Rham cohomology theory. 4. There is a distance function defined between pairs of
Perverse though it may seem, no sooner do mathemati- points, the length (or norm) of their vector difference:
cians set up a nice machinery than there arises an interest in
dist( p, q) = diff( p, q)
how and why the machinery could be made to seem inade-
quate or defective. Clearly, if well-founded logically, then
We use property (1) for labeling the points of the space:
the theory (e.g., general geometry) will not have intrinsic
in E 2 each point has two coordinates, so E 2 and 2 are
defects. However, near the “edges” of its validity some
equivalent (in bijective correspondence) as sets. The vec-
interesting problems arise. For example, general geome-
tor difference map gives us two things:
try can be asked to model the environment of a physical
singularity, like a black hole. This means devising a geo-
1. The set of all directions at each point; this is the
metrical space in which something goes wrong; namely,
tangent space at the point.
particles disappear if they follow certain paths. It is an in-
2. A simple algebraic cohesion among points that
teresting recent result that in such a model the singularity
facilitates the solution of problems by trigonometric
seems to be stable: we cannot make it go away by per-
methods.
turbing the geometry, as done for example in quantization
procedures.
Recall that in 2 we have the dot product between two
vectors given by
u · v = (u 1 , u 2 ) · (v1 , v2 ) = u 1 v1 + u 2 v2
II. GEOMETRICAL SPACES
This also determines the length (or norm) of a vector u by
A. Euclidean Spaces √
u = u · u
The simplest Euclidean space is the one-dimensional case
and angle θ between nonzero vectors u and v by
of the real line, E 1 . As is almost always the case in math-
ematical processes of generalization, there are two major u · v = uv cos θ
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 53

2. Construct a new set of points (e.g., by joining pieces

of E 2 ), then add some geometry.

In both cases we would want to meet certain compati-

bility conditions to avoid abrupt (i.e., not smooth) geomet-
rical changes from place to place. In method 1 we already
FIGURE 2 Angle between two vectors. have all tangent spaces and may vary how we geometrize
them in a smoothly changing manner. In method 2 we have
only the tangent spaces for each small piece of Euclidean
(see Fig. 2). Calculus on E 2 depends on the usual pro-
space so we have to make sure that they fit smoothly to-
cess of partial differentiation applied to functions of two
gether, unambiguously overlapping at the joins; then we
variables taking values in n . There are in particular two
can introduce smoothly changing measures of lengths and
operators that readily generalize to E n : grad and div.
angles for vectors over the whole space.
For a differentiable real-valued map
Method 1 would allow us to construct 2-manifolds that
f : E 2 → : (x, y) → f (x, y) have the appearance of bent and twisted planes, for exam-
its gradient is the vector-valued map ple, a saddle-shaped surface. However, method 1 would
not allow the construction of a sphere or torus. Either of
∂f ∂f
grad f : E 2 → 2 : (x, y) → , these would involve some joining together (called identi-
∂x ∂y fication) of certain points of E 2 . For example, to obtain a
grad f at (x, y) ∈ E 2 being viewed as a vector in the torus we begin with a rectangle in the plane, join together
“tangent space” to E 2 at (x, y), that is, in the space of two edges to form a tube, then join the two open ends
directions there. together. The joining process can be visualized through
For a differentiable tangent space-valued map paper models, though this and the smoothing process at
the joins can be made mathematically precise. There is
φ: E 2 → 2 : (x, y) → (φ1 (x, y), φ2 (x, y))
a subtle point here. When we appeal intuitively to pa-
its divergence is the real-valued map per models it is implicit that they are viewed in three-
∂φ1 ∂φ2 dimensional space, and so at each point of such a model
div φ: E 2 → : (x, y) → + surface there is a well-defined tangent space of directions
∂x ∂y
in the surface. It is actually the tangent plane at the point.
Of course we can apply div to grad f ; then we get the
In the formal mathematical process the tangent spaces can
Laplacian
be constructed without leaving the surface. The same is
∂2 f ∂2 f true in general for n-manifolds constructed by joining and
f : E 2 → : (x, y) → +
∂x2 ∂ y2 smoothing out pieces of E n . However, there is an impor-
Typical problems that arise in geometrical form are: tant theorem due to Whitney which says that we could,
Given X ⊆ E 2 find f : X → satisfying f = 0 and if we wish, view any such n-manifold as embedded in a
subject to certain boundary conditions on f . Euclidean space of sufficiently high dimension. We saw
Given X ⊆ E 2 and φ: X → find a curve that the torus 2-manifold needed E 3 ; in general an n-
manifold may need E 2n+1 .
c: [0, 1] → E 2 : t → (x(t), y(t)) One final point before getting to technical definitions.
beginning at (x0 , y0 ), that is, c(0) = (x0 , y0 ), with tangent In geometry we often speak of “the unit circle” or “the n-
vector there given by ċ(0) = (u 0 , v0 ) and satisfying sphere” when in fact there are many ways to describe these
spaces. For example S 1 , the unit circle, can be viewed as
c̈ = grad φ (at t and c(t) respectively) the set of all unimodular complex numbers or perfectly
Here ċ(t) = (d x dt, dy/dt), c̈(t) = (d 2 x/dt 2 , d 2 y/dt 2 ). equivalently (i.e., up to structural isomorphism) as the set
of vectors of unit norm in 2 . We often do not bother to
make distinctions between them.
B. Non-Euclidean Spaces
We can obtain more general spaces, called manifolds, in
III. MANIFOLDS AND BUNDLES
two ways:
A. Manifolds
1. Keep the same point set (e.g., as in E 2 ) but use
nonstandard definitions for lengths and angles of An n-manifold is a set M of points with the following
vectors in the tangent spaces. properties:
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

54 Manifold Geometry

1. M can support the notion of continuous functions on and atlas {(T Uα , T φα ) | α ∈ A}, where T Uα = ∪x∈Uα Tx M
it; (we take M to be Hausdorff with countable base). and T φα is the derivative of φα . We call this manifold the
2. M is the union of a collection {Uα | α ∈ A} of its tangent bundle to M and it is noteworthy that it comes
subsets; (the collection is an open cover for M). free with M; no further structure was needed. There is a
3. For each α in the indexing set A there is a continuous natural smooth map onto M:
equivalence between Uα and E n ; (homeomorphisms
p: TM →
→ M : (x, u) → x
φα : Uα → E n give coordinates).
4. The change of coordinate maps are smoothly The fiber of such a map over x is
differentiable; (φα ❜ φβ−1 is a diffeomorphism if
p ← {x} = {y ∈ TM | p(y) = x}
Uα ∩ Uβ = ∅).
which evidently coincides with Tx M and therefore looks
We call the collection {(Uα , φα ) | α ∈ A} an atlas of charts like n for each x, though there is no unique isomorphism
for M. Properties 1–4 are clear enough but it is worth not- Tx M n actually determined for each x since each chart
ing that 4 is meaningful because maps like φα ❜ αβ−1 go gives a different one. The tangent bundle is an example of
between pieces of E n on which we suppose differentia- an important class of structures over manifolds, the vector
bility to be well understood—namely, calculus on n at- bundles.
tached to each point. It is property 4 that enables us to
say what we mean by a map between two manifolds be- B. Bundles
ing differentiable. Let M be an m-manifold with atlas
{(Vλ , ψλ ) | λ ∈ B}. Then a map A vector bundle with fiber F (a vector space) over a mani-
fold M is a smooth surjection p: X → → M such that ev-
f : M → M ery x ∈ M has a neighborhood U for which there is a dif-
is called differentiable if and only if the composite maps feomorphism p ← U U × F, and isomorphism on fibers.
ψλ ❜ f ❜ φα−1 are differentiable wherever f Uα ∩ Vλ = ∅, for This implies that the zero vectors in fibers are joined
any α ∈ A and λ ∈ B. The reason for this is apparent from smoothly together and the set of all of them looks like
the diagram a copy of M embedded in X . In the case of the tangent
bundle over an n-manifold, F = n and the local triviality
f
Uα −−−−−→ Vλ ∩ f Uα condition is
 
φα 
−1
ψλ T U α U α × n
E n −−−−−→ Em
ψ ❜ f ❜φα−1 Thus a vector bundle over M consists of M together with
copies of a vector space smoothly fitted over all points of
Evidently, we are borrowing for manifolds the already
M. We interpret the tangent bundle over M as the set of
known property of differentiability of maps from E n to
all points and directions in M.
E m . Property 4 is spoken of as giving a differentiable or
If we have a vector bundle with fiber F over M then
smooth structure to M.
we may be able to use its construction to obtain another
A similar trick of borrowing is employed to define Tx M,
vector bundle with fiber the dual space F ∗ consisting of
the tangent space to M at x ∈ M. If x ∈ Uα then we cer-
all linear functionals defined on F. One case when this
tainly have a nice vector space tangent at φα (x) ∈ E n and it
is always possible is for the tangent bundle TM; we re-
is isomorphic to n . The problem is that we may also have
place each tangent space Tx M by its dual (Tx M)∗ con-
x ∈ Uβ . In that case φα ❜ φβ−1 and φβ ❜ φα−1 have derivatives
sisting of real-valued linear maps on Tx M. The result-
that are actually isomorphisms between the tangent spaces
ing bundle is T ∗M, the contangent bundle. Recall that
at φα (x) and φβ (x); in fact, they will appear as invertible
just as we can dual vector spaces so we can dual linear
Jacobian matrices with entries the partial derivatives, from
maps between them by sending linear f to linear f ∗
calculus on n . Such isomorphisms are used to take equiv-
where
alence classes of E n -tangent spaces for synthesizing Tx M. 
F −→ G 
f
Then the process can be rounded off nicely because the
with f ∗ : α → α f
derivative of a map between manifolds actually becomes f∗ 
a linear map between tangent spaces. G ∗ −→ F ∗
The totality of all tangent spaces to an n-manifold M is So the dual map actually goes in the reverse direction. For a
actually itself a 2n-manifold with point set finite-dimensional vector space F we may identify (F ∗ )∗ ,
the double dual, with F itself by the natural isomorphism
TM = Tx M = {(x, u) | x ∈ M, u ∈ Tx M}
x∈M F (F ∗ )∗ : v → v̂
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 55

where Conversely, if X is not a trivial product bundle then

there will be some a ∈ M with neighborhoods U and V
v: F ∗ → : α → α(v)
such that a ∈ U ∩ V with local triviality
EXAMPLE. Take M = S 1 = {z ∈ | |z| = 1}, the λ: p ← U U × F and µ: p ← V V × F
1-sphere. Then TS1 = S 1 × , a cylinder easily visualiz-
Therefore we have three diffeomorphisms
able by considering each tangent line to S 1 to be repre-
sented by a vertical copy of 1 so that one such can be p←U ∩ p← V µ
λ
fitted to each point. Thus the tangent bundle to S 1 is a
trivial product. (U ∩ V ) × F −1 (U ∩ V ) × F
µλ
On the other hand, we can find another vector bundle
over S 1 with fiber 1 . This is the Möbius bundle in which Now, a section α: M → X can be locally decomposed by
the different copies of 1 attached to points of S 1 actually means of λ and µ and U ∩ V thus:
rotate through 180◦ as we pass once around the circle. So
(U ∩ V ) × F → F
this bundle is not a trivial product; it is a twisted product
λ
and easily investigated by making a paper model. It is
σ
worth adding that the tangent bundle to the 2-sphere S 2 U ∩ V → p←U ∩ p← V
is not a trivial product, it is twisted. This is rather a deep µ
result but quite believable when you consider the problems (U ∩ V ) × F → F
of stacking tangent planes on a sphere. On the other hand,
which on elements becomes of the form
there is the normal vector bundle on S 2 , which is a trivial
vector bundle with fiber 1 . (a, λ(a)) → λ(a)

Bundles can be made with fibers that are manifolds with ❝
σ
structures other than vector spaces, for example, groups. a → σ (a)
#
The great importance of bundles is that they allow a crucial
generalization of the concept of a map. To see this, con- (a, µ(a)) → µ(a)
sider the two vector bundles with fiber 1 over S 1 given
Now, as long as λ and µ differ, then σ does not define any
above. Now a continuous map from S 1 to 1 is simply a
unique map from U ∩ V to F.
continuous choice of a vector in 1 for each point in S 1 .
Given such a map we could draw its graph on the cylinder
bundle S 1 × 1 above, as a continuous curve that projects C. Combining Vector Bundles
onto the whole of S 1 . Equally, given such a curve then it
We can obtain new vector bundles from old by exploit-
determines uniquely a continuous map from S 1 to 1 by
ing two processes available for vector spaces: their di-
sending each point of S 1 to the vector in 1 cut by the
rect sum and their tensor product . Consider two
curve above that point. Now consider the twisted bundle.
finite-dimensional real vector spaces F and G. Then we
If we draw on this a continuous curve that projects onto
can represent F G and F G in terms of any bases
S 1 then we do not obtain a unique continuous map from
{ f 1 , . . . , f n } for F and {g1 , . . . , gm } for G as follows:
S 1 to 1 because the 1 in which we are trying to ob-
tain values changes continuously as we go around S 1 . The F G has basis { f 1 , . . . , f n , g1 , . . . , gm }
precise formulation of our generalization of a map is the dimension n + m
following.
p F G has basis { f i g j | i = 1, . . . , n; j = 1, . . . , m}
A section of a bundle X → → M with fiber F over M is
a smooth map σ : M → X such that pσ is the identity map dimension nm
on M. Such a section is sometimes referred to as an X - Thus, F G arises from the disjoint union set of vec-
field on M. So σ smoothly chooses at each point of M an tors in F and G (identifying the two zero vectors) and
element in the fiber over that point. Thus, if X = M × F, F G arises from the product set of vectors, F × G.
the trivial product bundle, then all sections have the form The structure of F G is clear enough; for example,
σ : M → M × F: a → (a, f (a)) if F is the (x, y) plane in 3 and G is the z axis then
F G = 2 1 3 . Less clear is F G, but in con-
where crete applications we shall see a specific role provided for
the basis vectors f i g j . With bases {î, j}
ˆ for 2 and {k̂}
f : M → F: a → f (a)
for 1 we have a basis {î k̂, jˆ k̂} for 2 1 and so
is a map from M to F. 2 1 2 .
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

56 Manifold Geometry

Both products and can be extended in an obvious T02 M = TM TM with fiber n n

way to linear maps between spaces. For example, denote T20 M = T ∗M T ∗ M with fiber (n )∗ (n )∗
by L(F; J ) the vector space of linear maps from vector
T11 M = TM T ∗M with fiber (n ) (n )∗
space F to vector space J . Then with f ∈ L(F; J ) and
g ∈ L(G; K ) we obtain new linear maps: A2 M = T ∗M ∧ T ∗M with fiber (n )∗ ∧ (n )∗
It is clear enough how to get more factors with ; less
f g: F G → J K
clear is how to get Ar +s M from Ar M and As M. We use
: x y → f (x) g(y) the alternating operator (hence the A in A2 M)
f g: F G → J K Ak : F ∗ F ∗ · · · F ∗ → Ak F ∗ : w → w A
: x y → f (x) g(y) where
In particular, F ∗ = L(F; ) and a very useful equivalent w A : F × F × · · · × F → : (v1 , . . . , vk )
formulation of L(F; J ) itself is obtained:
1
→ sgn(τ )w vτ (1) , vτ (2) , . . . , vτ (k)
L(F; J ) F ∗ J k! τ

For example (3 )∗ 2 can therefore be interpreted as τ runs over all k! permutations of {1, 2, . . . , k}, and sgn(τ )
the space of linear maps from 3 to 2 , namely the space is ±1 according as τ is an even or an odd permutation.
of 2 × 3 matrices. Then we can define
There is a third composition, also based on the tensor (Ar F ∗ ) ∧ (As F ∗ ) = Ar +s F ∗
product, available for a vector space F with itself. This is
the alternating or exterior product F ∧ F. In terms of the = Ar +s (Ar F ∗ As F ∗ )
basis { f 1 , . . . , f n } for F:
which agrees with what we gave before for r = s = 1, and
F ∧ F has basis { f i f j − f j f i | i < j} the exterior product turns out to be associative and dis-
tributive but anticommutative:
and so dimension 12 (n 2 − n)
w ∧ v = (−1)r s v ∧ w, w ∈ Ar F ∗ , v ∈ As F ∗
We usually write f i ∧ f j = 12 ( f i f j − f j f i ). Hence
2 ∧ 2 has basis {î ĵ − ĵ î} and so 2 ∧ 2 1 . It follows that if dim F = dim F ∗ = n then
Observe that we can only have (F ∧ F) F if 12 (n 2 − n) =
r ∗ n n!
n, that is, if n = 3. This fact is closely related to the ex- dim A F = =
r (n − r )!r !
istence on 3 only of the vector cross product × defined
by and therefore

î × ĵ = k̂ ĵ × k̂ = i î × k̂ = − ĵ n n
dim Ar F ∗ = = = dim An−r F ∗
r n −r
For, we can make an isomorphism
so in particular
∧ ×
3 3 3 3
dim An F ∗ = dim A0 F ∗ = 1


î ∧ ĵ → î × ĵ dim Ak F ∗ = 0 for k > n
by mapping linearly î ∧ k̂ → î × k̂

 All of this exterior algebra can be carried over to ap-
ĵ ∧ k̂ → ĵ × k̂ ply smoothly over a manifold M and hence give rise to
Once again we can use the product ∧ on spaces to give Ar M, the alternating tensor bundles of differential r -forms
a product on linear maps. For f, g ∈ L(F; J ) we obtain or r -form bundles. The collection of all r -forms for all
r = 0, 1, . . . on an n-manifold M is denoted M and the
f ∧ g: F ∧ F → J ∧ J : x ∧ y → f (x) ∧ f (y) exterior product gives M the structure of a Grassmann
or exterior algebra. The collection of all r -forms on M is
These vector space processes can be performed denoted r M so M is the disjoint union of these over
smoothly over a manifold to the fibers of vector bundles. all r = 0, 1, . . . . We call the Tsr M the tensor bundles and
In particular, we have automatically the following vector we usually adopt the conventional notation
bundles derived from the tangent bundle TM and cotangent
bundle T ∗M to a smooth manifold: T00 M = A0 M = M ×
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 57

for the trivial product vector bundle with fiber 1 over IV. CALCULUS OF SECTIONS
M. Much of general geometry and its applications is con-
cerned with sections of these bundles. For example, we A. Module of Sections
may point out that
Intuitively we viewed a section of a bundle X over M as
a copy of M lifted vertically up into the manifold X in
1. A metric structure is determined by a section of T20 M. such a way that each point of M goes to the fiber over
2. A connection (i.e., parallelism) structure is itself. For this to be possible at all we need the projection
p
determined by a section of A2 M. of X to be surjective onto M; this we denote by X →→ M.
3. Curvature of a connection is determined by a section In a precise technical sense a section σ : M → X is actu-
of T31 M. ally a lift of the identity map 1 M : M → M since we want
4. On an n-dimensional configuration space of particles: Fig. 3 commuting, namely such that p(σ (x)) = x, that is,
the kinetic energy is a section of T20 M and the σ (x) ∈ p ← {x} for all x ∈ M. Vector bundles always have
Lagrangian is a section of An M. one smooth such section, the zero section which sends
each x ∈ M to the zero vector in the fiber p ← {x}. Some
vector bundles (e.g., the trivial product bundles) have a
Notation full set of linearly independent sections; such sections, of
It is important to distinguish between a bundles X →→ M course, can never cross the zero section. Other vector bun-
over M and its space of sections, which sometimes is de- dles (e.g., the Möbius bundle over the circle, or TS2 , the
noted χ (X ) or (X ). However, it is common to speak tangent bundle to the ordinary sphere) do not have even
of, for example, the bundle ArM of r -forms on M when one continuous section that never meets the zero section.
p
properly ArM →→ M is the bundle and the collection of its The fact that sections take values in fibers means that
sections is quite different, denoted r M by us; precisely over any point in M we may use available algebraic struc-
ture in the fiber on sections passing through it. Now, the
r M = {smooth w: M → ArM | pw = 1 M } fibers are joined smoothly together, so not surprisingly
we find that their algebraic structures change, but only
Similarly, we use ϒsr M to denote the collection of all sec- smoothly, so we can apply much of the algebra to sec-
tions of Tsr M, that is, the (r, s)-tensor fields: tions themselves. In particular, the set of sections of a
vector bundle over a manifold is a vector space (infinite-
ϒsr M = smooth v: M → TsrM pv = 1 M dimensional) with pointwise addition and multiplication
by numbers in each fiber. Furthermore, we can actually
allow multiplication by numbers that vary smoothly from
D. Manifolds with Boundary point to point, that is by sections of the trivial scalar bun-
The alert reader may have noticed that our definition of an dle; this means that the set of sections of a vector bun-
n-manifold disqualifies, for example, a unit cylinder and dle over a manifold is a module over the ring of smooth
the closed unit disk from being 2-manifolds. They simply scalar-valued maps on the manifold. Thus, if σ , τ are two
fail to be locally like E 2 at their boundary edges; instead sections of the real vector bundle X →→ M and λ: M →
the edge points are homeomorphic to half-spaces like is a smooth map, then also σ + τ and λσ are sections of
X →→ M with
{(x, y) ∈ E 2 | y 0}
σ + τ : M → X : x → σ (x) + τ (x)
Evidently such entities are likely to be needed in geometry λσ : M → X : x → λ(x)σ (x)
and physics, so we shall allow such edge points in future.
Spaces that have them are called manifolds with boundary Much of geometrical analysis and its applications de-
and we shall denote the boundary of such an M by ∂ M. In pend on the study of sections of the tensor bundles; these
fact, if M is an n-manifold then ∂ M is an (n −1)-manifold.
EXAMPLES

a. M = {x ∈ 3 | x 1} = closed 3-ball B 3

∂ M = {x ∈ 3 | x = 1} 2-sphere S 2
b. Similarly, ∂ B n S n−1 for n ≤ 1
c. M = [0, 1] × S 1 = unit cylinder
∂ M = [0} × S 1 ∪ {1} × S 1 two disjoint circles FIGURE 3 Commutative diagram for a section.
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

58 Manifold Geometry

are the tensor fields. As we have seen, they are built up Evidently these are related (invertibly) by
from sections of the tangent and cotangent bundles via the ∂x ∂y
tensor product. Accordingly, we shall begin by a study of ∂˜1 = ∂2 and ∂˜2 = ∂1 + ∂2
∂z ∂z
tangent and cotangent vector fields.
with x = (1 − y 2 − z 2 )1/2 and y = (1 − x 2 − z 2 )1/2 . Also
on U ∩ V , we could introduce angular coordinates by,
B. Tangent Vector Fields say,
From the canonical structure of TM inherent in the differ- φ̃: U ∩ V → 2 : (x, y, z) → (θ, ψ) = (x̂1 , x̂2 )
entiable structure on an n-manifold M, it turns out that for
each chart where θ and ψ are the angles defined by

φ: U → E n : x → (x1 , x2 , . . . , xn )x x = cos ψ cos θ  so ψ = 0 on the equator and π/2 at

the north pole
y = cos ψ sin θ
the tangent space Tx M has a basis (∂1 , ∂2 , . . . , ∂n )x , where 
 and θ = 0 in the positive x direction
each ∂i is the partial differential operator ∂/∂ xi for real z = sin ψ θ = ±π/2 in the ±y directions
functions defined on neighborhoods of φ(x). Hence, for Again we get a local basis section
all tangent vector fields σ : M → TM, on the domain of
∂ ∂
such a chart, σ appears in the form (∂1 , ∂2 ) =
ˆ ˆ ,
∂θ ∂ψ
σ : U → TM : x → σ 1 ∂1 + σ 1 ∂2 + · · · + σ n ∂n x
which is related to the original choice (∂1 , ∂2 ) by
with each σ a real function on U . A change of chart
i
∂x ∂
∂ˆ1 = ∂1 +
induces a change of basis and corresponding change of ∂θ ∂θ
components of σ , both in agreement with the usual trans- ∂ x ∂y
formation law for partial derivatives. Viewing ϒ M, the ∂ˆ2 = ∂1 + ∂2
∂ψ ∂ψ
space of sections of TM, as a module, it is convenient to
think of the map x → (∂1 , ∂2 , . . . , ∂n )x as giving a basis In fact, if we wish to exploit the sphericity of S 2 then
for this module in the neighborhood U of x. the chart with (θ, ψ) coordinates is very convenient. For
instance,
EXAMPLE. Let M = S 2 be the unit sphere in 3 ,
explicitly: U ∩ V → TM : (θ, ψ) → cos ψ ∂ˆ1
M = {(x, y, z) ∈ 3 | x 2 + y 2 + z 2 = 1} locally models an east–west wind on the earth, decaying
from the equator to the north pole. Similarly,
Define two charts (U, φ), (V, φ̃) for
U ∩ V → TM : (θ, ψ) → −cos ψ ∂ˆ2
U = {(x, y, z) ∈ M | z > 0}
models a north wind.
(the “northern hemisphere”) A local flow about x0 ∈ M for a vector field v on M is
V = {(x, y, z) ∈ M | x > 0} a map for some neighborhood U of x0 and positive ε,
(the “eastern hemisphere”) : U × (−ε, ε) → M
by such that (∀x ∈ U )
φ: U → 2 : (x, y, z) → (x, y) = (x1 , x2 ), say
a. (x, 0) = x.
φ̃: V → : (x, y, z) → (y, z) = (x̃1 , x̃2 ), say
2
b. cx : (−ε, ε) → M : t → (x, t) is a curve with
tangent vector ċx (t) = v(c(t)).
Then on U ∩ V = {(x, y, z) ∈ M | x > 0, z > 0} (the NE
quadrant) we have alternative coordinates: the (xi ) or the Intuitively we think of as a family of curves that “join
(x̃i ). These determine corresponding local basis sections: up the arrows” of the vector field v. We call each curve cx
an integral curve of v through x.
∂ ∂
(∂1 , ∂2 ) = ,
∂x ∂y EXAMPLE. Take U , V , and M as in the previous ex-
ample. Then a local flow on S 2 for the vector field
∂ ∂
(∂1 , ∂2 ) =
˜ ˜ ,
∂ y ∂z v: U ∩ V → TS2 : (θ, ψ) → cos ψ ∂ˆ2
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 59

is given by C. Cotangent Vector Fields

: U ∩ V × (−ε, ε) → M : (θ, ψ, t) → (α(t), β(t)) Given an n-manifold M and a chart

where the real functions, α, β must satisfy : U → E n : x → (x1 , x2 , . . . , xn )x

we obtain a set of local tangent vector fields
dα dβ
= cos ψ, =0 ∂i : U → T U : x → (∂i )x i = 1, 2, . . . , n
dt dt
(α(0), β(0)) = (θ, ψ) that generate all tangent vector fields over U . Now, an
element of T ∗M is a dual vector to some tangent vector
Hence α(t) = θ + t cos ψ, β(t) = ψ, and the integral on M, that is, an element of
curves are parts of circles of latitude. There is a nice ex-
istence and uniqueness theorem for smooth local flows of (Tx M)∗ = L(Tx M; ) for some x ∈ M
smooth vector fields on manifolds. It says that through Also, if (b1 , . . . , bn ) is a basis for Tx M then it is easy to
each point there is one and only one integral curve and show that (c1 , . . . , cn ) is a basis for (Tx M)∗ with
t : x → (x, t) satisfies t+s = t ❜ s when t, s,
c j : Tx M → : λ1 b1 + · · · + λn bn → λ j
(t + s) ∈ (−ε, ε). For our flow on S 2 this is satisfied for
small enough s, t since t : (θ, ψ) → θ + t cos ψ. More- j = 1, 2, . . . , n
over, each t is a diffeomorphism from U to i U .
Hence in terms of the Kronecker symbol
EXAMPLE. To find integral curves and hence a local
1 if i = j, i, j = 1, 2, . . . , n
flow for a vector field. We consider M = E 2 with its stan- j
c (bi ) = δi =
j

dard chart. Take the vector field 0 if i = j

and we call (c1 , . . . , cn ) the dual basis to the basis
v: E 2 → TE2 : (x, y) → ∂x + x 2 ∂ y (b1 , . . . , bn ). Clearly we can extend this process to the
fields (∂1 , . . . , ∂n ) and obtain their duals (d x 1 , . . . , d x n )
It is often convenient to use (x, y) as coordinate labels with
instead of (x 1 , x 2 ); then we denote their induced basis j
fields by ∂x = ∂1 and ∂ y = ∂2 . We take the open subset of d x j (∂i ) = δi , i, j = 1, 2, . . . , n
E 2 given by This notation is standard and should not be confused with
the unfortunate usage of d x j in older texts to mean a small
U = (x, y) ∈ E 2 | |x + 1| < 14 , |y| < 14 increment in coordinate x j . The correct way to view any
d f is as the gradient of a real-valued function f on M.
We seek for each a ∈ U an integral curve With respect to our chart diffeomorphism

ca : − 12 , 12 → E 2 : U → E n : x → (x1 , . . . , xn )x
with ca (0) = a and ċa (t) = v(ca (t)) we get a restriction of f to U . Now we define

The differential equation expands into d f : U → T ∗M : x → f 1 d x 1 + · · · + f n d x n x
where for each i = 1, . . . , n
ċa (t) = ċ1 (t) ∂x + ċ2 (t) ∂ y = ∂x + (c1 (t))2 ∂ y
∂ f ❜ −1
f i : E n → : (x1 , . . . , xn ) →
Hence ċ (t) = 1 and ċ (t) = (c (t)) . Therefore c (t) =
1 2 1 2 1
∂xi
k + t and c2 (t) = 13 (k + t)3 + l. Taking a = (α, β) as the So in particular if we choose f to be x j , that is,
initial point, we find

ca (t) = α + t, β − 13 α 3 + 13 (α + t)3

Accordingly, the local flow is given by

s : (α, β) → α + s, β − 13 α 3 + 13 (α + s)3
then
and we easily check that
∂x j 1 ∂x j n
t+s = t s : (α, β) → α + s + t, d x j : U → T ∗M : x → dx + · · · + dx
∂ x1 ∂ xn x

β − 13 α 3 + 13 (α + s + t)3 But ∂ x /∂ x
j i
= δi
j
so d x (x) = (d x )x , j = 1, 2, . . . , n.
j j
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

60 Manifold Geometry

We have been careful so far to put subscript x on various formed smoothly over a manifold, so giving correspond-
entities to indicate that they vary with x. This is not always ing products on fields. Locally with respect to some chart
necessary and where it is clear whether we mean, for in- φ: U → E n , suppose that we have basis fields {∂1 , . . . , ∂n }
stance, a vector at x or a field about x, we shall omit such for tangent and {d x 1 , . . . , d x n } for cotangent vector fields.
subscripts. There is another very useful notational trick: Then we obtain, for example, the basis fields
the summation convention attributed to Einstein. This is
{∂i ∂ j | i, j = 1, . . . , n} for T02 U
simply to sum over repeated upper and lower indices. For
example, if A is an n × n matrix with entries a ij then we {d x i d x j | i, j = 1, . . . , n} for T20 U
can write trace A = aii . Thus the identity matrix has entries
δ ij and its trace is δii = n. We shall use the same convention {d x i ∧ d x j | 1 i < j n} for 2 U
i
with more than one symbol present, for instance, d x 1 ∧ d x i 2 ∧ . . . ∧ d x ir |

f1 d x 1 + f2 d x 2 + · · · + fn d x n = fi d x i 1 i 1 < i 2 < . . . < ir n for r U
v 1 ∂1 + v 2 ∂2 + · · · + v n ∂n = v i ∂i To accommodate the structure of such spaces as r U we
h 11 ∂1 ∂1 + h 12 ∂1 ∂2 + · · · + h nn ∂n ∂n modify the summation convention by enclosing the lower
set of repeated indices in brackets to indicate that the sum-
= h i j ∂i ∂ j
mation is only over increasing sequences. Thus, for exam-
ple, we would write for w ∈ 2 E 3
D. Commutator or Lie Bracket
w = w12 d x 1 ∧ d x 2 + w13 d x 1 ∧ d x 3 + w23 d x 2 ∧ d x 3
Given two tangent vector fields u, v with expressions = w(i j) d x i ∧ d x j
u = u i ∂i and v = v j ∂ j in terms of local partial deriva-
tives defined by some chart (U, ), then we obtain their
EXAMPLE. Consider the simple case M = E 2 . Then
commutator vector field
we can illustrate two important fields with respect to the
[u, v]: U → TM : x → u i ∂i v j − v i ∂i u j ∂ j standard chart.
Independently of any chart we can define the commuta- (1) g: M → T20 M : (x1 , x2 ) → δi j d x i d x j ,
tor through its action as a derivation on any smooth real

function f : M → by 1 if i = j
δi j =
[u, v] ( f ) = u(v( f )) − v(u( f )) 0 if i = j

A feel for what the commutator measures can be ob- This defines the usual metric structure on E 2 , that is, the
tained by examining the flows of u and v. Suppose that standard dot product on each tangent space Tx M:
these flows are, respectively, and with
g u k ∂k , v m ∂m = δi j d x i d x j u k ∂k , v m ∂m
t : U → M : x → (x, t)
= δi j u k v m d x i d x j (∂k , ∂m )
s : U → M : x → (x, s)
= δi j u k v m δki δmj
Then it turns out that on U
= δi j u i v j = u 1 v 1 + u 2 v 2
i ❜ s = s ❜ t
(2) ω: M → 2 M : (x1 , x2 ) → d x 1 ∧ d x 2
if and only if [u, v] = 0
Thus, [u, v] measures the failure of the two flows to com- This defines the usual geometrical measure on E 2 , that
mute. It follows all vector fields v satisfy the self commut- is, the standard parallelogram area mapped out by pairs of
ing property vectors in each tangent space Tx M:

[v, v] = 0 ω u k ∂k , v m ∂m = 12 (d x 1 d x 2 − d x 2 d x 1 )

then the flow of V reflects this in the identity t ❜ s = × u k ∂k , v m ∂m
s ❜ t = s+t wherever all are defined on U .
= 12 (u 1 v 2 − u 2 v 1 )
Both (1) and (2) generalize to E n : g has the same form but
E. Products of Fields
we sum over i, j = 1, . . . , n; ω = d x 1 ∧ . . . ∧ d x n . Any
We have seen how and can be used in vector spaces, change of chart induces corresponding changes in their
and it is easy to see how these operations can be per- local expressions.
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 61

F. Exterior Derivative
There is an extraordinarily useful derivative on differential
forms defined locally by
d: r M → r +1 M
: w(i1 ...ir ) d x i1 ∧ . . . ∧ d x ir → dw(i1 ...ir )
d x i 1 ∧ . . . ∧ d x ir
FIGURE 4 Chain complex of modules.
where dwi1 ...ir = ∂ j wi1 ...ir d x j , the gradient of the compo-
nent function. Surprisingly, this exterior derivative d is 2. r -forms arriving from r −1 M, the image of
uniquely determined by the requirements:
d: r −1 M → r M
a. d: r M → r +1 M linearly We usually abbreviate those to ker d and im d, and if we
b. d: 0 M → 1 M : f → d f wish to emphasize precisely where they come from we
c. w ∈ r M, v ∈ s M write
⇒ d(w ∧ v) = dw ∧ v + (−1)r w ∧ dv ker d = Z r (M, d) and im d = B r (M, d)
d. d 2 = 0
Note that it is a consequence of d 2 = 0 that
Here of course we are using M to denote the space of
r
B r (M, d) ⊆ Z r (M, d)
sections of the bundle.
and a consequence of the linearity of d that B r (M, d) is
EXAMPLE. For a real function f ∈ M, with respect 0
a submodule of Z r (M, d) and both are submodules of
to some chart r M. When equality occurs we say that the sequence is
d f = ∂i f d x i ∈ 1 M exact at r M; this means that

d 2 f = d(∂i f ) ∧ d x i if dw = 0 then w = dv for some v ∈ r −1 M

= ∂ j ∂i f d x j ∧ d x i There is a standard way to compare submodules of

a module; just as for subgroups of groups and linear
=0 since ∂ j ∂i f = ∂i ∂ j r subspaces of vector spaces, we can take the quotient
It is precisely d 2 = 0 when applied to M = E 3 that gives H r (M, d) = Z r (M, d)/B r (M, d)
the familiar identities of ordinary vector calculus:
We call H r (M, d) the r th de Rham cohomology
curl grad f = 0 module of M. Amazingly, it gives information about
div curl v = 0 the connectivity of M by indicating the presence of
(r + 1)-dimensional holes.
Now, we know that on an n-manifold M, ϒ ∗ M = A1 M
is modeled on (n )∗ , which has dimension n. Hence, for EXAMPLE. To get a feel for the quotienting process,
r > n, r M contains only zero forms. Accordingly, the ex- suppose that r M looks like 4 and that
terior derivative determines a chain complex of the spaces
Z r (M, d) is the 3-dimensional subspace 3
of forms viewed as modules over real functions:
d d d B r (M, d) is the (x, y) plane in Z r (M, d)
0 → 0 M →1 M → 2 M → · · · → n M → 0
||| ||| Then
M × ϒ10 M H r (M, d) = Z r (M, d)/B r (M, d)
We always denote the zero module by 0. Properly, we = {[(x, y, z)]}
should put indices on each d to indicate its domain, but
where [(x, y, z)] = {(x, y, z) ∈ Z r (M, d) | (x, y) ∈ B r
often they are omitted. At each stage in all such sequences
(M, d)}. Hence H r (M, d) has one point, [(x, y, z)], for
of structure-preserving maps we have a standard terminol-
each value of z and therefore is isomorphic to 1 . Geomet-
ogy, which can be illustrated in Fig. 4. In r M we have
rically we view the quotient of 3 by its (x, y) subspace
two distinguished substructures:
as the space consisting of all planes parallel to the (x, y)
plane. That is, we stop distinguishing between points in
1. r -forms sent to zero in r +1 M, the kernel of
the same horizontal plane; the set of such planes is iso-
d: r M → r +1 M morphic to 1 , the z axis in 3 .
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

62 Manifold Geometry

G. Cohomology and Exactness Now 1 S 1 is generated by any nowhere-zero section of

A1 S 1 . Denote by δ 1 the element of TS 1 defined by using
There are some special cases for an n-manifold M:
an angle θ for coordinate on S 1 . (Properly our coordinates
should take values in 1 but we identify 0 and 2π to save
a. B 0 (M, d) = 0, so H 0 (M, d) Z 0 (M, d) 1
using two charts explicitly.) Let ω be the dual of δ1 ; then
(constant f ) if M is connected.
ω is nowhere zero because δ1 is nowhere zero. Also, by
b. Z n (M, d) = 0, so H n (M, d) = 0.
definition, for any h∂1 ∈ ϒ S 1
c. If Z r (M, d) = B r (M, d) then H r (M, d) = 0.
ω(h∂1 ) = h since ω(∂1 ) = 1
The cohomology module H r (M, d) measures departure
from exactness of d at r M. There are traditional names Hence, any 1-form v on S 1 is given by
for elements:
v: S 1 → A1 S 1 : θ → λ(θ )ω
Z (M, d) consists of closed r -forms, w with dw = 0
r

B r (M, d) consists of exact r -forms, w with w = dv for some real function λ.

Now ω is not exact (though a common notation for it
From the quotient process it turns out that is dθ ) since it is not the differential of any smooth real
u ∈ H r (M, d) ⇒ u = w + B r (M, d) function. For suppose ω = d f , then we require d f (∂1 ) = 1
and hence
for some
f : S 1 → : θ → f (θ )
w ∈ Z r (M, d)
that is: two equivalent representatives of u in Z r (M, d) must be continuous and smooth with d f /dθ = 1 every-
differ only by some dv ∈ B r (M, d). where. Apparently f (θ ) = θ would satisfy this differen-
A famous result of Poincaré is the following: tial equation, but the problem is f (0) = f (2π ) so it is not
continuous. Therefore ω + B 1 (S 1 , d) generates H 1 (S 1 , d)
if M E n then H k (M, d) = 0 for all k > 0 and so
Intuitively, there are no holes in Euclidean space or in any
H 1 (S 1 , d) 1
manifold that is diffeomorphic to it.
The dimension of H r (M, d) is called the r th Betti num- which has dimension 1, corresponding to one 2-hole in S 1 .
ber of M; it is a topological invariant, so two manifolds that
are homeomorphic (not necessarily diffeomorphic) have
the same Betti numbers. The Betti numbers are finite for H. Derivatives of Smooth Maps
compact manifolds M and then they determine the Euler
A smooth map f : M → M between two manifolds
characteristic
preserves smooth things. Now, since the prototypes of
M
dim
smooth things are the tangent structures, it is not surpris-
χ (M) = (−1)r dim H r (M, d) ing that such a map induces a smooth map, the deriva-
r =0
tive of f between the tangent bundles; we denote it by
which, like the Betti numbers, can be computed by alge- D f : TM → TM , though sometimes D f is written Tf. This
braic topological methods. smooth map is actually between two fiber bundles so, as
EXAMPLE. Take M = S 1 , the unit circle. This is we would wish, it preserves fibers; moreover, since tangent
connected, so we find H 0 (S 1 , d) 1 , and it is one- bundles are vector bundles it is actually linear between tan-
dimensional, so H r (S 1 , d) = 0 for r 2. We find H 1 (S 1 , gent spaces. Hence at each x ∈ M, D f restricted to Tx M
d); intuitively, it should not be zero because S 1 actually is a linear map
surrounds a two-dimensional hole. Dx f : Tx M −→ T f (x) M
0 1 d
0 → S → 1 S 1 → 0
n −→ m
H 1 (S 1 , d) = Z 1 (S 1 , d)/B 1 (S 1 , d) [f ]
ij

But Z 1 (S 1 , d) = 1 S 1 = {smooth w: S 1 → A1 S 1 = T ∗ S 1 } which, on choosing charts about x and f (x), appears lo-
and B 1 (S 1 , d) = {d f | f ∈ 0 S 1 = smoot maps S 1 → }. cally as the Jacobian matrix [ f i j ] of partial derivatives of
Hence components of f with respect to coordinates at x. As al-
ways, there is a dual to the linear map Dx F, going the
H 1 (S 1 , d) = {w + B 1 (S 1 , d) | w ∈ 1 S 1 } other way:
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 63

Dx F ∗ : T f (x) M ∗ −→ Tx M ∗ is a smooth deformation of the identity by the homotopy

Ft : 2 \{(0,0)} → 2 \{(0,0)}
m −→
[f t
n
ij]
: x → (1 − t)x + t x/x
and its corresponding matrix representative is just the Evidently, F0 is the identity on 2 \{(0,0)}, F1 is f ❜ g,
transpose of that for Dx f . As with the Dx f , their du- and Ft varies smoothly with t in the interval [0, 1]. In fact,
als fit together smoothly to give a vector bundle map the same homotopy holds good for the same map between
(D f )∗ : T ∗ M → T ∗ M. S n and n+1 \{(0,0)} (n+1 with the origin removed). We
From D f and (D f )∗ it is easy to obtain induced vector conclude that for 0 r n
bundle maps for all of the tensor bundles. So
H r (S n , d) H r (n+1 \{(0,0)}, d) for n 1
r
Ds f : Tsr M → Tsr M
smooth f : M → M ⇒ smooth
r f : Ar M → Ar M I. Using Cohomology Data
We often abbreviate Dsr f to f ∗ and Ar f to f ∗ when we We saw previously that H 1 (S 1 , d) 1 , since there is a
apply them to sections of these bundles. closed 1-form that is not exact; we now know that there
It is of fundamental importance that must also be a closed 1-form that is not exact on the punc-
tured plane. Equivalently, there is a smooth vector field
f ∗ : r M → r M preserves closedness and that is not the gradient of any function, for example, rota-
exactness of r -forms tion about the origin with velocity proportional to distance
from the origin.
In consequence f ∗ induces a linear map on cohomology: Table I gives the nonzero de Rham cohomology of some
familiar spaces. Each entry indicates the dimension of the
H r ( f ∗ ): H r (M , d) → H r (M, d) corresponding module of nonexact closed forms. Recall
that we agreed to allow manifolds with boundary.
Now certainly if f has a smooth inverse (i.e., f is a dif-
We can immediately infer the following, for example:
feomorphism) then each Dx f and (Dx f )∗ would be an
isomorphism and we would expect f ∗ to be an isomor-
1. On a torus S 1 × S 1 we can find u, v ∈ 1 (S 1 × S 1 )
phism of modules giving an isomorphism in cohomology.
such that u and v are independently directed
In fact, H r ( f ∗ ) is an isomorphism if f is any “smooth
cotangent fields and neither is a gradient field;
deformation” of a diffeomorphism. Precisely, M and M
moreover, there exists w ∈ 2 (S 1 × S 1 ) that is not
will have the same cohomology if they have the same ho-
motopy type, that is, if there are smooth maps such that
g ❜ f can be smoothly deformed into (is homotopic to) the TABLE I Betti Number of Some Familiar Spaces
identity 1 M and f ❜ g is similarly homotopic to 1 M . The Dimension of H r (M, d)
allowed smooth deformations are contracting or stretch- (Betti numbers)
ing, but without cutting or joining or introducing any holes
Space M r =0 r =1 r =2 ··· r =n
or corners.
f
Circles S 1 1
M M′ Torus S × S 1 2 1
Sphere S 2 1 0 1
g
n-sphere S n 1 0 0 ··· 1
Having the same cohomology obviously means having the 2-sphere + m handles 1 2m 1
same Betti numbers. Projective plane P 2 1
P n (n even) 1
EXAMPLE. S and the punctured plane \{(0,0)} are
1 2
P n (n odd) 1 0 0 ··· 1
of the same homotopy type because we have
n 1
f : S 1 → 2 \{(0,0)} : x → x Punctured torus 1
Klein bottle 1 1
g: 2 \{(0,0)} → S 1 : x → x/x (usual norm) 2-sphere + m discs 1 m−1
replaced by Möbius strips
Then g ❜ f : S 1 → S 1 : x → x is the identity. Also n-ball B n 1
Punctured n-ball B n \{0} 1 0 ··· 1
f ❜ g: 2 \{(0,0)} → 2 \{(0,0)} : x → x/x
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

64 Manifold Geometry

expressible as λ du + µ dv for any real functions not orientable by passing to density forms, but we shall
λ, µ. not pursue this.
2. On a sphere S 2 there is no nongradient cotangent If µ is a volume form on M then so is f µ for any
field: u ∈ 1 S 2 ⇒ u = d f for some f : S 2 → ; also, nowhere-zero smooth real function f , so such an f must
there is a nonzero w ∈ 2 S 2 , which is consequently stay the same sign on each component piece of M. Since
not of the form du for u ∈ 1 S 2 since all such have An M can have only one independent section, then volume
u = d f so du = d 2 f = 0. forms can only differ by such a factor f . We say µ and
3. If M is n (or the open n-ball, {x ∈ n | x < 1}, to f µ are equivalent choices of orientation if f is every-
which it is diffeomorphic) then every closed r -form where positive and opposite choices if f is everywhere
w ∈ r M is expressible as w = du for some negative.
u ∈ r −1 M; hence the partial differential equation Given a volume form µ on M and an atlas {(Uα , φα ) |
dw = 0 has a solution. α ∈ A} then it is possible to define an integral of an n-form
w ∈ n M by means of a technical device called a partition
of unity. This effectively shares w out fairly among the
V. METRIC GEOMETRY
charts involved in any overlapping. Then each φα∗ carries
the appropriately weighted fraction of w to n E n , where
In an n-manifold M we often have to measure two kinds
integration is as usual, and it remains to sum up the various
of things: the “volume” of a piece of M or the “size” of
weighted contributions from overlapping charts. For this
vectors from some bundle over M. The special kind of
to give a finite answer it is necessary that the sum converge,
measure function appropriate for a manifold is a volume
and for this we require M to have finite volume or, if not,
form.
then w must become zero on all but a finite region of M.
A. Volume Forms The technical condition is that w has compact support.
The integral of w is denoted M w and in particular the
A volume form on an m-manifold M is a nowhere-zero volume of M is M µ.
n-form µ ∈ n M. Since An M is a vector bundle [with A fundamental application of this process is in Stokes’
fiber (n )∗ ] there may be no such volume form. theorem:
On E n with the standard coordinates there is a volume
form namely d x 1 ∧ d x 2 ∧ . . . ∧ d x n , as we have pointed dv = v
out before. Now every point of M has a neighborhood that M ∂M
is diffeomorphic to E n so in each such neighborhood we for v ∈ n−1 M with compact support
could construct a local volume form by pulling it back
from E n ; the problem is simply that we may not be able where ∂ M is the boundary of an n-manifold with bound-
smoothly to join up these local expressions into a global ary. In this general context, integration by parts has the
volume form on M. A sufficient condition for effecting expression

the joining up is easily motivated:
If (U, φ), (V, ψ) are two charts about x ∈ U ∩ V with df ∧v = fv− f dv
M dM M
coordinates (x1 , . . . , xn ) and (y1 , . . . , yn ), respectively,
then where f ∈ M. 0

d x 1 ∧ . . . ∧ d x n = Jφψ dy 1 ∧ . . . ∧ dy n EXAMPLE. Let M be a closed, simply connected re-

gion in E 2 . Take v ∈ 1 M to be given in standard coor-
where Jφψ is the Jacobian determinant of the change of dinates by
coordinates
v = 12 (x dy − y d x)
Jφψ = det(∂ xi /∂ y j )
Plainly, a good start is if Jφψ is an everywhere positive Then
function for enough overlapping charts to cover M; if it dv = 12 (d x ∧ dy − dy ∧ d x)
has this property then M is called orientable. In fact, being
orientable is necessary and sufficient for M to admit a = d x ∧ dy
volume form, since from the outset we assumed that our
manifolds have a Hausdorff topology with a countable dv = v⇒ d x ∧ dy
M ∂M M
base, the latter being just the property needed to ensure
the proper local joining of local volume forms. Actually, 1
= (x dy − y d x)
integration can also be performed on manifolds that are 2 ∂M
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 65

But d x ∧dy is the usual volume form for E 2 so M d x ∧dy Usually we drop the subscript x indicating the variabil-
is just the area enclosed by curve ∂ M. In particular, if ∂ M ity of g and its coordinate expression with position in M.
is the ellipse L with EXAMPLE. Take M = E 2 so each tangent space is iso-
L = {(x = a cos θ, y = b sin θ ) ∈ 2
| 0 θ 2π } morphic to 2 as the set of directions at the point. The sim-
plest choice of metric tensor is given by the usual inner
then product
x dy − y d x = (ab cos2 θ + ba sin2 θ ) dθ gx : Tx M × Tx M →

= ab dθ : u i ∂i , v j ∂ j → u 1 v 1 + u 2 v 2

and it follows that the area of the ellipse is and so here g = δi j d x i d x j . This induces on E 2 all of
the usual Euclidean geometry, including the usual volume
1 form; it is easily generalized to E n .
d x ∧ dy = ab dθ = πab
M 2 ∂M Another metric tensor on E 2 is given by
Evidently the volume form on E 3 is d x ∧ dy ∧ dz and its ηx : Tx M × Tx M →
restriction to the two-dimensional submanifold E 2 which
: u i ∂i , v j ∂ j → u 1 v 1 − u 2 v 2
is the (x, y) plane is simply d x ∧ dy. Now, the ellipse
itself is a one-dimensional submanifold of E 2 and, like the and so this is expressed as
circle, it supports a nowhere-zero 1-form dθ . The standard
1 0
volume form on the ellipse L is actually r dθ , where r 2 = η = ηi j d x i d x j where ηi j =
x 2 + y 2 , and therefore the circumference of the ellipse is 0 −1
2π This induces Minkowski geometry on E 2 , as used in
r dθ = (a 2 cos2 θ + b2 sin2 θ )1/2 dθ relativity.
L 0
From the example we see that the Euclidean metric ten-
2π
sor satisfies a stronger condition than 2. It is 2 . Positive
=a (1 − e2 sin2 θ )1/2 dθ
0 definiteness: gx (u, v) = 0 if and only if u = 0. This im-
poses on the matrix (gi j )x that its eigenvalues all be of one
the familiar elliptic integral with e the eccentricity
sign.
(1 − b2 /a 2 )1/2 .
A metric tensor satisfying condition 2 is called a
Riemannian metric; one satisfying only 2 is called
B. Metric Tensors an indefinite metric or a pseudo-Riemannian metric.
Riemannian metric geometry includes all of the usual
A metric tensor on an n-manifold M is an element g ∈ curved surfaces and much else; pseudo-Riemannian ge-
T20 M, that is, a section of T20 M ometry includes space–time structures where angles and
lengths are interpreted very differently from those of com-
g: M → T20 M : x → gx ∈ Tx M ∗ Tx M ∗
mon experience.
where gx : Tx M × Tx M → : (u, v) → gx (u, v) satisfies Since the simplest possible choice of metric tensor gives
the conditions the whole of standard Euclidean geometry on E n , we can
expect that the structure implied by any metric tensor is
very rich indeed. This is the case, but before investigat-
1. Symmetry: gx (u, v) = gx (v, u).
ing it we should consider existence questions. Now, an
2. Nondegeneracy: if gx (u, v) = 0 for all v ∈ Tx M then
n-manifold consists of copies of E n smoothly joined to-
u = 0.
gether and we have a standard metric tensor on each patch
of E n , so we are once again faced with patching them
If (∂1 , . . . , ∂n )x is a basis for Tx M then its dual, together in a consistent way. The same trick works as
(d x 1 , . . . , d x n ), is a basis for (Tx M)∗ and so gx is ex- for volume forms: we use a partition of unity to share
pressible in the form out fairly the contributions of each overlapping copy of
E n . Thus, all of our manifolds admit a Riemannian met-
gx = gi j d x i d x j x
ric. However, whereas the module n M of volume forms
Then the symmetry condition imposes symmetry on the is one-dimensional and so any two differed only by a
matrix of numbers (gi j )x and nondegeneracy makes its scalar function, the module ϒ20 M of sections of T20 M is
determinant nonzero. n 2 -dimensional and no particular Riemannian metric is
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

66 Manifold Geometry

distinguished in general. Of course, if M is itself a sub- standard volume form on E n , where det gi j = 1
manifold of some manifold M (especially E m for some everywhere.
m n) that has a metric tensor g, then the restriction of g 6. If M is oriented then there is a Hodge dual
to M is a metric tensor on M. This is an important source isomorphism on differential forms:
of examples. ∗
: r M → n−r M : w → ∗ w
We list next some of the important implications of an
n-manifold M having a metric tensor g. To explain the construction of this isomorphism
suppose that (e1 , . . . , en )x is any ordered basis in the
1. It induces a measure of length for a curve orientation for Tx M ∗ and ordered bases for r M and
n−r M are given locally by
c: [0, 1] → M : t → c(t)
{ei1 ∧ ei2 · · · eir | 1 i 1 < · · · < ir n}
The tangent vector to c at t is ċ(t) ∈ Tc(t) M and we
define the length of c to be {e j1 ∧ e j2 · · · e jn−r | 1 j1 < · · · < jn−r n}
t=1
Then with gi j = g(ei , e j )
|g(ċ(t), ċ(t))|1/2 dt
∗
t=0
w(i1 ···ir ) ei1 ∧ · · · ∧ eir
In coordinates, if c(t) ∈ U for some chart (U, φ) then
φ(c(t)) = (c1 (t), . . . , cn (t)) and = |det gi j |1/2 w(∗j1 ··· jn−r ) e j1 ∧ · · · ∧ e jn−r

ċ(t) = ċi (t) ∂i with ċi = dci /dt. where

Hence g(ċ, ċ) = gi j ċi ċ j . w ∗j1 ··· jn−r = g k1 i1 g k2 i2 · · · g kr ir wi1 ···ir sgn(k → i)
2. There is a dual metric tensor g ∗ ∈ ϒ02 M with and sgn(k → i) is the sign of the permutation
gx∗ : ∗ ∗
Tx M × Tx M → : (α, β) → gx∗ (α, β) (i 1 , . . . , ir , j1 , . . . , jn−r ) → (k1 , . . . , kr , j1 , . . . , jn−r )
with coordinate expression Intuitively, ∗ w is the “complement” of w in the

gx∗ = g i j ∂i ∂ j x and (g ) = (gi j )
ij −1 volume form µg determined by g; in particular, we
find
as matrices, well defined since det gi j = 0.
∗
3. There is an isomorphism ϒ10 M ϒ01 M of tangent 1 = µg (1 is the constant unit real
and cotangent fields: function on M, 1 ∈ 0 M)
∗
gxb : Tx M → Tx M ∗ : v → v ∗ µg = (−1)ν (ν is the number of negative
eigenvalues of g)
with v ∗ (u = gx (u, v))
The appearance of ∗ is simplest if we choose the
which appears in coordinates as oriented base (e1 , . . . , en ) for Tx M ∗ to be
gxb : v i ∂i → gi j v i d x j orthonormal, that is, g ∗ (ei , e j ) = ±δ ij so
|det gi j | = 1. In this case
with inverse
∗
(ei1 ∧ · · · ∧ eir ) = (e j1 ∧ · · · ∧ e jn−r )
gx# : wi d x i → g i j wi ∂ j
for any even permutation (i 1 , . . . , ir , j1 , . . . , jn−r ) of
4. These isomorphisms in 2 can be tensor-producted to (1, . . . , n), and we extend the isomorphism to all
give isomorphisms r -forms by requiring it to be linear.
ϒsr M ϒmk M for all r + s = k + m
Some justification for the presence of |det gi j | in the
Consequently, we can switch among tensor fields to
volume form can be seen by recalling that (gi j ) is the local
suit our convenience when we have a metric tensor.
(n × n) matrix expression of a linear map Tx M → Tx M ∗ .
5. If M is an orientable manifold then a choice of an
For such matrix maps from n to n the determinant mea-
oriented atlas (Jacobian everywhere positive on
sures precisely the factor of volume change under the map:
overlaps) yields a unique volume form µg determined
a unit n-cube is sent to an n-box of volume |det gi j |.
locally by
EXAMPLE. The equations of the elecromagnetic field
µg = |det gi j |1/2 d x 1 ∧ · · · ∧ d x n
on a space-time manifold can be very neatly expressed in
This is nowhere zero since nondegeneracy ensures terms of the electromagnetic 2-form F ∈ 2 M and dim
det gi j = 0. Observe that this does agree with the M = 4.
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 67

Locally, for a basis of 1-form fields (ωi ), of v on M because Dv takes values in T (TM), not in TM. A
connection gets us from T (TM) to TM in orderly fashion.
F = F(i j) ωi ∧ ω j
If we suppose that the ωi are mutually orthogonal unit A. Linear Connection
fields, then the metric tensor components (gi j ) are the
eigenvalues of g lying along the diagonal. There are more general connections than the one that we
The Hodge dual isomorphism gives ∗ (ωi ∧ ω j ) = shall use; we are interested in those that have particular
ω ∧ ωk , where (i, j, m, k) is an even permutation of
m significance for the structures we already have on M.
(1, 2, 3, 4). So ∗ (ω1 ∧ ω2 ) = ω3 ∧ ω4 and so forth. Simi- A (linear) connection on M is a splitting of the vector
larly, ∗ (ω1 ∧ ω2 ∧ ω3 ) = ω4 and so forth. bundle T (TM) into a direct sum of a horizontal part HM
Physical theory leads to the following equations for F and a vertical part VM, with HM isomorphic to TM as
in regions that contain negligible amounts of matter: vector bundles. The motivation is clear: given any vector
∗ ∗
u ∈ TM and field v: M → TM then
dF = 0 d F=J (†)
Dv(u) ∈ T (TM) HM VM
where J is the current density. These equations correpond
to the usual Maxwell’s equations through the vector cal- and we interpret the rate of change of v in the direction u
culus correspondences: as the projection of the HM part onto TM. Equivalently,
∗ ∗ ∗ we define a connection on M to ba a map ∇ that assigns
d ≡ curl and d ≡ divergence
to each u ∈ TM and v ∈ ϒ01 M a vector ∇u v in Tx M (where
Conservation of charge is expressed by u ∈ Tx M) such that
div J = 0
a. ∇u is linear over u ∈ Tx M.
This is automatically satisfied when there is negligible b. ∇u v is linear over v ∈ ϒ01 M.
matter since it becomes c. ∇u f v = u( f )v(x) + f (x) ∇u v if f : M → .
∗ ∗∗ ∗ d. If w, v ∈ ϒ01 M then so is ∇w v: x → ∇w(x) v.
d d F =0 because d 2 = 0
However, in the presence of matter, Eq. (†) becomes We view ∇u v as the rate of change of the field v in the
∗ ∗ direction of the vector u at x ∈ M and call it the covariant
dA = 0 d B=J for some A, B ∈ 2 M derivative of v with respect to u.
with A and B related by some transformation, per- By exploiting the linearity properties we can easily ob-
haps linear. Once again, d 2 = 0 ensures conservation of tain coordinate expressions. Given a chart (U, φ) with cor-
charge. rdinates (x1 , . . . , xn ) about x and basis fields (∂1 , . . . , ∂n )
Locally, J = ρ Ji ωi , where ρ is the chrage density. Then for tangent vector fields about x, any u ∈ Tx M is of the
over a compact spacelike submanifold S of M we can form u = (u i ∂i )x and near x any vector field w is of the
measure the total charge Q N and find that form w = w j ∂ j ∈ ϒ01 U . Now we must have

∇u w = ∇u i∂i w j ∂ j ∈ Tx M
Q N = ρω ∧ ω ∧ ω = ∗J
i 2 3

S S = u i ∇∂i w j ∂ j by a

= d ∗B = ∗
B = u ∂i w ∂ j + w j ∇∂i ∂ j x by b and c
i j

∂S
S
Hence ∇ is completely determined on U by specification
of ∇∂i ∂ j ∈ T10 U . But any such field is of the form
VI. CONNECTION GEOMETRY ∇∂i ∂ j = ikj ∂k for some ikj : U →

Given a tangent vector field v: M → TM it is quite likely So locally ∇ appears as an n 3 array of smooth real
that we shall be interested in measurings its rate of change functions on U . These functions are traditionally called
over M. Now, v is a smooth map between smooth manifols the Christoffel symbols of the connection, and they will
so it induces the derivative change smoothly from chart to chart. Substitution of
[(∂ y m /∂ x k ) ∂ˆm ] for (∂i ) through a change of chart coor-
Dv: TM → T (TM) dinates from (x1 , . . . , xn ) to (y1 , . . . , yn ) can be traced
trough the above steps to obtain expressions for
between the corresponding tangent bundles. Unfortu-
nately, this is not useful as a measure of the rate of change ∇∂ˆi ∂ˆ j = ˆ ikj ∂ˆk
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

68 Manifold Geometry

Then the ˆ ikj can be related to the ikj through the par- EXAMPLE. Consider E 1 with ∇∂1 ∂1 = λ, for some
tial derivatives (∂ y m /∂ x k ). This is one way to see that constant λ ∈ , with respect to the standard chart.
the Christoffel symbols are not components of any tensor
field. Take c: [0, 1] → E 1 : t → t, so ċ(t) = ∂1 .
We easily extend ∇ to give covariant derivative of arbi- Then τt : Tc(0) E 1 → Tc(t) E 1 : α0 ∂1 → α(t)∂1 satisfies.
trary tensor fields by defining for u ∈ Tx M dα
+ αλ = 0 so α(t) = α0 e−λt
dt
a. ∇u f = u( f ) for f ∈ ϒ00 M.
b. ∇u (d x i ∂i ) = 0 for mutually dual basis fields. Evidently λ = 0 corresponds to the usual connection since
c. ∇u (w v) = (∇u w) v + w ∇u v. we do not usually alter the length of vectors when we
move them on E 1 . Any λ = 0 determines a non-Euclidean
From b and c with u = ∂ j we get, for example, parallelism structure on E 1 . A similar connection could
be put on S 1 .
∇∂ j (d x i ∂i ) = ∇∂ j d x i ∂i + d x i ∇∂ j ∂i = 0
EXAMPLE. To find a local expression for the paral-
0 = ∇∂ j d x k ∂k + kji d x i ∂k lel transport isomorphism. We consider M = E 2 with
therefore the standard chart and connection ∇ having constant
Christoffel symbols
∇∂ j d x k = − kji d x i
12
1
= 21
1
=1 and all other components zero
This defines a family of 1-forms called the connection
Given the curve c: [0, 1] → E 2 : t → (t, t 2 ) we find the
1-forms.
parallel vector field
w: [0, 1] → TE2 : t → f (t) ∂1 + g(t) ∂2
B. Parallel Transport
for two independent initial tangent vectors:
Given a curve c: [0, 1] → M, its tangent vector field is
denoted ċ: [0, 1] → TM and in components ċ = ċ j ∂ j . If a. w(0) = ∂1
w ∈ ϒsr M then we say that w is parallel along c if b. w(0) = ∂2
∇ċ w = 0
The parallel transport condition is ∇ċ w = 0 and we are
If w = wk ∂k ∈ ϒ01 M is parallel along c with tangent given ċ(t) = ∂1 + 2t ∂2 . Substituting
vector field ċ j ∂ j = d/dt, then
f˙∂1 + ġ ∂2 + 2tf 21
i
∂i + g12
i
∂i = 0
∇ċ w = ∇ċ j∂ j w ∂k = 0 ∈
k
ϒ01 M ( f˙ + 2tf + g) ∂1 + ġ ∂2 = 0
so (ċ ∂ j w ) ∂k + ċ w ∇∂ j ∂i = 0. Hence dw /dt + (dc /
j k j i k j
so g(t) = g(0)
dt)wi kji = 0 ∈ , k = 1, 2, . . . , n. This system of n linear
differential equations, which represents ∇ċ w = 0, has a We solve f˙ + 2tf + g = 0 for constant g to give:
unique solution
Case (a): f (t) = e−t , g(t)
2
= g(0) = 0.
w: [0, 1] → TM: t → wt 2 t
Case (b): f (t) = −e−t 0 e x d x = k(t), say, and
2

for given initial value w(0) = w0 ∈ Tc(0) M. Therefore ∇ g(t) = 1.

and c have defined a map
Then parallel transport along c is the isomorphism
τt : Tc(0) M → Tc(t) M: w0 → wt
τt : Tc(0) E 2 → Tc(t) E 2
This turns out to be a very good map indeed: an isomor-
: α ∂1 + β ∂2 → αe−t + βk(t) ∂1 + β ∂2
2

phism of vector spaces; it is called parallel transport and is

available for tensor spaces also. Moreover, parallel trans- or in matrix form
port allows us to express the covariant derivative as a limit −t 2
α e k(t) α
of a difference, because with it we can bring all vectors →
β 0 1 β
and tensors along c back to c(0). Let u ∈ Tx M be any vec-
tor; w ∈ ϒsr M and c is any curve with c(0) = x, ċ(0) = u; Evidently ∇ is not compatible with the usual metric
then tensor on E 2 because parallel transport is not an isometry,
τt−1 (w(c(t))) − w(c(0)) for instance,
(∇u w)x = lim
τt (∂1 ) = e−t ∂1
2
t→0 t
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 69

C. Geodesics Christoffel symbols:

It is natural to turn the equation for parallel transport onto 11
1
= −10
0
= −(m/r 2 )(1 − 2m/r )−1
the curve itself: solve ∇ċ ċ = 0 for c.
Such a curve is called a geodesic for the connection ∇ 12
2
= 13
3
= 1/r,
because on the surface of the earth, viewed as a sphere 33
1
= 22
1
= −r (1 − 2m/r ) sin2 θ
with the usual connection, such curves are great circles
and do “divide the earth.” On an n-manifold we can al- 00
1
= (m/r 2 )(1 − 2m/r ),
ways find geodesic curves going in all directions from a 33
2
= sin θ cos θ
point but in general we may not be able to make them
go very far. Accordingly, we usually define a geodesic 32
3
= cot θ, and ikj = kji
through x ∈ M with direction u 0 ∈ Tx M in a manifold M
Geodesic curves satisfy ∇ċ ċ = 0 and we consider two
with connection ∇ as any smooth c: (−ε, ε) → M, for
cases each with parameter s given by g(ċ, ċ) = 1.
positive ε, with c(0) = x, ċ(0) = u 0 , and ∇ċ(t) ċ = 0 for all
t ∈ (−ε, ε). Cleary, if M has a boundary then a geodesic
may not be extensible after it meets ∂ M. Another type 1. Circular geodesics: c(s) = (t(s), r (s), θ (s), φ(s)) with
of inextensibility can occur if M is incomplete in some ṙ (s) = 0. We shall take the plane of one of these
sense. circular orbits to be θ (s) = π/2. Denote
differentiation with respect to parameter s by a dot;
EXAMPLE. Let M = E 2 \{(0, 0)}, the punctured plane then we expand ∇ċ ċ = 0 to give the system of
with standard coordinates. Then the Euclidean connec- equations
tion has zero Christoffel symbols and the equation of a
geodesic becomes ẗ = 0 ⇒ ṫ = const
∇ċ ċ = ∇ċi ∂i ċ ∂ j = c̈ ∂k + ċ ċ
j k i j
ikj ∂k =0 33
1
(φ̇)2 + 00
1
(ṫ)2 = 0 ⇒ (1 − 2m/r )(m/r 2 (t˙)2
so c̈k = 0, k = 1, 2. Hence the geodesics are straight lines, − r (φ̇)2 ) = 0
as we expect for a submanifold of the Euclidean plane, but
⇒ (ṫ)2 = r 3 /m(φ̇)2
they cannot pass through the origin. Thus, for example, the
geodesic φ̈ = 0 ⇒ φ̇ = const = period/2π
c: (−ε, ε) → M : t → (2 − 2t, 1 − t) ⇒ period T = 2π/φ̇
which begins at (2, 1) in direction −2î − ĵ, is only defined From g(ċ, ċ) = 1 we find
for ε 1.
(t˙)2 (1 − 2m/r ) − r 2 (φ̇)2 = 1
EXAMPLE. To find geodesics (corresponding to free
particle trajectories) in Schwarzschild space-time. Here Substitution above gives circular orbits with periods
we take M = × (E 3 \B) with giving the time coor-
dinate t and E 3 \B being the Euclidean space outside a T = 2πr (r/m − 3)1/2 r > 3m
ball of some radius k > 0 centered on the origin. We give
E 3 \B the usual spherical polar coordinates (r, θ, φ) and In time units, for the sun we have 2m = 10−5 sec and
view the region B as containing some spherically sym- t = 1 year =˙. 107 π sec, for the earth orbit, so we de-
metric mass, like a star or planet. In general relativity, duce that the implied radius is r = ˙. 500 sec, which is
free particles falling under gravity follow geodesics. The what we observe. Similar data can be checked for the
appropriate metric tensor for this physical situation has moon or other satellites orbiting the earth. For the
components earth 2m =˙. 3 × 10−11 sec.
  2. Radial geodesics: c(s) = (t(s), r (s), θ (s), φ(s)) with
1 − 2m/r 0 0 0 θ, φ constant. The geodesic equation reduces on
 −(1 − 2m/r )−1 0  θ = π/2 to
 0 0 
(gi j ) =  
 0 0 −r 2 0 
r̈ − m/r 2 (1 − 2m/r )−1 (ṙ )2 + m/r 2 (1 − 2m/r )(ṫ)2 = 0
0 0 0 −r 2 sin2 θ
and we deduce r = −m/r 2 . Now r measures
for r > k > 2m
precisely the acceleration due to gravity at distance r
where m is the mass of the material contained in B. Denot- from the center of a spherically symmetric mass m, in
ing the coordinates (t, r, θ, φ) by (x0 , x1 , x2 , x3 ), we find agreement to first approximation with Newton’s
the metric connection ∇ has only the following nonzero theory. We find, for example, in time units, that
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

70 Manifold Geometry


2m =˙. 3 × 10−11 sec This is called being compatible with the metric and it
on the earth r= ˙. 2.1 × 10−2 sec effectively says that the covariant derivative will always

 r̈ = view the metric tensor (and its dual g ∗ ) as a constant: ∇ g
˙. −3/8 × 10−7
factors out any variability introduced by peculiarties of g.


2m =˙. 2.5 × 10−13 sec In particular, it will make parallel transport an isometry
as well as an isomorphism and covariant derivatives will
on the moon r= ˙. 5.4 × 10−3 sec

 r̈ = commute with the isomorphisms g # and g b induced by g.
˙. −14/3 × 10−9 The second condition is less obvious at first:


2m =˙. 10−5 sec b. ∇ug v − ∇vg u = [u, v] for all u, v ∈ ϒ01 M.
on the sun r= ˙. 2 sec

 r̈ = This is called being symmetric because it implies that with
˙. −1/8 × 10−5 respect to any coordinates the Christoffel symbols ikj are
A manifold with connection is called geodesically com- symmetric in i j.
plete if all of its geodesics can be extended to infinite Conditions (a) and (b) are sufficient to select a unique
parameter values (or until they meet the boundary, if connection for a manifold with metric tensor. They give
M is a manifold with boundary). It is known that, with a system of differential equations that locally allows the
their standard connections induced from being embedded Christoffel symbols to be calculated from partial dervia-
in Euclidean space, the circle, sphere, and torus are all tives of the components of g. We find that
g g
geodesically complete. Observe that on such spaces the ∂k g(∂i , ∂ j ) = g ∇∂k ∂i , ∂ j + g ∂i , ∇∂k ∂ j
extension of a geodesic to infinite parameter values may
involve it in repeatedly convering the same points, for from a so
some geodesics become closed curves. ∂k gi j = ki
m
gm j + kmj gim ,
As might be expected, from the fact that some suf- g g
ficiently small region about x ∈ M is diffeomorphic to ∇∂i ∂ j − ∇∂ j ∂i = [∂i , ∂ j ] = 0
E n , we can get a geodesic going in any direction. That is,
from b so ikj = kji . Hence we use the inverse matrix of
for all sufficiently small initial tangent vectors ċ(0) = u 0
(gr s ) and symmetry to give
∈ Tx M, there is a geodesic through c(0) = x. To make “suf-
ficiently small” precise we need a norm in Tx M, but any ikj = 12 g km (∂i g jk + ∂ j gik − ∂k gi j )
norm will do equivalently well. Define
It can be shown that every manifold can be given a
Sx = {u 0 ∈ Tx M | there is a geodesic Riemannian metric that is geodesically complete. Evi-
c: [0, 1] → M with c(0) = x and ċ(0) = u 0 } dently, the fact that (gi j ) = (δi j ) everywhere on E n im-
mediately gives zero Christoffel symbols in the standard
Then there is a nice map called the exponential map at x coordinates; however, if we use other than rectilinear coor-
dinates, then the corresponding metric tensor components
expx : Sx → M : u 0 → c(1) where c is a geodesic
will be nonconstant and some nonzero Christoffel symbols
with c(0) = x and ċ(0) = u 0 will airse. The idea is clear: if we wish to keep Euclidean
It turns out that at every point x there is some- geometry but describe it with curvilinear coordinates, then
neighborhood of the zero vector in Tx M on which expx we shall expect any components in these coordinates (of
is a diffeomorphism onto its image. If the connection is vectors or tensors) to alter as we parallel transport them.
complete then expx is defined on all of Tx M for all x ∈ M. EXAMPLE. To find a metric connection and equations
for a parallel vector field along a given curve. We take
M = (0, 2π) × S 1 , an open cylinder with identity coordi-
D. Metric Connection
nate x on the interval (0, 2π ) and angular coordinate θ on
We saw that any Euclidean space has a connection, the the circle S 1 . Consider the expression in these coordinates
simplest possible choice, and we implied that any subset of the pseudo-Riemannian metric tensor
of E n that is a manifold will inherit a unique connection.
−(1 − cos x)2 0
Now this is a consequence of the following result: (gi j ) =
Every metric tensor g determines a unique connection 0 (1 − cos x)2
∇ g , called the metric connection or Levi–Civita connec- at (x, θ ) ∈ M
tion. There is one obvious condition that it should satisfy:
From symmetry and compatibility of the induced metric
a. ∇ug g = 0 for all u ∈ ϒ01 M. connection ∇ with Christoffel symbols (imj ) we have
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 71

imj = 12 g mk ∂i g jk + ∂ j gi j − ∂k gi j We can interpret T as a section of T21 M since at each
x ∈ M we have a bilinear map
= 12 g mk k i j , say
T : Tx M × Tx M → Tx M
Substitution gives
! but such maps are effectively elements of Tx M ∗
−2 sin x(1 − cos x) 0
(1 i j ) = Tx M ∗ Tx M. Hence we may view T as a section of
0 −2 sin x(1 − cos x)
T ∗ M T ∗ M TM and locally
!
0 2 sin x(1 − cos x) T = ikj − kji d x i d x j ∂k
(2 i j ) =
2 sin x(1 − cos x) 0
The antisymmetry of T in i and j immediately suggests
Hence and interpretation of T as some king of 2-form. This is
! possible but the presence of the ∂k direction means that
sin x/(1 − cos x) 0
i1j = then we view T as a vector-valued 2-form, the torsion
0 sin x/(1 − cos x)
form
!
0 sin x/(1 − cos x) !: ϒ01 M × ϒ01 M → ϒ01 M
i j =
2
sin x/(1 − cos x) 0
which in local coordinates becomes
For the vertical-going curve c: (0, 2π) → M : t → (t, 0)
a parallel vector field is v: (0, 2π) → TM: t → f (t)∂x + ! = 12 i j d x i ∧ d x j where i j = ikj ∂k
h(t)∂θ where ∇ċ v = 0. This differential equation becomes
so locally ! takes values in n Tx M.
∂x f + f 11
1
=0 In much the same way, we can represent any connection
∂x h + 2
h12 =0 by a vector-valued 1-form, the connection form

So suitable f and g must satisfy ω: ϒ01 M → ϒ11 M

sin x which in local coordinates becomes
∂x f = − f
1 − cos x
ω = ωi d x i where ωi = ikj d x j ∂k
and
sin x 2
∂x h = − h So, locally ω takes values in n , the space of n × n
1 − cos x real matrices, which represents Tx M ∗ Tx M L(Tx M,
Tx M).
E. Torsion of a Connection
EXAMPLE. On E 2 with the standard coordinates,
In general, a connection ∇ need not be related to a metric one connection ∇ that is not symmetric is given by the
tensor; for example, the (constant) connection on the unit Christoffel symbols
circle S 1 with Christoffel symbol 111
= λ for fixed λ = 0
does not arise as the metric connection of any metric tensor 1 8 1 6
i1j = i2j =
on S 1 . However, it is trivially symmetric. The question of 4 0 4 2
whether a connection is symmetric is independent of any
metric tensor and it can be formulated in terms of the Its torsion form is !: ϒ01 E 2 × ϒ01 E 2 → ϒ01 E 2 with
torsion map
! = 12 ikj ∂k d x i ∧ d x j
T : ϒ01 M × ϒ01 M → ϒ01 M
: (u, v) → ∇u v − ∇v u − [u, v] = 12 (8∂1 + 6∂2 ) d x 1 ∧ d x 2

Hence T is zero if and only if ∇ is symmetric. For the + 12 (4∂1 + 4∂2 ) d x 2 ∧ d x 1

usual basis fields (∂1 , . . . , ∂n ) arising from coordinates
(x1 , . . . , xn ) we have = 12 (4∂1 + 2∂2 ) d x 1 ∧ d x 2 (with values in Tx M)

T (∂i , ∂ j ) = ∇∂i ∂ j − ∇∂ j ∂ j (since [∂i , ∂ j ] = 0) So

= ikj ∂k − kji ∂k = ikj − kji ∂k ! = (2, 1) d x 1 ∧ d x 2 (with values in 2 )
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

72 Manifold Geometry

The connection form of this ∇ is ω : Y01 E 2 → Y11 E 2 with which in local coordinates becomes

ω = ikj d x j ∂k d x i " = "(i j) d x i ∧ d x j
" #
1d x 1 ∂1 8 d x 1 ∂2 where "i j = (∂i mjk − ik
l
mjl ) ∂m d x k . So, locally "
=
2

4d x 2 ∂1 0 d x 2 ∂2
dx takes values in Tx M Tx M ∗ n , giving a matrix rep-
resenting the limiting parallel transport map of vectors
" #
1d x 1 ∂1 6 d x 1 ∂2 from Tx M around a parallelogram defined by ∂i , ∂ j .
+ dx2 It is easy to extend the exterior product and exterior
4d x 1 ∂1 4 d x 2 ∂1
derivative to vector-valued r -forms, just by applying the
So usual operations to their components. When we do this

1 8 1 6 connection, curvature, and torsion forms we obtain the
ω= dx +
1
dx2 (with values in 2×2 ) famous structural equations of E. Cartan, for example,
4 0 4 2
"(u, v) = dω(u, v) + 12 [ω(u), ω(v)] u, v ∈ ϒ 1 M
F. Curvature of a Connection
EXAMPLE. The previously mentioned Schwarzschild
Intuitively we perceive that the unit sphere in E 3 has curva- metric tensor on M = × E 3 \B with coordinates (t, r,
ture but the (x, y) plane there has not; both are 2-manifolds θ, φ) is of the form
inheriting a metric connection from Euclidean E 3 . A geo-  
meter detects the presence of curvature by taking a tangent − f 2 (r ) 0 0 0
 0 f −2 (r ) 0 
vector around closed curves by parallel transport. If upon  0 
(gi j ) =  
return to the starting point the transported vector is always  0 0 r2 0 
the same as the initial vector, then the connection used for 0 0 0 r sin θ
2 2
the parallel transport is called flat; otherwise it is called
curved. The amount of curvature at different points and in with f a function of r
different directions is measured by the curvature map As before we let indices run 0, 1, 2, 3. Evidently, a basis

R : ϒ01 M × ϒ01 M → L ϒ01 M, T01 M of mutually orthogonal unit 1-form fields, that is, of or-
thonormal fields, is given by
: (u, v) → R(u, v)
(ωi ) = ( f dt, f −1 dr, r dθ, r sin θ dφ)
where R(u, v): ϒ01 M → ϒ01 M : w → ∇u ∇v w − ∇v ∇u
w − ∇[u,v] w. Since L(ϒ01 M, ϒ01 M) ϒ11 M, we can con- Their exterior derivatives satisfy the structural equations
sider R to be a member of L(ϒ02 M, ϒ11 M) ϒ31 M, that
dωi = −ωij ∧ ω j where ωij = ijk ωk
is, a ( 13 )-tensor field. Observe that R(u, v) = −R(v, u).
We interpret R(u, v) at a point x as a map of Tx M to and
itself that is a limiting case of parallel transport around a
curvilinear parallelogram determined by u(x), v(x) ∈ "ij = dωij + ωki ∧ ωkj where "ij = Rl(i
k
j) w ∧ ω
i j

Tx M. In components, Computation of the derivatives yields the following with

R(∂i , ∂ j )w = ∇∂i ∇∂ j w − ∇∂ j ∇∂i w since [∂i , ∂ j ] = 0 f˙ denoting the derivative of f with respect to r :

so dω0 = f˙ω1 ∧ ω0
m dω1 = 0
R(∂i , ∂ j )∂k = ∇∂i mjk ∂m − ∇∂ j ik ∂m
m dω2 = ( f /r )ω1 ∧ ω2
= ∂i jk + ljk ilm ∂m − ∂ j ik m
+ ljk mjl ∂m
m dω3 = ( f /r )ω1 ∧ ω3 + (cot θ)/r ω2 ∧ ω3
= ∂i ik − ∂ j mjk + ljk ilm − ik
l
mjl ∂m
Then we deduce that the only nonzero ωij are
= Rimjk ∂m
ω12 = −ω21 = f /r ω2 so dω12 = f f˙/r ω1 ∧ ω2
So as a ( 13 )-tensor field
ω13 = −ω31 = f /r ω3 so dω13 = f f˙/r ω1 ∧ ω2 + f /r
R = Rimjk ∂m d x i d x j d x k
cot θ ω2 ∧ ω3
Most conveniently we can interpret R as a vector-valued
2-form, the curvature form ω10 = ω01 = f˙ω0 so dω10 = ( f˙2 + f f˙)ω1 ∧ ω0
": ϒ01 M → ϒ11 M ω23 = −ω32 = (cot θ)/r ω3 so dω23 = −1/r 2 ω2 ∧ ω3
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 73

By inspection of the second structural equation we find it to be compact, for instance, a closed and bounded sub-
set of some Euclidean E n . Spheres and the torus are of
"01 = ( f f˙ + f˙2 )ω1 ∧ ω0 this kind. It is not a necessary condition because E n itself
"02 = f f˙/r ω3 ∧ ω0 is complete but not compact. In a connected, complete
Riemannian manifold any two points can be connected by
"31 = f f˙/r ω1 ∧ ω3 + f /r 2 cot θ ω2 ∧ ω3 a geodesic that is minimal in length relative to all joining
"21 = f f˙/r ω1 ∧ ω2 curves between the points. Of course, in any Riemannian
manifold, “small enough” regions are always complete.
"32 = ( f 2 − 1)/r 2 ω2 ∧ ω3
"03 = f f˙/r ω3 ∧ ω0 B. Connection Completeness
Then from the definition of the curvature form we obtain Geodesics do not tell the whole story about complete-
the components Rlik j of the Riemann curvature tensor. For ness in non-Riemannian manifolds. It is possible to have
example, geodesic completeness but some other curves may be in-
2
R332 = R223
3
= ( f 2 /r 2 ) − (1/r 2 ) complete in any reasonable definition. For example, it is
possible to contrive a model space-time that is geodesi-
Einstein’s equation in general relativity can be written cally complete but in which an observer, in a rocket say,
could with finite energy follow a trajectory that cannot be
Rikjk = Ri0j0 + Ri1j1 + Ri2j2 + Ri3j3 = 0
extended beyond a certain finite time. Such an observer
It results in two differential equations for f , reducible to would disappear from that universe. Most realistic cos-
mological models imply space-time geometries that are
f 2 + r (d/dr )( f 2 ) − 1 = 0 of this kind, simply because gravity is attractive.
which admits the solution we encountered before: We are faced with the problem: does a given curve
c: [0, 1) → M admit a continuous extension by one point
f (r ) = (1 − 2m/r )1/2 to the closed interval [0, 1]? If not, then we say that the
curve is inextensible. This in itself need not imply a prob-
lem of incompleteness since, in the Euclidean plane,
VII. SINGULAR GEOMETRY
c: [0, 1) → E 2 : t → (0, (1 − t)−1 − 1)
In this section we shall look, albeit briefly, at some situa- begins at the origin and proceeds to cover the whole posi-
tions where the geometry goes wrong. We have seen that if tive y axis but we cannot extend it to a domain includ-
we start with a nice geometrical space, like Euclidean E 2 , ing t = 1. The additional test that we need to make is
we can introduce singular behavior by removing points. for some kind of finiteness property. In the presence of a
Spaces obtained by such removal operations are mani- Riemannian metric we could use the length of the curve.
festly incomplete. The interesting thing is that some geo- Clearly, though inextensible the above curve is not finite
metrical spaces are not of this type but are incomplete; in in length, so we do not wish to infer from it any intrinsic
other words some singular behavior still persists after all incompleteness of the space E 2 . In the absence of suit-
removed points are replaced. able measurements for length, as in pseudo-Riemannian
manifolds, we need another device to test for finiteness.
The natural one, which arises when we have a connection,
A. Riemannian Completeness
depends on the use of parallel transport along the given
It is an interesting geometrical problem to describe just curve.
what constitutes a singularity and where it is in such
spaces. Geodesic completeness is one obvious criterion
C. b -Incompleteness
and it solves the problem for Riemannian manifolds: there
is a geodesic singularity in a region U ⊆ M if there is a Let c: [0, 1) → M be a curve in manifold M with connec-
geodesic that enters U and does not leave but cannot be tion ∇ and choose a basis (bi ) for Tc(0) M. Then the parallel
extended. It can be shown by means of the Hopf-Rinow transport isomorphism of tangent spaces along the curve’s
theorem that this criterion covers all other curves as well: path, say,
a Riemannian manifold (without boundary) is complete
τt : Tc(0) M → Tc(t) M : u i bi → u i τ (bi ) = u i βi (t)
with respect to all curves if and only if it is complete
with respect to all geodesics. A sufficient condition for a defines a basis (βi (t)) for each Tc(t) M. Now, any n-
Riemannian manifold to be geodesically complete is for dimensional vector space, once we have chosen a basis,
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

74 Manifold Geometry

has a unique isomorphism with n simply by taking com- VIII. TOPOLOGY, GEOMETRY,
ponents with respect to that basis. Next, tmr n has a natural AND PHYSICS
norm
The 20th century enjoyed a wealth of developments from
(x i ) = ((x 1 )2 + (x 2 )2 + · · · + (x n )2 )1/2 the interplay of formal mathematics and natural sciences,
So, given the choice of basis (bi ) = (βi (0)) at c(0) for our and in no other area is this as rich in results as that involv-
tangent spaces, we can use its parallel transported image ing geometry, topology, and theoretical physics. One of the
to define a b length for the tangent vector most remarkable families of recent results has been in the
work of Freedman and Donaldson that led to their award
ċ(t) = ċi (t)βi (t) to c at t of Fields Medals at the International Congress of Mathe-
maticians in 1986. An amazing but easily comprehended
namely
result is that there are copies of 4 that are topologically
ċ(t)b = (ċi (t)) indistinguishable from ordinary 4 but which have differ-
ent manifold structures—namely, there are exotic 4-spaces
This gives us a b length of the curve with respect to basis and this happens in no other dimension.
(bi ) at c(0), defined by In simple terms, we could say that whereas on all n for
t n = 1, 2, 3, 5, 6, . . . , there is only one way to set up cal-
L b (c) = ċ(t)b dt culus, in the case of 4 there are infinitely many different
0
ways to do it. It was already known that there were ex-
In a space-time manifold, a choice of basis for one tan- otic 7-spheres (but none of lower dimension), but the new
gent space is effectively a choice of reference frame of results led to the existence of exotic closed 4-manifolds.
directions and scale of units, that is, an observer. Clearly Many compact (topological) 4-manifolds cannot be given
our b length will vary with the choice of the observer. a differentiable structure and those that do admit differ-
However, what does not vary is whether it is finite or not. entiable structures may allow infinitely many. Interest-
One choice of initial basis (bi ) at c(0) is sufficient to test ingly, this was proved by using the methods of Yang-Mills
for finiteness of b-length with respect to any basis. theory from physics. The obstructions to differentiable
Accordingly, we say that curve c is b incomplete if it is structures arise from the theory of connections associ-
inextensible and has finite b length. This definition only ated with the Yang-Mills equations and instantons. Thus,
needs the presence of a connection, not a metric tensor. Donaldson gave an application of physical gauge field the-
When we do have a Riemannian metric connection then ory to geometrical topology.
the finiteness of b length is equivalent to the finitness of Cosmology has provided rich areas of application for
ordinary length. We say that a manifold with connection the curved pseudo-Riemannian geometry needed for gen-
is b incomplete if it contains a b-incomplete curve. A eral relativity theory and it has generated very precise ex-
Riemannian manifold is actually b incomplete if and only priments to detect the physical consequences. In general
if it is geodesically incomplete, but this is not the case for relativity, spacetime has its curvature controlled by the
pseudo-Riemannian manifolds. matter distribution and the curvature controls how freely
When it was discovered that all realistic models of the gravitating bodies will move.
universe, relativistic or otherwise, are likely to be incom- The usual model for a simple, homogeneous isotropic
plete, it was natural to enquire if this was merely an inade- spacetime is 4 with the Friedmann-Robertson-Walker
quacy of the field theory of gravity. Thus, it might be hoped metric tensor, given by the arclength expression
than an appropriate quantum theory of gravity would be
dr 2
such as to average out any classical singularities. Unfor- ds 2 = c2 dt 2 − a(t)2 + r 2
(sin 2
θ dφ 2
+ dθ 2
) .
tunately, there is no single quantum theory of gravity that 1 − kr 2
is accepted by all. One theory, geometrical quantization, Here, k = 1, 0, −1 depending on whether the universe is
when applied to a massless Klein–Gordon scalar field on closed, flat, or open, respectively. In the case k = 1, space
a curved space-time could not prevent the collapse of the is represented at each instant t by a sphere of radius a(t).
state vector: the incompleteness found in the classical ge- Geodesics represent the trajectories of particles free
ometry could not be quantized away. More generally, it from all influences other than gravitational interactions
was shown that if there is b incompleteness with respect with the matter in the universe. Photons follow null
to one connection then there will be b incompleteness with geodesics, which lie on the boundary of local null cones
respect to any nearby connection. So the singularity is sta- (ds 2 = 0) determined by the constancy of the speed of
ble under perturbations of the connection and unlikely to light. Free material particles follow timelike geodesics
be removable by any quantum theory of gravity. but even those subject to acceleration are nevertheless
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

Manifold Geometry 75

constrained to lie locally in the interior (ds 2 > 0) of null Rather surprisingly, general relativistic cosmology cou-
cones of directions. pled with deep space observations lead cosmologists to
In the Friedmann-Robertson-Walker cosmological conclude that most of the matter in galaxies and clusters
model, the wavelength of electromagnetic radiation emit- is dark matter and so is invisible as far as electromagnetic
ted from a distant source at time t and observed at time t0 emission or absorption is concerned. The matter is known
has redshift relative to the local wavelength given by: to be there through gravitational effects and it is dominant
at all scales of structure larger than galactic cores, but it
λsource a(t0 ) is unclear what form it takes. Inference from interpreting
=1+z = .
λlocal a(t) gravitational effects on infrared spectral redshifts indicates
Free, initially thermal radiation energy remains thermal the presence of supergalactic sheetlike clusters containing
during the expansion of the universe, with temperature about 60% of all galaxies. The remaining galaxies seem to
proportional to a(t)−1 and the relative expansion rate of be about equally shared between dense filamentary con-
the spacelike surfaces is given by glomerations (“walls”) and sparse filaments. This disposi-
tion of matter leaves large voids in the observable univese

1 da 2 8 k2 1 and the distribution of the sizes of these voids is also an
= H 2 = π Gρ + 2 + active research area.
a dt 3 a 3
Quasars and active galactic nuclei need some source of
Here, G is the gravitational constant, H is Hubble’s con- energy and one possibility is for this to consist of mas-
stant at time t, is the cosmological constant and ρ is sive black holes, of size 107 –109 solar masses. The sim-
the mean matter density. The critical mean density that plest geometry for an otherwise empty spacetime con-
corresponds to k = 0 when = 0, is given at the present taining an isolated black hole with mass M is that of the
epoch when H = H0 , by Schwarzschild model. There we have
3H02 dr 2
ρcrit = ds 2 = (1 − 2M/r ) dt 2 −
8π G 1 − 2M/r
A recent trend has been to allow to differ from zero, + r 2 (sin2 θ dφ 2 + dθ 2 ).
since there is evidence that the present mean matter density
The event horizon consists of the surface at r = 2M, from
is less than the critical mean density ρcrit .
which neither photons nor material particles can escape to
A Hot Big Bang some 12 Giga years ago, at t = 0 in
the outside and through which any nearby particles will
the Friedmann-Robertson-Walker model, followed by an
be drawn toward the central singularity at r = 0. In fact,
adiabatic expansion controlled by general relativity, is the
quantum theory may allow a tunneling escape of matter
broad scenario most widely accepted by cosmologists. It
and a consequential reduction of mass as a result of pair
accounts well for the relative abundance of the lightest nu-
production just outside the event horizon, if only one of the
clides and for the observable microwave background radi-
pair is drawn in. Infalling matter swept up by a black hole
ation. However, there are some intriguing difficulties. The
in a galactic core could generate radiation energy through
matter in typical galaxies arose from fluctuations within
its acceleration toward the event horizon.
the first year after the Big Bang, because later the scales
would have been too large for causal effects to occur. This
gives rise to the question of the origin and nature of the SEE ALSO THE FOLLOWING ARTICLES
fluctuations in the hot dense early phase; “cosmic infla-
tion” is a currently favored way to answer this question. ALGEBRA, ABSTRACT • ALGEBRAIC GEOMETRY • COS-
This notion involves an initial period of exponentially ac- MOLOGY • LOOP GROUPS • TOPOLOGY, GENERAL
celerating expansion lasting ∼10−32 sec, caused by a posi-
tive cosmological constant, which physicists associate
with a hypothetical scalar field or “inflaton.” Such an in- BIBLIOGRAPHY
flation period would precede the normal evolution of a(t).
A special case of interest is when k = 0, the mean matter Arnold, V., Atiyah, M., Lax, P., and Mazur, B., eds. (1999). “Mathemat-
density is zero ρ = 0, and the cosmological constant is ics: Frontiers and Perspectives,” Am. Math. Soc., Providence, RI.
positive in Friedmann-Robertson-Walker spacetime. This Beem, J. K., and Ehrlich, P. E. (1981). “Global Lorentzian Geometry,”
corresponds to the de Sitter cosmological model, which Dekker, New York.
Beem, J. K., Ehrlish, P. E., and Easley, K. L. (1996). “Global Lorentzian
happens also to be the limiting scenario for all indefinitely Geometry,” Second ed., Marcel Dekker, New York.
expanding models with > 0. In de Sitter √ spacetime, the Berger, M. (1999). “Riemannian Geometry During the Second Half of
cosmic inflation corresponds to a(t) ∝ e /3t . the Twentieth Century,” Am. Math. Soc., Providence, RI.
P1: GPA Final Pages
Encyclopedia of Physical Science and Technology EN009B-400 July 6, 2001 20:51

76 Manifold Geometry

Bott, R., and Tu, L. W. (1982). “Differential Forms in Algebraic Topol- Donaldson, S. K. (1987). The geometry of 4-manifolds, Proc. Int.
ogy,” Springer-Verlag, New York. Congress of Mathematicians, Berkeley, 1986, pp. 43–54, Am. Math.
Choquet-Bruhat, Y., DeWitt-Morette, C., and Dillard-Bleick, M. Soc., Providence, RI.
(1982). “Analysis, Manifolds and Physics,” 2nd ed., North-Holland, Gompf, R. E., and Stipsicz, A. L. (1999). “4-Manifolds and Kirby Cal-
Amsterdam. culus,” Am. Math. Soc., Providence, RI.
Cordero, L. A., Dodson, C. T. J., and de Leon, M. (1989). “Differential Gray, A. (1998). “Modern Differential Geometry of Curves and Sur-
Geometry of the Frame Bundles,” D. Reidel, Dordrecht. faces,” Second ed., CRC Press, Boca Raton, FL.
Dekel, A., and Ostriker, J. P., eds. (1999). “Formation of Structure in the Peebles, P. J. E. (1993). “Principles of Physical Cosmology,” Princeton
Universe,” Cambridge Univ. Press, Cambridge. Univ. Press, Princeton.
Dodson, C. T. J. (1988). “Categories, Bundles and Spacetime Topology,” Sternberg, S. (1983). “Lectures on Differential Geometry,” 2nd ed.,
2nd ed., Kluwer, Dordrecht. Chelsea, New York.
Dodson, C. T. J., and Parker, P. (1997). “A Users’ Guide to Algebraic Thurston, W. P. (1997). “Three-Dimensional Geometry and Topology,”
Topology,” Kluwer, Dordrecht. Princeton Univ. Press, Princeton.
Dodson, C. T. J., and Poston, T. (1991) “Tensor Geometry,” Graduate Willmore, T. J. (1982). “Total Curvature in Riemannian Geometry,” Ellis
Texts in Mathematics 130, Second ed., Springer-Verlag, New York. Horwood, Chichester.
P1: GNHFinal Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

Mathematical Logic
Yiannis N. Moschovakis
University of California, Los Angeles,
and University of Athens

I. Propositional Logic, PL
II. First-Order Logic, FOL
III. Gödel’s Incompleteness Theorem
IV. Computability
V. Recursion and Programming
VI. Alternative Logics
VII. Set Theory

GLOSSARY Propositional connectives The linguistic constructs

“and,” “not,” “or,” and “implies.”
Church-Turing thesis Claim that every computable Quantifiers The linguistic constructs “there exists” and
function can be computed by a Turing machine. “for all.”
Computability theory Study of computable functions on Turing machine Mathematical model of computing de-
the natural numbers. vice with unbounded memory.
Continuum hypothesis Conjecture that there are only Unsolvable problem A problem whose solution requires
two sizes of infinite sets of real numbers. a non-existent algorithm.
Database Finite, typically relational structure.
First-order logic Mathematical model of the part of lan-
guage built up from the propositional connectives and
the quantifiers. NARROWLY CONSTRUED, mathematical logic is the
Incompleteness phenomenon Gödel’s discovery, that study of definition and inference in mathematical mod-
sufficiently strong axiomatic theories cannot decide all els of fragments of language, especially the first-order
propositions which they can express. logic fragment. Logic has made critical contributions to
Model theory Study of formal definability in first-order the foundations of science, especially through the work
structures. of Kurt Gödel, and it also has numerous applications. For
Paradox Counterintuitive truth. set theory and theoretical computer science, these appli-
Peano arithmetic Axiomatic theory of natural numbers. cations are so important, that parts of these fields are nor-
Proof theory Study of inference in formal systems inde- mally included in the modern, broad conception of the
pendently of their interpretation. discipline.

197
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

198 Mathematical Logic

I. PROPOSITIONAL LOGIC, PL TABLE I Truth Value Semantics

A B ¬A (A & B) (A ∨ B) (A → B)
Each logic L has a syntax which delineates the gram-
matically correct linguistic expressions of L, a semantics 1 1 0 1 1 1
which assigns meaning to the correct expressions, and a 1 0 0 0 1 0
structured system of proofs which specifies the rules by 0 1 1 0 1 1
which some L-expressions can be inferred from others. 0 0 1 0 0 1
There are other words to describe these things: formal
language is sometimes used to describe a plain syntax,
formal system often identifies a syntax together with an B. Propositional Semantics
inference system (but without an interpretation), and ab-
stract logic has been used to refer to a syntax together with If B stands for some true proposition, then ¬B is false,
an interpretation, leaving inference aside. It is, however, independently of the “meaning” or internal structure of
a fundamental feature of logic that it draws clean dis- B. This is an instance of a general Compositionality Prin-
tinctions and studies the connections among these three ciple for PL: The truth value of a formula depends only
aspects of language. We explain them first in the simplest on the truth values of its immediate parts. The semantics
example of the “logic of propositions,” which is part of of PL comprise the rules for computing truth values, and
many important logics. they can be summarized in Table I, where 1 stands for
“truth” and 0 for “falsity.” By the first line of this table,
for example, if A and B are both true, then ¬A is false
A. Propositional Syntax while (A & B), (A ∨ B), and (A → B) are all true. Notice
The symbols of PL are the connectives that if A is false, then (A → B) is reckoned to be true no
matter what the truth value of B, so that “if the moon is
¬ (not) & (and) ∨ (or) → (implies, if-then) made of cheese, then 1 + 1 = 5” is true (on the plausible
the two parentheses ‘(’, ‘)’, and an infinite list of (formal) assumption that the moon is not made of cheese). This
propositional variables P0 , P1 , P2 , . . . which intuitively material implication assumed by Propositional Logic has
stand for declarative propositions, things like “John loves been attacked as counterintuitive, but it agrees with math-
Mary” or “3 is a prime number.” It has only one cate- ematical practice and it is the only useful interpretation
gory of grammatically correct expressions, the formulas, of implication which accords with the Compositionality
which are strings (finite sequences) of symbols defined Principle.
inductively by the following conditions: Using these rules, we can construct for each formula
A a truth table which tabulates its truth value under all
1. Each Pi is a formula. assignments of truth values to the variables. For exam-
2. If A and B are formulas, then so are the expressions ple, the truth table for (Q → P) consists of the first three
columns of Table II while the first two and the last column
¬A (A & B) (A ∨ B) (A → B) give the truth table for (P → (Q → P)).
If n variables occur in a formula A, then the truth table
For example, if P and Q are propositional variables, then
for A has 2n rows and determines an n-ary bit function v A ,
(P → Q) and (P ∨ ¬P) are formulas, which we read as
with arguments and values in the two-element set {1, 0}.
“if P then Q” and “either P or not P.”
By the Definitional Completeness Theorem, every n-ary
The inductive definition gives a precise specification of
bit function is v A for some A, so that the formulas of PL
exactly which strings of symbols are formulas, and also
provide definitions (or “symbolic representations”) for all
insures that each formula is either prime, i.e., just a vari-
bit functions.
able Pi , or it can be constructed in exactly one way from its
simpler immediate parts, by one of the connectives. This
makes it possible to prove properties of formulas and to
TABLE II Truth Table
define operations on them by structural induction on their
definition. P Q (Q → P) (P → (Q →P))
More propositional connectives can be introduced as 1 1 1 1
“abbreviations” of formula combinations, e.g., 1 0 1 1
A ↔ B ≡ ((A → B) & (B → A)) 0 0 1 1
0 1 0 1
A ∨ B ∨ C ≡ (A ∨ (B ∨ C)).
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

Mathematical Logic 199

A formula A is a semantic consequence of a set of for- other connectives. One can now use natural manipulations
mulas T (or T -valid) if every assignment to the variables of formulas to construct circuits which compute a given
which satisfies (makes true) all the formulas in T also bit function with minimum size or time complexity, or
satisfies A. We write to establish optimality results for the computation of bit
functions by appealing to the formula representations of
T |= A ⇔ A is T -valid,
the circuits which realize them. For example, using dis-
and |= A, in the important special case when T is empty, junctive normal forms, one sees immediately that (if we
in which case A is called a tautology. A formula A is do not care about cost), every n-ary bit function can be
satisfiable if some assignment satisfies it, i.e., if ¬A is not computed by an unbounded fan-in circuit in no more than
a tautology. Let 3 time units. There is, in general, a substantial trade-off
between the size and time complexity of the circuits which
A ∼ B ⇔ {A} |= B and {B} |= A
compute a given bit function.
⇔ |= A ↔ B,
and call A and B equivalent if A ∼ B. Equivalent formulas D. The Satisfiability Problem
define the same bit function, and they can be substituted The assertion that “C(A) and C(B) never give the
for each other without changing truth values. Clearly same output on the same inputs” means precisely that
(A → B) ∼ (¬A ∨ B), “(A ↔ ¬B) is a tautology,” so that to detect that A and
B do not have this safety property we need to determine
so that the implication connective is superfluous. In fact, whether the formula ¬(A ↔ ¬B) is satisfiable.
every formula is equivalent to one in disjunctive nor- Because of such natural formulations of “error detec-
mal form, i.e., a disjunction A1 ∨ · · · ∨ Ak where each tion” for circuits relative to given specifications, it is
Ai is a conjunction of variables or negations of variables very important to find efficient algorithms for determin-
(literals). ing whether a given formula is satisfiable. The problem
is of non-deterministically polynomial time complexity
C. Applications to Circuits (NP), because it can be resolved by guessing (“non-
deterministically”) some assignment and then verifying
Each formula A with n variables can be realized by a that it satisfies A in a number of steps which is bounded
switching circuit C(A) with n inputs and one output, so that by a polynomial in the length of A; and it is NP-complete,
C(Pi ) consists of just one input-output edge, C(A & B) is i.e., every NP-problem can be “reduced” to it by a polyno-
constructed by joining C(A) and C(B) with an and-gate, mial reduction. This is a basic result of S. Cook, who in-
etc. Figure 1 exhibits the circuit for ((P1 & P2 ) → P3 ) using troduced the complexity class NP, showed that it contains
the equivalent formula without implications, so that only a large number of important problems, and asked if it co-
¬-, &-, and ∨-gates are required. These are restricted cir- incides with the (seemingly) smaller class P of “feasible,”
cuits, of fan-in (maximum number of edges into a node) 2 deterministically polynomial time problems. The question
and fan-out 1, but the Definitional Completeness Theorem whether P = NP is the fundamental open problem of com-
implies that every n-ary bit function can be computed by plexity theory; it amounts simply to the question whether
some formula circuit C(A). the satisfiability problem can be solved by a deterministic,
There are basically two useful measures of circuit polynomial algorithm.
complexity, and both of them are faithfully mirrored in
formulas. The number of gates of C(A) is exactly the E. Propositional Inference
number of connectives in A and measures size complex-
ity (construction cost), while the depth of C(A), which A proof of a formula A from a set of hypotheses T is any
measures the time complexity of computation, is exactly finite sequence
the rank of A, defined inductively so that rk(Pi ) = 1, A0 , A1 , . . . , An−1 , A
rk(A & B) = max(rk(A), rk(B)) + 1 and similarly for the
which ends with A, and such that each Ai is either in T ,
or a PL-axiom, or follows from previously listed formulas
by a rule of inference. To make this notion precise we need
to specify a set of PL-axioms and rules of inference; and
for these to be useful, it should be that they are few and
easy to understand, and that the formulas provable from
FIGURE 1 The circuit for (¬((P1 & P2 ) ∨ P3 ). T are exactly the T -tautologies.
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

200 Mathematical Logic

We need just one, binary inference rule: irrelevant as long as this fact obtains; and then the Com-
pleteness Theorem implies that two formulas A and B
A (A → B)
(Modus Ponens) define the same n-ary operation on all Boolean algebras
B exactly when A ∼ B, i.e., when A and B define the same
This is sound, i.e., {A, (A → B)} |= B, so that if A and bit function.
(A → B) are both T -tautologies, then so is B. Boolean algebras have many important applications in
An axiom is any instance of the following axiom mathematics (to measure theory, among other things), and
schemes, where A, B, and C are arbitrary formulas and we they are the subject of the classical Stone Representation
have omitted several parentheses which pedantry would Theorem which identifies them all (up to isomorphism)
require: with sub-algebras of powerset algebras. In logic they are
mostly used through the “nonstandard” Boolean seman-
(1) A → (B → A) tics of this subsection, which extend to richer logics and
(2) (A → B) → ((A → (B → C)) → (A → C)) provide a powerful tool for independence (unprovability)
(3) A → (B → (A & B)) results.
(4) (A & B) → A (4 ) (A & B) → B
(5) A → (A ∨ B) (5 ) B → (A ∨ B)
(6) (A → C) → ((B → C) → ((A ∨ B) → C)) II. FIRST-ORDER LOGIC, FOL
(7) (A → B) → ((A → ¬B) → ¬A)
Consider the claim:
(8) ¬¬A → A
These are all tautologies, and so every formula provable If everybody has a mother, and every mother loves her children,
then everybody is loved by somebody.
from T is T -valid. We write
T A ⇔ there is a proof of A from T, It is certainly true, it has the “linguistic form” of many
similar (more substantial) claims in mathematics, and it
and it is not hard now to establish the basic appears to be true by virtue of its form and not because of
Soundness and Completeness Theorem for PL. For all any special properties of the words “mother,” “love,” etc.
sets T and any A, First-Order Logic makes it possible to express complex
T |= A ⇔ T A. assertions of this type and to show that they are true by
logic alone. The symbolic expression of this one will be
[(∀x)(∃y)M(x, y) & (∀x)(∀y)[M(x, y) → L(y, x)]]
F. Boolean Algebras
→ (∀x)(∃y)L(y, x),
A Boolean algebra is a set B with at least two, distinct
elements 0 and 1, a unary complementation operation , give-or-take a few parentheses and brackets which will be
and binary infimum ∩ and supremum ∪ operations such required to make the syntax completely precise.
that certain properties hold. The standard example is the
set P(M) of all subsets of some nonempty set M, with
0 = ∅, 1 = M and the usual complementation, intersection A. First-Order Syntax
and union operations, which for a singleton M gives the The symbols of FOL are the propositional connectives,
two-element set {1, 0} of truth values; but there are others, the parentheses, the quantifiers
e.g., the set of all finite and cofinite subsets of some infinite
set, the set of all “closed and open” subsets of a topological ∀ (for all) ∃ (there exists)
space, etc. the comma ‘,’, the identity symbol ‘=’, an infinite list
Each formula A with n variables defines an n-ary func- v0 , v1 , . . . of individual variables which will denote arbi-
tion on every Boolean algebra B, simply by letting the trary objects in some domain, and for each n = 0, 1, . . . ,
propositional variables range over B and replacing ¬, &, two infinite lists of function and relational symbols
and ∨ and → by , ∩, ∪, and ⇒ respectively, where
fn0 , fn1 , . . . , Pn0 , Pn1 , . . . ,

x⇒y=x ∪y
which will stand for n-ary functions and relations on the
on B. Now the axioms for a Boolean algebra insure that objects.
every propositional axiom defines a function with con- There are two categories of grammatically correct ex-
stant value 1—in fact the particular choice of axiomati- pressions in FOL, terms and formulas, defined recursively
zation for Boolean algebras (and there are many) is quite by the following conditions.
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

Mathematical Logic 201

T1. Each variable vi is a term. of substituting the term t for the free occurrences of the
T2. If t1 , . . . , tn are terms, then (the string) fin (t1 , . . . , tn ) variable x in some formula A by
is also a term. When n = 0, we write simply fi0 .
A{x :≡ t}
F1. If t1 , . . . , tn are terms, then the expressions
and we will tacitly assume that all substitutions are free.
t1 = t2 Pin (t1 , . . . , tn ) Formulas of FOL are too messy to write down, and so
are formulas, the latter written simply Pi when we often resort to “informal descriptions” of them like
n = 0. the example about mothers loving their children above,
F2. If A and B are formulas, then so are the expressions recipes, really, from which the full, grammatically correct
formula could (in principle) be constructed.
¬A (A & B) (A ∨ B) (A → B)
F3. If A is a formula, then so are the expressions B. First-Order Semantics

(∀vi )A (∃vi )A Whether Eq. (1) is true or false depends on the object v1 , on
the function f10 , on the property P11 , and (most significantly)
Notice that by the notational convention in F1, all PL- on the range of objects over which we interpret the exis-
formulas are also FOL-formulas. tential quantifier—where do we search for things which
This logic is called first order because quantification is may or may not satisfy P11 ?
only allowed over individuals; if we add formula forma- To interpret the formulas of FOL we must be given a
tion rules domain D and an interpretation ı, a function which assigns
n an object ı(vi ) in D to each individual variable, an n-ary
∀Pi A ∃Pin A
function ı(fin ) on D to each n-ary function symbol fin , and
we obtain the formulas of second-order logic, SOL. an n-ary relation ı(Pin ) on D to each Pin . Using these, first
Consider the simple formula we extend inductively ı to all terms by

(∃v2 ) ¬v2 = v1 & P11 (v2 ) . (1) ı fin (t1 , . . . , tn ) = ı fin (ı(t1 ), . . . , ı(tn )),
Its “translation” into English by the reading of the symbols so that ı(t) is some object in D. To assign truth values to
we have introduced is formulas, define first, for each variable x and d in D, the
update
some object other than v1 has the property P11
 = ı{x := d},
which is exactly how we would translate the result of sub- which agrees with ı on all function and relation symbols,
stituting v3 for v2 in it, and also on all individual variables, except that (x) = d.
With the help of this basic operation, we can state in
(∃v3 ) ¬v3 = v1 & P11 (v3 ) . Table III the classical Tarski truth conditions which de-
This is because both occurrences of v2 in Eq. (1) are bound termine the truth of formulas relative to a fixed domain
by the quantifier ∃v2 , just as the occurrences of x are bound D and an interpretation ı. The truth value of a formula A
1
by the d x in 0 x 2 d x and can be replaced by y without relative to an interpretation ı is 1 if ı |= A and 0 otherwise,
changing the meaning of the definite integral. On the other and the Compositionality Principle extends to FOL in a
hand, the occurrence of v1 in Eq. (1) is free, because it is not straightforward manner and implies the following basic
within the scope of any quantifier, and so the interpretation fact: the truth value of A relative to ı depends only on the
of v1 clearly affects the meaning of Eq. (1).
Using the same simple example, consider the results of TABLE III The Tarski Truth Conditions
substituting f10 (v3 ) and f10 (v2 ) for v1 in Eq. (1), ı |= t1 = t2 ⇔ ı(t1 ) = ı(t2 )
ı |= Pin (t1 , . . . , tn ) ⇔ (ı(Pin ))(ı(t1 ), . . . , ı(tn ))
(∃v2 ) ¬v2 = f10 (v3 ) & P11 (v2 ) ,
ı |= ¬A ⇔ ı |= A

(∃v2 ) ¬v2 = f10 (v2 ) & P11 (v2 ) . ı |= (A & B) ⇔ ı |= A and ı |= B
ı |= (A ∨ B) ⇔ ı |= A or ı |= B
The first of these says of f10 (v3 ) what Eq. (1) says of v2 , but ı |= (A → B) ⇔ ı |= A or ı |= B
the second says that “something is not a fixed point of f10 ı |= (∀ vi )A ⇔ for all d in D,
and has property P11 ,” which is quite different—evidently ı{vi := d} |= A
because the variable v2 in f10 (v2 ) is “caught” by the quan- ı |= (∃vi )A ⇔ for some d in D,
tifier ∃v2 . The first is a free substitution (causing no con- ı{vi := d} |= A
fusion) while the second is not. We will denote the result
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

202 Mathematical Logic

values of ı on the function and relation symbols which oc- A directed graph is a structure G = (D, E), where E
cur in A, and on the values ı(x) for the individual variables is a binary “edge” relation on the set of “nodes” G, and it
which occur free in A. is a graph (undirected) if it satisfies the sentence
The Tarski conditions do nothing more than translate
formulas into English, in effect identifying FOL with a (∀x)(∀y)[E(x, y) → E(y, x)].
precisely formulated, small but very expressive fragment
Complete graphs (cliques) are characterized by the
of natural language.
sentence

(∀x)(∀y)E(x, y),
C. Structures
A vocabulary (or signature) is any finite sequence σ = while “diameter ≤2” is defined by
{f1 , . . . , fk , P1 , . . . , Pl } of function and relation symbols,
(∀x)(∀y)[x = y ∨ E(x, y) ∨ (∃z)[E(x, z) & E(z, y)]].
and FOL(σ ) is the part of FOL whose formulas involve
only the function and relation symbols of σ . The idea is Finite directed and undirected graphs are used to model
to think of f1 , . . . , fk and P1 , . . . , Pl as constants, denoting many notions in computer science, e.g., circuits.
fixed functions and relations on some set D, and to use the A semigroup (monoid) with identity is a structure
formulas of FOL(σ ) to study definability in structures (S, e, ·) where the identity e is some specified member
of S, · is a binary “multiplication” on S, and the following
M = (D M , f 1 , . . . , f k , P1 , . . . , Pl )
sentences are true:
of vocabulary σ , where the universe D M of M is any
nonempty set, and f 1 , . . . , f k , P1 , . . . , Pl are functions (∀x)(∀y)[x · (y · z) = (x · y) · z],
and relations which can be assigned to the vocabulary (∀x)(x · e = x & e · x = x).
symbols, e.g., such that f i is n-ary if fi is n-ary.
An M-assignment is any function α from the variables Here and in the sequel we write t1 · t2 rather than the
to D M , and it extends naturally to an interpretation α M by pedantically correct ·(t1 , t2 ).
the association of f i with fi and Pi with Pi ; the standard In addition to semigroups, there are groups, rings, fields,
notation for structure satisfaction is and ordered fields, vector spaces, and any number of other
structures which are the stuff of “abstract” algebra. These
M, α |= A ⇔ α M |= A. classes of structures are all characterized by first-order
Formulas of FOL(σ ) with no free variables are called axioms, and the use of methods from logic is becoming
sentences and (by the Compositionality Principle) they are increasingly important in their study.
simply true or false in every σ -structure, without reference Two structures M1 and M2 are isomorphic if some one-
to any assignment. They define properties of structures. to-one correspondence between their universes carries the
We write functions and relations of M1 to those of M2 . Isomorphic
structures satisfy the same first-order sentences, but the
M |= A ⇔ for any (and hence all) α, converse is not true, as we will see in Section II.F.
M, α |= A (A a sentence),
and if M |= A, we say that M satisfies A or is a model D. Databases
of A. In the most general terms, a database is just a finite
While sentences define properties of structures, formu- structure, typically relational, i.e., without functions, only
las with free variables can be used to define relations on relations. “Finite” does not mean “small” or “simple,”
structures. If, for example, A has at most one free vari- and in the interesting applications databases are huge
able x, we set structures of large and complex vocabularies, with basic
R A (d) ⇔ M, α{x := d} |= A, relations such as “x is an employee born in year n,” “y is
the supervisor of x,” etc. Properties of structures are usu-
where α is any assignment, since its only relevant value is ally called queries in database theory, and one of the main
updated in this definition. In the same way, formulas with n tasks in the field is to develop representations for databases
free variables define n-ary relations on σ -structures, the which support fast algorithms for updating, entering new
first-order definable relations of M. A function f : D nM → information in the base, and data testing, determining the
D M is first-order definable if its graph truth or falsity of queries. As it happens, updating and data
testing for first-order queries can be done very efficiently,
G f (x1 , . . . , xn , w) ⇔ w = f (x1 , . . . , xn )
and so database systems, including the industry standard
is first-order definable. Some examples: SQL make heavy use of methods from first-order logic.
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

Mathematical Logic 203

Motivated by database theory, a good deal of research (Wiles’) Theorem, and the (still open) question whether
has been done since the 1970s in Finite Model Theory, the there exist infinitely many twin pairs of prime numbers.
mathematical and logical study of finite structures. For a
rather surprising, basic result, let F. Model Theory
Probσ [M |= A : |D M | = n] The mathematical theory of structures starts with the fol-
= the proportion of σ -structures lowing basic result:
of size n which satisfy A, Compactness and Skolem-Löwenheim Theorem. If
where structures are counted “up to isomorphism.” every finite subset of a set of sentences T has a model,
then T has a countable model.
The FOL 0-1 Law. For each sentence A of FOL(σ ) in For an impressive application, let (in the vocabulary of
a relational vocabulary, either arithmetic)
lim Probσ [M |= A : |D M | = n] = 1, 0 ≡ 0, m+1 ≡ (m + 1),
n→∞
or so that the numeral m is about the simplest term which
lim Probσ [M |= A : |D M | = n] = 0, denotes the number m, add a constant c to the language,
n→∞
and let
i.e., either A or ¬A is asymptotically true.
More advanced work in this area is concerned primarily T = {A : N |= A}
with the algorithmic analysis of queries on finite struc- ∪ {0 ≤ c, 1 ≤ c, 2 ≤ c, . . .}.
tures, especially in logics richer than FOL. Every finite subset S of T has a model, namely
N S = (N, 0, 1, +, ·, m),
E. Arithmetic
where the object m which interprets c is some number
Most basic is the structure of arithmetic bigger than all the numerals which occur in formulas of
N = (N, 0, 1, +, ·), S. So T has a countable model

where N = {0, 1, . . .} is the set of (non-negative) natu- N T = (N̄, 0̄, 1̄, +̄, ¯·, c),
ral numbers and + and · are the operations of addition and then N̄ = (N̄, 0̄, 1̄, +̄, ¯· ) is a structure for the vocab-
and multiplication. The first-order definable relations and ulary of arithmetic which satisfies all the first-order sen-
functions on N are called arithmetical, and they obviously tences true in the “standard” structure N but is not iso-
include addition, multiplication, and the ordering on N, morphic with N —because it has in it some object c which
which is defined by the formula is “larger” than all the interpretations of the numerals 0 .
x ≤ y ≡ (∃z)[x + z = y]. 1 , . . . . It follows that, with all its expressiveness, First-
Order Logic does not capture the isomorphism type of
By a basic lemma of Gödel, if a function f is determined complex structures such as N .
from arithmetical functions g and h by the equations These nonstandard models of arithmetic were con-
structed by Skolem in the 1930s. Later, in the 1950s,

f (0, x ) = g(x ) Abraham, Robinson constructed by the same methods
(2)
f (y + 1, x ) = h( f (y, x ), y, x ), nonstandard models of analysis, and provided firm foun-
dations for the classical Calculus of Leibnitz with its in-
then f is also arithmetical. Thus exponentiation x y is arith- finitesimals and “infinitely large” real numbers.
metical, with g(x) = 1, h(w, y, x) = w·x, and, with some Model Theory has advanced immensely since the early
work, so is the function p(x) which enumerates the prime work of Tarski, Abraham Robinson and Malcev. Espe-
numbers, cially with the contributions of Shelah in the 1970s and,
p(0) = 2, p(1) = 3, p(2) = 5, . . . . more recently, Hrushovsky, it has become one of the most
mathematically sophisticated branches of logic, with sub-
In fact, the scheme of Primitive Recursion (2) is the ba- stantial applications to algebra and number theory.
sic method by which functions are introduced in number
theory, so that, with some work, all fundamental num-
G. First-Order Inference
ber theoretic relations and functions are arithmetical, and
all celebrated theorems and open problems of the theory The proof system of First-Order Logic is an extension of
of numbers are expressed by first-order sentences of N . that for Propositional Logic, first by identity axioms which
These include the Prime Number Theorem, Fermat’s Last insure that = is an equivalence relation and a congruence
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

204 Mathematical Logic

for all function and relation symbols, e.g., for unary func- Completeness Theorem answers definitively (for science)
tion symbols, the ancient question of what follows from what by logic
alone: a proposition A follows from certain assumptions
(∀x)(∀y)[x = y → f (x) = f (y)].
T as a matter of logic (and independently of the facts), if
In addition, there are two axioms for the quantifiers, A and T can all be expressed faithfully as FOL(σ ) asser-
tions about some σ -structure M, and T A. On this view,
A{x :≡ t} → (∃x)A (∀x)A → A{x :≡ t},
it is hard to overemphasize the importance of this result
assuming that the term substitutions are free; and there are for the foundations of mathematics and science.
two new inference rules, Incidentally, there is an obvious extension of the Tarski
C→A A→C conditions to Second-Order Logic, e.g.,

C → (∀x)A (∃x)A → C ı |= ∀Pin A ⇔ for all n-ary P on D,
which can be used only when the variable x is not free in
ı Pin := P |= A.
C. Proofs from a set T of FOL(σ ) sentences are defined
exactly as for PL, and we set again However, there is no useful Completeness Theorem for
T A ⇔ there is a proof of A from T. SOL, as we will see in Section IV.F.

Notice that without the restriction on the quantifier

rules, the sequence I. Proof Theory

P(x) → P(x), P(x) → (∀x)P(x), If Model Theory is the study of semantics independently of
inference, then Proof Theory can be viewed as the mathe-
(∃x)P(x) → (∀x)P(x) matical investigation of formal proofs independently of
would be a proof of (∃x)P(x) → (∀x)P(x), which is, ob- interpretation. This has always been one of the most
viously, not valid. With the restriction, however, for every active research areas of logic, and it has been invigo-
structure M, if every M-assignment satisfies the hypothe- rated in recent years by its substantial applications to
sis of either new rule, then every M-assignment satisfies computer science, including automated deduction, an im-
the conclusion, so that the quantifier inference rules are portant component of artificial intelligence. Key to these
sound. applications—and the basic result of Proof Theory—is
the Extended Normal Form Theorem of Gentzen, whose
somewhat weaker (but simpler) Herbrand version is fairly
H. Gödel’s Completeness Theorem
easy to describe.
A model of a set of sentences T in FOL(σ ) is any structure There are four Herbrand inference rules, and they apply
M which satisfies every A in T , in symbols to n-ary disjunctions
M |= T ⇔ for all A in T, M |= A. A1 ∨ · · · ∨ An .
We also write Two of them are structural, and they clearly preserve
T |= A ⇔ for all M, meaning: you can interchange the order of the disjuncts,
M |= T ⇒ M |= A, or delete one of two occurrences of the same disjunct. The
other two are quantifier rules,
which extends to FOL(σ ) the semantic consequence re-
lation of PL. From the comments above: A1 ∨ · · · ∨ An {x :≡ t} A1 ∨ · · · ∨ An
∗
A1 ∨ · · · ∨ (∃x)An A1 ∨ · · · ∨ (∀x)An
Soundness Theorem for FOL. If T A, then T |= A.
The fundamental fact about First-Order Logic is the where the ∗ indicates that the ∀-rule can only be used if the
converse of this result: variable x is not free in its conclusion. The result applies
only to sentences without identity and in prenex normal
Completeness of FOL. If T |= A, then T A.
form, i.e., looking like
It may be argued that the semantic consequence rela-
tion T |= A captures the intuitive notion A follows from (Q1 x1 ) · · · (Qn )B
the assumptions in T by logic alone, in the sense that
where each Qi is ∀ or ∃ and B is quantifier-free.
it insures that A is true whenever all the hypotheses in
T are true, independently of the meaning of the func- Herbrand’s Theorem. Every provable = −free sen-
tion and relation symbols. Granting that and considering tence A of FOL(σ ) in prenex form can be derived from a
the strong expressibility of First-Order Logic discussed provable quantifier-free disjunction by the four Herbrand
in Section II.C above, we may then argue further that the rules.
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

Mathematical Logic 205

The restriction to prenex sentences is not essential, be- and the best we can do in FOL is to adopt the Axiom
cause every formula can be converted to an equivalent Scheme
prenex one by the application of simple rules which can
be added to the system. (A{y :≡ 0} & (∀x)(A{y :≡ x} → A{y :≡ x + 1}))
The theorem asserts (in part) that every provable sen- → (∀x)A{y :≡ x}. (6)
tence A has a “normal” proof, in which only formulas of
“quantifier rank” no greater than A occur. This is a power- The set PA of (first-order) Peano axioms is obtained by
ful tool for proof-theoretic studies. As for applications, all taking the correctly spelled versions of all the formulas in
automated deduction systems use Herbrand-like inference (3)–(6) and adding enough universal quantifiers in front of
systems (or their Gentzen variants), and the programming them so that they become sentences. This is a very strong
language PROLOG is based entirely on this idea. set of axioms, it can prove all simple properties of numbers
The proof of Herbrand’s Theorem is constructive: an and most of their deep properties too—although proving
algorithm is defined, which computes for each proof a theorem from PA is harder than proving it using, say,
of a prenex sentence A a Herbrand proof , and then methods from analysis, and number theorists distinguish
it is shown by simple, combinatorial arguments that , and value “elementary proofs” in PA.
indeed, proves A. The additional, effective content is sig- Gödel’s First Incompleteness Theorem. There is a
nificant for the foundational applications of the theorem sentence g in FOL (0, 1, + , ·), such that N |= g but
(for example to consistency proofs), and also in the appli- PA g.
cations to automated deduction. One’s first thought is that we can overcome this “in-
It should be emphasized that the simplistic slogans completeness phenomenon” by strengthening PA, perhaps
“Model Theory = no inference” and “Proof Theory = no add Gödel’s own g to it, or use the Second-Order Logic
semantics” are often honored in the breach: like the Com- version of the Induction Axiom along with a suitable ax-
pleteness Theorem, most fundamental results of logic are iomatization of Second-Order Logic. None of this helps:
about connections between truth and proof, and some of Gödel’s fundamental discovery is that first-order truth in
the deepest results in one part of the discipline depend on N (and every other sufficiently rich structure) simply can-
methods and ideas from the other. not be presented usefully as an “axiomatic theory.” We
will make this precise in a more general version of the
Incompleteness Theorem in the next section.

III. GÖDEL’S INCOMPLETENESS

THEOREM
B. Coding (Gödel numbering)
Having established that FOL proves all logical truths, it
The basic ingredients of the proof of the Incompleteness
is natural to ask if it can also prove—from some natural
Theorem are coding and self-reference.
set of axioms—all mathematical truths. This is not pos-
In analytic geometry we “code” (represent) points in the
sible, by Gödel’s fundamental result, whose special case
plane by pairs of real numbers, their coordinates, so we
for arithmetical truths we discuss in this section.
can translate geometrical questions into algebraic prob-
lems and solve them by calculation. Gödel’s basic idea is
A. The Incompleteness of Peano Arithmetic to code the syntactic objects of FOL(0, 1, + , ·)—terms,
formulas, proofs—by natural numbers, so that their prop-
The classical Peano axioms for arithmetic comprise the
erties are translated into properties of numbers, which
properties of the successor
can then be expressed in FOL(0, 1, + , ·) and (perhaps)
x + 1 = 0 x + 1 = y + 1 → x = y, (3) proved in PA.
Since all syntactic objects are strings of symbols, if
the recursive definitions of addition and multiplication,
we view a proof A1 , . . . , An−1 as a sequence of formulas
x +0 = x separated by commas, it is enough to code strings, and
(4)
x + (y + 1) = (x + y) + 1, we can do this in (at least) one simple-minded way: we
enumerate the symbols of the language
x ·0 = 0
(5) ¬ & ∨ → ( ) ∀ ∃ , = 0 1 + · v0 v1 ··
x · (y + 1) = x · y + x,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ··
and the Induction Axiom which cannot be expressed fully
in First-Order Logic. Its Second-Order Logic version is and we set
(∀P)[(P(0) & (∀x)(P(x) → P(x + 1))) → (∀x)P(x)], [a0 a1 a2 · · · am ] = 2n 0 3n 1 5n 2 · · · p(m)n m ,
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

206 Mathematical Logic

where n i is the code of the symbol ai and p(i) is the ith and self-reference have become standard tools of logic
prime number. For example, the (correctly spelled) prime since Gödel’s work, and they have found substantial ap-
formula plications in many areas, including computer science and
set theory.
+(v1 , 0) = v0
has the horrendously large code
IV. COMPUTABILITY
213 35 516 79 1111 136 1710 1915 .
The size of codes is irrelevant: what matters is that It is easy to determine whether an arbitrary equation a0 +
every string of symbols (and hence every term, formula a1 x + · · · + an x n = 0 with integer coefficients a0 , . . . , an
and proof) has a code from which it can be reconstructed, has integer solutions, since every integer root must divide
by the Unique Factorization Theorem for numbers; and a0 , and so all we have to do is to test the finitely many
(more significantly) that PA is powerful enough to ex- divisors of a0 . The problem is not so easy for equations in
press and prove simple properties of formulas and proofs, k unknowns

thus translated into properties of numbers. For example, ar1 ,...,rn x1r1 x2r2 · · · xkrk = 0, (7)
if n is the numeral denoting n, as above, then PA can r1 +···+rk ≤n
prove all true, basic relations among numerals, e.g.,
and it is much more interesting, in fact
m + n = k ⇒ PA m + n = k .
to find an algorithm which determines whether Eq. (7) has a
Less trivially, the basic (coded) proof relation solution

ProofPA (a, p) ⇔ a is the code of some sentence A and p is the

is No. 10 in David Hilbert’s famous 1900 list of 23 open
code of a proof of A from PA
problems in mathematics. Diophantine equations are no-
toriously difficult to solve, and one might suspect that no
is defined by some formula ProofPA with v1 and v2 free,
algorithm would do the job, but how can you prove such
and PA can prove its basic properties, e.g.,
an assertion? Using ideas and techniques from Gödel’s
ProofPA (a, p) work and motivated by questions arising from it, logi-
⇒ PA ProofPA {v1 :≡ a , v2 :≡ p }. cians developed, in the 1930s, a tool for establishing ab-
solute unsolvability results of this kind which led to some
Similarly, the relation spectacular applications, including a rigorous proof of the
unsolvability of Hilbert’s 10th.
D(a, p) ⇔ a is the code of some formula A with only v1 free, The most direct approach was by Turing, who reasoned
and p is the code of a PA-proof of A{v1 :≡ a } that algorithms should be implemented by “mechanical
devices” and introduced “abstract machines” that can per-
is defined by some formula D with just v1 , v2 free. Set form symbolic computations some ten years before digital
A ≡ (∀v2 )¬D computers were invented.

so that only v1 is free in A, and if a is the code of A, set

A. Turing Machines
g ≡ A{v1 :≡ a }.
A Turing machine M is determined by a finite alphabet
Unscrambling the definitions, g asserts that there is no S M = {s0 , . . . , sk }, a finite set Q M = {q0 , . . . , qm } of
PA-proof of A{v1 :≡ a }; but g is A{v1 :≡ a }, so that (internal) states, and a finite table of transitions of the
g claims its own unprovability; and a careful analysis of form
the situation shows that, indeed, g cannot be provable in
q, s → q , s , m
PA, else PA would prove a contradiction. This also shows,
that g is true. where q, q are states, s, s are in S M or the special “blank”
It is not that simple, of course, and much delicate analy- symbol , and the move m is −1, 0, or +1. No two tran-
sis and computation must be done to establish that D(a, p) sitions are activated by the same pair q, s on the left. We
is arithmetical and to derive a formal contradiction from imagine that, at any moment, M is in some internal state
the assumption that g is PA-provable. Key to the proof q and sits in front of an infinite “tape” with symbols in
is the “self-reference” in the definition of D(a, p), which some of its cells. The machine can only “see” the symbol
uses the coding, and the argument depends on the strength s just in front of it, and does nothing (halts) unless one of
(not the weakness) of the axiomatic system PA. Coding its transitions is activated by the pair q, s; in which case
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

Mathematical Logic 207

ing computability. Within mathematics, it is officially a

definition, much like the definitions of arclength or area
in terms of integrals. But mathematical definitions are not
entirely arbitrary: when we “define” the length of the cir-
FIGURE 2 q, a → q , , −1.
cumference of a circle of radius r by an integral which
computes out to 2πr , we fully expect that if we draw such
a circle and measure its circumference, it will turn out to
it switches to state q , it replaces s by s on the tape, and be 2πr , within the margin of error of our measurements.
it moves left (if it can), right or none-at-all, depending on Similarly, when we prove that a certain string function f is
whether m is −1, +1 or 0 (Fig. 2). not Turing computable, we fully expect that nobody will
A machine M starts computing facing the leftmost cell, ever discover an algorithm which computes f , because
with an arbitrary string input u = u 0 · · · u m−1 on the tape, no such algorithm exists. This is the standard method of
application of the thesis.
u0u1 . . . um⫺1 v0v1 ...v
n⫺1 Evidence for the Church-Turing Thesis comes from
Turing’s analysis, from the sixty-odd years of failed at-
q0 qt tempts to contradict it, and from the robustness of the no-
tion of Turing computability. Many classes of functions
and it may diverge (never halt), for example if u = 11 and were defined in the thirties claiming to capture the no-
M has the two transitions tion of “computable” from different perspectives, includ-
ing Church’s λ-definable functions, Post’s canonical sys-
q0 , 1 → q0 , 1, +1 q0 , →
q0 , 1, +1 tems, the general recursive functions of Gödel, Herbrand,
If it halts, then its output on u is the string M[u] = and Kleene, Kleene’s µ-recursive functions and, in the
v0 · · · vn−1 at the left end of the tape, until the first blank forties, Markov’s (formal) algorithms; each of these was
(and it is possible that M[u] is empty.) proved equivalent to Turing computability, and the “simu-
Finally, M computes a string function f : S1∗ → S2∗ if lation techniques” developed for these proofs make it seem
S1 ∪ S2 ⊆ S M and for every string u ∈ S1∗ , M[u] = f (u). very unlikely that some algorithm will ever be discovered
By identifying each natural number n with the string || · · · | which cannot be simulated by a Turing machine.
of n + 1 tallies from the one-member alphabet {|} (unary It should be emphasized, however, that the Church-
notation), the notion covers functions whose arguments Turing Thesis does not provide a rigorous definition for
or values are either strings or numbers. Moreover, if we the notion of algorithm, which remains informal. Com-
code strings by numbers as above, then the transforma- plexity results about algorithms are rigorously grounded
tion u → [u] and its inverse can be computed by a Turing on various so-called computation models which embody
machine, so that a string function is Turing computable diverse features of actual computers. When we simulate
exactly when its “coded version” is computable, and we these models by Turing machines, the time and space
can safely confuse the two notions. complexity of computations increase substantially, and
so we cannot claim that the informal algorithm has been
faithfully modeled. On the other hand, the time complex-
B. The Church-Turing Thesis
ity increase is bound by a polynomial factor for all the
Turing argued persuasively that the symbolic computa- known simulations, so that the class P of polynomial prob-
tions of any “finite mechanical device” with access to lems can be defined in terms of Turing machines without
unbounded memory can be simulated by one of his ma- ambiguity.
chines, and he has been fully justified by the subsequent Turing-computable functions are also called recursive,
developments in computers. Church had already made an because of the basic Gödel-Herbrand-Kleene characteri-
equivalent (though less well justified) claim, and so the zation mentioned above.
new fundamental principle carries both famous names:
C. Unsolvable Problems
The Church-Turing Thesis: A string function
f : S1∗ → S2∗ is computable if and only if it can be A set of strings (or problem) Q ⊆ S ∗ from a finite “al-
computed by a Turing machine M on some alphabet phabet” S is computable (recursive, solvable, decidable)
S M ⊇ S1 ∪ S2 . if some Turing machine M computes its characteristic
function
The Church-Turing Thesis cannot be rigorously proved,
as it identifies the intuitive, informal notion of “com- 1 if u ∈ Q,
c Q (u) =
putability” with the precise, mathematical property of Tur- 0 otherwise,
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

208 Mathematical Logic

otherwise it is unsolvable or undecidable. The definitions The two basic examples are theories of σ -structures
apply to problems about natural numbers, coded in unary;
to problems about FOL-formulas, by identifying (for ex- Th(M) = {A | M |= A},
ample) each variable vi by a similar sequence vv· · · v of and axiomatic theories of the form
i + 1 v’s, so that the syntax of FOL is based on a finite
vocabulary; and to relations (sets of n-tuples) on strings T = Th(T0 ) = {A : T0 A},
or numbers, by thinking of u 1 , . . . , u n as a single string. where T0 is a decidable set of axioms T0 . The terminol-
Each Turing machine can be represented by a string ogy is natural, because we would certainly demand of any
of 0’s and 1’s which codes its alphabet, internal states, “axiomatization” that it can be decided effectively whether
and transitions, and this leads to the first and most basic an arbitrary sentence is an axiom.
unsolvability result, due to Turing: Every decidable theory T is axiomatizable since
The Halting problem: It is undecidable whether an Th(T ) = T when T is a theory, but the converse fails, in
arbitrary Turing machine M halts on an arbitrary binary general, and in particular for T0 = ∅ when the vocabulary
string u. is not trivial:
For the proof, Turing constructed a universal machine Church’s Theorem: If the vocabulary σ includes at
U which can simulate every other, i.e., least one binary function or relation symbol, then it is
undecidable for a sentence A of FOL(σ ) whether A.
U [ M̄, u] = M[u], if M̄ is the code of M.
A FOL(σ )-theory T is consistent if it does not contain
a contradiction A & ¬A, and it is complete if for every
This treatment of programs as data is, of course, routine
sentence A, either A or ¬A is in T . It is easy to verify that
today.
every consistent, axiomatizable, complete theory is decid-
All unsolvability results are (ultimately) established by
able, and we can use this to formulate and prove a very
reducing the Halting Problem to them, i.e., showing that
general version of the Gödel Incompleteness Theorem.
if such-and-such a function were computable, then the
The key tool is the notion of translation.
Halting Problem would be solvable. The proofs are often
Suppose T1 and T2 are theories, perhaps in different
difficult and generally depend on results specific to the
vocabularies σ1 and σ2 —e.g., T1 might by Th(PA), and
field in which the problem arises.
T2 might be some axiomatic set theory. A translation
In mathematics, the problems which have been proved
of T1 into T2 is a computable string function ρ which
unsolvable include:
assigns a sentence ρ(A) of FOL(σ2 ) to every sentence
Hilbert’s 10th: Whether a given Diophantine equa- A of FOL(σ1 ) and preserves propositional logic and T1 -
tion has integer solutions (Matijasevich, following work inference, i.e.,
of Martin Davis, Hilary Putnam, and Julia Robinson).
The Word Problem for Groups: Whether two words T2 ρ(¬A) ↔ ¬ρ(A)
denote the same element in a finitely generated, finitely T2 ρ(A & B) ↔ ρ(A) & ρ(B)
presented group (P. Novikov, W. Boone).
The Homeomorphism Problem for 4-manifolds: T1 A ⇒ T2 ρ(A).
Whether the orientable n-manifolds represented by two Notice that the identity function ρ(A) = A translates every
triangulations are homeomorphic, for n ≥ 4 (A. Markov). theory into itself.
This problem is solvable for 2-manifolds, by their classi-
cal representation as spheres with handles, and it is still The Gödel Incompleteness Theorem (Rosser’s form):
open for 3-manifolds, pending (among other things) the If T is a consistent, axiomatizable theory and Peano arith-
resolution of the Poincaré Conjecture. metic Th(PA) is translatable into T , then T is undecidable
and hence incomplete.
There is also a large number of unsolvable problems in
computer science. In short, every consistent axiomatic system in which a
reasonable amount of mathematics can be developed is
undecidable and incomplete.
To state the strongest corresponding result about theo-
D. Undecidable Theories ries of structures, we need the simple fact that every com-
A theory T in FOL(σ ) is any set of sentences closed under putable set is arithmetical, essentially due to Gödel.
consequence, Tarski’s Theorem: If Th (N ) is translatable into
Th(M), then Th (M) is not arithmetical, a fortiori it is
T A ⇒ A ∈ T. not decidable.
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 August 1, 2001 13:40

Mathematical Logic 209

To apply Tarski’s Theorem, you need (in effect) to tive methods” we would be willing to use in a consistency
give a first-order definition of the natural numbers within proof should be part of the “substantial part of mathemat-
the given structure. One of the first results of this type ics” we want to axiomatize. Beyond its obvious founda-
was the undecidabilty of the theory of rational numbers tional significance, the Second Incompleteness Theorem
Th(Q, 0, 1, +, ·) (Julia Robinson), but there are many has numerous applications, especially in comparing the
others, and there are also many difficult open problems strength of various hypotheses in Axiomatic Set Theory.
in this area.
On the other hand, many interesting theories are decid-
able, including the following: F. Hierarchies
A set Q of strings or numbers is 20 if
r The theory Th(N, 0, 1, +) of arithmetic without
u ∈ Q ⇔ (∃x1 )(∀x2 )R(u, x1 , x2 ),
multiplication (Presburger).
r The theory Th(Q, ≤). This coincides with the theory
where the quantified variables range over natural numbers
of every dense, linear ordering without end points. and the matrix R is computable, and it is 03 if, for all u
r The theory Th(C, 0, 1, +, ·) of the complex number
field, which coincides with the theory of every u ∈ Q ⇔ (∀x1 )(∃x2 )(∀x3 )R(u, x1 , x2 , x3 )
algebraically closed field of characteristic 0 (Tarski, with the same restrictions. The definitions extend naturally
Abraham Robinson). to all k, and we also set
r The theory Th(R, 0, 1, +, ·, ≤) of the ordered field of
real numbers, which coincides with the theory of 0k = k0 ∩ 0k .
every real closed field (Tarski). Kleene, who introduced these classes, showed that

The classical result here is Tarski’s decidability of the 01 = the class of recursive sets,
ordered ﬁeld of real numbers, which (using coordinates)
implies that Euclidean geometry is decidable, in a sense ⌺01 ⌺02
trivializing much of ancient Greek mathematics. It is still
open whether the extended theory Th(R, 0, 1, +, ·, ≤, ↑) ⌬01 ⌬02 ...

(with x ↑ y = x y for x > 0) is decidable, but there has ⌸01 ⌸02

been substantial progress in this problem with Wilkie’s
Theorem, that every set in R which is first-order definable
and that a nonempty set Q is 10 exactly when it is recur-
using exponentials is a finite union of intervals.
sively (or computably) enumerable, i.e., if

E. The Second Incompleteness Theorem Q = { f (0), f (1), . . .}

What sorts of sentences are not provable in sufficiently with some recursive f : N → S ∗ . Moreover, these classes
strong axiomatizable theories? If T = Th(T0 ) is axiomati- increase properly and exhaust the arithmetical sets. A simi-
zable in FOL(σ ), then the (coded) proof relation lar hierarchy
k1 , 1k , 1k
ProofT (a, p) ⇔ a is the code of some sentence A in FOL (σ )
and p is the code of a proof of A from T for the analytical (second-order definable) sets is con-
structed by allowing the quantified variables to range over
is Turing computable, and hence arithmetical. Using this, the unary functions α : N → N and the matrix to be arith-
we can construct a sentence ConcisT in the vocabulary of metical, so that all arithmetical sets are in 11 .
PA which expresses naturally the consistency of T and These hierarchies classify the analytical sets of natu-
establish the following: ral numbers and strings by the logical complexity of their
(simplest) definitions, and they are powerful tools in the
Gödel’s Second Incompleteness Theorem (Rosser’s theory of definability. For example, every axiomatizable
form): If T is consistent, axiomatizable and ρ trans- theory is 10 . This rules out an axiomatization of Second-
lates Th(PA) in T , then T cannot prove the translation Order Logic SOL, whose set of valid sentences (on the
ρ(ConsisT ) of its consistency sentence. empty vocabulary) is not analytical. Somewhat surpris-
The theorem makes it clear that we cannot axiomatize a ingly, it also rules out an axiomatization of the theory
substantial part of mathematics in any way whatsoever so
T f = {A | for all finite (D, E), (D, E) |= A}
that the consistency of the system can be established “con-
structively”: because the (presumably simple) “construc- of finite graphs, which is 01 but not 10 (Trachtenbrot).
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

210 Mathematical Logic

“recursiveness” provides a powerful tool for establishing

properties of computable functions, and it is especially
useful in the theory of programming languages.

A. Recursive Equations
FIGURE 3 q, a, 0 → q , , 1, −1, +1. Not every recursive equation (8) has a solution x, and
some have many, e.g., the trivial x(t) = x(t) which is sat-
isfied by every function. The basic result which guarantees
G. Turing Reducibility
canonical solutions to a large class of recursive equations
Imagine a Turing machine with a second query tape which comes from the theory of partially ordered sets.
it handles exactly like its primary tape, implementing A partially ordered set or poset is a structure (D, ≤ D ),
somewhat more complex transitions of the form where ≤ is a binary relation and for all x, y, z in D,
q, s1 , s2 → q , s1 , s2 , m 1 , m 2 x ≤ D x, [x ≤ D y & y ≤ D z] ⇒ x ≤ D z
It also has a special query state q? , and when it goes into [x ≤ D y & y ≤ D x] ⇒ x = y;
q? , the computation stops and does not resume until some
external agent (the oracle) replaces the contents on the a subset C of D is a chain if every two members of C are
query tape by some string (Fig. 3). ≤ D -comparable, i.e., x ≤ D y or y ≤ D x; and a poset D is
A string function f is computable relative to some given complete if every chain in D has a supremum (least upper
g if it can be computed by such an oracle machine, pro- bound).
vided each time q? is reached, the string u on the query Every complete poset has a least element ⊥ (the supre-
tape is replaced by the value g(u). We let mum of the empty chain), and every set A can be turned
into a flat poset A⊥ by adding a “bottom” below all its
f ≤T g ⇔ f is computable in g, otherwise incomparable elements (Fig. 4). Other, basic
and we extend this notion of Turing reducibility to sets of examples include the set of all subsets of a set A (un-
natural numbers via their characteristic functions. der ⊆) and the set of all (finite and infinite) sequences
It is not hard to show that there exist Turing-incompar- from some set, under “extension.” The Cartesian product
able sets of numbers (Kleene-Post). In fact, there exist of complete posets is complete, and, more importantly, if
Turing-incomparable recursively enumerable sets, but this W is complete, then the it function spaces of all arbitrary,
was quite hard to prove and it was a celebrated open ques- monotone or Scott-continuous mappings π : D → W are
tion for some twelve years, known as Post’s Problem. The also complete, with the pointwise partial ordering
simultaneous, independent discovery in 1956 by Friedberg π ≤ ρ ⇔ for all x, π (x) ≤ ρ(x).
and Muchnik of the priority method which proved it, initi-
ated an intense study of Turing reducibility which is still, Here π : D → W is monotone if
today, one of the most active research areas of logic, the x ≤ D y ⇒ π (x) ≤W π (y),
largest (and technically most sophisticated) part of com-
putability or recursion theory. and it is Scott-continuous if, in addition, for every chain
C in D,
π (supremum(C)) = supremum(π [C]).
V. RECURSION AND PROGRAMMING
The Least-Fixed-Point Theorem. If (D, ≤) is a com-
In its most general form, a recursive definition of a function plete poset and π : D → D is monotone, then the recur-
x is expressed by a recursive (or fixed point) equation sive equation
x(t) = f (t, x), (8) x = π (x) (9)
where the functional f (t, x) provides a method for com- has a least solution.
puting each value x(t), perhaps using (“calling”) other
values of x in the process. It is possible to characterize the
computable functions on the natural numbers using sim-
ple recursive equations of this form, generalizations of the
primitive recursive definition (2) in Section III.E. Though
conceptually less direct than Turing’s approach through
idealized machines, this modeling of computability by FIGURE 4 Flat poset.
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

Mathematical Logic 211

The theorem is proved by setting recursively The use of recursive equations is absolutely essential here,
to interpret the iteration and recursive constructs which are
x 0 = ⊥, x n+1 = π(x n ). (10)
at the heart of programming languages.
In the simplest case, which is sufficient for the applica- The implementation is a function which assigns to each
tions to programming languages, the mapping π is Scott program A a “machine” M A —or, more concretely, code
continuous, and then in the machine language of some processor—which com-
putes the denotation [[A]] of A. In the simplest case, [[A]]
x̄ = supremum{x 0 , x 1 , . . .} might just be a sequence of external acts, like “printing”
is the least fixed point of π . For the full result we need some file or drawing some picture on a monitor; more of-
to extend the iteration (10) into the “transfinite,” using ten [[A]] is a function relating input to output, or a “strat-
recursion on ordinal numbers. egy” in some game, by which the machine responds to a
There is a rich theory of complete posets and various sequence of external stimuli. As with inference systems,
kinds of mappings on them, mostly motivated by the ap- implementations come in a great variety of shapes and
plications to programming, but also by earlier work in forms (compilers and interpreters, to name two), but they
abstract recursion, the generalization of computability to must have the basic soundness property, that M A “com-
abstract structures. putes” [[A]] in a well-understood way which relates the
abstract (mathematical) denotations of programs to the
behavior of machines.
B. Programming Languages Even with this grossly oversimplified description, it
should be clear that the basic methodology of logic—
From the mathematical point of view, a programming lan-
the clean distinction between syntax, semantics, and
guage P is very much like a logic, with a syntax, a seman-
inference—has had an immense influence on the develop-
tics, and an implementation, which plays the role of an
ment of programming languages; and that the fundamen-
inference system.
tal, related notions of symbolic computation and recursion
The syntax is generally much more complex than that
introduced by logicians in the 1930s are essential to the
of logics, with many different categories of grammatically
understanding of programming languages.
correct expressions. There are variables of various kinds,
In the other direction, the study of programming
some of them for functions of specified types; constants
languages—spurred by the need for applications—has in-
which are meant to denote acts of interaction with the en-
troduced a host of interesting problems in logic, chief
vironment (input, output, interrupts); and various ways of
among them the question of logic of programs: What
combining grammatically correct expressions to produce
are the natural formal languages and inference systems
new ones, using programming constructs like composi-
in which the fundamental properties of programs can be
tion, “while loops,” functional abstraction, and recursion.
expressed and rigorously proved? Much work has been
Some closed expressions (with no free variables) corre-
done on this, but it is fair to say that the question is still
sponding to the “sentences” of a logic are singled out,
open, and a formidable challenge to logicians and com-
typically called programs. With all this complexity, the
puter scientists.
“grammar” is still specified by an induction, as it is for
logics, so that it is again possible to prove properties of
correct expressions and to define operations on them by
structural induction.
VI. ALTERNATIVE LOGICS
In the denotational semantics introduced by Dana Scott,
From the many alternative logics which are obtained by
a programming language P is interpreted in a structure
changing the syntax, semantics or inference system of
(D, —) whose universe D is a complete poset, the do-
First-Order Logic, we consider, very briefly, just two.
main. The points of D may include concrete data (words
from some finite alphabet), but also functions of various
sorts and complex mathematical structures which model A. Modal and Temporal Logic
computations, interactions, etc. For each correct expres-
Modal Logic goes back to Aristotle, the traditional founder
sion A and each assignment α to the variables, the deno-
of logic, who took necessity as one of the basic linguistic
tation [[A]](α) is a point in D, determined by a structural
constructs worthy of logical study. The modern syntax is
induction of the following general form: first a (Scott-
obtained by adding to FOL the propositional box operator
continuous) recursive equation (9) is constructed from
, so that with each formula A we have the formula A
α and the denotations of the parts of A, and then we
(with the same free variables), read necessarily A. The
take
possibility operator is defined by the abbreviation ♦ A ≡
[[A]](α) = the least fixed point of [x = π (x)]. ¬¬A.
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

212 Mathematical Logic

Modal formulas are interpreted in Kripke structures property q,” and ♦ p says that “ p will eventually become
and remain true,” both interesting properties of finite state
M = (W, s0 , {Ms | s ∈ W }, R),
machines. This temporal logic is decidable, and so are
of a specified vocabulary σ , where W is some set of pos- various extensions of it, in which essentially all interesting
sible worlds; s0 is a specified “actual world”; each Ms is a liveliness and fairness properties of finite state machines
σ -structure associated with the world s; and R(s, t) is an can be expressed, so that one can mechanically verify the
accessibility relation on the worlds, intuitively standing “correct behavior” of finite state machines. The relevant
for “t is a possible alternative to s.” There are no fixed, algorithms are practical, if not simple, they are used com-
general assumptions about the accessibility relation or the mercially, and they provide a spectacular example of the
interpretations of the given relations on the various worlds; emerging field of applied logic.
it could be, for example, that “Mary is John’s wife” in the
actual world s0 , but in alternative possible worlds John’s
wife might be Ellen, John may not have a wife, or he may B. Intuitionistic Logic
not even exist. Assign associate objects in the possible First-Order Intuitionistic Logic FOL I has the same syntax
worlds to individual variables, and the basic, semantic re- as FOL, and almost the same inference system: we simply
lation Ms , α |= A is defined by the Tarski conditions (for replace the Double Negation Law ¬¬A → A, Eq. (8) in
structures) with the additional clause Section I.E, by the weaker
Ms , α |= A (8) I ¬A → (A → B).
Kripke has established a Completeness Theorem for
⇔ for all t, if R(s, t), then Mt , α |= A.
FOL I using a variation of his semantics for Modal Logic,
For example, if R is transitive, and this is useful for obtaining unprovability results for
FOL I . The language, however, is meant to be understood
R(s, t) & R(t, t ) =⇒ R(s, t ), constructively, and so it is not really possible to explain
then, the formula its semantics fully within classical mathematics. Aside
from philosophical concerns, the real interest of Intuition-
A → A (11) istic Logic comes from the proof theory of FOL I , which,
is satisfied by all assignments, in all possible worlds, while somewhat surprisingly, also has important applications to
it may fail for some A in nontransitive structures. Finally, computer science. Some sample results:

M, α |= A ⇔ Ms0 , α |= A.
(1) For any two sentences A and B,
Different conceptions of “necessity” can be modeled
by placing appropriate restrictions on the accessibility re- I A ∨ B =⇒ I A or I B,
lation, for example that it be transitive, linear, etc., and and hence I p ∨ ¬ p.
there is a question of constructing a suitable inference (2) In Heyting Arithmetic, i.e., the axiom system PA of
system and proving the appropriate Completeness Theo- Section III.A with Intuitionistic Logic, for any
rem in each case. A great deal of interesting work has been sentence (∃x)A,
done in this area, much of it motivated by puzzles in the
philosophy of language. PA I (∃x)A
If we take W = N for the set of possible worlds, with ⇒ for some n, PA I A{x :≡ n }.
s0 = 0 and R(s, t) ⇔ s ≤ t, and if we read A as “from
now on A,” we get one version of Temporal Logic, very (3) If PA I (∀x)(∃y)A and (∀x)(∃y)A is a sentence (no
useful for applications to computing systems. The worlds free variables), then there is a computable function
are interpreted by the states of some finite state machine, f , such that for all n,
the propositional variables stand for properties of states, PA I A{x :≡ n , y :≡ f (n) }.
and the propositional formulas (which suffice) can express
interesting properties of the system, especially if we aug- This last result is obtained with Kleene’s Realizability
ment the language with some additional, natural primitives Theory, and it illustrates the following general principle:
like Next with the truth condition from a constructive proof of (∀x)(∃y)R(x, y), we can ex-
tract an algorithm which computes for each x, some y such
Ms |= Next A ⇔ Ms+1 |= A.
that R(x, y). There are obvious applications of this idea
For example, ( p → Next q) says that “every state which in computer science, and much of the current research in
has property p is followed immediately by one which has intuitionism is motivated by it.
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

Mathematical Logic 213

VII. SET THEORY without counting, because of the obvious one-to-one cor-
respondence between left and right shoes. The princi-
Sets are collections into a whole of definite and separate ple here is that equivalent sets have the same number of
objects of our intuition or thought, according to Georg members,
Cantor, who initiated their mathematical study in the mid |A| = |B| ⇔ A ∼c B, (12)
1870s. Thus the basic relation of the theory is member-
ship (∈), where A ∼c B indicates that some one-to-one correspon-
dence exists between the members of A and the members
x ∈ A ⇔ x is a member of A, of B, and
and a set is completely determined by its members, |X | = the number of objects in the set X.
A = B ⇔ (for all x)[x ∈ A ⇔ x ∈ B]. This is a basic tool in mathematics: we count a set A
Finite sets can be simply enumerated, e.g., A = {0, 5, 7}. by establishing a one-to-one correspondence between its
Infinite sets are usually specified by means of some con- members and the members of some already-counted set
dition P(x) which characterizes their members, and we B. Moreover, if we set
write A &c B ⇔ for some subset C ⊆ B, A ∼c C,
A = {x | P(x)} then, obviously,
to indicate that A “is the set of all objects which satisfy |A| ≤ |B| ⇔ A &c B, (13)
P(x).”
Cantor was led to the study of arbitrary, abstract sets and we can often prove indirectly that there are objects in
in his effort to understand the structure of some specific B which are not in A by showing (using arithmetic) that
sets of real numbers or pointsets, and the theory which he |A| < |B|, so that B ⊆ A is impossible.
created still exhibits today these two related but separate Cantor proposed to associate a cardinal number |X |
concerns. The theory of pointsets or descriptive set theory with every (finite or infinite) set X , so that Eqs. (12) and
is primarily a theory of definability on the real numbers, (13) hold, and then to use similar counting and (infinite)
and it is characterized by its applications to other fields cardinal arithmetic techniques in the study of arbitrary
of mathematics, especially analysis. Abstract set theory sets. One might expect problems, because a finite set can-
is primarily a theory of counting, an extension of com- not be equivalent with one of its proper subsets (by the
binatorics to the transfinite. The best set-theoretic results so-called Pigeonhole Principle), while
are about the interaction between these two poles of the N = {0, 1, 2, . . .} ∼c {0, 2, 4, . . .} (14)
subject.
via the correspondence f (n) = 2n. Cantor showed that,
At about the same time as Cantor’s original contribu-
despite this “paradox,” his cardinal arithmetic is a power-
tions, Gottlob Frege initiated an effort to create a foun-
ful tool with important applications in almost all areas of
dation of mathematics on the basis of set theory. Frege’s
mathematics.
approach was different (he took “function” rather than
Cantor’s first fundamental discovery was that there are
“set” as his primitive notion) and his original program
(at least) two infinite sizes of sets: if
was overly ambitious and failed. He had the right basic
idea, however, that all objects of classical mathematics ℵ0 = |N|, = |R| = |the real numbers|
can be “defined within set theory,” so that their properties
are the cardinal numbers of the two most basic sets in
can be (ultimately) derived from properties of sets. It took
mathematics, then
some time for this to take hold, but it is fair to say that
since the 1930s, set theory has been the official language ℵ0 < . (15)
of mathematics, just as mathematics is the official lan- A set A is countable if |A| ≤ ℵ0 , otherwise it is, like R,
guage of science. This richness of the field makes it fertile uncountable.
ground for logical investigations, and it is not an accident To define the arithmetical operations on (possibly infi-
that logicians have been involved with set theory from the nite) cardinal numbers, choose sets K , L with no members
beginning. in common so that κ = |K |, λ = |L|, and set
κ + λ = |K ∪ L|,
A. Cardinal Arithmetic
κ · λ = K × L,
There are exactly as many left shoes in a (normal) shoe
store as there are right shoes—we can be sure of this κ λ = |(L → K )|.
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

214 Mathematical Logic

Here the union K ∪ L is the set of all objects which be- Cantor’s Paradise. Exponentiation is the source of the
long to either K or L; the Cartesian product K × L is deepest questions about infinite sets, chief among them
the set of all ordered pairs (x, y) with x ∈ K and y ∈ L; Cantor’s Generalized Continuum Hypothesis (GCH), the
and the function space (L → K ) is the set of all func- claim that for all infinite κ,
tions f : L → K . If κ and λ are finite, we get the usual (GCH) 2κ = κ + = the least cardinal number > κ.
sum, product and exponential, noting, in particular, that
there are κ λ functions from a set of size λ to one of size The “ordinary” case (CH) 2ℵ0 = ℵ+ 0 was No. 1 in Hilbert’s
κ. Moreover, all the familiar arithmetical identities hold, list, it dominated set-theoretic research in the 20th century
e.g., addition and multiplication are associative and com- and, in a sense, it is still open today.
mutative, multiplication distributes over addition, κ 0 = 1, In addition to the cardinal numbers, which count the
and members of a set one, two, three, . . . in the finite case,
Cantor also introduced infinite versions of the ordinal
κ λ+µ = κ λ · κ µ , (κ λ )µ = κ λ·µ . numbers first, second, third, . . . which assign position in
As examples of “proofs by counting,” Cantor showed a sequence. These are associated with “transfinite se-
first that quences,” i.e., well-ordered structures (A, ≤), where ≤
ℵ0 + ℵ0 = ℵ0 · ℵ0 = ℵ0 (16) is a linear ordering on A (so that x ≤ y or y ≤ x, for all
x, y in A) and every non-empty subset of A has a least
(basically because of Eq. (14)), and element. Every ordinal number α has a successor α + 1
= 2ℵ0 . which defines “the next position,” and every set of ordi-
Both of these facts are easy, but they support the nal numbers A has a least upper bound sup A. The least
computation infinite ordinal number ω defines the first position with
infinitely
2 = 2ℵ0 · 2ℵ0 = 2ℵ0 +ℵ0 = 2ℵ0 = ,
0, 1, 2, . . . ω, ω + 1, ω + 2, . . . ω · 2, ω · 2 + 1, . . .
which means that there is a one-to-one correspondence be-
many predecessors, and it is a limit ordinal, without an
tween the line and the plane, and, hence, between the line
immediate predecessor.
and real n-space, for every n. This was new, it was surpris-
Ordinal arithmetic has fewer direct applications than the
ing, and it was proved by “plain arithmetic.” Eventually
arithmetic of cardinal numbers, but well-ordered struc-
it motivated the development of dimension theory, whose
tures and ordinal numbers are the fundamental tools in
basic result is that there is no continuous, one-to-one corre-
the study of transfinite iteration, which is rich in appli-
spondence of real n-space with real m-space unless n = m.
cations. In a typical case, a function f : A → B is de-
Moreover, the set of rational numbers is countable, and so
fined by recursion on some well ordered structure (A, ≤),
is the set of algebraic numbers, the solutions of polyno-
and then the crucial properties of f are established by
mial equations
induction along ≤. Moreover, the exact specification of
a0 + a1 x + a2 x 2 + · · · + x n = 0 the relation ≤ is often unimportant: all that matters is that
some relation well orders A, in other words that A be well
with integer coefficients. Thus, since R is uncountable,
orderable.
“by simple counting” there exist transcendental (not alge-
braic) real numbers, a famous result of Liouville’s whose (WOP) Is every set well orderable?
original proof had rested on delicate convergence argu- Specifically, is the set R of real numbers (where many of
ments for infinite series. It was a “killer application” which the applications lie) well orderable? The natural ordering
made set theory instantly known (and somewhat notori- of R won’t do, since (for example) R has no least element,
ous) in the mathematical community. and it is hard to imagine how one could arrange all the real
Cardinal addition and multiplication satisfy the follow- numbers into a transfinite sequence, with each point fol-
ing absorption laws which basically trivializes them in the lowed by its successor and every nonempty subset having
infinite case: a least element. The Continuum Problem (whether CH is
if 0 < κ ≤ λ and λ is infinite, true or not) and this Well-Ordering Problem were the cen-
tral open problems in set theory at the turn of the 20th
then κ + λ = κ · λ = λ.
century.
For exponentiation, however, Cantor extended Eq. (15) to
the general inequality
B. The Paradoxes
κ < 2κ ,
Cantor developed his theory on the basis of the following
which provides infinitely many distinct “orders of infin- General Comprehension Principle which flows naturally
ity,” perhaps what people meant when they referred to from his “definition” of sets quoted in the beginning of
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

Mathematical Logic 215

this section: every definite (unambiguous) property P(x) it is claimed (roughly, and in the later simple version due
of mathematical objects, has an extension, the set A = to Ramsey), that every mathematical object is of a certain
{x | P(x)} which “collects into a whole” all the objects (natural number) type n, and that every set A is of some
which satisfy P(x), so that successor type n +1, such that the members of A are of the
immediately preceding type n. Type theory is awkward to
x ∈ A ⇔ P(x). (17)
apply and it yields only a poor shadow of Cantor’s set the-
But this is not generally true: because if ory, albeit without the paradoxes. It never gained favor as
a true alternative to set theory, although it has been studied
R = {x | is a set and x ∈
/ x},
extensively as a logical system, it has found its own appli-
then, from Eq. (17), cations (especially recently, to programming languages),
and many of its fundamental ideas were eventually incor-
R ∈ R ⇔ R is a set and R ∈
/ R⇔R∈
/ R
porated in the reformulation of set theory which eventually
which is absurd. The argument was discovered in 1902 prevailed.
by Bertrand Russell, and it was not the first contradic- What has prevailed is Axiomatic Set Theory, first pro-
tion in set theory. However, earlier “paradoxes” (some posed in 1904 by Zermelo as a pragmatic way to avoid the
of them known to Cantor) were technical, not unlike the paradoxes by rebuilding Cantor’s set theory on the basis of
paradoxes with infinitesimals which had been common- a few set-theoretic principles which are basic, simple, and
place in calculus some years earlier, and it was thought well understood by their uses in classical mathematics.
that they would go away in a careful development of the Formalists can accept it, as nothing more but the choice of
subject. The Russell Paradox is not technical, it goes to the a specific set of axioms, whose “truth” is irrelevant, if, at
heart of the nature of sets, and it threw the mathematical all, meaningful. But it is the realists who, in the end, have
community into a spin. received the greatest comfort from axiomatic set theory:
L. E. J. Brouwer initiated the intuitionistic program because the systematic development of consequences of
which denies that abstract sets are meaningful objects of the axioms eventually led to a narrower, more concrete
study and also rejects some of the basic principles of logic. concept of set, which ultimately justified the axioms.
Mathematical objects cannot be said to “exist” in any sense Much of modern logic was created in response to the
independent of (mental) “mathematical activity”; and to challenge of the set-theoretic paradoxes, and that is an-
prove that some x has property P, one must construct other reason why the discipline is so intimately tied with
some specific object x which has property P. It is not set theory.
enough to derive a contradiction from the assumption that
no x has property P. Intuitionism had a strong influence
in the philosophy of mathematics and remains a vibrant C. Zermelo-Fraenkel Set Theory
field of study within logic, but it never carried much favor There are eight axioms in ZFC (Zermelo-Fraenkel Set
with mathematicians: too much of classical mathematics Theory with Choice), and it is assumed that they are in-
must be thrown out to satisfy its tenets. terpreted over some given domain of sets V, which comes
Hilbert proposed to “save” classical mathematics from endowed with a binary membership condition, x ∈ y. The
the paradoxes and Brouwer’s attack by formalizing as formal theory ZFC is obtained by expressing these axioms
large a part of it as possible in some first-order, axiomatic by sentences of FOL(∈), and it requires infinitely many
theory T , and then establishing the consistency of T by ab- sentences, because the Replacement Axiom 5 requires an
solutely safe, finitistic methods. Formalism is the reading axiom scheme. Here we will describe them briefly and
of Hilbert’s Program as a philosophical view: it alleges informally, with a few interspersed comments.
that once T is chosen, then T is all there is—there is noth-
ing more to mathematics but the study of the inference
relation T A, with no reference to meaning. Aside from 1. Extensionality. Two sets are equal exactly when the
the impact of Gödel’s Second Incompleteness Theorem have the same members.
(Section IV.E) which weakens it, formalism also fails to 2. Empty set and Pairing. There is a set ∅ with no
account for the applications of mathematics: it is hard to members, and for any two sets a, b, there is a set
see how the existence or not of certain patterns of meaning- {a, b} whose members are exactly a and b.
less symbols can have any bearing on the escape velocity 3. Unionset. For each set A, there is a set ∪A whose
of a rocket. members are the members of the members of A,
From those reluctant to abandon the traditional, realist t ∈ ∪A ⇔ (∃x)[x ∈ A & t ∈ x].
view that mathematical objects are, well, real, no matter
how abstract and difficult to pin down, Russell first pro- 4. Powerset. For each set A, there is a set P(A) whose
posed to replace set theory by his famous theory of types: members are all the subsets of A.
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

216 Mathematical Logic

An operation F : V → V is definite if it is first-order Zermelo showed that, conversely, AC implies that

definable with parameters, i.e., every set is well orderable, and identified numerous
examples where the seemingly controversial AC is
F(x) = G(x, a1 , . . . , ak )
routinely used in mathematics. Somewhat later
where G(x, y1 , . . . , yk ) is first-order definable, Hartogs showed that AC is also equivalent with the
Section II.C. cardinal comparability property
5. Replacement. The image
(∀A, B)[A ≤c B ∨ B ≤c A]
F[A] = {F(x) | x ∈ A}
of a set A by a definite operation F is a set. (This was without which there is no cardinal arithmetic, and this
formulated in the 1930s, primarily by Skolem, and it limited further opposition to AC to those who were
is much stronger than Zermelo’s original Separation willing to abandon completely Cantor’s Paradise.
Axiom.) The last axiom of ZFC involves the cummulative
For the next two axioms, we need the notion of hierarchy of sets, which is defined by recursion on the
function f : A → B from one set to another, which is ordinal numbers as follows:
not among our primitives, and so we need to “reduce”
V0 = ∅,
the notion of function to that of set. The trick is well
known: first we fix some ordered pair operation Vα+1 = P(Vα ),
(x, y) which satisfies the key property
Vλ = Vα (if λ is limit).
(x, y) = (x , y ) ⇔ x = x & y = y , (18) α<λ

and then we model a function f by its graph, 8. Foundation. Every set is a member of some Vα .
G f = {(x, y) ∈ A × B | y = f (x)},
This is a limiting axiom, not needed for the develop-
which is just a set with some special properties. It is ment of Cantor’s set theory or its applications, but it is
common to use the so-called Kuratowski pair important because it codifies within the axiomatic theory
operation a conception of set which replaced in the 1930s Cantor’s
(x, y) = {x, {x, y}}, free-wheeling notion of a “collection into a whole”: each
set is reached starting with “nothing” (the emptyset ∅),
but there are many others, and all that is needed is by “indefinite” (never ending) “iteration” of the powerset
some operation which satisfies Eq. (18). operation. Admittedly more complex than Cantor’s, this
6. Infinity. There is a set I and a one-to-one function notion of grounded set prohibits the circular constructions
f : I → I which is not onto I , f [I ] I . which lead to the paradoxes, and it can be described intu-
Next comes Zermelo’s chief contribution: itively in sufficiently clear terms to justify the axioms.
To see how classical mathematics can be developed on
the basis of these seven axioms, consider first arithmetic.
f A number system is a triple (N, 0, S) such that N is a set,
R 0 ∈ N, S : N → N is a one-to-one function which is never
B
0, and

x [X ⊆ N & 0 ∈ X & S[X ] ⊆ X ] ⇒ X = N;

A
7. Axiom of Choice (AC). For each binary relation we prove that there exists a number system and that every
R ⊆ A × B, two number systems are isomorphic, and then we choose
some specific number system and call its members the nat-
(∀x ∈ A)(∃y ∈ B)(x, y) ∈ R ural numbers. The real numbers are identified with some
⇒ (∃ f : A → B)(∀x ∈ A)(x, f (x)) ∈ R. complete ordered field, once we prove that one such ex-
ists and any two are order isomorphic, and so forth for
In effect, AC postulates a function f which makes
other structures. This process of “defining” (more accu-
a choice f (x) from the nonempty set {y | (x, y) ∈ R},
rately: modeling faithfully) mathematical structures in set
“simultaneously,” for each x ∈ A. If B carries a well
theory has found such widespread acceptance in mathe-
ordering ≤, we could take
matics, that “to make a notion precise” is now viewed as
f (x) = the ≤-least y such that (x, y) ∈ R; synonymous with “defining it in set theory.”
P1: GNHFinal Pages
Encyclopedia of Physical Science and Technology EN009B-410 July 19, 2001 18:42

Mathematical Logic 217

D. Independence Results lem, now that we know that it cannot be settled in ZFC?
Some have adopted a formalist view, that it is meaningless
It is, perhaps, ironic, that the axiomatization of set theory
to ask “whether CH is true or not,” and that “set theory is
made possible to formulate and prove its own limitations.
the study of all models of ZFC.” This is a very active area
Let ZF be the theory with axioms 1–6, i.e., without the
of research.
Axiom of Choice:
In another direction, people have looked for new ax-
Theorem. If ZF is consistent, then so are the theories ioms, extending ZFC, which might provide the needed
ZFC+GCH (Gödel, 1938) and ZFC+¬CH (Paul Cohen, answers, and a great deal of research has been done in
1963). this direction since the 1960s. Generally speaking, two
kinds of axioms have been considered. Large cardinal
In effect, ZFC can neither disprove nor prove the Con-
axioms are plausible generalizations of the Axiom of In-
tinuum Hypothesis, unless a contradiction can be obtained
finity, which, however, have very few direct consequences
from its “constructive” core. In addition, Cohen showed
for the continuum. Determinacy hypotheses postulate that
that ZF cannot prove the Axiom of Choice, and several
certain (fairly simple) infinite games on the natural num-
additional consistency and independence results.
bers are determined; somewhat technical and not espe-
Gödel’s proof uses an inner model, a sub-collection of
cially plausible, these axioms answer most definability
our intended universe of sets V: using only axioms 1–6,
questions about the real numbers that are independent of
he defines a certain collection L of constructible sets and
ZFC, although, unfortunately, they cannot settle the Con-
shows that if we reinterpret “set” to mean “member of
tinuum Problem. In a fundamental advance made in the
L,” then all the axioms of ZFC as well as GCH are true.
1980s, Donald A. Martin, John Steel, and Hugh Woodin
Cohen’s forcing method builds “virtual universes” which
showed that the plausible large cardinal axioms imply
are “larger” than V, and so he must describe them indi-
the fruitful determinacy hypotheses, and so a “unified,”
rectly. This can be done with Boolean-valued models: a
very strong extension of ZFC has been created which is
collection M ⊂ V and a binary condition E on M are
the subject of much current research. Unfortunately, it
defined, and then it is shown that, for a certain (com-
does not solve the Continuum Problem, and so the search
plete) Boolean algebra B, the Boolean semantics of the
goes on.
“structure” (M, E) assign 1 to all the theorems of ZFC
It may well be that set theory will continue to be domi-
but something other than 1 to CH.
nated in the 21st century by the search for an answer to the
In both of these proofs, logic plays an essential role
Continuum Problem, as it certainly was during the century
which goes much beyond providing the context in which
just ended.
their claims can be made precise. For example, the con-
structible universe L is defined by iterating the operation
of taking all first-order definable subsets rather than P(A) SEE ALSO THE FOLLOWING ARTICLES
in the cummulative hierarchy of sets, and then a strong
version of the Skolem-Löwenheim Theorem is used at a BOOLEAN ALGEBRA • COMPUTER ALGORITHMS • DATA-
crucial point to show that GCH holds in L. Through the BASES • FUZZY SETS, FUZZY LOGIC, AND FUZZY SYSTEMS
work, initially, of Robert Solovay for forcing and Ronald • SET THEORY
Jensen for constructibility, these theories have been much
generalized and continue to be very active research areas
of logic, with important applications to analysis, algebra BIBLIOGRAPHY
and topology.
Abramsky, S., Maibaum, T. S. E., and Gabbay, D. M., eds. (1993). “Hand-
book of Logic in Computer Science,” Clarendon Press, Oxford.
Buss, S. R., ed. (1998). Handbook of Proof Theory. In “Studies in
E. Current Research in Set Theory Logic and the Foundations of Mathematics,” Vol. 137, Elsevier,
In one direction, set theory is more involved now with Amsterdam/New York.
Hodges, W. (1993). Model Theory. In “Encyclopedia of Mathematics
applications than ever before. Especially fruitful has been and Its Applications,” Vol. 42, Cambridge Univ. Press, Cambridge,
the development in the 1960s of effective descriptive set U.K.
theory, which incorporates methods from recursion theory Rogers, Jr., H. J. (1967). “Theory of Recursive Functions and Effective
into the study of definability on the continuum to yield very Computability,” McGraw-Hill, New York.
substantial applications to analysis. Kunen, K. (1998). Set Theory. In “Studies in Logic and the Foundations
of Mathematics,” Vol. 102, Elsevier, Amsterdam/New York.
Beyond the applications, set theory has attempted to Moschovakis, Y. N. (1980). Descriptive Set Theory. In “Studies in
confront the fundamental problem posed by the indepen- Logic and Foundations of Mathematics,” Vol. 100, North Holland,
dence results: what does one do with the Continuum Prob- Amsterdam.
P1: GSS Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

Mathematical Modeling
Xavier J. R. Avula
University of Missouri, Rolla

I. Introduction
II. Mathematical Modeling
III. Classification of Mathematical Models
IV. Formulation of Mathematical Models
V. Solution Techniques
VI. Model Validation
VII. Chaos and Complexity
VIII. Modeling with Neural Networks
IX. Mathematical Modeling and Computers
X. Concluding Remarks

GLOSSARY cess information in a parallel distributed fashion much

like the human brain.
Chaos Irregular and unpredictable system behavior that Parameter identification Experimental determination
exhibits sensitive dependence on initial conditions in of values of parameters that govern the system
deterministic physical systems modeled by nonlinear behavior.
equations. Simulation Process of relating computers to models gen-
Complexity Complexity is the collective behavior of a erating behavioral data.
system containing many interacting subsystems that System Set of elements plus a collection of rules govern-
cannot be explained in terms of the observable entities. ing the interrelationships between the elements and the
Mathematical modeling Entire process of representing behavior of elements over time.
real-world phenomena in terms of mathematical equa- System identification Determination of a system from
tions and achieving their validity in an iterative manner. its input and output time history.
Model Idealized representation of a real-world phe-
nomenon. In the abstract sense, it is a set of instructions
for generating behavioral data. MATHEMATICAL MODELING is a comprehensive pro-
Modeling Act of relating real systems to models. cess of representing real-world phenomena in terms of
Artificial Neural Networks Networks of artificial neu- mathematical equations and extracting from them useful
rons that are interconnected in different patterns to pro- information for understanding and prediction. In recent

219
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

220 Mathematical Modeling

years, mathematical modeling has become a powerful tool for all humankind. The regularity observed in the natural
to solve complex, interconnected, and interacting phenom- processes of the universe is best expressed and explained
ena arising from the rapid developments taking place in in mathematical terms. Mathematical description has pro-
science and technology. The success in physical sciences vided the tools and motivation for numerous discoveries
in terms of valid mathematical models has led scientists in cosmology and atomic phenomena, as well as in bi-
to extend the modeling methodology to other emerging ology, material science, earth sciences and the social be-
fields of inquiry in which great strides have been made. havior of animal and human populations. In engineering
The explosive growth of mathematical modeling activ- and technology, mathematical concepts and analyses have
ity has been a driving force behind the development of contributed greatly to the understanding, design, and op-
high-speed digital computers, which in turn aided model eration of complex systems. The cost and risk involved in
solutions in a symbiotic relationship. testing a real physical or engineering system are usually
prohibitive. The alternative is to develop a mathematical
model of the system and investigate its performance. How
I. INTRODUCTION many of us would “pay the price for reaching the sun and
learning its shape, its size, and its substance?”
Modern civilization has its roots in human endeavor to
understand the physical universe. The effort to systemat-
ically understand the universe and the various phenom- II. MATHEMATICAL MODELING
ena in it appears as a never-ending struggle imbibed with
frustration and romance that sparked the imagination of Derived from its Latin root modus, the word “model” is
humankind in the course of history. The birth of the sci- generally understood to stand for an object that repre-
entific method and the ensuing pursuits have lead men sents a physical entity with a change of scale. For exam-
and women of learning to the study of phenomena sys- ple, a model airplane is a scaled-down version of a “real”
tematically and produced a vast edifice of knowledge in airplane by a few orders of magnitude. As any airplane
sciences and mathematics. Galileo (1564–1642), the fa- model builder knows, the behavior of the model and the
mous Italian astronomer and physicist, firmly enunciated real airplane differ in more ways than one, in spite of
that the language of science is mathematics. The German their physical similarity, ensuring a missing ingredient—
philosopher Immanuel Kant (1724–1804), in the preface something that has fallen out during the model building
to his book “Metaphysical Foundations of Science,” de- process. What then is a mathematical model? A mathemat-
clared “that each particular discipline contains only as ical model is a set of mathematical equations representing
much science as it contains mathematics.” No more words a process or a system. It is a mathematical idealization
need to be wasted to say that science and mathemat- of a real-world phenomenon. In the sense of the model
ics are intertwined. If science and mathematics are inter- just defined, it represents a change on the scale of abstrac-
twined, can technology, the child of science, exist without tion. Also, the way one sees the world depends upon the
mathematics? structure of one’s language; different languages give rise
In striving to understand phenomena, one must not for- to different concepts, and to many “alternate realities.” In
get the Aristotelian adage, “The primary question is not the process of idealization, some simplifications will have
what do we know but how do we know it.” What we know been made in obtaining the mathematical model. There-
and its reliability should be the consequence of the process fore, the mathematical model is less real than the system
that answers the question how do we know it. it is supposed to represent; it is a mathematical represen-
The famous mathematician Hilbert, after a sojourn in tation of the modeler’s perception of some aspects of real-
the chemistry department at the University of Göttingen, ity within the confines of a formal mathematical system.
said that chemistry is too difficult for chemists. The Ameri- Nevertheless, it is an essential step in the construction of a
can mathematician Richard Bellman echoed that medicine theory. To quote Boltzmann, there is nothing as practical
is too difficult for physicians, politics is too difficult for as a good theory.
politicians, and economics is too difficult for economists. In the process of mathematical modeling, the objective
So is engineering for engineers and physics for physi- of the modeler is, in general, to construct a model of an
cists. Then how can they be enlightened? The enlighten- observed phenomenon and use the model to predict its
ment comes from exploring the behavior of mathematical future course. In some cases, modeling is used to explain
models of the processes they encounter in their respective the known facts and lay a foundation for the theory be-
endeavors. hind the phenomenon. Thus, a model is a mathematical
The physical universe is an abundant source of won- manifestation of a particular theory. Some modelers have
der. Benevolently exploited, it is a life-enhancing system an altruistic motive to apply the model characteristics to
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

Mathematical Modeling 221

interpret the behavior of phenomena outside their disci-

pline, recognizing the underlying commonality of math-
ematical modeling. While this commonality is held in
high esteem, some fundamental issues underlying the the-
ory of modeling must be addressed. Some of the issues
are:

How can a formal mathematical system represent a natural

phenomenon?
Does the similarity of two natural systems imply that their
models are similar?
How can multiple models of the same natural phenomenon
be compared, and with what measure?
How to identify key observable parameters that are nec-
essary for the construction of a plausible model?
Under what conditions the relationship between observ-
ables can be accepted as a “Law of Nature”?
How can multiple systems that behave similarly be con-
sidered as models of the same?
How is a given system related to its subsystems?
What features characterize “good” models?

Attention to these issues will allow mathematical mod-

eling to play an essential part in the formation of scientific
FIGURE 1 Schematic of a mathematical modeling procedure.
theories of phenomena. Scientific literature abounds with
the view that the process of mathematical modeling is not
unique. In some cases, the mathematical modeling pro- Once the problem is defined, a mathematical model of
cedure begins with the desire to construct a model from the real-world phenomenon is to be constructed. The real-
a set of observations made on the behavior of a system. world phenomena are exceedingly complex. It is nearly
According to the British physicist Maxwell, “the success impossible to construct a model that replicates the phe-
of any physical investigation depends upon the judicious nomenon in its totality. The modeler must sort out the
selection of what is to be observed as of primary impor- essential and significant features that need to be incorpo-
tance.” For many problems in science and technology, no rated in the model. Once again, experience and intuition
wellsprings of observations exist to serve as a source in come into play. The modeler translates the features into
the modeling process. For example, the phenomena such mathematical entities and relates them under certain sim-
as the behavior of a spacecraft in the environment of an plifying but realistic assumptions and constraints. Now we
unknown planet, the response of a structure to an earth- have a mathematical model.
quake, and the long-term ecological effects of an industrial A word of caution is in order. While the application of
effluent injected into the biosphere have no groundwork well-known equations of physics such as Newton’s laws,
of observations. In such situations, the model-building Lagrange’s, Hamilton’s, and Schrödinger’s equations re-
process is originated in imagination modulated by intu- sult in mathematical models of various physical processes
ition and experience. The schematic diagram of Fig. 1 de- in the form of differential equations, there are other pre-
picts one of several paths in the process of mathematical scriptions such as simulation and laboratory experimen-
modeling. tation augmented by system identification techniques for
In Fig. 1 the step from A to B involves to a great extent formulating models of phenomena.
the elements of intelligence and creativity of the modeler. A myriad of mathematical solution techniques are avail-
In bringing out a well-defined problem from the maze of able to extract the system behavior from the model. By
observations, the modeler shows the ability to correspond using one or more of these techniques, the solution of the
different entities—in this case, a real situation and a for- model is obtained and interpreted from the viewpoint of
malized statement of the problem. Mathematical training accuracy and stability. Determination of how closely the
alone will not give an edge to the modeler at this stage. solution approaches the original, real-world phenomenon
Experience, intuition, and other nonmathematical skills is the next step in the modeling process. If the solution
play an important role. meets the imposed limits of acceptability, the model is
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

222 Mathematical Modeling

considered valid and then put into practice to predict a application of mathematical and empirical knowledge to
future event, or to make a decision. Otherwise, the model the problems of science and technology. Even social sci-
is invalidated, and steps B–D are repeated by revising one ences and business, which are not traditionally subjected
or more ingredients in the process. Even a scrutiny of the to mathematical treatment, are not spared from the pro-
solution technique (step E) is worth considering. Thus the ductiveness of mathematical modeling. The catalysts in
process of mathematical modeling is iterative in nature. the growth of mathematical modeling have been
There is no uniqueness in the model-building process.
Although the steps in the modeling process are subjec- 1. The advent of the computer age with rapid
tive, they are somewhat similar. Figures 2 and 3, extracted developments in computer technology, specifically,
from different sources, essentially have the same features the developments in speed, memory, and software
presented in Fig. 1, but indicate the role of computers in 2. The developments in numerical techniques and
mathematical modeling rather explicitly. analysis
In recent years, the growth in mathematical modeling 3. The developments in systems theory and simulation
has been so phenomenal that it is evolving into a disci- 4. The progress in empirical knowledge resulting in a
pline in its own right. The offering of courses on math- greater understanding of various physical processes
ematical modeling in the colleges in the United States,
Europe, Asia and Australia has been on a steady increase. The role of experience and intuition in the process of
The international conferences and symposia on mathe- mathematical modeling accord to it an air of art that makes
matical modeling and the journals that deal exclusively the process subjective. However, the commonality of fea-
with mathematical modeling support this viewpoint. The tures in the process of mathematical modeling in all sci-
proliferation of mathematical models in the scientific lit- ences and technology is so overwhelming that the sub-
erature can be attributed to the explosive growth in the jectivity is cast as a minor essence. Let us now turn our
attention to these common features that enhance the ob-
jective view of mathematical modeling.

III. CLASSIFICATION OF
MATHEMATICAL MODELS

Classification of models is important to the understand-

ing of a system’s behavioral data and to interpreting
or communicating the data for prediction and decision
making. An understanding of the system behavior is a
contribution to science, and its use for prediction and de-
cision making constitutes technology. Mathematical mod-
els are broadly classified into dynamic and steady-state
models. Obviously, the element of time is present in the
dynamic models. The dynamic models are further clas-
sified into continuous-time and discrete-time models. In
a continuous-time model the time advances smoothly
through real numbers, while in the discrete-time model
the time advances in finite jumps, each jump representing
a multiple of a selected time unit.
Models constructed in terms of descriptive state vari-
ables are classified into continuous state, discrete-state,
and mixed-state models. In a continuous-state model the
range sets of state variables are presented by real num-
bers, while in a discrete-state model the variables assume
a discrete set of values. In a mixed-state model both kinds
of variables are present.
FIGURE 2 Basic steps in the modeling process. [From United The broad class of continuous-time models is further
States General Accounting Office (1979). PAD70-17, Guidelines subdivided into models in which the state changes con-
for Model Evaluation, Washington, DC.] tinuously and those in which the state changes in discrete
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

Mathematical Modeling 223

FIGURE 3 Guide to model building. [From Jacoby, L. S., and Kowilik, J. S. (1980). “Mathematical Modeling with
Computers,” Prentice-Hall, Englewood Cliffs, NJ.]

jumps. The former subclass of models are represented by functions result in a response that is the superposition of
ordinary and/or partial differential equations, for exam- corresponding outputs. Simulations and exact solutions
ple, for the boundary-layer flow over a wing of an aircraft. of linear systems of equations are well established and
On the other hand, the production line in a factory is a reported in mathematical literature.
discrete-state system. All systems that are not linear belong to the class of
Yet another classification breaks down mathematical nonlinear systems. As there are no general methods of
models into stochastic (or probabilistic) and determinis- solutions to solve and analyze models of nonlinear sys-
tic models. Models with at least one random variable in tems, simulation has become a commonly used tool for
their description are termed stochastic. Models other than this purpose. Impacted by recent developments in numer-
stochastic are deterministic. In fact, deterministic is a spe- ical analysis and by advances in powerful, high-speed and
cial case of stochastic. high-memory digital and analog computers, simulation
Another classification of models is based on how the of nonlinear systems have blossomed into technical en-
real system interacts with the environment. If the real sys- deavors that are broad in scope and variety encompassing
tem is isolated from its environment, the model is called biological, ecological, social, and economic systems in
autonomous. If the environment exerts any kind of in- addition to those in hard scientific fields such as physics,
fluence through so-called input variables from the envi- chemistry, material science, and mechanics. Modern non-
ronment without being controlled by the model, then the linear equations frequently encountered in simulation in-
model is called nonautonomous. clude, for example, the quaternion rotational equations
When models are expressed in terms of mathematical of motion that are treated with geometric algebra which
equations, the solutions are profoundly affected by non- is of late applied to a range of problems in many fields
linearity. Based on the type of equations, the models are of science and being promoted as a unified mathematical
classified as linear or nonlinear. These classifications can language in the 21st century.
be combined with the above-mentioned model types to Traditional quantitative techniques of modeling can-
yield, for example, linear or nonlinear dynamic and linear not be effectively applied to model phenomena in the so-
or nonlinear stochastic models. Although the universe is called “soft” disciplines consisting of humanistic systems
predominantly nonlinear, some processes are linear within such as social, economic, ecological, and biological sys-
certain bounds. Actually, linearity can be viewed as a spe- tems because there may arise difficulties associated with
cial case of nonlinearity. Sometimes, it is mathematically multidimensionality, subsystem interactions, inexplicable
expedient to consider nonlinear processes as piece-wise feedback mechanisms, hierarchical structures, and unpre-
linear. Systems are classified as linear if and only if they dictable behavioral dynamics. Such difficulties may also
simultaneously satisfy the principle of homogeneity and be encountered in mechanistic systems that may include
the principle of superposition. The principle of homogene- some physical and engineering systems. These difficulties
ity preserves the input function scale factor in transition lead to making imprecise statements, introducing fuzzi-
from input to output. A system satisfies the superposi- ness much like human reasoning, about the characteristics
tion principle when the superposition of individual input and behavior of a system. An important class of models
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

224 Mathematical Modeling

using fuzzy sets was launched about 35 years ago. The with the application of physical laws—Newton’s laws,
idea of fuzzy sets is centered on the imprecision in the Maxwell’s laws, Kirchhoff’s laws, and balance laws,
belonging of elements to a set. Intuitively, a fuzzy set is a which include mass balance, energy/heat balance, mo-
set of elements that are not precise and in which there is mentum balance, impulse balance, and entropy balance—
no distinct boundary between the elements that belong to to the phenomena being studied. In these laws, a number of
the set and those that do not. In other words, the transition relationships between the variables are expressed in terms
from full membership to nonmembership of an element (or of ordinary differential equations, partial differential equa-
elements) is blurred, so that a gradation of partial member- tions, and difference equations. A detailed presentation of
ship is possible. Mathematical concepts based on this idea these laws is beyond the scope of this article. However,
have been successfully applied to modeling of systems in some examples are appropriate.
“soft” desciplines as well as in engineering when impre- The first and second laws of thermodynamics in differ-
cision is present. Fuzzy systems models are categorized ential form are stated as
into two types: liguistic and rule-based. An elaborate dis-
cussion of these types of models and modeling techniques dU = d Q − dW (1)
is beyond the scope of this article. dS = dQ/T, (2)
where U is the internal energy,Q is the heat absorbed by
the system, W is the work done by the system, S is the
IV. FORMULATION OF entropy, and T is the temperature.
MATHEMATICAL MODELS The fundamental law of heat conduction in one dimen-
sion is represented by the ordinary differential equation
A. Conventional Modeling (Direct Modeling)
dQ/dt = −kA (dT/dx),
Experimental observations and measurements are gener-
ally accepted to constitute the backbone of physical sci- where dQ/dt is the time rate of heat transfer across the
ences and engineering because of the physical insight they area A, k is the thermal conductivity of the medium, and
offer to the scientist for formulating the theory. The con- dT/dx is the temperature gradient.
cepts that are developed from the observations are used as Newton, in his famous “Principia,” expressed the sec-
guides for the design of new experiments, which in turn are ond law of motion in terms of momentum, which in sym-
used for validation of the theory. Thus, experiments and bolic form becomes
theory have a hand-in-hand relationship. The information
gathered during observation and measurement is usually F = dp/dt, (3)
presented in terms of curves, tables, block diagrams, cir-
where p is the linear momentum (product of mass and
cuit diagrams, flow diagrams, etc. for convenience of per-
velocity mv). This equation can be written in the familiar
ception in the model building process. These information
form as
display techniques have been tremendously aided by the
advent of high-performance computers. F = d(mv)/dt = m(dv/dt) = ma, (4)
Formulation of theory is equivalent to model building.
Cast in mathematical terms, the theory stands as a mathe- Attempts by scientists to describe the physical world led
matical model of reality. In the conventional sense, the first them to the formulation of various types of partial dif-
stage in model building is to propose and gather equations ferential equations. The mathematical model for the one-
representing all relevant mechanisms of the phenomenon dimensional wave phenomena is the hyperbolic equation
under study. Because of the diversity in the types of infor-
∂ 2 u/∂t 2 = c2 (∂ 2 u/∂ x 2 ). (5)
mation available to the model builder, the equations rep-
resenting the basic phenomenon may be extensive with The phenomena of diffusion of heat in solids is modeled
complex interrelationships that may be difficult to untan- by the parabolic equation derived by Fourier in the form
gle for easy tractability. To preserve fidelity, all equations
that will have significant effect on the system behavior ∂u ∂ 2u
= σ 2. (6)
must be included in the model. It is always better to first ∂t ∂x
produce a simple model and then refine it until it closely The Fourier heat equation in three spatial dimensions and
represents the reality, than to start with a complex model time modeled as
and simplify it for fear of facing mathematical difficulties.
Conventionally, the construction of mathematical ∂u ∂ 2u ∂ 2u ∂ 2u
=σ + 2 + 2 (7)
models for physical and engineering phenomena begins ∂t ∂x2 ∂y ∂z
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

Mathematical Modeling 225

yields for steady state (with ∂u/∂t = 0)

∂ 2u ∂ 2u ∂ 2u
∇ 2u = + + = 0, (8)
∂x2 ∂ y2 ∂z 2
which is an elliptic equation, also called Laplace’s
equation.
An example from the conservation laws is the law of
conservation of mass, which can be expressed in terms
mathematical symbols as
(∂ρ/∂t) + ∇ · (ρv) = 0, (9)
where ρ is the density, v is the velocity vector, and ∇ stands
for the operation î(∂/∂ x) + j(∂/∂
ˆ y) + k̂(∂/∂z) rectan-
gular Cartesian coordinates with i, ˆ j,
ˆ k̂ the unit vectors
along the coordinate axes x, y, z, respectively.
A universally accepted mathematical model for viscous
fluid flow is the famous Navier–Stokes equation,
ρ(Dv/Dt) = −∇ρ − ∇ × [µ∇ × v] + ∇[(2µ + λ)∇ · v
FIGURE 4 (a) Response u in Van der Pol equation for small val-
(10) ues of the parameter 8. (b) Response ii in Van der Pol equation
for large values of the parameter 8.
where µ, is the coefficient of viscosity, λ is the second
coefficient of viscosity, and in engineering, ranging from a simple on-off control to
Dv/Dt = (ρ(∂/∂t) + v · ∇)v. (11) sophisticated computer control of a spacecraft. A common
feature of all control systems is the so-called uncertainty
in which v · ∇ is the convective derivative. factor which might be limited by using adaptive or self-
These and other aforementioned equations play a cen- optimizing principles.
tral role in mathematical modeling of a large variety of Even if the mathematical model representing a system
physical systems. In addition to differential and difference is formulated in principle, it cannot be used for any prac-
equations, there are other distinct types of equations— tical purpose unless the range of parameters in the model
algebraic, integral, and functional equations—that ap- is available. The importance of determining the range of
pear prominently in mathematical modeling. Not all of parameters can be observed, for example, in the behavior
these equations have closed-form, analytical solutions. of Van der Pol equation, which can be expressed as
The modeler has to resort to numerical and other approx-
(d 2 u/dt 2 ) + λ(u 2 − 1) (du/dt) + u = 0.
imation methods for solutions. (12)
In science, the purpose of modeling has been primar- λ>0
ily for research and understanding natural phenomena. In
As it can be seen in Fig. 4(a), u is almost periodic for small
research, even though the immediate use of the model is
values of X, while for large values of λ the response of
not clear, one pursues it for gain in comprehension, for
u is as shown in Fig. 4(b). The requirement to determine
interpreting knowledge, and for formulating clues for fur-
valid parameters leads to the identification problem. Since
ther investigation. In engineering, however, in addition to
one cannot always have the complete knowledge of the
research, mathematical models are used for design and
physical aspects of phenomena for guidance to construct
control. Here the knowledge of the components of the
a model, identification of the system and its parameters
total engineering system has to be expressed in a model
becomes an important problem for which several methods
compatible with design criteria, which may include some
have been developed in systems theory.
or all of stability limits, error criterion, economic yield,
and safety criterion. Also, modeling for design is not just
B. System Identification and Parameter
for creating a new system but also for the adaptation of an
Estimation (Inverse Modeling)
existing system for a higher (or different) performance.
To drive an engineering system in a desired fashion, Determination of the output signal corresponding to the
some control action needs to be imposed in the form of input time history and system characteristics is a central
feed-back or feed-forward control, or adaptive control. problem in systems theory. In contrast to this problem,
There is a wide spectrum of human-made control systems given an input and output time history, can one determine
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

226 Mathematical Modeling

the mathematical model that describes the system behav- X(s) G(s) Y(s)
ior? This latter problem is called an inverse problem and
generally is referred to as the system identification. Iden- FIGURE 5 Relationship of the transfer function G(s) to the input
tification is defined as the determination, on the basis of X(s) and the output Y(s).
input and output, of a system within a specified class of
systems, to which the system (phenomenon) under study
2. Step-Response Method
is equivalent. In other words, identification is the process
of constructing a mathematical model of a system from Step input is the simplest input that can be applied to a
prior knowledge and observations. Equivalence is often system, like sudden closing or opening of a valve in a
expressed in terms of an error function E, which is a func- hydraulic network. In reality, strict step input is physi-
tion of system output y and the model output ym , that is, cally impossible, but as long as the rise time of the step
input is much shorter than the period of the highest fre-
E = E(y, ym ). (13) quency in the system, the input can be considered as a step
Two models m 1 and m 2 , are said to be equivalent if the input. Step-response problems are considered as impulse-
value of the error function is the same for both models, response problems, and the determination of transfer func-
that is, tion falls into the category of time-domain problems,
which can be solved by one of the gradient methods.
E y, ym 1 = E y, ym 2 . (14)

A model can be identified with reality if the value of 3. Deconvolution Method

the error function is minimum, and in recent years numer- Sometimes the input and output of a system can be related
ous techniques for minimizing error functions have been by the convolution integral
developed. t
Although in identification problems no prior knowl- y(t) = x(τ )w(t − τ ) dτ, (16)
edge of phenomena is assumed, in reality the modeler has 0
at least a partial understanding of the process to be mod- where w(τ ) is the impulse response of the system. The de-
eled, through experience and intuition. In such cases the termination of the impulse response from the input and the
identification problem is reduced to finding the numeri- output is called deconvolution. Once w(τ ) is determined,
cal values of a number of parameters (coefficients in the the system transfer function can be found by using the step
differential equations representing the phenomenon) and response method already described.
state variables. The identification problem is thus reduced
to a parameter estimation problem. Parameter estimation
is defined as the experimental determination of values of 4. Cost Function Method
parameters that govern the system behavior, assuming that In identification problems, cost means penalty for not
the structure of the process is known. achieving correct identification. The cost function is
usually expressed as a squared error (difference between
the assumed value and the actual value of a parameter)
C. Methods of System Identification function that has to be minimized for correct, or nearly
1. Frequency Response Method correct, parameter estimation. Cost functions are mainly
of two types with some variations depending on the choice
Certain linear stationary processes can be identified by of the parameter: (1) the maximum likelihood (ML) cost
the so-called frequency response method. In this method, function in which the ML estimator does not assume any
sine-wave inputs are applied to the system, and the steady- prior knowledge of the parameter, and (2) the maximum
state outputs in terms of the magnitude ratio and the phase a posteriori (MAP) cost function which considers prior
shift are measured over the entire range of frequencies of knowledge of the parameter obtained by maximizing the
interest. Knowing the output Y (s) corresponding to the probability density of the parameter.
input X (s), the system behavior (transfer function) G(s) is
determined (see Fig. 5) by the operation
5. Gradient Techniques
G(s) = Y (s)/ X (s). (15)
Gradient techniques are associated with computational
This method is applicable only to stable systems, because methods for system identification. Here, one attempts to
the output Y(s) for unstable systems cannot be measured. minimize the cost function with each iteration in a direct
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

Mathematical Modeling 227

fashion after expanding the cost function in Taylor se- general problem for which a sequential estimate of sys-
ries about an assumed parameter to a desired order. In the tem parameters is made in the solution procedure. Invari-
computational algorithm, one picks the variation in the pa- ant imbedding converts many boundary-value problems
rameter to make the steepest descent toward the minimum into initial-value problems, which makes both the anal-
cost function. The computation is repeated until there is ysis and the computational solution easier. This form of
no significant change in the parameter value from iteration imbedding is also useful in engineering problems where
to iteration. Widely applied among these methods are the sensitivity analysis has to be performed.
first-order, second-order, and conjugate gradient methods.

V. SOLUTION TECHNIQUES
6. Quasilinearization Method
As opposed to gradient technique, this is an indirect For a mathematical modeler, the entire world of applied
method in which a sequence of functions is iteratively mathematics and new mathematical concepts being devel-
computed until the functions converge to the solution. In oped from time to time are open for use. The advent of
this method, recurrence relations are obtained in the form high-speed digital computers has aided the solution tech-
of linear differential equations, even if the model equation niques and opened the doors for modeling more complex
has nonlinearity in its structure—hence the name quasi- systems. Some solution techniques practiced in modeling
linearization. This method is also called the generalized include:
Newton–Raphson method.
Consider, for example, the nonlinear differential 1. Rigorous analysis
equation 2. Symmetry methods
3. Obtaining a priori bounds on the solutions
d x/dt = f (x), x(0) = c, t >0 (17) 4. Method of isoclines
in which the function f is continuous in x and time t and 5. Perturbation theory
has continuous bounded second partial derivatives with 6. Asymptotic analysis
respect to x for all x and t. The function can be expanded 7. Group theory
around x (0) (t) in Taylor series as 8. Inequalities
9. Integration by parts
(0) ∂ f x (0) 10. Extremum principles
f (x) = f x + x − x (0) + . (18)
∂x 11. Numerical approximations (including finite
difference, finite element, boundary element
Neglecting the higher-order terms and combining
methods, invariant imbedding, etc.)
Eqs. (17) and (18) yields the linear differential equation
12. Variational principles
dx (0) ∂ f x (0) 13. Linearization and quasi-linearization
= f x + x − x (0) (19) 14. Integral methods
dt ∂x
15. Complex-variable methods
x(0) = c. (20)
16. Graph theory
Proceeding likewise, one obtains 17. Mathematical programming.

d x (n+1) (n) ∂ f x (n) (n+1) This list is by no means exhaustive. For detailed anal-
= f x + x − x (n) (21)
dt ∂x yses of these topics, the reader must refer to textbooks
x (n+1) (0) = c. (22) on applied mathematics. There are separate textbooks for
several of these topics.
The iterative process begins with an initial approxima-
tion x (0) (t) and proceeds until the sequence of functions
converges. VI. MODEL VALIDATION

Developing a mathematical model is a long and arduous

7. Invariant Imbedding
task. In mathematical modeling, the goal of the modeler is
The basic idea of invariant imbedding is to change a spe- to ensure that the model replicates the phenomena being
cific problem into a general problem. The solution to a modeled to an acceptable degree. The procedures followed
specific problem is imbedded in the solution of a more to test the fidelity of the model in reproducing the real
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

228 Mathematical Modeling

system it represents constitute model validation. On the computational algorithms that lead to novel computer ar-
surface, it appears that validation is something that should chitectures. New discoveries in nonlinear dynamics have
be done at the end of the model construction. But, in fact, created new concepts and tools such as fractal dimensions
validation should be carried out throughout the modeling and Lyapunov exponents for detecting and quantifying
process. A valid model can be expected by being logically chaos in physical systems. There are no formal, theoreti-
consistent at each step of the modeling process, by reexam- cal criteria to determine under what conditions a dynam-
ining the assumptions and constraints without sacrificing ical system in general would become chaotic. Significant
the mathematical rigor, and by turning every stone of ap- effort is needed to determine how and when more general
plicable mathematical knowledge. Is it not logical that to and complex physical systems will become chaotic.
obtain valid model behavior one must use valid assump- Complexity is one of the most perplexing problems of
tions and constraints? By doing so one would obtain the systems theory. Many relatively independent subsystems
most accurate model, but perhaps one difficult to solve. that are highly interconnected and interactive manifest
Suppose, by a judicious choice of a solution technique complex systems. As a result, the collective behavior of
from the myriad of techniques available, the modeler de- complex systems is reflected in reproducing the functions
termines the system behavior. This step invariably (at least of truly complex, self-organizing, replicating, learning and
in physical sciences and technology) involves a numerical adaptive systems. Systems consisting of a large number of
approximation associated with computer programming, interacting elements lead to a perception of complex and
which should be carefully carried out and correctly imple- disorderly behavior. It is generally believed that biological
mented. To generate confidence in the model, the system systems involve more interacting elements than physical
behavior must be compared with the real system data. It systems, and therefore the latter are simple with orderly
is nearly impossible, except for some simple situations, to behavior. However, recent developments in irreversible
obtain global agreement (agreement over the entire range thermodynamics, in the theory of dynamical systems, and
of parameters). Then the modeler must choose the range of in classical mechanics have narrowed the gap between the
validity for various model parameters and establish other simple and complex, and between order and disorder; un-
criteria, such as comparison of results by different solution der certain conditions, simple systems are also known to
techniques, by which to judge the model validity. exhibit complex behavior. Complexity in system behavior
The validity of a model should not be judged by mathe- is characterized by instabilities and bifurcations. In mod-
matical rationality alone; nor it should be judged purely by eling complex behavior of a system, one must first assess
empirical validation at the cost of mathematical and sci- the nonlinear character of the underlying dynamics and
entific principles. A combination of rationality and em- identify a set of variables that control these instabilities
piricism (logic and pragmatism) should be used in the and bifurcations.
validation. If necessary, all or some of the steps in the
modeling process should be repeated several times until
the model is acceptable for use. VIII. MODELING WITH
NEURAL NETWORKS

VII. CHAOS AND COMPLEXITY As processes increase in complexity, they become less
amenable to direct mathematical modeling based on phys-
In deterministic physical and mathematical systems, when ical laws. In the later half of the 20th century, artificial neu-
the model equations are nonlinear, the evolution of the ral networks have made inroads into several disciplines
system behavior becomes irregular and unpredictable and with a wide range of applications. An artificial neural net-
exhibits sensitive dependence on initial conditions. Such work is a network of interconnected units called artificial
behavior is called chaos. It occurs in vibrating objects, neurons that are connected in different patterns to process
in rotating or heated fluids and in some chemical reac- information in a parallel distributed fashion much like the
tions. Understanding chaotic behavior involves tracking human brain. A significant aspect of neural networks is
the time evolution of nonlinear dynamical system or that their ability to learn how to process information in super-
of natural phenomena modeled by a set of nonlinear dif- vised and unsupervised modes. They are used for simu-
ferential equations that arise from classical equations of lation of physical systems that are modeled by massively
physics. With the exception of some first-order equations, parallel networks. In recent years, applications have been
analytical solutions of nonlinear differential equations are developed for modeling simple biological structures with
either difficult or impossible. The desire to overcome this known functions, for the modeling of higher functions
difficulty has presented a significant motivation for the of the central nervous system, for solving complicated
advancement of numerical methods and innovations of problems in artificial intelligence and cognitive sciences,
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

Mathematical Modeling 229

for pattern recognition, and for solving combinatorial op- been surveyed. In recent years, mathematical modeling
timization problems. Further applications have been ex- has pervaded all branches of knowledge, bringing forth
tended to medical diagnosis, financial services including greater understanding of processes under investigation.
stock price prediction, intelligent control of engineering In engineering and technology it provides the analytical
plants, and manufacturing. Neural networks are also well basis for design and control in which predictions can be
adapted to ill-posed problems, those with damaged or in- confidently made without spending valuable resources of
complete data. money and effort.
Successful applications of mathematical modeling
techniques in engineering sciences have led the way to
IX. MATHEMATICAL MODELING extend the techniques to more exotic areas of inquiry,
AND COMPUTERS like nanotechnology, nuclear-reactor engineering, mate-
rial science, environment, weather prediction, biological
In all endeavors involving mathematical modeling, one processes, space sciences, cosmology, and also social sci-
cannot fail to perceive that computers act as an interface ences. Although the general philosophy of modeling in
between phenomena and the various stages of mathemat- these new areas remains the same as discussed in this ar-
ical modeling, including validation and use. Spectacular ticle, the simulation procedures and validation criteria are
advances have been achieved in modeling complex and different and dependent on the types of models and the
large systems using computers by removing the need for disciplines they belong to.
analytical solutions to differential equations representing Mathematical modeling is a vast, multidisciplinary field
the systems. The advances in high-speed computers and that pleads to engage the interest and dediation of engi-
efficient numerical algorithms have generated acceptable neers, scientists and mathematicians to solve the prob-
numerical solutions to systems of the equations hitherto lems facing the humankind. A significant development in
intractable and thus made modeling of complicated, inter- the mathematical modeling activity is the availability of
connected, and interacting systems possible. As a matter very-high-speed computers, which can solve a variety of
of fact, complicated mathematical models expressed in complex models. In spite of all the advances in empirical
terms of nonlinear partial differential equations and new knowledge, solution techniques, and computer assistance,
applications have been the driving force behind the devel- it must be noted that human intelligence, experience,
opment of large computing machines with huge amounts and intuition still play a significant role in mathematical
of shared memory and a processing rate of up to three modeling.
trillion floating-point operations per second. The need for
more efficient computers to simulate complicated models
has driven the computer scientists to think innovatively SEE ALSO THE FOLLOWING ARTICLES
in the direction of new architectures and procedures. The
greatest potential for achieving higher speed lies in adding CHAOS • CONTROLS, ADAPTIVE SYSTEMS • DIFFEREN-
parallel processors. In recent years, neurocomputers based TIAL EQUATIONS • DISCRETE SYSTEMS MODELING •
on artificial neural networks have been built to process in- GROUP THEORY • LINEAR SYSTEMS OF EQUATIONS •
formation efficiently and cost-effectively, but complemen- NUMERICAL ANALYSIS • PERTURBATION THEORY •
tary to algorithmic computing. With more developments PROBABILITY • STOCHASTIC PROCESSES • SYSTEM
in the computer technology on the horizon, mathematical THEORY
modelers can expect to achieve more success in model-
ing and simulation of very complicated systems, and also
in revising and reworking the old models, in which dras- BIBLIOGRAPHY
tic simplifications had to be made in the past. Despite
the advances in computers and computational technology, Andrews, J. G., and McLone, R. R. (1976). “Mathematical Modelling,”
Butterworths, London.
the role of analytical methods in deciphering mathemati- Aris, R. (1995). “Mathematical Modelling Techniques,” Dover, Mineola,
cal models should not be overlooked because they offer a NY.
valuable insight to a wide variety of problems. Atherton, D., and Borne, P. (eds.) “Concise Encyclopedia of Modelling
and Simulation,” Elsevier, New York.
Avula, X. J. R. (ed.) (1977). “Proceedings of the First International Con-
ference on Mathematical Modelling,” 5 vols., University of Missouri-
X. CONCLUDING REMARKS
Rolla, Rolla, MO.
Avula, X. J. R., Bellman, R. E., Luke, Y., and Rigler, A. K. (eds.) (1979).
In this article, mathematical modeling concepts in relation “Proceedings of the Second International Conference on Mathemati-
to physical sciences, engineering, and technology have cal Modelling,” University of Missouri-Rolla, MO.
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology EN009H-411 July 6, 2001 19:51

230 Mathematical Modeling

Avula, X. J. R., Kalman, R. E., Liapis, A. I., and Rodin, E. Y. (eds.) (1983). Fulford, G., Forrester, P., and Jones, A. (1999). “Modelling with Differ-
“Mathematical Modelling in Science and Technology,” Proceedings ential and Difference Equations,” Cambridge University Press, New
of the Fourth International Conference, Zurich, Switzerland, 1983, York.
Pergamon, New York. Gibbons, M. M. (1995). “A Concrete Approach to Mathematical Mod-
Avula, X. J. R., Leitmann, G., Mote, Jr., C. D., and Rodin, E. Y. elling,” Wiley, New York.
(eds.) (1987). “Mathematical Modelling in Science and Technol- Haber, R., and Keviczky, L. (2000). “Nonlinear System Identification:
ogy,” Proceedings of the Fifth International Conference, University Input-Output Modeling Approach,” Kluwer Academic, Norwell, MA.
of California, Berkeley (July 1985), Mathematical Modelling, Vol. 8, Jacoby, S. L. S., and Kowalik, J. S. (1980). “Mathematical Modeling
Pergamon Press, New York. with Computers,” Prentice-Hall, Englewood Cliffs, NJ.
Avula, X. J. R. (ed.) (1993). “Mathematical Modelling in Science and Jerome, J. W. (ed.) (1998). “Modelling and Computation for Applications
Technology,” Proceedings of the Eighth International Conference, in Mathematics, Science, and Engineering,” Oxford University Press,
University of Maryland, College Park (April 1991). Principia Scientia, New York.
St. Louis, MO. Kagawa, Y. (1994). “Modelling and Simulation and Identification,” Acta
Avula, X. J. R., and Mote Jr., C. D. (eds.) (1994). “Mathematical Mod- Press, Anaheim, CA.
elling in Science and Technology,” Proceedings of the Ninth Inter- King, J. R. (2000). “Emerging areas of mathematical modelling,” Phil.
national Conference, University of California, Berkeley (July 1993), Trans. Roy. Soc. Lond. A 358, pp. 3–19.
Principia Scientia, St. Louis, MO. Lasenby, J., Lasenby, A. N., and Doran, C. J. L. “A unified mathematical
Avula, X. J. R., and Nerode, A. (1996). “Mathematical Modelling in language for physics and engineering in the 21st Century,” Phil. Trans.
Science and Technology,” Proceedings of the Tenth International Roy. Soc. Lond. A 358, pp. 21–39.
Conference, Boston, MA (July 1995). PrincipiaI Scientia, St. Louis, May, R. M. (1976). “Simple mathematical models with very complicated
MO. dynamics,” Nature 261, 459–467.
Avula, X. J. R., and Nerode, A. (1998). “Mathematical Modelling in Meskens, N., and Roubens, M. (eds.) (1999). Kluwer Academic, Nor-
Science and Technology,” Proceedings of the Eleventh International well, MA.
Conference, Georgetown University, Washington, DC (July 1997), Nicholson, H. (ed.) (1980). “Modelling of Dynamical Systems,”
Principia Scientia, St. Louis, MO. Vols. 1 and 2, Peter Peregrinus Ltd. (for The Institution of Electri-
Avula, X. J. R. (2000). “Mathematical Modelling in Science and Technol- cal Engineers), Stevenage, U.K.
ogy,” Proceedings of the Twelfth International Conference, Chicago, Nicolis, G., and Prigogine, I. (1989). “Exploring Complexity,” W. H.
IL (July 1999). Principia Scientia, St. Louis, MO. Freeman & Co, New York.
Bandemer, H. (1993). “Modelling Uncertain Data,” Wiley, New York. Rodin, E. Y., and Avula, X. J. R. (eds.) (1989). “Mathematical Modelling
Bellman, R., and Roth, R. (1986). “Methods in Approximation: Tech- in Science and Technology,” Proceedings of the Sixth International
niques for Mathematical Modelling,” Kluwer Academic, Norwell, Conference, St. Louis, MO (August 1987), Elsevier Science, New
MA. York.
Bellman, R., and Wing, G. M. (1975). “An Introduction to Invariant Rodin, E. Y., and Avula, X. J. R. (eds.) (1990). “Mathematical Modelling
Imbedding,” Wiley, New York. in Science and Technology,” Proceedings of the Seventh International
Bellman, R., and Roth, R. (1983). “Quasilinearization and the Identifi- Conference, Chicago, II (August 1989) Elsevier Science, New York.
cation Problem,” World Scientific, Singapore. Saaty, T. L., and Alexander, J. M. (1981). “Thinking with Models: Math-
Caldwell, J., and Ram, Y. M. (1999). “Mathematical Modelling Concepts ematical Models in the Physical, Biological, and Social Sciences,”
and Case Studies,” Kluwer Academic, Norwell, MA. Pergamon, New York.
Casti, J. L. (1989). “Alternate Realities,” Wiley, New York. Sinha, N. K., and Kuszta, B. (1983). “Modeling and Identification of
Cowan, G. A., Pines, D., and Meltzer, D. (eds.) (1994). “Complexity: Dynamic Systems,” Van Nostrand Reinhold, New York.
Metaphors, Models and Reality,” Addison Wesley, Reading, MA. Stark, J. (2000). “Observing complexity, seeing simplicity,” Phil. Trans.
Cross, M., and Moscardini, A. O. (1985). “Learning the Art of Mathe- Roy. Soc. Lond. A 358, 41–61.
matical Modelling,” Wiley, New York. Whorf, W. (1956). “Language and Thought,” MIT Press, Cambridge,
Dym, C. L., and Ivey, E. S. (1980). “Principles of Mathematical Model- MA.
ing,” Academic, New York. Yager, R. R., and Filev, D. P. (1994). “Essentials of Fuzzy Modeling
Eykhoff, P. (1974). “System Identification: Parameter and State Estima- and Control,” Wiley, New York.
tion,” Wiley, London. Zadeh, L. A. (1965). “Fuzzy sets,” Information and Control 8, 338–353.
P1: ZCK Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

Measure and Integration

G. de Barra
University of London

I. Introduction VIII. The Radon–Nikodým Theorem and Signed

II. Lebesgue Measure Measures
III. The Lebesgue Integral IX. Extensions of the Definition of Measure
IV. The Lp Spaces and Inequalities for Integrals X. Extensions of Measures and
V. Differentiation and Integration Lebesgue–Stieltjes Integrals
VI. Product Spaces and Product Measures XI. The Radon–Nikodým Property for Banach
Spaces
VII. General Measures
XII. Measure and Fractals

GLOSSARY THE LEBESGUE THEORY of measure and integra-

tion provides the framework for considering the limiting
Almost everywhere (a.e.) Except on a set of measure operations of analysis as used in pure mathematics and
zero. theoretical physics. In particular it shows when sums and
Complement Set of points of the space not in A, denoted integrals or limits and integrals may be interchanged. It
A. includes the study of the theory of sets in the real line
Countable set Set which may be enumerated or put in or higher dimensions with special emphasis on countable
one-to-one correspondence with the integers. operations on sets.
Empty set Set with no elements. (Notation .)
Measurable function Function whose values above any
given value are taken in a measurable set. I. INTRODUCTION
Measurable set Set whose measure is defined.
Measure Quantity defined on sets usually taking a pos- The theory of integration arose informally. It was form-
itive value, denoted m(A) for A a set of real numbers alized by Riemann but in a way which restricted its
or µ(A) more generally. application severely. These restrictions are considered in
Set Class of objects with property P (say). (Notation {x: x Section III. Meanwhile in applications the subject was ex-
satisfies P}.) panding as applied mathematicians and students of prob-
Set difference Set of points of A not in B, denoted ability needed results on the limits of integrals, not pro-
A\B. vided by Riemann’s theory. The Lebesgue theory supplied

231
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

232 Measure and Integration

the main deficiencies and the limit theorems described in outer measure are measurable (and so therefore are all
Section III together with the results on differentiation of subsets of such sets). The class of measurable sets has the
Section V rounded off the subject. The needs of the ap- important property of being a σ algebra: that is, the class
plied mathematicians were provided for also by the re- contains the whole space ( in this case) and is closed un-
sults on the L p spaces of functions described in Section der the formation of complements and countable unions.
IV and the results using product spaces of Section VI. The Only the last part here presents any difficulty. Denote m ∗
Lebesgue integral turns out to be the appropriate definition restricted to the σ -algebra M by m, called Lebesgue mea-
to deal with orthogonal expansions and with the Fourier sure. Then we easily have that for any sequence {E i } of
and Laplace transforms. The needs of the probabilists were disjoint measurable sets
supplied by the general theory of measure in Section VII,
the decomposition theorems of Section VIII, and the re-
∞ ∞
m Ei = m(E i )
sults on products of measure spaces in Section VI. The i=1 i=1
differentiation of measures considered in Section VIII is
of central importance in functional analysis, as is seen in (i.e., m is additive on disjoint measurable sets). For some
Section XI. purposes it is convenient to restrict the measure m still fur-
ther to the Borel sets which may be defined as the smallest
σ algebra containing the intervals. These sets are more nat-
II. LEBESGUE MEASURE ural in some contexts, but difficulties can arise from the
fact that whereas for the class M every subset of a set of
For any set A contained in the real line we define zero measure is again measurable, this is no longer true of
the Lebesgue outer measure (or outer measure) to be the smaller class of Borel sets. So we confine our attention
to the Lebesgue measurable sets.
the quantity m ∗ (A) given by m ∗ (A) = inf n l(ln ) where
we are taking the infimum or greatest lower bound over It is easily seen that every nonempty open set has pos-
all collections {In } of intervals such that A ⊆ ∪ In and itive measure. From the countable additivity of m it is
where l(I ) denotes the length of the interval I . From this obvious that every countable set has zero measure. It is
definition we have immediately that m ∗ (A) is nonnegative; less obvious that there exist uncountable sets of zero mea-
m ∗ (A) = 0 if A is a one point set or empty; m ∗ (A) m ∗ (B) sure. A standard example is the Cantor ternary set. It is
whenever A ⊆ B. It is fairly easy to show that the outer formed by removing from the interval [0, 1] the middle
measure of any interval equals the length of the interval, third ( 13 , 23 ), then the middle thirds of the two remaining
so outer measure has for some sets at least the properties intervals, namely, ( 19 , 29 ) and ( 79 , 89 ), then the middle thirds
we would desire. Also, for any sequence of sets {E i } it is of the remaining intervals, and so on. The total set re-
easy to see that moved has measure adding to 1, so the residual set (the
Cantor set) has measure 0. But the Cantor set contains all
∞ ∞
the numbers between 0 and 1 with expansions to base 3
∗
m Ei m ∗ (E i )
consisting of 0’s and 2’s. So it has the same cardinality
i=1 i=1
as the set of all binary expansions, and so is uncountable.
∗
(i.e., m is subadditive). We, however, cannot in general It is also true that there exist sets of the real line which
assume that equality will occur here even if the sets are are not measurable. But these sets cannot be constructed
pairwise disjoint. The outer measure has another desirable and indeed can only be shown to exist using the axiom of
property: the outer measure of a set is unchanged if the set choice or an equivalent tool of mathematical logic. The
is shifted to left or right (i.e., m ∗ is translation invariant). existence of these nonmeasurable sets is not crucial for
It also has a regularity property: for any set A and any the theory, but if they did not exist the theory would lose
ε > 0 there is an open set O containing A and such that some of its content, for all sets would be measurable.
m ∗ (O) m ∗ (A) + ε. We consider now a “continuity” property of mea-
In order to achieve the desired additivity we restrict m ∗ sure. Suppose {E i } is a sequence of measurable sets,
to the class M of Lebesgue measurable sets. We define a E 0 ⊆ E 1 ⊆ E 2 ⊆ · · ·. Then m(∪i−1 ∞
E i ) = lim m(E i ). Also,
∞
set E to be Lebesgue measurable (or just measurable) if if F0 ⊇ F1 ⊇ F2 ⊇ · · ·, and m(F0 ) < ∞, then m(∩i=1 Fi ) =
for each set A lim m(Fi ) where again the sets {Fi } are supposed measur-
∞
m ∗ (A) = m ∗ (A ∩ E) + m ∗ (A ∩ E) able. If, as is conventional, we write here lim E i = ∪i−1 Ei
∞
and lim Fi = ∩i=1 Fi , then this result reads m(lim E i ) =
So a measurable set divides an arbitrary set in an additive lim m(E i ) for any increasing sequence of measurable
way as regards outer measure. It follows from this defi- sets and also for any decreasing sequence of sets of
nition that intervals are measurable. Also, all sets of zero finite measure. Lebesgue measure also has a regularity
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

Measure and Integration 233

property stronger than that of outer measure. If E is any ess inf f = sup{α: f α a.e.}. So if we disregard sets of
measurable set and ε any positive number then there measure zero we replace the sup f by the possibly smaller
exists an open set O ⊇ E with m(O\E) ε and a closed ess sup f and the inf f by the possibly larger ess inf f.
set F ⊆ E with m(E\F) ε. Functions for which both of the numbers are bounded are
Although the sets of zero measure can be uncountable, called essentially bounded. A simple property of the es-
they turn out to be negligible for the purposes of integration sential supremum to which we refer later is that for any
and we say that a property holds “almost everywhere” two measurable functions
(a.e.) if it holds except possibly on a set of zero measure.
Note that sets of infinite measure frequently occur (e.g., ess sup ( f + g) ess sup f + ess sup g
[1, ∞) is of infinite measure). So identities or inequalities and
involving the measures of sets must be written carefully
with this in mind. In particular, indeterminate expressions ess inf ( f + g) ess inf f + ess inf g
of the form ∞ − ∞ must be avoided.
So finite linear combinations of essentially bounded
The importance of Lebesgue measurable sets is that
functions are again essentially bounded. As an example
they allow us to define measurable functions. Recall that
suppose f (x) = 1 for x rational, f (x) = 0 otherwise. The
we say a function f is continuous if the set of points x with
f is measurable (it is the characteristic function of a
f (x) > α is open, for each α (i.e., f −1 (α, ∞) is open). We
countable set which of course has measure zero), and sup
define f to be Lebesgue measurable (or just measurable)
f = 1, but ess sup f = 0.
if f −1 (α, ∞) is a measurable set for each α. Since open
All the functions considered here have been real valued.
sets are measurable it follows that continuous functions
For some purposes, however, it is convenient to extend
are measurable. It follows easily from the definition that,
the definitions to complex-valued functions. We describe
if f is measurable, {x: f (x) = α} is measurable for each α.
the function f as measurable if Re( f ) and Im( f ), the
Also, the constant functions are measurable (since they are
real and imaginary parts of f , are measurable in the pre-
continuous). Of special importance are the characteristic
vious sense. To avoid difficulties we shall assume that
functions. We denote by χ A the characteristic function (or
any complex-valued functions considered take only finite
indicator function) of the set A: χ A = 1 on A, χ A = 0 on
values. It is easy to check that | f | is measurable whenever
A. Then from the definition of measurable functions we
f is a complex-valued measurable function.
have that a characteristic function χ A is measurable if,
and only if, the set A is measurable. It is also easy to see
that a constant multiple c f of a measurable function f is
measurable. With a little more care we can see that the III. THE LEBESGUE INTEGRAL
sum of measurable functions is measurable. So any finite
linear combination of measurable functions is measurable We now show how the Lebesgue integral is defined, how
and the product of measurable functions is measurable. its value may be obtained in practice, what the principal
As indicated earlier it is important in applications to theorems are which make its use attractive, and how these
deal with limiting operations. So we note that the sup theorems may be used in examples. We revert for the mo-
and inf of any finite or countable family of measurable ment to real-valued functions.
functions is again measurable; this uses the fact that M As with most definitions of the integral we specify the
is a σ algebra allowing countable set operations. It fol- integral for a certain basic class of functions and show how
lows that for any sequence { f n } of measurable functions, the definition can be extended to a more general class. Not
lim sup f n and lim inf f n are again measurable. Here lim all functions or even all measurable functions have inte-
sup f n is the inf(N 1) of the sup{ f n : n N }, and lim inf grals; the restrictions arise in a natural way from the defi-
f n = − lim sup(− f n ). If lim f n exists it is the common nition. So we first consider nonnegative simple functions.
value of lim sup f n and lim inf f n , so we have that the These are measurable functions taking only a finite num-
limit, when it exists, of a sequence of measurable func- ber of values. Equivalently we write the real line as the
tions is again measurable. The corresponding result is not union of a finite number of measurable sets
true of continuous functions. As important special cases n

we have that f + = max( f, 0), and f − = max(− f, 0) and = Ai

| f | = f + + f − are measurable whenever f is. i=1

For continuous functions the sup and inf are important. and consider the function
For a measurable function we are more interested in ess
n
sup f and ess inf f (the essential supremum and essen- φ(x) = ai χ Ai
tial infimum of f) given by ess sup f = inf{α: f α a.e.}, i=1
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

234 Measure and Integration

where the ai are finite nonnegative numbers and χ Ai , as
f ni d x A
before, stands for the characteristic function of the set Ai .
Typically, one of the ai will be zero. As special cases of
then
simple functions we have characteristic functions of mea-
surable sets (or of intervals, in particular). For
such a char- f dx A
acteristic function χ A we define the integral χ A d x to be
m(A). Since we want our integral to be linear we define This result has far-reaching applications and its proof
n is based directly on the result of the last section that lim
φ dx = ai m(Ai ) m(Ai ) = m(lim Ai ). We must allow integrals and limits
i=1 of integrals to take infinite values; since all the numbers
where φ is given as above. We use the convention that in considered here are nonnegative no difficulties can arise.
such sums 0 · ∞ = 0. This definition is extended immedi- As a companion result to Fatou’s lemma we have the
ately to all nonnegative measurable functions f on by Lebesgue monotone convergence theorem. Let { f n } be
a sequence of nonnegative measurable functions such
f d x = sup φ d x that for each x, { f n (x)} is monotone increasing. So
{ f n (x)} has a limit f (x) at each point (this limit may be
the supremum being taken over all nonnegative simple infinite). Then the function fis a nonnegative measurable
functions φ, φ f . function and f d x = lim f n d x. This follows from
This would seem to give two ways of arriving at the in- Fatou’s lemma since f d x lim inf f n d x lim
tegral of a nonnegative simple function, but they are easily sup f n d x f d x as f n f for all n. So equality
shown to give the same value. Restricting to a measurable holds.

subset E of we may write fχ E d x as E f d x for By partitioning the range of a nonnegative measurable
any nonnegative measurable function f ; in particular, if function f into intervals and taking the simple function
b
E = [a, b] this is written as just a f d x. which, on the measurable set for which f has its values in
From the definition we have immediately some proper- one such interval, takes the minimum of these values we
ties of the integral for nonnegative simple functions: if obtain a simple function approximating to f . Choosing
a sequence of finer partitions gives a sequence of simple

n
φ= ai χ Ai functions tending monotonically to f . Their integrals con-
i=1 verge monotonically to that of f and provide an alternative
and more manageable definition of f d x. Indeed this
then approach shows immediately that for any finite sequence

n
f i , . . . , f n of nonnegative measurable functions we have
φ dx = m(E ∩ Ai )
E
n n
i=1
fi d x = fi d x
for any measurable set E; A∪B φ d x = A φ d x + B φ d x i=1 i=1
for any disjoint
measurable
sets A and B, so the integral is
This extends by the Lebesgue monotone convergence the-
additive: aφ d x = a φ d x for any positive number a.
orem to infinite sums: for any sequence { f n } of nonnega-
From these properties and from the definition we have
tive measurable functions,
some properties
of the integral for nonnegative measurable
∞ ∞
functions:
f x = 0 if, and only if, f = 0 a.e.; if f g
d
fn d x = fn d x
then f d x g d x ( f and g nonnegative measurable
n=1 n=1
functions); if a 0 then a f d x = a f dx; if A and B
are measurable and A ⊇ B then A f d x B f d x. This result is very useful in applications; note that it does
The important theorems regarding the Lebesgue inte- not require the sum on the right-hand side to be finite.
gral are concerned with sequences or limits, a basic one We now wish to extend the definition of the integral
being Fatou’s lemma: let { f n } be a sequence of nonnega- to functions not necessarily nonnegative. If a function f
tive measurable functions; then is nonnegative on the set A and negative on the comple-
mentary
set B = A we want f d x to be A f dx −
+
lim inf f n d x lim inf fn d x B (− f ) d x. Equivalently, write f (x) = max( f (x), 0),
f − (x) = max(− f (x), 0). Then f + and f − are measur-
So if the sequence { f n } converges at each point to the able provided f is; f + and f − are nonnegative; f = f + −
function f (necessarily measurable) and if for some infi- f − and | f | = f + + f − . We define, for any measurable
nite subsequence we have function f , f d x as f + d x − f − d x provided at
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

Measure and Integration 235

least one of these integrals is finite; if both are finite we uous function on a finite interval [a, b] (so that f is
say f is Lebesgue integrable, or integrable in short. So the certainly measurable)
x then f is integrable and the func-
measurable function f is integrable provided the nonneg- tion F(x) = a f (t) dt(a < x < b) is differentiable with
ative measurable function | f | has a finite integral. Then F = f . So for continuous functions, integrals can be ob-
the integrals of integrable functions inherit the properties tained using indefinite integrals, in the usual elementary
of the simpler integral, namely, manner. In particular the usual devices of integration by
parts or by substitution, which follow from the correspond-
(a f + bg) d x = a f d x + b g d x ing rules for derivatives, can be applied in the integration
of continuous functions.
for any integrable functions f and g; the integral is mono- All the theory of this section has dealt with real-valued

tone: f g implies f d x g dx; the integral is addi- functions. It may be extended to complex-valued func-

tive on sets: A f d x + B f d x = A∪B f d x for disjoint tions just as in the last section. Define the complex-valued
measurable function f to be integrable provided its real
measurable sets A and B. From the definition we have eas-
ily that | f d x| ≤ | f | d x for any integrable function f . and imaginary parts Re f and Im f are integrable. The
Also a mean value theorem applies: if f is measurable and elementary theory will
still be true as will the modulus in-
equality | f d x| | f | d x though the proof now needs
g is integrable
if and α, β are real with α f β a.e. then
f |g| d x = γ |g| d x for some γ , α γ β. In particu- a little more care. Lebesgue’s monotone convergence the-
lar, if E is measurable
with m(E) < ∞ and α f β on orem and Fatou’s lemma have no direct application but
E then αm(E) E f d x βm(E). Lebesgue’s dominated convergence theorem and its coun-
We come now to the main theorem of this section: terpart for series apply unchanged to a complex-valued se-
Lebesgue’s dominated convergence theorem. It states that quence { f n }. Indeed, to prove this one need consider only
if { f n } is a sequence of measurable functions such that the sequences {Re f n } and {Im f n } separately.
| f n | g where g is integrable and if lim f n = f a.e. then We now have several theorems allowing us to take lim-
f is integrable and its “under the integral sign” or to interchange summation
and integration. In the case of nonnegative functions this
lim f n d x = f d x interchange is always possible; in the general case a finite-
ness condition needs to be imposed. We now give a few
This follows immediately on applying Fatou’s lemma to examples showing how the theorems may be applied.
the nonnegative sequences {g + f n } and {g − f n } in turn. EXAMPLE 1. Show that
As a corollary we have that | f n − f | d x → 0 (i.e., f n
tends to f “in the mean”). It is quite easy to extend this 1
x log x
∞
1
dx = −
result to a family of measurable functions indexed by a 0 1−x 1
(n + 1)2
parameter. So if { f α } is such a family and f α → f at each
point as α → α0 , and | f α | g,an integrable function, then The integrand on the left-hand side may be considered on
f is integrable and limα→α0 f α d x = f d x. This ver- (0, 1), as
one point does not affect
∞then+1 integral, and there it
sion is useful in some applications;
∞ in particular it allows equals ∞ 0 x n+1
log x. Since 0 x log 1/x is a series
us
b to consider an integral −∞ f d x as limit of the integrals of nonnegative functions we may integrate it term by term
→ −∞, → ∞, | |
a f d x as a b f being the “dominating to get ∞ 1 1/(n + 1) 2
as required.
integrable function.”
Another useful version of the dominated convergence EXAMPLE 2. Show that
theorem is as follows. Let { f n } be a sequence of integrable ∞
dx
functions such that lim =1
(1 + x/n) n x 1/n
∞
0

| fn | d x < ∞ We may suppose x > 0 in the integral and write
1 f n (x) for
the integrand. Then lim f n (x) = e−x and 0 e−x d x = 1.
n=1

Then the series f n (x) So we need to interchange the integral and the limit. We
a.e., its sum f (x) is
converges
integrable and f d x = ∞ n=1 f n d x. This follows on
may construct a dominating function g(x) as follows. For
using the dominating function ∞ n=1 | f n |.
0 < x < 1,
In order to apply these limiting theorems to specific
examples we need to be able to obtain the integrals of (1 + x/n)n x 1/n > x 1/2 (n > 1)
elementary functions. It is easily seen, from the mean
value theorem referred to earlier, that if f is a contin- For 1 x,
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

236 Measure and Integration

(1 + x/n)n x 1/n (1 + x/n)n

n
SD = Mi (ξi − ξi−1 )
1
> 1 + x + (n(n − 1))x 2 i=1
2
1 2 where a = ξ0 < ξ1 < · · · < ξn = b is a partition D of [a, b]
> x (n > 1) and Mi = sup f in [ξi−1 , ξi ]. Similarly
4
So let g(x) = x −1/2 (0 < x < 1), g(x) = 4/x 2 for 1 x.
n
Then g is integrable and the dominated convergence sD = m i (ξi − ξi−1 )
theorem justifies the interchange. i=1

EXAMPLE 3. Show that where m i = inf f in [ξi−1 , ξi ]. Then f is Riemann inte-

1 ∞
(−1)n grable on [a, b] if given ε > 0 there exists D such that
sin x log x d x = S D − s D < ε. Over all such partitions D we have inf
0 n=1
(2n)(2n)!
S D = sup s D with the common value as the Riemann in-
We have tegral of f .

∞
(−1)n x 2n+1
∞ Now the sums S D , s D may be regarded as the integrals
sin x log x = log x = f n (x) (Lebesgue or Riemann) of step functions. When a se-
n=0
(2n + 1)! n=0 quence of partitions is chosen, each being a refinement
say. But of the previous one, then a monotone sequence of step
1 1 functions is obtained and the limit theorems already ob-
| f n (x)| d x = (−1)n+1 f n (x) d x tained show that: if a function is bounded on a bounded
0 0 interval and is Riemann integrable then it is Lebesgue in-
1 tegrable and the values of the integrals are the same. Of
= course many more functions are Lebesgue integrable than
(2n + 2)(2n + 2)!
are Riemann integrable. Indeed if we examine the def-
Since the
sum
of these terms is finite the required inter- inition of the Riemann integral carefully we find that a
change fn d x = f n d x may be carried out and function f bounded on a finite interval is Riemann in-
yields the desired result. It is often the case, as here, that tegral if, and only if, there is a continuous function g
the same calculation yields the finiteness condition and which equals f a.e. This shows how restricted the class of
provides the answer. Riemann integrable functions is and also shows that even
EXAMPLE 4. Show that to consider in detail the “elementary” Riemann integral
∞ ∞ we need to introduce measurable sets. The relationship is
sin t x n−1
dt = , −1 x 1 not so∞clear for “improper” Riemann integrals. For exam-
0 et − x n=1
n2 + 1 ple, 0 x −1 sin x d x exists conventionally as a Riemann
The integrand here is integral, with a finite value, but when we try to find it
as a Lebesgue integral
∞ we obtain a conditionally conver-

N
gent series with 0 |x −1 sin x| d x = ∞, so x −1 sin x is not
lim e−t sin t(xe−t )n Lebesgue integrable.
N →∞
n=0
Finally we note that this possibility of approximation
which is in modulus ≤ 2t/(et − x) on summing this finite by simpler functions which we have seen for Riemann
series. But 2t/(et − x) is integrable and nonnegative on integrable functions holds also for Lebesgue integrable
(0, ∞) so the dominated convergence theorem may be functions. Indeed let f be bounded and measurable on the
applied to the sequence of partial sums, and we get finite interval [a, b], so f is integrable, and let ε > 0. Then
b
∞ ∞ ∞ there exists a step function h such that a | f − h| d x < ε
sin t
t −x
dt = x n
e−(n+1)t sin t dt and a continuous function g vanishing outside a finite
0 e 0
n=0 interval (not necessarily identical with [a, b]) such that
b

∞
xn a | f − g| d x < ε. Since,
as observed after the dominated
= convergence theorem, f d x may be regarded as the limit
n=0
1 + (n + 1)2
of integrals over finite intervals, and since in the same
by the usual elementary integration by parts. way the integral of an unbounded integrable function
Integrals are often encountered first by the student as may be regarded as the limit of the integrals of bounded
Riemann integrals. We recall one form of the definition. “truncated” functions tending to f , this result applies
Suppose f is bounded on the finite interval [a, b]. Write quite generally.
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

Measure and Integration 237

IV. THE Lp SPACES AND INEQUALITIES f (t x + (1 − t)y) t f (x) − (1 − t) f (y)

FOR INTEGRALS Geometrically this states that the segment jointing the
points (x, f (x)) and (y, f (y)) lies above the graph of f
We consider the class of real-valued measurable functions
between x and y (or more precisely, never lies before it).
such that | f | p is integrable where p is a fixed positive num-
For example, f (x) = e x and f (x) = x 2 are convex func-
ber. We also introduce the convention that functions equal
tions. It is easily shown that if a function f has a second
almost everywhere are to be identified. With this conven-
derivative f which is positive then f is convex. If − f is
tion the class described is the space L p (), or L p (a, b) if
convex then f is said to be concave. Clearly log x(x > 0)
we are restricting ourselves to functions with | f | p inte-
is concave.
grable on the interval (a, b). The elements of L p (a, b) are
We use the inequality
classes of functions such that in each class any two func-
tions are equal a.e. so that their integrals over all subsets a 1/ p b1/q (a/ p) + (b/q)
of (a, b) are identical. This distinction between functions
where a, b, p, q, > 0 and 1/ p + 1/q = 1. On taking logs
and classes of functions need not give rise to difficulty.
this is equivalent to
For p = 1 we recover the integrable functions, with the
convention stated. In addition we define L ∞ () to be the 1/ p log a + 1/q log b log (a/ p + b/q)
essentially bounded functions of Section II with the same which follows from the concavity of log x. The inequal-
convention. We indicate later how these definitions extend ity obtained is a generalized arithmetic-geometric mean
to more general measures than Lebesgue measure. inequality and substituting
It is easily seen that if | f | p , |g| p are integrable so is
|a f + bg| p for a, b constant. So L p () is a vector space for | f |p |g|q
a= , b=
0 < p < ∞. Also from the inequality for ess sup | f + g| ( f p ) p (gq )q
of Section II we have that L ∞ () is a vector space. The and integrating we get Hölder’s inequality. This states that
spaces L p (a, b) are related in a simple way for a, b finite: if
if 0 < p < q ∞, then L q (a, b) ⊂ L p (a, b).
The spaces L p () (or L p (a, b)) for 1 p ∞ have the 1 < p < ∞, 1 < q < ∞, 1/ p + 1/q = 1,
special property of being “normed spaces.” A space X f ∈ L p () and g ∈ L q ()
is said to be a real normed space if X is a real vector
space and we can define a nonnegative number x for then f g ∈ L 1 () and

each x ∈ X in such a way that x = 0 if, and only if,
x = 0 (the zero of the vector space), and ax + by f g1 = | f g| d x
|a|x + |b|y for any real numbers a, b. In the case 1/ p 1/q
of L p (1 p < ∞) the norm of f may be defined by | f |p dx |g|q d x
f p = ( | f | p d x)1/ p ; in the case of L ∞ the norm is de-
fined by f ∞ = ess sup| f |. In each case this norm is un- Here p, q are described as conjugate indices; the special
affected if the function changes value on a set of measure case p = q = 2 is called the Cauchy-Schwarz inequality.
zero, so it is genuinely a number associated with an ele- Of course instead of we may consider integrals over any
ment of the vector space L p . It may be easily seen that for interval or any measurable set. Equality occurs in Hölder’s
1 p ∞, f p 0 and equals zero only if f is the zero of inequality if, and only if, a f p + bg q = 0 a.e. for some
L p (i.e., the class of functions vanishing a.e.). Also clearly constants a, b not both zero.
a f p = |a| f p . The subadditive property of the norm If p, q are conjugate and p → 1 then q → ∞ so we
on L p (i.e., f + g p f p + g p ) is Minkowski’s in- may regard 1, ∞ as conjugate. A special case of Hölder’s
equality to which we refer again later). The same prop- inequality holds here: if f ∈ L 1 () and g ∈ L ∞ () then
erty for L ∞ follows from the inequality: ess sup| f + g| f g ∈ L 1 () and
ess sup| f | + ess sup|g| referred to previously. The norm
on L ∞ (a, b) seems different from those on L p (a, b) for f g1 f 1 g∞
1 p ∞ but a careful limiting argument shows that if This is quite easily proved directly.
f ∈ L ∞ (a, b) (a, b finite), so that f ∈ L p (a, b) for each p With the help of Hölder’s inequality we can easily prove
in (1, ∞), then lim p→∞ f p = f ∞ . Minkowski’s inequality: if p 1 and f, g ∈ L p () then
We now turn to three inequalities which use the idea of
convex functions. The real-valued function f is convex if f + g p f p + g p
for any two numbers x, y and any number t, 0 t 1, we Equality holds in the case p > 1 if, and only if, a f = bg
have a.e. where a, b are nonnegative constants, not both zero.
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

238 Measure and Integration

The special case of Minkowski’s inequality for p = ∞ has d, converges. In the usual L p notation this may be re-
been referred to already. stated as follows: if 1 p < ∞ and f n − f m p → 0 as
Our third inequality is Jensen’s inequality. This states n, m → ∞ then there exists a function f ∈ L p () such
that if f is a measurable function defined on the interval that f n − f p → 0 as n → ∞. We have in addition the
[0, 1] and with values in the finite range [a, b] and if φ is existence of a subsequence { f ni } such that pointwise con-
a convex function on [a, b] then vergence holds: f ni → f a.e. This result is proved by
Fatou’s lemma and Minkowski’s inequality. The corre-
1 1
φ f dx ≤ (φ ❜ f ) d x sponding result in L ∞ () is also true, with a more di-
0 0 rect proof: if each f n ∈ L ∞ () and f n − f m ∞ → 0 as
n, m → ∞ then there exists a function f ∈ L ∞ such that
The proof requires a careful examination of convex func-
lim f n = f a.e. and f − f n ∞ → 0 as n → ∞. These re-
tions. In the case where φ is strictly convex (i.e., the seg-
sults need to be used with care: even if the functions f n are
ment referred to in the definition is strictly above the
specified functions, the limit f in L p is defined only as an
graph), we have that equality occurs in Jensen’s inequality
1 element of L p and its value at any specific point cannot be
if, and only if, f is constant a.e., having the value 0 d x.
obtained. The completeness of the space L 2 is of special
An important special case is obtained by assuming h to
importance in physical applications.
be a nonnegative measurable function such that log h is
integrable over [0, 1] and taking f to be log h and φ(x) to
be e x : then we have V. DIFFERENTIATION AND INTEGRATION

1 1
exp log h d x h dx Differentiation and integration are closely connected;
0 0
since we have extended the elementary notion of integral
By induction proofs we can show that Hölder’s and we must deal carefully with differentiation so that this
Minkowski’s inequalities extend to n functions. For ex- relation continues to hold. The first point to note is that
ample, for the case n = 3, Hölder’s inequality reads: if continuous functions are not very relevant here; indeed
1 < p < ∞, 1 < q < ∞, 1 < r < ∞, 1/ p + 1/q + 1/r = continuous functions which are nowhere differentiable
1, f ∈ L p , g ∈ L q , and h ∈ L r then f gh ∈ L 1 and are easily constructed. For example, on the interval (0, 1)
f gh1 f p gq hr . let f n (x) denote the distance from x to the nearest number
Hölder’s inequality may be used to obtain inequalities of the form m/10n where m and n are nonnegative
for any integral of the product of functions. For exam- integers. Then f n has a “sawtooth” graph with 10n “teeth”
π/2
ple, to obtain an upper bound for 0 x −1/4 cos x d x take and max f n = 12 · 10−n . So f n certainly continuous, and

f (x) = x −1/4 , g(x) = cos x, p = q = 2 to get the bound f n (x) is uniformly convergent with sum f (x), say.
(π/2)3/4 . As a second example, Then f is continuous but for each x ∈ (0, 1) by consider-
b 1/ p
ing its decimal expansion we can show that the graph of
f dx | f |p dx (b − a)1/q y = f (x) has no tangent at this point.
a We now consider an important class of functions which
for conjugate p and q on taking g(x) ≡ 1. This shows although not necessarily continuous are well behaved
that L p (a, b) ⊆ L 1 (a, b) for p > 1 and in the special case as regards differentiability, namely, the functions of
a = 0, b = 1 is just Jensen’s inequality with φ(x) = x p . bounded variation. We suppose f is defined and finite-
As for any normed space the norm on L p may be used to valued on the finite interval [a, b] and take a partition
define a distance function with the “distance” between the a = x0 < x1 < · · · < xk = b. of [a, b]. Then we form the
functions f and g in L p defined as d( f, g) = f − g p . sums
Then Minkowski’s inequality states that the triangle in- k

equality does indeed hold if p ≥ 1, in which case L p is p= ( f (xi ) − f (xi−1 ))+

i=1
a metric space. Convergence in terms of this metric or
distance function is called convergence in the mean of
k

order p ( p 1). So the sequence { f n } converges to f in n= ( f (xi ) − f (xi−1 ))−

the mean of order p if each f n ∈ L p and f n − f p → 0 i=1

as n → ∞. The special case of p = 1 was referred to in

Section III, in connection with Lebesgue’s dominated con- t = p+n = | f (xi ) − f (xi−1 )|
i=1
vergence theorem. The most important property of the
L p spaces is that of completeness: every sequence which where we are using the notation a + = max(a, 0) and a − =
is a Cauchy sequence, in terms of the distance function max(−a, 0). So t, p, n, 0 and f (b) − f (a) = p − n.
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

Measure and Integration 239

Taking upper bounds over all partitions of [a, b], and ity). Nor can we assume that whenever F exists it equals
keeping to the same function f we let P = sup p, N = f , for we may take a continuous function as f (so that F
sup n, T = sup t and call these quantities the positive, exists at all points) and then change f at a single point x0 .
negative, and total variations of f over [a, b]. If T is finite, Then F is unchanged but F (x0 ) cannot equal f (x0 ).
f is said to be of bounded variation over [a, b] or to Indefinite integrals, however, have the important prop-
belong to the class BV[a, b]. Taking suprema over parti- erty of being absolutely continuous. A function f is said
tions the relations for t, p, n give T = p + N and f (b) − to be absolutely continuous on [a, b] if given ε > 0 there
f (a) = P − N . Also, each of these variations is additive exists δ > 0 such that
over intervals, so if a < c < b then T [a, b] = T [a, c] + n
T [c, b] and similarly for P, N . This follows immediately | f (xi ) − f (yi )| < ε
from the corresponding identities for t, p, n, and leads i=1
n
to the important result that a function f is of bounded whenever i=1 |xi − yi | < δ for any finite set of disjoint
variation over [a, b] if, and only if, it may be written as intervals (xi , yi ) in [a, b]. Taking the special case n = 1 we
the difference of two finite-valued monotone increasing see that absolutely continuous functions are continuous.
functions g and h, say. These are obtained by defining Considering any partition of [a, b] and introducing new
g(x) = P[a, x] + f (a) and h(x) = N [a, x]. The converse partition points at a distance at most δ apart we can show
follows from the fact that any finite-valued monotone that every absolutely continuous function is of bounded
function is of bounded variation, and so therefore is variation.
the difference of two such functions. Indeed, the class
Now for any integrable function f we have that
E | f | dt tends to zero as m(E) → 0. This is obvious for
of functions BV[a, b] forms a vector space; linear
combinations of functions of bounded variations are a bounded function f , and follows for an unbounded
again functions of bounded variation. function since the integral of f over any set is the limit
Functions of bounded variation share the “good” prop- of integrals of the functions f n where f n = f provided
erties of monotone functions. Since any finite-valued | f | n, f n = ± n otherwise. From this it follows (with
monotone increasing function is continuous except pos- E = ∪i=1 n
(xi , yi )) that if f is integrable over [a, b] its
sibly on a countable set the same is true for functions indefinite integral is absolutely continuous there. It is a
f ∈ BV[a, b], so in particular these functions are measur- little more difficult to prove the converse: every absolutely
able. An example of a monotone-increasing function with continuous function is an indefinite integral, indeed it is
a countable number of discontinuities is provided by let- the indefinite integral of its derivative, a function which
ting {ri } be an enumeration
of the rational numbers in [0, 1] we know to exist a.e. That the derivative of any function of
and defining f (x) = rn x 2−n . Then f is discontinuous bounded variation f is measurable can be seen from the
at each rational in the interval. For an example of a function fact that it may be obtained as the limit of a sequence of ra-
f not of bounded variation on [0, 1] define f arbitrarily at tios gn (x) = n( f (x + 1/n) − f (x)) which are themselves
x = 0 and let f (x) = sin(1/x) otherwise; another exam- measurable. So, on finite intervals, a function is an indef-
ple is given by f (x) = x sin(1/x), x = \ 0, f (0) = 0. This inite integral if, and only if, it is absolutely continuous.
example shows directly that f may be continuous but not
of bounded variation.
Lebesgue proved that if f ∈ BV[a, b], then f is differ- VI. PRODUCT SPACES AND
entiable a.e. and its derivative is finite a.e. This is a signifi- PRODUCT MEASURES
cantly more difficult result to prove than the earlier results
of this section. The two examples of the previous para- For any two spaces X and Y the Cartesian product X × Y
graph show that the converse of this theorem is not true. is the set of ordered pairs {(x, y): x ∈ X, y ∈ Y }. We will
We consider now indefinite integrals and write for any concern ourselves with the case X = Y = so that X × Y
x
integrable function f, F(x) = a f dt, so that F is the in- is just the plane 2 . However, the product notation is use-
definite integral of f over the interval [a, b], say, on which ful for subsets of 2 . We call a set E in 2 a rectangle if
f is integrable. Then Lebesgue’s dominated convergence E = A × B, A ⊆ , B ⊆ . For the purpose of measure and
theorem, applied to the family of functions χ[a.x] f where integration the important sets are the measurable rectan-
x → x0 shows that F is continuous. It also follows easily gles: sets of the form A ×B where A and B are measurable
from the definitions that F ∈ BV[a, b], its total variation sets. These include as special cases the “genuine” rectan-
being bounded by a | f | dt. So F exists a.e. and it can
b
gles where A and B are intervals.
easily be shown that F = f a.e. in [a, b]. We cannot ex- From the basic measurable rectangles we can form the
pect F to be everywhere differentiable (e.g., let f be a step σ -algebra M × M which is the least σ algebra containing
function, then F does not exist at the points of discontinu- the measurable rectangles. So within M × M we may
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

240 Measure and Integration

form complements and countable unions. It can be shown If f is a function of x and y we may similarly define
that if we take all finite unions of measurable rectangles the x section of f as the function f x (y) = f (x, y) for each
and then take the smallest class of sets which contains fixed x, and the y section of f as f y (x) = f (x, y) for each
these sets and is closed under the formation of count- fixed y. Since measurable sets have measurable sections
able increasing unions and countable decreasing intersec- it follows easily that if f is measurable with respect to
tions then we get just the σ -algebra M × M. As we shall M × M then f x and f y are measurable with respect to m.
see this is crucial for the theory. We shall call the sets of For any nonnegative measurable function f we have
M × M the measurable sets of the plane. An alternative the result that the integral of f may be expressed in
definition extends M × M by also including all subsets terms of repeated integrals in either order and both of
of sets of measure zero, so as to get a complete measure these integrals are defined. More precisely, let f be non-
space (see Section VII). negative and measurable with respect to M × M; write
For any set E in 2 define the x section of E to be the φ(x) = f x dy, ψ(y) = f y d x. Then φ and ψ are mea-
set E x = {y: (x, y) ∈ E} and the y section of E to be the set surable and
E y = {x: (x, y) ∈ E}. These are sets in , not 2 . It can be
shown quite easily that measurable sets in the plane have φ dx = f d x dy = ψ dy (2)
measurable sections. We wish now to define the product
measure on the sets of M × M. This definition should
provide, for the measurable rectangle A × B, the measure where the middle integral is the Lebesgue integral of f
m(A)m(B). We use the result that if E is a measurable with respect to the product measure defined earlier in this
set in the plane then φ(x) = m(E x ) and ψ(y) = m(E y ) are section. Identity (2) has already been proved for the case
measurable functions of x and y, respectively, and of a function of the form χ E by identity (1), and so for
any nonnegative measurable simple function. Taking a se-
φ d x = ψ dy (1) quence of such functions f n increasing monotonically to
f , the sections ( f n )x ↑ f x and ( f n ) y ↑ f y and the Lebesgue
The common value of these integrals is then taken as the monotone convergence theorem gives identity (2).
measure of E. It is clear that for the measurable rectangle We need now to extend (2) to functions not necessarily
E = A × B, φ(x) = χ A (x)m(B) so φ d x = m(A)m(B). nonnegative and expect now to find some finiteness con-
Also, ψ(y) = χ B (y)m(A) so ψ dy = m(A)m(B) also. So dition coming in. Now, identity (2) applied to | f | states
identity (1) holds for measurable rectangles and the mea- that | f | is integrable if, and only if, each of the iterated
sure obtained is the desired one. It holds similarly for fi- integrals of | f | (or, more precisely, of the sections of | f |)
nite unions of measurable rectangles. Using the Lebesgue is finite, and then all three integrals are equal. In this case
monotone convergence theorem we obtain (1) for the we write f as the difference f = f + − f − of nonnegative
union of monotone increasing sequences of sets {E n }, as measurable functions and we apply identity (2) to f + , f − ,
the sequences {φn } and {ψn } are similarly monotone. Sim- and their sections. On subtracting the results we get
ilarly for a decreasing sequence of sets {Fn }, contained in
a bounded rectangle, the result for the intersection fol-
lows from the Lebesgue dominated convergence theorem. dx f x dy = f d x dy = dy f y dx (3)
Since the plane may be written as the union of a sequence
of bounded rectangles So from the remarks concerning | f | we deduce Fubini’s

∞ theorem which states that if f is a measurable function of x
2 = {[n, n + 1) × [m, m + 1)} and y and either of the iterated integrals of | f | is finite then
n,m=−∞
so is the other, | f | is integrable and identity (3) holds for f .
the result follows for all the sets of M × M on adding An important application of this theory is in connection
together the results for the component pieces. with the Laplace and Fourier transforms of an integrable
These definitions and the sketch proof just given work function. We will describe the application in the Fourier
equally well with general measures µ, ν say (see Sec- transform case; the other is similar. The following result
tion VII), with a measurable rectangle A × B having the allows us to define the convolution of two functions. Let
measure µ(A)ν(B). The only condition on the measures f and g be integrable functions; then f (y − x)g(x) is an
is that each coordinate space can be written as a countable integrable function of x for almost all y and if h(y) is
union of sets of finite measure for µ or ν, as the case may defined for these y by h(y) = f (y − x)g(x) d x then h
be, so that the product space may be decomposed into a se- is integrable and h1 f 1 g1 . To prove this we need
quence of sets of finite measure as for 2 in the preceding to show that f (y − x)g(x) is measurable with respect
paragraph. to M × M. This is not very difficult and assuming
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

Measure and Integration 241

Now, as noted in Section II, Lebesgue measure is invari- = fˆ(s)ĝ(s)

ant under translation, so we may make a translation of
variables without changing the integral. Thus as required.
The Fourier transform has been defined here for func-
| f (y − x)| dy = | f (y)| dy tions of L 1 (). We may extend the definition by an approx-
imation argument to the functions of L 2 () and show that
and fˆ2 = f 2 (Parseval’s theorem). Note that the Fourier
transform though continuous and bounded need not be in-
H 1 g1 f 1
tegrable. But if it is (i.e., when both f and fˆ are integrable
So H and hence h is finite-valued a.e., h is integrable, functions), we have the Fourier inversion theorem:
and h1 f 1 g1 . Extending (2π )−1/2 h by giving
it any fixed value on the exceptional set we obtain the f (x) = (2π )−1/2 fˆ(t)ei xt dt a.e.
convolution
The proof is not very difficult and depends on Fubini’s
−1/2
( f ∗ g)(y) = (2π) f (y − x)g(x) d x theorem.

of the integrable functions f and g. We can easily show VII. GENERAL MEASURES
that f ∗ g = g ∗ f a.e. and ( f ∗ g) ∗ h = f ∗ (g ∗ h) a.e. for
any integrable functions f, g, h. For example, to prove Measures arise in various ways and in various spaces and
the first identity, let y be such that g(y − x) f (x) is inte- the theory outlined above can be applied in the main pro-
grable with respect to x, so y does not lie in the excep- vided we have an appropriate class of sets defined to be
tional
set of measure zero. For such y let t = y − x so that measurable. So we shall suppose we have a set or space
g(y − x) f (x) d x becomes X and on it a σ -algebra of sets. On the sets of our
measure µ is defined and so µ is presumed to be a nonneg-
g(t) f (y − t) dt = (2π )1/2 ( f ∗ g)(y) ative set function which takes the value 0 on the empty set
and which is countably additive (i.e., if {An }is a sequence
∞
as we may translate the variables. of disjoint sets of we have µ(∪∞ n=1 A n ) = n=1 µ(An )).
For any integrable function f we may define the Fourier Then the triple {X, , µ} is called a measure space. This
transform as the function fˆ given by measure space is said to be σ finite if we may write X
as X = ∪∞ n=1 X n , X n ∈ , and µ(X n ) < ∞. It is said to
ˆf (s) = (2π )−1/2 e−ist f (t) dt be a complete measure space if for any set E of with
µ(E) = 0 every subset of E also belongs to (and then of
Then fˆ is a continuous function, by the Lebesgue domi- course has measure zero also).
nated convergence theorem with | f | as dominating func-
EXAMPLE 1. X = , = M, µ = m, gives Lebesgue
tion. Also, | fˆ| (2π)−1/2 | f | dt. The convolution and
measure on the real line. This is σ finite and complete.
the Fourier transform are related by the identity
EXAMPLE 2. X = , = Borel sets, µ = m, gives
( f ∗ g) = fˆ ĝ
Borel measure. This is also σ finite but is not complete.
true for integrable functions. To see this note that EXAMPLE 3. X = 2 , = M × M, µ = m × m,

gives planar measure. This is σ finite but not complete.
( f ∗ g)(s) = (2π)−1 dt e−ist f (t − x)g(x) d x
EXAMPLE 4. X = N the set of natural numbers; is
Then the modulus of the integrand here, | f (t − x)g(x)|, the class of all subsets
of N. Let an < ∞ with an 0
is integrable by the argument given for the convolution. and define µ(E) = ak where the summation is over all
So by Fubini’s theorem we may interchange the order of integers k in E. This measure is obviously complete and
integration to get is finite for the whole space (and so trivially σ finite).
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

242 Measure and Integration

Results such as m(lim E i ) = lim m(E i ) hold as before zero). Indeed the sequence { f n } defines the function f a.e.,
for monotone sequences of sets. A real function f is mea- in the sense that if { f n } is a Cauchy sequence with respect
surable if f −1 (α, ∞) belongs to , the family of mea- to convergence in measure so that for any ε > 0,
surable sets. Integration theory proceeds
n as before, the
lim µ{x: | f n (x) − f m (x)| > ε} = 0
nonnegative simple function
n φ = χ
i=1 i Ai having the
a n,m→∞
integral φ dµ = i=1 ai µ(Ai ) as in Section III. Then
then there exists a measurable function f so that f n → f
the integral f dµ of a nonnegative measurable function
in measure and also for some subsequence {n i }, f ni → f
is the least upper bound of such integrals for all such func-
a.e. This is proved by a careful choice of ε’s as elements
tions φ f . The integration of functions not necessarily
of a convergent series.
nonnegative and of complex-valued functions is defined
An analog of Fatou’s lemma holds for convergence in
as before and the same theorems hold: Fatou’s lemma, the
measure: Let { f n } be a sequence of nonnegative measur-
Lebesgue monotone convergence theorem, the Lebesgue
able functions and let f be a measurable function such
dominated convergence theorem, and their variants for
that f n → f in measure; then
series. Of course the Riemann integral will not exist in
general so comparisons cannot be made. Also the result f dµ lim inf f n dµ
noted in Section V holds in general: E | f | dµ tends to
zero as µ(E) → 0. The proof depends on applying the original Fatou’s lemma
The constructions and inequalities obtained in of Section III to subsequences { f ni } tending to f a.e. There
Section IV will hold good for the general L p spaces is a corresponding analog of the Lebesgue dominated con-
L p (X, , µ). So will the results of Section VI, as already vergence theorem. Let { f n } be a sequence of measurable
noted, for a pair of measures µ and ν provided we assume functions such that | f n | g, an integrable function, and
these to be σ finite. let f n → f in measure, where f is measurable.
For some purposes it is convenient to use a complete Then
f is integrable, lim f n dµ = f dµ, and lim | f n −
measure and it is an important fact that a measure may f | dµ = 0 (i.e., f n → f in the mean). This result is easily
be extended in a unique way so as to be complete. We proved. Since there exists a subsequence { f ni } with limit
replace the σ -algebra by the larger class of sets of the f a.e., we have | f | g so f ∈ L 1 (µ). Also, for each n,
form E ∪ N where E ∈ , E ∩ N = , and N ⊆ M where g + f n 0 and g + f n → g + f in measure, so by the ver-
M ∈ with µ(M) = 0. Then we check that so defined sion of Fatou’s lemma just given
is a σ algebra and define µ on by µ(E ∪ N ) = µ(E).
This clearly extends µ and is easily seen to give an unam- g dµ + f dµ lim inf (g + f n ) dµ
biguous definition. With this construction {X, , µ} is a
complete measure space. and
If µ(X ) = 1 the triple {X, , µ} is called a probabil-
ity measure space. Then a particular set of terms is used, f dµ lim inf f n dµ
for historical reasons, and the emphasis is on a particu-
lar type of result. In particular for “almost everywhere” Similarly, g − f n 0, whence
one writes “almost surely,” abbreviated a.s.; measurable
functions are called
random variables; their Fourier trans- g dµ − f dµ lim inf (g − f n ) dµ
forms φyx = ei xt f (t) dµ(t) are called characteristic
functions. Note that Jensen’s inequality will apply directly and so

with [0, 1] replaced by X ; also that the product of proba-
bility spaces is again a probability space. f dµ lim sup f n dµ
We shall consider now some further types of conver-
gence which may be applied to sequences of functions. lim inf f n dµ f dµ
Let { f n } be a sequence of measurable functions and f
a measurable function (all on the measure space {X, Therefore equality holds and the first result follows.
, µ}). Then f n tends to f in measure if for every posi- Also it is easily seen that | f n − f | → 0 in measure. But
tive ε, | f n − f | 2g and so the second result follows from the
first.
lim µ{x: | f n (x) − f (x)| > ε} = 0
In L p (µ), convergence in the sense of the norm is
If µ is a probability measure this is termed convergence termed convergence in the mean of order p (i.e., f n → f
in probability. It is easily seen that a sequence { f n } can in the mean of order p( p 1) if lim f n − f p = 0). (It
tend at most to one limit function f (up to sets of measure may be defined for any p > 0, but for 0 < p < 1 the norm
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

Measure and Integration 243

notation is not appropriate.) This convergence and con- implies almost uniform convergence (i.e., if f n → f a.e.
vergence in measure are easily shown to be related: if where | f n | g, an integrable function, then f n → f a.u.).
f n − f in the mean of order p then f n → f in measure. Examples can be constructed showing that certain types
For suppose not. Then there exist ε > 0, δ > 0 such that of convergence do not in general imply certain other
µ{x: | f n (x) − f (x)| > ε} > δ for infinitely many n. But ones. For instance, let χni be the characteristic function
then | f n − f | p dµ ε p δ for infinitely many n, contra- of the interval [(i − 1)/n, i/n], i = 1, . . . , n, with [0, 1]
dicting convergence in the mean of order p. as the whole space and using Lebesgue measure and de-
Another important kind of convergence is almost uni- fine { f n } to be the sequence χ11 , χ21 , χ22 , χ31 , χ32 , χ33 , . . . .
form convergence. Let { f n } be a sequence of measurable Then f n = χk where 2 k(k − 1) < n ≤ 2 k(k + 1), whence
i 1 1

functions and let f be a measurable function. Then we say f n d x = 1/k → 0 as n (and thus k) → ∞, so that f n → 0
that f n → f almost uniformly (abbreviated a.u.) if for any in the mean. But f n → 0 a.e.; indeed for no x in [0, 1] does
ε > 0 there exists a set E with µ(E) < ε and such that on f n (x) → 0. Nor does f n → 0 a.u. The space here, [0, 1],
the complement of E, f n → f uniformly (i.e., given ε > 0 has nevertheless finite measure and the sequence is dom-
there exists N such that for n > N , | f n (x) − f (x)| < ε for inated by the integrable function χ[0,1] .
all x ∈ E (N depending on ε but not on x)). Obviously
uniform convergence (on the whole space) implies almost
uniform convergence (just take E = above). In the op- VIII. THE RADON–NIKODÝM THEOREM
posite direction, the sequence {x n } converges to zero al- AND SIGNED MEASURES
most uniformly on [0, 1] but not uniformly.
If f n → f a.u., however, then f n → f in measure. For New measures arise from given measures in natural ways.
if not then there exist positive ε and δ such that For example, let f be a nonnegative function measurable
with respect to the measure µ on the σ -algebra , and
µ{x: | f n (x) − f (x)| > ε} > δ define ν on by

for infinitely many n. But since there exists E with
ν(E) = f dµ
µ(E) < δ and with f n → f uniformly on E we get a E
contradiction. We can also show easily that if f n → f a.u. Then ν is a nonnegative set function, vanishing on the
then f n → f a.e. For if m is any positive integer we can empty set. That ν is countably additive follows from the
find a set E m with µ(E m ) < 1/m and on E m , f n → f corollary to the Lebesgue monotone convergence theorem.
uniformly. But if x ∈ ∪∞ m=1 E m we have x ∈ E N , say, For if {E n } is a sequence of disjoint sets in then
so lim f n (x) = f (x). Since ∪∞ ∞
m=1 E m = ∩m=1 E m , a set
of measure zero, the result follows. ∞ ∞
ν En = χ En f dµ
An important result in the opposite direction is given
n=1 n=1
by Egorov’s theorem: on a space of finite measure con-
∞

vergence almost everywhere implies almost uniform con-
= χ En f dµ
vergence. We can prove this by writing for each pair of n=1
positive integers k, n,

∞

E k,n = {x: | f m (x) − f (x)| < 1/k, m n} = ν(E n )

n=1
Then as f m → f a.e., we have So ν is a measure. The two measures µ, ν are related
by the following continuity condition. If µ1 , µ2 are any
∞
µ E k,n =0 two measures on the σ -algebra and µ2 (E) = 0 when-
n=1 ever µ1 (E) = 0 we say that µ2 is absolutely continuous
with respect to µ1 , and write µ2 µ1 . Obviously for the
and since all measures are finite, measure ν constructed above, ν µ.
µ( E k,n ) → 0 as n→∞ If we remove the restriction that f be nonnegative we
obtain measures with negative sign. So we have the fol-
for each fixed k. So if ε > 0, for an appropriate n k we lowing definition: a set function ν on the σ -algebra is a
have µ( E k,n k ) < ε/2k . Then if E = ∩∞ k=1 E k,n k we have signed measure if its values are real of infinite, ν takes at
µ( E) < ε, and on E, for each k, | f m − f | < 1/k for m most one of the values ∞ and −∞,
n k . So f n → f a.u.
∞ ∞
With a slight variation of this proof one can show that, ν() = 0 and ν Ei = ν(E i )
for dominated sequences, convergence almost everywhere i=1 i=1
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

244 Measure and Integration

whenever {E i } is a disjoint sequence of sets of . Clearly by reduction to the case for finite measures: write X as
every measure is a signed measure. the union of a sequence of disjoint sets on which both
µ and ν are finite, find the corresponding function f for
EXAMPLE. Let f be an integrable function on each set, and put these functions together to get the cor-
{X, , µ}. Then ν(E) = E f dµ defines a signed mea-
responding “density function” on the whole of X . In the
sure. Clearly only the countable additivity
+ needs chec-
+ − case of a finite measure, f is obtained as the supremum
king,
− so for E ∈ write ν (E) = E f dµ, ν (E) =
+ − of functions
g which are nonnegative, measurable, and
f dµ. Then ν , ν are (finite) measure and
E satisfy E g dµ ν(E) for each E in . However, taking

∞
∞ ∞ the supremum of such functions involves an uncountable
+ −
ν Ei = ν Ei − ν Ei family of functions so a nonmeasurable function could
i=1 i=1 i=1 apparently appear. The proof reduces the operation to a
∞
∞
countable one. The resulting function f can be regarded
= f + dµ − f − dµ as the derivative of the measure ν with respect to µ and we
Ei Ei
i=1 i=1 write f = dν/dµ. The theorem can be extended to signed
∞

∞ measures. If µ, ν are signed measures we say ν is abso-
= f dµ = ν(E i ) lutely continuous with respect to µ if ν(E) = 0 whenever
Ei
i=1 i=1
|µ|(E) = 0. Then ν can be written in terms of a derivative
In this example the signed measure ν has a decomposition with respect to the measure µ by writing ν = ν + + nu − ,
as the difference of measures and there is a corresponding finding the derivatives of ν + and ν − separately, and taking
decompositions of the space into the sets {x: f (x) 0} and dν/dµ as their difference.
{x: f (x) < 0} on one of which ν acts like a measure and on The analogy between ordinary derivatives and the
the other −ν acts like a measure. More generally for any Radon–Nikodým derivative dν/dµ goes further. The
signed measure ν there is a decomposition ν = ν1 − ν2 as chain rule applies: if λ, µ, ν are σ finite measures such
the difference of measures for which ν1 and ν2 are mutu- that ν µ and µ λ then ν λ and
ally singular (written ν1 ⊥ ν2 ) (i.e., for some set A ∈ we
have ν2 (A) = ν1 ( A) = 0). Then ν1 and ν2 are said to be dν/dλ = (dν/dµ)(dµ/dλ)
the Jordan decomposition of ν and are uniquely defined where equality is in the sense that this equation must hold
by ν. There is a corresponding decomposition of the space almost everywhere (in the sense of λ). If the measure ν is
X as the union of disjoint sets A, B of such that ν is defined by the integral of a function f with respect to µ
nonnegative on the measurable subsets of A, −ν is non- this theorem takes the following form: if µ λ where µ
negative on the measurable subsets of B. This, the Hahn and λ are σ finite then there exists a measurable function
decomposition, is unique up to sets for which all subsets g such that if f ∈ L 1 (X, µ) then f g ∈ L 1 (X, λ) and for
in have zero ν measure. The example given above where each E
ν is defined using an integrable function displays both the
Jordan and Hahn decompositions. In the general case the ν(E) = f dµ = f g dλ
Hahn decomposition is established first using an exhaus- E E

tion argument to find the “largest” set on which ν acts If we take any nonnegative integrable function f on the
like a nonnegative measure and the Jordan decomposition real line we may take multiple
of f so as to get a func-
follows. tion p(x) such that ν(E) = E p d x defines a probability
By analogy with the functions of bounded variation measure with p = dν/dm as the probability density. The
discussed in Section V the measure ν + is called the Radon–Nikodým derivative of one probability measure
positive variation of ν, ν − the negative variation, and with respect to another is important in connection with
|ν| = ν + + ν − the total variation of the signed measure conditional probabilities and conditional expectations.
ν. Obviously
in the example above, |ν| is given by The Radon–Nikodým theorem shows that for a finite
|ν|(E) = E | f | dµ. measure ν the relation ν µ is a genuine continuity
The converse to the situation presented in this exam- property. For since ν(E) = E f dµ where f is an inte-
ple is provided by the Radon-Nikodým theorem. It states grable function we have the result noted in Section VII
that if {X, µ} is a σ -finite measure space and ν is a that ν(E) → 0 as µ(E) → 0; more precisely, given ε > 0
σ -finite measure such that ν µ, then there exists a non- there exists δ > 0 such that whenever µ(E) < δ we have
negative measurable
function f on X such that, for each ν(E) < ε.
E ∈ , ν(E) = E f dµ. This f is unique in the conven- Example 4 in Section VII exhibits the opposite property:
tion given before; any other function with the same prop- the measure µ defined there is concentrated on a set of zero
erty agrees with f almost everywhere. The proof proceeds measure (the integer points) and so cannot be given as the
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

Measure and Integration 245

integral of a density or derivative. If an = 1 then we the set W = {g: 0 g 1} of L ∞ and themapping T from
have a “discrete probability,” a frequency occurrence in W to n given by T g = ( g dµ1 , . . . , g dµn ). The set
elementary probability. A discrete probability of the type W is convex and closed in the appropriate sense. Let
described and Lebesgue measure are mutually singular.
λ = (λ1 , . . . , λn ) ∈ T (W )
We can show that given any two σ -finite measures µ and
ν, we may decompose one of these, say ν, as ν = ν0 + ν1 Then if we show that W0 = T −1 λ contains a character-
where ν0 ⊥ µ and ν1 µ. This is the Lebesgue decom- istic function the result is proved since if E and F are
position of ν with respect to µ. It is easily proved using measurable sets then for 0 ≤ t ≤ 1
the Radon–Nikodým derivative
f of λ = µ + ν with re-
spect to µ, so that µ(E) = E f dλ. If A = {x: f (x) > 0} (tµ1 (E) + (1 − t)µ1 (F), . . . ,
and B = {x: f (x) = 0} we may write tµn (E) + (1 − t)µn (F))
ν0 (E) = ν(E ∩ B), ν1 (E) = ν(E ∩ A) lies in T (W ). Now for any convex set a point is an extreme
and obtain the desired decomposition. So the two construc- point if it cannot be written as the convex combination of
tions of new measures, one using densities and the other other points of the set. There is a general result that in
measures concentrated on sets of measure zero, together n every closed bounded convex set has extreme points
provide the most general σ -finite measures. The σ finite- and may be considered as the set of convex combinations
ness is essential to these results, for examples can be con- of these extreme points. Now W0 is convex, closed, and
structed where absolute continuity holds but no derivative bounded; the proof ends with the demonstration that the
exists; such examples depend on non-σ -finite measures. extreme points of W0 consist of characteristic functions,
using the nonatomic property of the measures µi .
The result obtained is nontrivial even for ordinary mea-
IX. EXTENSIONS OF THE sures (n = 1). It is obviously dependent on the nonatomic
DEFINITION OF MEASURE property: one could choose X as a single point with mea-
sure 1, to get a trivial counterexample. The result out-
In the last section we extended the idea of measure to lined above has applications in control theory where one
include measures taking negative values. These arose in wishes to know that a convex combination of states of
a natural way in connection with the integrals of func- the system is again a state of the system. It is also of im-
tions taking positive and negative values. If as in Sec- portance in the investigation of higher dimensional vector
tion III we allow functions to take complex values we spaces.
obtain complex-valued measures. But just as complex-
valued measurable functions may be dealt with by consid-
ering their real and imaginary parts separately so the prop- X. EXTENSIONS OF MEASURES AND
erties of a complex-valued measure µ = µ1 + iµ2 , where LEBESGUE–STIELTJES INTEGRALS
µ1 and µ2 are real measures, and the corresponding in-
tegration theory for µ may be deduced from that for µ1 In Section II we saw how starting with a “measure” (the
and µ2 . length) defined on intervals we can, using an outer mea-
More interesting considerations arise when measures sure, extend this measure to a σ algebra containing the
are allowed to take values in a vector space. We shall intervals. It is convenient to regard this as a two-stage
consider measures taking values in the n-dimensional real process. First we extend the definition of length to give
vector space n . (More general spaces could be considered a measure on the set of finite unions of intervals, which
but the essential results are already exhibited in n .) An form a ring, closed under finite unions and set differences.
important definition is that of an atomic measure. A set A We then extend this measure to a σ ring closed under
in a measure space {X, , µ} is an atom for µ if µ(A) = \0 countable unions and under set differences. If this σ ring
and every measurable subset of A has measure zero or contains the whole space it is a σ algebra. This passage
µ(A). The range of the measure µ is the set of numbers from a ring to a σ ring or σ algebra can be done in general
{mu(E): E ∈ }. Then the theorem of Liapounoff states and the result is essentially unique. An important example
that the range of a finite nonatomic measure with values of this construction is provided by the construction of
in n is closed and convex. More precisely: let µ1 , . . . , µn Lebesgue–Stieltjes measures. Take a monotone increas-
be finite nonnegative nonatomic measures on X . Then the ing finite-valued function g and for an interval I = [a, b)
set of points in n of the form (µ1 (A), µ2 (A), . . . , µn (A)), define a measure on I by µg (I ) = g(b) − g(a). Now, mea-
with A ranging over the measurable subsets of X , is closed sures have continuity properties since if I1 ⊆ I2 ⊆ · · · we
and convex. There are various proofs. In one, we consider have µ(∪Ii ) = lim µ(Ii ). So if Ii = (ai , bi ) we see that
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

246 Measure and Integration

g must be left-continuous (i.e., g(x) → g(x0 ) whenever of the Radon–Nikodým theorem holds. The most useful
x → x0 , x < x0 ). Then µg is indeed a measure on the ring version of this integral is the Bochner integral. First, we
of finite unions of such intervals and extends to a measure describe the details of such measures. Let Y be a Banach
on the Borel sets of the real line. We could, at least for the space with norm .. We will suppose that we have a space
case g bounded, have chosen intervals of the form (a, b] X with measurable sets . Then m: → Y is a vector mea-
and g right-continuous. This form of the definition is more sure if m() = 0 and whenever {Ai } is a countable family
∞ ∞
common in probability theory. The measure µg given by of disjoint sets of then m(∪i=1 Ai ) = i=1 m(Ai ) where
the construction outlined above is the Lebesgue–Stieltjes this sum
n is norm convergent, that is, the sequence of vec-
measure defined by g. tors i=1 m(Ai ) converges in the normed space (Y, .)
It is clear that points of discontinuity of g correspond to as n → ∞. We will suppose that the space X is equipped
atoms of the measure µg , as defined in Section IX. Abso- with a finite measure µ: for some purposes it is more con-
lute continuity was defined for functions in Section V and venient to assume, as we will, that µ(X ) = 1, that is: µ is
for measures in Section VIII. These definitions are drawn a probability measure. The average range of m is the set
together by the following result. Let g be a monotone in- in Y given by A R(m) = {m(A)/µ(A): A ∈ , µ(A) > 0}.
creasing absolutely continuous function. Then g defines a As in the scalar case we say the vector-valued measure
measure on a σ algebra. If we assume that this measure m is absolutely continuous with respect to µ, (m µ) if
has been completed, as defined in Section VII, to get the µ(A) = 0 implies m(A) = 0 (zero vector).
complete extension µg then µg is defined on the σ alge- Then to set up a theory of integration we nconsider
bra M of Lebesgue measurable sets and µg m. Indeed simple functions as before of form f = i=1 x i χ Ai
the Radon-Nikodym derivative of µg and the derivative with xi ∈ Y, Ai ∈ . Then for any set A ∈ , we can
n
of g (which exists b a.e.) correspond: for any a, b we have define for such an f A f dµ = i=1 xi µ(A ∩ Ai ). Then
g(b) − g(a) = a g dt and so g = dµg /dm. In this case generally a function f : X → Y is Bochner integrable
integrals with respect to µg reduce to integrals with respect if there is a sequence { f n } of simple functions with
to Lebesgue measure: limn→∞ f n (x) = f (x) a.e. (µ) and limn→∞ f (x) −
f n (x)dµ =0. This provides
an unambiguous definition
f dµg = f g dt if we set f dµ = lim f ndµ, and we then write
E E f ∈ L 1Y (X, , µ), and we say f dµ is the Bochner inte-
These results are especially important in probability the- gral of f . Regarding measurability, a Y-valued function
ory where the finiteness of the measure simplifies the is said to be strongly measurable if it is the limit a.e. of
results. Indeed let {X, , µ} be a probability measure simple functions. A weaker requirement is that each scalar
space and f a finite-valued measurable function, so f is valued function F ◦ f where F is a continuous linear
a random variable. Define a function F by F(x) = µf −1 functional on Y should be measurable in the usual sense.
(−∞, x] (i.e., F(x) = µ{t: f (t) x}, so that F(−∞) = For functions f with f integrable and which satisfy a
0 · F(∞) = 1). Then F is the distribution function of f condition on the range (always true for integrable func-
and is a monotone increasing right-continuous function. tions), the definitions are equivalent. Much of the theory
The measure µ F it defines agrees with the measure µf −1 extends; for example, we have a dominated convergence
on the intervals of . theorem: we require f n (x) ≤ g(x) a.e. (µ). where g is
integrable and lim f n (x) = f (x) a.e., then f is integrable,
f dµ = lim f n dµ and lim f n − f dµ = 0 (i.e.
XI. THE RADON–NIKODÝM PROPERTY f n converges to f in the mean).
FOR BANACH SPACES The Radon–Nikodým property turns out to be very im-
portant for when considering such vector-valued measures
In this section we consider measures taking their values in and functions; whether it holds depends on the geome-
a Banach space, that is a normed space which is complete try of Y . Let K be a closed bounded convex set in the
(the examples of L p , L ∞ were considered in Section IV). Banach space Y . Then K has the Radon–Nikodým prop-
This is more general than the finite dimensional case con- erty (RNP) for {X, , µ} if for any Y-valued measure
sidered in Section X. We consider first the question of in- m which is absolutely continuous with respect to µ and
tegrating functions taking values in a Banach space, with whose average range A R(m) lies in K there exists a func-
respect to a real measure. If we then set m(A) = f dµ tion f ∈ L 1Y (X, , µ) such that m(A) = A f dµ for each
we see that such a theory of integration gives rise to a Ba- A ∈ . More generally, if E is a closed convex (possi-
nach space–valued measure. Which measures are formed bly unbounded) set of Y (e.g., E = Y ), then E has the
in this way depends on the Banach space in question, since RNP for {X, , µ} if each closed bounded subset K of
the existence of such a function f implies that a version E has the RNP for {X, , µ}. Finally, K has the RNP
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

Measure and Integration 247

if it has the RNP for each {X, , µ} for µ a probability has the RNP we let m n (A) = A f n dµ where µ is a
measure. An example of a space without the RNP is at probability measure and { f n , n } form a martingale and
hand. A ∈ n . Since µ is a probability measure A R(m n ) lies in
EXAMPLE 1. [Bourgin]: Let X = [0, 1], = Lebesgue K . Then {m n (A)} converges for each A in ∪ n . By a
measurable sets, µ = Lebesgue measure. Let Y = L 1 [0, 1] limiting argument we get lim m n (A) = m(A), which is a
and define m by m(A) = χ A for each A ∈ . Then measure by the Vitali-Hahn-Saks Theorem. By construc-
m µ, A R(m), lies in the closed unit ball of Y but Y has tion m n µ and in the limit m µ and A R(m) ⊆ K .
by the RNP there exists f ∈ L (X, , µ) such that
1
not the RNP. For if it had, So
there would exist a function
f ∈ L 1Y (µ) with m(A) = A f dµ for each A ∈ . So for A f dµ = m(A) for each A in . By the martingale prop-
each x, f (x) is a real-valued measurable function taking erty A f dµ = A f n dµ for each A ∈ n and by a theorem
the value f (x)(s) at s ∈ [0, 1]. Let I be an interval in of Lévy we deduce limn→∞ f n (x) − f (x) = 0 a.e. (µ).
[0, 1], then for all A ∈ This theorem of Lévy uses conditional expectations whose
existence depends on the classical Radon–Nikodým result
given in Section VII.
χ I (s) f (t)(s) ds dt = χ I (s) f (t) dt ds A third property of Banach spaces which turns out to
A A
be related is that of dentability. A bounded set D in Y
= χ I (s) m(A)(s) ds in s-dentable if for each ε > 0 there exists xε in D with
xε ∈ s-co(D\Uε (xε )), where Uε (y) denotes the ball radius
ε and center Y and
= χ I (s) χ A (s) ds
∞
∞
s-coB = αi xi : xi ∈ B, αi ≥ 0, αi = 1 .
= χ I (t) dt i=x i=1
A
Replacing s-co by closed convex hull, we get the stronger
So definition of D dentable. Drawing a diagram we see that D

dentable implies that D is in some sense rotund for some
f (t)(s) ds = χ I (s) f (t)(s) ds = χ I (t)
I
part of its boundary. Dentability can be defined equiva-
lently in terms of slices. For a bounded set D in Y , the
outside
a set of measure zero, and for such t not in
slice s(D, f, α) = {x: x ∈ D, f (x) > sup y∈D f (y) − α}
I, I f (t)(s) ds = 0. Allowing I to vary over all subinter-
where α > 0, and f is a continuous linear functional on
vals Ir of [0, 1] with rational end-points we get an excep-
Y . Then it is an easy consequence of the Hahn–Banach
tional set of measure zero. Any set on which the function
theorem that D is dentable if, and only if, it has slices of
f (t) is positive can be approximated by an interval, and
arbitrarily small diameter. Now it follows easily from the
choosing a subinterval Ir of this interval with t ∈/ Ir we
definitions that if D is dentable then it is s-dentable. But
see that f (t) mustvanish a.e. for almost all t. This contra-
the converse is not true, as the following example shows.
dicts the fact that A f dµ = χ A = 0 for A a set of positive
measure. EXAMPLE 2. Let Y = C[0, 1] the Banach spaces
The RNP is closely related to the convergence of mar- of functions continuous on [0, 1] and with the norm
tingales, which we now describe. Let {n } be a sequence f = max | f (x)| and let D be the unit ball U1 [0] of Y .
of sub σ -algebras of , with n ⊆ m whenever n < m. Then with f (x) ≡ 1 as xε , for any ε > 0, we see that D
Suppose f n ∈ L 1Y (X, , µ) for each n. Suppose also that is s-dentable. However, D is not dentable for if f ∈ D
the functions
f n are strongly n measurable for each n and for any fixed n we choose function f n1 , . . . , f nn with
and that A f n dµ = A f m dµ for each A in n provided f in (t) = f (t) for t ∈ [i − 1/n, i/n] and | f in (t) − f (t)| >
n < m. Then the sequence { f n , n } is a Y-valued mar- 1/2 for some t ∈ (in − 1/n, i/n), for each i. Then f in −
tingale. A closed bounded convex set K in Y has the f > 1/2 but i=1 1/n f i − f ≤ 2/n. So f ∈ co(D\
n

martingale convergence property (MCP) for {X, , µ} if U1/2 ( f )). So D is not dentable. However, it can be shown
whenever { f n , n } is a martingale such that ∪∞
n=1 n gener- by a geometrical argument that if K is a closed convex
ates and f n ∈ L 1K (X, n , µ) for each n then there exists set in Y with interior K (int K ) nonempty then if K is not
f ∈ L 1K (X, , µ) such that limn → ∞ f n (x) − f (x) = 0 dentable int K is not s-dentable. See Davis and Phelps
a.e., (µ). The corresponding statements for closed confor details. So every bounded set of Y is dentable if, and
vex sets and the definition of “Y has MCP” follow ex- only if, every bounded set is s-dentable. This turns out
actly as for RNP. In fact a closed bounded convex set to be the important property. Indeed, if a closed bounded
has the RNP if, and only if, it has the MCP. To see convex set K has every subset s-dentable then K has the
that a closed convex set K has the MCP provided it RNP.
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

248 Measure and Integration

To outline the proof: consider the special case when m f 2 . By the non-s-dentable property of D, this martingale
is in the form so constructed will not converge.

n This result establishes the equivalence of the proper-
m(A) = xi µ(A ∩ Bi ), ties RNP, MCP, dentability of subsets and s-dentability of
i=1 subsets for a Banach space Y . Such spaces, to some ex-
tent, have the nice properties of finite dimensional spaces.
with {Bi } being a partition of X into sets of positive mea-
Among the various other properties of such spaces is the
sure and with xi ∈ K . Then for µ(A) > 0 and A ⊆ Bi we
fact that they possess the Krein–Milman property referred
have
to in Section IX; that is, for any closed bounded convex set
m(A) m(Bi ) K in Y , K is the closed convex hull of its extreme points.
= xi = .
µ(A) µ(Bi ) Indeed we saw earlier that the space C[0, 1] has a unit ball
n U1 [0], which is not dentable: U1 [0] has just two extreme
Set f = i=1 xi χ Bi ; then for each A in we have points, the constant functions +1 and −1, so it does not

n possess the KMP. To prove the KMP it is sufficient to show
f dµ = xi µ(A ∩ Bi ) = m(A). that every bounded convex set has an extreme point. For
A i=1 dentable sets this can be done using the fact that they have
So f = dm/dµ and K has the RNP. In general m will slices of arbitrarily small diameter. A nested decreasing
not be of this simple form, but we can approximate it sequence of such slices is found the intersection of which
by a sequence of such measures, obtaining a convergent yields the desired extreme point. For further details, re-
sequence of such derivatives f n . So we need to know lated results, and references see Bourgin (1983) and Phelps
that we can partition X into sets Bi such that for sub- (1988). We have seen that the spaces C[0, 1], L 1 [0, 1]
sets B of Bi , with µ(B) > 0, and for a suitable x in have not the RNP. However, the spaces L p of Section IV
K we have m(B)/µ(B) − xi < ε. Then we use a se- for which 1 < p < ∞ have the RNP, as have all reflex-
quence of such partitions with ε’s tending to zero to obtain ive spaces. It is not known whether the KMP implies the
the approximation. So we let E = {m(B)/µ(B): µ(B) > 0, RNP. Another important property of finite-dimensional
B ⊆ Bi }, a subset of K as A R(m) ⊆ K . So E is s-dentable spaces is that convex functions are differentiable almost
and so we can find xi ∈ s-co(E\Uε (xi )), with xi ∈ E, everywhere. To see how this extends, we say that the real-
xi = m(B)/µ(B), say. Suppose we can find sets C in Bi valued function f on a Banach space Y is Frechét differ-
with m(C)/µ(C) − xi ≥ ε. Maximizing such a family entiable at y if there exists a linear functional φ (y) on
of disjoint sets C we could write Bi = ∪i∞− 1 Ci and then Y such that for every ε > 0 there exists δ > 0 such that
ϕ(y + x) − ϕ(y) − ϕ (y)(x) ≤ εx whenever x ≤ δ.
m(B) ∞
µ(C j ) Then the space Y is said to be an Asplund space if every
xi = = xj continuous convex function f on a nonempty open set D
µ(B) j=1
µ(B
in Y is Frechét differentiable at each point y of some dense
∞
with x j set E of D where the set E is a G δ set, that is: E = ∩i=1 Gi
when the sets G i are dense open sets of D. Then the space
m(C j )
∞
µ(Ci ) Y is an Asplund space, if, and only if, the Banach space
= ∈E and = 1, Y ∗ of continuous linear functionals on Y has the RNP.
µ(C j ) i=1
µ(B)

contradicting the s-dentability of E.

It can be shown fairly easily that if K has the MCP, then XII. MEASURE AND FRACTALS
its subsets are s-dentable (Bourgin). Suppose not, so there
is a subset D of K which is not s-dentable. Then a noncon- Fractals are often introduced and thought of in pictorial
vergent martingale can be constructed inductively, using terms. However, the underlying measure theory is impor-
the interval [0, 1], µ = Lebesgue measure. So there exists tant. We start with Hausdorff measure on R. This is defined
a positive ε such that for each x ∈ D, x ∈ s-co(D\ ∪ε (x)). in two steps: firstdefine the “approximating measure”:
∗
At the first stage choose x0 ∈ D, define f 0 = x0 a constant Hs,δ (A) = inf (Ik )s where the infimum is taken over
function and let 0 = {[0, 1), }, the minimal σ -algebra. all coverings of the set A by intervals {Ik } with (Ik ) ≤ δ.
∗
By the property of D there exist {t j : 0 < t j < 1, t j = 1} Then let Hs∗ (A) = lim Hs,δ as δ → 0. This limit exists and
and points y j of D with x0 − y j ≥ ε, t j y j = x0 . Parti- defines Hausdorff outer measure. This is an outer mea-
tion [0, 1] into half-open intervals B j , one for each t j with sure, Hs∗ ({x}) = 0 for each point, Hs∗ (A + x) = Hs∗ (A),
µ(B j ) = t j . Let 1 be the σ -algebra generated by {B j }, and for each set A, so it is invariant under translation and,
f1 = ∞ j=1 y j χ B j . At the next stage of the induction we significantly, it “scales” in this way: Hs∗ (k A) = k s Hs∗ (A)
partition each B j to get a larger σ -algebra, and a function for any positive k. Then Hs∗ is a metric outer measure,
P1: ZCK Final Pages
Encyclopedia of Physical Science and Technology EN009B-413 July 18, 2001 0:38

Measure and Integration 249

that is if two sets A and B are a positive distance apart, finds that the curve C of length L has dimension one and
Hs∗ (A ∪ B) = Hs∗ (A) + Hs∗ (B). Then Borel sets are Hs∗ Hausdorff measure H1 (C) = L (Falconer, p. 24). This re-
measurable in the usual sense: the proof requires a lit- lates to the functions of bounded variation discussed in
tle care in showing that intervals are Hs∗ measurable. It Section V. In three dimensions we may consider the mo-
is easily seen that H1∗ is just Lebesgue measure. Haus- tion of minute particles in a liquid. This erratic motion is
dorff measure, Hs , defined by Hs∗ , is used to analyze modeled probalistically by Brownian motion. In Falconer
sets of zero Lebesgue measure. For it can be shown (p. 144) we have that Brownian paths are s-sets of di-
fairly easily that if Hs∗ (A) < ∞ then Hq∗ (A) = 0 for q > s. mension one, with probability one. Applications to many
So if 0 < Hs∗ (A) < ∞ then Hq∗ (A) = ∞ for q < s. So areas are to be found in Mattila, together with a compari-
we can define (Hausdorff) dimension of a set E by: son of different definitions of “dimension,” and which also
dim(E) = inf{s: Hs∗ (E) = 0}. This provides a dimension, contains a sizeable bibliography.
usually noninteger, for Borel sets in R. All this may
be generalized with, in the definition, (Ik )s being re-
placed by h((Ik )) with a suitable function h, for exam- SEE ALSO THE FOLLOWING ARTICLES
ple a monotonic increasing function h(t) for t ≥ 0 with
h(0+) = h(0) = 0. CALCULUS • COMPLEX ANALYSIS • CONVEX SETS •
Hausdorff dimension can be difficult to calculate. The DIFFERENTIAL EQUATIONS, ORDINARY • FRACTALS •
real numbers have dimension = 1, finite sets of points have INTEGRAL EQUATIONS
dimension zero. For the Cantor Set, P, described in Sec-
tion II, the calculation is straightforward. This set is “self-
similar” in the sense that the subsets in the closed inter- BIBLIOGRAPHY
vals J1,1 , J1,2 left after the first “open third” I1,1 has been
removed are copies of the Cantor set scaled by 1/3 and Best, E. P. (1992). On sets of fractional dimension III. London Math.
translated. So using the “scaling property” of H described Soc. 47(2), 436–454.
above, we get Hs (P) = 32s Hs (P) and so s = log 2
log 3
, provided Bourgin, R. D. (1983). “Geometric Aspects of Convex Sets with the
Radon–Nikodým Property” (Lecture Notes in Mathematics, 993),
we can show that 0 < Hs (P) < ∞. This part needs care
Springer-Verlag, Berlin.
in transforming a arbitrary covering of P into coverings Cohn, D. L. (1980). “Measure Theory,” Birkhäuser, Boston and Basel.
of the subsets of J1,1 , J1,2 mentioned in Section III, and Davis, W. J., and Phelps, R. R. (1974). The Radon-Nikodým property
using the metric outer measure property. This method and and dentable sets in Banach spaces. Proc. Amer. Math. Soc. 45, 119–
result generalize immediately to the “Cantor-like set” Pξ 122.
De Barra, G. (1981). “Measure Theory and Integration,” Ellis Horwood,
obtained when we remove, not the middle third, but a
Chichester, England.
central open interval of positive length 1 − 2ξ at the first Diestel, J., and Uhl, J. J. (1977). “Vector Measures,” Mathematical
stage, with the residual intervals being of length ξ at stage Surveys 15. Amer. Math. Soc., Providence, RI.
one, ξ 2 at stage two, etc. The same argument gives the Dinculeanu, N. (1967). “Vector Measures,” Pergamon, London.
Hausdorff dimension of Pξ as − log log 2
ξ
, a number between 0 Dunford, N., and Schwartz, J. T. (1958). “Linear Operators, Part I,”
Interscience Publications Inc., New York.
and 1. This shows that sets in R exist with all possible posi-
Falconer, K. J. (1985). “The Geometry of Fractal Sets,” Cambridge Univ.
tive values of Hausdorff dimension. In forming the Cantor Press, Cambridge, U.K.
set we removed numbers whose expansion to base 3 con- Gelbaum, B. R. (1982). “Problems in Analysis,” Springer-Verlag, New
tained a 1. If we remove instead those containing a 2 or York.
those containing a zero, we get somewhat different sets Hewitt, E., and Stromberg, K. (1965). “Real and Abstract Analysis,”
Springer-Verlag, New York.
whose Hausdorff dimensions were found by Best. Similar
Lauwerier, H. (1991). “Fractals,” Penguin Books, Baltimore.
results for expansions to other bases are to be found in Mattila, P. (1995). “Geometry of Sets and Measures in Euclidean Spaces,
Weymann. Sets of finite positive Hs measure are called Fractals and Rectifiability,” Cambridge Univ. Press, Cambridge, U.K.
s-sets, or fractals in this context, though some writers use Pesin, I. N. (1970). “Classical and Modern Integration Theories,” Aca-
“fractals” to refer to sets which are self-similar at all levels demic Press, New York.
Pfeffer, W. F. (1977). “Integrals and Measures,” Dekker, New York.
of magnification.
Phelps, R. R. (1988). “Convex Functions. Monotone Operators and
This generalizes easily to two dimensions: from a closed Differentiability,” Lecture Notes, University of Washington, Seattle.
square one deletes a central open cross to leave four closed Rogers, C. A. (1970). “Hausdorff Measures,” Cambridge Univ. Press,
squares placed at the corners: these are then similarly re- London and New York.
duced, etc. The resulting “Cantor square” along with many Rudin, W. (1966). “Real and Complex Analysis,” McGraw-hill, New
York.
examples of self-similar sets are considered in the very ac-
Weymann, H. (1971). Das Hausdorff-Mass von Cantormengen, Math.
cessible book of Lauwerier. In two dimensions one may Ann. 193, 7–20.
consider the Hausdorff dimension of a curve. With some Wheeden, R. L., and Zygmund, A. (1977). “Measure and Integral,”
restrictions to avoid pathological space-filling curves one Dekker, New York.
P1: GNH Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

Nonlinear Programming
Jon W. Tolle
University of North Carolina at Chapel Hill

I. Overview
II. Theoretical Aspects
III. Computation
IV. Applications

GLOSSARY function of several variables is minimized over the solu-

tion set of a finite number of functional inequalities and
Constraint functions Functions that are used to define equations. The problem can be written as:
the feasible set.
minimize f (x)
Decision vector Vector of unknown variables on which
the objective function is defined. subject to: gi (x) ≤ 0, i = 1, . . . , M
Feasible set Set of decision vectors that satisfy the
h j (x) = 0, j = 1, . . . , P
constraints and over which the objective function is
defined. x ∈ N ,
Global optimal solution Particular choice of the decision
where the objective function f and the constraint func-
vector that solves the optimization problem.
tions, gi and h j , are (usually) continuously differentiable
Lagrangian function Function of the decision vector and
functions. The components of the vector x are called the
the multipliers with critical points that are related to
decision variables. Such nonlinear problems find a wide
optimal solutions.
range of applicability in the physical and social sciences,
Local optimal solution Particular choice of the decision
engineering, and the decision sciences.
vector that yields the least value of the objective func-
tion in some relatively open subset of the feasible set.
Multipliers Coefficients of the constraint gradients in the I. OVERVIEW
first-order necessary conditions.
Objective function Function that is to be minimized. A. Optimization and Nonlinear Programming
Sequential quadratic programming Computational
technique for solving nonlinear programs. A significant proportion of the mathematical models for-
mulated and studied by scientists and engineers involve
some form of optimization. For physical scientists, na-
NONLINEAR PROGRAMMING is the study and so- ture seems inherently to be an optimizing force; the laws
lution of an optimization problem in which a nonlinear of conservation and equilibrium can be interpreted as

Encyclopedia of Physical Science and Technology, Third Edition, Volume 10

Copyright C 2002 by Academic Press. All rights of reproduction in any form reserved. 583
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

584 Nonlinear Programming

optimality conditions. Engineers build controlling devices Using a functional representation of the constraint set
that perform with minimal energy expenditure and de- the standard nonlinear programming problem, hereafter
sign plant schedules that maximize some measure of pro- denoted NLP, can be formulated as
duction efficiency. Economists and other social scientists
theorize that much of human behavior and development, minimize f (x)
whether individual or collective, is guided by optimization subject to: gi (x) ≤ 0, i = 1, . . . , M,
processes. Finally, modern managerial scientists use op-
timization methods to organize and reduce large amounts h j (x) = 0, j = 1, . . . , P,
of data to make rational decisions on policy issues in the x ∈ N .
public and private sectors.
Mathematically, the most general form of an optimiza- The functions gi and h j are called the constraint functions
tion model can be characterized by a set X and a real- and, like f , are generally assumed to be continuously dif-
valued function f defined on X . The set X , often called ferentiable. Nonlinear programming problems are distin-
the feasible set, consists of the possible realizations of the guished from the special types of optimization problems
model. For example, X could consist of a set of permis- called linear programming problems, in which the objec-
sible trajectories for a satellite launch vehicle, the possi- tive and constraint functions are affine. It is also conven-
ble flight schedules for an international cargo airline, or a tional to separate nonlinear programming, which requires
set of possible purchases by a consumer in a given mar- the decision vectors to be finite-dimensional, from prob-
ket. Each x ∈ X is a vector, the components of which are lems such as those of optimal control, in which X is a sub-
termed the decision variables of the model. The function set of an infinite-dimensional space. However, the com-
f , called the objective function, is a measure by which putation of the solutions to these latter problems depends
the vectors in X are compared. Thus for these examples, on the solution of finite-dimensional approximations (see
f (x) might represent the energy required to traverse tra- Section IV.C).
jectory x, the total cost for the implementation when x is
an airline schedule, or the value of a consumer’s purchase
B. The Solution of a Nonlinear Program
when x is the vector with components that are quantities of
commodities. A vector, x ∗ ∈ X , satisfying f (x ∗ ) ≤ f (x) for all x ∈ X
The problem to be solved in an optimization model is is called a (global) optimal solution to NLP and the
corresponding number v ∗ = f (x ∗ ) is called the optimal
minimize f (x), x ∈ X.
value of NLP. One of the distinguishing characteristics
∗
That is, an x ∈ X is sought such that for all other of nonlinear programming is that local optimal solu-
x ∈ X, f (x ∗ ) ≤ f (x). The fact that the problem is formulations are possible. Hence, x L ∈ X is a local optimal so-
ted as a minimization rather than a maximization is not im- lution if there is an > 0 such that f (x L ) ≤ f (x) for
portant because the above is mathematically equivalent to all x ∈ X ∩ {x : |x − x L | < } (|z| represents the Euclidean
length of the vector z). Unfortunately, for many nonlinear
maximize −f (x), x ∈ X.
programs, global optimal solutions cannot be identified
The problem as stated is much too general to yield use- until all local optimal solutions are known. Hereafter, the
ful information. In order to be able to analyze the problem, term “optimal solution” will refer to a local solution unless
more must be known about the properties of the set X and specified otherwise.
the function f . The general field of optimization is broken The major theoretical questions to be answered con-
down into subfields according to these properties. Nonlin- cerning a given NLP relate to the existence, characteri-
ear programming is one of these subfields; traditionally, it zation, and stability of solutions. Existence refers to the
refers to the case in which the set X is assumed to be func- determination of conditions on the objective and constraint
tionally prescribed. That is, X is the intersection of a finite functions under which global and local solutions are guar-
number of sets each of which is the solution set of a func- anteed to exist. The characterizations of a solution are
tional equation or inequality. The equations or inequalities theoretical conditions on a point x that are either neces-
that prescribe the feasible set are called constraints. In an sary or sufficient for it to be a global or local solution.
optimization model, an inequality constraint may be used These characterizations are important for identifying the
to represent the limits on an available resource or may be a solution(s) and its properties and in the construction of
production quota that must be exceeded. An equality con- algorithms for computing the solution. Stability refers to
straint might represent the requirement that a consumer the sensitivity of the solution with respect to the pertur-
spend or save all of his or her income or that a trajectory bation of the parameters which define the objective and
begin from a designated location. constraint functions. Stability is important in the practical
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

Nonlinear Programming 585

use of optimization because these parameters are usually C. Special Types of Nonlinear Programs
not known precisely. For instance, they might be the ob-
The set of nonlinear programs can also be subdivided fur-
served mean of historical data or they might be the result
ther into classes associated with special properties of the
of a physical measurement with its attendant errors. It is
objective and constraint functions. Some of the more com-
desirable, in these cases, to have assurances that the opti-
mon classes are described in this section.
mal solution and optimal value are not wildly inaccurate
In theory, the simplest nonlinear program is the un-
as a result of small errors in the determination of these
constrained program (i.e., the problem in which there are
parameters.
no constraints) where the objective function is minimized
In most cases, it is the numerical values of an optimal
over all of N . These problems are important because
solution that are ultimately needed. Thus, the computa-
they include many of the important least-square models
tional question of how best to numerically approximate
and also because the computational algorithms for solving
an optimal solution is equally important to the questions
them are used as the foundations for algorithms for solv-
of theory. A computationally efficient numerical algorithm
ing the more general problems. Among the constrained
that will yield a good approximate solution to NLP must
nonlinear programs, the easiest ones to deal with are the
be available. Consequently, the techniques of numerical
convex programs. These problems have a unique global
analysis and computational mathematics play a major role
optimal solution and no local solutions; much of the the-
in nonlinear programming.
ory is analogous to that found in linear programming. Of
The following simple one-dimensional example illus-
special interest among the convex programs are the con-
trates the major theoretical questions in nonlinear pro-
vex quadratic programs. In these problems, the objective
gramming:
function has the form
minimize f (x)
1 t
subject to: x − α ≥ 0, f (x) = x Ax + a t x,
2
x − β ≤ 0,
where A is an N × N symmetric positive definite matrix
where f is twice differentiable and 0 < α < β. Since the
and a is a fixed N -vector, and all of the constraints func-
objective function is continuous and X = [α, β] is com-
tions are affine. As will be seen, one approach to solv-
pact, the existence of a global solution is guaranteed. El-
ing a general NLP is to approximate it by a quadratic
ementary calculus shows that any local minimum point
program.
must satisfy f (x) = 0 if α < x < β, f (x) ≥ 0 if x = α, or
Two classes of problems that are currently on the fore-
f (x) ≤ 0 if x = β. These necessary conditions for opti-
front of research in the field of nonlinear programming are
mality are examples of the characterizations of the solu-
nondifferentiable and large-scale programs. Nondifferen-
tion. Note that {α < x < β, f (x) = 0, f (x) > 0} is a
tiable programs are those for which the objective function
sufficient (but not necessary) set of conditions for x to be
is only piecewise differentiable. This class has many appli-
a local minimum. For an example of the ideas of stabil-
cations, one of the most important of which is the case in
ity, specify f (x) = xe−x , α = 1/2, and β > 1. The global
which f is itself the solution of an optimization problem;
optima are
for example,

α, β ≤ β ∗ ,
x∗ = f (x) = max{(cα )t x},
β, β ≥ β ∗ , α∈A
∗
where β > 1 satisfies
where {cα } is a family of vectors. The term large-
∗
β ∗ e−β = (1/2)e−(1/2) . scale problems refers to problems in which some or
all of the parameters, N , M, and P are large. The the-
Moreover the optimal value as a function of β is ory for this class is not changed but special computa-

(1/2)e−(1/2) , β ≤ β ∗ , tional techniques are necessary to ensure that solving
v ∗ (β) = such a problem is feasible in terms of computer time and
βe−β , β ≥ β ∗.
storage.
v ∗ (β) is a continuous function of the parameter β and Although there are important applications and results
associated with each of these special types of nonlinear
dv ∗ 0, β < β ∗,
= programs, the theory and computation presented in the
dβ (1 − β)e , β > β ∗ .
−β
remainder of this article deals primarily with the general
Thus, the global optimal value is differentiable except problem, the major exception is the specialization of the
when β = β ∗ . theory to convex programs.
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

586 Nonlinear Programming

L(f,vL)

L(f,v*)

•x L

• x*

FIGURE 1 Level sets of the objective function with local and global minima.

II. THEORETICAL ASPECTS Because the f, gi , and h j are assumed to be continuous,

the level sets and, therefore X , will all be closed sets.
A. The Geometry Now v ∗ is an optimal value for NLP if the set
The theoretical part of nonlinear programming is based X ∗ = L( f, v ∗ ) ∩ X
on the geometry of the feasible set X and the underlying is nonempty and for each v < v ∗ , the sets
geometry of the objective function. This geometry can
be used to motivate the basic theorems of nonlinear pro- L( f, v) ∩ X
gramming (presented later in this section) and also the are empty. Each x ∈ X ∗ is then an optimal solution of
∗
algorithms used to solve the problems. NLP. Similarly, x l ∈ X is a local optimal solution for NLP
The level sets of a function will be important in the with local optimal value vl = f (x l ) if for some > 0 the
following development. Let w be a real-valued function set
defined on N . For each real number α, the following level
sets are defined: L( f, v) ∩ X ∩ {x : |x − x l | < }

L(w, α) = {x : w(x) = α}, is empty for each v < vl . An example of the geometry is
illustrated in Fig. 1.
L − (w, α) = {x : w(x) ≤ α}. The set of optimal solutions is simplified when the
With this notation, the feasible set for NLP can be written feasible set and the level sets L − ( f, α) are convex sets.
as a finite intersection of these level sets: Mathematically, a subset C ⊂ N is convex if, for every
pair of vectors x and y in C and all λ ∈ [0, 1], the vector
M P
X= −
L (gi , 0) ∩ L(h j , 0) . λx + (1−λ)y is also in C. Thus a convex set is one that has
i=1 j=1 no indentations or protuberances. If the feasible set and
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

Nonlinear Programming 587

L(f,v)

L (f,v*)
x*

FIGURE 2 Global minimum for a convex program.

the level sets are all convex then, for any optimal solution B. First-Order Conditions
x l with optimal value vl , the sets
In order to establish first-order optimally conditions the
L − ( f, v) ∩ X assumption that the objective and constraint functions are
continuously differentiable will be used. This allows the
are empty for all v < vl and hence there are no local so- functions to be approximated linearly about a given vector,
lutions that are not also global solutions. This situation is which in turn leads to a local polyhedral approximation of
illustrated in Fig. 2. the feasible set.
These geometric conditions characterize the optimal Given a continuously differentiable function w defined
solutions completely. However, they are not very useful in on N and a vector x̂ the affine (linear) approximation to
practice because the level sets cannot be easily mapped. w at x̂ is
In the next section, these geometric conditions are
ŵ(x) = w(x̂) + ∇w(x̂)t (x − x̂),
transformed into algebraic conditions, which will in turn
yield qualitative and quantitative results for the nonlinear where ∇w(x̂) is the gradient of w, that is, the N -
programs. dimensional column vector of partial derivatives of w at
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

588 Nonlinear Programming

L(g1,o) ^ ,o)
L(g 1

x^
x^ ^
X

L(g^2,o)

L(g2,o)
FIGURE 3 Linear approximation of the feasible set.

x̂. If the objective and constraint functions in NLP are The question that now arises is the extent to which the
replaced by their affine approximations at a point x̂ ∈ X optimality of x̂ for the approximating problem is related to
then the resulting optimization problem is a linear pro- the optimality of x̂ for NLP. The answer is contained in the
gram with a polyhedral feasible set X̂ , which serves as a following fundamental result of nonlinear programming.
local approximation of NLP (although the approximation
may be poor). Figure 3 illustrates this approximation in a Theorem 1 (First-Order Necessary Conditions): Let
simple inequality-constrained case. x̂ ∈ N and suppose that the set of active constraint gra-
It can easily be shown that in the inequality-constrained dients at x̂,
case, x̂ is an optimal solution to the approximating linear
{∇gi (x̂) : i such that gi (x̂) = 0}
program if and only if the vector −∇ f (x̂) is a nonnegative
combination of the gradients of the active constraints at x̂ ∪ {∇h j (x̂), j = 1, . . . , P},
(i.e., those constraints that have value 0 at x̂). As is seen is linearly independent. Then x̂ is a local optimal solution
in Fig. 4, this condition means that fˆ cannot have a direc- only if there exist vectors µ̂ ∈ M and ω̂ ∈ P such that
tion of decrease into the interior of X̂ from x̂ and hence
−∇ f (x̂) must lie in the cone K̂ generated by the gradients ∇ f (x̂) + µ̂i ∇gi (x̂) + ω̂ j ∇h j (x̂) = 0, (1)
i j
of the active inequality constraints. If equality constraints
are present, the situation is slightly more complicated in µ̂i gi (x̂) = 0, i = 1, . . . , M, (2)
that −∇ f (x̂) must be a combination of all of the active
gradients with the coefficients of the active inequality con- µ̂i ≥ 0, i = 1, . . . , M, (3)
straint gradients being nonnegative and the coefficients of gi (x̂) ≤ 0, i = 1, . . . , M, (4)
the equality constraints being unrestricted in sign; that is,
h j (x̂) = 0, j = 1, . . . , P. (5)
M P
∇ f (x) + µi ∇gi (x) + ω j ∇h j (x) = 0 The condition on the linear independence of the active
i=1 j=1
constraint gradients is called a constraint qualification.
for some µi ≥ 0 and some ω j . It assures that the geometry at x̂ is not so pathological
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

Nonlinear Programming 589

^ ,o) it is nonempty and that L(w, α) is convex if and only if

L(g 1
w is an affine function. For these reasons, NLP is called
a convex program if the functions f and gi , i = 1, . . . , M
are convex and the functions h j , j = 1, . . . , P are affine.
If NLP is convex, then X and L − ( f, α) are convex sets
and, as indicated in the preceding section, every local op-
^ timal solution is a global optimal solution. The conse-
g2,(x)
quences of these facts are summarized in the following
theorem.

^ Theorem 2: Let NLP be a convex program and assume

x^ f(x)
^
X ^
K that there is an x c ∈ X with gi (x c ) < 0 for i = 1, . . . , M.
Then x̂ is a global optimal solution of NLP if and only if
there are vectors µ̂ and ω̂ such that Eqs. (1)–(5) hold.
^
g1,(x)
The condition involving x c is a standard constraint qual-
ification for convex programs, sometimes called the Slater
condition. Theorems 1 and 2 are important in that they
L(g^2,o) severely limit the set of possible optimal points for NLP.
These conditions are the higher-dimensional analogues of
the conditions given in the example of Section I.B for
locating a minimum of a nonlinear function of one vari-
able on a closed interval. Most computational algorithms
for solving NLP generate x̂, µ̂, and ω̂ that approximately
FIGURE 4 −∇f (x̂) in the cone defined by the active constraint
gradients.
satisfy Eqs. (1)–(5).
To illustrate the application of the necessary conditions,
consider the convex quadratic program
that the set X̂ is not a good local approximation of X .
If this condition holds and x̂ is a local optimal solution minimize 12 x t Ax + a t x
of NLP, then it is a global optimal solution to the linear subject to: Bx − b ≤ 0,
approximating problem and hence conditions in Eqs. (1) Dx − d = 0,
and (3) hold. It is also a consequence of this constraint
qualification that for a given x̂ the corresponding µ̂ and ω̂ where A is an N × N positive definite matrix, B is an
are unique. Conditions in Eq. (2) are called the comple- M × N matrix, and D is a P × N matrix of rank P. This is
mentary slackness conditions; they merely expresses the a convex program, and assuming that the constraint qualifi-
condition that µ̂i = 0 if gi (x̂) = 0 [i.e., inactive constraints cation of Theorem 2 holds, the solution to this problem is
do not effect Eq. (1)]. Conditions in Eqs. (4) and (5) are the solution of the conditions
just the feasibility conditions on x̂. Ax + B t µ + D t ω + a = 0,
Note that this characterization of x̂ is a necessary con-
dition only. It is easy to construct examples for which x̂ µt (Bx − b) = 0,
satisfies Eqs. (1)–(5) but is not a local optimal solution µ ≥ 0,
for NLP. This is a consequence of using only linear ap-
proximations that are too coarse to reflect the true state Bx − b ≤ 0,
of affairs. In order for these first-order conditions to al- Dx − d = 0.
ways be sufficient for optimality as well as necessary, the
problem must be convex. If there are no inequality constraints then this system re-
A real-valued function w defined on N is said to be duces to the (P + N )-dimensional linear system
convex if for every x and y in N and every λ, 0 ≤ λ ≤ 1,
the following inequality holds: Ax + D t ω = −a,

w(λx + (1 − λ)y) ≤ λw (x) + (1 − λ)w(y). Dx = d,

It can easily be established that if w is convex then the which has a unique solution; namely, (x ∗ , ω∗ ), the global
level set L − (w, α) is a convex set for every α for which optimal solution and its multiplier. Thus, in this case
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

590 Nonlinear Programming

solving the NLP reduces to solving a linear system of z t Hx x L(x̂, µ̂, ω̂)z ≥ 0
equations.
Historically, the necessary conditions in theorem 1 are where Hx x L(x̂, µ̂, ω̂) is the N × N Hessian matrix of L
a part of classical mathematics for the case when there are with respect to x.
only equality constraints; the ω̂ j are known as Lagrange The second-order necessary condition requires the Hes-
multipliers and Eq. (1) is then a direct consequence of sian of the Lagrangian to be positive semidefinite on the
the implicit function theorem. The inequality-constrained tangent subspace to the feasible set at x̂. In a manner anal-
case was treated by Karush, Kuhn, and Tucker and the the- ogous to that of the unconstrained case, a second-order
ory is sometimes referred to by their names. It should be sufficiency condition can be obtained by the more strin-
noted that a more general necessary condition can be de- gent requirement that the Hessian of L be positive definite
rived without assuming any constraint qualification. This on this subspace.
result, called the Fritz John condition, differs from that of
theorem 1 in that there is a coefficient, possibly zero, in Theorem 4 (Second-Order Sufficiency Conditions):
front of ∇ f (x) in Eq. (1). Let x̂, µ̂, and ω̂ satisfy conditions in Eqs. (1)–(5) of theo-
rem 1 and suppose that for i ∈ I (x̂), µ̂i > 0. Further, sup-
pose that for every nonzero z satisfying
C. Second-Order Conditions
∇gi (x̂)t z = 0, i ∈ I (x̂),
In this section, it is assumed that the objective and con-
∇h j (x̂)t z = 0, j = 1, . . . , P,
straint functions are twice continuously differentiable. For
unconstrained nonlinear functions, second-order condi- it is the case that
tions are easily derived. Suppose x̂ is a critical point of f
z t Hx x L(x̂, µ̂, ω̂)z > 0.
(i.e., ∇ f (x̂) = 0) and let H f (x̂) denote the Hessian ma-
trix of f at x̂, the symmetric matrix of second-order partial Then ẑ is an isolated local minimum of NLP.
derivatives. Then x̂ is an unconstrained local minimum of
f if H f (x̂) is positive definite and only if H f (x̂) is posi- x̂ is an isolated local minimum if there is a neighbor-
tive semidefinite. hood of x̂ that contains no other local solution of NLP.
To state analogous conditions for the constrained prob- There are other slightly weaker versions of the second-
lem, the Lagrangian function is introduced. This function order sufficient conditions that do not require that i ∈ I (x̂)
of the N + M + P variables (x, µ, ω) defined by implies µ̂i > 0. However, this restriction, called strict com-
plementary slackness, is required in many important the-
L(x, µ, ω) = f (x) + µi gi (x) + ω j h j (x) oretical applications, including those of the next section.
i j
For twice differentiable functions it is the case that they
is central to the theoretical development of the subject. are convex if and only if their Hessian matrices are posi-
Conditions in Eqs. (1)–(5) show that (x̂, µ̂, ω̂) is a critical tive semidefinite. Thus for convex programs, the Hessian
point of L(x, µ, ω) with respect to x and ω and satisfies of the Lagrangian is positive semidefinite and the second-
order conditions are redundant, as is to be expected in light
∂L
(x̂, µ̂, ω̂) = gi (x̂) ≤ 0. of theorem 2.
∂µi
The second-order conditions for NLP are given in theo- D. Stability and Duality
rems 3 and 4. In these theorems, I (x) will represent the
index set of active inequality constraints at x; that is, As stated earlier, the stability of the optimal solution and
its optimal value are of major importance in applications of
I (x) = {i : gi (x) = 0}. nonlinear programming models. Fortunately, the optimal
solution and its optimal value are stable under reasonable
Theorem 3 (Second-Order Necessary Conditions): Let conditions. To formally state this result, it is necessary to
x̂ satisfy the conditions of Theorem 1 with corresponding consider the perturbed version of NLP,
multipliers µ̂ and ω̂. If x̂ is a local minimum then for any
minimize fˆ(x, p)
vector z satisfying
subject to: ĝ i (x, p) ≤ 0, i = 1, . . . , M,
∇gi (x̂)t z = 0, i ∈ I (x̂),
ĥ j (x, p) = 0, j = 1, . . . P,
∇h j (x̂)t z = 0, j = 1, . . . , P,
where p is a Q-vector, fˆ(x, 0) = f (x), ĝ i (x, 0) = gi (x),
it is the case that and ĥ j (x, 0) = h j (x). Furthermore, it is assumed that
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

Nonlinear Programming 591

fˆ, ĝ i , and ĥ j are continuously differentiable functions max min L(x, µ, ω).
µ≥0,ω x
of p near p = 0. The vector p represents the parameters
of the problem. The following theorem gives conditions If the program is convex and the Slater condition holds
under which the optimal solution and multipliers are then x ∗ is an optimal solution to NLP with multipliers µ∗
smooth functions of the perturbation. and ω∗ if and only if (x ∗ , µ∗ , ω∗ ) is a saddle point for
To state the theorem, we define a regular solution of L(x, µ, ω) that is,
NLP to be a solution at which the linear independence of L(x ∗ , µ, ω) ≤ L(x ∗ , µ∗ , ω∗ ) ≤ L(x, µ∗ , ω∗ )
the active constraint gradients, strict complementary
slackness, and the second-order sufficient conditions a1l for all x, µ ≥ 0, and ω. Thus the values of the primal and
hold. dual problems are equal at optimality. This result fails
to hold under less restrictive hypotheses on the problem
Theorem 5 (Basic Perturbation Theorem): Let x ∗ be a NLP.
regular optimal solution to NLP with multipliers µ∗ and
ω∗ . Let I (x ∗ ) be the index set of active constraints at x ∗ .
Then there is an > 0 and functions x ∗ ( p), µ∗ ( p), ω∗ ( p) III. COMPUTATION
defined and continuously differentiable on the set E = { p :
| p| ≤ } with x ∗ (0) = x ∗ , µ∗ (0) = µ∗ , and ω∗ (0) = ω∗ . A. Basic Concepts
x ∗ ( p) is a local optimal solution to the corresponding per- Finding a numerical approximation to the solution of a
turbed problem and µ∗ ( p) and ω∗ ( p) are its multipliers. nonlinear program can be a difficult task. If the number of
Moreover, the second-order sufficient conditions hold at variables is large or the functions involved are highly non-
x ∗ ( p) and I (x ∗ ( p)) = I (x ∗ ). linear, the computation can be time consuming, even on
the fastest computers. Moreover, in the case of nonconvex
One of the most important cases of perturbation occurs programs, the presence of local solutions can make the de-
when p is an M-vector and gˆi (x, p) = gi (x) − pi . Under termination of a global solution problematic. The methods
the assumptions of theorem 5, the optimal solution as a and procedures described as follows are concerned with
function of p, x ∗ ( p), is smooth and it can be shown that approximating local solutions only.
the optimal value as a function of p, v ∗ ( p) = f (x ∗ ( p)), The algorithms for approximating the solutions of NLP
satisfies. are all iterative in nature; that is, given an initial estimate
∇ p v ∗ (0) = −µ∗ . of the solution, x 0 , a sequence {x k } of approximate solu-
In other words, the instantaneous rate of change in the tions is generated with each iterate, x k , being determined
optimal value as a function of a shift in the value of gi is the successively from information gathered at the preceding
negative of the i th multiplier. This gives an interpretation iterations. It is desired that at each iteration, the new iterate
of the multiplier in terms of the model. For example, if the is a better approximation, in some sense, of a local solu-
i th constraint is a bound on a resource and the objective is tion than the previous ones. In theory, the iterations should
measured in dollars, the value of additional units of that converge to a local solution, say x ∗ . In most algorithms
resource in terms of decreased optimal value is linearly approximations to the optimal multipliers, (µk , ωk ), are
approximated as µi∗ dollars per unit. A similar result holds computed along with x k at each iteration and the algo-
for pertubations of the equality constraints. As a result of rithms terminate when the approximate solutions and mul-
this interpretation, the optimal multipliers are often called tipliers satisfy the first-order necessary conditions to some
shadow prices. predetermined degree of accuracy.
The preceding observations on the properties of the In constrained optimization, in addition to minimizing
multipliers lead, in linear programming, to the formula- f , the optimal solution must also satisfy the feasibility
tion of a dual linear program with the multipliers as the requirements. It is possible (even probable when there are
optimal solution. Similar, but much less complete results nonlinear equality constraints) that an iterate will not be
can be obtained for nonlinear programs. The appropriate feasible. Ideally, an algorithm should be designed in such
formulation involves the Lagrangian function defined ear- a way that, given a current iterate x k , the new iterate x k+1
lier. It can be shown that NLP is equivalent to the min-max will be no more infeasible than x k and will also satisfy
problem f (x k+1 ) ≤ f (x k ). Unfortunately, this is not always possi-
ble and therefore, a successful algorithm should balance
min max L(x, µ, ω). the goals of feasibility and decreasing f . In the algorithms
x µ≥0,ω
described in this article, the iterates are of the form
The dual problem can then be defined as the max-min
problem x k+1 = x k + αk d k ,
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

592 Nonlinear Programming

where the vector d k gives the direction in which a “step” subject to the need to reduce the value of f . Under fairly
is taken and and the scalar αk , called the step length reasonable conditions, these descent methods yield a con-
parameter, determines how far in this direction the next it- vergent sequence of iterates and the limit point satisfies
erate lies. Another approach, called a trust region method the equation ∇ f (x ∗ ) = 0. These methods are fairly robust;
can also be employed to determine the step from x k to their convergence not depending on the initial iterate x 0 .
x k+1 ; a reference to these methods can be found in the However, the convergence rate of these descent methods
Bibliography. is typically very slow and, consequently, they are ill-suited
In judging the effectiveness of an algorithm, there are for general use.
three major criteria: (1) robustness, (2) rate of conver- The second general method is that of Newton’s method
gence, and (3) efficiency. The first two terms deal with and its modifications. The pure Newton method deter-
theoretical issues. The term “robustness” refers to the like- mines the step, d k , for solving ∇ f (x) = 0 by
lihood that the algorithm will yield a sequence of iterates
d k = −[H f (x k )]−1 ∇ f (x k ),
that converge, in theory, to a local solution regardless of
the starting point x 0 , whereas “rate of convergence” refers where H f (x k ) is the Hessian matrix of f at x k , and re-
to the speed with which the iterates converge, for exam- quires αk = 1. It is well known that if the initial iterate
ple, how fast the theoretical error terms, |x k − x ∗ |, tend to x 0 is sufficiently close to a local solution x ∗ at which
zero. Finally, the “effectiveness” is concerned with prac- the gradient ∇ f (x k ) is nonzero, the iterates will converge
tical considerations in implementing the algorithm; it in- to x ∗ very rapidly. In particular, they will converge at a
cludes such problems as computer time and storage re- quadratic rate, which roughly means that the number of
quirements and numerical stability. The descriptions that decimal places of accuracy is doubled at each step. The
follow discuss only the first two of these criteria because biggest drawback to the use of Newton’s method is its lack
the third is very dependent on the particular platform and of robustness. Unless the vector x 0 is close to a local so-
software used. lution, these iterates may fail to converge or else converge
to a nonminimum point.
B. Unconstrained Optimization Algorithms There are a number of methods designed to obtain both
the robustness of descent methods and the rapid local con-
Any description of algorithms for solving NLP must begin vergence of Newton’s method. If f is a convex function, its
with the computational algorithms for unconstrained op- Hessian matrix is positive definite and the Newton step is
timization problems because the latter are fundamental to then also a descent step. Applied with a line search to deter-
the design of the former. The algorithms for unconstrained mine αk , this approach, called a damped Newton method,
problems are much less complex because there is no ques- can both be robust and yield rapid local convergence. For
tion of feasibility that must be taken into account in the nonconvex problems a standard approach for obtaining
choice of step direction and length. There are two basic robustness without sacrificing rapid convergence is to use
approaches to solving the unconstrained problem to be a quasi-Newton or secant method. In this type of algo-
discussed here: the basic descent method and variants of rithm, the Hessian matrix, H f (x k ) is approximated by
Newton’s method. The former emphasizes decreasing f a positive definite matrix, Bk , and the step direction is
while the latter attempts to solve the necessary conditions, given by
∇ f (x) = 0. Each is discussed briefly as are hybridizations
of the two approaches that attempt to incorporate their best d k = −[Bk ]−1 ∇ f (x k ).
properties.
For the descent method, the choice of the step direction Because Bk is positive definite, the direction dk is a descent
is a direction d k satisfying direction [satisfies Eq. (6)] and a line search procedure can
be used to determine αk . Because it is a descent method,
∇ f (x k )t d k < 0. (6) this algorithm will be robust and if the Bk are relatively
Movement in this direction, called a descent direction will, good approximations to the Hessians of f , the conver-
for at least for a short distance, decrease f , and hence αk gence rate should be close to that of Newton’s method.
can be chosen so that The class of matrices that satisfy the generalized secant
condition
f (x k+1 ) = f x k + αk d k < f (x k ).
Bk+1 (x k+1 − x k ) = ∇ f (x k+1 ) − ∇ f (x k ) (7)
The specific choice d k = −∇ f (x k ), gives the direction of
greatest local rate of decrease in f and leads to the steepest provide good approximations to the Hessian of f at x k+1 .
descent method. In practice, αk is chosen so as to mini- These approximations are implemented in an algorithm
mize a polynomial approximation to φ(α) = f (x k + αd k ) by an “updating formula” of the form
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

Nonlinear Programming 593

Bk+1 = Bk + E k , Theorem 6: There exists a ρ ∗ > 0 such that for all

ρ > ρ ∗ , an unconstrained minimum of the L 1 penalty func-
where E k is a low rank matrix chosen so that the mation is an optimal solution for NLP, and conversely.
trix Bk+1 is positive definite and Eq. (7) holds. It has
been shown, under relatively mild restrictions, that the A penalty function that has the property described in
descent method using appropriate secant updates will be this theorem is called an exact penalty function. This type
robust and converge at a rate approaching that of Newton’s of penalty function is important because it obviates the
method. Most modern unconstrained optimization algo- use of limiting processes to obtain a solution to NLP. This
rithms now utilize a version of this procedure. The ref- is not the panacea that it appears, however, because exact
erences on unconstrained optimization contain details for penalty functions are typically so complicated that they
these methods. do not readily lend themselves to computational proce-
dures; their application is less efficient than applying other
C. Penalty and Barrier Functions techniques for solving NLP. For example, the nondiffer-
entiability of the L 1 penalty function precludes using the
An early approach to solving NLP was to attempt to con- gradient descent methods described above. However, the
vert the constrained problem into a constrained one by L 1 penalty function does have a use as a “merit function”
incorporating the constraints into the objective function. in other optimization algorithms for solving NLP; as is
There are basically two types of methods included in this described as follows.
approach: penalty methods and barrier methods. The two Barrier methods are applied primarily to inequality-
cases are considered separately. constrained problems. In particular, they are applicable
Penalty methods, as the name suggests, incorporate the when the feasible region is known to have a strictly fea-
constraints into the objective function in such a way that sible point x 0 , that is, a point that satisfies g(x 0 ) < 0. The
infeasibility is penalized. The most common method re- most common barrier function is the so-called logarithmic
places NLP by the problem of minimizing barrier function,
P(x; ρ) = f (x) + ρ(h(x) + g + (x)) (8) p
B(x; ρ) = f (x) − ρ log(−g j (x)), (9)
j=1
where
where ρ is again a positive parameter. Another type of
0 if gi (x) ≤ 0,
gi+ (x) = barrier function is the inverse barrier function,
gi (x) else, p
V (x; ρ) = f (x) + ρ 1/(−g j (x)).
and · represents any norm. ρ > 0 is a parameter that
j=1
plays a major role in the algorithm. Given a value of the
parameter ρ, the minimization of P will take into con- An optimization procedure for minimizing B(x, ρ) for
sideration values of x that are infeasible but will penalize fixed positive ρ started at x 0 cannot reach the boundary of
them proportionally to their “distance” from feasibility, X because the log term approaches +∞ as the boundary
as measured by h(x) and g + (x), and to the value of ρ. is neared; that is, the boundary defines a barrier that the
If a sequence ρk → ∞ is chosen and x(ρk ) is an uncon- optimization procedure cannot cross. As ρ gets smaller,
strained minimum of P(x; ρk ) with x(ρk ) → x̂, it can be the penalty for approaching the boundary gets smaller, so
shown that x̂ is a local minimum of NLP. This observation if x(ρk) represents any set of minima of B(x; ρk ) corre-
forms the basis for an algorithm for solving NLP in which sponding to a sequence ρk → 0 then a limit point x̂ must
x k is taken as an approximation to x(ρk ) and is used as the be an optimal solution of NLP. There is no exact barrier
starting point for minimizing P(x, ρk+1 ). function because, if the solution to NLP is on the bound-
The choice of norm to be used in the definition of P is ary of X , it can only be reached in the limit. Nevertheless,
important. From an ease of computation standpoint, the barrier functions have also found applications to solving
Euclidean norm is the most convenient. However, the L 1 NLP in the guise of the interior point methods that are
norm, described as follows.

z = |z i |,
i D. The Sequential Quadratic
Programming Algorithm
provides an important theoretical advantage. Although the
use of the L 1 norm forces P(x; ρ) to be nondifferentiable At one time, it was common to generate approximations
on the boundary of the feasible region, it can be shown to the solution of NLP by linearly approximating the ob-
that the following property holds. jective function and the constraints at a current iterate
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

594 Nonlinear Programming

and to use this information to generate the next iterate. In the actual implementations of the algorithm, the ma-
These methods, as typified by the gradient projection and trices {Bk } are often a quasi-Newton approximations just
reduced gradient algorithms, are comparable to the de- as in the algorithms for unconstrained optimization; that
scent method described for unconstrained optimization is, they satisfy the conditions:
and hence are not competitive for the more complicated
nonlinear constrained problems. Bk+1 (x k+1 − x k ) = ∇x L(x k+1 , µk+1 , ωk+1 )
A more general approach that uses higher order infor- −∇x L(x k , µk+1 , ωk+1 ).
mation and has been shown to be particularly effective
is the sequential quadratic programming algorithm. In These matrices are updated by a rank-two matrix at each
this type of algorithm an approximation to NLP is con- iteration and are usually chosen to be positive definite.
structed in which the constraints are approximated lin- Since the true Hessian is not positive definite except on a
early and a quadratic approximation to the Lagrangian subspace of N (see theorem 4), the use of positive definite
function is employed as an objective function. Specifi- quasi-Newton approximations will usually lead to a slower
cally, if x k is a current iterate for NLP, not necessarily rate of convergence. Another aspect of this algorithm that
feasible, with corresponding multiplier approximations requires careful implementation is the determination of
µk and ωk the following quadratic programming problem the step length parameter αk . In unconstrained optimiza-
results: tion, the parameter is chosen so that the objective function
is decreased at each step. As was described in the first
minimize 1 t
2
d Bk d + ∇x L(x k , µk , ωk )t d section of this article, in constrained optimization there
is also the requirement of achieving feasibility, or at least
subject to: ∇gi (x k )t d + gi (x k ) ≤ 0, i = 1, . . . , M,
decreasing infeasibility. At any step this may be inconsis-
∇h j (x k )t d + h j (x k ) = 0, j = 1, . . . , P, tent with decreasing the objective function and therefore
it is not clear how the choice of αk should be made. This
where Bk is the Hessian to the Lagrangian with respect to is where the merit function, mentioned in the preceding
x at (x k , µk , ωk ), Hx x L(x k , µk , ωk ), or an approximation section, comes into play. In the sequential quadratic pro-
thereof. If this problem is solved to obtain a solution d k gramming method, a decrease in a merit function is used
with associated multipliers y k and z k , then the new iterate to determine the step length parameter for a given iterate.
for NLP is given by The choice of merit function depends upon the particular
version of the algorithm but is usually taken to be one of the
x k+1 = x k + αk d k ,
standard penalty functions. For example, if the L 1 penalty
for an appropriate choice of step length αk . Updates for function is used then at a given step, then αk is chosen
the multipliers are given by so that

µk+1 = y k and ωk+1 = z k . P x k + αk d k ; ρ < P(x k , ρ).

The use of the Lagrangian in the objective function permits Because P is decreased at each step and for large ρ a
quadratic effects in the constraints to have an effect on the minimum of P is a minimum of NLP, this choice of
iteration. αk is justified. However, the value of ρ must generally
As a justification for this approach, it can be seen be adjusted in the algorithm to assure that this property
that if the true Hessian of the Lagrangian is used in a holds.
problem with equality constraints only, then the iterate In spite of the complications involved in its implemen-
(x k+1 , ωk+1 ) is identical to that obtained by using the New- tation, versions of the sequential quadratic programming
ton step to find a solution of the first-order conditions of algorithm are considered the most effective general pur-
NLP: pose algorithms currently available for solving the NLP.

P
∇ f (x) + h j (x)ω j = 0 E. Interior Point Methods
j=1

h(x) = 0, Recently, there have been developments in algorithms for

NLP that involve generalizing the interior point algorithms
starting at (x k , ωk ). A similar result holds for the case when that have been so successful in linear programming. In one
inequality constraints are present. Thus, depending on the approach to interior point methods, the inequality con-
choice of the step length parameter, this method will yield straints are incorporated into the objective function by
convergence at a rate similar to that of Newton’s method the use of a barrier function. A similar formulation results
when started near a solution. from considering the Newton steps for solving a perturbed
P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

Nonlinear Programming 595

form of the first order conditions. A basic version of the IV. APPLICATIONS
latter development for the case where there are only in-
equality constraints is given here; more examples of the A. Nonlinear Models
method can be found in the Bibliography.
The first order system to be solved in this method is the The description of actual nonlinear optimization models
system of equations derived from Eqs. (1)–(5), in which a is hindered by the fact that the nonlinearities that occur in
slack variable z has been added to covert the inequalities the model are usually due to technical theoretical concepts
to equalities: special to the particular field and hence are not easily ex-
plained to the nonspecialist. In other cases the objective or

M
constraint functions may have little or no theoretical basis
∇ f (x) + gi (x)µi = 0, but be nonlinear functions that have been fit to historical
i=1
data. Linear optimization models in production, blending,
g(x) + z = 0, and network problems are discussed under the topic of
gi (x) · z i = β, i = 1, . . . , M, linear programming. Many nonlinear programming mod-
els are generalizations of these linear problems in which a
z ≥ 0, more accurate simulation of reality is obtained by allowing
µ ≥ 0. nonlinearities in the objective and constraint functions.
In models for which the objective or constraint function
This system is perturbed because the complementary is a cost function, nonlinearities very often occur when
slackness conditions are not satisfied at the zero level economies of scale are present (i.e., when the per unit cost
but at the β level for some β > 0. As in barrier function of a particular item decreases as the amount purchased
methods, it is assumed that there is a family of solutions or produced increases). In blending problems, a partic-
(x(β), z(β), µ(β)) to this set of equations. Under certain ular quality of the blend, for instance, the octane rating
assumptions (usually convexity of the f and gi functions), of gasoline, is a nonlinear function of the components of
this set of solutions defines a curve called the central path the mixture. In network flow problems, nonlinearities may
that tends to a solution of NLP as β → 0. The idea here is represent increasing per unit cost of shipment as a function
to use Newton steps to solve this system while decreasing of the flow from node to node. Another type of nonlinearity
β towards zero. Given a current iterate, (x k , z k , µk ) with results from the modeling of a learning process in produc-
z k > 0 and µk > 0, the set of equations becomes tion models; that is, the efficiency of production increases
  
Hx x L(x k , µk ) J g(x k )t 0 dx as production rises.
  
 J g(x k ) 0 I  du 
0 Zk Uk dz B. Statistical Applications
  One of the most important uses of nonlinear program-
−∇ x L(x k , µk )
  ming is in the estimation of the parameters for a statis-
=  −g(x k ) − z k  , tical distribution using the observed data. This is called
βe − Z k µk maximum likelihood estimation. A related problem is
the familiar regression problem in which parameters, say
where J g(x k ) is the Jacobian of the vector function g, Z k
θi , i = 1, . . . , N , are estimated so that a particular non-
and Uk are the diagonal matrices with diagonals that are
linear function w(θ) “best” fits a set of observed data;
z k and µk , e is the vector of ones, and L is the Lagrangian
that is, given values of a control variable z k , k = 1, . . . , L,
function. The iterates are now updated by the formulas:
and a sequence of corresponding experimental responses
x k+1 = x k + αk d x, y k , k = 1, . . . , L, values of θ are determined so that some
measure of the difference vector e ∈ L , with
z k+1 = z k + αk dz,
ek = y k − w(θ, z k ), k = 1, . . . , L ,
µk+1 = µk + αk du,
where αk is chosen to maintain the slack variables and mul- is minimized.
tiplier variables as positive and to decrease an appropriate The most common measure is the Eucclidean norm of
merit function. The key issue in an implementation of this e, in which case, the optimization problem is
algorithm is how progress toward the point on the central
L
path corresponding to β is mixed with decrease of β. As in sup [y k − w(θ, z k )]2
the the case of sequential quadratic programming, an ap- k=1

proximation of the Hessian of the Lagrangian can be used. subject to: θ ∈ ,

P1: GNH Final Pages
Encyclopedia of Physical Science and Technology EN010D-486 July 14, 2001 18:50

596 Nonlinear Programming

which is the so-called least squares pregression problem. y(t0 ) = ya ,

Here is the feasible set, which expresses any relations or
y(t N ) = yb ,
restriction on the parameters. This problem, expecially for
the case in which = L , has been exhaustively studied |u j | ≤ 1, j = 1, . . . , N .
and many software packages are now available containing
The control u(t) is approximated in this model by the
algorithms that compute a solution in an efficient manner.
piecewise constant function
û(t) = u j , for t j−1 ≤ t < t j .
C. Discretization of Infinite-Dimensional
Problems In theory the approximation should improve as N
gets large. Unfortunately, the number of variables and
The increase in available computer power has allowed constraints in the nonlinear program increases and the
the use of finite-dimensional nonlinear programming problem becomes more difficult to solve. Problems of
techniques to be applied to solve discretized infinite- this type with tens of thousands of variables are not
dimensional optimization problems. Problems generated uncommon.
in this manner are typically of very large dimension and
often have special structure. The limiting behavior of the
optimal solutions as the number of variables increases is SEE ALSO THE FOLLOWING ARTICLES
of significance in this type of problem.
As an illustration of this problem, the simple optimal APPROXIMATIONS AND EXPANSIONS • DYNAMIC PRO-
control problem of minimizing an energy function for a GRAMMING • LINEAR OPTIMIZATION • MATHEMATICAL
prescribed trajectory is formulated as follows. MODELING
b
minimize E(u(t)) dt
a BIBLIOGRAPHY
subject to: dy(t)/dt = w(y(t), u(t)),
Avriel, M. (1976). “Nonlinear Programming, Analysis and Methods,”
y(a) = ya , Prentice-Hall, Englewood Cliffs, New Jersey.
Biegler, L., Nocedal, J., and Schmid, C. (1995). “A reduced Hessian
y(b) = yb ,
method for large-scale constrained optimization,” SIAM J. Optimiza-
|u(t)| ≤ 1, tion 5, 314–347.
Boggs, P., Kearsely, A., and Tolle, J. (1999). “A global convergence
where a, b, ya , and yb are given scalars, and E and w are analysis of an algorithm for large scale nonlinear programming prob-
lems,” SIAM J. Optimization 9, 833–862.
given real-valued functions. The object is to choose the Boggs, P., and Tolle, J. (1995). “Sequential quadratic programming,”
control function u(t) that solves the problem. To discretize Acta Numerica. 1–51.
this problem, [a, b] is first partitioned so that Dennis, J., and Schnabel, R. (1983). “Numerical Methods for Nonlinear
Equations and Unconstrained Optimization,” Prentice-Hall, Engle-
a = t0 < t1 < t2 < · · · < t N = b. wood Cliffs, New Jersey.
El-Alem, M. (1995). “A robust trust-region algorithm with a nonmono-
Then, the following finite-dimensional nonlinear program tonic penalty parameter scheme for constrained optimization,” SIAM
J. Optimization 5, 348–378.
approximates the optimal control problem.
El-Bakry, A., Tapia, R., Tsuchiya, T., and Zhang, Y. (1996). “On the
formulation and theory of the Newton interior point method for non-

N
linear programming,” J. Optimization Theory Applications 89, 507–
minimize E(u j )(t j − t j−1 )
541.
j=1
Nash, S., and Sofer, A. (1996). “Linear and Nonlinear Programming,”
subject to: y(t j ) − y(t j−1 ) = (t j − t j−1 )w(y(t j−1 ), u j ), McGraw-Hill, New York.
Nocedal, J. (1991). “Theory of algorithms for unconstrained optimiza-
j = 1, . . . , N , tion,” Acta Numerica. 199–242.
P1: GSS Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

Number Theory, Algebraic

and Analytic
H. E. Rose
University of Bristol

I. Algebraic Number Fields

II. Diophantine Equations
III. Elliptic Curves
IV. Diophantine Approximation and Transcendence
V. Prime Number Theory
VI. Partitions
VII. Computational Number Theory

GLOSSARY To distinguish from other sets defined later we often use

the term rational integer instead of integer. denotes
Congruence If a, b, m ∈ and m > 0 then a ≡ b (mod m) the set of all integers with the usual operations and
stands for m divides b − a; written as “a is congruent rules of addition, subtraction, and multiplication. It is
to b modulo m.” an integral domain.
Divisibility If a, b ∈ , we write b | a when b divides a Order , ∼ : f (x) = (g(x)), f has order g, stands for
exactly. limx→∞ | f (x)|/g(x) is bounded as x → ∞, and
Equivalence relation A relation on a set S that satis- f (x) ∼ g(x) stands for limx→∞ f (x)/g(x) = 1.
fies, for all a, b, c ∈ S, a a, if a b then b a, and Polynomial An expression of the form a0 x n +
if a b and b c then a c. The congruence defined a1 x n−1 + · · · + an where each ai is a constant in the un-
above is an example. The subset {x: x a} of the set derlying ring or field and x is a variable; n is called the
S, where a is a fixed member of S, is called an equiva- degree of the polynomial. Polynomials in two or more
lence class. The relation splits S into a disjoint union variables are defined similarly. A polynomial is called
of such classes. irreducible if it cannot be written as a product of
Greatest common divisor If a, b ∈ , not both zero, then polynomials of smaller degree. Finally a polynomial
(a, b) denotes the largest integer that divides both a and is called homogeneous if each of its summands has
b. It exists, as has unique factorization. the same degree; for example, x 4 + 3x y 3 − x yzt is
Integer Member of the set {. . . − 2, −1, 0, 1, 2, 3, . . . }. homogeneous of degree 4.

1
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

2 Number Theory, Algebraic and Analytic

Prime number p is a prime if p ∈ , p > 1, and if a > 0 include the law of quadratic reciprocity of C. F. Gauss
and a divides p then a = 1 or a = p. (1777–1855).
Rational function A real function of the form F/G
where F and G are polynomials.
Rational number A number of the form a/b where I. ALGEBRAIC NUMBER FIELDS
a, b ∈ , (a, b) = 1, and b = 0. denotes the set of all
rational numbers with the usual operations and rules of Algebraic numbers and integers are generalizations of ra-
addition, subtraction, multiplication, and division. It is tional numbers and integers, and algebraic number fields
a field. are extensions of the rational field . In both cases many
Unique factorization Every element in can be repre- properties are preserved in the new situations, but not all;
sented uniquely as a product of primes. An integral for example, unique factorization is often lost. Although
domain with this property (using irreducible elements these new notions are important in their own right, they
rather than primes) is called a unique factorization are also important because many problems concerning the
domain. rational numbers are best treated in this more general con-
text. For example, to show that one of Fermat’s equations
(see Section II.A.5) has no rational solutions we work in
THE CENTRAL CONCERNS of number theory are the an algebraic extension of in which the left-hand side of
properties of the integers and rational numbers. Questions the equation can be factorized into linear factors. We shall
relating to individual real or complex numbers, integer define these new notions now and discuss their relation-
matrices, algebraic number fields, points on curves, lat- ship to the rational case.
tice points in convex regions, and similar entities are also Definition: A complex number α is called an algebraic
considered. The subject has a very long history, as long as number if rational numbers c1 , . . . , cn can be found to
mathematics itself, and is as actively researched today as satisfy the polynomial equation
at any time in the past.
Number theory is not an organized theory in the usual α n + c1 α n−1 + · · · + cn = 0.
sense but is a vast collection of individual topics and re- The positive integer n is called the degree of α provided
sults, with some coherent subtheories and a long list of un- that α does not satisfy another polynomial equation whose
solved problems. Some of these topics are highly special- degree is less than n. If each ci (i = 1, . . . , n) is a rational
ized, for example, giving a single solution to a Diophantine integer (i.e., each ci ∈ ) then α is called an algebraic
equation, whereas others have wide applicability, the integer.
Euclidean algorithm being an example. Methods range
over virtually all mathematical disciplines, although group For example, the complex numbers√ 2/3, (2 + i)/3,
√ and
and field theory, linear algebra, and both real and complex 21/3 are algebric numbers, and 5, −5, and (1 + 5)/2
analysis are the most commonly used. Problems and re- are algebraic integers. (Note that the last number is an
sults are studied entirely for their own sake; the fact that algebraic integer because it satisfies the equation x 2 −
many theorems are useful elsewhere is incidental. x − 1 = 0). The nonalgebraic numbers, that is the tran-
In this article we shall treat a few of the more im- scendental numbers, are discussed in Section IV.
portant topics not considered in the previous article. For The algebraic number fields are defined as follows. Let
each of these topics major results have been established α be an algebraic number of degree n, and let (α) denote
recently, and this progress is likely to continue in the the set of all complex numbers of the form
future. b0 + b1 α + · · · + bn−1 α n−1 ,
The reader is referred to the article quoted above for
an account of the history of the subject and the basic where each bi ∈ . It can be shown that this set is closed
definitions and results concerning the properties of the under the usual operations of complex addition, subtrac-
integers—in particular, the fundamental theorem of arith- tion, multiplication, and division and so is a field. It is
metic, prime numbers, greatest common divisors, and con- called an algebraic number field of degree n over and
gruences. The fundamental theorem states that every inte- its algebraic structure is similar to that of itself.
ger n > 1 can be expressed uniquely (except for the order Given a field K = (α), let O K denote the set of alge-
of the factors) as a product of prime numbers. The great- braic integers in K . This set is closed under sums and
est common divisor of two integers a and b is denoted by products and so is an integral domain called the ring
(a, b); it can be found using the Euclidean algorithm. The of integers in K . If α is an algebraic integer, we also
basic results of congruence theory concern the conditions define the set [α] to be the collection of all expres-
for the solution of linear and quadratic congruences and sions g(α) where g is a polynomial with rational integer
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

Number Theory, Algebraic and Analytic 3

coefficients. Each element of [α] is an algebraic integer, There is a close connection between the arithmetic of
and so [α] ⊆ O K . Note √ that this inclusion√can be strict. O K and the arithmetic of these ideals. This is illustrated by
For example if α = 5 then [α] =√{a + b 5: a , b ∈ }, the following important result. An ideal I is called prin-
but as noted above, the number (1 + 5)/2 is an algebraic cipal if I = {at: a ∈ O K } = t for some fixed t ∈ O K ; t
integer;
√ hence it belongs to the ring √ of integers of the field is called the generator (of I ) in O K . We have O K is a
( 5) but not to the domain [ 5]. unique factorization domain if and only if each ideal in
It can be shown that for each algebraic number field O K is principal. To illustrate this theory we shall consider
K = (α) of degree n its ring of integers O K has a -basis two examples.
{γ1 , . . . , γn }. That is, algebraic integers γ1 , . . . , γn can be √
found so that each integer in O K can be expressed in the (a) Let K = (i) where i = −1. In this case the ring
form c1 γ1 + · · · + cn γn where ci ∈ , i = 1, . . . , n.√In the of integers O K = {a + bi: a , b ∈ }; it is known as the set
example√ above, the -basis for O K , where K = ( 5), is of Gaussian integers. This is a unique factorization do-
{1, (1 + 5)/2}. main and so all ideals are principal. For instances the ideal
A “number theory” has been developed for these rings 2, 3 + 3i = {2a + (3 + 3i)b: a , b ∈ O K } equals the prin-
of integers O K ; it is similar to that for the rational case, but cipal ideal 1 + i as 1 + i divides both 2 and 3 + 3i in the
two main aspects are different—units and factorization. A Gaussian integers. √
unit ε in an algebraic number field K is an algebraic integer (b) In our second example, let K = ( 10). Its ring
that divides 1 (that is, εε = 1 for some ε ∈ O K ). of integers does not have unique factorization, and it has
In the only units are 1 and −1, but most algebraic nonprincipal √ instance, as noted above, 39 = 3 ×
√ ideals. For
number fields contain √ infinitely many√units. For example, 13 = (7 + 10)(7 − 10), and these last four√entries are
the units of the field ( 2) are (3 + 2 2)m for each m ∈ . irreducible in O K . Further the ideal 3, 7 + 10 is not
An irreducible element, which is the generalization of principal√because there is no t ∈ O K that divides both 3
the notion of a prime number in , is defined as follows: and 7 + 10. For the ideals we have
ξ is irreducible (over K ) if ξ ∈ O K , ξ is not a unit of zero, √ √
and whenever we have ξ = βγ , and β and γ belong to 3 = 3, 7 + 103, 7 − 10
O K , then either β or γ is a unit. Clearly, each member of √ √
13 = 13, 7 + 1013, 7 − 10
O K can be written as a product of a unit and a number of √ √ √
irreducible elements; but in many cases this representation 7 ± 10 = 3, 7 + 103, 7 − 10
is not unique, that is, K does √ not have unique factorization.
For example, If K = ( 10), we have and the ideal 39 can be written uniquely
√ as the √prod-
√ √ uct of the√four prime ideals √ 3, 7 + 10, 3, 7 − 10,
39 = 3 × 13 = (7 + 10) (7 − 10), 13, 7 + 10, and 13, 7 − 10.
√ √ Algebraic number theory can be viewed in two ways.
and 3, 13, 7 + 10, and 7 − 10 are all irreducible ele-
ments in this field. It is an open problem to list all algebraic In the first it is a branch of modern algebra and provides
number fields that have unique factorization. One impor- some good examples of structures for that theory to work
tant case has been √ settled recently; that is the imaginary with. In the second way, it provides a more general context
quadratic fields ( n) where n is a negative integer. The in which some problems concerning the rational numbers
only fields in this class that have unique factorization are and integers, especially Diophantine equation problems,
those where n = −1, −2, −3, −7, −11, −19, −43, −67, can be viewed. For a good introduction to the whole sub-
and −163 (see Section IV). ject see Stewart and Tall (1979).
The lack of unique factorization in many number fields
is a severe drawback; it can be partially retrieved by in-
troducing ideals into the theory. Let K be an algebraic II. DIOPHANTINE EQUATIONS
number field with ring of integers O K . An ideal I is a non-
empty subset of O K with the properties: if a , b ∈ I then This topic is the largest and most important in number
a − b ∈ I , and if a ∈ I and c ∈ O K then ac ∈ I . Further, if theory; it concerns the solution of polynomial equations
I and J are ideals we define the product ideal IJ to be the in some specified number system, often the integers
set of all finite sums ai b j , where ai ∈ I and b j ∈ J . Also or the rational numbers . The name refers to the early
I is called a prime ideal when I = O K ; and if J1 and J2 Greek mathematician Diophantus, who was working in
are ideals with J1 J2 ⊆ I then either J1 = I or J2 = I . It Alexandria in the middle of the third century. His surviv-
can be shown that each ideal I where I = O K (O K acts as ing books show that he had a highly developed knowledge
the identity ideal) can be represented uniquely in the form of the solution of equations in integers, especially those
I = P1 . . . Pn where each Pi is a prime ideal. involving sums of squares. Nowadays although, some
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

4 Number Theory, Algebraic and Analytic

methods recur, a large array of ideas and arguments finite or an infinite number of solutions (this is related
are used; this is particularly true when integer solutions to the underlying geometry); in the latter case the solu-
are sought. When solutions in an algebraic number field tions are generated by a finite set of fundamental solutions.
(for example ) are required, a more organized theory Accounts are given in Rose (1999) [two-variable integer
based on ideas from algebraic geometry has been de- case] and Cassels (1978) [general field and integral do-
veloped; we shall discuss this in the next section. A main cases].
good survey of the whole topic can be found in Mordell
(1969).
3. Thue–Siegel–Roth Theorem
We begin by listing some of the major results in
Diophantine equation theory. The list is by no means This is an example of a major class of equations that have
complete, but it will give the reader some idea of the only finitely many solutions. One version of the theorem
range of theorems and methods used. is as follows: Suppose f is an irreducible homogeneous
polynomial, without multiple roots, in the rational number
A. Some Major Results in Diophantine variables x and y, and having degree n > 2. Then if the
Equation Theory equation f (x , y) = m , m ∈ , is soluble, it has only finitely
many solutions. The proof uses Diophantine approxima-
1. Linear Diophantine Equations tion theory (see Section IV). T. A. Skolem (1887–1963)
If a , b, n ∈ , then the equation has given an ingenious method for solving some of these
equations; see Borevich and Shafarevich (1966). Also, G.
ax + by = n
Faltings has greatly extended this result; see Section III.
has a solution in integers x and y if and only if (a , b) | n,
and if x0 , y0 is a solution then the general solution is
given by x0 + tb/(a , b), y0 − ta/(a , b) where t ∈ . The 4. Goldbach Conjecture
Euclidean algorithm provides a good method for finding Here, we look for prime number solutions only; the result
solutions of these equations. This result is one of the most has not yet been fully established. C. Goldbach (1690–
useful in the whole of number theory. 1764) conjectured that every even positive integer larger
than 2 can be expressed as a sum of two prime num-
2. Pell’s Equation and Quadratic Forms bers. For example 10 = 7 + 3, 100 = 53 + 47, 1000 =
509 + 491, etc. At the present time, no counterexample
Suppose d is a positive integer and not square, then the has been found and J.-R. Chen has proved that every
equation large even integer is the sum of a prime a number that
x 2 − dy 2 = 1 has at most two prime factors. Also, I. M. Vinogradov
(1891–1983) has shown that every large odd positive
has infinitely many integer solutions x, y generated from √ a integer is the sum of three primes. The proofs of both of
fundamental
√ solution x 0 , y0 using the relation x + y d= these results are long and complex.
(x0 + y0 d)n for n ∈ . The fundamental solution√can be
found using the continued fraction expansion for d (see
Section IV). It is an accident that Pell’s name has been at- 5. Fermat’s “Last Theorem”
tached to this equation; J.-L. Lagrange (1736–1813) gave First proposed in 1637 and finally proved in 1995, this
the first proof of its solvability. is undoubtedly the most famous and well-researched
A quadratic form (over ) is a function F: n → given Diophantine equation. We shall give a brief history,
by the equation for more information see Edwards (1977), Ribenboim

n (1979), Cornell (1997) (for a detailed account of the
F(x1 , . . . , xn ) = ai j xi x j , whole proof ), and Singh (1997) (for a popular account
i , j =1
including the recent advances).
where, for 1 ≤ i, j ≤ n , ai j is an integer satisfying ai j = Although of little intrinsic interest in itself, Fermat’s
a ji . An extensive theory has been developed for solutions so-called Last Theorem has had a profound influence of
of equations of the form the development of pure mathematics over the past three
and a half centuries. It states: the equation Fn
F(x1 , . . . , xn ) = m .
x n + yn = zn
It relies to some extent on the theory of Pell’s equation dis-
cussed above. The results are generally straightforward, if has no solution in integers x, y and z if x yz = 0 and n > 2.
a little complicated to state. Equations can have either a (There are solutions when n = 2.) In 1637, P. de Fermat
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

Number Theory, Algebraic and Analytic 5

(1601–1673) proposed this result and claimed to have a is a kind of mathematical induction. By a simple count-
proof, which has not survived, although he did prove the ing argument we find, for our given m, positive integers
result in the case n = 4 and so showed that n can be re- t , x , y , z , w satisfying x 2 + y 2 + z 2 + w 2 = tm. Secondly,
stricted to be an odd prime number. The next step was given this equation, a procedure is devised to give a new
taken by L. Euler (1707–1783) who proved Fermat’s resolution t1 , x1 , y1 , z 1 , w1 to this equation, which satisfies
sult for n = 3. He was the first to realize that the solution 0 < t1 < t. The result follows by finitely many applications
to the problem would come by working in a number sys- of this procedure.
tem larger than the integers; he used the complex num- The two-square result first proved by Fermat, is: the
bers in his proof. In the middle of the nineteenth century, equation
E. E. Kummer (1810–1893) greatly extended this work
and so proved the result for a large number of cases (and x 2 + y2 = n
as a by-product helped to lay the foundation of modern
has a solution x , y provided those factors of n that are
algebra). Building on this and using the power of modern
congruent to 3 modulo 4 occur in the prime factorization
computers, Fermat’s Last Theorem had been established
of n with an even power. So for example, 1, 2, 4, 5, 8, 9,
for all n < 4,000,000 by 1980. Also by Falting’s work (see
10, 13, 16, 17, and 18 are the positive integers less than
next section), it was known that even if one of the equa-
20 that can be expressed as sums of two squares. This
tions Fn was solvable, it could only have a finite number
can also be established using the infinite descent method,
of solutions.
although many other proofs exist. The proofs of both of
In 1985, G. Frey noted an apparently simple connection
these results rely on the fact that the product of two sums of
between Fermat’s result and some properties of a class of
2 or 4 squares is itself a sum of 2 or 4 squares, respectively.
elliptic curves (see Section III). Frey’s elliptic curves have
For example, we have
the form
y 2 = x(x − a n )(x − cn ). (x 2 + y 2 )(z 2 + w 2 ) = (x z + y w)2 + (x w − yz)2

Using these curves, he made the following conjecture: if (a result known to Diophantus in the third century AD.) No
Fermat’s Last Theorem is false then so is the Taniyama similar identity is available for an odd number of squares,
and Weil Elliptic Curve conjecture (see the last para- and so in this case different methods involving quadratic
graph of Section III). He noted that if a n − cn = bn and form theory are used. The three-squares theorem, first pro-
a , b, c ∈ , then the discriminant of the curve has the vided by Gauss, states that n can be expressed as a sum of
form = (abc)2n , and so he conjectured that if an elliptic three squares proved it is not of the form 4a (8b + 7). For
curve whose discriminant was a 2n-power existed, then it example, every positive integer less than 30 is a sum of
could not satisfy the Taniyama and Weil conjecture. This three squares except 7, 15, 23, and 28.
implication was proved by K. Ribet in 1990. At about this
time, A. Wiles began work on proving that the Taniyama
7. Waring’s Problem
and Weil conjecture is true, one consequence of which
would be, by Frey and Ribet’s result, that Fermat’s result More generally we can ask if all positive integers are
was finally proved. His work was successful for in 1995, sums of kth powers with g(k) summands where g depends
and with some input from R. Taylor, he published his proof only on k. This is known as Waring’s problem. In 1770,
of the Taniyama and Weil conjecture for a large class of E. Waring (1736–1798) postulated that g(2) = 4 (this is
elliptic curves, which includes the Frey curves, and so Lagrange s thoerem), g(3) = 9, g(4) = 19, and so on. In
finally established Fermat’s Last Theorem 358 years af- 1909, D. Hilbert (1862–1943) showed that g exists for all
ter it had been first proposed. This work will surely rank k. It has been established that g(3) = 9 and g(5) = 37; very
as one of the greatest achievements of twentieth century recently it has been shown that g(4) = 19, the value Waring
mathematics. conjectured. Further, for most k ≥ 6, g(k) = 2k + A − 2,
where A denotes the largest integer less than (3/2)k . For
the exact result see Hardy and Wright (1954) pages 335–
6. Sums of Squares
337.
Here we ask: Can a positive integer m be expressed as One aspect distinguishes the case k = 2, discussed in
a sum of k squares, or l cubes, or in general as a sum Subsection 6, and the case k > 2; in the latter case there
of nth powers? We shall look at the squares case first. are a few exceptional integers that require more sum-
Lagrange’s famous four-squares theorem states that ev- mands than usual. For instance only 23 and 239 need nine
ery positive integer is a sum of four squares. The usual cubes, and if n > 8042 then, in all probability, six cubes
proof uses the so-called “infinite descent method,” which are sufficient. If we let G(k) denote the least number of
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

6 Number Theory, Algebraic and Analytic

summands needed to express all sufficiently large positive Definition: The genus g of a curve C that has degree
integers as a sum of kth powers, then clearly n and has N singular points is given by g = 12 (n − 1)
(n − 2) − N .
G(k) ≤ g(k) and G(2) = g(2) = 4.
It can be shown that g is a nonnegative integer and is
For most k the precise value of G(k) is not known. unaltered when C is birationally transformed into a new
G(3) lies between 4 and 7 with the most likely values curve C . The genus provides an important classification,
4 or 5, and G(4) = 16. G. H. Hardy (1877–1947) and into three main classes, for the collection of all curves
J. E. Littlewood (1885–1977) conjectured that G(k) ≤ defined over the field K . The first class contains curves
2k + 1 if k = 2m , where m > 1, and G(2m ) = 2m+2 again of genus zero. All curves in this class can be reduced (by
where m > 1. At the moment the best result known is: as birational transformations) to lines or conics, and the lin-
k → ∞, G(k) ≤ (2 + ck )k ln k, where ck → 0 as k → ∞. ear and quadratic form material discussed in the previous
For an account of this work see Vaughan (1981) and section applies to the associated Diophantine equations.
Ribenboim (1989), pages 236–245, which gives the cur- The second class contains the curves of genus one. Us-
rent status of this problem when k ≤ 10. ing birational transformations, all curves in this class that
have at least one point whose coordinates lie in K can be
represented by equations of the form
III. ELLIPTIC CURVES y 2 z = x 3 + ax z 2 + bz 3 , (1)

In this section we discuss the solution of two variable Dio- where a, b ∈ K . They are called elliptic curves because
phantine equations defined over an algebraic number field. they can be parametrized using elliptic functions. The
Compared with the material discussed in the previous sec- discriminant of the curve (1) is −16(4a 3 + 27b2 ).
tion, a more organized theory has been developed; it uses Many of the basic properties of this curve are determined
some basic ideas from algebraic geometry. Also, it is the by this quantity, for example the curve (1) is elliptic if and
subject of considerable current research interest. only if = 0 and, in the real plane, it has one component
Let K be an algebraic number field, for example if < 0 and two if > 0. The last main class contains
the rational field , and let An (K ) denote the set of curves having genus larger than one. Until recently little
n-tuples (x1 , . . . , xn ) where each xi ∈ K . An (K ) is called was known about these equations. In 1922 L. J. Mordell
the n-dimensional affine space over K . On the set A3 (K ) − (1888–1972) conjectured that each equation in this class
{(0, 0, 0)} we define the relation by: (x, y, z) (x ,y ,z ) has only finitely many points whose coordinates lie in K .
if and only if x = x t, y = y t, z = z t for some nonzero This remarkable conjecture was established by Faltings
t ∈ K . This is an equivalence relation, and the set of in 1983, using advanced techniques from algebraic geom-
equivalence classes is called the two-dimensional projec- etry. We have already mentioned one consequence of this
tive space P 2 (K ) over K . Points in this space are denoted work. If n > 3, the genus of each of the curves associated
by (x:y:z). Note that by the relation it is only the with the Fermat equation Fn (see Section II.A.5) is larger
ratios of x, y, and z that matter; so, for example, (2:3:5), than one, and so none of these equations can have in-
(−4: − 6: − 10), and (26:39:65) all represent the same finitely many solutions. This fact provided initial support
point. to Wiles and others in their eventual solution of Fermat’s
A curve C in P 2 (K ) is the set of points (x:y:z) that sat- result. Table I summarizes the main classes for the
isfy a homogeneous polynomial equation f (x, y, z) = 0, field .
and the degree of C is the degree of the polynomial
f . A point P on C is called singular if the partial
A. Elliptic Curves
derivatives (∂ f /∂ x, etc.) at P with respect to all three
variables x, y, and z are zero; unique tangents can be Points on elliptic curves have an important algebraic struc-
drawn to the curve at points that are not singular. Two ture that we shall discuss now. If the point (x:y:z) lies on
curves C and C are said to be birationally equivalent the curve C (defined over the field K ) and x, y, z ∈ K , then
if each can be transformed into the other by a rational we call the point rational; and we let C(K ) denote the set
function and the correspondence is one-to-one and onto of rational points on C. Using the so-called chord-tangent
except at a finite number of points. For example method, C(K ) can be given a group structure as follows.
x = 1/(x − 1), y = 1/y, z = 2z is birational. Note that Let us suppose that C is represented in the standard form,
the degrees of C and C need not be identical. Eq. (1). In this case it is a cubic curve and any straight
We come now to the important notion of the genus of a line will intersect it in at most three points. Further, if
curve. two of these points are rational than the coefficients of the
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

Number Theory, Algebraic and Analytic 7

TABLE I Properties of Homogeneous Three Variable Dio- points on C() can be constructed by the chord-tangent
phantine Equations Defined Over method from a finite set of points. Later, A. Weil extended
Maximum number this to all algebraic number fields K , and the result is now
of solutions called the Mordell–Weil theorem. The group C(K ) can
Genus Parametrization in in
be finite or infinite. Recently B. Mazur has classified all
groups that can occur in the finite case when K = ; they
0 By rational functions Infinite Infinite are either cyclic or direct products of two cyclic groups
1 By elliptic functions Finite Infinite but finitely and have orders not larger than 16.
generated The rank of C(K ) is the number of infinite order gen-
≥2 — Finite Finite erators of C(K ); it is finite by the Mordell–Weil theorem.
This number has been studied extensively, and recently
equation of the line through these points will belong to K , some results have been established. For many curves the
and so the third point will also be rational. This conclusion rank is zero or one; but examples have been given in which
is valid even if the first two points coincide at P, that is, the rank is at least 22. For a fixed field K , it is not known
the line is the tangent at P. if the rank can be arbitrarily large or, conversely, if a con-
Using this chord-tangent process we can define an ad- stant m exists such that the rank of all elliptic curves over
dition operation on C(K ). Let O be the point (0:1:0), it K is less than m.
lies on all curves of the form, Eq. (1), and it will act as There are many conjectures concerning these curves,
the identity of the group. Suppose P1 , P2 ∈ C(K ) and the the most important involve L-functions, a detailed defini-
line P1 P2 meets C again at Q; as noted above, Q ∈ C(K ). tion of which can be found in Silverman (1986). Roughly
Now construct the line O Q; it will meet C in one fur- speaking, an L-function for a curve C encodes “local”
ther rational point. This new point is called the sum of P1 properties (that is, properties of C defined over finite fields,
and P2 and is denoted by P1 + P2 . It can be shown that working modulo a prime for example) in the hope of ob-
C(K ) with this operation is an Abelian group. The proof taining “global” properties of C (that is, properties of C
is straight-forward except for associativity, which requires valid over the rational numbers or the complex numbers).
a result from algebraic geometry known as Bezout’s the- Two main collections of conjectures have been studied;
orem. Some sample constructions are given in Fig. 1. In the first, due to B. Birch and H. P. F. Swinnerton–Dyer, re-
this example, P1 + P2 = 2P where 2P denotes the tangent lates the behavior of the L-function of a curve C near 1 to
point P + P. the rank of C over the rational numbers—there is good nu-
In 1922 Mordell showed that the Abelian group C() merical evidence for this conjecture and it has been proved
has only a finite number of generators; that is, all rational recently for some classes of curves. The second main col-
lection of conjectures concerns the domain of definition of
the L-functions in the complex plane. It is a simple mat-
ter to see that these functions are defined at most points
z in the right hand side of the plane (to be precise, when
the real part of z is larger than 3/2). The Taniyama and
Weil conjecture states that the domain of definition can be
extended to cover the whole of the complex plane except
for the point 1. It is known that if this conjecture holds
for the curve C in question, and a set of related curves,
then the curve is “modular” which implies that it possesses
a number of useful arithmetic properties including a spe-
cial type of parameterization. It is this conjecture (for a
subset of all curves, called the semistable curves, which
contains the Frey curves related to Fermat’s Problem) that
A. Wiles (and R. Taylor) proved in 1993/1995 [see Wiles
(1995)] and which finally established that Fermat’s Last
Theorem is valid in all cases (see Section A.5). At the
time of writing (November 1999), it has been announced
by C. Breuil, B. Conrad, F. Diamond, and R. Taylor that
the Taniyama and Weil conjecture is valid for all elliptic
FIGURE 1 Sample constructions of rational points on elliptic curves defined over the rational numbers, no details are
curves by the chord-tangent method. available.
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

8 Number Theory, Algebraic and Analytic

IV. DIOPHANTINE APPROXIMATION if α is irrational then the inequality√ (2) has infinitely many
AND TRANSCENDENCE solutions a, b provided f (b) ≤ b2 5. This result can be
derived using some elementary theorems concerning con-
These topics are concerned with the number-theoretic tinued fractions√[see Rose (1999)]. It is best possible
properties of the real and complex numbers. The start- for if α = (1 + 5)/2 (its continued √ fraction representa-
ing point for this work is the question: How close can a tion is [1, 1, 1, . . .]) and f (b) > b2 5 then inequality (2)
rational number a /b approximate to a real number α as- has only finitely many solutions. If this α and any real
suming some restriction on the size of a and b; that is, number whose continued fraction expansion is eventually
given α can integers a and b be found to satisfy all ones is excluded,√then Hurwitz’s
√ theorem
√ can be im-
proved by replacing
√ 5 by 8. Then 8 is best possible

α − a < 1 (2) when α = 2. This process continues; finally it can be
b f (b) shown that there are uncountably many real numbers α
for some suitably chosen monotone increasing function f ? for which the inequality (2) has infinitely many solutions
This is the basic question in Diophantine approximation if f (b) = 3b2 , but only finitely many if the number 3 is
theory. We answer this question and discuss some related replaced by a larger number.
topics in the subsections below.
Continued fractions play an important role in this the- 3. Sets of Real Numbers Modulo 1
ory; they are defined as follows. Let α be a real number
and let [α] denote the largest integer c satisfying c ≤ α; Let ((α)) denote the fractional part of α, that is ((α)) =
then we can write α − [α]. P. L. Chebyshev (1821–1894) was the first to ask:
Given α how is the set {((nα)): n = 1, 2, . . .} distributed in
α = [α] + 1/α1 , the unit interval? Using the methods discussed above, he
where α1 > 1 provided α is not an integer. We repeat this was able to show that this set is dense in the unit interval
construction by setting and that the distribution is uniform provided α is irrational.
His methods apply to a number of similar problems.
αn = [αn ] + 1/αn +1
(n = 1, 2, . . .) provided αn is not an integer. If α is a ra- B. Transcendental Numbers
tional number this process will stop in a finite number of
steps, otherwise it will continue indefinitely. We write q0 A real or complex number that satisfies no polynomial
for [α], qn for [αn ], and [q0 , q1 , . . . , qn ] for the expression equation with algebraic coefficients is called transcen-
dental. J. Liouville (1809–1882) was the first to show that
q0 + 1/(q1 + 1/(q2 + · · · + 1/qn . . .)). transcendental numbers exist, although using the diago-
This is called the nth continued fraction convergent to√α. nal argument of G. Cantor (1845–1918) we now know
For example, the fifth continued fraction convergent to 2 that almost all real or complex numbers have this prop-
is [1, 2, 2, 2, 2, 2]. erty. On the other hand there are many specific num-
bers whose transcendental status is√ unknown. Examples of
Euler’s constant γ , e + π, and π 2 .
A. Rational Approximation Liouville began with the inequality (2) and showed that
if α is an algebraic number of degree n(n > 1) then (2)
1. Best Approximation
has only finitely many solutions if f (b) = bn+k and k > 0.
By a “good” rational approximation a /b to a real number Using this he was able to show that certain real numbers
α we mean that a /b is close to α, and a and b are relatively are transcendental; for example the number whose decimal
small. We define the best approximation to a real number expansion is 0.11000100 . . . , that is, all zeros except for
α relative to n to be the rational number a /b closest to α 1’s at the n! digit places, n = 1, 2, . . . . A number of this
satisfying 0 < b < n. For example, 22/7 is a good approx- type is now called a Liouville number.
imation to π; it is in fact the best approximation relative C. Hermite (1822–1901) introduced the main method
to any n ≤ 56. All good approximations to a real number for establishing the transcendence of a number in 1873.
α are continued fraction convergents to α. It begins by assuming that α is algebraic, satisfying the
polynomial equation f (α) = 0; that is the contrary of the
required result. Then two properties of a formula F con-
2. Hurwitz’s Theorem
structed using the coefficients of f are derived that contra-
The theorem of A. Hurwitz (1859–1919) puts a limit on dict one another, and so transcendence is established. For
how good approximations can be in general. It states that example, the contradictory properties could be (a) F is a
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

Number Theory, Algebraic and Analytic 9

positive integer and (b) F tends to zero. Hermite used this at most h, such that | f (γ )| takes the least nonzero value.
method to show that e (= ∞ n=1 1/n!) is transcendental. Now let ω(n, h), ωn , and ω, be given by
F. Lindemann (1852–1939) extended this to show that if
| f (γ )| = h −nω(n,h)
α1 , . . . , αn are distinct algebraic numbers and c1 , . . . , cn
are nonzero algebraic numbers, then ωn = lim max ω(n, k)
h→∞ 1≤k≤h
c1 eα1 + · · · + cn eαn = 0. (3) ω = lim max ωm
n→∞ 1≤m≤n
Using this result he was able to show that the following
Finally, let ν be the least positive integer n such that
numbers are transcendental where α is algebraic, nonzero,
ωn = ∞, and let ν = ∞ if ωn < ∞ for all n.
and not equal to one in the last two cases:

π, eα , sin α, cos α, sinh α, cosh α, There are four possibilities for the values of ω and ν
(note ω and ν cannot both be finite). The scheme of K.
arcsin α, across α, and ln α. Mahler (1888–1972) defines
A famous problem first proposed by the ancient Greeks
and known as “squaring the circle” was: Construct a square γ to be an A-number if and only if ω = 0 and ν = ∞,
equal in area to a given circle using only a ruler and com- γ to be an S-number if and only if 0 < ω < ∞ and
pass. Lindemann’s result shows that this is impossible as ν = ∞,
√ γ to be a T -number if and only if ω and ν are both
π is transcendental. Further extensions of the method
enabled A. O. Gelfond (1906–1968) and T. Schneider to infinite, and
show that if α and β are algebraic numbers, α = 0 or 1, γ to be a U -number if and only if ω = ∞ and ν is finite.
and β is irrational then α β is transcendental.
One consequence of this result is: If α, β, γ and δ are The following results concerning this classfication have
nonzero algebraic numbers and no linear relation with been proved.
rational coefficients exists between ln β and ln δ, then
(a) If two numbers α and β are algebraicly dependent
α ln β + γ ln δ = 0. (i.e., g(α, β) = 0 for some two-variable polynomial g) then
they belong to the same class.
In 1966, A. Baker extended this by providing a similar
(b) The A-numbers are exactly the algebraic numbers,
result with n rather than 2 summands [see Lindemann’s
and so the S-, T -, and U -numbers are transcendental.
result, inequality (3)]. A number of important applications
(c) Almost all real and complex numbers are
have followed. For example, the number eα β γ is transcen-
S-numbers.
dental provided α, β and γ are algebraic and nonzero.
(d) The Liouville numbers are U -numbers; in the ex-
We shall mention two further applications.
ampleof a Liouville number given above ω1 = ∞.
(e) T -numbers exist; this is a recent result of W. M.
(a) In Section III, we introduced the elliptic curves; Schmidt. Many problems remain, for example it is known
Baker’s result provides an effective finite bound on the that π is an S- or T -number, but which? For further de-
number of integer (as opposed to rational number) solu- tails on all of the results in this section see Baker (1975),
tions of the associated Diophantine equations. LeVeque (1955), and Schmidt (1980).
(b) Baker’s result enables us to list all those imagi-
nary quadratic fields that have unique factorization (see
Section I). V. PRIME NUMBER THEORY

1. Mahler’s Classification The properties of the prime numbers have been studied
since the time of the early Greeks Pythagoras and Euclid;
The set of all real and complex numbers is divided into even so, many questions remain unanswered today. The
four classes A, S, T, and U as follows: first major result, which is attributed to Euclid, states that
the set of prime numbers is infinite. The other major result
1. The height h of a polynomial f with rational integer from this period is the so-called “Sieve of Eratosthenes,”
coefficients, not all zero, is the maximum of the absolute which is an effective method for enumerating all primes
values of its coefficients. less than some fixed integer.
2. Given γ , n, and h > 1, let f be the polynomial with Two pre-eminent results have been established in mod-
rational integer coefficients, degree at most n, and height ern times. They are the theorem of P. G. L. Direchlet
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

10 Number Theory, Algebraic and Analytic

(1805–1859) concerning primes in arithmetic progres- bach conjecture (see Section II.A.4), problems concerning
sions and the Prime Number Theorem (PNT), which esti- formulas taking prime values, and Fermat and Mersenne
mates the density of the primes. We shall discuss these, but numbers.
we shall begin by considering some of the many unsolved We shall discuss now the two major results mentioned
problems in prime number theory. Most of the conjectures earlier and their connections with the Riemann hypothesis.
associated with these problems are backed up with ample
numerical evidence, but this should be treated with some
B. Dirichlet’s Theorem
skepticism because in most cases this evidence involves
only relatively small numbers. For example, most of the Suppose (a, b) = 1; this famous theorem first established
available numerical evidence suggests that Li(x) > π (x) in 1837 states that there are infinitely many prime numbers
for all x. [These two prime density functions are defined in the arithmetic progression A = {an + b, n =1, 2, . . .}.
below.] But it is now known that Li(x) − π(x) changes It is proved by considering a series of the form χ ( p)/ p
sign infinitely often, and the first change occurs before where the sum is taken over the primes p in A. The func-
x = 6.69 × 10370 . tion χ is used to pick out elements of A and is defined
using some basic group theory (it is called a multiplica-
A. Conjectures Concerning Prime Numbers tive character). Second, standard analytic techniques are
applied to this series to show that it diverges to infinity.
1. The Twin Prime Conjecture Dirichlet’s theorem follows, for if there were only finitely
A brief study of a table of prime numbers shows that there many primes in the progression A then the series would
are many pairs of primes p, q where q = p + 2; examples have a finite sum.
are 3, 5; 101, 103; 1997, 1999; and 109 + 7, 109 + 9. If we
let π2 (x) denote the number of prime pairs less than x;then, C. The Prime Number Theorem (PNT)
for instance, π2 (103 ) = 35 and π2 (106 ) = 8164. The twin
Let π (x) denote the number of primes p satisfying p ≤ x.
prime conjecture states that π2 (x) → ∞ as x → ∞, and
J. S. Hadamard (1865–1963) and C. de la Vallee Poussin
the likely value of π2 (x) is ( f 2x (ln t)−2 dt). Progress
(1866–1962) proved in 1896 that
has been made on this conjecture, for Chen has shown
that there are infinitely many pairs of integers p, p + 2 π (x) ∼ x/ ln x.
where p is prime and p + 2 has at most two factors.
A more precise version of this result is

2. Primes in Intervals π (x) = Li(x) + (x/(ln x)2 ),

It is a simple matter to show that a prime occurs between n where

and 2n for all integers n > 0. This is known as “Bertrand’s 1−ε x
dt
Li(x) = lim + .
Postulate”; for a proof see Rose (1999), page 231. Using →0 0 ε+1 ln t
the Prime Number Theorem this result can be extended to
There are two main proof of PNT, an analytic one and an
show that for all large n there is a prime between n and
elementary one. A typical analytic proof begins with the
(1 + ε)n for any given fixed ε > 0. But no similar result
result: if s is a complex variable whose real part is larger
for a smaller interval has been established. For example,
than one then
it is not known if a prime occurs between n 2 and (n + 1)2 ∞
for all large n. π (x)
ln ζ (s) = s d x. (4)
2 x(x s − 1)
3. Extensions of Dirichlet’s Theorem [The zeta function ζ (s) is defined below.] An estimate
for π(x) is then-obtained by “solving” this equation for
Very little information is available concerning the set D π (x). We shall consider the error term in this process later.
of those n for which an + b is prime. (Dirichlet’s theorem Elementary proofs of PNT rely on the following formula,
tells us that it is infinite.) For example, does D itself con- which was established by A. Selberg in 1949.
tain infinitely many primes, or any primes at all? Another
proposed extension replaces the linear form an + b by a ln2 p + ln p ln q = 2x ln x + (x),
p≤x pq≤x
quadratic one. For instance, it is not known if there are in-
finitely many primes of the form n 2 + n + 1 or n 2 + n + p where the sums in this formula are taken over prime values
for some prime p. only. This is used to improve some standard estimates,
A number of other problems concerning the primes and the result follows after a lengthy argument. The best
have been discussed elswhere. These include the Gold- estimates for π (x) are achieved using analytic methods.
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

Number Theory, Algebraic and Analytic 11

D. The Riemann Hypothesis VI. PARTITIONS

For a complex variable s = σ + it we define the Riemann
zeta function ζ (s) by Unlike much of the previous material, partition theory
deals with additive problems. These are often very easy
∞
1 to state, but much of the development is as difficult as any
ζ (s) =
n =1
ns in number thory. Every positive integer can be expressed
as a sum of smaller integers in a number of ways. If n > 0
for σ > 1 and −∞ < t < ∞. Using analytic continuation,
and
the domain of definition can be extended to the whole
complex plane, except the point 1. The connection be- n = a1 + a2 + · · · + ar
tween the prime numbers and the zeta function is given
by the following result [due to Euler.] where each ai is a positive integer, then the set {a1 ,
a2 , . . . , ar } is called a partition of n and the individual
ζ (s) = (1 − p −s )−1 , terms a1 , a2 , . . . are called parts. The order of the parts is
p
unimportant, although it is usual to write them in decreas-
where the product is taken over all primes p. Taking logs ing order. The total number of partitions of n is denoted
of both sides we obtain by p(n). For example, p(5) = 7 as

ln ζ (s) = − ln (1 − p −s ) 5, 4 + 1, 3 + 2, 3 + 1 + 1, 2 + 2 + 1,
p
2 + 1 + 1 + 1, 1+1+1+1+1
= p −s + finite error term,
p is the set of partitions of 5.
There are many relationships between subsets of the set
that is, a sum taken over the prime numbers only has been
of partitions of an integer n. The most famous, due to Euler,
expressed by standard analytic terms. The result (4) fol-
states that the number of partitions of n all of whose parts
lows from this easily.
are odd integers equals the number of partitions of n where
We can see from the foregoing equation that it is im-
no part is repeated. In the example above, the first three
portant to know where the zeros of the zeta function are.
partitions have distinct parts while the first, fourth, and
It has zeros at s = −2, −4, . . . , and the remaining zeros
seventh have exclusively odd parts. A number of similar
lie inside the region of the complex plane bounded on the
properties will be discussed below.
left by the imaginary axis and on the right by the verti-
Euler’s result is remarkable because there is no obvious
cal line passing through the point s = 1; this is called the
connection between the two sets, but it becomes almost a
critical strip. The hypothesis of B. Riemann (1826–1866)
triviality once it has been expressed in terms of generating
states that all the zeros in the critical strip lie on the cen-
functions. We shall define these now. Let {c1 , c2 , . . . } be
tral line, that is, the vertical line through the point s = 12 .
an infinite sequence of positive integers; then the power
Hardy showed that there are infinitely many zeros on this
series
line, but it is not known if there are any zeros off the
line. This is one of the most important unsolved problems
∞

in the whole of mathematics. One major consequence of f (q) = cn q n

n=0
the establishment of this hypothesis would be greatly im-
proved estimates for many theorems in analytic number is called the generating function for this sequence. These
theory. We give two examples. The error term x /(ln x)2 in power series are treated formally, but in most cases they are
√ π(x) = Li(x) + (x /(ln x) )
the prime number theorem 2
absolutely convergent when 0 < q < 1. To prove Euler’s
could be replaced by x ln x if Riemann’s hypothesis result let p1 (n) denote the number of partitions of n with
was available. Second, it is known that the first quadratic exclusively odd parts and let p2 (n) denote the number of
nonresidue n of a prime p satisfies n < p 1/4+ε for large p partitions of n with distinct parts; then we can show that
and ε > 0. (The integer n is a quadratic nonresidue mod-
ulo p if the congruence x 2 ≡ n (mod p) is insoluble). If
∞
∞
p1 (n)q n = (1 − q 2n−1 )−1
Riemann’s hypothesis was available then this inequality n=0 n=1
could be replaced by n < c(ln p)2 for some constant c.
For further applications of this hypothesis see Titchmarsh and
(1987), for the latest numerical evidence see Odlyzko
∞
∞
(1991), and for an informal but detailed discussion of the p2 (n)q n = (1 + q n )
whole topic of prime numbers see Ribenboim (1989). n=0 n=1
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

12 Number Theory, Algebraic and Analytic

Now VII. COMPUTATIONAL NUMBER THEORY

∞
∞
(1 − q 2n )
(1 + q n ) = With the advent of modern computer systems, a new
n =1 n =1
(1 − q n ) branch of number theory has developed over the past few
decades which provides powerful algorithms for many as-

∞
= (1 − a 2n −1 )−1 pects of the subject; Cohen (1993) gives a good introduc-
n =1 tion to the whole area. The most spectacular algorithms
give very efficient methods for factorizing positive inte-
and this gives p1 (n) = p2 (n) for all n. gers or for determining whether a given number is prime
To illustrate the theory we list some further results: or not. Other algorithms include, for example, methods
for finding the class group of an algebraic number field
1. The number of partitions of n with at most m parts (that is, determining the ideal structure of the field, see
equal the number of partitions of n in which no part ex- Section I), efficient methods for dealing with large ma-
ceeds m. trices, and methods for calculating the invariants associ-
2. The number of partitions of n in which no part occurs ated with elliptic curves (for example the discriminant, see
more than three times equals the number of partitions of Section III). This work has a major bearing on cryptog-
n in which only odd parts may be repeated. raphy, and so gives number theory a direct influence on
3. If n ≡ 4 (mod 5) then p(n) is divisible by 5. There modern commerce and public affairs. Many coding proce-
are similar results for some other moduli, for example 7, dures used to keep the transmission of information secret
11, 25, 35, . . . . are effective because it is very difficult to factorize a large
4. The number of partitions of n with minimum differ- positive integer, particularly if this integer has two or more
ence two (that is |ai − a j | > 1 for all i and j) equals the large prime factors.
number of partitions of n in which each part has the form We shall illustrate the methods used in this theory by
5m + 1 or 5m + 4. describing one modern procedure for factorizing a pos-
itive integer. It uses elliptic curves and was invented by
Most of these results are proved by deriving properties H. Lenstra, further details can be found in Cohen (1993) or
of the corresponding generating functions, as we did above Rose (1999), among others. Before describing Lenstra’s
for Euler’s theorem. A typical result is: If 0 < q < 1 then method, we need to consider an earlier method due to
Pollard. When attempting to factorize an integer n, Pol-

∞
∞
lard noted that in many cases, but by no means all, if a
(1 − q n ) = (−1)m q m(3m −1)/2 .
m =−∞
prime number p divides n, then p − 1 has many small
n =1
factors and hopefully all its prime factors are small. If all
This is called Euler’s pentagonal number theorem. Using the prime factors of p − 1 divide k, then a p−1 − 1 divides
it we can give a recursive procedure for calculating p(n), a k − 1, Fermat’s Theorem shows that p divides a p−1 − 1,
viz: If n > 0, p(0) = 1, and p(m) = 0 if m < 0, then and hence p divides a k − 1 for a = 2, 3, . . . . It is usual to
take k to be of the form j! or LCM (2, 3, . . . , j) for some
p(n) = p(n − 1) + p(n − 2) + · · · relatively small integer j. Therefore, for various choices
of a and k we calculate (n, a k − 1), and if this gcd is larger
+ (−1)m −1 p(n − k(3k − 1)/2) than 1, then we have found a factor of n and we can con-
+ (−1)m −1 p(n − k(3k − 1)/2) + · · · · tinue the process until all factors have been located. Note
that there are very efficient methods for calculating gcds,
This provides a very efficient method for calculating p(n). even if the integers involved are large.
The value of p(n) increases rapidly with n, for instance Pollard’s method fails when p − 1 has a large prime
p(10) = 42, p(20) = 627, and p(30) = 5604. Hence, it is factor. Lenstra introduced a similar method replacing the
important to find an estimate of the approximate value of integers modulo a prime p by an elliptic curve C defined
p(n) for large n. One such estimate is over a finite ring K . The expectation that p − 1 has small
factors in Pollard’s method is replaced by the expectation
1 √ that the order of points on the curve C defined over K has
p(n) ∼ √ eπ 2n /3 .
4n 3 many small factors [if P is a point on C(K ), its order is the
least positive integer t such that t P is the zero point, see
A more accurate estimate as well as full details on all the Section III]. The ring K is the system of integers modulo
topics mentioned in this section can be found in Andrews n where n is the number to be factored. This integer is not
(1976). prime (this is checked before the factorization algorithm is
P1: GSS Final Pages
Encyclopedia of Physical Science and Technology En011l-504 July 25, 2001 16:52

Number Theory, Algebraic and Analytic 13

applied because it is much easier to determine the primal- BIBLIOGRAPHY

ity status of n than it is to factorize it). Hence the ring K is
not a field, that is division can fail. But this does not cause Andrews, G. E. (1976). “The Theory of Partitions,” Addison-Wesley,
a problem because it is exactly at the point of failure that a Reading, Massachusetts.
factor of n appears. The method is as follows. We choose Baker, A. (1975). “Transcendental Number Theory,” Cambridge Univ.
Press, London.
an elliptic curve C, a point P on C, and an integer k with
Borevich, Z. I., and Shafarevich, I. R. (1966). “Number Theory” (trans-
many small factors, and then calculate the co-ordinates of lated from Russian by N. Greenleaf ). Academic Press, New York.
the point k P modulo n. If this works then we have failed Cassels, J. W. S. (1978). “Rational Quadratic Forms (LMS monograph
to find a factor of n, and so we repeat the process with 13),” Academic Press, London.
a different C, or a different P, or a different k until we Cohen, H. (1993). “A Course in Computational Algebraic Number
Theory,” Springer, New York.
obtain a calculation that crashes. This particular calcula-
Cornell, G., Silverman, J. H., and Stevens, G. (1997). “Modular Forms
tion will have crashed because it involved a division by a and Fermat’s Last Theorem,” Springer, New York.
factor m of n, this is equivalent to division by zero modulo Edwards, H. M. (1977). “Fermat’s Last Theorem,” Springer, New York.
n, but this is a success because we now have the factor m Hardy, G. H., and Wright, E. M. (1954). “An Introduction to the Theory
of n. Each single calculation is no more efficient than in of Numbers,” 3rd ed. Clarendon Press, Oxford.
LeVeque, W. J. (1955). “Topics in Number Theory” (2 volumes).
Pollard’s method. Lenstra’s methods works well because
Addison-Wesley, Reading, Massachusetts.
(a) there is a great range of starting positions (that is, for Mordell, L. J. (1969). “Diophantine Equations,” Academic Press,
curves C, points P and numbers k), and (b) each sepa- London.
rate calculation can be performed extremely quickly. For Odlyzko, A. M. (1991). “The 1020 -th Zero of the Riemann Zeta Function
example, this method was used to factorize n = 2137 − 1 and 70 Million of the Neighbors,” AT&T Bell Labs., Murray Hill. N.J.
Ribenboim, P. (1979). “13 Lectures on Fermat’s Last Theorem,”
on a standard issue desktop computer using the computer
Springer, New York.
package PARIGP developed by H. Cohen et al., and in Ribenboim, P. (1989). “The Book of Prime Number Records,” Springer,
under 6 seconds it gave the answer New York.
Rose, H. E. (1999). “A Course in Number Theory,” Revised 2nd ed.
2137 − 1 = 32032215596496435569 Clarendon press, Oxford.
× 5439042183600204290159, Schmidt, W. M. (1980). “Diophantine Approximation,” Lecture notes
785, Springer, Berlin.
where the two integers on the right hand side are prime. Silverman, J. (1986). “The Arithmetic of Elliptic Curves,” Springer,
Using traditional methods, this calculation would have New York.
Singh, S. (1997). “Fermat’s Last Theorem,” Fourth Estate, London.
taken many hours.
Stewart, I. N., and Tall, D. O. (1979). “Algebraic Number Theory,”
Chapman and Hall, London.
SEE ALSO THE FOLLOWING ARTICLES Titchmarsh, E. C. (1987). “Theory of the Riemann Zeta Function,” 2nd
ed. Clarendon Press. Oxford.
Vaughan, R. C. (1981). “The Hardy-Littlewood Method,” Cambridge
ALGEBRA, ABSTRACT • APPROXIMATIONS AND EXPAN- Univ. Press. London.
SIONS • GROUP THEORY • NUMBER THEORY, ELEMEN- Wiles, A. (1995). “Modular elliptic Curves and Fermat’s Last
TARY Theorem,” Ann. Math. 141, 443–551.
P1: GLQ Final Pages Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary

Robert L. Page
University of Maine, Augusta

I. The Ancient Roots of Number Theory

II. Classical Problems and Results
III. Modern Directions
IV. Unsolved Problems and Conjectures

GLOSSARY where e = 2.71828 . . . .

Natural number Member of the set {1, 2, 3, . . .}.
Absolute value Absolute value of a number n is given by Prime number Natural number larger than 1 whose only
|n| = n if n 0 and |n| = −n if n < 0. Absolute value divisors are 1 and the number itself. The first 10 primes
of a number is always nonnegative. are 2, 3, 5, 7, 11, 13, 17, 19, 23, and 29.
Composite number Natural number larger than 1 that is Proper divisor Any divisor of a natural number except
not prime. The first 10 composite numbers are 4, 6, 8, the number itself. Accordingly, 1, 2, 3, 4, and 6 are
9, 10, 12, 14, 15, 16, and 18. The number 1 is neither proper divisors of 12.
prime nor composite.
b
Definite integral a f (x) d x is the integral from a to b
of the function f with respect to the variable x. NUMBER THEORY is the study of the properties
Divisor If a, b, and c are natural numbers with a = bc, b of integers. The numbers used to count, called the
and c are divisors or factors of a. Since 12 = 1 · 12 = counting or natural numbers, are the set {1, 2, 3, 4, . . .}.
2 · 6 = 3 · 4, it follows that 1, 2, 3, 4, 6, 12 are all factors Closely related are the integers {. . . , −3, −2, −1,
or divisors of 12. 0, +1, +2, +3, . . .}, which, along with the operations
Greatest integer function [x] is the largest integer whose of addition and multiplication, are the foundation upon
value does not exceed x. We have [−1.2] = −2, which arithmetic is constructed. From the integers, we
[0.5] = 0, and [6.8] = 6. may construct other systems of numbers, such as the ra-
n factorial n! = 1(2)(3) · · · (n − 1)(n). That is, 1! = 1, tional numbers or the real numbers. However, it is the
2! = 1(2) = 2, 3! = 1(2)(3) = 6, and 10! = 1(2)(3) · · · properties of the integers and the solution of problems
(9)(10) = 3,628,800. that can be stated in terms of integers that form the sub-
Natural logarithm Natural logarithm of the number x ject matter of number theory, sometimes called the higher
is written in x and is the exponent t such that et = x, arithmetic.

15
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

16 Number Theory, Elementary

I. THE ANCIENT ROOTS The tablet designated Plimpton 332 in the Plimpton
OF NUMBER THEORY Library at Columbia University contains columns of num-
bers that give the hypotenuse and one leg of a right trian-
A. Arithmetica versus Logistica gle. The other leg can be calculated from the Pythagorean
theorem. The sizes of some numbers imply that the
As ancient cultures slowly developed the concept of num- Babylonians possessed a general method for solving the
ber and applied it to the solution of everyday problems, it right triangle problem. The last column of this tablet gives
was inevitable that a few people would become interested the value of the cosecant of the acute angle opposite the
in the properties of numbers, in relations among numbers longer leg of the triangle, indicating that the Babylonians
and their patterns, and in numbers as logical entities rather used and tabulated some of the ratios of the sides of a right
than as practical tools. triangle, as we do today in trigonometry.
The use of numbers in commerce and trade, construc- The Babylonians also considered problems of a the-
tion of temples, surveying of land, and other practical oretical nature such as circumscribing a circle about an
pursuits came to be known as logistica. Such activities, isosceles triangle or finding the area of a circle, the latter
which could be carried on by anyone with knowledge of leading to the erroneous value of 3 for π .
the proper procedures, particularly commoners or slaves, Other fertile sources of knowledge of ancient math-
was considered a less noble endeavor than was the study ematics are the papyrus records from Egypt. One of the
of abstract properties and relations among numbers, called most notable, the Rhind Papyrus, is displayed in the British
arithmetica. The latter was considered the province of Museum. Dating from about 1650 B.C., the Rhind Papyrus
scholars and royalty, too precious to be entrusted to the contains material copied by the scribe Ahmes from ear-
common people. In ancient times, then, the goals of arith- lier records. The material consists of problems involving
metic were the same as those of number theory today: the arithmetic as well as ideas from geometry. Some problems
study of the properties of integers. involved no practical applications but seemed to be posed
for the reader’s amusement.
Early records such as these show that ancient civiliza-
B. Ancient Mathematical Records
tions were highly interested in mathematics, their achieve-
Some records have survived through the centuries to shed ment in mathematics was considerable, and the knowledge
light on the early history of mathematics. In Czechoslo- they possessed no doubt formed the basis for works of the
vakia in 1937, a wolf bone was found on which were Greeks and others who followed.
carved 55 notches in groups of 5. The bone, which was
about 30,000 years old, suggests that before mankind
C. Number Theory versus Numerology
learned to write there was a compulsion to record numbers.
About 3000 B.C., a mace belonging to King Menes of Just as the science of astronomy developed hand in hand
Egypt was inscribed with hieroglyphic symbols represent- with the superstition of astrology, so the early history of
ing 400,000 oxen, 422,000 goats, and 120,000 prisoners. number theory is interwoven with numerology, the belief
On the Columna Rostrata in Rome, erected in honor of the that numbers possessed mystical powers and were a dom-
victory over Carthage in 260 B.C., the symbol for 100,000 inant influence in human affairs. One form of numerology
was engraved 31 times, signifying the number 3,100,000. was gematria, whereby numbers were assigned to letters
However, surviving records indicate that early mathe- of the alphabet, for example, a = 1, b = 2, and so forth.
matics consisted of more than just recording large num- The numerical value of words or names could then be cal-
bers. Clay tablets from Babylonia containing columns culated and studied, thus revealing hidden relationships
of numbers, which were originally thought to be merely or future events. A minor nobleman might ingratiate him-
records of business accounts, were later discovered to be self with the king by showing that the numerical values
mathematical texts and tables. The tablets date from about of their names, that is, the sum of the letter values, were
2000 to 200 B.C. Some contain solutions to construction equal.
problems such as the calculation of areas or volumes. Oth- Gematria could also warn the unwary of evil as in the
ers relate to business or legal matters such as the compu- Biblical quotation “Here is wisdom. Let him that hath
tation of interest or the division of estates. Some were ta- understanding count the number of the beast; for it is the
bles of multiplication or squares or square roots, whereas number of a man and his number is six hundred three score
others gave the circumferences of circles of various diam- and six.” To whom the passage referred has never been
eters. A table of inverses was used to reduce the opera- determined, but during the Reformation it was common
tion of division to the more easily performed operation of to attack an enemy by writing the enemy’s name in such
multiplication. a way that the numbers added to 666.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary 17

D. The Pythagoreans
One of the strongest influences in ancient number the-
ory came from the Pythagoreans, a brotherhood founded
by the philosopher Pythagoras who lived around 570–
500 B.C. The members of this mystical order believed
that integers were the key to explaining the universe. To
support their belief, they had only to point to the fact,
which could be verified by trial and error, that halving
FIGURE 1 The Pythagorean theorem: a 2 + b 2 = c 2 .
the length of a vibrating string created a sound an oc-
tave higher than the original tone. Further, they knew that
when tones whose frequencies were in the ratios of cer-
tain whole numbers were sounded together, the result was 3, 4, 5 relationship to establish right angles in surveying
a harmonious chord. property or constructing buildings by stretching a rope
On such empirical evidence as this, the Pythagoreans with knots marking lengths of 3, 4, and 5 units around
based a theory of the universe ruled by integers. Numbers three stakes, as in Fig. 2.
were imbued with human traits or characteristics. Odd Trial and error leads to the discovery of many such
numbers such as 1, 3, 5 were called masculine, whereas Pythagorean triples, for instance, 5, 12, 13 and 8, 15,
even numbers such as 2, 4, 6 were feminine. Square num- 17. It is natural to inquire whether formulas exist that
bers such as 4 = 2(2) represented justice, the two factors will generate sets of Pythagorean triples. One method,
being equal. The number 6 represented the soul, 7 repre- which is attributed to Pythagoras, is as follows: Let n be
sented health and understanding, and 8 was the number of any positive integer. Then a = 2n + 1, b = 2n 2 + 2n, and
love and friendship. c = 2n 2 + 2n + 1 constitute a Pythagorean triple. If n = 3,
Although few today believe in numerology, we may we obtain 7, 24, 25, for which 72 + 242 = 49 + 576 =
speculate on the origins of some associations of numbers 625 = 252 . For n = 4 we have 9, 40, 41, whereby 92 +
with philosophical concepts. The number 6 is the small- 402 = 81 + 1600 = 1681 = 412 . All triples generated in
est number that is the sum of all of its proper divisors: this way lead to triangles having a hypotenuse that is one
6 = 1 + 2 + 3. This special property may have led to the unit longer than the larger leg.
number being identified with the important concept of the A more general method consists of choosing integers u
soul. Again, the Bible states that the earth was created in and v, one of which is odd and the other even, such that u
6 days. and v have no common divisors other than the number 1.
A few hints of numerology still remain in the ideas that We then say that u and v are relatively prime. Then, letting
certain numbers such as 7 are “lucky” and that numbers a = 2uv, b = u 2 − v 2 , and c = u 2 + v 2 gives the desired
like 13 are “unlucky.” Another example is the common result. The following examples illustrate this method.
expression “bad news comes in threes.” Although such Multiplying each member of a Pythagorean triple
notions are now taken lightly, we must remember that an- by the same integer yields another Pythagorean triple.
cient people took the principles of numerology seriously
and that many purely mathematical questions and prob-
lems arose from such occult investigations.

E. Pythagorean Triples
The Pythagorean theorem states that the sum of the squares
of the legs of a right triangle equals the square of its hy-
potenuse, that is, a 2 + b2 = c2 , as shown in Fig. 1. This
result was certainly known before the time of Pythagoras,
but whether he was the first to actually prove the theo-
rem is unknown because of the Pythagoreans’ custom of
ascribing all new knowledge to the Master.
Numbers whose values satisfy the Pythagorean theo-
rem, such as 3, 4, and 5 (32 + 42 = 9 + 16 = 25 = 52 ), are
permissible values for the sides of a right triangle. In
Egypt, men known as rope stretchers made use of the FIGURE 2 Rope stretching.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

18 Number Theory, Elementary

For example, G. The Distribution of Primes

2(3, 4, 5) = 6, 8, 10, The knowledge that the number of primes is infinite leads
to speculation about the distribution of primes throughout
and
the set of natural numbers. In general, the density of primes
62 + 82 = 36 + 64 = 100 = 102 . is quite regular, decreasing slowly as we consider larger
integers. The following table illustrates this pattern.
F. Prime Numbers
Range Number of primes
A subject of early investigation in number theory and
one of continuing interest today is that of prime num- 2–1,000 168
bers. Primes are the building blocks from which all nat- 1,001–2,000 135
ural numbers are formed. The trial-and-error process for 2,001–3,000 127
finding primes is a tedious and time-consuming one. It is 3,001–4,000 120
not surprising that the search for methods or formulas for 4,001–5,000 119
.. .. ..
generating primes should have occupied a fundamental . . .
place in early number theory. 9,995,001–9,996,000 62
Eratosthenes (276–194 B.C.) devised a procedure for 9,996,001–9,997,000 58
finding all primes less than or equal to some number N . 9,997,001–9,998,000 67
The method, known as the Sieve of Eratosthenes, requires 9,998,001–9,999,000 64
that the numbers from 2 to N be listed. All multiples of 2, 9,999,001–10,000,000 53
except for 2 itself, are then crossed out, as are all multiples
of 3, except for 3 itself, and so on. Despite this overall regularity in the distribution of
primes, investigation of particular areas of the set of natu-
2 3
4 5 6 7 8 9 10 11 12 13
ral numbers reveals great diversity. Indeed, we can find
15 16
14 17 18 19 20 21
22 23 24 25 arbitrarily long blocks of consecutive composite num-

26 33
29 30 31 32
27 28 34 35 36 37 bers. If q = 2 · 3 · 5 · · · · p) is the product of all of the
primes from 2 to p, then the natural numbers q + 2,
After crossing √ out multiples of all primes whose values q + 3, q + 4, . . . , q + p are all composite. But since there
do not exceed N , we have left the primes whose values
is no limit to the size of p, the result is established.
do not exceed N . Clearly, any composite number N is √di- As an example, if we use the primes through 19, we find
visible by a prime whose value is less than or equal to N ;
that 9,699,692 and the next 18 integers are all composite.
otherwise, the product of its prime divisors would be
Even more remarkable sequences of composite numbers
greater than N . The sieve method can be carried out most
are know. Thus, 370,261 is prime but the next 111 integers
effectively using computers. Various kinds of sieves have
are composite.
been applied to problems in number theory.
Euclid, who lived from about 330 to 275 B.C., is best
H. The Fundamental Theorem of Arithmetic
known for the geometry contained in his great work, “The
Elements.” A considerable portion of this work is devoted Every composite number has a divisor other than 1 and
to number theory. For example, Euclid proves that if a the number itself. Such divisors are called proper divi-
prime p divides the product ab, then either p divides a or sors. A square number may have only one proper divisors.
p divides b, or both. Thus, 49 = 7(7). On the other hand, a composite num-
In the ninth volume of “The Elements,” Euclid proved ber may have many proper divisors, as in 36 = 2(18) =
that the number of primes is infinite. To see this, sup- 3(12) = 4(9) = 6(6). Note that some of the proper divisors
pose that only a finite number of primes exist, say of 36, such as 2 and 3, are prime, whereas others, such as
p1 , p2 , . . . , pn . Then the integer R = p1 p2 p3 . . . pn + 1 4, 6, 9, 12, and 18, are themselves composite. If we con-
is either prime or composite. If R is prime, we have found tinue to break the composite factors into proper divisors
a prime larger than pn , contradicting our original assump- we obtain
tion. If R is composite, it must be divisible by a prime
36 = 2 · 18 = 2 · 2 · 9 = 2 · 2 · 3 · 3
p. But R is not divisible by any of the primes p1 , p2 ,
p3 , . . . , pn , since it leaves a remainder of 1 upon divi- 3 · 12 = 3 · 3 · 4 = 3 · 3 · 2 · 2
sion by any of them. So p must be different from p1 , p2 ,
4·9 = 2·2·3·3
p3 , . . . , pn , again contradicting our assumption. Thus, the
number of primes cannot be finite. 6·6 = 2·3·2·3
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary 19

Ultimately, the factorization yields 22 · 32 . This illustrates 2. Dirichlet’s Theorem

the fact that if we agree to write the prime factors as powers If a is positive and a and b are relatively prime, then there
of primes in increasing order, there is only one way to are infinitely many primes of the form an + b. The proof
factor a given integer. of this general result is much more difficult than the proof
of specific cases.
1. Fundamental Theorem of Arithmetic Dirichlet’s theorem stands out like an oasis in a desert
wasteland. Attempts to find other functions that will gen-
Every natural number can be written uniquely as the prod- erate an infinite number of primes have resulted in failure.
uct of primes. It is not known whether such a simple sequence as n 2 + 1
Since 2 is the only even prime, for p > 2, consecutive contains an infinite number of primes.
primes must differ by two. The pairs of primes (3, 5),
(5, 7), (11, 13), (17, 19), (29, 31), etc., are called I. Greatest Common Divisor
twin primes. It has been conjectured, but never proven,
The basic operations of arithmetic are addition and multi-
that there exist an infinite number of pairs of twin
plication, from which we derive the inverse operations of
primes.
subtraction and division, respectively. If integers a and b
Except for the triplet (3, 5, 7), not all of the numbers
are both divisible by an integer c, we say that c is a com-
p, p + 2, and p + 4 can be prime since one of them must
mon divisor of a and b. Thus, 36 and 48 have the common
be divisible by 3. To see this, note that if p leaves the
divisor 2. It is easily seen that 3, 4, 6, and 12 are also
remainder 1 when divided by 3, p + 2 is divisible by 3;
common divisors of the two given numbers. The largest
whereas if p leaves the remainder 2, p + 4 is divisible by
of these, 12, is called the greatest common divisor, or gcd,
3. However, it is possible for p, p + 2, and p + 6 all to be
of 36 and 48, and we write (36, 48) = 12. Note that 2, 3,
prime, as in 5, 7, 11 or 17, 19, 23. Similarly, p, p + 4 and
4, and 6 all divide 12. This illustrates the theorem: If c is
p + 6 may all be prime as shown by 13, 17, 19 or 37, 41,
a common divisor of a and b, then c divides (a, b).
43. It is conjectured, but as yet unproven, that there are
To show the plausibility of this theorem and to illustrate
infinitely many prime triples of each form.
Euclid’s algorithm for finding the gcd of two numbers,
Certain arithmetic sequences have been shown to
consider the numbers 42 and 135. We divide the larger of
contain an infinite number of primes. Thus, it is known
these two numbers by the smaller, expressing the result in
that there are infinitely many primes of the form 4n − 1.
the form
Suppose there are only a finite number of them, say,
p1 = 3, p2 , p3 , . . . , pk . Forming R = 4 p1 p2 p3 · · · pk − 1, dividend = quotient(divisor) + remainder.
we see that R is not divisible by any of the primes We have
p1 , p2 , p3 , . . . , pk . Now any integer must have one of the 135 = 3(42) + 9.
forms 4n, 4n + 1, 4n + 2, or 4n − 1, and an odd prime
must have either the form 4n − 1 or 4n + 1. The product Each subsequent equation involves dividing the divisor of
of two numbers having the latter form also has the same the previous equation by the remainder from that equation.
form: The process stops when a zero remainder is obtained, as it
must be, since the remainders form a decreasing sequence
(4n 1 + 1)(4n 2 + 1) = 16n 1 n 2 + 4(n 1 + n 2 ) + 1 = 4n 3 + 1. of nonnegative integers

Hence if R is composite, at least one of its factors must 42 = 4(9) + 6

be a prime of the form 4n − 1. But this is obviously not 9 = 1(6) + 3
p1 , p2 , p3 , . . . , pk . So if R is composite, it has a factor of 6 = 2(3) + 0.
the form 4n − 1, which is different from the list that was
assumed to include all primes of that form. From the last equation we see that 3 divides 6. From the
On the other hand, if R is prime, it has the form 4n − 1 next-to-the-last equation we see that 3 divides 9 since it
and is different from p1 , p2 , p3 , . . . , pk . Either way, the divides the right-hand side of the equation. Similarly, 3
assumption that the list includes all primes of the form divides 42 and also 135. Hence, 3 is a common divisor of
4n − 1 is contradicted and so the number of primes having 42 and 135.
that form cannot be finite. By rewriting the equations as
In a similar fashion, we may show that there are in- 135 − 3(42) = 9
finitely many primes of the form 4n + 1, 6n + 5, and
8n + 5. All of these cases are included in Dirichlet’s the- 42 − 4(9) = 6
orem, proved by Peter G. L. Dirichlet (1805–1859). 9 − 1(6) = 3,
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

20 Number Theory, Elementary

we can see that any common divisor of 135 and 42 di- lity rules were developed. Obviously, a number is divisible
vides 9. Also, any common divisor of 135 and 42 divides by 2 only when it is even, that is, only when it ends in 0,
6. Finally, we conclude that any common divisor of 135 2, 4, 6, or 8.
and 42 divides 3. Hence, 3 is the gcd of 135 and 42. In Because 100 and all higher powers of 10 are divisible
general, gcd of the two numbers is always the last nonzero by 4, a number is divisible by 4 only when it ends in
remainder that occurs in the algorithm. 00 or when the number formed by its last two digits is
Several results follow from Euclid’s algorithm. If the divisible by 4. Only numbers ending in 0 or 5 are divisible
product ab is divisible by c and if a and c are relatively by 5.
prime, then c divides b. From this it may easily be seen One hundred and all higher powers of 10 are divisible
that if a number is relatively prime to each of two or by 25. A number is divisible by 25 only when it ends in
more numbers, it is relatively prime to their product. Also, 00 or when the number formed by its last two digits is
if (a, b) = c, then (ka, kb) = kc. The latter may be ver- divisible by 25. Hence, a number is divisible by 25 only
ified by Euclid’s algorithm. As an example, the gcd of when it ends in 00, 25, 50, or 75.
405 = 3(135) and 126 = 3(42) is 3(3) = 9. The number 10 and all higher powers of 10 leave the
Euclid’s algorithm may be used to prove the follow- remainder 1 when divided by 3. This means that the re-
ing: If c is the gcd of a and b, there exist integers x mainder from dividing a number by 3 equals the remainder
and y such that ax + by = c. For the preceding example, from dividing the sum of the digits of that number by 3.
5(135) − 16(42) = 3. The numbers x and y can be ob- We have 371 = 3(123) + 2, whereas 3 + 7 + 1 = 11 and
tained by solving for c, starting from the next-to-the-last 11 = 3(3) + 2. Therefore, a number is divisible by 3 if and
equation and working backward to the first equation. only if the sum of its digits is divisible by 3.
A similar statement holds for the divisor 9. A num-
ber is divisible by 9 if and only if the sum of its dig-
J. Least Common Multiple
its is divisible by 9. Thus, 12,681 is divisible by 9 since
If c is divisible by both a and b, we say that c is a common 1 + 2 + 6 + 8 + 1 = 18 and 18 is divisible by 9. A process
multiple of a and b. The multiples of 4 are 4, 8, 12, 16, called casting out nines, based on the divisibility property
20, 24, etc., whereas those of 6 are 6, 12, 18, 24, etc. We of the number 9, can be used to check the results of ad-
find that 12 and 24 are both common multiples of 4 and 6. dition or multiplication. For addition, the digits of each
The smallest of the infinite set of common multiples addend are added:
of a and b is called the least common multiple, or lcm,
written [a, b]. Therefore, we have [4, 6] = 12. 17 1+7=8 8
It is easily seen that the lcm of two primes is their 31 3+1=4 4
product. However, our previous example shows that the 84 8 + 4 = 12 1+2=3 3
lcm of two numbers having a common divisor is smaller 133 15
than their product. In fact, the product of two integers 1 + 3 + 3 = 7∗ 1 + 5 = 6∗
equals the product of their lcm and their gcd. That is,
ab = (a, b)[a, b]. If the result is a two-digit number, these digits are added
Finding the gcd and the lcm for two numbers can be and so on until each addend has been reduced to a single-
facilitated by writing each one as the product of primes. digit number. These resulting numbers for all of the
For example, 84 = 22 (3)(7) and 90 = 2(32 )(5). We find the addends are then added, the digits again being added
gcd by taking each factor appearing in both numbers to if necessary to reduce the final result to a single-digit
the smallest power to which it appears in either number. number.
So (84, 90) = 2(3) = 6. The same process is applied to the sum, yielding a
We find the lcm by taking each factor appearing in ei- single-digit number. If these two single-digit numbers,
ther number to the largest power to which it appears in the one from the addends and the one from the sum, are
either number. So [84, 90] = 22 (32 )(5)(7) = 1260. Finally, not equal, the addition is incorrect. On the other hand,
we note that 6(1260) = 7560 = 84(90). The definitions of if the two single-digit numbers are the same, there is no
gcd and lcm can be extended to three or more numbers in guarantee that the addition is correct. Two different errors
a straightforward way. may make these numbers the same but make the addition
wrong. In particular, the error of transposing two digits
will not be detected by this method. Hence, casting out
K. Divisibility Rules
nines can show that the addition process is wrong but can-
Before the advent of computers, finding divisors of large not show that it is absolutely correct. The results of our
numbers was not an easy task. Therefore, certain divisibi- example show that the original sum is incorrect.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary 21

FIGURE 5 The triangular numbers.

FIGURE 3 The square numbers.

Note that the triangular numbers arise from the
pattern 1 + 2 + 3 + · · · , whereas the square numbers
For the product of two numbers the process is similar, arise from 1 + 3 + 5 + 7 + · · · . The next logical pat-
except that the single-digit numbers of the factors are mul- tern, 1 + 4 + 7 + 10 + · · · , gives the pentagonal numbers
tiplied and the resulting product, reduced to a single-digit shown in Fig. 7.
number if necessary, is compared with the single-digit Less obvious is the formula that gives the general ex-
number of the product. Let us check the product pression for pentagonal numbers as
34(28) = 942 1 + 4 + 7 + 10 + · · · + 3n − 2 = (3n 2 − n)/2.
3+4=7 2 + 8 = 10 9 + 4 + 2 = 15 The pattern 1 + 5 + 9 + 13 + · · · yields the hexagonal
∗ numbers shown in Fig. 8. The general formula for the
7(10) = 70 1+5=6
hexagonal numbers is
7 + 0 = 7∗
1 + 5 + 9 + 13 + · · · + 4n − 3 = n(2n − 1).
Since the two single-digit numbers are not equal, the mul-
tiplication is incorrect. Even more remarkable is the formula generating the nth
number of the series of polygons having m sides:

m−2 2 4−m
L. Figurate Numbers n + n.
2 2
It was inevitable that the Greeks would take an interest in
the relationship between the two major area of mathemat-
ics, arithmetic and geometry. By associating n points with M. Perfect Numbers
the number n and arranging the points in the shape of geo- A perfect number is a natural number that is the sum
metric figures, they were able to investigate relationships of its proper divisors, such as 6 = 1 + 2 + 3 or 28 = 1 +
among the figurate or polygonal numbers. Consideration 2 + 4 + 7 + 14. A general result concerning perfect num-
of the square numbers, Fig. 3, led to the discovery that bers appears in Euclid’s “The Elements.”
n 2 + 2n + 1 = (n + 1)2 , Theorem. Let p be a prime. The number N = 2 p−1
(2 − 1) is a perfect number when the factor 2 p − 1 is a
p
as well as
prime.
1 + 3 + 5 + · · · + 2n − 1 = n 2 ,
The first four perfect numbers generated by this for-
as shown in Fig. 4. Formulas such as these may be verified mula are
by the process of mathematical induction.
(22−1 )(22 − 1) = 21 (3) = 6
The triangular numbers are shown in Fig. 5. Consider-
ation of the pattern in Fig. 6 leads to the relation (23−1 )(23 − 1) = 22 (7) = 28
1 + 2 + 3 + · · · + n = n(n + 1)/2. (25−1 )(25 − 1) = 24 (31) = 16(31) = 496
(27−1 )(27 − 1) = 26 (128 − 1) = 64(127) = 8128

FIGURE 4 The sum of the first n odd positive integers is n 2 . FIGURE 6 The sum of the first n positive integers is n (n + 1)/2.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

22 Number Theory, Elementary

6 days, 6 being a perfect number. However, the human

race was supposed to be descended from 8 persons in the
Ark, 8 being a deficient number. This implied that God’s
act was perfect and the latter occurrence imperfect.
Two numbers a and b are called amicable numbers if
the sum of the proper divisors of a equals b and the sum
FIGURE 7 The pentagonal numbers. of the proper divisors of b equals a. For example, the sum
of the proper divisors of 220 is 284 and the sum of the
However, for p = 11, we have 211−1 (211 − 1) = 210 (2047), proper divisors of 284 is 220. Thus, 220 and 284 form an
which is not a perfect number since 2047 = 23(89) is amicable pair.
composite. Numerologists believed that a couple could strengthen
Nearly 2000 years after the appearance of this theo- their bond of love or friendship by inscribing each of a
rem, Leonhard Euler (1707–1783) proved that every even pair of keepsakes with one of two amicable numbers, with
perfect number has the form given by Euclid. As a conse- each person retaining one of the keepsakes. If the names
quence, every even perfect number ends in 6 or 8. There of the two friends could be written in such a way that the
is known to exist a perfect number that requires 54 digits number of one name was 220 and the number of the other
to write in the base 10 system. name was 284, the relationship would be even stronger.
The existence of odd perfect numbers is as yet an unre- Other pairs of amicable numbers are 1184 and 1210 as
solved question. None has ever been found and it has been well as 17,296 and 18,416. Nearly 400 amicable pairs have
shown that none can exist that is smaller than 10200 . This been discovered, including a pair for which each number
is, of course, not the same as showing that no odd perfect requires 50 digits to write in the base 10 system.
numbers exist.
If the sum of the proper divisors of a number is a mul-
N. Diophantine Equations
tiple of the number itself, we call that number a multiply
perfect number. For example, the divisors of 120 have a Diophantus lived about the third or fourth century A.D.
sum of 240. We call 120 a doubly perfect number or a per- His work, “Arithmetic,” deals with the solution of certain
fect number of class 2. Other multiply perfect numbers of algebraic problems in which a symbol is used to repre-
class 2 are 672 and 523, 776. More than 300 multiply per- sent the unknown quantity, just as we do today. Although
fect numbers have been found, including some of class 7. we do not know his dates, a problem from the “Palatine
Numbers for which the sum of the proper divisors is Anthology,” which purports to describe the tomb of Dio-
less than the number itself are called deficient numbers, phantus, suggests a way of finding his age at death. The
the first few being 2, 3, 4, 5, 7, 8, 9, 10, and 11. Numbers description says that God granted him the sixth part of his
for which the sum of the proper divisors is greater than life for his youth, that after a twelfth part his cheeks were
the number itself are called abundant numbers, of which bearded, after a seventh part he married, a son was born
the first few are 12, 18, 20, 24, 30, and 36. The smallest in the fifth year of his marriage, the child was “half of his
odd abundant number is 945. father,” and that Diophantus grieved the remaining 4 years
Since a prime has only the number 1 as a proper divisor, of his life. Interpreting “half of his father” to mean that
it is obvious that all primes are deficient numbers as are the child’s age at death was half of the age at which the
powers of primes. Furthermore, all divisors of a perfect father died, we have
number or a deficient number are deficient, whereas all
X/6 + X/12 + X/7 + 5 + X/2 + 4 = X,
multiples of a perfect number or an abundant number are
abundant. where X is Diophantus’ age upon his death. Solving the
As one example of the role of abundant and deficient equation gives X = 84.
numbers in numerology, in the eighth century, Alcuin In another problem, the reader is asked to find two num-
noted that the original Creation was accomplished in bers whose sum is 20 and the sum of whose squares
is 208. Instead of letting one number be x and the
other 20 − x, as a modern student of algebra would
do, Diophantus called the required numbers 10 + x and
10 − x. Then the conditions of the problems require that
(10 + x)2 + (10 − x)2 = 208, where x = 2 and the required
numbers are 8 and 12. Diophantus’ selection of 10 + x and
10 − x to represent the two numbers leads to a simpler
FIGURE 8 The hexagonal numbers. equation to solve than does the modern approach.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary 23

Not all of the problems given by Diophantus yield inte-

gral solutions. One requires that 13, which equals the sum
of the squares 4 and 9, be written as the sum of two other
squares. The answer is 251
+ 324
25
or ( 15 )2 + ( 18
5
)2 .
Neither did all of the problems have unique solutions.
For example, a tax collector must collect 2 gold pieces
for each man, 1 gold piece for each woman, and 12 gold
piece for each child in a certain town. If he collects 100
gold pieces from 100 residents, how many men, women, FIGURE 9 Magic squares of order 3, 4, and 5.
and children live in the town? The conditions lead to the
equations:
The sum of the integers in any row, column, or main
2x + y + 12 z = 100, diagonal of a magic square of order n, constructed from
the integers 1, 2, 3, . . . , n 2 , is n(n 2 + 1)/2. Such a magic
and
square is said to be primitive. Many different rules exist for
x + y + z = 100, constructing magic squares, with some rules depending on
whether n is odd or even.
where x, y, and z represent the number of men, women,
It is easily seen that no magic squares exist of order 2.
and children, respectively. Subtracting the second equa-
Also, any set of n 2 consecutive integers may be substi-
tion from the first, we find
tuted for the integers in a magic square of order n as long
x − 12 z = 0 or z = 2x. as the original pattern for integers of increasing size is fol-
lowed. This is, of course, equivalent to adding a constant
Thus, we know that there are twice as many children as
to each number of a given magic square. Furthermore, the
men. (This is as far as mathematics will take us because
requirement that the integers be consecutive may be weak-
we have two equations containing three unknowns.) We
ened to require sequences of integers that differ by a fixed
can choose to let x be any integer less than or equal to 33
amount.
and determine y and z. Some of the solutions are
For example, if we add 11 to each number in the magic
x z = 2x y = 100 − x − z square of order 3 given above, we have the result shown
in Fig. 11, where the sum is 15 + 3(11) = 48. If we re-
10 20 70 place the original integers by the sequence 20, 25, 30, . . . ,
20 40 40 we obtain the magic square of Fig. 12, for which the
25 50 25 sum is 120. Certain rows or columns of a magic square
33 66 1 can be interchanged without destroying the equal sum
property.
Such a problem with multiple solutions is called an in-
While many purely mathematical properties have been
determinate problem. Problems whose solutions are re-
discovered concerning magic squares, they were origi-
quired to be rational numbers or integers came to be
nally of interest because of their supposed power to en-
known as Diophantine equations. Although Diophantus
sure good fortune or to ward off evil influences. Magic
frequently gave rational as well as integral solutions to his
squares engraved on precious metal were worn as amulets.
problems, in present day terminology, the word Diophan-
tine usually refers to problems requiring integral solutions.

O. Magic Squares
If a square is divided into n 2 smaller squares and one of
the integers from 1 to n 2 is written in each of the smaller
squares (using each number only once) in such a way
that the sum of the integers in any row, column, or main
diagonal is always the same, the result is called a magic
square of order n. A few examples of magic squares are
shown in Fig. 9.
Interest in magic squares dates back to about 2200 B.C.
in China where the diagram of Fig. 10, called the Lo-Shu,
appeared. FIGURE 10 The Lo-Shu.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

24 Number Theory, Elementary

FIGURE 11 Magic square formed by the addition of a constant.

FIGURE 13 A magic square of order 5.

A magic square, when engraved on a plate of silver, was
believed to protect the owner from the plague. The magic
square of order 4, shown in Fig. 9, appears in Dürer’s x y
engraving Melancholia.
What must rank near the top of any list of incredi- 10 16
ble coincidences was discovered by T. E. Lobeck. Start- 15 6
ing with the magic square of order 5, shown in Fig. 13, −5 46
√ √
he substituted the nth digit in the decimal expansion of 2 −2 2 + 36 etc.
.. ..
π = 3.14159 . . . for the integer n in the magic square. The . .
result appears in Fig. 14. The sum of each column also oc-
curs as the sum of a row. Perhaps more amazing is the fact The realities of the world preclude certain solutions such
that the sum of the two main diagonals is 38 + 27 = 65, as those involving negative or irrational numbers. Even
which is the sum of the numbers in any row, column, or with these restrictions, there are 19 integral solutions cor-
main diagonal of the original magic square. responding to x = 0, 1, 2, . . . , 18.
Of course, the number of variables is not restricted to 2.
Consider the following problem: Thirty people enter a
II. CLASSICAL PROBLEMS AND RESULTS movie theater, paying a total of $50. If men pay $3 each,
women $2 each, and children $1 each, how many men,
Certain themes in number theory that had their origins how many women, and how many children make up the
in antiquity reappeared during the Middle Ages or the party?
Renaissance. Indeed, some of them have continued to be If we let x equal the number of men, y the number of
the objects of research to this day. In this section, we ex- women, and z the number of children, we have
amine some of the major results that have been achieved x + y + z = 30
in number theory since the days of Greek mathematics,
regardless of whether those results concern problems of and
ancient derivation or more recent origin. 3x + 2y + z = 50.
Subtracting the first equation from the second yields
A. Indeterminate Problems
2x + y = 20 or y = 20 − 2x.
We start with a simple problem. A barnyard contains
chickens and goats. Altogether there are 72 legs in the Letting x take on the values 0, 1, 2, . . . , 10 gives these
barnyard. How many chickens and how many goats are values for y: 20, 18, 16, . . . , 0. Then either equation gives
there? The conditions result in the equation 4x + 2y = 72, these values for z: 10, 11, 12, . . . , 20. There are eleven
where x is the number of goats and y is the number of solutions consisting of the triples (0, 20, 10), (1, 18, 11),
chickens. Obviously, there are a multitude of solutions for (2, 16, 12), . . . , (10, 0, 20).
this equation, some of which we may discover by solving
for one variable in terms of the other, y = (72 − 4x)/2:

FIGURE 12 Magic square formed by substitution of numbers in FIGURE 14 The result of substituting the digits of π into the magic
a sequence. square of Fig. 13.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary 25

Sometimes, additional conditions restrict the number This follows directly from Euclid’s algorithm and leads to
of solutions. In this problem, one might require that the
Theorem. There exist integers x and y satisfying the
number of women be twice the number of men, yielding
equation ax + by = c if and only if the gcd of a and b
the unique solution (5, 10, 15). At other times there may
divides c.
be an infinite number of solutions or even no solution.
One renowned indeterminate problem, called the cattle For example, in the equation 14x + 35y = 56, we find that
problem of Archimedes, results in seven equations in eight (14, 35) = 7 and 7 divides 56. Dividing, we obtain the
unknowns whose solution yields extremely large numbers. equivalent equation 2x + 5y = 8. We now let 2x + 5y = 1
Many problems of this type, called linear indeterminate and find the solution x = −2, y = 1. Then x = 8(−2) =
problems, originated in India. The Hindu mathematician −16 and y = 8(1) = 8 are solutions of 2x + 5y = 8, and
Brahmagupta (born in 598 A.D.) authored a treatise on as- therefore of 14x + 35y = 56. From the particular solution
tronomy that included chapters devoted to mathematics. x0 , y0 , we can find the general solution x = x0 + tb and
This work, “Brahma-Sphuta-Siddhanta” (or “Brahma’s y = y0 − ta, where t is any integer. In our case, x = −16 +
Correct System”) includes solutions of numerous linear 35t and y = 8 − 14t.
indeterminate equations. Also found there is the second- Diophantine equations involving variables to the second
degree indeterminate equation nx 2 + 1 = y 2 , for which or higher powers can prove difficult or impossible to solve.
Brahmagupta gives the solution Although many special results exist, a general method of
solving any Diophantine equation or proving the nonexis-
x = 2t/(t 2 − n) tence of solutions is unknown. We mention a few particu-
y = (t 2 + n)/(t 2 − n), lar results. It is known that the equations x 3 + y 3 = z 3 and
x 4 + y 4 = z 4 have no positive integral solutions.
where t is any integer. Thus, if n = 3, we have 3x 2 + 1 = y 2 The equation x 4 + y 4 = u 4 + v 4 has general solutions
and x = 2t/(t 2 − 3) and y = (t 2 + 3)/(t 2 − 3), which leads that we shall not list. The smallest integral solution is
to 1334 + 1344 = 1584 + 594 .
It has been proven that the equation ax n + by n = c has
t x y only finitely many solutions if n ≥ 3. More generally, Axel
Thue (1863–1922) showed that if a function of x and y,
1 −1 −2
f (x, y) = an x n + an−1 x n−1 y + · · · + a1 x y n−1 + a0 y n ,
2 4 7
3 1 2 with ai integers, cannot be factored into two polynomials
10 20
97
103
97 having integral coefficients, then the equation f (x, y) = c
..
.
..
.
..
.
has only a finite number of solutions for n ≥ 3.
The equation x 2 − cy 2 = 1 is known as Pell’s equation,
after John Pell (1610–1685), although he had no part in
He also states that the equation nx 2 − 1 = y 2 has no in-
its solution. For any value of c there is the trivial solution
tegral solutions for x and y unless n is the sum of the
x = ±1, y = 0. If c is a square number, the left-hand side
squares of two integers. For example, 4x 2 − 1 = y 2 has
is easily factored. If c is a positive nonsquare integer, then
no integral solutions, whereas 13x 2 − 1 = y 2 has integral
it can be proved that the equation always has a nontrivial
solutions because 13 = 22 + 32 . One solution is x = 5,
solution. Such solutions may not necessarily be obtained
y = 18.
readily by trial and error. The equation x 2 − 61y 2 = 1 has
The search for general solutions was aided by the intro-
x = 1,766,319,049 and y = 226,153,980 as its smallest
duction of symbols for certain quantities and operations
positive nontrivial solution.
that occur frequently within a given problem. The use
The foregoing discussion is meant to illustrate the wide
of this technique is credited to Diophantus. For a long
range of problems leading to Diophantine equations, the
period of time, mathematicians did not differentiate be-
great variety of approaches to solving such problems and
tween problems leading to determinate or indeterminate
the enormous difficulty that is entailed in the solution of
solutions. Today, algebra students are exposed almost ex-
certain of them.
clusively to determinate problems.
The following results are useful in solving certain in-
determinate problems. B. Congruences
Theorem. If the integers a and b are relatively prime The theory of congruences was begun by the work of Carl
then there exist integers x and y such that ax + by = 1. Friedrich Gauss (1777–1855), which originally appeared
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

26 Number Theory, Elementary

in his “Disquisitiones Arithmeticae” in 1801. We say that Then,

the two integers a and b are congruent for the modulus m,
1519 ≡ 1516 (153 )(mod 23) ≡ 16(−6)(mod 23)
and write a ≡ b(mod m) if their difference a − b is divis-
ible by the integer m. For example, 21 ≡ 6 (mod 5), since ≡ −96(mod 23) ≡ 19(mod 23).
21 − 6 = 15 is divisible by 5. Similarly, 19 ≡ 54(mod 7)
Hence, 1519 leaves a remainder of 19 when divided by 23.
since 19 − 54 = −35 is divisible by 7. However, as is eas-
ily verified, 18 is not congruent to 11(mod 3) and we write
18 ≡ 11(mod 3). C. Linear Congruences
Since any integer n leaves a remainder of 0, 1, 2, . . . ,
Just as in the case of linear equations, linear congruences
m − 1 when it is divided by m, we see that the integers can
involve the first power of a variable. Whereas linear equa-
be partitioned into m classes according to the remainder
tions yield only a single solution, linear congruences may
obtained upon division by m. They are called the residue
give an unlimited number of solutions. The linear congru-
classes (mod m). Thus, the residue classes (mod 5) consist
ence 2x ≡ 6(mod 12) has the solution x = 3 as well as any
of
integer that is congruent to 3(mod 12).
. . . , −10, −5, 0, 5, 10, ... Of greater interest, however, is the question of whether
. . . , −9, −4, 1, 6, 11, ... two or more noncongruent solutions exists. Trial and error
. . . , −8, −3, 2, 7, 12, ... will show that our previous example also has the solution
. . . , −7, −2, 3, 8, 13, ... x ≡ 9(mod 12). Also, unlike linear equations, linear con-
. . . , −6, −1, 4, 9, 14, ... . gruences may have no solution. In general, ax ≡ b(mod m)
has solutions only when the greatest common divisor of a
Obviously, any two integers within a given residue class and m also divides b. In this case, there are exactly (a, m)
are congruent (mod m). noncongruent solutions.
Certain properties of congruences follow immediately Again, the congruence 18x ≡ 24(mod 30) has
from the definition. Thus, if a ≡ b(mod m) and c ≡ (18, 30) = 6 solutions given by the numbers 3, 8, 13, 18,
d(mod m), then 23, and 28. Note that all solutions after the first are ob-
tained by successively adding 30 6
= 5 to the previous solu-
(1) (a + c) ≡ (b + d)(mod m), tion. On the other hand, the congruence 15x ≡ 12(mod 35)
(2) ac ≡ bd(mod m), has no solution because (15, 35) = 5 does not divide 12.
(3) a n ≡ bn (mod m), and Two or more linear congruences may be solved si-
(4) a ≡ b(mod n) where n is a divisor of m. multaneously. By listing the solutions of the two con-
gruences x ≡ 5(mod 7) and x ≡ 4(mod 11), which are
Furthermore, if a ≡ b(mod m 1 ) and a ≡ b(mod m 2 ), 12, 19, 26, 33, . . . and 15, 26, 37, . . . , respectively, we
then a ≡ b(mod M), where M is the lcm of m 1 and see that 26 is a common solution. Then x ≡ 26(mod 7 · 11)
m 2 . So, from the congruences 17 ≡ 377(mod 24) and or x ≡ 26(mod 77) is the simultaneous solution.
17 ≡ 377(mod 40), we conclude that 17 ≡ 377(mod 360),
where [24, 40] = 360.
D. Chinese Remainder Theorem
Finally, if ac ≡ bc(mod m), then a ≡ b(mod m/d),
where d is the gcd of c and m. Thus, when the congruence Many ancient problems involved linear congruences. Such
48 ≡ 18(mod 15) is written as 8(6) ≡ 3(6)(mod 15), we a problem concerned the removal of eggs from a basket
see that (6, 15) = 3. Hence, 8 ≡ 3(mod 5). 2, 3, 4, 5, and 6 at a time, whereupon 1 egg remained.
To illustrate the utility of calculating with congruences, However, when they were removed 7 at a time, none re-
consider the problem of finding the smallest positive remained. This may be stated in terms of congruences as
mainder obtained when 1519 is divided by 23. We have follows: x ≡ 1(mod 2), x ≡ 1(mod 3), . . . , x ≡ 1(mod 6),
x ≡ 0(mod 7). These are equivalent to the two congruences
15 ≡ −8(mod 23) x ≡ 1(mod 60) and x ≡ 0(mod 7), which have the solution
x ≡ 301(mod 420).
152 ≡ 64(mod 23) ≡ −5(mod 23)
Problems of this type are found in the works of the
153 ≡ 40(mod 23) ≡ −6(mod 23) Chinese mathematician Sun-Tse. When each pair of mod-
uli is relatively prime, the solution is found in the following
154 ≡ 25(mod 23) ≡ 2(mod 23)
theorem.
1516 ≡ (154 )4 (mod 23) ≡ 24 (mod 23)
Chinese Remainder Theorem. Given the simultane-
≡ 16(mod 23). ous congruences x ≡ ai (mod m i ) for i = 1, 2, 3, . . . , n,
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary 27

where (m i , m j ) = 1 for all i, j, let M = m 1 m 2 m 3 · · · m n . For example, if p = 7, then

If for each i the congruence bi (M/m i ) ≡ 1(mod M/m i )
(7 − 1)! = 6! = 720 ≡ −1(mod 7).
is solved, then the original set of congruences has the
solution However, for p = 6,
(6 − 1)! = 5! = 120 ≡ −1(mod 6).
x ≡ [a1 b1 (M/m 1 ) + a2 b2 (M/m 2 ) + a3 b3 (M/m 3 ) + · · ·
Wilson’s theorem is credited to John Wilson (1741–
+ an bn (M/m n )](mod M). 1793), although it was first proved by Joseph Louis
Lagrange (1736–1813). It is noteworthy because it gives
We illustrate the use of the Chinese remainder theorem a necessary and sufficient condition for an integer to be
with the following problem. A shipwrecked sailor passes prime. In practice, however, it is of limited use in de-
the time of day by counting the coconuts he has gathered. termining the primality of integers, because the factorial
When he counts by threes, there are 2 coconuts left over. expression becomes so large that its computation quickly
When he counts by fives, there are 4 left over, and when he exceeds all reasonable time limits, even on the most mod-
counts by sevens, there are 5 left over. How many coconuts ern computers.
has the sailor gathered? A complementary result is the following theorem.
The conditions of the problem lead to the congruences
x ≡ 2(mod 3), x ≡ 4(mod 5), and x ≡ 5(mod 7); whence Theorem. If n is a composite integer different from 4,
a1 = 2, m 1 = 3, a2 = 4, m 2 = 5, a3 = 5, and m 3 = 7. Then then (n − 1)! ≡ 0(mod n).
M = 3(5)(7) = 105 and M/m 1 = 35, and M/m 2 = 21, and The fact that 1255 is composite guarantees that (1254)! ≡
M/m 3 = 15. The new set of congruences is 35b1 ≡ 1(mod 0(mod 1255).
3), 21b2 ≡ 1(mod 5) and 15b3 ≡ 1(mod 7). The solutions We now examine some results for congruences involv-
of the latter are b1 = 2, b2 = 1, and b3 = 1. Thus, ing a variable to the second or higher power. When the
highest power of the variable is 2, we call the congruence
x ≡ [2(2)(35) + 4(1)(21) + 5(1)(15)](mod 105)
a quadratic congruence, analagous to a quadratic equa-
≡ 299(mod 105) ≡ 89(mod 105) tion. The following theorems, which are consequences
of Wilson’s theorem, give explicit solutions for certain
is the solution of the original set of congruences. quadratic congruences.
Other theorems of interest that involve congruences in-
clude one due to Pierre de Fermat (1601–1665) called Theorem. If p is a prime having the form 4n + 1, then
Fermat’s Little Theorem. If p is a prime number that does the roots of the congruence x 2 ≡ −1(mod p) are

not divide the number a, then a p−1 ≡ 1(mod p). p−1
x ≡± !(mod p).
For instance, if p = 5 and a = 8, we have 85−1 = 84 = 2
4096 ≡ 1(mod 5).
To illustrate, let p = 13 = 4(3) + 1. Then the congruence
Fermat’s little theorem is a special case of a theorem by
x 2 ≡ −1(mod 13) has the solutions
Euler. For a positive integer n, the number of integers from
1 to n that are relatively prime to n is indicated by φ(n). 13 − 1
x≡ !(mod 13),
Thus, φ(8) = 4 since 8 is relatively prime to 1, 3, 5, and 7. 2
Also, φ(11) = 10 because 11 is relatively prime to the inte- or
gers from 1 to 10. Note that for any prime p, φ( p) = p − 1.
We state x ≡ 6!(mod 13) ≡ 5(mod 13),

Euler’s Theorem. If a and m are relatively prime, then which we illustrate by

φ(m)
a ≡ 1(mod m). 52 = 25 ≡ −1(mod 13)
Thus, if a = 9 and m = 8, we have 94 = 6561 ≡ 1(mod 8), and
which is easily verified.
x ≡ −6!(mod 13) ≡ 8(mod 13),
which we illustrate by
E. Wilson’s Theorem
82 = 64 ≡ −1(mod 13).
We next consider
Any odd prime must be of the form 4n + 1, considered
Wilson’s Theorem. The integer p is a prime if and only previously, or of the form 4n + 3, for which the following
if ( p − 1)! ≡ −1(mod p). holds.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

28 Number Theory, Elementary

Theorem. If p is a prime of the form 4n + 3, then one x ≡ 4(mod 44), x ≡ 8(mod 44)
of the congruences
x ≡ 12(mod 44), x ≡ 26(mod 44)
p−1
! ≡ ±1(mod p) x ≡ 30(mod 44), x ≡ 34(mod 44).
2
holds. As an example, we see that when n = 3,
F. Quadratic Residues
3−1
! = 1! = 1 ≡ 1(mod 3)
2 We now examine the law of quadratic reciprocity, one of
and when n = 7, the most profound and powerful results in the theory of
congruences. If the integers a and m are relatively prime,
7−1 we say that a is a quadratic residue of m if the congruence
! = 3! = 6 ≡ −1(mod 7).
2 x 2 ≡ a(mod m) has a solution. If it has no solution, we say
There is a complex procedure for determining whether the that a is a quadratic nonresidue of m. Since x 2 ≡ 11(mod 5)
positive or negative sign prevails. has the solution x ≡ 1(mod 5), we see that 11 is a quadratic
A more general result is contained in the following residue of 5. However, x 2 ≡ 13(mod 5) has no solution,
theorem. showing that 13 is a quadratic nonresidue of 5.
The Legendre symbol simplifies discussion of quadratic
Theorem. If p is an odd prime that does not di- reciprocity. If p is an odd prime that does not divide a,
vide the integer a, then x 2 ≡ a(mod p) has a solution we write (a/ p) = 1 if a is a quadratic residue of p and
when a ( p−1)/2 ≡ 1(mod p) and has no solution when (a/ p) = −1 if a is a quadratic nonresidue of p. From our
a ( p−1)/2 ≡ −1(mod p). previous discussion, we have (11/5) = 1 and (13/5) = −1.
Let p = 7 and a = 8. Then 8(7−1)/2 = 83 = 512 ≡ 1(mod 7), We have already mentioned Euler’s criterion, which
so the congruence x 2 ≡ 8(mod 7) has a solution. One easily states that r is a quadratic residue of an odd prime p
found solution is x ≡ 1(mod 7). if r ( p−1)/2 ≡ 1(mod p) and r is a quadratic nonresidue
On the other hand, if p = 5 and a = 8, then 8(5−1)/2 = of p if r ( p−1)/2 ≡ −1(mod p). Thus, from the fact that
8 = 64 ≡ −1(mod 5). Hence, the congruence x 2 ≡
2 5(11−1)/2 = 55 = 3125 ≡ 1(mod 11), we conclude that 5
8(mod 5) has no solutions. is a quadratic residue of 11. However, 5(13−1)/2 = 56 =
By trial and error we find that the congruence x 2 + 5x ≡ 15,625 ≡ −1(mod 13), which indicates that 5 is a quadratic
0(mod 6) has the four solutions x ≡ 0(mod 6), nonresidue of 13. Hence, (5/13) = −1 while (5/11) = 1.
x ≡ 1(mod 6), x ≡ 3(mod 6), and x ≡ 4(mod 6). However, Also, x 2 ≡ 11(mod 7) has the solution x ≡ 2(mod 7),
if the modulus is prime, a polynomial congruence of so (11/7) = 1. However, x 2 ≡ 7(mod 11) has no solution,
degree n cannot have more than n solutions. It may have whence (7/11) = −1.
fewer than n. For example, x 3 ≡ 1(mod 5) has only the We see that in some cases ( p/q) = (q/ p) as in (5/13) =
solution x ≡ 1(mod 5). (13/5) = −1 or (11/5) = (5/11) = 1, but in other cases
In general, in order to solve a polynomial congruence, ( p/q) = −(q/ p), for example, (11/7) = 1 = (7/11) = −1.
one has only to solve polynomial congruences having The question arises as to when equality holds and when it
moduli that are powers of a prime. We illustrate the method does not. Euler discovered a pattern that he believed an-
by considering the congruence swered this question, and Gauss later gave several proofs
that this pattern was indeed true in all cases. The key to
x 3 − 2x 2 + 12 ≡ 0(mod 44). the pattern is the observation that every odd prime has
Because 44 = 22 · 11, we write two new congruences either the form 4n + 1 or 4n + 3. If p and q are odd
primes and either one of them has the form 4n + 1, then
x 3 − 2x 2 + 12 ≡ 0(mod 22 ) ( p/q) = (q/ p), as in (5/11) = (11/5). However, if both of
and the primes have the form 4n + 3, then ( p/q) = −(q/ p).
An example of the latter is (11/7) = −(7/11).
x 3 − 2x 2 + 12 ≡ 0(mod 11). The theorem is usually stated in the more elegant form
The first has the solutions x ≡ 0(mod 4) and x ≡ 2(mod 4), Law of Quadratic Reciprocity. When p and q are
whereas the second has the solutions x ≡ 1(mod 11), distinct odd primes then
x ≡ 4(mod 11), and x ≡ 8(mod 11). By applying the Chi-
nese remainder theorem to pairs of these latter congru- ( p/q)(q/ p) = (−1)[( p−1)/2][(q−1)/2] .
ences, each pair containing a congruence of modulus 4
and a congruence of modulus 11, we find the solutions of Gauss gave six proofs of the law of quadratic reciprocity
the original congruence to be and more than 50 proofs have been devised.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary 29

A number of assertions by Fermat can be shown to all positive integers n. Unfortunately, there seems to be
follow from the law of quadratic reciprocity, including little hope of determining the value of θ , so the theorem is
the following. presently of no use for constructing primes. Similar for-
mulas exist for expressing pn as a function of some param-
1. Every prime of the form 8n + 1 or 8n + 3 has eter θ . In order to determine θ to an accuracy sufficient
the form x 2 + 2y 2 . For example, 17 = 32 + 2(22 ) or to calculate pn , it proves necessary to know the primes
41 = 32 + 2(4)2 . p1 , p2 , p3 , . . . , pn . Such formulas are thus equally useless
2. Every prime 3n + 1 has the form x 2 + 3y 2 but no for constructing new primes, unless some method can be
prime 3n − 1 has this form. found for determining θ without using p1 , p2 , p3 , . . . , pn .
3. Every prime 4n + 1 is the sum of two squares in only No such method has yet been found, nor has it been proved
one way. impossible for such a method to exist. The question is thus
an open one at present.
The number of primes whose values do not exceed some
G. Formulas for Primes
positive integer x is symbolized by π (x). Thus, π (12) = 5,
A glance at a list of the first few primes is sufficient to since 2, 3, 5, 7, and 11 are primes not exceeding 12. Simi-
convince one that any formula that would yield all of the larly, π (13) = 6.
primes would be a most unusual contrivance. Even the The method of finding primes known as the sieve of
related but less demanding task of finding a formula that Eratosthenes leads to the formula
would produce only primes seems most formidible. √
Certain formulas furnish partial results, For example, π (N ) − π ( N ) + 1 = N − [N / p1 ] − [N / p2 ] − · · ·
if we replace x in the function f (x) = x 2 + x + 41 with − [N / pk ] + [N / p1 p2 ] + [N / p1 p3 ] + · · ·
x = 0, ±1, ±2, . . . , ±39, or −40, the result in each case
is a prime. The same is true of g(x) = x 2 − 79x + 1601 + [N / pk−1 pk ] − [N / p1 p2 p3 ] − · · · ,
√
for x = 0, 1, 2, . . . , 79. The function f is a composite of where p1 , p2 , . . . , pk are the primes less than N . The
two formulas, one given by Euler in 1772, the other by formula involves considerable calculation. To determine
Legendre in 1798. the number of primes less than one million, one needs
Euler also found that 2x 2 + n is prime for x = 0, 1, . . . , to consider the primes less than 1000. Obviously, today’s
n − 1 where n is one of the numbers 3, 5, 11, or 29. Paul computers make such a formula vastly easier to apply and
Pritchard discovered the imposing expression, therefore more useful.
11,410,337,850,553 + 4,609,098,694,200x, There are certain shortcuts that can be employed with
the formula for π (x); Before the advent of computers,
which is prime for x = 0, 1, . . . , 21. Bertelsen determined that there are 50,847,478 primes that
Similar formulas of this nature exist, but it can be do not exceed one billion. An analogous formula for the
shown that no polynomial function having integral co- number of pairs of twin primes the larger of which does
efficients can generate only prime numbers for integral not exceed N was given by G. J. Kostis and R. L. Page.
values of x. One need only observe the effect of calculat- There exists a number pattern that generates primes in
ing f (41) = (41)2 + 41 + 41 to understand why any poly- a curious but inefficient way:
nomial will eventually produce a composite number.
The function f given above generates primes for 80 1 1
consecutive integers. It has been conjectured that this is 1 2 1
the largest number of consecutive primes that a quadratic 1 3 2 3 1
polynomial can produce. All that is known for sure is that 1 4 3 5 2 5 3 4 1
no quadratic having the form x 2 + x + c, where c > 41,
1 5 4 7 3 8 5 7 2 7 5 8 3 7 4 5 1
can yield primes for all values of x = 0, 1, 2, 3, . . . , c − 2.
From algebra we know that a polynomial of degree n can Each successive row is formed by inserting between
be constructed to assume n + 1 arbitrary values. Thus, the each consecutive pair of integers their sum. Any two con-
function h(x) = −x 3 /6 + x 2 + x/6 + 2 gives the first four secutive entries are relatively prime. It is also true that the
primes for x = 0, 1, 2, and 3. However, a similar polyno- integer n occurs φ(n) times in the nth line, where φ(n) is
mial yielding the first 101 primes would have degree 100. the number of integers less than n and relatively prime to
The longest known arithmetic progression consisting en- n. Since a prime p is relatively prime to all p − 1 integers
tirely of primes is given by 223,092,870n + 2,236,133,941 less than itself, we see that n is a prime if and only if it
where n = 0, 1, 2, . . . , 15. appears n − 1 times in the nth row. Since 2 appears once
W. H. Mills proved the following remarkable theorem: in row 2, it must be a prime. Similarly, 3 appears twice in
There exists a real number θ such that [θ 3n ] is prime for row 3 and 5 appears 4 times in row 5. However, 4 appears
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

30 Number Theory, Elementary

only twice in row 4, making 4 a composite number. The For M2 = 3, the perfect number is 2(3) = 6 and for M3 = 7,
number of integers in each row increases rapidly, there the perfect number is 4(7) = 28.
being 433 of them in row 10. Thus, the pattern is of little
practical value in finding primes.
K. Fermat Numbers
t

H. The Prime Number Theorem Numbers of the form 22 + 1 are called Fermat numbers.
We see that F0 = 3, F1 = 5, F2 = 17 and F4 = 65,537 are
By examining tables of prime numbers and by a great all primes. The next one, F5 = 232 + 1, is difficult to test for
amount of trial-and-error calculation, mathematicians dis- factors without the use of a computer. Fermat conjectured
covered that the quantities π(x) and x/ ln x behave in a that all Fermat numbers were primes, and it was a hundred
similar fashion and that their ratio approaches 1 as x in- years before Euler found a counterexample. He did this by
creases without bound. This was confirmed upon proof of proving that any factor of a Fermat number must have the
the following theorem. form (2t+1 )k + 1. For t = 5, these factors are of the form
Prime Number Theorem. limx→∝ [π(x)/(x/ln − x)] 64k + 1. He then showed that 641 is a factor of F5 .
= 1. No further primes have been discovered among the
Fermat numbers and many believe that quite the oppo-
This was proved independently by Hadamard and De la site of Fermat’s conjecture is true. Nevertheless, interest
Vallée Poussin. in Fermat numbers continues because of a remarkable re-
Another function that approximates π (x) better than sult due to Euler. He showed that a regular polygon of
x/ ln x is the integral logarithm, defined by N sides can be constructed with only straightedge and
x compass if N = 2k p1 p2 · · · pn , where the pi are Fermat
dt
Li(x) = . primes. This unexpected connection between number the-
2 ln t
ory and geometry is an example of the richness of results
I. Riemann’s Zeta Function obtained by pondering prime numbers.

Another function of great importance in the study of

the distribution of primes is Riemann’s zeta function: L. Twin Primes
∞
ζ (s) = n=1 (1/n s ). For example, ζ (1) = 1 + 12 + 13 + · · · , Little is known about twin primes, not even whether an
which may be shown to diverge and ζ (2) = 1 + 14 + infinite number of pairs exist. By examining the first few
1
9
+ · · · , which converges to π 2 /6. The function con- pairs, (3,5), (5,7), (11,13), (17,19), (29,31), etc., one is
verges for all s > 1. Its relation to prime numbers stems led to conjecture: If p and q are twin primes both larger
from an identity, which Euler discovered, that expresses than 3, then they have the form 6k ± 1. This conjecture is
the zeta function as the repeated product of a term eval- readily proven. Any integer must have one of the forms
uated only for primes. Many of the important functions 6k, 6k ± 1, 6k ± 2, or 6k + 3, all of which, except 6k ± 1,
used in studying properties of primes may be expressed are easily seen to be composite.
as combinations of zeta functions. The largest known pair of twin primes, discovered in
While no formula exists for generating every prime, 1990 by B. K. Parady, J. F. Smith, and S. Zarantonello, is
Bertrand’s postulate assures us that if n ≥ 1, there exists 1706595 × 211235 ± 1
at least one prime p such that n < p ≤ 2n.

M. Representation of Numbers
J. Mersenne Primes
in Certain Forms
The study of the primality of numbers having a particu-
The operation of division leads to questions of factorabil-
lar form has occupied a great deal of attention in num-
ity of integers and to congruences, to mention only two
ber theory. A number of the form Mn = 2n − 1 is called
things. So, too, does the operation of addition lead to many
a Mersenne number in honor of Marin Mersenne (1588–
important questions in number theory. For example, the
1648). Thus, M2 = 3, M3 = 7, M4 = 15, etc. When Mn is
question of whether a given integer can be represented as
a prime p, it is called a Mersenne prime and the num-
the sum of an arbitrary number of squares has occupied
ber 2 p−1 (2 p − 1) is a perfect number. Furthermore, every
many mathematicians over the years. For the case of two
even perfect number must be of this form. Thus, the search
squares, we have the following theorem:
for even perfect numbers reduces to a search for Mersenne
primes. Because 2n − 1 can be factored when n is compos- Theorem. Every prime of the form 4k + 1 can be writ-
ite, a Mersenne prime must have the form M p = 2 p − 1. ten as the sum of two squares.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary 31

Thus, 29 = 4(7) + 1 has the required form and can be because primes are the building blocks for all integers,
written 29 = 4 + 25 = 22 + 52 . A more powerful result research concerning their properties continues to be a ma-
states: An integer n can be written as the sum of two jor area in number theory.
squares if and only if all of its prime factors having We sometimes refer to numbers such as 1000 or
the form 4k + 3 occur with even exponents. We have 1,000,000 as nice round numbers because of the zeros in
5(72 ) = 245 = 49 + 196 = 72 + 142 as the sum of two their base 10 representation. More generally, we consider
squares, whereas 73 = 343 is not. a round number to be one having a large number of rela-
The case for three squares is settled by the following tively small factors. Therefore, 4200 = 23 · 3 · 52 · 7 would
theorem: be a round number but 17,858,257 = (3607)(4951) would
not. It can be shown that the number of prime factors of an
Theorem. An integer n can be represented as the sum
integer x is, on the average, of the order of ln (ln x). That
of three squares unless n has the form 4e (8k + 7) for some
is, for all integers in a large interval, the preponderance of
integers e and k.
them have as the number of their prime factors a number
We see that 45 = 4 + 16 + 25 = 22 + 42 + 52 , whereas close to ln (ln x).
60 = 4(8 + 7) cannot be so represented.
Lagrange proved that every positive integer is the sum
of four squares. The search for the four squares that repre- N. Factorization Methods
sent a certain number can be reduced to representing the Many methods have been devised to aid in the factorization
prime factors of the number as the sum of four squares. of large numbers. Fermat’s factorization method depends
An identity then gives the four squares for the original on writing the number to be factored as the difference of
number. two squares. If
The question of finding the least value of s such that
every integer can be expressed as the sum of no more than n = x 2 − y 2 = (x + y)(x − y) = ab,
s kth powers of integers is known as Waring’s problem. we may assume that n is odd and then a √ and b will also be
It has long been known that every integer can be written odd. If we write x 2 = n + y 2 , then x ≥ n. We examine
as the sum of nine cubes. Kevin S. McCurley proved that √
x 2 − n for various values of x > n, and when the differ-
every integer exceeding ee13.97 is a sum of seven positive ence is a square, we have found the desired factorization.
integral cubes. For example, if n = 26,781, we see that (164)2 −
For fourth powers, the most that can be said is that every 26,781 = 115, which is not a square. Instead of consid-
89
integer larger than 1010 can be represented as a sum of ering (165)2 − 26,781, (166)2 − 26,781, etc., we can use
19 fourth powers. For fifth powers, the question is unre- the easier but equivalent method of adding 2x + 1, 2x + 3,
solved. Surprisingly, for many higher powers, the problem etc., and check each result for a square.
has been solved. It is known that for 6 ≤ K ≤ 200,000,
every integer can be written as a sum of no more than
x x 2 − 26,781
2k + [(3/2)k ] − 2 kth powers. Hence, every integer is the
sum of no more than 26 + [(3/2)6 ] − 2 = 73 sixth powers. 165 115 + 329 = 444
We have seen that every integer can be represented as 166 444 + 331 = 775
the sum of no more than nine cubes. In fact, if 23 and 239 167 775 + 333 = 1108
are exempted, every integer is the sum of no more than 168 1108 + 335 = 1443
eight cubes. Similarly, every integer larger than 454 is the 169 1443 + 337 = 1780
sum of no more than seven cubes. One is therefore led 170 1780 + 339 = 2119
to ask: What is the smallest number G(k) of kth powers 171 2119 + 341 = 2460
whose sum represents any sufficiently large integer? I. M. 172 2460 + 343 = 2803
Vinogradov showed that 173 2803 + 345 = 3148
k + 1 ≤ G(k) ≤ k(3 ln k + 11). 174 3148 + 347 = 3495
175 3495 + 349 = 3844 = (62)2
Although one of the outstanding accomplishments in num-
ber theory, this nevertheless falls short of answering the Then
question.
From studying tables of primes, one comes to the (175)2 − 26,781 = (62)2
inescapable conclusion that most integers are compos- or
ite. This is true in the sense that the ratio π(x)/x ap-
proaches zero as x approaches infinity. Nevertheless, 26,781 = (175 + 62)(175 − 62) = 237(113).
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

32 Number Theory, Elementary

Euler’s factorization method depends on expressing the TABLE I The first 30 Fibonacci numbers
number to be factored as the sum of two squares in two n Fn n Fn n Fn
different ways: N = a 2 + b2 = c2 + d 2 . Then a 2 − c2 =
d 2 − b2 or (a + c)(a − c) = (d + b)(d − b), where we may 1 1 11 89 21 10,946
assume that a and c are odd and b and d are even. Let 2 1 12 144 22 17,711
k be the gcd of a − c and d − b. Then k is even and 3 2 13 233 23 28,657
a − c = kl and d − b = km. Then (a + c)kl = (d + b)km 4 3 14 377 24 46,368
or (a + c)l = (d + b)m. We see that a + c is divisible by m 5 5 15 610 25 75,025
since (l, m) = 1. Thus a + c = nm, where n is even. Also, 6 8 16 987 26 121,393
nml = (d + b)m or nl = d + b. We see that n is the gcd of 7 13 17 1597 27 196,418
a + c and d + b. The factorization is 8 21 18 2584 28 317,811
9 34 19 4181 29 514,229
N = [(k/2)2 + (n/2)2 ](m 2 + l 2 ). 10 55 20 6765 30 832,040
For example,
1,000,009 = (235)2 + (972)2 = 32 + (1000)2 .
the sum of the total number of pairs from the previous two
We have a − c = 232, a + c = 238, d − b = 28, and b + months.
d = 1972. Then k = (232,28) = 4, n = (238,1972) = 34, The sequence 1, 1, 2, 3, 5, 8, . . . , defined by F1 = 1,
l = (a − c)/k = 232
4
= 58, and m = (d − b)/k = 284
= 7. F2 = 1, Fn = Fn−1 + Fn−2 for all n ≥ 3, is called the
We find the factorization Fibonacci sequence in honor of Leonardo of Pisa (who
2 2 2 was known as Fibonacci, the son of Bonacci). Table I lists
1,000,009 = 42 + 34 2
(7 + 582 ) = 293(3413).
the first 30 terms of the Fibonacci sequence.
Many properties have been discovered for the sequence,
O. Fibonacci Numbers which has been the subject of extensive study. One readily
sees, for example, that every third term is divisible by 2,
Consider the following ancient problem. Assume that we
every fourth term is divisible by 3 and every fifth term is
have a pair of newborn rabbits, one male and one female.
divisible by 5. In fact, it can be shown that every prime p
Suppose that at the end of two months the pair becomes
divides an infinitude of Fibonacci numbers.
mature. At the end of the third month, and every suc-
Several other easily discovered properties are
ceeding month, they produce a male–female pair. Finally,
assume that the rules just stated apply to each new pair of
1. The sum of the first n Fibonacci numbers is one less
rabbits that is born. Assuming no deaths, how many pairs
than Fn+2 .
of rabbits will there be at the end of any given month?
2. Fn−1 Fn+1 + 1 = Fn2 .
If we represent an immature pair of rabbits by an open
3. Fn2 + Fn+1
2
= F2n+1 .
circle and a mature pair by a shaded circle, we have the
4. If A, B, C, and D are four consecutive Fibonacci
results shown in Fig. 15.
numbers, then C 2 − B 2 = AD.
Notice that the number of mature pairs each month
equals the total number of pairs from the previous month.
The sequence is also related to patterns found in na-
Also, the number of immature pairs each month equals
ture. For example, the seeds on the head of a sunflower
the number of mature pairs from the previous month. But,
lie in rows that form clockwise (CW) and counterclock-
since the number of mature pairs from the previous month
wise (CCW) spirals. The numbers of spirals of each type
equals the total number of pairs from two months ago, we
are often consecutive Fibonacci numbers with the smaller
see that the total number of pairs in any given month is
number enumerating the CCW spirals and the larger num-
ber the CW ones. A typical sunflower will have 34 CCW
and 55 CW spirals, although heads with 144 CCW and
233 CW spirals have been reported.
Other sunflower heads may have numbers of CCW and
CW spirals given by consecutive Lucas numbers, named
for Edouard Lucas. These numbers are 1, 3, 4, 7, 11,
18, . . . , where L1 = 1, L2 = 3, and Ln = Ln−1 + Ln−2 for
n 3.
If we consider the ratios Fn+1 /Fn , we have the se-
FIGURE 15 The Fibonacci sequence. quence 1, 2, 1.5, 1.666, . . . , 1.6154, 1.679, 1.6180, . . . .
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary 33

FIGURE 16 A rectangle divided in proportion to numbers in the

Fibonacci sequence.

√
This sequence converges to the number (1 + 5)/2 =
1.6180339 . . . , which is the golden ratio of ancient
Greece. FIGURE 18 An approximation to the logarithmic spiral.
Another interesting relationship involves the first k
Fibonacci numbers. Suppose we choose the first 11 of
them: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, and 89. We construct more, by performing thousands of operations a second,
a rectangle having sides of 55 and 89 and then divide it into computers reduce the time required for certain calcula-
a square of side 55 and a 34-by-55 rectangle as shown in tions to within reasonable limits. Starting with M521 , the
Fig. 16. Now we divide the rectangle into a square of side most recently discovered Mersenne primes were found
34 and a 34-by-21 rectangle as in Fig. 17. We continue the with the aid of computers. Some results that were estab-
process of constructing squares in the remaining rectan- lished by use of home computers have been published.
gles until two 1-by-1 squares are constructed. Then in the Computers have sparked renewed interest in other an-
55-by-55 square, we draw a quarter circle of radius 55, in cient problems. In 1974, H. J. J. teRiele announced a pair
the 34-by-34 square we draw a quarter circle of radius 34 of amicable numbers, each of which has 152 digits. Using
that is connected to the first quarter circle, and continue a probabilistic algorithm in conjunction with a computer,
in this fashion until all of the squares have quarter circles. Michael O. Rabin found a pair of numbers of the order of
The resulting curve, which closely approximates the loga- magnitude 10123 , which he conjectures are twin primes.
rithmic spiral, a curve found in the shell of the chambered In November, 1996, Joel Armengaud and George
nautilus, is shown in Fig. 18. F. Woltman proved that 21,398,269 − 1 is a Mersenne
prime that would require more than 400,000 digits to
write in the base 10 system and is the largest known
III. MODERN DIRECTIONS prime.
Research in the distribution of primes continues, in-
Research in number theory flourishes today, the results cluding the study of primes that differ by a fixed amount
of that research appearing in numerous scholarly journals and gaps between primes. Sol Weintraub has found the
on a monthly or quarterly basis. Obviously, this article largest known gap between primes, which consists of
cannot begin to treat the current state of research in the 682 consecutive composite numbers following the prime
field. Besides being too large in scope, the subject has 61,003,096,898,749. In view of the use of computers to
become extremely abstract, with many topics which can facilitate searches, it is safe to say that many of today’s
be fully understood only by experts in the field. largest results will soon be surpassed.
High-speed computers have affected number theory by Sometimes a long-term conjecture falls prey to a coun-
allowing the consideration of cases involving computa- terexample. Euler conjectured that no nth power is a sum
tions far beyond the capacity of the human mind. Further- of fewer than n nth powers; for example, no cube is the
sum of fewer than 3 cubes. In 1966, L. J. Lander and T. R.
Parkin found the counterexample:
1445 = 275 + 845 + 1105 + 1335 .
In 1921, V. Brun showed that even if the number of twin
primes should be infinite, they are more thinly distributed
throughout the integers than are the primes. More specif-
ically, he showed that the sum of the reciprocals of the
FIGURE 17 A rectangle divided in proportion to numbers in the primes is infinite but the sum of the reciprocals of twin
Fibonacci sequence. primes is finite.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

34 Number Theory, Elementary

plex variables, for solving problems in number theory.

This branch arises from the work of Dirichlet and Georg
F. B. Riemann (1826–1866), both of whom are sometimes
credited with its founding.
A Dirichlet series has the form F(s) = ∞ n=1 (αn /n ),
s

where αn is an expression that can be defined for each

integer n. The simplest case is 1 + 1/2s + 1/3s + · · · ,
FIGURE 19 Examples of convex and nonconvex regions.
which, as we have seen, is called Riemann’s zeta func-
tion. Riemann’s hypothesis, his only conjecture that re-
A. The Geometry of Numbers mains unproved concerning the Riemann zeta function,
states that the complex √zeros of that function, that is, all
Elementary number theory refers to those problems whose zeros of the form x + y −1, have x = 12
solution does not require methods from calculus. While This area tends to be abstract and the problems diffi-
this is still an important area in number theory, various cult, some yielding only to ingenious methods of attack.
other branches have developed in modern times. One such It may seem surprising that problems stated in terms of
branch, known as the geometry of numbers, arose from a integers may require for their solution powerful methods
theorem by Hermann Minkowski. In its simplest form, the from analysis, a branch of mathematics dealing with con-
theorem concerns lattice points in a plane, that is, points tinuous quantities rather than the discreet set of integers.
whose coordinates are integers. Indeed, some matters, such as the convergence of cer-
To state the theorem, we need to define a convex region tain series, may properly be considered areas belonging to
in the plane, that is, a region R having the property that analysis.
if any two points of R are connected by a straight line
segment, all of the points of the segment will also lie in
R. (See Fig. 19.) C. Algebraic Number Theory
Minkowski’s theorem states that any convex region R The branch known as algebraic number theory developed
that is symmetric about the origin and whose area is greater from attempts to apply concepts from number theory to
than 4 will contain lattice points other than the origin. sets of numbers other than the natural numbers. For √ ex-
(See Fig. 20.) Although seemingly self-evident, to prove ample, Gauss considered numbers of the form a + b −1,
the theorem requires a fair amount of work. The theorem where a and b are integers. These are called the Gaussian
may be extended to n dimensions and various relationships integers and they follow a unique factorization law anal-
and functions defined on the lattice points of both convex ogous to that of the fundamental theorem of arithmetic.
and nonconvex sets. Such advanced study of lattice points Euler, in attempting to prove Fermat’s last theorem (see
√ in which n = 3, considered num-
constitutes the geometry of numbers. Section IV.B) for the case
bers of the form a + b −3, where a and b are integers.
For such sets of numbers one may define primes that cor-
B. Analytic Number Theory
respond to primes in the set of natural numbers, although
Analytic number theory involves the use of methods from some of them will be quite different from the ordinary
analysis or calculus, especially from the theory of com- primes.
Euler assumed, without proof, that unique factoriza-
tion held for these numbers. His conclusion, although un-
proved, was correct. Later, Gabriel Lamé made a similar
unproved assumption and announced that he had solved
Fermat’s last theorem. His error was quickly pointed out.
About the same time, Ernst Eduard Kummer was at-
tempting to extend Gauss’ quadratic reciprocity law to
higher powers. He succeeded in showing that unique fac-
torization does not hold
√ for all sets of numbers. Thus, in the
set of numbers a + b −5, the number 6 has two different
prime factorizations:
√ √
6 = 2(3) = (1 + −5)(1 − −5).
An equation of the form a0 x n + a1 x n−1 + · · · + an = 0,
FIGURE 20 An illustration of Minkowski’s theorem. where each ai is rational, is said to be irreducible if the
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary 35

left-hand side cannot be factored into two similar expres- terns are disposed of almost as quickly as they appear. As
sions. If r is a root of such an irreducible equation of degree an example, consider some patterns which seem to gener-
n, then the set of all expressions that can be formed from ate primes. For p12 + ( p1 + 1) p2 we have
that root by addition, subtraction, multiplication, and divi-
22 + 3(3) = 13
sion by nonzero terms is called the algebraic number field
of degree n generated by r . 32 + 4(5) = 29
In order to preserve the concept of unique factorization,
52 + 6(7) = 67
Kummer introduced what he called ideal numbers. He was
able to find a sufficient condition for Fermat’s last theorem 72 + 8(11) = 137
to be true and proved it for several particular cases.
112 + 12(13) = 277,
Dedekind extended this concept to algebraic number
fields in general by considering sets in the fields called all of which are prime. However, 132 + 14(17) = 407 =
ideals. This work led to the development of field theory, 11(37).
p
an area in which number theory overlaps another branch Again, consider p1 2 + p1 + p2 :
of modern mathematics called abstract or modern algebra.
Much research is presently conducted on Diophantine 23 + 2 + 3 = 13
equations. A related topic is Diophantine approximation, 35 + 3 + 5 = 251
the approximation of irrational numbers by rational
numbers. 57 + 5 + 7 = 78,137,
A review of current literature in number theory gives ev-
all of which are prime. Numbers in this sequence grow in
idence of interest in old as well as new topics. Among the
size rapidly. Thus, 711 + 7 + 11 = 1,977,326,761, which is
former are tests for divisibility, distribution of quadratic
harder to check for primality by use of a pocket calculator
residues, Riemann’s zeta function, Pythagorean triples,
than its predecessors. Of course, it is easily handled by
and the distribution of primes. Among the latter are
a modern computer. Even this is not necessary, however,
Fortune’s conjecture (see Section IV.A) and permutable
since the next term 1113 + 11 + 13 = 1113 + 24 is clearly
primes. These are numbers such as 13 or 37 that are also
divisible by 5.
prime when their digits are reversed.
The numbers 31, 331, 3331, and 33,331 are all primes.
In fact, the pattern continues to yield primes until we reach
IV. UNSOLVED PROBLEMS 333,333,331, which is divisible by 17.
AND CONJECTURES Another interesting pattern is
3! − 2! + 1! = 5
The natural numbers are the simplest set of numbers used
in mathematics, yet the number of patterns derived from 4! − 3! + 2! − 1! = 19
the natural numbers seems almost endless. These patterns
have led mathematicians to ask questions such as: Is the 5! − 4! + 3! − 2! + 1! = 101
pattern true for all integers? Under what conditions does
6! − 5! + 4! − 3! + 2! − 1! = 619
the pattern hold?
7! − 6! + 5! − 4! + 3! − 2! + 1! = 4421
A. Conjectures
8! − 7! + 6! − 5! + 4! − 3! + 2! − 1! = 35,899,
When statements concerning number patterns are proved
to be true, they are called theorems. Before that, they are which yields the primes listed. Unfortunately, the next step
conjectures, nothing more than educated guesses based on gives 326,981, which is divisible by 79.
inductive reasoning applied to a finite number of particular The following pattern gives primes for many steps:
cases. Conjectures, then, fall in a kind of no-man’s-land
41 + 2 = 43
between the list of facts that have been shown to be true
and the host of statements that have been shown to be false. 43 + 4 = 47
They remain there until they are proven to be true theorems
47 + 6 = 53
or until a counterexample shows them to be false.
Some conjectures enjoy long lives before they are dis- 53 + 8 = 61
proved. For example, Fermat’s conjecture that numbers of
t 61 + 10 = 71
the form 22 + 1 are always prime survived for a hundred
years before it died at the hands of Euler. Many other pat- 71 + 12 = 83.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

36 Number Theory, Elementary

In fact, it will give primes for another 33 steps be- In 1974, Robert Tijdeman proved that there exists a
fore the composite number 1681 = 412 appears. The pat- constant k with the property that all powers of integers
tern is made up of numbers obtained from the formula which equal consecutive integers are less than k. Thus, we
x 2 + x + 41, which we have previously discussed. know that there can be at most a finite number of such pairs
Finally, consider the pattern of integers in which each of consecutive powers, although the work of calculating
succeeding row is obtained by inserting the number of k seems too formidible to allow a definite value to be
the row, n, between each pair of integers from row n − 1 determined at present.
whose sum is n: In 1876, Catalan also examined the sequence P0 = 2
and Pn+1 = 2 Pn − 1. Thus,
Number
n of terms P1 = 2 P0 − 1 = 22 − 1 = 3
1 1 2 P2 = 2 P1 − 1 = 23 − 1 = 7
2 1 2 1 3 P3 = 2 P2 − 1 = 27 − 1 = 127
3 1 3 2 3 1 5 P4 = 2 P3 − 1 = 2127 − 1.
4 1 4 3 2 3 4 1 7 He speculated that Pn is prime for n = 1, 2, 3, and 4,
5 1 5 4 3 5 2 5 3 4 5 1 11 all of which were subsequently verified. P5 seems to be
6 1 6 5 4 3 5 2 5 3 4 5 6 1 13 undecidable since it has approximately 1038 digits.
This pattern fails for row 10, which contains 33 terms. If
B. Fermat’s Last Theorem
we count the number of digits in each row, the tenth row
contains 37 digits, a prime number. However, the 11th row One of the most famous of all problems in number the-
contains 57 digits, a composite number. ory, unsolved for over 350 years, goes by the misnomer
A recent conjecture is due to Reo F. Fortune who of Fermat’s last theorem. In the margin of a copy of
examined the pattern Diophantus’ “Arithmetic,” opposite a problem concern-
ing writing a square as the sum of two squares, Fermat
2+1=3 5−2=3 wrote that it is impossible to write a cube as the sum of
2(3) + 1 = 7 11 − 6 = 5 two cubes, a fourth power as the sum of two fourth powers,
2(3)(5) + 1 = 31 37 − 30 = 7 and so forth. In other words, he claimed that the equation
2(3)(5)(7) + 1 = 211 223 − 210 = 13 x n + y n = z n cannot be solved when n > 2.
2(3)(5)(7)(11) + 1 = 2311 2333 − 2310 = 23 He also claimed to “have discovered a truly marvelous
demonstration of this proposition that this margin is too
2(3)(5)(7)(11)(13) + 1 = 30,031 30,047 − 30,030 = 17
narrow to contain.” In view of Fermat’s wrong guess con-
The first five sums in the left-hand column are primes, cerning the primality of all Fermat numbers, one must be
but 30,031 = 59(509). However, for each sum if we find skeptical of his claim, or at least wish he had had a supply
the next larger prime and subtract from it the product of blank paper at hand.
of consecutive primes given in that row, the result is Only one proof concerning number theory is known
prime. Fortune’s conjecture is that this pattern always to be due to Fermat, that being found in another margin
gives primes. Many feel that the conjecture is true but of the same book. This theorem showed that the area of
proving it appears to be a difficult task. a Pythagorean triangle having integral sides cannot be a
The question of whether there exist an infinite number square integer. This theorem leads to the proof of Fermat’s
of Mersenne primes has been unsolved for approximately last theorem for the case n = 4; that is, x 4 + y 4 = z 4 has
300 years, as has the companion question of the existence no solutions.
of an infinite number of even perfect numbers. To date, Fermat claimed to be able to prove the conjecture for
only 28 Mersenne primes are known, so the question is n = 3, but published no proof. Euler gave such a proof
far from resolved. nearly 100 years later, but it contained some faulty rea-
Catalan’s Conjecture, due to Eugène Charles Catalan soning that fortunately could be corrected.
(1814–1894), states that 8 and 9 are the only positive con- Work on the problem progressed slowly as other mathe-
secutive integral powers of integers. maticians proved the conjecture true for n = 5, n = 7, and
In general, this suggests that x m − y n = 1, where x and y other particular values of n. It can be shown that if the
are integers greater than 0 and m and n are integers greater conjecture is true for some integer k, then it is true for
than 1, has as its only solution x = 3, y = 2, m = 2, and any multiple of k. Hence, it suffices to consider only odd
n = 3. Since m and n vary, as well as x and y, the equation prime powers. Partial results included showing that if the
above is a Diophantine equation that is not in polynomial theorem is true for some value of n > 2, then n must ex-
form. ceed 4,000,000. It was also shown that any solution led to
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

Number Theory, Elementary 37

numbers that are inconceivably exceeding the capacity of work which led to his successful proof. Thus, his advisor
even the most modern computers. unwittingly helped Wiles achieve his childhood goal.
Furthermore, it was known that x p + y p = z p has no so-
lution in integers that are relatively prime to p if p is an odd C. Goldbach’s Conjectures
prime and q = 2 p + 1 is also prime. In 1983 Gerd Faltings
proved that the formula contained in Fermat’s last theorem Nearly 250 years ago, Christian Goldbach, in correspon-
has only finitely many rational solutions when n > 2. From dence with Euler, posed two conjectures:
time to time, several reputable mathematicians, as well as
countless amateurs, gave purported proofs that Fermat’s 1. Every even number greater than or equal to 6 is the
last theorem was true for all n > 2. Unfortunately, errors sum of two odd primes.
were discovered in each of the proofs presented before 2. Every odd number greater than or equal to 9 is the
1993. sum of three odd primes.
In June, 1993, Andrew J. Wiles of Princeton Univer-
sity gave three lectures at Cambridge University, which The second conjecture is actually a consequence of the
culminated in a proof of Fermat’s last theorem. Wiles had first, and in 1937 Vinogradoff proved that any odd number
been interested in the problem since the age of 10 when that is sufficiently large is the sum of three odd primes.
he vowed to find a proof. His interest in the problem led How large the number has to be is not known. It has been
16,038
him to choose mathematics as a career. established that if N is an even integer larger than ee ,
His teachers and professors advised him that he would then N is the sum of no more than four primes.
be wasting his time pursuing such a difficult and uncertain It is also known that every sufficiently large even inte-
goal. At Cambridge, his advisor guided him to the field of ger is the sum of a prime and an integer having no more
elliptic curves, which are curves having cubic equations than two distinct prime factors. Complete resolution of
and which can be used for calculating the perimeter of Goldbach’s conjectures awaits future results.
ellipses. The question of the existence of odd perfect numbers is
Wiles’ concern for the theorem never abated, even while still unresolved. Most mathematicians are inclined to be-
he did research on elliptic curves. In 1986, his determi- lieve that perfect numbers must be even, although an odd
nation was strengthened by the work of other mathemati- one may be detected someday by methods as yet unimag-
cians who postulated and then proved a connection be- ined. It is known that any such integer must exceed 1050
tween Fermat’s last theorem and elliptic functions. He and must have at least eight prime factors.
devoted himself completely to solving the problem and Number theory is the oldest branch of mathematics and
seven years later he was ready to present his proof to fellow concerns the simplest set of numbers, the integers. Be-
mathematicians. cause some of its problems can be stated in easily un-
The reaction to his proof was astonishment and derstood terms, it probably has attracted more amateurs
widespread acclaim. Verifying his proof was a slow pro- than any other branch of mathematics. Although many of
cess for two reasons: (1) his proof was 200 pages long its problems today are stated by means of abstract tech-
and (2) it was estimated that elliptic curves were under- nical definitions not easily mastered by the lay person,
stood by approximately only one tenth of one percent of nevertheless the subject that has commanded widespread
all professional mathematicians. attention over the past 3000 years should remain a vital
Minor errors in his proof were found and easily cor- area of human learning for at least that far into the future.
rected. Then a catastrophe occurred; a flaw in the proof
was discovered that could not be easily overcome. Wiles
and his former student, Richard Taylor, worked tirelessly SEE ALSO THE FOLLOWING ARTICLES
to save the proof. They finally decided to employ a differ-
ent type of elliptic curv that would eliminate the error. COMPUTER ALGORITHMS • LINEAR SYSTEMS OF EQUA-
After 14 months Wiles began to belive that his proof TIONS • NUMBER THEORY, ALGEBRAIC AND ANALYTIC •
would suffer the same fate as all previous ones, failure. SET THEORY
Then, one of those events occurred that sound more like
fiction than fact. They were close to giving up but on
September 19, 1994 they found a way to eliminate the BIBLIOGRAPHY
error and save the proof.
In June, 1997 Wiles was awarded the Wolfskehl Prize, Boston, N., and Greenwood, M. L. (1995). “Quadratics representing
primes,” Am. Math. Month. 102(7), 595–599.
which had been unclaimed for 89 years, for proving Bunt, L. N. H., Jones, P. S., and Bedient, J. D. (1976). “The Historical
Fermat’s last theorem. Had he not studied elliptic curves at Roots of Elementary Mathematics,” Prentice-Hall, Englewood Cliffs,
Cambridge, Wiles might not have been prepared to do the New Jersey.
P1: GLQ Final Pages
Encyclopedia of Physical Science and Technology EN011I-503 July 14, 2001 21:40

38 Number Theory, Elementary

Crandall, R. E. (1997). “The challenge of large numbers,” Sci. Am. McCoy, N. H. (1965). “The Theory of Numbers,” Macmillan, New York.
276(2), 74–78. McCurley, K. S., An effective seven cube theorem, J. Number Theory
Dudley, U. (1969). “Elementary Number Theory,” Freeman, San 19(2), 176–183.
Francisco, California. Newman, J. R. (ed.) (1956). “The world of Mathematics,” Simon &
Edwards, H. M. (1978). Fermat’s last theorem. Sci. Am. 239(4), 104– Schuster, New York.
122. Pomerance, C. (1980/1981). Recent developments in primality testing.
Gardner, M. (1979). Sci. Am. 241(3), 22–32. Math. Intelligencer. 3(3), 97–105.
Gardner, M. (1980). Mathematical games. Sci. Am. 243(6), 18–28. Ribenboim, P. (1994). “Prime number records,” College Math. J. 25(4),
Hardy, G. H., and Wright, E. M. (1960). “The Theory of Num- 280–290.
bers,” 4th ed. Oxford Univ. Press (Clarendon), London and New Rouse Ball, W. W. (1960). “A Short Account of the History of Mathe-
York. matics,” Dover, New York.
Garrison, B. (1981). Consecutive integers for which n 2 + 1 is composite. Singh, S., and Ribet, K. A. (1997). “Fermat’s last stand,” Sci. Am. 277(5),
Pacific J. Math. 97(l), 93–96. 68–73.
Kostis, G. J., and Page, R. L. (1964). A formula concerning twin primes. Weintraub, S. (1982). A prime gap of 682 and a prime arithmetic
Math. Mag. 37(3), 153–154. sequence. BIT 22(4), 538.
P1: GPA/GJK Final Pages P2: GSS Qu: 00, 00, 00, 00
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

Numerical Analysis
John N. Shoosmith
NASA, Langley Research Center, retired

I. Numerical Analysis as a Subject Area

II. Finite-Precision Numerical Operations
III. Algebraic Equations in a Single Variable
IV. Systems of Algebraic Equations
V. Matrix Eigenproblems
VI. Numerical Representation of Functions
VII. Differentiation and Integration
VIII. Differential Equations
IX. Recent Developments

GLOSSARY Data By input data for a problem, we mean numbers that

are required to be provided in order for the solution
Algorithm A numerical algorithm is a precise, step-by- process to start or continue. Output data are numbers
step description of the implementation of a numerical generated by the solution process.
method. Error Difference between the approximation to a quan-
Condition A problem is ill-conditioned if a small change tity and its true value. Relative error is the error divided
in the data results in a large change in the solution. by the true value. An error bound is a positive number
Conversely, it is well-conditioned if the solution is rel- that is known to be larger than the magnitude of an
atively insensitive to changes in the data. The condi- error.
tion number is a measure of condition. It is small for a Iteration Repeated process, where the input to each cycle
well-conditioned problem and large otherwise. is determined from the output of preceding cycles. The
Convergence A numerical method is said to converge if input to at least the first cycle is given in order to start
the solution generated by the method approaches a so- the process. Criteria for stopping the iteration must be
lution of the problem to which it is applied—either as provided.
an iteration count becomes large, or as a discretization Method A numerical method is a procedure used to
parameter approaches zero. The order of convergence solve a mathematical problem through the use of
is a measure of the rate of convergence. For example, numbers.
in the case of an iteration method, the order of conver- Parallel algorithm Algorithm designed to use more than
gence is p if, after a sufficient number of cycles, the one processor at the same time. The computations are
error after any cycle is comparable to the error of the organized into tasks that can be assigned to multiple
previous cycle raised to the power p. processors.

39
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

40 Numerical Analysis

Roundoff error Error introduced because of the limit to recent development in computing is the use of collections
the precision (number of places) to which numbers can of computers to solve individual problems (parallel pro-
be represented in any finite computation. cessing), and this requires the consideration of problem
Stability A numerical method is unstable if, when it is partitioning and algorithm development to take advantage
applied to a well-conditioned problem, a small change of the particular computer architectures involved. It is in
in the data results in a large change in the numerical this area that numerical analysis and computer science are
solution. It is stable if the change in the solution remains particularly closely related.
small.
Truncation error Error introduced because of the limit
to the number of numbers that can be used in any finite
I. NUMERICAL ANALYSIS
computation. For example, a variable may be repre-
AS A SUBJECT AREA
sented exactly by an infinite mathematical series, but
only the most significant terms can be retained. The
A. The Numerical Approach
discarded terms contribute to the truncation error in
to Problem Solution
the computed solution.
Vector algorithm Algorithm designed to make efficient The means by which physical situations and processes are
use of a vector processor. In a vector processor a single described, analyzed, designed, and simulated is through
instruction causes the same operation to be performed mathematics. Natural laws are stated in terms of mathe-
on one or more sequences of numbers (vectors) in an matical equations, and the behavior of systems that obey
assembly-line fashion. those laws is described by their solutions.
Unfortunately, the mathematics of many of the pro-
cesses we would like to study quickly becomes intractable
NUMERICAL ANALYSIS is the study of the solution of when approached by conventional means. For example,
mathematical problems through the manipulation of num- analytical solutions to most nonlinear systems of equa-
bers. The solution processes are usually referred to as nu- tions simply cannot be found. The best that can be done,
merical methods, and the required sequences of numerical in the traditional sense, is to attempt a series expansion of
and logical operations, when precisely set down, are called the solution.
algorithms. The solutions thus obtained are usually not ex- Today, there is another approach: the problem state-
actly correct, but approximately so. In the context of sci- ment and variables of interest can be approximated nu-
ence and engineering, the problems of concern are either merically. Analysis and problem solution can then be per-
the result of reducing the analysis of a physical situation formed through numerical computation with the aid of
to mathematical equations that cannot be simplified fur- high-speed digital computers. To be sure, something is lost
ther by usual mathematical means, or the physical laws and when it becomes necessary to resort to numerical meth-
constraints governing a system under study are “modeled” ods, because characteristics of the solution that are im-
by mathematical equations—in which case, the numerical mediately apparent from inspection of analytical expres-
solution can be said to simulate the behavior of the system. sions may be obscured in listings of numbers; however,
Although numerical methods have been used for centuries a numerical solution is certainly better than no solution,
(Johannes Kepler used one to determine the orbit of Mars and sometimes its nature can be revealed by repeating the
in 1607), the development of digital computers, with their process with small changes in the data. Also, computer-
tremendous capacity for carrying out arithmetic and logi- generated graphs or images can often provide a sufficiently
cal operations on numbers, has made numerical methods accurate visual interpretation of the solution to aid in its
practical in a great variety of scientific and technologi- understanding.
cal applications. Today, almost all numerical methods are The numerical approach is illustrated for a very simple
carried out on computers. From an understanding of num- problem in Fig. 1. The problem is posed in physical terms
ber representations and manipulations, the numerical ana- in (a), and its mathematical formulation is given in (b). In
lyst must devise the methods and design the algorithms to this situation we know the solution, but in more complex
solve a variety of problems. In doing so, he or she must be cases, of course, we may not.
concerned with the source and propagation of errors, and The next step is to select or develop a numerical
whether or not the method “converges” to a close approx- method. Here, we have chosen the Euler method for
imation of the solution. Also, in spite of the great speed the solution of initial value problems involving ordinary
of modern computers, the matter of convergence rate and differential equations, again because of its simplicity.
computational efficiency are of paramount importance and Numerical methods can be thought of as operators that
may determine whether or not the method is practical. A accept numbers as input (in this case the initial velocity
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

Numerical Analysis 41

V0 , the problem parameters D and M, and the discretiza- The final stage is to produce an algorithm, a step-by-step
tion parameter h) and produce other numbers as output implementation of the method. Algorithms are thought of
(the successive values of time and velocity). rather like flow charts and are usually described in an
unambiguous way by means of an algorithmic or even a
computer-programming language. Algorithms are recipes
that could conceivably be followed by a person with pen-
cil and paper; however, it is usual to convert them to
computer programs, which can then be executed on a suit-
able computer.

B. Errors, Their Sources, and Propagation

We use the example of Fig. 1 to illustrate the source of
errors in the numerical solution of a problem.
If we carry out the first step of the Euler method for this
example, using symbols V0 , D, and M for the data, and
the symbol h for the time increment, we arrive at
V (h) ≈ V1 = V0 [1 − (D/M)h].
On the other hand, we know from the series expansion
for e x , carried out to two terms with remainder, that the
solution at t = h is
V (h) = V0 e−(D/M)h
2
D D h2
= V0 1 − h+ e−(D/M)z ,
M M 2
where z appearing in the final (remainder) term has a value
between 0 and h. Comparing the approximate and true
values of v(h), we see that the difference is
ε(h) = V0 (D/M)2 e(−D/M)z h 2 /2,
where ε(h) is the truncation error committed by the Eu-
ler method in carrying out the first step. Since we know
that e(−D/M)x can be no more than 1 for any non-negative
value of x, we can say that ε(h) is bounded by a constant,
independent of h, times h 2 ; that is,
ε(h) < K h 2 .
This tells us that if, for example, we halve the interval, the
maximum possible truncation error for the first step will
be divided by four.
We are normally interested in computing the solution
out to some initially specified time, say tf , which will re-
quire n = tf / h steps. The accumulated truncation error,
which we will designate by ε(h, tf ), is bounded by n times
the bound for one step, thus
ε(h, tf ) < n K h 2 = tf K h.
This brings us to the concept of convergence. We see
FIGURE 1 The numerical approach to problem solution. (a) that as the step size h is made smaller, the answer produced
Physical situation. (b) Mathematical formulation. (c) Numerical by the Euler method (in the absence of any arithmetic
method. (d) Algorithm. error) for the velocity at time tf becomes more accurate.
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

42 Numerical Analysis

In fact, in the limit as h approaches zero, the answer is rithm as a mapping F from a set of input numbers X ≡
exact, since the error bound is then zero. Of course, it {x1 , x2 , . . . , xn } to a set of results Y ≡ {y1 , y2 , . . . , ym }
does not make sense to use a zero interval size, but the and write
point is that we can make the error as small as we wish by
yi = F(x1 , x2 , . . . , xn ), i = 1, 2, . . . , m.
selecting h sufficiently small. In this case, the truncation
error in computing V (tf ) is bounded in direct proportion to The error propagation formula is simply a first order dif-
the interval size h, and we say that the method converges ference expansion of this equation, Thus,
with order h.
∂ Fi ∂ F1
Turning now to the computation of the solution as de- yi ≈ x1 + x2 + · · ·
scribed by the algorithm, we first must specify values for ∂ x1 ∂ x2

the parameters of the problem and the initial data. Be- ∂ Fi
cause, in any practical computing device, the number of + xn , i = 1, 2, . . . , m.
∂ xn
digits allocated to a number is limited, it will probably
be necessary to chop or round these numbers before they The effect on the ith output variable of an error in the jth
are stored. The error committed by doing this is called input variable can be estimated by knowing the appropri-
inherent roundoff. Also, during the course of the compu- ate partial derivative. Of course, in most cases this will
tation, arithmetic operations are performed that produce not be known directly; however, it may be possible to find
results with more digits than the operands, and these re- an experimental estimate by observing the change in the
sults must be chopped or rounded before they are stored. output variables as the algorithm is executed with small,
This error is called arithmetic roundoff. A particularly incremental changes in the input data, one variable at a
serious consequence of roundoff occurs when two num- time.
bers of nearly the same value are rounded before they are Arithmetic operations that introduce roundoff error are
subtracted, since this can result in the loss of significant often treated as equivalent, exact operations with erro-
information. neous operands. By working backward, the effect of arith-
It is possible to minimize the effect of roundoff error metic roundoff can be reduced to an equivalent error in the
by judicious design of the algorithm. For example, the input, which can then be translated to output error by the
computation of the roots of a quadratic equation use of the error propagation formula.

ax 2 + bx + c = 0,
C. Condition and Numerical Stability
by the quadratic formula,
In solving a problem numerically, it is inevitable that errors
r = (−b ± b2 − 4ac)/2a, will be committed, both because of the use of a discrete
method (truncation) and because of finite precision cal-
when b is large relative to a and c, can produce a poor
culations (roundoff). There are two effects that can cause
approximation to the smallest in magnitude root. To avoid
such errors to grow to serious proportions, one of which
this, it is possible to use the mathematically equivalent
has to do with the problem itself, and the other due to the
formula,
method which is employed.
r = 2c/(−b ∓ b2 − 4ac) Some problems are inherently ill-conditioned, which
means that a small change in the data results in a large
for that root. change in the solution. A simple example is the system of
Another example is in the calculation of expressions of two linear equations representing the intersection of two
the type straight lines that are nearly parallel, such as

n
ai bi . y = x + 1, y = 1.01x.
i=1
These equations have the solution (100, 101); however, if
The results of each multiplication can be saved and added the coefficient of x is changed to 1.001x, less than a 1%
together in extended precision. (If the computer hardware change, the solution is then (1000, 1001), which is a 900%
cannot do this, then it can be done with the use of a special change in x. It is important that the numerical analyst be
program, treating the lower and upper halves of the multi- aware that a problem is ill-conditioned. It may be pos-
plication results as separate variables.) Then the accumu- sible to reformulate it to avoid the ill conditioning, or if
lated sum may be rounded at the end of the calculation. not, to use a higher-order method and increased precision
In order to assess the effect of an error as it is prop- arithmetic in order to reduce the introduction of numerical
agated through the computation, we consider an algo- error as much as possible.
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

Numerical Analysis 43

Numerical instability is a condition that is due to the
n
∝
method employed. A simple example is the solution of a x =± bi × 2i + b− j × 2 − j .
differential equation problem i=0 j=1

y = f (x, y), y(0) = y0 In general, for the integer base B > 1, a real number is
represented by
by the “midpoint method,”
±gn gn−1 · · · g0 .g−1 g−2 · · · ,
yi+1 = yi−1 + 2h f (xi , yi ), y1 = y0 + h f (x0 , y0 ).
where the g’s are symbols for zero and the first B − 1
In this method, a small error can grow as x increases, positive integers; its value is given by
even when the problem itself is well-conditioned, because
n ∝
of the existence of a parasitic solution. This is explained x =± gi × B i + g− j × B − j .
more fully in the section on differential equations. It is a i=0 j=1
challenge to the numerical analyst to recognize instability
Commonly used bases, in addition to 10 and 2, are 8
and to provide an alternate, stable method.
(octal), which uses the symbols 0 through 7, and 16 (hex-
adecimal), which uses 0 through 9, then A, B, C, D, E,
and F to represent the integer values from 10 to 15. These
II. FINITE-PRECISION NUMERICAL bases are useful because conversion to and from binary is
OPERATIONS easily accomplished through the grouping of bits three and
four at a time, respectively. Table I gives representations
A. Number Representation
of the first 16 positive integers for five different bases.
In our standard, positional, decimal notation, a real num- Arithmetic is carried out in any base system by using the
ber is represented as same procedure as in decimal arithmetic, keeping in mind,
however, that “carries” and “borrows” are dependent on
± dn dn−1 · · · d0 .d−1 d−2 · · · ,
the value of the base. Figure 2 gives examples of arithmetic
where each d is a digit, that is, a symbol representing zero in binary, octal, and hexadecimal.
or one of the first nine natural numbers, and the subscript Conversion from one base to another is accomplished
is an index specifying the position of the digit in the num- by a slightly different procedure, depending on the base
ber. The ± indicates that the sign may be either + or −. in which the arithmetic is to be performed. Examples are
The period after d0 is called the decimal point, and it sep- given, in Fig. 3, of conversion between octal and decimal.
arates the integral part of the number (on the left) from If the arithmetic is to be done in the base notation from
the fractional part. The leading digit dn is nonzero except
when the integral part is zero.
TABLE I The First Sixteen Natural Numbers Written in Some
The value of the number, which we will designate by Different Base Notations
x, is then given by
Decimal Binary Ternary Octal Hexadecimal
n ∝ (base 10) (base 2) (base 3) (base 8) (base 16)
x =± di × 10i + d− j × 10− j ,
i=0 j=1 1 1 1 1 1
2 10 2 2 2
the first sum being the value of the integral part and the
3 11 10 3 3
second being that of the fraction.
4 100 11 4 4
The number of distinct symbols used to represent a sin-
5 101 12 5 5
gle digit in the decimal system is 10 (probably because we
6 110 20 6 6
have 10 fingers), and we say that the base of the decimal
7 111 21 7 7
system is 10. Electronic devices depend on sensing either
8 1000 22 10 8
the presence or absence of a signal, just two possible states,
9 1001 100 11 9
which we can represent with 0 and 1. Thus, for electronic
10 1010 101 12 A
computers we need to consider the binary representation,
11 1011 102 13 B
which uses the base 2. The binary representation of a real
12 1100 110 14 C
number is
13 1101 111 15 D
±bn bn−1 · · · b0 .b−1 b−2 · · · , 14 1110 112 16 E
15 1111 120 17 F
where each b (bit) is either 0 or 1. The value of this number
16 10000 121 20 10
is
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

44 Numerical Analysis

this; however, if the word length is m bits, it is possible to

store only 2m different numbers in a given word. The bit
positions in a word are numbered, starting with zero, from
the least significant (which we can think of as the right
end) bit to the most significant (left end) bit [Fig. 4(a)].
The most significant bit is usually reserved for the sign of
the number, for example, 0 for a positive number and 1
for negative.
It is most efficient, in terms of the use of storage space
and processing speed, to store numbers in binary form;
however, it should be mentioned that it is possible to rep-
resent decimal numbers by allocating groups of four bits
to each digit (binary-coded decimal). This is done in cal-
culators and in applications where the input and output of
data dominates computation [Fig. 4(b)].
In the “binary integer” mode, integers are stored, right-
justified, in the computer word just as they are written
in binary [Fig. 4(c)]. All of the integers from −2m −1 to
+2m−1 can thus be represented in a word of length m bits.
Counting and indexing operations use this mode.
For more general computation, modern computers use
a mode referred to as binary floating point, in which num-
bers are represented in the form
x = ±M × 2t ,
where M, the “significand,” is in the range 0 ≤ M < 2 and
t, the “exponent,” is an integer. If M ≥ 1, the number is said
to be “normalized,” otherwise it is “denormalized.” Thus
the decimal number 358.625 (101100110.101 in binary)
FIGURE 2 Arithmetic in different base notations. is represented in normalized binary floating point form as
+1.40087890625 × 28 .
which we are converting, then the procedure is to sepa-
rate the integer and fractional parts. The integer part is Expressed in binary, the significand is 1.01100110101 and
converted by successive divisions by the base to which the exponent is 1000.
we are converting, as illustrated in Fig. 3 in Example 1, A national standard for binary floating point floating
while the fractional part is converted by successive multi- point arithmetic has been developed under the auspices
plications as illustrated in Example 2. On the other hand of the Institute for Electrical and Electronic Engineers-
if the arithmetic is to be performed in the base notation to (IEEE Standard 754-1985). In this standard, a single for-
which we are converting, the procedure is to use the value mat number is stored in a 32-bit computer word with the
formula fractional part of the significand, the “fraction,” stored in
binary in the least significant 23 bits. The sign is stored
n
∝
x =± gi B i + g− j B − j as 0 for a positive number and 1 for a negative number in
i=0 j=1 the most significant bit. In order to accommodate nega-
tive exponents, the exponent is biased by the addition of
as illustrated in Example 3.
127 (1111111 in binary) and stored in binary in the 8 bits
between the sign and fraction. Provided that the biased
B. Computer-Representable Numbers
exponent is not 0 or 255, the number is assumed to be nor-
The memory of a computer is composed of cells, called malized and the integer part of the significand is implicitly
words, each consisting of a number of bits. Most com- taken as 1. The “precision” of a single format normalized
puters use 8, 16, 32, or 64 bits per word. It is usual to binary floating point number is 24 bits. The single format
store a single number in each word (although multiple and representation of 358.625 is given in Fig. 4(d).
fractional precision modes are possible with some loss of The number zero is stored as a 0 in every bit position,
processing speed). There are several ways in which to do except possibly the most significant (sign). Positive zero
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

Numerical Analysis 45

FIGURE 3 Conversion of base.

and negative zero are numerically equivalent but the dis- the single or double format standard. Register to register
tinction can be useful; for example, division of a positive computations take place in extended precision (also de-
number by negative zero produces negative infinity. fined by the IEEE standard) and only when a result is to
The smallest positive normalized single format number be stored in the computer’s memory does it get shortened
is 2−126 . If any operation produces a positive number that to single or double precision.
is smaller than this, the biased exponent is set to zero and
the significand is shifted to the right so that the number
C. Roundoff Error
is no longer normalized and the precision is less than 24
bits. Similarly, the largest negative normalized number is Exact arithmetic operations typically produce results that
−2−126 and the production of a larger negative number contain a greater number of significant positions than the
results in a denormalized form. The generation of a de- operands. For example, the sum of 8.5 and 9.2 is 17.7 (one
normalized number signals an underflow condition. more significant digit than the addends) or the product of
A biased exponent of 255 with a zero fraction indicates 1.234 and 1.432 is 1.767088 (the fractional part has twice
plus or minus infinity (e.g., the result of dividing a nonzero as many significant digits). In a binary floating point pro-
number by a signed zero). A biased exponent of 255 with cessing unit, the exact result of an arithmetic operation on
a nonzero fraction is defined as a NaN (Not a Number). A single format operands may be contained in an extended
NaN will be produced when the magnitude of the result precision register; however, if the result is to be stored in
of an operation is equal to or larger than 2128 (overflow single format, some loss of precision is inevitable. The
condition). shortening of a number to a lower precision is accom-
The IEEE standard also defines a double format number plished by “rounding.”
that is stored in a 64-bit word. In the double format mode There are two choices in rounding. The first is to merely
the fraction length is 52 bits, the bias is 1022, and the discard the portion of the significand that follows the least
biased exponent length is 11 bits. The precision of a double significant bit of the shortened format; and the second is to
format normalized number is 53 bits. add one (with appropriate carries to more significant posi-
In order to reduce the effect of roundoff error, the float- tions) to the least significant bit after discarding the same
ing point processing units of modern computers generally portion. The default rounding mode in the IEEE standard
contain a number of registers that are longer than either is “round-to-nearest.” This means that the choice is made
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

46 Numerical Analysis

FIGURE 4 Representation of numbers in a 32 bit computer word. (a) Bit numbering convention. (b) Binary-coded
decimal representation of 358. (c) Binary integer representation of +358 (+546 octal). (d) Single format floating point
binary representation of +358.625.

so that the rounded number is nearest to the infinitely pre- from the error propagation formula. Bounds for the error
cise result. If both choices are equally near, then the even that have been obtained in this manner for some common
shortened significand is chosen. sequences of calculations are shown in Table II.
The IEEE standard also provides for user specified “di- The rate of convergence of iterative methods can be
rected” rounding modes in order to reproduce results that strongly affected by roundoff error; thus, the use of
may have been obtained on nonstandard implementations. multiple extended precision registers in floating point
Round-towards-zero (also referred to as chopping) always units, in which sequences such as those shown in Table II
uses the first choice. Round-towards-positive-infinity uses can be executed prior to rounding to single or double
the first choice if the number is negative and the second format, can significantly improve the performance of the
if it is positive. Round-towards-negative-infinity uses the computer.
first choice if the number is positive and the second if it is
negative. TABLE II Bounds for Errors in Some Common Floating Point
A bound for the relative error (magnitude of the maxi- Calculations
mum possible error divided by the exact number) is 2− p for
Absolute error bounda
round-to-nearest and 2− p+1 for directed rounding, where F(x 1 , x 2 , . . . , x n )

p is the precision in bits (e.g., p = 24 for single format).

n
n

This bound is referred to as the machine unit and desig- xi |(n + 1 − i)xi | (1.06µ)
i=1 i=1
nated by µ. n n
The generation of error in sequences of floating point xi (n − 1) |xi | (1.06µ)
i=1 i=1
operations is determined by backward error analysis. In
n
n
this approach, each rounded result is supposed to be ob- xi yi |(n + 2 − i)xi yi | (1.06µ)
i=1 i=1
tained by an exact calculation with numbers that are ini- n n
tially in error. It is possible to work backward in this man- ai x i |(2i + 1)ai x i | (1.06µ)
i=1 i=1
ner to establish a bound on the error in the initial data that
would lead to the same final result with exact arithmetic. a Note: µ is the machine unit; 21− p for chopping, 2− p for rounding,

An estimate of the accumulated error can then be obtained where p is the significand length.
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

Numerical Analysis 47

III. ALGEBRAIC EQUATIONS IN A

SINGLE VARIABLE

A. Background
By an algebraic equation in a single independent variable
x, we will mean an equation that can be put in the form
f (x) = 0,
where f is a single-valued function of x, containing no
derivatives nor integrals with respect to x. Examples are
x − sin x = 10
x3 = 2
e x + ln x − 3 = 0.
For purposes of this discussion, we will assume that f is
continuous and differentiable in x. Clearly, by adding x to
both sides, we have the equivalent equation
F(x) = f (x) + x = x.
A commonly occurring problem is to solve the equation FIGURE 5 The graphical determination of a root or fixed point.
in one of these two forms, by which we mean to find a
value of x for which the equation is true. In the first case
lim |xi − x ∗ | = 0.
a solution is called a root of f (x), and in the second it is i→∞
called a fixed point of F(x). If f (x) is continuous and differentiable, a sufficient condi-
In the case where f (x) is linear, that is, of the form tion for convergence is that there exists a constant C such
f (x) = ax + b, a = 0, that
|xi+1 − x ∗ |
a solution exists, is unique, and can be determined imme- ≤C <1
diately for given values of a and b. If f (x) is not linear, the |xi − x ∗ |
situation is a great deal more complicated, because there for all i greater than some threshold number. If this equa-
may not be a solution or there may be many solutions. tion holds, the convergence is said to be of order 1, or
In order to get some understanding of the nature of the linear. More generally, if it can be shown that there is a
problem, it is usually advantageous to construct a graph of number p ≥ 1 and a constant C such that
f (x) versus x and observe if and where this graph crosses |xi+1 − x ∗ |
the x axis, or alternatively to graph F(x) and observe if and ≤C <1
|xi − x ∗ | p
where it crosses the line y = x. This is illustrated in Fig. 5.
Another complication for a nonlinear equation is that for all i greater than some threshold, then the iteration
(with certain specific exceptions) it is not possible to obtain converges with order p. In particular, if p = 2, the con-
a solution in a finite number of operations. The best that vergence is said to be quadratic. The higher the order
can be done is to approximate a solution by making an of convergence, the fewer the number of iterations that
initial numerical estimate and then to methodically refine should be required to home in on the solution; however,
it by obtaining a succession of (hopefully, ever closer) whether or not a particular iteration will converge and how
estimates, until there is little change between successive rapidly it will do so depend on a number of factors, such
values. This process is referred to as iteration. as how close the initial estimate is to the desired solution
To make this more precise, suppose a solution is x ∗ , the and how close solutions are to one another. The situa-
initial estiamte is x0 , and successive iterations produce the tion of multiple, identical solutions can be particularly
infinite sequence troublesome.
As a practical matter, a test for convergence is usually
x1 , x2 , x3 , . . . , xi , . . . , made at the end of each iteration. If
then the absolute error of the ith iterate is |xi − x ∗ |. The |xi+1 − xi |
< ε1
iteration is said to converge to x ∗ if |xi |
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

48 Numerical Analysis

for a prescribed ε1 > 0, usually of the order of the machine method uses the value and derivative of f (x) at x = xi to
unit µ, the iteration is stopped and the solution is taken as extrapolate linearly to xi+1 ; the secant method uses the lat-
xi+1 . In the event that xi is close to zero, however, the test est two values of f (x) to extrapolate linearly; and Muller’s
must be modified to the absolute form, and the iteration is method uses the latest three values to perform a quadratic
stopped when extrapolation. The order of convergence of these meth-
ods (to simple roots) is 2, 1.62, and 1.84, respectively, but
|xi+1 − xi | < ε2 ,
the major disadvantage is that initial estimates must be
where, again, ε2 is a sufficiently small positive number. relatively close in order to achieve convergence.
Finally, in the root-finding case, a termination test can be Also, Newton’s method requires the computation of the
in the form function and its derivative at each step. Newton’s method
and the secant method are illustrated in Fig. 7.
| f (xi+1 )| < ε3 .
C. Roots of Polynomials
B. Iteration Methods The particular case where f (x) is a polynomial of degree
The simplest possible iteration method is the ancient n, that is, to find the roots of
method of repeated substitution, which can most read-
pn (x) = a0 x n + a1 x n−1 + a2 x n−2 + · · · + an ;
ily be applied to an algebraic equation in the (fixed-point)
form a0 = 0
x = F(x). is a frequently occurring problem. Any of the methods of
the previous section can, of course, be used for this case;
Starting with an initial estimate of x0 , the iteration formula however, it is sometimes possible to take advantage of the
is special properties of polynomials.
xi+1 = F(xi ). To start with, p(x) can be evaluated for x = α by the
following recurrence formula:
It can be shown that the method converges to x ∗ if
b0 = a 0
|F (x)| < 1
bi = αbi−1 + ai for i = 1, 2, . . . , n.
for (x ∗ − |x0 − x ∗ |) ≤ x ≤ (x ∗ + |x0 − x ∗ |); the order of
convergence is linear if F (x ∗ ) = 0. This is called the method of synthetic division, because the
Bracketing methods are applied to equations in the numbers b0 , b1 , . . . , bn−1 are the coefficients of the poly-
(root) form, nomial of degree n − 1 that is the result of dividing p(x)
by the binomial (x − α). Then p(α) = bn is the remainder
f (x) = 0. for this division. Also, once a root x1 has been found, that
They rely on first finding two values of x, say a and is, p(x1 ) = 0, then the b’s of the final iteration define the
b, such that f (a) and f (b) have opposite sign. Thus, reduced polynomial which has for its roots the remaining
if f (x) is continuous, a root lies somewhere between a n − 1 roots of p(x), so that additional roots can be sought,
and b. Two methods for successively narrowing the inter- starting from the reduced polynomial.
val containing the root are the bisection method and the The term p (x) can be evaluated for x = α by repeating
method of false position. These methods are illustrated in the synthetic division process for the first n − 1 b’s; thus,
Fig. 6. Their advantage is that once an interval containing c 0 = b0
a root has been found, convergence is guaranteed. On the
negative side, it is not always easy to find two values that ci = αci−1 + bi for i = 1, 2, . . . , n − 1,
bracket a root, and the order of convergence of both of and
these methods is only linear. In most cases the method
of false position converges more rapidly than the method p (α) = cn−1 .
of bisection; however, there are situations where it is much For this reason, Newton’s method is easy to apply.
slower. A common approach is to start with the method of A disadvantage of the methods considered so far, ex-
false position and then to revert to the method of bisection cept Muller’s method, is that they cannot directly find the
if convergence is not achieved after a prescribed number of complex roots that are usually needed in the case of poly-
iterations. nomial equations. Muller’s method and Cauchy’s method,
Extrapolation methods use information at previous iter- which bears the same relationship to Muller’s method as
ations to extrapolate to a new estimate of a root. Newton’s Newton’s method does to the secant method [that is, it
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

Numerical Analysis 49

FIGURE 6 Bracketing method. (a) Bisection method. (b) False position method.

uses the first and second derivative of f (x) at an estimate Finally, for every polynomial, a matrix can be found that
of a root to extrapolate quadratically to the next estimate], has the same characteristic polynomial; thus, the problem
are capable of finding complex roots and are thus more of finding roots of a polynomial can be expressed as the
useful for the polynomial case. problem of finding the eigenvalues of a matrix. Methods
A method designed for polynomial equations is due to exist for finding all or selected numbers of eigenvalues of
Lin and Bairstow (not described in detail here because a matrix that do not depend for convergence on close esti-
of its complexity). It uses synthetic division of p(x) by mates to start with, and are, therefore, often preferred over
a quadratic and iterates to reduce the linear remainder the methods discussed in this section. Matrix eigenvalue
to zero. The quadratic factor thus determined may have methods are discussed in Section V.
complex conjugate roots, and the reduced polynomial has
degree 2 less than p(x).
IV. SYSTEMS OF ALGEBRAIC EQUATIONS
The properties of polynomials can also be invoked to
determine approximate locations of roots, which are often
A. Systems of Linear Equations
necessary in order to improve the chance of convergence
of the methods discussed so far. Particularly useful in this If x1 , x2 , . . . , xn represent n variables, and a1 , a2 , . . . , an
regard is the theory of Sturm sequences, which can be are given numbers (called coefficients), then the
used to determine the number of real roots in a prescribed expression
interval. a1 x 1 + a2 x 2 + a3 x 3 + · · · + an x n
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

50 Numerical Analysis

FIGURE 7 Extrapolation methods. (a) Newton’s method. (b) Secant method.

is called a linear combination of the x’s. The equation and the second subscript of the a’s designate the variable.
a1 x 1 + a2 x 2 + a3 x 3 + · · · + an x n = b We define the m × n matrix A of coefficients as the m-row
by n-column array
is a linear equation. A system of m linear equations is
written in the form  
a1,1 a1,2 a1,3 ··· a1,n
a1,1 x1 + a1,2 x2 + a1,3 x3 + · · · + a1,n xn = b1  
 a2,1 a2,2 a2,3 ··· a2,n 
a2,1 x1 + a2,2 x2 + a2,3 x3 + · · · + a2,n xn = b2  
a ··· a3,n 
 3,1 a3,2 a3,23 
a3,1 x1 + a3,2 x2 + a3,3 x3 + · · · + a3,n xn = b3 A=
 ·


·  
 · 
·  
·  · 
am,1 x1 + am,2 x2 + am,3 x3 + · · · + am,n xn = bm , am,1 am,2 am,3 · · · am,n

where the subscript of the b’s and the first subscript of the and we define the n-vector x and the m-vector b as the
a’s designate the equation, while the subscript of the x’s columnar arrays
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

Numerical Analysis 51
   
x1 b1 Similarly, if A is upper triangular, which can be visualized
x  b  by reversing the order of the rows in the above matrix, then
 2  2
    the solution can be obtained by back substitution
 x3   b3 

x =  b=
 
 · . xn = bn /an,n
 ·  
·  · 
·  · 
n
xi = bi − ai, j x j /ai,i ,
xn bm j=i+1

The system of linear equations can now be written con- i = n − 1, n − 2, . . . , 1.

cisely as (Note that under our assumption that there is a unique
Ax = b, solution, it can be shown that the diagonal elements must
be nonzero so that the division can always be carried out.)
where the multiplication of A times x (the order of these The number of additions and multiplications required for
two factors is important) is defined to form an m-vector either process is of the order of 12 n 2 , and the number of
with the ith component being the inner product of the ith divisions is n.
row of A with the vector x The objective of direct methods for an arbitrary non-
singular matrix is to transform A into a triangular matrix.
ai,1 x1 + ai,2 x2 + ai,3 x3 + · · · + ai,n xn . This may be done by a process that is equivalent to premul-
tiplying A by a sequence of n − 1 matrices. The product
The characteristics of a linear system are determined
of two matrices is a matrix, each column is the product
by the properties of the matrix A and are addressed in the
of the first matrix and the corresponding column of the
subject of linear algebra. Except where we discuss overde-
second. Let A be designated as A(0) ; then the sequence of
termined systems at the end of the next section, we will
multiplications can be written recursively as
assume that A is a square matrix (m = n, the order of A),
and that there is a unique solution, that is, a unique set A(k) = M (k) A(k−1) , k = 1, 2, 3, . . . , n − 1,
of values for the variables x1 , x2 , . . . , xn that satisfies the
where M (k) is an n × n matrix with its diagonal elements
system of equations. Under these circumstances the solu-
equal to 1 and with 0 in every other position except, pos-
tion can always be obtained by direct methods in a finite
sibly, below the diagonal in the kth column, where the
number of arithmetic operations. When n is very large or
elements are
when A is sparse (has a large proportion of zero elements),
(k) (k−1) (k−1)
it is sometimes more efficient to use an iterative method, m i,k = −ai,k ak,k .
following the same general approach as discussed in the
Now A(k) has zeros below the diagonal through the kth
previous section on the solution of algebraic equations in
column, and the final result A(n−1) is an upper-triangular
a single variable.
matrix that we will denote by U . Now let L be the lower-
triangular matrix that has diagonal elements equal to 1
B. Direct Methods and
( j)
If A happens to be lower triangular, that is, has the structure li, j = −m i, j
 
a1,1 0 0 ··· 0 for all other elements with i > j. Multiplying L by the
a2,1 a2,2 0 ··· 0  same sequence of matrices M (k) , it is found that the result
 
  is the identity matrix. Thus L is the inverse of the product
a3,1 a3,2 a3,3 · · · 0 
A= ·   of the M matrices, and we have established that
· 
  A = LU.
 · · 
 · · 
Finding L and U is referred to as LU factorization. An
an,1 an,2 an,3 · · · an,n
example for a given 4 × 4 matrix is shown in Fig. 8.
then it is straightforward to obtain a solution by the process The solution of the linear system Ax = LU x = b can
of forward substitution now be found by solving the two triangular systems,

x1 = b1 /a1,1 L y = b, U x = y.
The number of additions and multiplications required
i−1
xi = bi − ai, j x j /ai,i , i = 2, 3, . . . , n. for computing the elements of L and U is of the order
j=1 of 13 n 3 ; thus for large n there is considerable savings in
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

52 Numerical Analysis

FIGURE 8 LU factorization of a 4 × 4 matrix.

performing the LU factorization once, then solving the thereby reducing the storage required by half and the num-
triangular systems only for each new b for which the so- ber of operations to the order of 16 n 3 . In this case, after the
lution may be needed. matrix U has been computed, the solution is found by
In solving an arbitrary n × n system it can occur that, solving
(k−1)
at the kth stage, the divisor ak,k is zero, in which case,
U T y = b, U x = Dy,
the solution can proceed no further. Also, if not identically
zero, it may be very small, which can cause relatively large where D is the diagonal of U . This is often referred to as
arithmetic roundoff errors to be introduced. To avoid these the (root-free) Cholesky method.
difficulties the technique of partial pivoting is employed, Systems that have a banded structure (the matrix A has
in which at each stage the largest element in the column zero elements wherever the difference between the indices
(k−1)
headed by the pivotal element, ak,k is found, and the i and j is greater than a fixed number less than n − 1) re-
entire row in which this element occurs is exchanged with tain their band width during LU factorization if pivoting is
the kth row. Partial pivoting does not change the solution not required; thus advantages in storage and computation
since it merely changes the order in which the equations can be realized when working with them. Matrices that
of the system appear; however, the order of elements in are sparse within a band of some nonzero elements, how-
the vector b must be adjusted accordingly. ever, suffer from “zerofill,” which means that no advantage
Pivoting is not necessary when the matrix A is diag- can be taken of sparsity with direct methods, except that
onally dominant: |ai,i | ≥ j |ai, j | for all i, or when it by judicious exchanging of equations and order of vari-
is symmetric and positive definite, meaning that for any ables, the bandwidth or profile of the matrix can sometimes
n-vector x that is not identically zero, the quadratic form be reduced, thereby reducing the number of operations
x T Ax is positive. If A is symmetric and pivoting is not required.
necessary, a symmetric version of LU factorization is pos- Before considering the error in the solution to a system
sible in which only the elements of A(k) above the diagonal of equations (either linear or nonlinear), we need to have
are computed and the elements of the M (k) are not saved, a means to measure a vector or matrix. Just in measuring
P1: GPA/GJK Final Pages P2: GSS
Encyclopedia of Physical Science and Technology EN011G-505 July 25, 2001 18:45

Numerical Analysis 53

a scalar quantity by its magnitude, we measure a vector where µ is the machine unit. In practice, it has bee

10 - PDFsam - Visualizing Environmental Science - 5th Ed - (2017)
No ratings yet
10 - PDFsam - Visualizing Environmental Science - 5th Ed - (2017)
1 page
Interpreting Infrared Raman and Nuclear Magnetic Resonance Spectra Two Volume Set
100% (1)
Interpreting Infrared Raman and Nuclear Magnetic Resonance Spectra Two Volume Set
1,103 pages
Photography - Electronic Still-Picture Cameras - Methods For Measuring Opto-Electronic Conversion Functions (Oecfs)
No ratings yet
Photography - Electronic Still-Picture Cameras - Methods For Measuring Opto-Electronic Conversion Functions (Oecfs)
34 pages
Springer Mittlelmann2003 Sensing With Terahertz Radiation PDF
No ratings yet
Springer Mittlelmann2003 Sensing With Terahertz Radiation PDF
350 pages
Theory Photoacoustics
100% (1)
Theory Photoacoustics
8 pages
Self 1983 Focusing Gaussian Beams AWC
100% (1)
Self 1983 Focusing Gaussian Beams AWC
4 pages
M Series SPEX Monochromator
100% (1)
M Series SPEX Monochromator
20 pages
Problems & Exercises in Operations Research
0% (1)
Problems & Exercises in Operations Research
128 pages
Snells Law Vector Form
No ratings yet
Snells Law Vector Form
12 pages
Robert A. Meyers (Editor-In-Chief) - Encyclopedia of Physical Science and Technology - Computer Software (2001, Elsevier)
No ratings yet
Robert A. Meyers (Editor-In-Chief) - Encyclopedia of Physical Science and Technology - Computer Software (2001, Elsevier)
311 pages
Encyclopedia of Physical Science and Technology - Stars and Stellar Systems by Robert A. Meyers PDF
No ratings yet
Encyclopedia of Physical Science and Technology - Stars and Stellar Systems by Robert A. Meyers PDF
229 pages
Ultrahigh Sensitivemultimode PDF
No ratings yet
Ultrahigh Sensitivemultimode PDF
12 pages
Jellison, G. E. Et Al. (01.1998) - Spectroscopic Ellipsometry Characterization of Thin-Film Silicon Nitride
No ratings yet
Jellison, G. E. Et Al. (01.1998) - Spectroscopic Ellipsometry Characterization of Thin-Film Silicon Nitride
5 pages
Tensor II Brochure en
No ratings yet
Tensor II Brochure en
7 pages
Classical and Statisitcal Thermodinamics
No ratings yet
Classical and Statisitcal Thermodinamics
111 pages
Oxidation Behaviour of Silicon Carbide - A Review PDF
No ratings yet
Oxidation Behaviour of Silicon Carbide - A Review PDF
11 pages
Detection of Light: Photodetectors
No ratings yet
Detection of Light: Photodetectors
48 pages
Reiser 1988
No ratings yet
Reiser 1988
5 pages
Metal Oxide-Based Gas Sensor Research: How To?
100% (1)
Metal Oxide-Based Gas Sensor Research: How To?
18 pages
gas sensing ABO3
No ratings yet
gas sensing ABO3
10 pages
Type1: Temperature Type2: IR Sensors Type3: UV Sensors Type4: Touch Sensor Type5: Proximity Sensor Advanced Sensor Technology
No ratings yet
Type1: Temperature Type2: IR Sensors Type3: UV Sensors Type4: Touch Sensor Type5: Proximity Sensor Advanced Sensor Technology
5 pages
Xps (X-Ray Photoelectron Spectroscopy)
No ratings yet
Xps (X-Ray Photoelectron Spectroscopy)
8 pages
Optical Properties
No ratings yet
Optical Properties
36 pages
Chapter 10: Optical Properties
No ratings yet
Chapter 10: Optical Properties
19 pages
Electrostatic Analyzer
No ratings yet
Electrostatic Analyzer
6 pages
A. Y Wong - Introduction To Experimental Plasma Physics - Physics 180E, Plasma Physics Laboratory-University of California at Los Angeles (1977)
No ratings yet
A. Y Wong - Introduction To Experimental Plasma Physics - Physics 180E, Plasma Physics Laboratory-University of California at Los Angeles (1977)
153 pages
Basic Electronics
No ratings yet
Basic Electronics
44 pages
Sample Preparation of Xps
No ratings yet
Sample Preparation of Xps
5 pages
Nanoelectromechanical Systems
100% (1)
Nanoelectromechanical Systems
38 pages
Aeris User-Documents Aeris-User-Guide User-Guides Usg Aeris en Tcm61-63655
No ratings yet
Aeris User-Documents Aeris-User-Guide User-Guides Usg Aeris en Tcm61-63655
77 pages
Spectral Emissivity Measurement Using FTIR Spectrophotometry - SHIMADZU (Shimadzu Corporation)
No ratings yet
Spectral Emissivity Measurement Using FTIR Spectrophotometry - SHIMADZU (Shimadzu Corporation)
8 pages
X-Ray Absorption and X-Ray Emission Spectroscopy: Theory and Applications
From Everand
X-Ray Absorption and X-Ray Emission Spectroscopy: Theory and Applications
Jeroen A. van Bokhoven
No ratings yet
Week5HW Solutions PDF
No ratings yet
Week5HW Solutions PDF
17 pages
Luminescent Materials - G. Blasse, B. C. Grabmaier
100% (1)
Luminescent Materials - G. Blasse, B. C. Grabmaier
121 pages
Conductors Semiconductors Superconductors An Introduction to Solid State Physics Undergraduate Lecture Notes in Physics Rudolf P. Huebener pdf download
100% (6)
Conductors Semiconductors Superconductors An Introduction to Solid State Physics Undergraduate Lecture Notes in Physics Rudolf P. Huebener pdf download
59 pages
Sample Preparation of Feeds and Forage For NIR Analysis
No ratings yet
Sample Preparation of Feeds and Forage For NIR Analysis
5 pages
2010 SPIE 7675 - VBG Filters
No ratings yet
2010 SPIE 7675 - VBG Filters
9 pages
Fiber Bragg Grating Sensors - A Market Overview
No ratings yet
Fiber Bragg Grating Sensors - A Market Overview
6 pages
Gas Sensing Mechanism of Metal Oxides - The Role of Ambient Atmosphere, Type of Semiconductor and Gases - A Review
100% (1)
Gas Sensing Mechanism of Metal Oxides - The Role of Ambient Atmosphere, Type of Semiconductor and Gases - A Review
19 pages
Brewster's Law & Malus' Law Experiment
No ratings yet
Brewster's Law & Malus' Law Experiment
13 pages
Geiger Test Proc
No ratings yet
Geiger Test Proc
25 pages
Dielectric Polarization
No ratings yet
Dielectric Polarization
4 pages
Amorphous Materials
100% (1)
Amorphous Materials
5 pages
Hamamatsu PMT Handbook
No ratings yet
Hamamatsu PMT Handbook
292 pages
Laser Beams and Resonators-By Kogelnik, H., Li, T.
No ratings yet
Laser Beams and Resonators-By Kogelnik, H., Li, T.
18 pages
Coordination Chemistry II PDF
No ratings yet
Coordination Chemistry II PDF
32 pages
Photonic S
No ratings yet
Photonic S
459 pages
BYUOpticsBook 2009 PDF
No ratings yet
BYUOpticsBook 2009 PDF
360 pages
TANGO Brochure EN PDF
No ratings yet
TANGO Brochure EN PDF
7 pages
Plasma Diagnostics: Course Notes: Prof. F.F. Chen P A7: P D
No ratings yet
Plasma Diagnostics: Course Notes: Prof. F.F. Chen P A7: P D
23 pages
Gladnet BG Whitby B
No ratings yet
Gladnet BG Whitby B
25 pages
Laser Doppler Vibrometry and Modal Testing
No ratings yet
Laser Doppler Vibrometry and Modal Testing
17 pages
Tango: Innovation With Integrity
No ratings yet
Tango: Innovation With Integrity
2 pages
Langmuir Probe
No ratings yet
Langmuir Probe
32 pages
Condition Monitoring of Zinc Oxide Surge
100% (1)
Condition Monitoring of Zinc Oxide Surge
21 pages
Chul Park Viscous Shock Layer Calculation of Stagnation-Region Heating Environment Viscous Shock Layer Calculation of Stagnation-Region Heating Environment
No ratings yet
Chul Park Viscous Shock Layer Calculation of Stagnation-Region Heating Environment Viscous Shock Layer Calculation of Stagnation-Region Heating Environment
7 pages
Scanning Electron Microscopy For Materials Characterization 2 PDF
No ratings yet
Scanning Electron Microscopy For Materials Characterization 2 PDF
4 pages
Complete Thermodynamics: A Complete Undergraduate Course 1st Edition Andrew M Steane PDF For All Chapters
100% (7)
Complete Thermodynamics: A Complete Undergraduate Course 1st Edition Andrew M Steane PDF For All Chapters
62 pages
Basic Principles of Spectros
No ratings yet
Basic Principles of Spectros
10 pages
Anamolous Zeeman Effect
No ratings yet
Anamolous Zeeman Effect
8 pages
Nuclear Physics
From Everand
Nuclear Physics
Ali A. Abdulla
No ratings yet
2-Review
No ratings yet
2-Review
5 pages
Materi 2 English - 8 Sept 2023
No ratings yet
Materi 2 English - 8 Sept 2023
3 pages
Sri Chaitanya Maths MULTIPLE CORRECT Type Question BANKkkkkkkkkkkkkkkkkkkkkkkkkkkk
No ratings yet
Sri Chaitanya Maths MULTIPLE CORRECT Type Question BANKkkkkkkkkkkkkkkkkkkkkkkkkkkk
38 pages
Action Verbs: A Project LA Activity
No ratings yet
Action Verbs: A Project LA Activity
26 pages
English: S. No Questions Answers
No ratings yet
English: S. No Questions Answers
3 pages
Business Research Methods (BRM) Solved MCQs (Set-1)
No ratings yet
Business Research Methods (BRM) Solved MCQs (Set-1)
6 pages
Mazhar Hussain - Software Engineer - Resume
No ratings yet
Mazhar Hussain - Software Engineer - Resume
2 pages
Steve Jobs Speech Analysis
No ratings yet
Steve Jobs Speech Analysis
2 pages
ATS Resume Template Executive Resume 2 1
No ratings yet
ATS Resume Template Executive Resume 2 1
2 pages
MORA Song 2.3 Oral Sex - Nov 1
No ratings yet
MORA Song 2.3 Oral Sex - Nov 1
17 pages
Install Oracle Enterprise Manager
No ratings yet
Install Oracle Enterprise Manager
5 pages
TMS3716 Assessment 3 18192173
No ratings yet
TMS3716 Assessment 3 18192173
25 pages
Y11 2U Algebra Quiz
No ratings yet
Y11 2U Algebra Quiz
9 pages
What Are The Vedas
No ratings yet
What Are The Vedas
10 pages
List of All Banned Words On YouTube
No ratings yet
List of All Banned Words On YouTube
8 pages
R Data Import Export.
No ratings yet
R Data Import Export.
1 page
Gyan Prabodh (Gurmukhi, Punjabi Meanings)
50% (2)
Gyan Prabodh (Gurmukhi, Punjabi Meanings)
38 pages
Artificial Intelligence and Neural Networks PDF
No ratings yet
Artificial Intelligence and Neural Networks PDF
4 pages
Interspecies Communication: Animal-Human Encounters and The Potential of Dialogue Across Species Boundaries
No ratings yet
Interspecies Communication: Animal-Human Encounters and The Potential of Dialogue Across Species Boundaries
90 pages
Phase 1: Building The Link To The Soul, The Rainbowbridge
No ratings yet
Phase 1: Building The Link To The Soul, The Rainbowbridge
26 pages
Eg4115 Datasheet
No ratings yet
Eg4115 Datasheet
2 pages
Cambridge First Certificate Fce Comparing Writing Tasks
No ratings yet
Cambridge First Certificate Fce Comparing Writing Tasks
6 pages
Park Place
No ratings yet
Park Place
116 pages
Listen and Comprehesion
No ratings yet
Listen and Comprehesion
3 pages
Syllabus Computer Science
No ratings yet
Syllabus Computer Science
2 pages
Unit 1 What Do You Like
No ratings yet
Unit 1 What Do You Like
31 pages
Ef A24652
No ratings yet
Ef A24652
39 pages
Historyquestions 160409101324
100% (1)
Historyquestions 160409101324
51 pages
En4G-Iig-3.2 Melc 14: (Formerly Pasongkawayan I Elementary School)
No ratings yet
En4G-Iig-3.2 Melc 14: (Formerly Pasongkawayan I Elementary School)
7 pages

Encyclopedia of Physical Science and Technology - Mathematics

Uploaded by

Encyclopedia of Physical Science and Technology - Mathematics

Uploaded by

P1: FVZ Revised Pages Qu: 00, 00, 00, 00

I. Sets and Relations

GLOSSARY other structure T such that, for every operation ∗ and

436 Algebra, Abstract

Algebra, Abstract 437

438 Algebra, Abstract

E. Boolean Matrices and Graphs

Algebra, Abstract 439

FIGURE 3 Matrix and graph of a relation.

440 Algebra, Abstract

Algebra, Abstract 441

442 Algebra, Abstract

Algebra, Abstract 443

finite semigroup H , a homomorphism h : W ∗ → H is onto,

444 Algebra, Abstract

Algebra, Abstract 445

446 Algebra, Abstract

so it is transitive. Its equivalence classes are called orbits. G. Enumeration

An algebraic structure on a set S is a family of subsets α PG (x1 , x2 , . . . , xn )

Algebra, Abstract 447

TABLE I Cycle Structure of Hexagonal Symmetries

448 Algebra, Abstract

C. Linear Transformations D. Matrices and Determinants

Algebra, Abstract 449

450 Algebra, Abstract

 1 2 1 2 + 5 6 0 1 Therefore, it is a similarity invariant. Its roots are called

Algebra, Abstract 451

452 Algebra, Abstract

Algebra, Abstract 453

454 Algebra, Abstract

Algebra, Abstract 455

sum decomposition (G) = ⊕ i . The mapping  → 

456 Algebra, Abstract

Algebra, Abstract 457

458 Algebra, Abstract

Algebra, Abstract 459

460 Algebra, Abstract

Algebra, Abstract 461

462 Algebra, Abstract

Algebra, Abstract 463

464 Algebra, Abstract

I. Basic Definitions of Affine Geometry

GLOSSARY ALGEBRAIC GEOMETRY is, broadly speaking, the

466 Algebraic Geometry

Algebraic Geometry 467

X is an algebraic subset of An if and only if Z (I (X)) = X. maximal ideal.

468 Algebraic Geometry

Algebraic Geometry 469

470 Algebraic Geometry

Algebraic Geometry 471

472 Algebraic Geometry

Algebraic Geometry 473

474 Algebraic Geometry

Algebraic Geometry 475

Approximations and Expansions

582 Approximations and Expansions

n width of A in X is the quantity δn (A; X ) = inf Pn I. BACKGROUND

Approximations and Expansions 583

584 Approximations and Expansions

(πn , Rmn , etc.). εk (δ)ik +1/ p ε j (δ)i j + 1/ p = ek + o(1)

Approximations and Expansions 585

where Theorem 7. Suppose that

586 Approximations and Expansions

Approximations and Expansions 587

588 Approximations and Expansions

Approximations and Expansions 589

spline functions in two variables (of bivariate splines) are + bj

then a necessary and sufficient condition that s is in C k qi (l j,i )k+1 = 0

590 Approximations and Expansions

Approximations and Expansions 591

592 Approximations and Expansions

Approximations and Expansions 593

Clearly, the zeroth (or central) diagonal of the identity

594 Approximations and Expansions

Approximations and Expansions 595

and This is the reason that modified roots of the Chebyshev

sum decomposition (G) = ⊕ i . The mapping →

VIII. MULTIVARIATE POLYHEDRAL

c1 i 2k ≤ ndni ≤ c2 i 2k with D (R s ) denoting the space of Schwartz distributions