(Higham, 1996) - Book - Accuracy and Stability of Numerical Algorithms PDF
(Higham, 1996) - Book - Accuracy and Stability of Numerical Algorithms PDF
1098765432
All rights reserved. Printed in the United States of America. No part of this book may be
reproduced, stored, or transmitted in any manner without the written permission of the
publisher. For information. write to the Society for Industrial and Applied Mathematics,
3600 University City Science Center, Philadelphia, PA 19104-2688.
o is a registered trademark.
Dedicated to
Alan M. Turing
and
James H. Wilkinson
Contents
Preface xxi
vii
V111
3 Basics 67
3.1 Inner and Outer Products . . . . . . . . . . . . . . . . . . . . 68
3.2 The Purpose of Rounding Error Analysis . . . . . . . . . . . . 71
3.3 Running Error Analysis . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Notation for Error Analysis . . . . . . . . . . . . . . . . . . . 73
3.5 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . 76
3.6 Complex Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 78
3.7 Miscellany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.8 Error Analysis Demystified . . . . . . . . . . . . . . . . . . . . 82
3.9 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.10 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 84
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4 Summation 87
4.1 Summation Methods . . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Compensated Summation . . . . . . . . . . . . . . . . . . . . . 92
4.4 Other Summation Methods . . . . . . . . . . . . . . . . . . . . 97
4.5 Statistical Estimates of Accuracy . . . . . . . . . . . . . . . . 98
4.6 Choice of Method . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 100
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5 Polynomials 103
5.1 Horner‘s Method . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2 Evaluating Derivatives . . . . . . . . . . . . . . . . . . . . . . 106
5.3 The Newton Form and Polynomial Interpolation . . . , . . . . 109
ix
6 Norms 117
6.1 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3 The Matrix p-Norm . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 126
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
18 QR Factorization 361
18.1 Householder Transformations . . . . . . . . . . . . . . . . . . . 362
18.2 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 363
18.3 Error Analysis of Householder Computations . . . . . . . . . . 364
18.4 Aggregated Householder Transformations . . . . . . . . . . . . 370
18.5 Givens Rotations . . . . . . . . . . . . . . . . . . . . . . . . . 371
18.6 Iterative Refinement . . . . . . . . . . . . . . . . . . . . . . . . 375
18.7 Gram-Schmidt Orthogonalization . . . . . . . . . . . . . . . . 376
18.8 Sensitivity of the QR Factorization . . . . . . . . . . . . . . . 381
18.9 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 383
18.9.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . 386
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
B S i n g u l a r V a l u e D e c o m p o s i t i o n , M-Matrices 579
B.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . . 580
B.2 M-Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
Bibliography 595
14.1 Underestimation ratio for Algorithm 14.4 for 5×5 matrix A( θ). 297
xvii
XV111 LIST OF FIGURES
24.1 The possible steps in one iteration of the MDS method when
n=2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
xix
LIST OF TABLES
It has been 30 years since the publication of Wilkinson’s books Rounding Er-
rors in Algebraic Processes [1088, 1963] and The Algebraic Eigenvalue Prob-
lem [1089, 1965]. These books provided the first thorough analysis of the
effects of rounding errors on numerical algorithms, and they rapidly became
highly influential classics in numerical analysis. Although a number of more
recent books have included analysis of rounding errors, none has treated the
subject in the same depth as Wilkinson.
This book gives a thorough, up-to-date treatment of the behaviour of
numerical algorithms in finite precision arithmetic. It combines algorithmic
derivations, perturbation theory, and rounding error analysis. Software prac-
ticalities are emphasized throughout, with particular reference to LAPACK.
The best available error bounds, some of them new, are presented in a unified
format with a minimum of jargon. Historical perspective is given to pro-
vide insight into the development of the subject, and further information is
provided in the many quotations. Perturbation theory is treated in detail,
because of its central role in revealing problem sensitivity and providing error
bounds. The book is unique in that algorithmic derivations and motivation
are given succinctly, and implementation details minimized, so that atten-
tion can be concentrated on accuracy and stability results. The book was
designed to be a comprehensive reference and contains extensive citations to
the research literature.
Although the book’s main audience is specialists in numerical analysis, it
will be of use to all computational scientists and engineers who are concerned
about the accuracy of their results. Much of the book can be understood with
only a basic grounding in numerical analysis and linear algebra.
This first two chapters are very general. Chapter 1 describes fundamental
concepts of finite precision arithmetic, giving many examples for illustration
and dispelling some misconceptions. Chapter 2 gives a thorough treatment of
floating point arithmetic and may well be the single most useful chapter in the
book. In addition to describing models of floating point arithmetic and the
IEEE standard, it explains how to exploit “low-level” features not represented
in the models and contains a large set of informative exercises.
In the rest of the book the focus is, inevitably, on numerical linear algebra,
because it is in this area that rounding errors are most influential and have
xxi
xxii PREFACE
The Notes and References are an integral part of each chapter. In addi-
tion to containing references, historical information, and further details, they
include material not covered elsewhere in the chapter, and should always be
consulted, in conjunction with the index, to obtain the complete picture.
I have included relatively few numerical examples except in the first chap-
ter. There are two reasons. One is to reduce the length of the book. The
second reason is because today it so easy for the reader to perform experi-
ments in MATLAB* or some other interactive system. To this end I have made
*MATLAB is a registered trademark of The Math Works, Inc.
xxiv PREFACE
available the Test Matrix Toolbox, which contains MATLAB M-files for many
of the algorithms and special matrices described in the book; see Appendix E.
This book has been designed to be as easy to use as possible. There are
thorough name and subject indexes, page headings show chapter and section
titles and numbers, and there is extensive cross-referencing. I have adopted
the unusual policy of giving with (nearly) every citation not only its numerical
location in the bibliography but also the names of the authors and the year of
publication. This provides as much information as possible in a citation and
reduces the need for the reader to turn to the bibliography.
A database acc-stab-num-alg.bib containing all the references
in the bibliography is available over the Internet from the bibnet project
(which can be accessed via netlib, described in §C.2).
Special care has been taken to minimize the number of typographical and
other errors, but no doubt, some remain. I will be happy to receive notification
of errors, as well as comments and suggestions for improvement.
Acknowledgements
Mary Rose Muccie (copy editing and indexing), Colleen Robishaw (design),
and Sam Young (production).
Research leading to this book has been supported by grants from the
Engineering and Physical Sciences Research Council, by a Nuffield Science
Research Fellowship from the Nuffield Foundation, and by a NATO Collabo-
rative Research Grant held with J. W. Demmel. I was fortunate to be able
to make extensive use of the libraries of the University of Manchester, the
University of Dundee, Stanford University, and the University of California,
Berkeley.
This book was typeset in using the book document style. The
references were prepared in and the index with MakeIndex. It is dif-
ficult to imagine how I could have written the book without these wonderful
tools. I used the “big” software from the distribution, running on a
486DX workstation. I used text editors The Semware Editor (Semware Cor-
poration) and GNU Emacs (Free Software Foundation) and checked spelling
with PC-Write (Quicksoft).
xxvii
XXV111 ABOUT THE DEDICATION
Chapter 1
Principles of Finite Precision
Computation
Since none of the numbers which we take out from logarithmic and
trigonometric tables admit of absolute precision,
but are all to a certain extent approximate only,
the results of all calculations performed
by the aid of these numbers can only be approximately true . . .
It may happen, that in special cases the
effect of the errors of the tables is so augmented that
we may be obliged to reject a method,
otherwise the best, and substitute another in its place.
-CARL FRIEDRICH GAUSS’, Theoria Motus (1809)
1
Cited in Goldstine [461, 1977 , p. 258].
1
2 PRINCIPLES OF FINITE PRECISION COMPUTATION
This book is concerned with the effects of finite precision arithmetic on nu-
merical algorithms’, particularly those in numerical linear algebra. Central
to any understanding of high-level algorithms is an appreciation of the basic
concepts of finite precision arithmetic. This opening chapter briskly imparts
the necessary background material. Various examples are used for illustra-
tion, some of them familiar (such as the quadratic equation) but several less
well known. Common misconceptions and myths exposed during the chapter
are highlighted towards the end, in §1.19.
This chapter has few prerequisites and few assumptions are made about
the nature of the finite precision arithmetic (for example, the base, number
of digits, or mode of rounding, or even whether it is floating point arith-
metic). The second chapter deals in detail with the specifics of floating point
arithmetic.
A word of warning: some of the examples from §1.12 onward are special
ones chosen to illustrate particular phenomena. You may never see in practice
the extremes of behaviour shown here. Let the examples show you what
can happen, but do not let them destroy your confidence in finite precision
arithmetic!
variable are also described using the colon notation: “i = 1:n” means the
same as “i = 1,2, . . . , n” .
Evaluation of an expression in floating point arithmetic is denoted ƒl(·),
and we assume that the basic arithmetic operations op=+,-,*,/ satisfy
|δ | < u . (1.1)
Here, u is the unit roundoff (or machine precision), which is typically of order
10-8 or 10-16 in single and double precision computer arithmetic, respectively,
and between 10-10 and 10-12 on pocket calculators. For more on floating
point arithmetic see Chapter 2.
Computed quantities (and, in this chapter only, arbitrary approximations)
wear a hat. Thus denotes the computed approximation to .
Definitions are often (but not always) indicated by “:=” or “=:”, with the
colon next to the object being defined.
We make use of the floor and ceiling functions: is the largest integer
less than or equal to , and is the smallest integer greater than or equal
to .
The normal distribution with mean µ and variance 2
is denoted by
= 0.9949, = 0.9951.
According to the definition does not have two correct significant digits
but does have one and three correct significant digits!
6 PRINCIPLES OF FINITE PRECISION COMPUTATION
or, if the data is itself the solution to another problem, it may be the result
of errors in an earlier computation. The effects of errors in the data are
generally easier to understand than the effects of rounding errors committed
during a computation, because data errors can be analysed using perturbation
theory for the problem at hand, while intermediate rounding errors require
an analysis specific to the given method. This book contains perturbation
theory for most of the problems considered, for example, in Chapters 7 (linear
systems), 19 (the least squares problem), and 20 (underdetermined systems).
Analysing truncation errors, or discretization errors, is one of the ma-
jor tasks of the numerical analyst. Many standard numerical methods (for
example, the trapezium rule for quadrature, Euler’s method for differential
equations, and Newton’s method for nonlinear equations) can be derived by
taking finitely many terms of a Taylor series. The terms omitted constitute
the truncation error, and for many methods the size of this error depends
on a parameter (often called h, “the stepsize”) whose appropriate value is a
compromise between obtaining a small error and a fast computation.
Because the emphasis of this book is on finite precision computation, with
virtually no mention of truncation errors, it would be easy for the reader to
gain the impression that the study of numerical methods is dominated by the
study of rounding errors. This is certainly not the case. Trefethen explains it
well when he discusses how to define numerical analysis [1016, 1992]:
A definition of correct significant digits that does not suffer from the latter
anomaly states that agrees with to p significant digits if is less than
half a unit in the pth significant digit of . However, this definition implies
that 0.123 and 0.127 agree to two significant digits, whereas many people
would say that they agree to less than two significant digits.
In summary, while the number of correct significant digits provides a useful
way in which to think about the accuracy of an approximation, the relative
error is a more precise measure (and is base independent). Whenever we give
an approximate answer to a problem we should aim to state an estimate or
bound for the relative error.
When and are vectors the relative error is most often defined with
a norm, as For the commonly used norms := maxi
:= the inequality < ½ × 10-p
implies that components with have about p correct significant
decimal digits, but for the smaller components the inequality merely bounds
the absolute error.
A relative error that puts the individual relative errors on an equal footing
is the componentwise relative error
which is widely used in error analysis and perturbation analysis (see Chapter 7,
for example).
As an interesting aside we mention the “tablemaker’s dilemma”. Suppose
you are tabulating the values of a transcendental function such as the sine
function and a particular entry is evaluated as 0.124|500000000 correct to a
few digits in the last place shown, where the vertical bar follows the final
significant digit to be tabulated. Should the final significant digit be 4 or
5? The answer depends on whether there is a nonzero trailing digit and, in
principle, we may never be able answer the question by computing only a
finite number of digits.
Figure 1.1. Backward and forward errors for y = Solid line = exact; dotted
line = computed.
given problem, and not for each method). We discuss perturbation theory in
the next section.
A method for computing y = is called backward stable if, for any
it produces a computed with a small backward error, that is, =
for some small . The definition of “small” will be context dependent. In
general, a given problem has several methods of solution, some of which are
backward stable and some not.
As an example, assumption (1.1) says that the computed result of the
operation ± y is the exact result for perturbed data (1 + δ) and y(1 + δ)
with |δ| < u; thus addition and subtraction are, by assumption, backward
stable operations.
Most routines for computing the cosine function do not satisfy = cos( +
) with a relatively small but only the weaker relation + ∆y = cos( +
), with relatively small ∆y and . A result of the form
(1.2)
Figure 1.2. Mixed forward-backward error for y = Solid line = exact; dotted
line = computed.
1.6. Conditioning
The relationship between forward and backward error for a problem is gov-
erned by the conditioning of the problem, that is, the sensitivity of the solution
to perturbations in the data. Continuing the y = example of the pre-
vious section, let an approximate solution satisfy = Then,
assuming for simplicity that f is twice continuously differentiable,
and we can bound or estimate the right-hand side. This expansion leads to
the notion of condition number. Since
the quantity
measures, for small the relative change in the output for a given relative
change in the input, and it is called the (relative) condition number of f . If
or f is a vector then the condition number is defined in a similar way using
norms and it measures the maximum relative change, which is attained for
some, but not all, vectors
As an example, consider the function = log The condition number
is = which is large for 1. This means that a small relative
change in can produce a much larger relative change in log for 1. The
10 PRINCIPLES OF FINITE PRECISION COMPUTATION
with approximate equality possible. One way to interpret this rule of thumb
is to say that the computed solution to an ill-conditioned problem can have a
large forward error. For even if the computed solution has a small backward
error, this error can be amplified by a factor as large as the condition number
when passing to the forward error.
One further definition is useful. If a method produces answers with for-
ward errors of similar magnitude to those produced by a backward stable
method, then it is called forward stable. Such a method need not be back-
ward stable itself. Backward stability implies forward stability, but not vice
versa. An example of a method that is forward stable but not backward stable
is Cramer’s rule for solving a 2 × 2 linear system, which is discussed in §1.10.1.
1.7. Cancellation
Cancellation is what happens when two nearly equal numbers are subtracted.
It is often, but not always, a bad thing. Consider the function = (1 -
With = 1.2×10-5 the value of rounded to 10 significant
figures is
c = 0.9999 9999 99,
so that
1 - c = 0.0000 0000 01.
Then (1 - c) = 10 /1.44×10-10 = 0.6944. . . , which is clearly wrong
-10
given the fact that 0 < < l/2 for all 0. A 10 significant figure
approximation to cos is therefore not sufficient to yield a value of with
even one correct figure. The problem is that 1 - c has only 1 significant
figure. The subtraction 1 - c is exact, but this subtraction produces a result
of the same size as the error in c. In other words, the subtraction elevates the
importance of the earlier error. In this particular example it is easy to rewrite
to avoid the cancellation. Since cos = 1 - 2 sin2
To gain more insight into the cancellation phenomenon consider the sub-
traction (in exact arithmetic) where and
The terms ∆a and ∆b are relative errors or uncertainties in the data, perhaps
attributable to previous computations. With x = a - b we have
The relative error bound for is large when |a - b| << |a| + |b|, that is,
when there is heavy cancellation in the subtraction. This analysis shows that
subtractive cancellation causes relative errors or uncertainties already present
in and b to be magnified. In other words, subtractive cancellation brings
earlier errors into prominence.
It is important to realize that cancellation is not always a bad thing. There
are several reasons. First, the numbers being subtracted may be error free,
as when they are from initial data that is known exactly. The computation
of divided differences, for example, involves many subtractions, but half of
them involve the initial data and are harmless for suitable orderings of the
points (see §5.3 and §21.3). The second reason is that cancellation may be
a symptom of intrinsic ill conditioning of a problem, and may therefore be
unavoidable. Third, the effect of cancellation depends on the role that the
result plays in the remaining computation. For example, if x >> y z > 0
then the cancellation in the evaluation of x + (y - z) is harmless.
(1.3)
(1.4)
Computing from this formula requires two passes through the data, one
to compute and the other to accumulate the sum of squares. A two-pass
computation is undesirable for large data sets or when the sample variance
is to be computed as the data is generated. An alternative formula, found
in many statistics textbooks, uses about the same number of operations but
requires only one pass through the data:
(1.5)
1.10 SOLVING LINEAR EQUATIONS 13
This formula is very poor in the presence of rounding errors because it com-
putes the sample variance as the difference of two positive numbers, and
therefore can suffer severe cancellation that leaves the computed answer dom-
inated by roundoff. In fact, the computed answer can be negative, an event
aptly described by Chan, Golub, and LeVeque [194, 1983] as “a blessing in
disguise since this at least alerts the programmer that disastrous cancella-
tion has occurred”. In contrast, the original formula (1.4) always yields a
very accurate (and nonnegative) answer, unless n is large (see Problem 1.10).
Surprisingly, current calculators from more than one manufacturer (but not
Hewlett-Packard) appear to use the one-pass formula, and they list it in their
manuals.
As an example, if x = [10000, 10001, 10002]T then, in single precision
arithmetic (u 6 × 10-8), the sample variance is computed as 1.0 by the
two-pass formula (relative error 0) but 0.0 by the one-pass formula (relative
error 1). It might be argued that this data should be shifted by some estimate
of the mean before applying the one-pass formula which
does not change ), but a good estimate is not always available and there
are alternative one-pass algorithms that will always produce an acceptably
accurate answer. For example, instead of accumulating and we
can accumulate
(1.6a)
(1.6b)
after which = Q n / (n - 1). Note that the only subtractions in these recur-
rences are relatively harmless ones that involve the data xi . For the numerical
example above, (1.6) produces the exact answer. The updating formulae (1.6)
are numerically stable, though their error bound is not as small as the one
for the two-pass formula (it is proportional to the condition number KN in
Problem 1.7).
The problem of computing the sample variance illustrates well how mathe-
matically equivalent formulae can have different numerical stability properties.
Lemma 1.1. With the notation above, and for the 2-norm,
The scaled residual for GEPP is pleasantly small-of order the unit round-
off. That for Cramer’s rule is ten orders of magnitude larger, showing that the
computed solution from Cramer’s rule does not closely satisfy the equations,
or, equivalently, does not solve a nearby system. The solutions themselves are
similar, both being accurate to three significant figures in each component but
incorrect in the fourth significant figure. This is the accuracy we would expect
from GEPP because of the rule of thumb “forward error backward error ×
condition number”. That Cramer’s rule is as accurate as GEPP in this ex-
ample, despite its large residual, is perhaps surprising, but it is explained by
the fact that Cramer’s rule is forward stable for n = 2; see Problem 1.9. For
general n, the accuracy and stability of Cramer’s rule depend on the method
used to evaluate the determinants, and satisfactory bounds are not known
even for the case where the determinants are evaluated by GEPP.
The small residual produced by GEPP in this example is typical: error
analysis shows that GEPP is guaranteed to produce a relative residual of
order u when n = 2 (see §9.2). To see how remarkable a property this is,
consider the rounded version of the exact solution: z = fl(x) = x + ∆x,
where ||∆x ||2 <u ||x ||2. The residual of z satisfies ||b- A z||2=||- A ∆ x || 2 <
u||A|| 2 ||x||2 u||A||2 ||z||2. Thus the computed solution from GEPP has about
as small a residual as the rounded exact solution, irrespective of its accuracy.
Expressed another way, the errors in GEPP are highly correlated so as to
produce a small residual. To emphasize this point, the vector [1.0006,2.0012],
which agrees with the exact solution of the above problem to five significant
figures (and therefore is more accurate than the solution produced by GEPP),
has a relative residual of order 10-6 .
16 PRINCIPLES OF FINITE PRECISION COMPUTATION
n
101 2.593743 1.25 × 10 -1
102 2.704811 1.35 × 10 -2
103 2.717051 1.23 × 10 -3
10 4 2.718597 3.15 × 10 -4
10 5 2.721962 3.68 × 10 -3
10 6 2.595227 1.23 × 10 -1
10 7 3.293968 5.76 × 10 -1
Since the first electronic computers were developed in the 1940s, comments
along the following lines have often been made: “The enormous speed of
current machines means that in a typical problem many millions of floating
point operations are performed. This in turn means that rounding errors can
potentially accumulate in a disastrous way.” This sentiment is true, but mis-
leading. Most often, instability is caused not by the accumulation of millions
of rounding errors. but by the insidious growth of just a few rounding errors.
As an example, let us approximate e = exp(1) by taking finite n in the
definition e := Table 1.1 gives results computed in For-
tran 90 in single precision (u 6 × 10-8).
The approximations are poor, degrading as n approaches the reciprocal
of the machine precision. For n a power of 10, l/n has a nonterminating
binary expansion. When 1+1/n is formed for n a large power of 10, only
a few significant digits from l/n are retained in the sum. The subsequent
exponentiation to the power n, even if done exactly, must produce an inaccu-
rate approximation to e(indeed, doing the exponentiation in double precision
does not change any of the numbers shown in Table 1.1). Therefore a single
rounding error is responsible for the poor results in Table 1.1.
There is a way to compute (1+1/n) n more accurately, using only single
precision arithmetic; it is the subject of Problem 1.5.
Strassen’s method for fast matrix multiplication provides another exam-
ple of the unpredictable relation between the number of arithmetic operations
and the error. If we evaluate fl(AB) by Strassen’s method, for n×n matrices
A and B, and we look at the error as a function of the recursion threshold
n 0 <n. we find that while the number of operations decreases as n 0 decreases
from n to 8, the error typically increases; see §22.2.2.
1.12 INSTABILITY WITHOUT CANCELLATION 17
end
for i = 1:60
x = x2
end
Since the computation involves no subtractions and all the intermediate num-
bers lie between 1 and x, we might expect it to return an accurate approxi-
mation to x in floating point arithmetic.
On the HP 48G calculator, starting with x = 100 the algorithm produces
x = 1.0. In fact, for any x, the calculator computes, in place of f(x) = x, the
function
18 PRINCIPLES OF FINITE PRECISION COMPUTATION
which rounds to 1, since the HP 48G works to about 12 decimal digits. Thus
for x > 1, the repeated square roots reduce x to 1.0, which the squarings leave
unchanged.
For 0 < x < 1 we have
This upper bound rounds to the 12 significant digit number 0.99. . .9. Hence
after the 60 square roots we have on the calculator a number x < 0.99. . .9.
The 60 squarings are represented by s(x) = and
the computed sum does not change. In Fortran 90 in single precision this
yields the value 1.6447 2532, which is first attained at k = 4096. This agrees
with the exact infinite sum to just four significant digits out of a possible nine.
The explanation for the poor accuracy is that we are summing the numbers
from largest to smallest, and the small numbers are unable to contribute to
the sum. For k = 4096 we are forming s + 4096-2 = s + 2-24, where s 1.6.
Single precision corresponds to a 24-bit mantissa, so the term we are adding
to s “drops off the end” of the computer word, as do all successive terms.
The simplest cure for this inaccuracy is to sum in the opposite order: from
smallest to largest. Unfortunately, this requires knowledge of how many terms
to take before the summation begins. With 109 terms we obtain the computed
sum 1.6449 3406, which is correct to eight significant digits.
For much more on summation, see Chapter 4.
not monotonic. The reason for the pronounced oscillating behaviour of the
relative error (but not the residual) for the inverse Hilbert matrix is not clear.
An example in which increasing the precision by several bits does not
improve the accuracy is the evaluation of
Figure 1.4 plots t versus the absolute error, for precisions u = 2-t, t = 10:40.
Since a sin(bx) -8.55×10-9, for t less than about 20 the error is dominated
by the error in representing x = l/7. For 22 < t < 31 the accuracy is (exactly)
constant! The plateau over the range 22 < t < 31 is caused by a fortuitous
rounding error in the addition: in the binary representation of the exact
answer the 23rd to 32nd digits are 1s, and in the range of t of interest the
final rounding produces a number with a 1 in the 22nd bit and zeros beyond,
yielding an unexpectedly small error that affects only bits 33rd onwards.
A more contrived example in which increasing the precision has no bene-
ficial effect on the accuracy is the following evaluation of z = f(x):
y = abs(3(x-0.5)-0.5)/25
if y = 0
z = l
1.14 C ANCELLATION OF R OUNDING E RRORS 21
else
z = ey % Store to inhibit extended precision evaluation.
z = (z - 1)/y
end
% Algorithm 1.
if x = 0
f = l
else
f = (ex - 1)/x
end
% Algorithm 2.
y = ex
if y = l
f = l
else
f = (y - 1)/logy
end
At first sight this algorithm seems perverse, since it evaluates both exp and
log instead of just exp. Some results computed in MATLAB are shown in
Table 1.2. All the results for Algorithm 2 are correct in all the significant
figures shown, except for x = 10-15, when the last digit should be 1. On the
other hand, Algorithm 1 returns answers that become less and less accurate
as x decreases.
To gain insight we look at the numbers in a particular computation with
x = 9 × 10-8 and u = 2-24 6 × 10-8, for which the correct answer is
1.00000005 to the significant digits shown. For Algorithm 1 we obtain a
completely inaccurate result, as expected:
1.14 C ANCELLATION OF R OUNDING E RRORS 23
x Algorithm 1 Algorithm 2
-5
10 1.000005000006965 1.000005000016667
10 -6 1.000000499962184 1.000000500000167
10 -7 1.000000049433680 1.000000050000002
10 -8 9.999999939225290 × 10-1 1.000000005000000
10 - 9 1.000000082740371 1.000000000500000
10 - 1 0 1.000000082740371 1.000000000050000
1 0 - 1 1 1.000000082740371 1.000000000005000
1 0 - 1 2 1.000088900582341 1.000000000000500
10 - 1 3 9.992007221626408 × 10-1 1.000000000000050
1 0 - 1 4 9.992007221626408 × 10-1 1.000000000000005
1 0 - 1 5 1.110223024625156 1.000000000000000
10-16 0
Here are the quantities that would be obtained by Algorithm 2 in exact arith-
metic (correct to the significant digits shown):
(1-9)
7
The analysis from this point on assumes the use of a guard digit in subtraction (see
§2.4); without a guard digit Algorithm 2 is not highly accurate.
24 P RINCIPLES OF F INITE P RECISION C OMPUTATION
For small x ( y 1 )
From (1.9) it follows that approximates f with relative error at most about
3.5u.
The details of the analysis obscure the crucial property that ensures its
success. For small x, neither - 1 nor log agrees with its exact arithmetic
counterpart to high accuracy. But ( - 1)/log is an extremely good approx-
imation to (y-l)/logy near y=1, because the function g(y) = (y-1)/log y
varies so slowly there (g has a removable singularity at 1 and g'(1) = 1). In
other words, the errors in - 1 and log almost completely cancel.
1.14.2. QR Factorization
Any matrix A IR m×n, m > n, has a QR factorization A = QR, where Q
m×
IR has orthonormal columns and R IR n×x is upper trapezoidal (rij = 0
for i > j). One way of computing the QR factorization is to premultiply A by
a sequence of Givens rotations-orthogonal matrices G that differ from the
identity matrix only in a 2×2 principal submatrix, which has the form
Figure 1.5. Relative errors ||A k- Â k||2/||A ||2 for Givens QR factorization. The
dotted line is the unit roundoff level.
Because y is at the roundoff level, the computed is the result of severe sub-
tractive cancellation and so is dominated by rounding errors. Consequently,
the computed Givens rotations ,..., whose purpose is to zero the
vector and which are determined by ratios involving the elements of , bear
little relation to their exact counterparts, causing Âk to differ greatly from
Ak for k = 11,12,. . . .
To shed further light on this behaviour, we note that the Givens QR fac-
torization is perfectly backward stable; that is, the computed R is the exact
R factor of A + ∆A, where ||∆A||2 <cu||A||2, with c a modest constant de-
pending on the dimensions (Theorem 18.9). By invoking a perturbation result
for the QR factorization (namely (18.27)) we conclude that is
bounded by a multiple of K2 (A)u. Our example is constructed so that K2 (A) is
small ( 24), so we know a priori that the graph in Figure 1.5 must eventually
dip down to the unit roundoff level.
26 P RINCIPLES OF F INITE P RECISION C OMPUTATION
end
The theory says that if A has a unique eigenvalue of largest modulus and x
is not deficient in the direction of the corresponding eigenvector υ, then the
power method converges to a multiple of υ (at a linear rate).
Consider the matrix
which has eigenvalues 0, 0.4394 and 1.161 (correct to the digits shown) and an
eigenvector [l, 1, 1]T corresponding to the eigenvalue zero. If we take [1,1,1]T
as the starting vector for the power method then, in principle, the zero vector
is produced in one step, and we obtain no indication of the desired dominant
eigenvalue-eigenvector pair. However, when we carry out the computation in
MATLAB, the first step produces a vector with elements of order 10-16 and
we obtain after 38 iterations a good approximation to the dominant eigen-
pair. The explanation is that the matrix A cannot be stored exactly in bi-
nary floating point arithmetic. The computer actually works with A + ∆A
for a tiny perturbation ∆A, and the dominant eigenvalue and eigenvector of
A + ∆A are very good approximations to those of A. The starting vector
[1,1,1]T contains a nonzero (though tiny) component of the dominant eigen-
v e c t o r o f A + ∆A. This component grows rapidly under multiplication by
A + ∆A, helped by rounding errors in the multiplication, until convergence
to the dominant eigenvector is obtained.
1.16 S TABILITY OF AN A LGORITHM D EPENDS ON THE P ROBLEM 27
because the kth row of A( k -1) is the same as the kth row of A. In floating
point arithmetic the model (1.1) shows that the computed satisfy
28 PRINCIPLES OF FINITE P RECISION COMPUTATION
where < u, i = 1:3. This equation says that the computed diagonal
elements are the exact diagonal elements corresponding not to A, but to
a matrix obtained from A by changing the diagonal elements to akk ( 1 + )
and the subdiagonal elements to In other
words, the computed are exact for a matrix differing negligibly from A.
The computed determinant d, which is given by
Figure 1.6. Values of rational function r(x) computed by Horner’s rule (marked as
"×" ), for x = 1.606 + (k - 1)2-52; solid line is the “exact” r(x).
Concerning the second point, good advice is to look at the numbers gen-
erated during a computation. This was common practice in the early days
of electronic computing. On some machines it was unavoidable because the
contents of the store were displayed on lights or monitor tubes! Wilkinson
gained much insight into numerical stability by inspecting the progress of an
algorithm, and sometimes altering its course (for an iterative process with
parameters): “Speaking for myself I gained a great deal of experience from
user participation, and it was this that led to my own conversion to backward
error analysis” [1099, 1980, pp. 112-113] see also [1083, 1955]). It is ironic
that with the wealth of facilities we now have for tracking the progress of nu-
merical algorithms (multiple windows in colour, graphical tools, fast printers)
we often glean less than Wilkinson and his co-workers did from mere paper
tape and lights.
1.19. Misconceptions
Several common misconceptions and myths have been dispelled in this chapter
(none of them for the first time-see the Notes and References). We highlight
them in the following list.
6. Rounding errors can only hinder, not help, the success of a computation
(§1.15).
32 P RINCIPLES OF FINITE P RECISION C OMPUTATION
Approximation theory: Clenshaw [212, 19551], Cox [249, 19721], [250, 19751],
[251, 1978], Cox and Harris [253, 1989], de Boor [272, 19721], and de Boor
and Pinkus [273, 1977].
Chaos and dynamical systems: Cipra [211, 19 88], Coomes, Koçak, and
Palmer [242, 1995], Corless [246. 1992], [247, 1992], Hammel, Yorke,
and Grebogi [499, 1988], and Sanz-Serna and Larsson [893, 1993].
Nonlinear equations: Dennis and Walker [302, 1984], Spellucci [933, 1980],
and Wozniakowski [1111, 1977].
Optimization: Dennis and Schnabel [300, 1983], Fletcher [377, 1986], [379,
19 88], [380, 1993 ], [381, 1994 ]. Gill, Murray, and Wright [447, 19 8 1 ],
Gurwitz [490, 1992], Müller-Merbach [783, 1970], and Wolfe [1108, 1965].
Partial differential equations: Ames [14, 1977], Birkhoff and Lynch [101,
1984], Canuto, Hussaini, Quarteroni, and Zang [183, 1988], Douglas [319,
1959 ], Forsythe and Wasow [397, 19 6 0 ], Richtmyer and Morton [872,
1967, §1.8], and Trefethen and Trummer [1020, 1987].
is described in §13.5.1.
Classic papers dispensing good advice on the dangers inherent in numer-
ical computation are the “pitfalls” papers by Stegun and Abramowitz [937,
1956] and Forsythe [394, 1970]. The book Numerical Methods That Work by
Acton [4, 1970] must also be mentioned as a fount of hard-earned practical
advice on numerical computation (look carefully and you will see that the
front cover includes a faint image of the word “Usually” before “Work”). If
it is not obvious to you that the equation x2 - 10x + 1 = 0 is best thought of
as a nearly linear equation for the smaller root, you will benefit from reading
Acton (see p. 58). Everyone should read Acton’s “Interlude: What Not To
Compute” (pp. 245-257).
Finally, we mention the paper “How to Get Meaningless Answers in Sci-
entific Computation (and What to Do About it)” by Fox [401, 1971]. Fox, a
contemporary of Wilkinson, founded the Oxford Computing Laboratory and
was for many years Professor of Numerical Analysis at Oxford. In this paper
he gives numerous examples in which incorrect answers are obtained from
plausible numerical methods (many of the examples involve truncation errors
as well as rounding errors). The section titles provide a list of reasons why
you might compute worthless answers:
Fox estimates [401, 1971, p. 296] that “about 80 per cent of all the results
printed from the computer are in error to a much greater extent than the user
would believe.”
8
This reason refers to using an inappropriate convergence test in an iterative process.
36 P RINCIPLES OF FINITE P RECISION C OMPUTATION
Problems
The road to wisdom?
Well, it’s plain and simple to express:
Err
and err
and err again
but less
and less
and less.
-PIET HEIN, Grooks (1966)
1.2. (Skeel and Keiper [923, 1993, §1.2]) The number y = was evalu-
ated at t digit precision for several values of t, yielding the values shown in the
following table. which are in error in at most one unit in the least significant
digit (the first two values are padded with trailing zeros):
t y
10 2625374 12600000000
15 262537412640769000
20 262537412640768744.00
25 262537412640768744.0000000
30 262537412640768743.999999999999
Does it follow that the last digit before the decimal point is 4?
1.3. Show how to rewrite the following expressions to avoid cancellation for
the indicated arguments.
1.4. Give stable formulae for computing the square root x + iy of a complex
number a + ib.
1.5. [523, 1982] By writing (1 + 1/n) n = exp(nlog(1 + 1/n)), show how to
compute (1 + 1/n) n accurately for large n. Assume that the log function is
computed with a relative error not exceeding u. (Hint: adapt the technique
used in §1.14.1.)
P ROBLEMS 37
1.6. (Smith [928, 1975]) Type the following numbers into your pocket calcu-
lator, and look at them upside down (you or the calculator):
07734 The famous “_world” program
38079 Object
318808 Name
35007 Adjective
57738.57734 × 1040 Exclamation on finding a bug
3331 A high quality floating point arithmetic
Fallen tree trunks
1.7. A condition number for the sample variance (1.4), here denoted by V(x) :
can be defined by
Show that
Show that
1.8. (Kahan, Muller, [781, 1989], Francois and Muller [406, 1991]) Consider
the recurrence
d = a 1 1 a 22 - a2 1 a 1 2 ,
x1 = (b 1 a22 - b2 a 12)/d ,
x2 = (a 11b2 - a 2 1 b 1 )/d.
38 P RINCIPLES OF FINITE P RECISION C OMPUTATION
Show that, assuming d is computed exactly (this assumption has little effect
on the final bounds), the computed solution satisfies
(Note that this error bound does not involve the condition numbers KC or
KN from Problem 1.7, at least in the first-order term. This is a rare instance
of an algorithm that determines the answer more accurately than the data
warrants!)
Previous Home Next
Chapter 2
Floating Point Arithmetic
9
In Hennessy and Patterson [515, 1990 , App. A].
39
40 F LOATING P OINT A RITHMETIC
(2.2)
Notice that the spacing of the floating point numbers jumps by a factor 2
at each power of 2. The spacing can be characterized in terms of machine
epsilon, which is the from 1.0 to the next larger floating point
number. Clearly, = β 1 -t , and this is the spacing of the floating point
numbers between 1.0 and β; the spacing of the numbers between 1.0 and
1/β is β -t = /β. The spacing at an arbitrary x F is estimated by the
following lemma.
Lemma 2.1. The spacing between a normalized floating point number x and
an adjacent normalized floating point number is at least β - 1 | x| and at most
|x| (unless x or the neighbour is zero).
Let G IR denote all real numbers of the form (2.1) with no restriction
on the exponent e. If x IR then fl(x) denotes an element of G nearest to x,
and the transformation x f l(x) is called rounding. There are several ways
to break ties when x is equidistant from two floating point numbers, including
taking fl(x) to be the number of larger magnitude (round away from zero)
or the one with an even last digit dt (round to even): the latter rule enjoys
impeccable statistics [144, 1973]. For more on tie-breaking strategies see the
Notes and References.
Although we have defined fl as a mapping onto G, we are only interested
in the cases where it produces a result in F. We say that fl(x) overflows if
|fl(x)| > max{|y| : y F} and underflows if 0 < |fl(x)| < min{|y| : 0 y
F} .
The following result shows that every real number x lying in the range of
F can be approximated by an element of F with a relative error no larger
than u = ½β 1 -t . The quantity u is called the unit roundoff. It is the most
useful quantity associated with F and is ubiquitous in the world of rounding
error analysis.
x = µ × βe - t
βt -1 < µ < β t - 1,
Hence
2.1 F L O A T I N G P O I N T N U M B E R S Y S T E M 43
The last inequality is strict unless µ = β t -1, in which case x = fl(x), hence
the inequality of the theorem is strict.
Theorem 2.2 says that fl(x) is equal to x multiplied by a factor very close
to 1. The representation 1 + δ for the factor is the standard choice, but it is
not the only possibility. For example, we could write the factor as ea , with a
bound on |a| a little less than u (cf. the rp notation in §3.4).
The following modified version of this theorem can also be useful.
Figure 2.1. Relative distance from x to the next larger machine number (β = 2,
t = 24), displaying wobbling precision.
STANDARD MODEL
It is normal to assume that (2.4) holds also for the square root operation.
Note that now we are using fl(·) with an argument that is an arithmetic
expression to denote the computed value of that expression. The model says
that the computed value of x op y is “as good as” the rounded exact answer,
in the sense that the relative error bound is the same in both cases. However,
2.3 IEEE A RITHMETIC 45
the model does not require that δ = 0 when x op y F-a condition which
obviously does hold for the rounded exact answer-so the model does not
capture all the features we might require of floating point arithmetic. This
model is valid for most computers, and, in particular, holds for IEEE standard
arithmetic. Cases in which the model is not valid are described in §2.4.
The following modification of (2.4) can also be used (cf. Theorem 2.3):
(2.5)
Note: Throughout this book, the standard model (2.4) is used unless
otherwise stated. Most results proved using the standard model remain true
with the weaker model (2.6) described below, possibly subject to slight in-
creases in the constants. We identify problems for which the choice of model
significantly affects the results that can be proved.
In both formats one bit is reserved as a sign bit. Since the floating point
numbers are normalized, the most significant bit is always 1 and is not stored
(except for the denormalized numbers described below). This hidden bit ac-
counts for the “+l” in the table.
46 F LOATING P OINT A RITHMETIC
ponent field as a NaN; the sign bit distinguishes between The infinity
symbol obeys the usual mathematical conventions regarding infinity, such as
2 1 × 0.0001 = 2 -2 × 0.100
Notice that to do the subtraction we had to line up the binary points, thereby
unnormalizing the second number and using, temporarily, a fourth mantissa
digit, known as a guard digit. Some machines do not have a guard digit.
Without a guard digit in our example we would compute as follows, assuming
the extra digits are simply discarded:
21 × 0.100 2 1 × 0.100-
20 × 0.111 2 1 × 0.011 (last digit dropped)
The computed answer is too big by a factor 2 and so has relative error l! For
machines without a guard digit it is not true that
Theorem 2.4 (Ferguson). Let x and y be floating point numbers for which
e ( x - y) < min(e(x),e(y)), where e(x) denotes the exponent of x in its nor-
malized floating point representation. If subtraction is performed with a guard
digit then x - y is computed exactly (assuming x - y does not underflow or
overflow) .
Proof. From the condition of the theorem the exponents of x and y differ
by at most 1. If the exponents are the same then fl(x - y) is computed
exactly, so suppose the exponents differ by 1, which can happen only when x
and y have the same sign. Scale and interchange x and y if necessary so that
β -1 < y < 1 < x < β, where β is the base. Now x is represented in base β as
x1 .x 2 . . .xt and the exact difference z = x - y is of the form
50 F LOATING P OINT A RITHMETIC
x 1 . x 2 ...x t -
0.y1 . . . yt-1yt
z1 .z 2 . . . ztzt +1
But e(x - y) < e(y) and y < 1, so z1 = 0. The algorithm for computing z
forms z 1 .z 2 . . .zt +1 and then rounds to t digits; since z has at most t significant
digits this rounding introduces no error, and thus z is computed exactly.
The next result is a corollary of the previous one but is more well known. It
is worth stating as a separate theorem because the conditions of the theorem
are so elegant and easy to check (being independent of the base), and because
this weaker theorem is sufficient for many purposes.
Theorem 2.5 (Sterbenz). Let x and y be floating point numbers with y/2 <
x < 2y. If subtraction is performed with a guard digit then x - y is computed
exactly (assuming x - y does not underflow).
Theorem 2.5 is vital in proving that certain special algorithms work. A
good example involves Heron’s formula for the area A of a triangle with sides
of length a, b, and c:
s = (a + b + c)/2.
This formula is inaccurate for needle-shaped triangles: if a b + c then s a
and the term s - a suffers severe cancellation. A way around this difficulty,
devised by Kahan, is to rename a, b, and c so that a > b > c and then evaluate
(2.7)
The parentheses are essential! Kahan has shown that this formula gives the
area with a relative error bounded by a modest multiple of the unit roundoff
provided that a guard digit is used in subtraction [457, 1991, Thm. 3], [634,
1990] (see Problem 2.22). If there is no guard digit, the computed result can
be very inaccurate.
Kahan has made these interesting historical comments about guard digits
[634, 1990]:
CRAYs are not the first machines to compute differences blighted
by lack of a guard digit. The earliest IBM ’360s, from 1964 to 1967,
subtracted and multiplied without a hexadecimal guard digit un-
til SHARE, the IBM mainframe user group, discovered why the
consequential anomalies were intolerable and so compelled a guard
digit to be retrofitted. The earliest Hewlett-Packard financial cal-
culator, the HP-80, had a similar problem. Even now, many a
calculator (but not Hewlett-Packard’s) lacks a guard digit.
2.5 C HOICE OF B ASE AND D ISTRIBUTION OF N UMBERS 51
What base β is best for a floating point number system? Most modern com-
puters use base 2. Most hand-held calculators use base 10, since it makes
the calculator easier for the user to understand (how would you explain to a
naive user that 0.1 is not exactly representable on a base 2 calculator?). IBM
mainframes traditionally have used base 16. Even base 3 has been tried-in
an experimental machine called SETUN, built at Moscow State University in
the late 1950s [1066, 1960].
Several criteria can be used to guide the choice of base. One is the impact
of wobbling precision: as we saw at the end of §2.1, the spread of representa-
tion errors is smallest for small bases. Another possibility is to measure the
worst-case representation error or the mean square representation error. The
latter quantity depends on the assumed distribution of the numbers that are
represented. Brent [144, 1973] shows that for the logarithmic distribution the
worst-case error and the mean square error are both minimal for (normalized)
base 2, provided that the most significant bit is not stored explicitly.
The logarithmic distribution is defined by the property that the proportion
of base β numbers with leading significant digit n is
1 2 3 4 5 6 7 8 9
0.301 0.176 0.125 0.097 0.079 0.067 0.058 0.051 0.046
As an example, here is the leading significant digit distribution for the ele-
ments of the inverse of one random 100 × 100 matrix from the normal N(0, 1)
distribution:
1 2 3 4 5 6 7 8 9
0.334 0.163 0.100 0.087 0.077 0.070 0.063 0.056 0.051
For an entertaining survey of work on the distribution of leading significant
digits see Raimi [856, 1976] (and also the popular article [855, 1969]).
will adequately describe what actually happens” (see also the ensuing note by
Henrici [520, 1966]).
Since the late 1980s Chaitin-Chatelin and her co-workers have been de-
veloping a method called PRECISE, which involves a statistical analysis of
the effect on a computed solution of random perturbations in the data; see
Brunet [152, 198 9], Chatelin and Brunet [202, 1990], and Chaitin-Chatelin
and Frayssé [190, 1996]. This approach is superficially similar to the earlier
CESTAC (permutation-perturbation) method of La Porte and Vignes [153,
1986], [682, 1974], [1054, 1986], but differs from it in several respects. CES-
TAC deals with the arithmetic reliability of algorithms, whereas PRECISE
is designed as a tool to explore the robustness of numerical algorithms as a
function of parameters such as mesh size, time step, and nonnormality.
Several authors have investigated the distribution of rounding errors under
the assumption that the mantissas of the operands are from a logarithmic
distribution, and for different modes of rounding; see Barlow and Bareiss [62,
1985] and the references therein.
Other work concerned with statistical modelling of rounding errors in-
cludes that of Tienari [1002, 1970] and Linnainmaa [704, 1975].
the number of bits allocated to the mantissa and exponent is allowed to vary
(within a fixed word size).
Other number systems include those of Swartzlander and Alexopolous [981,
1975], Matula and Kornerup [741, 1985], and Hamada [495, 1987]. For sum-
maries of alternatives to floating point arithmetic see the section “Alternatives
to Floating-Point Some Candidates” in [214, 198 9], and Knuth [668, 198 1,
Chap. 4].
1. (Cody [221, 1982]) Evaluate sin(22) = -8.8513 0929 0403 8759 2169 ×
10-3 (shown correct to 21 digits). This is a difficult test for the range
2.8 A CCURACY T ESTS 55
Machine | sin(22)
Exact -8.8513 0929 0403 8759 × 10-3
Casio fx-140 -8.8513 62 × 10 -3
Casio fx-992VB -8.8513 0929 096 × 10-3
HP 48G -8.8513 0929 040 × 10-3
Sharp EL-5020 -8.8513 0915 4 × 10-3
MATLAB 4.2 -8.8513 0929 0403 876 × 10-3
WATFOR-77 -8.8513 0929 0403 880 × 10-3
FTN 90 -8.8513 0929 0403 876 × 10-3
QBasic -8.8513 0929 0403 876 × 10-3
Table 2.5. Exponentation test. No entry for last column means same value as
previous column.
reduction used in the sine evaluation (which brings the argument within
t h e r a n g e [ -π/2,π/2], and which necessarily uses an approximate value
of π), since 22 is close to an integer multiple of π.
2. (Cody [221, 1982]) Evaluate 2.5125 = 5.5271 4787 5260 4445 6025 × 1049
(shown correct to 21 digits). One way to evaluate z = xy is as z =
exp(ylogx). But to obtain z correct to within a few ulps it is not suf-
ficient to compute exp and log correct to within a few ulps; in other
words, the composition of two functions evaluated to high accuracy is
not necessarily obtained to the same accuracy. To examine this partic-
ular case, write
w : = ylogx, z = exp(w).
56 F LOATING P OINT A RITHMETIC
3. (Karpinski [645, 1985]) A simple test for the presence of a guard digit
on a pocket calculator is to evaluate the expressions
9/27 * 3 - 1, 9/27 * 3 - 0.5 - 0.5,
which are given in a form that can be typed directly into most four-
function calculators. If the results are equal then a guard digit is present.
Otherwise there is probably no guard digit (we cannot be completely
sure from this simple test). To test for a guard digit on a computer it
is best to run one of the diagnostic codes described in §25.5.
eigenvalue problem. This paper has hardly dated and is still well worth read-
ing.
Another classic book devoted entirely to floating point arithmetic is Ster-
benz’s Floating-Point Computation [938, 1974]. It contains a thorough treat-
ment of low-level details of floating point arithmetic, with particular reference
to IBM 360 and IBM 7090 machines. It also contains a good chapter on round-
ing error analysis and an interesting collection of exercises. R. W. Hamming
has said of this book, “Nobody should ever have to know that much about
floating-point arithmetic. But I’m afraid sometimes you might” [833, 1988].
Although Sterbenz’s book is now dated in some respects, it remains a useful
reference.
A third important reference on floating point arithmetic is Knuth’s Seminu-
merical Algorithms [668, 1981, §4.2], from his Art of Computer Programming
series. Knuth’s lucid presentation includes historical comments and challeng-
ing exercises (with solutions).
The first analysis of floating point arithmetic was given by Samelson and
Bauer [890, 1953]. Later in the same decade Carr [187, 1959] gave a detailed
discussion of error bounds for the basic arithmetic operations.
An up-to-date and very readable reference on floating point arithmetic
is the survey paper by Goldberg [457, 1991], which includes a detailed dis-
cussion of IEEE arithmetic. A less mathematical, more hardware-oriented
discussion can be found in the appendix “Computer Arithmetic” written by
Goldberg that appears in the book on computer architecture by Hennessy and
Patterson [515, 1990].
A fascinating historical perspective on the development of computer float-
ing point arithmetics, including background to the development of the IEEE
standard, can be found in the textbook by Patterson and Hennessy [822,
1994, §4.11].
The idea of representing floating point numbers in the form (2.1) is found,
for example, in the work of Forsythe [393, 196 9], Matula [740, 1970], and
Dekker [275, 1971].
An alternative definition of fl(x) is the nearest y G satisfying |y| <
|x|. This operation is called chopping, and does not satisfy our definition of
rounding. Chopped arithmetic is used in the IBM/370 floating point system.
The difference between chopping and rounding (to nearest) is highlighted
by a discrepancy in the index of the Vancouver Stock Exchange in the early
1980s [852, 1983]. The exchange established an index in January 1982, with
the initial value of 1000. By November 1983 the index had been hitting lows
in the 520s, despite the exchange apparently performing well. The index was
recorded to three decimal places and it was discovered that the computer
program calculating the index was chopping instead of rounding to produce
the final value. Since the index was recalculated thousands of times a day, each
time with a nonpositive final error, the bias introduced by chopping became
58 F LOATING P OINT A RITHMETIC
base conversion are Goldberg [458, 1967] and Matula [739, 1968], [740, 1970].
It is interesting to note that, in Fortran or C, where the output format for
a “print” statement can be precisely specified, most compilers will, for an
(in)appropriate choice of format, print a decimal string that contains many
more significant digits than are determined by the floating point number whose
value is being represented.
Other authors who have analysed various aspects of floating (and fixed)
point arithmetic include Diamond [305, 1978], Urabe [1037, 1968], and Feld-
stein, Goodman, and co-authors [471, 1975], [368, 198 2], [472, 198 5], [369,
19 86]. For a survey of computer arithmetic up until 1976 that includes a
number of references not given here, see Garner [420, 1976].
Problems
The exercise had warmed my blood, and
I was beginning to enjoy myself amazingly.
-JOHN BUCHAN, The Thirty-Nine Steps (1915)
2.1. How many normalized numbers and how many subnormal numbers are
there in the system F defined in (2.1) with emin < e < emax? What are the
figures for IEEE single and double precision (base 2)?
2.2. Prove Lemma 2.1.
2.3. In IEEE arithmetic how many double precision numbers are there be-
tween any two adjacent nonzero single precision numbers?
2.4. Prove Theorem 2.3.
2.5. Show that
and deduce that 0.1 has the base 2 representation 0.0001100 (repeating last 4
bits). Let = fl(0.1) be the rounded version of 0.1 obtained in binary IEEE
single precision arithmetic (u = 2-24). Show that
2.6. What is the largest integer p such that all integers in the interval [-p,p]
are exactly representable in IEEE double precision arithmetic? What is the
corresponding p for IEEE single precision arithmetic?
2.7. Which of the following statements is true in IEEE arithmetic, assuming
that a and b are normalized floating point numbers and that no exception
occurs in the stated operations?
1. fl(a op b) = fl(b op a), op = +,*.
2 . f l(b - a) = -fl(a - b).
P ROBLEMS 63
3. fl(a + a) = fl(2* a ) .
4 . f l(0.5*a) = fl(a/2).
5. fl((a + b) + c) = fl(a + (b + c)).
6. a < fl( (a + b)/2) < b, given that a < b.
2.8. Show that the inequalities a < fl((a + b)/2) < b, where a and b are
floating point numbers with a < b, can be violated in base 10 arithmetic.
Show that a < fl(a+(b-a)/2) < b in base β arithmetic, for any β, assuming
the use of a guard digit.
2.9. What is the result of the computation in IEEE double preci-
sion arithmetic, with and without double rounding from an extended format
with a 64-bit mantissa?
2.10. A theorem of Kahan [457, 1991, Thm. 7] says that if β = 2 and the
arithmetic rounds as specified in the IEEE standard, then for integers m and
n with |m| < 2t-1 and n = 2i + 2j (some i, j), fl((m/n)*n) = m. Thus,
for example, fl((1/3) * 3) = 1 (even though fl(1/3) 1/3). The sequence of
allowable n begins 1,2,3,4,5,6,8,9,10,12,16,17,18,20, so Kahan’s theorem
covers many common cases. Test the theorem on your computer.
2.11. Investigate the leading significant digit distribution for numbers ob-
tained as follows.
1. kn , n = 0:1000 for k = 2 and 3.
2. n!, n = 1:1000.
3. The eigenvalues of a random symmetric matrix.
4. Physical constants from published tables.
5. From the front page of the London Times or the New York Times.
(Note that in writing a program for the first case you can form the powers of
2 or 3 in order, following each multiplication by a division by 10, as necessary,
to keep the result in the range [1,10]. Similarly for the second case.)
2.12. (Edelman [343, 1994]) Let x be a floating point number in IEEE double
precision arithmetic satisfying 1 < x < 2. Show that fl( x *(1/x )) is either 1
or 1 - where = 2-52 (the machine epsilon).
2.13. (Edelman [343, 1994]) Consider IEEE double precision arithmetic. Find
the smallest positive integer j such that fl( x *(1/x )) 1, where x = 1 +
with = 2-52 (the machine epsilon).
2.14. Kahan has stated that “an (over-)estimate of u can be obtained for
almost any machine by computing |3 × (4/3 - 1) - 1| using rounded floating-
point for every operation”. Test this estimate against u on any machines
available to you.
64 F LOATING P OINT A RITHMETIC
% max(x, y)
if x > y then
m a x =x
else
max = y
end
Does this code always produce the expected answer in IEEE arithmetic?
2.22. Prove that Kahan’s formula (2.7) computes the area of a triangle ac-
curately if a guard digit is used in subtraction. (Hint: you will need one
invocation of Theorem 2.5.)
P ROBLEMS 65
w = bc
e = w - b* c
x = (a * d - w) + e
Chapter 3
Basics
67
68 B ASICS
Having defined a model for floating point arithmetic in the last chapter, we
now apply the model to some basic matrix computations, beginning with inner
products. This first application is simple enough to permit a short analysis,
yet rich enough to illustrate the ideas of forward and backward error. It also
raises the thorny question of what is the best notation to use in an error
analysis. We introduce the “γ n ” notation, which we use widely, though not
exclusively, in the book. The inner product analysis leads immediately to
results for matrix-vector and matrix-matrix multiplication.
In the last two sections we determine a model for rounding errors in com-
plex arithmetic and derive some miscellaneous results of use in later chapters.
(3.1)
where |δi | < u, i = 1:3. For our purposes it is not necessary to distinguish
between the different δi terms, so to simplify the expressions let us drop the
subscripts on the δi and write 1 + δi 1 ± δ. Then
(3.2)
There are various ways to simplify this expression. A particularly elegant way
is to use the following result.
3.1 INNER AND OUTER PRODUCTS 69
Lemma 3.1. If |δi | < u and pi = ±1 for i = 1:n, and nu < 1, then
where
(3.3)
This is a backward error result and may be interpreted as follows: the com-
puted inner product is the exact one for a perturbed set of data x 1,. . . , xn ,
y1(1 + θn ),y 2(1 + θ' n ),. . . , yn (1 + θ2) (alternatively, we could perturb the xi
and leave the yi alone). Each relative perturbation is certainly bounded by
γn = nu/(l - nu), so the perturbations are tiny.
The result (3.3) applies to one particular order of evaluation. It is easy to
see that for any order of evaluation we have, using vector notation,
(3.4)
where |x| denotes the vector with elements |x i | and inequalities between vec-
tors (and, later, matrices) hold componentwise.
A forward error bound follows immediately from (3.4):
(3.5)
If y = x, so that we are forming a sum of squares x Tx, this result shows that
high relative accuracy is obtained. However, in general, high relative accuracy
is not guaranteed if |xT y| << |x| T |y|.
It is easy to see that precisely the same results (3.3)-(3.5) hold when we
use the no-guard-digit rounding error model (2.6). For example, expression
(3.1) becomes = x1 y1(1 + δ1)(1 + δ3) + x2 y2(1 + δ2)(1 + δ4), where δ4 has
replaced a second occurrence of δ3, but this has no effect on the error bounds.
It is worth emphasizing that the constants in the bounds above can be
reduced by focusing on particular implementations of an inner product. For
example, if n = 2m and we compute
70 BASICS
s1 = x(1: m) T y(1: m )
s 2 = x(m + 1:n) Ty(m + 1:n)
s n = s1 + s 2
Since many of the error analyses in this book are built upon the error analysis
of inner products, it follows that the constants in these higher level bounds
can also be reduced by using one of these nonstandard inner product imple-
mentations. The main significance of this observation is that we should not
attach too much significance to the precise values of the constants in error
bounds.
Inner products are amenable to being calculated in extended precision.
If the working precision involves a t-digit mantissa then the product of two
floating point numbers has a mantissa of 2t - 1 or 2t digits and so can be
represented exactly with a 2t-digit mantissa. Some computers always form the
2t-digit product before rounding to t digits. thus allowing an inner product to
be accumulated at 2t-digit precision at little or no extra cost, prior to a final
rounding.
The extended precision computation can be expressed as fl( f l e( x T y )),
where fle denotes computations with unit roundoff u e (ue < u). Defining
= fl e(xT y), the analysis above holds with u replaced by ue in (3.3) (and
with the subscripts on the θi , reduced by 1 if the multiplications are done
exactly). For the final rounding we have
Hence, as long as nu e(|x|T|y| < u|xT y|, the computed inner product is about
as good as the rounded exact inner product. The effect of using extended
3.2 T HE P URPOSE OF R OUNDING E RROR A NALYSIS 71
s0 = 0
for i = 1:n
si = si-1 + xi yi
end
Similarly, or
µ i = µ i-1 + µ 0 = 0.
s =0
µ =0
for i = 1:n
z = xi yi
s= s+ z
µ = µ + |s| + |z|
end
µ = µ * µ
For more on the derivation of running error bounds see Wilkinson [1100, 1985]
or [1101, 1986]. A running error analysis for Horner’s method is given in §5.1.
Lemma 3.3. For any positive integer k let θ k denote a quantity bounded
according to |θ k| < γk = ku/(1 - ku). The following relations hold:
but if we are given only the expression (1 + θ k )/(1 + θj ) and the bounds for
θ k and θj , we cannot do better than to replace it by θ k+2j for j > k.
Another style of writing bounds is made possible by the following lemma.
Proof. We have
Note that this lemma is slightly stronger than the corresponding bound we
can obtain from Lemma 3.1: |θ n | < nu/(1 - nu) < nu/0.99 = 1.0101. . . nu.
Lemma 3.4 enables us to derive, as an alternative to (3.5),
(3.9)
At the end of an analysis it is necessary to bound |<k > - 1|, for which any
of the techniques described above can be used.
Wilkinson explained in [1100, 1985] that he used a similar notation in his
research, writing for a product of r factors 1 + δi with |δi | < u. He also
derived results for specific values of n, before treating the general case-a
useful trick of the trade!
An alternative notation to fl(·) to denote the rounded value of a number
or the computed value of an expression is [·], suggested by Kahan. Thus, we
would write [a + [b * c]] instead of fl(a + fl(b * c)).
76 BASICS
Proponents of relative precision claim that the symmetry and additivity prop-
erties make it easier to work with than the relative error.
Pryce [845, 1981] gives an excellent appraisal of relative precision, with
many examples. He uses the additional notation 1(δ ) to mean a number θ
with θ 1; rp(δ). The 1(δ) notation is the analogue for relative precision of
Stewart’s <k> counter for relative error. In later papers, Pryce extends the
rp notation to vector and matrices and shows how it can be used in the error
analysis of some basic matrix computations [846, 1984], [847, 1985].
Relative precision has not achieved wide use. The important thing for an
error analyst is to settle on a comfortable notation that does not hinder the
thinking process. It does not really matter which of the notations described
in this section is used, as long as the final result is informative and expressed
in a readable form.
(3.10)
(3.11)
3.5 M ATRIX M ULTIPLICATION 77
Normwise bounds readily follow (see Chapter 6 for norm definitions): for
example,
end
% Saxpy form.
y(1:m) = 0
for j = 1:n
for i = 1:m
y (i) = y(i) +a( i,j ) x ( j )
end
end
The terms “sdot” and “saxpy” come from the BLAS (see §D.1). Sdot stands
for (single precision) dot product, and saxpy for (single precision) a times
x plus y. The question of interest is whether (3.10) and (3.11) hold for the
saxpy form. They do: the saxpy algorithm still forms the inner products
but instead of forming each one in turn it evaluates them all “in parallel”, a
term at a time. The key observation is that exactly the same operations are
performed, and hence exactly the same rounding errors are committed-the
only difference is the order in which the rounding errors are created.
This “rounding error equivalence” of algorithms that are mathematically
identical but algorithmically different is one that occurs frequently in matrix
computations. The equivalence is not always as easy to see as it is for matrix-
vector products.
Now consider matrix multiplication: C = AB, where A IRm×n and
n×p
B IR . Matrix multiplication is a triple loop procedure, with six possible
loop orderings, one of which is
C(1:m,1:p) = 0
for i = 1:m
for j = 1:p
for k = 1:n
C ( i , j ) =C ( i , j ) +A ( i , k ) B ( k , j )
78 BASICS
end
end
end
As for the matrix -vector product. all six versions commit the same rounding
errors, so it suffices to consider any one of them. The “jik” and “jki” orderings
both compute C a column at a time: c j = Ab j, where cj = C(:,j) and
b j = B(:,j). From (3.10),
Hence the jth computed column of C has a small backward error: it is the
exact jth column for slightly perturbed data. The same cannot be said for
as a whole (see Problem 3.5 for a possibly large backward error bound).
However, we have the forward error bound
(3.12)
The bound (3.12) falls short of the ideal bound which says
that each component of C is computed with high relative accuracy. Never-
theless (3.12) is the best bound we can expect, because it reflects the sen-
sitivity of the product to componentwise relative perturbations in the data:
for any i and j we can find a perturbation ∆A with |∆A| < u|A| such that
|(A+∆A)B-AB|ij =u(|A||B|)ij (similarly for perturbations in B).
(3.13c)
3.6 COMPLEX ARITHMETIC 79
so
as required.
Multiplication:
where
as required.
Division:
Then
80 B ASICS
Bounds for the rounding errors in the basic complex arithmetic operations
are rarely given in the literature. Indeed, virtually all published error analyses
in matrix computations are for real arithmetic. However, because the bounds
of Lemma 3.5 are of the same form as for the standard model (2.4) for real
arithmetic, most results for real arithmetic (including virtually all those in
this book) are valid for complex arithmetic, provided that the constants are
increased appropriately.
3.7. Miscellany
In this section we give some miscellaneous results that will be needed in later
chapters. The first two results provide convenient ways to bound the effect
of perturbations in a matrix product. The first result uses norms and the
second. components.
Lemma 3.6. If X j + ∆ X j IRn×n satisfies ||∆Xj|| < dj||Xj|| for all j for a
consistent norm, then
3.7 MISCELLANY 81
Lemma 3.7. If Xj + ∆Xj IRn×n satisfies |∆X j| < δj |X j | for all j, then
so that
where
and
The kth stage of the algorithm represents a single floating point operation
and xk contains the original data together with all the intermediate quantities
computed so far. Finally, z = where is comprised of a subset of the
columns of the identity matrix (so that each zi is a component of xp+1). In
floating point arithmetic we have
where ∆xk+i represents the rounding errors on the kth stage and should be
easy to bound. We assume that the functions g k are continuously differentiable
and denote the Jacobian of gk at a by Jk. Then, to first order,
In most matrix problems there are fewer outputs than inputs (m < n), so
this is an underdetermined system. For a normwise backward error analysis
we want a solution of minimum norm. For a componentwise backward error
analysis, in which we may want (for example) to minimize subject to | ∆ a| <
we can write
for vectors x and y and explains some of its favourable properties for error
analysis.
Wilkinson [1089, 1965, p. 447] gives error bounds for complex arithmetic;
Olver [808, 198 3] does the same in the relative precision framework. Dem-
mel [280, 1984] gives error bounds that extend those in Lemma 3.5 by taking
into account the possibility of underflow.
Henrici [521, 1980] gives a brief, easy to read introduction to the use of
the model (2.4) for analysing the propagation of rounding errors in a general
algorithm. He uses a set notation that is another possible notation to add to
those in §3.4.
The perspective on error analysis in §3.8 was suggested by J. W. Demmel.
Problems
3.1. Prove Lemma 3.1.
3.2. (Kielbasiniski and Schwetlick [658, 1988], [659, 1992]) Show that if pi 1
in Lemma 3.1 then the stronger bound |φ n | < nu/(1-½nu) holds for nu < 2.
P ROBLEMS 85
is
q n+1 = an +1
for k = n:-1:0
qk = ak + bk/qk+ 1
end
where the weights wi and nodes xi are assumed to be floating point numbers.
Assuming that the sum is evaluated in left-to-right order and that
Chapter 4
Summation
I do hate sums.
There is no greater mistake than to call arithmetic an exact science.
There are . . . hidden laws of Number
which it requires a mind like mine to perceive.
For instance, if you add a sum from the bottom up,
and then again from the top down,
the result is always different.
-MRS. LA TOUCHE10
10
Quoted in Mathematical Gazette [730, 1924 ].
87
88 SUMMATION
s =0
for i = 1:n
s=s+ x i
end
S 6 = (( x 1 + x 2 ) + ( x 3 + x 4 ) ) + ( x 5 + x 6 ) .
Pairwise summation is attractive for parallel computing, because each of the
[log2 n] stages can be done in parallel [573, 1988, §5.2.2].
A third summation method is the insertion met hod. First, the x i are
sorted by order of increasing magnitude (alternatively, some other ordering
could be used). Then x1 + x2 is formed, and the sum is inserted into the
l i s t x2,. . . ,x n , maintaining the increasing order. The process is repeated
recursively until the final sum is obtained. In particular cases the insertion
4.2 ERROR ANALYSIS 89
method reduces to one of the other two. For example, if x i = 2 i -1, the
insertion method is equivalent to recursive summation, since the insertion is
always to the bottom of the list:
On the other hand, if 1 < x1 < x2 < ... < xn < 2, every insertion is to the
end of the list, and the method is equivalent to pairwise summation if n is a
power of 2; for example, if 0 < < 1/2,
To choose between these met hods we need error analysis, which we develop
in the next section.
L e t S = {x 1,...,xn }.
repeat while S contains more than one element
Remove two numbers x and y from S,
and add their sum x + y to S.
end
Assign the remaining element of S to S n .
Note that since there are n numbers to be added and hence n - 1 additions
to be performed, there must be precisely n - 1 executions of the repeat loop.
First, let us check that the previous methods are special cases of Algo-
rithm 4.1. Recursive summation (with any desired ordering) is obtained by
taking x at each stage to be the sum computed on the previous stage of the
algorithm. Pairwise summation is obtained by [log2 n] groups of executions
of the repeat loop, in each group of which the members of S are broken into
pairs, each of which is summed. Finally, the insertion method is, by definition,
a special case of Algorithm 4.1.
Now for the error analysis. Express the ith execution of the repeat loop
as Ti = xi1 + yil. The computed sums satisfy (using (2.5))
(4.1)
90 SUMMATION
The local error introduced in forming The overall error is the sum
of the local errors (since summation is a linear process), so overall we have
(4.2)
(4.3)
(This is actually in the form of a running error bound, because it contains the
computed quantities-see §3.3.) It is easy to see that
for each i, and so we have also the weaker bound
(4.4)
How does the decreasing ordering fit into the picture? For the summation
of positive numbers this ordering has little to recommend it. The bound (4.3)
is no smaller, and potentially much larger, than it is for the increasing order-
ing. Furthermore, in a sum of positive terms that vary widely in magnitude
the decreasing ordering may not allow the smaller terms to contribute to the
sum (which is why the harmonic sum “converges” in floating point
arithmetic as n ). However, consider the example with n = 4 and
x = [ l , M, 2M, -3M], (4.5)
where M is a floating point number so large that fl (1 + M) = M (thus M >
u -1). The three orderings considered so far produce the following results:
Increasing: = fl(1 + M + 2M - 3M) = 0,
Psum: = fl(1 + M - 3M + 2M) = 0,
Decreasing: = fl(-3M + 2M + M + 1) = 1.
Thus the decreasing ordering sustains no rounding errors and produces the
exact answer, while both the increasing and Psum orderings yield computed
sums with relative error 1. The reason why the decreasing ordering performs
so well in this example is that it adds the “1” after the inevitable heavy
cancellation has taken place, rather than before, and so retains the important
information in this term. If we evaluate the term in the error
bound (4.3) for example (4.5) we find
Increasing: µ = 4M, Psum: µ = 3M, Decreasing: µ = M + 1,
so (4.3) “predicts” that the decreasing ordering will produce the most accurate
answer, but the bound it provides is extremely pessimistic since there are no
rounding errors in this instance.
Extrapolating from this example, we conclude that the decreasing or-
dering is likely to yield greater accuracy than the increasing or Psum or-
derings whenever there is heavy cancellation in the sum, that is, whenever
(4.6)
Since it is proportional to log2 n rat her than n, this is a smaller bound than
(4.4), which is the best bound of this form that holds in general for Algo-
rithm 4.1.
a+ b = +ê, (4.7)
that is, the computed ê represents the error exactly. This result (which does
not hold for all bases) is proved by Dekker [275, 1971, Thm. 4.7], Knuth [668,
1981, Thm. C, p. 221], and Linnainmaa [703, 1974, Thm. 3]. Note that there
is no point in computing fl( + ê), since is already the best floating point
representation of a + b !
Kahan’s compensated summation method employs the correction e on
every step of a recursive summation. After each partial sum is formed, the
correction is computed and immediately added to the next term xi before that
term is added to the partial sum. Thus the idea is to capture the rounding
errors and feed them back into the summation. The method may be written
as follows.
s=0;e= 0
for i = 1:n
temp = s
y=xi+e
s=temp+y
e = (temp - s) + y % Evaluate in the order shown.
end
[628, 1973 ] that (4.8) holds with the stronger bound |µ i | < 2u+O( ( n-i+
1 ) u 2) The proofs of (4.8) given by Knuth and Kahan are similar; they use the
model (2.4) with a subtle induct ion and some intricate algebraic manipulation.
The forward error bound corresponding to (4.8) is
(4.9)
(4.10)
provided nu < 0.1; this is weaker than (4.8) in that the second-order term
has an extra factor n. If n 2 u < 0.1 then in (4.10), |µ i | < 2.1u. Jankowski,
Smoktunowicz, and Wozniakowski [609, 198 3] show that, by using a divide
and conquer implementation of compensated summation, the range of n for
which |µ i | < cu holds in (4.10) can be extended, at the cost of a slight increase
in the size of the constant c.
Neither the correction formula (4.7) nor the result (4.8) for compensated
summation holds under the no-guard-digit model of floating point arithmetic.
Indeed, Kahan [634, 1990] constructs an example where compensated summa-
tion fails to achieve (4.9) on certain Cray machines, but he states that such
failure is extremely rare. In [627, 1972] and [628, 1973] Kahan gives a mod-
ification of the compensated summation algorithm in which the assignment
“e = (temp - s) + y” is replaced by
f = 0
if sign(temp) = sign(y), f = (0.46 * s - s) + s, end
e = ( ( temp - f ) - ( s - f )) + y
4.3 C OMPENSATED S UMMATION 95
Kahan shows in [628, 1973] that the modified algorithm achieves (4.8) “on
all North American machines with floating hardware” and explains that “The
mysterious constant 0.46, which could perhaps be any number between 0.25
and 0.50, and the fact that the proof requires a considerat ion of known ma-
chines designs, indicate that this algorithm is not an advance in computer
science.”
Viten’ko [1056, 1968] shows that under the no-guard-digit model (2.6) the
summation met hod with the optimal error bound (in a certain sense defined
in [1056, 1968]) is pairwise summation. This does not contradict Kahan’s
result because Kahan uses properties of the floating point arithmetic beyond
those in the no-guard-digit model.
A good illustration of the benefits of compensated summation is provided
by Euler’s method for the ordinary differential equation initial value problem
y' = f(x,y), y(a) given, which generates an approximate solution according
to y k+l = yk + hfk , y0 = y(a). We solved the equation y' = -y with y(0) = 1
over [0,1] using n steps of Euler’s method (nh = 1), with n ranging from 10
to 108. With compensated summation we replace the statements x = x + h,
y = y + h * f(x,y) by (with the initialization cx = 0, cy = 0)
dx = h + cx
new-x = x + dx
cx = (x - new-x) + dx
x = new_x
dy = h * f(x,y) + cy
new-y = y + dy
cy = (y - new-y) + dy
y = new-y
Figure 4.2. Errors |y(1) - for Euler’s method with (“×”) and without (“o”)
compensated summation.
decreasing order, which removes the need for certain logical tests that would
otherwise be necessary.
Priest [844, 1992, §4.1] analyses this algorithm for t-digit base β arithmetic
that satisfies certain reasonable assumptions-ones which are all satisfied by
IEEE arithmetic. He shows that if n < β t -3 then the computed sum
satisfies
by the error bounds; numerical experiments show that, given any two of the
methods, data can be found for which either method is more accurate than
the other [553, 1993]. However, some specific advice on the choice of met hod
can be given.
1. If high accuracy is important, consider implementing recursive summa-
tion in higher precision; if feasible this may be less expensive (and more
accurate) than using one of the alternative methods at the working pre-
cision. What can be said about the accuracy of the sum computed at
higher precision? If Sn = xi is computed by recursive summation
at double precision (unit roundoff u 2) and then rounded to single preci-
sion, an error bound of the form holds.
Hence a relative error of order u is guaranteed if nu
Priest [844, 1992, pp. 62-63] shows that if the xi are sorted in decreas-
ing order of magnitude before being summed in double precision, then
holds provided only that n < β t -3 for t-digit base
β arithmetic satisfying certain reasonable assumptions. Therefore the
decreasing ordering may be worth arranging if there is a lot of cancella-
tion in the sum. An alternative to extra precision computation is doubly
compensated summation, which is the only other method described here
that guarantees a small relative error in the computed sum.
2. For most of the methods the errors are, in the worst case, proportional
to n. If n is very large, pairwise summation (error constant log2 n) and
compensated summation (error constant of order 1) are attractive.
3. If the xi all have the same sign then all the methods yield a relative error
of at most nu and compensated summation guarantees perfect relative
accuracy (as long as nu < 1). For recursive summation of one-signed
data, the increasing ordering has the smallest error bound (4.3) and
the insertion method minimizes this error bound over all instances of
Algorithm 4.1.
4. For sums with heavy cancellation recursive
summation with the decreasing ordering is attractive, although it cannot
be guaranteed to achieve the best accuracy.
100 SUMMATION
Considerations of computational cost and the way in which the data are
generated may rule out some of the met hods. Recursive summation in the
natural order, pairwise summation, and compensated summation can be im-
plemented in O(n) operations for general xi , but the other methods are more
expensive since they require searching or sorting. Furthermore, in an applica-
tion such as the numerical solution of ordinary differential equations, where
xk is not known until xi has been formed. sorting and searching may
be impossible.
Problems
4.1. Define and evaluate a condition number C(x) for the summation Sn (x) =
When does the condition number take the value 1?
4.2. (Wilkinson [1088, 1963, p. 19]) Show that the bounds (4.3) and (4.4) are
nearly attainable for recursive summation. (Hint: assume u = 2- t , set n = 2r
(r << t), and define
x(1) = 1,
x (2) = 1 - 2-t ,
x (3:4) = 1 - 21-t ,
x(5:8) = 1 - 22-t ,
x(2r-1 + 1:2r ) = 1 - 2r - 1 -t .)
PROBLEMS 101
4.8. In numerical methods for quadrature and for solving ordinary differential
equation initial value problems it is often necessary to evaluate a function on
an equally spaced grid of points on a range [a,b]: xi := a + ih, i = 0:n,
where h = (b-a)/n. Compare the accuracy of the following ways to form xi .
Assume that a and b, but not necessarily h, are floating point numbers.
102 SUMMATION
x 1 = 2t + 1
, x2=2 t+ 1 - 2 , x3= x4 = x5 = x 6 =-(2 t -1),
Chapter 5
Polynomials
103
104 POLYNOMIALS
Two common tasks associated with polynomials are evaluation and interpola-
tion: given the polynomial find its values at certain arguments, and given the
values at certain arguments find the polynomial. We consider Horner’s rule
for evaluation and the Newton divided difference polynomial for interpolation.
A third task not considered here is finding the zeros of a polynomial. Much re-
search was devoted to polynomial zero finding up until the late 1960s; indeed,
Wilkinson devotes a quarter of Rounding Errors in Algebraic Processes [1088,
19 6 3 ] to the topic. Since the development of the QR algorithm for finding
matrix eigenvalues there has been less demand for polynomial zero finding,
since the problem either arises as, or can be converted to (see §26.6 and [346,
1995], [1007, 1994]), the matrix eigenvalue problem.
qn(x) = an
for i = n - 1: -1:0
qi(x) = xqi+ 1 (x) + ai
end
p(x) = q0(x)
The cost is 2n flops, which is n less than the more obvious method of evaluation
that explicitly forms powers of x (see Problem 5.2).
To analyse the rounding errors in Horner’s met hod it is convenient to use
the relative error counter notation <k> (see (3.9)). We have
(5.2)
5.1 H ORNER'S M ETHOD 105
where we have used Lemma 3.1, and where |φ k | < ku/(1 - ku) =: γk . This
result shows that Horner’s method has a small backward error: the com-
puted is the exact value at x of a polynomial obtained by making relative
perturbations of size at most γ2 n to the coefficients of p( x).
A forward error bound is easily obtained: from (5.2) we have
(5.3)
(5.4)
where we have used both (2.4) and (2.5). Defining =: qi + fi , we have
or
Hence
y = an
µ = |y|/2
for i = n- 1 : - 1 : 0
y = xy + ai
µ = |x|µ + |y |
end
µ = µ( 2µ - |y|)
Cost: 4n flops.
It is worth commenting on the case where one or more of the ai and x
is complex. The analysis leading to Algorithm 5.1 is still valid for complex
data, but we need to remember that the error bounds for fl ( x op y) are not
the same as for real arithmetic. In view of Lemma 3.5. it suffices to replace
the last line of the algorithm by µ = An increase in speed of
the algorithm, with only a slight worsening of the bound, can be obtained by
replacing |y| = (( Rey )2 + (Imy)2)½ by |Rey| + |Imy| (and, of course, |x|
should be evaluated once and for all before entering the loop).
One use of Algorithm 5.1 is to provide a stopping criterion for a polynomial
zero-finder: if |fl(p ( x))| is of the same order as the error bound µ, then further
iteration serves no purpose, for as far as we can tell, x could be an exact zero.
As a numerical example, for the expanded form of p (x) = (x + 1)32 we
found in MATLAB that
In these two cases. the running error bound is, respectively. 62 and 31 times
smaller than the a priori one.
In another experiment we evaluated the expanded form of p(x) = (x - 2)3
is simulated single precision in MATLAB (u 6 × 10-8) for 200 equally spaced
points near x = 2. The polynomial values, the error, and the a priori and
running error bounds are all plotted in Figure 5.1. The running error bound
is about seven times smaller than the a priori one.
Figure 5.1. Computed polynomial values (top) and running and a priori bounds
(bottom) for Horner’s method.
each expression, but there is a more efficient way. Observe that if we define
where the qi = qi (a) are generated by Horner’s method for p(a), then
p(x) = (x - a)q(x) + r.
In other words, Horner’s method carries out the process of synthetic division.
Clearly, p'(a) = q(a). If we repeat synthetic division recursively on q(x), we
will be evaluating the coefficients in the Taylor expansion
y0 = an
y(1:k) = 0
108 P OLYNOMIALS
for j = n- 1 : - 1 : 0
for i = min(k, n - j): -1:1
yi = ayi + yi- 1
end
y 0 = ay0 + aj
end
m = l
for j = 2:k
m = m*j
yj = m*yj
end
Hence
(5.5)
The recurrence for r0 = p'(a) can be expressed as Unr = q(1:n), where
r = r(0:n - 1), so
Hence
(5.6)
5.3 T HE N EWTON FORM AND P OLYNOMIAL I NTERPOLATION 109
Now
(5.7)
This is essentially the same form of bound as for p(a) in (5.3). Analogous
bounds hold for all derivatives.
(5.8)
as divided differences. Assuming that the points a j are distinct, the divided
differences may be computed from a standard recurrence:
c(0)(0: n) = f( 0 :n )
for k = 0:n - 1
f o r j = n:-1:k +1
end
end
c = c( n )
Cost: 3n 2/2 flops.
Two questions are of interest: how accurate are the computed and
what is the effect of rounding errors on the polynomial values obtained by
evaluating the Newton form? To answer the first question we express the
recurrence in matrix-vector form:
c(0)= f , c( k + l ) =L k c( k ), k=0:n-1,
where Lk = is lower bidiagonal, with
Dk = diag(ones(1:k + 1), ak+1 - a 0 ,a k+2 - a1, . . . ,an - an -k -1),
The analysis that follows is based on the model (2.4), and so is valid only for
machines with a guard digit. With the no-guard-digit model (2.6) the bounds
become weaker and more complicated, because of the importance of terms
f l(aj - aj-k-1) in the analysis.
It is straightforward to show that
(5.9)
where Gk = diag(ones(1:k + 1), η k , k+2,. . . ,η k , n+1), where each η ij is of the
form η ij = (1 + δ1)(1 + δ2)(1 + δ3), |δi | < u. Hence
(5.10)
From Lemma 3.7.
(5.11)
5.3 T HE N EWTON F ORM AND P OLYNOMIAL I NTERPOLATION 111
To interpret the bound, note first that merely rounding the data (f i
f i (1 + δi ), |δ i | < u) can cause an error ∆c as large as eround = u|L||f|,
where L = Ln-1. . . L0, so errors of at least this size are inevitable. Since
|L n-1 | . . . |L0 | > |Ln- 1 . . .L 0 | = |L|, the error in the computed divided differ-
ences can be larger than eround only if there is much subtractive cancellation
in the product L = Ln-1 . . . L0. If a 0 < a 1 < . . . < an , then each Li is
positive on the diagonal and nonpositive on the first subdiagonal; therefore
|L n-1|. . . |L0| = |Ln-1. . . L0| = |L|, and we have the very satisfactory bound
< ( ( 1-3u ) -n - 1)|L||f|. This same bound holds if the ai are arranged
in decreasing order.
To examine how well the computed Newton form reproduces the f i we
“unwind” the analysis above. From (5.9) we have
(5.12)
If a 0 < a1 < ... < an then > 0 for all i, and we obtain the very
satisfactory bound Again, the same
bound holds for points arranged in decreasing order.
In practice it is found that even when the computed divided differences are
very inaccurate, the computed interpolating polynomial may still reproduce
the original data well. The bounds (5.11) and (5.12) provide insight into this
observed behaviour by showing that and can be large only when
there is much cancellation in the products Ln-l . . . L0 f and
respectively.
The analysis has shown that the ordering a 0 < a 1 < . . . < an yields
“optimal” error bounds for the divided differences and the residual, and so
may be a good choice of ordering of interpolation points. However, if the
aim is to minimize |p(x) - fl(p(x))| for a given x aj, then other orderings
need to be considered. An ordering with some theoretical support is the Leja
ordering, which is defined by the equations [863, 1990]
(5.13a)
(5.13b)
(1) For f i from the normal N(0,1) distribution the divided differences
range in magnitude from 1 to 105. and their relative errors range from 0 (the
first divided difference, f 0 , is always exact) to 3 × 10-7. The ratio p1 = 16.3,
so (5.11) provides a reasonably sharp bound for the error in . The relative
errors when f is reconstructed from the computed divided differences range
between 0 and 3 × 10-1 (it makes little difference whet her the reconstruction
is done in single or double precision). Again, this is predicted by the analysis,
in this case by (5.12), because p 2 = 2 × 107. For the Leja ordering, the divided
differences are computed with about the same accuracy, but f is reconstructed
much more accurately, with maximum relative error 7 × 10-6 ( p 1 = 1 × 103,
p2 = 8 × 104).
(2) For f i = exp(a i), the situation is reversed: we obtain inaccurate di-
vided differences but an accurate reconstruction of f. The divided differences
range in magnitude from 10-4 to 10-1, and their relative errors are as large as
1, but the relative errors in the reconstructed f are all less than 10-7. Again,
the error bounds predict this behaviour: p 1 = 6 × 108, p2 = 1.02. The Leja
ordering performs similarly.
The natural way to evaluate the polynomial (5.8) for a given x is by a
generalization of Horner’s met hod:
q n(x ) = cn
for i = n - 1:-1:0
q i ( x) = ( x- a i )q i + 1 ( x) + ci
end
p (x) = q 0 (x)
A straightforward analysis shows that (cf. (5.2))
taken from a given compact set that satisfy the condition (5.13). For more
on the numerical benefits of the Leja ordering see §21.3.3.
If a polynomial is to be evaluated many times at different, arguments it
may be worthwhile to expend some effort transforming it to a form that can
be evaluated more cheaply than by a straightforward application of Horner’s
rule. For example, the quartic
p(x) = a 4 x4 + a 3 x3 + a 2 x2 + a 1 x + a 0 , a4 0,
p(x) = (( y + x + a2 ) y + a 3 )a4 , y = (x + a 0 )x + a 1 ,
Once the ai have been computed, p(x ) can he evaluated in three multiplica-
tions and five additions, as compared with the four multiplications and four
additions required by Horner’s rule. If a multiplication takes longer than an
addition, the transformed polynomial should be cheaper to evaluate. For poly-
nomials of degree n > 4 there exist evaluation schemes that require strictly
less than the 2n total additions and multiplications required by Horner’s rule;
see Knuth [665, 196 2], [668, 198 1, pp. 471-475] and Fike [373, 196 7] One
application in which such schemes have been used is in evaluating polynomial
approximations in an elementary function library [412, 1991]. Little seems to
be known about the numerical stability of fast polynomial evaluation schemes;
see Problem 5.6.
Problems
5.1. Give an alternative derivation of Algorithm 5.2 by differentiating the
Horner recurrence and rescaling the iterates.
5.2. Give an error analysis for the following “beginner’s” algorithm for eval-
uating p(x) = a 0 + a1 x + . . . + anxn :
q (x) = a 0; y = 1
for i = 1:n
y = xy
q(x) = q(x) + ai y
end
p(x) = q(x )
PROBLEMS 115
p(x) = (a 0 + a 2 x2 + . . . + a 2 m x2 m ) + (a 1 x + a 3 x3 + . . . + a 2 m - 1 x2 m - 1 )
= a0 + a 2 y + . . . + a2 mym + x(a 1 + a3 y + . . . + a 2 m - 1 y m- 1 ),
where y = x2. Obtain an error bound for fl(p(x)) when p is evaluated using
this splitting (using Horner’s rule on each part).
5.4. Write down an algorithm for computing the Leja ordering (5.13) in n 2
flops.
5.5. If the polynomial p(x) = has roots x1,. . . , xn , it can be eval-
uated from the root product form p(x) = Give an error
analysis for this evaluation.
5.6. (RESEARCH PROBLEM) Investigate the numerical stability of fast poly-
nomial evaluation schemes (see the Notes and References) by both rounding
error analysis and numerical experiments. For a brief empirical study see
Miller [757, 1975, §10].
Previous Home Next
Chapter 6
Norms
Matrix norms are defined in many different ways in the older literature,
but the favorite was the Euclidean norm of the matrix
considered as a vector in n 2-space.
Wedderburn (1934) calls this the absolute value of the matrix
and traces the idea back to Peano in 1887.
-ALSTON S. HOUSEHOLDER,
The Theory of Matrices in Numerical Analysis (1964)
12
Quoted in Fox [403, 19 8 7 ].
117
118 N ORMS
The three most useful norms in error analysis and in numerical computa-
tion are
The 2-norm has two properties that make it particularly useful for the-
oretical purposes. First, it is invariant under unitary transformations, for if
Q*Q = I, then = x*Q*Qx = x*x = Second, the 2-norm is
differentiable for all x, with gradient vector ||x||2 = x/||x||2.
A fundamental inequality for vectors is the Hölder inequality (see, for
example, [502, 1967, App. 1])
This is an equality when p, q > 1 if the vectors (|x i |p ) and (|y i |q ) are linearly
dependent and xi yi lies on the same ray in the complex plane for all i; equality
is also possible when p = 1 and p = , as is easily verified. The special case
with p = q = 2 is called the Cauchy-Schwarz inequality:
|x*y| < ||x||2 ||y||2.
(6.2)
It follows from the Hölder inequality that the dual of the p -norm is the q-norm,
where p- 1 +q -1 = 1. The definition of dual norm yields, trivially, the general
Hölder inequality |x*y| < ||x|| ||y||D . For a proof of the reassuring result that
the dual of the dual norm is the original norm (the “duality theorem”) see
Horn and Johnson [580, 1985, Thm. 5.5.14].
In some analyses we need the vector z dual to y, which is defined by the
property
z*y = ||z||D ||y || = 1. (6.3)
That such a vector z exists is a consequence of the duality theorem (see [580,
1985, Cor. 5.5.15]).
How much two p-norms of a vector can differ is shown by the attainable
inequalities [422, 1983, pp. 27-28], [459, 1983, Lem. 1.1]
(6.4)
The p-norms have the properties that ||x|| depends only on the absolute
value of x, and the norm is an increasing function of the absolute values of the
entries of x. These properties are important enough to warrant a definition.
1. monotone if |x| < |y| ||x|| < ||y|| for all x, y and
2. absolute if || |x| || = ||x|| for all x
The following nonobvious theorem shows that these two properties are
equivalent.
Proof. See Horn and Johnson [580, 19 8 5 , Thm. 5.5.10], or Stewart and
Sun [954, 1990, Thm. 2.1.3].
120 NORMS
(6.5)
or, equivalently,
(Strictly speaking, this definition uses two different norms: one on in the
numerator of (6.5) and one on in the denominator. Thus the norm used
in the definition is assumed to form a family defined on for any s.)
For the l-, 2-, and -vector norms it can be shown that
A norm for which ||UAV|| = ||A || for all unitary U and V is called a
unitarily invariant norm. These norms have an interesting theory, which we
will not explore here (see [581, 1991, §3.5] or [954, 1990, §2.3]). Only two
unitarily invariant norms will be needed for our analysis: the 2-norm and
the Frobenius norm. That these two norms are unitarily invariant follows
easily from the formulae above. For any unitarily invariant norm, the useful
property holds that ||A*|| = ||A||. The 2-norm satisfies the additional relation
||A*A||2 = ||A|| .
The unitary invariance of the 2- and Frobenius norms has implications for
error analysis, for it means that multiplication by unitary matrices does not
magnify errors. For example, if A is contaminated by errors E and
Q is unitary, then
Q(A+E)Q* = QAQ*+F,
then ||G||2 = ||XEX-1||2 < κ 2 (X)||E||2, where κ(X) = ||X|| ||X -1|| is the
condition number of X. The condition number satisfies κ(X) > 1 (κF (X) >
and can be arbitrarily large.
In perturbation theory and error analysis it is often necessary to switch
between norms. Therefore inequalities that bound one norm in terms of an-
other are required. It is well known that on a finite-dimensional space any
two norms differ by at most a constant that depends only on the dimension
(so-called norm equivalence). Tables 6.1 and 6.2 give attainable inequalities
for the vector and matrix norms of most interest.
The definition of subordinate matrix norm can be generalized by permit-
ting different norms on the input and output space:
(6.6)
122 NORMS
Table 6.2. Constants apq such that ||A||p < apg||A||q, A Here, ||A||M :=
max i,j |aij| and ||A||s :=
for any third vector norm ||·||γ. The choice a = 1 and β = produces the
max norm, mentioned above, ||A||1, = maxi , j |a ij |.
At least two important results about matrix norms hold for this mixed
subordinate norm. The first is a simple formula for the matrix condition
number of a nonsingular A defined by
Note that this definition uses the ||·||a,β norm on the data space and the
||·||β,a norm on the solution space, as is natural.
We need the following lemma.
Lemma 6.3. Given vector norms ||·||a and ||·||β and vectors x, y such
that ||x||a = ||y||β = 1, there exists a matrix B with ||B||a,β = 1 such that
Bx = y.
Proof. Recall that the dual of the a-norm is defined by
max||w ||a= 1 |z*w|. Let z be a vector dual to x, so that z*x =
and hence Let B = yz*. Then Bx = y and
as required.
6.2 M A T R I X N O R M S 123
(6.9)
That (6.9) holds with the equality replaced by “<” follows from two applica-
tions of (6.7). To show the opposite inequality, we have
(6.10)
where, for the lower bound, we have chosen y so that ||A - 1 y||a = ||A- 1 ||β , a ,
and where A- 1 y=||A - 1 ||β ,ax with ||x||a = 1. Now, from Lemma 6.3, there
exists a matrix AA with ||∆ A|| a ,β = 1 such that ∆Ax = y. In (6.10) this
gives as required.
The next result concerns the relative distance to singularity for a matrix
It states that the relative distance to singularity is the reciprocal of the con-
dition number.
giving
124 NORMS
6 . 3 . T h e M a t r i x p-Norm
The matrix p-norm is the norm subordinate to the Hölder p-norm:
(6.11)
Formulae for ||A||p are known only for p = 1, 2, . For other values of p, how
to estimate or compute ||A||p is an interesting problem, the study of which,
as well as being interesting in its own right, yields insight into the properties
of the 1, 2, and norms.
By taking x = ej in (6.11). using (6.4). and using (6.21) below, we can
derive the bounds, for
(6.12)
(6.13)
Matrix norms can be compared using the following elegant result of Schnei-
der and Strang [900, 1962] (see also [580, 1985, Thm. 5.6.18]): if ||·|| a and ||·||β
denote two vector norms and the corresponding subordinate matrix norms,
then for
(6.14)
(6.15)
Note that, unlike for vectors, p 1 < p 2 does not imply ||A||p 1 > ||A||p 2. The
result (6.15) implies, for example, that for all p > 1
(6.16)
(6.17)
6.3 THE MATRIX p- N O R M 125
Figure 6.1. Plots of p versus ||A||p, for 1 < p < 15. Fourth plot shows 1/ p versus
log ||A||p for the matrices in the first three plots.
Upper bounds for ||A||p that do not involve m or n can be obtained from
the interesting property that log ||A||p is a convex function of 1/p for p > 1
(see Figure 6.1), which is a consequence of the Riesz-Thorin theorem [503,
1952, pp. 214, 219], [450, 1991]. The convexity implies that if f(a) = ||A||1 / a ,
then for 0 < a,β < 1,
(6.18)
(6.19)
and
(6.20)
(6.21)
The bounds (6.16) and (6.17) imply that given the ability to compute
||A||1, ||A||2 and ||A|| we can estimate ||A||p correct to within a factor n 1 / 4 .
These a priori estimates are at their best when p is close to 1, 2, or but in
general they will not provide even one correct significant digit. The bound in
(6.18) can be much smaller than the other upper bounds given above, but how
tight it is depends on how nearly log ||A||p is linear in p . Numerical methods
are needed to obtain better estimates: these are developed in chapter 14.
Problems
Problems worthy
of attack
prove their worth
by hitting back.
-PIET HEIN, Grooks (1966)
6.1. Prove the inequalities given in Tables 6.1 and 6.2. Show that each
inequality in Table 6.2 (except the one for aS,2) is attainable for a matrix of the
form A = xyT, where x, y {e, ej}, where e = [1, 1,. . . , 1]T. Show that equal-
ity in ||A||s < as,2 ||A||2 is attained for square real matrices A iff A is a scalar
multiple of a Hadamard matrix (see §9.3 for the definition of a Hadamard
matrix), and for square complex matrices if a rs = exp(2 πi(r - 1)(s - 1)/n)
(this is a Vandermonde matrix based on the roots of unity).
6.2. Let x, y Show that, for any subordinate matrix norm, ||xy*|| =
||x|| ||y||D .
6.5. Show that ||ABC||F < ||A||2 ||B||F||C|| 2 for any A, B, and C such that
the product is defined. (This result remains true when the Frobenius norm is
replaced by any unitarily invariant norm [581, 1991, p. 211].)
128 NORMS
6.7. Show that for any and any consistent matrix norm, p(A) <
||A||, where p is the spectral radius.
6.8. Show that for any and δ > 0 there is a consistent norm ||·||
(which depends on A and δ) such that ||A|| < p(A) + δ, where p is the spectral
radius. Hence show that if p (A) < 1 then there is a consistent norm ||·|| such
that ||A|| < 1.
6.9. Let Use the SVD to find expressions for ||A||2 and ||A||F
in terms of the singular values of A. Hence obtain a bound of the form
c1 ||A|| 2 < ||A|| F < c2 ||A|| 2, where c 1 and c2 are constants that depend on n.
When is there equality in the upper bound? When is there equality in the
lower bound?
6.10. Show that
(Rohn [879, 1995] shows that the problem of computing ||A|| ,1 is NP-hard.)
6.13. Prove that if H IRn×n is a Hadamard matrix then
(6.22)
PROBLEMS 129
(6.23)
Chapter 7
Perturbation Theory for Linear
Systems
131
132 P ERTURBATION T HEORY FOR LINEAR S YSTEMS
η E , f(y) := min{ (A + ∆A)y = b + ∆b, ||∆ A|| < ||E||, ||∆b|| < ||f|| }
(7-1)
is given by
(7.2)
where r = b - Ay.
Proof. It is straight forward to show that the right-hand side of (7.2) is a
lower bound for ηE , f (y ). This lower bound is attained for the perturbations
(7.3)
Theorem 7.2. Let Ax = b and (A + ∆A)y = b + ∆b, where ||∆ A|| < ||E||
and ||∆ b|| < ||f|| , and assume that ||A- 1 ||||E|| < 1. Then
(7.4)
to x.
Associated with the way of measuring perturbations used in these two
theorems is the normwise condition number
For the choice E = A and f = b we have κ (A) < κE,f (A, x) < 2κ(A), and the
bound (7.4) can be weakened slightly to yield the familiar form
Next. let b = Ae. where e = [1, 1,. . . . 1]T, and let z be the solution to the
perturbed q-stem (A + ∆A)z = b + ∆b, where ∆A = tol|A| and ∆b = tol|b|,
with to1 = 8u. We find that
(7.5)
in (7.6), and so if wE,f (y) is small then y solves a problem that is close to
the original one in the’ sense of componentwise relative perturbations and has
the same sparsity pattern. Another attractive property of the componentwise
relative backward error is that it is insensitive to the scaling of the system: if
Ax = b is scaled to (S 1 AS2 ) = S 1 b, where S 1 and S 2 are diagonal, and
y is scaled to then w remains unchanged.
The choice E = |A|eeT, f = |b| gives a row-wise backward error. The
constraint |∆A| < is now |∆a i j | < where (ai is the 1-norm of the ith
row of A, so perturbations to the ith row of A are being measured relative to
the norm of that row. A columnwise backward error can be formulated in a
similar way, by taking E = eeT|A| and f =
The third natural choice of tolerances is E = ||A||eeT and f = ||b||e, for
which wE,f (y) is the same as the normwise backward error η E , f (y). up to a
constant.
As for the normwise backward error in Theorem 7.1, there is a simple
formula for wE,f (g).
7.2 C OMPONENTWISE A NALYSIS 135
(7.7)
Proof. It is easy to show that the right-hand side of (7.7) is a lower bound
for w(y), and that this bound is attained for the perturbations
∆A = D1 ED2 , ∆ b = -D 1 f, (7.8)
Theorem 7.4. Let Ax = b and (A + ∆A)y = b + ∆b, where |∆A| < and
|∆b| < and assume that where ||·|| is an absolute norm.
Then
(7.9)
Proof. The bound (7.9) follows easily from the equation A(y - x) =
∆b - ∆Ax + ∆A(x - y). For the -norm the bound is attained, to first
order in for ∆A = and ∆b = where D 2 = diag(sign(x i ) )
and D 1 = diag where = sign( A - 1 ) k j and
is given by
(7.10)
For the special case E = |A| and f = |b| we have the condition numbers
introduced by Skeel [919, 1979]:
136 PERTURBATION THEORY FOR LI N E A R S Y S T E M S
(7.11)
How does cond compare with κ? Since cond(A) is invariant under row
scaling Ax = b (DA)x = Db, where D is diagonal. it can be arbitrarily
smaller than (A). In fact, it is straightforward to show that
(7.12)
where the optimal scaling D R equilibrates the rows of A, that is, DRA has
rows of unit 1-norm (DR|A|e=e)
Chandrasekaran and Ipsen [197, 1995] note the following inequalities. First.
with DR as just defined,
(7.13)
(these inequalities imply (7.12)). Thus cond(A) can be much smaller than
(A) only when the rows of A are badly scaled. Second, if DC, equilibrates
the columns of A (eT|A|DC=e T) then
These inequalities show that cond(A , x) can be much smaller than (A) only
when the columns of either A or A-1 are badly scaled.
Returning to the numerical example of §7.1, we find that wE,f (y) = 1.10 ×
10 -12 for E = |A| and f = |b| or f = 0. This tells us that if we measure
changes to A in a componentwise relative sense, then for y to be the first
column of the inverse of a perturbed matrix we must make relative changes to
A four orders of magnitude larger than the unit roundoff. For the perturbed
system. Theorem 7.4 with = tol, E = |A|, and f = |b| gives the bound
which is eight orders of magnitude smaller than the normwise bound from
Theorem 7.2, and only a factor 170 larger than the actual forward error (7.5).
An example of Kahan [626, 1966] is also instructive Let
(7.14)
7.3 SCALING TO MINIMIZE THE CONDITION NUMBER 137
Theorem 7.5 (van der Sluis). Let have full rank, let
denote the set of nonsingular diagonal matrices, and define
138 PERTURBATION THEORY FOR LI N E A R S Y S T E M S
Then
(7.15)
(7.16)
(7.17)
Therefore
(7.18)
Now, for any D
(7.19)
using the first inequality in (7.17). Multiplying (7.18) and (7.19) and min-
imizing over D, we obtain (7.15). Inequality (7.16) follows by noting that
κp (DA) = κq (A T D), where p -1 + q -1 = 1 (see (6.21)).
For p = (7.16) confirms what we already know from (7.12) and (7.13):
that in the -norm, row equilibration is an optimal row scaling strategy.
Similarly, for p = 1, column equilibration is the best column scaling, by
(7.15). Theorem 7.5 is usually stated for the 2-norm, for which it shows that
row and column equilibration produce condition numbers within factors
and respectively, of the minimum 2-norm condition numbers achievable
by row and column scaling.
As a corollary of Theorem 7.5 we have the result that among two-sided
diagonal scalings of a symmetric positive definite matrix. the one that gives
A a unit diagonal is not far from optimal.
(7.20)
A (that is, there exists a permutation matrix P such that PAP T can be
expressed as a block 2 × 2 matrix whose (1,1) and (2,2) blocks are diagonal).
Thus, for example, any symmetric positive definite tridiagonal matrix with
unit diagonal is optimally scaled.
We note that by using (6.22) in place of (7.17), the inequalities of Theo-
rem 7.5 and Corollary 7.6 can be strengthened by replacing m and n with the
maximum number of nonzeros per column and row, respectively.
Here is an independent result for the Frobenius norm.
where the maximum is taken over all signature mat rices Si = diag(±1) and
where
for a constant γn . The lower bound always holds and Demmel identifies
several classes of matrices for which the upper bound holds. This conjecture
is both plausible and aesthetically pleasing because d|A| (A) is invariant under
two-sided diagonal scalings of A and p(|A- 1 ||A|) is the minimum -norm
condition number achievable by such scalings, as shown by Theorem 7.8.
7.5. Extensions
The componentwise analyses can be extended in three main ways.
7.6 N UMERICAL S TABILITY 141
(1) We can use more general measures of size for the data and the solution.
Higham and Higham [528, 1992] measure ∆A , ∆b, and ∆x by
This latter case has also been considered by Rohn [876, 1989] and Gohberg
and Koltracht [455, 1993]. It is also possible to obtain individual bounds
for by refraining from taking norms in the analysis: see
Chandrasekaran and Ipsen [197, 1995] and Problem 7.1.
(2) The backward error results and the perturbation theory can be ex-
tended to systems with multiple right-hand sides. For the general vp measure
described in (1), the backward error can be computed by finding the mini-
mum p-norm solutions to n underdetermined linear systems. For details, see
Higham and Higham [528, 1992].
(3) Structure in A and b can be preserved in the analysis. For example, if A
is symmetric or Toeplitz then its perturbation can be forced to be symmetric
or Toeplitz too, while still using componentwise measures. References include
Higham and Higham [527, 1992] and Gohberg and Koltracht [455, 1993] for
linear structure, and Bartels and D. J. Higham [76, 1992] for Vandermonde
structure. A symmetry-preserving normwise backward error is explored by
Bunch, Demmel, and Van Loan [163, 1989], while Smoktunowicz [930, 1995]
considers the componentwise case (see Problem 7.11). Symmetry-preserving
normwise condition numbers are treated by D. J. Higham [526, 1995].
different problems, that to define and name each one tends to cloud the issues
of interest. We therefore adopt an informal approach.
A numerical method for solving a square, nonsingular linear system Ax = b
is normwise backward stable if it produces a computed solution such that
is of order the unit roundoff. How large we allow to be.
while still declaring the method backward stable, depends on the context,. It
is usually implicit in this definition that = O(u) for all A and b, and
a method that yields = O(u) or a particular A and b is said to have
performed in a normwise backward stable manner.
The significance of normwise backward stability is that the computed so-
lution solves a slightly perturbed problem, and if the data A and b contain
uncertainties bounded only normwise with
and similarly for b), then may be the exact solution to the problem we
wanted to solve, for all we know.
Componentwise backward stability is defined in a similar way: we now re-
quire the componentwise backward error to be of order u. This is a
more stringent requirement than normwise backward stability. The rounding
errors incurred by a met hod that is componentwise backward stable are in
size and effect equivalent to the errors incurred in simply converting the data
A and b to floating point numbers before the solution process begins.
If a method is normwise backward stable then, by Theorem 7.2, the for-
ward error is bounded by a multiple of κ(A )u. However, a met hod
can produce a solution whose forward error is bounded in this way without the
normwise backward error being of order u. Hence it is useful to define
a method for which as normwise forward stable.
By similar reasoning involving we say a method is componentwise
forward stable if Table 7.1 summarizes the
definitions and the relations between them. There are several examples in
this book of linear-equation-solving algorithms that are forward stable but
not backward stable: Cramer’s rule for n = 2 (§1.10.1). Gauss-Jordan elim-
ination (§13.4), and the seminormal equations method for underdetermined
systems (§20.3).
Other definitions of numerical stability can be useful (for example, row-
wise backward stability means that and they will be
introduced when needed.
(7.23)
Lemma 7.9. The upper bound in (7.23) is at least as large as the upper bound
in
(7.25)
(7.26)
Therefore a strict bound. and one that should be used in practice in place of
(7.25), is
(7.27)
This bound shows that κ(A) is a key quantity in measuring the sensitivity of
a linear system. A componentwise bound could have been obtained just as
easily.
We normally express perturbations of the data in the form
To use the calculus framework we can take A(0) as the original matrix A and
write but the perturbation bound
then becomes a first-order one.
The calculus technique is a useful addition to the armoury of the error
analyst (it is used by Golub and Van Loan [470, 1989], for example), but the
algebraic approach is preferable for deriving rigorous perturbation bounds of
the standard forms.
of analysis can also be found in the book by Kuperman [681, 1971], which
includes an independent derivation of Theorem 7.3. See also Hartfiel [505,
1 9 8 0 ].
Theorem 7.4 is a straightforward generalization of a result of Skeel [919,
1979 , Thms. 2.1 and 2.2]. It is clear from Bauer’s comments in [80, 19 66]
that the bound (7.9), with E = |A| and f = |b|, was known to him, though
he does not state the bound. This is the earliest reference we know in which
componentwise analysis is used to derive forward perturbation bounds.
Theorem 7.8 is from Bauer [79, 1963]. Bauer actually states that equality
holds in (7.21) for any A, but his proof of equality is valid only when |A - 1||A|
and |A||A- 1| have positive Perron vectors. Businger [168, 1968] proves that
a sufficient condition for the irreducibility condition of Theorem 7.8 to hold
(which, of course, implies the positivity of the Perron vectors) is that there do
not exist permutations P and Q such that PAQ is in block triangular form.
Theorem 7.7 is from Stewart and Sun [954, 1990, Thm. 4.3.5].
Further results on scaling to minimize the condition number κ(A) are given
by Forsythe and Straus [386, 1955], Bauer [81, 1969], Golub and Varah [465,
1974], McCarthy and Strang [742, 1974], Shapiro [913, 1982]. [914, 1985], [915,
1991], and Watson [1067, 1991].
Chan and Foulser [193, 1988] introduce the idea of “effective conditioning”
for linear systems, which takes into account the projections of b onto the range
space of A. See Problem 7.5, and for an application to partial differential
equations see Christiansen and Hansen [208, 1994].
For an example of how definitions of numerical stability for linear equa-
tion solvers can be extended to incorporate structure in the problem, see
Bunch [162, 1987].
An interesting application of linear system perturbation analysis is to
Markov chains. A discrete-time Markov chain can be represented by a square
matrix P, where pij is the probability of a transition from state i to state j .
Since state i must lead to some other state. and these conditions
can be writ ten in matrix vector form as
Pe = e. (7.28)
by Ipsen and Meyer [605, 1994], who measure the perturbation matrix norm-
wise, and by O’Cinneide [800, 1993], who measures the perturbation matrix
componentwise. For a matrix-oriented development of Markov chain theory
see Berman and Plemmons [94, 1994].
It is possible to develop probabilistic perturbation theory for linear systems
and other problems by making assumptions about the statistical distribution
of the perturbations. We do not consider this approach here (though see Prob-
lem 7.13), but refer the interested reader to the papers by Fletcher [376, 1985],
Stewart [948, 1990], and Weiss, Wasilkowski, Wozniakowski, and Shub [1073,
19 86].
Problems
7.1. Under the conditions of Theorem 7.4, show that
Obtain an explicit formula for x E,f (A , x). Show that xE,f(A, x) > 1 if
E = |A| or f = |b|. Derive the corresponding normwise condition number
in which the constraints are ||∆A||2 < and ||∆b|| 2 <
7.9. (Bauer [79, 1963]) Let A, B, C (a) Prove that if B and C have
positive elements then
(d) Strengthen (b) by showing that for any nonsingular A such that
|A||A- 1 | and |A- 1 ||A|| are irreducible,
PROBLEMS 149
where the and δi are independent random variables, each having zero
mean and variance σ2. (As usual, the matrix E and vector f represent fixed
tolerances.) Let ε denote the expected value.
(a) Show that
Chapter 8
Triangular Systems
In the end there is left the coefficient of one unknown and the constant term.
An elimination between this equation and
one from the previous set that contains two unknowns
yields an equation with the coefficient of
another unknown and another constant term, etc.
The quotient of the constant term by the unknown
yields the value of the unknown in each case.
-JOHN V. ATANASOFF, Computing Machine for the Solution of
Large Systems of Linear Algebraic Equations (1940)
151
152 TRIANGULAR SYSTEM
• a triangular matrix may be much more or less ill conditioned than its
transpose; and
• the use of pivoting in LU, QR, and Cholesky factorizations can greatly
improve the conditioning of a resulting triangular system.
xn = bn /u n n
for i = n - 1:-1:1
s = bi
for j = i + 1:n
s = s - uijxij
8.1 B ACKWARD E RROR A NALYSIS 153
end
x i = s/u ii
end
We will not state the analogous algorithm for solving a lower triangu-
lar system, forward substitution. All the results below for back substitution
have obvious analogues for forward substitution. Throughout this chapter T
denotes a matrix that can be upper or lower triangular.
To analyse the errors in substitution we need the following lemma.
s =c
for i = 1:1k-1
s = s - ai bi
end
y = s/bk
Then the computed satisfies
(8.1)
Theorem 8.3 holds only for the particular ordering of arithmetic operations
used in Algorithm 8.1. A result that holds for any ordering is a consequence
of the next lemma.
Proof. The result is not hard to see after a little thought , but a formal
proof is tedious to write down. Note that the ordering used in Lemma 8.2 is
the one for which this lemma is least obvious! The last part of the lemma
is useful when analysing unit lower triangular systems, and in various other
contexts.
In technical terms, this result says that has a tiny componentwise relative
backward error. In other words, the backward error is about as small as we
could possibly hope.
In most of the remaining error analyses in this book, we will derive re-
sults that, like the one in Theorem 8.5, do not depend on the ordering of
the arithmetic operations. Results of this type are more general, usually no
less informative. and easier to derive, than ones that depend on the order-
ing. However, it is important to realise that the actual error does depend on
the ordering, possibly strongly so for certain data. This point is clear from
Chapter 4 on summation.
8.2 F ORWARD E RROR A NALYSIS 155
where
(8.2)
for which
(8.3)
(8.4)
Then the unit upper triangular matrix W = |U- 1 ||U| satisfies wij < 2 j-i for
all j > i.
156 T RIANGLLAR S YSTEMS
Theorem 8.7. Under the conditions of Lemma 8.6, the computed solution
to Ux = b obtained by substitution satisfies
Lemma 8.6 shows that for matrices satisfying (8.4), cond(T) is bounded
for fixed n, no matter how large κ(T). The bounds for in Theorem 8.7,
although large if n is large and i is small. decay exponentially with increasing
thus, later components of x are always computed to high accuracy relative
to the elements already computed.
Analogues of Lemma 8.6 and Theorem 8.7 hold for lower triangular L
satisfying
|lii | > |lij| for all j < i. (8.5)
Note, however, that if the upper triangular matrix T satisfies (8.4) then TT
does not necessarily satisfy (8.5). In fact, cond(T T) can he arbitrarily large,
as shown by the example
It is easy to see that such a matrix has an inverse with nonnegative elements,
and hence is an hi-matrix (for definitions of an hi-matrix see Appendix B).
Associated with any square matrix A is the comparison matrix:
(8.6)
Proof. The inequality follows from |T- 1 | < M(T)-1, together with |T| =
|M(T)|. Since M(T)-1 > 0, we have
This means, for example, that the system U(1)x = b (see (8.2)), where x = e,
is about as ill conditioned with respect to componentwise relative perturba-
tions in U(1) as it is with respect to normwise perturbations in U(1).
158 T RIANGULAR S YSTEMS
The next result gives a forward error bound for substitution that is proved
directly, without reference to the backward error result Theorem 8.5 (indeed. it
cannot be obtained from that result!). The bound can be very weak, because
||M(T) - 1 || can be arbitrarily larger than ||T- 1|| (see Problem 8.2), but it
yields a pleasing bound in the special case described in the corollary.
so that
(8.7)
Write
In this section we describe bounds for the inverse of a triangular matrix and
show how they can be used to bound condition numbers. All the bounds in
this section have the property that they depend only on the absolute values
of the elements of the matrix. The norm estimation methods of Chapter 14,
on the other hand, do take account of the signs of the elements.
The bounds are all based on the following theorem, whose easy proof we
omit.
y n = 1 /|unn|
for i = n - 1:-1:1
s=1
for j = i + 1:n
s = s + |uij|yj
end
y i = y i /|uii|
end
µ=
How good are these upper bounds? We know from Problem 8.2 that the
ratio ||M(T)- 1 ||/||T - 1 || can be arbitrarily large, therefore any of the upper
bounds can be arbitrarily poor. However, with suitable assumptions on T,
more can be said.
It is easy to show that if T is bidiagonal then |T- 1 | = M(T)- 1 . Since
a bidiagonal system can be solved in O(n) flops, it follows that the three
condition numbers of interest can each be computed exactly in O(n) flops
when T is bidiagonal.
As in the previous section, triangular matrices that result from a pivoting
strategy also lead to a special result.
8.3 BOUNDS FOR THE INVERSE 161
(8.8)
Proof. The left-hand inequality is trivial. The right-hand inequality
follows from the expression (see Problem 8.5),
together with ||A||2 <
The inequalities from the second on in (8.8) are all equalities for the matrix
with u ii = 1 and u ij = -1 (j > i). The question arises of whether equality is
possible for the upper triangular matrices arising from QR factorization with
column pivoting, which satisfy the inequalities (see Problem 18.5)
(8.9)
(8.10)
Thus as where
and hence, for small θ,
(8.11)
where all the products appearing within a particular size of parenthesis can
be evaluated in parallel. III general. the evaluation can be expressed as a
binary tree of depth [log(n + 1)] + 1, with products M1b and Mi Mi-1 (i =
3, 5,. . . , 2[(n - 1)/2] + 1) at the top level and a single product yielding x at
the bottom level. This algorithm was proposed and analysed by Sameh and
Brent [889, 1977], who show that it can be implemented in
time steps on processors. The algorithm requires about n 3 /10
operations and thus is of no interest for serial computation. Some pertinent
comments on the practical significance of log n terms in complexity results are
given by Edelman [341, 1993].
To derive an error bound while avoiding complicated notation that ob-
scures the simplicity of the analysis, we take n = 7. The result we obtain is
easily seen to be valid for all n. We will not be concerned with the precise
values of constants, so we write cn for a constant depending on n . We assume
that the inverses Mi = are formed exactly, because the errors in forming
them affect only the constants. From the error analysis of matrix-vector and
matrix-matrix multiplication (§3.5), we find that the computed solution
satisfies
(8.13)
8.4 A P ARALLEL FAN- I N A LGORITHM 163
where
All these terms share the same upper bound, which we derive for just one of
them. For j = 5, k =
(8.14)
(8.15)
By considering the binary tree associated with the fan-in algorithm, and
using the fact that the matrices at the ith level of the tree have at most 2i - 1
nontrivial columns, it is easy to see that we can take dn = an log n , where a
is a constant of order 1.
It is not hard to find numerical examples where the bound in (8.15) is
approximately attained (for dn = 1) and greatly exceeds which
is the magnitude required for normwise backward stability. One way to con-
struct such examples is to use direct search (see Chapter 24).
The key fact revealed by (8.15) is that the fan-in algorithm is only condi-
tionally stable. In particular, the algorithm is normwise backward stable if L
is well conditioned. A special case in which (8.14) simplifies is when L is an M -
matrix and b > 0: Problem 8.4 shows that in this case |L - 1 ||L||x| < (2n-1)|x|,
so (8.14) yields < (2n - 1)2 dnu|L||x| + O(u 2 ), and we have compo-
nentwise backward stability (to first order).
164 T RIANGULAR S YSTEMS
(8.16)
which was proved by Sameh and Brent (889, 1977] (with a n = ¼n 2 log n +
O(n log n)). However, (8.16) is a much weaker bound than (8.14) and (8.15).
In particular, a diagonal scaling Lx = b (where Dj is
diagonal) leaves (8.14) (and, to a lesser extent, (8.15)) essentially unchanged,
but can change the bound (8.16) by an arbitrary amount.
A forward error bound can be obtained directly from (8.13). We find that
(8.17)
where M(L) is the comparison matrix (a bound of the same form as that
in Theorem 8.9 for substitution-see the Notes and References and Prob-
lem 8.10). which can be weakened to
(8.18)
(8.19)
A result of the form of Theorem 8.9 holds for any triangular system solver
that does not rely on algebraic cancellation-in particular, for the fan-in al-
gorithm, as already seen in (8.17). See Problem 8.10 for a more precise for-
mulation of this general result.
The bounds in §8.3 have been investigated by various authors. The unified
presentation given here is based on Higham [534, 1987]. Karasalo [642, 1974]
derives an O(n) flops algorithm for computing ||M(T) - 1 ||F. Manteuffel [726,
1981] derives the first two inequalities in Theorem 8.11, and Algorithm 8.12.
A different derivation of the equations in Algorithm 8.12 is given by Jen-
nings [613, 1982, §9]. The formulae given in Problem 8.5 are derived directly
as upper bounds for by Lemeire [699, 1975].
That can be computed in O(n) flops when B is bidiagonal, as
was first pointed out by Higham [531, 1986]. Demmel and
Kahan [296, 1990] derive an estimate for the smallest singular value σmin of
a bidiagonal matrix B by using the inequality where
They compute in O(n) flops as
Section 8.4 is adapted from Higham [560, 1995], in which error analysis is
given for several parallel methods for solving triangular systems.
The fan-in method is topical because the fan-in operation is a special case
of the parallel prefix operation and several fundamental computations in linear
algebra are amenable to a parallel prefix-based implementation, as discussed
by Demmel [287, 1992], [288, 1993]. (For a particularly clear explanation of the
parallel prefix operation see the textbook by Buchanan and Turner [154, 1992,
§13.21.) The important open question of the stability of the parallel prefix
implementation of Sturm sequence evaluation for the symmetric tridiagonal
eigenproblem has recently been answered by Mathias [734, 1995]. Mathias
shows that for positive definite matrices the relative error in a computed minor
can be as large as a multiple of where is the smallest eigenvalue of
the matrix; the corresponding bound for serial evaluation involves The
analogy with (8.19), where we also see a condition cubing effect, is intriguing.
Higham and Pothen [568, 1994] analyse the stability of the “partitioned
inverse method” for parallel solution of sparse triangular systems with many
right-hand sides. This method has been studied by several authors in the
1990s; see Alvarado, Pothen, and Schreiber [13, 1993] and the references
therein. The idea of the method is to factor a sparse triangular matrix
as L = L 1 L 2 . . . Ln = G 1 G 2 . . . Gm , where each Gi is a prod-
uct of consecutive Lj terms and 1 < m < n, with m as small as possible
subject to the Gi being sparse. Then the solution to Lx = b is evaluated as
166 T RIANGULAR S YSTEMS
where each is formed explicitly and the product is evaluated from right
to left. The advantage of this approach is that x can be computed in m serial
steps of parallel matrix-vector multiplication.
8.5.1. LAPACK
Computational routine xTRTRS solves a triangular system with multiple right-
hand sides: xTBTRS is an analogue for banded triangular matrices. There is
no driver routine for triangular systems.
Problems
Before you start an exercise session,
make sure you have a g/ass of water and
a mat or towel nearby.
-MARIE HELVIN, Mode/ Tips for a Healthy Future (1994)
8.1. Show that under the no-guard-digit model (2.6). Lemma 8.2 remains
true if (8.1) is changed to
8.2. Show that for a triangular matrix T the ratio ||M(T) - 1||/||T - 1 || can be
arbitrarily large.
8.3. Suppose the unit upper triangular matrix satisfies |uij| < 1
for j > i. By using Theorem 8.9. show that the computed solution from
substitution on Ux = b satisfies
8.7. Bounds from diagonal dominance. (a) Prove the following result (Ahlberg
and Nilson [8, 1963], Varah [1049, 1975]): if (not necessarily trian-
gular) satisfies
for some positive diagonal matrix D = diag(di ) (that is, AD is strictly diag-
onally dominant by rows), then
(c) Use part (b) to provide another derivation of the upper bound
(8.20)
Give an example of a triangular system solver for which (8.20) is not satisfied.
Previous Home Next
Chapter 9
LU Factorization and Linear Equations
169
170 LU FACTORIZATION AND LINEAR E QUATIONS
At the start of the kth stage, rows k and r and columns k and s are
interchanged, where
Note that this requires O(n3 ) comparisons in total, compared with O(n2 )
for partial pivoting. Because of the searching overhead, and because partial
pivoting works so well, complete pivoting is rarely used in practice.
Much insight into GE is obtained by expressing it in matrix notation. We
can write
and so
From (9.1) follows the expression u kk = det(A k)/det(A k-l). In fact, all
the elements of L and U can be expressed by determinant al formulae (see,
e.g., Gantmacher [413, 1959, p. 35] or Householder [587, 1964. p. 11]):
(9.2a)
(9.2b)
9.1 GAUSSIAN ELIMINATION 173
If these nonlinear equations are examined in the right order, they are easily
solved. For added generality let (m > n) and consider an LU
factorization with and (L is lower trapezoidal: lij = 0
for i < j). Suppose we know the first k - 1 columns of L and the first k - 1
rows of U. Setting lkk = 1,
(9.3)
(9.4)
We can solve for the boxed elements in the kth row of U and then the kth
column of L. This process is called Doolittle’s method.
for k = 1: n
for j = k :n
(*)
end
for i = k + 1:m
(**)
end
end
(9.5)
and similarly for (9.3). Had we chosen the normalization u ii 1, we would
have obtained the Crout method. The Crout and Doolittle methods are well
suited to calculations by hand or with a desk calculator, because they obviate
the need to store the intermediate quantities They are also attractive
when we can accumulate inner products in extended precision.
It is straightforward to incorporate partial pivoting into Doolittle’s method
(see. e.g.. Stewart [941, 1973, p. 138]). However, complete pivoting cannot be
incorporated without changing the met hod.
The assignments (*) and (**) in Algorithm 9.2 are of the form y = ( c -
which is analysed in Lemma 8.4. Applying the lemma, we
deduce that, no matter what the-ordering of the inner products, the computed
matrices L and U satisfy (with lkk := 1)
(9.6)
With only a little more effort, a backward error result can be obtained for
the solution of Ax = b.
(9.7)
Thus
so that
(9.8)
This result says that has a small componentwise relative backward error.
One class of matrices that has nonnegative LU factors is defined as follows.
is totally positive (nonnegative) if the determinant of every square
submatrix is positive (nonnegative). In particular, this definition requires that
a ij and det(A) be positive or nonnegative. Some examples of totally nonneg-
ative matrices are given in Chapter 26. If A is totally nonnegative then it has
an LU factorization A = LU in which L and U are totally nonnegative. so
that L > 0 and U > 0 (see Problem 9.6); moreover, the computed factors
and are nonnegative for sufficiently small values of the unit roundoff u [273,
1977 ]. Inverses of totally nonnegative matrices also have the property that
|A| = |L||U| (see Problem 9.7). Note that the property of a matrix or its
inverse being totally nonnegative is generally destroyed under row permuta-
tions. Hence for totally- nonnegative matrices and their inverses it is best to
use Gaussian elimination without pivoting.
One important fact that follows from (9.6) and (9.7) is that the stability
of GE is determined not by the size of the multipliers but by the size of the
matrix This matrix can be small when the multipliers are large, and
large when the multipliers are of order 1 (as we will see in the next section).
To understand the stability of GE further we turn to norms. For GE with-
out pivoting. the ratio || |L||U|||/||A|| can be arbitrarily large. For example,
for the matrix the ratio is of order Assume then that partial piv-
oting is used. Then |lij | < 1 for all i > j, since the lij are the multipliers.
9.3 T HE G ROWTH FACTOR 177
And it is easy to show by induction that |uij| < 2 i -1 maxk < i |akj|. Hence, for
partial pivoting, L is small and U is bounded relative to A.
Traditionally, backward error analysis for GE is expressed in terms of the
growth factor
which involves all the elements (k = 1:n) that occur during the elimina-
tion. Using the bound p n maxi,j|aij| we obtain the following
classic theorem.
(9.9)
For these matrices, no interchanges are required by partial pivoting, and there
is exponential growth of elements in the last column of the reduced matrices.
In fact, this is just one of a nontrivial class of matrices for which partial
178 LU FACTORIZATION AND LINEAR E QUATIONS
pivoting achieves maximal growth. When necessary in the rest of this chapter,
we denote the growth factor for partial pivoting by and that for complete
pivoting by
Theorem 9.6 (Higham and Higham). All real n × n matrices A for which
2n -1 are of the form
integral equation gives rise to linear systems for which partial pivoting again
gives large growth factors.
There exist some well-known matrices that give unusually large, but not
exponential, growth. They can be found using the following theorem, which is
applicable whatever the strategy for interchanging rows and columns in GE.
(9.10)
(9.11)
is the symmetric, orthogonal eigenvector matrix for the second difference ma-
trix (the tridiagonal matrix with typical row (-1, 2, -1)-see §26.5); it arises,
for example, in the analysis of time series [19, 1971, §6.5]. Theorem 9.7 gives
p n (S N) > (n + 1)/2.
(2) A Hadamard matrix H n is an n × n matrix with elements h ij = ±1
and for which the rows of Hn are mutually orthogonal. Hadamard mat rices
exist only for certain n; a necessary condition for their existence if n > 2 is
that n is a multiple of 4. For more about Hadamard mat rices see Hall [494,
19 6 7 , Chap. 14], Wallis [1062, 1993 ], and Wallis, Street, and Wallis [1063,
1972 ]. We have = nI, and so Theorem 9.7 gives
pn > n.
(3) The next matrix is a complex Vandermonde matrix based on the roots
of unity, which occurs in the evaluation of Fourier transforms (see §23.1):
(9.12)
180 LU FACTORIZATION AND LINEAR E QUATIONS
(9.13)
and that this bound is not attainable. The bound is a much more slowly
growing function than 2n -1, but can still be quite large (e.g., it is 3570 for
n = 100). As for partial pivoting, in practice the growth factor is usually
small. Wilkinson stated that “no matrix has been encountered in practice for
which p 1 /pn was as large as 8” [1085, 1961, p. 285] and that “no matrix has
yet been discovered for which f(r) > r” [1089, 1965, p. 213] (here. pi is the
(n - i + 1)st pivot and f(r)
Cryer [256, 1968] defined
(9.14)
• g(2) = 2 (trivial).
Tornheim [1012, 1965] (see also Cryer [256, 1968]) showed that > n
for any n × n Hadamard matrix Hn (a bound which, as we saw above, holds
for any form of pivoting). For n such that a Hadamard matrix does not exist,
the best known lower bound is g(n) > = (n + 1)/2 (see (9.11)).
Cryer [256, 1968] conjectured that for real matrices < n, with equal-
ity if and only if A is a Hadamard matrix. The conjecture < n became
one of the most famous open problems in numerical analysis, and has been
investigated by many mathematicians. The conjecture was finally shown to be
false in 1991. Using a package LANCELOT [236, 1992] designed for large-scale
nonlinear optimization, Gould [474, 1991] discovered a 13 × 13 matrix for which
the growth factor is 13.0205 in IEEE double precision floating point arith-
metic. Edelman subsequently showed, using the symbolic manipulation pack-
ages Mathematica and Maple, that a growth factor 13.02 can be achieved in
exact arithmetic by making a small perturbation (of relative size 10-7) to one
element of Gould’s matrix [338, 1992], [348, 1991]. A more striking counterex-
ample to the conjecture is a matrix of order 25 for which = 32.986341 [338,
1992]. Interesting problems remain, such as determining limn g(n)/n and
evaluating for Hadamard matrices (see Problem 9.15).
For complex matrices the maximum growth factor is at least n for any
n, since > n (see (9.12)). The growth can exceed n, even for n = 3:
Tornheim [1012, 1965] constructed the example
Proof. The result follows immediately from the more general Theo-
rems 12.5 and 12.6 for block diagonally dominant matrices.
Note that for a matrix diagonally dominant by rows the multipliers can
be arbitrarily large but, nevertheless, pn < 2, so GE is perfectly stable.
A smaller bound for the growth factor also holds for an upper Hessenberg
matrix. (H is upper Hessenberg if hij = 0 for i > j + 1.)
(9.15)
184 LU F ACTORIZATION AND LINEAR E QUATIONS
(9.16)
Hence
(9.17)
(9.18)
(9.19)
Proof. For (a), it is well known that a symmetric positive definite A has
an LU factorization in which U = DL T , where D is diagonal with positive
diagonal elements. Hence |L||U| = |L||D||LT| = |LDLT| = |LU|, where the
middle equality requires a little thought. In (b) and (c) the equivalences,
and the existence of an LU factorization, follow from known results on totally
nonnegative matrices [258, 1976] and M-matrices [94, 1994]; |L||U| = |LU| is
immediate from the sign properties of L and U. (d) is trivial.
For diagonally dominant matrices, |L||U| is not equal to |LU| = |A|, but
it cannot be much bigger.
Proof. If u is sufficiently small then for types (a)-(c) the diagonal elements
of U will be positive, since as u 0. It is easy to see that
for all i ensures that The argument is similar for type (d). The
result therefore follows from (9.19) and (9.8). The last part is trivial.
A corollary of Theorem 9.13 is that it is not necessary to pivot for the
matrices specified in the theorem (and, indeed, pivoting could vitiate the result
of the theorem). Note that large multipliers may occur under the conditions
of the theorem, but they do not affect the stability. For example, consider the
M-matrix
the residuals were again of order 10-10 , that is of the size cor-
responding to the exact solution rounded to ten decimals. It is
interesting that in connection with this example we subsequently
performed one or two steps of what would now be called “iterative
refinement,” and this convinced us that the first solution had had
almost six correct figures.
1966], and Forsythe and Moler [396, 1967, Chap. 21]. Fox gives a simplified
analysis for fixed-point arithmetic under the assumption that the growth fac-
tor is of order 1. Forsythe and Moler give a particularly readable backward
error analysis that has been widely quoted.
Wilkinson’s 1961 result is essentially the best that can be obtained by
a normwise analysis. Subsequent work in error analysis for GE has mainly
been concerned with bounding the backward error component wise, as in The-
orems 9.3 and 9.4. We note that Wilkinson could have given a componentwise
bound for the backward perturbation ∆A, since most of his analysis is at the
element level.
Chartres and Geuder [200, 1967] analyse the Doolittle version of GE. They
derive a backward error result (A + ∆A) = b, with a componentwise bound
o n ∆A; although they do not recognize it, their bound can be written in the
form |∆A| < cnu
Reid [867, 1971] shows that the assumption in Wilkinson’s analysis that
partial pivoting or complete pivoting is used is unnecessary. Without making
any assumptions on the pivoting strategy, he derives for LU factorization the
result LU = A + ∆A, |∆ aij| < 3.01 min(i - 1, j) u maxk Again, this is a
componentwise bound. Erisman and Reid [355, 1974] note that for a sparse
matrix, the term min(i - 1, j) in Reid’s bound can be replaced by mij , where
mij is the number of multiplications required in the calculation of lij (i > j)
or u ij (i < j).
de Boor and Pinkus [273, 1977] give the result stated in Theorem 9.4.
They refer to the original 1972 German edition of [955, 198 0] for a proof
of the result and explain several advantages to be gained by working with a
componentwise bound for ∆A, one of which is the strong result that ensues for
totally nonnegative matrices. A result very similar to Theorem 9.4 is proved
by Sautter [895, 1978].
Skeel [919, 1979] carried out a detailed componentwise error analysis of
GE with a different flavour to the analysis given in this chapter. His aim was
to understand the numerical stability of GE (in a precisely defined sense) and
to determine the proper way to scale a system by examining the behaviour
of the backward and forward errors under scaling (see §9.7). He later used
this analysis to derive important results about fixed precision iterative refine-
ment (see Chapter 11). Skeel’s work popularized the use of component wise
backward error analysis and componentwise perturbation theory.
The componentwise style of backward error analysis for GE is
now well known, as evidenced by its presence in the text books of Conte and
de Boor [237, 198 0]. Golub and Van Loan [370, 198 9] (also the 1983 first
edition), and Stoer and Bulirsch [955, 1980].
Forward error analyses have also been done for GE. The analyses are more
complicated and more difficult to interpret than the backward error analyses.
Olver and Wilkinson [810, 1982] derive a posteriori forward error bounds that
9.7 S CALING 191
require the computation of A-1. Further results are given in a series of papers
by Stummel [965, 1982], [966, 1985], [967, 1985], [968, 1985].
Finally, probabilistic error analysis for GE is given by Barlow and Bareiss
[63, 1985].
9.7. Scaling
Prior to solving a linear system Ax = b by GE we are at liberty to scale the
rows and the columns:
(9.20)
(9.21)
A result of Peña [825, 1994] shows that if there exists a permutation matrix P
such that PA has an LU factorization PA = LU with |PA| = |L||U|, then such
a factorization will be produced by the pivoting scheme (9.21). This means
that, unlike for partial pivoting, we can use the pivoting scheme (9.21) with
impunity on totally nonnegative matrices and their inverses, row permutations
of such matrices, and any matrix for which some row permutation has the
“|PA| = |L||U|” property. However, this pivoting strategy is as expensive as
complete pivoting to implement, and for general A it is not guaranteed to
produce a factorization as stable as that produced by partial pivoting.
(9.22)
O(n 2) operations using the matrix norm estimator in Algorithm 14.4. Another
possibility is to use a running error analysis, in which an error bound is
computed concurrently with the factors (see 53.3).
Although Theorem 9.3 bounds the backward error of the computed LU factors
L and U, it does not give any indication about the size of the forward errors
L - and U - For most applications of the LU factorization it is the
backward error and not the forward errors that matters, but it is still of some
interest to know how big the forward errors can be. This is a quest ion of
perturbation theory and is answered by the next result.
(9.23)
where stril(·) and triu(·) denote, respectively, the strictly lower triangular part
and the upper triangular part of their matrix arguments.
The normwise bounds (9.23) imply that x(A) := ||L - 1 ||2 ||U- 1 ||2 ||A|| 2 is
an upper bound for the condition numbers of L and U under normwise per-
turbations. We have
and the ratio x(A)/κ 2 (A) can be arbitrarily large (though if partial pivoting
is used then κ2 (L) < n2n - 1 ).
The componentwise bounds in Theorem 9.14 are a little unsatisfactory in
that they involve the unknown mat rices ∆L and ∆U, but we can set these
terms to zero and obtain a bound correct to first order.
9.10 N OTES AND R EFERENCES 195
for k = 1:n
for j = k+ 1 :n
for i = k + 1:n
a i j = a i j - (a i k /a k k )a k j
end
end
end
This is identified as the kji form of GE. Altogether there are six possible
orderings of the three loops. Doolittle’s method (Algorithm 9.2) is the ijk
or jik variant of Gaussian elimination. The choice of loop ordering does not
affect the stability of the computation, but can greatly affect the efficiency of
GE on a high performance computer. For more on the different loop orderings
of GE see Chapter 12; Dongarra, Gustavson. and Karp [310, 1984]; and the
books by Dongarra, Duff, Sorensen, and van der Vorst [315, 1991] and Golub
and Van Loan [470, 1989].
This chapter draws on the survey paper Higham [545, 1990]. Theorems 9.6
and 9.7 are from Higham and Higham [562, 1989].
Myrick Hascall Doolittle (1830-1913) was a “computer of the United States
coast and geodetic survey” [362, 1987]. Crout’s method was published in an
engineering journal in 1941 [255, 1941].
GE and its variants were known by various descriptive names in the early
days of computing. These include the bordering met hod, the escalator met hod
(for matrix inversion), the square root method (Cholesky factorization), and
pivotal condensation. A good source for details of these methods is Fad-
deeva [360, 1959].
In a confidential 1948 report that “covers the general principles of both the
design of the [Automatic Computing Engine] and the method of programming
adopted for it”, Wilkinson gives a program implementing GE with partial
pivoting and iterative refinement [1080, 1948, p. 111]. This was probably the
196 LU F ACTORIZATION AND LINEAR E QUATIONS
first such program to be written and for a machine that had not yet been
built!
The terms “partial pivoting“ and “complete pivoting” were introduced by
Wilkinson in [1085, 1961]. The pivoting techniques themselves were in use
in the 1940s and it is not clear who, if anyone, can be said to have invented
them: indeed, von Neumann and Goldstine [1057, 1947, §4.2] refer to complete
pivoting as the “customary procedure”.
There is a long history of published programs for GE. beginning with Crout
routines of Forsythe [390, 1960], Thacher [999, 1961], McKeeman [745, 1962],
and Bowdler, Martin, Peters, and Wilkinson [138, 1966], all written in Algol
60 (which was the “official” language for publishing mathematical software in
the 1960s. and a strong competitor to Fort ran for practical use at that time).
The GE routines in LAPACK are the latest in a lineage beginning with the
Fortran routines decomp and solve in Forsythe and Moler‘s book [396, 1967],
and continuing with routines by Moler [766, 1972], [767, 1972] (which achieve
improved efficiency in Fortran by accessing arrays by column rather than by
row), Forsythe, Malcolm, and Moler [395, 1977] (these routines incorporate
condition estimation-see Chapter 14), and LINPACK [307, 1979].
LU factorization of totally nonnegative matrices has been investigated by
Cryer [257, 1973], [258, 1976], Ando [21, 198 7], and de Boor and Pinkus
[273, 1977]. It is natural to ask whether we can test for total nonnegativity
without computing all the minors. The answer is yes: for an n × n matrix
total nonnegativity can be tested in O(n 3) operations. as shown by Gasca
and Peña [421, 1992]. The test involves carrying out a modified form of GE
in which all the elimination operations are between adjacent rows and then
checking whether certain pivots are positive. Note the analogy with positive
definiteness, which holds for a symmetric matrix if and only if all the pivots
in GE are positive.
The dilemma of whether to define the growth factor in terms of exact or
computed quantities is faced by all authors; most make one choice or the other,
and go on to derive, without comment, bounds that are strictly incorrect.
Theorem 9.8, for example, bounds the exact growth factor; the computed one
could. conceivably violate the bound, but only by a tiny relative amount. van
Veldhuizen [1045, 1977] shows that for a variation of partial pivoting that
allows either a row or column interchange at each stage, the growth factor
defined in terms of computed quantities is at most about (1 + 3 nu)2n - 1 ,
compared with the bound 2n -1 for the exact growth factor.
The idea of deriving error bounds for GE by analysing the equations ob-
tained by solving A = LU is exploited by Wilkinson [1097, 1974], who gives a
general analysis that includes Cholesky factorization. This paper gives a con-
cise summary of error analysis of factorization methods for linear equations
and least squares problems.
Various authors have tabulated growth factors in extensive tests with ran-
9.10 N OTES AND R EFERENCES 197
dom matrices. In tests during the development of LINPACK, the largest value
observed was = 23, occurring for a random matrix of 1s, 0s, and -1s [307,
1979, p. 1.21]. Macleod [720, 1989] recorded a value = 35.1, which oc-
curred for a symmetric matrix with elements from the uniform distribution
on [-1, 1]. In one MATLAB test of 1000 matrices of dimension 100 from the
normal N(0, 1) distribution, I found the largest growth factor to be = 9.59.
Gould [474, 1991] used the optimization LANCELOT [236, 1992] to maxi-
mize the nth pivot for complete pivoting as a function of about n 3/3 variables
comprising the intermediate elements of the elimination; constraints were
included that normalize the matrix A, describe the elimination equations, and
impose the complete pivoting conditions. Gould’s package found many local
maxima, and many different starting values had to be tried in order to lo-
cate the matrix for which > 13. In an earlier attempt at maximizing the
growth factor, Day and Peterson [271, 1988] used a problem formulation in
which the variables are the n2 elements of A, which makes the constraints and
objective function substantially more nonlinear than in Gould’s formulation.
Using the package NPSOL [444, 1986], they obtained “largest known” growth
factors for 5 < n < 7.
Theoretical progress into understanding the behaviour of the growth fac-
tor for complete pivoting has been made by Day and Peterson [271, 1988],
Puschmann and Cortés [849, 1983], Puschmann and Nordio [850, 1985], and
Edelman and Mascarenhas [345, 1995].
A novel alternative to partial pivoting for stabilizing GE is proposed by
Stewart [942, 1974]. The idea is to modify the pivot element to make it suit-
ably large, and undo this rank one change later using the Sherman-Morrison
formula. Stewart gives error analysis that bounds the backward error for this
modified form of GE.
Theorem 9.8 is proved for matrices diagonally dominant by columns by
Wilkinson [1085, 1961, pp. 288-289]. Theorem 9.9 is proved in the same paper.
That pn < 2 for matrices diagonally dominant by rows does not appear to
be well known, but it is proved by Wendroff [1074, 1966, pp. 122-123]. for
example.
The results in §9.5 for tridiagonal matrices are taken from Higham [541,
1990 ]. Another method for solving tridiagonal systems is cyclic reduction,
which was developed in the 1960s [171, 1970]. Error analysis given by Amodio
and Mazzia [15, 1994] shows that cyclic reduction is normwise backward stable
for diagonally dominant tridiagonal matrices.
The chapter “Scaling Equations and Unknowns” of Forsythe and Moler
[396, 1967] is a perceptive, easy to understand treatment that is still well worth
reading. Early efforts at matrix scaling for GE were directed to equilibrating
either just the rows or the rows and columns simultaneously (so that all the
rows and columns have similar norms). An algorithm with the latter aim
is described by Curtis and Reid [259, 1972]. Other important references on
198 LU FACTORIZATION AND LINEAR E QUATIONS
scaling are the papers by van der Sluis [1040, 1970] and Stewart [945, 1977].
which employ normwise analysis, and those by Skeel [919, 1979], [921, 1981],
which use componentwise analysis.
Much is known about the existence and stability of LU factorizations of
M-matrices and related matrices. A is an H-matrix if the comparison matrix
M(A) (defined in (8.6)) is an M-matrix. Funderlic, Neumann, and Plem-
mons [410, 1982] prove the existence of an LU factorization for an H-matrix
A that is generalized diagonally dominant, that is, DA is diagonally dom-
inant by columns for some nonsingular diagonal matrix D: they show that
the growth factor satisfies pn < 2 maxi |dii |/min i |dii |. Neumann and Plem-
mons [791, 1984] obtain a growth factor bound for an inverse of an H -matrix.
Ahac, Buoni, and Olesky [7, 1988] describe a novel column-pivoting scheme
for which the growth factor can be bounded by n then A is an H-matrix.
The normwise bounds in Theorem 9.14 are due to Barrlund [71, 1991]
and the componentwise ones to Sun [972, 1992]. Similar bounds are given
by Stewart [951, 1993] and Sun [973, 1992]. Barrlund [72, 1992] describes a
general technique for deriving matrix perturbation bounds using integrals.
Interval arithmetic techniques (see §24.4) are worth considering if high ac-
curacy or guaranteed accuracy is required when solving a linear system. We
mention just one paper, that by Demmel and Krückeberg [297, 1985], which
provides a very readable introduction to the subject and contains further ref-
erences.
For several years Edelman has been collecting information on the solution
of large, dense linear algebra problems. His papers [337, 1991], [341, 1993],
[342, 1994] present statistics and details of the applications in which large
dense problems arise. Edelman also discusses relevant issues such as what
users expect of the computed solutions and how best to make use of parallel
computers. Table 9.2 contains “world records” for linear systems from Edel-
man’s surveys. For all the records shown the matrix was complex and the
system was solved in double precision arithmetic by some version of LU fac-
torization. Most of the very large systems currently being solved come from
the solution of boundary integral equations, a major application being the
analysis of radar cross sections; the resulting systems have coefficient mat ri-
ces that are complex symmetric (but not Hermitian). A recent reference is
Wang [1064, 1991].
9.10.1. LAPACK
Driver routines xGESV (simple) and xGESVX (expert) use LU factorization with
partial pivoting to solve a general system of linear equations with multiple
right-hand sides. The expert driver incorporates iterative refinement, condi-
tion estimation, and backward and forward error estimation and has an option
to scale the system AX = B to = before solution,
P ROBLEMS 199
Table 9.2. Records for largest dense linear systems solved (dimension n).
Problems
9.1. (Completion of proof of Theorem 9.1.) Show that if a singular matrix
has a unique LU factorization then Ak is nonsingular for k =
1 :n - 1.
9.2. Define A(σ) = A - σI, where and For how many
values of σ, at most, does A(σ) fail to have an LU factorization without
pivoting?
9.3. Show that has a unique LU factorization if 0 does not belong
to the field of values
9.4. State analogues of Theorems 9.3 and 9.4 for LU factorization with row
and column interchanges: PAQ = LU.
9.5. Give a 2 × 2 matrix A having an LU factorization A = LU such that
|L||U| < c|A| does not hold for any c, yet is of order 1.
9.6. Show that if is nonsingular and totally nonnegative it has an
LU factorization A = LU with L > 0 and U > 0. (Hint: use the inequality
which holds for any totally nonnegative A [414, 1959, p. 100].) Deduce that
the growth factor for GE without pivoting p n 1.
200 LU FACTORIZATION AND LINEAR E QUATIONS
Hence, for g(n) defined in (9.14) and Sn in (9.11). deduce a larger lower bound
than g (2n) >
9.11. Explain the errors in the following criticism of GE with complete piv-
oting.
Gaussian elimination with complete pivoting maximizes the pivot
at each stage of the elimination. Since the product of the pivots is
the determinant (up to sign), which is fixed, making early pivots
large forces later ones to be small. These small pivots will have large
relative errors due to the accumulation of rounding errors during the
algorithm, and dividing by them therefore introduces larger errors.
9.13. (RESEARCH PROBLEM) Obtain sharp bounds for the growth factor for
GE with partial pivoting applied to (a) a matrix with lower bandwidth p
and upper bandwidth q (thus generalizing Theorem 9.10), and (b) a quasi-
tridiagonal matrix (an n × n matrix that is tridiagonal except for nonzero
(1, n) and (n, 1) elements).
9.14. (R ESEARCH P ROBLEM ) Explain why the growth factor for GE with
partial pivoting is almost always small in practice.
9.15. (R ESEARCH P ROBLEM ) For GE with complete pivoting what is the
value of limn g (n)/n (see (9.14))? Is equal to n for Hadamard matrices?
Previous Home Next
Chapter 10
Cholesky Factorization
203
204 CHOLESKY FACTORIZATION
(10.1)
if
(10.2)
(10.3)
Equation (10.2) has a unique solution since Rn-1 is nonsingular. Then (10.3)
gives β 2 = a – rTr. It remains to check that there is a unique real, positive β
satisfying this equation. From the equation
end
end
Cost: n 3/3 flops (half the cost of LU factorization).
As for Gaussian elimination (GE), there are different algorithmic forms of
Cholesky factorization. Algorithm 10.2 is the jik or “sdot” form. We describe
the kij, outer product form in §10.3.
Given the Cholesky factorization A = RTR, a linear system Ax = b can
be solved via the two triangular systems RTy = b and Rx = y.
If we define D = then the Cholesky factorization A = R T R
can be rewritten as A = LDL , where L = RT diag(r i i )–1 is unit lower
T
(10.4)
206 C HOLESKY FACTORIZATION
(10.5)
where for the last inequality we have assumed that nγ n +1 < 1/2. Another
indicator of stability is that the growth factor for GE is exactly 1 (see Prob-
lem 10.4). It is important to realize that the multipliers can be arbitrarily
large (consider, for example, as θ 0). But, remarkably, for a positive
definite matrix the size of the multipliers has no effect on stability.
Note that the perturbation ∆A in (10.6) is not symmetric, in general,
because the backward error matrices for the triangular solves with R and
RT are not the transposes of each other. For conditions guaranteeing that a
“small” symmetric ∆A can be found, see Problem 7.11.
The following rewritten version of Theorem 10.3 provides further insight
into Cholesky factorization.
10.1 S YMMETRIC P OSITIVE D EFINITE M ATRICES 207
where di =
giving
(10.8)
and the required bound for ∆A.
Standard perturbation theory applied to Theorem 10.4 yields a bound
of the form However, with the aid of
Theorem 10.5 we can obtain a potentially much smaller bound. The idea is
to write A = DHD where D = diag(A )1/2, so that H has unit diagonal. van
der Sluis’s result (Corollary 7.6) shows that
(10.9)
where = 2n(1 - γ n + l ) – 1 γ n + 1 .
Proof. Straightforward analysis shows that (cf. the proof of Theorem 9.4)
(A + ∆A) = b, where
208 C HOLESKY F ACTORIZATION
with |∆A 1 | < (1–γn + 1 ) – 1 γ n+ 1ddT (by Theorem 10.5) and |∆ 1| < diag( γi )
|∆2| < diag( γn - i+ 1 ) Scaling with D, we have
- 1 T -1
But, using (10.8) and ||D dd D || 2 = ||eeT||2 = n, we have
using the interlacing of the eigenvalues [470, 1989, Cor. 8.1.4] and the con-
dition of the theorem. Hence is positive definite, and
therefore so is the congruent matrix Ak + ∆A k, showing that must be real
and nonsingular, as required to complete the induction.
If Cholesky succeeds, then, by Theorem 10.5, D –1 (A + ∆A)D –1 is positive
definite and so 0 <
Hence if min(H) < -nγ n+1/(1– γ n +1) then the computation must fail.
Note that, since ||H||2 > 1, the condition for success of Cholesky factor-
ization can be written as κ 2 (H)nγn +1/(1– γ n + 1 ) < 1 .
Note that the Cholesky factor of Ak = A(1:k,1:k) is Rk, and κ 2 (A k+l) >
κ2 (A k) by the interlacing property of the eigenvalues. Hence if Ak +1 (and
hence A) is ill conditioned but Ak is well conditioned then Rk will be relatively
insensitive to perturbations in A but the remaining columns of R will be much
more sensitive.
(10.11)
column. Ignoring pivoting for the moment, at the start of the kth stage we
have
(10.12)
where = [0, . . . , 0, rii , . . ., rin ]. The reduction is carried one stage further
by computing
Overall we have,
(10.13)
(10.14)
(10.15)
where R11 is the Cholesky factor of A 11, R12 = and
where W =
Proof. We can expand
(where S 0 (A) := A). Then, for sufficiently small ||E||2 , A+E = cp(A + E) .
For E = with |γ| sufficiently small,
Proof. Note that since A = cp(A ), (10.17) simply states that there
are no ties in the pivoting strategy (since (S i ( A ) ) 11 in (10.12)).
Lemma 10.10 shows that Si (A+E) = Si (A) + O(||E|| 2), and so, in view of
(10.17), for sufficiently small ||E||2 we have
This shows that A + E = c p(A+E). The last part then follows from
Lemma 10.10.
We now examine the quantity ||W|| 2 = We show first that
||W|| 2 can be bounded in terms of the square root of the condition number of
A1 1 .
Proof. Write
together with which follows
from the fact that the Schur complement is positive semidef-
inite.
Note that, by the arithmetic-geometric mean inequality
(x, y > 0), we also have, from Lemma 10.12,
||A 2 2 || 2 )/2.
The inequality of Lemma 10.12 is attained for the positive semidefinite
matrix
where Ip,q is the p × q identity matrix. This example shows that ||W||2 can
be arbitrarily large. However, for A := cp(A ), ||W||2 can be bounded solely
in terms of n and k. The essence of the proof, in the next lemma, is that
large elements in are countered by small elements in A12. Hereafter we
set k = r, the value of interest in the following sections.
214 C HOLESKY FACTORIZATION
(10.18)
(10.19)
with c = cosθ, s = sin θ. This is the r × n version of Kahan’s matrix (8.10). R
satisfies the inequalities (10.13) (as equalities) and so A(θ ) = cp(A(θ )).
We conclude this section with a “worst-case” example for the Cholesky fac-
torization with complete pivoting. Let U( θ) = diag(r, r–1, . . . . 1)R(θ), where
R(θ) is given by (10.19), and define the rank-r matrix C(θ) = U(θ) T U(θ).
Then C(θ) satisfies the conditions of Lemma 10.11. Also,
Thus, from Lemma 10.11, for E = with |γ| and θ sufficiently small,
This example can be interpreted as saying that in exact arithmetic the resid-
ual after an r-stage Cholesky factorization of a semidefinite matrix A can
overestimate the distance of A from the rank-r semidefinite matrices by a
factor as large as (n – r)(4r – 1)/3.
(10.20)
where A11 = D 11 H11 D 11 and D11 = diag(A 11)1/2. Then, in floating point
arithmetic, the Cholesky algorithm applied to A successfully completes r stages
(barring underflow and overflow), and the computed r × n Cholesky factor
satisfies
(10.21)
where W =
(10.22)
where
and
(10.23)
216 CHOLESKY FACTORIZATION
which implies
(10.24)
Theorem 10.14 is just about the best result that could have been expected,
because the bound (10.21) is essentially the same as the bound obtained on
taking norms in Lemma 10.10. In other words, (10.21) simply reflects the
inherent mathematical sensitivity of A–RTR to small perturbations in A.
We turn now to the issue of stability. Ideally, for A as defined in Theo-
rem 10.14, the computed Cholesky factor Rr produced after r stages of the
algorithm would satisfy
10.3 P OSITIVE S EMIDEFINITE M ATRICES 217
(10.25)
Usually k > r + 1, due to the effect of rounding errors.
A more sophisticated termination criterion is to stop as soon as
(10.26)
for some readily computed norm and a suitable tolerance This criterion
terminates as soon as a stable factorization is achieved, avoiding unnecessary
work in eliminating negligible elements in the computed Schur complement
Note that is indeed a reliable order-of-magnitude estimate of the
true residual, since is the only nonzero block of Â(k) and, by (10.22) and
(10.24), with ||E|| = O(u)(||A|| + ||Â(k)||).
Another possible stopping criterion is
(10.27)
Practical experience shows that the criteria (10.26) and (10.27) with =
nu both work well, and much more effectively than (10.25) [540, 1990]. We
favour (10.27) because of its negligible cost.
There is arbitrarily large element growth for 0 < << 1, and the factorization
does not exist for = 0.
The most popular approach for solving symmetric indefinite systems is to
use a block LDLT factorization
P A P T = L D L T,
13
The inertia of a symmetric matrix is an ordered triple {i +, i–, i0}, where i+ is the
number of positive eigenvalues, i– the number of negative eigenvalues, and i0 the number
of zero eigenvalues.
10.4 I NDEFINITE M ATRICES 219
à = B – CE- 1C T .
The cost of the method is n 3/3 flops (the same as the cost of Cholesky fac-
torization of a positive definite matrix) plus the cost of determining the per-
mutations 17. This method for computing the block LDLT factorization is
called the diagonal pivoting method. It can be thought of as a generalization
of Lagrange’s method for reducing a quadratic form to diagonal form (devised
by Lagrange in 1759 and rediscovered by Gauss in 1823) [763, 1961, p. 371].
One conceivable difficulty with the diagonal pivoting method can be dis-
posed of immediately. If a nonsingular pivot matrix E of dimension 1 or 2
cannot be found, then all 1 × 1 and 2 × 2 principal submatrices of the sym-
metric matrix A are singular, and this is easily seen to imply that A is the
zero matrix.
The strategy for choosing 17 is crucial for achieving stability. A suitable
modification of the error analysis for block LU factorization (Theorem 12.4)
tells us that, provided linear systems involving 2 x 2 pivots are solved in a
normwise backward-stable way, the condition ||L|| ||D|| ||LT|| < cn||A||, for
a modest constant c n , is sufficient to ensure stability. A key requirement,
therefore, is to choose the pivot E so that the Schur complement à is suitably
bounded, since D is made up of elements of Schur complements. We describe
two suitable pivoting strategies.
Now consider the case s = 2. The (i, j) element of the Schur complement
à = B – CE - 1 CT is
(10.28)
Now
Bunch and Kaufman [164, 1977] devised a pivoting strategy for the diago-
nal pivoting method that requires only O( n 2) comparisons. At each stage
it searches at most two columns and so is analogous to partial pivoting for
LU factorization. The strategy contains several logical tests. As before, we
describe the pivoting for the first stage only. Recall that s denotes the size of
the pivot block.
if |a1 1 |σ > a 2
(2) s = 1, Π = I
else if |arr| > aσ
(3) s = 1 and choose Π to swap rows and columns 1 and r.
else
(4) s = 2 and choose Π to swap rows and columns 2 and r,
so that |(ΠAΠ T )21| =
end
end
and to note that the pivot is one of a 11, arr , and (or, rather, since
= |ar1|, this matrix with replaced by ar1 ).
222 C HOLESKY FACTORIZATION
To bound the element growth reconsider each case in turn, noting that
for cases (1) and (2) the elements of the Schur complement are given by14
Case (l):
Case (3): The original arr is now the pivot, and |arr| > a σ, so
Case (4): This is where we use a 2 x 2 pivot, which, after the interchanges,
is E = (Π T AΠ)(1:2,1:2) = Now
so
This analysis shows that the bounds for the element growth for s = 1 and
s = 2 are the same as the respective bounds for the complete pivoting strategy.
Hence, using the same reasoning, we again choose a =
The growth factor for the partial pivoting strategy is bounded by (2.57)n - 1 .
As for GEPP, large growth factors do not seem to occur in practice. But un-
like for GEPP, no example is known for which the bound is attained [164,
1977]; see Problem 10.18.
14
We commit a minor abuse of notation, in that in the rest of this section ãij should
really be ãi– 1, j–1 (s = 1) or ãi- 2 ,j –2 (s = 2).
10.5 N ONSYMMETRIC P OSITIVE D EFINITE M ATRICES 223
where p1 and p 2 are linear polynomials. For the partial pivoting strategy,
Higham shows that if linear systems involving 2 × 2 pivots are solved by GEPP
or by use of the explicit inverse, then the computed solutions do indeed have
a small componentwise relative backward error, and that, moreover,
where ||A||M = max i,j |aij|. Thus the diagonal pivoting method with partial
pivoting is stable if the growth factor is small.
(10.29)
the normwise one in Theorem 10.8 is derived and explored by Chang and
Paige [198, 1995]. Perturbation results of a different flavour, including one
for structured perturbations of the form of ∆A in Theorem 10.5, are given by
Drmac, Omladc, and Veselic [321, 1994].
The perturbation and error analysis of §10.3 for semidefinite matrices is
from Higham [540, 1990], where in a perturbation result for the QR factoriza-
tion with column pivoting is also given. For an application in optimization
that makes use of Cholesky factorization with complete pivoting and the anal-
ysis of §10.3.1 see Forsgren, Gill, and Murray [384, 1995].
Fletcher and Powell [382, 1974] describe several algorithms for updating
an LDLT factorization of a symmetric positive definite A when A is modified
by a rank-1 matrix. They give detailed componentwise error analysis for some
of the methods.
An excellent way to test whether a given symmetric matrix A is positive
(semi) definite is to attempt to compute a Cholesky factorization. This test
is less expensive than computing the eigenvalues and is numerically stable.
Indeed, if the answer “yes” is obtained, it is the right answer for a nearby
matrix, whereas if the answer is “no” then A must be close to an indefinite
matrix. See Higham [535, 1988] for an application of this definiteness test.
An algorithm for testing the definiteness of a Toeplitz matrix is developed by
Cybenko and Van Loan [260, 1986], as part of a more complicated algorithm.
According to Kerr [654, 1990], misconceptions of what is a sufficient condition
for a matrix to be positive (semi) definite are rife in the engineering literature
(for example, that it suffices to check the definiteness of all 2 × 2 submatrices).
See also Problem 10.8. For some results on definiteness tests for Toeplitz
matrices, see Makhoul [722, 1991].
A major source of symmetric indefinite linear systems is the least squares
problem, because the augmented system is symmetric indefinite; see Chap-
ter 19. Other sources of such systems are interior methods for solving con-
strained optimization problems (see Forsgren, Gill, and Shinnerl [385, 1996],
Turner [1030, 1991], and Wright [1115, 1992]) and linearly constrained opti-
mization problems (see Gill, Murray, Saunders, and Wright [445, 1990], [446,
1991 ]).
The idea of using a block LDLT factorization with some form of pivoting
for symmetric indefinite matrices was first suggested by Kahan in 1965 [166,
1971]. Bunch and Parlett [166, 1971] developed the complete pivoting strategy
and Bunch [158, 1971] proved its stability. Bunch [160, 1974] discusses a rather
expensive partial pivoting strategy that requires repeated scalings. Bunch and
Kaufman [164, 1977] found the efficient partial pivoting strategy presented
here, which is the one now widely used, and Bunch, Kaufman and Parlett [165,
1976] give an Algol code implementing the diagonal pivoting method with this
pivoting strategy. Dongarra, Duff, Sorensen, and van der Vorst [315, 1991,
§5.4.5] show how to develop a partitioned version of the diagonal pivoting
226 C HOLESKY FACTORIZATION
10.6.1. LAPACK
Driver routines xPOSV (simple) and xPOSVX (expert) use the Cholesky fac-
torization to solve a symmetric (or Hermitian) positive definite system of
linear equations with multiple right-hand sides. (There are corresponding
PROBLEMS 227
routines for packed storage, in which one triangle of the matrix is stored in
a one-dimensional array: PP replaces PO in the names.) The expert driver
incorporates iterative refinement, condition estimation, and backward and
forward error estimation and has an option to scale the system AX = B
t o (D – 1 AD– 1 )DX = D – 1 B, where D = diag Modulo the rounding
errors in computing and applying the scaling, the scaling has no effect on
the accuracy of the solution prior to iterative refinement, in view of Theo-
rem 10.6. The Cholesky factorization is computed by the routine xPOTRF ,
which uses a partitioned algorithm that computes R a block row at a time.
The drivers xPTSV and xPTSVX for symmetric positive definite tridiagonal ma-
trices use LDLT factorization. LAPACK does not currently contain a routine
for Cholesky factorization of a positive semidefinite matrix, but there is such
a routine in LINPACK (xCHDC ).
Driver routines xSYSV (simple) and xSYSVX (expert) use the block LDLT
factorization (computed by the diagonal pivoting method) with partial piv-
oting to solve a symmetric indefinite system of linear equations with multi-
ple right-hand sides. For Hermitian matrices the corresponding routines are
xHESV (simple) and xHESVX (expert). (Variants of these routines for packed
storage have names in which SP replaces SY and HP replaces HE.) The expert
drivers incorporate iterative refinement, condition estimation, and backward
and forward error estimation. The factorization is computed by the routine
xSYTRF or xHETRF.
Problems
10.1. Show that if is symmetric positive definite then
where ||A||M = maxi,j |aij| (which is not a consistent matrix norm—see §6.2).
The significance of this result is that the bound for ||∆ A||M/||A||M contains a
linear polynomial in n, rather than the quadratic that appears for the 2-norm
in (10.7).
10.6. Let A = cp(A) be positive semidefinite of rank r and suppose it
has the Cholesky factorization (10.11) with Π = I. Show that Z = [W , –I] T
is a basis for the null space of A, where W =
10.7. Prove that (10.13) holds for the Cholesky decomposition with complete
pivoting.
10.8. Give an example of a symmetric matrix for which the leading
principal submatrices Ak satisfy det(A k) > 0, k = 1:n, but A is not positive
semidefinite (recall that det(A k) > 0, k = 1:n, implies that A is positive
definite). State a condition on the minors of A that is both necessary and
sufficient for positive semidefiniteness.
10.9. Suppose the outer product Cholesky factorization algorithm terminates
at the (k+1)st stage (see (10.15)), with a negative pivot in the (k + 1, k + 1)
position. Show how to construct a direction of negative curvature for A (a
vector p such that pTAp < 0).
10.10. What is wrong with the following argument? A positive semidefinite
matrix is the limit of a positive definite one as the smallest eigenvalue tends to
zero. Theorem 10.3 shows that Cholesky factorization is stable for a positive
definite matrix, and therefore, by continuity, it must be stable for a positive
semidefinite matrix, implying that Theorem 10.14 is unnecessarily weak (since
||W|| 2 can be large).
10.11. Consider the diagonal pivoting method applied to a symmetric ma-
trix. Show that with complete pivoting or partial pivoting any 2 × 2 pivot
is indefinite. Hence give a formula for the inertia in terms of the block sizes
of the block diagonal factor. Show how to avoid overflow in computing the
inverse of a 2 × 2 pivot.
10.12. Describe the effect of applying the diagonal pivoting method with
partial pivoting to a 2 × 2 symmetric matrix.
10.13. What factorization is computed if the diagonal pivoting method with
partial pivoting is applied to a symmetric positive definite matrix?
PROBLEMS 229
10.14. (Sorensen and Van Loan; see [315, 1991, §5.3.2]) Suppose the partial
pivoting strategy for the diagonal pivoting method is modified by redefining
(thus “σ n e w = m a x (σo l d ,|a rr | )”). Show that the same growth factor bound
holds as before and that for a positive definite matrix no interchanges are
done and only 1 × 1 pivots are used.
10.15. Let
where 0 < < 1, and suppose the diagonal pivoting method is applied to
A, yielding a factorization PAP T = LDL T. Show that with partial pivot-
ing is unbounded as whereas with complete pivoting is
bounded independently of
10.16. Let
Chapter 11
Iterative Refinement
231
232 I TERATIVE R EFINEMENT
2. Solve Ad = r.
3. Update y = + d.
(11.1)
Hence
(11.4)
Note that
234 ITERATIVE REFINEMENT
As long as A is not too ill conditioned and the solver is not too unstable, we
have which means that the error contracts until we reach a point
at which the gi term becomes significant. The limiting normwise accuracy,
that is, the minimum size of
Moreover, if for some
µ, then we can expect to obtain a componentwise relative error of order µu ,
that is, mini
We concentrate now on the case where the solver uses LU factorization. In
the traditional use of iterative refinement, = u 2, and one way to summarize
our findings is as follows.
This theorem is stronger than the standard results in the literature, which
have in place of η. We can have η << since η is independent
of the row scaling of A (modulo changes in the pivot sequence). For example,
if then η cond(A)u, and cond(A) can be arbitrarily smaller
than
Consider now the case where = u, which is called fixed precision iterative
refinement. We have an analogue of Theorem 11.1.
(this bound is obtained by applying Theorems 7.4 and 9.4, or from (11.4) with
i = 0!). In fact, a relative error bound of order cond(A, x)u is the best we can
possibly expect if we do not use higher precision, because it corresponds to the
11.2 I TERATIVE R EFINEMENT I MPLIES S TABILITY 235
(11.5)
(11.7)
First we give an asymptotic result that does not make any further assump-
tions on the linear equation solver.
(11.8)
where q = O(u) if
Proof. The residual r = b – A of the original computed solution
satisfies
(11.9)
The computed residual is = r + ∆r , where |∆ r| < The computed
correction d satisfies
Hence
(11.12)
where
The claim about the order of q follows since and are all of
order u.
Theorem 11.3 shows that, to first order, the componentwise relative back-
ward error w|A|,|b| will be small after one step of iterative refinement as long as
and are bounded by a modest scalar multiple of
This is true for t if the residual is computed in the conventional way (see
(11.7)), and in some cases we may take h 0, as shown below. Note that the
function g of (11.5) does not appear in the first-order term of (11.8). This
is the essential reason why iterative refinement improves stability: potential
instability manifested in g is suppressed by the refinement stage.
A weakness of Theorem 11.3 is that the bound (11.8) is asymptotic. Since
a strict bound for q is not given, it is difficult to draw firm conclusions about
11.2 I TERATIVE R EFINEMENT I MPLIES S TABILITY 237
the size of w|A|,|b|. The next result overcomes this drawback, at the cost of
some specialization (and a rather long proof).
We introduce a measure of ill scaling of the vector |B||x|,
Theorem 11.4. Under the conditions of Theorem 11.3, suppose that g(A, b) =
G|A| and h(A, b) = H|b|, where G, have nonnegative entries, and
that the residual is computed in the conventional manner. Then there is a
function
such that if
then
Proof. As with the analysis in the previous section, this proof can be
skipped without any real loss of understanding. From (11.12) in the proof of
Theorem 11.3, using the formula (11.7) for t, we have
(11.13)
(11.15)
238 ITERATIVE REFINEMENT
where
(11.18)
where
If < 1/2 (say) then (I – uM3)-1 > 0 with ||(I – uM3)-1|| < 2 and
we can rewrite (11.18) as
(11.19)
11.2 I TERATIVE R EFINEMENT I MPLIES S TABILITY 239
where
The analysis in §11.1 extends existing results in the literature. The analysis
in §11.2 is from Higham [549, 1991].
The quantity σ(A, x) appearing in Theorem 11.4 can be interpreted as
follows. Consider a linear system Ax = b for which (|A||x|)i = 0 for some i.
While the componentwise relative backward error w|A|,|b|( x ) of the exact so-
lution x is zero, an arbitrarily small change to a component xj where aij 0
yields w|A|,|b| (x + ∆x) > 1. Therefore solving Ax = b to achieve a small
componentwise relative backward error can be regarded as an ill-posed prob-
lem when |A||x| has a zero component. The quantity σ(A, x) reflects this
ill-posedness because it is large when |A||x| has a relatively small component.
For a lucid survey of both fixed and mixed precision iterative refinement
and their applications, see Björck [111, 1990]. For particular applications of
fixed precision iterative refinement, see Govaerts and Pryce [475, 1990] and
Jankowski and Wozniakowski [611, 1985].
By increasing the precision from one refinement iteration to the next it
is possible to compute solutions to arbitrarily high accuracy, an idea first
suggested by Stewart in an exercise [941, 1973, pp. 206–207]. For algorithms,
see Kielbasinski [656, 1981] and Smoktunowicz and Sokolnicka [931, 1984].
There are a number of practical issues to attend to when implementing iter-
ative refinement. Mixed precision iterative refinement cannot be implemented
in a portable way when the working precision is already the highest precision
supported by a compiler. This is the main reason why iterative refinement is
not supported in LINPACK. (The LINPACK manual lists a subroutine that
implements mixed precision iterative refinement for single precision data, but
it is not part of LINPACK [307, 1979, pp. 1.8–1. 10]. ) For either form of refine-
ment, a copy of the matrix A needs to be kept in order to form the residual,
and this necessitates an extra n 2 elements of storage. A convergence test for
terminating the refinement is needed. In addition to revealing when conver-
gence has been achieved, it must signal lack of (sufficiently fast) convergence,
which may certainly be experienced when A is very ill conditioned. In the
LAPACK driver xGESVX, fixed precision iterative refinement is terminated if
the componentwise relative backward error w = w|A|,|b| satisfies
1. w < u,
These criteria were chosen to be robust in the face of different BLAS imple-
mentations and machine arithmetics. In an implementation of mixed precision
iterative refinement it is more natural to test for convergence of the sequence
with a test such as < u (see, e.g., Forsythe and
PROBLEMS 243
Moler [396, 1967, p. 65]). However, if A is so ill conditioned that Theorem 11.1
is not applicable, the sequence could converge to a vector other than the
solution. This behaviour is very unlikely, and Kahan [626, 1966] quotes a
“prominent figure in the world of error-analysis” as saying “Anyone unlucky
enough to encounter this sort of calamity has probably already been run over
by a truck.”
A by-product of extended precision iterative refinement is an estimate of
the condition number. Since the error decreases by a factor approximately
η = on each iteration (Theorem 11.1), the relative change
made to x on the first iteration should be about η , that is,
Now that reliable and inexpensive condition estimators are
available (Chapter 14) this rough estimate is less important.
An unusual application of iterative refinement is to fault-tolerant com-
puting. Boley et al. [132, 1994] propose solving Ax = b by GEPP or QR
factorization, performing one step of fixed precision iterative refinement and
then testing whether the a priori residual bound in Theorem 11.4 is satisfied.
If the bound is violated then a hardware fault may have occurred and special
action is taken.
11.3.1. LAPACK
Iterative refinement is carried out by routines whose names end -RFS, and
these routines are called by the expert drivers (name ending -SVX). Iterative
refinement is available for all the standard matrix types except triangular ma-
trices, for which the original computed solution already has a componentwise
relative backward error of order u. As an example, the expert driver xGESVX
uses LU factorization with partial pivoting and fixed precision iterative refine-
ment to solve a general system of linear equations with multiple right-hand
sides, and the refinement is actually carried out by the routine xGERFS .
Problems
11.1. Show that for and where
σ = maxi |xi |/mini |xi |.
11.2. Use the analysis of §11.1 to show that, under the conditions of Theo-
rem 11.4, is bounded by a multiple of cond(A, x)u for GEPP
after one step of fixed precision iterative refinement.
11.3. Investigate empirically the size of for L from GEPP.
11.4. (Demmel and Higham [291, 1992]) Suppose GEPP with fixed precision
iterative refinement is applied to the multiple-right-hand side system AX = B,
and that refinement of the columns of X is done “in parallel”: R = B – AX,
244 I TERATIVE R EFINEMENT
Chapter 12
Block LU Factorization
245
246 B LOCK LU FACTORIZATION
(12.1)
In general, the blocks can be of different dimensions. Note that this fac-
torization is not the same as a standard LU factorization, because U is not
12.1 B LOCK V ERSUS P ARTITIONED LU FACTORIZATION 247
(12.2)
1. U 11 = A11, U 12 = A1 2 .
2. Solve L2 1 A 11 = A21 for L2 1 .
3. S = A22 – L2 1 A 12 (Schur complement).
4. Compute the block LU factorization of S, recursively.
What can be said about the numerical stability of partitioned and block
LU factorization? Because the partitioned algorithm is just a rearrangement
of standard GE, the standard error analysis applies if the matrix operations
are computed in the conventional way. However, if fast matrix multiplication
techniques are used (for example, Strassen’s method), the standard results
are not applicable. Standard results are, in any case, not applicable to block
LU factorization; its stability can be very different from that of LU factor-
ization. Therefore we need error analysis for both partitioned and block LU
factorization based on general assumptions that permit the use of fast matrix
multiplication.
Unless otherwise stated, in this chapter an unsubscripted norm denotes
||A|| := maxi,j |aij|. We make two assumptions about the underlying level-3
BLAS (matrix-matrix operations).
(1) If and then the computed approximation to
C = AB satisfies
(12.3)
(12.4)
(12.5)
(12.9)
It follows that
(12.10a)
(12.10b)
The remainder of the algorithm consists of the computation of the LU fac-
torization of B, and by our inductive assumption (12.6), the computed LU
factors satisfy
(12.11a)
(12.11b)
Combining (12.10) and (12.11), and bounding using (12.9), we obtain
(12.12)
250 B LOCK LU FACTORIZATION
(12.14)
Theorem 12.4 (Demmel, Higham, and Schreiber). Let and be the com-
puted block LU factors of from Algorithm 12.2 (with Implementa-
tion 1), and let be the computed solution to Ax = b. Under the assumptions
(12.3), (12.13), and (12.14),
(12.15)
Proof. We omit the proof (see Demmel, Higham, and Schreiber [293,
1995]for details). It is similar to the proof of Theorem 12.3.
The bounds in Theorem 12.4 are valid also for other versions of block LU
factorization obtained by “block loop reordering”, such as a block gaxpy based
algorithm.
Theorem 12.4 shows that the stability of block LU factorization is de-
termined by the ratio ||L||||U||/||A|| (numerical experiments show that the
bounds are, in fact, reasonably sharp). If this ratio is bounded by a mod-
est function of n, then L and U are the true factors of a matrix close to
A, and solves a slightly perturbed system. However, can exceed
||A|| by an arbitrary factor, even if A is symmetric positive definite or di-
agonally dominant by rows. Indeed, ||L|| > ||L2 1 || = using the
partitioning (12.2), and this lower bound for ||L|| can be arbitrarily large.
In the following two subsections we investigate this instability more closely
and show that ||L||||U|| can be bounded in a useful way for particular classes
of A. Without further comment we make the reasonable assumption that
||L||U|| ||L||||U|| , so that these bounds maybe used in Theorem 12.4.
What can be said for Implementation 2? Suppose, for simplicity, that the
inverses (which are used in step 2 of Algorithm 12.2 and in the block
back substitution) are computed exactly. Then the best bounds of the forms
(12.13) and (12.14) are
Working from these results, we find that Theorem 12.4 still holds provided the
first-order terms in the bounds in (12.15) are multiplied by max i κ(U i i ). This
suggests that Implementation 2 of Algorithm 12.2 can be much less stable
than Implementation 1 when the diagonal blocks of U are ill conditioned, and
this is confirmed by numerical experiments.
(12.16)
is obtained. Note that for the 1- and co-norms diagonal dominance does not
imply block diagonal dominance, nor does the reverse implication hold (see
Problem 12.2). Throughout our analysis of block diagonal dominance we take
the norm to be an arbitrary subordinate matrix norm.
First, we show that for block diagonally dominant matrices a block LU
factorization exists, using the key property that block diagonal dominance is
inherited by the Schur complements obtained in the course of the factorization.
In the analysis we assume that A has m block rows and columns.
using (12.16)
using (12.16),
12.3 E RROR A NALYSIS OF B LOCK LU FACTORIZATION 253
(12.17)
showing that A(2) is block diagonally dominant by columns. The result follows
by induction.
The next result allows us to bound ||U|| for a block diagonally dominant
matrix.
Theorem 12.6 (Demmel, Higham, and Schreiber). Let A satisfy the condi-
tions of Theorem 12.5. If A(k) denotes the matrix obtained after k – 1 steps
of Algorithm 12.2, then
Proof. Let A be block diagonally dominant by columns (the proof for row
diagonal dominance is similar). Then
The implications of Theorems 12.5 and 12.6 for stability are as follows.
Suppose A is block diagonally dominant by columns. Also, assume for the
moment that the (subordinate) norm has the property that
(12.18)
which holds for any p-norm, for example. The subdiagonal blocks in the first
block column of L are given by Li1 = and so < 1, by
(12.16) and (12.18). From Theorem 12.5 it follows that <
1 for j = 2:m. Since Uij = for j > i, Theorem 12.6 shows that ||Uij|| <
2||A|| for each block of U (and ||Uij|| < ||A||). Therefore ||L|| < m and ||U|| <
m2 ||A||, and so ||L||||U|| < m 3 ||A|| . For particular norms the bounds on the
blocks of L and U yield a smaller bound for ||L|| and ||U||. For example, for
the 1-norm we have ||L||1 ||U||1 < 2m||A||1 and for the -norm
We conclude that block LU factorization is stable if A is block
diagonally dominant by columns with respect to any subordinate matrix norm
satisfying (12.18).
Unfortunately, block LU factorization can be unstable when A is block
diagonally dominant by rows, for although Theorem 12.6 guarantees that
||U i j || < 2||A||, ||L|| can be arbitrarily large. This can be seen from the
example
(12.19)
and denote by pn the growth factor for GE without pivoting. We assume that
GE applied to A succeeds.
To bound ||L||, we note that, under the partitioning (12.19), for the first
block stage of Algorithm 12.2 we have ||L2 1 || = < n pn κ(A) (see
Problem 12.4). Since the algorithm works recursively with the Schur com-
plement S, and since every Schur complement satisfies κ( S) < pn κ(A) (see
Problem 12.4), each subsequently computed subdiagonal block of L has norm
at most Since U is composed of elements of A together with ele-
ments of Schur complements of A,
||U|| < pn||A||. (12.20)
12.3 E RROR ANALYSIS OF B LOCK LU FACTORIZATION 255
The definiteness implies certain relations among the submatrices Aij that can
be used to obtain a stronger bound for ||L|| 2 than can be deduced for a general
matrix (cf. Problem 12.4).
Table 12.1. Stability of block and point L U factorization. p n is the growth factor for
GE without pivoting.
(12.23)
It follows from Theorem 12.4 that when Algorithm 12.2 is applied to a sym-
metric positive definite matrix A, the backward errors for the LU factorization
and the subsequent solution of a linear system are both bounded by
(12.24)
12.4.1. LAPACK
LAPACK does not implement block LU factorization, but its LU factorization
(and related) routines for full matrices employ partitioned LU factorization
in order to exploit the level-3 BLAS and thereby to be efficient on high-
performance machines.
Problems
12.1. (Varah [1048, 1972]) Suppose A is block tridiagonal and has the block
LU factorization A = LU (so that L and U are block bidiagonal and Ui,i + 1 =
A i , i +1). Show that if A is block diagonally dominant by columns then
What can be deduced about the stability of the factorization and has the block
classes of matrices?
12.2. Show that for the 1- and co-norms diagonal dominance does not imply
block diagonal dominance, and vice versa.
12.3. If is symmetric, has positive diagonal elements, and is block
diagonally dominant by rows, must it be positive definite?
12.4. Let be partitioned
(12.25)
Chapter 13
Matrix Inversion
Almost anything you can do with A–1 can be done without it.
— GEORGE E. FORSYTHE and CLEVE B. MOLER,
Computer Solution of Linear Algebraic Systems (1967)
261
262 MATRIX INVERSION
min max
-1
x=A × b 6.66e-12 1.69e-10
GEPP 3.44e-18 7.56e-17
13.1 U SE AND ABUSE OF THE MATRIX INVERSE 263
(13.1)
(13.2)
(13.3)
(Note that (13.3) can also be derived from (13.1) or (13.2 ).) The bounds
(13.1)-(13.3) represent “ideal” bounds for a computed approximation Y to
A – 1, if we regard as a small multiple of the unit roundoff u. We will show
that, for triangular matrix inversion, appropriate methods do indeed achieve
(13.1) or (13.2) (but not both) and (13.3).
It is important to note that neither (13.1), (13.2), nor (13.3) implies that
Y + ∆Y = (A + ∆A)-1 with and that
is, Y need not be close to the inverse of a matrix near to A, even in the norm
sense. Indeed, such a result would imply that both the left and right residuals
are bounded in norm by and this is not the case for any
of the methods we will consider.
To illustrate the latter point we give a numerical example. Define the
matrix An as triu(qr(vand(n))), in M ATLAB notation (vand is a
routine from the Test Matrix Toolbox—see Appendix E); in other words, An
is the upper triangular QR factor of the n × n Vandermonde matrix based on
264 M ATRIX I NVERSION
are plotted in Figure 13.1. We see that while the left residual is always less
than the unit roundoff, the right residual becomes large as n increases. These
matrices are very ill conditioned (singular to working precision for n > 20),
yet it is still reasonable to expect a small residual, and we will prove in §13.3.2
that the left residual must be small, independent of the condition number.
In most of this chapter we are not concerned with the precise values of
constants (§13.4 is the exception); thus cn denotes a constant of order n. To
simplify the presentation we introduce a special notation. Let Ai
i = 1:k, be matrices such that the product A 1 A 2 . . . Ak is defined and let
Method 1.
for j = 1:n
Method 2.
for j = n:–1:1
(13.4)
and the componentwise forward error bound
(13.5)
(13.6)
which is invariant under row and column scaling of L. If we take norms we
obtain normwise relative error bounds that are either row or column scaling
independent: from (13.6) we have
(13.7)
and the same bound holds with cond(L –1) replaced by cond(L).
Notice that (13.4) is a bound for the right residual, LX – I. This is because
Method 1 is derived by solving LX = I. Conversely, Method 2 can be derived
by solving XL = I, which suggests that we should look for a bound on the
left residual for this method.
(13.8)
Proof. The proof is by induction on n, the case n = 1 being trivial.
Assume the result is true for n – 1 and write
Thus
By assumption, the corresponding inequality holds for the (2:n , 2:n ) subma-
trices and so the result is proved.
Lemma 13.1 shows that Method 2 has a left residual analogue of the right
residual bound (13.4) for Method 1. Since there is, in general, no reason to
choose between a small right residual and a small left residual, our conclusion
is that Methods 1 and 2 have equally good numerical stability properties.
More generally, it can be shown that all three i, j, and k inversion variants
that can be derived from the equations LX = I produce identical rounding
errors under suitable implementations, and all satisfy the same right residual
bound; likewise, the three variants corresponding to the equation XL = I
all satisfy the same left residual bound. The LINPACK routine xTRDI uses
a k variant derived from XL = I; the LINPACK routines xGEDI and xPODI
contain analogous code for inverting an upper triangular matrix (but the LIN-
PACK Users’ Guide [307, 1979, Chaps. 1 and 3] describes a different variant
from the one used in the code).
(13.9)
where we place no restrictions on the block sizes, other than to require the
diagonal blocks to be square. The most natural block generalizations of Meth-
ods 1 and 2 are as follows. Here, we use the notation Lp:q,r:s to denote the
268 MATRIX INVERSION
Method 1B.
for j = 1:N
(by Method 1)
X j+ 1 :N,j = -Lj+ 1 :N , j X j j
Solve L j+ 1 :N , j+ 1 :N X j+ 1 :N,j = Xj+ 1 :N , j,
by forward substitution
end
Method 2B.
f o r j = N:–1:1
(by Method 2)
X j+ 1 :N,j = Xj+ 1 :N , j+ 1 :N L j+ 1 :N , j
X j+ 1 :N,j = -Xj+ 1 :N , j X jj
end
One can argue that Method 1B carries out the same arithmetic operations
as Method 1, although possibly in a different order, and that it therefore
satisfies the same error bound (13.4). For completeness, we give a direct
proof.
(13.11)
where L 11, X1 1 and L11 is the (1, 1) block in the partitioning of (13.9).
X 11 is computed by Method 1 and so, from (13.4),
(13.12)
Hence
(13.13)
Together, inequalities (13.12) and (13.13) are equivalent to (13.11) with j = 1,
as required.
We can attempt a similar analysis for Method 2B. With the same notation
as above, X21 is computed as X21 = –X 2 2L 2 1X 11. Thus
(13.14)
To bound the left residual we have to postmultiply by L 11 and use the fact
that X11 is computed by Method 2:
which would be of the desired form in (13.8) were it not for the factor
This analysis suggests that the left residual is not guaranteed
to be small. Numerical experiments confirm that the left and right residuals
can be large simultaneously for Method 2B, although examples are quite hard
to find [322, 1992]; therefore the method must be regarded as unstable when
the block size exceeds 1.
The reason for the instability is that there are two plausible block gen-
eralizations of Method 2 and we have chosen an unstable one that does not
carry out the same arithmetic operations as Method 2. If we perform a solve
with Ljj instead of multiplying by Xjj we obtain the second variation, which
is used by LAPACK’s xTRTRI :
Method 2C.
for j = N:-1:1
(by Method 2)
X j+ 1 :N,j = Xj+ 1 :N , j+ 1 :N L j+ 1 :N , j
Solve Xj+ 1 :N,jLjj = –Xj+ 1 :N,j by back substitution.
end
For this method, the analogue of (13.14) is
which yields
13.3.1. Method A
Perhaps the most frequently described method for computing X = A-1 is the
following one.
Method A.
for j = 1:n
13.3 INVERTING A FULL MATRIX BY LU FACTORIZATION 271
Solve AxJ = ej
end
13.3.2. Method B
Next, we consider the method used by LINPACK’S xGEDI , LAPACK’S xGETRI,
and MATLAB’S INV function.
Method B.
Compute U -1 and then solve for X the equation XL = U - 1 .
We also assume that the triangular solve from the right with L is done by
back substitution. The computed X therefore satisfies XL = Xu + ∆(X, L)
and so
(13.18)
272 MATRIX INVERSION
which is the left residual analogue of (13.16). From (13.18) we obtain the
forward error bound
Note that Methods A and B are equivalent, in the sense that Method A
solves for X the equation LUX = I while Method B solves XLU = I. Thus
the two methods carry out analogous operations but in different orders. It fol-
lows that the methods must satisfy analogous residual bounds, and so (13.18)
can be deduced from (13.16).
We mention in passing that the LINPACK manual states that for Method B
a bound holds of the form ||AX – I|| < dnu||A|| ||X|| [307, 1979, p. 1.20]. This
is incorrect, although counterexamples are rare; it is the left residual that is
bounded this way, as follows from (13.18).
13.3.3. Method C
The next method that we consider is from Du Croz and Higham [322, 1992].
It solves the equation UXL = I, computing X a partial row and column at a
time. To derive the method partition
where the (1, 1) blocks are scalars, and assume that the trailing submatrix
X 22 is already known. Then the rest of X is computed according to
Method C.
for k = n:–1:1
end
which is (13.17) with |A- 1 | replaced by its upper bound |U – 1 ||L - 1 | + O(u)
and the factors reordered.
The LINPACK routine xSIDI uses a special case of Method C in con-
junction with the diagonal pivoting method to invert a symmetric indefinite
matrix; see Du Croz and Higham [322, 1992] for details.
13.3.4. Method D
The next method is based on another natural way to form A -1 and is used
by LAPACK’S xPOTRI , which inverts a symmetric positive definite matrix.
Method D.
Compute L-1 and U -1 and then form A-1 = U -1 × L- 1 .
The advantage of this method is that no extra workspace is needed; U - 1
and L–1 can overwrite U and L, and can then be overwritten by their product.
However, Method D is significantly slower on some machines than Methods
B or C, because it uses a smaller average vector length for vector operations.
To analyse Method D we will assume initially that L -1 is computed by
Method 2 (or Method 2C) and, as for Method B above, that U -1 is computed
by an analogue of Method 2 or 2C for upper triangular matrices. We have
(13.20)
Since A = LU – ∆A,
(13.21)
Rewriting the first term of the right-hand side using XLL = I + ∆ ( X L , L ) ,
and similarly for U, we obtain
(13.22)
274 M ATRIX I NVERSION
and so
(13.23)
and since L is unit lower triangular with |lij| < 1, we have |( L - 1 ) ij| < 2n - 1 ,
which places a bound on how much the left and right residuals of XL can differ.
Furthermore, since the matrices L from GEPP tend to be well conditioned
and since our numerical experience is that large residuals
tend to occur only for ill-conditioned matrices, we would expect the left and
right residuals of XL almost always to be of similar size. We conclude that
even in the “conflicting residuals” case, Method D will, in practice, usually
satisfy (13.23) or its right residual counterpart, according to whether XU has a
small left or right residual respectively. Similar comments apply to Method B
when U –1 is computed by a method yielding a small right residual.
These considerations are particularly pertinent when we consider Method
D specialized to symmetric positive definite matrices and the Cholesky fac-
torization A = R T R. Now A-1 is obtained by computing XR = R-1 and
then forming A–1 = this is the method used in the LINPACK rou-
tine xPODI [307, 1979, Chap. 3]. If XR has a small right residual then
has a small left residual, so in this application we naturally encounter con-
flicting residuals. Fortunately, the symmetry and definiteness of the problem
help us to obtain a satisfactory residual bound. The analysis parallels the
derivation of (13.23), so it suffices to show how to treat the term
(cf. (13.21)), where R now denotes the computed Cholesky factor. Assuming
RX R = I + ∆(R, XR), and using (13.24) with L replaced by R, we have
13.4 G AUSS –J ORDAN E LIMINATION 275
and
Since and A are symmetric the same bound holds for the right residual.
13.3.5. Summary
In terms of the error bounds, there is little to choose between Methods A, B,
C, and D. Numerical results reported in [322, 1992] show good agreement with
the bounds. Therefore the choice of method can be based on other criteria,
such as performance and the use of working storage. Table 13.3 gives some
performance figures for a Cray 2, covering both blocked (partitioned) and
unblocked forms of Methods B, C, and D.
On a historical note, Tables 13.4 and 13.5 give timings for matrix inversion
on some early computing devices; times for two modern machines are given
for comparison. The inversion methods used for the timings on the early
computers in Table 13.4 are not known, but are probably methods from this
section or the next.
Table 13.4. Times (minutes and seconds) for inverting an n × n matrix. Source for
DEUCE, Pegasus, and Mark 1 timings: [181, 1981].
for k = 1:n
Find r such that |ark| = maxi>k |aik|.
A(k, k:n) A(r, k:n), b(k) b(r) % Swap rows k and r.
row_ind = [1:k – 1, k + 1:n] % Row indices of elements to eliminate.
m = A(row_ind,k)/A(k,k) % Multipliers.
A(row_ind,k:n) = A(row_ind,k:n) – m*A(k,k:n)
b(row_ind) = b(row_ind) – m*b(k)
end
xi = bi /aii , i = 1:n
Cost: n3 flops.
The numerical stability properties of GJE are rather subtle and error anal-
ysis is trickier than for GE. An observation that simplifies the analysis is that
we can consider the algorithm as comprising two stages. The first stage is
identical to GE and forms Mn- 1 M n-2 . . . M1 A = U, where U is upper trian-
gular. The second stage reduces U to diagonal form by elementary operations:
(13.25a)
(13.25b)
(13.26)
278 MATRIX INVERSION
Without loss of generality, we now assume that the final diagonal matrix D is
the identity (i.e., the pivots are all 1); this simplifies the analysis a little and
has a negligible effect on the final bounds. Thus (13.25) and (13.26) yield
(13.27)
(13.28)
Now
and, similarly,
n ηU , b
16 2.0e-14 5.8e-11
32 6.4e-10 7.6e-6
64 1.7e-2 6.6e4
Proof. For the first stage (which is just GE), we have A + ∆A1 =
by Theorems 9.3 and
8.5.
Using (13.30), we obtain
or A = b – r, where
(13.33)
Theorem 13.5 shows that the stability of GJE depends not only on the size
of |L||U| (as in GE), but also on the condition of U. The term is
an upper bound for and if this bound is sharp then the residual bound
is very similar to that for LU factorization. Note that for partial pivoting we
have
The bounds in Theorem 13.5 have the pleasing property that they are
invariant under row or column scaling of A, though of course if we are using
partial pivoting then row scaling can change the pivots and alter the bound.
As mentioned earlier, to obtain bounds for matrix inversion we simply
take b to be each of the unit vectors in turn. For example, the residual bound
becomes
Like the matrix inverse, the determinant is a quantity that rarely needs to
be computed. The common fallacy that the determinant is a measure of ill
conditioning is displayed by the observation that if is orthogonal
then det(aQ) = andet(Q ) = ±an , which can be made arbitrarily small or
large despite the fact that aQ is perfectly conditioned. Of course, we could
normalize the matrix before taking its determinant and define, for example,
where D –1 A has rows of unit 2-norm. This function is called the Hadamard
condition number by Birkhoff [99, 1975], because Hadamard’s determinantal
inequality (see Problem 13.11) implies that (A) > 1, with equality if and
only if A has mutually orthogonal rows. Unless A is already row equilibrated
(see Problem 13.13), (A) does not relate to the conditioning of linear systems
in any straightforward way.
282 M ATRIX I NVERSION
The matrix H 1 is H with the first row cyclically permuted to the bottom, so
det(H 1 ) = (–1) n -1 det(H). Since T is a nonsingular upper triangular matrix,
we have the LU factorization
(13.35)
where
Since fl(det(T)) = det(T)(1 + δ1). . .(1 + δn -1), |δ i | < u, the computed de-
terminant satisfies
where |θ n| < γn . This is not a backward error result, because only one of
the two Ts in this expression is perturbed. However, we can write det( T ) =
det(T + ∆T)(1 + θ( n -1)2), so that
of this chapter suggest. The results of §8.2 provide some insight. For example,
if T is a triangular M-matrix then, as noted after Corollary 8.10, its inverse
is computed to high relative accuracy, no matter how ill conditioned L may
be.
Sections 13.2 and 13.3 are based closely on Du Croz and Higham [322,
1992 ].
Method D in §13.3.4 is used by the Hewlett-Packard HP-15C calculator,
for which the method’s lack of need for extra storage is an important prop-
erty [523, 1982].
Higham [560, 1995] gives error analysis of a divide-and-conquer method
for inverting a triangular matrix that has some attractions for parallel com-
putation.
GJE is an old method. It was discovered independently by the geodesist
Wilhelm Jordan (1842-1899) (not the mathematician Camille Jordan (1838-
1922)) and B.-I. Clasen [12, 1987].
An Algol routine for inverting positive definite matrices by GJE was pub-
lished in the Handbook for Automatic Computation by Bauer and Reinsch [83,
1971]. As a means of solving a single linear system, GJE is 50% more expensive
than GE when cost is measured in flops; the reason is that GJE takes O ( n 3 )
flops to solve the upper triangular system that GE solves in n 2 flops. How-
ever, GJE has attracted interest for vector computing because it maintains
full vector lengths throughout the algorithm. Hoffmann [577, 1987] found that
it was faster than GE on the CDC Cyber 205, a now-defunct machine with a
relatively large vector startup overhead.
Turing [1027, 1948] gives a simplified analysis of the propagation of errors
in GJE, obtaining a forward error bound proportional to κ(A). Bauer [80,
1966] does a componentwise forward error analysis of GJE for matrix inver-
sion and obtains a relative error bound proportional to for symmetric
positive definite A. Bauer’s paper is in German and his analysis is not easy to
follow. A summary of Bauer’s analysis (in English) is given by Meinguet [746,
19 6 9 ].
The first thorough analysis of the stability of GJE was by Peters and
Wilkinson [828, 1975]. Their paper is a paragon of rounding error analysis.
Peters and Wilkinson observe the connection with GE, then devote their at-
tention to the “second stage” of GJE, in which Ux = y is solved. They show
that each component of x is obtained by solving a lower triangular system, and
they deduce that, for each i, (U + ∆U i )x( i ) = y + ∆y i , where |∆Ui | < γn|U|
and |∆yi | < γn |y|, and where the ith component of x( i ) is the ith component
of They then show that has relative error bounded by
but that does not necessarily have a small backward error. The more direct
approach used in our analysis is similar to that of Dekker and Hoffmann [276,
1989], who give a normwise analysis of a variant of GJE that uses row pivot-
ing (elimination across rows) and column interchanges. Our componentwise
PROBLEMS 285
13.6.1. LAPACK
Routine xGETRI computes the inverse of a general square matrix by Method
B using an LU factorization computed by xGETRF . The corresponding routine
for a symmetric positive definite matrix is xPOTRI , which uses Method D,
with a Cholesky factorization computed by xPOTRF . Inversion of a symmetric
indefinite matrix is done by xSYTRI . Triangular matrix inversion is done by
xTRTRI , which uses Method 2C. None of the LAPACK routines compute the
determinant, but it is easy for the user to evaluate it after computing an LU
factorization.
Problems
13.1. Reflect on this cautionary tale told by Acton [4, 1970, p. 246].
“It was 1949 in Southern California. Our computer was a very new CPC
(model 1, number 1) —a 1-second-per-arithmetic-operation clunker that was
holding the computational fort while an early electronic monster was being
coaxed to life in an adjacent room, From a nearby aircraft company there
arrived one day a 16 × 16 matrix of 10-digit numbers whose inverse was desired
. . . We labored for two days and, after the usual number of glitches that
accompany any strange procedure involving repeated handling of intermediate
decks of data cards, we were possessed of an inverse matrix. During the
checking operations . . . it was noted that, to eight significant figures, the
inverse was the transpose of the original matrix! A hurried visit to the aircraft
company to explore the source of the matrix revealed that each element had
been laboriously hand computed from some rather simple combinations of
sines and cosines of a common angle. It took about 10 minutes to prove that
the matrix was, indeed, orthogonal!”
13.2. Rework the analysis of the methods of §13.2.2 using the assumptions
(12.3) and (12.4), thus catering for possible use of fast matrix multiplication
techniques.
286 MATRIX INVERSION
This inequality shows that the left and right residuals of X as an approxima-
tion to A–1 can differ greatly only if A is ill conditioned.
13.4. (Mendelssohn [748, 1956]) Find parametrized 2 × 2 matrices A and X
such that the ratio ||AX – I||/||XA – I|| can be made arbitrarily large.
13.5. Let X and Y be approximate inverses of that satisfy
and
Show that
Derive forward error bounds for and Interpret all these bounds.
13.6. What is the relation between the matrix on the front cover of the LA-
PACK Users’ Guide [17, 1995] and that on the back cover? Answer the same
question for the LINPACK Users’ Guide [307, 1979].
13.7. Show that for any matrix having a row or column of 1s, the elements
of the inverse sum to 1.
13.8. Let X = A + iB be nonsingular. Show that X-1 can be
expressed in terms of the inverse of the real matrix of order 2 n,
where ak = A(:, k). When is there equality? (Hint: use the QR factorization.)
13.12. (a) Show that if AT = QR is a QR factorization then the Hadamard
condition number where pi = ||R(:, i)||2. (b) Evaluate
for A = U(1) (see (8.2)) and for the Pei matrix A = (a – 1)I + eeT.
13.13. (Guggenheimer, Edelman, and Johnson [486, 1995]) (a) Prove that for
a nonsingular
Chapter 14
Condition Number Estimation
289
290 CONDITION NUMBER ESTIMATION
(14.1)
x = x0 /||x0 ||p
repeat
y = Ax
z = AT dualp (y)
if ||z||q < zTx
γ = ||y||p
quit
end
x = dualq (z)
end
First, we note that the subdifferential (that is, the set of subgradients) of an
arbitrary vector norm ||·|| is given by [378, 1987, p. 379]
(14.2)
292 CONDITION NUMBER ESTIMATION
We assume now that A has full rank, 1 < p < and x 0. Then it
is easy to see that there is a unique vector dualp ( x ), so has just one
element, that is, ||x||p is differentiable. Hence we have
(14.3)
(14.4)
(14.5)
(14.2), g has the form AT dualp (Ax), this completes the derivation of the
iteration.
For all values of p the power method has the desirable property of gener-
ating an increasing sequence of norm approximations.
Lemma 14.2. In Algorithm 14.1, the vectors from the kth iteration satisfy
The first inequality in (ii) is strict if convergence is not obtained on the kth
iteration.
14.2 THE p-NORM POWER METHOD 293
Proof. Then
For the last part, note that, in view of (i), the convergence test “||zk||q <
can be written as “||zk||q < ||yk||p ” .
It is clear from Lemma 14.2 (or directly from the Holder inequality) that
the convergence test “||z|| q < z T x” in Algorithm 14.1 is equivalent to “||z||q =
z T x” and, since ||x||p = 1, this is equivalent to x = dualq ( z ). Thus, although
the convergence test compares two scalars, it is actually testing for equality
in the vector equation (14.4).
The convergence properties of Algorithm 14.1 merit a careful description.
First, in view of Lemma 14.2, the scalars γk = ||yk||p form an increasing and
convergent sequence. This does not necessarily imply that Algorithm 14.1
converges, since the algorithm tests for convergence of the xk , and these vec-
tors could fail to converge. However, a subsequence of the xk must converge
to a limit, say. Boyd [139, 1974] shows that if is a strong local maximum
of F with no zero components, then linearly.
If Algorithm 14.1 converges it converges to a stationary point of F( x )
when 1 < p < Thus, instead of the desired global maximum ||A||p , we
may obtain only a local maximum or even a saddle point. When p = 1 or
if the algorithm converges to a point at which F is not differentiable, that
point need not even be a stationary point. On the other hand, for p = 1
or Algorithm 14.1 terminates in at most n + 1 iterations (assuming that
when dualp or dualq is not unique an extreme point of the unit ball is taken),
since the algorithm moves between the vertices ei of the unit ball in the 1-
norm, increasing F on each stage (x = ±ei for p = 1, and dual p (y) = ±ei for
P = An example where n iterations are required for p = 1 is given in
Problem 14.2.
For two special types of matrix, more can be said about Algorithm 14.1.
(1) If A = xyT (rank 1), the algorithm converges on the second step with
γ = ||A||p = ||x||p||y||q , whatever x0 .
(2) Boyd [139, 1974] shows that if A has nonnegative elements, ATA is
irreducible, 1 < p < and x0 has positive elements, then the xk converge
and γ k ||A||p .
For values of p strictly between 1 and the convergence behaviour of
Algorithm 14.1 is typical of a linearly convergent method: exact convergence is
not usually obtained in a finite number of steps and arbitrarily many steps can
294 C ONDITION N UMBER E STIMATION
x = n– 1e
repeat
y = Ax
ξ = sign(y)
z = A Tξ
if
γ = ||y||1
quit
end
x = e j, where |zj| = (smallest such j)
end
where Ce = CTe = 0 (there are many possible choices for C ). For any such
matrix, Algorithm 14.3 computes y = n – 1e, ξ = e, z = e, and hence the
algorithm terminates at the end of the first iteration with
The problem is that the algorithm stops at a local maximum that can differ
from the global one by an arbitrarily large factor.
A more reliable and more robust algorithm is produced by the following
modifications of Higham [537, 1988].
Definition of estimate. To overcome most of the poor estimates, γ is
redefined as
where
The vector b is considered likely to “pick out” any large elements of A in those
cases where such elements fail to propagate through to y.
Convergence test. The algorithm is limited to a minimum of two and a
maximum of five iterations. Further, convergence is declared after comput-
ing ξ if the new ξ is the same as the previous one; this event signals that
convergence will be obtained on the current iteration and that the next (and
final) multiplication ATξ is unnecessary. Convergence is also declared if the
new ||y||1 is no larger than the previous one. This nonincrease of the norm
can happen only in finite precision arithmetic and signals the possibility of a
vertex ej being revisited—the onset of “cycling.”
The improved algorithm is as follows. This algorithm is the basis of all
the condition number estimation in LAPACK.
υ = A(n – le)
if n = 1, quit with γ = |υ 1|, end
γ = ||υ| |1
ξ = sign(υ)
296 CONDITION NUMBER ESTIMATION
x = A Tξ
k=2
repeat
j = min{i: |xi | = }
υ = Aej
γ = ||υ| |1
if sign(υ) = ξ or γ < goto (*), end
ξ = sign(υ)
x = A Tξ
k = k+ 1
until
i+ 1
(*) xi = (–1)
x = Ax
i f 2||x||1 /(3n) > γ then
υ = x
γ = 2||x||1 /(3n)
end
Figure 14.1. Underestimation ratio for Algorithm 14.4 for 5 × 5 matrix A(O) of (14.6)
with 150 equally spaced values of [0,10].
2. Solve Tx = y.
298 CONDITION NUMBER ESTIMATION
3. Estimate
In LINPACK the norm is the l-norm, but the algorithm can also be used
for the 2-norm or the cm-norm. The motivation for step 2 is based on a singular
value decomposition analysis. Roughly, if ||y||/||d|| is large then
will almost certainly be at least as large, and it could be
a much better estimate. Notice that TTTx = d, so the algorithm is related to
the power method on the matrix (TTT) –1 with the specially chosen starting
vector d.
To examine step 1 more closely, suppose that T = UT is lower triangular
and note that the equation Uy = d can be solved by the following column-
oriented (saxpy) form of substitution:
end
The idea is to choose the elements of the right-hand side vector d adaptively
as the solution proceeds, with d j = ±1. At the jth stage of the algorithm
d . . . , dj+1 have been chosen and yn , . . . , yj+1 are known. The next element
dj {+1, –1} is chosen so as to maximize a weighted sum of dj – pj and the
partial sums p1, . . . , pj, which would be computed during the next execution
of statement (*) above. Hence the algorithm looks ahead, trying to gauge
the effect of the choice of d j on future solution components. This heuristic
algorithm for choosing d is expressed in detail as follows.
p(1:j– 1 ) = p -(1:j–1)
end
end
cost: 4n 2 flops.
LINPACK takes the weights wj 1, though another possible (but more
expensive) choice would be wj = 1/|ujj|, which corresponds to how pj is
weighted in the expression yj = (dj – pj)/ujj.
To estimate ||A- 1 || for a full matrix A, the LINPACK estimator makes
use of an LU factorization of A. Given PA = LU, the equations solved are
UTz = d, LTy = z, and AX = PTy, where for the first system d is constructed
by the analogue of Algorithm 14.5 for lower triangular matrices; the estimate
is ||x||1 /||y||1 ||A - 1 ||1. Since d is chosen without reference to L, there is
an underlying assumption that any ill condition in A is reflected in U. This
assumption may not be true; see Problem 14.3.
In contrast to the LAPACK norm estimator, the LINPACK estimator re-
quires explicit access to the elements of the matrix. Hence the estimator
cannot be used to estimate componentwise condition numbers. Furthermore,
separate code has to be written for each different type of matrix and factoriza-
tion. Consequently, while LAPACK has just a single norm estimation routine,
which is called by many other routines, LINPACK has multiple versions of its
algorithm, each tailored to the specific matrix or factorization.
Several years after the LINPACK condition estimator was developed, sev-
eral parametrized counterexamples were found by Cline and Rew [217, 1983].
Numerical counterexamples can also be constructed by direct search, as shown
in §24.3.1. Despite the existence of these counterexamples the LINPACK esti-
mator has been widely used and is regarded as being almost certain to produce
an estimate correct to within a factor 10 in practice.
A 2-norm condition estimator was developed by Cline, Corm, and Van
Loan [218, 1982, Algorithm 1]; see also Van Loan [1043, 1987] for another
explanation. The algorithm builds on the ideas underlying the LINPACK es-
timator by using “look-behind” as well as look-ahead. It estimates σ min (R) =
or σm a x (R) = ||R||2 for a triangular matrix R, where σ min and σ m a x
denote the smallest and largest singular values, respectively. Full matrices
can be treated if a factorization A = QR is available ( Q orthogonal, R up-
per triangular), since R and A have the same singular values. The estimator
performs extremely well in numerical tests, often producing an estimate that
has some correct digits [218, 1982], [534, 1987]. No counterexamples to the
estimator were known until Bischof [103, 1990] obtained counterexamples as
a by-product of the analysis of a different but related method, mentioned at
the end of this section.
All the methods described so far have the property that when applied
repeatedly to a given matrix they always produce the same estimate. Another
300 C ONDITION N UMBER E STIMATION
(14.7)
where z1, . . . , zn are independent random variables from the normal N(0, 1)
distribution [668, 1981, p. 130]. If, for example, n = 100 and θ has the rather
large value 6400 then inequality (14.7) holds with probability at least 0.9.
In order to take a smaller constant θ, for fixed n and a desired probability,
we can use larger values of k. If k = 2j is even then we can simplify (14.7),
obtaining
(14.8)
and the minimum probability stated by the theorem is 1 - 0.8θ - j n 1/2. Taking
j = 3, for the same value n = 100 as before, we find that (14.8) holds with
probability at least 0.9 for the considerably smaller value θ = 4.31.
Probabilistic condition estimation has not yet been adopted in any ma-
jor software packages, perhaps because the other techniques work so well.
For more on the probabilistic power method approach see Dixon [306, 1983],
Higham [534, 1987], and Kuczynski and Wozniakowski [676, 1992] (who also
analyse the more powerful Lanczos method with a random starting vector).
For a probabilistic condition estimation method of very general applicability
14.5 C ONDITION N UMBERS OF T RIDIAGONAL M ATRICES 301
see Kenney and Laub [652, 1994] and Gudmundsson, Kenney, and Laub [485,
1995 ].
The condition estimators described above assume that a single estimate
is required for a matrix given in its entirety. Condition estimators have also
been developed for more specialized situations. Bischof [103, 1990] develops a
method for estimating the smallest singular value of a triangular matrix which
processes the matrix a row or a column at a time. This “incremental condition
estimation” method can be used to monitor the condition of a triangular ma-
trix as it is generated, and so is useful in the context of matrix factorization
such as the QR factorization with column pivoting. The estimator is general-
ized to sparse matrices by Bischof, Lewis, and Pierce [104, 1990]. Barlow and
Vemulapati [67, 1992] develop a l-norm incremental condition estimator with
look-ahead for sparse matrices.
Condition estimates are also required in applications where a matrix fac-
torization is repeatedly updated as a matrix undergoes low rank changes.
Algorithms designed for a recursive least squares problem and employing
the Lanczos method are described by Ferng, Golub, and Plemmons [372,
1991 ]. Pierce and Plemmons [831, 1992 ] describe an algorithm for use with
the Cholesky factorization as the factorization is updated, while Shroff and
Bischof [918, 1992] treat the QR factorization.
that is, if
(14.9)
we have
(14.10)
It follows that we can compute any of the condition numbers or forward error
bounds of interest exactly by solving two bidiagonal systems. The cost is
O(n) flops, as opposed to the O(n2) flops needed to compute the inverse of a
tridiagonal matrix.
When does the condition |A| = |L||U| hold? Theorem 9.11 shows that it
holds if the tridiagonal matrix A is symmetric positive definite, totally posi-
tive, or an M-matrix. So for these types of matrix we have a very satisfactory
way to compute the condition number.
If A is tridiagonal and diagonally dominant by rows, then we can compute
in O(n) flops an upper bound for the condition number that is not more than
a factor 2n – 1 too large.
14.5 C ONDITION N UMBERS OF T RIDIAGONAL M ATRICES 303
where the bidiagonal matrix V = diag(u i i )-1 U has υi i 1 and |υi , i +1| =
|ei /ui | < 1 (see the proof of Theorem 9.12). Thus
14.6.1. LAPACK
Algorithm 14.4 is implemented in routine xLACON , which has a reverse commu-
nication interface. The LAPACK routines xPTCON and xPTRFS for symmet-
ric positive definite tridiagonal matrices compute condition numbers using
(14.10); the LAPACK routines xGTCON and xGTRFS for general tridiagonal
matrices use Algorithm 14.4. LINPACK’S tridiagonal matrix routines do not
incorporate condition estimation.
Problems
14.1. (Higham [551, 1992]) The purpose of this problem is to develop a non-
iterative method for choosing a starting vector for Algorithm 14.1. The idea
is to choose the components of x in the order x1, x2, . . . . xn in an attempt
to maximize ||Ax||p/||x||p. Suppose x1, . . . . xk-1 satisfying ||x(1:k-1)||p = 1
have been determined and let γk -1 = ||A( : , 1 :k-1)x(1:k-1)||p. We now try
to choose xk, and at the same time revise x(1:k –1), to give the next partial
product a larger norm. Defining
we set
where
Develop this outline into a practical algorithm. What can you prove about
the quality of the estimate ||Ax||p /||x||p that it produces?
PROBLEMS 307
Note that, for all a, ||Tn (a)e n- 1||1 = ||Tn (a)||1. Show that if Algorithm 14.3
is applied to Tn (a) with 0 < a < 1 then x = ei -1 on the ith iteration, for
i = 2, . . ., n, with convergence on the nth iteration. Algorithm 14.4, however,
terminates after five iterations with y5 = Tn (a)e4, and
Show that the extra estimate saves the day, so that Algorithm 14.4 returns a
final estimate that is within a factor 3 of the true norm, for any a < 1.
14.3. Let PA = LU be an LU factorization with partial pivoting of A 1
Show that
14.4. (Higham [537, 1988]) Investigate the behaviour of Algorithms 14.3 and
14.4 for the Pei matrix, A = aI + eeT ( a > 0), and for the upper bidiagonal
matrix with 1s on the diagonal and the first superdiagonal.
14.5. (Ikebe [600, 1979], Higham [531, 1986]) Let be nonsingular,
tridiagonal, and irreducible. By equating the last columns in AA -1 = I
and the first rows in A –1 A = I, show how to compute the vectors x and
y in Theorem 14.9 in O(n) flops. Hence obtain an O(n) flops algorithm for
computing where d > 0.
308 CONDITION NUMBER ESTIMATION
14.6. The representation of Theorem 14.9 for the inverse of nonsingular, tridi-
agonal, and irreducible involves 472 parameters, yet A depends only
on 3n – 2 parameters. Obtain an alternative representation that involves only
3n – 2 parameters. (Hint: symmetrize the matrix.)
14.7. (R ESEARCH P ROBLEM ) (Demmel [286, 1992 ]) Show that estimating
||A - 1|| to within a factor depending only on the dimension of A is at least as
expensive as computing A– 1 .
14.8. (RESEARCH PROBLEM) Let be diagonally dominant by rows,
let A = LU be an LU factorization, and let y > 0. What is the maximum
size of This is an open problem raised in [541,
1990]. In a small number of numerical experiments with full random matrices
the ratio has been found to be less than 2 [541, 1990], [790, 1986].
Previous Home Next
Chapter 15
The Sylvester Equation
309
310 THE SYLVESTER EQUATION
AX – XB = C, (15.1)
1 . linear system: Ax = c,
3 . matrix inversion: AX = I,
5 . commuting matrices: AX – XA = 0.
The Sylvester equation arises in its full generality in various applications. For
example, the equations
(See Horn and Johnson [581, 1991, Chap. 4] for a detailed presentation of
properties of the Kronecker product and the vec operator). The mn × mn
coefficient matrix in (15.2) has a very special structure, illustrated for n = 3
by
15.1 S OLVING THE S YLVESTER E QUATION 311
In dealing with the Sylvester equation it is vital to consider this structure and
not treat (15.2) as a general linear system.
Since the mn eigenvalues of In A – BT Im are given by
(15.3)
(15.5)
Suppose now that R and S are quasi-triangular, and for definiteness as-
sume that they are both upper quasi-triangular. Partitioning Z = (Zij ) con-
312 T HE S YLVESTER E QUATION
(15.6)
(15.8)
(15.9)
We saw in the last section that standard methods for solving the Sylvester
equation are guaranteed to produce a small relative residual. Does a small
relative residual imply a small backward error? The answer to this question
for a general linear system is yes (Theorem 7. 1). But for the highly structured
Sylvester equation the answer must be no, because for the special case of ma-
trix inversion we know that a small residual does not imply a small backward
error (§13.1). In this section we investigate the relationship between residual
and backward error for the Sylvester equation.
The normwise backward error of an approximate solution Y to (15.1) is
defined by
(15.10)
(15.11)
(15.12)
To explore the converse question of what the residual implies about the
backward error we begin by transforming (15.11) using the SVD of Y, Y =
UΣ VT, where and are orthogonal and Σ = diag(σi ) 1
The numbers σ1 > σ 2 > . . . > σ m i n (m , n) > 0 are the singular values
314 T HE S YLVESTER E QUATION
(15.13)
where
(15.14)
where
(15.16)
16
For notational convenience we extend (if m < n) or (if m > n) to dimension
m × n; the “fictitious” elements will be set to zero by the minimization.
15.2 B ACKWARD E RROR 315
This expression shows that the backward error is approximately equal not to
the normwise relative residual but to a component-
wise residual corresponding to the diagonalized equation (15.13).
From (15.15) and (15.16) we deduce that
(15.17)
where
(15.18)
(15.19)
(15.20)
Although has a very acceptable residual (as it must in view of (15.9)), its
backward error is eight orders of magnitude larger than is necessary to achieve
backward stability. We solved the same Sylvester equation using GEPP on the
system (15.2). The relative residual was again less than u, but the backward
error was appreciably larger:
One conclusion we can draw from the analysis is that standard methods for
solving the Sylvester equation are at best conditionally backward stable, since
there exist rounding errors such that is the only nonzero element of
and then (15.17) is an approximate equality, with µ possibly large.
AX+ XA T = C,
which is called the Lyapunov equation. This equation plays a major role in
control and systems theory and it can be solved using the same techniques as
for the Sylvester equation.
If C = C T then C = AX + XA T = X T A T + AX T = C T , so X and
T
X are both solutions to the Lyapunov equation. If the Lyapunov equation
is nonsingular (equivalently, for all i and j, by (15.3)) it
therefore has a unique symmetric solution.
15.2 B ACKWARD E RROR 317
(15.21)
where
318 T HE S YLVESTER E QUATION
(15.22)
(15.23)
is the corresponding condition number for the Sylvester equation. The bound
(15.23) can be weakened to
(15.25)
where
The sep function is an important tool for measuring invariant subspace sen-
sitivity [470, 1989, §7.2.5], [940, 1973], [1050, 1979].
For the Lyapunov equation, a similar derivation to the one above shows
that the condition number is
(15.27)
where Π is the vet-permutation matrix, which is defined by the property that
vec(A T) = Π vet(A).
How much can the bounds (15.23) and (15.25) differ? The answer is by
an arbitrary factor. To show this we consider the case where B is normal
(or equivalently, A is normal if we transpose the Sylvester equation). We
can assume B is in Schur form, thus B = diag(µ j ) (with the µ j possibly
complex). Then P = diag(A – ujjIm), and it is straightforward to show that
if X = [x1, . . . , xn ], and if we approximate the 2-norms in the definitions of
and Φ by Frobenius norms, then
while
Then if
A similar but potentially much smaller bound is described in the next section.
(15.29)
properties than (15.28), although (15.29) is not actually invariant under di-
agonal scalings of the Sylvester equation.
We give a numerical example to illustrate the advantage of (15.29) over
(15.28). Let
confirming that the usual perturbation bound (15.25) for the Sylvester equa-
tion can be very pessimistic. Furthermore,
15.5. Extensions
The Sylvester equation can be generalized in two main ways. One retains the
linearity but adds extra coefficient matrices, yielding the generalized Sylvester
equations
AXB + CXD = E (15.30)
and
AX – YB = C, DX – YE = F. (15.31)
These two forms are equivalent, under conditions on the coefficient matrices
[210, 1987]; for example, defining Z := XB and W := –CX, (15.30) becomes
AZ – WD = E, CZ + WB = 0. Applications of generalized Sylvester equa-
tions include the computation of stable eigendecompositions of matrix pencils
[294, 198 7], [295, 1988], [622, 1993], [623, 1994] and the implementation of
322 T HE S YLVESTER E QUATION
(15.32)
real parts and C is positive semidefinite; his method directly computes the
Cholesky factor of the solution (which is indeed symmetric positive definite—
see Problem 15.2).
A survey of the vec operator, the Kronecker product, and the vec-permu-
tation matrix is given together with historical comments by Henderson and
Searle [514, 1981]. Historical research by Henderson, Pukelsheim, and Searle
[513, 1983] indicates that the Kronecker product should be called the Zehfuss
product, in recognition of an 1858 paper by Zehfuss that gives a determinantal
result involving the product.
The vet-permutation matrix Π (which appears in (15.27)) is given explic-
itly by
15.6.1. LAPACK
The computations discussed in this chapter can all be done using LAPACK.
The Bartels-Stewart algorithm can be implemented by calling xGEES to com-
pute the Schur decomposition, using the level-3 BLAS routine xGEMM to trans-
form the right-hand side C, calling xTRSYL to solve the (quasi-) triangular
Sylvester equation, and using xGEMM to transform back to the solution X.
The error bound (15.29) can be estimated using xLACON in conjunction with
the above routines. A Fortran 77 code dggsvx [556, 1993] of Higham follows
this outline and may appear in a future release of LAPACK.
Routine xLASY2 solves a real Sylvester equation AX ± XB = σC in which
A and B have dimension 1 or 2 and σ is a scale factor. It is called by xTRSYL .
Kågström and Poromaa have developed codes for solving (15.31), which
are intended for a future release of LAPACK [622, 1993], [623, 1994].
Problems
15.1. Show that the Sylvester equation AX – XA = I has no solution.
15.2. (Bellman [89, 1970, §10.18]) Show that if the expression
exists for all C it represents the unique solution of the Sylvester equation
AX + XB = C. (Hint: consider the matrix differential equation dZ/dt =
AZ(t) + Z(t)B, Z(0) = C.) Deduce that the Lyapunov equation AX +
XAT = –C has a symmetric positive definite solution if A has eigenvalues
with negative real parts and C is symmetric positive definite.
15.3. (Byers and Nash [176, 1987]) Let and consider
Chapter 16
Stationary Iterative Methods
17
Gauss refers here to his relaxation method for solving the normal equations. The
translation is taken from Forsythe [387, 1951 ].
325
326 S TATIONARY I TERATIVE M ETHODS
Table 16.1. Dates of publication of selected iterative methods. Based on Young [1123,
19 8 9 ].
Iterative methods for solving linear systems have along history, going back
at least to Gauss. Table 16.1 shows the dates of publication of selected meth-
ods. It is perhaps surprising, then, that rounding error analysis for iterative
methods is not well developed. There are two main reasons for the paucity
of error analysis. One is that in many applications accuracy requirements
are modest and are satisfied without difficulty, resulting in little demand for
error analysis. Certainly there is no point in computing an answer to greater
accuracy than that determined by the data, and in scientific and engineering
applications the data often has only a few correct digits. The second reason
is that rounding error analysis for iterative methods is inherently more diffi-
cult than for direct methods, and the bounds that are obtained are harder to
interpret.
In this chapter we consider a simple but important class of iterative meth-
ods, stationary iterative methods, for which a reasonably comprehensive error
analysis can be given. The basic question that our analysis attempts to answer
is “What is the limiting accuracy of a method in floating point arithmetic?”
Specifically, how small can we guarantee that the backward or forward error
will be over all iterations k = 1, 2 ,. . .? Without an answer to this question
we cannot be sure that a convergence test of the form (say)
will ever be satisfied, for any given value of
As an indication of the potentially devastating effects of rounding errors
we present an example constructed and discussed by Hammarling and Wilkin-
son [497, 1976]. Here, A is the 100 × 100 lower bidiagonal matrix with aii 1.5
and ai,i– 1 1, and bi 2.5. The successive overrelaxation (SOR) method
is applied in MATLAB with parameter w = 1.5, starting with the rounded
version of the exact solution x, given by xi = 1 – (–2/3) i . The forward errors
and the co-norm backward errors are plotted in
Figure 16.1. The SOR method converges in exact arithmetic, since the itera-
tion matrix has spectral radius 1/2, but in the presence of rounding errors it
diverges. The iterate has a largest element of order 1013, for
16.1 S URVEY OF ERROR ANALYSIS 327
the SOR method. With the aid of numerical examples, they emphasize that
while it is the spectral radius of the iteration matrix M – 1 N that determines
the asymptotic rate of convergence, it is the norms of the powers of this matrix
that govern the behaviour of the iteration in the early stages. This point is
also explained by Trefethen [1017, 1992], using the tool of pseudospectra.
Dennis and Walker [302, 198 4] obtain bounds for
for stationary iteration as a special case of error analysis of quasi-Newton
methods for nonlinear systems. The bounds in [302, 198 4] do not readily
yield information about normwise or componentwise forward stability.
Bollen [133, 1984] analyses the class of “descent methods” for solving Ax =
b, where A is required to be symmetric positive definite; these are obtained by
iteratively using exact line searches to minimize the quadratic function F(x) =
(A-1 b – x)TA(A-1 b – x). The choice of search direction pk = b – Axk =: rk
yields the steepest descent method, while pk = ej (unit vector), where |rk|j =
gives the Gauss-Southwell method. Bollen shows that both methods
are normwise backward stable as long as a condition of the form cn κ(A)u < 1
holds. If the pk are cyclically chosen to be the unit vectors e 1, e2, . . ., en then
the Gauss–Seidel method results, but unfortunately no results specific to this
method are given in [133, 1984].
Wozniakowski [1112, 1977] shows that the Chebyshev semi-iterative method
is normwise forward stable but not normwise backward stable, and in [1113,
1978] he gives a normwise error analysis of stationary iterative methods. Some
of the assumptions in [1113, 1978] are difficult to justify, as explained by
Higham and Knight [563, 1993].
In [1114, 198 0] Wozniakowski analyses a class of conjugate gradient al-
gorithms (which does not include the usual conjugate gradient method). He
obtains a forward error bound proportional to κ(A)3/2 and a residual bound
proportional to K(A), from which neither backward nor forward normwise sta-
bility can be deduced. We note that as part of the analysis in [1114, 198 0]
Wozniakowski obtains a residual bound for the steepest descent method that
is proportional to K(A), and is therefore much weaker than the bound obtained
by Bollen [133, 1984].
Zawilski [1125, 1991] shows that the cyclic Richardson method for sym-
metric positive definite systems is normwise forward stable provided the pa-
rameters are suitably ordered. He also derives a sharp bound for the residual
that includes a factor κ(A), and which therefore shows that the method is not
normwise backward stable.
Arioli and Romani [29, 1992] give a statistical error analysis of stationary
iterative methods. They investigate the relations between a statistically de-
fined asymptotic stability factor, ill conditioning of M- 1 A, where A = M – N
is the splitting, and the rate of convergence.
Greenbaum [479, 1989] presents a detailed error analysis of the conjugate
gradient method, but her concern is with the rate of convergence rather than
16.2 FORWARD ERROR ANALYSIS 329
the attainable accuracy. An excellent survey of work concerned with the effects
of rounding error on the conjugate gradient method (and the Lanczos method)
is given by Greenbaum and Strakos in the introduction of [480, 1992]; see also
Greenbaum [481, 1994]. Notay [799, 1993] analyses how rounding errors influ-
ence the convergence rate of the conjugate gradient method for matrices with
isolated eigenvalues at the ends of the spectrum. Van der Vorst [1058, 1990]
examines the effect of rounding errors on preconditioned conjugate gradient
methods with incomplete Cholesky preconditioners.
The analysis given in the remainder of this chapter is from Higham and
Knight [563, 1993], [564, 1993], wherein more details are given. Error analysis
of Kaczmarz’s row-action method is given by Knight [663, 1993].
which we write as
(16.1)
where
We will assume that M is triangular (as is the case for the Jacobi, Gauss–
Seidel, SOR, and Richardson iterations), so that and f k
accounts solely for the errors in forming Hence
(16.2)
(16.3)
(16.4)
330 S TATIONARY I TERATIVE M ETHODS
(16.5)
We have
(16.6)
where µ k is the bound for ξk defined in (16.2). The first term, |Gm+ 1e0 |, is
the error of the iteration in exact arithmetic and is negligible for large m. The
accuracy that can be guaranteed by the analysis is therefore determined by
the last term in (16.6), and it is this term on which the rest of the analysis
focuses.
At this point we can proceed by using further componentwise inequalities
or by using norms. First we consider the norm approach. By taking norms in
(16.6) and defining
(16.7)
we obtain
(16.8)
where the existence of the sum is assured by the result of Problem 16.1.
If = q < 1 then (16.8) yields
Thus if q is not too close to 1 (q < 0.9, say), and γx and are not too
large, a small forward error is guaranteed for sufficiently large m.
Of more interest is the following componentwise development of (16.6).
Defining
(16.9)
(16.10)
16.2 FORWARD ERROR ANALYSIS 331
(16.11)
where, again, the existence of the sum is assured by the result of Problem 16.1.
Since A = M – N = M(I – M- 1 N) we have
The sum in (16.11) is clearly an upper bound for |A - 1 |. Defining c(A) > 1 by
(16.12)
we have our final bound
(16.13)
methods, this formula can be taken as being indicative of the size of c(A)
when M– 1 N is diagonalizable with a well-conditioned matrix of eigenvectors.
We therefore have the heuristic inequality, for general A,
(16.14)
(16.15)
If θxc(A) = O(1) and |M| + |N| < a|A|, with a = O(1), this bound is of the
form cn cond(A, x)u as m and we have componentwise forward stability.
Now we specialize the forward error bound (16.15) to the Jacobi, Gauss–
Seidel, and SOR iterations.
(16.16)
Wozniakowski [1113, 1978, Ex. 4.1] cites the symmetric positive definite
matrix
as a matrix for which the Jacobi method can be unstable, in the sense that
there exist rounding errors such that no iterate has a relative error bounded by
Let us see what our analysis predicts for this example. Straight-
forward manipulation shows that if a = 1/2 – then
so as (The heuristic lower bound (16.14) is approximately
in this case.) Therefore (16.16) suggests that the Jacobi iteration can
be unstable for this matrix. To confirm the instability we applied the Jacobi
method to the problem with x = [1, 1, 1] T and a = 1/2 – 8 – j, j = 1:5. We
took a random x 0. with ||x – x0 ||2 = 10-10, and the iteration was terminated
when there was no decrease in the norm of the residual for 50 consecutive
iterations. Table 16.2 reports the smallest value of
over all iterations, for each j; the number of iterations is shown in the column
“Iters.”
The ratio takes the values 8.02, 7.98, 8.02,
334 S TATIONARY I TERATIVE M ETHODS
7.98 for j = 1:4, showing excellent agreement with the behaviour predicted
by (16.16), since Moreover, in these tests and setting
the bound (16.16) is at most a factor 13.3 larger than the observed
error, for each j.
If –1/2 < a < 0 then A is an M-matrix and c(A) = 1. The bound (16.16)
shows that if we set a = –( 1/2 – 8 – j) and repeat the above experiment then
the Jacobi method will perform in a componentwise forward stable manner
(clearly, is to be expected). We carried out the modified experiment,
obtaining the results shown in Table 16.3. All the values are less
than cond(A,x) u, so the Jacobi iteration is indeed componentwise forward
stable in this case. Note that since p ( M – 1 N) and ||M- 1 N||2 take the same
values for a and –a, the usual rate of convergence measures cannot distinguish
between these two examples.
(16.17)
(16.18)
where
A potentially much smaller bound can be obtained under the assumption that
H is diagonalizable. If H = XDX –1, with D = diag then
(16.20)
A componentwise residual bound can also be obtained, but it does not lead
to any identifiable classes of matrix or iteration for which the componentwise
relative backward error is small.
To conclude, we return to our numerical examples. For the SOR example
at the start of the chapter, c(A) = O(1045) and σ = O(1030), so our error
bounds for this problem are all extremely large. In this problem maxi |1 –
where so (16.14) is very weak; (16.20) is
not applicable since M– 1 N is defective.
For the first numerical example in §16.2.1, Table 16.2 reports the minimum
backward errors For this problem it is straightforward to
show that The ratios of backward errors for
successive value of j are 7.10, 7.89, 7.99, 8.00, so we see excellent agreement
with the behaviour predicted by the bounds. Table 16.3 reports the normwise
backward errors for the second numerical example in §16.2.1. The backward
errors are all less than u, which again is close to what the bounds predict,
since it can be shown that σ < 5 for –1/2 < a < 0. In both of the examples
of §16.2.1 the componentwise backward error and in
our practical experience this behaviour is typical or the Jacobi and SOR
iterations.
is also known as the group inverse of A and is denoted by A#. The Drazin
inverse is an “equation-solving inverse” precisely when index(A) < 1, for then
AADA = A, and so if Ax = b is a consistent system then ADb is a solution.
As we will see, however, the Drazin inverse of the coefficient matrix A itself
plays no role in the analysis. The Drazin inverse can be represented explicitly
as follows. If
where P and B are nonsingular and N has only zero eigenvalues, then
Further details of the Drazin inverse can be found in Campbell and Meyer’s
excellent treatise [180, 1979, Chap. 7].
Let be a singular matrix and consider solving Ax = b by
stationary iteration with a splitting A = M – N, where M is nonsingular.
First, we examine the convergence of the iteration in exact arithmetic. Since
any limit point x of the sequence {x k } must satisfy Mx = Nx + b, or Ax = b,
we restrict our attention to consistent linear systems. (For a thorough analysis
of stationary iteration for inconsistent systems see Dax [269, 1990]. ) As in
the nonsingular case we have the relation (cf. (16.4)):
(16.21)
m
where G = M – 1N. Since A is singular, G has an eigenvalue 1, so G does
not tend to zero as that is, G is not convergent. If the iteration
is to converge for all x0 then must exist. Following Meyer and
Plemmons [752, 1977], we call a matrix B for which exists semi-
convergent.
We assume from this point on that G is semiconvergent. It is easy to see
[94, 1994, Lem. 6.9] that G must have the form
(16.22)
(16.23)
338 STATIONARY ITERATIVE METHODS
(16.24)
Hence
(16.25)
To evaluate the limit of the second term in (16.21) we note that, since the
system is consistent, M– 1 b = M-1 Ax = (I – G)x, and so
(16.26)
The first term in this limit is in null(I – G) and the second term is in range(I –
G). To obtain the unique solution in range(I – G) we should take for x0. any
vector in range(I – G) (x0 = 0, say).
where
(16.27)
(16.28)
(16.29)
The convergence of the two infinite sums is assured by the result of Prob-
lem 16.1, since by (16.22)-(16.24),
G i E = Gi (I – G)D (I – G)
(16.30)
340 S TATIONARY I TERATIVE M ETHODS
where
We conclude that we have the normwise error bound
(16.31)
Because of the form of the sum in (16.29), this prompts us to define the scalar
c(A) > 1 by
(16.32)
error is no larger than the uncertainty in x caused by rounding the data. The
constants θx and c(A) should be bounded by d n, where dn denotes a slowly
growing function of n; the inequality |M| + |N| < should hold, as it
does for the Jacobi method and for the SOR method when where
β is positive and not too close to zero; and the “exact error” Gm + 1e 0 must
decay quickly enough to ensure that the term (m + 1) | (I – E) M-1| does not
grow too large before the iteration is terminated.
Numerical results given in [564, 1993] show that the analysis can correctly
predict forward and backward stability, and that for certain problems linear
growth of the component of the error in null(A) can indeed cause an otherwise
convergent iteration to diverge, even when starting very close to a solution.
From Theorem 7.1 we have the following equivalences, for any subordinate
matrix norm:
(16.33a)
(16.33b)
(16.33c)
These inequalities remain true with norms replaced by absolute values (Theo
rem 7.3), but to evaluate (16.33b) and (16.33c) a matrix–vector product |A||y|
must be computed, which is a nontrivial expense in an iterative method.
Of these tests, (16.33c) is preferred in general, assuming it is acceptable
to perturb both A and b. Note the importance of including both ||A|| and
||y|| in the test on ||r||; a test ||r|| < though scale independent, does
not bound any relative backward error. Test (16.33a) is commonly used in
existing codes, but may be very stringent, and possibly unsatisfiable. To see
why, note that the residual of the rounded exact solution fl(x) = x + ∆ x ,
|∆x| < u|x|, satisfies, for any absolute norm,
and
Problems
16.1. Show that if and p(B) < 1, then the series and
are both convergent, where ||·|| is any norm.
16.2. (Descloux [304, 1963]) Consider the (nonlinear) iterative process
where satisfies
(16.34)
for some and where ||ek|| < a for all k. Note that a must satisfy
G(a) = a.
(a) Show that
(b) Show that the sequence {x k} is bounded and its points of accumulation
x satisfy
Chapter 17
Matrix Powers
Unfortunately, the roundoff errors in the mth power of a matrix, say Bm,
are usually small relative to ||B||m rather than ||Bm||.
— CLEVE B. MOLER and CHARLES F. VAN LOAN,
Nineteen Dubious Ways to Compute the Exponential of a Matrix (1978)
345
346 M ATRIX P OWERS
• For a matrix with p(A) < 1, how does the sequence {||Ak||} behave? In
particular, what is the size of the “hump” max k ||Ak||?
(17.1b)
(17.2)
with and a >> 1. While the spectral radius determines the asymptotic
rate of growth of matrix powers, the norm influences the initial behaviour of
the powers. The interesting result that p(A) = for any norm
(see Horn and Johnson [580, 1985, p. 299], for example) confirms the asymp-
totic role of the spectral radius. This formula for p(A) has actually been con-
sidered as a means for computing it; see Wilkinson [1089, 1965, pp. 615–617]
and Friedland [408, 1991].
An important quantity is the “hump” max k ||Ak||, which can be arbitrarily
large for a convergent matrix. Figure 17.1 shows the hump for the 3 × 3
upper triangular matrix with diagonal entries 3/4 and off-diagonal entries 2;
this matrix has 2-norm 3.57. The shape of the plot is typical of that for a
convergent matrix with norm bigger than 1. Note that if A is normal (so
that in (17.1a) J is diagonal and X can be taken to be unitary) we have
||Ak|| 2 = || ||2 = = p(A)k , so the problem of bounding ||Ak||
is of interest only for nonnormal matrices. The hump phenomenon arises in
various areas of numerical analysis. For example, it is discussed for matrix
powers in the context of stiff differential equations by D. J. Higham and
Trefethen [529, 1993], and by Moler and Van Loan [775, 1978] for the matrix
exponential eAt with t
More insight into the behaviour of matrix powers can be gained by con-
sidering the 2 × 2 matrix (17.2) with and a > 0. We have
and
(17.3)
17.1 MATRIX POWERS IN EXACT ARITHMETIC 349
Figure 17.3. Infinity norms of powers of 2 × 2 matrix J in (17.2), for λ = 0.99 and
a = 0 (bottom line) and a = 10 -k, k = 0:3.
Hence
It follows that the norms of the powers can increase for arbitrarily many steps
until they ultimately decrease. Moreover, because k–1 tends to zero quite
slowly as k the rate of convergence of to zero can be much
slower than the convergence of to zero (see (17.3)) when is close to 1. In
other words, nontrivial Jordan blocks retard the convergence to zero.
For this 2 × 2 matrix, the hump maxk is easily shown to be approx-
imately
in various bounds in this chapter. The matrix X in the Jordan form (17.1a)
is by no means unique [413, 1959, pp. 220–221], [467, 1976]: if A has distinct
eigenvalues (hence J is diagonal) then X can be replaced by XD , for any
nonsingular diagonal D, while if A has repeated eigenvalues then X can be
replaced by XT, where T is a block matrix with block structure conformal
with that of J and which contains some arbitrary upper trapezoidal Toeplitz
blocks. We adopt the convention that κ(X) denotes the minimum possible
value of κ(X) over all possible choices of X. This minimum value is not
known for general A, and the best we can hope is to obtain a good estimate
of it. However, if A has distinct eigenvalues then the results in Theorems 7.5
and 7.7 on diagonal scalings are applicable and enable us to determine (an
approximation to) the minimal condition number. Explicit expressions can be
given for the minimal 2-norm condition number for n = 2; see Young [1122,
1971, §3.8].
A trivial bound is ||Ak|| < ||A||k. A sharper bound can be derived in terms
of the numerical radius
which is the point of largest modulus in the field of values of A. It is not hard
to show that ||A||2/2 < r(A) < ||A||2 [580, 1985, p. 331]. The (nontrivial)
inequality r(Ak) < r(A)k [580, 1985, p. 333] leads to the bound
(17.4)
for any p-norm. (Since p(A) < ||A|| for any norm, we also have the lower bound
p (A)k < ||Ak||p .) This bound is unsatisfactory for two reasons. First, by
choosing A to have well-conditioned large eigenvalues and ill-conditioned small
eigenvalues we can make the bound arbitrarily pessimistic (see Problem 17.1).
Second, it models norms of powers of convergent matrices as monotonically
decreasing sequences, which is qualitatively incorrect if there is a large hump.
The Jordan canonical form can also be used to bound the norms of the
powers of a defective matrix. If XJX-1 is the Jordan canonical form of δ- 1 A
then
(17.5)
for all δ > 0. This is a special case of a result of Ostrowski [812, 1973,
Thin. 20.1] and the proof is straightforward: We can write δ - 1 A = X(δ-1 D +
M)X –1, where D = diag and M is the off-diagonal part of the Jordan
17.1 M ATRIX P OWERS IN E XACT A RITHMETIC 351
For the Frobenius norm, Henrici shows that ||N||F is independent of the par-
ticular Schur form and that
László [690, 1994] has recently shown that ∆F(A) is within a constant factor
of the distance from A to the nearest normal matrix:
(17.7)
352 M ATRIX P OWERS
.
Empirical evidence suggests that the first bound in (17.7) can be very pes-
simistic. However, for normal matrices both the bounds are equalities.
Another bound involving nonnormality is given by Golub and Van Loan [470,
1989, Lem. 7.3.2]. They show that, in the above notation,
for any θ > 0. This bound is an analogue of (17.5) with the Schur form
replacing the Jordan form. Again, there is equality when A is normal (if we
set θ = 0).
To compare bounds based on the Schur form with ones based on the Jordan
form we need to compare ∆(A) with κ(X). If A is diagonalizable then [710,
1969, Thin. 4]
(17.8)
where the
(17.9)
Their bound is
where e < a m := (1+ 1/m) m +1 < 4. Note that d(A) < 1 when p (A) < 1, as
is easily seen from the Schur decomposition. The distance d(A) is not easy to
compute. One approach is a bisection technique of Byers [175, 1988].
Finally, we mention that the Kreiss matrix theorem provides a good esti-
mate of supk >0 ||Ak|| for a general albeit in terms of an expression
that involves–the resolvent and is not easy to compute:
where φ(A) = sup{(|z| – 1)||(zI – A)–1||2 : |z| > 1} and e = exp(1). Details
and references are given by Wegert and Trefethen [1071, 1994].
(17.12)
354 M ATRIX P OWERS
Theorem 17.1 (Higham and Knight). Let with the Jordan form
(17.1) have spectral radius p(A) < 1. A sufficient condition for fl(Am)
as m is
(17.13)
where t = maxi ni .
Proof. It is easy to see that if we can find a nonsingular matrix S such
that
(17.14)
for all i, then the product
Now consider the matrix Its ith diagonal block is of the form
where the only nonzeros in N are 1s on the first super-
diagonal, and so
that is, if
We saw in the last section that the powers of A can be bounded in terms
of the pseudospectral radius. Can the pseudospectrum provide information
about the behaviour of the computed powers? Figure 17.5 shows approxi-
mations to the for the matrices used in Figure 17.4, where
the (computed) eigenvalues are plotted as crosses “×”. We see
that the pseudospectrum lies inside the unit disc precisely when the powers
converge to zero.
A heuristic argument based on (17.10) and (17.11) suggests that, if for ran-
domly chosen perturbations ∆Ai with ||∆Ai || < cnu||A||, most of the eigen-
values of the perturbed matrices lie outside the unit disc, then we can expect
a high percentage of the terms A + ∆Ai in (17.10) to have spectral radius
bigger than 1 and hence we can expect the product to diverge. On the other
hand, if the cnu||A||-pseudospectrum is wholly contained within the unit disc,
each A + ∆Ai will have spectral radius less than 1 and the product can be
expected to converge. (Note, however, that if p(A) < 1 and p (B) < 1 it is not
necessarily the case that p(AB) < 1.) To make this heuristic precise, we need
an analogue of Theorem 17.1 phrased in terms of the pseudospectrum rather
than the Jordan form.
Proof. It can be shown (see [565, 1995]) that the conditions on ||X|| 1 and
imply there is a perturbation à = A + ∆A of A with ||∆ A|| 2 =
such that
Using Theorem 17.1 we have the required result for cn = 4 n 2 (n + 2), since
t = 1.
Suppose we compute the eigenvalues of A by a backward stable algorithm,
that is, one that yields the exact eigenvalues of A+E, where ||E||2 < cnu||A|| 2 ,
with cn a modest constant. (An example of such an algorithm is the QR
algorithm [470, 1989, §7.5]). Then the computed spectral radius satisfies <
In view of Theorem 17.2 we can formulate a rule of thumb-one
358 M ATRIX P OWERS
that bears a pleasing symmetry with the theoretical condition for convergence:
and we observed convergence of the computed powers for C 8 and C13 + 0.01I
and divergence for the other matrices.
The analysis for the powers of the matrix (17.2) is modelled on that of
Stewart [953, 1994], who uses the matrix to construct a Markov chain whose
second largest eigenvalue does not correctly predict the decay of the transient.
For some results on the asymptotic behaviour of the powers of a nonneg-
ative matrix, see Friedland and Schneider [409, 1980].
Another application of matrix powering is in the scaling and squaring
method for computing the matrix exponential, which uses the identity eA =
(eA/m )m together with a Taylor or Padé approximation to eA/m ; see Moler
and Van Loan [775, 1978].
Problems
17.1. Let be diagonalizable: A = XΛX -1, Λ = diag Con-
struct a parametrized example to show that the bound
can be arbitrarily weak.
17.2. Show that p(|A|)/p(A) can be arbitrarily large for
17.3. (RESEARCH PROBLEM) Explain the scalloping patterns in the curves of
norms of powers of a matrix, as seen, for example, in Figure 17.4. (Consider
exact arithmetic, as the phenomenon is not rounding error dependent. )
17.4. (RESEARCH PROBLEM) obtain a sharp sufficient condition for fl(Ak)
0 in terms of the Schur decomposition of with p(A) < 1.
Previous Home Next
Chapter 18
QR Factorization
361
362 QR F ACTORIZATION
Figure 18.1 illustrates this formula and makes it clear why P is sometimes
called a Householder reflector: it reflects x about the hyperplane
Householder matrices are powerful tools for introducing zeros into vectors.
Consider the question “given x and y can we find a Householder matrix P such
that Px = y?” Since P is orthogonal we clearly require that ||x|| 2 = ||y||2 .
Now
and this last equation has the form aυ = x – y for some a. But P is indepen-
dent of the scaling of v, so we can set a = 1.
With υ = x – y we have
Therefore
18.2. QR Factorization
(18.2)
which shows that the matrix product can be formed as a matrix–vector prod-
uct followed by an outer product. This approach is much more efficient than
forming explicitly and doing a matrix multiplication.
The overall cost of the Householder reduction to triangular form is 2n 2 (m–
n/3) flops. The explicit formation of Q requires a further 4(m 2 n–mn2 + n 3/3)
flops, but for many applications it suffices to leave Q in factored form.
the exact one and the computed update is the exact update of a tiny norm-
wise perturbation of the original matrix [1089, 1965, pp. 153–162, 236], [1090,
19 6 5 ]. Wilkinson also showed that the Householder QR factorization algo-
rithm is normwise backward stable [1089, p. 236]. In this section we give a
combined componentwise and normwise error analysis of Householder matrix
computations. The componentwise bounds provide extra information over
the normwise ones that is essential in certain applications (for example, the
analysis of iterative refinement).
(18.3)
where, as required for the next two results, the dimension is now m. Here, we
have introduced the generic constant
(recall Lemma 3.3), where we signify by the prime that the constant c on the
right-hand side is different from that on the left.
The next result describes the application of a Householder matrix to a
vector, and is the basis of all the subsequent analysis. In the applications of
interest P is defined as in Lemma 18.1, but we will allow P to be an arbitrary
Householder matrix. Thus v is an arbitrary, normalized vector, and the only
assumption we make is that the computed satisfies (18.3).
where P = I - vvT.
Proof. (Cf. the proof of Lemma 3.8,) We have
We have
18.3 E RROR A NALYSIS OF H OUSEHOLDER C OMPUTATIONS 367
(18.4)
(In fact, we can take G = m- 1 eeT, where e = [1, 1,..., 1]T .) In the special
case n = 1, so that A a, we have with
Proof. First, we consider the jth column of A, aj, which undergoes the
transformations By Lemma 18.2 we have
where each ∆Pk depends on j and satisfies ||∆ Pk||F < γcm. Using Lemma 3.6
we obtain
(18.6)
368 QR F ACTORIZATION
Note that the componentwise bound for AA in Lemma 18.3 does not imply
the normwise one, because of the extra factor m in the componentwise bound.
This is a nuisance, because it means we have to state both bounds in this and
other analyses.
We now apply Lemma 18.3 to the computation of the QR factorization of
a matrix.
where ||∆A||F < nγ cm||A||F and |∆ A| < mnγ cmG|A| , with ||G||F = 1. The
matrix Q is given explicitly as Q = (PnPn–1 . . . P1 )T, where Pk is the House-
holder matrix that corresponds to the exact application of the kth step of the
algorithm to Ak.
Proof. This is virtually a direct application of Lemma 18.3, with Pk
defined as the Householder matrix that produces zeros below the diagonal in
the kth column of the computed matrix Ak. one subtlety is that we do not
explicitly compute the lower triangular elements of R, but rather set them to
zero explicitly. However, it is easy to see that the conclusions of Lemmas 18.2
and 18.3 are still valid in these circumstances; the essential reason is that the
elements of ∆Pb in Lemma 18.2 that correspond to elements that are zeroed
by the Householder matrix P are forced to be zero, and hence we can set the
corresponding rows of ∆P to zero too, without compromising the bound on
||∆P||F.
Finally, we consider use of the QR factorization to solve a linear system.
Given a QR factorization of a nonsingular matrix a linear sys-
tem Ax = b can be solved by forming Q T b and then solving Rx = Q T b.
18.3 E RROR A NALYSIS OF H OUSEHOLDER C OMPUTATIONS 369
where
Premultiplying by Q yields
(18.7)
(18.8)
(18.9)
which involves only level-3 BLAS operations. The process is now repeated on
the last m – r rows of B.
When considering numerical stability, two aspects of the WY representa-
tion need investigating: its construction and its application. For the construc-
tion, we need to show that satisfies
(18.10)
(18.11)
(18.12)
Analysing this level-3 BLAS-based computation using (18.12) and the very
general assumption (12.3) on matrix multiplication (for the 2-norm), it is
straightforward to show that
(18.13)
This result shows that the computed update is an exact orthogonal update of
a perturbation of B, where the norm of the perturbation is bounded in terms
of the error constants for the level-3 BLAS.
Two conclusions can be drawn. First, algorithms that employ the WY rep
resentation with conventional level-3 BLAS are as stable as the corresponding
point algorithms. Second, the use of fast BLAS3 for applying the updates af-
fects stability only through the constants in the backward error bounds. The
same conclusions apply to the more storage-efficient compact WY representa-
tion of Schreiber and Van Loan [905, 1989], and the variation of Puglisi [848,
1992].
and so yj = 0 if
(18.14)
Givens rotations are therefore useful for introducing zeros into a vector one
at a time. Note that there is no need to work out the angle θ, since c and s
in (18.14) are all that are needed to apply the rotation. In practice, we would
scale the computation to avoid overflow (cf. §25.8).
To compute the QR factorization, Givens rotations are used to eliminate
the elements below the diagonal in a systematic fashion. Various choices and
orderings of rotations can be used; a natural one is illustrated as follows for a
generic 4 x 3 matrix:
18.5 GIVENS ROTATIONS 373
(18.15)
where
where Gij is an exact Givens rotation based on c and s in (18.15). All the
rows of ∆Gij except the ith and jth are zero.
so that We take
For the next result we need the notion of disjoint Givens rotations. Rota-
tions Gi1 ,j 1, . . . , Gir,jr are disjoint if is js and js jt for s t . Disjoint ro-
tations are “nonconflicting” and therefore commute; it matters neither math-
ematically nor numerically in which order the rotations are applied. (Disjoint
374 QR F ACTORIZATION
rotations can therefore be applied in parallel, though that is not our inter-
est here. ) Our approach is to take a given sequence of rotations and reorder
them into groups of disjoint rotations. The reordered algorithm is numerically
equivalent to the original one, but allows a simpler error analysis.
As an example of a rotation sequence already ordered into disjoint groups,
consider the following sequence and ordering illustrated for a 6 × 5 matrix:
Here, an integer k in position (i, j) denotes that the (i, j) element is eliminated
on the kth step by a rotation in the (j, i) plane, and all rotations on the kth
step are disjoint. For an m × n matrix with m > n there are r = m + n — 2
stages, and the Givens QR factorization can be written as Wr Wr-1 . . . W1 A =
R, where each Wi is a product of at most n disjoint rotations. It is easy to see
that an analogous grouping into disjoint rotations can be done for the scheme
illustrated at the start of this section.
(In fact, we can take G = m - 1 eeT, where e = [1, 1, . . . , 1] T.) In the special
case n = 1, so that A = a, we have â(r + 1) = (Q + ∆Q)T a with ||∆ Q||F < γ c r .
where each ∆Wk depends on j and satisfies ||∆ Wk|| 2 < Using Lemma 3.6
we obtain
(18.16)
Hence
where
Hence we can compute Q and R a column at a time. To ensure that rjj > 0
we require that A has full rank.
for j = 1:n
for i = 1:j –1
end
Cost: 2mn2 flops (2n 3/3 flops more than Householder QR factorization
with Q left in factored form).
In the classical Gram–Schmidt method (CGS), aj appears in the compu-
tation only at the jth stage. The method can be rearranged so that as soon as
qj is computed, all the remaining vectors are orthogonalized against qj. This
gives the modified Gram-Schmidt method (MGS).
end
end
It is worth noting that there are two differences between the CGS and
MGS methods. The first is the order in which the calculations are performed:
in the modified method each remaining vector is updated once on each step
instead of having all its updates done together on one step. This is purely a
matter of the order in which the operations are performed. Second, and more
crucially in finite precision computation, two different (but mathematically
equivalent ) formulae for rkj are used: in the classical method, rkj =
which involves the original vector a j, whereas in the modified method a j is
replaced in this formula by the partially orthogonalized vector Another
way to view the difference between the two Gram–Schmidt methods is via
representations of an orthogonal projection; see Problem 18.7.
The MGS procedure can be expressed in matrix terms by defining Ak =
MGS transforms A1 = A into An+1 = Q by the
sequence of transformations Ak = Ak+ 1Rk, where Rk is equal to the identity
except in the kth row, where it agrees with the final R. For example, if n = 4
and k = 3,
Thus R = Rn . . . R 1 .
The Gram-Schmidt methods produce Q explicitly, unlike the Householder
and Givens methods, which hold Q in factored form. While this is a benefit,
in that no extra work is required to form Q, it is also a weakness, because
there is nothing in the methods to force the computed to be orthonormal in
the face of roundoff. Orthonormality of Q is a consequence of orthogonality
relations that are implicit in the methods, and these relations may be vitiated
by rounding errors.
Some insight is provided by the case n = 2, for which the CGS and MGS
methods are identical. Given a1, a2 we compute q 1 = a 1 /||a 1 ||2, which
we will suppose is done exactly, and then we form the unnormalized vector
q2 = a2 – The computed vector satisfies
where
Hence
18.7 G RAM -S CHMIDT O RTHOGONALIZATION 379
(18.18)
(18.19)
and that the multiplication A 2 = carries out the first step of the MGS
method on A, producing the first row of R and
(18.20)
(18.23)
Proof. To prove (18.21) we use the matrix form of the MGS method. For
the computed matrices we have
Hence
(18.24)
(18.25)
where Sk-1 agrees with in its first k – 1 rows and the identity in its last
n – k + 1 rows. Assume for simplicity that (this does not affect the
final result). We have and Lemma 3.8 shows that the
computed vector satisfies
provided that
To prove the last two parts of the theorem we exploit the Householder–
MGS connection. By applying Theorem 18.4 to (18.19) we find that there is
an orthogonal such that
(18.26)
with
18.8 S ENSITIVITY OF THE QR FACTORIZATION 381
This does not directly yield (18.23), since is not orthonormal. However,
it can be shown that if we define Q to be the nearest orthonormal matrix
to in the Frobenius norm, then (18.23) holds with c3 = (see
Problem 18.11).
Now (18.21) and (18.23) yield
(18.27)
(18.28)
where cn is a constant. Here, and throughout this section, we use the “econ-
omy size” QR factorization with R a square matrix normalized to have nonneg-
ative diagonal elements. Similar normwise bounds are given by Stewart [951,
1993] and Sun [971, 1991], and, for AQ only, by Bhatia and Mukherjea [95,
1994] and Sun [975, 1995].
Componentwise sensitivity analyses have been given by Zha [1126, 1993]
and Sun [972, 1992], [973, 1992]. Zha’s bounds can be summarized as follows,
with the same assumptions and notation as for Stewart’s result above. Let
|∆A| < where G is nonnegative with Then, for sufficiently
small
(18.29)
and
Since φ(A) 2.8 × 108 and φ(B) 3.8, the actual errors match the bound
(18.30) very well. Note that κ 2 (A) 2.8 × 108 and κ2 (B) 2.1 × 108, so
the normwise perturbation bound (18.28) is not strong enough to predict the
difference in accuracy of the computed Q factors, unless we scale the factor
out of the last column of B to leave a well-conditioned matrix. The virtue
of the componentwise analysis is that it does not require a judicious scaling
in order to yield useful results.
Schreiber and Parlett develop theory and algorithms for block reflectors, in
both of which the polar decomposition plays a key role.
Sun and Bischof [977, 1995] show that any orthogonal matrix can be ex-
pressed in the form Q = I – YSYT, even with S triangular, and they explore
the properties of this representation.
Another important use of Householder matrices, besides computation of
the QR factorization, is to reduce a matrix to a simpler form prior to itera-
tive computation of eigenvalues (Hessenberg or tridiagonal form) or singular
values (bidiagonal form). For these two-sided transformations an analogue of
Lemma 18.3 holds with normwise bounds (only) on the perturbation. Error
analyses of two-sided application of Householder transformations is given by
Ortega [811, 1963] and Wilkinson [1086, 1962], [1089, 1965, Chap. 6].
Mixed precision iterative refinement for solution of linear systems by House-
holder QR factorization is discussed by Wilkinson [1090, 1965, §10], who notes
that convergence is obtained as long as a condition of the form cn κ 2 (A)u < 1
holds.
18.9 N OTES AND R EFERENCES 385
Fast Givens rotations can be applied to a matrix with half the number
of multiplications of conventional Givens rotations, and they do not involve
square roots. They were developed by Gentleman [435, 1973] and Hammar-
ling [498, 1974]. Fast Givens rotations are as stable as conventional ones-see
the error analysis by Parlett in [820, 19 8 0 , §6.8.3], for example-but, for
the original formulations, careful monitoring is required to avoid overflow.
Rath [861, 1982] investigates the use of fast Givens rotations for performing
similarity transformations in solving the eigenproblem. Barlow and Ipsen [65,
1987] propose a class of scaled Givens rotations suitable for implementation
on systolic arrays, and they give a detailed error analysis. Anda and Park [16,
1994] develop fast rotation algorithms that use dynamic scaling to avoid over-
flow.
Rice [870, 1966] was the first to point out that the MGS method produces
a more nearly orthonormal matrix than the CGS method in the presence
of rounding errors. Björck [107, 1967] gives a detailed error analysis, proving
(18.21) and (18.22) but not (18.23), which is an extension of the corresponding
normwise result of Björck and Paige [119, 1992]. Björck and Paige give a
detailed assessment of MGS versus Householder QR factorization.
The difference between the CGS and MGS methods is indeed subtle.
Wilkinson [1095, 1971] admitted that “I used the modified process for many
years without even noticing explicitly that I was not performing the classical
algorithm.”
The orthonormality of the matrix from Gram-Schmidt can be improved
by reorthogonalization, in which the orthogonalization step of the classical
or modified method is iterated. Analyses of Gram–Schmidt with reorthog-
onalization are given by Abdelmalek [2, 1971], Ruhe [883, 1983], and Hoff-
mann [578, 1989]. Daniel, Gragg, Kaufman, and Stewart [263, 1976] analyse
the use of classical Gram–Schmidt with reorthogonalization for updating a
QR factorization after a rank one change to the matrix.
The mathematical and numerical equivalence of the MGS method with
Householder QR factorization of the matrix was known in the 1960s
(see the Björck quotation at the start of the chapter) and the mathematical
equivalence was pointed out by Lawson and Hanson [695, 1974, Ex. 19.39].
A block Gram-Schmidt method is developed by Jalby and Philippe [608,
1991 ] and error analysis given. See also Björck [115, 1994 ], who gives an
up-to-date survey of numerical aspects of the Gram–Schmidt method.
For more on Gram-Schmidt methods, including historical comments, see
Björck [116, 1996].
One use of the QR factorization is to orthogonalize a matrix that, because
of rounding or truncation errors, has lost its orthogonality; thus we compute
A = QR and replace A by Q. An alternative approach is to replace
(m > n) by the nearest orthonormal matrix, that is, the matrix Q that solves
{ ||A - Q|| : QTQ = I} = min. For the 2- and Frobenius norms, the optimal
386 QR F ACTORIZATION
18.9.1. LAPACK
LAPACK contains a rich selection of routines for computing and manipulat-
ing the QR factorization and its variants. Routine xGEQRF computes the QR
factorization A = QR of an m × n matrix A by the Householder QR algo-
rithm. If m < n (which we ruled out in our analysis, merely to simplify the
notation), the factorization takes the form A = Q[ R 1 R2], where R1 is m × m
upper triangular. The matrix Q is represented as a product of Householder
transformations and is not formed explicitly. A routine xORGQR (or xUNGQR in
the complex case) is provided to form all or part of Q, and routine xORMQR (or
xUNMQR ) will pre- or postmultiply a matrix by Q or its (conjugate) transpose.
Routine xGEQPF computes the QR factorization with column pivoting (see
Problem 18.5).
An LQ factorization is computed by xGELQF . When A is m × n with m < n
P ROBLEMS 387
Problems
18.1. Find the eigenvalues of a Householder matrix and a Givens matrix.
18.2. Let where and are the computed quantities described
in Lemma 18.1. Derive a bound for
18.3. A complex Householder matrix has the form
so that, in particular, |r11| > |r22| > . . . > |rnn|. (These are the same
equations as (10.13), which hold for the Cholesky factorization with complete
pivoting—why?)
18.6. Let be a product of disjoint Givens rotations. Show that
18.7. The CGS method and the MGS method applied to ( m > n)
compute a QR factorization A = QR, Define the orthogonal
projection Pi = where qi = Q(:, i). Show that
388 QR F ACTORIZATION
which is a matrix of the form discussed by Lauchli [692, 196 1]. Assuming
that evaluate the Q matrices produced by the CGS and MGS
methods and assess their orthonormality.
18.10. Show that the matrix P in (18.19) has the form
where both P11 and P21 have at least as many rows as columns, show that
there exists an orthonormal Q such that A + ∆A = QR, where
18.13. (Higham [557, 1994]) Let (m > n) have the polar decom-
position A = UH. Show that
This result shows that the two measures of orthonormality ||ATA – I|| 2 and
||A - U||2 are essentially equivalent (cf. (18.22)).
Previous Home Next
Chapter 19
The Least Squares Problem
391
392 T HE LEAST S QUARES P ROBLEM
(19.1)
(19.2)
Proof. We defer a proof until 19.8, since the techniques used in the proof
are not needed elsewhere in this chapter.
The bound (19. 1) is usually interpreted as saying that the sensitivity of the
LS problem is measured by k2 (A) when the residual is small or zero and by
2
K2(A) otherwise. This means that the sensitivity of the LS problem depends
strongly on b as well as A, unlike for a square linear system.
Here is a simple example where the K2(A)2 effect is seen:
(19.3)
which is simply another way of writing the normal equations, ATAx = ATb.
This is a square nonsingular system, so standard techniques can be applied.
The perturbed system of interest is
(19.4)
(19.6)
(19.7)
(19.8)
(Note that ||I – AA+||2 = min{1, m – n}, as is easily proved using the sin-
gular value decomposition (SVD).) On taking norms we obtain the desired
perturbation result.
(19.9)
(19.10)
then
Theorem 19.3. Let A (m > n) have full rank and suppose the
LS problem minx ||b – Ax||2 is solved using the Householder QR factorization
method. The computed solution is the exact LS solution to
where ||G||F = 1.
As for Theorem 18.5 (see (18.7)), Theorem 19.3 remains true if we set
∆ b = 0, but in general there is no advantage to restricting the perturbations
to A.
Theorem 19.3 is a strong result, but it does not bound the residual of the
computed solution, which, after all, is what we are trying to minimize. How
close, then, is to minx ||b – Ax||2? We can answer this question using
the perturbation theory of 19.1. With := b + ∆b – (A + ∆A) x := x LS
and r := b – Ax, (19.6) yields
so that
Substituting the bounds for ∆A and ∆b from Theorem 19.3, and noting that
||AA+||2 = 1, we obtain
This bound contains two parts. The term mnγ cm || |b| + |A||x| ||2 is a multiple
of the bound for the error in evaluating fl(b – Ax), and so is to be expected.
The factor 1 + mnγ c m cond2 (AT) will be less than 1.1 (say) provided that
cond2 (AT) is not too large. Note that cond2 (AT) < nk2 (A) and cond2 (AT) is
invariant under column scaling of A (A A diag(d i), di 0). The conclusion
is that, unless A is very ill conditioned, the residual b – will not exceed
the larger of the true residual r = b – AX and a constant multiple of the error
in evaluating fl(r)—a very satisfactory result.
We have
Since qn+l is orthogonal to the columns of Q1, ||b – Ax|| 22 = ||Rx – z|| 22 + ρ2 ,
so the LS solution is x = R-lZ. Of course, z = QT1b, but z is now computed
as part of the MGS procedure instead of as a product between QT and b
Björck [107, 1967] shows that this algorithm is forward stable, in the
sense that the forward error ||x – is as small as that for a norm-
wise backward stable method. It has recently been shown by Björck and
Paige [119, 1992] that the algorithm is, in fact, normwise backward stable
(see also Björck [115, 1994]), that is, a normwise result of the form in Theo-
rem 19.3 holds. Moreover, a componentwise result of the form in Theorem 19.3
holds too—see Problem 19.5. Hence the possible lack of orthonormality of
does not impair the stability of the MGS method as a means for solving the
LS problem.
for which
By Theorems 10.3 and 10.4, the computed Cholesky factor and solution
satisfy
(19.11)
Overall,
(19.12)
(19.13a)
(19.13b)
These bounds show that we have solved the normal equations in a backward
stable way, as long as ||A||2||b||2 ||ATb||2 . But if we try to translate this
result into a backward error result for the LS problem itself, we find that
the best backward error bound contains a factor K2(A) [569, 1987]. The best
forward error bound we can expect, in view of (19.13), is of the form
(19.14)
(since K2 (A T A) = K2 (A)2). Now we know from Theorem 19.1 that the sen-
sitivity of the LS problem is measured by K2(A)2 if the residual is large, but
by K2(A) if the residual is small. It follows that the normal equations method
has a forward error bound that can be much larger than that possessed by a
backward stable method.
A mitigating factor for the normal equations method is that, in view of
Theorem 10.6, we can replace (19.14) by the (not entirely rigorous) bound
19.5 I TERATIVE R EFINEMENT 399
Hence the normal equations method is, to some extent, insensitive to poor
column scaling of A.
Although numerical analysts almost invariably solve the full rank LS prob-
lem by QR factorization, statisticians frequently use the normal equations
(though perhaps less frequently than they used to, thanks to the influence of
numerical analysts). The normal equations do have a useful role to play. In
many statistical problems the regression matrix is contaminated by errors of
measurement that are very large relative to the roundoff level; the effects of
rounding errors are then likely to be insignificant compared with the effects
of the measurement errors, especially if IEEE double precision (as opposed to
single precision) arithmetic is used.
The normal equations (NE) versus (Householder) QR factorization debate
can be summed up as follows.
● The two methods have a similar computational cost if m n, but the
NE method is up to twice as fast for m >> n. (This statement assumes
that A and b are dense; for details of the storage requirements and
computational cost of each method for sparse matrices, see, for example,
Björck [116, 1996] and Heath [510, 1984].)
● The QR method is always backward stable. The NE method is guaran-
teed to be backward stable only if A is well conditioned.
● The forward error from the NE method can be expected to exceed that
for the QR method when A is ill conditioned and the residual of the LS
problem is small.
● The QR method lends itself to iterative refinement, as described in the
next section. Iterative refinement can be applied to the NE method, but
the rate of convergence inevitably depends on K2(A)2 instead of K2(A).
1. Compute r = b –
2. Solve the LS problem mind ||Ad–r||2 .
3. Update y = + d.
(Repeat from step 1 if necessary, with replaced by y).
This scheme is certainly efficient-a computed QR factorization (for example)
of A can be reused at each step 2. Golub and Wilkinson [466, 1966] inves-
tigated this scheme and found that it works well only for nearly consistent
systems.
An alternative approach suggested by Björck [106, 1967] is to apply iter-
ative refinement to the augmented system (19.3), so that both x and r are
refined simultaneously. Since this is a square, nonsingular system, existing re-
sults on the convergence and stability of iterative refinement can be applied,
and we would expect the scheme to work well. To make precise statements
we need to examine the augmented system method in detail.
For the refinement steps we need to consider an augmented system with
an arbitrary right-hand side:
r + Ax = f, (19.15a)
ATr = g. (19.15b)
h = R - T g,
x = R –l (d l – h).
The next result describes the effect of rounding errors on the solution process.
19.5 I TERATIVE R EFINEMENT 401
Theorem 19.4. Let A be of full rank n < m and suppose the aug-
mented system (19.15) is solved using a Householder QR factorization of A
as described above. The computed and satisfy
where
(19.16)
In view of the Oettli–Prager result (Theorem 7.3) this inequality tells us that,
asymptotically, the solution produced after one step of fixed precision it-
erative refinement has a small componentwise relative backward error with
respect to the augmented system. However, this backward error allows the
two occurrences of A in the augmented system coefficient matrix to be per-
turbed differently, and thus is not a true componentwise backward error for
the LS problem. Nevertheless, the result tells us that iterative refinement can
be expected to produce some improvement in stability. Note that the bound in
Theorem 19.2 continues to hold if we perturb the two occurrences of A in the
augmented system differently. Therefore the bound is applicable to iterative
refinement (with E = |A|, f = |b|), and so we can expect iterative refine-
ment to mitigate the effects of poor row and column scaling of A. Numerical
experiments show that these predictions are borne out in practice [549, 1991].
Turning to mixed precision iterative refinement, we would like to apply
the analysis of $11.1, with “Ax = b“ again identified with the augmented
system. However, the analysis of 11.1 requires a backward error result in
which only the coefficient matrix is perturbed (see (11.1)). This causes no
difficulties because from Theorem 19.4 we can deduce a normwise result (cf.
Problem 7.7):
The theory of 11.1 then tells us that mixed precision iterative refinement will
converge as long as the condition number of the augmented system matrix is
not too large and that the rate of convergence depends on this condition
number. How does the condition number of the augmented system relate to
that of A? Consider the matrix that results from the scaling
(α > 0):
(19.17)
(19.18)
( 0’, m - n times,
(19.19)
with minα K2(C(α)) being achieved for α = (see Problem 19.7). Hence
C (α) may be much more ill conditioned than A. However, in our analysis
19.6 T HE S EMINORMAL E QUATIONS 403
R T Rx = A T b,
which are called the seminormal equations. The solution x can be determined
from these equations without the use of Q. Since the cross product matrix
ATA is not formed explicitly and R is determined stably via a QR factor-
ization, we might expect this approach to be more stable than the normal
equations method.
Björck [110, 1987] has done a detailed error analysis of the seminormal
equations (SNE) method, under the assumption that R is computed by a
backward stable method. His forward error bound for the SNE method is of
the same form as that for the normal equations method, involving a factor
2
K2 (A) . Thus the SNE method is not backward stable. Björck considers
applying one step of fixed precision iterative refinement to the SNE method,
and he calls the resulting process the corrected seminormal equations (CSNE)
method:
R T Rx = A T b
r = b – Ax
404 T HE LEAST S QUARES P ROBLEM
R T Rw = A T r
y=x+w
Hence, if k2(A)2U < 1, the CSNE method has a forward error bound similar
to that for a backward stable method, and the bound is actually smaller than
that for the QR method if k2(A)2U << 1 and r is small. However, the CSNE
method is not backward stable for all A.
(19.20)
is given by
where
(19.21a)
(19.21b)
∆ A = G1∆ A1 + G2 ∆ A 2 , (19.22)
such that
Note that (19.22) implies a bound stronger than just ||∆ A||2 < ||∆ A1 ||2 +
||∆A 2 || 2 :
p = 2, F.
Turning to componentwise backward error, the simplest approach is to ap-
ply the componentwise backward error to the augmented system (19.3),
setting
so as not to perturb the diagonal blocks I and O of the augmented system co-
efficient matrix. However, this approach allows A and AT to undergo different
perturbations ∆A1 and ∆A2 with and thus does not give a true
backward error, and Lemma 19.6 is of no help. This problem can be over-
come by using a structured componentwise backward error to force symmetry
19.8 P ROOF OF W EDIN ’ S T HEOREM 407
of the perturbations; see Higham and Higham [527, 1992] for details. One
problem remains: as far as the backward error of y is concerned, the vector r
in the augmented system is a vector of free parameters, so to obtain the true
componentwise backward error we have to minimize the structure-preserving
componentwise backward error over all r. This is a nonlinear optimization
problem to which no closed-form solution is known. Experiments show that
when y is a computed LS solution, r = b – Ay is often a good approximation
to the minimizing r [527, 1992], [549, 1991].
that is,
Proof. We have
The result then follows from the (nontrivial) equality ||PA(I – PB)||2 = ||PB(I–
PA)||2; see, for example, Stewart [943, 1977, Thin. 2.3] or Stewart and Sun
408 T HE LEAST S QUARES P ROBLEM
[954, 1990, Lem. 3.3.5], where proofs that use the CS decomposition are
given.
Proof of Theorem 19.1. Let B := A + AA. We have, with r = b – Ax,
(19.25)
Similarly,
(19.26)
The bound for ||x – y||2 /||x||2 is obtained by using inequalities (19.25) and
(19.26) in (19.23).
Turning to the residual, using (19.23) we find that
s – r = ∆b + B(x – y) – ∆Ax
= ∆b – BB+(r – ∆Ax + ∆b) – ∆Ax
= (I – BB + )(∆b – ∆Ax) – BB+r.
Since ||I – BB+||2 = min{l, m – n },
Hence
Wei [1072, 1990] gives a normwise perturbation result for the LS problem
with a rank deficient A that allows rank(A + ∆A) > rank(A).
Componentwise perturbation bounds of the form in Theorem 19.2 were
first derived by Björck in 1988 and variations have been given by Arioli, Duff,
and de Rijk [25, 1989], Björck [113, 1991], and Higham [542, 1990].
Higham [542, 1990] examined the famous test problem from Longley [711,
1967]—a regression problem which has a notoriously ill-conditioned 16 x 7
coefficient matrix with k2 (A) 5 x 10 9 . The inequality (19.8) was found
to give tight bounds for the effect of random componentwise relative per-
turbations of the problem generated in experiments of Beaton, Rubin, and
Barone [86, 1976]. Thus componentwise perturbation bounds are potentially
useful in regression analysis as an alternative to the existing statistically based
techniques.
The tools required for a direct proof of the normwise backward error result
in Theorem 19.3 are developed in Wilkinson’s book The Algebraic Eigenvalue
Problem [1089, 1965]. Results of this form were derived informally by Golub
and Wilkinson (assuming the use of extended precision inner products) [466,
1966], stated by Wilkinson [1090, 1965, p. 93] and Stewart [941, 1973], and
proved by Lawson and Hanson [695, 1974, Chap. 16].
The idea of using QR factorization to solve the LS problem was mentioned
in passing by Householder [586, 1958]. Golub [463, 1965] worked out the
details, using Householder QR factorization, and this method is sometimes
called “Golub’s method”. In the same paper, Golub suggested the form of
iterative refinement described at the start of 19.5 (which is implemented in
a procedure by Businger and Golub [167, 1965]), and showed how to use QR
factorization to solve an LS problem with a linear constraint Bx = c.
It was Björck [106, 1967] who first recognized that iterative refinement
should be applied to the augmented system for best results, and he gave a
detailed rounding error analysis for the use of a QR factorization computed by
the Householder or MGS methods. Björck and Golub [118, 1967] give an Algol
code for computation and refinement of the solution to an LS problem with
a linear constraint; they use Householder transformations, while Björck [108,
1968] gives a similar code baaed on the Gram–Schmidt method. In [109, 1978],
Björck dispels some misconceptions of statisticians about (mixed precision)
iterative refinement for the LS problem; he discusses standard refinement
together with two versions of refinement based on the seminormal equations.
Error analysis for solution of the LS problem by the classical Gram-
Schmidt method with reorthogonalization is given by Abdelmalek [2, 1971],
who obtains a forward error bound as good as that for a backward stable
method.
Higharn and Stewart [569, 1987] compare the normal equations method
with the QR factorization method, with emphasis on aspects relevant to re-
gression problems in statistics.
19.9 N OTES AND R EFERENCES 411
Foster [398, 1991] proposes a class of methods for solving the LS problem
that are intermediate between the normal equations method and the MGS
method, and that can be viewed as block MGS algorithms.
The most general analysis of QR factorization methods for solving the LS
and related problems is by Björck and Paige [120, 1994], who consider an
augmented system with an arbitrary right-hand side (see Problem 20.1) and
prove a number of subtle stability results.
Theorem 19.4 and the following analysis are from Higham [549, 1991].
Arioli, Duff, and de Rijk [25, 1989] investigate the application of fixed pre-
cision iterative refinement to large, sparse LS problems, taking the basic solver
to be the block LDLT factorization code MA27 [329, 1982] from the Harwell
Subroutine Library (applied to the augmented system); in particular, they
use scaling of the form (19.17). Björck [114, 1992] determines, via an error
analysis for solution of the augmented system by block LDLT factorization, a
choice of α in (19.17) that minimizes a bound on the forward error.
The idea of implementing iterative refinement with a precision that in-
creases on each iteration (see the Notes and References to Chapter 11) can be
applied to the LS problem; see Gluchowska and Smoktunowicz [453, 1990].
The use of SNE was first suggested by Kahan, in the context of iterative
refinement, as explained by Golub and Wilkinson [466, 1966].
Stewart [945, 1977] discusses the problem of finding the normwise back-
ward error for the LS problem and offers some backward perturbations that
are candidates for being of minimal norm. The problem is also discussed by
Higham [542, 1990]. Componentwise backward error for the LS problem has
been investigated by Arioli, Duff, and de Rijk [25, 1989], Björck [113, 1991],
and Higham [542, 1990].
Theorem 19.5 has been extended to the multiple right-hand side LS prob-
lem by Sun [976, 1996].
Lemma 19.6 is from a book by and Schwetlick, which has
been published in German [658, 1988] and Polish [659, 1992] editions, but
not in English. The lemma is their Lemma 8.2.11, and can be shown to be
equivalent to a result of Stewart [943, 1977, Thin. 5.3].
Other methods for solving the LS problem not considered in this chapter
include those of Peters and Wilkinson [826, 1970], Cline [215, 1973], and
Plemmons [835, 1974], all of which begin by computing an LU factorization
of the rectangular matrix A. Error bounds for these methods can be derived
using results from this chapter and Chapters 9 and 18.
In this chapter we have not specifically treated LS problems whose coeffi-
cient matrices have rows varying greatly in norm, and we have not considered
weighted LS problems minx ||D(b — Ax)||2, where D = diag(d i ). Error anal-
ysis for Householder QR factorization with column pivoting applied to badly
row-scaled problems is given by Powell and Reid [840, 1969]. Methods and
error analysis for weighted LS problems are given by Barlow [59, 1988], Bar-
412 T HE LEAST S QUARES P ROBLEM
low and Handy [64, 1988], Barlow and Vemulapati [66, 1992], Gulliksson and
Wedin [489, 1992], Gulliksson [487, 1994], [488, 1995], and Van Loan [1042,
1985].
Another topic not considered here is the constrained LS problem, where
x is required to satisfy linear equality and/or inequality constraints. Nu-
merical methods are described by Lötstedt [713, 1984] and Golub and Van
Loan [470, 1989, 12. 1], and perturbation theory is developed by Eldén [350,
1980], Lötstedt [712, 1983], and Wedin [1070, 1985].
19.9.1. LAPACK
Driver routine xGELS solves the full rank LS problem by Householder QR
factorization. It caters for multiple right-hand sides, each of which defines a
separate LS problem. Thus, xGELS solves min{ ||B – AX||F : X },
where A This routine does not return any
error bounds, and iterative refinement is not supported for LS problems in
LAPACK.
Driver routines xGELSX and xGELSS solve the rank-deficient LS problem
with multiple right-hand sides, using, respectively, a complete orthogonal fac-
torization (computed via QR factorization with column pivoting) and the
SVD.
LAPACK also contains routines for solving the linearly constrained LS
problem (xGGLSE) and a generalized form of weighted LS problem (xGGGLM).
Problems
19.1. Show that any solution to the LS problem minx ||b – Ax||2 satisfies the
normal equations A T Ax = A T b. What is the geometrical interpretation of
these equations?
19.2. Prove Theorem 19.3.
19.3. The pseudo-inverse X can be defined as the
unique matrix satisfying the four Moore–Penrose conditions
Chapter 20
Underdetermined Systems
415
416 U NDERDETERMINED S YSTEMS
(20.1)
where
If A has full rank then yl = R-Tb is uniquely determined and all solutions of
Ax = b are given by
(20.3)
(20.4)
(20.6)
For any Hölder p-norm, the bound is attainable to within a constant factor
depending on n.
to obtain
The required bound follows on using absolute value inequalities and taking
norms. That the bound is attained to within a constant factor depending on
n for Holder p-norms is a consequence of the fact that the two vectors on the
right-hand side of (20.7) are orthogonal.
418 U NDERDETERMINED S YSTEMS
Two special cases are worth noting, for later use. We will use the equality
||I - A+A||2 = min{l, n - m}, which can be derived by consideration of the QR
factorization (20.1), for example. If E = |A|H, where H is a given nonnegative
matrix, and f = |b|, then we can put (20.6) in the form
(20.8)
where
cond2 ( A) = || |A+||A| ||2.
Note that cond2 (A ) is independent of the row scaling of A (cond2(D A) =
cond 2 (A) for nonsingular diagonal D). If E = ||A||2emeTn and f = ||b||2em,
where em denotes the m-dimensional vector of 1s, then
(20.9)
The following analogue of Lemma 19.6 will be needed for the error anal-
ysis in the next section, It says that if we perturb the two occurrences of
A in the normal equations AATx = b differently, then the solution of the
perturbed system is the solution of normal equations in which there is only
one perturbation of A and, moreover, this single perturbation is no larger, in
the normwise or componentwise sense, than the two perturbations we started
with.
(A + ∆A 1 ) = b, = (A + ∆A 2 ) T
Assume that 3 max (||Α+∆A1||2, ||A+ ∆A2||2) < 1. Then there is a vector
and a perturbation ∆A with
∆A = ∆AIGI + ∆A 2 G 2 , Gl + G2 = I,
such that
Proof. The proof is similar to that of Lemma 19.6, but differs in some
details. If = (A+ ∆A 2 ) T = 0 we take ∆A = ∆A2. otherwise, we set
Note the requirement in this definition that y be the minimum norm solution;
the usual componentwise backward error (see (7.6)) is a generally
smaller quantity. Let us say that a method is row-wise backward stable if it
produces a computed solution for which the componentwise backward error
420 U NDERDETERMINED S YSTEMS
and
||G||F = 1.
Proof. The Q method solves the triangular system R T y l = b and then
forms x = Assuming the use of Householder QR factorization,
from Theorem 18.4 we have that
(20.10)
We now rewrite the latter two equations in such a way that we can apply
Lemma 20.2:
and ||∆Ai ||F < mγ´||A||F, i = 1:2. The result follows on invocation of
Lemma 20.2.
20.3 E RROR ANALYSIS 421
Theorem 20.3 says that the Q method is row-wise backward stable. This is
not altogether surprising, since (Householder or Givens) QR factorization for
the LS problem enjoys an analogous backward stability result (Theorem 19.3),
albeit without the restriction of a minimum norm solution. Applying (20.8)
to Theorem 20.3 we obtain the forward error bound
(20.11)
The same form of forward error bound (20.11) can be derived for the SNE
method as for the Q method [292, 1993]. However, it is not possible to obtain
a result analogous to Theorem 20.3, nor even to obtain a residual bound of the
form (which would imply that solved a nearby
system, though would not necessarily be the minimum norm solution). The
method of solution guarantees only that the seminormal equations themselves
have a small residual. Thus, as in the context of overdetermined LS problems,
the SNE method is not backward stable. A possible way to improve the
stability is by iterative refinement, as shown in [292, 1993].
Note that the forward error bound (20. 11) is independent of the row scaling
of A, since cond2 (A) is. The bound is therefore potentially much smaller than
the bound
obtained by Paige [813, 1973] for the SNE method and by Jennings and Os-
borne [614, 1974] and Arioli and Laratta [26, 1985, Thin. 4] for the Q method.
Finally, we mention an alternative version of the Q method that is based
on the modified Gram–Schmidt (MGS) method. The obvious approach is
to compute the QR factorization AT = QR using MGS
solve RTy = b, and then form x = Qy. Since Q is provided explicitly
by the MGS method, the final stage is a full matrix–vector multiplication,
unlike for the Householder method. However, because the computed Q may
depart from orthonormality, this method is unstable in the form described.
The formation of x = Qy should instead be done as follows:
The recurrence can be written as x(k-l) = x(k) + ykq k – (qkTx(k))qk, and the
last term is zero in exact arithmetic if the qk are mutually orthogonal. In finite
precision arithmetic this correction term has the “magical” effect of making
422 U NDERDETERMINED S YSTEMS
the algorithm stable, in the sense that it satisfies essentially the same result
as the Q method in Theorem 20.3; see Björck and Paige [120, 1994].
A numerical example is instructive. Take the 20 x 30 Vandermonde matrix
where the pi are equally spaced on [0, 1], and let b have elements
equally spaced on [0, 1]. The condition number k2 (A) = 4.35 x 1014 . The
(standard) normwise backward errors in the 2-norm are shown in Table 20.1.
For A T , the supplied by MGS satisfies = 1.41 x 10-3, which
explains the instability of the “obvious” MGS solver.
with m < n.
20.4.1. LAPACK
The same routines that solve the (overdetermined) LS problem also solve un-
derdetermined systems for the solution of minimal 2-norm. Thus, xGELS solves
a full rank underdetermined system with multiple right-hand sides by the Q
method. Routines xGELSX and xGELSS solve rank-deficient problems with mul-
tiple right-hand sides, using, respectively, a complete orthogonal factorization
(computed via QR factorization with column pivoting) and the singular value
decomposition.
Problems
20.1. (Björck [114, 1992]) Show that the system
(20.12)
(20.13)
(20.14)
Chapter 21
Vandermonde Systems
425
426 V ANDERMONDE S YSTEMS
(21.1)
where σk(y1, . . . , yn) denotes the sum of all distinct products of k of the argu-
ments y1 , . . . , yn (that is σ k is the kth elementary symmetric function). An
efficient way to find the wij is first to form the master polynomial
Algorithm 21.1. Given distinct scalars α0, α1, . . . . ,αn this algorithm
computes W =
basis for the polynomials on the real line. A variety of bounds for the con-
dition number of a Vandermonde matrix have been derived by Gautschi and
his co-authors. Let Vn = V(α0, α1, . . . , α n–1 ) For arbitrary distinct
points αi ,
(21.3)
with equality on the right when for all j with a fixed θ (in
particular, when αj > 0 for all j) [424, 1962], [426, 1978]. Note that the
upper and lower bounds differ by at most a factor 2 n–1. More specific bounds
are given in Table 21.1, on which we now comment.
Bound (V1) and estimate (V4) follow from (21.3). The condition number
for the harmonic points 1/(i + 1) grows faster than n!; by contrast, the con-
dition numbers of the notoriously ill-conditioned Hilbert and Pascal matrices
grow only exponentially (see 26.1 and 26.4). For any choice of points the
rate of growth is at least exponential (V2), and this rate is achieved for points
equally spaced on [0, 1]. For points equally spaced on [– 1, 1], the condition
number grows at a slower exponential rate than that for [0, 1], and the growth
rate is slower still for the zeros of the nth degree Chebyshev polynomial (V6).
For one set of points the Vandermonde matrix is perfectly conditioned: the
roots of unity, for which is unitary.
(21.4)
The second, third, and fifth columns are obtained by “differentiating” the
previous column. The transpose of a confluent Vandermonde matrix arises
in Hermite interpolation; it is nonsingular if the points corresponding to the
“nonconfluent columns” are distinct.
A Vandermonde-like matrix is defined by where pi is a
polynomial of degree i. The case of practical interest is where the pi satisfy a
three-term recurrence relation. In the rest of this chapter we will assume that
the pi do satisfy a three-term recurrence relation. A particular application is
the solution of certain discrete Chebyshev approximation problems [11, 1984].
Incorporating confluence, we obtain a confluent Vandermonde-like matrix,
defined by
P = P(α0, α1,. . . ,αn ) = [q0 (α0), q1 (α1), . . . . qn (α n )]
where the αi are ordered so that equal points are contiguous, that is,
(21.5)
and the vectors qj(x) are defined recursively by
if j = 0 or
otherwise.
For all polynomials and points, P is nonsingular; this follows from the deriva-
tion of the algorithms below. One reason for the interest in Vandermonde-like
matrices is that for certain polynomials they tend to be better conditioned
than Vandermonde matrices (see, for example, Problem 21.5). Gautschi [427,
1983] derives bounds for the condition numbers of Vandermonde-like matrices.
Fast algorithms for solving the confluent Vandermonde-like primal and
dual systems Px = b and PTa = f can be derived under the assumption that
the p j(x) satisfy the three-term recurrence relation
(21.6a)
(21.6b)
where for all j. Note that in this chapter γi denotes a constant in the
recurrence relation and not iu/(1 – iu) as elsewhere in the book. The latter
notation is not used in this chapter.
430 V ANDERMONDE S YSTEMS
(21.7)
satisfies
i = 0:n.
Thus ψ is a Hermite interpolating polynomial for the data {αi, fi }, and our
task is to obtain its representation in terms of the basis As a first
step we construct the divided difference form of ψ,
(21.8)
(21.9)
Now we need to generate the ai in (21.7) from the ci in (21.8). The idea is
to expand (21.8) using nested multiplication and use the recurrence relations
(21.6) to express the results as a linear combination of the pj. Define
q n (x) = cn , (21.10)
(21.11)
(21.12)
(21.13)
in which the empty summation is defined to be zero. For the special case
k = n – l we have
(21.14)
% Stage I:
Set c = f
432 V ANDERMONDE S YSTEMS
for k = 0:n – 1
clast = ck
for j = k + l:n
i f αj αj-k-1 then
c j = c j/(k + 1)
else
temp = cj
c j = (cj – clast)/(αj – αj – k – 1 )
clast = temp
end
end
end
% Stage II:
Set a = c
a n - l = a n - l + (β 0 - αn - 1 )a n
a n = a n /θ 0
for k = n – 2: – 1:0
for j = l:n – k –2
end
a n = a n /θ n - k - 1
end
Assuming that the values are given (note that γj appears only in
the terms the computational cost of Algorithm 21.2 is at most 9n 2 / 2
flops. The vectors c and a have been used for clarity; in fact both can be
replaced by f, so that the right-hand side is transformed into the solution
without using any extra storage.
Values of for some polynomials of interest are collected in Ta-
ble 21.2.
The key to deriving a corresponding algorithm for solving the primal sys-
tem is to recognize that Algorithm 21.2 implicitly computes a factorization of
P - T into the product of 2n triangular matrices. In the rest of this chapter
we adopt the convention that the subscripts of all vectors and matrices run
from 0 to n. In stage I, letting c(k) denote the vector c at the start of the k t h
iteration of the outer loop, we have
The matrix Lk is lower triangular and agrees with the identity matrix in rows
21.2 P RIMAL AND D UAL SYSTEMS 433
Polynomial θj βj γj
Monomials 1 0 0
∗
Chebyshev 2* 0 1 θ0 = 1
*
Legendre* 0 p j (1) = 1
Hermite 2 0 2j
Laguerre 2j + l
if αj = αj - k - 1 ,
some s < j, otherwise,
end
end
434 V ANDERMONDE S YSTEMS
% Stage II:
Set x = d
for k = n – 1: – 1:0
xlast = 0
for j = n: - l :k + l
i f αj = αj–k–l then
x j = x j/(k + 1)
else
t e m p = x j/(α j - αj - k - 1 )
xj = temp – xlast
xlast = temp
end
end
xk = xk - xlast
end
21.3. Stability
Algorithms 21.2 and 21.3 have interesting stability properties. Depending on
the problem parameters, the algorithms can range from being very stable (in
either a backward or forward sense) to very unstable.
When the pa are the monomials and the points αi are distinct, the algo-
rithms reduce to those of Björck and Pereyra [121, 1970]. Björck and Pereyra
found that for the system Vx = b with αi = 1/(i + 3), bi = 2-i, n = 9, and
on a computer with u 10 –16 ,
Thus the computed solution has a tiny componentwise relative error, despite
the extreme ill condition of V. Björck and Pereyra comment “It seems as if
at least some problems connected with Vandermonde systems, which tradi-
tionally have been considered too ill-conditioned to be attacked, actually can
be solved with good precision.” This high accuracy can be explained with the
aid of the error analysis below.
The analysis can be kept quite short by exploiting the interpretation of
the algorithms in terms of matrix–vector products. Because of the inherent
duality between Algorithms 21.2 and 21.3, any result for one has an analogue
for the other, so we will consider only Algorithm 21.2.
The analysis that follows is based on the model (2.4), and so is valid only
for machines with a guard digit. With the no-guard-digit model (2.6) the
21.3 S TABILITY 435
(21.18)
Proof. First, note that Algorithm 21.2 must succeed in the absence of
underflow and overflow, because division by zero cannot occur.
The analysis of the computation of the c(k) vectors is exactly the same
as that for the nonconfluent divided differences in 5.3 (see (5.9) and (5.10)).
However, we obtain a slightly cleaner error bound by dropping the γ k notation
and instead writing
(21.21)
Applying Lemma 3.7 to (21.21) and using (21.17), we obtain the desired bound
for the forward error.
The product |U0| . . . |Un-1||Ln-1| . . . |L0| in (21.18) is an upper bound for
|U0. . .Un-1Ln-1 . . . L0| = |P–T| and is equal to it when there is no subtrac-
tive cancellation in the latter product. To gain insight, suppose the points are
distinct and consider the case n = 3. We have
436 V ANDERMONDE S YSTEMS
(21.22)
Corollary 21.5. If 0 < α0 < α1 < · · · < α n then for the monomials, or the
Chebyshev, Legendre, or Hermite polynomials,
Corollary 21.5 explains the high accuracy observed by Björck and Pereyra.
Note that if
|P - T ||f| < t n |P - T f | = t n |a|
then, under the conditions of the corollary, which shows
that the componentwise relative error is bounded by c(n, u)tn. For the prob-
lem of Björck and Pereyra it can be shown that tn n 4/24. Another factor
contributing to the high accuracy in this problem is that many of the sub-
tractions αj – αj-k-1 = 1/(j + 3) – 1/(j – k + 2) are performed exactly, in
view of Theorem 2.5.
Note that under the conditions of the corollary P-T has the alternating
sign pattern, since each of its factors does. Thus if (–1)i fi > 0 then tn =
1, and the corollary implies that is accurate essentially to full machine
precision, independent of the condition number In particular, taking
f to be a column of the identity matrix shows that we can compute the inverse
of P to high relative accuracy, independent of its condition number.
21.3 S TABILITY 437
21.3.2. Residual
(21.23)
From the proof of Theorem 21.4 and the relation (5.9) it is easy to show that
Strictly, an analogous bound for (Uk + ∆Uk)-l does not hold, since ∆U k
cannot be expressed in the form of a diagonal matrix times U k. However,
it seems reasonable to make a simplifying assumption that such a bound is
valid, say
Theorem 21.6. Under the assumption (21.24), the residual of the computed
solution from Algorithm 21.2 is bounded by
(21.25)
Corollary 21.7. Let 0 < α0 < α1 < · · · < αn, and consider Algorithm 21.2
for the monomials. Under the assumption (21.24), the computed solution
satisjies
438 V ANDERMONDE S YSTEMS
n 10 20 30 40
-1 -7 -2
2.5 x 10 2 6.3 X 10 4.7 X 10 1.8 X 103
where ha depends only on the θi , and since partial pivoting maximizes the
pivot at each stage, this ordering of the α i can be computed at the start of the
algorithm in n2 flops (see Problem 21.8). This ordering is essentially the Leja
ordering (5.13) (the difference is that partial pivoting leaves α0 unchanged).
To attempt to cure observed instability we can use iterative refinement
in fixed precision. Ordinarily, residual computation for linear equations is
21.3 S TABILITY 439
trivial, but in this context the coefficient matrix is not given explicitly and
computing the residual turns out to be conceptually almost as difficult, and
computationally as expensive, as solving the linear system!
To compute the residual for the dual system we need a means for evalu-
ating ψ(t) in (21.7) and its first k < n derivatives, where k = max i (i – r(i))
is the order of confluence. Since the polynomials pj satisfy a three-term re-
currence relation we can use an extension of the Clenshaw recurrence formula
(which evaluates ψ but not its derivatives). The following algorithm imple-
ments the appropriate recurrences, which are given by Smith [927, 1965] (see
Problem 21.9).
Computing the residual using Algorithm 21.8 costs between 3n2 flops (for
full confluence) and 6n2 flops (for the nonconfluent case); recall that Algo-
rithm 21.2 costs at most 9n 2 /2 flops!
The use of iterative refinement can be justified with the aid of Theo-
rem 11.3. For (confluent) Vandermonde matrices, the residuals are formed
using Homer’s rule and (11 .7) holds in view of (5.3) and (5.7). Hence for
standard Vandermonde matrices, Theorem 11.3 leads to an asymptotic com-
ponentwise backward stability result. A complete error analysis of Algo-
rithm 21.8 is not available for (confluent) Vandermonde-like matrices, but
it is easy to see that (11.7) will not always hold. Nevertheless it is clear that
440 V ANDERMONDE S YSTEMS
a normwise bound can be obtained (see Oliver [805, 1977] for the special case
of the Chebyshev polynomials) and hence an asymptotic normwise stability
result can be deduced from Theorem 11.3. Thus there is theoretical backing
for the use of iterative refinement with Algorithm 21.8.
Numerical experiments using Algorithm 21.8 in conjunction with the par-
tial pivoting reordering and fixed precision iterative refinement show that both
techniques are effective means for stabilizing the algorithms, but that iterative
refinement is likely to fail once the instability y is sufficiently severe. Because
of its lower cost, the reordering approach is preferable.
Two heuristics are worth noting. Consider a (confluent) Vandermonde-
like system Px = b Heuristic 1: it is systems with a large normed solution
that are solved to high accuracy by the fast algorithms.
To produce a large solution the algorithms must sustain little cancellation,
and the error analysis shows that cancellation is the main cause of instability.
Heuristic 2: GEPP is unable to solve accurately Vandermonde systems with
a very large normed solution The pivots for GEPP
will tend to satisfy so that the computed solution will tend
to satisfy A consequence of these two heuristics is that
for Vandermonde(-like) systems with a very large-normed solution the fast
algorithms will be much more accurate (but no more backward stable) than
GEPP. However, we should be suspicious of any framework in which such sys-
tems arise; although the solution vector may be obtained accurately (barring
overflow), subsequent computations with numbers of such a wide dynamic
range will probably themselves be unstable.
Problems
21.1. Derive a modified version of Algorithm 21.1 in which the scale factor
442 V ANDERMONDE S YSTEMS
(Hint: show that T = LV for a lower triangular matrix L and use the fact
that
21.6. Show that for a nonconfluent Vandermonde-like matrix P =
where the pi satisfy (21.6),
Show that
cond(V) =
and derive a corresponding condition number for the dual system VTa = f.
Previous Home Next
Chapter 22
Fast Matrix Multiplication
445
446 FAST M ATRIX M ULTIPLICATION
22.1. Methods
A fast matrix multiplication method forms the product of two n x n matrices
in arithmetic operations, where ω < 3. Such a method is more efficient
asymptotically than direct use of the definition
(22.1)
which requires O(n3) operations. For over a century after the development of
matrix algebra in the 1850s by Cayley, Sylvester and others, this definition
provided the only known method for multiplying matrices. In 1967, however,
to the surprise of many, Winograd found a way to exchange half the multi-
plications for additions in the basic formula [1105, 1968]. The method rests
on the identity, for vectors of even dimension n,
(22.2)
When this identity is applied to a matrix product AB, with x a row of A and
y a column of B, the second and third summations are found to be common
to the other inner products involving that row or column, so they can be
computed once and reused. Winograd’s paper generated immediate practical
interest because on the computers of the 1960s floating point multiplication
was typically two or three times slower than floating point addition. (On
todays’ machines these two operations are usually similar in cost).
Shortly after Winograd’s discovery, Strassen astounded the computer sci-
ence community by finding an operations method for matrix multi-
plication (log2 7 2.807). A variant of this technique can be used to compute
A-l (see Problem 22.8) and thereby to solve AX = b, both in op-
erations. Hence the title of Strassen’s 1969 paper [962, 1969], which refers to
the question of whether Gaussian elimination is asymptotically optimal for
solving linear systems.
Strassen’s method is based on a circuitous way to form the product of a
pair of 2 x 2 matrices in 7 multiplications and 18 additions, instead of the usual
8 multiplications and 4 additions. As a means of multiplying 2 x 2 matrices the
formulae have nothing to recommend them, but they are valid more generally
for block 2 x 2 matrices. Let A and B be matrices of dimensions m x n and
n x p respectively, where all the dimensions are even, and partition each of A,
B, and C = AB into four equally sized blocks:
(22.3)
22.1 M ETHODS 447
Counting the additions (A) and multiplications (M) we find that while con-
ventional multiplication requires
S1 = A21 + A22, M1 = S 2 S 6 , T 1 = M1 + M2 ,
S2 = S1 – A11, M 2 = A 1l B 1l , T 2 = T 1 + M4,
S3 = A11 – A21, M 3 = A 12 B 21 ,
S4 = A12 – S2, M4 = S 3 S 7 ,
(22.6)
S5 = B12 – B11, M5 = S 1 S 5 , C11 = M2 + M3,
S6 = B22 – S5, M 6 = S 4 B 22 , C12 = TI + M5 + M6,
S7 = B22 – B12, M7 = A22 S 8 , C21 = T2 – M7,
S8 = S6 – B21, C22 = T2 + M5.
Until the late 1980s there was a widespread view that Strassen’s method
was of theoretical interest only, because of its supposed large overheads for
dimensions of practical interest (see, for example, [909, 1988]), and this view
is still expressed by some [842, 1992]. However, in 1970 Brent implemented
Strassen’s algorithm in Algol-W on an IBM 360/67 and concluded that in this
environment, and with just one level of recursion, the method runs faster than
the conventional method for n > 110 [142, 1970]. In 1988, Bailey compared
his Fortran implementation of Strassen’s algorithm for the Cray-2 with the
Cray library routine for matrix multiplication and observed speedup factors
ranging from 1.45 for n = 128 to 2.01 for n = 2048 (although 35% of these
speedups were due to Cray-specific techniques) [43, 1988]. These empirical
results, together with more recent experience of various researchers, show that
Strassen’s algorithm is of practical interest, even for n in the hundreds. In-
deed, Fortran codes for (Winograd’s variant of) Straasen’s method have been
supplied with IBM’s ESSL library [595, 1988] and Cray’s UNICOS library
[602, 1989] since the late 1980s.
Strassen’s paper raised the question “what is the minimum exponent ω
such that multiplication of n x n matrices can be done in operations?”
Clearly, ω > 2, since each element of each matrix must partake in at least one
operation. It was 10 years before the exponent was reduced below Strassen’s
log2 7 . A flurry of publications, beginning in 1978 with Pan and his expo-
nent 2.795 [815, 1978], resulted in reduction of the exponent to the current
record 2.376, obtained by Coppersmith and Winograd in 1987 [245, 1987].
Figure 22.1 plots exponent versus time of publication (not all publications are
represented in the graph); in principle, the graph should extend back to 1850!
Some of the fast multiplication methods are based on a generalization of
Strassen’s idea to bilinear forms. Let A, B A bilinear noncommuta-
22.1 M ETHODS 449
(22.7a)
(22.7b)
where the elements of the matrices W, U (k) , and V (k) are constants. This
algorithm can be used to multiply n x n matrices A and B, where n = hk, as
follows: partition A, B, and C into h 2 blocks Aij , Bij , and Cij of size h k–1 ,
then compute C = AB by the bilinear algorithm, with the scalars aij, bij, and
cij replaced by the corresponding matrix blocks. (The algorithm is applicable
to matrices since, by assumption, the underlying formulae do not depend on
commutativity.) To form the t products Pk of (n/h) x (n/h) matrices, partition
them into h 2 blocks of dimension n/h2 and apply the algorithm recursively.
The total number of scalar multiplications required for the multiplication is
tk = n α, where α = logh t.
Strassen’s method has h = 2 and t =7. For 3 x 3 multiplication (h = 3),
the smallest t obtained so far is 23 [683, 1976]; since log323 2.854 > log27,
this does not yield any improvement over Strassen’s method. The method
450 FAST M ATRIX M ULTIPLICATION
computes the product of two complex numbers using three real multiplications
instead of the usual four. Since the formula does not rely on commutativity
it extends to matrices. Let A = A1 + iA2 and B = Bl + iB2, where Aj, Bj
and define C = C1 + iC2 = AB. Then C can be formed using three
real matrix multiplications as
T1 = A1 B1 , T2 = A2 B2 ,
C1 = T1 – T2, (22.9)
C2 = (A1 + A2)(B1 + B2) – T1 – T2,
Miller [756, 1975] shows that any polynomial algorithm for multiplying n x n
matrices that satisfies a bound of the form (22.10) (perhaps with a different
constant) must perform at least n 3 multiplications. (A polynomial algorithm
is essentially one that uses only scalar addition, subtraction, and multiplica-
tion.) Hence Strassen’s method, and all other polynomial algorithms with an
exponent less than 3, cannot satisfy (22.10). Miller also states, without proof,
that any polynomial algorithm in which the multiplications are all of the form
must satisfy a bound of the form
(22.11)
As noted in 6.2, this is not a consistent matrix norm, but we do have the
bound ||AB|| < n||A|| ||B|| for n x n matrices.
(22.12)
(22.13)
The bound (22. 12) for Winograd’s method exceeds the bound (22.13) by a fac-
tor approximately Therefore Winograd’s method
is stable if have similar magnitude, but potentially unstable
if they differ widely in magnitude. The underlying reason for the instabil-
ity is that Winograd’s method relies on cancellation of terms x2i–1 x2i and
y2i–ly2i that can be much larger than the final answer
therefore the intermediate rounding errors can swamp
the desired inner product.
A simple way to avoid the instability is to scale x µ x and y µ-ly
before applying Winograd’s method, where µ, which in practice might be a
power of the machine base to avoid roundoff, is chosen so that
When using Winograd’s method for a matrix multiplication AB it suffices to
carry out a single scaling A µA and B µ -lB such that ||A|| ||B||. If
−1
A and B are scaled so that τ < ||A||/||B|| < τ then
(22.14)
Proof. We will use without comment the norm inequality ||AB|| <
n||A|| ||B|| = 2 k||A|| ||B||.
Assume that the computed product AB from Strassen’s method
satisfies
= AB + E, ||E|| < cku||A|| ||B|| + O(u2), (22.15)
where ck is a constant. In view of (22.10), (22.15) certainly holds for n = no,
with c r = Our aim is to verify (22.15) inductively and at the same time
to derive a recurrence for the unknown constant ck.
Consider Cll in (22.4), and, in particular, its subterm P1. Accounting for
the errors in matrix addition and invoking (22.15), we obtain
where
| ∆A| < u|A11 + A22|,
|∆B | < u|B11 + B22 |,
||E1|| < ck-1u||A11 + A22 + ∆A|| ||Bll + B22 + ∆B|| + O(u2)
< 4ck-1u||A|| ||B|| + O(u2).
Hence
= P1 + F1,
||F1|| < (8 · 2 k-1 + 4c k - 1)u||A|| ||B|| + O(u2).
Similarly,
= A 22 (B 21 – B ll + ∆B) + E4,
where
|∆B| < u|B21 – B11|,
||E4 || < ck-1u||A22 || ||B21 - B11 + ∆B|| + O(u2),
which gives
= P4 + F4,
||F4|| < (2 · 2 k-1 + 2c k - 1) u||A|| ||B|| + O(u2).
Now
454 FAST M ATRIX M ULTIPLICATION
Clearly, the same bound holds for the other three ||∆Cij|| terms. Thus, overall,
Because c22 involves subterms of order unity, the error c22 – will be of
order u. Thus the relative error |c22 – |/|c22| = which is much
larger than u if ε is small. This is an example where Strassen’s method does
not satisfy the bound (22.10). For another example, consider the product
X = P32 E, where Pn is the n x n Pascal matrix (see 26.4) and eij = 1/3.
With just one level of recursion in Strassen’s method we find in MATLAB that
is of order 10-5, so that, again, some elements of the
computed product have high relative error.
It is instructive to compare the bound (22.14) for Strassen’s method with
the weakened, normwise version of (22.10) for conventional multiplication:
(22.17)
The bounds (22. 14) and (22. 17) differ only in the constant term. For Strassen’s
method, the greater the depth of recursion the bigger the constant in (22.14):
if we use just one level of recursion (n0 = n/2) then the constant is 3n2 +
25n, whereas with full recursion (n0 = 1) the constant is
6n3.585 – 5n. It is also interesting to note that the bound for Strassen’s method
(minimal for n 0 = n) is not correlated with the operation count (minimal for
n0 = 8).
Our conclusion is that Strassen’s method has less favorable stability prop-
erties than conventional multiplication in two respects: it satisfies a weaker
error bound (normwise rather than componentwise) and it has a larger con-
stant in the bound (how much larger depending on no).
Another interesting property of Strassen’s method is that it always involves
some genuine subtractions (assuming that all additions are of nonzero terms).
This is easily deduced from the formulae (22.4). This makes Strassen’s method
unattractive in applications where all the elements of A and B are nonnegative
(for example, in Markov processes). Here, conventional multiplication yields
low componentwise relative error because, in (22.10), |A||B| = |AB| = |C|,
yet comparable accuracy cannot be guaranteed for Strassen’s method.
An analogue of Theorem 22.2 holds for Winograd’s variant of Strassen’s
method.
(22.18)
Proof. The proof is analogous to that of Theorem 22.2, but more tedious.
It suffices to analyse the computation of C12, and the recurrence corresponding
456 FAST M ATRIX M ULTIPLICATION
to (22.16) is
c k = 1 8c k - l + 8 9 · 2 k – 1 , k > r, c r = 4 r .
Note that the bound for the Winograd–Strassen method has exponent
log2 18 4.170 in place of log2 12 3.585 for Strassen’s method, suggesting
that the price to be paid for a reduction in the number of additions is an
increased rate of error growth. All the comments above about Strassen’s
method apply to the Winograd variant.
Two further questions are suggested by the error analysis:
l How do the actual errors compare with the bounds?
l Which formulae are the more accurate in practice, Strassen’s or Wino-
grad’s variant?
To give some insight we quote results obtained with a single precision For-
tran 90 implementation of the two methods (the code is easy to write if we
exploit the language’s dynamic arrays and recursive procedures). We take
random n x n matrices A and B and look at ||AB – fl(AB)||/(||A|| ||B||) for
n0 = l, 2, . . . , 2k = n (note that this is not the relative error, since the de-
nominator is ||A|| ||B|| instead of ||AB||, and note that no = n corresponds
to conventional multiplication). Figure 22.2 plots the results for one random
matrix of order 1024 from the uniform [0, 1] distribution and another matrix of
the same size from the uniform [– 1, 1] distribution. The error bound (22.14)
for Strassen’s method is also plotted. Two observations can be made.
l Winograd’s variant can be more accurate than Strassen’s formulae, for
all no, despite its larger error bound.
l The error bound overestimates the actual error by a factor up to 1.8 x 106
for n = 1024, but the variation of the errors with no is roughly as
predicted by the bound.
Figure 22.2. Errors for Strassen’s method with two random matrices of dimension
n = 1024. Strassen’s formulae: “x”, Winograd’s variant: "o". X-axis contains log2
of recursion threshold n0, 1 < n0 < n. Dot-dash line is error bound for Strassen’s
formulae.
where α and β are constants that depend on the number of nonzero terms in
the matrices U, V and W that define the algorithm.
satisfies
(22.20)
(22.21)
(22.22)
Two notable features of the bound (22.22) are as follows. First, it is of
a different and weaker form than (22.21); in fact, it exceeds the sum of the
bounds (22.20) and (22.21). Second and more pleasing, it retains the property
of (22.20) and (22.21) of being invariant under diagonal scalings
C = AB D 1 AD 2 · D 2 -1 BD 3 = D 1 CD 3 , D j diagonal,
in the sense that the upper bound ∆C2 in (22.22) scales also according to
D 1 ∆C2D3. (The “hidden” second-order terms in (22.20)–(22.22) are invariant
under these diagonal scalings. )
22.3 N OTES AND R EFERENCES 459
(22.23)
(22.24)
(having used || |A1| + |A2| || < ||Al + iA2 ||). Combining these with an anal-
ogous weakening of (22.20), we find that for both conventional multiplication
and the 3M method the computed complex matrix satisfies
where cn = O(n).
The conclusion is that the 3M method produces a computed product
whose imaginary part may be contaminated by relative errors much larger
than those for conventional multiplication (or, equivalently, much larger than
can be accounted for by small componentwise perturbations in the data A
and B). However, if the errors are measured relative to ||A|| ||B||, which is a
natural quantity to use for comparison when employing matrix norms, then
they are just as small as for conventional multiplication.
It is straightforward to show that if the 3M method is implemented us-
ing Strassen’s method to form the real matrix products, then the computed
complex product satisfies the same bound (22.14) as for Strassen’s method
itself, but with an extra constant multiplier of 6 and with 4 added to the
expression inside the square brackets.
an unpublished technical report that has not been widely cited [142, 1970].
Section 22.2.2 is based on Higham [544, 1990].
According to Knuth, the 3M formula was suggested by P. Ungar in 1963
[668, 1981, p. 647]. It is analogous to a formula of Karatsuba and Ofman [643,
1963] for squaring a 2n-digit number using three squarings of n-digit num-
bers. That three multiplications (or divisions) are necessary for evaluating
the product of two complex numbers was proved by Winograd [1106, 1971].
Section 22.2.4 is based on Higham [552, 1992].
The answer to the question “What method should we use to multiply
complex matrices?” depends on the desired accuracy and speed. In a Fortran
environment an important factor affecting the speed is the relative efficiency
of real and complex arithmetic, which depends on the compiler and the com-
puter (complex arithmetic is automatically converted by the compiler into
real arithmetic). For a discussion and some statistics see [552, 1992].
The efficiency of Winograd’s method is very machine dependent. Bjørstad,
Marine, Sørevik, and Vajteršic [122, 1992] found the method useful on the
MasPar MP-1 parallel machine, on which floating point multiplication takes
about three times as long as floating point addition at 64-bit precision. They
also implemented Strassen’s method on the MP-1 (using Winograd’s method
at the bottom level of recursion) and obtained significant speedups over con-
ventional multiplication for dimensions as small as 256.
As noted in 22.1, Strassen [962, 1969] gave not only a method for multi-
plying n x n matrices in operations, but also a method for inverting
an n x n matrix with the same asymptotic cost. The method is described in
Problem 22.8. For more on Strassen’s inversion method see $24.3.2, Bailey
and Ferguson [41, 1988], and Bane, Hansen, and Higham [51, 1993].
Problems
22.1. (Knight [664, 1995]) Suppose we have a method for multiplying n x n
matrices in operations, where 2 < α < 3. Show that if A is m x n and
B is n x p then the product AB can be formed in operations,
where nl = min(m, n, p) and n2 and n3 are the other two dimensions.
22.2. Work out the operation count for Winograd’s method applied to n x n
matrices.
22.3. Let S n (n0 ) denote the operation count (additions plus multiplications)
for Strassen’s method applied to n x n matrices, with recursion down to the
level of no x no matrices. Assume that n and no are powers of 2. For large n,
estimate Sn (8)/ Sn(n) and Sn(1)/S n (8) and explain the significance of these
ratios (use (22.5)).
22.4. (Knight [664, 1995]) Suppose that Strassen’s method is used to multiply
462 FAST M ATRIX M ULTIPLICATION
The matrix multiplications are done by Strassen’s method and the inversions
determining P1 and P6 are done by recursive invocations of the method itself.
(a) Verify these formulae, using a block LU factorization of A, and show that
they permit the claimed complexity. (b) Show that if A is upper
triangular, Strassen’s method is equivalent to (the unstable) Method 2B of
13.2.2.
(For a numerical investigation into the stability of Strassen’s inversion
method, see 24.3.2.)
PROBLEMS 463
Chapter 23
The Fast Fourier Transform and
Applications
465
466 T HE FAST FOURIER T RANSFORM AND A PPLICATIONS
Theorem 23.2. Let = fl(F n x), where n = 2t, be computed using the
Cooley-Tukey radix 2 FFT, and assume that (23.2) holds. Then
using the fact that each Ak has only two nonzeros per row, and recalling that
we are using complex arithmetic. In view of (23.2),
Hence, overall,
Theorem 23.2 says that the Cooley–Tukey radix 2 FFT yields a tiny for-
ward error, provided that the weights are computed stably. It follows immedi-
ately that the computation is backward stable, since = y + ∆y = Fnx + ∆y
implies = F n ( x + ∆x) with ||∆x||2/||x||2 = ||∆y||2/||y||2. If we form
y = Fn x by conventional multiplication using the exact Fn , then (Prob-
lem 3.7) so Hence when
µ is of order u, the FFT has an error bound smaller than that for conven-
tional multiplication by the same factor as the reduction in complexity of the
method. To sum up, the FFT is perfectly stable.
468 T HE FAST FOURIER T RANSFORM AND A PPLICATIONS
Figure 23.1. Error in FFT followed by inverse FFT (“o”). Dotted line is error
bound.
Circulant matrices have the important property that they are diagonalized by
the DFT matrix Fn :
Fn CFn -l = D = diag(d i).
Moreover, the eigenvalues are given by d = Fnc. Hence a linear system Cx = b
can be solved in O(n log n) operations with the aid of the FFT as follows:
(1) d = Fnc,
(2) g = Fnb,
(3) h = D -l g,
(4) x = Fn-1h.
The computation involves two FFTs, a diagonal scaling, and an inverse FFT.
We now analyse the effect of rounding errors. It is convenient to write the
result of Theorem 23.2 in the equivalent form (from (23.3))
(23.4)
The conclusion is that the FFT method for solving a circulant system is
normwise backward stable provided that the FFT itself is computed stably.
Although we have shown that solves a slightly perturbed system, the
perturbed matrix C + ∆C is not a circulant. In fact, does not, in general,
solve a nearby circulant system, as can be verified experimentally by comput-
ing the "(circulant backward error” using techniques from [527, 1992]. The
basic reason for the lack of this stronger form of stability is that there are too
few parameters in the matrix onto which to throw the backward error.
A forward error bound can be obtained by applying standard perturba-
tion theory to Theorem 23.3: the forward error is bounded by a multiple of
k2 (C)u. That the forward error can be as large as k2 (C)u is clear from the
analysis above, because (23.5) shows that the computed eigenvalue
is contaminated by an error of order u
Problems
23.1. (Bailey [44, 1993]) In high-precision multiplication we have two integer
n -vectors x and y representing high-precision numbers and we wish to form
the terms By padding the vectors with n ze-
ros, these products can be expressed in the form where
k + 1 – j is interpreted as k + 1 – j + n if k + 1 – j is negative. These prod-
ucts represent a convolution: a matrix–vector product involving a circulant
matrix. Analogously to the linear system solution in 23.2, this product can
be evaluated in terms of discrete Fourier transforms as z =
where the dot denotes componentwise multiplication of two vectors. Since x
and y are integer vectors, the zi should also be integers, but in practice they
will be contaminated by rounding errors. Obtain a bound on z – and deduce
a sufficient condition for the “nearest integer vector to to be the exact z.
Previous Home Next
Chapter 24
Automatic Error Analysis
473
474 A UTOMATIC E RROR A NALYSIS
Automatic error analysis is any process in which we use the computer to help
us analyse the accuracy or stability of a numerical computation. The idea
of automatic error analysis goes back to the dawn of scientific computing.
For example, running error analysis, described in 3.3, is a form of automatic
error analysis; it was used extensively by Wilkinson on the ACE machine.
Various forms of automatic error analysis have been developed. In this chapter
we describe in detail the use of direct search optimization for investigating
questions about the stability and accuracy of algorithms. We also describe
interval analysis and survey other forms of automatic error analysis.
Is Algorithm X numerically stable? How large can the growth factor be for
Gaussian elimination (GE) with pivoting strategy P? By how much can con-
dition estimator C underestimate the condition number of a matrix? These
types of questions are common, as we have seen in this book. Usually, we
attempt to answer such questions by a combination of theoretical analysis
and numerical experiments with random and nonrandom data. But a third
approach can be a valuable supplement to the first two: phrase the question
as an optimization problem and apply a direct search method.
A direct search method for the problem
(24.1)
factor for GE on A
where the are the intermediate elements generated during the elimination.
We know from 9.2 that the growth factor governs the stability of GE, so for
a given pivoting strategy we would like to know how big ρ n (A) can be.
To obtain an optimization problem of the form (24.1) we let x = vet(A)
and we define f(x) := ρn(A). Then we wish to determine
for which ρ4(B) = 1.23 x 105. (The large growth is a consequence of the
submatrix B(1:3, 1:3) being ill conditioned; B itself is well conditioned.) Thus
the optimizer readily shows that ρn(A) can be very large for GE without
pivoting.
Next, consider GE with partial pivoting (GEPP). Because the elimination
cannot break down, f is now defined everywhere, but it is usually discontinu-
ous when there is a tie in the choice of pivot element, because then an arbitrar-
ily small change in A can alter the pivot sequence. We applied the maximizer
MDS to f, this time starting with the orthogonal matrix20 A with
aij = (cf. (9.11)), for which ρ4(A) = 2.32.
After 29 iterations and 1169 function evaluations the maximizer converged to
a matrix B with ρ 4(B) = 5.86. We used this matrix to start the maximizer
18
In the optimizations of this section we used the convergence tests described in 24.2
with tol = 10-3. There is no guarantee that when convergence is achieved it is to a local
maximum; see 24.2.
19
A11 numbers quoted are rounded to the number of significant figures shown.
20
This matrix is orthog (n, 2) from the Test Matrix Toolbox; see Appendix E.
476 A UTOMATIC E RROR A NALYSIS
for which ρ4(C) = 7.939. This is one of the matrices described in Theorem 9.6,
modulo the convergence tolerance.
These examples, and others presented below, illustrate the following at-
tractions of using direct search methods to investigate the stability of a nu-
merical computation.
(1) The simplest possible formulation of optimization problem is often suf-
ficient to yield useful results. Derivatives are not needed, and direct search
methods tend to be insensitive to lack of smoothness in the objective func-
tion f. Unboundedness of f is a favorable property-direct search methods
usually quickly locate large values of f.
(2) Good progress can often be made from simple starting values, such as
an identity matrix. However, prior knowledge of the problem may provide
a good starting value that can be substantially improved (as in the partial
pivoting example).
(3) Usually it is the global maximum of f in (24.1) that is desired (although
it is often sufficient to know that f can exceed a specified value). When a
direct search method converges it will, in general, at best have located a
local maximum-and in practice the maximizer may simply have stagnated,
particularly if a slack convergence tolerance is used. However, further progress
can often be made by restarting the same (or a different) maximizer, as in the
partial pivoting example. This is because for methods that employ a simplex
(such as the MDS method), the behaviour of the method starting at x0 is
determined not just by x0. but also by the n + 1 vectors in the initial simplex
constructed at x0 .
(4) The numerical information revealed by direct search provides a starting
point for further theoretical analysis. For example, the GE experiments above
strongly suggest the (well known) results that ρ n(A) is unbounded without
pivoting and bounded by 2n–1 for partial pivoting, and inspection of the
numerical data suggests the methods of proof.
When applied to smooth problems the main disadvantages of direct search
methods are that they have at best a linear rate of convergence and they are
unable to determine the nature of the point at which they terminate (since
derivatives are not calculated). These disadvantages are less significant for
the problems we consider, where it is not necessary to locate a maximum
to high accuracy and objective functions are usually nonsmooth. (Note that
these disadvantages are not necessarily shared by methods that implicitly or
24.2 D IRECT S EARCH M ETHODS 477
repeat
% One iteration comprises a loop over all components of x.
for i = l:n
find α such that f(x + αei ) is maximized (line search)
Set x x + αe i
end
until converged
where fk is the highest function value at the end of the k th iteration. The AD
method has the very modest storage requirement of just a single n-vector.
The second method is the multidirectional search method (MDS) of Dennis
and Torczon [1008, 1989], [1009, 1991], [301, 1991]. This method employs a
simplex, which is defined by n + 1 vectors One iteration in the
case n = 2 is represented pictorially in Figure 24.1, and may be explained as
follows.
478 A UTOMATIC E RROR A NALYSIS
Figure 24.1. The possible steps in one iteration of the MDS method when n = 2.
The initial simplex is {v0, v1, v2} and it is assumed that f(v0) = max i f(vi ).
The purpose of an iteration is to produce a new simplex at one of whose ver-
tices f exceeds f(v0 ). In the first step the vertices v1 and v2 are reflected
about v0 . along the lines joining them to v0 , yielding rl and r2 and the re-
flected simplex {v 0 ,r 1 ,r 2 }. If this reflection step is successful, that is, if
maxi f(ri ) > f(v0), then the edges from v0 to ri are doubled in length to
give an expanded simplex {v0, el, e2 }. The original simplex is then replaced
by {v0, el, e2 } if maxi f(ei ) > maxi f(ri ), and otherwise by {v0, r1, r2 }. If
the reflection step is unsuccessful then the edges v0 – vi of the original sim-
plex are shrunk to half their length to give the contracted simplex {v0, c1, c2}.
This becomes the new simplex if max i f(ci ) > maxi f(vi ), in which case the
current iteration is complete; otherwise the algorithm jumps back to the re-
flection step, now working with the contracted simplex. For further details
of the MDS method see Dennis and Torczon [301, 1991] and Torczon [1008,
1989], [1009, 1991].
The MDS method requires at least 2n independent function evaluations
per iteration, which makes it very suitable for parallel implementation. Gen-
eralizations of the MDS method that are even more suitable for parallel com-
putation are described in [301, 1991] and [1010, 1992]. The MDS method
requires O(n2) elements of storage for the simplices, but this can be reduced
to O(n) (at the cost of extra bookkeeping) if an appropriate choice of initial
simplex is made [301, 1991].
Unlike most direct search methods, the MDS method possesses some con-
vergence theory. Torczon [1009, 1991] shows that if the level set of j at
24.3 E XAMPLES OF D IRECT S EARCH 479
(24.3)
Unless otherwise stated, we used tol = 10-3 in (24.2) and (24.3) in all our
experiments.
The third method that we have used is the Nelder–Mead direct search
method [787, 1965], [303, 1987], which also employs a simplex but which
is fundamentally different from the MDS method. We omit a description
since the method is described in textbooks (see, for example, Gill, Murray,
and Wright [447, 1981, §4.2.2], or Press, Teukolsky, Vetterling, and Flannery
[842, 1992, §10.4]). Our limited experiments with the Nelder-Mead method
indicate that while it can sometimes out-perform the MDS method, the MDS
method is generally superior for our purposes. No convergence results of the
form described above for the MDS method are known for the Nelder-Mead
method.
For general results on the convergence of “pattern search” methods, see
Torczon [1011, 1993]. The AD and MDS methods are pattern search methods,
but the Nelder–Mead method is not.
It is interesting to note that the MDS method, the Nelder-Mead method,
and our particular implementation of the AD method do not exploit the nu-
merical values of f: their only use of f is to compare two function values to
see which is the larger!
Our MATLAB implementations of the AD, MDS, and Nelder-Mead direct
search methods are in the Test Matrix Toolbox, described in Appendix E.
where est (A) < k1 (A) is the condition estimate. We note that, since the
algorithms underlying RCOND and LACON contain tests and branches, there
are matrices A for which an arbitrarily small change in A can completely
change the condition estimate; hence for both algorithms f has points of
discontinuity.
We applied the MDS maximizer to RCOND starting at the 5 x 5 Hilbert
matrix. After 67 iterations and 4026 function evaluations the maximizer had
located a matrix for which f(x) = 226.9. We then started the Nelder-
Mead method from this matrix; after another 4947 function evaluations it
had reached the matrix (shown to 5 significant figures)
for which
This example is interesting because the matrix is well scaled, while the pa-
rametrized counterexamples of Cline and Rew [217, 1983] all become badly
scaled when the parameter is chosen to make the condition estimate poor.
For LACON we took as starting matrix the 4 x 4 version of then x n matrix
with a ij = cos((i – 1)(j – 1)π /( n – 1)) (this is a Chebyshev–Vandermonde
matrix, as used in §21.3.3, and is orthog(n, - 1) in the Test Matrix Tool-
box). After 11 iterations and 1001 function evaluations the AD maximizer
had determined a (well-scaled) matrix A for which
With relatively little effort on our part (most of the effort was spent ex-
perimenting with different starting matrices), the maximizers have discovered
examples where both condition estimators fail to achieve their objective of
producing an estimate correct to within an order of magnitude. The value of
direct search maximization in this context is clear: it can readily demonstrate
the fallibility of a condition estimator—a task that can be extremely diffi-
cult to accomplish using theoretical analysis or tests with random matrices.
Moreover, the numerical examples obtained from direct search may provide
a starting point for the construction of parametrized theoretical ones, or for
the improvement of a condition estimation algorithm.
In addition to measuring the quality of a single algorithm, direct search
can be used to compare two competing algorithms to investigate whether one
algorithm performs uniformly better than the other. We applied the MDS
maximizer to the function
where estL(A) and estR(A) are the condition estimates from LACON and
RCOND, respectively. If f(x) > 1 then LACON has produced a larger
lower bound for k1 (A) than RCOND. Starting with a random 5 x 5 ma-
trix the Nelder–Mead maximizer produced after 1788 function evaluations a
matrix A for which estL(A) = k1(A) and f(x) = 1675.4. With f defined as
f(x) = estR(A)/estL(A), and starting with I4, after 6065 function evalua-
tions the MDS maximizer produced a matrix for which f(x) = 120.8. This
experiment shows that neither estimator is uniformly superior to the other.
This conclusion would be onerous to reach by theoretical analysis of the algo-
rithms.
P1 = P2 = A21 P1 ,
P3 = P1 A12 , P4 = A21 P3 ,
P5 = P4 – A22, P6 = P5–1,
482 A UTOMATIC E RROR A NALYSIS
where each of the matrix products is formed using Strassen’s fast matrix mul-
tiplication method. Strassen’s inversion method is clearly unstable for general
A, because the method breaks down if A11 is singular. Indeed Strassen’s inver-
sion method has been implemented on a Cray-2 by Bailey and Ferguson [41,
1988] and tested for n < 2048, and these authors observe empirically that
the method has poor numerical stability. Direct search can be used to gain
insight into the numerical stability.
With x = vet( A ) define the stability measure
(24.4)
y3 + py + q = 0,
(24.5)
For either choice of sign, the three cube roots for w yield the roots of the
original cubic, on transforming back from w to y to Z.
Are these formulae for solving a cubic numerically stable? Direct search
provides an easy way to investigate. The variables are the three coefficients
a, b, c and for the objective function we take an approximation to the relative
3
error of the computed roots . We compute the “exact” roots z us-
ing MATLAB’s roots function (which uses the QR algorithm to compute the
eigenvalues of the companion matrix21 ) and then our relative error measure
is where we minimize over all six permutations Π.
First, we arbitrarily take the "+" square root in (24.5). With almost
any starting vector of coefficients, the MDS maximizer rapidly identifies co-
efficients for which the computed roots are very inaccurate. For example,
starting with [1, 1, 1]T we are lead to the vector
[ a b c] T = [1.732 1 1.2704]T ,
for which the computed and “exact” roots are
21
Edelman and Murakami [346, 1995] and Toh and Trefethen [1007, 1994] analyse the
stability of this method of finding polynomial roots; the method is stable.
484 A UTOMATIC E RROR A NALYSIS
When the cubic is evaluated at the computed roots the results are of order
10–2, whereas they are of order 10–15 for z. Since the roots are well separated
the problem of computing them is not ill conditioned, so we can be sure that
z is an accurate approximation to the exact roots. The conclusion is that the
formulae, as programmed, are numerically unstable.
Recalling the discussion for a quadratic equation in §1.8, a likely reason
for the observed instability is cancellation in (24.5). Instead of always taking
the “+” sign, we therefore take
(24.6)
When the argument of the square root is nonnegative, this formula suffers
no cancellation in the subtraction; the same is true when the argument of
the square root is negative, because then the square root is pure imaginary.
With the use of (24.6), we were unable to locate instability using the objective
function described above. However, an alternative objective function can be
derived as follows. It is reasonable to ask that the computed roots be the
exact roots of a slightly perturbed cubic. Thus each computed root should
be a root of
We will use the notation [x] for an interval [xl,x2] and we define width(x) :=
x2 – x1 .
In floating point arithmetic, an interval containing fl( [x] op [y]) is ob-
tained by rounding computations on the left endpoint to and those on
the right endpoint to (both these rounding modes are supported in IEEE
arithmetic).
The success of an interval analysis depends on whether an answer is pro-
duced at all (an interval algorithm will break down with division by zero if
it attempts to divide by an interval containing zero), and on the width of
the interval answer. A one-line program that prints would be cor-
rect, robust, and fast, but useless. Interval analysis is controversial because,
in general, narrow intervals cannot be guaranteed. One reason is that when
dependencies occur in a calculation, in the sense that a variable appears more
than once, final interval lengths can be pessimistic. For example, if [x ] = [1,2]
then
[ x] - [x] = [-1, 1], [x]/[x] = [1/2, 2],
whereas the optimal intervals for these calculations are, respectively, [0, O] and
[1, 1]. These calculations can be interpreted as saying that there is no additive
or multiplicative inverse in interval arithmetic.
An example of an algorithm for which interval arithmetic can be ineffective
is GEPP, which in general gives very pessimistic error bounds and is unstable
in interval arithmetic even for a well-conditioned matrix [798, 1977]. The
486 A UTOMATIC E RROR A NALYSIS
basic problem is that the interval sizes grow exponentially. For example, in
the 2 x 2 reduction
if then
width([ y ] – [x]) width([ x] – [x]) = 2 width([x]).
This type of growth is very likely to happen, unlike the superficially similar
phenomenon of element growth in standard GEPP. The poor interval bounds
are entirely analogous to the pessimistic results returned by early forward
error analyses of GEPP (see §9.6). Nickel [798, 1977] states that “The interval
Gauss elimination method is one of the most unfavorable cases of interval
computation . . . nearly all other numerical methods give much better results
if transformed to interval methods”. Interval GE is effective, however, for
certain special classes of matrices, such as M-matrices.
As already mentioned, there is a large body of literature on interval arith-
metic, though, as Skeel notes (in an article that advocates interval arith-
metic), “elaborate formalisms and abstruse notation make much of the lit-
erature impenetrable to all but the most determined outsiders” [922, 1989].
Good sources of information include various conference proceedings and the
journal Computing. The earliest book is by Moore [778, 1966], whose later
book [779, 1979] is one of the best references on the subject. Nickel [798,
1977] gives an easy to read summary of research up to the mid 1970s. A more
recent reference is Alefeld and Herzberger [10, 1983].
Yohe [1120, 1979] describes a Fortran 66 package for performing interval
arithmetic in which machine-specific features are confined to a few modules.
It is designed to work with a precompiled for Fortran called Augment [254,
1979], which allows the user to write programs as though Fortran had ex-
tended data types—in this case an INTERVAL type. A version of the package
that allows arbitrary precision interval arithmetic by incorporating Brent’s
multiple precision arithmetic package (see §25.9) is described in [1121, 1980].
In [1119, 1979], Yohe describes general principles of implementing nonstan-
dard arithmetic in software, with particular reference to interval arithmetic.
Kulisch and Miranker have proposed endowing a computer arithmetic with
a super-accurate inner product, that is, a facility to compute an exactly
rounded inner product for any two vectors of numbers at the working pre-
cision [678, 1981], [679, 1983], [680, 1986]. This idea has been implemented
in the package ACRITH from IBM, which employs interval arithmetic [124,
1985]. Anomalies in early versions of ACRITH are described by Kahan and
LeBlanc [639, 1985] and Jansen and Weidner [612, 1986]. For a readable
discussion of interval arithmetic and the super-accurate inner product, see
Rail [858, 1991]. Cogent arguments against adopting a super-accurate inner
product are given by Demmel [284, 1991].
24.5 O THER W ORK 487
(the interval could be much bigger than f 2 (x2 ) – f 2 (x2 ± ε 2), depending on
the algorithm used to evaluate f2). In other words, the width of the interval
containing x3 is roughly proportional to the condition number of f 2 . When
the output of the f2 computation is fed into f3 the interval width is multiplied
by f´3 The width of the final interval containing xn+l is proportional to the
product of the condition numbers of all the functions f i and if there are
enough functions, or if they are conditioned badly enough, the final interval
will provide no useful information The only way to avoid such a failure is to
use variable precision arithmetic or to reformulate the problem to escape the
product of the condition numbers of the fi .
Sections 24.1 – 24.3.2 and §24.5 are based on Higham [555, 1993].
The Release Notes for MATLAB 4.1 state that “This release of MATLAB
fixes a bug in the rcond function. Previously, rcond returned a larger than
expected estimate for some matrices . . . rcond now returns an estimate that
matches the value returned by the Fortran LINPACK library.” In the direct
search experiments in [555, 1993] we used MATLAB 3.5, and we found it much
easier to generate counterexamples to rcond than we do now with MATLAB
4.2. It seems that the maximizations in [555, 1993] were not only defeating
the algorithm underlying rcond, but also, unbeknown to us, exploiting a bug
in the implemental ion of the function. The conclusions of [555, 1993] are
unaffected, however.
Another way to solve a cubic is to use Newton’s method to find a real
zero, and then to find the other two zeros by solving a quadratic equation.
490 A UTOMATIC E RROR A NALYSIS
Problems
24.1. Let A and let and be the computed orthogonal
QR factors produced by the classical and modified Gram-Schmidt methods,
respectively. Use direct search to maximize the functions
In order to keep k2(A) small, try maximizing fi (A) – θ max( k2(A) – µ, 0),
where θ is a large positive constant and µ is an upper bound on the acceptable
condition numbers.
24.2. It is well known that if is nonsingular and vTA–1 u –1
then
(a) Investigate the stability using direct search. Let both linear systems
with coefficient matrix A be solved by GEPP. Take A, u, and v as the data
and let the function to be maximized be the normwise backward error η C,b in
the norm.
(b) (RESEARCH PROBLEM) Obtain backward and forward error bounds for
the method (for some existing analysis see Yip [1118, 1986]).
24.3. (RESEARCH PROBLEM) Investigate the stability of the formulae of $24.3.3
for computing the roots of a cubic.
Previous Home Next
Chapter 25
Software Issues in Floating Point
Arithmetic
491
492 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC
max = –inf
for j = 1:n
if aj > max
max = aj
end
end
3
r(x) = 7 – ,
1
x - 2 –
10
x - 7 +
2
x - 2
x - 3
which is plotted over the range [0,4] in Figure 25.1. Another way to write the
25.1 E XPLOITING IEEE A RITHMETIC 493
Figure 25.2. Error in evaluating rational function r. Solid line: continued fraction,
dotted line: usual rational form.
The difference x – y of two machine numbers can evaluate to zero even when
because of underflow. This makes a test such as
f = f / (x - y)
end
Carter [188, 1991] describes how a Cray computer at the NASA Ames Lab-
oratories produced puzzling results that were eventually traced to properties
of its floating point arithmetic. Carter used Cholesky factorization on a Cray
Y-MP to solve a sparse symmetric positive definite system of order 16146
resulting from a finite element model of the National Aerospace Plane. The
results obtained for several computers are shown in Table 25.1, where “Dis-
placement” denotes the largest component of the solution and incorrect digits
are set in italics and underlined. Since the coefficient matrix has a condition
number of order 1011, the errors in the displacements are consistent with the
error bounds for Cholesky factorization.
Given that both the Cray 2 and the Cray Y-MP lack guard digits, it is not
surprising that they give a less accurate answer than the other machines with
a similar unit roundoff. What is surprising is that, even though both machines
use 64-bit binary arithmetic with a 48-bit mantissa, the result from the Cray
Y-MP is much less accurate than that from the Cray 2. The explanation
(diagnosed over the telephone to Carter by Kahn, as he scratched on the back
of an envelope) lies in the fact that the Cray 2 implementation of subtraction
without a guard digit produces more nearly unbiased results (average error
zero), while the Cray Y-MP implementation produces biased results, causing
fl(x – y) to be too big if x > y > 0. Since the inner loop of the Cholesky
algorithm contains an operation of the form a i i = a i i – the errors in
the diagonal elements reinforce as the factorization proceeds on the Cray
Y-MP, producing a Cholesky factor with a diagonal that is systematically
too large. Similarly, since Carter’s matrix has off-diagonal elements that are
mostly negative or zero, the Cray Y-MP produces a Cholesky factor with
off-diagonal elements that are systematically too large and negative. For the
large value of n used by Carter, these two effects in concert are large enough
to cause the loss of a further two digits in the answer.
An inexpensive way to improve the stability of the computed solution is to
25.4 COMPILERS 497
use iterative refinement in fixed precision (see Chapter 11). This was tried by
Carter. He found that after one step of refinement the Cray Y-MP solution
was almost as accurate as that from the Cray 2 without refinement.
25.4. Compilers
Some of the pitfalls a compiler writer should avoid when implementing float-
ing point arithmetic are discussed by Farnum [364, 1988]. The theme of his
paper is that programmers must be able to predict the floating point opera-
tions that will be performed when their codes are compiled; this may require,
for example, that the compiler writer forgoes certain “optimizations”. In par-
ticular, compilers should not change groupings specified by parentheses. For
example, the two expressions
(1.0E-30 + 1.0E+30) - 1.0E+30
1.0E-30 + (1.0E+30 - 1.0E+30)
will produce different answers on many machines. Farnum explains that
strange behaviour of compilers and says that he was unable to make his code
run correctly on the Cyber 205. A routine based on machar is given in Nu-
merical Recipes [842, 1992, §20.1].
LAPACK contains a routine xLAMCH for determining machine parameters.
Because of the difficulty of handling the existing variety of machine arithmetics
it is over 850 lines long (including comments and the subprograms it calls).
Programs such as machar, paranoia, and xLAMCH are difficult to write; for
example, xLAMCH tries to determine the overflow threshold without invoking
overflow. The Fortran version of paranoia handles overflow by requiring the
user to restart the program, after which checkpointing information previously
written to a file is read to determine how to continue.
Fortran 90 contains environmental inquiry functions, which for REAL argu-
ments return the precision (PRECISION23), exponent range (RANGE), machine
epsilon (EPSILON), largest number (HUGE), and so on, corresponding to that
argument [749, 1990]. The values of these parameters are chosen by the im-
plementor to best fit a model of the machine arithmetic due to Brown [150,
1981] (see §25.7.4). Fortran 90 also contains functions for manipulating float-
ing point numbers: for example, to set or return the exponent or fractional
part (EXPONENT, FRACTION, SET_EXPONENT) and to return the spacing of the
numbers having the same exponent as the argument (SPACING).
25.7. Portability
Software is portable if it can be made run on different systems with just
a few straightforward changes (ideally, we would like to have to make no
changes, but this level of portability is often impossible to achieve). Sometimes
the word “transportable” is used to describe software that requires certain
well-defined changes to run on different machines. For example, Demmel,
Dongarra, and Kahan [290, 1992] describe LAPACK as “a transportable way
to achieve high efficiency on diverse modern machines”, noting that to achieve
high efficiency the BLAS need to be optimized for each machine. A good
example of a portable collection of Fortran codes is LINPACK. It contains
no machine-dependent constants and uses the PFORT subset of Fortran 77;
it uses the level-1 BLAS, so, ideally, optimized BLAS would be used on each
machine.
type of rounding, can all vary. Some possible ways for a code to acquire these
parameters are as follows.
(1) The parameters are evaluated and embedded in the program in PARAM-
ETER and DATA statements. This is conceptually the simplest approach, but
it is not portable.
(2) Functions are provided that return the machine parameters. Bell Lab-
oratories’ PORT library [405, 1978] has three functions:
R1MACH returns the underflow and overflow thresholds, the smallest and largest
relative spacing (β –t, β 1-t respectively), and log10 β, where β is the base
and t the number of digits in the mantissa. I1MACH returns standard input,
output and error units and further floating point information, such as β, t,
and the minimum and maximum exponents emin and emax. These functions
do not carry out any computation; they contain DATA statements with the
parameters for most common machines in comment lines, and the relevant
statements have to be uncommented for a particular machine. This approach
is more sensible for a library than the previous one, because only these three
routines have to be edited, instead of every routine in the library.
The NAG Library takes a similar approach to PORT. Each of the 18
routines in its X02 chapter returns a different machine constant. For example,
XO2AJF returns the unit roundoff and X02ALF returns the largest positive
floating point number. These values are determined once and for all when the
NAG library is implemented on a particular platform using a routine similar
to paranoia and machar, and they are hard coded into the Chapter X02
routines.
(3) The information is computed at run-time, using algorithms such as
those described in §25.5.
SLAEV2, and 207 for SLASY2. If the availability of extended precision arith-
metic (possibly simulated using the working precision) or IEEE arithmetic
can be assumed, the codes can be simplified significantly. Complicated and
less efficient code for these 2 x 2 problem solvers is a price to be paid for
portability across a wide range of floating point arithmetics.
1981] notes, “Programming for the [IEEE] standard is like programming for
one of a family of well-known machines, whereas programming for a model is
like programming for a horde of obscure and ill-understood machines all at
once.” Although Brown’s model was used in the Ada language to provide a
detailed specification of floating point arithmetic, the model is still somewhat
controversial.
Wichmann [1077, 1989] gives a formal specification of floating point arith-
metic in the VDM specification language based on a modification of Brown’s
model.
The most recent model is the Language Independent Arithmetic (LIA-
1) [702, 1993]. The LIA-1 specifies floating point arithmetic far more tightly
than Brown’s model. It, too, is controversial; an explanation of flaws in an
earlier version (know then as the Language Compatible Arithmetic Standard)
was published by Kahan [635, 1991].
For a more detailed critique of models of floating point arithmetic see
Priest [844, 1992].
s =0
for i = l:n
s = s + (x(i)/t)2
end
The trouble with this algorithm is that it requires n divisions and two passes
over the data (the first to evaluate so it is slow. (It also involves
more rounding errors than the unscaled evaluation, which could be obviated
by scaling by a power of the machine base.) Blue [128, 1978] develops a one-
pass algorithm that avoids both overflow and underflow and requires between
O and n divisions, depending on the data, and he gives an error analysis to
show that the algorithm produces an accurate answer. The idea behind the
algorithm is simple. The sum of squares is collected in three accumulators,
one each for small, medium, and large numbers. After the summation, various
logical tests are used to decide how to combine the accumulators to obtain
the final answer.
The original, portable implementation of the level-l BLAS routine xNRM2
(listed in [307, 1979]) was written by C. L. Lawson in 1978 and, according to
25.8 A VOIDING U NDERFLOW AND O VERFLOW 503
Lawson, Hanson, Kincaid, and Krogh [694, 1979], makes use of Blue’s ideas.
However, xNRM2 is not a direct coding of Blue’s algorithm and is extremely
difficult to understand—a classic piece of “Fortrans spaghetti”! Nevertheless,
the routine clearly works and is reliable, because it has been very widely used
without any reported problems. Lawson’s version of xNRM2 has now been su-
perseded in the LAPACK distribution by a concise and elegant version by S.
J. Harnmarling, which implements a one-pass algorithm; see Problem 25.5.
A special case of the vector 2-norm is the Pythagorean sum,
which occurs in many numerical computations. One way to compute it
is by a beautiful cubically convergent, square-root-free iteration devised by
Moler and Morrison [773, 1983 ; see Problem 25.6. LAPACK has a routine
xLAPY2 that computes and another routine xLAPY3 that computes
both routines avoid overflow by using the algorithm listed at
the start of this section with n = 2 or 3.
Pythagorean sums arise in computing the l-norm of a complex vector:
The level-l BLAS routine xCASUM does not compute the l-norm, but the more
easily computed quantity
The reason given by the BLAS developers is that it was assumed that users
would expect xCASUM to compute a less expensive measure of size than the 2-
norm [694, 1979]. This reasoning is sound, but many users have been confused
not to receive the true l-norm. See Problem 6.16 for more on this pseudo 1-
norm.
Another example where underflow and overflow can create problems is in
complex division. If we use the formula
then overflow will occur whenever cord exceeds the square root of the overflow
threshold, even if the quotient (a + ib)/(c + id) does not overflow. Certain Cray
and NEC machines implement complex division in this way [290, 1992]; on
these machines the exponent range is effectively halved from the point of view
of the writer of robust software. Smith [929, 1962] suggests a simple way to
avoid overflow: if |c| > |d| use the following formula, obtained by multiplying
the numerators and denominators by c–1 ,
(25.1)
504 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC
and if |d| > |c| use the analogous formula involving d–1. Stewart [947, 1985]
points out that underflow is still possible in these formulae, and suggests a
more complicated algorithm that avoids both underflow and overflow.
Demmel [280, 1984] discusses in detail the effects of underflow on numerical
software and analyses different ways of implementing underflows, the main two
of which are flush to zero and gradual underflow (as used in IEEE standard
arithmetic). Cody [222, 1988] makes the following observations:
which uses three real multiplications instead of the usual four. As we saw in
§22.2.4, the 3M method produces a computed imaginary part that can have
large relative error, but this instability is not a serious obstacle to its use in
MPFUN.
Bailey provides a translator that takes Fortran 77 source containing direc-
tives in comments that specify the precision level and which variables are to
be treated as multiprecision, and produces a Fortran 77 program containing
the appropriate multiprecision subroutine calls. He also provides a Fortran 90
version of the package that employs derived data types and operator exten-
sions [45, 1994]. This Fortran 90 version is a powerful and easy to use tool
for doing high precision numerical computations.
Bailey’s own use of the packages includes high precision computation of
constants such as π [42, 1988] and Euler’s constant γ. A practical application
to a fluid flow problem is described by Bailey, Krasny, and Pelz [46, 1993].
They found that converting an existing code to use MPFUN routines increased
execution times by a factor of about 400 at 56 decimal digit precision, and
they comment that this magnitude of increase is fairly typical for multiple
precision computation.
Another recent Fortran 77 program for multiple precision arithmetic is the
FM package of Smith [926, 1991]. It is functionally quite similar to Bailey’s
MPFUN. No translator is supplied for converting from standard Fortran 77
506 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC
source code to code that invokes FM, but Smith notes that the general purpose
precompiled Augment [254, 1979] can be used for this purpose.
Fortran routines for high precision computation can also be found in Nu-
merical Recipes [842, 1992, §20.6], and high precision numerical computation
is supported by many symbolic manipulation packages, including Maple [199,
1991] and Mathematical [1109, 1991].
A Pascal-like programming language called Turing [579, 1988] developed
at the University of Toronto in the 1980s is the basis of an extension called
Numerical Turing, developed by Hull and his co-workers [591, 1985]. Numer-
ical Turing includes decimal floating point arithmetic whose precision can be
dynamically varied throughout the course of a program, a feature argued for
in [590, 1982] and implemented in hardware in [231, 1983].
An extension to the level-2 BLAS (see Appendix D) is proposed in [313,
1988, App. B], comprising routines having the same specifications as those in
the standard BLAS but which calculate in extended precision.
Patriot lost track of its target. Note that the numbers in Table 25.2 are con-
sistent with a relative error of 2–20 in the computer’s representation of 0.1,
this constant being used to convert from the system clock’s tenths of a second
to seconds (2–20 is the relative error introduced by chopping 0.1 to 23 bits
after the binary point).
Problems
25.1. (a) The following MATLAB code attempts to compute
x = 1;
xp1 = x + l;
while xp1 > 1
x = x/2 ;
xp1 = x + 1;
end
x = 2*x;
On my workstation, running this code gives
>> mu
ans =
2.2204e-016
Try this code, or a translation of it into another language, on any machines
available to you. Are the answers what you expected?
(b) Under what circumstances might the code in (a) fail to compute µ?
(Hint: consider optimizing compilers.)
(c) What is the value of µ in terms of the arithmetic parameters β and
t ? Note: the answer depends subtly on how rounding is performed, and in
particular on whether double rounding is in effect; see Higham [550, 1991]
and Moler [771, 1990]. On a workstation with an Intel 486DX chip (in which
double rounding does occur), the following behaviour is observed in MATLAB:
>> format hex; format compact % Hexadecimal format.
y=
3ff0000000000001 3ff0000000000000 3ca0020000000001
PROBLEMS 509
y=
3ff0000000000000 3ff0000000000000 3ca0020000000000
25.2. Show that Smith’s formulae (25.1) can be derived by applying Gaussian
elimination with partial pivoting to the 2 x 2 linear system obtained from
(c + id)(x + iy) = a + ib.
25.3. The following MATLAB function implements an algorithm of Malcolm [724,
1972] for determining the floating point arithmetic parameters β and t.
function [beta, t] = param(x)
% a and b are floating point variables.
a = 1;
while (a+l) - a == 1
a = 2*a;
end
b=2;
while (a+b) == a
b = 2*b ;
end
beta = - a;
t = 1;
a = beta;
while (a+l) - a == 1
t = t+l;
a= a*beta;
end
Run this code, or a translation into another language, on any machines avail-
able to you. Explain how the code works. (Hint: consider the integers that can
be represented exactly as floating point numbers.) Under what circumstances
might it fail to produce the desired results?
25.4. Hough [585, 1989] formed a 512 x 512 matrix using the following Fortran
random number generator:
subroutine matgen (a, lda, n, b, norms)
REAL a(lda,1), b(1), norma
c
init = 1325
norms = 0.0
do 30 j = 1,n
do 20 i = 1,n
510 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC
init = mod(3125*init,65536)
a(i, j) = (init - 32768 .0)/16384.0
norms = max(a(i, j), norms)
20 continue
30 continue
He then solved a linear system using this matrix, on a workstation that uses
IEEE single precision arithmetic. He found that the program took an inordi-
nate amount of time to run and tracked the problem to underflows. On the
system in question underflows are trapped and the correct result recomputed
very slowly in software. Hough picks up the story:
I started printing out the pivots in the program. They started out
as normal numbers like 1 or – 10, the suddenly dropped to about
1e-7, then later to 1e-14, and then:
k 82 pivot -1.8666e-20 k 98 pivot 1.22101e-21
k 83 pivot -2.96595e-14 k 99 pivot -7.12407e-22
k 84 pivot 2.46156e-14 k 100 pivot -1.75579e-21
k 85 pivot 2.40541e-14 k 101 pivot 3.13343e-21
k 86 pivot -4.99053e-14 k 102 pivot -6.99946e-22
k 87 pivot 1.7579e-14 k 103 pivot 3.82048e-22
k 88 pivot 1.69295e-14 k 104 pivot 8.05538e-22
k 89 pivot -1.56396e-14 k 105 pivot -1.18164e-21
k 90 pivot 1.37869e-14 k 106 pivot -6.349e-22
k 91 pivot -3.10221e-14 k 107 pivot -2.48245e-21
k 92 pivot 2.35206e-14 k 108 pivot -8.89452e-22
k 93 pivot 1.32175e-14 k 109 pivot -8.23235e-22
k 94 pivot -7.77593e-15 k 110 pivot 4.40549e-21
k 95 pivot 1.34815e-14 k 111 pivot 1.12387e-21
k 96 pivot -1.02589e-21 k 112 pivot -4.78853e-22
k 97 pivot 4.27131e-22 k 113 pivot 4.38739e-22
k 114 pivot 7.3868e-28
t=0
s=1
for i = l:n
PROBLEMS 511
if |xi | > t
s = 1 + s (t/xi )2
t = |xi |
else
s = s + (xi /t)2
end
end
P = max(abs(a) , abs(b)) ;
q = min(abs(a) , abs(b)) ;
while 1
r = (q/p)^2;
if r+4 == 4, return, end
s = r/(4+ r);
p = p+2*s*p;
q = s*q;
fprintf(’p = %19.15e, q = %19.15e\n’, p, q)
end
The algorithmic immune to underflow and overflow (unless the result over-
flows), is very accurate, and converges in at most three iterations, assuming
the unit roundoff u > 10–20.
Example:
p = pythag(3,4); (p-5)/5
p = 4.986301369863014e+000, q = 3.698630136986301e-001
p = 4.999999974188253e+000, q = 5.080526329415358e-004
p = 5.000000000000001e+000, q = l.311372652397091e-012
ans =
1.7764e-016
The purpose of this problem is to show that pythag is Halley’s method
applied to a certain equation. Halley’s method for solving f(x) = 0 is the
iteration
512 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC
where fn, f´n and f´´n denote the values of f and its first two derivatives at xn.
(a) For given x 0 and y 0 such that 0 < y 0 < x 0 , the Pythagorean sum
is a root of the equation f(x) = x 2 – p 2 = 0. Show that
Halley’s method applied to this equation gives
Show that if 0 < x0. < p = then xn < xn+1 < p for all n. Deduce
that y n := is defined and that
for k =
R k = (Q k P k - 1 ) 2
S k = R k(4I + R k) - 1
Pk+1 = Pk + 2S k P k
Qk+l = SkQk
end
Chapter 26
A Gallery of Test Matrices
24
From interviews by Albers in [9, 1991].
513
514 A G ALLERY OF T EST M ATRICES
Ever since the first computer programs for matrix computations were written
in the 1940s, researchers have been devising matrices suitable for test purposes
and investigating the properties of these matrices. In the 1950s and 1960s it
was common for a whole paper to be devoted to a particular test matrix:
typically its inverse or eigenvalues would be obtained in closed form.
Early collections of test matrices include those of Newman and Todd [796,
1958] and Rutishauser [886, 1968]; most of Rutishauser’s matrices come from
continued fractions or moment problems. Two well-known books present col-
lections of test matrices. Gregory and Karney [482, 1969] deal exclusively with
the topic, while Westlake [1076, 1968] gives an appendix of test matrices.
In this chapter we present a gallery of matrices. We describe their prop-
erties and explain why they are useful (or not so useful, as the case may be)
for test purposes. The coverage is limited. A comprehensive, up-to-date, and
well-documented collection of parametrized test matrices may be found in
the Test Matrix Toolbox, described in Appendix E. Of course, MATLAB itself
contains a number of special matrices that can be used for test purposes (type
help specmat).
Several other types of matrices would have been included in this chapter
had they not been discussed elsewhere in the book. These include magic
squares (Problem 6.4), the Kahan matrix (8.10), Hadamard matrices (§9.3),
and Vandermonde matrices (Chapter 21).
The matrices described here can be modified in various ways while still
preserving some or all of their interesting properties. Among the many ways
of constructing new test matrices from old are
l Powers A Ak.
Despite its past popularity and notoriety, the Hilbert matrix is not a good
test matrix. It is too special. Not only is it symmetric positive definite, but
it is totally positive. This means, for example, that Gaussian elimination
without pivoting is guaranteed to produce a small componentwise relative
backward error (as is Cholesky factorization). Thus the Hilbert matrix is not
a typical ill-conditioned matrix.
The (i, j) element of the inverse of the Hilbert matrix Hn is the integer
(26.1)
and
(26.2)
There are many ways to rewrite the formula (26.1). These formulae are best
obtained as special cases of those for the Cauchy matrix below.
The Cholesky factor Rn of the inverse of the Hilbert matrix is known
explicitly, as is Rn-1:
(26.3)
(26.4)
n K2(Hn) K2(Pn)
2 1.9281e1 6.8541e0
3 5.2406e2 6.1984el
4 1.5514e4 6.9194e2
5 4.7661e5 8.5175e3
6 1.4951e7 1.1079e5
7 4.7537e8 1.4934e6
8 1.5258e10 2.0645e7
9 4.9315e11 2.9078e8
10 1.6026e13 4.1552e9
11 5.2307e14 6.0064e10
12 1.7132e16 8.7639e11
13 5.6279e17 1.2888e13
14 1.8534e19 1.9076e14
15 6.l166e20 2.8396e15
16 2.0223e22 4.2476e16
w h e r e | ∆H| < uH n , and (Hn + ∆H)–1 can differ greatly from Hn-1, in view
of the ill conditioning. A possible way round this difficulty is to start with
the integer matrix H n -1 , but its entries are so large that they are exactly
representable in IEEE double precision arithmetic only for n less than 13.
The Hilbert matrix is a special case of a Cauchy matrix C n
whose elements are cij = 1/(xi + yj), where x, y are given n-vectors
(take xi = yi = i – 0.5 for the Hilbert matrix). The following formulae give the
inverse and determinant of Cn, and therefore generalize those for the Hilbert
matrix. The (i, j) element of Cn-1 is
and
the latter formula having been published by Cauchy in 1841 [189, 1841,
pp. 151–159]. The LDU factors of Cn have been found explicitly by Gohberg
26.2 R ANDOM M ATRICES 517
It is known that Cn is totally positive if 0 < x1 < · · · < xn and 0 < yl <
· · · < yn (as is true for the Hilbert matrix) [998, 1962, p. 295]. Interestingly,
the sum of all the elements of [667, 1973, Ex. 44, §1.2.3].
practice, never produce them for a random matrix (see Chapter 14). The role
of random matrices here is to indicate the average quality of the estimates.
Edelman [340, 1993] summarizes the properties of random matrices well
when he says that
Various results are known about the behaviour of matrices with elements
from the normal N(0, 1) distribution. Matrices of this type are generated by
MATLAB’S randn function. Let An denote an n x n matrix from this distri-
bution and let E(·) be the expectation operator. Then, in the appropriate
probabilistic sense, the following results hold as n
For details of (26.5)-(26.8) see Edelman [335, 1988]. Edelman conjectures that
the condition number results are true for any distribution with mean O—in
particular, the uniform [– 1, 1] distribution used by MATLAB’S rand function.
The results (26.5) and (26.6) show that random matrices from the normal
N(0, 1) distribution tend to be very well conditioned.
The spectral radius result (26.9) has been proved as an inequality by Ge-
man [432, 1986] for independent identically distributed random variables aij
with zero mean and unit variance, and computer experiments suggest the
approximate equality for the standard normal distribution [432, 1986].
A question of interest in eigenvalue applications is how many eigenvalues
of a random real matrix are real. The answer has been given by Edelman,
Kostlan, and Shub [344, 1994]: denoting by En the expected number of real
eigenvalues of an n x n matrix from the normal N(0, 1) distribution,
P=
1 1 1 1 1 1
1 2 3 4 5 6
1 3 6 10 15 21
1 4 10 20 35 56
1 5 15 35 70 126
1 6 21 56 126 252
The earliest references to the Pascal matrix appear to be in 1958 by Newman
and Todd [796, 1958] and by Rutishauser [887, 1958] (see also Newman [795,
1962, pp. 240-24 1]); Newman and Todd say that the matrix was introduced
to them by Rutishauser. The matrix was independently suggested as a test
matrix by Caffney [177, 1963].
Rutishauser [886, 1968, §8] notes that Pn belongs to the class of moment
matrices M whose elements are contour integrals
26
In the MATLAB displays below we use the pascal function from the Test Matrix Toolbox.
This differs from the pascal function supplied with Matlab 4.2 only in that pascal (n, 2) is
rearranged.
26.4 T HE P ASCAL M ATRIX 521
>> R = chol(P)
R=
1 1 1 1 1 1
0 1 2 3 4 5
0 0 1 3 6 10
0 0 0 1 4 10
0 0 0 0 1 5
0 0 0 0 0 1
The scaled and transposed Cholesky factor S = RT diag(1, –1,1, –1,. . . . (–1)n + 1 )
is returned by pascal (n, 1 ):
>> S = pascal(6, 1)
S =
1 0 0 0 0 0
1 -1 0 0 0 0
1 -2 1 0 0 0
1 -3 3 -1 0 0
1 -4 6 -4 1 0
1 -5 10 -10 5 -1
so P and P–1 are similar and hence have the same eigenvalues. In other
words, the eigenvalues appear in reciprocal pairs. In fact, the characteristic
polynomial π n has a palindromic coefficient vector, which implies the recip-
rocal eigenvalues property, since π n (λ) = This is illustrated as
follows (making use of MATLAB’S Symbolic Math Toolbox):
522 A G ALLERY OF T EST M ATRICES
>> charpoly(P)
ans =
1-351*x+6084*x^2-13869*x^3+6084*x^4-351*x^5+x^6
>> eig(P)
ans =
0.0030
0.0643
0.4893
2.0436
15.5535
332.8463
Since P is symmetric, its eigenvalues are its singular values and so we also
have that ||P||2 = ||P-1 ||2 and ||P||F = ||P-1 ||F. Now
where for the last equality we used a binomial coefficient summation iden-
tity from [477, 1989, p. 161]. Hence, using Stirling’s approximation (n ! ~
The Pascal matrix can be made singular simply by subtracting 1 from the
(n,n) element. To see this, note that
26.4 T HE P ASCAL M ATRIX 523
This perturbation, ∆P = –enenT, is far from being the smallest one that makes
P singular, which is ∆Popt = where λn is the smallest eigenvalue of
P and vn is the corresponding unit eigenvector, for
is of order (n!)2 /(2 n)! ~ as we saw above.
A more subtle property of the Pascal matrix is that it is totally posi-
tive. Karlin [644, 1968, p. 137] shows that the matrix with elements
( i, j = 0,1, . . .) is totally positive; the Pascal matrix is a submatrix of this one
and hence is also totally positive. From the total positivity it follows that the
Pascal matrix has distinct eigenvalues, which (as we already know from the
positive definiteness) are real and positive, and that its ith eigenvector (cor-
responding to the ith largest eigenvalue) has exactly i – 1 sign changes [414,
1959, Thin. 13, p. 105].
T = pascal (n, 2) is obtained by rotating S clockwise through 90 degrees
and multiplying by — 1 if n is even:
>> T = pascal(6, 2)
T=
-1 -1 -1 -1 -1 -1
5 4 3 2 1 0
-10 -6 -3 -1 0 0
10 4 1 0 0 0
-5 -1 0 0 0 0
1 0 0 0 0 0
It has the surprising property that it is a cube root of the identity, a property
noted by Turnbull [1028, 1929, p. 332]:
>> T*T
ans =
o 0 0 0 0 1
0 0 0 0 -1 -5
0 0 0 1 4 10
0 0 -1 -3 -6 -lo
0 1 2 3 4 5
-1 -1 -1 -1 -1 -1
>> T*T*T
ans =
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
524 A G ALLERY OF T EST M ATRICES
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1
Such matrices arise, for example, when discretizing partial differential equa-
tions or boundary value problems for ordinary differential equations. The
26.6 C OMPANION M ATRICES 525
eigenvalues are known explicitly [884, 1947], [885, 1952], [1006, 1977, pp. 155–
156] :
d + 2(ce)1/2 cos(kπ/(n + 1)), k = 1:n.
The eigenvalues are also known for certain variations of the symmetric matrix
T n ( c, d, c) in which the (1,1) and (n, n) elements are modified; see Gregory
and Karney [482, 1969].
The matrix Tn (–1, 2, –1) is minus the well-known second difference ma-
trix, which arises in applying central differences to a second derivative oper-
ator. Its inverse has (i, j) element –i(n – j + 1)/(n + 1) (cf. Theorem 14.9).
The condition number satisfies k2 (Tn ) ~ (4/π 2 )n 2 .
One interesting property of Tn(c, d, e) is that the diagonals of its LU fac-
torization converge as n when Tn is symmetric and diagonally dominant,
and this allows some savings in the computation of the LU factorization, as
shown by Malcolm and Palmer [725, 1974]. Similar properties hold for cyclic
reduction; see Bondeli and Gander [134, 1994].
of A is the matrix
The Test Matrix Toolbox function compan computes C, via the call C =
compan (A), It is easy to check that C has the same characteristic polyno-
mial as A, and that if λ is an eigenvalue of C then is
a corresponding eigenvector. Since C has rank at least n – 1 for any
λ, C is nonderogatory, that is, in the Jordan form no eigenvalue appears in
more than one Jordan block. It follows that A is similar to C only if A is
nonderogatory.
There are no explicit formulae for the eigenvalues of C, but, perhaps sur-
prisingly, the singular values have simple representations, as found by Kenney
and Laub [651, 1988] (see also Kittaneh [660, 1995]):
526 A G ALLERY OF T EST M ATRICES
where
The compan function is a useful means for generating new test matrices
from old. For any A , compan(A) is a nonnormal matrix with the same
eigenvalues as A (to be precise, compan(A) is normal if and only if a0 = 1 and
ai = 0 for i > 0).
Companion matrices tend to have interesting pseudospectra, as illustrated
in Figure 26.2. For more information on the pseudospectra of companion
matrices see Toh and Trefethen [1007, 1994].
Hilbert matrix; two of particular interest are by Todd [1004, 1954], [1005,
1961].
Other references on the eigenvalues and condition numbers of random
matrices include Edelman [336, 1991], [339, 1992] and Kostlan [671, 1992].
Anderson, Olkin, and Underhill [20, 1987] suggest another way to con-
struct random orthogonal matrices from the Haar distribution, based on prod-
ucts of random Givens rotations. Marsaglia and Olkin [728, 1984] discuss the
generation of random correlation matrices (symmetric positive semidefinite
matrices with ones on the diagonal).
The involuntary triangular matrix pascal (n, 1) arises in the step-size chang-
ing mechanism in an ordinary differential equation code based on backward
differentiation formulae; see Shampine and Reichelt [912, 1995].
We mention some other collections of test matrices. The Harwell-Boeing
collection of sparse matrices, largely drawn from practical problems, is pre-
sented by Duff, Grimes, and Lewis [327, 1989], [328, 1992]. Bai [36, 1994] is
building a collection of test matrices for large-scale nonsymmetric eigenvalue
problems. Zielke [1128, 1986] gives various parametrized rectangular matrices
of fixed dimension with known generalized inverses. Demmel and McKen-
ney [299, 1989] present a suite of Fortran 77 codes for generating random
square and rectangular matrices with prescribed singular values, eigenvalues,
band structure, and other properties. This suite was the inspiration for the
randsvd routine in the Test Matrix Toolbox and is part of the testing code
for LAPACK (see below).
26.7.1. LAPACK
The LAPACK distribution contains a suite of routines for generating test ma-
trices, located in the directory LAPACK/TESTING/MATGEN (in Unix notation).
These routines (whose names are of the form xLAxxx) are used for testing
when LAPACK is installed and are not described in the LAPACK Users’
Guide [17, 1995].
Problems
pentoep(32,0,1/2,0,0,1) inv(pentoep(32,0,1,1,0,.25))
pentoep(32,0,1/2,1,1,1) pentoep(32,0,1,0,0,1/4)
Appendix A
Solutions to Problems
Hence if < 0.01, say, then there is no difference between Erel and for
practical purposes.
1.2. Nothing can be concluded about the last digit before the decimal point. Eval-
uating y to higher precision yields
t y
35 262537412640768743.99999999999925007
40 262537412640768743.9999999999992500725972
This shows that the required digit is, in fact, 3. The interesting fact that y is so close
to an integer was pointed out by Lehmer [697, 1943], who explains its connection
with number theory. It is known that y is irrational [959, 1991].
1.3.
1.
2. 2sin
3. ( x – y)(x + y). Cancellation has not been avoided, but it is now harmless if x
and y are known exactly (see also Problem 3.9).
4. sin x/(1 + cos x).
5. c = ((a – b)2 + ab(2sinθ/ 2 ) 2 ) 1 / 2 .
1.4. a + ib = (x + iy)2 = x2 – y2 + 2ixy, so b = 2xy and a = x2 – y2, giving
x2 – b2 /(4x 2) = a, or 4x4 – 4ax2 – b2 = 0. Hence
If a > 0 we use this formula with the plus sign, since x2 > 0. If a < 0 this formula
is potentially unstable, so we use the rewritten form
529
530 S OLUTIONS TO P ROBLEMS
In either case we get two values for x and recover y from y = b /(2x ). Note that
there are other issues to consider here, such as scaling to avoid overflow.
1.5. We need a way to compute f(x) = log(1 + x) accurately for all x > 0. A
straightforward evaluation of log(1 + x) is not sufficient, since the addition 1 + x
loses significant digits of x when x is small. The following method is effective (for
another approach, see Hull, Fairgrieve, and Tang [592, 1994]): calculate w = 1 + x
and then compute
The explanation of why this method works is similar to that for the method in
§1.14.1. We have = (1 + x)(1 + δ), |δ| < u, and if = 1 then |x| < u + u2+ u3+ · · ·,
2
so from the Taylor series f(x) = x(1 – x/2 + x /3 – · · ·) we see that f(x) = x is a
correctly rounded result. If then
Defining =: 1 + z,
so if (thus x 0) then
Thus, with 1 + θ :=
where a, b, and c are arbitrary constants. The particular starting values chosen
yield a = 0, b = c = 1, so that
for a constant η of order the unit roundoff. Hence the computed iterates rapidly
converge to 100. Note that resorting to higher precision merely delays the inevitable
convergence to 100. Priest [844, 1992, pp. 54–56] gives some interesting observations
on the stability of the evaluation of the recurrence.
1.9. Writing
C:= adj(A) =
we have
532 S OLUTIONS TO P ROBLEMS
so
Also, r = b – – (1/d) A ∆Cb, so |r| < γ3 |A||A-l||b|, which implies the normwise
residual bound. Note that the residual bound shows that will
be small if x is a large-normed solution
1.10. Let For any standard summation method we have (see §4.2)
Hence, defining
that is,
where
2.1. There are 1 + 2(e max – emin + 1)(β – 1)β t-1 normalized numbers (the “1” is for
zero), and 2(β t–1 – 1) subnormal numbers. For IEEE arithmetic we therefore have
S OLUTIONS TO P ROBLEMS 533
Normalized Subnormal
single precision 4.3 x 109 1.7 x 107
19
double precision 1.8 x 10 9 x 1015
2.2. Without loss of generality suppose x > 0. We can write x = m x β e-t, where
β t-l < m < β t. The next larger floating point number is x + ∆x, where ∆x = β e - t ,
and
The same upper bound clearly holds for the “next smaller” case, and the lower
bound in this case is also easy to see.
2.3. The answer is the same for all adjacent nonzero pairs of single precision num-
bers. Suppose the numbers are 1 and 1 + (single) = 1 + 2-23. The spacing of the
double precision numbers on [1,2] is 2 , so the answer is 229 – 1 5.4 x 108.
-52
2.4. Inspecting the proof of Theorem 2.2 we see that |yi | > β e-1, i = 1, 2, and so we
also have |fl(x) – x|/|fl(x)| < u, that is, fl(x) = x /(1 + δ), |δ| < u. Note that this
is not the same δ as in Theorem 2.2, although it has the same bound, and unlike in
Theorem 2.2 there can be equality in this bound for δ.
and so
2.6. Since the double precision mantissa contains 53 bits, p = 253 = 9007199254740992
9.01 x 1015. For single precision, p = 224 = 16777216 1.68 x 107.
2.7.
1. True, since in IEEE arithmetic fl( a op b) is defined to be the rounded value
of a op b, round(a op b), which is the same as round(b op a).
2. True for round to nearest (the default) and round to zero, but false for round
to
3. True, because fl(a + a) := round(a + a) = round(2 * a) =: fl(2 * a).
4. True: similar to 3.
5. False, in general.
534 S OLUTIONS TO P ROBLEMS
6. True for binary arithmetic. Since the division by 2 is exact, the inequality is
equivalent to 2a < fl(a + b) < 2b. But a + b < 2b, so, by the monotonicity
of rounding, fl(a + b) = round(a + b) < round(2b) = 2b. The lower bound is
verified similarly.
2.9.
(binary).
Rounded directly to 53 bits this yields 1 – 2-53. But rounded first to 64 bits it yields
and when this number is rounded to 53 bits using the round to even rule it yields
1.0.
2.12. The spacing between the floating point numbers in the interval (1/2,1] is
(cf. Lemma 2.1), so |1/x – fl(1/x)| < which implies that |1 – xfl(1/x)| <
Thus
Since the spacing of the floating point numbers just to the right of 1 is xfl(1/x)
must round to either 1 – or 1.
2.13. If there is no double rounding the answer is 257,736,490. For a proof that
combines mathematical analysis with computer searching, see Edelman [343, 1994].
2.15. The IEEE standard does not define the results of exponentiation. The choice
0° = 1 can be justified in several ways. For example, if p(x) = then
p (0) = a0 = a0 x 00, and the binomial expansion (x + y) n =
yields 1 = 00 for x = 0, y = 1. For more detailed discussions see Goldberg [457,
1991, p. 32] and Knuth [669, 1992, pp. 406–408].
S OLUTIONS TO P ROBLEMS 535
2.17. For IEEE arithmetic the answer is no. Since fl( x op y) is defined to be the
rounded version of x op y, b2 – ac > 0 implies fl(b2) – fl(ac) > 0 (rounding is a
monotonic operation). The final computed answer is
fl(fl(b2) - fl(ac)) = (fl(b2) - fl(ac))(1 + δ), |δ| < u
> 0.
2.19. The function maps the set of positive floating point numbers onto a
set of floating point numbers with about half as many elements. Hence there exist
two distinct floating point numbers x having the same value of and so the
condition = |x| cannot always be satisfied in floating point arithmetic. The
requirement |x| is reasonable for base 2, however, and is satisfied in IEEE
arithmetic, as we now show.
Without loss of generality, we can assume that 1 < x < 2, since scaling x by a
power of 2 does not alter By definition, is the nearest floating
point number to and
Now the spacing of the floating point numbers between 1 and 2 is = 2u, so
Hence |θ| < u if u < 1/4 (say), and then |x| is the nearest floating point number to
so that
In base-10 floating point arithmetic, the condition can be violated.
For example, working to 5 significant decimal digits, if x = 3.1625 then fl(x2) =
fl (10.0014 0625) = 10.001, and = fl(3.1624 3577 . . .) = 3.1624 < x.
2.20. On a Cray Y-MP the answer is yes, but in base-2 IEEE arithmetic the answer
is no. It suffices to demonstrate that = sign(x) which is shown
by the proof of Problem 2.19.
2.21. The test “x > y“ returns false if x or y is a NaN, so the code computes
max(NaN, 5) = 5 and max(5, NaN) = NaN, which violates the obvious requirement
that max(x, y) = max(y, x). Since the test x x identifies a NaN, the following
code implements a reasonable definition of max(x, y):
% max(x, y)
if x x then
max = y
else if y y then
max = x
else if y > x then
536 S OLUTIONS TO P ROBLEMS
max = y
else
max = x
end
end
end
A further refinement is to ensure that max(-0, + 0) = +0, which is not satisfied by
the code above since –0 and +0 compare as equal; this requires bit-level program-
ming.
2.22. We give an informal proof; the details are obtained by using the model
f l(x op y) = (x op y)(1 + δ), but they obscure the argument.
Since a, b, and c are nonnegative, a + (b + c) is computed accurately. Since
c < b < a, c + (a – b) and a + (b – c) are the sums of two positive numbers and
so are computed accurately. Since a, b, and c are the lengths of sides of a triangle,
a < b + c; hence, using c < b < a,
2.23. For a machine with a guard digit, y = x, by Theorem 2.5 (assuming 2 x does
not overflow). If the machine lacks a guard digit then the subtraction produces x
if the last bit of x is zero, otherwise it produces an adjacent floating point number
with a zero last bit; in either case the result has a zero last bit. Gu, Demmel, and
Dhillon [484, 1994] apply this bit zeroing technique to numbers d1, d2,. . . . dn arising
in a divide and conquer bidiagonal SVD algorithm, their motivation being that the
differences di – dj can then be computed accurately even on machines without a
guard digit.
2.24. The function f(x) = 3x – 1 has the single root x = 1/3. We can have
fl(f(x)) = 0 only for x 1/3. For x 1/3 the evaluation can be summarized as
follows:
The first, second, and fourth subtractions are done exactly, by Theorem 2.5. The
result of the first subtraction has a zero least-significant bit and the result of the
SOLUTIONS TO PROBLEMS 537
second has two zero least-significant bits; consequently the third subtraction suffers
no loss of trailing bits and is done exactly, Therefore f(x) is computed exactly
for x = fl(x) near 1/3. But fl(x) can never equal 1/3 on a binary machine, so
fl(f(x)) 0 for all x.
2.25. We have
which implies high relative accuracy unless u|bc| >> |x|. For comparison, the bound
for standard evaluation of fl(ad – bc) is |x – < γ2(|ad| + |bc|).
2.27. We would like (2.8) to be satisfied, that is, we want = fl(x/y) to satisfy
(A.1)
3.2. This result can be proved by a modification of the proof of Lemma 3.1. But it
follows immediately from the penultimate line of the proof of Lemma 3.4.
3.3. The computed iterates satisfy
This gives
The running error bound µ can therefore be computed along with the continued
fraction as follows:
qn+1 = an+1
fn+1 = 0
for k = n: – 1:0
r k = b k /q k+l
q k = a k + rk
fk = |qk| + |rk| + |bk|fk+l/((|qk+l| – ufk+l|)|qk+l|)
end
µ = uf0
The error bound is valid provided that |qk+l| – ufk+l > 0 for all k. Otherwise a
more sophisticated approach is necessary (for example, to handle the case where
qk+l = 0, qk = and qk-1 is finite).
S OLUTIONS TO P ROBLEMS 539
3.4. We prove just the division result and the last result. Let
Then α =
if j < k.
For the last result,
3.6. We have
which implies
It is easy to show that |αι| < γ n+2, so the only change required in (3.4) is to replace
γ n by γn+2. The complex analogue of (3.10) is = (A+ ∆A)x, |∆A| < γn + 2 |A|.
540 S OLUTIONS TO P ROBLEMS
3.8. Without loss of generality we can suppose that the columns of the product are
computed one at a time. With xj = A1 . . . Akej we have, using (3.10),
so
S OLUTIONS TO P ROBLEMS 541
We have
It follows that
which shows that differs negligibly from ym+l. For the repeated squarings,
however, we find that
where we have used Lemma 3.1. Hence the squarings introduce a relative error that
can be approximately as large as 2mu. Since u = 2 –53 this relative error is of order
0.1 for m = 50, which explains the observed results for m = 50.
For m = 75, the behaviour on the Sun is analogous to that on the HP calculator
described in §1.12.2. On the 486DX, however, numbers less than 1 are mapped to
1. The difference is due to the fact that the 486DX uses double rounding and the
Sun does not; see Problem 2.9.
3.12. The analysis is just a slight extension of that for an inner product. The
analogue of (3.3) is
Hence
(A.2)
542 S OLUTIONS TO P ROBLEMS
C(x) = max
which agrees with the actual error to within a factor 3; thus the smaller upper
bound of (4.3) is also correct to within this factor. The example just quoted is, of
course, a very special one, and as Wilkinson [1088, 1963, p. 20] explains, “in order
to approach the upper bound as closely as this, not only must each error take its
maximum value, but all the terms must be almost equal.”
4.3. With S k = we have
which yields the required expression for The bound on |En| is immediate.
The bound is minimized if the xi are in increasing order of absolute value. This
observation is common in the literature and it is sometimes used to conclude that
the increasing ordering is the best one to use. This reasoning is fallacious, because”
minimizing an error bound is not the same as minimizing the error itself. As (4.3)
shows, if we know nothing about the signs of the rounding errors then the “best”
ordering to choose is one that minimizes the partial sums.
S OLUTIONS TO P ROBLEMS 543
4.4. Any integer between 0 and 10 inclusive can reproduced. For example, fl (1 +
2 + 3 + 4 + M - M) = 0, fl(M - M + l + 2 + 3 + 4) = 10, and fl(2 + 3 + M - M + 1 + 4) = 5.
4.5. This method is sometimes promoted on the basis of the argument that it
minimizes the amount of cancellation in the computation of Sn. This is incorrect:
the “±” method does not reduce the amount of cancellation—it simply concentrates
all the cancellation into one step. Moreover, cancellation is not a bad thing per se,
as explained in §l.7.
The “±’ method is an instance of Algorithm 4.1 (assuming that S+ and S–
are computed using Algorithm 4.1) and it is easy to see that it maximizes max i |Ti |
over all methods of this form (where, as in §4.2, Ti is the sum computed at the i t h
stage). Moreover, when the value of maxi |Ti | tends to be
much larger for the “±” method than for the other methods we have considered.
4.6. The main concern is to evaluate the denominator accurately when the xi are
close to convergence. The bound (4.3) tells us to minimize the partial sums; these
are, approximately, for xi ξ, (a) −ξ, 0, (b) 0, 0, (c) 2ξ, 0. Hence the error analysis
of summation suggests that (b) is the best expression, with (a) a distant second.
That (b) is the best choice is confirmed by Theorem 2.5, which shows there will be
only one rounding error when the xi are close to convergence. A further reason to
prefer (b) is that it is less prone to overflow than (a) and (c).
4.7. This is, of course, not a practical method, not least because it is very prone
to overflow and underflow. However, its error analysis is interesting. Ignoring the
error in the log evaluation, and assuming that exp is evaluated with relative error
bounded by u, we have, with |δ| < u for all i, and for some δ2 n
The error bound for (b) is about a factor i smaller than that for (a). Note that
method (c) is the only one guaranteed to yield = b (assuming fl(n/n) = 1, as
holds in IEEE arithmetic), which may be important when integrating a differential
equation to a given end-point.
If a > 0 then the bounds imply
Thus (b) and (c) provide high relative accuracy for all i, while the relative accuracy
of (a) can be expected to degrade as i increases.
5.1. By differentiating the Homer recurrence qi = xqi+1 + ai , qn = an, we obtain
5.3. Accounting for the error in forming y, we have, using the relative error counter
notation (3.9),
Thus the relative backward perturbations are bounded by (3n/2 + 1)u instead of
2nu for Homer’s rule.
S OLUTIONS TO P ROBLEMS 545
n = max(size(a));
perm = (l:n)’;
% a(1) = max(abs(a)).
[t, i] = max(abs(a));
if i ~= 1
a([1 i]) = a([i 11);
perm([1 i]) = perm([i l]);
end
p = ones(n,l);
for k = 2 : n - 1
for i = k:n
p(i) = p(i)*(a(i)-a(k-1)) ;
end
[t, il = max(abs(p(k:n)));
i = i+k-l;
if i ~= k
a([k i]) = a([i k]);
p([k i]) = p([ik]);
perm([k i]) = perm([i k]);
end
end
5.5. It is easy to show that the computed satisfies = p(x)(1 + θ 2 n + 1), |θ 2n+1 | <
γ 2n+1. Thus has a tiny relative error. Of course, this assumes that the roots xi
are known exactly!
6.1. For then, using the Cauchy–Schwarz inequality,
The first inequality is an equality iff |aij| = α, and the second inequality is an
equality iff A is a multiple of a matrix with orthonormal columns. If A is real
and square, these requirements are equivalent to A being a scalar multiple of a
Hadamard matrix. If A is complex and square, the requirements are satisfied by the
given Vandermonde matrix, which is times a unitary matrix.
6.2.
546 S OLUTIONS TO P ROBLEMS
= ||A||2||B||F.
Similarly, ||BC||F < ||B||F||C||2, and these two inequalities together imply the re-
quired one.
6.7. Let λ be an eigenvalue of A and x the corresponding eigenvector, and form the
matrix X = [x, x,. . . ,x] Then AX =
showing that |λ| < ||A||. For a subordinate norm it suffices to take norms in the
equation Ax = λx.
6.8. The following proof is much simpler than the usual proof based on diagonal
scaling to make the off-diagonal of the Jordan matrix small (see, e.g., Horn and John-
son [580, 1985, Lem. 5.6.10]). The proof is from Ostrowski [812, 1973, Thin. 19.3].
Let δ −1 A have the Jordan canonical form δ−1 A = XJX--1. We can write
where D = diag(λ i ) and the λi are the eigenvalues of A. Then
A = X(D + δN)X-1, so
Note that we actually have ||A|| = ρ(A) + δ if the largest eigenvalue occurs in a
Jordan block of size greater than 1. If A is diagonalizable then with δ = 0 we get
||A|| = ρ(A). The last part of the result is trivial.
S OLUTIONS TO P ROBLEMS 547
6.9. Let A have the SVD A = UΣV*. By the unitary invariance of the 2- and
Frobenius norms, ||A||2 = ||Σ||2 = σ1, ||A||F = ||Σ||F = Thus
||A||2 < ||A||F < (in fact, we can replace by where r = rank(A)).
There is equality on the left when σ2 = · · · = σn = 0, that is, when A has rank 1
(A= xy*) or A = 0. There is equality on the right when σ1 = · · · = σ n = α, that
is, when A = αQ where Q has orthonormal columns,
Equality is attained for an x that gives equality in the Holder inequality involving
the kth row of A, where the maximum is attained for i = k. Finally, from either
formula,
6.14. We prove the lower bound in (6.22) and the upper bound in (6.23); the other
bounds follow on using ||AT||P = ||A||q. First, note that ||A||P > ||Aej||P = ||A(:, j)||p,
which gives the lower bound in (6.22). Now assume that A has at most µ nonzeros
per column. Define
Di = diag(si1, . . . . sin ),
548 S OLUTIONS TO P ROBLEMS
We have
6.15. The lower bound follows from ||Ax||p/||x||p < || |A||x| ||p/|| |x| ||P. From (6.12)
we have
By (6.21), we also have || |A| ||P = || |AT| ||q < nl-l/q||AT||q = n1/p||A||p and the
result follows.
6.16. The function v is not a vector norm because does not hold for
all However, and the other two norm conditions
hold, so it makes sense to define the “subordinate norm”. We have
There is equality for xj an appropriate unit vector ej. Hence v(A) = maxj v(A(:,j)).
7.2. Take norms in r = A(x–y) and x–y = A–lr. The result says that the normwise
relative error is at least as large as the normwise relative residual and possibly K(A)
times as large. Since the upper bound is attainable, the relative residual is not a
good predictor of the relative error unless A is very well conditioned.
7.3. Let DR equilibrate the rows of A, so that B = DRA satisfies |B|e = e. Then
7.4. The first inequality is trivial. For the second, since hii = 1 and |hij| < 1 we
have |H| > 1 and < n. Hence
7.7. We will prove the result for w; the proof for η is entirely analogous. The lower
bound is trivial. Let Then and
(A + ∆A)y = b + ∆b with |∆ A| < |A| and Hence |b| = |(A + ∆A)y - ∆b| <
yielding Thus
The lower bound for follows from the inequalities |cTx| = |cTA-1·Ax| <
|c A | |A| |x| and |c x| = |c A b| < |cTA-l ||b|. A slight modification to the
T -l T T -l
The rest of the proof shows that equality can be attained in this inequality.
Let x1 > 0 be a right Perron vector of BC, so that BCx1 = πx1, where π =
ρ(BC) > 0. Define
x2 = Cx1, (A.5)
so that x2 > 0 and
Bx2 = πx 1 . (A.6)
(We note, incidentally, that x2 is a right Perron vector of CB: CBx2 = πx 2 .)
Now define
D1 = diag(x 1 ) - 1 , D2 = diag(x 2 ). (A.7)
T
Then, with e = [1, 1, . . . , 1] , (A.6) can be written BD2e = or
π e. Since D1BD2 > 0, this gives similary, (A.5) can be written
D2e = or = e, which gives Hence for
D1 and D2 defined by (A.7) we have as
required.
Note that for the optimal D1 and D2, D1BD2 and both have the
property that all their rows sums are equal.
(b) Take B = |A| and C = |A-l|, and note that
Now apply (a).
(c) We can choose F1 > 0 and F2 > 0 so that |A| + tFl > 0 and |A-l| + tF2 > 0
for all i! >0. Hence, using (a),
(d) A nonnegative irreducible matrix has positive Perron vector, from stan-
dard Perron-Frobenius theory (see, e.g., Horn and Johnson [580, 1985, Chap. 8]).
Therefore the result follows by noting that in the proof of (a) all we need is for D1
and D2 in (A.7) to be defined and nonsingular, that is, for x1 and x2 to be positive
vectors. This is the case if BC and CB are irreducible (since x2 is a Perron vector
of CB)
(e) Using and it is easy to
show that the results of (a)–(d) remain true with the co-norm replaced by the 1-
norm. From it then follows that inf
In fact, the result in (a) holds for any p-norm, though the optimal D1
and D2 depend on the norm; see Bauer [79, 1963, Lem. 1(i)].
7.10. That cannot exceed the claimed expression follows by taking absolute
values in the expression ∆X = – (A-l ∆AA–1 + A–1 ∆A∆ X). To show it can equal
it we need to show that if the maximum is attained for ( i, j) = (r,s) then
where t =
and
(b) We can assume, without loss of generality, that
|y1| < |y2| < · · · < |yn|. (A.8)
Define the off-diagonal of H by hij = hji = gij for j > i, and let h11 = g11. The i t h
equation of the constraint Hy = Gy will be satisfied if
(A.9)
552 SOLUTIONS TO P ROBLEMS
(A.10)
which yields
Hence satisfies
(A.11)
which is used in place of (7.26) to obtain the desired bound.
7.13. (a) Writing A-1 = we have
Hence
so
But
Hence
—
This inequality shows that when perturbations are measured normwise there is little
difference between the average and worst-case condition numbers.
7.14. We have
Note that this result gives a lower bound for the optimal condition number, while
Bauer’s result in Problem 7.9(c) gives an upper bound. There is equality for diagonal
A, trivially. For triangular A there is strict inequality in general since the lower
bound is 1!
8.1. “Straightforward.
8.2. Let
Then
Now 0 < b = TX = (D – U)x, so Dx > Ux, that is, x > D-l Ux. Hence
it follows that
8.8. (a) Using the formula det(I + xyT) = 1 + yTx we have det
Hence we take
if otherwise there is no αij that makes A + singular.
It follows that the “best” place to perturb A to make it singular (the place that
gives the smallest αij ) is in the (s, r) position, where the element of largest absolute
value of A–1 is in the (r, s) position.
(b) The off-diagonal elements of are given by Hence,
using part (a), is singular, where α = –22–n. In fact, Tn is also made
singular by subtracting 21-n from all the elements in the first column.
8.9. Here is Zha’s proof. If s = 1 the result is obvious, so assume s < 1. Define the
n-vectors
and
which shows that σ is a singular value of Un (θ). With σι denoting the ith largest
singular value,
Therefore
8.10. For a solver of this form, it is not difficult to see that
where denotes fi with all its coefficients replaced by their absolute values, and
where (M(T), |b| ) is a rational expression consisting entirely of nonnegative terms.
This is the required bound expressed in a different notation. An example (admit-
tedly, a contrived one) of a solver that does not satisfy (8.20) is, for a 2 x 2 lower
triangular system LX = b,
9.1. The proof is by induction. Assume the result is true for matrices of order n – 1,
and suppose
so the implication is only one way. Note also that 0 F(A) iff eiθ A has positive
definite Hermitian part for some real θ (Horn and Johnson [581, 1991, Thin. 1.3.5]).
556 S OLUTIONS TO P ROBLEMS
9.4. The changes are minor. Denoting by and the computed permutations,
the result of Theorem 9.3 becomes
9.5
9.7. The given fact implies that JAJ is totally nonnegative. Hence it has an
LU factorization JAJ = LU with L > 0 and U > 0. This means that A =
(JLJ)(JUJ) is an LU factorization, and = LU = JAJ = |A|.
implying
9.9. By inspecting the equations in Algorithm 9.2 we see that the computed LU
factors satisfy
So
Then Hence
as required.
558 S OLUTIONS TO P ROBLEMS
so
as required.
10.7. The inequalities (10. 13) follow from the easily verified fact that, for j > k,
10.8. Examples of indefinite matrices with nonnegative leading principal minors are
A necessary and sufficient condition for definiteness is that all the principal minors
of A (of which there are 2n–1 ) be nonnegative (not just the leading principal minors);
see, e.g., Horn and Johnson [580, 1985, p. 405] or Mirsky [763, 1961, p. 405] (same
page number in both books!).
10.10. Theorem 10.3 is applicable only if Cholesky succeeds, which can be guaran-
teed only if the (suitably scaled) matrix A is not too ill conditioned (Theorem 10.7).
Therefore the standard analysis is not applicable to positive semidefinite matrices
that are very close to being singular. Theorem 10.14 provides a bound on the resid-
ual after rank(A) stages and, in particular, on the computed Schur complement,
which would be zero in exact arithmetic. The condition of Theorem 10.7 ensures
that all the computed Schur complements are positive definite, so that even if
magnification of errors occurs, it is absorbed by the next part of the Cholesky factor.
The proposed appeal to continuity is simply not valid.
10.11. The analysis in §10.4 shows that for a 2 x 2 pivot E, det(E) < ( α2 – l) for
complete pivoting and det(E) < (α 2 – 1)λ 2 for partial pivoting. Now α2 – 1 < 0 and
µ 0 and λ are nonzero if a 2 x 2 pivot is needed. Hence det(E ) <0, which means that
E has one positive and one negative eigenvalue. Note that if A is positive definite
it follows that all pivots are 1 x 1.
If the block diagonal factor has p+ positive 1 x 1 diagonal blocks, p– negative
1 x 1 diagonal blocks, p0 zero 1 x 1 diagonal blocks, and q 2 x 2 diagonal blocks,
then the inertia is (+, –, 0) = (p+ + q,p- + q,p0).
Denote a 2 x 2 pivot by
and consider partial pivoting. We know det(E) = ac – b2 < 0 and |b| > |a|, so the
formula det(E) = [(a/b)c – b]b minimizes the risk of overflow. Similarly, the formula
helps to avoid overflow; this is the formula used in LINPACK’S xSIDI and LAPACK’S
xSYTRI. The same formulae are suitable for complete pivoting because then |b| >
max( |a|, |c| ).
10.12. The partial pivoting strategy simplifies as follows: if |a11| > α|a21| use a
1 x 1 pivot all, if |a22| > α|a12| use a 1 x 1 pivot a22, else use a 2 x 2 pivot, that is,
do nothing.
10.13. There may be interchanges, because the tests |a11| > αλ and αλ2 < |a11|
can both fail, for example for A = with But there can be no 2 x 2
pivots, as they would be indefinite (Problem 10.11). Therefore the factorization is
PAPT = LDLT for a diagonal D with positive diagonal entries.
10.14. That the growth factor bound is unchanged is straightforward to check. No
2x2 pivots are used for a positive definite matrix because, as before (Problem 10.11),
any 2x2 pivot is indefinite. To show that no interchanges are required for a positive
definite matrix we show that the second test, αλ2 < |a11| is always passed. The
submatrix is positive definite, so a11arr – > 0. Hence
as required.
560 S OLUTIONS TO P ROBLEMS
10.15. With partial pivoting the diagonal pivoting method produces the factoriza-
tion, with P = I,
Putting x1 = we obtain
11.1
11.2. The inequality (11.4) yields, with x0 := 0, and dropping the subscripts on the
gi and Gi ,
under the conditions of Theorem 11.4. The term Gg can be bounded in a similar
way. The required bound for follows.
12.1. The equations for the blocks of L and U are U11 = A11 and
First, consider the case where A is block diagonally dominant by columns. We prove
by induction that
which implies both the required bounds. This inequality clearly holds for k = 2;
suppose it holds for k = i. We have
where
using the block diagonal dominance for the last inequality. Hence
as required.
The proof for A block diagonally dominant by rows is similar. The inductive
hypothesis is that
562 S OLUTIONS TO P ROBLEMS
giving as required.
For block diagonal dominance by columns in the -norm we have
and so block LU factorization is stable. If A is block diagonally
dominant by rows, stability is assured if ||Ai,i-1||/||Ai-1,i|| is suitably bounded for
all a.
12.2. The block 2 x 2 matrix
is block diagonally dominant by rows and columns in the 1- and -norms for = 1/2,
but is not point diagonally dominant by rows or columns. The block 2 x 2 matrix
is point diagonally dominant by rows and columns but not block diagonally dominant
by rows or columns in the -norm or l-norm.
12.3. No. A counterexample is the first matrix in the solution to Problem 12.2,
with = 1/2, which is clearly not positive definite because the largest element does
not lie on the diagonal.
12.4. Form (12.2) it can be seen that (A-1)21 = where the Schur
complement S = Hence
S is the trailing submatrix that would be obtained after r – 1 steps of GE. It follows
immediately that ||S|| < pn ||A||.
For the last part, note that ||S-1|| < ||A-l||, because S–l is the (2,2) block of
-1
A , as is easily seen from (12.2).
12.5. The proof is similar to that of Problem 8.7(a). We will show that
1. Let y = and let The kth equation of gives
SOLUTIONS TO PROBLEMS 563
Hence
so that
det(X) = det(A) x det(D – CA-lB).
Hence det(X) = det(AD – ACA–l B), which equals det(AD – CB) if C commutes
with A.
13.2. The new bounds have norms in place of absolute values and the constants are
different.
A=
Then
while
Hence ||AX – I||/||XA – I|| Note that in this example every element
of AX – I is large.
(A.12)
Hence where
Clearly,
564 S OLUTIONS TO P ROBLEMS
From (A.12), so
The first conclusion is that the approximate left inverse yields the smaller residual
bound, while the approximate right inverse yields the smaller forward error bound.
Therefore which inverse is “better” depends on whether a small backward error or
a small forward error is desired. The second conclusion is that neither approximate
inverse yields a componentwise backward stable algorithm, despite the favorable
assumptions on and Multiplying by an explicit inverse is simply not a good
way to solve a linear system.
13.6. Here is a hint: notice that the matrix on the front cover of the LAPACK
Users’ Guide has the form
13.7. If the ith row of A contains all 1s then simply sum the elements in the ith
row of the equation AA–1 = I.
13.8. (A+ iB)(P + iQ) = I is equivalent to AP – BQ = I and AQ + BP = 0, or
–1
so X – l is obtainable from the first n columns of Y . The definiteness result follows
from
(x + iy)*(A + iB)(x + iy) = xT(Ax – By) + yT(Ay + Bx)
+ i[xT(Ay + Bx) – yT(Ax – By)]
where we have used the fact that A = AT and B = –BT. Doubling the dimension
(from X to Y) multiplies the number of flops by a factor of 8, since the flop count
for inversion is cubic in the dimension. Yet complex operations should, in theory,
cost between about two and eight times as much as real ones (the extremes being
for addition and division). The actual relative costs of inverting X and Y depend
on the machine and the compiler, so it is not possible to draw any firm conclusions.
Note also that Y requires twice the storage of X.
13.10. As in the solution to Problem 8.8, we have
det
If (A ) j i = 0, this expression is independent of α, and hence det(A) is independent
-l
of aij. (This result is clearly correct for a triangular matrix. ) That det(A) can be
independent of aij shows that det(A), or even a scaling of it, is an arbitrarily poor
measure of conditioning, for A + αe i approaches a multiple of a rank-1 matrix as
S OLUTIONS TO P ROBLEMS 565
Since the geometric mean does not exceed the arithmetic mean,
This result holds for any order of evaluation and is proved (for any particular order
of evaluation) using a variation on the proof of Lemma 8.2 in which we do not divide
through by the product of 1 + δi terms. Using (A. 13) we have
If then
Thus
A diagonal similarity therefore has no effect on the error bound and so there is no
point in scaling H before applying Hyman’s method.
566 S OLUTIONS TO P ROBLEMS
where Aij denotes the submatrix of A obtained by deleting row i and column j. A
definition of condition number that is easy to evaluate is
There is one degree of freedom in the vectors x and y, which can be expended by
setting xl = 1, say.
15.1. The equation would imply, on taking traces, 0 = trace(l), which is false.
15.2. It is easily checked that the differential equation given in the hint has the
solution Z(t) = eAt CeBt . Integrating the differential equation between 0 and
gives, assuming that
15.3. Since
(b) We break the proof into two cases. First, suppose that
for some m. By part (a), the same inequality holds for all k > m and we are
finished. Otherwise, for all k. By part (a) the positive
sequence ||xk – a|| is monotone decreasing and therefore it converges to a limit l.
Suppose that l = where δ > 0. Since β < 1, for some k we must have
568 S OLUTIONS TO P ROBLEMS
By (A.14),
But
18.7. Straightforward. This problem shows that the CGS and MGS methods
correspond to two different ways of representing the orthogonal projection onto
span{q1, . . . ,qj}.
18.8. Assume, without loss of generality, that ||a1||2 < ||a2||2. If E is any matrix
such that A + E is rank deficient then
Hence
This is of the required form, with Q = [q1,.. ., qn] the matrix from the MGS method
applied to A.
∆ A= QR– A=(Q–P21)R+∆A 2
= (VW – P21)R + ∆A2
T
In fact, the result holds for any unitarily invariant norm (but the ||A||2 + 1 in the
denominator must be retained).
19.1. One approach is to let x be a solution and y an arbitrary vector, and consider
f (α) = ||A(x + αy) – b|| By expanding this expression it can be seen that if the
normal equations are not satisfied then α and y can be chosen so that f(α) < f(0).
Alternatively, note that for f(z) = (b – Ax)T(b – Ax) = xTATAx – 2bTAx + bTb
we have f(x) = 2(ATAx – ATb) and 2f(x) = 2ATA, so any solution of the
normal equations is a local minimum of f, and hence a global minimum since f is
quadratic. The normal equations can be written as (b – Ax)TA = 0, which shows
that the residual b – Ax is orthogonal to range(A).
We have
where G > G1 and ||G||F = 1. The normwise bounds are proved similarly.
We have
by (19.12).
19.8. Let y = 0 and ||(A + ∆A)y – b||2 = min. If b = 0 then we can take
AA= O. Otherwise, the normal equations tell us that (A+ ∆A)Tb = 0, so ||∆ A||2 >
||ATb||2/||b||2. This lower bound is achieved, and the normal equations satisfied, for
∆AT = –(ATb)bT/bTb. Hence
20.1. Setting the gradient of (20.13) to zero gives ATAx – ATb + c = 0, which
can be written as y = b – Ax, ATy = c, which is (20.12). The Lagrangian for
(20.14) is L(y, x) = ½(y – b)T(y – b) + xT(ATy – c). yL(y, x) = y – b + Ax, and
T
xL(y, x) = A y – c. Setting the gradients to zero gives the system (20.12).
21.1. The modified version has the same flop count as the original version.
21.4. The summation gives
574 S OLUTIONS TO P ROBLEMS
which implies the desired equality. It follows that all columns of V–1 except the
first must have both positive and negative entries. In particular, V-1 > 0 is not
possible. The elements of V-1 sum to 1, independently of the points αi (see also
Problem 13.7).
T -T T -T
and Le = e, so U(i,:)L = W(i,:), or U(i,:) = L W(i,:) . But L > 0
n -T -1 -l
by the given x formula, so ||L ||1 = ||L || = ||L e|| = ||e|| = 1, hence
||U(i,:)||1 < ||W(i,:)||1.
As an aside, we can evaluate ||L|| as
CDCT =
21.8. The increasing ordering is never produced, since the algorithm must choose
α1 to maximize |α l – α0|.
22.1. Consider the case where m < min(n,p), and suppose n = jm and p = km for
some integers j and k. Then the multiplication of the two matrices can be split into
m x m blocks:
22.3. For large n = 2k, Sn (8)/Sn (n) 1.96 x (7/8) k and S n (l)/S n (8) 1.79.
The ratio Sn(8)/Sn(n) measures the best possible reduction in the amount of arith-
metic by using Strassen’s method in place of conventional multiplication. The ratio
S n (1)/S n (8) measures how much more arithmetic is used by recurring down to the
scalar level instead of stopping once the optimal dimension n0 is reached. Of course,
the efficiency of a practical code for Strassen’s method also depends on the various
non-floating-point operations.
22.5. Apart from the differences in stability, the key difference is that Winograd’s
formula relies on commutativity and so cannot be generalized to matrices.
22.7. Some brief comments are given by Douglas, Heroux, Slishman, and Smith
[317, 1994].
Hence we can form AB by inverting a matrix of thrice the dimension. This result
is not of practical value, but it is useful for computational complexity analysis.
24.2. With n = 3 and almost any starting data, the backward error can easily be
made of order 1, showing that the method is unstable. However, the backward error
is found to be roughly of order so the method may have a backward error
bound proportional to (this is an open question).
25.1. (b) An optimizing compiler might convert the test xp1 > 1 to x+1 > 1 and
then to x > 0. (For a way to stop this conversion in Fortran, see the solution to
Problem 25.3.) Then the code would compute a number of order 2 e min instead of a
number of order 2–t.
576 S OLUTIONS TO P ROBLEMS
25.3. The algorithm is based on the fact that the positive integers that can be
exactly represented are 1,2, ..., β t and
In the interval [β t, β t+l] the floating point numbers are spaced β apart. This interval
k
must contain a power of 2, a = 2 . The first while loop finds such an a (or, rather,
the floating point representation of such an a) by testing successive powers 2i to see
if both 2i and 2i + 1 are representable. The next while loop adds successive powers
of 2 until the next floating point number is found; on subtracting a the base β is
produced. Finally t is determined as the smallest power of β for which the distance
to the next floating point number exceeds 1.
The routine can fail for at least two reasons. First, an optimizing compiler might
simplify the test while (a+l) - a == 1 to while 1 == 1, thus altering the meaning
of the program. Second, in the same test the result (a+l) might be held in an
extra length register and the subtraction done in extra precision. The computation
would then reflect this higher precision and not the intended one. We could try
to overcome this problem by saving the result a+l in a variable, but an optimizing
compiler might undo this effort by dispensing with the variable and storing the
result in a register. In Fortran, the compiler’s unwanted efforts can be nullified by
a test of the form while foo (a+l) - a == 1, where foo is a function that simply
returns its argument. The problems caused by the use of extra length registers were
discussed by Gentleman and Marovich [437, 1974]; see also Miller [759, 1984, §2.2].
25.4. (This description is based on Schreiber [903, 1989].) The random number
generator in matgen repeats after 16384 numbers [668, 1981, p. 19]. The dimension
n = 512 divides the period of the generator (16384 = 512 x 32), with the effect that
the first 32 columns of the matrix are repeated 16 times (512 = 32 x 16), so the
matrix has this structure:
B = rand(512,32);
A = [B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B];
but that is less than the under flow threshold. The actual pivots do not exactly match
the analysis, which is probably due to rank deficiency of one of the submatrices
generated. Also, underflow occurs earlier than predicted, apparently because two
small numbers (both O( 10–21 ) ) are multiplied in a saxpy operation.
25.5. Let si and tt denote the values of s and t at the start of the ith stage of the
algorithm. Then
(A.15)
so with si+l = (ti /xi )2si + 1 and ti+l = |xi|, (A.15) continues to hold for the next
value of i. The same is true trivially in the case |xi| < ti .
This is a one-pass algorithm using n divisions that avoids overflow except, pos-
sibly, on the final stage in forming which can overflow only if ||x||2 does.
Appendix B
Singular Value Decomposition,
M-Matrices
579
580 SINGULAR VALUE DECOMPOSITION, M-MATRICES
B . 2 . M-Matrices
A matrix is an M-matrix if aij < 0 for all and all the
eigenvalues of A have nonnegative real part. This is one of many equivalent
definitions [94, 1994, Chap. 6].
An M-matrix may be singular. A particularly useful characterization of a
nonsingular M-matrix is a nonsingular matrix for which aij < 0
for all and A-1 has nonnegative elements (written as A–1 > 0).
For more information on ill-matrices see Berman and Plemmons [94, 1994]
and Horn and Johnson [581, 1991, §2.5].
Previous Home Next
Appendix C
Acquiring Software
Caveat receptor . . .
Anything free comes with no guarantee!
— JACK DONGARRA and ERIC GROSSE, Netlib mail header
581
582 A CQUIRING S OFTWARE
C. 1. Internet
A huge variety of information and software is available over the Internet, the
worldwide combination of interconnected computer networks. The location of
a particular object is specified by a URL, which stands for “(Uniform Resource
Locator”. Examples of URLS are
http://www.netlib.erg/index.html
ftp://ftp.netlib. org
The first example specifies a World Wide Web server (http = hypertext
transfer protocol) together with a file in hypertext format (html = hyper-
text markup language), while the second specifies an anonymous ftp site. In
any URL, the site address may, optionally, be followed by a filename that
specifies a particular file. For more details about the Internet see on-line in-
formation, or one of the many books on the subject, such as Krol [674, 1994].
C.2. Netlib
Netlib is a repository of freely available mathematical software, documents,
and databases of interest to the scientific computing community [316, 1987],
[151, 1994]. It includes
l research codes,
l golden oldies (classic programs that are not available in standard li-
braries),
l the collected algorithms of the ACM,
l program libraries such as EISPACK, LINPACK, LAPACK, and MIN-
PACK
Netlib also enables the user to download technical reports from certain in-
stitutions, to download software and errata for textbooks, and to search “the
SIAM membership list and a “white pages” database.
Netlib can be accessed in several ways.
C.3 M ATLAB 583
C.3. M ATLAB
M ATLAB is a commercial program sold by The MathWorks, Inc. It runs
on a variety of platforms. The MathWorks maintains a collection of user-
contributed M-files, which is accessible over the Internet.
For information contact
NAG Ltd.
Wilkinson House
584 A CQUIRING S OFTWARE
Appendix D
Program Libraries
585
586 P ROGRAM LIBRARIES
In this appendix we briefly describe some of the freely available program li-
braries that have been mentioned in this book. These packages are all available
from netlib (see §C.2).
S real
D double precision
C complex
Z complex* 16, or double complex
Level 2: [313, 1988], [314, 1988] Matrix-vector operations. Matrix times vec-
tor (gaxpy): y αAx + βy (xGEMV); rank-1 update: A A + αx y T
(xGER); triangular solve: x T -lx (xTRSV); and variations on these.
code that tests the numerical stability is provided with the model implemen-
tations [309, 1990], [314, 1988].
For more details on the BLAS and the advantages of using them, see
the defining papers listed above, or, for example, [315, 1991] or [470, 1989,
Chap. 1].
D.2. EISPACK
EISPACK is a collection of Fortran 66 subroutines for computing eigenvalues
and eigenvectors of matrices [925, 1976], [415, 1977]. It contains 58 subrou-
tines and 13 drivers. The subroutines are the basic building blocks for eigen-
system computations; they perform such tasks as reduction to Hessenberg
form, computation of some or all of the eigenvalues/vectors, and back trans-
formations, for various types of matrix (real, complex, symmetric, banded,
etc.). The driver subroutines provide easy access to many of EISPACK’s ca-
pabilities; they call from one to five other EISPACK subroutines to do the
computations.
EISPACK is primarily based on Algol 60 procedures developed in the
1960s by 19 different authors and published in the journal Numerische Math-
ematik. An edited collection of these papers was published in the Handbook
for Automatic Computation [1102, 1971].
D.3. LINPACK
LINPACK is a collection of Fortran 66 subroutines that analyse and solve
linear equations and linear least squares problems [307, 1979]. The package
solves linear systems whose matrices are general, banded, symmetric indefi-
nite, symmetric positive definite, triangular, or tridiagonal. In addition, the
package computes the QR and singular value decompositions and applies them
to least squares problems. All the LINPACK routines use calls to the level-l
BLAS in the innermost loops; thus most of the floating point arithmetic in
LINPACK is done within the level-l BLAS.
D.4. LAPACK
LAPACK [17, 1995] was released on February 29, 1992. As the announce-
ment stated, “LAPACK is a transportable library of Fortran 77 subroutines
for solving the most common problems in numerical linear algebra: systems
of linear equations, linear least squares problems, eigenvalue problems, and
singular value problems. It has been designed to be efficient on a wide range
of modern high-performance computers.”
588 P ROGRAM LIBRARIES
LAPACK has been developed over a period that began in 1987 by a team
of 11 numerical analysts in the UK and the USA. LAPACK can be regarded as
a successor to LINPACK and EISPACK; it has virtually all their capabilities
and much more besides. LAPACK improves on LINPACK and EISPACK in
four main respects: speed, accuracy, robustness, and functionality. It was
designed at the outset to exploit the level-3 BLAS.
Development of LAPACK continues under the auspices of two follow-on
projects, LAPACK 2 and ScaLAPACK. An object-oriented C++ extension
to LAPACK has been produced, called LAPACK++ [311, 1995]. CLAPACK
is a C version of LAPACK, converted from the original Fortran version using
the f2c converter [367, 1990]. ScaLAPACK comprises a subset of LAPACK
routines redesigned for distributed memory parallel machines [206, 1992], [205,
1994]. Other work includes developing codes that take advantage of the careful
rounding and exception handling of IEEE arithmetic [298, 1994]. For more
details of all these topics see [17, 1995].
LAPACK undergoes regular updates, which are announced on the elec-
tronic newsletter NA-Digest. At the time of writing, the current release is
version 2.0, dated September 30, 1994, and the package contains over 1000
routines and over 735,000 lines of Fortran 77 code, including testing and tim-
ing code.
Mark 16 onward of the NAG Fortran 77 Library contains much of LA-
PACK in Chapters F07 and F08.
BD bidiagonal
GB general band
GE general
GT general tridiagonal
HS upper Hessenberg
OR (real) orthogonal
PO symmetric or Hermitian positive definite
PT symmetric or Hermitian positive definite tridiagonal
SB (real) symmetric band
ST (real) symmetric tridiagonal
SY symmetric
TR triangular (or (quasi-triangular)
The last three characters specify the computation performed.
TRF factorize
TRS solve a (multiple right-hand side) linear system using
the factorization
CON estimate 1/k1(A) (or compute it exactly when A is
tridiagonal and symmetric positive definite or Her-
mitian positive definite)
RFS apply fixed precision iterative refinement and com-
pute the componentwise relative backward error and
a forward error bound
TRI use the factorization to compute A–1
EQU compute factors to equilibrate the matrix
The auxiliary routines follow a similar naming convention, with most of
them having yy = LA.
Previous Home Next
Appendix E
The Test Matrix Toolbox
ftp://ftp.mathworks.com/pub/contrib/linalg/testmatrix
591
592 T HE T EST M ATRIX T OOLBOX
Demonstration
tmtdemo Demonstration of Test Matrix Toolbox.
Visualization
fv Field of values (or numerical range).
gersh Gershgorin disks.
ps Dot plot of a pseudospectrum.
pscont Contours and colour pictures of pseudospectra.
see Pictures of a matrix and its (pseudo-) inverse.
594 T HE T EST M ATRIX T OOLBOX
Miscellaneous
bandred Band reduction by two-sided unitary transformations.
chop Round matrix elements.
comp Comparison matrices.
cond Matrix condition number in 1, 2, Frobenius, or -norm.
cpltaxes Determine suitable axis for plot of complex vector.
dual Dual vector with respect to Holder p-norm.
eigsens Eigenvalue condition numbers.
house Householder matrix.
matrix Test Matrix Toolbox information and matrix access by
number.
matsignt Matrix sign function of a triangular matrix.
pnorm Estimate of matrix p-norm (1 < p < ).
qmult Pre-multiply by random orthogonal matrix.
rq Rayleigh quotient.
seqa Additive sequence.
seqcheb Sequence of points related to Chebyshev polynomials.
seqm Multiplicative sequence.
show Display signs of matrix elements.
skewpart Skew-symmetric (skew-Hermitian) part.
sparsify Randomly sets matrix elements to zero.
sub Principal submatrix.
symmpart Symmetric (Hermitian) part.
trap2tri Unitary reduction of trapezoidal matrix to triangular form.
Previous Home Next
Bibliography
[1] Jan Ole Aasen. On the reduction of a symmetric matrix to tridiagonal form.
BIT, 11:233-242, 1971.
[2] Nabih N. Abdelmalek. Round off error analysis for Gram-Schmidt method
and solution of linear least squares problems. BIT, 11 :345–368, 1971.
[3] ACM Turing Award Lectures: The First Twenty Years, 1966–1985. Addi-
son-Wesley, Reading, MA, USA, 1987. xviii+483 pp. ISBN 0-201-54885-2.
[4] Forman S. Acton. Numerical Methods That Work. Harper and Row, New
York, 1970. xviii+541 pp. Reprinted by Mathematical Association of Amer-
ica, Washington, DC, with new preface and additional problems, 1990. ISBN
0-88385-450-3.
[5] Duane A. Adams. A stopping criterion for polynomial root finding. Comm.
ACM, 10:655-658, 1967.
[6] Vijay B. Aggarwal and James W. Burgmeier. A round-off error model with
applications to arithmetic expressions. SIAM J. Comput., 8(1):60-72, 1979.
[7] Alan A. Ahac, John J. Buoni, and D. D. Olesky. Stable LU factorization of
H-matrices. Linear Algebra Appl., 99:97–110, 1988.
[8] J. H. Ahlberg and E. N. Nilson. Convergence properties of the spline fit. J.
Soc. Indust. Appl. Math., 11(1):95-104, 1963.
[9] Paul Halmos by parts (interviews by Donald J. Albers). In Paul IIalmos:
Celebrating 50 Years of Mathematics, John H. Ewing and F. W. Gehring,
editors, Springer-Verlag, Berlin, 1991, pages 3–32.
[10] Göltz Alefeld and Jurgen Herzberger. Introduction to Interval Computations.
Academic Press, New York, 1983. xviii+333 pp. ISBN 0-12-049820-0.
[11] M. Almacany, C. B. Dunham, and J. Williams. Discrete Chebyshev approx-
imation by interpolating rationals. IMA J. Numer. Anal., 4:467–477, 1984,
[12] Steven C. Althoen and Renate McLaughlin. Gauss-Jordan reduction: A brief
history. Amer. Math. Monthly, 94(2):130-142, 1987.
[13] Fernando L. Alvarado, Alex Pothen, and Robert S. Schreiber. Highly par-
allel sparse triangular solution. In Graph Theory and Sparse Matrix Com-
putations, J. Alan George, John R. Gilbert, and Joseph W. H. Liu, editors,
volume 56 of IMA Volumes in Mathematics and its Applications, Springer-
Verlag, New York, 1993, pages 141-158.
595
596 B IBLIOGRAPHY
[59] Jesse L. Barlow. Error analysis and implementation aspects of deferred cor-
rection for equality constrained least squares problems. SIAM J. Numer.
Anal., 25(6):1340-1358, 1988.
[60] Jesse L. Barlow. On the discrete distribution of leading significant digits
in finite precision arithmetic. Technical Report CS-88-35, Department of
Computer Science, Pennsylvania State University, University Park, PA, USA,
September 1988. 15 pp.
[61] Jesse L. Barlow. Error analysis of a pairwise summation algorithm to com-
pute the sample variance. Numer. Math., 58:583-590, 1991.
[62] Jesse L. Barlow and E. H. Bareiss. On roundoff error distributions in floating
point and logarithmic arithmetic. Computing, 34:325–347, 1985.
[63] Jesse L. Barlow and E. H. Bareiss. Probabilistic error analysis of Gaussian
elimination in floating point and logarithmic arithmetic. Computing, 34:
349-364, 1985.
[64] Jesse L. Barlow and Susan L. Handy. The direct solution of weighted and
equality constrained least-squares problems. SIAM J. Sci. Statist. Comput.,
9(4):704-716, 1988.
[65] Jesse L. Barlow and Ilse C. F. Ipsen. Scaled Givens rotations for the solution
of linear least squares problems on systolic arrays. SIAM J. Sci. Statist.
Comput., 8(5):716733, 1987.
[66] Jesse L. Barlow and Udaya B. Vemulapati. A note on deferred correction for
equality constrained least squares problems. SIAM J. Numer. Anal., 29(1):
249-256, 1992.
[67] Jesse L. Barlow and Udaya B. Vemulapati. Rank detection methods for
sparse matrices. SIAM J. Matrix Anal. Appl., 13(4):1279-1297, 1992.
[68] S. Barnett and C. Storey. Some applications of the Lyapunov matrix equa-
tion. J. Inst. Maths Applies, 4:33–42, 1968.
[69] Geoff Barrett. Formal methods applied to a floating-point number system.
IEEE Trans. Soflware Engrg., 15(5):611-621, 1989.
[70] Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June Do-
nato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and
Henk van der Vorst. Templates for the Solution of Linear Systems: Building
Blocks for Iterative Methods. Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA, 1994. xiii+112 pp. ISBN 0-89871-328-5.
[71] Anders Barrlund. Perturbation bounds for the LDL H and LU decomposi-
tions. BIT, 31:358–363, 1991.
[72] Anders Barrlund. How integrals can be used to derive matrix perturbation
bounds. Report UMINF 92.11, Institute of Information Processing, Univer-
sity of Umeå, Sweden, September 1992. 8 pp.
[73] D. W. Barron and H. P. F. Swinnerton-Dyer. Solution of simultaneous linear
equations using a magnetic-tape store. Comput. J., 3(1):28–33, 1960.
600 B IBLIOGRAPHY
[90] Frank Benford, The law of anomalous numbers. Proceedings of the American
Philosophical Society, 78(4):551–572, 1938.
[91] Commandant Benoit. Note sur une méthode de resolution des équation
normales provenant de l’application de la méthode des moindres carrés
á un système d’équations linéaires en nombre inférieur à celui des in-
connues. Application de la méthode à la résolution d’un système défini
d’équations linéaires (Procédé de Commandant Cholesky). B u l l e t i n
Géodésique (Toulouse), 7(1):67-77, 1924. Cited in [22, Cottle’s translation].
[92] N. F. Benschop and H. C. Ratz. A mean square estimate of the generated
roundoff error in constant matrix iterative processes. J. Assoc. Comput.
Mach., 18(1):48-62, 1971.
[93] M. C. Berenbaum. Direct search methods in the optimisation of cancer
chemotherapy. Br. J. Cancer, 61: 101–109, 1991.
[94] Abraham Berman and Robert J. Plemmons. Nonnegative Matrices in the
Mathematical Sciences. Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA, 1994. xx+340 pp. Corrected republication, with
supplement, of work first published in 1979 by Academic Press. ISBN O-
89871-321-8.
[95] Rajendra Bhatia and Kalyan Mukherjea. Variation of the unitary part of a
matrix. SIAM J. Matrix Anal. Appl., 15(3):1007–1014, 1994.
[96] Rajendra Bhatia and Peter Rosenthal. How and why to solve the operator
equation AX – XB = Y. Bull. London Math. Soc., 1996. To appear.
[97] Dario Bini and Grazia Lotti. Stability of fast algorithms for matrix multipli-
cation. Numer. Math., 36:63–72, 1980.
[98] Dario Bini and Victor Y. Pan. Polynomial and Matrix Computations. Volume
1: Fundamental Algorithms. Birkhäuser, Boston, MA, USA, 1994. xvi+415
pp. ISBN 0-8176-3786-9.
[99] Garrett Birkhoff. Two Hadamard numbers for matrices. Comm. ACM, 18
(1):25-29, 1975.
[100] Garrett Birkhoff and Surender Gulati. Isotropic distributions of test matrices.
J. Appl. Math. Phys. (ZAMP), 30:148-158, 1979.
[101] Garrett Birkhoff and Robert E. Lynch. Numerical Solution of Elliptic Prob-
lems. Society for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 1984. xi+319 pp. ISBN 0-89871-197-5.
[102] Garrett Birkhoff and Saunders Mac Lane. A Survey of Modern Algebra.
Fourth edition, Macmillan, New York, 1977. xi+500 pp. ISBN 0-02-310070-
2.
[103] Christian H. Bischof. Incremental condition estimation. SIAM J. Matrix
Anal. Appl., 11(2):312-322, 1990.
[104] Christian H. Bischof, John G. Lewis, and Daniel J. Pierce. Incremental
condition estimation for sparse matrices. SIAM J. Matrix Anal. Appl., 11
(4):644-659, 1990.
602 B IBLIOGRAPHY
[187] John W. Carr III. Error analysis in floating point arithmetic. Comm. ACM,
2(5):10-15, 1959.
[188] Russell Carter. Y-MPfloating point and Cholesky factorization. Internat.
J. High Speed Computing, 3(3/4):215-222, 1991.
[189] A. Catchy. Exercices d’Analyse et de Phys. Math., volume 2. Paris, 1841.
Cited in [896].
[190] Françoise Chaitin-Chatelin and Valérie Frayssé. Lectures on Finite Precision
Computations. Society for Industrial and Applied Mathematics, Philadelphia,
PA, USA, 1996. To appear.
[191] Raymond H. Chan, James G. Nagy, and Robert J. Plemmons. Circulant
preconditioned Toeplitz least squares iterations. SIAM J. Matrix Anal. Appl.,
15(1):80-97, 1994.
[192] T. F. Chan and P. C. Hansen. Some applications of the rank revealing QR
factorization. SIAM J. Sci. Statist. Comput., 13(3):727-741, 1992.
[193] Tony F. Chan and David E. Foulser. Effectively well-conditioned linear sys-
tems. SIAM J. Sci. Statist. Cornput., 9(6):963–969, 1988.
[194] Tony F. Chan, Gene H. Golub, and Randall J. LeVeque. Algorithms for com-
puting the sample variance: Analysis and recommendations. Amer. Statist.,
37(3):242-247, 1983.
[195] Tony F. Chan and John Gregg Lewis. Computing standard derivations: Ac-
curacy. Comm. ACM, 22(9):526–531, 1979.
[196] Shivkumar Chandrasekaran and Ilse C. F. Ipsen. Backward errors for eigen-
value and singular value decompositions. Numer. Math., 68:215–223, 1994.
[197] Shivkumar Chandrasekaran and Ilse C. F. Ipsen. On the sensitivity of solu-
tion components in linear systems of equations. SIAM J. Matrix Anal. Appl.,
16(1):93-112, 1995.
[198] Xiao-Wen Chang and C. C. Paige. New perturbation bounds for the Cholesky
factorization. Manuscript, February 1995. 13 pp.
[199] Bruce W. Char, Keith O. Geddes, Gaston H. Gonnet, Benton L. Leong,
Michael B. Monagan, and Stephen M. Watt. Maple V Library Reference
Manual. Springer-Verlag, Berlin, 1991. xxv+698 pp. ISBN 3-540-97592-6.
[200] Bruce A. Chartres and James C. Geuder. Computable error bounds for direct
solution of linear equations. J. Assoc. Comput. Mach., 14(1):63–71, 1967.
[201] Françoise Chatelin. Eigenvalues of Matrices. Wiley, Chichester, UK, 1993.
xviii+382 pp. ISBN 0-471-93538-7.
[202] Françoise Chatelin and Marie-Christine Brunet. A probabilistic round-off
error propagation model. Application to the eigenvalue problem. In Reliable
Numerical Computation, M. G. Cox and S. J. Hammarling, editors, Oxford
University Press, Oxford, UK, 1990, pages 139-160.
[203] Françoise Chatelin and Valérie Frayssé. Elements of a condition theory for
the computational analysis of algorithms. In Iterative Methods in Linear
Algebra, R. Beauwens and P. de Green, editors, Elsevier (North-Holland),
Amsterdam, The Netherlands, 1992, pages 15–25.
608 B IBLIOGRAPHY
[204] Denise Chen and Cleve Moler. Symbolic Math Toolbox: User’s Guide. The
MathWorks, Inc., Natick, MA, USA, 1993.
[205] Jaeyoung Choi, Jack J. Dongarra, Susan Ostrouchov, Antoine P. Petitet,
David W. Walker, and R. Clint Whaley. The design and implementation
of the ScaLAPACK LU, QR and Cholesky factorization routines. Report
0RNL/TM-12470, Oak Ridge National Laboratory, Oak Ridge, TN, USA,
September 1994. 26 pp. LAPACK Working Note 80.
[206] Jaeyoung Choi, Jack J. Dongarra, Roldan Pozo, and David W. Walker.
ScaLAPACK: A scalable linear algebra library for distributed memory con-
current computers. Technical Report CS-92-181, Department of Computer
Science, University of Tennessee, Knoxville, TN, USA, November 1992.8 pp.
LAPACK Working Note 55.
[207] Man-Duen Choi. Tricks or treats with the Hilbert matrix. Amer. Math.
Monthly 90:301-312, 1983.
[208] Søren Christiansen and Per Christian Hansen. The effective condition number
applied to error analysis of certain boundary collocation methods. J. Comp.
Appl. Math., 54:15-36, 1994.
[209] Eleanor Chu and Alan George. A note on estimating the error in Gaussian
elimination without pivoting. ACM SIGNUM Newsletter, 20(2):2–7, 1985.
[210] King-wah Eric Chu. The solution of the matrix equations AXB - CXD = E
and (YA – DZ, YC – BZ) = (E, F). Linear Algebra Appl., 93:93–105, 1987.
[211] Barry A. Cipra. Computer-drawn pictures stalk the wild trajectory. Science,
241:1162–1163, 1988.
[212] C. W. Clenshaw. A note on the summation of Chebyshev series. M.T.A.C.,
9(51):118-120, 1955.
[213] C. W. Clenshaw and F. W. J. Olver. Beyond floating point. J. Assoc.
Comput. Mach., 31(2):319-328, 1984.
[214] C. W. Clenshaw, F. W. J. Olver, and P. R. Turner. Level-index arithmetic:
An introductory survey. In Numerical Analysis and Parallel Processing, Lan-
caster 1987, Peter R. Turner, editor, volume 1397 of Lecture Notes in Math-
ematics, Springer-Verlag, Berlin, 1989, pages 95–168.
[215] A. K. Cline. An elimination method for the solution of linear least squares
problems. SIAM J. Numer. Anal., 10(2):283-289, 1973.
[216] A. K. Cline, C. B. Moler, G. W. Stewart, and J. H. Wilkinson. An estimate
for the condition number of a matrix. SIAM J. Numer. Anal., 16(2):368-375,
1979.
[217] A. K. Cline and R. K. Rew. A set of counter-examples to three condition
number estimators. SIAM J. Sci. Statist. Comput., 4(4):602-611, 1983.
[218] Alan K. Cline, Andrew R. Corm, and Charles F. Van Loan. Generalizing the
LINPACK condition estimator. In Numerical Analysis, Mexico 1981, J. P.
Hennart, editor, volume 909 of Lecture Notes in Mathematics, Springer-Ver-
lag, Berlin, 1982, pages 73-83.
B IBLIOGRAPHY 609
[251] M. G. Cox. The numerical evaluation of a spline from its B-spline represen-
tation. J. Inst. Maths Applies, 21:135–143, 1978.
[252] M. G. Cox and S. J. Hammarling, editors. Reliable Numerical Computation.
Oxford University Press, Oxford, UK, 1990. xvi+339 pp. ISBN 0-19-853564-
3.
[253] M. G. Cox and P. M. Harris. Overcoming an instability arising in a spline
approximation algorithm by using an alternative form of a simple rational
function. IMA Bulletin, 25(9):228-232, 1989.
[254] Fred D. Crary. A versatile precompiled for nonstandard arithmetics. ACM
Trans. Math. Software, 5(2):204-217, 1979.
[255] Prescott D. Crout. A short method for evaluating determinants and solving
systems of linear equations with real or complex coefficients. Trans. Amer.
Inst. Elec. Engrg., 60:1235-1241, 1941.
[256] C. W. Cryer. Pivot size in Gaussian elimination. Numer. Math., 12:335-345,
1968.
[257] Colin W. Cryer. The LU-factorization of totally positive matrices. Linear
Algebra Appl., 7:83-92, 1973.
[258] Colin W. Cryer. Some properties of totally positive matrices. Linear Algebra
Appl., 15:1-25, 1976.
[259] A. R. Curtis and J. K. Reid. On the automatic scaling of matrices for Gaus-
sian elimination. J. Inst. Maths Applies, 10:118–124, 1972.
[260] George Cybenko and Charles F. Van Loan. Computing the minimum eigen-
value of a symmetric positive definite Toeplitz matrix. SIAM J. Sci. Statist.
Comput., 7(1):123-131, 1986.
[261] G. Dahlquist. On matrix majorants and minorants, with applications to
differential equations. Linear Algebra Appl., 52/53:199-216, 1983.
[262] Germund Dahlquist and Åke Björck. Numerical Methods. Prentice-Hall,
Englewood Cliffs, NJ, USA, 1974. xviii+573 pp. Translated by Ned Anderson.
ISBN 0-13-627315-7.
[263] J. W. Daniel, W. B. Gragg, L. Kaufman, and G. W. Stewart. Reorthogonal-
ization and stable algorithms for updating the Gram-Schmidt QR factoriza-
tion. Math. Comp., 30(136):772–795, 1976.
[264] B. Danloy. On the choice of signs for Householder’s matrices. J. Comp. Appl.
Math., 2(1):67-69, 1976.
[265] Karabi Datta. The matrix equation XA – BX = R and its applications.
Linear Algebra Appl., 109:91-105, 1988.
[266] Philip J. Davis. Circulant Matrices. Wiley, New York, 1979. xv+250 pp.
ISBN 0-471-05771-1.
[267] Philip J. Davis and Philip Rabinowitz. Methods of Numerical Integration.
Second edition, Academic Press, London, 1984. xiv+612 pp. ISBN 0-12-
206360-0.
612 B IBLIOGRAPHY
[268] Achiya Dax. Partial pivoting strategies for symmetric Gaussian elimination.
Math. Programming, 22:288-303, 1982.
[269] Achiya Dax. The convergence of linear stationary iterative processes for
solving singular unstructured systems of linear equations. SIAM Rev., 32(4):
611-635, 1990.
[270] Achiya Dax and S. Kaniel. Pivoting techniques for symmetric Gaussian
elimination. Numer. Math., 28:221-241, 1977.
[271] Jane M. Day and Brian Peterson. Growth in Gaussian elimination. Amer.
Math. Monthly, 95(6):489–513, 1988.
[272] Carl de Boor. On calculating with B-splines. J. Approx. Theory, 6:5042,
1972.
[273] Carl de Boor and Allan Pinkus. Backward error analysis for totally positive
linear systems. Numer. Math., 27:485-490, 1977.
[274] Lieuwe Sytse de Jong. Towards a formal definition of numerical stability.
Numer. Math., 28:211-219, 1977.
[275] T. J. Dekker. A floating-point technique for extending the available precision.
Numer. Math,, 18:224-242, 1971.
[276] T. J. Dekker and W. Hoffmann. Rehabilitation of the Gauss-Jordan algo
rithm. Numer. Math., 54:591–599, 1989.
[277] Cédric J. Demeure. Fast QR factorization of Vandermonde matrices. Linear
Algebra and Appl., 122/3/4:165-194, 1989.
[278] Cédric J. Demeure. QR factorization of confluent Vandermonde matrices.
IEEE trans. Acoust., Speech, Signal Processing, 38(10):1799-1802, 1990.
[279] James W. Demmel. The condition number of equivalence transformations
that block diagonalize matrix pencils. SIAM J. Numer. Anal., 20(3):599-
610, 1983.
[280] James W. Demmel. Underflow and the reliability of numerical software.
SIAM J. Sci. Statist. Comput., 5(4):887-919, 1984.
[281] James W. Demmel. On condition numbers and the distance to the nearest
ill-posed problem. Numer. Math., 51:251–289, 1987.
[282] James W. Demmel. On error analysis in arithmetic with varying relative
precision. In Proceedings of the Eighth Symposium on Computer Arithmetic,
Como, Italy, Mary Jane Irwin and Renato Stefanelli, editors, IEEE Computer
Society, Washington, DC, 1987, pages 148-152.
[283] James W. Demmel. On floating point errors in Cholesky. Technical Re-
port CS-89-87, Department of Computer Science, University of Tennessee,
Knoxville, TN, USA, October 1989.6 pp. LAPACK Working Note 14.
[284] James W. Demmel. On the odor of IEEE arithmetic. NA Digest, Volume 91,
Issue 39, 1991. (Response to a message “IEEE Arithmetic Stinks” in Volume
91, Issue 33). Electronic mail magazine: na.helptia-net.ornl.gov.
[285] James W. Demmel. The componentwise distance to the nearest singular
matrix. SIAM J. Matrix Anal. Appl., 13(1):10-19, 1992.
B IBLIOGRAPHY 613
[286] James W. Demmel. Open problems in numerical linear algebra. IMA Preprint
Series #961, Institute for Mathematics and its Applications, University of
Minnesota, Minneapolis, MN, USA, April 1992. 21 pp. LAPACK Working
Note 47.
[287] James W. Demmel. A specification for floating point parallel prefix. Techni-
cal Report CS-92-167, Department of Computer Science, University of Ten-
nessee, Knoxville, TN, USA, July 1992. 8 pp. LAPACK Working Note 49.
[288] James W. Demmel. Trading off parallelism and numerical stability. In Lin-
ear Algebra for Larqe Scale and Real-Time Applications, Marc S, Moonen,
Gene H. Golub, and Bart L. De Moor, editors, volume 232 of NATO ASI
Series E, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1993,
pages 49-68.
[289] James W. Demmel, Inderjit Dhillon, and Huan Ren. On the correctness of
parallel bisection in floating point. Technical Report CS-94-228, Department
of Computer Science, University of Tennessee, Knoxville, TN, USA, March
1994. 38 pp. LAPACK Working Note 70.
[290] James W. Demmel, J. J. Dongarra, and W. Kahan. On designing portable
high performance numerical libraries. In Numerical Analysis 1991, Proceed-
ings of the 14th Dundee Conference, D. F. Griffiths and G. A. Watson, editors,
volume 260 of Pitman Research Notes in Mathematics, Longman Scientific
and Technical, Essex, UK, 1992, pages 69-84.
[291] James W. Demmel and Nicholas J. Higham. Stability of block algorithms
with fast level-3 BLAS. ACM Trans. Math. Software, 18(3):274–291, 1992.
[292] James W. Demmel and Nicholas J. Higham. Improved error bounds for
underdetermined system solvers. SIAM J. Matrix Anal. Appl., 14(1):1-14,
1993.
[293] James W. Demmel, Nicholas J. Higham, and Robert S. Schreiber. Stability
of block LU factorization. Numerical Linear Algebra with Applications, 2(2):
173-190, 1995.
[294] James W. Demmel and Bo Kågström. Computing stable eigendecompositions
of matrix pencils. Linear Algebra Appl., 88/89:139-186, 1987.
[295] James W. Demmel and Bo Kågström. Accurate solutions of ill-posed prob-
lems in control theory. SIAM J. Matrix Anal. Appl., 9(1):126-145, 1988.
[296] James W. Demmel and W. Kahan. Accurate singular values of bidiagonal
matrices. SIAM J. Sci. Statist. Comput., 11(5):873–912, 1990.
[297] James W. Demmel and F. Krückeberg. An interval algorithm for solving
systems of linear equations to prespecified accuracy. Computing, 34:117–129,
1985.
[298] James W. Demmel and Xiaoye Li. Faster numerical algorithms via exception
handling. IEEE Trans. Comput., 43(8):983-992, 1994.
[299] James W. Demmel and A. McKenney. A test matrix generation suite.
Preprint MCS-P69-0389, Mathematics and Computer Science Division, Ar-
gonne National Laboratory, Argonne, IL, USA, March 1989.16 pp. LAPACK
Working Note 9.
614 B IBLIOGRAPHY
[300] J. E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods for Uncon-
strained Optimization and Nonlinear Equations. Prentice-Hall, Englewood
Cliffs, NJ, USA, 1983. xiii+378 pp. ISBN 0-13-627216-9.
[301] J. E. Dennis, Jr. and Virginia Torczon. Direct search methods on parallel
machines. SIAM J. Optim., 1(4):448-474, 1991.
[302] J. E. Dennis, Jr. and Homer F. Walker. Inaccuracy in quasi-Newton methods:
Local improvement theorems. Math. Ping. Study, 22:70-85, 1984.
[303] John E. Dennis, Jr. and Daniel J. Woods. Optimization on microcomput-
ers: The Nelder-Mead simplex algorithm. In New Computing Environments:
Microcomputers in Large-Scale Computing, Arthur Wouk, editor, Society for
Industrial and Applied Mathematics, Philadelphia, PA, USA, 1987, pages
116-122.
[304] J. Descloux. Note on the round-off errors in iterative processes. Math. Comp.,
17:18-27, 1963.
[305] Harold G. Diamond. Stability of rounded off inverses under iteration. Math.
Comp., 32(141):227-232, 1978.
[306] John D. Dixon. Estimating extremal eigenvalues and condition numbers of
matrices. SIAM J. Numer. Anal., 20(4):812–814, 1983.
[307] J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart. LINPACK
Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia,
PA, USA, 1979. ISBN 0-89871-172-X.
[308] J. J. Dongarra, J. J. Du Croz, I. S. Duff, and S. J. Hammarling. A set of
level 3 basic linear algebra subprograms. ACM Thins. Math. Software, 16:
1–17, 1990.
[309] J. J. Dongarra, J. J. Du Croz, I. S. Duff, and S. J. Hammarling. Algorithm
679. A set of level 3 basic linear algebra subprograms: Model implementation
and test programs. ACM Trans. Math. Software, 16:18–28, 1990.
[310] J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra
algorithms for dense matrices on a vector pipeline machine. SIAM Rev., 26
(1):91-112, 1984.
[311] Jack Dongarra, Roldan Pozo, and David W. Walker. LAPACK++ V1.0:
High performance linear algebra users’ guide. Technical Report CS-95-290,
Department of Computer Science, University of Tennessee, Knoxville, TN,
USA, May 1995. 31 pp. LAPACK Working Note 98.
[312] Jack J. Dongarra. Performance of various computers using standard linear
equations software. Technical Report CS-89-85, Department of Computer
Science, University of Tennessee, Knoxville, TN, USA, February 1995. 34
PP.
[313] Jack J. Dongarra, Jeremy J. Du Croz, Sven J. Hammarling, and Richard J.
Hanson. An extended set of Fortran basic linear algebra subprograms. ACM
Trans. Math. Software, 14(1):1-17, 1988.
B IBLIOGRAPHY 615
[330] Iain S. Duff and John K. Reid. MA47, aFortran code for direct solution
of indefinite sparse symmetric linear systems. Report RAL-95-001, Atlas
Centre, Rutherford Appleton Laboratory, Didcot, Oxon, UK, January 1995.
63 pp.
[331] Iain S. Duff, John K. Reid, Neils Munskgaard, and Hans B. Nielsen. Direct
solution of sets of linear equations whose matrix is sparse, symmetric and
indefinite. J. Inst. Maths Applics, 23:235–250, 1979.
[332] William Dunham. Journey Through Genius: The Great Theorems of Math-
ematics. Penguin, New York, 1990. xiii+300 pp. ISBN 0-14-014739-X.
[333] Paul S. Dwyer. Linear Computations. Wiley, New York, 1951. xi+344 pp.
[334] Carl Eckart and Gale Young. The approximation of one matrix by another
of lower rank. Psychometrika, 1(3):211–218, 1936.
[335] Alan Edelman. Eigenvalues and condition numbers of random matrices.
SIAM J. Matrix Anal. Appl., 9(4):543-560, 1988.
[336] Alan Edelman. The distribution and moments of the smallest eigenvalue of
a random matrix of Wishart type. Linear Algebra Appl., 159:55–80, 1991.
[337] Alan Edelman. The first annual large dense linear system survey. ACM
SIGNUM Newsletter, 26:6-12, October 1991.
[338] Alan Edelman. The complete pivoting conjecture for Gaussian elimination
is false. The Mathematical Journal, 2:58–61, 1992.
[339] Alan Edelman. On the distribution of a scaled condition number. Math.
Comp., 58(197):185-190, 1992.
[340] Alan Edelman. Eigenvalue roulette and random test matrices. In Linear Al-
gebra for Large Scale and Real-Time Applications, Marc S. Moonen, Gene H.
Golub, and Bart L. De Moor, editors, volume 232 of NATO ASI Series E,
Kluwer Academic Publishers, Dordrecht, The Netherlands, 1993, pages 365-
368.
[341] Alan Edelman. Large dense numerical linear algebra in 1993: The parallel
computing influence. Int. J. Supecomputer Appl., 7(2): 113–128, 1993.
[342] Alan Edelman. Scalable dense numerical linear algebra in 1994: The multi-
computer influence. In Proceedings of the Fifth SIAM Conference on Applied
Linear Algebra, John G. Lewis, editor, Society for Industrial and Applied
Mathematics, Philadelphia, PA, USA, 1994, pages 344-348.
[343] Alan Edelman. When is x * (1/x) 1? Manuscript, 1994.
[344] Alan Edelman, Eric Kostlan, and Michael Shub. How many eigenvalues of a
random matrix are real? J. Amer. Math. Soc., 7(1):247–267, 1994.
[345] Alan Edelman and Walter Mascarenhas. On the complete pivoting conjecture
for a Hadamard matrix of order 12. Linear and Multilineal Algebra, 38(3):
181–188, 1995.
[346] Alan Edelman and H. Murakami. Polynomial roots from companion matrix
eigenvalues. Math. Comp., 64(210):763–776, 1995.
B IBLIOGRAPHY 617
[347] Alan Edelman and G. W. Stewart. Scaling for orthogonality. IEEE Trans.
Signal Processing, 41(4):1676-1677, 1993.
[348] Editor’s note. SIAM J. Matrix Anal. Appl., 12(3),1991.
[349] Timo Eirola. Aspects of backward error analysis of numerical ODEs. J.
Comp. Appl. Math., 45(1-2):65-73, 1993.
[350] Lars Eldén. Perturbation theory for the least squares problem with linear
equality constraints. SIAM J. Numer. Anal., 17(3):338–350, 1980.
[351] Samuel K. Eldersveld and Michael A. Saunders. A block-LU update for
large-scale linear programming. SIAM J. Matrix Anal. Appl., 13(1):191-201,
1992.
[352] W. H. Enright. A new error-control for initial value solvers. Appl. Math.
Comput., 31:288-301, 1989.
[353] Michael A. Epton. Methods for the solution of AXD – BXC = E and its ap
plication in the numerical solution of implicit ordinary differential equations.
BIT, 20:341–345, 1980.
[354] A. M. Erisman, R. G. Grimes, J. G. Lewis, W. G. Poole, and H. D. Si-
mon. Evaluation of orderings for unsymmetric sparse matrices. SIAM J. Sci.
Statist. Comput., 8(4):600-624, 1987.
[355] A. M. Erisman and J. K. Reid. Monitoring the stability of the triangular
factorization of a sparse matrix. Numer. Math., 22:183-186, 1974.
[356] Terje O. Espelid. On floating-point summation. Report No. 67, Department
of Applied Mathematics, University of Bergen, Bergen, Norway, December
1978.
[357] Christopher Evans. Interview with J. H. Wilkinson. Number 10 in Pioneers of
Computing, 60-Minute Recordings of Interviews. Science Museum, London,
1976.
[358] John Ewing, editor. A Century of Mathematics Through the Eyes of the
Monthly. Mathematical Association of America, Washington, DC, 1994.
xi+323 pp. ISBN 0-88385-459-7.
[359] John H. Ewing and F. W. Gehring, editors. Paul Halmos: Celebrating 50
Years of Mathematics. Springer-Verlag, New York, 1991. viii+320 pp. ISBN
3-540-97509-8.
[360] V. N. Faddeeva. Computational Methods of Linear Algebra. Dover, New
York, 1959. x+252 pp. ISBN 0-486-60424-1.
[361] Ky Fan and A. J. Hoffman. Some metric inequalities in the space of matrices.
Proc. Amer. Math. Soc., 6:111-116, 1955.
[362] R. W. Farebrother. A memoir of the life of M. H. Doolittle. IMA Bulletin,
23(6/7):102, 1987.
[363] R. W. Farebrother. Linear Least Sguares Computations. Marcel Dekker, New
York, 1988. xiii+293 pp. ISBN 0-8247-7661-5.
[364] Charles Farnum. Compiler support for floating-point computation.
Software—Practice and Experience, 18(7):701-709, 1988.
618 B IBLIOGRAPHY
[429] Walter Gautschi and Gabriele Inglese. Lower bounds for the condition num-
ber of Vandermonde matrices. Numer. Math., 52:241-250, 1988.
[430] Werner Gautschi. The asymptotic behaviour of powers of matrices. Duke
Math. J., 20:127-140, 1953.
[431] David M. Gay. Correctly rounded binary-decimal and decimal-binary con-
versions. Numerical Analysis Manuscript 90-10, AT&T Bell Laboratories,
Murray Hill, NJ, USA, November 1990. 16 pp.
[432] Stuart Geman. The spectral radius of large random matrices. Ann. Probab.,
14(4):1318-1328, 1986.
[433] W. M. Gentleman An error analysis of Goertzel’s (Watt’s) method for com-
puting Fourier coefficients. Comput. J., 12:160-165, 1969.
[434] W. M. Gentleman and G. Sande. Fast Fourier transforms-for fun and profit.
In Fall Joint Computer Conference, volume 29 of AFIPS Conference Proceed-
dings, Spartan Books, Washington, DC, 1966, pages 563–578.
[435] W. Morven Gentleman. Least squares computations by Givens transforma-
tions without square roots. J. Inst. Maths Applies, 12:329–336, 1973.
[436] W. Morven Gentleman. Error analysis of QR decompositions by Givens
transformations. Linear Algebra Appl., 10:189–197, 1975.
[437] W. Morven Gentleman and Scott B. Marovich. More on algorithms that
reveal properties of floating point arithmetic units. Comm. ACM, 17(5):
276-277, 1974.
[438] Alan George and Joseph W-H Liu. Computer Solution of Large Sparse Pos-
itive Definite Systems. Prentice-Hall, Englewood Cliffs, NJ, USA, 1981.
xii+324 pp. ISBN 0-13-165274-5.
[439] A. J. Geurts. A contribution to the theory of condition. Numer. Math., 39:
85-96, 1982.
[440] Ali R. Ghavimi and Alan J. Laub. Backward error, sensitivity, and refine-
ment of computed solutions of algebraic Riccati equations. Numerical Linear
Algebra with Applications, 2(1):29-49, 1995.
[441] Ali R. Ghavimi and Alan J. Laub, Residual bounds for discrete-time Lya-
punov equations. IEEE Trans. Automat. Control, 40(7):1244-1249, 1995.
[442] P. E. Gill and W. Murray. A numerically stable form of the simplex algorithm.
Linear Algebra Appl., 7:99-138, 1973.
[443] Philip E. Gill, Walter Murray, Dulce B. Ponceleón, and Michael A. Saun-
ders. Preconditioners for indefinite systems arising in optimization. SIAM
J. Matrix Anal. Appl., 13(1):292-311, 1992.
[444] Philip E. Gill, Walter Murray, Michael A. Saunders, and Margaret H. Wright.
User’s guide for NPSOL (version 4.0): A Fortran package for nonlinear pro-
gramming. Technical Report SOL 86-2, Department of Operations Research,
Stanford University, Stanford, CA, January 1986. 53 pp.
B IBLIOGRAPHY 623
[445] Philip E. Gill, Walter Murray, Michael A. Saunders, and Margaret H. Wright.
A Schur-complement method for sparse quadratic programming. In Reliable
Numerical Computation, M. G. Cox and S. J. Hammarling, editors, Oxford
University Press, Oxford, UK, 1990, pages 113-138.
[446] Philip E. Gill, Walter Murray, Michael A. Saunders, and Margaret H. Wright.
Inertia-controlling methods for general quadratic programming. SIAM Rev.,
33(1):1–36, 1991.
[447] Philip E. Gill, Walter Murray, and Margaret H. Wright. Practical Optimize-
tion. Academic Press, London, 1981. xvi+401 pp. ISBN 0-12-283952-8.
[448] Philip E. Gill, Michael A. Saunders, and Joseph R. Shinnerl. On the stability
of Cholesky factorization for symmetric quasi-definite systems. SIAM J.
Matrix Anal. Appl., 17(1):35-46, 1996.
[449] S. Gill. A process for the step-by-step integration of differential equations
in an automatic digital computing machine. Proc. Cambridge Phil. Soc., 47:
96-108, 1951.
[450] T. A. Gillespie. Noncommutative variations on theorems of Marcel Riesz and
others. In Paul Halmos: Celebrating 50 Years of Mathematics, John H. Ewing
and F. W. Gehring, editors, Springer-Verlag, Berlin, 1991, pages 221–236.
[451] Wallace J. Givens. Numerical computation of the characteristic values of a
real symmetric matrix. Technical Report ORNL-1574, Oak Ridge National
Laboratory, Oak Ridge, TN, USA, 1954. 107 pp.
[452] James Glanz. Mathematical logic flushes out the bugs in chip designs. Sci-
ence, 267:332–333, 1995. 20 January.
[453] J. Gluchowska and A. Smoktunowicz. Solving the linear least squares problem
with very high relative accuracy. Computing, 45:345–354, 1990.
[454] I. Gohberg and I. Koltracht. Error analysis for triangular factorization of
Cauchy and Vandermonde matrices. Manuscript, 1990.
[455] I. Gohberg and I. Koltracht. Mixed, componentwise, and structured condition
numbers. SIAM J. Matrix Anal. Appl., 14(3):688–704, 1993.
[456] I. Gohberg and V. Olshevsky. Fast inversion of Chebyshev–Vandermonde
matrices. Numer. Math., 67(1):71–92, 1994.
[457] David Goldberg. What every computer scientist should know about floating-
point arithmetic. ACM Computing Surveys, 23(1):5–48, 1991.
[458] I. Bennett Goldberg. 27 bits are not enough for 8-digit accuracy. Comm.
ACM, 10(2):105-106, 1967.
[459] Moshe Goldberg and E. G. Straus. Multiplicativity of lp norms for matrices.
Linear Algebra Appl., 52/53:351–360, 1983.
[460] Herman H. Goldstine. The Computer: From Pascal to von Neumann. Prince-
ton University Press, Princeton, NJ, USA, 1972. xii+378 pp. 1993 printing
with new preface. ISBN 0-691-02367-0.
624 B IBLIOGRAPHY
[493] E. Hairer and G. Wanner. Solving Ordinary Differential Equations II. Spring-
er-Verlag, Berlin, 1991. xv+601 pp. ISBN 3-540-537759.
[494] Marshall Hall, Jr. Combinatorial Theory. Blaisdell, Waltham, MA, USA,
1967. x+310 pp.
[495] Hozumi Hamada. A new real number representation and its operation. In
Proceedings of the Eighth Symposium on Computer Arithmetic, Como, Italy,
Mary Jane Irwin and Renato Stefanelli, editors, IEEE Computer Society,
Washington, DC, 1987, pages 153–157.
[496] S. J. Hammarling. Numerical solution of the stable, non-negative definite
Lyapunov equation. IMA J. Numer. Anal., 2:303-323, 1982.
[497] S. J. Hammarling and J. H. Wilkinson. The practical behaviour of linear iter-
ative methods with particular reference to S.O.R. Report NAC 69, National
Physical Laboratory, Teddington, UK, September 1976. 19 pp.
[498] Sven Hammarling. A note on modifications to the Givens plane rotation. J.
Inst. Maths Applies, 13:215-218, 1974.
[499] Stephen M. Hammel, James A. Yorke, and Celso Grebogi. Numerical orbits
of chaotic processes represent true orbits. Bull. Amer. Math. Sot., 19(2):
465469, 1988.
[500] Rolf Hammer, Matthias Hocks, Ulrich Kulisch, and Dietmar Ratz. Numerical
Toolbox for Verified Computing I. Basic Numerical Problems: Theory, Algo-
rithms, and Pascal-XSC Programs. Springer-Verlag, Berlin, 1993. xiii+337
pp. ISBN 3-540-57118-3.
[501] R. W. Hamming. Numerical Methods for Scientists and Engineers. Second
edition, McGraw-Hill, New York, 1973. ix+721 pp. ISBN 0-07-025887-2.
[502] G. H. Hardy. A Course of Pure Mathematics. Tenth edition, Cambridge
University Press, Cambridge, UK, 1967. xii+509 pp. ISBN 0-521-09227-2.
[503] G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Second edition,
Cambridge University Press, Cambridge, UK, 1952. xii+324 pp.
[504] Richard Harter. The optimality of Winograd’s formula. Comm. ACM, 15
(5):352, 1972.
[505] D. J. Hartfiel. Concerning the solution set of Ax = b where P < A < Q and
p < b < q. Numer. Math., 35:355-359, 1980.
[506] A Manual of Operation for the Automatic Sequence Controlled Calculator.
Harvard University Press, Cambridge, MA, USA, 1946. Reprinted, with
new foreword and introduction, Volume 8 in the Charles Babbage Institute
Reprint Series for the History of Computingj MIT Press, Cambridge, MA,
USA, 1985. xxxii+561 pp. ISBN 0-262-010844.
[507] Proceedings of a Symposium on Large-Scale Digital Calculating Machinery,
volume 16 of The Annals of the Computation Laboratory of Harvard Univer-
sity. Harvard University Press, Cambridge, MA, USA, 1948. Reprinted, with
a new introduction by William Aspray, Volume 7 in the Charles Babbage In-
stitute Reprint Series for the History of Computing, MIT Press, Cambridge,
MA, USA, 1985. xxix+302 pp. ISBN 0-262-08152-0.
B IBLIOGRAPHY 627
[525] Desmond J. Higham. Remark on Algorithm 669. ACM Trans. Math. Soft-
ware, 17(3):424-426, 1991.
[526] Desmond J. Higham. Condition numbers and their condition numbers. Linear
Algebra Appl., 214:193-213, 1995.
[527] Desmond J. Higham and Nicholas J. Higham. Backward error and condition
of structured linear systems. SIAM J. Matrix Anal. Appl., 13(1):162-175,
January 1992.
[528] Desmond J. Higham and Nicholas J. Higham. Componentwise perturbation
theory for linear systems with multiple right-hand sides. Linear Algebra
Appl., 174:111-129, 1992.
[529] Desmond J. Higham and Lloyd N. Trefethen. Stiffness of ODEs. BIT, 33:
285-303, 1993.
[530] Nicholas J. Higharn. Computing the polar decomposition—with applications.
SIAM J. Sci. Statist. Conmput., 7(4):1160-1174, 1986.
[531] Nicholas J. Higham. Efficient algorithms for computing the condition number
of a tridiagonal matrix. SIAM J. Sci. Statist. Comput., 7(1):150-165, 1986.
[532] Nicholas J. Higham. Computing real square roots of a real matrix. Linear
Algebra Appl., 88/89:405-430, 1987.
[533] Nicholas J. Higham. Error analysis of the Björck-Pereyra algorithms for
solving Vandermonde systems. Numer. Math., 50(5):613-632, 1987.
[534] Nicholas J. Higham. A survey of condition number estimation for triangular
matrices. SIAM Rev., 29(4):575–596, 1987.
[535] Nicholas J. Higham. Computing a nearest symmetric positive semidefinite
matrix. Linear Algebra Appl., 103:103–118, 1988.
[536] Nicholas J. Higham. Fast solution of Vandermonde-like systems involving
orthogonal polynomials. IMA J. Numer. Anal., 8:473-486, 1988.
[537] Nicholas J. Higham. FORTRAN codes for estimating the one-norm of a real
or complex matrix, with applications to condition estimation (Algorithm
674). ACM Trans. Math. Software, 14(4):381-396, 1988.
[538] Nicholas J. Higham. The accuracy of solutions to triangular systems. SIAM
J. Numer. Anal., 26(5):1252-1265, 1989.
[539] Nicholas J. Higham. Matrix nearness problems and applications. In Appli-
cations of Matrix Theory, M. J. C. Gover and S. Barnett, editors, Oxford
University Press, Oxford, UK, 1989, pages 1–27.
[540] Nicholss J. Higham. Analysis of the Cholesky decomposition of a semi-
definite matrix. In Reliable Numerical Computation, M. G. Cox and S. J,
Hammarling, editors, Oxford University Press, Oxford, UK, 1990, pages 161–
185.
[541] Nicholas J. Higham. Bounding the error in Gaussian elimination for tridiag-
onal systems. SIAM J. Matrix Anal. Appl., 11(4):521–530, 1990.
B IBLIOGRAPHY 629
[588] D. Y. Hu and L. Reichel. Krylov subspace methods for the Sylvester equation.
Linear Algebra Appl., 172:283-313, 1992.
[589] T. E. Hull. Correctness of numerical software. In Perfomance Evaluation of
Numerical Software, Lloyd D. Fosdick, editor, North-Holland, Amsterdam,
The Netherlands, 1979, pages 3–15.
[590] T. E. Hull. Precision control, exception handling and a choice of numerical
algorithms. In Numerical Analysis Proceedings, Dundee 1981, G. A. Watson,
editor, volume 912 of Lecture Notes in Mathematics, Springer-Verlag, Berlin,
1982, pages 169-178.
[591] T. E. Hull, A. Abraham, M. S. Cohen, A. F. X. Curley, C. B. Hall, D. A.
Penny, and J. T. M. Sawchuk. Numerical TURING. ACM SIGNUM Newslet-
ter, 20(3):26–34, 1985.
[592] T. E. Hull, Thomas F. Fairgrieve, and Ping Tak Peter Tang. Implementing
complex elementary functions using except ion handling. ACM Trans. Math.
Software; 20(2):215-244, 1994.
[593] T. E. Hull and J. R. Swenson. Tests of probabilistic models for propagation
of roundoff errors. Comm. ACM, 9(2):108–113, 1966.
[594] M. A. Hyman. Eigenvalues and eigenvectors of general matrices. Presented
at the 12th National Meeting of the Association for Computing Machinery,
Houston, Texas, 1957. Cited in [1088].
[595] IBM. Engineering and Scientific Subroutine Library, Guide and Reference,
Release 3. Fourth Edition (Program Number 5668-863), 1988.
[596] AIX Version 3.2 for RISC System\6000: Optimization and Tuning Guide for
Fortran, C, and C++. IBM, December 1993. viii+305 pp. Publication No.
SC09-1705-00.
[597] IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Standard
754-1985. Institute of Electrical and Electronics Engineers, New York, 1985.
Reprinted in SIGPLAN Notices, 22(2):9-25, 1987.
[598] A Radix-Independent Standard for Floating-Point Arithmetic, IEEE Stan-
dard 854-1987. IEEE Computer Society, New York, 1987.
[599] IEEE Computer Society Microprocessor Standards Committee, Floating-
Point Working Group. A proposed standard for binary floating-point arith-
metic, Draft 8.0 of IEEE Task P754 (with introductory comments by David
Stevenson). Computer, 14:51-62, 1981.
[600] Yasuhiko Ikebe. On inverses of Hessenberg matrices. Linear Algebra Appl.,
24:93-97, 1979.
[601] ILAS Education Committee. Report on graduate linear algebra courses.
Manuscript from the International Linear Algebra Society, November 1993.
URL = http://gauss.technion.ac.il/iic/GRAD-ED.SYLLABI. 4 pp.
[602] Cray Research Inc. UNICOS Math and Scientific Library Reference Manual.
Number SR-2081, Version 5.0, Eagan, MN, USA, 1989.
B IBLIOGRAPHY 633
[618] T. L. Jordan. Experiments on error growth associated with some linear least-
squares procedures. Math. Comp., 22:579–588, 1968.
[619] George Gheverghese Joseph. The Crest of the Peacock: Non-European Roots
of Mathematics. Penguin, New York, 1991. xv+371 pp. ISBN 0-14012529-9.
[620] Bo Kågström. A perturbation analysis of the generalized Sylvester equation
(AR – LB, DR – LE) = (C, F). SIAM J. Matrix Anal. Appl., 15(4):1045-
1060, 1994.
[621] Bo Kågström and Peter Poromaa. Distributed and shared memory block
algorithms for the triangular Sylvester equation with sep-1 estimators. SIAM
J. Matrix Anal. Appl., 13(1):90-101, 1992.
[622] Bo Kågström and Peter Poromaa. LAPACK-style algorithms and software
for solving the generalized Sylvester equation and estimating the separation
between regular matrix pairs. Report UMINF 93.23, Institute of Information
Processing, University of Umeå Umeå, Sweden, December 1993. 35 pp.
LAPACK Working Note 75.
[623] Bo Kågström and Peter Poromaa. Computing eigenspaces with specified
eigenvalues of a regular matrix pair (A, B) and condition estimation: Theory,
algorithms and software. Report UMINF 94.04, Institute of Information
Processing, University of Umeå, Umeå, Sweden, September 1994. 65 pp.
LAPACK Working Note 87.
[624] Bo Kågström and Lars Westin. Generalized Schur methods with condition
estimators for solving the generalized Sylvester equation. IEEE Trans. Au-
tomat. Control AC-34(7):745-751, 1989.
[625] W. Kahan Further remarks on reducing truncation errors. Comm. ACM, 8
(1):40, 1965.
[626] W. Kahan. Numerical linear algebra. Canad. Math. Bull., 9:757-801, 1966.
[627] W. Kahan. A survey of error analysis. In Proc. IFIP Congress, Ljubijana,
Information Processing 71, North-Holland, Amsterdam, The Netherlands,
1972, pages 1214-1239.
[628] W. Kahan. Implementation of algorithms (lecture notes by W. S. Haugeland
and D. Hough). Technical Report 20, Department of Computer Science,
University of California, Berkeley, CA, USA, 1973.
[629] W. Kahan. Interval arithmetic options in the proposed IEEE floating point
arithmetic standard. In Interval Mathematics 1980, Karl L. E. Nickel, editor,
Academic Press, New York, 1980, pages 99-128.
[630] W. Kahan. Why do we need a floating-point arithmetic standard? Technical
report, University of California, Berkeley, CA, USA, February 1981, 41 pp.
[631] W. Kahan. To solve a real cubic equation. Technical Report PAM-352,
Center for Pure and Applied Mathematics, University of California, Berkeley,
CA, USA, November 1986. 20 pp.
[632] W. Kahan. Branch cuts for complex elementary functions or much ado about
nothing’s sign bit. In The State of the Art in Numerical Analysis, A. Iserles
B IBLIOGRAPHY 635
[650] Charles Kenney and Gary Hewer. The sensitivity of the algebraic and differ-
ential Riccati equations. SIAM J. Control Optim., 28(1):50-69, 1990.
[651] Charles Kenney and Alan J. Laub. Controllability and stability radii for
companion form systems. Math. Control Signals Systems, 1:239-256, 1988.
[652] Charles S. Kenney and Alan J. Laub. Small-sample statistical condition
estimates for general matrix functions. SIAM J. Sci. Comput., 15(1):36-61,
1994.
[653] Charles S. Kenney, Alan J. Laub, and Philip M. Papadopoulos. Matrix sign
algorithms for Riccati equations. IMA J. of Math. Control Inform., 9:331-
344, 1992.
[654] Thomas H. Kerr. Fallacies in computational testing of matrix positive defi-
niteness/semidefiniteness. IBEE Trans. Aerospace Electron. Systems, 26(2):
415–421, 1990.
[655] Summation algorithm with corrections and some of its
applications. Math. Stos., 1:22–41, 1973. (In Polish, cited in [609] and [611]).
[656] Iterative refinement for linear systems in variable
precision arithmetic. BIT, 21:97-103, 1981.
[657] A note on rounding-error analysis of Cholesky factor-
ization. Linear Algebra Appl., 88/89:487–494, 1987.
[658] and Hubert Schwetlick. Numerische Lineare Algebra:
Eine Computerorientierte Einführung. VEB Deutscher, Berlin, 1988.472 pp.
ISBN 3-87144999-7.
[659] and Hubert Schwetlick. Numeryczna Algebra Liniowa:
Wprowadzenie do Zautomatyzowanych. Wydawnictwa Naukowo-
Techniczne, Warszawa, 1992. 502 pp. ISBN 83-2041260-9.
[660] Fuad Kittaneh. Singular values of companion matrices and bounds on zeros
of polynomials. SIAM J. Matrix Anal. Appl., 16(1):333–340, 1995.
[661] R. Klatte, U. W. Kulisch, C. Lawo, M. Rauch, and A. Wiethoff. C-XSC:
A C++ Class Library for Extended Scientific Computing. Springer-Verlag,
Berlin, 1993. ISBN 0-387-56328-8.
[662] R. Klatte, U. W. Kulisch, M. Neaga, D. Ratz, and Ch. Ullrich. PASCAL-
XSC—Language Reference With Examples. Springer-Verlag, Berlin, 1992.
[663] Philip A. Knight. Error Analysis of Stationary Iteration and Associated
Problems, Ph.D. thesis, University of Manchester, Manchester, England,
September 1993. 135 pp.
[664] Philip A. Knight. Fast rectangular matrix multiplication and QR decompo-
sition. Linear Algebra Appl., 221:69–81, 1995.
[665] Donald E. Knuth. Evaluation of polynomials by computer. Comm. ACM, 5
(12):595-599, 1962.
[666] Donald E. Knuth. The Art of Computer Programming. Addison-Wesley,
Reading, MA, USA, 1973-1981. Three volumes.
B IBLIOGRAPHY 637
[699] Frans Lemeire. Bounds for condition numbers of triangular and trapezoid
matrices. BIT, 15:58–64, 1975.
[700] H. Leuprecht and W. Oberaigner. Parallel algorithms for the rounding exact
summation of floating point numbers. Computing, 28:89-104, 1982.
[701] T. Y. Li and Z. Zeng. Homotopy-determinant algorithm for solving nonsym-
metric eigenvalue problems. Math. Comp., 59(200):483–502, 1992.
[702] Information Technology-Language Independent Arithmetic—Part I: Integer
and Floating Point Arithmetic, Draft International Standard (Version 4.1),
ISO/IEC DIS 10967-1:1993. August 1993.
[703] Seppo Linnainmaa. Analysis of some known methods of improving the accu-
racy of floating-point sums. BIT, 14:167–202, 1974.
[704] Seppo Linnainmaa. Towards accurate statistical estimation of rounding er-
rors in floating-point computations. BIT, 15:165-173, 1975.
[705] Seppo Linnainmaa. Taylor expansion of the accumulated rounding error.
BIT, 16:146-160, 1976.
[706] Seppo Linnainmaa. Software for doubled-precision floating-point computa-
tions. ACM Trans. Math. Software, 7(3):272–283, 1981.
[707] Peter Linz. Accurate floating-point summation. Comm. ACM, 13(6):361–
362, 1970.
[708] Elliot Linzer. On the stability of transform-based circular deconvolution.
SIAM J. Numer. Anal., 29(5):1482-1492, 1992.
[709] Joseph W. H. Liu. A partial pivoting strategy for sparse symmetric matrix
decomposition. ACM Trans. Math. Software, 13(2):173-182, 1987.
[710] Georghios Loizou. Nonnormality and Jordan condition numbers of matrices.
J. Assoc. Comput. Mach., 16(4):580-584, 1969.
[711] J. W. Longley. An appraisal of least squares programs for the electronic
computer from the point of view of the user. J. Amer. Statist. Assoc., 62:
819-841, 1967.
[712] Per Lotstedt. Perturbation bounds for the linear least squares problem sub-
ject to linear inequality constraints. BIT, 23:500-519, 1983.
[713] Per Lotstedt. Solving the minimal least squares problem subject to bounds
on the variables. BIT, 24:206-224, 1984.
[714] Hao Lu. Fast solution of confluent Vandermonde linear systems. SIAM J.
Matrix Anal. Appl., 15(4):1277-1289, 1994.
[715] Hao Lu. Fast algorithms for confluent Vandermonde linear systems and
generalized Trummer’s problem. SIAM J. Matrix Anal. Appl., 16(2):655-
674, 1995.
[716] Hao Lu. Solution of Vandermond-like systems and confluent Vandermonde-
like systems. SIAM J. Matrix Anal. Appl., 17(1):127-138, 1996.
[717] J. N. Lyness. The effect of inadequate convergence criteria in automatic
routines. Computer Journal, 12:279-281, 1969. See also letter and response
in 13 (1970), p. 121.
640 B IBLIOGRAPHY
[718] J. N. Lyness and C. B. Moler. Van der Monde systems and numerical differ-
entiation. Numer. Math., 8:458–464, 1966.
[719] M. Stuart Lynn. On the round-off error in the method of successive over-
relaxation. Math. Comp., 18(85):36–49, 1964.
[720] Allan J. Macleod. Some statistics on Gaussian elimination with partial piv-
oting. ACM SIGNUM Newsletter, 24(2/3):10-14, 1989.
[721] J. H. Maindonald. Statistical Computation. Wiley, New York, 1984. xviii+370
pp. ISBN 0-471-86452-8.
[722] John Makhoul. Toeplitz determinants and positive semidefiniteness. IEEE
Trans. Signal Processing, 39(3):743-746, 1991.
[723] Michael A. Malcolm. On accurate floating-point summation. Comm. ACM,
14(11):731-736, 1971.
[724] Michael A. Malcolm. Algorithms to reveal properties of floating-point arith-
metic. Comm. ACM, 15(11):949–951, 1972.
[725] Michael A. Malcolm and John Palmer. A fast method for solving a class of
tridiagonal linear systems. Comm. ACM, 17(1):14–17, 1974.
[726] Thomas A. Manteuffel. An interval analysis approach to rank determination
in linear least squares problems. SIAM J. Sci. Statist. Comput., 2(3):335–348,
1981.
[727] John Markoff. Circuit flaw causes Pentium chip to miscalculate, Intel admits.
New York Times, 1994. 24 November.
[728] George Marsaglia and Ingram Olkin. Generating correlation matrices. SIAM
J. Sci. Statist. Comput., 5(2):470-475, 1984.
[729] R. S. Martin, G. Peters, and J. H. Wilkinson. Iterative refinement of the
solution of a positive definite system of equations. Numer. Math., 8:203–216,
1966. Also in [1102, pp. 31–44], Contribution 1/2.
[730] Gleanings far and near. Mathematical Gazette, 22(170):95, 1924.
[731] Roy Mathias. Matrices with positive definite Hermitiam part: Inequalities
and linear systems. SIAM J. Matrix Anal. Appl., 13(2):640-654, 1992.
[732] Roy Mathias. Accurate eigensystem computations by Jacobi methods. SIAM
J. Matrix Anal. Appl., 16(3):977-1003, 1995.
[733] Roy Mathias. Analysis of algorithms for orthogonalizing products of unitary
matrices. Numerical Linear Algebra with Applications, 1995. To appear.
[734] Roy Mathias. The instability of parallel prefix matrix multiplication. SIAM
J. Sci. Comput., 16(4):956-973, 1995.
[735] MATLAB User’s Guide. The MathWorks, Inc., Natick, MA, USA, 1992.
[736] Shouichi Matsui and Masao Iri. An overflow/underflow-free floating-point
representation of numbers. J. Inform. Process., 4(3):123–133, 1981.
[737] R. M. M. Mattheij. Stability of block LU-decompositions of matrices arising
from BVP. SIAM J. Alg. Discrete Methods, 5(3):314-331, 1984.
B IBLIOGRAPHY 641
[757] Webb Miller. Software for roundoff analysis. ACM Trans. Math. Software, 1
(2):108-128, 1975.
[758] Webb Miller. Graph transformations for roundoff analysis. SIAM J. Comput.,
5(2):204–216, 1976.
[759] Webb Miller. The Engineering of Numerical Software. Prentice-Hall, Engle-
wood Cliffs, NJ, USA, 1984. viii+167 pp. ISBN 0-13-279043-2.
[760] Webb Miller and David Spooner. Software for roundoff analysis, II. ACM
Trans. Math. Software, 4(4):369-387, 1978.
[761] Webb Miller and David Spooner. Algorithm 532: Software for roundoff anal-
ysis. ACM Trans. Math. Software, 4(4):388-390, 1978.
[762] Webb Miller and Celia Wrathall. Software for Roundoff Analysis of Matrix
Algorithms. Academic Press, New York, 1980. x+151 pp. ISBN 0-12-497250-
0.
[763] L. Mirsky. An Introduction to Linear Algebra. Oxford University Press, 1961.
viii+440 pp. Reprinted by Dover, New York, 1990. ISBN 0-486-66434-1.
[764] Herbert F. Mitchell, Jr. Inversion of a matrix of order 38. M.T.A.C., 3:
161-166, 1948.
[765] Cleve B. Moler. Iterative refinement in floating point. J. Assoc. Comput.
Mach., 14(2):316-321, 1967.
[766] Cleve B. Moler. Matrix computations with Fortran and paging. Comm.
ACM, 15(4):268-270, 1972.
[767] Cleve B. Moler. Algorithm 423: Linear equation solver. Comm. ACM, 15
(4):274, 1972.
[768] Cleve B. Moler. Cramer’s rule on 2-by-2 systems. ACM SIGNUM Newsletter,
9(4):13-14, 1974.
[769] Cleve B. Moler. Three research problems in numerical linear algebra. In Nu-
merical Analysis, G. H. Golub and J. Oliger, editors, volume 22 of Proceed-
ings of Symposia in Applied Mathematics, American Mathematical Society,
Providence, RI, USA, 1978, pages 1-18.
[770] Cleve B. Moler. Cleve’s corner: The world’s simplest impossible problem.
The Math Works Newsletter, 4(2):6-7, 1990.
[771] Cleve B. Moler. Technical note: Doubl-rounding and implications for nu-
meric computations. The Math Works Newsletter, 4:6, 1990.
[772] Cleve B. Moler. A tale of two numbers. SIAM News, 28:1,16, January 1995.
Also in MATLAB News and Notes, Winter 1995, 10-12.
[773] Cleve B. Moler and Donald Morrison. Replacing square roots by Pythagorean
sums. IBM J. Res. Develop., 27(6):577–581, 1983.
[774] Cleve B. Moler and Donald Morrison. Singular value analysis of cryptograms.
Amer. Math. Monthly, 90:78-87, 1983.
[775] Cleve B. Moler and Charles F. Van Loan. Nineteen dubious ways to compute
the exponential of a matrix. SIAM Rev., 20(4):801–836, 1978.
B IBLIOGRAPHY 643
[831] Daniel J. Pierce and Robert J. Plemmons. Fast adaptive condition estima-
tion. SIAM J. Matrix Anal. Appl., 13(1):274-291, 1992.
[832] Robert Piessens, Elise de Doncker-Kapenga, Christoph W. Überhuber, and
David K. Kahaner. QUADPACK: A Subroutine Package for Automatic In-
tegration. Springer-Verlag, Berlin, 1983. ISBN 3-540-12553-1.
[833] P. J. Plauger. Properties of floating-point arithmetic. Computer Language,
5(3):17-22, 1988.
[834] R. J. Plemmons. Regular splittings and the discrete Neumann problem.
Numer. Math., 25:153–161, 1976.
[835] Robert J. Plemmons. Linear least squares by elimination and MGS. J. Assoc.
Comput. Mach., 21(4):581-585, 1974.
[836] Svatopluk Poljak and Rohn. Checking robust nonsingularity is NP-hard.
Math. Control Signals Systems, 6:1-9, 1993.
[837] Ben Polman. Incomplete blockwise factorization of (block) If-matrices. Lin-
ear Algebra Appl., 90:119–132, 1987.
[838] M. J. D. Powell. A survey of numerical methods for unconstrained optimiza-
tion. SIAM Rev., 12(1):79-97, 1970.
[839] M. J. D. Powell. A view of unconstrained minimization algorithms that do
not require derivatives. ACM Trans. Math. Software, 1(2):97–107, 1975.
[840] M. J. D. Powell and J. K. Reid. On applying Householder transformations to
linear least squares problems. In Pmt. IFIP Congress 1968, North-Holland,
Amsterdam, The Netherlands, 1969, pages 122-126.
[841] Stephen Power. The Cholesky decomposition in Hilbert space. IMA Bulletin,
22(11/12):186-187, 1986.
[842] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P.
Flannery. Numerical Recipes in FORTRAN: The Art of Scientific Comput-
ing. Second edition, Cambridge University Press, Cambridge, UK, 1992.
xxvi+963 pp. ISBN O 52143064 X.
[843] Douglas M. Priest. Algorithms for arbitrary precision floating point arith-
metic. In Pmt. 10th IEEE Symposium on Computer Arithmetic, Peter Ko-
rnerup and David W. Matula, editors, IEEE Computer Society Press, Los
Alamitos, CA, USA, 1991, pages 132-143.
[844] Douglas M. Priest. On Properties of Floating Point Arithmetics: Numeri-
cal Stability and the Cost of Accurate Computations. Ph.D. thesis, Mathe-
matics Department, University of California, Berkeley, CA, USA, November
1992. 126 pp. URL = ftp://ftp.icsi.berkeley.edu/pub/theory/priest-
thesis.ps.Z.
[845] J. D. Pryce. Round-off error analysis with fewer tears. IMA Bulletin, 17:
40-47, 1981.
[846] J. D. Pryce. A new measure of relative error for vectors. SIAM J. Numer.
Anal., 21(1):202-215, 1984.
B IBLIOGRAPHY 647
[899] R. Scherer and K. Zeller. Shorthand notation for rounding errors. Computing,
Suppl. 2:165-168, 1980.
[900] Hans Schneider and W. Gilbert Strang. Comparison theorems for supremum
norms. Numer. Math., 4:15–20, 1962.
[901] J. L. Schonfelder and M. Razaz. Error control with polynomial approxima-
tions. IMA J. Numer. Anal., 1:105-114, 1980.
[902] Robert S. Schreiber. Block algorithms for parallel machines. In Numeri-
cal Algorithms for Modern Parallel Computer Architectures, M. H. Schultz,
editor, number 13 in IMA Volumes In Mathematics and Its Applications,
Springer-Verlag, Berlin, 1988, pages 197-207.
[903] Robert S. Schreiber. Hough’s random story explained. NA Digest, Volume
89, Issue 3, 1989. Electronic mail magazine: [email protected].
[904] Robert S. Schreiber and Beresford N. Parlett. Block reflectors: Theory and
computation. SIAM J. Numer. Anal., 25(1):189-205, 1988.
[905] Robert S. Schreiber and Charles F. Van Loan. A storage efficient WY repre-
sentation for products of Householder transformations. SIAM J. Sci. Statist.
Comput., 10:53-57, 1989.
[906] N. L. Schryer. A test of a computer’s floating-point arithmetic unit. Com-
puting Science Technical Report No. 89, AT&T Bell Laboratories, Murray
Hill, NJ, USA, 1981.
[907] N. L. Schryer. Determination of correct floating-point model parameters.
In Sources and Development of Mathematical Software, Wayne R. Cowell,
editor, Prentice-Hall, Englewood Cliffs, NJ, USA, 1984, pages 360-366.
[908] Gunther Schulz. Iterative Berechnung der reziproken Matrix. Z. Angew.
Math. Mech., 13:57-59, 1933.
[909] Robert Sedgewick. Algorithms. Second edition, Addison-Wesley, Reading,
MA, USA, 1988. xii+657 pp. ISBN 0-201-06673-4.
[910] Lawrence F. Sharnpine. Numerical Solution of Ordinary Differential Equa-
tions. Chapman and Hall, New Yorkj 1994. x+484 pp. ISBN 0-412-05151-6.
[911] Lawrence F. Sharnpine and Richard C. Allen, Jr. Numerical Computing: An
Introduction. W. B. Saunders, Philadelphia, PA, USA, 1973. vii+258 pp.
ISBN 0-7216-8150-6.
[912] Lawrence F. Shampine and Mark W. Reichelt. The MATLAB ODE suite.
Manuscript, 1995. 35 pp.
[913] Alexander Shapiro. Optimally scaled matrices, necessary and sufficient con-
ditions. Numer. Math., 39:239-245, 1982.
[914] Alexander Shapiro. Optimal block diagonal l2-scaling of matrices. SIAM J.
Numer. Anal., 22(1):81-94, 1985.
[915] Alexander Shapiro. Upper bounds for nearly optimal diagonal scaling of
matrices. Linear and Multilineal Algebra, 29:147–147, 1991.
B IBLIOGRAPHY 651
[1002] Martti Tienari. A statistical model of roundoff error for varying length
floating-point arithmetic. BIT, 10:355-365, 1970.
[1003] J. Todd. On condition numbers. In Programmation en Mathématiques
Numérques, Besançon, 1966, volume 7 (no. 165) of Éditions Centre Nat.
Recherche Sci., Paris, 1968, pages 141-159.
[1004] John Todd. The condition of the finite segments of the Hilbert matrix. In
Contributions to the Solution of Systems of Linear Equations and the Deter-
mination of Eigenvalues, Olga Taussky, editor, number 39 in Applied Math-
ematics Series, National Bureau of Standards, United States Department of
Commerce, Washington, DC, 1954, pages 109-116.
[1005] John Todd. Computational problems concerning the Hilbert matrix. J. Res.
National Bureau Standards-B, 65(1):19-22, 1961.
[1006] John Todd. Basic Numerical Mathematics, Vol. 2: Numerical Algebra.
Birkhäuser, Basel, and Academic Press, New York, 1977. 216 pp. ISBN
0-12-692402-3.
[1007] Kim-Chuan Toh and Lloyd N. Trefethen. Pseudozeros of polynomials and
pseudospectra of companion matrices. Numer. Math., 68(3):403-425, 1994.
[1008] Virginia J. Torczon. Multi-Directional Search: A Direct Search Algorithm for
Parallel Machines. Ph.D. thesis, Rice University, Houston, TX, USA, May
1989. vii+85 pp.
[1009] Virginia J. Torczon. On the convergence of the multidirectional search algo-
rithm. SIAM J. Optim., 1(1):123–145, 1991.
[1010] Virginia J. Torczon. PDS: Direct search methods for unconstrained optimiza-
tion on either sequential or parallel machines. Report TR92-9, Department
of Mathematical Sciences, Rice University, Houston, TX, USA, March 1992.
To appear in ACM Trans. Math. Software.
[1011] Virginia J. Torczon. On the convergence of pattern search algorithms. SIAM
J. Optim., 7(1), 1997. To appear.
[1012] L. Tornheim. Maximum third pivot for Gaussian reduction. Technical report,
Calif. Research Corp., Richmond, CA, USA, 1965. Cited in [256].
[1013] J. F. Traub. Associated polynomials and uniform methods for the solution
of linear problems. SIAM Rev., 8(3):277-301, 1966.
[1014] Lloyd N. Trefethen. Three mysteries of Gaussian elimination. ACM SIGNUM
Newsletter, 20:2-5, 1985.
[1015] Lloyd N. Trefethen. Approximation theory and numerical linear algebra.
In Algorithms for Approximation II, J. C Mason and M. G. Cox, editors,
Chapman and Hall, London, 1990, pages 336-360.
[1016] Lloyd N. Trefethen. The definition of numerical analysis. SIAM News, 25:
6 and 22, November 1992. Reprinted in IMA Bulletin, 29 (3/4), pp. 47-49,
1993.
B IBLIOGRAPHY 657
[1035] Evgenij E. Tyrtyshnikov. How bad are Hankel matrices? Numer. Math., 67
(2):261-269, 1994.
[1036] Patriot missile defense: Software problem led to system failure at Dhahran,
Saudi Arabia. Report GAO/IMTEC-92-26, Information Management and
Technology Division, United States General Accounting Office, Washington,
DC, February 1992. 16 pp.
[1037] Minoru Urabe. Roundoff error distribution in fixed-point multiplication and
a remark about the rounding rule. SIAM J. Numer. Anal., 5(2):202-210,
1968.
[1038] J. V. Uspensky. Theory of Equations. McGraw-Hill, New York, 1948. vii+353
pp.
[1039] A. van der Sluis. Condition numbers and equilibration of matrices. Numer.
Math., 14:14-23, 1969.
[1040] A. van der Sluis. Condition, equilibration and pivoting in linear algebraic
systems. Numer. Math., 15:74–86, 1970.
[1041] A. van der Sluis. Stability of the solutions of linear least squares problems.
Numer. Math., 23:241-254, 1975.
[1042] Charles F. Van Loan. On the method of weighting for equality-constrained
least-squares problems. SIAM J. Numer. Anal., 22(5):851-864, 1985.
[1043] Charles F. Van Loan. On estimating the condition of eigenvalues and eigen-
vectors. Linear Algebra Appl., 88/89:715–732, 1987.
[1044] Charles F. Van Loan. Computational Frameworks for the Fast Fourier Trans-
form. Society for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 1992. xiii+273 pp. ISBN 0-89871-285-8.
[1045] M. van Veldhuizen. A note on partial pivoting and Gaussian elimination.
Numer. Math., 29:1-10, 1977.
[1046] A. van Wijngaarden. Numerical analysis as an independent science. BIT, 6:
66-81, 1966.
[1047] Robert J. Vanderbei. Symmetric quasi-definite matrices. SIAM J. Optim., 5
(1):100-113, 1995.
[1048] J. M. Varah. On the solution of block-tridiagonal systems arising from certain
finit-difference equations. Math. Comp., 26(120):859-868, 1972.
[1049] J. M. Varah. A lower bound for the smallest singular value of a matrix.
Linear Algebra Appl., 11:3-5, 1975.
[1050] J. M. Varah. On the separation of two matrices. SIAM J. Numer. Anal., 16
(2):216-222, 1979.
[1051] Richard S. Varga. On diagonal dominance arguments for bounding ||A-l ||
Linear Algebra Appl., 14:211-217, 1976.
[1052] Richard S. Varga. Scientific Computation on Mathematical Problems and
Conjectures. Society for Industrial and Applied Mathematics, Philadelphia,
PA, USA, 1990. vi+122 pp. ISBN 0-89871-257-2.
B IBLIOGRAPHY 659
[1053] Rank M. Verzuh. The solution of simultaneous linear equations with the aid
of the 602 calculating punch. M.T.A.C., 3:453–462, 1949.
[1054] J. Vignes and R. Alt. An efficient stochastic method for round-off error
analysis. In Accurate Scientific Computations, Proceedings, 1985, Willard L.
Miranker and Richard A. Toupin, editors, volume 235 of Lecture Notes in
Computer Science, Springer-Verlag, Berlin, 1986, pages 183-205.
[1055] Emil Vitasek. The numerical stability in solution of differential equations. In
Conference on the Numerical Solution of Differential Equations, J. L1. Morris,
editor, volume 109 of Lecture Notes in Mathematics, Springer-Verlag, Berlin,
1969, pages 87–111.
[1056] I. V. Viten’ko. Optimum algorithms for adding and multiplying on computers
with a floating point. U.S.S.R. Comput. Math. Math. Phys., 8(5):183-195,
1968.
[1057] John von Neumann and Herman H. Goldstine. Numerical inverting of ma-
trices of high order. Bull. Amer. Math. Soc., 53:1021–1099, 1947. Reprinted
in [995, pp. 479–557].
[1058] H. A. Van Der Vorst. The convergence behaviour of preconditioned CG
and CG-S in the presence of rounding errors. In Preconditioned Conjugate
Gradient Methods, Owe Axelsson and Lily Yu. Kolotilina, editors, volume
1457 of Lecture Notes in Mathematics, Springer-Verlag, Berlin, 1990, pages
126–136.
[1059] Eugene L. Wachpress. Iterative solution of the Lyapunov matrix equation.
Appl. Math. Lett., 1(1):87-90, 1988.
[1060] Bertil Waldén, Rune Karlson, and Ji-guang Sun. Optimal backward pertur-
bation bounds for the linear least squares problem. Numerical Linear Algebra
with Applications, 2(3):271–286, 1995.
[1061] Peter J. L. Wallis, editor. Improving Floating-Point Programming. Wiley,
London, 1990. xvi+191 pp. ISBN 0-471-92437-7.
[1062] W. D. Wallis. Hadarnard matrices. In Combinatorial and Graph- Theoretical
Problems in Linear Algebra, Richard A. Brualdi, Shmuel Friedland, and Vic-
tor Klee, editors, volume 50 of IMA Volumes in Mathematics and its Appli-
cations, Springer-Verlag, New York, 1993, pages 235–243.
[1063] W. D. Wallis, Anne Penfold Streetj and Jennifer Seberry Wallis. Combina-
torics: Room Squares, Sum-Free Sets, Hadamard Matrices, volume 292 of
Lecture Notes in Mathematics. Springer-Verlag, Berlin, 1972. 508 pp. ISBN
3-540-06035-9.
[1064] Johnson J. H. Wang. Generalized Moment Methods in Electromagnetic:
Formulation and Computer Solution of Integral Equations. Wiley, New York,
1991. xiii+553 pp. ISBN 0-471-51443-8.
[1065] Robert C. Ward. The QR algorithm and Hyman’s method on vector com-
puters. Math. Comp., 30(133):132–142, 1976.
[1066] Willis H. Ware, editor. Soviet computer technology–1959. Comm. ACM, 3
(3):131-166, 1960.
660 B IBLIOGRAPHY
[1083] J, H. Wilkinson. The use of iterative methods for finding the latent roots and
vectors of matrices. Mathematical Tables and Other Aids to Computation, 9:
184–191, 1955.
[1084] J. H. Wilkinson. Error analysis of floating-point computation. Numer. Math.,
2:319-340, 1960.
[1085] J. H. Wilkinson. Error analysis of direct methods of matrix inversion. J.
Assoc. Comput. Mach., 8:281–330, 1961.
[1086] J. H. Wilkinson. Error analysis of eigenvalue techniques based on orthogonal
transformations. J. Sot. Indust. Appl. Math., 10(1):162–195, 1962.
[1087] J. H. Wilkinson. Plan-rotations in floating-point arithmetic. In Experi-
mental Arithmetic, High Speed Computing and Mathematics, volume 15 of
Proceedings of Symposia in Applied Mathematics, American Mathematical
Society, Providence, RI, USA, 1963, pages 185-198.
[1088] J. H. Wilkinson. Rounding Errors in Algebraic Processes. Notes on Applied
Science No. 32, Her Majesty’s Stationery Office, London, 1963. vi+161 pp.
Also published by Prentice-Hall, Englewood Cliffs, NJ, USA. Reprinted by
Dover, New York, 1994. ISBN 0-486-67999-3.
[1089] J. H. Wilkinson. The Algebraic Eigenvalue Problem. Oxford University Press,
1965. xviii+662 pp. ISBN 0-19-853403-5 (hardback), 0-19-853418-3 (paper-
back) .
[1090] J. H. Wilkinson. Error analysis of transformations based on the use of ma-
trices of the form I – 2wwH. In Error In Digital Computation, Louis B. Rail,
editor, volume 2, Wiley, New York, 1965, pages 77–101.
[1091] J. H. Wilkinson. Bledy w Procesach Algebraicznych. PWW,
Warszawa, 1967. Polish translation of [1088].
[1092] J. H. Wilkinson. A priori error analysis of algebraic processes. In Proc.
International Congress of Mathematicians, Moscow 1966, I. G. Petrovsky,
editor, Mir Publishers, Moscow, 1968, pages 629–640.
[1093] J. H. Wilkinson. Rundungsfehler. Springer-Verlag, Berlin, 1969. German
translation of [1088].
[1094] J. H. Wilkinson. The Algebraic Eigenvalue Problem. Nauka, U.S.S.R.
Academy of Sciences, 1970.564 pp. Russian translation of [1089].
[1095] J. H. Wilkinson. Modern error analysis. SIAM Rev., 13(4):548-568, 1971.
[1096] J. H. Wilkinson. Some comments from a numerical analyst. J. Assoc. Com-
put. Mach., 18(2):137-147, 1971. Reprinted in [3].
[1097] J. H. Wilkinson. The classical error analyses for the solution of linear systems.
IMA Bulletin, 10(5/6):175-180, 1974.
[1098] J. H. Wilkinson. Numerical linear algebra on digital computers. IMA Bul-
letin, 10(9/10):354-356, 1974.
[1099] J. H. Wilkinson. Turing’s work at the National Physical Laboratory and the
construction of Pilot ACE, DEUCE, and ACE. In A History of Computing
662 B IBLIOGRAPHY
Name Index
A suffix “t” after a page number denotes a table, “f” a figure, “n” a footnote, and
“q” a quotation.
665
666 N AME I NDEX
386, 389, 407, 410, 411, 417, Kahan, William M. (Velvel), 1q, 29,
422, 440, 443, 460, 461, 489, 33, 34, 46, 47, 47q, 50, 50q,
508, 574, 577 59, 63-65, 75, 86, 92-95, 98,
Hilbert, David, 526 113, 123, 126, 136, 161, 165,
Hildebrand, F. B., 33, 39 q 169 q, 225, 243, 486, 490,
Hocks, Matthias, 487 494, 496, 497, 499, 501, 502,
Hodel, A. Scottedward, 323 507
Hodges, Andrew, xxvii Kahaner, David K., 391 q
Hoffman, A. J., 386 Kailath, T., 441
Hoffmann, Christoph M., 34 Kala, R., 323, 324
Hoffmann, W., 284, 385 Kaniel, S., 226
Hooper, Judith A., 504 Karasalo, Ilkka, 165
Horn, Roger A., 119, 150, 310, 348, Karatsuba, A., 461
546, 551, 555, 558, 580 Karlin, Samuel, 523
Horning, Jim, 491 q Karlson, Rune, 404, 413
Hotelling, Harold, 186, 187, 445 q Karney, David L., 514, 525
Hough, David, 47, 509 Karp, A., 195
Householder, Alston S., 2, 117q, 126, Karpinski, Richard, 56, 497
160, 172, 383, 410, 565 Kate, Tosio, 126
Hull, T. E., 52, 61, 488, 506, 530 Kaufman, Linda, 221, 225, 226, 385
Huskey, H. D., 188 Keiper, Jerry B., 36
Hussaini, M. Yousuff, 32 Keller, Herbert Bishop, 189,257,325 q
Hyman, M. A., 282 Kennedy, Jr., William J., 34
Kenney, Charles S., 301, 323, 525
Ikebe, Yasuhiko, 303, 305, 307 Kerr, Thomas H., 225
Incertis, F., 512 Andrzej, 84,94,224,242,
Ipsen, Ilse C. F., 136, 141, 147, 385, 406, 411, 418, 422
386 Kincaid, D. R., 503
Iri, Masao, 53, 488 Kittaneh, Fuad, 525
Isaacson, Eugene, 189, 257 Knight, Philip A., 328, 329, 354, 356,
358, 460, 461
Jalby, William, 385 Knuth, Donald E., xxiv, 54, 57, 58,
Jankowski, M., 94, 241, 242 67q, 87q, 93, 94, 114, 461,
Jansen, Paul, 486 491 q, 520q, 526, 534
Jennings, A., 165 Koçak, Hüseyin, 32
Jennings, L. S., 421 Koltracht, I., 141, 517
Johnson, Charles R., 119, 150, 287, Korner, T. W., 465q
310, 348, 546, 551, 555, 558, Kornerup, Peter, 54
580 Kostlan, Eric, 518, 527
Johnson, Samuel, 675 Kovarik, Z. V., 145
Jones, Mark T., 226 Kowalewski, G., 440
Jones, William B., 507 Krasny, Robert, 505
Jordan, Camille, 284 Kreczmar, Antoni, 460
Jordan, T. L., 409 Krogh, F. T., 503
Jordan, Wilhelm, 284 Krol, Ed, 582
Kruckeberg, F., 198
Kågstrom, Bo, 305, 320, 322-324 Kubota, Koichi, 488
670 N AME I NDEX
Subject Index
A suffix “t” after a page number denotes a table, “f” a figure, “n” a footnote, and
“q” a quotation. Mathematical symbols and Greek letters are indexed as if they
were spelled out. The solution to a problem is not indexed if the problem itself is
indexed under the same term.
675
676 S UBJECT I NDEX
0°, definition, 64
What is the most accurate way to sum floating point numbers? What
are the advantages of IEEE arithmetic? How accurate is Gaussian
elimination, and what were the key breakthroughs in the development
of error analysis for the method? The answers to these and many
related questions are included in Accuracy and Stability of Numerical
Algorithms.
This book gives a thorough treatment of the behavior of numerical
algorithms in finite precision arithmetic. It combines algorithmic
derivations, perturbation theory, and rounding error analysis. Software
practicalities are emphasized throughout, with particular reference to
LAPACK and MATLAB. The best available error bounds, some of
them new, are presented in a unified format with a minimum of jargon,
and pertubation theory is treated in detail.
Historical perspective and insight are given, with particular reference
to the fundamental work of Wilkinson and Turing. The many
quotations provide further information in an accessible format.
The book is unique in that algorithmic developments and
motivations are given succinctly and implementation details are
minimized so that readers can concentrate on accuracy and stability
results. Not since Wilkinson’s Rounding Errors in Algebraic
Processes (1963) and The Algebraic Eigenvalue Problem (1965) has
any volume treated this subject in such depth. A number of topics are
treated that are not usually covered in numerical analysis textbooks,
including floating point summation, block LU factorization, condition
number estimation, the Sylvester equation, powers of matrices, finite
precision behavior of stationary iterative methods, Vandermonde
systems, and fast matrix multiplication.
Nicholas J. Higham is Professor of Applied Mathematics at the
University of Manchester, England. He is the author of more than 40
publications and is a member of the editorial boards of the SIAM
Journal on Matrix Analysis and Applications and the IMA Journal of
Numerical Analysis . His book handbook of Writing for the
Mathematical Sciences was published by SIAM in 1993
____________________________________________________________
SIAM
Society for Industrial and Applied Mathematics
3600 University City Science Center
Philadelphia, PA 19104-2688 USA
Telephone: 215-382-9800 / Fax: 215-386-7999
[email protected]
http://www.siam.org
ISBN 0-89871-355-2