0% found this document useful (0 votes)

808 views718 pages

(Higham, 1996) - Book - Accuracy and Stability of Numerical Algorithms PDF

Uploaded by

Antero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

808 views718 pages

(Higham, 1996) - Book - Accuracy and Stability of Numerical Algorithms PDF

Uploaded by

Antero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 718

Home Next

Accuracy and Stability

of Numerical Algorithms
Nicholas J. Higham
University of Manchester
Manchester, England

Accuracy and Stability

of Numerical Algorithms

Society for Industrial and Applied Mathematics

Philadelphia
Copyright © 1996 by the Society for Industrial and Applied Mathematics.

1098765432

All rights reserved. Printed in the United States of America. No part of this book may be
reproduced, stored, or transmitted in any manner without the written permission of the
publisher. For information. write to the Society for Industrial and Applied Mathematics,
3600 University City Science Center, Philadelphia, PA 19104-2688.

Library of Congress Cataloging-in-Publication Data

Higham, Nicholas J., 1961-

Accuracy and stability of numerical algorithms / Nicholas J.
Higham.
p. cm.
Includes bibliographical references (p. - ) and index.
ISBN O-8987 l-355-2 (pbk.)
1. Numerical analysis--Data processing. 2. Computer algorithms.
I. Title.
QA297.H53 1996
5 19.4’0285’5 1 --dc20 95-39903

o is a registered trademark.
Dedicated to

Alan M. Turing
and
James H. Wilkinson
Contents

List of Figures xvii

List of Tables xix

Preface xxi

About the Dedication xxvii

1 Principles of Finite Precision Computation 1

1.1 Notation and Background . . . . . . . . . . . . . . . . . . . . 2
1.2 Relative Error and Significant Digits . . . . . . . . . . . . . . 4
1.3 Sources of Errors . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Precision Versus Accuracy . . . . . . . . . . . . . . . . . . . . 7
1.5 Backward and Forward Errors . . . . . . . . . . . . . . . . . . 7
1.6 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8 Solving a Quadratic Equation . . . . . . . . . . . . . . . . . . 11
1.9 Computing the Sample Variance . . . . . . . . . . . . . . . . . 12
1.10 Solving Linear Equations . . . . . . . . . . . . . . . . . . . . . 13
1.10.1 GEPP Versus Cramer’s Rule . . . . . . . . . . . . . . . 14
1.11 Accumulation of Rounding Errors . . . . . . . . . . . . . . . . 16
1.12 Instability Without Cancellation . . . . . . . . . . . . . . . . . 17
1.12.1 The Need for Pivoting . . . . . . . . . . . . . . . . . . 17
1.12.2 An Innocuous Calculation? . . . . . . . . . . . . . . . . 17
1.12.3 An Infinite Sum . . . . . . . . . . . . . . . . . . . . . . 18
1.13 Increasing the Precision . . . . . . . . . . . . . . . . . . . . . . 19
1.14 Cancellation of Rounding Errors . . . . . . . . . . . . . . . . . 21
1.14.1 Computing . . . . . . . . . . . . . . . . . . 22
1.14.2 QR Factorization . . . . . . . . . . . . . . . . . . . . . 24
1.15 Rounding Errors Can Be Beneficial . . . . . . . . . . . . . . . 26
1.16 Stability of an Algorithm Depends on the Problem . . . . . . 27
1.17 Rounding Errors Are Not Random . . . . . . . . . . . . . . . 29
1.18 Designing St able Algorithms . . . . . . . . . . . . . . . . . . . 30
1.19 Misconceptions . . . . . . . . . . . . . . . . . . . . . . . . . . 31

vii
V111

1.20 Rounding Errors in Numerical Analysis . . . . . . . . . . . . . 32

1.21 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 32
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2 Floating Point Arithmetic 39

2.1 Floating Point Number System . . . . . . . . . . . . . . . . . 40
2.2 Model of Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3 IEEE Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4 Aberrant Arithmetics . . . . . . . . . . . . . . . . . . . . . . . 48
2.5 Choice of Base and Distribution of Numbers . . . . . . . . . . 51
2.6 Statistical Distribution of Rounding Errors . . . . . . . . . . . 52
2.7 Alternative Number Systems . . . . . . . . . . . . . . . . . . . 53
2.8 Accuracy Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.9 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 56
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3 Basics 67
3.1 Inner and Outer Products . . . . . . . . . . . . . . . . . . . . 68
3.2 The Purpose of Rounding Error Analysis . . . . . . . . . . . . 71
3.3 Running Error Analysis . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Notation for Error Analysis . . . . . . . . . . . . . . . . . . . 73
3.5 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . 76
3.6 Complex Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 78
3.7 Miscellany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.8 Error Analysis Demystified . . . . . . . . . . . . . . . . . . . . 82
3.9 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.10 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 84
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4 Summation 87
4.1 Summation Methods . . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Compensated Summation . . . . . . . . . . . . . . . . . . . . . 92
4.4 Other Summation Methods . . . . . . . . . . . . . . . . . . . . 97
4.5 Statistical Estimates of Accuracy . . . . . . . . . . . . . . . . 98
4.6 Choice of Method . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 100
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 Polynomials 103
5.1 Horner‘s Method . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2 Evaluating Derivatives . . . . . . . . . . . . . . . . . . . . . . 106
5.3 The Newton Form and Polynomial Interpolation . . . , . . . . 109
ix

5.4 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 113

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6 Norms 117
6.1 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3 The Matrix p-Norm . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 126
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7 Perturbation Theory for Linear Systems 131

7.1 Normwise Analysis . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2 Componentwise Analysis . . . . . . . . . . . . . . . . . . . . . 134
7.3 Scaling to Minimize the Condition Number . . . . . . . . . . . 137
7.4 The Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . 140
7.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.6 Numerical Stability . . . . . . . . . . . . . . . . . . . . . . . . 141
7.7 Practical Error Bounds . . . . . . . . . . . . . . . . . . . . . . 142
7.8 Perturbation Theory by Calculus . . . . . . . . . . . . . . . . 144
7.9 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 145
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8 Triangular Systems 151

8.1 Backward Error Analysis . . . . . . . . . . . . . . . . . . . . . 152
8.2 Forward Error Analysis . . . . . . . . . . . . . . . . . . . . . . 155
8.3 Bounds for the Inverse . . . . . . . . . . . . . . . . . . . . . . 159
8.4 A Parallel Fan-In Algorithm . . . . . . . . . . . . . . . . . . . 162
8.5 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 164
8.5.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

9 LU Factorization and Linear Equations 169

9.1 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . 170
9.2 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.3 The Growth Factor . . . . . . . . . . . . . . . . . . . . . . . . 177
9.4 Special Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.5 Tridiagonal Matrices . . . . . . . . . . . . . . . . . . . . . . . 183
9.6 Historical Perspective . . . . . . . . . . . . . . . . . . . . . . . 186
9.7 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.8 A Posteriori Stability Tests . . . . . . . . . . . . . . . . . . . . 192
9.9 Sensitivity of the LU Factorization . . . . . . . . . . . . . . . 194
9.10 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 195
9.10.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . 198
X

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

10 Cholesky Factorization 203

10.1 Symmetric Positive Definite Matrices . . . . . . . . . . . . . . 204
10.1.1 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . 205
10.2 Sensitivity of the Cholesky Factorization . . . . . . . . . . . . 209
10.3 Positive Semidefinite Matrices . . . . . . . . . . . . . . . . . . 210
10.3.1 Perturbation Theory . . . . . . . . . . . . . . . . . . . 211
10.3.2 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . 214
10.4 Symmetric Indefinite Matrices and Diagonal Pivoting
Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
10.4.1 Complete Pivoting . . . . . . . . . . . . . . . . . . . . 219
10.4.2 Partial Pivoting . . . . . . . . . . . . . . . . . . . . . . 221
10.5 Nonsymmetric Positive Definite Matrices . . . . . . . . . . . . 223
10.6 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 224
10.6.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

11 Iterative Refinement 231

11.1 Convergence of Iterative Refinement . . . . . . . . . . . . . . . 232
11.2 Iterative Refinement Implies Stability . . . . . . . . . . . . . . 235
11.3 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 241
11.3.1 LAPACK. . . . . . . . . . . . . . . . . . . . . . . . . . 243
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

12 Block LU Factorization 245

12.1 Block Versus Partitioned LU Factorization . . . . . . . . . . . 246
12.2 Error Analysis of Partitioned LU Factorization . . . . . . . . . 248
12.3 Error Analysis of Block LU Factorization . . . . . . . . . . . . 250
12.3.1 Block Diagonal Dominance . . . . . . . . . . . . . . . . 251
12.3.2 Symmetric Positive Definite Matrices . . . . . . . . . . 255
12.4 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 257
12.3.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

13 Matrix Inversion 261

13.1 Use and Abuse of the Matrix Inverse . . . . . . . . . . . . . . 262
13.2 Inverting a Triangular Matrix . . . . . . . . . . . . . . . . . . 265
13.2.1 Unblocked Methods . . . . . . . . . . . . . . . . . . . . 265
13.2.2 Block Methods . . . . . . . . . . . . . . . . . . . . . . 267
13.3 Inverting a Full Matrix by LU Factorization . . . . . . . . . . 270
13.3.1 Method A . . . . . . . . . . . . . . . . . . . . . . . . . 270
13.3.2 Method B . . . . . . . . . . . . . . . . . . . . . . . . . 271
xi

13.3.3 Method C . . . . . . . . . . . . . . . . . . . . . . . . . 272

13.3.4 Method D . . . . . . . . . . . . . . . . . . . . . . . . . 273
13.3.5 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . 275
13.4 Gauss-Jordan Elimination . . . . . . . . . . . . . . . . . . . . 275
13.5 The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . 281
13.5.1 Hyman’s Method . . . . . . . . . . . . . . . . . . . . . 282
13.6 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 283
13.6.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

14 Condition Number Estimation 289

14.1 H o w t o E s t i m a t e C o m p o n e n t w i s e C o n d i t i o n
Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
14.2 The p-Norm Power Method. . . . . . . . . . . . . . . . . . . . 291
14.3 LAPACK l-Norm Estimator . . . . . . . . . . . . . . . . . . . 294
14.4 Other Condition Estimators . . . . . . . . . . . . . . . . . . . 297
14.5 Condition Numbers of Tridiagonal Matrices . . . . . . . . . . 301
14.6 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 304
14.6.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

15 The Sylvester Equation 309

15.1 Solving the Sylvester Equation . . . . . . . . . . . . . . . . . . 311
15.2 Backward Error . . . . . . . . . . . . . . . . . . . . . . . . . . 313
15.2.1 The Lyapunov Equation . . . . . . . . . . . . . . . . . 316
15.3 Perturbation Result . . . . . . . . . . . . . . . . . . . . . . . . 318
15.4 Practical Error Bounds . . . . . . . . . . . . . . . . . . . . . . 320
15.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
15.6 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 322
15.6.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324

16 Stationary Iterative Methods 325

16.1 Survey of Error Analysis . . . . . . . . . . . . . . . . . . . . . 327
16.2 Forward Error Analysis . . . . . . . . . . . . . . . . . . . . . . 329
16.2.1 Jacobi’s Method . . . . . . . . . . . . . . . . . . . . . . 332
16.2.2 Successive Overrelaxation . . . . . . . . . . . . . . . . 334
16.3 Backward Error Analysis . . . . . . . . . . . . . . . . . . . . . 334
16.4 Singular Systems . . . . . . . . . . . . . . . . . . . . . . . . . 336
16.4.1 Theoretical Background . . . . . . . . . . . . . . . . . 336
16.4.2 Forward Error Analysis . . . . . . . . . . . . . . . . . . 338
16.5 Stopping an Iterative Method . . . . . . . . . . . . . . . . . . 341
16.6 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 343
xii

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

17 Matrix Powers 345

17.1 Matrix Powers in Exact Arithmetic . . . . . . . . . . . . . . . 346
17.2 Bounds for Finite Precision Arithmetic . . . . . . . . . . . . . 353
17.3 Application to Stationary Iteration . . . . . . . . . . . . . . . 358
17.4 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 358
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

18 QR Factorization 361
18.1 Householder Transformations . . . . . . . . . . . . . . . . . . . 362
18.2 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 363
18.3 Error Analysis of Householder Computations . . . . . . . . . . 364
18.4 Aggregated Householder Transformations . . . . . . . . . . . . 370
18.5 Givens Rotations . . . . . . . . . . . . . . . . . . . . . . . . . 371
18.6 Iterative Refinement . . . . . . . . . . . . . . . . . . . . . . . . 375
18.7 Gram-Schmidt Orthogonalization . . . . . . . . . . . . . . . . 376
18.8 Sensitivity of the QR Factorization . . . . . . . . . . . . . . . 381
18.9 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 383
18.9.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . 386
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

19 The Least Squares Problem 391

19.1 Perturbation Theory . . . . . . . . . . . . . . . . . . . . . . . 392
19.2 Solution by QR Factorization . . . . . . . . . . . . . . . . . . 395
19.3 Solution by the Modified Gram-Schmidt Method . . . . . . . 396
19.4 The Normal Equations . . . . . . . . . . . . . . . . . . . . . . 397
19.5 Iterative Refinement . . . . . . . . . . . . . . . . . . . . . . . . 399
19.6 The Seminormal Equations . . . . . . . . . . . . . . . . . . . . 403
19.7 Backward Error . . . . . . . . . . . . . . . . . . . . . . . . . . 404
19.8 Proof of Wedin’s Theorem . . . . . . . . . . . . . . . . . . . . 407
19.9 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 409
19.9.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . 412
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

20 Underdetermined Systems 415

20.1 Solution Methods . . . . . . . . . . . . . . . . . . . . . . . . . 416
20.2 Perturbation Theory . . . . . . . . . . . . . . . . . . . . . . . 417
20.3 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
20.4 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 422
20.4.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
X111

21 Vandermonde Systems 425

21.1 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . 426
21.2 Primal and Dual Systems . . . . . . . . . . . . . . . . . . . . . 428
21.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
21.3.1 Forward Error . . . . . . . . . . . . . . . . . . . . . . . 435
21.3.2 Residual . . . . . . . . . . . . . . . . . . . . . . . . . . 437
21.3.3 Dealing with Instability . . . . . . . . . . . . . . . . . . 438
21.4 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 440
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

22 Fast Matrix Multiplication 445

22.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
22.2 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
22.2.1 Winograd’s Method . . . . . . . . . . . . . . . . . . . . 451
22.2.2 Strassen’s Method . . . . . . . . . . . . . . . . . . . . . 452
22.2.3 Bilinear Noncommutative Algorithms . . . . . . . . . . 456
22.2.4 The 3M Method . . . . . . . . . . . . . . . . . . . . . . 458
22.3 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 459
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461

23 The Fast Fourier Transform and Applications 465

23.1 The Fast Fourier Transform . . . . . . . . . . . . . . . . . . . 466
23.2 Circulant Linear Systems . . . . . . . . . . . . . . . . . . . . . 468
23.3 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 470
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471

24 Automatic Error Analysis 473

24.1 Exploiting Direct Search Optimization . . . . . . . . . . . . . 474
24.2 Direct Search Methods . . . . . . . . . . . . . . . . . . . . . . 477
24.3 Examples of Direct Search . . . . . . . . . . . . . . . . . . . . 479
24.3.1 Condition Estimation . . . . . . . . . . . . . . . . . . . 480
24.3.2 Fast Matrix Inversion . . . . . . . . . . . . . . . . . . . 481
24.3.3 Solving a Cubic . . . . . . . . . . . . . . . . . . . . . . 483
24.4 Interval Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 485
24.5 Other Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
24.6 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 489
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490

25 Software Issues in Floating Point Arithmetic 491

25.1 Exploiting IEEE Arithmetic . . . . . . . . . . . . . . . . . . . 492
25.2 Subtleties of Floating Point Arithmetic . . . . . . . . . . . . . 495
25.3 Cray Peculiarities . . . . . . . . . . . . . . . . . . . . . . . . . 496
25.4 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
xiv

25.5 Determining Properties of Floating Point Arithmetic . . . . . 497

25.6 Testing a Floating Point Arithmetic . . . . . . . . . . . . . . . 498
25.7 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
25.7.1 Arithmetic Parameters . . . . . . . . . . . . . . . . . . 499
25.7.2 2×2 Problems in LAPACK . . . . . . . . . . . . . . . 500
25.7.3 Numerical Constants . . . . . . . . . . . . . . . . . . . 501
25.7.4 Models of Floating Point Arithmetic . . . . . . . . . . 501
25.8 Avoiding Underflow and Overflow . . . . . . . . . . . . . . . . 502
25.9 Multiple Precision Arithmetic . . . . . . . . . . . . . . . . . . 504
25.10 Patriot Missile Software Problem . . . . . . . . . . . . . . . . 506
25.11 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 507
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508

26 A Gallery of Test Matrices 513

26.1 The Hilbert and Cauchy Matrices . . . . . . . . . . . . . . . . 514
26.2 Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 517
26.3 “Randsvd” Matrices . . . . . . . . . . . . . . . . . . . . . . . . 519
26.4 The Pascal Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 520
26.5 Tridiagonal Toeplitz Matrices . . . . . . . . . . . . . . . . . . 524
26.6 Companion Matrices . . . . . . . . . . . . . . . . . . . . . . . 525
26.7 Notes and References . . . . . . . . . . . . . . . . . . . . . . . 526
26.7.1 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527

A Solutions to Problems 529

B S i n g u l a r V a l u e D e c o m p o s i t i o n , M-Matrices 579
B.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . . 580
B.2 M-Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580

C Acquiring Software 581

C.1 Internet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
C.2 Netlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
C.3 MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
C.4 NAG Library and FTN90 Compiler . . . . . . . . . . . . . . . 583

D Program Libraries 585

D.1 Basic Linear Algebra Subprograms . . . . . . . . . . . . . . . 586
D.2 EISPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
D.3 LINPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
D.3 LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
D.4.1 Structure of LAPACK . . . . . . . . . . . . . . . . . . 588

E The Test Matrix Toolbox 591

Bibliography 595

Name Index 665

Subject Index 675

List of Figures

1.1 Backward and forward errors for y = . . . . . . . . . . . . 8

1.2 Mixed forward-backward error for y = . . . . . . . . . . . 9
1.3 Forward errors and relative residuals ||b-
versus precision. . . . . . . . . . . . . . . 20
1.4 Absolute error versus precision, t = -log2 u . . . . . . . . . . . 21
1.5 Relative errors ||Ak -Âk ||2/||A||2 for Givens QR factorization. 25
1.6 Values of rational function computed by Horner’s rule. . . 29

2.1 Relative distance from to the next larger machine number

(β=2, t =24), displaying wobbling precision. . . . . . . . . . . 44

4.1 Recovering the rounding error. . . . . . . . . . . . . . . . . . . 92

4.2 Errors for Euler’s method with and without com-
pensated summation. . . . . . . . . . . . . . . . . . . . . . . . . 96

5.1 Computed polynomial values and running and a priori bounds

for Horner’s method. . . . . . . . . . . . . . . . . . . . . . . . . 107

6.1 Plots of p versus ||A||p, for 1 < p < 15. . . . . . . . . . . . . . . 125

9.1 A banded matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . 182

13.1 Residuals for inverses computed by MATLAB'S INV function. . . 264

14.1 Underestimation ratio for Algorithm 14.4 for 5×5 matrix A( θ). 297

16.1 Forward and backward errors for SOR iteration. . . . . . . . . 327

17.1 A typical hump for a convergent, nonnormal matrix. . . . . . . 347

17.2 Diverging powers of a nilpotent matrix, C1 4 . . . . . . . . . . . 347
17.3 Infinity norms of powers of 2 × 2 matrix J in (17.2). . . . . . . 349
17.4 Computed powers of chebspec matrices. . . . . . . . . . . . . . 356
17.5 Pseudospectra for chebspec matrices. . . . . . . . . . . . . . . 357
17.6 Pseudospectrum for SOR iteration matrix. . . . . . . . . . . . . 359

xvii
XV111 LIST OF FIGURES

18.1 Householder matrix P times vector . . . . . . . . . . . . . . 363

18.2 Givens rotation, y = G(i,j,θ) . . . . . . . . . . . . . . . . . . 372

22.1 Exponent versus time for matrix multiplication. . . . . . . . . . 449

22.2 Errors for Strassen’s method with two random matrices of di-
mension n = 1024. . . . . . . . . . . . . . . . . . . . . . . . . . 457

23.1 Error in FFT followed by inverse FFT. . . . . . . . . . . . . . . 468

24.1 The possible steps in one iteration of the MDS method when
n=2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478

25.1 Rational function r . . . . . . . . . . . . . . . . . . . . . . . . 493

25.2 Error in evaluating rational function r . . . . . . . . . . . . . . 494

26.1 spy(rem(pascal(32),2)). . . . . . . . . . . . . . . . . . . . . 524

26.2 Pseudospectra of compan(A). . . . . . . . . . . . . . . . . , . . 526
26.3 Pseudospectra of 32 × 32 pentadiagonal Toeplitz matrices. . . . 528
List of Tables

1.1 Computed approximations = ƒl((1+1/n) n ) to e = 2.71828 . . . . 16

1.2 Computed values of from Algorithms 1 and 2. . . . 23
1.3 Results from GE without pivoting on an upper Hessenberg ma-
t r i x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.1 Floating point arithmetic parameters. . . . . . . . . . . . . . . 41

2.2 IEEE arithmetic exceptions and default results. . . . . . . . . . 46
2.3 Test arithmetics. . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Sine test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5 Exponentation test. . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1 Mean square errors for nonnegative . . . . . . . . . . . . . . 99

6.1 Constants αp q such that . . . . . . . . 121

6.2 Constants αpq such that . . . . . 122

7.1 Backward and forward stability. . . . . . . . . . . . . . . . . . . 143

9.1 Times for solution of a linear system of order n . . . . . . . . . 189

9.2 Records for largest dense linear systems solved. . . . . . . . . . 199

11.1 ω|A|,|b| values for A = orthog(25). . . . . . . . . . . . . . . . . 240

11.2 ω|A|,|b| values for A = clement(50) . . . . . . . . . . . . . . . . . 241
11.3 ω|A|,|b| values for A = gfpp(50) . . . . . . . . . . . . . . . . . . . 241
12.1 Stability of block and point LU factorization. . . . . . . . . . . 256

13.1 Backward errors for the -norm. . . . . . . . . . . . . 262

13.2 Mflop rates for inverting a triangular matrix on a Cray 2. . . . 270
13.3 Mflop rates for inverting a full matrix on a Cray 2. . . . . . . . 275
13.4 Times (minutes and seconds) for inverting an n × n matrix. . . 276
13.5 Additional timings for inverting an n × n matrix. . . . . . . . . 276
13.6 Gauss-Jordan elimination for U =6. . . . . . . . . . . . . . . 279

16.1 Dates of publication of selected iterative methods. . . . . . . . 326

xix
LIST OF TABLES

16.2 Result for Jacobi method, α = ½-8 -j . . . . . . . . . . . . . 333

16.3 Results for Jacobi method, a = -(½-8 -j ). . . . . . . . . . . 333

19.1 LS backward errors and residual for Vandermonde system. . . . 405

20.1 Backward errors for underdetermined Vandermonde system. . . 422

21.1 Bounds and estimates for . . . . . . . . . . . . . . . . 428

21.2 Parameters in the three-term recurrence (21.6). . . . . . . . . . 433
21.3 Results for dual Chebyshev-Vandermonde-like system. . . . . . 438

25.1 Results from Cholesky factorization. . . . . . . . . . . . . . . . 496

25.2 Effect of extended run time on Patriot missile operation. . . . . 507

26.1 Condition numbers of Hilbert and Pascal matrices. . . . . . . . 516

Preface

It has been 30 years since the publication of Wilkinson’s books Rounding Er-
rors in Algebraic Processes [1088, 1963] and The Algebraic Eigenvalue Prob-
lem [1089, 1965]. These books provided the first thorough analysis of the
effects of rounding errors on numerical algorithms, and they rapidly became
highly influential classics in numerical analysis. Although a number of more
recent books have included analysis of rounding errors, none has treated the
subject in the same depth as Wilkinson.
This book gives a thorough, up-to-date treatment of the behaviour of
numerical algorithms in finite precision arithmetic. It combines algorithmic
derivations, perturbation theory, and rounding error analysis. Software prac-
ticalities are emphasized throughout, with particular reference to LAPACK.
The best available error bounds, some of them new, are presented in a unified
format with a minimum of jargon. Historical perspective is given to pro-
vide insight into the development of the subject, and further information is
provided in the many quotations. Perturbation theory is treated in detail,
because of its central role in revealing problem sensitivity and providing error
bounds. The book is unique in that algorithmic derivations and motivation
are given succinctly, and implementation details minimized, so that atten-
tion can be concentrated on accuracy and stability results. The book was
designed to be a comprehensive reference and contains extensive citations to
the research literature.
Although the book’s main audience is specialists in numerical analysis, it
will be of use to all computational scientists and engineers who are concerned
about the accuracy of their results. Much of the book can be understood with
only a basic grounding in numerical analysis and linear algebra.
This first two chapters are very general. Chapter 1 describes fundamental
concepts of finite precision arithmetic, giving many examples for illustration
and dispelling some misconceptions. Chapter 2 gives a thorough treatment of
floating point arithmetic and may well be the single most useful chapter in the
book. In addition to describing models of floating point arithmetic and the
IEEE standard, it explains how to exploit “low-level” features not represented
in the models and contains a large set of informative exercises.
In the rest of the book the focus is, inevitably, on numerical linear algebra,
because it is in this area that rounding errors are most influential and have

xxi
xxii PREFACE

been most extensively studied. However, I found that it was impossible to

cover the whole of numerical linear algebra in a single volume. The main
omission is the area of eigenvalue and singular value computations, which
is still the subject of intensive research and requires a book of its own to
summarize algorithms, perturbation theory, and error analysis. This book is
therefore certainly not a replacement for The Algebraic Eigenvalue Problem.
Two reasons why rounding error analysis can be hard to understand are
that, first, there is no standard notation and, second, error analyses are often
cluttered with re-derivations of standard results. In this book I have used no-
tation that I find nearly always to be the most convenient for error analysis:
the key ingredient is the symbol γ n = nu/(1 - nu), explained in §3.1. I have
also summarized many basic error analysis results (for example, in Chapters 3
and 8) and made use of them throughout the book. I like to think of these
basic results as analogues of the Fortran BLAS (Basic Linear Algebra Sub-
programs): once available in a standard form they can be used as black boxes
and need not be reinvented.
A number of the topics included here have not been treated in depth in pre-
vious numerical analysis textbooks. These include floating point summation,
block LU factorization, condition number estimation. the Sylvester equation,
powers of matrices. finite precision behaviour of stationary iterative methods,
Vandermonde systems, and fast matrix multiplication, each of which has its
own chapter. But there are also some notable omissions. I would have liked
to include a chapter on Toeplitz systems, but this is an area in which sta-
bility and accuracy are incompletely understood and where knowledge of the
underlying applications is required to guide the investigation. The important
problems of updating and downdating matrix factorizations when the matrix
undergoes a “small” change have also been omitted due to lack of time and
space. A further omission is analysis of parallel algorithms for all the problems
considered in the book (though blocked and partitioned algorithms and one
particular parallel method for triangular systems are treated). Again, there
are relatively few results and this is an area of active research.
Throughout the history of numerical linear algebra, theoretical advances
have gone hand in hand with software development. This tradition has con-
tinued with LAPACK (1987-), a project to develop a state-of-the-art Fortran
package for solving linear equations and eigenvalue problems. LAPACK has
enjoyed a synergy with research that has led to a number of important break-
throughs in the design and analysis of algorithms, from the standpoints of
both performance and accuracy. A key feature of this book is that it pro-
vides the material needed to understand the numerical properties of many of
the algorithms in LAPACK, the except ions being the routines for eigenvalue
and singular value problems. In particular, the error bounds computed by
the LAPACK linear equation solvers are explained, the LAPACK condition
estimator is described in detail, and some of the software issues confronted by
XX111

the LAPACK developers are highlighted. Chapter 25 examines the influence

of floating point arithmetic on general numerical software, offering salutary
stories, useful techniques, and brief descriptions of relevant codes.
This book has been written with numerical analysis courses in mind, al-
though it is not designed specifically as a textbook. It would be a suitable
reference for an advanced course (for example, for a graduate course on Nu-
merical Linear Algebra following the syllabus recommended by the ILAS Ed-
ucation Committee [601, 1993]), and could be used by instructors at all levels
as a supplementary text from which to draw examples, historical perspective,
statements of results, and exercises. The exercises (actually labelled “prob-
lems”) are an important part of the book, and many of them have not, to my
knowledge, appeared in textbooks before. Where appropriate I have indicated
the source of an exercise; a name without a citation means that the exercise
came from private communication or unpublished notes. Research problems
given at the end of some sets of exercises emphasize that most of the areas
covered are still active.
In addition to surveying and unifying existing results (including some that
have not appeared in the mainstream literature) and sometimes improving
upon their presentation or proof, this book contains new results. Some of
particular note are as follows.

1. The error analysis in §5.3 for evaluation of the Newton interpolating

polynomial.
2. The forward error analysis for iterative refinement in §11.1.
3. The error analysis of Gauss-Jordan elimination in §13.4.
4. The unified componentwise error analysis of QR factorization methods
in Chapter 18, and the corresponding analysis of their use for solving
the least squares problem in Chapter 19.
5. Theorem 20.3, which shows the backward stability of the QR factoriza-
tion method for computing the minimum 2-norm solution to an under-
determined system.

The Notes and References are an integral part of each chapter. In addi-
tion to containing references, historical information, and further details, they
include material not covered elsewhere in the chapter, and should always be
consulted, in conjunction with the index, to obtain the complete picture.
I have included relatively few numerical examples except in the first chap-
ter. There are two reasons. One is to reduce the length of the book. The
second reason is because today it so easy for the reader to perform experi-
ments in MATLAB* or some other interactive system. To this end I have made
*MATLAB is a registered trademark of The Math Works, Inc.
xxiv PREFACE

available the Test Matrix Toolbox, which contains MATLAB M-files for many
of the algorithms and special matrices described in the book; see Appendix E.
This book has been designed to be as easy to use as possible. There are
thorough name and subject indexes, page headings show chapter and section
titles and numbers, and there is extensive cross-referencing. I have adopted
the unusual policy of giving with (nearly) every citation not only its numerical
location in the bibliography but also the names of the authors and the year of
publication. This provides as much information as possible in a citation and
reduces the need for the reader to turn to the bibliography.
A database acc-stab-num-alg.bib containing all the references
in the bibliography is available over the Internet from the bibnet project
(which can be accessed via netlib, described in §C.2).
Special care has been taken to minimize the number of typographical and
other errors, but no doubt, some remain. I will be happy to receive notification
of errors, as well as comments and suggestions for improvement.

Acknowledgements

Three books, in addition to Wilkinson’s, have strongly influenced my research

in numerical linear algebra and have provided inspiration for this book: Golub
and Van Loan’s Matrix Computations [470, 1989] (first edition 1983), Parlett’s
The Symmetric Eigenvalue Problem [820, 1980], and Stewart’s Introduction
to Matrix Computations [941, 1973]. Knuth’s The Art of Computer Program-
ming books [666, 1973-1981] have also influenced my style and presentation.
Jim Demmel has contributed greatly to my understanding of the subject
of this book and provided valuable technical help and suggestions. The first
two chapters owe much to the work of Velvel Kahan; I am grateful to him
for giving me access to unpublished notes and for suggesting improvements to
early versions of Chapters 2 and 25. Des Higham read various drafts of the
book, offering sound advice and finding improvements that had eluded me.
Other people who have given valuable help, suggestions, or advice are

Zhaojun Bai, Brad Baxter, Åke Björck, Martin Campbell-Kelly,

Shivkumar Chandrasekaran, Alan Edelman, Warren Ferguson, Philip
Gill, Gene Golub, George Hall, Sven Hammarling, Andrzej Kielbas-
, Philip Knight, Beresford Parlett, David Silvester, Michael
Saunders, Ian Smith, Doron Swade, Nick Trefethen, Jack Williams,
and Hongyuan Zha.

David Carlisle provided invaluable help and advice concerning .

Working with SIAM on the publication of this book was a pleasure. Special
thanks go to Nancy Abbott (design), Susan Ciambrano (acquisition), Ed Cil-
urso (production), Beth Gallagher (copy editing), Corey Gray (production),
XXV

Mary Rose Muccie (copy editing and indexing), Colleen Robishaw (design),
and Sam Young (production).
Research leading to this book has been supported by grants from the
Engineering and Physical Sciences Research Council, by a Nuffield Science
Research Fellowship from the Nuffield Foundation, and by a NATO Collabo-
rative Research Grant held with J. W. Demmel. I was fortunate to be able
to make extensive use of the libraries of the University of Manchester, the
University of Dundee, Stanford University, and the University of California,
Berkeley.
This book was typeset in using the book document style. The
references were prepared in and the index with MakeIndex. It is dif-
ficult to imagine how I could have written the book without these wonderful
tools. I used the “big” software from the distribution, running on a
486DX workstation. I used text editors The Semware Editor (Semware Cor-
poration) and GNU Emacs (Free Software Foundation) and checked spelling
with PC-Write (Quicksoft).

Manchester Nicholas J. Higham

April 1995
About the Dedication

This book is dedicated to the memory of two remarkable English mathemati-

cians, James Hardy Wilkinson (1919-1986), FRS, and Alan Mathison Turing
(1912-1954), FRS, both of whom made immense contributions to scientific
computation.
Turing’s achievements include his paper “On Computable Numbers, with
an Application to the Entscheidungsproblem”, which answered Hilbert’s de-
cidability question using the abstract device now known as a Turing machine
[1025, 1936]; his work at Bletchley Park during World War II on breaking
the ciphers of the Enigma machine; his 1945 report proposing a design for
the Automatic Computing Engine (ACE) at the National Physical Labora-
tory [1026, 1945]; his 1948 paper on LU factorization and its rounding error
analysis [1027, 1948]; his consideration of fundamental questions in artificial
intelligence (including his proposal of the “Turing test”); and, during the last
part of his life, spent at the University of Manchester, his work on morpho-
genesis (the development of structure and form in an organism). Turing is
remembered through the Turing Award of the Association for Computing Ma-
chinery (ACM), which has been awarded yearly since 1966 [3, 1987]. For more
about Turing, read the superb biography by Hodges [575, 1983], described by
a reviewer as “one of the finest pieces of scholarship to appear in the history
of computing” [182, 1984].
Wilkinson, like Turing a Cambridge-trained mathematician, was Turing’s
assistant at the National Physical Laboratory. When Turing left, Wilkinson
managed the group that built the Pilot ACE, contributing to the design and
construction of the machine and its software. Subsequently, he used the ma-
chine to develop and study a variety of numerical methods. He developed
backward error analysis in the 1950s and 1960s publishing the books Round-
ing Errors in Algebraic Processes [1088, 1963]† (REAP) and The Algebraic
Eigenvalue Problem [1089, 1965]‡ (AEP), both of which rapidly achieved the
status of classics. (AEP was reprinted in paperback in 1988 and, after being
out of print for many years, REAP is now also available in paperback.) The
AEP was described by the late Professor Leslie Fox as “almost certainly the
most important and widely read title in numerical analysis”. Wilkinson also
†
REAP has been translated into Polish [1091, 19 6 7 ] and German [1093, 19 6 9 ].
‡
AEP has been translated into Russian [1094, 1970 ].

xxvii
XXV111 ABOUT THE DEDICATION

contributed greatly to the development of mathematical software. The vol-

ume Handbook for Automatic Computation, Volume II: Linear Algebra [1102,
1971 ], co-edited with Reinsch, contains high-quality, properly documented
software and has strongly influenced subsequent software projects such as the
NAG Library, EISPACK, LINPACK, and LAPACK.
Wilkinson received the 1970 Turing Award. In his Turing Award lec-
ture he described life with Turing at the National Physical Laboratory in the
1940s [1096, 1971].
Wilkinson is remembered through SIAM’s James H. Wilkinson Prize in
Numerical Analysis and Scientific Computing, awarded every 4 years; the
Wilkinson Prize for Numerical Software, awarded by Argonne National Lab-
oratory, the National Physical Laboratory, and the Numerical Algorithms
Group; and the Wilkinson Fellowship in Scientific Computing at Argonne
National Laboratory. For more about Wilkinson see the biographical mem-
oir by Fox [403, 1987], Fox’s article [402, 1978], Parlett’s essay [821, 1990],
the prologue and epilogue of the proceedings [252, 1990] of a conference held
in honour of Wilkinson at the National Physical Laboratory in 1987. and the
tributes in [23, 1987]. Lists of Wilkinson’s publications are given in [403, 1987]
and in the special volume of the journal Linear Algebra and its Applications
(88/89, April 1987) published in his memory.
Previous Home Next

Chapter 1
Principles of Finite Precision
Computation

Numerical precision is the very soul of science.

-SIR D’ARCY WENTWORTH THOMPSON, On Growth and Form (1942)

There will always be a small but steady demand for error-analysts to . . .

expose bad algorithms’ big errors and, more important,
supplant bad algorithms with provably good ones.
-WILLIAM M. KAHAN, Interval Arithmetic Options in
the Proposed IEEE Floating Point Arithmetic Standard (1980)

Since none of the numbers which we take out from logarithmic and
trigonometric tables admit of absolute precision,
but are all to a certain extent approximate only,
the results of all calculations performed
by the aid of these numbers can only be approximately true . . .
It may happen, that in special cases the
effect of the errors of the tables is so augmented that
we may be obliged to reject a method,
otherwise the best, and substitute another in its place.
-CARL FRIEDRICH GAUSS’, Theoria Motus (1809)

Backward error analysis is no panacea;

it may explain errors but not excuse them.
-HEWLETT-PACKARD, HP-15C Advanced Functions Handbook (1982)

1
Cited in Goldstine [461, 1977 , p. 258].

1
2 PRINCIPLES OF FINITE PRECISION COMPUTATION

This book is concerned with the effects of finite precision arithmetic on nu-
merical algorithms’, particularly those in numerical linear algebra. Central
to any understanding of high-level algorithms is an appreciation of the basic
concepts of finite precision arithmetic. This opening chapter briskly imparts
the necessary background material. Various examples are used for illustra-
tion, some of them familiar (such as the quadratic equation) but several less
well known. Common misconceptions and myths exposed during the chapter
are highlighted towards the end, in §1.19.
This chapter has few prerequisites and few assumptions are made about
the nature of the finite precision arithmetic (for example, the base, number
of digits, or mode of rounding, or even whether it is floating point arith-
metic). The second chapter deals in detail with the specifics of floating point
arithmetic.
A word of warning: some of the examples from §1.12 onward are special
ones chosen to illustrate particular phenomena. You may never see in practice
the extremes of behaviour shown here. Let the examples show you what
can happen, but do not let them destroy your confidence in finite precision
arithmetic!

1.1. Notation and Background

We describe the notation used in the book and briefly set up definitions needed
for this chapter.
Generally, we use
capital letters A ,B ,C ∆,Λ for matrices,
subscripted lower case letters αi j , bij, cij , δi j , ij for matrix elements,
lower case letters , y, z, c, g, h for vectors,
lower case Greek letters α, β, γ,φ , ρ for scalars,

following the widely used convention originally introduced by Householder [587,

19 6 4 ].
The vector space of all real m × n matrices is denoted by IR m × n and the
vector space of real n-vectors by IRn . Similarly, denotes the vector
space of complex m × n matrices.
Algorithms are expressed using a pseudocode based on the MATLAB lan-
guage [232, 1988], [735, 1992]. Comments begin with the % symbol.
Submatrices are specified with the colon notation, as used in MATLAB and
Fortran 90: A( p :q, r:s) denotes the submatrix of A formed by the intersection
of rows p to q and columns r to s. As a special case, a lone colon as the row or
column specifier means to take all entries in that row or column; thus A (:,j )
is the jth column of A and A(i,:) the ith row. The values taken by an integer
2
For the purposes of this book an algorithm is a MATLAB program; cf. Smale [924, 1990].
1.1 NOTATION AND BACKGROUND 3

variable are also described using the colon notation: “i = 1:n” means the
same as “i = 1,2, . . . , n” .
Evaluation of an expression in floating point arithmetic is denoted ƒl(·),
and we assume that the basic arithmetic operations op=+,-,*,/ satisfy

|δ | < u . (1.1)
Here, u is the unit roundoff (or machine precision), which is typically of order
10-8 or 10-16 in single and double precision computer arithmetic, respectively,
and between 10-10 and 10-12 on pocket calculators. For more on floating
point arithmetic see Chapter 2.
Computed quantities (and, in this chapter only, arbitrary approximations)
wear a hat. Thus denotes the computed approximation to .
Definitions are often (but not always) indicated by “:=” or “=:”, with the
colon next to the object being defined.
We make use of the floor and ceiling functions: is the largest integer
less than or equal to , and is the smallest integer greater than or equal
to .
The normal distribution with mean µ and variance 2
is denoted by

We measure the cost of algorithms in flops. A flop is an elementary floating

point operation: +,-,/, or *. We normally state only the highest-order terms
of flop counts. Thus, when we say that an algorithm for n × n matrices requires
2n 3/3 flops, we really mean 2n 3/3+O(n 2) flops.
Other definitions and notation are introduced when needed. Two top
ics, however, do not fit comfortably into the main text and are described in
Appendix B: the singular value decomposition (SVD) and M-matrices.
All our numerical experiments were carried out either in MATLAB 4.2 [735,
1992], sometimes in conjunction with the Symbolic Math Toolbox [204, 1993],
or with the Salford Software/Numerical Algorithms Group FTN903 Fortran 90
compiler, Version 1.2 [888, 1993]. Whenever we say a computation was “done
in Fortran 90” we are referring to the use of this compiler. All the results
quoted were obtained on a 486DX workstation, unless otherwise stated, but
many of the experiments were repeated on a Sun SPARCstation, using the
NAGWare4 FTN90 compiler [785, 1992]. Both machines use IEEE standard
floating point arithmetic and the unit roundoff is u = 2 -53 1.1 × 10 -16
in M ATLAB and in double precision in Fortran 90. (Strictly speaking, in
Fortran 90 we should not use the terminology single and double precision but
should refer to the appropriate KIND parameters; see, e.g., Metcalf and Reid
[749, 1990, §2.6]. However, these terms are vivid and unambiguous in IEEE
arithmetic, so we use them throughout the book.)
3
FTN90 is a joint trademark of Salford Software Ltd. and The Numerical Algorithms
Group Ltd.
4
NAGWare is a trademark of The Numerical Algorithms Group Ltd.
4 PRINCIPLES OF FINITE PRECISION COMPUTATION

1.2. Relative Error and Significant Digits

Let be an approximation to a real number . The most useful measures of
the accuracy of are its absolute error

and its relative error

(which is undefined if = 0). An equivalent definition of relative error is

= |ρ|, where = (1+p). Some authors omit the absolute values
from these definitions. When the sign is important we will simply talk about
“the error - .
In scientific computation, where answers to problems can vary enormously
in magnitude, it is usually the relative error that is of interest, because it is
scale independent: scaling and leaves unchanged.
Relative error is connected with the notion of correct significant digits (or
correct significant figures). The significant digits in a number are the first
nonzero digit and all succeeding digits. Thus 1.7320 has five significant digits,
while 0.0491 has only three. What is meant by correct significant digits in
a number that approximates another seems intuitively clear, but a precise
definition is problematic, as we explain in a moment. First, note that for a
number with p significant digits there are only p +1 possible answers to the
question “how many correct significant digits does have?” (assuming is
not a constant such as 2.0 that is known exactly). Therefore the number of
correct significant digits is a fairly crude measure of accuracy in comparison
with the relative error. For example, in the following two cases agrees with
to three but not four significant digits by any reasonable definition, yet the
relative errors differ by a factor of about 44:
= 1.00000, = 1.00499, = 4.99 × 10-3,
= 9.00000, = 8.99899, = 1.12 × 10-4.
Here is a possible definition of correct significant digits: an approximation
to has p correct significant digits if and round to the same number to
p significant digits. Rounding is the act of replacing a given number by the
nearest p significant digit number, with some rule for breaking ties when there
are two nearest. This definition of correct significant digits is mathematically
elegant and agrees with intuition most of the time. But consider the numbers

= 0.9949, = 0.9951.

According to the definition does not have two correct significant digits
but does have one and three correct significant digits!
6 PRINCIPLES OF FINITE PRECISION COMPUTATION

or, if the data is itself the solution to another problem, it may be the result
of errors in an earlier computation. The effects of errors in the data are
generally easier to understand than the effects of rounding errors committed
during a computation, because data errors can be analysed using perturbation
theory for the problem at hand, while intermediate rounding errors require
an analysis specific to the given method. This book contains perturbation
theory for most of the problems considered, for example, in Chapters 7 (linear
systems), 19 (the least squares problem), and 20 (underdetermined systems).
Analysing truncation errors, or discretization errors, is one of the ma-
jor tasks of the numerical analyst. Many standard numerical methods (for
example, the trapezium rule for quadrature, Euler’s method for differential
equations, and Newton’s method for nonlinear equations) can be derived by
taking finitely many terms of a Taylor series. The terms omitted constitute
the truncation error, and for many methods the size of this error depends
on a parameter (often called h, “the stepsize”) whose appropriate value is a
compromise between obtaining a small error and a fast computation.
Because the emphasis of this book is on finite precision computation, with
virtually no mention of truncation errors, it would be easy for the reader to
gain the impression that the study of numerical methods is dominated by the
study of rounding errors. This is certainly not the case. Trefethen explains it
well when he discusses how to define numerical analysis [1016, 1992]:

Rounding errors and instability are important, and numerical an-

alysts will always be the experts in these subjects and at pains
to ensure that the unwary are not tripped up by them. But our
central mission is to compute quantities that are typically uncom-
putable, from an analytic point of view, and to do it with lightning
speed.

In this quotation “uncomputable” means that approximations are necessary,

and thus Trefethen’s point is that developing good approximations is a more
fundamental task than analysing the effects of rounding errors on those ap-
proximations.
A possible way to avoid rounding and truncation errors (but not data
errors) is to try to solve a problem using a symbolic manipulation package,
such as Maple5 [199, 1991] or Mathematica6 [1109, 1991]. Indeed, we have
used this approach to compute “exact answers” in some of our numerical
experiments, While we acknowledge the value of symbolic manipulation as
part of the toolkit of the scientific problem solver, we do not study it in this
book.
5
Maple is a registered trademark of Waterloo Maple Software.
6
Mathematica is a registered trademark of Wolfram Research Inc.
1.3 SOURCES OF ERRORS 5

A definition of correct significant digits that does not suffer from the latter
anomaly states that agrees with to p significant digits if is less than
half a unit in the pth significant digit of . However, this definition implies
that 0.123 and 0.127 agree to two significant digits, whereas many people
would say that they agree to less than two significant digits.
In summary, while the number of correct significant digits provides a useful
way in which to think about the accuracy of an approximation, the relative
error is a more precise measure (and is base independent). Whenever we give
an approximate answer to a problem we should aim to state an estimate or
bound for the relative error.
When and are vectors the relative error is most often defined with
a norm, as For the commonly used norms := maxi
:= the inequality < ½ × 10-p
implies that components with have about p correct significant
decimal digits, but for the smaller components the inequality merely bounds
the absolute error.
A relative error that puts the individual relative errors on an equal footing
is the componentwise relative error

which is widely used in error analysis and perturbation analysis (see Chapter 7,
for example).
As an interesting aside we mention the “tablemaker’s dilemma”. Suppose
you are tabulating the values of a transcendental function such as the sine
function and a particular entry is evaluated as 0.124|500000000 correct to a
few digits in the last place shown, where the vertical bar follows the final
significant digit to be tabulated. Should the final significant digit be 4 or
5? The answer depends on whether there is a nonzero trailing digit and, in
principle, we may never be able answer the question by computing only a
finite number of digits.

1.3. Sources of Errors

There are three main sources of errors in numerical computation: rounding,
data uncertainty, and truncation.
Rounding errors, which are an unavoidable consequence of working in finite
precision arithmetic, are largely what this book is about. The remainder of
this chapter gives basic insight into rounding errors and their effects.
Uncertainty in the data is always a possibility when we are solving practical
problems. It may arise in several ways: from errors in measuring physical
quantities, from errors in storing the data on the computer (rounding errors),
1.4 PRECISION VERSUS ACCURACY 7

1.4. Precision Versus Accuracy

The terms accuracy and precision are often confused or used interchangeably,
but it is worth making a distinction between them. Accuracy refers to the
absolute or relative error of an approximate quantity. Precision is the accu-
racy with which the basic arithmetic operations +,-,*,/ are performed, and
for floating point arithmetic is measured by the unit roundoff u (see (1.1)).
Accuracy and precision are the same for the scalar computation c = a*b, but
accuracy can be much worse than precision in the solution of a linear system
of equations, for example.
It is important to realize that accuracy is not limited by precision, at least
in theory. This may seem surprising, and may even appear to contradict many
of the results in this book. However, arithmetic of a given precision can be
used to simulate arithmetic of arbitrarily high precision, as explained in §25.9.
(The catch is that such simulation is too expensive to be of practical use for
routine computation.) In all our error analyses there is an implicit assumption
that the given arithmetic is not being used to simulate arithmetic of a higher
precision.

1.5. Backward and Forward Errors

Suppose that an approximation to y = is computed in an arithmetic

of precision u, where f is a real scalar function of a real scalar variable. How
should we measure the “quality” of ?
In most computations we would be happy with a tiny relative error,
but this cannot always be achieved. Instead of focusing on the
relative error of we can ask “for what set of data have we actually solved
our problem?“, that is, for what do we have = ? In general,
there may be many such so we should ask for the smallest one. The value
of (or min ), possibly divided by , is called the backward error.
The absolute and relative errors of are called forward errors, to distinguish
them from the backward error. Figure 1.1 illustrates these concepts.
The process of bounding the backward error of a computed solution is
called backward error analysis, and its motivation is twofold. First, it inter-
prets rounding errors as being equivalent to perturbations in the data. The
data frequently contains uncertainties due to previous computations or er-
rors committed in storing numbers on the computer. If the backward error
is no larger than these uncertainties then the computed solution can hardly
be criticized-it may be the solution we are seeking, for all we know. The
second attraction of backward error analysis is that it reduces the question of
bounding or estimating the forward error to perturbation theory, which for
many problems is well understood (and only has to be developed once, for the
8 PRINCIPLES OF FINITE PRECISION COMPUTATION

Input space Output space

Figure 1.1. Backward and forward errors for y = Solid line = exact; dotted
line = computed.

given problem, and not for each method). We discuss perturbation theory in
the next section.
A method for computing y = is called backward stable if, for any
it produces a computed with a small backward error, that is, =
for some small . The definition of “small” will be context dependent. In
general, a given problem has several methods of solution, some of which are
backward stable and some not.
As an example, assumption (1.1) says that the computed result of the
operation ± y is the exact result for perturbed data (1 + δ) and y(1 + δ)
with |δ| < u; thus addition and subtraction are, by assumption, backward
stable operations.
Most routines for computing the cosine function do not satisfy = cos( +
) with a relatively small but only the weaker relation + ∆y = cos( +
), with relatively small ∆y and . A result of the form

(1.2)

is known as a mixed forward-backward error result and is illustrated in Fig-

ure 1.2. Provided that and are sufficiently small, (1.2) says that the
computed value scarcely differs from the value + that would have been
produced by an input + scarcely different from the actual input Even
more simply, is almost the right answer for almost the right data.
In general, an algorithm is called numerically stable if it is stable in the
mixed forward-backward error sense of (1.2) (hence a backward stable algo-
rithm can certainly be called numerically stable). Note that this definition is
specific to problems where rounding errors are the dominant form of errors.
The term stability has different meanings in other areas of numerical analysis.
1.6 CONDITIONING

Input space Output space

Figure 1.2. Mixed forward-backward error for y = Solid line = exact; dotted
line = computed.

1.6. Conditioning
The relationship between forward and backward error for a problem is gov-
erned by the conditioning of the problem, that is, the sensitivity of the solution
to perturbations in the data. Continuing the y = example of the pre-
vious section, let an approximate solution satisfy = Then,
assuming for simplicity that f is twice continuously differentiable,

and we can bound or estimate the right-hand side. This expansion leads to
the notion of condition number. Since

the quantity

measures, for small the relative change in the output for a given relative
change in the input, and it is called the (relative) condition number of f . If
or f is a vector then the condition number is defined in a similar way using
norms and it measures the maximum relative change, which is attained for
some, but not all, vectors
As an example, consider the function = log The condition number
is = which is large for 1. This means that a small relative
change in can produce a much larger relative change in log for 1. The
10 PRINCIPLES OF FINITE PRECISION COMPUTATION

reason is that a small relative change in produces a small absolute change

in = log (since and that
change in log may be large in a relative sense.
When backward error, forward error, and the condition number are defined
in a consistent fashion we have the useful rule of thumb that

forward error condition number × backward error,

with approximate equality possible. One way to interpret this rule of thumb
is to say that the computed solution to an ill-conditioned problem can have a
large forward error. For even if the computed solution has a small backward
error, this error can be amplified by a factor as large as the condition number
when passing to the forward error.
One further definition is useful. If a method produces answers with for-
ward errors of similar magnitude to those produced by a backward stable
method, then it is called forward stable. Such a method need not be back-
ward stable itself. Backward stability implies forward stability, but not vice
versa. An example of a method that is forward stable but not backward stable
is Cramer’s rule for solving a 2 × 2 linear system, which is discussed in §1.10.1.

1.7. Cancellation
Cancellation is what happens when two nearly equal numbers are subtracted.
It is often, but not always, a bad thing. Consider the function = (1 -
With = 1.2×10-5 the value of rounded to 10 significant
figures is
c = 0.9999 9999 99,
so that
1 - c = 0.0000 0000 01.
Then (1 - c) = 10 /1.44×10-10 = 0.6944. . . , which is clearly wrong
-10

given the fact that 0 < < l/2 for all 0. A 10 significant figure
approximation to cos is therefore not sufficient to yield a value of with
even one correct figure. The problem is that 1 - c has only 1 significant
figure. The subtraction 1 - c is exact, but this subtraction produces a result
of the same size as the error in c. In other words, the subtraction elevates the
importance of the earlier error. In this particular example it is easy to rewrite
to avoid the cancellation. Since cos = 1 - 2 sin2

Evaluating this second formula for with a 10 significant figure approxi-

mation to sin yields = 0.5, which is correct to 10 significant figures.
1.8 SOLVING A QUADRATIC EQUATION 11

To gain more insight into the cancellation phenomenon consider the sub-
traction (in exact arithmetic) where and
The terms ∆a and ∆b are relative errors or uncertainties in the data, perhaps
attributable to previous computations. With x = a - b we have

The relative error bound for is large when |a - b| << |a| + |b|, that is,
when there is heavy cancellation in the subtraction. This analysis shows that
subtractive cancellation causes relative errors or uncertainties already present
in and b to be magnified. In other words, subtractive cancellation brings
earlier errors into prominence.
It is important to realize that cancellation is not always a bad thing. There
are several reasons. First, the numbers being subtracted may be error free,
as when they are from initial data that is known exactly. The computation
of divided differences, for example, involves many subtractions, but half of
them involve the initial data and are harmless for suitable orderings of the
points (see §5.3 and §21.3). The second reason is that cancellation may be
a symptom of intrinsic ill conditioning of a problem, and may therefore be
unavoidable. Third, the effect of cancellation depends on the role that the
result plays in the remaining computation. For example, if x >> y z > 0
then the cancellation in the evaluation of x + (y - z) is harmless.

1.8. Solving a Quadratic Equation

Mathematically, the problem of solving the (real) quadratic equation ax 2 +
bx + c = 0 is trivial: there are two roots (if a 0)) given by

(1.3)

Numerically, the problem is more challenging, as neither the successful evalua-

tion of (1.3) nor the accuracy of the computed roots can be taken for granted.
The easiest issue to deal with is the choice of formula for computing the
roots. If b2 >> |4ac| then and so for one choice of sign the for-
mula (1.3) suffers massive cancellation. This is damaging cancellation because
one of the arguments, is inexact, so the subtraction brings into
prominence the earlier rounding errors. How to avoid the cancellation is well
known: obtain the larger root (in absolute value), x 1, from

and the other from the equation x 1 x 2 = c/a.

12 PRINCIPLES OF FINITE PRECISION COMPUTATION

Unfortunately, there is a more pernicious source of cancellation: the sub-

traction b 2 - 4ac. Accuracy is lost here when b 2 4ac (the case of nearly
equal roots), and no algebraic rearrangement can avoid the cancellation. The
only way to guarantee accurate computed roots is to use double precision (or
some trick tantamount to the use of double precision) in the evaluation of
b 2 - 4ac.
Another potential difficulty is underflow and overflow. If we apply the
formula (1.3) in IEEE single precision arithmetic (described in §2.3) to the
equation 1020 x2 - 3·1020 x+2·1020 = 0 then overflow occurs, since the maxi-
mum floating point number is of order 10 the roots, however, are innocuous:
x = 1 and x = 2. Dividing through the equation by max(| a|, |b|, |c |) = 1020
cures the problem, but this strategy is ineffective for the equation 10-20x 2 -
3x+2·1020 = 0, whose roots are 1020 and 2·1020. In the latter equation we need
to scale the variable: defining x = 1020 y gives 1020 y2 -3·10 20 y+ 2 · 1 0 2 0 = 0 ,
which is the first equation we considered. These ideas can be built into a
general scaling strategy (see the Notes and References), but the details are
nontrivial.
As this discussion indicates, not only is it difficult to devise an accurate and
robust algorithm for solving a quadratic equation, but it is a nontrivial task
to prepare specifications that define precisely what “accurate” and “robust”
mean for a given system of floating point arithmetic.

1.9. Computing the Sample Variance

In statistics the sample variance of n numbers x 1, . . . , xn is defined as

(1.4)

where the sample mean

Computing from this formula requires two passes through the data, one
to compute and the other to accumulate the sum of squares. A two-pass
computation is undesirable for large data sets or when the sample variance
is to be computed as the data is generated. An alternative formula, found
in many statistics textbooks, uses about the same number of operations but
requires only one pass through the data:

(1.5)
1.10 SOLVING LINEAR EQUATIONS 13

This formula is very poor in the presence of rounding errors because it com-
putes the sample variance as the difference of two positive numbers, and
therefore can suffer severe cancellation that leaves the computed answer dom-
inated by roundoff. In fact, the computed answer can be negative, an event
aptly described by Chan, Golub, and LeVeque [194, 1983] as “a blessing in
disguise since this at least alerts the programmer that disastrous cancella-
tion has occurred”. In contrast, the original formula (1.4) always yields a
very accurate (and nonnegative) answer, unless n is large (see Problem 1.10).
Surprisingly, current calculators from more than one manufacturer (but not
Hewlett-Packard) appear to use the one-pass formula, and they list it in their
manuals.
As an example, if x = [10000, 10001, 10002]T then, in single precision
arithmetic (u 6 × 10-8), the sample variance is computed as 1.0 by the
two-pass formula (relative error 0) but 0.0 by the one-pass formula (relative
error 1). It might be argued that this data should be shifted by some estimate
of the mean before applying the one-pass formula which
does not change ), but a good estimate is not always available and there
are alternative one-pass algorithms that will always produce an acceptably
accurate answer. For example, instead of accumulating and we
can accumulate

which can be done via the updating formulae

(1.6a)

(1.6b)

after which = Q n / (n - 1). Note that the only subtractions in these recur-
rences are relatively harmless ones that involve the data xi . For the numerical
example above, (1.6) produces the exact answer. The updating formulae (1.6)
are numerically stable, though their error bound is not as small as the one
for the two-pass formula (it is proportional to the condition number KN in
Problem 1.7).
The problem of computing the sample variance illustrates well how mathe-
matically equivalent formulae can have different numerical stability properties.

1.10. Solving Linear Equations

For an approximate solution y to a linear system Ax = b ( A IRn × n , b
n
IR ) the forward error is defined as ||x-y||/||x||, for some appropriate norm.
14 PRINCIPLES OF FINITE PRECISION COMPUTATION

Another measure of the quality of y, more or less important depending on

the circumstances, is the size of the residual r = b - Ay . When the linear
system comes from an interpolation problem, for example, we are probably
more interested in how closely Ay represents b than in the accuracy of y. The
residual is scale dependent: multiply A and b by a and r is multiplied by
a. One way to obtain a scale-independent quantity is to divide by || A|| ||y||,
yielding the relative residual

The importance of the relative residual is explained by the following re-

suit, which was probably first proved by Wilkinson (see the Notes and Ref-
erences). We use the 2-norm, defined by ||x||2 = (xT x)½ and ||A||2 =

Lemma 1.1. With the notation above, and for the 2-norm,

Proof. If (A+∆A)y = b then r := b-Ay = ∆Ay, so ||r| | 2 < | |∆A| | 2 | |y | | 2 ,

giving
(l.7)

On the other hand, (A +∆A)y = b for ∆A = ryT /(yT y) and ||∆A||2 =

|| r||2 /||y||2, so the bound (1.7) is attainable.
Lemma 1.1 says that p(y) measures how much A (but not b) must be
perturbed in order for y to be the exact solution to the perturbed system,
that is, p(y) equals a normwise relative backward error. If the data A and b
are uncertain and p(y) is no larger than this uncertainty (e.g., p(y) = O(u))
then the approximate solution y must be regarded as very satisfactory. For
other problems the backward error may not be as easy to compute as it is for
a general linear system, as we will see for the Sylvester equation (§15.2) and
the least squares problem (§19.7).
To illustrate these concepts we consider two specific linear equation solvers:
Gaussian elimination with partial pivoting (GEPP) and Cramer’s rule.

1.10.1. GEPP Versus Cramer’s Rule

Cramer’s rule says that the components of the solution to a linear system
Ax = b are given by x i = det(A i (b))/det(A), where Ai ( b) denotes A with its
ith column replaced by b. These formulae are a prime example of a method
1.10 SOLVING LINEAR EQUATIONS 15

that is mathematically elegant, but useless for solving practical problems.

The two flaws in Cramer’s rule are its computational expense and its nu-
merical instability. The computational expense needs little comment, and is,
fortunately, explained in most modern linear algebra textbooks (for example,
Strang [961, 1993] cautions the student “it would be crazy to solve equations
that way”). The numerical instability is less well known, but not surprising.
It is present even for n = 2, as a numerical example shows.
We formed a 2 × 2 system Ax = b with condition number K2 (A) =
||A ||2|| A -1||2 1013, and solved the system by both Cramer’s rule and GEPP
in MATLAB (unit roundoff u 1.1 × 10 -16 ). The results were as follows,
where r = b - A :

Cramer’s rule GEPP

1.0000 1.5075 × 10-7 1.0002 -4.5689 × 10 -17

2.0001 1.9285 × 10-7 2.0004 -2.1931 × 10 -17

The scaled residual for GEPP is pleasantly small-of order the unit round-
off. That for Cramer’s rule is ten orders of magnitude larger, showing that the
computed solution from Cramer’s rule does not closely satisfy the equations,
or, equivalently, does not solve a nearby system. The solutions themselves are
similar, both being accurate to three significant figures in each component but
incorrect in the fourth significant figure. This is the accuracy we would expect
from GEPP because of the rule of thumb “forward error backward error ×
condition number”. That Cramer’s rule is as accurate as GEPP in this ex-
ample, despite its large residual, is perhaps surprising, but it is explained by
the fact that Cramer’s rule is forward stable for n = 2; see Problem 1.9. For
general n, the accuracy and stability of Cramer’s rule depend on the method
used to evaluate the determinants, and satisfactory bounds are not known
even for the case where the determinants are evaluated by GEPP.
The small residual produced by GEPP in this example is typical: error
analysis shows that GEPP is guaranteed to produce a relative residual of
order u when n = 2 (see §9.2). To see how remarkable a property this is,
consider the rounded version of the exact solution: z = fl(x) = x + ∆x,
where ||∆x ||2 <u ||x ||2. The residual of z satisfies ||b- A z||2=||- A ∆ x || 2 <
u||A|| 2 ||x||2 u||A||2 ||z||2. Thus the computed solution from GEPP has about
as small a residual as the rounded exact solution, irrespective of its accuracy.
Expressed another way, the errors in GEPP are highly correlated so as to
produce a small residual. To emphasize this point, the vector [1.0006,2.0012],
which agrees with the exact solution of the above problem to five significant
figures (and therefore is more accurate than the solution produced by GEPP),
has a relative residual of order 10-6 .
16 PRINCIPLES OF FINITE PRECISION COMPUTATION

Table 1.1. Computed approximations = fl((1+1/ n ) n ) to e = 2.71828. . . .

n
101 2.593743 1.25 × 10 -1
102 2.704811 1.35 × 10 -2
103 2.717051 1.23 × 10 -3
10 4 2.718597 3.15 × 10 -4
10 5 2.721962 3.68 × 10 -3
10 6 2.595227 1.23 × 10 -1
10 7 3.293968 5.76 × 10 -1

1.11. Accumulation of Rounding Errors

Since the first electronic computers were developed in the 1940s, comments
along the following lines have often been made: “The enormous speed of
current machines means that in a typical problem many millions of floating
point operations are performed. This in turn means that rounding errors can
potentially accumulate in a disastrous way.” This sentiment is true, but mis-
leading. Most often, instability is caused not by the accumulation of millions
of rounding errors. but by the insidious growth of just a few rounding errors.
As an example, let us approximate e = exp(1) by taking finite n in the
definition e := Table 1.1 gives results computed in For-
tran 90 in single precision (u 6 × 10-8).
The approximations are poor, degrading as n approaches the reciprocal
of the machine precision. For n a power of 10, l/n has a nonterminating
binary expansion. When 1+1/n is formed for n a large power of 10, only
a few significant digits from l/n are retained in the sum. The subsequent
exponentiation to the power n, even if done exactly, must produce an inaccu-
rate approximation to e(indeed, doing the exponentiation in double precision
does not change any of the numbers shown in Table 1.1). Therefore a single
rounding error is responsible for the poor results in Table 1.1.
There is a way to compute (1+1/n) n more accurately, using only single
precision arithmetic; it is the subject of Problem 1.5.
Strassen’s method for fast matrix multiplication provides another exam-
ple of the unpredictable relation between the number of arithmetic operations
and the error. If we evaluate fl(AB) by Strassen’s method, for n×n matrices
A and B, and we look at the error as a function of the recursion threshold
n 0 <n. we find that while the number of operations decreases as n 0 decreases
from n to 8, the error typically increases; see §22.2.2.
1.12 INSTABILITY WITHOUT CANCELLATION 17

1.12. Instability Without Cancellation

It is tempting to assume that calculations free from subtractive cancellation
must be accurate and stable, especially if they involve only a small number
of operations. The three examples in this section show the fallacy of this
assumption.

1.12.1. The Need for Pivoting

Suppose we wish to compute an LU factorization

Clearly, u 11 = and u 22 = 1 - l2 1 u12 =

In floating point arithmetic, if is sufficiently small then
evaluates to Assuming l21 is computed exactly, we then have

Thus the computed LU factors fail completely to-reproduce A. Notice that

there is no subtraction in the formation of L and U. Furthermore, the matrix
A is very well conditioned The problem, of course, is with
the choice of as the pivot. The partial pivoting strategy would interchange
the two rows of A before factorizing it, resulting in a stable factorization.

1.12.2. An Innocuous Calculation?

For any x>0 the following computation leaves x unchanged:
for i = 1:60

end
for i = 1:60
x = x2
end

Since the computation involves no subtractions and all the intermediate num-
bers lie between 1 and x, we might expect it to return an accurate approxi-
mation to x in floating point arithmetic.
On the HP 48G calculator, starting with x = 100 the algorithm produces
x = 1.0. In fact, for any x, the calculator computes, in place of f(x) = x, the
function
18 PRINCIPLES OF FINITE PRECISION COMPUTATION

The calculator is producing a completely inaccurate approximation to f(x) in

just 120 operations on nonnegative numbers. How can this happen?
The positive numbers x representable on the HP 48G satisfy 10-499 < x <
9.999. . . × 10499 . If we define r(x) = then, for any machine number
x > 1,

which rounds to 1, since the HP 48G works to about 12 decimal digits. Thus
for x > 1, the repeated square roots reduce x to 1.0, which the squarings leave
unchanged.
For 0 < x < 1 we have

on a 12-digit calculator, so we would expect the square root to satisfy

This upper bound rounds to the 12 significant digit number 0.99. . .9. Hence
after the 60 square roots we have on the calculator a number x < 0.99. . .9.
The 60 squarings are represented by s(x) = and

Because it is smaller than the smallest positive representable number, this

result is set to zero on the calculator--a process known as underflow. (The
converse situation, in which a result exceeds the largest representable number,
is called overflow.)
The conclusion is that there is nothing wrong with the calculator. This
innocuous-looking calculation simply exhausts the precision and range of a
machine with 12 digits of precision and a 3-digit exponent.

1.12.3. An Infinite Sum

It is well known that = 1.6449 3406 6848. . . . Suppose we
were not aware of this identity and wished to approximate the sum numeri-
cally. The most obvious strategy is to evaluate the sum for increasing k until
1.13 INCREASING THE PRECISION 19

the computed sum does not change. In Fortran 90 in single precision this
yields the value 1.6447 2532, which is first attained at k = 4096. This agrees
with the exact infinite sum to just four significant digits out of a possible nine.
The explanation for the poor accuracy is that we are summing the numbers
from largest to smallest, and the small numbers are unable to contribute to
the sum. For k = 4096 we are forming s + 4096-2 = s + 2-24, where s 1.6.
Single precision corresponds to a 24-bit mantissa, so the term we are adding
to s “drops off the end” of the computer word, as do all successive terms.
The simplest cure for this inaccuracy is to sum in the opposite order: from
smallest to largest. Unfortunately, this requires knowledge of how many terms
to take before the summation begins. With 109 terms we obtain the computed
sum 1.6449 3406, which is correct to eight significant digits.
For much more on summation, see Chapter 4.

1.13. Increasing the Precision

When the only source of errors is rounding, a common technique for estimating
the accuracy of an answer is to recompute it at a higher precision and to see
how many digits of the original and the (presumably) more accurate answer
agree. We would intuitively expect any desired accuracy to be achievable by
computing at a high enough precision. This is certainly the case for algorithms
possessing an error bound proportional to the precision, which includes all the
algorithms described in the subsequent chapters of this book. However, since
an error bound is not necessarily attained, there is no guarantee that a result
computed in t digit precision will be more accurate than one computed in
s digit precision, for a given t > s; in particular, for a very ill conditioned
problem both results could have no correct digits.
For illustration, consider the system Ax = b, where A is the inverse of the
5×5 Hilbert matrix and b i = (-l)i i. (For details of the matrices used in
this experiment see Chapter 26.) We solved the system in varying precisions
with unit roundoffs u = 2 -t , t = 15:40, corresponding to about 4 to 12
decimal places of accuracy. (This was accomplished in MATLAB by using
the function chop from the Test Matrix Toolbox to round the result of every
arithmetic operation to t bits; see Appendix E.) The algorithm used was
Gaussian elimination (without pivoting), which is perfectly stable for this
symmetric positive definite matrix. The upper plot of Figure 1.3 shows t
against the relative errors and the relative residuals ||b -
The lower plot of Figure 1.3 gives corresponding results
for A = P5 + 5I, where P5 is the Pascal matrix of order 5. The condition
numbers (A) are 1.62×102 for the inverse Hilbert matrix and 9.55×105 for
the shifted Pascal matrix. In both cases the general trend is that increasing
the precision decreases the residual and relative error, but the behaviour is
20 P RINCIPLES OF FINITE P RECISION C OMPUTATION

Figure 1.3. Forward errors and relative residuals ||b -

versus precision t = - log2 u on the x axis.

not monotonic. The reason for the pronounced oscillating behaviour of the
relative error (but not the residual) for the inverse Hilbert matrix is not clear.
An example in which increasing the precision by several bits does not
improve the accuracy is the evaluation of

y = x + a sin(bx), a = 10-8, b = 224. (1.8)

Figure 1.4 plots t versus the absolute error, for precisions u = 2-t, t = 10:40.
Since a sin(bx) -8.55×10-9, for t less than about 20 the error is dominated
by the error in representing x = l/7. For 22 < t < 31 the accuracy is (exactly)
constant! The plateau over the range 22 < t < 31 is caused by a fortuitous
rounding error in the addition: in the binary representation of the exact
answer the 23rd to 32nd digits are 1s, and in the range of t of interest the
final rounding produces a number with a 1 in the 22nd bit and zeros beyond,
yielding an unexpectedly small error that affects only bits 33rd onwards.
A more contrived example in which increasing the precision has no bene-
ficial effect on the accuracy is the following evaluation of z = f(x):

y = abs(3(x-0.5)-0.5)/25
if y = 0
z = l
1.14 C ANCELLATION OF R OUNDING E RRORS 21

Figure 1.4. Absolute error versus precision, t = -log2 u.

else
z = ey % Store to inhibit extended precision evaluation.
z = (z - 1)/y
end

In exact arithmetic, z = f(2/3) = 1, but in Fortran 90 on a Sun SPARCsta-

tion and on a 486DX workstation, = fl(f(2/3)) = 0.0 in both single and
double precision arithmetic. A further example is provided by the “innocu-
ous calculation” of §1.12.2, in which a step function is computed in place of
f(x) = x for a wide range of precisions.
It is worth stressing that how precision is increased can greatly affect
the results obtained. Increasing the precision without preserving important
properties such as monotonicity of rounding can vitiate an otherwise reliable
algorithm. Increasing the precision without maintaining a correct relationship
among the precisions in different parts of an algorithm can also be harmful
to the accuracy.

1.14. Cancellation of Rounding Errors

It is not unusual for rounding errors to cancel in stable algorithms, with the
result that the final computed answer is much more accurate than the inter-
22 P RINCIPLES OF FINITE P RECISION C OMPUTATION

mediate quantities. This phenomenon is not universally appreciated, perhaps

because we tend to look at the intermediate numbers in an algorithm only
when something is wrong, not when the computed answer is sat isfactory. We
describe two examples. The first is a very short and rather unusual com-
putation, while the second involves a well-known algorithm for computing a
standard matrix decomposition.

1.14.1. Computing (ex - 1)/x

Consider the function f(x) = (ex - 1)/x = which arises in
various applications. The obvious way to evaluate f is via the algorithm

% Algorithm 1.
if x = 0
f = l
else
f = (ex - 1)/x
end

This algorithm suffers severe cancellation for |x | << 1, causing it to produce an

inaccurate answer (0 instead of 1, if x is small enough) . Here is an alternative:

% Algorithm 2.
y = ex
if y = l
f = l
else
f = (y - 1)/logy
end

At first sight this algorithm seems perverse, since it evaluates both exp and
log instead of just exp. Some results computed in MATLAB are shown in
Table 1.2. All the results for Algorithm 2 are correct in all the significant
figures shown, except for x = 10-15, when the last digit should be 1. On the
other hand, Algorithm 1 returns answers that become less and less accurate
as x decreases.
To gain insight we look at the numbers in a particular computation with
x = 9 × 10-8 and u = 2-24 6 × 10-8, for which the correct answer is
1.00000005 to the significant digits shown. For Algorithm 1 we obtain a
completely inaccurate result, as expected:
1.14 C ANCELLATION OF R OUNDING E RRORS 23

Table 1.2. Computed values of (ex - 1)/x from Algorithms 1 and 2.

x Algorithm 1 Algorithm 2
-5
10 1.000005000006965 1.000005000016667
10 -6 1.000000499962184 1.000000500000167
10 -7 1.000000049433680 1.000000050000002
10 -8 9.999999939225290 × 10-1 1.000000005000000
10 - 9 1.000000082740371 1.000000000500000
10 - 1 0 1.000000082740371 1.000000000050000
1 0 - 1 1 1.000000082740371 1.000000000005000
1 0 - 1 2 1.000088900582341 1.000000000000500
10 - 1 3 9.992007221626408 × 10-1 1.000000000000050
1 0 - 1 4 9.992007221626408 × 10-1 1.000000000000005
1 0 - 1 5 1.110223024625156 1.000000000000000
10-16 0

Algorithm 2 produces a result correct in all but the last digit:

Here are the quantities that would be obtained by Algorithm 2 in exact arith-
metic (correct to the significant digits shown):

We see that Algorithm 2 obtains very inaccurate values of ex - 1 and log e x,

but the ratio of the two quantities it computes is very accurate. Conclusion:
errors cancel in the division in Algorithm 2.
A short error analysis explains this striking cancellation of errors. We
assume that the exp and log functions are both computed with a relative error
not exceeding the unit roundoff u. The algorithm first computes = e x(1+δ),
x
|δ| < u. If = 1 then e (l+δ) = 1, so
x = -log(1+δ) = -δ+δ2 /2-δ3/3+. . . , |δ| < u,
which implies that the correctly rounded value of f(x) = 1+x/2+x2/6+ . . .
is 1, and so f has been evaluated correctly, to the working precision, If 1
then7, using (1.1),

(1-9)
7
The analysis from this point on assumes the use of a guard digit in subtraction (see
§2.4); without a guard digit Algorithm 2 is not highly accurate.
24 P RINCIPLES OF F INITE P RECISION C OMPUTATION

where < u, i = 1:3. Defining υ = - 1, we have

For small x ( y 1 )

From (1.9) it follows that approximates f with relative error at most about
3.5u.
The details of the analysis obscure the crucial property that ensures its
success. For small x, neither - 1 nor log agrees with its exact arithmetic
counterpart to high accuracy. But ( - 1)/log is an extremely good approx-
imation to (y-l)/logy near y=1, because the function g(y) = (y-1)/log y
varies so slowly there (g has a removable singularity at 1 and g'(1) = 1). In
other words, the errors in - 1 and log almost completely cancel.

1.14.2. QR Factorization
Any matrix A IR m×n, m > n, has a QR factorization A = QR, where Q
m×
IR has orthonormal columns and R IR n×x is upper trapezoidal (rij = 0
for i > j). One way of computing the QR factorization is to premultiply A by
a sequence of Givens rotations-orthogonal matrices G that differ from the
identity matrix only in a 2×2 principal submatrix, which has the form

With A1 := A, a sequence of matrices Ak satisfying Ak = GkAk-1 is gen-

erated. Each A k, has one more zero than the last, so Ap = R for p =
n(m - (n + 1)/2). To be specific, we will assume that the zeros are intro-
duced in the order (n,1). (n - 1,1), . . . , (2,1); (n,2), . . . , (3,2); and so on.
For a particular 10×6 matrix A, Figure 1.5 plots the relative errors
||Ak - Ak||2/||A||2, where Ak, denotes the matrix computed in single precision
arithmetic (u 6 × 10-8). We see that many of the intermediate matrices are
very inaccurate, but the final computed has an acceptably small relative
error, of order u. Clearly, there is heavy cancellation of errors on the last
few stages of the computation. This matrix A IR10×6 was specially chosen,
1.14 C ANCELLATION OF R OUNDING E RRORS 25

Figure 1.5. Relative errors ||A k- Â k||2/||A ||2 for Givens QR factorization. The
dotted line is the unit roundoff level.

following a suggestion of Wilkinson [1100, 1985], as a full matrix such that

||A||2 1 and A 10 has the form

Because y is at the roundoff level, the computed is the result of severe sub-
tractive cancellation and so is dominated by rounding errors. Consequently,
the computed Givens rotations ,..., whose purpose is to zero the
vector and which are determined by ratios involving the elements of , bear
little relation to their exact counterparts, causing Âk to differ greatly from
Ak for k = 11,12,. . . .
To shed further light on this behaviour, we note that the Givens QR fac-
torization is perfectly backward stable; that is, the computed R is the exact
R factor of A + ∆A, where ||∆A||2 <cu||A||2, with c a modest constant de-
pending on the dimensions (Theorem 18.9). By invoking a perturbation result
for the QR factorization (namely (18.27)) we conclude that is
bounded by a multiple of K2 (A)u. Our example is constructed so that K2 (A) is
small ( 24), so we know a priori that the graph in Figure 1.5 must eventually
dip down to the unit roundoff level.
26 P RINCIPLES OF F INITE P RECISION C OMPUTATION

We also note that is of order u in this example, as again we can

show it must be from perturbation theory. Since Q is a product of Givens
rotations, this means that even though some of the intermediate Givens rota-
tions are very inaccurate, their product is highly accurate, so in the formation
of Q, too, there is extensive cancellation of rounding errors.

1.15. Rounding Errors Can Be Beneficial

An old method for computing the largest eigenvalue (in absolute value) of
a matrix A and the corresponding eigenvector is the power method, which
consists of repeatedly multiplying a given starting vector by A. With scaling
to avoid underflow and. overflow, the process in its simplest form is

% Choose a starting vector x.

while not converged
x := Ax

end

The theory says that if A has a unique eigenvalue of largest modulus and x
is not deficient in the direction of the corresponding eigenvector υ, then the
power method converges to a multiple of υ (at a linear rate).
Consider the matrix

which has eigenvalues 0, 0.4394 and 1.161 (correct to the digits shown) and an
eigenvector [l, 1, 1]T corresponding to the eigenvalue zero. If we take [1,1,1]T
as the starting vector for the power method then, in principle, the zero vector
is produced in one step, and we obtain no indication of the desired dominant
eigenvalue-eigenvector pair. However, when we carry out the computation in
MATLAB, the first step produces a vector with elements of order 10-16 and
we obtain after 38 iterations a good approximation to the dominant eigen-
pair. The explanation is that the matrix A cannot be stored exactly in bi-
nary floating point arithmetic. The computer actually works with A + ∆A
for a tiny perturbation ∆A, and the dominant eigenvalue and eigenvector of
A + ∆A are very good approximations to those of A. The starting vector
[1,1,1]T contains a nonzero (though tiny) component of the dominant eigen-
v e c t o r o f A + ∆A. This component grows rapidly under multiplication by
A + ∆A, helped by rounding errors in the multiplication, until convergence
to the dominant eigenvector is obtained.
1.16 S TABILITY OF AN A LGORITHM D EPENDS ON THE P ROBLEM 27

Perhaps an even more striking example of beneficial effects of rounding

errors is in inverse iteration, which is just the power method applied to the
shifted and inverted matrix (A - µI )-1. The shift µ is usually an approximate
eigenvalue. The closer µ is to an eigenvalue, the more nearly singular A - µI
is, and hence the larger the error in computing y = (A - µI)-1 x (which is done
by solving (A-µ I)y = x). However, it can be shown that the error lies almost
entirely in the direction of the required eigenvector, and so is harmless; see, for
example, Parlett [820, 1980, §4.31] or Golub and Van Loan [470, 1989, §7.6.1].

1.16. Stability of an Algorithm Depends on the Problem

An algorithm can be stable as a means for solving one problem but unsta-
ble when applied to another problem. One example is the modified Gram-
Schmidt method, which is stable when used to solve the least squares problem
but can give poor results when used to compute an orthonormal basis of a
matrix (see §§518.7 and 19.3).
A lesser known and much simpler example is Gaussian elimination (GE)
without pivoting for computing the determinant of an upper Hessenberg ma-
trix. A square matrix A is upper Hessenberg if a ij = 0 for i > j + 1. GE
transforms A to upper triangular form by n - 1 row eliminations. one for each
of the boxed entries in this 4 × 4 illustration:

The determinant of A is given by the product of the diagonal elements of U.

It is easy to show that this is a stable way to evaluate det(A), even though
arbitrarily large multipliers may arise during the elimination. Note. first, that,
if A( k ) denotes the matrix at the start of the kth stage (A(1) = A), then

because the kth row of A( k -1) is the same as the kth row of A. In floating
point arithmetic the model (1.1) shows that the computed satisfy
28 PRINCIPLES OF FINITE P RECISION COMPUTATION

Table 1.3. Results from GE without pivoting on an upper Hessenberg matrix.

Exact Computed Relative error

where < u, i = 1:3. This equation says that the computed diagonal
elements are the exact diagonal elements corresponding not to A, but to
a matrix obtained from A by changing the diagonal elements to akk ( 1 + )
and the subdiagonal elements to In other
words, the computed are exact for a matrix differing negligibly from A.
The computed determinant d, which is given by

is therefore a tiny relative perturbation of the determinant of a matrix differing

negligibly from A. so this method for evaluating det(A) is numerically stable
(in the mixed forward backward error sense of (1.2)).
However, if we use GE without pivoting to solve an upper Hessenberg
linear system then large multipliers can cause the solution process to be un-
stable. If we try to extend the analysis above we find that the computed LU
factors (as opposed to just the diagonal of U) do not, as a whole, necessarily
correspond to a small perturbation of A.
A numerical example illustrates these ideas. Let

We took a = 10-7 and b = Ae(e = [1,1,1,1]T) and used GE without pivoting

in single precision arithmetic (u 6 × 10-8) to solve Ax = b and compute
det(A). The computed and exact answers are shown to five significant figures
in Table 1.3. Not surprisingly, the computed determinant is very accurate.
But the computed solution to Ax = b has no correct figures in its first com-
ponent. This reflects instability of the algorithm rather than ill conditioning
of the problem because the condition number = 16. The source of the
instability is the large first multiplier, a 2 1 /a11 = 107.
1.17 ROUNDING ERRORS ARE NOT RANDOM 29

Figure 1.6. Values of rational function r(x) computed by Horner’s rule (marked as
"×" ), for x = 1.606 + (k - 1)2-52; solid line is the “exact” r(x).

1.17. Rounding Errors Are Not Random

Rounding errors, and their accumulated effect on a computation, are not

random. This fact underlies the success of many computations, including some
of those described earlier in this chapter. The validity of statistical analysis of
rounding errors is discussed in §2.6. Here we simply give a revealing numerical
example (due to W. Kahan).
Define the rational function

which is expressed in a form corresponding to evaluation of the quartic poly-

nomials in the numerator and denominator by Horner’s rule. We evaluated
r(x) by Horner’s rule in double precision arithmetic for 361 consecutive float-
ing point numbers starting with a = 1.606, namely x = a + (k - 1)2-52,
k = 1:361; the function r(x) is virtually constant on this interval. Figure 1.6
plots the computed function values together with a much more accurate ap-
proximation to r(x) (computed from a continued fraction representation).
The striking pattern formed by the values computed by Horner’s rule shows
clearly that the rounding errors in this example are not random.
30 P RINCIPLES OF FINITE P RECISION C OMPUTATION

1.18. Designing Stable Algorithms

There is no simple recipe for designing numerically stable algorithms. While
this helps to keep numerical analysts in business (even in proving each other’s
algorithms to be unstable!) it is not good news for computational scientists
in general. The best advice is to be aware of the need for numerical stability
when designing an algorithm and not to concentrate solely on other issues,
such as computational cost and parallelizability.
A few guidelines can be given.

1. Try to avoid subtracting quantities contaminated by error (though such

subtractions may be unavoidable).
2. Minimize the size of intermediate quantities relative to the final solu-
tion. The reason is that if intermediate quantities are very large then
the final answer may be the result of damaging subtractive cancella-
tion. Looked at another way, large intermediate numbers swamp the
initial data, resulting in loss of information. The classic example of an
algorithm where this considerat ion is import ant is Gaussian elimination
(§9.2), but an even simpler one is recursive summation (§4.2).
3. Look for different formulations of a computation that are mathemati-
cally but not numerically equivalent. For example, the classical Gram-
Schmidt method is unstable, but a trivial modification produces the
stable modified Gram-Schmidt (MGS) method (§18.7). There are two
ways of using the MGS method to solve a least squares problem, the
more obvious of which is unstable (§19.3).
4. It is advantageous to express update formulae as

new-value = old-value + small-correct ion

if the small correction can be computed with many correct significant

figures. Numerical methods are often naturally expressed in this form;
examples include met hods for solving ordinary differential equations,
where the correction is proportional to a step size, and Newton’s method
for solving a nonlinear system. A classic example of the use of this
update strategy is in iterative refinement for improving the computed
solution to a linear system Ax = b, in which by computing residuals
r = b - Ay in extended precision and solving update equations that
have the residuals as right-hand sides a highly accurate solution can be
computed: see Chapter 11. For another example (in which the correction
is not necessarily small), see Problem 2.8.
5. Use only well-conditioned transformations of the problem. In matrix
computations this amounts to multiplying by orthogonal matrices where
1.19 M ISCONCEPTIONS 31

possible, instead of nonorthogonal, and possibly, ill-conditioned matri-

ces. See §6.2 for a simple explanation of this advice in terms of norms.

6. Take precautions to avoid unnecessary overflow and underflow (see §25.8).

Concerning the second point, good advice is to look at the numbers gen-
erated during a computation. This was common practice in the early days
of electronic computing. On some machines it was unavoidable because the
contents of the store were displayed on lights or monitor tubes! Wilkinson
gained much insight into numerical stability by inspecting the progress of an
algorithm, and sometimes altering its course (for an iterative process with
parameters): “Speaking for myself I gained a great deal of experience from
user participation, and it was this that led to my own conversion to backward
error analysis” [1099, 1980, pp. 112-113] see also [1083, 1955]). It is ironic
that with the wealth of facilities we now have for tracking the progress of nu-
merical algorithms (multiple windows in colour, graphical tools, fast printers)
we often glean less than Wilkinson and his co-workers did from mere paper
tape and lights.

1.19. Misconceptions

Several common misconceptions and myths have been dispelled in this chapter
(none of them for the first time-see the Notes and References). We highlight
them in the following list.

1. Cancellation in the subtraction of two nearly equal numbers is always a

bad thing (§1.7).

2. Rounding errors can overwhelm a computation only if vast numbers of

them accumulate (§1.11).

3. A short computation free from cancellation, underflow, and overflow

must be accurate (§1.12).

4. Increasing the precision at which a computation is performed increases

the accuracy of the answer (§1.13).

5. The final computed answer from an algorithm cannot be more accurate

than any of the intermediate quantities, that is, errors cannot cancel
(§1.14).

6. Rounding errors can only hinder, not help, the success of a computation
(§1.15).
32 P RINCIPLES OF FINITE P RECISION C OMPUTATION

1.20. Rounding Errors in Numerical Analysis

Inevitably, much of this book is concerned with numerical linear algebra, be-
cause this is the area of numerical analysis in which the effects of rounding
errors are most important and have been most studied. In nonlinear prob-
lems rounding errors are often not a major concern because other forms of
error dominate. Although we give examples for numerical methods involving
derivatives and integrals (for example, Euler’s met hod in §4.3 and quadrature
in Problem 3.12), it is beyond our scope to give a treatment of the effects and
influence of rounding errors on these methods. We do. however, give some
selected references to the literature. grouped by subject area.

Approximation theory: Clenshaw [212, 19551], Cox [249, 19721], [250, 19751],
[251, 1978], Cox and Harris [253, 1989], de Boor [272, 19721], and de Boor
and Pinkus [273, 1977].

Chaos and dynamical systems: Cipra [211, 19 88], Coomes, Koçak, and
Palmer [242, 1995], Corless [246. 1992], [247, 1992], Hammel, Yorke,
and Grebogi [499, 1988], and Sanz-Serna and Larsson [893, 1993].

Nonlinear equations: Dennis and Walker [302, 1984], Spellucci [933, 1980],
and Wozniakowski [1111, 1977].

Optimization: Dennis and Schnabel [300, 1983], Fletcher [377, 1986], [379,
19 88], [380, 1993 ], [381, 1994 ]. Gill, Murray, and Wright [447, 19 8 1 ],
Gurwitz [490, 1992], Müller-Merbach [783, 1970], and Wolfe [1108, 1965].

Ordinary differential equation initial value problems: Henrici

[517, 1962], [518, 1963], [519, 1964], D. J. Higham [525, 1991]. Sanz-
Serna [891, 1992, §12], Sanz-Serna and Calvo [892, 1994], and Shampine
[910, 1994, §3.3, §5.6].

Partial differential equations: Ames [14, 1977], Birkhoff and Lynch [101,
1984], Canuto, Hussaini, Quarteroni, and Zang [183, 1988], Douglas [319,
1959 ], Forsythe and Wasow [397, 19 6 0 ], Richtmyer and Morton [872,
1967, §1.8], and Trefethen and Trummer [1020, 1987].

Quadrature: Davis and Rabinowitz [267, 198 4, §4.2], Lyness [717, 19 6 9 ],

and Piessens et al. [832, 1983, §3.4.3.1].

1.21. Notes and References

The term “correct significant digits” is rarely defined in textbooks: it is ap-
parently assumed that the definition is obvious. One of the earliest books on
1.21 N OTES AND R EFERENCES 33

numerical analysis, by Scarborough [897, 1950] (first edition 1930), is note-

worthy for containing theorems describing the relationship between correct
significant digits and relative error.
The first definition of correct significant digits in §1.2 is suggested by
Hildebrand [571, 1974, §1.41], who notes its weaknesses.
For a formal proof and further explanation of the fact that precision does
not limit accuracy see Priest [844, 1992].
It is possible to develop formal definitions of numerical stability, either
with respect to a particular problem, as is frequently done in research papers,
or for a very general class of problems, as is done, for example, by de Jong [274,
1977 ]. Except in §7.6, we do not give formal definitions of stability in this
book, preferring instead to adapt informally the basic notions of backward
and forward stability to each problem, and thereby to minimize the amount
of notation and abstraction.
Backward error analysis was systematically developed, exploited, and pop
ularized by Wilkinson in the 1950s and 1960s in his research papers and, in
particular, through his books [1088, 1963], [1089, 1965] (for more about the
books see the Notes and References for Chapter 2). Backward error ideas had
earlier appeared implicitly in papers by von Neumann and Goldstine (1057,
1947] and Turing [1027, 1948], both of which deal with the solution of lin-
ear systems, and explicitly in an unpublished technical report of Givens [451,
1954] on the solution of the symmetric eigenproblem by reduction to tridiag-
onal form followed by the use of Sturm sequences. The concept of backward
error is not limited to numerical linear algebra. It is used, for example, in
the numerical solution of differential equations; see Eirola [349, 1993], En-
right [352, 1989], and Shampine [910, 1994, §2.2], in addition to the references
of Sanz-Serna and his co-authors cited in §1.20. Backward error is also used
in understanding chaotic behaviour of iterations; see the references in §1.20.
Conditioning of problems has been studied by numerical analysts since
the 1940s but the first general theory was developed by Rice [871, 1966]. In
numerical linear algebra, developing condition numbers is part of the subject
of perturbation theory, on which there is a large literature.
The solution of a quadratic equation is a classic problem in numerical anal-
ysis. In 1969 Forsythe [393, 1969] pointed out “the near absence of algorithms
to solve even a quadratic equation in a satisfactory way on actually used dig-
ital computer systems” and he presented specifications suggested by Kahan
for a satisfactory solver. Similar, but less technical, presentations are given by
Forsythe [392, 1969] and Forsythe, Malcolm, and Moler [395, 1977, §2.6]. Ka-
han [627, 1972] and Sterbenz [938, 1974] both present algorithms for solving
a quadratic equation, accompanied by error analysis.
For more details of algorithms for computing the sample variance and
their error analysis, see Chan and Lewis [195, 1979], Chan, Golub, and LeV-
eque [194, 1983], Barlow [61, 1991], and the references therein. Good general
34 P RINCIPLES OF FINITE P RECISION C OMPUTATION

references on computational aspects of statistics are Kennedy and Gentle [649,

1980] and Thisted [1000. 1988].
The issues of conditioning and numerical stability play a role in any disci-
pline in which finite precision computation is performed, but the understand-
ing of these issues is less well developed in some disciplines than in others.
In geometric computation, for example, there has been much interest since
the late 1980s in the accuracy and robustness of geometric algorithms; see
Milenkovic [754, 1988], Hoffmann [576, 1989], and Priest [843, 1991], [844,
1992 ].
It was after discovering Lemma 1.1 that Wilkinson began to develop back-
ward error analysis systematically in the 1950s. He explains that in solving
eigenproblems Ax = x by deflation, the residual of the computed solution,
(with the normalization was “always at noise level rel-
ative to A” [1101, 1986]. He continues, “After some years’ experience of this
I happened. almost by accident, to observe that . . . In
other words and were exact for a matrix and since
this meant that they were exact for a matrix differing from A at the noise
level of the computer.“ For further details see [1101, 1986] or [1100, 1985].
The numerical stability of Cramer’s rule for 2 × 2 systems has been inves-
tigated by Moler [768, 1974] and Stummel [964, 1981, §3.3].
The example in §1.12.2 is taken from the HP-15C Advanced Functions
Handbook [523, 1982], and a similar example is given by Kahan [629, 1980].
For another approach to analysing this “innocuous calculation” see Prob-
lem 3.11. The “f(2/3)” example in §1.13 is also taken from [629, 1980], in
which Kahan states three “anti-theorems” that are included among our mis-
conceptions in §1.19.
The example (1.8) is adapted from an example of Sterbenz [938, 1974,
p. 220], who devotes a section to discussing the effects of rerunning a compu-
tation at higher precision.
The function expm1 := ex - 1 is provided in some floating point processors
and mathematics libraries as a more accurate alternative to forming ex and
subtracting 1 [991, 1992]. It is important in the computation of sinh and tanh,
for example (since sinh x = e-x(e2 x - 1)/2). Algorithm 2 in §1.14.1 is due to
Kahan [629, 1980].
The instability. and stability of GE without pivoting applied to an upper
Hessenberg matrix (§1.16) was first pointed out and explained by Wilkin-
son [1084, 1960]; Parlett [818, 1965] also gives a lucid discussion. In the 1950s
and 1960s. prior to the development of the QR algorithm, various methods
were proposed for the nonsymmetric eigenvalue problem that involved trans-
forming a matrix to Hessenberg form H and then finding the zeros of the char-
acteristic polynomial det(H - The most successful method of this type
was Laguerre’s iteration. described by Parlett [817, 196 4], and used in con-
junction with Hyman’s method for evaluating det(H - Hyman’s met hod
1.21 N OTES AND R EFERENCES 35

is described in §13.5.1.
Classic papers dispensing good advice on the dangers inherent in numer-
ical computation are the “pitfalls” papers by Stegun and Abramowitz [937,
1956] and Forsythe [394, 1970]. The book Numerical Methods That Work by
Acton [4, 1970] must also be mentioned as a fount of hard-earned practical
advice on numerical computation (look carefully and you will see that the
front cover includes a faint image of the word “Usually” before “Work”). If
it is not obvious to you that the equation x2 - 10x + 1 = 0 is best thought of
as a nearly linear equation for the smaller root, you will benefit from reading
Acton (see p. 58). Everyone should read Acton’s “Interlude: What Not To
Compute” (pp. 245-257).
Finally, we mention the paper “How to Get Meaningless Answers in Sci-
entific Computation (and What to Do About it)” by Fox [401, 1971]. Fox, a
contemporary of Wilkinson, founded the Oxford Computing Laboratory and
was for many years Professor of Numerical Analysis at Oxford. In this paper
he gives numerous examples in which incorrect answers are obtained from
plausible numerical methods (many of the examples involve truncation errors
as well as rounding errors). The section titles provide a list of reasons why
you might compute worthless answers:

• Your problem might be ill conditioned.

• Your method might be unstable.

• You expect too much “analysis” from the computers.

• Your intuition fails you.

• You accept consistency too easily.

• A successful method may fail in slightly different circumstances.

• Your test examples may be too special.

Fox estimates [401, 1971, p. 296] that “about 80 per cent of all the results
printed from the computer are in error to a much greater extent than the user
would believe.”

8
This reason refers to using an inappropriate convergence test in an iterative process.
36 P RINCIPLES OF FINITE P RECISION C OMPUTATION

Problems
The road to wisdom?
Well, it’s plain and simple to express:
Err
and err
and err again
but less
and less
and less.
-PIET HEIN, Grooks (1966)

1.1. In error analysis it is sometimes convenient to bound

instead of Obtain inequalities between and

1.2. (Skeel and Keiper [923, 1993, §1.2]) The number y = was evalu-
ated at t digit precision for several values of t, yielding the values shown in the
following table. which are in error in at most one unit in the least significant
digit (the first two values are padded with trailing zeros):
t y
10 2625374 12600000000
15 262537412640769000
20 262537412640768744.00
25 262537412640768744.0000000
30 262537412640768743.999999999999
Does it follow that the last digit before the decimal point is 4?
1.3. Show how to rewrite the following expressions to avoid cancellation for
the indicated arguments.

1.4. Give stable formulae for computing the square root x + iy of a complex
number a + ib.
1.5. [523, 1982] By writing (1 + 1/n) n = exp(nlog(1 + 1/n)), show how to
compute (1 + 1/n) n accurately for large n. Assume that the log function is
computed with a relative error not exceeding u. (Hint: adapt the technique
used in §1.14.1.)
P ROBLEMS 37

1.6. (Smith [928, 1975]) Type the following numbers into your pocket calcu-
lator, and look at them upside down (you or the calculator):
07734 The famous “_world” program
38079 Object
318808 Name
35007 Adjective
57738.57734 × 1040 Exclamation on finding a bug
3331 A high quality floating point arithmetic
Fallen tree trunks
1.7. A condition number for the sample variance (1.4), here denoted by V(x) :
can be defined by

Show that

This condition number measures perturbations in x componentwise. A corre-

sponding normwise condition number is

Show that

1.8. (Kahan, Muller, [781, 1989], Francois and Muller [406, 1991]) Consider
the recurrence

x k+1 = 111 - (1130 - 3000/x k- 1 )/x k, x0 = 11/2, x 1 = 61/11.

In exact arithmetic the xk form a monotonically increasing sequence that con-

verges to 6. Implement the recurrence on your computer or pocket calculator
and compare the computed x 34 with the true value 5.998 (to four correct
significant figures). Explain what you see.
The following questions require knowledge of material from later chapters.
1.9. Cramer’s rule solves a 2 × 2 system Ax = b according to

d = a 1 1 a 22 - a2 1 a 1 2 ,
x1 = (b 1 a22 - b2 a 12)/d ,
x2 = (a 11b2 - a 2 1 b 1 )/d.
38 P RINCIPLES OF FINITE P RECISION C OMPUTATION

Show that, assuming d is computed exactly (this assumption has little effect
on the final bounds), the computed solution satisfies

where γ3 = 3u/(1-3u), cond(A,x) = and cond(A) =

This forward error bound is as small as that for a backward
stable method (see §7.2, §7.6) so Cramer’s rule is forward stable for 2 × 2
systems.
1.10. Show that the computed sample variance = fl(V(x)) produced by
the two-pass formula (1.4) satisfies

(Note that this error bound does not involve the condition numbers KC or
KN from Problem 1.7, at least in the first-order term. This is a rare instance
of an algorithm that determines the answer more accurately than the data
warrants!)
Previous Home Next

Chapter 2
Floating Point Arithmetic

From 1946-1948 a great deal of quite detailed coding was done.

The subroutines for floating-point arithmetic were . . .
produced by Alway and myself in 1947 . . .
They were almost certainly the earliest floating-point subroutines.
-J. H. WILKINSON, Turing’s Work at the
National Physical Laboratory . . . (1980)

When discussing the floating-point capabilities of a new machine,

we always ask the manufacturer two questions:
Does the machine use IEEE arithmetic?
Does it support graceful underflow and provide
user control of rounding mode and exception flags?
Frequently the designer believes his machine is using IEEE arithmetic
when it is using only the IEEE formats without the other required features.
-W. J. CODY, Floating-Point Standards-Theory and Practice (1988)

Arithmetic on Cray computers is interesting because it is driven by a

motivation for the highest possible floating-point performance . . .
Addition on Cray computers does not have a guard digit,
and multiplication is even less accurate than addition . . .
At least Cray computers serve to keep numerical analysts on their toes!
-DAVID GOLDBERG 9, Computer Arithmetic (1990)

It is rather conventional to obtain a “realistic” estimate

of the possible overall error due to k roundoffs,
when k is fairly large,
by replacing k by in an expression for (or an estimate of)
the maximum resultant error.
-F. B. HILDEBRAND, introduction to Numerical Analysis (1974)

9
In Hennessy and Patterson [515, 1990 , App. A].

39
40 F LOATING P OINT A RITHMETIC

2.1. Floating Point Number System

A floating point number system F IR is a subset of the real numbers whose
elements have the form
y = ±m × β e - t . (2.1)
The system F is characterized by four integer parameters:
• the base β (sometimes called the radix),

• the precision t, and

• the exponent range emin < e < emax.

The mantissa m is an integer satisfying 0 < m < β t - 1. To ensure a unique

represent at ion for each y F it is assumed that m > β t -1 if y 0, so that
the system is normalized. The range of the nonzero floating point numbers
in F is given by β e min-1 < |y| < β e max(1 - β - t ). Values of the parameters
for some machines of interest are given in Table 2.1 (the unit roundoff u is
defined on page 42).
Note that an alternative (and more common) way of expressing y is

(2.2)

where each digit d i satisfies 0 < d i < β - 1, and d 1 0 for normalized

numbers. We prefer the more concise representation (2.1), which we usually
find easier to work with. This “nonpositional” representation has pedagogical
advantages. being entirely integer based and therefore simpler to grasp. In
the representation (2.2). d 1 is called the most significant digit and dt the least
significant digit.
It is important to realize that the floating point numbers are not equally
spaced. If β = 2, t = 3, emin = -1, and emax = 3 then the nonnegative
floating point numbers are

0, 0.25, 0.3125, 0.3750, 0.4375, 0.5, 0.625, 0.750, 0.875,

1.0, 1.25, 1.50, 1.75. 2.0, 2.5, 3.0. 3.5, 4.0, 5.0, 6.0, 7.0.

They can be represented pictorially as follows:

2.1 F LOATING P OINT N UMBER S YSTEM 41

Table 2.1. Floating point arithmetic parameters.

Machine and arithmetic β t emin e max u

Cray- 1 single 2 48 -8192 8191 4 × 10-15
Cray- 1 double 2 96 -8192 8191 1 × 10-29
DEC VAX G format, double 2 53 -1023 1023 1 × 10-16
DEC VAX D format, double 2 56 -127 127 1 × 10-17
HP 28 and 48G calculators 10 12 -499 499 5 × 10-12
IBM 3090 single 16 6 -64 63 5 × 10 -7
IBM 3090 double 16 14 -64 63 1 × 10 - 1 6
IBM 3090 extended 16 28 -64 63 2 × 10-33
IEEE single 2 24 -125 128 6 × 10 -8
IEEE double 2 53 -1021 1024 1 × 10-16
IEEE extended (typical) 2 64 -16381 16384 5 × 10-20

Notice that the spacing of the floating point numbers jumps by a factor 2
at each power of 2. The spacing can be characterized in terms of machine
epsilon, which is the from 1.0 to the next larger floating point
number. Clearly, = β 1 -t , and this is the spacing of the floating point
numbers between 1.0 and β; the spacing of the numbers between 1.0 and
1/β is β -t = /β. The spacing at an arbitrary x F is estimated by the
following lemma.

Lemma 2.1. The spacing between a normalized floating point number x and
an adjacent normalized floating point number is at least β - 1 | x| and at most
|x| (unless x or the neighbour is zero).

Proof. See Problem 2.2.

The system F can be extended by including subnormal numbers (also
known as denormalized numbers), which, in the notation of (2.1), are the
numbers
y = ±m × β e m i n- t , 0 < m < β t- 1 ,
which have the minimum exponent and are not normalized (equivalently, in
(2.2) e = emin and the most significant digit d 1 = 0). The subnormal numbers
have fewer digits of precision than the normalized numbers. The smallest
positive normalized floating point number is = β e min-1 while the smallest
positive subnormal number is µ = β e min-t = The subnormal numbers
fill the gap between and 0 and are equally spaced, with spacing the same
as that of the numbers of F between and namely = β e min-t. For
example, in the system illustrated above with β = 2, t = 3, emin = -1, and
emax = 3, we have = 2-2 and µ = 2-4, the subnormal numbers are
42 F LOATING P OINT A RITHMETIC

0.0625, 0.125, 0.1875,

and the complete number system can be represented as

Let G IR denote all real numbers of the form (2.1) with no restriction
on the exponent e. If x IR then fl(x) denotes an element of G nearest to x,
and the transformation x f l(x) is called rounding. There are several ways
to break ties when x is equidistant from two floating point numbers, including
taking fl(x) to be the number of larger magnitude (round away from zero)
or the one with an even last digit dt (round to even): the latter rule enjoys
impeccable statistics [144, 1973]. For more on tie-breaking strategies see the
Notes and References.
Although we have defined fl as a mapping onto G, we are only interested
in the cases where it produces a result in F. We say that fl(x) overflows if
|fl(x)| > max{|y| : y F} and underflows if 0 < |fl(x)| < min{|y| : 0 y
F} .
The following result shows that every real number x lying in the range of
F can be approximated by an element of F with a relative error no larger
than u = ½β 1 -t . The quantity u is called the unit roundoff. It is the most
useful quantity associated with F and is ubiquitous in the world of rounding
error analysis.

Theorem 2.2. If x IR lies in the range of F then

fl(x) = x(1 + δ), |δ | < u. (2.3)

Proof. We can assume that x > 0. Writing the real number x in the form

x = µ × βe - t
βt -1 < µ < β t - 1,

we see that x lies between the adjacent floating point numbers y 1 = |µ | β e - t

and y2 = [µ ]β e - t. Thus fl(x) = y1 or y2 and we have

Hence
2.1 F L O A T I N G P O I N T N U M B E R S Y S T E M 43

The last inequality is strict unless µ = β t -1, in which case x = fl(x), hence
the inequality of the theorem is strict.
Theorem 2.2 says that fl(x) is equal to x multiplied by a factor very close
to 1. The representation 1 + δ for the factor is the standard choice, but it is
not the only possibility. For example, we could write the factor as ea , with a
bound on |a| a little less than u (cf. the rp notation in §3.4).
The following modified version of this theorem can also be useful.

Theorem 2.3. If x IR lies in the range of F then

Proof. See Problem 2.4.

The widely used IEEE standard arithmetic (described in §2.3) has β = 2
and supports two precisions. Single precision has t = 24, emin = -125,
emax = 128, and u = 2-24 5.96 × 10-8. Double precision has t = 53,
emin = -1021, emax = 1024, and u = 2-53 = 1.11 × 10-16. IEEE arithmetic
uses round to even.
It is easy to see that

Hence, while the relative error in representing x is bounded by ½β 1 -t (as

it must be, by Theorem 2.2), the relative error varies with x by as much
as a factor β. This phenomenon is called wobbling precision, and is one of
the reasons why small bases (in particular, β = 2) are favoured. The effect
of wobbling precision is clearly displayed in Figure 2.1, which plots machine
numbers x versus the relative distance from x to the next larger machine
number, for 1 < x < 16 in IEEE single precision arithmetic. In this plot, the
relative distances range from about 2-23 = 1.19 × 10-7 just to the right of a
power of 2 to about 2-24 = 5.96 × 10-8 just to the left of a power of 2 (see
Lemma 2.1).
The notion of ulp, or “unit in last place”, is sometimes used when describ-
ing the accuracy of a floating point result. One ulp of the normalized floating
point number y = ±β e × .d 1 d 2 . . . dt is ulp(y) = β e × .00 . . . 01 = β e-t. If x
is any real number we can say that y and x agree to | y - x|/ulp(y) ulps in
y. This measure of accuracy “wobbles” when y changes from a power of β to
the next smaller floating point number, since ulp(y ) decreases by a factor β.
In MATLAB the permanent variable eps represents the machine epsilon
(not the unit roundoff as is sometimes thought). MATLAB uses IEEE standard
44 F LOATING P OINT A RITHMETIC

Figure 2.1. Relative distance from x to the next larger machine number (β = 2,
t = 24), displaying wobbling precision.

double precision arithmetic on those machines that support it in hardware.

In Fortran 90 the intrinsic function EPSILON returns the machine epsilon cor-
responding to the KIND of its REAL argument.

2.2. Model of Arithmetic

To carry out rounding error analysis of an algorithm we need to make some
assumptions about the accuracy of the basic arithmetic operations. The most
common assumptions are embodied in the following model, in which x, y
F:

STANDARD MODEL

f l(xopy) = ( x opy)(1+ δ ), |δ|< u, op = +, -, *, /. (2.4)

It is normal to assume that (2.4) holds also for the square root operation.
Note that now we are using fl(·) with an argument that is an arithmetic
expression to denote the computed value of that expression. The model says
that the computed value of x op y is “as good as” the rounded exact answer,
in the sense that the relative error bound is the same in both cases. However,
2.3 IEEE A RITHMETIC 45

the model does not require that δ = 0 when x op y F-a condition which
obviously does hold for the rounded exact answer-so the model does not
capture all the features we might require of floating point arithmetic. This
model is valid for most computers, and, in particular, holds for IEEE standard
arithmetic. Cases in which the model is not valid are described in §2.4.
The following modification of (2.4) can also be used (cf. Theorem 2.3):

(2.5)

Note: Throughout this book, the standard model (2.4) is used unless
otherwise stated. Most results proved using the standard model remain true
with the weaker model (2.6) described below, possibly subject to slight in-
creases in the constants. We identify problems for which the choice of model
significantly affects the results that can be proved.

2.3. IEEE Arithmetic

IEEE standard 754, published in 1985 [597, 1985], defines a binary floating
point arithmetic system. It is the result of several years’ work by a working
group of a subcommittee of the IEEE Computer Society Computer Standards
Committee.
Among the design principles of the standard were that it should encourage
experts to develop robust, efficient, and portable numerical programs, enable
the handling of arithmetic exceptions, and provide for the development of
transcendental functions and very high precision arithmetic.
The standard specifies floating point number formats, the results of the
basic floating point operations and comparisons, rounding modes, floating
point exceptions and their handling, and conversion between different arith-
metic formats. Square root is included as a basic operation. The standard
says nothing about exponentiation or transcendental functions such as exp
and cos.
Two main floating point formats are defined:

Type Size Mantissa Exponent Unit roundoff Range

Single 32 bits 23+1 bits 8 bits 2-24 5.96 × 10-8 10 ±38
-53
Double 64 bits 52+1 bits 11 bits 2 1.11 × 10-16 10±308

In both formats one bit is reserved as a sign bit. Since the floating point
numbers are normalized, the most significant bit is always 1 and is not stored
(except for the denormalized numbers described below). This hidden bit ac-
counts for the “+l” in the table.
46 F LOATING P OINT A RITHMETIC

Table 2.2. IEEE arithmetic exceptions and default results.

Exception type Example Default result

Invalid operation o/o, 0 × NaN (Not a Number)
Overflow
Divide by zero Finite nonzero/0
Underflow Subnormal numbers
Inexact Whenever fl(xopy) xopy Correctly rounded result

The standard specifies that all arithmetic operations are to be performed

as if they were first calculated to infinite precision and then rounded according
to one of four modes. The default rounding mode is to round to the nearest
representable number, with rounding to even (zero least significant bit) in the
case of a tie. With this default mode, the model (2.4) is obviously satisfied.
Note that computing with a single guard bit (see §2.4) will not always give the
same answer as computing the exact result and then rounding. But the use
of a second guard bit and a third sticky bit (the logical OR of all succeeding
bits) enables the rounded exact result to be computed. Rounding to plus or
minus infinity is also supported by the standard: this facilitates the imple-
mentation of interval arithmetic. The fourth supported mode is rounding to
zero (truncation, or chopping).
IEEE arithmetic is a closed system: every arithmetic operation produces
a result, whether it is mathematically expected or not, and exceptional oper-
ations raise a signal. The default results are shown in Table 2.2. The default
response to an exception is to set a flag and continue, but it is also possible
to take a trap (pass control to a trap handler).
A NaN is a special bit pattern that cannot be generated in the course of
unexceptional operations because it has a reserved exponent field. Since the
mantissa is arbitrary, subject to being nonzero, a NaN can have something
about its provenance encoded in it, and this information can be used for
retrospective diagnostics. A NaN is generated by operations such as 0/0,
0× and One creative use of the NaN is to
denote uninitialized or missing data. Arithmetic operations involving a NaN
return a NaN as the answer. A NaN compares as unordered and unequal with
everything including itself (a NaN can be tested with the predicate x x or
with the IEEE recommended function isnan. if provided).
The IEEE standard provides distinct representations for +0 and -0, but
comparisons are defined so that +0 = -0. Signed zeros provide an elegant
way to handle branch cuts in complex arithmetic; for details, see Kahan [632,
19 8 7 ].
The infinity symbol is represented by a zero mantissa and the same ex-
2.3 IEEE A RITHMETIC 47

ponent field as a NaN; the sign bit distinguishes between The infinity
symbol obeys the usual mathematical conventions regarding infinity, such as

The standard allows subnormal numbers to be represented, instead of

flushing them to zero as in many systems, and this feature permits gradual
underflow (sometimes called graceful underflow). Gradual underflow makes
it easier to write reliable numerical software; see Demmel [280, 1984].
The standard may be implemented in hardware or software. The first
hardware implementation was the Intel 8087 floating point coprocessor, which
was produced in 1981 and implements an early draft of the standard (the
8087 very nearly conforms to the present standard). This chip, together with
its bigger and more recent brothers the Intel 80287, 80387, 80486 and the
Pentium, is used in IBM PC compatibles (the 80486DX and Pentium are
general-purpose chips that incorporate a floating point coprocessor). Other
manufacturers that produce processors implementing IEEE arithmetic include
DEC (Alpha), Hewlett Packard (Precision Architecture), IBM (RS/6000),
Inmos (T800, T900), Motorola (680x0)) and Sun (SPARCstation).
The IEEE standard defines minimum requirements for two extended num-
ber formats: single extended and double extended. The double extended for-
mat has at least 79 bits, with at least 63 bits in the mantissa and at least
15 in the exponent; it therefore surpasses the double format in both preci-
sion and range, having unit roundoff u < 5.42 × 10 -20 and range at least
10 ±4932 . The purpose of the extended precision formats is not to provide for
higher precision computation per se, but to enable double precision results
to be computed more reliably (by avoiding intermediate overflow and under-
flow) and more accurately (by reducing the effect of cancellation) than would
otherwise be possible. In particular, extended precision makes it easier to
write accurate routines to evaluate the elementary functions, as explained by
Hough [584, 1981].
A double extended format of 80 bits is supported by the Intel and Motorola
chips mentioned above (which are used in many PC and Macintosh comput-
ers); these chips, in fact, normally do all their floating point arithmetic in 80
bit arithmetic (even for arguments of the single or double format). However,
double extended is not supported by Sun SPARCstations or machines that use
the PowerPC or DEC Alpha chips. Furthermore, the extended format (when
available) is supported little by compilers and packages such as Mathematica
and Maple. Kahan [636, 1994] notes that “What you do not use, you are
destined to lose”, and encourages the development of benchmarks to measure
accuracy and related attributes. He also explains that

For now the 10-byte Extended format is a tolerable compromise

between the value of extra-precise arithmetic and the price of im-
plementing it to run fast; very soon two more bytes of precision
48 F LOATING P OINT A RITHMETIC

will become tolerable, and ultimately a 16-byte format . . . That

kind of gradual evolution towards wider precision was already in
view when IEEE Standard 754 for Floating-Point Arithmetic was
framed.
A possible side effect of the use of an extended format is the phenomenon
of double rounding, whereby a result computed “as if to infinite precision” (as
specified by the standard) is rounded first to the extended format and then to
the destination format. Double rounding (which is allowed by the standard)
can give a different result from that obtained by rounding directly to the
destination format, and so can lead to subtle differences between the results
obtained with different implementations of IEEE arithmetic (see Problems 2.9
and 3.11).
An IEEE Standard 854, which generalizes the binary standard 754, was
published in 1987 [598, 198 7]. It is a standard for floating point arithmetic
that is independent of word length and base (although in fact only bases 2 and
10 are provided in the standard, since the drafting committee “could find no
valid technical reason for allowing other radices, and found several reasons for
not allowing them” [223, 1988]). Base 10 IEEE 854 arithmetic is implemented
in the HP-71B calculator.

2.4. Aberrant Arithmetics

Unfortunately, not all computer floating point arithmetics adhere to the model
(2.4). The most common reason for noncompliance with the model is that
the arithmetic lacks a guard digit in subtraction. The role of a guard digit is
easily explained with a simple example.
Consider a floating point arithmetic system with base β = 2 and t = 3
digits in the mantissa. Subtracting from 1.0 the next smaller floating number
we have, in binary notation,
21 × 0.100 2 1 × 0.100-
20 × 0.111 21 × 0.0111

2 1 × 0.0001 = 2 -2 × 0.100
Notice that to do the subtraction we had to line up the binary points, thereby
unnormalizing the second number and using, temporarily, a fourth mantissa
digit, known as a guard digit. Some machines do not have a guard digit.
Without a guard digit in our example we would compute as follows, assuming
the extra digits are simply discarded:
21 × 0.100 2 1 × 0.100-
20 × 0.111 2 1 × 0.011 (last digit dropped)

21 × 0.001 = 2-l × 0.100

2.4 A BERRANT A RITHMETICS 49

The computed answer is too big by a factor 2 and so has relative error l! For
machines without a guard digit it is not true that

f l( x ± y) = (x ± y)(1 + δ), | δ |< u ,

but it is true that

Our model of floating point arithmetic becomes

No GUARD DIGIT MODEL

f l(x ± y) = x(1 + a) ± y(l + β), |a | , | β| < u, (2.6a)

fl(xopy) = (xopy) ( l +δ ) , |δ| < u, op = *,/, (2.6b)

where we have stated a weaker condition on a and β that is generally easier

to work with.
Notable examples of machines that lack guard digits are several models
of Cray computers (Cray 1, 2, X-MP, Y-MP, and C90). On these computers
subtracting any power of 2 from the next smaller floating point number gives
an answer that is either a factor of 2 too large (as in the example above-e.g.,
Cray X-MP or Y-MP) or is zero (Cray 2). In 1992 Cray announced that it
would produce systems that use IEEE standard double precision arithmetic
by 1995.
The lack of a guard digit is a serious drawback. It causes many algorithms
that would otherwise work perfectly to fail some of the time (e.g., compensated
summation-see §4.3). Here is an example of a result that holds only when a
guard digit is used. This result holds for any base β .

Theorem 2.4 (Ferguson). Let x and y be floating point numbers for which
e ( x - y) < min(e(x),e(y)), where e(x) denotes the exponent of x in its nor-
malized floating point representation. If subtraction is performed with a guard
digit then x - y is computed exactly (assuming x - y does not underflow or
overflow) .

Proof. From the condition of the theorem the exponents of x and y differ
by at most 1. If the exponents are the same then fl(x - y) is computed
exactly, so suppose the exponents differ by 1, which can happen only when x
and y have the same sign. Scale and interchange x and y if necessary so that
β -1 < y < 1 < x < β, where β is the base. Now x is represented in base β as
x1 .x 2 . . .xt and the exact difference z = x - y is of the form
50 F LOATING P OINT A RITHMETIC

x 1 . x 2 ...x t -
0.y1 . . . yt-1yt

z1 .z 2 . . . ztzt +1
But e(x - y) < e(y) and y < 1, so z1 = 0. The algorithm for computing z
forms z 1 .z 2 . . .zt +1 and then rounds to t digits; since z has at most t significant
digits this rounding introduces no error, and thus z is computed exactly.
The next result is a corollary of the previous one but is more well known. It
is worth stating as a separate theorem because the conditions of the theorem
are so elegant and easy to check (being independent of the base), and because
this weaker theorem is sufficient for many purposes.

Theorem 2.5 (Sterbenz). Let x and y be floating point numbers with y/2 <
x < 2y. If subtraction is performed with a guard digit then x - y is computed
exactly (assuming x - y does not underflow).
Theorem 2.5 is vital in proving that certain special algorithms work. A
good example involves Heron’s formula for the area A of a triangle with sides
of length a, b, and c:

s = (a + b + c)/2.
This formula is inaccurate for needle-shaped triangles: if a b + c then s a
and the term s - a suffers severe cancellation. A way around this difficulty,
devised by Kahan, is to rename a, b, and c so that a > b > c and then evaluate

(2.7)
The parentheses are essential! Kahan has shown that this formula gives the
area with a relative error bounded by a modest multiple of the unit roundoff
provided that a guard digit is used in subtraction [457, 1991, Thm. 3], [634,
1990] (see Problem 2.22). If there is no guard digit, the computed result can
be very inaccurate.
Kahan has made these interesting historical comments about guard digits
[634, 1990]:
CRAYs are not the first machines to compute differences blighted
by lack of a guard digit. The earliest IBM ’360s, from 1964 to 1967,
subtracted and multiplied without a hexadecimal guard digit un-
til SHARE, the IBM mainframe user group, discovered why the
consequential anomalies were intolerable and so compelled a guard
digit to be retrofitted. The earliest Hewlett-Packard financial cal-
culator, the HP-80, had a similar problem. Even now, many a
calculator (but not Hewlett-Packard’s) lacks a guard digit.
2.5 C HOICE OF B ASE AND D ISTRIBUTION OF N UMBERS 51

2.5. Choice of Base and Distribution of Numbers

What base β is best for a floating point number system? Most modern com-
puters use base 2. Most hand-held calculators use base 10, since it makes
the calculator easier for the user to understand (how would you explain to a
naive user that 0.1 is not exactly representable on a base 2 calculator?). IBM
mainframes traditionally have used base 16. Even base 3 has been tried-in
an experimental machine called SETUN, built at Moscow State University in
the late 1950s [1066, 1960].
Several criteria can be used to guide the choice of base. One is the impact
of wobbling precision: as we saw at the end of §2.1, the spread of representa-
tion errors is smallest for small bases. Another possibility is to measure the
worst-case representation error or the mean square representation error. The
latter quantity depends on the assumed distribution of the numbers that are
represented. Brent [144, 1973] shows that for the logarithmic distribution the
worst-case error and the mean square error are both minimal for (normalized)
base 2, provided that the most significant bit is not stored explicitly.
The logarithmic distribution is defined by the property that the proportion
of base β numbers with leading significant digit n is

It appears that in practice real numbers are logarithmically distributed. In

1938, Benford [90, 1938] noticed, as had Newcomb [794, 1881] before him,
that the early pages of logarithm tables showed greater signs of wear than the
later ones. (He was studying dirty books!) This prompted him to carry out a
survey of 20,229 “real-life” numbers, whose decimal representations he found
matched the logarithmic distribution closely.
The observed logarithmic distribution of leading significant digits has not
been fully explained. Some proposed explanations are based on the assump-
tion that the actual distribution is scale invariant, but this assumption is
equivalent to the observation to be explained [1032, 1984]. Barlow [57, 1981],
[58, 1981], [60, 1988] and Turner [1031, 1982], [1032, 1984] give useful insight
by showing that if uniformly distributed numbers are multiplied together, then
the resulting distribution converges to the logarithmic one; see also Boyle [141,
1994]. Furthermore, it is an interesting result that the leading significant dig-
its of the numbers qk, k = 0,1,2,. . . , are logarithmically distributed if q is
positive and is not a rational power of 10; when q = 2 and the digit is 7 this
is Gelfand’s problem [829, 1981, pp. 50-51].
The nature of the logarithmic distribution is striking. For decimal num-
bers, the digits 1 to 9 are not equally likely to be a leading significant digit.
The probabilities are as follows:
52 FLOATING POINT ARITHMETIC

1 2 3 4 5 6 7 8 9
0.301 0.176 0.125 0.097 0.079 0.067 0.058 0.051 0.046
As an example, here is the leading significant digit distribution for the ele-
ments of the inverse of one random 100 × 100 matrix from the normal N(0, 1)
distribution:
1 2 3 4 5 6 7 8 9
0.334 0.163 0.100 0.087 0.077 0.070 0.063 0.056 0.051
For an entertaining survey of work on the distribution of leading significant
digits see Raimi [856, 1976] (and also the popular article [855, 1969]).

2.6. Statistical Distribution of Rounding Errors

Most rounding error analyses, including all the ones in this book, are designed
to produce worst-case bounds for the error. The analyses ignore the signs of
rounding errors and are often the result of many applications of the triangle
inequality and the submultiplicative inequality. Consequently, although the
bounds may well give much insight into a method, they tend to be pessimistic
if regarded as error estimates.
Statistical statements about the effect of rounding on a numerical process
can be obtained from statistical analysis coupled with probabilistic models
of the rounding errors. For example, a well-known rule of thumb is that a
more realistic error estimate for a numerical met hod is obtained by replacing
the dimension-dependent constants in a rounding error bound by their square
root: thus if the bound is f(n)u, the rule of thumb says that the error is typi-
cally of order (see, for example, Wilkinson [1088, 1963, pp. 26, 102]).
This rule of thumb can be supported by assuming that the rounding errors
are independent random variables and applying the central limit theorem.
Statistical analysis of rounding errors goes back to one of the first papers on
rounding error analysis, by Goldstine and von Neumann [462, 1951].
As we noted in §1.17, rounding errors are not random. See Problem 2.10
for an example of how two rounding errors cancel in one particular class of
computations. Forsythe [389, 1959] points out that rounding errors do not
necessarily behave like independent random variables and proposes a random
form of rounding (intended for computer testing) to which statistical analysis
is applicable.
Henrici [517, 1962], [518, 1963], [519, 1964] assumes models for rounding
errors and then derives the probability distribution of the overall error, mainly
in the context of difference methods for differential equations. Hull and Swen-
son [593, 1966] give an insightful discussion of probabilistic models, pointing
out that “There is no claim that ordinary rounding and chopping are random
processes, or that successive errors are independent. The question to be de-
cided is whether or not these particular probabilistic models of the processes
2.7 A LTERNATIVE N UMBER S YSTEMS 53

will adequately describe what actually happens” (see also the ensuing note by
Henrici [520, 1966]).
Since the late 1980s Chaitin-Chatelin and her co-workers have been de-
veloping a method called PRECISE, which involves a statistical analysis of
the effect on a computed solution of random perturbations in the data; see
Brunet [152, 198 9], Chatelin and Brunet [202, 1990], and Chaitin-Chatelin
and Frayssé [190, 1996]. This approach is superficially similar to the earlier
CESTAC (permutation-perturbation) method of La Porte and Vignes [153,
1986], [682, 1974], [1054, 1986], but differs from it in several respects. CES-
TAC deals with the arithmetic reliability of algorithms, whereas PRECISE
is designed as a tool to explore the robustness of numerical algorithms as a
function of parameters such as mesh size, time step, and nonnormality.
Several authors have investigated the distribution of rounding errors under
the assumption that the mantissas of the operands are from a logarithmic
distribution, and for different modes of rounding; see Barlow and Bareiss [62,
1985] and the references therein.
Other work concerned with statistical modelling of rounding errors in-
cludes that of Tienari [1002, 1970] and Linnainmaa [704, 1975].

2.7. Alternative Number Systems

The floating point format is not the only means for representing numbers in
finite precision arithmetic. Various alternatives have been proposed, though
none has achieved widespread use.
A particularly elegant system is the “level index arithmetic” of Clenshaw,
Olver, and Turner, in which a number x > 1 is represented by = l + f,
where f [0,1] and

or f = ln(ln(. . . (lnx) . . .)),

where the exponentiation or logarithm is performed l times (l is the “level”).

If 0 < x < 1, then x is represented by the reciprocal of the representation
for l/x. An obvious feature of the level index system is that it can represent
much larger and smaller numbers than the floating point system, for similar
word lengths. A price to be paid is that addition and subtraction are more
complicated (and more costly) than in floating point arithmetic. For very
readable introductions to level index arithmetic see Clenshaw and Olver [213,
1984] and Turner [1033, 1991], and for more details see Clenshaw, Olver, and
Turner [214, 1989]. Level index arithmetic is somewhat controversial in that
there is disagreement about its advantages and disadvantages with respect to
floating point arithmetic; see Demmel [282, 1987]. A number system involving
levels has also been proposed by Matsui and Iri [736, 1981]; in their system,
54 F LOATING P OINT A RITHMETIC

Table 2.3. Test arithmetics.

Hardware Software |3 × (4/3 - 1) -1|a

Casio fx-140 ( 1979) 1 × 10-9
Casio fx-992VB ( 1990) 1 × 10-13
HP 48G (1993) 1 × 10-11
Sharp EL-5020 (1994) 0.0 b
486DX MATLAB 4.2 (1994) 2.2. . . × 10-16
486DX WATFOR-77c V3.0 (1988) 2.2. . . × 10-16
486DX FTN 90d (1993) 2.2. . . × 10-16
486DX MS-DOS QBasic 1.1 1.1. . . × 10-19 e
a
Integers in the test expression are typed as real constants 3.0, etc., for the Fortran
tests.
b
c
1 × 10-9 if 4/3 is stored and recalled from memory.
WATCOM Systems Inc.
d
e
Salford Software/Numerical Algorithms Group, Version 1.2.
2.2. . . . × 10-16 if 4/3 is stored and recalled from a variable.

the number of bits allocated to the mantissa and exponent is allowed to vary
(within a fixed word size).
Other number systems include those of Swartzlander and Alexopolous [981,
1975], Matula and Kornerup [741, 1985], and Hamada [495, 1987]. For sum-
maries of alternatives to floating point arithmetic see the section “Alternatives
to Floating-Point Some Candidates” in [214, 198 9], and Knuth [668, 198 1,
Chap. 4].

2.8. Accuracy Tests

How can you test the accuracy of the floating point arithmetic on a computer
or pocket calculator? There is no easy way, though a few software packages are
available to help with the tasks in specific programming languages (see §25.6).
There are, however, a few quick and easy tests that may reveal weaknesses.
The following list is far from comprehensive and good performance on these
tests does not imply that an arithmetic is correct. Results from the tests are
given in Tables 2.4-2.5 for the selected floating point arithmetics described in
Table 2.3. Double precision was used for the compiled languages. The last
column of Table 2.3 gives an estimate of the unit roundoff (see Problem 2.14).
The estimate produced by QBasic indicates that the compiler used extended
precision in evaluating the estimate.

1. (Cody [221, 1982]) Evaluate sin(22) = -8.8513 0929 0403 8759 2169 ×
10-3 (shown correct to 21 digits). This is a difficult test for the range
2.8 A CCURACY T ESTS 55

Table 2.4. Sine test.

Machine | sin(22)
Exact -8.8513 0929 0403 8759 × 10-3
Casio fx-140 -8.8513 62 × 10 -3
Casio fx-992VB -8.8513 0929 096 × 10-3
HP 48G -8.8513 0929 040 × 10-3
Sharp EL-5020 -8.8513 0915 4 × 10-3
MATLAB 4.2 -8.8513 0929 0403 876 × 10-3
WATFOR-77 -8.8513 0929 0403 880 × 10-3
FTN 90 -8.8513 0929 0403 876 × 10-3
QBasic -8.8513 0929 0403 876 × 10-3

Table 2.5. Exponentation test. No entry for last column means same value as
previous column.

Machine 2.5 125 exp(125log(2.5))

Exact 5.5271 4787 5260 4446 × 1049 5.5271 4787 5260 4446 × 1049
Casio fx-140 5.5271 477 × 1049 5.5271 463 × 1049
Casio fx-992VB 5.5271 4787 526 × 1049
HP 48G 5.5271 4787 526 × 1049 5.5271 4787 377 × 1049
Sharp EL-5020 5.5271 4787 3 × 1049 5.5271 4796 2 × 1049
MATLAB 4.2 5.5271 4787 5260 445 × 1049 5.5271 4787 5260 459 × 1049
WATFOR-77 5.5271 4787 5260 450 × 1049 5.5271 4787 5260 460 × 1049
FTN 90 5.5271 4787 5260 445 × 1049 5.5271 4787 5260 459 × 1049
QBasic 5.5271 4787 5260 444 × 1049

reduction used in the sine evaluation (which brings the argument within
t h e r a n g e [ -π/2,π/2], and which necessarily uses an approximate value
of π), since 22 is close to an integer multiple of π.

2. (Cody [221, 1982]) Evaluate 2.5125 = 5.5271 4787 5260 4445 6025 × 1049
(shown correct to 21 digits). One way to evaluate z = xy is as z =
exp(ylogx). But to obtain z correct to within a few ulps it is not suf-
ficient to compute exp and log correct to within a few ulps; in other
words, the composition of two functions evaluated to high accuracy is
not necessarily obtained to the same accuracy. To examine this partic-
ular case, write

w : = ylogx, z = exp(w).
56 F LOATING P OINT A RITHMETIC

If w w + ∆w then z z + ∆z, where z + ∆z = exp(w+∆w) =

exp(w) exp(∆w) exp(w)(l+∆w), so ∆ z / z ∆ w. In other words,
the relative error of z depends on the absolute error of w and hence on
the size of w. To obtain z correct to within a few ulps it is necessary
to use extra precision in calculating the logarithm and exponential [228,
1980, Chap. 7].

3. (Karpinski [645, 1985]) A simple test for the presence of a guard digit
on a pocket calculator is to evaluate the expressions
9/27 * 3 - 1, 9/27 * 3 - 0.5 - 0.5,

which are given in a form that can be typed directly into most four-
function calculators. If the results are equal then a guard digit is present.
Otherwise there is probably no guard digit (we cannot be completely
sure from this simple test). To test for a guard digit on a computer it
is best to run one of the diagnostic codes described in §25.5.

2.9. Notes and References

The classic reference on floating point arithmetic, and on all aspects of round-
ing error analysis, is Wilkinson’s Rounding Errors in Algebraic Processes
(REAP) [1088, 1963]. Wilkinson was uniquely qualified to write such a book,
for not only was he the leading expert in rounding error analysis, but he was
one of the architects and builders of the Automatic Computing Engine (ACE)
at the National Physical Laboratory [1082, 1954]. The Pilot (prototype) ACE
first operated in May 1950, and an engineered version was later sold commer-
cially as the DEUCE Computer by the English Electric Company. Wilkinson
and his colleagues were probably the first to write subroutines for floating
point arithmetic. and this enabled them to accumulate practical experience
of floating point arithmetic much earlier than anyone else [357, 1976], [1099,
1980].
In REAP, Wilkinson gives equal consideration to fixed point and floating
point arithmetic. In fixed point arithmetic, all numbers are constrained to lie
in a range such as [-1,1], as if the exponent were frozen in the floating point
representation (2.1). Preliminary analysis and the introduction of scale factors
during the computation is needed to keep numbers in the permitted range.
We consider only floating point arithmetic in this book. REAP, together with
Wilkinson’s second book, The Algebraic Eigenvalue Problem (AEP) [1089,
1965], has been immensely influential in the areas of floating point arithmetic
and rounding error analysis.
Wilkinson’s books were preceded by the paper Error Analysis of Floating-
Point Computation [1084, 196 0], in which he presents the model (2.4) for
floating point arithmetic and applies the model to several algorithms for the
2.9 NOTES AND REFERENCES 57

eigenvalue problem. This paper has hardly dated and is still well worth read-
ing.
Another classic book devoted entirely to floating point arithmetic is Ster-
benz’s Floating-Point Computation [938, 1974]. It contains a thorough treat-
ment of low-level details of floating point arithmetic, with particular reference
to IBM 360 and IBM 7090 machines. It also contains a good chapter on round-
ing error analysis and an interesting collection of exercises. R. W. Hamming
has said of this book, “Nobody should ever have to know that much about
floating-point arithmetic. But I’m afraid sometimes you might” [833, 1988].
Although Sterbenz’s book is now dated in some respects, it remains a useful
reference.
A third important reference on floating point arithmetic is Knuth’s Seminu-
merical Algorithms [668, 1981, §4.2], from his Art of Computer Programming
series. Knuth’s lucid presentation includes historical comments and challeng-
ing exercises (with solutions).
The first analysis of floating point arithmetic was given by Samelson and
Bauer [890, 1953]. Later in the same decade Carr [187, 1959] gave a detailed
discussion of error bounds for the basic arithmetic operations.
An up-to-date and very readable reference on floating point arithmetic
is the survey paper by Goldberg [457, 1991], which includes a detailed dis-
cussion of IEEE arithmetic. A less mathematical, more hardware-oriented
discussion can be found in the appendix “Computer Arithmetic” written by
Goldberg that appears in the book on computer architecture by Hennessy and
Patterson [515, 1990].
A fascinating historical perspective on the development of computer float-
ing point arithmetics, including background to the development of the IEEE
standard, can be found in the textbook by Patterson and Hennessy [822,
1994, §4.11].
The idea of representing floating point numbers in the form (2.1) is found,
for example, in the work of Forsythe [393, 196 9], Matula [740, 1970], and
Dekker [275, 1971].
An alternative definition of fl(x) is the nearest y G satisfying |y| <
|x|. This operation is called chopping, and does not satisfy our definition of
rounding. Chopped arithmetic is used in the IBM/370 floating point system.
The difference between chopping and rounding (to nearest) is highlighted
by a discrepancy in the index of the Vancouver Stock Exchange in the early
1980s [852, 1983]. The exchange established an index in January 1982, with
the initial value of 1000. By November 1983 the index had been hitting lows
in the 520s, despite the exchange apparently performing well. The index was
recorded to three decimal places and it was discovered that the computer
program calculating the index was chopping instead of rounding to produce
the final value. Since the index was recalculated thousands of times a day, each
time with a nonpositive final error, the bias introduced by chopping became
58 F LOATING P OINT A RITHMETIC

significant. Upon recalculation with rounding the index almost doubled!

When there is a tie in rounding, two possible strategies are to round to
the number with an even last digit and to round to the one with an odd last
digit. Both are stable forms of rounding in the sense that
f l( ( ( ( x + y) - y) + y) - y) = fl( ( x + y) - y),
as shown by Reiser and Knuth [869, 1975], [668, 198 1, p. 222]. For other
rules, such as round away from zero, repeated subtraction and addition of the
same number can yield an increasing sequence, a phenomenon known as drift.
For bases 2 and 10 rounding to even is preferred to rounding to odd. After
rounding to even a subsequent rounding to one less place does not involve a
tie. Thus we have the rounding sequence 2.445, 2.44, 2.4 with round to even,
but 2.445, 2.45. 2.5 with round to odd. For base 2, round to even causes
computations to produce integers more often [640, 1979] as a consequence of
producing a zero least significant bit. Rounding to even in the case of ties
seems to have first been suggested by Scarborough in the first edition (1930)
of [897, 1950].
Predict ions based on the growth in the size of mathematical models solved
as the memory and speed of computers increase suggest that floating point
arithmetic with unit roundoff u 10-32 will be needed for some applications
on future supercomputers [48, 1989].
The model (2.4) does not fully describe any floating point arithmetic. It is
merely a tool for error analysis-one that has been remarkably successful in
view of our current understanding of the numerical behaviour of algorithms.
There have been various attempts to devise formal models of floating point
arithmetic, by specifying sets of axioms in terms of which error analysis can be
performed. Some attempts are discussed in §25.7.4. No model yet proposed
has been truly successful. Priest [844, 1992 ] conjectures that the task of
“encapsulating all that we wish to know about floating point arithmetic in
a single set of axioms” is impossible, and he gives some motivation for this
conjecture.
Under the model (2.4), floating point arithmetic is not associative with
respect to any of the four basic operations:
op = +,-,*,/, where := fl(a op b). Nevertheless, floating point arith-
metic enjoys some algebraic structure, and it is possible to carry out error
analysis in the “ algebra”. Fortunately, it was recognized by Wilkinson
and others in the 1950s that this laboured approach is unnecessarily compli-
cated, and that it is much better to work with the exact equations satisfied
by the computed quantities. As Parlett [821, 1990 ] notes, though, “There
have appeared a number of ponderous tomes that do manage to abstract the
computer’s numbers into a formal structure and burden us with more jargon.”
A draft proposal of IEEE Standard 754 is defined and described in [599,
19 8 1 ]. That article, together with others in the same issue of the journal
2.9 N OTES AND R EFERENCES 59

Computer, provides a very readable description of IEEE arithmetic. In par-

ticular, an excellent discussion of gradual underflow is given by Coonen [243,
1981]. A draft proposal of IEEE Standard 854 is presented, with discussion,
in [225, 1984].
W. M. Kahan of the University of California at Berkeley received the
1989 ACM Turing Award for his contributions to computer architecture and
numerical analysis, and in particular for his work on IEEE floating point
arithmetic standards 754 and 854.
An interesting examination of the implications of the IEEE standard for
high-level languages such as Fortran is given by Fateman [365, 1982]. Topics
discussed include trap handling and how to exploit NaNs. For an overview of
hardware implementations of IEEE arithmetic, and software support for it,
see Cody [223, 1988].
Producing a fast and correct implementation of IEEE arithmetic is a dif-
ficult task. Correctness is especially important for a microprocessor (as op-
posed to a software) implementation, because of the logistical difficulties of
correcting errors when they are found. In late 1994, much publicity was gen-
erated by the discovery of a flaw in the floating point divide instruction of
Intel’s Pentium chip. Because of some missing entries in a lookup table on
the chip, the FPDIV instruction could give as few as four correct significant
decimal digits for double precision floating point arguments with certain spe-
cial bit patterns [916, 1994]. The flaw had been discovered by Intel in the
summer of 1994 during ongoing testing of the Pentium processor, but it had
not been publically announced. In October 1994, a mathematician doing re-
search into prime numbers independently discovered the flaw and reported it
to the user community. Largely because of the way in which Intel responded
to the discovery of the flaw, the story was reported in national newspapers
(e.g., the New York Times [727, 1994]) and generated voluminous discussion
on Internet newsgroups (notably comp. sys. intel). Intel corrected the bug
in 1994 and, several weeks after the bug was first reported, offered to replace
faulty chips. For a very readable account of the Pentium FPDIV bug story,
see Moler [772, 1995]. To emphasize that bugs in implementations of floating
point arithmetic are far from rare, we mention that the Calculator application
in Microsoft Windows 3.1 evaluates fl(2.01 - 2.00) = 0.0.
Computer chip designs can be tested in two main ways: by software sim-
ulations and by applying formal verification techniques. Formal verification
aims to prove mathematically that the chip design is correct, and this ap-
proach is now being used by Intel and other chip manufacturers [452, 1995].
The implementation of IEEE arithmetic for the Inmos T800 transputer in the
1980s was done with the help of formal methods. The IEEE standard was
translated into the set-theoretic specification language Z, and then Occam
procedures were written that were proved to adhere to the specifications. For
details, see Barrett [69, 1989] or, for a more informal overview, Shepherd and
60 F LOATING P OINT A RITHMETIC

Wilson [917, 1989].

The floating point operation op (op = +,-,*, or /) is monotonic if
f l(a op b) < fl(c op d) whenever a, b, c, and d are floating point numbers
for which a op b < c op d and neither fl (a op b) nor fl(c op d) overflows. IEEE
arithmetic is monotonic, as is any correctly rounded arithmetic. Monotonic
arithmetic is import ant in the bisection algorithm for finding the eigenval-
ues of a symmetric tridiagonal matrix; see Demmel, Dhillon, and Ren [289,
1994], who give rigorous correctness proofs of some bisect ion implementat ions
in floating point arithmetic. Ferguson and Brightman [371, 1991] derive con-
ditions that ensure that an approximation to a monotonic function preserves
the monotonicity on a set of floating point numbers.
On computers of the 1950s (fixed point) multiplication was slower than
(fixed point) addition by up to an order of magnitude [693, 1980, Apps. 2, 3].
For most modern computers it is a rule of thumb that a floating point addition
and multiplication take about the same amount of time, while a floating point
division is 2-10 times slower, and a square root operation (in hardware) is 1-2
times slower than a division.
Some computers have the ability to perform a floating point multiplication
followed by an addition or subtraction, x * y + z or x * y - z , as though it were a
single floating point operation. For example, the IBM RISC System/6000 has
a fused multiply-add (FMA) operation that forms x * y + z with just a single
rounding error, at the end, the multiplication and addition being performed
at twice the precision of the operands (and by overlapping additions and
multiplications the RS/6000 can perform a sequence of FMAs in one cycle
each) [596, 1993]. For a clever use of an FMA operation to achieve increased
accuracy in a computation, see Problem 2.25.
During the design of the IBM 7030, Sweeney [982, 1965] collected statistics
on the floating point additions carried out by selected application programs
on an IBM 704. He found that 11% of all instructions traced were floating
point additions. Details were recorded of the shifting needed to align floating
point numbers prior to addition, and the results were used in the design of
the shifter on the IBM 7030.
The word bit, meaning binary digit, first appeared in print in a 1948
paper of Claude E. Shannon, but the term was apparently coined by John
W. Tukey [1022, 198 4]. The word byte, meaning a group of (usually eight)
bits, did not appear in print until 1959 [156, 1981].
The earliest reference we know for Theorem 2.5 is Sterbenz [938, 1974,
Thm. 4.3.1]. Theorem 2.4 is due to Ferguson [370, 1995], who proves a more
general version of the theorem that allows for trailing zero digits in x and y.
A variation in which the condition is 0 < y < x < y + β e . where e = min{j:
βj > y}, is stated by Ziv [1132, 1991] and can be proved in a similar way.
For more on the choice of base, see Cody [227, 1973] and Kuki and Cody
[677, 1973]. Buchholz’s paper Fingers or Fists? [155, 1959] on binary versus
2.9 N OTES AND R EFERENCES 61

decimal representation of numbers on a computer deserves mention for its

clever title, though the content is only of historical interest.
The model (2.4) ignores the possibility of underflow and overflow. To take
underflow into account the model must be modified to

fl(x op y) = (x op y)(1 + δ) + η, op = +,-,*,/. (2.8)

As before, |δ| < u. If underflow is gradual, as in IEEE arithmetic, then |η| <
which is half the spacing between the subnormal numbers
= β e min-1
is the smallest positive normalized floating point number); if
underflows are flushed to zero then |η | < Only one of δ and η is nonzero:
δ if no underflow occurs, otherwise η. With gradual underflow the absolute
error of an underflowed result is no greater than the smallest (bound for the)
absolute error that arises from an operation fl(x op y) in which the arguments
and result are normalized. For more details, and a thorough discussion of how
error analysis of standard algorithms is affected by using the model (2.8), see
the perceptive paper by Demmel [280, 198 4]. Another relevant reference is
Neumaier [789, 1985].
Algorithms for evaluating elementary functions in IEEE arithmetic are
developed by Tang [987, 1989], [989, 1990], [991, 1992], Gal and Bachelis [412,
1991], and Ziv [1132, 1991]. Tang [990, 1991] gives a very readable description
of table lookup algorithms for evaluating elementary functions, which are used
in a number of current chips.
Algorithms for evaluating complex elementary functions that exploit ex-
ception handling and assume the availability of algorithms for the real elemen-
tary functions are presented by Hull, Fairgrieve, and Tang [592, 1994]. For
details of how elementary functions are evaluated on many of today’s pocket
calculators see Schelin [898, 1983].
An important problem not considered in this chapter is the conversion of
numbers between decimal and binary representations. These conversions are
needed whenever numbers are read into a computer or printed out. They tend
to be taken for granted, but if not done carefully they can lead to puzzling
behaviour, such as a number read in as 0.1 being printed out as 0.099. . .9.
To be precise, the problems of interest are (a) convert a number represented
in decimal notation to the best binary floating point representation of a given
precision, and (b) given a binary floating point number, print a correctly
rounded decimal representation, either to a given number of significant digits
or to the smallest number of significant digits that allows the number to be
re-read without loss of accuracy. Algorithms for solving these problems are
given by Clinger [220, 1990] and Steele and White [936, 1990]; Gay [431, 1990]
gives some improvements to the algorithms and C code implementing them.
Precise requirements for binary-decimal conversion are specified in the IEEE
arithmetic standard. A program for testing the correctness of binary-decimal
conversion routines is described by Paxson [823, 1991]. Early references on
62 F LOATING P OINT A RITHMETIC

base conversion are Goldberg [458, 1967] and Matula [739, 1968], [740, 1970].
It is interesting to note that, in Fortran or C, where the output format for
a “print” statement can be precisely specified, most compilers will, for an
(in)appropriate choice of format, print a decimal string that contains many
more significant digits than are determined by the floating point number whose
value is being represented.
Other authors who have analysed various aspects of floating (and fixed)
point arithmetic include Diamond [305, 1978], Urabe [1037, 1968], and Feld-
stein, Goodman, and co-authors [471, 1975], [368, 198 2], [472, 198 5], [369,
19 86]. For a survey of computer arithmetic up until 1976 that includes a
number of references not given here, see Garner [420, 1976].

Problems
The exercise had warmed my blood, and
I was beginning to enjoy myself amazingly.
-JOHN BUCHAN, The Thirty-Nine Steps (1915)
2.1. How many normalized numbers and how many subnormal numbers are
there in the system F defined in (2.1) with emin < e < emax? What are the
figures for IEEE single and double precision (base 2)?
2.2. Prove Lemma 2.1.
2.3. In IEEE arithmetic how many double precision numbers are there be-
tween any two adjacent nonzero single precision numbers?
2.4. Prove Theorem 2.3.
2.5. Show that

and deduce that 0.1 has the base 2 representation 0.0001100 (repeating last 4
bits). Let = fl(0.1) be the rounded version of 0.1 obtained in binary IEEE
single precision arithmetic (u = 2-24). Show that
2.6. What is the largest integer p such that all integers in the interval [-p,p]
are exactly representable in IEEE double precision arithmetic? What is the
corresponding p for IEEE single precision arithmetic?
2.7. Which of the following statements is true in IEEE arithmetic, assuming
that a and b are normalized floating point numbers and that no exception
occurs in the stated operations?
1. fl(a op b) = fl(b op a), op = +,*.
2 . f l(b - a) = -fl(a - b).
P ROBLEMS 63

3. fl(a + a) = fl(2* a ) .
4 . f l(0.5*a) = fl(a/2).
5. fl((a + b) + c) = fl(a + (b + c)).
6. a < fl( (a + b)/2) < b, given that a < b.
2.8. Show that the inequalities a < fl((a + b)/2) < b, where a and b are
floating point numbers with a < b, can be violated in base 10 arithmetic.
Show that a < fl(a+(b-a)/2) < b in base β arithmetic, for any β, assuming
the use of a guard digit.
2.9. What is the result of the computation in IEEE double preci-
sion arithmetic, with and without double rounding from an extended format
with a 64-bit mantissa?
2.10. A theorem of Kahan [457, 1991, Thm. 7] says that if β = 2 and the
arithmetic rounds as specified in the IEEE standard, then for integers m and
n with |m| < 2t-1 and n = 2i + 2j (some i, j), fl((m/n)*n) = m. Thus,
for example, fl((1/3) * 3) = 1 (even though fl(1/3) 1/3). The sequence of
allowable n begins 1,2,3,4,5,6,8,9,10,12,16,17,18,20, so Kahan’s theorem
covers many common cases. Test the theorem on your computer.
2.11. Investigate the leading significant digit distribution for numbers ob-
tained as follows.
1. kn , n = 0:1000 for k = 2 and 3.
2. n!, n = 1:1000.
3. The eigenvalues of a random symmetric matrix.
4. Physical constants from published tables.
5. From the front page of the London Times or the New York Times.
(Note that in writing a program for the first case you can form the powers of
2 or 3 in order, following each multiplication by a division by 10, as necessary,
to keep the result in the range [1,10]. Similarly for the second case.)
2.12. (Edelman [343, 1994]) Let x be a floating point number in IEEE double
precision arithmetic satisfying 1 < x < 2. Show that fl( x *(1/x )) is either 1
or 1 - where = 2-52 (the machine epsilon).
2.13. (Edelman [343, 1994]) Consider IEEE double precision arithmetic. Find
the smallest positive integer j such that fl( x *(1/x )) 1, where x = 1 +
with = 2-52 (the machine epsilon).
2.14. Kahan has stated that “an (over-)estimate of u can be obtained for
almost any machine by computing |3 × (4/3 - 1) - 1| using rounded floating-
point for every operation”. Test this estimate against u on any machines
available to you.
64 F LOATING P OINT A RITHMETIC

2.15. What is 00 in IEEE arithmetic?

2.16. Evaluate these expressions in any IEEE arithmetic environment avail-
able to you. Are the values returned what you would expect? (None of the
results is specified by the IEEE standard.)

2.17. In the course of solving ax2 - 2bx + c = 0 for x, the expression

must be computed. Can the true value of b 2 - ac be nonnegative and yet its
computed value be negative?
2.18. Can Theorem 2.4 be strengthened to say that fl (x - y) is computed
exactly whenever the exponents of x > 0 and y > 0 differ by at most l?
2.19. Two requirements that we might ask of a routine for computing in
floating point arithmetic are that the identities
satisfied. Which, if either, of these is a reasonable requirement?
2.20. Are there any floating point values of x and y (excepting values both
0, or so huge or tiny to cause overflow or underflow) for which the computed
value of exceeds 1?
2.21. (Kahan) A natural way to compute the maximum of two numbers x
and y is with the code

% max(x, y)
if x > y then
m a x =x
else
max = y
end

Does this code always produce the expected answer in IEEE arithmetic?
2.22. Prove that Kahan’s formula (2.7) computes the area of a triangle ac-
curately if a guard digit is used in subtraction. (Hint: you will need one
invocation of Theorem 2.5.)
P ROBLEMS 65

2.23. (Kahan) Describe the result of the computation y = (x + x) - x on a

binary machine with a guard digit and one without a guard digit.
2.24. (Kahan) Let f(x) = (((x - 0.5) + x) - 0.5) + x. Show that if f is
evaluated as shown in single or double precision binary IEEE arithmetic then
f (x) 0 for all floating point numbers x.
2.25. (Kahan) Consider a machine that can perform a fused multiply-add
operation with just a single rounding error:

f l(x + y * z) = (x + y * z)(1 + δ), | δ | < u.

Show that, on such a machine, the algorithm

w = bc
e = w - b* c
x = (a * d - w) + e

computes x = with high relative accuracy.

2.26. Derive Newton’s method for solving f(x) = a - 1/x = 0. This method
was used on early computers (and is still used on some Cray computers, for
example) to implement reciprocation in terms of multiplication and thence
division as a/b = a * (1/b); see, e.g., [506, 1946].
2.27. Suppose we have an iterative algorithm for computing z = x/y. Derive
a convergence test that terminates the iteration (only) when full accuracy has
been achieved. Assume the use of IEEE arithmetic with gradual underflow
(use (2.8)).
Previous Home Next

Chapter 3
Basics

A method of inverting the problem of round-off error is proposed

which we plan to employ in other contexts and
which suggests that it may be unwise to
separate the estimation of round-off error
from that due to observation and truncation.
-WALLACE J. GIVENS, Numerical Computation of the
Characteristic Values of a Real Symmetric Matrix (1954)

The enjoyment of one’s tools is an essential ingredient of successful work.

-DONALD E. KNUTH, The Art of Computer Programming,
Volume 2, Seminumerical Algorithms (1981)

The subject of propagation of rounding error,

while of undisputed importance in numerical analysis,
is notorious for the difficulties which it presents when it is to be
taught in the classroom in such a manner that the student is
neither insulted by lack of mathematical content
nor bored by lack of transparence and clarity.
-PETER HENRICI, A Model for the Propagation
of Rounding Error in Floating Arithmetic (1980)

The two main classes of rounding error analysis are not,

as my audience might imagine, ‘backwards’ and ‘forwards’,
but rather ‘one’s own’ and ‘other people’s’.
One’s own is, of course, a model of lucidity;
that of others serves only to obscure the
essential simplicity of the matter in hand.
-J. H. WILKINSON, The State of the Art in Error Analysis (1985)

67
68 B ASICS

Having defined a model for floating point arithmetic in the last chapter, we
now apply the model to some basic matrix computations, beginning with inner
products. This first application is simple enough to permit a short analysis,
yet rich enough to illustrate the ideas of forward and backward error. It also
raises the thorny question of what is the best notation to use in an error
analysis. We introduce the “γ n ” notation, which we use widely, though not
exclusively, in the book. The inner product analysis leads immediately to
results for matrix-vector and matrix-matrix multiplication.
In the last two sections we determine a model for rounding errors in com-
plex arithmetic and derive some miscellaneous results of use in later chapters.

3.1. Inner and Outer Products

Consider the inner product sn = xT y, where x, y IRn . Since the order of
evaluation of sn = x1 y1 + . . . + xnyn affects the analysis (but not the final error
bounds), we will assume that the evaluation is from left to right. (The effect
of particular orderings is discussed in detail in Chapter 4, which considers
the special case of summation.) In the following analysis, and throughout the
book, a hat denotes a computed quantity.
Let si = x1 y1 + . . . + xi yi denote the ith partial sum. Using the standard
model (2.4), we have

(3.1)

where |δi | < u, i = 1:3. For our purposes it is not necessary to distinguish
between the different δi terms, so to simplify the expressions let us drop the
subscripts on the δi and write 1 + δi 1 ± δ. Then

The pattern is clear. Overall, we have

(3.2)

There are various ways to simplify this expression. A particularly elegant way
is to use the following result.
3.1 INNER AND OUTER PRODUCTS 69

Lemma 3.1. If |δi | < u and pi = ±1 for i = 1:n, and nu < 1, then

where

Proof. See Problem 3.1. Cl

The θn , and γn notation will be used throughout this book. Whenever we
write γn there is an implicit assumption that nu < 1, which is true in virtu-
ally any circumstance that might arise with IEEE single or double precision
arithmetic.
Applying the lemma to (3.2) we obtain

(3.3)
This is a backward error result and may be interpreted as follows: the com-
puted inner product is the exact one for a perturbed set of data x 1,. . . , xn ,
y1(1 + θn ),y 2(1 + θ' n ),. . . , yn (1 + θ2) (alternatively, we could perturb the xi
and leave the yi alone). Each relative perturbation is certainly bounded by
γn = nu/(l - nu), so the perturbations are tiny.
The result (3.3) applies to one particular order of evaluation. It is easy to
see that for any order of evaluation we have, using vector notation,

(3.4)
where |x| denotes the vector with elements |x i | and inequalities between vec-
tors (and, later, matrices) hold componentwise.
A forward error bound follows immediately from (3.4):

(3.5)

If y = x, so that we are forming a sum of squares x Tx, this result shows that
high relative accuracy is obtained. However, in general, high relative accuracy
is not guaranteed if |xT y| << |x| T |y|.
It is easy to see that precisely the same results (3.3)-(3.5) hold when we
use the no-guard-digit rounding error model (2.6). For example, expression
(3.1) becomes = x1 y1(1 + δ1)(1 + δ3) + x2 y2(1 + δ2)(1 + δ4), where δ4 has
replaced a second occurrence of δ3, but this has no effect on the error bounds.
It is worth emphasizing that the constants in the bounds above can be
reduced by focusing on particular implementations of an inner product. For
example, if n = 2m and we compute
70 BASICS

s1 = x(1: m) T y(1: m )
s 2 = x(m + 1:n) Ty(m + 1:n)
s n = s1 + s 2

then By accumulating the inner product in two

pieces we have almost halved the error bound. This idea can be gener-
alized by breaking the inner product into k pieces, with each mini inner
product of length n/k being evaluated separately and the results summed.
The error bound is now γn / k + k- 1 |x T ||y|, which achieves its minimal value of
(or, rather, we should take k to be the nearest
integer to But it is possible to do even better by using pairwise sum-
mation of the products xi yi (this method is described in §4.1). With pairwise
summation, the error bound becomes

Since many of the error analyses in this book are built upon the error analysis
of inner products, it follows that the constants in these higher level bounds
can also be reduced by using one of these nonstandard inner product imple-
mentations. The main significance of this observation is that we should not
attach too much significance to the precise values of the constants in error
bounds.
Inner products are amenable to being calculated in extended precision.
If the working precision involves a t-digit mantissa then the product of two
floating point numbers has a mantissa of 2t - 1 or 2t digits and so can be
represented exactly with a 2t-digit mantissa. Some computers always form the
2t-digit product before rounding to t digits. thus allowing an inner product to
be accumulated at 2t-digit precision at little or no extra cost, prior to a final
rounding.
The extended precision computation can be expressed as fl( f l e( x T y )),
where fle denotes computations with unit roundoff u e (ue < u). Defining
= fl e(xT y), the analysis above holds with u replaced by ue in (3.3) (and
with the subscripts on the θi , reduced by 1 if the multiplications are done
exactly). For the final rounding we have

and so, overall,

Hence, as long as nu e(|x|T|y| < u|xT y|, the computed inner product is about
as good as the rounded exact inner product. The effect of using extended
3.2 T HE P URPOSE OF R OUNDING E RROR A NALYSIS 71

precision inner products in an algorithm is typically to enable a factor n to

be removed from the overall error bound.
Because extended precision inner product calculations are machine depen-
dent it is difficult or impossible to write portable programs that use them.
Most modern numerical codes (for example those in EISPACK, LINPACK,
and LAPACK) do not use extended precision inner products. One process in
which these more accurate products are needed is the traditional formulation
of iterative refinement, in which the aim is to improve the accuracy of the
computed solution to a linear system (see Chapter 11).
We have seen that computation of an inner product is a backward stable
process. What can be said for an outer product A = xy T, where x, y, IRn ?
The analysis is easy. We have âij = xi yi (1 + δ i j ), |δi j | < u, so

Â = xyT + ∆, | ∆| < u|xyT |. (3.6)

This is a satisfying result, but the computation is not backward stable. In

fact, Â = (x + ∆x)(y + ∆y)T does not hold for any ∆x and ∆y (let alone a
small ∆x and ∆y) because A is not in general a rank 1 matrix.
This distinction between inner and outer products illustrates a general
principle: a numerical process is more likely to be backward stable when the
number of outputs is small compared with the number of inputs, so that there
is an abundance of data onto which to “throw the backward error”. An inner
product has the minimum number of outputs for its 2n scalar inputs, and
an outer product has the maximum number (among standard linear algebra
operations).

3.2. The Purpose of Rounding Error Analysis

Before embarking on further error analyses, it is worthwhile to consider what
a rounding error analysis is designed to achieve. The purpose is to show the
existence of an a priori bound for some appropriate measure of the effects
of rounding errors on an algorithm. Whether a bound exists is the most
important question. Ideally, the bound is small for all choices of problem
data. If not, it should reveal features of the algorithm that characterize any
potential instability, and thereby suggest how the instability can be cured
or avoided. For some unstable algorithms, however, there is no useful error
bound. (For example, no bound is known for the loss of orthogonality due to
rounding error in the classical Gram-Schmidt method; see §18.7.)
The constant terms in an error bound (those depending only on the prob-
lem dimensions) are the least important parts of it. As discussed in §2.6, the
constants usually cause the bound to overestimate the actual error by orders
of magnitude. It is not worth spending much effort to minimize constants
because the achievable improvements are usually insignificant.
72 BASICS

It is worth spending effort, though, to put error bounds in a concise, easily

interpreted form. Part of the secret is in the choice of notation, which we
discuss in §3.4, including the question of what symbols to choose for variables
(see the discussion in Higham [554, 1993, §3.5]).
If sharp error estimates or bounds are desired they should be computed
a posteriori, so that the actual rounding errors that occur can be taken into
account. One approach is to use running error analysis, described in the next
section. Other possibilities are to compute the backward error explicitly, as
can be done for linear equation and least squares problems (see §§7.1, 7.2, and
19.7), or to apply iterative refinement to obtain a correct ion that approximates
the forward error (see Chapter 11).

3.3. Running Error Analysis

The forward error bound (3.5) is an a priori bound that does not depend on
the actual rounding errors committed. We can derive a sharper, a posteri-
ori bound by reworking the analysis. The inner product evaluation may be
expressed as

s0 = 0
for i = 1:n
si = si-1 + xi yi
end

Write the computed partial sums as and let We

have, using (2.5),

Similarly, or

Hence ei = ei-1 - which gives

Since e0 = 0, we have |e n| < uµ n , where

µ i = µ i-1 + µ 0 = 0.

Algorithm 3.2. Given x, y IRn this algorithm computes s = fl(x T y) and

a number µ such that |s - x y| < µ.
T
3.4 N OTATION FOR ERROR ANALYSIS 73

s =0
µ =0
for i = 1:n
z = xi yi
s= s+ z
µ = µ + |s| + |z|
end
µ = µ * µ

This type of computation, where an error bound is computed concurrently

with the solution, is called running error analysis. The underlying idea is
simple: we use the modified form (2.5) of the standard rounding error model
to write
| x op y - fl(x op y)|<u|f l(x op y)|,
which gives a bound for the error in x op y that is easily computed, since
fl(x op y) is stored on the computer. Key features of a running error analysis
are that few inequalities are involved in the derivation of the bound and that
the actual computed intermediate quantities are used, enabling advantage
to be taken of cancellation and zero operands. A running error bound is,
therefore, usually smaller than an a priori one.
There are, of course, rounding errors in the computation of the running
error bound, but their effect is negligible for nu << 1 (we do not need many
correct significant digits in an error bound).
Running error analysis is a somewhat neglected practice nowadays, but it
was widely used by Wilkinson in the early years of computing. It is applicable
to almost any numerical algorithm. Wilkinson explains [1101, 1986]

When doing running error analysis on the ACE at no time did I

write down these expressions. I merely took an existing program
(without any error analysis) and modified it as follows. As each
arithmetic operation was performed I added the absolute value of
the computed result (or of the dividend) into the accumulating
error bound.

For more on the derivation of running error bounds see Wilkinson [1100, 1985]
or [1101, 1986]. A running error analysis for Horner’s method is given in §5.1.

3.4. Notation for Error Analysis

Another way to write (3.5) is

|xT y - fl(xT y)| < nu|x|T|y| + O(u 2 ). (3.7)

74 BASICS

In general, which form of bound is preferable depends on the context. The

use of first-order bounds such as (3.7) can simplify the algebra in an analysis.
But there can be some doubt as to the size of the constant term concealed by
the big-oh notation. Furthermore, in vector inequalities an O(u 2) term hides
the structure of the vector it is bounding and so can make interpretation of
the result difficult; for example, the inequality |x - y| < nu|x| + O(u 2) does
not imply that y approximates every element of x with the same relative error
(indeed the relative error could be infinite when xi = 0, as far as we can tell
from the bound).
In more complicated analyses based on Lemma 3.1 it is necessary to ma-
nipulate the 1 + θk and γk terms. The next, lemma provides the necessary
rules.

Lemma 3.3. For any positive integer k let θ k denote a quantity bounded
according to |θ k| < γk = ku/(1 - ku). The following relations hold:

Proof. See Problem 3.4.

Concerning the second rule in Lemma 3.3, we certainly have

but if we are given only the expression (1 + θ k )/(1 + θj ) and the bounds for
θ k and θj , we cannot do better than to replace it by θ k+2j for j > k.
Another style of writing bounds is made possible by the following lemma.

L e m m a 3 . 4 . I f |δ| < u for i = 1:n and nu < 0.01, then

where |η n | < 1.01nu.

3.4 N OTATION FOR ERROR ANALYSIS 75

Proof. We have

Since 1 + x < ex for x > 0, we have (1 + u) n < exp(nu), and so

Note that this lemma is slightly stronger than the corresponding bound we
can obtain from Lemma 3.1: |θ n | < nu/(1 - nu) < nu/0.99 = 1.0101. . . nu.
Lemma 3.4 enables us to derive, as an alternative to (3.5),

| xT y -f l( x T y ) |<1 . 0 1n u|x | T |y |. (3.8)

A convenient device for keeping track of powers of 1 + δ terms was intro-
duced by Stewart [941, 1973, App. 3]. His relative error counter < k > denotes
a product

(3.9)

The counters can be manipulated using the rules

<j><k> = <j + k>,

At the end of an analysis it is necessary to bound |<k > - 1|, for which any
of the techniques described above can be used.
Wilkinson explained in [1100, 1985] that he used a similar notation in his
research, writing for a product of r factors 1 + δi with |δi | < u. He also
derived results for specific values of n, before treating the general case-a
useful trick of the trade!
An alternative notation to fl(·) to denote the rounded value of a number
or the computed value of an expression is [·], suggested by Kahan. Thus, we
would write [a + [b * c]] instead of fl(a + fl(b * c)).
76 BASICS

A completely different notation for error analysis has been proposed by

Olver [807, 1978], and subsequently used by him and several other authors.
For scalars x and y of the same sign, Olver defines the relative precision rp as
follows:
y x; rp(a) means that y = eδ x, | δ| < a.
Since eδ = 1 + δ + O(δ2), this definition implies that the relative error in x as
an approximation to y (or vice versa) is at most a + O (a 2). But, unlike the
usual definition of relative error, the relative precision possesses the properties
of

symmetry: y x ; rp( a) x y; rp(a),

additivity: y x; rp(a ) and z y; rp(β) z x : rp(a + β).

Proponents of relative precision claim that the symmetry and additivity prop-
erties make it easier to work with than the relative error.
Pryce [845, 1981] gives an excellent appraisal of relative precision, with
many examples. He uses the additional notation 1(δ ) to mean a number θ
with θ 1; rp(δ). The 1(δ) notation is the analogue for relative precision of
Stewart’s <k> counter for relative error. In later papers, Pryce extends the
rp notation to vector and matrices and shows how it can be used in the error
analysis of some basic matrix computations [846, 1984], [847, 1985].
Relative precision has not achieved wide use. The important thing for an
error analyst is to settle on a comfortable notation that does not hinder the
thinking process. It does not really matter which of the notations described
in this section is used, as long as the final result is informative and expressed
in a readable form.

3.5. Matrix Multiplication

Given error analysis for inner products it is straightforward to analyse matrix-
vector and matrix-matrix products. Let A IR m × n, x IR n and y = Ax.
The vector y can be formed as m inner products, yi = = 1:m, where
is the ith row of A. From (3.4) we have

This gives the backward error result

(3.10)

which implies the forward error bound

(3.11)
3.5 M ATRIX M ULTIPLICATION 77

Normwise bounds readily follow (see Chapter 6 for norm definitions): for
example,

This inner product formation of y can be expressed algorithmically as

% Sdot or inner product form.

y(1:m) = 0
for i = 1:m
for j = 1:n
e n d y ( i ) =y ( i ) +a ( i , j) x ( j )

end

The two loops can be interchanged to give

% Saxpy form.
y(1:m) = 0
for j = 1:n
for i = 1:m
y (i) = y(i) +a( i,j ) x ( j )
end
end

The terms “sdot” and “saxpy” come from the BLAS (see §D.1). Sdot stands
for (single precision) dot product, and saxpy for (single precision) a times
x plus y. The question of interest is whether (3.10) and (3.11) hold for the
saxpy form. They do: the saxpy algorithm still forms the inner products
but instead of forming each one in turn it evaluates them all “in parallel”, a
term at a time. The key observation is that exactly the same operations are
performed, and hence exactly the same rounding errors are committed-the
only difference is the order in which the rounding errors are created.
This “rounding error equivalence” of algorithms that are mathematically
identical but algorithmically different is one that occurs frequently in matrix
computations. The equivalence is not always as easy to see as it is for matrix-
vector products.
Now consider matrix multiplication: C = AB, where A IRm×n and
n×p
B IR . Matrix multiplication is a triple loop procedure, with six possible
loop orderings, one of which is

C(1:m,1:p) = 0
for i = 1:m
for j = 1:p
for k = 1:n
C ( i , j ) =C ( i , j ) +A ( i , k ) B ( k , j )
78 BASICS

end
end
end

As for the matrix -vector product. all six versions commit the same rounding
errors, so it suffices to consider any one of them. The “jik” and “jki” orderings
both compute C a column at a time: c j = Ab j, where cj = C(:,j) and
b j = B(:,j). From (3.10),

Hence the jth computed column of C has a small backward error: it is the
exact jth column for slightly perturbed data. The same cannot be said for
as a whole (see Problem 3.5 for a possibly large backward error bound).
However, we have the forward error bound

(3.12)

and the corresponding normwise bounds include

The bound (3.12) falls short of the ideal bound which says
that each component of C is computed with high relative accuracy. Never-
theless (3.12) is the best bound we can expect, because it reflects the sen-
sitivity of the product to componentwise relative perturbations in the data:
for any i and j we can find a perturbation ∆A with |∆A| < u|A| such that
|(A+∆A)B-AB|ij =u(|A||B|)ij (similarly for perturbations in B).

3.6. Complex Arithmetic

To carry out error analysis of algorithms in complex arithmetic we need a
model for the basic arithmetic operations. Since complex arithmetic must be
implemented using real arithmetic, the complex model is a consequence of the
corresponding real one. We will assume that for complex numbers x = a + ib
and y = c + id we compute

x ± y = a + c ± i(b + d), (3.13a)

xy = ac - bd + i(ad + bc), (3.13b)

(3.13c)
3.6 COMPLEX ARITHMETIC 79

Lemma 3.5. For x, y the basic arithmetic operations computed according

to (3.13) under the standard model (2.4) satisfy

Proof. Throughout the proof, δi denotes a number bounded by |δi | < u.

Addition/subtraction:

as required.
Multiplication:

where

as required.
Division:

Then
80 B ASICS

where, using Lemma 3.3,

Using the analogous formula for the error in fl(Imx / y ) ,

which completes the proof.

It is worth stressing that δ in Lemma 3.5 is a complex number, so we
cannot conclude from the lemma that the real and imaginary parts of fl( x op y)
are obtained to high relative accuracy---only that they are obtained to high
accuracy relative to |x op y|.
As explained in §25.8, the formula (3.13c) is not recommended for practical
use since it is susceptible to overflow. For the alternative formula (25.1), which
avoids overflow, similar analysis to that in the proof of Lemma 3.5 shows that

Bounds for the rounding errors in the basic complex arithmetic operations
are rarely given in the literature. Indeed, virtually all published error analyses
in matrix computations are for real arithmetic. However, because the bounds
of Lemma 3.5 are of the same form as for the standard model (2.4) for real
arithmetic, most results for real arithmetic (including virtually all those in
this book) are valid for complex arithmetic, provided that the constants are
increased appropriately.

3.7. Miscellany
In this section we give some miscellaneous results that will be needed in later
chapters. The first two results provide convenient ways to bound the effect
of perturbations in a matrix product. The first result uses norms and the
second. components.

Lemma 3.6. If X j + ∆ X j IRn×n satisfies ||∆Xj|| < dj||Xj|| for all j for a
consistent norm, then
3.7 MISCELLANY 81

Proof. The proof is a straightforward induction, which we leave as an

exercise (Problem 3.10).
A componentwise result is entirely analogous.

Lemma 3.7. If Xj + ∆Xj IRn×n satisfies |∆X j| < δj |X j | for all j, then

The final result describes the computation of the “rank 1 update” y =

(I - abT )x, which is an operation arising in various algorithms, including the
Gram-Schmidt method and Householder QR factorization.

Lemma 3.8. Let a, b, x IR n and let y = (I - ab T )x be computed as =

fl(x- a( b T x )). Then = y + ∆y, where

so that

Proof. Consider first the computation of w = a(bTx). We have

where

Finally, = fl(x - ) satisfies

and

Hence = y + ∆y, where

82 B ASICS

3.8. Error Analysis Demystified

The principles underlying an error analysis can easily be obscured by the
details. It is therefore instructive to examine the basic mechanism of forward
and backward error analyses. We outline a general framework that reveals
the essential simplicity.
Consider the problem of computing z = f(a), where f : IRn IRm .
Any algorithm for computing z can be expressed as follows. Let x 1 = a and
x k+l = g k (x k), k = 1:p, where

The kth stage of the algorithm represents a single floating point operation
and xk contains the original data together with all the intermediate quantities
computed so far. Finally, z = where is comprised of a subset of the
columns of the identity matrix (so that each zi is a component of xp+1). In
floating point arithmetic we have

where ∆xk+i represents the rounding errors on the kth stage and should be
easy to bound. We assume that the functions g k are continuously differentiable
and denote the Jacobian of gk at a by Jk. Then, to first order,

The pattern is clear: for the final we have

In a forward error analysis we bound f(a) - , which requires bounds for

(products of) the Jacobians Jk. In a backward error analysis we write, again
to first order,
3.9 O THER A PPROACHES 83

where Jf is the Jacobian of f. So we need to solve, for ∆a,

In most matrix problems there are fewer outputs than inputs (m < n), so
this is an underdetermined system. For a normwise backward error analysis
we want a solution of minimum norm. For a componentwise backward error
analysis, in which we may want (for example) to minimize subject to | ∆ a| <
we can write

and then we want the solution c of minimal m-norm.

The conclusions are that forward error analysis corresponds to bounding
derivatives and that backward error analysis corresponds to solving a large
underdetermined linear system for a solution of minimal norm. In principal,
therefore, error analysis is straightforward! Complicating factors in practice
are that the Jacobians Jk may be difficult to obtain explicitly, that an error
bound has to be expressed in a form that can easily be interpreted, and that
we may want to keep track of higher-order terms.

3.9. Other Approaches

In this book we do not describe all possible approaches to error analysis. Some
others are mentioned in this section.
Linearized rounding error bounds can be developed by applying equations
that describe, to first order, the propagation of absolute or relative errors in
the elementary operations +,-,*,/. The basics of this approach are given in
many textbooks (see, for example, Dahlquist and Björck [262, 1974, §2.2] or
Stoer and Bulirsch [955, 1980, §1.3]), but for a thorough treatment see Stum-
mel [963, 198 0], [964, 198 1]. Ziv [1133, 1995] shows that linearized bounds
can be turned into true bounds by increasing them by a factor that depends
on the algorithm.
Rounding error analysis can be phrased in terms of graphs. This appears
to have been first suggested by McCracken and Dorn [743, 196 4], who use
“process graphs” to represent a numerical computation and thereby to analyse
the propagation of rounding errors. Subsequent more detailed treatments
include those of Bauer [82, 1974], Miller [758, 1976], and Yalamov [1117, 1994].
The work on graphs falls under the heading of automatic error analysis (for
more on which see Chapter 24) because processing of the large graphs required
to represent practical computations is impractical by hand. Linnainmaa [705,
1976] shows how to compute the Taylor series expansion of the forward error
84 B ASICS

in an algorithm in terms of the individual rounding errors, and he presents a

graph framework for the computation.
Some authors have taken a computational complexity approach to error
analysis, by aiming to minimize the number of rounding error terms in a
forward error bound, perhaps by rearranging a computation. Because this
approach ignores the possibility of cancellation of rounding errors, the results
need to be interpreted with care. See Aggarwal and Burgmeier [6, 1979] and
Tsao [1024, 1983].

3.10. Notes and References

The use of Lemma 3.1 for rounding error analysis appears to originate with
the original 1972 German edition of a book by Stoer and Bulirsch [955, 1980].
The lemma is also used, with p i = 1, by Shampine and Allen [911, 1973,
p. 18].
Lemma 3.4 is given by Forsythe and Moler [396, 1967, p. 92]. Wilkinson
made frequent use of a slightly different version of Lemma 3.4 in which the
assumption is nu < 0.1 and the bound for |ηn | is 1.06nu (see, e.g., [1089,
1965, p. 113]).
A straight forward notation for rounding errors that is subsumed by the
notation described in this chapter is suggested by Scherer and Zeller [899,
1980].
Ziv [1131, 1982] proposes the relative error measure

for vectors x and y and explains some of its favourable properties for error
analysis.
Wilkinson [1089, 1965, p. 447] gives error bounds for complex arithmetic;
Olver [808, 198 3] does the same in the relative precision framework. Dem-
mel [280, 1984] gives error bounds that extend those in Lemma 3.5 by taking
into account the possibility of underflow.
Henrici [521, 1980] gives a brief, easy to read introduction to the use of
the model (2.4) for analysing the propagation of rounding errors in a general
algorithm. He uses a set notation that is another possible notation to add to
those in §3.4.
The perspective on error analysis in §3.8 was suggested by J. W. Demmel.

Problems
3.1. Prove Lemma 3.1.
3.2. (Kielbasiniski and Schwetlick [658, 1988], [659, 1992]) Show that if pi 1
in Lemma 3.1 then the stronger bound |φ n | < nu/(1-½nu) holds for nu < 2.
P ROBLEMS 85

3.3. One algorithm for evaluating a continued fraction

q n+1 = an +1
for k = n:-1:0
qk = ak + bk/qk+ 1
end

Derive a running error bound for this algorithm.

3.4. Prove Lemma 3.3.
3.5. (Backward error result for matrix multiplication.) Let A IRn×n and
B IRn×n
both be nonsingular. Show that fl(AB) = (A + ∆A)B, where
| ∆A| < γn |A||B||B -1|, and derive a corresponding bound in which B is per-
turbed.
3.6. (Backward error definition for matrix multiplication.) Let A IRm × n
n×p
and B IR be of full rank and suppose C AB. Define the component-
wise backward error

where E and F have nonnegative entries. Show that

where R = C - AB and G = EF. Explain why the definition of w makes

sense only when A and B have full rank. Define a mixed backward/forward
error applicable to the general case.
3.7. Give analogues of the backward error results (3.4) and (3.10) for complex
x, y, and A.
3.8. Let Al,. . . ,Ak IRn × n . Show that

3.9. Which is the more accurate way to compute x2 - y2: as x2 - y2 or as

(x+y)(x-y)?(Assume the use of a guard digit. Note that this computation
arises when squaring a complex number.)
86 BASICS

3.10. Prove Lemma 3.6.

3.11. (Kahan [629, 1980]) Consider this MATLAB function, which returns the
absolute value of its first argument x IRn :
function z = absolute(x , m )
y = x.-2;
for i=l:m
y = sqrt(y);
end
z = y;
for i=l:m-1
z = z. -2;
end
Here is some output from a 486DX workstation that uses IEEE standard
double precision arithmetic:
> > x = [ . 2 5 . 5 . 7 5 1 . 2 5 1 . 5 2 ] ; z = a b s o l u t e ( x , 5 0 ) ; [ x ; z]
ans =
0.2500 0.5000 0.7500 1.2500 1.5000 2.0000
0.2528 0.5028 0.7788 1.2840 1.4550 2.1170
Give an error analysis to explain the results.
The same machine produced this output:
> > x = [ . 2 5 . 5 . 7 5 1 . 2 5 1 . 5 2 ] ; z = a b s o l u t e ( x , 7 5 ) ; [ x ; z]
ans =
0.2500 0.5000 0.7500 1.2500 1.5000 2.0000
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
But a Sun SPARCstation, which also uses IEEE standard double precision
arithmetic, produced
ans =
0.2500 0.5000 0.7500 1.2500 1.5000 2.0000
0 0 0 1.0000 1.0000 1.0000
Explain these results and why they differ (cf. §1.12.2).
3.12. Consider the quadrature rule

where the weights wi and nodes xi are assumed to be floating point numbers.
Assuming that the sum is evaluated in left-to-right order and that

obtain and interpret a bound for where

Previous Home Next

Chapter 4
Summation

I do hate sums.
There is no greater mistake than to call arithmetic an exact science.
There are . . . hidden laws of Number
which it requires a mind like mine to perceive.
For instance, if you add a sum from the bottom up,
and then again from the top down,
the result is always different.
-MRS. LA TOUCHE10

Joseph Fourier introduced this delimited z-notation in 1820,

and it soon took the mathematical world by storm.
-RONALD L. GRAHAM, DONALD E. KNUTH, and
OREN PATASHNIK, Concrete Mathematics (1989)

One of the major difficulties in a practical [error] analysis

is that of description.
An ounce of analysis follows a pound of preparation.
-BERESFORD N. PARLETT, Matrix Eigenvalue Problems (1965)

10
Quoted in Mathematical Gazette [730, 1924 ].

87
88 SUMMATION

Sums of floating point numbers are ubiquitous in scientific computing. They

occur when evaluating inner products, means, variances, norms, and all kinds
of nonlinear functions. Although at first sight summation might appear to
offer little scope for algorithmic ingenuity, the usual “recursive summation”
(with various orderings) is just one of a variety of possible techniques. We
describe several summation methods and their error analyses in this chap
ter. No one method is uniformly more accurate than the others, but some
guidelines can be given on the choice of method in particular cases.

4.1. Summation Methods

In most circumstances in scientific computing we would naturally translate a
sum into code of the form

s =0
for i = 1:n
s=s+ x i
end

This is known as recursive summation. Since the individual rounding errors

depend on the operands being summed, the accuracy of the computed sum
varies with the ordering of the xi . (Hence Mrs. La Touche, quoted at the
beginning of the chapter, was correct if we interpret her remarks as applying
to floating point arithmetic.) Two interesting choices of ordering are the
increasing order |x1| < | x 2 | < . . . < |x n |, and the decreasing order |x 1| >
|x2| > . . . > |xn|.
Another method is pairwise summation (also known as cascade summa-
tion, or fan-in summation), in which the xi are summed in pairs according
to

and this pairwise summation process is repeated recursively on the yi, i =

1 : [ ( n+1)/2]. The sum is obtained in [log2 n] stages. For n = 6, for example,
pairwise summation forms

S 6 = (( x 1 + x 2 ) + ( x 3 + x 4 ) ) + ( x 5 + x 6 ) .
Pairwise summation is attractive for parallel computing, because each of the
[log2 n] stages can be done in parallel [573, 1988, §5.2.2].
A third summation method is the insertion met hod. First, the x i are
sorted by order of increasing magnitude (alternatively, some other ordering
could be used). Then x1 + x2 is formed, and the sum is inserted into the
l i s t x2,. . . ,x n , maintaining the increasing order. The process is repeated
recursively until the final sum is obtained. In particular cases the insertion
4.2 ERROR ANALYSIS 89

method reduces to one of the other two. For example, if x i = 2 i -1, the
insertion method is equivalent to recursive summation, since the insertion is
always to the bottom of the list:

1248 348 78 15.

On the other hand, if 1 < x1 < x2 < ... < xn < 2, every insertion is to the
end of the list, and the method is equivalent to pairwise summation if n is a
power of 2; for example, if 0 < < 1/2,

To choose between these met hods we need error analysis, which we develop
in the next section.

4.2. Error Analysis

Error analysis can be done individually for the recursive, pairwise and inser-
tion summation methods, but it is more profitable to recognize that each is a
special case of a general algorithm and to analyse that algorithm.

Algorithm 4.1. Given numbers x1,. . . ,xn this algorithm computes Sn =

L e t S = {x 1,...,xn }.
repeat while S contains more than one element
Remove two numbers x and y from S,
and add their sum x + y to S.
end
Assign the remaining element of S to S n .

Note that since there are n numbers to be added and hence n - 1 additions
to be performed, there must be precisely n - 1 executions of the repeat loop.
First, let us check that the previous methods are special cases of Algo-
rithm 4.1. Recursive summation (with any desired ordering) is obtained by
taking x at each stage to be the sum computed on the previous stage of the
algorithm. Pairwise summation is obtained by [log2 n] groups of executions
of the repeat loop, in each group of which the members of S are broken into
pairs, each of which is summed. Finally, the insertion method is, by definition,
a special case of Algorithm 4.1.
Now for the error analysis. Express the ith execution of the repeat loop
as Ti = xi1 + yil. The computed sums satisfy (using (2.5))

(4.1)
90 SUMMATION

The local error introduced in forming The overall error is the sum
of the local errors (since summation is a linear process), so overall we have

(4.2)

The smallest possible error bound is therefore

(4.3)

(This is actually in the form of a running error bound, because it contains the
computed quantities-see §3.3.) It is easy to see that
for each i, and so we have also the weaker bound

(4.4)

This is a forward error bound. A backward error result showing that is

the exact sum of terms with can be deduced from (4.1),
using the fact that no number xi takes part in more than n - 1 additions.
The following criterion is apparent from (4.2) and (-1.3):

In designing or choosing a summation met hod to achieve high ac-

curacy, the aim should be to minimize the absolute values of the
intermediate sums Ti .

The aim specified in this criterion is surprisingly simple to state. When we

consider specific methods, however, we find that the aim is difficult to achieve.
Consider recursive summation, for which Ideally,
we would like to choose the ordering of the xi to minimize This
is a combinatorial optimization problem that is too expensive to solve in the
context of summation. A reasonable compromise is to determine the ordering
sequentially by minimizing, in turn, |x 1|, |S 2|, . . . , |S n-1|. This ordering
strategy, which we denote by Psum, can be implemented with O(n log n)
comparisons. If we are willing to give up the property that the ordering is
influenced by the signs of the xi we can instead use the increasing ordering,
which in general will lead to a larger value of than that for the
Psum ordering. If all the x i have the same sign then all these orderings
are equivalent. Therefore when summing nonnegative numbers by recursive
summation the increasing ordering is the best ordering, in the sense of having
the smallest a priori forward error bound.
4.2 ERROR ANALYSIS 91

How does the decreasing ordering fit into the picture? For the summation
of positive numbers this ordering has little to recommend it. The bound (4.3)
is no smaller, and potentially much larger, than it is for the increasing order-
ing. Furthermore, in a sum of positive terms that vary widely in magnitude
the decreasing ordering may not allow the smaller terms to contribute to the
sum (which is why the harmonic sum “converges” in floating point
arithmetic as n ). However, consider the example with n = 4 and
x = [ l , M, 2M, -3M], (4.5)
where M is a floating point number so large that fl (1 + M) = M (thus M >
u -1). The three orderings considered so far produce the following results:
Increasing: = fl(1 + M + 2M - 3M) = 0,
Psum: = fl(1 + M - 3M + 2M) = 0,
Decreasing: = fl(-3M + 2M + M + 1) = 1.
Thus the decreasing ordering sustains no rounding errors and produces the
exact answer, while both the increasing and Psum orderings yield computed
sums with relative error 1. The reason why the decreasing ordering performs
so well in this example is that it adds the “1” after the inevitable heavy
cancellation has taken place, rather than before, and so retains the important
information in this term. If we evaluate the term in the error
bound (4.3) for example (4.5) we find
Increasing: µ = 4M, Psum: µ = 3M, Decreasing: µ = M + 1,
so (4.3) “predicts” that the decreasing ordering will produce the most accurate
answer, but the bound it provides is extremely pessimistic since there are no
rounding errors in this instance.
Extrapolating from this example, we conclude that the decreasing or-
dering is likely to yield greater accuracy than the increasing or Psum or-
derings whenever there is heavy cancellation in the sum, that is, whenever

Turning to the insertion method, a good explanation of the insertion strat-

egy is that it attempts to minimize, one at a time, the terms
in the error bound (4.3). Indeed, if the xi are all nonnegative the insertion
method minimizes this bound over all instances of Algorithm 4.1.
Finally, we note that a stronger form of the bound (4.4) holds for pairwise
summation. It can be deduced from (4.3) or derived directly, as follows.
Assume for simplicity that n = 2 r . Unlike in recursive summation each addend
takes part in the same number of additions, log2 n. Therefore we have a
relation of the form
92 SUMMATION

Figure 4.1. Recovering the rounding error.

which leads to the bound

(4.6)

Since it is proportional to log2 n rat her than n, this is a smaller bound than
(4.4), which is the best bound of this form that holds in general for Algo-
rithm 4.1.

4.3. Compensated Summation

We have left to last the compensated summation method, which is recursive
summation with a correction term cleverly designed to diminish the rounding
errors. Compensated summation is worth considering whenever an accurate
sum is required and computations are already taking place at the highest
precision supported by the hardware or the programming language in use.
In 1951 Gill [449, 1951] noticed that the rounding error in the sum of two
numbers could be estimated by subtracting one of the numbers from the sum,
and he made use of this estimate in a Runge-Kutta code in a program library
for the EDSAC computer. Gill’s estimate is valid for fixed point arithmetic
only. Kahan [625, 196 5] and Møller [777, 196 5] both extended the idea to
floating point arithmetic. Møller shows how to estimate a + b - fl (a + b) in
chopped arithmetic, while Kahan uses a slightly simpler estimate to derive
the compensated summation method for computing
The estimate used by Kahan is perhaps best explained with the aid of a
diagram. Let a and b be floating point numbers with |a| > |b|, let = fl(a+b),
and consider Figure 4.1, which uses boxes to represent the mantissas of a and
b. The figure suggests that if we evaluate
4.3 C OMPENSATED S UMMATION 93

in floating point arithmetic, in the order indicated by the parentheses, then

the computed ê will be a good estimate of the error (a + b) - In fact, for
rounded floating point arithmetic in base 2, we have

a+ b = +ê, (4.7)

that is, the computed ê represents the error exactly. This result (which does
not hold for all bases) is proved by Dekker [275, 1971, Thm. 4.7], Knuth [668,
1981, Thm. C, p. 221], and Linnainmaa [703, 1974, Thm. 3]. Note that there
is no point in computing fl( + ê), since is already the best floating point
representation of a + b !
Kahan’s compensated summation method employs the correction e on
every step of a recursive summation. After each partial sum is formed, the
correction is computed and immediately added to the next term xi before that
term is added to the partial sum. Thus the idea is to capture the rounding
errors and feed them back into the summation. The method may be written
as follows.

Algorithm 4.2 (compensated summation). Given floating point numbers

x1, . . . , xn this algorithm forms the sum by compensated sum-
mation.

s=0;e= 0
for i = 1:n
temp = s
y=xi+e
s=temp+y
e = (temp - s) + y % Evaluate in the order shown.
end

The compensated summation method has two weaknesses: ê is not neces-

sarily the exact correction, since (4.7) is based on the assumption that | a | > |b|,
and the addition y = xi + e is not performed exactly. Nevertheless, the use
of the corrections brings a benefit in the form of an improved error bound.
Knuth [668, 1981, Ex. 19, pp. 229, 572-573] shows that the computed sum
satisfies
(4.8)

which is an almost ideal backward error result (a more detailed version of

Knuth’s proof is given by Goldberg [457, 1991]).
In [627, 1972] and [628, 1973] Kahan describes a variation of compensated
summation in which the final sum is also corrected (thus “s = s + e” is
appended to the algorithm above). Kahan states in [627, 1972] and proves in
94 SUMMATION

[628, 1973 ] that (4.8) holds with the stronger bound |µ i | < 2u+O( ( n-i+
1 ) u 2) The proofs of (4.8) given by Knuth and Kahan are similar; they use the
model (2.4) with a subtle induct ion and some intricate algebraic manipulation.
The forward error bound corresponding to (4.8) is

(4.9)

As long as nu < 1, the constant in this bound is independent of n, and

so the bound is a significant improvement over the bounds (4.4) for recur-
sive summation and (4.6) for pairwise summation. Note, however, that if
compensated summation is not guaranteed to yield a
small relative error.
Another version of compensated summation has been investigated by sev-
eral authors: Jankowski, Smoktunowicz, and Wozniakowski [609, 1983], Jank-
owski and Wozniakowski [611, 1985], Kielbasinski [655, 1973], Neumaier [788,
1974], and Nickel [797, 1970]. Here, instead of immediately feeding each cor-
rection back into the summation, the correct ions are accumulated separately
by recursive summation and then the global correction is added to the com-
puted sum. For this version of compensated summation Kielbasinski [655,
1973] and Neumaier [788, 1974] show that

(4.10)

provided nu < 0.1; this is weaker than (4.8) in that the second-order term
has an extra factor n. If n 2 u < 0.1 then in (4.10), |µ i | < 2.1u. Jankowski,
Smoktunowicz, and Wozniakowski [609, 198 3] show that, by using a divide
and conquer implementation of compensated summation, the range of n for
which |µ i | < cu holds in (4.10) can be extended, at the cost of a slight increase
in the size of the constant c.
Neither the correction formula (4.7) nor the result (4.8) for compensated
summation holds under the no-guard-digit model of floating point arithmetic.
Indeed, Kahan [634, 1990] constructs an example where compensated summa-
tion fails to achieve (4.9) on certain Cray machines, but he states that such
failure is extremely rare. In [627, 1972] and [628, 1973] Kahan gives a mod-
ification of the compensated summation algorithm in which the assignment
“e = (temp - s) + y” is replaced by

f = 0
if sign(temp) = sign(y), f = (0.46 * s - s) + s, end
e = ( ( temp - f ) - ( s - f )) + y
4.3 C OMPENSATED S UMMATION 95

Kahan shows in [628, 1973] that the modified algorithm achieves (4.8) “on
all North American machines with floating hardware” and explains that “The
mysterious constant 0.46, which could perhaps be any number between 0.25
and 0.50, and the fact that the proof requires a considerat ion of known ma-
chines designs, indicate that this algorithm is not an advance in computer
science.”
Viten’ko [1056, 1968] shows that under the no-guard-digit model (2.6) the
summation met hod with the optimal error bound (in a certain sense defined
in [1056, 1968]) is pairwise summation. This does not contradict Kahan’s
result because Kahan uses properties of the floating point arithmetic beyond
those in the no-guard-digit model.
A good illustration of the benefits of compensated summation is provided
by Euler’s method for the ordinary differential equation initial value problem
y' = f(x,y), y(a) given, which generates an approximate solution according
to y k+l = yk + hfk , y0 = y(a). We solved the equation y' = -y with y(0) = 1
over [0,1] using n steps of Euler’s method (nh = 1), with n ranging from 10
to 108. With compensated summation we replace the statements x = x + h,
y = y + h * f(x,y) by (with the initialization cx = 0, cy = 0)

dx = h + cx
new-x = x + dx
cx = (x - new-x) + dx
x = new_x

dy = h * f(x,y) + cy
new-y = y + dy
cy = (y - new-y) + dy
y = new-y

Figure 4.2 shows the errors en = |y (1) - where is the computed

approximation to y(1). The computations were done in Fortran 90 in single
precision arithmetic on a Sun SPARCstation (u 6 × 10-8). Since Euler’s
method has global error of order h, the error curve on the plot should be
approximately a straight line. For the standard. implement at ion of Euler’s
method the errors en start to increase steadily beyond n = 20,000 because
of the influence of rounding errors. With compensated summation the errors
en are much less affected by rounding errors and do not grow in the range of
n shown (for n = 108, en is about 10 times larger than it would be in exact
arithmetic). Plots of U-shaped curves showing total error against stepsize
are common in numerical analysis textbooks (see, e.g., Forsythe, Malcolm,
and Moler [395, 1977, p. 119] and Shampine [910, 1994, p. 259]), but the
textbooks rarely point out that the “U” can be flattened out by compensated
summation.
96 SUMMATION

Figure 4.2. Errors |y(1) - for Euler’s method with (“×”) and without (“o”)
compensated summation.

The cost of applying compensated summation in an ordinary differential

equation solver is almost negligible if the function f is at all expensive to eval-
uate. But, of course, the benefits it brings are noticeable only when a vast
number of integration steps are taken. Very long-term integrations are un-
dertaken in celestial mechanics, where roundoff can affect the ability to track
planetary orbits. Researchers in astronomy use compensated summation, and
other techniques, to combat roundoff. An example application is a 3 million
year integration of the planets in the solar system by Quinn, Tremaine, and
Duncan [854, 1991]; it used a linear multistep method of order 13 with a
constant stepsize of 0.75 days and took 65 days of machine time on a Silicon
Graphics 4D-25 workstation. See also Quinn and Tremaine [853, 1991] and
Quinlan [851, 1994].
Finally, we describe an even more ingenious algorithm called doubly com-
pensated summation, derived by Priest [844, 1992] from a related algorithm
of Kahan. It is compensated summation with 2 extra applications of the
correction process11 and it requires 10 instead of 4 additions per step. The
algorithm is tantamount to simulating double precision arithmetic with sin-
gle precision arithmetic; it requires that the summands first be sorted into
11
The algorithm should perhaps be called triply compensated summation, but we adopt
Priest’s terminology.
4.4 O THER S UMMATION M ETHODS 97

decreasing order, which removes the need for certain logical tests that would
otherwise be necessary.

Algorithm 4.3 (doubly compensated summation). Given floating point num-

bers x 1,. . . , xn , this algorithm forms the sum by doubly compen-
sated summation. All expressions should be evaluated in the order specified
by the parentheses.

Sort the xi so that |x i| > |x 2| > . . . > |x n |.

s 1 = x1; c1 = 0
for i = 2:n
yk = ck-1 + xk
u k = xk - (yk - ck- 1 )
tk = yk + sk- 1
υ k = yk - (tk - sk-l )
zk = u k + υ k
sk = tk + zk
ck = zk - (sk - tk)
end

Priest [844, 1992, §4.1] analyses this algorithm for t-digit base β arithmetic
that satisfies certain reasonable assumptions-ones which are all satisfied by
IEEE arithmetic. He shows that if n < β t -3 then the computed sum
satisfies

that is, the computed sum is accurate virtually to full precision.

4.4. Other Summation Methods

We mention briefly two further classes of summation algorithms. The first
builds the sum in a series of accumulators, which are themselves added to give
the sum. As originally described by Wolfe [1107, 1964] each accumulator holds
a partial sum lying in a different interval. Each term xi is added to the lowest-
level accumulator; if that accumulator overflows it is added to the next-highest
one and then reset to zero, and this cascade continues until no overflow occurs.
Modifications of Wolfe’s algorithm are presented by Malcolm [723, 1971] and
Ross [880, 1965]. Malcolm [723, 1971] gives a detailed error analysis to show
that his method achieves a relative error of order u. A drawback of the
algorithm is that it is strongly machine dependent. An interesting and crucial
feature of Malcolm’s algorithm is that on the final step the accumulators
are summed by recursive summation in order of decreasing absolute value,
which in this particular situation precludes severe loss of significant digits
and guarantees a small relative error.
98 SUMMATION

Another class of algorithms, referred to as “distillation algorithms” by

Kahan [633, 1987], work as follows: given xi = fl( x i ), i = 1:n, they iteratively
construct floating point numbers such that
terminating when approximates with relative error at most u.
Kahan states that these algorithms appear to have average run times of order
at least n log n. See Bohlender [130, 1977], Kahan [633, 1987], Leuprecht and
Oberaigner [700, 1982], Pichat [830, 1972], and Priest [844, 1992, pp. 66-69)
for further details and references.

4.5. Statistical Estimates of Accuracy

The rounding error bounds presented above can be very pessimistic, because
they account for the worst-case propagation of errors. An alternative way
to compare summation methods is through statistical estimates of the error,
which may be more representative of the average case. A statistical analysis
of three summation methods has been given by Robertazzi and Schwartz [874,
1988] for the case of nonnegative xi . They assume that the relative errors in
floating point addition are statistically independent and have zero mean and
finite variance Two distributions of nonnegative xi are considered: the
uniform distribution on [0,2µ ]. and the exponential distribution with mean
µ. Making various simplifying assumptions Robertazzi and Schwartz esti-
mate the mean square error (that is, the variance of the absolute error) of
the computed sums from recursive summation with random, increasing and
decreasing orderings, and from insertion summation and pair-wise summation
(with the increasing ordering). Their results for the summation of n numbers
are given in Table -1.1.
The results show that for recursive summation the ordering affects only
the constant in the mean square error, with the increasing ordering having
the smallest constant and the decreasing ordering the largest; since the xi are
nonnegative, this is precisely the ranking given by the rounding error bound
(4.3). The insertion and pairwise summation methods have mean square
errors proportional to n 2 rather than n 3 for recursive summation, and the
insertion method has a smaller constant than pairwise summation. This is also
consistent with the rounding error analysis, in which for nonnegative xi the
insertion met hod satisfies an error bound no larger than pairwise summation
and the latter method has an error bound with a smaller constant than for
recursive summation (log2 n versus n).

4.6. Choice of Method

There is a wide variety of summation methods to choose from. For each
met hod the error can vary greatly with the data, within the freedom afforded
4.6 C HOICE OF M ETHOD 99

Table 4.1. Mean square errors for nonnegative xi .

Distrib. Increasing Random Decreasing Insertion Pairwise

Unif(0,2µ) 0.20µ 2 n 3 σ2 0.33µ 2 n 3 σ2 0.53µ 2 n 3 σ2 2.6µ 2 n 2 σ2 2.7µ 2 n 2 σ2
EXP(µ) 0.13µ 2 n 3 σ2 0.33µ 2 n 3 σ2 0.63µ 2 n 3 σ2 2.6µ 2 n 2 σ2 4.0µ 2 n 2 σ2

by the error bounds; numerical experiments show that, given any two of the
methods, data can be found for which either method is more accurate than
the other [553, 1993]. However, some specific advice on the choice of met hod
can be given.
1. If high accuracy is important, consider implementing recursive summa-
tion in higher precision; if feasible this may be less expensive (and more
accurate) than using one of the alternative methods at the working pre-
cision. What can be said about the accuracy of the sum computed at
higher precision? If Sn = xi is computed by recursive summation
at double precision (unit roundoff u 2) and then rounded to single preci-
sion, an error bound of the form holds.
Hence a relative error of order u is guaranteed if nu
Priest [844, 1992, pp. 62-63] shows that if the xi are sorted in decreas-
ing order of magnitude before being summed in double precision, then
holds provided only that n < β t -3 for t-digit base
β arithmetic satisfying certain reasonable assumptions. Therefore the
decreasing ordering may be worth arranging if there is a lot of cancella-
tion in the sum. An alternative to extra precision computation is doubly
compensated summation, which is the only other method described here
that guarantees a small relative error in the computed sum.
2. For most of the methods the errors are, in the worst case, proportional
to n. If n is very large, pairwise summation (error constant log2 n) and
compensated summation (error constant of order 1) are attractive.
3. If the xi all have the same sign then all the methods yield a relative error
of at most nu and compensated summation guarantees perfect relative
accuracy (as long as nu < 1). For recursive summation of one-signed
data, the increasing ordering has the smallest error bound (4.3) and
the insertion method minimizes this error bound over all instances of
Algorithm 4.1.
4. For sums with heavy cancellation recursive
summation with the decreasing ordering is attractive, although it cannot
be guaranteed to achieve the best accuracy.
100 SUMMATION

Considerations of computational cost and the way in which the data are
generated may rule out some of the met hods. Recursive summation in the
natural order, pairwise summation, and compensated summation can be im-
plemented in O(n) operations for general xi , but the other methods are more
expensive since they require searching or sorting. Furthermore, in an applica-
tion such as the numerical solution of ordinary differential equations, where
xk is not known until xi has been formed. sorting and searching may
be impossible.

4.7. Notes and References

This chapter is based on Higham [553, 1993]. Analysis of Algorithm 4.1 and
compensated summation can also be found in Espelid [356, 1978].
The earliest error analysis of summation is that of Wilkinson for recursive
summation in [1084, 1960], [1088, 1963].
Pairwise summation was first discussed by McCracken and Dorn [743,
1964, pp. 61-63], Babuska [35, 1969], and Linz [707, 1970]. Caprani [185, 1971]
shows how to implement the method on a serial machine using temporary
storage of size [log2 n] + 1 (without overwriting the xi ).
The use of compensated summation with a Runge-Kutta formula is de-
scribed by Vitasek [1055, 1969]. See also Butcher [170, 1987, pp. 118-120]
and the experiments of Linnainrnaa [703, 1974]. Davis and Rabinowitz [267,
1984, §4.2.1] discuss pairwise summation and compensated summation in the
context of quadrature.

Problems
4.1. Define and evaluate a condition number C(x) for the summation Sn (x) =
When does the condition number take the value 1?

4.2. (Wilkinson [1088, 1963, p. 19]) Show that the bounds (4.3) and (4.4) are
nearly attainable for recursive summation. (Hint: assume u = 2- t , set n = 2r
(r << t), and define

x(1) = 1,
x (2) = 1 - 2-t ,
x (3:4) = 1 - 21-t ,
x(5:8) = 1 - 22-t ,

x(2r-1 + 1:2r ) = 1 - 2r - 1 -t .)
PROBLEMS 101

4.3. Let S n = xi be computed by recursive summation in the natural

order. Show that

and hence that En = satisfies

Which ordering of the xi minimizes this bound?

4.4. Let M be a floating point number so large that fl (10 + M) = M. What
are the possible values of where = {1, 2, 3, 4, M, -M}
and the sum is evaluated by recursive summation?
4.5. The “±” method for computing is defined as follows: form
the sum of the positive numbers, S+, and the sum of the nonpositive numbers,
S -, separately, by any method, and then form S n = S - + S+. Discuss the
pros and cons of this met hod.
4.6. Let {x i } be a convergent sequence with limit . Aitken’s ∆2-method
(Aitken extrapolation) generates a transformed sequence {y i } defined by

Under suitable conditions (typically that {x i } is linearly convergent), the yi

converge to faster than the xi . Which of the following expressions should
be used to evaluate the denominator in the formula for yi ?
( a) (xi +2 - 2xi+1) + xi .
( b) (x i+2 - xi+1) - (x i+1 - xi ).
(c) (x i+2 + xi ) - 2x i+ l .
4.7. Analyse the accuracy of the following method for evaluating

4.8. In numerical methods for quadrature and for solving ordinary differential
equation initial value problems it is often necessary to evaluate a function on
an equally spaced grid of points on a range [a,b]: xi := a + ih, i = 0:n,
where h = (b-a)/n. Compare the accuracy of the following ways to form xi .
Assume that a and b, but not necessarily h, are floating point numbers.
102 SUMMATION

(a) x i = x i-1 + h (x0 = a).

(b) xi = a + ih.
(c) xi = a(1 - i/n) + (i/n)b.
Note that (a) is typically used without comment in, for example, a Newton-
Cotes quadrature rule or a Runge-Kutta method with fixed step size.
4.9. (R ESEARCH P ROBLEM ) Priest [844, 1992 , pp. 61-62] has proved that
if |x1 | > |x2| > |x3| then compensated summation computes the sum x 1 +
x2 + x3 with a relative error of order u (under reasonable assumptions on the
arithmetic, such as the presence of a guard digit). He also gives the example

x 1 = 2t + 1
, x2=2 t+ 1 - 2 , x3= x4 = x5 = x 6 =-(2 t -1),

for which the exact sum is 2 but compensated summation computes 0 in

IEEE single precision arithmetic (t=24). What is the smallest n for which
compensated summation applied to x1,. . . , xn ordered by decreasing absolute
value can produce a computed sum with large relative error?
Previous Home Next

Chapter 5
Polynomials

The polynomial (z - 1)(z - 2) . . . (z - 20) is not a ‘difficult’ polynomial per se . . .

The ‘difficulty’ with the polynomial Π(z - i) is that of
evaluating the explicit polynomial accurately.
If one already knows the roots, then the polynomial can be evaluated
without any loss of accuracy.
-J. H. WILKINSON, The Perfidious Polynomial (1984)

I first used backward error analysis in connection with

simple programs for computing zeros of polynomials
soon after the PILOT ACE came into use.
-J. H. WILKINSON, The State of the Art in Error Analysis (1985)

The Fundamental Theorem of Algebra asserts that

every polynomial equation over the complex field has a root.
It is almost beneath the dignity of such a majestic theorem
to mention that in fact it has precisely n roots.
-J. H. WILKINSON, The Perfidious Polynomial (1984)

103
104 POLYNOMIALS

Two common tasks associated with polynomials are evaluation and interpola-
tion: given the polynomial find its values at certain arguments, and given the
values at certain arguments find the polynomial. We consider Horner’s rule
for evaluation and the Newton divided difference polynomial for interpolation.
A third task not considered here is finding the zeros of a polynomial. Much re-
search was devoted to polynomial zero finding up until the late 1960s; indeed,
Wilkinson devotes a quarter of Rounding Errors in Algebraic Processes [1088,
19 6 3 ] to the topic. Since the development of the QR algorithm for finding
matrix eigenvalues there has been less demand for polynomial zero finding,
since the problem either arises as, or can be converted to (see §26.6 and [346,
1995], [1007, 1994]), the matrix eigenvalue problem.

5.1. Horner’s Method

The standard met hod for evaluating a polynomial

p(x) = a0 + a 1 x + . . . + anxn (5.1)

is Horner’s method (also known as Horner’s rule and nested multiplication),

which consists of the following recurrence:

qn(x) = an
for i = n - 1: -1:0
qi(x) = xqi+ 1 (x) + ai
end
p(x) = q0(x)

The cost is 2n flops, which is n less than the more obvious method of evaluation
that explicitly forms powers of x (see Problem 5.2).
To analyse the rounding errors in Horner’s met hod it is convenient to use
the relative error counter notation <k> (see (3.9)). We have

It is easy to either guess or prove by induction that

(5.2)
5.1 H ORNER'S M ETHOD 105

where we have used Lemma 3.1, and where |φ k | < ku/(1 - ku) =: γk . This
result shows that Horner’s method has a small backward error: the com-
puted is the exact value at x of a polynomial obtained by making relative
perturbations of size at most γ2 n to the coefficients of p( x).
A forward error bound is easily obtained: from (5.2) we have

(5.3)

where The relative error is bounded according to

Clearly, the factor ψ ( p , x ) can be arbitrarily large. However, ψ(p,x) = 1 if

ai > 0 for all i and x > 0, or if (-l)i ai > 0 for all i and x < 0.
In a practical computation we may wish to compute an error bound along
with The bound (5.3) is entirely adequate for theoretical purposes and can
itself be computed by Horner’s method. However, it lacks sharpness for two
reasons. First, the bound is the result of replacing each γ k by γ2 n . Second,
and more importantly, it is an a priori bound and so takes no account of
the actual rounding errors that occur. We can derive a sharper, a posteriori
bound by a running error analysis.
For the ith step of Horner’s method we can write

(5.4)
where we have used both (2.4) and (2.5). Defining =: qi + fi , we have

Hence

Since fn = 0, we have |fi | < uπi , where

We can slightly reduce the cost of evaluating the majorizing sequence πi by

working with which satisfies the recurrence

We can now furnish Horner’s method with a running error bound.

106 POLYNOMIALS

Algorithm 5.1. This algorithm evaluates y = fl(p(x)) by Horner’s method,

where p(x) = It also evaluates a quantity µ such that |y - p(x)| < µ.

y = an
µ = |y|/2
for i = n- 1 : - 1 : 0
y = xy + ai
µ = |x|µ + |y |
end
µ = µ( 2µ - |y|)
Cost: 4n flops.
It is worth commenting on the case where one or more of the ai and x
is complex. The analysis leading to Algorithm 5.1 is still valid for complex
data, but we need to remember that the error bounds for fl ( x op y) are not
the same as for real arithmetic. In view of Lemma 3.5. it suffices to replace
the last line of the algorithm by µ = An increase in speed of
the algorithm, with only a slight worsening of the bound, can be obtained by
replacing |y| = (( Rey )2 + (Imy)2)½ by |Rey| + |Imy| (and, of course, |x|
should be evaluated once and for all before entering the loop).
One use of Algorithm 5.1 is to provide a stopping criterion for a polynomial
zero-finder: if |fl(p ( x))| is of the same order as the error bound µ, then further
iteration serves no purpose, for as far as we can tell, x could be an exact zero.
As a numerical example, for the expanded form of p (x) = (x + 1)32 we
found in MATLAB that

and for p(x) the Chebyshev polynomial of degree 32,

In these two cases. the running error bound is, respectively. 62 and 31 times
smaller than the a priori one.
In another experiment we evaluated the expanded form of p(x) = (x - 2)3
is simulated single precision in MATLAB (u 6 × 10-8) for 200 equally spaced
points near x = 2. The polynomial values, the error, and the a priori and
running error bounds are all plotted in Figure 5.1. The running error bound
is about seven times smaller than the a priori one.

5.2. Evaluating Derivatives

Suppose now that we wish to evaluate derivatives of p. We could simply
differentiate (5.1) as many times as necessary and apply Horner’s method to
5.2 E VALUATING D ERIVATIVES 107

Figure 5.1. Computed polynomial values (top) and running and a priori bounds
(bottom) for Horner’s method.

each expression, but there is a more efficient way. Observe that if we define

q (x) = q 1 + q 2 x + . . . + qnxn -1 r = q0,

where the qi = qi (a) are generated by Horner’s method for p(a), then

p(x) = (x - a)q(x) + r.

In other words, Horner’s method carries out the process of synthetic division.
Clearly, p'(a) = q(a). If we repeat synthetic division recursively on q(x), we
will be evaluating the coefficients in the Taylor expansion

and after a final scaling by factorials, we will obtain the derivatives of p at a .

The resulting algorithm is quite short.

Algorithm 5.2. This algorithm evaluates the polynomial

and its first k derivatives at a, returning yi = p( i)(a), i = 0:k.

y0 = an
y(1:k) = 0
108 P OLYNOMIALS

for j = n- 1 : - 1 : 0
for i = min(k, n - j): -1:1
yi = ayi + yi- 1
end
y 0 = ay0 + aj
end
m = l
for j = 2:k
m = m*j
yj = m*yj
end

cost: nk + 2(k + n) - k2/2 flops.

How is the error bounded for the derivatives in Algorithm 5.2? To answer
this quest ion with the minimum of algebra, we express the algorithm in matrix
notation. Horner’s method for evaluating p(a) is equivalent to solution of the
bidiagonal system

By considering (5.4). we see that

Hence
(5.5)
The recurrence for r0 = p'(a) can be expressed as Unr = q(1:n), where
r = r(0:n - 1), so

Hence

This gives, using (5.5),

(5.6)
5.3 T HE N EWTON FORM AND P OLYNOMIAL I NTERPOLATION 109

Now

By looking at the form of r and q, we find from (5.6) that

(5.7)
This is essentially the same form of bound as for p(a) in (5.3). Analogous
bounds hold for all derivatives.

5.3. The Newton Form and Polynomial Interpolation

An alternative to the monomial representation of a polynomial is the Newton
form

(5.8)

which is commonly used for polynomial interpolation. The interpolation prob-

lem is to choose p so that p(a i ) = fi, i = 0:n, and the numbers ci are known
110 P OLYNOMIALS

as divided differences. Assuming that the points a j are distinct, the divided
differences may be computed from a standard recurrence:
c(0)(0: n) = f( 0 :n )
for k = 0:n - 1
f o r j = n:-1:k +1

end
end
c = c( n )
Cost: 3n 2/2 flops.
Two questions are of interest: how accurate are the computed and
what is the effect of rounding errors on the polynomial values obtained by
evaluating the Newton form? To answer the first question we express the
recurrence in matrix-vector form:
c(0)= f , c( k + l ) =L k c( k ), k=0:n-1,
where Lk = is lower bidiagonal, with
Dk = diag(ones(1:k + 1), ak+1 - a 0 ,a k+2 - a1, . . . ,an - an -k -1),

The analysis that follows is based on the model (2.4), and so is valid only for
machines with a guard digit. With the no-guard-digit model (2.6) the bounds
become weaker and more complicated, because of the importance of terms
f l(aj - aj-k-1) in the analysis.
It is straightforward to show that
(5.9)
where Gk = diag(ones(1:k + 1), η k , k+2,. . . ,η k , n+1), where each η ij is of the
form η ij = (1 + δ1)(1 + δ2)(1 + δ3), |δi | < u. Hence
(5.10)
From Lemma 3.7.

(5.11)
5.3 T HE N EWTON F ORM AND P OLYNOMIAL I NTERPOLATION 111

To interpret the bound, note first that merely rounding the data (f i
f i (1 + δi ), |δ i | < u) can cause an error ∆c as large as eround = u|L||f|,
where L = Ln-1. . . L0, so errors of at least this size are inevitable. Since
|L n-1 | . . . |L0 | > |Ln- 1 . . .L 0 | = |L|, the error in the computed divided differ-
ences can be larger than eround only if there is much subtractive cancellation
in the product L = Ln-1 . . . L0. If a 0 < a 1 < . . . < an , then each Li is
positive on the diagonal and nonpositive on the first subdiagonal; therefore
|L n-1|. . . |L0| = |Ln-1. . . L0| = |L|, and we have the very satisfactory bound
< ( ( 1-3u ) -n - 1)|L||f|. This same bound holds if the ai are arranged
in decreasing order.
To examine how well the computed Newton form reproduces the f i we
“unwind” the analysis above. From (5.9) we have

By invoking Lemma 3.7 again, we obtain

(5.12)

If a 0 < a1 < ... < an then > 0 for all i, and we obtain the very
satisfactory bound Again, the same
bound holds for points arranged in decreasing order.
In practice it is found that even when the computed divided differences are
very inaccurate, the computed interpolating polynomial may still reproduce
the original data well. The bounds (5.11) and (5.12) provide insight into this
observed behaviour by showing that and can be large only when
there is much cancellation in the products Ln-l . . . L0 f and
respectively.
The analysis has shown that the ordering a 0 < a 1 < . . . < an yields
“optimal” error bounds for the divided differences and the residual, and so
may be a good choice of ordering of interpolation points. However, if the
aim is to minimize |p(x) - fl(p(x))| for a given x aj, then other orderings
need to be considered. An ordering with some theoretical support is the Leja
ordering, which is defined by the equations [863, 1990]

(5.13a)

(5.13b)

For a given set of n + 1 points ai , the Leja ordering can be computed in n 2

flops (see Problem 5.4).
We give a numerical example to illustrate the analysis. Let n = 16 and
let a 0 < . . . < an be equally spaced points on [-1,1] . Working in simulated
112 POLYNOMIALS

single precision with u = 2-24 6 × 10 -8 , we computed divided differences

for two different vectors f. Error statistics were computed by regarding the
solutions computed in double precision as exact. We define the ratios

(1) For f i from the normal N(0,1) distribution the divided differences
range in magnitude from 1 to 105. and their relative errors range from 0 (the
first divided difference, f 0 , is always exact) to 3 × 10-7. The ratio p1 = 16.3,
so (5.11) provides a reasonably sharp bound for the error in . The relative
errors when f is reconstructed from the computed divided differences range
between 0 and 3 × 10-1 (it makes little difference whet her the reconstruction
is done in single or double precision). Again, this is predicted by the analysis,
in this case by (5.12), because p 2 = 2 × 107. For the Leja ordering, the divided
differences are computed with about the same accuracy, but f is reconstructed
much more accurately, with maximum relative error 7 × 10-6 ( p 1 = 1 × 103,
p2 = 8 × 104).
(2) For f i = exp(a i), the situation is reversed: we obtain inaccurate di-
vided differences but an accurate reconstruction of f. The divided differences
range in magnitude from 10-4 to 10-1, and their relative errors are as large as
1, but the relative errors in the reconstructed f are all less than 10-7. Again,
the error bounds predict this behaviour: p 1 = 6 × 108, p2 = 1.02. The Leja
ordering performs similarly.
The natural way to evaluate the polynomial (5.8) for a given x is by a
generalization of Horner’s met hod:

q n(x ) = cn
for i = n - 1:-1:0
q i ( x) = ( x- a i )q i + 1 ( x) + ci
end
p (x) = q 0 (x)
A straightforward analysis shows that (cf. (5.2))

Hence the computed is the exact value corresponding to a polynomial with

slightly perturbed divided differences. The corresponding forward error bound
is
5.4 N OTES AND R EFERENCES 113

5.4. Notes and References

Backward and forward error analysis for Horner’s rule was given by Wilkin-
son [1088, 1963, pp. 36-37, 49-50]; our results are simply Wilkinson’s pre-
sented in a different notation. The analysis has been redone by many other au-
thors, sometimes without reference to Wilkinson’s results. Another early ref-
erence, which gives a forward error bound only, is McCracken and Dorn [743,
1964, §3.5],
For more on running error analysis see §3.3.
Müller [782, 1983] gives a first-order error analysis for the evaluation of the
divided difference form of a polynomial. Olver [809, 1986] derives a posteriori
error bounds for the Horner scheme with derivatives (Algorithm 5.2), phrasing
them in terms of his relative precision notation. Stewart [939, 1971] analyses
synthetic division, using a matrix-oriented approach similar to that in §5.2.
The relative merits of the monomial and Chebyshev representations of
a polynomial are investigated, with respect to accuracy of evaluation, by
Newbery [793, 1974] and Schonfelder and Razaz [901, 1980]. Clenshaw [212,
1955] showed how Horner’s method could be extended to evaluate a polyno-
mial expressed in the Chebyshev form p(x) = where Ti is the
Chebyshev polynomial of degree i. Error analysis of Clenshaw’s method, and
variations of it, are given by Gentleman [433, 1969], Newbery [792, 1973], and
Oliver [805, 1977], [806, 1979]. Clenshaw’s scheme can be generalized to ex-
pansions in terms of arbitrary orthogonal polynomials; see Smith [927, 1965]
and Algorithm 21.8.
Running error bounds for Horner’s method were included in algorithms of
Kahan and Farkas [637, 196 3], [638, 196 3] without explanation. Adams [5,
1967] derives the bounds and extends them to evaluation of a real polynomial
at a complex argument. Algorithm 5.1 is given in [5, 1967], and also in the clas-
sic paper by Peters and Wilkinson [827, 1971], which describes many aspects
of the solution of polynomial equations. Wilkinson’s paper “The Perfidious
Polynomial” [1103, 1984] (for which he was awarded the Chauvenet Prize) is
highly recommended as a beautifully written introduction to backward error
analysis in general and error analysis for polynomials in particular.
There seems to be little work on choosing the ordering of interpolation
points to minimize the effect of rounding errors on the construction or eval-
uation of the interpolating polynomial. Werner [1075, 1984] examines ex-
perimentally the effect of different orderings on the computed value of an
interpolating polynomial at a single point, for several forms of interpolating
polynomial.
The Leja ordering, which was proposed by Leja in a 1957 paper, is analysed
in detail by Reichel [863, 1990 ]. He shows that an appropriately defined
condition number for the Newton form of interpolating polynomial grows at a
slower than exponential rate in the degree n for Leja points, which are points
114 POLYNOMIALS

taken from a given compact set that satisfy the condition (5.13). For more
on the numerical benefits of the Leja ordering see §21.3.3.
If a polynomial is to be evaluated many times at different, arguments it
may be worthwhile to expend some effort transforming it to a form that can
be evaluated more cheaply than by a straightforward application of Horner’s
rule. For example, the quartic

p(x) = a 4 x4 + a 3 x3 + a 2 x2 + a 1 x + a 0 , a4 0,

can be rewritten as [668, 1981, p. 471]

p(x) = (( y + x + a2 ) y + a 3 )a4 , y = (x + a 0 )x + a 1 ,

where the coefficients ai are given by.

Once the ai have been computed, p(x ) can he evaluated in three multiplica-
tions and five additions, as compared with the four multiplications and four
additions required by Horner’s rule. If a multiplication takes longer than an
addition, the transformed polynomial should be cheaper to evaluate. For poly-
nomials of degree n > 4 there exist evaluation schemes that require strictly
less than the 2n total additions and multiplications required by Horner’s rule;
see Knuth [665, 196 2], [668, 198 1, pp. 471-475] and Fike [373, 196 7] One
application in which such schemes have been used is in evaluating polynomial
approximations in an elementary function library [412, 1991]. Little seems to
be known about the numerical stability of fast polynomial evaluation schemes;
see Problem 5.6.

Problems
5.1. Give an alternative derivation of Algorithm 5.2 by differentiating the
Horner recurrence and rescaling the iterates.
5.2. Give an error analysis for the following “beginner’s” algorithm for eval-
uating p(x) = a 0 + a1 x + . . . + anxn :

q (x) = a 0; y = 1
for i = 1:n
y = xy
q(x) = q(x) + ai y
end
p(x) = q(x )
PROBLEMS 115

5.3. Let p(x) = a 0 + a 1 x + . . . + anxn and n = 2 m. Then

p(x) = (a 0 + a 2 x2 + . . . + a 2 m x2 m ) + (a 1 x + a 3 x3 + . . . + a 2 m - 1 x2 m - 1 )
= a0 + a 2 y + . . . + a2 mym + x(a 1 + a3 y + . . . + a 2 m - 1 y m- 1 ),

where y = x2. Obtain an error bound for fl(p(x)) when p is evaluated using
this splitting (using Horner’s rule on each part).
5.4. Write down an algorithm for computing the Leja ordering (5.13) in n 2
flops.
5.5. If the polynomial p(x) = has roots x1,. . . , xn , it can be eval-
uated from the root product form p(x) = Give an error
analysis for this evaluation.
5.6. (RESEARCH PROBLEM) Investigate the numerical stability of fast poly-
nomial evaluation schemes (see the Notes and References) by both rounding
error analysis and numerical experiments. For a brief empirical study see
Miller [757, 1975, §10].
Previous Home Next

Chapter 6
Norms

While it is true that all norms are equivalent theoretically,

on/y a homely one like the -norm is truly useful numerically.
-J. H. WILKINSON 12, Lecture at Stanford University (1984)

Matrix norms are defined in many different ways in the older literature,
but the favorite was the Euclidean norm of the matrix
considered as a vector in n 2-space.
Wedderburn (1934) calls this the absolute value of the matrix
and traces the idea back to Peano in 1887.
-ALSTON S. HOUSEHOLDER,
The Theory of Matrices in Numerical Analysis (1964)

12
Quoted in Fox [403, 19 8 7 ].

117
118 N ORMS

Norms are an indispensable tool in numerical linear algebra. Their ability to

compress the mn numbers in an m × n matrix into a single scalar measure of
size enables perturbation results and rounding error analyses to be expressed
in a concise and easily interpreted form. In problems that are badly scaled,
or contain a structure such as sparsity, it is often better to measure matrices
and vectors componentwise. But norms remain a valuable instrument for the
error analyst, and in this chapter we describe some of their most useful and
interesting properties.

6.1. Vector Norms

A vector norm is a function ||·|| : IR satisfying the following conditions:

1. ||x|| > 0 with equality iff x = 0.

2. ||ax|| = |a| ||x|| for all

3. ||x+y|| < ||x|| + ||y|| for all x, y (the triangle inequality).

The three most useful norms in error analysis and in numerical computa-
tion are

“Manhattan“ or “taxi cab” norm,

(x*x)½. Euclidean length,

These are all special cases of the Holder y-norm:

The 2-norm has two properties that make it particularly useful for the-
oretical purposes. First, it is invariant under unitary transformations, for if
Q*Q = I, then = x*Q*Qx = x*x = Second, the 2-norm is
differentiable for all x, with gradient vector ||x||2 = x/||x||2.
A fundamental inequality for vectors is the Hölder inequality (see, for
example, [502, 1967, App. 1])

|x*y| < ||x||p ||y||q , (6.1)

6.1 V ECTOR N ORMS 119

This is an equality when p, q > 1 if the vectors (|x i |p ) and (|y i |q ) are linearly
dependent and xi yi lies on the same ray in the complex plane for all i; equality
is also possible when p = 1 and p = , as is easily verified. The special case
with p = q = 2 is called the Cauchy-Schwarz inequality:
|x*y| < ||x||2 ||y||2.

For an arbitrary vector norm ||·|| the dual norm is defined by

(6.2)

It follows from the Hölder inequality that the dual of the p -norm is the q-norm,
where p- 1 +q -1 = 1. The definition of dual norm yields, trivially, the general
Hölder inequality |x*y| < ||x|| ||y||D . For a proof of the reassuring result that
the dual of the dual norm is the original norm (the “duality theorem”) see
Horn and Johnson [580, 1985, Thm. 5.5.14].
In some analyses we need the vector z dual to y, which is defined by the
property
z*y = ||z||D ||y || = 1. (6.3)
That such a vector z exists is a consequence of the duality theorem (see [580,
1985, Cor. 5.5.15]).
How much two p-norms of a vector can differ is shown by the attainable
inequalities [422, 1983, pp. 27-28], [459, 1983, Lem. 1.1]

(6.4)

The p-norms have the properties that ||x|| depends only on the absolute
value of x, and the norm is an increasing function of the absolute values of the
entries of x. These properties are important enough to warrant a definition.

Definition 6.1. A norm on is

1. monotone if |x| < |y| ||x|| < ||y|| for all x, y and
2. absolute if || |x| || = ||x|| for all x
The following nonobvious theorem shows that these two properties are
equivalent.

Theorem 6.2 (Bauer, Stoer, and Witzgall). A norm on is monotone if

and only if it is absolute.

Proof. See Horn and Johnson [580, 19 8 5 , Thm. 5.5.10], or Stewart and
Sun [954, 1990, Thm. 2.1.3].
120 NORMS

6.2. Matrix Norms

A matrix norm is a function ||·|| : IR satisfying obvious analogues
of the three vector norm properties. The simplest example is the Frobenius
norm.

(which is sometimes called the Euclidean norm and denoted ||·||E ).

A very important class of matrix norms are those subordinate to vector
norms. Given a vector norm on the corresponding subordinate matrix
norm on is defined by

(6.5)

or, equivalently,

(Strictly speaking, this definition uses two different norms: one on in the
numerator of (6.5) and one on in the denominator. Thus the norm used
in the definition is assumed to form a family defined on for any s.)
For the l-, 2-, and -vector norms it can be shown that

where the spectral radius

and where σm a x (A) denotes the largest singular value of A. To remember

the formulae for the 1- and -norms, note that 1 is a vertical symbol (for
columns) and is a horizontal symbol (for rows).
A norm is consistent if it satisfies ||AB|| < ||A|| ||B|| whenever the prod-
uct AB is defined. The Frobenius norm and all subordinate norms are con-
sistent. An example of a norm that is not consistent is the “max norm”
||A|| = maxi , j |a ij |. The best bound that holds for all A and
B is ||AB|| < n||A|| ||B||, with equality when aij 1 and bij 1.
6.2 MATRIX NORMS 121

Table 6.1. Constants apq such that ||x||p < apq||x||q, x

A norm for which ||UAV|| = ||A || for all unitary U and V is called a
unitarily invariant norm. These norms have an interesting theory, which we
will not explore here (see [581, 1991, §3.5] or [954, 1990, §2.3]). Only two
unitarily invariant norms will be needed for our analysis: the 2-norm and
the Frobenius norm. That these two norms are unitarily invariant follows
easily from the formulae above. For any unitarily invariant norm, the useful
property holds that ||A*|| = ||A||. The 2-norm satisfies the additional relation
||A*A||2 = ||A|| .
The unitary invariance of the 2- and Frobenius norms has implications for
error analysis, for it means that multiplication by unitary matrices does not
magnify errors. For example, if A is contaminated by errors E and
Q is unitary, then
Q(A+E)Q* = QAQ*+F,

and ||F||2 = ||QEQ*||2 = ||E||2. In contrast, if we do a general, nonsingular

similarity transformation

X(A+E)X -1 = XAX- 1 +G,

then ||G||2 = ||XEX-1||2 < κ 2 (X)||E||2, where κ(X) = ||X|| ||X -1|| is the
condition number of X. The condition number satisfies κ(X) > 1 (κF (X) >
and can be arbitrarily large.
In perturbation theory and error analysis it is often necessary to switch
between norms. Therefore inequalities that bound one norm in terms of an-
other are required. It is well known that on a finite-dimensional space any
two norms differ by at most a constant that depends only on the dimension
(so-called norm equivalence). Tables 6.1 and 6.2 give attainable inequalities
for the vector and matrix norms of most interest.
The definition of subordinate matrix norm can be generalized by permit-
ting different norms on the input and output space:

(6.6)
122 NORMS

Table 6.2. Constants apq such that ||A||p < apg||A||q, A Here, ||A||M :=
max i,j |aij| and ||A||s :=

Note that, in general, the submultiplicative property ||AB||a, β < ||A||a,β||B||a, β

does not hold. but we do have
(6.7)

for any third vector norm ||·||γ. The choice a = 1 and β = produces the
max norm, mentioned above, ||A||1, = maxi , j |a ij |.
At least two important results about matrix norms hold for this mixed
subordinate norm. The first is a simple formula for the matrix condition
number of a nonsingular A defined by

Note that this definition uses the ||·||a,β norm on the data space and the
||·||β,a norm on the solution space, as is natural.
We need the following lemma.

Lemma 6.3. Given vector norms ||·||a and ||·||β and vectors x, y such
that ||x||a = ||y||β = 1, there exists a matrix B with ||B||a,β = 1 such that
Bx = y.
Proof. Recall that the dual of the a-norm is defined by
max||w ||a= 1 |z*w|. Let z be a vector dual to x, so that z*x =
and hence Let B = yz*. Then Bx = y and

as required.
6.2 M A T R I X N O R M S 123

Theorem 6.4. For nonsingular A the matrix condition number

κa , β (A) satisfies
κa ,β (A) = ||A||a , β ||A -1 || β , a . (6.8)
Proof. In view of the expansion

(A+ ∆A) - 1 - A- 1 = - A- 1 ∆AA - 1 + O(||∆A|| 2 ),

the result is proved if we can show that

(6.9)

That (6.9) holds with the equality replaced by “<” follows from two applica-
tions of (6.7). To show the opposite inequality, we have

(6.10)

where, for the lower bound, we have chosen y so that ||A - 1 y||a = ||A- 1 ||β , a ,
and where A- 1 y=||A - 1 ||β ,ax with ||x||a = 1. Now, from Lemma 6.3, there
exists a matrix AA with ||∆ A|| a ,β = 1 such that ∆Ax = y. In (6.10) this
gives as required.
The next result concerns the relative distance to singularity for a matrix

It states that the relative distance to singularity is the reciprocal of the con-
dition number.

Theorem 6.5 (Gastinel, Kahan). For nonsingular we have

dista ,β (A) = (||A||a ,β ||A -1 || β , a )-1 = κa ,β (A) - 1 .

Proof. If A + ∆A is singular, then there exists x 0 such that ( A +

∆A)x = 0. Hence

giving
124 NORMS

To show that a suitable perturbation achieves this lower bound, let y be

such that ||y||β = 1 and ||A - 1 y||a = ||A -1 ||β , a , and write x = A- 1 y. By
Lemma 6.3 there exists B with ||B ||a ,β = 1 such that Bx/||x||a = -y. Letting
∆A = B/||x||a we have ||∆A||a ,β /||A||a ,β = κa ,β (A)-1. and A+∆A is singular
because (A + ∆A)A - 1 y = 0.

6 . 3 . T h e M a t r i x p-Norm
The matrix p-norm is the norm subordinate to the Hölder p-norm:

(6.11)

Formulae for ||A||p are known only for p = 1, 2, . For other values of p, how
to estimate or compute ||A||p is an interesting problem, the study of which,
as well as being interesting in its own right, yields insight into the properties
of the 1, 2, and norms.
By taking x = ej in (6.11). using (6.4). and using (6.21) below, we can
derive the bounds, for

(6.12)

(6.13)

Matrix norms can be compared using the following elegant result of Schnei-
der and Strang [900, 1962] (see also [580, 1985, Thm. 5.6.18]): if ||·|| a and ||·||β
denote two vector norms and the corresponding subordinate matrix norms,
then for

(6.14)

From (6.4) and (6.14), we have, when m = n ,

(6.15)

Note that, unlike for vectors, p 1 < p 2 does not imply ||A||p 1 > ||A||p 2. The
result (6.15) implies, for example, that for all p > 1

(6.16)

(6.17)
6.3 THE MATRIX p- N O R M 125

Figure 6.1. Plots of p versus ||A||p, for 1 < p < 15. Fourth plot shows 1/ p versus
log ||A||p for the matrices in the first three plots.

Upper bounds for ||A||p that do not involve m or n can be obtained from
the interesting property that log ||A||p is a convex function of 1/p for p > 1
(see Figure 6.1), which is a consequence of the Riesz-Thorin theorem [503,
1952, pp. 214, 219], [450, 1991]. The convexity implies that if f(a) = ||A||1 / a ,
then for 0 < a,β < 1,

logf(θ a + (1 - θ)β) < θ logf(a) + ( 1 - θ1) logf(β), 0 < θ < l.

Writing p 1 = 1/a and p 2 = 1/β, this inequality can be expressed as

(6.18)

Two interesting special cases are

(6.19)

and
(6.20)

Note that (6.19) includes the well-known inequality

126 NORMS

Two further results that are familiar for p = 1, 2, are

(6.21)

(see, for example, [580, 1985, Thm. 5.6.36]) and

The bounds (6.16) and (6.17) imply that given the ability to compute
||A||1, ||A||2 and ||A|| we can estimate ||A||p correct to within a factor n 1 / 4 .
These a priori estimates are at their best when p is close to 1, 2, or but in
general they will not provide even one correct significant digit. The bound in
(6.18) can be much smaller than the other upper bounds given above, but how
tight it is depends on how nearly log ||A||p is linear in p . Numerical methods
are needed to obtain better estimates: these are developed in chapter 14.

6.4. Notes and References

The matrix condition number appears to have been first introduced explicitly
by Turing [1027, 1948], who defined, for example, the N-condition number
of A IRn×x as n -1 N(A)N(A -1), where N(·) is Turing’s notation for the
Frobenius norm. Todd [1003, 1968] gives a short survey of the matrix condi-
tion number with many references.
Theorem 6.2 was originally proved by Bauer, Stoer, and Witzgall, in a
paper that contains many interesting results on monotonic norms [84, 1961].
Tables of constants in inequalities between different norms have been given
by various authors: see, for example, Stone [957, 1962] and Zielke [1129, 1988].
Our development of the mixed subordinate norm ||·|| a ,β is based on that
of D. J. Higham [526, 1995].
Theorem 6.5 is proved by Kahan [626, 1966, pp. 775 776], who attributes
it to Gastinel but gives no reference. For the 2-norm, this result goes back
to a paper by Eckart and Young [334, 1936]. Theorem 6.5 is an instance
of a relationship that holds for many problems: the condition number is the
reciprocal of the distance to the nearest singular problem (one with an infinite
condition number). This relationship applies to matrix inversion, eigenvalue
and eigenvector computation, polynomial zero-finding, and pole assignment
in linear control systems. For an in-depth study see Demmel [281, 1987].
Direct proofs of inequality (6.19) can be found in Kato [646, 1976, p. 29]
and Todd [1006, 1977, pp. 25-26]. The inequality does not seem to be well
known.
For historical comments on the development of norms in numerical anal-
ysis, see Householder [587, 1964, Chap. 2] and Stewart and Sun [954, 1990,
Chap. 2].
PROBLEMS 127

Problems
Problems worthy
of attack
prove their worth
by hitting back.
-PIET HEIN, Grooks (1966)

6.1. Prove the inequalities given in Tables 6.1 and 6.2. Show that each
inequality in Table 6.2 (except the one for aS,2) is attainable for a matrix of the
form A = xyT, where x, y {e, ej}, where e = [1, 1,. . . , 1]T. Show that equal-
ity in ||A||s < as,2 ||A||2 is attained for square real matrices A iff A is a scalar
multiple of a Hadamard matrix (see §9.3 for the definition of a Hadamard
matrix), and for square complex matrices if a rs = exp(2 πi(r - 1)(s - 1)/n)
(this is a Vandermonde matrix based on the roots of unity).

6.2. Let x, y Show that, for any subordinate matrix norm, ||xy*|| =
||x|| ||y||D .

6.3. Show that a subordinate matrix norm ||·|| on satisfies

From ancient times until now the

study of magic squares has flourished as a kind of cult,
often with occult trappings, whose initiates range from
such eminent mathematicians as Arthur Cayley and Oswald Veblen
to laymen such as Benjamin Franklin.
-MARTIN GARDNER, More Mathematical Puzzles and Diversions (1961)

6.4. Let Mn denote a magic square matrix, that is, an n × n matrix

containing the integers from 1 to n 2 arranged in such a way that the row and
column sums are all the same. Let µ n , denote the magic sum of Mn (thus,
µn = n(n2 + 1)/2). Show that ||Mn||p = µ n for all 1 < p < (This result
is a special case of an apparently little-known result of Stoer and Witzgall,
which states that the norm of a doubly stochastic matrix is 1 for any norm
subordinate to a permutation-invariant absolute vector norm [956, 1962].)

6.5. Show that ||ABC||F < ||A||2 ||B||F||C|| 2 for any A, B, and C such that
the product is defined. (This result remains true when the Frobenius norm is
replaced by any unitarily invariant norm [581, 1991, p. 211].)
128 NORMS

6.6. Show that for any nonsingular

6.7. Show that for any and any consistent matrix norm, p(A) <
||A||, where p is the spectral radius.
6.8. Show that for any and δ > 0 there is a consistent norm ||·||
(which depends on A and δ) such that ||A|| < p(A) + δ, where p is the spectral
radius. Hence show that if p (A) < 1 then there is a consistent norm ||·|| such
that ||A|| < 1.
6.9. Let Use the SVD to find expressions for ||A||2 and ||A||F
in terms of the singular values of A. Hence obtain a bound of the form
c1 ||A|| 2 < ||A|| F < c2 ||A|| 2, where c 1 and c2 are constants that depend on n.
When is there equality in the upper bound? When is there equality in the
lower bound?
6.10. Show that

Deduce that when ||F||2 = 1, the norm is the golden ratio.

6.11. Let Prove that (a) ||A||1 , β = maxj ||A (:,j)||β , and (b)
||A||a, = maxi ||A (i,:)|| What is ||A||1 , ?
6.12. (Tao [994, 1984]) Show that if A is Hermitian positive definite then

(Rohn [879, 1995] shows that the problem of computing ||A|| ,1 is NP-hard.)
6.13. Prove that if H IRn×n is a Hadamard matrix then

||H|| p = max{n1 /p ,n1 - 1 /p }.

(See §9.3 for the definition of a Hadamard matrix.)

6.14. Show that if A IRm×n has at most µ nonzeros per row then

(6.22)
PROBLEMS 129

while if A has at most µ nonzeros per column then

(6.23)

where p -1 + q-1 = 1. (These inequalities generalize (6.12) and (6.13).)

6.15. Show that if A then for any p-norm (1 < p < ),

6.16. Define the function v : IR by

Is v a vector norm on Derive an explicit expression for

Previous Home Next

Chapter 7
Perturbation Theory for Linear
Systems

Our hero is the intrepid, yet sensitive matrix A.

Our villain is E, who keeps perturbing A.
When A is perturbed he puts on a crumpled hat: Ã = A + E.
-G. W. STEWART and JI-GUANG SUN, Matrix Perturbation Theory (1990)

The expression ‘ill-conditioned’ is sometimes used merely as a

term of abuse applicable to matrices or equations . . .
It is characteristic of ill-conditioned sets of equations that
small percentage errors in the coefficients given may lead to
large percentage errors in the solution.
-A. M. TURING, Rounding-Off Errors in Matrix Processes (1948)

131
132 P ERTURBATION T HEORY FOR LINEAR S YSTEMS

In this chapter we are concerned with a linear system Ax = b , where A

IRn × n . In the context of uncertain data or inexact arithmetic there are three
import ant quest ions:
(1) How much does x change if we perturb A and b ; that is, how sensitive
is the solution to perturbations in the data?
(2) How much do we have to perturb the data A and b for an approximate
solution y to be the exact solution of the perturbed system-in other words,
what is the backward error of y?
(3) What bound should we compute in practice for the forward error of a
given approximate solution?
To answer these questions we need both normwise and componentwise
perturbation theory.

7.1. Normwise Analysis

First, we present some classical normwise perturbation results. We denote by
||·|| any vector norm and the corresponding subordinate matrix norm. As
usual, κ(A) = ||A|| ||A- 1 || is the matrix condition number. Throughout this
chapter the matrix E and the vector f are arbitrary and represent tolerances
against which the perturbations are measured (their role becomes clear when
we consider componentwise results).
Our first result makes precise the intuitive feeling that if the residual is
small then then have a “good” approximate solution.

Theorem 7.1 (Rigal and Gaches). The normwise backward error

η E , f(y) := min{ (A + ∆A)y = b + ∆b, ||∆ A|| < ||E||, ||∆b|| < ||f|| }
(7-1)
is given by
(7.2)

where r = b - Ay.
Proof. It is straight forward to show that the right-hand side of (7.2) is a
lower bound for ηE , f (y ). This lower bound is attained for the perturbations

(7.3)

where z is a vector dual to y (see §6.1).

For the particular choice E = A and f = b, η E , f (y) is called the normwise
relative backward error.
The next result measures the sensitivity of the system.
7.1 N ORMWISE ANALYSIS 133

Theorem 7.2. Let Ax = b and (A + ∆A)y = b + ∆b, where ||∆ A|| < ||E||
and ||∆ b|| < ||f|| , and assume that ||A- 1 ||||E|| < 1. Then

(7.4)

and this bound is attainable to first order in

Proof. The bound (7.4) follows easily from the equation A( y - x) =

∆ b -∆A x+∆ A(x - y). It is attained to first order in for ∆ A = ||E||||x||wv T
and ∆b = ||f||w, where ||w|| = 1, ||A w|| = ||A || and υ is a vector dual
- 1 - 1

to x.
Associated with the way of measuring perturbations used in these two
theorems is the normwise condition number

Because the bound of Theorem 7.2 is sharp, it follows that

For the choice E = A and f = b we have κ (A) < κE,f (A, x) < 2κ(A), and the
bound (7.4) can be weakened slightly to yield the familiar form

A numerical example illustrates the above results. Let A be the 8 × 8

Vandermonde matrix with (i,j) element j2 (i -1), and let b = e1 be the first unit
vector, so that x is the first column of A-1. We take y to be the approximate
solution to Ax = b computed by Gaussian elimination with partial pivoting.
Computations are performed in MATLAB (u 1.1 × 10 -16 ). We find that
η A , b (y) = 3.05 × 10 -21
for the -norm, and = 1.68 × 10 13 . This
is an admirably small backward error, but it may be uninformative for two
reasons. First, the elements of A vary over 12 orders of magnitude, so while
our backward error perturbations are small compared with the large elements
of A, we may be making large perturbations in the small elements (indeed we
are in this particular example). Second, we are perturbing the zero elements
of b (as can be seen from (7.3) together with the fact that for this example
the residual r has no zero entries); this is unsatisfactory if we wish to regard
y as the first column of the inverse of a perturbed matrix.
133 P ERTURBATION T HEORY FOR LINEAR S YSTEMS

Next. let b = Ae. where e = [1, 1,. . . . 1]T, and let z be the solution to the
perturbed q-stem (A + ∆A)z = b + ∆b, where ∆A = tol|A| and ∆b = tol|b|,
with to1 = 8u. We find that

(7.5)

while the corresponding bound from (7.4) with = tol, E = A, and f = b is

3.03 × 10-2. Thus the normwise forward error bound is extremely pessimistic
for this special choice of perturbation.
To obtain a more satisfactory backward error measure and a sharper per-
turbation bound in this example, we need componentwise analysis.

7.2. Componentwise Analysis

The componentwise backward error is defined as
(7.6)
where E and f are now assumed to have nonnegative entries. Inequalities
between matrices or vectors are understood to hold componentwise. In this
definition each element of a perturbation is measured relative to its individual
tolerance, so, unlike in the normwise definition, we are making full use of the
n 2 + n parameters in E and f.
How should E and f be chosen? The most common choice of tolerances is
E = |A| and f = |b|, which yields the componentwise relative backward error.
For this choice

in (7.6), and so if wE,f (y) is small then y solves a problem that is close to
the original one in the’ sense of componentwise relative perturbations and has
the same sparsity pattern. Another attractive property of the componentwise
relative backward error is that it is insensitive to the scaling of the system: if
Ax = b is scaled to (S 1 AS2 ) = S 1 b, where S 1 and S 2 are diagonal, and
y is scaled to then w remains unchanged.
The choice E = |A|eeT, f = |b| gives a row-wise backward error. The
constraint |∆A| < is now |∆a i j | < where (ai is the 1-norm of the ith
row of A, so perturbations to the ith row of A are being measured relative to
the norm of that row. A columnwise backward error can be formulated in a
similar way, by taking E = eeT|A| and f =
The third natural choice of tolerances is E = ||A||eeT and f = ||b||e, for
which wE,f (y) is the same as the normwise backward error η E , f (y). up to a
constant.
As for the normwise backward error in Theorem 7.1, there is a simple
formula for wE,f (g).
7.2 C OMPONENTWISE A NALYSIS 135

Theorem 7.3 (Oettli and Prager). The componentwise backward error is

given by

(7.7)

where r = b- Ay, and is interpreted as zero if and infinity otherwise.

Proof. It is easy to show that the right-hand side of (7.7) is a lower bound
for w(y), and that this bound is attained for the perturbations

∆A = D1 ED2 , ∆ b = -D 1 f, (7.8)

where D 1 = diag(ri/ ( E|y| + f ) i) and D 2 = diag(sign( y i ) ).

The next result gives a forward error bound corresponding to the compo-
nentwise backward error.

Theorem 7.4. Let Ax = b and (A + ∆A)y = b + ∆b, where |∆A| < and
|∆b| < and assume that where ||·|| is an absolute norm.
Then
(7.9)

and for the -norm this bound is attainable to first order in

Proof. The bound (7.9) follows easily from the equation A(y - x) =
∆b - ∆Ax + ∆A(x - y). For the -norm the bound is attained, to first
order in for ∆A = and ∆b = where D 2 = diag(sign(x i ) )
and D 1 = diag where = sign( A - 1 ) k j and

Theorem 7.4 implies that the condition number

is given by
(7.10)

For the special case E = |A| and f = |b| we have the condition numbers
introduced by Skeel [919, 1979]:
136 PERTURBATION THEORY FOR LI N E A R S Y S T E M S

which differs from cond| A | , | b | (A , x) by at most a factor 2. and

(7.11)

How does cond compare with κ? Since cond(A) is invariant under row
scaling Ax = b (DA)x = Db, where D is diagonal. it can be arbitrarily
smaller than (A). In fact, it is straightforward to show that

(7.12)

where the optimal scaling D R equilibrates the rows of A, that is, DRA has
rows of unit 1-norm (DR|A|e=e)
Chandrasekaran and Ipsen [197, 1995] note the following inequalities. First.
with DR as just defined,

(7.13)

(these inequalities imply (7.12)). Thus cond(A) can be much smaller than
(A) only when the rows of A are badly scaled. Second, if DC, equilibrates
the columns of A (eT|A|DC=e T) then

These inequalities show that cond(A , x) can be much smaller than (A) only
when the columns of either A or A-1 are badly scaled.
Returning to the numerical example of §7.1, we find that wE,f (y) = 1.10 ×
10 -12 for E = |A| and f = |b| or f = 0. This tells us that if we measure
changes to A in a componentwise relative sense, then for y to be the first
column of the inverse of a perturbed matrix we must make relative changes to
A four orders of magnitude larger than the unit roundoff. For the perturbed
system. Theorem 7.4 with = tol, E = |A|, and f = |b| gives the bound

which is eight orders of magnitude smaller than the normwise bound from
Theorem 7.2, and only a factor 170 larger than the actual forward error (7.5).
An example of Kahan [626, 1966] is also instructive Let

(7.14)
7.3 SCALING TO MINIMIZE THE CONDITION NUMBER 137

where so that x = The normwise condition number

so the system is very sensitive to arbitrary perturbations
in A and b. Moreover,

so cond(A) = 3 + which implies that the system is also very sen-

sitive to componentwise perturbations for some right-hand sides. However,
cond(A, x) = 5/2 + so for this particular b the system is very well condi-
tioned under componentwise perturbations.
A word is in order concerning the choice of condition number. Every
condition number for a linear system is defined with respect to a particular
class of perturbations. It is important to use the right condition number for
the occasion. For example, if is a computed solution to Ax = b and we
know its normwise backward error η A , b then it is the condition number
κ (A) that appears in the relevant forward error bound (multiplying
and therefore tells us something about the accuracy of The component-
wise condition number cond(A, x) is relevant only if we are dealing with the
componentwise relative backward error, w|A|,|b| Looked at another way,
each algorithm has an associated error analysis that determines the condition
number relevant to that algorithm.

7.3. Scaling to Minimize the Condition Number

In the last section we noted the invariance of cond(A) under row scaling, which
contrasts with the strong dependence of upon the row scaling. The
opportunity to scale the rows or columns of A arises in various applications,
so we now take a closer look at the effect of scaling on the normwise condition
number.
First, we consider one-sided scaling, by giving a generalization of a well-
known result of van der Sluis [1039, 1969]. It shows that, for one-sided scaling
in a Hölder p-norm, equilibrating the rows or columns is a nearly optimal
strategy. We state the result for rectangular matrices A, for which we define
where A+ is the pseudo-inverse of A (see Problem 19.3).

Theorem 7.5 (van der Sluis). Let have full rank, let
denote the set of nonsingular diagonal matrices, and define
138 PERTURBATION THEORY FOR LI N E A R S Y S T E M S

Then

(7.15)

(7.16)

Proof. For any X we have, from (6.12),

(7.17)

Therefore
(7.18)
Now, for any D

(7.19)

using the first inequality in (7.17). Multiplying (7.18) and (7.19) and min-
imizing over D, we obtain (7.15). Inequality (7.16) follows by noting that
κp (DA) = κq (A T D), where p -1 + q -1 = 1 (see (6.21)).
For p = (7.16) confirms what we already know from (7.12) and (7.13):
that in the -norm, row equilibration is an optimal row scaling strategy.
Similarly, for p = 1, column equilibration is the best column scaling, by
(7.15). Theorem 7.5 is usually stated for the 2-norm, for which it shows that
row and column equilibration produce condition numbers within factors
and respectively, of the minimum 2-norm condition numbers achievable
by row and column scaling.
As a corollary of Theorem 7.5 we have the result that among two-sided
diagonal scalings of a symmetric positive definite matrix. the one that gives
A a unit diagonal is not far from optimal.

Corollary 7.6 (van der Sluis). Let be symmetric positive definite

and let D* = Then

(7.20)

Proof. Let A = RTR be a Cholesky factorization, note that κ2 (DAD) =

κ2 (RD)2, and apply Theorem 7.5 to RD.
Is the scaling D * in Corollary 7.6 ever optimal? Forsythe and Straus [386,
1955] show that it is optimal if A is symmetric positive definite with property
7.3 S CALING TO M INIMIZE THE C ONDITION N UMBER 139

A (that is, there exists a permutation matrix P such that PAP T can be
expressed as a block 2 × 2 matrix whose (1,1) and (2,2) blocks are diagonal).
Thus, for example, any symmetric positive definite tridiagonal matrix with
unit diagonal is optimally scaled.
We note that by using (6.22) in place of (7.17), the inequalities of Theo-
rem 7.5 and Corollary 7.6 can be strengthened by replacing m and n with the
maximum number of nonzeros per column and row, respectively.
Here is an independent result for the Frobenius norm.

Theorem 7.7 (Stewart and Sun). Let A = [a 1, . . . , an] be nonsin-

gular, with B := A-1 = [b 1, . . . , bn ]T, and let DC
Then

Proof. For D = diag(di ) we have, using the Cauchy-Schwarz

inequality,

with equality if d j||aj||2 = for all j, for some a 0. There is

equality for
As we have seen in this and the previous section, the minimum value of
The next result shows that for two-sided scalings
the matrix |A- 1 ||A| again features in the formula for the minimal condition
number. A matrix is irreducible if it cannot be symmetrically permuted to
block triangular form. A Perron vector of B > 0 is a nonnegative eigenvector
corresponding to the eigenvalue p(B), where p denotes the spectral radius.

Theorem 7.8 (Bauer). Let be nonsingular and suppose that |A||A- |1

and |A - 1 ||A| are irreducible. Then
(7.21)

The minimum is attained for D1 = diag(x)-1 and D2 = diag(|A- 1 |x), where

x > 0 is a right Perron vector of |A||A - 1 | (so that |A||A- 1 | = p(|A||A - 1 |)x).
Proof. See Problem 7.9.
For the Kahan example (7.14),

and, in fact, 3 for D = so a symmetric

two-sided scaling is nearly optimal in this case.
140 P ERTURBATION T HEORY FOR LINEAR S YSTEMS

7.4. The Matrix Inverse

We briefly discuss component wise perturbation theory for the matrix inverse.
With X := A-1 and X + ∆X := (A + ∆X)-1, a componentwise condition
number is

In general, the inequality- is strict. but there is equality when |A -1| = D1 A - 1D 2

for D i of the form diag(±1), [407, 1992, Thm. 1.10], [439, 198 2]. Another
componentwise condition number is evaluated in Problem 7.10. We saw in
Theorem 6.5 that the reciprocal of the normwise condition number for matrix
inversion is the normwise relative distance to singularity. Is the same true
for an appropriate componentwise condition number? The componentwise
distance to singularity,

has been characterized by Rohn [877, 1989], [878, 1990] as

where the maximum is taken over all signature mat rices Si = diag(±1) and
where

This formula involves 4n eigenproblems and thus is computationally intractable

(in fact it has been shown to be NP-hard by Poljak and Rohn [836, 1993]).
Demmel [285, 1992] shows by complexity arguments that there can be no
simple relationship between dE (A) and the quantity which is an
upper bound for µ E (A). He also presents evidence for the conjecture that

for a constant γn . The lower bound always holds and Demmel identifies
several classes of matrices for which the upper bound holds. This conjecture
is both plausible and aesthetically pleasing because d|A| (A) is invariant under
two-sided diagonal scalings of A and p(|A- 1 ||A|) is the minimum -norm
condition number achievable by such scalings, as shown by Theorem 7.8.

7.5. Extensions
The componentwise analyses can be extended in three main ways.
7.6 N UMERICAL S TABILITY 141

(1) We can use more general measures of size for the data and the solution.
Higham and Higham [528, 1992] measure ∆A , ∆b, and ∆x by

where and the e ij , fi , and gi are

tolerances. They show that the corresponding backward error is given by the
explicit formula

where r = b - Ay, Dj = diag(e j1, . . . , ejn, fj), and p -1 + q -1 = 1; bounds

for the corresponding condition number are also obtained. Theorem 7.3, and
Theorem 7.4 with the -norm, correspond to p = and gi If
we take p = and g = |x|, we are measuring the change in the solution in
a componentwise relative sense, as and the
condition number is [528, 1992]

This latter case has also been considered by Rohn [876, 1989] and Gohberg
and Koltracht [455, 1993]. It is also possible to obtain individual bounds
for by refraining from taking norms in the analysis: see
Chandrasekaran and Ipsen [197, 1995] and Problem 7.1.
(2) The backward error results and the perturbation theory can be ex-
tended to systems with multiple right-hand sides. For the general vp measure
described in (1), the backward error can be computed by finding the mini-
mum p-norm solutions to n underdetermined linear systems. For details, see
Higham and Higham [528, 1992].
(3) Structure in A and b can be preserved in the analysis. For example, if A
is symmetric or Toeplitz then its perturbation can be forced to be symmetric
or Toeplitz too, while still using componentwise measures. References include
Higham and Higham [527, 1992] and Gohberg and Koltracht [455, 1993] for
linear structure, and Bartels and D. J. Higham [76, 1992] for Vandermonde
structure. A symmetry-preserving normwise backward error is explored by
Bunch, Demmel, and Van Loan [163, 1989], while Smoktunowicz [930, 1995]
considers the componentwise case (see Problem 7.11). Symmetry-preserving
normwise condition numbers are treated by D. J. Higham [526, 1995].

7.6. Numerical Stability

The backward errors examined in this chapter lead to definitions of numerical
stability of algorithms for solving linear systems. Precise and formal defi-
nitions of stability can be given, but there are so many possibilities, across
142 PERTURBATION THEORY FOR LI N E A R S Y S T E M S

different problems, that to define and name each one tends to cloud the issues
of interest. We therefore adopt an informal approach.
A numerical method for solving a square, nonsingular linear system Ax = b
is normwise backward stable if it produces a computed solution such that
is of order the unit roundoff. How large we allow to be.
while still declaring the method backward stable, depends on the context,. It
is usually implicit in this definition that = O(u) for all A and b, and
a method that yields = O(u) or a particular A and b is said to have
performed in a normwise backward stable manner.
The significance of normwise backward stability is that the computed so-
lution solves a slightly perturbed problem, and if the data A and b contain
uncertainties bounded only normwise with
and similarly for b), then may be the exact solution to the problem we
wanted to solve, for all we know.
Componentwise backward stability is defined in a similar way: we now re-
quire the componentwise backward error to be of order u. This is a
more stringent requirement than normwise backward stability. The rounding
errors incurred by a met hod that is componentwise backward stable are in
size and effect equivalent to the errors incurred in simply converting the data
A and b to floating point numbers before the solution process begins.
If a method is normwise backward stable then, by Theorem 7.2, the for-
ward error is bounded by a multiple of κ(A )u. However, a met hod
can produce a solution whose forward error is bounded in this way without the
normwise backward error being of order u. Hence it is useful to define
a method for which as normwise forward stable.
By similar reasoning involving we say a method is componentwise
forward stable if Table 7.1 summarizes the
definitions and the relations between them. There are several examples in
this book of linear-equation-solving algorithms that are forward stable but
not backward stable: Cramer’s rule for n = 2 (§1.10.1). Gauss-Jordan elim-
ination (§13.4), and the seminormal equations method for underdetermined
systems (§20.3).
Other definitions of numerical stability can be useful (for example, row-
wise backward stability means that and they will be
introduced when needed.

7.7. Practical Error Bounds

Suppose we have a computed approximation to the solution of a linear

system Ax = b, where What error bounds should we compute?
7.7 P RACTICAL E RROR B OUNDS 143

Table 7.1. Backward and forward stability.

Componentwise backward stability Componentwise forward stability

Normwise backward stability Normwise forward stability

The backward error can be computed exactly, from the formulae

(7.23)

at the cost of one or two matrix-vector products, for r = b - and

The only question is what to do if the denominator is so small as to cause
overflow or division by zero in the expression for This could hap-
pen, for example, when E = |A| and f = |b| and, for some i , aijxj = 0 for
all j, as is most likely in a sparse problem. LAPACK’s xyyRFS (“refine so-
lution”) routines apply iterative refinement in fixed precision, in an attempt
to satisfy If the ith component of the denominator in (7.23)
is less than safe_min/u, where safe_min is the smallest number such that
1/safe_min does not overflow, then they add (n + 1) safe_min to the ith com-
ponents of the numerator and denominator. A more sophisticated strategy is
advocated for sparse problems by Arioli, Demmel, and Duff [24, 1989]. They
suggest modifying the formula (7.23) by replacing |bi | in the denominator by
when the ith denominator is very small. See [24, 1989] for
details and justification of this strategy.
Turning to the forward error, one approach is to evaluate the forward error
bound from Theorem 7.2 or Theorem 7.4, with equal to the corresponding
backward error. Because x in (7.9) is unknown, we should use the modified
bound
(7.24)

If we have a particular E and f in mind for backward error reasons, then it is

natural to use them in (7.24). However, the size of the forward error bound
varies with E and f, so it is natural to ask which choice minimizes the bound.
144 P ERTURBATION T HEORY FOR LINEAR S YSTEMS

Lemma 7.9. The upper bound in (7.23) is at least as large as the upper bound
in

(7.25)

and is equal to it when is a multiple of |r|.

Proof. First note that r = b - implies which implies

(7.25). Now, for z > 0,

with equality if z is a multiple of r. Taking z = gives

with equality when is a multiple of |r|. The truth of this statement

is preserved when -norms are taken, so the result follows.
Since the bound (7.25) is obtained by taking absolute values in the equa-
tion it is clearly the smallest possible such bound subject
to ignoring signs in A-1 and r. It is reasonable to ask why we do not take
as our error bound. (Theoretically it is an exact bound!)
The reason is that we cannot compute r or exactly. In place of r
we compute and

(7.26)

Therefore a strict bound. and one that should be used in practice in place of
(7.25), is
(7.27)

This forward error bound is estimated and returned by LAPACK‘s xyyRFS

routines. For details on how this is done without computing A -1, see Chap-
ter 14.
The LAPACK linear equation solvers estimate only one condition number:
the standard condition number κ1 (A) (or, rather, its reciprocal, referred to as
rcond), which is returned by the xyyCON routines.

7.8. Perturbation Theory by Calculus

The perturbation results in this book are all derived algebraically, without any
use of derivatives. Calculus can also be used to derive perturbation bounds,
often in a straight forward fashion.
7.9 N OTES AND R EFERENCES 145

As a simple example, consider a linear system A(t)z(t) = b(t), where

and x(t), b(t) are assumed to be continuously differentiable
functions of t. Differentiating gives

or, dropping the t arguments,

Taking norms, we obtain

This bound shows that κ(A) is a key quantity in measuring the sensitivity of
a linear system. A componentwise bound could have been obtained just as
easily.
We normally express perturbations of the data in the form
To use the calculus framework we can take A(0) as the original matrix A and
write but the perturbation bound
then becomes a first-order one.
The calculus technique is a useful addition to the armoury of the error
analyst (it is used by Golub and Van Loan [470, 1989], for example), but the
algebraic approach is preferable for deriving rigorous perturbation bounds of
the standard forms.

7.9. Notes and References

This chapter draws on the survey paper Higham [558, 1994].
Theorem 7.3 is due to Oettli and Prager [802, 196 4], and predates the
normwise backward error result Theorem 7.1 of Rigal and Gaches [873, 1967].
In addition to Theorem 7.1, Rigal and Gaches give a more general result
based on norms of blocks that includes Theorems 7.3 and 7.1 as special cases.
Theorem 7.1 is also obtained by Kovarik [672, 1976].
Theorems 7.1 and 7.3 both remain valid when A is rectangular. Compo-
nentwise backward error for rectangular A was considered by Oettli, Prager,
and Wilkinson [803, 1965], but their results are subsumed by those of Oettli
and Prager [802, 1964] and Rigal and Gaches [873, 1967].
For a linear system Ax = b subject to componentwise perturbations, Oet-
tli [801, 1965] shows how linear programming can be used to obtain bounds
on the components of x when all solutions of the perturbed system lie in the
same orthant. Cope and Rust [244, 1979] extend this approach by showing, in
general, how to bound all the solutions that lie in a given orthant. This type
146 P ERTURBATIO N T HEORY FOR LINEAR S YSTEMS

of analysis can also be found in the book by Kuperman [681, 1971], which
includes an independent derivation of Theorem 7.3. See also Hartfiel [505,
1 9 8 0 ].
Theorem 7.4 is a straightforward generalization of a result of Skeel [919,
1979 , Thms. 2.1 and 2.2]. It is clear from Bauer’s comments in [80, 19 66]
that the bound (7.9), with E = |A| and f = |b|, was known to him, though
he does not state the bound. This is the earliest reference we know in which
componentwise analysis is used to derive forward perturbation bounds.
Theorem 7.8 is from Bauer [79, 1963]. Bauer actually states that equality
holds in (7.21) for any A, but his proof of equality is valid only when |A - 1||A|
and |A||A- 1| have positive Perron vectors. Businger [168, 1968] proves that
a sufficient condition for the irreducibility condition of Theorem 7.8 to hold
(which, of course, implies the positivity of the Perron vectors) is that there do
not exist permutations P and Q such that PAQ is in block triangular form.
Theorem 7.7 is from Stewart and Sun [954, 1990, Thm. 4.3.5].
Further results on scaling to minimize the condition number κ(A) are given
by Forsythe and Straus [386, 1955], Bauer [81, 1969], Golub and Varah [465,
1974], McCarthy and Strang [742, 1974], Shapiro [913, 1982]. [914, 1985], [915,
1991], and Watson [1067, 1991].
Chan and Foulser [193, 1988] introduce the idea of “effective conditioning”
for linear systems, which takes into account the projections of b onto the range
space of A. See Problem 7.5, and for an application to partial differential
equations see Christiansen and Hansen [208, 1994].
For an example of how definitions of numerical stability for linear equa-
tion solvers can be extended to incorporate structure in the problem, see
Bunch [162, 1987].
An interesting application of linear system perturbation analysis is to
Markov chains. A discrete-time Markov chain can be represented by a square
matrix P, where pij is the probability of a transition from state i to state j .
Since state i must lead to some other state. and these conditions
can be writ ten in matrix vector form as
Pe = e. (7.28)

A nonnegative matrix satisfying (7.28) is called a stochastic matrix. The

initial state of the Markov chain can be defined by a vector zT , where zi ,
denotes the probability that the ith state of the chain is occupied. Then the
state of the chain at the next time unit is given by zTP. The steady state or
stationary vector of the chain is given by

An important question is the sensitivity of the individual components of the

steady-state vector to perturbations in P. This is investigated, for example.
PROBLEMS 147

by Ipsen and Meyer [605, 1994], who measure the perturbation matrix norm-
wise, and by O’Cinneide [800, 1993], who measures the perturbation matrix
componentwise. For a matrix-oriented development of Markov chain theory
see Berman and Plemmons [94, 1994].
It is possible to develop probabilistic perturbation theory for linear systems
and other problems by making assumptions about the statistical distribution
of the perturbations. We do not consider this approach here (though see Prob-
lem 7.13), but refer the interested reader to the papers by Fletcher [376, 1985],
Stewart [948, 1990], and Weiss, Wasilkowski, Wozniakowski, and Shub [1073,
19 86].

Problems
7.1. Under the conditions of Theorem 7.4, show that

Hence derive a first-order bound for |xi - yi |/|xi |.

7.2. Let Ax = b, where Show that for any vector y and any
subordinate matrix norm,

where the residual r = b - Ay. Interpret this result.

7.3. Prove (7.13) and deduce (7.12).
7.4. Let be symmetric positive definite and let A = DHD,
where D = (this is the scaling used in Corollary 7.6). Show that
cond(H) < < n cond(H).
7.5. (Chan and Foulser [193, 1988]) Let have the SVD A =
where and define the projection matrix Pk :=
where Uk = U(:,n + 1 - k:n). Show that if Ax = b and A(x + ∆x) =
(b + ∆b) then

What is the interpretation of this result?

7.6. (a) For the choice of tolerances E = |A|eeT, f = |b|, corresponding to a
row-wise backward error, show that
148 P ERTURBATION T HEORY FOR LINEAR S YSTEMS

(b) For E = eeT|A| and f = corresponding to a columnwise back-

ward error, show that

7.7. Show that

7.8. Let be nonsingular. A componentwise condition number for

the problem of computing cTx, where Ax = b, can be defined by

Obtain an explicit formula for x E,f (A , x). Show that xE,f(A, x) > 1 if
E = |A| or f = |b|. Derive the corresponding normwise condition number
in which the constraints are ||∆A||2 < and ||∆b|| 2 <
7.9. (Bauer [79, 1963]) Let A, B, C (a) Prove that if B and C have
positive elements then

where = {diag(d i ) : di > 0, i = 1:n}. (Hint: consider D 1 = diag(x1 ) - 1

and D 2 = diag(Cx1), where x1 > 0 is a right Perron vector of BC: BCx1 =
p (B C) x1 . )
(b) Deduce that if |A| and |A - 1 | have positive entries, then

(c) Show that for any nonsingular A,

(d) Strengthen (b) by showing that for any nonsingular A such that
|A||A- 1 | and |A- 1 ||A|| are irreducible,
PROBLEMS 149

(e) What can you deduce about for the 1- and

2-norms?
7.10. (Bauer [80, 1966, p. 413], Rohn [876, 1989]) We can modify the def-
inition of µE(A) in (7.22) by measuring ∆X componentwise relative to X,
giving

7.11. Let be symmetric and let y be an approximate solution to

Ax = b. If y has a small backward error, so that y solves a nearby system. does
it follow that y solves a nearby symmetric system? This problem answers the
question for both the normwise and componentwise relative backward errors.
(a) (Bunch, Demmel, and Van Loan [163, 1989]) Show that if (A+G)y = b
then there exists H = HT such that (A + H)y = b with ||H|| 2 < ||G||2 and
||H|| F < (This result does not require A = AT.)
(b) (Smoktunowicz [930, 1995]) Show that if A is symmetric and diagonally
dominant and (A + G)y = b with |G| < then there exists H = HT such
that (A + H)y = b with |H| < (For a general symmetric A there may
not exist such an H, as is easily shown by a 2 × 2 example [527, 1992].)
(c) (Smoktunowicz [930, 1995]) Show that if A is symmetric positive def-
inite and (A + G)y = b with |G| < then there exists H = H T such that
(A + H)y = b with |H| <
7.12. Suppose that has wi nonzeros in its ith row, i = 1:n. Show
that the inequality (7.27) can be replaced by

where This bound is potentially much smaller than (7.27)

for large, sparse matrices.
7.13. (D. J. Higham, after Fletcher [376, 198 5]) Suppose the nonsingular,
square matrix A is perturbed to A + ∆A and b to b + ∆b. Then, to first order,
the solution of Ax = b is perturbed to x + ∆x, where
150 P ERTURBATION T HEORY FOR LINEAR S YSTEMS

Suppose that the perturbations have the form

where the and δi are independent random variables, each having zero
mean and variance σ2. (As usual, the matrix E and vector f represent fixed
tolerances.) Let ε denote the expected value.
(a) Show that

where square brackets denote the operation of elementwise squaring: [B] ij =

(b) Hence explain why

may be regarded as an “expected condition number” for the linear system

Ax = b.
(c) For the case where eij ||A|| 2 and fj ||b|| 2. compare condexp(A, x)
with the “worst-case” condition number κA , b (A, x) for the 2-norm.
7.14. (Horn and Johnson [581, 1991, p. 331]) Prove that for any nonsingular

where is the Hadamard product (A B = (aijbij)) and is defined as in

Problem 7.9. (Hint: use the inequality ||A B||2 < ||A||2 ||B||2.) Discuss the
attainability of this bound.
Previous Home Next

Chapter 8
Triangular Systems

In the end there is left the coefficient of one unknown and the constant term.
An elimination between this equation and
one from the previous set that contains two unknowns
yields an equation with the coefficient of
another unknown and another constant term, etc.
The quotient of the constant term by the unknown
yields the value of the unknown in each case.
-JOHN V. ATANASOFF, Computing Machine for the Solution of
Large Systems of Linear Algebraic Equations (1940)

The solutions of triangular systems are usually computed to high accuracy.

This fact . . . cannot be proved in genera/, for counter examples exist.
However, it is true of many special kinds of triangular matrices and
the phenomenon has been observed in many others.
The practical consequences of this fact cannot be over-emphasized.
-G. W. STEWART, Introduction to Matrix Computations (1973)

In practice one almost invariably finds that

if L is ill-conditioned, so that
then the computed solution of Lx = b (or the computed inverse)
is far more accurate than [standard norm bounds] would suggest.
-J. H. WILKINSON, Rounding Errors in Algebraic Processes (1963)

151
152 TRIANGULAR SYSTEM

Triangular systems play a fundamental role in matrix computations. Many

methods are built on the idea of reducing a problem to the solution of one
or more triangular systems, including virtually all direct methods for solving
linear systems. On serial computers triangular systems are universally solved
by the standard back and forward substitution algorithms. For parallel com-
putation there are several alternative methods, one of which we analyse in
§8.4.
Backward error analysis for the substitution algorithms is straightforward
and the conclusion is well known: the algorithms are extremely stable. The
behaviour of the forward error, however, is intriguing, because the forward
error is often surprisingly small --much smaller than we would predict from
the normwise condition number κ, or, sometirnes, even the componentwise
condition number cond. The quotes from Stewart and Wilkinson at the start
of this chapter emphasize the high accuracy that is frequently observed in
practice. The analysis we give in this chapter provides a partial explanation
for the observed accuracy of the substitution algorithms. In particular, it
reveals three important but nonobvious properties:

• the accuracy of the computed solution from substitution depends strongly

on the right-hand side:

• a triangular matrix may be much more or less ill conditioned than its
transpose; and

• the use of pivoting in LU, QR, and Cholesky factorizations can greatly
improve the conditioning of a resulting triangular system.

As well as deriving backward and forward error bounds, we show how to

compute upper and lower bounds for the inverse of a triangular matrix.

8.1. Backward Error Analysis

Recall that for an upper triangular matrix the system Ux = b can
be solved using the formula xi = which yields the
components of x in order from last to first.

Algorithm 8.1 (back substitution). Given a nonsingular upper triangular

matrix this algorithm solves the system Ux = b.

xn = bn /u n n
for i = n - 1:-1:1
s = bi
for j = i + 1:n
s = s - uijxij
8.1 B ACKWARD E RROR A NALYSIS 153

end
x i = s/u ii
end
We will not state the analogous algorithm for solving a lower triangu-
lar system, forward substitution. All the results below for back substitution
have obvious analogues for forward substitution. Throughout this chapter T
denotes a matrix that can be upper or lower triangular.
To analyse the errors in substitution we need the following lemma.

Lemma 8.2. Let y = be evaluated in floating point arith-

metic according to

s =c
for i = 1:1k-1
s = s - ai bi
end
y = s/bk
Then the computed satisfies

(8.1)

where |θi | < γi = iu/(1 - iu).

Proof. Analysis very similar to that leading to (3.2) shows that :=
satisfies

where The final division yields, using (2.5),

so that, after dividing through by
we have

The result is obtained on invoking Lemma 3.1.

Two remarks are in order. First, we chose the particular form of (8.1),
in which c is not perturbed, in order to obtain a backward error result for
Ux = b in which b is not perturbed. Second, we carefully kept track of the
terms 1 + δi in the proof, so as to obtain the best possible constants. Direct
application of the lemma to Algorithm 8.1 yields a backward error result.
154 TRIANGULAR SYSTEMS

Theorem 8.3. The computed solution from Algorithm 8.1 satisfies

Theorem 8.3 holds only for the particular ordering of arithmetic operations
used in Algorithm 8.1. A result that holds for any ordering is a consequence
of the next lemma.

Lemma 8.4. If y = is evaluated in floating point arith-

metic, then. no matter what the order of evaluation,

where for all i. If bk = 1, so that there is no division, then

for all i.

Proof. The result is not hard to see after a little thought , but a formal
proof is tedious to write down. Note that the ordering used in Lemma 8.2 is
the one for which this lemma is least obvious! The last part of the lemma
is useful when analysing unit lower triangular systems, and in various other
contexts.

Theorem 8.5. Let the triangular system TX = b, where is non-

singular, be solved by substitution, with any ordering. Then the computed
solution satisfies

In technical terms, this result says that has a tiny componentwise relative
backward error. In other words, the backward error is about as small as we
could possibly hope.
In most of the remaining error analyses in this book, we will derive re-
sults that, like the one in Theorem 8.5, do not depend on the ordering of
the arithmetic operations. Results of this type are more general, usually no
less informative. and easier to derive, than ones that depend on the order-
ing. However, it is important to realise that the actual error does depend on
the ordering, possibly strongly so for certain data. This point is clear from
Chapter 4 on summation.
8.2 F ORWARD E RROR A NALYSIS 155

8.2. Forward Error Analysis

From Theorems 8.5 and 7.4 there follows the forward error bound

where

This bound can, of course, be arbitrarily smaller than the corresponding

bound involving for the reasons explained in Chap-
ter 7. For further insight, note that, in terms of the traditional condition
number, κ(T), ill conditioning of a triangular matrix stems from two pos-
sible sources: variation in the size of the diagonal elements and rows with
off-diagonal elements which are large relative to the diagonal elements. Sig-
nificantly, because of its row scaling invariance, cond(T , x) is susceptible only-
to the second source.
Despite its pleasing properties, cond(T , x) can be arbitrarily large. This
is illustrated by the upper triangular matrix

(8.2)

for which
(8.3)

We have cond(U(a),e) = cond(U( a)) 2an-1 as a Therefore WC

cannot assert that all triangular systems are solved to high accuracy. Never-
theless, for any T there is always at least one system for which high accuracy
is obtained: the system TX = e1 if T is upper triangular, or TX = en if T
is lower triangular. In both cases cond(T, x) = 1, and the solution comprises
the computation of just a single scalar reciprocal.
To gain further insight we consider special classes of triangular matrices,
beginning with one produced by certain standard factorizations with pivoting.
In all the results below, the triangular matrices are assumed to be n × n and
nonsingular, and is the computed solution from substitution.

Lemma 8.6. Suppose the upper triangular matrix satisfies

(8.4)
Then the unit upper triangular matrix W = |U- 1 ||U| satisfies wij < 2 j-i for
all j > i.
156 T RIANGLLAR S YSTEMS

Proof. We can write W = |V- 1 ||V| where V = D -1 U and D = diag(u ii ).

The matrix V is unit upper triangular with) |υij| < 1, and it is easy to show
that |(V - 1 ) ji| < 2 j - i -1 for j > i. Thus. for j > i,

Theorem 8.7. Under the conditions of Lemma 8.6, the computed solution
to Ux = b obtained by substitution satisfies

Proof. From Theorem 8.5 we have

Using Lemma 8.6 we obtain

Lemma 8.6 shows that for matrices satisfying (8.4), cond(T) is bounded
for fixed n, no matter how large κ(T). The bounds for in Theorem 8.7,
although large if n is large and i is small. decay exponentially with increasing
thus, later components of x are always computed to high accuracy relative
to the elements already computed.
Analogues of Lemma 8.6 and Theorem 8.7 hold for lower triangular L
satisfying
|lii | > |lij| for all j < i. (8.5)
Note, however, that if the upper triangular matrix T satisfies (8.4) then TT
does not necessarily satisfy (8.5). In fact, cond(T T) can he arbitrarily large,
as shown by the example

An important conclusion is that a triangular system Tx = b can be much

more or less ill conditioned than the system TTy = c, even if T satisfies (8.4).
Theorem 8.7, or its lower triangular analogue, is applicable to
8.2 F ORWARD E RROR A NALYSIS 157

• the lower triangular matrices from Gaussian elimination with partial

pivoting or complete pivoting;

• the upper triangular matrices from Gaussian elimination with complete

pivoting;

• the upper triangular matrices from the Cholesky and QR factorizations

with complete pivoting and column pivoting, respectively.

Next, we consider triangular T satisfying

tii > 0, tij < 0 for all i j.

It is easy to see that such a matrix has an inverse with nonnegative elements,
and hence is an hi-matrix (for definitions of an hi-matrix see Appendix B).
Associated with any square matrix A is the comparison matrix:

(8.6)

For any nonsingular triangular T, M(T) is an M-matrix. Furthermore, it is

easy to show that |T -1| < M(T)-1 (see Theorem 8.11).
The following result shows that among all matrices R such that |R| = |T|,
R = M(T) is the one that maximizes cond(R , x ).

Lemma 8.8. For any triangular T,

Proof. The inequality follows from |T- 1 | < M(T)-1, together with |T| =
|M(T)|. Since M(T)-1 > 0, we have

|M(T)- 1 ||M(T)| = M(T) -1 (2diag(|tii| ) - M(T))

= 2M(T) -1 diag(|tii |) - I,

which yields the equality.

If T=M(T) has unit diagonal then, using Lemma 8.8,

This means, for example, that the system U(1)x = b (see (8.2)), where x = e,
is about as ill conditioned with respect to componentwise relative perturba-
tions in U(1) as it is with respect to normwise perturbations in U(1).
158 T RIANGULAR S YSTEMS

The next result gives a forward error bound for substitution that is proved
directly, without reference to the backward error result Theorem 8.5 (indeed. it
cannot be obtained from that result!). The bound can be very weak, because
||M(T) - 1 || can be arbitrarily larger than ||T- 1|| (see Problem 8.2), but it
yields a pleasing bound in the special case described in the corollary.

Theorem 8.9. The computed solution obtained from substitution applied

to the triangular system Tx = b of order n satisfies

Proof. Without loss of generality, suppose T = L, is lower triangular. The

proof is by induction on the components of x. The result clearly holds for
the first component. Assume that it holds for the first n - 1 components. An
analogue of Lemma 8.3 shows that

where for all j. Subtracting from lnnxn =

gives

so that

(8.7)

Write

Then the inductive assumption can be written as

which implies
Hence (8.7) gives
8.3 BOUNDS FOR THE INVERSE 159

Corollary 8.10. The computed solution obtained from substitution applied

to the triangular system Tx = b of order n, where T = M(T) and b > 0,
satisfies

Corollary 8.10 shows that, when T is an M-matrix and the right-hand

side is nonnegative, the solution is obtained to high relative accuracy in every
component. The reason for the high accuracy is that for such a system there
are no subtractions of like-signed numbers, so that each xi is computed as
a sum of nonnegative quantities. A consequence of the corollary is that the
inverse of a triangular M-matrix can be computed to high relative accuracy.
Triangular systems of the type in Corollary 8.10 occur in linear equations
obtained by discretizing certain elliptic partial differential equations, such as
the Poisson equation on a rectangle, with zero boundary conditions and a
positive forcing function: these problems yield symmetric positive definite
M-matrices, and the LU factors of an M-matrix are themselves M-matrices.
Such systems also occur when evaluating the bounds of the next section.

8.3. Bounds for the Inverse

In this section we describe bounds for the inverse of a triangular matrix and
show how they can be used to bound condition numbers. All the bounds in
this section have the property that they depend only on the absolute values
of the elements of the matrix. The norm estimation methods of Chapter 14,
on the other hand, do take account of the signs of the elements.
The bounds are all based on the following theorem, whose easy proof we
omit.

Theorem 8.11. If U is a nonsingular upper triangular matrix then

|U- 1 | < M(U)-1 < W(U)-1 < Z(U)- 1 ,
where the upper triangular matrices W(U) and Z(U) are defined as follows:
160 T RIANGULAR S YSTEMS

where a = mink |ukk|, β = maxi<j |uij|/|uii|.

Theorem 8.11 is a special case of results in the theory of Ill-matrices.

For more general results couched in terms of matrix minorants and diagonal
dominance, respectively, see Dahlquist [261, 1983] and Varga [1051, 1976]; see
also Householder [587, 1964, Exercise 15, p. 58].
An obvious implication of the theorem is that for any vector z and any
absolute norm

By taking z = |U|e, z = |U||x|, and z = e, respectively, we obtain upper

bounds for cond(U), cond(U,x), and The cost of computing these
bounds is just the cost of solving a triangular system with coefficient matrix
M(U), W(U), or Z(U), which is easily seen to be O(n2). O(n), and O(1)
flops, respectively. By comparison, computing any of these condition numbers
exactly costs O(n3) flops.
As an example, here is how to compute an upper bound for in n 2
flops.

Algorithm 8.12. Given a nonsingular upper triangular matrix

this algorithm computes µ =

y n = 1 /|unn|
for i = n - 1:-1:1
s=1
for j = i + 1:n
s = s + |uij|yj
end
y i = y i /|uii|
end
µ=

How good are these upper bounds? We know from Problem 8.2 that the
ratio ||M(T)- 1 ||/||T - 1 || can be arbitrarily large, therefore any of the upper
bounds can be arbitrarily poor. However, with suitable assumptions on T,
more can be said.
It is easy to show that if T is bidiagonal then |T- 1 | = M(T)- 1 . Since
a bidiagonal system can be solved in O(n) flops, it follows that the three
condition numbers of interest can each be computed exactly in O(n) flops
when T is bidiagonal.
As in the previous section, triangular matrices that result from a pivoting
strategy also lead to a special result.
8.3 BOUNDS FOR THE INVERSE 161

Theorem 8.13. Suppose the upper triangular matrix satisfies

|uii | > |uij | for all j > i.
Then, for the 1-, 2-, and -norms,

(8.8)
Proof. The left-hand inequality is trivial. The right-hand inequality
follows from the expression (see Problem 8.5),
together with ||A||2 <
The inequalities from the second on in (8.8) are all equalities for the matrix
with u ii = 1 and u ij = -1 (j > i). The question arises of whether equality is
possible for the upper triangular matrices arising from QR factorization with
column pivoting, which satisfy the inequalities (see Problem 18.5)

(8.9)

That equality is possible is shown by the parametrized matrix of Kahan [626,

19 66]

(8.10)

where c = cos(θ), s = sin(θ). It is easily verified that Un (θ) satisfies the

inequalities (8.9)-as equalities, in fact. Prom (8.3), Un ( θ )-1 = (βi j ) is given
by

Thus as where
and hence, for small θ,

It can also be verified that the matrix defined by

(-1) j-i|uij| satisfies, for small θ, while
2 n - 1 /|unn|. Hence the upper bounds for ||U - 1 || can all be too big by a factor
of order 2n - 1 .
162 T RIANGULAR S YSTEMS

8.4. A Parallel Fan-In Algorithm

Substitution is not the only way to solve a triangular system. In this section we
describe a different approach that has been suggested for parallel computation.
Any lower triangular matrix can be factorized L = L1 L2 . . . Ln ,
where Lk differs from the identity matrix only in the kth column:

(8.11)

The solution to a linear system Lx = b may therefore be expressed as

(8.12)
where M i = When evaluated in the natural right-to-left order, this
formula yields a trivial variation of a column-oriented version of substitution.
The fan-in algorithm evaluates the product (8.12) in [log(n + 1)] steps by
the fan-in operation (which is the operation used in pairwise summation: see
§4.1). For example, for n = 7 the calculation is specified by

where all the products appearing within a particular size of parenthesis can
be evaluated in parallel. III general. the evaluation can be expressed as a
binary tree of depth [log(n + 1)] + 1, with products M1b and Mi Mi-1 (i =
3, 5,. . . , 2[(n - 1)/2] + 1) at the top level and a single product yielding x at
the bottom level. This algorithm was proposed and analysed by Sameh and
Brent [889, 1977], who show that it can be implemented in
time steps on processors. The algorithm requires about n 3 /10
operations and thus is of no interest for serial computation. Some pertinent
comments on the practical significance of log n terms in complexity results are
given by Edelman [341, 1993].
To derive an error bound while avoiding complicated notation that ob-
scures the simplicity of the analysis, we take n = 7. The result we obtain is
easily seen to be valid for all n. We will not be concerned with the precise
values of constants, so we write cn for a constant depending on n . We assume
that the inverses Mi = are formed exactly, because the errors in forming
them affect only the constants. From the error analysis of matrix-vector and
matrix-matrix multiplication (§3.5), we find that the computed solution
satisfies

(8.13)
8.4 A P ARALLEL FAN- I N A LGORITHM 163

where

Premultiplying (8.13) on the left by L, we find that the residual r =

is a sum of terms of the form

All these terms share the same upper bound, which we derive for just one of
them. For j = 5, k =

where we have used the property that, for any

The overall residual bound is therefore of the form

(8.14)

or, on taking norms,

(8.15)

By considering the binary tree associated with the fan-in algorithm, and
using the fact that the matrices at the ith level of the tree have at most 2i - 1
nontrivial columns, it is easy to see that we can take dn = an log n , where a
is a constant of order 1.
It is not hard to find numerical examples where the bound in (8.15) is
approximately attained (for dn = 1) and greatly exceeds which
is the magnitude required for normwise backward stability. One way to con-
struct such examples is to use direct search (see Chapter 24).
The key fact revealed by (8.15) is that the fan-in algorithm is only condi-
tionally stable. In particular, the algorithm is normwise backward stable if L
is well conditioned. A special case in which (8.14) simplifies is when L is an M -
matrix and b > 0: Problem 8.4 shows that in this case |L - 1 ||L||x| < (2n-1)|x|,
so (8.14) yields < (2n - 1)2 dnu|L||x| + O(u 2 ), and we have compo-
nentwise backward stability (to first order).
164 T RIANGULAR S YSTEMS

We can obtain from (8.15) the result

(8.16)

which was proved by Sameh and Brent (889, 1977] (with a n = ¼n 2 log n +
O(n log n)). However, (8.16) is a much weaker bound than (8.14) and (8.15).
In particular, a diagonal scaling Lx = b (where Dj is
diagonal) leaves (8.14) (and, to a lesser extent, (8.15)) essentially unchanged,
but can change the bound (8.16) by an arbitrary amount.
A forward error bound can be obtained directly from (8.13). We find that

(8.17)

where M(L) is the comparison matrix (a bound of the same form as that
in Theorem 8.9 for substitution-see the Notes and References and Prob-
lem 8.10). which can be weakened to

(8.18)

We also have the bound

(8.19)

which is an immediate consequence of (8.14). Either bound in (8.18) and

(8.19) can be arbitrarily larger than the other, for fixed n . An example where
(8.19) is the better bound (for large n) is provided by the matrix with lij , 1,
for which |L- 1 ||L| has maximum element 2 and M(L)- 1 |L| has maximum
element 2n - 1 .

8.5. Notes and References

Section 8.2 is based on Higham [538, 198 9]. Many of the results presented
in §§8.2 and 8.3 have their origin in the work of Wilkinson. Indeed, these
sections are effectively a unification and extension of Wilkinson’s results in
[1085, 1961], [1088, 1963], [1089, 1965].
Classic references for Theorems 8.3 and 8.5 are Wilkinson [1085, 196 1,
p. 294], [1088, 1963, pp. 100-102], Forsythe and Moler [396, 1967, §21], and
Stewart [941, 1973, pp. 150, 408-410].
Analogues of Theorem 8.7 and Corollary 8.10 for matrix inversion are
proved by Wilkinson in [1085, 1961, pp. 322 323], and Corollary 8.10 itself is
proved in [1089, 1965, pp. 250-251].
8.5 N OTES AND R EFERENCES 165

A result of the form of Theorem 8.9 holds for any triangular system solver
that does not rely on algebraic cancellation-in particular, for the fan-in al-
gorithm, as already seen in (8.17). See Problem 8.10 for a more precise for-
mulation of this general result.
The bounds in §8.3 have been investigated by various authors. The unified
presentation given here is based on Higham [534, 1987]. Karasalo [642, 1974]
derives an O(n) flops algorithm for computing ||M(T) - 1 ||F. Manteuffel [726,
1981] derives the first two inequalities in Theorem 8.11, and Algorithm 8.12.
A different derivation of the equations in Algorithm 8.12 is given by Jen-
nings [613, 1982, §9]. The formulae given in Problem 8.5 are derived directly
as upper bounds for by Lemeire [699, 1975].
That can be computed in O(n) flops when B is bidiagonal, as
was first pointed out by Higham [531, 1986]. Demmel and
Kahan [296, 1990] derive an estimate for the smallest singular value σmin of
a bidiagonal matrix B by using the inequality where
They compute in O(n) flops as

Section 8.4 is adapted from Higham [560, 1995], in which error analysis is
given for several parallel methods for solving triangular systems.
The fan-in method is topical because the fan-in operation is a special case
of the parallel prefix operation and several fundamental computations in linear
algebra are amenable to a parallel prefix-based implementation, as discussed
by Demmel [287, 1992], [288, 1993]. (For a particularly clear explanation of the
parallel prefix operation see the textbook by Buchanan and Turner [154, 1992,
§13.21.) The important open question of the stability of the parallel prefix
implementation of Sturm sequence evaluation for the symmetric tridiagonal
eigenproblem has recently been answered by Mathias [734, 1995]. Mathias
shows that for positive definite matrices the relative error in a computed minor
can be as large as a multiple of where is the smallest eigenvalue of
the matrix; the corresponding bound for serial evaluation involves The
analogy with (8.19), where we also see a condition cubing effect, is intriguing.
Higham and Pothen [568, 1994] analyse the stability of the “partitioned
inverse method” for parallel solution of sparse triangular systems with many
right-hand sides. This method has been studied by several authors in the
1990s; see Alvarado, Pothen, and Schreiber [13, 1993] and the references
therein. The idea of the method is to factor a sparse triangular matrix
as L = L 1 L 2 . . . Ln = G 1 G 2 . . . Gm , where each Gi is a prod-
uct of consecutive Lj terms and 1 < m < n, with m as small as possible
subject to the Gi being sparse. Then the solution to Lx = b is evaluated as
166 T RIANGULAR S YSTEMS

where each is formed explicitly and the product is evaluated from right
to left. The advantage of this approach is that x can be computed in m serial
steps of parallel matrix-vector multiplication.

8.5.1. LAPACK
Computational routine xTRTRS solves a triangular system with multiple right-
hand sides: xTBTRS is an analogue for banded triangular matrices. There is
no driver routine for triangular systems.

Problems
Before you start an exercise session,
make sure you have a g/ass of water and
a mat or towel nearby.
-MARIE HELVIN, Mode/ Tips for a Healthy Future (1994)
8.1. Show that under the no-guard-digit model (2.6). Lemma 8.2 remains
true if (8.1) is changed to

and that the corresponding modification of Theorem 8.5 has

8.2. Show that for a triangular matrix T the ratio ||M(T) - 1||/||T - 1 || can be
arbitrarily large.
8.3. Suppose the unit upper triangular matrix satisfies |uij| < 1
for j > i. By using Theorem 8.9. show that the computed solution from
substitution on Ux = b satisfies

Compare with the result of applying Theorem 8.7.

8.4. Let be triangular and suppose T = M(T) and Tx = b > 0.
Show that |T- 1||T||x| < (2n-1)|x|, and hence that cond(T, x) < 2n-1. This
result show that a triangular M-matrix system with a nonnegative right-
hand side is very well conditioned with respect to componentwise relative
perturbations, irrespective of the size of κ(T) (and so leads to an alternative
proof of Corollary 8.10).
8.5. Show that for a triangular for
both the l- and -norms (a and β are defined in Theorem 8.11).
P ROBLEMS 167

8.6. Write detailed algorithms for efficiently computing and

8.7. Bounds from diagonal dominance. (a) Prove the following result (Ahlberg
and Nilson [8, 1963], Varah [1049, 1975]): if (not necessarily trian-
gular) satisfies

(that is, A is strictly diagonally dominant by rows), then

(b) Hence show that (Varga [1051, 1976]) if satisfies

for some positive diagonal matrix D = diag(di ) (that is, AD is strictly diag-
onally dominant by rows), then

8.8. (a) Let be nonsingular. For a given i and j, determine, if

possible, a ij such that A + is singular. Where is the “best” place to
perturb A to make it singular?
(b) Let T = U(1) in (8.2), so that, for example,

Show that Tn is made singular by subtracting 22-n from a certain element of

Tn.
8.9. (Zha [1127, 1993]) Show that if c and s are nonnegative (with c2 + s2 = 1)
then the Kahan matrix Un (θ) in (8.10) has as its second smallest
singular value. (That there should be such an explicit formula is surprising;
none is known for the smallest singular value.)
168 TRIANGULAR SYSTEMS

8.10. Consider a method for solving triangular systems Tx = b that computes

xi = fi (T, b) where, for all i, fi is a multivariate rational function in which the
only divisions are by diagonal elements of L and such that when T = M(T)
and b > 0 there are no subtractions in the evaluation of fi . Show that a bound
holds of the form in Theorem 8.9, namely, for

(8.20)

Give an example of a triangular system solver for which (8.20) is not satisfied.
Previous Home Next

Chapter 9
LU Factorization and Linear Equations

It appears that Gauss and Doolittle applied

the method only to symmetric equations.
More recent authors, for example, Aitken, Banachiewicz, Dwyer, and Crout . . .
have emphasized the use of the method, or variations of it,
in connection with non-symmetric problems . . .
Banachiewicz . . . saw the point . . .
that the basic problem is really one of matrix factorization,
or “decomposition” as he called it.
-PAUL S. DWYER, Linear Computations (1951)

Intolerable pivot-growth [with partial pivoting] is a phenomenon that happens

only to numerical analysts who are looking for that phenomenon.
-WILLIAM M. KAHAN, Numerical Linear Algebra (1966)

By 1949 the major components of the

Pilot ACE were complete and undergoing trials . . .
During 1951 a programme for solving simultaneous
linear algebraic equations was used for the first time.
26th June, 1951 was a landmark in the history of the machine,
for on that day it first rivalled alternative computing methods
by yielding by 3 p.m. the solution to
a set of 17 equations submitted the same morning.
-MICHAEL WOODGER, The History and Present Use of
Digital Computers at the National Physical Laboratory (1958).

The closer one looks,

the more subtle and remarkable Gaussian elimination appears.
-LLOYD N. TREFETHEN, Three Mysteries of Gaussian Elimination (1985)

169
170 LU FACTORIZATION AND LINEAR E QUATIONS

9.1. Gaussian Elimination

We begin by giving a traditional description of Gaussian elimination (GE) for
solving a linear system Ax = b, where is nonsingular.
The strategy of GE is to reduce a problem we can’t solve (a full linear
system) to one that we can (a triangular system), using elementary row op-
erations. There are n - 1 stages, beginning with A(1) := A, b (1) := b, and
finishing with the upper triangular system A(n)x = b (n) .
At the kth stage we have converted the original system to A(k)x = b(k),

with upper triangular. The purpose of the kth stage of

the elimination is to zero the elements below the diagonal in the kth column
of A ( k ). This is accomplished by the operations

where the multipliers At the end of the (n - 1)st

stage we have the upper triangular system A(n)x = b (n) , which is solved by
back substitution. For an n × n matrix, GE requires 2n 3/3 flops.
There are two problems with the method as described. First. there is
a breakdown with division by zero if Second, if we are working in
finite precision and some multiplier mik is large, then there is a possible loss of
significance: in the subtraction low-order digits of could be
lost. Losing these digits could correspond to making a relatively large change
to the original matrix A. The simplest example of this phenomenon is for the
matrix here. and if which would
be the exact answer if we changed a22 from 1 to 0.
These observations motivate the strategy of partial pivoting. At the start
of the kth stage, the kth and rth rows are interchanged, where

Partial pivoting ensures that the multipliers are nicely bounded:

A more expensive pivoting strategy, which interchanges both rows and

columns, is complete pivoting.
9.1 GAUSSIAN ELIMINATION 171

At the start of the kth stage, rows k and r and columns k and s are
interchanged, where

Note that this requires O(n3 ) comparisons in total, compared with O(n2 )
for partial pivoting. Because of the searching overhead, and because partial
pivoting works so well, complete pivoting is rarely used in practice.
Much insight into GE is obtained by expressing it in matrix notation. We
can write

The matrix Mk can be expressed compactly as Mk = where ek is

the kth unit vector and for i < k. To invert Mk, just flip the signs
of the multipliers: Overall,

and so

The conclusion is that GE computes an LU factorization of A : A = LU,

where L is unit lower triangular and U is upper triangular.
We introduce the shorthand notation Ak := A(1:k, 1:k).

Theorem 9.1. There exists a unique LU factorization of if and

only if Ak is nonsingular for k = 1:n - 1. If Ak is singular for some 1 < k <
n - 1 then the factorization may exist, but if so it is not unique.
172 LU F ACTORIZATION AND LINEAR E QUATIONS

Proof. Suppose Ak is nonsingular for k = 1:n - 1. The existence of an

LU factorization can be proved by. examining the steps of GE. but a more
elegant proof, which also gives uniqueness, can be obtained by an induc-
tive bordering construct ion. Suppose Ak-1 has the unique LU factorization
A k-1 = Lk- 1 U k-1 (this supposition clearly holds for k - 1 = 1). We look for
a factorization

The equations to be satisfied are Lk- 1u = and a kk = 1 Tu +

u k k . The matrices Lk-1 and Uk-1 are nonsingular, since 0 det(A k-1) =
det(L k-1) det(U k-1). so the equations have a unique solution. completing the
induction.
We prove the converse, under the assumption that A is nonsingular; for the
case A singular see Problem 9.1. Suppose an LU factorization exists. Then
Ak = LkUk for k = 1:n, which gives

det(A k) = det(U k) = u 11 . . . ukk. (9.1)

Setting k = n we find that 0 det(A) = u 11 . . . unn, and hence det(A k) =

u 11 . . . ukk 0, k = 1:n - 1.
Examples that illustrate the last part of the theorem are
which holds for any l, and which does not have an LU factoriza-
tion.

Visually, the condition of Theorem 9.1 is (for n = 5) that the indicated

submatrices must be nonsingular:

From (9.1) follows the expression u kk = det(A k)/det(A k-l). In fact, all
the elements of L and U can be expressed by determinant al formulae (see,
e.g., Gantmacher [413, 1959, p. 35] or Householder [587, 1964. p. 11]):

(9.2a)

(9.2b)
9.1 GAUSSIAN ELIMINATION 173

The effect of partial pivoting is easily seen by considering the case n = 4.

We have

where, for example, For k =

1,2,3, is the same as Mk except the multipliers are interchanged. Hence,
for n = 4, GE with partial pivoting (GEPP) applied to A is equivalent to GE
without pivoting applied to the row-permuted matrix PA. This conclusion
is true for any n: GEPP computes a factorization PA = LU. Similarly, GE
with complete pivoting computes a factorization PAQ = LU, where P and Q
are permutation matrices.
Exploitation of the LU factorization streamlines both the error analysis
and the practical solution of linear systems. Solution of Ax = b breaks into
a factorization phase, PA = LU for partial pivoting (O(n 3 ) flops), and a
substitution phase, where the triangular systems Ly = Pb, Ux = y are solved
(O(n2 ) flops). If more than one system is to be solved with the same coefficient
matrix but different right-hand sides, the factorization can be reused, with a
consequent saving in work.
Computing an LU factorization A = LU is equivalent to solving the equa-
tions

If these nonlinear equations are examined in the right order, they are easily
solved. For added generality let (m > n) and consider an LU
factorization with and (L is lower trapezoidal: lij = 0
for i < j). Suppose we know the first k - 1 columns of L and the first k - 1
rows of U. Setting lkk = 1,

(9.3)

(9.4)

We can solve for the boxed elements in the kth row of U and then the kth
column of L. This process is called Doolittle’s method.

Algorithm 9.2 (Doolittle’s method). This algorithm computes an LU fac-

torization A = where m > n (assuming the factorization exists),
by Doolittle’s method.
174 LU F ACTORIZATION AND LINEAR E QUATIONS

for k = 1: n
for j = k :n
(*)
end
for i = k + 1:m
(**)
end
end

Cost: n 2 (m - n/3) flops.

Doolittle’s met hod is mathematically equivalent to GE without pivoting,
for we have, in (9.3).

(9.5)
and similarly for (9.3). Had we chosen the normalization u ii 1, we would
have obtained the Crout method. The Crout and Doolittle methods are well
suited to calculations by hand or with a desk calculator, because they obviate
the need to store the intermediate quantities They are also attractive
when we can accumulate inner products in extended precision.
It is straightforward to incorporate partial pivoting into Doolittle’s method
(see. e.g.. Stewart [941, 1973, p. 138]). However, complete pivoting cannot be
incorporated without changing the met hod.

9.2. Error Analysis

The error analysis of GE is a combination of the error analyses of inner
products and substitution. When this fact is realized. the analysis becomes
straight forward. The key observation leading to this viewpoint is that all
mathematically equivalent variants of GE satisfy a common error bound. To
see why, first note the connection between standard GE, as we have described
it, and the Doolittle met hod, as shown in (9.5). Whether the inner product
in (9.5) is calculated as one operation. or whether its terms are calculated
many operations apart, precisely the same rounding errors are sustained (as-
suming that extended precision accumulation of inner products is not used):
all that changes is the moment when those rounding errors are committed. If
we allow inner products to be reordered. so that, for example, the summation
(*) in Algorithm 9.2 is calculated with the index i decreasing from k - 1 to
1, instead of increasing from 1 to k - 1. then the actual rounding errors are
different but a common bound holds for all orderings.
It suffices, then, to analyse the Doolittle method. It also suffices to analyse
the met hod without pivoting, because GE with partial or complete pivoting
is equivalent to GE without pivoting applied to a permuted matrix.
9.2 ERROR ANALYSIS 175

The assignments (*) and (**) in Algorithm 9.2 are of the form y = ( c -
which is analysed in Lemma 8.4. Applying the lemma, we
deduce that, no matter what the-ordering of the inner products, the computed
matrices L and U satisfy (with lkk := 1)

These inequalities constitute a backward error result for LU factorization.

Theorem 9.3. If GE applied to (m > n) runs to completion then

the computed LU factors and satisfy

(9.6)

With only a little more effort, a backward error result can be obtained for
the solution of Ax = b.

Theorem 9.4. Let and suppose GE produces computed LU factors

and a computed solution to Ax = b. Then

(9.7)

Proof. From Theorem 9.3, By

Theorem 8.5, substitution produces and satisfying

Thus

where We need a constant 2γ n instead of

Although it is not usually worth expending effort reducing constants
in error bounds (see the Wilkinson quotation at the start of Chapter 10), we
will optimize constants in this important case. Consideration of Lemma 8.4
shows that we actually have
176 LU FACTORIZATION AND LINEAR E QUATIONS

so that

using Lemma 3.3, which gives the required constant.

How do we interpret Theorem 9.4? Ideally, we would like |∆ A| < u|A|,
which corresponds to the uncertainty introduced by rounding the elements of
A, but because each element of A undergoes up to n arithmetic operations we
cannot expect better than a bound |∆ A| < cnu|A|, where cn is a constant of
order n. Such a bound holds if and satisfy which certainly
holds if and are nonnegative. because then (9.6) gives

(9.8)

Substituting into (9.7). we obtain

This result says that has a small componentwise relative backward error.
One class of matrices that has nonnegative LU factors is defined as follows.
is totally positive (nonnegative) if the determinant of every square
submatrix is positive (nonnegative). In particular, this definition requires that
a ij and det(A) be positive or nonnegative. Some examples of totally nonneg-
ative matrices are given in Chapter 26. If A is totally nonnegative then it has
an LU factorization A = LU in which L and U are totally nonnegative. so
that L > 0 and U > 0 (see Problem 9.6); moreover, the computed factors
and are nonnegative for sufficiently small values of the unit roundoff u [273,
1977 ]. Inverses of totally nonnegative matrices also have the property that
|A| = |L||U| (see Problem 9.7). Note that the property of a matrix or its
inverse being totally nonnegative is generally destroyed under row permuta-
tions. Hence for totally- nonnegative matrices and their inverses it is best to
use Gaussian elimination without pivoting.
One important fact that follows from (9.6) and (9.7) is that the stability
of GE is determined not by the size of the multipliers but by the size of the
matrix This matrix can be small when the multipliers are large, and
large when the multipliers are of order 1 (as we will see in the next section).
To understand the stability of GE further we turn to norms. For GE with-
out pivoting. the ratio || |L||U|||/||A|| can be arbitrarily large. For example,
for the matrix the ratio is of order Assume then that partial piv-
oting is used. Then |lij | < 1 for all i > j, since the lij are the multipliers.
9.3 T HE G ROWTH FACTOR 177

And it is easy to show by induction that |uij| < 2 i -1 maxk < i |akj|. Hence, for
partial pivoting, L is small and U is bounded relative to A.
Traditionally, backward error analysis for GE is expressed in terms of the
growth factor

which involves all the elements (k = 1:n) that occur during the elimina-
tion. Using the bound p n maxi,j|aij| we obtain the following
classic theorem.

Theorem 9.5 (Wilkinson). Let and suppose GE with partial piv-

oting produces a computed solution to Ax = b. Then

(9.9)

We hasten to admit to using an illicit manoeuvre in the derivation of this

theorem: we have used bounds for L and U that strictly are valid only for the
exact L and U. We could instead have defined the growth factor in terms of
the computed but then any bounds for the growth factor would involve
the unit roundoff (similarly, we can only guarantee that < 1 + u). Our
breach of correctness is harmless for the purposes to which we will put the
theorem.
The assumption in Theorem 9.5 that partial pivoting is used is not nec-
essary: essentially the same result holds for GE without pivoting (see Prob-
lem 9.8). The normwise backward stability of GE with or without pivoting is
therefore governed by the growth factor, to which we now turn our attention.

9.3. The Growth Factor

It is easy to show that pn < 2 n -1 for partial pivoting. Wilkinson notes that
this upper bound is achieved for matrices of the form illustrated for n = 4 by

For these matrices, no interchanges are required by partial pivoting, and there
is exponential growth of elements in the last column of the reduced matrices.
In fact, this is just one of a nontrivial class of matrices for which partial
178 LU FACTORIZATION AND LINEAR E QUATIONS

pivoting achieves maximal growth. When necessary in the rest of this chapter,
we denote the growth factor for partial pivoting by and that for complete
pivoting by

Theorem 9.6 (Higham and Higham). All real n × n matrices A for which
2n -1 are of the form

where D = diag(±1). M is unit lower triangular with mij = -1 for i > j,

T is an arbitrary nonsingular upper triangular matrix of order n -1, d =
(1,2,4,,.... 2n -1)T, and a is a scalar such that a := |a1n| = maxi,j |aij|.

Proof. GE with partial pivoting applied to a matrix A gives a factorization

B := PA = LU, where P is a permutation matrix. It is easy to show that
|uij | < 2i -1 maxr< i |brj|, with equality for i = s only if there is equality for
i = 1:s - 1. Thus pn = 2n -1 implies that the last column of U has the form
aDd, and also that |b1 n| = maxi,j |bij|. By considering the final column of B,
and imposing the requirement that |lij| < 1, it is easy to show that the unit
lower triangular matrix L must have the form L = DMD. It follows that
at each stage of the reduction every multiplier is ±1; hence no interchanges
are performed, that is, P = I. The only requirement on T is that it be
nonsingular, for if t i i = 0 then the ith elimination stage would be skipped
because of a zero pivot column and no growth would be produced on that
stage.
Note that by varying the elements mij (i > j) and the vector d in The-
orem 9.6 we can construct matrices for which achieves any desired value
between 1 and 2n - 1 .
Despite the exist once of matrices for which pn is large with partial piv-
oting, the growth factor is almost invariably small in practice. For example,
Wilkinson says “It is our experience that any substantial increase in size of
elements of successive Ar is extremely uncommon even with partial pivoting
No example which has arisen naturally has in my experience given an
increase by a factor as large as 16” [1089, 1965, pp. 213-214].
Until recently. there were no reports in the literature of large growth factors
being observed in practical applications. However, Wright [1116, 1993] has
found a class of two-point boundary value problems that. when solved by the
multiple shooting met hod, yield a linear system for which partial pivoting
suffers exponential growth. The matrix is block lower bidiagonal, except for
a nonzero block in the top right-hand corner. Furthermore, Foster [399, 1994]
shows that a quadrature met hod for solving a practically occurring Volterra
9.3 T HE G ROWTH FACTOR 179

integral equation gives rise to linear systems for which partial pivoting again
gives large growth factors.
There exist some well-known matrices that give unusually large, but not
exponential, growth. They can be found using the following theorem, which is
applicable whatever the strategy for interchanging rows and columns in GE.

Theorem 9.7 (Higham and Higham). Let be nonsingular and set

a = max i,j |aij|, β = maxi,j |(A- 1 )ij |, and θ = (aβ)-1. Then θ < n, and for
any permutation matrices P and Q such that PAQ has an LU factorization.
the growth factor pn for GE without pivoting on PAQ satisfies p n > θ.
Proof. The inequality θ < n follows from Consider
an LU factorization PAQ = LU computed by GE. We have

(9.10)

Hence and the result follows.

Note that θ -1 = aβ satisfies Clearly, A
has to be very well conditioned for the theorem to provide a lower bound θ
near the maximum of n.
We apply the theorem to three noncontrived matrices that appear in prac-
tical applications.
(1) The matrix

(9.11)

is the symmetric, orthogonal eigenvector matrix for the second difference ma-
trix (the tridiagonal matrix with typical row (-1, 2, -1)-see §26.5); it arises,
for example, in the analysis of time series [19, 1971, §6.5]. Theorem 9.7 gives
p n (S N) > (n + 1)/2.
(2) A Hadamard matrix H n is an n × n matrix with elements h ij = ±1
and for which the rows of Hn are mutually orthogonal. Hadamard mat rices
exist only for certain n; a necessary condition for their existence if n > 2 is
that n is a multiple of 4. For more about Hadamard mat rices see Hall [494,
19 6 7 , Chap. 14], Wallis [1062, 1993 ], and Wallis, Street, and Wallis [1063,
1972 ]. We have = nI, and so Theorem 9.7 gives
pn > n.
(3) The next matrix is a complex Vandermonde matrix based on the roots
of unity, which occurs in the evaluation of Fourier transforms (see §23.1):

(9.12)
180 LU FACTORIZATION AND LINEAR E QUATIONS

Since Theorem 9.7 gives pn (Vn ) > n.

Note that each of these matrices is orthogonal or unitary (to within a row
scaling in the case of the Hadamard matrix), so it is not necessary to apply GE
to them! This may explain why growth factors of order n for these matrices
have not been reported in the literature as occurring in practice.
To summarize, although there are practically occurring matrices for which
partial pivoting yields a moderately large, or even exponentially large, growth
factor, the growth factor is almost invariably found to be small. Explaining
this fact remains one of the major unsolved problems in numerical analysis.
The best attempt to date is by Trefethen and Schreiber [1019, 1990]. who
develop a statistical model of the average growth fact or for partial pivoting
and complete pivoting. Their model supports their empirical findings that for
various distributions of random mat rices the average growth fact or (normal-
ized by the standard deviation of the initial matrix elements) is close to n 2 / 3
for partial pivoting and n ½ for complete pivoting. Extensive experiments by
Edelman suggest that for random matrices from the normal N(0, 1) distribu-
tion the unnormalized growth factor for partial pivoting grows like n ½ [345,
1 9 9 5 ].
We turn now to complete pivoting. Wilkinson [1085, 1961, pp. 282-285]
showed that

(9.13)

and that this bound is not attainable. The bound is a much more slowly
growing function than 2n -1, but can still be quite large (e.g., it is 3570 for
n = 100). As for partial pivoting, in practice the growth factor is usually
small. Wilkinson stated that “no matrix has been encountered in practice for
which p 1 /pn was as large as 8” [1085, 1961, p. 285] and that “no matrix has
yet been discovered for which f(r) > r” [1089, 1965, p. 213] (here. pi is the
(n - i + 1)st pivot and f(r)
Cryer [256, 1968] defined

(9.14)

The following results are known:

• g(2) = 2 (trivial).

• g(3) = 2¼; Tornheim [1012, 1965] and Cohen [229, 1974].

• g (4) = 4; Cryer [256, 1968] and Cohen [229, 1974].

• g(5) < 5.005; Cohen [229, 1974].

9.4 S PECIAL M ATRICES 181

Tornheim [1012, 1965] (see also Cryer [256, 1968]) showed that > n
for any n × n Hadamard matrix Hn (a bound which, as we saw above, holds
for any form of pivoting). For n such that a Hadamard matrix does not exist,
the best known lower bound is g(n) > = (n + 1)/2 (see (9.11)).
Cryer [256, 1968] conjectured that for real matrices < n, with equal-
ity if and only if A is a Hadamard matrix. The conjecture < n became
one of the most famous open problems in numerical analysis, and has been
investigated by many mathematicians. The conjecture was finally shown to be
false in 1991. Using a package LANCELOT [236, 1992] designed for large-scale
nonlinear optimization, Gould [474, 1991] discovered a 13 × 13 matrix for which
the growth factor is 13.0205 in IEEE double precision floating point arith-
metic. Edelman subsequently showed, using the symbolic manipulation pack-
ages Mathematica and Maple, that a growth factor 13.02 can be achieved in
exact arithmetic by making a small perturbation (of relative size 10-7) to one
element of Gould’s matrix [338, 1992], [348, 1991]. A more striking counterex-
ample to the conjecture is a matrix of order 25 for which = 32.986341 [338,
1992]. Interesting problems remain, such as determining limn g(n)/n and
evaluating for Hadamard matrices (see Problem 9.15).
For complex matrices the maximum growth factor is at least n for any
n, since > n (see (9.12)). The growth can exceed n, even for n = 3:
Tornheim [1012, 1965] constructed the example

for which = 3.079.

9.4. Special Matrices

For matrices with certain special properties, more can be said about the be-
haviour of GE and, in particular, the size of the growth factor.
As a first example, suppose is diagonally dominant by rows,

or diagonally dominant by columns, that is, A* is diagonally dominant by

rows. Then GE without pivoting is perfectly stable.

Theorem 9.8 (Wilkinson). If is diagonally dominant by rows or

columns then A has an LU factorization without pivoting and the growth factor
pn < 2. If A is diagonally dominant by columns then |lij| < 1 for all i and
j in the LU factorization without pivoting (hence GEPP does not require any
row interchanges).
182 LU FACTORIZATION AND LINEAR E QUATATIONS

Figure 9.1. A banded matrix.

Proof. The result follows immediately from the more general Theo-
rems 12.5 and 12.6 for block diagonally dominant matrices.
Note that for a matrix diagonally dominant by rows the multipliers can
be arbitrarily large but, nevertheless, pn < 2, so GE is perfectly stable.
A smaller bound for the growth factor also holds for an upper Hessenberg
matrix. (H is upper Hessenberg if hij = 0 for i > j + 1.)

Theorem 9.9 (Wilkinson). If is upper Hessenberg then < n.

Proof. The structure of an upper Hessenberg H means that at each stage

of GEPP we just add a multiple of the pivot row to the next row (after
possibly swapping these two rows). That < n is a consequence of the
following claim, which is easily proved by induction: at the start of stage k ,
row k + 1 of the reduced matrix is the same as row k + 1 of the original matrix,
and the pivot row has elements of modulus at most k times the largest element
o f H.

A matrix has lower bandwidth p and upper bandwidth q if aij = 0

for i > j + p and j > i + q; see Figure 9.1. It is well known that in an LU
factorization of a banded matrix the factors inherit A's band structure: L
has lower bandwidth p and U has upper bandwidth q. If partial pivoting is
used then, in PA = LU. L has at most p + 1 nonzeros per column and U has
upper bandwidth p + q. (For proofs of these properties see Golub and Van
Loan [470, 1989, §4.3].) It is not hard to see that for a banded matrix, γ n in
Theorem 9.3 can be replaced by γ m a x (p + 1 , q+1) and 2γ n in Theorem 9.4 can be
replaced by γm a x (p + 1 , q+1) + γp + q + 1 . ’
The following result bounds the growth factor for partial pivoting on a
banded matrix.
9.5 T RIDIAGONAL M ATRICES 183

Theorem 9.10 (Bohte). If has upper and lower bandwidths p then

and this bound is almost attainable when n = 2p+1.
Proof. See Bohte [131, 1975]. An example with n = 9 and p = 4 in which
equality is almost attained is the matrix

where is an arbitrarily small positive number, which ensures that rows 1

and 5 are interchanged on the first stage of the elimination, this being the
only row interchange required. Ignoring terms in the last column of U
in PA = LU is [1, 1, 2, 4, 8, 16, 31, 60, 116]T and the growth factor is
116.
A special case of Theorem 9.10 is the easily verified result that for a tridi-
agonal matrix, Hence GEPP achieves a small normwise backward
error for tridiagonal matrices. In the next section we show that for several
types of tridiagonal matrix GE without pivoting achieves a small component-
wise backward error.

9.5. Tridiagonal Matrices

Consider the nonsingular tridiagonal matrix

and assume that A has an LU factorization A = LU, where

(9.15)
184 LU F ACTORIZATION AND LINEAR E QUATIONS

GE for computing L and U is described by the recurrence relations

(9.16)

For the computed quantities, we have

Hence

In matrix terms these bounds may be written as

(9.17)

If the LU factorization is used to solve a system Ax = b by forward and back

substitution then it is straightforward to show that the computed solution
satisfies

(9.18)

Combining (9.17) and (9.18) we have, overall,

(9.19)

The backward error result (9.19) applies to arbitrary nonsingular tridiag-

onal A having an LU factorization. We are interested in determining classes
of tridiagonal A for which a bound of the form |∆ A| < g (u) |A| holds. Such a
bound will hold if as noted in §9.2 (see (9.8)).
Three classes of matrices for which holds for the exact L
and U are identified in the following theorem.

Theorem 9.11. Let be nonsingular and tridiagonal. If any of the

following conditions hold then A has an LU factorization and |L||U| = |LU|:
(a) A is symmetric positive definite;
(b) A is totally nonnegative, or equivalently, L > 0 and U > 0;
(c) A is an M-matrix, or equivalently, L and U have positive diagonal
elements and nonpositive off-diagonal elements;
(d) A is sign equivalent to a matrix B of type (a)-(c), that is, A = D 1 BD2 ,
where |D1 | = |D2 | = I.
9.5 T RIDIAGONAL M ATRICES 185

Proof. For (a), it is well known that a symmetric positive definite A has
an LU factorization in which U = DL T , where D is diagonal with positive
diagonal elements. Hence |L||U| = |L||D||LT| = |LDLT| = |LU|, where the
middle equality requires a little thought. In (b) and (c) the equivalences,
and the existence of an LU factorization, follow from known results on totally
nonnegative matrices [258, 1976] and M-matrices [94, 1994]; |L||U| = |LU| is
immediate from the sign properties of L and U. (d) is trivial.
For diagonally dominant matrices, |L||U| is not equal to |LU| = |A|, but
it cannot be much bigger.

Theorem 9.12. Suppose is nonsingular, tridiagonal, and diag-

onally dominant by rows or columns, and let A have the LU factorization
A = LU. Then |L||U| < 3|A|.
Proof. If |i - j| = 1 then (|L||U|) ij = |aij|, so it suffices to consider the
diagonal elements and show that (using the notation of (9.15))

|li ei-1 | + |ui | < 3|di |.

The rest of the proof is for the case where A is diagonally dominant by rows;
the proof for diagonal dominance by columns is similar.
First, we claim that |ei | < |ui | or all i. The proof is by induction. For
i = 1 the result is immediate, and if it is true for i - 1 then, from (9.16).

as required. Note that, similarly, |ui | < |di | + |ci |. Finally,

Theorem 9.13. If the nonsingular tridiagonal matrix A is of type (a)-(d) in

Theorem 9.11, and if the unit roundoff u is sufficiently small, then GE for
solving Ax = b succeeds and the computed solution satisfies

The same conclusion is true if A is diagonally dominant by rows or columns,

with no restriction on u, provided the bound is multiplied by 3.
186 LU F ACTORIZATION AND LINEAR E QUATIONS

Proof. If u is sufficiently small then for types (a)-(c) the diagonal elements
of U will be positive, since as u 0. It is easy to see that
for all i ensures that The argument is similar for type (d). The
result therefore follows from (9.19) and (9.8). The last part is trivial.
A corollary of Theorem 9.13 is that it is not necessary to pivot for the
matrices specified in the theorem (and, indeed, pivoting could vitiate the result
of the theorem). Note that large multipliers may occur under the conditions
of the theorem, but they do not affect the stability. For example, consider the
M-matrix

where The multiplier l32 is unbounded as but |L||U| = |A|

and GE performs very stably, as Theorem 9.13 shows it must.

9.6. Historical Perspective

GE was the first numerical algorithm to be subjected to rounding error anal-
ysis, so it is instructive to follow the development of the error analysis from
its beginnings in the 1940s.
In the 1940s there were three major papers giving error analyses of GE.
Hotelling [583, 1943] presented a short forward error analysis of the LU factor-
ization stage of GE. Under the assumptions that |aij| < 1 and |bi | < 1 for all
i and j and that the pivots are all of modulus unity, Hotelling derives a bound
containing a fact or 4 n -1 for the error in the elements of the reduced upper
triangular system. Hotelling’s work was quoted approvingly by Bargmann.
Montgomery, and von Newmann [55, 1946], who dismiss elimination met hods
for the solution of a linear system Ax = b as being numerically unstable. In-
stead, they recommended computation of A-1 via the Newton Schulz iteration
[908, 1933] (which was also discussed by Hotelling). In one paragraph they
out line the alleged shortcomings of elimination methods as follows:

In the elimination method a series of n compound operations is

performed each of which depends on the preceding. An error at
any stage affects all succeeding results and may become greatly
magnified; this explains roughly why instability should be ex-
pected. It should be noticed that at each step a division is per-
formed by a number whose size cannot be estimated in advance
and which might the so small that any error in it would be greatly
magnified by division. In fact such small divisors must occur if the
determinant of the matrix is small and may occur even if it is not
9.6 H I S T O R I C A L P E R S P E C T I V E 187

. . . Another reason to expect instability is that once the variable

xn is obtained all the other variables are expressed in terms of it.
As Wilkinson [1098, 1974, p. 354] notes of this paragraph, “almost every state-
ment in it is either wrong or misleading”.
Hotelling’s result led to general pessimism about the practical effectiveness
of GE for solving large systems of equations. Three papers later in the same
decade helped to restore confidence in GE.
Goldstine [460, 1972, p. 290] says of his discussions with von Neumann:
We did not feel it reasonable that so skilled a computer as Gauss
would have fallen into the trap that Hotelling thought he had noted
. . . Von Neumann remarked one day that even though errors may
build up during one part of the computation, it was only relevant
to ask how effective is the numerically obtained solution, not how
close were some of the auxiliary numbers, calculated on the way
to their correct counterparts. We sensed that at least for positive
definite matrices the Gaussian procedure could be shown to be
quite stable.
von Neumann and Goldstine [1057, 1947] subsequently gave a long and diffi-
cult rigorous fixed-point error analysis for the inversion of a symmetric pos-
itive definite matrix A via GE. They showed that the computed inverse
satisfies < 14.2n 2 uκ2 (A). Parlett [821, 1990] explains that “the joy
of this result was getting a polynomial in n, and the pain was obtaining 14.2,
a number that reflects little more than the exigencies of the analysis.” Wilkin-
son [1095, 1971] gives an interesting critique of von Neumann and Goldstine’s
paper and points out that the residual bound could hardly be improved using
modern error analysis techniques. In a later paper [462, 1951], Goldstine and
von Neumann gave a probabilistic analysis, which Goldstine summarizes as
showing that “under reasonable probabilistic assumptions the error estimates
of the previous paper could be reduced from a proportionality of n 2 to n”
[460, 1972, p. 291].
In his 1970 Turing Award Lecture [1096, 1971], Wilkinson recounts how in
the early 1940s he solved a system of 12 linear equations on a desk calculator,
obtaining a small residual. He goes on to describe a later experience:
It happened that some time after my arrival [at the National Physi-
cal Laboratory in 1946], a system of 18 equations arrived in Mathe-
matics Division and after talking around it for some time we finally
decided to abandon theorizing and to solve it . . . The operation
was manned by Fox, Goodwin, Turing, and me, and we decided
on Gaussian elimination with complete pivoting . . . Again the sys-
tem was mildly ill-conditioned, the last equation had a coefficient
of order 10-4 (the original coefficients being of order unity) and
188 LU FACTORIZATION AND LINEAR E QUATIONS

the residuals were again of order 10-10 , that is of the size cor-
responding to the exact solution rounded to ten decimals. It is
interesting that in connection with this example we subsequently
performed one or two steps of what would now be called “iterative
refinement,” and this convinced us that the first solution had had
almost six correct figures.

(Fox [403, 19 8 7 ] notes that the computation referred to in this quotation

took about two weeks using desk computing equipment!) In a subsequent
paper, Fox, Huskey, and Wilkinson [404, 1948] presented empirical evidence
in support of GE, commenting that “in our practical experience on matrices
of orders up to the twentieth, some of them very ill-conditioned, the errors
were in fact quite small”.
The experiences of Fox, Huskey, and Wilkinson prompted Turing to write
a remarkable paper “Rounding-off errors in matrix processes” [1027, 1948].
In this paper, Turing made several import ant contributions. He formulated
the LU (actually. the LDU) factorization of a matrix, proving the “if” part
of Theorem 9.1 and showing that GE computes an LDU factorization. He
introduced the term “condition number” and defined two matrix condition
numbers, one of which is n - 1N(A )N( A -1), where N(A) = ||A||F, the “N-
condition number of A”. He used the word “preconditioning” to mean im-
proving the condition of a system of linear equations (a term that did not
come into popular use until the 1970s). He described iterative refinement
for linear systems. He exploited backward error ideas, for example by noting
that *‘the triangular resolution obtained is an exact resolution of a matrix
A - S, where M(S) < ( M(S) := maxi,j |sij|. Finally, and perhaps most
importantly, he analysed GEPP for general matrices and obtained a bound
for that contains a term proportional to (By making a
trivial change in the analysis, namely replacing A - 1 b by x, Turing’s bound
can be made proportional only to Turing also showed that the
factor 4n -1 in Hotelling’s bound can be improved to 2 n -1 and that still the
bound is attained only in exceptional cases.
In a review of Turing’s paper, Bodewig [129, 1949] described the error
bounds as “impractical“ and advocated computing the residual of the com-
puted solution and then determining “the exact correct ion by solving a new
system.” That another researcher could miss the point of Turing’s analysis
emphasizes how new the concept of rounding error analysis was in the 1940s.
Table 9.1 shows the time for solution of linear systems by GE on some
early computing devices. The performance of modern computers on two linear
system benchmarks is summarized by Dongarra [312, 1995]; Dongarra’s report
is regularly updated and can be obtained from netlib under the benchmark
directory.
Douglas [319, 1959] presented a forward error analysis for GE applied to
9.6 H ISTORICAL P ERSPECTIVE 189

Table 9.1. Times for solution of a linear system of order n.

Machine Year n Time Reference

Logarithm tables c. 1884 29 a 7 weeks [952, 1994]
Desk computing equipment c. 1946 18 2 weeks [403, 1987]
Harvard Mark 1 1947 10 45 minutes b
IBM 602 Calculating Punch 1949 10 4 hours [1053, 1949]
Pilot ACE 1951 17 over 3 hours [1110, 1958]
Pilot ACE’ 1954 30 1½ mins [1110, 1958]
ACE 1958 30 5 seconds [1110, 1958]
EDSAC 2 1960 31 4 seconds [73, 1960]
EDSAC 2 d 1960 100 7 minutes [73, 1960]
a
Symmetric positive definite system.
b
c
[127, 1948,
p. 27], [507, 1948, p. 336].
With magnetic drum store.
d
Using magnetic tape for auxiliary storage.

diagonally dominant tridiagonal systems arising in the solution of the heat

equation by finite differences. He concluded that the whole procedure of
solving this partial differential equation “is stable against round-off error”. It
is surprising that Douglas’ paper is little known, because irrespective of the
fact that his analysis can be simplified and clarified using modern techniques,
his is one of the first truly positive rounding error results to be published.
A major breakthrough in the error analysis of GE came with Wilkinson’s
pioneering backward error analysis, in which he proved Theorem 9.5 [1085,
1961], [1088, 1963]. Apart from its simplicity and elegance and the realistic
nature of the bounds, the main feature that distinguishes Wilkinson’s analysis
from the earlier error analyses of GE is that it bounds the normwise backward
error rather than the forward error.
Wilkinson had been aware of the properties of the growth factor for par-
tial pivoting long before developing his backward error analysis. In a 1954
paper [1081, 1954] he noted that

After m reductions the largest element is at most 2”’ times as large

as the largest original coefficient. It is possible to construct sets
in which this factor is achieved but in practice an increase seldom
takes place; more frequently the coefficients become progressively
smaller, particularly if the equations are ill-conditioned.

This quote summarizes most of what we know today!

Four of the first textbooks to incorporate Wilkinson’s analysis were those of
Fox [400, 1964, pp. 161-174], Isaacson and Keller [607, 1966], Wendroff [1074,
190 LU FACTORIZATION AND LINEAR E QUATIONS

1966], and Forsythe and Moler [396, 1967, Chap. 21]. Fox gives a simplified
analysis for fixed-point arithmetic under the assumption that the growth fac-
tor is of order 1. Forsythe and Moler give a particularly readable backward
error analysis that has been widely quoted.
Wilkinson’s 1961 result is essentially the best that can be obtained by
a normwise analysis. Subsequent work in error analysis for GE has mainly
been concerned with bounding the backward error component wise, as in The-
orems 9.3 and 9.4. We note that Wilkinson could have given a componentwise
bound for the backward perturbation ∆A, since most of his analysis is at the
element level.
Chartres and Geuder [200, 1967] analyse the Doolittle version of GE. They
derive a backward error result (A + ∆A) = b, with a componentwise bound
o n ∆A; although they do not recognize it, their bound can be written in the
form |∆A| < cnu
Reid [867, 1971] shows that the assumption in Wilkinson’s analysis that
partial pivoting or complete pivoting is used is unnecessary. Without making
any assumptions on the pivoting strategy, he derives for LU factorization the
result LU = A + ∆A, |∆ aij| < 3.01 min(i - 1, j) u maxk Again, this is a
componentwise bound. Erisman and Reid [355, 1974] note that for a sparse
matrix, the term min(i - 1, j) in Reid’s bound can be replaced by mij , where
mij is the number of multiplications required in the calculation of lij (i > j)
or u ij (i < j).
de Boor and Pinkus [273, 1977] give the result stated in Theorem 9.4.
They refer to the original 1972 German edition of [955, 198 0] for a proof
of the result and explain several advantages to be gained by working with a
componentwise bound for ∆A, one of which is the strong result that ensues for
totally nonnegative matrices. A result very similar to Theorem 9.4 is proved
by Sautter [895, 1978].
Skeel [919, 1979] carried out a detailed componentwise error analysis of
GE with a different flavour to the analysis given in this chapter. His aim was
to understand the numerical stability of GE (in a precisely defined sense) and
to determine the proper way to scale a system by examining the behaviour
of the backward and forward errors under scaling (see §9.7). He later used
this analysis to derive important results about fixed precision iterative refine-
ment (see Chapter 11). Skeel’s work popularized the use of component wise
backward error analysis and componentwise perturbation theory.
The componentwise style of backward error analysis for GE is
now well known, as evidenced by its presence in the text books of Conte and
de Boor [237, 198 0]. Golub and Van Loan [370, 198 9] (also the 1983 first
edition), and Stoer and Bulirsch [955, 1980].
Forward error analyses have also been done for GE. The analyses are more
complicated and more difficult to interpret than the backward error analyses.
Olver and Wilkinson [810, 1982] derive a posteriori forward error bounds that
9.7 S CALING 191

require the computation of A-1. Further results are given in a series of papers
by Stummel [965, 1982], [966, 1985], [967, 1985], [968, 1985].
Finally, probabilistic error analysis for GE is given by Barlow and Bareiss
[63, 1985].

9.7. Scaling
Prior to solving a linear system Ax = b by GE we are at liberty to scale the
rows and the columns:

(9.20)

where D1 and D 2 are nonsingular diagonal matrices. We apply GE to the

scaled system A’y = c and then recover x from x = D2 y. Although scaling
was used in some of the earliest published programs for GE [396, 1967], [745,
1962], how best to choose a scaling is still not well understood, and no single
scaling algorithm can be guaranteed always to perform satisfactorily. Wilkin-
son’s remark “We cannot decide whet her equations are ill-conditioned without
examining the way in which the coefficients were derived” [1089, 1965, p. 198]
sums up the problem of scaling rather well.
The effect of scaling in GE without pivoting is easy to describe. If the
elements of D 1 and D2 are powers of the machine base β (so that the scaling
is done without error) and GE produces and satisfying A + ∆A =
then GE on A’ = D 1 A D2 produces and satisfying A’ +
D 1 ∆AD2 = In other words, the rounding errors in GE
scale in the same way as A. This is a result of Bauer [78, 1963] (see [396,
19 6 7 , Chap. 11] for a clear proof and discussion). With partial pivoting,
however, the choice of pivots is affected by the row scaling (though not the
column scaling), and in a way that is difficult to predict.
We can take a method-independent approach to scaling, by considering
any method for solving Ax = b that yields a solution satisfying

with cn a constant. For the scaled system (9.20) we have

so it is natural to choose D 1 and D2 to minimize As we saw in

§7.3 (Theorem 7.8), the minimum possible value is no larger than p ( |A- 1 ||A|) .
However, a column scaling has the (usually) undesirable effect of changing the
norm in which the error is measured. With row scaling only, the minimum
192 LU FACTORIZATION AND LINEAR E QUATIONS

value of is cond(A) = achieved when D 1 A has rows

of unit 1-norm (see (7.12)). Thus row equilibration yields a cond-bounded
forward error. For GE, though, it is possible to do even better. Skeel [919,
-1
1979] shows that for D 1 = diag(|A||x|) , the forward error bound for GEPP
is proportional to cond(A, x) = the catch is, of course,
that the scaling depends on the unknown solution x! Row equilibration can
be regarded as approximating x by e in this “optimal” scaling.
The LINPACK LU factorization routines do not include scaling, while
in the LAPACK driver routine xGESVX an initial scaling is optional. One
reason why scaling is not popular with numerical analysts is that a cond( A , x)-
bounded forward error and a small componentwise relative backward error are
both achieved by fixed precision iterative refinement (assuming it converges):
see Chapter 11. Even Skeel’s optimal scaling does not guarantee a small
componentwise relative backward error.
Some programs for GEPP incorporate row scaling implicitly. They com-
pute row scale factors d 1. . . . , dn . but. instead of applying GEPP to diag(d i )-1 ×
A, they apply it to A and choose as pivot row at the kth stage a row r for
which is maximal. This type of scaling has the sole effect of influenc-
ing the choice of pivots. There is little justification for using it, and the best
bound for the growth factor is 2n -1 multiplied by a product of terms di1 /d i2
that can be large.
There is, however. one situation in which a form of implicit row scaling
is beneficial. Consider the pivoting strategy that selects as the kth pivot an
element for which

(9.21)

A result of Peña [825, 1994] shows that if there exists a permutation matrix P
such that PA has an LU factorization PA = LU with |PA| = |L||U|, then such
a factorization will be produced by the pivoting scheme (9.21). This means
that, unlike for partial pivoting, we can use the pivoting scheme (9.21) with
impunity on totally nonnegative matrices and their inverses, row permutations
of such matrices, and any matrix for which some row permutation has the
“|PA| = |L||U|” property. However, this pivoting strategy is as expensive as
complete pivoting to implement, and for general A it is not guaranteed to
produce a factorization as stable as that produced by partial pivoting.

9.8. A Posteriori Stability Tests

Having solved a linear system by LU factorization we can compute the com-
ponentwise or normwise backward error at the cost of evaluating one or two
matrix-vector products (see Theorems 7.1 and 7.3). In some situations,
9.8 A P OSTERIORI S TABILITY T ESTS 193

though, we may wish to assess the stability of a computed LU factoriza-

tion before using it to solve one or more linear systems. One possibility is
to compute the growth factor by monitoring the size of elements during the
elimination, at a cost of O(n 3) comparisons. This has been regarded as rather
expensive, and more efficient ways to estimate pn have been sought.
Businger [169, 1971] describes a way to obtain an upper bound for p n in
O(n2) operations. This approach is generalized by Erisman and Reid [355,
1974], who apply the Holder inequality to the equation

to obtain the bound

(9.22)

where p -1 + q-1 = 1. In practice. p = 1,2, are the values of interest.

Barlow [56, 1986] notes that application of the Holder inequality instead to

yields a sometimes sharper bound.

It is interesting to note that in light of experience with the bound (9.22),
Reid [868, 198 7] recommends computing the growth factor explicitly in the
context of sparse matrices, arguing that the expense is justified because (9.22)
can be a very weak bound. See Erisman et al. [354, 1987] for some empirical
results on the quality of the bound.
Chu and George [209, 1985] observe that the -norm of the matrix
can be computed in O(n 2) operations without forming the matrix explicitly,
since

Thus one can cheaply compute a bound on from the componentwise

backward error bounds in (9.6) and (9.7).
All the methods discussed in this section make use of an a priori error
analysis to compute bounds on the backward error. Because the bounds do not
take into account the statistical distribution of rounding errors, and because
they involve somewhat pessimistic constant terms, they cannot be expected
to be very sharp. Thus it is important not to forget that it is straightforward
to compute the backward error itself: A - Exact computation costs a
prohibitively expensive O(n 3) operations, but can be estimated in
194 LU FACTORIZATION AND LINEAR E QUATIONS

O(n 2) operations using the matrix norm estimator in Algorithm 14.4. Another
possibility is to use a running error analysis, in which an error bound is
computed concurrently with the factors (see 53.3).

9.9. Sensitivity of the LU Factorization

Although Theorem 9.3 bounds the backward error of the computed LU factors
L and U, it does not give any indication about the size of the forward errors
L - and U - For most applications of the LU factorization it is the
backward error and not the forward errors that matters, but it is still of some
interest to know how big the forward errors can be. This is a quest ion of
perturbation theory and is answered by the next result.

Theorem 9.14 (Barrlund, Sun). Let the nonsingular matrices

and A+∆A have LU factorizations A = LU and A+∆ A = (L+∆L)(U+∆ U),
and assume that ||G||2 < 1, where G = L- 1 ∆AU -1. Then

(9.23)

Moreover, if < 1, where = (L + ∆L)- 1 ∆A(U + ∆U)-1, then

where stril(·) and triu(·) denote, respectively, the strictly lower triangular part
and the upper triangular part of their matrix arguments.

The normwise bounds (9.23) imply that x(A) := ||L - 1 ||2 ||U- 1 ||2 ||A|| 2 is
an upper bound for the condition numbers of L and U under normwise per-
turbations. We have

and the ratio x(A)/κ 2 (A) can be arbitrarily large (though if partial pivoting
is used then κ2 (L) < n2n - 1 ).
The componentwise bounds in Theorem 9.14 are a little unsatisfactory in
that they involve the unknown mat rices ∆L and ∆U, but we can set these
terms to zero and obtain a bound correct to first order.
9.10 N OTES AND R EFERENCES 195

9.10. Notes and References

A variant of GE was used by the Chinese around the first century AD; the Jiu
Zhang Suanshu (Nine Chapters of the Mathematical Art) contains a worked
example for a system of five equations in five unknowns [619, 1991, pp. 156-177].
[696, 1989]
Gauss, who was a great statistician and numerical analyst, developed his
elimination method as a tool to help him prove results in linear regression
theory. The first published appearance of GE is in his Theoria Motus (1809).
Stewart [952, 1994] gives a survey of Gauss’s work on solving linear systems:
see also the afterword in [423, 1995].
The traditional form of GE, as given at the start of this chapter. can be
expressed algorithmically as

for k = 1:n
for j = k+ 1 :n
for i = k + 1:n
a i j = a i j - (a i k /a k k )a k j
end
end
end

This is identified as the kji form of GE. Altogether there are six possible
orderings of the three loops. Doolittle’s method (Algorithm 9.2) is the ijk
or jik variant of Gaussian elimination. The choice of loop ordering does not
affect the stability of the computation, but can greatly affect the efficiency of
GE on a high performance computer. For more on the different loop orderings
of GE see Chapter 12; Dongarra, Gustavson. and Karp [310, 1984]; and the
books by Dongarra, Duff, Sorensen, and van der Vorst [315, 1991] and Golub
and Van Loan [470, 1989].
This chapter draws on the survey paper Higham [545, 1990]. Theorems 9.6
and 9.7 are from Higham and Higham [562, 1989].
Myrick Hascall Doolittle (1830-1913) was a “computer of the United States
coast and geodetic survey” [362, 1987]. Crout’s method was published in an
engineering journal in 1941 [255, 1941].
GE and its variants were known by various descriptive names in the early
days of computing. These include the bordering met hod, the escalator met hod
(for matrix inversion), the square root method (Cholesky factorization), and
pivotal condensation. A good source for details of these methods is Fad-
deeva [360, 1959].
In a confidential 1948 report that “covers the general principles of both the
design of the [Automatic Computing Engine] and the method of programming
adopted for it”, Wilkinson gives a program implementing GE with partial
pivoting and iterative refinement [1080, 1948, p. 111]. This was probably the
196 LU F ACTORIZATION AND LINEAR E QUATIONS

first such program to be written and for a machine that had not yet been
built!
The terms “partial pivoting“ and “complete pivoting” were introduced by
Wilkinson in [1085, 1961]. The pivoting techniques themselves were in use
in the 1940s and it is not clear who, if anyone, can be said to have invented
them: indeed, von Neumann and Goldstine [1057, 1947, §4.2] refer to complete
pivoting as the “customary procedure”.
There is a long history of published programs for GE. beginning with Crout
routines of Forsythe [390, 1960], Thacher [999, 1961], McKeeman [745, 1962],
and Bowdler, Martin, Peters, and Wilkinson [138, 1966], all written in Algol
60 (which was the “official” language for publishing mathematical software in
the 1960s. and a strong competitor to Fort ran for practical use at that time).
The GE routines in LAPACK are the latest in a lineage beginning with the
Fortran routines decomp and solve in Forsythe and Moler‘s book [396, 1967],
and continuing with routines by Moler [766, 1972], [767, 1972] (which achieve
improved efficiency in Fortran by accessing arrays by column rather than by
row), Forsythe, Malcolm, and Moler [395, 1977] (these routines incorporate
condition estimation-see Chapter 14), and LINPACK [307, 1979].
LU factorization of totally nonnegative matrices has been investigated by
Cryer [257, 1973], [258, 1976], Ando [21, 198 7], and de Boor and Pinkus
[273, 1977]. It is natural to ask whether we can test for total nonnegativity
without computing all the minors. The answer is yes: for an n × n matrix
total nonnegativity can be tested in O(n 3) operations. as shown by Gasca
and Peña [421, 1992]. The test involves carrying out a modified form of GE
in which all the elimination operations are between adjacent rows and then
checking whether certain pivots are positive. Note the analogy with positive
definiteness, which holds for a symmetric matrix if and only if all the pivots
in GE are positive.
The dilemma of whether to define the growth factor in terms of exact or
computed quantities is faced by all authors; most make one choice or the other,
and go on to derive, without comment, bounds that are strictly incorrect.
Theorem 9.8, for example, bounds the exact growth factor; the computed one
could. conceivably violate the bound, but only by a tiny relative amount. van
Veldhuizen [1045, 1977] shows that for a variation of partial pivoting that
allows either a row or column interchange at each stage, the growth factor
defined in terms of computed quantities is at most about (1 + 3 nu)2n - 1 ,
compared with the bound 2n -1 for the exact growth factor.
The idea of deriving error bounds for GE by analysing the equations ob-
tained by solving A = LU is exploited by Wilkinson [1097, 1974], who gives a
general analysis that includes Cholesky factorization. This paper gives a con-
cise summary of error analysis of factorization methods for linear equations
and least squares problems.
Various authors have tabulated growth factors in extensive tests with ran-
9.10 N OTES AND R EFERENCES 197

dom matrices. In tests during the development of LINPACK, the largest value
observed was = 23, occurring for a random matrix of 1s, 0s, and -1s [307,
1979, p. 1.21]. Macleod [720, 1989] recorded a value = 35.1, which oc-
curred for a symmetric matrix with elements from the uniform distribution
on [-1, 1]. In one MATLAB test of 1000 matrices of dimension 100 from the
normal N(0, 1) distribution, I found the largest growth factor to be = 9.59.
Gould [474, 1991] used the optimization LANCELOT [236, 1992] to maxi-
mize the nth pivot for complete pivoting as a function of about n 3/3 variables
comprising the intermediate elements of the elimination; constraints were
included that normalize the matrix A, describe the elimination equations, and
impose the complete pivoting conditions. Gould’s package found many local
maxima, and many different starting values had to be tried in order to lo-
cate the matrix for which > 13. In an earlier attempt at maximizing the
growth factor, Day and Peterson [271, 1988] used a problem formulation in
which the variables are the n2 elements of A, which makes the constraints and
objective function substantially more nonlinear than in Gould’s formulation.
Using the package NPSOL [444, 1986], they obtained “largest known” growth
factors for 5 < n < 7.
Theoretical progress into understanding the behaviour of the growth fac-
tor for complete pivoting has been made by Day and Peterson [271, 1988],
Puschmann and Cortés [849, 1983], Puschmann and Nordio [850, 1985], and
Edelman and Mascarenhas [345, 1995].
A novel alternative to partial pivoting for stabilizing GE is proposed by
Stewart [942, 1974]. The idea is to modify the pivot element to make it suit-
ably large, and undo this rank one change later using the Sherman-Morrison
formula. Stewart gives error analysis that bounds the backward error for this
modified form of GE.
Theorem 9.8 is proved for matrices diagonally dominant by columns by
Wilkinson [1085, 1961, pp. 288-289]. Theorem 9.9 is proved in the same paper.
That pn < 2 for matrices diagonally dominant by rows does not appear to
be well known, but it is proved by Wendroff [1074, 1966, pp. 122-123]. for
example.
The results in §9.5 for tridiagonal matrices are taken from Higham [541,
1990 ]. Another method for solving tridiagonal systems is cyclic reduction,
which was developed in the 1960s [171, 1970]. Error analysis given by Amodio
and Mazzia [15, 1994] shows that cyclic reduction is normwise backward stable
for diagonally dominant tridiagonal matrices.
The chapter “Scaling Equations and Unknowns” of Forsythe and Moler
[396, 1967] is a perceptive, easy to understand treatment that is still well worth
reading. Early efforts at matrix scaling for GE were directed to equilibrating
either just the rows or the rows and columns simultaneously (so that all the
rows and columns have similar norms). An algorithm with the latter aim
is described by Curtis and Reid [259, 1972]. Other important references on
198 LU FACTORIZATION AND LINEAR E QUATIONS

scaling are the papers by van der Sluis [1040, 1970] and Stewart [945, 1977].
which employ normwise analysis, and those by Skeel [919, 1979], [921, 1981],
which use componentwise analysis.
Much is known about the existence and stability of LU factorizations of
M-matrices and related matrices. A is an H-matrix if the comparison matrix
M(A) (defined in (8.6)) is an M-matrix. Funderlic, Neumann, and Plem-
mons [410, 1982] prove the existence of an LU factorization for an H-matrix
A that is generalized diagonally dominant, that is, DA is diagonally dom-
inant by columns for some nonsingular diagonal matrix D: they show that
the growth factor satisfies pn < 2 maxi |dii |/min i |dii |. Neumann and Plem-
mons [791, 1984] obtain a growth factor bound for an inverse of an H -matrix.
Ahac, Buoni, and Olesky [7, 1988] describe a novel column-pivoting scheme
for which the growth factor can be bounded by n then A is an H-matrix.
The normwise bounds in Theorem 9.14 are due to Barrlund [71, 1991]
and the componentwise ones to Sun [972, 1992]. Similar bounds are given
by Stewart [951, 1993] and Sun [973, 1992]. Barrlund [72, 1992] describes a
general technique for deriving matrix perturbation bounds using integrals.
Interval arithmetic techniques (see §24.4) are worth considering if high ac-
curacy or guaranteed accuracy is required when solving a linear system. We
mention just one paper, that by Demmel and Krückeberg [297, 1985], which
provides a very readable introduction to the subject and contains further ref-
erences.
For several years Edelman has been collecting information on the solution
of large, dense linear algebra problems. His papers [337, 1991], [341, 1993],
[342, 1994] present statistics and details of the applications in which large
dense problems arise. Edelman also discusses relevant issues such as what
users expect of the computed solutions and how best to make use of parallel
computers. Table 9.2 contains “world records” for linear systems from Edel-
man’s surveys. For all the records shown the matrix was complex and the
system was solved in double precision arithmetic by some version of LU fac-
torization. Most of the very large systems currently being solved come from
the solution of boundary integral equations, a major application being the
analysis of radar cross sections; the resulting systems have coefficient mat ri-
ces that are complex symmetric (but not Hermitian). A recent reference is
Wang [1064, 1991].

9.10.1. LAPACK
Driver routines xGESV (simple) and xGESVX (expert) use LU factorization with
partial pivoting to solve a general system of linear equations with multiple
right-hand sides. The expert driver incorporates iterative refinement, condi-
tion estimation, and backward and forward error estimation and has an option
to scale the system AX = B to = before solution,
P ROBLEMS 199

Table 9.2. Records for largest dense linear systems solved (dimension n).

Year n Computer Time

1991 55,296 Connection Machine CM-2 4.4 days
1992/3 75,264 Intel iPSC/860 22/3days
1994 76,800 Connection Machine CM-5 4.1 days
1995 128,600 Intel Paragon 1 hour

where D R = diag(ri ) = diag(maxj |aij|) and DC = diag(maxi ri |aij|); the

scaling is done by the routine xGEEQU. The LU factorization is computed by
the routine xGETRF, which uses a partitioned outer product algorithm. The
expert driver also returns the quantity ||A||/||U||, where ||A|| := maxi,j |aij|,
which is an estimate of the reciprocal of the growth factor, A value
much less than 1 signals that the condition estimate and forward error bound
could be unreliable.
For band matrices, the driver routines are xGBSV and xGBSVX, and for
tridiagonal matrices, xGTSV and xGTSVX; again, these use LU factorization
with partial pivoting.

Problems
9.1. (Completion of proof of Theorem 9.1.) Show that if a singular matrix
has a unique LU factorization then Ak is nonsingular for k =
1 :n - 1.
9.2. Define A(σ) = A - σI, where and For how many
values of σ, at most, does A(σ) fail to have an LU factorization without
pivoting?
9.3. Show that has a unique LU factorization if 0 does not belong
to the field of values
9.4. State analogues of Theorems 9.3 and 9.4 for LU factorization with row
and column interchanges: PAQ = LU.
9.5. Give a 2 × 2 matrix A having an LU factorization A = LU such that
|L||U| < c|A| does not hold for any c, yet is of order 1.
9.6. Show that if is nonsingular and totally nonnegative it has an
LU factorization A = LU with L > 0 and U > 0. (Hint: use the inequality

which holds for any totally nonnegative A [414, 1959, p. 100].) Deduce that
the growth factor for GE without pivoting p n 1.
200 LU FACTORIZATION AND LINEAR E QUATIONS

9.7. Show that if is nonsingular and its inverse is totally nonnega-

tive then it has an LU factorization A = LU with |A| = |L||U| . (Use the fact
that if C is totally nonnegative and nonsingular then JC - 1 J is totally non-
negative, where J = diag((-1) i +1) (this can be proved using determinantal
identities; see [21, 1987, Thm. 3.3]).)
9.8. Show that Theorem 9.5 is valid for GE without pivoting, with a different
constant.
9.9. Suppose that GE without pivoting is applied to a linear system Ax = b ,
where is nonsingular, and that all operations are performed exactly
except for the division determining a single multiplier lij (where i > j and
A = LU ), which is computed with relative error l i j = l i j(1 + ). Evaluate
the difference x - between the exact and computed solutions. (The answer
allows us to see clearly the effect of a computational blunder, which could, for
example. be the result of the malfunction of a computer’s divide operation.)
9.10. Show that θ in Theorem 9.7 satisfies

Hence, for g(n) defined in (9.14) and Sn in (9.11). deduce a larger lower bound
than g (2n) >
9.11. Explain the errors in the following criticism of GE with complete piv-
oting.
Gaussian elimination with complete pivoting maximizes the pivot
at each stage of the elimination. Since the product of the pivots is
the determinant (up to sign), which is fixed, making early pivots
large forces later ones to be small. These small pivots will have large
relative errors due to the accumulation of rounding errors during the
algorithm, and dividing by them therefore introduces larger errors.

9.12. In sparse matrix computations the choice of pivot in GE has to be

made with the aim of preserving sparsity as well as maintaining stability. In
threshold pivoting, a pivot element is chosen from among those elements in
column k that satisfy where is a parameter
(see, for example, Duff, Erisman, and Reid [325, 1986, §5.4]). Show that for
threshold pivoting

where µ j is the number of nonzero entries in the jth column of U. Hence

obtain a bound for p n .
PROBLEMS 201

9.13. (RESEARCH PROBLEM) Obtain sharp bounds for the growth factor for
GE with partial pivoting applied to (a) a matrix with lower bandwidth p
and upper bandwidth q (thus generalizing Theorem 9.10), and (b) a quasi-
tridiagonal matrix (an n × n matrix that is tridiagonal except for nonzero
(1, n) and (n, 1) elements).
9.14. (R ESEARCH P ROBLEM ) Explain why the growth factor for GE with
partial pivoting is almost always small in practice.
9.15. (R ESEARCH P ROBLEM ) For GE with complete pivoting what is the
value of limn g (n)/n (see (9.14))? Is equal to n for Hadamard matrices?
Previous Home Next

Chapter 10
Cholesky Factorization

The matrix of that equation system is negative definite-which is a

positive definite system that has been multiplied through by – 1.
For all practical geometries the common finite difference
Laplacian operator gives rise to these,
the best of all possible matrices.
Just about any standard solution method will succeed,
and many theorems are available for your pleasure.
—FORMAN S. ACTON, Numerical Methods That Work (1970)

Many years ago we made out of half a dozen transformers

a simple and rather inaccurate machine for
solving simultaneous equations—the solutions being
represented as flux in the cores of the transformers.
During the course of our experiments we
set the machine to solve the equations—
X + Y + Z = 1
X + Y + Z = 2
X + Y + Z = 3
The machine reacted sharply—it blew the main fuse and put all the lights out.
—B. V. BOWDEN, The Organization of a Typical Machine (1953)

There does seem to be some misunderstanding about the

purpose of an a priori backward error analysis.
All too often, too much attention is paid
to the precise error bound that has been established.
The main purpose of such an analysis is either to
establish the essential numerical stability of an algorithm or to
show why it is unstable and in doing so to
expose what sort of change is necessary to make it stable.
The precise error bound is not of great importance.
—J. H. WILKINSON, Numerics/ Linear Algebra on Digits/ Computers (1974)

203
204 CHOLESKY FACTORIZATION

10.1. Symmetric Positive Definite

Symmetric positive definiteness is one of the highest accolades to which a
matrix can aspire. Symmetry confers major advantages and simplifications
in the eigenproblem and, as we will see in this chapter, positive definiteness
permits economy and numerical stability in the solution of linear systems.
A symmetrical matrix is positive definite if xTAx > 0 for all
nonzero Well-known equivalent conditions to A = AT being positive
definite are

• det(A k) > 0, k = 1:n, where Ak = A(1:k, 1:k) is the leading principal

submatrix of order k.

• >0, k = 1:n, where denotes the kth largest eigenvalue.

The first of these conditions implies that A has an LU factorization, A = LU

(see Theorem 9.1). Another characterization of positive definiteness is that the
pivots in LU factorization are positive, since u kk = det(A k)/det(A k–1). By
factoring out the diagonal of U and taking its square root, the LU factorization
can be converted into a Cholesky factorization: A = RTR, where R is upper
triangular with positive diagonal elements. This factorization is so important
that it merits a direct proof.

Theorem 10.1. If is symmetric positive definite then there is a

unique upper triangular with positive diagonal elements such that
A = R T R.

Proof. The proof is by induction. The result is clearly true for n =

1. Assume it is true for n – 1. The leading principal submatrix An –1 =
A(1:n–1,1:n–1) is positive definite, so it has a unique Cholesky factorization
A n–1 = We have a factorization

(10.1)
if

(10.2)
(10.3)

Equation (10.2) has a unique solution since Rn-1 is nonsingular. Then (10.3)
gives β 2 = a – rTr. It remains to check that there is a unique real, positive β
satisfying this equation. From the equation

0 < det(A) = det(R T) det(R) = det(R n- 1 ) 2 β 2

10.1 S YMMETRIC P OSITIVE D EFINITE M ATRICES 205

we see that β 2 >0, hence there is a unique β > 0.

The proof of the theorem is constructive, and provides a way to compute
the Cholesky factorization that builds R a column at a time. Alternatively,
we can work directly from the equations

which follow by equating (i, j) elements in A = RTR. By solving these equa-

tions in the order (1,1), (1,2), (2,2), (1,3), (2,3), (3,3), . . . , (n,n), we obtain
the following algorithm.

Algorithm 10.2. Given a symmetric positive definite this algo

rithm computes the Cholesky factorization A = RTR.
for j = 1:n
for i = 1:j–1

end

end
Cost: n 3/3 flops (half the cost of LU factorization).
As for Gaussian elimination (GE), there are different algorithmic forms of
Cholesky factorization. Algorithm 10.2 is the jik or “sdot” form. We describe
the kij, outer product form in §10.3.
Given the Cholesky factorization A = RTR, a linear system Ax = b can
be solved via the two triangular systems RTy = b and Rx = y.
If we define D = then the Cholesky factorization A = R T R
can be rewritten as A = LDL , where L = RT diag(r i i )–1 is unit lower
T

triangular. The LDLT factorization is sometimes preferred over the Cholesky

factorization because it avoids the need to compute the n square roots that
determine the rii . The LDLT factorization is certainly preferred for solving
tridiagonal systems, as it requires n less divisions than Cholesky factorization
in the substitution stage. All the results for Cholesky factorization in this
chapter have analogues for the LDLT factorization. Block LDLT factorization
for indefinite matrices is discussed in §10.4.

10.1.1. Error Analysis

Error bounds for Cholesky factorization are derived in a similar way to those
for LU factorization. Consider Algorithm 10.2. Using Lemma 8.4 we have

(10.4)
206 C HOLESKY FACTORIZATION

From a variation of Lemma 8.4 in which the division is replaced by a square

root (see Problem 10.3), we have

A backward error result is immediate.

Theorem 10.3. If Cholesky factorization applied to the symmetric positive

definite matrix runs to completion then the computed factor R
satisfies

(10.5)

Theorem 10.4. Let be symmetric positive definite and suppose

Cholesky factorization produces a computed factor R and a computed solution
to Ax = b. Then
(10.6)

Proof. The proof is analogous to the proof of Theorem 9.4.

These results imply that Cholesky factorization enjoys perfect normwise
backward stability. The key inequality is

whose analogue for the computed is, from (10.5),

Thus (10.6) implies

(10.7)

where for the last inequality we have assumed that nγ n +1 < 1/2. Another
indicator of stability is that the growth factor for GE is exactly 1 (see Prob-
lem 10.4). It is important to realize that the multipliers can be arbitrarily
large (consider, for example, as θ 0). But, remarkably, for a positive
definite matrix the size of the multipliers has no effect on stability.
Note that the perturbation ∆A in (10.6) is not symmetric, in general,
because the backward error matrices for the triangular solves with R and
RT are not the transposes of each other. For conditions guaranteeing that a
“small” symmetric ∆A can be found, see Problem 7.11.
The following rewritten version of Theorem 10.3 provides further insight
into Cholesky factorization.
10.1 S YMMETRIC P OSITIVE D EFINITE M ATRICES 207

Theorem 10.5 (Demmel). If Cholesky factorization applied to the symmet-

ric positive definite matrix runs to completion then the computed
factor satisfies

where di =

Proof. Theorem 10.3 shows that = A+∆A with |∆ A| < γn + 1

Denoting by the ith column of we have

so that Then, using the Cauchy–Schwarz inequality,

giving
(10.8)
and the required bound for ∆A.
Standard perturbation theory applied to Theorem 10.4 yields a bound
of the form However, with the aid of
Theorem 10.5 we can obtain a potentially much smaller bound. The idea is
to write A = DHD where D = diag(A )1/2, so that H has unit diagonal. van
der Sluis’s result (Corollary 7.6) shows that

(10.9)

so D is nearly a condition-minimizing diagonal scaling. It follows that κ 2 (H) <

nκ 2 (A) and that κ 2 (H) << κ 2 (A) is possible if A is badly scaled. Note that
1 < ||H||2 < n, since H is positive definite with hii 1.

Theorem 10.6 (Demmel, Wilkinson). Let A = DHD be symmetric

positive definite, where D = diag(A )1/2, and suppose Cholesky factorization
successfully produces a computed solution to Ax = b. Then the scaled error
satisfies
(10.10)

where = 2n(1 - γ n + l ) – 1 γ n + 1 .
Proof. Straightforward analysis shows that (cf. the proof of Theorem 9.4)
(A + ∆A) = b, where
208 C HOLESKY F ACTORIZATION

with |∆A 1 | < (1–γn + 1 ) – 1 γ n+ 1ddT (by Theorem 10.5) and |∆ 1| < diag( γi )
|∆2| < diag( γn - i+ 1 ) Scaling with D, we have

and standard perturbation theory gives

- 1 T -1
But, using (10.8) and ||D dd D || 2 = ||eeT||2 = n, we have

using Lemma 3.3, which yields the result.

Care needs to be exercised when interpreting bounds that involve scaled
quantities, but in this case the interpretation is relatively easy. Suppose that
H is well conditioned and κ 2 (D) is large, which represents the artificial ill
conditioning that the DHD scaling is designed to clarify. The vector Dx =
H - 1D – 1b is likely to have components that do not vary much in magnitude.
Theorem 10.6 then guarantees that we obtain the components of Dx to good
relative accuracy and this means that the components of x (which will vary
greatly in magnitude) are obtained to good relative accuracy.
So far, our results have all contained the proviso that Cholesky factoriza-
tion runs to completion—in other words, the results assume that the argument
of the square root is always positive. Wilkinson [1092, 1968] showed that suc-
cess is guaranteed if 20n3 / 2 κ 2 (A )u < 1, that is, if A is not too ill conditioned.
It would be nice to replace A in this condition by H , where A = DHD. Justi-
fication for doing so is that Algorithm 10.2 is scale invariant, in the sense that
if we scale A FAF, where F is diagonal, then R scales to RF; moreover, if
F comprises powers of the machine base, then even the rounding errors scale
according to F. The following theorem gives an appropriate modification of
Wilkinson’s condition.

Theorem 10.7 (Demmel). Let A = DHD be symmetric positive

definite, when D = diag(A )1/2. If min(H) > nγn +1/(1– γn +1) then Cholesky
factorization applied to A succeeds (barring underflow and overflow) and pro-
duces a nonsingular I f min(H) < –nγn +1/(1– γ n+1) then the computation
is certain to fail.
10.2 S ENSITIVITY OF THE C HOLESKY FACTORIZATION 209

Proof. The proof is by induction. Consider Algorithm 10.2. The first

stage obviously succeeds and gives > 0, since a 11 > 0. Suppose the
algorithm has successfully completed k – 1 stages, producing a nonsingular
R k-1, and consider equations (10.1)-(10.3) with n replaced by k. The kth
stage can be completed, but may give a pure imaginary (it will if fl(a –
< 0). However, in the latter event, the error analysis of Theorem 10.5
is still valid! Thus we obtain satisfying = Ak + ∆A k, |∆Ak| < (1 –
where Now, with D k = diag(d k),
we have

using the interlacing of the eigenvalues [470, 1989, Cor. 8.1.4] and the con-
dition of the theorem. Hence is positive definite, and
therefore so is the congruent matrix Ak + ∆A k, showing that must be real
and nonsingular, as required to complete the induction.
If Cholesky succeeds, then, by Theorem 10.5, D –1 (A + ∆A)D –1 is positive
definite and so 0 <
Hence if min(H) < -nγ n+1/(1– γ n +1) then the computation must fail.

Note that, since ||H||2 > 1, the condition for success of Cholesky factor-
ization can be written as κ 2 (H)nγn +1/(1– γ n + 1 ) < 1 .

10.2. Sensitivity of the Cholesky Factorization

The Cholesky factorization has perturbation bounds that are similar to those
for LU factorization, but of a simpler form thanks to the positive definiteness
(||A- 1 ||2 replaces ||U- 1||2 ||L - 1 ||2 in the normwise bounds).

Theorem 10.8 (Sun). Let be symmetric positive definite with the

Cholesky factorization A = RTR and let AA be a symmetric matrix satisfying
||A- 1 ∆A|| 2 < 1. Then A + ∆A has the Cholesky factorization A + ∆A =
(R + ∆R)T(R + ∆R), where
210 C HOLESKY F ACTORIZATION

Moreover, if < 1, where = ( R + ∆R) - T ∆A(R + ∆R)-1, then

| ∆R| < triu

where triu(·) denotes the upper triangular part.

Note that the Cholesky factor of Ak = A(1:k,1:k) is Rk, and κ 2 (A k+l) >
κ2 (A k) by the interlacing property of the eigenvalues. Hence if Ak +1 (and
hence A) is ill conditioned but Ak is well conditioned then Rk will be relatively
insensitive to perturbations in A but the remaining columns of R will be much
more sensitive.

10.3. Positive Semidefinite Matrices

If A is symmetric and positive semidefinite (xTAx > 0 for all x) then a
Cholesky factorization exists, but the theory and computation are more subtle
than for positive definite A.
The questions of existence and uniqueness of a Cholesky factorization are
answered by the following result.

Theorem 10.9. Let be positive semidefinite of rank r. (a) There

exists at least one upper triangular R with nonnegative diagonal elements such
that A = RTR. (b) There is a permutation Π such that Π T AP has a unique
Cholesky factorization, which takes the form

(10.11)

where R11 is r × r upper triangular with positive diagonal elements.

Proof. (a): Let the symmetric positive semidefinite square root X of A

have the QR factorization X = QR with rii > 0. Then A = X 2 = XTX =
R T Q T QR = R T R. (b): The algorithm with pivoting described below amounts
to a constructive proof.
Note that the factorization in part (a) is not in general unique. For exam-
ple,

For practical computations a factorization of the form (10.11) is needed,

because this factorization so conveniently displays the rank deficiency. Such
a factorization can be computed using an outer product Cholesky algorithm,
comprising r = rank(A) stages. At the kth stage, a rank-1 matrix is sub-
tracted from A so as to introduce zeros into positions k:n in the k th row and
10.3 POSITIVE SEMIDEFINITE MATRICES 211

column. Ignoring pivoting for the moment, at the start of the kth stage we
have

(10.12)

where = [0, . . . , 0, rii , . . ., rin ]. The reduction is carried one stage further
by computing

Overall we have,

To avoid breakdown when vanishes (or is negative because of rounding

errors), pivoting is incorporated into the algorithm as follows. At the start
of the kth stage an element > 0 (s > k) is selected as pivot, and rows
and columns k and s of Ak, and the kth and sth elements of ri , i = 1:k – 1,
are interchanged. The overall effect is to compute the decomposition (10.11),
where the permutation Π takes account of all the interchanges.
The standard form of pivoting is defined by

This is equivalent to complete pivoting in GE, since Ak is positive semidefinite

so its largest element lies on the diagonal. We note for later reference that
this pivoting strategy produces a matrix R that satisfies (cf. Problem 18.5)

(10.13)

It will be convenient to denote by cp(A) := ΠT AΠ the permuted matrix

obtained from the Cholesky algorithm with complete pivoting.

10.3.1. Perturbation Theory

In this section we analyse, for a positive semidefinite matrix, the effect on
the Cholesky factorization of perturbations in the matrix. This perturbation
theory will be used in the error analysis of the next section.
212 C HOLESKY FACTORIZATION

Throughout this section A is assumed to be an n × n positive semidefinite

matrix of rank r whose leading principal submatrix of order r is positive
definite. For k = 1:r we will write

(10.14)

and other matrices will be partitioned conformably.

We have the identity

(10.15)
where R11 is the Cholesky factor of A 11, R12 = and

is the Schur complement of A11 in A. Note that Sr( A) 0, so that for k = r,

(10.15) is the (unique) Cholesky factorization of A. The following lemma
shows how S k(A) changes when A is perturbed.

Lemma 10.10. Let E be symmetric and assume A 11 + E 11 is nonsingular.

Then
(10.16)

where W =
Proof. We can expand

The result is obtained by substituting this expansion into Sk( A+E) = (A 22 +

E22) – (A 12 + E1 2 ) T (A11 + E1 1 ) – 1 ( A 12 + E12), and collecting terms.
Lemma 10.10 shows that the sensitivity of S k( A ) to perturbations in A
is governed by the matrix W = The question arises of whether,
for a given A, the potential magnification of E indicated by (10.16)
is attainable. For the no-pivoting strategy, Π = I, the answer is trivially
“yes”, since we can take E = with |γ| small, to obtain ||Sk (A+E) -
For complete pivoting, however, the answer
is complicated by the possibility that the sequence of pivots will be different for
A+ E than for A, in which case Lemma 10.10 is not applicable. Fortunately, a
mild assumption on A is enough to rule out this technical difficulty, for small
||E|| 2. In the next lemma we redefine A := cp(A ) in order to simplify the
not at ion.
10.3 P OSITIVE S EMIDEFINITE M ATRICES 213

Lemma 10.11. Let A := cp(A). Suppose that

( S i ( A ) ) 11 > (S i ( A ))j j, j = 2: n - i , i = 0: r- 1 (10.17)

(where S 0 (A) := A). Then, for sufficiently small ||E||2 , A+E = cp(A + E) .
For E = with |γ| sufficiently small,

Proof. Note that since A = cp(A ), (10.17) simply states that there
are no ties in the pivoting strategy (since (S i ( A ) ) 11 in (10.12)).
Lemma 10.10 shows that Si (A+E) = Si (A) + O(||E|| 2), and so, in view of
(10.17), for sufficiently small ||E||2 we have

This shows that A + E = c p(A+E). The last part then follows from
Lemma 10.10.
We now examine the quantity ||W|| 2 = We show first that
||W|| 2 can be bounded in terms of the square root of the condition number of
A1 1 .

Lemma 10.12. If A, partitioned as in (10.14), is symmetric positive definite

and A11 is positive definite then

Proof. Write
together with which follows
from the fact that the Schur complement is positive semidef-
inite.
Note that, by the arithmetic-geometric mean inequality
(x, y > 0), we also have, from Lemma 10.12,
||A 2 2 || 2 )/2.
The inequality of Lemma 10.12 is attained for the positive semidefinite
matrix

where Ip,q is the p × q identity matrix. This example shows that ||W||2 can
be arbitrarily large. However, for A := cp(A ), ||W||2 can be bounded solely
in terms of n and k. The essence of the proof, in the next lemma, is that
large elements in are countered by small elements in A12. Hereafter we
set k = r, the value of interest in the following sections.
214 C HOLESKY FACTORIZATION

L e m m a 1 0 . 1 3 . Let A := cp(A) and set k = r. Then

(10.18)

There is a parametrized family of rank-r matrices A(θ ) = cp(A(θ )),

for which

Proof. The proof is a straightforward computation. The matrix A(O) :=

R(θ)TR(θ), where

(10.19)
with c = cosθ, s = sin θ. This is the r × n version of Kahan’s matrix (8.10). R
satisfies the inequalities (10.13) (as equalities) and so A(θ ) = cp(A(θ )).
We conclude this section with a “worst-case” example for the Cholesky fac-
torization with complete pivoting. Let U( θ) = diag(r, r–1, . . . . 1)R(θ), where
R(θ) is given by (10.19), and define the rank-r matrix C(θ) = U(θ) T U(θ).
Then C(θ) satisfies the conditions of Lemma 10.11. Also,

Thus, from Lemma 10.11, for E = with |γ| and θ sufficiently small,

This example can be interpreted as saying that in exact arithmetic the resid-
ual after an r-stage Cholesky factorization of a semidefinite matrix A can
overestimate the distance of A from the rank-r semidefinite matrices by a
factor as large as (n – r)(4r – 1)/3.

10.3.2. Error Analysis

In this section we present a backward error analysis for the Cholesky factor-
ization of a positive semidefinite matrix. An important consideration is that
10.3 P OSITIVE S EMIDEFINITE M ATRICES 215

a matrix of floating point numbers is very unlikely to be ‘(exactly” positive

semidefinite; errors in forming or storing A will almost certainly render the
smallest eigenvalue nonzero, and possibly negative. Therefore error analysis
for a rank r positive semidefinite matrix may appear, at first sight, to be of
limited applicability. One way around this difficulty is to state results for
A = A + ∆A, where A is the matrix stored on the computer, A is positive
semidefinite of rank r, and ∆A is a perturbation, which could represent the
rounding errors in storing Ã, for example. However, if the perturbation AA
is no larger than the backward error for the Cholesky factorization, then this
extra formalism can be avoided by thinking of ∆A as being included in the
backward error matrix. Hence for simplicity, we frame the error analysis for
a positive semidefinite A.
The analysis makes no assumptions about the pivoting strategy, so A
should be thought of as the pre-permuted matrix Π T AΠ.

Theorem 10.14. Let A be an n × n symmetric positive semidefinite matrix

of floating point numbers of rank r < n. Assume that A11 = A(1:r, 1:r) is
positive definite with

(10.20)

where A11 = D 11 H11 D 11 and D11 = diag(A 11)1/2. Then, in floating point
arithmetic, the Cholesky algorithm applied to A successfully completes r stages
(barring underflow and overflow), and the computed r × n Cholesky factor
satisfies

(10.21)

where W =

Proof. First, note that condition (10.20) guarantees successful completion

of the first r stages of the algorithm by Theorem 10.7.
Analysis very similar to that leading to Theorem 10.3 shows that

(10.22)

where

and
(10.23)
216 CHOLESKY FACTORIZATION

Taking norms in (10.23) and using the inequality

we obtain

which implies

(10.24)

Our aim is to obtain an a priori bound for It is clear from

( r +1 )
(10.22)-(10.24) that to do this we have only to bound ||Â ||2. To this end,
we interpret (10.22) in such a way that the perturbation theory of §10.3.1 may
be applied.
Equation (10.22) shows that is the true Schur complement for the
matrix A + E and that is positive definite. Hence we can
use Lemma 10.10 to deduce that

Substituting from (10.24) we find that

Finally, using (10.22) and (10.24), we obtain

Theorem 10.14 is just about the best result that could have been expected,
because the bound (10.21) is essentially the same as the bound obtained on
taking norms in Lemma 10.10. In other words, (10.21) simply reflects the
inherent mathematical sensitivity of A–RTR to small perturbations in A.
We turn now to the issue of stability. Ideally, for A as defined in Theo-
rem 10.14, the computed Cholesky factor Rr produced after r stages of the
algorithm would satisfy
10.3 P OSITIVE S EMIDEFINITE M ATRICES 217

where cn is a modest constant. Theorem 10.14 shows that stability depends

on the size of ||W||2 = (to the extent that ||W||2 appears in a
realistic bound for the backward error).
If no form of pivoting is used then ||W|| 2 can be arbitrarily large for fixed
n (see §10.3.1) and the Cholesky algorithm must in this case be classed as
unstable. But for complete pivoting we have from Lemma 10.13 the upper
bound ||W|| 2 < (1/3(n – r)(4 r – 1))1/2. Thus the Cholesky algorithm with
complete pivoting is stable if r is small, but stability cannot be guaranteed,
and seems unlikely in practice, if ||W|| 2 (and hence, necessarily, r and n) is
large.
Numerical experiments show that ||W||2 is almost always small in practice
(typically less than 10) [540, 1990]. However, it is easy to construct examples
where ||W||2 is large. For example, if R is a Cholesky factor of A from complete
pivoting then let C = M(R ) T M( R), where M( R) is the comparison matrix;
C will usually have a much larger value of ||W|| 2 than A.
An important practical issue is when to terminate the Cholesky factoriza-
tion of a semidefinite matrix. The LINPACK routine xCHDC proceeds with
the factorization until a nonpositive pivot is encountered, that is, up to and
including stage k – 1, where k is the smallest integer for which

(10.25)
Usually k > r + 1, due to the effect of rounding errors.
A more sophisticated termination criterion is to stop as soon as

(10.26)
for some readily computed norm and a suitable tolerance This criterion
terminates as soon as a stable factorization is achieved, avoiding unnecessary
work in eliminating negligible elements in the computed Schur complement
Note that is indeed a reliable order-of-magnitude estimate of the
true residual, since is the only nonzero block of Â(k) and, by (10.22) and
(10.24), with ||E|| = O(u)(||A|| + ||Â(k)||).
Another possible stopping criterion is
(10.27)

This is related to (10.26) in that if A (pre-permuted) and Âk are positive

semidefinite then = maxi,j |aij| ||A|| 2, and similarly max
Note that (10.27) bounds since if (10.27) holds first at the
kth stage then, using Theorem 8.13,
218 C HOLESKY FACTORIZATION

Practical experience shows that the criteria (10.26) and (10.27) with =
nu both work well, and much more effectively than (10.25) [540, 1990]. We
favour (10.27) because of its negligible cost.

10.4. Symmetric Indefinite Matrices and Diagonal Pivot-

ing Method
Let be symmetric but indefinite, that is, (x T A x) (y T Ay) < 0 for
some x and y. How can we solve Ax = b efficiently?
Gaussian elimination with partial pivoting (GEPP) can be used to com-
pute the factorization PA = LU, but it does not take advantage of the symme-
try to reduce the cost and storage. We might try to construct a factorization
A = L D L T, where L is unit lower triangular and D is diagonal. But this
factorization may not exist, even if symmetric pivoting is allowed, and if it
does exist its computation may be unstable. For example, consider

There is arbitrarily large element growth for 0 < << 1, and the factorization
does not exist for = 0.
The most popular approach for solving symmetric indefinite systems is to
use a block LDLT factorization

P A P T = L D L T,

where L is unit lower triangular and D is block diagonal with 1 × 1 or 2 × 2

diagonal blocks. This factorization is essentially a symmetric block form of
GE, with pivoting. Note that by Sylvester’s inertia theorem, A and D have
the same inertia13, which is easily determined from D (see Problem 10.11).
To begin the computation of the factorization we choose a permutation Π
and an integer s = 1 or 2 so that

with E nonsingular. Then we compute the factorization

13
The inertia of a symmetric matrix is an ordered triple {i +, i–, i0}, where i+ is the
number of positive eigenvalues, i– the number of negative eigenvalues, and i0 the number
of zero eigenvalues.
10.4 I NDEFINITE M ATRICES 219

This process is repeated on the (n – s) × (n – s) Schur complement

Ã = B – CE- 1C T .

The cost of the method is n 3/3 flops (the same as the cost of Cholesky fac-
torization of a positive definite matrix) plus the cost of determining the per-
mutations 17. This method for computing the block LDLT factorization is
called the diagonal pivoting method. It can be thought of as a generalization
of Lagrange’s method for reducing a quadratic form to diagonal form (devised
by Lagrange in 1759 and rediscovered by Gauss in 1823) [763, 1961, p. 371].
One conceivable difficulty with the diagonal pivoting method can be dis-
posed of immediately. If a nonsingular pivot matrix E of dimension 1 or 2
cannot be found, then all 1 × 1 and 2 × 2 principal submatrices of the sym-
metric matrix A are singular, and this is easily seen to imply that A is the
zero matrix.
The strategy for choosing 17 is crucial for achieving stability. A suitable
modification of the error analysis for block LU factorization (Theorem 12.4)
tells us that, provided linear systems involving 2 x 2 pivots are solved in a
normwise backward-stable way, the condition ||L|| ||D|| ||LT|| < cn||A||, for
a modest constant c n , is sufficient to ensure stability. A key requirement,
therefore, is to choose the pivot E so that the Schur complement Ã is suitably
bounded, since D is made up of elements of Schur complements. We describe
two suitable pivoting strategies.

10.4.1. Complete Pivoting

Bunch and Parlett [166, 1971] devised the following strategy for choosing 17.
It suffices to describe the interchanges for the first stage of the factorization.

Let µ 0 = maxi,j |aij|, µ1 = maxi |aii|, and choose a (0, 1).

if µ 1 > aµ 0
Set s = 1, and choose Π so that |e 11| = µ 1 .
else
Set s = 2, and choose Π so that |e 21| = µ 0 .
end

Note that µ 1 is the best 1 × 1 pivot under symmetric permutations and

µ 0 is the pivot that would be chosen by GE with complete pivoting. This
strategy says “as long as there is a diagonal pivot element not much smaller
than the complete pivot, choose it as a 1 × 1 pivot”, that is, “choose a 1 × 1
pivot whenever possible”. If the strategy dictates the use of a 2 × 2 pivot then
that pivot E is indefinite (see Problem 10.11).
It remains to determine a. This is done by minimizing a bound on the
element growth. For the following analysis we assume that the interchanges
220 C HOLESKY F ACTORIZATION

have already been done. If s = 1 then

Now consider the case s = 2. The (i, j) element of the Schur complement
Ã = B – CE - 1 CT is

(10.28)

Now

and, using the symmetry of E,

Since a (0, 1), we have |det(E)| > (1 – a 2 ) Thus

Since |cij| < µ 0, we obtain from (10.28)

To determine a we equate the maximum growth for two s = 1 steps with

that for one s = 2 step:

which reduces to the quadratic equation 4a 2 – a – 1 = 0. We require the

positive root

The analysis guarantees a growth factor bound of (1+a- 1 ) n -1 = (2.57) n - 1 .

This bound is pessimistic, however; a much more detailed analysis by Bunch
[158, 1971] shows that the growth factor is no more than 3.07( n –1)0.446 times
larger than the bound (9.13) for LU factorization with complete pivoting—a
very satisfactory result. Strictly speaking, bounding the growth factor bounds
only ||D||, not ||L||. But it is easy to show that for s = 1 and 2 no element of
CE-1 exceeds max{1/a,1/(1–a)} in absolute value, and so ||L|| is bounded
independently of A.
Since complete pivoting requires the whole active submatrix to be searched
at each stage, it needs up to n3/6 comparisons, and so the method is rather
expensive.
10.4 I NDEFINITE M ATRICES 221

10.4.2. Partial Pivoting

Bunch and Kaufman [164, 1977] devised a pivoting strategy for the diago-
nal pivoting method that requires only O( n 2) comparisons. At each stage
it searches at most two columns and so is analogous to partial pivoting for
LU factorization. The strategy contains several logical tests. As before, we
describe the pivoting for the first stage only. Recall that s denotes the size of
the pivot block.

Choose a (0, 1).

If = 0 there is nothing to do on this stage of the elimination.

r := min{i > 2: |ai1| =
if |a 11| >
(1) s = 1, Π = I
else

if |a1 1 |σ > a 2
(2) s = 1, Π = I
else if |arr| > aσ
(3) s = 1 and choose Π to swap rows and columns 1 and r.
else
(4) s = 2 and choose Π to swap rows and columns 2 and r,
so that |(ΠAΠ T )21| =
end
end

To understand the algorithm it helps to consider the matrix

and to note that the pivot is one of a 11, arr , and (or, rather, since
= |ar1|, this matrix with replaced by ar1 ).
222 C HOLESKY FACTORIZATION

To bound the element growth reconsider each case in turn, noting that
for cases (1) and (2) the elements of the Schur complement are given by14

Case (l):

Case (2): Using symmetry,

Case (3): The original arr is now the pivot, and |arr| > a σ, so

Case (4): This is where we use a 2 x 2 pivot, which, after the interchanges,
is E = (Π T AΠ)(1:2,1:2) = Now

The elements of the Schur complement Ã = B-CE- 1 C T a r e g i v e n b y

This analysis shows that the bounds for the element growth for s = 1 and
s = 2 are the same as the respective bounds for the complete pivoting strategy.
Hence, using the same reasoning, we again choose a =
The growth factor for the partial pivoting strategy is bounded by (2.57)n - 1 .
As for GEPP, large growth factors do not seem to occur in practice. But un-
like for GEPP, no example is known for which the bound is attained [164,
1977]; see Problem 10.18.
14
We commit a minor abuse of notation, in that in the rest of this section ãij should
really be ãi– 1, j–1 (s = 1) or ãi- 2 ,j –2 (s = 2).
10.5 N ONSYMMETRIC P OSITIVE D EFINITE M ATRICES 223

As noted in the previous subsection, a bound on the growth factor p n does

not in itself ensure stability. Indeed although ||D||/||A|| is bounded for partial
pivoting, ||L||/||A|| can be arbitrarily large for fixed n; see Problem 10.15.
Higham [559, 1995] gives a detailed error analysis of the diagonal pivoting
method with an arbitrary pivoting strategy, under the assumption that com-
puted solutions to linear systems involving 2 × 2 pivots have a small com-
ponentwise relative backward error. The conclusions are that the computed
factors satisfy

and the computed solution to a linear system Ax = b satisfies

where p1 and p 2 are linear polynomials. For the partial pivoting strategy,
Higham shows that if linear systems involving 2 × 2 pivots are solved by GEPP
or by use of the explicit inverse, then the computed solutions do indeed have
a small componentwise relative backward error, and that, moreover,

|| |L||D||LT| ||M < 36n p n||A||M ,

where ||A||M = max i,j |aij|. Thus the diagonal pivoting method with partial
pivoting is stable if the growth factor is small.

10.5. Nonsymmetric Positive Definite Matrices

The notion of positive definiteness can be extended to nonsymmetric matrices.
A nonsymmetric matrix is positive definite if xTAx > 0 for all
x 0. This is equivalent to the condition that the symmetric part AS of
A is positive definite, where A A S + A K w i t h A S = ( A + A T )/2 and
A K = ( A – A T )/2. A positive definite matrix clearly has nonsingular leading
principal submatrices, and so has an LU factorization, A = LU. It can even
be shown that pivots uii are positive. However, there is no guarantee that the
factorization is stable without pivoting, as the example shows. The
standard error analysis for LU factorization applies (Theorems 9.3 and 9.4),
and so the question is whether can be suitably bounded. Golub and
Van Loan [469, 1979] show that, for the exact LU factors,

(10.29)

Let which is just κ 2 (A) when A is sym-

metric. Mathias [731, 1992] shows that || |L||U| ||F (involving now the com-
puted LU factors) is at most a factor 1 + 30un3 / 2 X(A) times larger than the
224 C HOLESKY FACTORIZATION

upper bound in (10.29), and that the LU factorization (without pivoting)

succeeds if 24n3 / 2 X (A )u < 1.
These results show that it is safe not to pivot provided that the symmetric
part of A is not too ill conditioned relative to the norm of the skew-symmetric
part. If A is symmetric (AK = 0) then we recover the results for symmetric
positive definite matrices.

10.6. Notes and References

André-Louis Cholesky (1875-1918) was a French military officer involved in
geodesy and surveying in Crete and North Africa. In some books his name
is misspelled “Choleski”. Details of Cholesky’s life-and a discussion about
the pronunciation of his name!-can be found in the electronic mail magazine
NA-Digest, volume 90, 1990, issues 7, 8, 10–12, and 24; see, in particular,
the biography [22, 1922]. Cholesky’s work was published posthumously on his
behalf by Benoit [91, 1924].
The properties of the Cholesky factorization are intimately associated with
the properties of the Schur complement, as is apparent from some of the proofs
in this chapter. The same is true for GE in general. An excellent survey of the
Schur complement, containing historical comments, theory, and applications,
is given by Cottle [248, 1974].
For results on the Cholesky factorization in Hilbert space see Power [841,
1986].
A book by George and Liu [438, 1981] is devoted to the many practical is-
sues in the implementation of Cholesky factorization for the solution of sparse
symmetric positive definite systems.
There is no floating point error analysis of Cholesky factorization in Wilkin-
son’s books, but he gives a detailed analysis in [1092, 1968], showing that
= A+E, with ||E||2 < 2.5n3 / 2 u||A|| 2. It is unfortunate that this paper
is in a rather inaccessible proceedings, because it is a model of how to phrase
and interpret an error analysis. Meinguet [747, 1983] and Sun [973, 1992] give
componentwise backward error bounds similar to those in Theorems 10.3 and
10.4. Kielbasinski [657, 198 7] reworks Wilkinson’s analysis to improve the
constant.
The fact that κ2 (H) can replace the potentially much larger κ 2 (A) in
the forward error bound for the Cholesky method was stated informally and
without proof by Wilkinson [1092, 1968, p. 638]. Demmel [283, 1989] made
this observation precise and explored its implications; Theorems 10.5, 10.6,
and 10.7 are taken from [283, 1989].
The bounds in Theorem 10.8 are from Sun [971, 1991], [972, 1992]. Similar
bounds are given by Stewart [944, 1977], [951, 1993], Barrlund [71, 1991],
and Sun [973, 1992]. A perturbation bound that can be much smaller than
10.6 N OTES AND R EFERENCES 225

the normwise one in Theorem 10.8 is derived and explored by Chang and
Paige [198, 1995]. Perturbation results of a different flavour, including one
for structured perturbations of the form of ∆A in Theorem 10.5, are given by
Drmac, Omladc, and Veselic [321, 1994].
The perturbation and error analysis of §10.3 for semidefinite matrices is
from Higham [540, 1990], where in a perturbation result for the QR factoriza-
tion with column pivoting is also given. For an application in optimization
that makes use of Cholesky factorization with complete pivoting and the anal-
ysis of §10.3.1 see Forsgren, Gill, and Murray [384, 1995].
Fletcher and Powell [382, 1974] describe several algorithms for updating
an LDLT factorization of a symmetric positive definite A when A is modified
by a rank-1 matrix. They give detailed componentwise error analysis for some
of the methods.
An excellent way to test whether a given symmetric matrix A is positive
(semi) definite is to attempt to compute a Cholesky factorization. This test
is less expensive than computing the eigenvalues and is numerically stable.
Indeed, if the answer “yes” is obtained, it is the right answer for a nearby
matrix, whereas if the answer is “no” then A must be close to an indefinite
matrix. See Higham [535, 1988] for an application of this definiteness test.
An algorithm for testing the definiteness of a Toeplitz matrix is developed by
Cybenko and Van Loan [260, 1986], as part of a more complicated algorithm.
According to Kerr [654, 1990], misconceptions of what is a sufficient condition
for a matrix to be positive (semi) definite are rife in the engineering literature
(for example, that it suffices to check the definiteness of all 2 × 2 submatrices).
See also Problem 10.8. For some results on definiteness tests for Toeplitz
matrices, see Makhoul [722, 1991].
A major source of symmetric indefinite linear systems is the least squares
problem, because the augmented system is symmetric indefinite; see Chap-
ter 19. Other sources of such systems are interior methods for solving con-
strained optimization problems (see Forsgren, Gill, and Shinnerl [385, 1996],
Turner [1030, 1991], and Wright [1115, 1992]) and linearly constrained opti-
mization problems (see Gill, Murray, Saunders, and Wright [445, 1990], [446,
1991 ]).
The idea of using a block LDLT factorization with some form of pivoting
for symmetric indefinite matrices was first suggested by Kahan in 1965 [166,
1971]. Bunch and Parlett [166, 1971] developed the complete pivoting strategy
and Bunch [158, 1971] proved its stability. Bunch [160, 1974] discusses a rather
expensive partial pivoting strategy that requires repeated scalings. Bunch and
Kaufman [164, 1977] found the efficient partial pivoting strategy presented
here, which is the one now widely used, and Bunch, Kaufman and Parlett [165,
1976] give an Algol code implementing the diagonal pivoting method with this
pivoting strategy. Dongarra, Duff, Sorensen, and van der Vorst [315, 1991,
§5.4.5] show how to develop a partitioned version of the diagonal pivoting
226 C HOLESKY FACTORIZATION

method with partial pivoting.

Liu [709, 198 7] shows how to incorporate a threshold into the Bunch–
Kaufinan partial pivoting strategy for sparse symmetric matrices; see also Duff
et al. [326, 1991]. The partial pivoting strategy and variants of it described
by Bunch and Kaufman [164, 1977] do not preserve band structure, but the
fill-in depends on the number of 2 × 2 pivots, which is bounded by the number
of negative eigenvalues (see Problem 10.11). Jones and Patrick [615, 1993],
[616, 1994] show how to exploit this fact.
The complete and partial pivoting strategies of Bunch et al. use a fixed
number of tests to determine each pivot. Another possibility is to prescribe
growth bounds corresponding to 1 × 1 and 2 × 2 pivots and to search in
some particular order for a pivot satisfying the bound. Fletcher [375, 1976]
uses this approach to define a pivoting strategy that usually requires only
O(n 2) operations. Duff, Reid, and co-workers apply the same approach to the
diagonal pivoting method for sparse matrices, where sparsity considerations
also influence the choice of pivot [331, 1979], [326, 1991]; their Fortran codes
MA27 [329, 1982] and MA47 [330, 1995] implement the methods.
Gill, Murray, Ponceleón, and Saunders [443, 1992] show how for sparse,
symmetric indefinite systems the diagonal pivoting factorization can be used
to construct a (positive definite) preconditioned for an iterative method.
Another method for solving symmetric indefinite systems is Aasen’s method
[1, 1971], which employs the factorization PAPT = LTL T , where L is unit
lower triangular and T is tridiagonal. It is competitive with the diagonal
pivoting method in terms of speed. Barwell and George [77, 1976] compare
the performance of Fortran codes for several methods for solving symmet-
ric indefinite systems, including the diagonal pivoting method and Aasen’s
method.
Dax and Kaniel [270, 1977] propose computing a factorization PAPT =
LDL T for symmetric indefinite matrices by an extended form of Gaussian
elimination in which extra row operations are used to “build up” a pivot ele-
ment prior to the elimination operations; here, L is unit lower triangular and
D is diagonal. A complete pivoting strategy for determining the permutation
P is described in [270, 1977] and partial pivoting strategies in Dax [268, 1982].
Analogues of the factorization for symmetric matrices exist for skew-
symmetric matrices; see Bunch [161, 1982].
Bunch [159, 1971] shows how to scale a symmetric matrix so that in every
nonzero row and column the largest magnitude of an element is 1.

10.6.1. LAPACK
Driver routines xPOSV (simple) and xPOSVX (expert) use the Cholesky fac-
torization to solve a symmetric (or Hermitian) positive definite system of
linear equations with multiple right-hand sides. (There are corresponding
PROBLEMS 227

routines for packed storage, in which one triangle of the matrix is stored in
a one-dimensional array: PP replaces PO in the names.) The expert driver
incorporates iterative refinement, condition estimation, and backward and
forward error estimation and has an option to scale the system AX = B
t o (D – 1 AD– 1 )DX = D – 1 B, where D = diag Modulo the rounding
errors in computing and applying the scaling, the scaling has no effect on
the accuracy of the solution prior to iterative refinement, in view of Theo-
rem 10.6. The Cholesky factorization is computed by the routine xPOTRF ,
which uses a partitioned algorithm that computes R a block row at a time.
The drivers xPTSV and xPTSVX for symmetric positive definite tridiagonal ma-
trices use LDLT factorization. LAPACK does not currently contain a routine
for Cholesky factorization of a positive semidefinite matrix, but there is such
a routine in LINPACK (xCHDC ).
Driver routines xSYSV (simple) and xSYSVX (expert) use the block LDLT
factorization (computed by the diagonal pivoting method) with partial piv-
oting to solve a symmetric indefinite system of linear equations with multi-
ple right-hand sides. For Hermitian matrices the corresponding routines are
xHESV (simple) and xHESVX (expert). (Variants of these routines for packed
storage have names in which SP replaces SY and HP replaces HE.) The expert
drivers incorporate iterative refinement, condition estimation, and backward
and forward error estimation. The factorization is computed by the routine
xSYTRF or xHETRF.

Problems
10.1. Show that if is symmetric positive definite then

What does this statement imply about maxi,j |aij|?

10.2. If A is a symmetric positive definite matrix, how would you compute
x T A- 1 x?

10.3. Let y = be evaluated in floating point arithmetic in

any order. Show that

where for all i, and |θk + 1| < γ k +1 .

10.4. Let be symmetric positive definite. Show that the reduced
submatrix B of order n—1 at the end of the first stage of GE is also symmetric
228 CHOLESKY FACTORIZATION

positive definite. Deduce that 0 < = akk and hence

that the growth factor pn = 1.
10.5. Show that the backward error result (10.6) for the solution of a sym-
metric positive definite linear system by Cholesky factorization implies

where ||A||M = maxi,j |aij| (which is not a consistent matrix norm—see §6.2).
The significance of this result is that the bound for ||∆ A||M/||A||M contains a
linear polynomial in n, rather than the quadratic that appears for the 2-norm
in (10.7).
10.6. Let A = cp(A) be positive semidefinite of rank r and suppose it
has the Cholesky factorization (10.11) with Π = I. Show that Z = [W , –I] T
is a basis for the null space of A, where W =
10.7. Prove that (10.13) holds for the Cholesky decomposition with complete
pivoting.
10.8. Give an example of a symmetric matrix for which the leading
principal submatrices Ak satisfy det(A k) > 0, k = 1:n, but A is not positive
semidefinite (recall that det(A k) > 0, k = 1:n, implies that A is positive
definite). State a condition on the minors of A that is both necessary and
sufficient for positive semidefiniteness.
10.9. Suppose the outer product Cholesky factorization algorithm terminates
at the (k+1)st stage (see (10.15)), with a negative pivot in the (k + 1, k + 1)
position. Show how to construct a direction of negative curvature for A (a
vector p such that pTAp < 0).
10.10. What is wrong with the following argument? A positive semidefinite
matrix is the limit of a positive definite one as the smallest eigenvalue tends to
zero. Theorem 10.3 shows that Cholesky factorization is stable for a positive
definite matrix, and therefore, by continuity, it must be stable for a positive
semidefinite matrix, implying that Theorem 10.14 is unnecessarily weak (since
||W|| 2 can be large).
10.11. Consider the diagonal pivoting method applied to a symmetric ma-
trix. Show that with complete pivoting or partial pivoting any 2 × 2 pivot
is indefinite. Hence give a formula for the inertia in terms of the block sizes
of the block diagonal factor. Show how to avoid overflow in computing the
inverse of a 2 × 2 pivot.
10.12. Describe the effect of applying the diagonal pivoting method with
partial pivoting to a 2 × 2 symmetric matrix.
10.13. What factorization is computed if the diagonal pivoting method with
partial pivoting is applied to a symmetric positive definite matrix?
PROBLEMS 229

10.14. (Sorensen and Van Loan; see [315, 1991, §5.3.2]) Suppose the partial
pivoting strategy for the diagonal pivoting method is modified by redefining

(thus “σ n e w = m a x (σo l d ,|a rr | )”). Show that the same growth factor bound
holds as before and that for a positive definite matrix no interchanges are
done and only 1 × 1 pivots are used.
10.15. Let

where 0 < < 1, and suppose the diagonal pivoting method is applied to
A, yielding a factorization PAP T = LDL T. Show that with partial pivot-
ing is unbounded as whereas with complete pivoting is
bounded independently of
10.16. Let

be nonsymmetric positive definite. Show that the Schur complement S =

is also positive definite. In other words, show that GE
preserves positive definiteness.
10.17. A matrix of the form

where and are symmetric positive definite has been

called a symmetric quasidefinite matrix by Vanderbei [1047, 1995]. Show that
(a) A is nonsingular, (b) for any permutation Π, Π T AΠ has an LU factoriza-
tion, (c) AS is nonsymmetric positive definite, where S = diag(I, –I). (This
last property reduces the question of the stability of an LDLT factorization of
A to that of the stability of the LU factorization of a nonsymmetric positive
definite matrix, for which see §10.5. This reduction has been pointed out and
exploited by Gill, Saunders, and Shinnerl [448, 1996].)
10.18. (RESEARCH PROBLEM) Is the growth factor bound (2.57)n –1 for the
diagonal pivoting method with partial pivoting attainable? If not, how big
can the growth factor be? Similarly, what is a sharp bound for the complete
pivoting growth factor?
10.19. (RESEARCH PROBLEM) Bound the growth factor for Aasen’s method
[1, 1971].
Previous Home Next

Chapter 11
Iterative Refinement

The ILLIAC’s memory is sufficient to accommodate a system of 39 equations

when used with Routine 51.
The additional length of Routine 100 restricts to 37
the number of equations that it can handle.
With 37 equations the operation time of Routine 100 is about
4 minutes per iteration.
—JAMES N. SNYDER, On the improvement of the Solutions to a Set of
Simultaneous Linear Equations Using the ILLIAC (1955)

In a short mantissa computing environment

the presence of an iterative improvement routine can
significantly widen the class of solvable Ax = b problems.
— GENE H. GOLUB and CHARLES F. VAN LOAN,
Matrix Computations (1989)

Most problems involve inexact input data and

obtaining a highly accurate solution to an
imprecise problem may not be justified.
— J. J. DONGARRA, J. R. BUNCH, C. B. MOLER, and G. W. STEWART,
LINPACK Users’ Guide (1979)

231
232 I TERATIVE R EFINEMENT

Iterative refinement is an established technique for improving a computed

solution to a linear system Ax = b. The process consists of three steps:
1. Compute r = b – A

2. Solve Ad = r.

3. Update y = + d.

(Repeat from step 1 if necessary, with replaced by y).

If there were no rounding errors in the computation of r, d, and y, then y would
be the exact solution to the system. The idea behind iterative refinement is
that if r and d are computed accurately enough then some improvement in
the accuracy of the solution will be obtained. The economics of iterative
refinement are favorable for solvers based on a factorization of A, because
the factorization used to compute can be reused in the second step of the
refinement.
Traditionally, iterative refinement is used with Gaussian elimination (GE),
and r is computed in extended precision before being rounded to working pre-
cision. Iterative refinement for GE was used in the 1940s on desk calculators,
but the first thorough analysis of the method was given by Wilkinson in 1963
[1088, 196 3]. The behaviour of iterative refinement for GE is usually sum-
marized as follows: if double precision is used in the computation of r, and
A is not too ill conditioned, then the iteration produces a solution correct to
working precision and the rate of convergence depends on the condition num-
ber of A. In the next section we give a componentwise analysis of iterative
refinement that confirms this summary and provides some further insight.

11.1. Convergence of Iterative Refinement

Let be nonsingular and let be a computed solution to Ax = b.
Define x 1 = and consider the following iterative refinement process: ri =
b – Axi (precision ), solve Adi = ri (precision u), xi+1 = xi + di (precision
u), i = 1, 2, . . . . For traditional iterative refinement, = u 2. Note that in this
chapter subscripts specify members of a vector sequence, not vector elements.
We henceforth define ri , di , and xi to be the computed quantities (to avoid
a profusion of hats). The only assumption we will make on the solver is that
the computed solution to a system Ay = c satisfies

(11.1)

Thus the solver need not be LU factorization or even a factorization method.

The page or so of analysis that follows is straightforward but tedious. The
reader is invited to jump straight to (11.4), at least on first reading.
11.1 C ONVERGENCE OF I TERATIVE R EFINEMENT 233

Consider first the computation of ri . There are two stages. First, si =

f l( b – Axi ) = b – Axi + ∆si is formed in the (possibly) extended precision
so that |∆ si | < (cf. (3.10)), where
Second, the residual is rounded to the working precision: ri = fl(s i) = si + fi ,
where |fi | < u|si |. Hence

By writing xi = x + (xi – x), we obtain the bound

(11.2)

For the second step we have, by (11.1), (A + ∆A i)di = ri . Now write

where, since θi := Hence

(11.3)

For the last step,

Using (11.3) we have

Hence

Substituting the bound for |∆ ri | from (11.2) gives

(11.4)

Note that
234 ITERATIVE REFINEMENT

As long as A is not too ill conditioned and the solver is not too unstable, we
have which means that the error contracts until we reach a point
at which the gi term becomes significant. The limiting normwise accuracy,
that is, the minimum size of
Moreover, if for some
µ, then we can expect to obtain a componentwise relative error of order µu ,
that is, mini
We concentrate now on the case where the solver uses LU factorization. In
the traditional use of iterative refinement, = u 2, and one way to summarize
our findings is as follows.

Theorem 11.1 (mixed precision iterative refinement). Let iterative refine-

ment be applied to the nonsingular linear system Ax = b, using LU fac-
torization and with residuals computed in double the working precision. Let
η = u|| |A - 1||L||U| || where L and U are the computed LU factors of A.
Then, provided η is sufficiently less than 1, iterative refinement reduces the
error by a factor approximately η at each stage, until

This theorem is stronger than the standard results in the literature, which
have in place of η. We can have η << since η is independent
of the row scaling of A (modulo changes in the pivot sequence). For example,
if then η cond(A)u, and cond(A) can be arbitrarily smaller
than
Consider now the case where = u, which is called fixed precision iterative
refinement. We have an analogue of Theorem 11.1.

Theorem 11.2 (fixed precision iterative refinement). Let iterative refine-

ment in fixed precision be applied to the nonsingular linear system Ax = b
of order n, using LU factorization. Let η = u|| |A – 1 ||L||U| || where L and
are the computed LU factors of A. Then, provided η is sufficiently less than
1, iterative refinement reduces the error by a factor approximately η at each
stage, until 2ncond(A, x)u.
The key difference between mixed and fixed precision iterative refinement
is that in the latter case a relative error of order u is no longer ensured. But
we do have a relative error bound of order cond(A, x)u. This is a stronger
bound than holds for the original computed solution for which we can say
only that

(this bound is obtained by applying Theorems 7.4 and 9.4, or from (11.4) with
i = 0!). In fact, a relative error bound of order cond(A, x)u is the best we can
possibly expect if we do not use higher precision, because it corresponds to the
11.2 I TERATIVE R EFINEMENT I MPLIES S TABILITY 235

uncertainty introduced by making componentwise relative perturbations to A

of size u (again, see Theorem 7.4); this level of uncertainty is usually present,
because of errors in computing A or in rounding its elements to floating point
form.
The gist of this discussion is that iterative refinement is beneficial even if
residuals are computed only at the working precision. This fact became widely
appreciated only after the publication of Skeel’s 1980 paper [920, 1980]. One
reason for the delayed appreciation may be that comments such as that made
by Forsythe and Moler, “It is absolutely essential that the residuals rk be
computed with a higher precision than that of the rest of the computation”
[396, 1967, p. 49], were incorrectly read to mean that without the use of higher
precision no advantage at all could be obtained from iterative refinement. In
the next section we will see that fixed precision iterative refinement does more
than just produce a cond(A, x)u-bounded forward error for LU factorization—
it brings componentwise backward stability as well.

11.2. Iterative Refinement Implies Stability

We saw in the last section that fixed precision iterative refinement can improve
the accuracy of a solution computed by GE. The question arises of what the
refinement process does to the backward error. To answer this question we give
a general backward error analysis that is applicable to a wide class of linear
equation solvers. Throughout this section, “iterative refinement” means fixed
precision iterative refinement.
We assume that the computed solution to Ax = b satisfies

(11.5)

where g : and h : have nonnegative

entries. The functions g and h may depend on n and u as well as on the data
A and b. We also assume that the residual r = b – A is computed in such a
way that
(11.6)
where t : is nonnegative. If r is computed in the conventional
way, then we can take

(11.7)

First we give an asymptotic result that does not make any further assump-
tions on the linear equation solver.

Theorem 11.3. Let be nonsingular. Suppose the linear system

Ax = b is solved in floating point arithmetic using a solver S together with one
236 I TERATIVE R EFINEMENT

step of iterative refinement. Assume that the computed solution produced by

S satisfies (11.5) and that the computed residual? satisfies (11.6). Then the
corrected solution satisfies

(11.8)

where q = O(u) if
Proof. The residual r = b – A of the original computed solution
satisfies
(11.9)
The computed residual is = r + ∆r , where |∆ r| < The computed
correction d satisfies

Finally, for the corrected solution we have

Collecting together the above results we obtain

Hence
(11.12)

where

The claim about the order of q follows since and are all of
order u.
Theorem 11.3 shows that, to first order, the componentwise relative back-
ward error w|A|,|b| will be small after one step of iterative refinement as long as
and are bounded by a modest scalar multiple of
This is true for t if the residual is computed in the conventional way (see
(11.7)), and in some cases we may take h 0, as shown below. Note that the
function g of (11.5) does not appear in the first-order term of (11.8). This
is the essential reason why iterative refinement improves stability: potential
instability manifested in g is suppressed by the refinement stage.
A weakness of Theorem 11.3 is that the bound (11.8) is asymptotic. Since
a strict bound for q is not given, it is difficult to draw firm conclusions about
11.2 I TERATIVE R EFINEMENT I MPLIES S TABILITY 237

the size of w|A|,|b|. The next result overcomes this drawback, at the cost of
some specialization (and a rather long proof).
We introduce a measure of ill scaling of the vector |B||x|,

Theorem 11.4. Under the conditions of Theorem 11.3, suppose that g(A, b) =
G|A| and h(A, b) = H|b|, where G, have nonnegative entries, and
that the residual is computed in the conventional manner. Then there is a
function

such that if

then

Proof. As with the analysis in the previous section, this proof can be
skipped without any real loss of understanding. From (11.12) in the proof of
Theorem 11.3, using the formula (11.7) for t, we have

(11.13)

The inequality (11.9) implies

or (I – uH)|b| < (I + uG) If < 1/2 (say) then I – uH is

nonsingular with a nonnegative inverse satisfying ||(I – uH) –1 || < 2 and we
can solve for |b| to obtain |b| < (I – uH)-1 (I + uG) |A| It follows from this
relation and consideration of the rest of the proof that the simplifying step
of replacing b by 0 in the analysis has little effect on the bounds—it merely
produces unimportant perturbations in f in the statement of the theorem.
Making this replacement in (11.13) and approximating γ n +1 + u γ n+1, we
have
(11.14)

Our task is now to bound and in terms of By manip-

ulating (11.11) we obtain the inequality

(11.15)
238 ITERATIVE REFINEMENT

Also, we can bound by

and dropping the |b| terms and using (11.15) gives

Substituting from (11.15) and (11.16) into (11.14) we find

where

Now from (11.10), making use of (11.16),

After premultiplying by |A| this may be rearranged as

(11.18)

where

Using γn + 1 /u < (n + 1)/(1 – (n + 1)u) n + 1, we have the bounds

If < 1/2 (say) then (I – uM3)-1 > 0 with ||(I – uM3)-1|| < 2 and
we can rewrite (11.18) as

(11.19)
11.2 I TERATIVE R EFINEMENT I MPLIES S TABILITY 239

Substituting this bound into (11. 17) we obtain

where

(see Problem 11.1). Finally, we bound Writing, g = h =

we have

and this expression is approximately bounded by u 2 (h(g + n + 1) + 2( g + n +

2)2(1 + uh)2 cond(A –1)). Requiring not to exceed γn +1 leads
to the result.
Theorem 11.4 says that as long as A is not too ill conditioned, is not
too badly scaled is not too large), and the solver is not too
unstable is not too large), then w|A|,|b| < 2γ n+1 after one
step of iterative refinement. Note that the term γn + 1 in (11.20) comes
from the error bound for evaluation of the residual, so this bound for w is
about the smallest we could expect to prove.
Let us apply Theorem 11.4 to GE with or without pivoting. If there is
pivoting, assume (without loss of generality) that no interchanges are required.
Theorem 9.4 shows that we can take

where are the computed LU factors of A. To apply Theorem 11.4 we

use A LU and write

which shows that we can take

Without pivoting the growth factor-type term is unbounded,

but with partial pivoting it cannot exceed 2n and is typically O(n) [1019,
1990] .
We can conclude that, for GE with partial pivoting (GEPP), one step of
iterative refinement will usually be enough to yield a small componentwise
240 I TERATIVE R EFINEMENT

Table 11.1. w|A|,|b| values for A = orthog(25).

relative backward error as long as A is not too ill conditioned and is

not too badly scaled. Without pivoting the same holds true with the added
proviso that the computation of the original must not be too unstable.
These results for GE are very similar to those of Skeel [920, 1980]. The
main differences are that Skeel’s analysis covers an arbitrary number of refine-
ment steps with residuals computed in single or double precision, his analysis
is specific to GE, and his results involve σ(A, x) rather than σ (A,
One interesting problem remains: to reconcile Theorem 11.4 with Theo-
rem 11.2. Under the conditions of Theorem 11.4 the componentwise relative
backward error is small after one step of iterative refinement, so the forward
error is certainly bounded by a multiple of cond(A, x)u. How can this be
shown (for GE) using the analysis of §11.1? An explanation is nontrivial-see
Problem 11.2.
We will see applications of Theorems 11.3 and 11.4 to other types of linear
equation solver in Chapters 18, 19, and 21.
Tables 11.1–11.3 show the performance of fixed precision iterative refine-
ment for GE without pivoting, GEPP, and Householder QR factorization (see
§18.6). The matrices are from the Test Matrix Toolbox (see Appendix E),
and may be summarized as follows. Clement(n) is tridiagonal with zero diag-
onal entries; orthog(n) is a symmetric and orthogonal matrix, and gfpp(n)
is a matrix for which the growth factor for GEPP is maximal. In each
case the right-hand side b was chosen as a random vector from the uniform
distribution on [0, 1]. We report the componentwise relative backward er-
rors for the initial solution and the refined iterates (refinement was termi-
nated when GEPP performs as predicted by both our and
Skeel’s analyses. In fact, iterative refinement converges in one step even when
θ(A, x) := cond(A - 1 )σ(A, x) exceeds u -1 in the examples reported and in
most others we have tried. GE also achieves a small componentwise relative
backward error, but can require more than one refinement step, even when
θ(A, x) is small.
11.3 N OTES AND R EFERENCES 241

Table 11.2. w|A|,|b| values for A = clement (50).

Table 11.3. w|A|,|b| values for A = gfpp(50).

11.3. Notes and References

Wilkinson [1088, 1963] gave a detailed analysis of iterative refinement in a kind
of scaled fixed point arithmetic called block-floating arithmetic. Moler [765,
1967] extended the analysis to floating point arithmetic. Very readable analy-
ses of iterative refinement are given in the books by Forsythe and Moler [396,
1967, §22] and Stewart [941, 1973, §4.5].
As we mentioned in §9.10, as early as 1948 Wilkinson had written a pro-
gram for the ACE to do GEPP and iterative refinement. Other early imple-
mentations of iterative refinement are in a code for the University of Illinois’
ILLIAC by Snyder [932, 1955], the Algol code of McKeeman [745, 1962], and
the Algol codes in the Handbook [138, 1966], [729, 1966]. Some of the ma-
chines for which these codes were intended could accumulate inner products
in extended precision, and so were well suited to mixed precision iterative
refinement.
Interest in fixed precision iterative refinement was sparked by two papers
that appeared in the late 1970s. Jankowski and Wozniakowski [610, 1977]
proved that an arbitrary linear equation solver is made normwise backward
stable by the use of fixed precision iterative refinement, as long as the solver
is not too unstable to begin with and A is not too ill conditioned. Skeel [920,
19 8 0 ] analysed iterative refinement for GEPP and showed that one step of
refinement yields a small componentwise relative backward error, as long as
cond(A – 1 )σ (A, x) is not too large.
242 I TERATIVE R EFINEMENT

The analysis in §11.1 extends existing results in the literature. The analysis
in §11.2 is from Higham [549, 1991].
The quantity σ(A, x) appearing in Theorem 11.4 can be interpreted as
follows. Consider a linear system Ax = b for which (|A||x|)i = 0 for some i.
While the componentwise relative backward error w|A|,|b|( x ) of the exact so-
lution x is zero, an arbitrarily small change to a component xj where aij 0
yields w|A|,|b| (x + ∆x) > 1. Therefore solving Ax = b to achieve a small
componentwise relative backward error can be regarded as an ill-posed prob-
lem when |A||x| has a zero component. The quantity σ(A, x) reflects this
ill-posedness because it is large when |A||x| has a relatively small component.
For a lucid survey of both fixed and mixed precision iterative refinement
and their applications, see Björck [111, 1990]. For particular applications of
fixed precision iterative refinement, see Govaerts and Pryce [475, 1990] and
Jankowski and Wozniakowski [611, 1985].
By increasing the precision from one refinement iteration to the next it
is possible to compute solutions to arbitrarily high accuracy, an idea first
suggested by Stewart in an exercise [941, 1973, pp. 206–207]. For algorithms,
see Kielbasinski [656, 1981] and Smoktunowicz and Sokolnicka [931, 1984].
There are a number of practical issues to attend to when implementing iter-
ative refinement. Mixed precision iterative refinement cannot be implemented
in a portable way when the working precision is already the highest precision
supported by a compiler. This is the main reason why iterative refinement is
not supported in LINPACK. (The LINPACK manual lists a subroutine that
implements mixed precision iterative refinement for single precision data, but
it is not part of LINPACK [307, 1979, pp. 1.8–1. 10]. ) For either form of refine-
ment, a copy of the matrix A needs to be kept in order to form the residual,
and this necessitates an extra n 2 elements of storage. A convergence test for
terminating the refinement is needed. In addition to revealing when conver-
gence has been achieved, it must signal lack of (sufficiently fast) convergence,
which may certainly be experienced when A is very ill conditioned. In the
LAPACK driver xGESVX, fixed precision iterative refinement is terminated if
the componentwise relative backward error w = w|A|,|b| satisfies

1. w < u,

2. w has not decreased by a factor of at least 2 during the current iteration,

3. five iterations have been performed.

These criteria were chosen to be robust in the face of different BLAS imple-
mentations and machine arithmetics. In an implementation of mixed precision
iterative refinement it is more natural to test for convergence of the sequence
with a test such as < u (see, e.g., Forsythe and
PROBLEMS 243

Moler [396, 1967, p. 65]). However, if A is so ill conditioned that Theorem 11.1
is not applicable, the sequence could converge to a vector other than the
solution. This behaviour is very unlikely, and Kahan [626, 1966] quotes a
“prominent figure in the world of error-analysis” as saying “Anyone unlucky
enough to encounter this sort of calamity has probably already been run over
by a truck.”
A by-product of extended precision iterative refinement is an estimate of
the condition number. Since the error decreases by a factor approximately
η = on each iteration (Theorem 11.1), the relative change
made to x on the first iteration should be about η , that is,
Now that reliable and inexpensive condition estimators are
available (Chapter 14) this rough estimate is less important.
An unusual application of iterative refinement is to fault-tolerant com-
puting. Boley et al. [132, 1994] propose solving Ax = b by GEPP or QR
factorization, performing one step of fixed precision iterative refinement and
then testing whether the a priori residual bound in Theorem 11.4 is satisfied.
If the bound is violated then a hardware fault may have occurred and special
action is taken.

11.3.1. LAPACK
Iterative refinement is carried out by routines whose names end -RFS, and
these routines are called by the expert drivers (name ending -SVX). Iterative
refinement is available for all the standard matrix types except triangular ma-
trices, for which the original computed solution already has a componentwise
relative backward error of order u. As an example, the expert driver xGESVX
uses LU factorization with partial pivoting and fixed precision iterative refine-
ment to solve a general system of linear equations with multiple right-hand
sides, and the refinement is actually carried out by the routine xGERFS .

Problems
11.1. Show that for and where
σ = maxi |xi |/mini |xi |.
11.2. Use the analysis of §11.1 to show that, under the conditions of Theo-
rem 11.4, is bounded by a multiple of cond(A, x)u for GEPP
after one step of fixed precision iterative refinement.
11.3. Investigate empirically the size of for L from GEPP.
11.4. (Demmel and Higham [291, 1992]) Suppose GEPP with fixed precision
iterative refinement is applied to the multiple-right-hand side system AX = B,
and that refinement of the columns of X is done “in parallel”: R = B – AX,
244 I TERATIVE R EFINEMENT

AD = R, Y = X + D. What can be said about the stability of the process

if R is computed by conventional multiplication but the second step is done
using a fast multiplication technique for which only (12.3) holds?
11.5. (RESEARCH PROBLEM) Is one step of fixed precision iterative refinement
sufficient to produce a componentwise relative backward error of order u for
Cholesky factorization applied to a symmetric positive definite system Ax = b,
assuming cond(A – 1 )σ(A , x) is not too large? Answer the same question for
the diagonal pivoting method with partial pivoting applied to a symmetric
system Ax = b.
Previous Home Next

Chapter 12
Block LU Factorization

Block algorithms are advantageous for at least two important reasons.

First, they work with blocks of data having b2 elements,
performing O(b 3) operations.
The O(b) ratio of work to storage means that
processing elements with an O(b) ratio of
computing speed to input/output bandwidth can be tolerated.
Second, these algorithms are usually rich in matrix multiplication.
This is an advantage because
nearly every modern parallel machine is good at matrix multiplication.
—ROBERT S. SCHREIBER, Block Algorithms for Parallel Machines (1988)

It should be realized that, with partial pivoting,

any matrix has a triangular factorization.
DECOMP actually works faster when zero pivots occur because they mean that
the corresponding column is already in triangular form.
— GEORGE E. FORSYTHE, MICHAEL A. MALCOLM, and CLEVE B. MOLER,
Computer Methods for Mathematical Computations (1977)

It was quite usual when dealing with very large matrices to

perform an iterative process as follows:
the original matrix would be read from cards and the reduced matrix punched
without more than a single row of the original matrix
being kept in store at any one time;
then the output hopper of the punch would be
transferred to the card reader and the iteration repeated.
— MARTIN CAMPBELL-KELLY, Programming the Pi/et ACE (1981)

245
246 B LOCK LU FACTORIZATION

12.1. Block Versus Partitioned LU Factorization

As we noted in Chapter 9 (Notes and References), Gaussian elimination (GE)
comprises three nested loops that can be ordered in six ways, each yielding a
different algorithmic variant of the method. These variants involve different
computational kernels: inner product and saxpy operations (level-1 BLAS),
or outer product and gaxpy operations (level-2 BLAS). To introduce matrix–
matrix operations (level-3 BLAS), which are beneficial for high-performance
computing, further manipulation beyond loop reordering is needed. We will
use the following terminology, which emphasises an important distinction.
A partitioned algorithm is a scalar (or point) algorithm in which the op-
erations have been grouped and reordered into matrix operations.
A block algorithm is a generalization of a scalar algorithm in which the
basic scalar operations become matrix operations (a+ β, aβ, and a/β become
A+B, AB, and AB-1), and a matrix property based on the nonzero structure
becomes the corresponding property blockwise (in particular, the scalars 0
and 1 become the zero matrix and the identity matrix, respectively). A block
factorization is defined in a similar way and is usually what a block algorithm
computes.
A partitioned version of the outer product form of LU factorization may
be developed as follows. For and a given block size r, write

(12.1)

where A11 is r × r. One step of the algorithm consists of factoring A11 =

L 11 U 11, solving the multiple right-hand side triangular systems L 11 U 12 = A1 2
and L2 1 U 11 = A21 for U 12 and L21, respectively, and then forming B =
A22 – L 21 U 12; this procedure is repeated on B. The block operations defining
U 12, L21, and B are level-3 BLAS operations. This partitioned algorithm does
precisely the same arithmetic operations as any other variant of GE, but it
does the operations in an order that permits them to be expressed as matrix
operations.
A genuine block algorithm computes a block LU factorization, which is a
factorization A = LU where L and U are block triangular and L has
identity matrices on the diagonal:

In general, the blocks can be of different dimensions. Note that this fac-
torization is not the same as a standard LU factorization, because U is not
12.1 B LOCK V ERSUS P ARTITIONED LU FACTORIZATION 247

triangular. However, the standard and block LU factorizations are related as

follows: if A = LU is a block LU factorization and each Uii has LU factor-
ization Uii = then A = Ldiag U is an LU factorization.
Conditions for the existence of a block LU factorization are easy to state.

Theorem 12.1. The matrix A = has a unique block LU

factorization if and only if the first m – 1 leading principal block submatrices
of A are nonsingular.

Proof. The proof is entirely analogous to the proof of Theorem 9.1.

This theorem makes clear that a block LU factorization may exist when
an LU factorization does not.
If A 1 1 is nonsingular we can write

(12.2)

which describes one block step of an outer-product-based algorithm for com-

puting a block LU factorization. Here, S = A 22 – is the Schur
complement of A11 in A. If the (1, 1) block of S of appropriate dimension is
nonsingular then we can factorize S in a similar manner, and this process can
be continued recursively to obtain the complete block LU factorization. The
overall algorithm can be expressed as follows.

Algorithm 12.2 (block LU factorization). This algorithm computes a block

LU factorization A = LU using the notation of (12.2).

1. U 11 = A11, U 12 = A1 2 .
2. Solve L2 1 A 11 = A21 for L2 1 .
3. S = A22 – L2 1 A 12 (Schur complement).
4. Compute the block LU factorization of S, recursively.

Given a block LU factorization of A, the solution to a system Ax = b can

be obtained by solving Ly = b by forward substitution (since L is triangular)
and solving Ux = y by block back substitution. There is freedom in how
step 2 of Algorithm 12.2 is accomplished, and how the linear systems with
coefficient matrices U ii that arise in the block back substitution are solved.
The two main possibilities are as follows.
Implementation 1: A11 is factorized by GE with partial pivoting. Step 2
and the solution of linear systems with Uii are accomplished by substitution
with the LU factors of A1 1 .
Implementation 2: is computed explicitly, so that step 2 becomes a
matrix multiplication and Ux = y is solved entirely by matrix–vector multi-
plications. This approach is attractive for parallel machines.
248 B LOCK LU FACTORIZATION

What can be said about the numerical stability of partitioned and block
LU factorization? Because the partitioned algorithm is just a rearrangement
of standard GE, the standard error analysis applies if the matrix operations
are computed in the conventional way. However, if fast matrix multiplication
techniques are used (for example, Strassen’s method), the standard results
are not applicable. Standard results are, in any case, not applicable to block
LU factorization; its stability can be very different from that of LU factor-
ization. Therefore we need error analysis for both partitioned and block LU
factorization based on general assumptions that permit the use of fast matrix
multiplication.
Unless otherwise stated, in this chapter an unsubscripted norm denotes
||A|| := maxi,j |aij|. We make two assumptions about the underlying level-3
BLAS (matrix-matrix operations).
(1) If and then the computed approximation to
C = AB satisfies

(12.3)

where c1 (m, n, p) denotes a constant depending on m, n and p.

(2) The computed solution to the triangular systems TX = B, where
and satisfies

(12.4)

For conventional multiplication and substitution, conditions (12.3) and

(12.4) hold with c1 (m, n, p) = n 2 and c2 (m,p) = m2. For implementations
based on Strassen’s method, (12.3) and (12.4) hold with c1 and c2 rather
complicated functions of the dimensions m, n, p and the threshold no that
determines the level of recursion (see Theorem 22.2 and [544, 1990]).

12.2. Error Analysis of Partitioned LU Factorization

An error analysis for partitioned LU factorization must answer two questions.
The first is whether partitioned LU factorization becomes unstable in some
fundamental way when fast matrix multiplication is used. The second is
whether the constants in (12.3) and (12.4) are propagated stably into the
final error bound (exponential growth of the constants would be disastrous).
We will assume that the block level LU factorization is done in such a way
that the computed LU factors of A1 1 satisfy

(12.5)

Theorem 12.3 (Demmel and Higham). Under the assumptions (12.3),

(12.4), and (12.5), the LU factors of computed using the partitioned
12.2 E RROR A NALYSIS OF P ARTITIONED LU FACTORIZATION 249

outer product form of LU factorization with block size r satisfiy = A+ ∆A,

where
(12.6)
and where

Proof. The proof is essentially inductive. To save clutter we will omit

“+O(u 2)” from each bound. For n = r, the result holds trivially. Consider
the first block stage of the factorization, with the partitioning (12.1). The
assumptions imply that
(12.7)
(12.8)

To obtain B = A22 – L2 1 U12 we first compute C = obtaining

and then subtract from A22, obtaining

(12.9)
It follows that

(12.10a)

(12.10b)
The remainder of the algorithm consists of the computation of the LU fac-
torization of B, and by our inductive assumption (12.6), the computed LU
factors satisfy

(12.11a)
(12.11b)
Combining (12.10) and (12.11), and bounding using (12.9), we obtain

(12.12)
250 B LOCK LU FACTORIZATION

Collecting (12.5), (12.7), (12.8), and (12.12) we have = A + ∆A, where

bounds on ||∆Aij|| are given in the equations just mentioned. These bounds
for the blocks of ∆A can be weakened slightly and expressed together in the
more succinct form (12.6).
These recurrences for δ(n,r) and θ(n,r) show that the basic error constants
in assumptions (12.3), (12.4), and (12.5) combine additively at worst. Thus,
the backward error analysis for the LU factorization is commensurate with
the error analysis for the particular implementation of the BLAS3 employed
in the partitioned factorization. In the case of the conventional BLAS3 we
obtain a Wilkinson-style result for GE without pivoting, with θ(n,r) = O(n 3 )
(the growth factor is hidden in and
Although the above analysis is phrased in terms of the partitioned outer
product form of LU factorization, the same result holds for other “ijk” par-
titioned forms (with slightly different constants), for example, the gaxpy or
sdot forms. There is no difficulty in extending the analysis to cover partial
pivoting and solution of Ax = b using the computed LU factorization (see
Problem 12.6).

12.3. Error Analysis of Block LU Factorization

Now we turn to block LU factorization. We assume that the computed ma-
trices L21 from step 2 of Algorithm 12.2 satisfy

We also assume that when a system U i i x i = d i of order r is solved, the

computed solution satisfies

(12.14)

The assumptions (12.13) and (12.14) are satisfied for Implementation 1 of

Algorithm 12.2 and are sufficient to prove the following result.

Theorem 12.4 (Demmel, Higham, and Schreiber). Let and be the com-
puted block LU factors of from Algorithm 12.2 (with Implementa-
tion 1), and let be the computed solution to Ax = b. Under the assumptions
(12.3), (12.13), and (12.14),

(12.15)

where the constant dn is commensurate with those in the assumptions.

12.3 E RROR A NALYSIS OF B LOCK LU FACTORIZATION 251

Proof. We omit the proof (see Demmel, Higham, and Schreiber [293,
1995]for details). It is similar to the proof of Theorem 12.3.
The bounds in Theorem 12.4 are valid also for other versions of block LU
factorization obtained by “block loop reordering”, such as a block gaxpy based
algorithm.
Theorem 12.4 shows that the stability of block LU factorization is de-
termined by the ratio ||L||||U||/||A|| (numerical experiments show that the
bounds are, in fact, reasonably sharp). If this ratio is bounded by a mod-
est function of n, then L and U are the true factors of a matrix close to
A, and solves a slightly perturbed system. However, can exceed
||A|| by an arbitrary factor, even if A is symmetric positive definite or di-
agonally dominant by rows. Indeed, ||L|| > ||L2 1 || = using the
partitioning (12.2), and this lower bound for ||L|| can be arbitrarily large.
In the following two subsections we investigate this instability more closely
and show that ||L||||U|| can be bounded in a useful way for particular classes
of A. Without further comment we make the reasonable assumption that
||L||U|| ||L||||U|| , so that these bounds maybe used in Theorem 12.4.
What can be said for Implementation 2? Suppose, for simplicity, that the
inverses (which are used in step 2 of Algorithm 12.2 and in the block
back substitution) are computed exactly. Then the best bounds of the forms
(12.13) and (12.14) are

Working from these results, we find that Theorem 12.4 still holds provided the
first-order terms in the bounds in (12.15) are multiplied by max i κ(U i i ). This
suggests that Implementation 2 of Algorithm 12.2 can be much less stable
than Implementation 1 when the diagonal blocks of U are ill conditioned, and
this is confirmed by numerical experiments.

12.3.1. Block Diagonal Dominance

One class of matrices for which block LU factorization has long been known
to be stable is block tridiagonal matrices that are diagonally dominant in
an appropriate block sense. A general matrix is block diagonally
dominant by columns with respect to a given partitioning A = ( A i j ) and a
given norm if, for all j,

(12.16)

A is block diagonally dominant by rows if AT is block diagonally dominant by

columns. For the block size 1, the usual property of point diagonal dominance
252 B LOCK LU FACTORIZATION

is obtained. Note that for the 1- and co-norms diagonal dominance does not
imply block diagonal dominance, nor does the reverse implication hold (see
Problem 12.2). Throughout our analysis of block diagonal dominance we take
the norm to be an arbitrary subordinate matrix norm.
First, we show that for block diagonally dominant matrices a block LU
factorization exists, using the key property that block diagonal dominance is
inherited by the Schur complements obtained in the course of the factorization.
In the analysis we assume that A has m block rows and columns.

Theorem 12.5 (Demmel, Higham, and Schreiber). Suppose is

nonsingular and block diagonally dominant by rows or columns with respect to
a subordinate matrix norm in (12.16). Then A has a block LU factorization,
and all the Schur complements arising in Algorithm 12.2 have the same kind
of diagonal dominance as A.

Proof. This proof is a generalization of Wilkinson’s proof of the corre-

sponding result for point diagonally dominant matrices [1085, 1961, pp. 288–
289], [470, 1989, p. 120] (as is the proof of Theorem 12.6 below). We consider
the case of block diagonal dominance by columns; the proof for row-wise di-
agonal dominance is analogous.
The first step of Algorithm 12.2 succeeds, since A 11 is nonsingular, pro-
ducing a matrix that we can write as

For j = 2:m we have

using (12.16)

using (12.16),
12.3 E RROR A NALYSIS OF B LOCK LU FACTORIZATION 253

(12.17)

Now if is singular it follows that therefore A(2), and

hence also A, is singular, which is a contradiction. Thus is nonsingular,
and (12.17) can be rewritten

showing that A(2) is block diagonally dominant by columns. The result follows
by induction.
The next result allows us to bound ||U|| for a block diagonally dominant
matrix.

Theorem 12.6 (Demmel, Higham, and Schreiber). Let A satisfy the condi-
tions of Theorem 12.5. If A(k) denotes the matrix obtained after k – 1 steps
of Algorithm 12.2, then

Proof. Let A be block diagonally dominant by columns (the proof for row
diagonal dominance is similar). Then

using (12.16). By induction, using Theorem 12.5, it follows that

This yields
254 B LOCK LU FACTORIZATION

The implications of Theorems 12.5 and 12.6 for stability are as follows.
Suppose A is block diagonally dominant by columns. Also, assume for the
moment that the (subordinate) norm has the property that
(12.18)

which holds for any p-norm, for example. The subdiagonal blocks in the first
block column of L are given by Li1 = and so < 1, by
(12.16) and (12.18). From Theorem 12.5 it follows that <
1 for j = 2:m. Since Uij = for j > i, Theorem 12.6 shows that ||Uij|| <
2||A|| for each block of U (and ||Uij|| < ||A||). Therefore ||L|| < m and ||U|| <
m2 ||A||, and so ||L||||U|| < m 3 ||A|| . For particular norms the bounds on the
blocks of L and U yield a smaller bound for ||L|| and ||U||. For example, for
the 1-norm we have ||L||1 ||U||1 < 2m||A||1 and for the -norm
We conclude that block LU factorization is stable if A is block
diagonally dominant by columns with respect to any subordinate matrix norm
satisfying (12.18).
Unfortunately, block LU factorization can be unstable when A is block
diagonally dominant by rows, for although Theorem 12.6 guarantees that
||U i j || < 2||A||, ||L|| can be arbitrarily large. This can be seen from the
example

where A is block diagonally dominant by rows in any subordinate norm for

any nonsingular matrix A11. It is easy to confirm numerically that block LU
factorization can be unstable on matrices of this form.
Next, we bound ||L||||U|| for a general matrix and then specialize to point
diagonal dominance. From this point on we use the norm ||A|| := max i,j |aij|.
We partition A according to

(12.19)

and denote by pn the growth factor for GE without pivoting. We assume that
GE applied to A succeeds.
To bound ||L||, we note that, under the partitioning (12.19), for the first
block stage of Algorithm 12.2 we have ||L2 1 || = < n pn κ(A) (see
Problem 12.4). Since the algorithm works recursively with the Schur com-
plement S, and since every Schur complement satisfies κ( S) < pn κ(A) (see
Problem 12.4), each subsequently computed subdiagonal block of L has norm
at most Since U is composed of elements of A together with ele-
ments of Schur complements of A,
||U|| < pn||A||. (12.20)
12.3 E RROR ANALYSIS OF B LOCK LU FACTORIZATION 255

Overall, then, for a general matrix

(12.21)
Thus, block LU factorization is stable for a general matrix A as long as GE
is stable for A (that is, pn is of order 1) and A is well conditioned.
If A is point diagonally dominant by columns then, since every Schur
complement enjoys the same property, we have ||Lij || < 1 for i > j, by
Problem 12.5. Hence ||L|| = 1. Furthermore, pn < 2 (Theorem 9.8 or Theo-
rem 12.6), giving ||U|| < 2||A|| by (12.20), and so
||L||||U|| < 2||A||.
Thus block LU factorization is perfectly stable for a matrix point diagonally
dominant by columns.
If A is point diagonally dominant by rows then the best we can do is to
take pn < 2 in (12.21), obtaining

||L||||U|| < 8nk(A)||A||. (12.22)

Hence for point row diagonally dominant matrices, stability is guaranteed if A
is well conditioned. This in turn is guaranteed if the row diagonal dominance
amounts γj in the analogue of (12.16) for point row diagonal dominance are
sufficiently large relative to ||A||, because < (minj γj )–1 (see prob-
lem 8.7(a)).

12.3.2. Symmetric Positive Definite Matrices

Further useful results about the stability of block LU factorization can be
derived for symmetric positive definite matrices. First, note that the existence
of a block LU factorization is immediate for such matrices, since all their
leading principal submatrices are nonsingular. Let A be a symmetric positive
definite matrix, partitioned as

The definiteness implies certain relations among the submatrices Aij that can
be used to obtain a stronger bound for ||L|| 2 than can be deduced for a general
matrix (cf. Problem 12.4).

Lemma 12.7. If A is symmetric positive definite then

Proof. This lemma is a corollary of Lemma 10.12, but we give a separate
proof. Let A have the Cholesky factorization
256 BLOCK LU FACTORIZATION

Table 12.1. Stability of block and point L U factorization. p n is the growth factor for
GE without pivoting.

Matrix property Block LU Point LU

Symmetric positive definite κ(A) 1 / 2 1
Block column diagonally dominant 1 pn
Point column diagonally dominant 1 1
Block row diagonally dominant pn
Point row diagonally dominant κ(A) 1
Arbitrary pn

The following lemma is proved in a way similar to the second inequality in

Problem 12.4.

Lemma 12.8. If A is symmetric positive definite then the Schur complement

S = A 22 – satisfies κ2 (S) < κ 2 (A).
Using the same reasoning as in the last subsection, we deduce from these
two lemmas that each subdiagonal block of L is bounded in 2-norm by κ 2 (A) 1 / 2 .
Therefore ||L|| 2 < 1 + mκ 2 (A)1/2, where there are m block stages in the algo-
rithm. Also, it can be shown that ||U|| 2 < Hence

(12.23)

It follows from Theorem 12.4 that when Algorithm 12.2 is applied to a sym-
metric positive definite matrix A, the backward errors for the LU factorization
and the subsequent solution of a linear system are both bounded by

(12.24)

Any resulting bound for will be proportional to κ 2 (A)3/2, rather

than κ2 (A) as for a stable method. This suggests that block LU factorization
can lose up to 50% more digits of accuracy in x than a stable method for
solving symmetric positive definite linear systems. The positive conclusion to
be drawn, however, is that block LU factorization is guaranteed to be stable
for a symmetric positive definite matrix that is well conditioned.
The stability results for block LU factorization are summarized in Ta-
ble 12.1, which tabulates a bound for ||A – LU||/(cnu||A||) for block and point
12.4 N OTES AND R EFERENCES 257

LU factorization for the matrix properties considered in this chapter. The

constant cn incorporates any constants in the bound that depend polynomi-
ally on the dimension, so a value of 1 in the table indicates unconditional
stability.

12.4. Notes and References

The distinction between a partitioned algorithm and a block algorithm is
rarely made in the literature (exceptions include the papers by Schreiber [902,
1988] and Demmel, Higham, and Schreiber [293, 1995]); the term “block al-
gorithm” is frequently used to describe both types of algorithm. A parti-
tioned algorithm might also be called a “blocked algorithm” (as is done by
Dongarra, Duff, Sorensen, and van der Vorst [315, 1991]), but the similar-
ity of this term to “block algorithm” can cause confusion and so we do not
recommend this terminology. Note that in the particular case of matrix mul-
tiplication, partitioned and block algorithms are equivalent. Our treatment of
partitioned LU factorization has focused on the stability aspects; for further
details, particularly concerning implementation on high-performance comput-
ers, see Dongarra, Duff, Sorensen, and van der Vorst [315, 1991] and Golub
and Van Loan [470, 1989].
Block LU factorization appears to have first been proposed for block tridi-
agonal matrices, which frequently arise in the discretization of partial dif-
ferential equations. References relevant to this application include Isaacson
and Keller [607, 1966, p. 59], Varah [1048, 1972], Bank and Rose [53, 1977],
Mattheij [737, 1984], [738, 1984], and Concus, Golub, and Meurant [235, 1985].
For an application of block LU factorization to linear programming, see
Eldersveld and Saunders [351, 1992].
Theorem 12.3 is from Demmel and Higham [291, 1992]. The results in
§12.3 are from Demmel, Higham, and Schreiber [293, 1995], which extends
earlier analysis of block LU factorization by Demmel and Higham [291, 1992].
Block diagonal dominance was introduced by Feingold and Varga [366,
19 6 2 ], and has been used mainly in generalizations of the Gershgorin circle
theorem. Varah [1048, 1972] obtained bounds on ||L|| and ||U|| for block
diagonally dominant block tridiagonal matrices; see Problem 12.1.
Theorem 12.5 is obtained in the case of block diagonal dominance by rows
with minj γ j > 0 by Polman [837, 1987]; the proof in [837, 1987] makes use of
the corresponding result for point diagonal dominance and thus differs from
the proof we have given.
At the cost of a much more difficult proof, Lemma 12.7 can be strengthened
to the attainable bound < (κ 2 ( A)1/2 – κ 2 (A)-1/2)/2, as shown
by Demmel [279, 1983, Thin. 4], but the weaker bound is sufficient for our
purposes.
258 B LOCK LU FACTORIZATION

12.4.1. LAPACK
LAPACK does not implement block LU factorization, but its LU factorization
(and related) routines for full matrices employ partitioned LU factorization
in order to exploit the level-3 BLAS and thereby to be efficient on high-
performance machines.

Problems
12.1. (Varah [1048, 1972]) Suppose A is block tridiagonal and has the block
LU factorization A = LU (so that L and U are block bidiagonal and Ui,i + 1 =
A i , i +1). Show that if A is block diagonally dominant by columns then

while if A is block diagonally dominant by rows then

What can be deduced about the stability of the factorization and has the block
classes of matrices?
12.2. Show that for the 1- and co-norms diagonal dominance does not imply
block diagonal dominance, and vice versa.
12.3. If is symmetric, has positive diagonal elements, and is block
diagonally dominant by rows, must it be positive definite?
12.4. Let be partitioned

(12.25)

with A11 nonsingular. Let ||A|| := max ij |aij|. Show that

n p n κ(A). Show that the Schur complement S = A 22 – satisfies
κ (S) < pn κ(A).
12.5. Let be partitioned as in (12.25), with A11 nonsingular,
and suppose that A is point diagonally dominant by columns. Show that
< 1.
12.6. Show that under the conditions of Theorem 12.3 the computed solution
to Ax = b satisfies
PROBLEMS 259

and the computed solution to the multiple right-hand side system AX = B

(where (12.4) is assumed to hold for the multiple right-hand side triangular
solves) satisfies

In both cases, cn is a constant depending on n and the block size.

12.7. Let X = where A is square and nonsingular. Show
that
det(X) = det(A) det(D – CA- 1 B ) .
Assuming A, B, C, D are all m × m, give a condition under which det(X ) =
det( AD – CB).
Previous Home Next

Chapter 13
Matrix Inversion

It is amusing to remark that we were so involved with

matrix inversion that we probably talked of nothing else for months.
Just in this period Mrs. von Neumann acquired a big,
rather wild but gentle Irish Setter puppy,
which she called inverse in honor of our work!
— HERMAN H. GOLDSTINE, The Computer.’
From Pascal to von Neumann (1972)

The most computationally intensive portion

of the tasks assigned to the processors is
integrating the KKR matrix inverse over the first Brillouin zone.
To evaluate the integral,
hundreds or possibly thousands of complex double precision matrices
of order between 80 and 300 must be formed and inverted.
Each matrix corresponds to a different vertex of the tetrahedrons
into which the Brillouin zone has been subdivided.
— M. T. HEATH, G. A. GEIST, and J. B. DRAKE, Superconductivity
in Early Experience with the Intel iPSC/860
at Oak Ridge National Laboratory (1990)

Press to invert the matrix.

Note that matrix inversion can produce erroneous results
if you are using iii-conditioned matrices.
— HEWLETT-PACKARD, HP 48G Series User’s Guide (1993)

Almost anything you can do with A–1 can be done without it.
— GEORGE E. FORSYTHE and CLEVE B. MOLER,
Computer Solution of Linear Algebraic Systems (1967)

261
262 MATRIX INVERSION

13.1. Use and Abuse of the Matrix Inverse

To most numerical analysts, matrix inversion is a sin. Forsythe, Malcolm,
and Moler put it well when they say [395, 1977, p. 31] “In the vast major-
it y of practical computational problems, it is unnecessary and inadvisable to
actually compute A– 1 .” The best example of a problem in which the ma-
trix inverse should not be computed is the linear equations problem Ax = b.
Computing the solution as x = A-1 × b requires 2n 3 flops, assuming A - 1
is computed by Gaussian elimination with partial pivoting (GEPP), whereas
GEPP applied directly to the system costs only 2n3/3 flops.
Not only is the inversion approach three times more expensive, but it is
much less stable. Suppose X = A–1 is formed exactly, and that the only
rounding errors are in forming x = fl(Xb). Then = (X + ∆X) b , where
|∆X| < γn|X|, by (3.10). So = A(X + ∆X)b = (I + A∆X)b, and the best
possible residual bound is

For GEPP, Theorem 9.4 yields

Since it is usually true that for GEPP, we see that the

matrix inversion approach is likely to give a much larger residual than GEPP
if A is ill conditioned and if For example, we solved 50
25×25 systems Ax = b in MATLAB, where the elements of x are taken from the
normal N(0, 1) distribution and A is random with κ 2 (A) = u - 1 / 2 9 × 107.
As Table 13.1 shows, the inversion approach provided much larger backward
errors than GEPP in this experiment.
Given the inexpedience of matrix inversion, why devote a chapter to it?
The answer is twofold. First, there are situations in which a matrix inverse
must be computed. Examples are in statistics [54, 1974, §7.5], [721, 198 4,
§2.3], [744, 1989, p. 342 ff], where the inverse can convey important statistical
information, in certain matrix iterations arising in eigenvalue-related problems
[37, 1993], [174, 198 7], [566, 1994], and in numerical integrations arising in

Table 13.1. Backward errors η A , b for the -norm.

min max
-1
x=A × b 6.66e-12 1.69e-10
GEPP 3.44e-18 7.56e-17
13.1 U SE AND ABUSE OF THE MATRIX INVERSE 263

superconductivity computations [509, 1990] (see the quotation at the start of

the chapter). Second, methods for matrix inversion display a wide variety
of stability properties, making for instructive and challenging error analysis.
(Indeed, the first major rounding error analysis to be published, that of von
Neumann and Goldstine, was for matrix inversion-see §9.6).
Matrix inversion can be done in many different ways—in fact, there are
more computationally distinct possibilities than for any other basic matrix
computation. For example, in triangular matrix inversion different loop order-
ings are possible and either triangular matrix–vector multiplication, solution
of a triangular system, or a rank-1 update of a rectangular matrix can be em-
ployed inside the outer loop. More generally, given a factorization PA = LU,
two ways to evaluate A–1 are. as A–1 = U –1 × L –1 × P, and as the solution
to UA–1 = L –1 × P. These methods generally achieve different levels of ef-
ficiency on high-performance computers, and they propagate rounding errors
in different ways. We concentrate in this chapter on the numerical stability,
but comment briefly on performance issues.
The quality of an approximation Y A–1 can be assessed by looking
at the right and left residuals, AY — I and YA — I , and the forward error,
Y – A-1. Suppose we perturb A A + ∆A with |∆ A| < thus, we
are making relative perturbations of size at most to the elements of A. If
Y = (A+ ∆A)-1 then (A + ∆A)Y = Y(A + ∆A) = I, so that

(13.1)
(13.2)

and, since (A + ∆A)–1 = A–1 – A– 1 ∆ AA–1 +

(13.3)

(Note that (13.3) can also be derived from (13.1) or (13.2 ).) The bounds
(13.1)-(13.3) represent “ideal” bounds for a computed approximation Y to
A – 1, if we regard as a small multiple of the unit roundoff u. We will show
that, for triangular matrix inversion, appropriate methods do indeed achieve
(13.1) or (13.2) (but not both) and (13.3).
It is important to note that neither (13.1), (13.2), nor (13.3) implies that
Y + ∆Y = (A + ∆A)-1 with and that
is, Y need not be close to the inverse of a matrix near to A, even in the norm
sense. Indeed, such a result would imply that both the left and right residuals
are bounded in norm by and this is not the case for any
of the methods we will consider.
To illustrate the latter point we give a numerical example. Define the
matrix An as triu(qr(vand(n))), in M ATLAB notation (vand is a
routine from the Test Matrix Toolbox—see Appendix E); in other words, An
is the upper triangular QR factor of the n × n Vandermonde matrix based on
264 M ATRIX I NVERSION

Figure 13.1. Residuals for inverses computed by MATLAB’S INV function.

equispaced points on [0, 1]. We inverted An , for n = 1:80, using MATLAB’S

INV function, which uses GEPP. The left and right normwise relative residuals

are plotted in Figure 13.1. We see that while the left residual is always less
than the unit roundoff, the right residual becomes large as n increases. These
matrices are very ill conditioned (singular to working precision for n > 20),
yet it is still reasonable to expect a small residual, and we will prove in §13.3.2
that the left residual must be small, independent of the condition number.
In most of this chapter we are not concerned with the precise values of
constants (§13.4 is the exception); thus cn denotes a constant of order n. To
simplify the presentation we introduce a special notation. Let Ai
i = 1:k, be matrices such that the product A 1 A 2 . . . Ak is defined and let

Then ∆(A1, A2, . . . , Ak) denotes a matrix bounded according to

13.2 I NVERTING A T RIANGULAR M ATRIX 265

This notation is chosen so that if = fl(A 1 A 2 . . .A k), with the product

evaluated in any order, then

13.2. Inverting a Triangular Matrix

We consider the inversion of a lower triangular matrix treating
unblocked and blocked methods separately. We do not make a distinction
between partitioned and block methods in this section. All the results in this
and the next section are from Du Croz and Higham [322, 1992].

13.2.1. Unblocked Methods

We focus our attention on two “j” methods that compute L–1 a column at a
time. Analogous “i” and “k” methods exist, which compute L–1 row-wise or
use outer products, respectively, and we comment on them at the end of the
section.
The first method computes each column of X = L-1 independently, using
forward substitution. We write it as follows, to facilitate comparison with the
second method.

Method 1.
for j = 1:n

X(j + 1:n,j) = -x jj L(j + 1:n,j)

Solve L(j + 1:n, j + l:n)X(j + 1:n, j) = X(j + 1:n,j),
by forward substitution.
end

In BLAS terminology, this method is dominated by n calls to the level-2

BLAS routine xTRSV (Triangular SolVe).
The second method computes the columns in the reverse order. On the
jth step it multiplies by the previously computed inverse L(j + 1:n, j + 1:n) - l
instead of solving a system with coefficient matrix L(j + 1:n, j + 1:n).

Method 2.
for j = n:–1:1

X (j + 1:n, j) = X(j + 1:n, j + 1:n)L(j + 1:n, j)

X (j + 1:n, j) = -xjjX(j + 1:n, j)
end
266 M ATRIX I NVERSION

Method 2 uses n calls to the level-2 BLAS routine xTRMV (Triangular

Matrix times Vector). On most high-performance machines xTRMV can be
implemented to run faster than xTRSV , so Method 2 is generally preferable to
Method 1 from the point of view of efficiency (see the performance figures at
the end of §13.2.2). We now compare the stability of the two methods.
Theorem 8.5 shows that the jth column of the computed from Method 1
satisfies

It follows that we have the componentwise residual bound

(13.4)
and the componentwise forward error bound

(13.5)

Since = L-1 + O(u), (13.5) can be written as

(13.6)
which is invariant under row and column scaling of L. If we take norms we
obtain normwise relative error bounds that are either row or column scaling
independent: from (13.6) we have

(13.7)

and the same bound holds with cond(L –1) replaced by cond(L).
Notice that (13.4) is a bound for the right residual, LX – I. This is because
Method 1 is derived by solving LX = I. Conversely, Method 2 can be derived
by solving XL = I, which suggests that we should look for a bound on the
left residual for this method.

Lemma 13.1. The computed inverse from Method 2 satisfies

(13.8)
Proof. The proof is by induction on n, the case n = 1 being trivial.
Assume the result is true for n – 1 and write

where and Method 2 computes

the first column of X by solving XL = I according to
β = a- 1 , z = −βN y.
13.2 I NVERTING A T RIANGULAR M ATRIX 267

In floating point arithmetic we obtain

Thus

This may be written as

By assumption, the corresponding inequality holds for the (2:n , 2:n ) subma-
trices and so the result is proved.
Lemma 13.1 shows that Method 2 has a left residual analogue of the right
residual bound (13.4) for Method 1. Since there is, in general, no reason to
choose between a small right residual and a small left residual, our conclusion
is that Methods 1 and 2 have equally good numerical stability properties.
More generally, it can be shown that all three i, j, and k inversion variants
that can be derived from the equations LX = I produce identical rounding
errors under suitable implementations, and all satisfy the same right residual
bound; likewise, the three variants corresponding to the equation XL = I
all satisfy the same left residual bound. The LINPACK routine xTRDI uses
a k variant derived from XL = I; the LINPACK routines xGEDI and xPODI
contain analogous code for inverting an upper triangular matrix (but the LIN-
PACK Users’ Guide [307, 1979, Chaps. 1 and 3] describes a different variant
from the one used in the code).

13.2.2. Block Methods

Let the lower triangular matrix be partitioned in block form as

(13.9)

where we place no restrictions on the block sizes, other than to require the
diagonal blocks to be square. The most natural block generalizations of Meth-
ods 1 and 2 are as follows. Here, we use the notation Lp:q,r:s to denote the
268 MATRIX INVERSION

submatrix comprising the intersection of block rows p to q and block columns

r to s of L.

Method 1B.
for j = 1:N
(by Method 1)
X j+ 1 :N,j = -Lj+ 1 :N , j X j j
Solve L j+ 1 :N , j+ 1 :N X j+ 1 :N,j = Xj+ 1 :N , j,
by forward substitution
end

Method 2B.
f o r j = N:–1:1
(by Method 2)
X j+ 1 :N,j = Xj+ 1 :N , j+ 1 :N L j+ 1 :N , j
X j+ 1 :N,j = -Xj+ 1 :N , j X jj
end

One can argue that Method 1B carries out the same arithmetic operations
as Method 1, although possibly in a different order, and that it therefore
satisfies the same error bound (13.4). For completeness, we give a direct
proof.

Lemma 13.2. The computed inverse from Method 1B satisfies

(13.10)

Proof. Equating block columns in (13.10), we obtain the N independent

inequalities

(13.11)

It suffices to verify the inequality with j = 1. Write

where L 11, X1 1 and L11 is the (1, 1) block in the partitioning of (13.9).
X 11 is computed by Method 1 and so, from (13.4),

(13.12)

X 21 is computed by forming T = –L2 1 X 11 and solving L2 2 X 21 = T. The

computed X 21 satisfies
13.2 I NVERTING A T RIANGULAR M ATRIX 269

Hence

(13.13)
Together, inequalities (13.12) and (13.13) are equivalent to (13.11) with j = 1,
as required.
We can attempt a similar analysis for Method 2B. With the same notation
as above, X21 is computed as X21 = –X 2 2L 2 1X 11. Thus
(13.14)
To bound the left residual we have to postmultiply by L 11 and use the fact
that X11 is computed by Method 2:

This leads to a bound of the form

which would be of the desired form in (13.8) were it not for the factor
This analysis suggests that the left residual is not guaranteed
to be small. Numerical experiments confirm that the left and right residuals
can be large simultaneously for Method 2B, although examples are quite hard
to find [322, 1992]; therefore the method must be regarded as unstable when
the block size exceeds 1.
The reason for the instability is that there are two plausible block gen-
eralizations of Method 2 and we have chosen an unstable one that does not
carry out the same arithmetic operations as Method 2. If we perform a solve
with Ljj instead of multiplying by Xjj we obtain the second variation, which
is used by LAPACK’s xTRTRI :
Method 2C.
for j = N:-1:1
(by Method 2)
X j+ 1 :N,j = Xj+ 1 :N , j+ 1 :N L j+ 1 :N , j
Solve Xj+ 1 :N,jLjj = –Xj+ 1 :N,j by back substitution.
end
For this method, the analogue of (13.14) is

which yields

Hence Method 2C enjoys a very satisfactory residual bound.

270 M ATRIX I NVERSION

Table 13.2. Mflop rates for inverting a triangular matrix on a Cray 2.

n = 128 n = 256 n = 512 n = 1024

Unblocked: Method 1 95 162 231 283
Method 2 114 211 289 330
k variant 114 157 178 191
Blocked: Method 1B 125 246 348 405
(block size 64) Method 2C 129 269 378 428
k variant 148 263 344 383

Lemma 13.3. The computed inverse from Method 2C satisfies

In summary, block versions of Methods 1 and 2 are available that have

the same residual bounds as the point methods. However, in general, there is
no guarantee that stability properties remain unchanged when we convert a
point method to block form, as shown by Method 2B.
In Table 13.2 we present some performance figures for inversion of a lower
triangular matrix on a Cray 2. These clearly illustrate the possible gains in
efficiency from using block methods, and also the advantage of Method 2 over
Method 1. For comparison, the performance of a k variant is also shown
(both k variants run at the same rate). The performance characteristics of
the i variants are similar to those of the j variants, except that since they are
row oriented rather than column oriented, they are liable to be slowed down
by memory-bank conflicts, page thrashing, or cache missing.

13.3. Inverting a Full Matrix by LU Factorization

Next, we consider four methods for inverting a full matrix given an
LU factorization computed by GEPP. We assume, without loss of generality,
that there are no row interchanges. We write the computed LU factors as L
and U. Recall that A + ∆A = LU, with |∆ A| < cnu|L||U| (Theorem 9.3).

13.3.1. Method A
Perhaps the most frequently described method for computing X = A-1 is the
following one.

Method A.
for j = 1:n
13.3 INVERTING A FULL MATRIX BY LU FACTORIZATION 271

Solve AxJ = ej
end

Compared with the methods to be described below, Method A has the

disadvantages of requiring more temporary storage and of not having a conve-
nient partitioned version. However, it is simple to analyse. From Theorem 9.4
we have
(13.15)
and so
(13.16)
This bound departs from the form (13. 1) only in that |A| is replaced by its
upper bound |L||U| + O(u). The forward error bound corresponding to (13.16)
is
(13.17)
Note that (13.15) says that is the jth column of the inverse of a matrix
close to A, but it is a different perturbation ∆Aj for each column. It is
not true that itself is the inverse of a matrix close to A, unless A is well
conditioned.

13.3.2. Method B
Next, we consider the method used by LINPACK’S xGEDI , LAPACK’S xGETRI,
and MATLAB’S INV function.

Method B.
Compute U -1 and then solve for X the equation XL = U - 1 .

To analyse this method we will assume that U -1 is computed by an

analogue of Method 2 or 2C for upper triangular matrices that obtains the
columns of U –1 in the order 1 to n. Then the computed inverse Xu U– 1
will satisfy the residual bound

We also assume that the triangular solve from the right with L is done by
back substitution. The computed X therefore satisfies XL = Xu + ∆(X, L)
and so

This leads to the residual bound

(13.18)
272 MATRIX INVERSION

which is the left residual analogue of (13.16). From (13.18) we obtain the
forward error bound

Note that Methods A and B are equivalent, in the sense that Method A
solves for X the equation LUX = I while Method B solves XLU = I. Thus
the two methods carry out analogous operations but in different orders. It fol-
lows that the methods must satisfy analogous residual bounds, and so (13.18)
can be deduced from (13.16).
We mention in passing that the LINPACK manual states that for Method B
a bound holds of the form ||AX – I|| < dnu||A|| ||X|| [307, 1979, p. 1.20]. This
is incorrect, although counterexamples are rare; it is the left residual that is
bounded this way, as follows from (13.18).

13.3.3. Method C
The next method that we consider is from Du Croz and Higham [322, 1992].
It solves the equation UXL = I, computing X a partial row and column at a
time. To derive the method partition

where the (1, 1) blocks are scalars, and assume that the trailing submatrix
X 22 is already known. Then the rest of X is computed according to

The method can also be derived by forming the product X = U -1 × L - 1

using the representation of L and U as a product of elementary matrices (and
diagonal matrices in the case of U). In detail the method is as follows.

Method C.
for k = n:–1:1

end

The method can be implemented so that X overwrites L and U, with the

aid of a work vector of length n (or a work array to hold a block row or column
13.3 I NVERTING A FULL M ATRIX BY LU FACTORIZATION 273

in the partitioned case). Because most of the work is performed by matrix–

vector (or matrix-matrix) multiplication, Method C is likely to be the fastest
of those considered in this section on many machines. (Some performance
figures are given at the end of the section.)
A straightforward error analysis of Method C shows that the computed
satisfies
(13.19)
We will refer to as a “mixed residual”. From (13.19) we can obtain
bounds on the left and right residual that are weaker than those in (13.18) and
(13.16) by a factor |U- 1 ||U| on the left or |L||L - 1| on the right, respectively.
We also obtain from (13.19) the forward error bound

which is (13.17) with |A- 1 | replaced by its upper bound |U – 1 ||L - 1 | + O(u)
and the factors reordered.
The LINPACK routine xSIDI uses a special case of Method C in con-
junction with the diagonal pivoting method to invert a symmetric indefinite
matrix; see Du Croz and Higham [322, 1992] for details.

13.3.4. Method D
The next method is based on another natural way to form A -1 and is used
by LAPACK’S xPOTRI , which inverts a symmetric positive definite matrix.
Method D.
Compute L-1 and U -1 and then form A-1 = U -1 × L- 1 .
The advantage of this method is that no extra workspace is needed; U - 1
and L–1 can overwrite U and L, and can then be overwritten by their product.
However, Method D is significantly slower on some machines than Methods
B or C, because it uses a smaller average vector length for vector operations.
To analyse Method D we will assume initially that L -1 is computed by
Method 2 (or Method 2C) and, as for Method B above, that U -1 is computed
by an analogue of Method 2 or 2C for upper triangular matrices. We have
(13.20)

Since A = LU – ∆A,
(13.21)
Rewriting the first term of the right-hand side using XLL = I + ∆ ( X L , L ) ,
and similarly for U, we obtain
(13.22)
274 M ATRIX I NVERSION

and so

(13.23)

This bound is weaker than (13.18), since + O(u). Note,

however, that the term ∆(XU,XL)A in the residual (13.22) is an unavoidable
consequence of forming XU XL, and so the bound (13.23) is essentially the
best possible.
The analysis above assumes that XL and XU both have small left residuals.
If they both have small right residuals, as when they are computed using
Method 1, then it is easy to see that a bound analogous to (13.23) holds for
the right residual – I. If XL has a small left residual and XU has a small
right residual (or vice versa) then it does not seem possible to derive a bound
of the form (13.23). However, we have

|X L L - I| = |L- 1 (L X L - I)L| < |L- 1||LXL - I||L|, (13.24)

and since L is unit lower triangular with |lij| < 1, we have |( L - 1 ) ij| < 2n - 1 ,
which places a bound on how much the left and right residuals of XL can differ.
Furthermore, since the matrices L from GEPP tend to be well conditioned
and since our numerical experience is that large residuals
tend to occur only for ill-conditioned matrices, we would expect the left and
right residuals of XL almost always to be of similar size. We conclude that
even in the “conflicting residuals” case, Method D will, in practice, usually
satisfy (13.23) or its right residual counterpart, according to whether XU has a
small left or right residual respectively. Similar comments apply to Method B
when U –1 is computed by a method yielding a small right residual.
These considerations are particularly pertinent when we consider Method
D specialized to symmetric positive definite matrices and the Cholesky fac-
torization A = R T R. Now A-1 is obtained by computing XR = R-1 and
then forming A–1 = this is the method used in the LINPACK rou-
tine xPODI [307, 1979, Chap. 3]. If XR has a small right residual then
has a small left residual, so in this application we naturally encounter con-
flicting residuals. Fortunately, the symmetry and definiteness of the problem
help us to obtain a satisfactory residual bound. The analysis parallels the
derivation of (13.23), so it suffices to show how to treat the term
(cf. (13.21)), where R now denotes the computed Cholesky factor. Assuming
RX R = I + ∆(R, XR), and using (13.24) with L replaced by R, we have
13.4 G AUSS –J ORDAN E LIMINATION 275

Table 13.3. Mflop rates for inverting a full matrix on a Cray 2.

n = 64 n = 128 n = 256 n = 512

Unblocked: Method B 118 229 310 347
Method C 125 235 314 351
Method D 70 166 267 329
Blocked: Method B 142 259 353 406
(block size 64) Method C 144 264 363 415
Method D 70 178 306 390

and

From the inequality together with ||A||2 =

+ O(u), it follows that

and thus overall we have a bound of the form

Since and A are symmetric the same bound holds for the right residual.

13.3.5. Summary
In terms of the error bounds, there is little to choose between Methods A, B,
C, and D. Numerical results reported in [322, 1992] show good agreement with
the bounds. Therefore the choice of method can be based on other criteria,
such as performance and the use of working storage. Table 13.3 gives some
performance figures for a Cray 2, covering both blocked (partitioned) and
unblocked forms of Methods B, C, and D.
On a historical note, Tables 13.4 and 13.5 give timings for matrix inversion
on some early computing devices; times for two modern machines are given
for comparison. The inversion methods used for the timings on the early
computers in Table 13.4 are not known, but are probably methods from this
section or the next.

13.4. Gauss–Jordan Elimination

Whereas Gaussian elimination (GE) reduces a matrix to triangular form by
elementary operations, Gauss–Jordan elimination (GJE) reduces it all the way
276 M ATRIX I NVERSION

Table 13.4. Times (minutes and seconds) for inverting an n × n matrix. Source for
DEUCE, Pegasus, and Mark 1 timings: [181, 1981].

DEUCE Pegasus Manchester HP 48G Sun SPARC-

(English Electric) (Ferranti) Mark 1 Calculator station ELC
n 1955 1956 1951 1993 1991
8 20s 26s — 4s .004s
16 58s 2m 37s — 18s .01s
24 3m 36s 7m 57s 8m 48s .02s
32 7m 44s 17m 52s 16m — .04s

Table 13.5. Additional timings for inverting an n × n matrix.

Machine Year n Time Reference

Aiken Relay Calculator 1948 38 59½ hours [764, 1948]
IBM 602 Calculating Punch 1949 10 8 hours [1053, 1949]
SEAC (National Bureau of Standards) 1954 49 3 hours [1004, 1954]
Datatron 1957 8 0 ’ 2½ hours [753, 1957]
IBM 704 1957 115* 19m 30s [320, 1957]
a
Block tridiagonal matrix, using an inversion method designed for such matrices.
asymmetric positive definite matrix.

to diagonal form. GJE is usually presented as a method for matrix inversion,

but it can also be regarded as a method for solving linear equations. We
will take the method in its latter form, since it simplifies the error analysis.
Error bounds for matrix inversion are obtained by taking unit vectors for the
right-hand sides.
At the kth stage of GJE, all off-diagonal elements in the kth column are
eliminated, instead of just those below the diagonal, as in GE. Since the ele-
ments in the lower triangle (including the pivots on the diagonal) are identical
to those that occur in GE, Theorem 9.1 tells us that GJE succeeds if all the
leading principal submatrices of A are nonsingular. With no form of pivoting
GJE is unstable in general, for the same reasons that GE is unstable. Partial
and complete pivoting are easily incorporated.

Algorithm 13.4 (Gauss–Jordan elimination). This algorithm solves the lin-

ear system Ax = b, where is nonsingular, by GJE with partial
pivoting.
13.4 G AUSS –J ORDAN E LIMINATION 277

for k = 1:n
Find r such that |ark| = maxi>k |aik|.
A(k, k:n) A(r, k:n), b(k) b(r) % Swap rows k and r.
row_ind = [1:k – 1, k + 1:n] % Row indices of elements to eliminate.
m = A(row_ind,k)/A(k,k) % Multipliers.
A(row_ind,k:n) = A(row_ind,k:n) – m*A(k,k:n)
b(row_ind) = b(row_ind) – m*b(k)
end
xi = bi /aii , i = 1:n

Cost: n3 flops.
The numerical stability properties of GJE are rather subtle and error anal-
ysis is trickier than for GE. An observation that simplifies the analysis is that
we can consider the algorithm as comprising two stages. The first stage is
identical to GE and forms Mn- 1 M n-2 . . . M1 A = U, where U is upper trian-
gular. The second stage reduces U to diagonal form by elementary operations:

The solution x is formed as x = D - 1 Nn . . . N2 y, where y = Mn-1 . . . M2 b.

The rounding errors in the first stage are precisely the same as those in GE,
so it suffices to consider the second stage. We will assume, for simplicity, that
there are no row interchanges. As with GE, this is equivalent to assuming
that all row interchanges are done at the start of the algorithm.
Define Uk+l = Nk. . .N2 U (so U2 = U) and note that Nk and Uk have the
forms

where nk = [-u 1 k /u k k , . . . . -uk- 1 ,k / u k k ] T. The computed matrices obviously

satisfy

(13.25a)

(13.25b)

(to be precise, is defined as Nk but with Similarly,

with xk+1 = Nk . . . N2 y, we have

(13.26)
278 MATRIX INVERSION

Because of the structure of ∆ k , and fk , we have the useful property that

Without loss of generality, we now assume that the final diagonal matrix D is
the identity (i.e., the pivots are all 1); this simplifies the analysis a little and
has a negligible effect on the final bounds. Thus (13.25) and (13.26) yield

(13.27)

(13.28)

Now

and, similarly,

But defining ñk we have

where X = U-1 + O(u) (by (13.27)). Hence

Combining (13.27) and (13.28) we have, for the solution of Ux = y ,

which gives the componentwise forward error bound

(13.29)
13.4 G AUSS –J ORDAN E LIMINATION 279

Table 13.6. Gauss-Jordan elimination for Ux = b.

n ηU , b
16 2.0e-14 5.8e-11
32 6.4e-10 7.6e-6
64 1.7e-2 6.6e4

This is an excellent forward error bound: it says that the error in is

no larger than the error we would expect for an approximate solution with
a tiny componentwise relative backward error. In other words, the forward
error is bounded in the sane way as for substitution. However, we have not,
and indeed cannot, show that the method is backward stable. The best we
can do is to derive from (13.27) and (13.28) the result
(13.30a)
(13.30b)
(13.30C)

using = U + O(u). These bounds show that has a normwise

backward error bounded approximately by Hence the back-
ward error can be guaranteed to be small only if U is well conditioned. This
agrees with the comments of Peters and Wilkinson [828, 1975] that “the resid-
uals corresponding to the Gauss-Jordan solution are often larger than those
corresponding to back-substitution by a factor of order κ.”
A numerical example helps to illustrate the results. We take U to be
the upper triangular matrix with 1s on the diagonal and – 1s everywhere
above the diagonal (U = U(1) from (8.2)). This matrix has condition number
Table 13.6 reports the normwise relative backward error
(see (7.2)) for b = U x, where
x = e/3. Clearly, GJE is backward unstable for these matrices—the backward
errors show a dependence on However, the relative distance between
and the computed solution from substitution (not shown in the table) is less
than which shows that the forward error is bounded by
confirming (13.29).
By bringing in the error analysis for the reduction to upper triangular
form, we obtain an overall result for GJE.

Theorem 13.5. Suppose GJE successfully computes an approximate solution

to Ax = b, where A is nonsingular. Then
(13.31)
(13.32)
280 M ATRIX I NVERSION

where is the factorization computed by GE.

Proof. For the first stage (which is just GE), we have A + ∆A1 =
by Theorems 9.3 and
8.5.
Using (13.30), we obtain

or A = b – r, where

(13.33)

The bounds (13.31) and (13.32) follow easily on using (13.30).

Theorem 13.5 shows that the stability of GJE depends not only on the size
of |L||U| (as in GE), but also on the condition of U. The term is
an upper bound for and if this bound is sharp then the residual bound
is very similar to that for LU factorization. Note that for partial pivoting we
have
The bounds in Theorem 13.5 have the pleasing property that they are
invariant under row or column scaling of A, though of course if we are using
partial pivoting then row scaling can change the pivots and alter the bound.
As mentioned earlier, to obtain bounds for matrix inversion we simply
take b to be each of the unit vectors in turn. For example, the residual bound
becomes

For the special case of symmetric positive definite matrices, an informative

normwise result follows from Theorem 13.5. We make the natural assumption
that symmetry is exploited in the elimination.

Corollary 13.6. Suppose GJE successfully computes an approximate solu-

tion to Ax = b, where is symmetric positive definite. Then

where is the factorization computed by symmetric GE.

Proof. By Theorem 9.3 we gave A + ∆A = where ∆A is symmetr-

ic and satisfies |∆ A| < Defining D = we have, by
13.5 T HE D ETERMINANT 281

symmetry, A + ∆A = RTR. Hence

Furthermore, it is staightforward to show that

n (1- γ n ) - 1 ||A|| 2 . The required bounds follow.
Corollary 13.6 shows that GJE is forward stable for symmetric positive
definite matrices, but it bounds the backward error only by a multiple of
κ 2 (A)1/2. Numerical experiments show that the backward error is usually
much less than κ 2 (A) 1 / 2 u, but (very ill-conditioned) matrices can certainly be
found for which the backward error is many order of magnitude larger than
u. Hence GJE is not backward stable even for symmetric positive definite
matrices.

13.5. The Determinant

It may be too optimistic to hope that determinants will
fade out of the mathematical picture in a generation;
their notation alone is a thing of beauty
to those who can appreciate that sort of beauty.
— E. T. BELL, Review of “Contributions to the History of Determinants,
1900–1920”, by Sir Thomas Muir (1931)

Like the matrix inverse, the determinant is a quantity that rarely needs to
be computed. The common fallacy that the determinant is a measure of ill
conditioning is displayed by the observation that if is orthogonal
then det(aQ) = andet(Q ) = ±an , which can be made arbitrarily small or
large despite the fact that aQ is perfectly conditioned. Of course, we could
normalize the matrix before taking its determinant and define, for example,

where D –1 A has rows of unit 2-norm. This function is called the Hadamard
condition number by Birkhoff [99, 1975], because Hadamard’s determinantal
inequality (see Problem 13.11) implies that (A) > 1, with equality if and
only if A has mutually orthogonal rows. Unless A is already row equilibrated
(see Problem 13.13), (A) does not relate to the conditioning of linear systems
in any straightforward way.
282 M ATRIX I NVERSION

As good a way as any to compute the determinant of a general matrix

is via GEPP. If PA = LU then

det(A) = det(P)-1 det(U) = (–1) r u 11 . . .unn, (13.34)

where r is the number of row interchanges during the elimination. If we

use (13.34) then, barring underflow and overflow, the computed determinant
= fl(det(A)) satisfies

where |θ n| < γn , so we have a tiny relative perturbation of the exact determi-

nant of a perturbed matrix A + ∆A, where ∆A is bounded in Theorem 9.3.
In other words, we have almost the right determinant for a slightly perturbed
matrix (assuming the growth factor is small).
However, underflow and overflow in computing the determinant are quite
likely, and the determinant itself may not be a representable number. One pos-
sibility is to compute log |det(A ) log |uii |; as pointed out by Forsythe
and Moler [396, 1967, p. 55], however, the computed sum may be inaccurate
due to cancellation. Another approach, used in LINPACK, is to compute and
represent det(A) in the form y × 10e, where 1 < |y| < 10 and e is an integer
exponent.

13.5.1. Hyman’s Method

In §1.16 we analysed the evaluation of the determinant of a Hessenberg matrix
H by GE. Another method for computing det(H) is Hyman’s method [594,
1957 ], which is superficially similar to GE but algorithmically different. Hy-
man’s method has an easy derivation in terms of LU factorization that leads to
a very short error analysis. Let bean unreduced upper Hessenberg
matrix (h i+ 1 , i 0 for all i) and write

The matrix H 1 is H with the first row cyclically permuted to the bottom, so
det(H 1 ) = (–1) n -1 det(H). Since T is a nonsingular upper triangular matrix,
we have the LU factorization

(13.35)

from which we obtain det(H 1) = det(T )(η – hTT– 1 y ). Therefore

det(H) = (–1) n -1 det(T)(η – hTT - 1y). (13.36)

13.6 N OTES AND R EFERENCES 283

Hyman's method consists of evaluating (13.36) in the natural way: by solving

the triangular system Tx = y by back substitution, then forming η – hTx and
its product with det(T ) = h 2 1 h 32 . . . hn,n- 1 .
For the error analysis we assume that no underflows or overflows occur.
From the backward error analysis for substitution (Theorem 8.5) and for an
inner product ((3.4)) we have immediately, on defining µ := η – hTT– 1 y ,

where

Since fl(det(T)) = det(T)(1 + δ1). . .(1 + δn -1), |δ i | < u, the computed de-
terminant satisfies

where |θ n| < γn . This is not a backward error result, because only one of
the two Ts in this expression is perturbed. However, we can write det( T ) =
det(T + ∆T)(1 + θ( n -1)2), so that

We conclude that the computed determinant is the exact determinant of a

perturbed Hessenberg matrix in which each element has undergone a relative
perturbation not exceeding γn2 - n + 1 n 2 u , which is a very satisfactory result.
In fact, the constant γn 2 - n+1 may be reduced to γ2 n -1 by a slightly modified
analysis; see Problem 13.14.

13.6. Notes and References

Classic references on matrix inversion are Wilkinson’s paper Error Analysis
of Direct Methods of Matrix Inversion [1085, 1961] and his book Rounding
Errors in Algebraic Processes [1088, 1963]. In discussing Method 1 of §13.2.1,
Wilkinson says “The residual of X as a left-hand inverse may be larger than
the residual as a right-hand inverse by a factor as great as ||L|| ||L - 1 || . . . We
are asserting that the computed X is almost invariably of such a nature that
XL – I is equally small” [1088, 1963, p. 107]. Our experience concurs with
the latter statement. Triangular matrix inversion provides a good example of
the value of rounding error analysis: it helps us identify potential instabilities,
even if they are rarely manifested in practice, and it shows us what kind of
stability is guaranteed to hold.
In [1085, 1961] Wilkinson identifies various circumstances in which trian-
gular matrix inverses are obtained to much higher accuracy than the bounds
284 M ATRIX I NVERSION

of this chapter suggest. The results of §8.2 provide some insight. For example,
if T is a triangular M-matrix then, as noted after Corollary 8.10, its inverse
is computed to high relative accuracy, no matter how ill conditioned L may
be.
Sections 13.2 and 13.3 are based closely on Du Croz and Higham [322,
1992 ].
Method D in §13.3.4 is used by the Hewlett-Packard HP-15C calculator,
for which the method’s lack of need for extra storage is an important prop-
erty [523, 1982].
Higham [560, 1995] gives error analysis of a divide-and-conquer method
for inverting a triangular matrix that has some attractions for parallel com-
putation.
GJE is an old method. It was discovered independently by the geodesist
Wilhelm Jordan (1842-1899) (not the mathematician Camille Jordan (1838-
1922)) and B.-I. Clasen [12, 1987].
An Algol routine for inverting positive definite matrices by GJE was pub-
lished in the Handbook for Automatic Computation by Bauer and Reinsch [83,
1971]. As a means of solving a single linear system, GJE is 50% more expensive
than GE when cost is measured in flops; the reason is that GJE takes O ( n 3 )
flops to solve the upper triangular system that GE solves in n 2 flops. How-
ever, GJE has attracted interest for vector computing because it maintains
full vector lengths throughout the algorithm. Hoffmann [577, 1987] found that
it was faster than GE on the CDC Cyber 205, a now-defunct machine with a
relatively large vector startup overhead.
Turing [1027, 1948] gives a simplified analysis of the propagation of errors
in GJE, obtaining a forward error bound proportional to κ(A). Bauer [80,
1966] does a componentwise forward error analysis of GJE for matrix inver-
sion and obtains a relative error bound proportional to for symmetric
positive definite A. Bauer’s paper is in German and his analysis is not easy to
follow. A summary of Bauer’s analysis (in English) is given by Meinguet [746,
19 6 9 ].
The first thorough analysis of the stability of GJE was by Peters and
Wilkinson [828, 1975]. Their paper is a paragon of rounding error analysis.
Peters and Wilkinson observe the connection with GE, then devote their at-
tention to the “second stage” of GJE, in which Ux = y is solved. They show
that each component of x is obtained by solving a lower triangular system, and
they deduce that, for each i, (U + ∆U i )x( i ) = y + ∆y i , where |∆Ui | < γn|U|
and |∆yi | < γn |y|, and where the ith component of x( i ) is the ith component
of They then show that has relative error bounded by
but that does not necessarily have a small backward error. The more direct
approach used in our analysis is similar to that of Dekker and Hoffmann [276,
1989], who give a normwise analysis of a variant of GJE that uses row pivot-
ing (elimination across rows) and column interchanges. Our componentwise
PROBLEMS 285

bounds (13.29)–(13.32) are new.

Goodnight [473, 1979] describes the use of GJE in statistics for solving
least squares problems.
Error analysis of Hyman’s method is given by Wilkinson [1084, 1960],
[1088, 196 3, pp. 147-154], [1089, 196 5, pp. 426-431]. Although it dates
from the 1950s, Hyman’s method is not obsolete: it has found use in meth-
ods designed for high-performance computers; see Ward [1065, 1976], Li and
Zeng [701, 1992], and Dubrulle and Golub [324, 1994].

13.6.1. LAPACK
Routine xGETRI computes the inverse of a general square matrix by Method
B using an LU factorization computed by xGETRF . The corresponding routine
for a symmetric positive definite matrix is xPOTRI , which uses Method D,
with a Cholesky factorization computed by xPOTRF . Inversion of a symmetric
indefinite matrix is done by xSYTRI . Triangular matrix inversion is done by
xTRTRI , which uses Method 2C. None of the LAPACK routines compute the
determinant, but it is easy for the user to evaluate it after computing an LU
factorization.

Problems

13.1. Reflect on this cautionary tale told by Acton [4, 1970, p. 246].
“It was 1949 in Southern California. Our computer was a very new CPC
(model 1, number 1) —a 1-second-per-arithmetic-operation clunker that was
holding the computational fort while an early electronic monster was being
coaxed to life in an adjacent room, From a nearby aircraft company there
arrived one day a 16 × 16 matrix of 10-digit numbers whose inverse was desired
. . . We labored for two days and, after the usual number of glitches that
accompany any strange procedure involving repeated handling of intermediate
decks of data cards, we were possessed of an inverse matrix. During the
checking operations . . . it was noted that, to eight significant figures, the
inverse was the transpose of the original matrix! A hurried visit to the aircraft
company to explore the source of the matrix revealed that each element had
been laboriously hand computed from some rather simple combinations of
sines and cosines of a common angle. It took about 10 minutes to prove that
the matrix was, indeed, orthogonal!”

13.2. Rework the analysis of the methods of §13.2.2 using the assumptions
(12.3) and (12.4), thus catering for possible use of fast matrix multiplication
techniques.
286 MATRIX INVERSION

13.3. Show that for any nonsingular matrix A,

This inequality shows that the left and right residuals of X as an approxima-
tion to A–1 can differ greatly only if A is ill conditioned.
13.4. (Mendelssohn [748, 1956]) Find parametrized 2 × 2 matrices A and X
such that the ratio ||AX – I||/||XA – I|| can be made arbitrarily large.
13.5. Let X and Y be approximate inverses of that satisfy

and

Show that

Derive forward error bounds for and Interpret all these bounds.
13.6. What is the relation between the matrix on the front cover of the LA-
PACK Users’ Guide [17, 1995] and that on the back cover? Answer the same
question for the LINPACK Users’ Guide [307, 1979].
13.7. Show that for any matrix having a row or column of 1s, the elements
of the inverse sum to 1.
13.8. Let X = A + iB be nonsingular. Show that X-1 can be
expressed in terms of the inverse of the real matrix of order 2 n,

Show that if X is Hermitian positive definite then Y is symmetric positive

definite. Compare the economics of real versus complex inversion.
13.9. For a given nonsingular and X A -1, it is interesting to ask
what is the smallest ε such that X + ∆X = (A + ∆A)-1 with ||∆ X|| <
and We require (A + ∆A)(X + ∆X) = I, or

A ∆X + ∆AX + ∆A∆X = I – AX.

It is reasonable to drop the second-order term, leaving a generalized Sylvester

equation that can be solved using singular value decompositions of A and X
(cf. §15.2). Investigate this approach computationally for a variety of A and
methods of inversion.
PROBLEMS 287

13.10. For a nonsingular and given integers i and j, under what

conditions is det(A) independent of aij ? What does this imply about the
suitability of det(A ) for measuring conditioning?
13.11. Prove Hadamard’s inequality for

where ak = A(:, k). When is there equality? (Hint: use the QR factorization.)
13.12. (a) Show that if AT = QR is a QR factorization then the Hadamard
condition number where pi = ||R(:, i)||2. (b) Evaluate
for A = U(1) (see (8.2)) and for the Pei matrix A = (a – 1)I + eeT.
13.13. (Guggenheimer, Edelman, and Johnson [486, 1995]) (a) Prove that for
a nonsingular

(Hint: apply the arithmetic-geometric mean inequality to the n numbers

where the σ i are the singular values of A.) (b)
Deduce that if A has rows of unit 2-norm then

where is the Hadamard condition number.

13.14. Show that Hyman’s method for computing det(H), where
is an unreduced upper Hessenberg matrix, computes the exact determinant
of H + ∆H where |∆ H| < γ2 n - 1 |H|, barring underflow and overflow. What
is the effect on the error bound of a diagonal similarity transformation H
D - 1 HD, where D = diag(di ), di 0?
13.15. What is the condition number of the determinant?
13.16. (RESEARCH PROBLEM) Obtain backward and forward error bounds
for GJE applied to a diagonally dominant matrix. Peters and Wilkinson [828,
1975] state that “it is well known that Gauss–Jordan is stable” for a diagonally
dominant matrix, but a proof does not seem to have been published.
Previous Home Next

Chapter 14
Condition Number Estimation

Most of LAPACK’S condition numbers and error bounds are based on

estimated condition numbers . . .
The price one pays for using an estimated
rather than an exact condition number is
occasional (but very rare) underestimates of the true error;
years of experience attest to the reliability of our estimators,
although examples where they badly underestimate the error can be constructed.
— E. ANDERSON et al,, LAPACK Users’ Guide, Release 2.0 (1995)

The importance of the counter-examples is that they make clear that

any effort toward proving that the algorithms
always produce useful estimations is fruitless.
It may be possible to prove that the algorithms
produce useful estimations in certain situations, however,
and this should be pursued.
An effort simply to construct more complex algorithms is dangerous.
— A. K. CLINE and R. K. REW, A Set of Counter-Examples
to Three Condition Number Estimators (1983)

Singularity is almost invariably a clue.

— SIR ARTHUR CONAN DOYLE, The Boscombe Valley Mystery “(1892)

289
290 CONDITION NUMBER ESTIMATION

14.1. How to Estimate Componentwise Condition

Numbers
When bounding the forward error of a computed solution to a linear system
we would like to obtain the bound with an order of magnitude less work than
is required to compute the solution. For a dense n × n system, where the
solution process usually requires O(n 3) operations, we need to compute the
bound in O(n 2) operations. An estimate of the bound that is correct to within
a factor 10 is usually acceptable, because it is the magnitude of the error that
is of interest, not its precise value.
In the perturbation theory of Chapter 7 for linear equations we obtained
perturbation bounds that involve the condition numbers

and their variants. To compute these condition numbers exactly we need

to compute A–1 , which requires O(n 3 ) operations, even if we are given a
factorization of A. Is it possible to produce reliable estimates of both condition
numbers in O(n 2) operations? The answer is yes, but to see why we first need
to rewrite the expression for condE , f(A, x). Consider the general expression
where d is a given nonnegative vector (thus, d = f + E|x| for
cond E , f (A, x)); note that the practical error bound (7.27) is of this form.
T
Writing D = diag(d) and e = [1, 1, . . . . 1] , we have

(14.1)

Hence the problem is equivalent to that of estimating

If we can estimate this quantity then we can estimate any of the condition
numbers or perturbation bounds for a linear system. There is an algorithm,
described in §14.3, that produces reliable order-of-magnitude estimates of
||B||1, for an arbitrary B, by computing just a few matrix-vector products
Bx and BTy for carefully selected vectors x and y and then approximating
||B|| 1 ||Bx|| 1 /||x||1. If we assume that we have a factorization of A (say,
PA = LU or A = QR), then we can form the required matrix-vector products
for B = A- 1 D in O(n 2) flops. Since ||B||1 = it follows that we can
use the algorithm to estimate in O(n 2) flops.
Before presenting the l-norm condition estimation algorithm, we describe
a more general method that estimates ||B||p for any This more
general method is of interest in its own right and provides insight into the
special case p = 1.
14.2 THE p-NORM POWER METHOD 291

14.2. The p-Norm Power Method

An iterative “power method” for computing ||A||p was proposed by Boyd in
1974 [139, 1974]. When p = 2 it reduces to the usual power method applied
to ATA. We use the notation dualp (x) to denote any vector y of unit q-
norm such that equality holds in the Holder inequality (this
normalization is different from the one we used in (6.3), but is more convenient
for our present purposes). Throughout this section, p > 1 and q is defined by
p -1 + q-1 = 1.

Algorithm 14.1 (p-norm power method). Given and

this algorithm computes γ and x such that γ < ||A||p and ||Ax||p = γ||x||p .

x = x0 /||x0 ||p
repeat
y = Ax
z = AT dualp (y)
if ||z||q < zTx
γ = ||y||p
quit
end
x = dualq (z)
end

Cost: 4rmn flops (for r iterations).

There are several ways to derive Algorithm 14.1. Perhaps the most natural
way is to examine the optimality conditions for maximization of

First, we note that the subdifferential (that is, the set of subgradients) of an
arbitrary vector norm ||·|| is given by [378, 1987, p. 379]

If x 0 then from the Holder inequality, and so, if

x 0,

It can also be shown that if A has full rank,

(14.2)
292 CONDITION NUMBER ESTIMATION

We assume now that A has full rank, 1 < p < and x 0. Then it
is easy to see that there is a unique vector dualp ( x ), so has just one
element, that is, ||x||p is differentiable. Hence we have

(14.3)

The first-order Kuhn–Tucker condition for a local maximum of F is therefore

Since dualq (dual p (x)) = x/||x||p if p 1, this equation can be written

(14.4)

The power method is simply functional iteration applied to this transformed

set of Kuhn–Tucker equations (the scale factor is irrelevant since
F(ax) = F(x)).
For the 1- and -norms, which are not everywhere differentiable, a differ-
ent derivation can be given. The problem can be phrased as one of maximizing
the convex function F(x) := ||Ax||p over the convex set S := {x : ||x||p < 1}.
The convexity of F and S ensures that, for any at least one vector g
exists such that

(14.5)

Vectors g satisfying (14.5) are called subgradients of F (see, for example,

[378, 198 7, p. 364]). Inequality (14.5) suggests the strategy of choosing a
subgradient g and moving from u to a point that maximizes gT (υ – u),
that is, a vector that maximizes g υ. Clearly, υ * = dualq ( g). Since, from
T

(14.2), g has the form AT dualp (Ax), this completes the derivation of the
iteration.
For all values of p the power method has the desirable property of gener-
ating an increasing sequence of norm approximations.

Lemma 14.2. In Algorithm 14.1, the vectors from the kth iteration satisfy

The first inequality in (ii) is strict if convergence is not obtained on the kth
iteration.
14.2 THE p-NORM POWER METHOD 293

Proof. Then

For the last part, note that, in view of (i), the convergence test “||zk||q <
can be written as “||zk||q < ||yk||p ” .
It is clear from Lemma 14.2 (or directly from the Holder inequality) that
the convergence test “||z|| q < z T x” in Algorithm 14.1 is equivalent to “||z||q =
z T x” and, since ||x||p = 1, this is equivalent to x = dualq ( z ). Thus, although
the convergence test compares two scalars, it is actually testing for equality
in the vector equation (14.4).
The convergence properties of Algorithm 14.1 merit a careful description.
First, in view of Lemma 14.2, the scalars γk = ||yk||p form an increasing and
convergent sequence. This does not necessarily imply that Algorithm 14.1
converges, since the algorithm tests for convergence of the xk , and these vec-
tors could fail to converge. However, a subsequence of the xk must converge
to a limit, say. Boyd [139, 1974] shows that if is a strong local maximum
of F with no zero components, then linearly.
If Algorithm 14.1 converges it converges to a stationary point of F( x )
when 1 < p < Thus, instead of the desired global maximum ||A||p , we
may obtain only a local maximum or even a saddle point. When p = 1 or
if the algorithm converges to a point at which F is not differentiable, that
point need not even be a stationary point. On the other hand, for p = 1
or Algorithm 14.1 terminates in at most n + 1 iterations (assuming that
when dualp or dualq is not unique an extreme point of the unit ball is taken),
since the algorithm moves between the vertices ei of the unit ball in the 1-
norm, increasing F on each stage (x = ±ei for p = 1, and dual p (y) = ±ei for
P = An example where n iterations are required for p = 1 is given in
Problem 14.2.
For two special types of matrix, more can be said about Algorithm 14.1.
(1) If A = xyT (rank 1), the algorithm converges on the second step with
γ = ||A||p = ||x||p||y||q , whatever x0 .
(2) Boyd [139, 1974] shows that if A has nonnegative elements, ATA is
irreducible, 1 < p < and x0 has positive elements, then the xk converge
and γ k ||A||p .
For values of p strictly between 1 and the convergence behaviour of
Algorithm 14.1 is typical of a linearly convergent method: exact convergence is
not usually obtained in a finite number of steps and arbitrarily many steps can
294 C ONDITION N UMBER E STIMATION

be required for convergence, as is well-known for the 2-norm power method.

Fortunately, there is a method for choosing a good starting vector that can
be combined with Algorithm 14.1 to produce a reliable norm estimator; see
the Notes and References and Problem 14.1.
We now turn our attention to the extreme values of p: 1 and

14.3. LAPACK l-Norm Estimator

Algorithm 14.1 has two remarkable properties when p = 1: it almost always
converges within four iterations (when x0 = [1, 1, . . . , 1]T, say) and it fre-
quently yields ||A||1 exactly. This rapid, finite termination is also obtained
for p = and is related to the fact that Algorithm 14.1 moves among the
finite set of extreme points of the unit ball. Numerical experiments suggest
that the accuracy is about the same for both norms but that slightly more
iterations are required on average for p = Hence we will confine our
attention to the 1-norm.
The l-norm version of Algorithm 14.1 was derived independently of the
general algorithm by Hager [492, 1984] and can be expressed as follows. The
notation ξ = sign(y) means that ξ i = 1 or –1 according as yi > 0 or yi < 0.
We now specialize to square matrices.

Algorithm 14.3 (1-norm power method). Given this algorithm

computes γ and x such that γ < ||A||1 and ||Ax||1 = γ||x|| 1 .

x = n– 1e
repeat
y = Ax
ξ = sign(y)
z = A Tξ
if
γ = ||y||1
quit
end
x = e j, where |zj| = (smallest such j)
end

Numerical experiments show that the estimates produced by Algorithm

14.3 are frequently exact (γ = ||A|| 1), usually “acceptable” (γ > ||A||1/10),
and sometimes poor (γ < ||A||1/10).
An important question for any norm or condition estimator is whether
there exists a “counterexample”—a parametrized matrix for which the quo-
tient “estimate divided by true norm” can be made arbitrarily small (or large,
depending on whether the estimator produces a lower bound or an upper
14.3 LAPACK 1-N ORM E STIMATOR 295

bound) by varying a parameter. A general class of counterexample for Algo-

rithm 14.3 is given by the matrices

where Ce = CTe = 0 (there are many possible choices for C ). For any such
matrix, Algorithm 14.3 computes y = n – 1e, ξ = e, z = e, and hence the
algorithm terminates at the end of the first iteration with

The problem is that the algorithm stops at a local maximum that can differ
from the global one by an arbitrarily large factor.
A more reliable and more robust algorithm is produced by the following
modifications of Higham [537, 1988].
Definition of estimate. To overcome most of the poor estimates, γ is
redefined as

where

The vector b is considered likely to “pick out” any large elements of A in those
cases where such elements fail to propagate through to y.
Convergence test. The algorithm is limited to a minimum of two and a
maximum of five iterations. Further, convergence is declared after comput-
ing ξ if the new ξ is the same as the previous one; this event signals that
convergence will be obtained on the current iteration and that the next (and
final) multiplication ATξ is unnecessary. Convergence is also declared if the
new ||y||1 is no larger than the previous one. This nonincrease of the norm
can happen only in finite precision arithmetic and signals the possibility of a
vertex ej being revisited—the onset of “cycling.”
The improved algorithm is as follows. This algorithm is the basis of all
the condition number estimation in LAPACK.

Algorithm 14.4 (LAPACK norm estimator). Given this algo-

rithm computes γ and υ = Aw such that γ < ||A|| 1 with ||υ||1 /||w|| 1 = γ (w
is not returned).

υ = A(n – le)
if n = 1, quit with γ = |υ 1|, end
γ = ||υ| |1
ξ = sign(υ)
296 CONDITION NUMBER ESTIMATION

x = A Tξ
k=2
repeat
j = min{i: |xi | = }
υ = Aej

γ = ||υ| |1
if sign(υ) = ξ or γ < goto (*), end
ξ = sign(υ)
x = A Tξ
k = k+ 1
until
i+ 1
(*) xi = (–1)
x = Ax
i f 2||x||1 /(3n) > γ then
υ = x
γ = 2||x||1 /(3n)
end

Algorithm 14.4 can still be “defeated”: it returns an estimate 1 for matrices

A(θ ) of the form

A(θ) = I + θP, where P = PT, Pe = 0, Pe1 = 0, Pb = 0. (14.6)

(P can be constructed as I – Q where Q is the orthogonal projection onto

span{e, e1, b}.) Indeed, the existence of counterexamples is intuitively obvi-
ous since Algorithm 14.4 samples the behaviour of A on less than n vectors
in Numerical counterexamples (not parametrized) can be constructed
automatically by direct search, as described in §24.3.1. Despite these weak-
nesses, practical experience with Algorithm 14.4 shows that it is very rare
for the estimate to be more than three times smaller than the actual norm,
independent of the dimension n. Therefore Algorithm 14.4 is, in practice, a
very reliable norm estimator. The number of matrix-vector products required
is at least 4 and at most 11, and averages between 4 and 5.
There is an analogue of Algorithm 14.3 for complex matrices, in which ξi
is defined as yi /|yi | if yi 0 and 1 otherwise. In the corresponding version of
Algorithm 14.4 the test for repeated ξ vectors is removed, because ξ now has
noninteger, complex components and so is unlikely to repeat.
It is interesting to look at a performance profile of Algorithm 14.4. A per-
formance profile is a plot of some measure of the performance of an algorithm
versus a problem parameter. In this case, the natural measure of performance
is the underestimation ratio, γ/||A|| 1. Figure 14.1 shows the performance
profile for a 5 × 5 matrix A( θ) of the form (14.6), with P constructed as
14.4 O THER C ONDITION E STIMATORS 297

Figure 14.1. Underestimation ratio for Algorithm 14.4 for 5 × 5 matrix A(O) of (14.6)
with 150 equally spaced values of [0,10].

described above (because of rounding errors in constructing A(θ ) and within

the algorithm, the computed norm estimates differ from those that would be
produced inexact arithmetic). The jagged nature of the performance curve is
typical for algorithms that contain logical tests and branches. Small changes
in the parameter θ, which themselves result in different rounding errors, can
cause the algorithm to visit different vertices in this example.

14.4. Other Condition Estimators

The first condition estimator to be widely used is the one employed in LIN-
PACK. It was developed by Cline, Moler, Stewart, and Wilkinson [216, 1979].
The idea behind this condition estimator originates with Gragg and Stew-
art [476, 1976], who were interested in computing an approximate null vector
rather than estimating the condition number itself.
We will describe the algorithm as it applies to a triangular matrix T
There are three steps:

1. Choose a vector d such that ||y|| is as large as possible relative to ||d||,

where TTy = d.

2. Solve Tx = y.
298 CONDITION NUMBER ESTIMATION

3. Estimate

In LINPACK the norm is the l-norm, but the algorithm can also be used
for the 2-norm or the cm-norm. The motivation for step 2 is based on a singular
value decomposition analysis. Roughly, if ||y||/||d|| is large then
will almost certainly be at least as large, and it could be
a much better estimate. Notice that TTTx = d, so the algorithm is related to
the power method on the matrix (TTT) –1 with the specially chosen starting
vector d.
To examine step 1 more closely, suppose that T = UT is lower triangular
and note that the equation Uy = d can be solved by the following column-
oriented (saxpy) form of substitution:

end

The idea is to choose the elements of the right-hand side vector d adaptively
as the solution proceeds, with d j = ±1. At the jth stage of the algorithm
d . . . , dj+1 have been chosen and yn , . . . , yj+1 are known. The next element
dj {+1, –1} is chosen so as to maximize a weighted sum of dj – pj and the
partial sums p1, . . . , pj, which would be computed during the next execution
of statement (*) above. Hence the algorithm looks ahead, trying to gauge
the effect of the choice of d j on future solution components. This heuristic
algorithm for choosing d is expressed in detail as follows.

Algorithm 14.5 (LINPACK condition estimator). Given a nonsingular up-

per triangular matrix and a set of nonnegative weights {w i }, this
algorithm computes a vector y such that Uy = d, where the elements dj = ±1
are chosen to make ||y|| large.
14.4 O THER C ONDITION E STIMATORS 299

p(1:j– 1 ) = p -(1:j–1)
end
end

cost: 4n 2 flops.
LINPACK takes the weights wj 1, though another possible (but more
expensive) choice would be wj = 1/|ujj|, which corresponds to how pj is
weighted in the expression yj = (dj – pj)/ujj.
To estimate ||A- 1 || for a full matrix A, the LINPACK estimator makes
use of an LU factorization of A. Given PA = LU, the equations solved are
UTz = d, LTy = z, and AX = PTy, where for the first system d is constructed
by the analogue of Algorithm 14.5 for lower triangular matrices; the estimate
is ||x||1 /||y||1 ||A - 1 ||1. Since d is chosen without reference to L, there is
an underlying assumption that any ill condition in A is reflected in U. This
assumption may not be true; see Problem 14.3.
In contrast to the LAPACK norm estimator, the LINPACK estimator re-
quires explicit access to the elements of the matrix. Hence the estimator
cannot be used to estimate componentwise condition numbers. Furthermore,
separate code has to be written for each different type of matrix and factoriza-
tion. Consequently, while LAPACK has just a single norm estimation routine,
which is called by many other routines, LINPACK has multiple versions of its
algorithm, each tailored to the specific matrix or factorization.
Several years after the LINPACK condition estimator was developed, sev-
eral parametrized counterexamples were found by Cline and Rew [217, 1983].
Numerical counterexamples can also be constructed by direct search, as shown
in §24.3.1. Despite the existence of these counterexamples the LINPACK esti-
mator has been widely used and is regarded as being almost certain to produce
an estimate correct to within a factor 10 in practice.
A 2-norm condition estimator was developed by Cline, Corm, and Van
Loan [218, 1982, Algorithm 1]; see also Van Loan [1043, 1987] for another
explanation. The algorithm builds on the ideas underlying the LINPACK es-
timator by using “look-behind” as well as look-ahead. It estimates σ min (R) =
or σm a x (R) = ||R||2 for a triangular matrix R, where σ min and σ m a x
denote the smallest and largest singular values, respectively. Full matrices
can be treated if a factorization A = QR is available ( Q orthogonal, R up-
per triangular), since R and A have the same singular values. The estimator
performs extremely well in numerical tests, often producing an estimate that
has some correct digits [218, 1982], [534, 1987]. No counterexamples to the
estimator were known until Bischof [103, 1990] obtained counterexamples as
a by-product of the analysis of a different but related method, mentioned at
the end of this section.
All the methods described so far have the property that when applied
repeatedly to a given matrix they always produce the same estimate. Another
300 C ONDITION N UMBER E STIMATION

approach is to introduce some randomness, so that the output of the method

depends on the particular random numbers chosen. A natural idea along these
lines is to apply the power method to the matrix (AAT) -1 with a randomly
chosen starting vector. If a factorization of A is available, the power method
vectors can be computed inexpensively by solving linear systems with A and
AT. Analysis based on the singular value decomposition suggests that there
is a high probability that a good estimate of ||A - 1 ||2 will be obtained. This
notion is made precise by Dixon [306, 1983], who proves the following result.

Theorem 14.6 (Dixon). Let be nonsingular and let θ > 1 be a

constant. If is a random vector from the uniform distribution on the
unit sphere Sn = then the inequality

(14.7)

holds with probability at least 1 – 0.8θ - k / 2 n 1/2 (k > 1).

Note that the left-hand inequality in (14.7) always holds; it is only the
right-hand inequality that is in question.
For k = 1, (14.7) can be written as

which suggests the simple estimate where x is chosen ran-

domly from the uniform distribution on Sn. Such vectors x can be generated
from the formula

where z1, . . . , zn are independent random variables from the normal N(0, 1)
distribution [668, 1981, p. 130]. If, for example, n = 100 and θ has the rather
large value 6400 then inequality (14.7) holds with probability at least 0.9.
In order to take a smaller constant θ, for fixed n and a desired probability,
we can use larger values of k. If k = 2j is even then we can simplify (14.7),
obtaining
(14.8)
and the minimum probability stated by the theorem is 1 - 0.8θ - j n 1/2. Taking
j = 3, for the same value n = 100 as before, we find that (14.8) holds with
probability at least 0.9 for the considerably smaller value θ = 4.31.
Probabilistic condition estimation has not yet been adopted in any ma-
jor software packages, perhaps because the other techniques work so well.
For more on the probabilistic power method approach see Dixon [306, 1983],
Higham [534, 1987], and Kuczynski and Wozniakowski [676, 1992] (who also
analyse the more powerful Lanczos method with a random starting vector).
For a probabilistic condition estimation method of very general applicability
14.5 C ONDITION N UMBERS OF T RIDIAGONAL M ATRICES 301

see Kenney and Laub [652, 1994] and Gudmundsson, Kenney, and Laub [485,
1995 ].
The condition estimators described above assume that a single estimate
is required for a matrix given in its entirety. Condition estimators have also
been developed for more specialized situations. Bischof [103, 1990] develops a
method for estimating the smallest singular value of a triangular matrix which
processes the matrix a row or a column at a time. This “incremental condition
estimation” method can be used to monitor the condition of a triangular ma-
trix as it is generated, and so is useful in the context of matrix factorization
such as the QR factorization with column pivoting. The estimator is general-
ized to sparse matrices by Bischof, Lewis, and Pierce [104, 1990]. Barlow and
Vemulapati [67, 1992] develop a l-norm incremental condition estimator with
look-ahead for sparse matrices.
Condition estimates are also required in applications where a matrix fac-
torization is repeatedly updated as a matrix undergoes low rank changes.
Algorithms designed for a recursive least squares problem and employing
the Lanczos method are described by Ferng, Golub, and Plemmons [372,
1991 ]. Pierce and Plemmons [831, 1992 ] describe an algorithm for use with
the Cholesky factorization as the factorization is updated, while Shroff and
Bischof [918, 1992] treat the QR factorization.

14.5. Condition Numbers of Tridiagonal Matrices

For a bidiagonal matrix B, |B–1| = M(B)–1 (see §8.3), so the condition num-
bers κ E,f and condE,f can be computed exactly with an order of magnitude
less work than is required to compute B-1 explicitly. This property holds
more generally for several types of tridiagonal matrix, as a consequence of
the following result. Recall that the LU factors of a tridiagonal matrix are
bidiagonal and may be computed using the formulae (9.16).

Theorem 14.7. If the nonsingular tridiagonal matrix has the LU

factorization A = LU and |L||U| = |A|, then |U - 1 |L- 1 | = |A- 1|.
Proof. Using the notation of (9.15), |L||U| = |A| = |LU| if and only if,
for all 2,

that is, if
(14.9)

Using the formulae

302 CONDITION NUMBER ESTIMATION

we have

Thus, in view of (14.9), it is clear that |U - 1 L - 1|ij = (|U- 1 ||L - 1 |) z j, as re-

quired.
Since L and U are bidiagonal, |U- 1 | = M(U)-1 and |L- 1 | = M(L)- 1 .
Hence, if |A| = |L||U|, then, from Theorem 14.7,

(14.10)

It follows that we can compute any of the condition numbers or forward error
bounds of interest exactly by solving two bidiagonal systems. The cost is
O(n) flops, as opposed to the O(n2) flops needed to compute the inverse of a
tridiagonal matrix.
When does the condition |A| = |L||U| hold? Theorem 9.11 shows that it
holds if the tridiagonal matrix A is symmetric positive definite, totally posi-
tive, or an M-matrix. So for these types of matrix we have a very satisfactory
way to compute the condition number.
If A is tridiagonal and diagonally dominant by rows, then we can compute
in O(n) flops an upper bound for the condition number that is not more than
a factor 2n – 1 too large.
14.5 C ONDITION N UMBERS OF T RIDIAGONAL M ATRICES 303

Theorem 14.8. Suppose the nonsingular, row diagonally dominant tridiag-

onal matrix has the LU factorization A = LU. Then, if y > 0,

Proof. We have L-1 = UA– 1 , so

where the bidiagonal matrix V = diag(u i i )-1 U has υi i 1 and |υi , i +1| =
|ei /ui | < 1 (see the proof of Theorem 9.12). Thus

and the result follows on taking norms.

In fact, it is possible to compute |A- 1|y exactly in O(n) operations for any
tridiagonal matrix. This is a consequence of the special form of the inverse of
a tridiagonal matrix.

Theorem 14.9 (Ikebe). Let be tridiagonal and irreducible (that

is, ai+ 1 ,i and ai,i+1 are nonzero for all i). Then there are vectors x, y, p, and
q such that

This result says that the inverse of an irreducible tridiagonal matrix is

the upper triangular part of a rank-1 matrix joined along the diagonal to the
lower triangular part of another rank-1 matrix. If A is reducible then it has
the block form (or its transpose), and this blocking can be applied
recursively until the diagonal blocks are all irreducible, at which point the
theorem can be applied to the diagonal blocks.
The vectors x, y, p, and q in Theorem 14.9 can all be computed in O(n)
flops, and this enables the condition numbers and forward error bounds to be
computed also in O(n) flops (see Problem 14.5). Unfortunately, the vectors
x, y, p, and q can have a huge dynamic range, causing the computation to
break down because of overflow or underflow. For example, for the diagonally
dominant tridiagonal matrix with a ii 4, ai+ 1 ,i = ai,i+ 1 1, we have (x1 =
1) |x n | θ n -1, |y1| θ -1, and |yn| θ - n , where θ = 2 + 3.73. These
numerical problems can be overcome, but only at a nontrivial increase in
cost. Therefore, we do not recommend the use of Theorem 14.9 for computing
condition numbers. For a general tridiagonal matrix it is probably better to
estimate the condition number using Algorithm 14.4.
304 CONDITION NUMBER ESTIMATION

14.6. Notes and References

The clever trick (14.1) for converting the norm into the norm of
a matrix with which products are easily formed is due to Arioli, Demmel, and
Duff [24, 1989].
The p-norm power method was first derived and analysed by Boyd [139,
1974] and was later investigated by Tao [993, 1984]. Tao applies the method
to an arbitrary mixed subordinate norm ||A||a,β (see (6.6)), while Boyd takes
the a and β-norms to be p-norms (possibly different). Algorithm 14.1 can be
converted to estimate ||A||a,β by making straightforward modifications to the
norm-dependent terms. An algorithm that estimates ||A||p using the power
method with a specially chosen starting vector is developed by Higham [551,
1992]; the method for obtaining the starting vector is outlined in Problem 14.1.
The estimate produced by this algorithm is always within a factor n 1 – 1 /p of
||A||p and the cost is about 70n 2 flops. A MATLAB M-file pnorm implementing
this method is part of the Test Matrix Toolbox (see Appendix E).
The finite convergence of the power method for p = 1 and p = 00 holds
more generally: if the power method is applied to the norm ||·|| a,β and
one of the a and β norms is polyhedral (that is, its unit ball has a finite
number of extreme points), then the iteration converges in a finite number
of steps. Moreover, under a reasonable assumption, this number of steps can
be bounded in terms of the number of extreme points of the unit balls in the
a-norm and the dual of the β-norm. See Bartels [75, 1991] and Tao [993,
1984] for further details.
Hager [492, 1984] gave a derivation of the 1-norm estimator based on sub-
gradients and used the method to estimate κ 1 (A). That the method is of
wide applicability y because it accesses A only through matrix-vector products
was recognized by Higham, who developed Algorithm 14.4 and its complex
analogue and wrote Fortran 77 implementations, which use a reverse commu-
nication interface [537, 1988], [543, 1990]. These codes are used in LAPACK,
the NAG library, and various other program libraries. A version of Algo-
rithm 14.4 dedicated to estimating κ 1 (A) is supplied with MATLAB as M-file
condest. Algorithm 14.4 is also implemented in ROM on the Hewlett-Packard
HP 48G and HP 48GX calculators (along with several other LAPACK rou-
tines), in a form that estimates κ 1 (A). The Hewlett-Packard implementation
is instructive because it shows that condition estimation can be efficient even
for small dimensions: on a standard HP 48G, inverting A and estimating its
condition number (without being given a factorization of A in either case)
both take about 5 seconds for n = 10, while for n = 20 inversion takes 30
seconds and condition estimation only 20 seconds.
Moler [769, 1978] describes an early version of the LINPACK condition
estimator and raises the question of the existence of counterexamples. An
early version without look-ahead was incorporated in the Fortran code decomp
14.6 N OTES AND R EFERENCES 305

in the book of Forsythe, Malcolm, and Moler [395, 1977].

Matrices for which condition estimators perform poorly can be very hard
to find theoretically or with random testing, but for all the estimators de-
scribed in this chapter they can be found quite easily by applying direct
search optimization to the under- or overestimation ratio; see §24.3.1.
Both LINPACK and LAPACK return estimates of the reciprocal of the
condition number, in a variable rcond < 1. Overflow for a very ill condi-
tioned matrix is thereby avoided, and rcond is simply set to zero when sin-
gularity is detected. MATLAB has a built-in function rcond that implements
the LINPACK condition estimation algorithm.
A simple modification to the LINPACK estimator that can produce a
larger estimate is suggested by O’Leary [804, 198 0]. For sparse matrices,
Grimes and Lewis [483, 1981] suggest a way to reduce the cost of the scaling
strategy used in LINPACK to avoid overflow in the condition estimation.
Zlatev, Wasniewski, and Schaumburg [1134, 1986] describe their experience
in implementing the LINPACK condition estimation algorithm in a software
package for sparse matrices.
Stewart [946, 1980] describes an efficient way to generate random matrices
of a given condition number and singular value distribution (see §26.3) and
tests the LINPACK estimator on such random matrices.
Condition estimators specialized to the (generalized) Sylvester equation
have been developed by Byers [172, 1984], Kågström and Westin [624, 1989],
and Kågström and Poromaa [621, 1992].
A survey of condition estimators up to 1987, which includes counterex-
amples and the results of extensive numerical tests, is given by Higham [534,
19 8 7 ].
Theorems 14.7 and 14.8 are from Higham [541, 1990]. That can
be computed in O(n) flops for symmetric positive definite tridiagonal A was
first, shown in Higham [531, 1986].
Theorem 14.9 has a long history, having been discovered independently
in various forms by different authors. The earliest reference we know for
the result as stated is Ikebe [600, 1979], where a more general result for
Hessenberg matrices is proved. A version of Theorem 14.9 for symmetric
tridiagonal matrices was proved by Bukhberger and Emel’yanenko [157, 1973].
The culmination of the many papers on inverses of tridiagonal and Hessenberg
matrices is a result of Cao and Stewart on the form of the inverse of a block
matrix (Aij) with Aij = 0 for i > j + s [184, 1986]; despite the generality of
this result, the proof is short and elegant. Any banded matrix has an inverse
with a special “low rank” structure; the earliest reference on the inverse of a
general band matrix is Asplund [32, 1959]. For a recent survey on the inverses
of symmetric tridiagonal and block tridiagonal matrices see Meurant [751,
1992 ].
For symmetric positive definite tridiagonal A the standard way to solve
306 C ONDITION N UMBER E STIMATION

AX = b is by using a Cholesky or LDLT factorization, rather than an LU

factorization. The LINPACK routine SPTSL uses a nonstandard “LUB” fac-
torization resulting from the BABE (“burn at both ends”) algorithm, which
eliminates from the middle of the matrix to the top and bottom simultane-
ously (see the LINPACK Users’ Guide [307, 1979, Chap. 7] and Higham [531,
19 86] ). The results of §14.5 are applicable to all these factorization, with
minor modifications.

14.6.1. LAPACK
Algorithm 14.4 is implemented in routine xLACON , which has a reverse commu-
nication interface. The LAPACK routines xPTCON and xPTRFS for symmet-
ric positive definite tridiagonal matrices compute condition numbers using
(14.10); the LAPACK routines xGTCON and xGTRFS for general tridiagonal
matrices use Algorithm 14.4. LINPACK’S tridiagonal matrix routines do not
incorporate condition estimation.

Problems
14.1. (Higham [551, 1992]) The purpose of this problem is to develop a non-
iterative method for choosing a starting vector for Algorithm 14.1. The idea
is to choose the components of x in the order x1, x2, . . . . xn in an attempt
to maximize ||Ax||p/||x||p. Suppose x1, . . . . xk-1 satisfying ||x(1:k-1)||p = 1
have been determined and let γk -1 = ||A( : , 1 :k-1)x(1:k-1)||p. We now try
to choose xk, and at the same time revise x(1:k –1), to give the next partial
product a larger norm. Defining

we set

where

Then ||x(1:k)||p = 1 and

Develop this outline into a practical algorithm. What can you prove about
the quality of the estimate ||Ax||p /||x||p that it produces?
PROBLEMS 307

14.2. (Higham [543, 1990]) Let the n × n symmetric tridiagonal matrix Tn ( a ) =

(tij ) be defined by

For example, T6 (a) is given by

Note that, for all a, ||Tn (a)e n- 1||1 = ||Tn (a)||1. Show that if Algorithm 14.3
is applied to Tn (a) with 0 < a < 1 then x = ei -1 on the ith iteration, for
i = 2, . . ., n, with convergence on the nth iteration. Algorithm 14.4, however,
terminates after five iterations with y5 = Tn (a)e4, and

Show that the extra estimate saves the day, so that Algorithm 14.4 returns a
final estimate that is within a factor 3 of the true norm, for any a < 1.
14.3. Let PA = LU be an LU factorization with partial pivoting of A 1
Show that

14.4. (Higham [537, 1988]) Investigate the behaviour of Algorithms 14.3 and
14.4 for the Pei matrix, A = aI + eeT ( a > 0), and for the upper bidiagonal
matrix with 1s on the diagonal and the first superdiagonal.
14.5. (Ikebe [600, 1979], Higham [531, 1986]) Let be nonsingular,
tridiagonal, and irreducible. By equating the last columns in AA -1 = I
and the first rows in A –1 A = I, show how to compute the vectors x and
y in Theorem 14.9 in O(n) flops. Hence obtain an O(n) flops algorithm for
computing where d > 0.
308 CONDITION NUMBER ESTIMATION

14.6. The representation of Theorem 14.9 for the inverse of nonsingular, tridi-
agonal, and irreducible involves 472 parameters, yet A depends only
on 3n – 2 parameters. Obtain an alternative representation that involves only
3n – 2 parameters. (Hint: symmetrize the matrix.)
14.7. (R ESEARCH P ROBLEM ) (Demmel [286, 1992 ]) Show that estimating
||A - 1|| to within a factor depending only on the dimension of A is at least as
expensive as computing A– 1 .
14.8. (RESEARCH PROBLEM) Let be diagonally dominant by rows,
let A = LU be an LU factorization, and let y > 0. What is the maximum
size of This is an open problem raised in [541,
1990]. In a small number of numerical experiments with full random matrices
the ratio has been found to be less than 2 [541, 1990], [790, 1986].
Previous Home Next

Chapter 15
The Sylvester Equation

We must commence, not with a square,

but with an oblong arrangement of terms consisting, suppose,
of m lines and n columns.
This will not in itself represent a determinant,
but is, as it were, a Matrix out of which we may form
various systems of determinants by fixing upon a number p,
and selecting at will p lines and p columns,
the squares corresponding to which may be termed
determinants of the pth order.
— J. J. SYLVESTER, Additions to the Articles, “On a New Class
of Theorems, ” and “On Pasta/’s Theorem” (1850)

I have in previous papers defined a “Matrix” as a rectangular array of terms,

out of which different systems of determinants may be engendered,
as from the womb of a common parent;
these cognate determinants being
by no means isolated in their relations to one another,
but subject to certain simple laws of
mutual dependence and simultaneous deperition.
— J. J. SYLVESTER, On the Relation Between the Minor
Determinants of Linear/y Equivalent Quadratic Functions (1851)

309
310 THE SYLVESTER EQUATION

The linear matrix equation

AX – XB = C, (15.1)

where and are given and

is to be determined, is called the Sylvester equation. It is of pedagogical
interest because it includes as special cases several important linear equation
problems:

1 . linear system: Ax = c,

2 . multiple right-hand side linear system: AX = C,

3 . matrix inversion: AX = I,

4 . eigenvector corresponding to given eigenvalue b: (A – bI)x = 0,

5 . commuting matrices: AX – XA = 0.

The Sylvester equation arises in its full generality in various applications. For
example, the equations

show that block-diagonalizing a block triangular matrix is equivalent to solv-

ing a Sylvester equation. The Sylvester equation can also be produced from
finite difference discretization of a separable elliptic boundary value problem
on a rectangular domain, where A and B represent application of a difference
operator in the “y” and “x” directions, respectively [935, 1991].
That (15.1) is merely a linear system is emphasized by writing it in the
form
(15.2)
where A B := (aij B) is a Kronecker product and the vec operator stacks
the columns of a matrix into one long vector. For future reference, we note
the useful relation

(See Horn and Johnson [581, 1991, Chap. 4] for a detailed presentation of
properties of the Kronecker product and the vec operator). The mn × mn
coefficient matrix in (15.2) has a very special structure, illustrated for n = 3
by
15.1 S OLVING THE S YLVESTER E QUATION 311

In dealing with the Sylvester equation it is vital to consider this structure and
not treat (15.2) as a general linear system.
Since the mn eigenvalues of In A – BT Im are given by

(15.3)

the Sylvester equation is nonsingular precisely when A and B have no eigen-

values in common.
In this chapter we briefly discuss the Schur method for solving the Sylvester
equation and summarize its rounding error analysis. Then we determine the
backward error for the Sylvester equation, investigate its relationship with
the residual, and derive a condition number. All these results respect the
structure of the Sylvester equation and are relevant to any solution method.
We also consider the special case of the Lyapunov equation and mention how
the results extend to generalizations of the Sylvester equation.

15.1. Solving the Sylvester Equation

One way to solve the Sylvester equation is to apply Gaussian elimination with
partial pivoting (GEPP) to the “big” system (15.2), but the structure of the
coefficient matrix cannot be exploited and the cost is a prohibitive O( m 3 n 3 )
flops. A more efficient method, requiring O( m 3 + n3) flops, is obtained with
the aid of Schur decompositions of A and B. Let A and B have the real Schur
decompositions
A = URU T , B = VSVT, (15.4)
where U and V are orthogonal and R and S are quasi-triangular, that is, block
triangular with 1 × 1 or 2 × 2 diagonal blocks, and with any 2 × 2 diagonal
blocks having complex conjugate eigenvalues. (See Golub and Van Loan [470,
1989, §7.4.1] for more details of the real Schur decomposition.)
With the decompositions (15.4), the Sylvester equation transforms to

(15.5)

or, equivalently, Pz = d, where P = I n R – ST I m , z = vec(Z) and

d = v e t(D). If R and S are both triangular then P is block triangular
with triangular diagonal blocks, so Pz = d can be solved by substitution.
Expressed in the notation of (15.5), the solution process take the form of n
substitutions: if S is upper triangular then

Suppose now that R and S are quasi-triangular, and for definiteness as-
sume that they are both upper quasi-triangular. Partitioning Z = (Zij ) con-
312 T HE S YLVESTER E QUATION

formally with R = (Rij ) and S = (Sij ) we have

(15.6)

These equations can be used to determine the blocks of Z working up the

block columns from first to last. Since Rii and Sjj are both of order 1 or 2,
each system (15.6) is a linear system of order 1, 2, or 4 for Zij ; in the latter
two cases it is usually solved by GEPP (or even Gaussian elimination with
complete pivoting—see Problem 15.4).
This Schur decomposition method for solving the Sylvester equation is
due to Bartels and Stewart [74, 1972]. What can be said about its numerical
stability? In the case where R and S are both triangular, Theorem 8.5 shows
that
(15.7)
where cm,n denotes a constant depending on the dimensions m and n (in fact,
we can take c m,n mn ). Thus which implies the
weaker inequality

(15.8)

If R or S is quasi-triangular then the error analysis depends on how the

systems of dimension 2 or 4 are solved. If GEPP followed by fixed precision
iterative refinement is used for each of these systems and if for
each system is not too ill conditioned and the vector is not too badly
scaled, then (15.7) and (15.8) remain valid (see §11.2). Otherwise, we have
only a normwise bound

Because the transformation of a matrix to Schur form (by the QR algorithm)

is a backward stable process15 it is true overall that

(15.9)

Thus the relative residual is guaranteed to be bounded by a modest multiple

of the unit roundoff u.
Golub, Nash, and Van Loan [464, 1979] suggested a modification of the
Bartels-Stewart algorithm in which A is reduced only to upper Hessenberg
form: A = UHU T . The reduced system HZ – ZS = D can be solved
by solving n systems that are either upper Hessenberg or differ from upper
15
See Golub and Van Loan [470, 19 8 9 , §7.5.6]. A proof is outside the scope of this book,
but the necessary tools are Lemmas 18.3 and 18.8 about Householder and Givens rotations.
15.2 B ACKWARD E RROR 313

Hessenberg form by the addition of an extra subdiagonal. As shown in [464,

1979 ], the Hessenberg–Schur algorithm has a smaller flop count than the
Bartels-Stewart algorithm, with the improvement depending on the relative
sizes of m and n. The computed solution again satisfies (15.9).
The use of iterative methods to solve the Sylvester equation has attracted
attention recently for applications where A and B are large and sparse [572,
1995 ], [588, 1992 ], [935, 1991 ], [1059, 19 88]. The iterations are usually ter-
minated when an inequality of the form (15.9) holds, so here the size of the
relative residual is known a priori (assuming the method converges).

15.2. Backward Error

We saw in the last section that standard methods for solving the Sylvester
equation are guaranteed to produce a small relative residual. Does a small
relative residual imply a small backward error? The answer to this question
for a general linear system is yes (Theorem 7. 1). But for the highly structured
Sylvester equation the answer must be no, because for the special case of ma-
trix inversion we know that a small residual does not imply a small backward
error (§13.1). In this section we investigate the relationship between residual
and backward error for the Sylvester equation.
The normwise backward error of an approximate solution Y to (15.1) is
defined by

(15.10)

The tolerances a, β, and γ provide some freedom in how we measure the

perturbations. Of most interest is the choice a = ||A||F, β = ||B||F, γ =
||C||F , for which we will call q the normwise relative backward error. The
equation (A + ∆A)Y – Y(B + ∆B) = C + ∆C may be written

(15.11)

where the residual R = C – (AY – YB). A small backward error implies a

small relative residual since, using the optimal perturbations from (15. 10) in
(15.11), we have

(15.12)

To explore the converse question of what the residual implies about the
backward error we begin by transforming (15.11) using the SVD of Y, Y =
UΣ VT, where and are orthogonal and Σ = diag(σi ) 1
The numbers σ1 > σ 2 > . . . > σ m i n (m , n) > 0 are the singular values
314 T HE S YLVESTER E QUATION

of Y and we define, in addition, σ m i n (m , n)+l = . . . = σ m a x (m , n) = 0. Equation

(15.1 1) transforms to

(15.13)

where

This is an underdetermined system, with mn equations in m 2 + n 2 + mn

unknowns. We can write it in the uncoupled form16

(15.14)

For each i and j it is straightforward to show that the minimum of

subject to (15.14) is attained for

These matrices minimize

Since η(Y ) is the minimum value of max{||a - 1 ∆ A||F,||β - 1∆ B||F,||γ - 1 ∆C||F} ,

it follows that .
(15.15)

where

(15.16)

16
For notational convenience we extend (if m < n) or (if m > n) to dimension
m × n; the “fictitious” elements will be set to zero by the minimization.
15.2 B ACKWARD E RROR 315

This expression shows that the backward error is approximately equal not to
the normwise relative residual but to a component-
wise residual corresponding to the diagonalized equation (15.13).
From (15.15) and (15.16) we deduce that

(15.17)

where
(15.18)

The scalar µ > 1 is an amplification factor that measures by how much, at

worst, the backward error can exceed the normwise relative residual. We now
examine µ more closely, concentrating on the normwise relative backward
error, for which a = ||A||F, β = ||B||F, and γ = ||C||F.
First, note that if n = 1 and B = 0, so that the Sylvester equation
reduces to a linear system Ay = c, then σ 1 = ||y||2 and σ k = 0 for k > 1, so
and so
we recover Theorem 7.1 (for the 2-norm) from (15.12) and (15.17), to within
a constant factor.
If m = n then

(15.19)

We see that µ is large only when

(15.20)

that is, when Y is ill conditioned and Y is a large-normed solution to the

Sylvester equation. In the general case, with m n, one of and is
always zero and hence µ can be large for a third reason: A (if m < n ) or B
(if m > n) greatly exceeds the rest of the data in norm; in these cases the
Sylvester equation is badly scaled. However, if we set a = β = ||A||F + ||B||F,
which corresponds to regarding A and B as comprising a single set of data,
then bad scaling does not affect µ.
If we allow only A and B to be perturbed in (15.10) (as may be desirable
if the right-hand side C is known exactly), then γ = 0 and (15.19) and (15.20)
remain valid with ||C||F replaced by zero. In this case µ > ||Y||F||Y+|| 2
κ 2 (Y ) (for any m and n), so µ is large whenever Y is ill conditioned (and
included in this case is matrix inversion). Conditions involving controllability
which guarantee that the solution to the Sylvester equation with m = n is
nonsingular are given by Hearon [508, 1977], while Datta [265, 1988] gives a
determinantal condition for nonsingularity. It is an open problem to derive
316 T HE S YLVESTER E QUATION

conditions for the Sylvester equation to have a well-conditioned solution (see

Problem 15.5).
The following numerical example illustrates the above analysis. This par-
ticular example was carefully chosen so that the entries of A and B are of a
simple form, but equally effective examples are easily generated using random,
ill-conditioned A and B of dimension m, n > 2. Let

Define C by the property that vet(C ) is the singular vector corresponding to

the smallest singular value of In A – BT Im. With a = 10-6, we solved the
Sylvester equation in MATLAB by the Bartels-Stewart algorithm and found
that the computed solution X satisfies

Although has a very acceptable residual (as it must in view of (15.9)), its
backward error is eight orders of magnitude larger than is necessary to achieve
backward stability. We solved the same Sylvester equation using GEPP on the
system (15.2). The relative residual was again less than u, but the backward
error was appreciably larger:
One conclusion we can draw from the analysis is that standard methods for
solving the Sylvester equation are at best conditionally backward stable, since
there exist rounding errors such that is the only nonzero element of
and then (15.17) is an approximate equality, with µ possibly large.

15.2.1. The Lyapunov Equation

If we put B = –AT in the Sylvester equation we obtain

AX+ XA T = C,

which is called the Lyapunov equation. This equation plays a major role in
control and systems theory and it can be solved using the same techniques as
for the Sylvester equation.
If C = C T then C = AX + XA T = X T A T + AX T = C T , so X and
T
X are both solutions to the Lyapunov equation. If the Lyapunov equation
is nonsingular (equivalently, for all i and j, by (15.3)) it
therefore has a unique symmetric solution.
15.2 B ACKWARD E RROR 317

We assume that C is symmetric and that Y is a symmetric approximate

solution. The definition of backward error is now

The analogue of (15.11) is

Let Y = UΛUT be a spectral decomposition, with A = diag Then the
residual equation transforms to

where and This system can be

written in uncoupled form as

(15.21)

We can obtain the minimum value of by minimizing

subject to (15.21), for i, j = 1:n. The solution is

(Note that is symmetric since is.) It follows that

where

where the last inequality is usually a good approximation. Comparing with

(15.16) we see that respecting the extra structure of the Lyapunov equation
has essentially no effect on the backward error.
Finally, the analogue of (15.17) and (15.18) is

where
318 T HE S YLVESTER E QUATION

15.3. Perturbation Result

To derive a perturbation result we consider the perturbed Sylvester equation

which, on dropping second-order terms, becomes

This system may be written in the form

(15.22)

where P = In A – BT I m . If we measure the perturbations normwise by

where a, β, and γ are tolerances as in (15.10), then

(15.23)

is a sharp bound (to first order in ), where

(15.24)

is the corresponding condition number for the Sylvester equation. The bound
(15.23) can be weakened to

(15.25)

where

If then twice the upper bound in (15.25) can be shown

to be a strict bound for the error. The perturbation bound (15.25) with
a ||A||F, β = ||B||F, and γ = ||C||F is the one that is usually quoted in
the literature for the Sylvester equation (see [464, 1979] and [522, 1988], for
example), and corresponds to applying standard perturbation theory for Ax =
b to (15.2). Note that ||P- 1||2 = sep(A, B)-1, where sep is the separation of
A and B,
(15.26)
15.3 PERTURBATION RESULT 319

The sep function is an important tool for measuring invariant subspace sen-
sitivity [470, 1989, §7.2.5], [940, 1973], [1050, 1979].
For the Lyapunov equation, a similar derivation to the one above shows
that the condition number is

(15.27)
where Π is the vet-permutation matrix, which is defined by the property that
vec(A T) = Π vet(A).
How much can the bounds (15.23) and (15.25) differ? The answer is by
an arbitrary factor. To show this we consider the case where B is normal
(or equivalently, A is normal if we transpose the Sylvester equation). We
can assume B is in Schur form, thus B = diag(µ j ) (with the µ j possibly
complex). Then P = diag(A – ujjIm), and it is straightforward to show that
if X = [x1, . . . , xn ], and if we approximate the 2-norms in the definitions of
and Φ by Frobenius norms, then

while

These formulae show that in general Ψ and Φ will be of similar magnitude,

and we know that Ψ < Φ from the definitions. However, Ψ can be much
smaller than Φ. For example, suppose that γ = 0 and

Then if

we have Ψ << Φ. Such examples are easily constructed. To illustrate, let

A = diag(2, 2, . . . , 2, 1) and B = diag(1/2, 1/2, . . . , 1/2, 1 – with > 0, so
that A – µ nnIm = diag and let X = (A – µ n n I m )Y,
where Y = [y, y, . . . , y, 0] with ||(A–µ n n I m )y||2 = ||A–µ n n I m || 2 and ||y||2 = 1.
Then, if γ =
320 T HE S YLVESTER E QUATION

To summarize, the “traditional” perturbation bound (15.25) for the Syl-

vester equation can severely overestimate the effect of a perturbation on the
data when only A and B are perturbed, because it does not take account
of the special structure of the problem. In contrast, the perturbation bound
(15.23) does respect the Kronecker structure, and consequently is attainable
for any given A, B, and C.
To obtain an a posteriori error bound for a computed solution
∆ X we can set ∆A = 0, ∆B = 0, and ∆C = AX – XB – C = R in (15.22),
which leads to
(15.28)

A similar but potentially much smaller bound is described in the next section.

15.4. Practical Error Bounds

For the Sylvester equation we can obtain an analogue of the practical error
bound (7.27) by identifying Ax = b with (15.2). For the computed residual of
a computed solution X we have

Therefore the bound is

(15.29)

where ||X|| := maxi,j |xij|. After transformation by the technique illustrated

in (14.1), this bound can be estimated by the LAPACK norm estimator (Algo-
rithm 14.4) at the cost of solving a few linear systems with coefficient matrices
In A – B T Im and its transpose—in other words, solving a few Sylvester
equations AX – XB = C and A T X – XB T = D. If the Bartels-Stewart
algorithm is used, these solutions can be computed with the aid of the pre-
viously computed Schur decompositions of A and B. The condition number
Ψ in (15.24) and sep(A, B) = can both be estimated in much the
same way; alternatively, the power method can be used (see Ghavimi and
Laub [440, 1995]). Other algorithms for efficiently estimating sep(A, B) given
Schur decompositions of A and B are given by Byers [172, 1984] and Kågström
and Poromaa [621, 1992].
The attraction of (15.29) is that large elements in the j th column of P- 1
may be countered by a small jth element of vec + vec(R u ), making the
bound much smaller than (15.28). In this sense (15.29) has better scaling
15.5 E XTENSIONS 321

properties than (15.28), although (15.29) is not actually invariant under di-
agonal scalings of the Sylvester equation.
We give a numerical example to illustrate the advantage of (15.29) over
(15.28). Let

where denotes a Jordan block of size n with eigenvalue Solving

the Sylvester equation by the Bartels–Stewart algorithm we found that the
bounds are

(where in evaluating (15.28) we replaced R by + Ru, as in (15.29)). Here,

sep(A, B) = 1.67 × 10–16, and the bound (15.29) is small because relatively
large columns of P-1 are nullified by relatively small elements of |vec +
vec(R u ). For this example, with a = ||A||F, β = ||B||F, γ = ||C||F, we have

Ψ = 7.00 × 109, Φ = 1.70 × 1016,

confirming that the usual perturbation bound (15.25) for the Sylvester equa-
tion can be very pessimistic. Furthermore,

so we have an example where the backward error is small despite a large µ .

15.5. Extensions
The Sylvester equation can be generalized in two main ways. One retains the
linearity but adds extra coefficient matrices, yielding the generalized Sylvester
equations
AXB + CXD = E (15.30)
and
AX – YB = C, DX – YE = F. (15.31)
These two forms are equivalent, under conditions on the coefficient matrices
[210, 1987]; for example, defining Z := XB and W := –CX, (15.30) becomes
AZ – WD = E, CZ + WB = 0. Applications of generalized Sylvester equa-
tions include the computation of stable eigendecompositions of matrix pencils
[294, 198 7], [295, 1988], [622, 1993], [623, 1994] and the implementation of
322 T HE S YLVESTER E QUATION

numerical methods for solving implicit ordinary differential equations [353,

19 8 0 ].
The second generalization incorporates a quadratic term, yielding the al-
gebraic Riccati equation

(15.32)

This general Riccati equation and its symmetric counterpart with B = AT

and F and G symmetric are widely used in control theory.
The backward error results and perturbation theory of this chapter can be
generalized in a straightforward way to (15.31) and (15.32). See Kågström [620,
1994] for (15.31) and Ghavimi and Laub [440, 1995] for (15.32). The back-
ward error derivations do not extend to (15.30), because in this equation the
coefficient matrices appear nonlinearly.
A variation of the Lyapunov equation called the discrete-time Lyapunov
equation has the form

where As in (15.30), the data appears nonlinearly. Ghavimi

and Laub [441, 1995] show how to derive an approximation to the backward
error by linearizing an equation characterizing the optimal perturbations.
Another generalization of the Sylvester equation, mainly of theoretical
interest, is

where and See Lancaster [685, 1970] for

associated theory.

15.6. Notes and References

This chapter is based on Higham [556, 1993]. The backward error derivations
make use of ideas of Ghavimi and Laub [440, 1995].
The Sylvester equation is so named because Sylvester considered the ho-
mogeneous version of the equation [985, 1884].
Bhatia and Rosenthal [96, 1996] give a survey of theoretical results for the
Sylvester equation in both finite- and infinite-dimensional spaces.
For details of the role of the Sylvester equation in the eigenproblem see
Bai, Demmel, and McKenney [38, 1993], [40, 1993] and the references therein.
Iterative methods that make use of matrix inversion to solve the Sylvester
equation are described by Miller [755, 1988] and Roberts [875, 1980].
Hammarling [496, 198 2] gives a method for solving the Lyapunov equa-
tion AX + XAT = –C in the case where A has eigenvalues with negative
15.6 N OTES AND R EFERENCES 323

real parts and C is positive semidefinite; his method directly computes the
Cholesky factor of the solution (which is indeed symmetric positive definite—
see Problem 15.2).
A survey of the vec operator, the Kronecker product, and the vec-permu-
tation matrix is given together with historical comments by Henderson and
Searle [514, 1981]. Historical research by Henderson, Pukelsheim, and Searle
[513, 1983] indicates that the Kronecker product should be called the Zehfuss
product, in recognition of an 1858 paper by Zehfuss that gives a determinantal
result involving the product.
The vet-permutation matrix Π (which appears in (15.27)) is given explic-
itly by

and has the property that

Applications of the Lyapunov equation in control theory, including special
situations where an approximate solution of low rank is required, are discussed
by Hodel [574, 1992]. A much older reference to applications is Barnett and
Storey [68, 1968].
Algorithms and software for solving (15.30) are developed by Gardiner,
Wette, Laub, Amato, and Moler [417, 1982], [418, 1982].
Perturbation theory for Lyapunov and Riccati equations can be found in
the work of Byers [173, 1985], Hewer and Kenney [522, 1988], [650, 1990], and
Gahinet, Laub, Kenney, and Hewer [411, 1990].
Chu [210, 1987] determines conditions for the existence of unique solutions
to the generalized Sylvester equations (15.30) and (15.31). The appropriate
conditions for (15.30) are that the pencils A + and D + are regular
and the spectra of the pencils have an empty intersection, which neatly gen-
eralizes the conditions for the Sylvester equation to have a unique solution;
the conditions for (15.31) are analogous.
There is much work on algorithms and software for solving the algebraic
Riccati equation. For a sampling, see Laub [691, 1979], Arnold and Laub [30,
1984], Byers [174, 1987], Gardiner, and Laub [416, 1991], and Kenney, Laub,
and Papadopoulos [653, 1992].
An algorithm for estimating a generalization of sep that occurs in pertur-
bation theory for the generalized Sylvester equation (15.31) is developed by
Kågström and Westin [624, 1989].
Another generalization of the Sylvester equation is to take just one equa-
tion from (15.31), AX – YB = C ((15.13) is of this form). This equation can
be underdetermined or overdetermined, depending on the dimensions of the
coefficient matrices. Conditions involving generalized inverses that are both
necessary and sufficient for the existence of a solution are given by Baksalary
and Kala [49, 1979]. Zietak examines the inconsistent case [1130, 1985] for one
324 T HE S YLVESTER E QUATION

choice of dimensions giving an overdetermined system. Stewart [949, 1992]

shows how to compute a minimum Frobenius norm least squares solution.
The even more general equation AXB + CYD = E has also been analysed
by Baksalary and Kala [50, 198 0], who again give necessary and sufficient
conditions for the existence of a solution.

15.6.1. LAPACK
The computations discussed in this chapter can all be done using LAPACK.
The Bartels-Stewart algorithm can be implemented by calling xGEES to com-
pute the Schur decomposition, using the level-3 BLAS routine xGEMM to trans-
form the right-hand side C, calling xTRSYL to solve the (quasi-) triangular
Sylvester equation, and using xGEMM to transform back to the solution X.
The error bound (15.29) can be estimated using xLACON in conjunction with
the above routines. A Fortran 77 code dggsvx [556, 1993] of Higham follows
this outline and may appear in a future release of LAPACK.
Routine xLASY2 solves a real Sylvester equation AX ± XB = σC in which
A and B have dimension 1 or 2 and σ is a scale factor. It is called by xTRSYL .
Kågström and Poromaa have developed codes for solving (15.31), which
are intended for a future release of LAPACK [622, 1993], [623, 1994].

Problems
15.1. Show that the Sylvester equation AX – XA = I has no solution.
15.2. (Bellman [89, 1970, §10.18]) Show that if the expression

exists for all C it represents the unique solution of the Sylvester equation
AX + XB = C. (Hint: consider the matrix differential equation dZ/dt =
AZ(t) + Z(t)B, Z(0) = C.) Deduce that the Lyapunov equation AX +
XAT = –C has a symmetric positive definite solution if A has eigenvalues
with negative real parts and C is symmetric positive definite.
15.3. (Byers and Nash [176, 1987]) Let and consider

Show that there exists a minimizer X that is either symmetric or skew-

symmetric.
15.4. How would you solve a Sylvester equation AX – XB = C in which A
and B are of dimension 1 or 2? Compare your method with the one used in
the LAPACK routine XLASY2 .
15.5. (RESEARCH PROBLEM) Derive conditions for the Sylvester equation to
have a well-conditioned solution.
Previous Home Next

Chapter 16
Stationary Iterative Methods

I recommend this method to you for imitation.

You will hardly ever again eliminate directly,
at least not when you have more than 2 unknowns.
The indirect [iterative] procedure can be done while half asleep,
or while thinking about other things.17
— CARL FRIEDRICH GAUSS, Letter to C. L. Gerling (1823)

The iterative method is commonly called the “Seidel process, ”

or the “Gauss–Seidel process. ”
But, as Ostrowski (1952) points out,
Seidel (1874) mentions the process but advocates not using it.
Gauss nowhere mentions it.
— GEORGE E. FORSYTHE,
Solving Linear Algebraic Equations Can Be Interesting (1953)

The spurious contributions in null(A)

grow at worst linearly and
if the rounding errors are small the scheme can be quite effective.
— HERBERT B. KELLER,
On the Solution of Singular and Semidefinite
Linear Systems by Iteration (1965)

17
Gauss refers here to his relaxation method for solving the normal equations. The
translation is taken from Forsythe [387, 1951 ].

325
326 S TATIONARY I TERATIVE M ETHODS

Table 16.1. Dates of publication of selected iterative methods. Based on Young [1123,
19 8 9 ].

1845 Jacobi Jacobi method

1874 Seidel Gauss-Seidel method
1910 Richardson Richardson’s method
1938–1939 Temple Method of steepest descent
1940s Various (analysis by Successive overrelaxation
Young and Frankel) (SOR) method
1952 Hestenesand Stiefel Conjugate gradient method

Iterative methods for solving linear systems have along history, going back
at least to Gauss. Table 16.1 shows the dates of publication of selected meth-
ods. It is perhaps surprising, then, that rounding error analysis for iterative
methods is not well developed. There are two main reasons for the paucity
of error analysis. One is that in many applications accuracy requirements
are modest and are satisfied without difficulty, resulting in little demand for
error analysis. Certainly there is no point in computing an answer to greater
accuracy than that determined by the data, and in scientific and engineering
applications the data often has only a few correct digits. The second reason
is that rounding error analysis for iterative methods is inherently more diffi-
cult than for direct methods, and the bounds that are obtained are harder to
interpret.
In this chapter we consider a simple but important class of iterative meth-
ods, stationary iterative methods, for which a reasonably comprehensive error
analysis can be given. The basic question that our analysis attempts to answer
is “What is the limiting accuracy of a method in floating point arithmetic?”
Specifically, how small can we guarantee that the backward or forward error
will be over all iterations k = 1, 2 ,. . .? Without an answer to this question
we cannot be sure that a convergence test of the form (say)
will ever be satisfied, for any given value of
As an indication of the potentially devastating effects of rounding errors
we present an example constructed and discussed by Hammarling and Wilkin-
son [497, 1976]. Here, A is the 100 × 100 lower bidiagonal matrix with aii 1.5
and ai,i– 1 1, and bi 2.5. The successive overrelaxation (SOR) method
is applied in MATLAB with parameter w = 1.5, starting with the rounded
version of the exact solution x, given by xi = 1 – (–2/3) i . The forward errors
and the co-norm backward errors are plotted in
Figure 16.1. The SOR method converges in exact arithmetic, since the itera-
tion matrix has spectral radius 1/2, but in the presence of rounding errors it
diverges. The iterate has a largest element of order 1013, for
16.1 S URVEY OF ERROR ANALYSIS 327

Figure 16.1. SOR iteration.

k > 238, and for k > 100, The divergence

is not a result of ill conditioning of A, since The reason for the
initial rapid growth of the errors in this example is that the iteration matrix
is far from normal; this allows the norms of its powers to become very large
before they ultimately decay by a factor 1/2 with each successive power.
The effect of rounding errors is to cause the forward error curve in Figure 16.1
to level off near k = 100, instead of decaying to zero as it would in exact arith-
metic. More insight into the initial behaviour of the errors can be obtained
using the notion of pseudo-eigenvalues; see §17.3.

16.1. Survey of Error Analysis

Before analysing stationary iterative methods, we briefly survey the published
error analysis for iterative methods. For symmetric positive definite systems,
Golub [468, 196 2] derives both statistical and nonstatistical bounds for the
forward error and residual of the Richardson method. Benschop and Ratz [92,
1971] give a statistical analysis of the effect of rounding errors on stationary
iteration, under the assumption that the rounding errors are independent
random variables with zero mean. Lynn [719, 196 4] presents a statistical
analysis for the SOR method with a symmetric positive definite matrix.
Hammarling and Wilkinson [497, 1976] give a normwise error analysis for
328 S TATIONARY I TERATIVE M ETHODS

the SOR method. With the aid of numerical examples, they emphasize that
while it is the spectral radius of the iteration matrix M – 1 N that determines
the asymptotic rate of convergence, it is the norms of the powers of this matrix
that govern the behaviour of the iteration in the early stages. This point is
also explained by Trefethen [1017, 1992], using the tool of pseudospectra.
Dennis and Walker [302, 198 4] obtain bounds for
for stationary iteration as a special case of error analysis of quasi-Newton
methods for nonlinear systems. The bounds in [302, 198 4] do not readily
yield information about normwise or componentwise forward stability.
Bollen [133, 1984] analyses the class of “descent methods” for solving Ax =
b, where A is required to be symmetric positive definite; these are obtained by
iteratively using exact line searches to minimize the quadratic function F(x) =
(A-1 b – x)TA(A-1 b – x). The choice of search direction pk = b – Axk =: rk
yields the steepest descent method, while pk = ej (unit vector), where |rk|j =
gives the Gauss-Southwell method. Bollen shows that both methods
are normwise backward stable as long as a condition of the form cn κ(A)u < 1
holds. If the pk are cyclically chosen to be the unit vectors e 1, e2, . . ., en then
the Gauss–Seidel method results, but unfortunately no results specific to this
method are given in [133, 1984].
Wozniakowski [1112, 1977] shows that the Chebyshev semi-iterative method
is normwise forward stable but not normwise backward stable, and in [1113,
1978] he gives a normwise error analysis of stationary iterative methods. Some
of the assumptions in [1113, 1978] are difficult to justify, as explained by
Higham and Knight [563, 1993].
In [1114, 198 0] Wozniakowski analyses a class of conjugate gradient al-
gorithms (which does not include the usual conjugate gradient method). He
obtains a forward error bound proportional to κ(A)3/2 and a residual bound
proportional to K(A), from which neither backward nor forward normwise sta-
bility can be deduced. We note that as part of the analysis in [1114, 198 0]
Wozniakowski obtains a residual bound for the steepest descent method that
is proportional to K(A), and is therefore much weaker than the bound obtained
by Bollen [133, 1984].
Zawilski [1125, 1991] shows that the cyclic Richardson method for sym-
metric positive definite systems is normwise forward stable provided the pa-
rameters are suitably ordered. He also derives a sharp bound for the residual
that includes a factor κ(A), and which therefore shows that the method is not
normwise backward stable.
Arioli and Romani [29, 1992] give a statistical error analysis of stationary
iterative methods. They investigate the relations between a statistically de-
fined asymptotic stability factor, ill conditioning of M- 1 A, where A = M – N
is the splitting, and the rate of convergence.
Greenbaum [479, 1989] presents a detailed error analysis of the conjugate
gradient method, but her concern is with the rate of convergence rather than
16.2 FORWARD ERROR ANALYSIS 329

the attainable accuracy. An excellent survey of work concerned with the effects
of rounding error on the conjugate gradient method (and the Lanczos method)
is given by Greenbaum and Strakos in the introduction of [480, 1992]; see also
Greenbaum [481, 1994]. Notay [799, 1993] analyses how rounding errors influ-
ence the convergence rate of the conjugate gradient method for matrices with
isolated eigenvalues at the ends of the spectrum. Van der Vorst [1058, 1990]
examines the effect of rounding errors on preconditioned conjugate gradient
methods with incomplete Cholesky preconditioners.
The analysis given in the remainder of this chapter is from Higham and
Knight [563, 1993], [564, 1993], wherein more details are given. Error analysis
of Kaczmarz’s row-action method is given by Knight [663, 1993].

16.2. Forward Error Analysis

A stationary iterative method has the form

where is nonsingular and M is nonsingular. We assume

that the spectral radius p(M - 1 N) < 1, so that in exact arithmetic the itera-
tion converges for any starting vector x0. We are not concerned with the size
of constants in this analysis, so we denote by cn a constant of order n.
The computed vectors satisfy an equality of the form

which we write as
(16.1)
where

We will assume that M is triangular (as is the case for the Jacobi, Gauss–
Seidel, SOR, and Richardson iterations), so that and f k
accounts solely for the errors in forming Hence
(16.2)

Solving the recurrence (16.1) we obtain

(16.3)

where G = M– 1 N. Since the iteration is stationary at x,

(16.4)
330 S TATIONARY I TERATIVE M ETHODS

and so the error em+1 := satisfies

(16.5)

We have
(16.6)

where µ k is the bound for ξk defined in (16.2). The first term, |Gm+ 1e0 |, is
the error of the iteration in exact arithmetic and is negligible for large m. The
accuracy that can be guaranteed by the analysis is therefore determined by
the last term in (16.6), and it is this term on which the rest of the analysis
focuses.
At this point we can proceed by using further componentwise inequalities
or by using norms. First we consider the norm approach. By taking norms in
(16.6) and defining
(16.7)

we obtain

(16.8)

where the existence of the sum is assured by the result of Problem 16.1.
If = q < 1 then (16.8) yields

Thus if q is not too close to 1 (q < 0.9, say), and γx and are not too
large, a small forward error is guaranteed for sufficiently large m.
Of more interest is the following componentwise development of (16.6).
Defining
(16.9)

so that for all k, we have from (16.2),

(16.10)
16.2 FORWARD ERROR ANALYSIS 331

Hence (16.6) yields

(16.11)

where, again, the existence of the sum is assured by the result of Problem 16.1.
Since A = M – N = M(I – M- 1 N) we have

The sum in (16.11) is clearly an upper bound for |A - 1 |. Defining c(A) > 1 by

(16.12)
we have our final bound

(16.13)

An interesting feature of stationary iteration methods is that if the ele-

ments of M and N are multiples of the elements in the corresponding positions
of A, then any scaling of the form di-
agonal) leaves the eigenvalues of M– 1 N unchanged; hence the asymptotic
convergence rate is independent of row and column scaling. This scale inde-
pendence applies to the Jacobi and SOR iterations, but not, for example, to
the stationary Richardson iteration, for which M = I. One of the benefits of
doing a componentwise analysis is that under the above assumptions on M
and N the bound (16.13) largely shares the scale independence. In (16.13)
the scalar c(A) is independent of the row and column scaling of A, and the
term |A– 1 |(|M| + |N|)|x| scales in the same way as x. Furthermore, θ x can be
expected to depend only mildly on the row and column scaling, because the
bound in (16.2) for the rounding error terms has the correct scaling properties.
What can be said about c(A)? In general, it can be arbitrarily large.
Indeed, c(A) is infinite for the Jacobi and Gauss–Seidel iterations for any
n > 3 if A is the symmetric positive definite matrix with a ij = min(i, j),
because A-1 is tridiagonal and (M - 1 N )k M–1 is not.
If M-1 and M- 1N both have nonnegative elements then c(A) = 1; as we
will see in the next section, this condition holds in some important instances.
Some further insight into c(A) can be obtained by examining the case
where M– 1 N is diagonal with eigenvalues It is easy to show that
c(A) = maxi so c(A) can be large only if p(M– 1 N) is close
to 1. Although M– 1 N cannot be diagonal for the Jacobi or Gauss–Seidel
332 S TATIONARY I TERATIVE M ETHODS

methods, this formula can be taken as being indicative of the size of c(A)
when M– 1 N is diagonalizable with a well-conditioned matrix of eigenvectors.
We therefore have the heuristic inequality, for general A,

(16.14)

In practical problems where stationary iteration is used, we would expect

c(A) to be of modest size (O(n), say) for two reasons. First, to achieve
a reasonable convergence rate, p(M– 1 N) has to be safely less than 1, which
implies that the heuristic lower bound (16.14) for c(A) is not too large. Second,
even if A is sparse, A–1 will usually be full, and so there are unlikely to be
zeros on the right-hand side of (16.12). (Such zeros are dangerous because
they can make c(A) infinite.)
Note that in (16.13) the only terms that depend on the history of the
iteration are |Gm+ 1e0 | and θx . In using this bound we can redefine x 0. to be
any iterate thereby possibly reducing θx . This is a circular argument if
used to obtain a priori bounds, but it does suggest that the potentially large
θ x term will generally be innocuous. Note that if xi = 0 for some i then θ x is
infinite unless = 0 for all k. This difficulty with zero components of x
can usually be overcome by redefining

for which the above bounds remain valid if θx is replaced by 2θ x .

Finally, we note that (16.13) implies

(16.15)

If θxc(A) = O(1) and |M| + |N| < a|A|, with a = O(1), this bound is of the
form cn cond(A, x)u as m and we have componentwise forward stability.
Now we specialize the forward error bound (16.15) to the Jacobi, Gauss–
Seidel, and SOR iterations.

16.2.1. Jacobi’s Method

For the Jacobi iteration, M = D = diag(A) and N = diag(A) – A. Hence
|M| + |N| = |M – N| = |A|, and so (16.15) yields

(16.16)

If A is an M-matrix then M-1 > 0 and M-1 > 0, so c(A) = 1. Hence in

this case we have componentwise forward stability as m if θx is suitably
bounded.
16.2 FORWARD ERROR ANALYSIS 333

Table 16.2. Jacobi method, a = l/2 – 8- j .

p(M-1N) Iters. cond(A,x)

j = 1 0.75 90 3.40 2.22e-16 1.27e-16
j = 2 0.97 352 4.76 1.78-15 9.02e-16
j = 3 0.996 1974 4.97 1.42e-14 7.12e-15
j = 4 1.00 11226 5.00 1.14e-13 5.69e-14
j = 5 1.00 55412 5.00 9.10e-13 4.55e-13

Table 16.3. Jacobi method, a = -(1/2-8 - j).

p(M - 1 N) Iters. cond(A,x)

j = 1 0.75 39 7.00 4.44e-16 5.55e-17
j = 2 0.97 273 6.30e1 4.88e-15 7.63e-17
j = 3 0.996 1662 5.11e2 4.22e-14 8.24e-17
j = 4 1.00 9051 4.09e3 3.41e-13 8.32e-17
j = 5 1.00 38294 3.28e4 2.73e-12 8.33e-17

Wozniakowski [1113, 1978, Ex. 4.1] cites the symmetric positive definite
matrix

as a matrix for which the Jacobi method can be unstable, in the sense that
there exist rounding errors such that no iterate has a relative error bounded by
Let us see what our analysis predicts for this example. Straight-
forward manipulation shows that if a = 1/2 – then
so as (The heuristic lower bound (16.14) is approximately
in this case.) Therefore (16.16) suggests that the Jacobi iteration can
be unstable for this matrix. To confirm the instability we applied the Jacobi
method to the problem with x = [1, 1, 1] T and a = 1/2 – 8 – j, j = 1:5. We
took a random x 0. with ||x – x0 ||2 = 10-10, and the iteration was terminated
when there was no decrease in the norm of the residual for 50 consecutive
iterations. Table 16.2 reports the smallest value of
over all iterations, for each j; the number of iterations is shown in the column
“Iters.”
The ratio takes the values 8.02, 7.98, 8.02,
334 S TATIONARY I TERATIVE M ETHODS

7.98 for j = 1:4, showing excellent agreement with the behaviour predicted
by (16.16), since Moreover, in these tests and setting
the bound (16.16) is at most a factor 13.3 larger than the observed
error, for each j.
If –1/2 < a < 0 then A is an M-matrix and c(A) = 1. The bound (16.16)
shows that if we set a = –( 1/2 – 8 – j) and repeat the above experiment then
the Jacobi method will perform in a componentwise forward stable manner
(clearly, is to be expected). We carried out the modified experiment,
obtaining the results shown in Table 16.3. All the values are less
than cond(A,x) u, so the Jacobi iteration is indeed componentwise forward
stable in this case. Note that since p ( M – 1 N) and ||M- 1 N||2 take the same
values for a and –a, the usual rate of convergence measures cannot distinguish
between these two examples.

16.2.2. Successive Overrelaxation

The SOR method can be written in the form Mxk+1 = Nxk+ b , where

and where A = D + L + U, with L and U strictly lower triangular and upper

triangular, respectively. The matrix |M| + |N| agrees with |A| everywhere
except, possibly, on the diagonal, and the best possible componentwise in-
equality between these two matrices is

(16.17)

Note that f(w) = 1 for 1 < w < 2, and as From (16.15)

we have

If A is an Al-matrix and 0 < w < 1 then M -1 > 0 and M- 1 N > 0, so

c(A) = 1. The Gauss–Seidel method corresponds to w = 1, and it is interesting
to note that for this method the forward error bound has exactly the same form
as that for the Jacobi method (though c(A) and θ x are, of course, different
for the two methods).

16.3. Backward Error Analysis

We now turn our attention to bounding the residual vector, rk = b –
From (16.3) and (16.4) we find that
16.3 B ACKWARD E RROR A NALYSIS 335

It is easy to show that AG k = H k A, where H := NM-1 (recall that G =

M – 1 N). Therefore

(16.18)

Taking norms and using (16.2) gives, similarly to (16.8),

(16.19)

where

The following bound shows that σ is small if with q not too

close to 1:

A potentially much smaller bound can be obtained under the assumption that
H is diagonalizable. If H = XDX –1, with D = diag then

(16.20)

Note that so we see the reappearance

of the term in the heuristic bound (16.14). The bound (16.20) is of modest
size if the eigenproblem for H is well conditioned is small) and p(H)
is not too close to 1. Note that real eigenvalues of H near +1 do not affect
the bound for σ, even though they may cause slow convergence.
To summarize, (16.19) shows that, for large m, the normwise backward
error for the is certainly no larger than

Note that for the Jacobi and Gauss-Seidel methods,

and also for the SOR method if w > 1.
336 S TATIONARY I TERATIVE M ETHODS

A componentwise residual bound can also be obtained, but it does not lead
to any identifiable classes of matrix or iteration for which the componentwise
relative backward error is small.
To conclude, we return to our numerical examples. For the SOR example
at the start of the chapter, c(A) = O(1045) and σ = O(1030), so our error
bounds for this problem are all extremely large. In this problem maxi |1 –
where so (16.14) is very weak; (16.20) is
not applicable since M– 1 N is defective.
For the first numerical example in §16.2.1, Table 16.2 reports the minimum
backward errors For this problem it is straightforward to
show that The ratios of backward errors for
successive value of j are 7.10, 7.89, 7.99, 8.00, so we see excellent agreement
with the behaviour predicted by the bounds. Table 16.3 reports the normwise
backward errors for the second numerical example in §16.2.1. The backward
errors are all less than u, which again is close to what the bounds predict,
since it can be shown that σ < 5 for –1/2 < a < 0. In both of the examples
of §16.2.1 the componentwise backward error and in
our practical experience this behaviour is typical or the Jacobi and SOR
iterations.

16.4. Singular Systems

Singular linear systems occur in a variety of applications, including the com-
putation of the stationary distribution vector in a Markov chain [94, 1994],
[647, 1983] and the solution of a Neumann boundary value problem by finite
difference methods [834, 1976]. Because of the structure and the possibly large
dimension of the coefficient matrices in these applications, iterative solution
methods are frequently used. An important question is how the rather deli-
cate convergence properties of the iterative methods are affected by rounding
errors. In this section we extend the analysis of stationary iteration to singular
systems.

16.4.1. Theoretical Background

A useful tool in analysing the behaviour of stationary iteration for a singular
system is the Drazin inverse. This can be defined, for as the
unique matrix AD such that

where k = index(A). The index of A is the smallest nonnegative integer k

such that rank(Ak) = rank(A k+1); it is characterized as the dimension of the
largest Jordan block of A with eigenvalue zero. If index(A) = 1 then AD
16.4 S INGULAR S YSTEMS 337

is also known as the group inverse of A and is denoted by A#. The Drazin
inverse is an “equation-solving inverse” precisely when index(A) < 1, for then
AADA = A, and so if Ax = b is a consistent system then ADb is a solution.
As we will see, however, the Drazin inverse of the coefficient matrix A itself
plays no role in the analysis. The Drazin inverse can be represented explicitly
as follows. If

where P and B are nonsingular and N has only zero eigenvalues, then

Further details of the Drazin inverse can be found in Campbell and Meyer’s
excellent treatise [180, 1979, Chap. 7].
Let be a singular matrix and consider solving Ax = b by
stationary iteration with a splitting A = M – N, where M is nonsingular.
First, we examine the convergence of the iteration in exact arithmetic. Since
any limit point x of the sequence {x k } must satisfy Mx = Nx + b, or Ax = b,
we restrict our attention to consistent linear systems. (For a thorough analysis
of stationary iteration for inconsistent systems see Dax [269, 1990]. ) As in
the nonsingular case we have the relation (cf. (16.4)):

(16.21)

m
where G = M – 1N. Since A is singular, G has an eigenvalue 1, so G does
not tend to zero as that is, G is not convergent. If the iteration
is to converge for all x0 then must exist. Following Meyer and
Plemmons [752, 1977], we call a matrix B for which exists semi-
convergent.
We assume from this point on that G is semiconvergent. It is easy to see
[94, 1994, Lem. 6.9] that G must have the form

(16.22)

where P is nonsingular and < 1. Hence

To rewrite this limit in terms of G, we note that

(16.23)
338 STATIONARY ITERATIVE METHODS

and, since is nonsingular,

(16.24)

Hence
(16.25)

To evaluate the limit of the second term in (16.21) we note that, since the
system is consistent, M– 1 b = M-1 Ax = (I – G)x, and so

We note in passing that the condition that G is semiconvergent is equivalent

to I – G having index 1, in view of (16.23), but that this condition does not
imply that A = M(I – G) has index 1.
The conclusion is that if G is semiconvergent, stationary iteration con-
verges to a solution of AX = b that depends on x 0 :

(16.26)

The first term in this limit is in null(I – G) and the second term is in range(I –
G). To obtain the unique solution in range(I – G) we should take for x0. any
vector in range(I – G) (x0 = 0, say).

16.4.2. Forward Error Analysis

We wish to bound em+1 = x – where x is the limit in (16.26) cor-
responding to the given starting vector x0. The analysis proceeds as in the
nonsingular case, up to the derivation of equation (16.5):

As before, the first term, Gm+ 1e 0, is negligible for large m, because it

is the error after m + 1 stages of the exact iteration and this error tends
to zero. To obtain a useful bound for the second term, we cannot simply
take norms or absolute values, because grows unfoundedly with m
(recall that G has an eigenvalue 1). Our approach is to split the vectors
according to where and
16.4 S INGULAR SYSTEMS 339

null(I – G); this is a well-defined splitting because range(I – G) and null(I – G)

are complementary subspaces (since index(I – G) = 1, or equivalently, G is
semiconvergent). Using the properties of the splitting the error can be written
as

We achieve the required splitting for ξi via the formulae

where

Hence the error can be written as

(16.27)

Clearly, as m the final term in this expression can become unbounded,

but since it grows only linearly in the number of iterations it is unlikely to
have a significant effect in applications where stationary iteration converges
quickly enough to be of practical use.
Now we bound the term

(16.28)

Using inequality (16.2) and the definition of γx in (16.7) and θx in (16.9), we

have

(16.29)

The convergence of the two infinite sums is assured by the result of Prob-
lem 16.1, since by (16.22)-(16.24),
G i E = Gi (I – G)D (I – G)

(16.30)
340 S TATIONARY I TERATIVE M ETHODS

where
We conclude that we have the normwise error bound

(16.31)

On setting E = I we recover the result (16.8) for the nonsingular case. If we

assume that Γ is diagonal, so that P in (16.30) is a matrix of eigenvectors of
G, then

This bound shows that a small forward error is guaranteed if

O(1) and the second largest eigenvalue of G is not too close to 1. (It is this
subdominant eigenvalue that determines the asymptotic rate of convergence
of the iteration.)
Turning to the componentwise case, we see from (16.24) and (16.30) that

Because of the form of the sum in (16.29), this prompts us to define the scalar
c(A) > 1 by

in terms of which we have the componentwise error bound

(16.32)

Again, as a special case we have the result for nonsingular A, (16.13).

To what should we compare this bound? A perturbation result for Ax = b
is given in [564, 1993] that projects the perturbations of A and b into range (I –
G) and thus can be thought of as gauging the effect of perturbations to the
“nonsingular part of the system”. For perturbations of order it gives an
expression

Hence we can deduce conditions on a stationary iterative method that ensure

it is componentwise forward stable, in the sense of yielding a solution whose
16.5 S TOPPING AN I TERATIVE M ETHOD 341

error is no larger than the uncertainty in x caused by rounding the data. The
constants θx and c(A) should be bounded by d n, where dn denotes a slowly
growing function of n; the inequality |M| + |N| < should hold, as it
does for the Jacobi method and for the SOR method when where
β is positive and not too close to zero; and the “exact error” Gm + 1e 0 must
decay quickly enough to ensure that the term (m + 1) | (I – E) M-1| does not
grow too large before the iteration is terminated.
Numerical results given in [564, 1993] show that the analysis can correctly
predict forward and backward stability, and that for certain problems linear
growth of the component of the error in null(A) can indeed cause an otherwise
convergent iteration to diverge, even when starting very close to a solution.

16.5. Stopping an Iterative Method

What convergence test should be used to stop an iterative linear equation
solver? In this section we explain how backward errors and condition numbers
help answer this question. Note first that most iterative methods for solving
Ax = b compute all or part of a matrix–vector product w = Aυ on each
iteration, and in floating point arithmetic we have

where m is the maximum number of nonzeros per row of A. The method

therefore cannot distinguish between A and A + ∆A where |∆ A| < γm |A|, and
so there is no point in trying to achieve a componentwise relative backward
error less than γm . Of course, instability of a method (or simply lack of
convergence) may pose further restrictions on how small a backward error
can be achieved.
It is worth keeping in mind throughout this discussion that in practical
applications accuracy and stability requirements are often quite modest be-
cause of large errors or uncertainties in the data, or because the iteration is
an “inner step” of some “outer” iterative process. Indeed, one of the advan-
tages of iterative methods is that the slacker the convergence tolerance the
less computational effort is required, though the relation between tolerance
and work depends very much on the method.
Natural stopping criteria for an iterative method are that some measure
of backward error or forward error does not exceed a tolerance, We will
assume that the residual r = b – Ay is available for each iterate y , and that
norms of y, r, and A can be computed or estimated. If r is not computed
directly, but is recurred by the method, as, for example, in the conjugate
gradient method, then the norm of the computed residual may differ from
that of the true residual by several orders of magnitude; clearly, this affects
the way that the stopping tests are interpreted.
342 S TATIONARY I TERATIVE M ETHODS

From Theorem 7.1 we have the following equivalences, for any subordinate
matrix norm:

(16.33a)
(16.33b)

(16.33c)

These inequalities remain true with norms replaced by absolute values (Theo
rem 7.3), but to evaluate (16.33b) and (16.33c) a matrix–vector product |A||y|
must be computed, which is a nontrivial expense in an iterative method.
Of these tests, (16.33c) is preferred in general, assuming it is acceptable
to perturb both A and b. Note the importance of including both ||A|| and
||y|| in the test on ||r||; a test ||r|| < though scale independent, does
not bound any relative backward error. Test (16.33a) is commonly used in
existing codes, but may be very stringent, and possibly unsatisfiable. To see
why, note that the residual of the rounded exact solution fl(x) = x + ∆ x ,
|∆x| < u|x|, satisfies, for any absolute norm,

and

If A is ill conditioned and x is a large-normed solution (that is, ||x||

||A - 1 ||||b||), so that ||b|| is close to its lower bound, then (16.33a) is much
harder to satisfy than (16.33c).
If the forward error is to be bounded, then, for a nonsingular problem, tests
can be derived involving the residual and A–1: the equality x – y = A – 1 r
leads to normwise and componentwise forward error bounds, such as ||x –
y||/||y|| < ||A- 1 ||||r||/||y||. Since these bounds involve A–1) they are nontrivial
to compute. Some iterative methods automatically produce estimates of the
extremal eigenvalues, and hence of For large,
sparse symmetric positive definite matrices ||A- 1 ||2 can be cheaply estimated
using the Lanczos method. Another possibility is to use condition estimation
techniques (Chapter 14).
The discussion in this section has inevitably been very general. Other
considerations in practice include detecting nonconvergence of a method (due
to rounding errors or otherwise), adapting the tests to accommodate a pre-
conditioned (the residual r provided by the method may now be that for the
preconditioned system), and using a norm that corresponds to a quantity be-
ing minimized by the method (a norm that may be nontrivial to compute).
16.6 N OTES AND R EFERENCES 343

16.6. Notes and References

The Gauss–Seidel method was chosen by Wilkinson [1080, 1948] as an “ex-
ample of coding” for the ACE. Speaking of experience at that time at the
National Physical Laboratory, he explained that “In general, direct methods
have been used on those equations which did not yield readily to relaxation,
and hence those solved by direct methods have nearly always been of an ill-
conditioned type”.
Stationary iterative methods are relatively easy to program, although there
are many issues to consider when complicated data structures or parallel ma-
chines are used. A good source of straightforward C, Fortran, and MATLAB
implementations of the Jacobi, Gauss–Seidel, and SOR methods, and other
nonstationary iterative methods, is the book Templates for the Solution of
Linear Systems [70, 1994]; the software is available from netlib. The book
contains theoretical discussions of the methods, together with practical ad-
vice on topics such as data structures and the choice of a stopping criterion.
The choice of stopping criteria for iterative methods is also discussed by Arioli,
Duff, and Ruiz [28, 1992].
An up-to-date textbook on iterative methods is Axelsson [34, 1994].

Problems
16.1. Show that if and p(B) < 1, then the series and
are both convergent, where ||·|| is any norm.
16.2. (Descloux [304, 1963]) Consider the (nonlinear) iterative process

where satisfies
(16.34)
for some and where ||ek|| < a for all k. Note that a must satisfy
G(a) = a.
(a) Show that

(b) Show that the sequence {x k} is bounded and its points of accumulation
x satisfy

(c) Explain the practical relevance of the result of (b).

Previous Home Next

Chapter 17
Matrix Powers

Unfortunately, the roundoff errors in the mth power of a matrix, say Bm,
are usually small relative to ||B||m rather than ||Bm||.
— CLEVE B. MOLER and CHARLES F. VAN LOAN,
Nineteen Dubious Ways to Compute the Exponential of a Matrix (1978)

It is the size of the hump that matters:

the behavior of ||p(A∆ t)n|| = ||p(A∆ t)t/∆ t|| for small but nonzero t.
The eigenvalues and the norm, by contrast, give sharp information
only about the limits t or t 0.
— DESMOND J. HIGHAM and LLOYD N. TREFETHEN,
Stiffness of ODES (1993)

345
346 M ATRIX P OWERS

Powers of matrices occur in many areas of numerical analysis. One approach

to proving convergence of multistep methods for solving differential equations
is to show that a certain parameter-dependent matrix is uniformly “power
bounded” [493, 1991 , §V.7], [862, 1992 ]. Stationary iterative methods for
solving linear equations converge precisely when the powers of the iteration
matrix converge to zero. And the power method for computing the largest
eigenvalue of a matrix computes the action of powers of the matrix on a vector.
It is therefore important to understand the behaviour of matrix powers, in
both exact and finite precision arithmetic.
It is well known that the powers Ak of tend to zero as k if
p (A) < 1, where p is the spectral radius. However, this simple statement does
not tell the whole story. Figure 17.1 plots the 2-norms of the first 30 powers of
a certain 3 × 3 matrix with p(A) = 0.75. The powers do eventually decay, but
initially they grow rapidly. (In this and other similar plots, k on the x -axis
is plotted against ||fl(Ak)||2 on the y-axis, and the norm values are joined
by straight lines for plotting purposes.) Figure 17.2 plots the 2-norms of the
first 250 powers of a 14 × 14 nilpotent matrix C 14 discussed by Trefethen
and Trummer [1020, 198 7] (see §17.2 for details). The plot illustrates the
statement of these authors that the matrix is not power bounded in floating
point arithmetic, even though its 14th power should be zero.
These examples suggest two important questions.

• For a matrix with p(A) < 1, how does the sequence {||Ak||} behave? In
particular, what is the size of the “hump” max k ||Ak||?

• What conditions on A ensure that fl(Ak) 0 as k

We examine these questions in the next two sections.

17.1. Matrix Powers in Exact Arithmetic

In exact arithmetic the limiting behaviour of the powers of is

determined by A’s eigenvalues. As already noted, if p(A) < 1 then Ak
as k if p(A) > 1, ||Ak|| as In the remaining case of
p(A) = 1, ||A k || if A has a defective eigenvalue such that
Ak does not converge if A has a nondefective eigenvalue such that
(although the norms of the powers may converge); otherwise, the only
eigenvalue of modulus 1 is the nondefective eigenvalue 1 and Ak converges
to a nonzero matrix. These statements are easily proved using the Jordan
canonical form
(17.1a)
17.1 M ATRIX P OWERS IN E XACT A RITHMETIC 347

Figure 17.1. A typical hump for a convergent, nonnormal matrix.

Figure 17.2. Diverging powers of a nilpotent matrix, C 1 4 .

348 MATRIX POWERS

where X is nonsingular and

(17.1b)

where n 1 + n2 + . . . + ns = n. We will call a matrix for which Ak as

k (or equivalently, p(A) < 1) a convergent matrix.
The norm of a convergent matrix can be arbitrarily large, as is shown
trivially by the scaled Jordan block

(17.2)

with and a >> 1. While the spectral radius determines the asymptotic
rate of growth of matrix powers, the norm influences the initial behaviour of
the powers. The interesting result that p(A) = for any norm
(see Horn and Johnson [580, 1985, p. 299], for example) confirms the asymp-
totic role of the spectral radius. This formula for p(A) has actually been con-
sidered as a means for computing it; see Wilkinson [1089, 1965, pp. 615–617]
and Friedland [408, 1991].
An important quantity is the “hump” max k ||Ak||, which can be arbitrarily
large for a convergent matrix. Figure 17.1 shows the hump for the 3 × 3
upper triangular matrix with diagonal entries 3/4 and off-diagonal entries 2;
this matrix has 2-norm 3.57. The shape of the plot is typical of that for a
convergent matrix with norm bigger than 1. Note that if A is normal (so
that in (17.1a) J is diagonal and X can be taken to be unitary) we have
||Ak|| 2 = || ||2 = = p(A)k , so the problem of bounding ||Ak||
is of interest only for nonnormal matrices. The hump phenomenon arises in
various areas of numerical analysis. For example, it is discussed for matrix
powers in the context of stiff differential equations by D. J. Higham and
Trefethen [529, 1993], and by Moler and Van Loan [775, 1978] for the matrix
exponential eAt with t
More insight into the behaviour of matrix powers can be gained by con-
sidering the 2 × 2 matrix (17.2) with and a > 0. We have

and
(17.3)
17.1 MATRIX POWERS IN EXACT ARITHMETIC 349

Figure 17.3. Infinity norms of powers of 2 × 2 matrix J in (17.2), for λ = 0.99 and
a = 0 (bottom line) and a = 10 -k, k = 0:3.

Hence

It follows that the norms of the powers can increase for arbitrarily many steps
until they ultimately decrease. Moreover, because k–1 tends to zero quite
slowly as k the rate of convergence of to zero can be much
slower than the convergence of to zero (see (17.3)) when is close to 1. In
other words, nontrivial Jordan blocks retard the convergence to zero.
For this 2 × 2 matrix, the hump maxk is easily shown to be approx-
imately

where this value being attained for

Figure 17.3 displays the norms of the first 400 powers of the matrices with
and a = 0,0.001,0.01,0.1,1. The size and location of the hump are
complicated expressions even in this simple case. When we generalize to direct
sums of larger Jordan blocks and incorporate a similarity transformation,
giving (17.1a), the qualitative behaviour of the powers becomes too difficult
to describe precisely.
In the rest of this section we briefly survey bounds for ||Ak||. First, how-
ever, we comment on the condition number that appears
350 MATRIX POWERS

in various bounds in this chapter. The matrix X in the Jordan form (17.1a)
is by no means unique [413, 1959, pp. 220–221], [467, 1976]: if A has distinct
eigenvalues (hence J is diagonal) then X can be replaced by XD , for any
nonsingular diagonal D, while if A has repeated eigenvalues then X can be
replaced by XT, where T is a block matrix with block structure conformal
with that of J and which contains some arbitrary upper trapezoidal Toeplitz
blocks. We adopt the convention that κ(X) denotes the minimum possible
value of κ(X) over all possible choices of X. This minimum value is not
known for general A, and the best we can hope is to obtain a good estimate
of it. However, if A has distinct eigenvalues then the results in Theorems 7.5
and 7.7 on diagonal scalings are applicable and enable us to determine (an
approximation to) the minimal condition number. Explicit expressions can be
given for the minimal 2-norm condition number for n = 2; see Young [1122,
1971, §3.8].
A trivial bound is ||Ak|| < ||A||k. A sharper bound can be derived in terms
of the numerical radius

which is the point of largest modulus in the field of values of A. It is not hard
to show that ||A||2/2 < r(A) < ||A||2 [580, 1985, p. 331]. The (nontrivial)
inequality r(Ak) < r(A)k [580, 1985, p. 333] leads to the bound

If A is diagonalizable then, from (17.1a), we have the bound

(17.4)

for any p-norm. (Since p(A) < ||A|| for any norm, we also have the lower bound
p (A)k < ||Ak||p .) This bound is unsatisfactory for two reasons. First, by
choosing A to have well-conditioned large eigenvalues and ill-conditioned small
eigenvalues we can make the bound arbitrarily pessimistic (see Problem 17.1).
Second, it models norms of powers of convergent matrices as monotonically
decreasing sequences, which is qualitatively incorrect if there is a large hump.
The Jordan canonical form can also be used to bound the norms of the
powers of a defective matrix. If XJX-1 is the Jordan canonical form of δ- 1 A
then
(17.5)
for all δ > 0. This is a special case of a result of Ostrowski [812, 1973,
Thin. 20.1] and the proof is straightforward: We can write δ - 1 A = X(δ-1 D +
M)X –1, where D = diag and M is the off-diagonal part of the Jordan
17.1 M ATRIX P OWERS IN E XACT A RITHMETIC 351

form. Then A = X(D + δM)X–1, and (17.5) follows by taking norms. An

alternative way of writing this bound is

where A = XJX–1 and D = Note that this is not

the same X as in (17.5): multiplying A by a scalar changes κ(X) when A is
not diagonalizable. Both bounds suffer from the same problems as the bound
(17.4) for diagonalizable matrices.
Another bound in terms of the Jordan canonical form (17.1) of A is given
by Gautschi [430, 1953]. For convergent matrices, it can be written in the
form
(17.6)
where p = and c is a constant depending only on A (c is not
defined explicitly in [430, 1953]). The factor kp-1 makes this bound somewhat
more effective at predicting the shapes of the actual curve than (17.5), but
again c can be unsuitably large.
Since the norm estimation problem is trivial for normal matrices, it is
natural to look for bounds that involve a measure of nonnormality. Consider
the Schur decomposition Q*AQ = D+N, where N is strictly upper triangular,
and let S represent the set of all such N. The nonnormality of A can be
measured by Henrici’s departure from normality [516, 1962]

For the Frobenius norm, Henrici shows that ||N||F is independent of the par-
ticular Schur form and that

László [690, 1994] has recently shown that ∆F(A) is within a constant factor
of the distance from A to the nearest normal matrix:

where v(A) = min{||E|| F : A + E is normal}. Henrici uses the departure

from normality to derive the 2-norm bounds

(17.7)
352 M ATRIX P OWERS
.
Empirical evidence suggests that the first bound in (17.7) can be very pes-
simistic. However, for normal matrices both the bounds are equalities.
Another bound involving nonnormality is given by Golub and Van Loan [470,
1989, Lem. 7.3.2]. They show that, in the above notation,

for any θ > 0. This bound is an analogue of (17.5) with the Schur form
replacing the Jordan form. Again, there is equality when A is normal (if we
set θ = 0).
To compare bounds based on the Schur form with ones based on the Jordan
form we need to compare ∆(A) with κ(X). If A is diagonalizable then [710,
1969, Thin. 4]

it can be shown by a 2 × 2 example that minX κ 2 (X) can exceed ∆F(A)/||A||F

by an arbitrary factor [201, 1993, §4.2.7], [190, 1996, §9.1.1].
Another tool that can be used to bound the norms of powers is the pseu-
dospectrum of a matrix, popularized by Trefethen [1017, 1992], [1018]. The
of is defined, for a given to be the set

and it can also be represented, in terms of the resolvent (zI – A) -1, as

As Trefethen notes [1017, 1992], by using the Cauchy integral representation

of Ak (which involves a contour integral of the resolvent) one can show that

(17.8)
where the
(17.9)

This bound is very similar in flavour to (17.5). The difficulty is transferred

from estimating κ(X) to choosing and estimating
Bai, Demmel, and Gu [39, 1994] consider A with p (A) < 1 and manipu-
late the Cauchy integral representation of Ak in a slightly different way from
Trefethen to produce a bound in terms of the distance to the nearest unstable
matrix,
17.2 B OUNDS FOR FINITE P RECISION A RITHMETIC 353

Their bound is

where e < a m := (1+ 1/m) m +1 < 4. Note that d(A) < 1 when p (A) < 1, as
is easily seen from the Schur decomposition. The distance d(A) is not easy to
compute. One approach is a bisection technique of Byers [175, 1988].
Finally, we mention that the Kreiss matrix theorem provides a good esti-
mate of supk >0 ||Ak|| for a general albeit in terms of an expression
that involves–the resolvent and is not easy to compute:

where φ(A) = sup{(|z| – 1)||(zI – A)–1||2 : |z| > 1} and e = exp(1). Details
and references are given by Wegert and Trefethen [1071, 1994].

17.2. Bounds for Finite Precision Arithmetic

The formulae A · Ak or Ak · A can be implemented in several ways, corre-
sponding to different loop orderings in each individual product, but as long
as each product is formed using the standard formula (AB)ij =
all these variations satisfy the same rounding error bounds. We do not
analyse here the use of the binary powering technique, where, for exam-
ple, A9 is formed as A((A2)2)2; alternate multiplication on the left and right
(fl(A k ) = fl(Afl(A k-2 )A)); or fast matrix multiplication techniques such as
Strassen’s method. None of these methods is equivalent to repeated multipli-
cation in finite precision arithmetic.
We suppose, without loss of generality, that the columns of Am are com-
puted one at a time, the jth as fl(A(A(. . . (Aej) . . .))), where ej is the jth
unit vector. The error analysis for matrix–vector multiplication shows that
the jth computed column of Am satisfies
(17.10)
where, for both real and complex matrices, we have (Problem 3.7)
(17.11)
It follows that

and so a sufficient condition for convergence of the computed powers is that

(17.12)
354 M ATRIX P OWERS

This result is useful in certain special cases: p(|A|) = p(A) if A is triangular

or has a checkerboard sign pattern (since then |A| = DAD –1 where D =
diag(±1)); if A is normal then p(|A| < (this bound being attained
for a Hadamard matrix); and in Markov processes, where the aij are transition
probabilities, |A| = A. However, in general p (|A|) can exceed p(A) by an
arbitrary factor (see Problem 17.2).
To obtain sharper and more informative results it is necessary to use more
information about the matrix. In the following theorem we give a sufficient
condition, baaed on the Jordan canonical form, for the computed powers of a
matrix to converge to zero. Although the Jordan form is usually eschewed by
numerical analysts because of its sensitivity to perturbations, it is convenient
to work with in this application and leads to an informative result.

Theorem 17.1 (Higham and Knight). Let with the Jordan form
(17.1) have spectral radius p(A) < 1. A sufficient condition for fl(Am)
as m is
(17.13)
where t = maxi ni .
Proof. It is easy to see that if we can find a nonsingular matrix S such
that
(17.14)
for all i, then the product

tends to 0 as m In the rest of the proof we construct such a matrix S

for the ∆Ai in (17.10).

Now consider the matrix Its ith diagonal block is of the form
where the only nonzeros in N are 1s on the first super-
diagonal, and so

Defining S = we have and

(17.15)
Now we set where 0 < θ < 1 and we determine θ so that
(17.14) is satisfied, that is, so that for all i. From (17.11)
and (17.15) we have
17.2 B OUNDS FOR FINITE P RECISION A RITHMETIC 355

Therefore (17.14) is satisfied if

that is, if

If the integer t is greater than 1 then the function f ( θ ) = (1 – θ)t-1 θ has

a maximum on [0,1] at θ* = t–1 and f(θ * ) = (t – 1)–1(1 – t– 1 ) t satisfies
(4(t – 1))-1 < f(θ*) < e-1. We conclude that for all integers 1 < t < n.

is sufficient to ensure that (17.14) holds.

If A is normal then ||A||2 = p(A) < 1, t = 1, and κ2 (X) = 1, so (17.13)
can be written in the form

where cn is a constant depending on n. This condition is also easily derived

by taking 2-norms in (17.10) and (17.11).
We can show the sharpness of the condition in Theorem 17.1 by using the
Chebyshev spectral differentiation matrix Cn described by Trefethen
and Trummer [1020, 1987]. The matrix Cn arises from degree n – 1 polynomial
interpolation of n arbitrary data values at n Chebyshev points, including a
boundary condition at 1. It is nilpotent and is similar to a single Jordan block
of dimension n. We generate Cn in MATLAB using the routine chebspec from
the Test Matrix Toolbox (see Appendix E). Figure 17.4 shows the norms of
the powers of four variants of the Cn matrix.
The powers of C 8 converge to zero, while the powers of 15C 8 diverge.
Using a technique for estimating k2 (X) described in [565, 1995], we find that
1.08 × 10–9, which is safely less than 1, so that Theorem 17.1
predicts convergence. For 15C 8 we have 2.7, so the theorem
correctly does not predict convergence.
Next, for the matrix A = C13 + 0.36I, whose powers diverge, we have
13.05, and for A = C13 + 0.01I, whose powers
converge, 0.01, so again the theorem is reasonably
sharp.
The plots reveal interesting scalloping patterns in the curves of the norms.
For C8 and 15C8 the dips are every 8 powers, but the point of first dip and
the dipping intervals are altered by adding different multiples of the identity
matrix, as shown by the C13 examples. Explaining this behaviour is an open
problem (see Problem 17.3).
356 M ATRIX P OWERS

Figure 17.4. Computed powers of chebspec matrices.

We saw in the last section that the powers of A can be bounded in terms
of the pseudospectral radius. Can the pseudospectrum provide information
about the behaviour of the computed powers? Figure 17.5 shows approxi-
mations to the for the matrices used in Figure 17.4, where
the (computed) eigenvalues are plotted as crosses “×”. We see
that the pseudospectrum lies inside the unit disc precisely when the powers
converge to zero.
A heuristic argument based on (17.10) and (17.11) suggests that, if for ran-
domly chosen perturbations ∆Ai with ||∆Ai || < cnu||A||, most of the eigen-
values of the perturbed matrices lie outside the unit disc, then we can expect
a high percentage of the terms A + ∆Ai in (17.10) to have spectral radius
bigger than 1 and hence we can expect the product to diverge. On the other
hand, if the cnu||A||-pseudospectrum is wholly contained within the unit disc,
each A + ∆Ai will have spectral radius less than 1 and the product can be
expected to converge. (Note, however, that if p(A) < 1 and p (B) < 1 it is not
necessarily the case that p(AB) < 1.) To make this heuristic precise, we need
an analogue of Theorem 17.1 phrased in terms of the pseudospectrum rather
than the Jordan form.

Theorem 17.2 (Higham and Knight). Suppose that is diagonal-

izable with A = and has a unique eigenvalue of largest
modulus. Suppose that and where
17.2 B OUNDS FOR FINITE P RECISION A RITHMETIC 357

Figure 17.5. Pseudospectra for chebspec matrices.

X -1 = (yij). If < 1 for = cnu||A||2, where cn is a constant de-

pending only on n, then, provided that a certain term can be ignored,
fl(Am ) = 0.

Proof. It can be shown (see [565, 1995]) that the conditions on ||X|| 1 and
imply there is a perturbation Ã = A + ∆A of A with ||∆ A|| 2 =
such that

Hence, if < 1 then Ignoring the

term and rearranging gives

Using Theorem 17.1 we have the required result for cn = 4 n 2 (n + 2), since
t = 1.
Suppose we compute the eigenvalues of A by a backward stable algorithm,
that is, one that yields the exact eigenvalues of A+E, where ||E||2 < cnu||A|| 2 ,
with cn a modest constant. (An example of such an algorithm is the QR
algorithm [470, 1989, §7.5]). Then the computed spectral radius satisfies <
In view of Theorem 17.2 we can formulate a rule of thumb-one
358 M ATRIX P OWERS

that bears a pleasing symmetry with the theoretical condition for convergence:

The computed powers of A can be expected to converge to 0 if the

spectral radius computed via a backward stable eigensolver is less
than 1.
This rule of thumb has also been discussed by Trefethen and Trummer [1020,
19 8 7 ] and Reichel and Trefethen [866, 1992 ]. In our experience the rule of
thumb is fairly reliable when is not too close to 1. For the matrices used in
our examples we have, using MATLAB’S eig function,

and we observed convergence of the computed powers for C 8 and C13 + 0.01I
and divergence for the other matrices.

17.3. Application to Stationary Iteration

As we saw in the previous chapter, the errors in stationary iteration satisfy
e k = (M - 1N) k e0, so convergence of the iteration depends on the convergence
of (M – 1 N) k to zero as k While the errors in stationary iteration are not
precisely modelled by the errors in matrix powering, because matrix powers
are not formed explicitly, the behaviour of the computed powers fl((M - 1N) k )
can be expected to give some insight into the behaviour of stationary iteration.
For the successive overrelaxation (SOR) example at the start of Chap-
ter 16, the matrix G = M – 1N is lower triangular with gij = 0.5(–1) i – j .
The computed powers of G in MATLAB reach a maximum norm of 1028
at k = 99 and then decay to zero; the eventual convergence is inevitable in
view of the condition (17.12), which clearly is satisfied for this triangular G.
An approximation to the u||G||2-pseudospectrum is plotted in Figure 17.6,
and we see clearly that part of the pseudospectrum lies outside the unit disk.
These facts are consistent with the nonconvergence of the SOR iteration (see
Figure 16.1).
That the pseudospectrum of G gives insight into the behaviour of station-
ary iteration has also been observed by Trefethen [1015, 1990], [1017, 1992],
[1018] and Chatelin and Frayssé [203, 1992], but no rigorous results about the
connect ion are available.

17.4. Notes and References

This chapter is baaed closely on Higham and Knight [565, 1995].
P ROBLEMS 359

Figure 17.6. Pseudospectrum for SOR iteration matrix.

The analysis for the powers of the matrix (17.2) is modelled on that of
Stewart [953, 1994], who uses the matrix to construct a Markov chain whose
second largest eigenvalue does not correctly predict the decay of the transient.
For some results on the asymptotic behaviour of the powers of a nonneg-
ative matrix, see Friedland and Schneider [409, 1980].
Another application of matrix powering is in the scaling and squaring
method for computing the matrix exponential, which uses the identity eA =
(eA/m )m together with a Taylor or Padé approximation to eA/m ; see Moler
and Van Loan [775, 1978].

Problems
17.1. Let be diagonalizable: A = XΛX -1, Λ = diag Con-
struct a parametrized example to show that the bound
can be arbitrarily weak.
17.2. Show that p(|A|)/p(A) can be arbitrarily large for
17.3. (RESEARCH PROBLEM) Explain the scalloping patterns in the curves of
norms of powers of a matrix, as seen, for example, in Figure 17.4. (Consider
exact arithmetic, as the phenomenon is not rounding error dependent. )
17.4. (RESEARCH PROBLEM) obtain a sharp sufficient condition for fl(Ak)
0 in terms of the Schur decomposition of with p(A) < 1.
Previous Home Next

Chapter 18
QR Factorization

Any orthogonal matrix can be written as the product of reflector matrices.

Thus the class of reflections is rich enough for all occasions
and yet each member is characterized by a single vector
which serves to describe its mirror.
— BERESFORD N. PARLETT, The Symmetric Eigenvalue Problem (1980)

A key observation for understanding the numerical properties of the

modified Gram–Schmidt algorithm is that it can be interpreted as
Householder QR factorization applied to the matrix A
augmented with a square matrix of zero elements on top.
These two algorithms are not only mathematically . . .
but also numerically equivalent.
This key observation, apparently by Charles Sheffield,
was relayed to the author in 1968 by Gene Golub.
— AKE BJÖRCK, Numerics of Gram-Schmidt Orthogonalization (1994)

The great stability of unitary transformations in numerical analysis

springs from the fact that both the
and the Frobenius norm are unitarily invariant.
This means in practice that even when rounding errors are made,
no substantial growth takes place in the
norms of the successive transformed matrices.
— J. H. WILKINSON,
Error Analysis of Transformations Based on the
Use of Matrices of the Form I – 2wwH (1965)

361
362 QR F ACTORIZATION

The QR factorization is a versatile computational tool that finds use in lin-

ear equation, least squares and eigenvalue problems. It can be computed in
several ways, including by the use of Householder transformations and Givens
rotations, and by the Gram–Schmidt method. We explore the numerical prop
erties of all three methods in this chapter. We also examine the use of iterative
refinement on a linear system solved with a QR factorization and consider the
inherent sensitivity of the QR factorization.

18.1. Householder Transformations

A Householder matrix (also known as a Householder transformation, or House-
holder reflector) is a matrix of the form

It enjoys the properties of symmetry and orthogonality, and, consequently, is

involuntary (P2 = I). The application of P to a vector yields

Figure 18.1 illustrates this formula and makes it clear why P is sometimes
called a Householder reflector: it reflects x about the hyperplane
Householder matrices are powerful tools for introducing zeros into vectors.
Consider the question “given x and y can we find a Householder matrix P such
that Px = y?” Since P is orthogonal we clearly require that ||x|| 2 = ||y||2 .
Now

and this last equation has the form aυ = x – y for some a. But P is indepen-
dent of the scaling of v, so we can set a = 1.
With υ = x – y we have

and, since xTx = yTy,

Therefore

as required. We conclude that, provided ||x||2 = ||y||2, we can find a House-

holder matrix P such that Px = y. (Strictly speaking, we have to exclude the
case x = y, which would require υ = 0, making P undefined).
18.2 QR FACTORIZATION 363

Figure 18.1. Householder matrix P times vector x.

Normally we choose y to have a special pattern of zeros. The usual choice

is y = σe1 where σ = ±||x||2, which yields the maximum number of zeros in
y. Then

We choose sign(σ) = –sign(x1) to avoid cancellation in the expression for υ 1 .

18.2. QR Factorization

A QR factorization of with m > n is a factorization

where is orthogonal and is upper triangular. The

matrix R is called upper trapezoidal, since the term triangular applies only to
square matrices. Depending on the context, either the full factorization A =
QR or the “economy size” version A = Q 1 R 1 can be called a QR factorization.
A quick existence proof of the QR factorization is provided by the Cholesky
factorization: if A has full rank and ATA = RTR is a Cholesky factorization,
then A = AR–1 · R is a QR factorization. The QR factorization is unique
if A has full rank and we require R to have positive diagonal elements (A =
QD · DR is a QR factorization for any D = diag(±1)).
The QR factorization can be computed by premultiplying the given ma-
trix by a suitably chosen sequence of Householder matrices. The process is
364 QR F ACTORIZATION

illustrated for a generic 4 × 3 matrix as follows:

The general process is adequately described by the k th stage of the re-

duction to triangular form. With A1 = A we have, at the start of the kth
stage,

where Rk-1 is upper triangular. Choose a Householder matrix such that

and embed into an m × m matrix

(18.2)

Then let Ak+1 = PkAk. Overall, we obtain R = PnPn-1 . . . P1 A =: QTA

(Pn = I if m = n).
To compute Ak+1 we need to form We can write

which shows that the matrix product can be formed as a matrix–vector prod-
uct followed by an outer product. This approach is much more efficient than
forming explicitly and doing a matrix multiplication.
The overall cost of the Householder reduction to triangular form is 2n 2 (m–
n/3) flops. The explicit formation of Q requires a further 4(m 2 n–mn2 + n 3/3)
flops, but for many applications it suffices to leave Q in factored form.

18.3. Error Analysis of Householder Computations

It is well known that computations with Householder matrices are very sta-
ble. Wilkinson showed that the computation of a Householder vector, and
the application of a Householder matrix to a given matrix, are both normwise
stable, in the sense that the computed Householder vector is very close to
18.3 E RROR A NALYSIS OF H OUSEHOLDER C OMPUTATIONS 365

the exact one and the computed update is the exact update of a tiny norm-
wise perturbation of the original matrix [1089, 1965, pp. 153–162, 236], [1090,
19 6 5 ]. Wilkinson also showed that the Householder QR factorization algo-
rithm is normwise backward stable [1089, p. 236]. In this section we give a
combined componentwise and normwise error analysis of Householder matrix
computations. The componentwise bounds provide extra information over
the normwise ones that is essential in certain applications (for example, the
analysis of iterative refinement).

Lemma 18.1. Let Consider the following construction of

and such that Px = σe1, where P = I – βvvT is a Householder matrix
with β = 2/(vTv):

In floating point arithmetic the computed and satisfy

and

where |θk| < γk .

Proof. We sketch the proof. Each occurrence of δ denotes a different

number bounded by |δ| < u. We compute fl(xTx) = (1 + θ n ) x T x, and then
(the latter term
1 + θn +1 is suboptimal, but our main aim is to keep the analysis simple).
Hence
For notational convenience, define w = υ 1 + s. We have
(essentially because there is no cancellation in the sum).
Hence

For convenience we will henceforth write Householder matrices in the form

I – v v T, which requires ||υ||2 = and amounts to redefining υ :=
366 QR F ACTORIZATION

and β := 1 in the representation of Lemma 18.1. We can then write, using

Lemma 18.1,

(18.3)

where, as required for the next two results, the dimension is now m. Here, we
have introduced the generic constant

in which c denotes a small integer constant whose exact value is unimportant.

We will make frequent use of γcm in the rest of this chapter, because it is not
worthwhile to evaluate the integer constants in our bounds explicitly. Because
we are not “chasing constants” we can afford to be somewhat cavalier and
freely use inequalities such as

(recall Lemma 3.3), where we signify by the prime that the constant c on the
right-hand side is different from that on the left.
The next result describes the application of a Householder matrix to a
vector, and is the basis of all the subsequent analysis. In the applications of
interest P is defined as in Lemma 18.1, but we will allow P to be an arbitrary
Householder matrix. Thus v is an arbitrary, normalized vector, and the only
assumption we make is that the computed satisfies (18.3).

Lemma 18.2. Let and consider the computation of y = = (I –

when satisfies (18.3): The computed satisfies

where P = I - vvT.
Proof. (Cf. the proof of Lemma 3.8,) We have

where and |∆b| < γm|b|. Hence

where |∆w| < γc m |υ||υT||b|. Then

We have
18.3 E RROR A NALYSIS OF H OUSEHOLDER C OMPUTATIONS 367

Hence where But then

where satisfies
Next, we consider a sequence of Householder transformations applied to a
matrix. Again, each Householder matrix is arbitrary and need have no con-
nection to the matrix to which it is being applied. In the cases of interest, the
Householder matrices Pk have the form (18.2), and so are of ever-decreasing
effective dimension, but to exploit this property would not lead to any signif-
icant improvement in the bounds.
For the remaining results in this section, we make the (reasonable) as-
sumption that an inequality of the form

(18.4)

holds, where r is the number of Householder transformations and it is implicit

that G and H denote nonnegative matrices.

Lemma 18.3. Consider the sequence of transformations

where A1 = and Pk = is a Householder matrix.

Assume that the transformations are performed using computed Householder
vectors that satisfy (18.3). The computed matrix Ar+1 satisfies
(18.5)

where Q T = PrPr–1 . . . P1 and ∆A satisfies the normwise and componentwise

bounds

(In fact, we can take G = m- 1 eeT, where e = [1, 1,..., 1]T .) In the special
case n = 1, so that A a, we have with
Proof. First, we consider the jth column of A, aj, which undergoes the
transformations By Lemma 18.2 we have

where each ∆Pk depends on j and satisfies ||∆ Pk||F < γcm. Using Lemma 3.6
we obtain

(18.6)
368 QR F ACTORIZATION

using Lemma 3.1 and assumption (18.4). Hence ∆A in (18.5) satisfies

NOW, since ||aj||2 < ||aj||1 = eT|aj|, from (18.6) we have

where ||G|| F = 1 (since ||ee T || F = m). Finally, if n = 1, so that A is a

column vector, then (as in the proof of Lemma 18.2) we can rewrite (18.5)
as where

Note that the componentwise bound for AA in Lemma 18.3 does not imply
the normwise one, because of the extra factor m in the componentwise bound.
This is a nuisance, because it means we have to state both bounds in this and
other analyses.
We now apply Lemma 18.3 to the computation of the QR factorization of
a matrix.

Theorem 18.4. Let be the computed upper trapezoidal QR factor

of (m > n) obtained via the Householder QR algorithm. Then
there exists an orthogonal such that

where ||∆A||F < nγ cm||A||F and |∆ A| < mnγ cmG|A| , with ||G||F = 1. The
matrix Q is given explicitly as Q = (PnPn–1 . . . P1 )T, where Pk is the House-
holder matrix that corresponds to the exact application of the kth step of the
algorithm to Ak.
Proof. This is virtually a direct application of Lemma 18.3, with Pk
defined as the Householder matrix that produces zeros below the diagonal in
the kth column of the computed matrix Ak. one subtlety is that we do not
explicitly compute the lower triangular elements of R, but rather set them to
zero explicitly. However, it is easy to see that the conclusions of Lemmas 18.2
and 18.3 are still valid in these circumstances; the essential reason is that the
elements of ∆Pb in Lemma 18.2 that correspond to elements that are zeroed
by the Householder matrix P are forced to be zero, and hence we can set the
corresponding rows of ∆P to zero too, without compromising the bound on
||∆P||F.
Finally, we consider use of the QR factorization to solve a linear system.
Given a QR factorization of a nonsingular matrix a linear sys-
tem Ax = b can be solved by forming Q T b and then solving Rx = Q T b.
18.3 E RROR A NALYSIS OF H OUSEHOLDER C OMPUTATIONS 369

From Theorem 18.4, the computed is guaranteed to be nonsingular if

We give only componentwise bounds.

Theorem 18.5. Let be nonsingular. Suppose we solve the system

Ax = b with the aid of a QR factorization computed by the Householder
algorithm. The computed satisfies

where

Proof. By A Theorem 18.4, the computed upper triangular factor satisfies

A + ∆A = QR with |∆ A| < n 2 γ c n G 1 |A| and ||G1 ||F = 1. BY Lemma 18.3,
the computed transformed right-hand side satisfies with
Importantly, the same orthogonal matrix Q appears in
the equations involving and
By Theorem 8.5, the computed solution to the triangular system
satisfies

Premultiplying by Q yields

that is, where Using

we have

where G > G 1 and ||G||F = 1.

The proof of Theorem 18.5 naturally leads to a result in which b is per-
turbed. However, we can easily modify the proof so that only A is perturbed:
the trick is to use Lemma 18.3 to write where
and to premultiply by (Q + ∆Q)-T instead of Q in the middle of the proof.
This leads to the result

(18.7)

An interesting application of Theorem 18.5 is to iterative refinement, as

explained in §18.6.
370 QR F ACTORIZATION

18.4. Aggregated Householder Transformations

In Chapter 12 we noted that LU factorization algorithms can be partitioned so
as to express the bulk of the computation as matrix-matrix operations (level-
3 BLAS). For computations with Householder transformations the same goal
can be achieved by aggregating the transformations. This technique is widely
used in LAPACK.
One form of aggregation is the “WY” representation of Bischof and Van
Loan [105, 1987]. This involves representing the product Qr = PrPr -1 . . . P1
of r Householder transformations (where = 2).
in the form

This can be done using the recurrence

(18.8)

Using the WY representation, a partitioned QR factorization can be de-

veloped as follows. Partition as

(18.9)

and compute the Householder QR factorization of A 1 ,

The product PrPr–1 . . . P1 = I + is accumulated using (18.8), as the

Pi are generated, and then B is updated according to

which involves only level-3 BLAS operations. The process is now repeated on
the last m – r rows of B.
When considering numerical stability, two aspects of the WY representa-
tion need investigating: its construction and its application. For the construc-
tion, we need to show that satisfies

(18.10)
(18.11)

for modest constants d 1 , d 2, and d 3. Now

18.5 GIVENS ROTATIONS 371

But this last equation is essentially a standard multiplication by a Householder

matrix, albeit with less opportunity for rounding errors.
It follows from Lemma 18.3 that the near orthogonality of is inherited by
the condition on in (18.11) follows similarly and that on is trivial.
Note that the condition (18. 10) implies that

(18.12)

that is, is close to an exactly orthogonal matrix (see Problem 18.13).

Next we consider the application of Suppose we form
for the B in (18.9), so that

Analysing this level-3 BLAS-based computation using (18.12) and the very
general assumption (12.3) on matrix multiplication (for the 2-norm), it is
straightforward to show that

(18.13)

This result shows that the computed update is an exact orthogonal update of
a perturbation of B, where the norm of the perturbation is bounded in terms
of the error constants for the level-3 BLAS.
Two conclusions can be drawn. First, algorithms that employ the WY rep
resentation with conventional level-3 BLAS are as stable as the corresponding
point algorithms. Second, the use of fast BLAS3 for applying the updates af-
fects stability only through the constants in the backward error bounds. The
same conclusions apply to the more storage-efficient compact WY representa-
tion of Schreiber and Van Loan [905, 1989], and the variation of Puglisi [848,
1992].

18.5. Givens Rotations

Another way to compute the QR factorization is with Givens rotations. A

Givens rotation (or plane rotation) G(i, j, θ) is equal to the identity
matrix except that
372 QR F ACTORIZATION

Figure 18.2. Givens rotation, y = G(i, j, θ)x.

where c = cosθ and s = sinθ. The multiplication y = G(i, j, θ)x rotates x

through θ radians clockwise in the (i, j) plane; see Figure 18.2. Algebraically,

and so yj = 0 if
(18.14)

Givens rotations are therefore useful for introducing zeros into a vector one
at a time. Note that there is no need to work out the angle θ, since c and s
in (18.14) are all that are needed to apply the rotation. In practice, we would
scale the computation to avoid overflow (cf. §25.8).
To compute the QR factorization, Givens rotations are used to eliminate
the elements below the diagonal in a systematic fashion. Various choices and
orderings of rotations can be used; a natural one is illustrated as follows for a
generic 4 x 3 matrix:
18.5 GIVENS ROTATIONS 373

The operation count for Givens QR factorization of a general m x n matrix

(m > n) is 3n 2 (m – n/3) flops, which is 50% more than that for Householder
QR factorization. The main use of Givens rotations is to operate unstructured
matrices—for example, to compute the QR factorization of a tridiagonal or
Hessenberg matrix, or to carry out delicate zeroing in updating or downdating
problems [470, 1989, §12.6].
Error analysis for Givens rotations is similar to that for Householder
matrices—but a little easier. We omit the (straightforward) proof of the first
result.

Lemma 18.6. Let a Givens rotation G(i, j, θ) be instructed according to

(18.14). The computed and satisfy

(18.15)

where

Lemma 18.7. Let and consider the computation of y = where

is a computed Givens rotation in the (i, j) plane for which and satisfy
(18.15). The computed satisfies

where Gij is an exact Givens rotation based on c and s in (18.15). All the
rows of ∆Gij except the ith and jth are zero.

Proof. The vector differs from x only in elements i and j. We have

where and similarly for Hence

so that We take
For the next result we need the notion of disjoint Givens rotations. Rota-
tions Gi1 ,j 1, . . . , Gir,jr are disjoint if is js and js jt for s t . Disjoint ro-
tations are “nonconflicting” and therefore commute; it matters neither math-
ematically nor numerically in which order the rotations are applied. (Disjoint
374 QR F ACTORIZATION

rotations can therefore be applied in parallel, though that is not our inter-
est here. ) Our approach is to take a given sequence of rotations and reorder
them into groups of disjoint rotations. The reordered algorithm is numerically
equivalent to the original one, but allows a simpler error analysis.
As an example of a rotation sequence already ordered into disjoint groups,
consider the following sequence and ordering illustrated for a 6 × 5 matrix:

Here, an integer k in position (i, j) denotes that the (i, j) element is eliminated
on the kth step by a rotation in the (j, i) plane, and all rotations on the kth
step are disjoint. For an m × n matrix with m > n there are r = m + n — 2
stages, and the Givens QR factorization can be written as Wr Wr-1 . . . W1 A =
R, where each Wi is a product of at most n disjoint rotations. It is easy to see
that an analogous grouping into disjoint rotations can be done for the scheme
illustrated at the start of this section.

Lemma 18.8. Consider the sequence of transformations

A k+1 = WkAk, k = 1:r,

where A1 = and each Wk is a product of disjoint Givens rotations.

Assume that the individual Givens rotations are performed using computed
sine and cosine values related to the exact values defining the Wk by (18.15).
Then the computed matrix Âr+1 satisfies

Âr+1 = QT(A + ∆A),

where QT = Wr Wr–1 . . . W1 and ∆A satisfies the normwise and component-

wise bounds

(In fact, we can take G = m - 1 eeT, where e = [1, 1, . . . , 1] T.) In the special
case n = 1, so that A = a, we have â(r + 1) = (Q + ∆Q)T a with ||∆ Q||F < γ c r .

Proof. The proof is analogous to that of Lemma 18.3, so we offer only

a sketch. First, we consider the jth column of A, aj, which undergoes the
18.6 I TERATIVE R EFINEMENT 375

transformations = W r . . . W1 aj. By Lemma 18.7 and the disjointness

of the rotations, we have

where each ∆Wk depends on j and satisfies ||∆ Wk|| 2 < Using Lemma 3.6
we obtain

(18.16)
Hence

The inequalities (18. 16) for j = 1:n imply that

where ||G||F = 1. The result for n = 1 is proved as in Lemma 18.3. II

We are now suitably equipped to give a result for Givens QR factorization.

Theorem 18.9. Let be the computed upper trapezoidal QR factor

of (m > n) obtained via the Givens QR algorithm, with any
standard choice and ordering of rotations. Then there exists an orthogonal
such that

with ||∆A||F < γc(m+n)||A||F and |∆ A| < mγ c(m+n)G|A|, ||G||F = 1. (The

matrix Q is a product of Givens rotations, the kth of which corresponds to the
exact application of the kth step of the algorithm to Âk.)
It is interesting that the error bounds for QR factorization with Givens
rotations are a factor n smaller than those for Householder QR factorization.
This appears to be an artefact of the analysis, and we are not aware of any
difference in accuracy in practice.

18.6. Iterative Refinement

Consider a nonsingular linear system Ax = b, where Suppose we
solve the system using a QR factorization A = QR computed using House-
holder or Givens transformations (thus, x is obtained by solving Rx = QTb).
Theorem 18.5, and its obvious analogue for Givens rotations, show that
satisfies
(18.17)
376 QR F ACTORIZATION

where ||G||F = 1 and p is a low-degree polynomial. Hence has a normwise

backward error of order p(n)u. However, since G is a full matrix, (18.17) sug-
gests that the componentwise relative backward error need not be
small. In fact, we know of no nontrivial class of matrices for which Householder
or Givens QR factorization is guaranteed to yield a small componentwise rel-
ative backward error.
Suppose that we carry out a step of fixed precision iterative refinement, to
obtain The form of the bound (18.17) enables us to invoke Theorem 11.4.
We conclude that the componentwise relative backward error af-
ter one step of iterative refinement will be small as long as A is not too ill
conditioned and is not too badly scaled. This conclusion is similar to
that for Gaussian elimination with partial pivoting (GEPP), except that for
GEPP there is the added requirement that the LU factorization not suffer
large element growth.
Recall from (18.7) that we also have a backward error result in which only
A is perturbed. The analysis of $11.1 is therefore applicable, and analogues
of Theorems 11.1 and 11.2 hold in which η =
The performance of QR factorization with fixed precision iterative refine-
ment is illustrated in Tables 11. 1–1 1.3 in §11.2. The performance is as pre-
dicted by the analysis. Notice that the initial componentwise relative back-
ward error is large in Table 11.2 but that iterative refinement successfully
reduces it to the roundoff level (despite being huge). It is
worth stressing that the QR factorization yielded a small normwise relative
backward error in each example in fact), as we know it must.

18.7. Gram–Schmidt Orthogonalization

The oldest method for computing a QR factorization is the Gram-Schmidt

orthogonalization method. It can be derived directly from the equation A =
QR, where A, and (Gram-Schmidt does not compute
the m x m matrix Q in the full QR factorization and hence does not provide
a basis for the orthogonal complement of range(A)). Denoting by a j and q j
the jth columns of A and Q, respectively, we have

Premultiplying by yields, since Q has orthonormal columns,

i = 1:j — 1. Further,
18.7 G RAM -S CHMIDT O RTHOGONALIZATION 377

where

Hence we can compute Q and R a column at a time. To ensure that rjj > 0
we require that A has full rank.

Algorithm 18.10 (classical Gram–Schmidt). Given of rank n

this algorithm computes the QR factorization A = QR, where Q is m × n and
R is n × n, by the Gram-Schmidt method.

for j = 1:n
for i = 1:j –1

end

Cost: 2mn2 flops (2n 3/3 flops more than Householder QR factorization
with Q left in factored form).
In the classical Gram–Schmidt method (CGS), aj appears in the compu-
tation only at the jth stage. The method can be rearranged so that as soon as
qj is computed, all the remaining vectors are orthogonalized against qj. This
gives the modified Gram-Schmidt method (MGS).

Algorithm 18.11 (modified Gram-Schmidt). Given of rank n

this algorithm computes the QR factorization A = QR, where Q is m × n and
R is n × n, by the MGS method.

end
end

Cost: 2mn2 flops.

378 QR F ACTORIZATION

It is worth noting that there are two differences between the CGS and
MGS methods. The first is the order in which the calculations are performed:
in the modified method each remaining vector is updated once on each step
instead of having all its updates done together on one step. This is purely a
matter of the order in which the operations are performed. Second, and more
crucially in finite precision computation, two different (but mathematically
equivalent ) formulae for rkj are used: in the classical method, rkj =
which involves the original vector a j, whereas in the modified method a j is
replaced in this formula by the partially orthogonalized vector Another
way to view the difference between the two Gram–Schmidt methods is via
representations of an orthogonal projection; see Problem 18.7.
The MGS procedure can be expressed in matrix terms by defining Ak =
MGS transforms A1 = A into An+1 = Q by the
sequence of transformations Ak = Ak+ 1Rk, where Rk is equal to the identity
except in the kth row, where it agrees with the final R. For example, if n = 4
and k = 3,

Thus R = Rn . . . R 1 .
The Gram-Schmidt methods produce Q explicitly, unlike the Householder
and Givens methods, which hold Q in factored form. While this is a benefit,
in that no extra work is required to form Q, it is also a weakness, because
there is nothing in the methods to force the computed to be orthonormal in
the face of roundoff. Orthonormality of Q is a consequence of orthogonality
relations that are implicit in the methods, and these relations may be vitiated
by rounding errors.
Some insight is provided by the case n = 2, for which the CGS and MGS
methods are identical. Given a1, a2 we compute q 1 = a 1 /||a 1 ||2, which
we will suppose is done exactly, and then we form the unnormalized vector
q2 = a2 – The computed vector satisfies

where

Hence
18.7 G RAM -S CHMIDT O RTHOGONALIZATION 379

and so the normalized inner product satisfies

(18.18)

where is the angle between a 1 and a2. But

where A = [a 1, a2] (Problem 18.8). Hence, for n = 2, the loss of orthogonality
can be bounded in terms of κ 2 (A). The same is true in general for the MGS
method, as proved by Björck [107, 196 7]. A direct proof is quite long and
complicated, but a recent approach of Björck and Paige [119, 1992] enables a
much shorter derivation; we take this approach here.
The observation that simplifies the error analysis of the MGS method
is that the method is equivalent, both mathematically and numerically, to
Householder QR factorization of the padded matrix To
understand this equivalence, consider the Householder QR factorization

(18.19)

Let q1 , be the vectors obtained by applying the MGS method to

A. Then it is easy to see that

and that the multiplication A 2 = carries out the first step of the MGS
method on A, producing the first row of R and

The argument continues in the same way, and we find that

(18.20)

With the Householder–MGS connection established, we are ready to derive

error bounds for the MGS method by making use of our existing error analysis
for the Householder method.

Theorem 18.12. Suppose the MGS method is applied to of rank

n, yielding computed matrices and Then there are
constants ci = ci (m, n) such that
(18.21)
(18.22)
380 QR F ACTORIZATION

and there exists an orthonormal matrix Q such that

(18.23)

Proof. To prove (18.21) we use the matrix form of the MGS method. For
the computed matrices we have

Expanding this recurrence, we obtain

Hence

(18.24)

and a typical term has the form

(18.25)

where Sk-1 agrees with in its first k – 1 rows and the identity in its last
n – k + 1 rows. Assume for simplicity that (this does not affect the
final result). We have and Lemma 3.8 shows that the
computed vector satisfies

which implies and, from we

have Using (18.24) and exploiting
the form of (18.25) we find, after a little working, that

provided that
To prove the last two parts of the theorem we exploit the Householder–
MGS connection. By applying Theorem 18.4 to (18.19) we find that there is
an orthogonal such that

(18.26)

with
18.8 S ENSITIVITY OF THE QR FACTORIZATION 381

This does not directly yield (18.23), since is not orthonormal. However,
it can be shown that if we define Q to be the nearest orthonormal matrix
to in the Frobenius norm, then (18.23) holds with c3 = (see
Problem 18.11).
Now (18.21) and (18.23) yield

where c5 = c1 and we have used (18.23) to bound This bound

implies (18.22) with c2 = 2c5 (use the first inequality in Problem 18.13).
We note that (18.22) can be strengthened by replacing κ 2 (A) in the bound
by the minimum over positive diagonal matrices D of κ 2 (AD). This follows
from the observation that in the MGS method the computed Q is invariant
under scalings A AD, at least if D comprises powers of the machine base.
As a check, note that the bound in (18.18) for the case n = 2 is independent
of the column scaling, since sin
Theorem 18.12 tells us three things. First, the computed QR factors from
the MGS method have a small residual. Second, the departure from orthonor-
mality of Q is bounded by a multiple of κ 2 (A)u, so that Q is guaranteed to be
nearly orthonormal if A is well conditioned. Finally, R is the exact triangular
QR factor of a matrix near to A in a componentwise sense, so it is as good
an R-factor as that produced by Householder QR factorization applied to A.
In terms of the error analysis, the MGS method is weaker than Householder
QR factorization only in that Q is not guaranteed to be nearly orthonormal.
For the CGS method the residual bound (18.21) still holds, but no bound
of the form (18.22) holds for n > 2 (see Problem 18.9).
Here is a numerical example to illustrate the behaviour of the Gram–
Schmidt methods. We take the 25 × 15 Vandermonde matrix A =
where the pi are equally spaced on [0, 1]. The condition number κ 2 (A) =
1.47 × 109. We have

Both methods produce a small residual for the QR factorization. While

CGS produces a Q showing no semblance of orthogonality, for MGS we have

18.8. Sensitivity of the QR Factorization

How do the QR factors of a matrix behave under small perturbations of the
matrix? This question was first considered by Stewart [944, 1977]. He showed
382 QR F ACTORIZATION

that if has rank n and

are QR factorization, then, for sufficiently small ∆A,

(18.27)

(18.28)

where cn is a constant. Here, and throughout this section, we use the “econ-
omy size” QR factorization with R a square matrix normalized to have nonneg-
ative diagonal elements. Similar normwise bounds are given by Stewart [951,
1993] and Sun [971, 1991], and, for AQ only, by Bhatia and Mukherjea [95,
1994] and Sun [975, 1995].
Componentwise sensitivity analyses have been given by Zha [1126, 1993]
and Sun [972, 1992], [973, 1992]. Zha’s bounds can be summarized as follows,
with the same assumptions and notation as for Stewart’s result above. Let
|∆A| < where G is nonnegative with Then, for sufficiently
small

(18.29)

where c m,n is a constant depending on m and n. The quantity φ(A) =

cond(R –1) can therefore be thought of as a condition number for the QR
factorization under the columnwise class of perturbations considered. Note
that φ is independent of the column scaling of A.
As application of these bounds, consider a computed QR factorization
A QR obtained via the Householder algorithm, where Q is the computed
product of the computed Householder matrices. Theorem 18.4 shows that A+
∆A = where is orthogonal and with ||G1 ||F = 1.
Applying Lemma 18.3 to_ the computation of we have that
where with ||G2 ||F = 1. From the expression
we have, on applying (18.29) to the first
term,
(18.30)
To illustrate this analysis we describe an example of Zha. Let
18.9 N OTES AND R EFERENCES 383

where is a parameter. It is easy to verify that A = QRA and B = QRB

are QR factorization (the same Q in each case) where

With = 10–8, the computed Q factors from MATLAB are

and

Since φ(A) 2.8 × 108 and φ(B) 3.8, the actual errors match the bound
(18.30) very well. Note that κ 2 (A) 2.8 × 108 and κ2 (B) 2.1 × 108, so
the normwise perturbation bound (18.28) is not strong enough to predict the
difference in accuracy of the computed Q factors, unless we scale the factor
out of the last column of B to leave a well-conditioned matrix. The virtue
of the componentwise analysis is that it does not require a judicious scaling
in order to yield useful results.

18.9. Notes and References

The earliest appearance of Householder matrices is in the book by Turnbull
and Aitken [1029, 1932, pp. 102–105]. These authors show that if ||x||2 = ||y||2
then a unitary matrix of the form R = azz* – I (in their notation)
can be constructed so that Rx = y. They use this result to prove the existence
of the Schur decomposition. The first systematic use of Householder matrices
for computational purposes was by Householder [586, 1958], who used them
to construct the QR factorization. Householder’s motivation was to compute
the QR factorization with less arithmetic operations (in particular, less square
roots) than are required by the use of Givens rotations.
In the construction of §18.1 for a Householder matrix P such that Px =
σe 1, the other choice of sign, sign(σ) = sign(x1), can be used, provided that
υ 1 is formed in a numerically stable way. The appropriate formula is derived
as follows [264, 1976], [819, 1971], [820, 1980, p. 91]:

A detailed analysis of different algorithms for constructing P such that Px =

σe 1 is given by Parlett [819, 1971].
384 QR F ACTORIZATION

Tsao [1023, 1975] describes an alternative way to form the product of a

Householder matrix with a vector and gives an error analysis. There is no
major advantage over the usual approach.
As for Householder matrices, normwise error analysis for Givens rota-
tions was given by Wilkinson [1087, 1963], [1089, 1965, pp. 131–139]. Wilkin-
son analysed QR factorization by Givens rotations for square matrices [1089,
1965, pp. 240-241], and his analysis was extended to rectangular matrices by
Gentleman [435, 1973]. The idea of exploiting disjoint rotations in the er-
ror analysis was developed by Gentleman [436, 1975], who gave a normwise
analysis that is simpler and produces smaller bounds than Wilkinson’s (our
normwise bound in Theorem 18.9 is essentially the same as Gentleman’s).
For more details of algorithmic and other aspects of Householder and
Givens QR factorization, see Golub and Van Loan [470, 1989, §5.2].
The error analysis in §18.3 is a refined and improved version of analysis
that appeared in the technical report [546, 1990] and was quoted without
proof in Higham [549, 1991].
The WY representation for a product of Householder transformations
should not be confused with a genuine block Householder transformation.
Schreiber and Parlett [904, 1988] define, for a given (m > n), the
“block reflector that reverses the range of Z” as

If n = 1 this is just a standard Householder transformation. A basic task is

as follows: given (m > n) find a block reflector H such that

Schreiber and Parlett develop theory and algorithms for block reflectors, in
both of which the polar decomposition plays a key role.
Sun and Bischof [977, 1995] show that any orthogonal matrix can be ex-
pressed in the form Q = I – YSYT, even with S triangular, and they explore
the properties of this representation.
Another important use of Householder matrices, besides computation of
the QR factorization, is to reduce a matrix to a simpler form prior to itera-
tive computation of eigenvalues (Hessenberg or tridiagonal form) or singular
values (bidiagonal form). For these two-sided transformations an analogue of
Lemma 18.3 holds with normwise bounds (only) on the perturbation. Error
analyses of two-sided application of Householder transformations is given by
Ortega [811, 1963] and Wilkinson [1086, 1962], [1089, 1965, Chap. 6].
Mixed precision iterative refinement for solution of linear systems by House-
holder QR factorization is discussed by Wilkinson [1090, 1965, §10], who notes
that convergence is obtained as long as a condition of the form cn κ 2 (A)u < 1
holds.
18.9 N OTES AND R EFERENCES 385

Fast Givens rotations can be applied to a matrix with half the number
of multiplications of conventional Givens rotations, and they do not involve
square roots. They were developed by Gentleman [435, 1973] and Hammar-
ling [498, 1974]. Fast Givens rotations are as stable as conventional ones-see
the error analysis by Parlett in [820, 19 8 0 , §6.8.3], for example-but, for
the original formulations, careful monitoring is required to avoid overflow.
Rath [861, 1982] investigates the use of fast Givens rotations for performing
similarity transformations in solving the eigenproblem. Barlow and Ipsen [65,
1987] propose a class of scaled Givens rotations suitable for implementation
on systolic arrays, and they give a detailed error analysis. Anda and Park [16,
1994] develop fast rotation algorithms that use dynamic scaling to avoid over-
flow.
Rice [870, 1966] was the first to point out that the MGS method produces
a more nearly orthonormal matrix than the CGS method in the presence
of rounding errors. Björck [107, 1967] gives a detailed error analysis, proving
(18.21) and (18.22) but not (18.23), which is an extension of the corresponding
normwise result of Björck and Paige [119, 1992]. Björck and Paige give a
detailed assessment of MGS versus Householder QR factorization.
The difference between the CGS and MGS methods is indeed subtle.
Wilkinson [1095, 1971] admitted that “I used the modified process for many
years without even noticing explicitly that I was not performing the classical
algorithm.”
The orthonormality of the matrix from Gram-Schmidt can be improved
by reorthogonalization, in which the orthogonalization step of the classical
or modified method is iterated. Analyses of Gram–Schmidt with reorthog-
onalization are given by Abdelmalek [2, 1971], Ruhe [883, 1983], and Hoff-
mann [578, 1989]. Daniel, Gragg, Kaufman, and Stewart [263, 1976] analyse
the use of classical Gram–Schmidt with reorthogonalization for updating a
QR factorization after a rank one change to the matrix.
The mathematical and numerical equivalence of the MGS method with
Householder QR factorization of the matrix was known in the 1960s
(see the Björck quotation at the start of the chapter) and the mathematical
equivalence was pointed out by Lawson and Hanson [695, 1974, Ex. 19.39].
A block Gram-Schmidt method is developed by Jalby and Philippe [608,
1991 ] and error analysis given. See also Björck [115, 1994 ], who gives an
up-to-date survey of numerical aspects of the Gram–Schmidt method.
For more on Gram-Schmidt methods, including historical comments, see
Björck [116, 1996].
One use of the QR factorization is to orthogonalize a matrix that, because
of rounding or truncation errors, has lost its orthogonality; thus we compute
A = QR and replace A by Q. An alternative approach is to replace
(m > n) by the nearest orthonormal matrix, that is, the matrix Q that solves
{ ||A - Q|| : QTQ = I} = min. For the 2- and Frobenius norms, the optimal
386 QR F ACTORIZATION

Q is the orthonormal polar factor U of A, where A = UH is a polar decom-

position: has orthonormal columns and is symmetric
positive semidefinite. If m = n, U is the nearest orthogonal matrix to A in any
unitarily invariant norm, as shown by Fan and Hoffman [361, 1955]. Chan-
drasekaran and Ipsen [196, 1994] show that the QR and polar factors satisfy
under the assumptions that A has full rank and
columns of unit 2-norm and that R has positive diagonal elements. Sun [974,
1995] proves a similar result and also obtains a bound for ||Q — U||F in terms
of ||ATA – I||F. Algorithms for maintaining orthogonality in long products of
orthogonal matrices, which arise, for example, in subspace tracking problems
in signal processing, are analysed by Edelman and Stewart [347, 1993] and
Mathias [733, 1995].
Various iterative methods are available for computing the orthonormal
polar factor U, and they can be competitive in cost with computation of a
QR factorization. For more details on the theory and numerical methods, see
Higham [530, 1986], [539, 1989], Higham and Papadimitriou [567, 1994], and
the references therein.
A notable omission from this chapter is a treatment of rank-revealing QR
factorizations-ones in which the rank of A can readily be determined from
R. This topic is not one where rounding errors play a major role, and hence it
is outside the scope of this book. Pointers to the literature include Golub and
Van Loan [470, 1989, §5.4], Chan and Hansen [192, 1992], and Björck [116,
199 6]. A column pivoting strategy for the QR factorization, described in
Problem 18.5, ensures that if A has rank r then only the first r rows of R
are nonzero. A perturbation theorem for the QR factorization with column
pivoting is given by Higham [540, 1990]; it is closely related to the perturba-
tion theory in §10.3.1 for the Cholesky factorization of a positive semidefinite
matrix.

18.9.1. LAPACK
LAPACK contains a rich selection of routines for computing and manipulat-
ing the QR factorization and its variants. Routine xGEQRF computes the QR
factorization A = QR of an m × n matrix A by the Householder QR algo-
rithm. If m < n (which we ruled out in our analysis, merely to simplify the
notation), the factorization takes the form A = Q[ R 1 R2], where R1 is m × m
upper triangular. The matrix Q is represented as a product of Householder
transformations and is not formed explicitly. A routine xORGQR (or xUNGQR in
the complex case) is provided to form all or part of Q, and routine xORMQR (or
xUNMQR ) will pre- or postmultiply a matrix by Q or its (conjugate) transpose.
Routine xGEQPF computes the QR factorization with column pivoting (see
Problem 18.5).
An LQ factorization is computed by xGELQF . When A is m × n with m < n
P ROBLEMS 387

it takes the form A = [L 0] Q. It is essentially the same as a QR factorization

of AT and hence can be used to find the minimum 2-norm solution to an
underdetermined system (see §20.1).
LAPACK also computes two nonstandard factorization of an m × n A:

where L is lower trapezoidal and R upper trapezoidal.

Problems
18.1. Find the eigenvalues of a Householder matrix and a Givens matrix.
18.2. Let where and are the computed quantities described
in Lemma 18.1. Derive a bound for
18.3. A complex Householder matrix has the form

where and β = 2/υ∗υ. For given show how to

determine, if possible, P so that Px = y.
18.4. (Wilkinson [1089, 1965, p. 242]) Let and let P be a Householder
matrix such that Px = ±||x||2 e1. Let G1,2, . . . . Gn– 1 ,n be Givens rotations
such that Qx := G 1,2. . . Gn- 1 ,nx = ±||x||2 e1. True or false: P = Q?
18.5. In the QR factorization with column pivoting, columns are interchanged
at the start of the kth stage of the reduction so that, in the notation of (18.1),
||xk|| 2 > ||Ck(:, j)||2 for all j. Show that the resulting R factor satisfies

so that, in particular, |r11| > |r22| > . . . > |rnn|. (These are the same
equations as (10.13), which hold for the Cholesky factorization with complete
pivoting—why?)
18.6. Let be a product of disjoint Givens rotations. Show that

18.7. The CGS method and the MGS method applied to ( m > n)
compute a QR factorization A = QR, Define the orthogonal
projection Pi = where qi = Q(:, i). Show that
388 QR F ACTORIZATION

Show that the CGS method corresponds to the operations

while MGS corresponds to

18.8. Let A = [a 1, a2 ] and denote the angle between a 1 and a 2 by

θ, 0 < θ < π/2. (Thus, cosθ := Show that

18.9. (Björck [107, 1967]) Let

which is a matrix of the form discussed by Lauchli [692, 196 1]. Assuming
that evaluate the Q matrices produced by the CGS and MGS
methods and assess their orthonormality.
18.10. Show that the matrix P in (18.19) has the form

where Q is the matrix obtained from the MGS method applied to A.

18.11. (Björck and Paige [119, 1992]) For any matrices satisfying

where both P11 and P21 have at least as many rows as columns, show that
there exists an orthonormal Q such that A + ∆A = QR, where

(Hint: use the CS decomposition P11 = UCWT, P21 = VSWT, where U

and V have orthonormal columns, W is orthogonal, and C and S are square,
nonnegative diagonal matrices with C2 + S 2 = I. Let Q = VWT. Note,
incident ally, that P21 = VWT · WSWT, so Q = VWT is the orthonormal
polar factor of P21, and hence is the nearest orthonormal matrix to P21 in the
2- and Frobenius norms. For details of the CS decomposition see Golub and
Van Loan [470, 1989, pp. 77, 471] and Paige and Wei [814, 1994].)
P ROBLEMS 389

18.12. We know that Householder QR factorization of is equivalent to

the MGS method applied to A, and Problem 18.10 shows that the orthonor-
mal matrix Q from MGS is a submatrix of the orthogonal matrix P from the
Householder method. Since Householder’s method produces a nearly orthog-
onal P, does it not follow that MGS must also produce a nearly orthonormal
Q?

18.13. (Higham [557, 1994]) Let (m > n) have the polar decom-
position A = UH. Show that

This result shows that the two measures of orthonormality ||ATA – I|| 2 and
||A - U||2 are essentially equivalent (cf. (18.22)).
Previous Home Next

Chapter 19
The Least Squares Problem

For some time it has been believed that orthogonalizing methods

did not suffer this squaring of the condition number . . .
It caused something of a shock, therefore,
when in 1966 Golub and Wilkinson . . . asserted that
already the multiplications QA and Qb may produce errors in the solution
containing a factor x2(A).
— A. VAN DER SLUIS,
Stability of the Solutions of Linear Least Squares Problems (1975)

Most packaged regression problems do compute a cross-products matrix

and solve the normal equations using a matrix inversion subroutine.
All the programs . . . that disagreed
(and some of those that agreed) with the unperturbed solution
tried to solve the normal equations.
— ALBERT E. BEATON, DONALD B. RUBIN, and JOHN L. BARONE,
The Acceptability of Regression Solutions:
Another Look at Computational Accuracy (1976)

On January 1, 1801 Giuseppe Piazzi discovered the asteroid Ceres.

Ceres was only visible for forty days
before it was lost to view behind the sun . . .
Gauss, using three observations, extensive analysis,
and the method of least squares, was able to
determine the orbit with such accuracy that Ceres was
easily found when it reappeared in late 1801.
— DAVID K. KAHANER, CLEVE B. MOLER, and STEPHEN G. NASH,
Numerical Methods and Software (1989)

391
392 T HE LEAST S QUARES P ROBLEM

In this chapter we consider the least squares (LS) problem

where A (m > n) has full rank. We begin by examining the sen-
sitivity of the LS problem to perturbations. Then we examine the stability
of methods for solving the LS problem, covering QR factorization methods,
the normal equations and seminormal equations methods, and iterative refine-
ment. Finally, we show how to compute the backward error of an approximate
LS solution.
We do not develop the basic theory of the LS problem, which can be found
in standard textbooks (see, for example, Golub and Van Loan [470, 1989,
However, we recall the fundamental result that any solution of the
LS problem satisfies the normal equations ATAx = ATb (see Problem 19.1).
Therefore if A has full rank there is a unique LS solution. More generally,
+
whatever the rank of A the vector xLS = A b is an LS solution, and it is the
+
solution of minimal 2-norm. Here, A is the pseudo-inverse of A (given by
A + = (A T A) -l A T when A has full rank); see Problem 19.3. (For more on the
pseudo-inverse see Stewart and Sun [954, 1990, 3.1]. )

19.1. Perturbation Theory

Perturbation theory for the LS problem is, not surprisingly, more complicated
than for linear systems, and there are several forms in which bounds can be
stated. We begin with a normwise perturbation theorem that is a restatement
of a result of Wedin [1069, 1973, Thm. 5.1]. For an arbitrary rectangular
matrix A we define the condition number K2 (A) = ||A|| 2 ||A + || 2 . If A has
r = rank(A) nonzero singular values, σ1 > · · · > σr, then K 2 (A) = σ1/σ r .

Theorem 19.1 (Wedin). Let A (m > n) and A + ∆A both be of full

rank, and let

Then, provided that K2(A)ε < 1,

(19.1)

(19.2)

These bounds are approximately attainable.

19.1 P ERTURBATION T HEORY 393

Proof. We defer a proof until 19.8, since the techniques used in the proof
are not needed elsewhere in this chapter.
The bound (19. 1) is usually interpreted as saying that the sensitivity of the
LS problem is measured by k2 (A) when the residual is small or zero and by
2
K2(A) otherwise. This means that the sensitivity of the LS problem depends
strongly on b as well as A, unlike for a square linear system.
Here is a simple example where the K2(A)2 effect is seen:

It is a simple exercise to verify that

Since K2(A) = 1/ε,

Surprisingly, it is easier to derive componentwise perturbation bounds than

normwise ones for the LS problem. The key idea is to express the LS solution
and its residual as the solution of the augmented system

(19.3)

which is simply another way of writing the normal equations, ATAx = ATb.
This is a square nonsingular system, so standard techniques can be applied.
The perturbed system of interest is

(19.4)

where we assume that

(19.5)

From (19.3) and (19.4) we obtain

394 T HE LEAST S QUARES P ROBLEM

Premultiplying by the inverse of the matrix on the left gives

(19.6)

Looking at the individual block components we obtain

(19.7)
(19.8)
(Note that ||I – AA+||2 = min{1, m – n}, as is easily proved using the sin-
gular value decomposition (SVD).) On taking norms we obtain the desired
perturbation result.

Theorem 19.2. Let (m > n) and A + ∆A be of full rank. For

the perturbed LS problem described by (19.4) and (19.5) we have

(19.9)

(19.10)

for any monotonic norm. These bounds are approximately attainable.

For a square system, we have s = O, and we essentially recover Theo-
rem 7.4. Note, however, that the bounds contain the perturbed vectors y and
s. For theoretical analysis it may be preferable to use alternative bounds in
which x and r replace y and s and there is an extra factor

where the term in parentheses is assumed to be positive. For practical compu-

tation (19.9) is unsatisfactory because we do not know s = b + ∆ b – (A + ∆A)y.
However, as Stewart and Sun observe [954, 1990, p. 159], = b – Ay is com-
putable and

and using this bound in (19.9) makes only a second-order change.

The componentwise bounds enjoy better scaling properties than the norm-
wise ones. If E = |A| and f = |b| then the bounds (19.7) and (19.8), and
to a lesser extent (19.9) and (19.10), are invariant under column scalings
(D diagonal). Row scaling does affect the com-
ponentwise bounds, since it changes the LS solution, but the componentwise
bounds are less sensitive to the row scaling than the normwise bounds, in a
way that is difficult to make precise.
19.2 S OLUTION BY QR F ACTORIZATION 395

19.2. Solution by QR Factorization

Let with m > n and rank(A) = n. If A has the QR factorization

then

It follows that the unique LS solution is x =R -l c, and the residual ||b –

Ax||2 = ||d||2 . Thus the LS problem can be solved with relatively little extra
work beyond the computation of a QR factorization. Note that Q is not
required explicitly; we just need the ability to apply Qt to a vector.
It is well known that the Givens and Householder QR factorization algo-
rithms provide a normwise backward stable way to solve the LS problem. The
next result expresses this fact for the Householder method and also provides
componentwise backward error bounds (essentially the same result holds for
the Givens method).
As in Chapter 18, we will use the generic constant γ cm, in which c denotes
a small integer. We will assume implicitly that a condition holds of the form
m nγ cm < 1/2.

Theorem 19.3. Let A (m > n) have full rank and suppose the
LS problem minx ||b – Ax||2 is solved using the Householder QR factorization
method. The computed solution is the exact LS solution to

where the perturbations satisfy the normwise bounds

and the componentwise bounds

where ||G||F = 1.

Proof. The proof is a straightforward generalization of the proof of The-

orem 18.5 and is left as an exercise (Problem 19.2).
396 T HE LEAST S QUARES P ROBLEM

As for Theorem 18.5 (see (18.7)), Theorem 19.3 remains true if we set
∆ b = 0, but in general there is no advantage to restricting the perturbations
to A.
Theorem 19.3 is a strong result, but it does not bound the residual of the
computed solution, which, after all, is what we are trying to minimize. How
close, then, is to minx ||b – Ax||2? We can answer this question using
the perturbation theory of 19.1. With := b + ∆b – (A + ∆A) x := x LS
and r := b – Ax, (19.6) yields

so that

Substituting the bounds for ∆A and ∆b from Theorem 19.3, and noting that
||AA+||2 = 1, we obtain

where cond2(A) := || |A+||A| ||2. Hence

This bound contains two parts. The term mnγ cm || |b| + |A||x| ||2 is a multiple
of the bound for the error in evaluating fl(b – Ax), and so is to be expected.
The factor 1 + mnγ c m cond2 (AT) will be less than 1.1 (say) provided that
cond2 (AT) is not too large. Note that cond2 (AT) < nk2 (A) and cond2 (AT) is
invariant under column scaling of A (A A diag(d i), di 0). The conclusion
is that, unless A is very ill conditioned, the residual b – will not exceed
the larger of the true residual r = b – AX and a constant multiple of the error
in evaluating fl(r)—a very satisfactory result.

19.3. Solution by the Modified Gram–Schmidt Method

The modified Gram-Schmidt (MGS) method can be used to solve the LS
problem. However, we must not compute x from x = R–1(QTb), because the
lack of orthonormality of the computed would adversely affect the stability.
Instead we apply MGS to the augmented matrix [A b]:
19.4 T HE N ORMAL E QUATIONS 397

We have

Since qn+l is orthogonal to the columns of Q1, ||b – Ax|| 22 = ||Rx – z|| 22 + ρ2 ,
so the LS solution is x = R-lZ. Of course, z = QT1b, but z is now computed
as part of the MGS procedure instead of as a product between QT and b
Björck [107, 1967] shows that this algorithm is forward stable, in the
sense that the forward error ||x – is as small as that for a norm-
wise backward stable method. It has recently been shown by Björck and
Paige [119, 1992] that the algorithm is, in fact, normwise backward stable
(see also Björck [115, 1994]), that is, a normwise result of the form in Theo-
rem 19.3 holds. Moreover, a componentwise result of the form in Theorem 19.3
holds too—see Problem 19.5. Hence the possible lack of orthonormality of
does not impair the stability of the MGS method as a means for solving the
LS problem.

19.4. The Normal Equations

The oldest method of solving the LS problem is to form and solve the normal
equations, A T Ax = A T b. Assuming that A has full rank, we can use the
following procedure:
Form C = ATA and c = ATb.
Compute the Cholesky factorization C = RTR.
Solve RTy = c, Rx = y.
Cost: : n 2(m + n/3) flops.
If m >> n, the normal equations method requires about half as many
flops as the Householder QR factorization approach (or the MGS method).
However, it has less satisfactory numerical stability properties. There are
two problems. The first is that information may be lost when = fl(ATA) is
formed-essentially because forming the cross product is a squaring operation
that increases the dynamic range of the data. A simple example is the matrix

for which

Even though A is distance approximately ε from a rank-deficient matrix, and

hence unambiguously full rank if ε the computed cross product is
398 T HE LEAST S QUARES P ROBLEM

singular. In general, whenever K2(A) > u –1/2 we can expect to be singular

or indefinite, in which case Cholesky factorization is likely to break down
(Theorem 10.7).
The second weakness of the normal equations method is more subtle and
is explained by a rounding error analysis. In place of C = ATA and c = ATb
we compute

By Theorems 10.3 and 10.4, the computed Cholesky factor and solution
satisfy

(19.11)

Overall,

(19.12)

By bounding with the aid of (19.11), we find that

(19.13a)
(19.13b)

These bounds show that we have solved the normal equations in a backward
stable way, as long as ||A||2||b||2 ||ATb||2 . But if we try to translate this
result into a backward error result for the LS problem itself, we find that
the best backward error bound contains a factor K2(A) [569, 1987]. The best
forward error bound we can expect, in view of (19.13), is of the form

(19.14)

(since K2 (A T A) = K2 (A)2). Now we know from Theorem 19.1 that the sen-
sitivity of the LS problem is measured by K2(A)2 if the residual is large, but
by K2(A) if the residual is small. It follows that the normal equations method
has a forward error bound that can be much larger than that possessed by a
backward stable method.
A mitigating factor for the normal equations method is that, in view of
Theorem 10.6, we can replace (19.14) by the (not entirely rigorous) bound
19.5 I TERATIVE R EFINEMENT 399

where A = BD, with D = diag(||A(:, i)||2), so that B has columns of unit

2-norm. Van der Sluis’s result (Theorem 7.5) shows that

Hence the normal equations method is, to some extent, insensitive to poor
column scaling of A.
Although numerical analysts almost invariably solve the full rank LS prob-
lem by QR factorization, statisticians frequently use the normal equations
(though perhaps less frequently than they used to, thanks to the influence of
numerical analysts). The normal equations do have a useful role to play. In
many statistical problems the regression matrix is contaminated by errors of
measurement that are very large relative to the roundoff level; the effects of
rounding errors are then likely to be insignificant compared with the effects
of the measurement errors, especially if IEEE double precision (as opposed to
single precision) arithmetic is used.
The normal equations (NE) versus (Householder) QR factorization debate
can be summed up as follows.
● The two methods have a similar computational cost if m n, but the
NE method is up to twice as fast for m >> n. (This statement assumes
that A and b are dense; for details of the storage requirements and
computational cost of each method for sparse matrices, see, for example,
Björck [116, 1996] and Heath [510, 1984].)
● The QR method is always backward stable. The NE method is guaran-
teed to be backward stable only if A is well conditioned.
● The forward error from the NE method can be expected to exceed that
for the QR method when A is ill conditioned and the residual of the LS
problem is small.
● The QR method lends itself to iterative refinement, as described in the
next section. Iterative refinement can be applied to the NE method, but
the rate of convergence inevitably depends on K2(A)2 instead of K2(A).

19.5. Iterative Refinement

As for square linear systems, iterative refinement can be used to improve
the accuracy and stability of an approximate LS solution. However, for the
LS problem there is more than one way to construct an iterative refinement
scheme.
By direct analogy with the square system case, we might consider the
scheme
400 T HE LEAST S QUARES P ROBLEM

1. Compute r = b –
2. Solve the LS problem mind ||Ad–r||2 .

3. Update y = + d.
(Repeat from step 1 if necessary, with replaced by y).
This scheme is certainly efficient-a computed QR factorization (for example)
of A can be reused at each step 2. Golub and Wilkinson [466, 1966] inves-
tigated this scheme and found that it works well only for nearly consistent
systems.
An alternative approach suggested by Björck [106, 1967] is to apply iter-
ative refinement to the augmented system (19.3), so that both x and r are
refined simultaneously. Since this is a square, nonsingular system, existing re-
sults on the convergence and stability of iterative refinement can be applied,
and we would expect the scheme to work well. To make precise statements
we need to examine the augmented system method in detail.
For the refinement steps we need to consider an augmented system with
an arbitrary right-hand side:

r + Ax = f, (19.15a)
ATr = g. (19.15b)

If A has the QR factorization

where R then (19. 15) transforms to

This system can be solved as follows:

h = R - T g,

x = R –l (d l – h).

The next result describes the effect of rounding errors on the solution process.
19.5 I TERATIVE R EFINEMENT 401

Theorem 19.4. Let A be of full rank n < m and suppose the aug-
mented system (19.15) is solved using a Householder QR factorization of A
as described above. The computed and satisfy

where

with ||G||F = 1, ||Hi ||F = 1, i = 1:3.

Proof. The proof does not involve any new ideas and is rather tedious,
so we omit it.
Consider first fixed precision iterative refinement. Theorem 19.4 implies
that the computed solution to (19.15) satisfies

Unfortunately, this bound is not of a form that allows us to invoke The-

orem 11.4. However, we can apply Theorem 11.3, which tells us that the
corrected solution obtained after one step of iterative refinement
satisfies

(19.16)

Here, denotes the residual of the augmented system corresponding

to the original computed solution. We will make two simplifications to the
bound (19.16). First, since = O(u), the first term in the bound may
2
be included in the O(u ) term. Second, (19.16) yields O(u)
and so With these two simplifications, (19.16) may
be written
402 T HE LEAST S QUARES P ROBLEM

In view of the Oettli–Prager result (Theorem 7.3) this inequality tells us that,
asymptotically, the solution produced after one step of fixed precision it-
erative refinement has a small componentwise relative backward error with
respect to the augmented system. However, this backward error allows the
two occurrences of A in the augmented system coefficient matrix to be per-
turbed differently, and thus is not a true componentwise backward error for
the LS problem. Nevertheless, the result tells us that iterative refinement can
be expected to produce some improvement in stability. Note that the bound in
Theorem 19.2 continues to hold if we perturb the two occurrences of A in the
augmented system differently. Therefore the bound is applicable to iterative
refinement (with E = |A|, f = |b|), and so we can expect iterative refine-
ment to mitigate the effects of poor row and column scaling of A. Numerical
experiments show that these predictions are borne out in practice [549, 1991].
Turning to mixed precision iterative refinement, we would like to apply
the analysis of $11.1, with “Ax = b“ again identified with the augmented
system. However, the analysis of 11.1 requires a backward error result in
which only the coefficient matrix is perturbed (see (11.1)). This causes no
difficulties because from Theorem 19.4 we can deduce a normwise result (cf.
Problem 7.7):

The theory of 11.1 then tells us that mixed precision iterative refinement will
converge as long as the condition number of the augmented system matrix is
not too large and that the rate of convergence depends on this condition
number. How does the condition number of the augmented system relate to
that of A? Consider the matrix that results from the scaling
(α > 0):
(19.17)

Björck [106, 1967] shows that the eigenvalues of C(α) are

(19.18)
( 0’, m - n times,

where σi i = 1: n, are the singular values of A, and that

(19.19)

with minα K2(C(α)) being achieved for α = (see Problem 19.7). Hence
C (α) may be much more ill conditioned than A. However, in our analysis
19.6 T HE S EMINORMAL E QUATIONS 403

we are at liberty to take minα K2(C(α )) as the condition number, because

scaling the LS problem according to b – Ax (b – Ax)/α does not change
the computed solution or the rounding errors in any way (at least not if
α is a power of the machine base). Therefore it is, K2(A) that governs the
behaviour of mixed precision iterative refinement, irrespective of the size of
the LS residual. As Björck [111, 1990] explains, this means that “in a sense
iterative refinement is even more satisfactory for large residual least-squares
problems.” He goes on to explain that “When residuals to the augmented
system are accumulated in precision β -t2, t2 > 2t1, this scheme gives solutions
to full single-precision accuracy even though the initial solution may have no
correct significant figures.”
Iterative refinement can be applied with the MGS method. Björck [106,
1967] gives the details and shows that mixed precision refinement works just
as well as it does for Householder’s method.

19.6. The Seminormal Equations

When we use a QR factorization to solve an LS problem minx ||b – Ax||2,
the solution x is determined from the equation Rx = QTb (or via a similar
expression involving Q for the MGS method). But if we need to solve for
several right-hand sides that are not all available when the QR factorization
is computed, we need to store Q before applying it to the right-hand sides.
If A is large and sparse it is undesirable to store Q, as it can be much more
dense than A. We can, however, rewrite the normal equations as

R T Rx = A T b,

which are called the seminormal equations. The solution x can be determined
from these equations without the use of Q. Since the cross product matrix
ATA is not formed explicitly and R is determined stably via a QR factor-
ization, we might expect this approach to be more stable than the normal
equations method.
Björck [110, 1987] has done a detailed error analysis of the seminormal
equations (SNE) method, under the assumption that R is computed by a
backward stable method. His forward error bound for the SNE method is of
the same form as that for the normal equations method, involving a factor
2
K2 (A) . Thus the SNE method is not backward stable. Björck considers
applying one step of fixed precision iterative refinement to the SNE method,
and he calls the resulting process the corrected seminormal equations (CSNE)
method:

R T Rx = A T b
r = b – Ax
404 T HE LEAST S QUARES P ROBLEM

R T Rw = A T r
y=x+w

It is important that the normal equations residual be computed as shown, as

T
AT(b – Ax), and not as ATb – A Ax. Björck derives a forward error bound
for the CSNE method that is roughly of the form

Hence, if k2(A)2U < 1, the CSNE method has a forward error bound similar
to that for a backward stable method, and the bound is actually smaller than
that for the QR method if k2(A)2U << 1 and r is small. However, the CSNE
method is not backward stable for all A.

19.7. Backward Error

Although it has been known since the 1960s that a particular method for
solving the LS problem, namely the Householder QR factorization method,
yields a small normwise backward error (see $19.2), it was for a long time
an open problem to obtain a formula for the backward error of an arbitrary
approximate solution. Little progress had been made towards solving this
problem until Waldén, Karlson, and Sun [1060, 1995] found an extremely
elegant solution. We will denote by λ min and σmin the smallest eigenvalue
of a symmetric matrix and the smallest singular value of a general matrix,
respectively.

Theorem 19.5 (Waldén, Karlson, and Sun). Let (m > n), b

and r = b – Ay. The normwise backward error

(19.20)

is given by

where

The backward error (19.20) is not a direct generalization of the usual

normwise backward error for square linear systems, because it minimizes
19.7 B ACKWARD E RROR 405

Table 19.1. LS backward errors and residual for Vandermonde system.

instead of However, the Pa-

rameter θ allows us some flexibility: taking the limit forces ∆b = 0,
giving the case where only A is perturbed.
Theorem 19.5 can be interpreted as saying that if λ* > 0 then the backward
error is essentially that given by Theorem 7.1 for a consistent system. If
λ* < 0, however, the nearest perturbed system of which y is the LS solution
is inconsistent. A sufficient condition for λ* < 0 is b range(A) (assuming
µ 0), that is, the original system is inconsistent.
The formulae given in Theorem 19.5 are unsuitable for computation be-
cause they can suffer from catastrophic cancellation when λ * < 0. Instead,
the following alternative formula derived in [1060, 1995] should be used (see
Problem 19.9):

(19.21a)

(19.21b)

To illustrate Theorem 19.5, we consider an LS problem with a 25 x 15

Vandermonde matrix A = where the pi are equally spaced on [0, 1],
and b = Ax with xi = i (giving a zero residual problem). The condition
number k2 (A) = 1.47 x 109 . We solved the LS problem in MATLAB in four
different ways: by using the NE with Cholesky factorization, via Householder
QR factorization, and via the MGS method, using both the stable approach
described in 19.3 and the unstable approach in which QTb is formed as
a matrix–vector product (denoted MGS(bad)). The results, including the
norms of the residuals r = b – are shown in Table 19.1. As would be
expected from the analysis in this chapter, the QR and stable MGS methods
produce backward stable solutions, but the NE method and the unstable MGS
approach do not.
As we saw in the analysis of iterative refinement, sometimes we need to
consider the augmented system with different perturbations to A and AT.
The next result shows that from the point of view of normwise perturbations
406 T HE LEAST S QUARES P ROBLEM

and componentwise perturbations bounded by |∆ A| < εG|A|, the lack of

“symmetry” in the perturbations has little effect on the backward error of y.

Lemma 19.6 and Schwetlick). Let (m > n) and

consider the perturbed augmented system

There is a vector and a perturbation ∆A with

∆ A = G1∆ A1 + G2 ∆ A 2 , (19.22)

such that

that is, y solves the LS problem minx ||(A + ∆A)x – b||2.

Proof. If s = b– (A + ∆A1)y = 0 we take ∆A = ∆A1. Otherwise, we set

∆ A := P∆ A2 + (I – P)∆ A1 =: ∆A1 + PH,

where P = ssT/sTs and H = ∆A2 – ∆A1. We have

where β = 1 – s T Hy/s T s. Then

Note that (19.22) implies a bound stronger than just ||∆ A||2 < ||∆ A1 ||2 +
||∆A 2 || 2 :
p = 2, F.
Turning to componentwise backward error, the simplest approach is to ap-
ply the componentwise backward error to the augmented system (19.3),
setting

so as not to perturb the diagonal blocks I and O of the augmented system co-
efficient matrix. However, this approach allows A and AT to undergo different
perturbations ∆A1 and ∆A2 with and thus does not give a true
backward error, and Lemma 19.6 is of no help. This problem can be over-
come by using a structured componentwise backward error to force symmetry
19.8 P ROOF OF W EDIN ’ S T HEOREM 407

of the perturbations; see Higham and Higham [527, 1992] for details. One
problem remains: as far as the backward error of y is concerned, the vector r
in the augmented system is a vector of free parameters, so to obtain the true
componentwise backward error we have to minimize the structure-preserving
componentwise backward error over all r. This is a nonlinear optimization
problem to which no closed-form solution is known. Experiments show that
when y is a computed LS solution, r = b – Ay is often a good approximation
to the minimizing r [527, 1992], [549, 1991].

19.8. Proof of Wedin’s Theorem

In this section we give a proof of Theorem 19.1. We define PA := AA+, the
orthogonal projection onto range(A).

Lemma 19.7. Let A, B If rank(A) = rank(B) and η = ||A+||2||A–

B||2 < 1, then

Proof. Let r = rank(A). A standard result on the perturbation of singular

values gives

that is,

which gives the result on rearranging.

Lemma 19.8. Let A, B If rank(A) = rank(B) then

Proof. We have

The result then follows from the (nontrivial) equality ||PA(I – PB)||2 = ||PB(I–
PA)||2; see, for example, Stewart [943, 1977, Thin. 2.3] or Stewart and Sun
408 T HE LEAST S QUARES P ROBLEM

[954, 1990, Lem. 3.3.5], where proofs that use the CS decomposition are
given.
Proof of Theorem 19.1. Let B := A + AA. We have, with r = b – Ax,

y – z = B+(b + ∆b) – z = B+(r + Ax + ∆b) – x

= B + (r + Bx – ∆Ax + ∆b) – x
= B + (r – ∆Ax + ∆b) – (I – B+B)x
= B + ( r – ∆Ax + ∆b), (19.23)
since B has full rank. Now
B + r = B + (BB + )r = B + P B r = B + P B (I – P A )r. (19.24)
Applying Lemmas 19.7 and 19.8, we obtain

(19.25)

Similarly,

(19.26)

The bound for ||x – y||2 /||x||2 is obtained by using inequalities (19.25) and
(19.26) in (19.23).
Turning to the residual, using (19.23) we find that
s – r = ∆b + B(x – y) – ∆Ax
= ∆b – BB+(r – ∆Ax + ∆b) – ∆Ax
= (I – BB + )(∆b – ∆Ax) – BB+r.
Since ||I – BB+||2 = min{l, m – n },

Using (19.24), Lemma 19.8, and ||BB+||2 = 1, we obtain

19.9 N OTES AND R EFERENCES 409

Hence

For the attainability, see Wedin [1069, 1973, 6].

Note that, as the proof has shown, Wedin’s theorem actually holds with-
out any restriction on m and n, provided we define x = A + b and y =
(A+ ∆A)+(b + ∆b) when m < n (in which case r = 0). We consider un-
derdetermined systems in detail in the next chapter. The original version of
Wedin’s theorem also requires only rank(A ) = rank(A + ∆A) and not that A
have full rank.

19.9. Notes and References

The most comprehensive and up to date treatment of the LS problem is the
book by Björck [116, 1996], which is an updated and expanded version of [112,
1990]. It treats many aspects not considered here, including rank-deficient,
weighted, and constrained problems. An early book devoted to numerical
aspects of the LS problem was written by Lawson and Hanson [695, 1974],
who, together with Stewart [94 1, 1973], were the first to present error analysis
for the LS problem in textbook form.
The history of the LS problem is described in the statistically oriented
book by Farebrother [363, 1988].
The pseudo-inverse A+ underlies the theory of the LS problem, since the
LS solution can be expressed as x = A+b. An excellent reference for pertur-
bation theory of the pseudo-inverse is Stewart and Sun [954, 1990, 3.3]. The
literature on pseudo-inverses is vast, as evidenced by the annotated bibliogra-
phy of Nashed and Rail [786, 1976], which contains 1,776 references published
up to 1976.
Normwise perturbation theory for the LS problem was developed by vari-
ous authors in the 1960s and 1970s. The earliest analysis was by Golub and
Wilkinson [466, 1966], who gave a first-order bound and were the first to rec-
ognize the potential k2 (A)2 sensitivity. A nonasymptotic perturbation bound
was given by Björck [107, 1967], who worked from the augmented system.
An early set of numerical experiments on the Householder, Gram-Schmidt,
and normal equations methods for solving the LS problem was presented by
Jordan [618, 1968]; this paper illustrates the incomplete understanding of
perturbation theory and error analysis for the LS problem at that time.
van der Sluis [1041, 1975] presents a geometric approach to LS perturba-
tion theory and gives lower bounds for the effect of worst-case perturbations.
Golub and Van Loan [470, 1989, Thin. 5.3.1] give a first-order analogue of
Theorem 19.1 expressed in terms of the angle θ between b and range(A ) in-
stead of the residual r.
410 T HE LEAST S QUARES P ROBLEM

Wei [1072, 1990] gives a normwise perturbation result for the LS problem
with a rank deficient A that allows rank(A + ∆A) > rank(A).
Componentwise perturbation bounds of the form in Theorem 19.2 were
first derived by Björck in 1988 and variations have been given by Arioli, Duff,
and de Rijk [25, 1989], Björck [113, 1991], and Higham [542, 1990].
Higham [542, 1990] examined the famous test problem from Longley [711,
1967]—a regression problem which has a notoriously ill-conditioned 16 x 7
coefficient matrix with k2 (A) 5 x 10 9 . The inequality (19.8) was found
to give tight bounds for the effect of random componentwise relative per-
turbations of the problem generated in experiments of Beaton, Rubin, and
Barone [86, 1976]. Thus componentwise perturbation bounds are potentially
useful in regression analysis as an alternative to the existing statistically based
techniques.
The tools required for a direct proof of the normwise backward error result
in Theorem 19.3 are developed in Wilkinson’s book The Algebraic Eigenvalue
Problem [1089, 1965]. Results of this form were derived informally by Golub
and Wilkinson (assuming the use of extended precision inner products) [466,
1966], stated by Wilkinson [1090, 1965, p. 93] and Stewart [941, 1973], and
proved by Lawson and Hanson [695, 1974, Chap. 16].
The idea of using QR factorization to solve the LS problem was mentioned
in passing by Householder [586, 1958]. Golub [463, 1965] worked out the
details, using Householder QR factorization, and this method is sometimes
called “Golub’s method”. In the same paper, Golub suggested the form of
iterative refinement described at the start of 19.5 (which is implemented in
a procedure by Businger and Golub [167, 1965]), and showed how to use QR
factorization to solve an LS problem with a linear constraint Bx = c.
It was Björck [106, 1967] who first recognized that iterative refinement
should be applied to the augmented system for best results, and he gave a
detailed rounding error analysis for the use of a QR factorization computed by
the Householder or MGS methods. Björck and Golub [118, 1967] give an Algol
code for computation and refinement of the solution to an LS problem with
a linear constraint; they use Householder transformations, while Björck [108,
1968] gives a similar code baaed on the Gram–Schmidt method. In [109, 1978],
Björck dispels some misconceptions of statisticians about (mixed precision)
iterative refinement for the LS problem; he discusses standard refinement
together with two versions of refinement based on the seminormal equations.
Error analysis for solution of the LS problem by the classical Gram-
Schmidt method with reorthogonalization is given by Abdelmalek [2, 1971],
who obtains a forward error bound as good as that for a backward stable
method.
Higharn and Stewart [569, 1987] compare the normal equations method
with the QR factorization method, with emphasis on aspects relevant to re-
gression problems in statistics.
19.9 N OTES AND R EFERENCES 411

Foster [398, 1991] proposes a class of methods for solving the LS problem
that are intermediate between the normal equations method and the MGS
method, and that can be viewed as block MGS algorithms.
The most general analysis of QR factorization methods for solving the LS
and related problems is by Björck and Paige [120, 1994], who consider an
augmented system with an arbitrary right-hand side (see Problem 20.1) and
prove a number of subtle stability results.
Theorem 19.4 and the following analysis are from Higham [549, 1991].
Arioli, Duff, and de Rijk [25, 1989] investigate the application of fixed pre-
cision iterative refinement to large, sparse LS problems, taking the basic solver
to be the block LDLT factorization code MA27 [329, 1982] from the Harwell
Subroutine Library (applied to the augmented system); in particular, they
use scaling of the form (19.17). Björck [114, 1992] determines, via an error
analysis for solution of the augmented system by block LDLT factorization, a
choice of α in (19.17) that minimizes a bound on the forward error.
The idea of implementing iterative refinement with a precision that in-
creases on each iteration (see the Notes and References to Chapter 11) can be
applied to the LS problem; see Gluchowska and Smoktunowicz [453, 1990].
The use of SNE was first suggested by Kahan, in the context of iterative
refinement, as explained by Golub and Wilkinson [466, 1966].
Stewart [945, 1977] discusses the problem of finding the normwise back-
ward error for the LS problem and offers some backward perturbations that
are candidates for being of minimal norm. The problem is also discussed by
Higham [542, 1990]. Componentwise backward error for the LS problem has
been investigated by Arioli, Duff, and de Rijk [25, 1989], Björck [113, 1991],
and Higham [542, 1990].
Theorem 19.5 has been extended to the multiple right-hand side LS prob-
lem by Sun [976, 1996].
Lemma 19.6 is from a book by and Schwetlick, which has
been published in German [658, 1988] and Polish [659, 1992] editions, but
not in English. The lemma is their Lemma 8.2.11, and can be shown to be
equivalent to a result of Stewart [943, 1977, Thin. 5.3].
Other methods for solving the LS problem not considered in this chapter
include those of Peters and Wilkinson [826, 1970], Cline [215, 1973], and
Plemmons [835, 1974], all of which begin by computing an LU factorization
of the rectangular matrix A. Error bounds for these methods can be derived
using results from this chapter and Chapters 9 and 18.
In this chapter we have not specifically treated LS problems whose coeffi-
cient matrices have rows varying greatly in norm, and we have not considered
weighted LS problems minx ||D(b — Ax)||2, where D = diag(d i ). Error anal-
ysis for Householder QR factorization with column pivoting applied to badly
row-scaled problems is given by Powell and Reid [840, 1969]. Methods and
error analysis for weighted LS problems are given by Barlow [59, 1988], Bar-
412 T HE LEAST S QUARES P ROBLEM

low and Handy [64, 1988], Barlow and Vemulapati [66, 1992], Gulliksson and
Wedin [489, 1992], Gulliksson [487, 1994], [488, 1995], and Van Loan [1042,
1985].
Another topic not considered here is the constrained LS problem, where
x is required to satisfy linear equality and/or inequality constraints. Nu-
merical methods are described by Lötstedt [713, 1984] and Golub and Van
Loan [470, 1989, 12. 1], and perturbation theory is developed by Eldén [350,
1980], Lötstedt [712, 1983], and Wedin [1070, 1985].

19.9.1. LAPACK
Driver routine xGELS solves the full rank LS problem by Householder QR
factorization. It caters for multiple right-hand sides, each of which defines a
separate LS problem. Thus, xGELS solves min{ ||B – AX||F : X },
where A This routine does not return any
error bounds, and iterative refinement is not supported for LS problems in
LAPACK.
Driver routines xGELSX and xGELSS solve the rank-deficient LS problem
with multiple right-hand sides, using, respectively, a complete orthogonal fac-
torization (computed via QR factorization with column pivoting) and the
SVD.
LAPACK also contains routines for solving the linearly constrained LS
problem (xGGLSE) and a generalized form of weighted LS problem (xGGGLM).

Problems
19.1. Show that any solution to the LS problem minx ||b – Ax||2 satisfies the
normal equations A T Ax = A T b. What is the geometrical interpretation of
these equations?
19.2. Prove Theorem 19.3.
19.3. The pseudo-inverse X can be defined as the
unique matrix satisfying the four Moore–Penrose conditions

(i) AXA = A, (ii) XAX = X,

(iii) AX = (AX)T, (iv) XA = (XA) T .

Let A = UΣ VT be an SVD, with Σ = diag(σi ) and let r = rank(A ). Show

that X = V diag(σ1-l, . . . ,σr-l. 0, . . . , 0)UT satisfies (i)-(iv) and hence is the
pseudo-inverse of A. Show that (A+)+ = A.
19.4. Show that the pseudo-inverse A+ of A solves the problem
P ROBLEMS 413

Is the solution unique?

19.5. Prove a result analogous to Theorem 19.3 for the MGS method, as
described in 19.3.
19.6. Consider the LS problem minx||b – Ax||2, where Let be
the computed LS solution obtained from the normal equations method and
x the exact solution, and define Using (19.12) and
(19.13) show that a bound holds of the form

19.7. Prove (19.18) and (19.19).

19.8. (Waldén, Karlson, and Sun [1060, 1995]) Partially complete the gap in
Theorem 19.5 by evaluating η F (0) for the case that is, ∆b = 0.
19.9. Prove (19.21).
Previous Home Next

Chapter 20
Underdetermined Systems

I’m thinking of two numbers.

Their average is 3.
What are the numbers?
— CLEVE B. MOLER, The World’s Simplest Impossible Problem (1990)

This problem arises in important algorithms

used in mathematical programming . . .
In these cases, B is usually very large and sparse and,
because of storage difficulties,
it is often uneconomical to store and access Q1 . . .
Sometimes it has been thought that [the seminormal equations method]
could be disastrously worse than [the Q method] . . .
It is the purpose of this note to show that such algorithms are
numerically quite satisfactory.
— C. C. PAIGE, An Error Analysis of a
Method for Solving Matrix Equations (1973)

415
416 U NDERDETERMINED S YSTEMS

Having considered well-determined and overdetermined linear systems, we

now turn to the remaining class of linear systems: those that are underdeter-
mined.

20.1. Solution Methods

Consider the underdetermined system Ax = b, where A with m < n.
The system can be analysed using a QR factorization

(20.1)

where Q is orthogonal and R is upper triangular. (We

could, alternatively, use an LQ factorization of A, but we will keep to the
standard notation.) We have

b = Ax = [RT 0]Q T x = R T y 1 , (20.2)

where

If A has full rank then yl = R-Tb is uniquely determined and all solutions of
Ax = b are given by

The unique solution xLS that minimizes ||x||2 is obtained by setting y2 = 0.

We have

(20.3)

(20.4)

where A + = A T (AA T ) – 1 is the pseudo-inverse of A. Hence x L S can be

characterized as xLS = ATy, where y solves the normal equations AATy = b.
Equation (20.3) defines one way to compute xLS . We will refer to this
method as the “Q method”. When A is large and sparse it is desirable to
avoid storing and accessing Q, which can be expensive. An alternative method
20.2 P ERTURBATION T HEORY 417

with this property uses the QR factorization (20.1) but computes x LS as

x LS = A T y, where
RTRy = b (20.5)
(cf. (20.4)). These latter equations are called the seminormal equations (SNE).
As the “semi” denotes, however, this method does not explicitly form AAT,
which would be undesirable from the standpoint of numerical stability. Note
that equations (20.5) are different from the equations RTRx = ATb for an
overdetermined least squares (LS) problem, where A = Q[ RT 0] T
with m > n, which are also called seminormal equations (see §19.6).

20.2. Perturbation Theory

A componentwise perturbation result for the minimum 2-norm solution to an
underdetermined system is readily obtained.

Theorem 20.1 (Demmel and Higham). Let A (m < n) be of full

rank and Suppose ||A+ ∆A||2 < 1 and

If x and y are the minimum 2-norm solutions to Ax = b and (A + ∆A)y =

b + ∆b, respectively, then, for any monotonic norm,

(20.6)
For any Hölder p-norm, the bound is attainable to within a constant factor
depending on n.

Proof. The perturbed matrix A + ∆A = A(I + A+ ∆A) has full rank, so

we can manipulate the equation

y = (A+ ∆A)T((A + ∆A)(A + ∆A)T)-l(b + ∆b)

to obtain

y – x = (I – A+ A) ∆AT(AAT)-lb + A+(∆b – ∆Ax) + O( 2 )

= (I - A + A) ∆ATA+Tx + A+(∆b – ∆Ax) + O(ε2). (20.7)

The required bound follows on using absolute value inequalities and taking
norms. That the bound is attained to within a constant factor depending on
n for Holder p-norms is a consequence of the fact that the two vectors on the
right-hand side of (20.7) are orthogonal.
418 U NDERDETERMINED S YSTEMS

Two special cases are worth noting, for later use. We will use the equality
||I - A+A||2 = min{l, n - m}, which can be derived by consideration of the QR
factorization (20.1), for example. If E = |A|H, where H is a given nonnegative
matrix, and f = |b|, then we can put (20.6) in the form

(20.8)

where
cond2 ( A) = || |A+||A| ||2.
Note that cond2 (A ) is independent of the row scaling of A (cond2(D A) =
cond 2 (A) for nonsingular diagonal D). If E = ||A||2emeTn and f = ||b||2em,
where em denotes the m-dimensional vector of 1s, then

(20.9)

The following analogue of Lemma 19.6 will be needed for the error anal-
ysis in the next section, It says that if we perturb the two occurrences of
A in the normal equations AATx = b differently, then the solution of the
perturbed system is the solution of normal equations in which there is only
one perturbation of A and, moreover, this single perturbation is no larger, in
the normwise or componentwise sense, than the two perturbations we started
with.

Lemma 20.2 and Schwetlick). Let A (m < n) be of

full rank and suppose

(A + ∆A 1 ) = b, = (A + ∆A 2 ) T

Assume that 3 max (||Α+∆A1||2, ||A+ ∆A2||2) < 1. Then there is a vector
and a perturbation ∆A with

∆A = ∆AIGI + ∆A 2 G 2 , Gl + G2 = I,

such that

that is, is the minimum 2-norm solution to (A + ∆A)x = b.

Proof. The proof is similar to that of Lemma 19.6, but differs in some
details. If = (A+ ∆A 2 ) T = 0 we take ∆A = ∆A2. otherwise, we set

∆ A := ∆AIP + ∆A2(I – P) =: ∆A2 + HP,

20.3 E RROR ANALYSIS 419

where and H = ∆A1 - ∆A2. We have

where β = which shows that we need to set To

check that (A + ∆A ) = b, we evaluate

as required. The vector is undefined if β = 0. But

which is positive if 3 max (||A+ ∆A 1 || 2 , ||A + ∆A 2 || 2 ) < 1.

Note that in Lemma 20.2 we have the normwise bound

20.3. Error Analysis

We now consider the stability of the Q method and the SNE method. For both
methods we assume that the QR factorization is computed using Householder
or Givens transformations.
Before presenting the results we define a measure of stability. The com-
ponentwise backward error for a minimum-norm underdetermined system
Ax = b is defined as

s.t. y is the min. norm solution to (A+ ∆A)y = b + ∆b },

Note the requirement in this definition that y be the minimum norm solution;
the usual componentwise backward error (see (7.6)) is a generally
smaller quantity. Let us say that a method is row-wise backward stable if it
produces a computed solution for which the componentwise backward error
420 U NDERDETERMINED S YSTEMS

is of order u, where E = |A|emeTn and f = |b|. This condition requires

that solve a perturbed minimum norm problem in which the perturbations
to the ith row of A are small compared with the norm of that row (similarly
for b); cf. the discussion of componentwise backward errors in 7.2.

Theorem 20.3. Let A with rank(A) = m < n, and assume that

a rendition of the form cond2 (A)mnγ c n < 1 holds. Suppose the underdeter-
mined system AX = b is solved in the minimum 2-norm sense using the Q
method. Then the computed solution is the minimum 2-norm solution to
(A+ ∆A)x = b, where

and
||G||F = 1.
Proof. The Q method solves the triangular system R T y l = b and then
forms x = Assuming the use of Householder QR factorization,
from Theorem 18.4 we have that

for some orthogonal matrix Q, where ||∆A0||F < mγ cn||A||F and

with ||G0 ||F = 1. The computed satisfies

From Lemma 18.3, the computed solution satisfies

(20.10)

We now rewrite the latter two equations in such a way that we can apply
Lemma 20.2:

It is straightforward to show that

|∆ Ai | < mnγ´cn|A|Gi , ||Gi ||F = 1, i = 1:2,

and ||∆Ai ||F < mγ´||A||F, i = 1:2. The result follows on invocation of
Lemma 20.2.
20.3 E RROR ANALYSIS 421

Theorem 20.3 says that the Q method is row-wise backward stable. This is
not altogether surprising, since (Householder or Givens) QR factorization for
the LS problem enjoys an analogous backward stability result (Theorem 19.3),
albeit without the restriction of a minimum norm solution. Applying (20.8)
to Theorem 20.3 we obtain the forward error bound

(20.11)

The same form of forward error bound (20.11) can be derived for the SNE
method as for the Q method [292, 1993]. However, it is not possible to obtain
a result analogous to Theorem 20.3, nor even to obtain a residual bound of the
form (which would imply that solved a nearby
system, though would not necessarily be the minimum norm solution). The
method of solution guarantees only that the seminormal equations themselves
have a small residual. Thus, as in the context of overdetermined LS problems,
the SNE method is not backward stable. A possible way to improve the
stability is by iterative refinement, as shown in [292, 1993].
Note that the forward error bound (20. 11) is independent of the row scaling
of A, since cond2 (A) is. The bound is therefore potentially much smaller than
the bound

obtained by Paige [813, 1973] for the SNE method and by Jennings and Os-
borne [614, 1974] and Arioli and Laratta [26, 1985, Thin. 4] for the Q method.
Finally, we mention an alternative version of the Q method that is based
on the modified Gram–Schmidt (MGS) method. The obvious approach is
to compute the QR factorization AT = QR using MGS
solve RTy = b, and then form x = Qy. Since Q is provided explicitly
by the MGS method, the final stage is a full matrix–vector multiplication,
unlike for the Householder method. However, because the computed Q may
depart from orthonormality, this method is unstable in the form described.
The formation of x = Qy should instead be done as follows:

The recurrence can be written as x(k-l) = x(k) + ykq k – (qkTx(k))qk, and the
last term is zero in exact arithmetic if the qk are mutually orthogonal. In finite
precision arithmetic this correction term has the “magical” effect of making
422 U NDERDETERMINED S YSTEMS

Table 20.1. Backward errors for underdetermzned Vandermonde system.

Householder QR 9.76 X 10-18

MGS with x := Qy 4.10 x 10-4
MGS with x formed stably (see text) 2.25 X 10-17
SNE method (using Householder QR) 1.99 x 10-4

the algorithm stable, in the sense that it satisfies essentially the same result
as the Q method in Theorem 20.3; see Björck and Paige [120, 1994].
A numerical example is instructive. Take the 20 x 30 Vandermonde matrix
where the pi are equally spaced on [0, 1], and let b have elements
equally spaced on [0, 1]. The condition number k2 (A) = 4.35 x 1014 . The
(standard) normwise backward errors in the 2-norm are shown in Table 20.1.
For A T , the supplied by MGS satisfies = 1.41 x 10-3, which
explains the instability of the “obvious” MGS solver.

20.4. Notes and References

The seminormal equations method was suggested by Gill and Murray [442,
1973] and Saunders [894, 1972]. Other methods for obtaining minimal 2-norm
solutions of underdetermined systems are surveyed by Cline and Plemmons
[219, 1976].
Theorem 20.1 is from Demmel and Higham [292, 1993]. The bound (20.9)
is well known; it follows from Wedin’s original version of our Theorem 19.1,
which applies to minimum 2-norm underdetermined problems as well as LS
problems.
Theorem 20.3 is new. Demmel and Higham [292, 1993] prove the weaker
result that from the Q method is very close to a vector that satisfies the
criterion for row-wise backward stability, and Lawson and Hanson [695, 1974,
Thin. 16. 18] give a corresponding result in which satisfies the criterion for
general normwise backward stability. The key to showing actual backward
stability is the use of and Schwetlick’s lemma, which is a mod-
ification of Lemma 8.2.11 in [658, 1988] and [659, 1992] (our Lemma 19.6).
Demmel and Higham [292, 1993] also give error analysis for the seminormal
equations method.
The new MGS algorithm for solving the minimum norm problem was first
suggested by Björck and Paige [119, 1992]; see also Björck [115, 1994].
Arioli and Laratta [27, 1986] give error analysis of QR factorization meth-
ods for solving the general problem min{ ||x – c||2 : Ax = b }, where A
PROBLEMS 423

with m < n.

20.4.1. LAPACK
The same routines that solve the (overdetermined) LS problem also solve un-
derdetermined systems for the solution of minimal 2-norm. Thus, xGELS solves
a full rank underdetermined system with multiple right-hand sides by the Q
method. Routines xGELSX and xGELSS solve rank-deficient problems with mul-
tiple right-hand sides, using, respectively, a complete orthogonal factorization
(computed via QR factorization with column pivoting) and the singular value
decomposition.

Problems
20.1. (Björck [114, 1992]) Show that the system

(20.12)

characterizes the solution to the following generalizations of the LS problem

and the problem of finding the minimum norm solution to an underdetermined
system:

(20.13)
(20.14)

20.2. (RESEARCH PROBLEM) Find a formula for the backward error of an

arbitrary approximation to the minimum 2-norm solution of an underdeter-
mined system. That is, for A with rank(A) = m < n, find

m(y) := min{ e : y is the minimum 2-norm solution to (A + ∆A)y = b + ∆b,

where ||∆A||2 < e||A||2, ||∆b||2 < e||b||2 }.
Previous Home Next

Chapter 21
Vandermonde Systems

We began, 25 years ago, to take up [the conditioning of]

the class of Vandermonde matrices.
The original motivation came from unpleasant experiences with the
computation of Gauss type quadrature rules from the
moments of the underlying weight function.
— WALTER GAUTSCHI, How (Un)stable are Vandermonde Systems? (1990)

Extreme ill-conditioning of the [Vandermonde] linear systems

will eventually manifest itself as n increases by yielding
an error curve which is not sufficiently Ievelled on the current reference . . .
or more seriously fails to have the correct number of sign changes.
— M. ALMACANY, C. B. DUNHAM, and J. WILLIAMS,
Discrete Chebyshev Approximation by Interpolating Rationals (1984)

425
426 V ANDERMONDE S YSTEMS

A Vandermonde matrix is defined in terms of scalars α0, α1, . . . . αn by

Vandermonde matrices play an important role in various problems, such as

in polynomial interpolation. Suppose we wish to find the polynomial pn(x) =
anxn + an-1xn–1 + · · · + a0 that interpolates to the data for distinct
points αi , that is, pn(α i ) = fi , i = 0:n. Then the desired coefficient vector
a = [a0, a1, . . . ,an ] T is the solution of the dual Vandermonde system
V T a = f (dual).
The primal system
Vx = b (primal)
represents a moment problem, which arises, for example, when determining
the weights for a quadrature rule: given moments bi find weights xi such that

Because a Vandermonde matrix depends on only n+ 1 parameters and has

a great deal of structure, it is possible to perform standard computations with
reduced complexity. The easiest algorithm to derive is for matrix inversion.

21.1. Matrix Inversion

-1
Assume that V is nonsingular and let V = W = The ith row of
the equation WV = I may be written

These equations specify a fundamental interpolation problem that is solved

by the Lagrange basis polynomial:

(21.1)

The inversion problem is now reduced to finding the coefficients of li (x). It is

clear from (21.1) that V is nonsingular iff the αi are distinct. It also follows
from (21. 1) that V -1 is given explicitly by
21.1 M ATRIX I NVERSION 427

where σk(y1, . . . , yn) denotes the sum of all distinct products of k of the argu-
ments y1 , . . . , yn (that is σ k is the kth elementary symmetric function). An
efficient way to find the wij is first to form the master polynomial

and then to recover each Lagrange polynomial by synthetic division:

The scalars qi (α i ) can be computed by Homer’s rule as the coefficients of qi

are formed.

Algorithm 21.1. Given distinct scalars α0, α1, . . . . ,αn this algorithm
computes W =

% Stage 1: Construct the master polynomial.

a0 = −α0; a1 = 1
for k = l:n
ak+l = 1
for j = k: –1: 1
a j = a j – 1 - αk a j
end

% Stage 2: Synthetic division.

for i = 0:n
win = l; s = l
for j = n – 1: –1:0
w i j = a j + l + αi w i , j + l
s = αi s + wij
end
w(i, : ) = w(i, : )/s
end

Cost: 6n2 flops.

The O(n2) complexity is optimal, since the algorithm has n2 output values,
each of which must partake in at least one operation.
Vandermonde matrices have the deserved reputation of being ill condi-
tioned. The ill conditioning is a consequence of the monomials being a poor
428 V ANDERMONDE S YSTEMS

Table 21.1. Bounds and estimates for

αi Bound or estimate Reference

(VI): 1/i+l [428, 1990]
(V2): arbitrary [1035, 1994]
(V3): αi > 0 [429, 1988]
(V4): equispaced [0,1] [428, 1990]
(V5): equispaced [– 1, 1] [425, 1975]
(V6): Chebyshev nodes [–1, 1] [425, 1975]
(V7): roots of unity well known.

basis for the polynomials on the real line. A variety of bounds for the con-
dition number of a Vandermonde matrix have been derived by Gautschi and
his co-authors. Let Vn = V(α0, α1, . . . , α n–1 ) For arbitrary distinct
points αi ,

(21.3)

with equality on the right when for all j with a fixed θ (in
particular, when αj > 0 for all j) [424, 1962], [426, 1978]. Note that the
upper and lower bounds differ by at most a factor 2 n–1. More specific bounds
are given in Table 21.1, on which we now comment.
Bound (V1) and estimate (V4) follow from (21.3). The condition number
for the harmonic points 1/(i + 1) grows faster than n!; by contrast, the con-
dition numbers of the notoriously ill-conditioned Hilbert and Pascal matrices
grow only exponentially (see 26.1 and 26.4). For any choice of points the
rate of growth is at least exponential (V2), and this rate is achieved for points
equally spaced on [0, 1]. For points equally spaced on [– 1, 1], the condition
number grows at a slower exponential rate than that for [0, 1], and the growth
rate is slower still for the zeros of the nth degree Chebyshev polynomial (V6).
For one set of points the Vandermonde matrix is perfectly conditioned: the
roots of unity, for which is unitary.

21.2. Primal and Dual Systems

The standard Vandermonde matrix can be generalized in at least two ways:

by allowing confluence of the points αi and by replacing the monomials by
21.2 P RIMAL AND D UAL S YSTEMS 429

other polynomials. An example of a confluent Vandermonde matrix is

(21.4)

The second, third, and fifth columns are obtained by “differentiating” the
previous column. The transpose of a confluent Vandermonde matrix arises
in Hermite interpolation; it is nonsingular if the points corresponding to the
“nonconfluent columns” are distinct.
A Vandermonde-like matrix is defined by where pi is a
polynomial of degree i. The case of practical interest is where the pi satisfy a
three-term recurrence relation. In the rest of this chapter we will assume that
the pi do satisfy a three-term recurrence relation. A particular application is
the solution of certain discrete Chebyshev approximation problems [11, 1984].
Incorporating confluence, we obtain a confluent Vandermonde-like matrix,
defined by
P = P(α0, α1,. . . ,αn ) = [q0 (α0), q1 (α1), . . . . qn (α n )]
where the αi are ordered so that equal points are contiguous, that is,
(21.5)
and the vectors qj(x) are defined recursively by

if j = 0 or

otherwise.

For all polynomials and points, P is nonsingular; this follows from the deriva-
tion of the algorithms below. One reason for the interest in Vandermonde-like
matrices is that for certain polynomials they tend to be better conditioned
than Vandermonde matrices (see, for example, Problem 21.5). Gautschi [427,
1983] derives bounds for the condition numbers of Vandermonde-like matrices.
Fast algorithms for solving the confluent Vandermonde-like primal and
dual systems Px = b and PTa = f can be derived under the assumption that
the p j(x) satisfy the three-term recurrence relation

(21.6a)
(21.6b)
where for all j. Note that in this chapter γi denotes a constant in the
recurrence relation and not iu/(1 – iu) as elsewhere in the book. The latter
notation is not used in this chapter.
430 V ANDERMONDE S YSTEMS

The algorithms exploit the connection with interpolation. Denote by

r(i) > 0 the smallest integer for which αi = αi-1 = · · · = αr(i). Consid-
ering first the dual system PTa = f, we note that

(21.7)

satisfies
i = 0:n.
Thus ψ is a Hermite interpolating polynomial for the data {αi, fi }, and our
task is to obtain its representation in terms of the basis As a first
step we construct the divided difference form of ψ,

(21.8)

The (confluent) divided differences ci = f[α0, α1,. . . . αi] may be generated

using the recurrence relation

(21.9)

Now we need to generate the ai in (21.7) from the ci in (21.8). The idea is
to expand (21.8) using nested multiplication and use the recurrence relations
(21.6) to express the results as a linear combination of the pj. Define

q n (x) = cn , (21.10)
(21.11)

from which q0(x) = ψ(x). Let

(21.12)

To obtain recurrences for the coefficients we expand the right-hand side

of (21.11), giving
21.2 P RIMAL AND D UAL SYSTEMS 431

Using the relations, from (21.6),

we obtain, for k = 0:n – 2,

(21.13)

in which the empty summation is defined to be zero. For the special case
k = n – l we have

(21.14)

Recurrences for the coefficients in terms of j = k +1: n,

follow immediately by comparing (21.12) with (21.13) and (21.14).
In the following algorithm, stage I computes the confluent divided differ-
ences and stage II implements the recurrences derived above.

Algorithm 21.2 (dual, PTa = f). Given parameters a vec-

tor f, and points satisfying (21 .5), this algorithm solves the
dual confluent Vandermonde-like system PTa = f.

% Stage I:
Set c = f
432 V ANDERMONDE S YSTEMS

for k = 0:n – 1
clast = ck
for j = k + l:n
i f αj αj-k-1 then
c j = c j/(k + 1)
else
temp = cj
c j = (cj – clast)/(αj – αj – k – 1 )
clast = temp
end
end
end

% Stage II:
Set a = c
a n - l = a n - l + (β 0 - αn - 1 )a n
a n = a n /θ 0
for k = n – 2: – 1:0

for j = l:n – k –2

end

a n = a n /θ n - k - 1
end

Assuming that the values are given (note that γj appears only in
the terms the computational cost of Algorithm 21.2 is at most 9n 2 / 2
flops. The vectors c and a have been used for clarity; in fact both can be
replaced by f, so that the right-hand side is transformed into the solution
without using any extra storage.
Values of for some polynomials of interest are collected in Ta-
ble 21.2.
The key to deriving a corresponding algorithm for solving the primal sys-
tem is to recognize that Algorithm 21.2 implicitly computes a factorization of
P - T into the product of 2n triangular matrices. In the rest of this chapter
we adopt the convention that the subscripts of all vectors and matrices run
from 0 to n. In stage I, letting c(k) denote the vector c at the start of the k t h
iteration of the outer loop, we have

c(0) = f, c(k+l) = Lkc(k), k = 0:n – 1. (21.15)

The matrix Lk is lower triangular and agrees with the identity matrix in rows
21.2 P RIMAL AND D UAL SYSTEMS 433

Table 21.2. Parameters in the three-term recurrence (21.6).

Polynomial θj βj γj
Monomials 1 0 0
∗
Chebyshev 2* 0 1 θ0 = 1
*
Legendre* 0 p j (1) = 1
Hermite 2 0 2j
Laguerre 2j + l

0 to k. The remaining rows can redescribed, for k + l < j < n, by

if αj = αj - k - 1 ,
some s < j, otherwise,

where ej is column j of the identity matrix. Similarly, stage II can be expressed

as
a (n) = c (n) , a ( k ) = U k a ( k + l ) , k = n – 1: – l:0. (21.16)
The matrix Uk is upper triangular, it agrees with the identity matrix in rows
0 to k – 1, and it has zeros everywhere above the first two superdiagonals.
From (21.15) and (21.16) we see that the overall effect of Algorithm 21.2
is to evaluate, step-by-step, the product

a = U 0 . . .U n-l L n-l . . . L 0 f = P-T f. (21.17)

Taking the transpose of this product we obtain a representation of P-1, from

which it is easy to write down an algorithm for computing x = P–1b.

Algorithm 21.3 (primal, Px = b). Given parameters a vec-

tor b, and points satisfying (21.5), this algorithm solves the
primal confluent Vandermonde-like system Px = b.
l
% Stage I:
Set d = b
for k = 0:n – 2
for j = n – k: – 1:2

end

end
434 V ANDERMONDE S YSTEMS

% Stage II:
Set x = d
for k = n – 1: – 1:0
xlast = 0
for j = n: - l :k + l
i f αj = αj–k–l then
x j = x j/(k + 1)
else
t e m p = x j/(α j - αj - k - 1 )
xj = temp – xlast
xlast = temp
end
end
xk = xk - xlast
end

Algorithm 21.3 has, by construction, the same operation count as Algo-

rithm 21.2.

21.3. Stability
Algorithms 21.2 and 21.3 have interesting stability properties. Depending on
the problem parameters, the algorithms can range from being very stable (in
either a backward or forward sense) to very unstable.
When the pa are the monomials and the points αi are distinct, the algo-
rithms reduce to those of Björck and Pereyra [121, 1970]. Björck and Pereyra
found that for the system Vx = b with αi = 1/(i + 3), bi = 2-i, n = 9, and
on a computer with u 10 –16 ,

Thus the computed solution has a tiny componentwise relative error, despite
the extreme ill condition of V. Björck and Pereyra comment “It seems as if
at least some problems connected with Vandermonde systems, which tradi-
tionally have been considered too ill-conditioned to be attacked, actually can
be solved with good precision.” This high accuracy can be explained with the
aid of the error analysis below.
The analysis can be kept quite short by exploiting the interpretation of
the algorithms in terms of matrix–vector products. Because of the inherent
duality between Algorithms 21.2 and 21.3, any result for one has an analogue
for the other, so we will consider only Algorithm 21.2.
The analysis that follows is based on the model (2.4), and so is valid only
for machines with a guard digit. With the no-guard-digit model (2.6) the
21.3 S TABILITY 435

bounds become weaker and more complicated, because of the importance of

terms in the analysis.

21.3.1. Forward Error

Theorem 21.4. If no underflow or overflows are encountered then Algo-
rithm 21.2 runs to completion and the computed solution satisfies

(21.18)

where c(n, u) := (1 + u)7n – 1 = 7nu + 0(u2).

Proof. First, note that Algorithm 21.2 must succeed in the absence of
underflow and overflow, because division by zero cannot occur.
The analysis of the computation of the c(k) vectors is exactly the same
as that for the nonconfluent divided differences in 5.3 (see (5.9) and (5.10)).
However, we obtain a slightly cleaner error bound by dropping the γ k notation
and instead writing

|∆Lk| < [(1 + u)3 - 1]|Lk|. (21.19)

Turning to the equations (21.16), we can regard the multiplication a(k) =

U ka (k+l) as comprising a sequence of three-term inner products. Analysing
these in standard fashion we arrive at the equation

|∆U k| < [(1 + u)4 - 1]|Uk|, (21.20)

where we have taken into account the rounding errors in forming

Since (21.19) and (21.20) imply that

(21.21)

Applying Lemma 3.7 to (21.21) and using (21.17), we obtain the desired bound
for the forward error.
The product |U0| . . . |Un-1||Ln-1| . . . |L0| in (21.18) is an upper bound for
|U0. . .Un-1Ln-1 . . . L0| = |P–T| and is equal to it when there is no subtrac-
tive cancellation in the latter product. To gain insight, suppose the points are
distinct and consider the case n = 3. We have
436 V ANDERMONDE S YSTEMS

(21.22)

There is no subtractive cancellation in this product as long as each matrix

has the alternating (checkerboard) sign pattern defined, for A = (aij ), by
(–l)i+jaij > 0. This sign pattern holds for the matrices Li if the points
α i are arranged in increasing order. The matrices Ui have the required sign
pattern provided that (in general)

θi > 0, γi > 0 for all i, and β i - αk < 0 for all i + k < n – 1.

In view of Table 21.2 we have the following result.

Corollary 21.5. If 0 < α0 < α1 < · · · < α n then for the monomials, or the
Chebyshev, Legendre, or Hermite polynomials,

Corollary 21.5 explains the high accuracy observed by Björck and Pereyra.
Note that if
|P - T ||f| < t n |P - T f | = t n |a|
then, under the conditions of the corollary, which shows
that the componentwise relative error is bounded by c(n, u)tn. For the prob-
lem of Björck and Pereyra it can be shown that tn n 4/24. Another factor
contributing to the high accuracy in this problem is that many of the sub-
tractions αj – αj-k-1 = 1/(j + 3) – 1/(j – k + 2) are performed exactly, in
view of Theorem 2.5.
Note that under the conditions of the corollary P-T has the alternating
sign pattern, since each of its factors does. Thus if (–1)i fi > 0 then tn =
1, and the corollary implies that is accurate essentially to full machine
precision, independent of the condition number In particular, taking
f to be a column of the identity matrix shows that we can compute the inverse
of P to high relative accuracy, independent of its condition number.
21.3 S TABILITY 437

21.3.2. Residual

Next we look at the residual, r = f – Rewriting (21.21),

(21.23)
From the proof of Theorem 21.4 and the relation (5.9) it is easy to show that

( L k + ∆Lk)-l = L k-l + Ek, |E k | < [(1 - u)-3 - 1]|Lk-1 |

Strictly, an analogous bound for (Uk + ∆Uk)-l does not hold, since ∆U k
cannot be expressed in the form of a diagonal matrix times U k. However,
it seems reasonable to make a simplifying assumption that such a bound is
valid, say

(U k + ∆Uk)-l = Uk-l + Fk, |F k | < [(1 - u)-4 - 1]|Uk-1|. (21.24)

Then, writing (21.23) as

and invoking Lemma 3.7, we obtain the following result.

Theorem 21.6. Under the assumption (21.24), the residual of the computed
solution from Algorithm 21.2 is bounded by

(21.25)

where d(n, u) := (1 – u) – 7n – 1 = 7nu + O(u2).

For the monomials, with distinct, nonnegative points arranged in increas-

ing order, the matrices Li and U i are bidiagonal with the alternating sign
property, as we saw above. Thus Li- 1 > 0 and U i -1 > 0, and since PT =
we obtain from (21.25) the following pleasing re-
sult, which guarantees a tiny componentwise relative backward error.

Corollary 21.7. Let 0 < α0 < α1 < · · · < αn, and consider Algorithm 21.2
for the monomials. Under the assumption (21.24), the computed solution
satisjies
438 V ANDERMONDE S YSTEMS

Table 21.3. Results for dual Chebyshev-Vandermonde-like system.

n 10 20 30 40
-1 -7 -2
2.5 x 10 2 6.3 X 10 4.7 X 10 1.8 X 103

6.0 X 10-13 1.1 x 10-7 5.3 x 10-3 8.3 X 10-2

21.3.3. Dealing with Instability

The potential instability of Algorithm 21.2 is illustrated by the following ex-
ample. Take the Chebyshev polynomials Ti with the points αi =
(the extrema of Tn ), and define the right-hand side by fi = (–1) i . The exact
solution to PTa = f is the last column of the identity matrix. Relative errors
-16
and residuals are shown in Table 21.3 (u = 10 ). Even though k2 (P) < 2
for all n (see Problem 21.7), the forward errors and relative residuals are large
and grow with n. The reason for the instability is that there are large inter-
mediate numbers in the algorithm (at the start of stage II for n = 40,
is of order 1015); hence severe cancellation is necessary to produce the final
answer of order 1. Looked at another way, the factorization of PT used by
the algorithm is unstable because it is very sensitive to perturbations in the
factors.
How can we overcome this instability? There are two possibilites: preven-
tion and cure. The only means at our disposal for preventing instability is to
reorder the points αi . The ordering is arbitrary subject to condition (21.5)
being satisfied. Recall that the algorithms construct an LU factorization of
PT in factored form, and note that permuting the rows of PT is equivalent
to reordering the points αi . A reasonable approach is therefore to take what-
ever ordering of the points would be produced by Gaussian elimination with
T
partial pivoting (GEPP) applied to P . Since the diagonal elements of U in
T
P = LU have the form

where ha depends only on the θi , and since partial pivoting maximizes the
pivot at each stage, this ordering of the α i can be computed at the start of the
algorithm in n2 flops (see Problem 21.8). This ordering is essentially the Leja
ordering (5.13) (the difference is that partial pivoting leaves α0 unchanged).
To attempt to cure observed instability we can use iterative refinement
in fixed precision. Ordinarily, residual computation for linear equations is
21.3 S TABILITY 439

trivial, but in this context the coefficient matrix is not given explicitly and
computing the residual turns out to be conceptually almost as difficult, and
computationally as expensive, as solving the linear system!
To compute the residual for the dual system we need a means for evalu-
ating ψ(t) in (21.7) and its first k < n derivatives, where k = max i (i – r(i))
is the order of confluence. Since the polynomials pj satisfy a three-term re-
currence relation we can use an extension of the Clenshaw recurrence formula
(which evaluates ψ but not its derivatives). The following algorithm imple-
ments the appropriate recurrences, which are given by Smith [927, 1965] (see
Problem 21.9).

Algorithm 21.8 (extended Clenshaw recurrence). This algorithm computes

the k + 1 values yj = ψ (j)(x), 0 < j < k, where ψ is given by (21.7) and k < n.
It uses a work vector z of order k.

Set y(0: k) = z(0: k) = 0

y0 = an
for j = n – 1: – 1:0
temp = y0
y0 = θj (x – β j )
z0 = temp
for i = 1: min(k, n – j)
temp = yi
yi = θj((x – β j)yi + zi-1) – γj + l z i
zi = temp
end
end
m=1
for i = 2:k
m=m*i
yi = m * yi
end

Computing the residual using Algorithm 21.8 costs between 3n2 flops (for
full confluence) and 6n2 flops (for the nonconfluent case); recall that Algo-
rithm 21.2 costs at most 9n 2 /2 flops!
The use of iterative refinement can be justified with the aid of Theo-
rem 11.3. For (confluent) Vandermonde matrices, the residuals are formed
using Homer’s rule and (11 .7) holds in view of (5.3) and (5.7). Hence for
standard Vandermonde matrices, Theorem 11.3 leads to an asymptotic com-
ponentwise backward stability result. A complete error analysis of Algo-
rithm 21.8 is not available for (confluent) Vandermonde-like matrices, but
it is easy to see that (11.7) will not always hold. Nevertheless it is clear that
440 V ANDERMONDE S YSTEMS

a normwise bound can be obtained (see Oliver [805, 1977] for the special case
of the Chebyshev polynomials) and hence an asymptotic normwise stability
result can be deduced from Theorem 11.3. Thus there is theoretical backing
for the use of iterative refinement with Algorithm 21.8.
Numerical experiments using Algorithm 21.8 in conjunction with the par-
tial pivoting reordering and fixed precision iterative refinement show that both
techniques are effective means for stabilizing the algorithms, but that iterative
refinement is likely to fail once the instability y is sufficiently severe. Because
of its lower cost, the reordering approach is preferable.
Two heuristics are worth noting. Consider a (confluent) Vandermonde-
like system Px = b Heuristic 1: it is systems with a large normed solution
that are solved to high accuracy by the fast algorithms.
To produce a large solution the algorithms must sustain little cancellation,
and the error analysis shows that cancellation is the main cause of instability.
Heuristic 2: GEPP is unable to solve accurately Vandermonde systems with
a very large normed solution The pivots for GEPP
will tend to satisfy so that the computed solution will tend
to satisfy A consequence of these two heuristics is that
for Vandermonde(-like) systems with a very large-normed solution the fast
algorithms will be much more accurate (but no more backward stable) than
GEPP. However, we should be suspicious of any framework in which such sys-
tems arise; although the solution vector may be obtained accurately (barring
overflow), subsequent computations with numbers of such a wide dynamic
range will probably themselves be unstable.

21.4. Notes and References

The formulae (21. 1) and (21.2), and inversion methods based on these formu-
lae, have been discovered independently by many authors. Traub [1013, 1966,
14] gives a short historical survey, his earliest reference being a 1932 book
by Kowalewski. There does not appear to be any published error analysis
for Algorithm 21.1 (see Problem 21.3). There is little justification for using
the output of the algorithm to solve the primal or dual linear system, as is
done in [842, 1992, §2.8]; Algorithms 21.2 and 21.3 are more efficient and al-
most certainly at least as stable. Calvetti and Reichel [179, 1993] generalize
Algorithm 21.1 to Vandermonde-like matrices, but they do not present any
error analysis. Gohberg and Olshevsky [456, 1994] give another O(n2) flops
algorithm for inverting a Chebyshev–Vandermonde matrix.
The standard condition number K(V) is not an appropriate measure of
sensitivity when only the points αi are perturbed, because it does not reflect
the special structure of the perturbations. Appropriate condition numbers
were first derived by Higham [533, 1987] and are comprehensively investigated
PROBLEMS 441

by Bartels and D. J. Higham [76, 1992]; see Problem 21.10.

Methods for solving the dual and primal Vandermonde systems have an in-
teresting history. The earliest algorithm was derived by Lyness and Moler [718,
1966] via Neville interpolation; it solves the dual system in O(n3) flops. The
first O(n2 ) algorithm was obtained by Ballester and Pereyra [52, 1967]; it
computes the LU factors of the Vandermonde matrix and requires O(n2) el-
ements of storage. Björck and Pereyra [121, 1970] derived the specialization
of Algorithms 21.2 and 21.3 to nonconfluent Vandermonde matrices; these
algorithms require no storage other than that for the problem data. The al-
gorithms of Björck and Pereyra were generalized by Björck and Elfving to
confluent systems [117, 1973], and by Higham to Vandermonde-like systems
[536, 1988] and confluent Vandermonde-like systems [547, 1990]. The error
analysis in this chapter is taken from [547, 1990]. Tang and Golub [992, 1981]
give a block algorithm that requires only real arithmetic to solve a Vander-
monde system in which all the points appear in complex conjugate pairs.
Other O(n2 ) algorithms for solving Chebyshev-Vandermonde systems are
given by Reichel and Opfer [865, 1991] and Calvetti and Reichel [178, 1992].
The former algorithms are progressive, in that they allow the solution to be
updated when a new point αi is added; they generalize progressive algorithms
of Björck and Pereyra [121, 1970]. Boros, Kailath, and Olshevsky [135, 1994]
use the concept of displacement structure to derive further O(n2) algorithms
for solving Vandermonde and Chebyshev–Vandermonde systems. No error
analysis is given in [135, 1994], [178, 1992], or [865, 1991].
The O(n2) complexity of the algorithms mentioned above for solving Van-
dermonde-like systems is not optimal. Lu [714, 1994], [715, 1995], [716, 1996]
derives O(n log n log p) flops algorithms, where p is the number of distinct
points. The numerical stability and practical efficiency of the algorithms re-
main to be determined. Bini and Pan [98, 1994] give an O ( n log2 n) algorithm
for solving a dual Vandermonde system that involves solving related Toeplitz
and Hankel systems.
Since Vandermonde systems can be solved in less than O(n3) flops it is
natural to ask whether the O(mn 2 ) complexity of QR factorization of an
m x n matrix can be bettered for a Vandermonde matrix. QR factorization
algorithms with O(mn) flop counts have been derived by Demeure [277, 1989],
[278, 1990], and for Vandermonde-like matrices where the polynomials satisfy
a three-term recurrence by Reichel [864, 1991]. No error analysis has been
published for these algorithms. Demeure’s algorithms are likely to be unstable,
because they form the normal equations matrix VTV.

Problems

21.1. Derive a modified version of Algorithm 21.1 in which the scale factor
442 V ANDERMONDE S YSTEMS

s = q(αi ) is computed directly as

What is the flop count for this version?

21.2. (Calvetti and Reichel [179, 1993]) Generalize Algorithm 21.1 to the
inversion of a Vandermonde-like matrix for polynomials that satisfy a three-
term recurrence relation.
21.3. Investigate the stability of Algorithm 21.1 and the modified version
of Problem 21.1. (a) Evaluate the left and right residuals of the computed
inverses; compare the results with those for GEPP. (b) Show that Algo-
rithm 21.1 always performs subtractions of like-signed numbers and explain
the implications for stability. (Does Algorithm 21.1 share the high accuracy
properties discussed at the end of 521.3.1?) (c) (RESEARCH PROBLEM) Derive
and explore forward error bounds and residual bounds for both algorithms.
Extend the analysis to the algorithms of Calvetti and Reichel [179, 1993].
21.4. By summing (21.1) for i = 0:n, show that What does
this imply about the sign pattern of V–1 ? What is the sum of all the elements
o f V –l ?
21.5. Let T = T ( α0, α1, . . . , α n ) = be a Chebyshev–Vandermonde
matrix (Tj is the Chebyshev polynomial of degree j), with T–1 = (uij).
Analogously to (21.1) we have Hence
where V-l = V(α0, α1, . . . ,αn )-l = (wij). Show that

and hence that

(Hint: show that T = LV for a lower triangular matrix L and use the fact
that
21.6. Show that for a nonconfluent Vandermonde-like matrix P =
where the pi satisfy (21.6),

(Hint: study the structure of (21.22 ).)

PROBLEMS 443

21.7. Show that for the Chebyshev-Vandermonde-like matrix T = T(α 0 ,

α 1 , . . . , α n ),

1. K 2 (T) = for αi = cos((i + ½)π/(n + 1)) (zeros of Tn+1 ).

2. K2 (T) < 2, for αi = cos(iπ/n ) (extrema of Tn ).

(Hint: use the discrete orthogonality properties of the Chebyshev polynomials;

see, e.g., Hamming [501, 1973, pp.472–473].)
21.8. Derive an O(n2 ) flops algorithm that reorders the distinct points α0 ,
α1, . . . . αn according to the same permutation that would be produced by
GEPP applied to PT(α0, α1,. . . . , αn ). (Cf. Problem 5.4. ) Can your algorithm
ever produce the increasing ordering?
21.9. Derive Algorithm 21.8 by differentiating repeatedly the original Clen-
shaw recurrence (which is Algorithm 21.8 with k = 0) and resealing so as to
consign factorial terms to a cleanup loop at the end. Derive an algorithm for
computing the residual for the primal system in a similar way, using recur-
rences obtained by differentiating (21.6).
21.10. (Higham [533, 1987]) A structured condition number for the primal
Vandermonde system Vx = b, where V = V(α0, α1, . . . . ,α n ), can be defined
by

Show that
cond(V) =

and derive a corresponding condition number for the dual system VTa = f.
Previous Home Next

Chapter 22
Fast Matrix Multiplication

A simple but extremely valuable bit of equipment in matrix multiplication

consists of two plain cards,
with a re-entrant right angle cut out of one or both of them
if symmetric matrices are to be multiplied.
In getting the element of the ith row and jth column of the product,
the ith row of the first factor and the jth column of the second
should be marked by a card beside, above, or below it.
— HAROLD HOTELLING, Some New Methods in Matrix Calculation (1943)

It was found that multiplication of matrices using punched card storage

could be a highly efficient process on the Pilot ACE,
due to the relative speeds of the Hollerith card reader used for input
(one number per 16 ins.) and the automatic multiplier (2 ins.).
While a few rows of one matrix were held in the machine
the matrix to be multiplied by it was passed through the card reader.
The actual computing and selection of numbers from store
occupied most of the time between the passage of
successive rows of the cards through the reader,
so that the overall time was but little longer
than it would have been if the machine
had been able to accommodate both matrices.
— MICHAEL WOODGER, The History and Present Use of
Digital Computers at the National Physical Laboratory (1958)

445
446 FAST M ATRIX M ULTIPLICATION

22.1. Methods
A fast matrix multiplication method forms the product of two n x n matrices
in arithmetic operations, where ω < 3. Such a method is more efficient
asymptotically than direct use of the definition

(22.1)

which requires O(n3) operations. For over a century after the development of
matrix algebra in the 1850s by Cayley, Sylvester and others, this definition
provided the only known method for multiplying matrices. In 1967, however,
to the surprise of many, Winograd found a way to exchange half the multi-
plications for additions in the basic formula [1105, 1968]. The method rests
on the identity, for vectors of even dimension n,

(22.2)

When this identity is applied to a matrix product AB, with x a row of A and
y a column of B, the second and third summations are found to be common
to the other inner products involving that row or column, so they can be
computed once and reused. Winograd’s paper generated immediate practical
interest because on the computers of the 1960s floating point multiplication
was typically two or three times slower than floating point addition. (On
todays’ machines these two operations are usually similar in cost).
Shortly after Winograd’s discovery, Strassen astounded the computer sci-
ence community by finding an operations method for matrix multi-
plication (log2 7 2.807). A variant of this technique can be used to compute
A-l (see Problem 22.8) and thereby to solve AX = b, both in op-
erations. Hence the title of Strassen’s 1969 paper [962, 1969], which refers to
the question of whether Gaussian elimination is asymptotically optimal for
solving linear systems.
Strassen’s method is based on a circuitous way to form the product of a
pair of 2 x 2 matrices in 7 multiplications and 18 additions, instead of the usual
8 multiplications and 4 additions. As a means of multiplying 2 x 2 matrices the
formulae have nothing to recommend them, but they are valid more generally
for block 2 x 2 matrices. Let A and B be matrices of dimensions m x n and
n x p respectively, where all the dimensions are even, and partition each of A,
B, and C = AB into four equally sized blocks:

(22.3)
22.1 M ETHODS 447

Strassen’s formulae are

P1 = (A ll + A 22 )(B II + B 22 )
P2 = (A21 + A22 )B11 ,
P3 = A11 (B12 – B22 ),
P4 = A22 (B21 – B11 ),
P5 = (A11 + A12 )B22 ,
P6 = (A 21 – A II )(B II + B 12 ) (22.4)
P7 = (A12 – A22 )(B21 + B22 )
C11 = P1 + P4 – P5 + P7,
C12 = P3 + P5,
C21 = P2 + P4,
C22 = P1 + P3 – P2 + P6.

Counting the additions (A) and multiplications (M) we find that while con-
ventional multiplication requires

mnpM + m(n – 1)pA,

Strassen’s algorithm, using conventional multiplication at the block level, re-

quires

Thus, if m, n, and p are large, Strassen’s algorithm reduces the arithmetic

by a factor of about 7/8. The same idea can be used recursively on the
multiplications associated with the Pa. In practice, recursion is only performed
down to the “crossover” level at which any savings in floating point operations
are outweighed by the overheads of a computer implementation.
To state a complete operation count, we suppose that m = n = p = 2 k
and that recursion is terminated when the matrices are of dimension no = 2 r ,
at which point conventional multiplication is used. The number of multipli-
cations and additions can be shown to be

M(k) = 7 k - r8r , (22.5)

The sum M(k) + A(k) is minimized over all integers r by r = 3; interestingly,

this value is independent of k. The total operation count for the “optimal”
no = 8 is less than

Hence, in addition to having a lower exponent, Strassen’s method has a rea-

sonable constant.
448 FAST M ATRIX M ULTIPLICATION

Winograd found a variant of Strassen’s formulae that requires the same

number of multiplications but only 15 additions (instead of 18). This vari-
ant therefore has slightly smaller constants in the operation count for n x n
matrices. For the product (22.3) the formulae are

S1 = A21 + A22, M1 = S 2 S 6 , T 1 = M1 + M2 ,
S2 = S1 – A11, M 2 = A 1l B 1l , T 2 = T 1 + M4,
S3 = A11 – A21, M 3 = A 12 B 21 ,
S4 = A12 – S2, M4 = S 3 S 7 ,
(22.6)
S5 = B12 – B11, M5 = S 1 S 5 , C11 = M2 + M3,
S6 = B22 – S5, M 6 = S 4 B 22 , C12 = TI + M5 + M6,
S7 = B22 – B12, M7 = A22 S 8 , C21 = T2 – M7,
S8 = S6 – B21, C22 = T2 + M5.

Until the late 1980s there was a widespread view that Strassen’s method
was of theoretical interest only, because of its supposed large overheads for
dimensions of practical interest (see, for example, [909, 1988]), and this view
is still expressed by some [842, 1992]. However, in 1970 Brent implemented
Strassen’s algorithm in Algol-W on an IBM 360/67 and concluded that in this
environment, and with just one level of recursion, the method runs faster than
the conventional method for n > 110 [142, 1970]. In 1988, Bailey compared
his Fortran implementation of Strassen’s algorithm for the Cray-2 with the
Cray library routine for matrix multiplication and observed speedup factors
ranging from 1.45 for n = 128 to 2.01 for n = 2048 (although 35% of these
speedups were due to Cray-specific techniques) [43, 1988]. These empirical
results, together with more recent experience of various researchers, show that
Strassen’s algorithm is of practical interest, even for n in the hundreds. In-
deed, Fortran codes for (Winograd’s variant of) Straasen’s method have been
supplied with IBM’s ESSL library [595, 1988] and Cray’s UNICOS library
[602, 1989] since the late 1980s.
Strassen’s paper raised the question “what is the minimum exponent ω
such that multiplication of n x n matrices can be done in operations?”
Clearly, ω > 2, since each element of each matrix must partake in at least one
operation. It was 10 years before the exponent was reduced below Strassen’s
log2 7 . A flurry of publications, beginning in 1978 with Pan and his expo-
nent 2.795 [815, 1978], resulted in reduction of the exponent to the current
record 2.376, obtained by Coppersmith and Winograd in 1987 [245, 1987].
Figure 22.1 plots exponent versus time of publication (not all publications are
represented in the graph); in principle, the graph should extend back to 1850!
Some of the fast multiplication methods are based on a generalization of
Strassen’s idea to bilinear forms. Let A, B A bilinear noncommuta-
22.1 M ETHODS 449

Figure 22.1. Exponent versus time for matrix multiplication.

tive algorithm over for multiplying A and B with t “nonscalar multiplica-

tions” forms the product C = AB according to

(22.7a)

(22.7b)

where the elements of the matrices W, U (k) , and V (k) are constants. This
algorithm can be used to multiply n x n matrices A and B, where n = hk, as
follows: partition A, B, and C into h 2 blocks Aij , Bij , and Cij of size h k–1 ,
then compute C = AB by the bilinear algorithm, with the scalars aij, bij, and
cij replaced by the corresponding matrix blocks. (The algorithm is applicable
to matrices since, by assumption, the underlying formulae do not depend on
commutativity.) To form the t products Pk of (n/h) x (n/h) matrices, partition
them into h 2 blocks of dimension n/h2 and apply the algorithm recursively.
The total number of scalar multiplications required for the multiplication is
tk = n α, where α = logh t.
Strassen’s method has h = 2 and t =7. For 3 x 3 multiplication (h = 3),
the smallest t obtained so far is 23 [683, 1976]; since log323 2.854 > log27,
this does not yield any improvement over Strassen’s method. The method
450 FAST M ATRIX M ULTIPLICATION

described in Pan’s 1978 paper has h = 70 and t = 143,640, which yields

α = log70143,640 = 2.795 . . . .
In the methods that achieve exponents lower than 2.775, various intricate
techniques are used. Laderman, Pan, and Sha [684, 1992] explain that for
these methods “very large overhead constants are hidden in the ’O’ nota-
tion’’, and that the methods “improve on Strassen’s (and even the classical)
algorithm only for immense numbers N“.
A further method that is appropriate to discuss in the context of fast
multiplication methods, even though it does not reduce the exponent, is a
method for efficient multiplication of complex matrices. The clever formula

(a + ib)(c + id) = ac - bd + i[(a + b)(c + d) - ac - bd] (22.8)

computes the product of two complex numbers using three real multiplications
instead of the usual four. Since the formula does not rely on commutativity
it extends to matrices. Let A = A1 + iA2 and B = Bl + iB2, where Aj, Bj
and define C = C1 + iC2 = AB. Then C can be formed using three
real matrix multiplications as

T1 = A1 B1 , T2 = A2 B2 ,
C1 = T1 – T2, (22.9)
C2 = (A1 + A2)(B1 + B2) – T1 – T2,

which we will refer to as the "3M method”. This computation involves 3n 3

scalar multiplications and 3n3 + 2n2 scalar additions. Straightforward evalua-
tion of the conventional formula C = A1B1 – A2B2 + i(A1B2 + A2B1) requires
4n3 multiplications and 4n3 – 2n2 additions. Thus, the 3M method requires
strictly fewer arithmetic operations than the conventional means of multiply-
ing complex matrices for n > 3, and it achieves a saving of about 25% for
n > 30 (say). Similar savings occur in the important special case where A or
B is triangular. This kind of clear-cut computational saving is rare in matrix
computations!
IBM’s ESSL library and Cray’s UNICOS library both contain routines for
complex matrix multiplication that apply the 3M method and use Strassen’s
method to evaluate the resulting three real matrix products.

22.2. Error Analysis

To be of practical use, a fast matrix multiplication method needs to be faster
than conventional multiplication for reasonable dimensions without sacrificing
numerical stability. The stability properties of a fast matrix multiplication
method are much harder to predict than its practical efficiency, and need
careful investigation.
22.2 E RROR A NALYSIS 451

The forward error bound (3.12) for conventional computation of C = AB,

where A, B can be written
(22.10

Miller [756, 1975] shows that any polynomial algorithm for multiplying n x n
matrices that satisfies a bound of the form (22.10) (perhaps with a different
constant) must perform at least n 3 multiplications. (A polynomial algorithm
is essentially one that uses only scalar addition, subtraction, and multiplica-
tion.) Hence Strassen’s method, and all other polynomial algorithms with an
exponent less than 3, cannot satisfy (22.10). Miller also states, without proof,
that any polynomial algorithm in which the multiplications are all of the form
must satisfy a bound of the form

(22.11)

It follows that any algorithm based on recursive application of a bilinear non-

commutative algorithm satisfies (22.11); however, the all-important constant
f n is not specified. These general results are useful because they show us
what types of results we can and cannot prove and thereby help to focus our
efforts.
In the subsections below we analyse particular methods.
Throughout the rest of this chapter an unsubscripted matrix norm denotes

As noted in 6.2, this is not a consistent matrix norm, but we do have the
bound ||AB|| < n||A|| ||B|| for n x n matrices.

22.2.1. Winograd’s Method

Winograd’s method does not satisfy the conditions required for the bound
(22.11), since it involves multiplications with operands of the form aij + brs.
However, it is straightforward to derive an error bound.

Theorem 22.1 (Brent). Let x, y where n is even. The inner prod-

uct computed by Winograd’s method satisfies

(22.12)

Proof. A straightforward adaptation of the inner product error analysis

in 3.1 produces the following analogue of (3.3):
452 FAST M ATRIX M ULTIPLICATION

where the and β i are all bounded in modulus by γn /2+4. Hence

The analogue of (22.12) for matrix multiplication is ||AB – fl(AB)|| <

Conventional evaluation of xTy yields the bound (see (3.5))

(22.13)

The bound (22. 12) for Winograd’s method exceeds the bound (22.13) by a fac-
tor approximately Therefore Winograd’s method
is stable if have similar magnitude, but potentially unstable
if they differ widely in magnitude. The underlying reason for the instabil-
ity is that Winograd’s method relies on cancellation of terms x2i–1 x2i and
y2i–ly2i that can be much larger than the final answer
therefore the intermediate rounding errors can swamp
the desired inner product.
A simple way to avoid the instability is to scale x µ x and y µ-ly
before applying Winograd’s method, where µ, which in practice might be a
power of the machine base to avoid roundoff, is chosen so that
When using Winograd’s method for a matrix multiplication AB it suffices to
carry out a single scaling A µA and B µ -lB such that ||A|| ||B||. If
−1
A and B are scaled so that τ < ||A||/||B|| < τ then

22.2.2. Strassen’s Method

Until recently there was a widespread belief that Strassen’s method is numer-
ically unstable. The next theorem, originally proved by Brent in 1970, shows
that this belief is unfounded.
22.2 E RROR ANALYSIS 453

Theorem 22.2 (Brent). Let A, B where n = 2 k. Suppose that

C = AB is computed by Strassen’s method and that n0 = 2 r is the threshold
at which conventional multiplication is used. The computed product satisfies

(22.14)

Proof. We will use without comment the norm inequality ||AB|| <
n||A|| ||B|| = 2 k||A|| ||B||.
Assume that the computed product AB from Strassen’s method
satisfies
= AB + E, ||E|| < cku||A|| ||B|| + O(u2), (22.15)
where ck is a constant. In view of (22.10), (22.15) certainly holds for n = no,
with c r = Our aim is to verify (22.15) inductively and at the same time
to derive a recurrence for the unknown constant ck.
Consider Cll in (22.4), and, in particular, its subterm P1. Accounting for
the errors in matrix addition and invoking (22.15), we obtain

where
| ∆A| < u|A11 + A22|,
|∆B | < u|B11 + B22 |,
||E1|| < ck-1u||A11 + A22 + ∆A|| ||Bll + B22 + ∆B|| + O(u2)
< 4ck-1u||A|| ||B|| + O(u2).
Hence
= P1 + F1,
||F1|| < (8 · 2 k-1 + 4c k - 1)u||A|| ||B|| + O(u2).
Similarly,
= A 22 (B 21 – B ll + ∆B) + E4,
where
|∆B| < u|B21 – B11|,
||E4 || < ck-1u||A22 || ||B21 - B11 + ∆B|| + O(u2),
which gives

= P4 + F4,
||F4|| < (2 · 2 k-1 + 2c k - 1) u||A|| ||B|| + O(u2).

Now
454 FAST M ATRIX M ULTIPLICATION

where =: P5 + F5 and =: P7 + F7 satisfy exactly the same error bounds

as and respectively. Assuming that these four matrices are added in
the order indicated, we have

Clearly, the same bound holds for the other three ||∆Cij|| terms. Thus, overall,

= AB + ∆C , ||∆C|| < (46 · 2 k-1 + 12ck- l )u||A|| ||B|| + O(u2).

A comparison with (22.15) shows that we need to define the ck by
c k = 12ck-1 +46 · 2 k-1, k > r, cr = 4 r , (22.16)
where cr = Solving this recurrence we obtain

which gives (22.14).

The forward error bound for Strassen’s method is not of the componentwise
form (22.10) that holds for conventional multiplication, which we know it
cannot be by Miller’s result. One unfortunate consequence is that while the
scaling AB (AD)(D-1 B), where D is diagonal, leaves (22.10) unchanged,
it can alter (22.14) by an arbitrary amount.
The reason for the scale dependence is that Strassen’s method adds to-
gether elements of A matrix-wide (and similarly for B); for example, in (22.4)
A11 is added to A22 , A12 , and A21 . This intermingling of elements is partic-
ularly undesirable when A or B has elements of widely differing magnitudes
because it allows large errors to contaminate small components of the product.
This phenomenon is well illustrated by the example

which is evaluated exactly in floating point arithmetic if we use conventional

multiplication. However, Strassen’s method computes
22.2 E RROR ANALYSIS 455

Because c22 involves subterms of order unity, the error c22 – will be of
order u. Thus the relative error |c22 – |/|c22| = which is much
larger than u if ε is small. This is an example where Strassen’s method does
not satisfy the bound (22.10). For another example, consider the product
X = P32 E, where Pn is the n x n Pascal matrix (see 26.4) and eij = 1/3.
With just one level of recursion in Strassen’s method we find in MATLAB that
is of order 10-5, so that, again, some elements of the
computed product have high relative error.
It is instructive to compare the bound (22.14) for Strassen’s method with
the weakened, normwise version of (22.10) for conventional multiplication:

(22.17)

The bounds (22. 14) and (22. 17) differ only in the constant term. For Strassen’s
method, the greater the depth of recursion the bigger the constant in (22.14):
if we use just one level of recursion (n0 = n/2) then the constant is 3n2 +
25n, whereas with full recursion (n0 = 1) the constant is
6n3.585 – 5n. It is also interesting to note that the bound for Strassen’s method
(minimal for n 0 = n) is not correlated with the operation count (minimal for
n0 = 8).
Our conclusion is that Strassen’s method has less favorable stability prop-
erties than conventional multiplication in two respects: it satisfies a weaker
error bound (normwise rather than componentwise) and it has a larger con-
stant in the bound (how much larger depending on no).
Another interesting property of Strassen’s method is that it always involves
some genuine subtractions (assuming that all additions are of nonzero terms).
This is easily deduced from the formulae (22.4). This makes Strassen’s method
unattractive in applications where all the elements of A and B are nonnegative
(for example, in Markov processes). Here, conventional multiplication yields
low componentwise relative error because, in (22.10), |A||B| = |AB| = |C|,
yet comparable accuracy cannot be guaranteed for Strassen’s method.
An analogue of Theorem 22.2 holds for Winograd’s variant of Strassen’s
method.

Theorem 22.3. Let A, B where n = 2 k. Suppose that C = AB is

computed by Winograd’s variant (22.6) of Strassen’s method and that n0 = 2 r
is the threshold at which conventional multiplication is used. The computed
product satisfies

(22.18)

Proof. The proof is analogous to that of Theorem 22.2, but more tedious.
It suffices to analyse the computation of C12, and the recurrence corresponding
456 FAST M ATRIX M ULTIPLICATION

to (22.16) is
c k = 1 8c k - l + 8 9 · 2 k – 1 , k > r, c r = 4 r .

Note that the bound for the Winograd–Strassen method has exponent
log2 18 4.170 in place of log2 12 3.585 for Strassen’s method, suggesting
that the price to be paid for a reduction in the number of additions is an
increased rate of error growth. All the comments above about Strassen’s
method apply to the Winograd variant.
Two further questions are suggested by the error analysis:
l How do the actual errors compare with the bounds?
l Which formulae are the more accurate in practice, Strassen’s or Wino-
grad’s variant?
To give some insight we quote results obtained with a single precision For-
tran 90 implementation of the two methods (the code is easy to write if we
exploit the language’s dynamic arrays and recursive procedures). We take
random n x n matrices A and B and look at ||AB – fl(AB)||/(||A|| ||B||) for
n0 = l, 2, . . . , 2k = n (note that this is not the relative error, since the de-
nominator is ||A|| ||B|| instead of ||AB||, and note that no = n corresponds
to conventional multiplication). Figure 22.2 plots the results for one random
matrix of order 1024 from the uniform [0, 1] distribution and another matrix of
the same size from the uniform [– 1, 1] distribution. The error bound (22.14)
for Strassen’s method is also plotted. Two observations can be made.
l Winograd’s variant can be more accurate than Strassen’s formulae, for
all no, despite its larger error bound.
l The error bound overestimates the actual error by a factor up to 1.8 x 106
for n = 1024, but the variation of the errors with no is roughly as
predicted by the bound.

22.2.3. Bilinear Noncommutative Algorithms

Bini and Lotti [97, 1980] have analysed the stability of bilinear noncommuta-
tive algorithms in general. They prove the following result.

Theorem 22.4 (Bini and Lotti). Let A, B (n = h k ) and let the

product C = AB be formed by a recursive application of the bilinear noncom-
mutative algorithm (22.7), which multiplies h x h matrices using t nonscalar
multiplications. The computed product satisfies
(22.19)
22.2 E RROR A NALYSIS 457

Figure 22.2. Errors for Strassen’s method with two random matrices of dimension
n = 1024. Strassen’s formulae: “x”, Winograd’s variant: "o". X-axis contains log2
of recursion threshold n0, 1 < n0 < n. Dot-dash line is error bound for Strassen’s
formulae.

where α and β are constants that depend on the number of nonzero terms in
the matrices U, V and W that define the algorithm.

The precise definition of α and β is given in [97, 1980]. If we take k = 1,

so that h = n, and if the basic algorithm (22.7) is chosen to be conventional
multiplication, then it turns out that α = n – 1 and β = n, so the bound
of the theorem becomes (n – 1)nu||A|| ||B|| + O(u2), which is essentially the
same as (22.17). For Strassen’s method, h = 2 and t = 7, and α = 5, β = 12,
so the theorem produces the bound which
is a factor log2 n larger than (22.14) (with n0 = 1). This extra weakness of
the bound is not surprising given the generality of Theorem 22.4.
Bini and Lotti consider the set of all bilinear noncommutative algorithms
that form 2 x 2 products in 7 multiplications and that employ integer constants
of the form ±2i , where i is an integer (this set breaks into 26 equivalence
classes). They show that Strassen’s method has the minimum exponent β
in its error bound in this class (namely, β = 12). In particular, Winograd's
variant of Strassen’s method has β = 18, so Bini and Lotti’s bound has the
same exponent log2 18 as in Theorem 22.3.
458 FAST M ATRIX M ULTIPLICATION

22.2.4. The 3M Method

A simple example reveals a fundamental weakness of the 3M method. Con-
sider the computation of the scalar

In floating point arithmetic, if y is computed in the usual way, as y = θ(1/θ ) +

( 1 /θ) θ, then no cancellation occurs and the computed has high relative
accuracy: The 3M method computes

If |θ| is large this formula expresses a number of order 1 as the difference

of large numbers. The computed will almost certainly be contaminated
by rounding errors of order uθ2, in which case the relative error is large:
However, if We measure the error in relative to z,
then it is acceptably small: This example suggests that
the 3M method may be stable, but in a weaker sense than for conventional
multiplication.
To analyse the general case, consider the product C 1 + iC 2 = (A1 +
iA2 )( B1 + iB2 ), where Ak, Bk, Ck k = 1:2. Using (22.10) we find
that the computed product from conventional multiplication,

satisfies

(22.20)
(22.21)

For the 3M method C1 is computed in the conventional way, and so (22.20)

holds. It is straightforward to show that satisfies

(22.22)
Two notable features of the bound (22.22) are as follows. First, it is of
a different and weaker form than (22.21); in fact, it exceeds the sum of the
bounds (22.20) and (22.21). Second and more pleasing, it retains the property
of (22.20) and (22.21) of being invariant under diagonal scalings
C = AB D 1 AD 2 · D 2 -1 BD 3 = D 1 CD 3 , D j diagonal,

in the sense that the upper bound ∆C2 in (22.22) scales also according to
D 1 ∆C2D3. (The “hidden” second-order terms in (22.20)–(22.22) are invariant
under these diagonal scalings. )
22.3 N OTES AND R EFERENCES 459

The disparity between (22.21) and (22.22) is, in part, a consequence of

the differing numerical cancellation properties of the two methods. It is easy
to show that there are always subtractions of like-signed numbers in the 3M
method, whereas if A1, A2, Bl, and B2 have nonnegative elements (for exam-
ple) then no numerical cancellation takes place in conventional multiplication.
We can define a measure of stability with respect to which the 3M method
matches conventional multiplication by taking norms in (22.21) and (22.22).
We obtain the weaker bounds

(22.23)
(22.24)

(having used || |A1| + |A2| || < ||Al + iA2 ||). Combining these with an anal-
ogous weakening of (22.20), we find that for both conventional multiplication
and the 3M method the computed complex matrix satisfies

where cn = O(n).
The conclusion is that the 3M method produces a computed product
whose imaginary part may be contaminated by relative errors much larger
than those for conventional multiplication (or, equivalently, much larger than
can be accounted for by small componentwise perturbations in the data A
and B). However, if the errors are measured relative to ||A|| ||B||, which is a
natural quantity to use for comparison when employing matrix norms, then
they are just as small as for conventional multiplication.
It is straightforward to show that if the 3M method is implemented us-
ing Strassen’s method to form the real matrix products, then the computed
complex product satisfies the same bound (22.14) as for Strassen’s method
itself, but with an extra constant multiplier of 6 and with 4 added to the
expression inside the square brackets.

22.3. Notes and References

A good introduction to the construction of fast matrix multiplication methods
is provided by the papers of Pan [816, 1984] and Laderman, Pan, and Sha [684,
1992].
Harter [504, 1972] shows that Winograd’s formula (22.2) is the best of its
kind, in a sense made precise in [504, 1972].
How does one derive formulae such as those of Winograd and Strassen, or
that in the 3M method? Inspiration and ingenuity seem to be the key. A fairly
straightforward, but still not obvious, derivation of Strassen’s method is given
by Yuval [1124, 1978]. Gustafson and Aluru [491, 1996] develop algorithms
460 FAST M ATRIX M ULTIPLICATION

that systematically search for fast algorithms, taking advantage of a parallel

computer. In an exhaustive search taking 21 hours of computation time on
a 256 processor nCUBE 2, they were able to find 12 methods for multiplying
2 complex numbers in 3 multiplications and 5 additions; they could not find
a method with fewer additions, thus proving that such a method does not
exist. However, they estimate that a search for Strassen’s method on a 1024
processor nCUBE 2 would take many centuries, even using aggressive pruning
rules, so human ingenuity is not yet redundant!
To obtain a useful implementation of Strassen’s method a number of issues
have to be addressed, including how to program the recursion, how best to
handle rectangular matrices of arbitrary dimension (since the basic method
is defined only for square matrices of dimension a power of 2), and how to
cater for the extra storage required by the method. These issues are discussed
by Bailey [43, 1988], Bailey, Lee, and Simon [47, 1991], Fischer [374, 1974],
Higham [544, 1990], Kreczmar [673, 1976], and [934, 1976], among oth-
ers. Douglas, Heroux, Slishman, and Smith [317, 1994] present a portable
Fortran implementation of Winograd’s variant of Strassen’s method for real
and complex matrices, with a level-3 BLAS interface; they take care to use
a minimal amount of extra storage (about 2n 3/3 elements of extra storage is
required when multiplying n x n matrices).
Higham [544, 1990] shows how Strassen’s method can be used to produce
algorithms for all the level-3 BLAS operations that are asymptotically faster
than the conventional algorithms. Most of the standard algorithms in numer-
ical linear algebra remain stable (in an appropriately weakened sense) when
fast level-3 BLAS are used. See, for example, Chapter 12, $18.4, and Problems
11.4 and 13.2.
Knight [664, 1995] shows how to choose the recursion threshold to minimize
the operation count of Strassen’s method for rectangular matrices. He also
shows how to use Strassen’s method to compute the QR factorization of an
m x n matrix in O(mn1.838 ) operations instead of the usual O(mn2 ).
Bailey, Lee, and Simon [47, 1991] substituted their Strassen’s method code
for a conventionally coded BLAS3 subroutine SGEMM and tested LAPACK’S
LU factorization subroutine SGETRF on a Cray Y-MP. They obtained speed
improvements for matrix dimensions 1024 and larger.
The Fortran 90 standard includes an intrinsic function MATMUL that returns
the product of its two matrix arguments. The standard does not specify which
method is to be used for the multiplication. An IBM compiler supports the use
of Winograd’s variant of Strassen’s method, via an optional third argument
to MATMUL (an extension to Fortran 90) [318, 1994],
Brent was the first to point out the possible instability of Winograd’s
method [143, 1970]. He presented a full error analysis (including Theo-
rem 22. 1) and showed that scaling ensures stability.
An error analysis of Strassen’s method was given by Brent in 1970 in
PROBLEMS 461

an unpublished technical report that has not been widely cited [142, 1970].
Section 22.2.2 is based on Higham [544, 1990].
According to Knuth, the 3M formula was suggested by P. Ungar in 1963
[668, 1981, p. 647]. It is analogous to a formula of Karatsuba and Ofman [643,
1963] for squaring a 2n-digit number using three squarings of n-digit num-
bers. That three multiplications (or divisions) are necessary for evaluating
the product of two complex numbers was proved by Winograd [1106, 1971].
Section 22.2.4 is based on Higham [552, 1992].
The answer to the question “What method should we use to multiply
complex matrices?” depends on the desired accuracy and speed. In a Fortran
environment an important factor affecting the speed is the relative efficiency
of real and complex arithmetic, which depends on the compiler and the com-
puter (complex arithmetic is automatically converted by the compiler into
real arithmetic). For a discussion and some statistics see [552, 1992].
The efficiency of Winograd’s method is very machine dependent. Bjørstad,
Marine, Sørevik, and Vajteršic [122, 1992] found the method useful on the
MasPar MP-1 parallel machine, on which floating point multiplication takes
about three times as long as floating point addition at 64-bit precision. They
also implemented Strassen’s method on the MP-1 (using Winograd’s method
at the bottom level of recursion) and obtained significant speedups over con-
ventional multiplication for dimensions as small as 256.
As noted in 22.1, Strassen [962, 1969] gave not only a method for multi-
plying n x n matrices in operations, but also a method for inverting
an n x n matrix with the same asymptotic cost. The method is described in
Problem 22.8. For more on Strassen’s inversion method see $24.3.2, Bailey
and Ferguson [41, 1988], and Bane, Hansen, and Higham [51, 1993].

Problems
22.1. (Knight [664, 1995]) Suppose we have a method for multiplying n x n
matrices in operations, where 2 < α < 3. Show that if A is m x n and
B is n x p then the product AB can be formed in operations,
where nl = min(m, n, p) and n2 and n3 are the other two dimensions.
22.2. Work out the operation count for Winograd’s method applied to n x n
matrices.
22.3. Let S n (n0 ) denote the operation count (additions plus multiplications)
for Strassen’s method applied to n x n matrices, with recursion down to the
level of no x no matrices. Assume that n and no are powers of 2. For large n,
estimate Sn (8)/ Sn(n) and Sn(1)/S n (8) and explain the significance of these
ratios (use (22.5)).
22.4. (Knight [664, 1995]) Suppose that Strassen’s method is used to multiply
462 FAST M ATRIX M ULTIPLICATION

an m x n matrix by an n x p matrix, where m = a2 j, n = b2 j, p = c2 j, and

that conventional multiplication is used once any dimension is 2 r or less. Show
that the operation count is α7 j + β4 j, where

Show that by setting r = 0 and a = 1 a special case of the result of Prob-

lem 22.1 is obtained.
22.5. Compare and contrast Winograd’s inner product formula for n = 2
with the imaginary part of the 3M formula (22.8).
22.6. Prove the error bound described at the end of 22.2.4 for the combina-
tion of the 3M method and Strassen’s method.
22.7. Two fast ways to multiply complex matrices are (a) to apply the 3M
method to the original matrices and to use Strassen’s method to form the
three real matrix products, and (b) to use Strassen’s method with the 3M
method applied at the bottom level of recursion. Investigate the merits of the
two approaches with respect to speed and storage.
22.8. Strassen [962, 1969] gives a method for inverting an n x n matrix in
operations. Assume that n is even and write

The inversion method is based on the following formulae:

The matrix multiplications are done by Strassen’s method and the inversions
determining P1 and P6 are done by recursive invocations of the method itself.
(a) Verify these formulae, using a block LU factorization of A, and show that
they permit the claimed complexity. (b) Show that if A is upper
triangular, Strassen’s method is equivalent to (the unstable) Method 2B of
13.2.2.
(For a numerical investigation into the stability of Strassen’s inversion
method, see 24.3.2.)
PROBLEMS 463

22.9. Find the inverse of the block upper triangular matrix

Deduce that matrix multiplication can be reduced to matrix inversion.

22.10. (RESEARCH PROBLEM) Carry out extensive numerical experiments to
test the accuracy of Strassen’s method and Winograd’s variant (cf. the results
at the end of 22.2.2).
Previous Home Next

Chapter 23
The Fast Fourier Transform and
Applications

Once the [FFT] method was established

it became clear that it had a long and interesting prehistory
going back as far as Gauss.
But until the advent of computing machines
it was a solution looking for a problem.
Fourier Analysis (1988)

Life as we know it would be very different without the FFT.

— CHARLES F. VAN LOAN, Computation/
Frameworks for the Fast Fourier Transform (1992)

465
466 T HE FAST FOURIER T RANSFORM AND A PPLICATIONS

23.1. The Fast Fourier Transform

The matrix–vector product y = Fn x, where

is the key computation in the numerical evaluation of Fourier transforms. If

the product is formed in the obvious way then O(n2) operations are required.
The fast Fourier transform (FFT) is a way to compute y in just O ( n log n)
operations. This represents a dramatic reduction in complexity.
The FFT is best understood (at least by a numerical analyst!) by inter-
preting it as the application of a clever factorization of the discrete Fourier
transform (DFT) matrix Fn .

Theorem 23.1 (Cooley–Tukey radix 2 factorization). If n = 2 t then the DFT

matrix Fn may be factorized as
Fn = At . . . A1Pn, (23.1)
where Pn is a permutation matrix and

Proof. See Van Loan [1044, 1992, Thm. 1.3.3].

The theorem shows that we can write y = Fnx as
y = At . . . A1Pnx.
which is formed as a sequence of matrix-vector products. It is the sparsity of
the Ak (two nonzeros per row) that yields the O ( n log n) operation count.
We will not consider the implementation of the FFT, and therefore we
do not need to define the “bit reversing: permutation matrix Pn in (23.1).
However, the way in which the weights are computed does affect the accu-
racy. We will assume that computed weights are used that satisfy, for all
j and k,
(23.2)
Among the many methods for computing the weights are ones for which we
can take µ = cu, µ = cu log j, and µ = cuj, where c is a constant that depends
on the method; see Van Loan [1044, 1992, 1.4].
We are now ready to prove an error bound.
23.1 T HE F AST F OURIER T RANSFORM 467

Theorem 23.2. Let = fl(F n x), where n = 2t, be computed using the
Cooley-Tukey radix 2 FFT, and assume that (23.2) holds. Then

Proof. Denote by the matrix defined in terms of the computed

Then

using the fact that each Ak has only two nonzeros per row, and recalling that
we are using complex arithmetic. In view of (23.2),

Hence, overall,

Invoking Lemma 3.7 we find that

using Lemma 3.1 for the second inequality. Now

Hence
(23.3)

Finally, because Fn is times a unitary matrix

Theorem 23.2 says that the Cooley–Tukey radix 2 FFT yields a tiny for-
ward error, provided that the weights are computed stably. It follows immedi-
ately that the computation is backward stable, since = y + ∆y = Fnx + ∆y
implies = F n ( x + ∆x) with ||∆x||2/||x||2 = ||∆y||2/||y||2. If we form
y = Fn x by conventional multiplication using the exact Fn , then (Prob-
lem 3.7) so Hence when
µ is of order u, the FFT has an error bound smaller than that for conven-
tional multiplication by the same factor as the reduction in complexity of the
method. To sum up, the FFT is perfectly stable.
468 T HE FAST FOURIER T RANSFORM AND A PPLICATIONS

Figure 23.1. Error in FFT followed by inverse FFT (“o”). Dotted line is error
bound.

The inverse transform x = F n - 1 y = n – 1 F n *y can again be computed

in O(n log n) operations using the Cooley–Tukey radix 2 factorization, and
satisfies the same bound as in Theorem 23.2. (Strictly, we
should replace t by t + 1 in the bound to account for the rounding error in
dividing by n)
For other variations of the FFT, based on different radices or different
factorization of Fn , results analogous to Theorem 23.2 hold.
A simple way to test the error bounds is to compute the FFT followed
by the inverse transform, and to evaluate e n = ||x –
Our analysis gives the bound en < 2n1/2 log2 nη + O(η 2). Fig-
ure 23.1 plots en and the approximate error bound n1/2 log2 nu for n = 2k ,
k = 0:16, with random x from the normal N(0,1) distribution (the FFTs
were computed using MATLAB’s fft and ifft functions). The errors grow at
roughly the same rate as the bound and are on average about a factor of 10
smaller than the bound.

23.2. Circulant Linear Systems

A circulant matrix (or circulant, for short) is a special Toeplitz matrix in

which the diagonals wrap around:
23.2 C IRCULANT LINEAR S YSTEMS 469

Circulant matrices have the important property that they are diagonalized by
the DFT matrix Fn :
Fn CFn -l = D = diag(d i).
Moreover, the eigenvalues are given by d = Fnc. Hence a linear system Cx = b
can be solved in O(n log n) operations with the aid of the FFT as follows:
(1) d = Fnc,
(2) g = Fnb,
(3) h = D -l g,
(4) x = Fn-1h.
The computation involves two FFTs, a diagonal scaling, and an inverse FFT.
We now analyse the effect of rounding errors. It is convenient to write the
result of Theorem 23.2 in the equivalent form (from (23.3))

(23.4)

Steps (1) and (2) yield

= (Fn + ∆1 )c, ||∆1||2 < f(n, u), (23.5)

= (Fn + ∆2 )b, ||∆2||2 < f(n, u).
Equation (23.5) implies that = D + ∆ 0 , ||∆ 0 || 2 < f(n, u)||C||2.
For steps (3) and (4) we have, using Lemma 3.5,

Putting these equations together we have

(23.6)
or, working to first order,

We obtain a backward stability result.

470 T HE FAST FOURIER T RANSFORM AND A PPLICATIONS

Theorem 23.3. Let C be a circulant and suppose the system Cx = b

is solved by the FFT process described above, where the FFT satisfies (23.4).
Then the computed solution satisfies (C + ∆C ) = b, where

The conclusion is that the FFT method for solving a circulant system is
normwise backward stable provided that the FFT itself is computed stably.
Although we have shown that solves a slightly perturbed system, the
perturbed matrix C + ∆C is not a circulant. In fact, does not, in general,
solve a nearby circulant system, as can be verified experimentally by comput-
ing the "(circulant backward error” using techniques from [527, 1992]. The
basic reason for the lack of this stronger form of stability is that there are too
few parameters in the matrix onto which to throw the backward error.
A forward error bound can be obtained by applying standard perturba-
tion theory to Theorem 23.3: the forward error is bounded by a multiple of
k2 (C)u. That the forward error can be as large as k2 (C)u is clear from the
analysis above, because (23.5) shows that the computed eigenvalue
is contaminated by an error of order u

23.3. Notes and References

For a unified treatment of the many different versions of the FFT, including
implementation details, see Van Loan [1044, 1992].
For a comprehensive survey of the discrete Fourier transform see Briggs
and Henson [148, 1995].
The Cooley–Tukey radix 2 FFT was presented in [240, 1965], which is one
of the most cited mathematics papers of all time [554, 1993, p. 171].
The history of the FFT is discussed by Cooley [238, 1990], [239, 1994]
and Cooley and Tukey [241, 1993]. Cooley [238, 1990] states that the earliest
known reference to the FFT is an obscure 1866 paper of Gauss in neoclassic
Latin, and he recommends that researchers not publish papers in neoclassic
Latin!
Theorem 23.2 is not new, but the proof is more concise than most in
the literature. Early references that derive error bounds using the matrix
factorization formulation of the FFT are Gentleman and Sande [434, 1966]
and Ramos [859, 1971]. A full list of references for error analysis of the FFT
is given by Van Loan [1044, 1992, 1.4].
Linzer [708, 1992] shows that the FFT-based circulant solver is forward
stable and poses the question of whether or not the solver is backward stable.
Our backward error analysis answers this question positively and therefore
also proves forward stability.
PROBLEMS 471

One application of circulant linear systems is in the preconditioned con-

jugate gradient method for solving Toeplitz systems. The idea of using a
circulant preconditioned was suggested by Strang [960, 1986], and the theory
and practice of this technique is now well developed. For more details see
Chan, Nagy, and Plemmons [191, 1994] and the references therein. A good
source of results about circulant matrices is the book by Davis [266, 1979].

Problems
23.1. (Bailey [44, 1993]) In high-precision multiplication we have two integer
n -vectors x and y representing high-precision numbers and we wish to form
the terms By padding the vectors with n ze-
ros, these products can be expressed in the form where
k + 1 – j is interpreted as k + 1 – j + n if k + 1 – j is negative. These prod-
ucts represent a convolution: a matrix–vector product involving a circulant
matrix. Analogously to the linear system solution in 23.2, this product can
be evaluated in terms of discrete Fourier transforms as z =
where the dot denotes componentwise multiplication of two vectors. Since x
and y are integer vectors, the zi should also be integers, but in practice they
will be contaminated by rounding errors. Obtain a bound on z – and deduce
a sufficient condition for the “nearest integer vector to to be the exact z.
Previous Home Next

Chapter 24
Automatic Error Analysis

Given the pace of technology,

I propose we leave math to the machines and go play outside.
— CALVIN, Calvin and Hobbes by Bill Watterson (1992)

To analyse a given numerical algorithm we proceed as follows.

A number which measures the effect of roundoff error
is assigned to each set of data.
“Hill-climbing” procedures are then applied to search for
values large enough to signal instability.
— WEBB MILLER, Software for Roundoff Analysis (1975)

473
474 A UTOMATIC E RROR A NALYSIS

Automatic error analysis is any process in which we use the computer to help
us analyse the accuracy or stability of a numerical computation. The idea
of automatic error analysis goes back to the dawn of scientific computing.
For example, running error analysis, described in 3.3, is a form of automatic
error analysis; it was used extensively by Wilkinson on the ACE machine.
Various forms of automatic error analysis have been developed. In this chapter
we describe in detail the use of direct search optimization for investigating
questions about the stability and accuracy of algorithms. We also describe
interval analysis and survey other forms of automatic error analysis.

24.1. Exploiting Direct Search Optimization

Is Algorithm X numerically stable? How large can the growth factor be for
Gaussian elimination (GE) with pivoting strategy P? By how much can con-
dition estimator C underestimate the condition number of a matrix? These
types of questions are common, as we have seen in this book. Usually, we
attempt to answer such questions by a combination of theoretical analysis
and numerical experiments with random and nonrandom data. But a third
approach can be a valuable supplement to the first two: phrase the question
as an optimization problem and apply a direct search method.
A direct search method for the problem

(24.1)

is a numerical method that attempts to locate a maximizing point using func-

tion values only and does not use or approximate derivatives of f. Such meth-
ods are usually based on heuristics that do not involve assumptions about the
function f. Various direct search methods have been developed; for surveys
see Powell [838, 1970] and Swarm [978, 1972], [979, 1974]. Most of these meth-
ods were developed in the 1960s, in the early years of numerical optimization.
For problems in which f is smooth, direct search methods have largely been
supplanted by more sophisticated optimization methods that use derivatives
(such as quasi-Newton methods and conjugate gradient methods), but they
continue to find use in applications where f is not differentiable, or even
not continuous. These applications range from chemical analysis [881, 1977],
where direct search methods have found considerable use, to the determina-
tion of drug doses in the treatment of cancer [93, 1991]; in both applications
the evaluation of f is affected by experimental errors. Lack of smoothness of
f, and the difficulty of obtaining derivatives when they exist, are characteristic
of the optimization problems we consider here.
The use of direct search can be illustrated with the example of the growth
24.1 E XPLOITING D IRECT S EARCH O PTIMIZATION 475

factor for GE on A

where the are the intermediate elements generated during the elimination.
We know from 9.2 that the growth factor governs the stability of GE, so for
a given pivoting strategy we would like to know how big ρ n (A) can be.
To obtain an optimization problem of the form (24.1) we let x = vet(A)
and we define f(x) := ρn(A). Then we wish to determine

Suppose, first, that no pivoting is done. Then f is defined and continuous at

all points where the elimination does not break down, and it is differentiable
except at points where there is a tie for the maximum in the numerator or
denominator of the expression defining ρn(A). We took n = 4 and applied
the direct search maximizer MDS (described in 24.2) to f(x), starting with
the identity matrix A = I4. After 11 iterations and 433 function evaluations,
the maximizer converged18 , having located the matrix19

for which ρ4(B) = 1.23 x 105. (The large growth is a consequence of the
submatrix B(1:3, 1:3) being ill conditioned; B itself is well conditioned.) Thus
the optimizer readily shows that ρn(A) can be very large for GE without
pivoting.
Next, consider GE with partial pivoting (GEPP). Because the elimination
cannot break down, f is now defined everywhere, but it is usually discontinu-
ous when there is a tie in the choice of pivot element, because then an arbitrar-
ily small change in A can alter the pivot sequence. We applied the maximizer
MDS to f, this time starting with the orthogonal matrix20 A with
aij = (cf. (9.11)), for which ρ4(A) = 2.32.
After 29 iterations and 1169 function evaluations the maximizer converged to
a matrix B with ρ 4(B) = 5.86. We used this matrix to start the maximizer
18
In the optimizations of this section we used the convergence tests described in 24.2
with tol = 10-3. There is no guarantee that when convergence is achieved it is to a local
maximum; see 24.2.
19
A11 numbers quoted are rounded to the number of significant figures shown.
20
This matrix is orthog (n, 2) from the Test Matrix Toolbox; see Appendix E.
476 A UTOMATIC E RROR A NALYSIS

AD (described in 24.2); it took 5 iterations and 403 function evaluations to

converge to the matrix

for which ρ4(C) = 7.939. This is one of the matrices described in Theorem 9.6,
modulo the convergence tolerance.
These examples, and others presented below, illustrate the following at-
tractions of using direct search methods to investigate the stability of a nu-
merical computation.
(1) The simplest possible formulation of optimization problem is often suf-
ficient to yield useful results. Derivatives are not needed, and direct search
methods tend to be insensitive to lack of smoothness in the objective func-
tion f. Unboundedness of f is a favorable property-direct search methods
usually quickly locate large values of f.
(2) Good progress can often be made from simple starting values, such as
an identity matrix. However, prior knowledge of the problem may provide
a good starting value that can be substantially improved (as in the partial
pivoting example).
(3) Usually it is the global maximum of f in (24.1) that is desired (although
it is often sufficient to know that f can exceed a specified value). When a
direct search method converges it will, in general, at best have located a
local maximum-and in practice the maximizer may simply have stagnated,
particularly if a slack convergence tolerance is used. However, further progress
can often be made by restarting the same (or a different) maximizer, as in the
partial pivoting example. This is because for methods that employ a simplex
(such as the MDS method), the behaviour of the method starting at x0 is
determined not just by x0. but also by the n + 1 vectors in the initial simplex
constructed at x0 .
(4) The numerical information revealed by direct search provides a starting
point for further theoretical analysis. For example, the GE experiments above
strongly suggest the (well known) results that ρ n(A) is unbounded without
pivoting and bounded by 2n–1 for partial pivoting, and inspection of the
numerical data suggests the methods of proof.
When applied to smooth problems the main disadvantages of direct search
methods are that they have at best a linear rate of convergence and they are
unable to determine the nature of the point at which they terminate (since
derivatives are not calculated). These disadvantages are less significant for
the problems we consider, where it is not necessary to locate a maximum
to high accuracy and objective functions are usually nonsmooth. (Note that
these disadvantages are not necessarily shared by methods that implicitly or
24.2 D IRECT S EARCH M ETHODS 477

explicitly estimate derivatives using function values, such as methods based

on conjugate directions, for which see Powell [838, 1970], [839, 1975]; however,
these are not normally regarded as direct search methods.)
A final attraction of direct search is that it can be used to test the cor-
rectness of an implementation of a stable algorithm. The software in question
can be used in its original form and does not have to be translated into some
other representation.

24.2. Direct Search Methods

For several years I have been using MATLAB implementations of three direct
search methods. The first is the alternating directions (AD) method (also
known as the coordinate search method). Given a starting value x it attempts
to solve the problem (24.1) by repeatedly maximizing over each coordinate
direction in turn:

repeat
% One iteration comprises a loop over all components of x.
for i = l:n
find α such that f(x + αei ) is maximized (line search)
Set x x + αe i
end
until converged

AD is one of the simplest of all optimization methods and the fundamental

weakness that it ignores any interactions between the variables is well known.
Despite the poor reputation of AD we have found that it can perform well
on the types of problems considered here. In our MATLAB implementation of
AD the line search is done using a crude scheme that begins by evaluating
f(x + hei ) with h = 10 -4xi (or h = 1 0 - 4 m a x if xi = 0); if f(x +
he i ) < f(x) then the sign of h is reversed. Then if f(x + he i ) > f(x), h
is doubled at most 25 times until no further increase in f is obtained. Our
convergence test checks for a sufficient relative increase in f between one
iteration and the next: convergence is declared when

f k – f k - 1 < t o l|fk-1|, (24.2)

where fk is the highest function value at the end of the k th iteration. The AD
method has the very modest storage requirement of just a single n-vector.
The second method is the multidirectional search method (MDS) of Dennis
and Torczon [1008, 1989], [1009, 1991], [301, 1991]. This method employs a
simplex, which is defined by n + 1 vectors One iteration in the
case n = 2 is represented pictorially in Figure 24.1, and may be explained as
follows.
478 A UTOMATIC E RROR A NALYSIS

Figure 24.1. The possible steps in one iteration of the MDS method when n = 2.

The initial simplex is {v0, v1, v2} and it is assumed that f(v0) = max i f(vi ).
The purpose of an iteration is to produce a new simplex at one of whose ver-
tices f exceeds f(v0 ). In the first step the vertices v1 and v2 are reflected
about v0 . along the lines joining them to v0 , yielding rl and r2 and the re-
flected simplex {v 0 ,r 1 ,r 2 }. If this reflection step is successful, that is, if
maxi f(ri ) > f(v0), then the edges from v0 to ri are doubled in length to
give an expanded simplex {v0, el, e2 }. The original simplex is then replaced
by {v0, el, e2 } if maxi f(ei ) > maxi f(ri ), and otherwise by {v0, r1, r2 }. If
the reflection step is unsuccessful then the edges v0 – vi of the original sim-
plex are shrunk to half their length to give the contracted simplex {v0, c1, c2}.
This becomes the new simplex if max i f(ci ) > maxi f(vi ), in which case the
current iteration is complete; otherwise the algorithm jumps back to the re-
flection step, now working with the contracted simplex. For further details
of the MDS method see Dennis and Torczon [301, 1991] and Torczon [1008,
1989], [1009, 1991].
The MDS method requires at least 2n independent function evaluations
per iteration, which makes it very suitable for parallel implementation. Gen-
eralizations of the MDS method that are even more suitable for parallel com-
putation are described in [301, 1991] and [1010, 1992]. The MDS method
requires O(n2) elements of storage for the simplices, but this can be reduced
to O(n) (at the cost of extra bookkeeping) if an appropriate choice of initial
simplex is made [301, 1991].
Unlike most direct search methods, the MDS method possesses some con-
vergence theory. Torczon [1009, 1991] shows that if the level set of j at
24.3 E XAMPLES OF D IRECT S EARCH 479

is compact and f is continuously differentiable on this level set then a subse-

quence of the points (where k denotes the iteration index) converges to a
stationary point of f. Moreover, she gives an extension of this result that re-
quires only continuity of f and guarantees convergence to either a stationary
point off or a point where f is not continuously differentiable.
Our implementation of the MDS method provides two possible starting
simplices, both of which include the starting point x0: a regular one (all sides
of equal length) and a right-angled one based on the coordinate axes, both as
described by Torczon in [1008, 1989]. The scaling is such that each edge of
the regular simplex, or each edge of the right-angled simplex that is joined to
x0 , has length Also as in [1008, 1989], the main termination
test halts the computation when the relative size of the simplex is no larger
than a tolerance toll that is, when

(24.3)

Unless otherwise stated, we used tol = 10-3 in (24.2) and (24.3) in all our
experiments.
The third method that we have used is the Nelder–Mead direct search
method [787, 1965], [303, 1987], which also employs a simplex but which
is fundamentally different from the MDS method. We omit a description
since the method is described in textbooks (see, for example, Gill, Murray,
and Wright [447, 1981, §4.2.2], or Press, Teukolsky, Vetterling, and Flannery
[842, 1992, §10.4]). Our limited experiments with the Nelder-Mead method
indicate that while it can sometimes out-perform the MDS method, the MDS
method is generally superior for our purposes. No convergence results of the
form described above for the MDS method are known for the Nelder-Mead
method.
For general results on the convergence of “pattern search” methods, see
Torczon [1011, 1993]. The AD and MDS methods are pattern search methods,
but the Nelder–Mead method is not.
It is interesting to note that the MDS method, the Nelder-Mead method,
and our particular implementation of the AD method do not exploit the nu-
merical values of f: their only use of f is to compare two function values to
see which is the larger!
Our MATLAB implementations of the AD, MDS, and Nelder-Mead direct
search methods are in the Test Matrix Toolbox, described in Appendix E.

24.3. Examples of Direct Search

In this section we give examples of the use of direct search to investigate the
behaviour of numerical algorithms.
480 A UTOMATIC E RROR A NALYSIS

24.3.1. Condition Estimation

We have experimented with MATLAB implementations of two matrix con-
dition number estimators. RCOND is the LINPACK estimator, described in
§14.4, as implemented in the built-in MATLAB function rcond. LACON is the
LAPACK condition estimator, as given in Algorithm 14.4 and implemented
in MATLAB’S condest. Both estimators compute a lower bound for k1(A)
by estimating ||A–1 ||1 (||A||1 is computed explicitly as the maximum column
sum).
To put the problem in the form of (24.1), we define x = vec(A), A
and

where est (A) < k1 (A) is the condition estimate. We note that, since the
algorithms underlying RCOND and LACON contain tests and branches, there
are matrices A for which an arbitrarily small change in A can completely
change the condition estimate; hence for both algorithms f has points of
discontinuity.
We applied the MDS maximizer to RCOND starting at the 5 x 5 Hilbert
matrix. After 67 iterations and 4026 function evaluations the maximizer had
located a matrix for which f(x) = 226.9. We then started the Nelder-
Mead method from this matrix; after another 4947 function evaluations it
had reached the matrix (shown to 5 significant figures)

for which

K1(A) = 3.38 x 10 5 , est(A) = 1.65 x 101,

This example is interesting because the matrix is well scaled, while the pa-
rametrized counterexamples of Cline and Rew [217, 1983] all become badly
scaled when the parameter is chosen to make the condition estimate poor.
For LACON we took as starting matrix the 4 x 4 version of then x n matrix
with a ij = cos((i – 1)(j – 1)π /( n – 1)) (this is a Chebyshev–Vandermonde
matrix, as used in §21.3.3, and is orthog(n, - 1) in the Test Matrix Tool-
box). After 11 iterations and 1001 function evaluations the AD maximizer
had determined a (well-scaled) matrix A for which

k1(A) = 2.94 x 105, est(A) = 4.81,

24.3 E XAMPLES OF D IRECT S EARCH 481

With relatively little effort on our part (most of the effort was spent ex-
perimenting with different starting matrices), the maximizers have discovered
examples where both condition estimators fail to achieve their objective of
producing an estimate correct to within an order of magnitude. The value of
direct search maximization in this context is clear: it can readily demonstrate
the fallibility of a condition estimator—a task that can be extremely diffi-
cult to accomplish using theoretical analysis or tests with random matrices.
Moreover, the numerical examples obtained from direct search may provide
a starting point for the construction of parametrized theoretical ones, or for
the improvement of a condition estimation algorithm.
In addition to measuring the quality of a single algorithm, direct search
can be used to compare two competing algorithms to investigate whether one
algorithm performs uniformly better than the other. We applied the MDS
maximizer to the function

where estL(A) and estR(A) are the condition estimates from LACON and
RCOND, respectively. If f(x) > 1 then LACON has produced a larger
lower bound for k1 (A) than RCOND. Starting with a random 5 x 5 ma-
trix the Nelder–Mead maximizer produced after 1788 function evaluations a
matrix A for which estL(A) = k1(A) and f(x) = 1675.4. With f defined as
f(x) = estR(A)/estL(A), and starting with I4, after 6065 function evalua-
tions the MDS maximizer produced a matrix for which f(x) = 120.8. This
experiment shows that neither estimator is uniformly superior to the other.
This conclusion would be onerous to reach by theoretical analysis of the algo-
rithms.

24.3.2. Fast Matrix Inversion

We recall Strassen’s inversion method from Problem 22.8: for

it uses the formulae

P1 = P2 = A21 P1 ,
P3 = P1 A12 , P4 = A21 P3 ,
P5 = P4 – A22, P6 = P5–1,
482 A UTOMATIC E RROR A NALYSIS

where each of the matrix products is formed using Strassen’s fast matrix mul-
tiplication method. Strassen’s inversion method is clearly unstable for general
A, because the method breaks down if A11 is singular. Indeed Strassen’s inver-
sion method has been implemented on a Cray-2 by Bailey and Ferguson [41,
1988] and tested for n < 2048, and these authors observe empirically that
the method has poor numerical stability. Direct search can be used to gain
insight into the numerical stability.
With x = vet( A ) define the stability measure

(24.4)

where is the inverse of A computed using Strassen’s inversion method.

This definition of f is appropriate because, as shown in Chapter 13, for most
conventional matrix inversion methods either the left residual XA – I or the
right residual AX – I is guaranteed to have norm of order u||X|| ||A||. To treat
Strassen’s inversion method as favorably as possible we use just one level of
recursion; thus P1 and P6 are computed using GEPP but the multiplications
are done with Strassen’s method. We applied the MDS maximizer, with
tol = 10-9 in (24.3), starting with the 4 x 4 Vandermonde matrix whose (i,j)
element is ((j – 1)/3) i–1. After 34 iterations the maximizer had converged with
f = 0.838, which represents complete instability. The corresponding matrix
A is well conditioned, with k2(A) = 82.4. For comparison, the value of f
when A is inverted using Strassen’s method with conventional multiplication
is f = 6.90 x 10–2; this confirms that the instability is not due to the use of
fast multiplication techniques—it is inherent in the inversion formulae.
If A is a symmetric positive definite matrix then its leading principal sub-
matrices are no more ill conditioned than the matrix itself, so we might expect
Strassen’s inversion method to be stable for such matrices. To investigate
this possibility we carried out the same maximization as before, except we
enforced positive definiteness as follows: when the maximizer generates a vec-
tor x = vec(B), A in (24.4) is defined as A = BTB. Starting with a 4 x 4
random matrix A with k2(A) = 6.71 x 107, the maximization yielded the value
f = 3.32 x 10 -8 after 15 iterations, and the corresponding value of f when
conventional multiplication is used is f = 6.61 x 10 –11 (the “maximizing”
matrix A has condition number k2(A) = 3.58 x 109).
The conclusion from these experiments is that Strassen’s inversion method
cannot be guaranteed to produce a small left or right residual even when A
is symmetric positive definite and conventional multiplication is used. Hence
the method must be regarded as being fundamentally unstable.
24.3 E XAMPLES OF D IRECT S EARCH 483

24.3.3. Solving a Cubic

Explicit formulae can be obtained for the roots of a cubic equation using tech-
niques associated with the 16th century mathematicians del Ferro, Cardano,
Tartaglia, and Vieta [140, 1968], [332, 1990]. The following development is
based on Birkhoff and Mac Lane [102, 1977, §5.5].
Any nondegenerate cubic equation can be put in the form x3 + ax2 + bx +
c = O by dividing through by the leading coefficient. We will assume that
the coefficients are real. The change of variable x = y – a/3 eliminates the
quadratic term:

y3 + py + q = 0,

Then Vieta’s substitution y = w – p/(3w ) yields

and hence a quadratic equation in w3: (W3)2 + qw3 — p3 /27 = 0. Hence

(24.5)

For either choice of sign, the three cube roots for w yield the roots of the
original cubic, on transforming back from w to y to Z.
Are these formulae for solving a cubic numerically stable? Direct search
provides an easy way to investigate. The variables are the three coefficients
a, b, c and for the objective function we take an approximation to the relative
3
error of the computed roots . We compute the “exact” roots z us-
ing MATLAB’s roots function (which uses the QR algorithm to compute the
eigenvalues of the companion matrix21 ) and then our relative error measure
is where we minimize over all six permutations Π.
First, we arbitrarily take the "+" square root in (24.5). With almost
any starting vector of coefficients, the MDS maximizer rapidly identifies co-
efficients for which the computed roots are very inaccurate. For example,
starting with [1, 1, 1]T we are lead to the vector

[ a b c] T = [1.732 1 1.2704]T ,
for which the computed and “exact” roots are

21
Edelman and Murakami [346, 1995] and Toh and Trefethen [1007, 1994] analyse the
stability of this method of finding polynomial roots; the method is stable.
484 A UTOMATIC E RROR A NALYSIS

When the cubic is evaluated at the computed roots the results are of order
10–2, whereas they are of order 10–15 for z. Since the roots are well separated
the problem of computing them is not ill conditioned, so we can be sure that
z is an accurate approximation to the exact roots. The conclusion is that the
formulae, as programmed, are numerically unstable.
Recalling the discussion for a quadratic equation in §1.8, a likely reason
for the observed instability is cancellation in (24.5). Instead of always taking
the “+” sign, we therefore take

(24.6)

When the argument of the square root is nonnegative, this formula suffers
no cancellation in the subtraction; the same is true when the argument of
the square root is negative, because then the square root is pure imaginary.
With the use of (24.6), we were unable to locate instability using the objective
function described above. However, an alternative objective function can be
derived as follows. It is reasonable to ask that the computed roots be the
exact roots of a slightly perturbed cubic. Thus each computed root should
be a root of

where is of the order of the unit

roundoff. Notice that we are allowing the leading coefficient of unity to be
perturbed. Denoting the unperturbed cubic by f, we find that this condition
implies
. that
(24.7)

is of order u. We therefore take the quantity in (24.7) as the function to be

maximized. On applying the MDS maximizer with starting vector [1, 1, 1] T ,
we obtain after 10 iterations an objective function value of 1.2 x 10–11. The
cubic coefficient vector is (to three significant figures)
[a b c] T = [-5.89 x 102 3.15 x 102 -1.36 x 101],
and the computed roots are (ignoring tiny, spurious imaginary parts)
= [4.75 x 10 -2 4.87 x 10 -1 5.89 x 10 2 ].
The value of the objective function corresponding to the “exact” roots (com-
puted as described above) is of order 10-16 (and the value of the previous
relative error objective function for the computed roots is of order 10–14 ).
The conclusion is that even using (24.6) the formulae for the roots are
numerically unstable. However, further theoretical analysis is needed to un-
derstand the stability fully; see Problem 24.3.
24.4 INTERVAL ANALYSIS 485

24.4. Interval Analysis

Interval analysis has been an area of extensive research since the 1960s, and it
had been considered earlier by Turing and Wilkinson in the 1940s [1099, 1980,
p. 104]. As the name suggests, the idea is to perform arithmetic on intervals
[a,b] (b > a). The aim is to provide as the solution to a problem an interval
in which the desired result is guaranteed to lie, which may be particularly
appropriate if the initial data is uncertain (and hence can be thought of as an
interval).
For the elementary operations, arithmetic on intervals is defined by

and the results are given directly by the formulae

[a,b] + [c,d] = [a + c, b + d],

[a,b] - [c,d] = [a - d, b - c],
[a,b] * [c,d] = [min (ac, ab, bc, bd), max (ac, ab, bc, bd)],
[a,b] / [c,d] = [a,b] * [1/d, 1/c ],

We will use the notation [x] for an interval [xl,x2] and we define width(x) :=
x2 – x1 .
In floating point arithmetic, an interval containing fl( [x] op [y]) is ob-
tained by rounding computations on the left endpoint to and those on
the right endpoint to (both these rounding modes are supported in IEEE
arithmetic).
The success of an interval analysis depends on whether an answer is pro-
duced at all (an interval algorithm will break down with division by zero if
it attempts to divide by an interval containing zero), and on the width of
the interval answer. A one-line program that prints would be cor-
rect, robust, and fast, but useless. Interval analysis is controversial because,
in general, narrow intervals cannot be guaranteed. One reason is that when
dependencies occur in a calculation, in the sense that a variable appears more
than once, final interval lengths can be pessimistic. For example, if [x ] = [1,2]
then
[ x] - [x] = [-1, 1], [x]/[x] = [1/2, 2],
whereas the optimal intervals for these calculations are, respectively, [0, O] and
[1, 1]. These calculations can be interpreted as saying that there is no additive
or multiplicative inverse in interval arithmetic.
An example of an algorithm for which interval arithmetic can be ineffective
is GEPP, which in general gives very pessimistic error bounds and is unstable
in interval arithmetic even for a well-conditioned matrix [798, 1977]. The
486 A UTOMATIC E RROR A NALYSIS

basic problem is that the interval sizes grow exponentially. For example, in
the 2 x 2 reduction

if then
width([ y ] – [x]) width([ x] – [x]) = 2 width([x]).
This type of growth is very likely to happen, unlike the superficially similar
phenomenon of element growth in standard GEPP. The poor interval bounds
are entirely analogous to the pessimistic results returned by early forward
error analyses of GEPP (see §9.6). Nickel [798, 1977] states that “The interval
Gauss elimination method is one of the most unfavorable cases of interval
computation . . . nearly all other numerical methods give much better results
if transformed to interval methods”. Interval GE is effective, however, for
certain special classes of matrices, such as M-matrices.
As already mentioned, there is a large body of literature on interval arith-
metic, though, as Skeel notes (in an article that advocates interval arith-
metic), “elaborate formalisms and abstruse notation make much of the lit-
erature impenetrable to all but the most determined outsiders” [922, 1989].
Good sources of information include various conference proceedings and the
journal Computing. The earliest book is by Moore [778, 1966], whose later
book [779, 1979] is one of the best references on the subject. Nickel [798,
1977] gives an easy to read summary of research up to the mid 1970s. A more
recent reference is Alefeld and Herzberger [10, 1983].
Yohe [1120, 1979] describes a Fortran 66 package for performing interval
arithmetic in which machine-specific features are confined to a few modules.
It is designed to work with a precompiled for Fortran called Augment [254,
1979], which allows the user to write programs as though Fortran had ex-
tended data types—in this case an INTERVAL type. A version of the package
that allows arbitrary precision interval arithmetic by incorporating Brent’s
multiple precision arithmetic package (see §25.9) is described in [1121, 1980].
In [1119, 1979], Yohe describes general principles of implementing nonstan-
dard arithmetic in software, with particular reference to interval arithmetic.
Kulisch and Miranker have proposed endowing a computer arithmetic with
a super-accurate inner product, that is, a facility to compute an exactly
rounded inner product for any two vectors of numbers at the working pre-
cision [678, 1981], [679, 1983], [680, 1986]. This idea has been implemented
in the package ACRITH from IBM, which employs interval arithmetic [124,
1985]. Anomalies in early versions of ACRITH are described by Kahan and
LeBlanc [639, 1985] and Jansen and Weidner [612, 1986]. For a readable
discussion of interval arithmetic and the super-accurate inner product, see
Rail [858, 1991]. Cogent arguments against adopting a super-accurate inner
product are given by Demmel [284, 1991].
24.5 O THER W ORK 487

Fortran and Pascal compilers (Fortran SC and Pascal-XSC) and a C++

class library that support a super-accurate inner product and interval arith-
metic have been developed jointly by researchers at IBM and the University of
Karlsruhe [125, 1987], [661, 1993], [662, 1992]. A toolbox of routines written
in Pascal-XSC for solving basic numerical problems and verifying the results
is presented by Hammer, Hocks, Kulisch, and Ratz [500, 1993].
Finally, we explain why any attempt to compute highly accurate answers
using interval arithmetic of a fixed precision (possibly combined with a super-
accurate inner product ) cannot always succeed. Suppose we have a sequence
of problems to solve, where the output of one is the input to the next: xi+1 =
f i (x i ), i = 1:n, for smooth functions fi : Suppose x 1 is known
exactly (interval width zero). Working in finite precision interval arithmetic,
we obtain an interval [x2 – α2, x2 + β 2] containing x2. This is used as input to
the computation of f 2 . Even under the favorable assumption of no rounding
errors in the evaluation of f2, we obtain an interval answer whose width must
be of order

(the interval could be much bigger than f 2 (x2 ) – f 2 (x2 ± ε 2), depending on
the algorithm used to evaluate f2). In other words, the width of the interval
containing x3 is roughly proportional to the condition number of f 2 . When
the output of the f2 computation is fed into f3 the interval width is multiplied
by f´3 The width of the final interval containing xn+l is proportional to the
product of the condition numbers of all the functions f i and if there are
enough functions, or if they are conditioned badly enough, the final interval
will provide no useful information The only way to avoid such a failure is to
use variable precision arithmetic or to reformulate the problem to escape the
product of the condition numbers of the fi .

24.5. Other Work

In this section we outline other work on automatic error analysis.
In the 1970s, Miller and his co-authors developed methods and software for
automatically searching for numerical instability in algebraic processes [757,
1975], [760, 1978], [762, 1980]. In [757, 1975] Miller defines a quantity σ ( d )
that bounds, to first order, the sensitivity of an algorithm to perturbations in
the data d and in the intermediate quantities that the algorithm generates.
He then defines the forward stability measure ρ(d) = σ(d)/k(d), where k(d)
is a condition number for the problem under consideration. The algorithms
to be analysed are required to contain no loops or conditional branches and
are presented to Miller’s Fortran software in a special numeric encoding. The
software automatically computes the partial derivatives needed to evaluate
488 A UTOMATIC E RROR A NALYSIS

ρ(d), and attempts to maximize ρ using the method of alternating directions.

Miller gives several examples illustrating the scope of his software; he shows,
for example, that it can identify the instability of the classical Gram-Schmidt
method for orthogonalizing a set of vectors.
In [760, 1978], [761, 1978] Miller and Spooner extend the work in [757,
1975] in several ways. The algorithm to be analysed is expressed in a Fortran-
like language that allows for-loops but not logical tests. The definition of p is
generalized and a method of computing it is developed that involves solving
a generalized eigenvalue problem. The book by Miller and Wrathall [762,
1980] gives a thorough development of the work of [760, 1978], including
a description of the graph theory techniques used to compute the partial
derivatives, and it provides further examples of the use of the software. The
potential of Miller and Spooner’s software for exposing numerical instability is
clearly demonstrated by the case studies in these references, yet the software
has apparently not been widely used. This is probably largely due to the
inability of the software to analyse algorithms expressed in Fortran, or any
other standard language.
A different approach to algorithm analysis is taken by Larson and Sameh
[688, 1978], [689, 1980], and implemented in software by Larson, Pasternak,
and Wisniewski [687, 1983]. Here, errors are measured in a relative rather
than an absolute sense, and the stability is analysed at fixed data instead of
attempting to maximize instability over all data; however, the analysis is still
linearized.
The idea of applying automatic differentiation to a computational graph
to obtain estimates for the forward error in an algorithm is found not only in
the references cited above, but also in the automatic differentiation literature;
see Rail [857, 1981], Iri [606, 1991], and Kubota [675, 1991], for example.
Hull [589, 1979] discusses the possibility of applying program verification
techniques to computations that are subject to rounding errors, with the aim
of proving a program “correct”. Difficulties include deciding what “correct”
means and formulating appropriate assertions. Although he reports some
progress, Hull concludes that “there is not a great deal to be gained by trying
to apply the techniques of program verification to programs for numerical
calculations”.
Bliss, Brunet, and Gallopoulos [126, 1992] develop Fortran preprocessor
tools for implementing the local relative error approach of Larson and Sameh.
Their tools also implement a statistical technique of Brunet and Chatelin
[152, 1989], [202, 1990] in which the result of every floating point operation
is randomly perturbed and statistical methods are used to measure the effect
on the output of the algorithm.
Rowan [882, 1990] develops another way to search for numerical instabil-
ity. For an algorithm with data d he maximizes S(d) = e(d)/ k (d) using a
direct search maximizer he has developed called the subplex method (which
24.6 N OTES AND R EFERENCES 489

is based on the Nelder–Mead simplex method). Here, e(d) = yacc - is an

approximation to the forward error in the computed solution where yacc is a
more accurate estimate of the true solution than and the condition number
k(d) is estimated using finite difference approximations. The quantity S(d) is
a lower bound on the backward error of the algorithm at d. Fortran software
given in [882, 1990] implements this “functional stability analysis”. The soft-
ware takes as input two user-supplied Fortran subprograms; one implements
the algorithm to be tested in single precision, and the other provides a more
accurate solution, typically by executing the same algorithm in double preci-
sion. The examples in [882, 1990] show that Rowan’s software is capable of
detecting numerical instability in a wide variety of numerical algorithms.
A technique called “significance arithmetic” was studied by Ashenhurst,
Metropolis, and others in the 1960s. It involves performing arithmetic on
unnormalized numbers in such a way that the number of significant digits
in the answers provides estimates of the accuracy of those answers. Signif-
icance arithmetic is therefore a form of automatic error analysis. For more
details see, for example, Ashenhurst and Metropolis [31, 1965], [750, 1977]
and Sterbenz [938, 1974, §7.2]; Sterbenz explains several drawbacks of the
technique.
Stoutemyer [958, 1977] describes the use of the symbolic manipulation
package REDUCE to analyse the propagation of data and rounding errors in
numerical algorithms.
Finally, we note that running error analysis (see §3.3) is a form of au-
tomatic error analysis and is an attractive alternative to interval arithmetic
as a means of computing a posteriori error bounds for almost any numerical
algorithm.

24.6. Notes and References

Sections 24.1 – 24.3.2 and §24.5 are based on Higham [555, 1993].
The Release Notes for MATLAB 4.1 state that “This release of MATLAB
fixes a bug in the rcond function. Previously, rcond returned a larger than
expected estimate for some matrices . . . rcond now returns an estimate that
matches the value returned by the Fortran LINPACK library.” In the direct
search experiments in [555, 1993] we used MATLAB 3.5, and we found it much
easier to generate counterexamples to rcond than we do now with MATLAB
4.2. It seems that the maximizations in [555, 1993] were not only defeating
the algorithm underlying rcond, but also, unbeknown to us, exploiting a bug
in the implemental ion of the function. The conclusions of [555, 1993] are
unaffected, however.
Another way to solve a cubic is to use Newton’s method to find a real
zero, and then to find the other two zeros by solving a quadratic equation.
490 A UTOMATIC E RROR A NALYSIS

In a detailed investigation, Kahan [631, 1986] finds this iterative method to

be preferable to (sophisticated) use of the explicit formulae. Other useful
references on the numerical computation of roots of a cubic are Lanczos [686,
1956, Chap. 1], Press, Teukolsky, Vetterling, and Flannery [842, 1992, §5.6],
and Uspensky [1038, 1948].

Problems
24.1. Let A and let and be the computed orthogonal
QR factors produced by the classical and modified Gram-Schmidt methods,
respectively. Use direct search to maximize the functions

In order to keep k2(A) small, try maximizing fi (A) – θ max( k2(A) – µ, 0),
where θ is a large positive constant and µ is an upper bound on the acceptable
condition numbers.
24.2. It is well known that if is nonsingular and vTA–1 u –1
then

This is known as the Sherman-Morrison formula. For a history and general-

izations see Henderson and Searle [512, 1981]. A natural question is whether
this formula provides a stable way to solve a rank-1 perturbed linear system.
That is, is the following algorithm stable?

% Solve Cx: = (A + uvT)x = b.

Solve Ay = b for y.
Solve Az = u for z.
x = y - (v T y)(1 + v T z) - 1 z

(a) Investigate the stability using direct search. Let both linear systems
with coefficient matrix A be solved by GEPP. Take A, u, and v as the data
and let the function to be maximized be the normwise backward error η C,b in
the norm.
(b) (RESEARCH PROBLEM) Obtain backward and forward error bounds for
the method (for some existing analysis see Yip [1118, 1986]).
24.3. (RESEARCH PROBLEM) Investigate the stability of the formulae of $24.3.3
for computing the roots of a cubic.
Previous Home Next

Chapter 25
Software Issues in Floating Point
Arithmetic

The first American Venus probe was lost due to a

program fault caused by the
inadvertent substitution of a statement of the form
DO 3 I = 1.3 for one of the form DO 3 I = 1,3.
— JIM HORNING, Note on Program Reliability 2 2 ( 1 9 7 9 )

Numerical subroutines should deliver results that satisfy simple,

useful mathematical laws whenever possible.
— DONALD E. KNUTH, The Art of Computer
Programming, Volume 2, Seminumerical Algorithms (1981)

No method of solving a computational problem is

really available to a user until it is
completely described in an algebraic computing language
and made completely reliable.
Before that, there are indeterminate aspects in any algorithm.
— GEORGE E. FORSYTHE,
Today’s Computational Methods of Linear Algebra (1967)

The extended precision calculation of pi has substantial application as a

test of the “global integrity” of a supercomputer . . .
Such calculations . . . are apparently now used routinely
to check supercomputers before they leave the factoty.
A large-scale calculation of pi is entirely unforgiving;
it soaks into all parts of the machine and a
single bit awry leaves detectable consequences.
— J. M. BORWEIN, P. B. BORWEIN, and D. H. BAILEY,
Ramanujan, Modular Equations, and Approximations to Pi
or How to Compute One Billion Digits of Pi (1989)
22
Quoted, with further details, in Tropp [1021, 1984].

491
492 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC

In this chapter we discuss some miscellaneous aspects of floating point arith-

metic that have an impact on software development.

25.1. Exploiting IEEE Arithmetic

IEEE standard 754 and 854 arithmetic can be exploited in several ways
in numerical software, provided that proper access to the arithmetic’s fea-
tures is available in the programming environment. Unfortunately, although
most commercially significant floating point processors at least nearly conform
to the IEEE standards, language standards and compilers generally provide
poor support (exceptions include the Standard Apple Numerics Environment
(SANE) [233, 1988], Apple’s PowerPC numerics environment [234, 1994], and
Sun’s SPARCstation compilers [969, 1992], [970, 1992]). We give four exam-
ples to show the benefits of IEEE arithmetic on software.
Suppose we wish to evaluate the dual of the vector p- n o r m
that is, the g-norm, where p –l + q –1 = 1. In MATLAB notation we simply
write norm (x, 1/(1-l/p)), and the extreme cases p = 1 and correctly
yield q = 1/(1 – 1/p) = and 1 in IEEE arithmetic. (Note that the formula
q = p/ (p – 1) would not work, because inf/inf evaluates to a NaN.)
The following code finds the biggest element in an array a (1: n), and ele-
gantly solves the problem of choosing a “sufficiently negative number” with
which to initialize the variable max:

max = –inf
for j = 1:n
if aj > max
max = aj
end
end

Any unknown or missing elements of the array a could be assigned NaNs

(a(j) = NaN) and the code would (presumably) still work, since a NaN com-
pares as unordered with everything.
Consider the following continued fraction [630, 1981]:

3
r(x) = 7 – ,
1
x - 2 –
10
x - 7 +
2
x - 2
x - 3

which is plotted over the range [0,4] in Figure 25.1. Another way to write the
25.1 E XPLOITING IEEE A RITHMETIC 493

Figure 25.1. Rational function r.

rational function r(x) is in the form

in which the polynomials in the numerator and denominator are written in

the form in which they would be evaluated by Homer’s rule. Examining
these two representations of r, we see that the continued fraction requires
less arithmetic operations to evaluate (assuming it is evaluated in the obvious
“bottom up” fashion) but it incurs division by zero at the points x = 1:4,
even though r is well behaved at these points. However, in IEEE arithmetic r
evaluates correctly at these points because of the rules of infinity arithmetic.
For x = 1077, the rational form suffers overflow, while the continued fraction
evaluates correctly to 7.0; indeed, in IEEE arithmetic the continued fraction is
immune to overflow. Figure 25.2 shows the relative errors made in evaluating
r in double precision on an equally spaced grid of points on the range [0,4]
(many of the errors for the continued fraction are zero, hence the gaps in
the error curve); clearly, the continued fraction produces the more accurate
function values. The conclusion is that in IEEE arithmetic the continued
fraction representation is the preferred one for evaluating r.
The exception handling provided by IEEE arithmetic can be used to sim-
plify and speed up certain types of software, provided that the programming
environment properly supports these facilities. Recall that the exceptional
494 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC

Figure 25.2. Error in evaluating rational function r. Solid line: continued fraction,
dotted line: usual rational form.

operations underflow, overflow, divide by zero, invalid operation, and inexact

deliver a result, set a flag, and continue. The flags (one for each type of excep-
tion) are “sticky”, remaining set until explicitly cleared, and implementations
of the IEEE standard are required to provide a means to read and write the
flags. Unfortunately, compiler writers have been slow to make exception han-
dling, and some of IEEE arithmetic’s other unusual features, available to the
programmer. Kahan has commented that “the fault lies partly with the IEEE
standard, which neglected to spell out standard names for the exceptions, their
flags, or even
As an example of how exception handling can be exploited, consider the
use of the LAPACK norm estimator (Algorithm 14.4) to estimate ||A–1 ||1 .
The algorithm requires the solution of linear systems of the form Ax = b and
ATy = c, which is typically done with the aid of an LU factorization PA =
LU. Solution of the resulting triangular systems is prone to both overflow and
division by zero. In LAPACK, triangular systems are solved in this context not
with the level-2 BLAS routine xTRSV but with a routine xLATRS that contains
elaborate tests and scalings to avoid overflow and division by zero [18, 1991].
In an IEEE arithmetic environment a simpler and potentially faster approach
is possible: solve the triangular systems with xTRSV and after computing each
solution check the flags to see whether any exceptions occurred. Provided that
some tests are added to take into account overflow due solely to ||A||1 being
25.2 S UBTLETIES OF FLOATING P OINT A RITHMETIC 495

tiny, the occurrence of an exception allows the conclusion that A is singular

to working precision; see Demmel and Li [298, 1994] for the details.
For the condition estimator that uses exception handling to be uniformly
faster than the code with scaling it is necessary that arithmetic with NaNs,
infinities , and subnormal numbers be performed at the same speed as conven-
tional arithmetic. While this requirement is often satisfied, there are machines
for which arithmetic on NaNs, infinities, and subnormal numbers is between
one and three orders of magnitude slower than conventional arithmetic [298,
1994, Table 2]. Demmel and Li compared the LAPACK condition estima-
tion routines xxxCON with modified versions containing exception handling
and found the latter to be up to 6 times faster and up to 13 times slower,
depending on the machine and the matrix.
Appendices to the IEEE standards 754 and 854 recommend 10 auxiliary
functions and predicates to support portability of programs across different
IEEE systems. These include nextafter(x,y), which returns the next floating
point number to x in the direction of y, and scalb(x,n), which returns x x β n
without explicitly computing β n, where n is an integer and β the base. Not
all implementations of IEEE arithmetic support these functions. Portable C
versions of six of the functions are presented by Cody and Coonen [226, 1993].

25.2. Subtleties of Floating Point Arithmetic

The difference x – y of two machine numbers can evaluate to zero even when
because of underflow. This makes a test such as

f = f / (x - y)
end

unreliable. However, in a system that supports gradual underflow (such as

IEEE arithmetic) x – y always evaluates as nonzero when x y, as is easy
to verify. On several models of Cray computer (Cray 1, 2, X-MP, Y-MP,
and C90) this code could fail for another reason: they compute f/(x – y) as
f * (1/(x – y)) and 1/(x – y) could overflow.
It is a general principle that one should not test two floating point numbers
for equality, but rather use a test such as if |x – y| < tol (there are exceptions,
such as Algorithm 2 in §1.14.1). Of course, skill may be required to choose
an appropriate value for tel. A test of this form would correctly avoid the
division in the example above when underflow occurs.
496 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC

Table 25.1. Results from Cholesky factorization.

Bits u Computer Displacement

128 le-29 Cray 2 .447440341
64 le-17 IBM 3090 .447440344
64 le-16 Convex 220 .44744033 9
64 le-16 IRIS .44744033 9
64 le-15 Cray 2 .447440303
64 le-15 Cray Y-MP .447436106

25.3. Cray Peculiarities

Carter [188, 1991] describes how a Cray computer at the NASA Ames Lab-
oratories produced puzzling results that were eventually traced to properties
of its floating point arithmetic. Carter used Cholesky factorization on a Cray
Y-MP to solve a sparse symmetric positive definite system of order 16146
resulting from a finite element model of the National Aerospace Plane. The
results obtained for several computers are shown in Table 25.1, where “Dis-
placement” denotes the largest component of the solution and incorrect digits
are set in italics and underlined. Since the coefficient matrix has a condition
number of order 1011, the errors in the displacements are consistent with the
error bounds for Cholesky factorization.
Given that both the Cray 2 and the Cray Y-MP lack guard digits, it is not
surprising that they give a less accurate answer than the other machines with
a similar unit roundoff. What is surprising is that, even though both machines
use 64-bit binary arithmetic with a 48-bit mantissa, the result from the Cray
Y-MP is much less accurate than that from the Cray 2. The explanation
(diagnosed over the telephone to Carter by Kahn, as he scratched on the back
of an envelope) lies in the fact that the Cray 2 implementation of subtraction
without a guard digit produces more nearly unbiased results (average error
zero), while the Cray Y-MP implementation produces biased results, causing
fl(x – y) to be too big if x > y > 0. Since the inner loop of the Cholesky
algorithm contains an operation of the form a i i = a i i – the errors in
the diagonal elements reinforce as the factorization proceeds on the Cray
Y-MP, producing a Cholesky factor with a diagonal that is systematically
too large. Similarly, since Carter’s matrix has off-diagonal elements that are
mostly negative or zero, the Cray Y-MP produces a Cholesky factor with
off-diagonal elements that are systematically too large and negative. For the
large value of n used by Carter, these two effects in concert are large enough
to cause the loss of a further two digits in the answer.
An inexpensive way to improve the stability of the computed solution is to
25.4 COMPILERS 497

use iterative refinement in fixed precision (see Chapter 11). This was tried by
Carter. He found that after one step of refinement the Cray Y-MP solution
was almost as accurate as that from the Cray 2 without refinement.

25.4. Compilers
Some of the pitfalls a compiler writer should avoid when implementing float-
ing point arithmetic are discussed by Farnum [364, 1988]. The theme of his
paper is that programmers must be able to predict the floating point opera-
tions that will be performed when their codes are compiled; this may require,
for example, that the compiler writer forgoes certain “optimizations”. In par-
ticular, compilers should not change groupings specified by parentheses. For
example, the two expressions
(1.0E-30 + 1.0E+30) - 1.0E+30
1.0E-30 + (1.0E+30 - 1.0E+30)
will produce different answers on many machines. Farnum explains that

Compiler texts and classes rarely address the peculiar problems

of floating-point computation, and research literature on the topic
is generally confined to journals read by numerical analysts, not
compiler writers. Many production-quality compilers that are ex-
cellent in other respects make basic mistakes in their compilation
of floating-point, resulting in programs that produce patently ab-
surd results, or worse, reasonable but inaccurate results.

25.5. Determining Properties of Floating Point Arithmetic

Clever algorithms have been devised that attempt to reveal the properties
and parameters of a floating point arithmetic. The first algorithms were pub-
lished by Malcolm [724, 1972] (see Problem 25.3). These have been improved
and extended by other authors. Kahan’s paranoia code carries out detailed
investigations of a computer’s floating point arithmetic; there are Basic, C,
Modula, Pascal, and Fortran versions, all available from netlib. In addition to
computing the arithmetic parameters, paranoia tries to determine how well
the arithmetic has been implemented (so it can be regarded as a program to
test a floating point arithmetic—see the next section). Karpinski [645, 1985]
gives an introduction to paranoia for the layman, but the best documentation
for the code is the output it produces.
Cody has a Fortran routine macher for determining 13 parameters asso-
ciated with a floating point arithmetic system [222, 1988] (an earlier version
was published in the book by Cody and Waite [228, 1980]). Cody notes some
498 S OFTWARE I SSUES IN F LOATING P OINT A RITHMETIC

strange behaviour of compilers and says that he was unable to make his code
run correctly on the Cyber 205. A routine based on machar is given in Nu-
merical Recipes [842, 1992, §20.1].
LAPACK contains a routine xLAMCH for determining machine parameters.
Because of the difficulty of handling the existing variety of machine arithmetics
it is over 850 lines long (including comments and the subprograms it calls).
Programs such as machar, paranoia, and xLAMCH are difficult to write; for
example, xLAMCH tries to determine the overflow threshold without invoking
overflow. The Fortran version of paranoia handles overflow by requiring the
user to restart the program, after which checkpointing information previously
written to a file is read to determine how to continue.
Fortran 90 contains environmental inquiry functions, which for REAL argu-
ments return the precision (PRECISION23), exponent range (RANGE), machine
epsilon (EPSILON), largest number (HUGE), and so on, corresponding to that
argument [749, 1990]. The values of these parameters are chosen by the im-
plementor to best fit a model of the machine arithmetic due to Brown [150,
1981] (see §25.7.4). Fortran 90 also contains functions for manipulating float-
ing point numbers: for example, to set or return the exponent or fractional
part (EXPONENT, FRACTION, SET_EXPONENT) and to return the spacing of the
numbers having the same exponent as the argument (SPACING).

25.6. Testing a Floating Point Arithmetic

How can we test whether a particular implementation of floating point arith-
metic is correct? It is impractical to test a floating point operation with all
possible arguments because there are so many of them—about 1019 in IEEE
double precision arithmetic, for example! Special techniques are therefore
needed that test with a limited number of arguments.
A package FPV [784, 1986] from NAG Ltd. attempts to verify experimen-
tally that a floating point arithmetic has been correctly implemented accord-
ing to its specification. FPV must be supplied by the user with the arithmetic
parameters β, t, emin, emax, and with the rounding rule; it then attempts to
verify that the arithmetic conforms to these parameters by probing for errors.
The tests cover the basic arithmetic operations (including square root) but
not the elementary functions. FPV, which is available in both Fortran 77 and
Pascal versions, adopts an approach used in an earlier program FPTST [906,
1981], [907, 1984] by Schryer of AT&T Bell Laboratories: it tests arithmetic
operations with a limited set of operands that are regarded as being most
likely to reveal errors. This approach is based on the premise that errors are
most likely to occur as “edge effects”, induced by operands close to a discon-
tinuity or boundary of the floating point number system (such as a power of
23
Intrinsic function names are shown in parentheses.
25.7 P ORTABILITY 499

the base β).

FPV and FPTST have both revealed implementation errors in floating
point arithmetics on commercial computers. Errors detected include multi-
plication and negation producing unnormalized results, x * y differing from
(–x) * (–y), and the product of two numbers greater than 1 being less than 1!
Wichmann [1078, 1992] suggests that it was probably revelations such as these
that led the UK Ministry of Defence to issue an interim standard prohibiting
the use of floating point arithmetic in safety-critical systems.
A Fortran package ELEFUNT by Cody and Waite contains programs to
test the elementary functions [221, 1982], [228, 1980]; the package is available
from netlib. It checks identities such as cos(x ) = cos(x/3)(4 cos2(x/3) – 1),
taking care to choose arguments x for which the errors in evaluating the
identities are negligible. A package CELEFUNT serves the same purpose
for the complex elementary functions [224, 1993]. Tang [988, 1990] develops
table-driven techniques for testing the functions exp and log.

25.7. Portability
Software is portable if it can be made run on different systems with just
a few straightforward changes (ideally, we would like to have to make no
changes, but this level of portability is often impossible to achieve). Sometimes
the word “transportable” is used to describe software that requires certain
well-defined changes to run on different machines. For example, Demmel,
Dongarra, and Kahan [290, 1992] describe LAPACK as “a transportable way
to achieve high efficiency on diverse modern machines”, noting that to achieve
high efficiency the BLAS need to be optimized for each machine. A good
example of a portable collection of Fortran codes is LINPACK. It contains
no machine-dependent constants and uses the PFORT subset of Fortran 77;
it uses the level-1 BLAS, so, ideally, optimized BLAS would be used on each
machine.

25.7.1. Arithmetic Parameters

Differences between floating point arithmetics cause portability problems.
First, what is meant by REAL and DOUBLE PRECISION in Fortran varies greatly
between machines, as shown by the following table:

Second, for a given level of precision, u, the various arithmetic parameters

such as base, unit roundoff, largest and smallest machine numbers, and the
500 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC

type of rounding, can all vary. Some possible ways for a code to acquire these
parameters are as follows.
(1) The parameters are evaluated and embedded in the program in PARAM-
ETER and DATA statements. This is conceptually the simplest approach, but
it is not portable.
(2) Functions are provided that return the machine parameters. Bell Lab-
oratories’ PORT library [405, 1978] has three functions:

R1MACH(k) 1 < k < 5 real parameters,

D1MACH(k) 1 < k < 5 double precision parameters,
I1MACH(k) 1 < k < 16 integer parameters.

R1MACH returns the underflow and overflow thresholds, the smallest and largest
relative spacing (β –t, β 1-t respectively), and log10 β, where β is the base
and t the number of digits in the mantissa. I1MACH returns standard input,
output and error units and further floating point information, such as β, t,
and the minimum and maximum exponents emin and emax. These functions
do not carry out any computation; they contain DATA statements with the
parameters for most common machines in comment lines, and the relevant
statements have to be uncommented for a particular machine. This approach
is more sensible for a library than the previous one, because only these three
routines have to be edited, instead of every routine in the library.
The NAG Library takes a similar approach to PORT. Each of the 18
routines in its X02 chapter returns a different machine constant. For example,
XO2AJF returns the unit roundoff and X02ALF returns the largest positive
floating point number. These values are determined once and for all when the
NAG library is implemented on a particular platform using a routine similar
to paranoia and machar, and they are hard coded into the Chapter X02
routines.
(3) The information is computed at run-time, using algorithms such as
those described in §25.5.

25.7.2. 2 x 2 Problems in LAPACK

LAPACK contains 10 or so auxiliary routines for solving linear equation and

eigenvalue problems of dimension 2. For example, SLAS2 computes the singu-
lar values of a 2 x 2 triangular matrix, SLAEV2 computes the eigenvalues and
eigenvectors of a 2 x 2 symmetric matrix, and SLASY2 solves a Sylvester equa-
tion AX – XB = C where A and B have order 1 or 2. These routines need
to be reliable and robust because they are called repeatedly by higher-level
solvers. Because LAPACK makes minimal assumptions about the underlying
floating point arithmetic, the 2 x 2 routines are nontrivial to write and are
surprisingly long: the counts of executable statements are 39 for SLAS2, 61 for
25.7 P ORTABILITY 501

SLAEV2, and 207 for SLASY2. If the availability of extended precision arith-
metic (possibly simulated using the working precision) or IEEE arithmetic
can be assumed, the codes can be simplified significantly. Complicated and
less efficient code for these 2 x 2 problem solvers is a price to be paid for
portability across a wide range of floating point arithmetics.

25.7.3. Numerical Constants

How should numerical constants be specified, e.g., for a Gauss quadrature rule
or a Runge–Kutta method? Some compilers limit the number of significant
digits that can be specified in a DATA statement or an assignment statement.
One possibility is to use rational approximations, but again the number of
digits required to cater for all precision may be too large for some compilers.
Another scheme is to store a carefully designed table of leading and trailing
parts of constants. A constant such as π is best computed via a statement such
as pi = 4.0*atan(1.0), if you trust your language’s mathematics library.

25.7.4. Models of Floating Point Arithmetic

The model (2.4) of floating point arithmetic contains only one parameter, the
unit roundoff. Other more detailed models of floating point arithmetic have
been proposed, with the intention of promoting the development of portable
mathematical software. A program developed under a model and proved
correct for that model has the attraction that it necessarily performs correctly
on any machine for which the model is valid.
The first detailed model was proposed by van Wijngaarden [1046, 1966];
it contained 32 axioms and was unsuccessful because it was “mathematically
intimidating” and did not cover the CDC 600, an important high-performance
computer of that time [635, 1991]. A more recent model is that of Brown [150,
1981]. Brown’s model contains four parameters: the base β, the precision t,
and the minimum and maximum exponents emin and emax together with a
number of axioms describing the behaviour of the floating point arithmetic.
Aberrant arithmetics are accommodated in the model by penalizing their
parameters (for example, reducing t by 1 from its actual machine value if
there is no guard digit). Brown builds a substantial theory on the model
and gives an example program to compute the 2-norm of a vector, which is
accompanied by a correctness proof for the model.
Brown’s model is intended to cater to diverse computer arithmetics. In
fact, “any behaviour permitted by the axioms of the model is actually exhib-
ited by at least one commercially important computer” [150, 1981, p. 457].
This broadness is a weakness. It means that there are important proper-
ties shared by many (but not all) machines that the model cannot reflect,
such as the exactness of fl(x – y) described in Theorem 2.4. As Kahan [630,
502 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC

1981] notes, “Programming for the [IEEE] standard is like programming for
one of a family of well-known machines, whereas programming for a model is
like programming for a horde of obscure and ill-understood machines all at
once.” Although Brown’s model was used in the Ada language to provide a
detailed specification of floating point arithmetic, the model is still somewhat
controversial.
Wichmann [1077, 1989] gives a formal specification of floating point arith-
metic in the VDM specification language based on a modification of Brown’s
model.
The most recent model is the Language Independent Arithmetic (LIA-
1) [702, 1993]. The LIA-1 specifies floating point arithmetic far more tightly
than Brown’s model. It, too, is controversial; an explanation of flaws in an
earlier version (know then as the Language Compatible Arithmetic Standard)
was published by Kahan [635, 1991].
For a more detailed critique of models of floating point arithmetic see
Priest [844, 1992].

25.8. Avoiding Underflow and Overflow

A classic example showing how care is needed to avoid underflow and overflow
is the evaluation of a vector 2-norm, For about half of
all machine numbers x, x2 either underflows or overflows! Overflow is avoided
by the following algorithm:

s =0
for i = l:n
s = s + (x(i)/t)2
end

The trouble with this algorithm is that it requires n divisions and two passes
over the data (the first to evaluate so it is slow. (It also involves
more rounding errors than the unscaled evaluation, which could be obviated
by scaling by a power of the machine base.) Blue [128, 1978] develops a one-
pass algorithm that avoids both overflow and underflow and requires between
O and n divisions, depending on the data, and he gives an error analysis to
show that the algorithm produces an accurate answer. The idea behind the
algorithm is simple. The sum of squares is collected in three accumulators,
one each for small, medium, and large numbers. After the summation, various
logical tests are used to decide how to combine the accumulators to obtain
the final answer.
The original, portable implementation of the level-l BLAS routine xNRM2
(listed in [307, 1979]) was written by C. L. Lawson in 1978 and, according to
25.8 A VOIDING U NDERFLOW AND O VERFLOW 503

Lawson, Hanson, Kincaid, and Krogh [694, 1979], makes use of Blue’s ideas.
However, xNRM2 is not a direct coding of Blue’s algorithm and is extremely
difficult to understand—a classic piece of “Fortrans spaghetti”! Nevertheless,
the routine clearly works and is reliable, because it has been very widely used
without any reported problems. Lawson’s version of xNRM2 has now been su-
perseded in the LAPACK distribution by a concise and elegant version by S.
J. Harnmarling, which implements a one-pass algorithm; see Problem 25.5.
A special case of the vector 2-norm is the Pythagorean sum,
which occurs in many numerical computations. One way to compute it
is by a beautiful cubically convergent, square-root-free iteration devised by
Moler and Morrison [773, 1983 ; see Problem 25.6. LAPACK has a routine
xLAPY2 that computes and another routine xLAPY3 that computes
both routines avoid overflow by using the algorithm listed at
the start of this section with n = 2 or 3.
Pythagorean sums arise in computing the l-norm of a complex vector:

The level-l BLAS routine xCASUM does not compute the l-norm, but the more
easily computed quantity

The reason given by the BLAS developers is that it was assumed that users
would expect xCASUM to compute a less expensive measure of size than the 2-
norm [694, 1979]. This reasoning is sound, but many users have been confused
not to receive the true l-norm. See Problem 6.16 for more on this pseudo 1-
norm.
Another example where underflow and overflow can create problems is in
complex division. If we use the formula

then overflow will occur whenever cord exceeds the square root of the overflow
threshold, even if the quotient (a + ib)/(c + id) does not overflow. Certain Cray
and NEC machines implement complex division in this way [290, 1992]; on
these machines the exponent range is effectively halved from the point of view
of the writer of robust software. Smith [929, 1962] suggests a simple way to
avoid overflow: if |c| > |d| use the following formula, obtained by multiplying
the numerators and denominators by c–1 ,

(25.1)
504 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC

and if |d| > |c| use the analogous formula involving d–1. Stewart [947, 1985]
points out that underflow is still possible in these formulae, and suggests a
more complicated algorithm that avoids both underflow and overflow.
Demmel [280, 1984] discusses in detail the effects of underflow on numerical
software and analyses different ways of implementing underflows, the main two
of which are flush to zero and gradual underflow (as used in IEEE standard
arithmetic). Cody [222, 1988] makes the following observations:

The variety in the treatment of underflow is wondrous. There ex-

ist machines in which the exponent wraps around and underflow
results in a large floating-point number, perhaps with an accompa-
nying sign change. There are also machines on which underflow is
replaced by the smallest in magnitude nonzero number, machines
on which small positive numbers vanish when added to themselves,
and machines on which such small numbers vanish when multi-
plied by 1.0. Finally, on IEEE-style machines, underflow may be
graceful.

25.9. Multiple Precision Arithmetic

If we wish to work at a precision beyond that provided by a Fortran compiler’s
DOUBLE PRECISION (or the quadruple precision supported as an extension by
some compilers) there are two basic approaches. The first is to implement
higher precision arithmetic using arrays of double precision variables, where
the unevaluated sum of the elements of an array represents a high precision
variable. Arithmetic can be implemented in this system using techniques
such as those of Dekker [275, 1971] (cf. compensated summation, described in
§4.3), as refined by Linnainmaa [706, 1981] and Priest [843, 1991], [844, 1992].
These techniques require a “well-behaved” floating point arithmetic, and are
not easy to apply.
The second approach is to use arrays to represent numbers in “standard”
floating point form with a large base and a long mantissa spread across the
elements of an array. All the software described in this section is of the second
type.
The first major piece of multiple precision Fortran software was Brent’s
package [145, 1978], [146, 1978]. This is a portable collection of Fortran 66
subroutines with wide capabilities, including the evaluation of special func-
tions. Brent’s package represents multiple precision numbers as arrays of
integers and operates on them with integer arithmetic. An interface between
Brent’s package and the Fortran precompiled Augment [254, 1979] is described
by Brent, Hooper, and Yohe [147, 1980]; it makes Brent’s package much easier
to use.
25.9 M ULTIPLE P RECISION A RITHMETIC 505

Recently, Bailey has written a multiple precision arithmetic package MP-

FUN [44, 1993], which consists of about 10,000 lines of Fortran 77 code in 87
subprograms. In MPFUN, a multiprecision (MP) number is a vector of single
precision floating point numbers; it represents a number in base 224 (for IEEE
arithmetic). Complex multiprecision numbers are also supported.
MPFUN routines are available to perform the basic arithmetic operations,
evaluate nth roots and transcendental functions, compare MP numbers, pro-
duce a random MP number, solve polynomial equations, and perform other
operations. For many of these routines both simple and advanced algorithms
are provided; the advanced algorithms are intended for computations at pre-
cision levels above 1000 digits. One advanced algorithm is a fast Fourier
transform technique for evaluating the convolution involved in a multiplica-
tion. Another is used for division: x/y is evaluated as x * (1/y ), where 1/y is
evaluated by Newton’s method (see Problem 2.26), which involves only mul-
tiplications. An interesting aspect of Newton’s method in this context is that
it can be implemented with a precision level that doubles at each stage, since
the iteration damps out any errors. Another interesting feature of a variable
precision environment is that numerical instability can be tackled simply by
increasing the precision. As an example, MPFUN does complex multiplication
using the 3M method

which uses three real multiplications instead of the usual four. As we saw in
§22.2.4, the 3M method produces a computed imaginary part that can have
large relative error, but this instability is not a serious obstacle to its use in
MPFUN.
Bailey provides a translator that takes Fortran 77 source containing direc-
tives in comments that specify the precision level and which variables are to
be treated as multiprecision, and produces a Fortran 77 program containing
the appropriate multiprecision subroutine calls. He also provides a Fortran 90
version of the package that employs derived data types and operator exten-
sions [45, 1994]. This Fortran 90 version is a powerful and easy to use tool
for doing high precision numerical computations.
Bailey’s own use of the packages includes high precision computation of
constants such as π [42, 1988] and Euler’s constant γ. A practical application
to a fluid flow problem is described by Bailey, Krasny, and Pelz [46, 1993].
They found that converting an existing code to use MPFUN routines increased
execution times by a factor of about 400 at 56 decimal digit precision, and
they comment that this magnitude of increase is fairly typical for multiple
precision computation.
Another recent Fortran 77 program for multiple precision arithmetic is the
FM package of Smith [926, 1991]. It is functionally quite similar to Bailey’s
MPFUN. No translator is supplied for converting from standard Fortran 77
506 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC

source code to code that invokes FM, but Smith notes that the general purpose
precompiled Augment [254, 1979] can be used for this purpose.
Fortran routines for high precision computation can also be found in Nu-
merical Recipes [842, 1992, §20.6], and high precision numerical computation
is supported by many symbolic manipulation packages, including Maple [199,
1991] and Mathematical [1109, 1991].
A Pascal-like programming language called Turing [579, 1988] developed
at the University of Toronto in the 1980s is the basis of an extension called
Numerical Turing, developed by Hull and his co-workers [591, 1985]. Numer-
ical Turing includes decimal floating point arithmetic whose precision can be
dynamically varied throughout the course of a program, a feature argued for
in [590, 1982] and implemented in hardware in [231, 1983].
An extension to the level-2 BLAS (see Appendix D) is proposed in [313,
1988, App. B], comprising routines having the same specifications as those in
the standard BLAS but which calculate in extended precision.

25.10. Patriot Missile Software Problem

A report from the United States General Accounting Office begins “On Febru-
ary 25, 1991, a Patriot missile defense system operating at Dhahran, Saudi
Arabia, during Operation Desert Storm failed to track and intercept an incom-
ing Scud. This Scud subsequently hit an Army barracks, killing 28 Americans”
[1036, 1992]. The report finds that the failure to track the Scud missile was
caused by a precision problem in the software.
The computer used to control the Patriot missile is baaed on a 1970s design
and uses 24-bit arithmetic. The Patriot system tracks its targets by measuring
the time it takes for radar pulses to bounce back from them. Time is recorded
by the system clock in tenths of a second, but is stored as an integer. To enable
tracking calculations the time is converted to a 24 bit floating point number.
Rounding errors in the time conversions cause shifts in the system’s “range
gate”, which is used to track the target.
On February 11, 1991 the Patriot Project Office received field data identi-
fying a 20% shift in the Patriot system’s range gate after the system had been
running continuously for 8 hours. This data implied that after 20 consecutive
hours of use the system would fail to track and intercept a Scud. Modified
software that compensated for the inaccurate time calculations was released
on February 16 by army officials. On February 25, Alpha Battery, which was
protecting the Dhahran Air Base, had been in continuous operation for over
100 hours. The inaccurate time calculations caused the range gate to shift so
much that the system could not track the incoming Scud. On February 26,
the next day, the modified software arrived in Dhahran. Table 25.2, taken
from [1036, 1992], shows clearly how, with increasing time of operation, the
25.11 N OTES AND R EFERENCES 507

Table 25.2. Effect of extended run time on Patriot missile operation.

Hours Seconds Calculated time Inaccuracy Approximate shift in

(seconds) (seconds) range gate (meters)
0 0 0 0 0
1 3600 3599.9966 .0034 7
8 28800 28799.9725 .0275 55
20a 72000 71999.9313 .0687 137
48 172800 172799.8352 .1648 330
72 259200 259199.7528 .2472 494
100 b 360000 359999.6567 c .3433 687
a
For continuous operation exceeding 20 hours target is outside range gate.
b
Alpha battery ran continuously for about 100 hours.
c
Corrected value.

Patriot lost track of its target. Note that the numbers in Table 25.2 are con-
sistent with a relative error of 2–20 in the computer’s representation of 0.1,
this constant being used to convert from the system clock’s tenths of a second
to seconds (2–20 is the relative error introduced by chopping 0.1 to 23 bits
after the binary point).

25.11. Notes and References

The issues discussed in this chapter are rarely treated in textbooks. One book
that is worth consulting is Miller’s Engineering of Numerical Software [759,
1984], which is concerned with the designing of reliable numerical software
and presents practical examples.
Kahan’s paper [627, 1972] is not, as the title suggests, a survey of error
analysis, but treats in detail some specific problems, including solution of a
quadratic equation and summation, giving error analysis, programs illustrat-
ing foibles of Fortran compilers, and ingenious workarounds.
The collection of papers improving Floating-Point Programming [1061,
1990] contains, in addition to an introduction to floating point arithmetic and
Brown’s model, descriptions of recent work on embedding interval arithmetic
with a super-accurate inner product in Pascal and ADA (see §24.4).
There are several algorithms for evaluating a continued fraction; see, for
example, Blanch [123, 1964] and Press, Teukolsky, Vetterling, and Flannery
[842, 1992, §5.2]. Rounding error analysis for the “bottom up” algorithm is
given by Jones and Thron [617, 1974].
A standard set of names and definitions of machine parameters was pro-
posed by the IFIP Working Group on Numerical Software (WG 2.5) [383,
508 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC

1978], though these do not appear to have been widely adopted.

The use of high precision computations to investigate the Riemann hy-
pothesis and other mathematical problems and conjectures is described by
Varga [1052, 1990], who has made use of Brent’s package [145, 1978], [146,
1978].

Problems
25.1. (a) The following MATLAB code attempts to compute

µ := min{floating point x : fl(1 + x) > 1},

under the assumption that the base β = 2.

function x = mu()

x = 1;
xp1 = x + l;
while xp1 > 1
x = x/2 ;
xp1 = x + 1;
end
x = 2*x;
On my workstation, running this code gives
>> mu

ans =
2.2204e-016
Try this code, or a translation of it into another language, on any machines
available to you. Are the answers what you expected?
(b) Under what circumstances might the code in (a) fail to compute µ?
(Hint: consider optimizing compilers.)
(c) What is the value of µ in terms of the arithmetic parameters β and
t ? Note: the answer depends subtly on how rounding is performed, and in
particular on whether double rounding is in effect; see Higham [550, 1991]
and Moler [771, 1990]. On a workstation with an Intel 486DX chip (in which
double rounding does occur), the following behaviour is observed in MATLAB:
>> format hex; format compact % Hexadecimal format.

>> x = 2^(-53) + 2^ (-64) + 2^(-105) ; y = [1+x 1 X]

y=
3ff0000000000001 3ff0000000000000 3ca0020000000001
PROBLEMS 509

>> x = 2^(-53) + 2^(-64) + 2^(-106); y =

[1+x 1 x]

y=
3ff0000000000000 3ff0000000000000 3ca0020000000000
25.2. Show that Smith’s formulae (25.1) can be derived by applying Gaussian
elimination with partial pivoting to the 2 x 2 linear system obtained from
(c + id)(x + iy) = a + ib.
25.3. The following MATLAB function implements an algorithm of Malcolm [724,
1972] for determining the floating point arithmetic parameters β and t.
function [beta, t] = param(x)
% a and b are floating point variables.

a = 1;
while (a+l) - a == 1
a = 2*a;
end
b=2;
while (a+b) == a
b = 2*b ;
end
beta = - a;
t = 1;
a = beta;
while (a+l) - a == 1
t = t+l;
a= a*beta;
end
Run this code, or a translation into another language, on any machines avail-
able to you. Explain how the code works. (Hint: consider the integers that can
be represented exactly as floating point numbers.) Under what circumstances
might it fail to produce the desired results?
25.4. Hough [585, 1989] formed a 512 x 512 matrix using the following Fortran
random number generator:
subroutine matgen (a, lda, n, b, norms)
REAL a(lda,1), b(1), norma
c
init = 1325
norms = 0.0
do 30 j = 1,n
do 20 i = 1,n
510 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC

init = mod(3125*init,65536)
a(i, j) = (init - 32768 .0)/16384.0
norms = max(a(i, j), norms)
20 continue
30 continue
He then solved a linear system using this matrix, on a workstation that uses
IEEE single precision arithmetic. He found that the program took an inordi-
nate amount of time to run and tracked the problem to underflows. On the
system in question underflows are trapped and the correct result recomputed
very slowly in software. Hough picks up the story:

I started printing out the pivots in the program. They started out
as normal numbers like 1 or – 10, the suddenly dropped to about
1e-7, then later to 1e-14, and then:
k 82 pivot -1.8666e-20 k 98 pivot 1.22101e-21
k 83 pivot -2.96595e-14 k 99 pivot -7.12407e-22
k 84 pivot 2.46156e-14 k 100 pivot -1.75579e-21
k 85 pivot 2.40541e-14 k 101 pivot 3.13343e-21
k 86 pivot -4.99053e-14 k 102 pivot -6.99946e-22
k 87 pivot 1.7579e-14 k 103 pivot 3.82048e-22
k 88 pivot 1.69295e-14 k 104 pivot 8.05538e-22
k 89 pivot -1.56396e-14 k 105 pivot -1.18164e-21
k 90 pivot 1.37869e-14 k 106 pivot -6.349e-22
k 91 pivot -3.10221e-14 k 107 pivot -2.48245e-21
k 92 pivot 2.35206e-14 k 108 pivot -8.89452e-22
k 93 pivot 1.32175e-14 k 109 pivot -8.23235e-22
k 94 pivot -7.77593e-15 k 110 pivot 4.40549e-21
k 95 pivot 1.34815e-14 k 111 pivot 1.12387e-21
k 96 pivot -1.02589e-21 k 112 pivot -4.78853e-22
k 97 pivot 4.27131e-22 k 113 pivot 4.38739e-22
k 114 pivot 7.3868e-28

SIGFPE 8: numerical exception, CHK, or TRAP

stopped at daxpy+0x18c: mov1 a4@(0xe10),a3@

Explain this behaviour.

25.5. The version of the level-l BLAS routine xNRM2 that is distributed with
LAPACK implements the following algorithm for computing ||x||2 ,

t=0
s=1
for i = l:n
PROBLEMS 511

if |xi | > t
s = 1 + s (t/xi )2
t = |xi |
else
s = s + (xi /t)2
end
end

Prove that the algorithm works and describe its properties.

25.6. (Moler and Morrison [773, 1983], Dubrulle [323, 1983]) This MATLAB
M-file computes the Pythagorean sum using an algorithm of Moler
and Morrison.
function p = pythag(a, b)
%PYTHAG Pythagorean sum.

P = max(abs(a) , abs(b)) ;
q = min(abs(a) , abs(b)) ;
while 1
r = (q/p)^2;
if r+4 == 4, return, end
s = r/(4+ r);
p = p+2*s*p;
q = s*q;
fprintf(’p = %19.15e, q = %19.15e\n’, p, q)
end
The algorithmic immune to underflow and overflow (unless the result over-
flows), is very accurate, and converges in at most three iterations, assuming
the unit roundoff u > 10–20.
Example:
p = pythag(3,4); (p-5)/5

p = 4.986301369863014e+000, q = 3.698630136986301e-001
p = 4.999999974188253e+000, q = 5.080526329415358e-004
p = 5.000000000000001e+000, q = l.311372652397091e-012
ans =
1.7764e-016
The purpose of this problem is to show that pythag is Halley’s method
applied to a certain equation. Halley’s method for solving f(x) = 0 is the
iteration
512 S OFTWARE I SSUES IN FLOATING P OINT A RITHMETIC

where fn, f´n and f´´n denote the values of f and its first two derivatives at xn.
(a) For given x 0 and y 0 such that 0 < y 0 < x 0 , the Pythagorean sum
is a root of the equation f(x) = x 2 – p 2 = 0. Show that
Halley’s method applied to this equation gives

Show that if 0 < x0. < p = then xn < xn+1 < p for all n. Deduce
that y n := is defined and that

Confirm that pythag implements a scaled version of these equations.

(b) Show that

which displays the cubic convergence of the iteration.

25.7. (Incertis [604, 1985]) (a) Given a skew-symmetric matrix Y
with ||Y||2 < 1, show that there is a real, symmetric X satisfying X2 = I + Y2
such that X + Y is orthogonal.
(b) Consider the following iteration, for P0 , Q0 with ||P0||2 >
||Q0||2, which generalizes to matrices the "pythag" iteration in Problem 25.6.

for k =
R k = (Q k P k - 1 ) 2
S k = R k(4I + R k) - 1
Pk+1 = Pk + 2S k P k
Qk+l = SkQk
end

Show that this iteration can be used to compute X in part (a)

(c) Investigate the convergence of the iteration in (b) for general P0 and
Q0.
25.8. Why might we prefer the expression in an algorithm
intended to be robust in floating point arithmetic?
Previous Home Next

Chapter 26
A Gallery of Test Matrices

Many tricks or treats associated with the Hilbert matrix

may seem rather frightening or fascinating.
— MAN-DUEN CHOI, Tricks or Treats with the Hilbert Matrix (1983)

I start by looking at a 2 by 2 matrix.

Sometimes I look at a 4 by 4 matrix.
That’s when things get out of control and too hard.
Usually 2 by 2 or 3 by 3 is enough, and I look at them,
and I compute with them, and I try to guess the facts.
First, think of a question.
Second, I look at examples, and then third,
guess the facts.
— PAUL R. HALMOS 2 4 (1991)

When people look down on matrices,

remind them of great mathematicians such as
Frobenius, Schur, C. L. Siegel, Ostrowski, Motzkin, Kac, etc.,
who made important contributions to the subject.
— OLGA TAUSSKY, How I Became a Torchbearer for Matrix Theory (1988)

24
From interviews by Albers in [9, 1991].

513
514 A G ALLERY OF T EST M ATRICES

Ever since the first computer programs for matrix computations were written
in the 1940s, researchers have been devising matrices suitable for test purposes
and investigating the properties of these matrices. In the 1950s and 1960s it
was common for a whole paper to be devoted to a particular test matrix:
typically its inverse or eigenvalues would be obtained in closed form.
Early collections of test matrices include those of Newman and Todd [796,
1958] and Rutishauser [886, 1968]; most of Rutishauser’s matrices come from
continued fractions or moment problems. Two well-known books present col-
lections of test matrices. Gregory and Karney [482, 1969] deal exclusively with
the topic, while Westlake [1076, 1968] gives an appendix of test matrices.
In this chapter we present a gallery of matrices. We describe their prop-
erties and explain why they are useful (or not so useful, as the case may be)
for test purposes. The coverage is limited. A comprehensive, up-to-date, and
well-documented collection of parametrized test matrices may be found in
the Test Matrix Toolbox, described in Appendix E. Of course, MATLAB itself
contains a number of special matrices that can be used for test purposes (type
help specmat).
Several other types of matrices would have been included in this chapter
had they not been discussed elsewhere in the book. These include magic
squares (Problem 6.4), the Kahan matrix (8.10), Hadamard matrices (§9.3),
and Vandermonde matrices (Chapter 21).
The matrices described here can be modified in various ways while still
preserving some or all of their interesting properties. Among the many ways
of constructing new test matrices from old are

l Similarity transformations A X – 1 AX.

l Unitary transformations A U AV, where UU = VV = I.

l Kronecker products A A B or B A (for which MATLAB has a

routine kron) .

l Powers A Ak.

26.1. The Hilbert and Cauchy Matrices

The Hilbert matrix H n with elements hij = 1/(i + j – 1), is perhaps
the most famous of all test matrices. It was widely used in the 1950s and
1960s for testing inversion algorithms and linear equation solvers. Its attrac-
tions were threefold: it is very ill conditioned for even moderate values of n,
formulae are known for the elements of the inverse, and the matrix arises in
a practical problem: least squares fitting by a polynomial expressed in the
monomial basis.
26.1 T HE H ILBERT AND C AUCHY M ATRICES 515

Despite its past popularity and notoriety, the Hilbert matrix is not a good
test matrix. It is too special. Not only is it symmetric positive definite, but
it is totally positive. This means, for example, that Gaussian elimination
without pivoting is guaranteed to produce a small componentwise relative
backward error (as is Cholesky factorization). Thus the Hilbert matrix is not
a typical ill-conditioned matrix.
The (i, j) element of the inverse of the Hilbert matrix Hn is the integer

(26.1)
and
(26.2)

There are many ways to rewrite the formula (26.1). These formulae are best
obtained as special cases of those for the Cauchy matrix below.
The Cholesky factor Rn of the inverse of the Hilbert matrix is known
explicitly, as is Rn-1:

(26.3)

(26.4)

One interesting application of these formulae is to compute the eigenvalues

of H n as the squares of the singular values of Rn ; if Rn is evaluated from
(26.3) and the one-sided Jacobi algorithm is used to compute the singular
values then high relative accuracy is obtained for every eigenvalue, as shown
by Mathias [732, 1995].
The condition number of the Hilbert matrix grows at an exponential rate:
k2 (H n) ~ exp(3.5n) [1004, 1954]. See Table 26.1 for the first few condition
numbers (these were obtained by computing the inverse exactly using MAT-
LAB’S Symbolic Math Toolbox [204, 1993] and then computing the norm of
the numeric representation of the inverse; the numbers given are correct to
the figures shown).
It is an interesting fact that the matrix = ( 1 / (i + j) ) (a
submatrix of H n+l ) satisfies µ n := = π + O(1/log n) as as
proved by Taussky [996, 1949]. The convergence to π is very slow: µ 200 = 2.01,
µ300 = 2.08, µ400 = 2.12.
That H n -1 is known explicitly is not as useful a property for testing an
inversion algorithm as it might appear, because Hn cannot be stored exactly
in floating point arithmetic. This means that when we attempt to invert H n
we at best invert H n + ∆H (the matrix actually stored on the computer),
516 A G ALLERY OF T EST M ATRICES

Table 26.1. Condition numbers of Hilbert and Pascal matrices.

n K2(Hn) K2(Pn)
2 1.9281e1 6.8541e0
3 5.2406e2 6.1984el
4 1.5514e4 6.9194e2
5 4.7661e5 8.5175e3
6 1.4951e7 1.1079e5
7 4.7537e8 1.4934e6
8 1.5258e10 2.0645e7
9 4.9315e11 2.9078e8
10 1.6026e13 4.1552e9
11 5.2307e14 6.0064e10
12 1.7132e16 8.7639e11
13 5.6279e17 1.2888e13
14 1.8534e19 1.9076e14
15 6.l166e20 2.8396e15
16 2.0223e22 4.2476e16

w h e r e | ∆H| < uH n , and (Hn + ∆H)–1 can differ greatly from Hn-1, in view
of the ill conditioning. A possible way round this difficulty is to start with
the integer matrix H n -1 , but its entries are so large that they are exactly
representable in IEEE double precision arithmetic only for n less than 13.
The Hilbert matrix is a special case of a Cauchy matrix C n
whose elements are cij = 1/(xi + yj), where x, y are given n-vectors
(take xi = yi = i – 0.5 for the Hilbert matrix). The following formulae give the
inverse and determinant of Cn, and therefore generalize those for the Hilbert
matrix. The (i, j) element of Cn-1 is

and

the latter formula having been published by Cauchy in 1841 [189, 1841,
pp. 151–159]. The LDU factors of Cn have been found explicitly by Gohberg
26.2 R ANDOM M ATRICES 517

and Koltracht [454, 1990]: lkk = ukk = 1 and

It is known that Cn is totally positive if 0 < x1 < · · · < xn and 0 < yl <
· · · < yn (as is true for the Hilbert matrix) [998, 1962, p. 295]. Interestingly,
the sum of all the elements of [667, 1973, Ex. 44, §1.2.3].

26.2. Random Matrices

Random matrices are widely used for test purposes. Perhaps the earliest use
of random matrices in numerical analysis was by von Neumann and Golds-
tine [1057, 1947], [462, 1951], who estimated the size of their error bounds for
matrix inversion (see §9.6) for the case of random matrices; to do this, they
had to estimate the condition number of a random matrix.
Intuitively, one might expect random matrices to be good at revealing
programming errors and unusual behaviour of algorithms, but this expec-
tation is not necessarily correct. For example, Miller [759, 1984, pp. 96-97]
describes a mutation experiment involving Fortran codes for Gaussian elimina-
tion without pivoting, Gaussian elimination with partial pivoting, and Gauss–
Jordan elimination with partial pivoting. For each code, all possible mutants
were generated, where a mutant is obtained by making a single typographical
change to the source code. All the mutants were tested on a single random
linear system Ax = b, with known solution, where a ij was chosen from the
uniform [– 1, 1] distribution. Many mutants were detected by their failure to
pass the test of producing a solution with forward error less than a tolerance.
However, some mutants passed this test, including all those that solve a sys-
tem correctly in exact arithmetic; mutants in the latter class include those
that select an incorrect pivot row and thus implement a numerically unstable
algorithm. A conclusion to be drawn from Miller’s experiment is that random
test data can reveal some programming errors, but will not reveal all.
A good example of a problem for which random test matrices are very poor
at revealing algorithmic weaknesses is matrix condition number estimation.
The popular condition estimation algorithms can yield poor estimates but, in
518 A G ALLERY OF T EST M ATRICES

practice, never produce them for a random matrix (see Chapter 14). The role
of random matrices here is to indicate the average quality of the estimates.
Edelman [340, 1993] summarizes the properties of random matrices well
when he says that

What is a mistake is to psychologically link a random matrix with

the intuitive notion of a “typical” matrix or the vague concept of
“any old matrix.” In contrast, we argue that “random matrices”
are very special matrices. The larger the size of the matrices the
more predictable they are because of the central limit theorem.

Various results are known about the behaviour of matrices with elements
from the normal N(0, 1) distribution. Matrices of this type are generated by
MATLAB’S randn function. Let An denote an n x n matrix from this distri-
bution and let E(·) be the expectation operator. Then, in the appropriate
probabilistic sense, the following results hold as n

(real data), (26.5)

(complex data), (26.6)
(real data), (26.7)
(complex data), (26.8)
(real or complex data). (26.9)

For details of (26.5)-(26.8) see Edelman [335, 1988]. Edelman conjectures that
the condition number results are true for any distribution with mean O—in
particular, the uniform [– 1, 1] distribution used by MATLAB’S rand function.
The results (26.5) and (26.6) show that random matrices from the normal
N(0, 1) distribution tend to be very well conditioned.
The spectral radius result (26.9) has been proved as an inequality by Ge-
man [432, 1986] for independent identically distributed random variables aij
with zero mean and unit variance, and computer experiments suggest the
approximate equality for the standard normal distribution [432, 1986].
A question of interest in eigenvalue applications is how many eigenvalues
of a random real matrix are real. The answer has been given by Edelman,
Kostlan, and Shub [344, 1994]: denoting by En the expected number of real
eigenvalues of an n x n matrix from the normal N(0, 1) distribution,

Thus the proportion of real eigenvalues, En/n, tends to zero as n Exact

formulae for En for finite n are also given in [344, 1994].
26.3 “R ANDSVD ” M ATRICES 519

26.3. “Randsvd” Matrices

By randsvd25 we mean a matrix A formed as A = UΣVT, where
are random orthogonal matrices and Σ = diag(σ i )
is a given matrix of singular values. This type of matrix has a
predetermined singular value distribution (and 2-norm condition number),
but is, nevertheless, random.
Randsvd matrices have been widely used as test matrices, for example
for condition estimators [534, 1987], [946, 1980], and in the LAPACK testing
software. Singular value distributions of interest include, with α := K2(A) > 1
a parameter,

1. one large singular value: σ1 = 1, σi = α–1, i = 2:n;

2. one small singular value: σi = 1, i = 1:n – 1, σ n = α– l ;

3. geometrically distributed singular values: σi = β 1–i, i = 1:n, where

4. arithmetically distributed singular values: σ i = 1 – (1 – α - l )(i – 1)/(n –

1), i = 1:n.

To be precise about what we mean by “random orthogonal matrix” we

specify matrices from the Haar distribution, which is a natural distribution
over the space of orthogonal matrices, defined in terms of a measure called
the Haar measure [780, 1982, 52.1.4]. If A has elements from the
normal N(0, σ 2) distribution and A has the QR factorization A = QR, with
the factorization normalized so that the diagonal elements of R are positive,
then Q is from the Haar distribution, for any variance σ 2 [100, 1979], [946,
1980]. If we compute Q from this prescription, the cost is 2n3 flops. For our
randsvd application, a more efficient approach is based on the following result
of Stewart [946, 1980].

Theorem 26.1 (Stewart). Let the independent vectors x i have ele-

ments from the normal N(0, 1) distribution for i = 1:n – 1. Let Pi =
diag where is the Householder transformation that reduces xi
to rii el. Then the product Q = DP1P2 . . . Pn-l is a random orthogonal ma-
trix from the Haar distribution, where D = diag(sign(rii ) ) .
This result allows us to compute a product form representation of a random
n x n orthogonal matrix from the Haar distribution in O(n2 ) flops. If we
implicitly compute U and using the construction in
Theorem 26.1, and then form A = UΣVT, exploiting the structure, the cost
25
Randsvd is the name of the MATLAB M-file in the Test Matrix Toolbox that generates
matrices of this type.
520 A G ALLERY OF T EST M ATRICES

is m3 + n3 flops. The alternative, which involves obtaining U and V from the

QR factorization of random matrices A, is about twice as expensive.
Note that forming UΣVT with U and V single Householder matrices, as is
sometimes done in the literature, is not recommended, as it produces matrices
of a very special form: diagonal plus a rank-2 correction.
The construction of randsvd matrices is, of course, easily adapted to pro-
duce random symmetric matrices A = QΛQT with given eigenvalues.

26.4. The Pascal Matrix

The numbers in Pascal’s triangle satisfy, practically speaking,
infinitely many identities.
— RONALD L. GRAHAM, DONALD E. KNUTH, and OREN PATASHNIK,
Concrete Mathematics (1989)
A particularly entertaining test matrix is the Pascal matrix Pn
defined by

The rows of Pascal’s triangle appear as anti-diagonals of Pn26:

>> p = pascal(6)

P=
1 1 1 1 1 1
1 2 3 4 5 6
1 3 6 10 15 21
1 4 10 20 35 56
1 5 15 35 70 126
1 6 21 56 126 252
The earliest references to the Pascal matrix appear to be in 1958 by Newman
and Todd [796, 1958] and by Rutishauser [887, 1958] (see also Newman [795,
1962, pp. 240-24 1]); Newman and Todd say that the matrix was introduced
to them by Rutishauser. The matrix was independently suggested as a test
matrix by Caffney [177, 1963].
Rutishauser [886, 1968, §8] notes that Pn belongs to the class of moment
matrices M whose elements are contour integrals

26
In the MATLAB displays below we use the pascal function from the Test Matrix Toolbox.
This differs from the pascal function supplied with Matlab 4.2 only in that pascal (n, 2) is
rearranged.
26.4 T HE P ASCAL M ATRIX 521

All moment matrices corresponding to a positive weight function w(z) on a

contour C are Hermitian positive definite (as is easily verified by considering
the quadratic form y*My). The choice C = [0, 1], with weight function w(z) =
1, yields the Hilbert matrix. The Pascal matrix is obtained for C the circle
{z : |z – 1| = 1 } and w(z) = (2 πi(z – 1))-1 (not w(z) = (2π)-1 as stated in
[886, 1968]); the change of variable z = 1 + exp(iθ ) yields a moment integral
with a positive weight function.
Remarkably, the Cholesky factor of the Pascal matrix again contains the
rows of Pascal’s triangle, now arranged columnwise:

>> R = chol(P)

R=
1 1 1 1 1 1
0 1 2 3 4 5
0 0 1 3 6 10
0 0 0 1 4 10
0 0 0 0 1 5
0 0 0 0 0 1

The scaled and transposed Cholesky factor S = RT diag(1, –1,1, –1,. . . . (–1)n + 1 )
is returned by pascal (n, 1 ):

>> S = pascal(6, 1)

S =
1 0 0 0 0 0
1 -1 0 0 0 0
1 -2 1 0 0 0
1 -3 3 -1 0 0
1 -4 6 -4 1 0
1 -5 10 -10 5 -1

It is involuntary: S 2 = I. This special property leads us to several more

properties of P = P n . First, since P = SS T , P –1 = S -T S -l = S T S, and so
P-1 has integer entries (as is also clear from the fact that det(P) = det(R)2 =
1). Moreover,
P = SST = S(STS)S-1 = SP-1S-1,

so P and P–1 are similar and hence have the same eigenvalues. In other
words, the eigenvalues appear in reciprocal pairs. In fact, the characteristic
polynomial π n has a palindromic coefficient vector, which implies the recip-
rocal eigenvalues property, since π n (λ) = This is illustrated as
follows (making use of MATLAB’S Symbolic Math Toolbox):
522 A G ALLERY OF T EST M ATRICES

>> charpoly(P)

ans =
1-351*x+6084*x^2-13869*x^3+6084*x^4-351*x^5+x^6

>> eig(P)

ans =
0.0030
0.0643
0.4893
2.0436
15.5535
332.8463
Since P is symmetric, its eigenvalues are its singular values and so we also
have that ||P||2 = ||P-1 ||2 and ||P||F = ||P-1 ||F. Now

where for the last equality we used a binomial coefficient summation iden-
tity from [477, 1989, p. 161]. Hence, using Stirling’s approximation (n ! ~

Thus Pn is exponentially ill conditioned as n

It is worth pointing out that it is not hard to generate symmetric positive
definite matrices with determinant 1 and the reciprocal root property. Let
X = ZDZ -l where Z is nonsingular and D = diag(±1) ±I. Then X2 = I
and the matrix A = X T X has the desired properties. If we choose Z lower
triangular then X is the Cholesky factor of A up to a column scaling by
diag(±1).
The inverse of the Pascal matrix was found explicitly by Cohen [230, 1975]:
the (i,j) element is

The Pascal matrix can be made singular simply by subtracting 1 from the
(n,n) element. To see this, note that
26.4 T HE P ASCAL M ATRIX 523

This perturbation, ∆P = –enenT, is far from being the smallest one that makes
P singular, which is ∆Popt = where λn is the smallest eigenvalue of
P and vn is the corresponding unit eigenvector, for
is of order (n!)2 /(2 n)! ~ as we saw above.
A more subtle property of the Pascal matrix is that it is totally posi-
tive. Karlin [644, 1968, p. 137] shows that the matrix with elements
( i, j = 0,1, . . .) is totally positive; the Pascal matrix is a submatrix of this one
and hence is also totally positive. From the total positivity it follows that the
Pascal matrix has distinct eigenvalues, which (as we already know from the
positive definiteness) are real and positive, and that its ith eigenvector (cor-
responding to the ith largest eigenvalue) has exactly i – 1 sign changes [414,
1959, Thin. 13, p. 105].
T = pascal (n, 2) is obtained by rotating S clockwise through 90 degrees
and multiplying by — 1 if n is even:
>> T = pascal(6, 2)

T=
-1 -1 -1 -1 -1 -1
5 4 3 2 1 0
-10 -6 -3 -1 0 0
10 4 1 0 0 0
-5 -1 0 0 0 0
1 0 0 0 0 0

It has the surprising property that it is a cube root of the identity, a property
noted by Turnbull [1028, 1929, p. 332]:
>> T*T

ans =
o 0 0 0 0 1
0 0 0 0 -1 -5
0 0 0 1 4 10
0 0 -1 -3 -6 -lo
0 1 2 3 4 5
-1 -1 -1 -1 -1 -1

>> T*T*T

ans =
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
524 A G ALLERY OF T EST M ATRICES

Figure 26.1. spy(rem(pascal(32),2)).

0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1

Finally, we note that it is trivial to plot an approximation to the Sierpinski

gasket [824, 1992, §2.2], [478,1992] in MATLAB: simply type spy(rem(pascal (n),
2)). See Figure 26.1. The picture produced by this command is incorrect for
large n, however, because the elements of pascal(n) become too large to be
exactly representable in floating point arithmetic.

26.5. Tridiagonal Toeplitz Matrices

A tridiagonal Toeplitz matrix has the form

Such matrices arise, for example, when discretizing partial differential equa-
tions or boundary value problems for ordinary differential equations. The
26.6 C OMPANION M ATRICES 525

eigenvalues are known explicitly [884, 1947], [885, 1952], [1006, 1977, pp. 155–
156] :
d + 2(ce)1/2 cos(kπ/(n + 1)), k = 1:n.
The eigenvalues are also known for certain variations of the symmetric matrix
T n ( c, d, c) in which the (1,1) and (n, n) elements are modified; see Gregory
and Karney [482, 1969].
The matrix Tn (–1, 2, –1) is minus the well-known second difference ma-
trix, which arises in applying central differences to a second derivative oper-
ator. Its inverse has (i, j) element –i(n – j + 1)/(n + 1) (cf. Theorem 14.9).
The condition number satisfies k2 (Tn ) ~ (4/π 2 )n 2 .
One interesting property of Tn(c, d, e) is that the diagonals of its LU fac-
torization converge as n when Tn is symmetric and diagonally dominant,
and this allows some savings in the computation of the LU factorization, as
shown by Malcolm and Palmer [725, 1974]. Similar properties hold for cyclic
reduction; see Bondeli and Gander [134, 1994].

26.6. Companion Matrices

The companion matrix associated with the characteristic polynomial

of A is the matrix

The Test Matrix Toolbox function compan computes C, via the call C =
compan (A), It is easy to check that C has the same characteristic polyno-
mial as A, and that if λ is an eigenvalue of C then is
a corresponding eigenvector. Since C has rank at least n – 1 for any
λ, C is nonderogatory, that is, in the Jordan form no eigenvalue appears in
more than one Jordan block. It follows that A is similar to C only if A is
nonderogatory.
There are no explicit formulae for the eigenvalues of C, but, perhaps sur-
prisingly, the singular values have simple representations, as found by Kenney
and Laub [651, 1988] (see also Kittaneh [660, 1995]):
526 A G ALLERY OF T EST M ATRICES

Figure 26.2. Pseudospectra of compan(A).

where
The compan function is a useful means for generating new test matrices
from old. For any A , compan(A) is a nonnormal matrix with the same
eigenvalues as A (to be precise, compan(A) is normal if and only if a0 = 1 and
ai = 0 for i > 0).
Companion matrices tend to have interesting pseudospectra, as illustrated
in Figure 26.2. For more information on the pseudospectra of companion
matrices see Toh and Trefethen [1007, 1994].

26.7. Notes and References

The earliest reference to the Hilbert matrix appears to be [570, 1894], wherein
Hilbert obtains the formula (26.2) for det(H n ).
The formula (26.1) for Hn-l is taken from Choi [207, 1983], who describes
various interesting properties of the Hilbert matrix and its infinite analogue.
An excellent reference for the derivation of formulae for the inverse of the
Cauchy and Hilbert matrices is Knuth [667, 1973, pp. 35–37]. Another refer-
ence for the Cauchy matrix is Tyrtyshnikov [1034, 1991]. The formulae (26.3)
and (26.4) for the Cholesky factor and its inverse are from Choi [207, 1983]
and Todd [1004, 1954].
Forsythe and Moler [396, 1967] have a chapter devoted to the Hilbert ma-
trix, in which they describe the underlying least squares problem and discuss
numerical computation of the inverse. There have been many papers on the
PROBLEMS 527

Hilbert matrix; two of particular interest are by Todd [1004, 1954], [1005,
1961].
Other references on the eigenvalues and condition numbers of random
matrices include Edelman [336, 1991], [339, 1992] and Kostlan [671, 1992].
Anderson, Olkin, and Underhill [20, 1987] suggest another way to con-
struct random orthogonal matrices from the Haar distribution, based on prod-
ucts of random Givens rotations. Marsaglia and Olkin [728, 1984] discuss the
generation of random correlation matrices (symmetric positive semidefinite
matrices with ones on the diagonal).
The involuntary triangular matrix pascal (n, 1) arises in the step-size chang-
ing mechanism in an ordinary differential equation code based on backward
differentiation formulae; see Shampine and Reichelt [912, 1995].
We mention some other collections of test matrices. The Harwell-Boeing
collection of sparse matrices, largely drawn from practical problems, is pre-
sented by Duff, Grimes, and Lewis [327, 1989], [328, 1992]. Bai [36, 1994] is
building a collection of test matrices for large-scale nonsymmetric eigenvalue
problems. Zielke [1128, 1986] gives various parametrized rectangular matrices
of fixed dimension with known generalized inverses. Demmel and McKen-
ney [299, 1989] present a suite of Fortran 77 codes for generating random
square and rectangular matrices with prescribed singular values, eigenvalues,
band structure, and other properties. This suite was the inspiration for the
randsvd routine in the Test Matrix Toolbox and is part of the testing code
for LAPACK (see below).

26.7.1. LAPACK
The LAPACK distribution contains a suite of routines for generating test ma-
trices, located in the directory LAPACK/TESTING/MATGEN (in Unix notation).
These routines (whose names are of the form xLAxxx) are used for testing
when LAPACK is installed and are not described in the LAPACK Users’
Guide [17, 1995].

Problems

26.1. Investigate the spectral and pseudospectral properties of pentadiag-

nal Toeplitz matrices. See Figure 26.3 (pentoep is a function in the Test
Matrix Toolbox that generates matrices of this type). References are Beam
and Warming [85, 1993] and Reichel and Trefethen [866, 1992].
26.2. (R ESEARCH PROBLEM ) Two methods for generating a real orthogo-
nal matrix from the Haar distribution are Stewart’s method (Theorem 26.1),
based on Householder transformations, and a method of Anderson, Olkin,
and Underhill [20, 1987], based on Givens rotations. Compare the efficiency
528 A G ALLERY OF T EST M ATRICES

pentoep(32,0,1/2,0,0,1) inv(pentoep(32,0,1,1,0,.25))

pentoep(32,0,1/2,1,1,1) pentoep(32,0,1,0,0,1/4)

Figure 26.3. Pseudospectm of 32 x 32 pentadiagonal Toeplitz matrices.

of these two methods when used to generate randsvd matrices. Investigate

the use of the Givens rotations approach to construct random banded matri-
ces with given singular values and random symmetric banded matrices with
given eigenvalues, and compare with the technique of generating a full matrix
and then using band reduction (as implemented in the routine randsvd in the
Test Matrix Toolbox).
26.3. (RESEARCH PROBLEM) Develop an efficient algorithm for computing a
unit upper triangular n x n matrix with prescribed singular values σ1, . . . . σ n ,
where
Previous Home Next

Appendix A
Solutions to Problems

1.1 Since we have

Hence if < 0.01, say, then there is no difference between Erel and for
practical purposes.
1.2. Nothing can be concluded about the last digit before the decimal point. Eval-
uating y to higher precision yields
t y
35 262537412640768743.99999999999925007
40 262537412640768743.9999999999992500725972
This shows that the required digit is, in fact, 3. The interesting fact that y is so close
to an integer was pointed out by Lehmer [697, 1943], who explains its connection
with number theory. It is known that y is irrational [959, 1991].
1.3.
1.
2. 2sin
3. ( x – y)(x + y). Cancellation has not been avoided, but it is now harmless if x
and y are known exactly (see also Problem 3.9).
4. sin x/(1 + cos x).
5. c = ((a – b)2 + ab(2sinθ/ 2 ) 2 ) 1 / 2 .
1.4. a + ib = (x + iy)2 = x2 – y2 + 2ixy, so b = 2xy and a = x2 – y2, giving
x2 – b2 /(4x 2) = a, or 4x4 – 4ax2 – b2 = 0. Hence

If a > 0 we use this formula with the plus sign, since x2 > 0. If a < 0 this formula
is potentially unstable, so we use the rewritten form

529
530 S OLUTIONS TO P ROBLEMS

In either case we get two values for x and recover y from y = b /(2x ). Note that
there are other issues to consider here, such as scaling to avoid overflow.
1.5. We need a way to compute f(x) = log(1 + x) accurately for all x > 0. A
straightforward evaluation of log(1 + x) is not sufficient, since the addition 1 + x
loses significant digits of x when x is small. The following method is effective (for
another approach, see Hull, Fairgrieve, and Tang [592, 1994]): calculate w = 1 + x
and then compute

The explanation of why this method works is similar to that for the method in
§1.14.1. We have = (1 + x)(1 + δ), |δ| < u, and if = 1 then |x| < u + u2+ u3+ · · ·,
2
so from the Taylor series f(x) = x(1 – x/2 + x /3 – · · ·) we see that f(x) = x is a
correctly rounded result. If then

Defining =: 1 + z,

so if (thus x 0) then

Thus, with 1 + θ :=

showing that is an accurate approximation to f(x) = log(1 + x).

1.7. From (1.4) we have
S OLUTIONS TO P ROBLEMS 531

This yields the following inequality, which is attainable to first order:

Dividing by and taking the limit as gives the expression for

The result for follows from

where the inequality is attainable (Cauchy–Schwarz), together with the relation

That is easily verified.

1.8. The general solution to the recurrence is

where a, b, and c are arbitrary constants. The particular starting values chosen
yield a = 0, b = c = 1, so that

Rounding errors in the evaluation (and in the representation of x1 on a binary

machine) cause a nonzero “a” term to be introduced, and the computed values are
approximately of the form

for a constant η of order the unit roundoff. Hence the computed iterates rapidly
converge to 100. Note that resorting to higher precision merely delays the inevitable
convergence to 100. Priest [844, 1992, pp. 54–56] gives some interesting observations
on the stability of the evaluation of the recurrence.

1.9. Writing
C:= adj(A) =

we have
532 S OLUTIONS TO P ROBLEMS

Also, r = b – – (1/d) A ∆Cb, so |r| < γ3 |A||A-l||b|, which implies the normwise
residual bound. Note that the residual bound shows that will
be small if x is a large-normed solution
1.10. Let For any standard summation method we have (see §4.2)

Then, since satisfies

Hence, defining

that is,

where

2.1. There are 1 + 2(e max – emin + 1)(β – 1)β t-1 normalized numbers (the “1” is for
zero), and 2(β t–1 – 1) subnormal numbers. For IEEE arithmetic we therefore have
S OLUTIONS TO P ROBLEMS 533

Normalized Subnormal
single precision 4.3 x 109 1.7 x 107
19
double precision 1.8 x 10 9 x 1015

2.2. Without loss of generality suppose x > 0. We can write x = m x β e-t, where
β t-l < m < β t. The next larger floating point number is x + ∆x, where ∆x = β e - t ,
and

The same upper bound clearly holds for the “next smaller” case, and the lower
bound in this case is also easy to see.

2.3. The answer is the same for all adjacent nonzero pairs of single precision num-
bers. Suppose the numbers are 1 and 1 + (single) = 1 + 2-23. The spacing of the
double precision numbers on [1,2] is 2 , so the answer is 229 – 1 5.4 x 108.
-52

2.4. Inspecting the proof of Theorem 2.2 we see that |yi | > β e-1, i = 1, 2, and so we
also have |fl(x) – x|/|fl(x)| < u, that is, fl(x) = x /(1 + δ), |δ| < u. Note that this
is not the same δ as in Theorem 2.2, although it has the same bound, and unlike in
Theorem 2.2 there can be equality in this bound for δ.

2.5. The first part is trivial. Since, in binary notation,

x = 0.1100 1100 1100 1100 1100 1100 | 1100 . . . x 2-3,
we have, rounding to 24 bits,
0.1100 1100 1100 1100 1100 1101 x 2-3.
Thus

and so

2.6. Since the double precision mantissa contains 53 bits, p = 253 = 9007199254740992
9.01 x 1015. For single precision, p = 224 = 16777216 1.68 x 107.

2.7.
1. True, since in IEEE arithmetic fl( a op b) is defined to be the rounded value
of a op b, round(a op b), which is the same as round(b op a).
2. True for round to nearest (the default) and round to zero, but false for round
to
3. True, because fl(a + a) := round(a + a) = round(2 * a) =: fl(2 * a).
4. True: similar to 3.
5. False, in general.
534 S OLUTIONS TO P ROBLEMS

6. True for binary arithmetic. Since the division by 2 is exact, the inequality is
equivalent to 2a < fl(a + b) < 2b. But a + b < 2b, so, by the monotonicity
of rounding, fl(a + b) = round(a + b) < round(2b) = 2b. The lower bound is
verified similarly.

2.8. Examples in 3-digit decimal arithmetic are f1((5.01 + 5.02)/2) = f1(10.0/2) =

5.0 and fl((5.02 + 5.04)/2) = f1(10.1/2) = 5.05.
The left-hand inequality is immediate, since fl((b – a)/2) > 0. The right-hand
inequality can possibly be violated only if a and b are close (otherwise (a + b)/2 is
safely less than b, so the inequality will hold). But then, by Theorem 2.5, fl( b – a) is
obtained exactly, and the result will have several zero low-order digits in the mantissa
because of cancellation. Consequently, if the base is even the division is done exactly,
and the right-hand inequality follows by the monotonicity of rounding. Alternatively,
and even more simply, for any base we can argue that fl((b – a)/2) < b – a, so that
a + fl((b – a)/2) < b and, again, the result follows by the monotonicity of rounding.

2.9.

(binary).

Rounded directly to 53 bits this yields 1 – 2-53. But rounded first to 64 bits it yields

and when this number is rounded to 53 bits using the round to even rule it yields
1.0.

2.12. The spacing between the floating point numbers in the interval (1/2,1] is
(cf. Lemma 2.1), so |1/x – fl(1/x)| < which implies that |1 – xfl(1/x)| <
Thus

Since the spacing of the floating point numbers just to the right of 1 is xfl(1/x)
must round to either 1 – or 1.

2.13. If there is no double rounding the answer is 257,736,490. For a proof that
combines mathematical analysis with computer searching, see Edelman [343, 1994].

2.15. The IEEE standard does not define the results of exponentiation. The choice
0° = 1 can be justified in several ways. For example, if p(x) = then
p (0) = a0 = a0 x 00, and the binomial expansion (x + y) n =
yields 1 = 00 for x = 0, y = 1. For more detailed discussions see Goldberg [457,
1991, p. 32] and Knuth [669, 1992, pp. 406–408].
S OLUTIONS TO P ROBLEMS 535

2.17. For IEEE arithmetic the answer is no. Since fl( x op y) is defined to be the
rounded version of x op y, b2 – ac > 0 implies fl(b2) – fl(ac) > 0 (rounding is a
monotonic operation). The final computed answer is
fl(fl(b2) - fl(ac)) = (fl(b2) - fl(ac))(1 + δ), |δ| < u
> 0.

2.18. No. A counterexample in 2-digit base-10 arithmetic is fl(2.0 – 0.91) =

f l(1.09) = 1.1.

2.19. The function maps the set of positive floating point numbers onto a
set of floating point numbers with about half as many elements. Hence there exist
two distinct floating point numbers x having the same value of and so the
condition = |x| cannot always be satisfied in floating point arithmetic. The
requirement |x| is reasonable for base 2, however, and is satisfied in IEEE
arithmetic, as we now show.
Without loss of generality, we can assume that 1 < x < 2, since scaling x by a
power of 2 does not alter By definition, is the nearest floating
point number to and

Now the spacing of the floating point numbers between 1 and 2 is = 2u, so

Hence |θ| < u if u < 1/4 (say), and then |x| is the nearest floating point number to
so that
In base-10 floating point arithmetic, the condition can be violated.
For example, working to 5 significant decimal digits, if x = 3.1625 then fl(x2) =
fl (10.0014 0625) = 10.001, and = fl(3.1624 3577 . . .) = 3.1624 < x.

2.20. On a Cray Y-MP the answer is yes, but in base-2 IEEE arithmetic the answer
is no. It suffices to demonstrate that = sign(x) which is shown
by the proof of Problem 2.19.

2.21. The test “x > y“ returns false if x or y is a NaN, so the code computes
max(NaN, 5) = 5 and max(5, NaN) = NaN, which violates the obvious requirement
that max(x, y) = max(y, x). Since the test x x identifies a NaN, the following
code implements a reasonable definition of max(x, y):
% max(x, y)
if x x then
max = y
else if y y then
max = x
else if y > x then
536 S OLUTIONS TO P ROBLEMS

max = y
else
max = x
end
end
end
A further refinement is to ensure that max(-0, + 0) = +0, which is not satisfied by
the code above since –0 and +0 compare as equal; this requires bit-level program-
ming.

2.22. We give an informal proof; the details are obtained by using the model
f l(x op y) = (x op y)(1 + δ), but they obscure the argument.
Since a, b, and c are nonnegative, a + (b + c) is computed accurately. Since
c < b < a, c + (a – b) and a + (b – c) are the sums of two positive numbers and
so are computed accurately. Since a, b, and c are the lengths of sides of a triangle,
a < b + c; hence, using c < b < a,

b < a < b + c < 2b ,

which implies that a – b is computed exactly, by Theorem 2.5. Hence c – ( a – b) is the
difference of two exactly represented numbers and so is computed accurately. Thus
f l(A) = where is an accurate approximation to the desired argument x
of the square root. It follows that fl(A) is accurate.

2.23. For a machine with a guard digit, y = x, by Theorem 2.5 (assuming 2 x does
not overflow). If the machine lacks a guard digit then the subtraction produces x
if the last bit of x is zero, otherwise it produces an adjacent floating point number
with a zero last bit; in either case the result has a zero last bit. Gu, Demmel, and
Dhillon [484, 1994] apply this bit zeroing technique to numbers d1, d2,. . . . dn arising
in a divide and conquer bidiagonal SVD algorithm, their motivation being that the
differences di – dj can then be computed accurately even on machines without a
guard digit.

2.24. The function f(x) = 3x – 1 has the single root x = 1/3. We can have
fl(f(x)) = 0 only for x 1/3. For x 1/3 the evaluation can be summarized as
follows:

The first, second, and fourth subtractions are done exactly, by Theorem 2.5. The
result of the first subtraction has a zero least-significant bit and the result of the
SOLUTIONS TO PROBLEMS 537

second has two zero least-significant bits; consequently the third subtraction suffers
no loss of trailing bits and is done exactly, Therefore f(x) is computed exactly
for x = fl(x) near 1/3. But fl(x) can never equal 1/3 on a binary machine, so
fl(f(x)) 0 for all x.

2.25. We have

where |δi | < u, i = 1:4. Hence

(1 + δ4 ) = (ad – bc – bcδ 1)(1 + δ3) + bcδ1(1 + δ2) = x + xδ3 – bcδ1(δ3 – δ2),

so that

which implies high relative accuracy unless u|bc| >> |x|. For comparison, the bound
for standard evaluation of fl(ad – bc) is |x – < γ2(|ad| + |bc|).

2.26. Newton’s method is

The quadratic convergence can be seen from 1 – xk+l a = (1 – xka ) 2 .

2.27. We would like (2.8) to be satisfied, that is, we want = fl(x/y) to satisfy
(A.1)

This implies that

In place of r = – x we have to use where

and where we can assume the subtraction is exact, since the case of interest is where
(see Theorem 2.5). Thus
using (Al) we obtain

Therefore the convergence test is

Since underflows to zero it cannot be precomputed, and we should instead com-

pute
538 S OLUTIONS TO P ROBLEMS

3.1. The proof is by induction. Inductive step: for ρ n = +l,

For ρn = – 1 we find, similarly, that

3.2. This result can be proved by a modification of the proof of Lemma 3.1. But it
follows immediately from the penultimate line of the proof of Lemma 3.4.
3.3. The computed iterates satisfy

Defining = qk + ek, we have

This gives

The running error bound µ can therefore be computed along with the continued
fraction as follows:
qn+1 = an+1
fn+1 = 0
for k = n: – 1:0
r k = b k /q k+l
q k = a k + rk
fk = |qk| + |rk| + |bk|fk+l/((|qk+l| – ufk+l|)|qk+l|)
end
µ = uf0
The error bound is valid provided that |qk+l| – ufk+l > 0 for all k. Otherwise a
more sophisticated approach is necessary (for example, to handle the case where
qk+l = 0, qk = and qk-1 is finite).
S OLUTIONS TO P ROBLEMS 539

3.4. We prove just the division result and the last result. Let
Then α =

However, it is easy to verify that, in fact,

if j < k.
For the last result,

3.5. We have fl(AB) = AB + ∆C = (A + ∆CB-1)B =: (A + ∆A)B, where

∆ A = ∆CB-l, and |∆C| < by (3.12). Similarly, fl(AB) = A(B + ∆B),

3.6. We have

which implies

Solving these inequalities for gives the required lower bound.

The definition is flawed in general as can be seen from rank considerations.
For example if A and B are vectors and rank(C) 1, then no perturbations AA
and ∆B exist to satisfy C = (A + ∆A)(B + ∆B)! Thus we have to use a mixed
forward/backward stability definition in which we perturb C by at most as well
as A and B.
3.7. Lemma 3.5 show that (3.2) holds provided that each (1 + δ )k product is replaced
by
xi yi . Thus we have

It is easy to show that |αι| < γ n+2, so the only change required in (3.4) is to replace
γ n by γn+2. The complex analogue of (3.10) is = (A+ ∆A)x, |∆A| < γn + 2 |A|.
540 S OLUTIONS TO P ROBLEMS

3.8. Without loss of generality we can suppose that the columns of the product are
computed one at a time. With xj = A1 . . . Akej we have, using (3.10),

and so, by Lemma 3.6,

Squaring these inequalities and summing over j yields

||A1 . . . Ak – fl(Al . . . Ak)||F <
which gives the result.
Note that the product ||A1||2 . . . ||Ak||2 can be much smaller than ||A||F . . . ||Ak||F;
the extreme case occurs when the Ai are orthogonal.
3.9. We have fl((x + y)(x – y)) = (x + y)(x – y)(1 + θ3), |θ3| < γ3 = 3u/(1 – 3u), so
the computed result has small relative error. Moreover, if y/2 < x < y then x – y is
computed exactly, by Theorem 2.5, hence fl((x + y)(x – y)) = (x + y)(x – y) (1 + θ2).
However, fl(x2 – y2) = x2 (1 + θ2) – y2 (1 + θ´2), so that

and we cannot guarantee a small relative error.

If |x| >> |y| then fl(x2 – y2) suffers only two rounding errors, since the error in
forming f1(y2) will not affect the final result, while fl((x + y)(x – y)) suffers three
rounding errors; in this case fl(x2 – y2) is likely to be the more accurate result.
3.10. Assume the result is true for m – 1. Now

so
S OLUTIONS TO P ROBLEMS 541

3.11. The computations can be expressed as

We have

where |δi |, < u. Solving these recurrences, we find that

It follows that

which shows that differs negligibly from ym+l. For the repeated squarings,
however, we find that

where we have used Lemma 3.1. Hence the squarings introduce a relative error that
can be approximately as large as 2mu. Since u = 2 –53 this relative error is of order
0.1 for m = 50, which explains the observed results for m = 50.
For m = 75, the behaviour on the Sun is analogous to that on the HP calculator
described in §1.12.2. On the 486DX, however, numbers less than 1 are mapped to
1. The difference is due to the fact that the 486DX uses double rounding and the
Sun does not; see Problem 2.9.

3.12. The analysis is just a slight extension of that for an inner product. The
analogue of (3.3) is

Hence

Setting M = max{ |f(x)| : a < x < b}, we have

(A.2)
542 S OLUTIONS TO P ROBLEMS

Any reasonable quadrature rule designed for polynomic f has so

one implication of (A.2) is that it is best not to have weights of large magnitude and
varying sign; ideally, wi > 0 for all i (as for Gaussian integration rules, for example),
so that
4.1. A condition number is

C(x) = max

It is easy to show that

The condition number is 1 if the xi all have the same sign.

4.2. In the (i – 1)st floating point addition the “2k-t” portion of xi does not
propagate into the sum (assuming that the floating point arithmetic uses round to
nearest with ties broken by rounding to an even last bit or rounding away from
zero), thus there is an error of 2k-t and = i. The total error is

while the upper bound of (4.4) is

which agrees with the actual error to within a factor 3; thus the smaller upper
bound of (4.3) is also correct to within this factor. The example just quoted is, of
course, a very special one, and as Wilkinson [1088, 1963, p. 20] explains, “in order
to approach the upper bound as closely as this, not only must each error take its
maximum value, but all the terms must be almost equal.”
4.3. With S k = we have

By repeated use of this relation it follows that

which yields the required expression for The bound on |En| is immediate.
The bound is minimized if the xi are in increasing order of absolute value. This
observation is common in the literature and it is sometimes used to conclude that
the increasing ordering is the best one to use. This reasoning is fallacious, because”
minimizing an error bound is not the same as minimizing the error itself. As (4.3)
shows, if we know nothing about the signs of the rounding errors then the “best”
ordering to choose is one that minimizes the partial sums.
S OLUTIONS TO P ROBLEMS 543

4.4. Any integer between 0 and 10 inclusive can reproduced. For example, fl (1 +
2 + 3 + 4 + M - M) = 0, fl(M - M + l + 2 + 3 + 4) = 10, and fl(2 + 3 + M - M + 1 + 4) = 5.

4.5. This method is sometimes promoted on the basis of the argument that it
minimizes the amount of cancellation in the computation of Sn. This is incorrect:
the “±” method does not reduce the amount of cancellation—it simply concentrates
all the cancellation into one step. Moreover, cancellation is not a bad thing per se,
as explained in §l.7.
The “±’ method is an instance of Algorithm 4.1 (assuming that S+ and S–
are computed using Algorithm 4.1) and it is easy to see that it maximizes max i |Ti |
over all methods of this form (where, as in §4.2, Ti is the sum computed at the i t h
stage). Moreover, when the value of maxi |Ti | tends to be
much larger for the “±” method than for the other methods we have considered.

4.6. The main concern is to evaluate the denominator accurately when the xi are
close to convergence. The bound (4.3) tells us to minimize the partial sums; these
are, approximately, for xi ξ, (a) −ξ, 0, (b) 0, 0, (c) 2ξ, 0. Hence the error analysis
of summation suggests that (b) is the best expression, with (a) a distant second.
That (b) is the best choice is confirmed by Theorem 2.5, which shows there will be
only one rounding error when the xi are close to convergence. A further reason to
prefer (b) is that it is less prone to overflow than (a) and (c).

4.7. This is, of course, not a practical method, not least because it is very prone
to overflow and underflow. However, its error analysis is interesting. Ignoring the
error in the log evaluation, and assuming that exp is evaluated with relative error
bounded by u, we have, with |δ| < u for all i, and for some δ2 n

Hence the best relative error bound we can obtain is

Clearly, this method of evaluation guarantees a small absolute error, but not a small
relative error when |Sn| << 1.

4.8. Method (a) is recursive summation of a, h, h,. . . . h. From (4.3) we have |a +

Hence, since

For (b), using the relative error counter notation (3.9),

a <1> + ih <3>. Hence
544 SOLUTIONS TO P ROBLEMS

For (c), = a(1 – <l> i/n) <2> + (i/n) b<3>, hence

The error bound for (b) is about a factor i smaller than that for (a). Note that
method (c) is the only one guaranteed to yield = b (assuming fl(n/n) = 1, as
holds in IEEE arithmetic), which may be important when integrating a differential
equation to a given end-point.
If a > 0 then the bounds imply

Thus (b) and (c) provide high relative accuracy for all i, while the relative accuracy
of (a) can be expected to degrade as i increases.
5.1. By differentiating the Homer recurrence qi = xqi+1 + ai , qn = an, we obtain

The factors 2, 3, . . . . can be removed by redefining Then

5.2. Analysis similar to that for Homer’s rule shows that

fl(p(x)) = a0<n> + a1x<n + 1> + · · · + anxn<n + 1>.
The total number of rounding errors is the same as for Homer’s algorithm, but they
are distributed more equally among the terms of the polynomial. Homer’s rule can
be expected to be more accurate when the terms |ai xi | decrease rapidly with i, such
as when p(x) is the truncation of a rapidly convergent power series. Of course, this
algorithm requires twice as many multiplications as Homer’s method.

5.3. Accounting for the error in forming y, we have, using the relative error counter
notation (3.9),

Thus the relative backward perturbations are bounded by (3n/2 + 1)u instead of
2nu for Homer’s rule.
S OLUTIONS TO P ROBLEMS 545

5.4. Here is a MATLAB M-file to perform the task.

function [a, perm] = leja(a)
%LEJA LEJA ordering.
% [A, PERM] = LEJA(A) reorders the points A by the
% Leja ordering and returns the permutation vector that
% effects the ordering in PERM.

n = max(size(a));
perm = (l:n)’;

% a(1) = max(abs(a)).
[t, i] = max(abs(a));
if i ~= 1
a([1 i]) = a([i 11);
perm([1 i]) = perm([i l]);
end

p = ones(n,l);
for k = 2 : n - 1
for i = k:n
p(i) = p(i)*(a(i)-a(k-1)) ;
end
[t, il = max(abs(p(k:n)));
i = i+k-l;
if i ~= k
a([k i]) = a([i k]);
p([k i]) = p([ik]);
perm([k i]) = perm([i k]);
end
end
5.5. It is easy to show that the computed satisfies = p(x)(1 + θ 2 n + 1), |θ 2n+1 | <
γ 2n+1. Thus has a tiny relative error. Of course, this assumes that the roots xi
are known exactly!
6.1. For then, using the Cauchy–Schwarz inequality,

The first inequality is an equality iff |aij| = α, and the second inequality is an
equality iff A is a multiple of a matrix with orthonormal columns. If A is real
and square, these requirements are equivalent to A being a scalar multiple of a
Hadamard matrix. If A is complex and square, the requirements are satisfied by the
given Vandermonde matrix, which is times a unitary matrix.

6.2.
546 S OLUTIONS TO P ROBLEMS

6.3. By the Holder inequality,

(A.3)
We now show that equality is possible throughout (A.3). Let x satisfy ||A|| =
||Ax||/||x|| and let y be dual to Ax. Then
Re y*Ax = y*Ax = ||y||D||Ax|| = ||y||D||A|| ||x||,
as required.
6.4. From (6.19) we have ||Mn||p But by taking x in the
definition (6. 11) to be the vector of all ones, we see that ||Mn||P > µ n .
6.5. If A = PDQ* is an SVD then
||AB||F = ||PDQ*B||F = ||DQ*B||F

= ||A||2||B||F.
Similarly, ||BC||F < ||B||F||C||2, and these two inequalities together imply the re-
quired one.

6.6. By (6.6) and (6.8) it suffices to show that ||A-1||β,α =

We have

6.7. Let λ be an eigenvalue of A and x the corresponding eigenvector, and form the
matrix X = [x, x,. . . ,x] Then AX =
showing that |λ| < ||A||. For a subordinate norm it suffices to take norms in the
equation Ax = λx.
6.8. The following proof is much simpler than the usual proof based on diagonal
scaling to make the off-diagonal of the Jordan matrix small (see, e.g., Horn and John-
son [580, 1985, Lem. 5.6.10]). The proof is from Ostrowski [812, 1973, Thin. 19.3].
Let δ −1 A have the Jordan canonical form δ−1 A = XJX--1. We can write
where D = diag(λ i ) and the λi are the eigenvalues of A. Then
A = X(D + δN)X-1, so

Note that we actually have ||A|| = ρ(A) + δ if the largest eigenvalue occurs in a
Jordan block of size greater than 1. If A is diagonalizable then with δ = 0 we get
||A|| = ρ(A). The last part of the result is trivial.
S OLUTIONS TO P ROBLEMS 547

6.9. Let A have the SVD A = UΣV*. By the unitary invariance of the 2- and
Frobenius norms, ||A||2 = ||Σ||2 = σ1, ||A||F = ||Σ||F = Thus
||A||2 < ||A||F < (in fact, we can replace by where r = rank(A)).
There is equality on the left when σ2 = · · · = σn = 0, that is, when A has rank 1
(A= xy*) or A = 0. There is equality on the right when σ1 = · · · = σ n = α, that
is, when A = αQ where Q has orthonormal columns,

6.10. Let F = PΣQ* be an SVD. Then

But can be permuted into the form diag(Di ), where Di = It is easy

to find the singular values of Di , and the maximum value is attained for σ1 = ||F||2.
6.11 (a)

with equality for x = ek, where the maximum is attained for j = k.

(b)

Equality is attained for an x that gives equality in the Holder inequality involving
the kth row of A, where the maximum is attained for i = k. Finally, from either
formula,

6.12. Using the Cholesky factorization A = R*R,

6.13. HT is also a Hadamard matrix, so by the duality result (6.21) it suffices to

show that ||H||P = nl/p for 1 < p < 2. Since |hij| = 1, (6.12) gives ||H||P > n1/p.
Since ||H||1 = n and ||H||2 = n1/2, (6.20) gives ||H||P < nl/p, and so ||H||P = nl/p
for 1 < p < 2, as required.

6.14. We prove the lower bound in (6.22) and the upper bound in (6.23); the other
bounds follow on using ||AT||P = ||A||q. First, note that ||A||P > ||Aej||P = ||A(:, j)||p,
which gives the lower bound in (6.22). Now assume that A has at most µ nonzeros
per column. Define

Di = diag(si1, . . . . sin ),
548 S OLUTIONS TO P ROBLEMS

and note that

We have

which gives the upper bound in (6.23).

6.15. The lower bound follows from ||Ax||p/||x||p < || |A||x| ||p/|| |x| ||P. From (6.12)
we have

By (6.21), we also have || |A| ||P = || |AT| ||q < nl-l/q||AT||q = n1/p||A||p and the
result follows.

6.16. The function v is not a vector norm because does not hold for
all However, and the other two norm conditions
hold, so it makes sense to define the “subordinate norm”. We have

There is equality for xj an appropriate unit vector ej. Hence v(A) = maxj v(A(:,j)).

7.1. It is straightforward to obtain from A(y – x) = ∆b – ∆Ax + ∆A(x – y) the

inequality In general, if B > 0 and ρ(B) < 1
then I – B is nonsingular. Since we can premultiply
by to obtain the bound for |x – y|. For the last part,
S OLUTIONS TO P ROBLEMS 549

7.2. Take norms in r = A(x–y) and x–y = A–lr. The result says that the normwise
relative error is at least as large as the normwise relative residual and possibly K(A)
times as large. Since the upper bound is attainable, the relative residual is not a
good predictor of the relative error unless A is very well conditioned.

7.3. Let DR equilibrate the rows of A, so that B = DRA satisfies |B|e = e. Then

Hence = cond(D R A) = cond(A), which implies cond(A).

The inequality cond(A) < is trivial, and the deduction of (7. 12) is immediate.

7.4. The first inequality is trivial. For the second, since hii = 1 and |hij| < 1 we
have |H| > 1 and < n. Hence

7.5. We have ∆x = A-l ∆b, which yields Now

which yields the result.

If we take k = n we obtain the bound that would be obtained by applying
standard perturbation theory (Theorem 7.2). The gist of this result is that the full
K2(A) magnification of the perturbation will not be felt if b contains a significant
component in the subspace span(U k) for a k such that This latter
condition says that x must be a large-normed solution:

7.6. (a) Use For the

upper bound, use f < |A| |x|.
(b) Use

7.7. We will prove the result for w; the proof for η is entirely analogous. The lower
bound is trivial. Let Then and
(A + ∆A)y = b + ∆b with |∆ A| < |A| and Hence |b| = |(A + ∆A)y - ∆b| <
yielding Thus

which implies the result, by Theorem 7.3.

550 S OLUTIONS TO P ROBLEMS

7.8. We have ∆x = A-l (∆b – ∆Ax) + Therefore cT∆ x = cTA-l (∆b –

∆ Ax) + and so

this inequality being sharp. Hence

The lower bound for follows from the inequalities |cTx| = |cTA-1·Ax| <
|c A | |A| |x| and |c x| = |c A b| < |cTA-l ||b|. A slight modification to the
T -l T T -l

derivation of (A, x) yields

7.9. (a) For any D1, D2 we have

(A.4)

The rest of the proof shows that equality can be attained in this inequality.
Let x1 > 0 be a right Perron vector of BC, so that BCx1 = πx1, where π =
ρ(BC) > 0. Define
x2 = Cx1, (A.5)
so that x2 > 0 and
Bx2 = πx 1 . (A.6)
(We note, incidentally, that x2 is a right Perron vector of CB: CBx2 = πx 2 .)
Now define
D1 = diag(x 1 ) - 1 , D2 = diag(x 2 ). (A.7)
T
Then, with e = [1, 1, . . . , 1] , (A.6) can be written BD2e = or
π e. Since D1BD2 > 0, this gives similary, (A.5) can be written
D2e = or = e, which gives Hence for
D1 and D2 defined by (A.7) we have as
required.
Note that for the optimal D1 and D2, D1BD2 and both have the
property that all their rows sums are equal.
(b) Take B = |A| and C = |A-l|, and note that
Now apply (a).
(c) We can choose F1 > 0 and F2 > 0 so that |A| + tFl > 0 and |A-l| + tF2 > 0
for all i! >0. Hence, using (a),

Taking the limit as t 0 yields the result, using continuity of eigenvalues.

S OLUTIONS TO P ROBLEMS 551

(d) A nonnegative irreducible matrix has positive Perron vector, from stan-
dard Perron-Frobenius theory (see, e.g., Horn and Johnson [580, 1985, Chap. 8]).
Therefore the result follows by noting that in the proof of (a) all we need is for D1
and D2 in (A.7) to be defined and nonsingular, that is, for x1 and x2 to be positive
vectors. This is the case if BC and CB are irreducible (since x2 is a Perron vector
of CB)
(e) Using and it is easy to
show that the results of (a)–(d) remain true with the co-norm replaced by the 1-
norm. From it then follows that inf
In fact, the result in (a) holds for any p-norm, though the optimal D1
and D2 depend on the norm; see Bauer [79, 1963, Lem. 1(i)].
7.10. That cannot exceed the claimed expression follows by taking absolute
values in the expression ∆X = – (A-l ∆AA–1 + A–1 ∆A∆ X). To show it can equal
it we need to show that if the maximum is attained for ( i, j) = (r,s) then

can be attained, to first order in Equality is attained for ∆A = D1ED2, where

D1 = diag(sign(A -l ) r i), D2 = diag(sign(A - 1 )i s ).
7.11. (a) We need to find a symmetric H satisfying Hy = b – Ay =: r. Consider
the QR factorization

The constraint can be rewritten QTHQ·QTy = QTr, and it is satisfied by QTHQ :=

diag( H, 0 n -2) if we can find H such that

where t =

We can take := (||u||2/||t||2)Q, where Q is either a suitably chosen

Householder matrix, or the identity if t is a positive multiple of u. Then

and
(b) We can assume, without loss of generality, that
|y1| < |y2| < · · · < |yn|. (A.8)
Define the off-diagonal of H by hij = hji = gij for j > i, and let h11 = g11. The i t h
equation of the constraint Hy = Gy will be satisfied if

(A.9)
552 SOLUTIONS TO P ROBLEMS

If yi =0 set hii = 0; then (A.9) holds by (A.8). Otherwise, set

(A.10)

which yields

by the diagonal dominance.

(c) Let D = and A =: Then = 1 and < 1 for
i j. The given condition can be written where
= Dy and = D-lb. Defining as in the proof of (b), we find from (A. 10) that
so that < (2n - 1) With H := we
find that H = HT, (A + H)y = b, and |H| < (2n – 1)
7.12. By examining the error analysis for an inner product (33.1), it is easy to see
that for any x

Hence satisfies
(A.11)
which is used in place of (7.26) to obtain the desired bound.
7.13. (a) Writing A-1 = we have

By the independence of the and δι.

Hence

(b) A traditional condition number would be based on the maximum of

over all “perturbations to the data of relative size σ". The expression condexp(A, b)
uses the “expected perturbation” rather than the worst case.
(c) For eij = ||A||2 and fi = ||b||2 we have
S OLUTIONS TO P ROBLEMS 553

But

Hence
—
This inequality shows that when perturbations are measured normwise there is little
difference between the average and worst-case condition numbers.
7.14. We have

Note that this result gives a lower bound for the optimal condition number, while
Bauer’s result in Problem 7.9(c) gives an upper bound. There is equality for diagonal
A, trivially. For triangular A there is strict inequality in general since the lower
bound is 1!
8.1. “Straightforward.
8.2. Let

Then

As This 3 x 3 example can be extended to

an n x n one by padding with an identity matrix.
8.3. The bound follows from Theorem 8.9, since
Using we get a similar result to that given by Theo-
rem 8.7.
8.4. Assume T is upper triangular, and write T = D – U, where D = diag( tii ) > 0
and U > 0 is strictly upper triangular. Then, using the fact that (D-1 U)n = 0,
554 S OLUTIONS TO P ROBLEMS

Now 0 < b = TX = (D – U)x, so Dx > Ux, that is, x > D-l Ux. Hence

which gives the result, since x = T-1b > 0.

8.5. Use (8.3).

8.7. (a) Write

where Ab = y. Let |bk| = From

it follows that

and hence that

(b) Apply the result of (a) to AD and use the inequality
(c) If A is triangular then we can achieve any diagonal dominance factors β ι we
like, by appropriate column scaling. In particular, we can take β ι = 1 in (b), which
requires M(A)d = e, where d = De. Then so the
bound is as required.

8.8. (a) Using the formula det(I + xyT) = 1 + yTx we have det
Hence we take
if otherwise there is no αij that makes A + singular.
It follows that the “best” place to perturb A to make it singular (the place that
gives the smallest αij ) is in the (s, r) position, where the element of largest absolute
value of A–1 is in the (r, s) position.
(b) The off-diagonal elements of are given by Hence,
using part (a), is singular, where α = –22–n. In fact, Tn is also made
singular by subtracting 21-n from all the elements in the first column.

8.9. Here is Zha’s proof. If s = 1 the result is obvious, so assume s < 1. Define the
n-vectors

and

and let It is easy to check that

S OLUTIONS TO P ROBLEMS 555

which shows that σ is a singular value of Un (θ). With σι denoting the ith largest
singular value,

Now we prove by induction that For n = 2 it is easy to check by

direct computation that Using the interlacing property of the
singular values [470, 1989, §8.3.1] and the inductive assumption, we have

Therefore
8.10. For a solver of this form, it is not difficult to see that

where denotes fi with all its coefficients replaced by their absolute values, and
where (M(T), |b| ) is a rational expression consisting entirely of nonnegative terms.
This is the required bound expressed in a different notation. An example (admit-
tedly, a contrived one) of a solver that does not satisfy (8.20) is, for a 2 x 2 lower
triangular system LX = b,

9.1. The proof is by induction. Assume the result is true for matrices of order n – 1,
and suppose

is a unique LU factorization. Then MV = C – laT has a unique LU factorization,

and so by the inductive hypothesis has nonsingular leading principal submatrices
of order 1 to n – 2. Thus vii 0, i = 1:n – 2. If α = 0 then b = 0 and l is
arbitrary subject to C – laT having an LU factorization. But C – ( l + ∆l)aT has
nonsingular leading principal submatrices of order 1 to n – 2 for sufficiently small
∆ l (irrespective of a), so has an LU factorization for sufficiently small ∆l. Thus if
α = 0 we have a contradiction to the uniqueness of the factorization. Hence α 0 ,
which completes the proof.
9.2. A(σ) fails to have an LU factorization without pivoting only if one of its
leading principal submatrices is singular, that is, if for k
{1,2, . . . ,n – l}. There are thus l + 2 + · · ·+ ( n – 1) = ½(n – 1)n “danger” values
of u, which may not all be distinct.
9.3. If 0 F(A) then all the principal submatrices of A must be nonsingular, so
the result follows by Theorem 9.1. Note that A may have a unique LU factorization
even when 0 is in the field of values, as shown by the matrices

so the implication is only one way. Note also that 0 F(A) iff eiθ A has positive
definite Hermitian part for some real θ (Horn and Johnson [581, 1991, Thin. 1.3.5]).
556 S OLUTIONS TO P ROBLEMS

9.4. The changes are minor. Denoting by and the computed permutations,
the result of Theorem 9.3 becomes

and that of Theorem 9.4 becomes

9.5

9.6. Since A is nonsingular and every submatrix has a nonnegative determinant,

the determinantal inequality shows that det (A (1: p, 1: p)) > 0 for p = 1: n –1, which
guarantees the existence of an LU factorization (Theorem 9.1). That the elements
of L and U are nonnegative follows from (9.2).
GE computes since
Thus For i
for j Thus 0 for all i, j, k
and hence 1. But = 1.

9.7. The given fact implies that JAJ is totally nonnegative. Hence it has an
LU factorization JAJ = LU with L > 0 and U > 0. This means that A =
(JLJ)(JUJ) is an LU factorization, and = LU = JAJ = |A|.

9.8. We start with Theorem 9.4, and so need to bound Now

Hence

implying

Hence (with the same caveat as mentioned after Theorem 9.5)

9.9. By inspecting the equations in Algorithm 9.2 we see that the computed LU
factors satisfy

Since = b, we have, writing α = and using the Sherman–Morrison

formula (Problem 24.2),
S OLUTIONS TO P ROBLEMS 557

The error x – is a rational function of and is zero if xj = 0, but it will typically

be of order

9.10. α(B) = α(A), and B-l = so (B) = (A). Hence (B) =

Taking A = Sn g(2n) > = n +1.
9.11. First, the size or accuracy of the pivots is not the fundamental issue. The error
analysis shows that instability corresponds to large elements in the intermediate
matrices or in L or U. Second, in PAQ = LU, is an element of A-1 (see
(9.10)), so it is not necessarily true that the pivoting strategy can force unn to be
small. Third, from a practical point of view it is good to obtain small final pivots
because they reveal near singularity of the matrix.
9.12. Because the rows of U are the pivot rows, µ j is the maximum number of times
any element in the jth column of A was modified during the reduction. Since the
multipliers are bounded by T–1, the bound for maxi follows easily. Thus
-l maxjµj
p n < (1 + T ) .
10.1. Observe that if z = αei – ej, where ei is the ith unit vector, then for any α
we have, using the symmetry of A,

Since the right-hand side is a quadratic in α, the discriminant must be negative,

that is, < 0, which yields the desired result. This result can also
be proved using the Cholesky decomposition and the Cauchy–Schwarz inequality:
A = RTR (there is strict inequality for
since The inequality implies that |aij| < max(aii , ajj), which shows
that that is, the largest element of A lies on the diagonal.
10.2. Compute the Cholesky factorization A = RTR, solve RTy = x, then compute
yTy. In addition to being more efficient than forming A-1 explicitly, this approach
guarantees a nonnegative result.
10.3. Let s = c – By Lemma 8.4,

Then Hence

as required.
558 S OLUTIONS TO P ROBLEMS

10.4. A= where B = C–aa T /α. Let y = [1, z] T .

Then 0 < for all and so the discriminant 4(z T a)2 –
T T
4αz Cz < 0 if z 0, that is, z Bz = z CZ –
T
> 0 (since α = a11 > 0).
This shows that B is positive definite.
By induction, the elimination succeeds (i.e., all pivots are nonzero) and all the
reduced submatrices are positive definite. Hence > 0 for all r. Therefore

Thus a kk = > 0. Since the largest element of a positive definite

matrix lies on the diagonal (Problem 10.1), for any i, j, k there exists r such that

which shows that the growth factor pn = 1 (since pn > 1).

10.5. From (10.8),

as required.

10.6. so RZ = 0 and hence AZ = 0. Z is of dimension

n (n – r) and of full rank, so it spans null(A).
X

10.7. The inequalities (10. 13) follow from the easily verified fact that, for j > k,

10.8. Examples of indefinite matrices with nonnegative leading principal minors are

A necessary and sufficient condition for definiteness is that all the principal minors
of A (of which there are 2n–1 ) be nonnegative (not just the leading principal minors);
see, e.g., Horn and Johnson [580, 1985, p. 405] or Mirsky [763, 1961, p. 405] (same
page number in both books!).

10.9. For the matrix Z = we have, from (10.15),

ZTAZ = Sk (A) and so we can take p = Zel, the first column of Z.
S OLUTIONS TO P ROBLEMS 559

10.10. Theorem 10.3 is applicable only if Cholesky succeeds, which can be guaran-
teed only if the (suitably scaled) matrix A is not too ill conditioned (Theorem 10.7).
Therefore the standard analysis is not applicable to positive semidefinite matrices
that are very close to being singular. Theorem 10.14 provides a bound on the resid-
ual after rank(A) stages and, in particular, on the computed Schur complement,
which would be zero in exact arithmetic. The condition of Theorem 10.7 ensures
that all the computed Schur complements are positive definite, so that even if
magnification of errors occurs, it is absorbed by the next part of the Cholesky factor.
The proposed appeal to continuity is simply not valid.
10.11. The analysis in §10.4 shows that for a 2 x 2 pivot E, det(E) < ( α2 – l) for
complete pivoting and det(E) < (α 2 – 1)λ 2 for partial pivoting. Now α2 – 1 < 0 and
µ 0 and λ are nonzero if a 2 x 2 pivot is needed. Hence det(E ) <0, which means that
E has one positive and one negative eigenvalue. Note that if A is positive definite
it follows that all pivots are 1 x 1.
If the block diagonal factor has p+ positive 1 x 1 diagonal blocks, p– negative
1 x 1 diagonal blocks, p0 zero 1 x 1 diagonal blocks, and q 2 x 2 diagonal blocks,
then the inertia is (+, –, 0) = (p+ + q,p- + q,p0).
Denote a 2 x 2 pivot by

and consider partial pivoting. We know det(E) = ac – b2 < 0 and |b| > |a|, so the
formula det(E) = [(a/b)c – b]b minimizes the risk of overflow. Similarly, the formula

helps to avoid overflow; this is the formula used in LINPACK’S xSIDI and LAPACK’S
xSYTRI. The same formulae are suitable for complete pivoting because then |b| >
max( |a|, |c| ).
10.12. The partial pivoting strategy simplifies as follows: if |a11| > α|a21| use a
1 x 1 pivot all, if |a22| > α|a12| use a 1 x 1 pivot a22, else use a 2 x 2 pivot, that is,
do nothing.
10.13. There may be interchanges, because the tests |a11| > αλ and αλ2 < |a11|
can both fail, for example for A = with But there can be no 2 x 2
pivots, as they would be indefinite (Problem 10.11). Therefore the factorization is
PAPT = LDLT for a diagonal D with positive diagonal entries.
10.14. That the growth factor bound is unchanged is straightforward to check. No
2x2 pivots are used for a positive definite matrix because, as before (Problem 10.11),
any 2x2 pivot is indefinite. To show that no interchanges are required for a positive
definite matrix we show that the second test, αλ2 < |a11| is always passed. The
submatrix is positive definite, so a11arr – > 0. Hence

as required.
560 S OLUTIONS TO P ROBLEMS

10.15. With partial pivoting the diagonal pivoting method produces the factoriza-
tion, with P = I,

As 0, ||L|| In contrast, complete pivoting yields

and now ||L|| is bounded independently of

10.16. For any nonzero x we have

Putting x1 = we obtain

which shows that S is positive definite.

10.17. (a) The nonsingularity of A follows from the factorization

since G + BH-1 BT is symmetric positive definite.

(b) It suffices to show that the first n + m – 1 leading principal submatrices of
are nonsingular for any permutation matrix But these submatrices are of
the form where AP is a principal submatrix of A and is a permutation
matrix. Any such AP is of the form

where Hp and GP are principal submatrices of H and G, respectively, and so are

positive definite. Thus AP is symmetric quasidefinite and hence nonsingular by (a),
as required.
(c)

so (AS + (AS)T)/2 = diag(H, G), which is symmetric positive definite.

S OLUTIONS TO P ROBLEMS 561

11.1

11.2. The inequality (11.4) yields, with x0 := 0, and dropping the subscripts on the
gi and Gi ,

Now G |F| < where B = can be

assumed to have a modest -norm. Now, using Problem 11.1,

under the conditions of Theorem 11.4. The term Gg can be bounded in a similar
way. The required bound for follows.

12.1. The equations for the blocks of L and U are U11 = A11 and

First, consider the case where A is block diagonally dominant by columns. We prove
by induction that

which implies both the required bounds. This inequality clearly holds for k = 2;
suppose it holds for k = i. We have

where

using the block diagonal dominance for the last inequality. Hence

as required.
The proof for A block diagonally dominant by rows is similar. The inductive
hypothesis is that
562 S OLUTIONS TO P ROBLEMS

and with P defined as before, we have

giving as required.
For block diagonal dominance by columns in the -norm we have
and so block LU factorization is stable. If A is block diagonally
dominant by rows, stability is assured if ||Ai,i-1||/||Ai-1,i|| is suitably bounded for
all a.
12.2. The block 2 x 2 matrix

is block diagonally dominant by rows and columns in the 1- and -norms for = 1/2,
but is not point diagonally dominant by rows or columns. The block 2 x 2 matrix

is point diagonally dominant by rows and columns but not block diagonally dominant
by rows or columns in the -norm or l-norm.

12.3. No. A counterexample is the first matrix in the solution to Problem 12.2,
with = 1/2, which is clearly not positive definite because the largest element does
not lie on the diagonal.

12.4. Form (12.2) it can be seen that (A-1)21 = where the Schur
complement S = Hence

S is the trailing submatrix that would be obtained after r – 1 steps of GE. It follows
immediately that ||S|| < pn ||A||.
For the last part, note that ||S-1|| < ||A-l||, because S–l is the (2,2) block of
-1
A , as is easily seen from (12.2).

12.5. The proof is similar to that of Problem 8.7(a). We will show that
1. Let y = and let The kth equation of gives
SOLUTIONS TO PROBLEMS 563

Hence

which yields in view of the column diagonal dominance, as required.

12.7. We have the block LU factorization

so that
det(X) = det(A) x det(D – CA-lB).
Hence det(X) = det(AD – ACA–l B), which equals det(AD – CB) if C commutes
with A.

13.2. The new bounds have norms in place of absolute values and the constants are
different.

13.3. Immediate from AX – I = A(XA – I)A-1.

13.4. With 0 < « 1, let

Then

while

Hence ||AX – I||/||XA – I|| Note that in this example every element
of AX – I is large.

13.5. We have where Hence

where Similarly,

(A.12)

Hence where

Clearly,
564 S OLUTIONS TO P ROBLEMS

From (A.12), so

The first conclusion is that the approximate left inverse yields the smaller residual
bound, while the approximate right inverse yields the smaller forward error bound.
Therefore which inverse is “better” depends on whether a small backward error or
a small forward error is desired. The second conclusion is that neither approximate
inverse yields a componentwise backward stable algorithm, despite the favorable
assumptions on and Multiplying by an explicit inverse is simply not a good
way to solve a linear system.
13.6. Here is a hint: notice that the matrix on the front cover of the LAPACK
Users’ Guide has the form

13.7. If the ith row of A contains all 1s then simply sum the elements in the ith
row of the equation AA–1 = I.
13.8. (A+ iB)(P + iQ) = I is equivalent to AP – BQ = I and AQ + BP = 0, or

–1
so X – l is obtainable from the first n columns of Y . The definiteness result follows
from
(x + iy)*(A + iB)(x + iy) = xT(Ax – By) + yT(Ay + Bx)
+ i[xT(Ay + Bx) – yT(Ax – By)]

where we have used the fact that A = AT and B = –BT. Doubling the dimension
(from X to Y) multiplies the number of flops by a factor of 8, since the flop count
for inversion is cubic in the dimension. Yet complex operations should, in theory,
cost between about two and eight times as much as real ones (the extremes being
for addition and division). The actual relative costs of inverting X and Y depend
on the machine and the compiler, so it is not possible to draw any firm conclusions.
Note also that Y requires twice the storage of X.
13.10. As in the solution to Problem 8.8, we have
det
If (A ) j i = 0, this expression is independent of α, and hence det(A) is independent
-l

of aij. (This result is clearly correct for a triangular matrix. ) That det(A) can be
independent of aij shows that det(A), or even a scaling of it, is an arbitrarily poor
measure of conditioning, for A + αe i approaches a multiple of a rank-1 matrix as
S OLUTIONS TO P ROBLEMS 565

13.11. That Hadamard’s inequality can be deduced from QR factorization was

noted by Householder [586, 1958, p. 341]. From A = QR we have so
that

Hadamard’s inequality follows since |det(A)| = |det(R)| = |r11 . . . rnn|. There is

equality when for k = 1: n, that is, when ak = αkqk, k = 1: n, for
some scalars αk. In other words, there is equality when R in the QR factorization
is diagonal.
13.12. (a) Straightforward. (b) For For the Pei matrix,

13.13. (a) The geometric mean of is

Since the geometric mean does not exceed the arithmetic mean,

which gives the required bound. (b) is trivial.

13.14. The key observation is that for a triangular system the
computed solution from substitution satisfies

This result holds for any order of evaluation and is proved (for any particular order
of evaluation) using a variation on the proof of Lemma 8.2 in which we do not divide
through by the product of 1 + δi terms. Using (A. 13) we have

where , and But, since diag(∆T) = 0,

det(T) = det(T + ∆T), hence

where The conclusion is that where

If then

Thus

A diagonal similarity therefore has no effect on the error bound and so there is no
point in scaling H before applying Hyman’s method.
566 S OLUTIONS TO P ROBLEMS

13.15. For we have, for any i ε {1,2, . . . ,n},

where Aij denotes the submatrix of A obtained by deleting row i and column j. A
definition of condition number that is easy to evaluate is

14.3 Straightforward, since L is unit lower triangular with |lij| < 1.

14.6 Let D = diag(dj), where d1 = 1 and

Then T = DA is tridiagonal, symmetric and irreducible. By applying Theorem 14.9

and using symmetry, we find that

There is one degree of freedom in the vectors x and y, which can be expended by
setting xl = 1, say.

15.1. The equation would imply, on taking traces, 0 = trace(l), which is false.

15.2. It is easily checked that the differential equation given in the hint has the
solution Z(t) = eAt CeBt . Integrating the differential equation between 0 and
gives, assuming that

Hence – dt satisfies the Sylvester equation. For the

Lyapunov equation, the integral – exists under the assumption
on A, and the corresponding quadratic form is easily seen to be positive.
S OLUTIONS TO P ROBLEMS 567

15.3. Since

XT is a minimizer if X is. If X + XT = 0 then X is a skew-symmetric minimizer.

Otherwise, note that the minimization problem is equivalent to finding the right
singular vector corresponding to the smallest singular value σ min of P = I
Since vec(X) and vec(XT) (suitably normalized) are both singular vectors
corresponding to σmin, so is their sum, vec(X + XT) 0. Thus the symmetric
matrix X + XT is a minimizer.
Byers and Nash [176, 1987] investigate conditions under which a symmetric
minimizer exists.
15.4. XLASY2 uses Gaussian elimination with complete pivoting to solve the 2 x 2
or 4 x 4 linear system that arises on converting to the Kronecker product form
(15.2). For complete pivoting on a 4 x 4 matrix the growth factor is bounded by
4 (§9.3), versus 8 for partial pivoting. The reasons for using complete pivoting
rather than partial pivoting are that, first, the small increase in cost is regarded as
worthwhile for a reduction in the error bound by a factor of 2 and, second, complete
pivoting reveals the rank better than partial pivoting, enabling better handling of
ill-conditioned systems [38, 1993, p. 78].
16.1. Since p(B) < 1, a standard result (Problem 6.8) guarantees the existence of a
consistent norm ||·||p for which ||B||P < 1. The series
(1 - ||B||p)-1 is clearly convergent, and so, by the equivalence of norms,
is convergent for any norm.
Since the convergence of ensures that of
(The convergence of can also be proved directly using the
Jordan canonical form.)
16.2. (a) We have
(A.14)
If then

If so that then, using (A.14),

(b) We break the proof into two cases. First, suppose that
for some m. By part (a), the same inequality holds for all k > m and we are
finished. Otherwise, for all k. By part (a) the positive
sequence ||xk – a|| is monotone decreasing and therefore it converges to a limit l.
Suppose that l = where δ > 0. Since β < 1, for some k we must have
568 S OLUTIONS TO P ROBLEMS

By (A.14),

which is a contradiction. Hence l = α/(1 – β), as required.

(c) The vectors ek can be regarded as representing the rounding errors on the
kth iteration. The bound in (b) tells us that provided the error vectors are bounded
by α, we can expect to achieve a relative error no larger than α/(1 – β). In other
words, the result gives us an upper bound on the achievable accuracy. When this
result is applied to stationary iteration, β is a norm of the iteration matrix; to make
β less than 1 we may have to use a scaled norm of the form ||A|| := ||XAX ||.
-1

17.1. Let Xn be the n x n version of the upper triangular matrix

where D = diag so that ||Xn (:,j)||2 = 1, j = l:n. From (8.3),

is the obvious n x n analogue of

Let A = diag(0,...,0, λ), with |λ| < 1. Then

But

Therefore There is still the question of

whether Xn is optimally scaled. We can perturb the zero eigenvalues of A to distinct,
sufficiently small numbers so that ||Ak||2 is essentially unchanged and so that the
only freedom in Xn is a diagonal scaling. Since Xn has columns of unit 2-norm,
rninF=diag(fi) k2(XnF) > n–1/2 k2(Xn) (Theorem 7.5), so even for the optimally
scaled Xn the bound can be arbitrarily poor.

17.2. A simple example is

for which λ(A) = {–1,1,3} and λ(|A|) = {–l,2±(4α+l)1/2}, so p(|A|)/p(A)

S OLUTIONS TO P ROBLEMS 569

18.1. For the Householder matrix I – (2/vTv)vvT there is an eigenvalue –1 with

eigenvector v, and there are n – 1 eigenvalues 1, with eigenvectors any basis for
span(v). A Givens matrix G(i,j,θ) has eigenvalues e± iθ, together with n – 2
eigenvalues 1.
18.2. Straightforward manipulation shows that a bound holds of the form cnu +
0(u2), where c is a constant of order 10.
18.3. We must have x*Px = x*y, so x*y must be real. Also, we need x*x = y*y
and x y. Then, with v = x – y, v*v = 2v*x, so Px = y.
LAPACK uses modified Householder matrices P = I – tvv* that are unitary
but not Hermitian (T is not real). The benefit of this approach is that such a P
can always be chosen so that Px = αe1, with α real (for this equation to hold for
a genuine Householder matrix P it would be necessary that x1 be real). For more
details see Lehoucq [698, 1994].
18.4. False. det(P) = – 1 for any Householder matrix and det(G) = 1 for any
Givens matrix, so det(Q) = 1. Moreover, whereas P is generally a full, symmetric
matrix, Q has some zero entries and is generally nonsymmetric.
18.5. The inequalities follow from the fact that, in the notation of (18.1), ||xk||2 =
|rkk| and ||ck(:,j)|| = the latter equality being a consequence of the
invariance of the 2-norm under orthogonal transformations. The last part follows
because QR factorization with column pivoting on A is essentially equivalent to
Cholesky factorization with complete pivoting applied to ATA.
18.6. If y = |W|x and W comprises r disjoint rotations then, in a rather loose
not at ion,

18.7. Straightforward. This problem shows that the CGS and MGS methods
correspond to two different ways of representing the orthogonal projection onto
span{q1, . . . ,qj}.
18.8. Assume, without loss of generality, that ||a1||2 < ||a2||2. If E is any matrix
such that A + E is rank deficient then

We take E = [el, 0], where el is chosen so that = 0 and al + el = αa2, for

some α. From Pythagoras’s theorem we have that ||el||2 = tan θ ||al||2, and so

Together with the trivial bound this yields

the result.
570 S OLUTIONS TO P ROBLEMS

18.9. We find that

For for i j, but for It is easy to see

that = 1 implies and that Hence for
showing, as expected, that the loss of orthogonality for MGS
is bounded in terms of k2(A)u

18.10. P is the product Pn . . . P1, where Pk is defined in (18.20). Since

= 0, we have

Hence

This is of the required form, with Q = [q1,.. ., qn] the matrix from the MGS method
applied to A.

18.11. With Q defined as in the hint,

∆ A= QR– A=(Q–P21)R+∆A 2
= (VW – P21)R + ∆A2
T

= V(I – S)WTR + ∆Az

= V(I + S)-1 C2 WTR + ∆A2
= V(I + S)-l WT · WCUT · UCWT · R + ΑAz

and the result follows.

18.12. To produce the Householder method’s P we have to explicitly form the

product of the individual Householder transformations. As long as this is done in
a stable way the computed P is guaranteed to be nearly orthogonal. MGS’s Q is
formed in an algebraically different way, and the rounding errors in its formation are
different from those in the formation of P; in other words, Q is not a submatrix of
P. Consequently, there is no reason to expect Q to be nearly orthonormal. Further
insight can be gained by a detailed look at the structure of the computed P; see
Björck and Paige [119, 1992, §4].
S OLUTIONS TO P ROBLEMS 571

18.13. It is straightforward to show that ATA– I = (A – U)T(A + U). Taking

norms gives the lower bound. Since A + U = U(H + I) we have, from the previous
relation,
(A - U)TU = (ATA - I)(H + I)-1.
Hence

In fact, the result holds for any unitarily invariant norm (but the ||A||2 + 1 in the
denominator must be retained).

19.1. One approach is to let x be a solution and y an arbitrary vector, and consider
f (α) = ||A(x + αy) – b|| By expanding this expression it can be seen that if the
normal equations are not satisfied then α and y can be chosen so that f(α) < f(0).
Alternatively, note that for f(z) = (b – Ax)T(b – Ax) = xTATAx – 2bTAx + bTb
we have f(x) = 2(ATAx – ATb) and 2f(x) = 2ATA, so any solution of the
normal equations is a local minimum of f, and hence a global minimum since f is
quadratic. The normal equations can be written as (b – Ax)TA = 0, which shows
that the residual b – Ax is orthogonal to range(A).

19.2. A = Q [~]. The computed upper triangular QR factor satisfies A +

∆A = with |∆A| < mnγ cmG1|Al and ||G1||F = 1, by Theorem 18.4. BY
Lemma 18.3 the computed transformed right-hand side satisfies = QT(b + ∆b),
with |∆b| < mnγ cmG1 |b|.
By Theorem 8.5, the computed solution to the triangular system
satisfies

So exactly solves the LS problem

which is equivalent to the LS problem, on premultiplying by Q,

We have

where G > G1 and ||G||F = 1. The normwise bounds are proved similarly.

19.3. A straightforward verification.

572 S OLUTIONS TO P ROBLEMS

19.4. Notice that

which is minimized if ||Axi — ei ||2 is minimized for i = 1: m. Thus we have m

+
independent LS problems. The solution is xi = A ei , i = 1: m, that is, X =
+ +
A Im = A . This is the unique solution only if A has full rank, but it is always the
unique minimum Frobenius norm solution.

19.5. By Theorem 18.12 there is an orthonormal matrix [W1, wn+l]

such that

with The computed solution to

satisfies

Therefore exactly solves the LS problem

which, on premultiplying by [W1, wn+1], is equivalent to the LS problem

We have

where and = 1. Normwise bounds on and follow similarly.

19.6. Subtracting from we have

by (19.12).

Taking norms gives the result.

S OLUTIONS TO P ROBLEMS 573

19.7. By constructing a block LU factorization it is straightforward to show that

det(C(α) – λI) = (α – λ)m-n det(ATA + λ(α – λ)).
Hence the eigenvalues of C(α) are λ = α (m – n times) together with the solutions
of λ(α – λ) = namely,
Now

so the minimum condition number must occur when This

gives for which

The lower bound for the maximum is achieved for

19.8. Let y = 0 and ||(A + ∆A)y – b||2 = min. If b = 0 then we can take
AA= O. Otherwise, the normal equations tell us that (A+ ∆A)Tb = 0, so ||∆ A||2 >
||ATb||2/||b||2. This lower bound is achieved, and the normal equations satisfied, for
∆AT = –(ATb)bT/bTb. Hence

19.9. For the case λ∗ < 0 we have

20.1. Setting the gradient of (20.13) to zero gives ATAx – ATb + c = 0, which
can be written as y = b – Ax, ATy = c, which is (20.12). The Lagrangian for
(20.14) is L(y, x) = ½(y – b)T(y – b) + xT(ATy – c). yL(y, x) = y – b + Ax, and
T
xL(y, x) = A y – c. Setting the gradients to zero gives the system (20.12).

21.1. The modified version has the same flop count as the original version.
21.4. The summation gives
574 S OLUTIONS TO P ROBLEMS

which implies the desired equality. It follows that all columns of V–1 except the
first must have both positive and negative entries. In particular, V-1 > 0 is not
possible. The elements of V-1 sum to 1, independently of the points αi (see also
Problem 13.7).

21.5. We have U(i,:)T(αo,. . . ,αn) = W(i,:)V(αo,. . . ,αn)- But T = LV, where

L has the form illustrated for n = 5 by

T -T T -T
and Le = e, so U(i,:)L = W(i,:), or U(i,:) = L W(i,:) . But L > 0
n -T -1 -l
by the given x formula, so ||L ||1 = ||L || = ||L e|| = ||e|| = 1, hence
||U(i,:)||1 < ||W(i,:)||1.
As an aside, we can evaluate ||L|| as

after a little trigonometric algebra.

21.7. Denote the matrix by C. For the zeros of Tn+1 we have

It follows that giving the result.

Extrema of Tn: we have

CDCT =

Hence B = CD½ satisfies BBT = and so k 2 (B) = K 2 (D) ½ = Then

k2 (C) < k2(D½)k2 (B) = 2.

21.8. The increasing ordering is never produced, since the algorithm must choose
α1 to maximize |α l – α0|.

21.10. The dual condition number is

See Higham [533, 1987] for proofs.

S OLUTIONS TO P ROBLEMS 575

22.1. Consider the case where m < min(n,p), and suppose n = jm and p = km for
some integers j and k. Then the multiplication of the two matrices can be split into
m x m blocks:

which involves a total of jk multiplications of m x m matrices, each involving O(mα )

operations. Thus the total number of operations is O(jkmα ), or O(mα -2np), as
required, and we can show similar results for the cases when n and p are the smallest
dimensions.

22.2. ½n3 + n2 multiplications and ³/2n3 + 2n(n – 1) additions.

22.3. For large n = 2k, Sn (8)/Sn (n) 1.96 x (7/8) k and S n (l)/S n (8) 1.79.
The ratio Sn(8)/Sn(n) measures the best possible reduction in the amount of arith-
metic by using Strassen’s method in place of conventional multiplication. The ratio
S n (1)/S n (8) measures how much more arithmetic is used by recurring down to the
scalar level instead of stopping once the optimal dimension n0 is reached. Of course,
the efficiency of a practical code for Strassen’s method also depends on the various
non-floating-point operations.

22.5. Apart from the differences in stability, the key difference is that Winograd’s
formula relies on commutativity and so cannot be generalized to matrices.

22.7. Some brief comments are given by Douglas, Heroux, Slishman, and Smith
[317, 1994].

22.9. The inverse is

Hence we can form AB by inverting a matrix of thrice the dimension. This result
is not of practical value, but it is useful for computational complexity analysis.

24.2. With n = 3 and almost any starting data, the backward error can easily be
made of order 1, showing that the method is unstable. However, the backward error
is found to be roughly of order so the method may have a backward error
bound proportional to (this is an open question).

25.1. (b) An optimizing compiler might convert the test xp1 > 1 to x+1 > 1 and
then to x > 0. (For a way to stop this conversion in Fortran, see the solution to
Problem 25.3.) Then the code would compute a number of order 2 e min instead of a
number of order 2–t.
576 S OLUTIONS TO P ROBLEMS

25.3. The algorithm is based on the fact that the positive integers that can be
exactly represented are 1,2, ..., β t and

In the interval [β t, β t+l] the floating point numbers are spaced β apart. This interval
k
must contain a power of 2, a = 2 . The first while loop finds such an a (or, rather,
the floating point representation of such an a) by testing successive powers 2i to see
if both 2i and 2i + 1 are representable. The next while loop adds successive powers
of 2 until the next floating point number is found; on subtracting a the base β is
produced. Finally t is determined as the smallest power of β for which the distance
to the next floating point number exceeds 1.
The routine can fail for at least two reasons. First, an optimizing compiler might
simplify the test while (a+l) - a == 1 to while 1 == 1, thus altering the meaning
of the program. Second, in the same test the result (a+l) might be held in an
extra length register and the subtraction done in extra precision. The computation
would then reflect this higher precision and not the intended one. We could try
to overcome this problem by saving the result a+l in a variable, but an optimizing
compiler might undo this effort by dispensing with the variable and storing the
result in a register. In Fortran, the compiler’s unwanted efforts can be nullified by
a test of the form while foo (a+l) - a == 1, where foo is a function that simply
returns its argument. The problems caused by the use of extra length registers were
discussed by Gentleman and Marovich [437, 1974]; see also Miller [759, 1984, §2.2].

25.4. (This description is based on Schreiber [903, 1989].) The random number
generator in matgen repeats after 16384 numbers [668, 1981, p. 19]. The dimension
n = 512 divides the period of the generator (16384 = 512 x 32), with the effect that
the first 32 columns of the matrix are repeated 16 times (512 = 32 x 16), so the
matrix has this structure:

B = rand(512,32);
A = [B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B];

Now consider the first 32 steps of Gaussian elimination. We apply 32 transformations

to A that have the effect, in exact arithmetic, of making B upper trapezoidal. In
floating point arithmetic, they leave a residue of small numbers (about u 10-7
in size) in rows 33 onward. Because of the structure of the matrix, identical small
numbers occur in each of the 15 blocks of A to the right of the first. Thus, the
remaining (512 – 32) x (512 – 32) submatrix has the same block structure as A (but
with 15 block columns). Hence this process repeats every 32 steps:

after 32 steps the elements drop to 10 -7 ;

after 64 steps the elements drop to 10 –14 ;
after 96 steps the elements drop to 10 -21 ;
after 128 steps the elements drop to 10 –28 ;
after 160 steps the elements drop to 10 –35 ;
after 192 steps the elements would drop to 10 –42 ,
S OLUTIONS TO P ROBLEMS 577

but that is less than the under flow threshold. The actual pivots do not exactly match
the analysis, which is probably due to rank deficiency of one of the submatrices
generated. Also, underflow occurs earlier than predicted, apparently because two
small numbers (both O( 10–21 ) ) are multiplied in a saxpy operation.

25.5. Let si and tt denote the values of s and t at the start of the ith stage of the
algorithm. Then
(A.15)

If |xi | > ti then the algorithm uses the relation

so with si+l = (ti /xi )2si + 1 and ti+l = |xi|, (A.15) continues to hold for the next
value of i. The same is true trivially in the case |xi| < ti .
This is a one-pass algorithm using n divisions that avoids overflow except, pos-
sibly, on the final stage in forming which can overflow only if ||x||2 does.

25.7. (a) The matrix A := I + Y2 = I – YTY is symmetric positive semidefinite

since ||Y||2 < 1, hence it has a (unique) symmetric positive semidefinite square root
X satisfying X2 = A = I + Y2. The square root X is a polynomial in A (see,
for example, Higham [532, 1987]) and hence a polynomial in Y; therefore X and Y
commute, which implies that (X+ Y)T(X + Y) = (X – Y)(X + Y) = X2 – Y2 = I,
as required.

25.8. |x|/3 could underflow to zero, or become unnormalized, when does

not.
Previous Home Next

Appendix B
Singular Value Decomposition,
M-Matrices

The singular value decomposition is a matrix factorization which can produce

approximations to large arrays.
Cryptanalysts is the task of breaking coded messages.
In this paper, we present an unusual merger of the two in which the
singular value decomposition may aid the cryptanalyst in discovering
vowels and consonants in messages coded in certain
variations of simple substitution ciphers.
— CLEVE B. MOLER and DONALD MORRISON,
Singular Value Analysis of Cryptograms (1983)

then each of the following 50 conditions is equivalent to the statement:

“A is a nonsingular M-matrix”.
— ABRAHAM BERMAN and ROBERT J. PLEMMONS,
Nonnegative Matrices in the Mathematical Sciences (1994)

579
580 SINGULAR VALUE DECOMPOSITION, M-MATRICES

In this appendix we define the singular value decomposition and M-matrices.

B.1. Singular Value Decomposition

Any matrix has a singular value decomposition (SVD)

are both unitary.

The are the singular values of A and the columns of U and V are the left
and right singular vectors of A, respectively.
For more details on the SVD see Golub and Van Loan [470, 1989, pp. 71-
73], Stewart and Sun [954, 1990, pp. 30-34], and Horn and Johnson [580, 1985,
§7.3], [581, 1991, §3.1]. The history of the SVD is described by Stewart [950,
1993] and Horn and Johnson [581, 1991, $3.0].

B . 2 . M-Matrices
A matrix is an M-matrix if aij < 0 for all and all the
eigenvalues of A have nonnegative real part. This is one of many equivalent
definitions [94, 1994, Chap. 6].
An M-matrix may be singular. A particularly useful characterization of a
nonsingular M-matrix is a nonsingular matrix for which aij < 0
for all and A-1 has nonnegative elements (written as A–1 > 0).
For more information on ill-matrices see Berman and Plemmons [94, 1994]
and Horn and Johnson [581, 1991, §2.5].
Previous Home Next

Appendix C
Acquiring Software

Caveat receptor . . .
Anything free comes with no guarantee!
— JACK DONGARRA and ERIC GROSSE, Netlib mail header

581
582 A CQUIRING S OFTWARE

In this appendix we provide information on how to acquire software mentioned

in the book. First, we describe some basic aspects of the Internet.

C. 1. Internet
A huge variety of information and software is available over the Internet, the
worldwide combination of interconnected computer networks. The location of
a particular object is specified by a URL, which stands for “(Uniform Resource
Locator”. Examples of URLS are
http://www.netlib.erg/index.html
ftp://ftp.netlib. org
The first example specifies a World Wide Web server (http = hypertext
transfer protocol) together with a file in hypertext format (html = hyper-
text markup language), while the second specifies an anonymous ftp site. In
any URL, the site address may, optionally, be followed by a filename that
specifies a particular file. For more details about the Internet see on-line in-
formation, or one of the many books on the subject, such as Krol [674, 1994].

C.2. Netlib
Netlib is a repository of freely available mathematical software, documents,
and databases of interest to the scientific computing community [316, 1987],
[151, 1994]. It includes
l research codes,
l golden oldies (classic programs that are not available in standard li-
braries),
l the collected algorithms of the ACM,
l program libraries such as EISPACK, LINPACK, LAPACK, and MIN-
PACK

l back issues of NA-Digest, a weekly digest for the numerical analysis

community,
l databases of conferences and performance data for a wide variety of
machines.

Netlib also enables the user to download technical reports from certain in-
stitutions, to download software and errata for textbooks, and to search “the
SIAM membership list and a “white pages” database.
Netlib can be accessed in several ways.
C.3 M ATLAB 583

l Over the World Wide Web. The URL is

http://www.netlib. erg/index.html

l Via Xnetlib, an X Windows application that provides interactive’ access

to netlib.

l By anonymous ftp. The URL is

ftp://ftp.netlib.org
l By electronic mail. For an introduction and master index, send a one-
line email message as follows:
mail netlib@ornl. gov
send index

Netlib is mirrored at various sites throughout the world.

C.3. M ATLAB
M ATLAB is a commercial program sold by The MathWorks, Inc. It runs
on a variety of platforms. The MathWorks maintains a collection of user-
contributed M-files, which is accessible over the Internet.
For information contact

The MathWorks, Inc.

24 Prime Park Way
Natick, MA 01760-1500
USA
Tel: 5086531415
Fax: 5086532997
email: [email protected]
URL: ftp://ftp.mathworks.com
URL: http://www.mathworks.com

C.4. NAG Library and FTN90 Compiler

The Numerical Algorithms Group (NAG) produces a variety of software prod-
ucts. Relevant to this book are the FTN90 compiler and the NAG Library, a
large numerical program library available in several programming languages.
For information contact

NAG Ltd.
Wilkinson House
584 A CQUIRING S OFTWARE

Jordan Hill Road

Oxford, OX2 8DR
UK
Tel: +44 1865 511245
Fax: +44 1865 310139
email: [email protected]
URL: http://www.nag.co.uk:70/

NAG has subsidiaries and distributors, whose addresses can be obtained

from the above sources.
Previous Home Next

Appendix D
Program Libraries

Since the programming is likely to be the

main bottleneck in the use of an electronic computer
we have given a good deal of thought to the
preparation of standard routines of considerable generality for the
more important processes involved in computation.
By this means we hope to reduce the time taken
to code up large-scale computing problems,
by building them up, as it were,
from prefabricated units.
— J. H. WILKINSON, The Automatic Computing Engine at the
National Physical Laboratory (1948)

In spite of the self-contained nature of the linear algebra field,

experience has shown that even here
the preparation of a fully tested set of algorithms
is a far greater task than had been anticipated.
— J. H. WILKINSON and C. REINSCH, Handbook for
Automatic Computation: Linear Algebra (1971)

585
586 P ROGRAM LIBRARIES

In this appendix we briefly describe some of the freely available program li-
braries that have been mentioned in this book. These packages are all available
from netlib (see §C.2).

D.1. Basic Linear Algebra Subprograms

The Basic Linear Algebra Subprograms (BLAS) are sets of Fortran primitives
for matrix and vector operations. They cover all the common operations in
linear algebra. There are three levels, corresponding to the types of object
operated upon. In the examples below, x and y are vectors, A, B, C are
rectangular matrices, and T is a square triangular matrix. Names of BLAS
routines are given in parentheses. The leading “x” denotes the Fortran data
type, whose possible values are some or all of

S real
D double precision
C complex
Z complex* 16, or double complex

Level 1: [694, 1979] Vector operations. Inner product: x T y (xdot); y

αx + y (xAXPY); vector 2-norm (yTy)1/2 (xNRM2); swap vectors x y
(xSWAP); scale a vector x αx (xSCAL); and other operations.

Level 2: [313, 1988], [314, 1988] Matrix-vector operations. Matrix times vec-
tor (gaxpy): y αAx + βy (xGEMV); rank-1 update: A A + αx y T
(xGER); triangular solve: x T -lx (xTRSV); and variations on these.

Level 3: [308, 1990], [309, 1990] Matrix-matrix operations. Matrix multi-

plication: C αAB + βC (xGEMM); multiple right-hand side triangular
solve: A αT -1A (xTRSM); rank-r and rank-2r updates of a symmetric
matrix (xSYRK, xSYR2K); and variations on these.

The BLAS are intended to be called in innermost loops of linear algebra

codes. Usually, most of the computation in a code that uses BLAS calls is
done inside these calls. LINPACK [307, 1979] uses the level-l BLAS through-
out (model Fortran implementations of the level-l BLAS are listed in [307,
1979, App. D]). LAPACK [17, 1995] exploits all three levels, using the highest
possible level at all times.
Each set of BLAS comprises a set of subprogram specifications only. The
specifications define the parameters to each routine and state what the routine
must do, but not how it must do it. Thus the implementor has freedom over
the precise implementation details (loop orderings, block algorithms, special
code for special cases) and even the method (fast versus conventional matrix
multiply), but the implementation is required to be numerically stable, and
D.2 EISPACK 587

code that tests the numerical stability is provided with the model implemen-
tations [309, 1990], [314, 1988].
For more details on the BLAS and the advantages of using them, see
the defining papers listed above, or, for example, [315, 1991] or [470, 1989,
Chap. 1].

D.2. EISPACK
EISPACK is a collection of Fortran 66 subroutines for computing eigenvalues
and eigenvectors of matrices [925, 1976], [415, 1977]. It contains 58 subrou-
tines and 13 drivers. The subroutines are the basic building blocks for eigen-
system computations; they perform such tasks as reduction to Hessenberg
form, computation of some or all of the eigenvalues/vectors, and back trans-
formations, for various types of matrix (real, complex, symmetric, banded,
etc.). The driver subroutines provide easy access to many of EISPACK’s ca-
pabilities; they call from one to five other EISPACK subroutines to do the
computations.
EISPACK is primarily based on Algol 60 procedures developed in the
1960s by 19 different authors and published in the journal Numerische Math-
ematik. An edited collection of these papers was published in the Handbook
for Automatic Computation [1102, 1971].

D.3. LINPACK
LINPACK is a collection of Fortran 66 subroutines that analyse and solve
linear equations and linear least squares problems [307, 1979]. The package
solves linear systems whose matrices are general, banded, symmetric indefi-
nite, symmetric positive definite, triangular, or tridiagonal. In addition, the
package computes the QR and singular value decompositions and applies them
to least squares problems. All the LINPACK routines use calls to the level-l
BLAS in the innermost loops; thus most of the floating point arithmetic in
LINPACK is done within the level-l BLAS.

D.4. LAPACK
LAPACK [17, 1995] was released on February 29, 1992. As the announce-
ment stated, “LAPACK is a transportable library of Fortran 77 subroutines
for solving the most common problems in numerical linear algebra: systems
of linear equations, linear least squares problems, eigenvalue problems, and
singular value problems. It has been designed to be efficient on a wide range
of modern high-performance computers.”
588 P ROGRAM LIBRARIES

LAPACK has been developed over a period that began in 1987 by a team
of 11 numerical analysts in the UK and the USA. LAPACK can be regarded as
a successor to LINPACK and EISPACK; it has virtually all their capabilities
and much more besides. LAPACK improves on LINPACK and EISPACK in
four main respects: speed, accuracy, robustness, and functionality. It was
designed at the outset to exploit the level-3 BLAS.
Development of LAPACK continues under the auspices of two follow-on
projects, LAPACK 2 and ScaLAPACK. An object-oriented C++ extension
to LAPACK has been produced, called LAPACK++ [311, 1995]. CLAPACK
is a C version of LAPACK, converted from the original Fortran version using
the f2c converter [367, 1990]. ScaLAPACK comprises a subset of LAPACK
routines redesigned for distributed memory parallel machines [206, 1992], [205,
1994]. Other work includes developing codes that take advantage of the careful
rounding and exception handling of IEEE arithmetic [298, 1994]. For more
details of all these topics see [17, 1995].
LAPACK undergoes regular updates, which are announced on the elec-
tronic newsletter NA-Digest. At the time of writing, the current release is
version 2.0, dated September 30, 1994, and the package contains over 1000
routines and over 735,000 lines of Fortran 77 code, including testing and tim-
ing code.
Mark 16 onward of the NAG Fortran 77 Library contains much of LA-
PACK in Chapters F07 and F08.

D.4.1. Structure of LAPACK

The LAPACK routines can be divided into three classes.

The drivers solve a complete problem. The simple drivers have a minimal
specification, while the expert drivers have additional capabilities of inter-
est to the sophisticated user. The computational routines perform individual
tasks such as computing a factorization or reducing a matrix to condensed
form; they are called by the drivers. The auxiliary routines perform relatively
low-level operations such as unblocked factorization, estimating or comput-
ing matrix norms, and solving a triangular system with scaling to prevent
overflow.
The driver and computational routines have names of the form xyyzzz.
The first letter specifies the data type, which is one of S, D, C, and Z. The
second two letters refer to the type of matrix. A partial list of types is as
follows (there are 27 types in all):
D.4 LAPACK 589

BD bidiagonal
GB general band
GE general
GT general tridiagonal
HS upper Hessenberg
OR (real) orthogonal
PO symmetric or Hermitian positive definite
PT symmetric or Hermitian positive definite tridiagonal
SB (real) symmetric band
ST (real) symmetric tridiagonal
SY symmetric
TR triangular (or (quasi-triangular)
The last three characters specify the computation performed.
TRF factorize
TRS solve a (multiple right-hand side) linear system using
the factorization
CON estimate 1/k1(A) (or compute it exactly when A is
tridiagonal and symmetric positive definite or Her-
mitian positive definite)
RFS apply fixed precision iterative refinement and com-
pute the componentwise relative backward error and
a forward error bound
TRI use the factorization to compute A–1
EQU compute factors to equilibrate the matrix
The auxiliary routines follow a similar naming convention, with most of
them having yy = LA.
Previous Home Next

Appendix E
The Test Matrix Toolbox

The Test Matrix Toolbox is a collection of MATLAB M-files containing test

matrices, routines for visualizing matrices, routines for direct search optimiza-
tion, and miscellaneous routines that provide useful additions to MATLAB’S ex-
isting set of functions. There are over 50 parametrized test matrices, which are
mostly square, dense, nonrandom, and of arbitrary dimension. The test ma-
trices include ones with known inverses or known eigenvalues; ill-conditioned
or rank deficient matrices; and symmetric, positive definite, orthogonal, de-
fective, involuntary, and totally positive matrices. The visualization routines
display surface plots of a matrix and its (pseudo-) inverse, the field of values,
Gershgorin disks, and two- and three-dimensional views of pseudospectra.
The direct search optimization routines implement the alternating directions
method, the multidirectional search method, and the Nelder–Mead simplex
method (which are described in $24.2).
Among the miscellaneous routines are one for rounding matrix elements
to a specified number of bits and ones that implement the classical and modi-
fied Gram-Schmidt methods and LU factorization without pivoting and with
complete pivoting.
The Test Matrix Toolbox was originally released in 1989, and a version was
published as ACM Algorithm 694 [548, 1991]. The current version, version
3.0, is described in the manual [561, 1995], which should be consulted for
further details.
The Test Matrix Toolbox is available by anonymous ftp from The Math-
Works; the URL is

ftp://ftp.mathworks.com/pub/contrib/linalg/testmatrix

The manual [561, 1995] is testmatrix.ps in the same location.

We summarize the contents of the toolbox in the following tables, which
list the M-files by category, with short descriptions.

591
592 T HE T EST M ATRIX T OOLBOX

Demonstration
tmtdemo Demonstration of Test Matrix Toolbox.

Test Matrices, A–K

augment Augmented system matrix.
cauchy Cauchy matrix.
chebspec Chebyshev spectral differentiation matrix.
chebvand Vandermonde-like matrix for the Chebyshev polynomials.
chow Chow matrix—a singular Toeplitz lower Hessenberg matrix.
circul Circulant matrix.
clement Clement matrix—tridiagonal with zero diagonal entries.
compan Companion matrix.
condex “Counterexamples” to matrix condition number estimators.
cycol Matrix whose columns repeat cyclically.
dingdong Dingdong matrix—a symmetric Hankel matrix.
dorr Dorr matrix—diagonally dominant, ill conditioned,
tridiagonal.
dramadah A (O, 1) matrix whose inverse has large integer entries.
fiedler Fiedler matrix—symmetric.
forsythe Forsythe matrix—a perturbed Jordan block.
frank Frank matrix—ill conditioned eigenvalues.
gallery Famous, and not so famous, test matrices.
gearm Gear matrix.
gfpp Matrix giving maximal growth factor for Gaussian elimination
with partial pivoting.
grcar Grcar matrix—a Toeplitz matrix with sensitive eigenvalues.
hadamard Hadamard matrix.
hanowa A matrix whose eigenvalues lie on a vertical line in the complex
plane.
hilb Hilbert matrix.
invhess Inverse of an upper Hessenberg matrix.
invol An involutory matrix.
ipjfact A Hankel matrix with factorial elements.
jordbloc Jordan block.
kahan Kahan matrix—upper trapezoidal.
kms Kac–Murdock–Szego Toeplitz matrix.
krylov Krylov matrix.
T HE T EST M ATRIX T OOLBOX 593

Test Matrices. L–Z

lauchli Lauchli matrix—rectangular.
lehmer Lehmer matrix-symmetric positive definite.
lesp A tridiagonal matrix with real, sensitive eigenvalues.
lotkin Lotkin matrix.
makejcf A matrix with given Jordan canonical form.
minij Symmetric positive definite matrix min( i , j ) .
moler Moler matrix-symmetric positive definite.
neumann Singular matrix from the discrete Neumann problem (sparse).
ohess Random, orthogonal upper Hessenberg matrix.
orthog Orthogonal and nearly orthogonal matrices.
parter Parter matrix—a Toeplitz matrix with singular values near p.
pascal Pascal matrix.
pdtoep Symmetric positive definite Toeplitz matrix.
pei Pei matrix.
pentoep Pentadiagonal Toeplitz matrix (sparse).
poisson Block tridiagonal matrix from Poisson’s equation (sparse).
prolate Prolate matrix—symmetric, ill-conditioned Toeplitz matrix.
rando Random matrix with elements – 1, 0, or 1.
randsvd Random matrix with pre-assigned singular values.
redheff A (0,1) matrix of Redheffer associated with the Riemann
hypothesis.
riemann A matrix associated with the Riemann hypothesis.
rschur An upper quasi-triangular matrix.
smoke Smoke matrix-complex, with a “smoke ring”
pseudospectrum.
tridiag Tridiagonal matrix (sparse).
triw Upper triangular matrix discussed by Wilkinson and others.
vend Vandermonde matrix.
wathen Wathen matrix—a finite element matrix (sparse, random
entries).
wilk Various specific matrices devised/discussed by Wilkinson.

Visualization
fv Field of values (or numerical range).
gersh Gershgorin disks.
ps Dot plot of a pseudospectrum.
pscont Contours and colour pictures of pseudospectra.
see Pictures of a matrix and its (pseudo-) inverse.
594 T HE T EST M ATRIX T OOLBOX

Decompositions and Factorization

cgs Classical Gram–Schmidt QR factorization.
cholp Cholesky factorization with pivoting of a positive semidefinite
matrix.
cod Complete orthogonal decomposition.
diagpiv Diagonal pivoting factorization with partial pivoting.
ge Gaussian elimination without pivoting.
gecp Gaussian elimination with complete pivoting.
gj Gauss-Jordan elimination to solve Ax = b.
mgs Modified Gram–Schmidt QR factorization.
poldec Polar decomposition.
signm Matrix sign decomposition.

Direct Search Optimization

adsmax Alternating directions direct-search method.
mdsmax Multidirectional search method for direct search optimization.
mmsmax Nelder–Mead simplex method for direct search optimization.

Miscellaneous
bandred Band reduction by two-sided unitary transformations.
chop Round matrix elements.
comp Comparison matrices.
cond Matrix condition number in 1, 2, Frobenius, or -norm.
cpltaxes Determine suitable axis for plot of complex vector.
dual Dual vector with respect to Holder p-norm.
eigsens Eigenvalue condition numbers.
house Householder matrix.
matrix Test Matrix Toolbox information and matrix access by
number.
matsignt Matrix sign function of a triangular matrix.
pnorm Estimate of matrix p-norm (1 < p < ).
qmult Pre-multiply by random orthogonal matrix.
rq Rayleigh quotient.
seqa Additive sequence.
seqcheb Sequence of points related to Chebyshev polynomials.
seqm Multiplicative sequence.
show Display signs of matrix elements.
skewpart Skew-symmetric (skew-Hermitian) part.
sparsify Randomly sets matrix elements to zero.
sub Principal submatrix.
symmpart Symmetric (Hermitian) part.
trap2tri Unitary reduction of trapezoidal matrix to triangular form.
Previous Home Next

Bibliography

[1] Jan Ole Aasen. On the reduction of a symmetric matrix to tridiagonal form.
BIT, 11:233-242, 1971.
[2] Nabih N. Abdelmalek. Round off error analysis for Gram-Schmidt method
and solution of linear least squares problems. BIT, 11 :345–368, 1971.
[3] ACM Turing Award Lectures: The First Twenty Years, 1966–1985. Addi-
son-Wesley, Reading, MA, USA, 1987. xviii+483 pp. ISBN 0-201-54885-2.
[4] Forman S. Acton. Numerical Methods That Work. Harper and Row, New
York, 1970. xviii+541 pp. Reprinted by Mathematical Association of Amer-
ica, Washington, DC, with new preface and additional problems, 1990. ISBN
0-88385-450-3.
[5] Duane A. Adams. A stopping criterion for polynomial root finding. Comm.
ACM, 10:655-658, 1967.
[6] Vijay B. Aggarwal and James W. Burgmeier. A round-off error model with
applications to arithmetic expressions. SIAM J. Comput., 8(1):60-72, 1979.
[7] Alan A. Ahac, John J. Buoni, and D. D. Olesky. Stable LU factorization of
H-matrices. Linear Algebra Appl., 99:97–110, 1988.
[8] J. H. Ahlberg and E. N. Nilson. Convergence properties of the spline fit. J.
Soc. Indust. Appl. Math., 11(1):95-104, 1963.
[9] Paul Halmos by parts (interviews by Donald J. Albers). In Paul IIalmos:
Celebrating 50 Years of Mathematics, John H. Ewing and F. W. Gehring,
editors, Springer-Verlag, Berlin, 1991, pages 3–32.
[10] Göltz Alefeld and Jurgen Herzberger. Introduction to Interval Computations.
Academic Press, New York, 1983. xviii+333 pp. ISBN 0-12-049820-0.
[11] M. Almacany, C. B. Dunham, and J. Williams. Discrete Chebyshev approx-
imation by interpolating rationals. IMA J. Numer. Anal., 4:467–477, 1984,
[12] Steven C. Althoen and Renate McLaughlin. Gauss-Jordan reduction: A brief
history. Amer. Math. Monthly, 94(2):130-142, 1987.
[13] Fernando L. Alvarado, Alex Pothen, and Robert S. Schreiber. Highly par-
allel sparse triangular solution. In Graph Theory and Sparse Matrix Com-
putations, J. Alan George, John R. Gilbert, and Joseph W. H. Liu, editors,
volume 56 of IMA Volumes in Mathematics and its Applications, Springer-
Verlag, New York, 1993, pages 141-158.

595
596 B IBLIOGRAPHY

[14] William F. Ames. Numerical Methods for Partial Differential Equations.

Second edition, Academic Press, New York, 1977. xiv+365 pp. ISBN 0-12-
056760-1.
[15] Pierluigi Amodio and Francesca Mazzia. Backward error analysis of cyclic
reduction for the solution of tridiagonal systems. Math. Comp., 62(206):
601-617, 1994.
[16] Andrew A. Anda and Haesun Park. Fast plane rotations with dynamic scal-
ing. SIAM J. Matrix Anal. Appl., 15(1):162–174, 1994.
[17] E. Anderson, Z. Bai, C. H. Bischof, J. W. Demmel, J. J. Dongarra, J. J.
Du Croz, A. Greenbaum, S. J. Hammarling, A. McKenney, S. Ostrouchov,
and D. C. Sorensen. LAPACK Users’ Guide, Release 2.0. Second edition,
Society for Industrial and Applied Mathematics, Philadelphia, PA, USA,
1995. xix+325 pp. ISBN 0-89871-345-5.
[18] Edward Anderson. Robust triangular solves for use in condition estimation.
Technical Report CS-91-142, Department of Computer Science, University
of Tennessee, Knoxville, TN, USA, August 1991. 35 pp. LAPACK Working
Note 36.
[19] T. W. Anderson. The Statistical Analysis of Time Series. Wiley, New York,
1971. xiv+704 pp. ISBN 0-471-02900-9.
[20] T. W. Anderson, I. Olkin, and L. G. Underhill. Generation of random or-
thogonal matrices. SIAM J. Sci. Statist. Comput., 8(4):625+29, 1987.
[21] T. Ando. Totally positive matrices. Linear Algebra Appl., 90:165-219, 1987.
[22] Anonymous. Le Commandant Cholesky. Bulletin Géodésique, pages 159-
161, 1922. Translation by Richard W. Cottle (“Major Cholesky” ) appears in
[721, Appendix] and in NA-Digest, Volume 90, Issue 7, 1990.
[23] Anonymous. James Wilkinson (1919-1986). Ann. Hist. Comput., 9(2):205-
210, 1987. From the introduction: “A series of lightly edited extracts from
messages that were sent over various computer networks during the period
October 5, 1986-February 13, 1987”.
[24] M. Arioli, J. W. Demmel, and I. S. Duff. Solving sparse linear systems with
sparse backward error. SIAM J. Matrix Anal. Appl., 10(2):165–190, 1989.
[25] M. Arioli, I. S. Duff, and P. P. M. de Rijk. On the augmented system approach
to sparse least-squares problems. Numer. Math., 55:667-684, 1989.
[26] M. Arioli and A. Laratta. Error analysis of an algorithm for solving an
underdetermined linear system. Numer. Math., 46:255-268, 1985.
[27] M. Arioli and A. Laratta. Error analysis of algorithms for computing the
projection of a point onto a linear manifold. Linear Algebra Appl., 82:1–26,
1986.
[28] Mario Arioli, Iain S. Duff, and Daniel Ruiz. Stopping criteria for iterative
solvers. SIAM J. Matrix Anal. Appl., 13(1):138–144, 1992.
[29] Mario Arioli and Francesco Romani. Stability, convergence, and conditioning
of stationary iterative methods of the form x(i+1) = Px(i) + q for the solution
of linear systems. IMA J. Numer. Anal., 12:21–30, 1992.
B IBLIOGRAPHY 597

[30] William F. Arnold and Alan J. Laub. Generalized eigenproblem algorithms

and software for algebraic Riccati equations. Proc. IEEE, 72(12): 1746–1754,
1984.
[31] R. L. Ashenhurst and N. Metropolis. Error estimation in computer calcula-
tion. Amer. Math. Monthly, 72(2):47–58, 1965.
[32] Edgar Asplund. Inverses of matrices {aij} which satisfy aij = 0 for j > i + p.
Math. Stand., 7:57-60, 1959.
[33] John V. Atanasoff. Computing machine for the solution of large systems
of linear algebraic equations. Unpublished manuscript, Iowa State College,
Ames, IA, USA, August 1940. Reprinted in [860, pp. 315-335].
[34] Owe Axelsson. Iterative Solution Methods. Cambridge University Press, 1994.
xiii+654 pp. ISBN 0-521-44524-8.
[35] Ivo Babuška. Numerical stability in mathematical analysis. In Proc.
IFIP Congress, Information Processing 68, North-Holland, Amsterdam, The
Netherlands, 1969, pages 11-23.
[36] Zhaojun Bai. A collection of test matrices for large scale nonsymmetric
eigenvalue problems (version 1.0). Manuscript, July 1994.
[37] Zhaojun Bai and James W. Demmel. Design of a parallel nonsymmetric
eigenroutine toolbox, Part I. In Proceedings of the Sixth SIAM Conference on
Parallel Processing for Scientific Computing, Volume I, Richard F. Sincovec,
David E. Keyes, Michael R. Leuze, Linda R. Petzold, and Daniel A. Reed,
editors, Society for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 1993, pages 391–398.
[38] Zhaojun Bai and James W. Demmel. On swapping diagonal blocks in real
Schur form. Linear Algebra Appl., 186:73-95, 1993.
[39] Zhaojun Bai, James W. Demmel, and Ming Gu. Inverse free parallel spectral
divide and conquer algorithms for nonsymmetric eigenproblems. Computer
Science Division Report UCB/CSD-94-793, University of California, Berke-
ley, February 1994. 34 pp.
[40] Zhaojun Bai, James W. Demmel, and Alan McKenney. On computing condi-
tion numbers for the nonsymmetric eigenproblem. ACM Trans. Math. Soft-
ware, 19(2):202–223, 1993.
[41] D. H. Bailey and H. R. P. Ferguson. A Strassen-Newton algorithm for high-
speed parallelizable matrix inversion. In Proceedings of Supercomputing ’88,
IEEE Computer Society Press, New York, 1988, pages 419–424.
[42] David H. Bailey. The computation of π to 29,360,000 decimal digits using
Borweins’ quartically convergent algorithm. Math. Comp., 50(181):283-296,
1988.
[43] David H. Bailey. Extra high speed matrix multiplication on the Cray-2.
SIAM J. Sci. Statist. Comput., 9(3):603-607, 1988.
[44] David H. Bailey. Algorithm 719: Multiprecision translation and execution of
FORTRAN programs. ACM Trans. Math. Software, 19(3):288-319, 1993.
598 B IBLIOGRAPHY

[45] David H. Bailey. A Fortran-90 based multiprecision system. Technical Report

RNR-94-013, NASA Ames Research Center, Moffett Field, CA, USA, June
1994. 12 pp.
[46] David H. Bailey, Robert Krasny, and Richard Pelz. Multiple precision, mul-
tiple processor vortex sheet roll-up computation. In Proceedings of the Sixth
SIAM Conference on Parallel Processing for Scientific Computing, Volume
I, Richard F. Sincovec, David E. Keyes, Michael R. Leuze, Linda R. Petzold,
and Daniel A. Reed, editors, Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA, 1993, pages 52-56.
[47] David H. Bailey, King Lee, and Horst D. Simon. Using Strassen’s algorithm
to accelerate the solution of linear systems. J. Supercomputing, 4:357–371,
1991.
[48] David H. Bailey, Horst D. Simon, John T. Barton, and Martin J. Fouts.
Floating point arithmetic in future supercomputers. Internat. J. Supercom-
puter Appl., 3(3):86-90, 1989.
[49] J. K. Baksalary and R. Kala. The matrix equation AX – YB = C. Linear
Algebra Appl., 25:41-43, 1979.
[50] J. K. Baksalary and R. Kala. The matrix equation AXB+CYD = E. Linear
Algebra Appl., 30:141-147, 1980.
[51] Susanne M. Bane, Per Christian Hansen, and Nicholas J. Higham. A
Strassen-type matrix inversion algorithm for the Connection Machine. Tech-
nical Report CNC/1993/028, Centre for Novel Computing, University of
Manchester, Manchester, England, October 1993. 29 pp.
[52] C. Ballester and V. Pereyra. On the construction of discrete approximations
to linear differential expressions. Math. Comp., 21:297–302, 1967.
[53] Randolph E. Bank and Donald J. Rose. Marching algorithms for elliptic
boundary value problems. I: The constant coefficient case. SIAM J. Numer.
Anal., 14(5):792-829, 1977.
[54] Yonathan Bard. Nonlinear Parameter Estimation. Academic Press, New
York, 1974. x+341 pp. ISBN 0-12-078250-2.
[55] V. Bargmann, D. Montgomery, and J. von Neumann. Solution of linear
systems of high order. Report prepared for Navy Bureau of Ordnance, 1946.
Reprinted in [995, pp. 421-477].
[56] J. L. Barlow. A note on monitoring the stability of triangular decomposition
of sparse matrices. SIAM J. Sci. Statist. Comput., 7(1):166–168, 1986.
[57] Jesse L. Barlow. On the distribution of accumulated roundoff error in floating
point arithmetic. In Proc. 5th IEEE Symposium on Computer Arithmetic,
Ann Arbor, MI, 1981, pages 100-105.
[58] Jesse L. Barlow. Probabilistic Error Analysis of Floating Point and CRD
Arithmetics. Ph.D. thesis, Northwestern University, Evanston, IL, USA, June
1981.
B IBLIOGRAPHY 599

[59] Jesse L. Barlow. Error analysis and implementation aspects of deferred cor-
rection for equality constrained least squares problems. SIAM J. Numer.
Anal., 25(6):1340-1358, 1988.
[60] Jesse L. Barlow. On the discrete distribution of leading significant digits
in finite precision arithmetic. Technical Report CS-88-35, Department of
Computer Science, Pennsylvania State University, University Park, PA, USA,
September 1988. 15 pp.
[61] Jesse L. Barlow. Error analysis of a pairwise summation algorithm to com-
pute the sample variance. Numer. Math., 58:583-590, 1991.
[62] Jesse L. Barlow and E. H. Bareiss. On roundoff error distributions in floating
point and logarithmic arithmetic. Computing, 34:325–347, 1985.
[63] Jesse L. Barlow and E. H. Bareiss. Probabilistic error analysis of Gaussian
elimination in floating point and logarithmic arithmetic. Computing, 34:
349-364, 1985.
[64] Jesse L. Barlow and Susan L. Handy. The direct solution of weighted and
equality constrained least-squares problems. SIAM J. Sci. Statist. Comput.,
9(4):704-716, 1988.
[65] Jesse L. Barlow and Ilse C. F. Ipsen. Scaled Givens rotations for the solution
of linear least squares problems on systolic arrays. SIAM J. Sci. Statist.
Comput., 8(5):716733, 1987.
[66] Jesse L. Barlow and Udaya B. Vemulapati. A note on deferred correction for
equality constrained least squares problems. SIAM J. Numer. Anal., 29(1):
249-256, 1992.
[67] Jesse L. Barlow and Udaya B. Vemulapati. Rank detection methods for
sparse matrices. SIAM J. Matrix Anal. Appl., 13(4):1279-1297, 1992.
[68] S. Barnett and C. Storey. Some applications of the Lyapunov matrix equa-
tion. J. Inst. Maths Applies, 4:33–42, 1968.
[69] Geoff Barrett. Formal methods applied to a floating-point number system.
IEEE Trans. Soflware Engrg., 15(5):611-621, 1989.
[70] Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June Do-
nato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and
Henk van der Vorst. Templates for the Solution of Linear Systems: Building
Blocks for Iterative Methods. Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA, 1994. xiii+112 pp. ISBN 0-89871-328-5.
[71] Anders Barrlund. Perturbation bounds for the LDL H and LU decomposi-
tions. BIT, 31:358–363, 1991.
[72] Anders Barrlund. How integrals can be used to derive matrix perturbation
bounds. Report UMINF 92.11, Institute of Information Processing, Univer-
sity of Umeå, Sweden, September 1992. 8 pp.
[73] D. W. Barron and H. P. F. Swinnerton-Dyer. Solution of simultaneous linear
equations using a magnetic-tape store. Comput. J., 3(1):28–33, 1960.
600 B IBLIOGRAPHY

[74] R. H. Bartels and G. W. Stewart. Algorithm 432: Solution of the matrix

equation AX + XB = C. Comm. ACM, 15(9):820-826, 1972.
[75] Sven G. Bartels. Two topics in matrix analysis: Structured sensitivity for
Vandermonde-like systems and a subgradient method for matrix norm esti-
mation. M. SC. Thesis, Department of Mathematics and Computer Science,
University of Dundee, Dundee, Scotland, September 1991.
[76] Sven G. Bartels and Desmond J. Higham. The structured sensitivity of
Vandermonde-like systems. Numer. Math., 62:17-33, 1992.
[77] Victor Barwell and Alan George. A comparison of algorithms for solving sym-
metric indefinite systems of linear equations. ACM Trans. Math. Software,
2(3):242–251, 1976.
[78] F. L. Bauer. Optimal scaling of matrices and the importance of the minimal
condition. In Proc. IFIP Congress 1962, Cicely M. Popplewell, editor, Infor-
mation Processing 62, North-Holland, Amsterdam, The Netherlands, 1963,
pages 198–201.
[79] F. L. Bauer. Optimally scaled matrices. Numer. Math., 5:73-87, 1963.
[80] F. L. Bauer. Genauigkeitsfragen bei der Lösung linearer Gleichungssysteme.
Z. Angew. Math. Mech., 46(7):409-421, 1966.
[81] F. L. Bauer. Remarks on optimally scaled matrices. Numer. Math., 13:1-3,
1969.
[82] F. L. Bauer. Computational graphs and rounding errors. SIAM J. Numer.
Anal., 11(1):87-96, 1974.
[83] F. L. Bauer and C. Reinsch. Inversion of positive definite matrices by the
Gauss-Jordan method. In Linear Algebra, J. H. Wilkinson and C. Reinsch,
editors, volume II of Handbook for Automatic Computation, Springer-Verlag,
Berlin, 1971, pages 45-49. Contribution 1/3.
[84] F. L. Bauer, J. Steer, and C. Witzgall. Absolute and monotonic norms.
Numer. Math., 3:257-264, 1961.
[85] Richard M. Beam and Robert F. Warming. The asymptotic spectra of banded
Toeplitz and quasi-Toeplitz matrices. SIAM J. Sci. Comput., 14(4):971-1006,
1993.
[86] Albert E. Beaton, Donald B. Rubin, and John L. Barone. The acceptability
of regression solutions: Another look at computational accuracy. J. Amer.
Statist. Assoc., 71(353):158-168, 1976.
[87] C. Gordon Bell and Allen Newell. Computer Structures: Readings and Ex-
amples. McGraw-Hill, New York, 1971. xix+668 pp. ISBN 07-004357-4.
[88] E. T. Bell. Review of “Contributions to the History of Determinants, 1900-
1920”, by Sir Thomas Muir. Amer. Math. Monthly, 38:161-164, 1931.
Reprinted in [358].
[89] Richard Bellman. Introduction to Matrix Analysis. Second edition, McGraw-
Hill, New York, 1970. xxiii+403 pp. First edition (1960) reprinted by Society
for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1995. ISBN
0-89871-346-3.
BIBLIOGRAPHY 601

[90] Frank Benford, The law of anomalous numbers. Proceedings of the American
Philosophical Society, 78(4):551–572, 1938.
[91] Commandant Benoit. Note sur une méthode de resolution des équation
normales provenant de l’application de la méthode des moindres carrés
á un système d’équations linéaires en nombre inférieur à celui des in-
connues. Application de la méthode à la résolution d’un système défini
d’équations linéaires (Procédé de Commandant Cholesky). B u l l e t i n
Géodésique (Toulouse), 7(1):67-77, 1924. Cited in [22, Cottle’s translation].
[92] N. F. Benschop and H. C. Ratz. A mean square estimate of the generated
roundoff error in constant matrix iterative processes. J. Assoc. Comput.
Mach., 18(1):48-62, 1971.
[93] M. C. Berenbaum. Direct search methods in the optimisation of cancer
chemotherapy. Br. J. Cancer, 61: 101–109, 1991.
[94] Abraham Berman and Robert J. Plemmons. Nonnegative Matrices in the
Mathematical Sciences. Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA, 1994. xx+340 pp. Corrected republication, with
supplement, of work first published in 1979 by Academic Press. ISBN O-
89871-321-8.
[95] Rajendra Bhatia and Kalyan Mukherjea. Variation of the unitary part of a
matrix. SIAM J. Matrix Anal. Appl., 15(3):1007–1014, 1994.
[96] Rajendra Bhatia and Peter Rosenthal. How and why to solve the operator
equation AX – XB = Y. Bull. London Math. Soc., 1996. To appear.
[97] Dario Bini and Grazia Lotti. Stability of fast algorithms for matrix multipli-
cation. Numer. Math., 36:63–72, 1980.
[98] Dario Bini and Victor Y. Pan. Polynomial and Matrix Computations. Volume
1: Fundamental Algorithms. Birkhäuser, Boston, MA, USA, 1994. xvi+415
pp. ISBN 0-8176-3786-9.
[99] Garrett Birkhoff. Two Hadamard numbers for matrices. Comm. ACM, 18
(1):25-29, 1975.
[100] Garrett Birkhoff and Surender Gulati. Isotropic distributions of test matrices.
J. Appl. Math. Phys. (ZAMP), 30:148-158, 1979.
[101] Garrett Birkhoff and Robert E. Lynch. Numerical Solution of Elliptic Prob-
lems. Society for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 1984. xi+319 pp. ISBN 0-89871-197-5.
[102] Garrett Birkhoff and Saunders Mac Lane. A Survey of Modern Algebra.
Fourth edition, Macmillan, New York, 1977. xi+500 pp. ISBN 0-02-310070-
2.
[103] Christian H. Bischof. Incremental condition estimation. SIAM J. Matrix
Anal. Appl., 11(2):312-322, 1990.
[104] Christian H. Bischof, John G. Lewis, and Daniel J. Pierce. Incremental
condition estimation for sparse matrices. SIAM J. Matrix Anal. Appl., 11
(4):644-659, 1990.
602 B IBLIOGRAPHY

[105] Christian H. Bischof and Charles F. Van Loan. The WY representation

for products of Householder matrices. SIAM J. Sci. Statist. Comput., 8(l): ‘
s2-s13, 1987.
[106] Åke Björck. Iterative refinement of linear least squares solutions I. BIT, 7:
257–278, 1967.
[107] Åke Björck. Solving linear least squares problems by Gram-Schmidt orthog-
onalization. BIT, 7:1–21, 1967.
[108] Åke Björck. Iterative refinement of linear least squares solutions II. BIT, 8:
8-30, 1968.
[109] Åke Björck. Comment on the iterative refinement of least-squares solutions.
J. Amer. Statist. Assoc., 73(361):161-166, 1978.
[110] Åke Björck. Stability analysis of the method of seminormal equations for
linear least squares problems. Linear Algebra and Appl., 88/89:31-48, 1987.
[111] Åke Björck. Iterative refinement and reliable computing. In Reliable Nu-
merical Computation, M. G. Cox and S. J. Hammarling, editors, Oxford
University Press, 1990, pages 249-266.
[112] Åke Björck. Least squares methods. In Handbook of Numerical Anal-
ysis, P. G. Ciarlet and J. L. Lions, editors, volume I: Finite Difference
Methods—Solution of Equations in , Elsevier/North-Holland, Amster-
dam, The Netherlands, 1990.
[113] Åke Björck. Component-wise perturbation analysis and error bounds for
linear least squares solutions. BIT, 31:238-244, 1991.
[114] Åke Björck. Pivoting and stability in the augmented system method. In
Numerical Analysis 1991, Proceedings of the 14th Dundee Conference, D. F.
Griffiths and G. A. Watson, editors, volume 260 of Pitman Research Notes
in Mathematics, Longman Scientific and Technical, Essex, UK, 1992, pages
1-16.
[115] Åke Björck. Numerics of Gram-Schmidt orthogonalization. Linear Algebra
Appl., 197/198:297-316, 1994.
[116] Åke Björck. Numerical Methods for Least Squares Problems. Society for In-
dustrial and Applied Mathematics, Philadelphia, PA, USA, 1996. To appear.
[117] Åke Björck and Tommy Elfving. Algorithms for confluent Vandermonde
systems. Numer. Math., 21:130-137, 1973.
[118] Åke Björck and Gene H. Golub. Iterative refinement of linear least squares
solutions by Householder transformation. BIT, 7:322–337, 1967.
[119] Åke Björck and C. C. Paige. Loss and recapture of orthogonality in the
modified Gram–Schmidt algorithm. SIAM J. Matrix Anal. Appl., 13(1):176–
190, 1992.
[120] Åke Björck and C. C. Paige. Solution of augmented linear systems using
orthogonal factorization. BIT, 34:1–24, 1994.
[121] Åke Björck and Victor Pereyra. Solution of Vandermonde systems of equa-
tions. Math. Comp., 24(112):893-903, 1970.
BIBLIOGRAPHY 603

[122] P. Bjørstad, F. Manne, T. Sørevik, and M. Vajteršic. Efficient matrix multi-

placation on SIMD computers. SIAM J. Matrix Anal. Appl., 13(1):386-401,
1992.
[123] G. Blanch. Numerical evaluation of continued fractions. SIAM Rev., 6(4):
383-421, 1964.
[124] J. H. Bleher, A. E. Roeder, and S. M. Rump. ACRITH: High-accuracy
arithmetic. An advanced tool for numerical computation. In Proceedings
of the 7th Symposium on Computer Arithmetic, Kai Hwang, editor, IEEE
Computer Society Press, Silver Spring, MD, USA, 1985, pages 318-321.
[125] J. H. Bleher, S, M. Rump, U. Kulisch, M. Metzger, Ch. Ullrich, and W. Wal-
ter. A study of a FORTRAN extension for engineering/scientific computation
with access to ACRITH. Computing, 39:93–110, 1987.
[126] B. Bliss, M.-C. Brunet, and E. Gallopoulos. Automatic program instrumenta-
tion with applications in performance and error analysis. In Expert Systems
for Scientific Computing, E. N. Houstis, J. R. Rice, and R. Vichnevetsky,
editors, North-Holland, Amsterdam, The Netherlands, 1992, pages 235–260.
[127] Richard M. Bloch. Mark I calculator. In [507], pages 23-30.
[128] James L. Blue. A portable Fortran program to find the Euclidean norm of a
vector. ACM Trans. Math. Software, 4(1):15–23, 1978.
[129] E. Bodewig. Review of “Rounding-Off Errors in Matrix Processes” by A. M.
Turing. Math. Rev., 10:405, 1949.
[130] Gerd Bohlender. Floating-point computation of functions with maximum
accuracy. IEEE Trans. Comput., C-26(7):621–632, 1977.
[131] Z. Bohte. Bounds for rounding errors in the Gaussian elimination for band
systems. J. Inst. Maths Applies, 16:133–142, 1975.
[132] Daniel Boley, Gene H. Golub, Samy Makar, Nirmal Saxena, and Edward J.
McCluskey. Floating point fault tolerance with backward error assertions.
IEEE Trans. Comput., 44(2):302-311, 1994.
[133] Jo A. M. Bollen. Numerical stability of descent methods for solving linear
equations. Numer. Math., 43:361–377, 1984.
[134] S. Bondeli and W. Gander. Cyclic reduction for special tridiagonal systems.
SIAM J. Matrix Anal. Appl., 15(1):321-330, 1994.
[135] T. Boros, T. Kailath, and V. Olshevsky. Fast algorithms for solving Vander-
monde and Chebyshev–Vandermonde systems. Manuscript, 1994. 22 pp.
[136] J. M. Borwein, P. B. Borwein, and D. H. Bailey. Ramanujan, modular equa-
tions, and approximations to Pi or how to compute one billion digits of Pi.
Amer. Math. Monthly, 96(3):201-219, 1989.
[137] B. V. Bowden. The organization of a typical machine. In Faster than Thought:
A Symposium on Digital Computing Machines, B. V. Bowden, editor, Pit-
man, London, 1953, pages 67–77.
604 B IBLIOGRAPHY

[138] H. J. Bowdler, R. S. Martin, G. Peters, and J. H. Wilkinson. Solution of real

and complex systems of linear equations. Numer. Math., 8:217–234, 1966.
Also in [1102, pp. 93-110], Contribution 1/7.
[139] David W. Boyd. The power method for norms. Linear Algebra Appl., 9:
95-101, 1974.
[140] Carl B. Boyer. A History of Mathematics. Wiley, New York, 1968. xv+717
pp. Reprinted by Princeton University Press, Princeton, NJ, USA, 1985.
ISBN 0-691-02391-3.
[141] Jeff Boyle. An application of Fourier series to the most significant digit
problem. Amer. Math. Monthly, 101(11):879-886, 1994.
[142] R. P. Brent. Algorithms for matrix multiplication. Technical Report CS 157,
Department of Computer Science, Stanford University, Stanford, CA, USA,
March 1970. ii+52 pp.
[143] R. P. Brent. Error analysis of algorithms for matrix multiplication and trian-
gular decomposition using Winograd’s identity. Numer. Math., 16:145-156,
1970.
[144] Richard P. Brent. On the precision attainable with various floating-point
number systems. IEEE Trans. Comput., C-22(6):601-607, 1973.
[145] Richard P. Brent. A Fortran multiple-precision arithmetic package. ACM
Trans. Math. Software, 4(1):57-70, 1978.
[146] Richard P. Brent. ALGORITHM 524 MP, a Fortran multiple-precision arith-
metic package. ACM nuns. Math. Software, 4(1):71–81, 1978.
[147] Richard P. Brent, Judith A. Hooper, and J. Michael Yohe. An AUGMENT in-
terface for Brent’s multiple precision arithmetic package. ACM Trans. Math.
Software, 6(2):146-149, 1980.
[148] William L. Briggs and Van Emden Henson. The DFT: An Owner’s Manual
for the Discrete Fourier Transform. Society for Industrial and Applied Math-
ematics, Philadelphia, PA, USA, 1995. xv+434 pp. ISBN 0-89871-342-0.
[149] J. L. Britton, editor. Collected Works of A. M. Turing: Pure Mathematics.
North-Holland, Amsterdam, The Netherlands, 1992. xxii+ 287 pp. ISBN
0-44488059-3.
[150] W. S. Brown. A simple but realistic model of floating-point arithmetic. ACM
Trans. Math. Software, 7(4):445–480, 1981.
[151] Shirley V. Browne, Jack J. Dongarra, Stan C. Green, Keith Moore,
Thomas H. Rowan, and Reed C. Wade. Netlib services and resources. Report
ORNL/TM-12680, Oak Ridge National Laboratory, Oak Ridge, TN, USA,
April 1994.42 pp.
[152] Marie-Christine Brunet. Contribution à la Fiabilité de Logiciels Numériques
et à L‘analyse de Leur Comportement: Une Approche Statistique. Ph.D. the-
sis, Université de Paris IX Dauphine, U.E.R. Mathématiques de la Décision,
January 1989. viii+214 pp.
B IBLIOGRAPHY 605

[153] Marie-Christine Brunet and Françoise Chatelin. CESTAC, a tool for a

stochastic round-off error analysis in scientific computing. In Numerical
Mathematics and Applications, R. Vichnevetsky and J. Vignes, editors, Else-
vier Science Publishers B.V. (North-Holland), Amsterdam, The Netherlands,
1986, pages 11–20.
[154] James L, Buchanan and Peter R. Turner. Numerical Methods and Analy-
sis. McGraw-Hill, New York, 1992. xv+751 pp. ISBN 0-07-008717-2, 0-07-
112922-7 (international paperback edition).
[155] W. Buchholz. Fingers or fists? (The choice of decimal or binary representa-
tion). Comm. ACM, 2(12):3–11, 1959.
[156] W. Buchholz. Origin of the word byte. Ann. Hist. Comput., 3(1):72, 1981.
[157] B. Bukhberger and G. A. Emel’yanenko. Methods of inverting tridiagonal
matrices. U.S. S. R. Computational Math. Math. Phys., 13:10–20, 1973.
[158] James R. Bunch. Analysis of the diagonal pivoting method. SIAM J. Numer.
Anal., 8(4):656-680, 1971.
[159] James R. Bunch. Equilibration of symmetric matrices in the max-norm. J.
Assoc. Comput. Mach., 18(4):566-572, 1971.
[160] James R. Bunch. Partial pivoting strategies for symmetric matrices. SIAM
J. Numer. Anal., 11(3):521-528, 1974.
[161] James R. Bunch. A note on the stable decomposition of skew-symmetric
matrices. Math. Comp., 38(158):475–479, 1982.
[162] James R. Bunch. The weak and strong stability of algorithms in numerical
linear algebra. Linear Algebra Appl., 88/89:49-66, 1987.
[163] James R. Bunch, James W. Demmel, and Charles F. Van Loan. The strong
stability of algorithms for solving symmetric linear systems. SIAM J. Matrix
Anal. Appl., 10(4):494-499, 1989.
[164] James R. Bunch and Linda Kaufman. Some stable methods for calculating
inertia and solving symmetric linear systems. Math. Comp., 31(137):163–179,
1977.
[165] James R. Bunch, Linda Kaufman, and Beresford N. Parlett. Decomposition
of a symmetric matrix. Numer. Math., 27:95–109, 1976.
[166] James R. Bunch and Beresford N. Parlett. Direct methods for solving sym-
metric indefinite systems of linear equations. SIAM J. Numer. Anal., 8(4):
639-655, 1971.
[167] P. Businger and G. H. Golub. Linear least squares solutions by Householder
transformations. Numer. Math., 7:269-276, 1965. Also in [1102, pp. 111-118],
Contribution I/8.
[168] P. A. Businger. Matrices which can be optimally scaled. Numer. Math., 12:
346-348, 1968.
[169] P. A. Businger. Monitoring the numerical stability of Gaussian elimination.
Numer. Math., 16:36@361, 1971.
606 B IBLIOGRAPHY

[170] J. C. Butcher. The Numerical Analysts of Ordinary Differential Equations:

Runge-Kutta and General Linear Methods. John Wiley, Chichester, UK,
1987. xv+512 pp. ISBN 0-471-91046-5.
[171] B. L. Buzbee, G. H. Golub, and C. W. Nielson. On direct methods for solving
Poisson’s equations. SIAM J. Numer. Anal., 7(4):627-656, 1970.
[172] Ralph Byers. A LINPACK-style condition estimator for the equation AX –
XB T = C. IEEE Trans. Automat. Control, AC-29(10):926-928, 1984.
[173] Ralph Byers. Numerical condition of the algebraic Riccati equation. In Linear
Algebra and its Role in Systems Theory, B. N. Datta, editor, volume 47 of
Contemporary Math., American Mathematical Society, Providence, RI, USA,
1985, pages 3549.
[174] Ralph Byers. Solving the algebraic Riccati equation with the matrix sign
function. Linear Algebra Appl., 85:267-279, 1987.
[175] Ralph Byers. A bisection method for measuring the distance of a stable
matrix to the unstable matrices. SIAM J. Sci. Statist. Comput., 9:875-881,
1988.
[176] Ralph Byers and Stephen Nash. On the singular “vectors” of the Lyapunov
operator. SIAM J. Alg. Discrete Methods, 8(1):59-66, 1987.
[177] John Caffney. Another test matrix for determinants and inverses. Comm.
ACM, 6(6):310, 1963.
[178] D. Calvetti and L. Reichel. A Chebychev-Vandermonde solver. Linear Alge-
bra Appl., 172:219-229, 1992.
[179] D. Calvetti and L. Reichel. Fast inversion of Vandermonde-like matrices
involving orthogonal polynomials. BIT, 33:473–484, 1993.
[180] S. L. Campbell and C. D. Meyer, Jr. Generalized Inverses of Linear Trans-
formations. Pitman, London, 1979. xi+272 pp. Reprinted by Dover, New
York, 1991. ISBN 0-486-66693-X.
[181] Martin Campbell-Kelly. Programming the Pilot ACE: Early programming
activity at the National Physical Laboratory. Ann. Hist. Comput., 3(2):
133–162, 1981.
[182] Martin Campbell-Kelly. Review of “Alan Turing: The Enigma”, by Andrew
Hodges. Ann. Hist. Comput., 6(2):176–178, 1984.
[183] Claudio Canuto, M. Yousuff Hussaini, Alfio Quarteroni, and Thomas A.
Zang. Spectral Methods in Fluid Dynamics. Springer-Verlag, Berlin, 1988.
xv+567 pp. ISBN 3-540-52205-0.
[184] Wei-Lu Cao and William J. Stewart. A note on inverses of Hessenberg-like
matrices. Linear Algebra Appl., 76:233–240, 1986.
[185] Ole Caprani. Implementation of a low round-off summation method. BIT,
11:271–275, 1971.
[186] B. E. Carpenter and R. W. Doran, editors. A. M. Turing’s ACE Report
of 1946 and Other Papers, volume 10 of Charles Babbage Institute Reprint
Series for the History of Computing. MIT Press, Cambridge, MA, USA,
1986. vii+141 pp. ISBN 0-262-031140.
B IBLIOGRAPHY 607

[187] John W. Carr III. Error analysis in floating point arithmetic. Comm. ACM,
2(5):10-15, 1959.
[188] Russell Carter. Y-MPfloating point and Cholesky factorization. Internat.
J. High Speed Computing, 3(3/4):215-222, 1991.
[189] A. Catchy. Exercices d’Analyse et de Phys. Math., volume 2. Paris, 1841.
Cited in [896].
[190] Françoise Chaitin-Chatelin and Valérie Frayssé. Lectures on Finite Precision
Computations. Society for Industrial and Applied Mathematics, Philadelphia,
PA, USA, 1996. To appear.
[191] Raymond H. Chan, James G. Nagy, and Robert J. Plemmons. Circulant
preconditioned Toeplitz least squares iterations. SIAM J. Matrix Anal. Appl.,
15(1):80-97, 1994.
[192] T. F. Chan and P. C. Hansen. Some applications of the rank revealing QR
factorization. SIAM J. Sci. Statist. Comput., 13(3):727-741, 1992.
[193] Tony F. Chan and David E. Foulser. Effectively well-conditioned linear sys-
tems. SIAM J. Sci. Statist. Cornput., 9(6):963–969, 1988.
[194] Tony F. Chan, Gene H. Golub, and Randall J. LeVeque. Algorithms for com-
puting the sample variance: Analysis and recommendations. Amer. Statist.,
37(3):242-247, 1983.
[195] Tony F. Chan and John Gregg Lewis. Computing standard derivations: Ac-
curacy. Comm. ACM, 22(9):526–531, 1979.
[196] Shivkumar Chandrasekaran and Ilse C. F. Ipsen. Backward errors for eigen-
value and singular value decompositions. Numer. Math., 68:215–223, 1994.
[197] Shivkumar Chandrasekaran and Ilse C. F. Ipsen. On the sensitivity of solu-
tion components in linear systems of equations. SIAM J. Matrix Anal. Appl.,
16(1):93-112, 1995.
[198] Xiao-Wen Chang and C. C. Paige. New perturbation bounds for the Cholesky
factorization. Manuscript, February 1995. 13 pp.
[199] Bruce W. Char, Keith O. Geddes, Gaston H. Gonnet, Benton L. Leong,
Michael B. Monagan, and Stephen M. Watt. Maple V Library Reference
Manual. Springer-Verlag, Berlin, 1991. xxv+698 pp. ISBN 3-540-97592-6.
[200] Bruce A. Chartres and James C. Geuder. Computable error bounds for direct
solution of linear equations. J. Assoc. Comput. Mach., 14(1):63–71, 1967.
[201] Françoise Chatelin. Eigenvalues of Matrices. Wiley, Chichester, UK, 1993.
xviii+382 pp. ISBN 0-471-93538-7.
[202] Françoise Chatelin and Marie-Christine Brunet. A probabilistic round-off
error propagation model. Application to the eigenvalue problem. In Reliable
Numerical Computation, M. G. Cox and S. J. Hammarling, editors, Oxford
University Press, Oxford, UK, 1990, pages 139-160.
[203] Françoise Chatelin and Valérie Frayssé. Elements of a condition theory for
the computational analysis of algorithms. In Iterative Methods in Linear
Algebra, R. Beauwens and P. de Green, editors, Elsevier (North-Holland),
Amsterdam, The Netherlands, 1992, pages 15–25.
608 B IBLIOGRAPHY

[204] Denise Chen and Cleve Moler. Symbolic Math Toolbox: User’s Guide. The
MathWorks, Inc., Natick, MA, USA, 1993.
[205] Jaeyoung Choi, Jack J. Dongarra, Susan Ostrouchov, Antoine P. Petitet,
David W. Walker, and R. Clint Whaley. The design and implementation
of the ScaLAPACK LU, QR and Cholesky factorization routines. Report
0RNL/TM-12470, Oak Ridge National Laboratory, Oak Ridge, TN, USA,
September 1994. 26 pp. LAPACK Working Note 80.
[206] Jaeyoung Choi, Jack J. Dongarra, Roldan Pozo, and David W. Walker.
ScaLAPACK: A scalable linear algebra library for distributed memory con-
current computers. Technical Report CS-92-181, Department of Computer
Science, University of Tennessee, Knoxville, TN, USA, November 1992.8 pp.
LAPACK Working Note 55.
[207] Man-Duen Choi. Tricks or treats with the Hilbert matrix. Amer. Math.
Monthly 90:301-312, 1983.
[208] Søren Christiansen and Per Christian Hansen. The effective condition number
applied to error analysis of certain boundary collocation methods. J. Comp.
Appl. Math., 54:15-36, 1994.
[209] Eleanor Chu and Alan George. A note on estimating the error in Gaussian
elimination without pivoting. ACM SIGNUM Newsletter, 20(2):2–7, 1985.
[210] King-wah Eric Chu. The solution of the matrix equations AXB - CXD = E
and (YA – DZ, YC – BZ) = (E, F). Linear Algebra Appl., 93:93–105, 1987.
[211] Barry A. Cipra. Computer-drawn pictures stalk the wild trajectory. Science,
241:1162–1163, 1988.
[212] C. W. Clenshaw. A note on the summation of Chebyshev series. M.T.A.C.,
9(51):118-120, 1955.
[213] C. W. Clenshaw and F. W. J. Olver. Beyond floating point. J. Assoc.
Comput. Mach., 31(2):319-328, 1984.
[214] C. W. Clenshaw, F. W. J. Olver, and P. R. Turner. Level-index arithmetic:
An introductory survey. In Numerical Analysis and Parallel Processing, Lan-
caster 1987, Peter R. Turner, editor, volume 1397 of Lecture Notes in Math-
ematics, Springer-Verlag, Berlin, 1989, pages 95–168.
[215] A. K. Cline. An elimination method for the solution of linear least squares
problems. SIAM J. Numer. Anal., 10(2):283-289, 1973.
[216] A. K. Cline, C. B. Moler, G. W. Stewart, and J. H. Wilkinson. An estimate
for the condition number of a matrix. SIAM J. Numer. Anal., 16(2):368-375,
1979.
[217] A. K. Cline and R. K. Rew. A set of counter-examples to three condition
number estimators. SIAM J. Sci. Statist. Comput., 4(4):602-611, 1983.
[218] Alan K. Cline, Andrew R. Corm, and Charles F. Van Loan. Generalizing the
LINPACK condition estimator. In Numerical Analysis, Mexico 1981, J. P.
Hennart, editor, volume 909 of Lecture Notes in Mathematics, Springer-Ver-
lag, Berlin, 1982, pages 73-83.
B IBLIOGRAPHY 609

[219] R. E. Cline and R. J. Plemmons. l 2-solutions to underdetermined linear

systems. SIAM Rev., 18(1):92–106, 1976.
[220] William D. Clinger. How to read floating point numbers accurately. SIG-
PLAN Notices, 25(6):92-101, 1990.
[221] W. J. Cody. Implementation and testing of function software. In Problems
and Methodologies in Mathematical Software Production, Paul C. Messina
and Almerico Murli, editors, volume 142 of Lecture Notes in Computer Sci-
ence, Springer-Verlag, Berlin, 1982, pages 24–47.
[222] W. J. Cody. ALGORITHM 665 MACHAR: A subroutine to dynamically
determine machine parameters. ACM Trans. Math. Software, 14(4):303-311,
1988.
[223] W. J. Cody. Floating-point standards-Theory and practice. In Reliability in
Computing: The Role of Interval Methods in Scientific Computing, Ramon E.
Moore, editor, Academic Press, Boston, MA, USA, 1988, pages 99-107.
[224] W. J. Cody. Algorithm 714. CELEFUNT: A portable test package for com-
plex elementary functions. ACM Trans. Math. Software, 19(1):121, 1993.
[225] W. J. Cody, J. T. Coonen, D. M. Gay, K. Hanson, D. Hough, W. Kahan,
R. Karpinski, J. Palmer, F. N. Ris, and D. Stevenson. A proposed radix-
and word-length-independent standard for floating-point arithmetic. IEEE
Micro, 4(4):86-100, 1984.
[226] W. J. Cody and Jerome T. Coonen. ALGORITHM 722: Functions to support
the IEEE standard for binary floating-point arithmetic. ACM Trans. Math.
Software, 19(4):443-451, 1993.
[227] William J. Cody, Jr. Static and dynamic numerical characteristics of floating-
point arithmetic. lEEE Trans. Comput., C-22(6):598-601, 1973.
[228] William J. Cody, Jr. and William Waite. Software Manual for the Elementary
Functions. Prentice-Hall, Englewood Cliffs, NJ, USA, 1980. x+269 pp. ISBN
0-13-822064-6.
[229] A. M. Cohen. A note on pivot size in Gaussian elimination. Linear Algebra
Appl., 8:361-368, 1974.
[230] A. M. Cohen. The inverse of a Pascal matrix. Mathematical Gazette, 59
(408):111-112, 1975.
[231] Marty S. Cohen, T. E. Hull, and V. Carl Hamacher. CADAC: A controlled-
precision decimal arithmetic unit. IEEE Trans. Comput., C-32(4):370-377,
1983.
[232] Thomas F. Coleman and Charles F. Van Loan. Handbook for Matrix Compu-
tations. Society for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 1988. vii+264 pp. ISBN 0-89871-227-0.
[233] Apple Computer. Apple Numerics Manual. Second edition, Addison-Wesley,
Reading, MA, USA, 1988.
[234] Apple Computer. Inside Macintosh: PowerPC Numerics. Addison-Wesley,
Reading, MA, USA, 1994. ISBN 0-201-40728-0.
610 B IBLIOGRAPHY

[235] P. Concus, G. H. Golub, and G. Meurant. Block preconditioning for the

conjugate gradient method. SIAM J. S C i. Statist. Comput., 6(1):220-252,
1985.
[236] Andrew R. Corm, Nicholas I. M. Gould, and Philippe L. Toint. LANCELOT:
A Fortran Package for Large-Scale Nonlinear Optimization (Release A).
Springer-Verlag, Berlin, 1992. xviii+330 pp. ISBN 0-387-55470-X.
[237] Samuel D. Conte and Carl de Boor. Elementary Numerical Analysis: An
Algorithmic Approach. Third edition, McGraw-Hill, Tokyo, 1980. xii+432
pp. ISBN 0-07-066228-2.
[238] James W. Cooley. How the FFT gained acceptance. In A History of Scientific
Computing, Stephen G. Nash, editor, Addison-Wesley, Reading, MA, USA,
1990, pages 133-140.
[239] James W. Cooley. Lanczos and the FFT: A discovery before its time. In
Proceedings of the Cornelius Lanczos International Centenary Conference,
J. David Brown, Moody T. Chu, Donald C. Ellison, and Robert J. Plemmons,
editors, Society for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 1994, pages 3-9.
[240] James W. Cooley and John W. Tukey. An algorithm for the machine calcu-
lation of complex Fourier series. Math. Comp., 19(90):297-301, 1965.
[241] James W. Cooley and John W. Tukey. On the origin and publication of the
FFT paper. Current Contents, (51-52):8-9, 1993.
[242] Brian A. Coomes, Huseyin Koçak, and Kenneth J. Palmer. Rigorous com-
putational shadowing of orbits of ordinary differential equations. Numer.
Math., 69(4):401-421, 1995.
[243] Jerome T, Coonen. Underflow and the denormalized numbers. Computer,
14:75-87, 1981.
[244] J. E. Cope and B. W. Rust. Bounds on solutions of linear systems with
inaccurate data. SIAM J. Numer. Anal., 16(6):950-963, 1979.
[245] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arith-
metic progressions. In Proceedings of the Nineteenth Annual ACM Sympo-
sium on Theory of Computing, ACM Press, New York, 1987, pages 1-6.
[246] Rob Corless. Continued fractions and chaos. Amer. Math. Monthly, 99(3):
203-215, 1992.
[247] Robert M. Corless. Defect-controlled numerical methods and shadowing for
chaotic differential equations. Physics D, 60:323–334, 1992.
[248] Richard W. Cottle. Manifestations of the Schur complement. Linear Algebra
Appi., 8:189-211, 1974.
[249] M. G. Cox. The numerical evaluation of B-splines. J. Inst. Maths Applies,
10:134–149, 1972.
[250] M. G. Cox. An algorithm for spline interpolation. J. Inst. Maths Applies,
15:95–108, 1975.
B IBLIOGRAPHY 611

[251] M. G. Cox. The numerical evaluation of a spline from its B-spline represen-
tation. J. Inst. Maths Applies, 21:135–143, 1978.
[252] M. G. Cox and S. J. Hammarling, editors. Reliable Numerical Computation.
Oxford University Press, Oxford, UK, 1990. xvi+339 pp. ISBN 0-19-853564-
3.
[253] M. G. Cox and P. M. Harris. Overcoming an instability arising in a spline
approximation algorithm by using an alternative form of a simple rational
function. IMA Bulletin, 25(9):228-232, 1989.
[254] Fred D. Crary. A versatile precompiled for nonstandard arithmetics. ACM
Trans. Math. Software, 5(2):204-217, 1979.
[255] Prescott D. Crout. A short method for evaluating determinants and solving
systems of linear equations with real or complex coefficients. Trans. Amer.
Inst. Elec. Engrg., 60:1235-1241, 1941.
[256] C. W. Cryer. Pivot size in Gaussian elimination. Numer. Math., 12:335-345,
1968.
[257] Colin W. Cryer. The LU-factorization of totally positive matrices. Linear
Algebra Appl., 7:83-92, 1973.
[258] Colin W. Cryer. Some properties of totally positive matrices. Linear Algebra
Appl., 15:1-25, 1976.
[259] A. R. Curtis and J. K. Reid. On the automatic scaling of matrices for Gaus-
sian elimination. J. Inst. Maths Applies, 10:118–124, 1972.
[260] George Cybenko and Charles F. Van Loan. Computing the minimum eigen-
value of a symmetric positive definite Toeplitz matrix. SIAM J. Sci. Statist.
Comput., 7(1):123-131, 1986.
[261] G. Dahlquist. On matrix majorants and minorants, with applications to
differential equations. Linear Algebra Appl., 52/53:199-216, 1983.
[262] Germund Dahlquist and Åke Björck. Numerical Methods. Prentice-Hall,
Englewood Cliffs, NJ, USA, 1974. xviii+573 pp. Translated by Ned Anderson.
ISBN 0-13-627315-7.
[263] J. W. Daniel, W. B. Gragg, L. Kaufman, and G. W. Stewart. Reorthogonal-
ization and stable algorithms for updating the Gram-Schmidt QR factoriza-
tion. Math. Comp., 30(136):772–795, 1976.
[264] B. Danloy. On the choice of signs for Householder’s matrices. J. Comp. Appl.
Math., 2(1):67-69, 1976.
[265] Karabi Datta. The matrix equation XA – BX = R and its applications.
Linear Algebra Appl., 109:91-105, 1988.
[266] Philip J. Davis. Circulant Matrices. Wiley, New York, 1979. xv+250 pp.
ISBN 0-471-05771-1.
[267] Philip J. Davis and Philip Rabinowitz. Methods of Numerical Integration.
Second edition, Academic Press, London, 1984. xiv+612 pp. ISBN 0-12-
206360-0.
612 B IBLIOGRAPHY

[268] Achiya Dax. Partial pivoting strategies for symmetric Gaussian elimination.
Math. Programming, 22:288-303, 1982.
[269] Achiya Dax. The convergence of linear stationary iterative processes for
solving singular unstructured systems of linear equations. SIAM Rev., 32(4):
611-635, 1990.
[270] Achiya Dax and S. Kaniel. Pivoting techniques for symmetric Gaussian
elimination. Numer. Math., 28:221-241, 1977.
[271] Jane M. Day and Brian Peterson. Growth in Gaussian elimination. Amer.
Math. Monthly, 95(6):489–513, 1988.
[272] Carl de Boor. On calculating with B-splines. J. Approx. Theory, 6:5042,
1972.
[273] Carl de Boor and Allan Pinkus. Backward error analysis for totally positive
linear systems. Numer. Math., 27:485-490, 1977.
[274] Lieuwe Sytse de Jong. Towards a formal definition of numerical stability.
Numer. Math., 28:211-219, 1977.
[275] T. J. Dekker. A floating-point technique for extending the available precision.
Numer. Math,, 18:224-242, 1971.
[276] T. J. Dekker and W. Hoffmann. Rehabilitation of the Gauss-Jordan algo
rithm. Numer. Math., 54:591–599, 1989.
[277] Cédric J. Demeure. Fast QR factorization of Vandermonde matrices. Linear
Algebra and Appl., 122/3/4:165-194, 1989.
[278] Cédric J. Demeure. QR factorization of confluent Vandermonde matrices.
IEEE trans. Acoust., Speech, Signal Processing, 38(10):1799-1802, 1990.
[279] James W. Demmel. The condition number of equivalence transformations
that block diagonalize matrix pencils. SIAM J. Numer. Anal., 20(3):599-
610, 1983.
[280] James W. Demmel. Underflow and the reliability of numerical software.
SIAM J. Sci. Statist. Comput., 5(4):887-919, 1984.
[281] James W. Demmel. On condition numbers and the distance to the nearest
ill-posed problem. Numer. Math., 51:251–289, 1987.
[282] James W. Demmel. On error analysis in arithmetic with varying relative
precision. In Proceedings of the Eighth Symposium on Computer Arithmetic,
Como, Italy, Mary Jane Irwin and Renato Stefanelli, editors, IEEE Computer
Society, Washington, DC, 1987, pages 148-152.
[283] James W. Demmel. On floating point errors in Cholesky. Technical Re-
port CS-89-87, Department of Computer Science, University of Tennessee,
Knoxville, TN, USA, October 1989.6 pp. LAPACK Working Note 14.
[284] James W. Demmel. On the odor of IEEE arithmetic. NA Digest, Volume 91,
Issue 39, 1991. (Response to a message “IEEE Arithmetic Stinks” in Volume
91, Issue 33). Electronic mail magazine: na.helptia-net.ornl.gov.
[285] James W. Demmel. The componentwise distance to the nearest singular
matrix. SIAM J. Matrix Anal. Appl., 13(1):10-19, 1992.
B IBLIOGRAPHY 613

[286] James W. Demmel. Open problems in numerical linear algebra. IMA Preprint
Series #961, Institute for Mathematics and its Applications, University of
Minnesota, Minneapolis, MN, USA, April 1992. 21 pp. LAPACK Working
Note 47.
[287] James W. Demmel. A specification for floating point parallel prefix. Techni-
cal Report CS-92-167, Department of Computer Science, University of Ten-
nessee, Knoxville, TN, USA, July 1992. 8 pp. LAPACK Working Note 49.
[288] James W. Demmel. Trading off parallelism and numerical stability. In Lin-
ear Algebra for Larqe Scale and Real-Time Applications, Marc S, Moonen,
Gene H. Golub, and Bart L. De Moor, editors, volume 232 of NATO ASI
Series E, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1993,
pages 49-68.
[289] James W. Demmel, Inderjit Dhillon, and Huan Ren. On the correctness of
parallel bisection in floating point. Technical Report CS-94-228, Department
of Computer Science, University of Tennessee, Knoxville, TN, USA, March
1994. 38 pp. LAPACK Working Note 70.
[290] James W. Demmel, J. J. Dongarra, and W. Kahan. On designing portable
high performance numerical libraries. In Numerical Analysis 1991, Proceed-
ings of the 14th Dundee Conference, D. F. Griffiths and G. A. Watson, editors,
volume 260 of Pitman Research Notes in Mathematics, Longman Scientific
and Technical, Essex, UK, 1992, pages 69-84.
[291] James W. Demmel and Nicholas J. Higham. Stability of block algorithms
with fast level-3 BLAS. ACM Trans. Math. Software, 18(3):274–291, 1992.
[292] James W. Demmel and Nicholas J. Higham. Improved error bounds for
underdetermined system solvers. SIAM J. Matrix Anal. Appl., 14(1):1-14,
1993.
[293] James W. Demmel, Nicholas J. Higham, and Robert S. Schreiber. Stability
of block LU factorization. Numerical Linear Algebra with Applications, 2(2):
173-190, 1995.
[294] James W. Demmel and Bo Kågström. Computing stable eigendecompositions
of matrix pencils. Linear Algebra Appl., 88/89:139-186, 1987.
[295] James W. Demmel and Bo Kågström. Accurate solutions of ill-posed prob-
lems in control theory. SIAM J. Matrix Anal. Appl., 9(1):126-145, 1988.
[296] James W. Demmel and W. Kahan. Accurate singular values of bidiagonal
matrices. SIAM J. Sci. Statist. Comput., 11(5):873–912, 1990.
[297] James W. Demmel and F. Krückeberg. An interval algorithm for solving
systems of linear equations to prespecified accuracy. Computing, 34:117–129,
1985.
[298] James W. Demmel and Xiaoye Li. Faster numerical algorithms via exception
handling. IEEE Trans. Comput., 43(8):983-992, 1994.
[299] James W. Demmel and A. McKenney. A test matrix generation suite.
Preprint MCS-P69-0389, Mathematics and Computer Science Division, Ar-
gonne National Laboratory, Argonne, IL, USA, March 1989.16 pp. LAPACK
Working Note 9.
614 B IBLIOGRAPHY

[300] J. E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods for Uncon-
strained Optimization and Nonlinear Equations. Prentice-Hall, Englewood
Cliffs, NJ, USA, 1983. xiii+378 pp. ISBN 0-13-627216-9.
[301] J. E. Dennis, Jr. and Virginia Torczon. Direct search methods on parallel
machines. SIAM J. Optim., 1(4):448-474, 1991.
[302] J. E. Dennis, Jr. and Homer F. Walker. Inaccuracy in quasi-Newton methods:
Local improvement theorems. Math. Ping. Study, 22:70-85, 1984.
[303] John E. Dennis, Jr. and Daniel J. Woods. Optimization on microcomput-
ers: The Nelder-Mead simplex algorithm. In New Computing Environments:
Microcomputers in Large-Scale Computing, Arthur Wouk, editor, Society for
Industrial and Applied Mathematics, Philadelphia, PA, USA, 1987, pages
116-122.
[304] J. Descloux. Note on the round-off errors in iterative processes. Math. Comp.,
17:18-27, 1963.
[305] Harold G. Diamond. Stability of rounded off inverses under iteration. Math.
Comp., 32(141):227-232, 1978.
[306] John D. Dixon. Estimating extremal eigenvalues and condition numbers of
matrices. SIAM J. Numer. Anal., 20(4):812–814, 1983.
[307] J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart. LINPACK
Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia,
PA, USA, 1979. ISBN 0-89871-172-X.
[308] J. J. Dongarra, J. J. Du Croz, I. S. Duff, and S. J. Hammarling. A set of
level 3 basic linear algebra subprograms. ACM Thins. Math. Software, 16:
1–17, 1990.
[309] J. J. Dongarra, J. J. Du Croz, I. S. Duff, and S. J. Hammarling. Algorithm
679. A set of level 3 basic linear algebra subprograms: Model implementation
and test programs. ACM Trans. Math. Software, 16:18–28, 1990.
[310] J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra
algorithms for dense matrices on a vector pipeline machine. SIAM Rev., 26
(1):91-112, 1984.
[311] Jack Dongarra, Roldan Pozo, and David W. Walker. LAPACK++ V1.0:
High performance linear algebra users’ guide. Technical Report CS-95-290,
Department of Computer Science, University of Tennessee, Knoxville, TN,
USA, May 1995. 31 pp. LAPACK Working Note 98.
[312] Jack J. Dongarra. Performance of various computers using standard linear
equations software. Technical Report CS-89-85, Department of Computer
Science, University of Tennessee, Knoxville, TN, USA, February 1995. 34
PP.
[313] Jack J. Dongarra, Jeremy J. Du Croz, Sven J. Hammarling, and Richard J.
Hanson. An extended set of Fortran basic linear algebra subprograms. ACM
Trans. Math. Software, 14(1):1-17, 1988.
B IBLIOGRAPHY 615

[314] Jack J. Dongarra, Jeremy J. Du Croz, Sven J. Hammarling, and Richard J.

Hanson. Algorithm 656. An extended set of Fortran basic linear algebra
subprograms: Model implementation and test programs. ACM Trans. Math.
Software, 14(1):18-32, 1988.
[315] Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der
Vorst. Solving Linear Systems on Vector and Shared Memory Computers.
Society for Industrial and Applied Mathematics, Philadelphia, PA, USA,
1991. x+256 pp. ISBN 0-89871-270-X.
[316] Jack J. Dongarra and Eric Grosse. Distribution of mathematical software via
electronic mail. Comm. ACM, 30(5):403–407, 1987.
[317] Craig C. Douglas, Michael Heroux, Gordon Slishman, and Roger M. Smith.
GEMMW: A portable level 3 BLAS Winograd variant of Strassen’s matrix–
matrix multiply algorithm. J. Comput. Phys., 110:1–10, 1994.
[318] Craig C. Douglas and Gordon Slishman. Variants of matrix-matrix multipli-
cation for Fortran-90. ACM SIGNUM Newsletter, 29:4–6, 1994.
[319] Jim Douglas, Jr. Round-off error in the numerical solution of the heat equa-
tion. J. Assoc. Comput. Mach., 6:48–58, 1959.
[320] Thomas C. Doyle. Inversion of symmetric coefficient matrix of positive-
definite quadratic form. M.T.A.C., 11:55-58, 1957.
[321] Zlatko Drma, On the perturbation
of the Cholesky factorization. SIAM J. Matrix Anal. Appl., 15(4):1319–1332,
1994.
[322] Jeremy J. Du Croz and Nicholas J. Higham. Stability of methods for matrix
inversion. IMA J. Numer. Anal., 12:1–19, 1992.
[323] Augustin A. Dubrulle. A class of numerical methods for the computation of
Pythagorean sums. IBM J. Res. Develop., 27(6):582-589, 1983.
[324] Augustin A. Dubrulle and Gene H. Golub. A multishift QR iteration without
computation of the shifts. Numerical Algorithms, 7: 173–181, 1994.
[325] I. S. Duff, A.M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices.
Oxford University Press, 1986. xiii+341 pp. ISBN 0-19-853408-6.
[326] I. S. Duff, N. I. M. Gould, J. K. Reid, J. A. Scott, and K. Turner. The
factorization of sparse symmetric indefinite matrices. IMA J. Numer. Anal.,
11:181–204, 1991.
[327] Iain S. Duff, Roger G. Grimes, and John G. Lewis. Sparse matrix test prob-
lems. ACM Trans. Math. Software, 15(1):1-14, 1989.
[328] Iain S. Duff, Roger G. Grimes, and John G. Lewis. Users’ guide for the
Harwell-Boeing sparse matrix collection (release 1). Report RAL-92-086,
Atlas Centre, Rutherford Appleton Laboratory, Didcot, Oxon, UK, Decem-
ber 1992. 84 pp.
[329] Iain S. Duff and John K. Reid. MA27—A set of Fortran subroutines for
solving sparse symmetric sets of linear equations. Technical Report AERE
R10533, AERE Harwell Laboratory, 1982.
616 B IBLIOGRAPHY

[330] Iain S. Duff and John K. Reid. MA47, aFortran code for direct solution
of indefinite sparse symmetric linear systems. Report RAL-95-001, Atlas
Centre, Rutherford Appleton Laboratory, Didcot, Oxon, UK, January 1995.
63 pp.
[331] Iain S. Duff, John K. Reid, Neils Munskgaard, and Hans B. Nielsen. Direct
solution of sets of linear equations whose matrix is sparse, symmetric and
indefinite. J. Inst. Maths Applics, 23:235–250, 1979.
[332] William Dunham. Journey Through Genius: The Great Theorems of Math-
ematics. Penguin, New York, 1990. xiii+300 pp. ISBN 0-14-014739-X.
[333] Paul S. Dwyer. Linear Computations. Wiley, New York, 1951. xi+344 pp.
[334] Carl Eckart and Gale Young. The approximation of one matrix by another
of lower rank. Psychometrika, 1(3):211–218, 1936.
[335] Alan Edelman. Eigenvalues and condition numbers of random matrices.
SIAM J. Matrix Anal. Appl., 9(4):543-560, 1988.
[336] Alan Edelman. The distribution and moments of the smallest eigenvalue of
a random matrix of Wishart type. Linear Algebra Appl., 159:55–80, 1991.
[337] Alan Edelman. The first annual large dense linear system survey. ACM
SIGNUM Newsletter, 26:6-12, October 1991.
[338] Alan Edelman. The complete pivoting conjecture for Gaussian elimination
is false. The Mathematical Journal, 2:58–61, 1992.
[339] Alan Edelman. On the distribution of a scaled condition number. Math.
Comp., 58(197):185-190, 1992.
[340] Alan Edelman. Eigenvalue roulette and random test matrices. In Linear Al-
gebra for Large Scale and Real-Time Applications, Marc S. Moonen, Gene H.
Golub, and Bart L. De Moor, editors, volume 232 of NATO ASI Series E,
Kluwer Academic Publishers, Dordrecht, The Netherlands, 1993, pages 365-
368.
[341] Alan Edelman. Large dense numerical linear algebra in 1993: The parallel
computing influence. Int. J. Supecomputer Appl., 7(2): 113–128, 1993.
[342] Alan Edelman. Scalable dense numerical linear algebra in 1994: The multi-
computer influence. In Proceedings of the Fifth SIAM Conference on Applied
Linear Algebra, John G. Lewis, editor, Society for Industrial and Applied
Mathematics, Philadelphia, PA, USA, 1994, pages 344-348.
[343] Alan Edelman. When is x * (1/x) 1? Manuscript, 1994.
[344] Alan Edelman, Eric Kostlan, and Michael Shub. How many eigenvalues of a
random matrix are real? J. Amer. Math. Soc., 7(1):247–267, 1994.
[345] Alan Edelman and Walter Mascarenhas. On the complete pivoting conjecture
for a Hadamard matrix of order 12. Linear and Multilineal Algebra, 38(3):
181–188, 1995.
[346] Alan Edelman and H. Murakami. Polynomial roots from companion matrix
eigenvalues. Math. Comp., 64(210):763–776, 1995.
B IBLIOGRAPHY 617

[347] Alan Edelman and G. W. Stewart. Scaling for orthogonality. IEEE Trans.
Signal Processing, 41(4):1676-1677, 1993.
[348] Editor’s note. SIAM J. Matrix Anal. Appl., 12(3),1991.
[349] Timo Eirola. Aspects of backward error analysis of numerical ODEs. J.
Comp. Appl. Math., 45(1-2):65-73, 1993.
[350] Lars Eldén. Perturbation theory for the least squares problem with linear
equality constraints. SIAM J. Numer. Anal., 17(3):338–350, 1980.
[351] Samuel K. Eldersveld and Michael A. Saunders. A block-LU update for
large-scale linear programming. SIAM J. Matrix Anal. Appl., 13(1):191-201,
1992.
[352] W. H. Enright. A new error-control for initial value solvers. Appl. Math.
Comput., 31:288-301, 1989.
[353] Michael A. Epton. Methods for the solution of AXD – BXC = E and its ap
plication in the numerical solution of implicit ordinary differential equations.
BIT, 20:341–345, 1980.
[354] A. M. Erisman, R. G. Grimes, J. G. Lewis, W. G. Poole, and H. D. Si-
mon. Evaluation of orderings for unsymmetric sparse matrices. SIAM J. Sci.
Statist. Comput., 8(4):600-624, 1987.
[355] A. M. Erisman and J. K. Reid. Monitoring the stability of the triangular
factorization of a sparse matrix. Numer. Math., 22:183-186, 1974.
[356] Terje O. Espelid. On floating-point summation. Report No. 67, Department
of Applied Mathematics, University of Bergen, Bergen, Norway, December
1978.
[357] Christopher Evans. Interview with J. H. Wilkinson. Number 10 in Pioneers of
Computing, 60-Minute Recordings of Interviews. Science Museum, London,
1976.
[358] John Ewing, editor. A Century of Mathematics Through the Eyes of the
Monthly. Mathematical Association of America, Washington, DC, 1994.
xi+323 pp. ISBN 0-88385-459-7.
[359] John H. Ewing and F. W. Gehring, editors. Paul Halmos: Celebrating 50
Years of Mathematics. Springer-Verlag, New York, 1991. viii+320 pp. ISBN
3-540-97509-8.
[360] V. N. Faddeeva. Computational Methods of Linear Algebra. Dover, New
York, 1959. x+252 pp. ISBN 0-486-60424-1.
[361] Ky Fan and A. J. Hoffman. Some metric inequalities in the space of matrices.
Proc. Amer. Math. Soc., 6:111-116, 1955.
[362] R. W. Farebrother. A memoir of the life of M. H. Doolittle. IMA Bulletin,
23(6/7):102, 1987.
[363] R. W. Farebrother. Linear Least Sguares Computations. Marcel Dekker, New
York, 1988. xiii+293 pp. ISBN 0-8247-7661-5.
[364] Charles Farnum. Compiler support for floating-point computation.
Software—Practice and Experience, 18(7):701-709, 1988.
618 B IBLIOGRAPHY

[365] Richard J. Fateman. High-level language implications of the proposed IEEE

floating-point standard. ACM trans. Program. Lang. Syst., 4(2):239-257,
1982.
[366] David G. Feingold and Richard S. Varga. Block diagonally dominant matrices
and generalizations of the Gerschgorin circle theorem. Pacific J. Math., 12:
1241–1250, 1962.
[367] S. I. Feldman, David M. Gay, Mark W. Maimone, and N. L. Schryer. A
Fortran to C converter. Computing Science Technical Report No. 149, AT&T
Bell Laboratories, Murray Hill, NJ, USA, 1990. 26 pp.
[368] Alan Feldstein and Richard Goodman. Loss of significance in floating point
subtraction and addition. IEEE Trans. Comput., C-31(4):328–335, 1982.
[369] Alan Feldstein and Peter Turner. Overflow, underflow, and severe loss of
significance in floating-point addition and subtraction. IMA J. Numer. Anal.,
6:241-251, 1986.
[370] Warren E. Ferguson, Jr. Exact computation of a sum or difference with appli-
cations to argument reduction. In Proc. 12th IEEE Symposium on Computer
Arithmetic, Bath, England, Simon Knowles and William H. McAllister, ed-
itors, IEEE Computer Society Press, Los Alarnitos, CA, USA, 1995, pages
216–221.
[371] Warren E. Ferguson, Jr. and Tom Brightman. Accurate and monotone ap-
proximations of some transcendental functions. In Pmt. 10th IEEE Sym-
posium on Computer Arithmetic, Peter Kornerup and David W. Matula,
editors, IEEE Computer Society Press, Los Alamitos, CA, USA, 1991, pages
237-244.
[372] William R. Ferng, Gene H. Golub, and Robert J. Plemmons. Adaptive Lanc-
zos methods for recursive condition estimation. Numer. Algorithms, l(l):
1–19, 1991.
[373] C. T. Fike. Methods of evaluating polynomial approximations in function
evaluation routines. Comm. ACM, 10(3):175-178, 1967.
[374] Patrick C. Fischer. Further schemes for combining matrix algorithms. In
Automata, Languages and Programming, Jacques Loeckx, editor, volume 14
of Lecture Notes in Computer Science, Springer-Verlag, Berlin, 1974, pages
428-436.
[375] R. Fletcher. Factorizing symmetric indefinite matrices. Linear Algebra Appl.,
14:257–272, 1976.
[376] R. Fletcher, Expected conditioning. IMA J. Numer. Anal., 5:247-273, 1985.
[377] R. Fletcher. Cancellation errors in quasi-Newton methods. SIAM J. S C i.
Statist. Comput., 7(4):1387-1399, 1986.
[378] R. Fletcher. Practical Methods of Optimization. Second edition, Wiley, Chich-
ester, UK, 1987. xiv+436 pp. ISBN 0-471-91547-5.
[379] R. Fletcher. Degeneracy in the presence of roundoff errors. Linear Algebra
Appl., 106:149-183, 1988.
B IBLIOGRAPHY 619

[380] R. Fletcher. Resolving degeneracy in quadratic programming. Ann. Oper.

Res., 47:307-334, 1993.
[381] R. Fletcher. Steepest edge, degeneracy and conditioning in LP. Numer-
ical Analysis Report NA/154, Department of Mathematics and Computer
Science, University of Dundee, Dundeej Scotland, November 1994.21 pp.
[382] R. Fletcher and M. J. D. Powell. On the modification of LDLT factorization.
Math. Comp., 28(128):1067-1087, 1974.
[383] Brian Ford. Parameterization of the environment for transportable numerical
software. ACM Trans. Math. Software, 4(2):100–103, 1978.
[384] Anders Forsgren, Philip E. Gill, and Walter Murray. Computing modified
Newton directions using a partial Cholesky factorization. SIAM J. SCi. Com-
put., 16(1):139-150, 1995.
[385] Anders Forsgren, Philip E. Gill, and Joseph R. Shinnerl. Stability of sym-
metric ill-conditioned systems arising in interior methods for constrained op-
timization. SIAM J. Matrix Anal. Appl., 17(1):187–211, 1996.
[386] G. E. Forsythe and E. G. Straus. On best conditioned matrices. Proc. Amer.
Math. Sot., 6:340-345, 1955.
[387] George E. Forsythe. Gauss to Gerling on relaxation. Mathematical Tables
and Other Aids to Computation, 5:255–258, 1951.
[388] George E. Forsythe. Solving linear algebraic equations can be interesting.
Bull. Amer. Math. Sot., 59(4):299-329, 1953.
[389] George E. Forsythe. Reprint of a note on rounding-off errors. SIAM Rev., 1
(1):66-67, 1959.
[390] George E. Forsythe. Algorithm 16: Crout with pivoting. Comm. ACM, 3(9):
507-508, 1960.
[391] George E. Forsythe. Today’s computational methods of linear algebra. SIAM
Rev., 9:489-515, 1967.
[392] George E. Forsythe. Solving a quadratic equation on a computer. In The
Mathematical Sciences: A Collection of Essays, National Research Coun-
cil’s Committee on Support of Research in the Mathematical Sciences, editor,
M.I.T. Press, Cambridge, MA, USA, 1969, pages 138-152.
[393] George E. Forsythe. What is a satisfactory quadratic equation solver? In
Constructive Aspects of the Fundamental Theorem of Algebra, Bruno Dejon
and Peter Henricij editors, Wiley-Interscience, London, 1969, pages 53–61.
[394] George E. Forsythe. Pitfalls in computation, or why a math book isn’t
enough. Amer. Math. Monthly, 77:931–956, 1970.
[395] George E. Forsythe, Michael A. Malcolm, and Cleve B. Moler. Computer
Methods for Mathematical Computations. Prentice-Hall, Englewood Cliffs,
NJ, USA, 1977. xi+259 pp. ISBN 0-13-165332-6.
[396] George E. Forsythe and Cleve B. Moler. Computer Solution of Linear Al-
gebraic Systems. Prentice-Hall, Englewood Cliffs, NJ, USA, 1967. xi+148
pp.
620 B IBLIOGRAPHY

[397] George E. Forsythe and Wolfgang R. Wasow. Finite-Difference Methods for

Partial Differential Equations. Wiley, New York, 1960. x+444 pp.
[398] Leslie Foster. Modifications of the normal equations method that are nu-
merically stable. In Numerical Linear Algebra, Digital Signal Processing and
Parallel Algorithms, G. H. Golub and P. M. Van Dooren, editors, volume
F70 of NATO ASI Series, Springer-Verlag, Berlin, New York, 1991, pages
501-512.
[399] Leslie V. Foster. Gaussian elimination with partial pivoting can fail in prac-
tice. SIAM J. Matrix Anal. Appl., 15(4):1354-1362, 1994.
[400] L. Fox. An Introduction to Numerical Linear Algebra. Oxford University
Press, Oxford, UK, 1964. xi+328 pp.
[401] L. Fox. How to get meaningless answers in scientific computation (and what
to do about it). IMA Bulletin, 7(10):296-302, 1971.
[402] L. Fox. All about Jim Wilkinson, with a commemorative snippet on back-
ward error analysis. In The Contribution of Dr. J. H. Wilkinson to Numerical
Analysis, Symposium Proceedings Series No. 19, The Institute of Mathemat-
ics and its Applications, Southend-On-Sea, Essex, UK, 1978, pages 1–20.
[403] L. Fox. James Hardy Wilkinson, 1919-1986. Biographical Memoirs of Fellows
of the Royal Society, 33:671–708, 1987.
[404] L. Fox, H. D. Huskey, and J. H. Wilkinson. Notes on the solution of algebraic
linear simultaneous equations. Quart. J. Mech. Appl. Math., 1:149-173, 1948.
[405] P. A. Fox, A. D. Hall, and N. L. Schryer. The PORT mathematical subroutine
library. ACM Trans. Math. Software, 4(2):104-126, 1978.
[406] Philippe Francois and Jean-Michel Muller. The SCALP perturbation
method. In Proceedings of IMACS ’91, 13th World Congress on Computation
and Applied Mathematicsj Dublin, Ireland, Volume 1, 1991, pages 59-60.
[407] V. Frayssé. Reliability of Computer Solutions. PhD thesis, L’Institut National
Polytechnique de Toulouse, Toulouse, Fance, July 1992. CERFACS Report
TH/PA/92/11.
[408] Shmuel Friedland. Revisiting matrix squaring. Linear Algebra Appl., 154-156:
59-63, 1991.
[409] Shmuel Friedland and Hans Schneider. The growth of powers of a nonnegative
matrix. SIAM J. Alg. Discrete Methods, 1(2):185–200, 1980.
[410] R. E. Funderlic, M. Neumann, and R. J. Plemmons. LU decompositions of
generalized diagonally dominant matrices. Numer. Math., 40:57-69, 1982.
[411] Pascal M. Gahinet, Alan J. Laub, Charles S. Kenney, and Gary A. Hewer.
Sensitivity of the stable discrete-time Lyapunov equation. IEEE Trans. Au-
tomat. Control, AC-35(11):1209-1217, 1990.
[412] Shmuel Gal and Boris Bachelis. An accurate elementary mathematical library
for the IEEE floating point standard. ACM Trans. Math. Software, 17(1):
2645, 1991.
B IBLIOGRAPHY 621

[413] F. R. Gantmacher. The Theory of Matrices, volume 1. Chelsea, New York,

1959. x+374 pp. ISBN 0-82840131-4.
[414] F. R. Gantmacher. The Theory of Matrices, volume 2. Chelsea, New York,
1959. ix+276 pp. ISBN 0-82840133-0.
[415] B. S. Garbow, J. M. Boyle, J. J. Dongarra, and C. B. Moler. Matrix Eigen-
systern Routines—EISPACK Guide Extension, volume 51 of Lecture Notes
in Computer Science. Springer-Verlag, Berlin, 1977.
[416] Judith D. Gardiner and Alan J. Laub. Parallel algorithms for algebraic
Riccati equations. Int. J. Control, 54(6):1317-1333, 1991.
[417] Judith D. Gardiner, Alan J. Laub, James J. Amato, and Cleve B. Moler.
Solution of the Sylvester matrix equation AXB T + CXD T = E. ACM
Trans. Math. Software, 18(2):223–231, 1992.
[418] Judith D. Gardiner j Matthew R. Wette, Alan J. Laub, James J. Amato,
and Cleve B. Moler. Algorithm 705: A FORTRAN-77 software package for
solving the Sylvester matrix equation AXB T + CXD T = E. ACM Trans.
Math. Software, 18(2):232-238, 1992.
[419] Martin Gardner. More Mathematical Puzzles and Diversions. Penguin, New
York, 1961. 187 pp. ISBN 0-14020748-1.
[420] Harvey L. Garner. A survey of some recent contributions to computer arith-
metic. IEEE Trans. Comput., C-25(12):1277–1282, 1976.
[421] M. Gasca and J. M. Peña. Total positivity and Neville elimination. Linear
Algebra Appl., 165:25-44, 1992.
[422] Noel Gastinel. Linear Numerical Analysis. Kershaw Publishing, London,
1983. ix+341 pp. First published in English by Academic Press, New York,
1970. ISBN 0-901665-16-9.
[423] Carl Friedrich Gauss. Theory of the Combination of Observations Least Sub-
ject to Errors. Part One, Part Two, Supplement. Society for Industrial and
Applied Mathematics, Philadelphia, PA, USA, 1995. xi+241 pp. Translated
from the Latin and German by G. W. Stewart. ISBN 0-89871-347-1.
[424] Walter Gautschi. On inverses of Vandermonde and confluent Vandermonde
matrices. Numer. Math., 4:117–123, 1962,
[425] Walter Gautschi. Norm estimates for inverses of Vandermonde matrices.
Numer. Math., 23:337-347, 1975,
[426] Walter Gautschi. On inverses of Vandermonde and confluent Vandermonde
matrices III. Numer. Math., 29:445–450, 1978.
[427] Walter Gautschi. The condition of Vandermonde-like matrices involving or-
thogonal polynomials. Linear Algebra Appl., 52/53:293–300, 1983.
[428] Walter Gautschi. How (un)stable are Vandermonde systems? In Asymptotic
and Computational Analysis, R. Wong, editor, volume 124 of Lecture Notes
in Pure and Applied Mathematics, Marcel Dekker, New York and Basel, 1990,
pages 193–210.
622 B IBLIOGRAPHY

[429] Walter Gautschi and Gabriele Inglese. Lower bounds for the condition num-
ber of Vandermonde matrices. Numer. Math., 52:241-250, 1988.
[430] Werner Gautschi. The asymptotic behaviour of powers of matrices. Duke
Math. J., 20:127-140, 1953.
[431] David M. Gay. Correctly rounded binary-decimal and decimal-binary con-
versions. Numerical Analysis Manuscript 90-10, AT&T Bell Laboratories,
Murray Hill, NJ, USA, November 1990. 16 pp.
[432] Stuart Geman. The spectral radius of large random matrices. Ann. Probab.,
14(4):1318-1328, 1986.
[433] W. M. Gentleman An error analysis of Goertzel’s (Watt’s) method for com-
puting Fourier coefficients. Comput. J., 12:160-165, 1969.
[434] W. M. Gentleman and G. Sande. Fast Fourier transforms-for fun and profit.
In Fall Joint Computer Conference, volume 29 of AFIPS Conference Proceed-
dings, Spartan Books, Washington, DC, 1966, pages 563–578.
[435] W. Morven Gentleman. Least squares computations by Givens transforma-
tions without square roots. J. Inst. Maths Applies, 12:329–336, 1973.
[436] W. Morven Gentleman. Error analysis of QR decompositions by Givens
transformations. Linear Algebra Appl., 10:189–197, 1975.
[437] W. Morven Gentleman and Scott B. Marovich. More on algorithms that
reveal properties of floating point arithmetic units. Comm. ACM, 17(5):
276-277, 1974.
[438] Alan George and Joseph W-H Liu. Computer Solution of Large Sparse Pos-
itive Definite Systems. Prentice-Hall, Englewood Cliffs, NJ, USA, 1981.
xii+324 pp. ISBN 0-13-165274-5.
[439] A. J. Geurts. A contribution to the theory of condition. Numer. Math., 39:
85-96, 1982.
[440] Ali R. Ghavimi and Alan J. Laub. Backward error, sensitivity, and refine-
ment of computed solutions of algebraic Riccati equations. Numerical Linear
Algebra with Applications, 2(1):29-49, 1995.
[441] Ali R. Ghavimi and Alan J. Laub, Residual bounds for discrete-time Lya-
punov equations. IEEE Trans. Automat. Control, 40(7):1244-1249, 1995.
[442] P. E. Gill and W. Murray. A numerically stable form of the simplex algorithm.
Linear Algebra Appl., 7:99-138, 1973.
[443] Philip E. Gill, Walter Murray, Dulce B. Ponceleón, and Michael A. Saun-
ders. Preconditioners for indefinite systems arising in optimization. SIAM
J. Matrix Anal. Appl., 13(1):292-311, 1992.
[444] Philip E. Gill, Walter Murray, Michael A. Saunders, and Margaret H. Wright.
User’s guide for NPSOL (version 4.0): A Fortran package for nonlinear pro-
gramming. Technical Report SOL 86-2, Department of Operations Research,
Stanford University, Stanford, CA, January 1986. 53 pp.
B IBLIOGRAPHY 623

[445] Philip E. Gill, Walter Murray, Michael A. Saunders, and Margaret H. Wright.
A Schur-complement method for sparse quadratic programming. In Reliable
Numerical Computation, M. G. Cox and S. J. Hammarling, editors, Oxford
University Press, Oxford, UK, 1990, pages 113-138.
[446] Philip E. Gill, Walter Murray, Michael A. Saunders, and Margaret H. Wright.
Inertia-controlling methods for general quadratic programming. SIAM Rev.,
33(1):1–36, 1991.
[447] Philip E. Gill, Walter Murray, and Margaret H. Wright. Practical Optimize-
tion. Academic Press, London, 1981. xvi+401 pp. ISBN 0-12-283952-8.
[448] Philip E. Gill, Michael A. Saunders, and Joseph R. Shinnerl. On the stability
of Cholesky factorization for symmetric quasi-definite systems. SIAM J.
Matrix Anal. Appl., 17(1):35-46, 1996.
[449] S. Gill. A process for the step-by-step integration of differential equations
in an automatic digital computing machine. Proc. Cambridge Phil. Soc., 47:
96-108, 1951.
[450] T. A. Gillespie. Noncommutative variations on theorems of Marcel Riesz and
others. In Paul Halmos: Celebrating 50 Years of Mathematics, John H. Ewing
and F. W. Gehring, editors, Springer-Verlag, Berlin, 1991, pages 221–236.
[451] Wallace J. Givens. Numerical computation of the characteristic values of a
real symmetric matrix. Technical Report ORNL-1574, Oak Ridge National
Laboratory, Oak Ridge, TN, USA, 1954. 107 pp.
[452] James Glanz. Mathematical logic flushes out the bugs in chip designs. Sci-
ence, 267:332–333, 1995. 20 January.
[453] J. Gluchowska and A. Smoktunowicz. Solving the linear least squares problem
with very high relative accuracy. Computing, 45:345–354, 1990.
[454] I. Gohberg and I. Koltracht. Error analysis for triangular factorization of
Cauchy and Vandermonde matrices. Manuscript, 1990.
[455] I. Gohberg and I. Koltracht. Mixed, componentwise, and structured condition
numbers. SIAM J. Matrix Anal. Appl., 14(3):688–704, 1993.
[456] I. Gohberg and V. Olshevsky. Fast inversion of Chebyshev–Vandermonde
matrices. Numer. Math., 67(1):71–92, 1994.
[457] David Goldberg. What every computer scientist should know about floating-
point arithmetic. ACM Computing Surveys, 23(1):5–48, 1991.
[458] I. Bennett Goldberg. 27 bits are not enough for 8-digit accuracy. Comm.
ACM, 10(2):105-106, 1967.
[459] Moshe Goldberg and E. G. Straus. Multiplicativity of lp norms for matrices.
Linear Algebra Appl., 52/53:351–360, 1983.
[460] Herman H. Goldstine. The Computer: From Pascal to von Neumann. Prince-
ton University Press, Princeton, NJ, USA, 1972. xii+378 pp. 1993 printing
with new preface. ISBN 0-691-02367-0.
624 B IBLIOGRAPHY

[461] Herman H. Goldstine. A History of Numerical Analysis From the 16t h

Through the 19th Century. Springer-Verlag, New York, 1977. xiv+348 pp.
ISBN 0-387-90277-5.
[462] Herman H. Goldstine and John von Neumann. Numerical inverting of ma-
trices of high order II. Pmt. Amer. Math. Sot., 2:188-202, 1951. Reprinted
in [995, pp. 558–572].
[463] G. H. Golub. Numerical methods for solving linear least squares problems.
Numer. Math., 7:206–216, 1965.
[464] G. H. Golub, S. Nash, and C. F. Van Loan. A Hessenberg-Schur method
for the problem AX+ XB = C. IEEE Trans. Automat. Control, AC-24(6):
909-913, 1979.
[465] G. H. Golub and J. M. Varah. On a characterization of the best -scaling
of a matrix. SIAM J. Numer. Anal., 11(3):472–479, 1974.
[466] G. H. Golub and J. H. Wilkinson. Note on the iterative refinement of least
squares solution. Numer. Math., 9:139–148, 1966.
[467] G. H. Golub and J. H. Wilkinson. Ill-conditioned eigensystems and the com-
putation of the Jordan canonical form. SIAM Rev., 18(4):578–619, 1976.
[468] Gene H. Golub. Bounds for the round-off errors in the Richardson second
order method. BIT, 2:212–223, 1962.
[469] Gene H. Golub and Charles F. Van Loan. Unsymmetric positive definite
linear systems. Linear Algebra Appl., 28:85-97, 1979.
[470] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Second edi-
tion, Johns Hopkins University Press, Baltimorej MD, USA, 1989. xix+642
pp. ISBN 08018-3772-3 (hardback), 0-8018-3739-1 (paperback).
[471] R. Goodman and A. Feldstein. Round-off error in products. Computing, 15:
263-273, 1975.
[472] R. H. Goodman, A. Feldstein, and J. Bustoz. Relative error in floating-point
multiplication. Computing, 35: 127–139, 1985.
[473] James H. Goodnight. A tutorial on the SWEEP operator. Amer. Statist., 33
(3):149-158, 1979.
[474] N. I. M. Gould. On growth in Gaussian elimination with complete pivoting.
SIAM J. Matrix Anal. Appl., 12(2):354-361, 1991.
[475] W. Govaerts and J. D. Pryce. Block elimination with one iterative refinement
solves bordered linear systems accurately. BIT, 30:490-507, 1990.
[476] W. B. Gragg and G. W. Stewart. A stable variant of the secant method for
solving nonlinear equations. SIAM J. Numer. Anal., 13:889-903, 1976.
[477] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathe-
matics: A Foundation for Computer Science. Addison-Wesley, Reading, MA,
USA, 1989. xiii+625 pp. ISBN 0-201-14236-8.
[478] Andrew Granville. Zaphod Beeblebrox’s brain and the fifty-ninth row of
Pascal’s triangle. Amer. Math. Monthly, 99:318-331, 1992.
B IBLIOGRAPHY 625

[479] A. Greenbaum. Behavior of slightly perturbed Lanczos and conjugate-

gradient recurrences. Linear Algebra Appl., 113:7-63, 1989.
[480] A. Greenbaum and Z. Strakos. Predicting the behavior of finite precision
Lanczos andconjugate gradient computations. SIAM J. Matrix Anal. Appl.,
13(1):121-137, 1992.
[481] Anne Greenbaum. The Lanczos andconjugate gradient algorithms infinite
precision arithmetic. In Proceedings of the Cornelius Lanczos International
Centenary Conference, J. David Brown, Moody T. Chu, Donald C. Ellison,
and Robert J. Plemmons, editors, Society for Industrial and Applied Math-
ematics, Philadelphia, PA, USA, 1994, pages 49–60.
[482] Robert T. Gregory and David L. Karney. A Collection of Matrices for Testing
Computational Algorithms. Wiley, New York, 1969. ix+154 pp. Reprinted
with corrections by Robert E. Krieger, Huntington, New York, 1978. ISBN
0-88275-649-4.
[483] Roger G. Grimes and John G. Lewis. Condition number estimation for sparse
matrices. SIAM J. Sci. Statist. Comput., 2(4):384–388, 1981.
[484] Ming Gu, James W. Demmel, and Inderjit Dhillon. Efficient computation of
the singular value decomposition with applications to least squares problems.
Technical Report CS-94-257, Department of Computer Science, University of
Tennessee, Knoxville, TN, USA, October 1994. 19 pp. LAPACK Working
Note 88.
[485] Thorkell Gudmundsson, Charles Kenney, and Alan J. Laub. Small-sample
statistical estimates for matrix norms. SIAM J. Matrix Anal. Appl., 16(3):
776–792, 1995.
[486] Heinrich W. Guggenheimer, Alan S. Edelman, and Charles R. Johnson. A
simple estimate of the condition number of a linear system. College Mathe-
matics Journal, 26(1):2–5, 1995.
[487] Mårten Gulliksson. Iterative refinement for constrained and weighted linear
least squares. BIT, 34:239-253, 1994.
[488] Mårten Gulliksson. Backward error analysis for the constrained and weighted
linear least squares problem when using the weighted QR factorization. SIAM
J. Matrix Anal. Appl., 16(2):675-687, 1995.
[489] Mårten Gulliksson and Per-Åke Wedin. Modifying the QR-decomposition to
constrained and weighted linear least squares. SIAM J. Matrix Anal. Appl.,
13(4):1298-1313, 1992.
[490] Chaya Gurwitz. A test for cancellation errors in quasi-Newton methods.
SIAM J. Sci. Statist. Comput., 18(2):134-140, 1992.
[491] John L. Gustafson and Srinivas Aluru. Massively parallel searching for better
algorithms or, how to do a cross product with five multiplications. Scientific
Programming, 1996. To appear.
[492] William W. Hager. Condition estimates. SIAM J. Sci. Statist. Comput., 5
(2):311-316, 1984.
626 B IBLIOGRAPHY

[493] E. Hairer and G. Wanner. Solving Ordinary Differential Equations II. Spring-
er-Verlag, Berlin, 1991. xv+601 pp. ISBN 3-540-537759.
[494] Marshall Hall, Jr. Combinatorial Theory. Blaisdell, Waltham, MA, USA,
1967. x+310 pp.
[495] Hozumi Hamada. A new real number representation and its operation. In
Proceedings of the Eighth Symposium on Computer Arithmetic, Como, Italy,
Mary Jane Irwin and Renato Stefanelli, editors, IEEE Computer Society,
Washington, DC, 1987, pages 153–157.
[496] S. J. Hammarling. Numerical solution of the stable, non-negative definite
Lyapunov equation. IMA J. Numer. Anal., 2:303-323, 1982.
[497] S. J. Hammarling and J. H. Wilkinson. The practical behaviour of linear iter-
ative methods with particular reference to S.O.R. Report NAC 69, National
Physical Laboratory, Teddington, UK, September 1976. 19 pp.
[498] Sven Hammarling. A note on modifications to the Givens plane rotation. J.
Inst. Maths Applies, 13:215-218, 1974.
[499] Stephen M. Hammel, James A. Yorke, and Celso Grebogi. Numerical orbits
of chaotic processes represent true orbits. Bull. Amer. Math. Sot., 19(2):
465469, 1988.
[500] Rolf Hammer, Matthias Hocks, Ulrich Kulisch, and Dietmar Ratz. Numerical
Toolbox for Verified Computing I. Basic Numerical Problems: Theory, Algo-
rithms, and Pascal-XSC Programs. Springer-Verlag, Berlin, 1993. xiii+337
pp. ISBN 3-540-57118-3.
[501] R. W. Hamming. Numerical Methods for Scientists and Engineers. Second
edition, McGraw-Hill, New York, 1973. ix+721 pp. ISBN 0-07-025887-2.
[502] G. H. Hardy. A Course of Pure Mathematics. Tenth edition, Cambridge
University Press, Cambridge, UK, 1967. xii+509 pp. ISBN 0-521-09227-2.
[503] G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Second edition,
Cambridge University Press, Cambridge, UK, 1952. xii+324 pp.
[504] Richard Harter. The optimality of Winograd’s formula. Comm. ACM, 15
(5):352, 1972.
[505] D. J. Hartfiel. Concerning the solution set of Ax = b where P < A < Q and
p < b < q. Numer. Math., 35:355-359, 1980.
[506] A Manual of Operation for the Automatic Sequence Controlled Calculator.
Harvard University Press, Cambridge, MA, USA, 1946. Reprinted, with
new foreword and introduction, Volume 8 in the Charles Babbage Institute
Reprint Series for the History of Computingj MIT Press, Cambridge, MA,
USA, 1985. xxxii+561 pp. ISBN 0-262-010844.
[507] Proceedings of a Symposium on Large-Scale Digital Calculating Machinery,
volume 16 of The Annals of the Computation Laboratory of Harvard Univer-
sity. Harvard University Press, Cambridge, MA, USA, 1948. Reprinted, with
a new introduction by William Aspray, Volume 7 in the Charles Babbage In-
stitute Reprint Series for the History of Computing, MIT Press, Cambridge,
MA, USA, 1985. xxix+302 pp. ISBN 0-262-08152-0.
B IBLIOGRAPHY 627

[508] John Z. Hearon. Nonsingular solutions of TA – BT = C. Linear Algebra

Appl., 16:57-63, 1977.
[509] M. T. Heath, G. A. Geist, and J. B. Drake. Early experience with the Intel
iPSC/860 at Oak Ridge National Laboratory. Report ORNL/TM-11655,
Oak Ridge National Laboratory, Oak Ridge, TN, USA, September 1990. 26
pp.
[510] Michael T. Heath. Numerical methods for large sparse linear least squares
problems. SIAM J. Sci. Statist. Comput., 5(3):497–513, 1984.
[511] Piet Hein. Grooks. Number 85 in Borgen’s Pocketbooks. Second edition,
Narayana Press, Gylling, 1992. 53 pp. ISBN 87-418-1079-1.
[512] H. V. Henderson and S. R. Searle. On deriving the inverse of a sum. of
matrices. SIAM Rev., 23(1):53-60, 1981.
[513] Harold V. Henderson, Friedrich Pukelsheim, and Shayle R. Searle. On the
history of the Kronecker product. Linear and Multilineal Algebra, 14:113-
120, 1983.
[514] Harold V. Henderson and S. R. Searle. The vet-permutation matrix, the vec
operator and Kronecker products: A review. Linear and Multilineal Algebra,
9:271-288, 1981.
[515] John L. Hennessy and David A. Patterson. Computer Architecture: A
Quantitative Approach. Morgan Kaufmann, San Mateo, CA, USA, 1990.
xxviii+594+appendices pp. ISBN 1-55860-188-0.
[516] Peter Henrici. Bounds for iterates, inverses, spectral variation and fields of
values of non-normal matrices. Numer. Math., 4:24-40, 1962.
[517] Peter Henrici. Discrete Variable Methods in Ordinary Differential Equations.
John Wiley, New York, 1962. xi+407 pp.
[518] Peter Henrici. Error Propagation for Difference Methods. John Wiley, New
York, 1963. vi+73 pp.
[519] Peter Henrici. Elements of Numerical Analysis. Wiley, New York, 1964.
xv+328 pp.
[520] Peter Henrici. Test of probabilistic models for the propagation of roundoff
errors. Comm. ACM, 9(6):409-410, 1966.
[521] Peter Henrici. A model for the propagation of rounding error in floating
arithmetic. In Interval Mathematics 1980, Karl L. E. Nickel, editor, Academic
Press, New York, 1980, pages 49-73.
[522] Gary Hewer and Charles Kenney. The sensitivity of the stable Lyapunov
equation. SIAM J. Control Optim., 26(2):321–344, 1988.
[523] HP-15C Advanced Functions Handbook. Hewlett-Packard, Portable Com-
puter Division, Corvallis, OR, USA, 1982.221 pp. Part number 00015-90011
Rev. C.
[524] HP 48G Series User’s Guide. Hewlett-Packard, Corvallis Division, Corvallis,
OR, USA, 1993. Part number 00048-90126, Edition 3.
628 B IBLIOGRAPHY

[525] Desmond J. Higham. Remark on Algorithm 669. ACM Trans. Math. Soft-
ware, 17(3):424-426, 1991.
[526] Desmond J. Higham. Condition numbers and their condition numbers. Linear
Algebra Appl., 214:193-213, 1995.
[527] Desmond J. Higham and Nicholas J. Higham. Backward error and condition
of structured linear systems. SIAM J. Matrix Anal. Appl., 13(1):162-175,
January 1992.
[528] Desmond J. Higham and Nicholas J. Higham. Componentwise perturbation
theory for linear systems with multiple right-hand sides. Linear Algebra
Appl., 174:111-129, 1992.
[529] Desmond J. Higham and Lloyd N. Trefethen. Stiffness of ODEs. BIT, 33:
285-303, 1993.
[530] Nicholas J. Higharn. Computing the polar decomposition—with applications.
SIAM J. Sci. Statist. Conmput., 7(4):1160-1174, 1986.
[531] Nicholas J. Higham. Efficient algorithms for computing the condition number
of a tridiagonal matrix. SIAM J. Sci. Statist. Comput., 7(1):150-165, 1986.
[532] Nicholas J. Higham. Computing real square roots of a real matrix. Linear
Algebra Appl., 88/89:405-430, 1987.
[533] Nicholas J. Higham. Error analysis of the Björck-Pereyra algorithms for
solving Vandermonde systems. Numer. Math., 50(5):613-632, 1987.
[534] Nicholas J. Higham. A survey of condition number estimation for triangular
matrices. SIAM Rev., 29(4):575–596, 1987.
[535] Nicholas J. Higham. Computing a nearest symmetric positive semidefinite
matrix. Linear Algebra Appl., 103:103–118, 1988.
[536] Nicholas J. Higham. Fast solution of Vandermonde-like systems involving
orthogonal polynomials. IMA J. Numer. Anal., 8:473-486, 1988.
[537] Nicholas J. Higham. FORTRAN codes for estimating the one-norm of a real
or complex matrix, with applications to condition estimation (Algorithm
674). ACM Trans. Math. Software, 14(4):381-396, 1988.
[538] Nicholas J. Higham. The accuracy of solutions to triangular systems. SIAM
J. Numer. Anal., 26(5):1252-1265, 1989.
[539] Nicholas J. Higham. Matrix nearness problems and applications. In Appli-
cations of Matrix Theory, M. J. C. Gover and S. Barnett, editors, Oxford
University Press, Oxford, UK, 1989, pages 1–27.
[540] Nicholss J. Higham. Analysis of the Cholesky decomposition of a semi-
definite matrix. In Reliable Numerical Computation, M. G. Cox and S. J,
Hammarling, editors, Oxford University Press, Oxford, UK, 1990, pages 161–
185.
[541] Nicholas J. Higham. Bounding the error in Gaussian elimination for tridiag-
onal systems. SIAM J. Matrix Anal. Appl., 11(4):521–530, 1990.
B IBLIOGRAPHY 629

[542] Nicholas J. Higham. Computing error bounds for regression problems. In

Statistical Analysis of Measurement Error Models and Applications, Con-
temporary Mathematics 112, Philip J. Brown and Wayne A. Fuller, editors,
American Mathematical Society, Providence, RI, USA, 1990, pages 195-208.
[543] Nicholas J. Higham. Experience with a matrix norm estimator. SIAM J.
Sci. Statist. Comput., 11(4):804-809, 1990.
[544] Nicholas J. Higham. Exploiting fast matrix multiplication within the level 3
BLAS. ACM Trans. Math. Software, 16(4):352-368, 1990.
[545] Nicholas J. Higham. How accurate is Gaussian elimination? In Numerical
Analysis 1989, Proceedings of the 13th Dundee Conference, D. F. Griffiths and
G. A. Watson, editors, volume 228 of Pitman Research Notes in Mathematics,
Longman Scientific and Technical, Essex, UK, 1990, pages 137-154.
[546] Nicholas J. Higham. Iterative refinement enhances the stability of QR fac-
torization methods for solving linear equations. Numerical Analysis Report
No. 182, University of Manchester, Manchester, England, April 1990.
[547] Nicholas J. Higham. Stability analysis of algorithms for solving confluent
Vandermonde-like systems. SIAM J. Matrix Anal. Appl., 11(1):23-41, 1990.
[548] Nicholas J. Higham. Algorithm 694: A collection of test matrices in MAT-
LAB. ACM Trans. Math. Software, 17(3):289-305, 1991.
[549] Nicholas J. Higham. Iterative refinement enhances the stability of QR fac-
torization methods for solving linear equations. BIT, 31:447–468, 1991.
[550] Nicholas J. Higham. Three measures of precision in floating point arithmetic.
NA Digest, Volume 91, Issue 16, 1991. Electronic mail magazine: na.help@
na-net.ornl.gov.
[551] Nicholas J. Higharn. Estimating the matrix p-norm. Numer. Math., 62:
539-555, 1992.
[552] Nicholas J. Higham. Stability of a method for multiplying complex matrices
with three real matrix multiplications. SIAM J. Matrix Anal. Appl., 13(3):
681–687, 1992.
[553] Nicholas J. Higham. The accuracy of floating point summation. SIAM J.
Sci. Comput., 14(4):783-799, 1993.
[554] Nicholas J. Higham. Handbook of Writing for the Mathematical Sciences.
Society for Industrial and Applied Mathematics, Philadelphia, PA, USA,
1993. xii+241 pp. ISBN 0-89871-314-5.
[555] Nicholas J. Higham. Optimization by direct search in matrix computations.
SIAM J. Matrix Anal. Appl., 14(2):317-333, 1993.
[556] Nicholas J. Higham. Perturbation theory and backward error for AX -XB =
C. BIT, 33:124-136, 1993.
[557] Nicholas J. Higham. The matrix sign decomposition and its relation to the
polar decomposition. Linear Algebra Appl., 212/213:3–20, 1994.
630 B IBLIOGRAPHY

[558] Nicholas J. Higham. A survey of componentwise perturbation theory in

numerical linear algebra. In Mathematics of Computation 1943–1993: A Half
Century of Computational Mathematics, Walter Gautschi, editor, volume 48
of Proceedings of Symposia in Applied Mathematics, American Mathematical
Society, Providence, RI, USA, 1994, pages 49-77.
[559] Nicholas J. Higham. Stability of the diagonal pivoting method with partial
pivoting. Numerical Analysis Report No. 265, University of Manchester,
Manchester, England, July 1995. 17 pp.
[560] Nicholas J. Higham. Stability of parallel triangular system solvers. SIAM J.
Sci. Comput., 16(2):400-413, 1995.
[561] Nicholas J. Higham. The Test Matrix Toolbox for MATLAB , version 3.0.
Numerical Analysis Report No. 276, University of Manchester, Manchester,
England, September 1995.
[562] Nicholas J. Higham and Desmond J. Higham. Large growth factors in Gaus-
sian elimination with pivoting. SIAM J. Matrix Anal. Appl., 10(2):155–164,
1989.
[563] Nicholas J. Higham and Philip A. Knight. Componentwise error analysis
for stationary iterative methods. In Linear Algebra, Markov Chains, and
Queueing Models, Carl D. Meyer and Robert J. Plemmons, editors, volume 48
of IMA Volumes in Mathematics and its Applications, Springer-Verlag, New
York, 1993, pages 29-46.
[564] Nicholas J. Higham and Philip A. Knight. Finite precision behavior of sta-
tionary iteration for solving singular systems. Linear Algebra Appl., 192:
165–186, 1993.
[565] Nicholas J. Higham and Philip A. Knight. Matrix powers in finite precision
arithmetic. SIAM J. Matrix Anal. Appl., 16(2):343–358, 1995.
[566] Nicholas J. Higham and Pythagoras Papadimitriou. A new parallel algorithm
for computing the singular value decomposition. In Proceedings of the Fifth
SIAM Conference on Applied Linear Algebra, John G. Lewis, editor, Society
for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1994, pages
80-84.
[567] Nicholas J. Higham and Pythagoras Papadimitriou. A parallel algorithm
for computing the polar decomposition. Parallel Comput., 20(8):1161-1173,
1994.
[568] Nicholas J. Higham and Alex Pothen. Stability of the partitioned inverse
method for parallel solution of sparse triangular systems. SIAM J. Sci. Com-
put., 15(1):139-148, 1994.
[569] Nicholas J. Higham and G. W. Stewart. Numerical linear algebra in statistical
computing. In The State of the Art in Numerical Analysis, A. Iserles and
M. J. D. Powell, editors, Oxford University Press, Oxford, UK, 1987, pages
41-57.
[570] David Hilbert. Ein Beitrag zur Theorie des Legendre’schen Polynoms. Acts
Math., 18:155–159, 1894.
B IBLIOGRAPHY 631

[571] F. B. Hildebrand. Introduction to Numerical Analysis. Second edition, Mc-

Graw-Hill, New York, 1974. xiii+669 pp. Reprinted by Dover, New York,
1987. ISBN 0-486-65363-3.
[572] Marlis Hochbruck and Gerhard Starke. Preconditioned Krylov subspace
methods for Lyapunov matrix equations. SIAM J. Matrix Anal. Appl., 16
(1):156-171, 1995.
[573] R. W. Hockney and C. R. Jesshope. Parallel Computers 2: Architecture,
Programming and Algorithms. Adam Hilger, Bristol, 1988. xv+625 pp. ISBN
0-85274812-4.
[574] A. Scottedward Hodel. Recent applications of the Lyapunov equation in
control theory. In Iterative Methods in Linear Algebra, R. Beauwens and
P. de Green, editors, Elsevier (North-Holland), Amsterdam, The Nether-
lands, 1992, pages 217-227.
[575] Andrew Hodges. Alan Timing: The Enigma. Burnett Books, London, 1983.
1992 edition with preface, Vintage, London. xix+586 pp. ISBN 0-09-911641-
3.
[576] Christoph M. Hoffmann. The problems of accuracy and robustness in geo-
metric computation. Computer, March:31–41, 1989.
[577] W. Hoffmann. Solving linear systems on a vector computer. J. Comput.
Appl. Math., 18:353-367, 1987.
[578] W. Hoffmann. Iterative algorithms for Gram-Schmidt orthogonalization.
Computing, 41:335-348, 1989.
[579] R. C. Holt and J. R. Cordy. The Turing programming language. Comm.
ACM, 31(12):1410-1423, 1988.
[580] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge Uni-
versity Press, Cambridge, UK, 1985. xiii+561 pp. ISBN 0-521-30586-1.
[581] Roger A. Horn and Charles R. Johnson. Topics in Matrix Analysis. Cam-
bridge University Press, Cambridge, UK, 1991. viii+607 pp. ISBN 0-521-
30587-X.
[582] Jim Horning. Note on program reliability. ACM SIGSOFT Softwam Engi-
neering Notes, 4(4):6, 1979. Cited in [1021].
[583] Harold Hotelling. Some new methods in matrix calculation. Ann. Math.
Statist., 14(1):1-34, 1943.
[584] David Hough. Applications of the proposed IEEE 754 standard for floating-
point arithmetic. Computer, 14:70-74, 1981.
[585] David Hough. Random story. NA Digest, Volume 89, Issue 1, 1989. Electronic
mail magazine: [email protected].
[586] Alston S. Householder. Unitary triangularization of a nonsymmetric matrix.
J. Assoc. Comput. Mach., 5:339-342, 1958.
[587] Alston S. Householder. The Theory of Matrices in Numerical Analysis. Blais-
dell, New York, 1964. xi+257 pp. Reprinted by Dover, New York, 1975. ISBN
0-486-61781-5.
632 B IBLIOGRAPHY

[588] D. Y. Hu and L. Reichel. Krylov subspace methods for the Sylvester equation.
Linear Algebra Appl., 172:283-313, 1992.
[589] T. E. Hull. Correctness of numerical software. In Perfomance Evaluation of
Numerical Software, Lloyd D. Fosdick, editor, North-Holland, Amsterdam,
The Netherlands, 1979, pages 3–15.
[590] T. E. Hull. Precision control, exception handling and a choice of numerical
algorithms. In Numerical Analysis Proceedings, Dundee 1981, G. A. Watson,
editor, volume 912 of Lecture Notes in Mathematics, Springer-Verlag, Berlin,
1982, pages 169-178.
[591] T. E. Hull, A. Abraham, M. S. Cohen, A. F. X. Curley, C. B. Hall, D. A.
Penny, and J. T. M. Sawchuk. Numerical TURING. ACM SIGNUM Newslet-
ter, 20(3):26–34, 1985.
[592] T. E. Hull, Thomas F. Fairgrieve, and Ping Tak Peter Tang. Implementing
complex elementary functions using except ion handling. ACM Trans. Math.
Software; 20(2):215-244, 1994.
[593] T. E. Hull and J. R. Swenson. Tests of probabilistic models for propagation
of roundoff errors. Comm. ACM, 9(2):108–113, 1966.
[594] M. A. Hyman. Eigenvalues and eigenvectors of general matrices. Presented
at the 12th National Meeting of the Association for Computing Machinery,
Houston, Texas, 1957. Cited in [1088].
[595] IBM. Engineering and Scientific Subroutine Library, Guide and Reference,
Release 3. Fourth Edition (Program Number 5668-863), 1988.
[596] AIX Version 3.2 for RISC System\6000: Optimization and Tuning Guide for
Fortran, C, and C++. IBM, December 1993. viii+305 pp. Publication No.
SC09-1705-00.
[597] IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Standard
754-1985. Institute of Electrical and Electronics Engineers, New York, 1985.
Reprinted in SIGPLAN Notices, 22(2):9-25, 1987.
[598] A Radix-Independent Standard for Floating-Point Arithmetic, IEEE Stan-
dard 854-1987. IEEE Computer Society, New York, 1987.
[599] IEEE Computer Society Microprocessor Standards Committee, Floating-
Point Working Group. A proposed standard for binary floating-point arith-
metic, Draft 8.0 of IEEE Task P754 (with introductory comments by David
Stevenson). Computer, 14:51-62, 1981.
[600] Yasuhiko Ikebe. On inverses of Hessenberg matrices. Linear Algebra Appl.,
24:93-97, 1979.
[601] ILAS Education Committee. Report on graduate linear algebra courses.
Manuscript from the International Linear Algebra Society, November 1993.
URL = http://gauss.technion.ac.il/iic/GRAD-ED.SYLLABI. 4 pp.
[602] Cray Research Inc. UNICOS Math and Scientific Library Reference Manual.
Number SR-2081, Version 5.0, Eagan, MN, USA, 1989.
B IBLIOGRAPHY 633

[603] D. C. Ince, editor. Collected Works of A.M. Turing: Mechanical Intelligence.

North-Holland, Amsterdam, The Netherlands, 1992. xix+227 pp. ISBNO-
44488058-5.
[604] F. Incertis. A faster method of computing matrx Pythagorean sums. IEEE
Trans. Automat. Control, AC-30(3):273-275, 1985.
[605] Ilse C. F. Ipsen and Carl D. Meyer. Uniform stability of Markov chains.
SIAM J. Matrix Anal. Appl., 15(4):1061-1074, 1994.
[606] Masao Iri. History of automatic differentiation and rounding error estimation.
In Automatic Differentiation of Algorithms: Theory, Implementation, and
Application, Andreas Griewank and George F. Corliss, editors, Society for
Industrial and Applied Mathematics, Philadelphia, PA, USA, 1991, pages
3-16.
[607] Eugene Isaacson and Herbert Bishop Keller. Analysis of Numerical Methods.
Wiley, New York, 1966. xv+541 pp. Reprinted by Dover, New York, 1994.
ISBN 0-486-68029-0.
[608] William Jalby and Bernard Philippe. Stability analysis and improvement of
the block Gram–Schmidt algorithm. SIAM J. Sci. Statist. Comput., 12(5):
1058–1073, 1991.
[609] M. Jankowski, A. Smoktunowicz, and H. A note on floating-
point summation of very many terms. J. Inform. Process. Cybernet., 19(9):
435-440, 1983.
[610] M. Jankowski and H. Iterative refinement implies numerical
stability. BIT, 17:303–311, 1977.
[611] M. Jankowski and H. The accurate solution of certain con-
tinuous problems using only single precision arithmetic. BIT, 25:635–651,
1985.
[612] Paul Jansen and Peter Weidner. High-accuracy arithmetic software—some
tests of the ACRITH problem-solving routines. ACM Trans. Math. Software,
12(1):62–70, 1986.
[613] A. Jennings. Bounds for the singular values of a matrix. IMA J. Numer.
Anal., 2:459-474, 1982.
[614] L. S. Jennings and M. R. Osborne. A direct error analysis for least squares.
Numer. Math., 22:325-332, 1974.
[615] Mark T. Jones and Merrell L. Patrick. Bunch-Kaufman factorization for real
symmetric indefinite banded matrices. SIAM J. Matrix Anal. Appl., 14(2):
553–559, 1993.
[616] Mark T. Jones and Merrell L. Patrick. Factoring symmetric indefinite matri-
ces on high-performance architectures. SIAM J. Matrix Anal. Appl., 15(1):
273–283, 1994.
[617] William B. Jones and W. J. Thron. Numerical stability in evaluating con-
tinued fractions. Math. Comp., 28(127):795-810, 1974.
634 B IBLIOGRAPHY

[618] T. L. Jordan. Experiments on error growth associated with some linear least-
squares procedures. Math. Comp., 22:579–588, 1968.
[619] George Gheverghese Joseph. The Crest of the Peacock: Non-European Roots
of Mathematics. Penguin, New York, 1991. xv+371 pp. ISBN 0-14012529-9.
[620] Bo Kågström. A perturbation analysis of the generalized Sylvester equation
(AR – LB, DR – LE) = (C, F). SIAM J. Matrix Anal. Appl., 15(4):1045-
1060, 1994.
[621] Bo Kågström and Peter Poromaa. Distributed and shared memory block
algorithms for the triangular Sylvester equation with sep-1 estimators. SIAM
J. Matrix Anal. Appl., 13(1):90-101, 1992.
[622] Bo Kågström and Peter Poromaa. LAPACK-style algorithms and software
for solving the generalized Sylvester equation and estimating the separation
between regular matrix pairs. Report UMINF 93.23, Institute of Information
Processing, University of Umeå Umeå, Sweden, December 1993. 35 pp.
LAPACK Working Note 75.
[623] Bo Kågström and Peter Poromaa. Computing eigenspaces with specified
eigenvalues of a regular matrix pair (A, B) and condition estimation: Theory,
algorithms and software. Report UMINF 94.04, Institute of Information
Processing, University of Umeå, Umeå, Sweden, September 1994. 65 pp.
LAPACK Working Note 87.
[624] Bo Kågström and Lars Westin. Generalized Schur methods with condition
estimators for solving the generalized Sylvester equation. IEEE Trans. Au-
tomat. Control AC-34(7):745-751, 1989.
[625] W. Kahan Further remarks on reducing truncation errors. Comm. ACM, 8
(1):40, 1965.
[626] W. Kahan. Numerical linear algebra. Canad. Math. Bull., 9:757-801, 1966.
[627] W. Kahan. A survey of error analysis. In Proc. IFIP Congress, Ljubijana,
Information Processing 71, North-Holland, Amsterdam, The Netherlands,
1972, pages 1214-1239.
[628] W. Kahan. Implementation of algorithms (lecture notes by W. S. Haugeland
and D. Hough). Technical Report 20, Department of Computer Science,
University of California, Berkeley, CA, USA, 1973.
[629] W. Kahan. Interval arithmetic options in the proposed IEEE floating point
arithmetic standard. In Interval Mathematics 1980, Karl L. E. Nickel, editor,
Academic Press, New York, 1980, pages 99-128.
[630] W. Kahan. Why do we need a floating-point arithmetic standard? Technical
report, University of California, Berkeley, CA, USA, February 1981, 41 pp.
[631] W. Kahan. To solve a real cubic equation. Technical Report PAM-352,
Center for Pure and Applied Mathematics, University of California, Berkeley,
CA, USA, November 1986. 20 pp.
[632] W. Kahan. Branch cuts for complex elementary functions or much ado about
nothing’s sign bit. In The State of the Art in Numerical Analysis, A. Iserles
B IBLIOGRAPHY 635

and M. J. D. Powell, editors, Oxford University Press, Oxford, UK, 1987,

pages 165–21 1.
[633] W. Kahan. Doubled-precision IEEE standard 754 floating-point arithmetic.
Manuscript, February 1987.
[634] W. Kahan. How Cray’s arithmetic hurts scientific computation (and what
might be done about it). Manuscript prepared for the Cray User Group
meeting in Toronto, June 1990. 42 pp.
[635] W. Kahan. Analysis and refutation of the LCAS. ACM SIGNUM Newsletter,
26(3):2–15, 1991.
[636] W. Kahan. Computer benchmarks versus accuracy. Draft manuscript, June
1994.
[637] W. Kahan and I. Farkas.. Algorithm 168: Newton interpolation with back-
ward divided differences. Comm. ACM, 6(4):165, 1963.
[638] W. Kahan and I. Farkas. Algorithm 169: Newton interpolation with forward
divided differences. Comm. ACM, 6(4):165, 1963.
[639] W. Kahan and E. LeBlanc. Anomalies in the IBM ACRITH package. In
Proceedings of the 7th Symposium on Computer Arithmetic, Kai Hwang, ed-
itor, IEEE Computer Society Press, Silver Spring, MD, USA, 1985, pages
322-331.
[640] W. Kahan and J. Palmer. On a proposed floating-point standard. ACM
SIGNUM Newsletter, 14:13-21, October 1979.
[641] David K. Kahaner, Cleve B. Moler, and Stephen G. Nash. Numerical Methods
and Software. Prentice-Hall, Englewood Cliffs, NJ, USA, 1989. xii+495 pp.
ISBN 0-13-627258-4.
[642] Ilkka Karasalo. A criterion for truncation of the QR-decomposition algorithm
for the singular linear least squares problem. BIT, 14:156-166, 1974.
[643] A. Karatsuba and Yu. Ofman. Multiplication of multidigit numbers on au-
tomata. Soviet Phys. Dokl., 7(7):595–596, January 1963.
[644] Samuel Karlin. Total Positivity, volume 1. Stanford University Press, Stan-
ford, CA, USA, 1968.
[645] Richard Karpinski. Paranoia: A floating-point benchmark. BYTE, 10(2):
223-235, 1985.
[646] Tosio Kate. Perturbation Theory for Linear Operators. Second edition,
Springer-Verlag, Berlin, 1976. xxi+619 pp. ISBN 3-540-97588-5.
[647] Linda Kaufman. Matrix methods for queuing problems. SIAM J. SCi. Statist.
Comput., 4(3):525-552, 1983.
[648] Herbert B. Keller. On the solution of singular and semidefinite linear systems
by iteration. SIAM J. Numer. Anal., 2(2):281-290, 1965.
[649] William J. Kennedy, Jr. and James E. Gentle. Statistical Computing. Marcel
Dekker, New York, 1980. xi+591 pp. ISBN 0-8247-6898-1.
636 B IBLIOGRAPHY

[650] Charles Kenney and Gary Hewer. The sensitivity of the algebraic and differ-
ential Riccati equations. SIAM J. Control Optim., 28(1):50-69, 1990.
[651] Charles Kenney and Alan J. Laub. Controllability and stability radii for
companion form systems. Math. Control Signals Systems, 1:239-256, 1988.
[652] Charles S. Kenney and Alan J. Laub. Small-sample statistical condition
estimates for general matrix functions. SIAM J. Sci. Comput., 15(1):36-61,
1994.
[653] Charles S. Kenney, Alan J. Laub, and Philip M. Papadopoulos. Matrix sign
algorithms for Riccati equations. IMA J. of Math. Control Inform., 9:331-
344, 1992.
[654] Thomas H. Kerr. Fallacies in computational testing of matrix positive defi-
niteness/semidefiniteness. IBEE Trans. Aerospace Electron. Systems, 26(2):
415–421, 1990.
[655] Summation algorithm with corrections and some of its
applications. Math. Stos., 1:22–41, 1973. (In Polish, cited in [609] and [611]).
[656] Iterative refinement for linear systems in variable
precision arithmetic. BIT, 21:97-103, 1981.
[657] A note on rounding-error analysis of Cholesky factor-
ization. Linear Algebra Appl., 88/89:487–494, 1987.
[658] and Hubert Schwetlick. Numerische Lineare Algebra:
Eine Computerorientierte Einführung. VEB Deutscher, Berlin, 1988.472 pp.
ISBN 3-87144999-7.
[659] and Hubert Schwetlick. Numeryczna Algebra Liniowa:
Wprowadzenie do Zautomatyzowanych. Wydawnictwa Naukowo-
Techniczne, Warszawa, 1992. 502 pp. ISBN 83-2041260-9.
[660] Fuad Kittaneh. Singular values of companion matrices and bounds on zeros
of polynomials. SIAM J. Matrix Anal. Appl., 16(1):333–340, 1995.
[661] R. Klatte, U. W. Kulisch, C. Lawo, M. Rauch, and A. Wiethoff. C-XSC:
A C++ Class Library for Extended Scientific Computing. Springer-Verlag,
Berlin, 1993. ISBN 0-387-56328-8.
[662] R. Klatte, U. W. Kulisch, M. Neaga, D. Ratz, and Ch. Ullrich. PASCAL-
XSC—Language Reference With Examples. Springer-Verlag, Berlin, 1992.
[663] Philip A. Knight. Error Analysis of Stationary Iteration and Associated
Problems, Ph.D. thesis, University of Manchester, Manchester, England,
September 1993. 135 pp.
[664] Philip A. Knight. Fast rectangular matrix multiplication and QR decompo-
sition. Linear Algebra Appl., 221:69–81, 1995.
[665] Donald E. Knuth. Evaluation of polynomials by computer. Comm. ACM, 5
(12):595-599, 1962.
[666] Donald E. Knuth. The Art of Computer Programming. Addison-Wesley,
Reading, MA, USA, 1973-1981. Three volumes.
B IBLIOGRAPHY 637

[667] Donald E. Knuth. The Art of Computer Programming, Volume 1, Funda-

mental Algorithms. Second edition, Addison-Wesley, Reading, MA, USA,
1973. xxi+634 pp. ISBN 0-201-03821-8.
[668] Donald E. Knuth. The Art of Computer Programming, Volume 2, Seminu-
merical Algorithms. Second edition, Addison-Wesley, Reading, MA, USA,
1981. xiii+688 pp. ISBN 0-201-03822-6.
[669] Donald E. Knuth. Two notes on notation. Amer. Math. Monthly, 99(5):
403-422, 1992.
[670] T. W. Körner. Fourier Analysis. Cambridge University Press, Cambridge,
UK, 1988. xii+591 pp. ISBN 0521389917.
[671] Eric Kostlan. On the spectra of Gaussian matrices. Linear Algebra Appl.,
162-164:385–388, 1992.
[672] Z. V. Kovarik. Compatibility of approximate solutions of inaccurate linear
equations. Linear Algebra Appl., 15:217–225, 1976.
[673] Antoni Kreczmar. On memory requirements of Strassen’s algorithms. vol-
ume 45 of Lecture Notes in Computer Science, Springer-Verlag, Berlin, 1976,
pages 404–407.
[674] Ed Krol. The Whole Internet User’s Guide & Catalog. Second edition,
O’Reilly & Associates, Sebsstopol, CA, USA, 1994. xxv+543 pp. ISBN
1-56592-063-5.
[675] Koichi Kubota. PADRE2, a Fortran precompiled yielding error estimates
and second derivatives. In Automatic Differentiation of Algorithms: Theory,
Implementation, and Application, Andreas Griewank and George F. Corliss,
editors, Society for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 1991, pages 251-262.
[676] J. Estimating the largest eigenvalue by the
power and Lanczos algorithms with a random start. SIAM J. Matrix Anal.
Appl., 13(4):1094-1122, 1992.
[677] H. Kuki and W. J. Cody. A statistical study of the accuracy of floating point
number systems. Comm. ACM, 16(4):223–230, 1973.
[678] Ulrich W. Kulisch and Willard L. Miranker. Computer Arithmetic in Theory
and in Practice. Academic Press, New York, 1981. xiii+249 pp. ISBN
0-12-428650-X.
[679] Ulrich W. Kulisch and Willard L. Miranker, editors. A New Approach to
Scientific Computation. Academic Press, New York, 1983. xv+384 pp. ISBN
0-12-428660-7.
[680] Ulrich W. Kulisch and Willard L. Miranker. The arithmetic of the digital
computer: A new approach. SIAM Rev., 28(1):1–40, 1986.
[681] I. B. Kuperman. Approximate Linear Algebraic Equations. Van Nostrand
Reinhold, London, 1971. xi+225 pp. ISBN 0-442-04546-8.
[682] M. La Porte and J. Vignes. Etude statistique des erreurs clans l’arithmétique
des ordinateurs; application au contrôe des résultats d‘algorithmes
numériques. Numer, Math., 23:63–72, 1974.
638 B IBLIOGRAPHY

[683] J. D. Laderman. A noncommutative algorithm for multiplying 3 × 3 matrices

using 23 multiplications. Bull. Amer. Math. Soc., 82(1):126–128, 1976.
[684] Julian Laderman, Victor Pan, and Xuan-He Sha. On practical algorithms
for accelerated matrix multiplication. Linear Algebra Appl., 162-164:557–588,
1992.
[685] Peter Lancaster. Explicit solutions of linear matrix equations. SIAM Rev.,
12(4):544–566, 1970.
[686] Cornelius Lanczos. Applied Analysis. Prentice Hall, Englewood Cliffs, NJ,
USA, 1956. xx+539 pp. Reprinted by Dover, New York, 1988. ISBN 0-486-
65656-X.
[687] John L. Larson, Mary E. Pasternak, and John A. Wisniewski. Algorithm
594: Software for relative error analysis. ACM Trans. Math. Software, 9(l):
125–130, 1983.
[688] John L. Larson and Ahmed H. Sameh. Efficient calculation of the effects of
roundoff errors. ACM Trans. Math. Software, 4(3):228–236, 1978. Errata
5(3):372, 1979.
[689] John L. Larson and Ahmed H. Sameh. Algorithms for roundoff error
analysis—A relative error approach. Computing, 24:275–297, 1980.
[690] Lajos László. An attainable lower bound for the best normal approximation.
SIAM J. Matrix Anal. Appl., 15(3):1035-1043, 1994.
[691] Alan J. Laub. A Schur method for solving algebraic Riccati equations. IEEE
Trans. Automat. Control, AC-24(6):913-921, 1979.
[692] Peter Läuchli. Jordan-Elimination und Ausgleichung nach kleinsten
Quadraten. Numer. Math., 3:226-240, 1961.
[693] Simon Lavington. Early British Computers: The Story of Vintage Computers
and the People Who Built Them. Manchester University Press, 1980. 139
pp. ISBN 0-7190-0803-4.
[694] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear
algebra subprograms for Fortran usage. ACM Trans. Math. Software, 5(3):
308-323, 1979.
[695] Charles L. Lawson and Richard J. Hanson. Solving Least Squares Problems.
Society for Industrial and Applied Mathematics, Philadelphia, PA, USA,
1995. xii+340 pp. Corrected republication of work first published in 1974 by
Prentice-Hall. ISBN 0-89871-356-0.
[696] Lam Lay-Yong and Shen Kangshen. Methods of solving linear equations in
traditional China. Historia Mathematical, 16(2):107–122, 1989.
[697] D. H. Lehmer. Tables to many places of decimals. Mathematical Tables and
Other Aids to Computation, 1(1):30-31, 1943.
[698] R. B. Lehoucq. The computation of elementary unitary matrices. Techni-
cal Report CS-94-233, Department of Computer Science, University of Ten-
nessee, Knoxville, TN, USA, May 1994. 9 pp. LAPACK Working Note 72.
B IBLIOGRAPHY 639

[699] Frans Lemeire. Bounds for condition numbers of triangular and trapezoid
matrices. BIT, 15:58–64, 1975.
[700] H. Leuprecht and W. Oberaigner. Parallel algorithms for the rounding exact
summation of floating point numbers. Computing, 28:89-104, 1982.
[701] T. Y. Li and Z. Zeng. Homotopy-determinant algorithm for solving nonsym-
metric eigenvalue problems. Math. Comp., 59(200):483–502, 1992.
[702] Information Technology-Language Independent Arithmetic—Part I: Integer
and Floating Point Arithmetic, Draft International Standard (Version 4.1),
ISO/IEC DIS 10967-1:1993. August 1993.
[703] Seppo Linnainmaa. Analysis of some known methods of improving the accu-
racy of floating-point sums. BIT, 14:167–202, 1974.
[704] Seppo Linnainmaa. Towards accurate statistical estimation of rounding er-
rors in floating-point computations. BIT, 15:165-173, 1975.
[705] Seppo Linnainmaa. Taylor expansion of the accumulated rounding error.
BIT, 16:146-160, 1976.
[706] Seppo Linnainmaa. Software for doubled-precision floating-point computa-
tions. ACM Trans. Math. Software, 7(3):272–283, 1981.
[707] Peter Linz. Accurate floating-point summation. Comm. ACM, 13(6):361–
362, 1970.
[708] Elliot Linzer. On the stability of transform-based circular deconvolution.
SIAM J. Numer. Anal., 29(5):1482-1492, 1992.
[709] Joseph W. H. Liu. A partial pivoting strategy for sparse symmetric matrix
decomposition. ACM Trans. Math. Software, 13(2):173-182, 1987.
[710] Georghios Loizou. Nonnormality and Jordan condition numbers of matrices.
J. Assoc. Comput. Mach., 16(4):580-584, 1969.
[711] J. W. Longley. An appraisal of least squares programs for the electronic
computer from the point of view of the user. J. Amer. Statist. Assoc., 62:
819-841, 1967.
[712] Per Lotstedt. Perturbation bounds for the linear least squares problem sub-
ject to linear inequality constraints. BIT, 23:500-519, 1983.
[713] Per Lotstedt. Solving the minimal least squares problem subject to bounds
on the variables. BIT, 24:206-224, 1984.
[714] Hao Lu. Fast solution of confluent Vandermonde linear systems. SIAM J.
Matrix Anal. Appl., 15(4):1277-1289, 1994.
[715] Hao Lu. Fast algorithms for confluent Vandermonde linear systems and
generalized Trummer’s problem. SIAM J. Matrix Anal. Appl., 16(2):655-
674, 1995.
[716] Hao Lu. Solution of Vandermond-like systems and confluent Vandermonde-
like systems. SIAM J. Matrix Anal. Appl., 17(1):127-138, 1996.
[717] J. N. Lyness. The effect of inadequate convergence criteria in automatic
routines. Computer Journal, 12:279-281, 1969. See also letter and response
in 13 (1970), p. 121.
640 B IBLIOGRAPHY

[718] J. N. Lyness and C. B. Moler. Van der Monde systems and numerical differ-
entiation. Numer. Math., 8:458–464, 1966.
[719] M. Stuart Lynn. On the round-off error in the method of successive over-
relaxation. Math. Comp., 18(85):36–49, 1964.
[720] Allan J. Macleod. Some statistics on Gaussian elimination with partial piv-
oting. ACM SIGNUM Newsletter, 24(2/3):10-14, 1989.
[721] J. H. Maindonald. Statistical Computation. Wiley, New York, 1984. xviii+370
pp. ISBN 0-471-86452-8.
[722] John Makhoul. Toeplitz determinants and positive semidefiniteness. IEEE
Trans. Signal Processing, 39(3):743-746, 1991.
[723] Michael A. Malcolm. On accurate floating-point summation. Comm. ACM,
14(11):731-736, 1971.
[724] Michael A. Malcolm. Algorithms to reveal properties of floating-point arith-
metic. Comm. ACM, 15(11):949–951, 1972.
[725] Michael A. Malcolm and John Palmer. A fast method for solving a class of
tridiagonal linear systems. Comm. ACM, 17(1):14–17, 1974.
[726] Thomas A. Manteuffel. An interval analysis approach to rank determination
in linear least squares problems. SIAM J. Sci. Statist. Comput., 2(3):335–348,
1981.
[727] John Markoff. Circuit flaw causes Pentium chip to miscalculate, Intel admits.
New York Times, 1994. 24 November.
[728] George Marsaglia and Ingram Olkin. Generating correlation matrices. SIAM
J. Sci. Statist. Comput., 5(2):470-475, 1984.
[729] R. S. Martin, G. Peters, and J. H. Wilkinson. Iterative refinement of the
solution of a positive definite system of equations. Numer. Math., 8:203–216,
1966. Also in [1102, pp. 31–44], Contribution 1/2.
[730] Gleanings far and near. Mathematical Gazette, 22(170):95, 1924.
[731] Roy Mathias. Matrices with positive definite Hermitiam part: Inequalities
and linear systems. SIAM J. Matrix Anal. Appl., 13(2):640-654, 1992.
[732] Roy Mathias. Accurate eigensystem computations by Jacobi methods. SIAM
J. Matrix Anal. Appl., 16(3):977-1003, 1995.
[733] Roy Mathias. Analysis of algorithms for orthogonalizing products of unitary
matrices. Numerical Linear Algebra with Applications, 1995. To appear.
[734] Roy Mathias. The instability of parallel prefix matrix multiplication. SIAM
J. Sci. Comput., 16(4):956-973, 1995.
[735] MATLAB User’s Guide. The MathWorks, Inc., Natick, MA, USA, 1992.
[736] Shouichi Matsui and Masao Iri. An overflow/underflow-free floating-point
representation of numbers. J. Inform. Process., 4(3):123–133, 1981.
[737] R. M. M. Mattheij. Stability of block LU-decompositions of matrices arising
from BVP. SIAM J. Alg. Discrete Methods, 5(3):314-331, 1984.
B IBLIOGRAPHY 641

[738] R. M. M. Mattheij. The stability of LU-decompositions of block tridiagonal

matrices. Bull. Austral. Math. Soc., 29:177–205, 1984.
[739] David W. Matula. In-and-out conversions. Comm. ACM, 11(1):47-50, 1968.
[740] David W. Matula. A formalization of floating-point numeric base conversion.
IEEE Trans. Comput., C-19(8):681-692, 1970.
[741] David W. Matula and Peter Kornerup. Finite precision rational arithmetic:
Slash number systems. IEEE Trans. Comput., C-34(1):3-18, 1985.
[742] Charles McCarthy and Gilbert Strang. Optimal conditioning of matrices.
SIAM J. Numer. Anal., 10(2):370-388, 1974.
[743] Daniel D. McCracken and William S. Dorn. Numerical Methods and Fortran
Programming: With Applications in Science and Engineering. Wiley, New
York, 1964. xii+457 pp.
[744] P. McCullagh and J. A. Nelder. Generalized Linear Models. Second edition,
Chapman and Hall, London, 1989. xix+511 pp. ISBN 0-412-31760-5.
[745] William Marshall McKeeman. Algorithm 135: Crout with equilibration and
iteration. Comm. ACM, 5:553–555, 1962.
[746] Jean Meinguet. On the estimation of significance. In Topics in Interval
Analysis, Elden Hansen, editor, Oxford University Press, Oxford, UK, 1969,
pages 47–64.
[747] Jean Meinguet. Refined error analyses of Cholesky factorization. SIAM J.
Numer. Anal., 20(6):1243-1250, 1983.
[748] N. S. Mendelssohn. Some elementary properties of ill conditioned matrices
and linear equations. Amer. Math. Monthly, 63(5):285–295, 1956.
[749] Michael Metcalf and John K. Reid. Fortran 90 Explained. Oxford University
Press, Oxford, UK, 1990. xiv+294 pp. ISBN 0-19-853772-7.
[750] N. Metropolis. Methods of significance arithmetic. In The State of the Art
in Numerical Analysis, David A. H. Jacobs, editor, Academic Press, London,
1977, pages 179-192.
[751] Gérard Meurant. A review on the inverse of symmetric tridiagonal and block
tridiagonal matrices. SIAM J. Matrix Anal. Appl., 13(3):707–728, 1992.
[752] Carl D. Meyer, Jr. and R. J. Plemmons. Convergent powers of a matrix
with applications to iterative methods for singular linear systems. SIAM J.
Numer. Anal., 14(4):699-705, 1977.
[753] H. I. Meyer and B. J. Hollingsworth. A method of inverting large matrices
of special form. M.T.A.C., 11:94–97, 1957.
[754] Victor J. Milenkovic. Verifiable implementations of geometric algorithms
using finite precision arithmetic. Artif. Intell., 37:377–401, 1988.
[755] D. F. Miller. The iterative solution of the matrix equation XA+BX+C = 0.
Linear Algebra Appl., 105:131-137, 1988.
[756] Webb Miller. Computational complexity and numerical stability. SIAM J.
Comput., 4(2):97-107, 1975.
642 B IBLIOGRAPHY

[757] Webb Miller. Software for roundoff analysis. ACM Trans. Math. Software, 1
(2):108-128, 1975.
[758] Webb Miller. Graph transformations for roundoff analysis. SIAM J. Comput.,
5(2):204–216, 1976.
[759] Webb Miller. The Engineering of Numerical Software. Prentice-Hall, Engle-
wood Cliffs, NJ, USA, 1984. viii+167 pp. ISBN 0-13-279043-2.
[760] Webb Miller and David Spooner. Software for roundoff analysis, II. ACM
Trans. Math. Software, 4(4):369-387, 1978.
[761] Webb Miller and David Spooner. Algorithm 532: Software for roundoff anal-
ysis. ACM Trans. Math. Software, 4(4):388-390, 1978.
[762] Webb Miller and Celia Wrathall. Software for Roundoff Analysis of Matrix
Algorithms. Academic Press, New York, 1980. x+151 pp. ISBN 0-12-497250-
0.
[763] L. Mirsky. An Introduction to Linear Algebra. Oxford University Press, 1961.
viii+440 pp. Reprinted by Dover, New York, 1990. ISBN 0-486-66434-1.
[764] Herbert F. Mitchell, Jr. Inversion of a matrix of order 38. M.T.A.C., 3:
161-166, 1948.
[765] Cleve B. Moler. Iterative refinement in floating point. J. Assoc. Comput.
Mach., 14(2):316-321, 1967.
[766] Cleve B. Moler. Matrix computations with Fortran and paging. Comm.
ACM, 15(4):268-270, 1972.
[767] Cleve B. Moler. Algorithm 423: Linear equation solver. Comm. ACM, 15
(4):274, 1972.
[768] Cleve B. Moler. Cramer’s rule on 2-by-2 systems. ACM SIGNUM Newsletter,
9(4):13-14, 1974.
[769] Cleve B. Moler. Three research problems in numerical linear algebra. In Nu-
merical Analysis, G. H. Golub and J. Oliger, editors, volume 22 of Proceed-
ings of Symposia in Applied Mathematics, American Mathematical Society,
Providence, RI, USA, 1978, pages 1-18.
[770] Cleve B. Moler. Cleve’s corner: The world’s simplest impossible problem.
The Math Works Newsletter, 4(2):6-7, 1990.
[771] Cleve B. Moler. Technical note: Doubl-rounding and implications for nu-
meric computations. The Math Works Newsletter, 4:6, 1990.
[772] Cleve B. Moler. A tale of two numbers. SIAM News, 28:1,16, January 1995.
Also in MATLAB News and Notes, Winter 1995, 10-12.
[773] Cleve B. Moler and Donald Morrison. Replacing square roots by Pythagorean
sums. IBM J. Res. Develop., 27(6):577–581, 1983.
[774] Cleve B. Moler and Donald Morrison. Singular value analysis of cryptograms.
Amer. Math. Monthly, 90:78-87, 1983.
[775] Cleve B. Moler and Charles F. Van Loan. Nineteen dubious ways to compute
the exponential of a matrix. SIAM Rev., 20(4):801–836, 1978.
B IBLIOGRAPHY 643

[776] Ole Møller. Note on quasi double-precision. BIT, 5:251-255, 1965.

[777] Ole Møller. Quasi double-precision in floating point addition. BIT, 5:37-50,
1965. See also [776] for remarks on this article.
[778] Ramon E. Moore. Interval Analysis. Prentice-Hall, Englewood Cliffs, NJ,
USA, 1966. xi+145 pp.
[779] Ramon E. Moore. Methods and Applications of Interval Analysis. Society for
Industrial and Applied Mathematics, Philadelphia, PA, USA, 1979. xi+190
pp. ISBN 0-89871-161-4.
[780] Robb J. Muirhead. Aspects of Multivariate Statistical Theory. Wiley, New
York, 1982. xix+673 pp. ISBN 0-471-09442-0.
[781] Jean-Michel Muller. Arithmétigue des Ordinateurs. Masson, Paris, 1989.214
pp. In French. Cited in [406]. ISBN 2-225-81689-1.
[782] K. H. Müller. Rounding error analysis of Homer’s scheme. Computing, 30:
285-303, 1983.
[783] H. Müller-Merbach. On Round-Off Errors in Linear Programming, volume 37
of Lecture Notes in Operations Research and Mathematical Systems. Spring-
er-Verlag, Berlin, 1970. 48 pp.
[784] FPV: A floating-point validation package. Release I. User’s guide. Technical
Report NP 1201, NAG Ltd., Oxford, UK, May 1986.
[785] NAGWare FTN90 Reference Manual. NAG Ltd., Oxford, UK, 1992. ISBN
1-85206-080-8.
[786] M. Zuhair Nashed and L. B. Rail. Annotated bibliography on generalized in-
verses and applications. In Generalized Inverses and Applications, M. Zuhair
Nashed, editor, Academic Press, New York, 1976, pages 771-1041.
[787] J. A. Nelder and R. Mead. A simplex method for function minimization.
Comput. J., 7:308-313, 1965.
[788] A. Neumaier. Rundungsfehleranalyse einiger Verfahren zur Summation
endlicher Summen. Z. Angew. Math. Mech., 54:39–51, 1974.
[789] A. Neumaier. Inner product rounding error analysis in the presence of un-
derflow. Computing, 34:365-373, 1985.
[790] A. Neumaier. On the comparison of H-matrices with M-matrices. Linear
Algebm Appl., 83:135-141, 1986.
[791] M. Neumann and R. J. Plemmons. Backward error analysis for linear systems
associated with inverses of H-matrices. BIT, 24:102–112, 1984.
[792] A. C. R. Newbery. Error analysis for Fourier series evaluation. Math. Comp.,
27(123):639-644, 1973.
[793] A. C. R. Newbery. Error analysis for polynomial evaluation. Math. Comp.,
28(127):789-793, 1974.
[794] Simon Newcomb. Note on the frequency of use of the different digits in
natural numbers. Amer. J. Math., 4:39–40, 1881. Cited in [856].
644 B IBLIOGRAPHY

[795] Morris Newman. Matrix computations. In Survey of Numerical Analysis,

John Todd, editor, McGraw-Hill, New York, 1962, pages 222-254.
[796] Morris Newman and John Todd. The evaluation of matrix inversion pro-
grams. J. Sot. Indust. Appl. Math., 6(4):466-476, 1958.
[797] K. Nickel. D a s Summierungsverfahren in Triplex-
ALGOL 60. Z. Angew. Math. Mech., 50:369-373, 1970.
[798] Karl Nickel. Interval-analysis. In The State of the Art in Numerical Analysis,
David A. H. Jacobs, editor, Academic Press, London, 1977, pages 193-225.
[799] Yvan Notay. On the convergence rate of the conjugate gradients in presence
of rounding errors. Numer. Math., 65:301–317, 1993.
[800] Colm Art O’Cinneide. Entrywise perturbation theory and error analysis for
Markov chains. Numer. Math., 65:109-120, 1993.
[801] W. Oettli. On the solution set of a linear system with inaccurate coefficients.
SIAM J. Numer. Anal., 2(1):115-118, 1965.
[802] W. Oettli and W. Prager. Compatibility of approximate solution of linear
equations with given error bounds for coefficients and right-hand sides. Nu-
mer. Math., 6:405-409, 1964.
[803] W. Oettli, W. Prager, and J. H. Wilkinson. Admissible solutions of linear
systems with not sharply defined coefficients. SIAM J. Numer. Anal, 2(2):
291-299, 1965.
[804] Dianne Prost O’Leary. Estimating matrix condition numbers. SIAM J. Sci.
Statist. Comput., 1(2):205-209, 1980.
[805] J. Oliver. An error analysis of the modified Clenshaw method for evaluating
Chebyshev and Fourier series. J. Inst. Maths Applies, 20:379-391, 1977.
[806] J. Oliver. Rounding error propagation in polynomial evaluation schemes. J.
Comput. Appl. Math., 5(2):85-97, 1979.
[807] F. W. J. Olver. A new approach to error arithmetic. SIAM J. Numer. Anal.,
15(2):368-393, 1978.
[808] F. W. J. Olver. Error analysis of complex arithmetic. In Computational
Aspects of Complex Analysis, volume 102 of NATO Advanced Study Institute
Series C, D. Reidel, editor, Dordrecht, Holland, 1983, pages 279-292.
[809] F. W. J. Olver. Error bounds for polynomial evaluation and complex arith-
metic. IMA J. Numer. Anal., 6:373–379, 1986.
[810] F. W. J. Olver and J. H. Wilkinson. A posteriori error bounds for Gaussian
elimination. IMA J. Numer. Anal., 2:377–406, 1982.
[811] James M. Ortega. An error analysis of Householder’s method for the sym-
metric eigenvalue problem. Numer. Math., 5:211-225, 1963.
[812] A. M. Ostrowski. Solution of Equations in Euclidean and Banach Spaces.
Academic Press, New York, 1973. xx+412 pp. Third edition of Solution of
Equations and Systems of Equations. ISBN 0-12530260-6.
B IBLIOGRAPHY 645

[813] C. C. Paige. An error analysis of a method for solving matrix equations.

Math. Comp., 27(122):355-359, 1973.
[814] C. C. Paigeand M. Wei. History andgenerality of the CS decomposition.
Linear Algebra Appl., 208/209:303-326, 1994.
[815] V. Pan. Strassen algorithm is not optimal. Trilinear technique of aggregating,
uniting and canceling for constructing fast algorithms for matrix multiplica-
tion. In Proc. 19th Annual Symposium on the Foundations of Computer
Science, Ann Arbor, MI, USA, 1978, pages 166-176.
[816] Victor Pan. How can we speed up matrix multiplication? SIAM Rev., 26(3):
393-415, 1984.
[817] Beresford N. Parlett. Laguerre’s method applied to the matrix eigenvalue
problem. Math. Comp., 18:464-485, 1964.
[818] Beresford N. Parlett. Matrix eigenvalue problems. Amer. Math. Monthly, 72
(2):59-66, 1965.
[819] Beresford N. Parlett. Analysis of algorithms for reflections in bisectors. SIAM
Rev., 13(2):197-208, 1971.
[820] Beresford N. Parlett. The Symmetric Eigenvalue Problem. Prentice-Hall,
Englewood Cliffs, NJ, USA, 1980. xix+348 pp. ISBN 013-880047-2.
[821] Beresford N. Parlett. The contribution of J. H. Wilkinson to numerical analy-
sis. In A History of Scientific Computing, Stephen G. Nash, editor, Addison-
Wesley, Reading, MA, USA, 1990, pages 17–30.
[822] David A. Patterson and John L. Hennessy. Computer Organization and De-
sign: The Hardware/Software Interface. Morgan Kaufmann, San Mateo, CA,
USA, 1994. xxiv+648+appendices pp. ISBN 1-55860-282-8.
[823] Vern Paxson. A program for testing IEEE decimal-binary conversion.
Manuscript. URL = ftp://ftp.ee.lbl.gov/testbase-report.ps.Z, May
1991. 40 pp.
[824] Heinz-Otto Peitgen, Hartmut Jürgens, and Dietmar Saupe. Fractals for the
Classroom. Part one: Introduction to Fractals and Chaos. Springer-Verlag,
New York, 1992. xiv+450 pp. ISBN 0-387-97041-X.
[825] J. M. Peña. Pivoting strategies leading to small bounds of the errors for
certain linear systems. IMA J. Numer. Anal., 1994. Submitted.
[826] G. Peters and J. H. Wilkinson. The least squares problem and pseudo-
inverses. Comput. J., 13(3):309-316, 1970.
[827] G. Peters and J. H. Wilkinson. Practical problems arising in the solution of
polynomial equations. J. Inst. Maths Applics, 8:16–35, 1971.
[828] G. Peters and J. H. Wilkinson. On the stability of Gauss-Jordan elimination
with pivoting. Comm. ACM, 18(1):20-24, 1975.
[829] Karl Petersen. Ergodic Theory. Cambridge University Press, Cambridge,
UK, 1981. xi+329 pp. ISBN 0-521-23632-0.
[830] M. Pichat. Correction d’une somme en arithmétique à virgule flottante.
Numer. Math., 19:400–406, 1972.
646 B IBLIOGRAPHY

[831] Daniel J. Pierce and Robert J. Plemmons. Fast adaptive condition estima-
tion. SIAM J. Matrix Anal. Appl., 13(1):274-291, 1992.
[832] Robert Piessens, Elise de Doncker-Kapenga, Christoph W. Überhuber, and
David K. Kahaner. QUADPACK: A Subroutine Package for Automatic In-
tegration. Springer-Verlag, Berlin, 1983. ISBN 3-540-12553-1.
[833] P. J. Plauger. Properties of floating-point arithmetic. Computer Language,
5(3):17-22, 1988.
[834] R. J. Plemmons. Regular splittings and the discrete Neumann problem.
Numer. Math., 25:153–161, 1976.
[835] Robert J. Plemmons. Linear least squares by elimination and MGS. J. Assoc.
Comput. Mach., 21(4):581-585, 1974.
[836] Svatopluk Poljak and Rohn. Checking robust nonsingularity is NP-hard.
Math. Control Signals Systems, 6:1-9, 1993.
[837] Ben Polman. Incomplete blockwise factorization of (block) If-matrices. Lin-
ear Algebra Appl., 90:119–132, 1987.
[838] M. J. D. Powell. A survey of numerical methods for unconstrained optimiza-
tion. SIAM Rev., 12(1):79-97, 1970.
[839] M. J. D. Powell. A view of unconstrained minimization algorithms that do
not require derivatives. ACM Trans. Math. Software, 1(2):97–107, 1975.
[840] M. J. D. Powell and J. K. Reid. On applying Householder transformations to
linear least squares problems. In Pmt. IFIP Congress 1968, North-Holland,
Amsterdam, The Netherlands, 1969, pages 122-126.
[841] Stephen Power. The Cholesky decomposition in Hilbert space. IMA Bulletin,
22(11/12):186-187, 1986.
[842] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P.
Flannery. Numerical Recipes in FORTRAN: The Art of Scientific Comput-
ing. Second edition, Cambridge University Press, Cambridge, UK, 1992.
xxvi+963 pp. ISBN O 52143064 X.
[843] Douglas M. Priest. Algorithms for arbitrary precision floating point arith-
metic. In Pmt. 10th IEEE Symposium on Computer Arithmetic, Peter Ko-
rnerup and David W. Matula, editors, IEEE Computer Society Press, Los
Alamitos, CA, USA, 1991, pages 132-143.
[844] Douglas M. Priest. On Properties of Floating Point Arithmetics: Numeri-
cal Stability and the Cost of Accurate Computations. Ph.D. thesis, Mathe-
matics Department, University of California, Berkeley, CA, USA, November
1992. 126 pp. URL = ftp://ftp.icsi.berkeley.edu/pub/theory/priest-
thesis.ps.Z.
[845] J. D. Pryce. Round-off error analysis with fewer tears. IMA Bulletin, 17:
40-47, 1981.
[846] J. D. Pryce. A new measure of relative error for vectors. SIAM J. Numer.
Anal., 21(1):202-215, 1984.
B IBLIOGRAPHY 647

[847] J. D. Pryce. Multiplicative error analysis of matrix transformation algo-

rithms. IMA J. Numer. Anal., 5:437–445, 1985.
[848] Chiara Puglisi. Modification of the Householder method based on the com-
pact WY representation. SIAM J. Sci. Statist. Comput., 13(3):723-726, 1992.
[849] Heinrich Puschmann and Joaquín Cortés. The coordinex problem and its
relation to the conjecture of Wilkinson. Numer. Math., 42:291–297, 1983.
[850] Heinrich Puschmann and Marcelo Nordio. Zwei Unzalässige Verstärkungen
der Vermutung von Wilkinson. Linear Algebra Appl., 72:167–176, 1985. In
German.
[851] Gerald D. Quinlan. Round-off error in long-term orbital integrations using
multistep methods. Celestial Mechanics and Dynamical Astronomy, 58:339–
351, 1994.
[852] Kevin Quinn. Ever had problems rounding off figures? This stock exchange
has. Wall Street Journal, 1983. 8 November.
[853] Thomas Quinn and Scott Tremaine. Roundoff error in long-term planetary
orbit integrations. Astron. J., 99(3): 1016–1023, 1990.
[854] Thomas R. Quinn, Scott Tremaine, and Martin Duncan. A three million year
integration of the earth’s orbit. Astron. J., 101(6):2287–2305, 1991.
[855] Ralph A. Raimi. The peculiar distribution of first digits. Scientific American,
221(6):109-120, 1969.
[856] Ralph A. Raimi. The first digit problem. Amer. Math. Monthly, 83:521-538,
1976.
[857] Louis B. Rail. Automatic Differentiation: Techniques and Applications, vol-
ume 120 of Lecture Notes in Computer Science. Springer-Verlag, Berlin, 1981.
viii+165 pp.
[858] Louis B. Rail. Tools for mathematical computation. In Computer Aided
Proofs in Analysis, Kenneth R. Meyer and Dieter S. Schmidt, editors, vol-
ume 28 of IMA Volumes in Mathematics and its Applications, Springer-Ver-
lag, New York, 1991, pages 217–228.
[859] George U. Ramos. Roundoff error analysis of the fast Fourier transform.
Math. Comp., 25(116):757-768, 1971.
[860] Brian Randell, editor. The Origins of Digital Computers: Selected Papers.
Third edition, Springer-Verlag, Berlin, 1975. xvi+580 pp. ISBN 3-540- 11319-
3.
[861] Wolfgang Rath. Fast Givens rotations for orthogonal similarity transforma-
tions. Numer. Math., 40:47–56, 1982.
[862] Satish C. Reddy and Lloyd N. Trefethen. Stability of the method of lines.
Numer. Math., 62:235-267, 1992.
[863] Lothar Reichel. Newton interpolation at Leja points. BIT, 30:332–346, 1990.
[864] Lothar Reichel. Fast QR decomposition of Vandermond-like matrices and
polynomial least squares approximation. SIAM J. Matrix Anal. Appl., 12(3):
552-564, 1991.
648 B IBLIOGRAPHY

[865] Lothar Reichel and Gerhard Opfer. Chebyshev-Vandermonde systems. Math.

Comp., 57(196):703-721, 1991.
[866] Lothar Reichel and Lloyd N. Trefethen. Eigenvalues and pseudo-eigenvalues
of Toeplitz matrices. Linear Algebra Appl., 162–164:153–185, 1992.
[867] J. K. Reid. A note on the stability of Gaussian elimination. J. Inst. Maths
Applics, 8:374-375, 1971.
[868] J. K. Reid. Sparse matrices. In The State of the Art in Numerical Analysis,
A. Iserles and M. J. D. Powell, editors, Oxford University Press, Oxford, UK,
1987, pages 59--85.
[869] John F. Reiser and Donald E. Knuth. Evading the drift in floating-point
addition. lnform. Process. Lett., 3(3):84-87, 1975.
[870] John R. Rice. Experiments on Gram-Schmidt orthogonalization. Math.
Comp., 20:325-328, 1966.
[871] John R. Rice. A theory of condition. SIAM J. Numer. Anal., 3(2):287-310,
1966.
[872] Robert D. Richtmyer and K. W. Morton. Difference Methods for Initial-Value
Problems. Second edition, Interscience, New York, 1967. xiv+405 pp.
[873] J. L. Rigal and J. Gaches. On the compatibility of a given solution with the
data of a linear system. J. Assoc. Comput. Mach., 14(3):543-548, 1967.
[874] T. G. Robertazzi and S. C. Schwartz. Best “ordering” for floating-point
addition. ACM Trans. Math. Software, 14(1):101–110, 1988.
[875] J. D. Roberts. Linear model reduction and solution of the algebraic Riccati
equation by use of the sign function. Int. J. Control, 32(4):677-687, 1980.
First issued as report CUED/B-Control/TR13, Department of Engineering,
University of Cambridge, 1971.
[876] Rohn. New condition numbers for matrices and linear systems. Comput-
ing, 41:167–169, 1989.
[877] Rohn. Systems of linear interval equations. Linear Algebra Appl., 126:
39-78, 1989.
[878] Rohn. Nonsingularity under data rounding. Linear Algebra Appl., 139:
171–174, 1990.
[879] Rohn. NP-hardness results for some linear and quadratic problems. Tech-
nical Report No. 619, Institute of Computer Science, Academy of Sciences of
the Czech Reupublic, Prague, January 1995. 11 pp.
[880] D. R. Ross. Reducing truncation errors using cascading accumulators.
Comm. ACM, 8(1):32-33, 1965.
[881] M. W. Routh, P. A. Swartz, and M. B. Denton. Performance of the super
modified simplex. Analytical Chemistry, 49(9):1422–1428, 1977.
[882] Thomas Harvey Rowan. Functional Stability Analysis of Numerical Algo-
rithms. Ph.D. thesis, University of Texas at Austin, Austin, TX, USA, May
1990. xii+205 pp.
B IBLIOGRAPHY 649

[883] Axel Ruhe. Numerical aspects of Gram-Schmidt orthogonalization of vectors.

Linear Algebra Appl., 52/53:591-601, 1983.
[884] D. E. Rutherford. Some continuant determinants arising in physics and chem-
istry. Pmt. Royal Sot. Edinburgh, 62,A:229–236, 1947.
[885] D. E. Rutherford. Some continuant determinants arising in physics and
chemistry—II. Proc. Royal Soc. Edinburgh, 63,A:232-241, 1952,
[886] H. Rutishauser. On test matrices. In Programmation en Mathématiques
Numériques, Besançon, 1966, volume 7 (no. 165) of Éditions Centre Nat.
Recherche Sci., Paris, 1968, pages 349-365.
[887] Heinz Rutishauser. Solution of eigenvalue problems with the LR-
transformation. In Further Contributions to the Solution of Simultaneous
Linear Equations and the Determination of Eigenvalues, number 49 in Ap-
plied Mathematics Series, National Bureau of Standards, United States De-
partment of Commerce, Washington, DC, 1958, pages 47–81.
[888] FTN90 Reference Manual. Second edition, Salford Software Ltd. and the
Numerical Algorithms Group Ltd., Salford and Oxford, UK, 1993.
[889] Ahmed H. Sameh and Richard P. Brent. Solving triangular systems on a
parallel computer. SIAM J. Numer. Anal., 14(6):1101–1113, 1977.
[890] Klaus Samelson and Friedrich L. Bauer. Optimale Rechengenauigkeit bei
Rechenanlagen mit gleitendem Komma. Zeitschrift für Angewandte Mathe-
matik und Physik, 4:312–316, 1953.
[891] J. M. Sanz-Serna. Symplectic integrators for Hamiltonian problems: An
overview. In Acta Numerica, Cambridge University Press, Cambridge, UK,
1992, pages 243–286.
[892] J. M. Sanz-Serna and M. P. Calve. Numerical Hamiltonian Problems. Chap
man and Hall, London, 1994. xii+207 pp. ISBN 0-412-54290-0.
[893] J. M. Sanz-Serna and S. Larsson. Shadows, chaos, and saddles. Appl. Numer.
Math., 13:181-190, 1993.
[894] M. A. Saunders. Large-scale linear programming using the Cholesky factor-
ization. Report CS 252, Department of Computer Science, Stanford Univer-
sity, January 1972.
[895] Werner Sautter. Error analysis of Gauss elimination for the best least squares
problem. Numer. Math., 30:165–184, 1978.
[896] I. Richard Savage and Eugene Lukacs. Tables of inverse of finite segments
of the Hilbert matrix. In Contributions to the Solution of Systems of Linear
Equations and the Determination of Eigenvaluesj Olga Taussky, editor, num-
ber 39 in Applied Mathematics Series, National Bureau of Standards, United
States Department of Commerce, Washington, DC, 1954, pages 105-108.
[897] James B. Scarborough. Numerical Mathematical Analysis. Second edition,
Johns Hopkins University Press, Baltimore, MD, USA, 1950. xviii+511 pp.
[898] Charles W. Schelin. Calculator function approximation. Amer. Math.
Monthly, 90(5):317-325, 1983.
650 B IBLIOGRAPHY

[899] R. Scherer and K. Zeller. Shorthand notation for rounding errors. Computing,
Suppl. 2:165-168, 1980.
[900] Hans Schneider and W. Gilbert Strang. Comparison theorems for supremum
norms. Numer. Math., 4:15–20, 1962.
[901] J. L. Schonfelder and M. Razaz. Error control with polynomial approxima-
tions. IMA J. Numer. Anal., 1:105-114, 1980.
[902] Robert S. Schreiber. Block algorithms for parallel machines. In Numeri-
cal Algorithms for Modern Parallel Computer Architectures, M. H. Schultz,
editor, number 13 in IMA Volumes In Mathematics and Its Applications,
Springer-Verlag, Berlin, 1988, pages 197-207.
[903] Robert S. Schreiber. Hough’s random story explained. NA Digest, Volume
89, Issue 3, 1989. Electronic mail magazine: [email protected].
[904] Robert S. Schreiber and Beresford N. Parlett. Block reflectors: Theory and
computation. SIAM J. Numer. Anal., 25(1):189-205, 1988.
[905] Robert S. Schreiber and Charles F. Van Loan. A storage efficient WY repre-
sentation for products of Householder transformations. SIAM J. Sci. Statist.
Comput., 10:53-57, 1989.
[906] N. L. Schryer. A test of a computer’s floating-point arithmetic unit. Com-
puting Science Technical Report No. 89, AT&T Bell Laboratories, Murray
Hill, NJ, USA, 1981.
[907] N. L. Schryer. Determination of correct floating-point model parameters.
In Sources and Development of Mathematical Software, Wayne R. Cowell,
editor, Prentice-Hall, Englewood Cliffs, NJ, USA, 1984, pages 360-366.
[908] Gunther Schulz. Iterative Berechnung der reziproken Matrix. Z. Angew.
Math. Mech., 13:57-59, 1933.
[909] Robert Sedgewick. Algorithms. Second edition, Addison-Wesley, Reading,
MA, USA, 1988. xii+657 pp. ISBN 0-201-06673-4.
[910] Lawrence F. Sharnpine. Numerical Solution of Ordinary Differential Equa-
tions. Chapman and Hall, New Yorkj 1994. x+484 pp. ISBN 0-412-05151-6.
[911] Lawrence F. Sharnpine and Richard C. Allen, Jr. Numerical Computing: An
Introduction. W. B. Saunders, Philadelphia, PA, USA, 1973. vii+258 pp.
ISBN 0-7216-8150-6.
[912] Lawrence F. Shampine and Mark W. Reichelt. The MATLAB ODE suite.
Manuscript, 1995. 35 pp.
[913] Alexander Shapiro. Optimally scaled matrices, necessary and sufficient con-
ditions. Numer. Math., 39:239-245, 1982.
[914] Alexander Shapiro. Optimal block diagonal l2-scaling of matrices. SIAM J.
Numer. Anal., 22(1):81-94, 1985.
[915] Alexander Shapiro. Upper bounds for nearly optimal diagonal scaling of
matrices. Linear and Multilineal Algebra, 29:147–147, 1991.
B IBLIOGRAPHY 651

[916] H. P. Sharangpani and M. L. Barton. Statistical analysis of floating point

flaw in the Pentium processor (1994). Technical report, Intel Corporation,
November 1994. 31 pp.
[917] David Shepherd and Greg Wilson. Making chips that work. New Scientist,
pages 61–64, 1989. 13 May.
[918] Gautam M. Shroff and Christian H. Bischof. Adaptive condition estimation
for rank-one updates of QR factorization. SIAM J. Matrix Anal. Appl., 13
(4):1264-1278, 1992.
[919] Robert D. Skeel. Scaling for numerical stability in Gaussian elimination. J.
Assoc. Comput. Mach., 26(3):494-526, 1979.
[920] Robert D. Skeel. Iterative refinement implies numerical stability for Gaussian
elimination. Math. Comp., 35(151):817–832, 1980.
[921] Robert D. Skeel. Effect of equilibration on residual size for partial pivoting.
SIAM J. Numer. Anal., 18(3):449-454, 1981.
[922] Robert D. Skeel. Safety in numbers: The boundless errors of numerical
computation. Working Document 89-3, Department of Computer Science,
University of Illinois, Urbana, IL, USA, 1989. 9 pp.
[923] Robert D. Skeel and Jerry B. Keiper. Elementary Numerical Computing
with Mathematical. McGraw-Hill, New York, 1993. xiv+434 pp. ISBN 0-07-
057820-6.
[924] Steve Smale. Some remarks on the foundations of numerical analysis. SIAM
Rev., 32(2):211-220, 1990.
[925] B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C.
Klema, and C. B. Moler. Matrix Eigensystem Routines—EISPACK Guide,
volume 6 of Lecture Notes in Computer Science. Springer-Verlag, Berlin,
1976.
[926] David M. Smith. Algorithm 693: A FORTRAN package for floating-point
multiple-precision arithmetic. ACM Trans. Math. Software, 17(2):273-283,
1991.
[927] Francis J. Smith. An algorithm for summing orthogonal polynomial series and
their derivatives with applications to curve-fitting and interpolation. Math.
Comp., 19:33-36, 1965.
[928] Jon M. Smith. Scientific Analysis on the Pocket Calculator. Wiley, New
York, 1975. xii+380 pp. ISBN 0-471-79997-1.
[929] Robert L. Smith. Algorithm 116: Complex division. Comm. ACM, 5(8):435,
1962.
[930] Alicja Smoktunowicz. A note on the strong componentwise stability of algo-
rithms for solving symmetric linear systems. Demonstratio Mathematical, 28
(2), 1995.
[931] Alicja Smoktunowicz and Jolanta Sokolnicka. Binary cascades iterative re-
finement in doubled-mantissa arithmetics. BIT, 24:123–127, 1984.
652 B IBLIOGRAPHY

[932] James N. Snyder. On the improvement of the solutions to a set of simultane-

ous linear equations using the ILLIAC. Mathematical Tables and Other Aids
to Computation, 9:177–184, 1955.
[933] P. Spellucci. An approach to backward analysis for linear and nonlinear
iterative methods. Computing, 25:269–282, 1980.
[934] J. Spieiß. Untersuchungen des Zeitgewinns durch neue Algorithmen zur
Matrix-Multiplikation. Computing, 17:23-36, 1976.
[935] Gerhard Starke and Wilhelm Niethammer. SOR for AX – XB = C. Linear
Algebra Appl., 154-156:355-375, 1991.
[936] Guy L. Steele, Jr. and Jon L. White. How to print floating-point numbers
accurately. SIGPLAN Notices, 25(6):112–126, 1990.
[937] Irene A. Stegun and Milton Abrarnowitz. Pitfalls in computation. J. Soc.
Indust. Appl. Math., 4(4):207-219, 1956.
[938] Pat H. Sterbenz. Floating-Point Computation. Prentice-Hall, Englewood
Cliffs, NJ, USA, 1974. xiv+316 pp. ISBN 0-13-322495-3.
[939] G. W. Stewart. Error analysis of the algorithm for shifting the zeros of a
polynomial by synthetic division. Math. Comp., 25(113):135-139, 1971.
[940] G. W. Stewart. Error and perturbation bounds for subspaces associated with
certain eigenvalue problems. SIAM Rev., 15(4):727-764, 1973.
[941] G. W. Stewart. Introduction to Matrix Computations. Academic Press, New
York, 1973. xiii+441 pp. ISBN 0-12-670350-7.
[942] G. W. Stewart. Modifying pivot elements in Gaussian elimination. Math.
Comp., 28(126):537-542, 1974.
[943] G. W. Stewart. On the perturbation of pseudo-inverses, projections and
linear least squares problems. SIAM Rev., 19(4):634-662, 1977.
[944] G. W. Stewart. Perturbation bounds for the QR factorization of a matrix.
SIAM J. Numer. Anal., 14(3):509-518, 1977.
[945] G. W. Stewart. Research, development, and LINPACK. In Mathematical
Software III, John R. Rice, editor, Academic Press, New York, 1977, pages
1–14.
[946] G. W. Stewart. The efficient generation of random orthogonal matrices with
an application to condition estimators. SIAM J. Numer. Anal., 17(3):403–
409, 1980.
[947] G. W. Stewart. A note on complex division. ACM Trans. Math. Software,
11(3):238-241, 1985.
[948] G. W. Stewart. Stochastic perturbation theory. SIAM Rev., 32(4):579+10,
1990.
[949] G. W. Stewart. Note on a generalized Sylvester equation. IMA Preprint
Series #985, Institute for Mathematics and its Applications, University of
Minnesota, Minneapolis, MN, USA, May 1992.3 pp.
B IBLIOGRAPHY 653

[950] G. W. Stewart. On the early history of the singular value decomposition.

SIAM Rev., 35(4):551-566, 1993.
[951] G. W. Stewart. On the perturbation of LU, Cholesky, and QR factorizations.
SIAM J. Matrix Anal. Appl., 4(4):1141-1145, 1993.
[952] G. W. Stewart. Gauss, statistics, and Gaussian elimination. Technical Report
TR-3307, Department of Computer Science, University of Maryland, College
Park, MD, USA, August 1994. 14 pp.
[953] G. W. Stewart. On Markov chains with sluggish transients. Technical Report
TR-3306, Department of Computer Science, University of Maryland, College
Park, MD, USA, June 1994. 8 pp.
[954] G. W. Stewart and Ji-guang Sun. Matrix Perturbation Theory. Academic
Press, London, 1990. xv+365 pp. ISBN 0-12-670230-6.
[955] J. Steer and R. Bulirsch. Introduction to Numerical Analysis. Springer-
Verlag, New York, 1980. ix+609 pp. ISBN 0-387-90420-4.
[956] J. Steer and C. Witzgall. Transformations by diagonal matrices in a normed
space. Numer. Math., 4:158–171, 1962.
[957] Betty Jane Stone. Best possible ratios of certain matrix norms. Numer.
Math., 4:114-116, 1962.
[958] David R. Stoutemyer. Automatic error analysis using computer algebraic
manipulation. ACM Trans. Math. Software, 3(1):26–43, 1977.
[959] David R. Stoutemyer. Crimes and misdemeanours in the computer algebra
trade. Notices Amer. Math. Soc., 38(7):778–785, 1991.
[960] Gilbert Strang. A proposal for Toeplitz matrix calculations. Stud. Appl.
Math., 74:171-176, 1986.
[961] Gilbert Strang. Introduction to Linear Algebra. Wellesley-Cambridge Press,
Wellesley, MA, USA, 1993. viii+472 pp. ISBN 0-9614088-5-5.
[962] V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354-356,
1969.
[963] F. Stummel. Rounding error analysis of elementary numerical algorithms.
Computing, Suppl. 2:169-195, 1980.
[964] F. Stummel. Perturbation theory for evaluation algorithms of arithmetic
expressions. Math. Comp., 37(156):435–473, 1981.
[965] F. Stummel. Optimal error estimates for Gaussian elimination in floating-
point arithmetic. Z. Angew. Math. Mech., 62:T355–T357, 1982.
[966] F. Stummel. Strict optimal error estimates for Gaussian elimination. Z.
Angew. Math. Mech., 65:T405-T407, 1985.
[967] Friedrich Stummel. Forward error analysis of Gaussian elimination, Part I:
Error and residual estimates. Numer. Math., 46:365-395, 1985.
[968] Friedrich Stummel. Forward error analysis of Gaussian elimination, Part II:
Stability theorems. Numer. Math., 46:397-415, 1985.
654 B IBLIOGRAPHY

[969] SPARCompiler FORTRAN 2.0.1: User’s Guide. Sun Microsystems, Inc.,

Mountain View, CA, USA, October 1992. Part No. 800-6552-11, Revision A.
[970] SPARCompiier FORTRAN: Numerical Computation Guide. Sun Microsys-
tems, Inc., Mountain View, CA, USA, October 1992. Part No. 800-7097-11,
Revision A.
[971] Ji-guang Sun. Perturbation bounds for the Cholesky and QR factorization.
BIT, 31:341-352, 1991.
[972] Ji-guang Sun. Componentwise perturbation bounds for some matrix decom-
positions. BIT, 32(4):702–714, 1992.
[973] Ji-guang Sun. Rounding-error and perturbation bounds for the Cholesky and
LDL T factorization. Linear Algebra Appl., 173:77-97, 1992.
[974] Ji-guang Sun. A note on backward perturbations for the Hermitian eigenvalue
problem. BIT, 35, 1995. To appear.
[975] Ji-guang Sun. On perturbation bounds for the QR factorization. Linear
Algebra Appl., 215:95-111, 1995.
[976] Ji-guang Sun. Optimal backward perturbation bounds for the linear LS
problem with multiple right-hand sides. IMA J. Numer. Anal., 1996. To
appear.
[977] Xiaobai Sun and Christian H. Bischof. A basis-kernel representation of or-
thogonal matrices. SIAM J. Matrix Anal. Appl., 16(4):1184-1196, 1995.
[978] W. H. Swarm. Direct search methods. In Numerical Methods for Uncon-
strained Optimization, W. Murray, editor, Academic Press, London, 1972,
pages 13-28.
[979] W. H. Swarm. Constrained optimization by direct search. In Numerical
Methods for Constrained Optimization, P. E. Gill and W. Murray, editors,
Academic Press, London, 1974, pages 191-217.
[980] Earl E. Swartzlander, Jr., editor. Computer Arithmetic, volume 21 of Bench-
mark Papers in Electrical Engineering and Computer Science. Dowden,
Hutchinson and Ross, Stroudsburg, PA, USA, 1980.
[981] Earl E. Swartzlander, Jr. and Aristides G. Alexopolous. The sign/logarithm
number system. IEEE Trans. Comput., C-24(12):1238–1242, 1975.
[982] D. W. Sweeney. An analysis of floating-point addition. IBM Systems Journal,
4:31-42, 1965. Reprinted in [980, pp. 317-328].
[983] J. J. Sylvester. Additions to the articles, “On a New Class of Theorems,”
and “On Pascal’s Theorem”. Philosophical Magazine, 37:363-370, 1850.
Reprinted in [986, pp. 1451-151].
[984] J. J. Sylvester. On the relation between the minor determinants of linearly
equivalent quadratic functions. Philosophical Magazine, (Fourth Series) 1:
295-305, 1851. Reprinted in [986, pp. 241-250].
[985] J. J. Sylvester. Sur l’equation en matrices px = xq. Comptes Rendus de
l‘Académie des Sciences, pages 67-71 and 115-116, 1884.
B IBLIOGRAPHY 655

[986] The Collected Mathematical Papers of James Joseph Sylvester, volume 1

(1837-1853). Cambridge University Press, Cambridge, UK,1904. xii+650
pp.
[987] Ping Tak Peter Tang. Table-driven implementation of the exponential func-
tion in IEEE floating-point arithmetic. ACM Trans. Math. Software, 15(2):
144–157, 1989.
[988] Ping Tak Peter Tang. Accurate and efficient testing of the exponential and
logarithm functions. ACM Trans. Math. Software, 16(3):185-200, 1990.
[989] Ping Tak Peter Tang. Table-driven implementation of the logarithm function
in IEEE floating-point arithmetic. ACM Trans. Math. Software, 16(3):378–
400, 1990.
[990] Ping Tak Peter Tang. Table-lookup algorithms for elementary functions and
their error analysis. In Proc. 10th IEEE Symposium on Computer Arith-
metic, Peter Kornerup and David W. Matula, editors, IEEE Computer So-
ciety Press, Los Alamitos, CA, USA, 1991, pages 232–236.
[991] Ping Tak Peter Tang. Table-driven implementation of the expm1 function in
IEEE floating-point arithmetic. ACM Trans. Math. Software, 18(2):211-222,
1992.
[992] W. P. Tang and G. H. Golub. The block decomposition of a Vandermonde
matrix and its applications. BIT, 21:505–517, 1981.
[993] Pham Dinh Tao. Convergence of a subgradient method for computing the
bound norm of matrices. Linear Algebra Appl., 62:163–182, 1984. In French.
[994] Pham Dinh Tao. Some methods for computing the maximum of quadratic
form on the unit ball of the maximum norm. Numer. Math., 45:377–401,
1984. In French.
[995] A. H. Taub, editor. John von Neumann Collected Works, volume V, Design
of Computers, Theory of Automata and Numerical Analysis. Pergamon,
Oxford, UK, 1963. ix+784 pp.
[996] Olga Taussky. A remark concerning the characteristic roots of the finite
segments of the Hilbert matrix. Quart. J. Math., 20:80–83, 1949.
[997] Olga Taussky. How I became a torchbearer for matrix theory. Amer. Math.
Monthly, 95(9):801–812, November 1988.
[998] Olga Taussky and Marvin Marcus. Eigenvalues of finite matrices. In Survey
of Numerical Analysis, John Todd, editor, McGraw-Hill, New York, 1962,
pages 279–313.
[999] Henry C. Thacher, Jr. Algorithm 43: Crout with pivoting II. Comm. ACM,
4(4):176-177, 1961.
[1000] Ronald A. Thisted. Elements of Statistical Computing: Numerical Computa-
tion. Chapman and Hall, New York, 1988. xx+427 pp. ISBN 0-412-01371-1.
[1001] D’Arcy Wentworth Thompson. On Growth and Form. The Complete Revised
Edition. Cambridge University Press, 1942. viii+1116 pp. Reprinted by
Dover, New York, 1992. ISBN 0-486-67135-6.
656 B IBLIOGRAPHY

[1002] Martti Tienari. A statistical model of roundoff error for varying length
floating-point arithmetic. BIT, 10:355-365, 1970.
[1003] J. Todd. On condition numbers. In Programmation en Mathématiques
Numérques, Besançon, 1966, volume 7 (no. 165) of Éditions Centre Nat.
Recherche Sci., Paris, 1968, pages 141-159.
[1004] John Todd. The condition of the finite segments of the Hilbert matrix. In
Contributions to the Solution of Systems of Linear Equations and the Deter-
mination of Eigenvalues, Olga Taussky, editor, number 39 in Applied Math-
ematics Series, National Bureau of Standards, United States Department of
Commerce, Washington, DC, 1954, pages 109-116.
[1005] John Todd. Computational problems concerning the Hilbert matrix. J. Res.
National Bureau Standards-B, 65(1):19-22, 1961.
[1006] John Todd. Basic Numerical Mathematics, Vol. 2: Numerical Algebra.
Birkhäuser, Basel, and Academic Press, New York, 1977. 216 pp. ISBN
0-12-692402-3.
[1007] Kim-Chuan Toh and Lloyd N. Trefethen. Pseudozeros of polynomials and
pseudospectra of companion matrices. Numer. Math., 68(3):403-425, 1994.
[1008] Virginia J. Torczon. Multi-Directional Search: A Direct Search Algorithm for
Parallel Machines. Ph.D. thesis, Rice University, Houston, TX, USA, May
1989. vii+85 pp.
[1009] Virginia J. Torczon. On the convergence of the multidirectional search algo-
rithm. SIAM J. Optim., 1(1):123–145, 1991.
[1010] Virginia J. Torczon. PDS: Direct search methods for unconstrained optimiza-
tion on either sequential or parallel machines. Report TR92-9, Department
of Mathematical Sciences, Rice University, Houston, TX, USA, March 1992.
To appear in ACM Trans. Math. Software.
[1011] Virginia J. Torczon. On the convergence of pattern search algorithms. SIAM
J. Optim., 7(1), 1997. To appear.
[1012] L. Tornheim. Maximum third pivot for Gaussian reduction. Technical report,
Calif. Research Corp., Richmond, CA, USA, 1965. Cited in [256].
[1013] J. F. Traub. Associated polynomials and uniform methods for the solution
of linear problems. SIAM Rev., 8(3):277-301, 1966.
[1014] Lloyd N. Trefethen. Three mysteries of Gaussian elimination. ACM SIGNUM
Newsletter, 20:2-5, 1985.
[1015] Lloyd N. Trefethen. Approximation theory and numerical linear algebra.
In Algorithms for Approximation II, J. C Mason and M. G. Cox, editors,
Chapman and Hall, London, 1990, pages 336-360.
[1016] Lloyd N. Trefethen. The definition of numerical analysis. SIAM News, 25:
6 and 22, November 1992. Reprinted in IMA Bulletin, 29 (3/4), pp. 47-49,
1993.
B IBLIOGRAPHY 657

[1017] Lloyd N. Trefethen. Pseudospectra of matrices. In Numerical Analysis 1991,

Proceedings of the 14th Dundee Conference, D. F. Griffiths and G. A. Watson,
editors, volume 260 of Pitman Research Notes in Mathematics, Longman
Scientific and Technical, Essex, UK, 1992, pages 234-266.
[1018] Lloyd N. Trefethen. Spectra and Pseudospedra: The Behavior of Non-Normal
Matrices and Operators. Book in preparation.
[1019] Lloyd N. Trefethen and Robert S. Schreiber. Average-case stability of Gaus-
sian elimination. SIAM J. Matrix Anal. Appl., 11(3):335–360, 1990.
[1020] Lloyd N. Trefethen and Manfred R. Trummer. An instability phenomenon
in spectral methods. SIAM J. Numer. Anal., 24(5):1008–1023, 1987.
[1021] Henry S. Tropp. FORTRAN anecdotes. Ann. Hist. Comput., 6(1):59-64,
1984.
[1022] Henry S. Tropp. Origin of the term bit. Ann. Hist. Comput., 6(2):152-155,
1984.
[1023] Nai-kuan Tsao. A note on implementing the Householder transformation.
SIAM J. Numer. Anal., 12(1):53-58, 1975.
[1024] Nai-kuan Tsao. A simple approach to the error analysis of division-free
numerical algorithms. IEEE Trans. Comput., C-32(4):343–351, 1983.
[1025] A. M. Turing. On computable numbers, with an application to the Entschei-
dungsproblem. Proc. London Math. Soc., 42:230-265, 1936.
[1026] A. M. Turing. Proposal for development in the Mathematics Division of an
Automatic Computing Engine (ACE). Report E.882, Executive Committee,
National Physical Laboratory, Teddington, UK, 1945. Reprinted in [186,
pp. 20-105] and [603, pp. 1-86].
[1027] A. M. Turing. Rounding-off errors in matrix processes. Quart. J. Mech.
Appl. Math., 1:287–308, 1948. Reprinted in [149] with summary and notes
(including corrections).
[1028] H. W. Turnbull. The Theory of Determinants, Matrices, and Invariant.
Blackie, London and Glasgow, 1929. xvi+338 pp.
[1029] H. W. Turnbull and A. C. Aitken. An Introduction to the Theory of Canonical
Matrices. Blackie, London and Glasgow, 1932. xiii+200 pp. Reprinted with
appendix, 1952.
[1030] Kathryn Turner. Computing projections for the Karmakar algorithm. Linear
Algebra Appl., 152:141-154, 1991.
[1031] Peter R. Turner. The distribution of leading significant digits. IMA J. Numer.
Anal., 2:407-412, 1982.
[1032] Peter R. Turner. Further revelations on L.S.D. IMA J. Numer. Anal., 4:
225-231, 1984.
[1033] Peter R. Turner. Will the “real” real arithmetic please stand up? Notices
Amer. Math. Soc., 38(4):298-304, 1991.
[1034] Evgenij E. Tyrtyshnikov. Cauchy-Toeplitz matrices and some applications.
Linear Algebra Appl., 149:1-18, 1991.
658 B IBLIOGRAPHY

[1035] Evgenij E. Tyrtyshnikov. How bad are Hankel matrices? Numer. Math., 67
(2):261-269, 1994.
[1036] Patriot missile defense: Software problem led to system failure at Dhahran,
Saudi Arabia. Report GAO/IMTEC-92-26, Information Management and
Technology Division, United States General Accounting Office, Washington,
DC, February 1992. 16 pp.
[1037] Minoru Urabe. Roundoff error distribution in fixed-point multiplication and
a remark about the rounding rule. SIAM J. Numer. Anal., 5(2):202-210,
1968.
[1038] J. V. Uspensky. Theory of Equations. McGraw-Hill, New York, 1948. vii+353
pp.
[1039] A. van der Sluis. Condition numbers and equilibration of matrices. Numer.
Math., 14:14-23, 1969.
[1040] A. van der Sluis. Condition, equilibration and pivoting in linear algebraic
systems. Numer. Math., 15:74–86, 1970.
[1041] A. van der Sluis. Stability of the solutions of linear least squares problems.
Numer. Math., 23:241-254, 1975.
[1042] Charles F. Van Loan. On the method of weighting for equality-constrained
least-squares problems. SIAM J. Numer. Anal., 22(5):851-864, 1985.
[1043] Charles F. Van Loan. On estimating the condition of eigenvalues and eigen-
vectors. Linear Algebra Appl., 88/89:715–732, 1987.
[1044] Charles F. Van Loan. Computational Frameworks for the Fast Fourier Trans-
form. Society for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 1992. xiii+273 pp. ISBN 0-89871-285-8.
[1045] M. van Veldhuizen. A note on partial pivoting and Gaussian elimination.
Numer. Math., 29:1-10, 1977.
[1046] A. van Wijngaarden. Numerical analysis as an independent science. BIT, 6:
66-81, 1966.
[1047] Robert J. Vanderbei. Symmetric quasi-definite matrices. SIAM J. Optim., 5
(1):100-113, 1995.
[1048] J. M. Varah. On the solution of block-tridiagonal systems arising from certain
finit-difference equations. Math. Comp., 26(120):859-868, 1972.
[1049] J. M. Varah. A lower bound for the smallest singular value of a matrix.
Linear Algebra Appl., 11:3-5, 1975.
[1050] J. M. Varah. On the separation of two matrices. SIAM J. Numer. Anal., 16
(2):216-222, 1979.
[1051] Richard S. Varga. On diagonal dominance arguments for bounding ||A-l ||
Linear Algebra Appl., 14:211-217, 1976.
[1052] Richard S. Varga. Scientific Computation on Mathematical Problems and
Conjectures. Society for Industrial and Applied Mathematics, Philadelphia,
PA, USA, 1990. vi+122 pp. ISBN 0-89871-257-2.
B IBLIOGRAPHY 659

[1053] Rank M. Verzuh. The solution of simultaneous linear equations with the aid
of the 602 calculating punch. M.T.A.C., 3:453–462, 1949.
[1054] J. Vignes and R. Alt. An efficient stochastic method for round-off error
analysis. In Accurate Scientific Computations, Proceedings, 1985, Willard L.
Miranker and Richard A. Toupin, editors, volume 235 of Lecture Notes in
Computer Science, Springer-Verlag, Berlin, 1986, pages 183-205.
[1055] Emil Vitasek. The numerical stability in solution of differential equations. In
Conference on the Numerical Solution of Differential Equations, J. L1. Morris,
editor, volume 109 of Lecture Notes in Mathematics, Springer-Verlag, Berlin,
1969, pages 87–111.
[1056] I. V. Viten’ko. Optimum algorithms for adding and multiplying on computers
with a floating point. U.S.S.R. Comput. Math. Math. Phys., 8(5):183-195,
1968.
[1057] John von Neumann and Herman H. Goldstine. Numerical inverting of ma-
trices of high order. Bull. Amer. Math. Soc., 53:1021–1099, 1947. Reprinted
in [995, pp. 479–557].
[1058] H. A. Van Der Vorst. The convergence behaviour of preconditioned CG
and CG-S in the presence of rounding errors. In Preconditioned Conjugate
Gradient Methods, Owe Axelsson and Lily Yu. Kolotilina, editors, volume
1457 of Lecture Notes in Mathematics, Springer-Verlag, Berlin, 1990, pages
126–136.
[1059] Eugene L. Wachpress. Iterative solution of the Lyapunov matrix equation.
Appl. Math. Lett., 1(1):87-90, 1988.
[1060] Bertil Waldén, Rune Karlson, and Ji-guang Sun. Optimal backward pertur-
bation bounds for the linear least squares problem. Numerical Linear Algebra
with Applications, 2(3):271–286, 1995.
[1061] Peter J. L. Wallis, editor. Improving Floating-Point Programming. Wiley,
London, 1990. xvi+191 pp. ISBN 0-471-92437-7.
[1062] W. D. Wallis. Hadarnard matrices. In Combinatorial and Graph- Theoretical
Problems in Linear Algebra, Richard A. Brualdi, Shmuel Friedland, and Vic-
tor Klee, editors, volume 50 of IMA Volumes in Mathematics and its Appli-
cations, Springer-Verlag, New York, 1993, pages 235–243.
[1063] W. D. Wallis, Anne Penfold Streetj and Jennifer Seberry Wallis. Combina-
torics: Room Squares, Sum-Free Sets, Hadamard Matrices, volume 292 of
Lecture Notes in Mathematics. Springer-Verlag, Berlin, 1972. 508 pp. ISBN
3-540-06035-9.
[1064] Johnson J. H. Wang. Generalized Moment Methods in Electromagnetic:
Formulation and Computer Solution of Integral Equations. Wiley, New York,
1991. xiii+553 pp. ISBN 0-471-51443-8.
[1065] Robert C. Ward. The QR algorithm and Hyman’s method on vector com-
puters. Math. Comp., 30(133):132–142, 1976.
[1066] Willis H. Ware, editor. Soviet computer technology–1959. Comm. ACM, 3
(3):131-166, 1960.
660 B IBLIOGRAPHY

[1067] G. A. Watson. An algorithm for optimal l 2 scaling of matrices. IMA J.

Numer. Anal., 11:481-492, 1991.
[1068] J. H. M. Wedderburn. Lectures On Matrices, volume 17 of American Math-
ematical Society Colloquium Publications. American Mathematical Society,
Providence, RI, USA, 1934. vii+205 pp.
[1069] Per-Åke Wedin. Perturbation theory for pseudo-inverses. BIT, 13:217-232,
1973.
[1070] Per-Åke Wedin. Perturbation theory and condition numbers for generalized
and constrained linear least squares problems. Report UMINF 125.85, In-
stitute of Information Processing, University of Umeå, Umeå, Sweden, May
1985.
[1071] Elias Wegert and Lloyd N. Trefethen. From the Buffon needle problem to
the Kreiss matrix theorem. Amer. Math. Monthly, 101(2):132–139, 1994.
[1072] Musheng Wei. Perturbation of the least squares problem. Linear Algebra
Appl., 141:177-182, 1990.
[1073] N. Weiss, G. W. Wasilkowski, H. and M. Shub. Average
condition number for solving linear equations. Linear Algebra Appl., 83:
79-102, 1986.
[1074] Burton Wendroff. Theoretical Numerical Analysis. Academic Press, New
York, 1966. xi+239 pp.
[1075] Wilhelm Werner. Polynomial interpolation: Lagrange versus Newton. Math.
Comp., 43(167):205-217, 1984.
[1076] Joan R. Westlake. A Handbook of Numerical Matrix Inversion and Solution
of Linear Equations. Wiley, New York, 1968.
[1077] B. A. Wichmann. Towards a formal specification of floating point. Comput.
J., 32:432-436, 1989.
[1078] B. A. Wichmann. A note on the use of floating point in critical systems.
Comput. J., 35(1):41-44, 1992.
[1079] J. H. Wilkinson. The Automatic Computing Engine at the National Physical
Laboratory. Proc. Roy. Sot. London Ser. A, 195:285-286, 1948.
[1080] J. H. Wilkinson. Progress report on the Automatic Computing Engine.
Report MA/17/1024, Mathematics Division, Department of Scientific and
Industrial Research, National Physical Laboratory, Teddington, UK, April
1948. 127 pp.
[1081] J. H. Wilkinson. Linear algebra on the Pilot ACE. In Automatic Digital
Computation, Her Majesty’s Stationery Office, London, 1954. Reprinted in
[1104, pp. 337-344].
[1082] J. H. Wilkinson. The Pilot ACE. In Automatic Digital Computation, Her
Majesty’s Stationery Office, London, 1954, pages 5-14. Reprinted in [87,
pp. 193-199] and [1104, pp. 219-228].
B IBLIOGRAPHY 661

[1083] J, H. Wilkinson. The use of iterative methods for finding the latent roots and
vectors of matrices. Mathematical Tables and Other Aids to Computation, 9:
184–191, 1955.
[1084] J. H. Wilkinson. Error analysis of floating-point computation. Numer. Math.,
2:319-340, 1960.
[1085] J. H. Wilkinson. Error analysis of direct methods of matrix inversion. J.
Assoc. Comput. Mach., 8:281–330, 1961.
[1086] J. H. Wilkinson. Error analysis of eigenvalue techniques based on orthogonal
transformations. J. Sot. Indust. Appl. Math., 10(1):162–195, 1962.
[1087] J. H. Wilkinson. Plan-rotations in floating-point arithmetic. In Experi-
mental Arithmetic, High Speed Computing and Mathematics, volume 15 of
Proceedings of Symposia in Applied Mathematics, American Mathematical
Society, Providence, RI, USA, 1963, pages 185-198.
[1088] J. H. Wilkinson. Rounding Errors in Algebraic Processes. Notes on Applied
Science No. 32, Her Majesty’s Stationery Office, London, 1963. vi+161 pp.
Also published by Prentice-Hall, Englewood Cliffs, NJ, USA. Reprinted by
Dover, New York, 1994. ISBN 0-486-67999-3.
[1089] J. H. Wilkinson. The Algebraic Eigenvalue Problem. Oxford University Press,
1965. xviii+662 pp. ISBN 0-19-853403-5 (hardback), 0-19-853418-3 (paper-
back) .
[1090] J. H. Wilkinson. Error analysis of transformations based on the use of ma-
trices of the form I – 2wwH. In Error In Digital Computation, Louis B. Rail,
editor, volume 2, Wiley, New York, 1965, pages 77–101.
[1091] J. H. Wilkinson. Bledy w Procesach Algebraicznych. PWW,
Warszawa, 1967. Polish translation of [1088].
[1092] J. H. Wilkinson. A priori error analysis of algebraic processes. In Proc.
International Congress of Mathematicians, Moscow 1966, I. G. Petrovsky,
editor, Mir Publishers, Moscow, 1968, pages 629–640.
[1093] J. H. Wilkinson. Rundungsfehler. Springer-Verlag, Berlin, 1969. German
translation of [1088].
[1094] J. H. Wilkinson. The Algebraic Eigenvalue Problem. Nauka, U.S.S.R.
Academy of Sciences, 1970.564 pp. Russian translation of [1089].
[1095] J. H. Wilkinson. Modern error analysis. SIAM Rev., 13(4):548-568, 1971.
[1096] J. H. Wilkinson. Some comments from a numerical analyst. J. Assoc. Com-
put. Mach., 18(2):137-147, 1971. Reprinted in [3].
[1097] J. H. Wilkinson. The classical error analyses for the solution of linear systems.
IMA Bulletin, 10(5/6):175-180, 1974.
[1098] J. H. Wilkinson. Numerical linear algebra on digital computers. IMA Bul-
letin, 10(9/10):354-356, 1974.
[1099] J. H. Wilkinson. Turing’s work at the National Physical Laboratory and the
construction of Pilot ACE, DEUCE, and ACE. In A History of Computing
662 B IBLIOGRAPHY

in the Twentieth Century: A Collection of Essays, N. Metropolis, J. Howlett,

and Gian-Carlo Rota, editors, Academic Press, New York, 1980, pages 101–
114.
[1100] J. H, Wilkinson. The state of the art in error analysis. In NAG Newsletter
2/85, Numerical Algorithms Group, Oxford, UK, November 1985, pages 5–
28.
[1101] J. H. Wilkinson. Error analysis revisited. IMA Bulletin, 22(11/12):192-200,
1986.
[1102] J. H. Wilkinson and C. Reinsch, editors. Linear Algebra, volume II of Hand-
book for Automatic Computation. Springer-Verlag, Berlin, 1971. ix+439 pp.
ISBN 3-540-054146.
[1103] James H. Wilkinson. The perfidious polynomial. In Studies in Numerical
Analysis, G. H. Golub, editor, volume 24 of Studies in Mathematics, Mathe-
matical Association of America, Washington, DC, 1984, pages 1–28.
[1104] M. R, Williams and Martin Campbell-Kelly, editors. The Early British Com-
puter Conferences, volume 10 of Charles Babbage Institute Reprint Series for
the History of Computing. MIT Press, Cambridge, MA, USA, 1989. xvi+508
pp. ISBN 0-262-23136-0.
[1105] S. Winograd. A new algorithm for inner product. IEEE Trans. Comput.,
C-18:693-694, 1968.
[1106] S. Winograd. On multiplication of 2 × 2 matrices. Linear Algebra Appl., 4:
381-388, 1971.
[1107] Jack M. Wolfe. Reducing truncation errors by programming. Comm. ACM,
7(6):355–356, 1964.
[1108] Philip Wolfe. Error in the solution of linear programming problems. In Error
In Digital Computation, Louis B. Rall, editor, volume 2, Wiley, New York,
1965, pages 271-284.
[1109] Stephen Wolfram. Mathematical: A System for Doing Mathematics by Com-
puter. Second edition, Addison-Wesley, Reading, MA, USA, 1991. xxii+961
pp. ISBN 0-201-51507-5.
[1110] Michael Woodger. The history and present use of digital computers at the
National Physical Laboratory. Process Control and Automation, pages 437–
442, November 1958. Reprinted in [186, pp. 125-140].
[1111] H. Numerical stability for solving nonlinear equations. Nu-
mer. Math., 27:373–390, 1977.
[1112] H. . Numerical stability of the Chebyshev method for the so-
lution of large linear systems. Numer. Math., 28:191-209, 1977.
[1113] H. . Roundoff-error analysis of iterations for large linear sys-
tems. Numer. Math., 30:301-314, 1978.
[1114] H. Roundoff-error analysis of a new class of conjugate-
gradient algorithms. Linear Algebra Appl., 29:507-529, 1980.
B IBLIOGRAPHY 663

[1115] Margaret H. Wright. Interior methods for constraind optimization. In Acta

Numerics, Cambridge University Press, Cambridge, UK, 1992, pages 341–
407.
[1116] Stephen J. Wright. A collection of problems for which Gaussian elimination
with partial pivoting is unstable. SIAM J. Sci. Statist. Comput., 14(1):231–
238, 1993.
[1117] P. Y. Yalamov. Roundoff errors and graphs. Manuscript, 1994.
[1118] E. L. Yip. A note on the stability of solving a rank-p modification of a linear
system by the Sherman–Morrison–Woodbury formula. SIAM J. Sci. Statist.
Comput., 7(3):507-513, 1986.
[1119] J. M. Yohe. Implementing nonstandard arithmetics. SIAM Rev., 21(1):34-56,
1979.
[1120] J. M. Yohe. Software for interval arithmetic: A reasonably portable package.
ACM Trans. Math. Software, 5(1):50–63, 1979.
[1121] J. M. Yohe. Portable software for interval arithmetic. Computing, Suppl. 2:
211-229, 1980.
[1122] David M. Young. Iterative Solution of Large Linear Systems. Academic
Press, New York, 1971. xxiv+570 pp. ISBN 0-12-773050-8.
[1123] David M. Young. A historical overview of iterative methods. Comput. Phys.
Comm., 53:1-17, 1989.
[1124] Gideon Yuval. A simple proof of Strassen’s result. Inform. Process. Lett., 7
(6):285-286, 1978.
[1125] Adam T. Zawilski. Numerical stability of the cyclic Richardson iteration.
Numer. Math., 60:251-290, 1991.
[1126] Hongyuan Zha. A componentwise perturbation analysis of the QR decom-
position. SIAM J. Matrix Anal. Appl., 4(4):1124–1131, 1993.
[1127] Hongyuan Zha. Problem 10312. Amer. Math. Monthly, 100(5):499, 1993.
[1128] G. Zielke. Report on test matrices for generalized inverses. Computing, 36:
105-162, 1986.
[1129] Gerhard Zielke. Some remarks on matrix norms, condition numbers, and
error estimates for linear equations. Linear Algebra Appl., 110:29–41, 1988.
[1130] K. On a particular case of the inconsistent linear matrix equation
AX + YB = C. Linear Algebra Appl., 66:249-258, 1985.
[1131] Abraham Ziv. Relative distance-An error measure in round-off error anal-
ysis. Math. Comp., 39(160):563–569, 1982.
[1132] Abraham Ziv. Fast evaluation of elementary mathematical functions with
correctly rounded last bit. ACM Trans. Math. Software, 17(3):410–423, 1991.
[1133] Abraham Ziv. Converting approximate error bounds into exact ones. Math.
Comp., 64(209):265-277, 1995.
[1134] Zahari Zlatev, Jerzy Wasniewski, and Kjeld Schaumburg. Condition number
estimators in a sparse matrix software. SIAM J. Sci. Statist. Comput., 7(4):
1175–1189, 1986.
Previous Home Next

Name Index

Science is organized knowledge.

— HERBERT SPENCER, Essays On Education (1861)

A suffix “t” after a page number denotes a table, “f” a figure, “n” a footnote, and
“q” a quotation.

Aasen, Jan Ole, 226 Axelsson, Owe, 343

Abdelmalek, Nabih N., 385, 410
Abramowitz, Milton, 35 Ivo, 100
Acton, Forman S., 35, 203 q, 285 q Bachelis, Boris, 61
Adams, Duane A., 113 Bai, Zhaojun, 322, 352, 527
Aggarwal, Vijay B., 84 Bailey, David H., 448, 460, 461, 471,
Ahac, Alan A., 198 482, 491 q, 505
Ahlberg, J. H., 167 Baksalary, J. K., 323, 324
Aitken, A. C., 383 Bane, Susanne M., 461
Albers, Donald J., 513 n Ballester, C., 441
Bank, Randolph E., 257
Alefeld, Göltz, 486
Bareiss, E. H., 53, 191
Alexopolous, Aristides G., 54
Bargmann, V., 186
Allen, Jr., Richard C., 84
Barlow, Jesse L., 33, 51, 53, 191, 193,
Almacany, Montaha, 425 q
301, 385, 411, 412
Aluru, Srinivas, 459
Barnett, S., 323
Alvarado, Fernando L., 165 Barone, John L., 391 q, 410
Amato, James J., 323 Barrett, Geoff, 59
Ames, William F., 32 Barrlund, Anders, 194, 198, 224
Amodio, Pierluigi, 197 Bartels, R. H., 312
Anda, Andrew A., 385 Bartels, Sven G., 141, 304, 441
Anderson, E., 289 q Barwell, Victor, 226
Anderson, T. W., 527 Bauer, F. L., 57, 83, 119, 126, 139,
Ando, T., 196 146, 148, 149, 191, 284, 551
Arioli, Mario, 143, 304, 328, 343, 410, Beam, Richard M., 527
411, 421, 422 Beaton, Albert E., 391 q, 410
Arnold, William F., 323 Bell, E. T., 281 q
Ashenhurst, R. L., 489 Bellman, Richard, 324
Asplund, Edgar, 305 Benford, Frank, 51
Atanasoff, John V., 151 q Benôit, Commandant, 224

665
666 N AME I NDEX

Benschop, N. F., 327 Caffney, John, 520

Berman, Abraham, 147, 579 q, 580 Calvetti, D., 440442
Bhatia, Rajendra, 322, 382 Calvin (and Hobbes), 473 q
Bini, Dario, 441, 456, 457 Calve, M. P., 32
Birkhoff, Garrett, 32, 281, 483 Campbell, S. L., 337
Bischof, Christian H., 299, 301, 370, Campbell-Kelly, Martin, 245 q
384 Canuto, Claudio, 32
Björck, Åke, 83, 242, 361q, 379, 385, Cao, Wei-Lu, 305
386, 388, 397, 399, 400, 402, Caprani, Ole, 100
403, 409-411, 422, 423, 434, Cardano, Geronimo, 483
436, 441, 570 Carr III, John W., 57
Bjørstad, Petter, 461 Carter, Russell, 496
Blanch, G., 507 Cauchy, Augustin-Louis, 516
Bliss, B., 488 Cayley, Arthur, 446
Blue, James L., 502 Chaitin-Chatelin, Françoise, 53, 358,
Bodewig, E., 188 488
Bohlender, Gerd, 98 Chan, Raymond H., 471
Bohte, Z., 183 Chan, Tony F., 13, 33, 146, 147, 386
Boley, Daniel, 243 Chandrasekaran, Shivkumar, 136, 141,
Bollen, Jo A. M., 328 386
Bondeli, S., 525 Chang, Xiao-Wen, 225
Boros, T., 441 Chartres, Bruce A., 190
Borwein, J. M., 491 q Chatelin, Françoise, see Chaitin-Chatelin,
Borwein, P. B., 491 q Françoise
Bowden, B. V., 203q Choi, Man-Duen, 513 q, 526
Bowdler, H. J., 196 Cholesky, André-Louis, 224
Boyd, David W., 291, 293, 304 Christiansen, Søren, 146
Boyle, Jeff, 51 Chu, Eleanor, 193
Brent, Richard P., 51, 162, 164, 448, Chu, King-wah Eric, 323
451, 453, 460, 486, 504, 508 Cipra, Barry A., 32
Briggs, William L., 470 Clasen, B.-I., 284
Brightman, Tom, 60 Clenshaw, C. W., 32, 53, 113
Brown, W. S., 498, 501 Cline, Alan K., 289q, 297, 299, 411,
Brunet, Marie-Christine, 53, 488 480
Buchan, John, 62 q Cline, R. E., 422
Buchanan, James L., 165 Clinger, William D., 61
Buchholz, W., 60 Cody, Jr., William J., 39q, 54, 55, 59,
Bukhberger, B., 305 60, 495, 497, 499, 504
Bulirsch, R., 83, 84, 190 Cohen, A. M., 180, 522
Bunch, James R., 141, 146, 149, 219- Concus, P., 257
221, 225, 226, 231 q Corm, Andrew R., 299
Buoni, John J., 198 Conte, Samuel D., 190
Burgmeier, James W., 84 Cooley, James W., 470
Businger, Peter A., 146, 193, 410 Coomes, Brian A., 32
Butcher, J. C., 100 Coonen, Jerome T., 59, 495
Byers, Ralph, 305, 320, 323, 324, 353, Cope, J. E., 145
567 Coppersmith, Don, 448
N AME I NDEX 667

Corless, Robert M., 32 Dunham, C. B., 425q

Cortés, Joaquín, 197 Dwyer, Paul S., 169 q
. Cottle, Richard W., 224
Cox, M. G., 32 Eckart, Carl, 126
Crout, Prescott D., 195 Edelman, Alan, 63, 162, 180, 181, 197,
Cryer, Colin W., 180, 181, 196 198, 287, 386, 483n, 518,
Curtis, A. R., 197 518 q, 527, 534
Cybenko, George, 225 Eirola, Time, 33
Eldén, Lars, 412
Dahlquist, Germund, 83, 160 Eldersveld, Samuel K., 257
Daniel, J. W., 385 Elfving, Tommy, 441
Datta, Karabi, 315 Emel’yanenko, G. A., 305
Davis, Philip J., 32, 100, 471 Enright, Wayne H., 33
Dax, Achiya, 226, 337 Erisman, A. M., 190, 193, 200
Day, Jane M., 197 Espelid, Terje O., 100
de Boor, Carl, 32, 190, 196
de Jong, Lieuwe Sytse, 33 Faddeeva, V. N., 195
de Rijk, P. P. M., 410, 411 Fairgrieve, Thomas F., 61, 530
Dekker, T. J., 57, 93, 284, 504 Fan, Ky, 386
del Ferro, Scipione, 483 Farebrother, R. W., 409
Demeure, Cédric J., 441 Farkas, I., 113
Demmel, James W., 47, 53, 60, 61, Farnum, Charles, 497
84, 126, 140, 141, 143, 149, Fateman, Richard J., 59
165, 198, 207, 208, 224, 243, Feingold, David G., 257
248, 250-253, 257, 304,308, Feldstein, A., 62
322, 352, 417, 422, 486, 495, Ferguson, H. R. P., 461, 482
499, 504, 527, 536 Ferguson, Jr., Warren E., 49, 60
Dennis, Jr., J. E., 32, 328, 477, 478 Ferng, William R., 301
Descloux, J., 343 Fike, C. T., 114
Dhillon, Inderjit, 60, 536 Fischer, Patrick C., 460
Diamond, Harold G., 62 Flannery, Brian P., 479, 490, 507
Dixon, John D., 300 Fletcher, R., 32, 147, 149, 225, 226
Dongarra, Jack J., 188, 195,225,231 q, Forsgren, Anders, 225
257, 499, 581 q Forsythe, George E., 32, 33, 35, 52,
Doolittle, Myrick Hascall, 195 57, 84, 95, 138, 146, 164,
Dorn, William S., 83, 100, 113 190, 196, 197, 235, 241, 242,
Douglas, Craig C., 460, 575 245 q, 261 q, 262, 282, 305,
Douglas, Jr., Jim, 32, 188 325 q, 491 q, 526
Doyle, Sir Arthur Conan, 289 q Foster, Leslie V., 178, 411
Drake, J. B., 261 q Foulser, David E., 146, 147
Zlatko, 225 Fox, L., xxviiq, xxviii, 35, 117 n, 188-
Du Croz, Jeremy J., 265, 272, 273, 190
284 Frayssé, Valérie, 53, 358
Dubrulle, Augustin A., 285, 511 Friedland, Shmuel, 348, 359
Duff, Iain S., 143, 195, 200, 225, 226, Funderlic, R. E., 198
257, 304, 343, 410, 411, 527
Duncan, Martin, 96 Gaches, J., 132, 145
668 N AME I NDEX

Gahinet, Pascal M., 323 Grimes, Roger G., 305, 527

Gal, Shmuel, 61 Grosse, Eric, 581 q
Gallopoulos, E., 488 Gu, Ming, 352, 536
Gander, Walter, 525 Gudmundsson, Thorkell, 301
Gantmacher, F. R., 172 Guggenheimer, Heinrich W., 287
Gardiner, Judith D., 323 Gulliksson, Mårten, 412
Gardner, Martin, 127 q Gurwitz, Chaya, 32
Garner, Harvey L., 62 Gustafson, John L., 459
Gasca, M., 196 Gustavson, F. G., 195
Gastinel, Noel, 123, 126
Gauss, Carl Friedrich, 1q, 195, 219, Hager, William W., 294, 304
325 q, 391 Hall, Jr., Marshall, 179
Gautschi, Walter, 425 q, 428, 429 Halmos, Paul R., 513 q
Gautschi, Werner, 351 Hamada, Hozumi, 54
Gay, David M., 61 Hammarling, Sven J., 322, 326, 327,
Geist, G. A., 261 q 385, 503
Geman, Stuart, 518 Hammel, Stephen M., 32
Gentle, James E., 34 Hammer, Rolf, 487
Gentleman, W. Morven, 113,384,385, Hamming, R. W., 57, 443
470, 576 Handy, Susan L., 412
George, Alan, 193, 224, 226 Hansen, Per Christian, 146, 386, 461
Geuder, James C., 190 Hanson, Richard J., 385, 409, 410,
Ghavimi, Ali R., 320, 322 422, 503
Gill, Philip E., 32, 225, 226, 229, 422, Harris, P. M., 32
479 Harter, Richard, 459
Gill, S., 92 Hartfiel, D. J., 146
Givens, Wallace J., 33, 67 q Hearon, John Z., 315
Gluchowska, J., 411 Heath, Michael T., 261 q, 399
Gohberg, I., 141, 440, 516 Hein, Piet, 36 q, 127q
Goldberg, David, 39 q, 57, 93, 534 Helvin, Marie, 166 q
Goldberg, I. Bennett, 62 Henderson, Harold V., 323, 490
Goldstine, Herman H., 1 n, 33,52, 187, Hennessy, John L., 39 n, 57
196, 261q, 263, 517 Henrici, Peter, 32, 52, 53, 67q, 84,
Golub, Gene H., xxiv, 13, 27, 33, 146, 351
182, 190, 195, 223, 231 q, Henson, Van Emden, 470
257, 285, 301, 311, 312, 312 n, Heroux, Michael, 460, 575
327, 352, 384, 386, 388, 391, Herzberger, Jurgen, 486
392, 400, 409-412, 441, 580 Hewer, Gary, 323
Goodman, R., 62 Higham, Desmond J., 32, 126, 141,
Goodnight, James H., 285 149, 178, 179, 195, 345q,
Gould, Nicholas I. M., 181, 197 348, 407, 441
Govaerts, W., 242 Higham, Nicholas J., 72, 100, 141, 145,
Gragg, W. B., 297, 385 164, 165, 178, 179, 195, 197,
Graham, Ronald L., 87 q, 520 q 223, 225, 242, 243, 248, 250-
Grebogi, Celso, 32 253, 257, 265, 272, 273, 284,
Greenbaum, A., 328, 329 295, 300, 304-307, 322, 324,
Gregory, Robert T., 514, 525 328, 329, 354, 356, 358, 384,
N AME I NDEX 669

386, 389, 407, 410, 411, 417, Kahan, William M. (Velvel), 1q, 29,
422, 440, 443, 460, 461, 489, 33, 34, 46, 47, 47q, 50, 50q,
508, 574, 577 59, 63-65, 75, 86, 92-95, 98,
Hilbert, David, 526 113, 123, 126, 136, 161, 165,
Hildebrand, F. B., 33, 39 q 169 q, 225, 243, 486, 490,
Hocks, Matthias, 487 494, 496, 497, 499, 501, 502,
Hodel, A. Scottedward, 323 507
Hodges, Andrew, xxvii Kahaner, David K., 391 q
Hoffman, A. J., 386 Kailath, T., 441
Hoffmann, Christoph M., 34 Kala, R., 323, 324
Hoffmann, W., 284, 385 Kaniel, S., 226
Hooper, Judith A., 504 Karasalo, Ilkka, 165
Horn, Roger A., 119, 150, 310, 348, Karatsuba, A., 461
546, 551, 555, 558, 580 Karlin, Samuel, 523
Horning, Jim, 491 q Karlson, Rune, 404, 413
Hotelling, Harold, 186, 187, 445 q Karney, David L., 514, 525
Hough, David, 47, 509 Karp, A., 195
Householder, Alston S., 2, 117q, 126, Karpinski, Richard, 56, 497
160, 172, 383, 410, 565 Kate, Tosio, 126
Hull, T. E., 52, 61, 488, 506, 530 Kaufman, Linda, 221, 225, 226, 385
Huskey, H. D., 188 Keiper, Jerry B., 36
Hussaini, M. Yousuff, 32 Keller, Herbert Bishop, 189,257,325 q
Hyman, M. A., 282 Kennedy, Jr., William J., 34
Kenney, Charles S., 301, 323, 525
Ikebe, Yasuhiko, 303, 305, 307 Kerr, Thomas H., 225
Incertis, F., 512 Andrzej, 84,94,224,242,
Ipsen, Ilse C. F., 136, 141, 147, 385, 406, 411, 418, 422
386 Kincaid, D. R., 503
Iri, Masao, 53, 488 Kittaneh, Fuad, 525
Isaacson, Eugene, 189, 257 Knight, Philip A., 328, 329, 354, 356,
358, 460, 461
Jalby, William, 385 Knuth, Donald E., xxiv, 54, 57, 58,
Jankowski, M., 94, 241, 242 67q, 87q, 93, 94, 114, 461,
Jansen, Paul, 486 491 q, 520q, 526, 534
Jennings, A., 165 Koçak, Hüseyin, 32
Jennings, L. S., 421 Koltracht, I., 141, 517
Johnson, Charles R., 119, 150, 287, Korner, T. W., 465q
310, 348, 546, 551, 555, 558, Kornerup, Peter, 54
580 Kostlan, Eric, 518, 527
Johnson, Samuel, 675 Kovarik, Z. V., 145
Jones, Mark T., 226 Kowalewski, G., 440
Jones, William B., 507 Krasny, Robert, 505
Jordan, Camille, 284 Kreczmar, Antoni, 460
Jordan, T. L., 409 Krogh, F. T., 503
Jordan, Wilhelm, 284 Krol, Ed, 582
Kruckeberg, F., 198
Kågstrom, Bo, 305, 320, 322-324 Kubota, Koichi, 488
670 N AME I NDEX

J., 300 Malcolm, Michael A., 33, 95, 97, 196,

Kuki, H., 60 245q, 262, 305, 497, 509,
Kulisch, Ulrich W., 486, 487 525
Kuperman, I. B., 146 Marine, F., 461
Manteuffel, Thomas A., 165
Marovich, Scott B., 576
La Porte, M., 53
Marsaglia, George, 527
La Touche, Mrs., 87q, 88
Martin, R, S., 196
Laderman, Julian, 450, 459
Mascarenhas, Walter, 197
Lagrange, Joseph Louis, 219
Mathias, Roy, 165, 223, 386, 515
Lancaster, Peter, 322 Matsui, Shouichi, 53
Lanczos, Cornelius, 490 Mattheij, R. M. M., 257
Laratta, A., 421, 422 Matula, David W., 54, 57, 62
Larson, John L., 488 Mazzia, Francesca, 197
Larsson, S., 32 McCarthy, Charles, 146
László, Lajos, 351 McCracken, Daniel D., 83, 100, 113
Laub, Alan J., 301, 320, 322, 323, 525 McKeeman, William Marshall, 196,
Läuchli, Peter, 388 241
Lawson, Charles L., 385, 409, 410, McKenney, Alan, 322, 527
422, 502 Meinguet, Jean, 224, 284
LeBlanc, E., 486 Mendelssohn, N. S., 286
Lee, King, 460 Metcalf, Michael, 3
Lehmer, D. H., 529 Metropolis, N., 489
Lehoucq, R. B., 569 Meurant, Gérard, 257, 305
Lemeire, Frans, 165 Meyer, Jr., Carl D., 147, 337
Leuprecht, H., 98 Milenkovic, Victor J., 34
LeVeque, Randall J., 13, 33 Miller, D. F., 322
Lewis, John G., 33, 301, 305, 527 Miller, Webb, 83, 115, 451, 454, 473q,
Li, T. Y., 285 487, 488, 507, 517, 576
Li, Xiaoye, 495 Miranker, Willard L., 486
Linnainmaa, Seppo, 53, 83, 93, 100, Mirsky, L., 558
504 Moler, Cleve B., 33, 34, 59, 84, 95,
Linz, Peter, 100 164, 190, 196, 197, 231 q,
Linzer, Elliot, 470 235, 241, 243, 245q, 261q,
Liu, Joseph W. H., 224, 226 262, 282, 297, 304, 305, 323,
Longley, James W., 410 345q, 348, 359, 391q, 415q,
Lotstedt, Per, 412 441, 503, 508, 511, 526, 579q
Lotti, Grazia, 456, 457 Møller, Ole, 92
Lu, Hao, 441 Montgomery, D., 186
Moore, Ramon E., 486
Lynch, Robert E., 32
Morrison, Donald, 503, 511, 579 q
Lyness, J. N., 32, 441
Morton, K. W., 32
Lynn, M. Stuart, 327
Mukherjea, Kalyan, 382
Müller, K. H., 113
Mac Lane, Saunders, 483 Müller-Merbach, H., 32
Macleod, Allan J., 197 Murakami, H., 483 n
Makhoul, John, 225 Murray, Walter, 32,225,226,422,479
N AME I NDEX 671

Nagy, James G., 471 Pereyra, Victor, 434, 436, 441

Nash, Stephen G., 312, 324, 391q, 567 Peters, G., 113, 196, 279, 284, 287,
Nashed, M. Zuhair, 409 411
Neumaier, A., 61, 94 Peterson, Brian, 197
Neumann, M., 198 Philippe, Bernard, 385
Newbery, A. C. R., 113 Pichat, M., 98
Newcomb, Simon, 51 Pierce, Daniel J., 301
Newman, Morris, 514, 520 Piessens, Robert, 32
Nickel, Karl, 94, 486 Pinkus, Allan, 32, 190, 196
Nilson, E. N., 167 Plemmons, Robert J., 147, 198, 301,
Nordio, Marcelo, 197 337, 411, 422, 471, 579q,
Notay, Yvan, 329 580
Poljak, Svatopluk, 140
Oberaigner, W., 98 Polman, Ben, 257
O’Cinneide, Colm Art, 147 Ponceleón, Dulce B., 226
Oettli, W., 135, 145 Poromaa, Peter, 305, 320, 324
Ofman, Yu., 461 Pothen, Alex, 165
O’Leary, Dianne Prost, 305 Powell, M. J. D., 225, 411, 474, 477
Olesky, D. D., 198 Power, Stephen, 224
Oliver, J., 113, 440 Prager, W., 135, 145
Olkin, Ingram, 527 Press, William H., 479, 490, 507
Olshevsky, V., 440, 441 Priest, Douglas M., 33, 34, 58, 96-99,
Olver, F. W. J., 53, 76, 84, 113, 190 102, 502, 504, 531
225 Pryce, J. D., 76, 242
Opfer, Gerhard, 441 Puglisi, Chiara, 371
Ortega, James M., 384 Pukelsheim, Friedrich, 323
Osborne, M. R., 421 Puschmann, Heinrich, 197
Ostrowski, A. M., 350, 546
Quarteroni, Alfio, 32
Paige, C. C., 225, 379, 385, 388, 397, Quinlan, Gerald D., 96
411, 415q, 421, 422, 570 Quinn, Thomas R., 96
Palmer, John, 525
Palmer, Kenneth J., 32 Rabinowitz, Philip, 32, 100
Pan, Victor, 441, 448, 450, 459 Raimi, Ralph A., 52
Papadimitriou, Pythagoras, 386 Rail, Louis B., 409, 486, 488
Papadopoulos, Philip M., 323 Ramos, George U., 470
Park, Haesun, 385 Rath, Wolfgang, 385
Parlett, Beresford N., xxiv, xxviii, 27, Ratz, Deitmar, 487
34, 58, 87q, 187, 219, 225, Ratz, H. C., 327
361 q, 383-385 Razaz, M., 113
Pasternak, Mary E., 488 Reichel, Lothar, 113, 358, 440-442,
Patashnik, Oren, 87 q, 520 q 527
Patrick, Merrell L., 226 Reichelt, Mark W., 527
Patterson, David A., 39 n, 57 Reid, J. K., 3, 190, 193, 197, 200, 226,
Paxson, Vern, 61 411
Pelz, Richard, 505 Reinsch, C., xxviii, 284, 585 q
Peña, J. M., 192, 196 Reiser, John F., 58
672 N AME I NDEX

Ren, Huan, 60 Shroff, Gautam M., 301

Rew, R. K., 289 q, 299, 480 Shub, Michael, 147, 518
Rice, John R., 33, 385 Simon, Horst D., 460
Richtmyer, Robert D., 32 Skeel, Robert D., 36, 135, 146, 190,
Rigal, J. L., 132, 145 192, 198, 235, 240, 241, 486
Robertazzi, T. G., 98 Slishman, Gordon, 460, 575
Roberts, J. D., 322 Smale, Steve, 2 n
Rohn, 128, 140, 141, 149 Smith, David M., 505
Romani, Francesco, 328 Smith, Francis J., 113, 439
Rose, Donald J., 257 Smith, Jon M., 37
Rosenthal, Peter, 322 Smith, Robert L., 503, 509
Ross, D. R., 97 Smith, Roger M., 460, 575
Rowan, Thomas Harvey, 488, 489 Smoktunowicz, Alicja, 94, 141, 149,
Rubin, Donald B., 391 q, 410 242, 411
Ruhe, Axel, 385 Snyder, James N., 231 q, 241
Ruiz, Daniel, 343 Sokolnicka, Jolanta, 242
Rust, B. W., 145
Sorensen, Danny C., 195, 225, 229,
Rutishauser, Heinz, 514, 520
257
Sørevik, T., 461
Sameh, Ahmed H., 162, 164, 488
Spellucci, P., 32
Samelson, Klaus, 57
Spencer, Herbert, 665
Sande, G., 470
Spieß, J., 460
Sanz-Serna, J. M., 32, 33
Saunders, Michael A., 225, 226, 229, Spooner, David, 488
257, 422 Steele, Jr., Guy L., 61
Sautter, Werner, 190 Stegun, Irene A., 35
Scarborough, James B., 33, 58 Sterbenz, Pat H., 33, 34, 50, 57, 60,
Schaumburg, Kjeld, 305 489
Schelin, Charles W., 61 Stewart, G. W. (Pete), xxiv, 75, 113,
Scherer, R., 84 119, 126, 131q, 139, 146,
Schnabel, Robert B., 32 147, 151 q, 164, 174, 195,
Schneider, Hans, 124, 359 197, 198, 224, 231 q, 241,
Schonfelder, J. L., 113 242, 297, 305, 312, 324, 359,
Schreiber, Robert S., 165, 180, 245 q, 381,382,385,386,392, 394,
250–253, 257, 371, 384, 576 407,409-411,504, 519,527,
Schryer, N. L., 498 580
Schwartz, S. C., 98 Stewart, William J., 305
Schwetlick, Hubert, 84,406,411,418, Steer, J., 83, 84, 119, 126, 127, 190
422 Stone, Betty Jane, 126
Searle, Shayle R., 323, 490 Storey, C., 323
Sha, Xuan-He, 450, 459 Stoutemyer, David R., 489
Shampine, Lawrence F., 32, 33, 84, Strakos, Zdenek, 329
95, 527 Strang, Gilbert, 15, 124, 146, 471
Shannon, Claude E., 60 Strassen, Volker, 446, 461, 462
Shapiro, Alexander, 146 Straus, E. G., 138, 146
Shepherd, David, 59 Street, Anne Penfold, 179
Shinnerl, Joseph R., 225, 229 Stummel, Friedrich, 34, 83, 191
N AME I NDEX 673

Sun, Ji-guang, 119, 126, 131 q, 139, Underhill, L. G., 527

146, 194, 198,209,224,382, Ungar, Peter, 461
386,392,394,404,407, 409, Urabe, Minoru, 62
411, 413, 580 Uspensky, J. V., 490
Sun, Xiaobai, 384
Swarm, W. H., 474 M., 461
Swartzlander, Jr., Earl E., 54 van der Sluis, A., 137, 138, 198, 207,
Sweeney, D. W., 60 391 q, 409
Swenson, J. R., 52 Van der Vorst, Henk A., 195, 225,
Sylvester, James Joseph, 309 q, 322, 257, 329
446 Van Loan, Charles F., xxiv, 27, 141,
149, 182, 190, 195, 223, 225,
Tang, Ping Tak Peterj 61, 499, 530 229, 231 q, 257, 299, 311,
Tang, W. P., 441 312, 312n, 345q, 348, 352,
Tao, Pham Dinh, 128, 304 359, 370, 371, 384, 386, 388,
Tartaglia, Niccolo, 483 392, 409, 412, 465q, 466,
Taussky, Olga, 513 q, 515 470, 580
Teukolsky, Saul A., 479, 490, 507 van Veldhuizen, M., 196
Thacher, Jr., Henry C., 196 van Wijngaarden, A., 501
Thisted, Ronald A., 34 Vanderbei, Robert J., 229
Thompson, Sir D’arcy Wentworth, 1q Varah, James M., 146, 167, 257, 258
Thron, W. J., 507 Varga, Richard S., 160, 167, 257, 508
Tienari, Martti, 53 Vemulapati, Udaya B., 301, 412
Todd, John, 126, 514, 520, 526, 527 225
Toh, Kim-Chuan, 483 n, 526 Vetterling, William T., 479, 490, 507
Torczon, Virginia J., 477-479 Vieta, Franciscus, 483
Tornheim, L., 180, 181 Vignes, J., 53
Traub, J. F., 440 Vitasek, Emil, 100
Trefethen, Lloyd N., 6 q, 32, 169 q, Viten’ko, I. V., 95
180, 328, 345 q, 346, 348, von Neumann, John, 33, 52, 186, 187,
352, 353, 355, 358, 483 n, 196, 263, 517
526, 527
Tremaine, Scott, 96 Waite, William, 497, 499
Tropp, Henry S., 491n Waldén, Bertil, 404, 413
Trummer, Manfred R., 32, 346, 355, Walker, Homer F., 32, 328
358 Wallis, Jennifer Seberry, 179
Tsao, Nai-kuan, 84, 384 Wallis, W. D., 179
Tukey, John W., 60, 470 Wang, Johnson J. H., 198
Turing, Alan Mathison, xxvii, xxviii, Ward, Robert C., 285
33, 126, 131 q, 188, 284,485 Warming, Robert F., 527
contributions in 1948 paper “Rounding- Wasilkowski, G. W., 147
off errors...”, 188, 284 Wasniewski, Jerzy, 305
Turnbull, H. W., 383, 523 Wasow, Wolfgang R., 32
Turner, Kathryn, 225 Watson, G. A., 146
Turner, Peter R., 51, 53, 165 Watterson, Bill, 473
Tyrtyshnikov, Evgenij E., 526 Wedderburn, J. H. M., 117q
674 N AME I NDEX

Wedin, Per-Åke, 392, 407, 409, 412, Yalamov, P. Y., 83

422 Yip, E. L., 490
Wegert, Elias, 353 Yohe, J. Michael, 486, 504
Wei, Musheng, 388, 410 Yorke, James A., 32
Weidnerj Peter, 486 Young, David M., 326 t, 350
Weiss, N., 147 Young, Gale, 126
Wendroff, Burton, 189, 197 Yuvzd, Gideon, 459
Werner, Wilhelm, 113
Westin, Lars, 305, 323 Zang, Thomas A., 32
Westlake, Joan R., 514 Zawilski, Adam T., 328
Wette, Matthew R., 323 Zehfuss, Johann Georg, 323
White, Jon L., 61 Zeller, K., 84
Zeng, Z., 285
Wichmann, B. A., 499, 502
Zha, Hongyuan, 167, 382
Wilkinson, J. H., xxi, xxiv, xxvii, xxviii,
Zielke, Gerhard, 126, 527
14, 25, 31, 33, 34, 39q, 52,
K., 323
56, 58, 67q, 73, 75, 84, 100,
Ziv, Abraham, 60, 61, 83, 84
l03q, 104, 113, 117q, 145,
Zlatev, Zahari, 305
151 q, 164, 175, 177, 178,
180-182, 187-191, 195-197,
203 q, 207, 208, 224, 232,
241, 252, 279, 283-285, 287,
297, 326, 327, 343 q, 348,
361 q, 364, 365, 384, 385,
387, 391, 400, 409-411, 474,
485, 542, 585 q
first program for Gaussian elim-
ination, 195–196
on the purpose of a priori error
analysis, 203 q
solving linear systems on desk cal-
culator in 1940s, 187–188
user participation in a computa-
tion, 31
Williams, Jack, 425 q
Wilson, Greg, 60
Winograd, Shmuel, 446, 448, 461
Wisniewski, John A., 488
Witzgall, C., 119, 126, 127
Wolfe, Jack M., 97
Wolfe, Philip, 32
Woodger, Michael, 169 q, 445 q
H., 32, 94, 147, 241,
242, 300, 328
Wrathall, Celia, 488
Wright, Margaret H., 32, 225, 479
Wright, Stephen J., 178
Previous Home

Subject Index

Knowledge is of two kinds.

We know a subject ourselves,
or we know where we can find information upon it.
— SAMUEL JOHNSON, Boswell’s Life of Johnson (1775)

A suffix “t” after a page number denotes a table, “f” a figure, “n” a footnote, and
“q” a quotation. Mathematical symbols and Greek letters are indexed as if they
were spelled out. The solution to a problem is not indexed if the problem itself is
indexed under the same term.

Aasen’s method, 226, 229 componentwise, 134, 141

absolute error, 4 evaluating, 143
absolute norm, 119 componentwise relative, 134
accuracy versus precision, 7, 33 definition, 7
ACRITH, 486 least squares problem, 404-407,
Aitken extrapolation, 101 413
alternating directions method, 477 linear system
approximation theory, references for Oettli-Prager theorem, 135
rounding error analysis, 32 Rigal-Gaches theorem, 14, 132
Augment precompiled, 486, 504, 506 Lyapunov equation, 316–317
augmented system matrix, 393 mixed forward-backward error, 8
scaling and conditioning, 402 normwise, 132
Automatic Computing Engine (ACE),
normwise relative, 132
56, 189t, 195, 343
preserving symmetric structure,
automatic differentiation, 488
149, 406–407
automatic error analysis, 473–490, see
structured, 141
also interval analysis; run-
ning error analysis Sylvester equation, 313-316
condition estimation, 480-481 underdetermined system, 419, 423
solving a cubic, 483–484 backward error analysis
Strassen’s inversion method, 481- development by Wilkinson, 33–
482 34, 189
using direct search optimization, in differential equations, 33
474-477 motivation, 7
not a panacea, 1q
backward error, 7-8 purpose, 71–72, 203 q

675
676 S UBJECT I NDEX

backward stability cancellation, 10–1 1, 30

componentwise, 142 in summation, 91, 543
definition, 8 not a bad thing, 11
normwise, 142 of rounding errors, 21–26
banded matrix, growth factor, 183 Cauchy matrix, 516-517
Bartels-Stewart method, 311-313 inverse, 516
Bauer’s scaling theorem, 139, 146 LDU factors, 516-517
bilinear noncommutative matrix mul- Cauchy-Schwarz inequality, 119
tiplication algorithm, 449– CELEFUNT, 499
450 chaos, references for rounding error,
error analysis, 456–457 analysis, 32
binary-decimal conversion, 61-62 Chebyshev spectral differentiation ma-
bit, 60 trix, 346, 355
BLAS (Basic Linear Algebra Subpro- Cholesky factorization, 204
grams), 586-587 computation of, 205
fast level 3, 460 conditions for success in floating
level 2 extended precision exten- point, 208-209
sion, 506 error analysis, 205–209
xNRM2 (2-norm), 502–503, 510– existence and uniqueness, 204
511 perturbation bounds, 209–210
block algorithm semidefinite matrix
advantages of, 245 q complete pivoting, 211
definition, 246 computation of, 210–21 1
block diagonal dominance, 251-255, error analysis, 214–218
257 existence and uniqueness, 210
and block LU factorization, 252– perturbation theory, 21 1–214
255 termination criteria, 21 7–218
definition, 251 chopping, 57
block LU factorization, 246-259 circulant matrix, 468
computation, 247 circulant system, error analysis for so-
definition, 246 lution by FFT, 468-470
error analysis, 250-257 CLAPACK, 588
existence and uniqueness, 247 colon notation, 2–3
stability companion matrix, 525-526
for (point) diagonally dominant singular values, 525–526
matrix, 255 comparison matrix, 157
for block diagonally dominant compensated summation, 92–97
matrix, 25 1–255 complete pivoting, 170
for block tridiagonal matrix, early use of, 196
258 fallacious criticism of, 200
for symmetric positive definite growth factor, 180-181, 197
matrix, 255–257 conjecture proved false, 181
Bunch-Kaufman factorization, 221-223 complex arithmetic, error analysis, 78–
Bunch-Parlett factorization, 219-220 80, 84
byte, 60 complex number
division without overflow, 503-
calculator, displaying words on, 37 504, 509
S UBJECT I NDEX 677

square root of, 36 UNICOS library, 448, 450

componentwise relative error, 5 Crout’s method, 174
condition number CS decomposition, 388, 408
distance to singularity and, 123, cubic equation
126, 140 Newton’s method, 489-490
estimation, 289–308 stability of explicit formulae for
counterexamples, 289 q, 294– roots, 483484
296, 299, 304, 305 cyclic reduction, 197
counterexamples by direct search,
480-481 denormalized numbers, see subnormal
for tridiagonal matrices, 301– numbers
303 departure from normality (Henrici’s),
incremental, 301 351-352
LAPACK estimator, 294-297, determinant, 281-283
480481 computation of, 282–283
LINPACK estimator, 297-299 condition number of, 287
probabilistic methods, 300-301 of upper Hessenberg matrix, 27–
general theory, 33 28
Hadamard, 281, 287 diagonal dominance, 181
minimizing by scaling, 136–139, and block LU factorization, 255
146 block, 251-255, 257
of function, 9 bound for LU factors of tridiag-
of linear system onal matrix, 185
componentwise, 135 growth factor, 181
normwise, 133 matrix inverse bound, 167
of rectangular matrix, 392 for tridiagonal matrix, 303
of square matrix, 121–123, 126 diagonal pivoting method, 218-223
of summation, 100 complete pivoting and its stabil-
Skeel’s, 135 ity, 219–220
conjugate gradient method, 328, 329, growth factor
341 complete pivoting, 220
circulant preconditioned, 471 partial pivoting, 222
continued fraction partial pivoting and its stability,
algorithms and error analysis, 507 221-223
evaluating in IEEE arithmetic, diagonally dominant matrix
492-493 bound for inverse, 167
running error bound, 85 growth factor for, 181
convergent matrix, 348 differential equations, see ordinary dif-
conversion, binary-decimal, 61–62 ferential equations; partial
correct significant digits, 4–5, 32 differential equations, refer-
Cramer’s rule, (instability of, 14-15, ences for rounding error anal-
34, 37 ysis
Cray computers direct search optimization methods,
adoption of IEEE arithmetic, 49 477-479
arithmetic on, 39 q, 495 discretization error, 6
puzzling results from Cray Y-MP distance to singularity
and Cray 2, 496497 componentwise, 140
678 S UBJECT I NDEX

normwise, 123 Winograd’s variant, 448, 455–

divided differences, 109-112 456
confluent, 430 3M method for complex multi-
Doolittle’s method, 173-174, 195 plication, 450
double rounding, 48, 63, 541 Winograd’s method, 446
Drazin inverse, 336-337 fixed point arithmetic, 56
drift, in floating point arithmetic, 58 fl operator (rounding), 42
dual norm, 119 floating point arithmetic, 39-65
dual vector, 119 alternatives to, 53-54
dynamical systems, references for round- banned from safety-critical sys-
ing error analysis, 32 tems, 499
binary-decimal conversion, 61-62
effective conditioning, 146 choice of base, 51, 60–61
EISPACK, 587 compiler optimization, dangers of,
ELEFUNT package, 499 497
equilibration, 136, 138, 192 determining properties of, 497-
error 498
absolute, 4 drift in, 58
backward, see backward error earliest subroutines, 39 q
forward, see forward error formal algebra, 58
mixed forward-backward error, 8 fused multiply-add operation, 60,
relative, 4, 5 65
sources of, 5-6 IEEE arithmetic, see IEEE arith-
error analysis, see rounding error anal- metic
ysis Language Independent Arithmetic
ESSL library (IBM), 448, 450 (LIA-1), 502
expm 1 function (ex – 1), 34 model, 58, 501–502
Brown’s, 498, 501-502
fan-in algorithm, 165 standard, 44
for summation, 88 with underflow, 61
for triangular system solution, 162– without guard digit, 49
164 multiple precision, 504–506
fast Fourier transform, 465-471 parameters for selected machines,
Cooley-Tukey factorization of DFT 41t
matrix, 466 parameters in software, specify-
error bound, 467 ing, 499–500
for solving circulant systems, 468- representation error, 51
470 rounding, see rounding
fast matrix multiplication, 445463 software issues, 491–512
bilinear noncommutative algorithm, speed of operations (relative), 60
449450 subnormal numbers, 41, 47, 495
deriving methods, 459460 subtraction done exactly, 49–50
error analysis, 450-459 testing accuracy of, 54–56
in the level 3 BLAS, 460 testing correctness of, 498499
Miller’s error results, 451 unit roundoff, 3, 42
record exponent, 448 wobbling precision, 43, 51
Strassen’s method, 446448 floating point coprocessor, 47
S UBJECT I NDEX 679

floating point numbers in ancient China, 195

characterization, 40 loop orderings, 195
normalized, 40 need for pivoting, 170
spacing between, 41 on Hessenberg matrix, 27–28
subnormal, 41, 47 partial pivoting, 170, 173
testing for equality, 495 pessimism of its accuracy in 1940s,
flop, 3 186–187
Fortran 90, 3 row and column scaling, 191–192
environmental inquiry functions, threshold pivoting, 200
498 use by Gauss, 195
matmul, 460 versus Cramer’s rule, 14–15
forward error, 7–8 without pivoting, instability of,
definition, 7 17
for linear system, 13 Gelfand’s problem, 51
mixed forward-backward error, 8 geometric computation, accuracy of
forward stability algorithms in, 34
componentwise, 142 Givens rotation, 371
definition, 10 disjoint rotations, 373-375, 387
normwise, 142 fast, 385
Fourier matrix, 179 gradual underflow, 47, 61
FPV (floating point verification) pack- Gram-Schmidt method, 376-381
age, 498499 classical
Frobenius norm, 120 algorithm, 377
fused multiply-add operation, 60, 65 error analysis, 378–379, 381
modified
γ n (error constant) algorithm, 377
definition, 69 connection with Householder
properties, 74 QR factorization, 361q, 379,
Gauss-Jordan elimination, 275-281,284- 385
285 error analysis, 378–381
algorithm, 276 error analysis for application
error analysis, 277–281 to LS problem, 396-397
Gauss-Seidel method, 325 q, 334 stability, 27
Gaussian elimination, 170–174, see also reorthogonalization, 385
LU factorization growth factor, 177-183
a posteriori stability tests, 192– a posteriori estimates for, 193
194 define using exact or computed
complete pivoting, 170 quantities?, 177, 196
computer programs for banded matrix, 183
first, 195-196 for complete pivoting, 180-181
history of, 196 for diagonal pivoting method, 220,
connection with LU factorization, 222
171 for diagonally dominant matrix,
error analysis, 174–177 181
history of, 186-191 for partial pivoting, 177-183
growth factor, 177–183, see also large growth in practical prob-
growth factor lems, 178
680 S UBJECT I NDEX

for random matrices, 196-197 IBM, ESSL library, 448, 450

for tridiagonal matrix, 183 IEEE arithmetic, 43, 4548
for upper Hessenberg matrix, 182 double rounding, 48, 63, 541
lower bound for, 179 exception handling, 46, 493–495
maximization by direct search, 475– exceptions, 4647
476 exploiting in software, 492–495
numerical maximization for com- extended formats, 47
plete pivoting, 181, 197 gradual underflow, 47
statistical model of, 180 implementation using formal meth-
guard digit, 48 ods, 59-60
test for, 56 46–47, 492, 495
NaN, 46, 492, 495
Haar distribution, random orthogonal parameters, 41 t, 45
matrix from, 519–520 recommended auxiliary functions,
Hadamard condition number, 281,287 495
Hadamard matrix, 128, 179, 181, 201 rounding modes, 46
Hadamard’s inequality, 287 signed zeros, 46
Harwell-Boeing sparse matrix collec- Standard 754, 45
tion, 527 Standard 854, 48
Heron’s formula, 50 subnormal numbers, 47, 495
Hessenberg matrix index of a matrix, 336
determinant of, 27-28, 34, 282- (IEEE arithmetic), 46–47, 492, 495
283 inner product
Gaussian elimination, 27-28, 34 error analysis, 68–71
growth factor for, 182 in extended precision, 70
Hewlett-Packard HP 48G calculator reducing constant in error bound,
condition estimator, 304 69
exhausting its range and preci- Intel Pentium chip, division bug, 59
sion, 17–18 Internet, 582
hidden bit, 45 interval analysis, 46, 198, 485–487
Hilbert matrix, 514-517 dependencies, 485
Cholesky factor, 515 fallibility, 487
inverse, 515 Gaussian elimination, 485-486
Holder inequality, 118, 119 super-accurate inner product, 486
Homer’s method, 104-115 interval arithmetic, see interval anal-
for derivatives, 106-109 ysis
for rational function, 29 inverse iteration, 27
running error bound, 105–106, 113 inverse matrix, 26 1–287
Hough’s underflow story, 509-510 bound using diagonal dominance,
Householder transformation, 362-363 167
aggregated (WY representation), error analysis
370-371 for Gauss-Jordan elimination,
block, 384 277–281
error analysis, 364–369 for LU factorization, 270-275
history of, 383 for triangular matrix, 265-270
in QR factorization, 363–364 high accuracy computation of, 284
Hyman’s method, 34, 282-283, 287 in solving Ax = b, stability, 262
S UBJECT I NDEX 681

left and right residuals, 263-264 condition number estimation, 294–

perturbation theory, 140 297, 306
times for computation on early diagonal pivoting method, 227
computers, 276 t forward error bound for linear sys-
triangular tems, 144
bounds for, 159-161 iterative refinement, 242–243
error analysis, 265–270 least squares problem, 412
why not to compute, 262 LU factorization, 192, 198-199,
involutary matrix, 521 258
iterative methods, see also stationary matrix l-norm estimator, 294–
iterative methods 297, 480–481
dates of publication, 326 t matrix inversion, 285
error analysis, 329-341 QR factorization, 386-387
survey of, 327–329 Sylvester equation, 324
stopping criteria, 341–342 test matrix generation, 527
iterative refinement, 30, 188, 231–244, triangular systems, 166
497 2 x 2 problems, solving, 500-501
backward error analysis, 235-240 underdetermined system, 423
behaviour xLAMCH for determining machine
with GEPP, 239 parameters, 498
with QR factorization, 375–376 Latin, neoclassic, publishing papers
condition number estimate from, in, 470
243 LDLT factorization, 205
for least squares problem, 399- block, 218
403, 410 least significant digit, 40
for Vandermonde system, 438– least squares problem, 391-413
440 augmented system matrix, 393
forward error analysis, 232-235 scaling and conditioning of, 402
in fixed precision, 234 backward error, 404-407, 413
in mixed precision, 234 constrained, 412
LAPACK convergence test, 242 iterative refinement, 399–403, 410
practical issues, 242–243 Longley test problem, 410
modified Gram–Schmidt, error anal-
Jacobi method, forward error analy- ysis, 396–397
sis, 332–334 normal equations
Jordan canonical form, 346-348 error analysis, 397–399
versus QR factorization, 399
Kahan matrix, 161, 214 perturbation theory, 392–394
second smallest singular value, 167 QR factorization, error analysis,
Kreiss matrix theorem, 353 395-396
Kronecker product, 310, 323 seminormal equations, 403–404
weighted, 411–412
LANCELOT, 181, 197 Leja ordering, 111, 113, 115, 438
LAPACK, 587-589 level index arithmetic, 53
block and partitioned LU factor- linear system
ization, 258 large dense, in applications, 198
Cholesky factorization, 226-227 perturbation theory, 131–150
682 S UBJECT I NDEX

practical forward error bounds, without pivoting, instability of,

143–144 17
records for largest solved, 199 t Lyapunov equation
scaling before Gaussian elimina- backward error, 316-317
tion, 191–192, 197–198 discrete-time, 322
times for solution on early com-
puters, 189 t M-matrix, 580
LINPACK, 587 stability of LU factorization, 198
Cholesky factorization of semidef- triangular, 157, 159, 160
inite matrix, 217 machar code, 497498
condition estimator, 297–299, 305 machine epsilon, 41
iterative refinement, 242 magic square matrix, p norm of, 127
LU factorization, 192 mantissa, 40
matrix inversion, 267, 271–274 Maple, 6, 181, 506
tridiagonal system solution, 306 Markov chain, perturbation analysis
logarithmic distribution of numbers, for, 146
51, 53 Mathematical, 6, 181, 506
Longley test problem, 410 MATLAB, 3, 43, 583
condest, 304, 480
LU factorization, 169-201, see also Gaus-
fft, 468
sian elimination
inv, 271
a posteriori stability tests, 192–
rand, 518
194
randn, 518
block, see block LU factorization
rcond, 305, 480, 489
complete pivoting, 170
roots, 483
Crout’s method, 174
special matrices, 514
determinantal formulae for fac-
Symbolic Math Toolbox, 3, 521
tors, 172
Test Matrix Toolbox, 591–594
Doolittle’s method, 173-174
matrix
error analysis, 174–177 block diagonally dominant, 251
history of, 186-191 Cauchy, 516-517
existence and uniqueness, 171- circulant, 468
172 companion, 525–526
for nonsymmetric positive defi- comparison, 157
nite matrix, 223–224 condition number, 121–123, 126,
growth factor, 177-183, see also 392
growth factor confluent Vandermonde-like, 429
loop orderings, 195 convergent, 348
of Hessenberg matrix, 27–28 diagonally dominant, 181
of tridiagonal matrix, 184 distance to singularity, 123, 140
partial pivoting, 170, 173 Drazin inverse, 336-337
partitioned, error analysis of, 248– Fourier, 179
250 Hadamard, 128, 179, 181, 201
perturbation bounds, 194 Hilbert, 514–517
row and column scaling, 191–192 inversion, 261–287, see also in-
stability for M-matrix, 198 verse matrix
versus Cramer’s rule, 14–15 involuntary, 521
S UBJECT I NDEX 683

Kahan, 161, 167, 214 Bailey’s package MPFUN, 505

M-matrix, 580 Brent’s package, 486, 504
magic square, 127 mutation testing, 517
moment, 520
nonsymmetric positive definite, 223
Pascal, 520-524 NAG Library, 583-584
powers of, 345–359, see also pow- LAPACK in, 588
ers of a matrix machine constants, 500
pseudo-inverse, 392, 409, 412, 416 NaN (not a number), 46, 492, 495
random, 517–520 Nelder-Mead simplex method, 479
randsvd, 519–520 netlib, 582–583
second difference, 525 nonlinear equations, references for round-
semiconvergent, 337 ing error analysis, 32
Sylvester’s introduction of term, nonsymmetric positive definite matrix,
309 223
symmetric indefinite, 218 LU factorization, stability of, 223-
symmetric positive definite, 204 224
symmetric positive semidefinite, norm, 117–129
210 || · ||α,B, explicit formulae for, 128
symmetric quasidefinite, 229 absolute, 119
test, 513–528 consistent, 120
totally nonnegative, 176 dual, 119
tridiagonal, 183–186 Frobenius, 120
Toeplitz, 524–525 Holder inequality, 118
Vandermonde, 426 matrix, 120–124
Vandermonde-like, 429 norm equivalence constants, 122t
vet-permutation, 319, 323
matrix p-norm, 124–126
matrix multiplication
of magic square matrix, 127
backward error, 85
monotone, 119
error analysis, 76–78, 85
subordinate matrix, 120, 121
fast methods, 445-463
matrix norm, see norm 2-norm, evaluation without over-
meaningless answers, why you might flow, 502-503
get them, 35 unitarily invariant, 121
misconceptions, of floating point arith- vector, 118–119
metic, 31 norm equivalence constants, 121t
mixed forward–backward error, 8 normal equations, error analysis, 397–
modified Gram–Schmidt method, see 399
Gram–Schmidt method, mod- notation, explanation of, 2–3, 73–76
ified 0°, definition, 64
moment matrix, 520 NPSOL, 197
monotone norm, 119 numerical analysis, definition, 6
Moore-Penrose conditions, 412 numerical radius, 350
most significant digit, 40 numerical stability
multi-directional search method, 477– definition, 8, 33
479 for linear equation solvers, 141-
multiple precision arithmetic, 504–506 142
684 S UBJECT I NDEX

Oettli–Prager backward error theorem, polar decomposition, 386, 389

135 polynomials, 103–1 15, see also Horner’s
optimization, references for rounding method
error analysis, 32 divided differences, 109–1 12
ordinary differential equations fast evaluation schemes, 114, 115
accuracy of mesh point forma- Newton form, 109-112
tion, 101 PORT library, machine constants, 500
backward error in, 33 portability of software, 499-502
Euler’s method with compensated positive definite matrix, see nonsym-
summation, 95 metric positive definite ma-
references for rounding error anal- trix; symmetric positive def-
ysis, 32 inite matrix
outer product, error analysis, 71 power method, 26
overflow, 18, 42 for matrix l-norm estimation, 294-
avoiding, 502–504 297
for matrix p-norm estimation, 291-
p-norm power method, 291–294, 304 294, 304
parallel prefix operation, 165 powers of a matrix, 345–359
paranoia code, 497498 behaviour of stationary iteration,
partial differential equations, references 358
for rounding error analysis, departure from normality, 351-
32 352
partial pivoting, 170, 173 hump, 348-349
early use of, 196 in exact arithmetic, 346–353
growth factor, 177-183 in finite precision arithmetic, 353–
large growth in practical prob- 358
lems, 178 pseudospectrum, 352, 356-358
threshold pivoting, 200 role of spectral radius, 346–348
partitioned algorithm, definition, 246 precision
partitioned LU factorization, error anal-
effect of increasing, 19-21
ysis, 248–250
versus accuracy, 7, 33
Pascal matrix, 520-524
program verification, applied to error
Cholesky factor, 521
analysis, 488
inverse, 522
pseudo-inverse, 392, 409, 412, 416
total positivity, 523
ε-pseudospectral radius, 352
Patriot missile software problem, 506–
pseudospectrum, 352, 356-358
507
of companion matrix, 526
Pentium chip, division bug, 59
Pythagorean sum, 503, 511-512
performance profile, for LAPACK norm
estimator, 296–297
perturbation theory QR factorization, 361-389
by calculus, 144-145 column pivoting, 387
linear systems, 131-150 Givens, 371-373
statistical, 147, 149-150 cancellation of errors in, 24–26
Sylvester equation, 318-320 error analysis, 373–375
p i (π ), high precision calculation as Householder, 363-364
computer test, 491q error analysis, 368–369
S UBJECT I NDEX 685

error analysis for application automatic, 473–490

to LS problem, 395-396 demystified, 82–83
error analysis for partitioned graphs in, 83
(WY representation), 370- model
371 standard, 44
iterative refinement for linear sys- with underflow, 61
tem, 375–376 without guard digit, 49
perturbation theory, 381–383 notation, 73–76
rank-revealing, 386 ordering of operations, effect of,
quadratic equation, solving, 1 1–12, 33 77, 154
quadrature purpose of, 71–72, 203 q
accuracy of grid formation, 101 statistical approach, 52–53
error bound for evaluation of rule, rounding errors
86 accumulation of, 16
references for rounding error anal- are not random, 29, 52
ysis, 32 beneficial effects of, 26-27
cancellation of, 21–26
random matrices, 5 17–520 in subtraction, 49–50
condition number of, 518 statistical assumptions on, 52–53
expected number of real eigen- rules of thumb
values, 518 condition for computed powers
orthogonal, 519, 527 of matrix to converge to zero,
spectral radius of, 518 358
tend to be well conditioned, 518 forward error related to backward
2-norm of, 518 error and condition number,
with given singular values, 519– 10
520 relative speed of floating point
randsvd matrix, 519–520, 527 operations, 60
range reduction, 54 square root of constants in error
RCOND condition estimator (LINPACK, bound, 52
MATLAB), 305, 480–481, 489 Runge-Kutta method, 92, 100
relative error, 4, 5 running error analysis, 73, 489
componentwise, 5 for continued fraction, 85
relative error counter, <k>, 75 for Homer’s method, 105–106, 113
relative precision, 76 for inner product, 72–73
relative residual, 14
research problems, 102, 115, 201, 229, sample variance, see variance
244, 287, 308,324,359,423, ScaLAPACK, 588
442, 463, 490, 527, 528 scaling a linear system before Gaus-
residual, relative, 14 sian elimination, 191-192, 197–
Riccati equation, algebraic, 322 198
Rigal–Gaches backward error theorem, scaling to minimize the condition num-
132 ber, 136-139, 191
rounding, 4, 42 Schur complement, 219, 224, 247, 252
dealing with ties, 42, 58 perturbation bounds for symmet-
to even versus to odd, 58 ric positive semidefinite ma-
rounding error analysis trix, 212–218
686 S UBJECT I NDEX

second difference matrix, 525 error versus operation count, 16

semiconvergent matrix, 337 for inversion, 461-462, 481-482
seminormal equations implementation issues, 460
for least squares problem, 403- Winograd’s variant, 448,455-456
404 subdifferential, of a vector norm, 291
for underdetermined system, 417 subgradient, 292
separation (sep), of two matrices, 318 subnormal numbers, 41, 47, 495
Sherman-Morrison formula, 197, 490 summation, 87–102
significance arithmetic, 489 choice of method: summary, 98–
significant digits 100
correct, 4–5, 32 compensated and applications, 92–
least and most significant, 40 97
singular value decomposition (SVD), condition number, 100
580 criterion for minimizing error, 90
software distillation algorithms, 98
avoiding underflow and overflow, doubly compensated, 96-97
502-504 error analysis, 89–92
effects of underflow, 504 insertion method, 88
issues in floating point arithmetic, pairwise (fan-in), 88
491-512 recursive, 88
portability, 499-502 ordering in, 19, 90-91
specifying arithmetic parameters, statistical error estimates, 98
499-500 SVD (singular value decomposition),
specifying numerical constants, 501 580
SOR method, forward error analysis, Sylvester equation, 309-324
334 backward error, 313-316
square root, of complex number, 36 Bartels-Stewart method, 311-313
stable algorithms, designing, 30-31 generalizations, 321–324
stationary iterative methods, 325–343 perturbation theory, 318-320
and powers of a matrix, 358 practical error bounds, 320–321
backward error analysis, 334-336 solution methods, 311–313
forward error analysis, 329-334 symbolic manipulation package, 6
singular systems, 338–341 symmetric indefinite factorization, 218,
Jacobi method, 332-334 see also diagonal pivoting method
scale independence, 331 symmetric indefinite matrix, 218
singular systems, theory for, 336– symmetric positive definite matrix, 204
338 and block LU factorization, 255–
SOR, 334 257
stopping criteria, 341–342 practical test for, 225
statistics, see also variance symmetric positive semidefinite ma-
computational references, 34 trix, 210
sticky bit, 46 determinantal conditions for, 228
Strassen’s method, 446-448 symmetric quasidefinite matrix, 229
accuracy compared with conven- synthetic division, 107
tional multiplication, 454-
455 tablemaker’s dilemma, 5
error analysis, 452–456 test
S UBJECT I NDEX 687

for accuracy of floating point arith- Numerical Turing, 506

metic, 54–56 2 x 2 problems, reliable solution of,
for guard digit, 56 500-501
test matrices, 513–528
Harwell–Boeing collection, 527 ulp (unit in last place), 43
Test Matrix Toolbox, 591-594 uncertainty, in data, 5
3M method, 450, 461, 505 underdetermined system, 415–423
error analysis, 458–459 backward error, 419, 423
Toeplitz matrix backward stability, definition, 419
pseudospectra, 527, 528 f modified Gram–Schmidt, 421–422
tridiagonal, 524–525 perturbation theory, 417–419
totally nonnegative matrix, 176, 523 Q method (QR factorization), 416
LU factorization, 176, 196 error analysis, 419-422
row scaling in, 192 seminormal equations, 417
test for, 196 error analysis, 421
transformations, well conditioned, 30 underflow, 18, 42
transputer (Inmos), proof of arithmetic’s avoiding, 502–504
correctness, 59 effects on software, 504
triangular matrix model for error analysis, 61
bounds for inverse, 159–161 UNICOS library (Cray), 448, 450
condition numbers, 155 unit roundoff, 3, 42
inversion, 265–270 update formula, involving small cor-
inversion methods rection, 30
blocked, 267–270
unblocked, 265–267 van der Sluis’s theorem, 137
M-matrix, 157, 159, 160 Vancouver Stock Exchange, inaccurate
triangular systems, 151–168 index, 57–58
accurate solution of, 151 q, 155, Vandermonde matrix
156, 159 bounds and estimates for condi-
conditioning, 156–157 tion number, 428
fan-in algorithm, 162-164 definition, 426
partitioned inverse method, 165 inverse, 426–428
substitution inversion algorithm, 427
backward error analysis, 152– LU factorization in factored form,
154 432–433
forward error analysis, 155-159 QR factorization, 441
tridiagonal matrix, 183–186 structured condition number, 440-
condition number estimation, 301– 441
303 Vandermonde system, 425-443
growth factor, 183 accuracy independent of condi-
LU factorization, 184 tion number, 436
error analysis of, 184–186 algorithm
structure of inverse, 303, 305 for dual, 431–432
Toeplitz, 524-525 for primal, 433-434
truncation error, 6 for residual of confluent sys-
Turing Award of the ACM, xxvii, 59 tem, 439
Turing programming language, 506 backward error analysis, 437
688 S UBJECT I NDEX

complexity results, 441

curing instability, 438440
forward error analysis, 435-436
history of solution methods, 441
preventing instability, 438
Vandermonde-like matrix
confluent, definition, 429
definition, 429
determinant, 442
variance
algorithms for computing, 12–13,
33
condition numbers for, 37
error bound for two-pass formula,
38
vec operator, 310
vet-permutation matrix, 319, 323
Venus probe, loss due to program bug,
491 q

Wedin’s least squares perturbation the-

orem, 392–393
proof, 407–409
Winograd’s method, 446, 461
error analysis, 451452
scaling for stability, 452
wobbling precision, 43, 51
WY representation of product of House-
holder matrices, 370-371

0°, definition, 64
What is the most accurate way to sum floating point numbers? What
are the advantages of IEEE arithmetic? How accurate is Gaussian
elimination, and what were the key breakthroughs in the development
of error analysis for the method? The answers to these and many
related questions are included in Accuracy and Stability of Numerical
Algorithms.
This book gives a thorough treatment of the behavior of numerical
algorithms in finite precision arithmetic. It combines algorithmic
derivations, perturbation theory, and rounding error analysis. Software
practicalities are emphasized throughout, with particular reference to
LAPACK and MATLAB. The best available error bounds, some of
them new, are presented in a unified format with a minimum of jargon,
and pertubation theory is treated in detail.
Historical perspective and insight are given, with particular reference
to the fundamental work of Wilkinson and Turing. The many
quotations provide further information in an accessible format.
The book is unique in that algorithmic developments and
motivations are given succinctly and implementation details are
minimized so that readers can concentrate on accuracy and stability
results. Not since Wilkinson’s Rounding Errors in Algebraic
Processes (1963) and The Algebraic Eigenvalue Problem (1965) has
any volume treated this subject in such depth. A number of topics are
treated that are not usually covered in numerical analysis textbooks,
including floating point summation, block LU factorization, condition
number estimation, the Sylvester equation, powers of matrices, finite
precision behavior of stationary iterative methods, Vandermonde
systems, and fast matrix multiplication.
Nicholas J. Higham is Professor of Applied Mathematics at the
University of Manchester, England. He is the author of more than 40
publications and is a member of the editorial boards of the SIAM
Journal on Matrix Analysis and Applications and the IMA Journal of
Numerical Analysis . His book handbook of Writing for the
Mathematical Sciences was published by SIAM in 1993
____________________________________________________________

For more information about SIAM books, journals, conferences

memberships, or activities, contact:

SIAM 
Society for Industrial and Applied Mathematics
3600 University City Science Center
Philadelphia, PA 19104-2688 USA
Telephone: 215-382-9800 / Fax: 215-386-7999
[email protected]
http://www.siam.org

ISBN 0-89871-355-2

An Introduction To Statistical Modeling of Extreme Values
80% (5)
An Introduction To Statistical Modeling of Extreme Values
221 pages
Nla Primer-Toc PDF
No ratings yet
Nla Primer-Toc PDF
5 pages
Golub, Van Loan, Matrix Computations
100% (2)
Golub, Van Loan, Matrix Computations
367 pages
Breakthroughs in Statistics Vol.1 1992
100% (2)
Breakthroughs in Statistics Vol.1 1992
665 pages
Geometric - Numerical - Integration Structure-Preserving Algorithms
No ratings yet
Geometric - Numerical - Integration Structure-Preserving Algorithms
659 pages
(Nicholas J. Higham) Accuracy and Stability of Num
100% (1)
(Nicholas J. Higham) Accuracy and Stability of Num
710 pages
Iterative Methods For Sparse Linear Systems
No ratings yet
Iterative Methods For Sparse Linear Systems
460 pages
A Guide To Numerical Methods (Dmitri Kuzmin)
No ratings yet
A Guide To Numerical Methods (Dmitri Kuzmin)
226 pages
Combinatorial Optimization: Networks and Matroids
From Everand
Combinatorial Optimization: Networks and Matroids
Eugene Lawler
3.5/5 (2)
Numerical Analysis Notes
No ratings yet
Numerical Analysis Notes
73 pages
(Applied Mathematical Sciences 114) J. Kevorkian, J. D. Cole (Auth.) - Multiple Scale and Singular Perturbation Methods-Springer-Verlag New York (1996)
No ratings yet
(Applied Mathematical Sciences 114) J. Kevorkian, J. D. Cole (Auth.) - Multiple Scale and Singular Perturbation Methods-Springer-Verlag New York (1996)
642 pages
Solving Inverse Problems Using Datadriven Models
No ratings yet
Solving Inverse Problems Using Datadriven Models
174 pages
Solutions To Habermans Book Applied Part PDF
100% (1)
Solutions To Habermans Book Applied Part PDF
420 pages
AVery Applied First Course in PDE PDF
100% (1)
AVery Applied First Course in PDE PDF
530 pages
(Developments in Mathematics 48) Yuri A. Melnikov, Volodymyr N. Borodin (Auth.) - Green's Functions - Potential Fields On Surfaces-Springer International Publishing (2017)
100% (2)
(Developments in Mathematics 48) Yuri A. Melnikov, Volodymyr N. Borodin (Auth.) - Green's Functions - Potential Fields On Surfaces-Springer International Publishing (2017)
211 pages
Varga - Functional Analysis and Approximations in Numerical Analysis
100% (2)
Varga - Functional Analysis and Approximations in Numerical Analysis
87 pages
Frank Stenger, Don Tucker, Gerd Baumann Auth. Navier-Stokes Equations On R3 × (0, T)
No ratings yet
Frank Stenger, Don Tucker, Gerd Baumann Auth. Navier-Stokes Equations On R3 × (0, T)
232 pages
(J. - A. - Sethian) Level Set Methods Evolving Interf PDF
No ratings yet
(J. - A. - Sethian) Level Set Methods Evolving Interf PDF
121 pages
Springer Series in Statistics: Advisors
No ratings yet
Springer Series in Statistics: Advisors
343 pages
Mechanics Lagrangian and Hamilitonian
100% (1)
Mechanics Lagrangian and Hamilitonian
56 pages
Hairer Geometric Numerical Integration
100% (3)
Hairer Geometric Numerical Integration
525 pages
Waves in Fluids and Solids
No ratings yet
Waves in Fluids and Solids
326 pages
Topics in Random Matrix Theory
No ratings yet
Topics in Random Matrix Theory
342 pages
Structure and Interpretation of Classical Mechanics
100% (1)
Structure and Interpretation of Classical Mechanics
427 pages
Daniel Dubin-Numerical and Analytical Methods For Scientists and Engineers, Using Mathematica-Wiley-Interscience (2003)
No ratings yet
Daniel Dubin-Numerical and Analytical Methods For Scientists and Engineers, Using Mathematica-Wiley-Interscience (2003)
647 pages
Notes Spectral Methods PDF
No ratings yet
Notes Spectral Methods PDF
12 pages
Modern Numerical Methods For Fluid Flow - Phillip Colella PDF
100% (1)
Modern Numerical Methods For Fluid Flow - Phillip Colella PDF
158 pages
Gallopoulos - Parallelism in Matrix Computations
No ratings yet
Gallopoulos - Parallelism in Matrix Computations
505 pages
SpringerLink - Bücher - Alemdar Hasanov Hasanoğlu, Vladimir G. Romanov - Introduction To Inverse Problems For Differential Equations (2017, Springer)
No ratings yet
SpringerLink - Bücher - Alemdar Hasanov Hasanoğlu, Vladimir G. Romanov - Introduction To Inverse Problems For Differential Equations (2017, Springer)
264 pages
LIVRO An Introduction To Inverse Problems With Applications
100% (3)
LIVRO An Introduction To Inverse Problems With Applications
255 pages
FFT Window Functions - Limits On FFT Analysis
No ratings yet
FFT Window Functions - Limits On FFT Analysis
4 pages
The Limitation of Conflict _ a Theory of Bargaining and -- Rangarajan, L_ N_ -- New York, New York State, 1985 -- Palgrave Macmillan -- 9780312486754 -- 930ebe7b0c369f6f6a89cb1ec5c8bfbd -- Anna’s Archive
No ratings yet
The Limitation of Conflict _ a Theory of Bargaining and -- Rangarajan, L_ N_ -- New York, New York State, 1985 -- Palgrave Macmillan -- 9780312486754 -- 930ebe7b0c369f6f6a89cb1ec5c8bfbd -- Anna’s Archive
360 pages
A Student's Guide To Fourier TNG - J. M. James - 2nd Edition 26
No ratings yet
A Student's Guide To Fourier TNG - J. M. James - 2nd Edition 26
1 page
Dimensionless Analysis PDF
No ratings yet
Dimensionless Analysis PDF
16 pages
(Advanced Texts in Physics) Philippe Réfrégier (Auth.) - Noise Theory and Application To Physics - From Fluctuations To Information-Springer-Verlag New York (2004)
No ratings yet
(Advanced Texts in Physics) Philippe Réfrégier (Auth.) - Noise Theory and Application To Physics - From Fluctuations To Information-Springer-Verlag New York (2004)
293 pages
Linear and Nonlinear Waves Textbook)
100% (1)
Linear and Nonlinear Waves Textbook)
334 pages
Perturbation Methods
100% (1)
Perturbation Methods
29 pages
Sirca S., Horvat M. Computational Methods For Physicists
No ratings yet
Sirca S., Horvat M. Computational Methods For Physicists
13 pages
Discrete Wavelet Transform ..
0% (2)
Discrete Wavelet Transform ..
9 pages
Vidia Sagar
No ratings yet
Vidia Sagar
510 pages
Cuda Reference Manual
No ratings yet
Cuda Reference Manual
256 pages
Pierre Bremaud - Mathematical Principles of Signal Processing
No ratings yet
Pierre Bremaud - Mathematical Principles of Signal Processing
262 pages
History of Numerical Weather Prediction
50% (2)
History of Numerical Weather Prediction
73 pages
Adjoint Tutorial PDF
No ratings yet
Adjoint Tutorial PDF
6 pages
Why Are Complex Numbers Needed in Quantum Mechanics
No ratings yet
Why Are Complex Numbers Needed in Quantum Mechanics
18 pages
Wavelets A Primer
100% (1)
Wavelets A Primer
207 pages
Tips - The State Space Method PDF
No ratings yet
Tips - The State Space Method PDF
275 pages
Substitutional Analysis
From Everand
Substitutional Analysis
Daniel Edwin Rutherford
No ratings yet
Graphs and Tables of the Mathieu Functions and Their First Derivatives
From Everand
Graphs and Tables of the Mathieu Functions and Their First Derivatives
James C. Wiltse
No ratings yet
Book Ena
No ratings yet
Book Ena
436 pages
Math 248: Computers and Numerical Algorithms
No ratings yet
Math 248: Computers and Numerical Algorithms
162 pages
Reference Book For Numerical Analysis
100% (3)
Reference Book For Numerical Analysis
231 pages
Buch Gander Kwok
No ratings yet
Buch Gander Kwok
10 pages
Course Note
No ratings yet
Course Note
121 pages
Text Book
100% (1)
Text Book
129 pages
statcomp-notes
No ratings yet
statcomp-notes
56 pages
Undergraduate Text
No ratings yet
Undergraduate Text
351 pages
Applied Numerical Computing
No ratings yet
Applied Numerical Computing
274 pages
NumCompEWN2004
No ratings yet
NumCompEWN2004
383 pages
Notes ITSC
No ratings yet
Notes ITSC
117 pages
Curseng
No ratings yet
Curseng
230 pages
(Davoudian Et Al., 2018) A Survey On NoSQL Stores
No ratings yet
(Davoudian Et Al., 2018) A Survey On NoSQL Stores
43 pages
A Critique of ANSI SQL Isolation Levels
No ratings yet
A Critique of ANSI SQL Isolation Levels
12 pages
(Atzeni Et Al., 2020) Data Modeling in The NoSQL World
No ratings yet
(Atzeni Et Al., 2020) Data Modeling in The NoSQL World
37 pages
(Skiena, 2017) - Book - The Data Science Design Manual - 7
No ratings yet
(Skiena, 2017) - Book - The Data Science Design Manual - 7
1 page
(Skiena, 2017) - Book - The Data Science Design Manual - 3
No ratings yet
(Skiena, 2017) - Book - The Data Science Design Manual - 3
1 page
(Skiena, 2017) - Book - The Data Science Design Manual - 5
No ratings yet
(Skiena, 2017) - Book - The Data Science Design Manual - 5
1 page
(Skiena, 2017) - Book - The Data Science Design Manual - 6
No ratings yet
(Skiena, 2017) - Book - The Data Science Design Manual - 6
1 page
(Skiena, 2017) - Book - The Data Science Design Manual - 2
No ratings yet
(Skiena, 2017) - Book - The Data Science Design Manual - 2
1 page
(Gale & Shapley, 1962) - College Admissions and The Stability of Marriage
No ratings yet
(Gale & Shapley, 1962) - College Admissions and The Stability of Marriage
19 pages
That Is An Assignement
No ratings yet
That Is An Assignement
34 pages
Evaluation and Development of Spatial Decision Support System
No ratings yet
Evaluation and Development of Spatial Decision Support System
8 pages
Preliminaries and Systems of Linear Equations
No ratings yet
Preliminaries and Systems of Linear Equations
30 pages
Math 302 Lecture Notes linear algebra and multivariable calculus Kenneth L. Kuttler pdf download
100% (2)
Math 302 Lecture Notes linear algebra and multivariable calculus Kenneth L. Kuttler pdf download
51 pages
Ch-3 & 4 Solving System of Equations
No ratings yet
Ch-3 & 4 Solving System of Equations
18 pages
4 LinearlEquations
No ratings yet
4 LinearlEquations
15 pages
ESO 208A: Computational Methods in Engineering: Tutorial 4
No ratings yet
ESO 208A: Computational Methods in Engineering: Tutorial 4
3 pages
Mathematics in Chemical Engineering PDF
No ratings yet
Mathematics in Chemical Engineering PDF
148 pages
Gauss Elimination: Reported by
No ratings yet
Gauss Elimination: Reported by
7 pages
Leniear Algebra Operation For Machine Learning
No ratings yet
Leniear Algebra Operation For Machine Learning
10 pages
Lê Xuân Đ I Linear - System - Handout
No ratings yet
Lê Xuân Đ I Linear - System - Handout
90 pages
2.5 Iterative Improvement of A Solution To Linear Equations
No ratings yet
2.5 Iterative Improvement of A Solution To Linear Equations
5 pages
Matrix Algebra For Engineers
100% (1)
Matrix Algebra For Engineers
187 pages
Applied Numerical Computing
100% (1)
Applied Numerical Computing
257 pages
Jeff Gill-Essential Mathematics For Political and Social Research (Analytical Methods For Social Research) (2006)
0% (1)
Jeff Gill-Essential Mathematics For Political and Social Research (Analytical Methods For Social Research) (2006)
474 pages
Introduction to Tensor Network Methods Numerical simulations of low dimensional many body quantum systems Simone Montangero download
100% (1)
Introduction to Tensor Network Methods Numerical simulations of low dimensional many body quantum systems Simone Montangero download
54 pages
CO1 Material MFE
No ratings yet
CO1 Material MFE
24 pages
CSC336 Assignment 4
No ratings yet
CSC336 Assignment 4
5 pages
Numerical Methods For Civil Engineering PDF
No ratings yet
Numerical Methods For Civil Engineering PDF
256 pages
MATLAB Linear Algebra Functions
No ratings yet
MATLAB Linear Algebra Functions
15 pages
LU Factorisation of A Matrix
No ratings yet
LU Factorisation of A Matrix
10 pages
Matrix PDF
No ratings yet
Matrix PDF
2,245 pages
2.7.1 Definition:: LU Decomposition
No ratings yet
2.7.1 Definition:: LU Decomposition
43 pages
Script of Code
No ratings yet
Script of Code
3 pages
MA2501 Numerical Methods Spring 2015: Solutions To Exercise Set 3
No ratings yet
MA2501 Numerical Methods Spring 2015: Solutions To Exercise Set 3
5 pages
W5 Lesson 3 - Systems of Linear Equations (Part 2) - Module
No ratings yet
W5 Lesson 3 - Systems of Linear Equations (Part 2) - Module
8 pages
Fault Analysis Using Z Bus
No ratings yet
Fault Analysis Using Z Bus
11 pages
Chapter III
No ratings yet
Chapter III
62 pages
Experiment 2a2q2020
No ratings yet
Experiment 2a2q2020
25 pages