Explore 1.5M+ audiobooks & ebooks free for days

Only $12.99 CAD/month after trial. Cancel anytime.

Analysis of Numerical Methods
Analysis of Numerical Methods
Analysis of Numerical Methods
Ebook896 pages7 hoursDover Books on Mathematics

Analysis of Numerical Methods

Rating: 3.5 out of 5 stars

3.5/5

()

Read preview

About this ebook

In this age of omnipresent digital computers and their capacity for implementing numerical methods, no applied mathematician, physical scientist, or engineer can be considered properly trained without some understanding of those methods. This text, suitable for advanced undergraduate and graduate-level courses, supplies the required knowledge — not just by listing and describing methods, but by analyzing them carefully and stressing techniques for developing new methods.
Based on each author's more than 40 years of experience in teaching university courses, this book offers lucid, carefully presented coverage of norms, numerical solution of linear systems and matrix factoring, iterative solutions of nonlinear equations, eigenvalues and eigenvectors, polynomial approximation, numerical solution of differential equations, and more. No mathematical preparation beyond advanced calculus and elementary linear algebra (or matrix theory) is assumed. Examples and problems are given that extend or amplify the analysis in many cases.

LanguageEnglish
PublisherDover Publications
Release dateApr 26, 2012
ISBN9780486137988
Analysis of Numerical Methods

Related to Analysis of Numerical Methods

Titles in the series (100)

View More

Related ebooks

Mathematics For You

View More

Reviews for Analysis of Numerical Methods

Rating: 3.3333333333333335 out of 5 stars
3.5/5

3 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Analysis of Numerical Methods - Eugene Isaacson

    1

    Norms, Arithmetic, and Well-Posed Computations

    0.INTRODUCTION

    In this chapter, we treat three topics that are generally useful for the analysis of the various numerical methods studied throughout the book. In Section 1, we give the elements of the theory of norms of finite dimensional vectors and matrices. This subject properly belongs to the field of linear algebra. In later chapters, we may occasionally employ the notion of the norm of a function. This is a straightforward extension of the notion of a vector norm to the infinite-dimensional case. On the other hand, we shall not introduce the corresponding natural generalization, i.e., the notion of the norm of a linear transformation that acts on a space of functions. Such ideas are dealt with in functional analysis, and might profitably be used in a more sophisticated study of numerical methods.

    We study briefly, in Section 2, the practical problem of the effect of rounding errors on the basic operations of arithmetic. Except for calculalations involving only exactinteger arithmetic, rounding errors are invariably present in any computation. A most important feature of the later analysis of numerical methods is the incorporation of a treatment of the effects of such rounding errors.

    Finally, in Section 3, we describe the computational problems that are reasonable in some general sense. In effect, a numerical method which produces a solution insensitive to small changes in data or to rounding errors is said to yield a well-posed computation. How to determine the sensitivity of a numerical procedure is dealt with in special cases throughout the book. We indicate heuristically that any convergent algorithm is a well-posed computation.

    1.NORMS OF VECTORS AND MATRICES

    We assume that the reader is familiar with the basic theory of linear algebra, not necessarily in its abstract setting, but at least with specific reference to finite-dimensional linear vector spaces over the field of complex scalars. By basic theory we of course include: the theory of linear systems of equations, some elementary theory of determinants, and the theory of matrices or linear transformations to about the Jordan normal form. We hardly employ the Jordan form in the present study. In fact a much weaker result can frequently be used in its place (when the divisor theory or invariant subspaces are not actually involved). This result is all too frequently skipped in basic linear algebra courses, so we present it as

    THEOREM 1. For any square matrix A of order n there exists a non-singular matrix P, of order n, such that

    B = P−1AP

    is upper triangular and has the eigenvalues of A, say λj ≡ λj(A), j = 1, 2, …, n, on the principal diagonal (i.e., any square matrix is equivalent to a triangular matrix).

    Proof.We sketch the proof of this result. The reader should have no difficulty in completing the proof in detail.

    Let λ1 be an eigenvalue of A with corresponding eigenvector u1.† Then pick a basis for the n-dimensional complex vector space, Cn, with u1 as the first such vector. Let the independent basis vectors be the columns of a non-singular matrix P1, which then determines the transformation to the new basis. In this new basis the transformation determined by A is given by B1 ≡ P1−1 and since Au1 = λ1u1,

    where A2 is some matrix of order n − 1.

    The characteristic polynomial of B1 is clearly

    det (λIn B1 = (λ − λ1) det (λIn−i A2),

    where In is the identity matrix of order n. Now pick some eigenvalue λ2 of A2 and corresponding (n − 1)-dimensional eigenvector, v2; i.e.,

    A2v2 = λ2v2.

    With this vector we define the independent n-dimensional vectorts

    Note that with the scalar α = α1υ12 + α2υ22 + … + αn−1υn−1, 2

    B1û1 = λ1û1,B1û2 = λ2û2 + αû1,

    and thus if we set u1 = P1û1,u2 = P1û2, then

    Au1 = λ1u1,Au2 = λ2u2 + αu1.

    Now we introduce a new basis of Cn with the first two vectors being and u1 and u2. The non-singular matrix P2 which determines this change of basis has u1 and u2 as its first two columns; and the original linear transformation in the new basis has the representation

    where A3 is some matrix of order n − 2.

    The theorem clearly follows by the above procedure; a formal inductive proof could be given.

    It is easy to prove the related stronger result of Schur stated in Theorem 2.4 of Chapter 4 (see Problem 2.13(b) of Chapter 4). We turn now to the basic content of this section, which is concerned with the generalization of the concept of distance in n-dimensional linear vector spaces.

    The distance between a vector and the null vector, i.e., the origin, is a measure of the size or length of the vector. This generalized notion of distance or size is called a norm. In particular, all such generalizations are required to have the following properties:

    (0)To each vector x in the linear space, , say, a unique real number is assigned; this number, denoted by ||x|| or N(x), is called the norm of x iff:

    (i)||x|| ≥ 0 for all x ∈ and ||x|| = 0 iff x = o; where o denotes the zero vector (if ≡ Cn, then oi = 0);

    (ii)||αx|| = ||α|| · ||x|| for all scalars α and all x ∈ ;

    (iii)||x + y|| ≤ ||x|| + ||y||, the triangle inequality,† for all x, y ∈ .

    Some examples of norms in the complex n-dimensional space Cn are

    It is an easy exercise for the reader to justify the use of the notation in (1d) by verifying that

    The norm, ||·||2, is frequently called the Euclidean norm as it is just the formula for distance in ordinary three-dimensional Euclidean space extended to dimension n. The norm, ||·||∞, is called the maximum norm or occasionally the uniform norm. In general, ||·||p, for p ≥ 1 is termed the p-norm.

    To verify that (1) actually defines norms, we observe that conditions (0), (i), and (ii) are trivially satisfied. Only the triangle inequality, (iii), offers any difficulty. However,

    and

    so (1a) and (1d) define norms.

    The proof of (iii) for (1b), the Euclidean norm, is based on the well known Cauchy-Schwarz inequality which states that

    To prove this basic result, let |x| and |y| be the n-dimensional vectors with components |xj| and |yj|, j = 1, 2, …, n, respectively. Then for any real scalar, ξ,

    But since the real quadratic polynomial in ξ above does not change sign its discriminant must be non-positive; i.e.,

    However, we note that

    and (2) follows from the above pair of inequalities.

    Now we form

    An application of the Cauchy-Schwarz inequality yields finally

    N2(x + y) ≤ N2(x) + N2(y)

    and so the Euclidean norm also satisfies the triangle inequality.

    The statement that

    is know as Minkowski’s inequality. We do not derive it here as general p-norms will not be employed further. (A proof of (3) can be found in most advanced calculus texts.)

    We can show quite generally that all vector norms are continuous functions in Cn. That is,

    LEMMA 1. Every vector norm, N(x), is a continuous function of x1, x2, …, xn, the components of x.

    Proof.For any vectors x and δ we have by (iii)

    N(x + δ) ≤ N(x) + N(δ),

    so that

    N(x + δ) − N(x) ≤ N(δ).

    On the other hand, by (ii) and (iii),

    so that

    N(δ) ≤ N(x + δ) − N(x).

    Thus, in general

    |N(x + δ) − N(x)| ≤ N(δ).

    With the unit vectors † {ek}, any δ has the representation

    Using (ii) and (iii) repeatedly implies

    where

    Using this result in the previous inequality yields, for any ε > 0 and all δ with N∞(δ) ≤ ε/M,

    |N(x + δ) − N(x)| ≤ ε.

    This is essentially the definition of continuity for a function of the n variables x1, x2, …, xn.

    See Problem 6 for a mild generalization.

    Now we can show that all vector norms are equivalent in the sense of

    THEOREM 2. For each pair of vector norms, say N(x) and N′(x), there exist positive constants m and M such that for all x Cn:

    mN′(x) ≤ N(x) ≤ MN′(x).

    Proof.The proof need only be given when one of the norms is N∞, since N and N′ are equivalent if they are each equivalent to N∞. Let S Cn be defined by

    S ≡ {x | N∞(x) = 1, x Cn}

    (this is frequently called the surface of the unit ball in Cn). S is a closed bounded set of points. Then since N(x) is a continuous function (see Lemma 1), we conclude by a theorem of Weierstrass that N(x) attains its minimum and its maximum on S at some points of S. That is, for some x⁰ ∈ S and x¹ ∈ S

    or

    0 < N(x⁰) ≤ N(x) ≤ N(x¹) < ∞

    for all x S.

    For any y o we see that y|N∞(y) is in S and so

    or

    N(x⁰)N∞(y) ≤ N(y) ≤ N(x¹)N∞(y).

    The last two inequalities yield

    mN∞(y) ≤ N(y) ≤ MN∞(y),

    where m N(x⁰) and M N(x¹).

    A matrix of order n could be treated as a vector in a space of dimension n² (with some fixed convention as to the manner of listing its elements). Then matrix norms satisfying the conditions (0)–(iii) could be defined as in (1). However, since the product of two matrices of order n is also such a matrix, we impose an additional condition on matrix norms, namely that

    (iv)||AB|| ≤ ||A|| · ||B||.

    With this requirement the vector norms (1) do not all become matrix norms (see Problem 2). However, there is a more natural, geometric, way in which the norm of a matrix can be defined. Thus, if x Cn and ||·|| is some vector norm on Cn, then ||x|| is the length of x, ||Ax|| is the length of Ax, and we define a norm of A, written as ||A|| or N(A), by the maximum relative stretching,

    Note that we use the same notation, ||·||, to denote vector and matrix norms; the context will always clarify which is implied. We call (5) a natural norm or the matrix norm induced by the vector norm, ||·||. This is also known as the operator norm in functional analysis. Since for any x o we can define u = x/||x|| so that ||u|| = 1, the definition (5) is equivalent to

    That is, by Problems 6 and 7, ||Au|| is a continuous function of u and hence the maximum is attained for some y, with ||y|| = 1.

    Before verifying the fact that (5) or (6) defines a matrix norm, we note that they imply, for any vector x, that

    There are many other ways in which matrix norms may be defined. But if (7) holds for some such norm then it is said to be compatible with the vector norm employed in (7). The natural norm (5) is essentially the smallest matrix norm compatible with a given vector norm.

    To see that (5) yields a norm, we first note that conditions (i) and (ii) are trivially verified. For checking the triangle inequality, let y be such that ||y|| = 1 and from (6),

    ||(A + B)|| = ||(A + B)y||.

    But then, upon recalling (7),

    Finally, to verify (iv), let y with ||y|| = 1 now be such that

    ||(AB)|| = ||(AB)y||.

    Again by (7), we have

    so that (5) and (6) do define a matrix norm.

    We shall now determine the natural matrix norms induced by some of the vector p-norms (p = 1, 2, ∞) defined in (1). Let the nth order matrix A have elements ajk, j, k = 1, 2, …, n.

    (A) The matrix norm induced by the maximum norm (1d) is

    i.e., the maximum absolute row sum. To prove (8), let y be such that ||y||∞ = 1 and

    ||A||∞ = ||Ay||∞.

    Then,

    so the right-hand side of (8) is an upper bound of ||A||∞. Now if the maximum row sum occurs for, say, j = J then let x have the components

    Clearly ||x||∞ = 1, if A is non-trivial, and

    so (8) holds. [If A O, property (ii) implies ||A|| = 0 for any natural norm.]

    (B) Next, we claim that

    i.e., the maximum absolute column sum. Now let ||y||1 = 1 and be such that

    ||A||1 = ||Ay||1.

    Then,

    and the right-hand side of (9) is an upper bound of ||A||1. If the maximum is attained for m = K, then this bound is actually attained for x = ek, the Kth unit vector, since ||ek|| = 1 and

    Thus (9) is established.

    (C) Finally, we consider the Euclidean norm, for which case we recall the notation for the Hermitian transpose or conjugate transpose of any rectangular matrix A ≡ (aij),

    A* ≡ ĀT,

    i.e., if A* ≡ (bij), then bij = āij. Further, the spectral radius of any square matrix A is defined by

    where λs(A) denotes the sth eigenvalue of A. Now we can state that

    To prove (11), we again pick y such that ||y||2 = 1 and

    ||A||2 = ||Ay||2.

    From (1b) it is clear that ||x||2² = x*x, since x* ≡ ( 1, 2, …, n). Therefore, from the identity (Ay)* = y*A*, we find

    But since A*A is Hermitian it has a complete set of n orthonormal eigenvectors, say u1, u2, …, un, such that

    The multiplication of (13b) by us* on the left yields further

    λs = us*A*Aus ≥ 0.

    Every vector has a unique expansion in the basis {us}. Say in particular that

    and then (12) becomes, upon recalling (13),

    Thus ρ¹/²(A*A) is an upper bound of ||A||2. However, using y = us, where λs = ρ(A*A), we get

    and so (11) follows.

    We have observed that a matrix of order n can be considered as a vector of dimension n². But since every matrix norm satisfies the conditions (0)–(iii) of a vector norm the results of Lemma 1 and Theorem 2 also apply to matrix norms. Thus we have

    LEMMA 1′. Every matrix norm, ||A||, is a continuous function of the n² elements aij of A.

    THEOREM 2′. For each pair of matrix norms, say ||A|| and ||A||′ there exist positive constants m and M such that for all nth order matrices A

    m||A||′ ≤ ||A|| ≤ M||A||′.

    The proofs of these results follow exactly the corresponding proofs for vector norms so we leave their detailed exposition to the reader.

    There is frequently confusion between the spectral radius (10) of a matrix and the Euclidean norm (11) of a matrix. (To add to this confusion, ||A||2 is sometimes called the spectral norm of A.) It should be observed that if A is Hermitian, i.e., A* = A, then λs(A*A) = λs²(A) and so the spectral radius is equal to the Euclidean norm for Hermitian matrices. However, in general this is not true, but we have

    LEMMA 2. For any natural norm, ||·||, and square matrix, A,

    ρ(A) ≤ ||A||.

    Proof.For each eigenvalue λs(A) there is a corresponding eigenvector, say us, which can be chosen to be normalized for any particular vector norm, ||us|| = 1. But then for the corresponding natural matrix norm

    As this holds for all s = 1, 2, …, n, the result follows.

    On the other hand, for each matrix some natural norm is arbitrarily close to the spectral radius. More precisely we have

    THEOREM 3. For each nth order matrix A and each arbitrary ε > 0 a natural norm, ||A||, can be found such that

    ρ(A) ≤ ||A|| ≤ ρ(A) + ε.

    Proof.The left-hand inequality has been verified above. We shall show how to construct a norm satisfying the right-hand inequality. By Theorem 1 we can find a non-singular matrix P such that

    PAP−1 ≡ B ≡ Λ + U

    where Λ = (λj(Aij) and U ≡ (uij) has zeros on and below the diagonal. With δ > 0, a "sufficiently small positive number, we form the diagonal matrix of order n

    Now consider

    C = DBD−1 = Λ + E,

    where E ≡ (eij) = DUD−1 has elements

    Note that the elements eij can be made arbitrarily small in magnitude by choosing δ appropriately. Also we have that

    A = P−1D−1CDP.

    Since DP is non-singular, a vector norm can be defined by

    ||x|| ≡ N2(DPx) =(x*P*D*DPx)¹/².

    The proof of this fact is left to the reader in Problem 5. The natural matrix norm induced by this vector norm is of course

    However, from the above form for A, we have, for any y,

    ||Ay|| = N2(DPAy) = N2(CDPy).

    If we let z DPy, this becomes

    ||Ay|| = N2(Cz) = (z*C*Cz)¹/².

    Now observe that

    Here the term (δ) represents an nth order matrix each of whose terms is (δ).† Thus, we can conclude that

    since

    |z* (δ)z| ≤ n²z*z (δ) = z*z (δ).

    Recalling ||y|| = N2(z), we find from ||y|| = 1 that z*z = 1. Then it follows that

    For δ sufficiently small (δ) < ε.

    It should be observed that the natural norm employed in Theorem 3 depends upon the matrix A as well as the arbitrary small parameter ε However, this result leads to an interesting characterization of the spectral radius of any matrix; namely,

    COROLLARY. For any square matrix A

    where the inf is taken over all vector norms, N(·); or equivalently

    where the inf is taken over all natural norms, ||·||.

    Proof.By using Lemma 2 and Theorem 3, since ε > 0 is arbitrary and the natural norm there depends upon ε, the result follows from the definition of inf.

    1.1.Convergent Matrices

    To study the convergence of various iteration procedures as well as for many other purposes, we investigate matrices A for which

    where O denotes the zero matrix all of whose entries are 0. Any square matrix satisfying condition (14) is said to be convergent. Equivalent conditions are contained in

    THEOREM 4. The following three statements are equivalent:

    (a)A is convergent;

    (b) , for some matrix norm;

    (c)ρ(A) < 1.

    Proof.We first show that (a) and (b) are equivalent. Since ||·|| is continuous, by Lemma 1′, and ||O|| = 0, then (a) implies (b). But if (b) holds for some norm, then Theorem 2′ implies there exists an M such that

    ||Am||∞ ≤ M||Am|| → 0.

    Hence,(a) holds.

    Next we show that (b) and (c) are equivalent. Note that by Theorem 2′ there is no loss in generality if we assume the norm to be a natural norm. But then, by Lemma 2 and the fact that λ(Am) = λm(A), we have

    ||Am ρ(Am) = ρm(A),

    so that (b) implies (c). On the other hand, if (c) holds, then by Theorem 3 we can find an ε > 0 and a natural norm, say N(·), such that

    N(A) ≤ ρ(A) + ε θ < 1.

    Now use the property (iv) of matrix norms to get

    N(Am) ≤ [N(A)]m θm

    so that N(Am) = 0 and hence (b) holds.

    A test for convergent matrices which is frequently easy to apply is the content of the

    COROLLARY. A is convergent if for some matrix norm

    ||A|| < 1.

    Proof.Again by (iv) we have

    ||A||m ≤ ||A||m

    so that condition (b) of Theorem 4 holds.

    Another important characterization and property of convergent matrices is contained in

    THEOREM 5. (a) The geometric series

    I + A + A² + A³ + …,

    converges iff A is convergent.

    (b) If A is convergent, then I − A is non-singular and

    (I A)−1 = I + A + A² + A³ + ···.

    Proof.A necessary condition for the series in part (a) to converge is that Am = O, i.e., that A be convergent. The sufficiency will follow from part (b).

    Let A be convergent, whence by Theorem 4 we know that ρ(A) < 1. Since the eigenvalues of I A are 1 − λ(A), it follows that det (I A) ≠ 0 and hence this matrix is non-singular. Now consider the identity

    (I A)(I + A + A² + ··· + Am) = I Am + ¹

    which is valid for all integers m. Since A is convergent, the limit as m → ∞ of the right-hand side exists. The limit, after multiplying both sides on the left by (I A)−1, yields

    (I + A + A² + ···) = (I A)−1

    and part (b) follows.

    A useful corollary to this theorem is

    COROLLARY. If in some natural norm, ||A|| < 1, then I − A is non-singular and

    Proof.By the corollary to Theorem 4 and part (b) of Theorem 5 it follows that I A is non-singular. For a natural norm we note that ||I|| = 1 and so taking the norm of the identity

    I = (I A)(I A)−1

    yields

    Thus the left-hand inequality is established.

    Now write the identity as

    (I A)−1 = I + A(I A)−1

    and take the norm to get

    ||(I A)−1|| ≤ 1 + ||A|| · ||(I A)−1||.

    Since ||A|| < 1 this yields

    It should be observed that if A is convergent, so is (−A), and ||A|| = ||−A||. Thus Theorem 5 and its corollary are immediately applicable to matrices of the form I + A. That is, if in some natural norm, ||A|| < 1, then

    PROBLEMS, SECTION 1

    1. (a) Verify that (1b) defines a norm in the linear space of square matrices of order n; i.e., check properties (i)–(iv), for ||A||E² = |aij|².

    (b) Similarly, verify that (1a) defines a matrix norm, i.e., ||A|| = |aij|

    2. Show by example that the maximum vector norm, η(A) = |aij|, when applied to a matrix, does not satisfy condition (iv) that we impose on a matrix norm.

    3. Show that if A is non-singular, then B A*A is Hermitian and positive definite. That is, x*Bx > 0 if x o. Hence the eigenvalues of B are all positive.

    4. Show for any non-singular matrix A and any matrix norm that

    [Hint: ||I|| = ||II|| ≤ ||I||²; ||A−1A|| ≤ ||A−1|| · ||A||.]

    5. Show that if η(x) is a norm and A is any non-singular matrix, then N(x) defined by

    N(x) ≡ η(Ax),

    is a (vector) norm.

    6. We call η(x) a semi-norm iff η(x) satisfies all of the conditions, (0)–(iii), for a norm with condition (i) replaced by the weaker condition

    (i′): η(x) ≥ 0 for all x ∈ .

    We say that η(x) is non-trivial iff η(x) > 0 for some x ∈ . Prove the following generalization of Lemma 1:

    LEMMA 1″. Every non-trivial semi-norm, η(x), is a continuous function of x1, x2, …, xn, the components of x. Hence every semi-norm is continuous.

    7. Show that if η(x) is a semi-norm and A any square matrix, then N(x) ≡ η(Ax) defines a semi-norm.

    2.FLOATING-POINT ARITHMETIC AND ROUNDING ERRORS

    In the following chapters we will have to refer, on occasion, to the errors due to rounding in the basic arithmetic operations. Such errors are inherent in all computations in which only a fixed number of digits are retained. This is, of course, the case with all modern digital computers and we consider here an example of one way in which many of them do or can do arithmetic; so-called floating-point arithmetic. Although most electronic computers operate with numbers in some kind of binary representation, most humans still think in terms of a decimal representation and so we shall employ the latter here.

    Suppose the number a ≠ 0 has the exact decimal representation

    where q is an integer and the d1, d2, …, are digits with d1 ≠ 0. Then the "t-digit floating-decimal representation of a, or for brevity the floating a" used in the machine, is of the form

    where δ1 ≠ 0 and δ1, δ2, …, δt are digits. The number (.δ1δ2 … δt) is called the mantissa and q is called the exponent of fl(a). There is usually a restriction on the exponent, of the form

    for some large positive integers N, M. If a number a ≠ 0 has an exponent outside of this range it cannot be represented in the form (2), (3). If, during the course of a calculation, some computed quantity has an exponent q > M (called overflow) or q < − N (called underflow), meaningless results usually follow. However, special precautions can be taken on most computers to at least detect the occurrence of such over- or underflows. We do not consider these practical difficulties further; rather, we shall assume that they do not occur or are somehow taken into account.

    There are two popular ways in which the floating digits δj are obtained from the exact digits, dj. The obvious chopping representation takes

    Thus the exact mantissa is chopped off after the tth decimal digit to get the floating mantissa. The other and preferable procedure is to round, in which case

    and the brackets on the right-hand side indicate the integral part. The error in either of these procedures can be bounded as in

    LEMMA 1. The error in t-digit floating-decimal representation of a number a ≠ 0 is bounded by

    Proof.From (1), (2), and (4) we have

    But since 1 ≤ d1 ≤ 9 and 0.dt + 1dt + 2 ··· ≤ 1 this implies

    |a − fl(a)| ≤ 10¹ − t|a|,

    which is the bound for the chopped representation. For the case of rounding we have, similarly,

    We shall assume that our idealized computer performs each basic arithmetic operation correctly to 2t digits and then either rounds or chops the result to a t-digit floating number. With such operations it clearly follows from Lemma 1 that

    In many calculations, particularly those concerned with linear systems, the accumulation of products is required (e.g., the inner product of two vectors). We assume that rounding (or chopping) is done after each multiplication and after each successive addition. That is,

    and in general

    The result of such computations can be represented as an exact inner product with, say, the ai slightly altered. We state this as

    LEMMA 2. Let the floating-point inner product (7) be computed with rounding. Then if n and t satisfy

    it follows that

    where

    Proof.By (6b) we can write

    fl(akbk) = akbk(1 + φk10−t),|φk| ≤ 5,

    since rounding is assumed. Similarly from (6a) and (7b) with n = k we have

    where

    θ1 = 0;|θk| ≤ 5,k = 2, 3, ….

    Now a simple recursive application of the above yields

    where we have introduced Ek by

    A formal verification of this result is easily obtained by induction.

    Since θ1 = 0, it follows that

    (1 − 5·10−t)n k + ² ≤ 1 + Ek ≤ (1 + 5·10−t)n k + ²,k = 2, 3, …, n,

    and

    (1 − 5·10−t)n ≤ 1 + E1 ≤ (1 + 5·10−t)n.

    Hence, with = 5·10−t,

    But, for p n,(8) implies that p ≤ , so that

    Therefore,

    |Ek| ≤ (n k + 2)10¹ − t,k = 2, 3, …, n.

    Clearly for k = 1 we find, as above with k = 2, that

    |E1| ≤ n·10¹ − t.

    The result now follows upon setting

    δak = akEk.

    (Note that we could just as well have set δbk = bkEk.)

    Obviously a similar result can be obtained for the error due to chopping if condition (8) is strengthened slightly; see Problem 1.

    PROBLEMS, SECTION 2

    1. Determine the result analogous to Lemma 2, when chopping replaces rounding in the statement.

    [Hint: The factor 10¹ − t need only be replaced by 2·10¹ − t, throughout.]

    2. (a) Find a representation for .

    (b) If c1 > c2 > … > cn > 0, in what order should be calculated to minimize the effect of rounding?

    3. What are the analogues of equations (6a, b, c) in the binary representation:

    fl(a) = ±2q(.δ1δ2 ··· δt)

    where δ1 = 1 and δj = 0 or 1?

    3.WELL-POSED COMPUTATIONS

    Hadamard introduced the notion of well-posed or properly posed problems in the theory of partial differential equations (see Section 0 of Chapter 9). However, it seems that a related concept is quite useful in discussing computational problems of almost all kinds. We refer to this as the notion of a well-posed computing problem.

    First, we must clarify what is meant by a computing problem in general. Here we shall take it to mean an algorithm or equivalently: a set of rules specifying the order and kind of arithmetic operations (i.e., rounding rules) to be used on specified data. Such a computing problem may have as its object, for example, the determination of the roots of a quadratic equation or of an approximation to the solution of a nonlinear partial differential equation. How any such rules are determined for a particular purpose need not concern us at present (this is, in fact, what much of the rest of this book is about).

    Suppose the specified data for some particular computing problem are the quantities a1, a2, …, am, which we denote as the m-dimensional vector a. Then if the quantities to be computed are x1, x2, …, xn, we can write

    where of course the n-dimensional function f(·) is determined by the rules.

    Now we will define a computing problem to be well-posed iff the algorithm meets three requirements. The first requirement is that a solution, x, should exist for the given data, a. This is implied by the notation (1). However, if we recall that (1) represents the evaluation of some algorithm it would seem that a solution (i.e., a result of using the algorithm) must always exist. But this is not true, a trivial example being given by data that lead to a division by zero in the algorithm. (The algorithm in this case is not properly specified since it should have provided for such a possibility. If it did not, then the corresponding computing problem is not well-posed for data that lead to this difficulty.) There are other, more subtle situations that result in algorithms which cannot be evaluated and it is by no means easy, a priori, to determine that x is indeed defined by (1).

    The second requirement is that the computation be unique. That is, when performed several times (with the same data) identical results are obtained. This is quite invariably true of algorithms which can be evaluated. If in actual practice it seems to be violated, the trouble usually lies with faulty calculations (i.e., machine errors). The functions f(a) must be single valued to insure uniqueness.

    The third requirement is that the result of the computation should depend Lipschitz continuously on the data with a constant that is not too large. That is, "small changes in the data, a, should result in only "small changes in the computed x. For example, let the computation represented by (1) satisfy the first two requirements for all data a in some set, say a D. If we change the data a by a small amount δa so that (a + δa) ∈ D, then we can write the result of the computation with the altered data as

    Now if there exists a constant M such that for any δa,

    we say that the computation depends Lipschitz continuously on the data. Finally, we say (1) is well-posed iff the three requirements are satisfied and (3) holds with a not too large constant, M = M(a, η), for some not too small η > 0 and all δa such that ||δa|| ≤ η. Since the Lipschitz constant M depends on (a, η) we see that a computing problem or algorithm may be well-posed for some data, a, but not for all data.

    Let (a) denote the original problem which the algorithm (1) was devised to solve. This problem is also said to be well-posed if it has a unique solution, say

    y = g(a),

    which depends Lipschitz continuously on the data. That is, (a) is well-posed if for all δa satisfying ||δa|| ≤ ζ, there is a constant N = N(a, ζ) such that

    We call the algorithm (1) convergent iff f depends on a parameter, say (e.g., may determine the size of the rounding errors), so that for any small > 0,

    for all δa such that ||δa|| ≤. Now, if (a) is well-posed and (1) is convergent, then (4) and (5) yield

    Thus, recalling (3), we are led to the heuristic

    OBSERVATION 1. If (a) is a well-posed problem, then a necessary condition that (1) be a convergent algorithm is that (1) be a well-posed computation.

    Therefore we are interested in determining whether a given algorithm (1) is a well-posed computation simply because only such an algorithm is sure to be convergent for all problems of the form (a + δa), when (a) is well-posed and ||δa|| ≤ δ.

    Similarly, by interchanging f and g in (6), we may justify

    OBSERVATION 2. If is a not well-posed problem, then a necessary condition that (1) be an accurate algorithm is that (1) be a not well-posed computation.

    In fact, for certain problems of linear algebra (see SubSection 1.2 of Chapter 2), it has been possible to prove that the commonly used algorithms, (1), produce approximations, x, which are exact solutions of slightly perturbed original mathematical problems. In these algebraic cases, the accuracy of the solution x, as measured in (5), is seen to depend on the well-posedness of the original mathematical problem. In algorithms, (1), that arise from differential equation problems, other techniques are developed to estimate the accuracy of the approximation. For differential equation problems the well-posedness of the resulting algorithms (1) is referred to as the stability of the finite difference schemes (see Chapters 8 and 9).

    We now consider two elementary examples to illustrate some of the previous notions.

    The most overworked example of how a simple change in the algorithm can affect the accuracy of a single precision calculation is the case of determining the smallest root of a quadratic equation. If in

    x² + 2bx + c,

    b < 0 and c are given to t digits, but |c|/b² < 10−t then the smallest root, x2, should be found from x2 = c/x1, after finding x1 = –b + in single precision arithmetic. Using

    in single precision arithmetic would be disastrous!

    A more sophisticated well-posedness discussion, without reference to the type of arithmetic, is afforded by the problem of determining the zeros of a polynomial

    Pn(z) = zn + an − 1Zn − ¹ + … + a1z + a0.

    If Qn(z) ≡ zn + bn − 1Zn − ¹ + … + b1z + b0, then the zeros of Pn(z; ) ≡ Pn(z) + Qn(z) are close to the zeros of Pn(z). That is, in the theory of functions of a complex variable it is shown that

    LEMMA. If z = z1 is a simple zero of Pn(z), then for || || sufficiently small Pn(z; ) has a zero z1( ), such that

    If z1 is a zero of multiplicity r of Pn(z), there are r neighboring zeros of Pn(z; ) with

    Now it is clear that in the case of a simple zero, z1, the computing problem, to determine the zero, might be well-posed if Pn′(z1) were not too small and Qn(z1) not too large, since then |z1( ) − z1|/| | would not be large for small . On the other hand, the determination of the multiple root would most likely lead to a not well-posed computing problem.

    The latter example illustrates Observation (2), that is, a computing problem is not well-posed if the original mathematical problem is not well-posed. On the other hand, the example of the quadratic equation indicates how an ill-chosen formulation of an algorithm may be well-posed but yet inaccurate in single precision.

    Given an > 0 and a problem (a) we do not, in general, know how to determine an algorithm, (1), that requires the least amount of work to find x so that ||x y|| ≤ . This is an important aspect of algorithms for which there is no general mathematical theory. For most of the algorithms that are described in later chapters, we estimate the number of arithmetic operations required to find x.

    PROBLEM, SECTION 3

    1. For the quadratic equation

    x² + 2bx + c = 0,

    find the small root by using single precision arithmetic in the iterative schemes

    and

    If your computer has a mantissa with approximately t = 2p digits, use

    c = 1, b = −10p

    for the two initial values

    Which scheme gives the smaller root to approximately t digits with the smaller number of iterations? Which scheme requires less work?

    † Unless otherwise indicated, boldface type denotes column vectors. For example, an n-dimensional vector uk has the components uik; i.e.,

    † For complex numbers x and y the elementary inequality |x + y| ≤ |x| + |y| expresses the fact that the length of any side of a triangle is not greater than the sum of the lengths of the other two sides.

    ek has the components eik, where eik = 0, i k; ekk = 1.

    † A quantity, say f, is said to be (δ), or briefly f = (δ) iff for some constants K ≥ 0 and δ0 > 0,

    |f| ≤ K |δ|,for |δ| ≤ δ0.

    † For simplicity we are neglecting the special case that occurs when d1 = d2 = … = dt = 9 and dt + 1 ≥ 5. Here we would increase the exponent q in (2) by unity and set δ1 = 1, δj = 0, j > 1. Note that when dt + 1 = 5, if we were to round up iff dt is odd, then an unbiased rounding procedure would result. Some electronic computers employ an unbiased rounding procedure (in a binary system).

    2

    Numerical Solution of

    Linear Systems and Matrix Inversion

    0.INTRODUCTION

    Finding the solution of a linear algebraic equation system of large order and calculating the inverse of a matrix of large order can be difficult numerical tasks. While in principle there are standard methods for solving such problems, the difficulties are practical and stem from

    (a) the labor required in a lengthy sequence of calculations, and

    (b) the possible loss of accuracy in such lengthy calculations performed with a fixed number of decimal places.

    The first difficulty renders manual computation impractical and the second limits the applicability of high speed digital computers with fixed word length. Thus to determine the feasibility of solving a particular problem with given equipment, several questions should be answered:

    (i)How many arithmetic operations are required to apply a proposed method?

    (ii)What will be the accuracy of a solution to be found by the proposed method (a priori estimate)?

    (iii)How can the accuracy of the computed answer be checked (a posteriori estimate)?

    The first question can frequently† be answered in a straightforward manner and this is done, by means of an operational count, for most of the methods in this chapter. The third question can be easily answered if we have a bound for the norm of the inverse matrix. We therefore indicate, in SubSection 1.3, how such a bound may be obtained if we have an approximate inverse. However, the second question has only been recently answered for some methods. After discussing the notions of well-posed problem and condition number of a matrix, we give an account of Wilkinson’s a priori estimate for the Gaussian elimination method in SubSection 1.2. This treatment, in Section 1, of the Gaussian elimination method is followed, in Section 2, by a discussion of some modifications of the procedure. Direct factorization methods, which include Gaussian elimination as a special case, are described in Section 3. Iterative methods and techniques for accelerating them are studied in the remaining three sections.

    The matrix inversion problem may be formulated as follows: Given a square matrix of order n,

    find its inverse, i.e., a square matrix of order n, say A−1, such that

    Here I is the nth order identity matrix whose elements are given by the Kronecker delta:

    It is well known that this problem has one and only one solution iff the determinant of A is non-zero (det A ≠ 0), i.e., iff A is non-singular.

    The problem of solving a general linear system is formulated as follows: Given a square matrix A and an arbitrary n-component column vector f, find a vector x which satisfies

    or, in component form,

    Again it is known that this problem has a solution which is unique for every inhomogeneous term f, iff A is non-singular. [If A is singular the system (4) will have a solution only for special vectors f and such a solution is not unique. The numerical solution of such singular problems is briefly touched on

    Enjoying the preview?
    Page 1 of 1