0% found this document useful (0 votes)
89 views314 pages

Na - 2

The document is an index for a numerical analysis course divided into weekly topics. Week 1 covers polynomial interpolation, including construction of interpolation polynomials from discrete data points. Later weeks cover numerical differentiation and integration, solving ordinary differential equations, and finding roots of equations. Methods discussed include Lagrange interpolation, divided differences, Taylor series, Runge-Kutta, and Newton-Raphson. The document indexes 53 topics over 12 weeks pertaining to numerical analysis techniques.

Uploaded by

Himanshu Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views314 pages

Na - 2

The document is an index for a numerical analysis course divided into weekly topics. Week 1 covers polynomial interpolation, including construction of interpolation polynomials from discrete data points. Later weeks cover numerical differentiation and integration, solving ordinary differential equations, and finding roots of equations. Methods discussed include Lagrange interpolation, divided differences, Taylor series, Runge-Kutta, and Newton-Raphson. The document indexes 53 topics over 12 weeks pertaining to numerical analysis techniques.

Uploaded by

Himanshu Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 314

INDEX

S. No Topic Page No.


Week 1
1 Lesson 1 - Introduction , Motivation 1
Lesson 2 - Part 1 - Mathematical Preliminaries , Polynomial Interpolation
2 –1 5
Lesson 2 - Part 2 - Mathematical Preliminaries , Polynomial Interpolation
3 –1 10
4 Lesson 3 - Part 1 - Polynomial Interpolation – 2 16
5 Lesson 3 - Part 2 - Polynomial Interpolation – 2 21
6 Lesson 4 - Polynomial Interpolation – 3 25
Week 2
7 Lagrange Interpolation Polynomial , Error In Interpolation - 1 28
8 Lagrange Interpolation Polynomial ; Error In Interpolation - 1 32
9 Error In Interpolation - 2 37
10 Error In Interpolation 2 41
11 Divide Difference Interpolation Polynomial 46
12 Properties Of Divided Difference , Introduction To Inverse Interpolation 51
13 Properties Of Divided Difference Introduction To Inverse Interpolation 56
Week 3
14 Inverse Interpolation , Remarks on Polynomial Interpolation 58
15 Numerical Differentiation - 1 Taylor Series Method 62
16 Numerical Differentiation - 2 Method Of Undetermined Coefficients 66
17 Numerical Differentiation - 2 Polynomial Interpolation Method 70
18 Numerical Differentiation - 3 Operator Method Numerical Integration - 1 75
Week 4
19 Numerical Integration - 2 Error in Trapezoidal Rule Simpson's Rule 78
Numerical Integration - 3 Error in Simpson's Rule Composite in
20 Trapezoidal Rule , Error 81
Numerical Integration - 4 Composite Simpsons Rule , Error Method of
21 Undetermined Coefficients 86
22 Numerical Integration - 5 Gaussian Quadrature (Two-Point Method) 91
Week 5
Numerical Integrature - 5 Gaussian Quadrature (Three-Point Method)
23 Adaptive Quadrature 95
24 Numerical Solution of Ordinary Differential Equation (ODE) - 1 101
Numerical Solution Of ODE - 2 Stability , Single-Step Methods - 1
25 Taylor Series Method 105
Numerical Solution Of ODE - 3 Examples of Taylor Series Method
26 Euler's Method 109
Week 6
27 Numerical Solution Of ODE-4 Runge-Kutta Methods 115
Numerical Solution Of ODE-5 Example For RK-Method Of Order 2
28 Modified Euler's Method 121
Numerical Solution Of Ordinary DifferentialEquations- 6 Predictor-
29 Corrector Methods (Adam-Moulton) 124
30 Numerical Solution Of Ordinary DifferentialEquations- 7 130
Week 7
31 Numerical Solution Of Differential Equations - 8 136
32 Numerical Solution of Ordinary Differential Equations - 9 140
33 Numerical Solution of Ordinary Differential Equations - 10 143
34 Numerical Solution of Ordinary Differential Equations - 11 148
Week 8
35 Root Finding Methods - 1 The Bisection Method - 1 151
36 Root Finding Methods - 2 The Bisection Method - 2 153
37 Root Finding Methods - 3 Newton-Raphson Method - 1 155
38 Root Finding Methods - 4 Newton-Raphson Method - 2 158
Week 9
39 Root Finding Methods - 5 Secant Method, Method Of false Position 162
40 Root Finding Methods - 6 Fixed Point Methods - 1 169
41 Root Finding Methods - 7 Fixed Point Methods - 2 174
42 Root Finding Methods - 8 Fixed Point Iteration Methods - 3 177
Week 10
43 Root Finding Methods - 9 Practice Problems 179
44 Solution Of Linear Systems Of Equations - 1 181
45 Solution Of Linear Systems Of Equations - 2 183
46 Solution Of Linear Systems Of Equations - 3 187
Week 11
47 Solution Of Linear Systems Of Equations - 4 189
48 Solution Of Linear Systems Of Equations - 5 192
49 Solution Of Linear Systems Of Equations - 6 197
50 Solution Of Linear Systems Of Equations - 7 206
Week 12
51 Solutions Of Linear Systems Of Equations - 8 Iterative Method - 1 221
52 Solutions Of Linear Systems Of Equations - 8 Iterative Method - 2 234
53 Matrix Eigenvalue Problems - 2 Power Method - 2 253
54 Practice Problems 271
Numerical Analysis
Lecture-1
Numerical Analysis is a branch of Mathematics concerned with the theoretical founda-
tions of numerical algorithm which provides a complete set of procedures for finding an
approximate solution of problems that arise in scientific applications. In this course, we will
focus on the development of algorithms to solve some class of problems and mathematically
analyse these techniques. Since we will obtain approximate solutions to our problems, we
will focus our attention on the issues of how accurate our solution is?, what is the error that
is incurred at each step of our computation and see how we can get reliable results correct
to the desired degree of accuracy.
A good knowledge of the tools from calculus is necessary for understanding of the nu-
merical algorithms and we will review these tools before discussing the main topics.
In scientific and engineering applications, we come across situations where we have
to solve equations which are linear, nonlinear or involve derivatives or integrals and a
combination of these. It may not be possible to write an exact solution of these using
elementary functions. The topics that we discuss and understand from this course will help
us to construct approximate solutions to such otherwise intractable problems. while doing
this, we will compare and assess different methods for solving a class of problems with our
focus on efficiency and accuracy of these methods.
Numerical Analysis lies at the heart of Scientific Computing or computational sciences
and engineering domains and it includes the development of high performing computing
skills. In the course, we focus our attention on laying a strong foundation and knowledge
of various methods used for obtaining approximate solutions of a certain class of problems
with emphasis on error analysis and postpone the part of developing numerical codes for
implementation to another course in future.
In this course, we will consider the topics on polynomial interpolation, numerical dif-
ferentiation, numerical integration, numerical solution of first order ordinary differential
equations governing initial value problems. Enclosure methods and fixed point iteration
methods for obtaining a simple root of algebraic and transcendental equations will be dis-
cussed. Decomposition method and Iterative methods will be discussed for solving a system
of linear algebraic equations and error analysis will be considered. This will be followed by
learning some simple methods for solving algebraic eigenvalue problems.

1 Polynomial Interpolation
We begin our study with the problems of polynomial interpolation. Polynomial interpolation
is one of the most fundamental problems in numerical methods. A function y = f (x) is
known at (n + 1) discrete points x0 , x1 , x2 , . . . , xn ; that is, f (x0 ), f (x1 ), f (x2 ), . . . , f (xn ) are
given or we may not know the function and the data that is experimentally collected at
the points x0 , x1 , x2 , . . . , xn namely y0 , y1 , y2 , . . . , yn respectively may be known. In either
case, our goal is to construct a function that passes through these points; that is, to find a
function p(x) such that the conditions

p(xi ) = f (xi ) (or yi ) for i = 0, 1, . . . , n

1 1
are satisfied. We are interested in seeking a smooth function. We therefore look for a
function that is a polynomial. The points are called interpolation points. By ”interpolating
the data”, we mean that the polynomial passes through the points (xi , f (xi )) such that
pn (xi ) = f (xi ), i = 0, 1, 2, . . . , n. The function that interpolates the data, pn (x) is an
interpolating polynomial.
We shall develop methods for constructing interpolation polynomials in this course.
Note that the interpolating polynomial pn (x) and the underlying function f (x) are gen-
erally different except that they agree at the interpolation points. So, we focus on error
in interpolation which is a measure on the difference between the two functions f (x) and
pn (x).

2 Numerical Differentiation
We next consider numerical approximation of derivatives. Why do we need to approximate
derivatives?
The function values may be known only at a set of discrete points and we may have
to obtain a derivative of the function and understand the changes in the data that are
related to derivatives, or the computation of the derivative of the given known function
may be complicated and it may be easier to approximate the derivative. For example, if
you measure the position of a moving vehicle every minute through some mechanism (say,
GPS) and you want to compute its speed. The position of the vehicle is known only at
discrete times and differentiation is not possible. So, we require a numerical method for
computing the derivative of a function whose values are known at a set of discrete points.
For the above and other reasons, we wish to develop methods for approximating various
order derivatives.

3 Numerical Integration
Integration is a basic Mathematical operations having a wide range of applications in science
and engineering. Most of the integrals can not be evaluated by analytical methods that you
have learnt in school. Also, you may know function values (such as speed of a vehicle) at
a set of discrete points (times) and you may want to compute the integral of this function
(distance travelled by the vehicle in unit time). The solution to this problem is to use a
numerical method to evaluate the required integral. We shall consider a number of methods
for evaluating an integral and while doing so, we shall understand a general strategy for
developing such methods. This will help you to develop new techniques that you may need
for your applications. The techniques developed allow us to keep track of the error incurred
in these methods and we perform error analysis and also determine a bound on the error.

4 Numerical Solution of Ordinary Differential Equa-


tions
Ordinary Differential Equations(ODE’s) of order one occur as mathematical models in many
branches of science, engineering and in other branches. These equations may not belong to
the class of equations for which exact analytical solutions can be obtained and therefore the

2 2
solutions can not be expressed in closed form. Therefore, we seek approximate solutions
by means of numerical methods. These methods provide us a way of obtaining accurate
results with a reliable bound on the error. We will focus on the development and analysis
of numerical techniques for solving Initial value problems governed by first order ODE’s.
We begin with some preliminary results that give us the conditions under which an Initial
value Problem(IVP) possesses a unique solution in an interval containing x0 (an initial
point). The methods that we consider are for IVP’s that satisfy the conditions stated in
the existence and uniqueness theorem on solutions of IVP’s.

5 Root finding problem: solving f(x)=0


We then move onto developing methods that help us to find an approximation to a simple
root of an equation f (x) = 0, where f (x) is an algebraic or transcendental function. The
methods developed belong to either a class of enclosure methods or fixed point iteration
methods.The iterative methods will generate a sequence of approximations to a root of
f (x) = 0 and we need to discuss the conditions under which this sequence converges and so
we perform convergence analysis and determine the order of convergence of these methods.
We will also consider in detail the error estimation of each of these methods that we develope
and see the advantages and disadvantages of each of these methods.

6 Solutions of System of equations: A(x) = b


We next consider two main classes of methods for solving linear systems:A(x) = b, where A
is an n × n matrix, x is an n × 1 vector and b is an n × 1 vector, namely, the direct methods
and iterative methods. we consider only simple iterative methods in this course as it is
important for a student to be able to use these methods efficiently before studying more
efficient and modern methods. In the case of modern methods, we consider decomposition
methods and understand conditions under which Gauss Elimination Method is guaranteed
to yield a solution in the absence of round-off errors. We will also consider the effects of
such errors and find an estimate on the error bound.
We finally consider the algebraic eigenvalue problems which are important in stability
analysis. We begin with a brief description of the eigenvalue problem and present a simple
but powerful method of computing the dominant eigenvalue of a matrix and corresponding
eigenvector.
The lecture notes and the material presented in the classes are based on the content from
the following books. The presentation and explanations are excellent in these books and I
have enjoyed reading and understanding Numerical Analysis from these books. Some subtle
points are explained very well and I have closely followed the text books and prepared the
lecture notes. I sincerely thank and acknowledge the authors of these text books for their
clear and excellent presentation in their books. I have also given below the book references
from where I have collected problems.

References
[1] David Kincaid and Ward Cheney, Numerical Analysis: Mathematics of Scientific Com-
puting, ISBN-13:978-0-8218-4788-6 (AMS), Pure and Applied Undergraduate texts, Vol

3 3
2, (2002).

[2] R.L. Burden and J.D. Faires, Numerical Analysis, 7th Edition, Brooks/Cole, (2001).

[3] E. Süli and Mayers, An Introduction to Numerical Analysis, Cambridge University Press,
(2003).

[4] M.K. Jain, S.R.K. Iyengar and R.K. Jain, Numerical Methods for Scientific and Engi-
neering Computation, New Age International (P) Limited, Publishers New Delhi.

[5] Robert J. Schilling and Sandra L. Harris, Applied Numerical Methods for Engineers
using MATLAB and C, Brooks/Cole, New york.

[6] M.K. Jain, S.R.K. Iyengar and R.K. Jain, Numerical Methods: Problems and Solutions,
New Age International, (2007)

4 4
Numerical Analysis
Lecture-2

1 Mathematical Preliminaries
Notations:
• C(R) : set of all functions ‘f ’ that are continuous on the entire real line R.

• C 1 (R) : set of all functions ‘f ’ for which f 0 is continuous on the entire real line R.
Note. C 1 (R) ⊂ C(R) and the inclusion is proper since there exist many continuous
functions whose derivatives do not exist at all; for example, f (x) = |x|.

• C 2 (R) : set of all functions ‘f ’ for which f 00 is continuous on the entire real line R.

• C n (R) : set of all functions ‘f ’ for which f (n) is continuous everywhere, n ∈ N(natural
numbers).

• C ∞ (R) : set of functions each of whose derivatives is continuous. For example, f (x) =
ex .
We have C ∞ (R) ⊂ · · · ⊂ C 2 (R) ⊂ C 1 (R) ⊂ C(R).

• C n [a, b] : set of functions for which f (n) exists and is continuous on [a, b].

Theorem 1.1 (Taylor’s theorem with Lagrange’s formula for remainder). If f ∈


C n [a, b] and if f n+1 exists on (a, b), then for any points c and x in [a, b], there exists a point
ξ between c and x such that
n
X 1 (k)
f (x) = f (c)(x − c)k + En (x) (1)
k=0
k!

where
1
En (x) = f (n+1) (ξ)(x − c)n+1 (2)
(n + 1)!
Taylor’s theorem allows us to represent functions exactly in terms of polynomials with a
known specified error. In a computational setting, such functions can be replaced by a
simple polynomial and we are able to bound the error made.

Illustration
Let f (x) = ln x, a = 1, b = 2, c = 1.
f 0 (x) = x1 , f 00 (x) = − x12 , f 000 (x) = x23 , . . .
In general, for k ≥ 1, f (k) (x) = (−1)k−1 (k−1)! xk
; f (k) (1) = (−1)k−1 (k − 1)!; f (0) (1) = 0.
Pn (−1)n n!
∴ ln x = k=1 (−1)k−1 (k−1)! k!
(x − 1)k
+ 1
(n+1)! ξ n+1
(x − 1)n+1 , 1 ≤ x ≤ 2, where 1 < ξ < x.

Simplifying,

1 5
Pn (−1)k−1 (−1)n (x−1)n+1
ln x = k=1 k
(x − 1)k + n+1 ξ n+1

k−1
Now nk=1 (−1)k (x−1)k is a polynomial in x. This can be regarded as a simple function
P
that approximates ln x.
n (x−1)n+1
The last term (−1)
n+1 ξ n+1
can be regarded as an error term.
Note that the last term is not a polynomial because ξ depends on x. En (x) tells us how
the polynomial approximation differs from ln(x).
n−1
ln x = (x − 1) − 21 (x − 1)2 + 13 (x − 1)3 − · · · + (−1)n (x − 1)n + En (x) → (∗)
1 (x−1)n+1 1 1
where |En (x)| = n+1 ξ n+1
< n+1 (x − 1)n+1 (∵ ξ > 1, ξn+1 < 1)
Use (∗) to compute ln 2.
n−1
ln 2 = 1 − 21 + 31 − 14 + · · · + (−1)n + En (2), where |En (2)| < n+11

∴ To compute ln 2 with accuracy of one part in 108 , n must be chosen such that

En (2) ≤ 10−8
⇒ n+11
≤ 10−8
⇒ (n + 1) ≥ 108

So, one hundred million terms in the polynomial are required to compute ln 2 to the desired
accuracy.

Theorem 1.2 (Mean Value Theorem (MVT)). If f ∈ C[a, b] and if f 0 exists on (a, b),
then for x and c in [a, b]

f (x) = f (c) + f 0 (ξ)(x − c), ξ is in between c and x.

Take x = b and c = a; then

f (b) − f (a) = f 0 (ξ)(b − a), a < ξ < b (3)

Theorem 1.3 (Rolles’s Theorem(special case of MVT)). Let f be continuous on [a, b].
If f 0 exists on (a, b) and if f (a) = f (b) = 0, then f 0 (ξ) = for some ξ in (a, b).

Note that Rolle’s Theorem is an immediate consequence of MVT.

In Error Analysis, it is necessary to estimate error incurred in each step of a computa-


tion. Rounding is an important concept in scientific computing.
Let x be a positive decimal number of the form

x = 0. . . . . . . . . . using n digits to the right of the decimal point

We want to round x to m decimal places (m < n) in a manner that depends on the value
of the (m + 1)st digit.

• If this digit is a 0, 1, 2, 3 or 4, then the mth digit is not changed and all the following
digits are discarded.

• If this digit is a 5, 6, 7, 8 or 9, then the mth digit is increased by one unit and the
remaining digits are discarded.

2 6
Examples
seven digits number being correctly rounded to four digits:
.1735499 → .1735
.9999500 → 1.000
.4321609 → .4322
If x is rounded so that x̃ is the m-digit approximation to it, then
1
|x − x̃| ≤ × 10−m (4)
2
case-i If the (m + 1)st digit of x is 0, 1, 2, 3 or 4, then x = x̃ + , with  < 1
2
× 10−m . Then
(4) follows.

case-ii If the (m + 1)st digit of x is 5, 6, 7, 8 or 9, then x̃ = x̂ + 10−m , where x̂ is a number


with the same m digits as x and all digits beyond the mth are 0.
Now x = x̂ + δ × 10−m with δ ≥ 12 and x̃ − x = 10−m (1 − δ).
Since 1 − δ ≤ 21 , |x − x̃| ≤ 21 × 10−m .

The chopped or truncated m-digit approximation to a decimal number x is the number


x̂ obtained by discarding all digits beyond the mth . In this case,

|x − x̂| < 10−m (5)

x and x̂ are related in such a way that x − x̂ has 0 in the first m places and x = x̂ + δ × 10−m
with 0 ≤ δ < 1.
∴ |x − x̂| = |δ| × 10−m < 10−m and (5) follows.

Absolute Error and Relative Error


When a real number x is approximated by another number x∗ , the error is x − x∗

The absolute error is |x − x∗ |



The relative error is | x−x
x
|, provided that x is nonzero.

The absolute error is the most natural measure of the accuracy of an approximation, but it
can be misleading.

Example
Suppose x = 10−16 and x∗ = 2 × 10−16 . The absolute error in this approximation is

Eabs = |2 × 10−16 − 10−16 | = 10−16 ,

which suggests that the computation is accurate because this error is small. However, the
relative error is
2 × 10−16 − 10−16
Erel = | | = 1,
10−16
which suggests that the computation is completely erroneous, because by this measure, the
error is equal in magnitude to the exact value; that is the error is 100%. This example,

3 7
although an extreme case, illustrates why the absolute error can be a misleading measure
of error.
Why do we need different measures of error?
Let x = e−16 = 0.1125351747 × 10−6 (x is small). Let x∗ = 0 be an approximation to x.
Then,

|x − x∗ | < 1.2 × 10−7 (absolute error) and



| x−x
x
| = 1 (relative error)

which tells that not many digits are matched in this approximation of x by x∗ .
Now consider z = e16 = 0.8886110521 × 107 (z is large). Let z ∗ = 0.8886110517 × 107 .
Then,
|z − z ∗ | = 4 × 10−3 (not very small even though z ∗ matches z to nine decimal digits)
∗ 4×10−3 −9
and | z−zz
| = 0.8886110521×107 = 0.4501 × 10 , showing that about nine digits are correct.

Though absolute error is a natural measure of accuracy of approximation, it is mislead-


ing. Relative error, on the other hand, protects us from misjudging the accuracy of an
approximation because of scale extremes (namely very large or very small numbers).
Remark. The representation of a real number x on a computing device is only approximate
since the computing device is capable of holding only a finite number of digits. As a result,
all the digits after a finite number of digits in the real number x, which can have infinitely
many digits are discarded. Even if the discarded part of x may be very small, it leads
to large errors in numerical computation. In this lecture, we introduced the chopped and
rounded machine approximaton of a real number x and the types of errors.
Computers store information and allow us to operate on it. Computers have different
memory; so it is not possible to store the infinite range of numbers that exist. In view of
this, approximations are made. These approximations can accumulate when many arith-
metic operations are done.

Round-off error is the error arising from the fact that not every number is exactly rep-
resentable in the finite precision floating point format.

Suppose say we can only keep 4 significant digits.


√ √
√ Let us
√ compute x + 1 − x taking x = 1984.
x + 1− x = 44.55−44.54 = 0.01 by keeping only 4 digits in each step of our computation.
We are unable to get the required precision. Now, consider
√ √ √ √ √ √
x + 1 − x = ( x + 1 − x) √x+1+ x
√ = √ 1 √
x+1+ x x+1+ x
√ √ 1√ 1
So, 1985 − 1984 = √
1985+ 1984
= 44.55+44.54
= 0.01122.

Let us take another example: compute e−24 using Taylor’s Theorem and keeping terms
upto xn .
x2 x3 xn
ex ≈ 1 + x + 2!
+ 3!
+ ··· + n!

4 8
|x|n+1
Error in approximation is less than (n+1)!
max{1, ex }.

We compute e−24 by adding the terms until the term is less than the machine precision,
we see that

e−24 ≈ 3.443053542881011977E − 007 but


e−24 = 3.77513454427909773E − 011 = True Value

We observe that the error | True value − approximate value | is larger than the true answer.

Consider another example, where we want to compute the volume of a spherical shell

V = 43 π[R13 − R23 ]

which involves cancellation of two very big numbers in some applications. Reduce the round
off error by rewriting it as

V = 34 π(R1 − R2 )(R12 + R1 R2 + R22 )

Round-off error is one type of error that we come across during our computations. There
is another type of error, truncation error, which arises due to translating continuous math-
3 5
ematical expressions into discrete forms. Consider sin x = x − x3! + x5! . . . which is Taylor
series expansion for sin x. If x is small, we can neglect higher order terms by truncating the
series. This introduces an error.

2 Polynomial Interpolation-I
Calculus provides us with many tools with the help of which we can understand the be-
haviour of functions. However, it is important that these functions are continuous or dif-
ferentiable. In most of the real applications, we have information about functions given
in the form of a relationship between two quantities and this consists of a set of discrete
data points. There is a need for reconstructing continuous functions based on discrete
data. Polynomials are continuous functions and are simple to work with. We focus our
attention therefore on the problem of constructing a polynomial in such a way that it has
the same value as that of the function at the given discrete points; that is, if (xi , f (xi ))
is known at a set of say, (n + 1) points i = 0, 1, 2, . . . , n, then we develop a method for
constructing a polynomial of degree at most n such that pn (xi ) = f (xi ), i = 0, 1, 2, . . . , n.
The polynomial so obtained is called the interpolating polynomial that interpolates the
function at a set of discrete points (xi , f (xi )), i = 0, 1, 2, . . . , n. The points x0 , x1 , . . . , xn
are called interpolation points. Or, we seek a function pn (polynomial) passing through
(xi , f (xi )), i = 0, 1, 2, . . . , n with the property pn (xi ) = f (xi ).

5 9
Numerical Analysis
Lecture-3

1 Polynomial Interpolation-II
We now address the problem of representing functions within a computer.

1. function values may be known at a set of discrete points, say, x0 , x1 , x2 , . . . , xn .


The function values are f (x0 ), f (x1 ), f (x2 ), . . . , f (xn ) or fi , i = 0, 1, 2, . . . , n and are
known. The aim is to reconstruct the unknown function f by seeking a polynomial pn
of degree n whose graph in the (x, y) plane passes through the points with coordinates
(xi , f (xi )), i = 0, 1, . . . , n.
Note. In general, the resulting polynomial pn will differ from f (unless f itself is a
polynomial of the same degree as pn ), so an error will be incurred.

We shall consider later results which provide bound on the size of this error.

2. If the function values f (x) are known for all x in a closed interval [a, b]. In this case,
the goal is to approximate the function f by a polynomial over this interval [a, b].
Such polynomials are called interpolating polynomials, say, pn .

1.1 Properties of pn
a) pn is a polynomial of degree n passing through (n+1) discrete points x0 , x1 , x2 , . . . , xn .

b) pn (x) interpolates the function f (x) such that

pn (xi ) = f (xi ) = fi , i = 0, 1, . . . , n.

Remark. We all know that any polynomial can be completely specified by its finitely many
coefficients. Therefore, storing the interpolation polynomial for f in a computer will be
generally more economical than storing f itself. Thus, interpolation by polynomials provides
a way of representing functions within a computer.

1.2 Problem formulation


Given a set of (n + 1) data points (xi , yi ):

x x0 x1 x2 ... xn
y y0 y1 y2 ... yn

seek a polynomial p of lowest possible degree for which p(xi ) = yi , i = 0, 1, . . . , n. Such a


polynomial is said to interpolate the data.

Theorem 1.1. If x0 , x1 , x2 , . . . , xn are distinct real numbers, then for arbitrary values of
y0 , y1 , y2 , . . . , yn , there is a unique polynomial pn of degree atmost n such that pn (xi ) =
yi , i = 0, 1, . . . , n.

1 10
Proof. Uniqueness
Suppose there were two such polynomials pn and qn . Then,
pn (xi ) = yi , i = 0, 1, . . . , n.
qn (xi ) = yi , i = 0, 1, . . . , n.
Consider the polynomial pn − qn .
(pn − qn )(xi ) = pn (xi ) − qn (xi ) = yi − yi = 0, i = 0, 1, . . . , n.
Since the degree of pn − qn can be atmost n, this polynomial pn − qn can have atmost n
zeros, if it is not the zero polynomial. Since x0 , x1 , x2 , . . . , xn are (n + 1) distinct points,
and (pn − qn )(xi ) = 0, i = 0, 1, . . . , n, (pn − qn ) has (n + 1) zeros. So,
(pn − qn ) must be the zero polynomial
or, pn ≡ qn
Thus the interpolating polynomial, if exists, is unique.
We now present the proof of existence of such a polynomial.
Existence
We show that there exists a polynomial of degree atmost n such that pn (xi ) = yi , i =
0, 1, . . . , n.
Proof by induction
For n = 0, choose a constant function p0 (polynomial of degree ≤ 0) such that p0 (x0 ) = y0 .
Suppose that we have obtained a polynomial pk−1 of degree ≤ k − 1 with pk−1 (xi ) = yi ,
for i = 0, 1, . . . , k − 1 (induction hypothesis).
We now try to construct pk in the form
pk (x) = pk−1 (x) + A(x − x0 )(x − x1 ) . . . (x − xk−1 ).
pk (x) is a polynomial of degree atmost k.
pk (xi ) = pk−1 (xi ) + A(xi − x0 )(xi − x1 ) . . . (xi − xk−1 ), i = 0.1. . . . , k − 1
= yi + 0 (since pk−1 (xi ) = yi )
So, pk (xi ) = yi , for i = 0, 1, . . . , k − 1, that is, pk interpolates the data at x0 , x1 , x2 , . . . , xk−1 .
In addition, pk (xk ) = yk implies
yk = pk−1 (xk ) + A(xk − x0 )(xk − x1 ) . . . (xk − xk−1 )
yk −pk−1 (xk )
∴ A = (xk −x0 )(x k −x1 )...(xk −xk−1 )

Hence,
(x−x0 )(x−x1 )...(x−xk−1 )
pk (x) = pk−1 (x) + [y
(xk −x0 )(xk −x1 )...(xk −xk−1 ) k
− pk−1 (xk )]
which is a polynomial of degree atmost k such that
pk (xi ) = yi , i = 0, 1, . . . , k.
By induction, the result is true for all n, i.e., there exists a polynomial pn of degree atmost
n such that pn (xi ) = yi , i = 0, 1, . . . , n.
Note. The polynomials p0 , p1 , p2 , . . . , pn constructed in the proof have the property that
each pk is obtained simply by adding a single term to pk−1 .
i.e.

2 11
p0 (x) = c0
p1 (x) = c0 + c1 (x − x0 )
p2 (x) = c0 + c1 (x − x0 ) + c2 (x − x0 )(x − x1 )
... ... ... ... ... ... ...
pk (x) = c0 + c1 (x − x0 ) + c2 (x − x0 )(x − x1 ) + · · · + ck (x − x0 )(x − x1 ) . . . (x − xk−1 )
These polynomials are called the interpolation polynomials in Newton’s form.

1.3 Linear interpolation (n+1=2; n=1)


Approximate the curve of f by a chord joining two adjacent x-values x0 and x1 .

y = f (x)

f1
f0 rh
x0 x0 + rh x1
x1 − x0 = h f (x0 + rh) =approximate value of f at x = x0 + rh.
Equation of the line joining (x0 , f0 ), (x1 , f1 ) is
f −f1 x−x1 x0 +rh−x1 rh−h (r−1)h
f1 −f0
= x1 −x0
= x1 −x0
= x1 −x0
= x1 −x0
= (r − 1)

∴ f = f1 + (f1 − f0 )(r − 1) = f0 + r(f1 − f0 )


Thus, f (x) ≈ p1 (x) = f0 + r(f1 − f0 )
i.e. f (x) ≈ f0 + x−x
h
0
(f1 − f0 )
Note. Linear interpolation will be satisfactory as long as the x-values in a table are close
to each other. In this case, chords deviate very little from the curve of f (x).
In quadratic interpolation, we approximate the curve of the function f between x0 and
x0 +2h = x2 by the quadratic parabola which passes through the three points (x0 , f0 ), (x1 , f1 ), (x2 , f2 ).
Note that xi = x0 + ih, i = 0, 1, 2.
Let f (x) ≈ p2 (x) = c0 + c1 (x − x0 ) + c2 (x − x0 )(x − x1 ).
p2 is such that p2 (xi ) = f (xi ), i = 0, 1, 2.

(x1 , f1 )

(x2 , f2 )
(x0 , f0 )

f0 f1 f2

So, f (x0 ) = p2 (x0 ) = c0 .


f (x1 ) = p2 (x1 ) = c0 + c1 (x1 − x0 )

3 12
∴ f (x1 ) = f (x0 ) + c1 (x1 − x0 )
∴ c1 = f (xx11)−f
−x0
(x0 )
.

Similarly f (x2 ) = p2 (x2 ) = c0 + c1 (x2 − x0 ) + c2 (x2 − x0 )(x2 − x1 )


−f0
= f (x0 ) + xf11 −x 0
(x2 − x0 ) + c2 (x2 − x0 )(x2 − x1 ).
f (x2 )−f (x0 ) (f1 −f0 )(x2 −x0 )
∴ c2 = (x2 −x0 )(x2 −x1 ) − (x1 −x0 )(x2 −x0 )(x2 −x1 )
−f0 −f0
= (x2 −xf02 )(x 2 −x1 )
− (x1 −xf01 )(x2 −x1 )
.
f1 −f0 −f0
∴ f (x) ≈ p2 (x) = f (x0 ) + x1 −x0 (x − x0 ) + [ (x2 −xf02 )(x 2 −x1 )
− (x1 −xf10 −f 0
)(x2 −x1 )
](x − x0 )(x − x1 )
= f0 + f1 −f0
h
(x − x0 ) + (f2 −2f21 +f0 ) (x−x
h
0 ) (x−x0 −h)
h
(x−x0 ) (x−x0 ) x−x0 (f2 −2f1 +f0 )
= f0 + h
(f 1 − f 0 ) + h
( h
− 1) 2

The computations become tedious if one continues in this way to obtain higher degree
interpolating polynomials. The concept of finite differences is very useful and is importanmt
for interpolation of a tabulated function.

1.4 Finite-difference operators


Forward differences
Given a table of values (xi , yi ), i = 0, 1, 2, . . . , n of a function y = f (x), where xi ’s are
equally spaced. That is, xi = x0 + ih, i = 0, 1, 2, . . . , n. Define the forward difference
operator ∆ as follows:

• The first forward difference is:


∆yi = yi+1 − yi , i = 0, 1, 2, . . . , n − 1
i.e.,
∆y0 = y1 − y0
∆y1 = y2 − y1
... ... ...
∆yn−1 = y1 n − yn−1

• The second forward difference are the differences of the first forward differences:
∆2 yi = ∆yi+1 − ∆yi , i = 0, 1, 2, . . . , n − 2
i.e.,
∆2 y0 = ∆y1 − ∆y0 = (y2 − y1 ) − (y1 − y0 ) = y2 − 2y1 + y0
∆2 y1 = ∆y2 − ∆y1 = (y3 − y2 ) − (y2 − y1 ) = y3 − 2y2 + y1
... ... ... ... ... ... ... ... ... ... ...
∆2 yn−2 = ∆yn−1 − ∆yn−2 = (yn − yn−1 ) − (yn−1 − yn−2 ) = yn − 2yn−1 + yn−2

Given a set of values (xi , yi ), i = −2, −1, 0, 1 where the xi ’s are equally spaced, we present
below the forward difference table.
Here xi = x0 + ih i = −2, −1, 0, 1. So, x−2 = x0 − 2h, x−1 = x0 − h, x1 = x0 + h.
Forward Difference Table

4 13
x y ∆y ∆2 y ∆3 y
x−2 y−2
∆y−2 = y−1 − y−2
x−1 y−1 ∆2 y−2 = ∆y−1 − ∆y−2
∆y−1 = y0 − y−1 ∆3 y−2 = ∆2 y−1 − ∆2 y−2
x0 y0 ∆2 y−1 = ∆y0 − ∆y−1
∆y0 = y1 − y0
x1 y1
• each difference is located in its column, midway between the elements of the previous
column.
• subscript remains the same along each diagonal of the table.
• y−2 is called the leading term
• ∆y−2 , ∆2 y−2 , ∆3 y−2 are the leading order differences.

Shift Operator E
Let y = f (x). Let x takes values x, x + h, x + 2h, . . . , x + nh etc. Define an operator E
having the property
Ef (x) = f (x + h).
E is called the Shift operator.
E 2 f (x) = E(E(f (x))) = E(f ((x + h)) = f (x + 2h).
A similar reasoning says, that if we apply E n times on f (x), then
E n f (x) = f (x + nh)
When x0 , x1 , x2 , . . . , xn are equally spaced, then
x1 = x0 + h
x2 = x0 + 2h
... ... ...
xn = x0 + nh

∴ E k f (x0 ) = f (x0 + kh) = f (xk ), k = 1, 2, . . . , n

(or) Ey0 = y1 ; E 2 y0 = y2 ; E 3 y0 = y3 ; . . . ; E n y0 = yn ;

The inverse operator E −1 is defined similarly:


E −1 f (x) = f (x − h).
So,
E −2 f (x) = E −1 (E −1 (f (x))) = E −1 (f ((x − h)) = f (x − 2h).
... ... ... ... ...
E −n f (x)) = f (x − nh).

Relation between ∆ and E


∆f (x) = f (x + h) − f (x) = Ef (x) − f (x) = (E − 1)f (x)
∴∆≡E−1

5 14
1.4.1 Newton’s forward difference interpolation polynomial (for equally spaced
xi values, i = 0, 1, . . . , n)
Let y = f (x) be a function which takes values f (x0 ), f (x0 + h), f (x0 + 2h), . . . , f (x0 + nh)
corresponding to equispaced points x0 , x1 = x0 + h, x2 = x0 + 2h, . . . , xn = x0 + nh.
Suppose we want to evaluate the function f (x) for a value x0 + ph, where p is any real
number. Then, for any real number p, the operator E is such that E p f (x) = f (x + ph).
Note here that x0 ≤ x0 + ph ≤ x0 + nh.
Let x = x0 + ph. Then
f (x0 +ph) = E p f (x) = (1+∆)p f (x0 ) = [1+p∆+ p(p−1)
2!
∆2 + p(p−1)(p−2)
3!
∆3 +. . . ]f (x0 ) (∗∗)
What about the number of terms within the bracket [.]? We will answer this question first
using the following result and then complete the derivation of the interpolating polynomial,
given a set of +1 equispaced points.
Theorem 1.2. The nth differences of a polynomial of degree n is constant, when the values
of the independent variable are given at equal intervals.
Proof. Consider a polynomial of degree ‘n0 in the form
y(x) = a0 xn + a1 xn−1 + a2 xn−2 + · · · + an−1 x + an , a0 6= 0 and a0 , a1 , . . . , an are constants.
Let xi = x0 + ih, i = 0, 1, . . . , n.
y(x + h) = a0 (x + h)n + a1 (x + h)n−1 + a2 (x + h)n−2 + · · · + an−1 (x + h) + an .
So,
∆y(x) = y(x + h) − y(x)
= a0 [(x + h)n − xn ] + a1 [(x + h)n−1 − xn−1 ] + · · · + an−1 [x + h − x]
= a0 [xn + nC1 xn−1 h + nC2 xn−2 h2 + · · · + hn − xn ]
+a1 [xn1 + (n−1)C1 xn−2 h + (n−1)C2 xn−3 h2 + · · · + hn−1 − xn−1 ]
+...
+an−1 [x + h − x]
= a0 nhxn−1 + [a0 nC2 h2 + a1 (n−1)C1 h]xn−2 + · · · + an−1 h
∴ ∆y(x) = a0 nhxn−1 + B1 xn−2 + · · · + Kx + L,
where B1 , . . . , K, L are constants involving h but not x.
∴ the first difference of a polynomial of degree n is a polynomial of degree (n − 1).
∆2 y(x) = ∆(∆y(x))
= ∆y(x + h) − ∆y(x)
= a0 nh[(x + h)n−1 − xn−1 ] + B1 [(x + h)n−2 − xn−2 ] + · · · + k[x + h − h]
which reduces to a polynomial of degree (n − 2) in x given by
∆2 y(x) = a0 n(n − 1)h2 xn−2 + B10 xn−3 + · · · + K 0 x + L
Similarly, we can show that by computing the next higher order differences, the degree of
the polynomial is reduced by one.
After taking differences n times, we are left with the first term in the form
∆n y(x) = a0 n(n − 1)(n − 2) . . . (2)(1)hn = a0 n!hn =constant (which is independent of x)
∵ ∆n y(x) is constant, ∆n+1 y(x) = 0.
∴ (n + 1)st and higher order differences of a polynomial of degree n are zero
In view of this result, terms upto nth order differences are within the bracket [ . ] in (∗∗)
above in the derivation of Nwton Gregory forward difference interpolation polynomial and
(n + 1)st and higher order differences vanish.

6 15
Numerical Analysis
Lecture-4

1 Polynomial Interpolation-III
Continuing our discussion from (∗∗) in Lecture 3, we have
f (x0 + ph) = [f (x0 ) + p∆f (x0 ) + p(p−1) 2!
∆2 f (x0 ) + · · · + p(p−1)(p−2)...(p−(n−1))
n!
∆n f (x0 ) + . . . ]
p(p−1) 2 p(p−1)(p−2)...(p−(n−1)) n
= [y0 + p∆y0 + 2! ∆ y0 + · · · + n!
∆ y0 ] (∗ ∗ ∗)
where p = x−xh
0
.
Given a set of equally spaced points xi , i = 0, 1, . . . , n, and the corresponding function
values yi = f (xi ), i = 0, 1, . . . , n, form a forward difference table and then use (∗ ∗ ∗)
to determine the interpolating polynomial of degree atmost n, pn (x) that interpolates the
function at xi , that is, pn (xi ) = f (xi ), i = 0, 1, . . . , n.

f (x) = y = y0 + p∆y0 + p(p−1)


2!
∆2 y0 + · · · + p(p−1)(p−2)...(p−(n−1))
n!
∆n y0 is Newton Gregory
interpolating polynomial (forward difference) of degree n that interpolates f (x) at a set of
n + 1 equally spaced points.

Example 1.1. Given the table of values, find the Newton’s forward difference interpolation
formula and evaluate f (15).

x y = f (x) ∆y ∆2 y ∆3 y ∆4 y
x0 = 10 46 = y0
20 = ∆y0
x1 = 20 66 = y1 −5 = ∆2 y0
15 = ∆y1 2 = ∆3 y0
x2 = 30 81 = y2 −3 = ∆2 y1 −3 = ∆4 y0
12 = ∆y2 −1 = ∆3 y1
2
x3 = 40 93 = y3 −4 = ∆ y2
8 = ∆y3
x4 = 50 101 = y4

∴ y(x) = y(x0 + ph) = y0 + p∆y0 + p(p−1)


2!
∆2 y0 + p(p−1)(p−2)
3!
∆3 y0 + p(p−1)(p−2)(p−3)
4!
∆4 y0
= 46 + p(20) + p(p−1)
2!
(−5) + p(p−1)(p−2)
3!
(2) + p(p−1)(p−2)(p−3)
4!
(−3)
x−x0 x−10
where p = h = 10

y(x) gives a 4th degree interpolating polynomial. We want y(15)

15−10
Here p = 10
= 12 .
1
(− 1 )(− 3 )(2) 1
(− 12 )(− 32 )(− 52 )(−3)
∴ y(15) = 46 + 12 (20) + 12 (− 12 ) 21 (−5) + 2 2 6 2 + 2
24
= 46 + 10 + 85 + 18 + 128 15

= 56 + 0.625 + 0.125 + 0.1172


56.8672 correct to four decimal places.

1 16
Example 1.2. Find a cubic polynomial in x which takes values −3, 3, 11, 27, 57 and 107
when x = 0, 1, 2, 3, 4 and 5 respectively.

Given 6 points xi = x0 + ih, i = 0, 1, 2, 3, 4, 5, which are equally spaced with h = 1.


We use Newton Gregory forward difference interpolation polynomial.

x y = f (x) ∆y = ∆f ∆2 y = ∆2 f ∆3 y = ∆3 f ∆4 y = ∆4 f ∆5 y = ∆5 f
0 −3 = f0
6 = ∆f0
1 3 2 = ∆ 2 f0
8 6 = ∆ 3 f0
2 11 8 0 = ∆4 f0
16 6 0 = ∆ 5 f0
3 27 14 0
30 6
4 57 20
50
5 107
x−x0 x−0
p= h
= 1
=x

∴ f (x0 + ph) = f (0 + x(1))


= f (x)
= f0 + p∆f0 + p(p−1)
2!
∆2 f0 + p(p−1)(p−2)
3!
∆3 f0 + p(p−1)(p−2)(p−3) 4
4!
∆ f0
+ p(p−1)(p−2)(p−3)(p−4)
5!
∆5 f0
= 3 + p(6) + p(p−1)
2
(2) + p(p−1)(p−2)
6
(6) + 0 + 0
= −3 + 6x + x(x − 1) + x(x − 1)(x − 2)
= −3 + 6x + x2 − x + x3 − 3x2 + 2x
∴ f (x) = x3 − 2x2 + 7x − 3 is the required cubic polynomial.
Note. You can check whether your computations are correct be verifying that p3 (xi ) =
f (xi ), i = 0, 1, . . . , 5. For example, f (0) = −3 = y0 which is what is given. f (4) =
43 − 2(42 ) + 7(4) − 3 = 57 which is what is given.
Example 1.3. Estimate the missing figure in the following table:

x 1 2 3 4 5
y = f (x) 2 5 7 ? 32
(missing)

Given 4 entries (xi , fi ) of the function f (x), where y = f (x), the function can be represented
by a polynomial of degree 3.
∴ ∆3 f (x) = constant and
∆4 f (x) = 0 for all x.
In particular, ∆4 f (x0 ) = 0
that is, (E − 1)4 f (x0 ) = 0.

2 17
∴ (E 4 − 4E 3 + 6E 2 − 4E + 1)f (x0 ) = 0
or, f (x4 ) − 4f (x3 ) + 6f (x2 ) − 4f (x1 ) + f (x0 ) = 0
or 32 − 4f (x3 ) + 6(7) − 4(5) + 2 = 0
or 32 + 42 − 20 + 2 = 4f (x3 )
=⇒ f (x3 ) = 14

∴ the missing entry is 14.

1.1 Backward Differences


For a given table of values (xi , yi ), i = 0, 1, 2, . . . , n of a function y = f (x) with equally
spaced xi i = 0, 1, 2, . . . , n, xi = x0 + ih, the first backward differences are expressed in
terms of the backward difference operator ∇ as:
∇yi = yi+1 − yi , i = n, n − 1, . . . , 1
i.e.,
∇y1 = y1 − y0
∇y2 = y2 − y1
... ... ...
∇yn = yn − yn−1

The differences of these differences are called the second backward differences, denoted
by ∇2 y2 , ∇2 y3 , ∇2 y4 , . . . , ∇2 yn .
∇2 y2 = ∇y2 − ∇y1 = (y2 − y1 ) − (y1 − y0 ) = y2 − 2y1 + y0
∇2 y3 = ∇y3 − ∇y2 = (y3 − y2 ) − (y2 − y1 ) = y3 − 2y2 + y1
... ... ... ... ... ... ... ... ... ... ...
∇2 yn = ∇yn − ∇yn−1 = (yn − yn−1 ) − (yn−1 − yn−2 ) = yn − 2yn−1 + yn−2

In general,∇2 yi = ∇yi − ∇yi−1 , i = n, n − 1, . . . , 2


k th backward differences are
∇k yi = ∇k−1 yi − ∇k−1 yi−1 , i = n, n − 1, . . . , k.

The backward differences can be arranged for a table of values (xi , yi ), i = 0, 1, 2, 3, 4.


This is shown below.
x y ∇y ∇2 y ∇3 y ∇4 y
x0 y 0
∇y1 = y1 − y0
x1 y 1 ∇2 y2 = ∇y2 − ∇y1
∇y2 = y2 − y1 ∇3 y3 = ∇2 y3 − ∇2 y2
x2 y 2 ∇2 y3 = ∇y3 − ∇y2 ∇4 y4 = ∇3 y4 − ∇3 y3
3 2 2
∇y3 = y3 − y2 ∇ y4 = ∇ y4 − ∇ y3
x3 y 3 ∇2 y4 = ∇y4 − ∇y3
∇y4 = y4 − y3
x4 y 4
Note. subscript remains constant along every backward diagonal
∇y1 = y1 − y0 = y1 − E −1 y1 = (1 − E −1 )y1 This is true for other i’s as well. Hence

∇ = 1 − E −1 = E−1
E

3 18
∴ E = (1 − ∇)−1

Newton’s Backward Difference Interpolation Formula


This formula is used if one wants to interpolate the value of the function y = f (x) near the
end of table of values.
Let y = f (x) be a function taking values f (xn ), f (xn − h), f (xn − 2h), . . . , f (x0 )
corresponding to equispaced points xn , xn−1 = xn − h, xn−2 = xn − 2h, . . . , x0 = xn − nh.
We want to evaluate the function f (x) for a value x = xn + ph, where p is any real number.
h = xi − xi−1 , i = n, n − 1, . . . , 1.
f (xn + ph) = E p f (x) = (E −1 )−p f (xn )
= (1 − ∇)−p f (xn )
= [1 + p∇ + p(p+1)
2!
∇2 + p(p+1)(p+2)
3!
∇3 + · · · + p(p+1)(p+2)...(p+n−1)
n!
∇n ]f (xn ) (∗)
where p = x−x h
n

(∗) is known as Newton-Gregory backward difference interpolation polynomial of degree


n, interpolating the function at (n + 1) discrete equidistant points.
Example 1.4. Estimate f (7.5) from the following table

x y ∇y ∇2 y ∇3 y ∇4 y
1 1
7
2 8 12
19 6
3 27 18 0
37 6
4 64 24 0
61 6
5 125 30 0
91 6
6 216 36 0 = ∇ 4 yn
3
127 6 = ∇ yn
2
7 343 42 = ∇ yn
169 = ∇yn
8 = xn 512 = yn
The fourth and the higher order differences are zero. Backward interpolation polynomial
is, therefore,
p(p+1) 2 p(p+1)(p+2) 3
y(x) = yn + p∇yn + 2!
∇ yn + 3!
∇ yn
x−8
Here p = 1
= x − 8. We want to estimate f at x = 7.5. ∴ p = 7.5 − 8 = −0.5

y(x) = 512 + (−0.5)169 + (−0.5)(0.5)


2
42 + (−0.5)(0.5)(1.5)
6
6
= 512 − 84.5 − 5.25 − 3.75
= 421.875
∴ y(7.5) = 421.875
Example 1.5. The sales in a particular store in the is given. Estimate the sales for the
year 1979.

4 19
x y ∇y ∇2 y ∇3 y ∇4 y
1974 40
3
1976 43 2
5 −3
1978 48 −1 5 = ∇ 4 yn
4 2 = ∇ 3 yn
2
1980 52 1 = ∇ yn
5 = ∇yn
1982 57 = yn
1979−1982
p= 2
= −1.5

∴ y(x) = y(1979) = 57 + (−1.5)(5) + (−1.5)(−0.5)


2
(1) + (−1.5)(−0.5)(0.5)
6
(2) + (−1.5)(−0.5)(0.5)(1.5)
24
(5)
= 57 − 7.5 + 0.375 + 0.125 + 0.1172
= 50.1172

5 20
Numerical Analysis
Lecture-5

1 Lagrange Interpolation (xi need not be equally spaced)


Let y = f (x) be a function taking values y0 , y1 , y2 , ..., yn corresponding to x0 , x1 , x2 , ..., xn .
xi , i = 0, 1, 2, ..., n need not be equally spaced. Since there are (n + 1) values of y corre-
sponding to distinct (n + 1) values of x, we can represent the function f (x) by a polynomial
of degree n, say pn (x).

pn (x) = A0 xn + A1 xn−1 + ... + An−1 x + An

or

pn (x) = a0 (x − x1 )(x − x2 )(x − x3 )...(x − xn )


+ a1 (x − x0 )(x − x2 )(x − x3 )...(x − xn )
+ a2 (x − x0 )(x − x1 )(x − x3 )...(x − xn )
+ ... ... ... ...
+ an (x − x0 )(x − x1 )(x − x2 )...(x − xn−1 ).

The coefficients a0 , a1 , ..., an are determined using pn (xi ) = f (xi ), i = 0, 1, 2, ..., n.

f (x0 ) = pn (x0 ) = a0 (x0 − x1 )(x0 − x2 )(x0 − x3 )...(x0 − xn )


f (x0 )
⇒ a0 =
(x0 − x1 )(x0 − x2 )(x0 − x3 )...(x0 − xn )
f (x1 ) = pn (x1 ) = a1 (x1 − x0 )(x1 − x2 )(x1 − x3 )...(x1 − xn )
f (x1 )
⇒ a1 =
(x1 − x0 )(x1 − x2 )(x1 − x3 )...(x1 − xn )
..
.
f (xn ) = pn (xn ) = an (xn − x0 )(xn − x1 )(xn − x2 )...(xn − xn−1 )
f (xn )
⇒ an =
(xn − x0 )(xn − x1 )(xn − x2 )...(xn − xn−1 )

1 21
Therefore
(x − x1 )(x − x2 )(x − x3 )...(x − xn )
pn (x) = f (x0 )
(x0 − x1 )(x0 − x2 )(x0 − x3 )...(x0 − xn )
(x − x0 )(x − x2 )(x − x3 )...(x − xn )
+ f (x1 )
(x1 − x0 )(x1 − x2 )(x1 − x3 )...(x1 − xn )
(x − x0 )(x − x1 )(x − x3 )...(x − xn )
+ f (x2 )
(x2 − x0 )(x2 − x1 )(x2 − x3 )...(x2 − xn )
+ ... ... ... ... ...
(x − x0 )(x − x1 )(x − x2 )...(x − xn−1 )
+ f (xn )
(xn − x0 )(xn − x1 )(xn − x2 )...(xn − xn−1 )
= L0 (x)f (x0 ) + L1 (x)f (x1 ) + ... + Ln (x)f (xn )
Xn
= Lk (x)f (xk )
k=0
(x − x0 )(x − x1 )...(x − xk−1 )(x − xk+1 )...(x − xn )
where Lk (x) = .
(xk − x0 )(xk − x1 )...(xk − xk−1 )(xk − xk+1 )...(xk − xn )

We observe that Lk (xk ) = 1 and Lk (xi ) = 0 for k 6= i. pn (x) is the required interpolating
polynomial, the Lagrange interpolating polynomial, of degree atmost n, that interpolates
the function of xi , i = 0, 1, 2, ..., n.

pn (xi ) = L0 (xi )f (x0 ) + L1 (xi )f (x1 ) + ... + Li (xi )f (xi ) + ... + Ln (x)f (xn )
= 0 + 0 + ... + 1.f (xi ) + ... + 0
= f (xi ), i = 0, 1, 2, ..., n.

We now show that pn is the unique polynomial satisfying the interpolation property pn (xi ) =
yi , i = 0, 1, 2, ..., n.
Suppose there exists qn different from pn , such that qn (xi ) = f (xi ) = yi , i = 0, 1, 2, ..., n,
then pn − qn is such that pn (xi ) − qn (xi ) = f (xi ) − f (xi ) = 0, i = 0, 1, 2, ..., n. Therefore,
pn − qn has (n + 1) distinct roots xi , i = 0, 1, 2, ..., n. Since a polynomial of degree n cannot
have more than n distinct roots, unless it is identically zero, it follows that pn (x)−qn (x) = 0
and thus pn (x) = qn (x). This contradicts our assumption that pn and qn are distinct.
Therefore, there exists only one polynomial pn of degree atmost n that interpolates at
(n + 1) distinct points (xi , yi ), i = 0, 1, 2, ..., n. The points xi , i = 0, 1, 2, ..., n are called the
interpolation points for the function f .

Example 1.1. Construct the Lagrange interpolation polynomial of degree 2 for the func-
tion f (x) = ex on [−1, 1] with interpolation points x0 = −1, x1 = 0, x2 = 1.

(x − x1 )(x − x2 ) x(x − 1) x(x − 1)


L0 (x) = = =
(x0 − x1 )(x0 − x2 ) (−1)(−1 − 1) 2
2
(x − x0 )(x − x2 ) (x + 1)(x − 1) x −1
L1 (x) = = = = 1 − x2
(x1 − x0 )(x1 − x2 ) (0 + 1)(0 − 1) −1
(x − x2 )(x − x1 ) (x + 1)(x − 0) x(x + 1)
L2 (x) = = =
(x2 − x0 )(x2 − x1 ) (1 + 1)(1 − 0) 2

2 22
Thus the quadratic interpolating polynomial p2 (x) that interpolates f (x) = ex at (x0 , f (x0 )),
(x1 , f (x1 )) and (x2 , f (x2 )) is

p2 (x) = L0 (x)f (x0 ) + L1 (x)f (x1 ) + L2 (x)f (x2 )


x(x − 1) −1 x(x + 1) 1
= e + (1 − x2 )e0 + e
2 2
x(x − 1) x(x + 1)e
= + (1 − x2 ) +
2e 2
Remark. pn (xi ) coincides with f at the interpolation points such that f (xi ) = pn (xi ), i =
0, 1, 2, ..., n. f (x) may be quite different from pn (x) where x is not an interpolation point.
How large is the difference pn (x) − f (x), when x 6= xi , i = 0, 1, 2, ..., n? Assuming that
the function f is sufficiently smooth, an estimate of the size of the interpolation error
f (x) − pn (x) is given in the following theorem.
Theorem 1.1. Suppose that n ≥ 0 and that f is a real-valued function defined and contin-
uous on the closed interval [a, b], such that the derivative of f of order (n + 1) exists and is
continuous on [a, b]. Then given that x ∈ [a, b], there exists ξ = ξ(x) in (a, b) such that

f (n+1) (ξ)
f (x) − pn (x) = πn+1 (x), (1)
(n + 1)!
Mn+1
where πn+1 (x) = (x − x0 )(x − x1 )...(x − xn ). Moreover, |f (x) − pn (x)| ≤ |π (x)|,
(n+1)! n+1
where Mn+1 = maxs∈[a,b] |f (n+1) (s)|.
Proof. If x = xi for some i = 0, 1, 2, ..., n, then right hand side of equation (1) is zero, since

πn+1 (x) = πn+1 (xi ) = (xi − x0 )(xi − x1 )...(xi − xi )...(xi − xn ) = 0

and the left hand side is f (xi )−pn (xi ) = 0 since xi is an interpolation point, i = 0, 1, 2, ..., n.
Thus equation (1) holds good.
Suppose that x ∈ [a, b] and x 6= xi , i = 0, 1, 2, ..., n. For such a value of x, consider the
function φ(t) defined on [a, b] by

[f (x) − pn (x)]
φ(t) = f (t) − pn (t) − πn+1 (t).
πn+1 (x)
Note that x is fixed. Now,
[f (x) − pn (x)]
φ(xi ) = f (xi ) − pn (xi ) − πn+1 (xi )
πn+1 (x)
[f (x) − pn (x)]
=0− (0)
πn+1 (x)
= 0, i = 0, 1, 2, ..., n

Also φ(x) = f (x) − pn (x) − [f (x)−p n (x)]


πn+1 (x)
πn+1 (x) = 0. Therefore φ vanishes at (n + 2) distinct
0
points x0 , x1 , ..., xn and x, in [a, b]. Thus by Rolle’s theorem, φ (t) vanishes at (n + 1) points
in (a, b), one between each pair of consecutive points at which φ vanishes. Applying Rolle’s
00
theorem again, φ vanishes at n distinct points.
Our assumptions about f that the derivative of f of order (n + 1) exists and is continuous

3 23
on [a, b] allows us to apply Rolle’s theorem (n + 1) times in succession, showing that φ(n+1)
vanishes at some point ξ ∈ (a, b), ξ = ξ(x). Now,

[f (x) − pn (x)] (n+1)


φ(n+1) = f (n+1) − p(n+1)
n − πn+1
πn+1 (x)
[f (x) − pn (x)]
= f (n+1) − (n + 1)!
πn+1 (x)

(since pn is a polynomial of degree atmost n and therefore its (n + 1)th derivative is zero and
πn+1 is a polynomial of degree (n + 1) with leading term tn+1 and its (n + 1)th derivative is
(n + 1)!).
Since φ(n+1) (ξ) = 0, we have 0 = φ(n+1) (ξ) = f (n+1) (ξ) − (n + 1)! [f (x)−p n (x)]
πn+1 (x)
. This gives
(n+1)
f (x) − pn (x) = f (n+1)!(ξ) πn+1 (x) which is equation (1).
Now as f (n+1) is a continuous function on [a, b], the same is true of |f (n+1) |. Thus |f (n+1) |
is bounded on [a, b]. |f (n+1) | ≤ Mn+1 and achieves its maximum there.

Mn+1
∴ |f (x) − pn (x)| ≤ |πn+1 (x)|.
(n + 1)!

Using this inequality, we can provide an upper bound on the size of the interpolation
error.

Example 1.2. If the function f (x) = sinx is approximated by a polynomial of degree 9


that interpolated f at 10 points in the interval [0, 1], how large is the error on this interval?
(n+1)
f (x) − pn (x) = f (n+1)!(ξ) πn+1 (x), πn+1 (x) = (x − x0 )(x − x1 )...(x − x9 ).
Now |f (n+1) (ξ)| = |f (10) (ξ)| ≤ 1 and |(x − x0 )(x − x1 )...(x − x9 )| ≤ 1, for x ∈ [0, 1]. Thus
for all x ∈ [0, 1], |sinx − p9 (x)| ≤ 10!1
< 2.8 × 10−7 .

4 24
Numerical Analysis
Lecture-6

1 Error in Interpolation II
We now consider the special case when the points xi are equally spaced and use the results
of the above theorem and provide an upper bound on the size of the interpolation error.
1. If there is only one interpolating point (x0 , f (x0 )), then n = 0.
P0 (x) = f (x0 ), π1 (x) = x − x0 .
Therefore, f (x) − p0 (x) = (x−x 1!
0) 0
f (ξ), x < ξ < x0 or x0 < ξ < x., i.e., ξ is in
the smallest interval containing x and x0 and this is one of the intervals [x0 , x] or
[x, x0 ]. Note that the above result is Mean Value Theorem of Differential Calculus
f (x) − f (x0 ) = f 0 (ξ)(x − x0 ), ξ lies between x and x0 ].
2. If there are two interpolating points (x0 , f (x0 )), (x1 , f (x1 )), we obtain the linear inter-
polating polynomial p1 (x). The error is f (x) − p1 (x) = (x−x0 )(x−x 2
1 ) 00
f (ξ), ξ ∈ (x0 , x1 ).
What is |f (x) − p1 (x)|?
Now |(x − x0 )(x − x1 )| ≤ (x − x0 )2 since |x − x0 | = x − x0 < x1 − x0 .
2
Thus | (x−x0 )(x−x
2
1)
| ≤ (x1 −x
2
0)
≤ (x1 − x0 )2 .
∴ If M2 = maxx∈[x0 ,x1 ] |f 00 (x)|, then |f (x) − p1 (x)| ≤ (x1 − x0 )2 M2 .
In what follows, we obtain a better bound on the linear interpolation error. Consider
f (x) − p1 (x) = (x−x0 )(x−x
2
1 ) 00
f (ξ).
Define the function h by
π2 (x) (x − x0 )(x − x1 ) x2 − x(x0 + x1 ) + x0 x1
h(x) = = =
2 2 2
2x − (x0 + x1 ) (x0 + x1 ) x0 + x1
h0 (x) = =x− = 0 when x =
2 2 2
00
In the interval [x0 , x1 ], h(x) ≤ 0. Since h (x) = 1 > 0, h(x) has a minimum at
x = x0 +x
2
1
, Since h(x) is negative , this means that |h(x)| has a maximum at x0 +x
2
1
in
(x1 −x0 ) (x0 −x1 ) 2
[x0 , x1 ]. At x = x0 +x
2
1
, h(x) = 2
2
2
= − (x1 −x
8
0)
.
(x1 −x0 )2
∴ |f (x) − p1 (x)| ≤ |h(x)|M2 = 8
M2 , M2 = maxx∈[x0 ,x1 ] |f 00 (x)|.
Example 1.1. Determine the step-size h to be used in the table f (x) = sinx in the
interval [1, 3] so that linear interpolation will be correct to four decimal places?
f (x) = sinx, f 0 (x) = cosx, f 00 (x) = −sinx
maxx∈[1,3] |f 00 | = maxx∈[1,3] | − sinx| = 1 = M2 .
h2 h2
∴ 8
M2 ≤ 0.00005 ⇒ 8
≤ 0.00005 ⇒ h2 ≤ 0.0004 ⇒ h ≤ 0.02.

3. If there are three interpolating points (x0 , f (x0 )), (x1 , f (x1 )), (x2 , f (x2 )), where x0 , x1 , x2
are equally spaced, then we have a quadratic interpolating polynomial p2 (x) that in-
terpolates the function at x0 , x1 , x2 .
such that f (xi ) = p2 (xi ), i = 0, 1, 2. Error in quadratic interpolation is
(x − x0 )(x − x1 )(x − x2 ) 000
f (x) − p2 (x) = f (ξ), ξ ∈ (x0 , x2 ).
3!
1 25
Let x1 − x0 = x2 − x1 = h. Write x = x1 + rh, −1 ≤ r ≤ 1. When

r = −1, x = x1 − h = x0
r = 0, x = x1
r = 1, x = x1 + h = x2

Let

Q(x) = (x − x0 )(x − x1 )(x − x2 )


= (x1 + rh − x0 )(x1 + rh − x1 )(x1 + rh − x2 )
= (h + rh)(rh)(rh − h)
= h2 (r + 1)r(r − 1), −1 ≤ r ≤ 1
= h3 r(r2 − 1), −1 ≤ r ≤ 1
= h3 φ(r).

Consider φ(r) = r(r2 − 1). Then φ0 (r) = 3r2 − 1 and φ00 (r) = 6r. φ0 (r) = 0 when
r = ± √13 .
When r2 < 13 , φ0 (r) < 0 and therefore φ is decreasing in (− √13 , √13 ) and when
r2 > 13 , φ0 (r) > 0 and φ is increasing in (−∞, − √13 ) and ( √13 , ∞).
M ax−1≤r≤1 |φ(r)| = |φ(− √13 )| = |(− √13 )(− 13 − 1)|. Also, |f 000 (ξ)| ≤ M3 . Thus
2h3 3
√ M3
2h h3√
M3
maxx∈(x0 ,x2 ) Q(x) = √ ;
3 3
|f (x) − p2 (x)| ≤ 3 3 3!
= 9 3
.

Example 1.2. Determine the step-size that can be used in the tabulation of f (x) = sinx
in the interval [0, π4 ] at equally spaced nodal points so that the truncation error of the
quadratic interpolation is less than 5 × 10−8 .
Let xi−1 , xi , xi+1 denote the three equispaced points with step-size h which are the interpo-
3
lation points. f (x) = sinx, |f 000 (x)| ≤ 1, x ∈ [0, π4 ]; M3 = 1. Thus |f (x) − p2 (x)| ≤ 9h√3 M3 .
h√3
Determine h such that 9 3
≤ 5 × 10−8 ⇒ h ' 0.009.

Example 1.3. Practice Problem 1: Obtain an error bound on cubic interpolation of


f (x) with equally spaced points.
Solution: Here we have n = 3. Thus f (x) − p3 (x) = (x−x0 )(x−x1 4! )(x−x2 )(x−x3 ) (4)
f (ξ), ξ ∈
(x0 , x3 ).
Let Q(x) = (x − x0 )(x − x1 )(x − x2 )(x − x3 ). Let x1 − x0 = x2 − x1 = x3 − x2 = h. Thus
xi = x0 + ih, i = 0, 1, 2, 3. Let x = x0 + rh, 0 ≤ r ≤ 3. Then Q(x) = h4 (r)(r − 1)(r −
2)(r − 3), 0 ≤ r ≤ 3.
4
Put r = 23 + 2t , then −3 ≤ t ≤ 3. Q(x) = h4 ( 32 + 2t )( 12 + 2t )(− 12 + 2t )(− 32 + 2t ) = h16 (t2 −
4
1)(t2 − 9) = h16 (t4 − 10t2 + 9). Let

φ(t) = t4 − 10t2 + 9
√ √
φ0 (t) = 4t3 − 20t = 4t(t2 − 5) = 4t(t − 5)(t + 5)
φ00 (t) = 12t2 − 20
φ000 (t) = 24t

φ0 (t) = 0 ⇒ √
t(t2 − 5) = 0 ⇒ t = 0 or t = ± 5. √
For 0 < t < 5, φ0 (t) < 0 ⇒ φ is decreasing in (0, 5).

2 26
√ √
For t > 5, φ0 (t) > 0 ⇒ φ is increasing for t > 5.

φ00 (0) = −20 < 0



φ00 ( 5) = 40 > 0

φ00 (− 5) = 40 > 0

Max of φ occurs at t = 0, i.e., at r = 32 . Min of φ occurs at t = ± 5.
Maximum of φ(t) occurs at t = 0 implies maximum of Q(x) occurs at r = 32 and minimum

of φ(t) occurs at t = ± 5 implies minimum of Q(x) occurs at r = ± 25 + 32 .
4
∴ Maximum of Q(x) = h4 32 12 (− 12 )(− 23 ) = 9h 16
.
9h4
∴Maximum |Q(x)| = 16 .
4 M 4
∴ |f (x) − p3 (x)| ≤ 9h
16 24
4
= 3h
128
M4 , |f (4) (ξ)| ≤ M4 .

Example 1.4. Practice Problem 2: Approximate the function f (x) = x1 by a cubic


1
polynomial p3 (x) − 24 (x3 − 10x2 + 35x − 50). How well does p3 (x) approximate f (x) over
the interval [1, 4]?
4 M 4
Solution: |f (x) − p3 (x)| ≤ 9h
16 24
4
= 3h
128
M4 , where |f (4) (ξ)| ≤ M4 .

1
f (x) =
x
1
f 0 (x) = −
x2
2
f 00 (x) =
x3
6
f 000 (x) = −
x4
24
f (4) (x) = 5
x
∴ f (4) (x) = x245 is bounded by M = 24 in [1, 5]. Here h = 1; (∵ x0 = 1, x1 = 2, x2 = 3, x3 =
4).
3 9
∴ |f (x) − p3 (x)| ≤ 128 (1)(24) = 16 = 0.5625.

3 27
Numerical Analysis
Lecture-7

Divided Difference Interpolation Polynomial


Let y = f (x) be known for xi , i = 0, 1, . . . , n.
xi need not be equally spaced.
The divided differences of orders 0, 1, . . . , n are defined as

y[x0 ] = y0 , zeroth order divided difference


y1 − y0
y[xo , x1 ] = , first order divided difference
x1 − x0
y[x1 , x2 ] − y[x0 , x1 ]
y[xo , x1 , x2 ] =
x2 − x0
" #
1 y2 − y1 y1 − y0
= − , second order divided difference
x2 − x0 x2 − x1 x1 − x0

and so on, and

y[x1 , x2 , . . . , xn ] − y[x0 , x1 , . . . , xn−1 ]


y[xo , x1 , x2 , . . . , xn ] =
xn − x0

Divided difference table

x y 1st order 2nd order 3rd order 4th order


x0 y0
y[x0 , x1 ]
x1 y1 y[x0 , x1 , x2 ]
y[x1 , x2 ] y[x0 , x1 , x2 , x3 ]
x2 y2 y[x1 , x2 , x3 ] y[x0 , x1 , x2 , x3 , x4 ]
y[x2 , x3 ] y[x1 , x2 , x3 , x4 ]
x3 y3 y[x2 , x3 , x4 ]
y[x3 , x4 ]
x4 y4

One can construct a divided difference table as shown above, given a set of 5 discrete
points.

We verify below that the divided difference is a symmetric function of its arguments.

We show this for first order and second order divided differences.

1 28
In the case of first order divided difference,

y[x0 , x1 ] = y[x1 , x0 ]
y1 − y0 y0 y1
LHS = y[x0 , x1 ] = = +
x 1 − x0 x0 − x1 x1 − x0
y0 − y1 y0 y1
RHS = y[x1 , x0 ] = = + = LHS
x 0 − x1 x0 − x1 x1 − x0

which shows that the first order divided difference is symmetric of its arguments. Now,

y[x1 , x2 ] − y[x0 , x1 ]
y[x0 , x1 , x2 ] =
x2 − x0
" #
1 y2 − y1 y1 − y0
= −
x2 − x0 x2 − x1 x1 − x0
y0 y1 y1 y2
= + − +
(x0 − x1 )(x0 − x2 ) (x1 − x2 )(x2 − x0 ) (x1 − x0 )(x2 − x0 ) (x2 − x0 )(x2 − x1 )
y0 xZ1 − x0 − Z
y 1 [Z xZ1 + x2 ] y2
= + +
(x0 − x1 )(x0 − x2 ) (x2 − x0 )(x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
y0 y1 y2
= + +
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )

We must show that

y[x0 , x1 , x2 ] = y[x1 , x2 , x0 ] = y[x2 , x0 , x1 ]

Now, following the computations as above, we have


y1 y2 y0
y[x1 , x2 , x0 ] = + +
(x1 − x2 )(x1 − x0 ) (x2 − x1 )(x2 − x0 ) (x0 − x1 )(x0 − x2 )
= y[x0 , x1 , x2 ]

and
y2 y0 y1
y[x2 , x0 , x1 ] = + +
(x2 − x0 )(x2 − x1 ) (x0 − x2 )(x0 − x1 ) (x1 − x2 )(x1 − x0 )
= y[x0 , x1 , x2 ]

showing that the second order divided difference is a symmetric function of its arguments.

Newton’s divided difference interpolation polynomial


Let y = f (x) be a function taking values yo , y1 , . . . , yn corresponding to x0 , x1 , . . . , xn .

2 29
Let pn (x), the interpolating polynomial of degree atmost n, that interpolates f (x) at
(xi , yi ), i = 0, 1, 2, . . . , n be taken in the form

pn (x) = A0 + A1 (x − x0 )
+ A2 (x − x0 )(x − x1 )
+ A3 (x − x0 )(x − x1 )(x − x2 )
+ ...............
+ An (x − x0 )(x − x1 )(x − x2 ) . . . (x − xn−1 )

We want pn (xi ) = yi = f (xi ), i = 0, 1, 2, . . . , n

pn (x0 ) = y0 = f (x0 ) = A0 ⇒ A0 = y0 = f [x0 ]


y1 − y0
pn (x1 ) = A0 + A1 (x1 − x0 ) = y1 ⇒ A1 = = f [x0 , x1 ]
x1 − x 0
pn (x2 ) = A0 + A1 (x2 − x0 ) + A2 (x2 − x0 )(x2 − x1 )
= f [x0 ] + f [x0 , x1 ](x2 − x0 ) + A2 (x2 − x0 )(x2 − x1 )
= f (x2 )
f (x2 ) − f (x0 ) − (x2 − x0 )f [x0 , x1 ]
⇒ A2 =
(x2 − x0 )(x2 − x1 )
f (x1 )−f (x0 )
f (x2 ) − f (x1 ) + (x1 −x0 )
(x1 − x0 ) − (x2 − x0 )f [x0 , x1 ]
=
(x2 − x0 )(x2 − x1 )
f (x2 ) − f (x1 ) + (x1 − x0 )f [x0 , x1 ] − (x2 − x0 )f [x0 , x1 ]
=
(x2 − x0 )(x2 − x1 )
f (x2 ) − f (x1 ) + f [x0 , x1 ][x1 − x0 − x2 + x0 ]
=
(x2 − x0 )(x2 − x1 )
f (x2 )−f (x1 )
2−
f [x0 , x1 ] (xX x1X)
X
x2 −x1 XX
= − XX
(x2 − x0 ) (x2 −XX x1X)(x2 − x0 )
f [x1 , x2 ] f [x0 , x1 ]
= − = f [x0 , x1 , x2 ]
(x2 − x0 ) (x2 − x0 )

By continuing this way, one can show that

An = f [x0 , x1 , x2 , . . . , xn ]

∴ Newton’s divided difference interpolation polynomial is

y = f (x) = f [x0 ] + (x − x0 )f [x0 , x1 ]


+ f [x0 , x1 , x2 ](x − x0 )(x − x1 )
+ f [x0 , x1 , x2 , x3 ](x − x0 )(x − x1 )(x − x2 )
+ .....................
+ f [x0 , x1 , x2 , . . . , xn ](x − x0 )(x − x1 ) . . . (x − xn−1 )

Example 1. Find the interpolating polynomial using Newton’s divided differences.

3 30
x y 1st order 2nd order 3rd order
0 1 = f [x0 ]
1−1
1−0
= 0 = f [x0 , x1 ]
1−0 1
1 1 2−0
= 2
= f [x0 , x1 , x2 ]
1
2−1 − 12 −2 −1
2−1
=1 6
4−0
= 24
= 12
= f [x0 , x1 , x2 , x3 ]
3
−1 1
2 2 2
4−1
= 6
5−3 3
4−2
= 2
4 5

1 1
y = f (x) = 1 + 0(x − 0) + (x − 0)(x − 1) − (x − 0)(x − 1)(x − 2)
2 12
x(x − 1) 1
=1+ − (x)(x − 1)(x − 2)
2 12
3 2
−x 3x 2x
= + − + 1 → (∗)
12 4 3

Let us find Lagrange interpolation polynomial for the above data

(x − 1)(x − 2)(x − 4) (x − 0)(x − 2)(x − 4)


y = f (x) = (1) + (1)
(0 − 1)(0 − 2)(0 − 4) (1 − 0)(1 − 2)(1 − 4)
(x − 0)(x − 1)(x − 4) (x − 0)(x − 1)(x − 2)
+ (2) + (5)
(2 − 0)(2 − 1)(2 − 4) (4 − 0)(4 − 1)(4 − 2)
−(x3 − 7x2 + 14x − 8) x3 − 6x2 + 8x (x3 − 5x2 + 4x) 5(x3 − 3x2 + 2x)
= + − +
8 3 2 24
−x3 3x2 2x
= + − +1
12 4 3

and is the same as (*) above.

4 31
Numerical Analysis
Lecture-8

Properties of Divided Differences, Introduction to In-


verse Interpolation
Divided diffference polynomial computation requires less arithmatic operations than that
of Lagrange interpolation polynomial.

We shall now show that the divided differences satisfy the equation

f [x1 , x2 , . . . , xn ] − f [x0 , x1 , . . . , xn−1 ]


f [x0 , x1 , x2 , . . . , xn ] =
xn − x0
Proof: Let pk denote the polynomial of degree atmost k that interpolates f at the nodes
x0 , x1 , x2 , . . . , xk .
Let q denote the polynomial of degree atmost (n−1) that interpolates f at x1 , x2 , . . . , xn
i.e, pn−1 (xi ) = f (xi ), i = 0, 1, 2, . . . , n − 1 (taking k as (n − 1))
and q(xi ) = f (xi ), i = 1, 2, . . . , n
Then, we claim that

(x − xn )
pn (x) = q(x) + [q(x) − pn−1 (x)] → (∗∗)
(xn − x0 )

Why?

1. L.H.S is a polynomial of degree atmost n. R.H.S is also a polynomial of degree atmost


n.

2. pn (xi ) = f (xi ), i = 0, 1, 2, . . . , n (taking k as n) RHS of (∗∗) evaluated at xi gives

(xi − xn )
q(xi ) + [q(xi ) − pn−1 (xi )]
(xn − x0 )
(xi − xn )
= f (xi ) + [f (xi ) − f (xi )] = f (xi ), i = 1, 2, . . . , n − 1
(xn − x0 )

For i=0

(x0 − xn )
RHS = q(x0 ) + [q(x0 ) − pn−1 (x0 )] = q(x0 ) − [q(x0 ) − f (x0 )]
(xn − x0 )
= f (x0 )
= pn (x0 ) = LHS

1 32
For i=n

(xn − xn )
RHS = q(xn ) + [q(xn ) − pn−1 (xn )] = q(xn ) + 0
(xn − x0 )
= q(xn )
= f (xn ) = LHS
(xi − xn )
pn (xi ) = q(xi ) + [q(xi ) − pn−1 (xi )], i = 0, 1, 2, . . . , n
(xn − x0 )
∴ the polynomials must be identical.
We now consider the coefficient of xn on both sides of (∗∗).
These coefficients must be equal.

Coefficient of xn on the LHS is f [x0 , x1 , x2 , . . . , xn ]

coefficient of xn on the RHS is

" #
1
f [x1 , x2 , . . . , xn ] − f [x0 , x1 , x2 , . . . , xn−1 ]
(xn − x0 )

f [x1 , x2 , . . . , xn ] − f [x0 , x1 , x2 , . . . , xn−1 ]


f [x0 , x1 , x2 , . . . , xn ] =
xn − x0
We now show that the divided difference is a symmetric function of its arguments.
Thus, if (z0 , z1 , z2 , . . . , zn ) is a permutation of (x0 , x1 , x2 , . . . , xn ), then

f [z0 , z1 , . . . , zn ] = f [x0 , x1 , x2 , . . . , xn ] → (+)

The divided difference f [z0 , z1 , . . . , zn ] on the LHS of (+) is the coefficient of xn in the
interpolating polynomial of degree atmost n interpolating f at the points z0 , z1 , . . . , zn .

The divided difference on the right, f [x0 , x1 , x2 , . . . , xn ] is the coefficient of xn in the poly-
nomial of degree atmost n that interpolates f at the points x0 , x1 , x2 , . . . , xn .

These polynomials are the same and ∴

f [x0 , x1 , x2 , . . . , xn ] = f [z0 , z1 , z2 , . . . , zn ]

We now obtain error in polynomial interpolation in terms of divided difference.

Let pn be the polynomial of degree atmost n that interpolates a function f at a set of


6 xi , i = 0, 1, . . . , n ( that is, t is a point
(n + 1) distinct points x0 , x1 , x2 , . . . , xn . If t =
different from the nodes), then
n
Y
f (t) − pn (t) = f [x0 , x1 , . . . , xn , t] (t − xj )
j=0

2 33
Where
n
Y
(t − xj ) = (t − x0 )(t − x1 ) . . . . . . . . . (t − xn )
j=0

Proof: Let q be the polynomial of degree atmost (n + 1) that interpolates f at the nodes
x0 , x1 , . . . , xn , t.

We know that q is obtained from pn by adding one term

q(x) = pn (x) + f [x0 , x1 , . . . , xn , t](x − x0 )(x − x1 ) . . . (x − xn )


q(t) = f (t)

putting x=t, we have

f (t) = pn (t) + f [x0 , x1 , . . . , xn , t](t − x0 )(t − x1 ) . . . (t − xn )


Y n
= pn (t) + f [x0 , x1 , . . . , xn , t] (t − xj )
j=0


n
Y
f (t) − pn (t) = f [x0 , x1 , . . . , xn , t] (t − xj )
j=0

We now show that if f is n times differentiable on [a, b], and if x0 , x1 , . . . , xn are distinct
points in [a, b], then there exists a point ξ in (a, b) such that

1 (n)
f [x0 , x1 , x2 , . . . , xn ] = f (ξ)
n!
Proof: Let pn−1 be the polynomial of degree atmost (n − 1) that interpolates f at the
nodes x0 , x1 , . . . , xn−1 .

Then, there exists a point ξ ∈ (a, b) such that


n−1
1 (n) Y
f (xn ) − pn−1 (xn ) = f (ξ) (xn − xj )
n! j=0

1
Qn
(this is the error in interpolation that we had obtained earlier; f (x)−pn (x) = (n+1)!
f (n+1) (ξ) j=0 (x−
xj ). Here x = xn ).

By the previous result,


n−1
Y
f (xn ) − pn−1 (xn ) = f [x0 , x1 , . . . , xn ] (xn − xj )
j=0

3 34
∴ comparing, we get,

1 (n)
f [x0 , x1 , . . . , xn ] = f (ξ)
n!
Problem1: A function is known at three points x1 , x2 , x3 in the vicinity of an extreme
point x0 . Show that x0 can be approximated by
x1 + 2x2 + x3 {f [x1 , x2 ] + f [x2 + x3 ]}
x0 ' −
4 4f [x1 , x2 , x3 ]

x f 1st order 2nd order


x1 f (x1 )
f [x1 , x2 ]
x2 f (x2 ) f [x1 , x2 , x3 ]
f [x2 , x3 ]
x3 f (x3 )
Now

f (x) = f (x1 ) + (x − x1 )f [x1 , x2 ] + (x − x1 )(x − x2 )f [x1 , x2 , x3 ]

At an extreme points, f 0 (x) = 0

⇒ 0 = f [x1 , x2 ] + [2x − (x1 + x2 )]f [x1 , x2 , x3 ]


x1 + x2 f [x1 , x2 ]
⇒x= − → (1)
2 2f [x1 , x2 , x3 ]

Also,

f (x) = f (x3 ) + (x − x3 )f [x2 , x3 ] + (x − x2 )(x − x3 )f [x1 , x2 , x3 ]

At extreme point, f’(x)=0

⇒ 0 = f [x2 , x3 ] + [2x − (x2 + x3 )]f [x1 , x2 , x3 ]


x2 + x3 f [x2 , x3 ]
⇒x= − → (2)
2 2f [x1 , x2 , x3 ]

(1)+(2)
2
gives

x1 + 2x2 + x3 {f [x1 , x2 ] + f [x2 , x3 ]}


x= −
4 4f [x1 , x2 , x3 ]
Inverse Interpolation provides a powerful way to find zeros of functions.

Let the function be y = f (x) for which we want to find the zeros. Suppose y = f (x) is
tabulated at a set of discrete points (need not be equally spaced), say

x x0 x1 x2 ... xn
y = f (x) f (x0 ) f (x1 ) f (x2 ) ... f (xn )

4 35
Let us suppose that on the interval [x0 , xn ], f (x) satisfies the condition f 0 (x) 6= 0, so
that we can write x = g(y), where g = f −1 .

∴ finding the value of g(0) is equivalent to finding a zero of f (x).

To estimate g(0), we write the above table as

y f (x0 ) f (x1 ) f (x2 ) ... f (xn )


x = g(y) x0 x1 x2 ... xn

In terms of polynomial interpolation, let f (x0 ), f (x1 ), . . . , f (xn ) be the tabulated values
of the independent variable y (in general not equally spaced) and let x0 , x1 , . . . , xn be the
function values at these points.

We can ∴ find Lagrange interpolation polynomial that approximates g(y). Then, inter-
polate at the point y = 0, then we get the desired approximation to p = g(0).

The following example demonstrates how inverse interpolation is used to finding a root
of f (x) = 0.
Problem2: The equation x3 − 15x + 4 = 0 has a root close to 0.3. Obtain this root with
four decimal places accuracy.

f (x) = x3 − 15x + 4
f (0.2) = (0.2)3 − 15(0.2) + 4 = 0.008 − 3.0 + 4 = 1.008 > 0
f (0.3) = (0.3)3 − 15(0.3) + 4 = 0.027 − 4.5 + 4 = −0.473 < 0

function changes sign between x = 0.2 and x = 0.3


⇒ f vanishes at some p ∈ (0.2, 0.3)

∴ We want to find that x = p at which f (p) = 0.

Taking nodes as f (x0 ), f (x1 ) where

f (x0 ) = f (0.2) = 1.008 = y0


f (x1 ) = f (0.3) = −0.473 = y1

We seek an interpolating polynomial (linear) such that it interpolates the function at points
(y0 , 0.2) and (y1 , 0.3).

We seek a linear polynomial x = g(y).

Once it is determined, then we find x = p such that f (p) = 0. Then x = p is a root of


f (x) = 0. The problem is solved using linear inverse interpolation and we shall continue
with this in the next class.

5 36
Numerical Analysis
Lecture-9

Inverse Interpolation, Remarks on Polynomial Interpo-


lation
We use Lagrange linear interpolation polynomial that interpolates the function at a set of
two points (y0 , x0 ) and (y1 , x1 ) in the problem considered in the previous class

y = x3 − 15x + 4
y(0.2) = 1.008
y(0.3) = −0.473

y + 0.473 y − 1.008
y = g(y) = (0.2) + (0.3)
1.008 + 0.473 −0.473 − 1.008
(0.473)(0.2) (−1.008)(0.3)
= +
1.481 −1.481
0.0946 + 0.3024 0.3970
= =
1.481 1.481
= 0.26806217

A root of the equation x3 − 15x + 4 = 0 that lies in the vicinity o 0.3 is obtained as
p = 0.26806217 using the ideas of inverse interpolation.

In what follows, we explain briefly what is inverse interpolation

Inverse Interpolation to determine a zero of f

Suppose f ∈ C | [a, b], f 0 (x) 6= 0 on [a, b] and f has one zero ’p’ in [a, b].

Let x0 , x1 , x2 , . . . , xn be (n+1) distinct numbers in [a, b], with f (xk ) = yk ,k = 0, 1, 2, . . . , n

To approximate ’p’, construct the interpolation polynomial of degree atmost n on the


nodes y0 , y1 , y2 , . . . , yn for f −1 (the inveres of f )

Since yk = f (xk )
and 0 = f (p)

it follows that f −1(yk ) = xk and p = f −1 (0)

Error in linear inverse interpolation

1 37
If y = f (x), and f 0 (x) 6= 0 for x0 < x < x1 , then show that the error in linear inverse in-
00
terpolation based on corresponding values (x0 , y0 ) and (x1 , y1 ) is given by −(y−y2[f
0 )(y−y1 )f (r)
0 (r)]3 ,
x0 < ξ < x1
Let g(y) be the interpolating polynomial

x = g(y) = g(f (x))

Differentiate with respect to x

1 = g 0 (f (x)).f 0 (x)
= g 0 (y)f 0 (x)
1
⇒ g 0 (y) = 0 → (1)
f (x)

Differentiate (1) again with respect to x,

0 = g 00 (y)f 0 (x)f 0 (x) + g 0 (y)f 00 (x)


−f 00 (x)g 0 (y) −f 00 (x) 1
⇒ g 00 (y) = 0 = (∵ g 0
(y) = )
f (x)f 0 (x) f 0 (x)f 0 (x)f 0 (x) f 0 (x)

But, by error in the interpolation (linear)

(y − y0 )(y − y1 ) 00
f −1 (y) − g(y) = g (ξ), ξ ∈ (y0 , y1 )
2!
−(y − y0 )(y − y1 ) f 00 (ξ)
=
2 [f 0 (ξ)]3
In inverse interpolation, the values of the independent variable is determined when the
values of the dependent variable y at that x takes on a prescribed value.

This problem arises mainly in solution of nonlinear equation.

Note: It is important to ensure that the inverse function exists and is single-valued.
Otherwise, interpolation may provide completely useless results.

Remarks on Polynomial Interpoaltion

If the table of function values is sparse, it may not be possible to obtain the desired
degree of accuracy by interpolation, no matter how many terms are included.

The indication is that the terms in interpolation formula do not decrease very fast as
we keep adding more terms.

Another indication is that one may observe a strong variation in the higher order differ-
ences.

2 38
There is no conclusive test for convergence.

Consider f (x) = log(1 + x) on [0,1]

n!
|f ( n + 1)(ξ)| = (1+ξ)n+1
< n! on (0,1)

n!
So,|error in interpolation of f (x) by a polynomial of degree atmost n | ≤ |Πn+1 (x)| (n+1)! ≤
1
n+1
(∵ |x − xk | ≤ 1 for each x, xk for k = 0, 1, . . . , n in [0,1]

⇒ |Πn+1 (x)| ≤ 1

In fact, for many x, e.g for x = 12 ,

Πn+1 ( 21 ) ≤ 1
2n+1
as | 12 − xk | ≤ 1
2

This shows the important fact that the error can be large at the end points, an effect
known as ’Runge phenomenon’.

This is a famous example due to Runge, where the error from the interpolation poly-
1
nomial approximation to f (x) = 1+x 2 , for (n + 1) equally spaced points on [-5,5] diverges

near ± 5 as n → ∞ since high degree polynomials tend to have larger oscillations. Runge
discovered that oscillations near the end points of an interval tended to grow with the order
of the polynomial in certain cases and this is known as Runge’s phenomenon.

Let us try to understand the Runge phenomenon by seeking an answer to the following
questions.

Does a sequence {pn } of interpolation polynomials for a continous function f converges


to f as n → ∞ ?

We consider the sequence of Lagrange interpolation polynomials pn , n = 0, 1, 2, . . . with


1
equally spaced points on [-5, 5] to f (x) = 1+x 2 , x ∈[-5, 5]. The table below presents the

characteristic behaviour displayed by the sequence of interpolation polynomials pn .

n 2 4 6 8 10 12 14 16 18 20 22 24
ME 0.65 0.44 0.61 1.04 1.92 3.66 7.15 14.25 28.74 58.59 121.02 252.78

ME = maximum error= maximum difference between f (x) and pn (x) for −5 ≤ x ≤ 5


for values of n from 2 to 24. Maximum error increases exponentially as n increases.

p10 (x) with equally spaced points xj = −5 + j, j = 0, 1, . . . , 10 has a local maximum


near ± 5 and this maximum grows exponentially as the degree n increases.

1
Note that the function f (x) = 1+x 2 is well behaved on [-5,5]; All its derivates are conti-

nous and bounded for all x ∈ [-5,5]. But, the sequence {pn } of interpolation polynomials is
such that

3 39
lim maxx∈[−5,5] |f (x) − pn (x)| 6= 0
n→∞

This example is taken from Numerical Analysis book by Suli & Mayers.

We now present some examples where the results are totally unreliable but there are no
indications.

1. Consider a table of sin πx, for x = a + 2i, i = 0, 1, 2, . . .


x sin πx
a sin πa
a+2 sin π(a + 2) = sin(πa + 2π) = sin πa
a+4 sin π(a + 4) = sin(πa + 4π) = sin πa
. .
. .
. .
. .

In this case, all order differences are zero and interpolation yields the value sin πa (con-
stant polynomial).
P20 P20
2. Consider the function f (x) = n=1 an sin(2πnx) and g(x) = n=1 an +a10+n sin(2πnx)

i
Let xi = 10
,i = 0, 1, . . .

The table of values of f (x) and g(x) at these xi , i = 0, 1, 2, . . . are identical.

We get the same interpolation polynomial for the two functions f (x) and g(x) but the
two functions are completely different.

The above examples clearly demonstrate that unless we have an idea of the typical scales
at which the function changes significantly, interpolation could be useless.

The given table of values may be smooth, even though the function itself may not be.

When the function is known only through a few experimental measurements, it is nearly
impossible to give any reliable estimate of truncation error.

In such cases, one has to resort to more general class of approximations which are
obtained by minimizing some reasonable norm of TE in approximation, where TE is the
truncation error.

This is beyond the scope of the present course on Basic Numerical Analysis and we
conclude the discussions on Interpolation with the above remarks.

4 40
Numerical Analysis
Lecture-10

1 Numerical Differentiation
We now consider numerical approximation of derivatives.
Suppose that we are given values of a function at a discrete set of points (result of some
measurement or computer experiment). We can not get an exact formula for the derivative
f 0 (x) = lim4x−→0 f (x+4x)−f
4x
(x)
, since this involves an infinite number of computations or
measurements with smaller and smaller 4x. However, we can get an appropriate formula
for the derivative f 0 (x) at some point x, by fixing one small 4x; f 0 (x) ' f (x+4x)−f
4x
(x)
.
We demonstrate how we can generate numerical differentiation formulas from (i) inter-
polation polynomials, (ii) using Taylor series with remainder and (iii) using the method of
undetermined coefficients.
If the values of a function f are given at a set of discrete points x0 , x1 , ..., xn or f (x)
is known on some closed interval [a, b], then can this information be used to estimate a
derivative f 0 (c)?
We know that f 0 (x) = limh−→0 f (x+h)−f h
(x)
. A formula for numerical differentiation
emerges directly from the above definition, namely,
f (x + h) − f (x)
f 0 (x) ' ............(1)
h
where we assume that h > 0.
What do we understand when we say that the expression on the right hand side of (1) is
an approximation to the derivative?
For linear functions, f (x) = ax+b, (1) is exact, since LHS=f 0 (x) = a and RHS= f (x+h)−f
h
(x)
=
a(x+h)+b−(ax+b)
h
= a and therefore (1) is actually an exact expression for the derivative and it
yields the correct value of f 0 (x) for any nonzero value of h. For almost all other functions,
(1) is not the exact derivative.
Let us assess the error involved in this formula for numerical differentiation.
From Taylor’s theorem,

h2 00
f (x + h) = f (x) + hf 0 (x) + f (ξ), x < ξ < x + h............(2)
2
For validity of (2), f and f 0 should be continuous on [x, x + h], f 0 exists in (x, x + h). (2)
means that we can now replace the approximation (1) with an exact formula

f (x + h) − f (x) h 00
f 0 (x) = − f (ξ).............(3)
h 2
(3) gives a numerical differentiation formula along with the error term and gives us the
information on the class of functions for which the estimate is valid.
This is provided by the error term h2 f 00 (ξ). Note that as h −→ 0, the error term goes to
zero. The rapidity of this convergence depends on the power of h.
The factor f 00 (ξ) in the error term tells us what smoothness class, the function must
belong to, so that the estimate is valid.

1 41
Example 1.1. How accurate our result is if we use (3) to compute derivative of cos x at
x = π4 with h = 0.01?

f (x + h) − f (x) 0.700000476 − 0.707106781


f 0 (x) = =
h 0.01
= −0.71063051

The error term is | h2 f 00 (ξ)| = 0.005 |cos ξ| ≤ 0.005.

Numerical differentiation formulas are very useful in obtaining numerical solution of


differential equations. The derivatives in the differential equations are replaced by their
approximations.
We now derive such approximations to various order derivatives.

h2 00 h3
f (x + h) = f (x) + hf 0 (x) + f (x) + f 000 (ξ1 )
2 6
2 3
h h
f (x − h) = f (x) ∗ hf 0 (x) + f 00 (x) − f 000 (ξ2 )
2 6
2
Subtracting, f (x+h)−f 2h
(x−h)
− h12 [f 000 (ξ1 ) + f 000 (ξ2 )] = f 0 (x).
000
Assume that f exists and is continuous on [x − h, x + h]. Let M and m denote
the maximum and minimum values of f 000 on [x − h, x + h]. Then f 000 (ξ1 ), f 000 (ξ2 ) and
c = 21 [f 000 (ξ1 ) + f 000 (ξ2 )] all lie in [m, M ]. Since f 000 is continuous, it assumes the value ‘c’ at
some point ξ in [x − h, x + h]. Hence f 000 (ξ) = c = 21 [f 000 (ξ1 ) + f 000 (ξ2 )].

f (x + h) − f (x − h) h2 000
∴ f 0 (x) = − f (ξ).
2h 6
Truncation error ' O(h2 ).
Now,

h2 00 h3 h4
f (x + h) = f (x) + hf 0 (x) + f (x) + f 000 (x) + f (4) (ξ1 )
2 6 24
2 3
h h h4
f (x − h) = f (x) ∗ hf 0 (x) + f 00 (x) − f 000 (x) + f (4) (ξ2 )
2 6 24
Adding,

f (x + h) − 2f (x) + f (x − h) h2 (4)
f 00 (x) = − f (ξ), ξ ∈ (x − h, x + h)
h2 12
using a similar argument as above.
Similarly one can derive approximations for various order derivatives using Taylor series
with remainder.

Method of undetermined coefficients


This method is very useful when the table of function values is available with uniform
spacing and derivatives are required at the tabular points. It is a very practical way for
generating approximation of derivatives.

2 42
We assume a formula of some form for the required derivative and determine the co-
efficients in this formula by demanding that the formula is exact for polynomials upto a
certain degree.
Consider a five-point formula of the form
1
f 0 (x) = [A1 f (x − 2h) + A2 f (x − h) + A3 f (x) + A4 f (x + h) + A5 f (x + 2h)]..............(∗)
h
for calculating the first derivative.
The coefficients A1 , A2 , A3 , A4 , A5 can be determined by demanding that (∗) is exact for
polynomials of degree ≤ 4.
Since the formula should be independent of any shift in the origin, we can assume x = 0 in
(∗). Also, the coefficients should be independent of h and so we can assume h = 1.

f 0 (0) = A1 f (−2) + A2 f (−1) + A3 f (0) + A4 f (1) + A5 f (2)


f (x) = 1, f 0 (x) = 0
∴ 0 = A1 + A2 + A3 + A4 + A5 ............(1)
f (x) = x, f 0 (x) = 1 =⇒ f 0 (0) = 1
∴ 1 = −2A1 − A2 + A4 + 2A5 ................(2)
f (x) = x2 , f 0 (x) = 2x =⇒ f 0 (0) = 0
∴ 0 = 4A1 + A2 + A4 + 4A5 ...................(3)
f (x) = x3 , f 0 (x) = 3x2 =⇒ f 0 (0) = 0
∴ 0 = −8A1 − A2 + A4 + 8A5 ..................(4)
f (x) = x4 , f 0 (x) = 4x3 =⇒ f 0 (0) = 0
∴ 0 = 16A1 + A2 + A4 + 16A5 .................(5)

3 43
Solving, we get
1
A1 = −A5 =
12
2
A4 = −A2 = , A3 = 0
3
1
∴ f 0 (x) = [f (x − 2h) − 8f (x − h) + 8f (x + h) − f (x + 2h)] + E
12h
1
where, E = f 0 (x) − [f (x − 2h) − 8f (x − h) + 8f (x + h) − f (x + 2h)]
12h
1 (2h)2 00 8h3 000 (2h)4 (4) (2h)5 (5)
=− [f (x) − 2hf 0 (x) + f (x) − f (x) + f (x) − f (ξ1 )]
12h 2 6 4! 5!
h2 h3 h4 h5
− 8[f (x) − hf 0 (x) + f 00 (x) − f 000 (x) + f (4) (x) − f (5) (ξ2 )]
2! 6 4! 5!
2 3 4 5
h h h h
+ 8[f (x) + hf 0 (x) + f 00 (x) + f 000 (x) + f (4) (x) + f (5) (ξ3 )]
2! 6 4! 5!
2 3 4
(2h) 00 8h 000 (2h) (4) (2h)5 (5)
− [f (x) + 2hf 0 (x) + f (x) + f (x) + f (x) + f (ξ4 )]
2 6 4! 5!
1 1 f 000 (x) h3
= f 0 (x)[1 − (−2h − 2h + 8h + 8h)] + f 00 (x). (0) + . (−8 − 8 + 8 + 8)
12h 12h 12h 3!
f (4) (x) 1 4 48 h5 (5)
+ . .h (−16 + 8 − 8 + 16) + . f (ξ), x − 2h < ξ < x + 2h
4! 12h 5! 12h
h4
= f (5) (ξ)
30
h4
∴ E = f (5) (ξ), x − 2h < ξ < x + 2h.
30
Similarly we can derive the following formulas for calculating the derivatives at a point which
is at the center of the region covered by the points used in the formula. Such formulas are
referred to as Central Difference formulas.
f (x−h)−2f (x)+f (x+h) h2 (4)
1. f 0 (x) = h2
− 12
f (ξ),
4
2. f 00 (x) = 1
12h2
[−f (x − 2h) + 16f (x − h) − 30f (x) + 16f (x + h) − f (x + 2h)] − h90 f (6) (ξ).

If the derivatives at one end point are needed, we derive forward difference formulas as given
below.
−3f (x)+4f (x+h)−f (x+2h) h2 000
1. f 0 (x) = 2h
+ 3
f (ξ).
−11f (x)+18f (x+h)−9f (x+2h)=2f (x+3h) h3 (4)
2. f 0 (x) = 6h
− 4
f (ξ).

Find an approximation of the second derivative f 00 (x) that is based on the values of
the function at three equally spaced points f (x − h), f (x), f (x + h) by the method of
undetermined coefficients.
Let f 00 (x) ' Af (x + h) + 13f (x) + cf (x − h)....(∗)
The coefficients A, B, C are determined in such a way that (∗) is an approximation of the

4 44
second derivative.
h2 00 h3 h4
f (x + h) = f (x) + hf 0 (x) + f (x) + f 000 (x) + f (4) (ξ1 )
2 6 24
2 3
h h h4
f (x − h) = f (x) − hf 0 (x) + f 00 (x) − f 000 (x) + f (4) (ξ2 )
2 6 24
where x − h ≤ ξ2 ≤ x ≤ ξ1 ≤ x + h (assuming h > 0).

Using the expansions in (∗),

f 00 (x) ' Af (x + h) + Bf (x) + Cf (x − h)


h2
0
= (A + B + C)f (x) + h(A − C)f (x) + (A + C)f 00 (x)
2
h3 h4
+ (A − C)f 000 (x) + [Af (4) (ξ1 ) + Cf (4) (ξ2 )]..........(∗∗)
6 24
Equating the coefficients of f (x), f 0 (x), f 00 (x) on both sides of (∗∗),

A+B+C =0
A − C = 0 =⇒ A = C
2 2 1
A + C = 2 =⇒ 2A = 2 =⇒ A = 2 = C
h h h
2
∴B=− 2
h
f (x − h) − 2f (x) + f (x + h) h2 (4)
f 00 (x) = − [f (ξ1 ) + f (4) (ξ2 )]
h2 24
Assuming that f (x) has four continuous derivatives, the last two terms can be combined
using Intermediate Value Theorem:

h2 (4) h2
[f (ξ1 ) + f (4) (ξ2 )] = f (4) (ξ), ξ ∈ (x − h, x + h)
24 24
∴ We have second-order approximation of the second derivative

f (x − h) − 2f (x) + f (x + h) h2 (4)
f 00 (x) = − f (ξ1 )
h2 24
In the next lecture, we demonstrate how we can generate numerical differentiation formulas
from polynomial interpolants that interpolate at a set of given discrete points. An approxi-
mation of the derivative at any point can then be obtained by differentiating the polynomial
interpolant.

5 45
Numerical Analysis
Lecture-11

1 Numerical Differentiation (Interpolation Polynomi-


als)
We can use polynomial interpolation to derive formulas for approximating the derivative.
Since polynomials can approximate more complicated functions reasonably well, we expect
that the derivative of a polynomial interpolant might approximate the derivative of the
complicated function reasonably well.

From the Lagrange interpolation polynomial pn , defined by


n
X
pn (x) = Lk (x)f (xk ),
k=0

(xi , i = 0, 1, ..., n are the nodes), which is an approximation to f .

One can obtain the polynomial p0n , which is an approximation to the derivative f 0 .
The polynomial p0n is given by
n
X
p0n (x) = L0k (x)f (xk ), n ≥ 1
k=0

The degree of p0n is atmost n − 1. p0n is a linear combination of derivatives of the polynomial
Lk , k = 0, 1, 2, ..., n, with coefficients as the values of f at the interpolation points xk , k =
0, 1, 2, ..., n.
What is f 0 (x) − p0n (x)?
(n+1) (ξ)π
f (x) − pn (x) = f (n+1)! n+1 (x)
, x0 < ξ < xn , where πn+1 (x) = (x − x0 )(x − x1 )...(x − xn ).
Therefore,

d f (n+1) (ξ)πn+1 (x)


 
0 0
f (x) − pn (x) =
dx (n + 1)!
(n+1)
f (ξ(x)) 0 πn+1 (x) d (n+1)
= πn+1 (x) + [f (ξ(x))]..........(∗)
(n + 1)! (n + 1)! dx
Since the dependence of ξ on x is not known, the second term can not be estimated. The
following theorem gives us an alternate approach.

Theorem 1.1. Let n ≥ 1 and suppose that f is a real-valued function defined and continuous
on [a, b] such that f (n+1) is continuous on [a, b]. Suppose further that xi , i = 0, 1, ..., n are
distinct points in [a, b] and pn is the Lagrange interpolation polynomial of degree atmost n
for f defined by these points. Then, there exists distinct points ηi , i = 0, 1, ..., n in (a, b) and
corresponding to each x in [a, b], there exists ξ = ξ(x) in (a, b) such that f 0 (x) − p0n (x) =
f (n+1) (ξ) ∗
n!
πn (x), where πn∗ (x) = (x − η1 )(x − η2 )...(x − ηn ).

1 46
Proof. Since f (xi ) − pn (xi ) = 0, i = 0, 1, ..., n, there exists a point ηi in (xi−1 , xi ) at which
f 0 (ηi ) − p0n (ηi ) = 0 for each i = 1, 2, ..., n. This defines the n points ηi , i = 1, 2, ..., n. Now,
when x = ηi for some i ∈ {1, 2, ..., n}, both sides of (*) are zero.
Let x be fixed. x ∈ [a, b].
0 0 (x)]
Define χ(t) = f 0 (t) − p0n (t) − [f (x)−p n
π ∗ (x)
πn∗ (t). χ(ηi ) = 0, i = 1, 2, ..., n.
n
0 0
Also, χ(x) = f 0 (x) − p0n (x) − [f (x)−p n (x)] ∗
∗ (x)
πn
πn (x) = 0. Thus χ vanishes at (n + 1) points ηi and
x, i = 1, 2, ..., n. By successively0 applying
0 (x)
Rolle’s Theorem, χ(n) vanishes at some ξ.
[f (x)−p
But χ(n) (t) = f (n+1) (t) − 0 − π∗ (x)n .n!.
n
0 f (n+1) (ξ) ∗
∴χ (n)
(ξ) = 0 ⇒ f (x)−p0n (x) = n!
πn (x), where πn∗ (x) = (x−η1 )(x−η2 )...(x−ηn ).
Mn+1
Corollary. Under the conditions of the above theorem, |f 0 (x) − p0n (x)| ≤ n!
|πn∗ (x)| ≤
(b−a)n
n!
Mn+1 for all x ∈ [a, b], where maxx∈[a,b] |f (n+1) (x)| = Mn+1 .
Proof.

f (n+1) (ξ) ∗
f 0 (x) − p0n (x) = πn (x)
n!
|f (n+1) (ξ)| ∗
∴ |f 0 (x) − p0n (x)| ≤ |πn (x)|
n!
Mn+1
= |(x − η1 )(x − η2 )...(x − ηn )|
n!
Mn+1
≤ (b − a)(b − a)...(b − a)
n!
Mn+1
⇒ |f 0 (x) − p0n (x)| ≤ (b − a)n .
n!

Deduction: If f and all its derivatives are defined and continuous on [a, b], and
limn−→∞ Mn!n+1
(b − a)n = 0, then limn−→∞ maxx∈[a,b] |f 0 (x) − p0n (x)| = 0, showing the conver-
gence of the sequence of interpolation polynomials (p0n ) to f 0 , uniformly on [a, b].

Consider (∗).
(n+1) (ξ(x))
f (x) − p0n (x) = f (n+1)!
0 0
πn+1 (x) + π(n+1)!
n+1 (x) d
dx
[f (n+1) (ξ(x))]. If this error is estimated at
the nodes xi , i = 0, 1, ..., n, we see that ∵ πn+1 (xi ) = 0, i = 0, 1, ..., n, f 0 (xi ) − p0n (xi ) =
0
πn+1 (xi ) (n+1)
(n+1)!
f (ξ(xi )).
We illustrate this by using a quadratic interpolation polynomial that interpolates f at
(x0 , f (x0 )), (x1 , f (x1 )), (x2 , f (x2 )).
f (x) ' p2 (x) = L0 (x)f (x0 ) + L1 (x)f (x1 ) + L2 (x)f (x2 ), ∵ f 0 (xi ) ' p02 (xi ) = L00 (xi )f (x0 ) +
L01 (xi )f (x1 ) + L02 (xi )f (x2 ), i = 0, 1, 2,

(x − x1 )(x − x2 )
where L0 (x) =
(x0 − x1 )(x0 − x2 )
(x − x0 )(x − x2 )
L1 (x) =
(x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 )
L2 (x) =
(x2 − x0 )(x2 − x1 )

2 47
2x − (x1 + x2 )
L00 (x) =
(x0 − x1 )(x0 − x2 )
2x0 − (x1 + x2 ) 2x0 − (x0 + h + x0 + 2h) −3h −3
∴ L00 (x0 ) = = = 2
=
(x0 − x1 )(x0 − x2 ) (−h)(−2h) 2h 2h
2x 1 − x 1 − x 2 −h −1
L00 (x1 ) = 2
= 2 =
2h 2h 2h
0 2x2 − x1 − x2 h 1
L0 (x2 ) = 2
= 2 =
2h 2h 2h
(x−x0 )(x−x2 ) 2x−(x0 +x2 )
L1 (x) = (x1 −x0 )(x1 −x2 )
; L01 (x) = h(−h)

2x0 − (x0 + x2 ) −2h 2


∴ L01 (x0 ) = 2
= 2
=
−h −h h
2x 1 − (x0 + x 2 ) 2x 0 + 2h − x0 − x0 − 2h
L01 (x1 ) = = =0
−h2 −h2
2x2 − (x0 + x2 ) 2h −2
L01 (x2 ) = 2
= 2
=
−h −h h
(x−x0 )(x−x1 ) 2x−(x0 +x1 )
L2 (x) = (x2 −x0 )(x2 −x1 )
; L02 (x) = 2h2

2x0 − (x0 + x1 ) −h −1
∴ L02 (x0 ) = = =
2h2 2h2 2h
2x 1 − (x 0 + x 1 ) h 1
L02 (x1 ) = 2
= 2 =
2h 2h 2h
2x 2 − (x 0 + x 1 ) 2x 0 + 4h − x0 − x0 − h 3h 3
L02 (x2 ) = 2
= 2
= 2 =
2h 2h 2h 2h
Now error in derivative is f 0 (xi ) − p02 (xi ) = 1 000
3!
f (ξi )π30 (xi ).

π3 (x) = (x − x0 )(x − x1 )(x − x2 )


= (x2 − x(x0 + x1 ) + x0 x1 )(x − x2 )
= x3 − x2 (x0 + x1 ) + x0 x1 x − x2 x2 + x2 (x0 + x1 )x − x0 x1 x2
= x3 − x2 (x0 + x1 + x2 ) + x(x0 x1 + x0 x2 + x1 x2 ) − x0 x1 x2
π30 (x) = 3x2 − 2x(x0 + x1 + x2 ) + x0 x1 + x0 x2 + x1 x2
= 3x2 − 2x(3x0 + 3h) + 3x20 + 6hx0 + 2h2
π30 (x0 ) = 2h2
π30 (x1 ) = −h2
π30 (x2 ) = 2h2 .

3 48
1 0
∴ f 0 (x0 ) = p02 (x0 ) + π (x0 )f 000 (ξ(x0 ))
3! 3
1
= L00 (x0 )f (x0 ) + L01 (x0 )f (x1 ) + L02 (x0 )f (x2 ) + π30 (x0 )f 000 (ξ(x0 ))
6
−3 2 1
= f (x0 ) + f (x1 ) − f (x2 ) + π30 (x0 )f 000 (ξ(x0 ))
2h h 2h
1 2h2 000
= [−3f (x0 ) + 4f (x1 ) − f (x2 )] + f (ξ(x0 ))
2h 6
1 h2
∴ f 0 (x0 ) = [−3f (x0 ) + 4f (x1 ) − f (x2 )] + f 000 (ξ(x0 ))
2h 3
1
Similarly, f 0 (x1 ) = p02 (x1 ) + π30 (x1 )f 000 (ξ(x1 ))
3!
1
= L00 (x1 )f (x0 ) + L01 (x1 )f (x1 ) + L02 (x1 )f (x2 ) + π30 (x1 )f 000 (ξ(x1 ))
6
−1 1 h2 000
= f (x0 ) + (0)f (x1 ) + f (x2 ) − f (ξ(x1 ))
2h 2h 6
1 h2 000
= [f (x2 − f (x0 ))] − f (ξ(x1 ))
2h 6
1
f 0 (x2 ) = p02 (x2 ) + π30 (x2 )f 000 (ξ(x2 ))
3!
1
= L00 (x2 )f (x0 ) + L01 (x2 )f (x1 ) + L02 (x2 )f (x2 ) + π30 (x2 )f 000 (ξ(x2 ))
6
1 2 3 2h2 000
= f (x0 ) − f (x1 ) + f (x2 ) + f (ξ(x2 ))
2h h 2h 6
1 h2
= [f (x0 ) − 4f (x1 ) + 3f (x2 )] + f 000 (ξ(x2 )).
2h 3
If p2 (x) is the quadratic interpolating polynomial that interpolates the function f at a set
of discrete points (x0 , f (x0 )), (x1 , f (x1 )) and (x2 , f (x2 )) where xi = x0 + ih, i = 0, 1, 2, then
an estimate for f 0 at the nodes are given as
2h2 000
1. f 0 (x0 ) = 1
2h
[−3f (x0 ) + 4f (x1 ) − f (x2 )] + 6
f (ξ(x0 ))
h2 000
2. f 0 (x1 ) = 1
2h
[f (x2 − f (x0 ))] − 6
f (ξ(x1 ))
h2 000
3. f 0 (x2 ) = 1
2h
[f (x0 ) − 4f (x1 ) + 3f (x2 )] + 3
f (ξ(x2 )).
Remark. Given any continuous function defined on a closed interval, there exists a polyno-
mial that is arbitrarily close to the function at every point in that inerval and this is one
of the reasons for using algebraic polynomials to approximate an arbitrary set of data. In
addition, the derivatives of polynomials can be obtained and evaluated easily.
Note. that from (3) above (3∗ ) can be obtained by simply replacing x0 + 2h as x0 :
2
(3∗ ) : f 0 (x0 ) = 2h
1
[f (x0 − 2h) − 4f (x0 − h) + 3f (x0 )] + h3 f 000 (ξ(x2 )).
Similarly, (2∗ ) can be obtained from (2) by replacing x0 + h by x0 :
2
(2∗ ) : f 0 (x0 ) = 2h
1
[f (x0 + h) − f (x0 − h))] − h6 f 000 (ξ(x1 )).
Also, (3∗ ) can be obtained from (1) by replacing h by −h. Therefore, we have the following
two formulas : (1), (2∗ ).

4 49
• Three-point End Point Formula:
h2 000
f 0 (x0 ) = 2h
1
[f (x0 − 2h) − 4f (x0 − h) + 3f (x0 )] + 3
f (ξ(x2 )).

• Three-point Midpoint Formula:


h2 000
f 0 (x0 ) = 2h
1
[f (x0 + h) − f (x0 − h))] − 6
f (ξ(x1 )).

The errors in the above two formulas are of O(h2 ). The first is useful near the ends of an
interval because one may not have information about ‘f ’ outside the interval.
Find f 0 (0.25) from the following data using the interpolating polynomial based on di-
vided differences.

x f (x) 1st order 2nd order 3rd order


0.15 0.1761
2.4350
0.21 0.3222 -5.7500
1.9750 15.6250
0.23 0.3617 -3.8750
1.7425
0.27 0.4314

f (x) ' p3 (x)


= 0.1760 + 2.4350(x − x0 ) − 5.7500(x − x0 )(x − x1 ) + 15.6250(x − x0 )(x − x1 )(x − x2 )
0
f (x) = 2.4350 − 5.7500(2x − (x0 + x1 )) + 15.6250[(x − x1 )(x − x2 ) + (x − x0 )(x − x2 ) + (x − x0 )(x − x1 )]
0
f (0.25) = 2.4350 − 5.75[0.5 − 0.36] + 15.6250[(0.25 − 0.21)(0.25 − 0.23) + (0.25 − 0.15)(0.25 − 0.21) + (0.25 −
= 2.4350 − 0.805 + 0.10625
= 1.7363.

5 50
Numerical Analysis
Lecture-12

1 Numerical Differentiation (Operator Method); Nu-


merical Integration
Differential Operator
d
D≡ dx
, differential operator.

Df (x) = f 0 (x)
D2 f (x) = f 00 (x)
..
.

E[f (x)] = f (x + h), E − shif t operator


h2
= f (x) + hf 0 (x) + f 00 (x) + ...
2
h2
= f (x) + hDf (x) + D2 f (x) + ...
2
2
(hD)
= (1 + hD + + ...)f (x)
2!
= ehD [f (x)].
∴ hD = ln E.
∴ hD = ln(1 + 4)
1 42 43 44
D = [4 − + − + ...]
h 2 3 4
hD = ln E = ln(1 + 4)
= ln(1 − 5)−1
= −ln(1 − 5)
1 52 53
D = [5 + + + ...]
h 2 3
1 42 f (x0 ) 43 f (x0 )
∴ Df (x0 ) = [4f (x0 ) − + − ...]
h 2 3
df df
= |x=x0 = (x0 ).
dx dx
Similarly, we can get

1 52 53
D2 f (x0 ) = [5 + + + ...]2 f (x0 )
h2 2 3
1 11 4 5 5
= 2 [42 − 43 + 4 − 4 −...]f (x0 )
h 12 6

1 51
Using backward difference operator 5,
1 52 53
D= [5 + + + ...]
h 2 3
1 52 53
D2 = 2 [5 + + + ...]2
h 2 3
1 11 4 5 5
= 2 [52 + 53 + 5 + 5 +...]
h 12 6
1 11 4 5
D2 f (xn ) = 2 [52 f (xn ) + 53 f (xn ) + 5 f (xn ) + 55 f (xn ) + ...]
h 12 6
00 0
Compute f (0), f (0.2) from the given data.
x f (x) 4f (x) 42 f 43 f 44 f 45 f
0.0 1.00
0.16
0.2 1.16 2.24
2.40 5.76
0.4 3.56 8.00 3.84
10.40 9.60 0
0.6 13.96 17.60 3.84
28.00 13.44
0.8 41.96 31.04
59.04
1.0 101.00

1 2 11 4 5
D2 f (x) = 2
[4 f (x) − 43 f (x) + 4 f (x) − 45 f (x)]
h 12 6
1 11 5
∴ f 00 (0) = [2.24 − 5.76 + (3.84) − (0)] = 0
(0.2)2 12 6
1 4 f (x) 4 f (x) 44 f (x)
2 3
Df (x) = [4f (x) − + − ]
h 2 2 4
1 8.00 9.60 3.84
∴ f 0 (0.2) = [2.40 − + − ] = 3.2
0.2 2 3 4

Remarks on Numerical Differentiation


Numerical differentiation methods are unstable when the function values are polluted by
rounding errors. Unless the original data is known to be very accurate, we cannot expect
to achieve good accuracy. This can be illustrated by an example.

f is a real-valued function defined and continuous and differentiable on [−h, h], h > 0.
f is reconstructed by a linear polynomial that interpolates f at (−h, f (−h)) and (h, f (h)).

f (h) − f (−h)
p1 (x) = f (−h) + [x − (−h)]
2h
f (h) − f (−h)
= f (−h) + (x + h)
2h
f (h) − f (−h)
∴ p01 (x) =
2h

2 52
p01 (x) is a polynomial of degree zero representing an approximation to f 0 (x) at any x ∈
[−h, h] and in particular to f 0 (0).
Due to round-off errors, the information that we have is not the exact values f (h), f (−h)
but f (h) + 2 , f (−h) + 1 , where 1 , 2 are unknowns.
∴ we can only calculate [f (h)+2 ]−[f
2h
(−h)+1 ]
= f (h)−f
2h
(−h) −1
+ 22h . limh−→0 f (h)−f
2h
(−h)
−→ f 0 (0),
2 −1
2 − 1 is non-zero and fixed and ∴ 2h −→ ∞ as h −→ 0. Therefore if h is too small
as compared to |2 − 1 |, then our approximation to f 0 (0) will be affected by a large error
of size |22h
−1 |
. If h is large, then f (h)−f
2h
(−h)
will be a poor approximation to f 0 (0). This
indicates the existence of optimal ‘h’ depending on the size of the rounding error, for which
the difference between f 0 (0) and its approximation is the smallest. Thus care must be taken
in using numerical differentiation methods when rounding errors are present in the available
data.

We now try to understand why it is important to take care of round-off error while
approximating derivatives.

Let us consider the three-point mid-point formula:

1 h2
f 0 (x0 ) = [f (x0 + h) − f (x0 − h)] − f 000 (ξ1 ).
2h 6
Let e(x0 + h) and e(x0 − h) be round-off errors in evaluating f (x0 + h) and f (x0 − h)
respectively. The true values f (x0 + h) and f (x0 − h) are such that

f (x0 + h) = f ∗ (x0 + h) + e(x0 + h)


f (x0 − h) = f ∗ (x0 − h) + e(x0 − h)

where f ∗ (x0 + h), f ∗ (x0 − h) are the values that we use in our computations.
Thus the total error in the approximation is
 ∗
f (x0 + h) − f ∗ (x0 − h) e(x0 + h) − e(x0 − h) h2 000

0
f (x0 ) − = − f (ξ1 )
2h 2h 6
= round − of f error + truncation error

Let |f (3) | ≤ M, hM > 0 and |e(x0i+ h)| < , |e(x0 − h)| < ,  > 0.
∗ ∗ (x −h) 2
Then |f 0 (x0 ) − f (x0 +h)−f
2h
0
| ≤ h + h6 M.....(∗).
2
In order to reduce the truncation error, h6 M , we have to reduce h. But as h is reduced, the
round-off error h grows.
In practice, therefore, if we let h to be very small, then in that case the round-off error will
dominate during the computations, thus reducing the step-size will not always improve the
approximation.
d
Example 1.1. (**) True value of cos(0.900) = 0.62161. If we use cos x = dx (sin x) =
d 0
dx
f (x) with f (x) = sin x and approximate cos(0.900) using (∗) above, then f (0.900) =
f (0.900+h)−f (0.900−h)
cos(0.900) ' 2h
where f (x) = sin x.
This example is from Burden and Faires book.
2
Let (h) = h + h6 M .
d 3 13
dh
= − h2 + 2h
6
M = 0, when h2 = 2h 6
M or h3 = M3
or h = ( M ) .

3 53
d2 
dh2
= h23 + 31 M = h23 + 13 h33 (∵ M = h33 ) = h33 > 0.
3 13
∴ (h) has a minimum at h( M ) .
000
Now maxx∈[0.80,1.00] |f (x)| = maxx∈[0.80,1.00] |cos x| = cos(0.8) ' 0.69671.
−6 1
If  = 5 × 10−6 , then the optimal choice of h is approximately h = ( 3×5×10
0.69671
) 3 ' 0.028.

Remark. In practice, we will not be able to compute an optimal h so that it can be used
in approximating the derivative. This is because, we have no knowledge of what the third
derivative of the function which is known only at a set of discrete points.

Numerical Integration
We now explore various ways for approximating the integral of a function over a given
domain. Why do we need such approximations?

• Analytical integration may be impossible or not feasible (Example : a typical function


2 y −x2
R
that can not be integrated analytically is the error function erf (y) = π 0 e dx.)

• or we may want to integrate tabulated data rather than a known function.

We outline the main approaches in numerical integration. Which of the methods is prefer-
able depends on the results required and on the function or data to be integrated.
We introduce techniques for numerical integration which are based on integrating inter-
polating polynomials and which lead to the so-called Newton-Cotes Integration Methods.

Integration by polynomial interpolation


Rb
One powerful method for evaluating the integral a f (x)dx numerically is to replace f by
another function g that approximates f well and is easily integrated.

f 'g
Z b Z b
f (x)dx ' g(x)dx
a a
R1 2 R 1 Pn x2k
Pn 1
For example, 0
ex dx ' 0 k=0 k! dx = k=0 (2k+1)k! .
Rb
We use polynomial interpolation to evaluate a f (x)dx. We select nodes x0 , x1 , ..., xn in [a, b]
and set up Lagrange interpolation polynomial. The polynomial of degree n that interpolates
f at the nodes is
n
X
f (x) ' pn (x) = Li (x)f (xi )
i=0
Z b Xn Z b
∴ f (x)dx = f (xi ) Li (x)dx
a i=0 a
Z b n
X Z b
or f (x)dx = Ai f (xi ), Ai = Li (x)dx..........(∗)
a i=0 a

4 54
(∗) is called the Newton-Cotes formula, if the nodes are equally spaced.
If n = 1, the nodes are x0 and x1 with x0 = a and x1 = b. L0 (x) = x−b
a−b
, L1 (x) = x−a
b−a
.

Z b Z b  2 b
1 1 x
A0 = L0 (x)dx = (x − b)dx = − bx
a a−b a a−b 2 a
 2 2

1 b a
= − b2 − + ab
a−b 2 2
(a − b)2
 
1 b−a
= − =
a−b 2 2
Z b
b−a
L1 (x)dx = A1 =
a 2
Z b
b−a
∴ f (x)dx ' [f (a) + f (b)]....T rapezoidal Rule.
a 2
Rb
(∗) is exact for polynomials of degree atmost 1. a f (x)dx represents the area bounded by
the curve y = f (x), the ordinates at x = a and x = b.
Rb
a
f (x)dx=Area of the shaded region.
In Trapezoidal rule, the function f (x) is approximated by the straight line joining (a, f (a))
and (b, f (b)) in the interval [a, b].
Rb
∴ a f (x)dx is approximated by the area of the trapezium P ABQ.
The error in this approximation is obtained by integrating the error in polynomial interpo-
lation.

Theorem 1.1. Let n ≥ 1. Suppose that f is a real-valued function, defined and con-
tinuous on [a, b] and let f (n + 1) be defined and continuous on [a, b]. Then |En (f )| ≤
Mn+1
Rb
(n+1)! a
|πn+1 (x)|dx, Mn+1 = maxξ∈[a,b] |f (n+1) (ξ)|. πn+1 (x) = (x − x0 )(x − x1 )...(x − xn ).
Proof.
Z b n
Z bX
En (f ) = f (x)dx − [Lk (x)f (xk )]dx
a a k=0
Z b
= [f (x) − pn (x)]dx
a
Z b
T hus, |En (f )| ≤ |f (x) − pn (x)|dx.
a

5 55
Numerical Analysis
Lecture 13

1 Numerical Integration-2
We now obtain error in Trapezoidal rule:
The error in approximating f (x) by a linear polynomial p1 (x) is
00
f (x) − p1 (x) = f (ξ(x)) 2
(x − a)(x − b), a < ξ < b
Rb Rb
∴ a [f (x) − p1 (x)]dx = 12 a f 00 (ξ(x))(x − a)(x − b)dx
Rb
= 12 f 00 (ξ) a [x2 − x(a + b) + ab]dx
(by applying MVT 1 for integrals), a < ξ < b
3 2
= 21 f 00 (ξ)[ x3 − (a + b) x2 + abx]ba
3 3
= 21 f 00 (ξ)[ b −a3
− a+b
2
(b2 − a2 ) + ab(b − a)]
3 3
= 12 f 00 (ξ)[ b −a3
− b−a
2
{b2 + 2ab + a2 − 2ab}]
1 00
= 12 f (ξ)[(b − a){2(b2 + ab + a2 ) − 3(b2 + a2 )}]
1 00
= 12 f (ξ)(b − a)[−b2 + 2ab − a2 ]
1 00
= − 12 f (ξ)(b − a)3 , a < ξ < b
The error is zero if f (x) is a linear polynomial.
∴ Trapezoidal rule is exact for polynomials of degree at most 1. Thus, we have shown that

Theorem 1.1. Let f (x) have two continuous derivatives on the interval a ≤ x ≤ b. Then,
the error in Trapezoidal rule is
Rb
a
f (x) − (b−a)
2
1
[f (a) + f (b)] = − 12 (b − a)3 f 00 (ξ)

for some ξ in the interval [a, b].

Since b − a = h,
Rb 2 (b−a)

a
f (x) − h2 [f (a) + f (b)] = − h 12
f 00 (ξ) a < ξ < b

The formula says that the error decreases in a manner that is proportional to h2 . There-
fore, doubling the number of subintervals into which [a, b] should be divided (or halving h)
should cause the error to decrease by a factor of approximately 4.
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
R 2 dx
Evaluate I = 0 1+x 2 using Trapezoidal rule. How large should n be chosen in order to

ensure that the error in Trapezoidal rule is less than 5 × 10−6 ?


2
1
f (x) = 1+x 2; f 0 (x) = − (1+x
2x
2 )2 ; f 00 (x) = − (1+x
2+6x
2 )3 ;
00
Also, max |f (x)| = 2, a = 0, b = 2, b − a = 2.
0≤x≤2
2 (b−a)
∴ Error in Trapezoidal rule =− h 12
f 00 (ξ).
1
MVT or Mean Value theorem for integrals says that if u(x) and v(x) are two continuous function on
Rb Rb
[a, b] and if v(x) ≥ 0 ∀x ∈ [a, b] then there exists ξ ∈ (a, b) such that a u(x)v(x)dx = u(ξ) a v(x)dx

1 56
h2 (2) h2
So, | error | ≤ 12
(2) = 3
.

b−a
We also want to find the number of subintervals ’n’, where n = h
, so that |error| ≤
5 × 10−6 . We choose h so small that
h2
3
≤ 5 × 10−6
or h ≤ 0.003873.

But b−a
h
= n or n = h2
∴ n = h2 ≥ 516.4
Therefore n ≥ 517 will ensure that |error| ≤ 5 × 10−6 .

2 Simpson’s rule
We take n = 2.
f (x) is approximated by a second degree polynomial p2 (x) that interpolates f (x) at
three equally spaced points.
p2 (x) = L0 (x)f (a) + L1 (x)f ( a+b 2
) + L2 (x)f (b)
We set a = x0 , a+b 2
= x 1 , b = x 2 ; xi = x0 + ih, i = 0, 1, 2.
p2 (x) = L0 (x)f (x0 ) + L1 (x)f (x1 ) + L2 (x)f (x2 )
x = x0 + ph
L0 (x) = (x(x−x 1 )(x−x2 )
0 −x1 )(x0 −x2 )
= [(x0 +ph)−(x 0 +h)][(x0 +ph)−(x0 +2h)]
[x0 −(x0 +h)][x0 −(x0 +2h)]
= h(p−1)h(p−2)
2h2
= p−2)(p−1)
2
(x−x0 )(x−x2 )
L1 (x) = (x1 −x1 )(x1 −x2 )
= [(x 0 +ph)−x0 ][(x0 +ph)−(x0 +2h)]
[(x0 +h)−x0 ][(x0 +h)−(x0 +2h)]
= hph(p−2)
−h2
= −p(p − 2)
(x−x0 )(x−x1 )
L2 (x) = (x2 −x0 )(x2 −x1 ) = [(x0 +2h)−x0 ][(x0 +2h)−(x0 +h)] = 2h2 = p(p−1)
[(x0 +ph)−x0 ][(x0 +ph)−(x0 +h)] hph(p−1)
2
R x2 Rb
∴ x0
f (x)dx = a
f (x)dx
(using x = x0 + h =⇒ dx = hdp)
R p=2 R p=2 R p=2
= h p=0 (p−1)(p−2)
2
f (x0 )dp + h p=0 −p(p − 2)f (x1 )dp + h p=0 (p−1)(p)
2
f (x2 )dp
h 4h h
= 3 f (x0 ) + 3 f (x1 ) + 3 f (x2 )
= h3 [f0 + 4f1 + f2 ] (fi = f (xi ) i = 0, 1, 2)
= h3 [f0 + f2 + 4f1 ]
= h3 [ sum of the end ordinates + 4 times the intermediate ordinate]

y = p0 (x)
y = f (x)

f (x2 )
f (x0 ) f (x1 )
a = x0 x1 x2 = b

2 57
Numerical Analysis
Lecture-14

1 Numerical Integration-3
We now compute the error in Simpson’s rule. If h = b−a 2
, the Simpson’s rule is given by
Rb R a+2h
a
f (x)dx = a
f (x)dx ≈ h3 [f (a) + 4f (a + h) + f (a + 2h)].

f (a) f (a + h) f (a + 2h)

x0 = a a+h a + 2h = b = x2
= a+b
2
= x1

Rb
∴ Error in Simpson’s rule is: f (x)dx − h3 [f (a) + 4f (a + h) + f (a + 2h)]
a
= Term 1 + Term 2

Let us consider Term 2.

Term 2 = − h3 [f (a) + 4f (a + h) + f (a + 2h)]


2 3 4 5
= − h3 [f (a) + 4{f (a) + hf 0 (a) + h2 f 00 (a) + h6 f 000 (a) + h24 f (4) (a) + h5! f (5) (a) + . . . }
2 3 4 5
+ {f (a) + 2hf 0 (a) + (2h) 2
f 00 (a) + (2h)
6
f 000 (a) + (2h)24
f (4) (a) + (2h) 5!
f (5) (a) + . . . }]
2 2 3 3 4 4
= − h3 [6f (a) + f 0 (a)(4h + 2h) + f 00 (a)( 4h2 + 4h2 ) + f 000 (a)( 4h6 + 8h6 ) + f (4) (a)( 4h 24
+ 16h24
)
(5) 4h5 32h5
+ f (a)( 120 + 120 ) + . . . ]
= −[2hf (a) + 2h2 f 0 (a) + 43 h3 f 00 (a) + 23 h4 f 000 (a) + 3(5!) 100 5 (4)
h f (a) + . . .
Rx
Let F (x) = a f (t)dt. By fundamental theorem of calculus, F 0 = f .
We now evaluate Term 1.
Rb R a+2h
∴ a f (x)dx = a f (x)dx = F (a + 2h) − F (a)
2 3 4
= F (a) + 2hF 0 (a) + (2h) 2
F 00 (a) + (2h) 6
F 000 (a) + (2h) 24
F (4) (a)
5
+ (2h)
5!
F (5) (a) + · · · − F (a)
2 3 4 5
= 2hF 0 (a) + (2h) 2
F 00 (a) + (2h) 6
F 000 (a) + (2h) 24
F (4) (a) + (2h) 5!
F (5) (a) + · · · −
∴ error in Simpson’s rule=Term 1+ Term 2
= − (100−96)
3(5!)
h5 f (4) (ξ)
5
= − h90 f (4) (ξ)
1 (b−a)5 (4)
= − 90 25
f (ξ)
Simpson’s rule is exact for polynomials of degree ≤ 3.
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

How small the width of the interval (h) should be that guarantees that the error in using
Simpson’s rule is < 10−6 in [0, 1], when f (x) = ex ?

1 58
h5 (4) b−a
| error in Simpson’s rule | = 90
f (ξ). Since h= 2
,
h4 h4
| error | = (1−0)
2 90
|f (4) (ξ)| = 180
|f (4) (ξ)|.

We want to find h such that | error | ≤ 10−6

h4
max |f (4) (ξ)|
180 ξ∈[0,1]
≤ 10−6
eh4
=⇒ 180
≤ 10−6

=⇒ h ≤ 0.09020788609.

∴ Taking h ≤ 0.09020788609 will ensure desired accuracy when using Simpson’s rule.
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

1.1 Composite Trapezoidal Rule


In elementary calculus course, we have learnt and understood integration in terms of Rie-
mann sums. This is done by approximating the shape under the curve f (x) as a union of
rectangles; the area under the curve is approximated as the sum of the areas of the rectan-
gles. We then consider a limiting process in which the width of the rectangles goes to zero
and the number of rectangles becomes infinite.
In numerical integration, we arrest this limiting process at some finite number of rect-
angles. Then, our approximation to the integral is taken as the combined area of these
rectangles. In this, we approximate f (x) on each interval by a zeroth-degree polynomial (
that is constant). Figure shows that the rectangles are a crude approximation to the area
under the curve. Therefore, we look for a better approximation. In each subinterval into
which we divide the interval [a, b] into, f can be approximated by a first degree interpolating
polynomial (that is by a straight line).

a = x0 a = x0 a = x0 a = x0 a = x0 a = x0
Rb b−a
If a f (x)dx is required, divide [a, b] into n equal subintervals of width h = 2
by points
a = x0 < x1 < x2 < · · · < xn = b, Then we have

2 59
Rb R x1 R x2 R xn =b
a
f (x)dx = a=x 0
f (x)dx + x 1
f (x)dx + · · · + xn−1
f (x)dx
= 2 [f (x0 ) + f (x1 )] + 2 [f (x1 ) + f (x2 )] + · · · + xn −x2 n−1 [f (xn−1 ) + f (xn )]
x1 −x0 x2 −x1

In the above step, we have applied T rapezoidal rule in each of the subintervals. Note
xi − xi−1 = h.
Rb
∴ a f (x)dx = h2 [(f (x0 ) + f (x1 )) + (f (x1 ) + f (x2 )) + · · · + (f (xn−1 ) + f (xn ))]
= h2 [f (x0 ) + 2(f (x1 ) + f (x2 ) + · · · + f (xn−1 )) + f (xn )]
= h2 [f (x0 ) + f (xn ) + 2(f (x1 ) + f (x2 ) + · · · + f (xn−1 ))]
= h2 [ sum of the end ordinates +2 (the sum of the intermediate ordinates)] (**)
(**) is known as Composite Trapezoidal rule.

1.2 Error in Composite Trapezoidal rule


The error in composite Trapezoidal rule can be analyzed by adding together the errors over
the subintervals [x0 , x1 ], [x1 , x2 ], . . . , [xn−1 − xn ]. We have shown that error in Trapezoidal
rule is
Rb 3
a
f (x) − h2 [f (a) + f (b)] = − h12 f 00 (ξ), a < ξ < b.
Then the error in [xi−1 − xi ] is
R xi 3
xi−1
f (x) − h2 [f (xi−1 ) + f (xi )] = − h12 f 00 (ξi ), xi−1 < ξ < xi .

∴ Error in composite Trapezoidal rule is


Rb
a
f (x) − h2 [f (x0 ) + f (xn ) + 2(f (x1 ) + f (x2 ) + · · · + f (xn−1 ))].
3 3 3
= − h12 f 00 (ξ1 ) − h12 f 00 (ξ2 ) − · · · − − h12 f 00 (ξn ) xi−1 < ξ < xi
3
= − h12 [f 00 (ξ1 ) + f 00 (ξ2 ) + · · · + f 00 (ξn )] xi−1 < ξ < xi
3 00 00 00
= − h12 .n[ f (ξ1 )+f (ξ2n)+···+f (ξn ) ] xi−1 < ξ < xi
3 00 00 00
= − h12 .n[sn ], where sn = f (ξ1 )+f (ξ2n)+···+f (ξn )
This number sn satisfies
min f 00 (x) ≤ sn ≤ max f 00 (x)
a≤x≤b a≤x≤b

Since f 00 (x) is a continuous function (by our assumption), we have that there must be some
ξ in [a, b], for which
f 00 (ξ) = sn
Then, error00 in composite
00
Trapezoidal rule is
00
3
= − h12 .n[ f (ξ1 )+f (ξ2n)+···+f (ξn ) ]
3
= − h12n .f 00 (ξ), we must know that nh = b − a.
2 (nh)
= − h 12 .f 00 (ξ)
2
= − h12 (b − a)f 00 (ξ).
or the error E in composite Trapezoidal rule is given by
3
E = − h12n .f 00 (ξ)
3
= − (b−a)n3 12
n 00
f (ξ)
(b−a)3 00
= − 12n2 f (ξ).
Thus we have shown that if f has a continuous derivative of second order on [a, b], then the

3 60
Rb
error in approximating a f (x)dx by Trapezoidal rule is
3
|E| ≤ (b−a)
12n2
[max|f 00 (x)|] a ≤ x ≤ b.
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
R3 1
Estimate the error in approximating the integral 1 (x+2)2
dx with n = 4 using the com-
posite Trapezoidal rule.

f (x) = 1
(x+2)2
, f 0 (x) = − (x+2)
2 00
3 f (x) =
6
(x+2)4
.
3
|E| ≤ (b−a)
12n2
[max|f 00 (x)|] 1 ≤ x ≤ 3
Maximum of f 00 (x) = (x+2) 6 6
4 occurs at x = 1 in [1, 3] and the maximum value is = (1+2)4 =
6 2
81
= 27
3
∴ |E| ≤ (3−1) ( 2 ) = 0.003086.
12×16 27
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
R1
Find n such that the error in the approximation of the integral 0
sin(πx)dx using Trape-
zoidal rule is less than or equal to 0.0001

f (x) = sin(πx), f 0 (x) = −π cos(πx) f 00 (x) = π 2 sin(πx).


Maximum of f 00 (x) = π 2 sin(πx) in [0, 1] ocuurs when πx = π2 or x = 1
2
and f 00 ( 21 ) = π 2 .
3
∴ |E| ≤ (1−0)
12n2
π 2 ≤ 0.0001
π 2
=⇒ 12n 2 ≤ 0.0001
2
n ≥ 8224.7
=⇒ n ≥ 90.7

∴ n=91 subintervals are required to ensure desired accuracy.

4 61
Numerical Analysis
Lecture-15

1 Numerical Integration 4
Rb
1.1 Composite Simpson’s rule for approximating a f (x)dx
Divide the interval [a, b] into even number of equal subintervals, say, 2N of width b−a 2N
= h.
The nodes are a = x0 < x1 < x2 < · · · < x2N = b, xk = x0 + kh, k = 0, 1, 2, . . . , 2N .
Rb R x2N =b
a
f (x)dx = a=x f (x)dx
R x2 0 Rx Rx R x2N =b
= x0 f (x)dx + x24 f (x)dx + x46 f (x)dx + · · · + x2N −2
f (x)dx
h h h
= 3 [f0 + 4f1 + f2 ] + 3 [f2 + 4f3 + f4 ] + · · · + 3 [f2N −2 + 4f2N −1 + f2N ]
= h3 [(f0 + f2N ) + 4(f1 + f3 + f5 + · · · + f2N −1 ) + 2(f2 + f4 + f6 + · · · + f2N −2 )]
= h3 [ sum of the end ordinates + 4(sum of the odd suffixed ordinates)
+ 2(sum of the even suffixed ordinates)]

Evaluate 0 sin xdx by taking n = 6 using
a)Trapezoidal rule; b)Simpson’s rule.
π π π 2π 5π
x 0 6 3 2 3 6
π
sin x 0 0.5 0.8660 1.0 0.8660 0.5 0
b−a π−0
we divide the interval [0, π] into 6 equal subintervals of width h = 6
= 6
= π6 . The
nodes are
π π π 2π 5π
x0 = 0 6 3 2 3 6
π = x6
↓ ↓ ↓ ↓ ↓
x1 x2 x3 x4 x5

the values of sin x at these nodes are given in the table. Here yi = f (xi ) = fi , i = 0, 1, . . . , 6.

a) by
R πTrapezoidal rule (composite)
h
0
sin xdx = 2 [y0 + y6 + 2(y1 + y2 + y3 + y4 + y5 )]
π
= 12 [0 + 0 + 2(0.5 + 0.8660 + 1.0 + 0.8660 + 1.5)]
π
= 12 [2(3.732)]
= 1.9540

b) by
R πSimpson’s rule (composite)
h
0
sin xdx = 3
[(y 0 + y6 ) + 4(y1 + y3 + y5 ) + 2(y2 + y4 )]
π
= 18 [(0 + 0) + 4(0.5 + 1.0 + 0.5) + 2(0.8660 + 0.8660)]
π
= 18 (11.464)
= 2.0008

1.1.1 Error in Composite Simpson’s rule


Simpson’s rule gives
Rb
a
f (x)dx = h3 [f0 + 4f1 + f2 ].

1 62
Error in Simpson’s rule is
Rb 5
a
f (x)dx − h3 [f0 + 4f1 + f2 ] = − h90 f (4) (ξ).

In composite Simpson’s rule, the interval [a, b] is divided into 2N equal parts and the error
is Rb
a
f (x)dx − h3 [(f0 + f2N ) + 4(f1 + f3 + f5 + · · · + f2N −1 ) + 2(f2 + f4 + f6 + · · · + f2N −2 )]
5 5 5
= − h90 f (4) (ξ1 ) − h90 f (4) (ξ2 ) − · · · − h90 f (4) (ξN ) where x2i < ξi < x2i+2 i = 0, 1, . . . , N − 1
5
= − h90 [f (4) (ξ1 ) − f (4) (ξ2 ) − · · · − f (4) (ξN )]
If f ∈ C 4 [a, b], then extreme value theorem implies that f (4) assumes its maximum and
minimum in [a, b]. Since,

min f (4) (x) ≤ f (4) (ξi ) ≤ max f (4) (x)


x∈[a,b] x∈[a,b]

we have,

N min f (4) (x) ≤ f (4) (ξ1 ) + f (4) (ξ2 ) + · · · + f (4) (ξN ) ≤ N max f (4) (x)
x∈[a,b] x∈[a,b]

or
1
min f (4) (x) ≤ N
[f (4) (ξ1 ) + f (4) (ξ2 ) + · · · + f (4) (ξN )] ≤ max f (4) (x)
x∈[a,b] x∈[a,b]

By Intermediate value theorem, there exists a ξ ∈ (a, b) such that

f (4) (ξ) = N1 [f (4) (ξ1 ) + f (4) (ξ2 ) + · · · + f (4) (ξN )]


=⇒ N f (4) (ξ) = [f (4) (ξ1 ) + f (4) (ξ2 ) + · · · + f (4) (ξN )]

∴ Error in composite Simpson’s rule is


5
= − h90 .N f (4) (ξ)
5
= − h90 b−a
2h
f (4) (ξ) (Since b−a
2N
=hN = b−a
2h
)
4 (b−a)
= − h 180 f (4) (ξ)

∴ Composite Simpson’s rule is exact for polynomials of degree upto 3. error is propor-
tional to h4 .
R5
Estimate 1 ln xdx using Simpson’s rule with 8 subintervals. Also, obtain the value of
h, so that the value of the integral will be accurate upto five decimal places.

Divide [1, 5] into 8 subintervals of width h = 5−1 2(4)


= 12 by nodes xi = x0 + ih, i =
0, 1, 2, . . . , 8. The nodes and the function values at these nodes are tabulated:

x0 x1 x2 x3 x4 x5 x6 x7 x8
x 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
ln x 0.00 0.4055 0.6931 0.9163 1.0986 1.2528 1.3863 1.504 1.6094
f0 f1 f2 f3 f4 f5 f6 f7 f8
R5
∴ 1
ln xdx = h3 [(f0 + f8 ) + 4(f1 + f3 + f5 + f7 ) + 2(f2 + f4 + f6 )]
= 0.5
3
[(0 + 1.6094) + 4(4.0787) + 2(3.178)]
= 4.0467.

2 63
4
|Error| = h 180
(b−a) (4)
|f (ξ)|
0
f (x) = ln x, f (x) = x1 , f 00 (x) = − x12 , f 000 (x) = 2
x3
. f (4) (x) = − x64
∴ max f (4) (x) = 6
x∈[1,5]
If the result has to be correct to 5 decimal places, then
5−1 4
180
≤ 10−5
h (6)
4
=⇒ h ≤ 0.000075
=⇒ h ≤ 0.09

1.2 Method of undetermined coefficients


Given the nodes xi , 0 ≤ i ≤ n, we seek the numerical integration methods of the form
R xn
f (x)dx = nk=0 λk f (xk )
P
x0

such that the method is exact for polynomials of degree as high as possible. We work out
the details for n = 2.

Rb
a
f (x)dx = λ0 f (x0 ) + λ1 f (x1 ) + λ2 f (x2 ) → (∗)

where a = x0 , b = x0 + 2h = x2 , and x − 1 = x0 + h
There are 3 unknowns λ0 , λ1 , λ2 to be determined. We find these constants by requiring
that the method (∗) is exact for polynomials of degree upto 2, i.e. for f (x) = 1, x, x2
Rb
f (x) = 1 =⇒ a dx = b − a = λ0 + λ1 + λ2 = x2 − x0
Rb 2 2 2 2
f (x) = x =⇒ a xdx = b −a 2
= λ0 x0 + λ1 x1 + λ2 x2 = x2 −x 2
0

Rb 3 3 3 3
f (x) = x2 =⇒ a x2 dx = b −a 3
= λ0 x20 + λ1 x21 + λ2 x22 = x2 −x
3
0

Use x2 = x0 + 2h, x1 = x0 + h in the above and we get


λ0 + λ1 + λ2 = 2h
λ1 + 2λ2 = 2h
λ1 + 4λ2 = 38 h
Solving these we get

λ2 = h3
λ1 = 4h
3
λ0 = h3
Rb Rx
∴ a f (x)dx = x02 f (x)dx = h3 f (x0 ) + 4h 3
f (x1 ) + h3 f (x2 )
that
R b=x2is,
a=x0
f (x)dx = h3 [f (x0 ) + 4f (x1 ) + f (x2 )]

The method has been derived by requiring that it is exact for polynomials of degree ≤ 2.
We will now examine what happens if f (x) = x3 and compute the error.

3 64
R b=x2
a=x0
x3 dx − h3 [f (x0 ) + 4f (x1 ) + f (x2 )], h = x2 −x2
0

x4 −x4
= 2 4 0 − x2 −x 6
0
[x30 + 4x31 + x32 ] x1 = x0 +x2
2
4
x −x 4
23
= 2 4 0 − x2 −x 6
0
[x30 + 4 x0 +x
2
+ x32 ]
4
x −x 4
= 2 4 0 − x2 −x 6
0
[x30 + 21 (x30 + 3x20 x2 + 3x0 x22 + x32 ) + x32 ]
4
x −x 4
−x0
= 2 4 0 − x212 [3x30 + 3x32 + 3x20 x2 + 3x0 x22 ]
x42 −x40
= 4 − x2 −x 4
0
[x30 + x32 + x20 x2 + x0 x22 ]
4
x −x 4
= 2 4 0 − x2 −x 4
0
(x0 + x2 )(x20 + x22 )
4
x −x 4 2
x −x 2
= 2 4 0 − 2 4 0 (x20 + x22 )
x4 −x4 x4 −x4
= 24 0 − 24 0
=0
∴ the derived method is also exact for polynomials of degree 3. We examine what hap-
pens when f (x) = x4 .
R b=x2
a=x0
x3 dx − h3 [x40 + 4x41 + x42 ] is the error.
x5 −x5 4
Error= 2 5 0 − x2 −x6
0
[x40 + 4 x0 +x
2
2
+ x42 ]
−x0 )5
= − (x2120 (Simplify the above steps and get this result)
6= 0
∴ the derived method is exact for polynomials of degree upto 3.
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Rh
Determine A, B and C such that the formula 0 f (x)dx = h[Af (0) + Bf ( h3 ) + Cf (h)]
is exact for polynomials of as high degree as possible.

There are 3 unknown to be determined. We require that the method to be exact for
polynomials f (x) = 1, x, x2 .
Rh
f (x) = 1 =⇒ 0 dx = h = h(A + B + C) =⇒ A + B + C = 1
Rh h2 Bh
f (x) = x =⇒ 0 xdx = 2 = h[ 3 + Ch] =⇒ B3 + C = 12
h 3 2
f (x) = x2 =⇒ 0 x2 dx = h3 = h[ Bh9 + Ch2 ] =⇒ B9 + C = 13
R

Solving them we get, A = 0, B = 43 , C = 14 .


Rh
∴ 0 f (x)dx = h[ 34 f ( h3 ) + 14 f (h)]
Rh
∴ 0 f (x)dx = h4 [3f ( h3 ) + f (h)] (∗)
The method (∗) is exact for polynomials of degree ≤ 2. If f (x) = x3 , the error is
Rh 3
0
x dx − h4 [3f ( h3 ) + f (h)]
4 3
= h4 − h4 [3 h3 + h3 ]
4 3
= h4 − h4 [ h9 + h3 ]
4
= h4 [1 − 91 − 1]
4
= − h36
4 000 4 000
∴ error term is − h36 f 3!(ξ) = − h f216(ξ) .
∴ error is of O(h4 ).
The method is exact for polynomials of degree ≤ 2.

4 65
Numerical Analysis
Lecture-16

1 Numerical Integration-5
1.1 Gaussian Quadrature (Gaussian Integration Formula)
The Trapezoidal and the Simpson’s rule are integration methods involving function values
at equally spaced nodes and give exact results for polynomials of degree 1 (two points
and two functional values for the Trapezoidal rule) and degree 2 (three points and three
function values). In general, an exact integration method can be derived if the equally
spaced ordinates are those of a polynomial of degree n or less (involving (n + 1) points and
(n + 1) function values).
Can a numerical integration method be obtain which gives an exact result for polyno-
mials of degree > n? gauss showed that this is possible if one is ready to sacrifice the
advantage provided by the equally spaced nodes.
He derived numerical integration methods by selecting 0 n0 nodes which are not equally
spaced and showed that these methods are exact for polynomials of degree ≤ 2n − 1. These
methods are called Gaussian quadrature methods and they require function evaluation of a
set of optimally chosen points in the interval of integration.
The choice of spacing of these ordinates requires the knowledge of Legendre polynomials.

1.2 Legendre Polynomials


• Legendre polynomials are solutions of the Legendre equation

(1 − x2 )y 00 − 2xy 0 + n(n + 1)y = 0, n an integer

and are given by

1 d3
y = Pn (x) = 2n n! dxn
[(x2 − 1)n ], called Rodrigue’s formula.

P0 (x) = 1
P1 (x) = x
P2 (x) = 12 (3x2 − 1)
P3 (x) = 21 (5x3 − 3x)
P4 (x) = 81 (35x4 − 30x2 + 3) and so on
• These
R 1 polynomials satisfy the following orthogonality property:
P (x)Pm (x)dx = 0, m 6= n
−1 n
2
= 2n+1 m=n
R1
• −1 Pn (x)xn dx = 0 for all non-negative n < m.

• Legendre polynomials consist of either all odd powers of x or all even powers of x

• The zeros of Pn (x) are all real and distinct and lie in the interval (−1, 1).

1 66
Gauss quadrature methods are of the form
R1
f (x)dx = ni=1 ci f (xi )
P
−1

such that the 0 n0 values of ci ’s and the 0 n0 nodes xi (totally 2n unknowns) must be determined
in such a way that the Gauss quadrature methods are exact for polynomials of degree
0, 1, 2, . . . , 2n − 1.

1.2.1 Derivation for n = 2


We derive below the Gauss quadrature two point method of the form
R1
−1
f (x)dx = c1 f (x1 ) + c2 f (x2 ) → (∗)

This method must be exact for polynomials of degree 0, 1, 2 and 3. (There are 2 constants
c1 , c2 and two nodes x1 , x2 to be determined; we determine the 4 unknowns be requiring
that the method (∗) is exact for polynomials of degree ≤ 2(20 − 1 = 3.)

R1
f (x) = 1 =⇒ −1 1dx = 2 = c1 + c2
R1
f (x) = x =⇒ −1 xdx = 0 = c1 x 1 + c2 x 2
R1 2
f (x) = x2 =⇒ −1 x2 dx = 3
= c1 x21 + c2 x22
R1
f (x) = x3 =⇒ −1 x3 dx = 0 = c1 x31 + c2 x32

We use the properties of Legendre polynomials to solve this system. We have

Z 1
c1 + c2 = dx (1)
−1
Z 1
c1 x 1 + c2 x 2 = xdx (2)
−1
Z 1
c1 x21 + c2 x22 = x2 dx (3)
−1
Z 1
c1 x31 + c2 x32 = x3 dx (4)
−1

3 1
Multiply (3) by 2
and (1) by 2
and subtract the resulting (1) from (3). This gives
Z 1
3 1 3 1 3 1
( x2 − )dx = c1 [ x21 − ] + c2 [ x22 − ] (5)
−1 2 2 2 2 2 2

But P2 (x) = 21 (3x2 − 1), P0 (x) = 1

∴ (5) implies, Z 1
P2 (x)P0 (x)dx = c1 P2 (x1 ) + c2 P2 (x2 )
−1

2 67
or,
c1 P2 (x1 ) + c2 P2 (x2 ) = 0 (6)
(by orthogonality property of Legendre polynomials)

3 1
Similarly, multiply (4) by 2
and (2) by 2
and subtracting the resulting (2) from (4), we
get
Z 1
3 1 3 1 3 1
( x3 − x)dx = c1 [ x31 − x1 ] + c2 [ x22 − x2 ]
−1 2 2 2 2 2 2
Z 1
3 1 3 1 3 1
x( x2 − )dx = c1 [x1 ( x21 − )] + c2 [x2 ( x22 − )]
−1 2 2 2 2 2 2
or, Z 1
P1 (x)P2 (x)dx = c1 x1 P2 (x1 ) + c2 x2 P2 (x2 )
−1
or
c1 x1 P2 (x1 ) + c2 x2 P2 (x2 ) = 0 (7)
Equations (6) and (7) given by

c1 P2 (x1 ) + c2 P2 (x2 ) = 0
c1 x1 P2 (x1 ) + c2 x2 P2 (x2 ) = 0

will be satisfied for any choices of c1 , c2 if we choose x1 , x2 to be the zeros of P2 (x). Since
P2 (x) = 21 (3x2 − 1), P2 (x) = 0 =⇒ x = ± √13 .

Take x1 = − √13 , x2 = √13 .


Substituting in (1) and (2),

c1 + c2 = 2
c1 (− √13 ) + c2 ( √13 ) = 0
=⇒ c1 − c2 = 0
=⇒ c1 = c2

∴ c1 = 1 = c2 .

∴ Gaussian two point quadrature rule is


R1
−1
f (x)dx = f (− √13 ) + f ( √13 )

Gaussian quadrature methods are of the form


R1
f (x)dx = ni=1 ci f (xi )
P
−1

R b limits are −1 and 1.


We note that the integral must be such that the
If we are given an integral of the form a f (x)dx and we want to evaluate it using
Rb R1
Gaussian quadrature methods, then we change a f (x)dx to −1 f (x)dx using the following
transformation:

3 68
put x = b−a
2
t+ b+a
2
b−a
dx = 2 dt
x a b
t −1 1
Find the value of the integral
R3 cos 2x
I= 2 1+sin x
dx

using Gauss two-point quadrature

x = 3−2
2
t + 3+2
2
= t
2
+ 5
2
1
dx = 2 dt
x 2 3
t −1 1
R3 cos 2x
∴I = 2 1+sin x
dx
t 5
1 1 cos 2( 2 + 2 )
R
= 2 −1 1+sin( 2t + 52 )
dt
R 1 cos(t+5)
= −1 1+sin( 2t + 52 )
dt
1
= 2
[f (− √13 ) + f ( √13 )]
1
= 2
[0.56558356 − 0.15886672]
= 0.20350842

4 69
Numerical Analysis
Lecture-17

1 Numerical Integration-6
Now, we show the important steps in the derivation of Gauss quadrature three-point
method.
We seek a numerical integration method of the form
Z 1
f (x)dx = c1 f (x1 ) + c2 f (x2 ) + c3 f (x3 )..........(∗∗)
−1

where the constants c1 , c2 , c3 and the nodes x1 , x2 , x3 are determined by requiring that the
method (∗∗) is exact for polynomials of degree less than or equal to 2(3) − 1(= 5).
Z 1
f (x) = 1 : dx = c1 + c2 + c3 = 2 (1)
−1
Z 1
f (x) = x : xdx = c1 x1 + c2 x2 + c3 x3 = 0 (2)
−1
Z 1
2 2
f (x) = x : x2 dx = c1 x21 + c2 x22 + c3 x23 = (3)
−1 3
Z 1
f (x) = x3 : x3 dx = c1 x31 + c2 x32 + c3 x33 = 0 (4)
−1
Z 1
2
f (x) = x4 : x4 dx = c1 x41 + c2 x42 + c3 x43 = (5)
−1 5
Z 1
f (x) = x5 : x5 dx = c1 x51 + c2 x52 + c3 x53 = 0 (6)
−1

5 3
Multiply (4) by 2
and (2) by 2
and subtract the resulting (2) from (4). This gives
Z 1   Z 1 Z 1
5 3 3 1 3
x − x dx = (5x − 3x).1dx = P3 (x)P0 (x)dx
−1 2 2 −1 2 −1
     
5 3 3 5 3 3 5 3 3
= c1 x 1 − x 1 + c2 x 2 − x 2 + c3 x 3 − x 3
2 2 2 2 2 2
= c1 P3 (x1 ) + c2 P3 (x2 ) + c3 P3 (x3 )....................................................(7)
R1
LHS of (7) is −1
P3 (x)P0 (x)dx − 0 by orthogonality property.

∴ c1 P3 (x1 ) + c2 P3 (x2 ) + c3 P3 (x3 ) = 0 ..........................................................................(8)

3 1
Multiply (5) by 2
and (3) by 2
and subtract the resulting (3) from (5). This gives

c1 x1 P3 (x1 ) + c2 x2 P3 (x2 ) + c3 x3 P3 (x3 ) = 0 .....................................................................(9)

1 70
3 1
Multiply (6) by 2
and (4) by 2
and subtract the resulting (4) from (6). This gives

c1 x21 P3 (x1 ) + c2 x22 P3 (x2 ) + c3 x23 P3 (x3 ) = 0 .....................................................................(10)


R1
Note (9) and (10) are derived using the orthogonality property −1 P3 (x)P1 (x)dx = 0 and
R1
P (x)P2 (x)dx = 0. Equations (8), (9), (10) will be satisfied for any choice of c1 , c2 , c3 if
−1 3
we choose x1 , x2 , x3 to be zeros of P3 (x) = 0. Since P3 (x) = 21 (5x3 − 3x), P3 (x) = 0 =⇒
q q
5x3 − 3x = 0 =⇒ x1 = − 35 , x2 = 0, x3 = 35 .
Substitute these in (1), (2) and (3).

c1 + c2 + c3 = 2
r r
3 3
−c1 + c3 = 0 =⇒ c1 = c3
5 5
3 3 2 3 2 5
c1 . + c3 . = =⇒ 2.c3 . = =⇒ c3 = = c1
5 5 3 5 3 9
10 8
∴ c2 = 2 − c1 − c3 = 2 − =
9 9
∴ We have
Z 1
r r
5 3 8 5 3
f (x)dx = f (− ) + f (0) + f ( )
−1 9 5 9 9 5
" r r #
1 3 3
= f (− ) + f (0) + f ( )
9 5 5

This quadrature three point method is exact for polynomials of degree ≤ 5.


Note. The end points −1 and 1 of the intervals are not the nodes in these methods. Nodes
lie in the interior of the interval. Gauss methods are open-type formulas. Trapezoidal and
Simpson rules are closed-type formulas as the end points of the interval are part of the
nodes where function is evaluated.R
3 cos 2x
Find the value of the integral I = 2 1+sin x
dx using Gauss quadrature three point method.

Z 3 Z 1
cos 2x 1 cos(t + 5) t+5 1
dx = t+5 dt; x = and so dx = dt
2 1 + sin x 2 −1 1 + sin( 2 2 2
" r r #
1 3 3
= f (− ) + f (0) + f ( )
9 5 5
  q   q  
3 3
1  5cos − 5 + 5   + 8cos(5)  + 1 
 5cos 5
+5 
=   √   1 + sin 5  √ 
9 − 35 +5
2
9  3
+5

1 + sin 2
1 + sin 5
2

1
= [−1.26018515 + 1.419666661 + 3.48936886]
18
= 0.20271391.

2 71
Show that
Z 5 √
(a) 1 + x2 dx = 41.71
Z1 1
dx
(b) 1 = 0.9181
0 (1 + x2 ) 3

sing Gauss quadrature


R 1 dx (i) three-point method.
Evaluate I = 0 1+x using Gauss quadrature three-point method. (Ans: I = 0.693122).
Remark. The Trapezoidal (composite) and the composite Simpson’s rules, although effec-
tive, are not suitable when it comes to evaluation of integrals of functions on intervals in
which there are both regions with large variations of the integrand and regions Rwith very
c
small variations of the integrand. From the figure here, we see that if we require 0 f (x)dx,
then the composite Trapezoidal and composite Simpson’s rules will not be suitable. This is
because, there is significant variation of the function on [0, b] whereas the variation of the
function on [b, c] is very very small. This indicates that one can use a low-order method to
Rc Rb
evaluate b f (x)dx but a higher-order method is necessary to compute 0 f (x)dx. We now
try to understand a technique referred to as adaptive quadrature that should be applied
on different parts of the interval of integration; we will also examine the accuracy of the
approximation that we get.

Adaptive quadrature
Rb
There are situations where the integrand f (x) in a f (x)dx varies too rapidly in some subin-
tervals (large fluctuations) and varies too slowly in some subintervals (small variations). In
such cases, we need many nodes where the function fluctuates but only few nodes where
the variations are slow; that is, it is desirable to take smaller step size in the subintervals
over which the function variation is large and larger step size in the subintervals in which
the function varies too slowly.

In order to approximate definite integrals with such integrands, we must develop meth-
ods in such a way that we must be able to measure or predict the amound of variation and
automatically add more points in the interval where it is required. A numerical integration
method that adapts automatically an appropriate step size to estimate an integral numeri-
cally is called Adaptive quadrature method.
Step 1 Develop a way to measure the error, that is a numerical estimate of the actual error
in the numerical intergration
Step 2 Use the error estimate to decide whether to accept the value obtained using our
method or we need to refine further by performing the computations again using smaller
step size.

We shall derive an adaptive quadrature method


Rb based on Simpson’s rule.
We wish to approximate the integral I = a f (x)dx to within an accuracy  > 0, using
Simpson’s rule, with h = b−a
2
.

3 72
b
h5
Z  
h a+b
I= f (x)dx = f (a) + 4f ( ) + f (b) − f (4) (ξ1 ), a < ξ1 < b
a 3 2 90
4
(b − a)h (4)
= I(a, b) − f (ξ1 ), a < ξ1 < b
180

where I(a, b) = h3 f (a) + 4f ( a+b


 
2
) + f (b) .
b−a h h1
Subdivide the interval, set h1 = 4 = 2 . (Note 3
= h6 ). Simpson’s rule gives
b
(b − a)h4 (4)
Z  
h 3a + b a+b a + 3b
I= f (x)dx = f (a) + 4f ( ) + 2f ( ) + 4f ( ) + f (b) − f (ξ2 ), a < ξ2 < b
a 6 4 2 4 180 × 16
(b − a)h4 (4)
   
h 3a + b a+b h a+b a + 3b
= f (a) + 4f ( ) + f( ) + f( ) + 4f ( ) + f (b) − f (ξ2 )
6 4 2 6 2 4 180 × 16
a+b a+b (b − a)h4 (4)
= I(a, ) + I( , b) − f (ξ2 ).
2 2 180 × 16

Let f (4) (ξ) = max{|f (4) (ξ1 )|, |f (4) (ξ2 )|}.

a+b a+b
∴ 0 = I(a, b) − I(a,
) − I( , b)
2 2
(b − a)h4 (4) (b − a)h4 (4)
=− f (ξ) + f (ξ)
180 180 × 16
(b − a)h4 (4)
= −15 f (ξ)
180 × 16
1 a+b a+b (b − a)h4 (4)
=⇒ [I(a, b)−I(a, ) − I( , b)] = − f (ξ).
15 2 2 180 × 16
Z b
a+b a+b 1 a+b a+b
∴| f (x)dx − I(a,
) − I( , b)| = |I(a, b) − I(a, ) − I( , b)|,
a 2 2 15 2 2
Rb
which shows that I(a, a+b 2
) + I( a+b
2
, b) approximates a f (x)dx about 15 times better than
1
Rb
it agrees with I(a, b). Thus if 15 |I(a, b) − I(a, a+b
2
) − I( a+b
2
, b)| <  and that a f (x)dx =
I(a, a+b
2
) + I( a+b
2
, b) correct to the desired degree of accuracy. If it is not satisfied, then
apply the procedure to each of the subintervals [a, a+b 2
] and [ a+b
2
, b] with the tolerance 2 .
If the inequality is satisfied in both the intervals, then the sum of the two approximations
will give an approximation to the given integral. If the test fails in any of the intervals, say,
for example [ a+b2
, b], then this interval is subdivided into two subintervals and the above
procedure is applied with the tolerance which is half of the previous tolerance ( 2 ) and the
process is continued.

The details are explained using an example below.

Step-1

4 73
Rπ π
−0
We apply Simpson’s rule to 0
2
sin xdx = 1, h = 2
2
= π4 . Here
π π π π
I(0, ) = [f (0) + 4f ( ) + f ( )]
2 12 4 2
π π π
= [sin(0) + 4sin( ) + sin( )]
12 4 2
π √
= [2 2 + 1]
12
= 1.00227987749221
π π π π π π 3π π
N ow, I(0, ) + I( , ) = [sin(0) + 4sin( ) + 2sin( ) + 4sin( ) + sin( )]
4 4 2 24 8 4 8 2
= 1.00013458497419
1
The error estimate is 15 [I(0, π2 )−{I(0, π4 )+I( π4 , π2 )}] = 15
1
[1.00227987749221−1.00013458497419] =
0.00014301950120.
We now know how to get an error estimate. How do we use this to derive an adaptive
integration method? Rb
We want to approximate I = a f (x)dx with an error less than  (a specified accuracy).

1. Compute the two approximations I(a, b) and I(a, a+b


2
) + I( a+b
2
, b).

2. Estimate the error; if the error is less than , then we are done; if not, then go to step
3.
Step-2

3. Apply steps 1 and 2 recursively to the intervals [a, a+b


2
] = [a1 , b1 ] and [ a+b
2
, b] =
[a2 , b2 ] = [b1 , b2 ] with tolerance 2 .

a b tolerance : 
a.........................(a + b)/2 tolerance : /2
(3a + b)/2.........................(a + b)/2 tolerance : /4

A possible subinterval refinement is shown above. Solid lines indicate that the desired
degree of accuracy could not be reached in that interval; dashed lines show that in that
interval, desired degree of accuracy has been achieved.
Continue Step 2 till the desired degree of accuracy is achieved in both the subintervals at
any step.

5 74
Numerical Analysis
Lecture-18

1 Numerical Solution of Ordinary Differential Equations-


1
Numerical Solution of Differential Equations
Problems in Science and Engineering are modelled by differential equations(DE).
A DE is a relation between an independent variable (x), a dependent variable (y, y depends
on x) and various order derivatives of y with respect to x (say, upto nth order derivative) of
the form F (x, y, y 0 , y 00 , ..., y (n) ) = 0. The above relation gives a DE of order n. The central
problem in this course is to learn different numerical methods to solve a first order DE
(ordinary), namely, F (x, y, y 0 ) = 0.
Most of these problems require the solution to an initial value problem (IVP), namely,
solve a first order ordinary differential equation when one point on the initial curve is
given, i.e., obtain the solution to a first order ODE that satisfies an initial condition (IC);
dy
dx
= f (x, y); y(x0 ) = y0 .

References:
1. David Kincaid and Ward Cheney. Numerical Analysis. Brooks/cole, California.

2. R. L. Burden and J. D. Faires. Numerical Analysis, Seventh Edition, Brooks/cole,


California.
We now consider a typical case where we come across such initial value problems.

Mixing problems
Suppose we have a container full of a salt solution of a certain concentration. If we pump in
a solution with a different concentration at a certain rate, mix well and extract the mixture
at the same rate, then how does the concentration change with time?
We consider the law of conservation of salt.
Rate of change of salt in the container equals the difference between the rate of inflow of
salt into the container and the rate of outflow of salt from the container.
A 100l tank of water initially contains 10kg of dissolved salt. A pipe brings a salt solution
(concentration 0.005 kg/l) into the tank at a rate of 2 l/s, and a second pipe carries away
the excess solution. What is the concentration c(t) of salt assuming that the tank is well
mixed.
Let V = 100l be the volume of water in the tank (this is constant). Let m(t) be the mass
of salt dissolved in the water. So, m(0) = 10kg. Let c(t) = m V
be the concentration of salt
in the water, assuming that the salt is well mixed. So c(0) = 0.01kg/l. Rate of inflow of
salt (in kg/s) is cin rin , where cin = 0.005kg/l is the concentration of salt in the inflow and
rin = 2l/s is the rate of inflow.
The rate of outflow of salt (in kg/s) is cout rout . cout = c(t) is the concentration of salt
in the outflow = concentration of salt in the water in this case. rout = rin = 2l/s=

1 75
rate of outflow = rate of inflow in this case. Thus from the law of concentration of salt,
we get dm dt
= cin rin − cout rout . Since m(t) = c(t)V ; rout = rin and cout = c(t), we get
dc
V dt = (cin −c)rin or dc
dt
+ rVin c = rVin cin which is a first order ODE of the linear Lagrange form
rin t rin t
dy
dx
+ P (x)y = Q(x). The integrating factor e V . Thus the solution is c(t) = cin + Ke− V ,
K is a constant. Using the IC, we find c(t) = 0.005 + 0.005e−0.002t , where cin is in kg/l and
t is in seconds.
Note. The methods that we consider do not yield continuous approximation to the solution
of the IVP. Approximations are found at certain specified, equally spaced points.
We need some definitions and results from the theory of ordinary Differential Equations
(ODE) before considering methods for approximating the solutions to IVP.
dy
Will every IVP of the form dx = f (x, y), y(x0 ) = y0 have a solution?
Answer : No.
Some assumptions must be made about f and even then we can only expect the solution
to exist in a neighbourhood of x = x0 .
dy
Example 1.1. dx = 1 + y 2 ; y(0) = 0. Since y(0) = 0, y 0 (0) = 1 + y 2 (0) = 1 + 0 = 1. Since
the slope is positive, y(x) is increasing near x = 0. Thus 1 + y 2 is increasing. Hence y 0 is
increasing. Since both y and y 0 are increasing and y 0 = 1 + y 2 , we can expect that at some
finite x, there will be no solution, i.e., y(x) = +∞. Infact, it occurs at x = π2 since the
analytical solution is y(x) = tan x.

Theorem 1.1. If f is continuous in a rectangle R centered at (x0 , y0 ), say R = {(x, y) :


dy
|x − x0 | ≤ α, |y − y0 | ≤ β}, then the IVP dx = f (x, y), y(x0 ) = y0 has a solution y(x) for
β
|x − x0 | ≤ min(α, M ), where M is the maximum of |f (x, y)| in the rectangle R.
dy
Example 1.2. dx = (x + sin y)2 ; y(0) = 3.
We will show that this IVP has a solution on the interval −1 ≤ x ≤ 1.
f (x, y) = (x + sin y)2 ; (x0 , y0 ) = (0, 3). R = {(x, y) : |x − 0| ≤ α, |y − 3| ≤ β}. On this
β
rectangle R, |f (x, y)| ≤ (α + 1)2 ≡ M . We want min(α, M ) ≥ 1. We can let α = 1. Then
2 β
M = (α + 1) = 4. min(α, 4 ) ≥ 1 is met by taking β ≥ 4.

The existence theorem given above asserts that a solution of the IVP exists on the
β
interval |x| ≤ min(α, M ) = 1.
Note that even if f is continuous, the IVP may not have a unique solution.
dy 2
Example 1.3. dx = y 3 ; y(0) = 0.
y(x) ≡ 0 is a solution.
3
y(x) = x27 is another solution.
The solution to the IVP is not unique.

In order that the IVP has a unique solution in a neighbourhood of x = x0 , it is necessary


that f satisfies some more conditions.

Theorem 1.2. If f and fy are continuous in the rectangle R = {(x, y) : |x − x0 | ≤


dy
α, |y − y0 | ≤ β}, then the IVP dx = f (x, y), y(x0 ) = y0 has a unique solution in the interval
β
|x − x0 | < min(α, M ).

Note. The interval on the x-axis in which the solution is asserted to exist (in Theorems 0.1
and 0.2) may be smaller than the base of the rectangle R in which we have defined f (x, y).

2 76
The next theorem gives the existence and uniqueness of a solution on a prescribed
interval [a, b].

Theorem 1.3. If f is continuous in the strip a ≤ x ≤ b, −∞ < y < ∞ and satisfies an


inequality
|f (x, y1 ) − f (x, y2 )| ≤ L|y1 − y2 |........(∗)
dy
then the IVP dx
= f (x, y); y(a) = y0 has a unique solution in the interval [a, b].

The inequality (∗) in Theorem 0.3 is called a Lipschitz condition in the second variable.

Example 1.4. f (x, y) = x|y|. R = {(x, y) : 1 ≤ x ≤ 2, −3 ≤ y ≤ 4}.


For each pair of points (x, y1 ) and (x, y2 ) ∈ R,

|f (x, y1 ) − f (x, y2 )| = |x|y1 | − x|y2 ||


= |x|||y1 | − |y2 || ≤ 2|y1 − y2 |

∴ f satisfies a Lipschitz condition on R in the variable y with Lipschitz constant 2.

Definition 1.1. A set D ⊂ R2 is said to be convex if whenever (x1 , y1 ) and (x2 , y2 ) belong
to D and λ in in [0, 1], the point ((1 − λ)x1 + λx2 , (1 − λ)y1 + λy2 ) also belongs to D.
i.e., whenever two points belong to the set, the entire line segment between the points also
belong to the set. (See figure below). The set that we consider, in what follows, are of the
form D = {(x, y)|a ≤ x ≤ b, −∞ < y < ∞}, for some constants a and b.

Theorem 1.4. Suppose f (x, y) is defined on a convex set D ⊂ R2 . If a constant L > 0


exists with | ∂f
∂y
(x, y)| ≤ L, for all (x, y) ∈ D, then f satisfies a Lipschitz condition on D in
the variable y with Lipschitz constant L.
dy
Example 1.5. dx = 1 + xsin(xy), 0 ≤ x ≤ 2; y(0) = 0.
Holding x constant and applying the Mean Value Theorem to the function f (x, y) = 1 +
xsin(xy), we see that when y1 < y2 , there exists a ξ ∈ (y1 , y2 ) such that

f (x, y2 ) − f (x, y1 ) ∂
= (f (x, ξ)) = x2 cos(xξ)
y2 − y1 ∂y

∴ |f (x, y2 ) − f (x, y1 )| = |y2 − y1 ||x2 cos(xξ)| ≤ 4|y2 − y1 |


and f satisfies a Lipschitz condition in the variable y with Lipschitz constant L = 4. f (x, y)
is continuous when 0 ≤ x ≤ 2 and −∞ < y < ∞. Thus the IVP has a unique solution.

Remark. We have understood, to some extent, the conditions under which IVPs have unique
solution. We now address the question of whether small changes in the statement of the
problem introduce correspondingly small changes in the solution. This is an important issue
since round off errors are introduced when numerical methods are used.

3 77
Numerical Analysis
Lecture-19

1 Numerical Solution of Ordinary Differential Equations-


2
dy
Definition 1.1. The IVP dx
= f (x, y), y(x0 ) = y0 , x0 ≤ x ≤ b is said to be well-posed if
1. a unique solution y(x) to the problem exists.
2. For any  > 0, there exists a positive constant k(), such that whenever |0 | <  and
δ(x) is continuous with |δ(x)| <  on [x0 , b], a unique solution, z(x) to
dz
= f (x, z) + δ(x), x0 ≤ x ≤ b; z(x0 ) = y0 + 0 ............(∗∗)
dx
exists with |z(x) − y(x)| < k() for all x0 ≤ x ≤ b.
(∗∗) defines a perturbed problem associated with the original problem
dy
= f (x, y), y(x0 ) = y0 , x0 ≤ x ≤ b.
dx
The perturbed problem assumed the possibility of an error δ(x) being introduced in the
DE as well as an error 0 being present in the IC.
Note. Numerical methods will be concerned with solving a perturbed problem since any
round off error introduced in the representation perturbs the original problem. Care must be
taken to ensure that the original problem is well-posed; otherwise, there is no guarantee that
the numerical solution to the perturbed problem will approximate accurately the solution
to the original problem.
The conditions that ensure that an IVP is well-posed are specified in the following
theorem.
Theorem 1.1. Suppose D = {(x, y)|a ≤ x ≤ b, −∞ < y < ∞}. If f is continuous and
dy
satisfies a Lipschitz condition in the variable y on the set D, then the IVP dx = f (x, y), a ≤
x ≤ b, y(a) = y0 is well-posed.
dy
Example 1.1. Let D = {(x, y)|0 ≤ x ≤ 1, −∞ < y < ∞}. dx = y − x2 + 1, 0 ≤ x ≤
2, y(0) = 0.5.
Since | ∂f
∂y

| = | ∂y (y − x2 + 1)| = |1| = 1, f (x, y) = y − x2 + 1 satisfies a Lipschitz condition
in y on D with Lipschitz constant L = 1. Since f is continuous on D, by Theorem 0.1, the
IVP is well-posed.
We will verify this directly.
Consider the perturbed problem
dz
= z − x2 + 1 + δ, 0 ≤ x ≤ 2, z(0) = 0.5 + 0
dx
where δ and 0 are constants.
dy
The solution to dx = y − x2 + 1, y(0) = 0.5 is y(x) = (x + 1)2 − (0.5)ex .

1 78
dz
The solution to dx = z − x2 + 1 + δ, z(0) = 0.5 + 0 is z(x) = (x + 1)2 + (δ + 0 − 0.5)ex − δ.
If |δ| <  and |0 | < , then

|y(x) − z(x)| = |(δ + 0 )ex − δ| ≤ |δ + 0 |e2 + |δ|


≤ 2e2 + 
= (2e2 + 1) f or all x
dy
Thus the problem dx
= y − x2 + 1, y(0) = 0.5 is well-posed with k() = 2e2 + 1 for all  > 0.
dy
Example 1.2. dx = ytan(x + 3); y(−3) = 1.
We want to determine y on an interval containing the initial point x0 = −3.
y(x) = sec(x + 3) is a solution. (Verify!)
Since sec x becomes infinite at x = ± π2 , our solution is valid only for − π2 < x + 3 π2 . Thus
the solution is y = sec(x + 3), − π2 < x + 3 π2 .
We have been able to obtain an analytical solution in this example. Typically, analytical
solutions are not available for problems of the type
dy
= f (x, y); y(x0 ) = y0
dx
and so numerical methods must be employed.
There are two approaches:
1. Simplify the DE to one that can be solved exactly and then use the solution of the
simplified equation to approximate the solution to the original equation.

2. Develop and use the methods for approximating the solution of the original problem.
We want to take the second approach in this course since approximate methods give
more accurate results and realistic information about the error incurred.

Taylor-Series Method
dy

dx
= f (x, y)
IV P :
y(x0 ) = y0
In the numerical solution of DE, we construct a table of function values of the form
x0 x1 x2 ... xn
y0 y1 y2 ... yn
where yi is the computed approximate value of y(xi ). Here y(xi ) denotes the exact solution
of the IVP at xi . From the above table, one can construct approximating functions.
dy
Assume that various partial derivatives of f exist. Consider dx = f (x, y); y(x0 ) = y0 . Now

h2 00 h3 h4
y(x + h) = y(x) + hy 0 (x) + y (x) + y 000 (x) + y (4) (x) + . . .
2! 3! 4!
We decide to use only terms upto and including h4 (say), then the terms that we have not
included start with a term in h5 . These neglected terms contribute to the Truncation
error that is inherent in our procedure. The resulting numerical method is said to be of
order 4.

2 79
Definition 1.2. The order of the Taylor-series method is n if terms up to and including
hn y (n) (x)
n!
are used.

At each step, the local truncation error (TE) is O(h5 ) since we have not included terms
involving h5 , h6 , ... from the Taylor series. Local error ' Ch5 as h −→ 0, C is a constant,
i.e., if h = 10−2 , then h5 = 10−10 and so the error in each step is roughly of the magnitude
10−10 . These small errors accumulate and after several steps, there is considerable amount
of error in the computed solution.

Estimation of local TE in each step of the numerical solution


The error term in the Taylor series is of the form
1
En = hn+1 y (n+1) (x + θh), 0 < θ < 1
(n + 1)!

This is the error when the last power of h included in the Taylor series method is hn .
 (n)
(x + h) − y (n) (x)

1 n+1 y
En = h
(n + 1)! h
n
h
= [y (n) (x + h) − y (n) (x)]
(n + 1)!

after estimating y (n+1) (x + θh) by a simple finite-difference approximation.


In solving a DE numerically, there are several types of errors that arise.

1. The local truncation error is the error made in one step when we replace an infinite
process by a finite process. This error is present in each step of the numerical solution.
Note that in Taylor series method we replace the infinite Taylor series for y(x + h) by
a finite sum of terms (i.e., by a partial sum). Thus the local TE is inherent in this
method and is independent of round off error. The local TE is O(hn+1 ) if we retain
terms upto and including hn in the series.

2. Round off error is caused by the limited precision of our computing machines. The
magnitude of round off error depends on the word length of the computer (or on the
number of bits in the mantissa of the floating-point machine numbers).

3. Global Truncation error. The local TE is present in each step of the numerical
solution. The accumulation of all these local TEs gives rise to the global TE. Note that
global TE will be present even if the calculations are performed using exact arithmetic.
It is an error associated with the method. It is independent of the computer on which
the calculations are performed. Let the computations be started at x0 . Number of
steps necessary to reach an arbitrary point, say, xn is xn −xh
0
. Therefore, if the local
n+1 n
TE is O(h ), then the global TE is O(h ).

4. Global round off error is the accumulation of local round off errors in the previous
steps.

5. The Total error is the sum of the global TE and the global round off error. If the
global error is O(hn ), then the numerical procedure is of order n.

3 80
Numerical Analysis
Lecture-20

1 Numerical Solution of Ordinary Differential Equations-


3
dy
Consider dx = f (x, y), a ≤ x ≤ b, y(a) = y0 . Suppose the solution y(x) to the above
IVP has (n + 1) continuous derivatives. Expand the solution y(x) in terms of its nth Taylor
polynomial about xi and evaluate y at xi+1 = xi + h.

h2 00 hn hn+1 (n+1)
y(xi+1 ) = y(xi + h) = y(xi ) + hy 0 (xi ) + y (xi ) + ... + y (n) (xi ) + y (ξi )
2! n! (n + 1)!

for ξi ∈ (xi , xi+1 ).

y 0 = f (x, y)
y 00 = f x + f fy
y 000 = fxx + fxy f + f fxy + f 2 fyy + fx fy + f fy2
= fxx + 2fxy f + f 2 fyy + fx fy + f fy2

Similarly, higher order derivatives can be obtained.

Taylor Series method of order n


y(a) = y0

h2 00 hn (n)
yi+1 = yi + hyi0 + yi + ... + yi
2! n!
local TE=O(hn+1 .
dy
Solve by Taylor’s method of orders two and four dx
= y − x2 + 1, 0 ≤ x ≤ 2, y(0) = 12 .

1 3
y 0 = y − x2 + 1; y 0 (0)y00 = y(0) − 0 + 1 = +1=
2 2
1 3
y 00 = y 0 − 2x = y − x2 + 1 − 2x; y000 = −0+1−0=
2 2
1 1
y 000 = y 0 − 2x − 2 = y − x2 + 1 − 2x − 2 = y − x2 − 2x − 1; y0000 = −1=−
2 2
(4) 1 1
y (4) = y 0 − 2x − 2 = y − x2 + 1 − 2x − 2 = y − x2 − 2x − 1; y0 = y(0) − 0 + 1 = −1=−
2 2

1 81
Let h = 0.2.
h2 00 h3 000 h4 (4)
∴ y(0.2) = y(0) + hy00 + y + y0 + y0
2! 0 3! 4!
1 3 h2 3 h3 1 h4 1
= + (0.2) + ( ) − ( ) − ( )
2 2 2 2 6 2 24 2
2 3
1 3 3(0.2) (0.2) (0.2)4
= + (0.2) + − −
2 2 4 12 48
= 0.5 + 0.3 + 0.03 − 0.00066667 − 0.00003333
= 0.83 − 0.0007
y(0.2) = 0.8293 by T aylor0 s method of order 4

By Taylor’s method of order 2

h2 00
y(0.2) = y(0) + hy00 + y
2! 0
1 3 (0.2)2 3
= + (0.2) + .
2 2 2 2
= 0.5 + 0. + 0.03
= 0.83

The exact solution is y(x) = (x + 1)2 − 12 ex . Thus y(0.2) = 0.8293(correct to 4 decimal


places).
| absolute error | (in using 2nd order method)=|0.83 − 0.8293| = 0.0007.
dy
dx
= x − y 2 ; y(0) = 1.
The Taylor series for y(x) is

x2 00 x3 x4 x5
y(x) = y(x − 0 + 0) = y(0) + xy 0 (0) + y (0) + y 000 (0) + y (4) (0) + y (5) (0) + ...
2! 3! 4! 5!
y 0 = x − y 2 ; y 0 (0) = 0 − y 2 (0) = −1
y 00 = 1 − 2yy 0 ; y 00 (0) = 1 − 2y(0)y 0 (0) = 1 − 2y(0)y 0 (0) = 1 − 2(1)(−1) = 1 + 2 = 3
y 000 = −2y 02 − 2yy 00 ; y 000 (0) = −2(−1)2 − 2(1)(3) = −8
y (4) = −4y 0 y 00 − 2y 0 y 00 − 2yy 000 ; y (4) (0) = −4(−1)(3) − 2(−1)(3) − 2(1)(−8) = 34
y (5) = −6y 002 − 8y 0 y 000 − 2yy (4) ; y (5) (0) = −6(9) − 8(−1)(−8) − 2(1)(34) = −54 − 64 − 68 = −186
3 4 17 31
∴ y(x) = 1 − x = x2 − x3 + x4 − x5 + ...
2 3 12 20
Find y(0.1) correct to four decimal places.
Use Taylor series method of order 2, then
3
y(x) = 1 − x + x2
2
3
y(0.1) = 1 − 0.1 + (0.1)2 = 1 − 0.1 + 0.015 = 0.915
2

2 82
Use Taylor series method of order 3, then
3 4
y(x) = 1 − x + x2 − x3
2 3
3 4
y(0.1) = 1 − 0.1 + (0.1)2 − (0.1)3
2 3
= 1 − 0.1 + 0.015 − 0.00133333
= 0.915 − 0.00133333
= 0.91366667

We see that y(0.1) obtained by Taylor’s method of orders two and three do not match up
to 4 decimal places. So, we consider Taylor’s method of order 4.
3 4 17
y(x) = 1 − x + x2 − x3 + x4
2 3 12
17
y(0.1) = 0.91366667 + (0.1)4
12
= 0.91366667 + 0.00014167
= 0.91380834

Again y(0.1) obtained using Taylor’s method of orders three and four do not match upto 4
decimal places. We use Taylor’s method of order 5.
3 4 17 31
y(x) = 1 − x + x2 − x3 + x4 − x5
2 3 12 20
31
y(0.1) = 0.91380834 − (0.1)5
20
= 0.91380834 − 0.00000155
= 0.91379284

y(0.1) using 4th order Taylor method is 0.9138 correct to four decimal places.
y(0.1) using 5th order Taylor method is 0.9138 correct to four decimal places.
∴ y(0.1) = 0.9138 correct to four decimal places.
We see that to obtain the value of y(0.1) correct to four decimal places, terms up to x4
should be considered.

Suppose that we want to determine the range of values of x for which the above series,
truncated after the term containing x4 , can be used to compute the values of y correct to
four decimal places, then 31
20
x5 ≤ 0.00005, which gives x ≤ 0.126.
dy
Example 1.1. Given the IVP dx = x2 + y 2 , y(0) = 0, determine the first three non-zero
terms in Taylor series for y(x) and hence obtain y(1). Also, determine x when the error in

3 83
y(x) obtained from the first two terms (non-zero) is to be less than 10−6 after rounding.

y(0) = 0
y 0 = x2 + y 2 ; y 0 (0) = 0
y 00 = 2x + 2yy 0 ; y 00 (0) = 0
y 000 = 2 + 2(yy 00 + y 02 ); y 000 (0) = 2
y (4) = 2(yy 000 + 3y 0 y 00 ); y (4) (0) = 0
y (5) = 2[yy (4) + 4y 0 y 000 + 3(y 00 )2 ]; y (5) (0) = 0
y (6) = 2[yy (5) + 5y 0 y (4) + 10y 00 y 000 ]; y (6) (0) = 0
y (7) = 2[yy (6) + 6y 0 y (5) + 15y 00 y (4) + 10(y 000 )2 ]; y (7) (0) = 80
y (8) (0) = y (9) (0) = y (10) (0) = 0
y (11) = 2[yy (10) + 10y 0 y (9) + 45y 00 y (8) + 120y 000 y (7) + 210y (4) y (6) + 126(y (5) )2 ]; y (11) (0) = 38400
x3 x7 2 00
∴ y(x) = + + y .
3 63 2079
1 1 2
y(1) = + + = 0.350168.
3 63 2079
2
If only the first two terms are used, then the local TE is 2079 x11 .
The value of x is obtained from | 2079 x11 | < 0.5 × 10−7 which implies x ≤ 0.41.
2

Euler’s Method
Consider the well-posed IVP
dy
= f (x, y); y(a) = y0 , a ≤ x ≤ b
dx
divide the interval [a, b] into n equal subintervals of width h by points xi = a+ih, i = 0, ..., n,
x0 = a and xn = b.
We use Taylor’s theorem to derive Euler’s method. Suppose that y(x), the unique solution
to the IVP has two continuous derivatives on [a, b], so that for each i = 0, 1, ..., n − 1,

(xi+1 − xi )2 00
0
y(xi+1 ) = y(xi ) + (xi+1 − xi )y (xi ) + y (ξ)
2!
for some number ξi in (xi , xi+1 ).
2
∵ h = xi+1 − xi , y(xi+1 ) = y(xi ) + hy 0 (xi ) + h2! y 00 (ξ).
2
∵ y 0 = f (x, y), we have y(xi+1 ) = y(xi ) + hf (xi , yi ) + h2! y 00 (ξ).
Euler’s method constructs yi ' y(xi ) for each i = 1, 2, ..., n, by deleting the remainder term.
Euler’s method y0 = y(a).

yi+1 = yi + hf (xi , yi ), i = 0, 1, ..., n − 1.

Error in Euler’s method = O(h2 ).


Euler’s method can also be derived as follows.
dy
= f (x, y); y(a) = y0 , a ≤ x ≤ b
dx

4 84
integrating w.r.t. x between x0 and x1 ,
Z x1 Z x1
dy
dx = y1 − y0 = f (x, y)dx.
x0 dx x0

Assuming that f (x, y) = f (x0 , y0 ) in x0 ≤ x ≤ x1 , we have y1 − y0 = f (x0 , y0 )(x1 − x0 ) or


y1 = y0 + hf (x0 , y0 ). Rx
Similarly for the range x1 ≤ x ≤ x2 , y2 = y1 + hf (x1 , y1 ), ∵ y2 = y1 + x12 f (x, y)dx.
Proceeding in this way, we have

yn+1 = yn + hf (xn , yn ), n = 0, 1, ...

Note that Euler’s method is Taylor’s method of order 1.

• The process is very slow.

• Smaller h value is required to obtain reasonable accuracy.

• Thus not suitable for practical use.

However, the method is very simple to derive; it can be used to illustrate the techniques
involved in the construction of more advanced techniques.
Solve y 0 = −y; y(0) = 1 using Euler’s method. Determine y(0.04), taking step-size h = 0.01.

y1 ' y(0.01) = y0 + hf (x0 , y0 ) = 1 + (0.01)f (0, 1)


= 1 + (0.01)(−1)
= 1 − 0.01 = 0.99
y2 ' y(0.02) = y1 + hf (x1 , y1 ) = 0.99 + (0.01)(0.99)
= 0.99 − 0.0099
= 0.9801
y3 ' y(0.03) = y2 + hf (x2 , y2 ) = 0.9801 + (0.01)f (0.02, 0.9801)
= 0.9703
y4 ' y(0.04) = y3 + hf (x3 , y3 ) = 0.966.

5 85
Numerical Analysis
Lecture-21

1 Runge-Kutta Methods
The Taylor series method has the desirable property of higher order local truncation error
(TE).

Disadvantages (Taylor series method)


• Requires computation and evaluation of the derivatives of f (x, y).

• It is a complicated and time consuming process for most problems.

• It is too expensive for most practical problems.

• It is seldom used in practice

This leads us to look for other one step methods which imitate the Taylor series method,
without the necessity to calculate higher order derivatives. Such one step methods are called
Runge-Kutta Methods.

Theorem 1.1 (Taylor’s theorem in two variables). Suppose that f (x, y) and all its partial
derivatives of order less than or equal to (n + 1) are continuous on D = {(x, y)|a ≤ x ≤
b, c ≤ y ≤ d}, and let (x0 , y0 ) ∈ D. For every (x, y) ∈ D there exists ζ between x and x0
and µ between y and y0 with

f (x, y) = Pn (x, y) + Rn (x, y),

where
 
∂f ∂f
Pn (x, y) = f (x0 , y0 ) + (x − x0 ) (x0 , y0 ) + (y − y0 ) (x0 , y0 )
∂x ∂y
2 2
∂ 2f (y − y0 )2 ∂ 2 f
 
(x − x0 ) ∂ f
+ (x0 , y0 ) + (x − x0 )(y − y0 ) (x0 , y0 ) + (x0 , y0 )
2! ∂x2 ∂x∂y 2! ∂y 2
+−−−−−−−−−−−−−
" n  
#
n
1 X n ∂ f
+ (x − x0 )n−j (y − y0 )j n−j j (x0 , y0 )
n! j=0 j ∂x ∂y

and
" n+1  #
X n + 1 n+1
1 ∂ f
Rn (x, y) = (x − x0 )n+1−j (y − y0 )j n+1−j j (ζ, µ) .
(n + 1)! j=0 j ∂x ∂y

Pn (x) is called the nth Taylor polynomial in two variables for the function f about
(x0 , y0 ).
Rn (x) is the remainder term associated with Pn (x).

1 86
Explicit Runge-Kutta method of order 2
We want to derive a method of the form,
yn+1 = yn + hF (xn , yn , h; f ) (1)
with F (xn , yn , h; f ) as a carefully chosen approximation to f (x, y) on the interval [xn , xn+1 ].
Let
F (xn , yn , h; f ) = c1 f (x, y) + c2 f (x + αh, y + βhf (x, y)) (2)
RHS is some kind of ”average” derivative. How to choose the constants c1 , c2 , α, β?
Recall that the local truncation error for the Taylor’s method of order 2 has error not greater
than O(h2 ). We determine c1 , c2 , α, β in such a way that (1) agrees with Taylor series
method of as high order as possible.
F (xn , yn , h; f ) = c1 f (x, y) + c2 [f (x, y) + αhfx + βhfy +

1 2 2

(αh) fxx + 2(αh)(βhf )fxy + (βhf ) fyy + · · ·
2!
= (c1 + c2 )f + h [c2 αfx + c2 βf fy ]
h2  2
c2 α fxx + c2 2αβf fxy + c2 β 2 f 2 fyy + · · ·

+
2
Substituting this in (1)
yn+1 = yn + h(c1 + c2 )f + h2 [c2 αfx + c2 βf fy ] +
h3  2
c2 α fxx + c2 2αβf fxy + c2 β 2 f 2 fyy + · · ·

2
in which all functions are evaluated at (xn , yn ). Now Taylor series method gives
h2 00 h3
y(xn+1 ) = y(xn ) + hy 0 (xn ) + y (xn ) + y 000 (xn ) + · · ·
2! 3!
h2
= y(xn ) + hf (xn , yn ) + [fx + f fy ] |(xn ,yn )
2!
h3 
fxx + 2f fxy + f 2 fyy + fx fy + f fy2 |(xn ,yn ) + · · ·

+
6
So,
Tn = y(xn+1 ) − yn+1
 
2 1 1
= hf [1 − c1 − c2 ] + h fx ( − αc2 ) + f fy ( − βc2 )
2 2
      
3 1 1 2 1 1 2 2 1 1 2
=h − α c2 fxx + − αβc2 f fxy + − β c2 f fyy + fx fy + f fy + · · ·
6 2 3 6 6 6
In order that method (1) agrees with Taylor’s method of order as high as possible, we set
1 − c1 − c2 = 0 (3)
1
− αc2 = 0 (4)
2
1
− βc2 = 0 (5)
2

2 87
As the terms 16 fx fy and 61 f fy2 are independent of the choice of the coefficients c1 , c2 , α, β
we can not equate the coefficient of h3 to zero. We can not take c2 = 0 as this would lead
to a contradiction in equations (4) and (5). Hence, there are 3 equations and four variables,
we let c2 be arbitrarily and solve for α, β and c1 , in terms of c2 . This gives,
1
α=β= ; c1 = 1 − c2 .
2c2
Case I : Let c2 = 21
Then α = β = 1 and c1 = 21 .
The method is
h
yn+1 = yn + [f (xn , yn ) + f (xn + h, yn + hf (xn , yn ))] ,
2
which is modified Euler’s method or R-K method.
Case II : Let c2 = 1
Then α = β = 21 and c1 = 0.
The method is

yn+1 = yn + hf (xn + h/2, yn + h/2f (xn , yn )),

which is mid point method.


Now,
      
3 1 1 2 1 1 2 2 1 1 2
Tn = h − α c2 fxx + − αβc2 f fxy + − β c2 f fyy + fx fy + f fy + · · ·
6 2 3 6 6 6
The methods derived in case I and II are of order two since these agree with the Taylor’s
method of order two. The method derived for case I is
1
yn+1 = yn + [k1 + k2 ]
2
k1 = hf (xn , yn )
k2 = hf (xn + h, yn + hf (xn , yn ))

is called R-K method of order 2 and the truncation order is O(h2 ).

Runge-Kutta method of order 4


Solve for y in [x0 , xn+1 ]
dy
= f (x, y), y(x0 ) = y0 .
dx

1
yn+1 = yn + [k1 + 2k2 + 2k3 + k4 ] ,
6
k1 = hf (xn , yn ),
k2 = hf (xn + h/2, yn + k1 /2),
k3 = hf (xn + h/2, yn + k2 /2),
k4 = hf (xn + h, yn + k3 ).

3 88
dy
Note, if dx
= f (x) then R-K method of order 4 gives

1
yn+1 = yn + [k1 + 2k2 + 2k3 + k4 ]
6
k1 = hf (xn ),
k2 = hf (xn + h/2, yn + k1 /2) = hf (xn + h/2),
k3 = hf (xn + h/2, yn + k2 /2) = hf (xn + h/2),
k4 = hf (xn + h, yn + k3 ) = hf (xn + h).

So,
1
yn+1 = yn + [hf (xn ) + 2hf (xn + h/2) + 2hf (xn + h/2) + hf (xn + h)] ,
6
1
= yn + [hf (xn ) + 4hf (xn + h/2) + hf (xn + h)] ,
6
which is the simpson’s integration method with step-size h/2, Tn (y) = C(xn )h5 + O(h6 ) for
a suitable function C(x). Local truncation error is O(h4 ).

Problem
Solve by R-K method of order 4.
dy
= x + y; y(0) = 1.
dx
Take h = 0.1 and find y(0.2).

1
yn+1 = yn + [k1 + 2k2 + 2k3 + k4 ] ,
6
k1 = hf (xn , yn ),
k2 = hf (xn + h/2, yn + k1 /2),
k3 = hf (xn + h/2, yn + k2 /2),
k4 = hf (xn + h, yn + k3 ),

where f (x, y) = x + y, x0 = 0, y0 = 1.

k1 = hf (x0 , y0 ) = 0.1f (0, 1) = 0.1(0 + 1) = 0.1


k2 = hf (x0 + h/2, y0 + k1 /2) = 0.1f (0 + 0.1/2, 1 + 0.1/2) = 0.1(0.05 + 1.05) = 0.11
k3 = hf (x0 + h/2, y0 + k2 /2) = 0.1f (0 + 0.1/2, 1 + 0.11/2) = 0.1(0.05 + 1.055) = 0.1105
k4 = hf (x0 + h, y0 + k3 ) = 0.1f (0 + 0.1, 1 + 0.1105) = 0.1(0.1 + 1.1105) = 0.12105

1
y1 = y(0.1) = y0 + [k1 + 2k2 + 2k3 + k4 ]
6
1
= 1 + [0.1 + 2(0.11) + 2(0.1105) + 0.12105]
6
= 1.11034.

4 89
Evaluation per step 2 3 4 5≤n≤7 8≤n≤9 10 ≤ n
Best possible local TE O(h2 ) O(h3 ) O(h4 ) O(hn−1 ) O(hn−2 ) O(hn−3 )

x Exact Euler (h1 = 0.025) Modified Euler (h2 = 2h1 ) RK 4th order (h = 0.1 = 4h1 = 2h2 )
0.0 0.5000000 0.5000000 0.5000000 0.5000000
0.1 0.6574145 0.6554982 0.6573085 0.6574144
0.2 0.8292986 0.8253385 0.8290778 0.8292983
0.3 1.0150706 1.0089334 1.0147254 1.0150701
0.4 1.2140877 1.2056345 1.2136079 1.2140869
0.5 1.4256394 1.4147264 1.4250141 1.4256384

Now,
1
y2 = y1 + [k1 + 2k2 + 2k3 + k4 ] ,
6
k1 = hf (x1 , y1 ) = 0.1f (0.1, 1.11034) = 0.1(0.1 + 1.11034) = 0.121034,
k2 = hf (x1 + h/2, y1 + k1 /2) = 0.1f (0.1 + 0.1/2, 1.11034 + 0.121034/2) = 0.13208,
k3 = hf (x1 + h/2, y1 + k2 /2) = 0.1f (0.1 + 0.1/2, 1.11034 + 0.13208/2) = 0.132638,
k4 = hf (x1 + h, y1 + k3 ) = 0.1f (0.1 + 0.1, 1.11034 + 0.132638) = 0.1442978,

∴ y2 = 1.11034 + 61 [k1 + 2k2 + 2k3 + k4 ] = 1.2428 = y(0.2).

Remark
The main computational effort in using R-K methods lies in the evaluation of f.
R-K method of order 2 Local truncation error is O(h2 ). The cost is two functions
evaluations per step.
R-K method of order 4 Local truncation error is O(h4 ). The cost is four functions
evaluations per step.
The table below shows the number of evaluations per step and the order of local truncation
error (John Butches).
The R K method of order less than 5 with smaller step size are used in preference to
the higher order methods using a large step size.

Example

y 0 = y − x2 + 1, 0 ≤ x ≤ 2, y(0) = 5.

results are compared at the mesh points. Each of these methods requires 20 functions
evaluations to determine the values presented in the table. The R K method of order 4 is
superior.

5 90
Numerical Analysis
Lecture-15

Numerical solution of ordinary differential equations-5


We consider an IVP and see how we can solve it using RK method of order 2.
dy y+x
= , y(0) = 1.
dx y−x

Take h = 0.2, compute y(0.4).

y(0.2)
y0 = 1

0 0.2 0
x0

y(0.4)

1
y1 = y(x1 ) = y0 + [k1 + k2 ]
2
∴ y(0.2) = 1 + 12 [k1 + k2 ]

 
1+0
k1 = hf (x0 , y0 ) = 0.2f (0, 1) = 0.2 = 0.2.
1−0
k2 = hf (x0 + h, y0 + hf (x0 , y0 ))
= hf (x0 + h, y0 + k1 )
= 0.2f (0 + 0.2, 1 + 0.2f (0, 1))
= 0.2f (0.2, 1.2)
 
1.2 + 0.2
= 0.2 = 0.28.
1.2 − 0.2

∴ y(0.2) = 1 + 12 [0.2 + 0.28] = 1 + 12 (0.48) = 1.24.


∴ y1 = y(x1 ) = y(0.2) = 1.24.

Now x1 = 0.2, y(x1 ) = 1.24. We solve

dy y+x
= , y(x1 ) = 1.24
dx y−x

1 91
and obtain y(x2 ) = y(0.4)
1
∴ y2 = y(x2 ) = y1 + [k1 + k2 ]
2
1
= 1.24 + [k1 + k2 ] .
2

k1 = hf (x1 , y1 ) = 0.2f (0.2, 1.24


 
1.24 + 0.2
= 0.2 = (0.2)(1.44)/1.04
1.24 − 0.2
= 0.288/1.04 = 0.27692308
k2 = hf (x1 + h, y1 + k1 )
= 0.2f (0.4, 1.24 + 0.27692308)
= 0.2f (0.4, 1.51692308)
 
1.51692308 + 0.4
= 0.2
1.51692308 − 0.4
(1.91692308)(0.2) 0.38338462
= = = 0.34325069.
1.11692308 1.11692308

1
∴ y2 = 1.24 + [k1 + k2 ]
2
1
= 1.24 + (0.62017377)
2
= 1.24 + 0.31008688 = 1.55008688,

∴ y(0.4) = 1.55008688.

Modified Euler’s Method


dy
= f (x, y), y(x0 ) = y0 ,
dx
we wish to determine y(xn ).
Divide [x0 , xn ] into n equal subintervals of width h by points xi , where xi = x0 + ih,
i = 0, 1, 2, · · ·

y(x0 ) = y0
y(xn ) =??

x0 x1 x2 x3 xn−1 xn

The method is based on the assumption that in each subinterval [xi−1 , xi ], the solution
curve y = y(x) is approximated by a straight line L whose slope is the average of the
slopes of two lines L1 and L2 , where the slope of L1 is f (xi−1 , yi−1 ) and the slope of L2 is
f (xi−1 + h, yi ) = f (xi , yi ).

2 92
1
∴ slope of L is 2
[f (xi−1 , yi−1 ) + f (xi , yi )].
Equation of the line is
1
y − yi−1 = (x − xi−1 ) [f (xi−1 , yi−1 ) + f (xi , yi )] .
2
The point where it meets the curve at xi is yi
1
∴ yi − yi−1 = (xi − xi−1 ) [f (xi−1 , yi−1 ) + f (xi , yi )]
2
(xi − xi−1 )
yi = yi−1 + [f (xi−1 , yi−1 ) + f (xi , yi )]
2
h
or yi = yi−1 + [f (xi−1 , yi−1 ) + f (xi , yi )] (1)
2
In evaluating yi that appears on the RHS of (1) in f (xi , yi ), we use Euler’s method so that,
yi = yi−1 + hf (xi−1 , yi−1 ),
(p)
call it yi (yi predicted). This gives,
hh (p)
i
yi = yi−1 + f (xi−1 , yi−1 ) + f (xi , yi ) , (2)
2
(p)
where yi = yi−1 + hf (xi−1 , yi−1 ) (3)
This is the modified Euler’s method. Taking i = 1, determine y1 as
hh (p)
i
y1 = y0 + f (x0 , y0 ) + f (x1 , y1 ) ,
2
(p)
where y1 = y0 + hf (x0 , y0 ),
use this y1 and take i = 2, get y2 and so on and obtain yn = y(xn ). Note: The method (2)-(3)
(p)
suggests that it can be used as a predictor-corrector pair. (3) : yi = yi−1 + hf (xi−1 , yi−1 ),
(p) (p)
use this predicted value of yi in (2) to determine f (xi , yi ). Now (2) is used to correct
(c)
this yi = yi−1 + hf (xi−1 , yi−1 ).
If you are required to determine the solution correct to a specified accuracy , then check
(c) (p)
if |yi − yi | < .
(c)
If so, stop and take y(xi ) = yi = yi . If not, recorrect this solution using (2) as
(cc) hh (c)
i
yi = yi−1 + f (xi−1 , yi−1 ) + f (xi , yi )
2
(c) (c) (cc)
Note that in this step you used yi value in evaluating f (xi , yi ) and obtained yi .
(cc) (c)
Check if |yi − yi | < . If so, stop. If not, continue to recorrect yi till the desired
degree of accuracy is attained.
Note that Modified Euler’s method is Runge-Kutta method of order 2 that we had
derived earlier. Runge-Kutta method of order 2 is
1
yn+1 = yn + [k1 + k2 ] ,
2
where
k1 = hf (xn , yn ),
k2 = hf (xn + h, yn + k1 )
= hf (xn + h, yn + hf (xn , yn ))
= hf (xn + h, yn(p) ).

3 93
Problem

dy √
= x + y = f (x, y), y(0) = 1
dx
find y(0.2) taking h as h = 0.2 by Modified Euler’s method.

(p)
y(0.2) = y1 = y1 = y0 + hf (x0 , y0 )
= 1 + 0.2f (0, 1)

= 1 + 0.2[0 + 1]
= 1.2

∴ y(0.2) = y1 by Modified Euler’s method is given by

hh (p)
i
y1 = y0 + f (x0 , y0 ) + f (x1 , y1 )
2
0.2
=1+ [f (0, 1) + f (0.2, 1.2)]
2 √ √
= 1 + 0.1[(0 + 1) + (0.2 + 1.2)]
= 1.2295.
(p)
But, we use this as a predictor-corrector pair and obtain y(0.2) = y1 . In this case y1 = 1.2,
(c)
y1 = 1.2295.
(c) (p)
|y1 − y1 | = |1.2295 − 1.2| = 1.0295.

So, we recorrect this as

(c) hh (c)
i
y1 = y0 + f (x0 , y0 ) + f (x1 , y1 )
2
0.2
=1+ [f (0, 1) + f (0.2, 1.2295)]
2 √ √
= 1 + 0.1[0 + 1 + 0.2 + 1.2295] = A.
(cc) (c)
Check if |yi − yi | <  i.e., |A − 1.2295| < . If so stop. If not, continue till the desired
degree is attained.

4 94
Numerical Analysis
Lecture-23

Numerical solution of ordinary differential equations-6


Euler’s method, Taylor series method, modified Euler’s method, Runge-Kutta methods
that we had discussed so far, for solving IVPs are called one-step methods because the
approximations for the mesh points xn+1 involves information from the previous point, xn .
dy
Given dx = f (x, y) with y(x0 ) = y0 , one uses any of the above one-step methods and
computes y1 = y(x1 ) where x1 = x0 + h; with this information one continues to find
an approximation y2 = y(x2 ) where x2 = x1 + h by solving the initial value problem
dy
dx
= f (x, y) with y(x1 ) = y1 and so on; thus marching forward each time computing
y(xi+1 ) with the information available at y(xi ) at the previous step, one has approximations
to y(x1 ), y(x2 ), · · · y(xi ), y(xi+1 ), · · · y(xn ), y(xn+1 ).
This suggests that one can develop methods that uses information at a set of previous
points to determine the approximation at xn+1 . Such methods which use information at
more than one previous mesh point to determine the approximation at the next step are
called multi-step methods.
We shall derive some multi-step methods now and see how we can use these as predictor-
dy
corrector pairs in solving IVP dx = f (x, y) with y(x0 ) = y0 .

Adam-Moulton Method
dy
Given dx = f (x, y), we use the information that the solution to the given IVP is known at
the previous equally spaced points xn , xn−1 , xn−2 and xn−3 to compute y(xn+1 ).
dy
Consider dx = f (x, y). Integrating between xn to xn+1 ,
Z xn+1 Z xn+1
dy
dx = f (x, y)dx,
xn dx xn
Z xn+1
∴ yn+1 − yn = f (x, y)dx (1)
xn

We employ Newton’s backward interpolation polynomial so that,

p(p + 1) 2 p(p + 1)(p + 2) 3


f (x, y) = f (xn ) + p∇fn + ∇ fn + ∇ fn
2 3!
p(p + 1)(p + 2)(p + 3) 4
+ ∇ fn + · · · , (2)
4!
where p = x−x
h
n
, x = xn + ph..
Substituting (2) into (1), we get
Z xn+1 
p(p + 1) 2 p(p + 1)(p + 2) 3
yn+1 − yn = f (xn ) + p∇fn + ∇ fn + ∇ fn
xn 2 3!

p(p + 1)(p + 2)(p + 3) 4
+ ∇ fn + · · · dx,
4!

1 95
when x = xn+1 , we have xn+1 = xn + ph or, xn+1 − xn = ph or, p = 1.
when x = xn , we have xn = xn + ph or, p = 0.
Also, dx = hdp.
Z 1
p(p + 1) 2 p(p + 1)(p + 2) 3
yn+1 − yn = h fn + p∇fn + ∇ fn + ∇ fn
0 2 3!

p(p + 1)(p + 2)(p + 3) 4
+ ∇ fn + · · · dp
4!

1 1
p2
Z 
1
pdp = =
0 2 0 2
Z 1 Z 1  3 1
2 p p2 5
p(p + 1)dp = (p + p)dp = + =
0 0 3 2 6
Z 1 Z 1 Z 10
p(p + 1)(p + 2)dp = (p2 + p)(p + 2)dp = (p3 + 3p2 + 2p)dp
0 0 0
2 1
 4 3

p p p 9
= +3 +2 =
4 3 2 0 4
Z 1 Z 1
p(p + 1)(p + 2)(p + 3)dp = (p3 + 3p2 + 2p)(p + 3)dp
0
Z0 1
= (p4 + 6p3 + 11p2 + 6p)dp
0
 5 1
p p4 p3 p2 251
= + 6 + 11 6 =
5 4 3 2 0 30

 
1 5 2 3 3 251 4
∴ yn+1 = yn + h fn + ∇fn + ∇ fn + ∇ fn + ∇ fn .
2 12 8 720
Now,

∇fn = fn − fn−1
∇2 fn = fn − 2fn−1 + fn−2
∇3 fn = fn − 3fn−1 − 3fn−2 − fn−3

1 5 3
∴ fn + ∇fn + ∇2 fn + ∇3 fn
2 12 8
1 5 3
= fn + [fn − fn−1 ] + [fn − 2fn−1 + fn−2 ] + [fn − 3fn−1 − 3fn−2 − fn−3 ]
 2 12  8    
1 5 3 1 5 9 5 9 3
= fn 1 + + + + fn−1 − − − + fn−2 + + fn−3 −
2 12 8 2 6 8 12 8 8
55 59 37 9
= fn − fn−1 + fn−2 − fn−3 .
24 24 24 24

h 251 4
∴ yn+1 = yn + [55fn − 59fn−1 + 37fn−2 + 9fn−3 ] + ∇ fn .
24 720

2 96
This is known as Adam’s predictor formula. Truncation error is 251720
∇4 yn0 .
dy
We now derive the corrector formula. Integrating dx = f (x, y) with respect to x from xn
to xn+1 ,
Z xn+1 Z xn+1 Z xn+1
dy
dx = f (x, y)dx, ∴ yn+1 − yn = f (x, y)dx
xn dx xn xn

Now use Newton’s backward interpolation formula given by


p(p + 1) 2 p(p + 1)(p + 2) 3
f (x, y) = fn+1 + p∇fn+1 + ∇ fn+1 + ∇ fn+1 + · · · ,
2 3!
then, we get
Z 0 
p(p + 1) 2 p(p + 1)(p + 2) 3
yn+1 = yn + h fn+1 + p∇fn+1 + ∇ fn+1 + ∇ fn+1 +
−1 2 3!

p(p + 1)(p + 2)(p + 3) 4
+ ∇ fn+1 + · · · dp,
4!
x−xn+1 xn −xn+1 xn+1 −xn+1
where p = h
; when x = xn , p = h
= −1; when x = xn , p = h
= 0,
dx = hdp.
0 0
p2
Z 
1
pdp = =−
−1 2 −1 2
Z 0 Z 0  3 0
2 p p2 1
p(p + 1)dp = (p + p)dp = + =−
−1 −1 3 2 −1 6
Z 0 Z 0 Z 1
p(p + 1)(p + 2)dp = (p2 + p)(p + 2)dp = (p3 + 3p2 + 2p)dp
−1 −1 0
 4 3 2
0
p p p 1
= +3 +2 =−
4 3 2 −1 4
Z 0 Z 0
p(p + 1)(p + 2)(p + 3)dp = (p3 + 3p2 + 2p)(p + 3)dp
−1 −1
Z 0
= (p4 + 6p3 + 11p2 + 6p)dp
−1
5
0
p4 p3 p2

p 19
= + 6 + 11 6 =−
5 4 3 2 −1 30
 
1 1 2 1 3 19 4
∴ yn+1 = yn + h fn+1 − ∇fn+1 − ∇ fn+1 − ∇ fn+1 − ∇ fn+1 .
2 12 24 720
Consider
1 1 1
fn+1 − ∇fn+1 − ∇2 fn+1 − ∇3 fn+1
2 12 24
1 1 1
= fn+1 − [fn+1 − fn ] − [fn+1 − 2fn + fn−1 ] − [fn+1 − 3fn − 3fn−1 − fn−2 ]
 2 12
   24    
1 1 1 1 1 1 1 1 1
= fn+1 1 − − − + fn + + + fn−1 − − + fn−2
2 12 24 2 6 8 12 8 24
9 19 5 1
= fn+1 + fn − fn−1 + fn−2 .
24 24 24 24

3 97
h 19 4
∴ yn+1 = yn + [9fn+1 + 19fn − 5fn−1 + fn−2 ] − ∇ fn+1 ,
24 720
h  0 19 4 0
9yn+1 + 19yn0 − 5yn−1
0 0

or, yn+1 = yn + + yn−2 − ∇ yn+1 , (3)
24 720
where yn0 = f (xn , yn ).
0
(3) is known as Adam’s corrector formula. The term yn+1 on the RHS requires the
0
knowledge of yn+1 at x = xn+1 since yn+1 = f (xn+1 , yn+1 ).
0
∴ yn+1 is computed by using yn+1 value predicted by Adam’s predictor. Thus we have the
predictor-corrector pair (Adam’s) given by

(p) h  0 0 0
 251 4
yn+1 = yn + 55yn − 59yn−1 + 37yn−2 + 9yn−3 + ∇ fn (4)
24 720
(c) h  0 19 4
0
9yn+1 + 19yn0 − 5yn−1 0

yn+1 = yn + + yn−2 − ∇ fn+1 (5)
24 720
0 (p)
where yn+1 = f (xn+1 , yn+1 ). (5) can be repeatedly used to recorrect yn+1 till the desired
(c) (p)
degree of accuracy is attained. If  is the desired error tolerance, then if |yn+1 − yn+1 | < ,
(c) 0 (c)
we stop the computations. If not, we find yn+1 from (5), where now yn+1 = f (xn+1 , yn+1 )
(cc) (c) (c)
and check if |yn+1 − yn+1 | < . If so, we stop the computation and take yn+1 = yn+1 correct
(cc)
to the desired accuracy. If not we recorrect yn+1 and continue till the desired accuracy is
obtained.

Example
Solve
dy
= y − x2 , y(0) = 1
dx
using Adam’s predictor-corrector pair and determine an approximation to y(1) taking step-
size h = 0.2, correct to two decimal places. Step-1 : We first find y4 = y(0.8) using Adam’s
predictor-corrector pair. This requires knowledge of

y(0) = 1, y(0.2) =?, y(0.4) =?, y(0.6) =?.

We employ R-K method of order 4 and determine y(0.2) as follows.


1
y(0.2) = y(0) + [k1 + 2k2 + 2k3 + k4 ]
6
k1 = hf (x0 , y0 ) = 0.2f (0, 1) = 0.2
k2 = hf (x0 + h/2, y0 + k1 /2) = 0.2f (0.2/2, 1 + 0.2/2) = 0.218
k3 = hf (x0 + h/2, y0 + k2 /2) = 0.2f (0.2/2, 1 + 0.218/2) = 0.2198
k4 = hf (x0 + h, y0 + k3 ) = 0.2f (0.2, 1 + 0.218) = 0.23596
1
∴ y(0.2) = 1 + [0.2 + 2(0.218) + 2(0.2198) + 0.23596] = 1.21859.
6
Now solve
dy
= y − x2 , y(0, 2) = 1.21859
dx

4 98
and obtain y(0.4) using RK method of order 4. You get y(0.4) = 1.46813. Similarly obtain
y(0.6) = 1.73779.

y0 = y(0) = 1
y1 = y(0.2) = 1.21859
y2 = y(0.4) = 1.46813
y3 = y(0.6) = 1.73779.

Adam’s predictor formula is

(p) h  0 0 0 0

yn+1 = yn + 55yn − 59yn−1 + 37yn−2 − 9yn−3
24
(p) h
or y4 = y3 + [55y30 − 59y20 + 37y10 − 9y00 ]
24
where

y00 = f (x0 , y0 ) = f (0, 1) = 1


y10 = f (x1 , y1 ) = f (0.2, 1.21859) = 1.17859
y20 = f (x2 , y2 ) = f (0.4, 1.46813) = 1.30813
y30 = f (x3 , y3 ) = f (0.6, 1.73779) = 1.37779.

So,

(p) 0.2
y (p) (0.8) = y4 = 1.73779 + [55(1.37779) − 59(1.30813) + 37(1.17859) − 9(1)]
24
= 2.01451.
(p)
We need y40 = f (x4 , y4 ) in the corrector formula.

∴ y40 = f (0.8, 2.01451) = 1.3745.

Adam’s corrector is
(c) h  0
9yn+1 + 19yn0 − 5yn−10 0

yn+1 = yn + + yn−2
24
(c) h
∴ y4 = y3 + [9y40 + 19y30 − 5y20 + y10 ]
24
(c) 0.2
y4 = 1.73779 + [9(1.3745) + 19(1.37779) − 5(1.30813) + 1.17859]
24
= 2.01434.
(p)
y4 = 2.01 correct to two decimal places.
(c)
y4 = 2.01 correct to two decimal places.
∴ y4 = y(0.8) = 2.01 correct to two decimal places.
Step-2
We move on to find y(1.0) using Adam’s predictor-corrector pair.

(p) h
y5 = y (p) (1.0) = y4 + [55y40 − 59y30 + 37y20 − y10 ] .
24

5 99
Now

y40 = f (x4 , y4 ) = y4 − x24 = 2.01434 − (0.8)2 = 1.3743.


(p) 0.2
∴ y5 = 2.01434 + [55(1.3743) − 59(1.37779) + 37(1.30813) − 9(1.17859)]
24
= 2.281759.

Now
(p)
y50 = f (x5 , y5 ) = f (1.0, 2.281759) = 1.281759.
(c) h
∴ y5 = y4 + [9y50 + 19y40 − 5y30 + y20 ]
24
0.2
= 2.01434 + [9(1.281759) + 19(1.3743) − 5(1.37779) + 1.30813]
24
= 2.28156259.
(p)
y5 = 2.28 correct to two decimal places.
(c)
y5 = 2.28 correct to two decimal places.
∴ y5 = y(1.0) = 2.28 correct to two decimal places.

6 100
Numerical Analysis
Lecture-24

Milne’s predictor-corrector method


Milne’s predictor-corrector methods are multistep methods. We assume that the solution
to the given IVP is known at the past four equally spaced points x0 , x1 , x2 , x3 . Consider
dy
= f (x, y); y(x0 ) = y0 .
dx
Determine y(x4 ).
Divide [x0 , x4 ] interval into 4 equal parts (subintervals) of width x4 −x
4
0
= h by points
x0 , x1 = x0 + h, x2 = x0 + 2h, x3 = x0 + 3h, x4 = x0 + 4h. As y(x0 ) = y0 is given, solve the
dy
IVP dx = f (x, y); y(x0 ) = y0 using any of the single step methods that we have studied
and obtain y(x1 ), y(x2 ), y(x3 ).
Now we have a knowledge of solutions at the previous points x0 , x1 , x2 , x3 . We make
use of this information and find an approximation to y(x4 ), using Milne’s predictor-corrector
pair, correct to the desired degree of accuracy.
dy
Consider dx = f (x, y). Integrate with respect to x between x0 to x4
Z x4 Z x4
dy
dx = f (x, y)dx,
x0 dx x0
Z x4
y(x4 ) − y(x0 ) = f (x, y)dx.
x0

We approximate f (x, y) using Newton’s forward difference formula


p(p − 1) 2 p(p − 1)(p − 2) 3
f (x, y) = f (x0 ) + p∆f0 + ∆ f0 + ∆ f0
2 3!
p(p − 1)(p − 2)(p − 3) 4
+ ∆ f0 + · · · ,
4!
x−x0
where p = h
,x = x0 + ph, dx = hdp. When x = x0 , p = 0. When x = x4 , p = 4.
Z 4
p(p − 1) 2 p(p − 1)(p − 2) 3
∴ y4 − y0 = h f0 + p∆f0 + ∆ f0 + ∆ f0
0 2 3!

p(p − 1)(p − 2)(p − 3) 4
+ ∆ f0 + · · · dp.
4!

20 2 8 28
∴ y4 = y0 + h[4f0 + 8∆f0 + ∆ f0 + ∆3 f0 + ∆4 f0 ].
3 3 90
Now

∆f0 = f1 − f0
∆2 f0 = f2 − 2f1 + f0
∆3 f0 = f3 − 3f2 + 3f1 − f0 .

1 101
We get,
4h 28 4
y4 = y0 +[2f1 − f2 + 2f3 ] + ∆ f0
3 90
4h 28 4 0
or y4 = y0 + [2y10 − y20 + 2y30 ] + ∆ y0 . (1)
3 90
(1) is called the Milne’s predictor method.
We now derive Milne’s corrector method. Consider
dy
= f (x, y)
dx
Integrate with respect to x between x0 to x2
Z x2 Z x2
dy
dx = f (x, y)dx,
x0 dx x0
Z x2
y(x2 ) − y(x0 ) = f (x, y)dx.
x0
Z x2 
p(p − 1) 2 p(p − 1)(p − 2) 3
= f0 + p∆f0 + ∆ f0 + ∆ f0
x0 2 3!

p(p − 1)(p − 2)(p − 3) 4
+ ∆ f0 + · · · dx
4!
x−x0
where p = h
, dx = hdp. When x = x0 , p = 0, when x = x2 , p = 2.
2 
p(p − 1) 2 p(p − 1)(p − 2) 3
Z
∴ y2 − y0 = h f0 + p∆f0 + ∆ f0 + ∆ f0
0 2 3!

p(p − 1)(p − 2)(p − 3) 4
+ ∆ f0 + · · · dp
4!
 
1 2 1 4
∴ y2 = y0 + h 2f0 + 2∆f0 + ∆ f0 − ∆ f0 .
3 90
R2
Note that, 0 p(p − 1)(p − 2)dp = 0.
Now ∆f0 = f1 − f0 ; ∆2 f0 = f2 − 2f1 + f0 , so we get
h 1
y2 = y0 + [f0 + 4f1 + f2 ] − h∆4 y00
3 90
h 0 1
or y2 = y0 + [y0 + 4y10 + y20 ] − h∆4 y00 .
3 90
Milne’s corrector method.
Milne’s predictor is
4h 28
y4 = y0 + [2y10 − y20 + 2y30 ] + h∆4 y00 .
3 90
Predictor formula is
(p) 4h  0 0
+ 2yn0

yn+1 = yn−3 + 2yn−2 − yn−1 (2)
3

2 102
Corrector method is
h 0 1
y2 = y0 + [y0 + 4y10 + y20 ] − h∆4 y00 .
3 90
As we want to correct y at x4 using this corrector, we rename the points as follows

x0 = x2 = xn−1 , x1 = x 3 = x n , x2 = x4 = xn+1 .

This gives

(c) h
y4 = y2 + [y20 + 4y30 + y40 ]
3
(c) h 0
or yn+1 = yn−1 + [yn−1 + 4yn0 + yn+1
0
],
3
0 (p) (p)
where, yn+1 = f (xn+1 , yn+1 ), yn+1 is obtained from (2).

Problem
dy
Solve dx = x + y; y(0) = 1 and determine y(0.4) using Milne’s predictor corrector method.
Solution

Divide [0, 0.4] into 4 equal subintervals of width 0.4/4 = 0.1 by points xi , i = 0, 1, 2, 3, 4.
We need y0 , y1 , y2 , y3 and y00 , y10 , y20 , y30 in order to predict y(0.4) = y4 using Milne’s
predictor method.

y00 = f (x0 , y0 ) = f (0, 1) = 1


y10 = f (x1 , y1 )

We determine y1 as follows (use Taylor-series method)

h2 (2) h3 (3) h4 (4) h5 (5)


y(0 + h) = y(0) + hy00 + y + y0 + hy0 + y0
2! 0 3! 4! 5!
2 3 4
0 (0.1) (2) (0.1) (3) (0.1) (4) (0.1)5 (5)
y(0.1) = y(0) + (0.1)y0 + y + y + hy0 + y
2! 0 3! 0 4! 5! 0

(neglect higher order terms)


Now

y 0 = x + y; y 0 (0) = 1
y 00 = 1 + y 0 ; y 00 (0) = 2
y 000 = y 00 ; y 000 (0) = 2
y (4) = y 000 ; y (4) (0) = 2
y (5) = y (4) ; y (5) (0) = 2.

(0.1)2 (0.1)3 (0.1)4 (0.1)5


∴ y(0.1) = 1 + (0.1)1 + 2+ 2+ 2+ 2 = 1.1103
2! 3! 4! 5!

3 103
(0.2)2 (2) (0.2)3 (3) (0.2)4 (4) (0.2)5 (5)
y(0.2) = y(0) + (0.2)y00 + y + y + hy0 + y
2! 0 3! 0 4! 5! 0
(0.2)2 (0.2)3 (0.2)4 (0.2)5
= 1 + (0.2)1 + 2+ 2+ 2+ 2
2! 3! 4! 5!
= 1.2428
(0.3)2 (2) (0.3)3 (3) (0.3)4 (4) (0.3)5 (5)
y(0.3) = y(0) + (0.3)y00 + y + y + hy0 + y
2! 0 3! 0 4! 5! 0
(0.3)2 (0.3)3 (0.3)4 (0.3)5
= 1 + (0.3)1 + 2+ 2+ 2+ 2
2! 3! 4! 5!
= 1.3997

Now

y00 = x 0 + y0 =1
y10 = x1 + y 1 = 1.2103
y20 = x2 + y 2 = 1.4428
y30 = x3 + y 3 = 1.6997

Using Milne’s predictor method

(p) 4h
y4 = y0 + [2y10 − y20 + 2y30 ]
3
4
= 1 + (0.1) [2(1.2103) − 1.4428 + 2(1.6997)]
3
= 1.583627

Milne’s corrector gives

(c) h 0 (p)
y4 = y2 + [y + 4y30 + y40 ] ; y40 = f (x4 , y4 ) = f (.4, 1.5836) = 1.9836
3 2
(c) 0.1
∴ y4 = 1.2428 + [1.4428 + 4(1.6997) + 1.9836] = 1.58364
3
∴ y(0.4) = 1.5836 correct to 4 decimal places.

4 104
Numerical Analysis
Lecture 25

Numerical Solution of ODE (Boundary Value Prob-


lems)
Boundary Value Problem(BVP)
Goal
Develop numerical techniques for approximating the solution of the two-point BVP
y 00 = f (x, y, y 0 ), a ≤ x ≤ b
α1 y(a) + α2 y 0 (a) = α3
β1 y(b) + β2 y 0 (b) = β3

Linear BVP
f (x, y, y 0 ) = p(x)y 0 + q(x)y + r(x) for some functions p, q, r.

Nonlinear
Otherwise

Robin or Mixed boundary condition


A linear combination of the unknown function and its first derivative is presented at the
endpoints a, b of [a, b].
Special case: α2 = 0, β2 = 0.

Dirichlet boundary condition


y(a) = αα13 = α
y(b) = ββ13 = β
Special case α1 = 0, β1 = 0

Neumann BC
y60(a) = αα23 = γ
y 0 (b) = ββ23 = δ

1 Linear BVP
Consider the general second order one dimensional two-point BVP
y 00 = f (x, y, y 0 ), a ≤ x ≤ b
α1 y(a) + α2 y 0 (a) = α3
β1 y(b) + β2 y 0 (b) = β3

1 105
1.1 Finite difference method
We investigate the linear BVP
y 00 = p(x)y 0 + q(x)y + r(x), a ≤ x ≤ b
with Dirichlet BC’s
y(a) = α
y(b) = β
using a finite diference method.

Objective of a finite difference method


Approximate the value of the exact solution to the BVP, y(x), at a discrete set of points
y : the value of exact solution at x = xi
x0 , x1 , . . . , xN ∈ [a, b] i Replace each deriva-
wi : the finite difference approximation to yi
tive in the BVP with appropriate finite difference formula. For example
y 00 (xi ) = yi+1 −2y
h2
i +yi−1
+ O(h2 )
y −y
y 0 (xi ) = i+12h i−1 + o(h2 )
Here, the derivatives are replaced by second-order central difference approximations.

What happens?
Converts the single continuous ODE for the unknown y(x) into a system of algebraic equa-
tions for the unknowns w0 , w1 , . . . , wN .
Step 1 Partition the interval [a, b] into subintervals by points x0 , x1 , . . . , xN such that
a = x0 < x1 < x2 < · · · < xi < · · · < xN = b
For simplicity, assume a uniform grid, i.e., the points are equally spaced: xi = x0 + ih, i =
0, 1, . . . , N. h = b−a
N
= xNN−x0 . x0 , xN are boundary grid points. x1 , x2 , . . . , xN −1 are interior
grid points. h is the step size or mesh size. h is a key parameter governing the accuracy of
the finite difference approximation
Step 2: Derivation of the Algebraic system Evaluate the DE
y 00 = p(x)y 0 + q(x)y + r(x)
at each interior grid points xi , i = 1, 2, . . . , N − 1
y 00 |xi = p(xi )y 0 |xi +q(xi )y |xi +r(xi ), i = 1, 2, . . . , N − 1
Replace the derivatives in the above ODE by second order central difference formula:
yi+1 −2yi +yi−1 −yi−1
h2
+ O(h2 ) = pi [ yi+12h ] + qi yi + ri + O(h2 )
Drop the truncation error terms. Replace all the yi (exact solution at x = xi ) by wi
(approximate solution at x = xi ) for i = 1, 2, . . . , N − 1.
wi+1 −2wi +wi−1 −wi−1
h2
+ O(h2 ) = pi [ wi+12h ] + qi wi + ri + O(h2 ), i = 1, 2, . . . , N − 1
There are N + 1 unknowns w0 , w1 , w2 , . . . , wN . Now we use the Dirichlet BCs

2 106
y(a) = α =⇒ y(x0 ) = α or y0 = w0 = α
y(b) = β =⇒ y(xN ) = β or yN = wN = β

We now have a numerical method of second order:

w0 = α
wi+1 −2wi +wi−1 −wi−1
h2
= pi [ wi+12h ] + qi wi + ri , i = 1, 2, . . . , N − 1 (∗)
wN = β

Step 3: Matrix formulation Multiply (∗) by −h2 and collect the like terms. This ensures
that the coefficient of wi is positive and also helps to avoid division by a small number.

(−1 − h2 pi )wi−1 + (2 + h2 qi )wi + (−1 + h2 pi )wi+1 = −h2 ri , i = 1, 2, . . . , N − 1

For i = 1

(−1 − h2 p1 )w0 + (2 + h2 qi )w1 + (−1 + h2 p1 )w2 = −h2 r1


But w0 = α (prescribed)
∴ (2 + h qi )w1 + (−1 + h2 p1 )w2 = −h2 r1 + (1 + h2 p1 )α
2

For i = N − 1, use wN = β (prescribed)

(2 + h2 qN −1 )wN −1 + (−1 − h2 pN −1 )wN −2 = −h2 rN −1 + (1 − h2 pN −1 )β

The system Aw = b can now be written as

    
d1 u 1 w1 b1
 l2 d2 u2  w2   b2 
    

 l3 d3 u3 
 w3  
  b3 

 − − − − − − − − −  −   − 
  = 
 − − − − − − − − −  −   − 
    

 lN −3 dN −3 uN −3 
 wN −3   bN −3
 


 lN −2 dN −2 uN −2  wN −2   bN −2 
lN −1 dN −1 wN −1 bN −1

−h2 r1 + (1 + h2 p1 )α
   
w1

 w2 


 −h2 r2 


 w3 


 −h2 r3 

 −   − 
w= ; b =  

 − 


 − 

2

 wN −3 


 −h rN −3 

 wN −2   −h2 rN −2 
wN −1 −h2 rN −1 + (1 − h2 pN −1 )β
Here

di = 2 + h2 qi
ui = −1 + h2 pi
li = −1 − h2 pi
for i = 1, 2, . . . , N − 1

3 107
−h2 r1 + (1 + h2 p1 )α
 

 −h2 r2 


 −h2 r3 

 − 
b= 

 − 

2

 −h rN −3 

 −h2 rN −2 
−h2 rN −1 + (1 − h2 pN −1 )β
Does the system Aw = b have a unique solution?
Suppose that p is continuous and that q(x) ≥ 0 on [a, b]. The continuity of p over the
closed interval [a, b] guarantees the existence of a positive constant L such that |p(x)| ≤ L
on [a, b] (by extreme value theorem).
Choose h < L2 . Then, for each i, −1 < h2 pi < 1 This implies that −1 − h2 pi and −1 + h2 pi
are always negative.

∴ |−1 + h2 pi | = 1 + h2 pi ; |−1 + h2 pi | = 1 − h2 pi

Now, consider rows 2 to N − 2 in the system Aw = b. Sum of the absolute value of the
off-diagonal entries in A is such that

|−1 − h2 pi | + |−1 + h2 pi | = 2 ≤ |2 + h2 qi |, the diagonal entry.

In row 1,

|u1 | = |−1 + h2 p1 | = 1 − h2 p1
|d1 | = |2 + h2 q1 | = 2 + h2 q1
|u1 | < |d1 | (strict inequality)

Also, row N − 1,

|uN −1 | = |−1 + h2 pN −1 | = 1 − h2 pN −1
|dN −1 | = |2 + h2 qN −1 | = 2 + h2 qN −1
and |uN −1 | < |dN −1 | (strict inequality)

∴ A is diagonally dominant.
∴ A is nonsingular and hence A−1 exists and w = A−1 b is a unique solution of the system
Aw = b.

The conditions given above are sufficient to guarantee a unique solution of Aw = b


provided the step size h is smaller than L2 .

4 108
Numerical Analysis
Lecture 26

Numerical Solution of ODE-9:Boundary Value Prob-


lems(FDM)
Example 0.1.

− y 00 + π 2 y = 2π 2 sin(πx), x ∈ [0, 1]


y(0) = 0
Dirichlet BC
y(1) = 1
p(x) = 0, q(x) = π 2 , r(x) = −2π 2 sin(πx)
∵ p(x) = 0, q(x) = π 2 > 0 on [0, 1], we are guaranteed of a unique solution to the finite-
difference equation for any value of h.
Partition [0, 1] into 4 equal subintervals of width h = 1−0
4
by points

0 = x0 < x1 < x2 < x3 < x4 = 1


1 1 3
4 2 4
0 = x0 x1 x2 x3 x4 = 1

Evaluate the Differential Equation(DE) at the interior grid points

−yi00 + π 2 yi = 2π 2 sin(πxi ), i = 1, 2, 3

Replace the second derivative by its second order central difference formula. This gives
−yi−1 +2yi −yi+1
( 14 )2
+ O(h2 ) = π 2 yi = 2π 2 sin( πi
4
) i = 1, 2, 3

Drop the truncation error term and replace each yi (exact solution at x = xi ). Multiply by
h2 = ( 14 )2 and collect the like terms:

−wi−1 + [2 + ( pi4 )2 ]wi − wi+1 = 2( pi4 )2 sin( iπ4 ) i = 1, 2, 3

Now, use

w0 = y(0) = 0
w4 = y(4) = 0

We obtain the system Aw = b can now be written as


  
w0 = 0 → 1 0 w0
pi 2
i=1  −1 2 + ( 4 ) −1 w1
  
2 + ( pi4 )2
 
i=2 
 −1 −1 
 w2 = b

i=3  −1 2 + ( pi4 )2 −1  w3 
w4 = 0 → 0 −1 w4

1 109
Here  
√ 0π 2

 2( 4 ) 

π 2
b= √ 4π) 2
2( 
 
 2( 4 ) 
0
Solving, we get  
0

 0.725371 
 y( 14 ) ≈ ( 41 ) = 0.725371
w=
 1.025830 
 y( 12 ) ≈ ( 21 ) = 1.025830
 0.725371  y( 34 ) ≈ ( 43 ) = 0.725371
0
Note that the exact solution is y(x) = sin πx
xi Approximate solution wi exact solution yi Absolute error |yi − wi
0.00 0.000000 0.000000
0.25 0.725371 0.707107 0.018264
Table
0.50 1.025830 1.000000 0.025830
0.75 0.725371 0.707107 0.018264
1.00 0.000000 0.000000
The accuracy of the approximate solution is quite reasonable inspite of h = 14 being not
very small
Maximum absolute error in the approximate solution as a function of number N of subintervals

N maximum absolute error error ratios


4 0.0258297765
8 0.0064337127 4.014754
16 0.0016068959 4.003814
32 0.0004016275 4.000961
64 0.0001004008 4.000241
128 0.0000250998 4.000060
Table present a numerical verification of the second order accuracy of the scheme used
in solving this linear BVP with Dirichlet BCs. As N is doubled, h is halved and the
approximation error is reduced roughly by a factor of 4, which is what one would expect
from a second order numerical method (note TE=O(h2 )).

Example 0.2.

y 00 = −(x + 1)y 0 + 2y + (1 − x2 )e−x , x ∈ [0, 1]


y(0) = −1
Dirichlet BC
y(1) = 0
p(x) = −(x + 1), q(x) = 2 > 0, max |p(x)| = 2 = L
x∈[0,1]
∴ we are guaranteed of a unique solution to the finite difference equation for any value
of h < L2 = 22 i.e. h < 1
Partition [0, 1] into 4 equal subintervals of width h = 1−0
4

2 110
−1 = y0 = w0 w1 w2 w3 y4 = w4 = 0
1 1 3
4 2 4

0 = x0 x1 x2 x3 x4 = 1

(−1 − h2 pi )wi−1 + (2 + h2 qi )wi + (−1 + h2 pi )wi+1 = −h2 ri0 i = 1, 2, 3


pi = −(xi + 1) = −(1 + 4i )
qi = 2
i
ri = (1 − x2i )e−xi = [1 − ( 4i )2 ]e− 4
w0 = −1
w4 = 0
We have Aw = b can now be written as

−1
    
1 0 w0
 − 27 17 − 37   w1   − 15 e− 14 
 32 8 32   256
  w2  =  − 3 e− 12
13 17 19
 

 − 16 8
− 16    64


 − 25
32
17
8
− 39
32
  w3   − 7 e− 34 
256
0 1 w4 0
 
−1
 −0.582559 
  y( 14 ) ≈ ( 41 ) = −0.582559
w=  −0.301452 
 y( 12 ) ≈ ( 21 ) = −0.301452
 −0.116906  y( 34 ) ≈ ( 43 ) = −0.116906
0
Exact solution is y(x) = (x − 1)e−x
xi Approximate solution wi exact solution yi Absolute error |yi − wi
0.00 −1.000000 −1.000000
0.25 −0.582559 −0.584101 0.001542
Table
0.50 −0.301452 −0.303265 0.001813
0.75 −0.116906 −0.118092 0.001186
1.00 0.000000 0.000000
Approximate solution compares well with the exact solution

Linear BVP with non-Dirichlet BCs


Consider the one-dimensional two-point BVP

y 00 = p(x)y 0 + q(x)y + r(x), x ∈ [a, b]



 α1 y(a) + α2 y 0 (a) = α3
Robin BC β1 y(b) + β2 y 0 (b) = β3
α2 6= 0, β2 6= 0

Partition [a, b] into N equal subintervals of width h = b−a


N
by points a = x0 < x1 < x2 <
· · · < xN −1 < xN = b; xi = x0 + ih, i = 0, 1, . . . , N . Let wi denote the approximation to
the solution y(x), at x = xi . There are N + 1 unknowns w0 , w1 , . . . , wN . We need N + 1
equations to determine them. N − 1 of these can be obtained as explained before and are
given by
(−1 − h2 pi )wi−1 + (2 + h2 qi )wi + (−1 + h2 pi )wi+1 = −h2 ri0 i = 1, 2, . . . , N − 1

3 111
wf

xf x0 x1 x2 ... ... xN

We introduce a functions node xf to the left of x0 such that x0 − xf = h and apply the
scheme at x0 . At x = x0 ,
(−1 − h2 p0 )wf + (2 + h2 q0 )w0 + (−1 + h2 p0 )w1 = −h2 r0
We now eliminate wf using the b.c

α1 y(a) + α2 y 0 (a) = α3
w −w
=⇒ α1 w0 + α2 [ 12h f ] = α3 (used second order central difference formula for approximating y 0 )
2h
=⇒ wf = w1 − α2
(α3 − α1 w0 )
∴ at x = x0 = a, we have
[2 + h2 q0 − (2 + hp0 )h αα12 ]w0 − 2w1 = −h2 r0 − (2 + hp0 )h αα23

Special case α1 = 0(Neumann B.C)


The corresponding equation is
α3
(2 + h2 q0 )w0 − 2w1 = −h2 r0 − (2 + hp0 )hα, α= α2

Similarly we can handle the Robin BC at xN = b. The corresponding finite-difference


equation is

−2wN −1 + [2 + h2 qN + (2 − hpN )h ββ12 ]wN = −h2 rN + (2 − hpN )h ββ32

Special case β1 = 0(Neumann B.C at x = b)


The corresponding equation is
β3
−2wN −1 + (2 + h2 qN )wN = −h2 rN + (2 − hpN )hβ, β= β2

We have (N + 1) equations in (N + 1) unknowns. Rewrite them in matrix form aw = b and


solve for w0 , w1 , . . . , wN .
− − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −−
(Derivation of the above finite difference formula at xN = b)

β1 y(b) + β2 y 0 (b) = β3
−wN −1
=⇒ β1 wN + β2 [ wN +12h ] = β3 (where wN +1 = wf =value of w at the fictitious node)
∴ 2hβ1 wN + β2 wN +1 − β2 wN −1 = β3 .2h
∴ β2 wN +1 = 2hβ3 + β2 wN −1 − 2hβ1 wN
∴ wN +1 = 2hββ2
3
+ wN −1 − 2h ββ12 wN
2h
=⇒ wf = wN +1 = wN −1 + β2
(β3 − β1 wN )
wN −1 wN +1

x0 x1 x2 ... ... xN −1 xN xN +1

4 112
Example 0.3.
Solve y 00 + y = sin 3x, x ∈ [0, π2 ]
Robin BC: y(0) + y 0 (0) = −1
Neumann BC: y 0 ( π2 ) = 1
π
−0
Partition [0, π2 ] into 4 equal subintervals of width h = 2
4
by points

0 = x0 < x1 < x2 < x3 < x4 = 1; xi = 8
, i = 0, 1, 2, 3, 4

p(x) = 0, q(x) = −1, r(x) = sin 3x.


For each i,

pi = 0,
qi = −1,
π
ri = sin(3i ), (i = 0, 1, . . . , 4)
8
Here α1 − α2 = 1, α3 = −1, β = 1. The system of finite-difference equations in matrix
form is obtained from
wi−1 −2wi +wi+1
(h2
+ wi = sin 3xi i = 1, 2, 3

along with
w −w
w0 + 12h f = −1(BC at x0 = 0)
=⇒ 2hw0 + w1 − wf = −2h
wf = 2hw0 + 2h + w1
and
wN +1 −wN −1
2h
= 1 (BC at xN = π2 )
=⇒ wN +1 = 2h + wN −1

The system is Aw = b where

d − π4 −2
 
 −1 d −1 
 whered = 2 − ( π )2
 
A= −1 d −1
  8
 −1 d −1 
−2 d
π
   
w0 4
 w1   −( π )2 sin 3π 
   8 8 
w= w 2
 b =  −( π )2 sin 3π 
   8 4 
 w3   −( π )2 sin 9π 
8 8
w4 −( π8 )2 + π4
Solution is  
−1.023672

 −0.935445 

w=
 −0.560486 

 0.00995175 
0.519840

5 113
Exact solution y(x) = − cos x + 38 sin x − 18 sin 3x

xi Approximate solution wi exact solution yi Absolute error |yi − wi


0.00 −1.023672 −1.000000 0.023672
π
8
−0.935445 −0.895858 0.039587
π
4
−0.560486 −0.530330 0.030156

8
0.00995175 0.0116068 0.001655
π
2
0.519840 0.500000 0.019840
The accuracy of the approximate solution is reasonable inspite of h not being very small.
Maximum absolute error in the approximate solution as a function of number N of subintervals

N maximum absolute error error ratio


4 0.0395865088
8 0.0094846260 4.173755
16 0.0023587346 4.021065
32 0.0005899465 3.998218
64 0.0001473906 4.002605
128 0.0000368515 3.999585
256 0.0000092125 4.000162
512 0.0000023031 4.000031
A numerical verification of the second order accuracy of the scheme is presented in
the Table. As the number of intervals N is doubled, the step size h is halved and the
approximate solution is such that the error is reduced roughly by a factor of 4, (the method
is such that error=O(h2 )).

6 114
Numerical Analysis
Lecture 27

Numerical Solution of ODE-10:Shooting Method


The Shooting Method
An alternative numerical method for solving BVPs.

Basic idea
Convert the BVP into two IVPs; solve the IVPs using the techniques developed for IVPs
such as Taylor series method, Runge-Kutta method etc.
In the case of linear BVPs, combine the solutions of the IVPs to generate the solution
of the original BVP.

Shooting Method:Boundary Value Problems


The two point boundary value problem is

y 00 = f (x, y, y 0 ), x ∈ [a, b]
y(a) = α
y(b) = β

Shooting method for solving the above BVP is based on a very simple idea.
Consider the problem of hitting a stationary target at a distance. We attempt to fire
y target

attempt 1

attempt 3

α attempt 2

y(a) = α a x
b y(b) = β
an artillery shell to hit the target.
Assume that the artillery piece is pointed in the direction of the target. The elevation angle
of the barrel is a variable and this is the initial slope of the trajectory of the projectile.

1 115
• Fire the first round with some barrel angle (θ1 ); record where the projectile lands in
y target

attempt 1

attempt 3

α attempt 2

a x
this case b

• Fire the second round with barrel angle (θ2 ); record where the projectile lands in this
case

• interpolate between the two previous targets; it is then possible to come closer to the
target with third round

• we can continue this process depending upon the desired accuracy

Such a techniques applied to DEs is referred to as Shooting Method

dy
barrel angle ↔ initial slope dx at x = a
shooting the gun ↔ solving the DE
hitting the target ↔ matching the BC y(b) = β

The shooting method requires good initial guesses for the slope at x = a.

Consider the second order BVP

y 00 (x) = f (x)
y(0) = 0; y(1) = 1

Step 1 transform the BVP into an IVP


Step 2 Obtain the solution of the IVP using Taylor series method
or Euler’s method or Runge-Kutta methods etc
Step 3 solution of the given BVP
Step 1 requires the knowledge of y 0 (0) = m
We make two guesses for this slope m, say m1 and m2 , and we solve the two IVPs by any
of the methods for IVP’s
1. y 00 = f (x) 2. y 00 = f (x)
y(0) = 0 y(0) = 0
y 0 (0) = m1 y 0 (0) = m2

2 116
Let the corresponding values of y(1) be obtained by solving IVP’s 1 and 2 be y(m1 , 1) and
y(m2 , 1). We then use linear interpolation.
y(m1 , 1) m1
The interpolating polynomial is → y(m, 1) = 1 m??
y(m2 , 1) m2
[y−y(m2 ,1)] [y−y(m1 ,1)]
m = y(m1 ,1)−y(m2 ,1) m1 + y(m2 ,1)−y(m1 ,1) m2
But y(1) = 1
m = y(m[y−y(m 2 ,1)]
1 ,1)−y(m2 ,1)
m1 + y(m[y−y(m 1 ,1)]
2 ,1)−y(m1 ,1)
m2
(m1 −m2 )y(1) m2 y(m1 ,1)−m1 y(m2 ,1)
= y(m1 ,1)−y(m2 ,1) + y(m1 ,1)−y(m2 ,1)
(m2 −m1 )y(1)
= y(m 2 ,1)−y(m1 ,1)
+ m1 y(m 2 ,1)−m2 y(m1 ,1)
y(m2 ,1)−y(m1 ,1)
(m2 −m1 )y(1)
= y(m 2 ,1)−y(m1 ,1)
+ m1 y(m 2 ,1)−m2 y(m1 ,1)
y(m2 ,1)−y(m1 ,1)
(m2 −m1 )y(1)
= y(m 2 ,1)−y(m1 ,1)
+ m1 [y(m2 ,1)−y(m 1 ,1)]+m1 y(m1 ,1)−m2 y(m1 ,1)
y(m2 ,1)−y(m1 ,1)
(m2 −m1 )y(1) 2 −m1 )y(m1 ,1)
= m1 + y(m 2 ,1)−y(m1 ,1)
− (m y(m2 ,1)−y(m1 ,1)
−m1 )[y(1)−y(m1 ,1)]
∴ m = m1 + (m2y(m 2 ,1)−y(m1 ,1)
Call this m as m3 . We now solve the IVP

y 00 = f (x)
y(0) = 0
y 0 (0) = m3

and obtain y(m3 , 1). Use linear interpolation with (m2 , y(m2 , 1)) and (m3 , y(m3 , 1)) and
find m4 and so on. Repeat this process until the value of y(mk , 1) agrees with y(1) correct
to desired degree of accuracy. The speed of convergence depends on how good the initial
guesses are. The above method of solution is determined in the following example.
y 00 = y(x)
Example 0.1. Solve: y(0) = 0 by shooting method.
y(1) = 1.1752
Take the two initial guesses for the slope y 0 (0) as m1 = 0.8 and m2 = 0.9 and solve the
IVPs by Taylor series method.
1. y 00 = y(x) 2. y 00 = y(x)
y(0) = 0 y(0) = 0
y 0 (0) = m1 = 0.8 y 0 (0) = m2 = 0.9
Solving by Taylor series method, we have
x2 00 x3 000 x4 (4) x5 (5)
y(x) = y(0) + xy 0 (0) + 2!
y (0) + 3!
y (0) + 4!
y (0) + 5!
y (0) + ...

Now

y 00 = y (given DE)
y(0) = 0 =⇒ y 00 (0) = 0
y 000 = y 0 =⇒ y 000 (0) = y 0 (0)
y (4) = y 00 =⇒ y (4) (0) = y 00 (0) = 0
y (5) = y 000 =⇒ y (5) (0) = y 000 (0) = y 0 (0) = 0

3 117
and so on.

x3 0 x5
∴ y(x) = 0 + xy 0 (0) + y (0) + y 0 (0) + . . .
3! 5!
3 5 7
x x x x9
= y 0 (0)[x + + + + + ...]
3! 5! 7! 9!
1 1 1 1
y(1) = y 0 (0)[1 + + + + + . . . ]
3! 5! 7! 9!

with y 0 (0) = m1 = 0.8


1 1 1 1
y(1) = y ( 0.8)[1 + + + + + ...]
6 120 5040 362800
0.9402

∴ in our notation, y(m1 , 1) = y(0.8, 1) = 0.9402


Similarly, with y 0 (0) = m1 = 0.8
1 1 1 1
y(1) = y ( 0.9)[1 + + + + + ...]
6 120 5040 362800
1.0578

∴ in our notation, y(m2 , 1) = y(0.9, 1) = 1.0578


∴ Using linear interpolation, we get
(m2 − m1 )[y(1) − y(m1 , 1)]
m3 = m1 +
y(m2 , 1) − y(m1 , 1)
[1.1752 − 0.9402]
= 0.8 + (0.9 − 0.8)
1.0578 − 0.9402
(0.1)(0.235)
= 0.8 +
0.1176
= 0.8 + 0.19982993
= 0.99982993

We see that with m3 = 0.99982993, we solve the IVP y 00 = y; y(0) = 0, y(1) = m3 . Then
y(1) is given by
y(1) = m3 [1 + 61 + 120
1 1
+ 5040 1
+ 362800 + . . . ] = 0.99982993[1 + 16 + 120
1
+ . . . ] = 1.17500013,
which is close to the prescribed value y(1) = 1.1752

Now the solution to the BVP at any x ∈ [0, 1] is given by


x3 x5 x7 x9
y(x) = y 0 (0)[x + + + + + ...]
3! 5! 7! 9!
x3 x5 x7 x 9
y(x) = m3 [x + + + + + ...]
3! 5! 7! 9!

The following theorem gives general conditions that ensure that the solution to a linear
second order BVP has a unique solution.

4 118
Theorem 0.1. If the linear BVP

y 00 (x) = p(x)y 0 + q(x)y + r(x), a ≤ x ≤ b


y(a) = α
y(b) = β

satisfies

i) p(x), q(x) and r(x) are continuous on [a, b]

ii) q(x) > 0 on [a, b]

then, the problem has a unique solution

Our goal now is to get an approximation to this unique solution.


− − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −−

We know that every solution to a linear, non homogeneous DE can be written as a


particular solution plus a constant times a solution to the corresponding homogeneous
problem. ∴ we work with two IVPs

1 Original non homogeneous DE with y(a) = α, y 0 (a) is prescribed. solution is y1 (x)

2 corresponding homogeneous DE with y(a) = 0 and an arbitrary nonzero value is


prescribed for y 0 (a); solution is y2 (x). Consider y1 (x) + ky2 (x) and determine k such
that the BC at x = b is satisfied.

Consider the IVPs

y 00 (x) = p(x)y 0 + q(x)y + r(x), a ≤ x ≤ b (1)


y(a) = α
y(b) = 0

and

y 00 (x) = p(x)y 0 + q(x)y, a ≤ x ≤ b (2)


y(a) = 0
y 0 (a) = 1

The existence and uniqueness theorem for IVPs ensures that under the hypothesis of the
above theorem, problems (1) and (2) have a unique solution.
Let y1 (x) denote the solution to (1)
Let y2 (x) denote the solution to (2)
β−y1 (b)
Let y(x) = y1 (x) + y2 (b)
y2 (x) → (∗)
Then
β−y1 (b) 0
y 0 (x) = y10 (x) + y2 (b)
y2 (x)
β−y1 (b) 00
y 00 (x) = y100 (x) + y2 (b)
y2 (x)

Remark. Explanation for (∗) is given later

5 119
Let
y(x) = y1 (x) + ky2 (x), k is constant
y 0 (x) = y10 (x) + ky20 (x)
y 00 (x) = y100 (x) + ky200 (x)
= p(x)y10 + q(x)y1 + r(x) + k[p(x)y20 + q(x)y2 ]
= p(x)[y10 + ky20 ] + q(x)[y1 + ky2 ] + r(x)
= p(x)y 0 (x) + q(x)y(x) + r(x)
Also y(a) = y1 (a) + ky2 (a)
= α + k(0)

0
y (a) = y10 (a) + ky20 (a)
= 0 + k(1)
=k

∴ y(x) satisfies the IVP

y 00 (x) = p(x)y 0 + q(x)y + r(x)


with y(a) = α
y 0 (a) = k

We want y to be such that y(b) = β


y(b) = y1 (b) + ky2 (b) = β
=⇒ k = β−y 1 (b)
y2 (b)

β−y1 (b)
∴ y(x) = y1 (x) + y2 (b)
y2 (x)

y 00 (x) = p(x)[y10 + β−y 1 (b) 0


y2 (b)
y2 ] + q(x)[y1 + β−y1 (b)
y2 (b)
y2 ] + r(x)
0
= p(x)y (x) + q(x)y(x) + r(x)
y(a) = y1 (a) + β−y 1 (b)
y2 (b)
y2 (a)
β−y1 (b)
= α + y2 (b) (0)

∴ y(x) defined by (∗) satisfies the second order DE for the linear BVP Also
y(b) = y1 (b) + β−y 1 (b)
y2 (b)
y2 (b)
β−y1 (b)
= β + y2 (b) y2 (b)

∴ y(x) satisfies

y 00 (x) = p(x)y 0 + q(x)y + r(x) x ∈ [a, b]


with y(a) = α
y 0 (a) = k

∴ y(x) is the unique solution to the linear BVP provided y2 (b) 6= 0

6 120
Numerical Analysis
Lecture 28

Numerical Solution of ODE-II: Shooting Method


The shooting method for linear BVPs is based on the replacement of the linear BVP by
the two IVPs (1) and (2). Solving (1) and (2) by the available methods for IVPs, one gets
y1 (x) and y2 (x). Then, the solution of the BVP is given by
β−y1 (b)
y(x) = y1 (x) + y2 (b)
y2 (x), provided y2 (b) 6= 0.

y2 (x) β−y1 (b)


y(x) = y1 (x) + y2 (b)
y2 (x)
y(a) = α
y1 (x)
y1 (a) = α
α

y2 (a) = 0
a

Figure 1: Graphical explanation of the above method

Example 0.1.  sin(ln x)


 y 00 = − x2 y 0 + 2
x2
y + x2
, 1≤x≤2
BVP y(1) = 1
y(2) = 2

Applying the above method to this BVP, we require approximations to the solutions of
the following two IVPs:

 y100 = − x2 y10 + x22 y1 + sin(ln
x2
x)
, 1≤x≤2
IVP-1 y1 (1) = 1
 0
y1 (1) = 0
 00
 y2 = − x2 y20 + x22 y2 , 1 ≤ x ≤ 2
IVP-2 y2 (1) = 0
 0
y2 (1) = 1
We solve 1 and 2 using R-K method of order 4 at xi = 1 + i(0.1), i = 0, 1, 2, . . . , 10.
Denote by u1,i the approximations of y1 (xi )
and by v1,i the approximations of y2 (xi ).
Let wi denote the approximation of y(xi ) where y(xi ) = y1 (xi ) + 2−y 1 (2)
y2 (2)
y2 (xi ).
The following table presents the details. Note that the exact solution of the BVP is

1 121
y = c1 x + xc22 − 10
3 1
sin(ln x) − 10 cos(ln x)
1
where c2 = 70 [8 − 12 sin(ln 2) − 4 cos(ln 2)] ≈ −0.03920701320
c1 = 11
10
− c2 ≈ 1.1392070132

x−i u1,i v1,i wi y(xi ) |y(xi ) − wi |


1.0 1.0 0.0 1.0 1.0
1.1 1.00896058 0.09117986 1.09262917 1.09262930 1.43 × 10−7
1.2 1.03245472 0.16851175 1.18708471 1.18708484 1.34 × 10−7
1.3 1.06674375 0.23608704 1.28338227 1.28338236 9.78 × 10−8
1.4 1.10928795 0.29659061 1.38144589 1.38144595 6.02 × 10−8
1.5 1.15830000 0.35184379 1.48115939 1.48115942 8.06 × 10−8
1.6 1.21248372 0.40311695 1.58239245 1.58239246 1.08 × 10−8
1.7 1.27087454 0.45131840 1.68501396 1.68501396 5.43 × 10−10
1.8 1.33273851 0.49711137 1.78889854 1.78889853 5.05 × 10−9
1.9 1.39750618 0.54098928 1.89392951 1.89392951 4.41 × 10−9
2.0 1.46472815 0.58332538 2.00000000 2.00000000
Example 0.2.

− y 00 + π 2 y = 2π 2 sin(πx), x ∈ [0, 1]


y2 (1) = 0
Dirichlet BC
y20 (1) = 1
Convert The BVP into two IVPs.
 00
 y = π 2 y − 2π 2 sin(πx)
IVP-1 y(0) = 0
 0
y (0) = 0
 00
 y = π2y
IVP-2 y(0) = 0
 0
y (0) = 1
Solving IVP-1 and IVP-2 by R-K method of order 4 taking h = 41 .
Let u1,i and v1,i denote the approximate solutions of IVP-1 and IVP-2 obtained by R-K
method. Let wi denote the approximate solution of the BVP.

w(x) = u1 (x) + kv1 (x) is such that


w(1) = 0 must be satisfied
i.e. w(1) = u1 (1) + kv1 (1) = 0
=⇒ k = − uv11(1)
(1)
.

Table presents the solution of the IVP at xi ’s.


xi u1 (xi ) v1 (xi )
0.00 0.00 0.00
0.25 −0.157372 0.275702
0.50 −1.290357 0.730213
0.75 −4.490694 1.657343
1.00 −11.466375 3.656793

2 122
∴ k = − uv11(1)
(1)
= − (−11.466375)
3.656793
= 3.135637
∴ w(x) = u1 (x) + 3.135637u2 (x) is guaranteed to satisfy both the BCs in the BVP.
Table presents w(xi ) for the BVP and compares this approximate solution with the ex-
act solution y(x) = sin πx

xi Approximate solution, wi Exact solution y(xi ) = yi absolute error


0.00 0.00 0.00
0.25 0.707129 0.707107 0.000022
0.50 0.999327 1.0000000 0.000673
0.75 0.706132 0.707107 0.000975
1.00 0.00000000 0.00000000
The accuracy of the solution is excellent in spite of the fact that h = 14 which is not very
small.

3 123
Numerical Analysis
Lecture 29

Root Finding Methods-I: The Bisection Method


In this unit, the problem of determining roots of equation of the form f (x) = 0 (or zeros of
functions f ) is discussed.

Example 0.1. The growth of a population can be modeled over short periods of time by
assuming that the population grows continuously with time at a rate proportional to the
number present at that time.
Let N (t) denote the number at time t and λ denote the constant birth rate of the
population. if immigration is permitted at a constant rate ν, then the population satisfies
the DE
dN (t)
= λN (t) + ν
dt
whose solution is

N (t) = N0 eλt + λν (eλt − 1)


where N0 = N (t = 0) is the initial population.

Suppose N0 = 8, 00, 000, ν = 2, 08, 700 in the first year. Then at the end of 1 year
11, 39, 512 individuals are present. We must solve for λ, the birth rate of the above popu-
lation which satisfies
208700 λ
11, 39, 512 = 8, 00, 000eλ + λ
(e − 1)

which is of the form f (λ) = 0

Example 0.2. In diffraction of light, one requires roots of x − tan x = 0, which is of the
form f (x) = 0.

Example 0.3. In planetary orbits, one looks for the roots of the Kepler’s equation x −
a sin x = b for different values of a and b, which is of the form f (x, a, b) = 0

Example 0.4. How far a spherical object of radius R sinks into a fluid such as water or
oil?
Archimedes Principle: The object will sink to the depth at which the weight of the
fluid displaced by the object equals the weight of the object.
Now weight of the object = 34 πR3 ρ0 g

1 124
y axis

x2 + (y − R)2 = R2

(0, R)
y=h

x axis

(mass=volume×density; weight=mg; con-


stant mass m due to gravity)
ρ0 : density of object.
Weight of the fluid displaced =ρf Vd g
(Vd : volume of the fluid displaced by the object;
ρf : density of fluid.)
What isR Vd ?
h Rh Rh
Vd = π 0 x2 dy = π 0 [R2 − (R − y)2 ]dy = π 0 (2yR − y 2 )dy = πh2 (R − h3 )

∴ 43 πR3 ρ0 g = πh2 (R − h3 )ρf g


ρf 3
(or) 3
h − Rρf h2 + 34 R3 ρ0 = 0 (→ ∗)

Given the values for R, ρ0 , ρf , the depth h to which the object sinks is determined by solving
(∗) for h.

Equations of various kinds arise in a range of physical applications. The numerical


methods that we discuss in this unit are used to approximate solutions of equations of the
type that we have seen in the above examples, when the exact solutions can not be obtained
by algebraic methods.
Question Given a real valued function of a real variable f : R → R, find the values of x
for which f (x) = 0

This process involves finding a root, or a solution of an equation of the form f (x) = 0,
for a given function f . A root of f (x) = 0 is also called a zero of the function f .

Linear equation: ax + b = 0, a, b are real numbers, a 6= 0


solution: x = − ab
Nonlinear equation: Ax2 + Bx√+ C = 0, A, B, C √are real, A 6= 0
b2 −4AC b2 −4AC
solutions: x1 = −B+ 2A , x2 = −B− 2A
Solutions of cubic and quartic polynomial equations are also known and these involve
radicals, namely, roots of sums of products of coefficients in the equation. However, no such
closed formula exists for a polynomial equation of degree n when n ≥ 5. for each n ≥ 5,
there exists a polynomial equation of degree n with integer coefficients which cannot be
solved in terms of radicals.

2 125
Example 0.5. x5 − 4x − 2 = 0
Since there is no general formula for the equation of polynomial equations, no general
formula will exist for the solution of an arbitrary nonlinear equation of the form f (x) = 0,
where f is a continuous real-valued function.
Question 1: How can we decide whether or not such an equation possesses
a solution in the set of real numbers?
Question 2: How can we find a solution?
This unit is devoted to providing answers to these questions.
Goal: Develop simple numerical methods for the approximate solution of the equations
f (x) = 0, where f is a real-valued function, defined and continuous on a bounded and
closed interval of the real line.
Definition 0.1. By a root of the equation f (x) = 0, we mean that real number p for which
f (p) is identically zero.
Definition 0.2. A root p of the equation f (x) = 0 is said to be a Root of multiplicity m if
f can be written as
f (x) = (x − p)m q(x)
where q(p) 6= 0
A root of multiplicity one is called a simple root.
Example 0.6. For polynomials, multiplicity can be determined by factoring the polynomial
and then examining the power on each factor. For example
x6 + x5 − 12x4 + 2x3 + 41x2 − 51x + 18 = (x − 1)3 (x + 3)2 (x − 2)
∴ the equation x6 + x5 − 12x4 + 2x3 + 41x2 − 51x + 18 = 0 has
a root of multiplicity 3 at x = 1
a root of multiplicity 2 at x = −3
and a simple root at x = 2
1−x
What about f (x) = 2x + ln( 1+x ) = 0?
Observe that f (0) = 0. ∴ it has a root at x = 0. What is the multiplicity of this root? The
following theorem is helpful.
Theorem 0.1. Let f be a continuous function with m continuous derivatives. The equation
f (x) = 0 has a root of multiplicity m at x = p if and only if
f (p) = f 0 (p) = f 00 (p) = · · · = f (m−1) (p) = 0 but f (m) (p) 6= 0
Proof. Let f be a continuous function with m continuous derivatives. Suppose that f (p) =
f 0 (p) = f 00 (p) = · · · = f (m−1) (p) = 0 but f (m) (p) 6= 0. Expand f in a Taylor polynomial of
degree m − 1 about the point x = p. Then
m−1
X f (k) (p) f (m) (ξ(x))
f (x) = (x − p)k + (x − p)m ,
k=0
k! m!

where ξ(x) lies between x and p. Since f (k) (p) = 0 for 0 ≤ k ≤ m − 1, we have
f (m) (ξ(x))
f (x) = (x − p)m
m!
∵ f (m) is continuous, lim f (m) (ξ(x)) = f (m) [lim ξ(x)] = f (m) (p) 6= 0
x→p x→p
f (m) (ξ(x))
Define q(x) = m!
. Then

3 126
f (x) = (x − p)m q(x), with q(p) 6= 0 (∵ lim q(x) 6= 0)
x→p

∴ f (x) = 0 has a root of multiplicity m at x = p.


Conversely, suppose that f (x) = 0 has a root of multiplicity m at x = p. Then, there
exists a function q with lim q(x) 6= 0 such that
x→p

f (x) = (x − p)m q(x)


Then
f (p) = 0
f 0 (x) = m(x − p)m−1 q(x) + (x − p)m q 0 (x) =⇒ f 0 (p) = 0
f 0 (x) = m(m − 1)(x − p)m−2 q(x) + 2m(x − p)m−1 q 0 (x) + (x − p)m q 00 (x) =⇒ f 00 (p) = 0
By direct calculation, f 000 (p) = f 0000 (p) = · · · = f (m−1) (p) = 0.
But f (m) (p) = lim f (m) (x) = m!lim q(x) 6= 0
x→p x→p

Example 0.7. f (x) = 2x + ln( 1−x 1+x


) = 2n + ln(1 − x) − ln(1 + x) = 0
f (0) = 0; f 0 (0) = 0; f 00 (0) = 0; f 000 (0) = −4 6= 0
∴ f (x) = 0 has a root of multiplicity 3 at x = 0.
−1
(∵ f 0 (x) = 2 + 1+x − 1−x1
=⇒ f 0 (0) = 2 − 1 − 1 = 0
f 00 (x) = − (1−x)
1 1 00
2 + (1+x)2 =⇒ f (0) = −1 + 1 = 0

f 000 (x) = − (1−x)


2 2 00
3 − (1+x)3 =⇒ f (0) = −2 − 2 = −4 6= 0)

Techniques for Root finding problem fall into two categories

1 Enclosure methods

2 Fixed point iteration schemes

Enclosure methods are based on Intermediate Value Theorem (IVT).

Theorem 0.2. Let f be a continuous function over the closed interval [a, b] and let k be
any real number that lies between f (a) and f (b) (f (a) < k < f (b)). Then there exists a real
number c with a < c < b such that f (c) = k

Consequence of IVT for continuous functions


If f is a continuous function on the interval [a, b] and if f (a)f (b) < 0, then f must have a
zero in (a, b)
Since f (a)f (b) < 0, the function changes sign on the interval [a, b] and hence it has
atleast one zero in the interval.

Example 0.8. f (x) = x3 + 2x2 − 3x − 1, x ∈ [−4, 3]


f (−4) = −21, f (−3) = −1, f (−2) = 5 , f (−1) = 3, f (0) = −1 , f (1) = −1, f (2) = 9 ,
f (3) = 35
∴ f (x) has simple zeroes, one in each of (−3, −2), (−1, 0) and (1, 2).

4 127
Bisection Method: Method of interval halving
Exploits the above idea in the following way.
Given a function f , find the values of x for which f (x) = 0. f is continuous on [a, b].
x ∈ (a, b). Suppose f is a continuous function defined on [a, b], with f (a) and f (b) of
opposite signs. By IVT, there exists a number p ∈ (a, b) with f (p) = 0. Although the
procedure will work when there is more than one root in (a, b), we assume for simplicity
that the root in (a, b) is the only root in (a, b). The method calls for a repeated halving of
sub intervals of [a, b]. At each step, locate the half containing p.
Set a1 = a, b1 = b. Let p1 = a+b 2
, the midpoint of [a, b].
If f (p1 ) = 0, then p = p1 and we are done
if f (p1 ) 6= 0; then f (p1 ) has the same sign as either f (a1 ) or f (b1 ).
When f (p1 ) and f (a1 ) have the same sign, p ∈ (p1 , b1 ) and set a2 = p1 , b2 = b1 .
When f (p1 ) and f (a1 ) have the opposite sign, p ∈ (a1 , p1 ) and set a2 = a1 , b2 = p1 .

y = f (x)

a = a1 p 2 p 3
p p1 b = b1

Reapply this process to [a2 , b2 ] and repeat this procedure.


Example 0.9. f (x) = x3 + 2x2 − 3x − 1 has a simple zero in (1, 2) ∵ f (1)f (2) < 0
(a1 , b1 ) = (1, 2); f (a1 ) < 0, f (b1 ) > 0
a1 + b1 3
p1 = = = 1.5
2 2
f (p1 ) = f (1.5) = 2.375 > 0

f (a1 ) and f (p1 ) are of opposite sign


∴ there is a root between a1 and p1 .
Set a2 = a1 = 1; b2 = p1 = 1.5

(a2 , b2 ) = (1, 1.5); f (a2 ) < 0, f (b2 ) > 0


a2 + b 2 2.5
p2 = = = 1.25
2 2
f (p1 ) = f (1.25) ≈ 0.328 > 0

∴ there is a root between a2 and p2 .


Set a3 = a2 = 1; b3 = p2 = 1.25 and the process is repeated and p is determined correct to

5 128
10 decimal places.

p = 1.1986912435

You have a sequence of successive approximations p1 , p2 , . . . , pn , . . . to the exact root p of


f (x) = 0. Does this sequence {p1 , p2 , . . . , pn , . . . } converge to p?
YES!!

6 129
Numerical Analysis
Lecture 29

Root Finding Methods-I: The Bisection Method-2


a = a1 p1 p2 b = b1 = b2 = b3
= a1 +b
2
1
= a2 +b
2
2

= a2 = a3

1 b1 − a1
b2 − a2 = b1 − (a1 + b1 ) =
2 2
1 1
b3 − a3 = (b2 − a2 ) = 2 (b1 − a1 )
2 2
−−−−−−−−−−−−−
− − − − − − − − − − − − −−
1 1
bn − an = (bn−1 − an−1 ) = n−1 (b1 − a1 )
2 2
1 1
pn − p ≤ (bn − an ) = n (b1 − a1 )
2 2
b−a
∴ |pn − p| ≤ n
2
As n → ∞, 21n → 0 =⇒ pn → p as n → ∞.
∴ sequence of approximations generated by the bisection method is always guaranteed to
converge to a root of the equation f (x) = 0.
Remark. The bisection method finds one zero but not all zeroes in the interval [a, b]
Consider the figures below. Assume that f (a) > 0 > f (b).
i) Here, the bisection method selects the left subinterval
f (a)

p1
a b
f (a)f (p1 ) < 0
f (b)

ii) Here, the bisection method selects the right subinterval


f (a)

a p1 b
f (p1 )f (b) < 0
f (b)

1 130
Stopping Condition: For any root finding technique, the following measures of conver-
gence can be used and one can construct the stopping criteria.

1 Let M gives the maximum number of steps that the user will permit. (this should
always be present to reduce the possibility of the computation going into an infinite
loop)

2 Let  be the specified convergence tolerance

a) terminate the iteration when |pn − p| <  (absolute error in the location of the
root)
b) terminate the itaration when | pnp−p
n
| <  (relative error in the location of the root)

3 terminate the iteration when |f (pn )| <  (test for a root)

There is no rule of thumb for selecting one stopping criteria over another. Note that none
of the conditions works well in all the cases. Examples can be given in which one of the
stopping criteria, say 2 or 3, is satisfied but the other is not

[ ]
an pn bn

Figure 1:
|f (pn )| <  but |bn − an | > δ. Criterion |bn − an | < δ failure.

2 131
y axis

The curve in Fig. 2 is not continuous


f

[ ] x axis
an p n b n
f

Figure 2:
|bn − an | < δ but |f (pn )| > . Criterion |f (pn )| <  failure.

1
f (x) = (x − 1) 7
y axis

x
x axis
1 1 3
2 2

Figure 3:
f has a nearly vertical portion surrounding the root. For this, |f (pn )| <  will guarantee a
reliable result rather than either of the other two conditions since x is close to 1 does not
imply that f (x) is close to zero.
Note. In Fig. 1, the graph is flat near the zero, which corresponds to a multiple root and
the bisection method may have difficulty in determining this zero to high precission
STOP THE ALGORITHM WHEN ANY ONE OF THE STOPPING CRITERIA IS SAT-
ISFIED

3 132
y axis

f (x) = (x − 1)7

x axis
0.5 1 1.5

Figure 4:
f (x) has a wide, flat plateau surrounding the root. ∴ terminate the iteration when |f (pn )| <
 will not guarantee a small error in the approximate location of the root. Either of the other
stopping criterion based on the location of the root will produce a more reliable termination

Error analysis
Let us denote successive intervals that arise in the process by [a0 , b0 ], [a1 , b1 ] and so on.
Note that

a0 ≤ a1 ≤ a2 ≤ · · · ≤ b0
b0 ≥ b1 ≥ b2 ≥ · · · ≥ a0
1
bn+1 − an+1 = (bn − an ), n ≥ 0
2
The sequence {an } is increasing and bounded above, hence {an } converges.
The sequence {bn } is decreasing and bounded below, hence {bn } converges.
bn − an = 12 (bn−1 − an−1 )
= 212 (bn−2 − an−2 )
−−−
= 21n (b0 − a0 )
Thus lim bn − lim an = lim 21n (b0 − a0 ) = 0
n→∞ n→∞ n→∞
Let p = lim an = lim bn
n→∞ n→∞
Then taking a limit in the inequality 0 ≥ f (an )f (bn ), we get 0 ≥ f (p)2 whence f (p) = 0.
Suppose that, at a certain stage in the process, the interval [an , bn ] has been defined. If
the process is stopped now, the root is certain to lie in this interval. The best estimate of
the root at this stage is pn = an +b
2
n

The error is thus bounded as follows:


|p − pn | ≤ 12 (bn − an ) = 1 1
(b
2 2n 0
− a0 ) = 1
(b
2n+1 0
− a0 )
So, we have the following Theorem on the bisection method.
Theorem 0.1. If [a0 , b0 ], [a1 , b1 ], . . . , [an , bn ], . . . denote the intervals in the bisection method,
then the limits lim an and lim bn exist, are equal, and represent a zero of f . If p = lim pn
n→∞ n→∞ n→∞
an +bn
and pn = 2
, then

4 133
1
|p − pn | ≤ (b
2n+1 0
− a0 )

Problem 1 If suppose the bisection method is started with the interval [50, 63]. How
many steps should be taken to compute a root with relative accuracy of one part in 10−12 ?
|p−pn |
We require |p|
≤ 10−12
Now we know that p ≥ 50 ∴ |p−p 50
n|
≤ 10−12
1 13
By theorem, |p − pn | ≤ 2n+1 (63 − 50) = 2n+1
∴ 13 1
50 2n+1
≤ 10−12
which gives n ≥ 37

Example 0.1. Use the bisection method to find the root of the equation ex = sin x closest
to 0
There are no positive roots of f (x) = ex − sin x. The first root to the left of 0 is in
[−4, −3]. Start with [−4, −3] interval and apply bisection method

k p f (p)
1 −3.5000 −0.321
2 −3.2500 −0.694 × 10−1
3 −3.1250 0.605 × 10−1
- - -
- - -
- - -
14 −3.1830 0.193 × 10−4
15 −3.1831 −0.124 × 10−4
16 −3.1831 0.345 × 10−5

Note. • the theorem states that the bisection method converges to a root of f (x) = 0
and not the root of f (x) = 0. The condition f (a)f (b) < 0 guarantees the existence of
a root but not the uniqueness. There may be more than one root on the interval and
there is no way to know a priori, to which root the sequence will converge, but it will
surely converge to one of them.
1
∵ |pn − p| is the absolute error in the approximation |p − pn | ≤ 2n+1 (b0 − a0 ) gives a
theoretical error bound. The error at any stage of the iterative process can never be
larger than this quantity.

• The requirement that an interval [a, b] be found such that f (a)f (b) < 0 =⇒ that the
bisection method cannot be used to locate roots of even multiplicity. For such roots,
function does not change sign on either side of the root. This remark is common to
all enclosure techniques.

Example 0.2. The equation f (x) = x3 + 4x2 − 10 = 0 has a root in [1, 2] since f (1) = −5
and f (2) = 14. Terminate when the relative error is less than 0.0001.
Bisection algorithm gives

5 134
k an bn pn f (pn )
1 1.0 2.0 1.5 2.375
2 1.0 1.5 1.25 −1.79687
3 1.25 1.5 1.37 0.16211
4 1.25 1.375 1.3125 −0.84839
5 1.3125 1.375 1.34375 −0.35098
6 1.34375 1.375 1.359375 −0.09641
7 1.359375 1.375 1.3671875 0.03236
8 1.359375 1.3671875 1.36328125 −0.03815
9 1.36328125 1.3671875 1.365234375 0.000072
10 1.36328125 1.365234375 1.364257813 −0.01605
11 1.364257813 1.365234375 1.364746094 −0.00799
12 1.364746094 1.365234375 1.364990235 −0.00396
13 1.364990235 1.365234375 1.365112305 −0.00194
|p−pn |
We stop when |p|
< 10−4 . After 13 iterations, p13 = 1.365112305 approximates the
root with an error

|p − p13 | < |b14 − a14 | = |1.365234375 − 1.365112305| = 0.000122070

Since |a14 | < |p|, |p−p


|p|
13 |
< |b14|a−a
14 |
14 |
≤ 9.0 × 10−5
∴ the approximation is correct to atleast four significant digits. The correct value of p
is p = 1.3652230013

Note. p9 is closer to p than the final approximation p13 . We may suspect this is true
∵ |f (p9 )| < |p13 |, but we cannot be sure of this unless the true answer is known.

Example 0.3. Determine the number of iterations necessary to solve f (x) = x3 +4x2 −10 =
0 with accuracy 10−3 using a0 = 1, b0 = 2.
This requires finding an integer n that satisfies
1
|p − pn | ≤ 2n+1 (b0 − a0 ) = 2−(n+1) < 10−3
log10 2−(n+1) < log10 10−3
−(n + 1) log10 2 < −3
n + 1 > log3 2 = 0.3010
3
≈ 9.966777 n > 8.96
10

∴ 9 iterations will ensure an approximations accurate within 10−3

Advantages

1 very easy and straightforward to implement

2 only one function evaluation is needed at each iteration (so inexpensive)

3 the sequence of approximations generated by the method is guaranteed to converge.

Disadvantage

1 converges slowly (the number N of iterations become very large before |p − pN | is


sufficiently small)

2 a good intermediate approximation may be missed

6 135
Numerical Analysis
Lecture 31

Root Finding Methods-3: Newton- Raphson Method-1


Newton- Raphson Method. Problem : Given a function f , determine zeroes of f numeri-
cally.
Let p be a zero of f and let x0 be an approximation to p. If f 00 exists and is continuous,
then by Taylor’s theorem

0 = f (p) = f (x0 + h)
0 h2 00
= f (x0 ) + hf (x0 ) + f (x0 )
2!
= f (x0 ) + hf 0 (x0 ) + O(h2 ), h = p − x0 .

If h is small (i.e., x0 is near p), then it is reasonable to ignore the O(h2 ) term and solve for
h.

f (x0 )
f (x0 ) + hf 0 (x0 ) = 0 =⇒ h = − .
f 0 (x0 )
If x0 is an approximation to p, then
f (x0 )
x0 −
f 0 (x0 )
is a better approximation to p. Call it x1 . Then,
f (x0 )
x1 = x0 − .
f 0 (x0 )
Define inductively,
f (xn )
xn+1 = xn − , n ≥ 0.
f 0 (xn )
Example : Use Newton-Raphson method to find the negative zero of the function

f (x) = ex − 1.5 − tan−1 x.

f (x) = ex − 1.5 − tan−1 x


1
f 0 (x) = ex −
1 + x2

f (xn )
xn+1 = xn −
f 0 (xn )
[exn − 1.5 − tan−1 xn ]
= xn − 1 , n = 0, 1, 2, · · ·
exn − 1+x 2
n

1 136
k xn f (xn )
0 −7.0000 −.702 × 10−1
1 −10.6770 −.226 × 10−1
2 −13.2791 −.437 × 10−2
3 −14.0536 −0.239 × 10−3
4 −14.1011 −0.800 × 10−6
5 −14.1012 −0.901 × 10−11
6 −14.1012 −0.114 × 10−20
7 −14.1012 0.000
8 −14.1012 0.000
Output shows that there is rapid convergence of iterates.

Example : f (x) = 2 − ex
Choose x0 = 0. The only zero of f (x) is
p = ln 2 = 0.6931471806.
We have f 0 (x0 ) = −ex and therefore,
f (x0 ) 2−1
x1 = x 0 −0
=− =1
f (x0 ) −1
f (x1 ) 2 − e x1 2−e
x2 = x 1 − 0 = x1 − x
=1− = 0.7357588823
f (x1 ) −e 1 −e
2 − ex2 2 − e0.7357588823
x3 = x 2 − = 0.7357588823 − = 0.6940422999
−ex2 −e0.7357588823
x4 = 0.693147581060
x5 = 0.693147180560
In 6 iterations, NRM has achieved 13 digit accuracy.
Let us do the same problem using bisection method.

f (x) = 2 − ex , [a0 , b0 ] = [0, 1], f (0) > 0, f (1) < 0.

0+1 1
p1 = = , f (p1 ) = 0.3513 > 0
  2 2
1
[a1 , b1 ] = , 1 ; f (a1 ) = 0.315, f (b1 ) = −0.7183
2
1
2
+1 3
p2 = = ; f (p2 ) = −0.1170 < 0
  2 4
1 3
[a2 , b2 ] = , ; f (a2 ) = 0.3513; f (b2 ) = −0.1170
2 4
1
2
+ 34 5
p3 = = ; f (p3 ) = 0.1318 > 0
2 8 
5 3
[a3 , b3 ] = , ; p4 = 0.6875
8 4

2 137
If we continue, p5 = 0.6875, p6 = 0.7031,....p20 = 0.6931. In 20 iterations, bisection
method has 6 digit accuracy. Note that the number of correct digits in NRM is doubling,
in every iteration.This suggests that NRM is very fast.

Graphical Interpretation. NRM involves linearising the function. f was replaced by a


linear function.If
(x − x0 )2 00
f (x) = f (x0 ) + (x − x0 )f 0 (x0 ) + f (x0 ) + ....
2!

then, the linearisation at x0 produces the linear function

l(x) = f (x0 ) + (x − x0 )f 0 (x0 )

This has the property

l(x0 ) = f (x0 )
l0 (x) = f 0 (x0 )

Therefore,

l0 (x0 ) = f 0 (x0 )

That is, the linear function has the same value as f at x0 and has the same slope as f at
x0 .Therefore, l is a good approximation to f in the vicinity of x0 . Therefore, in NRM, we
construct the tangent to the f curve at a point x0 near p and determine where the tangent
line intersects the x-axis.
Equation of the line L is

y − y0 = (x − x0 )f 0 (x0 ),

and where it crosses the x- axis, we have

0 − y0 = (x − x0 )f 0 (x0 )

Therefore,
f (x0 )
x1 = x0 − .
f 0 (x0 )
. Now, consider linearisation at x1 and construct the tangent to the f curve at x1 and
determine where the tangent line intersects the x- axis.

y − y1 = (x − x1 )f 0 (x1 ),

3 138
It crosses the x- axis at (x2 , 0).

0 − y1 = (x2 − x1 )f 0 (x1 )

Therefore,
f (x1 )
x2 = x1 − .
f 0 (x1 )
. Continue this inductively,

f (xn )
xn+1 = xn − , n ≥ 0.
f 0 (xn )

Figure shows the example of a function and starting points for which NRM fails.The shape
of the curve is such that for certain starting values xn diverges.Therefore, any statement
about NRM must involve an assumption that x0 is sufficiently close to a zero or that the
graph of f has a prescribed shape.

4 139
Numerical Analysis
Lecture-32

1 Root finding methods-4 : the Newton Raphson


method-2
Definition 1.1. Suppose {pn } is a sequence that converges to p with pn 6= p∀n. If positive
constants λ and α exists with ltn→∞ |p|pn+1 −p|
n −p|
α = λ, then pn converges to p of order α , with

asymptotic error constant λ.

If α = 1, the sequence is linearly convergent(i.e) the method has linear convergence (or) the
method converges linearly (or) the method has order of convergence 1.
If α = 2, the sequence is quadratically convergent i.e the method has convergent of order 2.
In the case of bisection method,

|pn+1 − p| = 21 |pn − p|
or |p|pn+1 −p|
n −p|
= 12
⇒ en+1 = 12 en i.e en+1 = cen

Where en+1 is the error at the (n + 1)-th iteration and en is the error at the n-th iteration.
Hence α = 1 , c = 21 and the bisection method has linear convergence with asymptotic error
constant c = 12 .
Error Analysis: Analysis of error in Newton-Raphson Method for solving f (x) = 0.
00
Let en = xn − p = error at the n-th step (omitting round off errors). Let f be continuous.
Let p be a simple zero of f .
f (p) = 0 and f 0 (p) 6= 0.
en+1 = xn+1 − p = xn − ff0(x n)
(xn )
− p = (xn − p) − ff0(x n)
(xn )
= en − ff0(xn)
(xn )
0
Hence en+1 = en f (x n )−f (xn )
f 0 (xn )
. . . (1)
Now by Taylor’s theorem,
e2n 00
0 = f (p) = f (xn − en ) = f (xn ) − en f 0 (xn ) + 2
f (ξn ), where ξn lies between xn and p.
00
Hence en f 0 (xn ) − f (xn ) = 21 f (ξn )e2n . . . (2)
Substituting (2) in (1),
00 00
en+1 = 12 ff 0 (x
(ξn ) 2
e ≈ 12 ff 0 (p)
n) n
(p) 2
en = ce2n
Hence en+1 = ce2n
Newton Raphson Method has quadratic convergence with asymptotic error constant c =
00 00
1 f (p) 1 f (xn )
0
2 f (p)
≈ 2 f 0 (xn )
, where xn is the approximation to p obtained at the n-th iteration correct
to the desired degree of accuracy.

1 140
Convergence of NRM

Theorem 1.1. Assume that f is defined and twice continuously differentiable for all x with
f (p) = 0 for some p. Define the ratio
00
maxx∈R |f (x)|
M= 2.minx∈R |f 0 (x)|

and assume that M < ∞, then for any x0 such that M |p − x0 | < 1 , then the successive
n
iterates generated by NRM, namely {xn } converges. Moreover |p − xn | ≤ M −1 (M |p − x0 |)2 .
00
Proof. We have shown en+1 = 21 e2n |f|f 0 (x
(ξn )|
n )|
≤ (p − xn )2 M
⇒ en+1 ≤ M e2n

e1 ≤ M e20
e2 ≤ M e21 ≤ (M e20 )M = e40 M 3 = M1 (M e0 )4
e3 ≤ e22 M ≤ (e21 M )2 M = e80 M 7 = M1 (M e0 )8

...

1 n
en ≤ M
(M e0 )2

RHS goes to zero as n → ∞ iff M e0 = M |p − x0 | < 1 and this completes the proof.

00
Remark. The above theorem says that if f is continuous and p is a simple zero of f ,
then there is a neighbourhood of p and a constant M such that if NRM is started in that
neighbourhood, then the successive points (iterates) come closer to p and satisfy |xn+1 −p| ≤
M (xn − p)2 , n ≥ 0.
Remark. There are some situations in which NRM can be guaranteed to converge from an
arbitrary starting point. The following theorem presents the details.
Theorem 1.2. If f belongs to C 2 (R), is increasing, is convex and has a zero, then the zero
is unique and the Newton iteration will converge to it from any starting point.
00
Proof. A function f is convex if f (x) > 0 for all x.
y B

A
x
x=a x=b

2 141
[line segment joining two points on the graph of f lies above the graph. As you move
from A to B, the tangent makes an obtuse angle with the X- axis. As you move along,
at the lower most point, the tangent is parallel to the X-axis. As the tangent crosses the
dy
lowest point, the tangent makes acute angle with the X-axis. i.e dx is increasing as you
0 00
move from A to B, i.e f (x) is increasing in [a, b] , i.e f (x) > 0, ∀x ∈ [a, b].]
00
00
Given f is increasing , f 0 > 0 on R. We know en+1 = 21 ff 0 (p) (p) 2
en . Hence f (p) > 0, f 0 (p) > 0
, hence en+1 > 0.Thus xn > p for n ≥ 1. Since f is increasing ,f (xn ) > f (p) but f (p) = 0.
Hence f (xn ) > 0.
0
Now consider en+1 = en f (x n )−f (xn )
f 0 (xn )
= en − ff0(xn)
(xn )
. . . (∗).
0
Since f (xn ) > 0, f > 0 , en+1 < en . Thus the sequences {en } and {xn } are decreasing and
bounded below by 0 and p respectively.
Hence e∗ = ltn→∞ en and x∗ = ltn→∞ xn exist. ∗)
From (∗), ltn→∞ en+1 = ltn→∞ en − ltn→∞ ff0(x n)
(xn )
, we have e∗ = e∗ − ff0(x(x∗ )
implies f (x∗ ) = 0
and x∗ = p.

Example 1.1. Find an efficient method for computing square roots based on the use of
NRM. √
Let R > 0 be given. Let x = R, then x is a root of the equation x2 − R = 0
f (x) = x2 − R, f 0 (x) = 2x. Hence NRM method gives
2 −R 2 +R
xn+1 = xn − ff0(x n)
(xn )
= xn − x2x
n
n
= x2x
n
n
= x2n + 12 xRn = 12 (xn + xRn ) . . . (∗)
Let R = 17 ,let x0 = 4.
(∗) generates successive iterates
x1 = 12 (x0 + x170 ) = 4.12
x2 = 12 (x1 + x171 ) = 4.123106
x3 = 12 (x2 + x172 ) = 4.1231056
x4 = 12 (x3 + x173 ) = 4.1231056

Correct to four decimal places, we have 17 = 4.1231.

3 142
Lecture 33
Secant method and Method of False Position

Secant Method
NRM is a powerful technique; but it has a major weakness; one needs to know the value of
the derivative of f at each approximation.

To overcome this, a slight variation is considered. By definition,


f (x) − f (pn−1 )
f 0 (pn−1 ) = limx−→pn−1
x − pn−1
Letting x = pn−2 ,
f (pn−2 ) − f (pn−1 ) f (pn−1 ) − f (pn−2 )
f 0 (pn−1 ) ' =
pn−2 − pn−1 pn−1 − pn−2
Use this in NRM,
pn−1 − pn−2
pn = pn−1 − f (pn−1 )
f (pn−1 ) − f (pn−2 )
This is the secant method.

Start with two initial approximations p0 , p1 .


Equation of line L1 :
f (p1 ) − f (p0 )
y − f (p1 ) = (x − p1 )
p1 − p0

(p1 , f (p1 ))

L1

L2
p0 p2 p3
p p1
(p0 , f (p0 )) (p , f (p ))
2 2
L1 cuts the x-axis at (p2 , 0).
f (p1 ) − f (p0 )
0 − f (p1 ) = (p2 − p1 )
p1 − p0
This implies
p 1 − p0 p1 − p0
p2 − p1 = −f (p1 ) =⇒ p2 = p1 − f (p1 )
f (p1 ) − f (p0 ) f (p1 ) − f (p0 )
p3 is the x-intercept of line L2 joining (p1 , f (p1 )) and (p2 , f (p2 )) and so on.

1 143
Example 0.1. Use the secant method to find a solution to x = cos x.
Let p0 = 0.5 and p1 = π4 .

(pn−1 − pn−2 )(cos pn−1 − pn−1 )


pn = pn−1 − , n≥2
(cos pn−1 − pn−1 ) − (cos pn−2 − pn−2 )

k pk
0 0.5
1 0.7853981635
2 0.7363841388
3 0.7390581392
4 0.7390851493
5 0.7390851332

p5 is accurate to the tenth decimal place.

NRM obatained this degree of accuracy with p3 (check!)


Secant method is slower than NRM.
Note. Secant method and NRM require a good initial approximation.

Get initial approximation using bisection method and refine this using secant/NRM.
They give rapid convergence.
Remark. f (x) = cos x − x.
NRM An approximate root obtained by NRM is p ' 0.7390851332.
Table gives the successive iterates using NRM.
k pk
0 0.7853981635
1 0.7395361337
2 0.7390851781
3 0.7390851332
4 0.7390851332

This root p ' 0.7390851332 is not bracketed by either p0 , p1 or p1 , p2 .


Secant Method An approximate root obtained by secant method is p ' 0.7390851332.
Table gives the successive iterates using secant method.
k pk
0 0.5
1 1
2 0.7363841388
3 0.7390581392
4 0.7390851493
5 0.7390851332

The initial approximations p0 , p1 bracket the root but the pair of approximations p3 and p4
fail to do.
The Method of False Position(Regula Falsi Method) generates approximations but it
includes a test to ensure that the root is bracketed between successive iterations.

2 144
Method of False Position (Regula Falsi Method)
f (x) = 0.
Choose initial approximations p0 , p1 with f (p0 )f (p1 ) < 0. The approximation p2 is chosen
in the same manner as in the secant method.
p2 is the x-intercept of the line joining (p0 , f (p0 )) and (p1 , f (p1 )).
Check if f (p2 )f (p1 ) < 0. If so, then p1 , p2 bracket a root.
Choose p3 as the x-intercept of the line joining (p1 , f (p1 )) and (p2 , f (p2 )).
If not, choose p3 is the x-intercept of the line joining (p0 , f (p0 )) and (p2 , f (p2 )).
Interchange the indices on p0 and p1 . So p3 is the x-intercept of the line joining (p1 , f (p1 ))
and (p2 , f (p2 )). (after interchanging the indices on p0 and p1 )
Note. This interchange of indices ensures that the root is bracketed between successive
iterations. Find p3 . Now the sign of f (p3 ).f (p2 ) determines whether we use p2 and p3 or p3
and p1 to compute p4 . In the latter case, relable p2 and p1 . Continue this process till the
stopping condition is satisfied.

y = f (x)

Regula Falsi method

p3 p5
p4 p1

p0 p2

Example 0.2. f (x) = cos x − x.


Let p0 = 0.5 and p1 = π4 .

(pn−1 − pn−2 )(cos pn−1 − pn−1 )


pn = pn−1 − , n≥2
(cos pn−1 − pn−1 ) − (cos pn−2 − pn−2 )

k pk
0 0.5
1 0.7853981635
2 0.7363841388
3 0.7390581392
4 0.7390848638
5 0.7390851305
6 0.7390851332

Note that approximates agree with those obtained by secant method through p3 and RFM
requires an additional iteration to obtain the same accuracy as secant method.

3 145
Error Analysis : Secant Method
 
xn − xn−1
xn+1 = xn − f (xn ) , n≥1
f (xn ) − f (xn−1 )
e n = xn − p
en+1 = xn+1 − p
 
xn − xn−1
= xn − f (xn ) −p
f (xn ) − f (xn−1 )
xn f (xn ) − xn f (xn−1 ) − xn f (xn ) + xn−1 f (xn )
= −p
f (xn ) − f (xn−1 )
xn−1 f (xn ) − xn f (xn−1 ) − pf (xn ) + pf (xn−1 )
=
f (xn ) − f (xn−1 )
f (xn )(xn−1 − p) − f (xn−1 )(xn − p)
=
f (xn ) − f (xn−1 )
h i
f (xn ) f (xn−1 )
en−1 f (xn ) − en f (xn−1 ) e n en−1 en
− en−1
= =
f (xn ) − f (xn−1 ) f (xn ) − f (xn−1 )
" f (xn ) f (xn−1 ) # 
− en−1

en xn − xn−1
= en en−1 ....(I)
xn − xn−1 f (xn ) − f (xn−1 )

f (xn )
Now f (xn ) = f (p + en ) = f (p) + en f 0 (p) + 12 e2n f 00 (p) + O(e3n ). Since f (p) = 0, en
=
f 0 (p) + 21 en f 00 (p) + O(e2n ).
∴ f (x n−1 )
en−1
= f 0 (p) + 12 en−1 f 00 (p) + O(e2n−1 ).
∴ f (x
en
n)
− f (xn−1 )
en−1
= 12 (en − en−1 )f 00 (p) + O(e2n−1 )...........(∗)
Since xn − xn−1 = xn − p − (xn−1 − p) = en − en−1 .....(∗∗),
we have
f (xn )
en
− f (xn−1 )
en−1 1
' f 00 (p) (using (∗∗) in (∗))
xn − xn−1 2
Now
xn − xn−1 1
' 0
f (xn ) − f (xn−1 ) f (p)
∴ from (I), we have
1 f 00 (p)
en+1 ' en en−1 = Cen en−1
2 f 0 (p)
en+1 ' Cen en−1

Order of convergence of the secant method


If α is the order of convergence of the secant method, then

|en+1 | ∼ A|en |α
|en+1 |
This means A|en |α
−→ 1 as n −→ ∞.
1
Hence, |en | ∼ A|en−1 |α implies |en−1 | ∼ [ A1 |en |] α .

4 146
1 1
Since en+1 ' Cen en−1 , A|en |α ∼ C|en |A− α |en | α .
1 1
A1+ α C −1 ∼ |en |1−α+ α .......(II)
Since the LHS of this relation is a nonzero constant while en −→√ 0, we√conclude that
1 − α √+ α1 = 0 or α − α2 + 1 = 0 or α2√− α − 1 = 0. Thus α = 1± 21+4 = 1±2 5 .
As 1−2 5 < 0 and α is positive, α = 1+2 5 ' 1.62.
The secant’s method rate of convergence is superlinear.
We now determine A. RHS of (II) is 1.
1
A1+ α C −1 = 1
1
1 1
A = C 1+ α = C α
= C α−1 = C 1.62−1 = C 0.62
 00 0.62
f (p)
=
2f 0 (p)

1+ 5
as given above, we have for the secant method, |en+1 | ' A|en |
With A √ . 2

1+ 5
Since 2 ' 1.62 < 2, the rapidity of convergence of the secant method is not as good as
Newton-Raphson method but better than the Bisection method.

5 147
Lecture 34
Fixed point iteration
Definition 0.1. A point p is a fixed point for a given function g if g(p) = p.
We consider the problem of finding solutions to fixed point problems and the connections
between the fixed point problems and root-finding problems.
• find solutions to fixed point problems.
• fixed point problems ←→ root finding problems.
Example 0.1. g(x) = x2 − 2, −2 ≤ x ≤ 3.
The fixed points of g are
g(x) = x
2
x −2=x
2
x −x−2=0
x2 − 2x + x − 2 = 0
x(x − 2) + 1(x − 2) = 0
x = 2, x = −1
g(−1) = (−1)2 − 2 = −1 and g(2) = 22 − 2 = 2.

y=x
(0, 2)

(−1, 0)
(2, 0)
(0, −1)

(−2, 0)
∴ g has fixed points at x = −1 and x = 2.
Theorem 0.1 (Sufficient conditions for the existence and uniquness of a fixed point).
1. If g ∈ C[a, b] and g(x) ∈ [a, b] for all x ∈ [a, b], then g has a fixed point in [a, b].
2. If, in addition, g 0 (x) exists on (a, b) and a positive constant k < 1 exists with
|g 0 (x)| ≤ k, ∀x ∈ (a, b),
then the fixed point in [a, b] is unique.

1 148
Proof. 1. If g(a) = a or g(b) = b, then g has a fixed point at an end point.
If not, then g(a) > a and g(b) < b.
y=x
b
g(p) = p

a y = g(x)

a p b
Now h(x) = g(x)−x is continuous on [a, b].
h(a) = g(a) − a > 0 and h(b) = g(b) − b < 0. Thus by IVT, there exists a p ∈ (a, b)
for which h(p) = 0.
This number p is a fixed point for g since 0 = h(p) = g(p) − p =⇒ g(p) = p.
2. Suppose, in addition, that |g 0 (x)| ≤ k < 1 and that p and q are both fixed points
in [a, b]. If p 6= q, then by MVT, ∃ a ξ between p and q and hence in [a, b], with
g(p)−g(q)
p−q
= g 0 (ξ).
Thus |p−q| = |g(p)−g(q)| = |g 0 (ξ)||p−q| ≤ k|p−q| < |p−q| which is a contradiction.
Thus our assumption that p 6= q is wrong.
∴ p = q and the fixed point in [a, b] is unique.

2
Example 0.2. g(x) = x 3−1 on [−1, 1].
g 0 (x) = 2x3
= 0 =⇒ x = 0
g 00 (x) = 32 > 0.
g has a minimum at x = 0. Extreme value theorem implies that the absolute minimum of
g occurs at x = 0.
g(0) = − 31 .
The absolute maximum of g occurs at x = ±1 and has the value g(±1) = 0.
g is continuous and |g 0 (x)| = | 2x3
| ≤ 23 , ∀x ∈ (−1, 1).
∴ g satisfies all the conditions of the previous theorem and hence has a unique fixed point
in [−1, 1]. √ √
2
p = g(p) = p 3−1 or √
p 2
− 3p − 1 = 0 =⇒ p = 2± 9+4
2
= 3± 13
2
.
3− 13
Note that p = 2 is a unique fixed √point in [−1, 1].
Also, note that g has a fixed point 3+2 13 for the interval [3, 4].
However, g(4) = 5, g 0 (4) = 38 > 1.
∴ g does not satisfy the conditions of the theorem on [3, 4].
∴ The hypothesis in the theorem are sufficient to guarantee a unique fixed point but are
not necessary.
Theorem 0.2 (Fixed Point Theorem). Let g ∈ C[a, b] be such that g(x) ∈ [a, b] for all
x ∈ [a, b]. Suppose in addition that g 0 (x) exists on (a, b) and that a constant 0 < k < 1
exists with |g 0 (x)| ≤ k, ∀x ∈ (a, b). Then, for any number p0 in [a, b], the sequence defined
by pn = g(pn−1 ), n ≥ 1 converges to the unique fixed point in [a, b].
Proof. The previous theorem implies that a unique fixed point exists in [a, b].
Since g : [a, b] −→ [a, b], the sequence {pn }∞
n=0 is defined for all n ≥ 0 and pn ∈ [a, b] for all

2 149
n.
Using MVT and |g 0 (x)| ≤ k, we have for each n,

|pn − p| = |g(pn−1 ) − g(p)| = |g 0 (ξn )||pn−1 − p| ≤ k|pn−1 − p|, ξn ∈ (a, b).

Applying this inequality inductively,

|pn − p| ≤ k|pn−1 − p| ≤ k 2 |pn−2 − p| ≤ ... ≤ k n |p0 − p|

Since 0 < k < 1, limn−→∞ = 0 and limn−→∞ |pn − p| ≤ limn−→∞ k n |p0 − p| = 0.


∴ {pn }∞
n=0 converges to p.

Corollary. If g satisfies the hypothesis of the Fixed Point Theorem, then the bounds of the
error involved in using pn to approximate p are given by

|pn − p| ≤ k n max{p0 − a, b − p0 }

and
kn
|pn − p| ≤ |p1 − p0 |, ∀ n ≥ 1.
1−k
Proof. Since p ∈ [a, b],

|pn − p| ≤ k n |p0 − p| ≤ k n max{p0 − a, b − p0 }.

For n ≥ 1, from the proof of the above theorem,

|pn+1 − pn | = |g(pn ) − g(pn−1 )| ≤ k|pn − pn−1 | ≤ ... ≤ k n |p1 − p0 |

Thus for m, n ≥ 1,

|pm − pn | = |pm − pm−1 + pm−1 − pm−2 + ... + pn+1 − pn |


≤ |pm − Pm−1 | + |pm−1 − pm−2 | + ... + |pn+1 − pn |
≤ k m−1 |p1 − p0 | + k m−2 |p1 − p0 | + ... + k n |p1 − p0 |
= k n |p1 − p0 |[1 + k + k 2 + ... + k m−n−1 ]

By the fixed point theorem, limm−→∞ pm = p. Pm−n−1 i P∞ i


n n
∴ |p − p
P∞ in | = lim m−→∞ |p m − p n | ≤ limm−→∞ k |p 1 − p 0 | i=0 k ≤ k |p 1 − p 0 | i=0 k .
But i=0 k is a geometric series with common ratio k and 0 < k < 1.
1
Therefore the sequence converges to 1−k .
k n
∴ |pn − p| ≤ 1−k |p1 − p0 |.

3 150
Lecture 35
We shall see now how we can relate the root finding problem with the fixed point itera-
tion problem.

Consider f (x) = 0 where f (x) = x3 + 4x2 − 10. Since f (2) = 8 + 8 − 10 = 6 > 0,


f (1) = 5 − 10 = −5 < 0, there is a root of f (x) = 0 in [1, 2].
There are many ways in which f (x) = 0 can be rewritten in the form x = g(x).

1. x = g1 (x) = x − x3 − 4x2 + 10
10
 12
2. x = g2 (x) = x
− 4x
1
3. x = g3 (x) = 12 (10 − x3 ) 2

10
 12
4. x = g4 (x) = 4+x

x3 +4x2 −10
5. x = g5 (x) = x − 3x2 +8x

We now try to answer the question how we can find a fixed point problem that generates a
sequence of iterates that reliably and rapidly converges to a solution of a given root finding
problem. We get some hints to the answer to this question from the Fixed point theorem.
Before that, we shall discuss about the various forms x = g(x) that we have given above.

1. Consider x = g1 (x), where g1 (x) = x − x3 − 4x2 + 10. Then g1 (1) = 1 − 1 − 4 + 10 = 6


and g1 (2) = 2 − 8 − 16 + 10 = −12.
∴ g1 does not map [1, 2] into itself.
Also |g10 (x)| = |1 − 3x2 − 8x| > 1∀x ∈ [1, 2].
The condition given in Fixed point theorem are sufficient conditions.
Although Fixed Point theorem does not guarantee that the method must fail for this
choice, namely x = g1 (x), there is no reason to expect this choice x = g1 (x) will
converge.
 21 √ 1
2. Now consider g2 (x) = 10 x
− 4x . Here g2 (1) = 6; g2 (2) = (5 − 8) 2 .
∴ g2 does not map [1, 2] into [1, 2].
∴ {pn }∞ n=0 is not defined.
0
Also g2 (x) ' 3.4 where x = 1.365.
∴ there is no reason to expect that this choice x = g2 (x) will converge.
1
3. g3 (x) = 21 (10 − x3 ) 2 .
1
g30 (x) = − 34 x2 (10 − x3 )− 2 < 0 on [1, 2].
g3 is strictly decreasing on [1, 2].
However, |g30 (x)| ' 2.12.
∴ |g30 (x)| ≤ k < 1 fails on [1, 2].

1 151
Start with p0 = 1.5. The successive iterates were

1.5
1.286953768
1.402540804
1.345458374
..
.
30th iteration 1.365230013

It suffices to consider [1, 1.5] instead of [1, 2].


On this interval, g30 (x) < 0 and g3 is strictly decreasing but, additionally,

1 < 1.28 ' g3 (1.5) ≤ g3 (x) ≤ g3 (1) = 1.5

∴ g3 maps [1, 1.5] into itself.


Also |g30 (x)| ≤ |g30 (1.5)| ' 0.66 on [1, 1.5], FPT confirms the convergence of the
method.
1
10 2
4. g4 (x) = 4+x
1√
g40 (x) = 10√2 10
10
(− 21 ) 1 3 .
(4+x) 2
|g40 (x)| = | √ −5 3 | ≤ √ 5 3 < 0.15.
10(4+x) 2 10(5) 2
g4 (x) maps [1, 2] into itself. FPT confirms convergence.
Convergence is more rapid than in g3 (x) since the bound on magnitude of g40 (x) is
smaller than in for g30 (x).
+4x −10 3 2
5. The sequence defined by g5 (x) = x − x 3x2 +8x converges much more rapidly than the
other choices.

We shall justify the above statement in the next lecture.

2 152
Lecture 36
Order of convergence of fixed point iteration method
If g(x) = x is such that g is continuous on [a, b] with continuous derivatives g 0 , g 00 , ..., g (α) on
(a, b). p is a fixed point of g on (a, b). If g 0 (p) = g 00 (p) = ... = g (α−1) (p) = 0, but g (α) (p) 6= 0,
then xn+1 = g(xn ) is said to have αth order convergence.
Theorem 0.1. Let g ∈ C[a, b] be such that g(x) ∈ [a, b], ∀x ∈ [a, b]. Suppose, in addition,
that g 0 is continuous on (a, b) and that a positive constant k < 1 exists with |g 0 (x)| ≤ k, ∀x ∈
(a, b). If g 0 (p) 6= 0, then for any number p0 in [a, b], the sequence pn = g(pn−1 ), n ≥ 1
converges only linearly to the unique fixed point p in [a, b].
Proof. From FPT, we know that the sequence generated by pn = g(pn−1 ), n ≥ 1 converges
to p. Since g 0 exists on (a, b), apply MVT to g to show that for any n,
pn+1 − p = g(pn ) − g(p) = g 0 (ξn )(pn − p)
for some ξn in between pn and p.
Since {pn } converges to p, the sequence {ξn }∞
n=0 converges to p.
Since g is continuous in (a, b), limn−→∞ g (ξn ) = g 0 (p).
0 0

pn+1 − p
∴ limn−→∞ = limn−→∞ g 0 (ξn ) = g 0 (p).
pn − p
|pn+1 − p|
∴ limn−→∞ = |g 0 (p)|.
|pn − p|
0
6 0, FP Iteration exhibits linear convergence with asymptotic error constant
∴ if g (p) =
0
|g (p)|.
Remark. The above theorem implies that higher order convergence for Fixed Point methods
can occur only when g 0 (p) = 0. There are additional conditions that ensure quadratic
convergence.
Theorem 0.2. Let p be a solution of the equation x = g(x). Suppose that g 0 (p) = 0, g 00
is continuous with |g 00 (x)| < M on an open interval I containing p. Then, there exists
a number δ > 0 such that for p0 ∈ [p − δ, p + δ], the sequence defined by pn = g(pn−1 ),
when n ≥ 1 converges atleast quadratically to p. Moreover, for sufficiently large values of
n, |pn+1 − p| < M2 |pn − p|2 .
Proof. Choose k in (0, 1) and δ > 0 on [p − δ, p + δ] ⊂ I. We have |g 0 (x)| ≤ k and g 00 is
continuous.
Since |g 0 (x)| ≤ k < 1, the sequence {pn }∞ n=0 is such that pi ∈ [p − δ, p + δ] by the earlier
2
theorem. Now g(x) = g(p) + (x − p)g (p) + (x−p)
0
2
g 00 (ξ), ξ lies between x and p (by expanding
g(x) as a linear Taylor polynomial for x ∈ [p − δ, p + δ]).
00
Now g(p) = p and g 0 (p) = 0 =⇒ g(x) = p + g 2(ξ) (x − p)2 .
00
When x = pn , pn+1 = g(pn ) = p + g (ξ 2
n)
(pn − p)2 ; ξn is between pn and p.
g 00 (ξn )
Thus, pn+1 − p = 2 (pn − p)2 .
∵ |g 0 (x)| ≤ k < 1 on [p − δ, p + δ] and g maps [p − δ, p + δ] into itself, by FPT, {pn }∞ n=0 −→ p.
But ξn is between p and pn for each n.00
∴ {ξn } −→ p and limn−→∞ |p|pn+1 −p|
n −p|
2 =
|g (p)|
2
. Thus {pn } converges quadratically if |g 00 (p)| 6= 0
and of higher order convergence if g 00 (p) = 0. Since g 00 is continuous and strictly bounded by
M on [p − δ, p + δ], this implies, for sufficiently large values of n, |pn+1 − p| < M2 |pn − p|2 .

1 153
∴ The easiest way to construct a fixed point problem associated with a root finding
problem f (x) = 0 is to subtract a multiple of f (x) from x.
Consider pn = g(pn−1 ) for n ≥ 1.
For g in the form g(x) = x − φ(x)f (x) where φ is a differentiable function to be chosen.
For this iterative method derived from g to be quadratically convergent, we need to have
g 0 (p) = 0 when f (p) = 0.
∵ g 0 (x) = 1 − φ0 (x)f (x) − f 0 (x)φ(x), we have

g 0 (p) = 1 − φ0 (p)f (p) − f 0 (p)φ(p)


= 1 − f 0 (p)φ0 (p) (∵ f (p) = 0)

g 0 (p) = 0 if and only if φ(p) = f 01(p) .


Let φ(x) = f 01(x) . Then we will ensure that φ(p) = f 01(p) and we have a quadratically
convergent procedure
f (pn−1 )
pn = g(pn−1 ) = pn−1 − 0
f (pn−1 )
which is NRM.

Example 0.1. f (x) = cos x − x = 0.


y=x
1

y = cos x

b A solution to this root finding problem is also a solution to


the fixed point problem x = cos x, i.e., x = g(x).
Graph shows that a single fixed point p of g lies in [0, π2 ].
Let p0 = π4 ' 0.7853981635.
xn+1 = xn − ff0(x n)
(xn )
−→ Newton-Raphson Method.
xn = g(xn−1 ) −→ fixed point iteration method.
f( π )
x1 = g(x0 ); g(x0 ) = g( π4 ) = π4 − f 0 (4π ) since g(x) = x − ff0(x)
(x)
.
4
∴ x1 = g(x0 ) = 0.7071067810, x2 = g(x1 ) and so on and the successive iterates p0 =
x0 , p1 = x1 , p2 = x2 , ..., pk = xk are tabulated.

k pk
0 0.7853981635
1 0.7071067810
2 0.7602445972
3 0.7246674808
4 0.7487198858
5 0.7325608446
6 0.7434642113
7 0.7361282565

A root of f (x) = 0 correct to two decimal places is p ' 0.74 .

2 154
Numerical Analysis
Lecture-37

Practice Problems
1. Derive the iteration formula xn+1 = x2n (3 − N x2n ) to obtain the square root of the
1
reciprocal of a number N > 0 and hence compute √1000 .
x = √N or N − x2 = 0. Let f (x) = N − x2 ; f 0 (x) = x3 .
1 1 1 2

(N − 12 )   3   3
f (xn )
NRM: xn+1 = xn − f 0 (xn ) = xn − 2 xn
= x3 − N + x2 2 = x2 − N + x2 x2n =
2xn 1 xn 2 1
3 n n n n
xn
3
 
xn 3
2 x2n
−N .
xn
Thus xn+1 = (3 − N x3n ) .
2
1
This iteration method can be used to find √1N = √1000 .
1 1 1
√ <√ <√
1600 1000 900
1 1 1
∴ <√ <
40 1000 30
1
0.025 < √ < 0.033.
1000
Take x0 = 0.03 and generate the successive iterates using the above formula. Stop
your computation when |xn+1 − xn | is correct to six decimal accuracy.
2. Show that the initial approximation x0 for finding N1 , where N is a positive integer
by NRM must satisfy 0 < x0 < N2 for convergence.
f (x) = N − x1 , f 0 (x) = x12 .
(N − 1 ) ( 1 −N )
 
xn+1 = xn − 1 xn = xn + xn 1 = xn + x1n − N x2n = xn +xn −N x2n = 2xn −N x2n .
x2
n x2
n
g(x) = 2xn − N x2n .

1 1
en+1 = xn+1 − = 2xn − N x2n −
N   N
1
= xn − + (xn − N x2n )
N
   
1 1
= xn − + N xn − xn
N N
 
1
= xn − (1 − N xn )
N
= en (1 − N xn )

∴ en+1 = en (1 − N xn )
Thus |e1 | = |1 − N x0 ||e0 |.
2
∴ |en | = |1 − N x0 |n |e0 | −→ 0 if |1 − N x0 | < 1 =⇒ −1 < 1 − N x0 < 1 =⇒ 0 < x0 < N
.

1 155
3. Solve 1 − x
5
= e−x by Newton-Raphson Method taking convergence tolerance  =
5 × 10−3 .

x
f (x) = 1 − − e−x
5
1 1
f (1) = 1 − − > 0
5 e
5
f (5) = 1 − − e−5 = −e−5 < 0
5
Thus f (1)f (5) < 0 and hence a root lies in [1, 5].
Take x0 = 5.
f (xn )
xn+1 =xn −
f 0 (xn )
f (5)
x1 = 5 − 0 = 4.965135690
f (5)
x2 = 4.965114232
x3 = 4.965114232
∴ x = 4.965114232.

4. Determine p, q, r so that the order of convergence of the iterative method xn+1 =


2 1
pxn + q xa2 + ra
x5
for finding a 3 becomes as high as possible.
n n
1
x=a . 3
1
Let ξ be a fixed point of x = g(x). Then ξ = a 3 = x =⇒ x3 = ξ 3 = a.

a ra2
g(x) = px + q +
x2 x5
a ra2
g(ξ) = pξ + q 2 + 5
ξ ξ
1 a ra2
= pa 3 + q 2 + 5
a3 a3
1
= (p + q + r)a 3
1 1
Now g(ξ) = ξ = a 3 = (p + q + r)a 3 . Thus p + q + r = 1 −→ (1).
2 5ra2
g 0 (x) = p − 2qa
x3
− 5ra
x6
. Thus g 0 (ξ) = 0 =⇒ p − 2qa
ξ3
− ξ6
= 0.
∴ p − 2q − 5r = 0 −→ (2).
2
g 00 (x) = 6qa
x4
+ 30ra
x7
.
6qa 2
00
g (ξ) = 0 =⇒ ξ4 + 30ra ξ7
= 0 =⇒ 6qa
ξ
+ 3rξ = 0 =⇒ 6q + 30r = 0.
Thus q + 5r = 0 −→ (3).
=⇒ q = −5r, p = 1 − q − r = 1 + 5r − r = 4r + 1. Thus p = 1 + 4r. Also
1 5
p − 2q − 5r = 0 =⇒ 4r + 1 + 10r − 5r = 0 =⇒ 9r = −1 =⇒ r = − . ∴ q = .
9 9
∴ p = − 49 + 1 = 59 .
1 a2
∴ g(x) = 59 x + qx 5a
2 − 9 x5 .

2 156
2
Now g 000 (x) = −24qa
x5
− 210ra
x8
.
−24qa 210ra2
Then g (ξ) = ξ5 − ξ8 = − 24q
000
ξ2
− 210r
ξ2
= − 24 .5 +
ξ2 9
210 1
.
ξ2 9
6= 0.
Thus the order of convergence of the method is 3.

5. One wants to compute the positive root of the equation x = a − bx2 , a, b > 0 by using
the iterative method xn+1 = a − bx2n . What is the condition for convergence?
xn+1 = a − bx2n .
g(x) = a − bx2 . Thus g 0 (x) = −2bx. √
x = a − bx2 or bx2 + x − a√ = 0. Therefore x = −1± 2b1+4ab .
Positive root say α = −1+ 2b1+4ab .
√ √
α > 0 =⇒ −1+ 2b1+4ab > 0 =⇒ 1 + 4ab > 1 =⇒ 4ab > 0 which is true since a, b are
positive.
For convergence to α, we must √
have |g 0 (α)| < 1.
i.e., | − 2bα| < 1 =⇒ | −2b(−1+2b 1+4ab) | < 1.
√ √ √
i.e., | − 1 + 1 + 4ab| < 1 =⇒ −1 < −1 + 1 + 4ab < 1 =⇒ 0 < 1 + 4ab < 2 =⇒
3
4ab + 1 < 4 =⇒ 4ab < 3 =⇒ ab < .
4

3 157
Numerical Analysis
Lecture-38

1 Solution of Linear Systems of Equations-1


Decomposition Methods-1
We now discuss numerical aspect of solving a system of linear equations of the form
a11 x1 + a12 x2 + a13 x3 + ... + a1n xn = b1
a21 x1 + a22 x2 + a23 x3 + ... + a2n xn = b2
... ... ...
... ... ...
an1 x1 + an2 x2 + an3 x3 + ... + ann xn = bn .......................(1)
, ..., xn , here aij , bi ∈R.
(1) is a system of n equations in n unknowns x1 , x2
a11 a12 . . . a1n  
x1
 a21 a22 . . . a2n 
   x2 
(1) can be written in the form Ax = b, where A =  . . . . . . . . . . . . ;x =  ..  ; b =



. . . . . . . . . . . . 
  .
an1 an2 . . . ann xn
 
b1
 b2 
 .. .
 
.
bn
We now try to understand the algorithms for solving the system Ax = b; examine and
analyze the errors that are associated with the computed solution; study the methods for
controlling and reducing the errors.

Efficient procedures for solving Ax = b that we will learn in this course belong to
1. Direct methods: Decomposition methods, Gauss Elimination method, Gauss Jordan
method.
2. Iterative methods : Gauss-Jacobi method, Gauss-Seidel method.

Direct Methods
Special Cases : Solve Ax = b.
(i) A is a diagonal matrix with aii = di , i = 1, 2, ..., n and aij = 0 for i 6= j.
In this case, the system of equations is given by
a11 x1 = b1
a22 x2 = b2
a33 x3 = b3
..
.
ann xn = bn

1 158
where aii = di , i = 1, 2, ..., n and aii 6= 0, i = 1, 2, ..., n.
The solution is given by
b1 b1
x1 = =
a11 d1
b2 b2
x2 = =
a22 d2
... ...
bn bn
xn = =
ann dn
bi bi
or xi = = , aii 6= 0, i = 1, 2, ..., n
aii di
(ii) A is a lower triangular matrix.
The system of equations is given by
a11 x1 = b1
a21 x1 + a22 x2 = b2
a31 x1 + a32 x2 + a33 x3 = b3
... ... ... ... ...
an1 x1 + an2 x2 + an3 x3 + ... + ann xn = bn
where aii 6= 0, i = 1, 2, ..., n.
We solve the above system by Forward Substitution, this gives
b1
x1 =
a11
(b2 − a21 x1 )
x2 =
a22
(b3 − a31 x1 − a32 x2 )
x3 =
a33
... ...
(bn − an1 x1 − an2 x2 − · · · − ann−1 xn−1 )
xn =
ann
n−1
X
[bn − anj xj ]
j=1
=
ann
where aii 6= 0 for i = 1, 2, ..., n.
(iii) A is an upper triangular matrix.
In this case, the system of equations is given by
a11 x1 + a12 x2 + a13 x3 + ... + a1n xn = b1
a22 x2 + a23 x3 + ... + a2n xn = b2
a33 x3 + ... + a3n xn = b3
...
an−1,n−1 xn−1 + an−1,n xn = bn−1
ann xn = bn

2 159
where aii 6= 0, i = 1, 2, ..., n.
We solve the last equation for xn and use this solution in the last but one equation and
solve for xn−1 and so on and finally use the first equation and solve it for x1 . This method
is called Backward Substitution and this gives the solution as
bn
xn =
ann
(bn−1 − an−1,n xn )
xn−1 =
an−1,n−1
...
b1 − a12 x2 − a13 x3 − · · · − a1n xn
x1 =
a11
n−1
X
[b1 − a1j xj ]
j=1
=
a11
where aii 6= 0, i = 1, 2, ..., n.
Remark. The above cases suggest that if we are given a system Ax = b, where A is not
diagonal or lower triangular or upper triangular, then if we can reduce A to a diagonal
matrix or a lower triangular matrix or an upper triangular matrix using elementary row
operations, we can solve the resulting system easily as explained above and thus the solution
of the given system Ax = b can be given.

Elementary Operations
1. interchanging two equations in the system: Ei ←→ Ej .

2. multiplying an equation by a nonzero number: λEi −→ Ei .

3. adding to an equation a multiple of some other equation: Ei + λEj −→ Ei .

Theorem 1.1. If one system of equations is obtained from another by a finite sequence of
elementary operations, then the two systems are equivalent.

Given a system Ax = b, if A−1 exists, then Ax = b has the solution x = A−1 b.


If A−1 is available, then one can obtain the solution using A−1 as x = A−1 b.
If A−1 is not available, then one can use the following efficient methods to solve the given
system Ax = b.

Decomposition Methods
These methods fall under the class of direct methods of solving Ax = b.
Suppose A can be factored or decomposed into A = LU where L is a lower triangular matrix
and U is an upper triangular matrix, then Ax = b can be written as LU x = b. Set U x = z.
Solve for x after finding z. Then L(U x) = b can be written as Lz = b which can be solved
for z by forward substitution. Now use this z in the right hand side of U x = z and solve

3 160
for x by backward substitution.

Inthe above factorization,   


l11 0 0 . . . 0 u11 u12 u13 ... u1n
 l21 l22 0 . . . 0   0 u22 u23 ... u2n 
L= . . . . . . . . . . . . . . .  ; U =  . . .
  .
... ... ... ...
ln1 ln2 ln3 . . . lnn ... ... ... ... unn
If A = LU , then A has LU decomposition.

1. Dolittle decomposition: If A = LU where L is unit lower triangular matrix and U


is any upper triangular matrix, then we have Dolittle factorization of A. Note here
lii = 1, ∀i = 1, 2, ..., n.
2. Crout decomposition: If A = LU where U is a unit upper triangular matrix and L is
any lower triangular matrix, then we have Crout’s reduction or Crout’s factorization
of A. Note here uii = 1, ∀i = 1, 2, ..., n.
3. Cholesky’s factorization: If in A = LU , U = LT , so that A = LLT , then we have
Cholesky’s decomposition of A.

Sufficient conditions for a square matrix A to have LU decomposition


Theorem 1.2. If all n leading principal minors of the n × n matrix A are non-singular,
then A has LU decomposition.
Positive definite matrices have the following properties:
1. All eigenvalues of a positive definite matrix are positive.
2. All the leading principal minors are positive.
 
12 4 −1
Show that the matrix A =  4 7 1  is positive definite.
−1 1 6
12 4
The leading principal minors are 12, = 84 − 16 = 68 > 0 and
4 7

12 4 −1
7 1 4 1 4 7
− 4 −1 6 − 1 −1 1
4 7 1 = 12


−1 1 1 6
6
= 12(42 − 1) − 4(24 + 1) − 1(4 + 7)
= (12)(41) − 4(25) − 1(11)
= 492 − 100 − 11 = 492 − 111 = 381 > 0
We see that all the leading principal minors are positive.
∴ A is a positive definite matrix.
Definition 1.1. An n×n matrix A is positive definite, if for any nonzero vector x, x∗ Ax > 0,
x∗ = x̄T .
Theorem 1.3. If A is a real, symmetric, positive-definite matrix, then it has a unique
factorization, A = LLT , in which L is a lower triangular matrix with a positive diagonal.

4 161
Numerical Analysis
Lecture-39

1 Solution of Linear Systems of Equations-2


Decomposition Methods-2
Dolittle Method
Let A be an n × n matrix with real entries. If A can be factored in the form A = LU with
L as an n × n unit lower triangular matrix and U as an n × n upper triangular matrix, then
we say that we have a Dolittle factorization of A. We consider this for n = 3.
A = LU , L is aunitlower triangular
   matrix.   
a11 a12 a13 1 0 0 u11 u12 u13 u11 u12 u13
a21 a22 a23  = l21 1 0 .  0 u22 u23  = l21 u11 l21 u12 + u22 l21 u13 + u23 .
a31 a32 a33 l31 l32 1 0 0 u33 l31 u11 l31 u12 + l32 u22 l31 u13 + l32 u23 + u33
Equating the corresponding entries in A and the matrix on the right hand side.

• Step 1: We first equate the entries in the first row of A to the corresponding first row
entries in the right hand side matrix.

a11 = u11
a12 = u12
a13 = u13

This determines the unknowns u11 , u12 , u13 in the first row of U .

• Step 2: We next equate the first column entries below the first row on the right hand
matrix to the corresponding entries in A.
a21
l21 u11 = a21 =⇒ l21 =
u11
a31
l31 u11 = a31 =⇒ l31 =
u11
This gives l21 , l31 (since u11 is known already) which are unknown entries in the first
column of L.
We now know the first row entries in U and first column entries in L.

• Step 3: In Step 3, we equate the entries in the second row on the right hand side
matrix to the corresponding entries in A.

l21 u12 + u22 = a22 =⇒ u22 = a22 − l21 u22


l21 u13 + u23 = a23 =⇒ u23 = a23 − l21 u13

This yields the 2nd row entries in U .


a32 −l31 u12
• Step 4:l31 u12 + l32 u22 = a32 =⇒ l32 = u22
which gives the second column entry
in L.

1 162
• Step 5:l31 u13 + l32 u23 + u33 = a33 =⇒ u33 = a33 − l31 u13 − l32 u23 which gives the third
row entry in U .
Thus, all the unknowns l21 , l31 , l32 in L and u11 , u12 , u13 , u22 , u23 , u33 in U are determined
and we have an LU decomposition of A where L is a unit lower triangular matrix.
 
2 3 1
Example 1.1. Let A = 1 2 3.
3 1 2
2 3
We see that all the leading principal minors, namely, 2, = 4 − 3 = 1 > 0,
1 2

2 3 1

1 2 3 = 2(4 − 3) − 3(2 − 9) + 1(1 − 6)

3 1 2
= 2 + 21 − 5 = 18 > 0
are positive. Thus A has an LU decomposition.
A = LU , L is aunitlower triangular
   matrix. 
a11 a12 a13 1 0 0 u11 u12 u13
a21 a22 a23  = l21 1 0 .  0 u22 u23 .
a31 a32 a33 l31 l32 1 0 0 u33 
2 3 1 u11 u12 u13
1 2 3 = l21 u11 l21 u12 + u22 l21 u13 + u23 .
3 1 2 l31 u11 l31 u12 + l32 u22 l31 u13 + l32 u23 + u33
• Step 1: (first row in U ).
u11 = 2; u12 = 3; u13 = 1.

• Step 2: (first column in L).


1 1
l21 u11 = 1 =⇒ l21 = =
u11 2
3 3
l31 u11 = 3 =⇒ l31 = =
u11 2

• Step 3: (second row in U ).


3 1
l21 u12 + u22 = 2 =⇒ u22 = 2 − =
2 2
1 5
l21 u13 + u23 = 3 =⇒ u23 = 3 − l21 u13 = 3 − (1) =
2 2
• Step 4: (second column in L).
1
1− 2
l31 u12 + l32 u22 = 1 =⇒ l32 = 1 =⇒ l32 = −7
2

• Step 5:(third row in U ).


3 5
l31 u13 + l32 u23 + u33 = 2 =⇒ u33 = 2 − (1) − (−7) = 2 + 16 = 18.
2 2

2 163
   
1 0 0 2 3 1
∴ L =  12 1 0 ; U = 0 12 25  .
3
2
−7 1 0 0 18
   
9 9
If now, we want to solve Ax = b, where b = 6, then LU x = 6. Set U x = z. Then
8 8
T
Lz = b, where z = (z1 , z2 , z3 ) .
    
1 0 0 z1 9
1 1 0   z2 = 6
 
2
3
2
−7 1 z3 8

Forward substitution gives

1.z1 = 9 =⇒ z1 = 9
1 z1 9 3
z1 + z2 = 6 =⇒ z2 = 6 − =6− =
2 2 2 2
3 3
z1 − 7z2 + z3 = 8 =⇒ z3 = 8 − z1 + 7z2
2 2
3 3
= 8 − (9) + 7( )
2 2
27 21
=8− +
2 2
=8−3=5
   
z1 9
∴ z = z2  =  32 
z3 5
Now solve U x = z for x by back substitution.
     
2 3 1 x1 9
0 1 5  . x2  =  3 
2 2 2
0 0 18 x3 5

3 164
5
18x3 = 5 =⇒ x3 =
18
1 5 3
x2 + x3 = or x2 = 3 − 5x3
2 2 2
25
=3−
18
54 − 25 29
= =
18 18
2x1 + 3x2 + x3 = 9
2x1 = 9 − 3x2 − x3
29 5
= 9 − 3( ) −
18 18
87 5
=9− −
18 18
162 − 92 70
= =
18 18
70 1 35
∴ x1 = × = .
18 2 18
   35 
x1 18
∴ x = x2  =  29

18
.
5
x3 18

Crout’s Method
Let A be an n × n matrix. If A can be factored into A = LU with U as a unit upper
triangular matrix and L is any lower triangular matrix, then we say that we have a Crout’s
reduction of A.
We
 take n =  3 here and illustrate this method by an example by taking A as A =
5 −2 1
7 1 −5.
3 7 4
     
l11 0 0 1 u12 u13 l11 l11 u12 l11 u13
A = LU = l21 l22 0  . 0 1 u23  = l21 l21 u12 + l22 l21 u13 + l22 u23  .
l31 l32 l33 0 0 1 l31 l31 u12 + l32 l31 u13 + l32 u23 + l33
• Step 1: Equate the entries in first coluumn on the right hand side matrix to the
corresponding entries in A.
l11 = 5
l21 = 7
l31 = 3
This gives first column entries in L.
• Step 2: Equate the first row entries to the right of column 1 in the right hand side
matrix to the corresponding entries in A.
2 2
l11 u12 = −2 =⇒ u12 = − =−
l11 5
1 1
l11 u13 = 1 =⇒ u13 = =
l11 5

4 165
This gives first row entries in U .
• Step 3: Equate the 2nd column entries below the first row on the right hand side
matrix to the corresponding entries in A.
2 14 19
l21 u12 + l22 = 1 =⇒ l22 = 1 − l21 u12 = 1 − 7(− ) = 1 + =
5 5 5
2 6 41
l31 u12 + l32 = 7 =⇒ l32 = 7 − l31 u12 = 7 − 3(− ) = 7 + =
5 5 5
This gives 2nd column unknown entries in L.
• Step 4: Equate the second row entry to the right of second column on the right hand
side matrix to the corresponding entry in A.
−5 − l21 u13
l21 u13 + l22 u23 = −5 =⇒ u23 =
l22
−5 − 7( 51
= 19
5
−25 − 7 32
= =−
19 19
This gives second row unknown entry in U .
• Step 5: Equate the third column entry below the second row in the right hand side
matrix to the corresponding in A.
3 41 32
l31 u13 + l32 u23 + l33 = 4 =⇒ l33 = 4 − + ( )
5 5 19
3 1312
=4− +
5 5 × 19
20 − 3 1312
= +
5 5 × 19
1
= [17 × 19 + 1312]
5 × 19
1
= [323 + 1312]
5 × 19
1
= [1635]
5 × 19
327
=
19
This gives the entry in the 3rd column of L.
1 − 25 1
   
5 0 0 5
∴ L = 7 19 5
0  ; U = 0 1 − 32
19

41 327
3 5 19 0 0 1
 
4
Now, if we want to solve the system Ax = b, where b =  8 , then LU x = b. Set
  10
z1
U x = z = z2 , then solve Lz = b for z by forward substitution. Use this z in U x = z
z3

5 166
and solve for x by backward substitution.

Lz = b
     
5 0 0 z1 4
7 19
5
0 . z2 = 8 
   
41 327
3 5 19
z3 10

4
5z1 = 4 =⇒ z1 =
5
19 4 19
7z1 + z2 = 8 =⇒ 7( ) + z2 = 8
5 5 5
19 28 12
=⇒ z2 = 8 − =
5 5 5
12
=⇒ z2 =
19
41 327 327 41
3z1 + z2 + z3 = 10 =⇒ z3 = 10 − 3z1 − z2
5 19 19 5
4 41 12
= 10 − 3( ) − ( )
5 5 19
12 492
= 10 − −
5 5 × 19
38 492 722 − 492 230
= − = =
5 5 × 19 5 × 19 5 × 19
327 46 46
z3 = =⇒ z3 =
19 19 327

Now, solve U x = z
1 − 52 1
    4 
5
x1 5
32   
0 1 − 19 x2 =  12
19

46
0 0 1 x3 327

6 167
46
x3 =
327
32 12 12 32
x2 − x3 = =⇒ x2 = + x3
19 19 19 19
12 32 46
= + ×
19 19 327
3924 + 1472 5396
= =
19 × 327 19 × 327
284
x2 =
327
2 1 4
x1 − x2 + x3 =
5 5 5
4 2 1
x1 = + x2 − x3
5 5 5
1 1 568 46
= [4 + 2x2 − x3 ] = [4 + − ]
5 5 327
 327 
1 1308 + 568 − 46 1 1876 − 46
= =
5 327 5 327
1830 366
= =
5 × 327 327
366
x1 =
327
   366 
x1 327
∴ Solution is x = x2  =  284327
.
46
x3 327

7 168
Numerical Analysis
Lecture-40

1 Solution of Linear Systems of Equations-3


Cholesky Method, Gauss Elimination Method
Cholesky’s Method
If Ax = b is a given system of equations, where A is a real, symmetric and positive-definite
matrix, then one can solve this system using Cholesky’s method which belongs to the class
of decomposition methods. A can be decomposed uniquely in the form A = LLT , where L
is a lower triangular matrix with positive diagonal entries.
Ax = b is then LLT x = b. Set LT x = z, LT is an upper triangular matrix. Then Lz = b.
Solve this by forward substitution and obtain z. Using z, solve LT x = z by backward
substitution.

Example 1.1.     
1 2 3 x1 5
2 8 22 x2  =  6 
3 22 82 x3 −10
 
1 2 3
A = 2 8 22 is a real, symmetric matrix. Also, its leading principal minors, 1,
3 22 82
1 2 3
1 2
2 8 = 8−4 = 4 > 0, 2 8 22 = 1(656−484)−2(166−66)+3(44−24) = 172−200+60 =

3 22 82
32 > 0 are positive.
∴ A is positive-definite and A can therefore be factored as LLT with L as a lower triangular
matrix with positive diagonal.
  
l11 0 0 l11 l21 l31
A = l21 l22 0   0 l22 l32 
l31 l32 l33 0 0 l33
 2 
l11 l11 l21 l11 l31
2 2
= l21 l11 l21 + l22 l21 l31 + l22 l32 
2 2 2
l31 l11 l31 l21 + l32 l22 l31 + l32 + l33

• Step 1: Equate the corresponding entries on either the first row or first column on the
right hand side matrix and A.
2
l11 = 1 =⇒ l11 = 1 (we take the positive square root as the diagonal entries as positive)
l11 l21 = 2 =⇒ l21 = 2
l11 l31 = 3 =⇒ l31 = 3

This step gives the unknown entries in the first column of L(in the first row of LT ).

1 169
• Step 2: Equate the corresponding entries on either the second row or second column.
2 2 2 2
l21 + l22 = 8 =⇒ 4 + l22 = 8 =⇒ l22 = 4 =⇒ l22 = 2
l21 l31 + l22 l32 = 2(3) + 2(l32 ) = 22 =⇒ 2l32 = 16 =⇒ l32 = 8

This step gives the unknown entries in the second column of L(second row of LT ).
• Step 3: Equate the corresponding entries in third row third column.
2 2 2 2
l31 + l32 + l33 = 82 =⇒ 9 + 64 + l33 = 82 =⇒ l33 = 3

This step gives the 3rd row entry of LT or third column entry of L.
We now solve Lz = b.     
1 0 0 z1 5
2 2 0 z2  =  6 
3 8 3 z3 −10
Forward substitution gives

z1 = 5
2z1 + 2z2 = 6 =⇒ 2z2 = 6 − 2z1 = 6 − 10 = −4 =⇒ z2 = −2
3z1 + 8z2 + 3z3 = −10 =⇒ 3(5) + 8(−2) + 3z3 = −10 =⇒ z3 = −3

Solve LT x = z.     
1 2 3 x1 5
0 2 8 x2  = −2
0 0 3 x3 −3
Back substitution gives

3x3 = −3 =⇒ x3 = −1
2x2 + 8x3 = −2 =⇒ 2x2 = −2 − 8(−1) = 6 =⇒ x2 = 3
x1 + 2x2 + 3x3 = 5 =⇒ x1 + 2(3) + 3(−1) = 5 =⇒ x1 = 5 + 3 − 6 = 2 =⇒ x1 = 2
   
x1 2
Solution is x = x2 =
   3 .
x3 −1

Gauss Elimination Method


Given Ax = b. Gauss elimination method reduces the system Ax = b to a system U x = b̂,
where U is an upper triangular matrix using elementary row operations. Once the system
U x = b̂ is obtained, the system is solved using back substitution method.

We explain the method by taking a system of 3 equations in 3 unknowns.

a11 x1 + a12 x2 + a13 x3 = b1 (R1 ) (1)


a21 x1 + a22 x2 + a23 x3 = b2 (R2 ) (2)
a31 x1 + a32 x2 + a33 x3 = b3 (R3 ) (3)

2 170
• Step 1: Take equation (1) as the pivotal equation associated with the first unknown
x1 . Here a11 is the pivot.
Eliminate x1 from all the equations below the pivotal equation (1) at this step using
elementary row operations-
To eliminate x1 from equation (2), multiply (1) by aa21
11
and subtract it from (2). i.e.,
R2 − aa11
21
R1 −→ R2 .
To eliminate x1 from equation (3), multiply (1) by aa31
11
and subtract it from (3). i.e.,
a31
R3 − a11 R1 −→ R3 .

At the end of Step 1, the system is

a11 x1 +a12 x2 + a13 x3 = b1 R1 −→ R1 (1)


(1) (1) (1) a21
a22 x2 + a23 x3 = b2 R2 −→ R2 − R1 (4)
a11
(1) (1) (1) a31
a32 x2 + a33 x3 = b3 R3 −→ R3 − R1 (5)
a11
where
(1) a21
a22 = a22 − a12
a11
(1) a21
a23 = a23 − a13
a11
(1) a31
a32 = a32 − a12
a11
(1) a31
a33 = a33 − a13
a11
(1) a21
b2 = b2 − b1
a11
(1) a31
b3 = b3 − b1
a11

• Step 2: At this step, (4) is the pivotal equation which is associated with the second
(1)
unknown x2 and a22 is the pivot.
Eliminate x2 from all the equations that lie below the piovtal equation (4) at this
step.
(1)
a32
To eliminate x2 from equation (5), multiply (4) by (1) and subtract it from (5). i.e.,
a22
(1)
a32
R1 −→ R! , R2 −→ R2 , R3 − (1) R2 −→ R3 .
a22
The system now is

a11 x1 + a12 x2 +a13 x3 = b1 R1 −→ R1


(1) (1) (1)
a22 x2 +a23 x3 = b2 R2 −→ R2
(1)
(2) (2) a32
a33 x3 = b3 R3 −→ R3 − (1)
R2
a22

3 171
Here,
(1)
(2) (1) a32 (1)
a33 = a33 − a
(1) 23
a22
(1)
(2) (1) a32 (1)
b3 = b 3 − b .
(1) 2
a22

At the end of Step 2, we have the following system

a11 x1 + a12 x2 + a13 x3 = b1


(1) (1) (1)
a22 x2 + a23 x3 = b2
(2) (2)
a33 x3 = b3

which is an upper triangular system. Using backward substitution, the solution can
be determined.
Example 1.2. Solve

10x1 − x2 + 2x3 = 4
x1 + 10x2 − x3 = 3
2x1 + 3x2 + 20x3 = 7

• Step 1:

10x1 − x2 + 2x3 = 4 −→ R1 (pivotal equation : pivot = 10)


101 12 26 R1
x2 − x3 = ; R2 − −→ R2
10 10 10 10
32 196 62 2R1
x2 + x3 = ; R3 − −→ R3
10 10 10 10

• Step 2:

10x1 − x2 + 2x3 = 4 −→ R1 −→ R1
101 12 26
x2 − x3 = ; R2 −→ R2
10 10 10
20180 5430 32R2
x3 = ; R3 − −→ R3
1010 1010 101
which is an upper triangular system. Back substitution gives
5430
x3 = = 0.2690
20180
101 12 26
x2 − (0.2690) = =⇒ x2 = 0.2894
10 10 10
10x1 − x2 + 2x3 = 4 =⇒ 10x1 = 4 + x2 − 2x3 = 4 + 0.2894 − 2(0.2690) =⇒ x1 = 0.3751
   
x1 0.375
The solution is x = x2  = 0.289.
x3 0.269

4 172
(1) (2)
Note. We note that in the above method, if any one of the pivot elements a11 , a22 , a33 is zero
or becomes very small as compared to other elements in that column, then Gauss elimination
method fails. In this case, we rearrange the remaining rows (that is perform elementary row
operation Ri ←→ Rj ) so as to have a nonzero pivot or to avoid multiplication by a large
number. This is known as pivoting.

Example 1.3. Consider the following example:

x1 + x2 + x 3 = 6
3x1 + 3x2 + 4x3 = 20
2x1 + x2 + 3x3 = 13

• Step 1:

x1 + x2 + x3 = 6; R1 −→ R1
x3 = 2; R2 − 3R1 −→ R2
−x2 + x3 = 1; R3 − 2R1 −→ R3

The above system obtained at the end of Step 1 has coefficient of x2 as 0 in the second
equation which is the pivotal equation at Step 2, i.e., the pivot (which is the coefficient
of x2 in the second equation) is zero.
Gauss elimination method fails at this step.
So, we interchange R2 ←→ R3 and consider the system

x1 + x2 + x3 = 6; R1 −→ R1
−x2 + x3 = 1; R2 ←→ R3
x3 = 2

The above system is an upper triangular system and therefore by back substitution,
the solution can be obtained.
x3 = 2
−x2 = 1 − x3 = 1 − 2 = −1 =⇒ x2 = 1
x 1 = 6 − x2 − x3 = 6 − 1 − 2 = 3
   
x1 3
The solution is x = x2 = 1.
  
x3 2
(1)
In the above example, a22 = 0, row 2 cannot be used to eliminate the elements in
(p−1)
column 2 below the diagonal. It is necessary to find row k, where ak,p 6= 0 and k > p
(p−1)
when app = 0 and then interchange row p and row k so that a nonzero pivot element
(1)
is obtained. In this example, in row 3, a3,2 6= 0 and so we interchanged row 3 and row 2.
This process is called pivoting and the criterion for deciding which row to choose is called
a pivoting strategy.

5 173
Numerical Analysis
Lecture 41
Solution of Linear System of Equations - 4 : Gauss Elimination with partial pivoting

Partial Pivoting
Given the system Ax = b, check the magnitude of all the elements in column p that lie on or below
the diagonal at each step in GEM.
Locate the row k in which the element has the largest absolute value

i.e., kakp k = max |ap,p |, |ap+1,p |, · · · , |an−1,p |, |an,p |
and interchange row p with row k if k > p. Usually, the larger pivot element will result in a small
error being propagated.
(j−1) alj
In general, if the pivot element, say, ajj , is very small in absolute value, then | ajj | may be
very large for some l > j. When we use this as a multiplier in our computations, then this may
cause round-off errors.
∵ We should avoid choosing small numbers as pivotal elements. This is done by following partial
pivoting strategy.
(k−1) (k−1) (k−1)
At the k th step, akk , ak+1 k , · · · , ank are all candidates for the pivotal element. We choose
(k−1)
alk so that
(k−1) (k−1)
|alk | = max |apk |
(k−1)
apk
as the pivotal element by interchanging k th and lth rows. Then (k−1) ≤ 1, ∀p ≥ k.
alk
Examples:

• Consider the following system


0.102x1 + 2.45x2 = 2.96
20.2x1 − 11.4x2 = 89.6
The system has the solution x1 = 5, x2 = 1. We now solve the system using Gauss Elimina-
tion
 method. We usethree-digit arithmetic in our computations.
0.102 2.45 2.96
R1 ↔ R1 ; R1 → R2 − 198R1
20.2 −11.4 89.6
 
0.102 2.45 2.96 496
≈ =⇒ x2 = − 497 = −0.998
0 497 −496
0.102x1 + 2.45(−0.998) = 2.96
0.102x1 = 2.96 − 2.45(−0.998)
x1 = 52.9

• The above solution is obtained using Gauss Elimination Method without pivoting. If we now
solve using Gauss Elimination with partial pivoting, then we look for entries in first column
which are coefficients of the first unknown x1 and observe that a21 = 20.2 is the largest of
the
 two and so we interchange
 the first and the second equation. The system now is
20.2 −11.4 89.6
R1 ↔ R1 ; R1 → R2 − 0.00505R1
0.102 2.45 2.96
 
20.2 −11.4 89.6

0 2.51 2.51
=⇒ x2 = 1 and 20.2x1 − 11.4x2 = 89.6 =⇒ 20.2x1 = 89.6 + 11.4(1) = 101.0
101
=⇒ x1 = 20.2 = 5.
which yields the correct solution.

1 174
• Solve
 using
 Gauss
 Elimination
  method with partial pivoting
1 1 1 x1 7
3 3 4 x2  = 24
2 1 3 x3 16
A = (aij )i=1,2,3;j=1,2,3 . a21 is the numerically largest element in the first column, therefore,
interchange R
 1 and
 R2 i.e., R1 ↔ R2 . The system now is
3 3 4 x1 24
 1 1 1 x2  =  7 , this is after partial pivoting.
2 1 3 x3 16
Now eliminate x1 from R2 and R3 taking R1 as the pivoting equation with pivot = 3 (which
is the coefficient of x1 in R1 )
R2→ R2 − 1/3R1 ; R3→ R3 − 2/3R1 ; R1 → 1/3R1
1 1 4/3 8
≈ 0 0 −1/3 −1

0 −1 1/3 0
(1)
We now see that a22 = 0 in R2 .
∵ we scan all the entries below R2 in the second column and find the numerically largest
element, it occurs in the third row R3 . Therefore, we interchange R2 and R3 i.e., R2 ↔ R3
and R1 → R1 .  
1 1 4/3 8
The system now, after partial pivoting is  0 −1 1/3 0  Use the back substitution,
0 0 −1/3 −1
this gives
−1/3x3 = −1 =⇒ x3 = 3
−x2 + 1/3x3 = 0 =⇒ x2 = 1/3x3 = 1/3 × 3 = 1
x1 + x2 + 4/3x3 = 8=⇒  x1=8 − x2 − 4/3x3 = 8 − 1 − 4/3 × 3 = 3.
x1 3
The solution is x = x2  = 1.
x3 3
• 1.133x1 + 5.281x2 = 6.414
24.14x1 − 1.210x2 = 22.93
x1 = 1, x2 = 1 is the solution of the above system. Use four-digit arithmetic and Gauss
elimination with (i) no pivoting and (ii) partial pivoting and solve the system.
– with no pivoting
24.14
1.133 = 21.31,
-1.210 - 21.31 (5.281) = -1.210 - 102.5 = -113.7
22.93 - 21.31 (6.414) = 22.93 - 136.7 = -113.8.
The system is at the end of step 1: R1 ↔ R1 ; R2 → R2 − 21.31R1 ;
1.133x1 + 5.281x2 = 6.414
−113.7x2 = −113.8
which gives by back substitution, x2 = −113.8
−113.7 = 1.001
5.281×1.001 6.44−5.286
x1 = 6.414 − 1.133 = 1.133 = 0.9956.
The solution using GEM with no pivoting is x1 = 0.9956, x2 = 1.001.
– GEM with partial pivoting
Interchange row 1 and row 2, R1 ↔ R2
The system is 24.14x1 − 1.210x2 = 22.93
1.133x1 + 5.281x2 = 6.416.
1.133
24.14 = 0.04693
5.281 - 0.04693 (-1.210) = 5.281 + 0.05679 = 5.338
6.414 - 0.04693 (22.93) = 6.414 - 1.076 = 5.338.

2 175
The system is 24.14x1 − 1.210x2 = 22.93 ; 5.338x2 = 5.338. Back substitution gives
5.338
x2 = 5.338 = 1 ; x1 = 22.93+1.210×1
24.14
24.14
= 24.14 =1
∴ GEM with partial pivoting gives the correct solution x1 = 1, x2 = 1.
Diagonally Dominant Matrices:
Sometimes, a system of equations has the property that Gauss Elimination without pivoting can
be safely used.
one class of
Pmatrices for which this is true is the class of diagonally dominant matrices for
which |aii | > nj=1,j6=1 |aij |, 1 ≤ i ≤ n.
THEOREM 1. Gauss elimination without pivoting preserves the diagonal dominance of a matrix.
 
  5 −1 1 10
• Solve by GEM the system Ax = b, where A|b is given as  2 4 0 12  A is strictly
1 1 5 −1
diagonally dominant 5 > |1| + |1| , 4 > |2| + 0 , 5 > |1| + |1| ∴ GEM without pivoting can
be safely used.
Show that x1 = 23 31 19
9 , x2 = 18 , x3 = − 18 is the solution of Ax = b.

3 176
Numerical Analysis
Lecture 42

Solution of Linear System of Equations - 5 : Gauss-Jordan Method

Gauss-Jordan Method belongs to the class of direct methods for solving a system of equations
Ax = b. It is a modification h iof Gauss elimination method. It uses elementary row operations
in the augmented matrix A|b in order to transform the matrix A of coefficients of the system
Ax = b into the identify matrix.
Thus, in Gauss Jordan method, we do the following:
At the ith −stage, we use the ith −equation to eliminate the xi term from equations Rk , k = i +
1, · · · , n below the pivot, as in GEM and also from Rk , k = 1, 2, · · · , i − 1 above the pivot. The
augmented matrix is then reduced to a diagonal matrix and the solution is directly obtained.
h i Gauss Jordan Method(GJM) h i
GJM
A|b −−−−−−−−−−−−−−−−−→ I|b ; Ax = b −−−→ Ix = d, Ix = d will yield x = d.

As this is a very expensive method from the point of view of computations involved, this method
is very often used to compute the inverse of a matrix A, when it exists.
h i h i
GJM
An×n |In×n −−−→ In×n |A−1n×n

We shall illustrate how to compute A−1 using Gauss Jordan method.

Example
 
1 1 1
Problem. Compute the inverse of A, given A = 4 3 −1 using Gauss Jordan Method use
3 5 3
partial pivoting strategy.
Solution
   
1 1 1 1 0 0 4 3 −1 0 1 0
 4 3 −1 0 1 0 ≈  1 1 1 1 0 0 R1 ↔ R2 partial pivoting.

3 5 3 0 0 1 3 5 3 0 0 1
 
1 3/4 −1/4 0 1/4 0
≈ 1 1 1 1 0 0 R1 → R1 /4
3 5 3 0 0 1
 
1 3/4 −1/4 0 1/4 0
≈  0 1/4 5/4 1 −1/4 0 R1 ↔ R1 ; R2 → R2 − R1 ; R3 → R3 − 3R1
0 11/4 15/4 0 −3/4 1
Scan all the elements in the 2nd column below the first equation. They are 1/4 and 11/4 and the
largest
 is 11/4. So, we  R2 ↔ R3 and thus apply partial pivoting.
inter change
1 3/4 −1/4 0 1/4 0
≈  0 11/4 15/4 0 −3/4 1 R1 → R1 ; R2 ↔ R3
0 1/4 5/4 1 −1/4 0
 
1 3/4 −1/4 0 1/4 0
≈  0 1 15/11 0 −3/11 4/11 R1 ↔ R1 ; R2 → R2 /(11/4) ; R3 ↔ R3
0 1/4 5/4 1 −1/4 0
 
1 0 −14/11 0 5/11 −3/11
≈  0 1 15/11 0 −3/11 4/11  R1 − 3/4R2 → R1 ; R2 ↔ R2 ; R3 − 1/4R2 → R3
0 0 10/11 1 −2/11 −1/11

1 177
 
1 0 −14/11 0 5/11 −3/11
≈ 0 1 15/11 0 −3/11 4/11  R3 → R3 /(10/11)
0 0 1 11/10 −1/5 −1/10
 
1 0 0 7/5 1/5 −2/5
≈ 0 1 0 −3/2 0 1/2  R1 → R1 + 14/11R3 ; R2 − 15/11R3 → R2 ; R1 ↔ R3
0 0 1 11/10 −1/5 −1/10

 
7/5 1/5 −2/5
∵ A−1 =  −3/2 0 1/2 
11/10 −1/5 −1/10
Now as Ax =b =⇒ x = A−1 b, where b =(1 64)T 
7/5 1/5 −2/5 1 1
−1
x = A b = −3/2
 0 1/2   6 =
  1/2 
11/10 −1/5 −1/10 4 −1/2
 
1
∵ The solution is x =  1/2 .
−1/2
If A can be decomposed in the form A = LU, then A−1 = (LU )−1 = U −1 L−1 . So, A−1 can be
computed by computing U −1 , L−1 using GJM and then A−1 is given by A−1 = U −1 L−1 .

2 178
Numerical Analysis
Lecture 43

Solution of Linear System of Equations - 6 : Error Analysis - 1

Error Analysis (Direct Methods)


During computation, it will be necessary to round or chop the numbers. This will introduce round
off errors in the computation. Due to this, the methods will produce results which will differ
considerably from the exact solution. Error is incurred at each step of computation.

• The system
2x + y = 2
2x + 1.01y = 2.01
has the solution x = 1/2, y = 1, whereas the system
2x + y = 2
2.01x + y = 2.05
has the solution x = 5 and y = −8.
The system is ill-conditional, that is, small changes in either the coefficient matrix or on the
right side vector results in large changes in the solution.
 
−73 ⃝ 78 24
• The matrix A =  92 66 25  is ill-conditioned since |A| = 1, whereas
−80 37 10

−73 78 24 −73 ⃝ 24 −73 78.01 24
78.01

92.01 66 25 = 2.08 ; 92.01 66 25 = −28.20 ; 92.01 66 25 = −118.94.

−80 37 10 −80 37 10 −80 37 10.01
We see that very small changes in any one entry in A results in large changes in determinant
of A.
    
10 7 8 7 x 32
 7 5 6 5   y  23
•     
 8 6 10 9   z  = 33 has the solution (1, 1, 1, 1)
T

7 5 9 10 w 31
    
10 7 8 7 x 32.1
 7 5 6 5   y  29.9
But     
 8 6 10 9   z  = 32.9 has the solution (6, −7.2, 2.9, −0.1)
T

7 5 9 10 w 31.1
∴ The system is ill-conditioned as small changes in the entries in right hand side vector
results in large changes in the solution.

To discuss the errors in numerical problems involving vectors, it is useful to employ norms.
Vector norms
Let V = Rn (n-dimensional Euclidean space). On V , a norm is a function ∥ · ∥ from V to the set
of non-negative real numbers that obeys the three postulates. (The non-negative quantity ∥x∥,
x ∈ Rn is a measure of the size or length of a vector x or magnitude of x).

• ∥x∥ > 0, if x ̸= 0, x ∈ Rn

• ∥λx∥ = |λ|∥x∥ if λ ∈ R, x ∈ Rn

• ∥x + y∥ ≤ ∥x∥ + ∥y∥ if x, y ∈ Rn .

1 179
Note: A norm on Rn generalizes the notion of absolute value |r|, for a real or complex number.
The most commonly used norms on Rn are as follows.

• The Euclidean norm defined by


( )1/2

n
∥x∥2 = x2i , x = (x1 , x2 , · · · , xn )T .
i=1

This is the norm that corresponds to an intuitive concept of length.

• l∞ norm ∥x∥∞ = max1≤i≤n |xi |



• l1 norm ∥x∥1 = ni=1 |xi |.

Example:
Problem 1: Compare the length of the following three vectors in Rn .
Let x = (4, 4, −4, 4)T ; v = (0, 5, 5, 5)T ; w = (6, 0, 0, 0)T . Then

∥ · ∥1 ∥ · ∥2 ∥ · ∥∞
x 16 8 4
v 15 8.66 5
w 6 6 6

Problem 2: In R2 , the figures give the sketches of the set {x|x ∈ R2 , ∥x∥ ≤ 1} for the three
norms.

∥x∥2 ∥x∥∞ ∥x∥1

2 180
Numerical Analysis
Lecture 44

Solution of Linear System of Equations - 7 : Error Analysis - 2

Matrix Norm: The matrix norm ∥A∥ is a non-negative number which satisfies the properties

• (i) ∥A∥ > 0 if A ̸= 0 and ∥0∥ = 0

• (ii) ∥cA∥ = |c|∥A∥ for an arbitrary complex number c

• (iii) ∥A + B∥ ≤ ∥A∥ + ∥B∥

• (iv) ∥AB∥ ≤ ∥A∥∥B∥

The most commonly used norms are


[ ]
∑n ( )2 1/2
• (i) Euclidean norm: ∥A∥ = i,j=1 aij


• (ii) Maximum norm (maximum absolute row sum): ∥A∥ = ∥A∥∞ = maxi |aik | k

• (iii) Maximum norm (maximum absolute column sum): ∥A∥ = ∥A∥1 = maxk i |aik |

THEOREM 1. If the vector norm ∥ · ∥∞ is defined by ∥x∥∞ = max ∑1≤i≤n |xi |, then the associated
matrix norm consistent with the vector norm is ∥A∥∞ = max1≤i≤n nj=1 |aij | i.e., ∥Ax∥ ≤ ∥A∥∥x∥.
Problem 
1 7 −4
Determine the Euclidean and maximum absolute row sum norms of the matrix A =  4 −4 9 .
12 −1 3
√∑
3
Proof. Euclidean norm ∥A∥ = i,j=1 |aij | .
2

∴∥A∥2 =√1 + 49 + 16 + 16 + 16 + 81 + 144 + 1 + 9 = 333


∴∥A∥ = 333 = 18.25 ∑
Maximum absolute row sum norm = maxi 3k=1 |aik | = max(12, 17, 16) = 17
Let us use these concepts to perform error analysis for direct methods.
Consider an equation Ax = b, A is an n × n matrix suppose that A is invertible.
Example

• If A−1 is perturbed to obtain a new matrix B, the solution x = A−1 b is perturbed to become
a new vector x̃ = Bb. How large is this latter perturbation in absolute and relative terms ?.
We have ∥x − x̃∥ = ∥x − Bb∥ = ∥x − BAx∥ = ∥(I − BA)x∥ ≤ ∥I − BA∥∥x∥
This gives the magnitude of the perturbation in x. If the relative perturbation is being
measured then
∥x − x̃∥
≤ ∥I − BA∥ −→ (∗)
∥x∥
∥x−x̃∥
(∗) gives an upper bound on ∥x∥ and this ratio is taken as a relative error between x and
x̃.

1 181
• Suppose that the vector b is perturbed to obtain a vector b̂. If x and x̂ satisfy Ax = b,
Ax̂ = b̂, by how much do x and x̂ differ, in absolute and relative terms ?.
Assuming A is invertible,

∥x − x̃∥ = ∥A−1 b − A−1 b̂∥ = ∥A−1 (b − b̂)∥ ≤ ∥A−1 ∥∥b − b̂∥

which gives a measure of perturbation is x.


Relative perturbation

∥x − x̃∥ ≤ ∥A−1 ∥∥b − b̂∥


∥b − b̂∥
= ∥A−1 ∥∥b∥
∥b∥
∥b − b̂∥
= ∥A−1 ∥∥Ax∥
∥b∥
∥b − b̂∥
≤ ∥A−1 ∥∥A∥∥x∥
∥b∥

∥x−x̃∥ −1 ∥b−b̂∥
∴ ∥x∥ ≤ ∥A ∥∥A∥ ∥b∥
∥x−x̃∥ ∥b−b̂∥
∥x∥ ≤ K(A) ∥b∥ −→ (∗), K(A) = ∥A−1 ∥∥A∥ which shows that the relative error in x is
not greater than K(A) times the relative error in b.
This number K(A) = ∥A−1 ∥∥A∥ is called a condition number.
If the condition number is small, then a small perturbation in b leads to small perturbation
in x.

• K(A)
( ≥ 1 is always true.
K(A) = ∥A∥∥A−1 ∥
)
(∵ AA−1 = I) ∥I∥ = ∥AA−1 ∥ ≤ ∥A∥∥A−1 ∥ ≤ K(A) =⇒ K(A) ≥ 1.

2 182
Numerical Analysis
Lecture 45

Solution of Linear System of Equations - 8 : Iterative Improvement method, Iterative methods

Iterative Improvement method


When the solution to the system Ax = b has been computed, and because of round off error, we
obtain not the exact solution but the approximate solution x̃, then it is possible to apply iterative
improvement to correct x̃ so that it more closely agrees with x.
Define e = x − x̃
Define r = b − Ax̃. Then Ae = Ax − Ax̃ = b − Ax̃ = r
i.e., the error e satisfies the system Ae = r. r is called the residual vector.
We solve for e from Ae = r by any of the direct methods. Then use e = x − x̂ to get a better
approximation to x, namely, x̂ ˆ

ˆ = e + x̃
x = x̂
kek kx−x̃k
. We check if the value of kx̃k = kx̃k ,
is small, then it means that we are close to x.
In this case, we can stop our computations, otherwise define ẽ = x − x̃˜ and continue the same way
as above till we get the solution correct to the desired degree of accuracy. We consider an example
and demonstrate the above method.
Example:

• Solve
 the system
  by
 iterative
  improvement method.
33 25 20 x1 78
20 17 14 x2  = 51.
25 20 17 x3 62
Perform step 1 of computations.
 
1.1
We apply GEM with partial pivoting and obtain the solution as x̃ = 0.89
1.0
T
We observe that the exact solution is x = (1, 1, 1) . So, we want to improve the solution x̃
using the iterative improvement method.
We compute
    
33 25 20 1.1 78.55
Ax̃ = 20 17 14 0.89 = 51.13
25 20 17 1.0 62.3
   
78 − 78.55 −0.55
∴ b − Ax̃ = 51 − 51.13 = −0.13 = r
62 − 62.3 −0.3
 
33 25 20 −0.55
 
Solve Ae = r for the error e = x − x̃. Now A|r =  20 17 14 −0.13 Apply GEM and
25 20 17 −0.3
˜ = x̃ + e = (1.1 + e1 , 0.89 + e2 , 1.0 + e3 )T .
obtain e = (e1 , e2 , e3 )T . Then x = x̃ + e, call this x̃
This is the required solution at the end of first step. We now move onto iterative methods.

Solution of a system of equations by iterative methods.


Iterative methods produce a sequence of solution vectors x(i) that converges to the solution.

1 183
The computation is halted when an approximate solution is obtained having some specified
accuracy or after a certain number of iterations.

    
7 −6 x1 3
• = : 7x1 − 6x2 = 3; −8x1 + 9x2 = −4
−8 9 x2 −4
Solve the th th
h i equation i for the i h unknown i
(k−1) (k−1)
xk1 = 17 3 − 6x2 and xk2 = 91 − 4 + 8x1 , −→ (∗)
this is Jacobi method.
Then generate the improve values using (∗)
(k) (k)
k x1 x2
0 0.0000 0.0000
1 3/7 -4/9
10 0.148651 -0.198201
20 0.186816 -0.249088
30 0.196615 -0.262154
40 0.199131 -0.265508
50 0.199777 -0.266369

Gauss-Seidel
h method
i h i
(k−1) (k)
xk1 = 17 3 + 6x2 and xk2 = 91 − 4 + 8x1 , −→ (∗∗)

(k) (k)
k x1 x2
0 0.0000 0.0000
1 3 −4
− 4 = 91 24
   
1 h i 3/7h 9 8× 7 7 − 4 = 63
1 −4
 89−24
i 165 1
 165

2 7 3+6 63 = 63 = 431 9 8 × 431 − 4
10 0.219776 -0.249088
20 0.201304 -0.265308
30 0.200086 -0.266590
40 0.200006 -0.266662
50 0.2 -0.266666

Gauss Jordan(GJ) and Gauss Seidel (GS) converge to the same limit but GS converges faster.
∴ the accuracy or precision we obtained in the solution depends on when the iterative process
in halted.

• 10x1 + 2x2 + x3 = 9
x1 + 10x2 − x3 = −22
−2x1 + 3x2 + 10x3 = 22

h i
1 (k−1) (k−1)
GJ Method: xk1 = 10 9 − 2x2 − x3
h i
1 (k−1) (k−1)
xk2 = 10 − 22 + x3 − x1
h i
1 (k−1) (k−1)
xk3 = 10 22 + 2x1 − 3x2

x01 = x02 = x03 = 0


x11 = 0.9h ; x12 = −2.2 ; x13 = 2.2
i
1
x21 = 10 9 − 2 × (−2.2) − 2.2 = 1.12

2 184
h i
1
x22 = 10 − 22 + 2.2 − 0.9 = −2.07
h i
1
x23 = 10 22 + 2 × 0.9 − 3 × 2.2 = 3.04

(k) (k) (k)


k x1 x2 x3
0
1
2
3 1.01 -2.008 3.045
4 0.9971 -1.9965 3.0044
5 0.9989 -1.9993 2.9984
6 1.0000 -2.0000 2.9996
7 1.0000 -2.0000 3.0000

∵ x61 = x71 , x62 = x72 , x63 = x73


The solution is x1 = 1, x2 = −2, x3 = 3.
h i
1 (k−1) (k−1)
GS Method: xk1 = 10 9 − 2x2 − x3
h i
k 1 (k−1) (k)
x2 = 10 − 22 + x3 − x1
h i
1 (k) (k)
xk3 = 10 22 + 2x1 − 3x2

x01 = x02 = x03 = 0


x11 = 0.9h ; x12 = −2.29 ; x13 = 3.067
i
1
x21 = 10 9 − 2 × (−2.29) − 3.067 = 1.0513
h i
1
x22 = 10 − 22 + 3.067 − 1.0513 = −1.9984
h i
1
x23 = 10 22 + 2 × 1.0513 − 3 × 1.9984 = 3.0098

(k) (k) (k)


k x1 x2 x3
0
1
2
3 0.9987 -1.9989 2.9994
4 0.9998 -2.0000 3.0000
5 1.0000 -2.0000 3.0000

The solution is x1 = 1, x2 = −2, x3 = 3.

We now answer the question: Will GJM and GSM always converge starting from some initial
vector ?.
The following theorem provides us an answer.

THEOREM 1. If A is diagonally dominant, i.e., |aii | > nj=1,j6=i |aij |, 1 ≤ i ≤ n,


P
then Gauss Jacobi method (GJM) and Gauss Seidel method(GSM) converge for any starting vector
to the solution of the system Ax = b.
This condition is a sufficient condition but not necessary for convergence and consequently,
these iterative methods may converge for other sets of equations.
In the example that we considered, the coefficient matrix is diagonally dominant. Therefore
we expect GJM and GSM to converge. But in the following example,

3 185
• x1 − x2 − 2x3 = −5
x1 + 2x2 + x3 = 7
x1 + 3x2 − x3 = 2
the coefficient matrix A is not diagonally dominant. Therefore, there is no guarantee that
the GJM & GSM will converge. We can also not guarantee that they will both diverge.
Let us start with the initial vector x01 = x02 = x03 = 0 and perform 10 iterations.

GJM GJ method gives


(10) (10) (10)
x1 = 302.125 ; x2 = −106.188 ; x3 = 217.375
GSM GS method gives
(10) (10) (10)
x1 = 85011.3 ; x2 = −58568.9 ; x3 = −90697.4

Both the methods are diverging from the true solution.

• Now consider the another example 4x1 − 2x2 + x3 = 12


2x1 + 3x2 − x3 = 7
2x1 − 2x2 + 2x3 = 8
The coefficient matrix is not diagonally dominant. However, starting with x01 = x02 = x03 = 0,
and performing 18 iteration with GJM and 15 iterations with GSM, we get

GJM GJ method gives


(18) (18) (18)
x1 = 3.0 ; x2 = 1.0 ; x3 = 2.0
GSM GS method gives
(10) (10) (10)
x1 = 3.0 ; x2 = 1.0 ; x3 = 2.0

Since the true solution is x1 = 3, x2 = 1, x3 = 2. We can say that GJM and GSM both
converge for this problem to the true solution.

Thus, given a system Ax = b,


if A is strictly diagonally dominant, then GJM and GSM coverage to the true solution, starting
from any initial vector.
if the sequence of iterates for the solution generated by GJM and GSM diverge, then A is not
diagonally dominant.
there may be other systems Ax = b with A not diagonally dominant for which the iterative
methods converge to the true solution.
Therefore, the condition given in the theorem are sufficient to ensure convergence but not necessary.
Exercise     
4 1 1 x1 2
Solve by GJM and GSM the following system 1 5 2   x2 = −6. Take the initial vector
 
1 2 3 x3 −4
as x01 = x02 = x03 = 0.

4 186
Numerical Analysis
Lecture 46

Matrix eigenvalue problem - 1

Matrix eigenvalue problem: Let A be an n × n matrix.


Let λ be a scalar. 
If the equation Ax = λx has a nontrivial solution x 6= 0 , then λ is an eigenvalue of A. A non-zero
vector x satisfying Ax = λx is an eigenvector of A corresponding to the eigenvalue of A.
    
2 0 1 1 1
 5 −1 2   3  = −2  3  ∵ Ax = λx
−3 2 −5/4 −4 −4
T
∴ -2 is an eigenvalue of A and 1, 3, −4 is a corresponding eigenvector.
Note: A non-zero multiple of an eigenvector is another eigenvector corresponding to the same
eigenvalue. 
The conditionthat Ax = λx has a non-trivial solution is equivalent to det Ax − λx = 0
i.e., Ax − λx is non-singular.

a11 − λ a12 ··· ··· ··· a1n

a21 a22 − λ ··· ··· ··· a2n
=0

··· ··· ··· ··· ···

an1 an2 ··· ··· · · · ann − λ

gives a polynomial in λ of degree n known as characteristic polynomial of A.


An n × n matrix has exactly n eigenvalues, provided that these are counted with the multi-
plicities that they possess as roots of the characteristic equation.
 
1 2 1
A = 0 1 3
2 1 1

det(A − λI) = 0 =⇒ −λ3 + 3λ2 + 2λ + 8 = 0


i.e., −(λ − 4)(λ2 +√λ + 2) = 0 √
λ = 4, λ = − 12 + i 27 , λ = − 12 − i 27
Note: Eigenvalues of a real matrix are not necessarily real numbers.
For large matrices, the above direct method is not recommended.
∵ any errors (round-off) in the coefficients
 of the polynomial may lead to gross inaccuracies in
the numerically determined roots.

Power Method
Power method is used to determine the largest eigenvalue (in magnitude) and the corresponding
eigenvector of Ax = λx.
Assumptions A has a single eigenvalue of maximum modulus, and there is a linearly independent
set of n eigenvectors.
→ →
Let λ1 , λ2 , · · · , λn be the distinct eigenvalues such that |λ1 | > |λ2 | > · · · > |λn |, and let v1 , v2

, · · · , vn be the corresponding eigenvectors.
The method is applicable if a complete system of n linear independent eigenvectors exists. Then,

1 187
→ → → →
any eigenvector v in the space of eigenvectors v1 , v2 , · · · , vn can be written as a linear combination
→ → →
of these basis vectors v1 , v2 , · · · , vn .
→ → → →
v = c1 v1 +c2 v2 + · · · + cn vn

→  → → → 
A v = A c1 v1 +c2 v2 + · · · + cn vn
→ → →
= c1 A v1 +c2 A v2 + · · · + cn A vn
→ → →
= c1 λ1 v1 +c2 λ2 v2 + · · · + cn λn vn
h → λ2 → λn → i
= λ1 c1 v1 +c2 v2 + · · · + cn vn
λ1 λ1
" #
 λ 2  λ 2
2 → → 2 → n →
A v = λ21 c1 v1 +c2 v2 + · · · + cn vn
λ1 λ1

·········
" #
 λ k  λ k
k → → 2 → n →
A v = λk1 c1 v1 +c2 v2 + · · · + cn vn
λ1 λ1

" #
 λ k+1  λ k+1
k+1 → → 2 → n →
A v = λk+1
1 c1 v1 +c2 v2 + · · · + cn vn
λ1 λ1
→ → → →
As k → ∞, Ak v → λk1 c1 v1 and Ak+1 v → λk+1 ∵ λλ1i < 1, i = 2, 3, · · · , n

1 c1 v1

 k →  k → →
The vector c1 v1 +c2 λλ12 v2 + · · · + cn λλn1 vn −→ c1 v1
which is the eigenvector corresponding to λ1 .
→ →
The eigenvalue λ1 is obtained as the ratio of the corresponding components of Ak+1 v and Ak v .

 
Ak+1 v
λ1 = lim  →
 m , m = 1, 2, · · · , n
k→∞
Ak v
m
The suffix denotes the mthcomponent of the vector. The iteration is stopped when the magnitudes
of the differences of the ratios are less than the given error tolerance.
In order to keep the round off error in control, we normalize (i.e., the largest element in unity)
the vector before multiplying by A.
→ →
Let v0 be a non-zero initial vector (non-orthogonal to v1 ).

→ → → y k+1 →
Ping, y k+1 = A vk and v k+1 = here mk+1 is the largest element in magnitude of y k+1 .
mk+1 ,


 
Ak+1 v
λ1 = lim  →
 m , m = 1, 2, · · · , n
k→∞ k
A v
m

and v k+1 is the required eigenvector.
Note:
• As k → ∞, mk+1 also gives λ1 .
• Choose the initial vector as a vector with all components equal to unity (non-orthogonal to

v 1 ), if no initial vector is given.

2 188
Numerical Analysis
Lecture 47
Matrix eigenvalue problem - 2

Example
 
3 −5
• Obtain the dominant eigenvalue of A = by power method, starting with an initial
−2 4
 
→ 1
vector v0 = .
1
      
→ → 3 −5 1 −2 1 →
y1 = A v0 = = = −2 = −2 v1
−2 4 1 2 −1

 
→ y1 1
v1 = −2 =
−1
      
→ → 3 −5 1 8 1 →
y2 = A v1 = = =8 = 8 v2
−2 4 −1 −6 −0.75
      
→ → 3 −5 1 6.75 1 →
y3 = A v2 = = = 6.75 = 6.75 v3
−2 4 −0.75 −5 −0.7407
      
→ → 3 −5 1 6.7035 1 →
y4 = A v3 = = = 6.7035 = 6.7035 v4
−2 4 −0.7407 −4.9628 −0.7403
      
→ → 3 −5 1 6.7015 1 →
y5 = A v4 = = = 6.7015 = 6.7015 v5
−2 4 −0.7403 −4.9612 −0.7403
      
→ → 3 −5 1 6.7015 1 →
y6 = A v5 = = = 6.7015 = 6.7015 v6
−2 4 −0.7403 −4.9612 −0.7403
→ →
6.7015 v5 = 6.7015 v6
Therefore, the convergence has occurred.
We see that the eigenvalue  whichis numerically the largest of A is λ = 6.7015. The corre-
1
sponding eigenvector is .
−0.7403
 
−13 4 5
• Find the largest eigenvalue in modulus and the corresponding eigenvector of A =  10 −12 6,
20 −4 2
 
1

starting with an initial vector v0 = 1 and performing 9 steps of power method.
1
   
1 −8
→ →
y1 = A v0 = A 1 =  4 
1 18
   
→ →
y1 −8/18 −4/9
v1 = 18 = =
4/18//1 2/9//1
     
−4/9 −95/9 1
→ →
y2 = A v1 = A  2/9  = −10/9 = 95  −2/19  = 95 v→2
9 9
1 −70/9 −14/19
→ →
y3 = A v2 = · · · · · ·
→ →
y4 = A v3 = · · · · · ·
···    
−19.8674 −1.0
→ → →
y7 = A v6 =  9.7072  = 19.8674 0.4886 = 19.8674 v7
19.8524 0.9992

1 189
   
19.952 0.9998
→ →
y8 = A v7 =  −9.868  = 19.956 −0.494 = 19.956 v→8
−19.956 −1
 
19.973
→ →
y9 = A v8 =  9.926 
19.988
As we have performed 9 steps of computations of power method, we stop here.
An approximation to the largest eigenvalue in modulus is


y9
 
|λ| = → m = 19.973 , 9.926 19.988
0.9998 −0.494 , −1
v8
m
Take the absolute value component wise and round it off to three decimal places, |λ| ≈ 20;
0.9998
the corresponding eigenvector is −0.494.
−1
 
1 −3 2
• Find the dominant eigenvlaue and the corresponding eigenvector of A = 4 4 −1 by
6 3 5
 
1

power method, starting with an initial vector v0 = 1. Perform 16 steps of computations.
1
 
0.3000

(Ans: y16 = 6.9998 0.0666 and dominant eigenvalue is |λ1 | ≈ 7)
1
Can we find the smallest eigenvalue of A ?
Suppose say, the matrix A has the largest eigenvalue in magnitude λ. Then Ax = λx; Assume
that A is invertible.

A−1 Ax = A−1 λx
Ix = λA−1 x
1
x = A−1 x
λ
Therefore, λ1 is an eigenvalue of A−1 and the corresponding eigenvector is x, which is the eigenvector
associated with the dominant eigenvalue λ of A.
Let the dominant eigenvalue of A−1 be µ. Then A−1 x = µx.
It is clear that µ1 is the least eigenvector of A.
Therefore, to find the least eigenvector of A, then find A−1 by Gauss-Jordan method, compute its

dominate eigenvalue µ and the corresponding eigenvector x , then λ = µ1 is the least eigenvalue of

A and x is the corresponding eigenvector.
Example
 
−15 4 3
• A =  10 −12 6. Compute the smallest eigenvalue of A.
20 −4 2
 
0 −0.02 0.06
We get A−1 by Gauss-Jordan method and it is A−1 = 0.1 −0.09 0.12
0.2 0.02 0.14
 
1

Compute the dominant eigenvalue of A−1 using power method starting with v0 = 1.
1

2 190
→ →
Then y1 = A−1 v0 = · · ·
Show that the dominant eigenvalue of A−1 is 0.2 or µ = 15 .
Therefore the smallest eigenvalue of A is 5.
One can also compute the smallest eigenvalue of A by the following procedure and it does not
involve computation of A−1 .

Let λ be an eigenvalue of A i.e., λx = Ax


Find a matrix B, given by B = A − µI, then

Bx = A − µI x = Ax − µIx = λx − µIx = (λ − µ)x
.
=⇒ Bx = (λ − µ)x
=⇒ (λ − µ) is an eigenvalue of B.
i.e., A − µI has an eigenvalue λ − µ.
Therefore, eigenvalues of B are merely the eigenvalues of A diminished by a constant µ, and the
eigenvector associated with (λ − µ) of B is the same as that associated with the eigenvalue λ of A.

Therefore, having evaluated the dominant eigenvalue λ1 of A, then we can find the smallest
eigenvalue of A by modifying the matrix in the form A − λ1 I, which has eigenvalues given by
0
λp = λp − λ1 , p = 1, 2, · · · n
0
where λp ’s are eigenvalues of A.
We want λp to be the smallest.
0
For λp to be the smallest λp is the largest (in magnitude) eigenvalue of A − λ1 I.
0
Compute λp , the dominant eigenvalue of A − λ1 I by using power method.
0
  0   0
Therefore, if xp is the corresponding eigenvector, then A − λ1 I xp = λp − λ1 xp

0 0
=⇒ Axp = λp xp

=⇒ the eigenvector correspondingto the smallest


  eigenvalue
 λp of A is the same as the eigenvector
0
xp associated with the eigenvalue λp − λ1 of A − λ1 I .

3 191
Numerical Analysis
Lecture 48
Matrix eigenvalue problem - 3: Gershgorin, Brauer’s theorems

Can we give the locations of the eigenvalues of a given matrix A ?


The answer is, yes, and we have results that clearly specify the circular disks within which the
eigenvalues are located.
THEOREM 1 (Gershgorin Theorem). The largest eigenvalue in modulus, of a square matrix
A cannot exceed the largest sum of the moduli of the elements along any row or any column.

Proof. Let λi be an eigenvalue of A, where A(= ai j ) is an n×n matrix. Let xi be the corresponding
eigenvector r; xi = (xi,1 , xi,2 , · · · , xi,n )T . Then, A xi = λi xi , (or) we have

a11 xi,1 + a12 xi,2 + · · · + a1n xi,n = λi xi,1


a21 xi,1 + a22 xi,2 + · · · + a2n xi,n = λi xi,2
····································
an1 xi,1 + an2 xi,2 + · · · + ann xi,n = λi xi,n

Let
|xi,k | = max |xi,j |.
j

Select the j th equation and divide by xi,j .


This gives (x ) (x ) (x )
i,1 i,2 i,n
λi = aj1 + aj2 + · · · + ajj + · · · + ajn
xi,j xi,j xi,j
and

|λi | ≤ |aj1 | + |aj2 | + · · · + |ajj | + · · · + |ajn |



xi,k
Since, ≤ 1, k = 1, 2, . . . , n.
xi,j
Since j is arbitrary, [ ]

n
|λ| ≤ max |aik | .
i
k=1

As the eigenvalues of AT and A are the same, the theorem is true for columns.

THEOREM 2 (Brauer’s theorem (A)). Let pj be the sum of the moduli of the elements along
the j th row excluding the diagonal element ‘ajj ’. Then, every eigenvalue of A lies inside or on the
boundary of at least one of the circles |λ − ajj | = pj , j = 1, 2, . . . , n.

Proof. From eq?? in the proof of Gershgorin Theorem, we know


(x ) (x ) (x )
i,1 i,2 i,n
λi − ajj = aj1 + aj2 + · · · + ajn
xi,j xi,j xi,j
Therefore,
|λi − ajj | ≤ |aj1 | + |aj2 | + · · · + |aj j−1 | + |aj j+1 | + · · · + |ajn |
∑n
= |aji | = pj
i=1,i̸=j

Therefore, all the eigenvalues of A lie inside or on the union of the above circles.

1 192
Note: These circles are called the Gershgorin circles and the bounds are called Gershgorin
bounds.
THEOREM 3 (Brauer’s theorem (B)). Every eigenvalue of a matrix A must lie in a Gershgorin
disk or circle corresponding to the columns of A.
The above theorems give us the Gershgorin circles or disks which correspond to the rows of A,
where A is the matrix whose eigenvalues are trying to determine.
Transpose matrix of A, then the columns of matrix A become the rows of AT and we know
eigenvalues of A and AT are the same.

Remarks
• Since A and AT have the same eigenvalues, all the the eigenvalues lie in the union of the n
circles

n
|λi − ajj | ≤ |aji |, j = 1, 2, . . . , n.
i=1,i̸=j

• The bounds obtained are all independent. Hence, all the eigenvalues of A must lie in the
intersection of these bounds.

• If any of the Gershgorin circle is isolated, then it contains exactly one eigenvalue.

• If A is real, symmetric matrix, i.e., aij = aji then we obtain an interval which contains all
the eigenvalues
  of A.
3 2 2
A = 2 5 2
2 2 3
The eigenvalues lie in the interval |λ| ≤ max{7, 9, 7} {row sum or column sum }. i.e., |λ| ≤ 9
i.e., −9 ≤ λ ≤ 9.
The eigenvalues lie in the union of the intervals
|λ − 3| ≤ 4 or − 1 ≤ λ ≤ 7
|λ − 5| ≤ 4 or 1 ≤ λ ≤ 9
|λ − 3| ≤ 4 or − 1 ≤ λ ≤ 7
The intersection of these intervals is −1 ≤ λ ≤ 9.

1 Problems
[
]
1 2
Example 1.1. A =
1 −1

1 − λ 2
In this case, we can compute the eigenvalues: |A − λI| = 0 =⇒ = 0.
1 −1 − λ

Implies, −(1 − λ)(1 + λ) − 2 = 0 =⇒ −1 + λ2 − 2 = 0 =⇒ λ = ± 3.
From the rows of matrix A, |λ − 1| ≤ 2(say I) and |λ + 1| ≤ 1(say II).
(∗ denotes the location of the eigenvalues obtained through actual calculations).
For the matrix A2×2 , there are two disks in the complex plane, each centered on one of the diagonal
entries of the matrix A.
From the theorems, we know that every eigenvalue must lie within one of these circular disks.
However, it does not say that each disk has an eigenvalue.

2 193
[ ]
1 −1
Example 1.2. A =
2 −1

1 − λ −1
The eigen values are obtained as: |A − λI| = 0 =⇒ = 0.
2 −1 − λ
Implies, −(1 − λ)(1 + λ) + 2 = 0 =⇒ −1 + λ2 + 2 = 0 =⇒ λ = ±i.
The circular disks are, |λ − 1| ≤ 1(say I) and |λ + 1| ≤ 2(say II).
(∗ denotes the location of the eigenvalues obtained through actual calculations).
It can noticed that all the eigenvalues lie within the circular disk |λ + 1| ≤ 2, defined by the second
row of the matrix.

 
5 0.6 0.1
Example 1.3. A = −1 6 −0.1
1 0 2
Eigenvalues lie within the Gershgorin circles
|λ − 5| ≤ 0.7(say I), |λ − 6| ≤ 1.1(say II), |λ − 2| ≤ 2(say III). The |λ − 2| ≤ 2 is disjoint, there
is exactly one eigenvalue in |λ − 2| ≤ 2. There are two eigenvalues in II ∪ III.

All eigenvalues have positive real parts.

3 194
The eigenvalues in |λ − 2| ≤ 2 is a real eigenvalue. Since complex eigenvalues must occur in
conjugate pairs.
We now apply Gershgorin theorem to AT .
Eigenvalues must lie within the Gershgorin circles. |λ − 5| ≤ 2, |λ − 6| ≤ 0.6, and |λ − 2| ≤
0.2 =⇒ 1.8 ≤ λ ≤ 2.2.
Which shows that the single real eigenvalue in |λ − 2| ≤ 0.2 is such that it satisfies 1.8 ≤ λ ≤ 2.2.
Gershgorin circle theorem says that the eigenvalues for a matrix are not far away from its
entries on the diagonal. If the matrix A is either upper or lower triangular, then the eigenvalues
are the diagonal entries.
 
4 2 3
Example 1.4. A = −2 −5 8
1 0 3
Gershgorin circles
|λ − 4| ≤ 5(say I), |λ + 5| ≤ 10(say II), |λ − 3| ≤ 1(say III).
Eigenvalues lie in the union of these circles.

 
15 −3 1
Example 1.5. A =  1 −2 1
1 6 1
The eigenvalues of A lie within the union of three disks |z − 15| ≤ 4(say I), |z + 2| ≤ 2(say II),
|z − 1| ≤ 7(say III).
The circle centered on z = 15 is entirely disjoint from circles II & III. Therefore, it must

contain one eigenvalue and the other two eigenvalues lie in the union of circles II and III.
 
4 1 0
Example 1.6. A = 1 0 −1
1 1 −4
Gershgorin circles are |λ − 4| ≤ 1(say I), |λ| ≤ 2(say II), |λ + 4| ≤ 2(say III).

4 195
Circle I is disjoint from the circles II and III, there must lie a single eigenvalue in this circle
I. Since the coefficients of the characteristic polynomial f (λ) = |A − λI| are real, the complex
eigenvalues, if they occur at all, they occur in conjugate pairs. Therefore is a real eigenvalue in
the interval [3, 5].
II and III touch at (−2, 0). A similar argument as above shows that the eigenvalues in II
and III must be real.
4 + 2 1 0

λ = −2 is not an eigenvalue, since |A − λI| = 1 2 −1 = −17 ̸= 0. There is one
1 1 −2
eigenvalue in [−6, −2) and one in (−2, 2].
Therefore, A has one eigenvalue in each of the intervals, [-6, -2), (-2, 2], [3, 5].
 
4 1 0 ··· 0
1 4 1 0 · · · 0
Example 1.7. Consider A =  0
. Show that A−1 has its eigenvalues
1 4 1 0 · · · 0
··· ··· ··· ··· 0 1 4
[ ]
in 61 , 12 .

A is symmetric. Therefore all eigenvalues of A are real.


The Gershgorin circles are |λ − 4| ≤ 1 or |λ − 4| ≤ 2. Therefore eigenvalues all lie in [2, 6].
Since the eigenvalues of A−1 are the reciprocals of those of A, i.e., 16 ≤ µ ≤ 12 , for all eigenvalues
µ of A−1 .

5 196
Numerical Analysis
Professor R. Usha
Department of Mathematics
Indian Institute of Technology Madras
Lecture No 43
Solution of Linear Systems of Equations-6

Error Analysis-1

So we consider the error analysis for direct method. So let us first give some examples to
show that small errors can produce large deviations in the solution; so let us consider the
following example.

(Refer Slide Time: 0:40)

Suppose say I am given a system of equations of the form 2 x plus y equal to 2; 2 x plus 1.01
y equal to 2.01; it is immediate that this has the solution x equal to half and y equal to 1. So
you can substitute x as half and it will give you 1, y is 1; so this equation is satisfied. x is half,
so it is 1 plus y is 1, so 1.01, that gives you 2.01, so half, 1 is the exact solution of the system
of two equations that we have written down. If suppose I consider the following system
namely, the first equation is as it is, I observe that there is a small change in the coefficient of
x in the second equation, there is a small change in the second component of the right-hand
side vector. I would like to see what is the solution of this system, and I observe that this
system has the solution given by x equal to 5 and y equal to minus 8.

So, small changes in either the coefficient matrix or on the right-hand side vector yield very
large deviations in the solution of the system. In that case, we say that the system is ill-
conditioned, namely when there are small deviations in the input resulting in large deviations

197
in the final solution, we call the system to be an ill -conditioned system; let us consider
another example.

(Refer Slide Time: 3:15)

Suppose say the matrix A is minus 73, 78, 24; 92, 66, 25; minus 80, 37, 10. And if I compute
the determinant of A, it turns out to be 1. There is a small change in this entry, say it is 92.01,
the rest of the entries are as they are, there is no change. And if I compute the determinant of
this, it turns out to be 2.08; this change is very very small, 92 has been changed to 92.01, but
the determinant value has changed from 1 to 2.08. Let us take the case when this entry 78 is
changed to 78.01, the other entries remain as they are, and if I compute the determinant, then
it turns out to be minus 28.20.

And if I consider say the determinant with entries as minus 73, 78 and 24; 92, 66, 25; minus
80, 37, say I make a small change in this entry, then you observe that the determinant turns
out to be minus 118.94. We have made only small changes in one of the entries in any one of
the 3 rows, in each of these cases that we have considered. But we observe that the value of
the determinant A is such that there are large deviations from the value namely determinant A
equal to 1 for the given matrix A. So small changes in any one of the entries in the matrix A
which is the coefficient matrix for a system has resulted in large deviations in the determinant
value, and therefore we say that this matrix is ill- conditioned.

198
(Refer Slide Time: 7:01)

And therefore if this matrix has been the coefficient matrix of a system of equations A x
equal to b if small changes occur due to round of errors in anyone of its entries, then it is
going to give large deviations in the solution of that system and therefore the system is going
to be an ill-conditioned system. Let us consider another example, where I solve the system
whose coefficient matrix has entries given by these and I have to solve for the unknowns x, y,
z, w; so that the right-hand side vector is 32, 23, 33, 31 and I immediately observe that my
solution is x equal to 1, y equal to 1, z equal to 1 and w equal to 1. You can simply verify, 17,
25, 32; 12, 18, 23; 14, 23, 33; 12, 21 plus 10 equal to 31; so these are the solutions of this
system of equations.

Now I would like to make a small change on the right-hand side vector; say due to some
round- off errors, the right-hand side vector is say changed; so I shall introduce those changes
here. Suppose the right-hand side vector changes to 32.1, then 23.9 or 22.9, 32.9 and 31.1
right. So the first and the last entries, some increase is incorporated and in the second and
third entries, a small decrease is introduced, so that all the entries in the right-hand side vector
are changed.

And if you solve the resulting system, you will see that x turns out to be 6, y is minus 7.2 and
z is 2.9 and w turns out to be minus 0.1 and you observe such large deviation in the solution
of the system for very small changes on the entries in the right-hand side vector and therefore
the system is again an ill- conditioned system because small changes result in a large
deviation in the output. And therefore your final solution cannot be a reliable solution, why

199
does it happen? There are round- off errors which are incorporated while carrying out the
steps which are involved in the methods that we have discussed namely, in the direct methods
that we have discussed. And due to these errors, the final solution turns out to be one which is
such that there are large deviations in the solution from the actual solution and therefore, the
system is an ill conditioned system.

So we must know a priory whether the system is an ill- conditioned system alright, so we
must have a knowledge of the magnitude of the error that is being incorporated at each step
of our computation and that is why we need to perform the error analysis for direct methods
and we shall see how we can obtain the magnitude of the error that is incorporated at each
step from our discussions which follows. So to discuss the errors in the numerical methods
involving vectors, it is useful to introduce the concept of or ideas of norm of a vector, so let
us define what we mean by a norm of vector.

(Refer Slide Time: 11:13)

The norm that we are going to introduce generalizes the notion of magnitude say of r of a real
number or a complex number, so the norm is a notion which realises the magnitude of a real
number or a complex number. When we consider norm, we consider norm of a vector, so it
gives you the length or the magnitude of that vector so let us define what we mean by a norm
of a vector. So we introduce what are called vector norms, so let us call the n dimensional
Euclidean space as V, then on this set V, we introduce a norm, so a norm is a function from V
onto the set of nonnegative reals that obeys the following postulates. So it satisfies the
postulate that norm x is greater than 0, if x is different from 0, where does x belongs to?? x
belongs to R n.

200
So is n is 3, then when we say x belongs to R3, then x has components x 1, x 2, x 3, so x is a
vector in the 3-dimensional Euclidean space; that is what we mean. So here x belongs to Rn,
so x will have components x 1, x 2, etc…, x n, so it has n components or it is an n -tuple. So
it is a point in the n dimensional Euclidean space, what does it satisfy? It satisfies the
conditions that norm of x is positive if x is different from 0. Secondly, norm lambda x is
going to be mod lambda into norm x, if lambda is a real number and x is in Rn and thirdly,
norm(x plus y) is less than or equal to norm x plus norm y.

(Refer Slide Time: 15:19)

So as we remarked, norm x generalises the notion of absolute value for a real or a complex
number. So this nonnegative quantity norm x is positive, the nonnegative quantity norm x,
where x belongs to Rn is a measure of the length of this vector x or it is the magnitude of this
vector x. The most commonly used norms are as follows; so one can introduce the Euclidean
norm which we denote by norm x suffix 2, it is just a notation. What does that mean? It is
{[Sigma i equal to 1 to n] [x i square]} whole to the power of half, and what is x? x has
components x 1, x 2, etc…, x n.

I take a point having components x 1, x 2, x 3 and then measure this length, which I call as r,
then it is nothing but square root of [(x 1 minus 0) the whole square plus (x 2 minus 0) the
whole square plus (x 3 minus 0) the whole square], namely, root of [x 1 square plus x 2
square plus x 3 square] that represents the length of this vector or the distance of the point P
from the origin. This is extended, you consider all the components of the vector, square them,
add them, take the square root and that is denoted by norm x 2. So this represents the length
of the vector x in Rn. Another commonly used norm is the l infinity norm, so you denote that

201
by norm x infinity what is it? It is maximum of modulus of x i for i lying between 1 and n. So
what is norm? Norm is a function from V that is Rn on the set of nonnegative real numbers,
so you will end up with a real number, that is the norm of x.

(Refer Slide Time: 19:05)

Take an x in Rn then norm x is a real number which is a nonnegative real number. And
therefore, I must have a number here when I write down norm x infinity, what is it, earlier I
took it to be the distance of the point P from the origin where the point P is in Rn. Now I take
the norm as follows, what is it, it is maximum of the absolute value of the x I, so I look at the
vector x which is given to me and I look at all the components of vector x, take the absolute
value of each of these components, from among them, I see which is the maximum, I take
that and I call that as norm x infinity. If I do that, then I have defined l infinity norm on V.
Thirdly I have what is called l 1 norm, I denote it by norm x 1 and what is it? It is [Sigma i
equal to 1 to n] [modulus of x i].

So in this case what do I do? Given a vector x in Rn, I look at the components of that vector
and consider the absolute value of each of these components, I add them all; that is what I
have done [Sigma i equal to 1 to n][ mod x i], so I have a real number and this is what I call
as norm x 1; so given Rn and vectors x belonging to Rn, I can use any one of these
definitions of norm on Rn and find out the magnitude of that vector x. Is it clear now? So let
us try to understand the notion of norm by taking the following set and trying to sketch that
set in R2.

202
(Refer Slide Time: 19:58)

Set of all x such that x belongs to R2 and I want the set to be such that it should consist of all
those points x in R2, R 2 is a plane, so in R2 such that norm of x should be less than 1.
Suppose I want to sketch this set, I call this as S and I want to use the norm as the Euclidean
norm, what is the set? So if I use the Euclidean norm, then what do I want? I want all those
members x in the two-dimensional plane such that x is now having components x 1, x 2, so I
want to use norm x 2, by definition it is root of [x 1 square plus x 2 square]. So any typical x
which is given to me having components x 1, x 2, then it will belong to this set if its distance
from the origin is less than 1. So what all those points which will satisfy this criterion will all
lie within a unit disc in the 2-dimensional plane.

So all those points within this circle where suppose this is x 1 axis, this is x 2 axis, this point
will be (1, 0) and this is (minus 1, 0), this will be (0, 1) and this will be (0, minus 1). So this is
what the set describes in R2, if I make use of the Euclidean norm. Let us use the l infinity
norm for the same set, so if I use l infinity norm, then in the Euclidean plane which is a two-
dimensional plane, I would like to mark all those points x such that norm x infinity must be
less than 1. What is norm x infinity? It is the maximum of modulus of x i and that must be
less than 1, so what are those points, they are going to be points which lie within this
rectangle; so x 1, x 2, so this point is (minus 1, minus 1) and this point is going to be (1, 1);
so this will be (1, minus 1) and this will be (minus 1, 1).

So all those points within this rectangle will belong to this set S in which we use infinity
norm. What about the third case? What if you use the l 1 norm? So you must have all those

203
points in the two-dimensional plane such that the sum of the absolute value of the
components of each of these vectors that you take must be less than 1 and therefore, in the
two-dimensional plane, you will have a set which consists of all those points which satisfy
the condition, so this will be (1, 0) and this will be (0, 1). So if you use this norm and mark all
the points which lie in the set S, so these are the points which lie within this rhombus. I hope
you have understood how you compute norm 2 or norm infinity or norm 1; maybe we can
take some examples so that we have a clear understanding of the notion of these norms.

(Refer Slide Time: 24:05)

Suppose I give you the following vectors namely, x has components 4, 4, minus 4, 4 and
then y is another vector having components 0, 5, 5, 5 and say v is a vector or w is a vector 0,
0, 0, 0. So I would like to find the norm of x, if I use each of these definitions, so I want to
use norm 1, norm 2 and finally norm infinity for the vector x which is given to me. What is
norm 1? If I use norm 1 to compute the l 1 norm for the vector x, it is, take the absolute value
of each of the components and sum it up, so it is 4 plus 4 plus modulus of minus 4 plus 4, that
is going to be 16. What about norm 2 for this vector, it is going to be square root of [4 square
plus 4 square plus 4 square plus 4 square] and therefore that is going to give me the value as
8. What about norm infinity, norm x infinity? It is the maximum of modulus of x i, so it is the
maximum of mod 4, mod 4, mod minus 4, mod 4 so that is going to be 4.

So let us do the same thing with the vector y, so norm 1 is going to be the sum of the absolute
values of its components, so it is going to be 5 plus 5 plus 5 so 15, norm 2 root of [5 square
plus 5 square plus 5 square] so root of 75, it is 25 plus 25 plus 25 and that is going to be 8.66.
What about norm infinity for y, it is going to be maximum of modulus of 0, mod 5, mod 5,

204
mod 5, so that is going to be 5. Let us now take the same definitions of the norms for the
vector w and compute each of these norms, so for w, it is going to be norm w1, so it is the
sum of the absolute values of the entries, so it is 6. Norm 2 that is square root of [6 square
plus 0 square, etc…,], so 6, and then norm infinity, maximum of the absolute values of the
entries which is again 6.

So now we know, given a vector x in Rn,, for any value of n, we know how to compute the
norm of that vector with respect to either the Euclidean norm or l 1 norm or the l infinity
norm. So when we solve the system of equations with direct methods, we come across a
matrix also namely, the system is given by A x equal to b, where x and b are vectors and A is
a square matrix. So we should also have some knowledge about the matrix norm, so let us
introduce the notion of matrix norm that is associated with the vector norm.

205
Numerical Analysis
Professor R. Usha
Department of Mathematics
Indian Institute of Technology Madras
Lecture No 44
Solution of Linear Systems of Equations -6

Error Analysis -2

(Refer Slide Time: 00:20)

So we introduce matrix norms; so the matrix norm, we denote by norm A is a nonnegative


number which satisfies the following properties. Namely, norm A is greater than 0, if A is a
nonzero matrix and norm of zero matrix is 0. Secondly, norm of c into A, where c is a scalar
is modulus of c into norm A, for any arbitrary complex number c, and norm of (A plus B) is
less than or equal to norm A plus norm B and norm AB will be less than or equal to norm A
into norm B; so these are the properties which are satisfied by norm of a matrix A. So now let
us introduce the most commonly used matrix norm, say the Euclidean norm, denoted by norm
A and that is {[Sigma i, j running from 1 to n] [a ij square]} whole to the power of half.

206
(Refer Slide Time: 2:57)

How do you compute this? you are given a matrix A whose entries are say, I take a 3 by 3
matrix or a 2 by 2 matrix: a 11, a 12, a 21, a 22; the Euclidean norm is computed by taking
square root of the sum of the squares of all the entries, so a 11 square, a 12 square, plus a 21
square plus a 22 square. So take all the entries in the matrix A, square them and add them,
take the square root of that number; that gives you norm A. Now secondly maximum norm is
very often used what is it? It is the maximum absolute row sum and it is defined as follows.
And that is equal to maximum over i, [Sigma over k] [modulus of a i k]; so let us try to
understand this norm.

What are we supposed to take? You compute absolute value of the entries, so that in the i- th
row, you have those entries to be a i1, a i2, etc…, a ik, etc.., a in, if A is an n cross n matrix.
Where did you take, we took these entries in the i- th row and then take the absolute values of
all these entries and then sum them up; so it is mod a i1 plus modulus of a i2 plus etc…, plus
modulus of a i n, so you sum up all those entries, this sum is for the i- th row, do that for all
the rows, namely start with the first row, compute the absolute value of the entries in the first
row and then go to the second row, compute the absolute values of the entries in the second
row, add them up and do that for all the n rows.

From among those n values that you have got, you pick that which is the maximum namely,
you have picked that which is the maximum absolute row sum and that is your norm A. I
hope it is clear, what should you do? you take typically i- th row, look at its entries, take the
absolute value of these entries, sum them up, do this for all the rows, if A is an n cross n

207
matrix, there are going to be n such rows; so do this for all the n rows, you have now n
values. From among these values pick that which is the maximum and that gives you the
maximum absolute row sum. One can also define the maximum norm as follows.

(Refer Slide Time: 6:52)

In this case, it is the maximum absolute column sum; so norm A is going to be norm A 1 and
that is maximum over k [Sigma over i] [mod a i k]; so we must have the matrix norm to be
such that it is consistent with the vector norm that we choose; so we have the following result
which can be used to appropriately take the matrix norm associated with a vector norm.

(Refer Slide Time: 7:49)

208
The results says, if the vector norm, norm infinity is defined by norm infinity of a vector x is
maximum of modulus of x i, for i lying between 1 and n, then the associated matrix norm
consistent with the vector norm is norm A infinity; so it is the maximum of j equal to 1 to n
[modulus of a i j], for i lying between 1 and n, so it is the maximum absolute row sum norm.
So if you use the vector norm as infinity norm, then the associated matrix norm consistent
with the vector norm is maximum absolute row sum norm and in this case, norm A x will be
less than or equal to norm A into norm x.

(Refer Slide Time: 9:55)

So let us now consider the following example, so let us determine the Euclidean and
maximum absolute row sum norm of the matrix A which is 1, 7, minus 4; 4, minus 4, 9; 12,
minus 1, 3. So we want Euclidean norm, so what is Euclidean norm? Given this matrix A, it
is norm A which is equal to square root of {[Sigma i, j taking values 1 to 3] of [modulus of a
ij square]}, so let us compute norm A square; so it is [1 square plus 7 square plus (modulus
of minus 4) the whole square plus 4 square plus again 4 square plus 9 square plus 12 square
plus 1 square plus 3 square], so that turns out to be 333 and therefore the Euclidean norm or
Euclidean matrix norm A is root of 333 which is 18.25.

Let us now compute the maximum absolute row sum norm. It is by definition, denoted by
norm A infinity and by definition it is [maximum over i] {[Sigma k equal to 1 to 3] [mod a i
k]}, so what should I do? I should take the first row, take the absolute values of the entries in
the first row, add them up and do that for all the three rows; from among them, pick that
which is the maximum. So this is going to be maximum of, what is the absolute row sum
here; Mod 1 plus mod 7 plus modulus of (minus 4) and so that is 12; 4 plus 4 plus 9; so 8

209
plus 9 that is 17; for the third row 12 plus 1 plus 3 and therefore it is 16, so I have three such
values from among them. I should choose that which is the largest, namely 17, so this gives
me maximum absolute row sum norm for the given matrix A.

So, let us use this notion of norms of a matrix and the given vector and try to perform error
analysis of the direct method; so let us work out the following example which is going to give
us the bound on the absolute error, if during our computation some amount of error has been
incorporated; so let us consider the following example.

(Refer Slide Time: 12:51)

So if suppose A inverse is perturbed to obtain a new matrix B, the solution x equal to A


inverse b is perturbed to become a new vector x tilde which is B into b, so the question is
how large is this latter perturbation in absolute and relative terms. If a small change is given
in the input, then how large is the deviation in the output is what the question is. So we are
given a system of equation A x equal to b, where is A is an n cross n matrix and A is non-
singular, so that A inverse exists or in other words A is an invertible matrix. The question is,
if suppose in the matrix A inverse, you give a small change, we have already seen that.

210
(Refer Slide Time: 14:37)

Suppose I am given a 2 by 2 matrix 1, 2; 7, 5, this matrix is non-singular since determinant A


is going to be 5 minus 14, that is minus 9; so A is invertible. And I compute A inverse which
is (1 by determinant of A) into {adjoint of A}. And suppose I get the inverse matrix to be a,
b; c, d; now if I give a small change in anyone of these entries, so that the entries are say a
plus epsilon, b; c, d; so my A inverse which is this is perturbed, namely slightly changed
because of the change in one of the entries, so a has become say a plus epsilon, I continue
with the computation of getting the solution of this system and taking x to be equal to A
inverse b, actually my A inverse b should be such that it should take this as A inverse, but I
take this as my A inverse and workout the solution.

So the solution that I get is something different; so the question is how large is this difference
if the actual solution is x and the computed solution is say x tilde, what is the difference x
minus x tilde, what is its magnitude? Since it is a vector, what is the norm of this vector (x
minus x tilde); that is what the question is, so let us understand the question again. If A
inverse is perturbed, that is some slight change is given to any one of these entries to obtain a
new matrix B, so this is my new matrix B, the solution x is equal to A inverse b; so if I take
this as A inverse and compute my solution as A inverse b, that solution is perturbed to
become a new vector x tilde.

Why? I am going to use this as my A inverse and obtain the solution, so it is B times vector b;
so with this my solution is B b and I should have got this solution as A inverse b, so I have
got it as B b; so I have got a new vector x tilde. So how large is this change in absolute and

211
relative terms; so compute what is x minus x tilde; compute its magnitude, that is going to
give me in absolute terms, this is the magnitude that results or in relative terms, I am going to
get the norm of this vector (x minus x tilde by x), so this is what we are asked to work out. So
let us write down the solution of this problem.

(Refer Slide Time: 18:07)

So let us see what is norm (x minus x tilde), what is x, I write x as it is, but what is x tilde, it
is B times b and it is norm of x minus B into what is b, b is A x, because I am given the
system A x; so x minus BA into x; it is x into (I minus BA). So I have something like a vector
multiplied by a matrix, I is the identity matrix of the same order as that of A. So we have seen
that norm of a matrix A satisfies the property that norm A B is less than or equal to norm A
into norm B. So here I have x into a matrix and therefore this will be less than or equal to
norm of I minus B A multiplied by norm of x; so this is the amount of perturbation in x, that
is what we have.

So if x is disturbed and the resulting value that you get, resulting solution that you get is x
tilde, then the magnitude of that change is less than or equal to is this; namely, norm of (I
minus BA) into norm x. So this gives us the magnitude of perturbation in x; what do you
mean by perturbation in x? The change in x, the disturbance given to x; so you have in
absolute terms the magnitude of the change x, that this cannot be greater than this value; so
let us discuss the details when we want to give the result in terms of relative magnitude.

212
(Refer Slide Time: 20:36)

So I want norm x minus x tilde divided by x, what is it that I know? I know norm (x minus x
tilde) is less than or equal to norm of (I minus BA) into norm of x. I mean I divide throughout
by norm x; so, that will be less than or equal to norm of (I minus BA). So if the relative
perturbation is measured, then we have the relative change is such that it cannot exceed norm
(I minus B A), so we can appropriately choose the norm of a matrix that appears here and
then give the relative error due to the perturbation given in A inverse, which results in
perturbation in the actual solution.

So, these are very useful results, namely this gives you, I call this as star, star gives an upper
bound on the relative magnitude of the error and this ratio is taken as relative error between
the actual value and the computed value, namely between x and x tilde. Let us consider
another problem.

213
(Refer Slide Time: 22:33)

Suppose that the vector b is perturbed to obtain a vector b cap, so entries in the right-hand
side vector for a given system A x equal to b are slightly changed, so that the new right-hand
side vector is b cap; if x and x tilde satisfy A x equal to b and A x tilde equal to b cap, the
question is, by how much do x and x tilde differ and give this result in absolute and relative
terms. So the question is, get the magnitude of the absolute error and the magnitude of the
relative error, I hope it is clear; if a small change in the entries in the right-hand side matrix b
is given, so that you solve the system A x tilde equal to b cap; so this x changes to x tilde
because right-hand side vector has changed from b to b cap. So you obtain your solution as x
tilde whereas, the actual solution is denoted by x because you want to solve the system A x
equal to b.

(Refer Slide Time: 25:07)

214
So the question is compute what is the error x minus x tilde and get its magnitude, so norm x
minus x tilde, so that you have magnitude of the absolute error. Also compute the magnitude
of the relative error in this case, if there is a perturbation given to the right-hand side entry; so
let us compute the result. So we assume that A is invertible, so we compute the magnitude of
the absolute error which is norm x minus x tilde, what is x? x is A inverse b, what is x tilde?
It is A inverse into b cap, so it is norm A inverse into (b minus b cap); so by the property of
the matrix norms this is less than or equal to norm A inverse into norm (b minus b cap). So
this gives you the measure of perturbation in x, this gives a measure of perturbation in x or
the change in x, what is the magnitude of the deviation?

So we have computed what we want; namely in absolute terms, the result is available. Now
we want to get in relative terms, a measure of perturbation in x. So we start with computation
for relative perturbation; so we start with the result that we already have, namely, norm x
minus x tilde is less than or equal to norm A inverse into norm (b minus b cap), so this is
norm A inverse into I shall multiply by norm b and therefore I shall divide by norm, so I have
not altered anything in this step, I have multiplied and divided by the same quantity. So this
will be equal to norm A inverse, but what do I know about norm b, A x is b, so norm b is
norm A x, so into norm (b minus b cap) by norm b.

(Refer Slide Time: 30:20)

215
So I know norm A x is less than or equal to norm A into norm x, so I use that property and
right down the result. This multiplied by norm (b minus b cap) by norm b. So I divide
throughout by norm x; so I get norm (x minus x tilde) by norm x is less than or equal to norm
A inverse into norm A into norm (b minus b cap) by norm b. So I have a result which gives
me the magnitude of the relative change in x is less than or equal to norm A into norm A
inverse into the magnitude of the relative change in the right-hand side vector. I denote norm
A into norm A inverse as K of A, so K of A multiplied by norm (b minus b cap) by b, what is
K of A? K of A is norm A into norm A inverse, which I call as the Condition Number.

You observe from this result that if the condition number is very large, then the relative
change in x is going to be very large, in which case the system will be a well conditioned
system. So we can immediately conclude about whether the system is an ill conditioned
system or a well conditioned system by computing the condition number and looking at its
magnitude; if the magnitude, namely of a condition number turns out to be very very large,
then this result tells you that the relative error in x, namely the output that you get as x tilde is
such that it is deviated very much from x, if the condition number is very very large, so that
the system becomes an ill conditioned system.

(Refer Slide Time: 30:31)

216
This number K of A which is equal to norm A into norm A inverse is called a condition
number and the result says that if the condition number is small, then a small change in b or a
small change or perturbation in b leads to a small perturbation in x. If the condition number is
small, then small change in the right-hand side vector will result in a small change in the
output, namely the solution vector for the problem. And what can you say about this K of A?
K of A has to be greater than or equal to 1 and this is always true, how do you show that that
K of A must be greater than or equal to 1 is always true. So let us take A A inverse what is it?
It is identity matrix.

So norm of I must be norm of A A inverse and that is less than or equal to norm A into norm
A inverse, but what is it and that is what we call as the condition number of the matrix A. So
condition number of the matrix A must be greater than or equal to norm of I, what is I, I is an
identity matrix so this will be 1 and therefore K of A, the condition number has to be greater
than or equal to 1. So this result is always true; If of the condition number that you get here
for a matrix A to be such that it is very close to 1, then a small change in the right-hand side
vector will result in a very small change in the solution vector and so the system will be well
conditioned system, so that the solution that you obtain at the end will be a reliable result.

On the other hand, if the condition number K of A is very large and it is very much deviated
from the value 1, say something like 40908 and so on, then your output will be such that it
will be very much deviated from the actual solution; so your solution cannot be a reliable
solution and the system is an ill conditioned system in this case.

(Refer Slide Time: 34:14)

217
Suppose say epsilon is positive and I give you a matrix A which has entries 1, 1 plus epsilon;
1 minus epsilon and 1, then compute A inverse which is (1 by epsilon square) into 1, minus 1
minus epsilon; minus 1 plus epsilon and 1; so let us compute norm A infinity. So it is the
maximum absolute row sum; so it is 1 plus 1 plus epsilon; so 2 plus epsilon and here it is 2
minus epsilon; the maximum, so I take that to be norm A infinity. Similarly we compute
norm A inverse infinity, so that is (1 by Epsilon square) into (2 plus epsilon) again, so
therefore K of A is norm A into norm A inverse.

I choose the infinity norm for computing norm A infinity and norm A inverse infinity, so it
turns out to be (2 plus epsilon) into (2 plus epsilon) by epsilon square; so it is greater than 4
divided by epsilon square and therefore if epsilon is less than or equal to 0.01, then K of A
which is greater than 4 by Epsilon square will be 4 by (0.01 the whole square); so will be
greater than 40,000. So in this case, what happens, in this case, a small relative perturbation
in b may induce a relative perturbation in x, when you solve a system A x is equal to b and
therefore the system will be an ill conditioned system.

(Refer Slide Time: 36:43)

218
Suppose say matrix A is 1, 4, 9; 4, 9, 16; 9, 16, 25; use the maximum absolute row sum norm
and compute the condition number. So I require A inverse, how do you compute A inverse?
you already know, apply Gauss Jordan technique and compute A inverse and show that A
inverse is 31 by 8, minus 44 by 8, 17 by 8; minus 44 by 8, 56 by 8, minus 20 by 8; 17 by 8,
minus 20 by 8 and 7 by 8; so I first have to compute norm A infinity, which is, it is the
maximum absolute row sum norm, so I must take the maximum of the sum of the absolute
values of the entries in each of these rows; so it is going to be 14, 29 and 50, the maximum is
50.

Next I compute norm A inverse infinity, so that is going to be maximum of 31 plus 44 plus
17 by 8; so, that turns out to be 92 by 8. Similarly, I compute the absolute row sum values for
the second row and the third row and it turns out to be 15 and 44 by 8; so the maximum is 15.
So we evaluate the condition number of the given matrix A and that is norm A infinity into
norm A inverse infinity; so it is 50 into 15, that is 750 which is large as compared to a value
which is 1 or close to 1. If you are going to solve a system of equations with A as the
coefficient matrix then the corresponding system is going to be an ill conditioned system
because the condition number of A is large.

(Refer Slide Time: 39:40)

219
A matrix A with a large condition number is ill conditioned; so in this case; so, for an ill
conditioned matrix, there will be cases in which the solution of the system A x is equal to b
will be very sensitive to small changes in the vector b. Our second example shows this,
namely if you have a small perturbation in the right-hand side vector b and if the matrix A is
an ill conditioned matrix, because the condition number of the matrix is very large, then if
you seek the solution of the system A x is equal to b with A having condition number to be
very very large, then the solution of the system will be very sensitive to this small change in
the vector b. On the other hand, if the condition number is of moderate value, then the matrix
is said to be well conditioned.

The question now is, now that we have performed the error analysis, how are we going to
make use of this information in improving the accuracy of the solution that we have obtained
using direct methods. Is there a way to make use of the results that we have derived in the
error analysis so that we can improve the accuracy of the solution of the system of equation A
x equal b which are obtained using direct methods? Yes, it is possible and we shall try to
discuss these details in the next class.

220
Numerical Analysis.
Professor r. Usha.
Department of Mathematics.
Indian Institute of Technology, Madras.
Lecture-45.
Solution of Linear Systems of equations-8
Iterative Improvement Method, Iterative Methods-1.

Good morning everyone, in the last class we discussed the error analysis for the direct
methods and we wanted to know whether the ideas that we understood from the error
analysis, there is a possibility of improving the accuracy of the approximate solution that we
obtained for a system of equations Ax equal to b.

(Refer Slide Time: 0:58)

So when we solve the system of equations of the form Ax is equal to b by direct methods,
then the solution that we obtained is not the exact solution x but is an approximate solution,
which we denote by say x tilde due to the round-off errors which are incorporated at any
stage of our computation. So the question now is, can we improve this approximate solution
that we obtained using direct methods so that we obtain the solution of the system correct to
the desired degree of accuracy. So I compute A times x tilde, where x tilde is the approximate
solution that we obtained using any of the direct methods.

We expect that Ax tilde is very close to the right-hand side vector b because we are solving
the system Ax equal to b. And x tilde being an approximate solution, there will be a
difference between b and Ax tilde, so I compute this error. Recall b is a vector and A is a

221
matrix of order n cross n, x tilde is an n cross 1 vector, so this will be n cross 1 vector, b is an
n cross 1 vector, so b minus A x tilde is an n cross 1 vector. So I compute this vector and call
this as r, namely it is the residual. Now what is the error that has been incorporated? The
error, if I denote by e, is the difference between the exact solution and the approximate
solution.

So it is x minus x tilde and therefore A times this error is Ax minus Ax tilde. And so it is Ax
minus Ax tilde is what we have computed using our approximate solution and Ax is b, so it is
b minus Ax tilde. But what is b minus Ax tilde, that is the residual vector that we obtain. We
observe that e satisfies the equation A e is equal to r, so there is a relationship connecting the
error at one step and the residual at that particular. So e at a particular step which is a vector
satisfies the equation Ae is equal to r. So we again have a system of equations in which the
unknown vector is e, why is it an unknown e, what is e, e is x minus x tilde.

x is not known to us, it is exact solution, that is what we are trying to obtain, so e is an
unknown vector, so here e is an unknown vector, so that we solve the system Ae is equal to r.
Do we know the right-hand side vector r at this step? Yes, what is it, it is b minus A into x
tilde. So we solve the system. How do we solve? Again by any of the direct method that we
have learnt and then obtained this error vector e, so once we have e, then we use the fact that
the error is x minus x tilde. So x minus x tilde is going to be the e that we have computed and
therefore x is equal to e plus x tilde.

So we call this vector as x double tilde, we compute what is b minus Ax double tilde. If this is
such that the x double tilde that we computed is within the desired degree of accuracy that we
have demanded while computing our solution, then we can take x double tilde to be the
solution and stop our iterations. If not, what do we do, this is the residual at this step, we call
this as r tilde. And therefore we solve the new equation, namely A into e tilde is equal to r
tilde. So we compute what is e tilde, what is e tilde, e tilde is x minus x double tilde, which
was obtained as the solution at this step.

So again we have a system which satisfies the relationship A into e tilde is equal to r tilde. So
we solve for this unknown vector again by any of the direct methods that we know. And so
we obtain x minus x double tilde to be e tilde which has been obtained just now, so x is equal
to e tilde plus x double tilde. And we check whether the solution that we have obtained at this
step is correct to the desired degree of accuracy. If so, we stop our computation, if not, we

222
call this as x triple tilde and continue our computations till the desired degree of accuracy is
achieved.

So the method described here is known as the iterative refinement method, which helps us to
improve the accuracy of the approximate solution that we have obtained using any of the
direct methods. So this method helps us to obtain the solution of the given problem Ax equal
to b correct to the desired degree of accuracy. So let us take an example and I will indicate
the first few steps and you can complete the problem.

(Refer Slide Time: 7:57)

So solve the system Ax is equal to b, given that A is a 3 cross 3 matrix having elements as
given here, multiplied by the vector x1, x2, x3 and that should be equal to 78, 51, 62, so I
have a system Ax is equal to b. Say apply Gauss elimination method with partial pivoting,

223
then say I obtain the solution x tilde as (1.1, 0.89 and 1.0) transpose. When I look at the
system may immediately see that x1 equal to1, x2 equal to1 and x3 equal to1 is the exact
solution, since 33 plus25 plus 20 is 78; 20 plus 17 plus14 is 51, 25 plus 20 plus 17 is 62 and
therefore (1, 1, 1) is the exact solution of the system.

When I apply Gauss elimination method with partial pivoting, I end up with a solution which
is this. So it is an approximate solution, so I must improve this approximate solution so that I
go closer and closer to the exact solution which is (1, 1, 1). So what should I do according to
the iterative refinement method? I must compute A into x tilde. So A is this matrix and x tilde
is this vector and so when you compute A into x tilde, that turns out to be vector 78.55, 51.13,
62.3. So I compute what is the residual vector r, what is the residual vector, it is b minus Ax
tilde. And my b is the right-hand side vector (78, 51, 62) minus Ax tilde is (78.55, 51.13,
62.3). So the residual vector is (minus 0.55, minus 0.13 and minus 0.3). So this is my residual
vector. So what should I do now?

I should solve the system A into e is equal to this residual vector r. You know what the matrix
A is, you also know the right-hand side vector r, so you solve for e, the error vector. So once
you get e, then your solution x is going to be this x tilde plus e. What is x tilde, x tilde has
components 1.1, 0.89 and 1. If suppose you compute your e to have components e1, e2, e3,
then add this vector to x tilde, so your new x which I call as x double tilde will be 1.1 plus e1,
0.89 plus e2, 1.0 plus e3. So this is your new solution, this is your new approximation to the
solution of the system Ax is equal to b.

Now check whether x minus x double tilde that you have computed is such that the
magnitude is less than the prescribed error tolerance. If so, you stop your iterations and you
say that x double tilde is your solution correct to the desired degree of accuracy. If not,
compute A into x double tilde and then compute b plus Ax double tilde, call it the residual
vector r double tilde and then continue your computations as described in the iterative
refinement methods and obtain the solution and see whether that solution is current to the
desired degree of accuracy.

So I leave the rest of the computations for you to complete, work out the details. I will give
you some more problems on this method in the assignment sheet. So that completes our
discussion on the direct methods, so we learnt some direct methods, namely the
decomposition methods and Gauss elimination method, Gauss Jordan method and also we
incorporated the pivoting strategy in these two methods. And then we performed some error

224
analysis in the previous class and then using those concepts we have now seen how we can
improve the accuracy of the approximate solution that we get using direct methods when we
solve a system of equations.

(Refer Slide Time: 14:15)

So now we move onto the iterative methods for solving the system of equations Ax equal to
b. So our goal now is to develop methods which are iterative in nature for solving a system of
equations Ax equal to b. So what do these methods do? These methods generate successive
iterates say x1, x2, etc..., xn and so on for the solution vector x such that this sequence say,
{xi} converges to the solution x. So where do we stop these iterations? We either perform
prescribed number of iterations and stop our computations and take the solution obtained at
that step to be the solution of the system or we work out the details in such a way that the
solution that we obtained at any stage satisfies the required accuracy demand.

So let us see some methods for solving a system of equations iteratively. One of the methods
for solving this system is known as Gauss-Jacobi method, the other one that we are going to
learn is called Gauss-Seidel method. So we shall describe these methods and see how one can
solve a system of equations iteratively. So we shall illustrate it by taking a simple example of
a system of two equations in two unknowns, the procedure is the same for a system of n
equations in n unknowns.

225
(Refer Slide Time: 15:45)

Suppose say, we are given the system of equations 7, minus 6; minus 8, 9; x1, x2 equal to 3,
minus 4. If I write out this equation, then it is 7 x1 minus 6 x2 is 3, minus 8 x1 plus 9 x2 is
minus 4. So the procedure for solving this system using Gauss-Jacobi method is as follows. I
take the first equation to be associated with the first unknown, namely x1 and I solve for x1
from the first equation. So solve for x1 from equation 1 and similarly I will associate the
second unknown x2 with the second equation and use the second equation and solve for x2
using the second equation. Then in that case I get x1 to be equal to 1 by 7, 3 plus 6 x2, then
x2 is 1 by 9 into minus 4 plus 8 x1. This is the first step that I do when I want to apply Gauss-
Jacobi method.

Now since it is an iterative method, I must start with an initial approximation. So initially I
shall take my solution, right, as say x1(0) and x2(0). And use that solution in the right-hand
side and compute what is the solution that I get for the unknown variable. So when I do that,
then I get x1 to be 1 by 7 into 3 plus 6 into x2(0) because my initial approximation for x2 is
x2 (0), this is what I guess for x2. So when I substitute for x2(0), I have a value for x1, so I
say that this is the solution that I obtain at the first step, namely x1(1). Now I go to the second
equation, then that is 1 by 9 into plus 4 plus 8 times x1(0). So I have a number now which
gives me x2, so I call this as x2(1).

And therefore at the first step, I have my solution for the unknown vector x, namely it is
x1(1) and x2(1) and now I continue my computations. Before continuing my iterations, I
check whether the absolute value of the solution that I obtained for x1 at the first step minus
the solution that I assumed for x1 at the initial step is less than the prescribed error tolerance

226
and I do that also for the second variable, namely x2(1) minus x2(0), its absolute value is less
than epsilon. If both these conditions are satisfied, then I can stop my computations and take
x1(1) and x2(1) to be the solution of the system correct to the desired degree of accuracy.

(Refer Slide Time: 20:04)

If that does not happen, then what should I do? I must use the current solution that I have for
x1 and x2, namely x1(1), x2(1) and compute the new solution. What is the new solution, it is
going to be x1 but now I obtain it at the second step and that is going to be 1 by 7 into 3 plus
6 into the variable x2; x2 has the value x2(1), which we have computed at this step.
Similarly, x2 at the second step is 1 by 9 into (minus 4) plus 8 times x1(1). So I have the
solution available at the second step, namely x1(2), x2(2). So I again check whether the
absolute value of x1(2) minus x1(1) is less than prescribed error tolerance and x2(2) minus
x2(1) is less than the prescribed error tolerance. If so, I stop the computations and take x1(2),
x2(2) to give me the solution for x1 and x2 which saw the given system of equations, if not I
continue my iterations like this.

So what is it that we do in Gauss-Jacobi method? We make use of the solution obtained at the
previous that for the unknown variables, say x1, x2 and use them on the right-hand side
whenever they occur and compute the solution of the current step, that is what we do in the
Gauss-Jacobi method. Let us apply this method to the given system and arrive at the solution.
At the k th step, x1(k) is given by 1 by 7 into 3 plus 6 times x2, what is the value of x2 that I
use here, that is available to me in the previous step. So it is x2 (k minus1). What about x2(k),
it is 1 by 9 into (minus 4) plus 8 times x1 available at the (k minus 1) step. So this is valid for
k equal to 1, 2, 3, etc…,

227
228
(Refer Slide Time: 22:59)

when I start with an initial approximation for x1(0), x2(0); so this is what is known as Gauss-
Jacobi method. So let us apply this technique and solve this system. So we initially select
x1(0) and x2(0). So we simply set them to be 0. And then use that, so what is x1(1), it is
going to be 1 by 7, 3 plus 6 into x2(0), so it is 1 by 7, 3 plus 0, so 3 by 7. And what is x2(1),
it is 1 by 9 into (minus 4) plus 8 times x1(0). So it is 1 by 9 into (minus 4) plus 8 into 0 and
so it is (minus 4 by 9). And suppose say we are asked to do 50 iterations and produce the
solution, so we have listed in the table the solutions up to 50 iterations that we have
performed and we see that x1 at this step turns out to be 0.199777 and x2 that we have
obtained at the fiftieth iteration as (minus 0.266369).

229
So the second variable x2 has been determined correct to three decimal, correct to two
decimal accuracy and the solution is 0.27 correct up to 2 decimal accuracy. And the solution
for x1 is obtained in this case again correct to 2 decimal accuracy and that turns out to be
0.20. So at the end of 50 iterations, the solutions have been obtained correct to two decimal
places using Gauss-Jacobi method. So at this stage we shall see what is Gauss-Seidel method
for the same system of equations.

(Refer Slide Time: 25:25)

Let us look at the equations which are given to us, namely 7, minus 6; minus 8, 9 is the
coefficient matrix, the unknown vector is having 2 components x1, x2, the right-hand side
vector is 3, minus 4. So as before, when we write down the system of equations, it is 7 x1
minus 6 x2 is 3 and minus 8 x1 plus 9 x 2 is minus 4. As before, I associate my first equation

230
with the first unknown, namely x1 and I solve for x1 from the first equation and associate the
second equation with the second unknown x2 and solve for x2 from the second equation. So
what do we get, x1 is going to be 1 by 7 into 3 plus 6 x2 and x2 is 1 by 9 into (minus 4) plus
8 x 1.

Now this is an iterative procedure that I would like to describe, namely it is Gauss-Seidel
method. So as before I have to start with an initial approximation for these 2 variables x1 and
x2. So suppose you have chosen x1(0), x2(0) as the initial approximation, then the solution at
the first step, namely the first approximation for x1 is given by 1 by 7 into 3 plus 6 times
x2(0), because I have information about x2, I have guessed that the solution is x2(0). Now I
move over to the next iteration, so x2(1), which is 1 and into (minus 4) plus 8 times, at this
stage I see that I have x1 appearing here, but x1 has already been obtained at this step,
namely, the first equation gives me x1.

So why not I use the currently available solution for x1. So I take x1(1) for x1 here and then
use that to compute the solution x2(1). I recall in Gauss-Jacobi method, we did not use the
currently available value for any unknown on the right-hand side. We, at any iteration, we
always used the solution that was available in the previous step in the previous iteration.
Whereas in Gauss-Seidel method, whenever it is possible for us to make use of the currently
available solution, it is made use of on the right-hand side to compute the solution at that
particular step from the left-hand side. So x2(1) is computed as 1 by 9 into (minus 4) plus 8
times, substitute for x1, the currently available value which you have computed just now.

And then now you have to check what is the difference between x1(1) minus x1(0) in
absolute value and x2(1) minus x2(0) in absolute value? If they satisfy the desired condition
for accuracy, then stop your computations and call x1(1), x2(1) as the solution of the
problem, otherwise continue your computations till the desired degree of accuracy is
obtained. So let us work out the details for this problem. So if I take the initial approximation
as (0, 0), I would like to compare my solutions with Gauss-Jacobi method. So at the first step
what will I get, I get 1 by 7 into 3 plus 6 times x2(0) is 0, so it is 3 by 7, that is what we got
for x1(1), using Gauss-Jacobi method also.

So now I work out x2(1) and I see that it is 1 by 9 into (minus 4) plus 8 times, although
initially I have approximated it to be 0, in the previous step, right, I would like to make use of
the solution that is available to me for x1 currently. Namely I already know that x1 at this
step which is 3 by 7, so I would like to make use of this and write down the solution as one

231
by 9 into (minus 4) plus 24 by 7, so it is 1 by 9 into (minus 28) plus 24 by 7, so it is (minus
4) by 63). So you observe that the solution for x2 even at the first step, right, coincides with
the solution for x2 that was obtained at the second step using Gauss-Jacobi method.

And so we expect faster convergence using Gauss-Seidel method because it makes use of the
currently available solution which is computed at the previous step for an unknown on the
right-hand side to compute the solution at this particular step. So we carry out our
computations, say we perform 50 iterations and see what our solution is using Gauss-Seidel
method. So we present the details in the following table.

(Refer Slide Time: 31:40)

Let us now look at the table for the solutions that we have computed for using Gauss-Seidel
method we have performed 50 iterations. And if we still require, say two decimal place
accuracy for our solution, we observe that this has already been obtained at the twentieth
iteration. So correct to two decimal places, the solution is 0.20 for x1 and 0.27 for x2,
whereas using Gauss-Jacobi method, we could achieve that accuracy only at the end of 50
iterations. So Gauss-Seidel method and Gauss-Jacobi method, both generate a sequence of
iterates for the solution of the system of equations Ax equal to b which converge to the actual
solution of the problem.

Gauss-Seidel method converges must faster than Gauss-Jacobi method, the reason being it
takes into account the currently available solution at any particular step for the unknowns
which have been already computed, as a result the method converges faster. So let us take

232
another example, where we have three equations into the unknowns and solve the system by
both the methods and compare the solutions.

233
Numerical Analysis.
Professor R. Usha.
Department of Mathematics.
Indian Institute of Technology, Madras.
Lecture-46.
Solution of Linear Systems of Equations-8.
Iterative Methods-2.
Matrix Eigenvalue Problems-1.
Power Method-1.

(Refer Slide Time: 0:23)

So we are given a system of equations, namely three equations in three unknowns. We


observe the following from this system, A is a diagonally dominant matrix. The matrix A is a

234
diagonally dominant Matrix and it has the property that modulus of a i i is greater than
[Sigma j is equal to 1 to n, j not equal to i ] [modulus of a i j], for all values of i, which run
between 1 and n. Say for example if I take i as 1, then mod a11 is greater than [Sigma j is
equal to 1 to n], [modulus of a1 j], so j should not take the value 1. So what is it, it is modulus
of a12 plus modulus of a13. So, namely the coefficient of x1, which is a11 in absolute value
is bigger than the sum of the absolute values of the coefficients which appear in the same
equation.

So that happens for i equal to1, that should happen for i equal to 2, i equal to 3 and so on. If
this property is satisfied for a matrix A, then we say that the matrix A is a diagonally
dominant matrix. If you are given a system of equations with the coefficient matrix as a
diagonally dominant matrix, then we have a result which says that if A is a diagonally
dominant matrix in the system Ax is equal to b, then Gauss- Jacobi method and Gauss -Seidel
method converge to the solution for any arbitrary starting vector for the unknowns x.

And here we are given a system such that the coefficient matrix is a diagonally dominant
matrix. So we can start with any arbitrary initial vector x0, which is x1 0, x2 0, x30 and apply
either Gauss-Jacobi method or Gauss-Seidel method, then our successive iterates are
guaranteed to converge to the solution of the system. So we first write down the result and
then work out the solution for this problem.

(Refer Slide Time: 3:17)

235
So the result states and it is given by the following theorem. And it states that if A is a
diagonally dominant matrix, then Gauss- Jacobi method and Gauss- Seidel method converge
for any arbitrary starting vector to the solution of the system A x is equal to b. So the theorem
guarantees that the sequence of iterates generated by Gauss- Jacobi method and Gauss- Seidel
method for this problem will converge to the solution of the system Ax is equal to b. So let us
first write down these methods, what should we do, I must associate the first equation with
the first unknown.

And therefore x1 should be solved from the first equation and that gives you 1 by 10 into [9
minus 2 x2 minus x3]; x2 is 1 by 10, again, [minus 22 minus x1 plus x3]. Thirdly, x3 is 1 by
10 into [22 plus 2 x1 minus 3 x2]. So having written out the steps, if I want to apply Gauss -
Jacobi method, my solution at the k -th step for x1 will be obtained by using the solution for
(k minus 1) -th step for x2 and the solution for x3 at the (k minus 1)-th step. The solution for
x1 at the (k- th) step will be obtained by using the values at the (k minus 1)-th step for x2 and
x 3. Similarly x2 (k) will be given by 1 by 10 [minus 22 minus x1 at the (k minus1)-th step
plus x3 at the (k minus1) -th step].

And x3 (k) is 1 by 10 into [22 plus 2 times x1 at the (k minus1) -th step plus x2 at the (k
minus1) -th step. And that is for the Gauss- Jacobi method. What is it for Gauss-Seidel
method; so let us write down the details. This will be the iterative procedure that I will use it
when I make use of Gauss-Seidel method to generate the successive iterates. So, we shall
present the solutions for both these methods and see how the solutions look like.

236
(Refer Slide Time: 6:57)

So the solutions using Gauss -Jacobi method and Gauss -Seidel method are presented here.
We start with an initial approximation for x1, x2, x3 as 0. And apply Gauss-Jacobi method
every time using the previously available values for the variables I have used and the values
at this step are computed. When we perform our computations, we see that at the end of
seventh iteration, we notice that the solution obtained at the sixth iteration and seventh
iteration are the same for all the three variables. So I stop my computations and take the
solution as x1 equal to 1, x2 equal to minus 2 and x3 equal to 3. Now I perform the solution
using Gauss-Seidel method.

In this case, the solution at any step is computed using the currently available values for the
variables at that particular step. So when I do that, the solutions are given in this table. I again
start with the initial vector 0 for all the three unknowns x1, x2, x3 and I perform five
iterations and I observe that the solution turns out to be 1, minus 2 and 3 at the end of five
iterations for x1, x2, x3 respectively. So the sequence of iterates generated by Gauss-Seidel
method converges faster than the sequence of iterates generated by Gauss-Jacobi method.

237
(Refer Slide Time: 9:14)

The question now is will the sequence of iterates generated by both these methods will
always converge to the root of the equation. So let us look at some examples and given an
answer this question. Let us consider the following example. The system is given by x1
minus x2 minus 2 x3 equal to minus 5, x1 plus 2 x2 plus x3 equal to 7 and x1 plus 3 x2 minus
x3 equal to 2. We see that the system is not diagonally dominant. And therefore there is no
guarantee that the scheme will converge. The theorem that we stated said, if A a diagonally
dominant matrix, then the sequence of iterates generated by Gauss-Seidel method and Gauss-
Jacobi method will converge to the solution of the system Ax is equal to b, for any starting
initial vector.

238
So the conditions are sufficient and not necessary. And therefore in this case the matrix A,
which is the coefficient matrix is not diagonally dominant and therefore there is no guarantee
that the scheme will converge. At the same time, we cannot say that both the schemes will
diverge; so let us work out the details and what we get starting with some initial vector. So, if
I start with an initial vector x10, x2 0, x30 equal to 0, then if I apply Gauss- Jacobi method,
that gives me at the end of ten iterations the solution for x1 as 302.125 and x2 is
minus106.188 and x3 at the tenth iteration turns out to be 217.375, x3 at the tenth iteration is
217.375, these are the values of x1, x2 and x3 at the end of 10 iterations.

Let us work out the same problem using Gauss- Seidel method. And in this case at the end of
ten iterations we see that the solution for x1 is 85,011.3 and that for x2 is -58,568.9 and that
for x3 is -90,697.4. So both these methods diverge from the true solution. Let us now
consider another example.

(Refer Slide Time: 12:32)

Suppose say the system is 4 x1 minus 2 x2 plus3 is 12, 2 x1 plus 3 x2 minus x3 is 7 and 2 x1
minus 2 x2 plus 2 x3 is 8. So let us look at the system, we observe, so the system is not
strictly diagonally dominant. So let us now try to work out starting from some initial vector
and using Gauss-Jacobi method and Seidel method. So again I start with the initial vector
x10, x2 0, x30 to be 0. Then Gauss- Jacobi method gives me at the end of eighteen iterations,
the solution to be x1(18) is 3, x2(18) is 1 and x3 at the end of 18 iterations is 2. And Gauss-
Seidel method also gives me the solution at the end of fifteen iterations for x1, x2, x3 as 3, 1
and 2.

239
So we say that the Gauss- Seidel method and Gauss- Jacobi method converge for this
problem, because in this case the exact solution for the problem is, x1 is equal to 3, x2 is
equal to 1 and x3 is equal to 2. So just see Gauss Jacobi method and Gauss Seidel method,
both have converged to the exact solution. So they generate sequence of iterates which
converge to the exact solution of the problem and we observed that the matrix A is not a
strictly diagonally dominant Matrix. So the conditions given in the theorem are sufficient and
not necessary. So what is it that we infer from the result of the theorem and from the
examples that we have taken.

Given a system Ax is equal to b, If the coefficient matrix A is strictly likely dominant, then
the sequence of iterates generated by both the methods will converge to the solution of the
system Ax is equal to b starting with any initial vector. If on the other hand given a matrix A,
such that it need not be a diagonally dominant matrix, the sequence of iterates generated by
this matrix may either converge or diverge. So the conditions given in the theorem are
sufficient but not necessary. So when we are asked to solve a system of equations, we check
whether the given system is such that it has its coefficient matrix to be a diagonally dominant
matrix.

If suppose some exchange of equations can be made such that the coefficient matrix can be
changed into a diagonally dominant matrix, then you can apply both the methods and
generate a sequence of iterates which will converge to the solution of the system of equations.
If the matrix A that you are given is not strictly diagonally dominant, then nothing can be said
when you apply your methods such as Gauss-Jacobi method or Gauss-Seidel method because
the sequence of iterates may either converge or diverged for any initial starting vector. So this
completes our discussion on the iterative methods for solving the system of equations.

So let us quickly summarize what we have done in this section of algebraic system of
equations. So we said that using elementary operations we can convert a given matrix A to an
upper triangular matrix and when we convert it into an upper triangular matrix, then back
substitution method will enable us to solve the system immediately. So this was used when
we described Gauss elimination method. Then we said that we may have to divide a
particular equation by a very small quantity when we apply Gauss elimination method, in
which case the round off errors will become very large. In order to avoid that, as well as to
avoid the situation where we may have to divide by a quantity which is the pivot and the
pivot turns out to be 0 at that particular step,

240
then in such cases the partial pivoting techniques must be used in Gauss elimination method
so that we can solve, we can avoid division by zero or division by a small quantity and the
solution can be obtained. So summarising the methods for solving a system of equations, we
said that we have two classes of methods, namely the direct methods and the iterative
methods. And the direct methods comprise of decomposition methods, Gauss elimination
method and Gauss- Jordan method, whereas the iterative methods that we have learnt in this
course are Gauss- Jacobi method and Gauss- Seidel method.

So given a system of equations, we can apply any one of these techniques depending upon
and obtain the solution of the equation. And in all the cases, in both the cases we have a way
to improve our solution, namely we can perform the iterative improvement method in the
case of direct methods to get our solution to be as accurate as is required. And in the case of
iterative methods, we can perform the iterations as many times as it is required, the solution is
obtained correct to the desired degree of accuracy. So we close this section on the solution of
system of algebraic equations and move on to the computation of eigenvalues and
eigenvectors of a given matrix A.

(Refer Slide Time: 20:06)

We consider matrix eigenvalue problems. So we are given a matrix A which is an n cross n


matrix and let lambda be a scalar. Our goal is to solve an equation of the form Ax is equal to
lambda x. We seek a nonzero vector x and a scalar lambda which satisfies this equation Ax is
equal to lambda x. The scalar lambda is called an eigenvalue of the matrix A and the nonzero
vector x which is associated with this eigenvalue lambda is called an eigenvector of the
matrix associated with the eigenvalue lambda. So solving this equation is equivalent to

241
solving an equation of the form (A minus lambda I) into x is equal to 0, which is a system of
homogeneous algebraic equations.

(Refer Slide Time: 22:21)

So the condition for a non-trivial solution of this system, let me call this as star, of star is
equivalent to requiring that the determinant of the coefficient matrix, namely (A minus
lambda I) is equal to 0. So if you find determinant of (A minus lambda I) and equate it to 0,
you will observe that you have an n-th degree polynomial in lambda whose roots are the
eigenvalues of the given matrix A. And once you determine these eigenvalues, you substitute
each of those eigenvalues here and then solve the system (A minus lambda I) into x is equal
to 0 and you look for a nontrivial solution of this system corresponding to each of the
eigenvalues which are roots of the equation: determinant of (A minus lambda I) equal to 0.

242
When you have done that, then you have solved the matrix eigenvalue problem. So let us
consider a simple example by taking A to be a 3 by 3 matrix. Suppose say A is 2, 0, 1; 5,
minus 1, 2; minus 3, 2, minus 5 by 4; I observe that A into x which is 1, 3, minus 4 is minus
2 times 1, 3, minus 4, so that this vector which has components 1, 3, minus 4 satisfies the
equation Ax is equal to lambda x. So lambda equal to minus 2 is an eigenvalue of this matrix
A. If x is an eigenvector associated with an eigenvalue lambda, so that Ax is equal to lambda
x, then if I multiply this vector by say a constant c, then I have A into cx is equal to lambda
into cx.

So if I call y to be cx, then I observe that Ay equal to lambda y and therefore y is an eigen
vector. Because it satisfies the equation, Ay is equal to lambda y. So a constant multiple of an
eigenvector of a matrix A associated with an eigenvalue lambda is also an eigenvector of the
matrix A. Now one can compute the eigenvalues and the corresponding eigenvectors by
requiring that determinant of (A minus lambda I) equal to 0, but the computations will be
very tedious when the order of the matrix is very large. And therefore we require some
numerical methods for computing the eigenvalues and eigenvectors of a given matrix A.

So we develop in this course a method known as power method which computes numerically
the most dominant eigenvalue of a given matrix A. So you may ask me why only the most
dominant eigenvalue and why not all the eigenvalues of a matrix. Of course numerical
methods are available for computing all the eigenvalues of a given matrix numerically for
special classes of matrices. In this course, we focus only on the method where we are able to
compute numerically the most dominant eigenvalue. The other methods which will help us to
compute for special classes of matrices, all the eigenvalues are beyond the scope of this
course and so we focus only on the Power method which helps us to compute the most
dominant eigenvalue.

243
(Refer Slide Time: 26:49)

So we describe Power method now. Power method is used to compute numerically the most
dominant eigenvalue of a given matrix. And it also helps us to obtain the corresponding eigen
vector. So the method is based on the following assumption, namely the matrix A has a single
eigenvalue of maximum modulus and that there is a linearly independent set of eigenvectors
associated with the eigenvalues of a given matrix A. So, under this assumption, we compute
the most dominant eigenvalue of the given matrix.

(Refer Slide Time: 28:37)

So let us denote by lambda 1, lambda 2, etc .., lambda n, the eigenvalues of the given matrix
A and that they are distinct. And lambda 1 is greater than lambda 2 in absolute value is
greater than mod lambda 3 and so on greater than mod lambda n. So lambda 1 is the most

244
dominate eigenvalue in magnitude. So Power method helps us to obtain this dominant
eigenvalue numerically. So if suppose we denote by V1, V2, etc., Vn, the corresponding
eigenvectors, what does that mean, it means A V1 is lambda 1 V1, A V2 is lambda 2 V2 or in
general A V i is lambda i Vi, for i is equal to 1, 2, 3 up to n.

Write down V if V is any vector, then it is, I use an arrow above V to indicate that it is a
vector because we are solving a system Ax is equal to lambda x, so x is a vector. If A is an n
cross n matrix, x is an n cross 1 vector. So associated with lambda equal to lambda i, if we
determine x i, that xi is denoted by Vi, which is an eigenvector associated with the eigenvalue
lambda i, so it is a vector. So any vector V can be expressed as a linear combination of the
eigenvectors associated with the eigenvalues lambda 1, lambda 2, etc. So, any vector V will
be of the form C1 V1 plus C2 V2 plus C3 V3 and so on plus Cn into Vn.

(Refer Slide Time: 31:10)

245
And so I compute A times vector V. So C1 is a constant, so it is C1 into A V1 plus C2 into A
V2 plus C3 into A V3 plus etc… plus Cn into AVn. But what do we know about A V1, A V1
will be lambda 1 into V1, why, V1 is an eigenvector associated with the eigenvalue lambda 1.
So I substitute for A V2 as lambda 2 V2, A V3 as lambda 3 V3 and so on, A Vn as lambda n
Vn. And therefore this will be equal to lambda 1 times [C1 into V1 plus C2 into (lambda 2 by
lambda 1) into V2, C3 into (lambda 3 by lambda 1) into V3 and so on plus Cn into (lambda n
by lambda 1) into Vn].

And so this will give me AV. Now I consider A on AV, that is I pre-multiply both sides of
this matrix equation by A. So that will give me A square V, so that will be lambda 1 into [C1
into A V1 plus C2 into (lambda 2 by lambda 1) into A V2 plus C3 (lambda 3 by lambda 1)
into A V3 plus etc… plus Cn into (lambda n by lambda 1) into AVn. So I again make use of
the fact that V1, V2, V3 are eigenvectors of matrix A associated with lambda 1, lambda 2,
etc., lambda n. So A V1 will be lambda 1 V1, then plus C2 into (lambda 2 by lambda 1) into
A V2 is lambda 2 V2 and so on plus Cn into (lambda n by lambda 1) into A Vn is lambda n
into Vn.

And now I remove the factor lambda 1, so I get (lambda 1) square into [C1 V1 plus C2 into
(lambda 2 by lambda 1) the whole square into V2. Because I have removed a factor lambda 1
from it, so this will be lambda 2 by lambda 1, I already have that factor, so it is (lambda 2 by
lambda 1) the whole square. So I remove lambda 1 from this; that will give me (lambda n by
lambda 1) the whole square into vector Vn. Now I see some pattern here, namely if I have A
into V, that involves lambda 1 times [C1 V1 plus C2 into (lambda 2 by lambda 1) into V2
and so on, Cn into (lambda n by lambda 1) into Vn].

246
(Refer Slide Time: 35:13)

If I have A square times vector V, then that is lambda 1 square into [C1 V1 plus C2 into this
plus etc… plus Cn into (lambda n by lambda 1) the whole square into Vn]. So what do you
think you will have when you compute A cube V, that is going to be( lambda 1 cube) into
[C1 V1 plus C2 into (lambda 2 by lambda 1) the whole cube into V2 plus etc… plus Cn into
(lambda n by lambda 1) the whole cube into Vn]. So you can continue to obtain every time
pre-multiplying the equation at this step by A and obtaining say at the n- th step (A to the
power of n) into V. So suppose we do that, then (A power n) V is equal to (lambda 1 power
n) into [C1 into V1 plus C2 into (lambda 2 by lambda 1) power n into V 2 plus etc… plus Cn
into (lambda n by lambda 1) to the power of n into Vn].

(Refer Slide Time: 36:15)

247
So at the end of n steps we have (A power n) V to be given by (lambda 1 power n) into this
sum of vectors. Let us apply it once again. What do we do, we pre-multiply this by A, so we
get [A power (n plus1)] into V and we pull out a factor lambda 1, so that will be (lambda 1
power (n plus1)) into the sum of these vectors within the bracket. Now let us take the limit as
n tending to infinity. Then as n tends to infinity, we observe that the right-hand side vector
tends to (lambda 1 power n) [C1 into V1]. What happens to the rest of the terms? (lambda 2
by lambda 1) in absolute value is less than 1.

So as N tends to infinity of a quantity which is less than 1 is 0. So all these terms tend to 0 as
n tends to infinity and the limit of (A power n) into V as n tends to infinity is lambda 1 power
n into [C1 into V1]. Similarly, when I take the limit as n tends to infinity of [A power (n
plus1)] into V, then I will have (lambda 1 power (n plus1)) into C1 into V1 and rest of the
terms which appear within the bracket will all tend to 0 because we have (lambda 2 by
lambda 1) is less than 1 and (lambda n by lambda 1) is less than one in absolute value.

Now let us look at this vector, what did we get, C1 V1 plus C2 into this plus etc... plus Cn
into this, this vector tends to the vector C1 V1. Again the reason being in absolute value,
(lambda i by lambda 1) is less than 1 for i is equal to 1, 2, 3, etc., n. So as n tends to infinity,
the vectors here are such that these terms will tend to 0 and we will end up with the vector C1
V1. So this gives the eigenvector associated with the eigenvalue lambda 1. Then how do you
compute the eigenvalue lambda 1 which is the most dominant eigenvalue? And that is
obtained as a ratio of the corresponding components of these two vectors, namely (A power
(n plus1) V) by (A to the power of n) V.

248
(Refer Slide Time: 39:27)

So that is (lambda 1 power (n plus1)) by (lambda 1 power n), so that will give you lambda 1
and so lambda 1 will be equal to limit as k tending to infinity of (A power (k plus1)) into
vector V by (A power k) into vector V; but take the m -th component of each of these vectors,
where m takes values 1, 2, 3, etc… n. What does it mean, take the first component of this
vector and take the first component of this vector and obtain the ratio of the first components
of each of these vectors. And then similarly work out the details by computing the ratios of
the n-th component of each of these vectors.

Lambda 1 is given by limit as k tending to infinity of the m- th component of the vector A


power (k plus 1) V by the m -th component of the vector (A power k) V, for m is equal to1, 2,
3 up to n. When do we stop the iterations, we stop the iterations when the magnitudes of the
differences between the ratios that we obtain turn out to be less than the given error tolerance,
so that we obtain lambda 1 correct to the desired degree of accuracy and the vector C1 V1 to
which the sum of these vectors converge to as n tends to infinity will give us the
corresponding eigenvector.

249
(Refer Slide Time: 41:24)

So let us take an example and understand power method which helps us to determine
numerically the largest eigenvalue in magnitude of a given matrix A. So let us try to illustrate
power method which helps us to compute numerically largest in magnitude eigenvalue of a
given matrix A. So we need to control the round -off errors in our computation. So we take
care of this by doing the following. Namely we normalise the vector before multiplying by A.
You just look back and see what we did, we, at each step, we pre-multiplied the vector V by
A, here we multiplied AV by A and so on.

In order that we have control of round- off errors, before multiplying by A, we normalise the
vector V, that is we see to it that the largest element in that vector is unity. So let V0 be a
nonzero initial vector, we have to start with an initial vector V0 and generate a sequence of
iterates for V and which tends to the eigenvector associated with a given eigenvalue. So we
do this as follows. So we take AVk and call it y (k plus1), we start with the vector V, so AV
is obtained and that is what is mentioned here. Find at each step what y (k plus1) is, so y (k
plus1) will be AVk.

250
(Refer Slide Time: 44:15)

And now what is V(k plus1), V (k plus1) is y(k plus1) by m(k plus1), what is m(k plus1), it is
the largest element in magnitude in the vector y(k plus1). So compute A Vk, call it as y(k
plus1), remove the largest element in magnitude in that vector and divide y(k plus1) vector
by m(k plus 1) and call that as V(k plus1). So if you pre-multiply now by A, you have a
vector A V(k plus 1), that will be of y(k plus 2), from that vector you remove numerically the
largest element, call it m(k plus 2). So divide your vector y(k plus 2) by m(k plus 2) and the
resulting vector is V(k plus 3). So you continue to perform your computations this way, this
will help you to control the round- off errors.

So finally what is lambda 1, lambda 1 will be the ratio of the components, the m -th
component of vector y (k plus1) to the m- th component of vector Vk as k tends to infinity.
And what is the eigenvector, eigenvector will be V (k plus 1) which is associated with the
eigenvalue lambda 1 that you computed. In fact you can see that as k tends to infinity, m (k
plus 1) gives you the eigenvalue lambda 1. So choose your initial vector V0 with all its
components to be equal to unity, provided you are not given any initial vector to start with in
your computations.

251
(Refer Slide Time: 44:58)

So start with an initial vector V0 which has components 1, 1, 1 etc. and then compute A V0,
A square V0 and so on and take the limit as n tends to infinity to obtain the ratio of the m-th
component of these vectors in your computations and that will give you the eigenvalue which
is the most prominent eigenvalue of the given matrix A. And the corresponding Eigen vector
is given by vector V (k plus 1) as k tends to infinity and you have the required result at the
end of your computation, namely the largest eigenvalue in magnitude and the corresponding
eigen vector. So we shall illustrate this method in the next class by taking some examples.

252
Numerical Analysis.
Professor R. Usha.
Department of Mathematics.
Indian Institute of Technology, Madras.
Lecture-47.
Matrix Eigenvalue problems-2.
Power method-2.
Gerschgorin’s Theorem, Brauer’s Theorem.

(Refer Slide Time: 0:38)

Good morning everyone, in the last class we discussed Power method for computing
numerically the largest eigenvalue in magnitude of a given matrix A. So let us illustrate this
method by taking some examples. So we said that, if we are given a square matrix, say an n
cross n matrix, we start with an initial vector which we call as V0 and we find the vectors say
y(k plus 1) which are AVk. And the next vector V(k plus 1) will be y(k plus 1) by m(k plus 1)
where m (k plus 1) is the largest element in magnitude in y (k plus 1) vector. Then lambda 1
will be limit as k tending to infinity of the m- th component of vector y(k plus 1) by the m- th
component of the vector Vk.

253
(Refer Slide Time: 2:27)

And the vector V(k plus1) that we have obtained will give us the required eigenvector. So let
us now illustrate this procedure which is Power method for computing numerically the largest
eigenvalue of a given matrix A. So let us consider the given matrix A to be a 2 cross 2 matrix
having entries 3, minus 5 and minus 2, 4. We start with an initial vector V0, say (1, 1). So we
should find first vector y1, which is A times V0, so it is 3, minus 5; minus 2, 4 into (1, 1) and
that is the column vector having components minus 2, 2. Now I remove the largest element in
magnitude and write down the vector as minus 2 times (1, minus 1). So now I should define
V1, what is V1? That is y1 by the largest value of the element that I have factored out.

So that will give you 1, minus 1, so with this V1, I compute y2 vector which is A V1. 3,
minus 5; minus 2, 4 into V1 which is 1, minus 1 and that is 8, minus 6, I remove the element
which has largest magnitude and write down the vector as 1, minus 0.75. So it is 8 times
vector V2. So we compute y3 which is A V2, 3, minus 5; minus 2, 4 into V2 which is 1,
minus 0.75. That turns out to be 6.75, minus 5, so I remove 6.75 and the vector that I get is 1,
minus 0.7407, so it is 6.75 times vector V3. We compute y4 which is A V3, so 3, minus 5;
minus 2, 4 into V3 which is 1, minus 0.7407 and that turns out to be 6.7035, minus 4.9628, so
we remove 6.7035 and the vector is 1, minus 0.7403. So we have y4 to be 6.7035 times
vector V4. We continue this process and compute what is y5 and that is A V4.

So 3, minus 5; minus 2, 4 into vector V4 is 1, minus 0.7403 and that turns out to be 6.7015,
minus 4.9612, so I remove 6.7015 and we get 1, minus 0.7403. So we have 6.7015 times
vector V5 and we see that every time we perform the computations, the vector is such that the
components in the vector are coming to be closer to each other. So let us perform another

254
iteration and see. So compute y6, so vector y6 will be A into vector V5, so 3, minus 5; minus
2, 4 into vector V5 which is 1, minus 0.7403 and that turns out to be 6.7015, minus 4.9612, so
it is 6.7015 into 1, minus 0.7403. And so this will be 6.7015 into vector V6.

So at this stage, we observe that 6.7015 into vector V5 is the same as 6.7015 into vector V6
and therefore convergence has occurred and therefore we can stop our computations and
write down what the eigenvalue is and the corresponding eigenvectors. So we observe that
the eigenvalue lambda 1 which is numerically the largest eigenvalue of this matrix is 6.7015.
And what is the eigenvector? The corresponding eigenvector is given by 1, minus 0.7403. So
Power method helps us to compute numerically the largest eigenvalue or the most dominant
eigenvalue of a given matrix A numerically. And the procedure is illustrated by means of this
example.

(Refer Slide Time: 9:39)

255
So let us now take a 3 by 3 matrix A and then work out the details, so the problem is find the
largest eigenvalue in modulus and the corresponding eigenvector of the matrix A, which has
elements minus 15, 4, 3; 10, minus 12, 6; 20, minus 4, 2 and we are asked to obtain the
dominant eigenvalue of this matrix by Power method numerically. So we start with an initial
vector V0 having components 1, 1, 1. I compute the vector y1 first. What is it, it is equal to A
into V0, so A into vector 1, 1, 1 where A is this matrix. The entries in the product of these
two matrices are given by minus 8, 4, 18. So I have to define now my vector V1, which is
vector y1 by the largest element in the vector y1.

So that will give me minus 8 by 18, 4 by 18, 1 and that is minus 4 by 9, 2 by 9, 1. Now that I
know the vector V1, I compute y2, which is AV1. So, A into (minus 4 by 9, 2 by 9, 1) and
that turns out to be the vector 95 by 9, minus 10 by 9, minus 70 by 9. And I divide by the
largest element, namely 95 by 9, so if I remove that factor, I get a new vector which is V2. So
V2 turns out to be the vector 1, minus 2 by 19, minus 14 by 19. So we compute y3, y4 and so
on, so I leave these computations for you. And when it comes to obtaining the vector y7, it
turns out to be the vector -19.8674, 9.7072 and 19.8524 and therefore I compute vector V7 by
dividing vector y7 by the numerically largest element in that vector.

That gives me V7 to be minus 1.0, 0.4886, 0.9992. So we compute y8 which is A into V7 and
that gives 19.952, minus 9.868, minus 19.956. And so I compute vector V8 which is y8
vector by numerically the largest element in y8. And that turns out to be 0.9998, minus 0.494
and minus 1. And so we work out Y9 which is A times vector V8 and this turns out to be
19.973, 9.926 and 19.988. Suppose say the problem also says perform 9 steps using Power
method and write down the numerically largest eigenvalue of matrix A.

256
So we have performed nine steps and at this stage we want to stop our computations because
we are asked to do only nine steps of computations of Power method. And so we determine
the eigenvalue and the eigenvector. So what is going to be an approximation to the largest
eigenvalue in modulus? So it is modulus of lambda and that will be the m- th component of
vector y9 by the m- th component of vector V8. So I have to compute the ratio of the first
component of y9 and first component of V8. Then take the second component of y9 by the
second component of V8 and then get the ratio, the third component of y9 and the third
component of V8 and obtain what the value is.

So we get, for the first component of y9 is 19.973 divided by first component of V8 is


0.9998, the second component of y9 is 9.926 and that of V8 is minus 0.494, the third
component is 19.988 divided by minus 1. I observe that if I compute these ratios and take the
absolute value, then round it off to three digits, then I end up with mod lambda to be 20. So if
I compute this ratio, it is going to be very close to 19.966 etc, similarly this one and here it is
going to be minus 19.988, I observe that that is the largest, so in absolute value correct to
three digits and I round it off to three digits, mod lambda is close to 20 and therefore that is
the largest eigenvalue in magnitude of this matrix A.

So we have to get the corresponding eigenvector, so the corresponding eigenvector is going


to be the vector having the vector having components namely V8, that will give you 0.9998,
minus 0.494 and minus 1. So the numerically largest eigenvalue and the corresponding
eigenvector of the given matrix A have been computed using power method. I will give you
some problems, you can try to work them out at home and I will include some more problems
in the assignment sheet.

So find numerically the largest eigenvalue of the matrix 1, minus 3, 2; 4, 4, minus 1; 6, 3, 5


by Power method. You start with the vector V0 having components 1, 1, 1, I will give you the
final answer, you can try to see, compute sixteen steps of computations and show that at the
sixteenth step you get y16 to be 6.9998 into (0.3000, 0.0666, 1) and therefore eigenvalue, the
numerically largest eigenvalue is close to 7. So using Power method we have been able to
compute the most dominant eigenvalue. The question now is, is it possible to obtain the
smallest eigenvalue of this matrix A using Power method.

(Refer Slide Time: 19:22)

257
The answer is yes and let us see how we can obtain the smallest eigenvalue. So suppose that
we assume that the matrix A has largest eigenvalue, say lambda. Then what does that mean, it
means Ax is equal to lambda x and let us assume that A is invertible. Then when I pre-
multiply this by A inverse, then I get Ix to be equal to A inverse into x multiplied by lambda.
Therefore (1 by lambda) x will be equal to A inverse x. So when we say that Ax is equal to
lambda x, we say that lambda is an eigenvalue of the matrix A and the corresponding
eigenvector is x. Now this statement tells us that 1 lambda is an eigenvalue of A inverse and
the corresponding eigenvector is the same vector x for the eigenvalue lambda of the matrix A.

So this tells you that 1 by lambda is an eigenvalue of matrix A inverse with the same
eigenvector x. If I call the dominant eigenvalue of A inverse as mu, so that A inverse x is
equal to mu x, then it is clear that 1 by mu is the least eigenvalue of the matrix A. Why, what
have we shown? If lambda is the dominant eigenvalue of matrix A, then A inverse x is 1 by
lambda into x, so that if mu is the dominant eigenvalue of A inverse, such that A inverse x is
equal to (mu) x then, 1 by mu will be the least eigenvalue of the matrix A. So if I ask you to
compute the smallest eigenvalue of a given matrix A, what is it that you should do?

Given the matrix A, compute its inverse, say by Gauss Jordan method and then compute its
dominant eigenvalue using Power method, take the reciprocal, that gives you the least
eigenvalue of the given matrix A. And therefore it is possible for you to compute the smallest
eigenvalue in magnitude for the given matrix A. So let us work out an example illustrating
this.

258
(Refer Slide Time: 23:19)

So if I consider matrix A to be given by minus 15, 4, 3; 10, minus 12, 6; 20, minus 4, 2 and I
am asked to compute the smallest eigenvalue of this matrix A. So I immediately compute the
inverse of the matrix A and it turns out to be 0, minus 0.02, 0.06; 0.1, minus 0.09, 0.12; 0.2,
0.02, 0.14. So I have to now compute the most dominant eigenvalue of this matrix A inverse.
So use Power method to compute the most dominant eigenvalue. Start with the vector say V0
which is 1, 1, 1 and compute y1 which is A inverse into V0 because you are computing the
dominant eigenvalue of the matrix A inverse.

So your y1 is A inverse into V0. So continue your computations and show that the dominant
eigenvalue of A inverse turns out to be say 0.2, that is 1 by 5. Show that the dominant
eigenvalue of A inverse is 1 by 5 and therefore the least eigenvalue or the smallest eigenvalue
in magnitude for the given matrix A is going to be 5. So given a matrix A, you know how to
get the largest eigenvalue by Power method and how to get the least eigenvalue by computing
the dominant eigenvalue of A inverse and taking its reciprocal and that will give you the
smallest eigenvalue of A.

So you now know using Power method how to compute both the largest eigenvalue and the
least eigenvalue. There is also another of computing the smallest eigenvalue of a given matrix
A, even without computing what A inverse is. So let us see how this can be done.

259
(Refer Slide Time: 26:08)

So let lambda be an eigenvalue of A, so A is a given matrix, suppose lambda is an eigenvalue


of A, that Ax is equal to lambda x. Let us find a matrix B where B is given by A - mu times I,
mu is some constant. Then let us compute what is Bx, so Bx is (A minus mu I) into x. So it is
Ax - mu Ix, but what is Ax, Ax is lambda x. So this tells you it is (lambda minus mu) into x.
So what have we shown, matrix B which is A - mu I is such that it satisfies this equation
namely Bx equal to (lambda minus mu) times x. What are lambdas, lambdas are eigenvalues
of given matrix A.

And therefore (A minus mu I) has eigenvalues (lambda minus mu). And therefore the
eigenvalues of B are nearly the eigenvalues of A diminished by a constant mu. And what
about the eigenvector, the eigenvector is the same, namely if lambda is an eigenvalue of A
and the corresponding eigenvector is x, then (lambda minus mu) is an eigenvector of B
having the same eigenvector x. So that is what we want to write, the eigenvalues of B are
nearly eigenvalues of A diminished by a constant mu with the same eigenvector.

260
(Refer Slide Time: 29:06)

So therefore if you evaluate the largest eigenvalue of A and call it as lambda 1, then we can
determine the smallest eigenvalue of A by modifying the matrix. We are given a matrix A,
we would now like to consider another matrix which is different from A, so we want to do
some modification in the matrix A, what is it? Take it in the form (A minus lambda 1 times
I), take the matrix A and subtract the matrix which is lambda 1 times the identity matrix. By
what we have discussed, what can you say about the eigenvalues of this matrix, this matrix
has eigenvalues given by say lambda p dash, which is equal to lambda p - lambda 1 for p is
equal to 1, 2, 3 up to n, where ,what are lambda p’s? Lambda’s are eigenvalues of A.

Now I want lambda p to be the smallest. So for lambda p to be the smallest, lambda p dash
must be the largest in magnitude eigenvalue of which matrix, (A minus lambda 1 into I).
What are lambda p’s? They are the eigenvalues of the given matrix A. So (lambda p - lambda
1) for p taking values 1, 2, 3 up to n and if you call that as lambda p dash, then the lambda p
dash are such that they are the eigenvalues of (A minus lambda 1 into I). So in order that your
lambda p which is an eigenvalue of A to be the smallest, you must have lambda p dash to be
the largest eigenvalue in magnitude of (A minus lambda 1 I).

Can you compute this, yes, we know we can use Power method to obtain the largest or the
dominant eigenvalue in magnitude of a given matrix. What is the matrix that is given to you
now, the given matrix is say B which is (A minus lambda 1 into I). Do you know lambda 1,
yes, it is the dominant eigenvalue of the given matrix A. So compute a new matrix, call it is B
which is (A minus lambda 1 into I) and compute its dominant eigenvalue, so you are using
again power method to compute its dominant eigenvalue. Then once you know it is the

261
dominant eigenvalue of (A minus lambda 1 into I), then that eigenvalue plus lambda 1 will
give you lambda p which is the least eigenvalue.

(Refer Slide Time: 33:05)

So let us discuss about the corresponding eigenvector. So if suppose I call xp dash as the
corresponding eigenvector, then (A minus lambda 1 into I) into xp dash must be (lambda p
minus lambda 1) into xp dash because (lambda p minus lambda 1) are eigenvalues of (A
minus lambda 1 into I) and the corresponding eigenvector is xp dash. So what does this give?
This gives you A into xp dash is equal to (lambda p into xp dash). While the next, the next
term will be minus lambda 1 I xp dash, that will cancel with minus lambda 1 xp dash. So we
get A xp dash is lambda p xp dash.

What does that give, this gives you that the eigenvector corresponding to the eigenvalue
lambda p which is the smallest eigenvalue of A is the same as the corresponding eigenvector
associated with the eigenvalue lambda p - lambda 1 of the matrix (A minus lambda 1 into I).
So you not only get the smallest eigenvalue in magnitude of the given matrix A but you also
get the corresponding eigenvector, what is it, it is a xp dashed, what is this eigenvector, this is
the eigenvector which is associated with the eigenvalue lambda p minus lambda 1 of the
matrix A minus lambda 1 into I.

So you are able to get the smallest eigenvalue as well as the largest eigenvalue by using the
power method. So let us just recall the steps for computing the smallest eigenvalue of a given
matrix A. So you start with the matrix A, get its largest eigenvalue, call that as lambda 1, so it
is the largest eigenvalue in magnitude for the matrix A, call that as lambda 1. Now form a

262
matrix B which is A minus lambda 1 I, for this matrix, you compute the largest eigenvalue,
then the smallest eigenvalue of the given matrix A will be lambda p such that if lambda p
dash is the dominant eigenvalue of A minus lambda 1 I, then the two are related by lambda p
dash is lambda p minus lambda 1.

So you immediately know what is the smallest eigenvalue of the given matrix, what is the
associated eigenvector. You compute the corresponding eigenvector associated with lambda
p dash for the matrix A minus lambda 1 I, that eigenvector is the eigenvector associated with
the smallest eigenvalue lambda p of the matrix A. So that completes our discussion on how
we can get the smallest eigenvalue. So the method does not involve computation of inverse of
the matrix A, so it is much easier to work out because you again apply Power method to a
new matrix and therefore the computations are very simple.

Can we give locations of the eigenvalues for a given matrix A? The answer is yes, there are
results which clearly specify the regions within which, circular disks within which the
eigenvalues are located. So we look into these results and therefore we will be able to know
where the eigenvalues of a given matrix are located. The following result gives us a way to
find the location of eigenvalues of a given matrix A. What does the result say, it says that the
largest eigenvalue in modulus of a square matrix A cannot exceed the largest sum of the
moduli of the elements along any row or any column.

(Refer Slide Time: 38:23)

263
Suppose say lambda i is an eigenvalue of an n cross n matrix A; let us take A has entries
denoted by a i j, where i and j run from 1 to n. Suppose x i is the corresponding eigenvector
and having components x i1, x i2, etc. x in, then we know that A xi is (lambda i) x i. So if I
write out these n equations, then they are of this form; a11 into x i1 plus a1 2 x i2 plus etc…
plus a in into x in will be equal to, on the right-hand side I have lambda 1, lambda i into x i 1.
Similarly I take the second row of A and multiply by this column vector and equate that to
the second entry in lambda i xi. So I continue and write down the n-th equation. Now I take
modulus of x ik to be the maximum of modulus of x ij over j.

So what does, once I get that, I take the j -th equation and divide the j -th equation by x ij. So
what will I get, I will get lambda i, if I take the second equation and divide by x i2, then on
the right-hand side, I have just lambda i. So here I select the j -th equation and divide the j -th

264
equation by x ij. So that will give you lambda i is equal to a j1, that will be the first term here,
a j1 x i1, that divided by x ij. Then the next term, a j2 xi2 divided by x ij and so on. Then I
will have a term a jj xij, divided by xij, so that simply gives me a jj plus etc…, plus a jn into x
in by x ij.

(Refer Slide Time: 41:11)

So if I take the absolute value on both sides, then mod lambda i will be less than or equal to
mod a j1 plus mod a j 2 plus etc… plus mod a jj plus etc… plus mod a jn. Why modulus of x
ik by x ij is less than or equal to1, why? x ik in absolute value is the maximum of modulus of
x ij; and therefore modulus of x ik by x ij in absolute value is less than or equal to1 and
therefore mod lambda i is less than or equal to mod a j1 plus mod a j2, etc..., plus mod a jn.
What are these, these are the absolute values of the entries that appear in the j -th equation.
And so j is arbitrary and therefore mod lambda is less than or equal to maximum over i
[Sigma over k equal to1 to n] [modulus of a ik].

And that completes the result that the largest eigenvalue in modulus of a square matrix cannot
exceed the largest sum of the moduli of the elements along any row or any column. What
about the proof for any column? The matrix A and A transpose have the same eigenvalues
and therefore the proof of the theorem holds good also for columns and hence Gerschgorin’s
theorem is completed.

265
(Refer Slide Time: 41:54)

So we now have another result which gives the location of the eigenvalues of the given
matrix. It is given by Brauer’s theorem and it states that if pj is that sum of the moduli of the
elements along the j -th row, excluding the diagonal elements that are a jj, then it says every
eigenvalue of A lies inside or on the boundary of at least one of these circles, which are
described as mod lambda minus a jj equal to pj for j is equal to 1, 2, 3 up to n. So these are
circles which have a jj as centre and radius as pj. So there are n such circles and the result
says that every eigenvalue of the given matrix A lies inside or on the boundary of at least one
of these circles, that is what Brauer’s theorem states.

(Refer Slide Time: 43:17)

266
The proof is very simple, so let us work out the details. We already have seen in
Gerschgorin’s theorem this step. Namely lambda i was written as all these terms plus a jj
after dividing by x ij. So I can bring a jj here and write lambda i minus a jj is the sum of the
rest of the terms. And therefore in absolute value, mod lambda i minus a jj will be less than or
equal to mod a j1 plus mod a j2 plus etc...,plus mod a j(j minus1) plus modulus of a j(j plus1)
plus etc... plus mod a jn. Why? The reason is modulus of x ik by x ij is less than or equal to1,
for k running from 1, 2, 3 up to n. And what is this, this is summation i is equal to 1 to n, you
see that that is no term mod a jj, so I0 equal to J of modulus of a ji, but that is what is denoted
by pj in the theorem.

So this is equal to pj, so what are your circles, your circles are such that they have their
centres at a jj for j is equal to1, 2, 3 up to n and the radii are given by pj for j is equal to 1, 2,
3 up to n. So you now know the circles within which the eigenvalues are located and the
result says every eigenvalue of A lies inside or on the boundary of at least one of these circles
which have their Centre at a jj and radii at pj. So let us now use Gerschgorin’s theorem and
Brauer’s theorem to find out the location of eigenvalues of a given matrix A.

(Refer Slide Time: 46:08)

267
The circles that we have obtained are called Gerschgorin circles and the bound that we have
obtained is called the Gerschgorin’s bound. So let us try to determine Gerschgorin’s bounds
so that we know where all the eigenvalues of a given matrix are located. So we have some
important observations to make before we proceed with demonstrating these results. So let us
consider these observations. Now we know that A and A transpose have the same
eigenvalues and therefore all the eigenvalues lie in the union of the n circles, namely modulus
of lambda i minus a jj is less than or equal to [Sigma k equal to1 to n], k not equal to j,
[modulus of a kj], for j is equal to1, 2, 3 up to n.

Then secondly all these bounds are independent of each other and therefore all the
eigenvalues of A must lie in the intersection of these bounds. If any of the Gerschgorin
circles is isolated, then this circle contains exactly one eigenvalue and if you are given a real
symmetric matrix A such that a ij is a ji, then in this case we obtain an interval which
contains all the eigenvalues. So in case you are given a matrix A which is a real symmetric
matrix, then you can see that, you can obtain an interval which contains all the eigenvalues of
this real symmetric matrix.

So let us take an example, suppose say A is a matrix having entries 3, 2, 2; 2, 5, 2; 2, 2, 3. So


let us apply Gerschgorin’s theorem and Brauer’s theorem and determine the bounds. So all
the eigenvalues of this matrix lie in the interval mod lambda less than or equal to maximum
of 7, 9 and 7, what is it, it is the row sum or it is also the column sum. So take the first row,
take the entries in the first row and compute the absolute value of each of the entries and add
it or take the entries in the first column and compute the absolute value of the entries in the

268
first column and add that, say add them, then you get value to be 7. Do that for the second
row and the third row or the second column and the third column.

(Refer Slide Time: 51:40)

Which theorem have I applied, I have used Gerschgorin’s theorem. So I obtain mod lambda
to be less than or equal to 9 and therefore lambda is such that it lies between minus 9 and 9,
this is what my Gerschgorin’s theorem says. What does that mean? I have an interval
between minus 9 and 9 in which all the eigenvalues of the given matrix lie. Let us now use
Brauer’s theorem and see what the result is. The Brauer’s theorem tells us: the eigenvalues lie
in the union of the intervals given by modulus of lambda minus a 11; that is 3 less than or
equal to p1. What is p1? The sum of the absolute value of the entries in the first row, so it is
4.

269
The second condition is modulus of lambda minus a 22, which is 5 and that is less than or
equal to the sum of the absolute values of the other entries; so that is again 4. And from the
third row, we get modulus of lambda minus 3 is less than or equal to 4. Let us look at the first
inequality, it gives mod lambda minus 3 less than or equal to 4, mod lambda minus 5 less
than or equal to 4. So minus 4 less than lambda minus 5 is less than 4. 1 less than lambda less
than 9 and the third inequality is mod lambda minus 3 less than equal to 4 which is the same
as the first inequality. So Brauer’s theorem tells us that all the eigenvalues lie between minus
1 and 7 and lambda, all the eigenvalues lie between 1 and 9.

Therefore we make use of the results obtained using Gerschgorin’s theorem and using
Brauer’s theorem. The first inequality says the eigenvalues lie between minus1 and 7, so it
lies in the interval minus 1 to 7. The second inequality says it lies between 1 and 9, so
between 1 and 9. And results from Gerschgorin’s theorem tells us that lambda lies between
minus 9 and 9, so it lies between say minus 9 and 9. So now we have to make a conclusion
about the interval within which all the eigenvalues lie. So since the bounds are all
independent of each other, so we have to find the intersection of these intervals and write
down the interval within which all the eigenvalues lie.

So we observe that the intersection of these intervals is minus 1 less than or equal to lambda
less than or equal to 9. So the eigenvalues of this given matrix lie in the interval minus 1 to 9.
So if the given matrix A is a real symmetric matrix, then you can find an interval within
which all the eigenvalues of the given matrix lie. And I will include more problems on
Gerschgorin’s bounds in the assignment sheets.

270
Numerical Analysis.
Professor R. Usha.
Department of Mathematics.
Indian Institute of Technology, Madras.
Lecture-48.
Practice Problems.

(Refer Slide Time: 0:45)

Good morning everyone, in the last class we learned how to compute Gerschgorin bounds
which help us to specify the location of the eigenvalues of a given matrix A. So now we shall
consider some examples and see how we can obtain the Gerschgorin bounds. Let us start with
a simple example of a 2 by 2 matrix A. So here the matrix A has entries 1, 2; 1, minus 1 and
we can actually compute the eigenvalues in this case. The eigenvalues are given by
determinant of (A minus lambda I) equal to 0. So if you compute the determinant of this
matrix and equate it to 0 and solve the corresponding characteristic polynomial for lambda,
then you get lambda to be plus or minus root 3.

So you know that one of the eigenvalues of matrix A is plus root 3 and the other eigenvalue is
minus root 3. Now let us apply the Gerschgorin’s theorem results and see what information
that we get about the location of these two eigenvalues. So the two eigenvalues plus or minus
root 3 have actually been marked on the real line, they are real eigenvalues, I have marked
them here by means of stars. This is lambda equal to minus root 3 and this is lambda equal to
plus root 3. Now let us use Gerschgorin’s theorem. So from the rows of matrix A, I consider
the first row, this is a11 and Gerschgorin’s theorem tells us that modulus of lambda minus a1
1 which is 1 is less than or equal to sum of the absolute values of the other entries in that row.

271
So here we have only one element which is 2, so the first result tells us that absolute value of
lambda minus 1 is less than or equal to 2, this gives us one circle and some information about
an eigenvalue of the matrix A can be obtained from here. Then taking the second row, we
have lambda minus of minus 1 which is a22; that is lambda plus 1 in absolute value must be
less than or equal to the sum of the absolute values of the other entries; here, in this case it is
1, so modulus of lambda plus 1 less than or equal to 1 is going to be the second Gerschgorin
circle that we have.

So we observe that for the matrix A we have considered, there are now 2 discs in the complex
plane and the centre of the discs are a11 and a 22, which appear in the matrix A, so each
centred on one of the diagonal entries of the matrix A.

(Refer Slide Time: 3:21)

272
And from the theorems we know that every eigenvalue must lie within one of these circles
and however we do not have any information about the fact that the each disc contains an
eigenvalue. So this information is not obtained from Gerschgorin theorem. So let us see
another example. So in the second example, again I consider a 2 by 2 matrix; as before we
can consider the eigenvalues of this matrix A, they can be actually computed by solving the
characteristic polynomial given by determinant of A minus lambda I equal to 0 and we
observe that the eigenvalues are given by lambda equal to plus or minus i.

So those eigenvalues are actually plotted here, this is lambda equal to i and this is lambda
equal to minus i, they are marked by star here in the complex plane. Now let us see what the
Gerschgorin’s theorems tell us. The Gerschgorin’s theorems tell us that the circles within
which the eigenvalues plus or minus i lie are given by modulus of lambda minus a11, where
a11 is 1, is less than or equal to modulus of minus 1, that is 1. And modulus of lambda minus
of minus1, that is modulus of lambda plus 1 is less than or equal to 2 and that gives us the
second circle. So 1 and 2 are Gerschgorin’s circles and the eigenvalues i and minus i lie in the
union of these two circles is what is given by Gerschgorin’s theorem.

(Refer Slide Time: 5:27)

273
And we observe in this case that all the eigenvalues, namely plus i and minus i lie within the
circular disk given by modulus of lambda plus1 less than or equal to 2 and that is defined by
the second row. Right, let us now consider another example. Namely consider a 3 by 3 matrix
A and here Gerschgorin’s Circle theorems tell us that all the eigenvalues lie within the
Gerschgorin circles given by mod lambda minus 5 is less than equal to the sum of the
absolute values of the other two entries in the row and that is 0.7. Then modulus of lambda
minus 6 which is this is less than or equal to the sum of the absolute values of the other two
entries that is 1.1 and finally modulus of lambda minus 2 is less than equal to 1 plus 0 and
therefore we have the Gerschgorin’s circles as given by these.

So finally we see that the circle which has its centre at 2 and radius 2 is disjoint from the
other two circles which clearly tells us that there is exactly one eigenvalue in this circle which

274
is disjoint from the others. And there are two more eigenvalues and these two eigenvalues lie
in the union of other two circles which we have given. And we also observe that all the eigen
values are real because they have a positive real part. And the second thing is, therefore the
circle, modulus of lambda minus 2 less than equal to 2 is such that it has an eigenvalue which
is a real eigenvalue since complex eigenvalues, if they occur, they must occur in conjugate
pairs.

(Refer Slide Time: 7:38)

And so this circle, the disjoint circle encloses one and only one eigenvalue which is a real
eigenvalue. Now to get more information, we apply Gerschgorin’s theorems to the transpose
of the matrix A, that is A transpose. Then we observe that eigenvalues must lie within the
Gerschgorin circles which are given by modulus of lambda minus 5 less than or equal to the
sum of the absolute values in the first column, that is 2 and modulus of lambda minus 6 is
less than or equal to the sum of the absolute values in the second column which is 0.6 and
finally if you use the third column of matrix A, that gives you mod lambda minus 2 is less
than or equal to 0.2.

Right. So you observe that this is the disjoint circle with centre at the point 2 and this result
which is obtained from the columns of A tell us that that single eigenvalue which lies in the
disjoint circle must be such that lambda must be greater than or equal to 1.8 and less than or
equal to 2.2 and so lambda lies in the interval 1.8 to 2.2. So it clearly gives you an interval
within which this single eigenvalue lies within that disjoint circle. So we are able to get the
location of the eigenvalues of a given matrix A by applying Gerschgorin circle theorem in

275
terms of rows and in terms of the columns of the matrix A or the rows of the corresponding
matrix which is A transpose.

(Refer Slide Time: 9:10)

So we now consider another example in which say A is again a 3 by 3 matrix. Immediately


we can write down the Gerschgorin circles, what are they? They are given by modulus of
lambda minus 4 less than or equal to 2 plus 3 which is 5, modulus of lambda minus of minus
5, that is lambda plus 5 that is less than or equal to modulus of minus to plus 8 which is less
10 and modulus of lambda minus 3 is less than or equal to mod 1plus 0 that is one. So these
are the three circles and the eigenvalues of this given matrix lie within the union of these
three circles. So you just plot these circles because you know the Centre of the circles are
going to be respectively 4, 5 and 3, I mean 4, minus 5 and 3.

And the radii of the circles are going to be 5, 10 and 1 respectively, so plot the circles in the
complex plane and you know the eigenvalues of the given matrix lie within the union of these
three circles in this case.

276
(Refer Slide Time: 10:22)

So let us now take another example. So again A is a 3 by 3 matrix and it is clear that the
eigen values of the matrix A lie within the union of the circles mod Z minus15 less than or
equal to 4, mod Z plus2 less than or equal to 2 and mod Z minus 1 less than or equal to 7.
And the one with Centre at 15 and radius is equal to 4 is going to be a circle which is the joint
from the other two circles. So as before we argue and say and conclude that that circle which
is disjoint from the other circles must contain one eigenvalue and the other two eigenvalues
lie in the union of the other 2 circles namely mod Z plus 2 less than or equal to 2 and mod Z
minus 1 less than or equal to 7.

(Refer Slide Time: 11:35)

277
So the Gerschgorin’s theorem is very very useful in understanding the location of the
eigenvalues of a given matrix A. So let us look into another example, suppose A is given by
this 3 by 3 matrix, we can write down the Gerschgorin circles, so mod lambda minus 4 is less
than equal to1, mod lambda minus 0, that is mod lambda less than equal to 2 and mod lambda
plus 4 is lesser equal to 2, so these are the rather circles which we have plotted here. And we
observe that this circle 1 is disjoint from the other 2 circles and therefore it must contain a
single eigenvalue within it.

And we also observe that if we write down the characteristic polynomial, the coefficients of
the characteristic polynomial given by determinant of A minus lambda I equal to 0 are real
and hence the complex values which are eigenvalues, if they occur at all, then they can in
conjugate pairs. Therefore there is a single real eigenvalue within this interval, namely 3 to 5,
because this is a disjoint circle, right, it single eigenvalue lies within it and radius of the circle
is 1, so that single real, single eigenvalue is real and that real eigenvalue lies between 3 and 5.
So we know the interval within which this real eigenvalue lies.

278
(Refer Slide Time: 13:20)

And then the other two circles, so the other two eigenvalues lie in the union of these two
circles is what we know. But we observe that these two circles meet each other or touch each
other at the point minus 2. So we would like to see whether minus 2 is an eigenvalue of the
matrix A. How do you check whether lambda equal to lambda 0 is an eigenvalue of the
matrix A? The just have to see whether it is a root of the characteristic polynomial,
determinant of A minus lambda I equal to 0, that is what you have to check. So let us
compute determinant of A minus lambda I taking lambda to be minus 2, so if you write
down the determinant and expand and you observe that that turns out to be minus 17 which is
different from 0 and therefore the characteristic polynomial does not have lambda equal to
minus 2 as its root.

279
And therefore lambda equal to minus 2 is not an eigenvalue of the matrix A is what we
conclude. What does it mean, we said earlier that the other two eigenvalues lie in the union of
those 2 circles and the point where they touch each other, namely minus 2 is not an eigen
values, all right. And therefore the other two eigenvalues must be such that each one must lie
in each of these circles and that is the conclusion that we get from our discussion. So there is
one eigenvalue in the interval minus 6 to minus 2 and another eigenvalue which is in minus 2
to minus 2. Why does that happen, so let us look at the circles, so this is a circle with Centre
at minus 4 and radius 2.

And therefore it passes through the points minus 6 and minus 2, so one of the eigenvalues lies
between minus 6 and minus 2. And then you have another circle whose centre is the origin
and radius is 2 units, and there is an eigenvalue within this circle and it is a real eigenvalue
and therefore it lies between minus 2 and 2. So the matrix A is such that it has one real
eigenvalue between minus 6 and minus 2, another real eigenvalue between minus 2 and 2 and
third real eigenvalue between 3 and 5. And so we conclude that A has one eigenvalue in each
of these intervals minus 6 to minus 2, minus 2 to 2 and 3 to 5. So information about the
location of the eigenvalues can be obtained by writing out the Gerschgorin’s circles within
which the eigenvalues are located and arguing out depending upon whether these circles are
disjoint or the eigenvalues lie within the union of these circles and so on.

(Refer Slide Time: 16:14)

So now we have another problem where A is a matrix say it is it is an n cross n matrix having
entries given by these. Namely along the diagonals you have 4 appearing throughout and just
above the diagonal, entries have values 1 and just below the angle, the entries are again 1, the

280
rest of the entries are all 0. So you have an n cross n matrix where aij are given by the entries
which are specified here. You are asked to show that the inverse of this matrix A has its
eigenvalues in the interval 1 by 6 to 1 by 2. So you need some information about the location
of the eigenvalues of the inverse of this matrix A.

We have already seen while doing the Gerschgorin theorems and other details, we have
remarked that if lambda is an eigenvalue of A, then 1 by lambda is an eigenvalue of A
inverse, we have actually shown this result. So now we want to show that A inverse has its
eigenvalues interval 1 by 6 to 1 by 2. So let us first determine the interval within which
eigenvalues of A lie and then we can use this result and make a conclusion about the location
of eigenvalues of the matrix A inverse. First of all, we observe that A is a symmetric matrix,
aij is aji, so what do you immediately conclude, all its eigenvalues right, are going to be real.

(Refer Slide Time: 18:40)

Namely all the eigenvalues of matrix A are going to be real. That is the first information that
we get by looking at what the matrix A and matrix A is symmetric. Now let us apply
Gerschgorin’s circles theorem to the rows of this matrix A. So every row has 4 along its
diagonal and therefore the Gerschgorin circles will be given by modulus of lambda minus 4 is
less than or equal to the sum of absolute values of the entries in that row. And we observe
that the first row has entry to the right of 4, so the first row will give us the Gerschgorin circle
to be modulus of lambda minus 4 less than or equal to 1.

Similarly the last row will give us mod lambda minus 4 is less than or equal to1 because the
rest of the entries are all zeros. Now apply Gerschgorin’s theorem to all the other rows,

281
starting from the second row to the (n minus 1) -th row. We observe that the diagonal entries
are 4 and you have on either side of the diagonal entry, value 1 appears to the right of it and
value 1 appears to the left of it. And hence we get modulus of lambda minus 4 less than or
equal to 2 to be the Gerschgorin circles. So what does this mean, if you write out this, this
tells you minus 2 less than equal to lambda minus 4 less than or equal to 2 and therefore that
immediately tells you that all the eigenvalues lie in the interval 2 to 6.

(Refer Slide Time: 20:16)

So what are these eigenvalues? They are the eigenvalues of matrix A. So Gerschgorin circle
theorems tell us that all the eigenvalues of A are real and all the eigenvalues of the matrix A
lie within the interval 2 to 6. So now we make use of the result that we have shown. What is
it, eigenvalues of A inverse are reciprocals of eigenvalues of A. So A has eigenvalues lying in
the interval 2 to 6 and therefore if mu denotes the eigenvalues of A inverse, then mu has to lie
between the reciprocal of 2 and reciprocal of 6, namely mu has to lie between 1 by 6 and 1
by2.

So mu lies between 1 by 6 and 1 by2 and therefore the conclusion is that all the eigenvalues
mu of A inverse lie in the interval 1 by 6 to 1 by2 and that is what we are asked to show.
Right, so we have made use of the two results, what are the two results, we said that if A is a
symmetric matrix, all eigenvalues are real and that secondly we use the fact that eigenvalues
of A inverse are given by the reciprocal of eigenvalues of A, so we first computed the
eigenvalues of A and then we used the result and got some information about the location of
the eigenvalues of the matrix A inverse.

282
So these are some examples in Gerschgorin theorems and location of eigenvalues of a given
matrix. Now we move onto some examples where we discuss the computation of norm of a
matrix and then some problems from the error analysis by the direct methods of solving a
system of equation Ax is equal to b.

(Refer Slide Time: 21:58)

So suppose say I give you a matrix A which is this and I make a statement, I make the
statement that the matrix A given by this is ill conditioned and I ask you to tell me whether
this statement is true or false. It is just not enough if you say that the statement is true or the
statement is false, you have to justify your statement whether it is a true statement or a false
statement, then only your answer will be correct and you have justified your conclusion. So
how are we going to conclude whether the statement is true or false, how do I say or how do I
find out whether the matrix is an ill conditioned matrix or not, I compute what is its condition
number.

If the condition number is very large, then we conclude that the matrix A is ill conditioned.
So I compute the condition number. And what is the condition number, condition number is
norm A into norm A inverse. So given a matrix A, I compute its inverse and then I compute
norm of A, then norm of A inverse and then find what K of A, the condition number of A is.
So I compute norm A using maximum column sum norm. So, I have maximum of 100 plus
modulus of minus 200, so 300, then modulus of minus 200 plus 401, that is 601, so norm A is
601.

283
So with respect to the maximum columns sum norm, norm A is 601. Now I compute norm A
inverse with respect to the maximum column sum, so norm A inverse is going to be equal to
maximum of the sum of the absolute values of the entries in the first column which is 6.01
and the sum of the entries in the second column after taking the absolute value, right and that
gives you 3 and so it is 6.01. And therefore norm A inverse is 6.01. So now compute the
condition number of A, it is norm A into norm A inverse, so that is 601 into 6.01 and it turns
out to be 3612 which is very much greater than the value 1, right.

For a well conditioned system, the condition number must be close to a value 1, greater than
1 and here it turns out to be 3612 and therefore A is an ill conditioned matrix, so the
statement that has been given, namely the matrix A is ill conditioned is a true statement. So
whenever you are given some statement and you are asked to check whether the statement is
true or false, right, in order to conclude what the answer is, you must work out the details and
based on the results that you get, you must conclude whether the statement is true or false, in
which case you have justified your conclusion.

(Refer Slide Time: 25:23)

284
So let us now consider an example where we want to perform some error analysis. Suppose
you are given Epsilon greater than 0 and a matrix A having entries 1, 1 plus Epsilon in the
first row and 1 minus Epsilon and 1 in the second row. Immediately I compute A inverse and
that is given there. And this time I would like to obtain norm of A using the infinity norm. So
what is, the condition number is norm A infinity into norm into A inverse infinity and that
turns out to be 2 plus Epsilon the whole square by Epsilon square. So this is surely greater
than 4 by Epsilon square.

If suppose in the matrix that is given, I take my Epsilon which is positive to be such that it is
less than or equal to 0.01, then I observe that in that case K of A is greater than 4 by 0.01 the
whole square and that will turn out to be 40,000. So K, condition of condition number of A
will turn out to be greater than or equal to 40,000 which is a very very large quantity.

285
(Refer Slide Time: 27:10)

Suppose I give a small, suppose I give a small change in one of the entries in the matrix A,
namely I add Epsilon to this entry and subtract Epsilon from this entry, so if I perturbe entries
in the given matrix A, then what happens to the condition number. In that case the condition
number becomes very very large and so this says that a relative perturbation in the right-hand
side may induce a relative perturbation 40,000 times greater in the solution of the given
system Ax is equal to b. So change in some entries in A will give the corresponding changes
in the solution and that is going to be such that it is going to be 40,000 times greater in the
solution of the given system Ax is equal to b.

(Refer Slide Time: 28:01)

286
So let us now consider what happens if you want to compute the condition number of a
matrix A which is a 3 by 3 matrix using the maximum absolute by row sum norm. So A is
given, so you can compute A inverse using Gauss- Jordan technique, now you have learnt
that. So you know how to compute the inverse of a given matrix using Gauss -Jordan
technique and now you compute norm A infinity. What is it, it is maximum absolute row sum
norm for A. So it is a maximum of the sum of the entries in each of these rows after taking
the absolute value. And that turns out to be maximum of 14, 29 and 50 and that is 50.

Now compute norm A inverse infinity, here again I want to use the maximum row sum norm
and so compute the sum of the entries in absolute values in each of the rows and from among
them choose that which is the maximum, that turns out to be 15. So norm A inverse infinity is
15, so the condition, condition number of given matrix A is K of A given Norm A infinity
into A inverse infinity which is 50 from here and 15 from norm A inverse infinity, that gives
you 750. So we can say that the matrix A is an ill conditioned matrix.

(Refer Slide Time: 29:34)

Now I consider another example. Taking A to be a 2 by 2 matrix and I want you to suit the
condition number of A with respect to the 2 norms. So when I work out the 2 norms for the
matrix A, that turns out to be 3.165. I also require the 2 norms for A inverse because I am
asked to compute the condition number of A. So norm A inverse 2 turns out to be 158.27.
Now that you know the definition of 2 norms for a matrix A, you can compute A inverse and
work out the details and show that norm A inverse 2 is 158.27. And therefore the condition
number is norm A into norm A inverse with respect to the 2 norm and so it is the product of
these 2 values and that turns out to be this.

287
So you have actually computed it and you are asked to fill in the blanks here, so write out the
results here that it is 500.974. And you observe that this is a large quantity and therefore, you
are asked to make some conclusion about what the matrix is. And you know since the
condition number is large, the matrix is ill conditioned, so you fill in the blanks here by
saying that the matrix is an ill conditioned matrix.

(Refer Slide Time: 31:05)

288
Compute the condition number of the following matrix relative to infinity norm. So A is a 2
by 2 matrix, norm A infinity is maximum row sum, so compute the row sum, namely the first
row sum in absolute value is and the second row sum and you conclude that norm A infinity
turns out to be this. Now what is the determinant of A? That turns out to be 1 by 72. Get A
inverse and workout what norm A inverse infinity is. Again maximum row sum norm and so
we have maximum of 42 and 60 to give you norm A inverse infinity and that turns out to be
60. So what is the condition number? The condition number is norm A infinity into norm A
inverse infinity.

We have computed each of these, so the product of these gives you norm A A infinity into
norm A inverse infinity and that is 50 and you are asked to get what the condition number is.
So the condition number of A is 50. So now we know how to compute norm of a vector or
norm of a matrix associated with the norm of a vector which is given. So let us work out
some details about the error analysis for direct method.

289
(Refer Slide Time: 32:44)

So the problem is, the following linear system Ax is equal to b has x as the actual solution
and x tilde is an approximate solution. So you are asked to compute the magnitude of the
absolute error and also compute what is the residual a namely norm of A x tilde minus b with
respect to the infinity norm. So you are given a system of two equations in two unknowns.
And you observe that these coefficients have 1 by 2, 1 by 3; 1 by 3, 1 by 4 are the same as the
examples which we have considered earlier. And x is equal to (1 by 7, minus1 by 6)
transpose is the actual solution and x tilde which is this is an approximate solution.

So solve this system say by applying Gauss elimination method with partial pivoting and get
the solution correct to 3 decimal places and that is what your x tilde. So now you want to
compute the error, what is the error, it is x minus x tilde. So x is 1 by 7, x tilde, first

290
component is 0.142. Second component is minus 1 by 6 and the second component of x tilde
is minus 0.166, so it is this quantity. And you want norm x minus x tilde infinity. So it is the
maximum of the first entry and the second entry. So maximum of absolute value of the first
entry and the absolute value of the second entry, so compute this.

(Refer Slide Time: 35:05)

So you get this to be maximum of this and this which turns out to be the value 0.00085714;
so that is the absolute error in the solution x of the system Ax is equal to b. So you would like
to find what is the residual, so compute A into x tilde. So A matrix is given, x tilde is given,
so work out what is A into x tilde which turns out to be this vector. So what is the residual,
residual is Ax tilde b B. Why is it a residual, you want x to satisfy the equation Ax is equal to

291
b, so Ax minus B must be 0 but x tilde is an approximate solution, so Ax tilde minus b will
not be equal to 0, it gives you residual, and so you call r as Ax tilde minus b.

And you computed Ax tilde as this, so that vector minus you know what the right-hand side
vector is, so substitute that and then evaluate what is the difference and compute norm Ax
tilde minus b, that is the maximum of the absolute values of the entries that you have written
here and that turns out to be 0.00020635. So you will be in a position to compute the absolute
error which is norm x minus x tilde as well as the norm of the residual r which is norm Ax
tilde minus b with respect to any norm that will be asked. In this example you have been
asked to get these values using the infinity norm and you have computed them with respect to
infinity norm.

(Refer Slide Time: 36:39)

292
And now you are asked to compute the condition number relative to infinity norm and also
the condition number times the residual by norm A with respect to infinity norm. So you
compute what is condition number, it is norm A infinity into norm A inverse infinity. We
already have computed the condition number earlier when we worked out the example, it
turned out to be 50. And now we have computed what is the residual, namely norm b minus
Ax tilde with respect to infinity norm and that is 0.00020635. And we also have computed
norm A with respect to infinity norm which is this; so we simply have to substitute these
values which appear on the left-hand side, namely these and then simplify and that gives you
the value of this, namely condition number multiplied by norm of the residual by norm A
infinity to be equal to this.

(Refer Slide Time: 38:05)

293
If we know how to compute the matrix norm given the vector norm, then we will be in a
position to compute the absolute error and also the relative error. So that is what is given in
the next example. Suppose that x tilde is an approximation to the solution of Ax is equal to b,
A is a non-singular matrix and r is the residual vector for x tilde, that is r is Ax tilde minus b.
Then you are asked to show that for any natural norm, namely infinity norm or 2 norm, you
should be able to show norm x minus x tilde is less than equal to norm r into norm A inverse.

And in addition, if x is different from 0, is a nonzero vector, and b is also a nonzero vector,
then show that the relative error in our computation, what is the relative error, it is change in
x with respect to x, so it is norm x minus x tilde by norm x, that must be less than or equal to
the condition number which is norm A into norm A inverse multiplied by the norm of the
residual by the norm of the right-hand side vector, this is what we have to show. So these two
results clearly tell us the bound on the absolute error, namely this is the right-hand side, your
absolute error cannot exceed the value on the right-hand side.

Your relative error in the computation of solution of a system Ax equal to b cannot exceed
the value on the right-hand side, what is it, it is the condition number of the matrix A times
the norm of the residual by norm of the right-hand side vector b. So let us work out the
details. Now residual is r, which is b minus A x tilde. But what do you know about b, b is Ax,
so A x minus A x tilde, that is A into x minus x tilde. So I pre-multiply both sides by A
inverse because it is given A is a non-singular matrix, so A inverse exists and therefore I
multiply by A inverse.

(Refer Slide Time: 40:50)

294
So we have A inverse into r to be A inverse into A into x minus x tilde, so that will be x
minus x tilde. And therefore norm x minus x tilde will be norm of A inverse into r but we
have shown already norm of A B, where A and B are matrices will be less than or equal to
norm A into B, so let us make use of that result. So norm A inverse r is less than or equal to
norm A inverse into norm r. So this proves our first result, that is what we have to show,
namely norm of x minus x tilde is less than or equal to norm r into norm A inverse and we
have shown that result 1.

Now, we are, we know that b is A x, so norm b is norm Ax, that is less than or equal to norm
A into norm x and therefore 1 by norm x is less than or equal to norm A divided by norm b.
Now consider norm x minus x tilde less than or equal to norm A inverse into norm r. And
therefore divide by norm x throughout, so you get norm x minus x tilde by norm x is norm A
inverse into norm r and what about norm x, we know Ax is b, so I also know norm b is less
than or equal to norm A into norm x and so I can replace 1 by norm x by a quantity which is
less than or equal to norm A by norm b. So I replace this 1 by norm x by this and this give me
norm A inverse into norm A into norm r by norm b and that is what we have to show here,
namely norm A, norm A inverse, norm r.

We can immediately get the bound on the absolute error and the relative error for a given
problem when we get an approximate solution using the direct methods, namely Gauss
elimination method or any of the direct methods that we have studied. So given a system Ax
is equal to b and if you apply direct methods to solve the system, then you can immediately
specify the amount of absolute error and relative error that are involved in your computations
by using the bounds which are given by these examples. What is that we have done all along
in this course?

We have focused our attention on five important topics which are very useful to us in
obtaining approximation of a function whose values are specified at discrete points. Namely,
we discussed polynomial approximation and then when the information is given at a set of
equally spaced points, we used Newton’s forward interpolation polynomial or backward
interpolation polynomial and obtained approximate polynomial, namely the interpolating
polynomial that interpolates the function at a set of given point. If the information is given at
a set of arbitrarily located points, then we had Lagrange interpolation method and divided
difference method to compute the interpolating polynomial that interpolates the given data
and we have discussed error in interpolation and also the bound on the error in interpolation.

295
(Refer Slide Time: 44:12)

296
Then we considered the topic on numerical differentiation. So then we derived finite
difference approximations to various ordered derivatives which can be given in terms of the
values of the function at a certain set of points. So let us consider an example here. Suppose
say I give you a formula which approximates the second derivative, f double dash of x I, is
given by this and I ask you the following. Show that the truncation error in the 4 point
backward difference formula which is given by this, that is used to approximate the second
derivative is, is it order of h, order of h square, order of h cube or order of h to the power of 4.

I ask you to choose the correct answer from the four choices which I have given here. Unless
you work out the details, you will not be able to give the results. So you start with the right-
hand side, expand using Taylor’s theorem and get the coefficient of f i, f dashed i, f double
dashed i, f triple dashed i and the fourth derivative at I and so on. And then that is equal to f
double dash I, which appears on the left-hand side. So collect the coefficients of f i, you will
see it is 0, f dashed i, that is also 0, f double dashed i, also turns out to be 0, f triple dashed i
also turns out to be 0.

On the other hand, the fourth derivative at i, coefficient turns out to be different from 0 and so
your right-hand side has h power 4 into fourth derivative at i divided by h square because
your formula had 1 by h square in it. So we observe that you have the fourth derivative term
appearing, multiplied by h power 4 by h square and that is a nonzero term, so it is h square
times the fourth derivative. And so that is the first term which is a nonzero term that appears
in this formula when you approximate the second derivative by means of the formula on the
right-hand side. So you conclude that the truncation error is of order of h square.

297
(Refer Slide Time: 46:24)

And we have also discussed after numerical differentiation some results on numerical
integration when we derived the closed type formulas, which are Newton -Cotes type
formulas, namely the Trapezoidal rule and Simpson’s rule and we derived certain integration
methods which are exact of polynomial is of degree n, when we use n nodes in the interval
say [a, b] which is a closed interval.

Then we said that if we can sacrifice the requirement, that these n nodes are equally spaced
and look for methods which are quadrature methods, such that they are exact for polynomials
of degree greater than n, when n nodes are used within an interval of the form [a,b], in
particular an interval of the form minus1 to 1, then we said that we obtain open type
integration methods where we have used properties of the Legendre polynomials and the
integration methods turn out to be methods which are exact for polynomials of degree less
than or equal to (2 n) minus 1, where the function values at only n nodes are required.

And these n nodes turn out to be the zeros of the Legendre polynomial of degree n and these
formulas were of the open type. The nodes were all within the interval minus 1 to1. And then
we moved on to the solution of ordinary differential equations where we derived some single
step methods like Taylor’s series method, Euler’s, modified Euler’s method and Runge Kutta
methods of order 2 and 4 and solved initial value problem using any of these methods. Then
we said that it is possible to consider iterative type of methods where we have a predictor to
predict the solution and then a corrector which would re-correct the solution and so that we
can obtain solution correct to the desired degree of accuracy.

298
And these methods involved a number of points close to the point at which the initial
condition is specified, so that information at four points were used to compute the solution at
the fifth point. And these were multistep methods which we used as predictor corrector
methods and we had solved an initial value problem using Milne’s predictor character and
Adam- Bashforth’s predictor corrector methods. Then we moved on to solution of nonlinear
equations of the form f of x is equal to 0.

So we discussed two types of methods, namely the enclosure methods and then the fixed
point iteration methods and discussed the order of convergence of these methods and we
learnt some techniques of solving equations of the form f of x is equal to 0. And finally we
moved on to the solution of a system of algebraic equations and then discussed the direct
methods and iterative methods. And the direct methods included the decomposition methods
of Doolittle, Crout and Cholesky and Gauss elimination method incorporating partial pivoting
and then we discussed the iterative methods such as Gauss- Seidel method and Gauss -Jacobi
method.

But then we wanted to see how in the case of direct methods, we can have some information
about the error that has been committed by us. So we performed error analysis and so we
introduced norm of a vector, norm of a matrix associated with the given norm for a vector
and we have shown how to get a bound on the absolute error and the relative error when we
know the condition number of a given matrix A which appears as the coefficient matrix in the
system Ax is equal to b and when we know what the norm of the residual which is norm Ax
tilde minus b, where x tilde is an approximate solution, b is the right-hand side vector that
appears in the system Ax is equal to b.

And finally we moved to matrix eigenvalue problems and computed the most dominant
eigenvalue by power method and discussed the Gerschgorin bounds using Gerschgorin circle
theorem and Brauer’s theorem and we have illustrated all these methods which we have
learnt in this course by means of some examples, problems and you have solved a number of
assignment problems, you also have looked into a number of problems which we have given
to you for practice and finally you are ready to appear for the end semester examinations.

The type of problems that you will get in your end semester will be such that you will have
questions where you may have to choose the appropriate answer with proper justification,
you may have to choose whether a given statement is true or false, you may have to fill in
certain blanks where you may have to work out the details first and then fill in the results in

299
the blank and there will be also be problems where you will have to completely work out the
details and present the solution. Some questions similar to what may come in your
examination, I have included a few examples here. So let us quickly go through them.

(Refer Slide Time: 52:31)

So you have the first question to be, choose the appropriate answer. So you are given a
function f of x is interpolated at n plus 1 points, x 0 to x n. The error in interpolation at any
point x which does not, which does not coincide with any of the x i, for i is equal to 1 to n,
depends on what is the question. The answer is, is it dependent on the method that is chosen
for interpolation, surely not, it is clear. Does it depend on the value of the function f at x, no,
you do not even know what the value of f at x. Thirdly, is it dependent on the location of x in
the interval, no, not at all, x can be anywhere in the interval x 0 to x n.

Is it dependent on the number of data points, yes, ofcourse. If you choose three points from
the data, you can given an interpolating polynomial of degree 2. If you choose 10 points from
the data, you can give and you can get an interpolating polynomial of degree n. So the error
in interpolation depends on the number of data points and not on any of the other three
statements which are given. So you conclude that the result is only on the number of data
points and mark D to be the correct result and giving reason for that.

300
(Refer Slide Time: 54:02)

Then you have another question which comes in interpolation again. In which of the
following polynomial interpolation methods can one use all the previous computations where
new data points are included. We have already discussed this in the class and therefore I
straightaway go to the answer, we know that the use in divided difference method, all the
computations that have been used earlier because suppose say we are given three points and
we are asked to get the divided difference interpolation polynomial. Then we would be
getting the interpolating polynomial of degree two that interpolates the setup given data
which are three in number.

But now suddenly I have another information coming about this function, so another point
and the value of that function at that point is given, now I have 4 points, so I can get an
interpolating polynomial of degree 3 which I call as p3, then what is p3, p3 is simply p2 of x,
namely the quadratic interpolation polynomial that we already have computed; so p2 of x
plus an extra term, that extra term which arises due to the new information that is given to us.
Right. And you know what the extra term should be, so if x 0, x 1, x 2 are the previous points
and x 3 is the new point, then the extra term is going to be plus A times x minus x 0, x minus
x 1, x minus x 2.

How do you determine A? so your p3 of x is p2 of x plus A times (x minus x 0)(x minus x1)
(x minus x2). How do you determine A? This p3 must be such that it should interpolate at the
new point that is given. So substitute the x value as the x coordinate of the point, new point
that is given and the y value as the function value that is specified and determine A and you
now have a cubic polynomial which interpolates the given data, where you have used the

301
computation of p2 of x, which you have already done. So it is in the divide and difference
formula that you are able to use the previous computations. So you write the result as B, that
is so in divided difference method.

(Refer Slide Time: 56:30)

So you have another problem on divided difference which you can try to work out. Now if I
consider examples in say methods of solving a non-linear equation of the form f of x is equal
to 0 and I say apply Newton -Raphson method to solve this equation. And the computations
are started with an initial approximation p0 close to 2; that is an initial approximation to a
root of the equation. And you are asked to find what is limit n tending to infinity of modulus
of x (n plus 1) minus p, by (modulus of x n minus p) (the whole squared). Immediately, you

302
have recognised that this is your error at the (n plus1) -th step, this is your error at the n -th
step.

So you are asked to get what is limit as n tending to infinity of e(n plus 1) by e n square, is it
any one of these values. Unless you work out, you cannot give the answer, so let us work out
the details, what is it? I start with e (n plus 1) by modulus of e n square. We have already
done error analysis and we have shown, when we apply Newton- Raphson method, e(n plus
1) by e n to the power of alpha, where p is an approximation is nothing but half of the second
derivative of f at the root which is p divided by the first derivative of the function at the root.
So p is given, because it says close to 2, that information is given to us.

So p is 2, and therefore we take this value, we know the function, how do we know the
function, we are given f of x is equal to x square minus4. Compute the second derivative,
compute the first derivative, substitute these derivatives, evaluate it at 2 and give the values
here and that gives you 1 by 4. So immediately you know that the answer C is correct
because you have been able to show that it is equal to 1 by 4. Because the Newton- Raphson
method is such that the error at the (n plus 1) -th step is given by the C into the error at the n -
th step square, whereas C, the asymptotic error constant is given by this information and that
result has been used.

(Refer Slide Time: 59:14)

So all the results that we have shown in that there are very important in the sense that those
results like what we have now shown, those results can be used to solve some examples and
that has been illustrated above. Let us consider this example, Newton- Raphson method for

303
finding the square root of a real number R. What is it, is it any one of these, is the question.
Yes, take Newton-Raphson method, take the function; you want to get the square root of a
real number R, so what is equation that we want to solve, x square minus R equal to 0 is the
equation, call that as f of x, so find f dash of x, substitute in x(n plus 1) equal to x n minus (f
of x n) by f dash of x n, that is what appears here, substitute, simplify and you end up and
you observe that that appears here; where? In this half of x n plus R by x n. So you Mark C as
the correct answer.

(Refer Slide Time: 60:04)

Similarly I have another example where we say the reciprocal a positive real number N using
Newton- Raphson method starting with an initial approximation. So the sequence of iterates
are generated from which one of these? So we are asked to do by Newton -Raphson method,

304
you want to get the reciprocal of a number, so the equation that you want to solve is x equal
to 1 by N or the equation is N minus 1 by x is 0, compute f dash of x, substitute in Newton
Raphson method, x (n plus 1) equal to x n minus f of x n by f dash of x n and that gives you
this and that appears as one of the results, namely C and so you can fill in and say the answer
is C.

(Refer Slide Time: 60:52)

305
Then we have a result which says, state with reason whether the following statement is true
or false. So the statement is, a non-linear equation f of x is equal to 0 is solved by bisection
method. So it is a direct method for solving an equation of the form f of x is equal to 0, it
belongs to the class of enclosure methods. Now if the number of steps to be taken, so that you
want to get an error of e, and that is n steps, with initial interval [a, b]. And you divide the
interval in such a way that the length of the initial interval is divided by 3. And the number of
steps that you need to commit the same error e is say n1.

Then the statement says n1 has to be greater than n, is your statement correct is the question.
So do not worry about what you should write, start writing down what is given. What is
given, we know that the error is in bisection method given by the length of the interval by 2
to the power of (n plus1) at the end of n steps. So you start, then what do you do, you divide
the interval by 3, so what is the length of the interval now, it is (b minus a) by 3. So what is
the error at that step, it is (b minus a) by 3 divided by 2 power, now the number of steps is n1,
so 2 power (n1 plus 1).

Now what is the error that you commit, it is the same e. So I have written whatever that is
given in the problem, now I divide one by the other. And I see that when I take logarithm
with respect to base 2, n turns out to be n1 plus some quantity. So n is bigger than n1, so n is
greater than n1. The question says n1 is greater than n, no, the statement is therefore false,
why because of all this. So you have to justify by giving the details of why you are saying
that the statement is false.

306
(Refer Slide Time: 63:09)

Then suppose say f is a continuous function on the interval minus 1 to 2. And f of minus 1
into f of 2 is less than 0, what does it mean, the root lies in minus 1 to 2. The number of
iterations that are necessary to guarantee that an approximation to a root of f of x equal to 0
using bisection method is correct to within 7 decimal places is what, which one of them is
correct. So you have a formula which says (b minus a) by 2 power (n plus1), you want it to be
less than half of 10 to the power of minus 7 because correct to within 7 decimal places is
what your requirement is. Simplify and you get n to be bigger than 24.84 and you observe
that n greater than, n is equal to 25 is what appears here.

307
(Refer Slide Time: 64:01)

So now, the order of convergence of an iteration method for solving f of x is equal to 0 in [a,
b], if the iteration method generates successive iterates given by x (n plus1) equal to 1 by (n
plus 1) the whole square and x n converges to 0 is what. So you are asked to get the order of
convergence of a method. You are given a sequence that generates, sequence which generates
the successive iterates and which converges to 0. So what is the order? So p n is given, how
do you compute the order, you must get what is e (n plus 1) divided by e n power alpha, that
must be an asymptotic error constant times, right, e n; asymptotic error constant, which is C.

So I repeat again, you have to show that e (n plus 1) is equal to e n to the power of alpha
multiplied by C. Alpha gives you the order of convergence of the method and C gives you the
asymptotic error constant. So let us see what is this in this example. Now the sequence

308
converges to 0, so it is p (n plus 1) minus 0 is e (n plus 1), p n minus 0 is e n. So I consider
mod e (n plus 1) by mod e n power alpha. So it is given 1 by n plus1 the whole square by 1
by n square the whole square. So that is 1 by n square to the power of, p n power alpha. So
this must be alpha.

So that gives you n power 2 alpha by (n plus 1) the whole square, this is alpha, right. You
have n power alpha by (n plus 1) the whole square. Now when will this quantity converge to
0? That will converge to 0 if alpha is less than 1, it will converge to 1 if alpha is 1 and it will
diverge if alpha is greater than 1. And you want the sequence to converge, so alpha must be
such that it converges to 1 if alpha is equal to 1. So modulus of p (n plus 1) must be C into
modulus of p n to the power of alpha, with C is equal to 1 and alpha equal to 1 and so you
conclude that the method converges and its order of convergence is 1, so it has linear
convergence.

I have illustrated some examples from the different topics that we have discussed in our
course so that one can expect questions of this type also in your examination. So I hope you
have enjoyed this course and you have learnt some good portion of the topics from numerical
analysis from this course and I wish you all the best in your career and in future. Thank you
very much.

309
THIS BOOK IS
NOT FOR SALE
NOR COMMERCIAL USE

(044) 2257 5905/08 nptel.ac.in swayam.gov.in

You might also like