Numerical Methods For Ordinary Differential Equations
Numerical Methods For Ordinary Differential Equations
C. Vuik
F.J. Vermolen
M.B. van Gijzen
M.J. Vuik
ii
c
VSSD
First edition 2007
Second edition 2015
Preface
In this book we discuss several numerical methods for solving ordinary differential equations.
We emphasize the aspects that play an important role in practical problems. We confine ourselves
to ordinary differential equations with the exception of the last chapter in which we discuss the
heat equation, a parabolic partial differential equation. The techniques discussed in the intro-
ductory chapters, for instance interpolation, numerical quadrature and the solution to nonlinear
equations, may also be used outside the context of differential equations. They have been in-
cluded to make the book self-contained as far as the numerical aspects are concerned. Chapters,
sections and exercises marked with a * are not part of the Delft Institutional Package.
The numerical examples in this book were implemented in Matlab, but also Python or any other
programming language could be used. A list of references to background knowledge and related
literature can be found at the end of this book. Extra information about this course can be found
at http://NMODE.ewi.tudelft.nl, among which old exams, answers to the exercises, and a link
to an online education platform. We thank Matthias Möller for his thorough reading of the draft
of this book and his helpful suggestions.
Delft, June 2016
C. Vuik
The figure at the cover shows the Erasmus bridge in Rotterdam. Shortly after the bridge became
operational, severe instabilities occurred due to wind and rain effects. In this book we study,
among other things, numerical instabilities and we will mention bridges in the corresponding
examples. Furthermore, numerical analysis can be seen as a bridge between differential equa-
tions and simulations on a computer.
iv
Contents
1 Introduction 1
1.1 Some historical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 What is numerical mathematics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Why numerical mathematics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Rounding errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Landau’s O -symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Some important concepts and theorems from analysis . . . . . . . . . . . . . . . . . 7
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Interpolation 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Linear interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Higher-order Lagrange interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Interpolation with function values and derivatives ∗ . . . . . . . . . . . . . . . . . . 16
2.4.1 Taylor polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Interpolation in general . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Hermite interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Interpolation with splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Numerical differentiation 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Simple difference formulae for the first derivative . . . . . . . . . . . . . . . . . . . 25
3.3 Rounding errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 General difference formulae for the first derivative . . . . . . . . . . . . . . . . . . . 29
3.5 Relation between difference formulae and interpolation ∗ . . . . . . . . . . . . . . . 31
3.6 Difference formulae of higher-order derivatives . . . . . . . . . . . . . . . . . . . . 32
3.7 Richardson’s extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7.2 Practical error estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7.3 Formulae of higher accuracy from Richardson’s extrapolation ∗ . . . . . . . 35
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Nonlinear equations 39
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 A simple root finder: the Bisection method . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Fixed-point iteration (Picard iteration) . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 The Newton-Raphson method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
vi
5 Numerical integration 51
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Riemann sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Simple integration rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.1 Rectangle rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.2 Midpoint rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.3 Trapezoidal rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.4 Simpson’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Composite rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 Measurement and rounding errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Interpolatory quadrature rules ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7 Gauss quadrature rules ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Introduction
also plays an important role in error analysis (investigating the difference between the numerical
approximation and the solution).
Calculating with only a finite subset of the rational numbers has many consequences. For exam-
ple: a computer cannot distinguish between two polynomials of sufficiently high degree. Conse-
quently, methods based on the main theorem of algebra (i.e. that an nth degree polynomial has
exactly n complex zeros) cannot be trusted. Errors that follow from the use of finitely many digits
are called rounding errors (Section 1.4).
An important aspect of numerical mathematics is the emphasis on efficiency. Contrary to or-
dinary mathematics, numerical mathematics considers an increase in efficiency, i.e. a decrease
of the number of operations and/or amount of storage required, as an essential improvement.
Progress in this aspect is of great practical importance and the end of this development has not
been reached yet. Here, the creative mind will meet many challenges. On top of that, revolutions
in computer architecture will overturn much conventional wisdom.
where for simplicity x is taken positive, and we assume that d n < β − 1. Rounding x means that
x will be replaced with the floating point number closest to x, which is called f l ( x ). The error
caused by this process is called rounding error, and f l ( x ) is written as
f l ( x ) = x (1 + ε). (1.2)
The value | x − f l ( x )| = |εx | is called the absolute error and | x − f l ( x )|/| x | = |ε| the relative
error. The difference between the floating point numbers enclosing x is βe−n . Rounding yields
| x − f l ( x )| ≤ βe−n /2, hence the absolute error is bounded by
1 e−n
|εx | ≤ β .
2
Because | x | ≥ βe−1 (since d1 > 0), the relative error is bounded by
1 1− n
eps = β . (1.4)
2
From β = 2 and n = 53 it follows that eps ≈ 10−16, thus in double precision approximately 16
decimal digits are used.
Figure 1.1 shows the distribution of the floating point numbers 0.1d2 d3 · βe ; e = −1, 0, 1, 2 in
base 2 (binary numbers). These floating point numbers are not uniformly distributed and there is
a gap around 0. A computational result lying within this neighborhood is called underflow. Most
machines provide a warning, replace the result with 0 and continue. A computational result with
absolute value larger than the largest floating point number that can be represented is called
overflow. The machine warns and may continue with Inf’s (infinity).
A final remark is that rounding errors are deterministic. If one repeats the algorithm, the same
results are obtained.
−4 −3 −2 −1 0 1 2 3 4
Next, let us investigate how computers execute arithmetic operations in floating point arithmetic.
Processors are very complex and usually the following model is used to simulate reality. Let ◦
denote an arithmetic operation (+, −, × or /) and let x and y be floating point numbers (hence,
x = f l ( x ), y = f l (y)). Then the machine result of the operation x ◦ y will be
z = f l ( x ◦ y ). (1.5)
The exact result of x ◦ y will not be a floating point number in general, hence an error results.
From formula (1.2) it follows that
Next, we suppose that x and y are no floating point numbers: the corresponding floating point
numbers f l ( x ) and f l (y) satisfy f l ( x ) = x (1 + ε 1 ), f l (y) = y(1 + ε 2 ). Again, x ◦ y should
be calculated. Using the floating point numbers in this calculation, the absolute error in the
calculated result f l ( f l ( x ) ◦ f l (y)) satisfies:
From this expression it follows that the error consists of two terms. The first term is caused by
an error in the data and the second one by converting the result of an exact calculation to floating
point form (as in equation (1.6)).
A few examples are provided to show how rounding errors may affect the result of a calcula-
tion. Furthermore, general computational rules will be presented regarding the propagation of
rounding errors.
In this example, x = 5/7 and y = 1/3 are taken, and the calculations are computed on a system
that uses β = 10 and a precision of 5 digits. In Table 1.1 the results of various calculations applied
to f l ( x ) = 0.71429 × 100 and f l (y) = 0.33333 × 100 can be inspected.
In order to show how the values in de table are computed, the addition is expressed as
In this example, the same numbers x and y and the same precision as in the previous example are
used. Furthermore, u = 0.714251, v = 98765.1 and w = 0.111111 × 10−4, hence f l (u) = 0.71425,
f l (v) = 0.98765 × 105 and w = 0.11111 × 10−4. These numbers are chosen in such a way that
rounding error problems can be clearly illustrated. In Table 1.2, x − u has a small absolute error
but a large relative error. By dividing x − u by a small number w or multiplying it with a large
number v, the absolute error increases, whereas the relative error is not affected. On the other
Chapter 1. Introduction 5
hand, adding a larger number u to a small number v results in a large absolute error but only a
small relative error.
The first row in Table 1.2 is created using the exact results u = 0.714251 and
It is interesting to note that the large relative error has nothing to do with the limitations of the
floating point system (the subtraction of f l ( x ) and f l (u) is without error in this case) but is only
due to the fact that the data are represented in no more than 5 decimal digits. The zeros that
remain after normalization in the single precision result f l ( f l ( x ) − f l (u)) = 0.40000 have no
significance, only the digit 4 is significant; the zeros that are substituted are a mere formality
and represent no information. This phenomenon is called loss of significant digits. The loss of
significant digits has a large impact on the relative error, because of division by the small result.
A large relative error will sooner or later have some unpleasant consequences in later stages of
the process, also for the absolute error. For example, if x − u is multiplied by a large number, then
a large absolute error is generated, together with the already existing large relative error. This can
be seen in the third row of the table, where the exact result is ( x − u) × v = 3.42855990000460 . . ..
Calculating f l ( f l ( x ) − f l (u)) × f l (v) yields
After rounding, the computed result is f l ( f l ( f l ( x ) − f l (u)) × f l (v)) = 0.39506 × 101. This yields
the absolute error
3.42855990000460 . . . − 0.39506 × 101 ≈ 0.522,
and the relative error is
0.522 . . .
≈ 0.152.
3.42855990 . . .
Suppose that y2 is added to the result ( x − u) × v. Because y = 1/3 and therefore y2 = 1/9, the
result of this operation is indistinguishable, due to the large absolute error. In other words, for
the reliability of the result it does not make a difference if the last operation had been omitted
(and the numerical process was altered). Therefore, something is fundamentally wrong in this
case.
6 Numerical Methods for Ordinary Differential Equations
Almost all numerical processes exhibit loss of significant digits for a certain set of input data,
one might call such a set ill conditioned. There are also numerical processes that exhibit these
phenomena for all possible input data. Such processes are called unstable. One of the objectives
of numerical analysis is to identify unstable processes and classify them as useless, or improve
them in such a way that they become stable.
Remark
The standard mathematical laws (such as the commutative laws, associative laws, and the dis-
tributive law) do not necessarily hold for floating point arithmetic.
Computational rules for error propagation
In the analysis of a complete numerical process, the accumulated error of all previous steps
should be interpreted as a perturbation of the original data. Moreover, in the result of the sub-
sequent step, the propagation of these perturbations should be taken into account together with
the floating point error. In general, this error source will be more important than the floating
point error after a considerable number of steps (in the previous example of ( x − u) × v even
after two steps!). In that stage the error in a numerical process will be largely determined by the
’propagation’ of the accumulated errors. The computational rules to calculate numerical error
propagation are the same as those to calculate propagation of error in measurements in physical
experiments. There are two rules: one for addition and subtraction and one for multiplication
and division.
The approximations of x and y will be denoted by x̃ and ỹ and the (absolute) perturbations by
δx = x − x̃ and δy = y − ỹ.
a) For addition, the error is ( x + y) − ( x̃ + ỹ) = ( x − x̃ ) + (y − ỹ) = δx + δy. In other words,
the absolute error in the sum of two perturbed terms is equal to the sum of the absolute
perturbations. A similar rule holds for differences: ( x − y) − ( x̃ − ỹ) = δx − δy. The rule is
often presented in the form of an inequality: |( x ± y) − ( x̃ ± ỹ)| ≤ |δx | + |δy|. We call this
inequality an error estimate.
b) This rule does not hold for multiplication and division. Efforts to derive a rule for the
absolute error will lead nowhere, but one may derive a similar rule for the relative error. The
relative perturbations ε x and ε y are defined as x̃ = x (1 + ε x ), and similarly for y. For the
relative error in a product xy it holds that
xy − x̃ ỹ xy − x (1 + ε x )y(1 + ε y )
= = −(ε x + ε y + ε x ε y ) ≈ −(ε x + ε y ),
xy xy
assuming ε x and ε y are negligible compared to 1. In words: the relative error in a product of
two perturbed factors is approximately equal to the sum of the two relative perturbations.
A similar rule can be derived for division, where
| xy − x̃ ỹ|
≤ | ε x | + | ε y |.
| xy|
Identification of x̃ with f l ( x ) and ỹ with f l (y) makes it possible to explain various phenomena
in floating point computations, using these two simple rules.
Definition 1.5.1 (O -symbol) Let f and g be given functions. Then f ( x ) = O( g( x )) for x → 0, if there
exist positive r and finite M such that
Definition 1.6.1 (Well posedness) Mathematical models of physical phenomena are called well posed
if
• A solution exists;
• The solution is unique;
• The solution’s behavior changes continuously with the data.
Models that do not satisfy these conditions are called ill posed.
Definition 1.6.2 (Well conditioned) Mathematical models are called well conditioned if small errors
in the initial data lead to small errors in the results. Models that do not satisfy this condition are called ill
conditioned.
Definition 1.6.3
√ (Modulus of a complex number) √ The modulus of a complex number a + ib is equal
to | a + ib| = a2 + b2 . Note that also | a − ib| = a2 + b2 .
Definition 1.6.4 (Kronecker delta) The Kronecker delta is defined as the following function of two in-
teger variables:
1, i = j,
δij =
0, i 6= j.
8 Numerical Methods for Ordinary Differential Equations
Definition 1.6.5 (Eigenvalues) Let A be an m × m matrix, and I be the m × m identity matrix. The
eigenvalues λ1 , . . . , λm of A satisfy det(A − λI) = 0.
Theorem 1.6.1 (Triangle inequality with absolute value) For any x, y ∈ R, | x + y| ≤ | x | + |y|.
Rb Rb
Furthermore, for any function f : ( a, b) → R, | a f ( x )dx | ≤ a | f ( x )|dx.
Theorem 1.6.2 (Intermediate-value theorem) Assume f ∈ C [ a, b]. Let f ( a) 6= f (b) and let F be a
number between f ( a) and f (b). Then there exists a number c ∈ ( a, b) such that f (c) = F.
Theorem 1.6.3 (Rolle’s theorem) Assume f ∈ C [ a, b] and f differentiable on ( a, b). If f ( a) = f (b),
then there exists a number c ∈ ( a, b) such that f ′ (c) = 0.
Theorem 1.6.4 (Mean-value theorem) Assume f ∈ C [ a, b] and f differentiable on ( a, b), then there
exists a number c ∈ ( a, b) such that
f ( b) − f ( a )
f ′ (c) = .
b−a
Theorem 1.6.5 (Mean-value theorem for integration) Assume G ∈ C [ a, b] and ϕ is an integrable
function that does not change sign on [ a, b]. Then there exists a number x ∈ ( a, b) such that
Z b Z b
G (t) ϕ(t)dt = G ( x ) ϕ(t)dt.
a a
Theorem 1.6.6 (Weierstrass approximation theorem) Suppose f ∈ C [ a, b] is given. For every ε > 0,
there exists a polynomial function p such that for all x in [ a, b], | f ( x ) − p( x )| < ε.
Theorem 1.6.7 (Taylor polynomial) Assume f ∈ C n+1 ( a, b) is given. Then for all c, x ∈ ( a, b) there
exists a number ξ between c and x such that
f ( x ) = Pn ( x ) + Rn ( x ),
in which the Taylor polynomial Pn ( x ) is given by
( x − c)2 ′′ ( x − c)n (n)
Pn ( x ) = f (c) + ( x − c) f ′ (c) + f (c) + . . . + f ( c ),
2! n!
and the remainder term Rn ( x ) is:
( x − c ) n +1 ( n +1 )
Rn ( x ) = f ( ξ ).
( n + 1) !
Proof
Take c, x ∈ ( a, b) with c 6= x and let K be defined as
( x − c)2 ′′ ( x − c)n (n)
f ( x ) = f (c) + ( x − c) f ′ (c) + f (c) + . . . + f ( c ) + K ( x − c ) n +1 . (1.8)
2! n!
Consider the function
( x − t)2 ′′ ( x − t)n (n)
F(t) = f (t) − f ( x ) + ( x − t) f ′ (t) + f (t) + . . . + f ( t ) + K ( x − t ) n +1 .
2! n!
By (1.8) it follows that F (c) = 0 and F ( x ) = 0. Hence, by Rolle’s theorem there exists a number ξ
between c and x such that F ′ (ξ ) = 0. Further elaboration yields
′′′
f (ξ )
F ′ (ξ ) = f ′ (ξ ) + { f ′′ (ξ )( x − ξ ) − f ′ (ξ )} + ( x − ξ )2 − f ′′ (ξ )( x − ξ ) + . . .
2!
( )
f ( n + 1 ) (ξ ) ( n
f (ξ ))
n ( n −1 )
+ (x − ξ ) − (x − ξ ) − K (n + 1)( x − ξ )n
n! ( n − 1) !
f ( n +1 ) ( ξ )
= ( x − ξ )n − K (n + 1)( x − ξ )n = 0.
n!
Chapter 1. Introduction 9
Hence,
f ( n +1 ) ( ξ )
K= ,
( n + 1) !
which proves the theorem. ⊠
Theorem 1.6.8 (Taylor polynomial of two variables) Let f : D ⊂ R2 7→ R be continuous with
p including order n + 1 in a ball B ⊂ D with center c = (c1 , c2 )
continuous partial derivatives up to and
and radius ρ: B = {( x1 , x2 ) ∈ R2 : ( x1 − c1 )2 + ( x2 − c2 )2 < ρ}. Then for each x = ( x1 , x2 ) ∈ B
there exists a θ ∈ (0, 1), such that
f (x) = Pn (x) + Rn (x),
in which the Taylor polynomial Pn (x) is given by
2 2
∂f ∂f 1 ∂2 f
Pn (x) = f (c) + ( x1 − c1 )
∂x1
( c ) + ( x2 − c2 )
∂x2
(c) +
2 ∑ ∑ (xi − ci )(x j − c j ) ∂xi ∂x j (c) + . . .
i =1 j =1
2 2 2
1 ∂n f
+
n! ∑ ∑ ... ∑ (xi1 − ci1 )(xi2 − ci2 ) . . . (xin − cin ) ∂xi ∂xi . . . ∂x in
( c ),
i 1 =1 i 2 =1 i n =1 1 2
1 2 2
∂ n +1 f
Rn ( x ) =
( n + 1) ! ∑ ... ∑ ( x i 1 − c i 1 ) . . . ( x i n +1 − c i n +1 )
∂xi1 . . . ∂xin+1
(c + θ (x − c)).
i 1 =1 i n +1 = 1
Proof
Let x ∈ B be fixed, and h = x − c, hence khk < ρ. Define the function F : (−1, 1) 7→ R by:
F (s) = f (c + sh).
Because of the differentiability conditions satisfied by f in the ball B, F ∈ C n+1 (−1, 1) and F k (s)
is given by (verify)
2 2 2
∂k f (c + sh)
F k ( s) = ∑ ∑ ... ∑ h h . . . hik .
i 1 =1 i 2 =1 ik =1
∂xi1 ∂xi2 . . . ∂x ik i1 i2
Expand F into a Taylor polynomial about 0. This yields
sn n s n +1
F (s) = F (0) + sF ′ (0) + . . . + F (0) + F n+1 (θs),
n! ( n + 1) !
for some θ ∈ (0, 1). By substituting s = 1 into this expression and into the expressions for the
derivatives of F, the result follows. ⊠
∂f ∂f
P1 (x) = f (c1 , c2 ) + ( x1 − c1 ) ( c1 , c2 ) + ( x2 − c2 ) ( c , c2 ),
∂x1 ∂x2 1
and for the remainder term: R1 (x) is O(kx − ck2 ).
Theorem 1.6.9 (Power series of 1/(1 − x )) Let x ∈ R with | x | < 1. Then:
∞
1
1−x
= 1 + x + x2 + x3 + . . . = ∑ xk.
k =0
10 Numerical Methods for Ordinary Differential Equations
1.7 Summary
In this chapter, the following subjects have been discussed:
- Numerical mathematics;
- Rounding errors;
- Landau’s O -symbol;
- Some important concepts and theorems from analysis.
1.8 Exercises
1. Let f ( x ) = x3 . Determine the second-order Taylor polynomial of f about the point x = 1.
Compute the value of this polynomial in x = 0.5. Give an upper bound for the error and
compare this with the actual error.
2. Let f ( x ) = e x . Give the nth order Taylor polynomial about the point x = 0, and the remain-
der term. How large should n be chosen in order to make the error less than 10−6 in the
interval [0, 0.5]?
3. We use the polynomial P2 ( x ) = 1 − x 2 /2 to approximate f ( x ) = cos x in the interval
[−1/2, 1/2]. Give an upper bound for the error in this approximation.
4. Let x = 1/3, y = 5/7. We calculate with a precision of three (decimal) digits. Express x
and y as floating point numbers. Compute f l ( f l ( x ) ◦ f l (y)), x ◦ y and the rounding error
taking ◦ = +, −, ∗, / respectively.
5. Write a program to determine the machine precision of your system.
Hint: start with µ = 1. Divide µ repeatedly by a factor 2 until the test 1 + µ = 1 holds. The
resulting value is close to the machine precision of your system.
Chapter 2
Interpolation
2.1 Introduction
In practical applications it is frequently the case that only a limited amount of measurement data
is available, from which intermediate values should be determined (interpolation) or values out-
side the range of measurements should be predicted (extrapolation). An example is the number
of cars in The Netherlands, which is tabulated in Table 2.1 (per 1000 citizens) from 1990 every fifth
year up to 2010. How can these numbers be used to estimate the number of cars in intermediate
years, e.g. in 2008, or to predict the number in the year 2020?
Table 2.1: Number of cars (per 1000 citizens) in The Netherlands (source: Centraal Bureau voor
de Statistiek).
A better way of interpolation is a straight line between two points (see Figure 2.1). Suppose that
the function values f ( x0 ) and f ( x1 ) at the points x0 and x1 are known (x0 < x1 ). In that case,
it seems plausible to take the linear function (a straight line) through ( x0 , f ( x0 )), ( x1 , f ( x1 )), in
12 Numerical Methods for Ordinary Differential Equations
f ( x0 )
1.5
y
L1 ( x )
1
f (x)
f ( x1 )
0.5
0.5 1 1.5
x0 x x1
Figure 2.1: Linear interpolation polynomial, L1 , of the function f ( x ) = 1/x, x0 = 0.6, x = 0.95,
x1 = 1.3.
L1 ( x ) = L01 ( x ) f ( x0 ) + L11 ( x ) f ( x1 ),
1 Whenever the word ’approximation’ occurs in this chapter, we mean that the unknown function value is approxi-
which is called the Lagrange form of the interpolation polynomial, or the Lagrange interpolation
polynomial for short. The interpolation points x0 and x1 are often called nodes.
Theorem 2.2.1 provides a statement for the interpolation error if the interpolated function is at
least twice continuously differentiable.
Theorem 2.2.1 Let x0 and x1 be distinct nodes in [ a, b], x0 < x1 , f ∈ C2 [ a, b] and let L1 be the linear
interpolation polynomial that equals f at the nodes x0 , x1 . Then for each x ∈ [ a, b] there exists a ξ ∈ ( a, b)
such that
1
f ( x ) − L1 ( x ) = ( x − x0 )( x − x1 ) f ′′ (ξ ). (2.3)
2
Proof
If x = x0 or x = x1 , then f ( x ) − L1 ( x ) = 0 and ξ can be chosen arbitrarily. Assume x 6= x0 and
x 6= x1 . For each x there exists a constant q such that
f ( x ) − L1 ( x ) = q( x − x0 )( x − x1 ).
To find an expression for q consider the function
ϕ(t) = f (t) − L1 (t) − q(t − x0 )(t − x1 ).
ϕ satisfies ϕ( x0 ) = ϕ( x1 ) = ϕ( x ) = 0. By Rolle’s theorem, there exist at least two points y and z
in ( a, b) such that ϕ′ (y) = ϕ′ (z) = 0. Again by Rolle’s theorem there is a ξ between y and z and
hence ξ ∈ ( a, b) such that ϕ′′ (ξ ) = 0. Because ϕ′′ (t) = f ′′ (t) − 2q this means that q = 21 f ′′ (ξ ). ⊠
If x 6∈ [ x0 , x1 ], then the linear polynomial (2.1) can be used to extrapolate. Relation (2.3) is still the
correct expression for the error.
From Theorem 2.2.1, an upper bound for the linear interpolation error follows:
1
| f ( x ) − L1 ( x )| ≤ ( x1 − x0 )2 max | f ′′ (ξ )|. (2.4)
8 ξ ∈( a,b)
In many practical applications the values f ( x0 ) and f ( x1 ) are a result of measurements (per-
turbed) or calculations (round-off errors). Assume that the measurement error is bounded by
ε, that is, | f ( x0 ) − fˆ( x0 )| ≤ ε and | f ( x1 ) − fˆ( x1 )| ≤ ε, where the ˆ (hat) refers to the measured,
hence available data. The difference between the exact linear interpolation polynomial L1 and
the perturbed polynomial L̂1 is bounded by
| x1 − x | + | x − x0 |
| L1 ( x ) − L̂1 ( x )| ≤ ε. (2.5)
x1 − x0
For x ∈ [ x0 , x1 ] (interpolation), the error resulting from uncertainties in the measurements is
bounded by ε.
However, if x ∈ / [ x0 , x1 ] (extrapolation), the error can grow to arbitrarily large values when
moving away from the interval [ x0 , x1 ]. Suppose that x ≥ x1 , then the additional inaccuracy
is bounded by
x − x1
| L1 ( x ) − L̂1 ( x )| ≤ 1 + 2 ε. (2.6)
x1 − x0
In Figure 2.2, the impact of the rounding or measurement errors on linear interpolation and ex-
trapolation is visualized. It graphically illustrates the danger of extrapolation, where the region
of uncertainty, and hence, the errors due to uncertainties in the data, make the extrapolation error
large. Note that the truncation error in equation (2.3) also becomes arbitrarily large outside the
interval [ x0 , x1 ].
The total error is the sum of the interpolation/extrapolation error and the measurement error.
Example 2.2.1 (Linear interpolation)
In Table 2.2, the value of the sine function is presented for 36◦ and 38◦ . The approximation for
37◦ using the linear interpolation polynomial provides a result of 0.601723. The difference with
the exact value is only 0.9 × 10−4.
14 Numerical Methods for Ordinary Differential Equations
2
y
−1
Exact
−2 Lower limit
Upper limit
−3
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
x
Figure 2.2: Interpolation and extrapolation errors resulting from measurement errors, using, as
an example, x0 = 0, x1 = 1, f ( x0 ) = 1, f ( x1 ) = 2, and ε = 1/2. The region of uncertainty is filled.
x02 x0n
1 x0 ... c0 f ( x0 )
.. .. .. .. .. = ..
.
. . . . . .
1 xn x2n ... x nn cn f ( xn )
from which the unknown coeffients ci , i = 0, . . . , n, can be determined. This matrix, in which
each row consists of the terms of a geometric series, is called a Vandermonde matrix.
Chapter 2. Interpolation 15
It is again possible to avoid the linear system of equations by using Lagrange basis polynomials.
To this end, the nth degree interpolation polynomial is written as
n
Ln ( x ) = ∑ f ( x k ) Lkn ( x ), (2.7)
k =0
in which the nth degree Lagrange basis polynomials satisfy Lkn ( x j ) = δkj in the nodes x j (see
definition 1.6.4). The Lagrange basis polynomials Lkn ( x ) are explicitly given by
( x − x0 ) · · · ( x − xk−1 )( x − xk+1 ) · · · ( x − xn )
Lkn ( x ) = , (2.8)
( xk − x0 ) · · · ( xk − xk−1 )( xk − xk+1 ) · · · ( xk − xn )
and can also be written as:
n
ω(x)
Lkn ( x ) =
( x − xk )ω ′ ( xk )
with ω ( x ) = ∏( x − x i ).
i =0
n
Table 2.3: Upper bounds for ∑ | Lkn ( x )|.
k =0
x ∈ [ x0 , x1 ] x ∈ [ x1 , x2 ] x ∈ [ x2 , x3 ] x ∈ [ x3 , x4 ] x ∈ [ x4 , x5 ]
n =1 1
n =2 1.25 1.25
n =3 1.63 1.25 1.63
n =4 2.3 1.4 1.4 2.3
n =5 3.1 1.6 1.4 1.6 3.1
Assume that the interpolation points are equidistantly distributed on an interval [ a, b]. One might
wonder whether it is possible to approximate a provided (sufficiently smooth) function arbitrar-
ily close using a Lagrange interpolation polynomial, by increasing the degree n of the interpo-
lation polynomial. Surprisingly enough this is not the case. A well known example of this was
suggested by Runge.
16 Numerical Methods for Ordinary Differential Equations
Consider the function 1/(1 + x2 ) on the interval [−5, 5], with nodes xk = −5 + 10k/n, where
k = 0, . . . , n. In Figure 2.3(a) the function is visualized together with the 6th and 14th degree
interpolation polynomials. Note that the 14th degree polynomial approximation is better than
that of degree 6 on the interval [-3, 3]. In the neighborhood of the end points, however, the 14th
degree polynomial exhibits large oscillations.
The oscillations of the interpolation polynomial near the boundary of the interval are known as
Runge’s phenomenon. The above example leads to the question whether Runge’s phenomenon
can be avoided. In order to answer this question, the expression for the interpolation error in
Theorem 2.3.1 can be considered. This expression contains two factors that can become large:
the unknown factor f ( n+1) (ξ ) and the polynomial factor ( x − x0 ) · · · ( x − x n ). Clearly, f ( n+1) (ξ )
cannot be influenced, but the polynomial ( x − x0 ) · · · ( x − xn ) can be made small in some sense
by choosing better interpolation points xi , i = 0, . . . , n. More specifically, the interpolation points
should be chosen such that
max |( x − x0 ) · · · ( x − xn )|
x ∈[ a,b]
is minimized over all possible choices for the interpolation points x0 , . . . , x n . This problem was
solved in the 19th century by the Russian mathematician Chebyshev, who found that the optimal
interpolation points are given by
a+b b−a 2k + 1
xk = + cos π k = 0, . . . , n,
2 2 2( n + 1)
which are called Chebyshev nodes. In Figure 2.3(b), it can be seen that the interpolation polynomi-
als using Chebyshev nodes provide much better approximations compared to the equidistantly
distributed ones in Figure 2.3(a). It can be shown that by using Chebyshev nodes, a continuously
differentiable function can be approximated with arbitrarily small error by increasing the degree
n of the interpolation polynomial.
Large errors may also occur in extrapolation. Consider the function f ( x ) = 1/x. The nth degree
interpolation polynomial is determined with nodes xk = 0.5 + k/n, k = 0, . . . , n. In Figure 2.4,
f ( x ) and the interpolation polynomials of degree 6 and 10 are shown on the interval [0.5, 1.8].
The polynomials and the function itself are indistinguishable on the interval [0.5, 1.5]. With
extrapolation ( x ≥ 1.5), however, large errors occur. Again the largest error occurs in the highest-
degree polynomial.
8
f
7 L6
L14
6
4
y
3
−1
−5 −4 −3 −2 −1 0 1 2 3 4 5
x
(a) Equidistant nodes
8
f
7 L6
L14
6
4
y
−1
−5 −4 −3 −2 −1 0 1 2 3 4 5
x
(b) Chebyshev nodes
Figure 2.3: Interpolation of f ( x ) = 1/(1 + x2 ) with Lagrange polynomials of degree 6 and 14.
are given by: f ( k) ( x ) = (−1)k k!x −( k+1), such that the nth degree Taylor polynomial about x = 1
equals
n n
pn ( x ) = ∑ f ( k) (1)( x − 1)k /k! = ∑ (−1)k (x − 1)k .
k =0 k =0
The values of pn (3) that should approximate f (3) = 1/3 are tabulated in Table 2.4. The approxi-
mation becomes less accurate with increasing polynomial degree.
Table 2.4: Approximation of f ( x ) = 1/x in x = 3, using an nth degree Taylor polynomial about
x = 1.
n 0 1 2 3 4 5 6 7
p n (3) 1 -1 3 -5 11 -21 43 -85
18 Numerical Methods for Ordinary Differential Equations
2.4
f
2.2 L6
2 L10
1.8
1.6
1.4
y
1.2
0.8
0.6
0.4
0.5 1 1.5 2
x
Figure 2.4: Extrapolation of f ( x ) = 1/x with Lagrange polynomials of degree 6 and 10.
1 x0 x02 x03
c0 f ( x0 )
1 x1 x 2 x13 f ( x1 )
c1
1
0 1 2x0 3x2 c2 = f ′ ( x0 ) . (2.9)
0
0 1 2x1 3x12 c3 f ′ ( x1 )
In the same spirit as for Lagrange interpolation, this polynomial can be written as
1
∑ f ( xk ) Hk1 ( x ) + f ′ ( xk )Hk1 ( x ) ,
H3 ( x ) =
k =0
with
Hk1 ( x j ) = δjk , Hk1 ( x j ) = 0,
′ ( x ) = 0,
Hk1 ′ (x ) = δ .
Hk1
j j jk
This avoids having to solve system (2.9). Polynomials based both on given function values and
given derivatives are called Hermite interpolation polynomials.
The general expression for Hermite interpolation polynomials containing function values and
first derivatives is
n n
H2n+1 ( x ) = ∑ f ( xk ) Hkn ( x ) + ∑ f ′ ( xk )Hkn ( x ),
k =0 k =0
in which
Hkn ( x ) = 1 − 2( x − xk ) L′kn ( xk ) L2kn ( x ),
and
Hkn ( x ) = ( x − xk ) L2kn ( x ).
The functions Lkn ( x ) are the Lagrange basis polynomials, as defined in equation (2.8).
The correctness of the Hermite interpolation polynomial is shown as follows: Hkn and Hkn are
polynomials of degree 2n + 1. Note that Hkn ( x j ) = 0 if k 6= j, because in that case, Lkn ( x j ) = 0. If
k = j, then, because Lkn ( x k ) = 1,
′
Hkn ( x ) = L2kn ( x ) + 2( x − xk ) Lkn ( x ) L′kn ( x ).
′ ( x ) = δ and therefore H ′
Hence Hkn ′
j jk 2n +1 ( x j ) = f ( x j ).
Indeed, the Hermite interpolation polynomial satisfies the conditions H2n+1 ( x j ) = f ( x j ) and
′ ′
H2n +1 ( x j ) = f ( x j ), j = 0, . . . , n.
Theorem 2.4.1 Let x0 , . . . , x n be distinct nodes in [ a, b], x0 < x1 < · · · < xn . Let f ∈ C2n+2 [ a, b]
and let H2n+1 be the Hermite interpolation polynomial that equals f and f ′ at these nodes. Then for each
x ∈ [ a, b] there exists a ξ ∈ ( a, b) such that
( x − x0 )2 · · · ( x − xn )2 (2n+2)
f ( x ) − H2n+1 ( x ) = f ( ξ ).
(2n + 2)!
20 Numerical Methods for Ordinary Differential Equations
Proof:
The proof of this theorem is analogous to that of Theorem 2.2.1. The auxiliary function is chosen
as follows:
( t − x0 )2 · · · ( t − x n )2
ϕ(t) = f (t) − H2n+1 (t) − ( f ( x ) − H2n+1 ( x )).
( x − x0 )2 · · · ( x − x n )2
Seismic waves, the same type of waves used to study earthquakes, are also used to explore deep
underground for reservoirs of oil and natural gas. A simple model for wave propagation is
described by the following set of ordinary differential equations:
dx
dt = c sin θ,
dz
dt = −c cos θ,
dθ dc
dt = − dz sin θ.
The position is denoted by ( x, z) while θ is the angle between wave front and x-axis. Suppose
that the propagation speed c depends on the vertical position only and that it is known at a
finite number of positions. However, in order to solve this system an approximation of c(z) in
all intermediate points is required. If piecewise linear interpolation is used, the derivative c′ (z)
does not exist in the nodes, which may lead to large errors in the solution. If both c and c′ (z) are
known in the nodes, then the third degree Hermite interpolation polynomial can be used in each
interval and the first derivative exists also at the nodes.
f
L1
1 H3
L3
0.8
0.6
y
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 3.5 4
x
Figure 2.5: Interpolation of f ( x ) = 1/(1 + x3 ) using two intervals: linear, Hermite (degree 3) and
Lagrange (degree 3) interpolation.
the derivative at the nodes are often unknown. One way to ensure smoothness without knowing
the derivatives is to use so-called splines.
A spline is a piecewise polynomial that is connected smoothly in the nodes. We define the nodes
a = x0 < x1 < · · · < xn = b, which partition the interval [ a, b]. A spline of degree p is defined as
the piecewise function
s0 ( x ) , x ∈ [ x 0 , x 1 ] ,
s1 ( x ) , x ∈ [ x 1 , x 2 ] ,
s( x ) = ..
.
s n −1 ( x ) , x ∈ [ x n −1 , x n ],
where sk ( x )|[ xk ,xk +1] is a polynomial of degree p and a smooth connection of sk and its derivatives,
( p −1 )
s′k , . . . , sk , is required in the nodes.
In this section, only linear (p = 1) and cubic (p = 3) interpolation splines are considered.
Definition 2.5.1 (Spline of degree 1) Let x0 , . . . , x n be distinct nodes in [ a, b], x0 < x1 < · · · < xn .
For f ∈ C [ a, b], the interpolation spline s of degree 1 is a function s ∈ C [ a, b] such that
Note that an interpolation spline of degree 1 consists of piecewise linear interpolation polynomi-
als.
A spline of degree 3, also called a cubic spline, consists of polynomials of degree 3. In the nodes
the value is equal to the function value, and further, the first and second derivatives are continu-
ous. Splines were originally developed for shipbuilding in the days before computer modeling.
Naval architects needed a way to draw a smooth curve through a set of points. The solution
was to place metal weights (called knots or ducks) at the control points and bend a thin metal or
wooden beam (called a spline) through the weights. The physics of the bending spline meant that
the influence of each weight was largest at the point of contact and decreased smoothly further
along the spline. To get more control over a certain region of the spline the draftsman simply
added more weights.
22 Numerical Methods for Ordinary Differential Equations
Definition 2.5.2 (Natural spline of degree 3) Let the nodes x0 , . . . , x n be distinct in [ a, b] and satisfy
x0 < x1 < · · · < x n . For a function f ∈ C [ a, b], the interpolating spline s of degree 3 has the following
properties:
a. On each subinterval [ xk , xk+1 ], s is a third degree polynomial sk , k = 0, . . . , n − 1,
b. The values at the nodes satisfy s( xk ) = f ( xk ), for k = 0, . . . , n,
c. sk ( x k+1 ) = sk+1 ( xk+1 ), k = 0, . . . , n − 2,
s′k ( x k+1 ) = s′k+1 ( xk+1 ), k = 0, . . . , n − 2,
s′′k ( xk+1 ) = s′′k+1 ( xk+1 ), k = 0, . . . , n − 2.
In order to uniquely determine the unknowns, two extra conditions are required. A viable choice is to
prescribe
d. s0′′ ( x0 ) = s′′n−1 ( xn ) = 0,
which implies that the curvature vanishes at the boundary.
Note that s ∈ C2 [ a, b]. The conditions under d could be replaced with other conditions, but these
so-called natural boundary conditions belong to the shipbuilding application.
In order to demonstrate how to determine such an interpolating spline, sk is expressed as
sk ( x ) = ak ( x − x k )3 + bk ( x − xk )2 + ck ( x − xk ) + dk , k = 0, . . . , n − 1. (2.10)
By defining the value bn as bn = s′′n−1 ( xn )/2, the relation 6ak hk + 2bk = 2bk+1 holds for
k = n − 1 as well. This means that ak can be expressed in bk :
1
ak = (b − bk ), k = 0, . . . , n − 1. (2.12)
3hk k+1
f k +1 − f k 2b + bk+1
ck = − hk k , k = 0, . . . , n − 1. (2.13)
hk 3
f
s
1
0.8
0.6
y
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 3.5 4
x
Property d implies that s0′′ ( x0 ) = 2b0 = 0, such that b0 = 0. Using the definition bn = s′′n−1 ( xn )/2
(step 1), we also find bn = 0. Therefore, the resulting system consists of n − 1 equations for n − 1
unknowns b1 , . . . , bn−1 :
f −f f −f
3 2h 1 − 1h 0
2( h0 + h1 ) h1 b1
1 0
.
h 2 ( h + h ) h b2
.
1 1 2 2 .
. = .
.. .. ..
.
.
. . . ..
.
h n −2 2 ( h n −2 + h n −1 ) bn − 1
f n − f n −1 f n −1 − f n −2
− h
3 h n −1 n −2
From this system the values bk can be calculated, from which ak and ck can be determined using
equations (2.12) and (2.13) and b0 = bn = 0.
Example 2.5.1 (Visualization)
In this example, a cubic spline on six subintervals is used to approximate a symmetric hill (ex-
ample 2.4.3). The result in Figure 2.6 is of higher quality than the Hermite and Lagrange inter-
polation approximation in Figure 2.5. Note that for Hermite interpolation, the derivatives at the
nodes should be known, whereas this is not required in the cubic spline approach.
Remark
Splines are used in CAD and experience a revival in modern numerical methods for partial dif-
ferential equations.
24 Numerical Methods for Ordinary Differential Equations
2.6 Summary
In this chapter, the following subjects have been discussed:
- Linear interpolation;
- Higher-order Lagrange interpolation;
- Interpolation with derivatives:
- Taylor polynomial;
- Interpolation in general;
- Hermite interpolation;
- Spline interpolation.
2.7 Exercises
1. (a) Prove upper bound (2.4) for linear interpolation:
1
| f ( x ) − L1 ( x )| ≤ ( x1 − x0 )2 max | f ′′ (ξ )|.
8 ξ ∈( a,b)
(b) Show that the difference between the exact and the perturbed linear interpolation
polynomial is bounded by
| x1 − x | + | x − x0 |
| L1 ( x ) − L̂1 ( x )| ≤ ε,
x1 − x0
(a) Compute the Lagrange interpolation polynomial of degree 3. Use this polynomial to
estimate the maximum speed of the car.
(b) Compute the cubic spline that interpolates the above data. Use this spline to estimate
the maximum speed of the car.
Chapter 3
Numerical differentiation
3.1 Introduction
Everyone who possesses a car and/or a driver’s licence is familiar with speeding tickets. In
The Netherlands, speeding tickets are usually processed in a fully automated fashion, and the
perpetrator will receive the tickets within a couple of weeks after the offence. The Dutch police
optimized the procedures of speed control such that this effort has become very profitable to the
Dutch government. Various strategies for speed control are carried out by police forces, which
are all based on the position of the vehicle at consecutive times. The actual velocity follows from
the first-order derivative of the position of the vehicle with respect to time. Since no explicit
formula for this position is available, the velocity can only be estimated using an approximation
of the velocity based on several discrete vehicle positions at discrete times. This motivates the use
of approximate derivatives, also called numerical derivatives. If the police want to know whether
the offender drove faster before speed detection (in other words, whether the perpetrator hit the
brakes after having seen the police patrol), or whether the driver was already accelerating, then
they are also interested in the acceleration of the ’bad guy’. This acceleration can be estimated
using numerical approximations of the second-order derivative of the car position with respect
to time.
Since the time-interval of recording is nonzero, the velocity is not determined exactly in general.
In this chapter, the resulting error, referred to as the truncation error, is estimated using Taylor se-
ries. In most cases, the truncation error increases with an increasing size of the recording interval
(Sections 3.2 and 3.4). Next to the truncation error, the measurement of the position of the vehicle
is also prone to measurement errors. Issues that influence the results are, for example, paral-
lax, the measurement equipment, and in some cases even the performance of the police officer
(in car-videoing and laser control). These measurement errors provide an additional deteriora-
tion of the approximation of the speed and acceleration. The impact of measurement errors on
approximations of derivatives is treated in Section 3.3.
f ( x + h) − f ( x )
Q f (h) = , h > 0,
h
in which h is called the step size. By definition,
f ( x + h) − f ( x )
lim = f ′ ( x ),
h →0 h
26 Numerical Methods for Ordinary Differential Equations
and hence, the forward difference converges to the derivative. The truncation error is defined as
f ( x + h) − f ( x )
R f (h) = f ′ ( x ) − . (3.1)
h
Theorem 3.2.1 If f ∈ C2 [ x, x + h], then there exists a ξ ∈ ( x, x + h) such that
h
R f (h) = − f ′′ (ξ ). (3.2)
2
Proof
Taylor expansion of f ( x + h) about x yields
h2 ′′
f ( x + h) = f ( x ) + h f ′ ( x ) + f (ξ ), with ξ ∈ ( x, x + h).
2
Substitution into (3.1) yields
2
′ f ( x ) + h f ′ ( x ) + h2 f ′′ (ξ ) − f ( x ) h
R f (h) = f ( x ) − = − f ′′ (ξ ),
h 2
which completes the proof. ⊠
Example 3.2.1 (Forward difference)
In this example, the derivative of the function f ( x ) = − x3 + x2 + 2x is approximated in x = 1
using the step size h = 0.5. In Figure 3.1, it can be seen that the approximation of the derivative
is not yet very accurate.
3
f
exact derivative
2.5 forward difference
central difference
f (x)
2
f ( x + h)
y
1.5
f ( x − h)
1
0.5
0.2 0.5 1 1.5 1.8
x−h x x+h
which means that the accuracy is comparable to that of the forward difference.
Note that the errors have opposite sign. Moreover, ξ and η will differ, but for small h this differ-
ence is very small. It may therefore be interesting to average these two formulae such that these
errors almost cancel. This average is known as the central difference:
f ( x + h) − f ( x − h)
Qc (h) = .
2h
For this particular example, this results in a better approximation (Figure 3.1). If the third deriva-
tive of f exists and is continuous, then the truncation error in the central difference becomes
f ( x + h) − f ( x − h) h2
Rc ( h ) = f ′ ( x ) − = − f ′′′ (ξ ),
2h 6
with ξ between x − h and x + h. To prove this, f is expanded in a Taylor polynomial about x with
remainder term, and the intermediate-value theorem is used. Note that the error in the central-
difference approximation converges much faster towards zero as a function of h, than the error
of the forward or backward formula.
Remark
In the limit h → 0, a central difference approximation approximates the derivative more accurate.
However, for a given h the error can still be larger than when forward or backward differences
are used.
Table 3.1: Value of e x , six digits. Table 3.2: Results of the central-difference formula.
In Table 3.2, it can be seen that for decreasing h, the approximation becomes better at first but
it deteriorates rapidly for very small h, which is caused by the rounding errors in the function
values.
For h ≤ 0.1, we know that m = max{ f ′′′ (ξ ), ξ ∈ (1 − h, 1 + h)} = e1.1 . The function ϕ(h) and the
upper bounds of | Rc (h)| (equal to h2 m/6) and Sc (h) (equal to ε/h) that belong to this example
are drawn in Figure 3.2. Indeed, it can be seen that, for decreasing h, the upper bound of the
truncation error decreases, whereas the rounding-error bound increases. It is easy to verify that
ϕ(h) is minimal for hopt = 0.0171, with ϕ(hopt ) = 0.00044. Note that ϕ(h) is almost constant in
the neighborhood of hopt (in particular for h > hopt ), hence a step size close to hopt will lead to a
comparable accuracy.
Remarks
• If the approximation of the derivative is insufficiently accurate, then there are two possible
approaches to improve this:
- Decrease the rounding and/or measurement error in the function values;
- Use a higher-order difference formula. In that case the truncation error, R(h), is often
smaller compared to a lower-order formula. Note that h should not be too small, in or-
der to avoid that the loss in significant digits due to the rounding error will annihilate
the improvement.
• Difference formulae are often used to solve differential equations (Chapter 7). In that case
the solution to the differential equation has to be determined accurately rather than the
derivatives. The behavior with respect to rounding errors is often much better in that case.
• The effect of rounding errors is large when the error in the function values is large and if
h is small. This often happens with inherently imprecise measurements. In a computer
generated table, the rounding error is often much smaller than in a table generated by
measurements, but still noticeable for a small h.
• For h < hopt , the measurement or rounding error may even explode when h → 0.
Chapter 3. Numerical differentiation 29
−3
x 10
h2 m/6
6
ε/h
ϕ(h)
5
error 4
0
0 0.02 0.04 0.06 0.08 0.1
h
Figure 3.2: Upper bounds of absolute truncation error, rounding error, and total error, using
central differences in approximating the derivative of f ( x ) = e x in x = 1.
h2 ′′ h3 ′′′
f ( x − h) = f ( x ) − h f ′ ( x ) + f (x) − f ( x ) + O(h4 ),
2 3!
f ( x ) = f ( x ),
h2 ′′ h3 ′′′
f ( x + h) = f ( x ) + h f ′ ( x ) + f (x) + f ( x ) + O(h4 ).
2 3!
Substituting these expansions in equation (3.4) yields
f ( x ) = f ( x ),
h2 ′′
f ( x + h) = f ( x ) + h f ′ ( x ) + f ( x ) + O(h3 ),
2
f ( x + 2h) = f ( x ) + 2h f ′ ( x ) + 2h2 f ′′ ( x ) + O(h3 ).
Solving this system yields α0 = −3/2, α1 = 2 and α2 = −1/2, such that the one-sided difference
formula equals
−3 f ( x ) + 4 f ( x + h) − f ( x + 2h)
Q(h) = , (3.6)
2h
If x0 , x1 , . . . , x n do not coincide and f ∈ C n+1 [ a, b], then the Lagrange interpolation formulation
is given by
n
f ( n+1) (ξ ( x ))
f (x) = ∑ f ( x k ) Lkn ( x ) + ( x − x0 ) · · · ( x − x n )
( n + 1) !
,
k =0
n
d f ( n+1) (ξ ( x ))
f ′ (x) = ∑ f ( xk ) L′kn ( x ) + (( x − x0 ) · · · ( x − xn ))
k =0
dx ( n + 1) !
d f ( n+1) (ξ ( x ))
+ ( x − x0 ) · · · ( x − x n ) . (3.7)
dx (n + 1)!
The last term generally cannot be evaluated. However, if x = x j , then equation (3.7) becomes
n n f ( n+1) (ξ ( x j ))
f ′ (x j ) = ∑ f ( x k ) L′kn ( x j ) + ∏ ( x j − xk ) .
k =0 k =0
( n + 1) !
k6= j
x − x1 ′ (x)
L01 ( x ) = x0 − x1 , L01 = − h1 ,
x − x0 ′ (x) 1
L11 ( x ) = x1 − x0 , L11 = h.
1 1 h
f ′ ( x0 ) = − f ( x0 ) + f ( x0 + h) + f ′′ (ξ ),
h h 2
h2 ′′ h3 ′′′
f ( x + h) = f ( x ) + h f ′ ( x ) + f (x) + f ( x ) + O(h4 ),
2 3!
f ( x ) = f ( x ),
h2 ′′ h3 ′′′
f ( x − h) = f ( x ) − h f ′ ( x ) + f (x) − f ( x ) + O(h4 ).
2 3!
The conditions for α−1 , α0 and α1 are
α −1 α0 α1
f (x) : h2
+ h2
+ h2
= 0,
f ′ (x) : − α−h 1 + α1
h = 0,
f ′′ ( x ) : 1
2 α −1 + 1
2 α1 = 1.
Solving this system yields α−1 = 1, α0 = −2 and α1 = 1, such that
f ( x + h) − 2 f ( x ) + f ( x − h)
Q(h) = . (3.8)
h2
Note that the error in this formula is O(h2 ).
Rule of thumb 2
To derive a numerical method for the kth derivative with truncation error O(h p ), the Taylor
expansions should have a remainder term of order O(h p+k ).
Another way to determine the second derivative is the repeated application of the formula of
the first derivative. To approximate the second derivative numerically, at least three points are
required. Suppose that the second derivative is approximated using three equidistant function
values, f ( x − h), f ( x ), and f ( x + h).
The obvious way to approximate the second derivative is to take the difference of the first deriva-
tive at the points x + h and x − h. Using central differences, this means that the function values
in x + 2h and x − 2h are required. This can be avoided by introducing auxiliary points x ± h/2
(Figure 3.3) and taking a central difference of the derivatives at these points:
f ′ ( x + 21 h) − f ′ ( x − 21 h)
f ′′ ( x ) ≈ .
h
Applying a central-difference scheme again to the first-order derivatives yields
1 f ( x + h) − f ( x ) f ( x ) − f ( x − h)
f ′′ ( x ) ≈ − ,
h h h
which can be simplified to
f ( x + h) − 2 f ( x ) + f ( x − h)
Q(h) = .
h2
Chapter 3. Numerical differentiation 33
h h
x− 2 x+ 2
× × ×
x−h x x+h
Figure 3.3: Nodes to approximate f ′′ ( x ) using a repeated central-difference formula.
h2 (4)
f ′′ ( x ) − Q(h) = − f ( ξ ). (3.9)
12
Note that the effect of rounding errors is even more serious here. An upper bound for the error
caused by rounding errors is:
4ε
S( h ) ≤ 2 .
h
in which c p 6= 0, p ∈ N. Note that the error has this form if it can be expressed as a Taylor series
and that p is the order of the error.
In this section, Richardson’s extrapolation will be applied to difference formulae, but the method
can also be used for other types of approximations (interpolation, numerical integration etc.), as
long as the error has the form (3.10).
34 Numerical Methods for Ordinary Differential Equations
M − Q(h) = c p h p . (3.11)
For a given h, Q(h) can be computed, but M, c p and p are still unknown. However, these val-
ues can be approximated by computing for example Q(h), Q(2h) and Q(4h). In that way, the
following set of equations is obtained:
By subtracting equation (3.12b) from (3.12a) and equation (3.12c) from (3.12b), M can be elimi-
nated and only two unknowns are left:
Next, c p and h can be eliminated by dividing these two expressions, which yields
Q(2h) − Q(4h)
= 2p, (3.14)
Q(h) − Q(2h)
from which p may be determined. Substitution of p into (3.13b) provides an approximation for
c p h p , which is used in equation (3.12c) to find the error estimate M − Q(h).
Q(h) − Q(2h)
M − Q(h) = c p h p = = −0.0348 . . . .
2p − 1
Chapter 3. Numerical differentiation 35
In the above example, the value of p was computed using Richardson’s extrapolation. However,
using Theorem 3.2.1, it is clear that p = 1, and this value could have been used immediately in
equation (3.13b) in order to determine c p h p . In practice, more complex situations are found, and
the following complications may occur:
- The final result is a combination of various approximation methods. The influence of these
approximations on p is not always clear.
To reveal any of these complications it is good practice to verify whether the calculated p is close
to the p that follows from theory.
Multiplying equation (3.15a) by 2 p and subtracting equation (3.15b) from this yields
such that
(2 p − 1) M − 2 p Q(h) + Q(2h) = O(h p+1 ).
This means that
2 p Q(h) − Q(2h)
M= + O(h p+1 ). (3.16)
2p − 1
The value (2 p Q(h) − Q(2h))/(2 p − 1) is a new approximation formula for M with an accuracy
that is one order higher than the order of Q(h).
Multiplying equation (3.17) by 2 and subtracting (3.18) from this result yields
3.8 Summary
In this chapter, the following subjects have been discussed:
- Difference methods and truncation errors for the first derivative;
- The effect of measurement errors;
- Difference methods and truncation errors for higher-order derivatives;
- Richardson’s extrapolation:
- Practical error estimate;
- Increasing accuracy.
3.9 Exercises
1. Prove for f ∈ C3 [ x − h, x + h], that the truncation error in the approximation of f ′ ( x ) with
central differences is of order O(h2 ).
2. Assume that the position of a ship can be determined with a measurement error of at most
10 meters, and that the real position of the ship during start up is given by the function
S(t) = 0.5at2, in which S is expressed in meters and t in seconds. The velocity is ap-
proximated by a backward difference with step size h. Give the truncation error and the
measurement error in this formula. Take a = 0.004. Determine h such that the error in the
calculated velocity is minimal. How large is the error?
3. In this exercise, the central-difference formula
f ( x − h) − 2 f ( x ) + f ( x + h) h2 (4)
f ′′ ( x ) = − f (ξ ), ξ ∈ ( x − h, x + h)
h2 12
is used to approximate the second derivative of f ( x ) = sin x in x = 1. The computer, and
hence Matlab, uses finite precision arithmetic, and for this reason it computes perturbed
function values fˆ( x ). The rounding error in the function value is bounded by
| f ( x ) − fˆ( x )|
≤ eps,
| f ( x )|
in which eps is the relative machine precision (assume that eps = 10−16).
(a) Give an upper bound for the total error (truncation error + rounding error) and mini-
mize this upper bound to find an optimal value for h.
(b) Take h = 1. Compute the true error in the approximation for f ′′ (1). Repeat this eight
times and divide h by 10 after every computation. Tabulate the results. Does the h for
which the error is minimal agree with the value found in question (a)? Remark: the
computations for this question have to be carried out using Matlab.
Chapter 3. Numerical differentiation 37
4. Let the function f ( x ) = sin x, x ∈ [0, π ] be given. Compute f ′ (1) by a central difference
with h = 0.1. Estimate the error by Richardson’s extrapolation.
5. Given are f ( x ), f ( x + h) and f ( x + 2h). Derive a formula of maximal order to approximate
f ′ ( x ).
38 Numerical Methods for Ordinary Differential Equations
Chapter 4
Nonlinear equations
4.1 Introduction
The pressure drop in a fluid in motion is examined. For a flow in a pipe with a circular cross
section of diameter D (meter), the Reynolds number, Re, is given by
Dv
Re = ,
ν
in which v (m/s) is the average flow velocity and ν (m2 /s) is the viscosity of the fluid. The flow is
called laminar if Re < 2100 (low flow velocity) and turbulent if Re > 3000. For 2100 ≤ Re ≤ 3000,
the flow is neither laminar nor turbulent.
For turbulent flows, the pressure drop between inflow and outflow is given by
ρwLv2
Pout − Pin = ,
2gD
in which w is a friction coefficient, ρ (kg/m3 ) is the fluid density, L (m) is the length and g (m/s2 )
is the acceleration of gravity. If the fluid contains particles (sand, paper fibers), then the friction
coefficient w satisfies the equation
√
1 ln( Re w) + 14 − 5.6 k
√ = ,
w k
in which k is a parameter known from experiments.
In this chapter, numerical methods will be discussed that can be used to determine w if the values
of Re and k are known.
4.2 Definitions
In this chapter, various iterative methods will be considered to solve nonlinear equations of the
form f ( p) = 0. The point p is called a zero of the function f , or a root of the equation f ( x ) = 0.
First, some useful definitions and concepts are introduced.
Convergence
Each numerical method generates a sequence { pn } = p0 , p1, p2 , . . . which should converge to p:
limn→∞ pn = p. Assume that the sequence indeed converges, with pn 6= p for all n. If there exist
positive constants λ and α satisfying
| p − p n +1 |
lim = λ, (4.1)
n→∞ | p − pn |α
40 Numerical Methods for Ordinary Differential Equations
Proof:
By induction,
lim | p − pn | ≤ lim kn | p − p0 | = 0,
n→∞ n→∞
because k < 1. Hence pn converges to p. ⊠
Stopping criteria
Because an iterative method will be used to approximate p, it is necessary to specify when the
method should stop. Some examples of stopping criteria are listed below.
1. If the solution, p, is known, then | p − pn | < ε is the most useful stopping criterion. Since p
is not known in general, this criterion cannot be used in practice.
2. Two successive approximations can be used in the stopping criterion | pn − pn−1 | < ε. How-
ever, for some methods, it is possible that | pn − pn−1 | < ε, whereas | p − pn | ≫ ε. In that
case the method stops at a wrong approximation of the solution.
3. Since p may be very large or very small, it is often more meaningful to consider the relative
error
| p n − p n −1 |
< ε, if p 6= 0.
| pn |
4. Finally, the accuracy of pn can be determined by the criterion | f ( p n )| < ε. If this cri-
terion is satisfied, and the first derivative of f exists and is continuous in the neighbor-
hood of p, we can bound the error of the approximation. Using the intermediate-value
theorem, we know that | f ( p) − f ( p n )| = | f ′ (ξ )| · | p − pn | for some ξ between p and
pn . Since f ( p) = 0 and | f ( p n )| < ε, we find that | f ′ (ξ )| · | p − pn | < ε. By defining
m = min{| f ′ (ξ )|, ξ between p and pn } it follows that
ε
| p − pn | < , if m 6= 0. (4.2)
m
Uncertainty interval
In numerical simulations, the exact function values, f ( x ), are often unknown. Instead, an ap-
proximation of the function values is used, which is denoted by fˆ( x ), in which | f ( x ) − fˆ( x )| ≤ ε̄.
Using the perturbed function, fˆ, the zero of f cannot be determined exactly, as each point of the
set I = { x ∈ [ a, b] | | f ( x )| < ε̄} can be a solution.
The width of this uncertainty interval can be computed using that f ( p) = 0 (p is the zero of f ),
and assuming that x + is the zero of the maximally perturbed function fˆ( x ) = f ( x ) + ε̄, such that
fˆ( x + ) = 0.
Linearization of f ( x + ) about p yields
0 = fˆ( x + ) = f ( x + ) + ε̄ ≈ f ( p) + ( x + − p) f ′ ( p) + ε̄ = ( x + − p) f ′ ( p) + ε̄.
Chapter 4. Nonlinear equations 41
Convergence
Note that this method always converges to a solution: limn→∞ pn = p. In general, it converges
slowly, and it may happen that | p − pn−1 | ≪ | p − pn |. The unconditional convergence makes
the Bisection method a useful method for computing a starting estimate for the more efficient
methods that will be discussed later in this chapter.
Theorem 4.3.1 Assume f ∈ C [ a, b] and f ( a) · f (b) < 0. Then the Bisection method generates a sequence
{ pn } converging to a zero p of f in which
b−a
| p − pn | ≤ , n ≥ 0.
2 n +1
Proof:
For each n ≥ 0, bn − an = (b − a)/2n and p ∈ ( an , bn ). Since pn = ( an + bn )/2 it follows that
bn − a n b−a
| p − pn | ≤ = n +1 .
2 2
⊠
42 Numerical Methods for Ordinary Differential Equations
g ( p ) − g ( q)
= g ′ ( ξ ).
p−q
From this it follows that
hence p is a fixed point of g. This method is called fixed-point (or Picard) iteration.
Remarks
• It is easily seen that each converging fixed-point method is at least linearly convergent.
• Close to the solution p, the error decreases as | p − pn+1| ≈ | g′ ( p)|| p − p n |. If | g′ ( p)| is small,
convergence will be fast, but if | g′ ( p)| is only marginally smaller than 1, convergence will
be slow. If | g′ ( p)| ≥ 1 there is no convergence.
Note: if g′ ( p) 6= 0, then the process is linearly convergent, with asymptotic convergence
factor | g′ ( p)|. If g′ ( p) = 0, then the process is higher-order convergent.
Chapter 4. Nonlinear equations 43
n pn | p − pn | | p n − p n −1 | | p − p n | / | p − p n −1 |
0 0 1 - -
1 1.3333 0.3333 1.3333 0.3333
2 0.8372 0.1628 0.4961 0.4884
3 1.0808 0.0808 0.2436 0.4964
2
y=x
1.8
y = g( x )
1.6
1.4
g ( p0 )
1.2
g ( p2 )
g ( p3 ) 1
g ( p1 )
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5 2
p0 p2 p3 p1
Figure 4.1: Fixed-point method, used to solve p = 4/( p2 + 3) = g( p). The iteration process is
marked by a bold line.
( x − x̄ )2 ′′
f ( x ) = f ( x̄) + ( x − x̄ ) f ′ ( x̄ ) + f (ξ ( x )), (4.4)
2
in which ξ ( x ) between x and x̄. Using that f ( p) = 0, equation (4.4) yields
( p − x̄ )2 ′′
0 = f ( x̄ ) + ( p − x̄ ) f ′ ( x̄) + f (ξ ( x )).
2
0 ≈ f ( x̄ ) + ( p − x̄ ) f ′ ( x̄).
Note that the right-hand side is the formula for the tangent in ( x̄, f ( x̄)). Solving for p yields
f ( x̄)
p ≈ x̄ − .
f ′ ( x̄)
Chapter 4. Nonlinear equations 45
This motivates the Newton-Raphson method, that starts with an approximation p0 and generates
a sequence { pn } by
f (p )
pn = p n−1 − ′ n−1 , for n ≥ 1.
f ( p n −1 )
The graphical illustration is that p n is the zero of the tangent in ( pn−1, f ( p n−1)), which is depicted
in Figure 4.2.
Example 4.5.1 (Newton-Raphson method)
In this example, the positive zero of the function f ( x ) = x2 − 2 will be
√ determined using the
Newton-Raphson method. To six decimal places the exact result is p = 2 = 1.41421. Note that
f ′ ( x ) = 2x, such that the method generates approximations using
( p n −1 ) 2 − 2
p n = p n −1 − .
2pn−1
2
f ( x ) = x2 − 2
1.5 tangent in (1, −1)
0.5
0
y
−0.5
−1
f ( p0 )
−1.5
−2
−1 −0.5 0 0.5 1 1.5 2
p0 p1
Figure 4.2: The Newton-Raphson method used to solve the positive zero of f ( x ) = x2 − 2. The
zero of the tangent in ( p0 , f ( p0 )) is p1 .
• Continuity of g
Since f ′ ( p) 6= 0 and f ′ is continuous, there exists a δ1 > 0 such that f ′ ( x ) 6= 0 for all
x ∈ [ p − δ1 , p + δ1 ]. Therefore, g is well defined and continuous for all x in [ p − δ1 , p + δ1 ].
• Derivative of g
( f ′ ( x ))2 − f ( x ) f ′′ ( x ) f ( x ) f ′′ ( x )
g′ ( x ) = 1 − ′
= .
( f ( x )) 2 ( f ′ ( x ))2
| g( p) − g( x )| < | p − x | < δ.
Using Theorem 4.4.2, the sequence { pn } converges to p for each p0 ∈ [ p − δ, p + δ]. This proves
the theorem. ⊠
Quadratic convergence
The convergence behavior of the Newton-Raphson method can be computed by making use of
the following observation:
( p − pn )2 ′′
0 = f ( p) = f ( pn ) + ( p − pn ) f ′ ( pn ) + f (ξ n ) for ξ n between pn and p. (4.5a)
2
The Newton-Raphson method is defined such that
0 = f ( p n ) + ( p n +1 − p n ) f ′ ( p n ) . (4.5b)
( p − pn )2 ′′
( p − p n +1 ) f ′ ( p n ) + f (ξ n ) = 0,
2
such that
| p − pn+1 | f ′′ (ξ n )
= ′ .
| p − p n |2 2 f ( pn )
From equation (4.1) it follows that the Newton-Raphson method converges quadratically, with
α = 2 and ′′
f (ξ n ) f ′′ ( p)
λ = lim ′
= ′
.
n→∞ 2 f ( pn ) 2 f ( p)
Chapter 4. Nonlinear equations 47
f ( p n −1 + h ) − f ( p n −1 )
K= ,
h
where h > 0.
Secant method
The Secant method replaces f ′ ( pn−1 ) by
f ( p n −1 ) − f ( p n −2 )
K= .
p n −1 − p n −2
bn − 1 − a n − 1
p n = a n −1 − f ( a n −1 ) · .
f ( bn − 1 ) − f ( a n − 1 )
If the chosen stopping criterion is satisfied, then the zero of f is found. If not, then, similar to the
Bisection approach, the new interval [ an , bn ] is constructed by defining
and
an = pn and bn = bn−1 otherwise.
In matrix-vector notation, these systems can be written as g(p) = p and f(p) = 0, respectively,
where
p = ( p1 , . . . , p m ) ⊤ ,
g(p) = ( g1 ( p1 , . . . , p m ), . . . , gm ( p1, . . . , p m ))⊤ ,
f(p) = ( f 1 ( p1, . . . , p m ), . . . , f m ( p1, . . . , p m ))⊤ .
As in the scalar case, an initial estimate for the solution is used to construct a sequence of suc-
cessive approximations, {p( n) }, of the solution, until a desired
q tolerance is reached. A possible
stopping criterion is ||p( n) − p( n−1) || < ε, where ||x|| = x12 + . . . + x2m denotes the Euclidean
norm.
2p1 − p2 + p21 = 91 ,
(4.9)
13
− p1 + 2p2 + p22 = 9 .
It is easy to see that the solution equals ( p1, p2 )⊤ = (1/3, 2/3)⊤. The Picard iteration uses
(n) (n) ( n −1 ) ( n )
2p1 − p2 + p1 p1 = 19 ,
(n) (n) ( n −1 ) ( n ) 13
− p1 + 2p2 + p2 p2 = 9 ,
which can be written as
A(p( n−1) )p( n) = b ⇔ p( n) = A−1 (p( n−1))b, (4.10)
where !
( n −1 ) 1
( n −1 ) 2 + p1 −1 9
A(p )= ( n −1 ) , b = 13 .
−1 2 + p2 9
If the initial estimate equals p(0) = (0, 0)⊤ , then the new approximation equals p(1) = (5/9, 1)⊤.
Note that system (4.9) can be written as g(p) = p, with g(p) = A(p)−1 b.
p ( n ) = p ( n −1 ) − J −1 ( p ( n −1 ) ) f ( p ( n −1 ) ) . (4.13)
If the computation of the partial derivatives in the Jacobian matrix is not possible, then the deriva-
tives can be approximated using, e.g., forward divided differences:
∂ fj f j (x + ei h) − f j (x)
(x) ≈ , (4.14)
∂xi h
where ei is the ith unit vector. This is the Quasi-Newton method for systems. It is also possible to
use central differences to approximate the derivatives.
Example 4.6.2 (Newton-Raphson method)
In this example, the Newton-Raphson scheme is applied to the following system for p1 and p2:
18p1 − 9p2 + p21 = 0,
(
(4.15)
−9p1 + 18p2 + p22 = 9,
with initial estimate p(0) = (0, 0)⊤ .
System (4.15) can be written as f(p) = 0, where
f 1 ( p1, p2 ) = 18p1 − 9p2 + p21 ,
f 2 ( p1, p2 ) = −9p1 + 18p2 + p22 − 9.
The Jacobian of f(x) is given by
18 + 2x1 −9
J( x ) = .
−9 18 + 2x2
4.7 Summary
In this chapter, the following subjects have been discussed:
- Stopping criteria;
- Rounding errors;
- Convergence;
- Bisection method;
- Fixed-point iteration (Picard iteration);
- Newton-Raphson method;
- Quasi-Newton method, Secant method, Regula-Falsi method;
- Systems of nonlinear equations: Picard iteration, Newton-Raphson method.
4.8 Exercises
1. Let f ( x ) = 3( x + 1)( x − 1/2)( x − 1). Use the Bisection method to determine p2 on the
intervals [−2, 1.5] and [−1.25, 2.5].
2. In this exercise, two different fixed-point methods are considered:
20pn−1 + 21/p2n−1 p3 − 21
pn = and pn = pn−1 − n−12 .
21 3pn−1
Show that for both methods, (21)1/3 is the fixed point. Estimate the convergence speeds,
and determine p3 if p0 = 1.
3. In this exercise, the Secant method is derived to approximate a zero of f .
(a) Suppose that p0 and p1 are two given values. Determine the linear interpolation poly-
nomial of f between p0 and p1 .
(b) Compute p2 as the intersection point of this interpolation polynomial with the x-axis.
Repeat this process with p1 and p2 to obtain p3 , and derive a general method for com-
puting pn from pn−2 and pn−1 (Secant method).
(c) Perform two iterations with this method using the function f ( x ) = x2 − 2 with p0 = 1
and p1 = 2.
4. Consider the function f ( x ) = x − cos x, x ∈ [0, π/2]. Determine an approximation of the
zero of f with an error less than 10−4 using the method of Newton-Raphson.
5. Perform two iterations with the Newton-Raphson method with starting vector (1, 1)⊤ to
solve the nonlinear system
x12 − x2 −3 = 0,
− x + x2 +1 = 0.
1 2
Numerical integration
5.1 Introduction
Determining the physical quantities of a system (for example volume, mass, or length) often
involves the integral of a function. In most cases, an analytic evaluation of the integral is not
possible. In such cases one has resort to numerical quadratures.
As an example, we investigate the production of a spoiler that is mounted onto the cabin of
a truck (Figure 5.1). The shape of the spoiler is described by a sine function with a 2π meter
period. The aluminium spoiler is made out of a flat plate by extrusion. The manufacturer wants
to know the width of the plate such that the horizontal dimension of the spoiler will be 80 cm.
The answer to this question is provided by the arc length of the curve
x (t) = t,
y(t) = sin t, 0 ≤ t ≤ 0.8.
Numerical
Integration
Height
Length
is used. However, this integral cannot be evaluated in a simple way. In this chapter, numerical
integration methods will be investigated in order to determine the arc length.
52 Numerical Methods for Ordinary Differential Equations
Let a sequence of partitions, P1 , P2, . . ., and corresponding T1 , T2 , . . . , be given, such that lim m( Pn ) = 0.
n→∞
We call f Riemann integrable over [ a, b] if
Zb
R( f , Pn , Tn ) converges to a limit I= f ( x )dx.
a
Riemann sums are usually used to study integrability theoretically, but they are not very useful
in practice. Numerical integration rules have a similar structure to Riemann sums:
n
I= ∑ w k f ( t k ).
k =0
Here, tk are called the integration points, and wk are the corresponding weights. Numerical inte-
gration rules are also called quadrature rules.
which use the left and the right end point of the interval, respectively.
Theorem 5.3.1 Let f ∈ C1 [ x L , x R ], and m1 = maxx ∈[ x L,x R ] | f ′ ( x )|. Then
Z x
R 1 2
x f ( x )dx − ( x R − x L ) f ( x L ) ≤ 2 m1 ( x R − x L ) ,
L
Z x
R 1 2
x f ( x )dx − ( x R − x L ) f ( x R ) ≤ 2 m1 ( x R − x L ) .
L
Proof:
The proofs for the left and the right Rectangle rule follow the same structure. Therefore, only the
left Rectangle rule will be considered. Using a Taylor expansion, we know that
f ( x ) = f ( x L ) + ( x − x L ) f ′ (ξ ( x )) with ξ ( x ) ∈ ( x L , x R ).
Chapter 5. Numerical integration 53
Zx R Zx R Z xR
f ( x L ) + ( x − x L ) f ′ (ξ ( x )) dx = ( x R − x L ) f ( x L ) + ( x − x L ) f ′ (ξ ( x ))dx.
f ( x )dx =
xL xL xL
Since ( x − x L ) ≥ 0, the mean-value theorem for integration can be used, which means that there
exists an η L ∈ [ x L , x R ] such that
Z x Z x
R R 1
f ( x )dx − ( x R − x L ) f ( x L ) = f ′ (η ) ( x − x L )dx = f ′ (η L ) · ( x R − x L )2 .
xL xL 2
Taking the absolute value on both sides and using m1 completes the proof. ⊠
Proof:
Using Taylor series,
1
f ( x ) = f ( x M ) + ( x − x M ) f ′ ( x M ) + ( x − x M )2 f ′′ (ξ ( x )), with ξ ( x ) ∈ ( x L , x R ).
2
Since ( x − x M )2 ≥ 0, the mean-value theorem for integration can be used, which means that there
exists an η ∈ [ x L , x R ] such that
Z xR Z xR
1 ′′
f ( x )dx − ( x R − x L ) f ( x M ) = f (η ) ( x − x M )2 dx.
xL 2 xL
Taking the absolute value on both sides and using m2 completes the proof. ⊠
54 Numerical Methods for Ordinary Differential Equations
Proof:
From Theorem 2.2.1 it follows that linear interpolation through ( x L , f ( x L )) and ( x R , f ( x R )) has a
truncation error
1
f ( x ) − L1 ( x ) = ( x − x L )( x − x R ) f ′′ (ξ ( x )), with ξ ( x ) ∈ ( x L , x R ). (5.2)
2
Integrating both sides of equation (5.2) yields
Zx R Zx R
xR − xL 1
f ( x )dx − ( f ( x L ) + f ( x R )) = ( x − x L )( x − x R ) f ′′ (ξ ( x ))dx. (5.3)
2 2
xL xL
Using equation (5.4) in equation (5.3), taking the absolute value and using m2 completes the
proof. ⊠
the integration rules of Section 5.3 can be applied on each subinterval [ xk−1 , x k ]. Here,
R x x L = x k −1 ,
x R = xk , and x M = ( xk−1 + xk )/2. If Ik is an approximation of the integral x k f ( x )dx, for
k −1
k = 1, . . . , n, then the integral over [ a, b] can be approximated by
Z b n Z xk n
a
f ( x )dx = ∑ f ( x )dx ≈ ∑ Ik = I.
k = 1 x k −1 k =1
In the following theorem, an upper bound for the remainder term of a composite rule will be
considered.
Theorem 5.4.1 Let f be given on [ a, b], and let a = x0 < x1 <R . . . < x n = b, with x k = a + kh,
x
and h = (b − a)/n. Let Ik be an approximation of the integral x k f ( x )dx, k = 1, . . . , n. Assume
k −1
that the absolute value of the remainder term of Ik has upper bound ck · h p+1 , with p ∈ N. Define
c = max{c1 , . . . , cn }, then
Z b
p
a f ( x )dx − I ≤ c(b − a)h .
Proof:
Using the triangle inequality,
Z b n Z x n Z x
k k
a f ( x )dx − I = ∑ f ( x )dx − Ik ≤ ∑ f ( x ) dx − I k .
x
x
k =1 k −1 k =1 k −1
n −1
IL = h ∑ f ( xk ) = h ( f ( a) + f ( a + h) + . . . + f (b − h)) ,
k =0
n
IR = h ∑ f ( x k ) = h ( f ( a + h) + f ( a + 2h) + . . . + f (b)) .
k =1
56 Numerical Methods for Ordinary Differential Equations
From Theorem 5.3.1 it follows that for each interval [ xk−1 , xk ], ck = maxx ∈[ xk −1,xk ] | f ′ ( x )|. Defining
M1 = maxx ∈[ a,b] | f ′ ( x )| = max{c1 , . . . , cn }, the corresponding upper bounds for the remainder
terms are (use Theorem 5.4.1)
Zb
f ( x )dx − IL ≤ 1 M1 (b − a)h,
a
2
Zb
f ( x )dx − IR ≤ 1 M1 (b − a)h.
2
a
Midpoint rule
The composite Midpoint rule is given by
n
x k −1 + x k 1 3 1
IM = h ∑ f = h f ( a + h) + f ( a + h) + . . . + f (b − h) .
k =1
2 2 2 2
Defining M2 = maxx ∈[ a,b] | f ′′ ( x )|, the upper bound for the remainder term (use Theorems 5.3.2
and 5.4.1) is
Zb
f ( x )dx − I M ≤ 1 M2 (b − a)h2 .
a
24
Trapezoidal rule
For the Trapezoidal rule, the composite version is
h n
1 1
IT = ∑ ( f ( x k−1 ) + f ( x k )) = h f ( a ) + f ( a + h ) + . . . + f ( b − h ) + f ( b) ,
2 k =1 2 2
h n
x k −1 + x k
IS = ∑ f ( x k − 1 ) + 4 f + f ( xk )
6 k =1 2
1 2 1 1 2 3 2 1 1
=h f ( a ) + f ( a + h ) + f ( a + h ) + f ( a + h ) + . . . + f ( b − h ) + f ( b) ,
6 3 2 3 3 2 3 2 6
with remainder term
Zb
f ( x )dx − IS ≤ 1 M4 (b − a)h4 ,
a
2880
• The remainder terms of the composite Midpoint and Trapezoidal rule are of order O(h2 ),
whereas the Rectangle rules have remainder terms of order O(h). The composite Simpson’s
rule has a remainder term of order O(h4 ). The Rectangle, Midpoint and Trapezoidal rules
all require the same amount of work. Among these, the Midpoint and Trapezoidal rules are
clearly preferred. The composite Simpson’s rule needs twice as many function evaluations,
but converges faster.
• Note that the upper bound for the error of the Trapezoidal rule is twice as large as the
bound for the Midpoint rule.
• The composite Trapezoidal rule may also be interpreted as the integral of the linear spline
approximation of f . Suppose s is the linear spline approximation of f in a = x0 , . . . , x n = b.
Rb
Then the composite Trapezoidal rule computes IT = a s( x )dx.
Example 5.4.1 (Spoiler)
For the example mentioned in the introduction, the length of the flat plate should be approxi-
p R 0.8
mated with an error of at most 1 cm. Defining f (t) = 1 + (cos t)2 , this length equals 0 f (t)dt.
Applying the composite left Rectangle rule, the upper bound for the remainder term equals
Z 0.8
1
0 f (t)dt − IL ≤ 2 M1 · 0.8h,
R
0.8 p
Table 5.1: The error 0 f (t)dt − I for f (t) = 1 + (cos t)2 , using different composite integra-
tion rules I.
h Left Rectangle rule Midpoint rule Trapezoidal rule Simpson’s rule
0.8 5.5435 · 10−2 1.1698 · 10−2 2.2742 · 10−2 2.1789 · 10−4
0.4 3.3567 · 10−2 2.7815 · 10−3 5.5221·10−3 1.3618 · 10−5
0.2 1.8174 · 10−2 6.8643 · 10−4 1.3703 · 10−3 8.4852 · 10−7
0.1 9.4302 · 10−3 1.7105 · 10−4 3.4194 · 10−4 5.2978 · 10−8
0.05 4.8006 · 10−3 4.2728 · 10−5 8.5445 · 10−5 3.3102 · 10−9
58 Numerical Methods for Ordinary Differential Equations
| f ( x ) − fˆ( x )| = ε( x ).
Next, the integral is approximated with an integration rule using the perturbed values. In this
derivation, the composite left Rectangle rule will be used. The difference between the exact inte-
Rb
gral, I = a f ( x )dx, and the perturbed approximation, ÎL , can be bounded by
Assuming ε( x ) ≤ ε max , the total error for the composite left Rectangle rule is
Zb n −1
f ( x )dx − ÎL ≤ 1 M1 (b − a)h + h ∑ ε max
a
2 k =0
1 1
= M1 (b − a)h + h · nε max = M h + ε max (b − a),
2 2 1
where M1 = maxx ∈[ a,b] | f ′ ( x )|. Note that it is useless to take h any smaller than 2ε max /M1 ,
because then the measurement error dominates the total error.
Note that this derivation is specific for the composite left Rectangle rule. Similar error estimates
can be derived for the Midpoint, Trapezoidal and Simpson’s rule.
The computation of the length of the flat plate can be perturbed assuming a small manufacturing
error. Consequently, the integrand contains a rounding error ε( x ) = 10−3 . In Figure 5.2, the effect
of rounding errors can be seen for different values of h (using the composite left Rectangle rule).
It turns out that the total error remains larger than 0.8 × 10−3. It serves no purpose to take the
step size smaller than 4ε max = 4 · 10−3.
Conditioning of the integral
The numerical approximation of an integral I is either a well-conditioned or an ill-conditioned
problem. If ǫmax is an upper bound for the relative error (different from ε max for the absolute
error), then the absolute error in the function values is bounded by the inequality
| f ( x ) − fˆ( x )| ≤ | f ( x )|ǫmax.
Chapter 5. Numerical integration 59
−1
10
Without rounding errors
With rounding errors
−2
10
error
−3
10
−4
10
−5
10 −4 −3 −2 −1 0
10 10 10 10 10
h
R R
0.8 0.8
Figure 5.2: The errors 0 f (t)dt − IL (without rounding errors) and 0 f (t)dt − ÎL (with
rounding errors) for f (t) = 1 + (cos t)2 , using ε( x ) = 10−3 , and the composite left Rectangle
p
rule.
In that case, an upper bound for the relative error in the integral is given by
Rb Rb Rb
ˆ
f ( x )dx − f ( x )dx
| f ( x )|dx
a a
a
Rb
≤ Rb
· ǫmax .
f ( x )dx f ( x )dx
a a
The value
Rb
| f ( x )|dx
a
KI =
Rb
f ( x )dx
a
is called the condition number of the integral I . If KI · ǫmax = O(1), then the determination of I is
ill conditioned: the relative error in the computed integral is of order O(1).
The profits or losses per day of a car manufacturer depend on the season. The following profits
formula (in billions of $) are assumed:
π π
wspring (t) = 0.01 + sin πt − , t ∈ [0, 1], and wfall (t) = 0.01 + sin πt + , t ∈ [0, 1].
2 2
The total profits in spring (Wspring ) and fall (Wfall ) equal
Z1 Z1
Wspring = wspring (t)dt = 0.01, Wfall = wfall (t)dt = 0.01,
0 0
The composite left Rectangle rule using h = 0.02 approximates Wspring with −0.01 and Wfall with
0.03. The composite Midpoint and Trapezoidal rule obtain the exact results for h = 0.02.
Note that
Z1
Wspring = |Wspring | = 0.01 and |wspring (t)|dt ≃ 0.64.
0
Therefore, the condition number of Wspring equals KI = 64. If the function values only have
an accuracy of two digits, then ǫmax = 0.005. This means that KI · ǫmax = 0.32, such that the
determination of the integral is an ill-conditioned problem. In the worst case, no digit of the
approximation is correct.
xL
f ( x )dx ≈
xL
L N ( x )dx =
x L ℓ=0
∑ f ( xℓ ) Lℓ N ( x )dx = ∑ w ℓ f ( x ℓ ),
ℓ=0
Proof:
From Theorem 2.3.1 it follows that f coincides with its Lagrange interpolation polynomial on
[ x L , x R ], and hence,
Zx R Zx R
f ( x )dx = L N ( x )dx.
xL xL
⊠
Conversely:
Theorem 5.6.2 If x0 , . . . , x N and w0 , . . . , w N are given and if
Zx R N
p( x )dx = ∑ wℓ p( xℓ )
xL ℓ=0
N
holds for all Rpolynomials p of degree at most N, then ∑ℓ= 0 w ℓ f ( x ℓ ) is the quadrature rule based on inter-
xR
polation for x f ( x )dx.
L
Chapter 5. Numerical integration 61
Proof:
Let L N be the interpolation polynomial of f on the nodes x0 , . . . , x N . Then, by definition,
Zx R N N
L N ( x )dx = ∑ w ℓ L N ( x ℓ ) = ∑ w ℓ f ( x ℓ ),
xL ℓ=0 ℓ=0
Note that the verification for L N is already enough to prove the theorem. ⊠
The following theorem can be used to compute remainder terms for interpolatory quadrature
rules. The proof of this theorem will be omitted, and can be found in [9].
Rx
Theorem 5.6.3 Let x R f ( x )dx be approximated by the integration of the Lagrange interpolation poly-
L
nomial L N . Then there exists a ξ ∈ [ x L , x R ] such that the remainder term of the approximation satisfies:
if N is even and f ∈ C N +2 [ x L , x R ]:
x R − x L N +3 ( N +2 )
Z x Z xR
R
x f ( x )dx − x Ln ( x )dx = CN f (ξ ) with
L L N
ZN
1
CN = t2 (t − 1) · · · (t − N )dt,
( N + 2) !
0
if N is odd and f ∈ C N +1 [ x R , x L ]:
x R − x L N +2 ( N +1 )
Z x Z xR
R
f ( x ) dx − L N ( x ) dx = D N f (ξ ) with
x
L x L
N
ZN
1
DN = t(t − 1) · · · (t − N )dt.
( N + 1) !
0
Remarks
• Composite interpolatory quadrature rules are found by repeating the interpolation on sub-
intervals, as has been explained in Section 5.4.
Definition 5.6.1 (Newton-Cotes quadrature rules) If x L = x0 < x1 < . . . < x N = x R and the
N
nodes areR equidistantly distributed, then the sum ∑ℓ= 0 w ℓ f ( x ℓ ) is called the Newton-Cotes quadrature
xR
rule for x f ( x )dx.
L
The Trapezoidal rule (Section 5.3.3) uses linear interpolation in the nodes x L and x R . This means
that the Trapezoidal rule is a Newton-Cotes quadrature rule with N = 1. Similarly, the Simpson’s
rule (Section 5.3.4) uses quadratic interpolation in the nodes x L , ( x L + x R )/2 and x R : this is a
Newton-Cotes quadrature rule using N = 2. Indeed, the error terms of the Trapezoidal and the
Simpson’s rule correspond to Theorem 5.6.3.
Newton-Cotes quadrature rules of higher order are rarely used: negative weights will occur and
that is less desirable.
62 Numerical Methods for Ordinary Differential Equations
computes the exact result when f is a polynomial of degree at most 3. Using f equal to 1, x, x2
and x3 consecutively, this means that w0 , w1 , x0 and x1 must satisfy
R1
f (x) = 1 ⇒ w0 + w1 = 1dx = 2,
−1
R1
f (x) = x ⇒ w0 x 0 + w1 x 1 = xdx = 0,
−1
R1
f ( x ) = x2 ⇒ w0 x02 +w1 x12 = x2 dx = 23 ,
−1
(3) R1
f ( x ) = x3 ⇒ w0 x0 +w1 x13 = x3 dx = 0.
−1
For Gauss quadrature formulae, the integration limits are always taken equal to x L = −1 and
x R = 1. Consequently, Gauss quadrature rules have the following general form
Z 1 N
−1
f ( x )dx ≈ ∑ w ℓ f ( x ℓ ) = IG ,
ℓ=0
Table 5.2: Weights and integration points for Gauss quadrature rules.
N xℓ wℓ
0 0√ 2
1 − 31 √ 3 1
1
3√ 3 1
1 5
2 − 5 15 9
8
0√ 9
1 5
p 5 15 √ 9 √
1 1
3 − 35 p525 + 70 30 36 (18 − √30)
1
√ 1
− 35 p 525 − 70√ 30 36 (18 + √30)
1 1
35 p 525 − 70√30 36 (18 + √30)
1 1
35 525 + 70 30 36 (18 − 30)
holds. Tables with Gauss integration points and corresponding weights can be found in many
text books on numerical analysis. In Table 5.2, the values are given for the first four Gauss quadra-
ture rules. Note that the Gauss quadrature rule for N = 0 corresponds to the Midpoint rule.
In order to integrate a function numerically over an arbitrary interval [ x L , x R ] using a Gauss
quadrature rule, the interval of integration has to be transformed to [−1, 1]. This change of inter-
val is achieved by linear transformation using
Z x
x − xL 1
xR − xL x + xR
Z
R
f (y)dy = R f x+ L dx. (5.6)
xL 2 −1 2 2
Replacing the latter integral by a Gauss formula yields
Z xR N
xR − xL xR − xL xL + xR
xL
f (y)dy ≈
2 ∑ wℓ f 2
xℓ +
2
. (5.7)
ℓ=0
The integration points xℓ correspond to the Gauss integration points as tabulated in Table 5.2.
If f ∈ C2N +2 [ x L , x R ], then the N + 1-point Gauss quadrature rule has the remainder term
N
( x R − x L )2N +3 (( N + 1)!)4 (2N +2)
Z xR
xR − xL xR − xL xL + xR
xL
f (y)dy −
2 ∑ wℓ f 2
xℓ +
2
=
(2N + 3)((2N + 2)!)3
f ( ξ ),
ℓ=0
with ξ ∈ ( x L , x R ). This formula shows that Gauss integration rules are indeed exact for polyno-
mials up to degree 2N + 1.
Remark
Composite Gauss quadrature rules are found by repeating the Gauss quadrature rule on subin-
tervals, as has been explained in Section 5.4.
Example 5.7.1 (Polynomial)
R 1.5
In this example, the value of 1 y7 dy will be approximated. The exact value of the integral
equals 3.07861328125. Using a change of interval as in equations (5.6) and (5.7), the integral is
approximated by
Z 1.5 N
f (y)dy ≈ 0.25 ∑ wℓ f (0.25xℓ + 1.25),
1 ℓ=0
In Table 5.3, the errors are shown for different values of N. As expected, the 4-point Gauss
quadrature rule (N = 3) gives the exact result. Note that also for smaller numbers of integration
points, the Gauss quadrature rule is very accurate.
R
1.5
Table 5.3: The error 1 y7 dy − IG , using different N + 1-point Gauss integration rules IG .
R 1.5
N | 1 y7 dy − IG |
0 6.9443 · 10−1
1 1.1981 · 10−2
2 2.4414 · 10−5
3 0
5.8 Summary
In this chapter, the following subjects have been discussed:
- Riemann sums;
- Simple integration rules (left and right Rectangle rule, Midpoint rule, Trapezoidal rule,
Simpson’s rule);
- Composite rules;
- Measurement and rounding errors;
- Interpolatory quadrature rules;
- Newton-Cotes quadrature rules;
- Gauss quadrature rules.
5.9 Exercises
1. In this exercise, the integral
Z1
((10x )3 + 0.001)dx
−1
should be computed.
(a) The relative rounding error in the function values is less than ε. Determine the relative
error in the integral due to the rounding errors.
(b) Use the composite Midpoint rule and ε = 4 × 10−8. Give a reasonable value for the
step size h.
R1
2. Determine 0.5 x4 dx with the Trapezoidal rule. Estimate the error and compare it with the
real error. Also compute the integral with the composite Trapezoidal rule using step size
h = 0.25. Estimate the error using Richardson’s extrapolation.
R 1.5
3. Calculate 1 x7 dx with the composite Trapezoidal rule using two equal intervals and with
the Simpson’s rule. Compare your results with the results for Gauss quadrature rules as
tabulated in Table 5.3.
Chapter 6
6.1 Introduction
An initial-value problem usually is a mathematical description (differential equation) of a time-
dependent problem. The conditions, necessary to determine a unique solution, are given at the
starting time, t0 , and are called initial conditions.
As an example, we consider a water discharge from a storage reservoir through a pipe (Fig-
ure 6.1). The water is at rest until, at t0 = 0, the latch is instantaneously opened. The inertia of
the water causes a gradual development of the flow. If this flowing water is used to generate elec-
tricity, then it is important to know how long it will take before the turbines work at full power.
Figure 6.1: The storage reservoir, the drain pipe and the latch in closed position.
In this problem, q(m3 /s) is the mass flow rate, p is the driving force, which is equal to
force/length N/m
and measured in = m3 /s2 ,
density kg/m3
66 Numerical Methods for Ordinary Differential Equations
and aq2 is the friction. The driving force depends –among other things– on the water level in the
reservoir. Assuming that p(t) is constant ( p(t) = p0 ), a solution in closed form is known:
r
p0 √
q( t ) = tanh(t ap0 ).
a
For general functions p, an analytic solution cannot be obtained. In that case, a numerical method
must be used to approximate the solution.
• The solution changes continuously when the parameters in the differential equation or the
initial conditions are slightly perturbed.
A theorem about these issues uses the concepts of a well-posed initial-value problem (see also
definition 1.6.1) and Lipschitz continuity.
is called well posed if the problem has a unique solution, that depends continuously on the data ( a, y a and
f ).
The following theorem yields a sufficient condition for a function to be Lipschitz continuous. The
proof uses the intermediate-value theorem.
Theorem 6.2.1 If the function f is differentiable with respect to the variable y, then Lipschitz continuity
is implied if
∂f
(t, y) ≤ L, for all (t, y) ∈ D,
∂y
The following theorem presents a sufficient condition for an initial-value problem to be well
posed. The proof can be found in many text books on differential equations, for example in
Boyce and DiPrima, [4].
Theorem 6.2.2 Suppose that D = {(t, y)| a ≤ t ≤ b, −∞ < y < ∞} and that f (t, y) is continuous on
D. If f is Lipschitz continuous in the variable y on D, then the initial-value problem (6.2) is well posed.
Chapter 6. Numerical time integration of initial-value problems 67
Therefore, it is illuminating to consider the slope field generated by the differential equation. A
differential equation is a rule that provides the slope of the tangent of a solution curve at each
point in the (t, y)-plane. A specific solution curve is selected by specifying a point through which
that solution curve passes. A formal expression for the solution curve through a point (t0 , y0 ) is
generated by integrating the differential equation over t:
Z t
y ( t ) = y ( t0 ) + f (τ, y(τ ))dτ.
t0
Because the unknown function y appears under the integral sign, this equation is called an in-
tegral equation. Although the integral equation is as difficult to solve as the original initial-value
problem, it provides a better starting point for approximating the solution.
In order to approximate the solution to the integral equation numerically, the time axis is divided
into a set of discrete points tn = t0 + n∆t, n = 0, 1, . . .. The solution at each time step is computed
using the integral equation. Assume that the solution to the differential equation at time tn is
known, and denoted by yn = y(tn ). Then, the solution at tn+1 is equal to
Z t n +1
y n +1 = y n + f (t, y(t))dt. (6.3)
tn
The numerical time-integration methods that are presented in this section can be characterized
Rt
by the approximation of the integral term tnn+1 f (t, y(t))dt: different numerical quadrature rules
(Chapter 5) can be used. Because integral equation (6.3) only considers one time interval [tn , tn+1 ],
these numerical methods are called single-step methods. Multi-step methods, which may also use
t0 , . . . , tn−1, will be described in Section 6.11.
The numerical approximation of the solution at tn is denoted by wn , n = 0, 1, . . ., where w0 = y0 .
Forward Euler method
First, the Forward Euler method will be presented, which is the most simple, earliest, and best-
known time-integration method. In this method, an approximation of the integral in equation
(6.3) is obtained by using the left Rectangle rule (Figure 6.2(a)):
y n +1 ≈ y n + ( t n +1 − t n ) f ( t n , y n ) . (6.4)
Note that f (tn , yn ) is the slope of the tangent in (tn , yn ). The numerical approximation of the
solution at tn+1, denoted by wn+1 , is computed by
5.5 5.5
f ( t n +1 , y n +1 ) f ( t n +1 , y n +1 )
5 5
f (tn , yn ) f (tn , yn )
f (t, y(t))
f (t, y(t))
4.5 4.5
4 4
3.5 3.5
0
tn tn+1 0.5 1 0
tn tn+1 0.5 1
R t n +1
Figure 6.2: Two different approximations of the integral tn f (t, y(t))dt.
5.5
w2 w3
w1
5 w4
y2
y1 y3
4.5
w5
y
y 0 , w0 y4
Exact
Forward Euler method y5
3.5
0 0.2 0.4 0.6 0.8 1
t
in which f (tn , wn ) is the slope of the tangent in (tn , wn ). This procedure leads to a sequence
{tn , wn }, n = 0, 1, . . . of approximation points. Geometrically this represents a piecewise-linear
approximation of the solution curve, the Euler polygon, see Figure 6.3.
Because wn+1 can be computed directly from equation (6.5), the Forward Euler method is called
an explicit method.
Backward Euler method
The Backward Euler method is obtained by using the right Rectangle rule to approximate the inte-
grals. At time tn+1 , the solution is approximated by
Z t n +1
y n +1 = y n + f (t, y(t))dt ≈ y n + ∆t f (tn+1 , yn+1 ).
tn
Note that the unknown wn+1 also occurs in the right-hand side. For that reason, the Backward
Euler method is called an implicit method. If f depends linearly on y, then the solution to (6.6) can
be easily computed.
Chapter 6. Numerical time integration of initial-value problems 69
For nonlinear initial-value problems, the solution to (6.6) is non-trivial. In that case, a numerical
nonlinear solver has to be applied (Chapter 4). This means that the Backward Euler method uses
more computation time per time step than the Forward Euler method. However, the Backward
Euler method also has some advantages, as will be seen later.
Trapezoidal method
The Trapezoidal method is constructed using the Trapezoidal rule of Chapter 5 to approximate the
integral (see Figure 6.2(b)). This means that
Z t n +1
∆t
y n +1 = y n + f (t, y(t))dt ≈ y n + ( f (tn , yn ) + f (tn+1 , yn+1 )) . (6.7)
tn 2
∆t
w n +1 = w n + ( f (tn , wn ) + f (tn+1, wn+1 )) . (6.8)
2
Note that the Trapezoidal method is also an implicit method.
Modified Euler method
To derive an explicit variant of (6.8), the unknown wn+1 in the right-hand side of (6.8) is predicted
using the Forward Euler method. This results in the following predictor-corrector formula:
In order to analyze the global truncation error, we need to introduce two important concepts:
stability and consistency.
6.4.1 Stability
A physical phenomenon is called stable if a small perturbation of the parameters (including the
initial condition) results in a small difference in the solution. Hence this concept is connected to
the continuous dependence of the solution on the data. Some examples of unstable phenomena
are buckling of an axially loaded bar or resonance of a bridge1 . Throughout this chapter, we
assume that the initial-value problem is stable. First, we will consider the conditions for an
initial-value problem to be stable.
Analytical stability of an initial-value problem
Let an initial-value problem be given by
y′ = f (t, y),
t > t0 ,
y ( t0 ) = y0 .
1 http://ta.twi.tudelft.nl/nw/users/vuik/information/tacoma_eng.html
70 Numerical Methods for Ordinary Differential Equations
Suppose next that the initial condition contains an error ε 0 . Then, the perturbed solution ỹ satis-
fies
ỹ′ = f (t, ỹ), t > t0 ,
ỹ(t0 ) = y0 + ε 0 .
The difference between the two functions is given by ε(t) = ỹ(t) − y(t). An initial-value problem
is called stable if
|ε(t)| < ∞ for all t > t0 ,
and absolutely stable if it is stable and
lim |ε(t)| = 0.
t→∞
y′ = λy + g(t), t > t0 ,
(6.11)
y ( t0 ) = y0 .
Subtracting the equations in (6.11) from those in (6.12), we obtain an initial-value problem for the
error ε in the approximation:
This initial-value problem plays an important role in stability analysis, and is often called the
initial-value problem for the so-called test equation.
The solution to (6.13) is
ε ( t ) = ε 0 e λ ( t − t0 ) ,
which shows that (6.13) is stable if λ ≤ 0 and absolutely stable if λ < 0.
Numerical stability of initial-value problems based on the test equation
Next, we explain how perturbations propagate if initial-value problem (6.11) is approximated by
a numerical method. For example, the Forward Euler method applied to (6.11) yields
Subtracting these equations yields the following propagation formula for the numerical approx-
imation of the error (ε̄ n = wn − w̃n ):
ε̄ n+1 = (1 + λ∆t)ε̄ n .
This means that the error propagation for the Forward Euler method is modeled by the test
equation.
Chapter 6. Numerical time integration of initial-value problems 71
For each of the numerical methods from Section 6.3, the application to the test equation can be
written as
ε̄ n+1 = Q(λ∆t)ε̄ n . (6.14)
The value Q(λ∆t) determines the factor by which the existing perturbation is amplified, and is
therefore called the amplification factor. By substitution it may be verified that
| Q(λ∆t)| ≤ 1. (6.16)
Because λ is known, we can compute the values of ∆t such that inequality (6.16) is still satisfied.
For absolute stability, we require | Q(λ∆t)| < 1. Note that the method is unstable if | Q(λ∆t)| > 1.
This leads to the following characterization of stability:
As an example, we investigate the Forward Euler method, where Q(λ∆t) = 1 + λ∆t. Here,
we assume that λ < 0, hence the initial-value problem is absolutely stable. To satisfy (6.16), ∆t
should be chosen such that
−1 ≤ 1 + λ∆t ≤ 1,
or
−2 ≤ λ∆t ≤ 0.
Since ∆t > 0 and λ ≤ 0, the right-hand side is automatically satisfied. From the left-hand side it
follows that
2
∆t ≤ − , (6.18)
λ
with strict inequality for absolute stability.
For the Backward Euler method, stability is found when
1
−1 ≤ ≤ 1.
1 − λ∆t
It is easily seen that this condition is satisfied for each ∆t ≥ 0, since λ ≤ 0. Therefore, the
Backward Euler method is unconditionally stable if λ ≤ 0.
A similar analysis shows that, for λ ≤ 0, the Modified Euler method is stable if
2
∆t ≤ − ,
λ
72 Numerical Methods for Ordinary Differential Equations
with strict inequality for absolute stability. The Trapezoidal method is again unconditionally
stable.
Stability of a general initial-value problem
For a general (nonlinear) initial-value problem, the stability of the corresponding linearized prob-
lem will be analyzed. The initial-value problem
is approximated by a linear Taylor expansion (Theorem 1.6.8) about the point (t̂, ŷ):
∂f ∂f
f (t, y) ≈ f (t̂, ŷ) + (y − ŷ) (t̂, ŷ) + (t − t̂) (t̂, ŷ).
∂y ∂t
∂f ∂f
y′ = f (t̂, ŷ) + (y − ŷ) (t̂, ŷ) + (t − t̂) (t̂, ŷ) = λy + g(t), (6.20)
∂y ∂t
with
∂f
λ= (t̂, ŷ), (6.21a)
∂y
∂f ∂f
g(t) = f (t̂, ŷ) − ŷ (t̂, ŷ) + (t − t̂) (t̂, ŷ). (6.21b)
∂y ∂t
The value of λ in equation (6.21a) can be used in the requirement for stability (| Q(λ∆t)| ≤ 1).
Note that the stability condition depends on t̂ and ŷ.
y(0) = 0.
This model can be related to initial-value problem (6.1), using a = 10 and p0 = 20, and the
solution is given by √ √
y(t) = 2 tanh( 300t).
√
This means that y(t) ∈ (0, 2) for all t > 0.
In this example, f (t, y) = −10y2 + 20. The linearized approximation of the differential equation
uses
∂f
λ= (t̂, ŷ) = −20ŷ.
∂y
Because y(t) > 0 for all t > 0, this initial-value problem is absolutely stable: λ < 0.
Furthermore, it is important to investigate numerical stability. Note that if the Forward√ Euler
method is stable, then the
√ numerical approximation of the solution satisfies w n ∈ ( 0, 2) for all
n ∈ N. The maximum 2 can be used as an upper bound to derive a criterion for the largest
admissible time step that ensures stability:
2 1 1
∆t ≤ − = ≤ √ .
λ 10ŷ 10 · 2
Chapter 6. Numerical time integration of initial-value problems 73
The estimation of the local truncation error will be explained for the Forward Euler method and
the Modified Euler method. We assume that all derivatives that we use are continuous.
Forward Euler method
For the Forward Euler method, we have
yn+1 − (yn + ∆t f (tn , yn ))
τn+1 = . (6.22)
∆t
Taylor expansion of yn+1 about tn yields
(tn+1 − tn )2 ′′
y n +1 = y ( t n +1 ) = y ( t n ) + ( t n +1 − t n ) y ′ ( t n ) + y (ξ )
2
∆t2 ′′
= yn + ∆ty′n + y ( ξ ), (6.23)
2
for a ξ ∈ (tn , tn+1 ).
Note that y satisfies the differential equation, such that
y′ = f (t, y). (6.24)
Using equations (6.23) and (6.24) in (6.22) yields
∆t2 ′′
yn + ∆ty′n + 2 y (ξ ) − (yn
∆t ′′ + ∆ty′n )
τn+1 = = y ( ξ ),
∆t 2
where ξ ∈ (tn , tn+1 ). This means that the Forward Euler method is of order O(∆t).
Modified Euler method
For the Modified Euler method, the value of zn+1 is obtained by replacing the numerical approx-
imation wn in equation (6.9) by the exact value yn :
z n +1 = yn + ∆t f (tn , yn ), (6.25a)
∆t
z n +1 = yn + ( f (tn , yn ) + f (tn+1, zn+1 )). (6.25b)
2
Using a two-dimensional Taylor expansion about (tn , yn ) (Theorem 1.6.8), the final term of (6.25b)
can be written as
f (tn+1, zn+1 ) = f (tn + ∆t, yn + ∆t f (tn , yn ))
∂f ∂f
= f (tn , yn ) + (tn + ∆t − tn ) + (yn + ∆t f (tn , yn ) − yn )
∂t n ∂y n
2 2
1 ∂ f 1 ∂ f
+ (tn + ∆t − tn )2 2
+ (yn + ∆t f (tn , yn ) − yn )2
2 ∂t n 2 ∂y2 n
2
∂ f
+ (tn + ∆t − tn )(yn + ∆t f (tn , yn ) − yn ) +....
∂t∂y n
74 Numerical Methods for Ordinary Differential Equations
Note that the last three specified terms of (6.26) are O(∆t2 ) (and that the dots represent terms of
order O(∆t3 ) and higher). Since higher-order terms are not required, we arrive at
∂f ∂f
f (tn+1 , zn+1 ) = f n + ∆t +f + O(∆t2 ).
∂t ∂y n
∆t2
∂f ∂f
zn+1 = yn + ∆t f n + +f + O(∆t3 ). (6.27)
2 ∂t ∂y n
Differentiating the differential equation with respect to t and using the chain rule, yields
∆t2 ′′
zn+1 = yn + ∆ty′n + y + O(∆t3 ). (6.29)
2 n
Note that the first three terms of this expression are exactly the first three terms of the Taylor
expansion
∆t2 ′′
yn+1 = yn + ∆ty′n + y + O(∆t3 ). (6.30)
2 n
This means that
y − z n +1
τn+1 = n+1 = O(∆t2 ).
∆t
Local truncation error for initial-value problems based on the test equation
For implicit time-integration methods, the procedure to estimate the local truncation error has to
be adapted. Therefore, we only consider the error estimation for the test equation y′ = λy. The
treatment for implicit time-integration methods to the general equation y′ = f (t, y(t)) is outside
the scope of this book.
Remember that zn+1 is obtained by applying the numerical scheme to the solution y n . Using the
amplification factor, this can be computed easily: zn+1 = Q(λ∆t)yn (Section 6.4.1). Therefore, the
local truncation error equals
y − Q(λ∆t)yn
τn+1 = n+1 . (6.31)
∆t
Furthermore, because the solution to the test equation equals y(t) = y0 eλt , it is clear that
eλ∆t − Q(λ∆t)
τn+1 = yn . (6.33)
∆t
Chapter 6. Numerical time integration of initial-value problems 75
This means that the size of the local truncation error is determined by the difference between
the amplification factor and the exponential function. To determine this difference, the Taylor
expansion of eλ∆t can be used (see Theorem 1.6.10):
∞
1 1 1
eλ∆t = 1 + λ∆t + (λ∆t)2 + (λ∆t)3 + . . . = ∑ (λ∆t)k . (6.34)
2 3! k =0
k!
Comparing (6.34) with Q(λ∆t) = 1 + λ∆t we conclude that the Forward Euler method has a local
truncation error of order O(∆t). For the Modified Euler method,
1
Q(λ∆t) = 1 + λ∆t + (λ∆t)2 ,
2
such that the local truncation error is O(∆t2 ) and this method is more accurate in the limit ∆t → 0.
For the Backward Euler method, a Taylor expansion of the amplification factor yields (see Theo-
rem 1.6.9)
∞
1
Q(λ∆t) = = 1 + λ∆t + (λ∆t)2 + . . . = ∑ (λ∆t)k ,
1 − λ∆t k =0
hence the local truncation error of the Backward Euler method is O(∆t). The Trapezoidal method
has a local truncation error of order O(∆t2 ) (see exercise 3).
where (n + 1)∆t = T, and T is a fixed time. We call a method consistent of order p if τn+1 = O(∆t p ).
Definition 6.4.3 (Convergence) A scheme is called convergent if the global truncation error en+1 sat-
isfies
lim en+1 = 0,
∆t →0
where (n + 1)∆t = T, and T is a fixed time. A method is convergent of order p if en+1 = O(∆t p ).
The theorem below is one of the most important theorems to prove convergence.
Theorem 6.4.1 (Lax’s equivalence theorem, [6]) If a numerical method is stable and consistent, then
the numerical approximation converges to the solution for ∆t → 0. Moreover, the global truncation error
and the local truncation error are of the same order.
Proof
This theorem will only be proved for the test equation y ′ = λy. In that case, we know that
wn+1 = Q(λ∆t)wn , such that the global truncation error, en+1 , satisfies
Using stability (| Q(λ∆t)| ≤ 1) and (n + 1)∆t = T, the global truncation error can be bounded by
n n
| e n +1 | ≤ ∑ |Q(λ∆t)|ℓ ∆t|τn+1−ℓ | ≤ ∑ ∆t|τn+1−ℓ | ≤ T · 1≤ℓ≤
max |τℓ |.
n +1
ℓ=0 ℓ=0
This implies that the order of convergence of the global truncation error equals that of the local
truncation error. Furthermore, the global truncation error tends to zero as ∆t → 0, since the
numerical method is consistent. ⊠
Table 6.1: Number of function evaluations (denoted by # fe) for the Forward Euler method and
the Modified Euler method using various accuracies. Integration is done from t0 = 0 to T = 1,
and ∆t is chosen such that the desired global truncation errors are found.
Example 6.5.1 shows that a higher-order method is to be preferred, if a high accuracy is required.
This observation is valid as long as the solution is sufficiently smooth.
Chapter 6. Numerical time integration of initial-value problems 77
k1 = ∆t f (tn , wn ), (6.37a)
1 1
k2 = ∆t f (tn + ∆t, wn + k1 ), (6.37b)
2 2
1 1
k3 = ∆t f (tn + ∆t, wn + k2 ), (6.37c)
2 2
k4 = ∆t f (tn + ∆t, wn + k3 ). (6.37d)
Note that the corrector formula is based on Simpson’s rule for numerical integration:
Z t
n +1
y ( t n +1 ) = y ( t n ) + f (t, y(t))dt
tn
∆t 1 1
≈ y(tn ) + f (tn , y(tn )) + 4 f tn + ∆t, y tn + ∆t + f (tn + ∆t, y(tn + ∆t)) .
6 2 2
Here, y(tn ) is approximated by wn , and the values of y(tn + ∆t/2) and y(tn + ∆t) have to be
predicted. From equations (6.37) it follows that both wn + k1 /2 and wn + k2 /2 approximate
y(tn + ∆t/2), and that y(tn + ∆t) is predicted by wn + k3 .
The amplification factor of the RK4 method can be computed using the test equation. Application
of formulae (6.36) and (6.37) to y′ = λy results in
1 1 1
wn+1 = 1 + λ∆t + (λ∆t)2 + (λ∆t)3 + (λ∆t)4 wn .
2 6 24
This means that the amplification factor of the RK4 method is given by
1 1 1
Q(λ∆t) = 1 + λ∆t + (λ∆t)2 + (λ∆t)3 + (λ∆t)4 . (6.38)
2 6 24
Next, the local truncation error is computed for the test equation. The quantity zn+1 is defined
by replacing wn in (6.36) and (6.37) by yn . Substitution of f (t, y) = λy leads to zn+1 = Q(λ∆t)yn .
Using that yn+1 = eλ∆t yn (equation (6.32)), we obtain
yn+1 − zn+1 = eλ∆t − Q(λ∆t) yn . (6.39)
The right-hand side of (6.38) consists exactly of the first five terms of the Taylor series of eλ∆t .
Hence in the right-hand side of (6.39), only the 5th and the higher powers of ∆t remain. Division
by ∆t shows that the local truncation error is O(∆t4 ). Also for a general initial-value problem, it
can be shown that the RK4 method is of fourth order.
Finally, it is important to investigate the stability properties. From (6.38) it follows that for each
∆t > 0, Q(λ∆t) is larger than one if λ is positive, hence exponential error growth will be in-
evitable.
For λ < 0, the RK4 method is stable when
1 1 1
−1 ≤ 1 + x + x2 + x3 + x4 ≤ 1, where x = λ∆t. (6.40)
2 6 24
78 Numerical Methods for Ordinary Differential Equations
Table 6.2: Number of function evaluations (denoted by # fe) for the Forward Euler, Modified
Euler and RK4 method using various accuracies. Integration is done from t0 = 0 to T = 1, and
∆t is chosen such that the desired global truncation errors are found.
Table 6.3: Amplification factors, stability conditions (λ ∈ R, λ < 0), and local truncation errors.
method amplification factor stab. condition τn+1
2
Forward Euler 1 + λ∆t ∆t ≤ − λ O(∆t)
1
Backward Euler 1− λ∆t all ∆t O(∆t)
1+ 12 λ∆t
Trapezoidal
1− 12 λ∆t
all ∆t O(∆t2 )
Modified Euler 1 + λ∆t + 12 (λ∆t)2 ∆t ≤ − λ2 O(∆t2 )
1 1 1
RK4 1 + λ∆t + 2 3
2 ( λ∆t ) + 6 ( λ∆t ) + 24 ( λ∆t )
4 ∆t ≤ − 2.8
λ O(∆t4 )
w∆t 2∆t p p
N − w N/2 = c p ( t ) ∆t (2 − 1), (6.45)
w∆t 2∆t
N − w N/2
c p (t) = . (6.46)
∆t p (2 p − 1)
By substituting this estimate of c p (t) into (6.44b), we obtain a global truncation-error estimate for
the approximation with time step ∆t:
w∆t 2∆t
N − w N/2
y(t) − w∆t
N = . (6.47)
2p − 1
1
The order of the method manifests itself in the factor 2 p −1 of the global truncation-error estimate.
a certain value ∆t, it is unknown whether relation (6.43) is sufficiently well satisfied to obtain an
accurate truncation-error estimate. Applying the same approach as in Section 3.7.2, and defining
w4∆t
N/4 as the approximation for y ( t ) using N/4 time steps of length 4∆t, equation (3.14) gives that
w2∆t 4∆t
N/2 − w N/4
= 2p. (6.48)
w∆t 2∆t
N − w N/2
If for a certain ∆t the value of equation (6.48) is in the neighborhood of the expected 2 p , then ∆t
is small enough to obtain an accurate global truncation-error estimate.
y′ = 50 − 2y2.1,
t > 0,
(6.49)
y(0) = 0.
An explicit solution to this problem is not known. Table 6.4 gives the numerical approximation
of the solution at t = 0.2 using the Forward Euler method for various time steps. The estimate for
y(t) − w∆t p
N (equation (6.47)) is also given, together with the estimated value of 2 , using equation
(6.48). Note that indeed, the global truncation-error estimate is doubled when 2∆t is taken. The
results in the fourth column show that from ∆t = 0.00125 onwards the global truncation error
may be considered linear in ∆t.
Table 6.4: Numerical approximation and global truncation-error estimates for the solution to
problem (6.49) at t = 0.2, using the Forward Euler method and equations (6.47) and (6.48).
∆t w∆t
N y(t) − w∆t
N 2p
0.01000000000 4.559913710927 - -
0.00500000000 4.543116291062 -0.01679742 -
0.00250000000 4.534384275072 -0.00873202 1.9236589
0.00125000000 4.529943322643 -0.00444095 1.9662485
0.00062500000 4.527705063356 -0.00223826 1.9841099
0.00031250000 4.526581601706 -0.00112346 1.9922881
0.00015625000 4.526018801777 -0.00056280 1.9962008
0.00007812500 4.525737136255 -0.00028167 1.9981144
0.00003906250 4.525596237317 -0.00014090 1.9990607
0.00001953125 4.525525771331 -0.00007047 1.9995312
In general, the global truncation error at a certain time t should satisfy |e(t, ∆t)| ≤ ε, for a certain
bound ε. Richardson’s extrapolation can be used to determine whether a certain value of ∆t is
accurate enough.
In practice, ∆t is often based on knowledge of the general behavior of the solution or an idea
about the number of points necessary to recognize the solution visually. For example, to visualize
the sine function on the interval (0, π ), about 10 to 20 points are required. To estimate the global
truncation error for this value of ∆t, the numerical integration is performed three times (for ∆t,
2∆t and 4∆t), and the value of (6.48) is computed. Note that if the estimation of 2 p is not accurate
enough, then ∆t should apparently be taken smaller. In practice, the time step is halved, such that
two earlier approximations can be reused. As soon as 2 p is approximated accurately enough,
an estimate for the global truncation error according to (6.47) is computed, using the last two
approximations. If the estimated global truncation error exceeds the threshold, we again take a
Chapter 6. Numerical time integration of initial-value problems 81
smaller time step, until the error estimate is small enough. The approximation of the solution
will become more and more accurate because (6.48) is satisfied better and better as ∆t decreases.
This characterizes the use of adaptive time-stepping in the numerical integration of initial-value
problems. More details about adaptive time-stepping can be found in [8].
For the Forward Euler method, it will be shown that the application to systems of equations is
similar to the scalar case. Similarly, the vector-valued generalization of the scalar case will be
presented for the Backward Euler, Trapezoidal, Modified Euler, and RK4 method.
For systems of differential equations, the Forward Euler method (equation (6.5)) should be ap-
plied to each differential equation separately. For the jth component, the solution at time tn+1 is
approximated by
w j,n+1 = w j,n + ∆t f j (tn , w1,n , . . . , wm,n ), (6.50)
j = 1, . . . , m. The corresponding vector notation reads
Indeed, the Forward Euler method for systems of differential equations is a straightforward
vector-valued generalization of the corresponding scalar method. Using the definitions of the
Backward Euler (equation (6.6)), Trapezoidal (equation (6.8)), Modified Euler (equation (6.9))
and RK4 method (equation (6.36)) we obtain their vector-valued generalizations:
For the implicit methods, an approximation of the solution to a nonlinear system should be de-
termined at each time step. This can be achieved using a method from Section 4.6.
Higher-order initial-value problems
A higher-order initial-value problem relates the mth derivative of a function to its lower-order
derivatives. Using the notation
dj x
x ( j) = j , j = 0, . . . , m,
dt
we consider the following initial-value problem:
x ( m) = f (t, x, x (1), . . . , x ( m−1)), t > t0 ,
x ( t0 ) = x0 ,
(1) (6.52)
x (1) ( t0 ) = x0 ,
..
.
( m −1 )
( m −1 )
x ( t0 ) = x0 ,
which consists of a single mth order differential equation and a set of initial conditions for x and
all of its derivatives up to the m − 1th order.
One way to solve such a higher-order initial-value problem is to transform the single differential
equation into a system of first-order differential equations, by defining
y1 = x,
y2 = x (1) ,
..
.
y m = x ( m −1 ) .
In this way, problem (6.52) transforms into the following first-order system:
y1′ = y2 ,
..
.
′
y m −1 = y m ,
y′m = f (t, y1, . . . , y m ), t > t0 ,
y1 ( t0 ) = x0 ,
..
.
( m −1 )
y m ( t0 ) = x0 .
Multiplying this equation by a leads to the equivalent equation aλ2 + bλ + c = 0, with solutions
√
−b ± b2 − 4ac
λ1,2 = :
2a
note that the eigenvalues are complex if b2 − 4ac < 0. Table 6.5 contains some possible choices
for a, b, and c, the corresponding eigenvalues and the stability result.
Table 6.5: Analytical stability investigation for initial-value problem of Example 6.8.1.
a b c λ1 λ2 stability result
1 2 0 0 −2 stable
1 2 1 −1 −1 absolutely stable
1 0 1 0+i 0−i stable
1 2 2 −1 + i −1 − i absolutely stable
1 −2 0 2 0 unstable
A = SΛS−1 . (6.55)
Here, Λ = diag(λ1 , . . . , λm ) contains the eigenvalues of A (which can be complex)2 , and S is the
corresponding eigenvector matrix having the (right) eigenvectors v1 ,. . . , vm of A as columns. For
the sake of completeness we recall the definition: Av j = λ j v j , j = 1, . . . , m.
Let η = S−1 ε, such that ε = Sη. Substituting this transformation in (6.54) and using factorization
(6.55) yields
Sη′ = ASη = SΛη,
η(t0 ) = S−1 ε0 = η0 .
2 We write diag( λ1 , . . . , λm ) to denote a diagonal matrix with λ1 , . . . , λm on the main diagonal.
Chapter 6. Numerical time integration of initial-value problems 85
η j = η j,0 eλ j ( t−t0 ) , j = 1, . . . , m.
Note that in general, the eigenvalues can be complex. Writing each eigenvalue as
λ j = µ j + iνj ,
From this observation it follows that the system is stable if µ j = Re(λ j ) ≤ 0 for all j = 1, . . . , m.
Absolute stability is obtained when µ j = Re(λ j ) < 0 for all j = 1, . . . , m. If there exists a j for
which µ j = Re(λ j ) > 0, then the system is unstable.
eigenvalue for which | Q(λ j ∆t)| > 1. This leads to the following characterization of stability for
systems (cf. equations (6.17)):
Note that in general, λ j is complex, such that the modulus of Q(λ j ∆t) needs to be computed (see
definition 1.6.3).
In this section, we will investigate the stability conditions for several time-integration methods.
Example 6.8.2 (Stability condition for the Forward Euler method)
Absolute stability for the Forward Euler approximation of a system of differential equations is
obtained when all eigenvalues λ j of the matrix A satisfy the strict inequality
| 1 + λ j ∆t |< 1, (6.58)
Taking squares on both sides of the stability condition (6.58) yields the following inequality:
This inequality has to be satisfied by all eigenvalues of matrix A to ensure absolutely stable nu-
merical integration. Note that the above equation cannot be satisfied if eigenvalue λ j has a posi-
tive real part (µ j > 0). Hence, if at least one eigenvalue of the matrix A has a positive real part,
then the system is unstable.
The corresponding stability condition for ∆t follows from the equation (1 + ∆tµ j )2 + νj2 ∆t2 = 1.
The nonzero root of this quadratic polynomial in ∆t is
2µ j
∆t = − .
µ2j + νj2
2µ j
∆t < − , j = 1, . . . , m.
µ2j + νj2
If λ j is real-valued, then this condition can be related to the scalar case (condition (6.18)). Ensuring
that the above condition is satisfied for all eigenvalues, the integration of the system is absolutely
stable if ∆t satisfies ( )
2µ j
∆t < min − 2 . (6.60)
j =1,...,m µ j + νj2
For an imaginary eigenvalue (µ j = 0), the Forward Euler method is always unstable.
Example 6.8.3 (Unconditional stability of the Backward Euler method)
Stable integration using the Backward Euler method is obtained when
1
| Q(λ j ∆t)| = ≤1
|1 − λ j ∆t|
Chapter 6. Numerical time integration of initial-value problems 87
for all eigenvalues λ j = µ j + iνj of A. Note that this is equivalent to |1 − λ j ∆t| ≥ 1, which holds
when
| 1 − λ j ∆t |2 =| 1 − µ j ∆t − iνj ∆t |2 = (1 − µ j ∆t)2 + (νj ∆t)2 ≥ 1. (6.61)
If all eigenvalues have non-positive real part µ j (which is required to obtain an analytically stable
system), then equation (6.61) is automatically satisfied, independent of the time step ∆t, and
therefore, | Q(λ j ∆t)| ≤ 1. In that case, the Backward Euler method is unconditionally stable.
If all eigenvalues are imaginary (µ j = 0), then the Backward Euler method is absolutely stable,
since | Q(λ j ∆t)| < 1 for all eigenvalues.
Example 6.8.4 (Unconditional stability Trapezoidal method)
For the Trapezoidal method, the amplification factor for eigenvalue λ j = µ j + iνj is
1 + 1 λ ∆t | (1 + 21 µ j ∆t) + 21 iνj ∆t |
2 j
| Q(λ j ∆t) | = =
1 − 1 λ j ∆t | (1 − 12 µ j ∆t) − 21 iνj ∆t |
2
q
(1 + 21 µ j ∆t)2 + ( 12 νj ∆t)2
= q .
(1 − 21 µ j ∆t)2 + ( 12 νj ∆t)2
For eigenvalues with non-positive real part (µ j ≤ 0) the numerator is always smaller than or
equal to the denominator, and hence, | Q(λ j ∆t)| ≤ 1. If this holds for all eigenvalues, then the
Trapezoidal method is stable for each time step, hence this method is unconditionally stable.
If all eigenvalues are imaginary (µ j = 0), then the Trapezoidal Method is stable, since | Q(λ j ∆t)| =
1 for all eigenvalues.
Proof
First, it should be noticed that
P( x ) = p0 + p1 x + . . . + pn p x n p , n p ∈ N,
in which P(∆tA) and R(∆tA) are polynomials in ∆tA. For example, for the Trapezoidal method,
we have
1 1
R(∆tA) = I − ∆tA, and P(∆tA) = I + ∆tA.
2 2
88 Numerical Methods for Ordinary Differential Equations
Note that R(∆tΛ)−1 and P(∆tΛ) are both diagonal matrices, and therefore, their product is also
a diagonal matrix:
The above derivation makes use of the fact that the amplification matrix is built using the scalar
amplification factors.
Substituting the expression (6.65) in equation (6.64) proves that Q(∆tA) = SQ(∆tΛ)S−1 . ⊠
In order to investigate numerical stability of time-integration methods, the test system (6.54) is
used. For each time-integration method, the approximation at time tn+1 equals
Next, using that ε = Sη (as in Section 6.8.2), equation (6.66) is transformed into (using Theorem
6.8.1)
Sηn+1 = Q(∆tA)Sηn = SQ(∆tΛ)ηn .
Because S is non-singular, this means that
ηn+1 = Q(∆tΛ)ηn ,
Stability of ηn is equivalent with stability of εn , since S does not depend on the approximation.
This means that a sufficient condition for numerical stability is
| Q(λ j ∆t)| ≤ 1, j = 1, . . . , m.
If for one of the eigenvalues | Q(λ j ∆t)| > 1, then the system is unstable.
Note that this set represents a circle with midpoint (−1, 0) and radius 1. For points λ∆t ∈ S FE
the modulus of the amplification factor is less than 1 and outside the circle it is larger than 1.
The region inside the circle is therefore called the stability region of the Forward Euler method. The
stability region of the Forward Euler method lies completely to the left of the imaginary axis and
is tangent to that axis.
Chapter 6. Numerical time integration of initial-value problems 89
The stability region may be used to provide a graphical estimate of the stability condition for
∆t. As a starting point, each eigenvalue λ j should be marked in the complex plane (which cor-
responds to assuming ∆t = 1). The position of the marked point shows whether the integration
is stable for ∆t = 1 or not. If any marked point lies outside the stability region, then ∆t must
be taken smaller, such that λ j ∆t falls inside the stability region. Graphically, this corresponds
to moving from the marked point to the origin along a straight line. In many cases even the
unassisted eye can determine fairly accurately for which ∆t the line intersects the circle. This
process has to be performed for all eigenvalues; the smallest ∆t that results determines the sta-
bility bound.
Obviously, this operation will only succeed if all eigenvalues have a strictly negative real part.
Shrinking ∆t will never bring λ∆t inside the stability region when an eigenvalue has positive
or zero real part, the latter because the stability region is tangent to the imaginary axis. Imagi-
nary eigenvalues are often found when representing systems with oscillatory solutions without
damping.
In Figure 6.4, all stability regions can be seen for the different numerical time-integration meth-
ods. Note that both implicit methods (Backward Euler and Trapezoidal method) are uncondi-
tionally stable when all eigenvalues have non-positive real part. If λ j falls in the white region of
Figure 6.4(b) (unstable case of the Backward Euler method), then ∆t has to be increased to reach
the stability region.
The stability region of the Modified Euler method is very similar to that of the Forward Euler
method. As a consequence both methods have almost the same time step constraint. In partic-
ular, both methods share the property that systems containing imaginary eigenvalues cannot be
integrated in a stable way, whichever the time step is: unbounded error growth will ensue.
The RK4 method is conditionally stable. It is important to note that this is the only discussed
explicit method that may be stable for systems which have imaginary eigenvalues. In that case,
RK4 is stable if |λ∆t| ≤ 2.8.
Note that the observations in this section can be related to Section 6.4.1, in which numerical
stability was derived.
y′ = f(t, y),
y ( t 0 ) = y0 ,
locally have the same properties as the linear system (6.53). The stability behavior can be deter-
mined by linearization of the initial-value problem about (t̂, ŷ) (note that a similar approach was
adopted in Section 6.4.1 for general scalar equations). The role of A is taken over by the Jacobian
matrix of f, evaluated in (t̂, ŷ):
∂ f1 ∂ f1
∂y1 ... ∂y m
J|( t̂,ŷ) = .. ..
. (6.67)
. .
∂ fm ∂ fm
∂y1 ... ∂y m
( t̂,ŷ)
Note that the eigenvalues of the Jacobian matrix, λ1 (t̂, ŷ), . . . , λm (t̂, ŷ), depend on t̂ and ŷ. The
stability properties of numerical time-integration methods therefore depend on time and approx-
imation.
90 Numerical Methods for Ordinary Differential Equations
Im(λ∆t) 2 2
Im(λ∆t)
1 1
Re(λ∆t) Re(λ∆t)
−3 −2 −1 0 1 −2 −1 0 1 2
−1 −1
−2 −2
Im(λ∆t) 2 2
Im(λ∆t)
1 1
Re(λ∆t) Re(λ∆t)
−3 −2 −1 0 1 −2 −1 0 1 2
−1 −1
−2 −2
Im(λ∆t)
3
Re(λ∆t)
−4 −3 −2 −1 0 1
−1
−2
−3
In this example, we investigate the population densities of two species of bacteria competing
with each other for the same supply of food. A model for this competition leads to the system of
differential equations
y1′ = y1 (1 − y1 − y2 ),
y′ = y (0.5 − 0.75y − 0.25y ),
2 2 1 2
in which y1 and y2 are the population densities of the bacteria. The solution to this system of
equations is unknown, and therefore numerical time-integration methods are applied.
Assume that the approximation at time tn is given by wn = (1.5, 0)⊤. We choose t̂ = tn , ŷ = wn ,
and linearize the system about (t̂, ŷ). The Jacobian matrix of this model, evaluated at (t̂, ŷ), is
given by
1 − 2ŷ1 − ŷ2 −ŷ1 −2 −1.5
J|(t̂,ŷ) = = ,
−0.75ŷ2 0.5 − 0.75ŷ1 − 0.5ŷ2 0 −0.625
In this example, the mathematical pendulum of example 6.7.1 is revisited. Note that the corre-
sponding system can be written as
y1′ = y2 = f 1 (t, y),
y′ = − sin y1 = f 2 (t, y),
2
If −π/2 < ŷ1 < π/2, then the eigenvalues of this matrix are imaginary, which are given by
p
λ1,2 (t̂, ŷ) = ±i cos ŷ1 .
In Figures 6.4(a) and 6.4(c), it can be seen that the Forward Euler method and the Modified Euler
method are unstable for every ∆t when the eigenvalues are imaginary (this has also been shown
for the Forward Euler method in example 6.8.2).
However, the Backward Euler method and the Trapezoidal method are unconditionally stable
(cf. Figures 6.4(b) and 6.4(d)). We remark that the amplification factor of the Backward Euler
method satisfies | Q(λ1,2 (t̂, ŷ))| < 1, such that a damped oscillation is found when this method is
applied (cf. example 6.8.3). However, since imaginary eigenvalues imply neither losses (dissipa-
tion or damping) nor growth of the solution, the Trapezoidal method is more useful in this case
(| Q(λ1,2 (t̂, ŷ))| = 1, see example 6.8.4).
92 Numerical Methods for Ordinary Differential Equations
which means that a suitable time step for the RK4 method depends on ŷ and t̂. If ∆t is strictly
smaller than the stability bound, then the RK4 method gives a damped oscillation.
yn+1 − Q(∆tA)yn
τ n +1 = ,
∆t
and the global truncation-error vector by en+1 = yn+1 − Q(∆tA)wn . Because wn = yn − en , the
global truncation-error vector equals
Next, equation (6.69) is decoupled using the transformation en+1 = Sηn+1 , where S is the matrix
having the eigenvectors, v1 , . . . , vm , of A as columns. This means that
The values α j,n+1 are the components of the local truncation-error vector with respect to the basis
of eigenvectors.
Chapter 6. Numerical time integration of initial-value problems 93
Transforming S−1 τ n+1 into S−1 Sαn+1 = αn+1 , the decoupled global truncation-error vector of
equation (6.70) equals
ηn+1 = ∆tαn+1 + Q(∆tΛ)ηn .
As in equation (6.35), this results in
n
ηn+1 = ∑ (Q(∆tΛ))ℓ ∆tαn+1−ℓ .
ℓ=0
Because Q(∆tΛ) = diag( Q(λ1 ∆t), . . . , Q(λm ∆t)), the components of the global truncation-error
vector are given by
n ℓ
η j,n+1 = ∑ Q(λ j ∆t) ∆tα j,n+1−ℓ ( j = 1, . . . , m). (6.72)
ℓ=0
Note that this is a decoupled system: the scalar properties of local and global truncation errors
can be applied to each component (as in Sections 6.4.2 and 6.4.3).
This automatically entails that each component of the decoupled local truncation-error vectors
α1 , . . . , αn+1 has the same order of magnitude as the scalar local truncation error of the corre-
sponding method, for example of order O(∆t) for the Forward Euler method. If a numerical
method is stable and consistent, then the decoupled global truncation error is of the same or-
der as the decoupled local truncation error (see Theorem 6.4.1). Because the matrix S does not
depend on ∆t, the original local and global truncation error are also of the same order.
with solution y(t) = (y0 − F (t0 ))eλ( t−t0) + F (t), as may be verified directly by substitution. This
initial-value problem is stiff if λ is strongly negative and if variations of F occur on a large time
scale (slowly varying). Then, the transient is (y0 − F (t0 ))eλ( t−t0) (rapidly decaying) and F (t) is
the quasi-stationary solution.
We consider approximations using a numerical time-integration method. When we choose an
explicit method, then stability is obtained if the time step is chosen small enough. Since λ is
strongly negative, the condition | Q(λ∆t)| ≤ 1 leads to a very small bound for ∆t. Note that this
restriction is only due to the transient part of the solution, in which λ occurs. With respect to
the accuracy to approximate the quasi-stationary solution, larger time steps could be taken. This
means that the stability condition restricts the time step more than the accuracy requirement.
Therefore, implicit time-integration methods (which are unconditionally stable) are preferable
for stiff problems.
Considering the error behavior, it is important to note that local truncation errors are relatively
large in the beginning, since the transient decays very rapidly. Since the quasi-stationary solution
is slowly varying, these errors are smaller at later times. The global truncation error, that adapts
94 Numerical Methods for Ordinary Differential Equations
to these local truncation errors, also decreases after passing the transient phase. This can be
derived from definition (6.35):
n
e n +1 = ∑ (Q(λ∆t))ℓ ∆tτn+1−ℓ
ℓ=0
= ( Q(λ∆t))n ∆tτ1 + (Q(λ∆t))n−1 ∆tτ2 + . . . + Q(λ∆t)∆tτn + ∆tτn+1 . (6.73)
Hence if the time-integration method is stable and the amplification factor is small enough (for ex-
ample, when | Q(λ∆t)| ≤ 0.5), then, due to exponential decay, initial local truncation errors have
much less influence than local truncation errors in the last few steps, which have not decayed
that much. The smaller | Q(λ∆t)| is, the faster the effect of past local truncation errors decays.
The global truncation error could even be unnecessarily small when the time step is chosen very
small in relation to the timescale of variation of F. This is not efficient if the approximation after
a long time is required.
Implicit methods
In the previous discussion, it has been explained that explicit time-integration methods are not
very useful for stiff systems, as the stability condition requires a very small time step. Therefore,
it is more efficient to apply an implicit time-integration method, such as the Backward Euler, or
the Trapezoidal method. These methods are unconditionally stable, and therefore, the time step
can be taken arbitrarily large, at least theoretically. In practice, it needs to be chosen such that
sufficient accurate approximations of the quasi-stationary solution are obtained. Note that for
each time step, a system of algebraic equations has to be solved.
Both implicit methods have their unconditional stability in common but exhibit, in applications,
significant difference in behavior as shown by the following experiment.
Example 6.10.1 (Stiff differential equation)
In this example, we investigate the scalar stiff problem
with solution y(t) = −e−100t + cos t (λ = −100). The solution is approximated using the Back-
ward Euler and the Trapezoidal method, in both cases with time step size 0.2. The size of the
transient region is of order 0.01, which means that the first time step already exceeds this region.
This means that the first local truncation error is large. The ease with which the global trunca-
tion error diminishes differs, as can be seen in Figure 6.5. Comparing the approximations to the
solution makes it clearly visible how the neglected transient influences the global truncation er-
ror. The Backward Euler method is favorable: after four time steps the solution curve is almost
reached. The Trapezoidal method needs more time steps to let the large initial local truncation
error decay enough. This can be explained using the amplification factors of these methods (re-
member equations (6.15b) and (6.15c)):
Backward Euler method: | Q(λ∆t)| = 1−1λ∆t = 1+100 1 1
·0.2 = 21 ≈ 0.048,
1+ 1 λ∆t 1
= 1− 21 100·0.2 = 9 ≈ 0.82.
Trapezoidal method: | Q(λ∆t)| = 21 11
1− 2 λ∆t 1+ 2 100·0.2
Using these amplification factors in equation (6.73) shows that the initial local truncation error is
damped out much faster when the Backward Euler method is used.
Chapter 6. Numerical time integration of initial-value problems 95
2 2
Exact Exact
Backward Euler method Trapezoidal method
1.5 1.5
1 1
0.5 0.5
y
y
0 0
−0.5 −0.5
−1 −1
Figure 6.5: Approximation of the solution to problem (6.74) using implicit methods, ∆t = 0.2.
Systems
In practice, stiffness is often encountered in systems. For systems of the form y′ = Ay + f, the
solution is given by the sum of the homogeneous and the particular solution. The system is stiff
if at least either of the following criteria is satisfied:
• Some of the real parts of the eigenvalues are strongly negative, whereas some of the other
eigenvalues have a real part close to zero;
• The particular solution varies much more slowly than the homogeneous solution does.
In both cases, the essential feature is that the solution contains two timescales: a fast transient
(which determines numerical stability of explicit time-integration methods) and a slowly-varying
component.
Stiffness in systems occurs mostly in the following form. Suppose, for example, that y′ = Ay,
where A is a 2 × 2 matrix, having eigenvalues λ1 = −1 and λ2 = −10000. The solution is of
the form y(t) = c1 v1 e−t + c2 v2 e−10000t , in which v1 and v2 are the eigenvectors of A and the
integration constants c1 and c2 are determined by the initial conditions of the two variables. The
term proportional to e−10000t plays the role of the transient: this term vanishes much sooner than
the term containing e−t . The latter is slowly-varying compared to the transient and plays the role
of the quasi-stationary solution. However, the transient still determines the stability condition
(Forward Euler method: ∆t ≤ 2/10000) over the whole domain of integration. This inhibits
adaptation of the time step to the relatively slowly-varying quasi-stationary part containing e−t .
Figure 6.6 shows that the Backward Euler method is superstable, whereas for the Trapezoidal
method,
lim | Q(λ∆t)| = 1.
λ∆t →− ∞
This means that initial perturbations in the fast components do not decay, or decay very slowly
when using the Trapezoidal method.
96 Numerical Methods for Ordinary Differential Equations
−1
Figure 6.6: Amplification factors of the Backward Euler and Trapezoidal method.
This section only dealt with a simple system of differential equations, but stiffness may also
occur in more complicated systems. Also in that case, implicit methods are to be recommended.
However, at each time step, a nonlinear system of equations has to be solved, which increase the
computational cost considerably. Therefore, the choice of a method is often a matter of careful
consideration and explicit methods cannot be ruled out beforehand.
t − t n −1
L1 ( t ) = f ( t n −1 , w n −1 ) + ( f ( t n , w n ) − f ( t n −1 , w n −1 ) ,
∆t
see equation (2.1). This means that L1 (tn+1 ) = 2 f (tn , wn ) − f (tn−1 , wn−1 ), and using this instead
of f (tn+1, wn+1 ) in (6.75), the Adams-Bashforth method is obtained:
3 1
wn+1 = wn + ∆t f (tn , wn ) − ∆t f (tn−1 , wn−1 ). (6.76)
2 2
Chapter 6. Numerical time integration of initial-value problems 97
Note that only one function evaluation is required per time step, since f (tn−1 , wn−1 ) has already
been computed during the previous time step.
For n = 1, only one previous value is known, which is based on the initial condition. In order
to compute a second value, a single-step method of the same order (in this case the Trapezoidal
method, or the Modified Euler method) is applied to the initial condition. Setting w0 equal to
y0 and w1 to the single-step approximation for time t1 , the multi-step method can be applied to
compute w2 . For n ≥ 2, the multi-step algorithm is applied.
Stability
For the test equation y′ = λy, the time-integration method yields
3 1
wn+1 = 1 + λ∆t wn − λ∆twn−1 . (6.77)
2 2
The values of Q1,2 (λ∆t) are visualized in Figure 6.7. Note that stability bound (6.78) is satisfied
for all values of Q1 (λ∆t), but | Q2 (λ∆t)| ≤ 1 implies λ∆t ≥ −1 (see lower bound in Figure 6.7).
Therefore, the time step should be bounded by
1
∆t ≤ −
λ
in order to have a stable method.
Local truncation error
In order to derive the local truncation error, we use a Taylor expansion of the amplification fac-
tors. Here, we make use of the fact that for x sufficiently close to zero,
√ 1 1 1
1 + x = 1 + x − x2 + x3 − . . . .
2 8 16
98 Numerical Methods for Ordinary Differential Equations
Q1 (λ∆t)
Q2 (λ∆t) Q(λ∆t)
1
0
λ∆t −1 −0.5
−1
Figure 6.7: Amplification factors Q1,2 (λ∆t) of the Adams-Bashforth method. The stability bounds
are also visualized.
√
Applying this expansion to approximate D (taking x = λ∆t + 9/4(λ∆t)2 ) the amplification
factors equal
1 1
Q1 (λ∆t) = 1 + λ∆t + (λ∆t)2 − (λ∆t)3 + O(∆t4 ),
2 4
1 2
Q2 (λ∆t) = λ∆t + O(∆t ).
2
The use of Q1 (λ∆t) leads to a local truncation error for the test equation of (cf. equation (6.33))
eλ∆t − Q1 (λ∆t) 5 3 2
τn+1 = yn = λ ∆t yn + O(∆t3 ) = O(∆t2 ), (6.79)
∆t 12
whereas Q2 (λ∆t) has a local truncation error of order O(1). This is the reason why Q1 (λ∆t) is
called the principal root, whereas Q2 (λ∆t) is called the spurious root. This root does not belong to
the differential equation, but it is a consequence of the chosen numerical method.
Comparison with the Modified Euler method
Regarding stability the time step in the Adams-Bashforth method has to be chosen twice as small
as in the Modified Euler method. Since the Modified Euler method costs twice the amount of
work in each time step the methods are comparable in this respect.
To obtain reasonable accuracy we choose the time step in such a way that the local truncation
error, τn+1, is less than ε. For the Adams-Bashforth method it follows that (using equation (6.79)
and neglecting higher-order terms)
r
12ε
∆t AB = − 3 .
5λ
For the Modified Euler method, the local truncation error for the test equation is (using the am-
plification factor as given by equation (6.15d))
1 3 2
τn+1 = λ ∆t + O(∆t3 ),
6
such that the same approach yields the time step
r
6ε
∆t ME = − .
λ3
Chapter 6. Numerical time integration of initial-value problems 99
√
This means that the time step of the Adams-Bashforth method is 2/5 ≈ 0.63 times the time step
of the Modified Euler method.
Because the Adams-Bashforth method reuses previous approximations, two steps of the Adams-
Bashforth method require the same amount of work as one step of the Modified Euler method.
This means that with the Adams-Bashforth method, the approximation at time interval 2∆t AB
can be computed using
√ the same amount of work as with the Modified Euler method on interval
∆t ME , which is 2 · 2/5 ≈ 1.26 times as long as the Modified Euler method’s time interval. Hence
the Adams-Bashforth method requires less work than the Modified Euler method.
Discussion
Multi-step methods are less popular than Runge-Kutta methods. With multi-step methods start-
ing problems occur, since approximations should be known at several time steps. Moreover,
keeping track of the ’spurious roots’ is a difficult matter. Finally, adaptive time-stepping is more
difficult to apply.
6.12 Summary
In this chapter, the following subjects have been discussed:
- Theory of initial-value problems;
- Elementary single-step methods (Forward Euler, Backward Euler, Trapezoidal, and Modi-
fied Euler method);
- Analysis of numerical time-integration methods:
- Test equation, amplification factor, stability;
- Local truncation error;
- Consistency, convergence, global truncation error;
- Higher-order time-integration methods (RK4 method);
- Global truncation error and Richardson error estimates;
- Numerical methods for systems:
- Test system, amplification matrix, stability;
- Local and global truncation error;
- Stiff systems, superstability;
- Multi-step method (Adams-Bashforth method).
6.13 Exercises
1. Use the Modified Euler method to approximate the solution to the initial-value problem
y′ = 1 + (t − y)2 , 2 ≤ t ≤ 3,
y(2) = 1,
with ∆t = 0.5.
3. Determine the amplification factor of the Trapezoidal method. Estimate with this amplifica-
tion factor the local truncation error of the test equation. Show that the Trapezoidal method
is stable for all ∆t > 0, if λ ≤ 0.
4. Consider the nonlinear initial-value problem: y′ = 1 + (t − y)2 . Give the stability condition
of the Modified Euler method at the point t = 2 and y = 1.
5. Consider the numerical time-integration method
(a) Show that the local truncation error is O(∆t) for each value of β. Moreover, show that
there is no β such that the truncation error is O(∆t2 ).
(b) Determine the amplification factor of this method.
(c) Consider the nonlinear differential equation y ′ = 2y − 4y2 . Determine the maximal
step size such that the method (with β = 1/2) is stable in the neighborhood of y = 1/2.
6. Perform a single step with the Forward Euler method with step size ∆t = 0.1 for the system
7. Perform a single step with the Forward Euler method for the initial-value problem
′′
y − 2y′ + y = tet − t, t > 0,
y(0) = 0,
y′ (0) = 0,
using step size ∆t = 0.1. Determine the error with respect to the solution
1 3 t
y(t) = t e − tet + 2et − t − 2.
6
(a) Perform a single step with the Forward and Backward Euler method with ∆t = 0.1
and compare the results with the exact answer.
(b) Determine for which step sizes the Forward Euler method is stable.
(c) Perform a single step with the Forward and Backward Euler method with ∆t = 0.0001
and compare the results with the exact answer.
What is your conclusion?
10. Consider the initial-value problem
y′ = y − t2 + 1,
t > 0,
y(0) = 12 .
Approximate y(0.1) = 0.6574145 with the Forward Euler method, using ∆t = 0.025, and
the RK4 method, using ∆t = 0.1. Which method is to be preferred?
11. The solution to an initial-value problem is approximated with two different methods. Both
methods have used step sizes ∆t = 0.1, 0.05, 0.025. The numerical approximations at the
point T = 1 are tabulated below. Determine the order p of the global truncation error of
both methods.
7.1 Introduction
Many applications can be simulated by solving a boundary-value problem. A one-dimensional
boundary-value problem consists of a differential equation on a line segment, where the function
and/or its derivatives are given at both boundary points.
Stationary heat conduction in a bar
As an example we consider the temperature distribution in a bar with length L and cross-sectional
area A (Figure 7.1).
0 x x + ∆x L
We denote the temperature in the bar by T ( x ) (measured in K), and assume that the temperature
is known at both ends: T (0) = Tl and T ( L) = Tr . Further, heat is assumed to be generated inside
the bar. We denote this heat production by Q( x ) ( J/(m3 s)). In general, the temperature profile
evolves over time towards a steady state. In this example we are interested in the temperature
after a long period of time, i.e. the steady-state temperature. For the derivation of the differential
equation the energy conservation law is applied to the control volume between x and x + ∆x (see
Figure 7.1).
There is heat transport by conduction through the faces in x and x + ∆x. According to Fourier’s
law this heat transport per unit area and per unit time is called the heat flow density, and equals
dT
q( x ) = −λ ( x ),
dx
where λ ( J/(msK )) is called the heat-conduction coefficient. For the control volume between x and
x + ∆x, the energy balance should be satisfied: the total heat outflow at x + ∆x minus the total
heat inflow at x should equal the amount of heat produced in this segment. This can be expressed
as
dT dT
−λA ( x + ∆x ) + λA ( x ) = AQ( x )∆x.
dx dx
104 Numerical Methods for Ordinary Differential Equations
This example will often be used in this chapter to illustrate the finite-difference method.
In some applications the heat flux rather than the temperature is known at one of the end points
of the bar. If this is the case at x = L, then Fourier’s law leads to the boundary condition
dT
−λ ( L) = q L ,
dx
in which q L ( J/(m2 s)) is known.
We assume that p( x ) > 0 and q( x ) ≥ 0 for all x ∈ [0, L]. Note that the boundary-value problem
does not have a unique solution when a0 = a L = 0 and q( x ) = 0, 0 < x < L (if solutions exist at
all). The following terms will be used to describe the boundary conditions (we confine ourselves
to x = 0):
In this chapter, we apply the finite-difference method to approximate the solution to a boundary-
value problem. The key principle of this approach is to replace all derivatives in the differential
equation by difference formulae (Chapter 3) and to neglect truncation errors. In this way, a dis-
crete approximation w for the solution y is obtained.
In this example, we describe the finite-difference method for the boundary-value problem
y(0) = 0, (7.2)
y(1) = 0.
First, the interval [0, 1] is divided into n + 1 equidistant subintervals with length ∆x = 1/(n + 1).
The nodes are given by x j = j∆x for j = 0, . . . , n + 1 (see Figure 7.2). The approximation of the
solution to problem (7.2) will be represented on these nodes.
Chapter 7. The finite-difference method for boundary-value problems 105
x0 x1 x2 ··· x n x n +1
| | | | |
0 ∆x 2∆x n∆x 1
Figure 7.2: Discretization of the domain. The solution is approximated in the nodes x0 , . . . , x n+1 .
y(0) = α,
y(1) = β.
Definition 7.3.1 (Scaled Euclidean norm of a vector) The scaled Euclidean norm of a vector w ∈ R n
is defined as
s
1 n 2
n i∑
kwk = wi .
=1
Definition 7.3.2 (Subordinate matrix norm) The natural, or subordinate, matrix norm, related to
the vector norm k.k is defined as
kAk = max kAwk.
k w k=1
y 1
kAk = max kAwk ≥ kAxk = kA k= kAyk.
k w k=1 kyk kyk
Condition number
Suppose we are interested in the vector w, satisfying Aw = f. If the right-hand side f is perturbed
with an error ∆f, then as a consequence the solution w will contain an error ∆w. This means that
we solve
A(w + ∆w) = f + ∆f.
Using that Aw = f and A∆w = ∆f, inequality (7.7) yields that kfk = kAwk ≤ kAk · kwk, such
that
1 kA k
≤ and k∆w k = kA−1 ∆f k ≤ kA−1 k · k∆f k,
kwk kfk
and the relative error equals
k∆wk k∆fk
≤ k A k · k A −1 k . (7.8)
kwk kfk
The quantity κ (A) = kAk · kA−1 k is called the condition number of the matrix A. A large condition
number implies that a small relative error in the right-hand side f may result in a large relative
error in the approximation w.
For a symmetric matrix A, the eigenvalues, λ1 , . . . , λn , are real valued, and
1 1
kAk = |λ|max = max |λ j | and kA−1 k = = . (7.9)
1≤ j ≤ n |λ|min min |λ j |
1≤ j ≤ n
Hence κ (A) = |λ|max /|λ|min : it is sufficient to know the largest and smallest absolute eigenvalue
(at least approximately) to compute the condition number of a symmetric matrix or an estimate
to it.
A useful theorem to estimate these extremal eigenvalues is the Gershgorin circle theorem:
Chapter 7. The finite-difference method for boundary-value problems 107
Theorem 7.3.1 (Gershgorin circle theorem) The eigenvalues of a general n × n matrix A are located
in the complex plane in the union of circles
n
|z − aii | ≤ ∑ |aij | where z ∈ C.
j6=i
j =1
Proof
Suppose Av = λv. By definition, component v i satisfies
n n
∑ aij v j = λvi , such that aii vi + ∑ aij v j = λvi .
j =1 j6=i
j =1
Therefore,
n
(λ − aii )vi = ∑ aij v j .
j6=i
j =1
Next, we assume that vi is the largest component in modulus of v. Using this assumption and
the triangle inequality (Theorem 1.6.1) it follows that
n |v j | n
|λ − aii | ≤ ∑ |aij | |vi | ≤ ∑ |aij |.
j6=i j6=i
j =1 j =1
This relation holds for each eigenvalue and each possible eigenvector. Therefore, the complete
set of eigenvalues is found in the union of these circles in the complex plane. ⊠
Example 7.3.1 (Gershgorin circle theorem)
In this example, the eigenvalues of the matrix
2 −1
A=
−1 2
are estimated using the Gershgorin circle theorem. Following the theorem, the eigenvalues
should satisfy |λ − 2| ≤ 1. Note that the eigenvalues are λ1 = 1, λ2 = 3, which indeed sat-
isfy this condition.
such that
ε = Ay − f = −y′′ + O(∆x2 ) + qy ⊤ − f, (7.10)
and ε j = O(∆x2 ), j = 1, . . . , n.
Definition 7.4.2 (Consistency) A finite-difference scheme is called consistent if
lim kε k = 0.
∆x →0
For the boundary-value problem in example 7.2.1, kε k = O(∆x2 ), hence system (7.5) is consis-
tent.
Definition 7.4.3 (Stability) A finite-difference scheme is called stable if A−1 exists and if there exists a
constant C, independent of ∆x, such that
kA−1 k ≤ C, for ∆x → 0.
In this example, example 7.2.1 is revisited. Note that the derived matrix A in this example is
symmetric. This means that the eigenvalues are real and that (using equation (7.9))
1
k A −1 k = .
|λ|min
and because the eigenvalues are real, the following union of intervals is found:
4
qj ≤ z ≤ qj + , j = 1, . . . , n.
∆x
This means that
4
qmin ≤ λ j ≤ qmax + for j = 1, . . . , n.
∆x2
From this it follows that kA−1 k ≤ 1/qmin , hence the scheme is stable.
• If q( x ) = 0 for all x ∈ [0, 1], then the Gershgorin circle theorem is of no use to estimate
the smallest eigenvalue, since in that case the theorem gives no bound for the matrix norm
kA−1 k. Without proof we note that the eigenvalues of A = K are
2 − 2 cos j∆xπ
λj = , j = 1, . . . , n. (7.11)
∆x2
This yields
2 − 2 cos ∆xπ
λmin = ≈ π2 ,
∆x2
which implies that also for q ≡ 0 the scheme is stable.
Definition 7.4.4 (Global truncation error) The global truncation error of a finite-difference scheme is
defined as e = y − w.
Chapter 7. The finite-difference method for boundary-value problems 109
Definition 7.4.5 (Convergence) A scheme is called convergent if the global truncation error satisfies
lim kek = 0.
∆x →0
The next theorem provides a relation between the concepts consistency, stability and conver-
gence.
Theorem 7.4.1 If a scheme is stable and consistent, then that scheme is convergent.
Proof:
The global truncation error is related to the local truncation error (Ae = A(y − w) = ε), and
hence e = A−1 ε. Taking norms and applying inequality (7.7) yields kek ≤ kA−1 k kεk. From
stability and consistency it then follows that lim∆x →0 kek = 0. ⊠
Note that consistency in itself is insufficient for convergence. Again it is true, that the order of
the global truncation error equals the order of the local truncation error.
y(0) = 0, (7.12)
y(1) = 0.
Note that this problem is related to system (7.1), where y denotes the temperature in the bar, and
the production term Q( x ) is given by 25e5x (we expect positive temperatures inside the bar). The
solution is given by
y( x ) = −e5x + (−1 + e5 ) x + 1.
In Figure 7.3(a) the solution is plotted, together with the numerical approximations (computed
using the finite-difference approach of Section 7.2) for different values of ∆x. The numerical
approximation converges rapidly to the solution. In a practical situation, the accuracy obtained
with step size ∆x = 1/16 will probably be sufficient.
Next, heat dissipation is included in the boundary-value problem. This leads to
y(0) = 0, (7.13)
y(1) = 0,
in which the term 9y describes heat dissipation to the environment, proportional to the temper-
ature. The solution and the numerical approximation are plotted in Figure 7.3(b). Note that the
maximal temperature is definitely lower than without dissipation (Figure 7.3(a)). For the rest,
there is little difference between the convergence behavior of both problems.
4 4
λmin ≈ π 2 , λmax ≈ , and κ (A) ≈ .
∆x2 (π∆x )2
If ∆x tends to zero, then the condition number of A is unbounded which is not desirable, since
this would mean that perturbations in the right-hand side f could lead to arbitrarily large errors
110 Numerical Methods for Ordinary Differential Equations
80 80
Exact Exact
70 ∆x = 0.25 70 ∆x = 0.25
∆x = 0.125 ∆x = 0.125
60 ∆x = 0.0625 60 ∆x = 0.0625
50 50
y
y
40 40
30 30
20 20
10 10
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
(a) Without dissipation (b) With dissipation
Figure 7.3: Exact solution and numerical approximation of the temperature distribution in a bar,
without (problem (7.12)) or with (problem (7.13)) dissipation.
in the approximation. It turns out that the derivation of the relative error in equation (7.8) was
too pessimistic. A more realistic estimate is found by writing
1
k∆wk = kA−1 ∆fk ≤ k∆fk.
|λ|min
The role of κ (A) in this expression is taken over by the effective condition number
1 kfk
κeff (A) = .
λmin kwk
We know that λmin ≈ π 2 and in many applications kfk/kwk is bounded as ∆x tends to zero,
hence the effective condition number is, in fact, bounded. Although this estimate is more accurate
than the original condition number, it is more difficult to compute, since (an estimate for) w is
required.
Here, the boundary-value problem of example 7.2.1 is slightly modified to include a Neumann
boundary condition on the right boundary:
−y′′ + q( x )y = f ( x ),
0 < x < 1,
y(0) = 0, (7.14)
y′ (1) = 0.
Chapter 7. The finite-difference method for boundary-value problems 111
In this case, wn+1 is no longer known, as it was for a Dirichlet boundary condition. Therefore, the
discretization should also include an equation for j = n + 1, such that a system of n + 1 equations
with n + 1 unknowns, w1 , . . . , wn+1 has to be solved. For j = n + 1 we have
x0 x1 x2 ··· x n x n +1 x n +2
| | | | | |
0 ∆x 2∆x n∆x 1 1 + ∆x
Figure 7.4: Discretization of the domain, with a virtual node xn+2 . The solution is approximated
in the nodes x0 , . . . , x n+1.
The value of wn+2 follows from discretization of the Neumann boundary condition using central
differences (equation (3.5)):
w n +2 − w n
= 0.
2∆x
This implies wn+2 = wn . We substitute this into (7.15) and divide by two (to preserve symmetry
of the matrix) to obtain
− w n + w n +1 1 1
2
+ q n +1 w n +1 = f n +1 . (7.16)
∆x 2 2
This means that we should solve Aw = f, where the symmetric matrix A can be written as
A = K + M, with
2 −1
∅
−1 q1
2 −1 ∅
..
. . .
1 .
K= and M = .. ,
2
∆x . ..
.
∅ q n
∅ −1 2 −1 q n +1
−1 1 2
Using that y′n+1 = 0, this means that y n+2 = yn + O(∆x3 ), such that the replacement of yn+2 by
yn in the discretization of the Neumann boundary yields a local truncation error of order O(∆x ).
This means that the local truncation error can be written as
Ay − f = ε = ∆x 2 u + ∆xv,
112 Numerical Methods for Ordinary Differential Equations
where ∆x2 u results from the inner local truncation error (as was already found in equation (7.10)),
and the component ∆xv belongs to the local truncation error due to the Neumann boundary:
v = (0, . . . , 0, vn+1 )⊤ , with vn+1 = O(1). Note that the order of the local truncation error is
different from the order of a boundary-value problem with pure Dirichlet boundary conditions.
It turns out that the global truncation error still remains of order O(∆x2 ). This claim will be
shown for q( x ) = 0, 0 ≤ x ≤ 1, which results in a stable system. The global truncation error is
split as follows:
e = A−1 ε = A−1 ∆x2 u + ∆xv = e(1) + e(2) .
(2)
By construction, ke(1) k = O(∆x2 ). For e(2) , it holds that e j = ∆x3 vn+1 j, j = 1, . . . , n + 1. This
can be shown by verifying that Ae (2) = ∆xv. The jth component of Ae (2) equals
(2) (2)
− e n + e n +1 ∆x3 vn+1 (−n + (n + 1))
(Ae(2) )n+1 = 2
= = ∆xvn+1,
∆x ∆x2
y(0) = α,
p(1)y′ (1) = β,
− p j + 1 ( w j +1 − w j ) + p j − 1 ( w j − w j −1 ) w j +1 − w j −1
2 2
+ rj + qj wj = f j, j = 1, . . . , n + 1. (7.17)
∆x2 2∆x
Note that if r j 6= 0, then the discretization matrix is not symmetric.
The boundary condition in x = 0 yields w0 = α which may be substituted directly into (7.17) for
j = 1.
For j = n + 1, the discretization is
− p n + 3 ( w n +2 − w n +1 ) + p n + 1 ( w n +1 − w n ) w n +2 − w n
2 2
+ r n +1 + q n +1 w n +1 = f n +1 . (7.18)
∆x2 2∆x
Chapter 7. The finite-difference method for boundary-value problems 113
As in Section 7.6, a virtual grid point, xn+2 , is added to the domain. In order to remove the term
− pn+3/2 (wn+2 − wn+1 ) in equation (7.18), we discretize boundary condition p(1)y′ (1) = β by
averaging the approximations for
′ ∆x ′ ∆x
pn+ 1 y 1 − and pn+ 3 y 1 + .
2 2 2 2
This means that
1 w n +1 − w n w − w n +1
pn+ 1 + p n + 3 n +2 = β,
2 2 ∆x 2 ∆x
such that
pn+ 3 (wn+2 − wn+1 ) = − pn+ 1 (wn+1 − wn ) + 2∆xβ.
2 2
Replacing this term in equation (7.18) and using that (wn+2 − wn )/(2∆x ) = β/p(1) yields
2pn+ 1 (wn+1 − wn ) r n +1 β 2β
2
+ + q n +1 w n +1 = f n +1 + .
∆x2 p (1) ∆x
y(0) = 0, (7.19)
y(1) = 1,
with v ∈ R. This equation can be interpreted as a heat-transfer problem. Heat transport is caused
by conduction (diffusion, the term −y′′ ), and by a flowing medium (convection, the term vy′ ).
It can be verified that the solution to this boundary-value problem is given by
evx − 1
y( x ) = .
ev − 1
The solution increases strictly over the region [0, 1] and therefore numerical approximations that
contain any oscillations will be rejected. However, a careless discretization of boundary-value
problem (7.19) can result in a numerical approximation that contains (spurious) oscillations.
Central differences
Using the same subintervals and nodes as in example 7.2.1, the central-difference discretization
at interior nodes x j is given by
w j−1 − 2w j + w j+1 w j +1 − w j −1
− 2
+v = 0, j = 1, . . . , n. (7.20)
∆x 2∆x
Together with the boundary conditions, this leads to a system of equations Aw = f, in which
2 −1 + v∆x
2
−1 − v∆x 2 −1 + v∆x ∅
2 2
1 .
A=
. . ,
∆x2
v∆x v∆x
∅ −1 − 2 2 −1 + 2
v∆x
−1 − 2 2
114 Numerical Methods for Ordinary Differential Equations
1
Exact
0.8 Central
Upwind
0.6
0.4
y
0.2
−0.2
−0.4
0 0.2 0.4 0.6 0.8 1
x
Figure 7.5: Approximation of the solution to convection-diffusion problem (7.19) for v = 30,
∆x = 0.1.
y(0) = 0, (7.25)
y(1) = 0.
will be computed, using both a central and an upwind discretization. The solution to this convec-
tion-diffusion equation is given by
1 − evx
1
y( x ) = x− .
v 1 − ev
Although the solution to this convection-diffusion equation is not monotone as was the case in
example 7.8.1, a similar behavior is found.
Approximations of the solution are determined with stepsize ∆x = 0.1 for various values of the
velocity v (see Figure 7.6).
For v = 10, we have v∆x < 2. Therefore, the central-difference approximation does not introduce
any spurious oscillations. Since the global truncation error is of order O(∆x2 ), this results into a
more accurate approximation than upwind discretization (which is of order O(∆x )). If v = 20,
then v∆x = 2, such that the discretization at node xn is given by
wn−1 − 2wn + wn+1 w − w n −1 2 2
− + v n +1 = − 2 w n −1 + wn = 1.
∆x2 2∆x ∆x ∆x2
Since the term with wn+1 has been cancelled, the boundary condition in x = 1 does not have any
effect on the rest of the central-difference approximation. However, for this example the result
is still quite accurate. For v = 100 the central-discretization scheme produces oscillations in the
approximation, and the errors are large. A more accurate approximation for v = 100 is found
using the upwind-discretization scheme.
y(0) = 0, (7.26)
y(1) = 0.
116 Numerical Methods for Ordinary Differential Equations
0.07 0.05
Exact Exact
0.045
0.06 Central Central
Upwind 0.04 Upwind
0.05 0.035
0.03
0.04
y
y
0.025
0.03
0.02
0.02 0.015
0.01
0.01
0.005
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
(a) v = 10 (b) v = 20
0.016
Exact
0.014 Central
Upwind
0.012
0.01
y
0.008
0.006
0.004
0.002
0
0 0.2 0.4 0.6 0.8 1
x
(c) v = 100
Figure 7.6: Approximation of the solution to convection-diffusion problem (7.25) for different
values of v, ∆x = 0.1.
The general approach to solve (7.26) uses an iterative process of Chapter 4 in which only linear
equations have to be solved. The derived linear boundary-value problems are solved using the
finite-difference schemes as explained in earlier sections. The iterative method generates a series
of iterates y(0) , y(1) , . . . in such a way that y ( m) → y, m → ∞ where y is the solution to (7.26).
The iterative method of Picard approximates the solution to boundary-value problem (7.26) by
determining y( m) from
y( m) (0) = 0,
y( m) (1) = 0.
Here, the finite-difference approach is used to approximate the derivative of the m − 1th and the
mth iterate.
Chapter 7. The finite-difference method for boundary-value problems 117
in which
∂g ( m−1)′ ( m−1) ∂g ( m−1)′ ( m−1)
r= (y ,y , x) and q= (y ,y , x ).
∂y′ ∂y
The corresponding iterative method determines y( m) from
−( py( m)′)′ + ry( m)′ + qy( m) = y ( m−1)′r + y( m−1) q − g(y( m−1)′, y( m−1), x ),
y(0) = 0,
y(1) = 0.
7.10 Summary
In this chapter, the following subjects have been discussed:
- Boundary-value problems;
- Dirichlet, Neumann, Robin boundary condition;
- Finite-difference schemes;
- Norms, condition numbers, Gershgorin circle theorem;
- Local truncation error, consistency;
- Stability;
- Global truncation error, convergence;
- Implementation of Neumann boundary condition;
- Convection-diffusion problem;
- Upwind discretization;
- Nonlinear boundary-value problem.
7.11 Exercises
1. Consider matrices A1 and A2 given by
2 1 100 99
A1 = and A2 = .
1 2 99 100
−y′′ ( x ) = sin x,
0 < x < π,
y(0) = 0,
y(π ) = 0.
−y′′ + y = 0,
0 < x < 1,
y(0) = 1,
y′ (1) = 0.
Discretize this problem using the nodes x0 = 0, x1 = 2/7, x2 = 4/7, x3 = 6/7 and
x4 = 8/7. Determine the discretization matrix A. Is A symmetric or nonsymmetric? Are
the eigenvalues positive?
Chapter 8
8.1 Introduction
The time-dependent temperature distribution in a bar can be described by a parabolic partial
differential equation. To solve that kind of initial- and boundary-value problems methods from
Chapters 6 and 7 will be combined.
As an example, we consider the temperature distribution in a bar (Figure 8.1).
0 x x + ∆x L
We denote the temperature by T ( x, t) (measured in K). We assume that the temperature is known
at both ends T (0, t) = Tl and T ( L, t) = Tr , and that the initial temperature at t = 0 is given by
T ( x, t) = T0 ( x ). The heat production in the bar is denoted by Q( x, t) ( J/(m3 s)). To derive the
heat equation we use the energy conservation law applied to the control volume between x and
x + ∆x and between the points in time t and t + ∆t. According to Fourier’s law the heat flow
density is given by
∂T
q( x, t) = −λ ( x, t),
∂x
where λ( J/(msK )) is the heat-conduction coefficient.
The energy balance on the control volume is given by
∂T ∂T
ρcT ( x, t + ∆t) A∆x = ρcT ( x, t) A∆x − λ ( x, t) A∆t + λ ( x + ∆x, t) A∆t + Q( x, t) A∆x∆t,
∂x ∂x
in which ρ(kg/m3 ) is the mass density, and c( J/(kgK )) is the specific heat. After dividing by
A∆x∆t and rearranging the terms we obtain
∂T
T ( x, t + ∆t) − T ( x, t) ( x + ∆x, t) − ∂T
∂x ( x, t )
ρc = λ ∂x + Q.
∆t ∆x
Taking the limits ∆x → 0 and ∆t → 0 yields the following initial-boundary-value problem:
∂2 T
ρc ∂T ∂t = λ ∂x 2 + Q, 0 < x < L, 0 < t,
T (0, t) = Tl (t), T ( L, t) = Tr (t), 0 ≤ t,
T ( x, 0) = T0 ( x ), 0 ≤ x ≤ L.
120 Numerical Methods for Ordinary Differential Equations
8.2 Semi-discretization
In this chapter, the space and time discretizations will be exhibited for a simple heat problem.
There is a clear difference between an explicit (Forward Euler) and an implicit (Backward Euler)
method of time integration.
Example 8.2.1 (Heat-conduction problem)
We consider the following heat-conduction problem:
∂y ∂2 y
∂t = ∂x 2 , 0 < x < 1, 0 < t ≤ T,
y(0, t) = y L (t), y(1, t) = y R (t), 0 ≤ t ≤ T, (8.1)
y( x, 0) = y ( x ),
0 0 ≤ x ≤ 1.
We discretize (8.1) in the x-direction by means of the finite-difference method (Chapter 7). The
interval [0, 1] is divided into n + 1 equal parts with length ∆x, where the nodes are given by
xi = i∆x, i = 0, . . . , n + 1. We denote the numerical approximation of y( xi , t) by ui (t). The vector
u(t) is given by u(t) = (u1 (t), . . . , un (t))⊤ . The values at the boundary points are omitted,
because they are known from the boundary conditions. Using the techniques from Chapter 7 we
obtain the following system of first-order ordinary differential equations
du
dt = Ku + r, 0 < t ≤ T, (8.2)
u ( 0 ) = y0 .
−2
1
1 −2 1 ∅
. ..
1 1
(y L (t), 0, . . . , 0, y R (t))⊤ .
K= and r=
2
∆x .. ∆x2
.
∅ 1 −2 1
1 −2
This is called semi-discretization (or method of lines), because discretization has been applied to
the spatial x-direction but not to the temporal t-direction. Note that the eigenvalues λ j of K
satisfy (cf. the Gershgorin circle theorem) λ j ≤ 0 for all j = 1, . . . , n − 1. This implies stability
of the semi-discrete system. In the boundary-value problems of Chapter 7, −y′′ was considered.
Therefore, the matrix K differs in sign from the matrix in equation (7.6).
j +1
Note that wi depends on the approximations in ( xi−1, t j ), ( xi , t j ) and ( xi+1, t j ) (as follows from
the definition of K). This idea is depicted in Figure 8.2(a).
The local truncation error of this method can be computed using definition (6.68) in Chapter 6:
y j +1 − z j +1
τ j +1 = ,
∆t
where z j+1 is computed by applying the Forward Euler method to the solution at time t j (given
by y j ), hence z j+1 = y j + ∆t(Ky j + r j ). This yields (using the Dirichlet boundary conditions)
y j+1 − (y j + ∆t(Ky j + r j )) y j +1 − y j ∂y j ∂2 y j
τ j +1 = = − (Ky j + r j ) = + εj − 2 + µj,
∆t ∆t ∂t ∂x
in which
j ∆t ∂2 y j ∆x2 ∂4 y
εi = ( x , ζ ), ζ j ∈ ( t j , t j +1 ) , and µi = − ( ξ , t ), ξ i ∈ ( x i −1 , x i +1 )
2 ∂t2 i j 12 ∂x4 i j
follow from the truncation errors of the numerical differentiation, equations (3.2) and (3.9).
Using the heat-conduction problem, we find that τ j+1 = ε j + µ j , such that method (8.3) is of
order O(∆t + ∆x 2 ). It makes sense to choose ∆t and ∆x in such a way that both components in
the truncation error have about the same size. This already suggests that the time step should be
divided by four if the spatial step size is halved.
Stability
Since the Forward Euler method is only conditionally stable, it is important to determine the
size of the time step in such a way that no instabilities will occur. The matrix K from (8.2) is
symmetric, hence the eigenvalues are real. From the Gershgorin circle theorem it follows that
4
− ≤ λi ≤ 0, for 1 ≤ i ≤ n.
∆x2
The matrix has no positive eigenvalues, hence the system of differential equations (8.2) is an-
alytically stable (Section 6.8.2). An explicit expression for the eigenvalues of −K was given in
equation (7.11). For real eigenvalues, the Forward Euler method is stable if the time step ∆t
satisfies (equation (6.18))
2
∆t ≤ − , ∀i = 1, . . . , n.
λi
Using that |λ|max = 4/∆x2 , we obtain ∆t ≤ ∆x2 /2. Note that if the spatial step size is halved,
then the time step has to be taken four times as small to satisfy the stability condition.
Note that the condition number of K is given by (Section 7.3)
maxi=1,...,n |λi | 4
κ (K) = ≈ .
mini=1,...,n |λi | (π∆x )2
For small values of ∆x the initial-value problem (8.2) is a stiff system, since there is a large differ-
ence between |λ|max and |λ|min . Because of the stability restrictions for explicit methods, it might
be more convenient to use an implicit method in that case (Section 6.10).
The dependency between approximations in this case is depicted in Figure 8.2(b). Again the local
truncation error is of order O(∆t + ∆x 2 ). Note that the Backward Euler method is uncondition-
ally stable, hence the time step may be chosen in accordance with the accuracy required (stiffness
does not impair numerical stability).
Note that the solution to (8.5) is computationally more costly than determining w j+1 using the
Forward Euler method. However, since the matrix K contains many zeros, there are efficient
methods to solve problem (8.5). Careful consideration is necessary: cheap approximation with
many time steps (Forward Euler method) versus expensive approximation with few time steps
(Backward Euler method).
∆t
w j +1 = w j + Kw j + r j + Kw j+1 + r j+1 ,
2
cf. equation (6.51c). The corresponding dependency between the approximations is depicted in
Figure 8.2(c).
t t t
8.4 Summary
In this chapter, the following subjects have been discussed:
- Instationary heat equation;
- Semi-discretization;
- Time integration: Forward Euler, Backward Euler, and Crank-Nicolson method;
- Local truncation error;
- Stability.
References and further reading
Background knowledge:
[1] M. Abramowitz and I.A. Stegun. Handbook of Mathematical Functions with Formulas, Graphs,
and Mathematical Tables. Dover Publications, tenth edition, 1972. The Digital Library of
Mathematical Functions can be found online: http://dlmf.nist.gov/
[2] R.A. Adams. Calculus: a complete course. Pearson Addison Wesley, eighth edition, 2013
[3] R.L. Borelli and C.S. Coleman. Differential Equations: A Modelling Perspective. John Wiley &
Sons, second edition, 2004
[4] W.E. Boyce and R.C. DiPrima. Elementary Differential Equations. John Wiley & Sons, tenth
edition, 2012
[5] D.C. Lay. Linear algebra and its applications. Pearson Addison Wesley, fourth edition, 2010
[6] R.D. Richtmyer and K.W. Morton. Difference Methods for Initial-Value Problems. Krieger Pub-
lishing Company, second edition, 1994
[7] J. Stewart. Calculus: Early Transcendentals. Brooks/Cole, Cengage Learning, seventh edition,
2011
Theoretical Numerical Mathematics:
[8] R.L. Burden and J.D. Faires. Numerical Analysis. Brooks/Cole, Cengage Learning, ninth
edition, 2011
[9] E. Isaacson and H.B. Keller. Analysis of Numerical Methods. Dover Publications, 1994
[10] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer, Science+Business Media,
third edition, 2002
[11] E. Süli and D. Mayers. An Introduction to Numerical Analysis. Cambridge University Press,
second edition, 2006
Numerical Recipes:
[12] S.C. Chapra and R.P. Canale. Numerical Methods for Engineers. McGraw-Hill Education,
seventh edition, 2015
[13] G.R. Lindfield and J.E.T. Penny. Numerical Methods using Matlab. Academic Press, Elsevier,
third edition, 2012
[14] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical Recipes: The Art of
Scientific Computing. Cambridge University Press, third edition, 2007
Index
quadratically convergent, 40
quadrature rules, 52
Quasi-Newton, 47, 49
Secant method, 47
semi-discretization, 120
Simpson’s rule, 54, 56
single-step methods, 67
splines, 20
stability, 69, 83, 89, 108
absolute stability (analytical), 70, 84, 85
absolute stability (numerical), 71, 85
analytical stability, 69, 84, 85
instability (analytical), 70, 84, 85
instability (numerical), 71, 86, 88
numerical stability, 70, 85, 87, 88
stability regions, 88
superstability, 95
stiff differential equations, 93
stiff systems, 95
stopping criteria, 40, 41, 43, 48
subordinate matrix norm, 106
systems of differential equations, 81
Taylor polynomial, 8, 9, 16
test equation, 70
test system, 83
Trapezoidal method, 69, 81
Trapezoidal rule, 54, 56
triangle inequality, 8
uncertainty interval, 40
upwind difference, 114
upwind discretization, 114
Vandermonde matrix, 14
virtual point, 111
Numerical Methods for Ordinary Differential Equations
C. Vuik, F.J. Vermolen, M.B. van Gijzen & M.J. Vuik
Groupname
TU Delft | Applied Mathematics
Biography:
Prof.dr.ir. C. (Kees) Vuik is a Full Professor in Numerical Analysis at the Delft Institute of Applied Mathematics of the
TU Delft in The Netherlands. He obtained his PhD degree from Utrecht University in 1988. Thereafter he joined the
TU Delft as an assistant professor in the section Numerical Analysis. His research is related to the discretization of
partial differential equations, moving boundary problems, High-Performance Computing, and iterative solvers for large
linear systems originating from incompressible fluid flow; wave equations; and energy networks. He has teached the
Prof.dr.ir. F.J. (Fred) Vermolen is a Full Professor in Computational Mathematics at the University of Hasselt in
Belgium. He obtained his PhD degree from the TU Delft in 1998. Thereafter he worked at CWI and from 2000 he
joined the TU Delft as an assistant professor in the section Numerical Analysis. His research is related to analysis,
numerical methods and uncertainty quantification for partial differential equations. He has given courses in numerical
Prof. dr. M.B. (Martin) van Gijzen is Full Professor in High-Performance Computing at the Delft Institute of Applied
Mathematics of the TU Delft in The Netherlands. He obtained his PhD degree from the same university in 1994.
Before returning to the TU Delft in 2004, he hold positions at the Utrecht University and at TNO, both in The
Netherlands, and at CERFACS in France. His research topics include High Performance Computing, iterative solvers
for large scale linear systems and eigenvalue problems, model order reduction, and inverse problems. He has
worked on many applications, including topology optimisation, MR imaging, numerical modelling of ocean circulation,
Thea Vuik got her PhD from the Numerical Analysis group at Delft University of Technology. In addition to her
© 2023 TU Delft Open
research on troubled-cell indication in DG methods, she assisted in the Numerical Analysis course for several years.
ISBN 978-94-6366-665-7
Currently, she is working at VORtech as a scientific software engineer. DOI: https://doi.org/10.5074/t.2023.001
textbooks.open.tudelft.nl