0% found this document useful (0 votes)
15 views

Na Notes

Uploaded by

teyeh16787
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Na Notes

Uploaded by

teyeh16787
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 236

1

Lecture Notes on Numerical Analysis


Virginia Tech · MATH/CS 5466 · Spring 2016

We model our world with continuous mathematics. Whether our


interest is natural science, engineering, even finance and economics,
the models we most often employ are functions of real variables. The
equations can be linear or nonlinear, involve derivatives, integrals,
combinations of these and beyond. The tricks and techniques one
learns in algebra and calculus for solving such systems exactly can-
not tackle the complexities that arise in serious applications. Exact
solution may require an intractable amount of work; worse, for many
problems, it is impossible to write an exact solution using elementary
functions like polynomials, roots, trig functions, and logarithms.
This course tells a marvelous success story. Through the use of
clever algorithms, careful analysis, and speedy computers, we can Image from Johannes Kepler’s Astrono-
mia nova, 1609, (ETH Bibliothek). In this
construct approximate solutions to these otherwise intractable prob- text Kepler derives his famous equation
lems with remarkable speed. Nick Trefethen defines numerical analysis that solves two-body orbital motion,
to be ‘the study of algorithms for the problems of continuous math- M = E − e sin E,
ematics’. This course takes a tour through many such algorithms, where M (the mean anomaly) and e
sampling a variety of techniques suitable across many applications. (the eccentricity) are known, and one
solves for E (the eccentric anomaly).
We aim to assess alternative methods based on both accuracy and This vital problem spurred the de-
efficiency, to discern well-posed problems from ill-posed ones, and to velopment of algorithms for solving
see these methods in action through computer implementation. nonlinear equations.

Perhaps the importance of numerical analysis can be best appre-


ciated by realizing the impact its disappearance would have on our We highly recommend Trefethen’s
world. The space program would evaporate; aircraft design would essay, ‘The Definition of Numerical
Analysis’, (reprinted on pages 321–327
be hobbled; weather forecasting would again become the stuff of
of Trefethen & Bau, Numerical Linear
soothsaying and almanacs. The ultrasound technology that uncov- Algebra), which inspires our present
ers cancer and illuminates the womb would vanish. Google couldn’t manifesto.

rank web pages. Even the letters you are reading, whose shapes are
specified by polynomial curves, would suffer. (Several important ex-
ceptions involve discrete, not continuous, mathematics: combinatorial
optimization, cryptography and gene sequencing.)
On one hand, we are interested in complexity: we want algorithms
that minimize the number of calculations required to compute a solu-
tion. But we are also interested in the quality of approximation: since
we do not obtain exact solutions, we must understand the accuracy
of our answers. Discrepancies arise from approximating a compli-
cated function by a polynomial, a continuum by a discrete grid of
points, or the real numbers by a finite set of floating point numbers.
Different algorithms for the same problem will differ in the quality of
their answers and the labor required to obtain those answers; we will
2

learn how to evaluate algorithms according to these criteria.


Numerical analysis forms the heart of ‘scientific computing’ or
‘computational science and engineering,’ fields that also encompass
the high-performance computing technology that makes our algo-
rithms practical for problems with millions of variables, visualization
techniques that illuminate the data sets that emerge from these com-
putations, and the applications that motivate them.
Though numerical analysis has flourished in the past seventy
years, its roots go back centuries, where approximations were neces-
sary in celestial mechanics and, more generally, ‘natural philosophy’.
Science, commerce, and warfare magnified the need for numerical
analysis, so much so that the early twentieth century spawned the
profession of ‘computers,’ people who conducted computations with
hand-crank desk calculators. But numerical analysis has always been
more than mere number-crunching, as observed by Alston House-
holder in the introduction to his Principles of Numerical Analysis, pub-
lished in 1953, the end of the human computer era:
The material was assembled with high-speed digital computation
always in mind, though many techniques appropriate only to “hand”
computation are discussed.. . . How otherwise the continued use of
these machines will transform the computer’s art remains to be seen.
But this much can surely be said, that their effective use demands a
more profound understanding of the mathematics of the problem, and
a more detailed acquaintance with the potential sources of error, than
is ever required by a computation whose development can be watched,
step by step, as it proceeds.

Thus the analysis component of ‘numerical analysis’ is essential. We


rely on tools of classical real analysis, such as continuity, differentia-
bility, Taylor expansion, and convergence of sequences and series.
Matrix computations play a fundamental role in numerical analy-
sis. Discretization of continuous variables turns calculus into algebra.
Algorithms for the fundamental problems in linear algebra are cov-
ered in MATH/CS 5465. If you have missed this beautiful content,
your life will be poorer for it; when the methods we discuss this
semester connect to matrix techniques, we will provide pointers.
These lecture notes were developed alongside courses that were
supported by textbooks, such as An Introduction to Numerical Analysis
by Süli and Mayers, Numerical Analysis by Gautschi, and Numerical
Analysis by Kincaid and Cheney. These notes have benefited from this
pedigree, and reflect certain hallmarks of these books. We have also
been significantly influenced by G. W. Stewart’s inspiring volumes,
Afternotes on Numerical Analysis and Afternotes Goes to Graduate School.
I am grateful for comments and corrections from past students, and
welcome suggestions for further repair and amendment.
— Mark Embree
1
Interpolation

lecture 1: Polynomial Interpolation in the Monomial Basis

Among the most fundamental problems in numerical analysis


is the construction of a polynomial that approximates a continuous
real function f : [ a, b] → IR. Of the several ways we might design
such polynomials, we begin with interpolation: we will construct poly-
nomials that exactly match f at certain fixed points in the interval
[ a, b] ⊂ IR.

1.1 Polynomial interpolation: definitions and notation

Definition 1.1. The set of continuous functions that map [ a, b] ⊂ IR to


IR is denoted by C [ a, b]. The set of continuous functions whose first r
derivatives are also continuous on [ a, b] is denoted by Cr [ a, b]. (Note
that C0 [ a, b] ≡ C [ a, b].)

Definition 1.2. The set of polynomials of degree n or less is denoted


by Pn .

Note that C [ a, b], Cr [ a, b] (for any a < b, r ≥ 0) and Pn are linear


spaces of functions (since linear combinations of such functions main- We freely use the concept of vector
tain continuity and polynomial degree). Furthermore, note that Pn is spaces. A set of functions V is a real
vector space it is closed under vector
an n + 1 dimensional subspace of C [ a, b]. addition and multiplication by a real
The polynomial interpolation problem can be stated as: number: for any f , g ∈ V, f + g ∈ V,
and for any f ∈ V and α ∈ IR, α f ∈ V.
For more details, consult a text on
linear algebra or functional analysis.
Given f ∈ C [ a, b] and n + 1 points { x j }nj=0 satisfying

a ≤ x0 < x1 < · · · < xn ≤ b,

determine some pn ∈ Pn such that

pn ( x j ) = f ( x j ) for j = 0, . . . , n.
4

It shall become clear why we require n + 1 points x0 , . . . , xn , and no


more, to determine a degree-n polynomial pn . (You know the n = 1
case well: two points determine a unique line.) If the number of data
points were smaller, we could construct infinitely many degree-n
interpolating polynomials. Were it larger, there would in general be
no degree-n interpolant.
As numerical analysts, we seek answers to the following questions:

• Does such a polynomial pn ∈ Pn exist?

• If so, is it unique?

• Does pn ∈ Pn behave like f ∈ C [ a, b] at points x ∈ [ a, b] when


x 6= x j for j = 0, . . . , n?

• How can we compute pn ∈ Pn efficiently on a computer?

• How can we compute pn ∈ Pn accurately on a computer (with


floating point arithmetic)?

• If we want to add a new interpolation point xn+1 , can we easily


adjust pn to give an interpolating polynomial pn+1 of one higher
degree?

• How should the interpolation points { x j } be chosen?

Regarding this last question, we should note that, in practice, we


are not always able to choose the interpolation points as freely as
we might like. For example, our ‘continuous function f ∈ C [ a, b]’
could actually be a discrete list of previously collected experimental
data, and we are stuck with the values { x j }nj=0 at which the data was
measured.

1.2 Constructing interpolants in the monomial basis

Of course, any polynomial pn ∈ Pn can be written in the form

p n ( x ) = c0 + c1 x + c2 x 2 + · · · + c n x n

for coefficients c0 , c1 , . . . , cn . We can view this formula as an expres-


sion for pn as a linear combination of the basis functions 1, x, x2 , . . . ,
x n ; these basis functions are called monomials.
To construct the polynomial interpolant to f , we merely need to
determine the proper values for the coefficients c0 , c1 , . . . , cn in the
above expansion. The interpolation conditions pn ( x j ) = f ( x j ) for
5

j = 0, . . . , n reduce to the equations

c0 + c1 x0 + c2 x02 + · · · + cn x0n = f ( x0 )
c0 + c1 x1 + c2 x12 + · · · + cn x1n = f ( x1 )
..
.
c0 + c1 xn + c2 xn2 + · · · + cn xnn = f ( xn ).

Note that these n + 1 equations are linear in the n + 1 unknown


parameters c0 , . . . , cn . Thus, our problem of finding the coefficients
c0 , . . . , cn reduces to solving the linear system

1 x0 x02 · · · x0n
   
c0 f ( x0 )
 1 x1 x12 · · · x1n   c1   f ( x1 ) 
    
    

1 x x 2 ··· x n  
c
 
f ( x )

(1.1) 
 2 2 2  2  = 
    2 ,

 . .. .. .. ..   .   ..
 .. .

 . . . .  .  
    . 

1 xn xn2 ··· xnn cn f ( xn )

which we denote as Ac = f. Matrices of this form, called Vander-


monde matrices, arise in a wide range of applications.1 Provided all 1
Higham presents many interesting
the interpolation points { x j } are distinct, one can show that this ma- properties of Vandermonde matrices
and algorithms for solving Vander-
trix is invertible.2 Hence, fundamental properties of linear algebra monde systems in Chapter 21 of
allow us to confirm that there is exactly one degree-n polynomial that Accuracy and Stability of Numerical
Algorithms, 2nd ed., (SIAM, 2002). Van-
interpolates f at the given n + 1 distinct interpolation points. dermonde matrices arise often enough
that MATLAB has a built-in command
Theorem 1.1. Given f ∈ C [ a, b] and distinct points { x j }nj=0 , a ≤ for creating them. If x = [ x0 , . . . , xn ] T ,
x0 < x1 < · · · < xn ≤ b, there exists a unique pn ∈ Pn such that then A = fliplr(vander(x)).
pn ( x j ) = f ( x j ) for j = 0, 1, . . . , n. 2
In fact, the determinant takes the
simple form
To determine the coefficients {c j }, we could solve the above linear
n n
system with the Vandermonde matrix using some variant of Gaus- det(A) = ∏ ∏ ( x k − x j ).
j =0 k = j +1
sian elimination (e.g., using MATLAB’s \ command); this will take
O(n3 ) floating point operations. Alternatively, we could (and should) This is evident for n = 1; we will
not prove if for general n, as we will
use a specialized algorithm that exploit the Vandermonde structure have more elegant ways to establish
to determine the coefficients {c j } in only O(n2 ) operations, a vast existence and uniqueness of polynomial
interpolants. For a clever proof, see
improvement.3 p. 193 of Bellman, Introduction to Matrix
Analysis, 2nd ed., (McGraw-Hill, 1970).
1.2.1 Potential pitfalls of the monomial basis
3
See Higham’s book for details and
Though it is straightforward to see how to construct interpolating stability analysis of specialized Vander-
polynomials in the monomial basis, this procedure can give rise to monde algorithms.

some unpleasant numerical problems when we actually attempt to


determine the coefficients {c j } on a computer. The primary difficulty
is that the monomial basis functions 1, x, x2 , . . . , x n look increasingly
alike as we take higher and higher powers. Figure 1.1 illustrates this
behavior on the interval [ a, b] = [0, 1] with n = 5 and x j = j/5.
6

Figure 1.1: The six monomial basis


1 vectors for P5 , based on the interval
1 [ a, b] = [0, 1] with x j = j/5 (red circles).
Note that the basis vectors increasingly
0.8 align as the power increases: this basis
becomes ill-conditioned as the degree of
the interpolant grows.
0.6
x

0.4 x2
x3
0.2 x4
x5
0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


x

Because these basis vectors become increasingly alike, one finds


that the expansion coefficients {c j } in the monomial basis can be-
come very large in magnitude even if the function f ( x ) remains of
modest size on [ a, b].
Consider the following analogy from linear algebra. The vectors
" # " #
1 1
,
10−10 0

form a basis for IR2 . However, both vectors point in nearly the same
direction, though of course they are linearly independent. We can write
the vector [1, 1] T as a unique linear combination of these basis vec-
tors:
" # " # " #
1 1 1
(1.2) = 10, 000, 000, 000 − 9, 999, 999, 999 .
1 10−10 0

Although the vector we are expanding and the basis vectors them-
selves are all have modest size (norm), the expansion coefficients are
enormous. Furthermore, small changes to the vector we are expand-
ing will lead to huge changes in the expansion coefficients. This is a
recipe for disaster when computing with finite-precision arithmetic.
This same phenomenon can occur when we express polynomials
in the monomial basis. As a simple example, consider interpolating
f ( x ) = 2x + x sin(40x ) at uniformly spaced points (x j = j/n, j =
0, . . . , n) in the interval [0, 1]. Note that f ∈ C ∞ [0, 1]: this f is a ‘nice’
function with infinitely many continuous derivatives. As seen in
Figures 1.2–1.3, f oscillates modestly on the interval [0, 1], but it
7

f (x) = 2x + x sin(40x) Figure 1.2: Degree n = 10 interpolant


6 p10 ( x ) to f ( x ) = 2x + x sin(40x ) at
pn (x) = interpolating polynomial, n = 10
the uniformly spaced points x0 , . . . , x10
5 for x j = j/10 over [ a, b] = [0, 1].
Even though p10 ( x j ) = f ( x j ) at the
eleven points x0 , . . . , xn (red circles), the
4
interpolant gives a poor approximation
to f at the ends of the interval.
3

-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

f (x) = 2x + x sin(40x) Figure 1.3: Repetition of Figure 1.2, but


6 now with the degree n = 30 interpolant
pn (x) = interpolating polynomial, n = 30
at uniformly spaced points x j = j/30 on
5 [0, 1]. The polynomial still overshoots f
near x = 0 and x = 1, though by less
than for n = 10; for this example, the
4
overshoot goes away as n is increased
further.
3

-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

certainly does not grow excessively large in magnitude or exhibit any


nasty singularities.
Comparing the interpolants with n = 10 and n = 30 between the
two figures, it appears that, in some sense, pn → f as n increases.
Indeed, this is the case, in a manner we shall make precise in future
lectures.

However, we must address a crucial question:

Can we accurately compute the coefficients c0 , . . . , cn


that specify the interpolating polynomial?
8

Use MATLAB’s basic Gaussian elimination algorithm to solve the


Vandermonde system Ac = f for c via the command c = A\f, then
evaluate
n
pn ( x ) = ∑ cj xj
j =0

e.g., using MATLAB’s polyval command.


Since pn was constructed to interpolate f at the points x0 , . . . , xn ,
we might (at the very least!) expect

f ( x j ) − pn ( x j ) = 0, j = 0, . . . , n.

Since we are dealing with numerical computations with a finite preci-


sion floating point system, we should instead be well satisfied if our
numerical computations only achieve | f ( x j ) − pn ( x j )| = O(ε mach ), More precisely, we might expect
where ε mach denotes the precision of the floating point arithmetic | f ( x j ) − pn ( x j )| ≈ ε mach k f k L∞ ,
system.4 where | f k L∞ := max | f ( x )|.
x ∈[ a,b]
Instead, the results of our numerical computations are remarkably
inaccurate due to the magnitude of the coefficients c0 , . . . , cn and the 4
For the double-precision arithmetic
ill-conditioning of the Vandermonde matrix. used by MATLAB, ε mach ≈ 2.2 × 10−16 .
Recall from numerical linear algebra that the accuracy of solving
the system Ac = f depends on the condition number kAkkA−1 k of A.5 5
For information on conditioning and
Figure 1.4 shows that this condition number grows exponentially as n the accuracy of solving linear systems,
see, e.g.., Lecture 12 of Trefethen and
increases.6 Thus, we should expect the computed value of c to have Bau, Numerical Linear Algebra (SIAM,
errors that scale like kAkkA−1 kε mach . Moreover, consider the entries 1997).
in c. For n = 10 (a typical example), we have
6
The curve has very regular behavior
j cj up until about n = 20; beyond that
point, where kAkkA−1 k ≈ 1/ε mach ,
0 0.00000 the computation is sufficiently unstable
1 363.24705 that the condition number is no longer
2 −10161.84204 computed accurately! For n > 20,
take all the curves in Figure 1.4 with a
3 113946.06962 healthy dose of salt.
4 −679937.11016
5 2411360.82690
6 −5328154.95033
7 7400914.85455
8 −6277742.91579
9 2968989.64443
10 −599575.07912

The entries in c grow in magnitude and oscillate in sign, akin to the


simple IR2 vector example in (1.2). The sign-flips and magnitude of
the coefficients would make
n
pn ( x ) = ∑ cj xj
j =0
9

1030 Figure 1.4: Illustration of some pitfalls


of working with interpolants in the
1025 monomial basis for large n: (a) the con-
20
condition number kAkkA−1 k dition number of A grows large with n;
10 (b) as a result, some coefficients c j are
large in magnitude (blue line) and in-
1015
accurately computed; (c) consequently,
max j |c j | the computed ‘interpolant’ pn is far
1010
from f at the interpolation points (red
105 line): this red curve should be zero!

100

10
-5 max j | f ( x j ) − pn ( x j )|

10-10

-15
10

-20
10
0 5 10 15 20 25 30 35 40
n

difficult to compute accurately for large n, even if all the coefficients


c0 , . . . , cn were given exactly. Figure 1.4 shows how the largest com-
puted value in c grows with n. Finally, this figure also shows the
quantity we began discussing,

max | f ( x j ) − pn ( x j )|.
0≤ j ≤ n

Rather than being nearly zero, this quantity grows with n, until the
computed ‘interpolating’ polynomial differs from f at some interpo-
lation point by roughly 1/10 for the larger values of n: we must have
higher standards!
This is an example where a simple problem formulation quickly
yields an algorithm, but that algorithm gives unacceptable numerical
results.
Perhaps you are now troubled by this entirely reasonable question:
If the computations of pn are as unstable as Figure 1.4 suggests, why
should we put any faith in the plots of interpolants for n = 10 and,
especially, n = 30 in Figures 1.2–1.3?
You should trust those plots because I computed them using a
much better approach, about which we shall next learn.
10

lecture 2: Superior Bases for Polynomial Interpolants

1.3 Polynomial interpolants in a general basis

The monomial basis may seem like the most natural way to write
down the interpolating polynomial, but it can lead to numerical
problems, as seen in the previous lecture. To arrive at more stable
expressions for the interpolating polynomial, we will derive several
different bases for Pn that give superior computational properties:
the expansion coefficients {c j } will typically be smaller, and it will
be simpler to determine those coefficients. This is an instance of
a general principle of applied mathematics: to promote stability,
express your problem in a well-conditioned basis.
Suppose we have some basis {b j }nj=0 for Pn . We seek the polyno- Recall that {b j }nj=0 is a basis if the
mial p ∈ Pn that interpolates f at x0 , . . . , xn . Write p in the basis functions span Pn and are linearly
independent. The first requirement
as means that for any polynomial p ∈ Pn
we can find constants c0 , . . . , cn such
p( x ) = c0 b0 ( x ) + c1 b1 ( x ) + · · · + cn bn ( x ). that
p = c0 b0 + · · · + cn bn ,
We seek the coefficients c0 , . . . , cn that express the interpolant p in while the second requirement means
this basis. The interpolation conditions are that if
0 = c0 b0 + · · · + cn bn
p( x0 ) = c0 b0 ( x0 ) + c1 b1 ( x0 ) + · · · + cn bn ( x0 ) = f ( x0 ) then we must have c0 = · · · = cn = 0.

p( x1 ) = c0 b0 ( x1 ) + c1 b1 ( x1 ) + · · · + cn bn ( x1 ) = f ( x1 )
..
.
p( xn ) = c0 b0 ( xn ) + c1 b1 ( xn ) + · · · + cn bn ( xn ) = f ( xn ).

Again we have n + 1 equations that are linear in the n + 1 unknowns


c0 , . . . , cn , hence we can arrange these in the matrix form
    
b0 ( x0 ) b1 ( x0 ) ··· bn ( x 0 ) c0 f ( x0 )

 b0 ( x1 ) b1 ( x1 ) ··· bn ( x 1 )

 c1
 
  f ( x1 )


(1.3) .. .. .. .. = .. ,
..
 
. . . . . .
    
    
b0 ( xn ) b1 ( xn ) ··· bn ( x n ) cn f ( xn )

which can be solved via Gaussian elimination for c0 , . . . , cn .


Notice that the linear system for the monomial basis in (1.1) is a
special case of the system in (1.3), with the choice b j ( x ) = x j . Next
we will look at two superior bases that give more stable expressions
for the interpolant. We emphasize that when the basis changes, so to
do the values of c0 , . . . , cn , but the interpolating polynomial p remains the
same, regardless of the basis we use to express it.
11

1.4 Constructing interpolants in the Newton basis

To derive our first new basis for Pn , we describe an alternative


method for constructing the polynomial pn ∈ Pn that interpolates
f ∈ C [ a, b] at the distinct points { x0 , . . . , xn } ⊂ [ a, b]. This approach,
called the Newton form of the interpolant, builds pn up from lower
degree polynomials that interpolate f at only some of the data points.
Begin by constructing the polynomial p0 ∈ P0 that interpolates
f at x0 : p0 ( x0 ) = f ( x0 ). Since p0 is a zero-degree polynomial (i.e., a
constant), it has the simple form

p0 ( x ) = c0 .

To satisfy the interpolation condition at x0 , set c0 = f ( x0 ). (We


emphasize again: this c0 , and the c j below, will be different from the
c j ’s obtained in Section 1.2 for the monomial basis.)
Next, use p0 to build the polynomial p1 ∈ P1 that interpolates f at
both x0 and x1 . In particular, we will require p1 to have the form

p1 ( x ) = p0 ( x ) + c1 q1 ( x )

for some constant c1 and some q1 ∈ P1 . Note that

p1 ( x0 ) = p0 ( x0 ) + c1 q1 ( x0 )
= f ( x0 ) + c1 q1 ( x0 ).

Since we require that p1 ( x0 ) = f ( x0 ), the above equation implies


that c1 q1 ( x0 ) = 0. Either c1 = 0 (which can only happen in the
special case f ( x0 ) = f ( x1 ), and we seek a basis that works for any f )
or q1 ( x0 ) = 0, i.e., q1 ( x0 ) has a root at x0 . Thus, we deduce that
q1 ( x ) = x − x0 . It follows that

p1 ( x ) = c0 + c1 ( x − x0 ),

where c1 is still undetermined. To find c1 , use the interpolation con-


dition at x1 :
f ( x1 ) = p1 ( x1 ) = c0 + c1 ( x1 − x0 ).

Solving for c1 ,
f ( x1 ) − c0
c1 = .
x1 − x0
Next, find the p2 ∈ P2 that interpolates f at x0 , x1 , and x2 , where
p2 has the form
p2 ( x ) = p1 ( x ) + c2 q2 ( x ).

Similar to before, the first term, now p1 ( x ), ‘does the right thing’ at
the first two interpolation points, p1 ( x0 ) = f ( x0 ) and p1 ( x1 ) = f ( x1 ).
12

We require that q2 not interfere with p1 at x0 and x1 , i.e., q2 ( x0 ) =


q2 ( x1 ) = 0. Thus, we take q2 to have the form

q2 ( x ) = ( x − x0 )( x − x1 ).

The interpolation condition at x2 gives an equation where c2 is the


only unknown,

f ( x2 ) = p2 ( x2 ) = p1 ( x2 ) + c2 q2 ( x2 ),

which we can solve for


f ( x2 ) − p1 ( x2 ) f ( x2 ) − c0 − c1 ( x2 − x0 )
c2 = = .
q2 ( x2 ) ( x2 − x0 )( x2 − x1 )

Follow the same pattern to bootstrap up to pn , which takes the


form
p n ( x ) = p n −1 ( x ) + c n q n ( x ),

where
n −1
qn ( x ) = ∏ ( x − x j ),
j =0

and, setting q0 ( x ) = 1, we have

f ( xn ) − ∑nj=−01 c j q j ( xn )
cn = .
qn ( xn )

Finally, the desired polynomial takes the form


n
pn ( x ) = ∑ c j q j ( x ).
j =0

The polynomials q j for j = 0, . . . , n form a basis for Pn , called the


Newton basis. The c j we have just determined are the expansion coef-
ficients for this interpolant in the Newton basis. Figure 1.5 shows the
Newton basis functions q j for [ a, b] = [0, 1] with n = 5 and x j = j/5,
which look considerably more distinct than the monomial basis poly-
nomials illustrated in Figure 1.1.
This entire procedure for constructing pn can be condensed into a
system of linear equations with the coefficients {c j }nj=0 unknown:
    
1 c0 f ( x0 )
    

 1 ( x1 − x0 ) 
 c1   f ( x1 )
  

    
1 ( x2 − x0 ) ( x2 − x0 )( x2 − x1 ) c2  =  f ( x2 )
    
(1.4) 


  
,

 .. .. .. ..  ..   .. 

 . . . . 
 .  
 
.


    
1 ( xn − x0 ) ( xn − x0 )( xn − x1 ) · · · ∏nj=−01 ( xn − x j ) cn f ( xn )
13

1 Figure 1.5: The six Newton basis


q0 ( x ) polynomials q0 , . . . , q5 for P5 , based
on the interval [ a, b] = [0, 1] with
0.8 x j = j/5 (black dots). Compare these
to the monomial basis polynomials
in Figure 1.1: these vectors look far
0.6 q1 ( x ) more distinct from one another than the
monomials.

0.4
q2 ( x )

0.2 q3 ( x )
q4 ( x )
0
q5 ( x )
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

again a special case of (1.3) but with b j ( x ) = q j ( x ). (The unspecified


entries above the diagonal are zero, since q j ( xk ) = 0 when k < j.) The
system (1.4) involves a triangular matrix, which is simple to solve.
Clearly c0 = f ( x0 ), and once we know c0 , we can solve for

f ( x1 ) − c0
c1 = .
x1 − x0
With c0 and c1 , we can solve for c2 , and so on. This procedure, for-
ward substitution, requires roughly n2 floating point operations once
the entries are formed.
With this Newton form of the interpolant, one can easily update
pn to pn+1 in order to incorporate a new data point ( xn+1 , f ( xn+1 )),
as such a change affects neither the previous values of c j nor q j . The
new data ( xn+1 , f ( xn+1 )) simply adds a new row to the bottom of
the matrix in (1.4), which preserves the triangular structure of the
matrix and the values of {c0 , . . . , cn }. If we have already found these
coefficients, we easily obtain cn+1 through one more step of forward
substitution.

1.5 Constructing interpolants in the Lagrange basis

The monomial basis gave us a linear system (1.1) of the form Ac = f


in which A was a dense matrix: all of its entries are nonzero. The
Newton basis gave a simpler system (1.4) in which A was a lower
triangular matrix. Can we go one step further, and find a set of basis
functions for which the matrix in (1.3) is diagonal?
For the matrix to be diagonal, the jth basis function would need to
have roots at all the other interpolation points xk for k 6= j. Such func-
14

tions, denoted ` j for j = 0, . . . , n, are called Lagrange basis polynomials,


and they result in the Lagrange form of the interpolating polynomial.
We seek to construct ` j ∈ Pn with ` j ( xk ) = 0 if j 6= k, but ` j ( xk ) = 1
if j = k. That is, ` j takes the value one at x j and has roots at all the
other n interpolation points.
What form do these basis functions ` j ∈ Pn take? Since ` j is a
degree-n polynomial with the n roots { xk }nk=0,k6= j , it can be written in
the form
n
` j (x) = ∏ γk ( x − x k )
k =0,k 6= j

for appropriate constants γk . We can force ` j ( x j ) = 1 if all the terms


in the above product are one when x = x j , i.e., when γk = 1/( x j −
xk ), so that
n
x − xk
` j (x) = ∏ .
x
k =0,k 6= j j
− xk

This form makes it clear that ` j ( x j ) = 1. With these new basis func-
tions, the constants {c j } can be written down immediately. The inter-
polating polynomial has the form
n
pn ( x ) = ∑ c k ` k ( x ).
k =0

When x = x j , all terms in this sum will be zero except for one, the
k = j term (since `k ( x j ) = 0 except when j = k). Thus,

pn ( x j ) = c j ` j ( x j ) = c j ,

so we can directly write down the coefficients, c j = f ( x j ).


As desired, this approach to constructing basis polynomials leads
to a diagonal matrix A in the equation Ac = f for the coefficients.
Since we also insisted that ` j ( x j ) = 1, the matrix A is actually just the
identity matrix:
    
1 c0 f ( x0 )
    
 1   c1   f ( x1 ) 
    
  = .
. .. . .
  ..   ..
     
 
    
1 cn f ( xn )

Now the coefficient matrix is simply the identity.


A forthcoming exercise will investigate an important flexible and
numerically stable method for constructing and evaluating Lagrange
interpolants known as barycentric interpolation.
Figure 1.6 shows the Lagrange basis functions for n = 5 with
[ a, b] = [0, 1] and x j = j/5, the same parameters used in the plots of
the monomial and Newton bases earlier.
15

Figure 1.6: The six Lagrange basis


1.5 polynomials `0 , . . . , `5 for P5 , based
`0 ( x )

`5 ( x )
`1 ( x ) `2 ( x ) `3 ( x ) `4 ( x ) on the interval [ a, b] = [0, 1] with
1 x j = j/5 (black dots). Note that each
Lagrange polynomial has roots at n of
the interpolation points. Compare these
0.5
polynomials to the monomial and New-
ton basis polynomials in Figures 1.1
0
and 1.5 (but note the different vertical
scale): these basis vectors look most
-0.5 independent of all.

-1

-1.5

-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

The fact that these basis functions are not as closely aligned as the
previous ones has interesting consequences on the size of the coeffi-
cients {c j }. For example, if we have n + 1 = 6 interpolation points
for f ( x ) = sin(10x ) + cos(10x ) on [0, 1], we obtain the following
coefficients:
monomial Newton Lagrange
c0 1.0000000e+00 1.0000000e+00 1.0000000e+00
c1 4.0861958e+01 -2.5342470e+00 4.9315059e-01
c2 -3.8924180e+02 -1.7459341e+01 -1.4104461e+00
c3 1.0775024e+03 1.1232385e+02 6.8075479e-01
c4 -1.1683645e+03 -2.9464687e+02 8.4385821e-01
c5 4.3685881e+02 4.3685881e+02 -1.3830926e+00

We emphasize that all three approaches (in exact arithmetic) must


yield the same unique polynomial, but they are expressed in different
bases. The behavior in floating point arithmetic varies significantly
with the choice of basis; the monomial basis is the clear loser.
16

lecture 3: Interpolation Error Bounds

1.6 Convergence theory for polynomial interpolation

Interpolation can be used to generate low-degree polynomials that


approximate a complicated function over the interval [ a, b]. One
might assume that the more data points that are interpolated (for a
fixed [ a, b]), the more accurate the resulting approximation. In this
lecture, we address the behavior of the maximum error

max | f ( x ) − pn ( x )|
x ∈[ a,b]

as the number of interpolation points—hence, the degree of the in-


terpolating polynomial—is increased. We begin with a theoretical
result.

Theorem 1.2 (Weierstrass Approximation Theorem).


Suppose f ∈ C [ a, b]. For any ε > 0 there exists some polynomial pn
of finite degree n such that maxx∈[ a,b] | f ( x ) − pn ( x )| ≤ ε.

Unfortunately, we do not have time to prove this in class.7 As 7


The typical proof is a construction
stated, this theorem gives no hint about what the approximating based on Bernstein polynomials; see,
e.g., Kincaid and Cheney, Numerical
polynomial looks like, whether pn interpolates f at n + 1 points, or Analysis, 3rd edition, pages 320–323.
merely approximates f well throughout [ a, b], nor does the Weier- This result can be generalized to the
Stone–Weierstrass Theorem, itself a
strass theorem describe the accuracy of a polynomial for a specific special case of Bishop’s Theorem for
value of n (though one could gain insight into such questions by approximation problems in operator
studying the constructive proof). algebras; see e.g., §5.6–§5.8 of Rudin,
Functional Analysis, 2nd ed. (McGraw
On the other hand, for the interpolation problem studied in the Hill, 1991).
preceding lectures, we can obtain a specific error formula that gives a
bound on maxx∈[ a,b] | f ( x ) − pn ( x )|. From this bound, we can deduce
if interpolating f at increasingly many points will eventually yield
a polynomial approximation to f that is accurate to any specified
precision.
For any xb ∈ [ a, b] that is not of the interpolation points, we seek to
measure the error
f ( xb) − pn ( xb),

where pn ∈ Pn is the interpolant to f at the distinct points x0 , . . . , xn ∈


[ a, b]. We can get a grip on this error from the following perspective.
Extend pn by one degree to give a new polynomial that additionally
interpolates f at xb. This is easy to do with the Newton form of the
interpolant; write the new polynomial as
n
p n ( x ) + λ ∏ ( x − x j ),
j =0
17

for constant λ chosen so that


n
f ( xb) = pn ( xb) + λ ∏ ( xb − x j ).
j =0

For convenience, we write


n
w( x ) := ∏ ( x − x j ),
j =0

so we could solve for λ as


f ( xb) − pn ( xb)
(1.5) λ= .
w( xb)

Now notice that



0 = f ( xb) − pn ( xb) + λw( xb)

implies

(1.6) f ( xb) − pn ( xb) = λw( xb),

which is an expression for the desired error f ( xb) − pn ( xb). Unfor-


tunately, the formula (1.5) does not give much insight into how the
error behaves as a function of f and the interpolation points.

Theorem 1.3 (Interpolation Error Formula).


Suppose f ∈ C n+1 [ a, b] and let pn ∈ Pn denote the polynomial that
interpolates { x j , f ( x j ) }nj=0 for distinct points x j ∈ [ a, b], j = 0, . . . , n.


Then for every x ∈ [ a, b] there exists ξ ∈ [ a, b] such that


n
f ( n +1) ( ξ )
f ( x ) − pn ( x ) =
( n + 1) ! ∏ ( x − x j ).
j =0

From this formula follows a bound for the worst error over [ a, b]:

| f (n+1) (ξ )| n
  
(1.7) max | f ( x ) − pn ( x )| ≤
x ∈[ a,b]
max
ξ ∈[ a,b] ( n + 1) !
max ∏ |x − xj |
x ∈[ a,b] j=0
.

We shall carefully prove this essential result; it will repay the ef-
fort, for this theorem becomes the foundation upon which we shall
build the convergence theory for piecewise polynomial approxima-
tion and interpolatory quadrature rules for definite integrals.

Proof. Consider some arbitrary point xb ∈ [ a, b]. We seek a de-


scriptive expression for the error f ( xb) − pn ( xb). If xb = x j for some
j ∈ {0, . . . , n}, then f ( xb) − pn ( xb) = 0 and there is nothing to prove.
Thus, suppose for the rest of the proof that xb is not one of the inter-
polation points.
18

To describe f ( xb) − pn ( xb), we shall build the polynomial of degree


n + 1 that interpolates f at x0 , . . . , xn , and also xb. Of course, this
polynomial will give zero error at xb, since it interpolates f there.
From this polynomial we can extract a formula for f ( xb) − pn ( xb) by
measuring how much the degree n + 1 interpolant improves upon the
degree-n interpolant pn at xb.
Since we wish to understand the relationship of the degree n + 1
interpolant to pn , we shall write that degree n + 1 interpolant in a
manner that explicitly incorporates pn . Given this setting, use of the
Newton form of the interpolant is natural; i.e., we write the degree
n + 1 polynomial as
n
pn ( x ) + λ ∏ ( x − x j )
j =0

for some constant λ chosen to make the interpolant exact at xb. For
convenience, we write
n
w( x ) ≡ ∏( x − x j )
j =0

and then denote the error of this degree n + 1 interpolant by



φ( x ) ≡ f ( x ) − pn ( x ) + λw( x ) .

To make the polynomial pn ( x ) + λw( x ) interpolate f at xb, we shall


pick λ such that φ( xb) = 0. The fact that xb 6∈ { x j }nj=0 ensures that
w( xb) 6= 0, and so we can force φ( xb) = 0 by setting

f ( xb) − pn ( xb)
λ= .
w( xb)

Furthermore, since f ( x j ) = pn ( x j ) and w( x j ) = 0 at all the n + 1


interpolation points x0 , . . . , xn , we also have φ( x j ) = f ( x j ) − pn ( x j ) −
λw( x j ) = 0. Thus, φ is a function with at least n + 2 zeros in the
interval [ a, b]. Rolle’s Theorem8 tells us that between every two con- 8
Recall the Mean Value Theorem
secutive zeros of φ, there is some zero of φ0 . Since φ has at least n + 2 from calculus: Given d > c, suppose
f ∈ C [c, d] is differentiable on (c, d).
zeros in [ a, b], φ0 has at least n + 1 zeros in this same interval. We Then there exists some η ∈ (c, d) such
can repeat this argument with φ0 to see that φ00 must have at least n that ( f (d) − f (c))/(d − c) = f 0 (η ).
Rolle’s Theorem is a special case: If
zeros in [ a, b]. Continuing in this manner with higher derivatives, we f (d) = f (c), then there is some point
eventually conclude that φ(n+1) must have at least one zero in [ a, b]; η ∈ (c, d) such that f 0 (η ) = 0.
we denote this zero as ξ, so that φ(n+1) (ξ ) = 0.
We now want a more concrete expression for φ(n+1) . Note that
( n +1)
φ ( n +1) ( x ) = f ( n +1) ( x ) − p n ( x ) − λw(n+1) ( x ).
( n +1)
Since pn is a polynomial of degree n or less, pn ≡ 0. Now observe
that w is a polynomial of degree n + 1. We could write out all the
coefficients of this polynomial explicitly, but that is a bit tedious,
19

and we do not need all of them. Simply observe that we can write
w( x ) = x n+1 + q( x ), for some q ∈ Pn , and this polynomial q will
vanish when we take n + 1 derivatives:
 n +1 
( n +1) d n +1
w (x) = x + q(n+1) ( x ) = (n + 1)! + 0.
dx n+1

Assembling the pieces, φ(n+1) ( x ) = f (n+1) ( x ) − λ (n + 1)!. Since


φ(n+1) (ξ ) = 0, we conclude that

f ( n +1) ( ξ )
λ= .
( n + 1) !

Substituting this expression into 0 = φ( xb) = f ( xb) − pn ( xb) − λw( xb),


we obtain
n
f ( n +1) ( ξ )
f ( xb) − pn ( xb) =
( n + 1) ! ∏(xb − x j ).
j =0

This error bound has strong parallels to the remainder term in


Taylor’s formula. Recall that for sufficiently smooth h, the Taylor
expansion of f about the point x0 is given by

( x − x0 ) k ( k )
f ( x ) = f ( x0 ) + ( x − x0 ) f 0 ( x0 ) + · · · + f ( x0 ) + remainder.
k!
Ignoring the remainder term at the end, note that the Taylor expan-
sion gives a polynomial model of f , but one based on local infor-
mation about f and its derivatives, as opposed to the polynomial
interpolant, which is based on global information, but only about f ,
not its derivatives.
An interesting feature of the interpolation bound is the polynomial
w( x ) = ∏nj=0 ( x − x j ). This quantity plays an essential role in ap-
proximation theory, and also a closely allied subdiscipline of complex
analysis called potential theory. Naturally, one might wonder what
choice of points { x j } minimizes |w( x )|: We will revisit this question
when we study approximation theory in the near future. For now, we
simply note that the points that minimize |w( x )| over [ a, b] are called
Chebyshev points, which are clustered more densely at the ends of the
interval [ a, b].
Example 1.1 ( f ( x) = sin( x)). We shall apply the interpolation
bound to f ( x ) = sin( x ) on x ∈ [−5, 5]. Since f (n+1) ( x ) = ± sin( x ) or
± cos( x ), we have maxx∈[−5,5] | f (n+1) ( x )| = 1 for all n. The interpo-
lation result we just proved then implies that for any choice of distinct
interpolation points in [−5, 5],
n
∏ |x − x j | < 10n+1 ,
j =0
20

the worst case coming if all the interpolation points are clustered at
an end of the interval [−5, 5]. Now our theorem ensures that

10n+1
max | sin( x ) − pn ( x )| ≤ .
x ∈[−5,5] ( n + 1) !

For small values of n, this bound will be very large, but eventually
(n + 1)! grows much faster than 10n+1 , so we conclude that our error
must go to zero as n → ∞ regardless of where in [−5, 5] we place our
interpolation points! The error bound is shown in the first plot below.
Consider the following specific example: Interpolate sin( x ) at
points uniformly selected in [−1, 1]. At first glance, you might think
there is no reason that we should expect our interpolants pn to con-
verge to sin( x ) for all x ∈ [−5, 5], since we are only using data from
the subinterval [−1, 1], which is only 20% of the total interval and
does not even include one entire period of the sine function. (In fact,
sin( x ) attains neither its maximum nor minimum on [−1, 1].) Yet
the error bound we proved above ensures that the polynomial in-
terpolant must converge throughout [−5, 5]. This is illustrated in
the first plot below. The next plots show the interpolants p4 ( x ) and
p10 ( x ) generated from these interpolation points. Not surprisingly,
these interpolants are most accurate near [−1, 1], the location of the
interpolation points (shown as circles), but we still see convergence
well beyond [−1, 1], in the same way that the Taylor expansion for
sin( x ) at x = 0 will converge everywhere.

Example 1.2 (Runge’s Example). The error bound (1.7) suggests those
functions for which interpolants might fail to converge as n → ∞:
beware if higher derivatives of f are large in magnitude over the
interpolation interval. The most famous example of such behavior
is due to Carl Runge, who studied convergence of interpolants for
f ( x ) = 1/(1 + x2 ) on the interval [−5, 5]. This function looks beau-
tiful: it resembles a bell curve, with no singularities in sight on IR, as
Figure 1.8 shows. However, the interpolants to f at uniformly spaced
points over [−5, 5] do not seem to converge even for x ∈ [−5, 5].
21

5
10
Figure 1.7: Interpolation of sin( x ) at
error bound points x0 , . . . , xn uniformly distributed
10 0
10n+1 /(n + 1)! on [−1, 1]. We develop an error bound
from Theorem 1.3 for the interval
[ a, b] = [−5, 5]. The bound proves that
10 -5
even though the interpolation points
only fall in [−1, 1], the interpolant will
10 -10 max | sin( x ) − pn ( x )| still converge throughout [−5, 5]. The
x ∈[−5,5] top plot shows this convergence for
-15
n = 0, . . . , 40; the bottom plots show
10
the polynomials p4 and p1 0, along with
the interpolation points that determine
10 -20 these polynomials (black circles).

-25
10
0 5 10 15 20 25 30 35 40
n

3 3

2 p4 ( x ) 2

f (x)
1 1

0
f (x) 0
p10 ( x )
-1 -1

-2 -2

-3 -3
-5 -1 0 1 5 -5 -1 0 1 5
x x

Look at successive derivatives of f ; they expose its crucial flaw:


2x
f 0 (x) = −
(1 + x 2 )2

8x2 2
f 00 ( x ) = −
(1 + x 2 )3 (1 + x 2 )2

48x3 24x
f 000 ( x ) = − +
(1 + x 2 )4 (1 + x 2 )3

348x4 288x2 24
f (iv) ( x ) = 2 5
− +
(1 + x ) 2
(1 + x ) 4 (1 + x 2 )3

46080x6 57600x4 17280x2 720


f (vi) ( x ) = 2 7
− 2 6
+ 2 5
− .
(1 + x ) (1 + x ) (1 + x ) (1 + x 2 )4

At certain points on [−5, 5], f (n+1) blows up more rapidly than


(n + 1)!, and the interpolation bound (1.7) suggests that pn will
not converge to f on [−5, 5] as n gets large. Not only does pn fail
to converge to f ; the error between certain interpolation points gets
enormous as n increases.
22

3
10
Figure 1.8: Interpolation of Runge’s
function 1/( x2 + 1) at points x0 , . . . , xn
uniformly distributed on [−5, 5]. The
2 top plot shows this convergence for
10
n = 0, . . . , 25; the bottom plots show the
x )|
n( interpolating polynomials p4 , p8 , p16 ,
) −p and p24 , along with the interpolation
10 1 x
2 +1 points that determine these polynomi-
| 1/ ( als (black circles). These interpolants
x
ma−5,5] do not converge to f as n → ∞. This is
[
x∈ not a numerical instability, but a fatal
0
10 flaw that arises when interpolating with
large degree polynomials at uniformly
spaced points.
-1
10
0 5 10 15 20 25
n

1.5
1.2

1 1
0.8
0.5
0.6

0.4 0

0.2 f (x) p8 ( x )
-0.5
0

-0.2
p4 ( x ) -1

-0.4
-1.5
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
x x

4 50

2
0
0

-2 -50

-4
-100
-6
p16 ( x )
p24 ( x )
-150
-8

-10 -200
-12
-250
-14

-16 -300
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
x x

The following code uses MATLAB’s Symbolic Toolbox to compute


higher derivatives of the Runge function. Several of the resulting
plots follow.9 Note how the scale on the vertical axis changes from plot to 9
Not all versions of MATLAB have the
plot! Symbolic Toolbox, but you should be
able to run this code on any Student
Edition or on copies on Virginia Tech
% rungederiv.m network.
% routine to plot derivatives of Runge’s example,
% f(x) = 1/(1+x^2) on [-5,5]
23

0.8 Figure 1.9: Runge’s function


1
0.6 1
f (x) =
0.4 1 + x2
0.5
0.2 and a few of its derivatives on x ∈
0 0 [−5, 5]. Notice how large the derivatives
grow in magnitude: the vertical scale on
-0.2
-0.5
the plot for f (25) (bottom-right) is 1025 .
-0.4

-1
f (x) -0.6 f 0 (x)
-0.8
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
x x
#10 6
4
20 3
2
10
1
0 0
-1
-10
-2
-20 f (4) ( x ) -3 f (10) ( x )
-4
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
x x
#10 18 #10 25
3
1.5
2
1
1
0.5

0 0

-0.5
-1
-1
-2
f (20) ( x ) f (25) ( x )
-1.5
-3
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
x x

figure(1), clf, set(gca,’fontsize’,18)


for j=0:25
syms x
fj = vectorize(diff(1/(x^2+1),j)); % compute derivative (Symbolic Toolbox)
x = linspace(-5,5,1000); fjx = eval(fj); % evaluate on a grid of points
plot(x,fjx,’b-’,’linewidth’,2); % plot derivative
title(sprintf(’Runge’’s Example: f^{(%d)}(x)’,j),’fontsize’,14)
input(’ ’)
end

Some improvement can be made by a careful selection of the in-


terpolation points { x0 }. In fact, if one interpolates Runge’s example,
f ( x ) = 1/(1 + x2 ), at the Chebyshev points for [−5, 5],
 jπ 
x j = 5 cos , j = 0, . . . , n,
n
24

then the interpolant will converge!


As a general rule, interpolation at Chebyshev points is greatly
preferred over interpolation at uniformly spaced points for reasons
we shall understand in a few lectures. However, even this set is not
perfect: there exist functions for which the interpolants at Chebyshev
points do not converge. Examples to this effect were constructed by
Marcinkiewicz and Grunwald in the 1930s. We close with two re-
sults of a more general nature.10 We require some general notation to 10
An excellent exposition of these
describe a family of interpolation points that can change as the poly- points is given in volume 3 of I. P.
[n] Natanson, Constructive Function Theory
nomial degree increases. Toward this end, let { x j }nj=0 denote the set (Ungar, 1965).
of interpolation points used to construct the degree-n interpolant. As
we are concerned here with the behavior of interpolants as n → ∞, so
[n]
we will speak of the system of interpolation points {{ x j }nj=0 }∞
n =0 .
Our first result is bad news.
Theorem 1.4 (Faber’s Theorem).
[n] [n]
Let {{ x j }nj=0 }∞
n=0 be any system of interpolation points with x j ∈
[n] [n]
[ a, b] and x j 6= x` for j 6= ` (i.e., distinct interpolation points for
each polynomial degree). Then there exists some function f ∈ C [ a, b]
[n]
such that the polynomials pn that interpolate f at { x j }nj=0 do not
converge uniformly to f in [ a, b] as n → ∞.

The good news is that there always exists a suitable set of interpo-
lation points for any given f ∈ C [ a, b].

Theorem 1.5 (Marcinkiewicz’s Theorem).


Given any f ∈ C [ a, b], there exist a system of interpolation points
[n]
with x j ∈ [ a, b] such that the polynomials pn that interpolate f at
[n]
{ x j }nj=0 converge uniformly to f in [ a, b] as n → ∞.

These results are both quite abstract; for example, the construction
of the offensive example in Faber’s Theorem is not nearly as con-
crete as Runge’s nice example for uniformly spaced points discussed
above. We will revisit the question of the convergence of interpolants
in a few weeks when we discuss Chebyshev polynomials. Then we
will be able to say something much more positive: there exists a nice
set of points that works for all but the ugliest functions in C [ a, b].
25

lecture 4: Constructing Finite Difference Formulas

1.7 Application: Interpolants for Finite Difference Formulas

The most obvious use of interpolants is to construct polynomial mod-


els of more complicated functions. However, numerical analysts rely
on interpolants for many other numerical chores. For example, in a
few weeks we shall see that common techniques for approximating
definite integrals amount to exactly integrating a polynomial inter-
polant. Here we turn to a different application: the use of interpolat-
ing polynomials to derive finite difference formulas that approximate
derivatives, the to use those formulas to construct approximations of
differential equation boundary value problems.

1.7.1 Derivatives of Interpolants


Theorem 1.3 from the last lecture showed how well the interpolant
pn ∈ Pn approximates f . Here we seek deeper connections between
pn and f .

How well do derivatives of pn approximate derivatives of f ?

Let p ∈ Pn denote the degree-n polynomial that interpolates f at


the distinct points x0 , . . . , xn . We want to derive a bound on the error
f 0 ( x ) − p0 ( x ). Let us take the proof of Theorem 1.3 as a template, and
adapt it to analyze the error in the derivative.
For simplicity, assume that xb ∈ { x0 , . . . , xn }, i.e., assume that xb
is one of the interpolation points. Suppose we extend p( x ) by one
degree so that the derivative of the resulting polynomial at xb matches
f 0 ( xb). To do so, use the Newton form of the interpolant, writing the
new polynomial as
p( x ) + λw( x ),

again with
n
w( x ) := ∏ ( x − x j ).
j =0

The derivative interpolation condition at xb is

(1.8) f 0 ( xb) = p0 ( xb) + λw0 ( xb),

and since w( x j ) = 0 for j = 0, . . . , n, the new polynomial maintains


the standard interpolation at the n + 1 interpolation points:

(1.9) f ( x j ) = p( x j ) + λw( x j ), j = 0, . . . , n.
26

Here we must tweak the proof of Theorem 1.3 slightly. As in that


proof, define the error function

φ( x ) := f ( x ) − p( x ) + λw( x ) .

Because of the standard interpolation conditions (1.9) at x0 , . . . , xn ,


φ must have n + 1 zeros. Now Rolle’s theorem implies that φ0 has
(at least) n zeros, each of which occurs strictly between every two
consecutive interpolation points. But in addition to these points, φ0
must have another root at xb (which we have required to be one of the
interpolation points, and thus distinct from the other n roots). Thus,
φ0 has n + 1 distinct zeros on [ a, b].
Now, repeatedly apply Rolle’s theorem to see that φ00 has n distinct
zeros, φ000 has n − 1 distinct zeros, etc., to conclude that φ(n+1) has a
zero: call it ξ. That is,

0 = φ(n+1) (ξ ) = f (n+1) (ξ ) − p(n+1) (ξ ) + λw(n+1) (ξ ) .



(1.10)

We must analyze

φ(n+1) ( x ) = f (n+1) ( x ) − p(n+1) ( x ) + λw(n+1) ( x ) .




Just as in the proof of Theorem 1.3, note that p(n+1) = 0 since p ∈ Pn


and w(n+1) ( x ) = (n + 1)!. Thus from (1.10) conclude

f ( n +1) ( ξ )
λ= .
( n + 1) !

From (1.8) we arrive at

f ( n +1) ( ξ ) 0
f 0 ( xb) − p0 ( xb) = λw0 ( xb) = w ( xb).
( n + 1) !

To arrive at a concrete estimate, perhaps we should say something Lemma 1.1 was proved by Andrey
more specific about w0 ( xb). Expanding w and computing w0 explicitly Markov in 1889, generalizing a result
for n = 2 that was obtained by the
will take us far into the weeds; it suffices to invoke an interesting famous chemist Mendeleev in his
result from 1889. research on specific gravity. Markov’s
younger brother Vladimir extended
Lemma 1.1 (Markov brothers’ inequality for first derivatives). it to higher derivatives (with a more
complicated right-hand side) in 1892.
For any polynomial q ∈ Pn , The interesting history of this inequality
(and extensions into the complex plane)
2n2 is recounted in a paper by Ralph Boas,
max |q0 ( x )| ≤ max |q( x )|. Jr. on ‘Inequalities for the derivatives
x ∈[ a,b] b − a x∈[ a,b]
of polynomials,’ Math. Magazine 42
(4) 1969, 165–174. The result is called
We can thus summarize our discussion as the following theorem, the ‘Markov brothers’ inequality’ to
an analogue of Theorem 1.3. distinguish it from the more famous
‘Markov’s inequality’ in probability
theory (named, like ‘Markov chains,’ for
Andrey; Vladimir died of tuberculosis
at the age of 25 in 1897).
27

Theorem 1.6 (Bound on the derivative of an interpolant).


Suppose f ∈ C (n+1) [ a, b] and let pn ∈ Pn denote the polynomial that
interpolates { x j , f ( x j ) }nj=0 at distinct points x j ∈ [ a, b], j = 0, . . . , n.


Then for every xk ∈ { x0 , . . . , xn }, there exists some ξ ∈ [ a, b] such that

f ( n +1) ( ξ ) 0
f 0 ( xk ) − p0n ( xk ) = w ( x k ), Why don’t we simply ‘take a derivative
( n + 1) ! of Theorem 1.3’? The subtlety is the
where w( x ) = ∏nj=0 ( x − x j ). From this formula follows the bound f (n+1) (ξ ) term in Theorem 1.3. Since ξ
depends on x, taking the derivative of
f (n+1) (ξ ( x )) via the chain rule would
2n2 | f (n+1) (ξ )| n
  
(1.11) | f 0 ( xk ) − p0n ( xk )| ≤ max max ∏ | x − x j | . require explicit knowledge of ξ ( x ). We
b − a ξ ∈[ a,b] (n + 1)! x ∈[ a,b] j=0 don’t want to work out a formula for
ξ ( x ) for each f and interval [ a, b].

Contrast the bound (1.11) with (1.7) from Theorem 1.3: the bounds
are the same, aside from the leading constant 2n2 /(b − a) inherited
from Lemma 1.1.
For our later discussion it will help to get a rough bound for he
case where the interpolation points are uniformly distributed, i.e.,

x j = a + jh, j = 0, . . . , n

with spacing equal to h := (b − a)/n. We seek to bound h-


n x0 x1 x2 x3 x4 x5
max
x ∈[ a,b]
∏ | x − x j |,
j =0

i.e., maximize the product of the distances of x from each of the n=5
#10 -4
5
interpolation points. Consider the sketch in the margin. Think about
how you would place x ∈ [ x0 , xn ] so as to make ∏nj=0 | x − x j | as large 0
as possible. Putting x somewhere toward the ends, but not too near
-5
one of the interpolation points, will maximize product. Convince
yourself that, regardless of where x is placed within [ x0 , xn ]: w( x )
-10

• at least one interpolation point is no more than h/2 away from x; -15
0 0.2 0.4 0.6 0.8 1
x
• a different interpolation point is no more than h away from x;
Notice that for n = 5 uniformly
• a different interpolation point is no more than 2h away from x; spaced points on [0, 1], w( x ) takes its
.. maximum magnitude between the two
. interpolation points on each end of the
domain.
• the last remaining (farthest) interpolation point is no more than
nh = b − a away from x.
This reasoning gives the bound
n
h hn+1 n!
(1.12) max ∏ |x − xj | ≤
x ∈[ a,b] j=0 2
· h · 2h · · · nh =
2
.

Substituting this into (1.11) and using b − a = nh gives the following


result.
28

Corollary 1.1 (The derivative of an interpolant at equispaced points).


Suppose f ∈ C (n+1) [ a, b] and let pn ∈ Pn denote the polynomial
that interpolates { x j , f ( x j ) }nj=0 at equispaced points x j = a + jh for


h = (b − a)/n. Then for every xk ∈ { x0 , . . . , xn },

nhn
 
0
(1.13) | f ( xk ) − p0n ( xk )| ≤ max | f ( n +1)
(ξ )| .
n+1 ξ ∈[ a,b]

1.7.2 Finite difference formulas


The preceding analysis was toward a very specific purpose: to use
interpolating polynomials to develop formulas that approximate
derivatives of f from the value of f at a few points.

Example 1.3 (First derivative). We begin with the simplest case:


formulas for the first derivative f 0 ( x ). Pick some value for x0 and
some spacing parameter h > 0.
First construct the linear interpolant to f at x0 and x1 = x0 + h.
Using the Newton form, we have

f ( x1 ) − f ( x0 )
p1 ( x ) = f ( x0 ) + ( x − x0 )
x1 − x0
f ( x1 ) − f ( x0 )
= f ( x0 ) + ( x − x0 ).
h
Take a derivative of the interpolant:

f ( x1 ) − f ( x0 )
(1.14) p10 ( x ) = ,
h
which is precisely the conventional definition of the derivative, if
we take the limit h → 0. But how accurate an approximation is
it? Appealing to Corollary 1.1 with n = 1 and [ a, b] = [ x0 , x1 ] =
x0 + [0, h], we have
 
0 1
(1.15) | f ( xk ) − p10 ( xk )| ≤ max | f 00 (ξ )| h
2 ξ ∈[ x0 ,x1 ]

Does the bound (1.15) improve if we use a quadratic interpolant to


f through x0 , x1 = x0 + h and x2 = x0 + 2h? Again using the Newton
form, write
f ( x )− f ( x )
f ( x1 ) − f ( x0 ) f ( x2 ) − f ( x0 ) − x1 − x0 0 ( x2 − x0 )
1
p2 ( x ) = f ( x0 ) + ( x − x0 ) + ( x − x0 )( x − x1 )
x1 − x0 ( x2 − x0 )( x2 − x1 )

f ( x1 ) − f ( x0 ) f ( x0 ) − 2 f ( x1 ) + f ( x2 )
(1.16) = f ( x0 ) + ( x − x0 ) + ( x − x0 )( x − x1 ).
h 2h2
29

Taking a derivative of this interpolant with respect to x gives


f ( x1 ) − f ( x0 ) f ( x0 ) − 2 f ( x1 ) + f ( x2 )
p20 ( x ) = + (2x − x0 − x1 ).
h 2h2
Evaluate this at x = x0 , x = x1 , and x = x2 and simplify as much as
possible to get:
−3 f ( x0 ) + 4 f ( x1 ) − f ( x2 )
(1.17) p20 ( x0 ) =
2h

f ( x2 ) − f ( x0 )
(1.18) p20 ( x1 ) =
2h

f ( x0 ) − 4 f ( x1 ) + 3 f ( x2 )
(1.19) p20 ( x2 ) = .
2h
These beautiful formulas are right-looking, central, and left-looking These formulas can also be derived
by strategically combining Taylor
approximations to f 0 . Though we used an interpolating polynomial expansions for f ( x + h) and f ( x − h).
to derive these formulas, those polynomials are now nowhere in That is an easier route to simple for-
mulas like (1.18), but is less appealing
sight: they are merely the scaffolding that lead to these formulas. when more sophisticated approxima-
How accurate are these formulas? Corollary 1.1 with n = 2 and tions like (1.17) and (1.19) (and beyond)
[ a, b] = [ x0 , x2 ] = x0 + [0, 2h] gives are needed.
 
2
(1.20) | f 0 ( xk ) − p20 ( xk )| ≤ max | f 000 (ξ )| h2 .
3 ξ ∈[ x0 ,x2 ]
Notice that these approximations indeed scale with h2 , rather than h,
and so the quadratic interpolant leads to a much better approxima-
tion to f 0 , at the cost of evaluating f at three points (for f 0 ( x0 ) and
f 0 ( x2 )), rather than two.
Example 1.4 (Second derivative). While we have only proved a
bound for the error in the first derivative, f 0 ( x ) − p0 ( x ), you can
see that similar bounds should hold when higher derivatives of p
are used to approximate corresponding derivatives of f . Here we
illustrate with the second derivative.
Since p1 is linear, p100 ( x ) = 0 for all x, and the linear interpolant
will not lead to any meaningful bound on f 00 ( x ). Thus, we focus on
the quadratic interpolant to f at the three uniformly spaced points
x0 , x1 , and x2 . Take two derivatives of the formula (1.16) for p2 ( x ) to
obtain
f ( x0 ) − 2 f ( x1 ) + f ( x2 )
(1.21) p200 ( x ) = ,
h2
which is a famous approximation to the second derivative that is
often used in the finite difference discretization of differential equa-
tions. One can show that, like the approximations p20 ( xk ), this for-
mula is accurate to order h2 .

Example 1.5 (Mathematica code for computing difference formulas).


Code to follow. . . .
30

lecture 5: Finite Difference Methods for Differential Equations

1.7.3 Application: Boundary Value Problems


Example 1.6 (Dirichlet boundary conditions). Suppose we want to
solve the differential equation

−u00 ( x ) = g( x ), x ∈ [0, 1]

for the unknown function u, subject to the Dirichlet boundary condi-


tions
u(0) = u(1) = 0.
One common approach to such problems is to approximate the
solution u on a uniform grid of points

0 = x0 < x1 < · · · < x n = 1

with x j = j/N.
We seek to approximate the solution u( x ) at each of the grid
points x0 , . . . , xn . The Dirichlet boundary conditions give the end
values immediately:

u( x0 ) = 0, u( xn ) = 0.

At each of the interior grid points, we require a local approximation


of the equation
−u00 ( x j ) = g( x j ), j = 1, . . . , n − 1.
For each, we will (implicitly) construct the quadratic interpolant p2,j
to u( x ) at the points x j−1 , x j , and x j+1 , and then approximate
00
− p2,j ( x j ) ≈ −u00 ( x j ) = g( x j ).
00 in
Aside from some index shifting, we have already constructed p2,j
equation (1.21):

00 u( x j−1 ) − 2u( x j ) + u( x j+1 )


(1.22) p2,j (x) = .
h2
Just one small caveat remains: we cannot construct p2,j 00 ( x ), because we

do not know the values of u( x j−1 ), u( x j ), and u( x j+1 ): finding those val-
ues is the point of our entire endeavor. Thus we define approximate
values
u j ≈ u ( x j ), j = 1, . . . , n − 1.
and will instead use the polynomial p2,j that interpolates u j−1 , u j ,
and u j+1 , giving

00 u j−1 − 2u j + u j+1
(1.23) p2,j (x) = .
h2
31

Let us accumulate all our equations for j = 0, . . . , n:

u0 = 0

u0 − 2u1 + u2 = −h2 g( x1 )

u1 − 2u2 + u3 = −h2 g( x2 )
..
.

un−3 − 2un−2 + un−1 = −h2 g( xn−2 )

un−2 − 2un−1 + un = −h2 g( xn−1 )

un = 0.

Notice that this is a system of n + 1 linear equations in n + 1 variables


u0 , . . . , un+1 . Thus we can arrange this in matrix form as
    
1 u0 0
 1

−2 1 
 u1
 
  − h2 g ( x1 ) 

1 −2 − h2 g ( x2 )
    
 1  u2   
.. ..
    
(1.24) 
 .. .. .. 
=
 
,

 . . . 
 .   . 

 1 −2 1 
 u n −2  
  − h 2 g ( x n −2 ) 

1 −2 1  u n −1 − h 2 g ( x n −1 )
    
   
1 un 0

where the blank entries are zero. Notice that the first and last entries
are trivial: u0 = un = 0, and so we can trim them off to yield the
slightly simpler matrix

− h2 g ( x1 )
    
−2 1 u1
 1 −2
 1 
 u2  
  − h2 g ( x2 ) 

 .. .. ..  ..  
= .. 
(1.25) 
 . . . 
 .   .
.

  − h 2 g ( x n −2 )
    
 1 −2 1  u n −2 
1 −2 u n −1 − h 2 g ( x n −1 )

Solve this (n − 1) × (n − 1) linear system of equations using Gaussian Ideally, use an efficient version of
Gaussian elimination that exploits the
elimination. One can show that the solution to the differential equa-
banded structure of this matrix to give a
tion inherits the accuracy of the interpolant: the error |u( x j ) − u j | solution in O(n) operations.
behaves like O(h2 ) as h → 0.

Example 1.7 (Mixed boundary conditions). Modify the last example


to keep the same differential equation

−u00 ( x ) = g( x ), x ∈ [0, 1]
32

but now use mixed boundary conditions,

u0 (0) = u(1) = 0.

The derivative condition on the left is the only modification; we must


change the first row of equation (1.24) to encode this condition. One
might be tempted to use a simple linear interpolant to approximate
the boundary condition on the left side, adapting formula (1.14) to
give:

u1 − u0
(1.26) = 0.
h

This equation makes intuitive sense: it forces u1 = u0 , so the ap-


proximate solution will have zero slope on its left end. This gives the
equation
    
−1 1 u0 0
 1 −2 1 u1 − h2 g ( x1 )
    
   
− h2 g ( x2 )
    
 1 −2 1  u2   
(1.27)  = ,
    
.. .. ..  .. ..

 . . . 
 .  
  . 

1 −2 1  u n −2 − h 2 g ( x n −2 )
    
   
1 −2 u n −1 − h 2 g ( x n −1 )

where we have trimmed off the elements associated with the un = 0.


The approximation of the second derivative (1.23) is accurate up to
O(h2 ), whereas the estimate (1.26) of u0 (0) = 0 is only O(h) accurate.
Will it matter if we compromise accuracy that little bit, if only in
one of the n equations in (1.27)? What if instead we approximate
u0 (0) = 0 to second-order accuracy?
Equations (1.17)–(1.19) provide three formulas that approximate
the first derivative to second order. Which one is appropriate in this
setting? The right-looking formula (1.17) gives the approximation

−3u0 + 4u1 − u2
(1.28) u 0 (0) ≈ ,
2h

which involves the variables u0 , u1 , and u2 that we are already con-


sidering. In contrast, the centered formula (1.18) needs an estimate of
u(−h), and the left-looking formula (1.19) needs u(−h) and u(−2h).
Since these values of u fall outside the domain [0, 1] of u, the centered
and left-looking formulas would not work.
Combining the right-looking formula (1.28) with the boundary
condition u0 (0) = 0 gives

−3u0 + 4u1 − u2
= 0,
2h
33

with which we replace the first row of (1.27) to obtain


    
−3 4 −1 u0 0
 1 −2 1   u1  − h2 g ( x1 )
    
 
− h2 g ( x2 )
    
 1 −2 1   u2   
(1.29)   .  = .
    
.. .. ..   .. 
 ..

 . . .  

 . 


1 − 2 1   u n −2 
   2
− h g ( x n −2 )

  
1 −2 u n −1 − h 2 g ( x n −1 )

Is this O(h2 ) accurate approach at the boundary worth the (rather


minimal) extra effort? Let us investigate with an example. Set the
right-hand side of the differential equation to

g( x ) = cos(πx/2),

which corresponds to the exact solution


4 Verify that u satisfies the boundary
u( x ) = cos(πx/2).
π2 conditions u0 (0) = 0 and u(1) = 0.

Figure 1.10 compares the solutions obtained by solving (1.27) and (1.29)
with n = 4. Clearly, the simple adjustment that gave the O(h2 ) ap-
proximation to u0 (0) = 0 makes quite a difference! This figures shows Indeed, we used this small value
that the solutions from (1.27) and (1.29) differ, but plots like this are of n because it is difficult to see the
difference between the exact solution
not the best way to understand how the approximations compare as and the approximation from (1.29) for
n → ∞. Instead, compute maximum error at the interpolation points, larger n.

max |u( x j ) − u j |
0≤ j ≤ n

0.45 Figure 1.10: Approximate solutions


u( x ) to −u00 ( x ) = cos(πx/2) with u0 (0) =
0.4 u(1) = 0. The black curve shows u( x ).
−3u0 + The red approximation is obtained by
4u1 − u
0.35 2 =0 solving (1.27), which uses the O(h)
approximation u0 (0) = 0; the blue
0.3 approximation is from (1.29) with the
u1 − u0 = 0 O(h2 ) approximation of u0 (0) = 0. Both
0.25 approximations use n = 4.

0.2

0.15

0.1

0.05

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
34

10
0 Figure 1.11: Convergence of approxi-
mate solutions to −u00 ( x ) = cos(πx/2)
O( h ) with u0 (0) = u(1) = 0. The red line
-2
shows the approximation from (1.27);
10 u1 − it converges like O(h) as h → 0. The
u0 =
0 blue line shows the approximation
from (1.29), which converges like O(h2 ).
max |u( x j ) − u j |

-4
10 O( h2 )
−3
u0
+4
u1
−u
0≤ j ≤ n

-6
10 2 =0

10 -8

-10
10
0 1 2 3 4
10 10 10 10 10
n

for various values of n. Figure 1.11 shows the results of such exper-
iments for n = 22 , 23 , . . . , 212 . Notice that this figure is a ‘log-log’
plot; on such a scale, the errors fall on straight lines, and from the
slope of these lines one can determine the order of convergence. The
slope of the red curve is −1, so the accuracy of the approximations
from (1.27) is O(n−1 ) = O(h) accurate. The slope of the blue curve is
−2, so (1.29) gives an O(n−2 ) = O(h2 ) accurate approximation. How large would n need to be, to
This example illustrates a general lesson: when constructing finite get the same accuracy from the O(h)
approximation that was produced
difference approximations to differential equations, one must ensure by the O(h2 ) approximation with
that the approximations to the boundary conditions have the same n = 212 = 4096? Extrapolation of
the red curve suggests we would need
order of accuracy as the approximation of the differential equation roughly n = 108 .
itself. These formulas can be nicely constructed by from derivatives
of polynomial interpolants of appropriate degree.
35

lecture 6: Interpolating Derivatives

1.8 Hermite Interpolation and Generalizations

Example 1.1 demonstrated that polynomial interpolants to sin( x )


attain arbitrary accuracy for x ∈ [−5, 5] as the polynomial degree
increases, even if the interpolation points are taken exclusively from
[−1, 1]. In fact, as n → ∞ interpolants based on data from [−1, 1]
will converge to sin( x ) for all x ∈ IR. More precisely, for any x ∈
IR and any ε > 0, there exists some positive integer N such that
| sin( x ) − pn ( x )| < ε for all n ≥ N, where pn interpolates sin( x ) at
n + 1 uniformly-spaced interpolation points in [−1, 1].
In fact, this is not as surprising as it might first appear. The Taylor
series expansion uses derivative information at a single point to
produce a polynomial approximation of f that is accurate at nearby
points. In fact, the interpolation error bound derived in the previous
lecture bears close resemblance to the remainder term in the Taylor
series. If f ∈ C (n+1) [ a, b], then expanding f at x0 ∈ ( a, b), we have
n
f ( k ) ( x0 ) f ( n +1) ( ξ )
f (x) = ∑ k!
( x − x0 ) k +
( n + 1) !
( x − x 0 ) n +1
This is the Lagrange form of the error.
k =0

for some ξ ∈ [ x, x0 ] that depends on x. The first sum is simply a


degree n polynomial in x; from the final term – the Taylor remainder
– we obtain the bound
 n f (k) ( x )  
| f (n+1) (ξ )|
 
max f ( x ) − ∑
0
( x − x0 )k ≤ max max | x − x0 |n+1 ,
x ∈[ a,b] k =0
k! ξ ∈[ a,b] ( n + 1) ! x ∈[ a,b]

which should certainly remind you of the interpolation error formula


in Theorem 1.3.
One can view polynomial interpolation and Taylor series as two
extreme approaches to approximating f : one uses global information,
but only about f ; the other uses only local information, but requires
extensive knowledge of the derivatives of f . In this section we shall
discuss an alternative based on the best features of each of these
ideas: use global information about both f and its derivatives.

1.8.1 Hermite interpolation


In cases where the polynomial interpolants of the previous sections
incurred large errors for some x ∈ [ a, b], one typically observes
that the slope of the interpolant differs markedly from that of f at
some of the interpolation points { x j }. (Recall Runge’s example in
Figure 1.8.) Why not then force the interpolant to match both f and f 0 at
the interpolation points?
36

Often the underlying application provides a motivation for such


derivative matching. For example, if the interpolant approximates the
position of a particle moving in space, we might wish the interpolant
to match not only position, but also velocity. Hermite interpolation is Typically the position of a particle is
given in terms of a second-order differ-
the general procedure for constructing such interpolants.
ential equation (in classical mechanics,
arising from Newton’s second law,
F = ma). To solve this second-order
Given f ∈ C1 [ a, b] and n + 1 points { x j }nj=0 satisfying ODE, one usually writes it as a system
of first-order equations whose numer-
a ≤ x0 < x1 < · · · < xn ≤ b, ical solution we will study later in the
semester. One component of the system
is position, the other is velocity, and
determine some hn ∈ P2n+1 such that
so one automatically obtains values
for both f (position) and f 0 (velocity)
h n ( x j ) = f ( x j ), h0n ( x j ) = f 0 ( x j ) for j = 0, . . . , n. simultaneously.

Note that h must generally be a polynomial of degree 2n + 1 to


have sufficiently many degrees of freedom to satisfy the 2n + 2 con-
straints. We begin by addressing the existence and uniqueness of this
interpolant.
Existence is best addressed by explicitly constructing the desired
polynomial. We adopt a variation of the Lagrange approach used
in Section 1.5. We seek degree-(2n + 1) polynomials { Ak }nk=0 and
{ Bk }nk=0 such that
(
0, j 6= k,
Ak ( x j ) = A0k ( x j ) = 0 for j = 0, . . . , n;
1, j = k,
(
0 0, j 6= k
Bk ( x j ) = 0 for j = 0, . . . , n, Bk ( x j ) = .
1, j = k

These polynomials would yield a basis for P2n+1 in which hn has a


simple expansion:
n n
(1.30) hn ( x ) = ∑ f ( xk ) Ak ( x ) + ∑ f 0 ( xk ) Bk ( x ).
k =0 k =0

To show how we can construct the polynomials Ak and Bk , we recall


the Lagrange basis polynomials used for the standard interpolation
problem,
n (x − xj )
`k ( x ) = ∏ .
j=0,j6=k
( xk − x j )
Consider the definitions

Ak ( x ) := 1 − 2( x − xk )`0k ( xk ) `2k ( x ),


Bk ( x ) := ( x − xk )`2k ( x ).

Note that since `k ∈ Pn , we have Ak , Bk ∈ P2n+1 . Figure 1.12 shows


these Hermite basis polynomials and their derivatives for n = 5 using
37

Figure 1.12: The Hermite basis poly-


1.5
A0 ( x )

A5 ( x )
nomials for n = 5 on the interval
A1 ( x ) A2 ( x ) A3 ( x ) A4 ( x ) [ a, b] = [0, 1] with x j = j/5 (black dots).
1 • A0 , . . . , A5 : A j ( x j ) = 1 (black circles).
• B0 , . . . , B5 : Bj ( xk ) = 0 for all j, k.
0.5 • A00 , . . . , A50 : A0j ( xk ) = 0 for all j, k.
• B00 , . . . , B50 : B0j ( x j ) = 1 (black circles).
0

-0.5

-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
0.3
Bj ( x )
0.2

0.1

-0.1

-0.2

-0.3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


x
20

15
A0j ( x )
10

-5

-10

-15

-20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
6

4
B00 ( x )

B50 ( x )

2 B10 ( x ) B20 ( x ) B30 ( x ) B40 ( x )

-2

-4

-6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
38

uniformly spaced points on [0, 1]. It is a straightforward exercise


to verify that these Ak and Bk , and their first derivatives, obtain the
specified values at { x j }nj=0 .
These interpolation conditions at the points { x j } ensure that the
2n + 2 polynomials { Ak , Bk }nk=0 , each of degree 2n + 1, form a basis
for P2n+1 , and thus we can always write hn via the formula (1.30).
Figure 1.13 compares the standard polynomial interpolant pn ∈ Pn
to the Hermite interpolant hn ∈ P2n+1 and the standard interpolant of
the same degree, p2n+1 ∈ P2n+1 for the example f ( x ) = sin(20x ) +
e5x/2 using uniformly spaced points on [0, 1] with n = 5. Note the
distinction between hn and p2n+1 , which are both polynomials of the
same degree.

Here are a couple of basic results whose proofs follow the same
techniques as the analogous proofs for the standard interpolation
problem.

The uniqueness result hinges on the


Theorem 1.7. The Hermite interpolant hn ∈ P2n+1 is unique.
fact that we interpolate f and f 0 both at
all interpolation points. If we vary the
number of derivatives interpolated at
each data point, we open the possibility
Theorem 1.8. Suppose f ∈ C2n+2 [ x0 , xn ] and let hn ∈ P2n+1 such of non-unique interpolants.
that hn ( x j ) = f ( x j ) and h0n ( x j ) = f 0 ( x j ) for j = 0, . . . , n. Then for any
x ∈ [ x0 , xn ], there exists some ξ ∈ [ x0 , xn ] such that

n
f (2n+2) (ξ )
f ( x ) − hn ( x ) =
(2n + 2)! ∏ ( x − x j )2 .
j =0

The proof of this latter result is directly analogous to the standard


polynomial interpolation error in Theorem 1.3. Think about how you
would prove this result for yourself. Hint: the proof has some resemblance
to our proof of Theorem 1.6. Invoke
Rolle’s theorem to get n roots of a
certain function, then use the derivative
1.8.2 Hermite–Birkhoff interpolation
interpolation to get another n + 1 roots.

Of course, one need not stop at interpolating f and f 0 . Perhaps your


application has more general requirements, where you want to in-
terpolate higher derivatives, too, or have the number of derivatives
interpolated differ at each interpolation point. Such general polyno- For example, suppose you seek an in-
terpolant that is particularly accurate in
mials are called Hermite–Birkhoff interpolants, and you already have
the vicinity of one of the interpolation
the tools at your disposal to compute them. Simply formulate the points, and so you wish to interpo-
problem as a linear system and find the desired coefficients, but be- late higher derivatives at that point:
a hybrid between an interpolating
ware that in some situations, there may be infinitely many polynomials polynomial and a Taylor expansion.
that satisfy the interpolation conditions. For these problems, it is gen-
erally simplest to work with the monomial basis, though one could
design Newton- or Lagrange-inspired bases for particular situations.
39

14
Figure 1.13: Interpolation of
f ( x ) = sin(20x ) + e5x/2 at uniformly
12
spaced points for x ∈ [0, 1]. Top plot:

)
(x
the standard polynomial interpolant

p5
10
p5 ∈ P5 . Middle plot: the Hermite
interpolant h5 ∈ P11 . Bottom plot: the
8
standard interpolant p11 ∈ P11 .
f (x) Though the last two plots show
6
polynomials of the same degree, notice
how the interpolants differ. (At first
4 glance it appears the Hermite interpo-
lation condition fails at the rightmost
2 point in the middle plot; zoom in to see
that the slope of the interpolant indeed
0 matches f 0 (1).)

-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

14

12

x)
h5 (
10

8 f (x)

-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

14

12
)
f (x

10

p11 ( x )
8

-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
40

1.8.3 Hermite–Fejér interpolation


Another Hermite-like scheme initially sounds like a bad idea: make
the interpolant have zero slope at all the interpolation points.

Given f ∈ C1 [ a, b] and n + 1 points { x j }nj=0 satisfying

a ≤ x0 < x1 < · · · < xn ≤ b,

determine some hn ∈ P2n+1 such that

h n ( x j ) = f ( x j ), h0n ( x j ) = 0 for j = 0, . . . , n.

That is, explicitly construct hn such that its derivatives in general


do not match those of f . This method, called Hermite–Fejér interpola-
tion, turns out to be remarkably effective, even better than standard
Hermite interpolation in certain circumstances. In fact, Fejér proved
that if we choose the interpolation points { x j } in the right way, hn is
guaranteed to converge to f uniformly as n → ∞.

Theorem 1.9. For each n ≥ 1, let hn be the Hermite–Fejér interpolant For a proof of Theorem 1.9, see page 57
of I. P. Natanson, Constructive Function
of f ∈ C [ a, b] at the Chebyshev interpolation points
Theory, vol. 3 (Ungar, 1965).
a + b b − a  (2j + 1)π 
xj = + cos , j = 0, . . . , n.
2 2 2n + 2
Then hn ( x ) converges uniformly to f on [ a, b].
41

lecture 7: Trigonometric Interpolation

1.9 Trigonometric interpolation for periodic functions

Thus far all our interpolation schemes have been based on polynomi-
als. However, if the function f is periodic, one might naturally prefer
to interpolate f with some set of periodic functions. ‘2π-periodic’ means that f is
To be concrete, suppose we have a continuous 2π-periodic func- continuous throughout IR and
tion f that we wish to interpolate at the uniformly spaced points f ( x ) = f ( x + 2π ) for all x ∈ IR.
The choice of period 2π makes the
xk = 2πk/n for k = 0, . . . , n with n = 5. We shall build an interpolant notation a bit simpler, but the idea can
as a linear combination of the 2π-periodic functions be easily adapted for any period.

b0 ( x ) = 1, b1 ( x ) = sin( x ), b2 ( x ) = cos( x ), b3 ( x ) = sin(2x ), b4 ( x ) = cos(2x ).

Note that we have six interpolation conditions at xk for k = 0, . . . , 5,


but only five basis functions. This is not a problem: since f is peri-
odic, f ( x0 ) = f ( x5 ), and the same will be true of our 2π-periodic
interpolant: the last interpolation condition is automatically satisfied.
We shall construct an interpolant of the form
4
t5 ( x ) = ∑ c k bk ( x )
k =0

such that
t5 ( x j ) = f ( x j ), j = 0, . . . , 4.
To compute the unknown coefficients c0 , . . . , c4 , set up a linear system
as usual,
    
b0 ( x0 ) b1 ( x0 ) b2 ( x0 ) b3 ( x0 ) b4 ( x0 ) c0 f ( x0 )
 b (x ) b (x ) b (x ) b (x ) b (x )   c   f (x ) 
 0 1 1 1 2 1 3 1 4 1  1   1 
 b0 ( x2 ) b1 ( x2 ) b2 ( x2 ) b3 ( x2 ) b4 ( x2 )   c2  =  f ( x2 )  ,
    
    
 b0 ( x3 ) b1 ( x3 ) b2 ( x3 ) b3 ( x3 ) b4 ( x3 )   c3   f ( x3 ) 
b0 ( x4 ) b1 ( x4 ) b2 ( x4 ) b3 ( x4 ) b4 ( x4 ) c4 f ( x4 )

which can be readily generalized to accommodate more interpolation


points. We could solve this system for c0 , . . . , cn , but we prefer to You would b6 ( x ) = sin(3x ),
express the problem in a more convenient basis for the trigonometric b7 ( x ) = cos(3x ), etc.: one function
for each additional interpolation point.
functions. Recall Euler’s formula, Generally you would use an odd value
of n, to include pairs of sines and
eiθx = cos(θx ) + i sin(θx ), cosines.

which also implies that


To prove this, write the Taylor expan-
e −iθx
= cos(θx ) − i sin(θx ). sion of eiθx , then separate the real and
imaginary components to give Taylor
expansions for cos(θx ) and sin(θx ).
From these formulas it follows that

span{eiθx , e−iθx } = {cos(θx ), sin(θx )}.


42

Note that we can also write b0 ( x ) ≡ 1 = ei0x . Putting these pieces


together, we arrive at an alternative basis for the trigonometric inter-
polation space:

span{1, sin( x ), cos( x ), sin(2x ), cos(2x )} = span{e−2ix , e−ix , e0ix , eix , e2ix }.

The interpolant tn can thus be expressed in the form


2 2
t4 ( x ) = ∑ γk eik x = ∑ γk (eix )k .
k =−2 k =−2

This last sum is written in a manner that emphasizes that t4 is a


polynomial in the variable eix , and hence tn is a trigonometric polynomial.
In this basis, the interpolation conditions give the linear system

e−2ix0 e−ix0 e0ix0 eix0 ei2x0


    
γ− 2 f ( x0 )

 e−2ix1 e−ix1 e0ix1 eix1 ei2x1 
 γ− 1  
  f ( x1 ) 

e−2ix2 e−ix2 e0ix2 eix2 ei2x2 = f ( x2 ) ,
    
  γ0
e−2ix3 e−ix3 e0ix3 eix3 ei2x3
    
  γ1   f ( x3 ) 
e−2ix4 e−ix4 e0ix4 eix4 ei2x4 γ2 f ( x4 )

again with the natural generalization to larger odd integers n. At first


blush this matrix looks no simpler than the one we first encountered,
but a fascinating structure lurks. Notice that a generic entry of this
matrix has the form e` ixk for ` = −(n − 1)/2, . . . , (n − 1)/2 and
k = 0, . . . , n − 1. Since xk = 2πk/n, rewrite this entry as
` ` k`
e` ixk = eixk = e2πik/n = e2πi/n = ω k` ,

where ω = e2πi/n is an nth root of unity. In the n = 5 case, the linear This name comes from the fact that
system can thus be written as ω n = 1.

ω0 ω0 ω0 ω0 ω0
    
γ− 2 f ( x0 )

 ω −2 ω −1 ω0 ω1 ω2 
 γ− 1  
  f ( x1 ) 

(1.31) ω −4 ω −2 ω0 ω2 ω4 = f ( x2 ) .
    
  γ0
ω −6 ω −3 ω0 ω3 ω6
    
  γ1   f ( x3 ) 
ω −8 ω −4 ω0 ω4 ω8 γ2 f ( x4 )

Denote this system by Fγ = f. Notice that each column of F equals


some (entrywise) power of the vector

ω0
 

 ω1 

ω2 .
 

ω3
 
 
ω4

In other words, the matrix F has Vandermonde structure. From our past
experience with polynomial fitting addressed in Section 1.2.1, we
43

might fear that this formulation is ill-suited to numerical computa- In the language of numerical linear
algebra, we might fear that the matrix
tions, i.e., solutions γ to the system Fγ = f could be polluted by large
F is ill-conditioned, i.e., the condition
numerical errors. number kFkkF−1 k is large.

Before jumping to this conclusion, examine F∗ F. To form F∗ note F∗ is the conjugate-transpose of F:


that ω −` = ω ` , so T
F∗ = F ,
so (F∗ ) j,k = Fk,j .
 0
ω2 ω4 ω6 ω8 ω0 ω0 ω0 ω0 ω0
 
ω
 ω0 ω1 ω 2 ω 3 4
ω  ω
  − 2 ω −1 ω 0 ω1 ω2 
 
F∗ F =  ω 0 ω 0 ω0 ω0 ω 0   ω −4 ω −2 ω 0 ω2 ω4 .
   
− − − ω −4   ω −6 ω −3 ω 0
 0 1 2 3 ω3 ω6
 
 ω ω ω ω 
ω 0 ω −2 ω −4 ω −6 ω −8 ω −8 ω −4 ω 0 ω4 ω8

The (`, k) entry for F∗ F thus takes the form

(F∗ F)`,k = ω 0 + ω (k−`) + ω 2(k−`) + ω 3(k−`) + ω 4(k−`) .

On the diagonal, when ` = k, we simply have

(F∗ F)k,k = ω 0 + ω 0 + ω 0 + ω 0 + ω 0 = n.

On the off-diagonal, use ω n = 1 to see that all the off diagonal entries
simplify to ω1
1

(F∗ F)`,k = ω 0 + ω 1 + ω 2 + ω 3 + ω 4 , ` 6= k. ω2
0.5

You can think of this last entry as n times the average of ω 0 , ω 1 , ω 2 ,


ω0 = ω5
ω 3 , and ω 4 , which are uniformly spaced points on the unit circle, 0

shown in the plot to the right.


-0.5
As these points are uniformly positioned about the unit circle,
ω3
their mean must be zero, and hence
-1
ω4

(F F)`,k = 0, ` 6= k.
-1 -0.5 0 0.5 1

We thus must conclude that

F∗ F = nI,

thus giving a formula for the inverse:

1 ∗
F −1 = F .
n
The system Fγ = f can be immediately solved without the need for
any factorization of F:
1
γ = F∗ f.
n
The ready formula for F−1 is reminiscent of a unitary matrix. In fact, Q ∈ C n×n is unitary if and only if
the matrices Q−1 = Q∗ , or, equivalently, Q∗ Q = I.
1 1
√ F and √ F∗
n n
44

are indeed unitary, and hence kn−1/2 Fk2 = kn−1/2 F∗ k2 = 1. The matrix 2-norm is defined as
From this we can compute the condition number of F: kFk2 = max
kFxk2
,
x6=0 k x k2
1
k F k 2 k F −1 k 2 = kFk2 kF∗ k2 = kn−1/2 Fk2 kn−1/2 F∗ k2 = 1. where the vector norm on the right
n hand side is the Euclidean norm
This special Vandermonde matrix is perfectly conditioned! One can easily  1/2
k y k2 = ∑ | y k |2 = (y∗ y)1/2 .
solve the system Fγ = f to high precision. The key distinction be- k

tween this case and standard polynomial interpolation is that now we The 2-norm of a unitary matrix is one:
have a Vandermonde matrix based on points eixk that are equally spaced If Q∗ Q = I, then

about the unit circle in the complex plane, whereas before our points kQxk22 = x∗ Q∗ Qx = x∗ x = kxk2 ,
were distributed over an interval on the real line. This distinction so kQk2 = 1.
makes all the difference between an unstable matrix equation and
one that is not only perfectly stable, but also forms the cornerstone of
modern signal processing.
In fact, we have just computed the ‘Discrete Fourier Transform’
(DFT) of the data vector
 
f ( x0 )
 f ( x1 ) 
 
 .. .
.
 
 
f ( x n −1 )
The coefficients γ−(n−1)/2 , . . . , γ(n−1)/2 that make up the vector
1 ∗
γ= F f
n
are the discrete Fourier coefficients of the data in f. From where does
this name derive?

1.9.1 Connection to Fourier series


In a real analysis course, one learns that a 2π-periodic function f can
be written as the Fourier series

f (x) = ∑ ck e ikx ,
To ensure pointwise convergence of
k =−∞
this series for all x ∈ [0, 2π ], f must
where the Fourier coefficients c` are defined via be a continuous 2π-periodic function
Z 2π with a continuous first derivative.
√ The
1 functions ek ( x ) = e ik x / 2π form
ck := f ( x )e−ik x . an orthonormal basis for the space
2π 0
L2 [0, 2π ] with the inner product
Notice that γk = ((1/n)F∗ f)k is an approximation to this ck : Z 2π
( f , g) = f ( x ) g( x ) dx.
n −1 0
1
γk =
n ∑ f ( x` )ω −`k
The Fourier series is simply an expan-
`=0 sion of f in this basis: f = ∑( f , ek ) ek .
k
n −1 n −1
1 1
=
n ∑ f ( x k )e −(2πk/n) i `
=
n ∑ f ( x k )e −i ` x k
.
k =0 k =0
45

Now use the fact that f ( x0 )e−i ` x0 = f ( xn )e−i ` xn to view the last sum
as a composite trapezoid rule approximation of an integral: The composite trapezoid rule will be
discussed in Chapter 3.
 n −1 
2π 1 1
2πγ` = f ( x 0 )e−i ` x0 + ∑ f ( x k )e−i ` x k + f ( x n )e−i ` x n
n 2 k =1
2
Z 2π
≈ f ( x )e−i ` x dx
0

= 2π c` .

The coefficient γ` that premultiplies ei ` x in the trigonometric interpolating


polynomial is actually an approximation of the Fourier coefficient c` .

Let us go one step further. Notice that the trigonometric interpolant

(n−1)/2
tn ( x ) = ∑ γk e ik x
k =−(n−1)/2

is an approximation to the Fourier series



f (x) = ∑ ck e ik x
k =−∞

obtained by (1) truncating the series, and (2) replacing ck with γk .


To assess the quality of the approximation, we need to understand
the magnitude of the terms dropped from the sum, as well as the
accuracy of the composite trapezoid rule approximation γk to ck . We
will thus postpone discussion of f ( x ) − tn ( x ) until we develop a few
more analytical tools in the next two chapters.

1.9.2 Computing the discrete Fourier coefficients


Normally we would require O(n2 ) operations to compute these co-
Apparently the FFT was discovered
efficients using matrix-vector multiplication with F∗ , but Cooley and earlier by Gauss, but it was forgotten,
Tukey discovered in 1965 that given the amazing structure in F∗ , given its limited utility before the
advent of automatic computation.
one can arrange operations so as to compute γ = n−1 F∗ f in only Jack Good (Bletchley Park codebreaker
O(n log n) operations: a procedure that we now famously call the Fast and, later, a Virginia Tech statistician)
Fourier Transform (FFT). published a similar idea in 1958. Good
recalls: ‘John Tukey (December 1956)
We can summarize this section as follows. and Richard L. Garwin (September
1957) visited Cheltenham and I had
them round to steaks and fries on
The FFT of a vector of uniform samples of a 2π-periodic separate occasions. I told Tukey briefly
about my FFT (with little detail) and,
function f gives the coefficients for the trigonometric in-
in Cooley and Tukey’s well known
terpolant to f at those sample points. These coefficients paper of 1965, my 1958 paper is the
approximate the function’s Fourier coefficients. only citation.’ See D. L. Banks, ‘A
conversation with I. J. Good,’ Stat. Sci.
11 (1996) 1–19.
46

Example 1.8 (Trig interpolation of a smooth periodic function).


Figure 1.14 shows the degree n = 5, 7, 9 and 11 trigonometric inter-
polants to the 2π-periodic function f ( x ) = ecos( x)+sin(2x) . Notice that
although all the interpolation points are all drawn from the interval
[0, 2π ) (indicated by the gray region on the plot), the interpolants are
just as accurate outside this region. In contrast, a standard polyno-
mial fit through the same points will behave very differently: (non-
constant) polynomials must satisfy | pn ( x )| → ∞ as | x | → ∞. Fig-
ure 1.15 shows this behavior for n = 7: for x ∈ [0, 2π ], the polynomial
fit to f is about as accurate as the n = 7 trigonometric polynomial in
Figure 1.14. Outside of [0, 2π ],the polynomial is much worse.

Example 1.9 (Trig interpolation of non-smooth function).


Figure 1.15 shows that a standard (non-perioidic) polynomial fit to
a periodic function can yield a good approximation, at least over the
interval from which the interpolation points are drawn. Now turn
the tables: how well does a (periodic) trigonometric polynomial fit a
smooth but non-periodic function? Simply take f ( x ) = x on [0, 2π ],
and construct the trigonometric interpolant as described above for
n = 11. The top plot in Figure 1.16 shows that t11 gives a very poor
approximation to f , constrained by design to be periodic even though
f is not. The fact that f (2π ) 6= f (0) acts like a discontinuity, vastly
impairing the quality of the approximation. The bottom plot in Fig-
ure 1.16 repeats this exercise for f ( x ) = ( x − π )2 . Since in this case
f (0) = f (2π ) we might expect better results; indeed, the approxi-
mation looks quite reasonable. Note, however, that f 0 (0) 6= f 0 (π ),
and this lack of smoothness severely slows convergence of tn to f as
n → ∞. Figure 1.17 contrasts this slow rate of convergence with the
much faster convergence observed for f ( x ) = ecos( x)+sin(2x) used in
Figure 1.14. Clearly the periodic interpolant is much better suited to
smooth f .

1.9.3 Fast MATLAB implementation

MATLAB organizes its Fast Fourier Transform is a slightly different


fashion than we have described above. To fit with MATLAB, reorder
the unknowns in the system (1.31) to obtain

ω0 ω0 ω0 ω0 ω0
    
γ0 f ( x0 )

 ω0 ω1 ω2 ω −2 ω −1 
 γ1  
  f ( x1 ) 

(1.32) ω0 ω2 ω4 ω −4 ω −2 = f ( x2 ) ,
    
  γ2
ω0 ω3 ω6 ω −6 ω −3
    
  γ− 2   f ( x3 ) 
ω0 ω4 ω8 ω −8 ω −4 γ− 1 f ( x4 )
47

7
Figure 1.14: Trigonometric in-
6 terpolant to 2π-periodic function
f ( x ) = ecos( x)+sin(2x) , using n = 5, 7, 9
5 and 11 points uniformly spaced over
f (x) [0, 2π ) ({ xk }nk=0 for xk = 2πk/n). Since
4
both f and the interpolant are periodic,
3 the function fits well throughout IR,
not just on the interval for which the
2 interpolant was designed.
t5 ( x )
1

-1
-: 0 : 2: 3:
x
7

6
f (x)
5

3
t7 ( x )
2

-1
-: 0 : 2: 3:
x
7

1 f (x)
0
t9 ( x )
-1
-: 0 : 2: 3:
x
7

5
f (x)
4

2
t11 ( x )
1

-1
-: 0 : 2: 3:
x
48

7
p7 ( x )
6 Figure 1.15: Polynomial fit of degree
n = 7 through uniformly spaced grid
5 points x0 , . . . , xn for x j = 2πj/n, for
f (x) the same function f ( x ) = ecos( x)+sin(2x)
4
used in Figure 1.14. In contrast to the
3 trigonometric fits in the earlier figure,
the polynomial grows very rapidly
2 outside the interval [0, 2π ]. Moral: if
1
your function is periodic, fit it with a
trigonometric polynomial.
0

-1
-: 0 : 2: 3:
x

6
f (x) = x Figure 1.16: Trigonometric polynomial
fit of degree n = 11 through uniformly
5 spaced grid points x0 , . . . , xn for x j =
2πj/n, for the non-periodic function
4
f ( x ) = x (top) and for f ( x ) = ( x − π )2
3 (bottom). By restricting the latter
t11 ( x ) function to the domain [0, 2π ], one
2 can view it as a continuous periodic
function with a jump discontinuity in
1
the first derivative. The interpolant t11
0 seems to give a good approximation to
f , but the discontinuity in the derivative
-1 slows the convergence of tn to f as
-: 0 : 2: 3:
x n → ∞.
14

12
f (x) =
( x − π )2
10

4
t11 ( x )
2

-2
-: 0 : 2: 3:
x

which amounts to reordering the columns of the matrix in (1.31). You


can obtain this matrix by the command ifft(eye(n)). For n = 5,

ω0 ω0 ω0 ω0 ω0
 

 ω0 ω1 ω2 ω −2 ω −1 

5 ∗ ifft(eye(5)) =  ω0 ω2 ω4 ω −4 ω −2 .
 
ω0 ω3 ω6 ω −6 ω −3
 
 
ω0 ω4 ω8 ω −8 ω −4
49

102 Figure 1.17: Convergence of the


trigonometric polynomial inter-
polants to f ( x ) = ecos( x)+sin(2x) and
0
10
f ( x ) = ( x − π )2 . For the first func-
tion, convergence is extremely rapid as
f ( x ) = ( x − π )2 n → ∞. The second function, restricted
max | f ( x ) − tn ( x )|

-2
10
to [0, 2π ], can be viewed as a continu-
ous but not continuously differentiable
10-4 function. Though the approximation
in Figure 1.16 looks good over [0, 2π ],
f( the convergence of tn to f is slow as
10-6 x) n → ∞.
x ∈[0,2π ]

=
e cos
(x
10-8 )+
sin
(2
x)
-10
10

-12
10
0 5 10 15 20 25 30 35 40 45 50
n

Similarly, the inverse of this matrix can be computed from fft(eye(n))


command:
 0
ω0 ω0 ω0 ω0

ω
 ω 0 ω −1 ω −2 ω −3 ω −4 
1 
fft(eye(5))/5 =  ω 0 ω −2 ω −4 ω −6 ω −8  .
 
5
 ω0 ω2 ω4 ω6 ω8 
 

ω0 ω1 ω2 ω3 ω4

We could construct this matrix and multiply it against f to obtain γ,


but that would require O(n2 ) operations. Instead, we can compute γ
directly using the fft command:

γ = fft(f)/n.

Recall that this reordered vector gives


 
γ0

 γ1 

γ= ,
 
γ2
 
 γ− 2 
γ− 1

which must be taken into account when constructing tn .

Example 1.10 (MATLAB code for trigonometric interpolation).


We close with a sample of MATLAB code one could use to construct
the interpolant tn for the function f ( x ) = ecos( x)+sin(2x) . First we
present a generic code that will work for any (real- or complex-
50

valued) 2π-periodic f . Take special note of the simple one line com-
mand to find the coefficient vector γ.

f = @(x) exp(cos(x)+sin(2*x)); % define the function


n = 5; % # terms in trig polynomial (must be odd)
xk = [0:n-1]’*2*pi/n; % interpolation points
xx = linspace(0,2*pi,500)’; % fine grid on which to plot f, t_n
tn = zeros(size(xx)); % initialize t_n

gamma = fft(f(xk))/n; % solve for coefficients, gamma

for k=1:(n+1)/2
tn = tn + gamma(k)*exp(1i*(k-1)*xx); % gamma_0, gamma_1, ... gamma_{(n-1)/2} terms
end
for k=(n+1)/2+1:n
tn = tn + gamma(k)*exp(1i*(-n+k-1)*xx); % gamma_{-(n-1)/2}, ..., gamma_{-1} terms
end
plot(xx,f(xx),’b-’), hold on % plot f
plot(xx, tn,’r-’) % plot t_n

In the case that f is real-valued (as with all the examples shown in
this section), one can further show that

γ− k = γk ,

indicating that the imaginary terms will not make any contribution to
tn . Since for k = 1, . . . , (n − 1)/2,
 
γ−k e−1ik x + γk e1ik x = 2 Re(γk ) cos(kx ) − Im(γk ) sin(kx ) ,

the code can be simplified slightly to construct tn as follows.

gamma = fft(f(xk))/n; % solve for coefficients, gamma


tn = gamma(1)*exp(1i*0*xx); % initialize t_n(x) = gamma_0
for k=2:(n+1)/2
tn = tn + 2*real(gamma(k))*cos((k-1)*xx) ...
- 2*imag(gamma(k))*sin((k-1)*xx); % exploit gamma_{-k} = conj(gamma_k)
end
51

lecture 8: Piecewise interpolation

1.10 Piecewise polynomial interpolation

We have seen, through Runge’s example, that high degree poly-


nomial interpolation can lead to large errors when the (n + 1)st
derivative of f is large in magnitude. In other cases, the interpolant
converges to f , but the polynomial degree must be fairly high to
deliver an approximation of acceptable accuracy throughout [ a, b].
Beyond theoretical convergence questions, high-degree polynomials
can be delicate to work with, even when using a stable implemen-
tation (the Lagrange basis, in its barycentric form). Many practical
approximation problems are better solved by a simpler ‘piecewise’
alternative: instead of approximating f with one high-degree inter-
polating polynomial over a large interval [ a, b], patch together many
low-degree polynomials that each interpolate f on some subinterval
of [ a, b].

1.10.1 Piecewise linear interpolation


The simplest piecewise polynomial interpolation uses linear poly-
nomials to interpolate between adjacent data points. Informally, the
idea is to ‘connect the dots.’ Given n + 1 data points {( x j , f j )}nj=0 , we
need to construct n linear polynomials {s j }nj=1 such that

s j ( x j −1 ) = f j −1 , and sj (xj ) = f j

for each j = 1, . . . , n. It is simple to write down a formula for these Note that all the s j ’s are linear polyno-
polynomials, mials. Unlike our earlier notation, the
subscript j does not reflect the polyno-
(x j − x) mial degree.
s j (x) = f j − ( f − f j −1 ).
( x j − x j −1 ) j
Each s j is valid on x ∈ [ x j−1 , x j ], and the interpolant S( x ) is defined
as S( x ) = s j ( x ) for x ∈ [ x j−1 , x j ].
To analyze the error, we can apply the interpolation bound devel-
oped in the last lecture. If we let ∆ denote the largest space between
interpolation points,

∆ := max | x j − x j−1 |,
j=1,...,n

then the standard interpolation error bound gives


| f 00 ( x )| 2
max | f ( x ) − S( x )| ≤ max ∆ .
x ∈[ x0 ,xn ] x ∈[ x0 ,xn ] 2
In particular, this proves convergence as ∆ → 0 provided f ∈
C 2 [ x0 , x n ].
52

14 Figure 1.18: Piecewise linear inter-


polant to f ( x ) = sin(20x ) + e5x/2 at
12 n = 5 uniformly spaced points (top),
and the derivative of this interpolant

x)
S(
10 (bottom). Notice that the interpolant is
continuous, but its derivative has jump
8 f (x) discontinuities.

-2
0 0.2 0.4 0.6 0.8 1
x
50

f 0 (x)
40

30
S0 ( x )
20

10

-10

-20
0 0.2 0.4 0.6 0.8 1
x

What could go wrong with this simple approach? The primary


difficulty is that the interpolant is continuous, but generally not con-
tinuously differentiable. Still, these functions are easy to construct and
cheap to evaluate, and can be very useful despite their simplicity.

1.10.2 Piecewise cubic Hermite interpolation

To remove the discontinuities in the first derivative of the piecewise


linear interpolant, we begin by modeling our data with cubic poly-
nomials over each interval [ x j , x j+1 ]. Each such cubic has four free
parameters (since P3 is a vector space of dimension 4); we require
53

these polynomials to interpolate both f and its first derivative:

s j ( x j −1 ) = f ( x j −1 ), j = 1, . . . , n;
s j ( x j ) = f ( x j ), j = 1, . . . , n;
s0j ( x j−1 ) = f 0 (x j −1 ), j = 1, . . . , n;
s0j ( x j ) = f 0 ( x j ), j = 1, . . . , n.

To satisfy these conditions, take s j to be the Hermite interpolant to


the data ( x j−1 , f ( x j−1 ), f 0 ( x j−1 )) and ( x j , f ( x j ), f 0 ( x j )). The resulting
function, S( x ) := s j ( x ) for x ∈ [ x j−1 , x j ], will always have a contin-
uous derivative, S ∈ C1 [ x0 , xn ], but generally S 6∈ C2 [ x0 , xn ] due to
discontinuities in the second derivative at the interpolation points.

In many applications, we lack specific values for S0 ( x j ) = f 0 ( x j );


we simply want the function S( x ) to be as smooth as possible. That
motivates our next topic: splines.
54

14 Figure 1.19: Piecewise cubic Hermite


interpolant to f ( x ) = sin(20x ) + e5x/2
12 at n = 5 uniformly spaced points (top),
and the derivative of this interpolant

x)
S(
10 (middle). Now both the interpolant and
its derivative are continuous, and the
8 f (x) derivative interpolates f 0 . However,
the second derivative of the interpolant
6 now has jump discontinuities (bottom).

-2
0 0.2 0.4 0.6 0.8 1
x
50
f 0 (x)
40

30

20

S0 ( x )
10

-10

-20
0 0.2 0.4 0.6 0.8 1
x
600

400
f 00 ( x )

200

S00 ( x )
-200

-400

-600
0 0.2 0.4 0.6 0.8 1
x
55

lecture 9: Introduction to Splines

1.11 Splines

Spline fitting, our next topic in interpolation theory, is an essential


tool for engineering design. As in the last lecture, we strive to inter-
polate data using low-degree polynomials between consecutive grid
points. The piecewise linear functions of Section 1.10 were simple,
but suffered from unsightly kinks at each interpolation point, reflect-
ing a discontinuity in the first derivative. By increasing the degree of
the polynomial used to model f on each subinterval, we can obtain
smoother functions. Long before numerical analysts got
their hands on them, ‘splines’ were
used in the woodworking, shipbuild-
1.11.1 Cubic splines: first approach ing, and aircraft industries. In such
work ‘splines’ refer to thin pieces of
Rather than setting S0 ( x j ) to a particular value, suppose we simply wood that are bent between physical
constraints called ducks (apparently
require S0 to be continuous throughout [ x0 , xn ]. This added freedom these were also called dogs and rats in
allows us to impose a further condition: require S00 to be continuous some settings; modern versions are
sometimes called whales because of their
on [ x0 , xn ], too. The polynomials we construct are called cubic splines.
shape). The spline, a thin beam, bends
In spline parlance, the interpolation points { x j }nj=0 are called knots. gracefully between the ducks to give
These cubic spine requirements can be written as: a graceful curve. For some discussion
of this history, see the brief ‘History of
Splines’ note by James Epperson in the
s j ( x j −1 ) = f ( x j −1 ), j = 1, . . . , n; 19 July 1998 NA Digest, linked from the
class website. For a beautiful derivation
s j ( x j ) = f ( x j ), j = 1, . . . , n; of cubic splines from Euler’s beam
equation—that is, from the original
s0j ( x j ) = s0j+1 ( x j ), j = 1, . . . , n − 1;
physical situation, see Gilbert Strang’s
s00j ( x j ) = s00j+1 ( x j ), j = 1, . . . , n − 1. Introduction to Applied Mathematics,
Wellesley Cambridge Press, 1986.

Compare these requirements to those imposed by piecewise cubic


Hermite interpolation. Add up all these new requirements:

n + n + (n − 1) + (n − 1) = 4n − 2 constraints

and compare to the total free variables available:

(n cubic polynomials)×(4 variables per cubic) = 4n variables.

So far, we thus have an underdetermined system, and there will be in-


finitely many choices for the function S( x ) that satisfy the constraints.
There are several canonical ways to add two extra constraints that
uniquely define S:

• natural splines require S00 ( x0 ) = S00 ( xn ) = 0;


• complete splines specify values for S0 ( x0 ) and S0 ( xn );
• not-a-knot splines require S000 to be continuous at x1 and xn−1 . Since the third derivative of a cubic is
a constant, the not-a-knot requirement
forces s1 = s2 and sn−1 = sn . Hence,
while S( x ) interpolates the data at x2
and xn−1 , the derivative continuity
requirements are automatic at those
knots; hence the name “not-a-knot”.
56

14 Figure 1.20: Not-a-knot cubic spline


interpolant to f ( x ) = sin(20x ) + e5x/2
12 at n = 5 uniformly spaced knots (top),
along with its first (middle) and second

x)
(bottom) derivative. Note that S, S0 , and

S(
10
S00 are all continuous. Look closely at
8 f (x) the plot of S00 : clearly this function will
have jump discontinuities at the interior
6 nodes x2 and x3 , but the not-a-knot
condition forces S000 to be continuous at
4 the knots x1 and x4 = xn−1 .

-2
0 0.2 0.4 0.6 0.8 1
x
50

f 0 (x)
40

30 S0 ( x )

20

10

-10

-20
0 0.2 0.4 0.6 0.8 1
x
600

400
f 00 ( x )

200

0
S00 ( x )

-200

-400

-600
0 0.2 0.4 0.6 0.8 1
x
57

14 Figure 1.21: Complete cubic spline


interpolant to f ( x ) = sin(20x ) + e5x/2
12 at n = 5 uniformly spaced knots (top),
along with its first (middle) and second

)
S( x
10 (bottom) derivative. Note that S, S0 , and
S00 are all continuous. For a complete
8 f (x) cubic spline, one specifies the value
of S0 ( x0 ) and S0 ( xn ); in this case we
6 have set S0 ( x0 ) = S0 ( xn ) = 0, as you
can confirm in the middle plot. In the
4 bottom plot, see that S000 ( x ) will have
jump discontinuities at all the interior
2 knots x1 , . . . , xn−1 , in contrast to the
not-a-knot spline shown in Figure 1.20.
0

-2
0 0.2 0.4 0.6 0.8 1
x
50
f 0 (x)
40

30

20

S0 ( x )
10

-10

-20
0 0.2 0.4 0.6 0.8 1
x
600

400
f 00 ( x )

200

-200

S00 ( x )
-400

-600
0 0.2 0.4 0.6 0.8 1
x
58

Natural cubic splines are a popular choice for they can be shown,
in a precise sense, to minimize curvature over all the other possible
splines. They also model the physical origin of splines, where beams
of wood extend straight (i.e., zero second derivative) beyond the first
and final ‘ducks.’
Continuing with the example from the last section, Figure 1.20
shows a not-a-knot spline, where S000 is continuous at x1 and xn−1 .
The cubic polynomials s1 for [ x0 , x1 ] and s2 for [ x1 , x2 ] must satisfy

s1 ( x1 ) = s2 ( x1 )
s10 ( x1 ) = s20 ( x1 )
s100 ( x1 ) = s200 ( x1 )
s1000 ( x1 ) = s2000 ( x1 )

Two cubics that match these four conditions must be the same:
s1 ( x ) = s2 ( x ), and hence x1 is ‘not a knot.’ (The same goes for xn−1 .)
Notice this behavior in Figure 1.20. In contrast, Figure 1.21 shows the
complete cubic spline, where S0 ( x0 ) = S0 ( xn ) = 0.
However we assign the two additional conditions, we get a system
of 4n equations (the various constraints) in 4n unknowns (the cubic
polynomial coefficients). These equations can be set up as a system
One can arrange Gaussian elimination
involving a banded coefficient matrix (zero everywhere except for a to solve an n × n tridiagonal system in
limited number of diagonals on either side of the main diagonal). We O(n) operations.
could derive this linear system by directly enforcing the continuity
conditions on the cubic polynomial that we have just described. In- Try constructing this matrix!
stead, we will develop a more general approach that expresses the
spline function S( x ) as the linear combination of special basis func-
tions, which themselves are splines.
59

lecture 10: B-Splines

1.11.2 B-Splines: a basis for splines


Throughout our discussion of standard polynomial interpolation, we
viewed Pn as a linear space of dimension n + 1, and then expressed
the unique interpolating polynomial in several different bases (mono-
mial, Newton, Lagrange). The most elegant way to develop spline
functions uses the same approach. A set of basis splines, depending
only on the location of the knots and the degree of the approximating
piecewise polynomials can be developed in a convenient, numerically
stable manner. (Cubic splines are the most prominent special case.)
For example, each cubic basis spline, or B-spline, is a continuous
piecewise-cubic function with continuous first and second deriva-
tives. Thus any linear combination of such B-splines will inherit the
same continuity properties. The coefficients in the linear combination
are chosen to obey the specified interpolation conditions.
B-splines are built up recursively from constant B-splines. Though
we are interpolating data at n + 1 knots x0 , . . . , xn , to derive B-splines
we need extra nodes outside [ x0 , xn ] as scaffolding upon which to
construct the basis. Thus, add knots on either end of x0 and xn :

· · · < x −2 < x −1 < x 0 < x 1 < · · · < x n < x n +1 < · · · .

Given these knots, define the constant (zeroth-degree) B-splines:


(
1 x ∈ [ x j , x j +1 );
Bj,0 ( x ) =
0 otherwise.

The following plot shows the basis function B0,0 for the knots x j = j.
Note, in particular, that Bj,0 ( x j+1 ) = 0. The line drawn beneath
the spline marks the support of the spline, that is, the values of x for
which B0,0 ( x ) 6= 0.

1 B0,0 ( x )
0.5

-3 -2 -1 0 1 2 3 4 5 6 7
x

From these degree-0 B-splines, manufacture B-splines of higher de-


gree via the recurrence
! !
x − xj x j + k +1 − x
(1.33) Bj,k ( x ) = Bj,k−1 ( x ) + Bj+1,k−1 ( x ).
x j+k − x j x j + k +1 − x j +1
60

While not immediately obvious from the formula, this construction


ensures that Bj,k has one more continuous derivative than does Bj,k−1 .
Thus, while Bj,0 is discontinuous (see previous plot), Bj,1 is continu-
ous, Bj,2 ∈ C1 (IR), and Bj,3 ∈ C2 (IR). One can see this in the three
plots below, where again x j = j. As the degree increases, the B-spline
Bj,k becomes increasingly smooth. Smooth is good, but it has a con-
sequence: the support of Bj,k gets larger as we increase k. This, as we
will see, has implications on the number of nonzero entries in the
linear system we must ultimately solve to find the expansion of the
desired spline in the B-spline basis.

1
B0,1 ( x )
0.5

-3 -2 -1 0 1 2 3 4 5 6 7
x

0.5 B0,2 ( x )
0

-3 -2 -1 0 1 2 3 4 5 6 7
x

0.5 B0,3 ( x )
0

-3 -2 -1 0 1 2 3 4 5 6 7
x
From these plots and the recurrence defining Bj,k , one can deduce
several important properties:

• Bj,k ∈ C k−1 (IR) (continuity);

• Bj,k ( x ) = 0 if x 6∈ ( x j , x j+k+1 ) (compact support);

• Bj,k ( x ) > 0 for x ∈ ( x j , x j+k+1 ) (positivity).

Finally, we are prepared to write down a formula for the spline


that interpolates {( x j , f j )}nj=0 as a linear combination of the basis
splines we have just constructed. Let Sk ( x ) denote the spline consist-
ing of piecewise polynomials in Pk . In particular, Sk must obey the
following properties:
61

• Sk ( x j ) = f j for j = 0, . . . , n;

• Sk ∈ C k−1 [ x0 , xn ] for k ≥ 1.

The beauty of B-splines is that the second of these properties is


automatically inherited from the B-splines themselves. (Any linear
combination of C k−1 (IR) functions must itself be a C k−1 (IR) function.)
The interpolation conditions give n + 1 equations that constrain the
unknown coefficients c j,k in the expansion of Sk :

(1.34) Sk ( x ) = ∑ c j,k Bj,k (x).


j

What limits should j have in this sum? For the greatest flexibility, let
j range over all values for which

Bj,k ( x ) 6= 0 for some x ∈ [ x0 , xn ]. If Bj,k ( x ) = 0 for all x ∈ [ x0 , xn ], it


cannot contribute to the interpolation
requirement Sk ( x j ) = f j , j = 0, . . . , n.
Figure 1.22 shows the B-splines of degree k = 1, 2, 3 that overlap
the interval [ x0 , x4 ] for x j = j. For k ≥ 1, Bj,k ( x ) is supported on
( x j , x j+k+1 ), and hence the limits on the sum in (1.34) take the form

n −1
(1.35) Sk ( x ) = ∑ c j,k Bj,k ( x ), k ≥ 1.
j=−k

The sum involves n + k coefficients c j,k , which must be determined to

Figure 1.22: B-splines of degree k = 1


1
B−1,1 B0,1 B1,1 B2,1 B3,1 (top), k = 2 (middle), and k = 3
(bottom) that are supported on the
0.5 interval [ x0 , xn ] for x j = j with n = 4.
Note that n + k B-splines are supported
0
on [ x0 , xn ].

-3 -2 -1 0 1 2 3 4 5 6 7

1 B−2,2 B−1,2 B0,2 B1,2 B2,2 B3,2

0.5

-3 -2 -1 0 1 2 3 4 5 6 7

1
B−3,3 B−2,3 B−1,3 B0,3 B1,3 B2,3 B3,3
0.5

-3 -2 -1 0 1 2 3 4 5 6 7
62

satisfy the n + 1 interpolation conditions

n −1
f ` = Sk ( x ` ) = ∑ c j,k Bj,k ( x` ), ` = 0, . . . , n.
j=−k

Before addressing general k ≥ 1, we pause to handle the special case


of k = 0, i.e., constant splines.

1.11.3 Constant splines, k = 0


The constant B-splines give Bn,0 ( xn ) = 1 and so, unlike the general
k ≥ 1 case, the j = n B-spline must be included in the spline sum:
n
S0 ( x ) = ∑ c j,0 Bj,0 (x).
j =0

The interpolation conditions give, for ` = 0, . . . , n,


n
f ` = S0 ( x ` ) = ∑ c j,0 Bj,0 (x` )
j =0

= c`,0 B`,0 ( x` ) = c`,0 ,

since Bj,0 ( x` ) = 0 if j 6= `, and B`,0 ( x` ) = 1 (recall the plot of


B0,0 ( x ) shown earlier). Thus c`,0 = f ` , and the degree k = 0 spline
interpolant is simply
n
S0 ( x ) = ∑ f j Bj,0 (x).
j =0
63

lecture 11: Matrix Determination of Splines; Energy Minimization

1.11.4 General case, k ≥ 1


Now consider the general spline interpolant of degree k ≥ 1,

n −1
Sk ( x ) = ∑ c j,k Bj,k ( x ),
j=−k

with constants c−k,k , . . . , cn−1,k determined to satisfy the interpolation


conditions Sk (`) = f ` , i.e.,

n −1
∑ c j,k Bj,k ( x` ) = f ` , ` = 0, . . . , n.
j=−k

By now we are accustomed to transforming constraints like this into


matrix equations. Each value ` = 0, . . . , n gives a row of the equation
    
B−k,k ( x0 ) B−k+1,k ( x0 ) ··· Bn−1,k ( x0 ) c−k,k f0
    
    
 −k,k ( x1 )
 B B−k+1,k ( x1 ) ··· Bn−1,k ( x1 ) 
  c−k+1,k
   f1 
  
(1.36) = .
    
.. .. .. ..   .. 
 
 .. 

 . . . . 
 .   . 
  
    
B−k,k ( xn ) B−k+1,k ( xn ) ··· Bn−1,k ( xn ) cn−1,k fn

Let us consider the matrix in this equation. The matrix will have
n + 1 rows and n + k columns, so when k > 1 the system of equations
will be underdetermined. Since B-splines have ‘small support’ (i.e., One could obtain an (n + 1) × (n + 1)
Bj,k ( x ) = 0 for most x ∈ [ x0 , xn ]), this matrix will be sparse: most matrix by arbitrarily setting k − 1
certain values of c j,k to zero, but this
entries will be zero. would miss a great opportunity: we can
constructively include all n + k B-splines
The following subsections will describe the particular form the and impose k extra properties on Sk to
system (1.36) takes for k = 1, 2, 3. In each case we will illustrate the pick out a unique spline interpolant
from the infinitely many options that
resulting spline interpolant through the following data set. satisfy the interpolation conditions.

j 0 1 2 3 4
(1.37) xj 0 1 2 3 4
fj 1 3 2 −1 1

1.11.5 Linear splines, k = 1


Linear splines are simple to construct: in this case n + k = n + 1, so
the matrix in (1.36) is square. Let us evaluate it: since
(
1, ` = j + 1;
Bj,1 ( x` ) =
0, ` 6= j + 1,
64

5
Figure 1.23: Linear spline S1 interpolat-
ing 5 data points {( x j , f j )}4j=0 .
4

3 S1 ( x )

-1

-2
-3 -2 -1 0 1 2 3 4 5 6 7

the matrix is simply


   
B−1,1 ( x0 ) B0,1 ( x0 ) ··· Bn−1,1 ( x0 ) 1 0 ··· 0
   
.. 
   
 B−1,1 ( x1 ) B0,1 ( x1 ) ··· Bn−1,1 ( x1 )   ..
   0 1 . . 
=  = I.
   
.. .. ..   ..

 .. .. .. 

 . . . .   .
  . . 0 
   
B−1,1 ( xn ) B0,1 ( xn ) ··· Bn−1,1 ( xn ) 0 ··· 0 1

The system (1.36) is thus trivial to solve, reducing to


   
c−1,1 f0
   
   
 c   f1 
 −0,k   
= ,
   
..   .. 



 .   . 
  
   
cn−1,k fn

which gives the unique linear spline

n −1
This above discussion is a pedantic way
S1 ( x ) = ∑ f j+1 Bj,1 ( x ). to arrive at an obvious solution: Since
j=−1 the jth ‘hat function’ B-spline equals
one at x j+1 and zero at all other knots,
Figure 1.23 shows the unique piecewise linear spline interpolant to just write the unique formula for the
the data in (1.37), which is a linear combination of the five linear interpolant immediately.

splines shown in Figure 1.22. Explicitly,

S1 ( x ) = f 0 B−1,1 ( x ) + f 1 B0,1 ( x ) + f 2 B1,1 ( x ) + f 3 B2,1 ( x ) + f 4 B3,1 ( x )

= B−1,1 ( x ) + 3 B0,1 ( x ) + 2 B1,1 ( x ) − B2,1 ( x ) + B3,1 ( x ).


65

Note that linear splines are simply C0 functions that interpolate a


given data set—between the knots, they are identical to the piecewise
linear functions constructed in Section 1.10.1. Note that S1 ( x ) is sup-
ported on ( x−1 , xn+1 ) with S1 ( x ) = 0 for all x 6∈ ( x−1 , xn+1 ). This is
a general feature of splines: Outside the range of interpolation, Sk ( x )
goes to zero as quickly as possible for a given set of knots while still
maintaining the specified continuity.

1.11.6 Quadratic splines, k = 2

The construction of quadratic B-splines from the linear splines via


the recurrence (1.33) forces the functions Bj,2 to have a continuous
derivative, and also to be supported over three intervals per spline, as
seen in the middle plot in Figure 1.22. The interpolant takes the form

n −1
S2 ( x ) = ∑ c j,2 Bj,2 ( x ),
j=−2

with coefficients specified by n + 1 equations in n + 2 unknowns:

    
B−2,2 ( x0 ) B−1,2 ( x0 ) ··· Bn−1,2 ( x0 ) c−2,2 f0
    
    
 B−2,2 ( x1 )
 B−1,2 ( x1 ) ··· Bn−1,2 ( x1 ) 
  c−1,2
   f1 
  
(1.38) = .
    
.. .. .. ..   .. 
 
 .. 

 . . . . 
 .   . 
  
    
B−2,2 ( xn ) B−1,2 ( xn ) ··· Bn−1,2 ( xn ) cn−1,2 fn

Since there are more variables than constraints, we expect infinitely


many quadratic splines that interpolate the data.
Evaluate the entries of the matrix in (1.38). First note that

Bj,2 ( x` ) = 0, ` 6∈ { j + 1, j + 2},

so the matrix is zero in all entries except the main diagonal (Bj,2 ( x j+2 ))
and the first superdiagonal (Bj,2 ( x j+1 )). To evaluate these nonzero en-
tries, recall that the recursion (1.33) for B-splines gives

 x−x 
j +3 − x
 x 
j
Bj,2 ( x ) = Bj,1 ( x ) + B ( x ).
x j +2 − x j x j+3 − x j+1 j+1,1

Evaluate this function at x j+1 and x j+2 , using our knowledge of the
66

Bj,1 linear B-splines (‘hat functions’):


x
j +1− xj  x
j +3 − x j +1

Bj,2 ( x j+1 ) = Bj,1 ( x j+1 ) + B (x )
x j +2 − x j x j+3 − x j+1 j+1,1 j+1

x
j +1− xj  x
j +3 − x j +1
 x j +1 − x j
= ·1+ ·0 = ;
x j +2 − x j x j +3 − x j +1 x j +2 − x j

x
j +2− xj  x
j +3 − x j +2

Bj,2 ( x j+2 ) = Bj,1 ( x j+2 ) + B (x )
x j +2 − x j x j+3 − x j+1 j+1,1 j+2

x
j +2− xj  x
j +3 − x j +2
 x j +3 − x j +2
= ·0+ ·1 = .
x j +2 − x j x j +3 − x j +1 x j +3 − x j +1

Use these formulas to populate the superdiagonal and subdiagonal


of the matrix in (1.38). In the (important) special case of uniformly
spaced knots
x j = x0 + jh, for fixed h > 0,

gives the particularly simple formulas

1
Bj,2 ( x j+1 ) = Bj,2 ( x j+2 ) = ,
2
hence the system (1.38) becomes
  c   
1/2 1/2 −2,2 f
   0 
  c−1,2
   f 
 1/2 1/2 
  1 
 
  c0,2 = ,
 .. ..   .. 
 . .    . 
  ..  
1/2 1/2

 . 
 fn
cn−1,2

where the blank entries are zero. This (n + 1) × (n + 2) system will


have infinitely many solutions, i.e., infinitely many splines that satisfy
the interpolation conditions. How to choose among them? Impose
one extra condition, such as S20 ( x0 ) = 0 or S20 ( xn ) = 0.
As an example, let us work through the condition S20 ( x0 ) = 0; it
raises an interesting issue. Refer to the middle plot in Figure 1.22.
Due to the small support of the quadratic B-splines, B0j,2 ( x0 ) = 0 for
j > 0, so

(1.39) S20 ( x0 ) = c−2,2 B−


0 0 0
2,2 ( x0 ) + c−1,2 B−1,2 ( x0 ) + c0,2 B0,2 ( x0 ).

The derivatives of the B-splines at knots are tricky to compute. Dif-


ferentiating the recurrence (1.33) with k = 2, we can formally write
 x−x 
j +3 − x
 1   1   x 
j
B0j,2 ( x ) = Bj,1 ( x ) + B0j,1 ( x ) − Bj+1,1 ( x ) + B0 ( x ).
x j +2 − x j x j +2 − x j x j +3 − x j +1 x j+3 − x j+1 j+1,1
67

Try to evaluate this expression at x j , x j+1 , or x j+2 : you must face that
fact that the linear B-splines Bj,1 and Bj+1,1 are not differentiable at the
knots! You must instead check that the one-sided derivatives match,
e.g.,
Bj,2 ( x j+1 + h) − Bj,2 ( x j+1 ) Bj,2 ( x j+1 + h) − Bj,2 ( x j+1 )
lim = lim .
h →0 h h →0 h
h <0 h >0

A mildly tedious calculation verifies that indeed these one-sided first


derivatives do match, and that is the point of splines: each time you
increase the degree k, you increase the smoothness, so Bj,2 ∈ C1 (IR).
Now regarding formula (1.39), one can compute
0 2 0 2 0
B− 2,2 ( x0 ) = − , B− 1,2 ( x0 ) = , B0,2 ( x0 ) = 0,
x 1 − x −1 x 1 − x −1
and so, in the special case of a uniformly spaced grid (x j = x0 + jh),
the condition S0 ( x0 ) = 0 becomes
1 1
− c−2,2 + c−1,2 = 0.
h h
Insert this equation as the first row in the linear system for the coeffi-
cients,
    
−1/h 1/h c−2,2 0
    
 1/2 1/2   c−1,2   f 0 
    
1/2 1/2   c0,2  =  f 1  ,
    

    
 .. ..  ..   . 
  .. 

 . . 
 .   
1/2 1/2 cn−1,2 fn
and solve this for c−2,2 , . . . , cn−1,2 to determine the unique interpolat-
ing quadratic spline with S20 ( x0 ) = 0.
Figure 1.24 shows quadratic spline interpolants to the data in (1.37).
One spline is determined with the extra condition S20 ( x0 ) = 0
described above; the other satisfies S20 ( xn ) = 0. In any case, the
quadratic spline S2 is supported on ( x−2 , xn+2 ).

1.11.7 Cubic splines, k = 3


Cubic splines are the most famous of all splines. We began this sec-
tion by discussing cubic splines as an alternative to piecewise cubic
Hermite interpolation. Now we will show how to derive the same
cubic splines from the cubic B-splines.
Begin by reviewing the bottom plot in Figure 1.22. The cubic B-
splines B−3,3 , . . . , Bn−1,3 take nonzero values on the interval [ x0 , xn ],
and hence we write the cubic spline as
n −1
(1.40) S3 ( x ) = ∑ c j,3 Bj,3 ( x ).
j=−3
68

5
Figure 1.24: Two choices for the
quadratic spline S2 that interpolates
4 S2 ( x ), S20 ( xn ) = 0 the 5 data points {( x j , f j )}4j=0 in (1.37).
The blue spline satisfies the extra con-
3 dition that S20 ( x0 ) = 0, while the red
S2 ( x ), S20 ( x0 ) = 0 spline satisfies S20 ( xn ) = 0. Check to see
2
that these conditions are consistent with
the splines in the plot.

-1

-2
-3 -2 -1 0 1 2 3 4 5 6 7

The linear system (1.36) now involves n + 1 equations in n + 3 un-


knowns:

    
B−3,3 ( x0 ) B−2,3 ( x0 ) ··· Bn−1,3 ( x0 ) c−3,3 f0
    
    
 B−3,3 ( x1 )
 B−2,3 ( x1 ) ··· Bn−1,3 ( x1 ) 
  c−2,3
   f1 
  
(1.41) = .
    
.. .. .. ..   .. 
 
 .. 

 . . . . 
 .   . 
  
    
B−3,3 ( xn ) B−2,3 ( xn ) ··· Bn−1,3 ( xn ) cn−1,3 fn

Given the support of cubic splines, note that

Bj,3 ( x` ) = 0, ` 6∈ { j + 1, j + 2, j + 3},

which implies that only three diagonals of the matrix in (1.41) will
be nonzero. We shall only work out the nonzero entries in the case of
uniformly spaced knots, x j = x0 + jh for fixed h > 0. In this case,

x − xj 
j +1
x
j +4 − x j +1
  h  1  3h  1
Bj,3 ( x j+1 ) = Bj,2 ( x j+1 ) + Bj+1,2 ( x j+1 ) = · + ·0 =
x j +3 − x j x j +4 − x j +1 3h 2 3h 6
j +2 − x j x j +4 − x j +2
x     2h  1  2h  1 2
Bj,3 ( x j+2 ) = B (x ) + B (x ) = · + · =
x j+3 − x j j,2 j+2 x j+4 − x j+1 j+1,2 j+2 3h 2 3h 2 3
j +3 − x j x j +4 − x j +3
x     3h   h  1 1
Bj,3 ( x j+3 ) = Bj,2 ( x j+3 ) + Bj+1,2 ( x j+3 ) = ·0+ · = ,
x j +3 − x j x j +4 − x j +1 3h 3h 2 6

where we have used the fact that Bj,2 ( x j+1 ) = Bj,2 ( x j+2 ) = 1/2 and
69

Bj,2 ( x j ) = Bj,2 ( x j+3 ) = 0. Substituting these values into (1.41) gives


    
1/6 2/3 1/6 c−3,3 f0
    
    
 1/6 2/3 1/6   c−2,3   f1 
    
(1.42) =
    
..   .. 
  
 .. .. .. 

 . . . 
 .   . 
  
    
1/6 2/3 1/6 cn−1,3 fn

involving a matrix with n + 1 rows and n + 3 columns. Again, in-


finitely many cubic splines satisfy these interpolation conditions; two
independent requirements must be imposed to determine a unique
spline. Recall the three alternatives discussed in Section 1.11.1: com-
plete splines (specify a value for S30 at x0 and xn ), natural splines
(force S300 ( x0 ) = S300 ( xn ) = 0), or not-a-knot splines. One can show that
imposing natural spline conditions on S3 requires

( x2 − x−1 )c−3,3 − ( x2 + x1 − x−1 − x−2 )c−2,3 + ( x1 − x−2 )c−1,3 = 0

( xn+2 − xn−1 )cn−3,3 − ( xn+2 + xn+1 − xn−1 − xn−2 )cn−2,3 + ( xn+1 − xn−2 )cn−1,3 = 0,

which for equally spaced knots (x j = x0 + jh) simplify to

3hc−3,3 − 6hc−2,3 + 3hc−1,3 = 0

3hcn−3,3 − 6hcn−2,3 + 3hcn−1,3 = 0.

It is convenient to add these conditions (dividing out the h) as the


first and last row of (1.41) to give
    
3 −6 3 c−3,3 0
    
    
 1/6 2/3 1/6   c−2,3   f0 
    
    
    
 1/6 2/3 1/6   c−1,3   f1 
    
(1.43) = .
    
..   .. 
 
 .. .. .. 

 . . . 
 .   . 
  
    
    
  cn−2,3
 1/6 2/3 1/6     fn 
   
    
3 −6 3 cn−1,3 0

This system of n + 3 equations in n + 3 variables has a unique solu-


tion, the natural cubic spline interpolant. It is a useful exercise to work out the
Figure 1.25 shows the natural cubic spline interpolant to the extra rows you would add to (1.41) to
impose complete or not-a-knot boundary
data (1.37). Clearly this spline satisfies the interpolation conditions, conditions.
but now there seems to be an artificial peak near x = 5 that you
might not have anticipated from the data values. This is a feature
70

5
Figure 1.25: Cubic spline S3 interpolant
to 5 data points {( x j , f j )}4j=0 , imposing
4 the two extra natural spline conditions
S300 ( x0 ) = S300 ( xn ) = 0 to give a unique
3 S3 ( x ) spline.

=0
=0
1

xn )
0)
3 (x

S300(
S 00

-1

-2
-3 -2 -1 0 1 2 3 4 5 6 7

of the natural boundary conditions: by forcing S300 to be zero at x0


and xn , we ensure that the spline S3 has constant slope at x0 and xn .
Eventually this slope must be reversed, as our B-splines force S3 ( x )
to be zero outside ( x−3 , xn+3 ), the support of the B-splines that con-
tribute to the sum (1.40).
Of course, one can implement splines of higher degree, k > 3,
if if greater continuity is required at the knots, or if there are more
than two boundary conditions to impose (e.g., if one wants both first
and second derivatives to be zero at the boundary). The procedure
in that case follows the pattern detailed above: work out the entries
in the matrix (1.36) and add in rows to encode the additional k − 1
constraints needed to specify a unique degree-k spline interpolant.

1.11.8 Optimality properties of splines


Splines often enjoy a beautiful property: among all sufficiently
smooth interpolants , certain splines minimize ‘energy’, quantified
for a function g ∈ C2 [ x0 , xn ] as
Z xn
g00 ( x )2 dx.
x0

To give a flavor for such results, we present one example. For a similar result involving complete
cubic splines, see Theorem 2.3.1 of
Gautschi’s Numerical Analysis (2nd ed.,
Birkhäuser, 2012). The proof here is an
Theorem 1.10 (Natural cubic splines minimize energy). easy adaptation of Gautchsi’s.
Suppose S3 is the natural cubic spline interpolant to {( x j , f j )}nj=0 , and
g is any C2 [ x0 , xn ] function that also interpolates the same data. Then
Z xn Z xn
S300 ( x )2 dx ≤ g00 ( x )2 dx.
x0 x0
71

Proof. The proof will actually quantify how much larger g00 is than S300
by showing that
Z xn Z xn Z xn 2
(1.44) g00 ( x )2 dx = S300 ( x )2 dx + g00 ( x ) − S300 ( x ) dx.
x0 x0 x0

Expanding the right-hand side, see that this claim is equivalent to


Z xn
g00 ( x ) − S300 ( x ) S300 ( x ) dx = 0.

(1.45)
x0

To prove this claim, break the integral on the left into segments
[ x j , x j+1 ] between the knots. Write This decomposition of [ x0 , xn ] will
allow us to exploit the fact that S3
Z xn n Z xj is a standard cubic polynomial, and
g00 ( x ) − S300 ( x ) S300 ( x ) dx = ∑ g00 ( x ) − S300 ( x ) S300 ( x ) dx.
 
hence infinitely differentiable, on these
x0 j =1 x j +1 subintervals.

On each subinterval, integrate by parts to obtain


Z x ixj Z x
j j
h
g00 ( x ) − S300 ( x ) S300 ( x ) dx = ( g0 ( x ) − S30 ( x ))S300 ( x ) g0 ( x ) − S30 ( x ) S3000 ( x ) dx.
 
(1.46) −
x j +1 x = x j −1 x j +1

Focus now on the integral on the right-hand side; we can show it is


zero by integrating it by parts to get
Z x ixj Z x
j j
h
g0 ( x ) − S30 ( x ) S3000 ( x ) dx = ( g( x ) − S3 ( x ))S3000 ( x ) g( x ) − S3 ( x ) S30000 ( x ) dx.
 
(1.47) −
x j +1 x = x j −1 x j +1

The boundary term on the right is zero, since g( x` ) − S3 ( x` ) = 0 for


` = 0, . . . , n (both g and S3 must interpolate the data). The integral
on the right is also zero: since S3 is a cubic polynomial on [ x j−1 , x j ],
S30000 ( x ) = 0. Thus (1.46) reduces to
Z x ixj
j
h
g00 ( x ) − S300 ( x ) S300 ( x ) dx = ( g0 ( x ) − S30 ( x ))S300 ( x )

x j +1 x = x j −1

Adding up these contributions over all the subintervals,


Z xn n h ixj
g00 ( x ) − S300 ( x ) S300 ( x ) dx = ∑ ( g0 ( x ) − S30 ( x ))S300 ( x )

.
x0 j =1 x = x j −1

Most of the boundary terms on the right cancel one another out,
leaving only
Z xn    
g00 ( x ) − S300 ( x ) S300 ( x ) dx = ( g0 ( xn ) − S30 ( xn ))S300 ( xn ) − ( g0 ( x0 ) − S30 ( x0 ))S300 ( x0 ) .

x0

Each of the terms on the right is zero by virtue of the natural cu-
bic spline condition S300 ( x0 ) = S300 ( xn ) = 0. This confirms the for-
mula (1.45), and hence the equivalent (1.44) that quantifies how much
larger g00 can be than S300 .
72

1.11.9 Some omissions


The great utility of B-splines in engineering has led to the develop-
ment of the subject far beyond these basic notes. Among the omis-
sions are: interpolation imposed at points distinct from the knots,
convergence of splines to the function they are approximating as the
number of knots increases, integration and differentiation of splines,
tension splines, etc. Splines in higher dimensions (‘thin-plate splines’)
are used, for example, to design the panels of a car body.

1.12 Handling Polynomials in MATLAB

To close this discussion of interpolating polynomials, we mention a


few notes about polynomials in matlab.

1.12.1 MATLAB’s Polynomial Format


By convention, matlab represents polynomials by their coefficients,
listed by decreasing powers of x. Thus c0 + c1 x + c2 x2 + c3 x3 is repre-
sented by the vector
[ c 3 c 2 c 1 c 0 ],

while 7 + 3x + 5x3 − 2x4 would be represented by

[ − 2 5 0 3 7]

In this last example note the 0 corresponding to the x2 term: all lower
powers of x must be accounted for in coefficient vector.
Given a polynomial in a vector, say p = [ − 2 5 0 3 7], one can
evaluate p( x ) using the command polyval, e.g.
>> polyval(p,x)

This command works if x is a scalar or a vector. Thus, for example, to


plot p( x ) for x ∈ [0, 1], one could use
>> x = linspace(0,1,500); % 500 uniform points between 0 and 1
>> plot(x,polyval(p,x)) % plot p(x) with x from 0 to 1

One can also compute the roots of polynomials very easily with the
Type type roots to see matlab’s code
command for the roots command. Scan to the
>> roots(p) % compute roots of p(x)=0 bottom to see the crucial lines. From
the coefficients matlabconstructs a
though one should be cautious of numerical errors when the degree companion matrix, then computes its
eigenvalues using the eig command.
of the polynomial is large. One can construct a polynomial directly For some (larger degree) polynomials,
from its roots, using the poly command. For example, these eigenvalues are very sensitive to
perturbations, and the roots can be very
>> poly([1:4]) inaccurate. For a famous example due
ans = to Wilkinson, try roots(poly[1:24])),
should return the roots 1, . . . , 24.
1 -10 35 -50 24
73

poly returns the coefficients of the monic polynomial with roots


1, 2, 3, 4:

24 − 50x + 35x2 − 10x3 + x4 = ( x − 1)( x − 2)( x − 3)( x − 4).

This gives a slick way to construct the Lagrange basis function


n
x−x
` j (x) = ∏ x j − xkk
k =0
k6= j

given the vector xx = [ x0 · · · xn ] of interpolation points:


>> ell_j = poly(xx([1:j j+2:end])); % specify roots of ell_j
>> ell_j = ell_j/polyval(ell_j,xx(j+1)); % scale so ell_j(xx(j+1)) = 1

Note that the indices of xx account for the fact that x j = xx(j + 1).

1.12.2 Constructing Polynomial Interpolants


matlabhas a built-in code for constructing polynomial interpolants.
In fact, it is a special case of the polynomial approximation code
polyfit. When you request that polyfit produce a degree-n polyno- Beware! The numerical implementation
of polyfit is not ideal for polynomial
mial through n + 1 pairs of data, you obtain an interpolant. For ex- interpolation: the code uses the Van-
ample, the following code will interpolate f ( x ) = sqrt( x ) at x j = j/4 dermonde basis. Thus, restrict your use
for j = 0, . . . , 4: of polyfit to low degree polynomials.
The command type polyfit will show
you matlab’s code.
>> f = @(x) sqrt(x); % define f
>> xx = [0:4]/4; % define interpolation points
>> p = polyfit(xx,f(xx),4); % quartic polynomial interpolant
>> polyval(p,xx) % evaluate p at interpolation points
ans =
0.0000 0.5000 0.7071 0.8660 1.0000
>> f(xx) % compare to f at interpolation points
ans =
0 0.5000 0.7071 0.8660 1.0000

1.12.3 Piecewise Polynomial Interpolants and Splines


matlab also includes a general-purpose interp1 command that
constructs various piecewise polynomial interpolants. For example,
the ’linear’ option constructs piecewise linear interpolants.

>> f = @(x) sin(3*pi*x); % define f


>> xx = [0:10]/10; % define "knots"
>> x = linspace(0,1,500); % evaluation points
>> plot(x,interp1(xx,f(xx),x,’linear’)) % plot piecewise linear interpolant
74

Alternatively, the ’spline’ option constructs the not-a-knot cubic


spline approximation.

>> plot(x,interp1(xx,f(xx),x,’spline’)) % plot cubic spline interpolant


The spline command (which interp1 uses to construct the spline)
will return matlab’s data structure that stores the cubic spline inter-
polant. >> S = spline(xx,f(xx))
S =
form: ’pp’
breaks: [1x11 double]
coefs: [10x4 double]
pieces: 10
order: 4
dim: 1

For example, S.breaks contains the list of knots. One can also pass Another option to interp1 has a mis-
leading name: ’pchip’ constructs a
arguments to spline to specify complete boundary conditions. How- particular spline-like interpolant de-
ever, there is no easy way to impose natural boundary conditions. signed to be quite smooth: it cannot
For more sophisticated data fitting operations, matlab offers a Curve match any derivative information about
f , as no derivative information is even
Fitting Toolbox (which fits both curves and surfaces). passed to the function.

1.12.4 Chebfun
Chebfun is a free package of matlab routines developed by Nick
Trefethen and colleagues at Oxford University. Using sophisticated
techniques from polynomial approximation theory, Chebfun auto-
matically represents an arbitrary (piecewise smooth) function f ( x ) to
machine precision, and allows all manner of operations on this func-
tion, overloading every conceivable matlab matrix/vector operation.
There is no way to do this beautiful and powerful system justice in
a few lines of text here. Go to chebfun.org, download the software,
and start exploring. Suffice to say, Chebfun significantly enrich one’s In fact, it was used to generate a num-
study and practice of numerical analysis. ber of the plots in these notes.
2
Approximation Theory

lecture 12: Introduction to Approximation Theory

Interpolation is an invaluable tool in numerical analysis: it


provides an easy way to replace a complicated function by a polyno-
mial (or piecewise polynomial), and, at least as importantly, it provides
a mechanism for developing numerical algorithms for more sophis-
ticated problems. Interpolation is not the only way to approximate We saw one example in Section 1.7:
a function, though: and indeed, we have seen that the quality of the finite difference formulas for approxi-
mating derivatives and solving differen-
approximation can depend perilously on the choice of interpolation tial equation boundary value problems.
points. Several other applications will follow
later in the semester.
If approximation is our goal, interpolation is only one means to
that end. In this chapter we investigate alternative approaches that
directly optimize the quality of the approximation. How do we mea-
sure this quality? That depends on the application. Perhaps the most
natural means is to minimize the maximum error of the approximation.

Given f ∈ C [ a, b], find p∗ ∈ Pn such that

max | f ( x ) − p∗ ( x )| = min max | f ( x ) − p( x )|.


x ∈[ a,b] p∈Pn x ∈[ a,b]

This is called the minimax approximation problem.


Norms clarify the notation. For any g ∈ C [ a, b], define

k gk∞ := max | g( x )|,


x ∈[ a,b]

the ‘infinity norm of g’. One can show that k · k∞ satisfies the basic
norm axioms on the vector space C [ a, b] of continuous functions. k gk∞ ≥ 0 for all g ∈ C [ a, b]
Thus the minimax approximation problem seeks p∗ ∈ Pn such that k gk∞ = 0 ⇐⇒ g( x ) = 0 for all x ∈ [ a, b].
kαgk∞ = |α|k gk∞ for all α ∈ C, g ∈ C [ a, b].
k g + hk∞ ≤ k gk∞ + khk∞ , for all g, h ∈ C [ a, b].
k f − p∗ k∞ = min k f − pk∞ .
p∈Pn
76

Notice that, for better or worse, this approximation will be heavily


influenced by extreme values of f ( x ), even if they occur over only a
small range of x ∈ [ a, b].
Some applications call instead for an approximation that balances
the size of the errors against the range of x values over which they
are attained. In such cases it is most common to minimize the inte-
gral of the square of the error, the least squares approximation problem.

Given f ∈ C [ a, b], find p∗ ∈ Pn such that


Z b 1/2 Z b 1/2
( f ( x ) − p∗ ( x ))2 dx = min ( f ( x ) − p( x ))2 dx .
a p∈Pn a

This problem is often associated with energy minimization in mechan-


ics, giving one motivation for its widespread appeal. As before, we
express this more compactly by introducing the two-norm of g ∈ [ a, b]:
Z b 1/2
k g k2 = | g( x )|2 dx ,
a

so the least squares problem becomes

k f − p∗ k2 = min k f − pk2 .
p∈Pn

This chapter focuses on these two problems. Before attacking them


we mention one other possibility, minimizing the absolute value of
the integral of the error: the least absolute deviations problem. This problem has become quite im-
portant in recent years. In particular,
the analogous problem resulting when
Given f ∈ C [ a, b], find p∗ ∈ Pn such that f is replaced by its vector discretiza-
tion f ∈ C n plays a pivotal role in
Z b Z b compressive sensing.
| f ( x ) − p∗ ( x )| dx = min | f ( x ) − p( x )| dx.
a p∈Pn a

With this problem we associate the one-norm of g ∈ C [ a, b],


Z b
k g k1 = | g( x )| dx,
a

giving the least absolute deviations problem as

k f − p∗ k1 = min k f − pk1 .
p∈Pn
77

2.1 Minimax Approximation: General Theory

The goal of minimizing the maximum error of a polynomial p from


the function f ∈ C [ a, b] is called minimax (or uniform, or L∞ ) approxi-
mation: Find p∗ ∈ Pn such that

k f − p∗ k∞ = min k f − pk∞ .
p ∈Pn

Let us begin by connecting this problem to polynomial interpolation.


On Problem Set 2 you were asked to prove that

k f − Πn f k∞ ≤ 1 + kΠn k∞ k f − p∗ k∞ ,

(2.1)

where Πn is the linear interpolation operator for That is, p = Πn f ∈ Pn is the polynomial
that interpolates f at x0 , . . . , xn .
x0 < x1 < · · · < x n

with x0 , . . . , xn ∈ [ a, b]. Here kΠn k∞ is the operator norm of Πn :

kΠn f k∞
kΠn k∞ = max
f ∈C [ a,b] k f k∞

You further show that


n
kΠn k = max
x ∈[ a,b]
∑ |` j (x)|,
j =0

where ` j denotes the jth Lagrange interpolation basis function


n
x−x
` j (x) = ∏ x j − xkk .
k =0
k6= j

Now appreciate the utility of bound (2.1): the linear interpolant


Πn f (which is easy to compute) is within a factor of 1 + kΠn k∞ of the
optimal approximation p∗ . Note that kΠn k∞ ≥ 1: how much larger 3

than one depends on the distribution of the interpolation points.


In the following sections we shall characterize and compute p∗ 2.5
x
(indeed more difficult than computing the interpolant), then use the e
x )=
theory of minimax approximation to find an excellent set of almost f(
2
fail-safe interpolation points.
We begin by working out a simple example by hand. 1.5

Example 2.1. Suppose we seek the constant that best approximates


f ( x ) = ex over the interval [0, 1], shown in the margin. Before go- 1

ing on, sketch out a constant function (degree-0 polynomial) that


0 0.2 0.4 0.6 0.8 1
approximates f in a manner that minimizes the maximum error. x
Since f ( x ) increases monotonically for x ∈ [0, 1], the optimal
constant approximation p∗ = c0 must fall between f (0) = 1 and
78

f (1) = e, i.e., 1 ≤ c0 ≤ e. Moreover, since f is monotonic and p∗ is e−1


a constant, the function f − p∗ is also monotonic, so the maximum
error maxx∈[ a,b] | f ( x ) − p∗ ( x )| must be attained at one of the end |
c0
points, x = 0 or x = 1. Thus, 0 −
|e

k f − p∗ k∞ = max{|e0 − c0 |, |e1 − c0 |}. e−1


2

The picture to the right shows |e0 − c0 | (blue) and |e1 − c0 | (red) for |e 1

c0 ∈ [1, e]. The optimal value for c0 will be the point at which the c0
|
larger of these two lines is minimal. The figure clearly reveals that this
happens when the errors are equal, at c0 = (1 + e)/2. We conclude
0
that the optimal minimax constant polynomial approximation to ex 1 e−1 e
2 c0
on x ∈ [0, 1] is p∗ ( x ) = c0 = (1 + e)/2.

The plots in Figure 2.1 compare f to the optimal polynomial p∗


(top), and show the error f − p∗ (bottom). We picked c0 so that the
error f − p∗ was equal in magnitude at the end points x = 0 and
x = 1; in fact, it is equal in magnitude, but opposite in sign,

e0 − c0 = −(e1 − c0 ).

This property—maximal error being attained with alternating sign—


is a key feature of minimax approximation.

3
Figure 2.1: Minimax approximation
) of degree k = 0 to f ( x ) = ex on
f (x
2.5 x ∈ [0, 1]. The top plot compares f
to p∗ ; the bottom plot shows the error
2 p∗ ( x ) f − p∗ , whose extreme magnitude is
attained, with opposite sign, at two values
of x ∈ [0, 1].
1.5

0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

1
0.8
0.6
f ( x ) − p∗ ( x )

0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
79

lecture 13: Equioscillation, Part 1

2.2 Oscillation Theorem

The previous example hints that the points at which the error f − p∗
attains its maximum magnitude play a central role in the theory of
minimax approximation. The Theorem of de la Vallée Poussin is a
first step toward such a result. We include its proof to give a flavor of The proof is adapted from Section 8.3
how such results are established. of Süli and Mayers, An Introduction to
Numerical Analysis (Cambridge, 2003).

Theorem 2.1 (de la Vallée Poussin’s Theorem).


Let f ∈ C [ a, b] and suppose r ∈ Pn is some polynomial for which
there exist n + 2 points { x j }nj=+01 with a ≤ x0 < x1 < · · · < xn+1 ≤ b at
which the error f ( x ) − r ( x ) oscillates signs, i.e.,

 1, x > 0;
sgn( f ( x j ) − r ( x j )) = −sgn( f ( x j+1 ) − r ( x j+1 )) sgn( x ) = 0, x = 0;
−1, x < 0.

for j = 0, . . . , n. Then

(2.2) min k f − pk∞ ≥ min | f ( x j ) − r ( x j )|.


p ∈Pn 0≤ j ≤ n +1

Before proving this result, look at Figure 2.2 for an illustration of the
theorem. Suppose we wish to approximate f ( x ) = ex with some
quintic polynomial, r ∈ P5 (i.e., n = 5). This polynomial is not neces-
sarily the minimax approximation to f over the interval [0, 1]. However,
in the figure it is clear that for this r, we can find n + 2 = 7 points
at which the sign of the error f ( x ) − r ( x ) oscillates. The red curve These n + 2 points are by no means
shows the error for the optimal minimax polynomial p∗ (whose unique: we have a continuum of choices
available. However, taking the extrema
computation is discussed below). This is the point of de la Vallée of f − r will give the the best bounds in
Poussin’s theorem: Since the error f ( x ) − r ( x ) oscillates sign n + 2 the theorem.
times, the minimax error ±k f − p∗ k∞ exceeds | f ( x j ) − r ( x j )| at one of
the points x j that give the oscillating sign. In other words, de la Val-
lée Poussin’s theorem gives a nice mechanism for developing lower
bounds on k f − p∗ k∞ .

Proof. Suppose we have n + 2 ordered points, { x j }nj=+01 ⊂ [ a, b], such


that f ( x j ) − r ( x j ) alternates sign at consecutive points, and let p∗
denote the minimax polynomial,

k f − p∗ k∞ = min k f − pk∞ .
p ∈Pn

We will prove the result by contradiction. Thus suppose

(2.3) k f − p∗ k∞ < | f ( x j ) − r ( x j )|, for all j = 0, . . . , n + 1.


80

#10-6
Figure 2.2: Illustration of de la Vallée
Poussin’s theorem for f ( x ) = ex and
1.5
f (x) − r(x) n = 5. Some polynomial r ∈ P5 gives
an error f − r for which we can identify
1 n + 2 = 7 points x j , j = 0, . . . , n + 1
(black dots) at which f ( x j ) − r ( x j )
oscillates sign. The minimum value
0.5
of | f ( x j ) − r ( x j )| gives a lower bound
the maximum error k f − p∗ k∞ of the
0 optimal approximation p∗ ∈ P5 .

-0.5

-1

f ( x ) − p∗ ( x )
-1.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

As the left hand side is the maximum difference of f − p∗ over all


x ∈ [ a, b], that difference can be no larger at x j ∈ [ a, b], and so:

(2.4) | f ( x j ) − p∗ ( x j )| < | f ( x j ) − r ( x j )|, for all j = 0, . . . , n + 1.

Now consider

p∗ ( x ) − r ( x ) = ( f ( x ) − r ( x )) − ( f ( x ) − p∗ ( x )),

which is a degree n polynomial, since p∗ , r ∈ Pn . Equation (2.4) states


that f ( x j ) − r ( x j ) always has larger magnitude than f ( x j ) − p∗ ( x j ).
Thus, regardless of the sign of f ( x j ) − p∗ ( x j ), the magnitude | f ( x j ) −
p∗ ( x j )| will never be large enough to overcome | f ( x j ) − r ( x j )|, and
hence
sgn( p∗ ( x j ) − r ( x j )) = sgn( f ( x j ) − r ( x j )).

We know from the hypothesis that f ( x ) − r ( x ) must change sign


at least n + 1 times (at least once in each interval ( x j , x j+1 ) for j =
0, . . . , n), and thus the degree-n polynomial p∗ − r must do the same.
But n + 1 sign changes implies n + 1 roots; the only degree-n polyno-
mial with n + 1 roots is the zero polynomial, i.e., p∗ = r. However,
this contradicts the strict inequality in equation (2.3). Hence, there
must be at least one j for which

k f − p∗ k∞ ≥ | f ( x j ) − r ( x j )|,

thus yielding the theorem.

Now suppose we can find some degree-n polynomial, call it e r ∈


Pn , and n + 2 points x0 < · · · < xn+1 in [ a, b] such that not only does
81

the sign of f − e
r oscillate, but the error takes its extremal values at
these points. That is,

| f ( x j ) − er ( x j )| = k f − er k∞ , j = 0, . . . , n + 1,

and The red curve in Figure 2.2 shows



f (xj ) − e
r ( x j ) = − f ( x j +1 ) − e
r ( x j +1 ) , j = 0, . . . , n. an error function that satisfies these
requirements.
Now apply de la Valée Poussin’s theorem to this special polynomial
r. Equation (2.2) gives
e

min k f − pk ≥ min | f ( x j ) − e
r ( x j )|.
p ∈Pn 0≤ j ≤ n +1

On the other hand, we have presumed that

| f ( x j ) − er ( x j )| = k f − er k∞

for all j = 0, . . . , n + 1. Thus, by de la Vallée Poussin’s theorem,

min k f − pk ≥ min | f ( x j ) − e
r ( x j )| = k f − e
r k∞ .
p ∈Pn 0≤ j ≤ n +1

r ∈ Pn , it follows that
Since e

min k f − pk = k f − e
r k∞ ,
p ∈Pn

and this equioscillating e


r must be an optimal approximation to f .
The question remains: Does such a polynomial with equioscillat-
ing error always exist? The following theorem ensures it does. For a direct proof, see Section 8.3 of
Süli and Mayers. Another excellent
Theorem 2.2 (Oscillation Theorem). Suppose f ∈ C [ a, b]. Then resource is G. W. Stewart, Afternotes
p∗ ∈ Pn is a minimax approximation to f from Pn on [ a, b] if and only Goes to Graduate School, SIAM, 1998; see
Stewart’s Lecture 3.
if there exist n + 2 points x0 < x1 < · · · < xn+1 such that

| f ( x j ) − p∗ ( x j )| = k f − p∗ k∞ , j = 0, . . . , n + 1

and the sign of the error oscillates at these points:



f ( x j ) − p ∗ ( x j ) = − f ( x j +1 ) − p ∗ ( x j +1 ) , j = 0, . . . , n.

Note that this result is if and only if : the oscillation property exactly
characterizes the minimax approximation. We have proved one direc-
tion already by appeal to de la Vallée Poussin’s theorem. The proof of
the other direction is rather more involved.
82

lecture 14: Equioscillation, Part 2


A direct proof that an optimal minimax approximation p∗ ∈ Pn must
give an equioscillating error is rather tedious, requiring one the chase
down the oscillation points one at a time. The following approach is
a bit more appealing. We begin with a technical result from which
the main theorem will readily follow.

Lemma 2.1. Let p∗ ∈ Pn be a minimax approximation of f ∈ C [ a, b], This ‘lemma’ is a diluted version of
Kolmolgorov’s Theorem, which is (a) an
k f − p∗ k∞ = min k f − pk∞ , ‘if and only if’ version of this lemma
p ∈Pn that (b) appeals to approximation with
much more general classes of functions,
and let X denote the set of all points x ∈ [ a, b] for which not just polynomials, and (c) handles
complex-valued functions. The proof
here is adapted from that more general
| f ( x ) − p∗ ( x )| = k f − p∗ k∞ . setting given in Theorem 2.1 of DeVore
and Lorentz, Constructive Approximation
Then for all q ∈ Pn , (Springer, 1993).


(2.5) max f ( x ) − p∗ ( x ) q( x ) ≥ 0.
x ∈X

Proof. We will prove the lemma by contradiction. Suppose p∗ ∈ Pn is


a minimax approximation, but that (2.5) fails to hold, i.e., there exists
some qe ∈ Pn and ε > 0 such that

max f ( x ) − p∗ ( x ) qe( x ) < −2ε.
x ∈X

We first note that kqek∞ > 0. Since f ( x ) − p∗ ( x ) q( x ) is a continuous
function on [ a, b], it must remain negative on some sufficiently small
neighborhood of X. More concretely, we can find δ > 0 such that

(2.6) max( f ( x ) − p∗ ( x ) qe( x ) < −ε,
x ∈X
e

where
X
e := {ξ ∈ [ a, b] : min |ξ − x | < δ}.
x ∈X

To arrive at a contradiction, we will design a function pe that better


approximates f than p∗ , i.e., k f − pek∞ < k f − p∗ k∞ . This function
will take the form
pe( x ) = p∗ ( x ) − λe
q( x )

for (small) constant λ we shall soon determine. Let E := k f − p∗ k∞


and pick M such that |qe( x )| ≤ M for all x ∈ X.
e Then for all x ∈ X,
e
2
| f ( x ) − pe( x )|2 = f ( x ) − p∗ ( x )
+ 2λ f ( x ) − p∗ ( x ) qe( x ) + λ2 qe( x )2


= E2 + 2λ f ( x ) − p∗ ( x ) qe( x ) + λ2 qe( x )2


(2.7) < E2 − 2λε + λ2 M2 ,


83

where this inequality follows from (2.6). To show that pe is a better


approximation to f than p∗ , it will suffice to show that the right-hand
side of (2.7) is smaller than E2 : note that for any λ ∈ (0, 2ε/M2 ), then

4ε2 4ε2
(2.8) | f ( x ) − pe( x )|2 < E2 − 2λε + λ2 M2 < E2 − + = E2 = k f − p ∗ k2
M2 M2

for all x ∈ X.
e Thus pe beats p∗ on X. e Now since X comprises the
points where | f ( x ) − p∗ ( x )| attains its maximum, away from X
e this
error must be bounded away from its maximum, i.e., there exists
some η > 0 such that

max | f ( x ) − p∗ ( x )| ≤ E − η.
x ∈[ a,b]
x 6 ∈X
e

Now we want to show that | f ( x ) − pe( x )| < E for these x 6∈ X


e as well.
In particular, for such x

| f ( x ) − pe( x )| = | f ( x ) − p∗ ( x ) + λe
q( x )|
≤ | f ( x ) − p∗ ( x )| + λ|qe( x )|
≤ E − η + λkqek∞ ,

and so if λ ∈ (0, η/kqek∞ ),


η
| f ( x ) − pe( x )| < E − η + kqek∞ = E.
kqek∞

In conclusion, if
 
λ ∈ 0, min(2ε/M2 , η/kqek∞ ) ,

then we constructed pe( x ) := p∗ ( x ) − λe


q( x ) such that

| f ( x ) − pe( x )| < E for all x ∈ [ a, b],

i.e., k f − pek∞ < k f − p∗ k, contradicting the optimality of p∗ .

With this lemma, we can readily complete the proof of the Oscilla-
tion Theorem.

Completion of the Proof of the Oscillation Theorem. We must show that


if p∗ is a minimax approximation to f , then there exist n + 2 points
in [ a, b] on which the error f − p∗ changes sign. If p∗ = f , the result
holds trivially. Suppose then that k f − p∗ k∞ > 0. In the language
of Lemma 2.1, we need to show that (a) the set X contains (at least) Recall that X contains all the points
n + 2 points and (b) the error oscillates sign at these points. Suppose x ∈ [ a, b] for which the maximum error
is attained: | f ( x ) − p∗ ( x )| = k f − p∗ k∞ .
this is not the case, i.e., we cannot identify n + 2 consecutive points in
X at which the error oscillates in sign. Suppose we can only identify
84

m such points, 1 ≤ m < n + 2, which we label x0 < · · · < xm−1 . We


will show how to construct a q that violates Lemma 2.1.
If m = 1, f ( x ) − p∗ ( x ) has the same sign for all x ∈ X. Set
q( x ) = −sgn( f ( x0 ) − p∗ ( x0 )) (a constant, hence in Pn ), so that
( f ( x ) − p∗ ( x ))q( x ) < 0 for all x ∈ X, contradicting Lemma 2.1. x1 x3
If m > 1, the between each consecutive pair of these m points one r r r r r
can then identify xe1 , . . . , xem−1 where the error changes sign. (See the
sketch in the margin.) Then define
xe1 xe2 xe3 xe4
q( x ) = ±( x − xe1 )( x − xe2 ) · · · ( x − xem−1 ).
r r r r r
x0 x2 x4
Since m < n + 2 by assumption, m − 1 ≤ n, i.e., q ∈ Pn , so Lemma 2.1
should hold with this choice of q. Since the sign of q( x ) does not sketch for m = 5
change between its roots, it does not change within the intervals r = f ( x) − p∗ ( x) for x ∈ X

( a, xe1 ), ( xe1 , xe2 ), ··· , ( xem−2 , xem−1 ), ( xem−1 , b),

and the sign of q flips between each of these intervals. Thus the sign
of f ( x ) − p∗ ( x ) q( x ) is the same for all x ∈ X. Pick the ± sign in the


definition of q such that

for all x ∈ X,

f ( x ) − p∗ ( x ) q( x ) < 0

thus contradicting Lemma 2.1. Hence, there must be (at least) n + 2


consecutive points in X at which the error flips sign.

Thus far we have been careful to only speak of a minimax approx-


imation, rather than the minimax approximation. In fact, the later
terminology is more precise, for the minimax approximant is unique.

Theorem 2.3 (Uniqueness of minimax approximant).


The minimax approximant p∗ ∈ Pn of f ∈ C [ a, b] over the interval
[ a, b] is unique.

The proof is a straightforward application of the Oscillation Theorem.


Suppose p1 and p2 are both minimax approximations from Pn to
f on [ a, b]. Then one can show that ( p1 + p2 )/2 is also a minimax
approximation. Apply the Oscillation Theorem to obtain n + 2 points
at which the error for ( p1 + p2 )/2 oscillates sign. One can show that
these points must also be oscillation points for p1 and p2 , and that p1
and p2 agree at these n + 2 points. Polynomials of degree n that agree For the full details of this proof, see
at n + 2 points must be the same. Theorem 8.5 in Süli and Mayers.

This oscillation property forms the basis of algorithms that find the
minimax approximation: iteratively adjust an approximating poly-
nomial until it satisfies the oscillation property. The most famous
algorithm for computing the minimax approximation is called the
85

#10 -4
0.01 6

0.005
n=2 4 n=3
2

0 0

-2
-0.005
-4

-0.01 -6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x
-5 -6
#10 #10
3
1
2 n=4 n=5
1 0.5

0 0

-1 -0.5
-2
-1
-3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x x
Figure 2.3: Illustration of the equioscil-
lating minimax error f − p∗ for approx-
imations of degree n = 2, 3, 4, and 5
with f ( x ) = ex for x ∈ [ a, b]. In each
Remez exchange algorithm, essentially a specialized linear program-
case, the error attains its maximum
ming procedure. In exact arithmetic, this algorithm is guaranteed to with alternating sign at n + 2 points.
terminate with the correct answer in finitely many operations.
The oscillation property is demonstrated in the Example 2.1, where
we approximated f ( x ) = ex with a constant. Indeed, the maxi-
mum error is attained at two points (that is, n + 2, since n = 0), and
the error differs in sign at those points. Figure 2.3 shows the errors
f ( x ) − p∗ ( x ) for minimax approximations p∗ of increasing degree. These examples were computed in
The oscillation property becomes increasingly apparent as the poly- matlab using the Chebfun pack-
age’s remez algorithm. For details, see
nomial degree increases. In each case, there are n + 2 extreme points www.chebfun.org.
of the error, where n is the degree of the approximating polynomial.

Example 2.2 (ex revisited). Now we shall use the Oscillation Theorem
to compute the optimal linear minimax approximation to f ( x ) = ex
on [0, 1]. Assume that the minimax polynomial p∗ ∈ P1 has the form
p∗ ( x ) = α + βx. Since f is convex, a quick sketch of the situation
suggests the maximal error will be attained at the end points of the
interval, x0 = 0 and x2 = 1. We assume this to be true, and seek some
third point x1 ∈ (0, 1) that attains the same maximal error, δ, but with
opposite sign. If we can find such a point, then the Oscillation Theo-
rem guarantees that the resulting polynomial is optimal, confirming
our assumption that the maximal error was attained at the ends of
the interval.
86

This scenario suggests the following three equations:

f ( x0 ) − p ∗ ( x0 ) = δ
f ( x1 ) − p ∗ ( x1 ) = − δ
f ( x2 ) − p∗ ( x2 ) = δ.

Substituting the values x0 = a, x2 = b, and p∗ ( x ) = α + βx, these


equations become

1−α = δ
x1
e − α − βx1 = −δ
e − α − β = δ.

The first and third equation together imply β = e − 1. We also deduce


that 2α = ex1 − x1 (e − 1) + 1. A variety of choices for x1 will satisfy
these conditions, but in those cases δ will not be the maximal error. We
must ensure that
|δ| = max | f ( x ) − p∗ ( x )|.
x ∈[ a,b]

To make this happen, require that the derivative of error be zero at x1 , This requirement need not hold at the
reflecting that the error f − p∗ attains a local minimum/maximum at points x0 and x2 , since these points
are on the ends of the interval [ a, b]; it
x1 . (The plots in Figure 2.3 confirm that this is reasonable.) Imposing is only required at the interior points
the condition that f 0 ( x1 ) − p0∗ ( x1 ) = 0 yields where the extreme error is attained,
x j ∈ ( a, b).
ex1 − β = 0.

Now we can explicitly solve the equations to obtain


Notice that we have a system of four
1 nonlinear equations in four unknowns,

α= 2 e − (e − 1) log(e − 1) = 0.89406 . . .
due to the ex1 term. Generally such
β = e − 1 = 1.71828 . . . nonlinear systems might not have a
solution; in this case we can compute
x1 = log(e − 1) = 0.54132 . . . one.

1

δ= 2 2 − e + (e − 1) log(e − 1) = 0.10593 . . . .

Figure 2.4 shows the optimal approximation, along with the error
f ( x ) − p∗ ( x ) = ex − (α + βx ). In particular, notice the size of the
maximum error (δ = 0.10593 . . .) and the point x1 = 0.54132 . . . at
which this error is attained.
87

3
Figure 2.4: The top plot shows the
minimax approximation p∗ of degree
2.5 n = 1 (red) to f ( x ) = ex (blue); the
bottom plot shows the error f ( x ) −
2
p ∗(x
) ∈ P1 p∗ ( x ), equioscillating at n + 1 = 3
points.
x
=e
1.5
f (x)
1

0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.05

f (x
0 )−
p∗
(x
-0.05 )

-0.1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
88

lecture 15: Chebyshev Polynomials for Optimal Interpolation

2.3 Optimal Interpolation Points via Chebyshev Polynomials

As an application of the minimax approximation procedure, we con-


sider how best to choose interpolation points { x j }nj=0 to minimize

k f − pn k∞ ,

where pn ∈ Pn is the interpolant to f at the specified points.


Recall the interpolation error bound developed in Section 1.6: If
f ∈ C n+1 [ a, b], then for any x ∈ [ a, b] there exists some ξ ∈ [ a, b] such
that
f ( n +1) ( ξ ) n
( n + 1) ! ∏
f ( x ) − pn ( x ) = ( x − x j ).
j =0

Taking absolute values and maximizing over [ a, b] yields the bound


n
| f (n+1) (ξ )|
k f − pn k∞ = max max
ξ ∈[ a,b] ( n + 1) ! x ∈[ a,b]
∏( x − x j ) .
j =0

For Runge’s example, f ( x ) = 1/(1 + x2 ) for x ∈ [−5, 5], we observed


that k f − pn k∞ → ∞ as n → ∞ if the interpolation points { x j } are
uniformly spaced over [−5, 5]. However, Marcinkiewicz’s theorem
(Section 1.6) guarantees there is always some scheme for assigning
the interpolation points such that k f − pn k∞ → 0 as n → ∞. While
there is no fail-safe a priori system for picking interpolations points
that will yield uniform convergence for all f ∈ C [ a, b], there is a dis-
tinguished choice that works exceptionally well for just about every
function you will encounter in practice. We determine this set of in-
terpolation points by choosing those { x j }nj=0 that minimize the error
bound (which is distinct from – but hopefully akin to – minimizing
the error itself, k f − pn k∞ ). That is, we want to solve
n
(2.9) min max
x0 ,...,xn x ∈[ a,b]
∏( x − x j ) .
j =0

Notice that
n n n n n
∏(x − x j ) = xn+1 − xn ∑ x j + xn−1 ∑ ∑ x j xk − · · · + (−1)n+1 ∏ x j
j =0 j =0 j =0 k =0 j =0

= x n +1 − r ( x ),

where r ∈ Pn is a degree-n polynomial depending on the interpola-


tion nodes { x j }nj=0 .
For example, when n = 1,
 
( x − x0 )( x − x1 ) = x2 − ( x0 + x1 ) x − x0 x1 = x2 − r1 ( x ),
89

where r1 ( x ) = ( x0 + x1 ) x − x0 x1 . By varying x0 and x1 , we can obtain


make r1 any function in P1 .
To find the optimal interpolation points according to (2.9), we
should solve

min max | x n+1 − r ( x )| = min k x n+1 − r ( x )k∞ .


r ∈Pn x ∈[ a,b] r ∈Pn

Here the goal is to approximate an (n + 1)-degree polynomial, x n+1 ,


with an n-degree polynomial. The method of solution is somewhat
indirect: we will produce a class of polynomials of the form x n+1 −
r ( x ) that satisfy the requirements of the Oscillation Theorem, and
thus r ( x ) must be the minimax polynomial approximation to x n+1 .
As we shall see, the roots of the resulting polynomial x n+1 − r ( x )
will fall in the interval [ a, b], and can thus be regarded as ‘optimal’
interpolation points. For simplicity, we shall focus on the interval
[ a, b] = [−1, 1].
Definition 2.1. The degree-n Chebyshev polynomial is defined for
x ∈ [−1, 1] by the formula

Tn ( x ) = cos(n cos−1 x ).

At first glance, this formula may not appear to define a polynomial


at all, since it involves trigonometric functions. But computing the Furthermore, it doesn’t apply if
first few examples, we find | x | > 1. For such x one can define
the Chebyshev polynomials using
n = 0: T0 ( x ) = cos(0 cos−1 x ) = cos(0) = 1 hyperbolic trigonometric functions,
Tn ( x ) = cosh(n cosh−1 x ). Indeed, using
n = 1: T1 ( x ) = cos(cos−1 x ) = x hyperbolic trigonometric identities, one
can show that this expression generates
n = 2: T2 ( x ) = cos(2 cos−1 x ) = 2 cos2 (cos−1 x ) − 1 = 2x2 − 1. for x 6∈ [−1, 1] the same polynomials
we get for x ∈ [−1, 1] from the standard
For n = 2, we employed the identity cos 2θ = 2 cos2 θ − 1, substituting trigonometric identities. We discuss this
point in more detail at the end of the
θ = cos−1 x. More generally, use the cosine addition formula section.
α + β α − β
cos α + cos β = 2 cos cos
2 2
to get the identity
 
cos (n + 1)θ = 2 cos θ cos nθ − cos (n − 1)θ .

This formula implies, for n ≥ 2,

Tn+1 ( x ) = 2xTn ( x ) − Tn−1 ( x ),


In fact, Chebyshev polynomials are
orthogonal polynomials on [−1, 1] with
a formula related to the three term recurrence used to construct respect to the inner product
orthogonal polynomials. Z b
f ( x ) g( x )
Chebyshev polynomials exhibit a wealth of interesting properties, h f , gi = √ dx,
a 1 − x2
of which we mention just three. a fact we will use when studying Gaus-
sian quadrature later in the semester.
90

Theorem 2.4. Let Tn be the degree-n Chebyshev polynomial

Tn ( x ) = cos(n cos−1 x )

for x ∈ [−1, 1].

• | Tn ( x )| ≤ 1 for x ∈ [−1, 1].


(2j−1)π
• The roots of Tn are the n points ξ j = cos 2n , j = 1, . . . , n.

• For n ≥ 1, | Tn ( x )| is maximized on [−1, 1] at the n + 1 points


η j = cos( jπ/n), j = 0, . . . , n:

Tn (η j ) = (−1) j .

Proof. These results follow from direct calculations. For x ∈ [−1, 1],
Tn ( x ) = cos(n cos−1 ( x )) cannot exceed one in magnitude because
cosine cannot exceed one in magnitude. To verify the formula for the
roots, compute
  (2j − 1)π   (2j − 1)π 
Tn (ξ j ) = cos n cos−1 cos = cos = 0,
2n 2
since cosine is zero at half-integer multiples of π. Similarly,
  jπ 
Tn (η j ) = cos n cos−1 cos = cos( jπ ) = (−1) j .
n
Since Tn (η j ) is a nonzero degree-n polynomial, it cannot attain more
than n + 1 extrema on [−1, 1], including the endpoint: we have thus
characterized all the maxima of | Tn | on [−1, 1].

Figure 2.5 shows Chebyshev polynomials Tn for nine different


values of n.

2.3.1 Interpolation at Chebyshev Points


Finally, we are ready to solve the key minimax problem that will
reveal optimal interpolation points. Looking at the above plots of
Chebyshev polynomials, with their striking equioscillation properties,
perhaps you have already guessed the solution yourself.
We defined the Chebyshev polynomials so that

Tn+1 ( x ) = 2xTn ( x ) − Tn−1 ( x )

with T0 ( x ) = 1 and T1 ( x ) = x. Thus Tn+1 has the leading coefficient


2n for n ≥ 0. Define
bn+1 = 2−n Tn+1
T
91

1
n=1 1
n=2 1
n=3

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1


x x x

1
n=5 1
n=7 1
n = 10

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1


x x x

1
n = 15 1
n = 20 1
n = 30

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1


x x x

Figure 2.5: Chebyshev polynomials Tn


of degree n = 1, 2, 3 (top), n = 5, 7, 10
(middle), and n = 15, 20, 30 (bottom).
Note how rapidly these polynomials
grow outside the interval [−1, 1].
92

for n ≥ 0, with T b0 ( x ) = 1. These normalized Chebyshev polynomials


are monic, i.e., the leading term in T bn+1 ( x ) is x n+1 , rather than 2n x n+1
as for Tn+1 ( x ). Thus, we can write

bn+1 ( x ) = x n+1 − rn ( x )
T

for some polynomial rn ( x ) = x n+1 − T bn+1 ( x ) ∈ Pn . We do not


especially care about the particular coefficients of this rn ; our quarry
will be the roots of T
bn+1 , the optimal interpolation points.
For n ≥ 0, the polynomials T bn+1 ( x ) oscillate between ±2−n for
x ∈ [−1, 1], with the maximal values attained at
 jπ 
η j = cos
n+1
for j = 0, . . . , n + 1. In particular,

bn+1 (η j ) = (η j )n+1 − rn (η j ) = (−1) j 2−n .


T

Thus, we have found a polynomial rn ∈ Pn , together with n + 2


distinct points, η j ∈ [−1, 1] where the maximum error

max | x n+1 − rn ( x )| = 2−n


x ∈[−1,1]

is attained with alternating sign. Thus, by the oscillation theorem, we


have found the minimax approximation to x n+1 .

Theorem 2.5 (Optimal approximation of x n+1 ).


The optimal approximation to x n+1 from Pn for x ∈ [−1, 1] is

bn+1 ( x ) = x n+1 − 2−n Tn+1 ( x ) ∈ Pn .


r n ( x ) = x n +1 − T

Thus, the optimal interpolation points are those n + 1 roots of


x n +1
− rn ( x ), that is, the roots of the degree-(n + 1) Chebyshev poly-
nomial:  (2j + 1)π 
ξ j = cos , j = 0, . . . , n.
2n + 2
For generic intervals [ a, b], a change of variable demonstrates that
the same points, appropriately shifted and scaled, will be optimal.
Similar properties hold if interpolation is performed at the n + 1
points
 jπ 
η j = cos , j = 0, . . . , n,
n
which are also called Chebyshev points and are perhaps more pop-
ular due to their slightly simpler formula. (We used these points
to successfully interpolate Runge’s function, scaled to the interval
[−5, 5].) While these points differ from the roots of the Chebyshev
polynomial, they have the same distribution as n → ∞. That is the key.
93

Figure 2.6: Repetition of Figure 1.8,


10 0 interpolating Runge’s function 1/( x2 +
1) on x ∈ [−5, 5], but now using
Chebyshev points x j = 5 cos( jπ/n).
The top plot shows this convergence
10 -1 max for n = 0, . . . , 25; the bottom plots
x ∈[
−5,
5]
|1/ ( show the interpolating polynomials
x 2+ p4 , p8 , p16 , and p24 , along with the
1) −
pn ( interpolation points that determine
x )| these polynomials (black circles).
-2
10
Unlike interpolation at uniformly
spaced points, these interpolants do
converge to f as n → ∞. Notice how the
interpolation points cluster toward the
10
-3 ends of the domain [−5, 5].
0 5 10 15 20 25
n

1.2 1.2

1 1

0.8 0.8

0.6 0.6
p8 ( x )
0.4 0.4
p4 ( x )
0.2 0.2

0
f (x) 0

-0.2 -0.2
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
x x

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4
p16 ( x ) p24 ( x )
0.2 0.2

0 0

-0.2 -0.2
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
x x

We emphasize the utility of interpolation at Chebyshev points by


quoting the following result from Trefethen’s excellent Approximation
Theory and Approximation Practice (SIAM, 2013). Trefethen emphasizes
that worst-case results like Faber’s theorem (Theorem 1.4) give mis-
leadingly pessimistic concerns about interpolation. If the function
f ∈ C [ a, b] has just a bit of smoothness (i.e., bounded derivatives),
interpolation in Chebyshev points is ‘bulletproof’. The following the-
orem consolidates aspects of Theorem 7.2 and 8.2 in Trefethen’s book.
94

The results are stated for [ a, b] = [−1, 1] but can be adapted to any
real interval.

Theorem 2.6 (Convergence of Interpolants at Chebyshev Points).


For any n > 0, let pn denote the interpolant to f ∈ C [−1, 1] at the
Chebyshev points
 jπ 
x j = cos , j = 0, . . . , n.
n
• Suppose f ∈ C ν [−1, 1] for some ν ≥ 1, with f (ν) having variation
V (ν), i.e.,

V (ν) := max f (ν) ( x ) − min f (ν) ( x ).


x ∈[−1,1] x ∈[−1,1]

Then for any n > ν,

4V (ν)
k f − pn k∞ ≤
π ( ν ( n − ν ) ν ).
i
• Suppose f is analytic on [−1, 1] and can be analytically continued
ρ = 1.75
(into the complex plane) onto the region bounded by the ellipse
ρ = 1.25
n ρeiθ + e−iθ /ρ o
Eρ := : θ ∈ [0, 2π ) .
2
Suppose further that | f (z)| ≤ M on and inside Eρ . Then

2Mρ−n −i
k f − pn k∞ ≤ .
ρ−1
Interval [−1, 1] (blue), with two ellipses
Eρ for ρ = 1.25 and ρ = 1.75.
For example, the first part of this theorem implies that if f 0 exists
and is bounded, then k f − pn k∞ must converge at least as fast as
1/n as n → ∞. While that is not such a fast rate, it does indeed
show convergence of the interpolant. The second part of the theorem
ensures that if f is well behaved in the region of the complex plane
around [−1, 1], the convergence will be extremely fast: the larger the
area of C in which f is well behaved, the faster the convergence.

2.3.2 Chebyshev polynomials beyond [−1, 1]


Another way of interpreting the equioscillating property of Cheby-
shev polynomials is that Tn solves the approximation problem

k Tn k∞ = min k pk∞ ,
p ∈Pn
p monic

over the interval [−1, 1], where a polynomial is monic if it has the
form x n + q( x ) for q ∈ Pn−1 .
95

In some applications, such as the analysis of iterative methods for


solving large-scale systems of linear equations, one needs to bound
the size of the Chebyshev polynomial outside the interval [−1, 1]. Fig-
ure 2.5 shows that Tn grows very quickly outside [−1, 1], even for
modest values of n. How fast?

To describe Chebyshev polynomials outside [−1, 1], we must re-


place the trigonometric functions in the definition Tn ( x ) = cos(n cos−1 x )
with hyperbolic trigonometric functions:

(2.10) Tn ( x ) = cosh(n cosh−1 x ), x 6∈ (−1, 1).

Is this definition is consistent with

Tn ( x ) = cos(n cos−1 x ), x ∈ [−1, 1]

used previously? Trivially one can see that the new definition also
gives T0 ( x ) = 1 and T1 ( x ) = x. Like standard trigonometric func-
tions, the hyperbolic functions also satisfy the addition formula
α + β α − β
cosh α + cosh β = 2 cosh cosh ,
2 2
and so
 
cosh (n + 1)θ = 2 cosh θ cosh nθ − cosh (n − 1)θ ,

leading to the same three-term recurrence as before:

Tn+1 ( x ) = 2xTn ( x ) − Tn−1 ( x ).

Thus, the definitions are consistent.


We would like a more concrete formula for Tn ( x ) for x 6∈ [−1, 1]
than we could obtain from the formula (??). Thankfully Chebyshev
polynomials have infinitely many interesting properties to lean on.
Consider the change of variables

w + w −1
x= ,
2
which allows us to write

elog w + e− log w
x= = cosh(log w).
2
Thus work from the definition to obtain

Tn ( x ) = cosh(n cosh−1 ( x ))
= cosh(n log w)
n n)
elog(w ) + e− log(w wn + w−n
= cosh(log wn ) = = .
2 2
96

We emphasize this last important formula:

wn + w−n w + w −1
(2.11) Tn ( x ) = , x= 6∈ (−1, 1).
2 2
We have thus shown that | Tn ( x )| will grow exponentially in n for any
x 6∈ (−1, 1) for which |w| 6= 1. When does |w| = 1? Only when
x = ±1. Hence,

| Tn ( x )| grows exponentially in n for all x 6∈ [−1, 1].


Example 2.3. We want to evaluate Tn (2) as a function of n. First, find
w such that 2 = (w + w−1 )/2, i.e.,

w2 − 4w + 1 = 0.

Solve this quadratic for √


w± = 2 ± 3.

We take w = 2 + 3 = 3.7320 . . .. Thus by (2.11) Which ± choice should you make?
√ √ √ It
√ does √ Notice that (2 −
not matter.
(2 + 3) n + (2 − 3) n (2 + 3) n 3)−1 = 2 + 3, and this happens in
Tn (2) = ≈ general: w± = 1/w∓ .
2 2
√ n
as n → ∞, since (2 − 3) = (0.2679 . . .)n → 0.
Take a moment to reflect on this: We have a beautifully concrete
way to write down | Tn ( x )| that does not involve any hyperbolic
trigonometric formulas, or require use of the Chebyshev recurrence
relation. Formulas of this type can be very helpful for analysis in
various settings. You will see one such example on Problem Set 3.
97

lecture 16: Introduction to Least Squares Approximation

2.4 Least squares approximation

The minimax criterion is an intuitive objective for approximating


a function. However, in many cases it is more appealing (for both
computation and for the given application) to find an approximation
to f that minimizes the integral of the square of the error.

Given f ∈ C [ a, b], find P∗ ∈ Pn such that


Z b 1/2 Z b 1/2
(2.12) ( f ( x ) − P∗ ( x ))2 dx = min ( f ( x ) − p( x ))2 dx .
a p∈Pn a

This is an example of a least squares problem.

2.4.1 Inner products for function spaces


To facilitate the development of least squares approximation theory,
we introduce a formal structure for C [ a, b]. First, recognize that C [ a, b]
is a linear space: any linear combination of continuous functions on
[ a, b] must itself be continuous on [ a, b].
Definition 2.2. The inner product of the functions f , g ∈ C [ a, b] is
Z b
h f , gi = f ( x ) g( x ) dx.
a

The inner product satisfies the following basic axioms: For simplicity we are assuming that
f and g are real-valued. To handle
• hα f + g, hi = αh f , hi + h g, hi for all f , g, h ∈ C [ a, b] and all α ∈ IR; complex-valued functions, one general-
izes the inner product to
• h f , gi = h g, f i for all f , g ∈ C [ a, b]; Z b
h f , gi = f ( x ) g( x ) dx,
a
• h f , f i ≥ 0 for all f ∈ C [ a, b].
which then gives h f , gi = h g, f i.

With this inner product we associate the norm


Z b 1/2
k f k2 := h f , f i1/2 = f ( x )2 dx .
a

This is often called the ‘L2 norm,’ where the superscript ‘2’ in L2
refers to the fact that the integrand involves the square of the func-
tion f ; the L stands for Lebesgue, coming from the fact that this inner
The Lebesgue theory gives a more
product can be generalized from C [ a, b] to the set of all functions that robust definition of the integral than
are square-integrable, in the sense of Lebesgue integration. By restrict- the conventional Riemann approach.
With such notions one can extend least
ing our attention to continuous functions, we dodge the measure- squares approximation beyond C [ a, b],
theoretic complexities. to more exotic function spaces.
98

2.4.2 Least squares minimization via calculus


We are now ready to solve the least squares problem. We shall call
the optimal polynomial P∗ ∈ Pn , i.e.,

k f − P∗ k2 = min k f − pk2 .
p ∈Pn

We can solve this minimization problem using basic calculus. Con-


sider this example for n = 1, where we optimize the error over
polynomials of the form p( x ) = c0 + c1 x. The polynomial that mini-
mizes k f − pk2 will also minimize its square, k f − pk22 . For any given
p ∈ P1 , define the error function
Z b
E(c0 , c1 ) := k f ( x ) − (c0 + c1 x )k2L2 = ( f ( x ) − c0 − c1 x )2 dx
a
Z b 
= f ( x )2 − 2 f ( x )(c0 + c1 x ) + (c20 + 2c0 c1 x + c21 x2 ) dx
a
Z b Z b Z b
= f ( x )2 dx − 2c0 f ( x ) dx − 2c1 x f ( x ) dx
a a a
+ c20 (b − a) + c0 c1 (b2 − a ) + 13 c21 (b3 − a3 ).
2

To find the optimal polynomial, P∗ , optimize E over c0 and c1 , i.e.,


find the values of c0 and c1 for which
∂E ∂E
= = 0.
∂c0 ∂c1
First, compute
Z b
∂E
= −2 f ( x ) dx + 2c0 (b − a) + c1 (b2 − a2 )
∂c0 a
Z b
∂E
= −2 x f ( x ) dx + c0 (b2 − a2 ) + c1 23 (b3 − a3 ).
∂c1 a

Setting these partial derivatives equal to zero yields


Z b
2c0 (b − a) + c1 (b2 − a2 ) = 2 f ( x ) dx
a
Z b
c0 (b2 − a2 ) + c1 32 (b3 − a3 ) = 2 x f ( x ) dx.
a

These equations, linear in the unknowns c0 and c1 , can be written in


the matrix form
#  Rb 
b2 − a2
" #"
2( b − a ) c0 2 a f ( x ) dx
= R .
b
b2 − a2 23 (b3 − a3 ) c1 2 a x f ( x ) dx

When b 6= a this system always has a unique solution. The resulting


c0 and c1 are the coefficients for the monomial-basis expansion of the
least squares approximation P∗ ∈ P1 to f on [ a, b].
99

Example 2.4 ( f ( x ) = ex ). Apply this result to f ( x ) = ex for x ∈ [0, 1].


Since
Z 1 Z 1 1
ex dx = e − 1, xex dx = ex ( x − 1) x=0 = 1,

0 0

we must solve the system


" #" # " #
2 1 c0 2e − 2
= .
2 c1 2
1 3

The desired solution is

c0 = 4e − 10, c1 = 18 − 6e.

Figure 2.7 compares f to this least squares approximation P∗ and the


minimax approximation p∗ computed earlier.

3
Figure 2.7: Top: Approximation of
f ( x ) = ex (blue) over x ∈ [0, 1] via
2.5 least squares (P∗ , shown in red) and
minimax (p∗ , shown as a gray line).
2 Bottom: Error curves for least
squares, f − P∗ (red), and minimax,
f − p∗ (gray) approximation. While the
1.5
curves have similar shape, note that the
red curve does not attain its maximum
1 deviation from f at n + 2 = 3 points,
while the gray one does.
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
0.2

0.15
error, f ( x ) − p( x )

0.1

0.05 lea
st s
qu
0
mi are
nim s
-0.05 ax
-0.1

-0.15
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

We can see from the plots in Figure 2.7 that the approximation
looks decent to the eye, but the error is not terribly small. We can In fact, k f − P∗ k2 = 0.06277 . . .. This is
decrease that error by increasing the degree of the approximating indeed smaller than the 2-norm error
of the minimax approximation p∗ :
polynomial. Just as we used a 2-by-2 linear system to find the best k f − p∗ k2 = 0.07228 . . ..
linear approximation, a general (n + 1)-by-(n + 1) linear system can
be constructed to yield the degree-n least squares approximation.
100

2.4.3 General polynomial bases


Note that we performed the above minimization in the monomial
basis: p( x ) = c0 + c1 x is a linear combination of 1 and x. Our expe-
rience with interpolation suggests that different choices for the basis
may yield approximation algorithms with superior numerical proper-
ties. Thus, we develop the form of the approximating polynomial in
an arbitrary basis.
Suppose {φk }nk=0 is a basis for Pn . Any p ∈ Pn can be written as

n
p( x ) = ∑ ck φk (x).
k =0

The error expression takes the form


Z b n 2
E ( c0 , . . . , c n ) : = k f ( x ) − p( x )k2L2 =
a
f (x) − ∑ ck φk (x) dx
k =0

n n n
= hf, fi − 2 ∑ ck h f , φk i + ∑ ∑ ck c` hφk , φ` i.
k =0 k=0 `=0

To minimize E, we seek critical values of c = [c0 , . . . , cn+1 ] T ∈ IRn+1 ,


i.e., we want coefficients where the gradient of E with respect to c
is zero: ∇c E = 0. To compute this gradient, evaluate ∂E/∂c j for
j = 0, . . . , n:

∂E ∂ ∂  n  ∂  n n 
2 ∑ ck h f , φk i +
∂c j k∑ ∑ k` k `
= hf, fi − c c h φ , φ i
∂c j ∂c j ∂c j k=0 =0 `=0
n n n n
∂  2 
= 0 − 2h f , φ j i + c j hφj , φj i + ∑ ck c j hφk , φj i + ∑ c j c` hφj , φ` i + ∑ ∑ ck c` hφk , φ` i
∂c j k =0 `=0 k =0 `=0
k6= j `6= j k 6= j `6= j

In this last line, we have broken the double sum on the previous line
into four parts: one that contains c2j , two that contain c j (ck c j for k 6= j;
c j c` for ` 6= j), and one (the double sum) that does not involve c j at
all. This decomposition makes it easier to compute the derivative:
n n n n
∂  2 
c j hφj , φj i + ∑ ck c j hφk , φj i + ∑ c j c` hφj , φ` i + ∑ ∑ ck c` hφk , φ` i
∂c j k =0 `=0 k =0 `=0
k6= j `6= j k 6= j `6= j

n n
= 2c j hφj , φj i + ∑ ck hφk , φj i + ∑ c` hφj , φ` i + 0
k =0 `=0
k6= j `6= j

n
= 2c j hφj , φj i + 2 ∑ ck hφk , φj i.
k =0
k6= j
101

These terms contribute to ∂E/∂c j to give


n
∂E
(2.13) = −2h f , φj i + 2 ∑ ck hφk , φj i.
∂c j k =0

To minimize E, set ∂E/∂c j = 0 for j = 0, . . . , n, which gives the n + 1


equations
n
(2.14) ∑ ck hφk , φj i = h f , φj i, j = 0, . . . , n,
k =0

in the n + 1 unknowns c0 , . . . , cn . Since these equations are linear in


the unknowns, write them in matrix form:
    
hφ0 , φ0 i hφ0 , φ1 i · · · hφ0 , φn i c0 h f , φ0 i
 hφ1 , φ0 i hφ1 , φ1 i · · · hφ1 , φn i   c   h f , φ i
    

 1   1
(2.15)  .  =  ,
 
 .. .. .. .. .
. . . .   ..   ..
     
 
hφn , φ0 i hφn , φ1 i · · · hφn , φn i cn h f , φn i

which we denote Gc = b. The matrix G is called the Gram matrix.


Using this matrix-vector notation, we can accumulate the partial
derivatives formulas (2.13) for E into the gradient

∇c E = 2 Gc − b .

Since c is a critical point if and only if ∇c E(c) = 0, we must ask:

• How many critical points are there? Equivalently, how many c


solve Gc = b?

• If c is a critical point, is it a (local or even global) minimum?

We will answer the first question by showing that G is invertible, and


hence E has a unique critical point. To answer the second question,
we must inspect the Hessian

∇c2 E = ∇c (∇c E) = 2G.

The critical point c is local minimum if and only if the Hessian is


symmetric positive definite. A matrix G is positive definite provided
The symmetry of the inner product implies hφj , φk i = hφk , φj i, and z∗ Gz > 0 for all z 6= 0.

hence G is symmetric. (In this case, symmetry also follows from the
equivalence of mixed partial deritivates.) The following theorem
confirms that G is indeed positive definite.
102

lecture 17: Fundamentals of Least Squares Approximation, Part I


lecture 18: Fundamentals of Least Squares Approximation, Part II

Theorem 2.7. If φ0 , . . . , φn are linearly independent, the Gram matrix This proof is very general: we are
thinking of φ0 , . . . , φn being a basis for
G is positive definite. Pn (and hence linearly independent),
but the same proof applies to any
linearly independent set of vectors in a
Proof. For a generic z ∈ IRn+1 , consider the product general inner product space.

h i  
z0 z1 ··· zn hφ0 , φ0 i hφ0 , φ1 i ··· hφ0 , φn i z0
 hφ1 , φ0 i hφ1 , φ1 i ··· hφ1 , φn i
  
 z1 
z∗ Gz =
  
 .. .. ..  .. 

 . . . 
 . 

hφn , φ0 i hφn , φ1 i · · · hφn , φn i zn

∑nk=0 z j hφ0 , φk i
h i 
z0 z1 ··· zn
 n 
 ∑
 k=0 z j hφ1 , φk i n n

Eigenvalues illuminate. The surfaces
∑ ∑ z j zk hφj , φk i.

= 
 ..
=
 below visualize E(c0 , c1 ) for best ap-

 . 

j =0 k =0 proximation of f ( x ) = ex from P1 over
x ∈ [−1, 1] (top) and x ∈ [0, 1].
∑nk=0 z j hφn , φk i
For [−1, 1], the eigenvalues of G are
relatively large, and the error surface
Now use linearity of the inner product to write looks very bowl-like.
7
n n D n n E n 2

∑ ∑ z j zk hφj , φk i = ∑ z j φj , ∑ zk φk ∑ z j φj
6
z Gz = = . 5
j =0 k =0 j =0 k =0 j =0
E ( c0 , c1 )
4

Thus, by nonnegativity of the norm, z∗ Gz


≥ 0. This is enough to 2

show that G is positive semidefinite. To show that G is positive definite, 1

we must show that z∗ Gz > 0 if z 6= 0. Now since φ0 , . . . , φn are


0
3
2

linearly independent, ∑nj=0 c j φj = 0 if and only if c0 = · · · = cn = 0, c1 1


0 1
2
3

0
c0
i.e., if and only if z = 0. Thus, if z 6= 0, z∗ Gz > 0.
-1 -1

Eigenvalues of G: λ1 = 2, λ2 = 2/3
This answers the second question posed above, and also makes the
For [0, 1], G has a small eigenvalue: the
answer to the first trivial.
error surface is much more ‘shallow’
in one direction. (The orientation of
the trough can be found from the
Corollary 2.1. If φ0 , . . . , φn are linearly independent, the Gram matrix corresponding eigenvector of G.)
G is invertible. 6

4
E ( c0 , c1 )

Proof. The matrix G is invertible if Gz = 0 implies z = 0, i.e., G has 3

a trivial null space. If Gz = 0, then z∗ Gz = 0. Theorem 2.7 ensures 2

that G is positive definite, so z∗ Gz = 0 implies z = 0. Hence, G has a 1

0
trivial null space, and is thus invertible. 4
3
2
c1 1 2
3
1
We can summarize our findings as follows. 0 -1
0
c0

Eigenvalues of G: λ1 = (4 + √13)/6 = 1.26759 . . .
λ2 = (4 − 13)/6 = 0.06574 . . .
103

Given any basis φ0 , . . . , φn , the least squares approximation


P∗ to f ∈ C [ a, b] is unique and can be expressed as
n
P∗ = ∑ c j φj ,
j =0

where the coefficients c are computed as the unique solu-


tion of Gc = b.

As with the interpolation problem studied earlier, different choices of


basis will give different linear algebra problems, but ultimately result
in the same overall approximation P∗ . We shall study several choices
for the basis in Sections 2.4.5 and 2.4.6. Before doing so, we establish
a fundamental property of least squares approximation.

2.4.4 Orthogonality of the error


This is the fundamental theorem of linear least squares problems:

The error f − P∗ is orthogonal to the approximating subspace Pn .

Having worked hard to characterize the optimal approximation, the


formula Gc = b makes the proof of this result trivial.

Theorem 2.8. The function P∗ ∈ Pn is the least squares approxima-


tion to f ∈ C [ a, b] if and only if the error f − P∗ is orthogonal to the
subspace Pn from which the approximation was drawn:

h f − P∗ , qi = 0, for all q ∈ Pn .

Proof. First suppose that P∗ is the least squares approximation. Thus


given any basis φ0 , . . . , φn for Pn , we can express P∗ = c0 φ0 + · · · +
cn φn , where the coefficients solve Gc = b. Now for any basis function
φj , use the linearity of the inner product to compute

h f − P∗ , φj i = h f , φj i − h P∗ , φj i
D n E n
= h f , φj i − ∑ ck φk , φj = h f , φj i − ∑ ck hφk , φj i.
k =0 k =0

Recall that the jth row of the equation Gc = b (see (2.14) is precisely
n
∑ ck hφk , φj i = h f , φj i,
k =0

so since the least squares approximation must satisfy Gc = b,


conclude that h f − P∗ , φj i = 0. Since this orthogonality holds for
104

j = 0, . . . , n, the error f − P∗ is orthogonal to the entire basis for Pn ,


and hence it is orthogonal to any vector q = ∑nj=0 d j φj ∈ Pn , since

D n E n n
h f − P∗ , qi = f − P∗ , ∑ d j φj = ∑ d j h f − P∗ , φj i = ∑ d j · 0 = 0.
j =0 k =0 k =0

Thus, the least squares error f − P∗ is orthogonal to all q ∈ Pn .


On the other hand suppose that p ∈ Pn gives an error f − p that is
orthogonal to all q ∈ Pn , i.e.,

(2.16) h f − p, qi = 0, for all q ∈ Pn .

Let φ0 , . . . , φn be a basis for Pn . Then we can find c0 , . . . , cn so that


p = c0 φ0 + · · · + cn φn . The orthogonality of f − p to Pn in (??) implies
in particular that h f − p, φj i = 0 for all j = 0, . . . , n, i.e., using linearity
of the inner product,
D n E n
0= f− ∑ ck φk , φj = h f , φj i − ∑ ck hφk , φj i
k =0 k =0

for j = 0, . . . , n. Thus orthogonality of the error implies that


n
∑ ck hφk , φj i = h f , φj i, j = 0, . . . , n,
k =0

and these n + 1 equations precisely give Gc = b: since the coefficients


of p satisfy the linear system that characterizes the (unique) least
squares approximation, p must be that least squares approximation,
p = P∗ . Thus, orthogonality of the error f − p with all q ∈ Pn implies
that p is the least squares approximation.

2.4.5 Monomial basis


Suppose we apply this method on the interval [ a, b] = [0, 1] with the
monomial basis, φk ( x ) = x k . In that case,
Z 1
1
hφk , φj i = h x k , x j i = x j+k dx = ,
0 j+k+1

and the coefficient matrix has an elementary structure. In fact, this is


a form of the notorious Hilbert matrix. It is exceptionally difficult to See M.-D. Choi, ‘Tricks or treats with
obtain accurate solutions with this matrix in floating point arithmetic, the Hilbert matrix,’ American Math.
Monthly 90 (1983) 301–312.
reflecting the fact that the monomials are a poor basis for Pn on [0, 1].
Let G denote the n + 1-dimensional Hilbert matrix, and suppose b
is constructed so that the exact solution to the system Gc = b is
c = [1, 1, . . . , 1] T . Let b
c denote computed solution to the system in
MATLAB. Ideally the forward error kc − b ck2 will be nearly zero (if
the rounding errors incurred while constructing b and solving the
105

system are small). Unfortunately, this is not the case – the condition
number of G grows exponentially in the dimension n, and the accu-
racy of the computed solution to the linear system quickly degrades
as n increases.
The last few condition numbers
n kGkkG−1 k kc − bck kGkkG−1 k are in fact smaller than
they ought to be: matlab computes the
5 1.495 × 107 7.548 × 10−11 condition number based as the ratio of
the largest to smallest singular values
10 1.603 × 1014 0.01288 of G; the smallest singular value can
15 4.380 × 1017 12.61 only be determined accurately if it is
larger than about kGkε mach , where
20 1.251 × 1018 46.9 ε mach ≈ 2.2 × 10−16 . Thus, if the true
condition number is larger than about
Clearly these errors are not acceptable! 1/ε mach , we should not expect matlab
In summary: The monomial basis forms an ill-conditioned basis for Pn to compute it accurately.

over the real interval [ a, b].

2.4.6 Orthogonal basis


In the search for a basis for Pn that will avoid the numerical difficul-
ties, let the structure of the equation Gc = b be our guide. What
choice of basis would make the matrix G, written out in (2.15), as
simple as possible? If the basis vectors are orthogonal, i.e., Section 2.5 will derive a procedure for
computing an orthogonal basis for Pn .
(
6= 0, j = k;
hφj , φk i
= 0, j 6= k,

then G only has nonzeros on the main diagonal, giving the system
    
hφ0 , φ0 i 0 ··· 0 c0 h f , φ0 i

.. ..    
0 hφ1 , φ1 i . . c1   h f , φ1 i
    
  
  = .
 .. .. ..  ..   .. 

 . . . 0 
 .  
  . 

0 ··· 0 hφn , φn i cn h f , φn i

This system decouples into n + 1 scalar equations hφj , φj ic j = h f , φj i


for j = 0, . . . , n. Solve these scalar equations to get

hφj , φj i
cj = , j = 0, . . . , n.
h f , φj i

Thus, with respect to the orthogonal basis the least squares approxi-
mation to f is given by
n n h f , φj i
(2.17) P∗ ( x ) = ∑ c j φ j ( x ) = ∑ h φ j , φ j i φ j ( x ).
j =0 j =0

The formula (2.17) has an outstanding property: if we wish to


extend approximation from Pn one degree higher to Pn+1 , we simply
106

add in one more term. If we momentarily use the notation P∗,k for
the least squares approximation from Pk , then
h f , φn+1 i
P∗,n+1 ( x ) = P∗,n ( x ) + φ ( x ).
hφn+1 , φn+1 i n+1
In contrast, to increase the degree of the least squares approximation
in the monomial basis, one would need to extend the G matrix by
one row and column, and re-solve form Gc = b: increasing the degree
changes all the old coefficients in the monomial basis.
An orthogonal basis also permits a beautifully simple formula for
the norm of the error, k f − P∗ k2 . This result is closely related to Parseval’s
identity, which essentially says that if
Theorem 2.9. Let φ0 , . . . , φn denote an orthogonal basis for Pn . Then φ0 , φ1 , . . . forms an orthogonal basis
for the (possibly infinite dimensional)
for any f ∈ C [ a, b], the norm of the error f − P∗ of the least squares vector space V, then for any f ∈ V,
approximation P∗ ∈ Pn is
h f , φ j i2
v k f k2 = ∑ .
u n h f , φ i2 j
hφj , φj i
j
k f − P∗ k2 = tk f k22 − ∑
u
(2.18) .
j =0
h φ j , φ ji To put the utility of the formula (2.18)
in context, think about minimax ap-
proximation. We have various bounds,
Proof. First, use the formula (2.17) for P∗ to compute like de la Vallée Poussin’s theorem, on
the minimax error, but no easy formula
 n n exists to give you that error directly.
h f , φj i h f , φk i

k P∗ k22 = ∑ φj , ∑ φk
j =0
hφj , φj i k=0 hφk , φk i

n n h f , φj i h f , φk i
= ∑∑ h φ , φ i,
hφj , φj i hφk , φk i j k
j =0 k =0

using linearity of the inner product. Since the basis polynomials are
orthogonal, hφj , φk i = 0 for j 6= k, which reduces the double sum to
n h f , φj i h f , φj i n h f , φ j i2
k P∗ k22 = ∑ hφj , φj i hφj , φj i hφj , φj i = ∑ hφj , φj i .
j =0 j =0

This calculation simplifies our primary concern:

k f − P∗ k22 = h f − P∗ , f − P∗ i = h f , f i − h f , P∗ i − h P∗ , f i + h P∗ , P∗ i

= h f , f i − 2h f , P∗ i + h P∗ , P∗ i
 n
h f , φj i

= hf, fi − 2 f, ∑ φj + h P∗ , P∗ i
j =0
hφj , φj i

n h f , φ j i2 n h f , φ i2
j
= hf, fi − 2 ∑ +∑
j =0
h φ j , φ j i j =0
h φ j , φ ji
n h f , φ j i2
= k f k22 − ∑ ,
j =0
hφj , φj i

as required.
107

2.4.7 Coda: Connection to discrete least squares (Optional Section)


Studies of numerical linear algebra inevitably address the discrete least
squares problem: Given A ∈ IRm×n and b ∈ IRm with m ≥ n, solve

(2.19) min kAx − bk2 ,


x∈IRn

using the Euclidean norm kvk2 = v∗ v. One can show that the
minimizing x solves the linear system

(2.20) A∗ Ax = A∗ b,

which are called the normal equations. If rank(A) = n (i.e., the


columns of A are linearly independent), then A∗ A ∈ IRn×n is in-
vertible, and

(2.21) x = (A∗ A)−1 A∗ b.

One learns that, for purposes of numerical stability, it is preferable to


compute the QR factorization

A = QR,

where the columns of Q ∈ IRm×n are orthonormal, Q∗ Q = I, and


R ∈ IRn×n is upper triangular (r j,k = 0 if j > k) and invertible if
the columns of A are linearly independent. Substituting QR for A
reduces the solution formula (2.21) to

(2.22) x = R−1 Q∗ b.

How does this “least squares problem” relate to the polynomial


approximation problem in this section? We consider two perspec-
tives.

2.4.8 Discrete least squares as subspace approximation


Notice that the he problem (2.19) can be viewed as

(2.23) min kAx − bk2 = min k b − v k2 ,


x∈IRn v∈Ran(A)

i.e., the discrete least squares problem seeks to approximate b with Ran(A) = {Ax : x ∈ IRn } is the range
some vector v = Ax from the subspace Ran(A) ⊂ IRm . Writing (column space) of A.

h i
A = a1 · · · a n

for a1 , . . . , an ∈ IRm , we seek

v = Ax = x1 a1 + · · · xn an ∈ IRm
108

to approximate b ∈ IRm .
Viewing a1 , . . . , an as a basis for the approximating subspace
Ran(A), one can develop the least squares theory precisely as we
have earlier in this section, using the inner product

ha j , ak i = a∗k a j .

Minimizing the error function

E( x1 , . . . , xn ) = kb − ( x1 a1 + · · · + xn an )k22

with respect to x1 , . . . , xn just as in the previous development leads to


the Gram matrix problem
    
h a1 , a1 i h a1 , a2 i · · · h a1 , a n i x1 hb, a1 i
 ha2 , a1 i ha2 , a2 i · · · ha2 , an i   x   hb, a i 
    
 2   2
(2.24)  .  =  ,
 
 .. .. .. .. .

 . . . .   .
 .  
  .
.


h a n , a1 i h a n , a2 i · · · h a n , a n i xn hb, an i

which is a perfect analogue of (2.15). In fact, notice that (2.24) is


nothing other than
A∗ Ax = A∗ b,
the familiar normal equations! What role does the QR factorization
play? The columns of Q form an orthonormal basis for Ran(A):

Ran(A) = span{a1 , . . . , an } = span{q1 , . . . , qn } = Ran(Q).

So the approximation problem (2.23) is equivalent to

min kAx − bk2 = min k b − v k2


x∈IRn v∈Ran(Q)

= min kb − (c1 q1 + · · · + cn qn )k2 .


c1 ,...,cn

The Gram matrix system with respect to this basis is


    
h q1 , q1 i h q1 , q2 i · · · h q1 , q n i c1 hb, q1 i
 h q2 , q1 i h q2 , q2 i · · · h q2 , q n i   c
    
  hb, q i 
 2 2
(2.25) = .
   
 .. .. .. ..  .
 . ..

 . . . .  .
 
  .


h q n , q1 i h q n , q2 i · · · h q n , q n i cn hb, qn i

The orthonormality of the vectors q1 , . . . , qn means that q∗j qk = 0 if


j 6= k and q∗j q j = kq j k22 = 1, and so the matrix in (2.25) is the identity.
Hence
c j = hb, q j i,
so we can write
c = Q∗ b.
109

The approximation v to b is then

v = c1 q1 + · · · + cn qn = Qc = QQ∗ b.

Now using the fact that Q∗ Q = I,

QQ∗ = (AR−1 )(AR−1 )∗

= A(R∗ R)−1 A∗ = A(A∗ A)−1 A∗ .

Thus the least squares approximation to b is

v = QQ∗ b = A (A∗ A)−1 A∗ b) = Ax,

where x solves the original least squares problem (2.19).


Thus, the orthogonal basis for the approximating space Ran(A)
leads to an easy formula for the approximation, in just the same fash-
ion that orthogonal polynomials made quick work of the polynomial
least squares problem in (2.17).

2.4.9 Discrete least squares for polynomial approximation


Now we turn the tables for another view of the connection between
the the polynomial approximation problem and the matrix least
squares problem (2.19).
Suppose we only know how to solve discrete least squares prob-
lems like (2.19), and want to use that technology to construct some
polynomial p ∈ Pn that approximates f ∈ C [ a, b] over x ∈ [ a, b].
We could sample f at, say, m + 1 discrete points x0 , . . . , xm uni-
formly distributed over [ a, b]: set hm := (b − a)/m and let

xk = a + khm .

We then want to solve


m
(2.26) min
p ∈Pn
∑ | f (x j ) − p(x j )|2 .
j =0

This least squares error, when scaled by hm , takes the form of a Rie-
mann sum that, in the m → ∞ limit, approximates an integral:
m Z b
lim hm
m→∞
∑ ( f (xk ) − p(xk ))2 = a
( f ( x ) − p( x ))2 dx.
k =0

That is, as we take more and more approximation points, the er-
ror (2.26) that we are minimizing better and better approximates the
integral error formulation (2.12).
To solve (2.26), represent p ∈ Pn using the monomial basis,

p ( x ) = c0 + c1 x + · · · + c n x n .
110

Then write (2.26) as


min kf − Ack22 ,
c0 ,...,cn

where

x0n
    
1 x0 x02 ··· c0 f ( x0 )
x12 x1n 
     

 1 x1 ··· 

 c1 
 f ( x1 )



1 x2 x22 ··· x2n  , c2  , f ( x2 )
     
A=
  c=
  f=

.

 .. .. .. .. ..   ..   .. 

 . . . . . 


 . 


 . 

1 xm 2
xm ··· xmn cn f ( xm )

This discrete problem can be solved via the normal equations, i.e., find
c ∈ IRn+1 to solve the matrix equation

A∗ Ac = A∗ f.

Compute the right-hand side as


 
∑nk=0 f ( xk )
 
 n
 ∑ k =0 x k f ( x k )


 
 n
A∗ f =  ∑ x2 f ( xk )  ∈ IRn+1 .

 k =0 k 
 .. 
.
 
 
 
n
∑k=0 xkn f ( xk )

Notice that if m + 1 approximation points are uniformly spaced over


[ a, b], xk = a + khm for hm = (b − a)/m, then
 Rb   
a f ( x ) dx h f , 1i
 R   
 b
 a x f ( x ) dx   h f , x i 
  
 R   
 b 2 2i 
lim hm A∗ f =  h f , x
 
 a x f ( x ) dx  =  ,
m→∞   

..
  .. 
.
   

 .  
 


Rb n
h f , x ni
x f ( x ) dx
a

which is precisely the right hand side vector b ∈ IRn+1 obtained for
the original least squares problem at the beginning of this section
in (2.15). Similarly, the ( j + 1, k + 1) entry of A∗ A ∈ IR(n+1)×(n+1) for
the discrete problem can be formed as
m m
∑ ∑ x`
j j+k
(A∗ A) j+1,k+1 = x` x`k = ,
`=0 `=0

and thus for uniformly spaced approximation points,


Z b
lim hm (A∗ A) j+1,k+1 = x j+k dx = h x j , x k i.
m→∞ a
111

Thus in aggregate we have

lim hm A∗ A = G,
m→∞

where G is the same Gram matrix in (2.15).

We arrive at the following beautiful conclusion: The normal equa-


tions A∗ Ac = A∗ f formed for polynomial approximation by discrete
least squares converges to exactly the same (n + 1) × (n + 1) system
Gc = b that we independently derived for the polynomial approxi-
mation problem (2.12) with the integral form of the error.
112

lecture 19: Orthogonal Polynomials, Part I


lecture 20: Orthogonal Polynomials, Part II

2.5 Systems of orthogonal polynomials

Given a basis for Pn , we can obtain the least squares approximation


to f ∈ C [ a, b] by solving the linear system Gc = b as described in
Section 2.4.3. In particular, we could expand polynomials in any basis
{φk }nk=0 for Pn ,
n
p= ∑ ck φk ,
k =0
and then solve the system
    
hφ0 , φ0 i hφ0 , φ1 i · · · hφ0 , φn i c0 h f , φ0 i
 ..    
 hφ1 , φ0 i hφ1 , φ1 i . c1   h f , φ1 i
    
 
  = .
 .. .. ..  ..   .. 

 . . . 
 .  
  . 

hφn , φ0 i hφn , φ1 i · · · hφn , φn i cn h f , φn i

If the basis is orthogonal, so that hφj , φk i = 0 when j 6= k, the Gram


matrix is diagonal, and the coefficients take the simple form

h f , φj i
cj = .
hφj , φj i

The simplicity of this solution is compelling, but some work is re-


quired to construct a basis that has this orthogonality property. This
section derives an efficient method to build this basis. In fact, for
later use we slightly generalize the inner product to incorporate a
weight function.

Definition 2.3. Given a function w ∈ C [ a, b] with w( x ) > 0, the inner Generalizations are possible: for ex-
product of f , g ∈ C [ a, b] with respect to the weight w is ample, we can allow w( x ) = 0 on a
set of measure zero (e.g., finitely many
Z b points on [ a, b]), and we can take [ a, b]
h f , gi = f ( x ) g( x )w( x ) dx. to be the unbounded interval [0, ∞) or
a (−∞, ∞), provided we are willing to
restrict C [ a, b] to functions that have
One can confirm that this definition is consistent with the axioms finite norm on these intervals.
required of an inner product that were described on page 97. For any
such inner product, we then have the following definitions.
Definition 2.4. The functions f , g ∈ C [ a, b] are orthogonal if h f , gi = 0.
Definition 2.5. A set of functions {φk }nk=0 is a system of orthogonal
polynomials provided:
• φk is a polynomial of exact degree k (with φ0 6= 0);
• hφj , φk i = 0 when j 6= k.
113

Be sure not to overlook the first property, that φk has exact degree k;
it ensures the following result.

Lemma 2.2. The system of orthogonal polynomials {φk }`k=0 is a basis


for P` , for all ` = 0, . . . , n.

The proof follows by observing that the exact degree property en-
sures that φ0 , . . . , φ` are ` + 1 linearly independent vectors in the
` + 1-dimensional subspace Pn .We can apply it to derive the next
lemma, one we will use repeatedly.

Lemma 2.3. Let {φj }nj=0 be a system of orthogonal polynomials. Then


h p, φn i = 0 for any p ∈ Pn−1 .

Proof. Lemma 2.2 ensures that {φk }nj=−01 is a basis for Pn−1 . Thus for
any p ∈ Pn−1 , one can determine constants c0 , . . . , cn−1 such that
n −1
p= ∑ c j φj .
j =0

The linearity of the inner product and orthogonality of {φj }nj=0 imply

D n −1 E n −1 n −1
h p, φn i = ∑ c j φj , φn = ∑ c j hφj , φn i = ∑ 0 = 0,
j =0 j =0 j =0

as required.

We need a mechanism for constructing orthogonal polynomials.


The Gram–Schmidt process used to orthogonalize vectors in IRn can
readily be generalized to the present setting. Suppose that we have
some (n + 1)-dimensional subspace S with the basis p0 , p1 , . . . , pn .
Then the classical Gram–Schmidt algorithm takes the following form.

Algorithm 2.1 (Gram–Schmidt orthogonalization, prototype).


Given a basis { p0 , . . . , pn } for some subspace S, the following
algorithm constructs an orthogonal basis {φ0 , . . . , φn } for S:
φ0 := p0
for k = 1, . . . , n  
φk := pk − least squares approximation to pk from span{φ0 , . . . , φk−1 }
end
Focus on the construction of φk : By the orthogonality of the er-
ror in least squares approximation (Theorem 2.8, which holds for
the general inner products described above), φk is orthogonal to
φ0 , . . . , φk−1 . Moreover, φk is not zero, since the linear independence
of p0 , . . . , pn ensures

pk 6∈ span{ p0 , . . . , pk−1 } = span{φ0 , . . . , φk−1 }.


114

Since the basis φ0 , . . . , φk−1 is orthogonal, the best approximation to


pk can be easily written down via Theorem 2.9 as
k −1 h pk , φj i
∑ hφj , φj i j
φ.
j =0

Thus the Gram–Schmidt process takes this more precise formulation.

Algorithm 2.2 (Gram–Schmidt orthogonalization, general basis).


Given a basis { p0 , . . . , pn } for some subspace S, the following
algorithm constructs an orthogonal basis {φ0 , . . . , φn } for S:
φ0 := p0
for k = 1, . . . , n
k −1 h pk , φj i
φk := pk − ∑ hφj , φj i j
φ
j =0
end
This is a convenient process, but like the vector Gram–Schmidt pro-
cess the amount of work required at each step grows as k increases,
as pk must be orthogonalized against more φj polynomials. Fortu-
nately, there is a slick way to choose the initial basis { p0 , . . . , pn } for
which the Gram–Schmidt process takes a much simpler form.
−1
Suppose one has a set of orthogonal polynomials, {φj }kj= 0 , and
seeks the next orthogonal polynomial, φk . Since φk−1 has exact de-
gree k − 1, the polynomial xφk−1 ( x ) has exact degree k, and hence
xφk−1 ( x ) ∈ Pk but

xφk−1 ( x ) 6∈ span{φ0 , . . . , φk−1 }.

Thus, we could apply a Gram–Schmidt step to orthogonalize xφk−1 ( x )


against φ0 , . . . , φk−1 to get a new orthogonal basis vector, φk , giving

Pk = span{φ0 , . . . , φk }.

What is special about the special choice pk ( x ) = xφk−1 ( x )? It will


gives an essential simplification to the customary Gram–Schmidt
recurrence
k −1 h xφk−1 ( x ), φj ( x )i
φk ( x ) = xφk−1 ( x ) − ∑ hφj , φj i
φ j ( x ).
j =0

The trick is that the x in xφk−1 ( x ) can be flipped to the other side of
the inner product,
Z b 
h xφk−1 ( x ), φj ( x )i = xφk−1 ( x ) φj ( x )w( x ) dx
a
Z b  
= φk−1 ( x ) xφj ( x ) w( x ) dx
a

= hφk−1 ( x ), xφj ( x )i.


115

Now, recall from Lemma 2.3 that φk−1 is orthogonal to all polynomi-
als of degree less that k − 1. Thus since xφj ( x ) ∈ P j+1 , if j + 1 < k − 1,

h xφk−1 ( x ), φj ( x )i = hφk−1 ( x ), xφj ( x )i = 0,

allowing us to neglect all terms in the Gram–Schmidt sum for which


j < k − 2:

k −1 h xφk−1 ( x ), φj ( x )i k −1 h xφk−1 ( x ), φj ( x )i
∑ hφj , φj i
φj = ∑ hφj , φj i
φj .
j =0 j = k −2

Thus we can compute orthogonal polynomials efficiently, even if the


necessary polynomial degree is large. This fact has vital implications The Gram–Schmidt process does
in numerical linear algebra: indeed, it is a reason that the iterative not simplify to a short recurrence
in all settings. We used the key fact
conjugate gradient method for solving Ax = b often executes with h xφn , φk i = hφn , xφk i, which does not hold
blazing speed, but that is a story for another class. in general inner product spaces, but works
perfectly well in our present setting
because our polynomials are real valued
Theorem 2.10 (Three-Term Recurrence for Orthogonal Polynomials).
on [ a, b]. The short recurrence does
Given a weight function w( x ) (w( x ) ≥ 0 for all x ∈ ( a, b), and not hold, for example, if you compute
w( x ) = 0 only on a set of measure zero), a real interval [ a, b], and an orthogonal polynomials over a general
complex domain, instead of the real
associated real inner product interval [ a, b].
Z b
h f , gi = w( x ) f ( x ) g( x ) dx,
a

then a system of (monic) orthogonal polynomials {φk }nk=0 can be


generated as follows:

φ0 ( x ) = 1,
h x, 1i
φ1 ( x ) = x − ,
h1, 1i
h xφk−1 ( x ), φk−1 ( x )i h xφk−1 ( x ), φk−2 ( x )i
φk ( x ) = xφk−1 ( x ) − φk−1 ( x ) − φ ( x ), for k ≥ 2.
hφk−1 ( x ), φk−1 ( x )i hφk−2 ( x ), φk−2 ( x )i k−2

The above process constructs monic orthogonal polynomials, i.e.,


φk has leading term x k . Other normalizations can be imposed with
simple modifications to the Gram–Schmidt algorithm that preserve
the three-term recurrence structure. In some settings other normal-
izations are more convenient, e.g., kφk k2 = hφk , φk i = 1 or φ(0) = 1.
We next illustrate these ideas for a particularly important class of
orthogonal polynomials that use the constant weight w( x ) = 1.

2.5.1 Legendre polynomials


On the interval [ a, b] = [−1, 1] with weight w( x ) = 1 for all x, the
orthogonal polynomials are known as Legendre polynomials. Start with
φ0 ( x ) = 1, and then orthogonalize φ1 ( x ) = xφ0 ( x ) = x; since φ0 is
116

even and φ1 is odd over x ∈ [−1, 1], h x, 1i = 0, giving φ1 = x. This The recurrence for the monic Legendre
polynomial is given, e.g., by Dahlquist
begins an inductive cascade: in the Gram–Schmidt process, for all k,
and Björck, Numerical Methods in Scien-
tific Computing, vol. 1, p. 571. In contrast
h xφk−1 ( x ), φk−1 i = 0, to these monic polynomials, Legen-
dre polynomials are, by longstanding
since if φk−1 is even, xφk−1 will be odd (or vice versa), and the inner tradition, usually normalized so that
product of even and odd functions with w( x ) = 1 over x ∈ [−1, 1] is φk (0) = 1.
always zero. Thus for Legendre polynomials the conventional three- 2
term recurrence in Theorem 2.10 reduces to φ0
0
h xφk−1 ( x ), φk−2 ( x )i
φk ( x ) = xφk−1 ( x ) − φ ( x ).
hφk−2 ( x ), φk−2 ( x )i k−2 -2
-1 -0.5 0 0.5 1
Legendre polynomials enjoy many nice properties and identities; x
with some extra work, one can simplify the coefficient multiplying 2
φk−2 to φ1
( k − 1)2 0
φk ( x ) = xφk−1 ( x ) − φ ( x ).
4 ( k − 1 )2 − 1 k −2
-2
The first few Legendre polynomials φ0 , . . . , φ6 are presented below -1 -0.5 0 0.5 1
x
(and plotted in the margin):
1
φ0 ( x ) = 1 φ2
0
φ1 ( x ) = x
-1
-1 -0.5 0 0.5 1
φ2 ( x ) = x2 − 1
3 x

φ3 ( x ) = x3 − 35 x 1
φ3
φ4 ( x ) = x4 − 67 x2 + 3
35
0

φ5 ( x ) = x5 − 10 3
9 x + 5
21 x
-1
-1 -0.5 0 0.5 1
x
6 15 4 5 2 5
φ6 ( x ) = x − 11 x + 11 x − 231 . 0.5

Orthogonal polynomials play a key role in a prominent technique φ4


0
for computing integrals known as Gaussian quadrature. In that con-
text, we will see other families of orthogonal polynomials: the Cheby- -0.5
-1 -0.5 0 0.5 1
shev, Laguerre, and Hermite polynomials. These polynomials differ x
in their weight functions and the intervals of IR on which they are
0.2
posed. φ5
0
Example 2.5 ( f ( x ) = ex ). We repeat our previous example: approx-
imating f ( x ) = ex on [0, 1] with a linear polynomial. First, we need -0.2
-1 -0.5 0 0.5 1
to construct orthogonal polynomials for this interval. Set φ0 ( x ) = 1; x
then a straightforward computation gives φ1 ( x ) = x − 1/2. We then
0.1
compute φ6
Z 1 0
hex , φ0 ( x )i = ex dx = e − 1
0 -0.1
Z 1 -1 -0.5 0 0.5 1
x x x
he , φ1 ( x )i = e ( x − 1/2) dx = (3 − e)/2,
0
117

and
Z 1
hφ0 , φ0 i = 12 dx = 1
0
Z 1
hφ1 , φ1 i = ( x − 1/2)2 dx = 1/12.
0

Assemble these inner products to obtain a formulas for P∗ :

hex , φ0 i hex , φ0 i
P∗ ( x ) = φ0 ( x ) + φ (x)
hφ0 , φ0 i hφ1 , φ1 i 1
e−1 (3 − e)/2
= 1+ ( x − 1/2)
1 1/12
= (e − 1) + (18 − 6e)( x − 1/2)
= 4e − 10 + x (18 − 6e).

Note that this is exactly the polynomial we obtained in Example 2.4


using the monomial basis.
With this procedure, one can easily to increase the degree of the
approximating polynomial. To increase the degree by one, simply
add
h f , φn+1 i
φ
hφn+1 , φn+1 i n+1
to the old approximation. Were we using a general (non-orthogonal)
basis, to increase the degree of the approximation we would need to
solve a new (n + 2)-by-(n + 2) linear system to find the coefficients
of the least squares approximation. Indeed, an advantage to the It is true, however, that both these
new method is that we express the optimal polynomial in a ‘good’ methods for finding the least squares
polynomial will generally be more
basis—the basis of orthonormal polynomials—rather than the monic expensive then simply finding a poly-
polynomial basis. nomial interpolant.
3
Quadrature

lecture 21: Interpolatory Quadrature Rules


The past two chapters have developed a suite of tools for poly-
nomial interpolation and approximation. We shall now apply these
tools toward the approximation of definite integrals.
To compute the least squares approximations discussed in Sec-
tion 2.4, one needs to compute integrals for the inner products
Z b
h f , φj i = f ( x )φj ( x ) dx
a

that form the right-hand side of the Gram matrix equation Gc = b.


Of course, many other applications require the evaluation of definite
integrals; integrals across (many) different variables pose additional
challenges.
Many definite integrals are difficult or impossible to evaluate ex-
actly, so our next charge is to develop algorithms that approximate
The term quadrature is used to distin-
such integrals quickly and accurately. This field is known as quadra-
guish the numerical approximation of
ture, a name that suggests the approximation of the area under a a definite integral from the numerical
curve by area of subtending quadrilaterals. (a “Riemann sum”). solution of an ordinary differential
equation, which is often called numerical
integration. Approximation of a double
integral is sometimes called cubature.
3.1 Interpolatory Quadrature
For more details on quadrature rules,
see Süli and Mayers, Chapter 7, which
Given f ∈ C [ a, b], we seek approximations to the definite integral has guided many aspects of our presen-
tation here.
Z b
f ( x ) dx.
a

All the methods we consider in these notes are variants of interpola-


tory quadrature rules, meaning that they approximate the integral of f
by the exact integral of a polynomial interpolant to f :
Z b Z b
f ( x ) dx ≈ pn ( x ) dx,
a a
120

where pn ∈ Pn interpolates f at n + 1 points in [ a, b]. Will such


rules produce a reasonable estimate to the integral? Of course, that
depends on properties of f and the interpolation points.
Our goal in this section is to develop a convenient formula for the
approximation
Z b
pn ( x ) dx
a
that will not require the explicit construction of pn . As is often the
case, the task becomes direct and simple if we express the interpolant
in the correct basis. Recall the Lagrange form of the interpolant pre-
sented in Section 1.5: Given n + 1 distinct interpolation points

x0 , . . . , xn ∈ [ a, b],

the interpolant can be written as


n
(3.1) pn ( x ) = ∑ f (x j )` j (x),
j =0

where the basis functions `0 , . . . , `n take the familiar form


n
x−x
` j (x) = ∏ x j − xkk .
k =0
k6= j

The integral of pn can then be computed in terms of the integral of


the basis functions:
Z b Z b n n Z b

a
pn ( x ) dx = ∑ f (x j )` j (x) dx = ∑ f (x j )
a j =0 a
` j ( x ) dx. Why is the Lagrange basis special?
j =0 Could you not do the same kind of
expansion with the monomial or New-
In the nomenclature of quadrature rules, the integrals of the basis ton bases? Yes indeed: but then you
functions are called weights, denoted would need to compute the coefficients
c j that multiply these basis functions
Z b in the expansion pn ( x ) = ∑ c j φj ( x ),
w j := ` j ( x ) dx. which requires the solution of a (non-
a trivial) linear system. The beauty of
the Lagrange approach is that these
coefficients are instantly available by
The degree-n interpolatory quadrature rule at distinct nodes evaluating f at the quadrature nodes:
x0 , . . . , xn takes the form c j = f ( x j ).

Z b n

a
f ( x ) dx ≈ ∑ w j f ( x j ),
j =0

for the weights


Z b
wj = ` j ( x ) dx.
a

It is worth stating an obvious theorem, which we will revisit in


future lectures.
121

Theorem 3.1 (Exactness of Interpolatory Quadrature).


The degree-n interpolatory quadrature rule at distinct points x0 , . . . , xn ∈
[ a, b] is exact for any polynomial of degree n or less: if f ∈ Pn , then
Z b n

a
f ( x ) dx = ∑ w j f ( x j ).
j =0

The proof is simple: If f ∈ Pn , its polynomial interpolant pn ∈ Pn


is exactly f , and so the exact integral of pn is the same thing as the
exact integral of f . However, the result is not inconsequential. There
are some circumstances in numerical computations where it is easier
to use a quadrature rule to evaluate the integral of a polynomial, Finite element methods give one such
setting, where in some cases f is rep-
rather than computing the integral directly from the polynomial resented by its values f ( x j ) on a com-
coefficients. putational mesh, rather than by its
It is no surprise that a quadrature rule based on a degree-n in- coefficients.

terpolant will exactly integrate f ∈ Pn . However, in special cases a


degree-n interpolant will exactly integrate polynomials of higher degree.
This motivates the next definition.
Definition 3.1. An interpolatory quadrature rule has degree of exact-
ness m if for all f ∈ Pm ,
Z b n

a
f ( x ) dx = ∑ w j f ( x j ).
j =0

By Theorem 3.1, a degree-n quadrature rule has degree of exactness


m ≥ n. Thus it will be particularly interesting to see circumstances in
which this degree of exactness is exceeded.
122

lecture 22: Newton–Cotes quadrature 12

f (x)
10

3.2 Newton–Cotes quadrature 8

You encountered the most basic method for approximating an inte- 6

gral when you learned calculus: the Riemann integral is motivated by


4
approximating the area under a curve by the area of rectangles that
touch that curve, which gives a rough estimate that becomes increas- 2
ingly accurate as the width of those rectangles shrinks. This amounts
to approximating the function f by a piecewise constant interpolant, 0
0 2 4 6 8 10
and then computing the exact integral of the interpolant. When only x
one rectangle is used to approximate the entire integral, we have the 12

most simple Newton–Cotes formula; see Figure 3.1.


10
Newton–Cotes formulas are interpolatory quadrature rules where
the quadrature notes x0 , . . . , xn are uniformly spaced over [ a, b], 8

b−a
 
6
xj = j .
n
4

Given the lessons we learned about polynomial interpolation at uni-


2
formly spaced points in Section 1.6, you should rightly be suspicious
of applying this idea with large n (i.e., high degree interpolants). 0
0 2 4 6 8 10
A more reliable way to increase accuracy follows the lead of basic x
Riemann sums: partition [ a, b] into smaller subintervals, and use R 10
Figure 3.1: Estimates of 0 f ( x ) dx,
low-degree interpolants to approximate the integral on each of these shown in gray: the first approximates f
by a constant interpolant; the second, a
smaller domains. Such methods are called composite quadrature rules. composite rule, uses a piecewise constant
In some cases, the function f may be fairly regular over most of interpolant. You probably have encoun-
the domain [ a, b], but then have some small region of rapid growth or tered this second approximation as a
Riemann sum.
oscillation. Modern adaptive quadrature rules are composite rules on
which the subintervals of [ a, b] vary in size, depending on estimates
of how rapidly f is changing in a given part of the domain. Such
methods seek to balance the competing goals of highly accurate
approximate integrals and as few evaluations of f as possible.We
shall not dwell much on these sophisticated quadrature procedures
here, but rather start by understanding some methods you were
probably introduced to in your first calculus class.

3.2.1 The trapezoid rule

The trapezoid rule is a simple improvement over approximating the


integral by the area of a single rectangle. A linear interpolant to f can
be constructed, requiring evaluation of f at the interval end points
x0 = a and x1 = b. Using the interpolatory quadrature methodology
123

described in the last section, we write

x−b x−a
   
p1 ( x ) = f ( a ) + f (b) ,
a−b b−a

and compute its integral as


Z b Z b
x−b x−a
   
p1 ( x ) dx = f ( a) + f (b) dx
a a a−b b−a
Z b Z b
x − x1 x − x1
= f ( a) dx + f (b) dx
a x0 − x1 a x0 − x1

b−a b−a
  
= f ( a) + f (b) .
2 2

In summary, 12
63.5714198. . . (trapezoid)
10 73.4543644. . . (exact)
Trapezoid rule:
Z b
b−a   8
f ( x ) dx ≈ f ( a) + f (b) .
a 2
6

4
The procedure behind the trapezoid rule is illustrated in Figure 3.2
where the area approximating the integral is colored gray. 2

To derive an error bound for the trapezoid rule, simply integrate


0
the fundamental interpolation error formula in Theorem 1.3. That 0 2.5 5 7.5 10
x
gave, for each x ∈ [ a, b], some ξ ∈ [ a, b] such that Figure 3.2: Trapezoid rule estimate of
R 10
1 00 0 f ( x ) dx, shown in gray.
f ( x ) − p1 ( x ) = 2 f ( ξ )( x − a)( x − b).

Note that ξ will vary with x, which we emphasize by writing ξ ( x ).


Integrate this formula to obtain
Z b Z b Z b
1 00
f ( x ) dx − p1 ( x ) dx = f (ξ ( x ))( x − a)( x − b) dx
a a a 2
Z b
1 00
= f (η ) ( x − a)( x − b) dx
2 a

1 00 1 1 1 1
= f (η )( a3 − a2 b + ab2 − b3 ) The mean value theorem for inte-
2 6 2 2 6 grals states that if h, g ∈ C [ a, b] and
h does not change sign on [ a, b], then
1 00
=− f (η )(b − a)3 there exists some η ∈ [ a, b] such that
12 Rb Rb
a g ( t ) h ( t ) dt = g ( η ) a h ( t ) dt. The re-
quirement that h not change sign is es-
for some η ∈ [ a, b]. The second step follows from the mean value
sential. For example, if g(t) = h(t) = t
theorem for integrals. R1 R1
then −1 g(t)h(t) dt = −1 t2 dt = 2/3,
In a forthcoming lecture we shall develop a much more general R1 R1
yet −1 h(t) dt = −1 t dt = 0, so for
theory, based on the Peano kernel, from which we can derive this error
R1
all η ∈ [−1, 1], g(η ) −1 h(t) dt = 0 6=
R1
−1 g ( t ) h ( t ) dt = 2/3.
124

bound, plus bounds for more complicated schemes, too. For now, we
summarize the bound in the following Theorem.

Theorem 3.2. Let f ∈ C2 [ a, b]. The error in the trapezoid rule is


Z b
b−a 
  1
f ( x ) dx − f ( a) + f (b) = − f 00 (η )(b − a)3
a 2 12

for some η ∈ [ a, b].

This bound has an interesting feature: if we are integrating over


the small interval, b − a = h  1, then the error in the trapezoid
rule approximation is O(h3 ) as h → 0, while the error in the linear
interpolant upon which this quadrature rule is based is only O(h2 )
(from Theorem 1.3).

Example 3.1 ( f ( x) = ex (cos x + sin x)). Here we demonstrate the


difference between the error for linear interpolation of a function,
f ( x ) = ex (cos x + sin x ), between two points, x0 = 0 and x1 = h, and
the trapezoid rule applied to the same interval. The theory reveals
that linear interpolation will have an O(h2 ) error as h → 0, while the
trapezoid rule has O(h3 ) error, as confirmed in Figure 3.3.

10 0 Figure 3.3: Error of linear interpolation


inter and trapezoid rule approximation for
p olati
on er f ( x ) = ex (cos x + sin x ) for x ∈ [0, h] as
10 -2 ror h → 0.
tra
pez
10 -4 oid O( h 2
err )
o r
10 -6

10 -8

10 -10 O(
h 3)

10 -12
10 0 10 -1 10 -2 10 -3
h

3.2.2 Simpson’s rule


To improve the accuracy of the trapezoid rule, increment the degree
of the interpolating polynomial. This will increase the number of
evaluations of f (often very costly), but hopefully will significantly
125

decrease the error. Indeed it does – by an even greater margin than


we might expect.
Simpson’s rule integrates the quadratic interpolant p2 ∈ P2 to f at
the uniformly spaced points

x0 = a, x1 = ( a + b)/2, x2 = b.

Using the interpolatory quadrature formulation of the last section,


Z b
p2 ( x ) dx = w0 f ( a) + w1 f ( 21 ( a + b)) + w2 f (c),
a

where
Z b
x − x1 x − x2 b−a
 
w0 = dx =
a x0 − x1 x0 − x2 6
Z b
x − x0 x − x2 2( b − a )
 
w1 = dx =
a x1 − x0 x1 − x2 3
Z b
x − x0 x − x1 b−a
 
w2 = dx = .
a x2 − x0 x2 − x1 6

In summary: 12
76.9618331. . . (Simpson)
10 73.4543644. . . (exact)
Simpson’s rule:
Z b
b−a   8
f ( x ) dx ≈ f ( a) + 4 f ( 21 ( a + b)) + f (b) .
a 6
6

4
Simpson’s rule enjoys a remarkable feature: though it only approxi-
mates f by a quadratic, it integrates any cubic polynomial exactly! One 2
can verify this by directly applying Simpson’s rule to a generic cu-
bic polynomial. Write f ( x ) = αx3 + q( x ), where q ∈ P2 . Let 0
0 2.5 5 7.5 10
Rb x
I ( f ) = a f ( x ) dx and let I2 ( f ) denote the Simpson’s rule approx-
Figure 3.4: Simpson’s rule estimate of
imation. Then, by linearity of the integral, R 10
0 f ( x ) dx, shown in gray.

I ( f ) = αI ( x3 ) + I (q)

and, by linearity of Simpson’s rule,

I2 ( f ) = αI2 ( x3 ) + I2 (q).

Since Simpson’s rule is an interpolatory quadrature rule based on


quadratic polynomials, its degree of exactness must be at least 2
(Theorem 3.1), i.e., it exactly integrates q: I2 (q) = I (q). Thus
 
I ( f ) − I2 ( f ) = α I ( x3 ) − I2 ( x3 ) .
126

So Simpson’s rule will be exact for all cubics if it is exact for x3 . A


simple computation gives

b−a 3
  a + b 3 
I2 ( x3 ) = a +4 + b3
6 2
b − a 3  b4 − a4
= 3a + 3a2 b + 3ab2 + 3b3 = = I ( x 3 ),
12 4
In fact, Newton–Cotes formulas based
confirming that Simpson’s rule is exact for x3 , and hence for all cu- on approximating f by an even-degree
bics. For now we simply state an error bound for Simpson’s rule, polynomial always exactly integrate
which we will prove in a future lecture. polynomials one degree higher.

Theorem 3.3. Let f ∈ C4 [ a, b]. The error in the Simpson’s rule is


Z b
b−a 
  1
f ( x ) dx − f ( a) + 4 f (( a + b)/2) + f (b) = − f (4) (η )(b − a)5
a 6 90

for some η ∈ [ a, b].

This error formula captures the fact that Simpson’s rule is exact
for cubics, since it features the fourth derivative f (4) (η ), two deriva-
tives greater than f 00 (η ) in the trapezoid rule bound, even though
the degree of the interpolant has only increased by one. Perhaps it is
helpful to visualize the exactness of Simpson’s rule for cubics. Fig-
ure 3.5 shows f ( x ) = x3 (blue) and its quadratic interpolant (red).
On the left, the area under f is colored gray: its area is the integral
we seek. On the right, the area under the interpolant is colored gray.
Accounting area below the x axis as negative, both integrals give an
identical value even though the functions are quite different. It is
remarkable that this is the case for all cubics.
Typically one does not see Newton–Cotes rules based on poly-
nomials of degree higher than two (i.e., Simpson’s rule). Because Integrating the cubic interpolant at
it can be fun to see numerical mayhem, we give an example to em- four uniformly spaced points is called
Simpson’s three-eighths rule.
phasize why high-degree Newton–Cotes rules can be a bad idea.
Recall that Runge’s function f ( x ) = 1/(1 + x2 ) gave a nice exam-
ple for which the polynomial interpolant at uniformly spaced points
over [−5, 5] fails to converge uniformly to f . This fact suggests that
Newton–Cotes quadrature will also fail to converge as the degree of
the interpolant grows. The exact value of the integral we seek is
Z 5
1
dx = 2 tan−1 (5) = 2.75680153 . . . .
−5 1 + x2
Just as the interpolant at uniformly spaced points diverges, so too
does the Newton–Cotes integral. Figure 3.6 illustrates this diver-
gence, and shows that integrating the interpolant at Chebyshev
127

3.5 3.5 Figure 3.5: Simpson’s rule applied to


area under area under f ( x ) = x3 on x ∈ [−1, 3/2]. The areas
3
f ( x ) = x3 3
quadratic interpolant under f ( x ) (blue) and its quadratic
interpolant (red) are the same, even
2.5 2.5
though the functions are quite different.
2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5
-1 0 1 2 -1 0 1 2
x x

points, called Clenshaw–Curtis quadrature, does indeed converge.


Section 3.3 describes this latter quadrature in more detail. Before
discussing it, we describe a way to make Newton–Cotes rules more
robust: integrate low-degree polynomials over subintervals of [ a, b].

10 3 Figure 3.6: Integrating interpolants


pn at n + 1 uniformly spaced points
10 2 (red) and at Chebyshev points (blue) for
oints
rm p Runge’s function, f ( x ) = 1/(1 + x2 )
unifo
pn ( x ) dx

10 1
over x ∈ [−5, 5].
−5

10 0
Z 5
f ( x ) dx −

10 -1
Che
bys
-2 h ev p
10 oint
−5
Z 5

10 -3

10 -4
1 8 16 24
n

3.2.3 Composite rules

As an alternative to integrating a high-degree polynomial, one can


pursue a simpler approach that is often very effective: Break the
interval [ a, b] into subintervals, then apply a standard Newton–Cotes
rule (e.g., trapezoid or Simpson) on each subinterval. Applying the
128

trapezoid rule on n subintervals gives


Z b n Z xj n ( x j − x j −1 )  
a
f ( x ) dx = ∑ f ( x ) dx ≈ ∑ 2
f ( x j −1 + f ( x )
j .
j =1 x j −1 j =1

The standard implementation assumes that f is evaluated at uni-


formly spaced points between a and b, x j = a + jh for j = 0, . . . , n and
h = (b − a)/n, giving the following famous formulation:

Composite Trapezoid rule:


Z b n −1
h 
a
f ( x ) dx ≈
2
f ( a) + 2 ∑ f ( a + jh) + f (b) .
j =1

(Of course, one can readily adjust this rule by partitioning [ a, b]


into subintervals of different sizes.) The error in the composite trape-
zoid rule can be derived by summing up the error in each application
of the trapezoid rule:
Z b n −1 n
h   
a
f ( x ) dx −
2
f ( a) + 2 ∑ f ( a + jh) + f (b) = ∑ 1 00
− 12 f (η j )( x j − x j−1 )3
j =1 j =1
n
h3
=−
12 ∑ f 00 (ηj )
j =1

for η j ∈ [ x j−1 , x j ]. We can simplify these f 00 terms by noting that


1 n 00 00
n ( ∑ j=1 f ( η j )) is the average of n values of f evaluated at points in
the interval [ a, b]. Naturally, this average cannot exceed the maximum
or minimum value that f 00 assumes on [ a, b], so there exist points
ξ 1 , ξ 2 ∈ [ a, b] such that
n
1
f 00 (ξ 1 ) ≤
n ∑ f 00 (ηj ) ≤ f 00 (ξ 2 ).
j =1

Thus the intermediate value theorem guarantees the existence of


some η ∈ [ a, b] such that
n
1
f 00 (η ) =
n ∑ f 00 (ηj ).
j =1

We arrive at a bound on the error in the composite trapezoid rule.

Theorem 3.4. Let f ∈ C2 [ a, b]. The error in the composite trapezoid


rule over n intervals of uniform width h = (b − a)/n is
n −1
h2
Z b
h 
f ( x ) dx − f ( a) + 2 ∑ f ( a + jh) + f (b) = − (b − a) f 00 (η ).
a 2 j =1
12

for some η ∈ [ a, b].


129

This error analysis has an important consequence: the error for the
composite trapezoid rule is only O(h2 ), not the O(h3 ) we saw for the
usual trapezoid rule (in which case b − a = h since n = 1).

A similar construction leads to the composite Simpson’s rule. We


now must ensure that n is even, since each interval on which we
apply the standard Simpson’s rule has width 2h. Simple algebra
leads to the following formula.

Composite Simpson’s rule:


Z b n/2 n/2−1
h 
f ( x ) dx ≈ f ( a) + 4 ∑ f ( a + (2j − 1)h) + 2 ∑ f ( a + 2jh) + f (b) .
a 3 j =1 j =1

Now use Theorem 3.3 to derive an error formula for the composite
Simpson’s rule, using the same approach as for the composite trape-
zoid rule.

Theorem 3.5. Let f ∈ C2 [ a, b]. The error in the composite Simpson’s


rule over n/2 intervals of uniform width 2h = 2(b − a)/n is
Z b n/2 n/2−1
h 
f ( x ) dx − f ( a) + 4 ∑ f ( a + (2j − 1)h) + 2 ∑ f ( a + 2jh) + f (b)
a 3 j =1 j =1

h4
=− ( b − a ) f (4) ( η )
180

for some η ∈ [ a, b].

The illustrations in Figure 3.7 compare the composite trapezoid and


Simpson’s rules for the same number of function evaluations. One
can see that Simpson’s rule, in this typical case, gives considerably
better accuracy.

Reflect for a moment. Suppose you are willing to evaluate f a


fixed number of times. How can you get the most bang for your
buck? If f is smooth, a rule based on a high-order interpolant (such
as the Clenshaw–Curtis and Gaussian quadrature rules we will
present in a few lectures) are likely to give the best result. If f is not
smooth (e.g., with kinks, discontinuous derivatives, etc.), then a ro-
bust composite rule would be a good option. (A famous special case:
If the function f is sufficiently smooth and is periodic with period
b − a, then the trapezoid rule converges exponentially.)
130

12 12
73.2181469. . . (comp. trapezoid) 73.4610862. . . (comp. Simpson)
73.4543644. . . (exact) 73.4543644. . . (exact)
10 10

8 8

6 6

4 4

2 2

0 0
0 2.5 5 7.5 10 0 2.5 5 7.5 10
x x
Figure 3.7: Composite trapezoid rule
(left) and composite Simpson’s rule
3.2.4 Adaptive Quadrature (right).

If f is continuous, we can attain arbitrarily high accuracy with com-


posite rules by taking the spacing between function evaluations, h,
to be sufficiently small. This might be necessary to resolve regions
of rapid growth or oscillation in f . If such regions only make up a
small proportion of the domain [ a, b], then uniformly reducing h over
the entire interval will be unnecessarily expensive. One wants to
concentrate function evaluations in the region where the function is
the most ornery. Robust quadrature software adjusts the value of h
locally to handle such regions. To learn more about such techniques,
which are not foolproof, see W. Gander and W. Gautschi, “Adaptive
quadrature—revisited,” BIT 40 (2000) 84–101. This paper criticizes the routines
quad and quad8 that were included
in MATLAB version 5. In light of
this analysis MATLAB improved its
software, essentially incorporating the
two routines suggested in this paper
starting in version 6 as the routines
quad (adaptive Simpson’s rule) and
quadl (an adaptive Gauss–Lobatto rule).
131

lecture 23: Clenshaw–Curtis quadrature

3.3 Clenshaw–Curtis quadrature

To get faster convergence for a fixed number of function evalua-


tions, one might wish to increase the degree of the approximating
polynomial further still, then integrate that high-degree polynomial.
As evidenced in the discussion of polynomial interpolation in Sec-
tion 1.6, the success of such an approach depends significantly on the
choice of the interpolation points and the nature of the function being
interpolated. For example, we would not expect the integral of a high
degree polynomial interpolant to Runge’s function f ( x ) = ( x2 + 1)−1
over uniformly spaced points on [−5, 5] to accurately approximate
Z 5
1
dx,
−5 x 2 +1

since the underlying interpolants fail to converge for this example.

However, recall that Theorem 2.6 ensures that the polynomials


that interpolate f at Chebyshev points will converge, provided f is just
a bit smooth. This suggests that the integrals of such interpolating
polynomials will also be accurate as n increases, and indeed this is
the case.
This procedure of integrating interpolants at Chebyshev points is
known as Clenshaw–Curtis quadrature. If f is smooth, this method
typically converges much faster than the composite trapezoid and
Simpson’s rules, which are only based on low-degree polynomial
interpolants. In fact, the Clenshaw–Curtis method competes very
well with the Gaussian quadrature schemes discussed in the next sec-
tions, although those Gaussian quadrature schemes have historically
received much greater attention. See L. N. Trefethen, ‘Is Gauss Quadra-
ture Better than Clenshaw–Curtis?’,
Suppose we wish to integrate a function over [−1, 1]. Then Clenshaw– SIAM Review 50 (2008) 67–87.
Curtis quadrature evaluates f at the n + 1 Chebyshev points
 jπ 
x j = cos , j = 0, . . . , n.
n
The Lagrange form of the polynomial interpolant at these points
takes the form
n
pn ( x ) = ∑ f (x j )` j (x),
j =0

where
n
x−x
` j (x) = ∏ x j − xkk
k =0
132

are the usual Lagrange basis functions. Since the Clenshaw–Curtis


quadrature rule will integrate the polynomial interpolant at the
Chebyshev points, the rule will give
Z 1 Z 1 n Z 1

−1
f ( x ) dx ≈
−1
pn ( x ) dx = ∑ f (xj )
−1
` j ( x ) dx.
j =0

Thus, defining the weights to be


Z 1
w j := ` j ( x ) dx,
−1

the Clenshaw–Curtis quadrature rule takes the following compact


form.

Clenshaw–Curtis rule:
Z 1 n

−1
f ( x ) dx ≈ ∑ w j f ( x j ),
j =0

R1
where x j = cos( jπ/n) and w j = −1 ` j ( x ) dx.

Connecting interpolation at Chebyshev points to trigonometric inter-


polation leads to a convenient algorithm for computing the weights
w j using a fast Fourier transform, which is much more stable and
convenient than integrating the Lagrange interpolating polynomials
directly. We shall not go into details here. Interested readers can con-
sult Trefethen’s paper, ‘Is Gauss Quadrature Better than Clenshaw–
Curtis?’ (2008) or his book Spectral Methods in MATLAB (2000).
133

lecture 24: Gaussian quadrature rules: fundamentals

3.4 Gaussian quadrature

It is clear that the trapezoid rule,

b−a 
f ( a) + f (b) ,
2
exactly integrates linear polynomials, but not all quadratics. In fact,
one can show that no quadrature rule of the form

w a f ( a ) + wb f ( b )

will exactly integrate all quadratics over [ a, b], regardless of the choice
of constants wa and wb . However, notice that a general quadrature
rule with two points,

w0 f ( x 0 ) + w1 f ( x 1 ) ,

has four parameters (w0 , x0 , w1 , x1 ). We might then hope that we


could pick these four parameters in such a fashion that the quadra-
ture rule is exact for a four-dimensional subspace of functions, P3 .
This section explores generalizations of this question.

3.4.1 A special 2-point rule


Suppose we consider a more general class of 2-point quadrature
rules, where we do not initially fix the points at which the integrand
f is evaluated:
I ( f ) = w0 f ( x 0 ) + w1 f ( x 1 )

for unknowns nodes x0 , x1 ∈ [ a, b] and weights w0 and w1 . We wish to


pick x0 , x1 , w0 , and w1 so that the quadrature rule exactly integrates
all polynomials of the largest degree possible. Since this quadrature
rule is linear, it will suffice to check that it is exact on monomials.
There are four unknowns; to get four equations, we will require I ( f )
to exactly integrate 1, x, x2 , x3 .
Z b
f (x) = 1 : 1 dx = I (1) =⇒ b − a = w0 + w1
a
Z b
1 2
f (x) = x : x dx = I ( x ) =⇒ 2 (b − a 2 ) = w0 x 0 + w1 x 1
a
Z b
f ( x ) = x2 : x2 dx = I ( x2 ) =⇒ 1 3
3 (b − a3 ) = w0 x02 + w1 x12
a
Z b
f ( x ) = x3 : x3 dx = I ( x3 ) =⇒ 1 4
4 (b − a4 ) = w0 x03 + w1 x13
a
134

Three of these constraints are nonlinear equations of the unknowns


x0 , x1 , w0 , and w1 : thus questions of existence and uniqueness of
solutions becomes a bit more subtle than for the linear equations we
so often encounter.
In this case, a solution does exist:

w0 = w1 = 12 (b − a),
√ √
3 3
x0 = 21 (b + a) − 6 (b − a ), x1 = 12 (b + a) + 6 (b − a ).
Notice that x0 , x1 ∈ [ a, b]: If this were not the case, we could not
use these points as quadrature nodes, since f might not be defined
outside [ a, b]. When [ a, b] = [−1, 1], the interpolation points are

±1/ 3, giving the quadrature rule
√ √
I ( f ) = f (−1/ 3) + f (1/ 3).

3.4.2 Generalization to higher degrees


Emboldened by the success of this humble 2-point rule, we consider
generalizations to higher degrees. If some two-point rule (n + 1 inte-
gration nodes, for n = 1) will exactly integrate all cubics (3 = 2n + 1),
one might anticipate the existence of rules based on n + 1 points that
exactly integrate all polynomials of degree 2n + 1, for general values
of n. Toward this end, consider quadrature rules of the form
n
In ( f ) = ∑ w j f ( x j ),
j =0

for which we will choose the nodes { x j } and weights {w j } (a total


of 2n + 2 variables) to maximize the degree of polynomial that is
integrated exactly.
The primary challenge is to find satisfactory quadrature nodes.
Once these are found, the weights follow easily: in theory, one could
obtain them by integrating the polynomial interpolant at the nodes,
though better methods are available in practice. In particular, this
procedure for assigning weights ensures, at a minimum, that In ( f )
will exactly integrate all polynomials of degree n. This assumption
will play a key role in the coming development.
Orthogonal polynomials, introduced in Section 2.5, play a central
role in this exposition, and they suggest a generalization of the inter-
polatory quadrature procedures we have studied up to this point.
Let {φj }nj=+01 be a system of orthogonal polynomials with respect to
the inner product
Z b
h f , gi = f ( x ) g( x )w( x ) dx
a
135

for some weight function w ∈ C ( a, b) that is non-negative over ( a, b) and This weight function plays an essential
role in the discussion: it defines the
takes the value of zero only on a set of measure zero.
inner product, and so it dictates what
Now we wish to construct an interpolatory quadrature rule for an it means for two functions to be or-
integral that incorporates the weight function w( x ) in the integrand: thogonal. Change the weight function,
and you will change the orthogonal
n Z b polynomials.
In ( f ) = ∑ wj f (xj ) ≈ a
f ( x )w( x ) dx.
j =0

It is our aim to make In ( p) exact for all p ∈ P2n+1 . First, we will show In the Section 3.4.4 we shall see some
useful examples of weight functions.
that any interpolatory quadrature rule In will at least be exact for the
weighted integral of degree-n polynomials. Showing this is a simple
modification of the argument made in Section 3.1 for unweighted
integrals.
Given a set of distinct nodes x0 , . . . , xn , construct the polynomial
interpolant to f at those nodes:
n
pn ( x ) = ∑ f (x j )` j (x),
j =0

where ` j ( x ) is the usual Lagrange basis function for polynomial


interpolation. The interpolatory quadrature rule will exactly integrate
the weighted integral of the interpolant pn :
Z b Z b Z b n 

a
f ( x )w( x ) dx ≈
a
pn ( x )w( x ) dx =
a
∑ f ( x j )` j ( x ) w( x ) dx
j =0
n Z b
= ∑ f (xj ) a
` j ( x )w( x ) dx.
j =0

Thus we define the quadrature weights for the weighted integral to be


Z b
w j := ` j ( x )w( x ) dx,
a

giving the rule


n Z b
In ( f ) = ∑ wj f (xj ) ≈ a
f ( x )w( x ) dx.
j =0

Apply this rule to a degree-n polynomial, p. Since p ∈ Pn , it is its


own degree-n polynomial interpolant, so the integral of the inter-
polant delivers the exact weighted integral of p: Note that the weight function w( x )
can include all sorts of nastiness, all of
Z b n which is absorbed in the quadrature

a
p( x )w( x ) dx = ∑ w j p(x j ) = In ( p). weights w0 , . . . , wn .
j =0

This is the case regardless of how the (distinct) nodes x0 , . . . , xn were


chosen. Now we seek a way to choose the nodes so that the quada-
ture rule is exactly for a higher degree polynomials.
136

To begin, consider an arbitrary p ∈ P2n+1 . Using polynomial


division, we can always write

p( x ) = φn+1 ( x )q( x ) + r ( x )

for some q, r ∈ Pn that depend on p. Integrating this p, we obtain


Z b Z b Z b
p( x )w( x ) dx = φn+1 ( x )q( x )w( x ) dx + r ( x )w( x ) dx
a a a
Z b
= hφn+1 , qi + r ( x )w( x ) dx
a
Z b
= r ( x )w( x ) dx.
a

The last step is a consequence that important basic fact, proved in


Section 2.5, that the orthogonal polynomial φn+1 is orthogonal to all
q ∈ Pn .
Now apply the quadrature rule to p, and attempt to pick the inter-
polation nodes { x j } to yield the value of the exact integral computed
above. In particular,
n n n
In ( p) = ∑ w j p(x j ) = ∑ w j φn+1 (x j )q(x j ) + ∑ w j r(x j )
j =0 j =0 j =0
n Z b
= ∑ w j φn+1 (x j )q(x j ) + a
r ( x )w( x ) dx.
j =0

This last statement is a consequence of the fact that In (·) will exactly
integrate all r ∈ Pn . This will be true regardless of our choice for
the distinct nodes { x j } ⊂ [ a, b]. (Recall that the quadrature rule
is constructed so that it exactly integrates a degree-n polynomial
interpolant to the integrand, and in this case the integrand, r, is a
degree n polynomial. Hence In (r ) will be exact.)
Rb
Notice that we can force agreement between In ( p) and a p( x )w( x ) dx
provided
n
∑ w j φn+1 (x j )q(x j ) = 0.
j =0

We cannot make assumptions about q ∈ Pn , as this polynomial will


vary with the choice of p, but we can exploit properties of φn+1 . Since
φn+1 has exact degree n + 1 (recall this property of all orthogonal
polynomials), it must have n + 1 roots. If we choose the interpolation
nodes { x j } to be the roots of φn+1 , then ∑nj=0 w j φn+1 ( x j )q( x j ) =
0 as required, and we have a quadrature rule that is exact for all
polynomials of degree 2n + 1.
Before we can declare victory, though, we must exercise some
caution. Perhaps φn+1 has repeated roots (so that the nodes { x j } are
not distinct), or perhaps these roots lie at points in the complex plane
137

where f may not even be defined. Since we are integrating f over the
interval [ a, b], it is crucial that φn+1 has n + 1 distinct roots in [ a, b].
Fortunately, this is one of the many beautiful properties enjoyed by
orthogonal polynomials.

Theorem 3.6 (Roots of Orthogonal Polynomials).


+1
Let {φk }nk= 0 be a system of orthogonal polynomials on [ a, b ] with
respect to the weight function w( x ). Then φk has k distinct real roots,
(k) (k)
{ x j }kj=1 , with x j ∈ [ a, b] for j = 1, . . . , k.

Proof. The result is trivial for φ0 . Fix any k ∈ {1, . . . , n + 1}. Suppose
that φk , a polynomial of exact degree k, changes sign at j < k distinct
(k) j
roots { x` }`=1 , in the interval [ a, b]. Then define

(k) (k) (k)


q( x ) = ( x − x1 )( x − x2 ) · · · ( x − x j ) ∈ P j .

This function changes sign at exactly the same points as φk does on


[ a, b]. Thus, the product of these two functions, φk ( x )q( x ), does not
change sign on [ a, b]. See the illustration in Figure 3.8.

φk ( x )
q( x )
0 0

φk ( x )q( x )

0
a b a b a b
Figure 3.8: The functions φk , q, and φk q
from the proof of Theorem 3.9.
As the weight function w( x ) is nonnegative on [ a, b], it must also be
that φk qw does not change sign on [ a, b]. However, the fact that q ∈ P j
for j < k implies that
Z b
φk ( x )q( x )w( x ) dx = hφk , qi = 0,
a

since φk is orthogonal to all polynomials of degree k − 1 or lower


(Lemma 2.3). Thus, we conclude that the integral of some continuous
nonzero function φk qw that never changes sign on [ a, b] must be zero.
This is a contradiction, as the integral of such a function must always
be positive. Thus, φk must have at least k distinct zeros in [ a, b]. As φk
is a polynomial of degree k, it can have no more than k zeros.
138

We have arrived at Gaussian quadrature rules: Integrate the polyno-


mial that interpolates f at the roots of the orthogonal polynomial
φn+1 . What are the weights {w j }? Write the interpolant, pn , in the
Lagrange basis,
n
pn ( x ) = ∑ f (x j )` j (x),
j =0

where the basis polynomials ` j are defined as usual,


n
( x − xk )
(3.2) ` j (x) = ∏ ( x j − xk )
.
k =0,k 6= j

Integrating this interpolant gives


Z b Z b n n Z b
In ( f ) =
a
pn ( x )w( x ) dx =
a j =0
∑ f ( x j )` j ( x )w( x ) dx = ∑ f (xj )
a
` j ( x )w( x ) dx,
j =0

revealing a formula for the quadrature weights:


Z b
wj = ` j ( x )w( x ) dx.
a
This construction proves the following result.

Theorem 3.7. Suppose In ( f ) is the Gaussian quadrature rule


n
In ( f ) = ∑ w j f ( x j ),
j =0

where the nodes { x j }nj=0 are the n + 1 roots of a degree-(n + 1) or-


thogonal polynomial on [ a, b] with weight function w, and w j =
Rb
a ` j ( x ) w ( x ) dx. Then
Z b
In ( f ) = f ( x )w( x ) dx
a
for all polynomials f of degree 2n + 1.

As a side-effect of this high-degree exactness, we obtain an inter-


esting new formula for the weights in Gaussian quadrature. Since
the Lagrange basis polynomial `k is the product of n linear factors
(see (3.2)), `k ∈ Pn , and

(`k )2 ∈ P2n ⊆ P2n+1 .


Thus the Gaussian quadrature rule exactly integrates (`k )2 w( x ). We
write
Z b n

a
(`k ( x ))2 w( x )dx = ∑ w j (`k (x j ))2
j =0

= wk (`k ( xk ))2 = wk ,
139

where we have used the fact that `k ( x j ) = 0 if j 6= k, and `k ( xk ) = 1.


This leads to another formula for the Gaussian quadrature weights:
Z b Z b
(3.3) wk = `k ( x )w( x ) dx = (`k ( x ))2 w( x ) dx.
a a

This latter formula is more computationally appealing than the for-


mer, because it is more numerically reliable to integrate positive-
valued integrands. This is a neat fact, but, as described in Section , One avoids floating point errors that
there is a still-better way to compute these weights: by computing can be introduced by adding quantities
that are similar in magnitude but
eigenvectors of a symmetric tridiagonal matrix. opposite in sign, known as catastrophic
Of course, in many circumstances we are not simply integrating cancellation.
polynomials, but more complicated functions, so we want better
insight about the method’s performance than Theorem 3.7 provides.
One can prove the following error bound. See, e.g., Süli and Mayers, pp. 282–283.

Theorem 3.8. Suppose f ∈ C2n+2 [ a, b] and let In ( f ) be the usual


(n + 1)-point Gaussian quadrature rule on [ a, b] with weight function
w( x ) and nodes { x j }nj=0 . Then

f (2n+2) (ξ )
Z b Z b
f ( x )w( x ) dx − In ( f ) = ψ2 ( x )w( x ) dx
a (2n + 2)! a

for some ξ ∈ [ a, b] and ψ( x ) = ∏nj=0 ( x − x j ).


140

lecture 25: Gaussian quadrature: nodes, weights; examples; extensions

3.4.3 Computing Gaussian quadrature nodes and weights


When first approaching Gaussian quadrature, the complicated char-
acterization of the nodes and weights might seem like a significant
drawback. For example, if one approximates an integral with an
(n + 1)-point Gaussian quadrature rule and finds the accuracy insuf-
ficient, one must compute an entirely new set of nodes and weights
for a larger n from scratch.
Many years ago, one would need to look up pre-computed nodes
and weights for a given rule in book of mathematical tables, and one
was thus limited to using values of n for which one could easily find
tabulated values for the nodes and weights.
However, in a landmark paper of 1969, Gene Golub and John
Welsch found a nice characterization of the nodes and weights in G. H. Golub and J. H. Welsch, “Calcula-
terms of a symmetric matrix eigenvalue problem. Given the existence tion of Gauss Quadrature Rules,” Math.
Comp. 23 (1969) 221–230.
of excellent algorithms for computing such eigenvalues, one can
readily compute Gaussian quadrature nodes for arbitrary values of n. In general, such algorithms require
Some details of this derivation are left for a homework exercise, but O(n3 ) operations to compute all eigen-
values and eigenvectors of an n × n
we summarize the results here. matrix; for the modest values of n most
One can show, via the discussion in Section 2.5, that, given values common in practice (n in the tens or
at most low hundreds), this expense is
φ−1 ( x ) = 0 and φ0 ( x ) = 1 the subsequent orthogonal polynomials not onerous. Exploiting the structure of
can be generated via the three-term recurrence relation Jn , this algorithm could be sped up to
O( n2 ).
(3.4) φk+1 ( x ) = x φk ( x ) − αk φk ( x ) − β k φk−1 ( x ).

The values of α0 , . . . , αn and β 1 , . . . , β n follow from the Gram–Schmidt


process used in Section 2.5. Later we will also need the definition A good source for the values of
{αk } and { β k } is Table 1.1 in Walter
Z b Gautschi’s Orthogonal Polynomials,
β 0 := h1, 1i = w( x ) dx. Oxford University Press, 2004.
a

Given a fixed value of n, collect the coefficients {αk }nk=0 and


{ β k }nk=1 and use them to populate the Jacobi matrix
 p 
α0 β1
 p p 

 β1 α1 β2 

 p .. .. 
(3.5) Jn = 

β2 . . .

 
 .. p 

 . α n −1 βn 

p
βn αn

The following theorem (whose proof is left for as a homework


exercise) gives a ready way to compute the roots of φn+1 .
141

Theorem 3.9. Let {φj }nj=+01 denote a sequence of monic orthogonal


polynomials generated by the recurrence relation (3.4). Then λ is a
root of φn+1 if and only if λ is an eigenvalue of Jn with correspond-
ing eigenvector
 
φ0 (λ)
 p 

 φ1 (λ)/ β 1 

 p 
(3.6) v(λ) = 
 φ2 (λ)/ β 1 β 2 .

 
 .. 

 . 

p
φn (λ)/ β1 · · · βn

Theorem 3.9 thus gives a convenient way to compute the nodes of


a Gaussian quadrature rule. Given the nodes, one could compute
the weights using either of the formulas (3.3) involving Lagrange
basis functions. However, Golub and Welsch proved that these same
weights could be extracted from the eigenvectors of the Jacobi ma-
trix Jn . Label the eigenvalues of Jn as

λ0 < λ1 < · · · < λ n

and the corresponding eigenvectors, given by the formula (3.6), as

v0 = v ( λ0 ), v1 = v ( λ1 ), ..., v n = v ( λ n ).

Then, with a bit of work, one can show that the weights for n + 1-
point Gaussian quadrature can be computed as

1
(3.7) w j = β0 , Note: assumes (v j )1 = φ0 (λ j ) = 1.
kv j k22

where k · k2 is the vector 2-norm.


The formula (3.7) relies on the specific form of the eigenvector
in (3.6). If the eigenvector is normalized differently (e.g., MATLAB’s
eigs routine gives unit eigenvectors, kv j k2 = 1, then one should use
the general formula

(v j )21
(3.8) w j = β0 ,
kv j k22

where (v j )1 is the first entry in the eigenvector v j .

3.4.4 Examples of Gaussian Quadrature


Let us examine four well-known Gaussian quadrature rules.
142

method interval, ( a, b) weight, w( x )

Gauss–Legendre (−1, 1) 1
1
Gauss–Chebyshev (−1, 1) √
1 − x2

Gauss–Laguerre (0, ∞) e− x
2
Gauss–Hermite (−∞, ∞) e− x

The weight function plays an essential role here. For example, a


Gauss–Chebyshev quadrature rule with n + 1 = 5 points will exactly
integrate polynomials of degree 2n + 1 = 9 times the weight function.
Thus this rule will exactly integrate

x9
Z 1
√ dx,
−1 1 − x2

but it will not exactly integrate


Z 1
x10 dx.
−1

This is a subtle point that many students overlook when first learning
about Gaussian quadrature.

Example 3.2. Gauss–Legendre Quadrature


When numerical analysts speak of “Gaussian quadrature” without
further qualification, they typically mean Gauss–Legendre quadra-
ture, i.e., quadrature with the weight function w( x ) = 1 (perhaps over
a transformed domain ( a, b); see Section 3.4.5.) As discussed in Sec-
tion 2.5.1, the orthogonal polynomials for this interval and weight are
called Legendre polynomials. To construct a Gaussian quadrature rule
with n + 1 points, determine the roots of the degree-(n + 1) Legendre
polynomial, then find the associated weights.

First consider n = 1. The quadratic Legendre polynomial is

φ2 ( x ) = x2 − 1/3,

and from this polynomial one can derive the 2-point quadrature rule

that is exact for cubic polynomials, with roots ±1/ 3. This agrees
with the special 2-point rule derived in Section 3.4.1. The values for
the weights follow simply, w0 = w1 = 1, giving the 2-point Gauss–
Legendre rule
√ √
In ( f ) = f (−1/ 3) + f (1/ 3)

that exactly integrates polynomials of degree 2n + 1 = 3, i.e., all


cubics. Recall that Simpson’s rule also exactly
integrates cubics, but it requires three
f evaluations, rather than the two f
evaluations required of this rule.
143

0.2
0.6 0.35
n=4 n=8 n = 16
0.5 0.3
0.15
wj wj 0.25 wj
0.4
0.2
0.3 0.1
0.15
0.2
0.1 0.05
0.1 0.05
0 0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
xj xj xj

0.1 0.05
n = 32 n = 64 0.025 n = 128
0.08 0.04 0.02
wj wj wj
0.06 0.03 0.015

0.04 0.02 0.01

0.02 0.01 0.005

0 0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
xj xj xj

Figure 3.9: Nodes and weights of


Gauss–Legendre quadrature, for var-
For Gauss–Legendre quadrature rules based on larger numbers of ious values of n. In each case, the
location of the vertical line indicates x j ,
points, we can compute the nodes and weights using the symmetric while the height of the line shows w j .
eigenvalue formulation discussed in Section 3.4.3. For this weight,
one can show
αk = 0, k = 0, 1, . . . ;

β 0 = 2;
k2
βk = , k = 1, 2, 3, . . . .
4k2 −1
Figure 3.9 shows the nodes and weights for six values of n, as com-
puted via the eigenvalue problem. Notice that the points are not uni-
formly spaced, but are slightly more dense at the ends of the interval.
Moreover, the weights are smaller at these ends of the interval.
The table below shows nodes and weights for n = 4, as computed
in MATLAB.

j nodes, x j weights, w j

0 −0.906179845938664 0.236926885056189
1 −0.538469310105683 0.478628670499366
2 0.000000000000000 0.568888888888889
3 0.538469310105683 0.478628670499367
4 0.906179845938664 0.236926885056189

Example 3.3. Gauss–Chebyshev quadrature


Another popular class of Gaussian quadrature rules use as their
nodes the roots of the Chebyshev polynomials. The standard degree-
144

k Chebyshev polynomial is defined as

Tk ( x ) = cos(k cos−1 x ),

which can be generated by the recurrence relation

Tk+1 ( x ) = 2xTk ( x ) − Tk−1 ( x ).

with T0 ( x ) = 1 and T1 ( x ) = x. These Chebyshev polynomials are


orthogonal on (−1, 1) with respect to the weight function

1
w( x ) = √ .
1 − x2
The degree-(n + 1) Chebyshev polynomial has the roots
 ( j + 1/2)π 
x j = cos , j = 0, . . . , n.
n+1
In this case all the weights work out to be identical; one can show
π
wj =
n+1
for all j = 0, . . . n. Figure 3.10 shows these nodes and weights. One See Süli and Mayers, Problem 10.4 for a
can also define the monic Chebyshev polynomials according to the sketch of a proof.

recurrence (3.4) with


αk = 0, k = 0, 1, . . . ;

β 0 = π;

β 1 = 1/2;

β k = 1/4, k = 2, 2, 3, . . . .
The resulting polynomials are scaled versions of the usual Chebyshev
polynomials Tk+1 ( x ), and thus have the same roots.
Again, we emphasize that the weight function plays a crucial role:
the Gauss–Chebyshev rule based on n + 1 interpolation nodes will
exactly compute integrals of the form
Z 1
p( x )
√ dx
−1 1 − x2
for all p ∈ P2n+1 . For a general integral
Z 1
f (x)
√ dx.
−1 1 − x2
the quadrature rule should be implemented as √
Note that the 1/ 1 − x2 component of
n the integrand is not evaluated here; its
In ( f ) = ∑ w j f ( x j ); influence has already been incorporated
j =0 into the weights {w j }.
145

0.7 0.4
n=4 n=8 0.2 n = 16
0.6
0.3
wj 0.5 wj wj 0.15

0.4
0.2 0.1
0.3
0.2 0.1 0.05
0.1
0 0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
xj xj xj

0.1 n = 32 0.05
n = 64 0.025
n = 128
0.08 0.04
wj wj wj 0.02

0.06 0.03 0.015

0.04 0.02 0.01

0.02 0.01 0.005

0 0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
xj xj xj

Figure 3.10: Nodes and weights of


Gauss–Chebyshev quadrature, for
various values of n. In each case, the
The Chebyshev weight function w( x ) = 1/(1 − x2 ) blows up location of the vertical line indicates
x j , while the height of the line shows
at ±1, so if the integrand f does not balance this growth, adaptive
w j . The nodes, roots of Chebyshev
Newton–Cotes rules will likely have to place many interpolation polynomials, cluster toward the end of
nodes near these singularities to achieve decent accuracy, while the intervals; the weights are the same
for all the nodes, w j = π/(n + 1).
Gauss–Chebyshev quadrature has no problems. Moreover, in this
important case, the nodes and weights are trivial to compute, thus
allaying the need to solve the eigenvalue problem.
It is worth pointing out that Gauss–Chebyshev quadrature is quite
different than Clenshaw–Curtis quadrature. Though both use Cheby-
shev points as interpolation nodes, only Gauss–Chebyshev incorpo-
rates the weight function w( x ) = (1 − x2 )−1/2 in the weights {w j }.
Thus Clenshaw–Curtis is more appropriately compared to Gauss–
Legendre quadrature. Since the Clenshaw–Curtis method is not a
Gaussian quadrature formula, it will generally be exact only for all
p ∈ Pn , rather than all p ∈ P2n+1 .

Example 3.4. Gauss–Laguerre quadrature


The Laguerre polynomials form a set of orthogonal polynomials over
(0, ∞) with the weight function w( x ) = e− x . The accompanying
quadrature rule approximates integrals of the form

Z ∞
f ( x ) e− x dx.
0

The recurrence (3.4) uses the coefficients


146

n=4 n=8 n = 16
0 0 0
-5 -5 -5
log10 w j

log10 w j

log10 w j
-10 -10 -10
-15 -15 -15
-20 -20 -20
-25 -25 -25

-2 -1 0 1 2 -2 -1 0 1 2
10 10 10 10 10 10 10 10 10 10 10 -2 10 -1 10 0 10 1 10 2
xj xj xj

Figure 3.11: Nodes and weights of


Gauss–Laguerre quadrature, for various
αk = 2k + 1, k = 0, 1, . . . ; values of n. In each case, the location
of the vertical line indicates x j , while
the height of the line shows log10 w j .
β 0 = 1; Note that the horizontal axis is scaled
logarithmically. As n increases the
β k = k2 , k = 1, 2, 3, . . . . quadrature rule includes larger and
larger nodes to account for the infinite
Figure 3.11 shows the nodes and weights for several values of n. domain of integration; however, the
weights are exceptionally small for the
Since the domain of integration (0, ∞) is infinite, the quadrature larger nodes. For example for n = 16,
nodes x j get larger and larger. As the nodes get larger, the corre- w16 ≈ 10−23 .
sponding weights decay rapidly. When w j < 10−15 , it becomes
difficult to reliably compute the weights by solving the Jacobi matrix
eigenvalue problem. To get the small weights given here, we have
used Chebfun’s lagpts routine routine, which uses a more efficient
algorithm of Glaser, Liu, and Rokhlin (2007).

Example 3.5. Gauss–Hermite quadrature


The Hermite polynomials are orthogonal polynomials over (−∞, ∞)
2
with the weight function w( x ) = e− x . This quadrature rule approxi-
mates integrals of the form
Z ∞
2
f ( x ) e− x dx.
−∞

The Hermite polynomials can be generated using the recurrence (3.4)


with coefficients

αk = 0, k = 0, 1, . . . ;

β 0 = π;

β k = k/2, k = 1, 2, 3, . . . .

Figure 3.12 shows nodes and weights for various values of n. Though
the interval of integration is infinite, the nodes do not grow as
rapidly as for Gauss–Laguerre quadrature, since the Hermite weight
2
w( x ) = e− x decays more rapidly than the Laguerre weight w( x ) =
e− x . (Again, the nodes and weights in the figure were computed with
Chebfun’s implementation of the Glaser, Liu, and Rokhlin algorithm.)
147

0 0 0
n=4 n=8 n = 16
-5 -5 -5
log10 w j

log10 w j

log10 w j
-10 -10 -10

-15 -15 -15

-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
xj xj xj

Figure 3.12: Nodes and weights of


Gauss–Hermite quadrature, for various
3.4.5 Changing variables to transform domains values of n. In each case, the location of
the vertical line indicates x j , while the
One notable drawback of Gaussian quadrature is the need to pre- height of the line shows log10 w j . Con-
trast this plot to Figure 3.11. Though
compute (or look up) the requisite weights and nodes. If one has the domain of integration is infinite in
a quadrature rule for the interval [c, d], and wishes to adapt it to both cases, the weight function here,
2
the interval [ a, b], there is a simple change of variables procedure to e− x , decays much more rapidly than
− x
e , explaining why the largest nodes
eliminate the need to recompute the nodes and weights from scratch. are smaller than seen in Figure 3.11 for
Let τ be a linear transformation taking [c, d] to [ a, b], Gauss–Laguerre quadrature.

b−a
 
τ (x) = a + ( x − c)
d−c

with inverse τ −1 : [ a, b] → [c, d],


d−c
 
τ −1 ( y ) = c + ( y − a ).
b−a
Then we have
Z b Z τ −1 ( b )
f ( x )w( x ) dx = f (τ ( x ))w(τ ( x ))τ 0 ( x ) dx
a τ −1 ( a )
Z d
b−a

= f (τ ( x ))w(τ ( x )) dx.
d−c c

The quadrature rule for [ a, b] takes the form


n
I( f ) =
b ∑ wb j f (xbj ),
j =0

for
b−a
 
bj =
w wj , xbj = τ −1 ( x j ),
d−c
where { x j }nj=0 and {w j }nj=0 are the nodes and weights for the quadra-
ture rule on [c, d].
Be sure to note how this change of variables alters the weight
function. The transformed rule will now have a weight function

w(τ ( x )) = w( a + (b − a)( x − c)/(d − c)),

not simply w( x ). To make this concrete, consider Gauss–Chebyshev


quadrature, which uses the weight function w( x ) = (1 − x2 )−1/2 on
148

R1
[−1, 1]. If one wishes to integrate, for example, 0 x (1 − x2 )−1/2 dx, it
is not sufficient just to use the change of variables formula described
here. To compute the desired integral, one would have to adjust the
nodes and weights to accommodate w( x ) = (1 − x2 )−1/2 on [0, 1].

Composite rules Employing this change of variables technique, it is


simple to devise a method for decomposing the interval of integra-
tion into smaller regions, over which Gauss quadrature rules can
be applied. (The most straightforward application is to adjust the
Gauss–Legendre quadrature rule, which avoids complications in-
duced by the weight function, since w( x ) = 1 in this case.) Such
techniques can be used to develop Gaussian-based adaptive quadra-
ture rules.

3.4.6 Gauss–Radau and Gauss-Lobatto quadrature


Some applications make it necessary or convenient to force one or
both of the end points of the interval of integration to be among the
quadrature points. Such methods are known as Gauss–Radau and
Gauss–Lobatto quadrature rules, respectively; rules based on n + 1
interpolation points exactly integrate all polynomials in P2n or P2n−1 :
each quadrature node that we fix decreases the optimal order by one.
149

lecture 26: Richardson extrapolation

3.5 Richardson extrapolation, Romberg integration

Throughout numerical analysis, one encounters procedures that


apply some simple approximation (e.g., linear interpolation) to con-
struct some equally simple algorithm (e.g., differentiate the inter-
polant to get a finite difference formula (Section 1.7), integrate the
interpolant to get the trapezoid rule (Section 3.2). An unfortunate
consequence is that such approximations often converge slowly, with
errors decaying only like h or h2 , where h is some discretization pa-
rameter (e.g., the spacing between interpolation points). In the case of integration, you might
In this lecture we describe a remarkable, fundamental tool of clas- prefer using a higher order method,
like Clenshaw–Curtis or Gaussian
sical numerical analysis. Like alchemists who sought to convert lead quadrature. What we talk about here is
into gold, so we will take a sequence of slowly convergent data and an alternative to such approaches.
extract from it a highly accurate estimate of our solution. This pro-
cedure is Richardson extrapolation, an essential but easily overlooked
technique that should be part of every numerical analyst’s toolbox.
When applied to quadrature rules, the procedure is called Romberg
integration.
We begin in a general setting: Suppose we seek some abstract
quantity, ξ ∈ IR, which could be the value of a definite integral, a
derivative, the solution to a differential equation at a certain point,
or something else entirely. Further suppose we cannot compute ξ
exactly; we can only access numerical approximations to it, gener-
ated by some function (an algorithm) Φ that depends upon a mesh
parameter h. We compute Φ(h) for several values of h, expecting that
Φ(h) → Φ(0) = ξ as h → 0. To obtain good accuracy, one naturally
seeks to evaluate Φ with increasingly smaller values of h. There are
two reasons not to do so:

• Often Φ becomes increasingly expensive to evaluate as h shrinks; For example, computing Φ(h/2) often
requires at least twice as much work
• The numerical accuracy with which we can evaluate Φ may de- as Φ(h). In some cases, Φ(h/2) could
teriorate as h gets small, due to rounding errors in floating point require 4, or even 8, times as much
work at Φ(h), i.e., the expense of Φ
arithmetic. (For an example of the latter, try computing estimates could grow like 1/h or 1/h2 or 1/h3 ,
of f 0 (α) using the formula f 0 (α) ≈ ( f (α + h) − f (α))/h as h → 0.) etc.

Assume that Φ is infinitely continuously differentiable as a function of


h, thus allowing us to expand Φ(h) in the Taylor series

Φ(h) = Φ(0) + h Φ0 (0) + 21 h2 Φ00 (0) + 16 h3 Φ000 (0) + · · · .

The derivatives here may seem to complicate matters (e.g., what are
the derivatives of a quadrature rule with respect to h?), but we shall
not need to compute them: they key is that the function Φ behaves
150

smoothly in h. Recalling that Φ(0) = ξ, we can rewrite the Taylor For the sake of clarity let us discuss
series for Φ(h) as a concrete case, elaborated upon in
Example 3.6 below. Suppose we wish
to compute ξ = f 0 (α) using the finite
Φ ( h ) = ξ + c1 h + c2 h2 + c3 h3 + · · · difference formula
f (α + h) − f (α)
for some constants {c j }∞ 0
j=1 . (For example, c1 = Φ (0).)
Φ(h) =
h
.
This expansion implies that taking Φ(h) as an approximation for ξ The quotient rule gives
incurs an O(h) error. Halving the parameter h should roughly halve f (α) − f (α + h) f 0 (α + h)
the error, according to the expansion Φ0 ( h) = 2
+ ,
h h
which will depend smoothly on h pro-
Φ(h/2) = ξ + c1 21 h + c2 41 h2 + c3 18 h3 + · · · . vided f is smooth near α. In particular,
a Taylor expansion for f gives
Here comes the trick that is key to the whole lecture: Combine the f (α + h) = f (α) + h f 0 (α) + 21 h2 f 00 (α) + 61 h3 f 000 (η )
expansions for Φ(h) and Φ(h/2) in such a way that eliminates the
for some η ∈ [α, α + h]. Substitute this
O(h) term. In particular, define formula into the equation for Φ0 (h) and
simplify to get
Ψ(h) := 2Φ(h/2) − Φ(h)
f 0 (α + h) − f 0 (α) 1 00
  Φ0 ( h) = − 2 f (α) − 61 h f 000 (η ).
= 2 ξ + c1 21 h + c2 41 h2 + c3 81 h3 + · · · h
  Now this expression leads to a clean
− ξ + c1 h + c2 h2 + c3 h3 + · · · formula for the first coefficient of the
Taylor series for Φ(h):

= ξ − c2 21 h2 − c3 34 h3 + · · · . c1 = Φ0 (0) = lim Φ0 (h) =


h →0
1 00
2 f ( α ).

The moral of the example: while it


Thus, Ψ(h) also approximates ξ = Ψ(0) = Φ(0), but with an O(h2 )
might seem strange to take a Taylor se-
error, rather than the O(h) error that pollutes Φ(h). For small h, this ries of the “algorithm” Φ, the quantities
O(h2 ) approximation will be considerably more accurate. involved often have a very natural in-
terpretation in terms of the underlying
Why stop with Ψ(h)? Repeat the procedure, combining Ψ(h) and problem at hand.
Ψ(h/2) to eliminate the O(h2 ) term. Since

Ψ(h/2) = ξ − c2 18 h2 − c3 32
3 3
h +··· ,

we have
4Ψ(h/2) − Ψ(h)
Θ(h) := = ξ + c3 18 h3 + · · · .
3
To compute Θ(h), we must have access to both Ψ(h) and Ψ(h/2).
These, in turn, require Φ(h), Φ(h/2), and Φ(h/4). In many cases,
Φ becomes increasingly expensive to compute as the parameter h is
reduced. Thus there is some practical limit to how small we can take
h when evaluating Φ(h).
One could continue this procedure repeatedly, each time improv-
ing the accuracy by one order, at the cost of one additional Φ com-
putation with a smaller h. To facilitate generalization and to avoid a
further tangle of Greek characters, we adopt a new notation: Define

R( j, 0) := Φ(h/2 j ), j ≥ 0;

2k R( j, k − 1) − R( j − 1, k − 1)
R( j, k) := , j ≥ k > 0.
2k − 1
151

Thus: R(0, 0) = Φ(h), R(1, 0) = Φ(h/2), and R(1, 1) = Ψ(h). This


procedure is called Richardson extrapolation after the British applied
mathematician Lewis Fry Richardson, a pioneer of the numerical
solution of partial differential equations, weather modeling, and
mathematical models in political science. The numbers R( j, k) are
arranged in a triangular extrapolation table:
R(0, 0)
R(1, 0) R(1, 1)
R(2, 0) R(2, 1) R(2, 2)
R(3, 0) R(3, 1) R(3, 2) R(3, 3)
.. .. .. .. ..
. . . . .
··· ··· ··· ··· ··· ·
↑ ↑ ↑ ↑
O( h ) O( h2 ) O( h3 ) O( h4 )
To compute any given element in the table, one must first determine
entries above and to the left. Note that only the first column will
require significant work; the subsequent columns follow from easy
arithmetic. The theory suggests that the bottom-right element in
the table will be the most accurate approximation to ξ. Indeed this
bottom-right entry will generally be the most accurate, provided the
assumption that Φ is infinitely continuously differentiable holds.
When floating point roundoff errors spoil what otherwise would
have been an infinitely continuously differentiable procedure, the
bottom-right entry will suffer acutely from this pollution. Such errors
will be apparent in the forthcoming example.
Example 3.6 (Finite difference approximation of the first derivative).
We seek ξ = f 0 (α) for some function continuously differentiable
function f . Recall from Section 1.7 the simple finite difference ap-
proximation to the first derivative that follows from differentiating
the linear interpolant to f through the points x = α and x = α + h:
f (α + h) − f (α)
f 0 (α) ≈ .
h
In fact, in Theorem 1.6 we quantified the error to be O(h) as h → 0:
f (α + h) − f (α)
f 0 (α) = + O( h ).
h
Thus we define
f (α + h) − f (α)
Φ(h) = .
h
As a simple test problem, take f ( x ) = ex . We will use Φ and Richard-
son extrapolation to approximate f 0 (1) = e = 2.7182818284 . . . .
The simple finite difference method produces crude answers:
152

h Φ(h) error
1 4.670774270 1.95249 × 100
1/2 3.526814484 8.08533 × 10−1
1/4 3.088244516 3.69963 × 10−1
1/8 2.895480164 1.77198 × 10−1
1/16 2.805025851 8.67440 × 10−2
1/32 2.761200889 4.29191 × 10−2
1/64 2.739629446 2.13476 × 10−2
1/128 2.728927823 1.06460 × 10−2
1/256 2.723597892 5.31606 × 10−3
1/512 2.720938130 2.65630 × 10−3

Even with h = 1/512 = 0.00195 . . . we fail to approximate f 0 (1) to


even three correct digits. As we take h smaller and smaller, finite pre-
cision arithmetic eventually causes unacceptable errors; Figure 3.13
shows the error in Φ(h) as h → 0. (The red line shows what perfect
O(h) convergence would look like.)

Figure 3.13: Linear convergence of


10 0 the estimate Φ(h) to f 0 (1) (blue line).
As h gets small, rounding errors spoil
10 -2 the O(h) convergence (red line). An
accuracy of about 10−8 seems to be the
| f 0 (1) − Φ(h)|

best we can do for this method and this


10 -4
problem.

10 -6

10 -8

10 -10 O( h )

10 -12
10 0 10 -2 10 -4 10 -6 10 -8 10 -10 10 -12
h

A few steps of Richardson extrapolation on the data in the table


above reveals greatly improved solutions, five correct digits in R(4, 4):

j R( j, 0) R( j, 1) R( j, 2) R( j, 3) R( j, 4)
0 4.67077427047160
1 3.52681448375804 2.38285469704447
2 3.08824451601118 2.64967454826433 2.73861449867095
3 2.89548016367188 2.70271581133258 2.72039623235534 2.71779362288168
4 2.80502585140344 2.71457153913500 2.71852344840247 2.71825590783778 2.71828672683485

The good performance of this method depends on f having suf-


ficiently many smooth derivatives. If higher derivatives are not
153

Figure 3.14: The convergence of Φ(h)


10 0 (blue) along with its first two Richard-
son refinements, Ψ(h) (green) and Θ(h)
10 -2 (black). The red lines show O(h), O(h2 ),
and O(h3 ) convergence. For these
absolute error

values of h, rounding errors are not ap-


10 -4
parent in the Φ(h) plot; however, they
Φ(h) lurk in the later digits of Φ(h), enough
10 -6 to interfere with the Ψ(h) and Θ(h)
approximations. Before these errors
take hold, Θ(h) gives several additional
10 -8
orders of magnitude accuracy than was
Ψ(h) obtained by Φ(h) with much smaller h,
10 -10 in Figure 3.13.
Θ(h)

10 -12
10 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6
h

smooth, then Φ(h) will not have smooth derivatives, and the accu-
racy breaks down. The accuracy also eventually degrades because
of rounding errors that subtly pollute the initial column of data, as
shown in the Figure 3.14.

3.5.1 Extrapolation for higher order approximations


In many cases, the initial algorithm Φ(h) is better than O(h) accurate,
and in this case the formula for R( j, k) should be adjusted to take
advantage. Suppose that

Φ(h) = ξ + c1 hr + c2 h2r + c3 h3r + · · · Notice that this structure is rather


special: for example, if r = 2, then the
for some integer r ≥ 1. Then define Taylor series for Φ(h) must avoid all
odd-order terms.
R( j, 0) := Φ(h/2 j ) for j ≥ 0

2rk R( j, k − 1) − R( j − 1, k − 1)
(3.9) R( j, k) := for j ≥ k > 0.
2rk − 1

In this case, the R(:, k) column will be O(h(k+1)r ) accurate.

3.5.2 Extrapolating the composite trapezoid rule: Romberg integration

Suppose f ∈ C ∞ [ a, b], and we wish to approximate a f ( x ) dx with


Rb

the composite trapezoid rule,


n −1
hh i
T (h) = f ( a) + 2 ∑ f ( a + jh) + f (b) .
2 j =1

Notice that T (h) only makes sense (as the composite trapezoid rule) If you find this restriction on h distract-
ing, just define T (h) to be a sufficiently
when h = (b − a)/n for some integer n. Notice that T ((b − a)/n) smooth interpolation between the
values of T ((b − a)/n) for n = 1, 2, . . . .
154

requires n + 1 evaluations of the function f , and so increasing n


(decreasing h) increases the expense.
One can show that for any f ∈ C ∞ [ a, b],
Z b
T (h) = f ( x ) dx + c1 h2 + c2 h4 + c3 h6 + · · · .
a

Now perform the generalized Richardson extrapolation (3.9) on T (h)


with r = 2:

R( j, 0) = T (h/2 j ) for j ≥ 0

4k R( j, k − 1) − R( j − 1, k − 1)
R( j, k) = for j ≥ k > 0.
4k − 1
This procedure is called Romberg integration.
In cases where f ∈ C ∞ [ a, b] (or if f has many continuous deriva-
tives), the Romberg table will converge to high accuracy, though it
may be necessary to take h to be relatively small before this is ob-
served. When f does not have many continuous derivatives, each
column of the Romberg table will still converge to the true integral,
but not at the ever-improving clip we expect for smoother functions.
This procedure’s utility is best appreciated through an example.
Example 3.7. For purposes of demonstration, we should use an
integral we know exactly, say
Z π
sin( x ) dx = 2.
0

Start the table with h = π to generate R(0, 0), requiring 2 evaluations


of f ( x ). To build out the table, compute the composite trapezoid ap-
proximation based on an increasing number of function evaluations
at each step. The final entry in the first column requires 129 function Ideally, one would exploit the fact that
evaluations, and has four digits correct. This may not seem partic- some grid points used to compute T (h)
are also required for T (h/2), etc., thus
ularly impressive, but after refining these computations through a limiting the number of new function
few steps of Romberg integration, we have an approximation that is evaluations required at each step.
accurate to full precision.
j R( j, 0) R( j, 1) R( j, 2) R( j, 3) R( j, 4) R( j, 5) R( j, 6)
0 0.000000000000
1 1.570796326795 2.094395102393
2 1.896118897937 2.004559754984 1.998570731824
3 1.974231601946 2.000269169948 1.999983130946 2.000005549980
4 1.993570343772 2.000016591048 1.999999752455 2.000000016288 1.999999994587
5 1.998393360970 2.000001033369 1.999999996191 2.000000000060 1.999999999996 2.000000000001
6 1.999598388640 2.000000064530 1.999999999941 2.000000000000 2.000000000000 2.000000000000 2.000000000000

Be warned that Romberg results are not always as clean as this


example, but this procedure is important tool to have at hand when
high precision integrals are required. The general strategy of Richard-
son extrapolation can be applied to great effect in a wide variety of
numerical settings.
4
Nonlinear Equations

lecture 27: Introduction to Nonlinear Equations


lecture 28: Bracketing Algorithms

The solution of nonlinear equations has been a motivating


challenge throughout the history of numerical analysis. We opened
these notes with mention of Kepler’s equation, M = E − e sin E,
where M and e are known, and E is sought. One can view this E as
the solution of the equation f ( E) = 0, where

f ( x ) = M − x + e sin( x ).

This is a simple equation in one variable, x ∈ IR, and it turns out that
it is not particularly difficult to solve. Other examples are compli-
cated by nastier nonlinearities, multiple solutions (in which case one
might like to find them all), ill-conditioned zeros (where f ( x ) ≈ 0 for
x far from the true zeros of f ), solutions in the complex plane, and
expensive f evaluations.
Optimization is closely allied to the solution of nonlinear equa-
tions, since one finds extrema of F : IR → IR by solving

F 0 ( x ) = 0.

When F : IRn → IR, one seeks x ∈ IRn that solves

∇ F (x) = 0,

a nonlinear system of n equations in n variables. Handling n > 1


variables leads to more sophisticated theory and algorithms. Opti-
mization problems become more difficult when additional constraints
are imposed upon x ∈ IRn ,

min F (x),
x ∈S
156

where S is the set of feasible solutions (e.g., vectors with nonnegative


entries). When F is linear in x and x is constrained by simple inequal-
ities, one has the important field of linear programming. When the set
S constrains x to take discrete values (e.g., integers), the optimization
problem becomes exceptionally difficult. Such combinatorial optimiza-
tion or integer programming problems connect closely to the study of
NP-hard problems in computer science.

In this course, we only have time to look at a few of the simplest


algorithms for solving nonlinear equations. These will give some
brief introduction to some of the over-arching themes in this impor-
tant field. Further study is highly recommended!

4.1 Bracketing Algorithms for Root Finding

Given a function f : IR → IR, we seek a point x∗ ∈ IR such that


f ( x∗ ) = 0. This x∗ is called a root of the equation f ( x ) = 0, or simply
a zero of f . At first, we only require that f be continuous a interval
[ a, b] of the real line, f ∈ C [ a, b], and that this interval contains the
root of interest. The function f could have many different roots; we
will only look for one. In practice, f could be quite complicated (e.g., In some applications, like polynomial
root-finding, we know f has multiple
evaluation of a parameter-dependent integral or differential equation)
zeros, and we might want to find
that is expensive to evaluate (e.g., requiring minutes, hours, . . . ), so all of them. This particular case is
we seek algorithms that will produce a solution that is accurate to complicated by the fact that the zeros
could be located in the complex plane,
high precision while keeping evaluations of f to a minimum. x∗ ∈ C.
The first algorithms we study require the user to specify a finite
interval [ a0 , b0 ], called a bracket, such that f ( a0 ) and f (b0 ) differ in
sign, f ( a0 ) f (b0 ) < 0. Since f is continuous, the intermediate value
theorem guarantees that f has at least one root x∗ in the bracket,
x∗ ∈ ( a0 , b0 ).

4.1.1 Bisection
Given a bracket [ a, b] for which f takes opposite sign at a and b, the
simplest technique for finding x∗ is the bisection algorithm:
For k = 0, 1, 2, . . .
1. Compute f (ck ) for ck = 21 ( ak + bk ).
2. If f (ck ) = 0, exit; otherwise, repeat with
(
[ ak , ck ], if f ( ak ) f (ck ) < 0;
[ a k + 1 , bk + 1 ] : =
[ck , bk ], if f (ck ) f (bk ) < 0.
3. Stop when the interval bk+1 − ak+1 is sufficiently small,
or if f (ck ) = 0.

How does this method converge? Not bad for such a simple idea. At
157

the kth stage, there must be a root in the interval [ ak , bk ]. Take ck =


1
2 ( ak + bk ) as the next estimate to x∗ , giving the error ek = ck − x∗ .
The worst possible error would occur if x∗ was close to ak or bk , half
the bracket’s width away from ck , and hence |ek | = |ck − x∗ | ≤
1 − k −1 ( b − a ).
2 ( bk − a k ) = 2 0 0

Theorem 4.1. Given a bracket [ a, b] ⊂ IR for which f ( a) f (b) < 0,


there exists x∗ ∈ [ a, b] such that f ( x∗ ) = 0 and the bisection point ck
satisfies
b−a
| c k − x ∗ | ≤ k +1 .
2

We say this iteration converges linearly (the log of the error is bounded
by a straight line when plotted against iteration count – see the next
example) with rate ρ = 1/2. Practically, this means that the error is
cut in half at each iteration, independent of the behavior of f . Reduction
of the initial bracket width by ten orders of magnitude would require
roughly log2 1010 ≈ 33 iterations. If f is fast to evaluate, this conver-
gence will be pretty quick; moreover, since the algorithm only relies
on our ability to compute the sign of f ( x ) accurately, the algorithm is
robust to strange behavior in f (such as local minima).

Example 4.1. Kepler’s equation is among the most historically impor-


tant nonlinear equations, playing a pivotal role in two-body celestial
mechanics. Given orbital parameters e ∈ [0, 1) (the eccentricity of the
elliptical orbit) and M ∈ [0, 2π ] (the mean anomaly), find E ∈ [0, 2π ]
(the eccentric anomaly) such that

M = E − e sin E. 3
2
Cast this as the root-finding problem f ( E) = 0, where
1
0
f ( E) = M − E + e sin E.
-1 f ( E)
In this example we set e = 0.8 and M = 3π/4, yielding the function -2

shown in the margin. Judging from this plot, the desired root E∗ -3
-4
falls in the interval [2, 3]. Using the initial bracket [ a, b] = [2, 3], the 0 1 2 3 4 5 6
bisection method converges as steadily as expected, cutting the error E
in half at every step. Figure 4.2 shows the convergence to the exact
root E∗ = 2.69889638445749738544 . . . .

4.1.2 Regula Falsi


A simple adjustment to bisection can often yield much quicker con-
vergence. The name of the resulting algorithm, regula falsi (literally
‘false rule’) hints at the technique. As with bisection, begin with an
interval [ a0 , b0 ] ⊂ IR such that f ( a0 ) f (b0 ) < 0. The goal is to be
158

0 Figure 4.1: The convergence of the


10
bisection method for solving Kepler’s
equation (Example 4.1) with initial
bracket [ a, b] = [2, 3]. Use the midpoint
ck of the kth bracket as an estimate of
the root E∗ . The dashed line shows the
10 -5
error, |ck − E∗ |

error bound |ck − E∗ | ≤ (b − a)/2k+1 .

10 -10

10 -15

0 5 10 15 20 25 30 35 40 45 50
k

more sophisticated about the choice of the root estimate ck ∈ ( ak , bk ).


Instead of simply choosing the middle point of the bracket as in bi-
section, we approximate f with the line pk ∈ P1 that interpolates
( ak , f ( ak )) and (bk , f (bk )), so that pk ( ak ) = f ( ak ) and p(bk ) = f (bk ).
This unique polynomial is given (in the Newton form) by

f ( bk ) − f ( a k )
pk ( x ) = f ( ak ) + ( x − a k ).
bk − a k

Now approximate the zero of f in [ ak , bk ] by the zero of the linear


model pk :
a f ( bk ) − bk f ( a k )
ck = k .
f ( bk ) − f ( a k )
The algorithm then takes the following form:
For k = 0, 1, 2, . . .
a k f ( bk ) − bk f ( a k )
1. Compute f (ck ) for ck = .
f ( bk ) − f ( a k )
2. If f (ck ) = 0, exit; otherwise, repeat with
(
[ ak , ck ], if f ( ak ) f (ck ) < 0;
[ a k + 1 , bk + 1 ] : =
[ck , bk ], if f (ck ) f (bk ) < 0.
3. Stop when f (ck ) is sufficiently small, or the maximum
number of iterations is exceeded.

Note that only Step 3 differs significantly from the bisection method.
The former algorithm forces the bracket width bk − ak to zero as it
homes in on the root. In contrast, there is no mechanism in the regula
falsi algorithm to drive the bracket width to zero: it will still always
converge (in exact arithmetic) even though the bracket length does
159

not typically decrease to zero. Analysis of regula falsi is more compli-


cated than the trivial bisection analysis; we give a convergence proof
only for a special case. We will use the opportunity to introduce the
notion of convex functions, a fundamental idea in optimization theory,
especially for problems in higher dimensions.

Definition 4.1. Let [ a, b] be a finite interval of IR. Then a function


f ∈ C [ a, b] is convex provided that for all x, y ∈ [ a, b] and λ ∈ [0, 1],

f λx + (1 − λ)y ≤ λ f ( x ) + (1 − λ) f (y).

At first sight convexity might seem to be an abstract concept.


However, an easy sufficient condition will be more familiar.

Lemma 4.1. If f ∈ C2 [ a, b] and f 00 ( x ) ≥ 0 for all x ∈ [ a, b], then f is


convex.

Proof. (See Rockafeller, Convex Analysis, Theorem 4.4.)


First note that f 0 is monotonically increasing on [ a, b]. To see this,
use the Fundamental Theorem of Calculus to write, for any ξ ∈ [ a, b],
Z ξ
f 0 (ξ ) − f 0 ( a) = f 00 (s) ds.
a

The integrand on the right is nonnegative, so as ξ increases, the value


of the integral cannot decrease. Since f 0 ( a) is a constant,
Z ξ
f 0 (ξ ) = f 0 ( a) + f 00 (s) ds
a

is a monotonically increasing function of ξ ∈ [ a, b].


Now fix λ ∈ [0, 1] and z := λx + (1 − λ)y. Again use the Funda- Such a z is a convex combination
of x and y.
mental Theorem of Calculus to write
Z z
f (z) − f ( x ) = f 0 (s) ds.
x

Now use monotonicity of f 0 to get the upper bound


Z z
f (z) = f ( x ) + f 0 (s) ds
x

(4.1) ≤ f ( x ) + f 0 (z)(z − x ).

Another upper bound follows similarly. Write


Z z
f (y) − f (z) = f 0 (s) ds
y

and again use monotonicity of the derivative (with z ≤ y) to obtain


Z y
f (z) = f (y) − f 0 (s) ds
z

(4.2) ≤ f (y) − f 0 (z)(y − z).


160

Now preumultiply (4.1) by λ and (4.2) by 1 − λ and add to obtain:

f ( z ) = λ f ( z ) + (1 − λ ) f ( z )

(4.3) ≤ λ f ( x ) + (1 − λ) f (y) + λ f 0 (z)(z − x ) − (1 − λ) f 0 (z)(y − z).

Notice that

λ(z − x ) − (1 − λ)(y − z) = −λx − (1 − λ)y + z = 0,

and hence (4.3) reduces to

f (z) ≤ λ f ( x ) + (1 − λ) f (y) + λ f 0 (z)(z − x ).

Thus f ( x ) ≥ 0 for all x ∈ [ a, b] proves that f is convex.

Convexity implies that the linear interpolant will always be located


above the function, as sketched on the right. The linear interpolant to f (b)
f at x = a and x = b is
p
f (b) − f ( a) f
p( x ) = f ( a) + ( x − a ).
b−a

Any z ∈ [ a, b] can be written as z = λa + (1 − λ)b, so

f (b) − f ( a) 
p(z) = f ( a) + λa + (1 − λ)b − a f ( a)
b−a
f (b) − f ( a) a z b
= f ( a) + (1 − λ)(b − a)
b−a
= λ f ( a ) + (1 − λ ) f ( b )

≥ f λa + (1 − λ)b
= f ( z ),

where the inequality follows from convexity of f .


This observation plays a key role in the next proof, which guaran-
tees convergence of regula falsi for convex functions.

Theorem 4.2. Suppose f is a convex function on [ a0 , b0 ] with a0 < b0


and f ( a0 ) < 0 < f (b0 ). Then regula falsi converges: either f (ck ) = 0
for some k ≥ 0, or ck → x∗ ∈ [ a0 , b0 ] with f ( x∗ ) = 0.
f (b0 )
Proof. (See Stoer & Bulirsch, Introduction to Numerical Analysis, 2nd
ed., §5.9.)
The argument preceding the proof ensures that the linear inter- f
polant p to f at a0 and b0 satisfies p( x ) ≥ f ( x ) for all x ∈ [ a0 , b0 ].
Since c0 is selected so that p0 (c0 ) = 0, it follows that f (c0 ) ≤ 0, so the
0
new bracket will be [ a1 , b1 ] = [c0 , b0 ]. c0 = a1 b0 = b1

f ( a0 )
161

0 Figure 4.2: The convergence of the


10
bisection and regula falsi methods for
solving Kepler’s equation (Example 4.1)
with initial bracket [ a, b] = [2, 3]. The
root ck in regula falsi converges to E∗
much more rapidly than the midpoint
10 -5
error, |ck − E∗ |

bis ck of the bisection brackets.


ect
ion
regu
la fa

10 -10
lsi

10 -15

0 5 10 15 20 25 30 35 40 45 50
k

If f (c0 ) = 0, the method has exactly found a root, and stops;


otherwise,
f ( a1 ) = f (c0 ) < 0 < f (b0 ) = f (b1 ).
Now use the convexity of f on the subinterval [ a1 , b1 ] to iterate this
argument, first showing that [ a2 , b2 ] = [c1 , b1 ], and then, in general,

[ a k + 1 , bk +1 ] = [ c k , bk ] .

Notice that ck > ak = ck−1 . If the algorithm never finds an exact root,
it forms a sequence of estimates {ck } that is monotonically increasing
and bounded above by b0 . The monotone convergence theorem in For a proof, see Rudin, Principles of
real analysis ensures that any bounded monotone sequence of real Mathematical Analysis, Theorem 3.14.

numbers must converge to a limit. Thus, lim ck = γ with f (γ) ≤ 0,


k→∞
and this γ must be a fixed point of the regula falsi iteration, i.e.,

γ f (b0 ) − b0 f (γ)
γ= .
f (b0 ) − f (γ)

Rearrange this expression to get (γ − b0 ) f (γ) = 0. Since f (b0 ) > 0,


conclude that γ 6= b0 , and hence f (γ) = 0. Thus, regula falsi converges
for convex functions.
Example 4.2. Now apply the regula falsi method to the same version
of Kepler’s equation we solved with bisection in Example 4.1. Fig-
ure ?? compares the convergence of regula falsi to bisection, both with
the same initial bracket [2, 3]. About 9 steps of regula falsi delivers the
same accuracy obtained by more than 40 steps of bisection. For nice
problems like this one, regula falsi is much more efficient.
Is regula falsi always superior to bisection? If f has a zero, one
can alway rig a bracket around it so that the zero x∗ is exactly at its
162

midpoint, x∗ = 12 ( a0 + b0 ), giving convergence of bisection in a


single iteration. For most such functions, the first regula falsi iterate
is different, and not a root of our function. Thus bisection can beat
regula falsi, but such examples seem contrived, depending essentially
on a lucky bracket. Can one construct more compelling examples?

Example 4.3. Lest Example 4.2 suggest that regula falsi is always su-
perior to bisection, we now consider a function for which regula falsi
converges very slowly. Sketch out a few sample functions. You will
soon see how to design an f such that the root ck of the linear ap-
proximation converges slowly toward x∗ as k increases. The function
should be relatively flat and small in magnitude in a large region 2
near the root. One such example is
1.5
1/20
2 19 1
f ( x ) = sign(tan−1 ( x )) tan−1 ( x ) + ,
π 20
0.5 f (x)
which has a single root at x∗ ≈ −0.6312881 . . . . This function is illus-
0
trated in the margin. Figure 4.3 compares convergence of bisection
-0.5
and regula falsi for this f with the initial bracket [−10, 10]. The small -10 -5 0 5 10
value of f at the left end of the bracket ensures that [ a1 , b1 ] = [c0 , b] x

will be almost as large as the initial bracket [ a, b].

4.1.3 Accuracy
Here we have assumed that we calculate f ( x ) to perfect accuracy,
an unrealistic expectation on a computer. If we attempt to compute
x∗ to very high accuracy, we will eventually experience errors due
to inaccuracies in our function f ( x ). For example, f ( x ) may come
from approximating the solution to a differential equation, were there

Figure 4.3: The convergence of the


0 bisection and regula falsi methods for
10
the function in Example 4.3 with initial
bracket [ a, b] = [−10, 10]. The root ck
in regula falsi converges very slowly to
the right as k increases, due to the small
error, |ck − x∗ |

reg value of f on the left end of the bracket.


10 -5 ul In contrast, bisection merrily cuts the
a fal
s error in half at each step, affected only
i
bis

by the sign of f , not its magnitude.


e
cti
on

10 -10

10 -15
0 20 40 60 80 100 120
k
163

is some approximation error we must be concerned about; more


generally, the accuracy of f will be limited by the computer’s floating
point arithmetic. One must also be cautious of subtracting one like
quantity from another (as in construction of ck in both algorithms),
which can give rise to catastrophic cancellation.

4.1.4 Conditioning
When | f 0 ( x0 )|  0, the desired root is easy to pick out. In cases
where f 0 ( x0 ) ≈ 0, the root will be ill-conditioned, and it can be difficult
to locate. This is the case, for example, when x0 is a multiple root of
f . (You may find it strange that the more copies of a root you have,
A well-conditioned root.
the more difficult it can be to compute it!) In such cases bisection has
the advantage that it only depends on the sign of f .

4.1.5 Deflation
What is one to do if multiple distinct roots are required? One ap-
proach is to choose a new initial bracket that omits all known roots.
Another technique, though numerically fragile, is to work with
fb( x ) := f ( x )/( x − x0 ), where x0 is the previously computed root. An ill-conditioned root.
164

lecture 29: Newton’s Method


We have studied two bracketing methods for finding zeros of a func-
tion, bisection and regula falsi. These methods have certain virtues
(most importantly, they always converge), but some exploratory
evaluations of f might be necessary to determine an initial bracket.
Moreover, while these methods appear to converge fairly quickly,
when f is expensive to compute (e.g., requiring solution of a differen-

tial equation) or many systems must be solved (e.g., to evaluate A
as a zero of f ( x ) = x2 − A for many values of A), every objective
function evaluation counts. The next few sections describe algorithms
that can converge more quickly that bracketing methods — provided
we have a sufficiently good initial estimate of the root.

4.2 Newton’s method

We begin with Newton’s method, the most celebrated root-finding


algorithm. The algorithm’s motivation resembles regula falsi: model
f with a line, and estimate the root of f by the root of that line. In
regula falsi, this line interpolated the function values at either end
of the root bracket. Newton’s method is based purely on local in-
formation at the current solution estimate, xk . Whereas the bracket-
ing methods only required that f be continuous, Newton’s method
needs f ∈ C1 (IR); to analyze the algorithm, we will further require
f ∈ C2 (IR). This latter condition means that f can be expanded in a
Taylor series centered at the approximate root xk :

(4.4) f ( x∗ ) = f ( xk ) + f 0 ( xk )( x∗ − xk ) + 12 f 00 (ξ )( x∗ − xk )2 ,

where x∗ is the exact solution, f ( x∗ ) = 0, and ξ is between xk and


0 x0 x1
x∗ . Ignore the error term in this series, and you have a linear model
for f ; i.e., f 0 ( xk ) is the slope of the line secant to f at the point xk .
Specifically,
0 = f ( x∗ ) ≈ f ( xk ) + f 0 ( xk )( x∗ − xk ), f

which implies
f ( xk )
x∗ ≈ xk − .
f 0 ( xk )
We get an iterative method by replacing x∗ in this formula with xk+1 .

Newton’s method updates the approximate root xk via

f ( xk )
(4.5) x k +1 : = x k − .
f 0 ( xk )
165

Newton’s method is famous because it often converges very quickly,


but that excellent convergence comes at no small cost. For a bad
starting guess x0 , it can diverge entirely. When it converges, the root
it finds can, in some circumstances, depend sensitively on the initial
guess: this is a famous source of beautiful fractals. However, for a
good x0 , the convergence is usually lightning quick. Let ek = xk −
x∗ be the error at the kth step. Subtract x∗ from both sides of the
iteration (4.5) to obtain a recurrence for the error,

f ( xk )
(4.6) e k +1 = e k − .
f 0 ( xk )

The Taylor expansion of f ( x∗ ) about the point xk given in (4.4) gives

0 = f ( xk ) − f 0 ( xk )ek + 12 f 00 (ξ )e2k .

Solving this equation for f ( xk ) and substituting that formula into the Newton’s method for finding zeros of
expression (4.6) for ek+1 gives f ( x ) = x7 − 1 in the complex plane.
This function has seven zeros the
complex plane, the principal roots of
f 0 ( xk )ek + 12 f 00 (ξ )e2k
e k +1 = e k − unity (shown as white dots in the plot).
f 0 ( xk ) The color encodes the convergence
of Newton’s method: for each point
f 00 (ξ ) 2 x0 ∈ C, iterate Newton’s method until
=− e .
2 f 0 ( xk ) k convergence. The color corresponds
to which of the seven roots that x0
When xk converges to x∗ , ξ ∈ [ x∗ , xk ] also converges to x∗ . Suppos- converged. In some regions, small
changes to x0 get attracted to vastly
ing that x∗ is a simple root, so that f 0 ( x∗ ) 6= 0, the above analysis different roots.
suggests that when xk is near x∗ ,

| e k +1 | ≤ C | e k | 2

for constant
| f 00 ( x∗ )|
C=
2 | f 0 ( x∗ )|
independent of k. Thus we say that if f 0 (?) 6= 0, then Newton’s
method converges quadratically, roughly meaning each iteration of
Newton’s method doubles the number of correct digits at each iteration.
Compare this to bisection, where

|ek+1 | ≤ 12 |ek |,

meaning that the error was halved at each step. Significantly, New-
ton’s method will often exhibit a transient period of linear conver-
gence as xk is gradually improved; once xk gets close enough to x∗ ,
the behavior transitions to quadratic convergence and full machine
precision is attained in just a couple more iterations.

Example 4.4. One way to compute 2 is to find the zero of

f ( x ) = x2 − 2.
166

Figure 4.4: The convergence of New-


ton’s method for f ( x ) = x2 − 2. If √
x0 is
sufficiently close to the zero x∗ = 2,
the method converges very rapidly. For
a poor x0 , the convergence is slow until
xk is sufficiently close. (The situation
error, | xk − x∗ |

could be worse: in many cases, bad


initial guesses lead to the divergence of
Newton’s method.)

Figure 4.5 shows convergence in just a few steps for the realistic start-
ing guess x0 = 1.25. The plot also shows the convergence behavior
for the (entirely ridiculous) starting guess x0 = 1000, to illustrate a
linear phase of convergence as the iterate gradually approaches the
region of quadratic convergence. Once xk is sufficiently close to x∗ ,
convergence proceeds very quickly.
Table 4.1 shows the iterates for x0 = 1000, computed exact arith-
metic in Mathematica, and displayed here to more than eighty digits.
This is a bit excessive: in the floating point arithmetic we have used

k xk

0 1000.00000000000000000000000000000000000000000000000000000000000000000000000000000000000
1 500.00100000000000000000000000000000000000000000000000000000000000000000000000000000000
2 250.00249999600000799998400003199993600012799974400051199897600204799590400819198361603
3 125.00524995800046799458406305526512856598014823595622393441695800477446685799463896484
4 62.51062464301703314888691358403320464529759944325744566631164600631017391478309761341
5 31.27130960206219455596422358771700548374565801842332086536365236578278080406153827364
6 15.66763299486836640030755527100281652065100159710324459452581543767403479921834012248
7 7.89764234785635806719051360934236238116968365174167025116461034160777628217364960111
8 4.07544124051949892088798573387067133352991149961309267159333980191548308075360961862
9 2.28309282439255383986306690358177946144339233634377781606055538481637200759555376236
10 1.57954875240601536527547001727498935127463981776389016188975791363939586265860323251
11 1.42286657957866825091209683856309818309310929428763928162890934673847036238184992693
12 1.41423987359153062319364616441120035182529489347860126716395746896392690040774558375
13 1.41421356261784851265589000359174396632207628548968908242398944391615436335625360056
14 1.41421356237309504882286807775717118221418114729423116637254804377031332440406155716
15 1.41421356237309504880168872420969807856983046705949994860439640079460765093858305190
16 1.41421356237309504880168872420969807856967187537694807317667973799073247846210704774
exact 1.41421356237309504880168872420969807856967187537694807317667973799073247846210703885038753 . . .

Table 4.1: Convergence of Newton’s


method for f ( x ) = x2 − 2 for a poor
initial guess. The first 9 iterations
each roughly cut he error in half; later
iterations then start to double the
correct digits (blue, underlined).
167

all semester, we can only expect to get 15 or 16 digits of accuracy in


the best case. It is worth looking at all these digits to get a better ap-
preciation of the quadratic convergence. Once we are in the quadratic
regime, notice the characteristic doubling of the number of correct
digits (in blue and underlined) at each iteration.
168

lecture 30: Convergence of Newton’s Method via Direct Iteration

4.3 Direct iteration

We have already performed a simple analysis of Newton’s method


to gain an appreciation for the quadratic convergence rate. For a
broader perspective, we shall now put Newton’s method into a more
general framework, so that the accompanying analysis would allow
us to understand simpler iterations like the ‘constant slope method:’

x k +1 = x k − α f ( x k )

for some constant α (which could approximate 1/ f 0 ( x∗ ), for exam-


ple). We begin by formalizing the notion of the rate of convergence.
Definition 4.2. A root-finding algorithm is pth-order convergent if

| e k +1 | ≤ C | e k | p

for some p ≥ 1 and positive constant C.

• If p = 1, then C < 1 is necessary for convergence, and C is called


the linear convergence rate.

• If p > 1, the constant C is called the asymptotic error constant.

In the last section, our analysis showed that, if f 0 ( x∗ ) 6= 0 and


x0 is sufficiently close to x∗ , then Newton’s method is second-order
convergent with asymptotic error constant

| f 00 ( x∗ )|
C= .
2 | f 0 ( x∗ )|

Earlier we saw that bisection is linearly convergent for f ∈ C [ a0 , b0 ]


with rate C = 1/2.
One can analyze Newton’s method and its variants through the
following general framework. Consider direct iterations of the form For further details on this standard
approach, see G. W. Stewart, Afternotes
x k +1 = Φ ( x k ), on Numerical Analysis, §§2–4; J. Stoer
& R. Bulirsch, Introduction to Numerical
Analysis, 2nd ed., §5.2; L. W. Johnson
for some iteration function Φ. This framework will include Newton’s and R. D. Riess, Numerical Analysis,
method, since we can take second ed., §4.3.

f (x)
Φ( x ) = x − .
f 0 (x)

If the starting guess is an exact root, x0 = x∗ , the method should be


smart enough to return x1 = x∗ . Thus the root x∗ is a fixed point of Φ:

x ∗ = Φ ( x ∗ ).
169

We seek an expression for the error ek+1 = xk+1 − x∗ in terms of ek


and properties of Φ. Assume, for example, that Φ( x ) ∈ C2 (IR), so
that we can write the Taylor series for Φ expanded about x∗ :

xk+1 = Φ( xk ) = Φ( x∗ ) + ( xk − x∗ )Φ0 ( x∗ ) + 12 ( xk − x∗ )2 Φ00 (ξ )


= x∗ + ( xk − x∗ )Φ0 ( x∗ ) + 12 ( xk − x∗ )2 Φ00 (ξ )

for some ξ between xk and x∗ . From this we obtain an expression for


the errors:
ek+1 = ek Φ0 ( x∗ ) + 12 e2k Φ00 (ξ ).
Convergence analysis is reduced to the study of Φ0 ( x∗ ), Φ00 ( x∗ ), etc.
To analyze methods that converge faster than quadratic, one would
simply add extra terms in the Taylor series.

Example 4.5. For Newton’s method

f (x)
Φ( x ) = x − ,
f 0 (x)

so the quotient rule gives

f 0 ( x )2 − f ( x ) f 00 ( x ) f ( x ) f 00 ( x )
Φ0 ( x ) = 1 − = .
f 0 ( x )2 f 0 ( x )2

Provided x∗ is a simple root so that f 0 ( x∗ ) 6= 0 (and supposing


f ∈ C2 (IR)), we have Φ0 ( x∗ ) = 0, and thus

ek+1 = 12 e2k Φ00 (ξ ),

and hence we again see quadratic convergence provided xk is suffi-


ciently close to x∗ . The asymptotic error constant is thus

|Φ00 ( x∗ )| | f 00 ( x∗ )|
C= = ,
2 2 | f 0 ( x∗ )|

as expected.
What happens when f 0 ( x∗ ) = 0? The direct iteration framework
makes it straightforward to analyze this situation. If x∗ is a multiple
root, we might worry that Newton’s method might have trouble con-
verging, since we are dividing f ( xk ) by f 0 ( xk ), and both quantities
are nearing zero as xk → x∗ . To study convergence, we investigate

f ( x ) f 00 ( x )
lim Φ0 ( x ) = lim .
x → x∗ x → x∗ f 0 ( x )2

This limit has the indeterminate form 0/0. Assuming sufficient dif-
ferentiability, we can invoke l’Hôpital’s rule:

f ( x ) f 00 ( x ) f 0 ( x ) f 00 ( x ) + f ( x ) f 000 ( x )
lim = lim ,
x → x∗ f 0 ( x )2 x → x∗ 2 f 0 ( x ) f 00 ( x )
170

but this is also of the indeterminate form 0/0 when f 0 ( x∗ ) = 0.


Again using l’Hôpital’s rule and now assuming f 00 ( x∗ ) 6= 0,

f ( x ) f 00 ( x ) f 00 ( x )2 + 2 f 0 ( x ) f 000 ( x ) + f ( x ) f (iv) ( x ) f 00 ( x )2 1
lim 0 2
= lim 0 000 00 2
= lim 00 2
= .
x → x∗ f (x) x → x∗ 2( f ( x ) f ( x ) + f ( x ) ) x → x∗ 2 f ( x ) 2

Thus, Newton’s method converges locally to a double root according


to
ek+1 = 12 ek + O(e2k ).

This is linear convergence at the same rate as bisection! If x∗ has mul-


tiplicity exceeding two, then f 00 ( x∗ ) = 0 and further analysis is
required. One would find that the rate remains linear, and gets even
slower. The slow convergence of Newton’s method for multiple roots
is exacerbated by the chronic ill-conditioning of such roots. Let us
summarize what might seem to be a paradoxical situation: the more
‘copies’ of root there are present, the more difficult that root is to
find!
171

lecture 31: The Secant Method: Prototypical Quasi-Newton Method

Newton’s method is fast if one has a good initial guess x0 . Even then,
it can be inconvenient and expensive to compute the derivatives
f 0 ( xk ) at each iteration. The final root finding algorithm we consider
is the secant method, a kind of quasi-Newton method based on an ap-
proximation of f 0 . It can be thought of as a hybrid between Newton’s
method and regula falsi.

4.4 The Secant Method

Throughout this semester, we have seen how derivatives can be ap-


proximated using finite differences, for example,

f ( x + h) − f ( x )
f 0 (x) ≈
h
for some small h. (Recall that too-small h will give a bogus answer
due to rounding errors, so some caution is needed; see Section 3.5.)
What if we replace f 0 ( xk ) in Newton’s method with this sort of ap-
proximation? The natural algorithm that emerges is the secant method,

x k − x k −1 x f ( x k ) − x k f ( x k −1 ) 0
x k +1 = x k − f ( x k ) = k −1 . x0 x1 x2
f ( x k ) − f ( x k −1 ) f ( x k ) − f ( x k −1 )

Note the similarity between this formula and the regula falsi iteration:
f
a k f ( bk ) − bk f ( a k )
ck = .
f ( bk ) − f ( a k )

Both methods approximate f by a line that joins two points on the


graph of f , but the secant method requires no initial bracket for the
root. Instead, the user simply provides two starting points x0 and x1
with no stipulation about the signs of f ( x0 ) and f ( x1 ). As a conse-
quence, there is no guarantee that the method will converge: as in
Newton’s method, a poor initial guess can lead to divergence.
Do we recover the convergence behavior of Newton’s method?
Not quite – but it is still better than the linear convergence exhibited
by the bisection and regula falsi methods (provided it does converge).
The convergence theory brings in a new technique we have not see
before, where the error ek = xk − x∗ is presumed to reduce according
to a generic convergence behavior, ek+1 ≈ Merk , and this ansatz is used
to derive values of M and r. Begin by writing the linear interpolant See Dahlquist and Björck, Numerical
to f at xk and xk−1 in the Newton form Methods, Section 6.4.3.

x − xk 
(4.7) p( x ) = f ( xk ) + f ( x k −1 ) − f ( x k ) .
x k −1 − x k
172

Once again we can put the interpolation error formula (Theorem 1.3)
to good use. Assuming that f ∈ C2 (IR), for any x ∈ IR one can write

f 00 (ξ )
f ( x ) − p( x ) = ( x − xk )( x − xk−1 ),
2
where ξ falls within the extremes of x, xk , and xk−1 . Since f ( x∗ ) = 0,
we can thus write
f 00 (ξ )
0 = p( x∗ ) + ( x∗ − xk )( x∗ − xk−1 ).
2
Defining e j := x j − x∗ as usual, this last equation is

f 00 (ξ )
0 = p( x∗ ) + e k e k −1 .
2
Substituting formula (4.7) for p gives

x∗ − xk  f 00 (ξ )
(4.8) 0 = f ( xk ) + f ( x k −1 ) − f ( x k ) + e k e k −1 .
x k −1 − x k 2

Now recall that, by design, the secant method picks xk+1 as the zero
of p, i.e.,

0 = p ( x k +1 )
x k +1 − x k 
(4.9) = f ( xk ) + f ( x k −1 ) − f ( x k ) .
x k −1 − x k

Subtract (4.8) from (4.9) to obtain

x k +1 − x ∗  f 00 (ξ )
0= f ( x k −1 ) − f ( x k ) − e k e k −1
x k −1 − x k 2
f ( x k −1 ) − f ( x k ) f 00 (ξ )
(4.10) = e k +1 − e k e k −1 .
x k −1 − x k 2 p

We can clean up this last formula by realizing that f

f ( x k −1 ) − f ( x k )
x k −1 − x k

is the slope of the linear interpolant p. It is also a difference quotient


for f , and so, by the Mean Value Theorem, there exists η ∈ [ xk , xk−1 ] 1.5
x0 η x1
where the slope of f matches that of the interpolant:

f ( x k −1 ) − f ( x k )
f 0 (η ) = .
x k −1 − x k

Substituting this formula into (4.10) gives

f 00 (ξ )
0 = f 0 ( η ) e k +1 − e k e k −1 ,
2
173

which can be solved for


f 00 (ξ ) Contrast this formula with the error
(4.11) e k +1 = e e .
2 f 0 ( η ) k k −1 recurrence for Newton’s method
in (4.7):
As xk and xk−1 converge to x∗ , the values for ξ and η must also
f 00 (ξ ) 2
converge toward x∗ , justifying the approximation e k +1 = − e .
2 f 0 ( xk ) k

(4.12) ek+1 ≈ Cek ek−1

for asymptotic error constant

f 00 ( x∗ )
C= ,
2 f 0 ( x∗ )

presuming, as usual, that x∗ is a simple root, so f 0 ( x∗ ) 6= 0.


It remains to convert the approximation (4.12) into some order
of convergence. As xk converges, we expect |ek−1 | to be larger than
ek , so ek ek−1 in the secant method is probably not as small as the e2k
term that appeared in the analogous formula for Newton’s method.
Can we quantify how much smaller? Suppose the error obeys the
approximation

(4.13) e j+1 ≈ Me jr

for some constants M and r. Then

ek ≈ Mekr−1

implies that
ek−1 ≈ M−1/r e1/r
k ,

and so the error approximation (4.12) suggests

(4.14) ek+1 ≈ C ek ek−1 ≈ C M−1/r e1k +1/r .

This equation must agree with the the error form (4.13) with j = k:

(4.15) ek+1 ≈ Mekr .

Equating the approximations (4.14) and (4.15) gives

CM−1/r e1k +1/r = Mekr .

Matching the constants then gives

M = Cr/(r+1) ,

while matching the rates gives 1 + 1/r = r, i.e.,

r2 − r − 1 = 0.
174

Solve this quadratic to get



1± 5
r= .
2

The negative sign choice gives r = (1 − 5)/2 = −0.6180 . . .,
which does not correspond to a convergent process. (If |ek | < 1,
then |ek−0.6180... | > 1.) Thus, if the process converges according to
ek+1 ≤ Merk , the r must correspond to the positive root,

1+ 5
r= := ϕ = 1.6180 . . . ,
2
the golden ratio. It follows that

| e k +1 | ≤ M | e k | ϕ

for a constant M > 0. Note that ϕ < 2, so, in the region of asymptotic
convergence (xk close to x∗ ), one step of the secant method will make
a bit less progress to the root than one step of Newton’s method.

Though you may regret that the secant method does not recover
the quadratic convergence of Newton’s method, take solace in the
fact that the secant method requires only one function evaluation Of course, for the secant method one
f ( xk ) at each iteration, as opposed to Newton’s method, which re- stores the f ( xk−1 ) value computed
during the previous iteration.
quires f ( xk ) and f 0 ( xk ). Typically the derivative is more expensive to
compute than the function itself. Assuming that evaluating f ( xk ) and
f 0 ( xk ) requires the same amount of effort, then we can compute two
steps of the secant method for roughly the same cost as a one step of
Newton’s method. These two steps of the secant method combine to
give an improved convergence rate:
ϕ 2
| e k +2 | ≤ M | e k +1 | ϕ ≤ M M | e k | ϕ ≤ M 1+ ϕ | e k | ϕ ,

where ϕ2 = 12 (3 + 5) ≈ 2.62 > 2. Hence, in terms of computing
time, the secant method can actually be more efficient than Newton’s
method. This discussion is drawn from §3.3 of
Figure 4.5 compares the convergence of the secant method to New- Kincaid and Cheney, Numerical Analysis,
3rd ed.
ton’s method for the function f ( x ) = x2 − 2, which we can use to

compute x∗ = 2 as in Figure 4.5. This example starts with the (bad)
initial guess x0 = 10. To ensure that the secant method is not ham-
pered by a bad value of x1 , this experiments uses the same x1 value
computed using Newton’s method. After these two initial steps,
both methods steadily converge, but Newton’s method takes fewer
iterations, in agreement with the theory derived in this and the last
lecture. Table 4.2 shows the iterates xk and magnitude of the errors
|ek | for both methods.
175

Figure 4.5: The convergence of New-


ton’s method and the secant method for
f ( x ) = x2 − 2. Both methods start with
10 0 x0 = 1, and the secant method uses the
same x1 that was generated by the first
step of Newton’s method.
error, | xk − x∗ |

10 -5

New
10 -10

sec
ton’

ant
s me

me
tho
thod
10 -15

d
0 1 2 3 4 5 6 7 8 9 10 11
k

k xk (Newton) |ek | (Newton) xk (secant) |ek | (secant)


0 10.000000000000 8.585786437627 10.000000000000 8.585786437627
1 5.100000000000 3.685786437627 5.100000000000 3.685786437627
2 2.746078431373 1.331864868999 3.509933774834 2.095720212461
3 1.737194874380 0.322981312007 2.311360664564 0.897147102191
4 1.444238094866 0.030024532493 1.737194874380 0.322981312007
5 1.414525655149 0.000312092776 1.485785199551 0.071571637178
6 1.414213596802 0.000000034429 1.421385900007 0.007172337634
7 1.414213562373 0.000000000000 1.414390138133 0.000176575760
8 1.414214008974 0.000000446601
9 1.414213562401 0.000000000028
10 1.414213562373 0.000000000000

Table 4.2: Comparison of iterates


from Newton’s method and the secant
method for finding a zero of f ( x ) =
x2 − 2.
5
Ordinary Differential Equations

lecture 32: Introduction to Numerical Integration

The final segment of the course addresses techniques for ap-


proximating the solution of an ordinary differential equation of the
general form
x 0 (t) = f (t, x (t)).
For the most part, we will consider initial value problems, where the
solution is determined by an initial condition

x ( t0 ) = x0 .

A wide variety of methods have been proposed to solve such equa-


tions, often derived from the techniques of interpolation, approxima-
tion, and quadrature studied earlier in the course.
Differential equations play the dominant role in applied math-
ematics. Their (approximate) solution is required in nearly every
corner of physics, chemistry, biology, engineering, finance, and be-
yond. For many practical problems involving nonlinearities, one
cannot write down a closed-form solution to a differential equation in
terms of familiar functions such as polynomials, trigonometric func-
tions, and exponentials. Thus the numerical solution of differential
equations is an enormous field. In this course we shall only be able
to focus on ordinary differential equations (ODEs). Partial differen-
tial equations (PDEs) are even more challenging, requiring several
additional semester-long courses to cover the basic methods.
The numerical solution of differential equations began in earnest
with Leonhard Euler and his contemporaries in the mid 1700s, with
especially important contributions following between 1880 and 1905.
The primary motivating application was celestial mechanics, where
the approximate integration of Newton’s differential equations of
motion was needed to predict the orbits of planets and comets. In-
deed celestial mechanics (more generally, Hamiltonian systems) con-
178

tinues to motivate the field, leading to recent innovations in so-called


geometric or symplectic integrators.

5.1 Existence theory for ODEs

Before computing numerical solutions to ODE initial value problems,


we should address the variety of problems that arise, and the the-
oretical conditions under which one can guarantee that a solution
exists.

5.1.1 Scalar equations


A standard scalar initial value problem takes the form

Given: x 0 (t) = f (t, x (t)), with x (t0 ) = x0


Determine: x (t) for all t ≥ t0 .

That is, we are given a formula for the derivative of some unknown
function x (t), together with a single value of the function at some
initial time, t0 . The goal is to use this information to determine x (t) at
all points t beyond the initial time.

Differential equations are an inherently graphical subject, so we


should examine a few sample problems and plots of their solutions.

Example 5.1. First consider the simple, essential model problem

x 0 (t) = λx (t),

with exact solution x (t) = eλt c, where the constant c is determined by


the initial data (t0 , x0 ). In the common case that t0 = 0,

x0 = x (0) = eλ·0 c = c,

so the exact solution is


x (t) = eλt x0 .
If λ = α + iβ ∈ C (with α, β ∈ IR), then
If λ > 0, the solution grows exponentially with t; λ < 0 yields
exponential decay. Because this linear equation is easy to solve, it etλ = et(α+iβ) = etα eitβ .

provides a good test case for numerical algorithms. Moreover, it is Since itβ is purely imaginary, |eitβ | = 1,
so
the prototypical linear ODE; from it, we gain insight into the local
|etλ | = etα .
behavior of nonlinear ODEs. Thus |etλ | → 0 as t → ∞ if Re λ < 0,
Applications typically give equations whose whose solutions while |etλ | → ∞ as t → ∞ if Re λ > 0.
cannot be expressed as simply as the solution of this linear model
problem. Among the tools that improve our understanding of more
difficult problems is the direction field of the function f (t, x ), a key
179

3
Figure 5.1: Force field for the equation
x 0 (t) = x (t), for which f (t, x ) = x.
2

x (t)
1

-1

-2

-3
0 1 2 3 4 5 6
t
For an elementary introduction to the
qualitative analysis of ODEs, see Hub-
technique from the sub-discipline of the qualitative analysis of ODEs. bard and West, Differential Equations:
Here is the critical idea: The function f (t, x (t)) reveals the slope of the A Dynamical Systems Approach, Part I,
Springer-Verlag, 1991.
solution x (t) going through any point in the (t, x (t)) plane. Hence
one can get a good impression about the behavior of a differential
equation by plotting these slopes throughout some interesting region
of the (t, x (t)) plane.
To plot the direction field, let the horizontal axis represent t, and
the vertical axis represent x. Then divide the (t, x ) plane with reg-
ular grid points, {(t j , xk )}. Centered at each grid point, draw a line
segment whose slope is f (t j , xk ). To get a rough impression of the
solution of the differential equation x 0 (t) = f (t, x ) with x (t0 ) = x0 ,
begin at the point (t0 , x0 ), and follow the direction of the slope lines.
Figure 5.1 shows the direction field for x 0 (t) = x (t), giving
f (t, x ) = x. Since f does not depend directly on t, the differential
equation is autonomous. In the plot of the direction field, for a fixed
value of x, the arrows point in the same direction and have the same
magnitude for all t.
One only needs a few simple MATLAB commands to produce a
direction field like the one seen in Figure 5.1, thanks to the build-in
quiver routine.

f = inline(’x’,’t’,’x’); % x’ = f(t,x) = x
x = linspace(-3,3,15); t = linspace(0,6,15); % grid of points at which to plot the slope
[T,X] = meshgrid(t,x); % turn grid vectors into matrices
figure(1), clf
quiver(T,X,ones(size(T)),f(T,X)), hold on % produce a "quiver" plot
axis([min(t) max(t) min(x) max(x)]) % adjust the axes

Figure 5.2 repeats Figure 5.1, but now showing solution trajectories
for x (0) = 0.1 and x (0) = −0.01. Notice how these solutions follow
180

3
Figure 5.2: Force field for the equation
x 0 (t) = x (t), now showing solutions for
x (0) = 0.1 (in blue) and x (0) = −0.01
2
(in red).
x (t)
1

-1

-2

-3
0 1 2 3 4 5 6
t

the arrows in the direction field.

Example 5.2. Next consider an equation that, for most x (0), lacks an
elementary solution that can be expressed in closed form,

x 0 (t) = sin(tx (t)).

The direction field for sin( xt) is shown below. Though we don’t have
access to the exact solution, it is a simple matter to compute accu-
rate approximations. Several solutions (for x (0) = 3, x (0) = 0, and
x (0) = −2) are superimposed on the direction field. These were
computed using a one-step method of the kind we will discuss mo-
mentarily. (Those areas where up and down arrows appear to cross

4
Figure 5.3: Force field for the equation
x 0 (t) = sin(tx (t)), showing solutions for
3 x (0) = 3 (blue), x (0) = 0 (black) and
x (0) = −2 (red).
x (t) 2

-1

-2

-3

-4
0 2 4 6 8 10 12 14 16
t
181

are asymptotes of the solution: between the up and down arrow is a


point where the slope f (t, x ) is zero.)

5.1.2 Systems of equations


In most applications we do not have a simple scalar equation, but
rather a system of equations describing the coupled dynamics of
several variables. Such situations give rise to vector-valued functions
x(t) ∈ IRn . In particular, the initial value problem becomes

Given: x0 (t) = f(t, x(t)), with x(t0 ) = x0


Determine: x(t) for all t ≥ t0 .

All the techniques for solving scalar initial value problems described
in this course can be applied to systems of this type.

5.1.3 Higher-order ODEs


Newton’s Second Law, F(t) = ma(t), leads to many important
second-order differential equations. Noting the acceleration a(t) is
the second derivative of the position x(t), we arrive at

x00 (t) = m−1 F(t).

Thus, we are often interested in systems of higher-order ODEs.


To keep the notation simple, consider the scalar second-order
problem

Given: x 00 (t) = f (t, x (t), x 0 (t)), with x (t0 ) = x0 , x 0 (t0 ) = y0


Determine: x (t) for all t ≥ t0 .

Note, in particular, that the initial conditions x (t0 ) and x 0 (t0 ) must
both be supplied. In some cases, one instead knows x (t)
This second order equation (and higher-order ODE’s as well) at two distinct points, x (t0 ) = x0 and
x (tfinal ) = xfinal , leading to an ODE
can always be written as a first order system of equations. Define boundary value problem.
x1 (t) = x (t), and let x2 (t) = x 0 (t). Then

x10 (t) = x 0 (t) = x2 (t)


x20 (t) = x 00 (t) = f (t, x (t), x 0 (t)) = f (t, x1 (t), x2 (t)).

Writing this in vector form, x(t) = [ x1 (t) x2 (t)]T , and the differential
equation becomes Fonts matter: x (t) denotes a scalar
quantity, while x(t) is a vector.
182

" # " #
0 x10 (t) x2 ( t )
x (t) = = = f(t, x(t)).
x20 (t) f (t, x1 (t), x2 (t))

The initial value is given by


" # " #
x1 ( t0 ) x ( t0 )
x0 = = .
x2 ( t0 ) x 0 ( t0 )

The most famous second-order differential equation,

x 00 (t) = − x (t),

has the solution x (t) = α cos(t) + β sin(t), for constants α and β


depending upon the initial values. Write the second-order equation
as the system " # " #
x10 (t) x2 ( t )
x0 (t) = = .
x20 (t) − x1 ( t )
Combining Newton’s inverse-square description of gravitational
force with his Second Law leads to the system of second order ODEs

−x(t)
x00 (t) = ,
kx(t)k32

where x ∈ IR3 is a vector in Euclidean space, and the 2-norm denotes


the usual (Euclidean) length of a vector,

kx(t)k2 = ( x1 (t)2 + x2 (t)2 + x3 (t)2 )1/2 .

Since x(t) ∈ IR3 , this second order equation reduces to a system of six
first order equations.

5.1.4 Picard’s Theorem: Existence and Uniqueness of Solutions


Before constructing numerical solutions to these differential equa-
tions, it is important to understand when solutions exist at all. Pi-
card’s theorem establishes existence and uniqueness.

Theorem 5.1 (Picard’s Theorem). For a proof, see Süli and Mayers,
Let f (t, x ) be a continuous function on the rectangle Section 12.1.

D = {(t, x ) : t ∈ [t0 , tfinal ], x ∈ [ x0 − c, x0 + c]}

for some fixed c > 0. Furthermore, suppose | f (t, x0 )| ≤ K for all


t ∈ [t0 , tfinal ], and suppose there exists some Lipschitz constant L > 0
such that
| f (t, u) − f (t, v)| ≤ L |u − v|
183

for all u, v ∈ [ x0 − c, x0 + c] and all t ∈ [t0 , tfinal ]. Finally, suppose that


K  L(tfinal −t0 ) 
c≥ e −1 .
L
(That is, the box D must be sufficiently large to compensate for large
values of K and L.) Then there exists a unique x ∈ C1 [t0 , tfinal ] such
that x (t0 ) = x0 , x 0 (t) = f (t, x ) for all t ∈ [t0 , tfinal ], and | x (t) − x0 | ≤ c
for all t ∈ [t0 , tfinal ].

In simpler words, these hypotheses ensure the existence of a unique


C1 solution to the initial value problem, and this solution stays within
the rectangle D for all t ∈ [t0 , tfinal ], i.e., as t increases the solution
will ‘exit’ from the right-side of the rectangle, not the top or bottom.

5.2 One-step methods

We are prepared to discuss some numerical methods to approximate


the solution to all these ODEs. To simplify the notation, we present
our methods in the context of the scalar equation

x 0 (t) = f (t, x (t))

with the initial condition x (t0 ) = x0 . All the algorithms generalize


trivially to systems: simply replace scalars with vectors.
When computing approximate solutions to the initial value prob-
lem, we will not obtain the solution for every value of t > t0 , but
only on a discrete grid. In particular, we will generate approximate The field of asymptotic analysis delivers
approximations in terms of elementary
solutions at some regular grid of time steps
functions that can be highly accurate;
these are typically derived in a non-
tk = t0 + kh numerical fashion, and often have
the virtue of accurately identifying
for some constant step-size h. (The methods we consider in this sub- leading order behavior of complicated
section allow h to change with each step size, so one actually has solutions. For a beautiful introduc-
tion to this important area of applied
tk = tk−1 + hk . For simplicity of notation, we will assume for now mathematics, see Carl M. Bender and
that the step-size is fixed.) Seven A. Orszag, Advanced Mathematical
Methods for Scientists and Engineers;
The approximation to x at the time tk is denoted by xk , so hope- McGraw-Hill, 1978; Springer, 1999.
fully
x k ≈ x ( t k ).
Of course, the initial data is exact:

x0 = x ( t0 ).

5.2.1 Euler’s method


We need some approximation that will advance from the exact point
on the solution curve, (t0 , x0 ) to time t1 . From basic calculus we
184

know that
x (t + h) − x (t)
x 0 (t) = lim .
h →0 h
This definition of the derivative inspires our first method. Apply it at
time t0 with a small but finite time step h > 0 to obtain

x ( t0 + h ) − x ( t0 )
x 0 ( t0 ) ≈ .
h

Since x 0 (t0 ) = f (t0 , x (t0 )) = f (t0 , x0 ), we know the quantity on


the left hand side of this approximation. The only quantity we don’t
know is x (t0 + h) = x (t1 ). Rearrange the above to put x (t1 ) on the
left hand side:

x (t1 ) ≈ x (t0 ) + hx 0 (t0 ) = x0 + h f (t0 , x0 ).

This approximation is precisely the one suggested by the direction


field discussion in Section 5.1.1. There, to progress from the starting
point (t0 , x0 ), we followed the line of slope f (t0 , x0 ) some distance,
which in the present context is our step size, h. To progress from
the new point, (t1 , x1 ), we follow a new slope, f (t1 , x1 ), giving the
iteration
x2 = x1 + h f ( t1 , x1 ).

There is an important distinction here. Ideally, we would have de-


rived our value of x2 ≈ x (t2 ) from the formula

x (t2 ) ≈ x (t1 ) + h f (t1 , x (t1 )).

However, an error was made in the computation of x1 ≈ x (t1 ); we


do not know the exact value x (t1 ). Thus, we settle for computing
x2 from x1 , a quantity we do know. This might seem like a minor
distinction, but in general the difference between the approximation
xk and the true solution x (tk ) is vital. At each step, a local error is
made due to the approximation of the derivative by a line. These
local errors accumulate, giving a global error. Is a small local error
enough to ensure small global error? This question is the subject of
the next two lectures.
Given the approximation x2 , repeat the same procedure to obtain
x3 , and so on. Formally,

Euler’s Method: x k +1 = x k + h f ( t k , x k ).

The first step of Euler’s method is illustrated in the following


schematic.
185

6
x

slope at t0
x1 = f ( t0 , x0 ) s(t1 , x1 )
H x (t)
H
j
H

x0 s

-
t0 t1 t

Example 5.3. Consider the performance of Euler’s method on


the two examples in Section 5.1.1. First, we examine the equation,
x 0 (t) = x (t), with initial condition x (0) = 1. We apply two step sizes:
h = 0.5 and h = 0.1. Naturally, we expect that decreasing h will
improve the local accuracy. But with h = 0.1, we require five times
as many approximations as with h = 0.5. How do the errors made
at these many steps accumulate? The plot below shows that both ap-
proximations underestimate the true solution, but as we expect, the
smaller step size yields the better approximation – but requires more
work to get to a fixed destination time.

Example 5.4. Next, consider the second example, x 0 (t) = sin(tx (t)),
this time with x (0) = 5. Since we do not know the exact solution, we
can only compare approximate answers, here obtained with h = 0.5
and h = 0.1. For t > 4, the solutions completely differ from one
another! Again, the smaller step size is the more accurate solution. In
the plot below, the direction field is shown together with the approx-
imate solutions. Note that f (t, x ) = sin(tx ) varies with x, so when
the h = 0.5 solution diverges from the h = 0.1 solution, very different
values of f are used to generate iterates. The h = 0.5 solution ‘jumps’
over the correct asymptote, and provides a very misleading answer.

Example 5.5. For a final example of Euler’s method, consider the


equation
x 0 ( t ) = 1 + x ( t )2

with x (0) = 0. This equation looks innocuous enough; indeed, you This example is given in Kincaid and
might notice that the exact solution is x (t) = tan(t). The true solu- Cheney, page 525.

tion blows up in finite time, x (t) → ∞ as t → π/2. (Such blow-up


behavior is common in ODEs and PDEs where the formula for the
derivative of x involves higher powers of x.) It is reasonable to seek
an approximate solution to the differential equation for t ∈ [0, π/2),
but beyond t = π/2, the equation does not have a solution, and any
answer produced by our numerical method is, essentially, garbage.
186

For any finite x, f (t, x ) = 1 + x2 will always be finite. Thus Euler’s


method,

x k +1 = x k + h f ( t k , x k )
= h + xk (1 + hxk )

will always produce some finite quantity; it will never give the infi-
nite answer at t = π/2. Still, as we see in the plots below, Euler’s
method captures the qualitative behavior well, with the iterates grow-
ing very large soon after t = π/2. (Notice that the vertical axis is
logarithmic, so by t = 2, the approximation with time step h = 0.05
exceeds 1010 .)
187

lecture 33: Local Analysis of One-Step Integrators


What can be said of the error between the computed solution xk at
time tk = t0 + kh and the exact solution x (tk )? In this lecture and the
next, we analyze this error, as a function of k, h, and properties of the
ODE, for an important class of algorithms that generalize the forward
Euler method.

5.2.2 Runge–Kutta Methods


To obtain increased accuracy in Euler’s method,

x k +1 = x k + h f ( t k , x k ),

one might naturally reduce the step-size, h. Since Euler’s method


derives from a first-order approximation to the derivative, we might
expect the error to decay linearly in h. Before making this rigorous,
first consider some better approaches: we are rarely satisfied with
order-h accuracy! By improving upon Euler’s method, we hope to
obtain an improved solution while still maintaining a large time-step.

First consider a modification that might not look like such a big
improvement: simply replace f (tk , xk ) by f (tk+1 , xk+1 ) to obtain

x k +1 = x k + h f ( t k +1 , x k +1 ),

called the backward Euler method. Because xk+1 depends on the value
f (tk+1 , xk+1 ), this scheme is called an implicit method; to compute At each step, one must find a zero of the
xk+1 , one needs to solve a (generally nonlinear) system of equations, function
rather more involved than the simple update required for the for- G ( x k +1 ) = x k +1 − x k − h f ( t k +1 , x k +1 )
ward Euler method. using, for example Newton’s method
One can improve on both Euler methods by averaging the updates or the secant method. If h is small and
f is not too wild, we might hope that
they make to xk : we could get an initial guess xk+1 ≈ xk ,
  or xk+1 ≈ xk + h f (tk , xk ). Note that
(5.1) xk+1 = xk + 21 h f (tk , xk ) + f (tk+1 , xk+1 ) . this nonlinear iteration could require
multiple evaluations of f to advance the
backward Euler method by one time
This method is the trapezoid method, for it can be derived by integrat- step.
ing the equation x 0 (t) = f (t, x (t)),
Z t Z t
k +1 k +1
x 0 (t) dt = f (t, x ) dt,
tk tk

and approximating the integral on the right using the trapezoid rule.
The fundamental theorem of calculus gives the exact formula for the
integral on the left, x (tk+1 ) − x (tk ). Together, this gives

t k +1 − t k   
(5.2) x ( t k +1 ) − x ( t k ) ≈ f t k , x ( t k ) + f t k +1 , x ( t k +1 ) .
2
188

Replacing the inaccessible exact values x (tk ) and x (tk+1 ) with their
approximations xk and xk+1 , and using the time-step h = tk+1 − tk ,
equation (5.2) suggests
h 
x k +1 − x k = f ( t k , x k ) + f ( t k +1 , x k +1 ) .
2
Rearranging this equation gives the trapezoid method (5.1) for xk+1 .
Like the backward Euler method, the trapezoid rule is implicit,
due to the f (tk+1 , xk+1 ) term. To obtain a similar explicit method,
replace xk+1 by its approximation from the explicit Euler method:

f (tk + h, xk+1 ) ≈ f (tk + h, xk + h f (tk , xk )).

The result is called Heun’s method or the improved Euler method:


 
xk+1 = xk + 12 h f (tk , xk ) + f (tk + h, xk + h f (tk , xk )) .

Note that this method can be implemented using only two evalua-
tions of the function f (t, x ).
The modified Euler method takes a similar approach to Heun’s
method:
xk+1 = xk + h f tk + 12 h, xk + 12 h f (tk , xk ) ,


which also requires two f evaluations per step.


Additional function evaluations can deliver increasingly accurate
explicit one-step methods, an important family of which are known
as Runge–Kutta methods. In fact, the forward Euler and Heun meth-
ods are examples of one- and two-stage Runge–Kutta methods. The
four-stage Runge–Kutta method is among the most famous one-step
methods:  
xk+1 = xk + 16 h k1 + 2k2 + 2k3 + k4 , One often encounters this method
implemented as a subroutine called RK4
where in old FORTRAN codes.

k1 = f (tk , xk )
k2 = f (tk + 21 h, xk + 12 hk1 )
k3 = f (tk + 12 h, xk + 12 hk2 )
k4 = f (tk + h, xk + hk3 ).

We must address an important consideration: these more sophis-


ticated methods might potentially give better approximations of the
solution x (t), but they require more evaluations of the function f per
step than the forward Euler method. Many interesting applications
give functions f that are expensive to evaluate. One must make a
trade-off: methods with greater accuracy allow for larger time-step h,
but require more function evaluations per time step. To understand
the interplay between accuracy and computational expense, we re-
quire a more nuanced understanding of the convergence behavior of
these various methods.
189

5.2.3 Truncation Error


All explicit one-step methods can be written in the general form

xk+1 = xk + hΦ(tk , xk ; h).

Such methods incur two types of error:

1. The error due to the fact that even if the method was exact at tk ,
the updated value xk+1 at tk+1 will not be exact. This is called
truncation error, or local error.

2. In practice, the value xk is not exact. How is this discrepancy,


the fault of previous steps, magnified by the current step? This
accumulated error is called global error.

Let us make these notions of error more precise. At every given


time tk , k = 1, 2, . . ., we have some approximation xk to the value
x (tk ). Denote the global error by

ek : = x ( tk ) − xk .

We seek to understand this error as a function of the step size h.


To analyze the global error ek , first consider the approxima-
tions made at each iteration. In the last lecture, we saw that Euler’s
method made an error by approximating the derivative x 0 (tk ) by a
finite difference,

x ( t k +1 ) − x ( t k )
≈ x 0 (tk ) = f (tk , x (tk )).
h
This type of error is made at every step. Generalize this error for all
explicit one-step methods.

Definition 5.1. The truncation error of an explicit one-step ODE inte-


grator is defined as

x ( t k +1 ) − x ( t k )
Tk = − Φ ( t k , x ( t k ); h ).
h
If Tk → 0 as h → 0, the method is consistent. If Tk = O(h p ), the
method has order-p truncation error.

The key to understanding truncation error is to note that Tk is essen-


tially just a rearranged version of the general one-step method, except
that the exact solutions x (tk ) and x (tk+1 ) have replaced the approximations
xk and xk+1 . Thus, the truncation error can be regarded as a measure
of the error the method would make in a single step if supplied with
perfect data, x (tk ).
190

Example 5.6. It is simple to compute Tk for the explicit Euler method:

x ( t k +1 ) − x ( t k )
Tk = − Φ ( t k , x ( t k ); h )
h
x ( t k +1 ) − x ( t k )
= − f (tk , x (tk ))
h
x ( t k +1 ) − x ( t k )
= − x 0 ( t k ).
h
This last substitution, f (tk , x (tk )) = x 0 (tk ), is valid because f is eval-
uated at the exact solution x (tk ). (Recall that in general, f (tk , xk ) 6=
x 0 (tk ).) Assuming that x (t) ∈ C2 [tk , tk+1 ], we can expand x (t) in a
Taylor series about t = tk to obtain

x (tk+1 ) = x (tk ) + hx 0 (tk ) + 21 h2 x 00 (ξ )

for some ξ ∈ [tk , tk+1 ]. Rearrange this to obtain a formula for x 0 (tk ),
and substitute it into the formula for Tk , yielding

x ( t k +1 ) − x ( t k )
Tk = − x 0 (tk )
h
x ( t k +1 ) − x ( t k ) x (tk+1 ) − x (tk ) 1 00
= − + 2 hx (ξ )
h h
= 12 hx 00 (ξ ).

Thus, the forward Euler method has truncation error Tk = O(h), so


Tk → 0 as h → 0.

Similarly, one can find that Heun’s method and the modified Eu-
ler’s method both have O(h2 ) truncation error, while the error for the
four-stage Runge–Kutta method is O(h4 ). Extrapolating from this
data, one might expect that a method requiring m evaluations of f
can deliver O(hm ) truncation error. Unfortunately, this is not true be-
yond m = 4, hence the fame of the four-stage Runge–Kutta method.
All Runge–Kutta methods with O(h5 ) truncation error require at least
six evaluations of f . As we will discuss later, the same
Next we must address a fundamental question: Does Tk → 0 as function evaluations for higher order
methods can be strategically combined
h → 0 ensure global convergence, ek → 0, for each k = 1, 2, . . . ? to give two methods with different
orders of accuracy. Comparing the
estimates from two methods of different
orders, one can estimate the error in the
integration. Such estimates then allow
one to adjust the time-step h on the fly
during an integration to control the
error.
191

lecture 34: Global Analysis of One-Step Integrators

5.3 Global error analysis for explicit one-step methods

The last lecture addressed the truncation error, Tk , of a one-step


method. Consistency (i.e., Tk → 0 as h → 0) is an obvious neces-
sary condition for the global error

ek = x ( tk ) − xk

to converge as h → 0. In this lecture, we wish to understand this key


question:
Is consistency sufficient for convergence of the global error as h → 0?
As before, consider the general one step method

xk+1 = xk + hΦ(tk , xk ; h)

where the choice of Φ(tk , xk ; h) defines the specific algorithm. We can


rearrange the formula for truncation error,

x ( t k +1 ) − x ( t k )
Tk = − Φ ( t k , x ( t k ); h ),
h
to obtain an expression for x (tk+1 ),

x (tk+1 ) = x (tk ) + hΦ(tk , x (tk ); h) + hTk .

This formula is comparable to the one-step method itself,

xk+1 = xk + hΦ(tk , xk ; h).

Combining these expressions gives a formula for the global error,

e k +1 = x ( t k +1 ) − x k +1
 
= x (tk ) − xk + h Φ(tk , x (tk ); h) − Φ(tk , xk ; h) + hTk
 
= ek + h Φ(tk , x (tk ); h) − Φ(tk , xk ; h) + hTk .

Recall the example x 0 (t) = 1 + x2 from Section 5.1. That equation


blew up in finite time, while the iterates of Euler’s method were
always finite. This is disappointing: for some equations, we can es-
sentially have infinite global error! Thus, to get a useful error bound,
we must make an assumption that the ODE is well behaved. Suppose
we are integrating our equation from t0 to some fixed tfinal . Then
assume there exists a constant LΦ , depending on the equation, the time
interval, and the particular method (but not h), such that

|Φ(t, u; h) − Φ(t, v; h)| ≤ LΦ |u − v|


192

for all t ∈ [t0 , tfinal ] and all u, v ∈ IR. This assumption is closely
related to the Lipschitz condition that plays an essential role in the the-
orem of existence of solutions given in Section 5.1. For ‘nice’ ODEs
and reasonable methods Φ, this condition is not difficult to satisfy. For example, for the forward Euler
This assumption is precisely what we need to bound the difference method, LΦ = L, where L is the usual
Lipshitz constant for the ordinary
between Φ(tk , x (tk ); h) and Φ(tk , xk ; h) that appears in the formula for differential equation.
ek . In particular, we now have
 
|ek+1 | = ek + h Φ(tk , x (tk ); h) − Φ(tk , xk ; h) + hTk

≤ |ek | + h Φ(tk , x (tk ); h) − Φ(tk , xk ; h) + h | Tk |

≤ |ek | + hLΦ | x (tk ) − xk | + h | Tk |

= |ek | + hLΦ |ek | + h | Tk |

= |ek |(1 + hLΦ ) + h | Tk |.


Suppose we are interested in all iterates from x0 up to xn for some n.
Then let T denote the magnitude of the maximum truncation error
over all those iterates:
T := max | Tk |.
0≤ k ≤ n
We now build up an expression for en iteratively:

| e0 | = | x ( t + 0 ) − x 0 | = 0

|e1 | ≤ h | T0 | ≤ hT

|e2 | ≤ |e1 |(1 + hLΦ ) + h | T1 | ≤ hT (1 + hLΦ ) + hT

|e3 | ≤ |e2 |(1 + hLΦ ) + h | T2 | ≤ hT (1 + hLΦ )2 + hT (1 + hLΦ ) + hT

..
.
n −1
|en | ≤ hT ∑ (1 + hLΦ )k .
k =0

Notice that this bound for |en | is a finite geometric series, and thus
we have the convenient formula
 (1 + hL )n − 1 
Φ
|en | ≤ hT
(1 + hLΦ ) − 1
T
(1 + hLΦ )n − 1

=
LΦ Recall that
T nhLΦ  eαx = 1 + αx + 21 (αx )2 + 1
3! ( αx )
3 +···
(5.3) < e −1 .
LΦ and so, since hLΦ > 0,
(This result and proof are given as Theorem 12.2 in the text by Süli 1 + hLΦ < ehLΦ .
and Mayers.)
There are two key lessons to be learned from this bound on |en |.
193

• Fix h. The bound suggests that |en | can grow exponentially as n


increases. As you continue your numerical integration indefinitely,
the error can be quite severe, even for well-behaved differential
equations.

• Focus attention on some fixed target time tfinal , and consider time
steps
t − t0
h := final ,
n
so that xn ≈ x (tfinal ). As n → ∞, note that h → 0, and in this case
nh = tfinal − t0 is fixed. Thus in the bound
T nhLΦ 
| en | < e −1

Since LΦ is independent of the step size h, the term enhLΦ − 1 is a


constant, and |en | = O( T ). Hence if the truncation error converges,
T → 0 as h → 0, then the global error at tfinal will also converge.
Moreover, if Tk = O(h p ), then the global error at tfinal will also be
O(h p ). Remember this beautiful, vital fact: the global error reduces at
the same rate as the truncation error for one-step methods!

The plots in Figure 5.4 confirm these observations. Again for the
model problem x 0 (t) = x (t) with (t0 , x0 ) = (0, 1), the figure shows
the error for Euler’s method (Tk = O(h)), Heun’s method ( Tk =
O(h2 )), and the four-stage Runge–Kutta method ( Tk = O(h4 )) for
t ∈ [0, 10]. Note the logarithmic scale of the vertical axes in these
plots. As n increases, the error grows exponentially in all these cases.
However, as h is reduced, the error gets smaller at all fixed times. The
extent of this error reduction is what one would expect from the local
truncation errors. All of the plots start with h = 0.1 (black dots). The
other curves show the result of repeatedly cutting h in half, giving
h = 0.05 (blue), h = 0.025 (red), h = 0.0125 (cyan), and h = 0.00625
(magenta).

• If Tk is proportional to h (forward Euler), then cutting h in half


should scale Tk by 1/2.

• If Tk is proportional to h2 (Heun’s method), then cutting h in half


should scale Tk by 1/4.

• If Tk is proportional to h4 (four-stage Runge–Kutta), then cutting h


in half should scale Tk by 1/16.

5.4 Adaptive Time-Step Selection

One-step methods make it simple to change the time-step h at each


iteration. For complicated nonlinear problems, it is quite natural that
194

5
10
h = 0.1
forward Euler method h = 0.00625

0
10

| ek |

10 -5

10 -10

-15
10
0 1 2 3 4 5 6 7 8 9 10
tk

10 5

Heun’s method h = 0.1

10
0 h = 0.00625

| ek |

10 -5

10 -10

-15
10
0 1 2 3 4 5 6 7 8 9 10
tk

10 5

four-stage Runge–Kutta method

10 0
h = 0.1
h = 0.05
| ek |
h = 0.025
h = 0.0125
10 -5
h = 0.00625

Figure 5.4: Global errors for three


10 -10
explicit one-step methods applied to
x 0 (t) = x (t): the forward Euler method
(first order), Heun’s method (second
order), and the four-stage Runge–Kutta
10 -15 method (fourth order). Each plot shows
0 1 2 3 4 5 6 7 8 9 10
the evolution of the error for five values
tk of the time-step h. As k increases, the
error grows exponentially. At a fixed t,
the error reduces as h is reduced, at the
rate given by the truncation error.
195

some regions (especially when x 0 is large) will merit a small time-step


h, yet other regions, where there is less change in the solution, can
easily be handled with a large value of h.
In the 1960s, Erwin Fehlberg suggested a beautiful way in which
the step-size could be automatically adjusted at each step. There
exist Runge–Kutta methods of order 4 and order 5 that can both be
generated with the same six evaluations of f . (Recall that any fifth-
order Runge–Kutta method requires at least six function evaluations.)
First, we define the necessary f evaluations for this method:

k1 = f (tk , xk )

k2 = f (tk + 14 h, xk + 14 hk1 )

k3 = f (tk + 38 h, xk + 3
32 hk 1 + 9
32 hk 2 )
12 1932 7200 7296
k4 = f (tk + 13 h, xk + 2197 hk 1 − 2197 hk 2 + 2197 hk 3 )
439 3680 845
k5 = f (tk + h, xk + 216 hk 1 − 8hk2 + 513 hk 3 − 4104 hk 4 )

k6 = f (tk + 12 h, xk − 8
27 hk 1 + 2hk2 − 3544
2565 hk 3 + 1859
4104 hk 4 − 11
40 hk 5 ).

The following method has O(h5 ) truncation error:


 
16 6656
xk+1 = xk + h 135 k1 + 12825 k3 + 28561 9
56430 k 4 − 50 k 5 +
2
55 k 6 .

The f evaluations used to compute these k j values can be combined


in a different manner to obtain the following approximation, which
only has O(h4 ) truncation error:
 
25
xbk+1 = xk + h 216 k1 + 1408
2565 k 3 + 2197
4104 k 4 − 1
5 k 5 .

Why would one be interested in an O(h4 ) method when an O(h5 ) ap-


proximation is available? By inspecting xk+1 − xbk+1 , we can see how
much the extra order of accuracy changes the solution. A significant
difference signals that the step size h may be too large; software will
react by reducing the step size before proceeding. This technology
is implemented in MATLAB’s ode45 routine. (The ode23 routine is
similar, but based on a pair of second and third order methods.)
Another popular fifth-order method, designed by Cash and Karp
(1990), uses six carefully chosen function evaluations that can be com-
bined to also provide O(h), O(h2 ), O(h3 ), and O(h4 ) approximations.
196

lecture 35: Linear Multistep Methods: Truncation Error

5.5 Linear multistep methods: general form, local error

One-step methods construct an approximate solution xk+1 ≈ x (tk+1 )


using only one previous approximation, xk . This approach enjoys
the virtue that the step size h can be changed at every iteration, if
desired, thus providing a mechanism for error control. This flexibility
comes at a price: For each order of accuracy in the truncation error,
each step must perform at least one new evaluation of the derivative
function f (t, x ). This might not sound particularly onerous, but in
many practical problems, f evaluations are terribly time-consuming.
A classic example is the N-body problem, which arises in models
ranging from molecular dynamics to galactic evolution. Such models
give rise to N coupled second order nonlinear differential equations,
where the function f measures the forces between the N different
particles. An evaluation of f requires O( N 2 ) arithmetic operations to
compute, costly indeed when N is in the millions. Every f evaluation
counts. A landmark improvement to this N 2
One could potentially improve this situation by re-using f evalu- approach, the fast multipole method, was
developed by Leslie Greengard and
ations from previous steps of the ODE integrator, rather than always Vladimir Rokhlin in the late 1980s.
requiring f to be evaluated at multiple new (t, x ) values at each step
(as is the case, for example, with higher order Runge–Kutta meth-
ods). Consider the method

3
f (tk , xk ) − 12 f (tk−1 , xk−1 ) ,

x k +1 = x k + h 2

where h is the step size, h = tk+1 − tk = tk − tk−1 . Here xk+1 is de-


termined from the two previous values, xk and xk−1 . Unlike Runge–
Kutta methods, f is not evaluated at points between tk and tk+1 .
Rather, each step requires only one new f evaluation, since f (tk−1 , xk−1 )
would have been computed already at the previous step. Hence this
method has roughly the same computational requirements as Euler’s
method, though soon we will see that its truncation error is O(h2 ).
The Heun and midpoint rules attained this same truncation error, but
required two new f evaluations at each step.
Several drawbacks to this new scheme are evident: the step size
is fixed throughout the iteration, and values for both x0 and x1 are Step-size adjustment is possible, e.g.,
needed before starting the method. The former concern can be ad- interpolating (tk−1 , xk−1 ) and (tk , xk ) to
get intermediate values, but step-size
dressed in practice through interpolation techniques. To handle the adjustment still requires more care than
latter concern, initial data can be generated using a one-step method with one-step methods.
with small step size h. In some applications, including some prob-
lems in celestial mechanics, an asymptotic series expansion of the
solution, accurate near t ≈ t0 , can provide suitable initial data.
197

5.5.1 General linear multistep methods


This section considers a general class of integrators known as linear
multistep methods. These notes on linear multistep meth-
ods draw heavily from the excellent
Definition 5.2. general m-step linear multistep method has the form presentation in Süli and Mayers,
Numerical Analysis: An Introduction,
m m Cambridge University Press, 2003.
∑ α j x k + j = h ∑ β j f ( t k + j , x k + j ),
j =0 j =0

with αm 6= 0. If β m 6= 0, then the formula for xk+m involves xk+m on


the right hand side, so the method is implicit; otherwise, the method
is explicit. A final convention requires |α0 | + | β 0 | 6= 0, for if α0 =
β 0 = 0, then we actually have an m − 1 step method masquerading
as a m-step method. As f is only evaluated at (t j , x j ), we adopt the
abbreviation
f j = f ( t j , x j ).

By this definition most Runge–Kutta methods, though one-step


methods, are not multistep methods. Euler’s method is an example of
a one-step method that also fits this multistep template. Here are a
few examples of linear multistep methods:

Euler’s method: x k +1 − x k = h f k α0 = −1, α1 = 1; β 0 = 1, β 1 = 0.


h
Trapezoid rule: x k +1 − x k = 2 ( fk + f k +1 ) α0 = −1, α1 = 1; β 0 = 12 , β 1 = 12 .
Adams–Bashforth: xk+2 − xk+1 = 2h (3 f k+1 − f k ) α0 = 0, α1 = −1, α2 = 1;
β 0 = − 12 , β 1 = 32 , β 2 = 0.
The ‘Adams–Bashforth’ method presented above is the 2-step
example of a broader class of Adams–Bashforth formulas. The 4-step
Adams–Bashforth method takes the form
h 
x k +4 = x k +3 + 55 f k+3 − 59 f k+2 + 37 f k+1 − 9 f k
24
for which
α0 = 0, α1 = 0, α2 = 0, α3 = −1, α4 = 1;
9 37
β0 = − 24 , β1 = 24 , β2 = − 59
24 , β3 = 55
24 , β 4 = 0.
The Adams–Moulton methods are a parallel class of implicit formu-
las. The 3-step version of this method is

h 
x k +3 = x k +2 + 9 f k+3 + 19 f k+2 − 5 f k+1 + f k ,
24
giving

α0 = 0, α1 = 0, α2 = −1, α4 = 1;
1 5 19 9
β0 = 24 , β1 = − 24 , β2 = 24 , β3 = 24 .
198

5.5.2 Truncation error for linear multistep methods

Recall that the truncation error of one-step methods of the form


xk+1 = xk + hΦ(tk , xk ; h) was given by

x ( t k +1 ) − x ( t k )
Tk = − Φ ( t k , x k ; h ).
h

With general linear multistep methods is associated an analogous


formula, based on substituting the exact solution x (tk ) for the ap-
proximation xk , and rearranging terms.

Definition 5.3. The truncation error for the linear multistep method

m m
∑ α j xk+ j = h ∑ β j f (tk+ j , xk+ j )
j =0 j =0

is given by the formula


h i
∑m
j =0 α j x ( t k + j ) − h β j f ( t k + j , x ( t k + j ))
Tk = .
h ∑m
j =0 β j

The ∑mj=0 β j term in the denominator is a normalization term; were


it absent, multiplying the entire multistep formula by a constant
would alter the truncation error, but the not the iterates xk .
To get a simple form of the truncation error, we turn to Taylor
series:
h2 00 h3 000 h4 (4)
x ( t k +1 ) = x (tk + h) = x (tk ) + hx 0 (tk ) + 2! x ( tk ) + 3! x ( tk ) + 4! x ( tk ) + ···
22 h 2 23 h 3 24 h 4
x ( t k +2 ) = x (tk + 2h) = x (tk ) + 2hx 0 (tk ) + 2! x 00 (tk ) + 3! x 000 (tk ) + 4! x (4) ( t k ) + · · ·
32 h2 00 33 h3 000 34 h 4 ( 4 )
x ( t k +3 ) = x (tk + 3h) = x (tk ) + 3hx 0 (tk ) + 2! x ( tk ) + 3! x ( tk ) + 4! x ( tk ) + ···
..
.
m2 h2 00 m3 h3 000 m4 h4 (4)
x (tk+m ) = x (tk + mh) = x (tk ) + mhx 0 (tk ) + 2! x ( tk ) + 3! x ( tk ) + 4! x ( tk ) + ···

and also

h2 000 h3 (4)
f (tk+1 , x (tk+1 )) = x 0 (tk + h) = x 0 (tk ) + hx 00 (tk ) + 2! x ( tk ) + 3! x ( tk ) + ···
22 h 2 23 h 3
f (tk+2 , x (tk+2 )) = x 0 (tk + 2h) = x 0 (tk ) + 2hx 00 (tk ) + 2! x 000 (tk ) + 3! x (4) ( t k ) + ···
32 h 2 33 h 3
f (tk+3 , x (tk+3 )) = x 0 (tk + 3h) = x 0 (tk ) + 3hx 00 (tk ) + 2! x 000 (tk ) + 3! x (4) ( t k ) + ···
..
.
m2 h2 000 m3 h3 (4)
f (tk+m , x (tk+m )) = x 0 (tk + mh) = x 0 (tk ) + mhx 00 (tk ) + 2! x ( tk ) + 3! x ( tk ) + ···.

Substituting these expansions into the expression for Tk (eventually)


199

yields a convenient formula:


h i
m  ∑m
j =0 α j x ( t k + j ) − h β j f ( t k + j , x ( t k + j ))
∑ β j Tk = h
j =0

m ∞ m
h i h  1 1  i
= h −1 ∑ j x (tk ) +
α ∑ h` ∑ αj
(` + 1)!
j`+1 − β j j` x (`+1) (tk )
`!
j =0 `=0 j =0

1h m i h m m i
= ∑
h j =0
α j x (tk ) + ∑ jα j − ∑ β j x 0 (tk )
j =0 j =0
m m
h j2 i
+h ∑ 2
α j − ∑ jβ j x 00 (tk )
j =0 j =0
m m 2 i
h j3 j
+ h2 ∑ 6
α j − ∑ β j x 000 (tk )
2
j =0 j =0
m m
h j4 j 3 i (4)
+ h3 ∑ 24 α j − ∑ 6 j
β x (tk ) + · · · .
j =0 j =0

In particular, the coefficient of the h` term is simply


m m
j`+1 j`
∑ (` + 1)! α j − ∑ ` ! β j
j =0 j =0

for all nonnegative integers `.


Definition 5.4. A linear multistep method is consistent if Tk → 0 as
h → 0.
The formula for Tk . gives an easy condition to check the consis-
tency of a method.

Theorem 5.2. An m-step linear multistep method of the form


m m
∑ α j xk+ j = h ∑ β j f k+ j
j =0 j =0

is consistent if and only if


m m m
∑ αj = 0 and ∑ jα j = ∑ β j .
j =0 j =0 j =0

If one of the conditions in Theorem 5.2 are violated, then the formula
for the truncation error contains either a term that grows like 1/h or
remains constant as h → 0. The Taylor analysis of the truncation error
yields even more information, though: inspecting the coefficients
multiplying h, h2 , etc. reveals easy conditions for determining the
overall truncation error of a linear multistep method.
200

Definition 5.5. A linear multistep method is order-p accurate if Tk =


O(h p ) as h → 0.

Theorem 5.3. An m-step linear multistep method is order-p accurate


if and only if it is consistent and
m m
j`+1 j`
∑ (` + 1)! α j = ∑ ` ! β j
j =0 j =0

for all ` = 1, . . . , p − 1.

The next few examples show this theorem in action, applying it to


some of the linear multistep methods discussed earlier.

Example 5.7 (Forward Euler method).

α0 = −1, α1 = 1; β 0 = 1, β 1 = 0.

Clearly α0 + α1 = −1 + 1 = 0 and (0α0 + 1α1 ) − ( β 0 + β 1 ) = 0.


Thus the method is consistent.
When analyzed as a one-step method, the forward Euler method
had truncation error Tk = O(h). The same result should hold when
we analyze the algorithm as a linear multistep method. Indeed,
   
1 2 1 2 1
2 0 α0 + 2 1 α1 − 0β 0 + 1β 1 = 2 6 = 0.

Thus, Tk = O(h).

Example 5.8 (Trapezoid method).

1 1
α0 = −1, α1 = 1; β 0 = ,β = .
2 1 2
Again, consistency is easy to verify: α0 + α1 = −1 + 1 = 0 and
(0α0 + 1α1 ) − ( β 0 + β 1 ) = 1 − 1 = 0. Furthermore,
 
1 2 1 2 1 1
2 0 α0 + 2 1 α1 − (0β 0 + 1β 1 ) = 2 − 2 = 0,

so Tk = O(h2 ), but
  
1 3 1 3 1 2 1 2 1 1
6 0 α0 + 6 1 α1 − 2 0 β 0 + 2 1 β 1 ) = 6 − 4 6= 0,

so the trapezoid rule is not third order accurate: Tk = O(h2 ).

Example 5.9 (2-step Adams–Bashforth).

α0 = 0, α1 = −1, α2 = 1;
β0 = − 21 , β1 = 3
2, β 2 = 0.
201

Does this explicit 2-step method deliver O(h2 ) accuracy, like the (im-
plicit) Trapezoid method? Consistency follows easily:

α0 + α1 + α2 = 0 − 1 + 1 = 0

and
(0α0 + 1α1 + 2α2 ) − ( β 0 + β 1 ) = 1 − 1 = 0.
The second order condition is also satisfied,
   
1 2 1 2 1 2 3 3
2 0 α 0 + 2 1 α 1 + 2 2 α 2 − 0β 0 + 1β 1 = 2 − 2 = 0,

but not the third order,


   
1 3 1 3 1 3 1 2 1 2 7 3
6 0 α0 + 6 1 α1 + 6 2 α2 − 2 0 β 0 + 2 1 β 1 = 6 − 4 6= 0.

Thus Tk = O(h2 ): the method is second order.

Example 5.10 (4-step Adams–Bashforth).

α0 = 0, α1 = 0, α2 = 0, α3 = −1, α4 = 1;
9 37
β0 = − 24 , β1 = 24 , β2 = − 59
24 , β3 = 55
24 , β 4 = 0.

Consistency holds, since ∑ α j = −1 + 1 = 0 and


4 4    
∑ jα j − ∑ β j = 9
3(−1) + 4(1) − 24 + 37
24 − 47
24 + 55
24 = 1 − 1 = 0.
j =0 j =0

The coefficients of h, h2 , and h3 in the expansion for Tk all vanish:

4    
∑ 21 j2 α j − ∑4j=0 jβ j = 32 42
2 (−1) + 2 (1)
9
− 0(− 24 37
) + 1( 24 ) + 2(− 59
24 ) + 3 ( 55
24 ) = 7
2 − 84
24 = 0;
j =0

4    
∑ 16 j3 α j − ∑4j=0 21 j2 β j = 33 43
6 (−1) + 6 (1) − 12 37 22 59 32 55
2 ( 24 ) + 2 (− 24 ) + 2 ( 24 ) = 37
6 − 148
24 = 0;
j =0

4    
∑ 1 4 4 1 3
24 j α j − ∑ j=0 6 j β j = 34 44
24 (−1) + 24 (1) − 13 37 23 59 33 55
6 ( 24 ) + 6 (− 24 ) + 6 ( 24 ) = 175
24 − 1050
144 = 0.
j =0

However, the O(h4 ) term is not eliminated:

4    
∑ 120
1 5
j α j − ∑4j=0 24
1 4
j βj = 35 45
120 (−1) + 120 (1) − 14 37 24 59 34 55
24 ( 24 ) + 24 (− 24 ) + 24 ( 24 ) = 1267
120 − 887
144 6 = 0.
j =0

Thus Tk = O(h4 ): the method is fourth order.


A similar computation establishes fourth-order accuracy for the
4-step (implicit) Adams–Moulton formula
 
1
xk+3 = xk+2 + 24 h 9 f k+3 + 19 f k+2 − 5 f k+1 + f k .
202

In the last lecture, we proved that consistency implies convergence for


one-step methods. Essentially, provided the differential equation is
sufficiently well-behaved (in the sense of Picard’s Theorem), then the
numerical solution produced by a consistent one-step method on the
fixed interval [t0 , tfinal ] will converge to the true solution as h → 0.
Of course, this is a key property that we hope is shared by multistep
methods.
Whether this is true for general linear multistep methods is the
subject of the next lecture. For now, we merely present some compu-
tational evidence that, for certain methods, the global error at tfinal
behaves in the same manner as the truncation error.
Consider the model problem x 0 (t) = x (t) for t ∈ [0, 1] with
x (0) = x0 = 1, which has the exact solution is x (t) = et . We shall
approximate this solution using Euler’s method, the second-order
Adams–Bashforth formula, and the fourth-order Adams–Bashforth
formula. The latter two methods require data not only at t = 0, but
also at several additional values, t = h, t = 2h, and t = 3h. For
this simple experiment, we can use the value of the exact solution,
x1 = x (t1 ), x2 = x (t2 ), and x3 = x (t3 ).
We close by offering evidence that there is more to the analysis of
linear multistep methods than truncation error. Here are two explicit
methods that are both second order:

xk+2 − 32 xk+1 + 12 xk = h( 54 f k+1 − 43 f k )

xk+2 − 3xk+1 + 2xk = h( 12 f k+1 − 23 f k ).

We apply these methods to the model problem x 0 (t) = x (t) with


x (0) = 1, with exact initial data x0 = 1 and x1 = eh . The results
of these two methods are shown below. The first method tracks the
exact solution x (t) = et very nicely. The second method, however,
shows a disturbing property: while it matches up quite well for the
initial steps, it soon starts to fall far from the solution. Why does this
second-order method do so poorly for such a simple problem? Does
this reveal a general problem with linear multistep methods? If not,
how do we identify such ill-mannered methods?
203

lecture 36: Linear Multistep Mehods: Zero Stability

5.6 Linear multistep methods: zero stability

Does consistency imply convergence for linear multistep methods?


This is always the case for one-step methods, as proved in section 5.3,
but the example at the end of the last lecture suggests the issue is
less straightforward for multistep methods. By understanding the
subtleties, we will come to appreciate one of the most significant
themes in numerical analysis: stability of discretizations.
We are interested in the behavior of linear multistep methods as
h → 0. In this limit, the right hand side of the formula for the generic
multistep method,
m m
∑ α j x k + j = h ∑ β j f ( t k + j , x k + j ),
j =0 j =0

makes a negligible contribution. This motivates our consideration of


the trivial model problem x 0 (t) = 0 with x (0) = 0. Does the linear
multistep method recover the exact solution, x (t) = 0?
When x 0 (t) = 0, clearly we have f k+ j = 0 for all j. The condition
αm 6= 0 allows us to write

( α 0 x 0 + α 1 x 1 + · · · + α m −1 x m −1 )
xm = −
αm

Hence if the method is started with exact data

x0 = x1 = · · · = xm−1 = 0,

then
( α 0 · 0 + α 1 · 0 + · · · + α m −1 · 0 )
xm = − = 0,
αm
and this pattern will continue: xm+1 = 0, xm+2 = 0, . . . . Any lin-
ear multistep method with exact starting data produces the exact
solution for this special problem, regardless of the time-step.
Of course, for more complicated problems it is unusual to have
exact starting values x1 , x2 , . . . xm−1 ; typically, these values are only
approximate, obtained from some high-order one-step ODE solver
or from an asymptotic expansion of the solution that is accurate in
a neighborhood of t0 . To discover how multistep methods behave,
we must first understand how these errors in the initial data pollute
future iterations of the linear multistep method.

Definition 5.6. Suppose the initial value problem x 0 (t) = f (t, x ),


x (t0 ) = x0 satisfies the requirements of Picard’s Theorem over the
204

interval [t0 , tfinal ]. For an m-step linear multistep method, consider


two sequences of starting values for a fixed time-step h

{ x0 , x1 , . . . , xm−1 } and { xb0 , xb1 , . . . , xbm−1 },

that generate the approximate solutions { x j }nj=0 and { xbj }nj=0 , where
tn = tfinal . The multistep method is zero-stable for this initial value
problem if for sufficiently small h there exists some constant M (inde-
pendent of h) such that

| xk − xbk | ≤ M max | x j − xbj |


0≤ j ≤ m −1

for all k with t0 ≤ tk ≤ tfinal . More plainly, a method is zero-stable for


a particular problem if errors in the starting values are not magnified
in an unbounded fashion.

Proving zero-stability directly from this definition would be a


chore. Fortunately, there is an easy way to check zero stability all
at once for all sufficiently ‘nice’ differential equations. To begin with,
consider a particular example.

Example 5.11 (A novel second order method). The truncation error


formulas from the last lecture can be used to derive a variety of linear
multistep methods that satisfy a given order of truncation error. You
can use those conditions to verify that the explicit two-step method

(5.4) xk+2 = 2xk − xk+1 + h( 12 f k + 52 f k+1 )

is second order accurate. Now we will test the zero-stability of this


algorithm on the trivial model problem, x 0 (t) = 0 with x (0) = 0.
Since f (t, x ) = 0 in this case, the method reduces to

xk+2 = 2xk − xk+1 .

As seen above, this method produces the exact solution if given exact
initial data, x0 = x1 = 0. But what if x0 = 0 but x1 = ε for some small
ε > 0? This method produces the iterates

x2 = 2x0 − x1 = 2 · 0 − ε = −ε
x3 = 2x1 − x2 = 2(ε) − (−ε) = 3ε
x4 = 2x2 − x3 = 2(−ε) − 3ε = −5ε
x5 = 2x3 − x4 = 2(3ε) − (−5ε) = 11ε
x6 = 2x4 − x5 = 2(−5ε) − (11ε) = −21ε
x7 = 2x5 − x6 = 2(11ε) − (−21ε) = 43ε
x8 = 2x6 − x7 = 2(−21ε) − (43ε) = 85ε.
205

4
0.1
3
h = 0.2 h = 0.1
2
0.05
xk ≈ x (tk )

xk ≈ x (tk )
1

0 0

-1
-0.05
-2

-3
-0.1
-4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

tk tk
#10 9
4000 4

3000 3
h = 0.05 h = 0.025
2000 2
xk ≈ x (tk )

xk ≈ x (tk )
1000 1

0 0

-1000 -1

-2000 -2

-3000 -3

-4000 -4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

tk tk

Figure 5.5: Approximate solutions to


x 0 (t) = 0 using the novel second-order
method (5.4) with initial data x0 = 0
In just seven steps, the error has been multiplied 85-fold. The error is
and x1 = 0.01. As h gets smaller, the
roughly doubling at each step, and before long the approximate ‘so- method produces approximations to the
lution’ is complete garbage. Figure 5.5 shows this instability, plotting true solution x (t) = 0 of exponentially
degrading quality! (Note how the scale
xk for four different values of h and ε = 0.01. of the vertical axis grows from plot to
This examples illustrates another quirk. When applied to this plot.)
particular model problem, the linear multistep method reduces to
∑mj=0 α j xk + j = 0, and thus never incorporates the time-step, h. Hence
the error at some fixed time tfinal = hk gets worse as h gets smaller and k
grows accordingly! Figure 5.6 puts all four of solutions from Figure 5.5
together in one plot, dramatically illustrating how the solutions de-
grade as h gets smaller!
Though this method has second-order local (truncation) error, it
blows up if fed incorrect initial data for x1 . Decreasing h can magnify
this effect, even if, for example, the error in x1 is proportional to h.
We can draw a larger lesson from this simple problem: For linear
multistep methods, consistency (i.e., Tk → 0 as h → 0) is not sufficient
to ensure convergence.
Let us analyze our unfortunate method a little more carefully. Set-
ting the starting values x0 and x1 aside for the moment, we want to
206

h = 0:2 h = 0:1 h = 0:05 h = 0:025 Figure 5.6: All four solutions from
Figure 5.5 plotted together, to illustrate
15 how the approximate solutions from the
method (5.4) to x 0 (t) = 0 with x0 = 0
and x1 = ε degrade as h → 0.
10

5
xk ≈ x (tk )

-5

-10

-15
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
tk

find all sequences { x j }∞


j=0 that satisfy the linear, constant-coefficient
recurrence relation
xk+2 = 2xk − xk+1 .
Since the xk values grew exponentially in the example above, assume
that this recurrence has a solution of the form xk = γk for all k =
0, 1, . . ., where γ is some number that we will try to determine. Plug
this ansatz for xk into the recurrence relation to see if you can make it
work as a solution:
γk+2 = 2γk − γk+1 .
Divide this equation through by γk to obtain the quadratic equation

γ2 = 2 − γ.

If γ solves this quadratic, then the putative solution xk = γk indeed


satisfies the difference equation. Since

γ2 + γ − 2 = (γ + 2)(γ − 1)

the roots of this quadratic are simply γ = −2 and γ = 1. Thus we


expect solutions of the form xk = (−2)k and the less interesting
xk = 1k = 1.
If xk = γ1k and xk = γ2k are both solutions of the recurrence, then
xk = Aγ1k + Bγ2k is also a solution, for any real numbers A and B. To
see this, note that

γ12 + γ1 − 2 = γ22 + γ2 − 2 = 0,

and so    
Aγ1k γ12 + γ1 − 2 = Bγ2k γ22 + γ2 − 2 = 0.
207

Rearranging this equation,

Aγ1k+2 + Bγ1k+2 = 2( Aγ1k + Bγ1k ) − ( Aγ1k+1 + Bγ1k+1 ),

which implies that xk = Aγ1k + Bγ2k is a solution to the recurrence.


In fact, this is the general form of a solution to our recurrence.
For any starting values x0 and x1 , one can determine the associated
constants A and B. For example, with γ1 = −2 and γ2 = 1, the initial
conditions x0 = 0 and x1 = ε require that

A+B = 0
−2A + B = ε,
which implies
A = −ε/3, B = ε/3.
Indeed, the solution
ε ε
(5.5) xk = − (−2)k
3 3
generates the iterates x0 = 0, x1 = ε, x2 = −ε, x3 = 3ε, x4 =
−5ε, . . . computed previously. Notice that (5.5) reveals exponential
growth with k: this growth overwhelms algebraic improvements
in the estimate x1 that might occur as we reduce h. For example, if
ε = x1 − x (t0 + h) = ch p for some constant c and p ≥ 1, then
xk = ch p (1 − (−2)k )/3 still grows exponentially in k.

5.6.1 The Root Condition


The real trouble with the previous method was that the formula
for xk involves the term (−2)k . Since |−2| > 1, this component of
xk grows exponentially in k. This term is simply an artifact of the
finite difference equation, and has nothing to do with the underlying
differential equation. As k increases, this (−2)k term swamps the
other term in the solution. It is called a parasitic solution.
Let us review how we determined the general form of the solution.
We assumed a solution of the form xk = γk , then plugged this solu-
tion into the recurrence xk+2 = 2xk − xk+1 . The possible values for γ
were roots of the equation γ2 = 2 − γ.
The process we just applied to one specific linear multistep method
can readily be extended to the general linear multistep method
m m
(5.6) ∑ α j x k + j = h ∑ β j f ( t k + j , x k + j ).
j =0 j =0

by applying the method again to the trivial equation x 0 (t) = 0. For


this special equation, the method (5.6) reduces to
m
∑ α j xk+ j = 0.
j =0
208

Substituting xk = γk yields
m
∑ α j γk+ j = 0.
j =0

Canceling γk ,
m
∑ α j γ j = 0.
j =0

Definition 5.7. The characteristic polynomial of an m-step linear multi-


step method is the degree-m polynomial
m
ρ(z) = ∑ αj zj.
j =0

For xk = γk to be a solution to the above recurrence, γ must be a


root of the characteristic polynomial, ρ(γ) = 0. Since the character-
istic polynomial has degree m, it will have m roots. If these roots are
distinct, call them γ1 , γ2 , . . . , γm , the general form of the solution of If some root, say γ1 is repeated p
times, then instead of contributing
m the term c1 γ1k to the general solution,
∑ α j xk+ j = 0 it will contribute a term of the form
j =0 c1,1 γ1k + c1,2 kγ1k + · · · + c1,p k p−1 γ1k .

is
xk = c1 γ1k + c2 γ2k + · · · cm γm
k
.

for constants c1 , . . . , cm that are determined from the starting values


x0 , . . . , x m .
To avoid parasitic solutions to a linear multistep method, all the
roots of the characteristic polynomial should be located within the
unit disk in the complex plane, i.e., |γ j | ≤ 1 for all j = 1 . . . , m. Thus, Following on from the previous
for the simple differential equation x 0 (t) = 0, we have found a way marginal note, we note that if some
root, say γ1 , is a repeated root on the
to describe zero stability: Initial errors will not be magnified if the unit circle, |γ1 | = 1, then the general
characteristic polynomial has all its roots in the unit disk; any roots solution will have terms like kγ1k , so
|kγ1k | = k|γ1 |k = k. While this term
on the unit disk should be simple (i.e., not multiple). will not grow exponentially in k, it does
This analysis might seem too trivial: after all, the problem x 0 (t) = grow algebraically, and errors will still
0 is not particularly interesting. What is remarkable is that, in a grow enough as h → 0 to violate zero
stability.
sense, if the linear multistep method is zero stable for x 0 (t) = 0, the
one can prove it is zero stable for all well-behaved differential equations!
This result was discovered in the late 1950s by Germund Dahlquist. An excellent discussion of this theoreti-
cal material is given in Süli and Mayers,
Numerical Analysis: An Introduction,
Cambridge University Press, 2003; these
Theorem 5.4. A linear multistep method is zero-stable for any ‘well-
notes follow, in part, their exposition.
behaved’ initial value problem provided it satisfies the root condition: See also E. Hairer, S. P. Nørsett, and G.
Wanner, Solving Ordinary Differential
• all roots of ρ(γ) = 0 lie in the unit disk, i.e., |γ| ≤ 1; Equations I: Nonstiff Problems, 2nd ed.,
Springer-Verlag, 1993.
• any roots on the unit circle (|γ| = 1) are simple (i.e., not multiple).
209

One can see now where the term zero-stability comes from: it is
necessary and sufficient for the stability definition to hold for the
differential equation x 0 (t) = 0. In recognition of the discoverer of
this key result, zero-stability is sometimes called Dahlquist stability.
(Another synonymous term is root stability.) In addition to making
this beautiful characterization, Dahlquist also answered the question
about the conditions necessary for a multistep method to be conver-
gent.

Theorem 5.5 (Dahlquist Equivalence Theorem).


Suppose an m-step linear multistep method applied to a ‘well-
behaved’ initial value problem on [t0 , tfinal ] with consistent starting
values,
xk → x (tk ) for tk = t0 + hk, k = 0, . . . , m − 1

as h → 0. This method is convergent, i.e.,

xd(t−t0 )/he → x (t) for all t ∈ [t0 , tfinal ].

as h → 0 if and only if the method is consistent and zero-stable.


If the exact solution is sufficiently smooth, x (t) ∈ C p+1 [t0 , tfinal ]
and the multistep method is order-p accurate (Tk = O(h p )), then

x ( t k ) − x k = O( h p )

for all tk ∈ [t0 , tfinal ].

Dahlquist also characterized the maximal order of convergence for


a zero-stable m-step multistep method.

Theorem 5.6 (First Dahlquist Stability Barrier).


A zero-stable m-step linear multistep method has truncation error no
better than
• O(hm+1 ) if m is odd
• O(hm ) if m is even.

Example 5.12 (A method on the brink of stability). We close this


lecture with an example of a method that, we might figuratively say,
is ‘on the brink of stability.’ That is, the method is zero-stable, but it
stretches that definition to its limit. Consider the method

(5.7) xk+2 = xk + 2h f k+1 ,


210

1 1

h = 0.05 h = 0.01
xk ≈ x (tk )

xk ≈ x (tk )
0.5 0.5

0 0

-0.5 -0.5
0 0.5 1 1.5 2 0 0.5 1 1.5 2

tk tk

Figure 5.7: The second-order zero-


stable method (5.7) applied to x 0 (t) =
which has O(h2 ) truncation error. The characteristic polynomial is −2x (t), started with initial condition
x0 = 1 and a 1% error in x1 , i.e.,
z2 − 1 = (z + 1)(z − 1), which has the two roots γ1 = −1 and x1 = 1.01e−2h (shown in red). The
γ2 = 1. These are distinct roots on the unit circle, so the method is method is ‘on the brink of stability’:
zero-stable. the systematic error in x1 prevents
the method from converging, but
Apply this method to the model problem x 0 (t) = λx with x (0) = the deviation from the true solution
1. Substituting f (tk , xk ) = λxk into the method gives x (t) = e−2t (in blue) is bounded. This
would not be the case if the method
violated the root condition.
xk+2 = xk + 2λhxk+1 .

For a fixed λ and h, this is just another recurrence relation like we


have considered above. It has solutions of the form γk , where γ is a
root of the polynomial

γ2 − 2λhγ − 1 = 0.

In fact, those roots are simply


p
γ = λh ± λ2 h2 + 1.

Since λ2 h2 + 1 ≥ 1 for any h > 0 and λ 6= 0, at least one of the
roots γ will always be greater than one in modulus, thus leading to
a solution xk that grows exponentially with k. Of course, the exact
solution to this equation is x (t) = eλt , so if λ < 0, then we have
x (t) → 0 as t → ∞. The numerical approximation will generally
diverge, giving the qualitatively opposite behavior!
How is this possible for a zero-stable method? The key is that
here, unlike our previous zero-unstable method, the exponential
growth rate depends upon the time-step h. Zero stability only re-
quires that on a fixed finite time interval t ∈ [t0 , tfinal ], the amount by
which errors in the initial data are magnified be bounded.
Figure 5.7 shosw what this means. Set λ = −2 and [t0 , tfinal ] =
[0, 2]. Start the method with x0 = 1 and x1 = 1.01 e−2h . That is,
the second data point has an initial error of 1%. The plot on the left
211

1 1

h = 0.05 h = 0.01
xk ≈ x (tk )

xk ≈ x (tk )
0.5 0.5

0 0

-0.5 -0.5
0 0.5 1 1.5 2 0 0.5 1 1.5 2

tk tk

Figure 5.8: Second-order Adams–


Bashforth method applied to x 0 (t) =
shows the solution for h = 0.05, while the plot on the right uses −2x (t), started with initial condition
x0 = 1 and a 10% error in x1 , i.e.,
h = 0.01. In both cases, the solution oscillates wildly across the true x1 = 1.01e−2h . Despite the systematic
solution, and the amplitude of these oscillations grows with t. As error, the method still behaves like the
we reduce the step-size, the solution remains equally bad. (If the true solution, x (t) = e−2t .

method were not zero-stable, we would expect the error to magnify


as h shrinks.)
The solution does not blow up, but nor does it converge as h → 0.
So does this example contradict the Dahlquist Equivalence Theorem?
No! The hypotheses for that theorem require consistent starting
values. In this case, that means x1 → x (t0 + h) as h → 0. (We assume
that x0 = x (t0 ) is exact.) In the example shown above, we have kept
fixed x1 to have a 1% error as h → 0, so it is not consistent.
Not all linear multistep methods behave as badly as this one in the
presence of imprecise starting data. Recall the second-order Adams–
Bashforth method from the previous lecture.

h
x k +2 − x k +1 = (3 f k +1 − f k ).
2
This method is zero stable, as ρ(z) = z2 − z = z(z − 1). Figure 5.8
repeats the exercise of Figure 5.7, with the same errors in x1 , but
with the second-order Adams–Bashforth method. Though the initial
value error will throw off the solution slightly, we recover the correct
qualitative behavior.
Judging from the different manner in which our two second-order
methods handle this simple problem, it appears that there is still
more to understand about linear multistep methods. This is the sub-
ject of the next lecture.
212

lecture 37: Linear Multistep Methods: Absolute Stability, Part I


lecture 38: Linear Multistep Methods: Absolute Stability, Part II

5.7 Linear multistep methods: absolute stability

At this point, it may well seem that we have a complete theory for
linear multistep methods. With an understanding of truncation er-
ror and zero stability, the convergence of any method can be easily
understood. However, one further wrinkle remains. (Perhaps you
expected this: thus far the β j coefficients have played no role in our
stability analysis!) Up to this point, our convergence theory addresses
the case where h → 0. Methods differ significantly in how small
h must be before one observes this convergent regime. For h too
large, exponential errors that resemble those seen for zero-unstable
methods can emerge for rather benign-looking problems—and for
some ODEs and methods, the restriction imposed on h to avoid such
behavior can be severe. To understand this problem, we need to con-
sider how the numerical method behaves on a less trivial canonical
model problem. For an elaboration of many details
Now consider the model problem x 0 (t) = λx (t), x (0) = x0 for described here, see Chapter 12 of Süli
and Mayers.
some fixed λ ∈ C, which has the exact solution x (t) = etλ x0 . In those
cases where the real part of λ is negative (i.e., λ is in the open left
half of the complex plane), we have | x (t)| → 0 as t → ∞. For a fixed
step size h > 0, will a linear multistep method mimic this behavior?
The explicit Euler method applied to this equation takes the form
x k +1 = x k + h f k
= xk + hλxk
= (1 + hλ) xk .
Hence, this recursion has the general solution
xk = (1 + hλ)k x0 .
Under what conditions will xk → 0? Clearly we need |1 + hλ| < 1;
this condition is more easily interpreted by writing |1 + hλ| = | −
1 − hλ|, where that latter expression is simply the distance of hλ
from −1 in the complex plane. Hence |1 + hλ| < 1 provided hλ is
located strictly in the interior of the disk of radius 1 in the complex
plane, centered at −1. This is the stability region for the explicit Euler
method, shown in the plot on the next page.
Now consider the backward (implicit) Euler method for this same
model problem:
x k +1 = x k + h f k +1
= xk + hλxk+1 .
213

Forward Euler Backward Euler


2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2

-2.5 -2.5
-2 -1 0 1 2 -2 -1 0 1 2
x k +1 = x k + h f k x k +1 = x k + h f k +1

Figure 5.9: Stability regions for the


forward and backward Euler method. If
hλ is contained within the blue region,
Solve this equation for xk+1 to obtain then the approximate solution { xk } to
x 0 (t) = λx (t) will converge, | xk | → 0, as
1 k → ∞.
x k +1 = x ,
1 − hλ k

from which it follows that

xk = (1 − hλ)−k x0 .

Thus xk → 0 provided |1 − hλ| > 1, i.e., hλ must be more than a


distance of 1 away from 1 in the complex plane. As illustrated in the
plot on the next page, the backward Euler method has a much larger
stability region than the explicit Euler method. In fact, the entire left
half of the complex plane is contained in the stability region for the
implicit method. Since h > 0, for any value of λ with negative real
part, the backward Euler method will produce decaying solutions
that qualitatively mimic the exact solution.
If hλ falls within the stability region for a method, we say that the
method is absolutely stable for that value of hλ. Figure 5.9 shows the
stability regions for the forward and backward Euler methods. The
blue region shows values of λh in the complex plane for which the
method is absolutely stable. (For the backward Euler method, this
regions extends throughout the complex plane, beyond the range of
the plot.)
A general linear multistep method

m m
∑ α j xk+ j = h ∑ β j f k+ j
j =0 j =0
214

applied to x 0 (t) = λx, x (0) = x0 reduces to


m m
∑ α j xk+ j = hλ ∑ β j xk+ j ,
j =0 j =0

which can be rearranged as


m
∑ (α j − hλβ j )xk+ j .
j =0

This expression closely resembles the formula we analyzed when


assessing the zero stability of linear multistep methods, except
that now we have the hλβ j terms. The new equation is also a lin-
ear constant-coefficient recurrence relation, so just as before we can
assume that it has solutions of the form xk = γk for constant γ. The
values of γ ∈ C for which such xk will be solutions to the recurrence
are the roots of the stability polynomial
m
∑ (α j − hλβ j )z j ,
j =0

which can be written as

ρ(z) − hλσ(z) = 0,

where ρ is the characteristic polynomial,


m
ρ(z) = ∑ αj zj
j =0

and
m
σ(z) = ∑ β j zj.
j =0

Thus for a fixed hλ, there will be m solutions of the form γkj for the
m roots γ1 , . . . , γm of the stability polynomial. If these roots are all
distinct, then for any initial data x0 , . . . , xm−1 we can find constants
c1 , . . . , cm such that
m
xk = ∑ c j γkj .
j =1

For a given value hλ, we have xk → 0 provided that |γ j | < 1 for all
j = 1, . . . , m. If that condition is met, we say that the linear multistep
method is absolutely stable for that value of hλ.
In the next section, we will describe how linear systems of differ-
ential equations, x0 (t) = Ax(t), can give rise, through an eigenvalue
decomposition of A, to the scalar problem x 0 (t) = λx (t) with com-
plex values of the eigenvalue λ (even if A is real). This explains our
interest in values of hλ ∈ C.
215

We can now add a condition to our growing list of requirements to


look for when assessing the quality of a linear multistep method. We
seek linear multistep methods with the following properties:
• high order truncation error;

• zero stability;

• absolute stability region that contains as much of the left half of


the complex plane as possible.
Those methods for which the stability region contains the entire
left half plane are distinguished, as they will produce, for any value of
h, exponentially decaying numerical solutions to linear problems that
have exponentially decaying true solutions, i.e., when Re λ < 0.

Definition 5.8. A linear multistep method is A-stable provided that its


stability region contains the entire left half of the complex plane.

Figure 5.10 shows the stability regions for two Adams–Bashforth


and Adams–Moulton methods. Notice two trends in these plots:
(1) the implicit Adams–Moulton methods have a larger stability re-
gion than the explicit Adams–Bashforth methods of the same order;
(2) as the order of the method increases, the stability region gets
smaller.
Figure 5.11 shows the stability regions for a class of implicit in-
tegrators called backward difference methods. The 1-step backward
difference method is simply the trapezoid method described earlier.
All four of these methods have contain the entire negative axis within
their stability region, which will make these methods very effective
for important systems of differential equations we will discuss in the
next lecture.
How does one draw plots of the sort shown here? We take the
second order Adams–Bashforth method

xk+2 − xk+1 = h( 32 f k+1 − 21 f k )

as an example. Apply this rule to x 0 (t) = f (t, x (t)) = λx (t) to obtain

xk+2 − xk+1 = λh( 23 xk+1 − 12 xk ),

with which we associate the stability polynomial

z2 − (1 + 32 λh)z + 12 λh = 0.

Any point λh ∈ C on the boundary of the stability region must be


one for which the stability polynomial has a root z with |z| = 1. We
can rearrange the stability polynomial to give
z2 − z
λh = 3
.
2z − 1
216

Second-Order Adams–Bashforth Second-Order Adams–Moulton


2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2

-2.5 -2.5
-4 -3 -2 -1 0 1 -4 -3 -2 -1 0 1
3 1 1

x k +2 − x k +1 = h 2 f k +1 − 2 fk x k +1 − x k = 2 h ( f k + f k +1 )

Fourth-Order Adams–Bashforth Fourth-Order Adams–Moulton


2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2

-2.5 -2.5
-4 -3 -2 -1 0 1 -4 -3 -2 -1 0 1
1 1
 
x k +4 − x k +3 = 24 h 55 f k+3 − 59 f k+2 + 37 f k+1 − 9 f k x k +3 − x k +2 = 24 h 9 f k+3 + 19 f k+2 − 5 f k+1 + f k

Figure 5.10: Stability regions for


the second-order and fourth-order
For general methods, this expression takes the form Adams–Bashforth (explicit) and
Adams–Moulton (implicit) methods. If
∑m
j =0 α j z
j
hλ is contained within the blue region,
(5.8) λh = , then the approximate solution { xk } to
∑m
j =0 β j z
j
x 0 (t) = λx (t) will converge, | xk | → 0, as
k → ∞.
To determine the boundary of the stability region, we sample this
formula for all z ∈ C with |z| = 1, i.e., we trace out the image for z =
eiθ , θ ∈ [0, 2π ). This curve will divide the complex plane into stable
and unstable regions, which can be distinguished by testing the roots
of the stability polynomial for λh within each of those regions.
We illustrate this process for the fourth order Adams–Bashforth
scheme. The curve described in the last paragraph is shown in Fig-
ure 5.12; it divides the complex plane into regions where the stability
217

1-Step Backward Difference Method (Trapezoid) 2-Step Backward Difference Method


8 8

6 6

4 4

2 2

0 0

-2 -2

-4 -4

-6 -6

-8 -8
-4 -2 0 2 4 6 8 10 12 -4 -2 0 2 4 6 8 10 12
1
x k +1 − x k = 2 h( f k + f k +1 ) 3xk+2 − 4xk+1 + xk = 2h f k+2

3-Step Backward Difference Method 4-Step Backward Difference Method


8 8

6 6

4 4

2 2

0 0

-2 -2

-4 -4

-6 -6

-8 -8
-4 -2 0 2 4 6 8 10 12 -4 -2 0 2 4 6 8 10 12
11xk+3 − 18xk+2 + 9xk+1 − 2xk = 6h f k+3 25xk+4 − 48xk+3 + 36xk+2 − 16xk+1 + 3xk = 12h f k+4

Figure 5.11: Stability regions for four


(implicit) backward difference methods.
polynomial has an equal numbers of roots larger than 1 in magni- If hλ is contained within the blue
region, then the approximate solution
tude. As denoted by the numbers on the plot: outside the curve there { xk } to x 0 (t) = λx (t) will converge,
is one root larger than one; within the rightmost lobes of this curve, | xk | → 0, as k → ∞.
two roots are larger than one; within the leftmost region, no roots are
larger than one in magnitude. The latter is the stable region, which is
colored blue in the bottom-left plot in Figure 5.10.
218

Figure 5.12: The curve traced out by


∑m jiθ m
j =0 a j e / ∑ j =0 b j e
jiθ for θ ∈ [0, 2π ).
1 The numbers reveal the number of
roots of the stability polynomial that
2 have magnitude larger than one. The
stability region is the region bounded
0.5 1 by this curve for which all the roots of
the stability polynomial are less than
one in magnitude.

0
0

-0.5

-1

-1 -0.5 0 0.5 1
219

lecture 39: Systems of ODEs, Stiff differential equations

5.8 Systems of linear differential equations

Thus far we have mainly considered scalar ODEs. Both one-step


and linear multistep methods readily generalize to systems of ODEs,
where the scalar x (t) ∈ IR is replaced by a vector x(t) ∈ IRn . In these
notes, we shall focus upon linear systems of ODEs. Of course, many applications give rise
Consider the linear system of differential equations to nonlinear ODEs; understanding the
linear case is essential to understanding
the behavior of nonlinear systems in the
x0 (t) = Ax(t), x (0) = x0 , vicinity of a steady-state.

for A ∈ C n×n and x(t) ∈ C n . We wish to see how the scalar linear
stability theory discussed in the last lecture applies to such systems.
Suppose that the matrix A is diagonalizable, so that it can be written
A = VΛV−1 for the diagonal matrix Λ = diag(λ1 , . . . , λn ). Premulti-
plying the differential equation by V−1 yields

(5.9) V−1 x0 (t) = ΛV−1 x(t), V −1 x (0 ) = V −1 x 0 .

Now let y(t) = V−1 x(t), which represents x(t) in a transformed


coordinate system in which the eigenvectors form the new coordinate
axes. In these new coordinates, the matrix equation decouples into a
system of n independent scalar linear equations: the system (5.9) can
be written as
y0 (t) = Λy(t), y(0) = V−1 x0 .

This system is equivalent to the n scalar problems

y10 (t) = λ1 y1 (t), y 1 (0 ) = [ V −1 x 0 ] 1 ,


.. ..
. .

y0n (t) = λn yn (t), y n (0 ) = [ V −1 x 0 ] n ,

each of which has a simple solution of the form

y j ( t ) = eλ j t y j (0 ).

Now use the relationship x(t) = Vy(t) to transform back to the


original coordinates. Collect the exponentials eλ j t into a diagonal
matrix,  
etλ1
eΛ t : =  ..
 
. .
 
etλn
220

Then write

(5.10) x(t) = Vy(t) = VeΛt y(0) = VeΛt V−1 x0 ,

which motivates the definition of the matrix exponential,

eAt := VeΛt V−1 ,

giving the solution x(t) has the convenient form

x(t) = eAt x0 .

For scalar equations, we considered when | x (t)| → 0. The analogy


for x(t) ∈ IRn is kx(t)k2 → 0. What properties of A ensure this
convergence?
We can bound the solution using norm inequalities,

kx(t)k2 ≤ kVk2 keΛt k2 kV−1 k2 kx0 k2 .


Since eΛt is a diagonal matrix, its 2-norm is the largest magnitude of
its entries:
keΛt k2 = max |etλ j |,
1≤ j ≤ n
and hence
kx(t)k2
(5.11) ≤ kVk2 kV−1 k2 max |etλ j |.
k x0 k2 1≤ j ≤ n

Thus the asymptotic decay rate of kx(t)k2 is controlled by the right-


most eigenvalue of A in the complex plane. If all eigenvalues of A
have negative real part, then kx(t)k2 → 0 as t → ∞ for all initial
conditions x(0). Such systems are called stable. When kVk2 kV−1 k2 > 1, it is possible
that kx(t)k2 /kx0 k2 > 1 for small t > 0,
Note that the definition etA = VetΛ V−1 is consistent with the even if this ratio must eventually decay
more general definition obtained by substituting tA into the same to zero as t → 0.The possibility of
this transient growth complicates the
Taylor series that defines the scalar exponential: analysis of dynamical systems with
1 2 2 1 1 non-symmetric coefficient matrices,
etA = I + tA + t A + t3 A3 + t4 A4 + · · · . and turns out to be closely related the
2! 3! 4! sensitivity of the eigenvalues of A to
Setting x(t) = etA x0 with this series definition of etA , perturbations. This behavior is both
fascinating and physically important,
d  tA  but regrettably beyond the scope of
x0 (t) = e x0 these lectures.
dt
d t2 t3 
= I + tA + A2 + A3 + · · · x0
dt 2! 3!
 t2 t3 
= 0 + A + tA2 + A3 + A4 + · · · x0
2! 3!
 t2 t3 
= A I + tA + A2 + A3 + · · · x0
2! 3!
= AetA x0

= Ax(t).
221

Hence x(t) = etA x0 solves the equation x0 (t) = Ax(t), and satisfies
the initial condition x(0) = x0 .

5.8.1 Linear multistep methods for linear systems


What can be said of the behavior of a linear multistep method ap-
plied to x0 (t) = Ax(t)? Euler’s method, for example, takes the form

xk+1 = xk + hAxk

= (I + hA)xk .

Iterating from the initial condition,

x1 = (I + hA)x0
x2 = (I + hA)x1 = (I + hA)2 x0
x3 = (I + hA)x2 = (I + hA)3 x0
..
.

and, in general,
xk = (I + hA)k x0 .

We can understand the asymptotic behavior of (I + hA)k by exam-


ining the eigenvalues of (I + hA)k : the quantity (I + hA)k → 0 if and Recall that Mk → 0 as k → ∞ if
only if all the eigenvalues of I + hA are less than one in magnitude. and only if all eigenvalues of M have
magnitude strictly less than one.
The spectral mapping theorem ensures that if (λ j , v j ) is an eigenvalue-
eigenvector pair for A, then (1 + hλ j , v j ) is an eigenpair of I + hA.
This is easy to verify by a direct computation: If Av j = λ j v j , then
(I + hA)v j = v j + hAv j = (1 + hλ j )v j .
It follows that the numerical solution xk computed by Euler’s
method will decay to zero if |1 + hλ j | < 1 for all eigenvalues λ j
of A. In the language of the last lecture, this requires that hλ j falls
in the absolute stability region for the forward Euler method for all
eigenvalues λ j of A.

For a general linear multistep method, this criterion generalizes


to the requirement that hλ j be located in the method’s absolute sta-
bility region for all eigenvalues λ j of A. This phenomenon is illus-
trated in Figures 5.13 and 5.14. Here A is a 16 × 16 matrix with all
its eigenvalues in the left half of the complex plane. We wish to solve
x0 (t) = Ax(t) using the second-order Adams–Bashforth method,
whose stability region was plotted in the last lecture. The plots in
Figure 5.13 show hλ j as crosses for the eigenvalues λ1 , . . . , λ16 of A.
If any value of hλ j is outside the stability region (shown in blue), then
the iteration will grow exponentially. If h is sufficiently small that hλ j
222

h = 1/4 h = 1/8
2 2

1 1

0 0

-1 -1

-2 -2
-3 -2 -1 0 1 -3 -2 -1 0 1

h = 1/10 h = 1/16
2 2

1 1

0 0

-1 -1

-2 -2
-3 -2 -1 0 1 -3 -2 -1 0 1

Figure 5.13: Values of hλ j for all


eigenvalues of a 16 × 16 matrix A. For
is in the stability region for all eigenvalues λ j , then xk → 0 as k → ∞, h = 1/4 many values of hλ j fall outside
the stability region of the second-order
consistent with the behavior of the exact solution, x(t) → 0 as t → ∞.
Adams–Bashforth method. For h = 1/8,
Figure 5.13 shows kxk k2 as a function of tk for three values of h (two only one hλ j is outside the stability
unstable and one stable). region, but that is enough to mean that
kxk k → ∞ as k → ∞. For h = 1/10 and
This example deserves closer scrutiny. Suppose A is diagonaliz- h = 1/16, all values of hλ j are in the
able, so we can write A = VΛV−1 . Thus, stability region, so kxk k → 0 as k → ∞.

xk = (I + hA)k x0

= (I + hVΛV−1 )k x0

= (VV−1 + hVΛV−1 )k x0

= V(I + hΛ)k V−1 x0 .

Compare this last expression to the formula (5.10) for the true solu-
223

25
10
h = 1/4

20
10
k x k k2
10 15

10 10

10 5

h = 1/8
10 0
h = 1/10
-5
10
0 2 4 6 8 10 12 14 16

tk
Figure 5.14: The second-order Adams–
Bashforth method applied to x0 (t) =
tion x(t) in terms of the matrix exponential. As we did in that case, Ax(t) for the same matrix A used for
Figure 5.13. As seen in that figure, for
we can bound kxk k2 : step-sizes h = 1/4 and h = 1/8 the
method is unstable, and kxk k2 → ∞ as
kxk k2 = kV(I + hΛ)k V−1 x0 k2 k → ∞. When h = 1/10, hλ j is in the
stability region for all eigenvalues λ j of
= kV(I + hΛ)k V−1 k2 kx0 k2 A, and hence kxk k2 → 0 as k → ∞.

= kVk2 kV−1 k2 k(I + hΛ)k k2 kx0 k2 .

Since I + hΛ is a diagonal matrix,



(1 + hλ1 )k


(1 + hλ2 )k 
(I + hΛ)k = 
 
,
..
.
 
 
(1 + hλn )k

giving
k(I + hΛ)k k2 = max |1 + hλ j |k .
1≤ j ≤ n

Thus, we arrive at the bound

k x k k2
≤ kVk2 kV−1 k2 max |1 + hλ j |k ,
k x0 k2 1≤ j ≤ n

which is analogous to the bound (5.11) for the exact solution.


We can glean just a bit more from our analysis of xk . Since A is
diagonalizable, its eigenvectors v1 , . . . , vn for a basis for C n . Expand
the initial condition x0 in this basis:
n
x0 = ∑ c j v j = Vc.
j =1
224

Now, our earlier expression for xk gives

xk = V(I + hΛ)k V−1 x0 = V(I + hΛ)k V−1 Vc = V(I + hΛ)k c.

Since

(1 + hλ1 )k
  
(1 + hλ1 )k c1

c1

 (1 + hλ2 )k 
 c2
 
  (1 + hλ2 )k c2 

.. = .. ,
..
 
. . .
    
    
(1 + hλn )k cn (1 + hλn )k cn

we have
(5.12) 
(1 + hλ1 )k c1

h i (1 + hλ2 )k c2  n
∑ c j (1 + hλ j )k v j .
 
xk = v1 v2 ··· vn  .. =
.
 
  j =1
(1 + hλn )k cn

Thus as k → ∞, the approximate solution xk will start to look more


and more like (a scaled version of) the vector v` , where ` is the index
that maximizes |1 + hλ j |:

|1 + hλ` | = max |1 + hλ j |.
1≤ j ≤ n

5.8.2 Stiff differential equations


In the example in Figures 5.13 and 5.14 the step-size did not need to
be very small for all hλ j to be contained within the stability region.
However, most practical examples in science and engineering yield
matrices A whose eigenvalues span multiple orders of magnitude
– and in this case, the stability requirement is far more difficult to
satisfy. First consider the following simple example. Let
" #
−199 −198
A=
99 98

which has the diagonalization


" #" #" #
−1 −1 2 −1 0 1 2
A = VΛV = .
1 −1 0 −100 1 1

The eigenvalues are λ1 = −1 and λ2 = −100, and the exact solution


takes the form
" #
e− t 0
tA
x ( t ) = e x0 = V V −1 x 0 .
0 e−100t
225

Writing the initial condition as


" # " # " #
c1 −1 2
x0 = V = c1 + c2 , Note that c = V−1 x0 .
c2 1 −1

the solution is
" #
e− t 0
x(t) = V V −1 x 0
0 e−100t
" #" # " # " #
e− t 0 c1 −1 2 3
=V = c 1 e− t + c2 e−100t ,
0 e−100t c2 1 −1 x(.012)
x2 ( t )
2 x(.006)
and so x(t) → 0 as t → ∞. The eigenvalue λ2 = −100 corresponds to
x(.002)
a fast transient, a component of the solution that decays very rapidly;
x (0)
the eigenvalue λ1 = −1 corresponds to a slow transient, a component 1

of the solution that decays much more slowly. Using this insight we
x (2)
can describe the behavior of the system as t → 0 more precisely than
0
merely saying x(t) → 0. Since e−100t decays much more quickly than -2 -1 0 1
e−t , the solution will be dominated by the λ1 term: x1 ( t )
" #
Some snapshots of the exact solution
−t −1
x ( t ) ∼ c1 e , t → ∞, x(t) for the example with λ1 = −1
1 and λ2 = −100, using initial condition
x(0) = [1, 1] T . The solution decays,
provided c1 6= 0. This means that the solution vector x(t) will quickly kx(t)k2 → 0 as t → ∞, and as it does
so, the solution aligns in the direction
align in the v1 direction as it converges toward zero. of the eigenvector associated with
λ1 = −1, v1 = [−1, 1] T .
Now apply the forward Euler method to this problem. From the
general expression (5.12), the iterate xk can be written in the basis of
eigenvectors as
" # " #
− 1 2
xk = c1 (1 + hλ1 )k + c2 (1 + hλ2 )k
1 −1 2

" # " # ( x k )2 x0
−1 2 x2
k k
= c1 (1 − h ) + c2 (1 − 100h) . 0 x4
1 −1
x6
-2
x8
To obtain a numerical solution {xk } that mimics the asymptotic be-
havior of the true solution, x(t) → 0, one must choose h sufficiently x10
small that |1 + hλ1 | = |1 − h| < 1 and |1 + hλ2 | = |1 − 100h| < 1. -4
0 2 4 6 8
The first condition requires h ∈ (0, 2), while the second condition ( x k )1
is far more restrictive: h ∈ (0, 1/50). The more restrictive condition
describes the values of h that will give xk → 0 for all x0 . Some iterates xk of the forward Euler
method for the example with λ1 = −1
Take note of this phenomenon: the faster a component decays from the and λ2 = −100, using time-step
h = 0.021. This time-step is slightly
true solution (like e−100t in our example), the smaller the time step must be larger than the stability limit h < 0.02,
for the forward Euler method (and other explicit schemes). so kxk k2 → ∞ as k increases. Moreover,
the solution aligns in the direction of
the the eigenvector associated with the
most unstable eigenvalue, v2 = [2, −1] T .
226

Problems for which A has eigenvalues with significantly different


magnitudes are called stiff differential equations. For such problems,
implicit methods – which generally have much larger stability re-
gions – are generally favored.
Thus far we have only sought xk → 0 as k → ∞. In some cases,
we merely wish for xk to be bounded. In this case, it is acceptable
to have an eigenvalue hλ j on the boundary of the absolute stability
region of a method, provided it is not a repeated eigenvalue (more
precisely, provided it is associated with 1 × 1 Jordan blocks, i.e., it is
not defective).

5.8.3 Closing example: heat equation


We can draw together many themes from this course in one elemen-
tary but vital example, the solution of the linear partial differential
equation
ut ( x, t) = u xx ( x, t)
posed on the domain x ∈ [0, 1] and t ≥ 0. This heat equation equa-
tion models the temperature in a long bar. The length of the bar is
spanned by the x variable; t refers to time, starting from t = 0 when
the state of the system is specified,

u( x, 0) = U0 ( x ), x ∈ [0, 1].

Quenching both ends of this bar in an ice bath equates to the homoge-
neous Dirichlet boundary conditions

u(0, t) = u(1, t) = 0.

One common approach to numerically solving this equation is called


the method of lines. Here is the basic idea: discretize the continuum
x ∈ [0, 1] into a set of discrete points, then approximate u xx by a finite
difference approximation in the x variable. This discretization con-
verts the original partial differential equation into ordinary differential
equations, which can be readily solved using any of the techniques
discussed in this chapter.
Let us get more specific. Discretize [0, 1] into the points x0 , . . . , xn+1 ,
uniformly spaced a distance
1
∆x =
n+1
apart from one another,
x j = j∆x, j = 0, . . . , n + 1.

1. Recall from Section 1.7 that one can approximate derivatives by


differentiating interpolating polynomials. Indeed, in Section 1.7 we
227

already saw how the second derivative can be approximated via


the difference

u( x j−1 , t) − 2u( x j , t) + u( x j+1 , t)


(5.13) u xx ( x j , t) ≈ ,
(∆x )2

which is just an adaptation of equation (1.22). In making these


approximations, we shall no longer have access to the values of the
exact solution such as u( x j , t), so we will introduce a set of new
functions
u j ( t ) ≈ u ( x j , t ), j = 0, . . . , n + 1.

Then (5.13) becomes

u j−1 (t) − 2u j (t) + u j+1 (t)


(5.14) u xx ( x j , t) ≈ .
(∆x )2

Notice that the boundary conditions from the differential equation,

u(0, t) = u(1, t) = 0, t ≥ 0,

directly imply that u0 (t) = un+1 (t) = 0 for all t ≥ 0. Now our goal This explains why this approach is
is to find u j (t) for j = 1, . . . , n. called the ‘method of lines’. It will
develop approximations u j (t) to the
2. Now recall the partial differential equation ut ( x, t) = u xx ( x, t). solution u( x, t) on ‘lines’ of constant
x = x j values in the ( x, t) plane.
Replacing u xx ( x, t) with the finite difference approximation, the
differential equation suggests that we find u j (t) so that

∂ u j−1 (t) − 2u j (t) + u j+1 (t)


u (t) = , j = 1, . . . , n.
∂t j (∆x )2

These equations form a coupled linear system. Perhaps it helps to


write them out individually for a few values of j:

∂ u0 (t) − 2u1 (t) + u2 (t) −2u1 (t) + u2 (t)


u1 ( t ) = =
∂t (∆x ) 2 (∆x )2

∂ u (t) − 2u2 (t) + u3 (t)


u2 ( t ) = 1
∂t (∆x )2

..
.

∂ u (t) − 2un−1 (t) + un (t)


u ( t ) = n −2
∂t n−1 (∆x )2

∂ u (t) − 2un (t) + un+1 (t) u (t) − 2un (t)


u n ( t ) = n −1 = n −1 .
∂t (∆x ) 2 (∆x )2
228

Compile these equations into the matrix form


   
u1 ( t ) u1 ( t )

−2 1
..
   
 u2 ( t )   u2 ( t )
 1 −2 . 
 
   
∂  ..  1  .. 

.
=  . .. . .. 1  .
,
(∆x )2 

∂t 


 



1 −2
 
 n −1 ( t )  1   u n −1 ( t )
 u   

un (t) 1 −2 un (t)
which we summarize as

u0 (t) = Au(t), Note that A contains the leading


1/(∆x )2 factor.
a system of linear ordinary differential equations, with initial
condition derived from initial data U0 ( x ) given for this problem:
   
u1 (0) U0 ( x1 )
 ..   .. 
u0 = u (0) = 
 .
=
  .
.

u n (0) U0 ( xn )
3. This approximation is called the semi-discretized form of the prob-
lem, since it is discretized in space, but time remains a continuous
variable. Now one could quickly express the solution using the
exponential of the matrix A, as discussed earlier in this section,

u(t) = etA u(0).

However, for large A computation of etA (e.g., using MATLAB’s Such large A arise when we have
expm command), is quite expensive, and so we wish to approxi- partial differential equations in two
and three physical dimensions. The
mate the solution of the ordinary differential equation using one of the one-dimensional example here is easy
techniques studied in this chapter. by comparison.
For example, one could fix a time-step ∆t > 0 and seek an
approximation
u k ≈ u ( t k ).
The forward Euler method gives

uk+1 = uk + (∆t)Auk ,

while the (implicit) backward Euler method

uk+1 = uk + (∆t)Auk+1

leads to the linear system of equations

(I − (∆t)A)uk+1 = uk
that must be solved (e.g., via Gaussian elimination) at each step to
find uk . If ∆t is fixed, then one would compute
The main question, then, is: given a choice of numerical integra- a Cholesky or LU factorization of
I − ∆tA, thus expediting the solution
tor (forward Euler, backward Euler, etc.), how large can the time of this system at each step. If A is
step ∆t be to maintain stability? banded, as it is in this example, such
factorizations are very fast.
229

4. To answer this question, diagonalize A to reveal its eigenvalues.


Thankfully, explicit formulas are available for these eigenvalues,

2 cos( jπ/(n + 1)) − 2


λj = , j = 1, . . . , n;
(∆x )2
the associated eigenvectors have a beautifully elegant formula:
   
jπ  sin( jx1 )
sin n+1
   
 sin 2jπ    sin( jx2 ) 
 n +1   
vj = 
 ..
=
  ..
,
 j = 1, . . . , n.

 .  
  . 

njπ 
sin n+1 sin( jxn )

These eigenvalues are all real. The rightmost eigenvalue is

2 cos(π/(n + 1)) − 2 2(1 − π 2 /(2(n + 1)2 )) − 2


λ1 = ≈ = −π 2 , Here we use the Taylor approximation
(∆x )2 (∆x )2 for cosine: as θ → 0,
1 2
while the leftmost eigenvalue is cos(θ ) = 1 − θ + O( θ )4 .
2

2 cos(nπ/(n + 1)) − 2 −4
λn = ≈ = −4( n + 1)2 .
(∆x ) 2 (∆x )2
Notice that the eigenvalues of A are all negative, so

u(t) = etA u(0) → 0

as t → ∞. However, as n increases, the leftmost eigenvalue in-


creases in magnitude like O(n2 ).

n λ1 λn
16 −9.841548 . . . −1146.158 . . .
32 −9.862152 . . . −4346.137 . . .
64 −9.867683 . . . −16890.132 . . .

It will help our later discussion to visualize the eigenvectors of A.


Figure 5.15 shows all the eigenvectors for n = 16.
5. Consider the implications of these eigenvalues for the forward
Euler method: since λn > −4(n + 1)2 ,

2 cos(nπ/(n + 1)) − 2
λn = > −4( n + 1)2 ,
(∆x )2
the eigenvalues of A are contained in the interval

[−4(n + 1)2 , 0].

To ensure stability of the method, choose the forward Euler time-


step ∆t so that for all λ ∈ [−4(n + 1)2 , 0], λ∆t is in the stability
230

61 = !9:84155 62 = !39:03105 63 = !86:57450 64 = !150:85285


1 1 1 1

0 0 0 0

-1 -1 -1 -1
1 8 16 1 8 16 1 8 16 1 8 16

65 = !229:67718 66 = !320:36323 67 = !419:82279 68 = !524:66889


1 1 1 1

0 0 0 0

-1 -1 -1 -1
1 8 16 1 8 16 1 8 16 1 8 16

69 = !631:33111 610 = !736:17721 611 = !835:63677 612 = !926:32282


1 1 1 1

0 0 0 0

-1 -1 -1 -1
1 8 16 1 8 16 1 8 16 1 8 16

613 = !1005:14715 614 = !1069:42550 615 = !1116:96895 616 = !1146:15845


1 1 1 1

0 0 0 0

-1 -1 -1 -1
1 8 16 1 8 16 1 8 16 1 8 16

Figure 5.15: Eigenvectors of the matrix


A from the method of lines discretiza-
region for the method. Since λ∆t ∈ IR, we need only consider the tion of the heat equation, for n = 16.
The plot for λ j shows the n entries
intersection of the forward Euler stability region with the real axis, of the corresponding eigenvector v j .
i.e., (−2, 0). To get (Lines are drawn between the entries in
each eigenvector to help you appreciate
λ∆t ∈ (−2, 0)
that these eigenvectors approximate
continuous functions of x ∈ [0, 1] as
for all λ ∈ (−4(n + 1)2 , 0), take
n → ∞.) Notice that as the eigenvalue
gets increasingly negative, the eigen-
1 (∆x )2 vector increases in frequency (i.e., it
∆t ≤ 2
= .
2( n + 1) 2 oscillates more rapidly).

This is a crucial condition:

If you halve ∆x, you must quarter ∆t.


231

Figure 5.16: Method of lines ap-


3 proximation to the heat equation,
discretized in space with n = 16 to get
2.5 u0 (t) = Au(t), and solving this equa-
u j (t) tion exactly in time. In the top plot,
2 each component u j (t) of the solution
vector u(t) is shown as a continuous
1.5
function of time. The bottom plots take
snapshots in time, connecting the dots
between all the values u j (tk ) at a fixed
1
time tk . Notice as t → ∞, the solu-
tion looks increasingly like a multiple
0.5
of the eigenvector v1 associated with
the rightmost eigenvalue, shown in
0 Figure 5.15.
0.2

0.15
16
0.1 14 15
11 12 13
t 9 10
0.05 7 8
4 5 6
1 2
3 j
0

tk = 0:0000 tk = 0:0200 tk = 0:0400


3 3 3

u j (tk )
2 2 2

1 1 1

0 0 0
1 8 16 1 8 16 1 8 16
j j j

tk = 0:0500 tk = 0:0700 tk = 0:1400


3 3 3

u j (tk )
2 2 2

1 1 1

0 0 0
1 8 16 1 8 16 1 8 16
j j j

Thus, improving the spatial resolution of the discretization re-


quires extreme refinement in the time-step for forward Euler. This
requirement is known as the CFL condition for this problem, named
after its discovers Richard Courant, Kurt Otto Friedrichs, and Hans
Lewy (1928).
6. Figure 5.16 shows the solution of the heat equation with initial
232

condition U0 ( x ) = 10x2 (1 − x )(1.2 + sin(3πx )). The eigenvalues


and eigenvectors explain the behavior seen in these plots. From
our earlier discussion, recall that the exact solution of the semi-
discretized problem is
n
(5.15) u(t) = etA u(0) = ∑ c j etλj v j ,
j =1

where the coefficients {c j } come from c = V−1 u(0). As with


the two-dimensional example considered in Section 5.8.2, the
formula (5.15) explains how u(t) behaves as t → 0. As t increases,
the sum will be increasingly dominated by the term involving the
rightmost eigenvalue, λ1 ≈ −π 2 . Indeed, presuming c1 6= 0, we
expect
u(t) ∼ c1 etλ1 v1 ,

so the solution should increasingly resemble the eigenvector v1 as


t increases. This is evident in Figure 5.16.

Now solve this same problem using the forward Euler method.
In the eigenvector basis, the forward Euler approximation is
n
(5.16) uk = (I + (∆t) A)k u0 = ∑ c j (1 + (∆t) λ j )k v j .
j =1

One gains a very rich insight by combining this formula with


some insight from analysis. A smooth initial condition U0 ( x ) to
the original problem, which satisfies the boundary conditions
U0 (0) = U0 (1) = 0, can be approximated very well with a Fourier
sine series. Similarly, the discretized initial condition u0 on the There is a deep tie to the eigenfunctions
spatial grid is approximated well by the leading eigenvectors of of the underlying differential operator
Lu = −u00 , defined on the domain of
A (which are samples of sine functions). You can intuitively ap- twice differentiable functions satisfying
preciate this fact by comparing the initial condition in Figure 5.16 the homogeneous Dirichlet boundary
conditions.
(corresponding to tk = 0) with the eigenvectors shown in Fig-
ure 5.15. The initial condition much more closely resembles the
first few (smooth) eigenvectors than it does the highly oscillatory
eigenvectors corresponding to the most negative eigenvalues. Fig-
ure 5.17 shows the decay of the c j coefficients for n = 16 and
U0 ( x ) = 10x2 (1 − x )(1.2 + sin(3πx )).
With n = 16, we previously noted that λn = −1146.158 . . . , so to
maintain stability the forward Euler method requires

∆t < 0.001744959 . . . .

Figure 5.18 shows the result of running forward Euler with a time-
step ∆t = 0.002 that is slightly too large. The computation proceeds
reasonably for the first ten time steps or so, but by k = 20 the
233

forward Euler iterate uk begins to bear the fingerprint of a high-


frequency oscillation. By the time k = 70, that instability com-
pletely dominates the solution. (Notice the different vertical scales
in the k = 35 and k = 70 plots: the instability is growing fast!)
Compare the approximation u70 to the eigenvector v16 shown in
Figure 5.15. The close resemblance is no coincidence. In the eigen-
vector basis expansion (5.16) for uk , the dominant term will be

cn (1 + (∆t)λn )k vn ,

since |1 + (∆t)λn | is larger than any of the other |1 + (∆t)λ j | val-


ues. Why does this term not start dominating right away? Because
the magnitude of cn is much smaller than other c j values, as seen
in Figure 5.17. Thus it takes a number of iterations before the
growth of (1 + (∆t)λn )k counteracts the small value of cn . When k
gets sufficiently large, this λn term completely takes over.
Lest we end the lecture on a negative note, Figure 5.19 shows
the result of running the forward Euler method with time-step
∆t = 0.0015, just within the stability condition. The solution uk , for
the same values of k as shown in Figure 5.18, looks much closer to
the solution from the matrix exponential shown in Figure 5.16.
An astute reader might notice one subtlety. This ∆t obeys the
stability condition,
|1 + (∆t)λ j | < 1, j = 1, . . . , n,
but the term corresponding to λn is still the largest of these terms

|1 + (∆t)λn )| > |1 + (∆t)λ j |, j = 1, . . . , n − 1.

Thus the (1 + (∆t)λn )k term decays most slowly in the sum (5.16).
Will it not eventually dominate, causing the solution to again

10 1 Figure 5.17: The magnitude of the


coefficients c j for the initial condition u0
expanded in the basis of eigenvectors
of A, i.e., c = V−1 u0 . Notice that the
10 0 coefficients decrease rapidly as j in-
creases: the essence of a smooth initial
|c j | condition is captured by the eigen-
vectors associated with the rightmost
eigenvalues.
10 -1

10 -2

10 -3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

j
234

Figure 5.18: Method of lines approxi-


mation to the heat equation, discretized
3 in space with n = 16 and discretized
in time with the forward Euler method.
2.5
The time-step ∆t = 0.0002 is slightly
2 larger than the stability limit. In the
u j (t) top plot, each component (uk ) j of the
1.5
solution vector uk is shown, connecting
1 values in time. The bottom plots take
0.5 snapshots in time, connecting the dots
between all the values (uk ) j at a fixed
0 time tk . Notice as t → ∞, the solu-
-0.5 tion looks increasingly like a multiple
of the eigenvector v1 associated with
-1
the rightmost eigenvalue, shown in
-1.5 Figure 5.15.
30
25
20
16
15 14 15
11 12 13
t 10 9 10
6 7 8
5 5
3 4 j
0 1 2

k = 0; tk = 0:0000 k = 10; tk = 0:0200 k = 20; tk = 0:0400


4 4 4

u j (tk ) 3 3 3

2 2 2

1 1 1

0 0 0

-1 -1 -1
1 8 16 1 8 16 1 8 16
j j j

k = 25; tk = 0:0500 k = 35; tk = 0:0700 #10 4 k = 70; tk = 0:1400


4 8 5

6
u j (tk ) 3
4

2 2

0 0
1 -2

-4
0
-6

-1 -8 -5
1 8 16 1 8 16 1 8 16
j j j

resemble the shape of vn as it decays to zero? Yes, indeed: but the


small value of cn delays this effect, often well beyond the span of
time we care about. Moreover, all the terms are getting smaller as
k increases. However, if you take enough steps and zoom in on the
small magnitude solution, you will indeed see vn emerge.
235

Figure 5.19: Method of lines approxi-


3 mation to the heat equation, discretized
in space with n = 16 and discretized
2.5 in time with the forward Euler method.
u j (t) The time-step ∆t = 0.00015 obeys
2 the stability condition, ensuring that
uk → 0 as k → ∞. Compare these
1.5
stable computations to the solution in
Figure 5.16, which solves the equation
exactly in time.
1

0.5

0
30
25
20
16
15 14 15
12 13
t 10 10 11
8 9
5 5 6 7
3 4 j
0 1 2

k = 0; tk = 0:0000 k = 10; tk = 0:0150 k = 20; tk = 0:0300


3 3 3

u j (tk )
2 2 2

1 1 1

0 0 0
1 8 16 1 8 16 1 8 16
j j j

k = 25; tk = 0:0375 k = 35; tk = 0:0525 k = 70; tk = 0:1050


3 3 3

u j (tk )
2 2 2

1 1 1

0 0 0
1 8 16 1 8 16 1 8 16
j j j
236

I hope our investigations this semester have given you a taste of the beautiful
mathematics that empower numerical computations, the discrimination to
pick the right algorithm to suit your given problem, the insight to identify
those problems that are inherently ill-conditioned, and the tenacity to always
seek clever, efficient solutions.

You might also like