0% found this document useful (0 votes)
207 views

Numerical Analysis

This document provides notes on numerical methods from lectures. It covers topics like numerical solving of problems, floating point numbers, numerical errors and stability. Methods discussed include solving nonlinear equations, linear systems, least squares problems, eigenvalues, interpolation, numerical differentiation and integration, and Bézier curves. The goal is to obtain numerical approximations of solutions when analytical solutions cannot be found.

Uploaded by

Đorđe Klisura
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
207 views

Numerical Analysis

This document provides notes on numerical methods from lectures. It covers topics like numerical solving of problems, floating point numbers, numerical errors and stability. Methods discussed include solving nonlinear equations, linear systems, least squares problems, eigenvalues, interpolation, numerical differentiation and integration, and Bézier curves. The goal is to obtain numerical approximations of solutions when analytical solutions cannot be found.

Uploaded by

Đorđe Klisura
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

Introduction to Numerical

Methods
Notes on the lectures

Karla Ferjančič, Vito Vitrih

Unrevised version (2017)


Contents

1 Introduction 7
1.1 Numerical solving . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Floating point . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Numerical errors . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Numerical stability . . . . . . . . . . . . . . . . . . . . 14
1.4 Some educational examples . . . . . . . . . . . . . . . . . . . . 15

2 Nonlinear Equations 19
2.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Simple iterative method . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Newton or tangent method . . . . . . . . . . . . . . . . . . . . 23
2.4 Secant method . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 System of nonlinear equations . . . . . . . . . . . . . . . . . . 28
2.6.1 Generalization of a simple iterative method . . . . . . . 28
2.6.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . 28
2.7 Polynomial equations . . . . . . . . . . . . . . . . . . . . . . . 29
2.7.1 Laguerre’s method . . . . . . . . . . . . . . . . . . . . 30

3 Linear Systems 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Vector and Matrix norms . . . . . . . . . . . . . . . . . . . . . 33
3.3 Sensitivity of the linear systems . . . . . . . . . . . . . . . . . 36
3.4 LU decomposition . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Permutation matrices . . . . . . . . . . . . . . . . . . . 37
3.4.2 LU decomposition . . . . . . . . . . . . . . . . . . . . 37
3.4.3 LU decomposition with pivoting . . . . . . . . . . . . . 41
3.5 Symmetric positive definite matrices . . . . . . . . . . . . . . . 44
3.6 Iterative methods for solving a linear systems . . . . . . . . . . 46
4 Contents

4 Linear least-squares problem


(Overdetermined systems) 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 System of the normal equations . . . . . . . . . . . . . . . . . . 54
4.3 QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.1 Classical and modified Gram-Schmidt method . . . . . 57
4.3.2 Givens rotations . . . . . . . . . . . . . . . . . . . . . 58
4.3.3 Householder transformation . . . . . . . . . . . . . . . 59
4.3.4 The comparison of the presented methods . . . . . . . . 62
4.4 Singular value decomposition . . . . . . . . . . . . . . . . . . . 62

5 Eigenvalue problem 67
5.1 Power method . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 QR method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6 Interpolation 71
6.1 Polynomial interpolation . . . . . . . . . . . . . . . . . . . . . 71
6.2 Divided differences . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.1 Convergence of the interpolation polynomial . . . . . . 78
6.3 Spline interpolation . . . . . . . . . . . . . . . . . . . . . . . . 79

7 Numerical differentiation and integration 83


7.1 Numerical differentiation . . . . . . . . . . . . . . . . . . . . . 83
7.1.1 Whole error . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Numerical integration . . . . . . . . . . . . . . . . . . . . . . . 86
7.2.1 Newton–Cotes quadrature rules . . . . . . . . . . . . . 87
7.2.2 The irreducible error . . . . . . . . . . . . . . . . . . . 90
7.2.3 Composite rules . . . . . . . . . . . . . . . . . . . . . 90
7.2.4 Romberg’s method . . . . . . . . . . . . . . . . . . . . 91
7.2.5 Multiple integrals . . . . . . . . . . . . . . . . . . . . . 94
7.2.6 Monte Carlo method . . . . . . . . . . . . . . . . . . . 96
7.2.7 Introduction to Gaussian quadrature . . . . . . . . . . . 97

8 Bézier curves 99
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.1.1 Affine transformations in R3 . . . . . . . . . . . . . . . 99
8.1.2 Linear Interpolation . . . . . . . . . . . . . . . . . . . 101
8.1.3 Parametric curves . . . . . . . . . . . . . . . . . . . . . 101
8.2 Bézier curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.1 Bernstein polynomials . . . . . . . . . . . . . . . . . . 102
Contents 5

8.2.2 De Casteljau algorithm . . . . . . . . . . . . . . . . . . 104


8.2.3 Properties of the Bézier curves . . . . . . . . . . . . . . 106
8.2.4 Derivative of the Bézier curve . . . . . . . . . . . . . . 109
8.2.5 Subdivision . . . . . . . . . . . . . . . . . . . . . . . . 110
8.3 Other interesting topics for further studying . . . . . . . . . . . 113
1 Introduction

1.1 Numerical solving


Our
√ aim is to find a numerical solution of a given problem. Namely, instead of
3 we obtain 1.73205...
Numerical method is a procedure, where from the initial numerical data one ob-
tains numerical approximation for the solution of a given mathematical problem

using the finite sequence of elementary operations (+, −, ·, /, ).
• The numerical methods are used when the analytical solution cannot be
obtained. Here are some examples:
– Searching for solutions of a polynomial of degree five, e.g. x5 +
3x − 1 = 0. It is proven that there do not exist an exact formulas for
the roots of the polynomials of degree ≥ 5.
– Solving the transcendental equations, e.g. x + ln x = 0.
2 2
– Computing the definite integral, e.g. 01 ex dx (for ex we cannot
R

analyticaly compute its definite integral).


• The numerical methods are used also when the analytical solutions exist,
but one can faster obtain sufficiently accurate numerical solutions, e.g.:
– computing the inverse of a big matrices,

√ of degree 3 (there exist Car-


– computing the roots of the polynomial
dano’s formulas, but one obtains 3 z, z ∈ C, therefore we prefer to
use one of the numerical algorithms).
• The main idea is to solve some easier problem (that has the same solution
or the solution does not differ much from the exact one) instead of a given
difficult problem. Here are some examples:
– The infinite processes are substituted with the finite ones, e.g. adding
in the infinite series.
8 1 Introduction

– The infinite dimensional spaces are replaced with the finite dimen-
sional spaces, e.g. instead of a general functions we compute the
solution in the space of the polynomials of degree at most 5 (inter-
polation).

– The complex functions are replaced with simpler ones, e.g. with
polynomials (integration).

– The matrices are replaced with the matrices that have a simpler
form, e.g. when solving the linear systems we transform the given
matrix in the upper triangular matrix.

What must a good numerical method satisfy?

• Reliability: On simple problems it must always work correctly.

• Accuracy: The solution is computed as accurately as it is possible ac-


cording to the accuracy of the initial data.

• Economy: According to time (number of the operations) and space (mem-


ory usage).

• Usability: Can be used on a wider range of problems.

The numerical methods are constantly being developed:

• new approaches and new algorithms,

• the computer development,

• the development of the parallel computers.

One can define the absolute and relative error:


The difference between the approximate value x̄ and the exact value x is the
approximation error:

• x̄ = x + da , where da is the absolute error,

x̄−x
• x̄ = x(1 + dr ), or dr = x
, where dr is the relative error.
1.2 Floating point 9

1.2 Floating point

The numerical errors are, among other things, the result of the fact that we
can not present all the real numbers in the computer, but only finally many.
Therefore, all numbers that are not representable are approximated by the rep-
resentable ones, and for each such approximation we obtain some error.
In the computers, the numbers are written in the floating-point arithmetics as

x = ±m · be , m = 0.c1 c2 . . . ct ,

where:
• b is the base (2, 10 (calculators), 16 (IBM)),
• m is the significand or mantissa,
• t is the length of the mantissa,
• e is the exponent: L ≤ e ≤ U,
• ci are the digits between 0 and b − 1.
For the normalized numbers we assume c1 6= 0. Such system is denoted by
P (b, t, L, U ).

Example: All representable normalized (positive) numbers from P (2, 3, −1, 1)


are:
0.1002 · 2−1 = 0.250010 , 0.1002 · 20 = 0.50010 , 0.1002 · 21 = 1.0010 ,
0.1012 · 2−1 = 0.312510 , 0.1012 · 20 = 0.62510 , 0.1012 · 21 = 1.2510 ,
0.1102 · 2−1 = 0.375010 , 0.1102 · 20 = 0.75010 , 0.1102 · 21 = 1.5010 ,
0.1112 · 2−1 = 0.437510 , 0.1112 · 20 = 0.87510 , 0.1112 · 21 = 1.7510 .

On the Figure 1.1 one can observe that these representable numbers are not
uniformly distributed on the real axis. Observe also that there is a large gap

0.25 0.75 1.25 1.75

0.0 0.5 1.0 1.5 2.0

Figure 1.1: The representable normalized (positive) numbers from the set
P (2, 3, −1, 1).
10 1 Introduction

between 0 and the first representable number. This gap can be reduced, if we
impose denormalized numbers, namely in the set P (2, 3, −1, 1) we additionally
obtain numbers

0.0112 · 2−1 = 0.187510 , 0.0102 · 2−1 = 0.125010 , 0.0012 · 2−1 = 0.062510 .

In the IEEE standard (The Institute of Electrical and Electronics Engineers),


namely the binary arithmetic standard, that is used by most of computers, we
know two types of numbers:
• single precision (single): P (2, 24, −125, 128), the numbers are saved in
32 bits,
• double precision (double): P (2, 53, −1021, 1024), the numbers are saved
in 64 bits.
The IEEE standard knows also:
∞ 0
• 0, ∞, −∞, N aN ( ∞ , 0 ),
• denormalized numbers, where c1 = 0 and the exponent is fixed - the
smallest possible (in single precision it is equal to −125).

Example: If s is the sign, 0 ≤  ≤ 255 is the exponent and 0 ≤ m < 1 the


mantisse m = 0.c2 c3 . . . c24 , then the numbers in single precision are written
as:
0 <  < 255 m x = (−1)s (1 + m) · 2−127
 = 255 m = 0 x = (−1)s · ∞ (overf low)
 = 255 m 6= 0 x = N aN
=0 m=0 x = (−1)s · 0
=0 m 6= 0 x = (−1)s · (0 + m) · 2−126
Here is an example of the positive numbers:

 m number
10000010 01100000000000000000000 x = (1 + 2−2 + 2−3 ) · 2130−127 = 11
11111111 00000000000000000000000 x=∞
11111111 01011010100000000000000 x = N aN
00000000 00000000000000000000000 x=0
00000000 00000100000000000000000 x = 2 · 2−126 = 2−132
−6

Let x be the number. With f l(x) we denote the nearest representable number
of x:
1.3 Numerical errors 11

• if x is the representable number, then f l(x) = x,


• if x is not a representable number we write x in the infinite decimal format
as
x = ±0.d1 d2 . . . dt dt+1 . . . · be

and with respect to the digit dt+1 we decide, if we round up or down:


b
– if dt+1 < 2
, then f l(x) = ±0.d1 d2 . . . dt · be , namely we round
down,
– if dt+1 ≥ 2b , then f l(x) = (±0.d1 d2 . . . dt ± b−t ) · be , namely we
round up.
We can write

1
f l(x) = x(1 + δ), |δ| ≤ u, u = b1−t ,
2

where u is the basic rounding error:


• single precision: u = 2−24 ≈ 6 · 10−8 ,
• double precision: u = 2−53 ≈ 10−16 .
The IEEE standard ensures:

f l(x ⊕ y) = (x ⊕ y)(1 + δ), |δ| ≤ u for ⊕ = +, −, /, ∗,

and
√ √
f l( x) = x(1 + δ), |δ| ≤ u.

The exception is, if we get the overflow (⇒ ±∞) or the underflow (⇒ 0).

1.3 Numerical errors


We would like to compute the value of the function f : X → Y at some
given x ∈ X. The numerical method returns the approximate value ŷ for y. The
difference d = ŷ − y is the whole error that we obtain. This error can be divided
into three parts:
12 1 Introduction

a) The irreducible error Dn :


It is obtained due to the error of the initial (approximate) data. It arises
from the measurement error or the number approximation with the repre-
sentable ones. We cannot avoid this error.
Instead of computing with x, we compute with x̄. Therefore, instead of
y = f (x) we obtain ȳ = f (x̄). The irreducible error is equal to

Dn = y − ȳ.

The perturbation theory deals with the analysis of the Dn . The question
is, how the result is changed, if the initial data are slightly changed. We
say that the problem is sensitive (poorly conditioned), if the changes are
big, and it is insensitive (good conditioned), if the obtained changes are
small.
Example: We are looking for the intersection of two given lines.
a) The intersection of
x + y = 2,
x − y = 0,
is x = y = 1. If we slightly change the right side of the equation,
namely,
x + y = 1.9999,
x − y = 0.0002,
the solution is x = 1.00005, y = 0.99985. Therefore the system is
insensitive, see Fig. 1.2 (left).
b) The intersection of
x + 0.99y = 1.99,
0.99x + 0.98y = 1.97,
is x = y = 1. If we slightly change the right side of the equation,
namely,
x + 0.99y = 1.9899,
0.99x + 0.98y = 1.9701,
the solution is x = 2.97, y = −0.99. Therefore the system is very
sensitive, see Fig. 1.2 (right).
1.3 Numerical errors 13
2.0 2.0

1.5 1.5

1.0 (1, 1) 1.0 (1, 1)

0.5 0.5

0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0

Figure 1.2: An example of insensitive (good conditioned) system (left) and sen-
sitive (poorly conditioned) system (right).

b) The method error Dm :


In numerics, the infinite process is replaced with the finite one (we sum
up only the finite elements of the infinite series, after the finite steps we
stop the iterative method,...). So, instead of computing the value of f we
compute the value of some g, and therefore instead of ȳ = f (x̄) we obtain
ỹ = g(x̄). The method error is equal to

Dm = ȳ − ỹ.

c) The rounding error Dz :


At each arithmetic operation, we obtain rounding error. Instead of ỹ, we
obtain ŷ. The rounding error is equal to

Dz = ỹ − ŷ.

The whole error is equal to: D = Dn + Dm + Dz .

Example: We would like to compute the sin(π/10).We have a calculator that


computes on 4 decimal places exact (P (10, 4, −5, 5)) and has only 4 basic op-
erations (+, −, /, ∗). Instead of f (x) = sin(x) we compute the values of the
3
approximation function g(x) = x − x6 .
π
• The irreducible error: instead with x = 10
we compute with x̄ = 0.3142 ·
100 ,
π
 
Dn = y − ȳ = sin − sin(0.3142) = −3.9 · 10−5 .
10
14 1 Introduction

x̄3
• The method error: instead of sin(x̄) we compute g(x̄) = x̄ − 6
,

Dm = ȳ − ỹ = 2.5 · 10−5 .

• The rounding error depends on the order of arithmetic operations and


3
how we compute g(x̄). Suppose that we compute x̄ − x̄6 as:

a1 = f l(x̄ ∗ x̄) = f l(0.09872164) = 0.9872 · 10−1 ,


a2 = f l(a1 ∗ x̄) = f l(0.03101154) = 0.3101 · 10−1 ,
a3 = f l(a2 /6) = f l(0.0051683) = 0.5168 · 10−2 ,
ŷ = f l(x̄ − a3 ) = f l(0.309032) = 0.3090 · 100 ,

Dz = ỹ − ŷ = 3.0 · 10−5 .

The whole error is equal to: D = Dn + Dm + Dz = 1.6 · 10−5 .

1.3.1 Numerical stability

When analyzing the rounding errors we separate forward and backward (stabil-
ity) analysis.

• Forward analysis: instead of y = f (x), we compute ŷ. If the value

|y − ŷ|
.
|y|

is small, the process is forward stable, otherwise it is unstable.

• Backward analysis: We know that ŷ = f (x̂). If the value

|x − x̂|
.
|x|

is small, the process is backward stable, otherwise it is unstable.


1.4 Some educational examples 15

1.4 Some educational examples


Let us examine some examples which show us that it is not enough to just derive
the formula and run it on the computer.

Example: Let’s make sure that the numerical computing is not equal to exact
computing.
a) Let us compute 100 · (100/3 − 33) − 100/3. Exactly we obtain the result
0, while numerically (e.g. in Octave or Matlab) we obtain 2.3448 · 10−13 .
b) What about the associativity? For

x = 0.1234567890, y = 0.0987654321, z = 0.9911991199,

we compute xyz − zyx. Exactly we obtain the result 0, while numerically


(e.g. in Octave or Matlab) we obtain 1.7347 · 10−18 .

Example: One way how to compute the π is to consider that π is the limit of
the perimeter of the equilateral (n−sided) polygon Sn that is inscribed in the
circle with the radius r = 21 . Let an be the side of Sn . One can derive the
following relation between an and a2n :
v v
 2
u  s  u q
u 2 u 1 − 1 − a2
u an 1 1 an 2 

n
+ − −
t
a2n = t = .
2 2 4 2 2

From Sn = nan it follows that


v q
u
u 1 − 1 − ( Sn )2
t n
S2n = 2na2n = 2n . (1.1)
2
We can compute S6 = 3. Using (1.1) we obtain S12 , S24 , S48 , . . . , and

π = n→∞
lim Sn .

But it turns out that the formula fails both in single and double precision, see
Table 1.1.
The question is, what is wrong with the formula (1.1)? In the limit Sn should
go to π, but when n is very big, the value of Snn is very small, and therefore
S2n → 0.
16 1 Introduction

n Sn n Sn
6 3.0000000 768 3.1417003
12 3.1058285 1536 3.1430793
24 3.1326280 3072 3.1374769
48 3.1393509 6144 3.1819811
96 3.1410384 12288 3.3541021
192 3.1414828 24576 3.0000000
384 3.1414297 49152 0.0000000

Table 1.1: The formula (1.1) for computing π fails.

We have to derive stable formula:


v q q v
u (1 − 1 − ( Sn )2 )(1 + 1 − ( Sn )2 )
u u
n n
u 2
S2n = 2nt = Sn u .
u
q t q
2(1 + 1− ( Snn )2 ) 1+ 1 − ( Snn )2
(1.2)
Sn
Observing the formula (1.2) we see that, if n → ∞ and n
→ 0, the value S2n
is equal to Sn and remains equal to π, see Table 1.2.

n Sn n Sn
6 3.0000000 768 3.1415837
12 3.1058285 1536 3.1375901
24 3.1326284 3072 3.1415918
48 3.1393499 6144 3.1415923
96 3.1410317 12288 3.1415925
192 3.1414523 24576 3.1415925
384 3.1415575 49152 3.1415925

Table 1.2: The formula (1.2) for computing π works.

So, if we have an unstable procedure, even computing with greater accuracy


will not help.

Example: Summing the Taylor series for the function e−x .


We know that ∞
xn
e−x = (−1)n
X
. (1.3)
n=0 n!
1.4 Some educational examples 17

If we sum (1.3) step by step, then for x > 0 we do not obtain good results.
Namely, in single precision for x = 10 we get the value −7.265709·10−5 instead
of 4.539993 · 10−5 . The reason is that the terms in the series are alternating and
furthermore, the absolute values are growing for some time and then they start
falling towards 0.
When summing the series of ex there are no such problems, as all terms are
positive, the final result is big and the relative error is small. Therefore the
solution for summing the series of e−x is to compute it as e−x = e1x .

Example: The accident of the rocket Ariane (European Space Agency)


In 1996, during the first flight of Arianne 5, which was made to replace Arianne
4, an accident happened. After 40 seconds the rocket exploded. It turned out
that the accident happened because the programs didn’t have the testing for
overflow (when transformed from 64 double representation to 16 integer rep-
resentation). The part of the program which caused the error came from the
program for Arianne 4, where it always worked without any problems. But Ari-
anne 5 was a more powerful rocket, the measured quantities were too large and
this implied overflow.

Example: The accident of the oil platform Sleipner A


The platform is designed in a way that the hull is located on the floating holders
that are submerged to a depth of 82 m. In 1991, during sinking the holders an
accident happened. Because of using the finite element method which was not
accurate enough, the concrete cells were not thick enough and the forces were
underestimated for 47%. This caused a crack in the holder and the sea water
began filling it so rapidly that the pumps were not able to regulate it anymore.
The entire system crashed to the bottom of the fjord at a depth of 220 m. This
also caused an earthquake of grade 3 on the Richter scale. The whole damage
was estimated at 700 million dollars.
2 Nonlinear Equations
We are looking for solutions of the equations f (x) = 0, where f : R → R is a
given real function.
Usually we require that f is continuous because we have the following theo-
rem.

Theorem 2.0.1 Let f be a real continuous function on [a, b] and let

f (a) · f (b) < 0,

then there exists ξ ∈ (a, b) such that f (ξ) = 0.

2.1 Bisection

We are searching for an interval [a, b], as small as possible, for which f has
different signs in a and b. Let us present the bisection algorithm.

Algorithm 1 Bisection
Input: Function f, interval [a, b] ⊂ R and precision  ∈ R.
Output: c ∈ R presenting the (approximate) zero of f.
1: while |b − a| >  do
2: c = a + b−a2
;
3: if sign(f (a)) == sign(f (c)) then
4: a = c;
5: else
6: b = c;
7: end if
8: end while
20 2 Nonlinear Equations

Remark 2.1.1 • Instead of c = a+b 2


we use c = a + b−a
2
because f l( a+b
2
)
may fall out of the interval [a, b].
• Why we do not check if f (c) = 0? Because this condition is almost never
satisfied.
• Why we check sign(f (a)) == sign(f (c)) instead of f (a) · f (c) > 0?
Computing f (a) · f (c) is more expensive and can come to over-flow or
under-flow.
• Method fails at even degree roots.
• If there exist more than one root, bisection returns only one of them.

Example: Let us find the zeros of the function f (x) = x − tan(x) using the
bisection. Except for x = 0 it is hard to find the good initial interval, as we
have problems with the poles. Solution to this problem is that we find the zeros
of g(x) = xcos(x) − sin(x) instead.

2.2 Simple iterative method

We rewrite f (x) = 0 into x = g(x) and then we proceed with the iteration

xr+1 = g(xr ), r = 0, 1, . . .

Function g is called iteration function.

Example:

g(x) = x − f (x),
g(x) = x − cf (x), c 6= 0,
g(x) = x − h(x)f (x), h(x) 6= 0.

Example: Let us find the roots of p(x) = x3 − 5x + 1. From the graph of the
3
function we can see that one root is close to 0. We rewrite x = 1+x
5
, namely

1 + x3
g(x) =
5
2.2 Simple iterative method 21

and we form a sequence

x0 = 0,
1 + x3r
xr+1 = , r = 0, 1, . . .
5
This sequence converges to α = 0.201639678... This iteration works only for
√ to 0. To obtain the remaining two roots we can use the iteration
a root close
xr+1 = 3 5xr − 1, e.g.

Convergence of the simple iterative method


Let α be a solution of x = g(x). Whether the iteration converges close to the
solution depends on g 0 (α). If |g 0 (α)| < 1, we have an attractive point, otherwise
we have an unatractive point.

Theorem 2.2.1 Let α be a solution of x = g(x), let g be continuously differ-


entiable and let |g 0 (α)| < 1. Then there exists such neighborhood I of α, such
that for all x0 ∈ I the sequence xr+1 = g(xr ), r = 0, 1, . . . converges to α.

Proof: Let ek := xk −α be the error of xk . We have to prove that limk→∞ ek = 0.


We derive

ek+1 = xk+1 − α = g(xk ) − g(α) = g 0 (ξk )(xk − α), ξk ∈ (xk , α).

If we start enough close to α, then, because of |g 0 (α)| < 1, there exists C < 1
such that |g 0 (α)| ≤ C < 1. We obtain

|ek+1 | ≤ C|ek | ≤ C 2 |ek−1 | ≤ · · · ≤ C k |e0 |.

When k → ∞, C k → 0 and therefore |ek | → 0, or equivalently xk → α.

Corollary 2.2.1 Let α be a solution of the equation x = g(x) and g continu-


ously differentiable on I = [α − d, α + d]. If |g 0 (x)| ≤ m < 1, then ∀x0 ∈ I the
sequence xr+1 = g(xr ) converges to α.

The convergence order

Definition 2.2.1 The convergence order is equal to p, if there exist constants


C1 , C2 > 0 in the neighborhood of α, such that

C1 |xr − α|P ≤ |xr+1 − α| ≤ C2 |xr − α|P .


22 2 Nonlinear Equations

Theorem 2.2.2 Let the iteration function g be p-times continuously differen-


tiable in the neighborhood I of a point α and let
g (k) (α) = 0
for k = 1, 2, . . . , p − 1, g (p) (α) 6= 0. Then the order of convergence of the
iteration xr+1 = g(xr ), r = 0, 1, . . . , in the neighborhood of the solution is
equal to p.

Proof: From the Taylor series of function g around α,


1
g(x) = g(α) + g 0 (α)(x − α) + · · · + g (p−1) (α)(x − α)p−1 .
(p − 1)!
one can derive
1 (p)
g(x) = α + g (ξx )(x − α)p .
p!
1
Let I denote the neighborhood of α, then C1 = p!
minx∈I |g (p) (ξx )| and C2 =
1
p!
maxx∈I |g (p) (ξx )|.

Here are some special cases:


• p = 1 : linear convergence (constant number of steps for exact deci-
mals),
• p = 2 : quadratic convergence (no. of exact decimals at each step is
doubled),
• p = 3 : cubic convergence (no. of exact decimals triples at each step),
• 1 < p < 2 : superlinear convergence.

Example: Comparison of different iteration functions for the equation


x2 + ln x = 0,
that has the solution α = 0.6529186 :
2
g1 (x) = e−x , g10 (α) = −0.853 ⇒ p = 1,
1 2
g1 (x) = (x + e−x ), g20 (α) = 0.074 ⇒ p = 1,
2
2
2x + e−x
3
g3 = 2
, g30 (α) = 0 ⇒ p = 2.
1 + 2x
For different initial approximations we require different number of iteration
steps until two obtained consecutive approximations differ for < 10−12 :
2.3 Newton or tangent method 23

x0 0.6 0.7 1 10 100


g1 163 162 173 175 175
g2 11 11 12 15 19
g3 5 5 7 109 10012

Table 2.1: The number of steps we need to get two consecutive approximations
that differ for < 10−12 .

Remark 2.2.1 • The method with quadratic convergence order in the neigh-
borhood of the solution converges very slowly far away from the solution,
slower than the methods that have linear convergence order in the neigh-
borhood of the solution.

• Usually the methods with high convergence order converge very fast in
the neighborhood of the solution, but if we do not have enough good ini-
tial approximation, it is better to use the method with linear convergence
order and only in the vicinity of the solution we shall use faster methods.

2.3 Newton or tangent method

Let us approximate the given function f with the tangent at the point (xr , f (xr ))
and let us take the intersection of the tangent with the x−axis for the next ap-
proximate value, namely

f (xr )
xr+1 = xr − , r = 0, 1, . . . .
f 0 (xr )

We derive this formula such that we compute the first derivative

∆y f (xr ) − 0
f 0 (xr ) = k = = ,
∆x xr − xr+1

and we obtain
f (xr )
xr − xr+1 =
f 0 (xr ).
24 2 Nonlinear Equations

15

10

1.0 1.5 2.0 2.5 3.0

xr+2 xr+1 xr
-5

-10

-15

Figure 2.1: The illustration of the tangent method.

The tangent method is actually just a special form of a simple iterative method
where g(x) = x − ff0(x)
(x)
. The derivative of g is equal to

f (x)f 00 (x)
g 0 (x) = .
f 02 (x)

If α is a simple zero of function f, then g 0 (α) = 0 and we obtain (at least)


quadratic convergence order. Further we compute

f 00 (α)
g 00 (α) =
f 0 (α)

and observe that if f 00 (α) 6= 0 then g 00 (α) 6= 0 and we obtain a quadratic con-
vergence order, otherwise at least cubic convergence order.
If α is a zero of multiplicity m, we obtain limx→α g 0 (x) = 1 − 1
m
, so we have
linear convergence.

Example: Let us consider a polynomial p(x) = (x − 12 )(x − 1)(x − 2)2 . If


the initial guesses are 0.6, 1.1 and 2.1 we can observe that we have quadratic
convergence at 0.6, cubic convergence at 1.1 (inflection point) and linear con-
vergence at 2.1 (double zero).

The next theorem says that all simple roots are attractive points for the tan-
gent method. If we take enough good initial guesses, the tangent method will
converge.
2.3 Newton or tangent method 25
0.5

0.4

0.3

0.2

0.1

0.5 1.0 1.5 2.0

-0.1

Figure 2.2: The graph of a polynomial p(x) = (x − 12 )(x − 1)(x − 2)2 .

Theorem 2.3.1 Let α be a simple zero of twicely continuously differentiable


function f . Then there exists the neighborhood I of point α and the constant
C > 0 such that the tangent method converges for all x0 ∈ I and that for xr the
following holds,
|xr+1 − α| ≤ C(xr − α)2 .

Under some assumptions one can prove also the global convergence for an ar-
bitrary initial approximations.

Theorem 2.3.2 Let the function f be twicely continuously differentiable, in-


creasing and convex on the interval I = [a, ∞) and let α ∈ I denote its zero.
Then α is simple and the only root of f on I and for every x0 ∈ I tangent
method converges to α.

The above theorems assure the convergence of the tangent method close to the
zero or for the function of nice shape, otherwise we have very different be-
haviour and blind use of the tangent method is not recommended.

Example: The Figure 2.3 shows two examples of divergence of the tangent
method.
26 2 Nonlinear Equations

x0 1
x1
0.10
x2 x3
-3 -2 -1 1 2 3

0.05
-1

0.8 1.0 1.2 1.4 -2

x3 x1 x0 x2
-0.05
-3

-0.10 -4

Figure 2.3: The divergence because the initial approximation was not enough
close to the zero (left) and the repetition of the approximations
(right).

2.4 Secant method


The idea of the secant method is to use the secant line to obtain better approxi-
mation for the zero of a given function f . The method requires two initial values
x0 and x1 which should ideally be chosen such that they are close to the zero.
The secant method is defined by the recurrence relation

xr − xr−1
xr+1 = xr − f (xr ), r = 1, 2, . . .
f (xr ) − f (xr−1 )

2.0

1.5

1.0

0.5
xr-1 xr xr+1

1.5 2.0 2.5 3.0

-0.5

-1.0

Figure 2.4: The illustration of the secant method.

Remark 2.4.1 • This method is not a special case of a simple iterative


2.5 Other methods 27

method, as the next approximate value depends on two previous approxi-


mate values.

• The method does not require derivatives.



1+ 5
• The order of convergence is α = 2
≈ 1.618, namely we have super-
linear convergence.

• The behaviour of the secant method is similar to the tangent method:

a) the method will converge, if two initial approximate values will be


good enough,

b) tangent method is faster, but requires also the derivative of the func-
tion, therefore sometimes the secant method is a better choice that
the tangent one as it requires less work.

2.5 Other methods

Let us list some other methods one can use to solve the nonlinear equation:

• Combined method: We combine bisection with the tangent or secant


method (bisection is used when we fall out of a reasonable interval).

• Muller’s method:

– It usees three initial points, constructs the parabola through these


three points and takes the intersection of the x−axis with the parabola
for the next approximation. If the parabola has real roots, we take
the one that is closer to the last approximate value, otherwise we
take only the real part of the solution.

– The order of convergence is 1.84.

– In practice it is used for finding complex roots.

• Method (f , f 0 , f 00 ) : It is a generalization of the tangent method, where we


require also the second derivative and obtain cubic convergence,

f (xr ) f 00 (xr )f 2 (xr )


xr+1 = xr − − , r = 0, 1, . . .
f 0 (xr ) 2(f 0 (xr ))3
28 2 Nonlinear Equations

2.6 System of nonlinear equations


We would like to solve the system of nonlinear equations,
f1 (x1 , x2 , . . . , xn ) = 0,
f2 (x1 , x2 , . . . , xn ) = 0,
..
.
fn (x1 , x2 , . . . , xn ) = 0,
or equivalently

F (x) = 0, x ∈ Rn , F : Rn → Rn .

Our task is to find α ∈ Rn such that F (α) = 0.

2.6.1 Generalization of a simple iterative method

We are looking for the solution of x = G(x), where G : Rn → Rn , such that


from α = G(α) follows F (x) = 0.

Algorithm 2 Generalization of a simple iterative method


Input: Mapping G : Rn → Rn and the initial approximation x(0) .
Output: The (approximate) zero of f.
1: for k = 0, 1, . . . do
2: x(k+1) = G(x(k) );
3: end for

2.6.2 Newton’s method

This is a generalization of the tangent method. It is derived by using the Taylor


series.
The nonlinear system is solved with the help of a simpler system,
F (α) = 0,
n
X ∂fi (x)
0 = fi (α) = fi (x) + (αk − xk ) + . . . , i = 1, 2, . . . , n.
k=1 ∂xk
2.7 Polynomial equations 29

We obtain
∂f1 ∂f1
(x)(α1 − x1 ) + · · · + (x)(αn − xn ) = −f1 (x)(+ · · · ),
∂x1 ∂xn
..
.
∂fn ∂fn
(x)(α1 − x1 ) + · · · + (x)(αn − xn ) = −fn (x)(+ · · · ).
∂x1 ∂xn
In one step of a Newton’s method we have to solve the following system,
 ∂f1 (r) ∂f1
··· (x(r) ) (r)
  
f1 (x(r) )

(x ) ∆x1
 ∂x1 . ∂xn
.. ..  = −  ..
..
   
,

 . 
 .   . 
∂fn ∂fn (r)
∂x1
(x(r) ) ··· ∂xn
(x (r)
) ∆x (r)
n f n (x )

or equivalently
   
JF x(r) ∆x(r) = −F x(r) .

Then we correct the approximation:

x(r+1) = x(r) + ∆x(r) .

As with the tangent method we obtain quadratic convergence in the neighbor-


hood of a simple zero. The problem: for the convergence we have to know very
good initial approximation. In case of the multiple zeros, we have only linear
convergence.

2.7 Polynomial equations


In this section, let us see how we search for the roots of the polynomial

p(x) = an xn + an−1 xn−1 + · · · + a1 x + a0 .

We can use any of the methods previously written (bisection, tangent method,...),
but in such a way, we do not use the advantages that p is a polynomial.
In practice we usually require all roots of p. When the first one (let us denote it
p(x)
by xk ) is computed, we obtain (x−x k)
= q(x) which is a polynomial of a lower
degree and we further search for the roots of q.
30 2 Nonlinear Equations

Another way to compute the roots is to form a matrix (associated matrix)

0 1
 
 .. .. 
 . . 
An =  
0 1
 
 
− aan0 − an−1
a0
··· − aa01

and compute its eigenvalues.


We can also use some special methods for solving polynomial equations, such
as Laguerre method, Bairstrow-Hitchcock’s method, Durand-Kerney’s method,
Ehrlich-Abert’s method ...

2.7.1 Laguerre’s method

The idea is to repeat the iteration:

p0 (zr )
S1 = ,
p(zr )
p0 (zr )2 − p(zr )p00 (zr )
S2 = ,
p(zr )2
n
zr+1 = zr − q ,
S1 ± (n − 1)(nS2 − S12 )

where p is the given polynomial. The sign in the last equation is selected such
that the denominator has bigger absolute value.

Theorem 2.7.1 If the polynomial p has only real roots, than for an arbitrary
initial approximation Laguerre’s method converges to left or right nearby root.
In case of a simple root, the order of the convergence in the vicinity of the root
is cubic, otherwise it is linear.

Remark 2.7.1 The method works also for the complex roots, but in this case it
is not necessarily convergent for an arbitrary initial guess.
3 Linear Systems

3.1 Introduction

The system of n linear equations with n unknowns

a1,1 x1 + a1,2 x2 + . . . + a1,n xn = b1 ,


a2,1 x1 + a2,2 x2 + . . . + a2,n xn = b2 ,
..
.
an,1 x1 + an,2 x2 + . . . + an,n xn = bn ,

can be written as

Ax = b, A ∈ Rn×n (Cn×n ), b, x ∈ Rn (Cn ),

where A is a real (or a complex) matrix and x, b are real (or complex) vectors.
Here ai,j denotes the (i, j)−th element of a matrix A and xi denotes the i−th
element of the vector x.
The vector x can be written as
 
x1

x2 
= [x1 , x2 , . . . , xn ]> ,
 
x=  .. 
.
 
 
xn

while matrix A is written as


···
   
a1,1 a1,2 a1,n α1

a2,1 a2,2 ··· a2,n  
α2 
ai , αi ∈ Rn (Cn ).
   
A=  .. .. ..  = [a1 , a2 , . . . , an ] =  .. ,
. . ··· . .
   
   
an,1 an,2 ··· an,n αn
32 3 Linear Systems

Let us denote
0
 
 . 
 .. 
 
0
 
 
 
ei = 
 1 
 ← i,

 0 

 .. 
.
 
 
0
then Aek is the k−th column of the matrix A, e>
i A is the i−th row of A and
>
ei Aek is the (i, k)−th element of A.

By A> we denote the transposed matrix of the matrix A and AH = Ā> is the
hermitian transpose of A. Note that AH is in some literature denoted as A∗ .

The matrix A ∈ Rn×n (Cn×n ) is nonsingular, if one of the following conditions


is satisfied:

• there exists an inverse matrix A−1 , such that AA−1 = A−1 A = I,

• det A 6= 0,

• rankA = n,

• there does not exist x 6= 0 such that Ax = 0.

We say that A is symmetric, if A = A> . If A = AH then A is hermitian. The


symmetric matrix is positive definite, if ∀x 6= 0, x> Ax > 0 or equivalently
xH Ax > 0 for the hermitean matrix.

If there exists scalar γ and nonzero vector x such that Ax = λx, then λ is an
eigenvalue and x is an eigenvector of A.

Every n × n matrix A has n eigenvalues that correspond to the characteristic


polynomial
p(λ) = det(A − λI).

For a symmetric matrix A the following hold,

• all eigenvalues are real (and positive if A is positive definite),

• we obtain n orthonormal eigenvectors x1 , x2 , . . . , xn : x>


i xj = δij .
3.2 Vector and Matrix norms 33

The dot or a scalar product of the vectors x and y can be written as


n n
!
> H
X X
y x= xi y i , y x= xi ȳi .
i=1 i=1

The multiplication of the matrix A by the vector x, y = Ax can be written as


• yi = α >
i x, namely, yi is a product of i−th row of A and the vector x,
Pn
• y= i=1 xi ai , namely, y is a linear combination of columns of A.
The multiplication C = A · B can be presented as

ci,j = α>
i bj ,
ci,j = Abi ,
n
ai β >
X
C= i .
i=1

The matrix xy > is called the dyadic matrix and has a rank equal to 1.
Real matrix Q is called orthogonal, if Q−1 = Q> (or QQ> = Q> Q = I).
Complex matrix U is called unitary, if U −1 = U H (or U U H = U H U = I).

3.2 Vector and Matrix norms


Definition 3.2.1 The vector norm is a mapping k · k : Cn → R for which the
following hols:
a) kxk ≥ 0 ⇔ x = 0,
b) kαxk = |α|kxk, ∀x, y ∈ Cn , α ∈ C,
c) kx + yk ≤ kxk + kyk (triangle inequality).

Example: Here are some examples of a vector norms:


Pn
• kxk1 = i=1 |xi |, 1-norm,
qP
n
• kxk2 = i=1 |xi |2 , 2-norm,
• kxk∞ = maxi=1,2,...,n |xi |, ∞−norm or max-norm.
34 3 Linear Systems

We say that two vector norms are equivalent, if there exist C1 , C2 > 0 such that
∀x ∈ Cn ,
C1 kxka ≤ kxkb ≤ C2 kxka .

Example: Here are some examples:



• kxk2 ≤ kxk1 ≤ nkxk2 ,

• kxk∞ ≤ kxk2 ≤ nkxk∞ ,

• kxk∞ ≤ kxk1 ≤ nkxk∞ .

Here are two important inequalities:

• Cauchy-Schwartz inequality: |y H x| ≤ kxk2 kyk2 ,


1 1
• Hölder inequality: |y H x| ≤ kxkp kykq , p
+ q
= 1.

Definition 3.2.2 Matrix norm is a mapping k · k : Cm×n → R for which the


following hold:

1) kAk ≥ 0, kAk = 0 ⇔ A = 0,

2) kαAk = |α|kAk,

3) kA + Bk ≤ kAk + kBk, ∀A, B ∈ Cm×n , α ∈ C, C ∈ Cn×p ,

4) kACk ≤ kAk · kCk, (submultiplicative norm).

Example: The Frobenious norm


 1
m X
n 2

|ai,j |2 
X
kAkF :=  (extended vector 2-norm).
i=1 j=1

Definition 3.2.3 The operator matrix norm based on a vector norm k·k is given
by
( )
n kAxk
kAk = max{kAxk : x ∈ C , kxk = 1} = max : x ∈ Cn , x 6= 0 .
kxk
3.2 Vector and Matrix norms 35

We obtain:
m
!
X
kAk1 = max kAxk1 = max |ai,j | , 1-norm,
i=1
q
kAk2 = max kAxk2 = max λi (AH A), spectral or 2-norm,
kxk2 =1
q
if A = AH , then kAk2 = max λi (A2 ) = max |λi (A)|
i=1,...,n i=1,...,n
Xn
kAk∞ = max kAxk∞ = max |ai,j |, ∞−norm.
kxk∞ =1 i=1,...,m
j=1

For these norms the following hold

kAxk ≤ kAkkxk.


Note that Frobenius norm is not subordinate matrix norm, as kIkF = n.

Example: For the given matrix


 
3 −1 2
A=
 4 1 −8 

1 −5 0

we compute the norms

kAk1 = 10, kAk∞ = 13, kAkF = 11, kAk2 ≈ 9.02316.

Theorem 3.2.1 For every matrix norm and an arbitrary eigenvalue λ of A the
inequality
|λ| ≤ kAk
holds.

Proof: Let Ax = λx, x 6= 0. Then

|λ|kxk = kλxk = kAxk ≤ kAkkxk

and therefore
|λ| ≤ kAk.
36 3 Linear Systems

3.3 Sensitivity of the linear systems


We are interested in how the solution of linear system Ax = b is sensitive
according to the change of A and b. Let Ax = b and (A+δA)(x+δx) = (b+δb).
With some basic computation the following estimation can be derived,
!
kδxk K(A) kδAk kδxk
≤ + ,
kxk 1 − K(A) kδAk
kAk
kAk kxk

where
K(A) = kAkkA−1 k.

The K(A) is called the sensitivity of the matrix A or also the condition number.
So the accuracy of the obtained solution depends on the sensitivity of the given
matrix A. For the K(A) we can derive the following estimation
1 ≤ kIk = kAA−1 k ≤ kAkkA−1 k = K(A),
and therefore
K(A) ≥ 1.

Example:
a) The least sensitive are the unitary (orthogonal) matrices multiplied by the
scalar, since for the unitary matrix U we derive
K(U ) = kU kkU −1 k = kU kkU −H k = 1 · 1 = 1.

b) The canonical examples of ill-conditioned matrices are the Hilbert matri-


ces, i.e. the square matrices with entries being the unit fractions
1
hi,j = .
i+j−1
For example,
 1 1 1 
1 2 3 4

 1 1 1 1 

2 3 4 5
H4 =

 1 1 1 1 ,

K(H4 ) = 1.55 · 104 .
 
 3 4 5 6 
1 1 1 1
4 5 6 7

For the matrices H7 and H10 we obtain K(H7 ) = 1.8 · 108 and K(H10 ) =
1.6 · 1013 .
3.4 LU decomposition 37

3.4 LU decomposition

3.4.1 Permutation matrices


 
1 2 3 ··· n
The permutation σ = σ1 σ2 σ3 ··· σn corresponds to the permutation matrix

e>
 
σ
 .1 
 ..  .
Pσ =  
e>
σn

Properties:

• Pσ−1 = Pσ> ,

• Pσ A represents the matrix with rows of the matrix A that are permuted
with σ,

• APσ> represents the matrix with columns of the matrix A that are per-
muted with σ.

Example: For the given


 
1 0 3 !
1 2 3
 2 2 1 
A= and σ=

2 1 3
3 2 1

we obtain
     
0 1 0 2 2 1 0 1 3
 1 0 0 ,
Pσ =  
 1 0 3 ,
Pσ A =  
APσ>  2 2 1 .
= 

0 0 1 3 2 1 2 3 1

3.4.2 LU decomposition

The LU decomposition (where 0 LU 0 stands for ’lower upper’), sometimes also


called LU factorization, factors a matrix A ∈ Rn×n as the product of a lower
38 3 Linear Systems

triangular matrix L ∈ Rn×n and an upper triangular matrix U ∈ Rn×n . The


decomposition looks like

a11 a12 · · · u11 u12 · · ·


    
a1n 1 u1n

 a21 a22 · · · a2n 


 `11 1 
 u22 · · · u2n 

 .. .. ..  =  .. .. ... 
... .. .
. . . . . .
    
    
an1 an2 · · · ann `n1 `n2 · · · 1 unn

Example: For the given matrix


 
2 2 3
A= 4 5 6  (3.1)
 

1 2 4
we obtain the following LU decomposition
  
1 0 0 2 2 3
2 1 0  0 1 0 .
  
LU = 
  
1 5
2
1 1 0 0 2

The procedure is called the LU decomposition without pivoting and can be


viewed as the matrix form of Gaussian elimination without pivoting. Let us
present an algorithm for its computation.

Algorithm 3 Computation of the LU decomposition without pivoting.


Input: Matrix A ∈ Rn×n
Output: The lower triangular matrix L ∈ Rn×n and the upper triangular matrix
U ∈ Rn×n .
1: for j = 1, 2, . . . , n do
2: for i = 1, 2, . . . , n do
3: Compute `ij = aajj ij
;
4: for k = j + 1, j + 2, . . . , n do
5: Compute aik = aik − `ij ajk ;
6: end for
7: end for
8: end for

Remark 3.4.1 We can observe that:


3.4 LU decomposition 39

• The matrix U is the upper triangular of the matrix A obtained at the end
of the Algorithm 3.
(1) (n−2)
• The elements a11 , a22 , . . . , an−1,n−12 with which we divide are called piv-
ots. They should not be equal to zero, otherwise the factorization fails to
materialize. This is a procedural problem. It can be removed by simply
reordering the rows of A so that the first element of the permuted matrix
is nonzero.
• Computing the LU decomposition using the Algorithm 3 requires
 
n−1 n n
2 1 1 2 3
2 = n3 − n2 − n = n + O(n2 )
X X X
1 +
j=1 i=j+1 k=j+1 3 2 6 3

floating point operations.

Let us consider the following example.

Example: For the matrix (3.1) the LU factorization is computed as


 
1 0 0
j = 1 : L(1) = 2 1 0 
 
,

1
2
0 1
    
 −1
1 0 0 2 2 3 2 2 3
A(1) = L(1) −2 1 0 4 5 6 0 1 0 ,
    
A=
 


=
 

− 12 0 1 1 2 4 0 1 45
 
1 0 0
j = 2 : L(2) =  0 1 0 ,
 

0 1 1
    
 −1
1 0 0 2 2 3 2 2 3
A(2) = L(2) A(1) 0 1 0  0 1 0  =  0 1 0 .
    
=
    
0 −1 1 0 1 54 0 0 45

We obtain    
1 0 0 2 2 3
2 1 0 , 0 1 0 .
   
L=
  U =
 
1
2
1 1 0 0 54
40 3 Linear Systems

Solving the linear equations


Given a system of linear equations in matrix form Ax = b, we want to solve
the equation for x given A and b. Suppose we have already obtained the LU
decomposition of A such that A = LU , so LU x = b. In this case the solution is
done in two logical steps:
1. we solve the equation Ly = b for y,
2. we solve the equation U x = y for x.
Note that in both cases we are dealing with triangular matrices (L and U ) which
can be solved directly by forward and backward substitution.
1. The system Ly = b can be solved by the forward substitution:

    
1 y1 b1

 `11 1 
 y2 


 b2 

 .. .. ..  ..  =  ..  (3.2)

 . . . 
 .



 .


`n1 `n2 ··· 1 yn bn

Let us present the Algorithm 4 for its computation.

Algorithm 4 Computation of Ly = b by forward substitution.


Input: The lower triangular matrix L ∈ Rn×n and the vector b ∈ Rn .
Output: The solution y ∈ Rn .
1: for i = 1, 2, . . . , n do
Compute yi = bi − i−1
P
2: j=1 `ij yi ;
3: end for

The cost of solving a system 3.2 is n2 floating point operations:


n n
i − n = n2 .
X X
(1 + 2(i − 1)) = 2
i=1 i=1

2. The system U x = y can be solved by the backward substitution:

u11 u12 · · ·
    
u1n x1 y1

 u22 · · · u2n 
 x2 


 y2 


... ..  ..  =  ..  (3.3)
. . .
    
    
unn xn yn
3.4 LU decomposition 41

Algorithm 5 Computation of U x = y by backward substitution.


Input: The upper triangular matrix U ∈ Rn×n and the vector y ∈ Rn .
Output: The solution x ∈ Rn .
1: for i = n, n − 1, . . . , 1 do 
Compute xi = u1ii yi − nj=i+1 uij xj ;
P
2:
3: end for

Let us present the Algorithm 5 for its computation. The cost of solving
the system 5 is
n
(2 + 2(j − 1)) = n2 + n
X

j=1

floating point operations.


The cost of solving a system of linear equations using the LU decomposition is
2 3
3
n + O(n2 ) floating point operations if the matrix A has size n × n.

Remark 3.4.2 In practice we do not solve the systems Ax = b as x = A−1 b


since:
• the cost of computing A−1 is 2n3 operations, that is three times as much
as computing LU decomposition,
• the numerical error may become bigger.
Similarly, instead of computing A−1 B we solve the system AX = B (separately
for each column).

3.4.3 LU decomposition with pivoting

The method for the LU decomposition without pivoting fails, if any of the pivots
is equal to 0, while numerically it fails even if it is close to 0.

Example: Let us compute the following LU decomposition,


" # " # " #
0.0001 1 1 0 0.0001 1
A= →L= ,U = .
1 1 10000 1 0 f l(1 − 10000)
Suppose that we compute on three decimal places exact. In this case we obtain
" #
0.0001 1
U=
0 −10000
42 3 Linear Systems

and therefore " #


0.0001 1
LU = 6= A.
1 0

This problem can be removed by simply reordering the rows of A (partial piv-
oting) or also columns of A (complete pivoting). The latter is not very useful in
practice as it requires too many operations.
Using partial pivoting, before we start eliminating the entries below the main
diagonal in j−th column of A, we compare the values |aj,j |, |aj+1,j |, |aj+2,j |,
. . . |an,j | and we change the j−th row with the one that contains the maximal
element. As a result one obtains decomposition P A = LU, where P denotes
permutation matrix.
Because of the pivoting, the absolute value of all entries in the matrix L is
≤ 1.

Theorem 3.4.1 If the matrix A is nonsingular, then there exists such permuta-
tion matrix P, such that there exists LU decomposition with partial pivoting,

P A = LU,

where L is a lower triangular matrix with ones on the main diagonal and U is
an upper triangular matrix.

Let us present the Algorithm 6 for this decomposition.


3.4 LU decomposition 43

Algorithm 6 Computation of the LU decomposition with partial pivoting.


Input: Matrix A ∈ Rn×n
Output: The permutation matrix P ∈ Rn×n , the lower triangular matrix L ∈
Rn×n and the upper triangular matrix U ∈ Rn×n .
1: for j = 1, 2, . . . , n − 1 do
2: Find |aqj | = max |apj | and change the rows q and j;
j≤p≤n
3: for i = j + 1, j + 2, . . . , n do
4: Compute `ij = aajj
ij
;
5: for k = j + 1, j + 2, . . . , n do
6: Compute aik = aik − `ij ajk ;
7: end for
8: end for
9: end for

In comparison with the LU decomposition without pivoting, we additionally


require O(n2 ) comparisons.

Solving the linear equations


We are given a system of linear equations in a matrix form Ax = b and we want
to solve the equation using the LU decomposition with partial pivoting. The
solution is done in three logical steps:

1. we compute the decomposition P A = LU ,

2. we solve the equation Ly = P b for y,

3. we solve the equation U x = y for x.

Finally, let us consider an example of the LU decomposition with partial pivot-


ing.

Example: For the matrix


 
0 1 2
A= 1 2 3 
 

1 0 1
44 3 Linear Systems

the LU factorization with partial pivoting is computed as


   
1 2 3 0 1 0
j=1: Ã =  0 1 2  , P (1) =  1 0 0 ,
   

1 0 1 0 0 1
 
1 0 0
L(1)  0 1 0 ,
= 

1 0 1
 
 −1 1 2 3
A(1) = L(1) Ã =  0 1 2 ,

0 −2 −2
   
1 2 3 0 1 0
j=2: Ã(1) =  0 −2 −2 

, P (1) = 0 0 1 

,
0 1 2 1 0 0
 
1 0 0
L(2) 0 1 0 ,
 
=
 
0 12 1
 
 −1
1 2 3
A(2) = L(2) Ã(1) 0 −2 −2
 
=
 .

0 0 1

We obtain
   
1 0 0 1 2 3
 
0 1 0
L= 1 1 0  0 −2 −2
   
P = 0 0 1 

, 
, U =
 .

1
1 0 0 0 −2 1 0 0 1

3.5 Symmetric positive definite matrices


In linear algebra, an n × n real matrix A is said to be a symmetric positive
definite (s.p.d.) if A = A> and the scalar x> Ax is positive for every non-zero
column vector x of n real numbers.

Theorem 3.5.1 The following statements are equivalent:


• A is s.p.d., iff A = A> and all eigenvalues of A are positive,
3.5 Symmetric positive definite matrices 45

• A is s.p.d., iff aii > 0 for ∀i and maxi,j |aij | = maxi |aii |,
• A is s.p.d., iff LU decomposition without pivoting does not fail and aii > 0
∀i,
• A is s.p.d., iff there exists a nonsingular lower triangular matrix V with
real and positive diagonal entries, such that A = V V > .
The decomposition A = V V > is called Cholesky decomposition and V is called
the Cholesky factor.

Now, let us present the Algorithm 7 for Cholesky decomposition. The number

Algorithm 7 Computation of the Cholesky decomposition.


Input: Matrix A ∈ Rn×n
Output: Nonsingular lower triangular matrix V ∈ Rn×n with real and positive
diagonal elements.
1: for k = 1, 2, . . . , n do
 1
Compute vkk = akk − k−1 2 2
P
2: i=1 vki ;
3: for j = k + 1, k + 2, . . . , n do 
1
ajk − k−1
P
4: Compute vjk = vkk v v
i=1 ji ki ;
5: end for
6: end for

of the required operations is equal to

n
1 3
n + O(n2 ),
X
(2(k − 1) + 2 + 2(n − k)k) =
k=1 3

that is half less than the number of the required operations for the LU decom-
position.

Example: Let us compute the Cholesky decomposition for the given matrix
 
4 −2 4 −2 4
−2 10 1 −5 −5 
 

 
A= 
 4 1 9 −2 1 .

 −2 −5 −2 22 7 
4 −5 1 7 14
46 3 Linear Systems

The Cholesky factor is equal to


 
2
−1 3
 
 
 
V = 
 2 1 2 .


 −1 −2 1 4 

2 −1 −1 2 2

Note that the Cholesky decomposition is the cheapest way to determine if the
given matrix is s.p.d., i.e., if the algorithm for its computation does not fail, then
the matrix is s.p.d..

3.6 Iterative methods for solving a linear


systems
The linear system of equations Ax = b can be written as x = Rx + c that can
be solved iteratively:

x(r+1) = Rx(r) + c.

The matrix R is called iteration matrix.

Theorem 3.6.1 The sequence x(r+1) = Rx(r) + c, r = 0, 1, . . . , converges for


an arbitrary x(0) , iff for the spectral radius ρ of a square matrix A the following
holds
ρ(R) := max |λ(R)| < 1

Proof: Let us denote the exact solution with x̂. Then from x̂ = Rx̂ + c and
x(r+1) = Rx(r) + c follows

x̂ − x(r+1) = R(x̂ − x(r) )

and therefore

x̂ − x(r+1) = R2 (x̂ − x(r−1) ) = · · · = Rr+1 (x̂ − x(0) ).

Obviously, the necessary and sufficient condition for the convergence is limk→∞ Rk =
0, what is equivalent to the condition ρ(R) < 1.
3.6 Iterative methods for solving a linear systems 47

Corollary 3.6.1 The sequence x(r+1) = Rx(r) + c, r = 0, 1, . . . , converges for


an arbitrary x(0) , if kRk < 1.

How one obtains the matrix R?


We rewrite the system Ax = b as M x = −N x + b, where A = M + N, and we
obtain R = −M N −1 . The system is, of course, solved as

M x(r+1) = −N x(r) + b, r = 0, 1, . . . ,

where the matrix M is choosen such that the system with the matrix M can be
solved faster as the initial system Ax = b.
In which cases the iterative methods are used?

• When we are dealing with large systems, i.e. when A is a large matrix.
• When the matrix A is a sparse matrix, i.e. a matrix in which most of the
elements are zero.

Jacobi method

If the diagonal entries of the matrix A are nonzero, i.e. aii 6= 0 ∀i, the system
Ax = b can be written as
1
x1 = (b1 − a12 x2 − a13 x3 − · · · − a1n xn ),
a11
..
.
1
xn = (bn − an1 x1 − an2 x2 − · · · − an,n−1 xn−1 .)
ann
This yields to the Jacobi method

 
n
(r+1) 1  bk −
X (r) 
xk = aki xi  , k = 1, 2, . . . , n.
akk 
i=1
i6=k

How is the matrix R for the Jacobi method defined?


Let us write A = L + D + U, where L is a lower triangular matrix of the matrix
48 3 Linear Systems

A without diagonal, D is a diagonal matrix that contains diagonal entries of A


and U is an upper triangular matrix of the matrix A without diagonal. Then
M = D and N = L + U. Therefore

RJ = −D−1 (L + U ).

Gauss-Seidel method

(r+1)
We are going to compute the values of x1 , . . . , xn(r+1) and when we compute
(r+1) (r+1) (r+1)
an element xk we can take into account the values x1 , . . . , xk−1 that are
already given. Considering this we obtain Gauss-Seidel method:

 
k−1 n
(r+1) 1  X (r+1) X (r)
xk = bk − aki xi − aki xi  , k = 1, 2, . . . , n.
akk i=1 i=k+1

In this case M = L + D, N = U, and therefore

RGS = −(L + D)−1 U.

Theorem 3.6.2 If the matrix A is a strictly diagonally dominant matrix by rows


(or columns), i.e. |aii | > nj=1 |aij |, i = 1, 2, . . . , n, then the Jacobi method and
P
j6=i
the Gauss-Seidel method converge.

Example: For the system of equations,

12x1 − 3x2 + x3 = 10,


−x1 + 9x2 + 2x3 = 10,
x1 − x2 + 10x3 = 10,

and the initial approximation x(0) = [101]> let us determine the first two steps
of the Jacobi and Gauss-Seidel method.
The Jacobi method:

x(1) = [0.75, 1, 0.9]> , x(2) = [1.00833, 0.99444, 1.025]> .


3.6 Iterative methods for solving a linear systems 49

The Gauss-Seidel method:

x(1) = [0.75, 0.9722, 1.022]> , x(2) = [0.991203, 0.994084, 1.0002881]> .

The exact solution is equal to X = [1, 1, 1]> . Observe that A is strictly diago-
nally dominant matrix, therefore both methods converge.

SOR (successive over-relaxation) method

The method of successive over-relaxation (SOR) is a variant of the Gauss-Seidel


method for solving a linear system of equations, resulting in faster convergence
and is given by

 
(r+1) (r) (r+1)GS (r)
xk = xk + ω x k − xk , k = 1, 2, . . . , n,

(r+1)GS
where ω is called relaxation factor and xk is the approximation obtained
by the Gauss-Seidel method. Note that for ω = 1, SOR is equal to the Gauss-
Seidel method.

Theorem 3.6.3 • If A is a symmetric positive definite matrix and ω ∈


(0, 2), than SOR converges for an arbitrary initial approximation.

• The SOR does not converge for an arbitrary initial approximation, if ω ∈


/
(0, 2).

It turns out that:

• if the sequence of the approximations is monotone, then for ω > 1 the


(r+1)SOR (r+1)GS
approximation xk is closer to a solution than xk ,

• if the sequence of the approximations is alternating, we obtain better ap-


proximation if we choose 0 < ω < 1,

• it is hard to determine the optimal ω.

Let us examine how we can obtain an optimal ω.


50 3 Linear Systems

Definition 3.6.1 Let us define that the matrix A has the “property A”, if there
exists the permutation matrix P, such that
" #
A11 A12
P AP > = ,
A21 A22

where A11 and A22 are diagonal matrices.

Theorem 3.6.4 Let the matrix A has the property A and let µ = ρ(RJ ), then
a) ρ(RGS ) = µ2 ,
b) ωopt = √2 , ρ(RSOR (ωopt )) = ωopt − 1,
1+ 1−µ2

c)

ω − 1, ωopt ≤ ω < 2,
(
ρ(RSOR (ω)) = q
1 − ω + 12 ω 2 µ2 + ωµ 1 − ω + 41 ω 2 µ2 , 0 < ω ≤ ωopt .

Example: For the matrix

4 −1 −1
 

 −1 4 −1 −1 

−1 4 −1 −1
 
 
A=
  (3.4)
 −1 4 −1 

−1 −1 4 −1 
 

−1 −1 4

we obtain ρ(RJ ) = 0.6036, ρ(RGS ) = 0.3634 and ρ(RSOR (ωopt )) = 0.1128,


where ωopt = 1.1128. The graph of ρ(ω) is shown on the Figure 3.1.
3.6 Iterative methods for solving a linear systems 51

ρ(ω)

1.0

0.8

0.6

0.4

0.2

ω
0.5 1.0 1.5 2.0

Figure 3.1: The graph of ρ(ω) that corresponds to the matrix (3.4).
4 Linear least-squares problem
(Overdetermined systems)

4.1 Introduction

• We are solving the linear system

Ax = b, A ∈ Rm×n , m > n, x ∈ Rn , b ∈ Rm . (4.1)

• It is called an overdetermined system, as it has more equations than the


unknowns.
• Instead of the exact solution, that does not exist, we are going to compute
x that results in minimal
kAx − bk.

• If we consider
kAx − bk2 ,

then the corresponding x that minimizes this norm is called the solution
of the least-squares method.

Example: Let us find the polynomial p(x) = a0 + a1 x + · · · + an xn , that fits


the points (x1 , y1 ), . . . , (xm , ym ), m > n, the most. We require that p(xi ) is, as
much as possible, close to the yi and we obtain the following system

1 x1 x1 · · · xn1
    
a0 y1

 1 x2 x22 · · · xn2 
 a1 


 y2 

 .. .. .. ..  ..  =  .. , Aa = y.
. . . . . .
    
    
1 xm xm · · · xnm an ym
4 Linear least-squares problem
54 (Overdetermined systems)

Using the least-squares method one wants to find such a, that kAa − yk2 is
minimal. In this case, kAa − yk22 is also minimal, i.e.

n
yk22 (p(xi ) − yi )2
X
kAa − =
i=1

is minimal (hence the name the least-squares).

Example: In statistics one wants to estimate the parameters of the given model
with respect to the obtained measurements. Let us assume that the success of
the student b depends on
• a1 − the points achieved on the matura exam,
• a2 − the points achieved on the entrance exam,
• a3 − the high school success.
The hypothesis: The linear model b = x1 a + x2 a2 + x3 a3 , where x1 , x2 , x3 are
the unknowns. For m students we obtain the overdetermined system
   
a11 a12 a13 b1

 .
x1
 . .. ..   .. 
.   x2  = 

.

 . .  . 
am1 am2 am3 x 3 bm

4.2 System of the normal equations

The system of linear equations (4.1) can be transformed into the system of the
normal equations, given as

A> Ax = A> b,

where AT A ∈ Rn×n is the Gramian matrix and b ∈ Rn .

Theorem 4.2.1 The solution of the system of normal equations A> Ax = A> b
minimizes the norm kAx − bk2 .
4.2 System of the normal equations 55

Proof: Let B = A> A and C = A> b. Let A has a full rank. The matrix B is
s.p.d matrix, as

B > = (A> A)> = A> A = B,


x> Bx = x> A> Ax = (Ax)> Ax = kAxk22 > 0, ∀x 6= 0.

Observe that

kAx − bk22 = (Ax − b)> (Ax − b) = (x> A> − b> )(Ax − b)


= x> Bx − x> c − c> x + b> b,
(Bx − c)> B −1 (Bx − c) = (x> B > − c> )B −1 (Bx − c)(x − B −1 c)
= x> Bx − c> x − x> c + c> B −1 c.

Therefore

kAx − bk22 = (Bx − c)> B −1 (Bx − c) − c| > B −1{z


c + b> b} .
does not depend on x, i.e. is constant

As the matrix B is s.p.d., also B −1 is s.p.d. matrix. We derive that

(Bx − c)> B −1 (Bx − c) ≥ 0,

where one obtains the equality only in case if Bx = c, or equivalently, A> Ax =


A> b.

Let us count the number of operations:

• for computing A> A : mn2 ,

• for solving system A> Ax = A> b : 1 3


3
n + O(n2 ) (Cholesky factoriza-
tion),

that sum up to mn2 + O(n3 ), since m >> n.

Remark 4.2.1 Solving the system of normal equations is the cheapest but at
least stable. Therefore one usually uses one of the other methods.
4 Linear least-squares problem
56 (Overdetermined systems)

4.3 QR decomposition

The QR decomposition (also called a QR factorization) of a given matrix A is a


decomposition into a product A = QR of an orthogonal matrix Q and an upper
triangular matrix R. QR decomposition is often used to solve the linear least
squares problem, and is the basis for a particular eigenvalue algorithm, the QR
algorithm.

From the system of normal equations we obtain:

A> Ax = A> b
(QR)> QRx = (QR)> b
R> Q> QRx = R> Q> b
R> Rx = R> Q> b
Rx = Q> b.

Let us show the uniqueness of the QR decomposition:

A = QR ⇒ A> A = R> Q> QR = R> R,

where A> A is s.p.d. matrix and therefore R> is the Cholesky factor. As the
Cholesky decomposition is uniquely defined, R is also unique. Thus we can
conclude that Q = AR−1 is also unique.

The solution of the least-squares method is obtained, if one solves the upper
triangular system

Rx = Q> b.

Computing the solution using QR decomposition is more stable than solving


the system of normal equations.

There are several ways how to compute the matrices Q and R :

• classical Gram-Schmidt method, or shortly CGS,

• modified Gram-Schmidt method, or shortly MGS,

• using the Givens rotations,

• using the Housholder reflections.


4.3 QR decomposition 57

4.3.1 Classical and modified Gram-Schmidt method

The matrices Q and R are obtained by Gram-Schmidt orthogonalization of the


columns in the matrix A. Let us present the corresponding Algorithm 8.

Algorithm 8 Computation of the QR decomposition.


Input: Matrix A ∈ Rm×n , A = [a1 , . . . , an ].
Output: Orthogonal matrix Q ∈ Rm×n , Q = [q1 , . . . , qn ], and upper triangular
matrix R ∈ Rn×n .
1: for k = 1, 2, . . . , n do
2: Compute qk = ak ;
3: for i = 1, 2, . . . , k − 1 do
4: Compute rik = qi> ak (CGS) or rik = qi> qk (MGS);
5: Compute qk = qk − rik qi ;
6: end for
7: Compute rkk = kqk k2 ;
8: Compute qk = rqkk k
;
9: end for

Note that in exact computing, the results obtained by CGS and MGS are the
same, but numerically MGS is more stable than CGS.

Example: Let
 
1+ 1 1
A= 1 1+ 1 ,  = 10−10 .

1 1 1+

Using CGS we obtain q2> q3 ≈ 0.5, what is wrong result.


Using MGS we obtain q2> q3 ≈ −3.9 · 10−17 .

Let us count the number of the operations:


n k−1 n n−1
!
2mn2 + mn
X X X X
3m + 4m = 3mn+ 4m(k−1) = 3mn+4m `= ,
k=1 i=1 k=1 `=1

that is approximately twice as much as by solving the system of normal equa-


tions (if m >> n).
4 Linear least-squares problem
58 (Overdetermined systems)

Remark 4.3.1 We know also the extended QR decomposition: A = Q̃R̃, where

Q̃ = [Q Q1 ] ∈ Rm×m ,

where Q1 is m×(n−m) matrix with orthonormal columns, that are orthogonal


also on the columns of the matrix Q, and
" #
R
R̃ = ∈ Rm×n .
0

4.3.2 Givens rotations

A Givens rotation is a rotation in the plane spanned by two coordinates axes.


Givens rotations are named after Wallace Givens, who introduced them to nu-
merical analysts in the 1950s.

In the plane, the vector x = [x1 , x2 ]> is rotated for an angle ϕ in the negative
axis such that it is multiplied by the matrix
" #
c s
R> = , c = cos ϕ, s = sin ϕ.
−s c

In Rn the vector x = [x1 , x2 , . . . , xn ]> is rotated in (i, k) plane if it is multiplied


> >
by the matrix Rik , namely, y = Rik x,
 
1
 .. 

 . 

1
 
 
 

 c s 

1
 
 
...
 
>
Rik = , c = cos ϕ, s = sin ϕ.
 

 

 1 

−s c
 
 
 

 1 

 .. 

 . 

1
4.3 QR decomposition 59

where c and s appear at the intersections i−th and j−th rows and columns. The
>
matrix Rik is orthogonal matrix called Givens rotation. Note that

yj = xi , ∀j 6= i, k,
yi = cxi + sxk ,
yk = −sxi + cxk .

>
q Rik can be choosen such that yk = 0 and yi = r. This is the case
The rotation
when r = x2i + x2k , c = xri , s = xrk . Therefore, using the appropriate Givens
rotations one may zero all underdiagonal elements in the given matrix A. The
number of the operations then is equal to 3mn2 − n3 .

Example: Suppose that we are given an arbitrary matrix A ∈ R4×3 . Let us


illustrate how the Givens rotations can be used to obtain QR decomposition.

× × × × × × × × ×
     
 × × ×  R>  0 × ×  R>  0 × × 
R> = 
 12  13
 −−→  −−→
  
× × × × × × 0 × ×
   
     
× × × × × × × × ×
× × × × × ×
   

R>14 
 0 × ×  R>
 23
 0 × × 
−−→  −−→
 
0 × × 0 0 ×
  
   
0 × × 0 × ×
× × × × × ×
   

R>24
 0 × ×  R>
 34
 0 × × 
−−→  −−→ = R̃.
  
0 0 × 0 0 ×
  
   
0 0 × 0 0 0
Let us define
Q̃ := R12 R13 R14 R23 R24 R34 .
If we denote the upper 3 × 3 matrix of R̃ by R and the first n columns of Q̃ by
Q, we obtain the QR decomposition.

4.3.3 Householder transformation

The Householder transformation (also known as Householder reflection or el-


ementary reflector) is a linear transformation that describes a reflection about
4 Linear least-squares problem
60 (Overdetermined systems)

a plane or hyperplane containing the origin. The Householder transformation


was introduced in 1958 by Alston Scott Householder.
Householder reflections can be used to calculate QR decompositions by re-
flecting first one column of a matrix onto a multiple of a standard basis vector,
calculating the transformation matrix, multiplying it with the original matrix
and then recursing down the (i, i) minors of that product.
It is a generalization of a Givens rotations, where we zero all underdiagonal
elements in the given matrix at once.

2
The number of the operations is equal to 2mn2 − n3 .
3

For an arbitrary vector w ∈ Rn , w 6= 0, we define the matrix


2
P := I − ww> .
w> w
Note that
• the matrix P is symmetric orthogonal matrix as
2 2
P P > = (I − ww> )(I −
ww> )
w> w w> w
4 4
= I − > ww> + > 2 = I,
w w (w w) w(w> w)w>

and P 2 = I as P 2 = P P = P P > = I,
• P represents the reflection over the hyperplane with the normal w :
every vector x can be presented as x = αw + u, u⊥w, and therefore
2
Px = x − ww> (αw + u) = x − 2αw = −αw + u.
w> w

The idea is to find the matrix P such that we zero all components of the given
vector x except the first one:

P x = ±ke1 .

Theorem 4.3.1 Let kxk2 = kyk2 , ∀x 6= y, x, y ∈ Rn . Let us define w = x − y.


Then P x = y.
4.3 QR decomposition 61

Proof:

2 2x> (x − y)
Px = x − ww> x = x − (x − y)
w> w (x − y)> (x − y)
2(x> x − x> y)
=x− (x − y) = y.
2(x> x − x> y)

We would like that y = ±ke1 and therefore w = x ± ke1 . What is the value of
k?
The preposition of the Theorem 4.3.1 is kxk2 = kyk2 , therefore k = kxk2 and
w = x ± kxk2 e1 .

From the numerical reasons we take w = x + sign(x1 )kxk2 e1 , as when we


divide by
w> w = 2k(k ± x1 ),
we require that this value is as big as possible.

The Householder reflections can be applied for computing QR decomposition,


such that we zero all elements under the main diagonal in the given column at
once. One step of the algorithm requires multiplying by the matrix
" #
Ii−1 0
P̃i = , Ii−1 ∈ R(i−1)×(i−1) ,
0 Pi

where Pi is the Householder reflection that zeroes all elements under the main
diagonal in the i−th column.

The matrix R is then equal to

R = P̃n P̃n−1 · · · P̃1 A,

and
P̃n P̃n−1 · · · P̃1 = Q> .
Since P̃i = P̃i> , the matrix Q is equal to Q = P̃1 P̃2 · · · P̃n . As the matrices P̃i
are orthogonal, Q is also orthogonal,

Q> Q = (P̃1 P̃2 · · · P̃n )> (P̃1 P̃2 · · · P̃n ) = I.


4 Linear least-squares problem
62 (Overdetermined systems)

4.3.4 The comparison of the presented methods

Let us compare the cost of the presented methods.


• Solving the overdetermined system Ax = b, m >> n :
– by the normal system we require mn2 + 31 n3 operations,
– using the MGS we require 2mn2 operations,
– using the Givens rotations we require 3mn2 − n3 operations,
– using Householder reflections, we require 2mn2 − 23 n3 operations.
• Solving the system Ax = b in case when m = n :
– using the LU decomposition we require 23 n3 operations,
– using the Givens rotations we require 2n3 operations,
– using Householder reflections, we require 34 n3 operations.

4.4 Singular value decomposition

The singular value decomposition (SVD) is a factorization of a real or complex


matrix. It is the generalization of the eigendecomposition of a positive semidefi-
nite normal matrix to any m×n matrix via an extension of polar decomposition.
It has many useful applications in signal processing and statistics.
Formally, the singular value decomposition of an m × n real or complex matrix
A is a factorization of the form A = U ΣV H , where U is an m × m real or
complex unitary matrix, Σ is a m × n rectangular diagonal matrix with non-
negative real numbers on the diagonal,
 
σ1

 ... 

Σ= ,
σn 
 

and V is an n × n real or complex unitary matrix. The diagonal entries σi ,


σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0, of Σ are known as the singular values of A. The
columns of U and the columns of V are called the left-singular vectors and
right-singular vectors of A, respectively.
4.4 Singular value decomposition 63

Applications that employ the SVD include computing the pseudoinverse, least
squares fitting of data, multivariable control, matrix approximation, and deter-
mining the rank, range and null space of a matrix.
• In case if n > m, the SVD is obtained such that we transpose the SVD of
the matrix AH .
• If σr > 0 and σr+1 = σr+2 = · · · = σn = 0, then r is a rank of a matrix
A.

Theorem 4.4.1 Let A ∈ Rm×n , m ≥ n, and rank(A) = n. Then the minimum


of kAx − bk2 is obtained for
n
X u>
i b
x= vi ,
i=1 σi
where ui and vi denote i−th column of U and V, respectively.

Proof: Observe that


kAx − bk2 = kU ΣV > x − bk2 = kΣV > x − U > bk2 .
Let us define
U = [ U1 , U2 ] and Σ = [|{z} 0 ]> .
S , |{z}
|{z} |{z}
n m−n n m−n

Then
kAx − bk2 = k[SV > x − U1> b, U2> b]> k2
and the minimum is obtained for SV > x = U1> b, or equivalently x = V S −1 U1> =
Pn u> i b
i=1 σi vi .

Pseudoinverse

The pseudoinverse A+ of a matrix A is a generalization of the inverse matrix.


For the matrix A ∈ Rm×n , m ≥ n, rank(A) = n, the pseudoinverse A+ ∈
Rn×m is defined as
 −1
A+ = A> A A> .

 −1
In case if m < n and rank(A) = m, then A+ = A> AA> .
4 Linear least-squares problem
64 (Overdetermined systems)
 −1
Remark 4.4.1 If m = n, then A+ = A> A A> = A−1 A−> A> = A−1 .

The solution of the overdetermined system can be written as x = A+ b.


How can one determine the pseudoinverse if the matrix A does not have a full
rank?
We compute the SVD A = U ΣV > , where
 
σ1
 .. 
 . 0 
Σ= .
σr
 
 
0 0

Then A+ = V Σ+ U > , where


 
1
σ1
..
 
. 0 
 
+
Σ = .

1

 
 σr 
0 0

The compression of images

Pr
If the rank(A) = r, then A = i=1 σi ui vi> .

Pk
Theorem 4.4.2 Let A = U ΣV > , rank(A) > k and let Ak = i=1 σi ui vi> or
equivalently Ak = U Σk V > , where
 
σ1
 .. 

 . 

σk
 
 
 
Σk = 
 0 .


 ... 

 
0
 
 

Then
min kB − Ak2 = kAk − Ak2 = σk+1 .
rank(B)=k
4.4 Singular value decomposition 65

Proof: Note that

kAk −Ak2 = kU (Σk −Σ)V > k2 ≤ kU k2 kΣk −Σk2 kV > k2 = kΣk −Σk2 = σk+1 .

Suppose that we are given the matrix B such that rank(B) = k, therefore dim
ker(B) = n − k. Let us define Vk+1 = [v1 , v2 , . . . , vk+1 ]. Then

dim im(Vk+1 ) + dim ker(B) = n + 1

and there exists z 6= 0, kzk2 = 1, such that z ∈ im(Vk+3 )∩ker(B). Therefore

kA − Bk22 ≥ k(A − B)zk22 = kAzk22 = kU ΣV > zk22 ≥ σk+1


2
.

By the latter theorem, Ak is the best approximation of the matrix A with the
matrix with the rank k. The value of σk+1 gives us the information how far is
the matrix A from the nearest matrix with the rank k.
The singular decomposition can be used for the compression of images. The im-
age can be presented with the matrix A whose elements represent the grayscale
level (or RGB). Instead of the matrix A (or in the color case matrices Ar , Ag , Ab ),
we the the best approximation with the matrix of the rank k. As a consequence,
instead of m × n data we require only (m + n)k data for [u1 , . . . , uk ] and
[σ1 r1 , . . . , σk rk ].

How can one compute the SV D?

1. We compute A> A.
2. We determine the eigenvalues of the matrix A> A, e.g. as the roots of
polynomial det(A> A − λI) = 0.
>
√ as λ1 ≥ λ2 ≥ · · · ≥ λn (as A A is s.p.d., λi ≥ 0,
3. We sort the eigenvalues
∀λi ∈ R) and σi = λi , i = 1, 2, . . . , n.
4. We compute the eigenvectors of A> A from (A> A − λI)vi = 0, and by
the normalization we obtain the matrix V.
1
5. We compute ui = σi
Avi and finally obtain U.
5 Eigenvalue problem
A non-zero vector v ∈ Cn is an eigenvector of a square n × n matrix A if it
satisfies the linear equation
Av = λv,
where λ ∈ C is a scalar, termed the eigenvalue corresponding to v. That is, the
eigenvectors are the vectors that the linear transformation A merely elongates
or shrinks, and the amount that they elongate/shrink by is the eigenvalue. The
above equation is called the eigenvalue equation or the eigenvalue problem.
The vector y 6= 0 that satisfies y H A = λy H is called left eigenvector.
If y is left eigenvector of A with the corresponding eigenvalue λ, then y is the
right eigenvector of AH with the corresponding eigenvalue λ̄. So hereafter we
will consider only the right eigenvectors.
The eigenvalues are the roots of the polynomial

p(λ) := det(A − λI)

of degree n, where dim(A) = n. The matrix A has n eigenvalues λ1 , λ2 , . . . , λn ,


that are not necessarily different.
As there are no analytic formulas for the roots of polynomials of degree ≥ 5,
we have to use numerical computation also to find the eigenvalues of the given
matrix.
The matrix A is a diagonalizable matrix, if there exists nonsingular matrix
X = [x1 , x2 , . . . , xn ] and diagonal matrix Λ = diag(λ1 , . . . , λn ), such that
A = XΛX −1 . In this case Axi = λi xi , ∀i = 1, 2, . . . , n. E.g. the symmetric
matrix is diagonalizable.
If detS 6= 0, then the matrices A and S −1 AS have the same eigenvalues (they
are similar). The vector x is an eigenvector of A, iff S −1 x is an eigenvector of
S −1 AS.
Note that computing the eigenvalues via the characteristic polynomial is numer-
ically not stable.
68 5 Eigenvalue problem

Which algorithm is suitable, depends on the needs of the user and the charac-
teristics of the given matrix. The following facts are important:
• is the given matrix symmetric (or hermitian),
• does the user require all eigenvalues or just some,
• does the user require also the eigenvectors.

5.1 Power method


The power method, also known as power iteration, is an eigenvalue algorithm:
given a matrix A, the algorithm will produce a number λ, which is the greatest
(in absolute value) eigenvalue of A, and a nonzero vector v, the corresponding
eigenvector of λ, such that Av = λv. The algorithm is also known as the Von
Mises iteration.
The power iteration is a very simple algorithm, but it may converge slowly. It
does not compute a matrix decomposition, and hence it can be used when A is
a very large sparse matrix.
Let us present the corresponding Algorithm 9. The sequence of the vectors xk

Algorithm 9 Power iteration


Input: Matrix A ∈ Rm×n .
Output: The vector that corresponds to the biggest eigenvalue (by the absolute
value).
1: Choose x0 6= 0;
2: for k = 0, 1, 2, . . . do
3: Compute zk+1 = Axk ;
zk+1
4: Compute xk+1 = kzk+1 k∞(2)
;
5: end for

converges towards the vector that corresponds to the biggest (by the absolute
value) eigenvalue.

Theorem 5.1.1 Let for the eigenvalues of the matrix A the following holds
|λ1 | > |λ2 | ≥ |λ3 | ≥ · · · ≥ |λn |.
If k → ∞, then xk converges towards the eigenvector that corresponds to the
eigenvalue λ1 .
5.1 Power method 69

Proof: Let us consider the proof only for the case A = Y ΛY −1 ,


Λ = diag(λ1 , . . . , λn ), y = [y1 , . . . , yn ], where yi are the eigenvectors (the
basis). Let x0 = ni=1 αi yi . If α1 6= 0 (in practice this always holds because of
P

the numerical reasons), then


 k  k
λ2 λn
k
A x0 α1 y 1 + α2 λ1
y2 + · · · + αn
yn λ1
xk = = k k
kAk x0 k∞
   
kα1 y1 + α2 λλ21 y2 + · · · + αn λλn1 yn k∞
α1 y1
and when k → ∞, xk → kα1 y1 k
= ± kyy11 k .

Note that the order of convergence is linear and that in the practical applications
the method converges for every x0 .
We have the convergence also in case when |λ1 | = |λ2 | ≥ |λ3 | ≥ · · · ≥ |λn |
and λ1 = +λ2 (multiple dominant eigenvalue). The method can be modified
that it works also in case when |λ1 | = |λ2 | ≥ |λ3 | ≥ · · · ≥ |λn | and λ1 = −λ2 ,
λ1 = λ̄2 .
How can one obtain the eigenvalue, if the approximation for the eigenvector
is known?
The best approximate value is the λ that minimizes kAx − λxk2 . The solution
is the Rayleigh quotient

x> Ax
ρ(A, x) = , x 6= 0 .
x> x

Observe that ρ(A, x) = ρ(A, αx) for α 6= 0.


From Ax = λx, follows ρ(A, x) = λ.
70 5 Eigenvalue problem

5.2 QR method
This method is nowdays mostly used in the numerical packages. Here we
present the Algorithm 10 for its computation.

Algorithm 10 Power iteration


Input: Matrix A ∈ Rm×n .
Output: The Schur form of the matrix A.
1: Define A0 = A;
2: for k = 0, 1, 2, . . . do
3: Compute the QR decomposition of A, namely Ak = Qk Rk ;
4: Compute Ak+1 = Rk Qk ;
5: end for

Note that:
• The matrices Ak , k = 0, 1, . . . , are similar to the matrix A, as Ak+1 =
Q>
k Ak , and therefore they have the same eigenvalues.

• The matrices Ak converge towards the quasi-upper triangular matrix i.e.


the Schur form (the upper triangular matrix that can have some 2 × 2
blocks on the main diagonal). The eigenvalues of each 2 × 2 block
present two (complex conjugated) eigenvalues, while the real eigenvalues
are written on the main diagonal (the remainder of the main diagonal).
• The number of the operations of QR method is in general O(n4 ). We
know also the QR method with shifts (single, double), that accelerates
the convergency.
For the symmetric matrices we know many other methods, i.e. specially adapted
QR method, the method divide and conquer, Jacobi method,...
6 Interpolation

6.1 Polynomial interpolation

In n + 1 distinct points x0 , x1 , . . . , xn we are given values y0 , y1 , . . . , yn of the


function f. We are looking for the polynomial of degree ≤ n, namely

p(x) = an xn + an−1 xn−1 + · · · + a1 x + a0 ,

that matches the value of the function f in the points xi . Such polynomial is
called the interpolation polynomial.

Why the interpolation is useful?


• Sometimes its main use was computing the values of the tabulated func-
tions.
• The polynomial interpolation is the basis for the numerical differentiation
and integration.

Why the polynomials?


• Simple design and calculation of its values.
• Simple differentiation and integration.
• It turns out, that even better choice are the splines (we compose together
polynomials of a lower degree).
• We can also use trigonometric polynomials and rational functions.

How can one compute the interpolation polynomial? The naive approach
would be:
72 6 Interpolation

Let us set the linear system for the unknown coefficients of the polynomial
p(x) = a0 + a1 x + · · · + an−1 xn−1 + an xn , namely

a0 + a1 x0 + · · · + an−1 xn−1
0 + an xn0 = y0 ,
..
.
a0 + a1 xn + · · · + an−1 xn−1
n + an xnn = yn .

This approach, i.e. solving the obtained system, is time-consuming (it would be
better to use some explicit method), in addition, the system is ill-conditioned.

Theorem 6.1.1 For the pairwise distinct points x0 , x1 , . . . , xn , and values


y0 , y1 , . . . , yn , there exists a unique polynomial In of degree ≤ n, for which
In (xi ) = yi , i = 0, 1, . . . , n.

Proof:
• Let us first show the existence of such polynomial by construction. Let
us define the polynomials

(x − x0 )(x − x1 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn )


Ln,i (x) = ,
(xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn )

n
Y x − xk
Ln,i (x) =
k=0 xi − xk
k6=i

that are called the Lagrangee polynomials of degree n. Obviously,

Ln,i (xj ) = δi,j ,

where δi,j denotes the Kronecer delta. The interpolation polynomial In is


of the form n X
In = yk Ln,k (x),
k=0

that is clearrly of degree ≤ n and In (xi ) = yi , i = 0, 1, . . . , n.


• Let us show the uniqueness. Assume that there exist two different poly-
nomials p and q of degree ≤ n, that match in n + 1 distinct points. Let
us define r := p − q. Obviously r has n + 1 zeros x0 , . . . , xn , and is a
polynomial of degree ≤ n. Therefore r ≡ 0 and p = q.
6.1 Polynomial interpolation 73

The interpolation polynomial In matches the function f in the points x0 , . . . , xn ,


while elsewhere, i.e. in the other points it is not necessarily. Let us see how one
can estimate the difference/error.

Theorem 6.1.2 Let the function f be (n + 1)−times continuously differentiable


on the interval [a, b] and let x0 , x1 , . . . , xn ∈ [a, b]. Then for every x ∈ [a, b]
there exists such ξ ∈ (a, b) that

f (n+1) (ξ)
f (x) − In (x) = ω(x), (6.1)
(n + 1)!

where ω(x) = (x − x0 )(x − x1 ) · · · (x − xn ) and min(x, x0 , . . . , xn ) < ξ <


max(x, x0 , . . . , xn ).

(n+1)
Proof: If x = xi , i ∈ 0, 1, . . . , n, then f (xi )−In (xi ) = 0 and f (n+1)!(ξ) ω(xi ) = 0.
Let us now assume x 6= xi , ∀i ∈ 0, 1, . . . , n. We define the function

F (z) := f (z) − In (z) − Rω(z). (6.2)

The function F is (n + 1)−times continuously differentiable and F (xi ) = 0,


∀i ∈ 0, 1, . . . , n. For an arbitrary choosen x ∈ [a, b], x 6= xi , i = 0, 1, . . . , n,
the function R can be determined such that F (x) = 0. For such R, the function
F has at least n + 2 roots on [a, b], namely x0 , x1 , . . . , xn , x. Therefore, by the
Rolle theorem, F 0 has at least n + 1 equidistant roots on (a, b), F 00 has at least n
equidistant roots, . . . , F (n+1) has at least 1 root on (a, b) which will be denoted
by ξ. From (6.2) it follows that

0 = F (n+1) (ξ) = f (n+1) (ξ) − R(n + 1)!,

therefore
f (n+1) (ξ)
R=
(n + 1)!
and the equation (6.2) at z = x reads as (6.1).

Remark 6.1.1 Note that


|ω(x)|kf (n+1) k∞
|f (x) − In (x)| ≤ .
(n + 1)!
74 6 Interpolation

6.2 Divided differences

Divided differences is a recursive division process. The method can be used to


calculate the coefficients in the interpolation polynomial in the Newton form.

Definition 6.2.1 The divided difference f [x0 , x1 , . . . , xk ] is the leading coeffi-


cient of the interpolation polynomial of the degree k, that matches the function
f in the equidistant points x0 , x1 , . . . , xn .

Theorem 6.2.1 For the divided differences, the following holds,

a) The polynomial

Pn (x) =f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 )


+ · · · + f [x0 , x1 , . . . , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 )
(6.3)

is the interpolation polynomial that matches f in x0 , x1 , . . . , xn , and is


called the Newton form of the interpolation polynomial. As the interpo-
lation polynomial is uniquely defined, In (x) = Pn (x).

b) The f [x0 , x1 , . . . , xk ] is a symmetric function of its arguments.

c) (αf + βg)[x0 , x1 , . . . , xk ] = αf [x0 , x1 , . . . , xk ] + βg[x0 , x1 , . . . , xk ].

d) We have the recursive formula

f [x1 , x2 , . . . , xk ] − f [x0 , x1 , . . . , xk−1 ]


f [x0 , x1 , . . . , xk ] = .
xk − x0
(6.4)

Proof:

a) We can use the induction.

a.1) The base case, n = 0: f [x0 ] is the leading coefficient of the polyno-
mial of degree 0 that matches f in x1 , namely this polynomial is the
line y = f (x0 ). Therefore f [x0 ] = f (x0 ).
6.2 Divided differences 75

a.2) The inductive step, n 7−→ n + 1 : Let

Pn (x) =f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 )


+ · · · + f [x0 , x1 , . . . , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 ).

The Pn+1 interpolates f in x0 , x1 , . . . , xn+1 , and can be written as

Pn+1 (x) = Pn (x) + c(x − x0 )(x − x1 ) · · · (x − xn ).


| {z } | {z }
degree n degree n+1

By the definition of the divided differences, c = f [x0 , x1 , . . . , xn+1 ]


and therefore

Pn+1 (x) =f [x0 ] + f [x0 , x1 ](x − x0 )+


· · · + f [x0 , x1 , . . . , xn+1 ](x − x0 )(x − x1 ) · · · (x − xn ).

b) Obviously.
c) Obviously.
d) Let p0 be the interpolation polynomial that matches f in x0 , x1 , . . . , xk−1
and let p1 be the interpolation polynomial that matches f in x1 , x2 , . . . , xk .
Then the interpolation polynomial p that matches f in x0 , x1 , . . . , xk has
the form
x − xk x − x0
p(x) = p0 (x) − p1 (x).
x0 − x k xk − x0
If we compare the leading coefficients, we obtain (6.4).

Let us put down the following formulas:

f (x1 ) − f (x0 )
f [x0 ] = f (x0 ), f [x0 , x1 ] = .
x 1 − x0
What happens if x1 → x0 ? From

f (x1 ) − f (x0 )
p(x) = f [x0 ] + f [x0 , x1 ](x − x0 ) = f (x0 ) + (x − x0 )
x1 − x0
and
f (x0 + h) − f (x0 )
f 0 (x0 ) = lim ,
h→0 h
76 6 Interpolation

one derives
p(x) = f (x0 ) + f 0 (x0 )(x − x0 )
and therefore
f [x0 , x0 ] = f 0 (x0 ).

So if we allow that interpolation points match, then the following recursive


formula for the divided differences holds

(k)
 f (x0 ) ,
 x 0 = x1 = · · · = xk ,
k!
f [x0 , x1 , . . . , xk ] = f [x1 ,x2 ,...,xk ]−f [x0 ,x1 ,...,xk−1 ]


xk −x0
, otherwise.

Therefore the definition of the interpolation polynomial and the divided differ-
ences can be extended to the polynomials that, besides the values, match also the
derivatives. For example: For the given interpolation points x1 , x2 , x2 , x2 , x3 , x3 ,
we are searching for the polynomial p for which
p(x1 ) = y1 , p(x2 ) = y2 , p0 (x2 ) = y20 , p00 (x2 ) = y200 , p(x3 ) = y3 , p(x3 ) = y30 .
The divided differences can be computed in the triangular scheme from which,
at the end, we quick pick up a polynomial equation or calculate its value at the
given point.

Example: Find the equation of the polynomial of degree 5 for which


p(0) = 1, p0 (0) = 2, p00 (0) = 3, p(1) = 1, p0 (1) = 3, p(2) = 4.
Let us compute the coefficients of the triangular scheme,
xi f [·] f [·, ·] f [·, ·, ·] f [·, ·, ·, ·] f [·, ·, ·, ·, ·] f [·, ·, ·, ·, ·, ·]
0 1
2
3
0 1 2
2 − 112
29
0 1 −4 2
−2 9 − 798
1 −1 5 − 214
3 − 32
1 −1 2
5
2 4
6.2 Divided differences 77

from which we obtain


3 11 29 79
p(x) = 1 + 2x + x2 − x2 + x3 (x − 1) − x3 (x − 1)2 .
2 2 2 8

Theorem 6.2.2 Let f be k−times continuously differentiable function, then

1 (k)
f [x0 , x1 , . . . , xk ] = f (ξ),
k!

where
min (xi ) < ξ < max (xi ).
i=0,1,...,k i=0,1,...,k

Theorem 6.2.3 Let f be (n+1)−times continuously differentiable function and


let In be the interpolation polynomial at points x0 , x1 , . . . , xn , then

f (x) − In (x) = f [x0 , . . . , xn , x](x − x0 ) · · · (x − xn ). (6.5)

Proof: Let q be the polynomial that interpolates f at points x0 , . . . , xn , t. Then

q(x) = In (x) + f [x0 , . . . , xn , t](x − x0 ) · · · (x − xn ).

As this holds for every t, the equation (6.5) is satisfied.

Corollary 6.2.1 For (n + 1)−times continuously differentiable function f and


the interpolation polynomial In at points x0 , x1 , . . . , xn , the following holds

f (n+1) (ξ)
f (x) − In (x) = ω(x),
(n + 1)!

where
min(x, x0 , . . . , xn ) < ξ < max(x, x0 , . . . , xn ).

The advantage of the new estimate is that it can be used also when the interpo-
lation points match.
78 6 Interpolation

6.2.1 Convergence of the interpolation polynomial

One would expect that with raising the degree of the interpolation polynomial
In (and therefore raising the number of the interpolation points, i.e. n + 1) the
interpolation polynomial would converge to the interpolation function.
But, it turns out that for an arbitrary selection of interpolation points, there exists
a function for which the interpolation polynomial diverges when n increases.

Example: Runge counterexample: Let us consider the equidistant interpo-


1
lation points and the function 1+x 2 on the interval [−5, 5]. On the Figure 6.1

observe what happens when raising the number of the interpolation points.

3 3

2 2

1 1

0 0

-1 -1
-4 -2 0 2 4 -4 -2 0 2 4
n=3 n=7
3 3

2 2

1 1

0 0

-1 -1
-4 -2 0 2 4 -4 -2 0 2 4
n=11 n=15
3 3

2 2

1 1

0 0

-1 -1
-4 -2 0 2 4 -4 -2 0 2 4
n=19 n=23

Figure 6.1: The graph of the interpolation polynomial (red) that interpolates the
1
function 1+x 2 on the interval [−5, 5] at the given equidistant points.

 
2k+1
Note that, if we take Chebyshev points xk = 5 cos 2n+1 π , then with rais-
ing the degree n of the interpolating polynomial, the interpolation polynomials
converge towards the given function f (see Figure 6.2).
6.3 Spline interpolation 79

3 3

2 2

1 1

0 0

-1 -1
-4 -2 0 2 4 -4 -2 0 2 4
n=3 n=7
3 3

2 2

1 1

0 0

-1 -1
-4 -2 0 2 4 -4 -2 0 2 4
n=11 n=15
3 3

2 2

1 1

0 0

-1 -1
-4 -2 0 2 4 -4 -2 0 2 4
n=19 n=23

Figure 6.2: The graph of the interpolation polynomial (red) that interpolates the
1
function 1+x 2 on the interval [−5, 5] at the given Chebyshev points.

6.3 Spline interpolation

As the polynomials of a higher degree may diverge when one wants to interpo-
late more points, we choose different solution to this problem. Instead of one
interpolation polynomial of a higher degree, the interpolating polynomial can
be composed of several polynomials of a lower degree. The obtained function
is called piecewise-polynomial function or a spline.

On each interval [xi , xi+1 ] we construct interpolating polynomial pi (x) of de-


gree k. Moreover, we require that the obtained spline is as smooth as possible
and that it matches the shape of initial data as much as possible.

In practice, the cubic splines are the one that works the best.

Example: C 1 parabolic splines


80 6 Interpolation

On each interval [xi , xi+1 ] we construct a parabola pi :

pi (xi ) = f (xi ), pi (xi+1 ) = f (xi+1 ).

As each parabola has three free coefficients, we add additional requirement.


The most natural one is
xi + xi+1
pi (xi+ 1 ) = vi , xi+ 1 = ,
2 2 2
where we try to choose vi such that the spline is C 1 continuous. We know that

pi (x) = f (xi ) + f [xi , xi+1 ](x − xi ) + f [xi , xi+1 , xi+ 1 ](x − xi )(x − xi+1 )
2

f (xi+1 ) − f (xi )
= f (xi ) + (x − xi )
xi+1 − xi
f [xi , xi+ 1 ] − f [xi+1 , xi+ 1 ]
+ 2 2
(x − xi )(x − xi+1 )
xi − xi+1
f (xi+1 ) − f (xi )
= f (xi ) + (x − xi )
xi+1 − xi
2(f (xi )−vi ) 2(vi −f (xi+1 ))
xi −xi+1
− xi −xi+1
+ (x − xi )(x − xi+1 )
xi − xi+1
f (xi+1 ) − f (xi )
= f (xi ) + (x − xi )
xi+1 − xi
(x − xi )(x − xi+1 )
+ (2f (xi ) + 2f (xi+1 ) − 4vi ).
(xi+1 − xi )2
Therefore

f (xi+1 ) − f (xi ) 2f (xi ) + 2f (xi+1 ) − 4vi


p0i (x) = + (2x − xi − xi+1 ).
xi+1 − xi (xi+1 − xi )2

The C 1 continuity condition implies that

p0i (xi+1 ) = p0i+1 (xi+1 ), ∀i = 0, 1, . . . , n − 2. (6.6)

As
f (xi+1 ) − f (xi ) 2f (xi ) + 2f (xi+1 ) − 4vi
p0i (xi+1 ) = + ,
xi+1 − xi xi+1 − xi
f (xi+2 ) − f (xi+1 ) 2f (xi+1 ) + 2f (xi+2 ) − 4vi+1
p0i+1 (xi+1 ) = + (xi+1 − xi+2 ),
xi+2 − xi+1 (xi+2 − xi+1 )2
6.3 Spline interpolation 81

the equation (6.6) can be equivalently written as

3f (xi+1 ) + f (xi ) − 4vi −f (xi+2 ) − 3f (xi+1 ) + 4vi+1


= . (6.7)
xi+1 − xi xi+2 − xi+1

Let us assume that ∆xi = C, namely, we have equidistant breakpoints: xi+2 −


xi+1 = xi+1 −xi . Therefore from (6.7) we obtain n−1 equations for n unknowns
v0 , v1 , . . . , vn−1 ,

6f (xi+1 ) + f (xi ) + f (xi+2 ) = 4vi + 4vi+1 , i = 0, 1, . . . , n − 2.

One vi can be arbitrary choosen. Let us assume that v0 = A. For the rest of
unknowns vi , i = 1, 2, . . . , n − 1, we obtain the (n − 1) × (n − 1) linear system
   
4 0 0 ··· 0 0 v1 6f (x1 ) + f (x0 ) + f (x2 ) − 4A
4 4 0 ··· 0 0   v2 6f (x2 ) + f (x1 ) + f (x3 )
   
  
   

 0 4 4 ··· 0 0 
  v3
 
 6f (x3 ) + f (x2 ) + f (x4 ) .

 .. .. .. ..   ...
   .. 

 . . . . 

 . 

0 0 0 ··· 4 4 vn−1 6f (xn−1 ) + f (xn−2 ) + f (xn )
7 Numerical differentiation and
integration

7.1 Numerical differentiation

From function values we would like to determine the approximate value of its
derivative. An idea is to use the derivative of the corresponding interpolation
polynomial. By differentiating

ω(x) (n+1)
f (x) = In (x) + f (ξx ),
(n + 1)!

we obtain

ω 0 (x) (n+1) ω(x) df (n+1) (ξx )


f 0 (x) = In0 (x) + f (ξx ) + .
(n + 1)! (n + 1)! dx
| {z }
error

The error can be estimated if x = xk for some k ∈ {0, 1, . . . , n}. In this case
ω(x) = 0 and the second part of the error cancels. We derive

ω 0 (xk ) (n+1)
f 0 (xk ) = In0 (xk ) + f (ξ),
(n + 1)!

where In0 (xk ) represents the derivative of the interpolation polynomial of degree
n at the given point xk .

Example: Let us derive formula for f 0 (x1 ), if we know the values f (x0 ) and
84 7 Numerical differentiation and integration

f (x1 ) :

0 ω 0 (x0 ) 00
f (x0 ) = I10 (x0 ) + f (ξ0 )
2!
f 00 (ξ0 )
= (f [x0 ] + f [x0 , x1 ](x − x0 ))0 |x=x0 + ((x − x0 )(x − x1 ))0 |x=x0
2
x0 − x1 00
= f [x0 , x1 ] + f (ξ0 )
2
f (x1 ) − f (x0 ) h 00
= − f (ξ0 ).
h 2

Let us assume that we have equidistant breakpoints xi = x0 +ih, i = 0, 1, . . . , n.


Here are some basic formulas for numerical differentiation:

n=1 (points x0 , x1 ) :
1 1
f 0 (x0 ) = (f (x1 ) − f (x0 )) − hf 00 (ξ0 ),
h 2
1 1
f 0 (x1 ) = (f (x1 ) − f (x0 )) + hf 00 (ξ1 ),
h 2
n=2 (points x0 , x1 , x2 ) :
1 1
f 0 (x0 ) = (−3f (x0 ) + 4f (x1 ) − f (x2 )) + h2 f 000 (ξ0 ),
2h 3
0 1 1 2 000
f (x1 ) = (−f (x0 ) + f (x2 )) − h f (ξ1 ) (the symmetric difference),
2h 6
0 1 1
f (x2 ) = (f (x0 ) − 4f (x1 ) + 3f (x2 )) + h2 f 000 (ξ2 ),
2h 3

7.1.1 Whole error

Example: Let us use the formulas

1 1
f 0 (0) = (f (h)−f (−h))+O(h2 ), f 00 (0) = 2
(f (−h)−2f (0)+f (h))+O(h2 )
2h h
to compute f 0 (0) and f 00 (0) for the function f (x) = ex . We obtain the results
presented in the Table 7.1. Although the error of the methods is O(h2 ) we obtain
worse results for smaller value of h. Why?
7.1 Numerical differentiation 85

h f 0 (0) f 00 (0)
1 1.11752012 1.0861612
0.1 1.00166730 1.00083334
0.01 1.00001610 1.0000169
0.001 0.99999454 0.9983778
0.0001 0.99994244 1.4901161

Table 7.1: Numerical values of the derivatives f 0 (0) and f 00 (0) for the function
f (x) = ex .

The results of the previous example, where the smaller values of the step h
resulted in worse approximations for the first and second derivative of the given
function, can be explained on formula

1 1
f 00 (x1 ) = 2
(y0 − 2y1 + y2 ) − h2 f (4) (ξ), yi = f (xi ).
h 12
If we ignore the rounding error Dz , the whole error is composed of the error
that corresponds to the method and the irreducible error:

• the method error can be estimated as:


1 (4)
|Dm | ≤ kf k∞ ,
12

• the irreducible error: instead of computing with the exact values f (xi ) we
compute with the approximate values ỹi , such that |f (xi ) − ỹi | ≤ . We
obtain
4
|Dn | ≤ 2 .
h

The estimation for the whole error is equal to

h2 (4) 4
|D| ≤ |Dm | + |Dn | ≤ kf k + 2 .
12 h

If we know the estimate for the f (4) and , we can, from the estimate for D,
determine the optimal value of h, where the error for the whole error is the
smallest (see Figure 7.1).
86 7 Numerical differentiation and integration

error
method error
1.4 irreducible error
1.2
whole error
1.0
0.8
0.6
0.4
0.2
h
0.5 1.0 1.5 2.0 2.5 3.0

Figure 7.1: The optimal step h can be estimated from the graph of the estimation
for the whole error.

7.2 Numerical integration


The numerical integration constitutes a broad family of algorithms for calculat-
ing the numerical value of a definite integral, and by extension, the term is also
sometimes used to describe the numerical solution of differential equations. The
basic problem in numerical integration is to compute an approximate solution
to a definite integral Z b
I(f ) = f (x)dx,
a
to a given degree of accuracy. The integral is approximated with the value of
the function f at the points x0 , x1 , . . . , xn , (see Figure 7.2) as

Z b n
X
f (x)dx = αi f (xi ) + R(f ) ,
a i=0
| {z }
| {z } error of the quadrature rule
quadrature rule

where points xi are called the integration points or. breakpoints and αi are the
weights (coefficients). We require that the formula is accurate for the polynomi-
als of the highest possible degree, therefore we integrate interpolation polyno-
mial instead of f.
If the breakpoints xi , i = 0, 1, . . . , n, are determined, then the weights Ai are
determined in the one of the following ways:
7.2 Numerical integration 87

15

10

2 4 6 8 10 12 14

Figure 7.2: The numerical integration.

a) We integrate the Lagrangee polynomails


n
X f (n+1) (ξ)
f (x) = f (xi )Ln,i (x) + ω(x),
i=0 (n + 1)!
n
Z b X Z b Z b
f (n+1) (ξ)
f (x)dx = f (xi ) Ln,i (x)dx + ω(x) .
a i=0 a a (n + 1)!
| {z }
R(f )

We obtain Z b
Ai = Ln,i (x)dx.
a

b) By the method of undetermined coefficients: Ai , i = 0, 1, . . . , n, are de-


termined such that the formula is exact for the polynomials of the highest
possible degree.

7.2.1 Newton–Cotes quadrature rules

The Newton–Cotes quadrature rules, also called the Newton–Cotes formulae or


simply Newton–Cotes rules, are a group of formulae for numerical integration
(also called quadrature) based on evaluating the integrand at equally spaced (i.e.
equidistant) points. They are named after Isaac Newton and Roger Cotes.
We are given the value of the integrand at equally spaced points
b−a
x0 = a, xi = x0 + ih, for i = 1, 2, . . . , n, where h = ,
n
Let yk = f (xk ). There are two types of the Newton–Cotes quadrature rules:
88 7 Numerical differentiation and integration

a) Closed Newton–Cotes quadrature rules:


Z b n
X
f (x)dx = Ak yk + Rn (f ).
a k=0

b) Open Newton–Cotes quadrature rules:


Z b n−1
X
f (x)dx = Ak yk + Rn (f ).
a k=1

Let us list some closed quadrature rules:

n = 1 : Trapezoid rule:
Z x1
h h3
f (x)dx = (y0 + y1 ) − f 00 (ξ),
x0 2 12
How we obtain this rule? We compute
Z x1
x − x1 h Z x1
x − x0 h
A0 = dx = , A1 = dx = ,
x0 x0 − x1 2 x0 x1 − x0 2

and (using the mean value theorem)


Z x1
f 00 (ξx ) f 00 (ξ) Z h h3 00
R1 (f ) = (x−x0 )(x−x1 )dx = u(u−h)du = − f (ξ).
x0 2 2 0 12

Figure 7.3: The numerical integration: Trapezoid rule.


7.2 Numerical integration 89

Figure 7.4: The numerical integration: Simpson’s rule.

n = 2 : Simpson’s rule:
Z x2
h h5 (4)
f (x)dx = (y0 + 4y1 + y2 ) − f (ξ),
x0 3 90
This rule is exact for all polynomials of degree 3.
n = 3 : Simpson’s 3/8 rule:
Z x3
3h 3h5 (4)
f (x)dx = (y0 + 3y1 + 3y2 + y3 ) − f (ξ),
x0 8 80

n = 4 : Boole’s rule:
Z x4
2h 8h7 (6)
f (x)dx = (7y0 + 32y1 + 12y2 + 32y3 + 7y4 ) − f (ξ),
x0 45 945

Now let us list some closed quadrature rules:


n = 2 : Rectangle rule, or midpoint rule:
Z x2
h3 00
f (x)dx = 2hy1 + f (ξ),
x0 3

n=3: Z x3
3h 3h3 00
f (x)dx = (y1 + y2 ) + f (ξ),
x0 2 4
n = 4 : Milne’s rule
Z x4
4h 28h5 (4)
f (x)dx = (2y1 − y2 + y3 ) + f (ξ).
x0 3 90
90 7 Numerical differentiation and integration

Figure 7.5: The numerical integration: Midpoint rule.

7.2.2 The irreducible error

The definite integral is numerically computed as


Z b n
X
f (x)dx = Ai f (xi ) + R(f ). (7.1)
a i=0

Let us assume that every time we compute f (xi ) the obtained error is ≤ . The
estimate for the whole error is than |Dn | ≤  ni=0 |Ai |.
P

As
n n Z b n
Z b ! Z b
X X X
Ai = Ln,i (x)dx = Ln,i (x) dx = dx = b − a = nh,
i=0 i=0 a a i=0 a

in case when Ai ≥ 0 then |Dn | ≤ (b − a), i.e. the irreducible error is small.
Unfortunately: the closed rules for n > 8 contain some negative weights and
Pn Pn
i=0 |Ai | can be very big (although i=0 Ai = (b − a) = const.)

Increasing the degree of the interpolation polynomial to obtain better approxi-


mation for the integral (7.1), is not a good idea. It is better to use a composite
rule.

7.2.3 Composite rules

Definite integral can be written as the sum of the subintegrals


Z xn n−1
X Z xi+1 
f (x)dx = f (x)dx
x0 i=0 xi
7.2 Numerical integration 91

and for every subintegral, one may use some Newton Cotes rule of a lower
degree.
Here are some basic composite rules:
• Composite Trapezoid rule:
Z xn
h (xn − x0 )h2 00
f (x)dx = (y0 + 2y1 + 2y2 + · · · + 2yn−1 + yn ) − f (ξ)
x0 2 12
Obtained from:
n−1 n−1
!
Z xn X Z xi+1 h h3

(yk + yk+1 ) − f 00 (ξk )
X
f (x)dx = f (x)dx =
x0 i=0 xi i=0 2 12
n−1
X h3
h
= (y0 + 2y1 + 2y2 + · · · + 2yn−1 + yn ) − f 00 (ξk )
2 i=0 12
h nh3 00
= (y0 + 2y1 + 2y2 + · · · + 2yn−1 + yn ) − f (ξ).
2 12

• Composite Simpson’s rule (n must be even):


Z xn
h
f (x)dx = (y0 + 4y1 + 2y2 + 4y3 + · · · + 2yn−2 + 2yn−1 + yn )
x0 3
(xn − x0 )h4 (4)
− f (ξ)
180

• Composite Midpoint rule (n must be even):


Z xn
(xn − x0 )h2 00
f (x)dx = 2h(y0 + y1 + y2 + · · · + yn−1 + yn ) + f (ξ)
x0 6

7.2.4 Romberg’s method

Romberg’s method is used to estimate the definite integral


Z b
f (x) dx
a

by applying Richardson extrapolation repeatedly on the Trapezoid rule or the


Midpoint rule. The estimates generate a triangular array. Romberg’s method
92 7 Numerical differentiation and integration

evaluates the integrand at equally spaced points. The integrand must have con-
tinuous derivatives, though fairly good results may be obtained if only a few
derivatives exist.
The method is named after Werner Romberg (1909–2003), who published the
method in 1955.

Theorem 7.2.1 Let f be infinitely many continuously differentiable function,


then the Euler–Maclaurin sumation formula
Z b ∞
B2k 2k  (2k−1) 
(b) − f (2k−1) (a)
X
I(f ) = f (x)dx = Tn (f ) − h f
a k=1 (2k)!

holds.

In the above theorem Bk are Bernoulli numbers, defined by the expansion



x X Bk k
x
= x , |x| < 2π.
e − 1 k=0 k!
All Bernoulli numbers are rational numbers and all Bernoulli numbers with odd
index (except the index 1) are equal to 0. Here are some Bernoulli numbers:
1 1 1 1
B0 = 1, B1 = − , B2 = , B4 = − , B6 = , . . .
2 6 30 42
From here we obtain that
h2 h4 000
I(f ) = Th (f ) − (f 0 (b) − f 0 (a)) + (f (b) − f 000 (a)) + · · · ,
12 720
namely
I(f ) = Th (f ) + a1 h2 + a2 h4 + a3 h6 + · · · ,
where ai , i = 1, 2, . . . , does not depend on h. With halving of h, one may obtain
more accurate results when using the composite rules. From the approximate
values that correspond to different h, one may estimate the error.
The following holds:
I(f ) = Th (f ) + a1,0 h2 + a2,0 h4 + a3,0 h6 + · · · ,
!2 !4 !6
h h h
I(f ) = T h (f ) + a1,0 + a2,0 + a3,0 + ··· ,
2 2 2 2
!2 !4 !6
h h h
I(f ) = T h (f ) + a1,0 + a2,0 + a3,0 + ···
4 4 4 4
7.2 Numerical integration 93

Let us multiply the equation that corresponds to h2 with 4 and let us subtract
from it the equation that corresponds to h, in order to get rid of the part with h2
and therefore obtain better approximation:
(1)
I(f ) = T h (f ) + a2,1 h4 + a3,1 h6 + · · · ,
2
!4 !6
(1) h h
I(f ) = T h (f ) + a2,1 + a3,1 + ··· ,
4 2 2

where

(1)
4T h (f ) − Th (f ) (1)
4T h (f ) − T h (f )
2 4 2
T h (f ) = , T h (f ) = .
2 3 4 3
We continue with the procedure in
(2)
I(f ) = T h (f ) + a3,2 h6 + a4,2 h8 + · · · ,
4

where
(1) (1)
16T h (f ) − T h (f )
(2) 4 2
T h (f ) = .
4 15
By this we generate the triangular array presented by Table 7.2. The general

error O(h2 ) O(h4 ) O(h6 ) O(h8 )


(0)
Th (f )
(0) (1)
T h (f ) T h (f )
2 2
(0) (1) (2)
T h (f ) T h (f ) T h (f )
4 4 4
(0) (1) (2) (3)
T h (f ) T h (f ) T h (f ) T h (f )
8 8 8 8

Table 7.2: The estimates of the definite integral obtained by the Romberg’s
method.

formula is
(j−1) (j−1)
4j T h (f ) − T h (f )
(j) 2k 2k−1
T h (f ) = .
2k 4j − 1
94 7 Numerical differentiation and integration

Example: Using the Romberg method, compute the integral


Z 2.2
ln x dx = 0.5346062.
1

Let the initial step be equal to h = 0.6. Compute the two halving/steps of the
method.
One obtains the following results:
1 1
 
(0)
Th = 0.6 ln 1.0 + ln 1.6 + ln 2.2 = 0.5185394,
2 2
(0) 1 (0)
T h = Th + 0.3 (ln 1.3 + ln 1.9) = 0.5305351,
2 2
(0) 1 (0)
T h = T h + 0.15 (ln 1.15 + ln 1.45 + ln 1.75 + ln 2.05) = 0.5335847.
4 2 2

It is important that T h is always computed as


2k

1 h
T h = T h + k (y1 + y3 + · · · + y2k −1 ).
2 k 2 2 k−1 2
In such a way each function value is computed only once and so using the
Romberg method results in a negligible additional computation in comparison
(0)
with computing the T h , while the obtained result can be much more accurate.
2k

Using the Romberg extrapolation we obtain:


(0) (0)
4T h − Th
(1) 2
Th = = 0.5345337,
2 3
(0) (0)
4T h − T h
(1) 4 2
Th = = 0.5346013,
4 3
(1) (1)
16T h − T h
(2) 4 2
Th = = 0.5346058.
4 15

7.2.5 Multiple integrals

Suppose that one would like to compute the integral of a function of two vari-
ables on a rectangular area Ω = [a, b] × [c, d],
ZZ Z b Z d
f (x, y) dx dy = dx f (x, y) dy.
Ω a c
7.2 Numerical integration 95

If we divide the interval [a, b] with points xi = a + ih, i = 0, 1, . . . , n, h = b−a


n
,
and the interval [c, d] with points yj = c + jk, j = 0, 1, . . . , m, k = d−c m
, and if
we use the composite Trapezoid rule, we obtain
Z b Z d n X m
hk X
dx f (x, y) dy = Aij f (xi , yj ) + O(h2 + k 2 ),
a c 4 i=0 j=0

where the coefficients Aij satisfy Aij = A1 (i) · A2 (j),


A0 (0) = A1 (n) = A2 (0) = A2 (m) = 1,
A1 (i) = A2 (j) = 2, for i ∈ {1, 2, . . . , n − 1} and j ∈ {1, 2, . . . , m − 1}.
Namely, we obtain the tensor product of the composite Trapezoid rule. Simi-
larly, one may use the other composite rules.

Example: Suppose that we have the 5 × 4 grid, see Table 7.3. Than the co-
efficients for the composite Trapezoid rule generalized to the double integral
computation are presented in Table 7.4.

(x0 , y3 ) (x1 , y3 ) (x2 , y3 ) (x3 , y3 ) (x4 , y3 )


(x0 , y2 ) (x1 , y2 ) (x2 , y2 ) (x3 , y2 ) (x4 , y2 )
(x0 , y1 ) (x1 , y1 ) (x2 , y1 ) (x3 , y1 ) (x4 , y1 )
(x0 , y0 ) (x1 , y0 ) (x2 , y0 ) (x3 , y0 ) (x4 , y0 )

Table 7.3: The 5 × 4 grid on the given rectangular area.

1 2 2 2 1
hk 2 4 4 4 2
4 2 4 4 4 2
1 2 2 2 1

Table 7.4: The 5 × 4 grid with the coefficients that correspond to the composite
Trapezoid rule.

If one would use e.g. composite Simpson’s rule on 5 × 5 grid, the obtained
coefficients are presented in Table 7.5.

This approach can be generalized also to the multidimensional integrals. The


problem of this approach is that we require a lot of points for a good approxi-
mation. It is better to use the Monte Carlo method.
96 7 Numerical differentiation and integration

1 4 2 4 1
4 16 8 16 4
hk
9
2 8 4 8 2
4 16 8 16 4
1 4 2 4 1

Table 7.5: The 5 × 5 grid with the coefficients that correspond to the composite
Simpson’s rule.

7.2.6 Monte Carlo method

Let the integration area be Ω = [0, 1]d .

• Let us assume d = 1 firstly:

– The idea of the Monte Carlo method is that I(f ) = 01 f (x) dx is


R

equal to an average value of the function f on the interval [0, 1].

– If X is a random variable, uniformly distributed on [0, 1], then for


the mathematical hope the following holds,

E(f (X)) = I(f ).

The approximation for the mathematical hope is


N
1 X
IN (f ) = f (Xi ),
N i=1

where Xi are the independent random values on [0, 1].


1
−2
q N (f ) = I(f ) − IN (f ) applies  ≈ σ(f )N , where
– The error
σ(f ) = D(f ) is a standard deviation and
Z 1
D(f ) = (f (x) − I(f ))2 dx
0

is the variance of f.

• For the arbitrary d : Xi are the random vectors from [0, 1]d , where each
component is a random number on [0, 1].
7.2 Numerical integration 97

7.2.7 Introduction to Gaussian quadrature

The Gaussian quadrature rule is named after Carl Friedrich Gauss. The definite
integral of a function,
Z b
f (x)ρ(x) dx, ρ ≥ 0,
a

is approximated as a weighted sum of function values at specified points within


the domain of integration,
Z b n
X
f (x)ρ(x) dx = Ai f (xi ) + R(f ),
a i=0

where the points xi are not known in advance. The weights are determined by
the points xi since
Z b
Ai = Ln,i (x)ρ(x)dx,
a

and the rule will be for an arbitrary choice of xi accurate for the polynomials
of degree at least n. With an appropriate choice of xi one may achieve, that the
rule is accurate for the polynomials of degree at least 2n + 1.
In the background there are orthogonal polynomials, where the scalar product
of functions f, g is defined as
Z b
< f, g >:= f (x)g(x)ρ(x) dx, f ⊥g ⇔< f, g >= 0.
a

It is important that all weights of the Gaussian quadrature are positive.


The most frequently used orthogonal polynomials are:
• Legendre polynomials: ρ(x) = 1, on the interval [−1, 1] (see Gauss–Legendre
quadrature),
• Chebyshev polynomials (first kind): ρ(x) = √ 1 , on the interval [−1, 1]
1−x2
(see Chebyshev–Gauss quadrature),

• Chebyshev polynomials (second kind): ρ(x) = 1 − x2 , on the interval
[−1, 1] (see Chebyshev–Gauss quadrature),
• Jacobi polynomials: ρ(x) = (1 − x)α (1 + x)β , α, β > −1 on the interval
[−1, 1] (see Gauss–Jacobi quadrature),
98 7 Numerical differentiation and integration

• Generalized Laguerre polynomials: ρ(x) = xσ e−x , σ > −1 on the inter-


val [0, ∞) (see Gauss–Laguerre quadrature),
2
• Hermite polynomials: ρ(x) = e−x , on the interval (−∞, ∞) (see Gauss
– Hermite quadrature).

Example: The Gauss–Legendre quadrature rules on two and three points are
! !
Z 1
1 1 1 (4)
f (x) dx = f − √ + f √ + f (ξ),
−1 3 3 135
 s  s 
Z 1
5 3 8 5 3 1
f (x) dx = f − + f (0) + f  + f (6) (ξ).
−1 9 5 9 9 5 15750

Compare these two rules with the Trapezoid and Simpson’s quadrature.
8 Bézier curves
They were independently discovered by P. E. Bézier (1910-1999, Renault) and
P. de Casteljau (1930-1999, Citroen). In research area, the most notable was
the Bézier wirk, therefore the curves are named Bézier curves. According to
another author, however, is called very important algorithm (although it was
also independently discovered by someone in Bézier research group).
Bézier curves are important objects in Computer Aided Geometric Design, or
shortly CAGD. They are used e.g. for writing letters, car, ship and plane design,
motion design, object animations (e.g. cartoons), etc.

8.1 Introduction

8.1.1 Affine transformations in R3

Most of the transformations used in CAGD objects are affine transformations.

Definition 8.1.1 The mapping Φ : R3 → R3 is affine, if


 
n n n
aj , Φ(aj ) ∈ R3 ,
X X X
Φ αj aj  = αj Φ(aj ), αj = 1.
j=1 j=0 j=1

The interpretation: Let x = nj=1 αj aj . The point x is the affine combination


P

of the points aj with the weights αj . The point x is mapped by Φ in such a way
that firstly the points aj are mapped and than we use the affine combination
with the same weights αj .

Example: The middle point of the line segment a1 a2 is mapped into the middle
point on the mapped line segment Φ(a1 )Φ(a2 ) :
1 1
x = (a1 + a2 ), Φ(x) = (Φ(a1 ) + Φ(a2 )).
2 2
100 8 Bézier curves

Example: The centroid, namely the geometric center of a plane figure or the
arithmetic mean position of all the points in the shape, of a discrete set of points
is mapped into a centroid of a discrete set of a mapped points.

Theorem 8.1.1 If we fix the coordinate system, then each affine mapping has
the form
Φ(x) = Ax + v, A ∈ R3×3 , v ∈ R3 . (8.1)

Pn
Proof: Let us show that (8.1) is the affine mapping. Let x = j=1 αj aj ,
Pn
j=1 αj = 1. Then
   
n
X n
X n
X
Φ αj aj  = Φ(x) = A  αj aj  + αj v
j=1 j=1 j=1
n
X n
X n
X n
X
= αj (Aaj ) + αj v = αj (Aaj + v) = αj Φ(aj ).
j=1 j=1 j=1 j=1

Examples:
• Identity: A = I, v = 0.
• Translation: A = I, v is the translation vector.
• Scaling: v = 0, A is a diagonal matrix.
• Rotation: v = 0, A is orthogonal matrix (A> A = I). For example, for
the rotation around z−axis,
 
cos α − sin α 0
A =  sin α cos α 0  .
 

0 0 1

• Shear stress (i.e. the component of stress coplanar with a material cross
section that arises from the force vector component parallel to the cross
section): v = 0, A is upper triangular matrix with ones on the main diag-
onal.
• Rigid motion: A is orthogonal matrix, v is an arbitrary vector.

Remark 8.1.1 Every affine mapping can be written as the compositum of trans-
lations, rotations, scalings and shear stresses.
8.1 Introduction 101

8.1.2 Linear Interpolation

Consider the points a and b ∈ R3 . The set of points x ∈ R3 of the form


x = x(t) = (1 − t)a + tb, t ∈ R,
is called the line through points a and b. Namely, this is the linear interpolation
of two points in the space. At t = 0, we are at the point a, while at t = 1,
we are in b. If we take the whole interval t ∈ [0, 1], we obtain the line segment
from a to b.

Remark 8.1.2 The x(t) is the affine combination of the points a and b.

As t can be written as
t = (1 − t) · 0 + t · 1,
it follows that we have the affine mapping that maps the line segment in R ⊂ R3
between 0 and 1 into the line segment in R3 between a and b. So the linear
interpolation is actually the affine interpolation.
The linear interpolation will be the main building block of de Casteljau algo-
rithm.

8.1.3 Parametric curves

The parametric curves are very important objects in the CAGD.

Definition 8.1.2 The mapping


p = (p1 , p2 , p3 ) : I → R3 , I⊆R
is called the parametrization of the curve p(I). The curve is continuous, contin-
uously differentiable,. . . , if this also holds for the mapping p.

So the curve is an image of the mapping p (i.e. the set p(I)), where I is called
the domain of the parametrization.
In CAGD it is important what is the type of the parametric curve, since the
goal is usually the implementation of the algorithms in practice. Therefore we
mostly require polynomial or rational parametric curves. Hereafter we will con-
sider only polynomial parametric curves, i.e. curves whose components pi are
polynomials of degree ≤ n. Such curve is called polynomial curve of degree
n.
102 8 Bézier curves

Example:

p : [0, 1] → R3 , p(t) = (t3 − 2t2 + 1, t2 − t, t3 + 1).

8.2 Bézier curves

8.2.1 Bernstein polynomials

The i−th Bernstein basis polynomial of degree n is defined by

!
n i
Bin (t) = t (1 − t)n−i , t ∈ [0, 1].
i

The Bernstein basis polynomialds of degree four are presented on the Fig-
ure 8.1.

1.0

0.8

0.6

0.4

0.2

0.2 0.4 0.6 0.8 1.0

Figure 8.1: The Bernstein basis polynomials of degree four.

Theorem 8.2.1 The Bernstein basis polynomials satisfy the following recur-
rence formula:

n−1
Bin (t) = (1 − t)Bin−1 (t) + tBi−1 (t),
8.2 Bézier curves 103

where
B00 (t) = 1 (8.2)
and
Bjn (t) = 0, ∀j ∈
/ {0, 1, . . . , n}.

Proof: We use the factorization of the binomial symbol,


!
n i
Bin (t) = t (1 − t)n−i
i
! !
n−1 i n−i n−1 i
= t (1 − t) + t (1 − t)n−i
i i−1
n−1
= (1 − t)Bin−1 (t) + tBi−1 (t).

Theorem 8.2.2 Bernstein basis polynomials form the partition of the unity:

n
Bjn (t) ≡ 1, Bjn (t) ≥ 0.
X

j=0

Proof: We use the binomial theorem


n n
!
n n j
t (1 − t)n−j = Bjn (t).
X X
1 = (t + (1 − t)) =
j=0 j j=0

The second property (nonnegativity) is obvious.

The Bernstein polynomials were firstly used in constructive proof of the Weier-
strass approximation theorem.

Definition 8.2.1 Let the points bi ∈ R3 , i = 0, 1, . . . , n, be given. The curve

n
bn (t) = bj Bjn (t)
X

j=0

is called the Bézier curve. The points bj are called Bézier control points and
the polygon that connects them is called Bézier control polygon.
104 8 Bézier curves

Figure 8.2: Cubic Bézier curve with the corresponding control polygon.

The geometric construction of Bézier curve is obtained by de Casteljau algo-


rithm. This algorithm is probably one of the basic ones in curves and surfaces
design. The geometric construction is intuitively clear and leads to strong the-
ory.

8.2.2 De Casteljau algorithm

Algorithm 11 de Casteljau algorithm


Input: Points b0 , b1 , . . . , bn ∈ E3 and t ∈ R.
Output: bn (t) := bn0 (t), namely the point on the corresponding Bézier curve
at given parameter t.
0
1: Define bi (t) = bi , i = 0, 1, . . . , n;
2: for r = 1, 2, . . . , n do
3: for i = 0, 1, . . . , n − r do
4: Compute bri (t) = (1 − t)br−1 i (t) + tbr−1
i+1 (t);
5: end for
6: end for

The geometric interpretation can be seen on the Figure 8.3. Geometrically, this
algorithm represents the repetition of linear interpolations.

Usually, the coefficients bri are written in the triangular form, called de Casteljau
8.2 Bézier curves 105

03
022
031
02
021 041 03
2 023
04
01
1 5 2
Figure 8.3: De Casteljau algortihm for the cubic curve at t = 0.7. The point on
the curve is obtained with the repetition of linear interpolations.

scheme. For the cubic case it is equal to:


b0
b1 b10
b2 b11 b20
b3 b12 b21 b30

When programming de Casteljau algorithm, we require O(n2 ) space, if we need


also the intermediate points. If we need only the final values, each column can
be written in the first one, for what we require only O(n) space.

Theorem 8.2.3 The intermediate de Casteljau points bri can be expressed in


terms of Bernstein polynomials of degree r,
r
bri (t) = bi+j Bjr (t),
X
r = 0, 1, . . . , n, i = 0, 1, . . . , n − r. (8.3)
j=0

The main result of the (8.3) is in case r = n. The corresponding de Casteljau


point lies on the curve
n
n
bn0 (t) bj Bjn (t).
X
b (t) = =
j=0

Proof: We will use the induction. For r = 0, the (8.3) holds by definition. Let
us assume that (8.3) holds for r − 1 and let us use the recurrence relation for bri
106 8 Bézier curves

given in Algorithm 11 and the recurrence relation that holds for the Bernstein
polynomials,

bri (t) = (1 − t)br−1


i (t) + tbr−1
i−1 (t)
i+r−1 i+r
r−1 r−1
X X
= (1 − t) bj Bj−i (t) + t bj Bj−i−1 (t).
j=i j=i+1

By (8.2) the indices in both rows can be summed up from i to i + r,


i+r i+r
bri (t) = (1 − t) r−1 r−1
X X
bj Bj−i (t) + t bj Bj−i−1 (t)
j=i j=i+1
i+r  
r−1 r−1
X
= bj (1 − t)Bj−i (t) + tBj−i−1 (t)
j=i
i+r
r
X
= bj Bj−i (t)
j=i
r
bi+j Bjr (t).
X
=
j=0

Remark 8.2.1 With the intermediate points bri the Bézier curve can be written
as r
bn (t) =
X n−r
bi Bir (t).
i=0

The interpretation for this can be: first we compute n − r steps of de Casteljau
algorithm according to the t and we take the calculated points bn−r i for the
control points of the Bézier curve of degree r and compute the point with the
parameter t.

8.2.3 Properties of the Bézier curves

1. Affine invariance
Bézier curves are invariant under the affine mappings, i.e. the next two
processes lead to the same result:
a) we compute the point bn (t), and then we apply the affine mapping
on it,
8.2 Bézier curves 107

b) we apply the affine mapping on the control polygon and then we


compute the point on the obtained control polygon. In some cases,
the second approach is computationally far less demanding as the
first one. For example: If one has to apply the affine transformation
on many points it is more suitable to use the approach b) as it re-
quires affine transformation on a small number of the control points
instead of affine transformations of all given points (the number of
iterations of de Casteljau algorithm remains the same).
This property follows from the invariance of the barycentric combi-
nations for the affine mapping and using the property of the unity
partition.
2. Convex hull property

Definition 8.2.2 The set K is convex, iff ∀x, y ∈ K also (1−t)x+ty ∈ K.

Definition 8.2.3 The convex hull of the points a0 , a1 , . . . , an , is the small-


est convex set containing all these points.

Theorem 8.2.4 The Bézier curve bn lies in the convex hull of its control
polygon.

Proof: For t ∈ [0, 1] all Bernstein polynomials are nonnegative, and by


the Theorem 8.2.2, they sum up to 1. Therefore ni=0 bi Bin (t) is the con-
P

vex combination of the control points for each t ∈ [0, 1]. If t ∈


/ [0, 1], the
convex hull property does not hold.

Corollary 8.2.1 If the control polygon is planar, also the corresponding


curve is planar.
3. Interpolation of the boundary points

Theorem 8.2.5 The Bézier curve interpolates the boundary points

bn (0) = b0 and bn (1) = bn .

Proof: This follows from the equations

Bin (0) = δi,0 , Bin (1) = δi,n .


108 8 Bézier curves

In practice this property is very useful as it is important that we have the


control over the boundary points.

4. Affine invariance for the affine transformation of the parameter do-


main
Usually the domain of the parameter of the Bézier curve is the interval
[0, 1]. For the Bézier curve over an arbitrary interval a ≤ u ≤ b we im-
pose local parametrization t = u−ab−a
and we proceed with de Casteljau
algorithm. The line 4 of the Algorithm 11 is then equal to

b − u r−1 u − a r−1
bri (u) = bi (u) + b (u).
b−a b − a i+1

As the mapping t is affine, we say that the Bézier curves are invariant
with respect to the affine transformations of the parameter.

This local parametrization is called linear reparametrization and is often


used in algorithms for the Bézier curves manipulations, where the prop-
erties of the cures are preserved.

5. Symmetry The Bézier curves that correspond to the two sequences of the
control points b0 , b1 , . . . , bn and bn , bn−1 , . . . , b0 are the same, the only
difference is in the parametrization direction:
n n
bj Bjn (t) = bn−j Bjn (1 − t).
X X

j=0 j=0

Proof: This follows from property of the Bernstein basis polynomials,

n n 0 n
bj Bjn (t) = n
bn−i Bin (1−t) = bn−j Bjn (1−t).
X X X X
bj Bn−j (1−t) =
j=0 j=0 i=n j=0

6. Line preserving If the control points are nj , j = 0, 1, . . . , n, then the


Bézier curve is a line. The following equality holds,
n
X j n
Bj (t) = t.
j=0 n
8.2 Bézier curves 109

Proof: The Bernstein polynomials Bjn can be expressed with respect to


the Bkn−1 and as they form the partition of the unity,
n n
!
j n j n j
t (1 − t)n−j
X X
Bj (t) =
j=0 n j=0 n j
n
!
n − 1 j−1
t (1 − t)n−1−(j−1) = t.
X
=t
j=0 j−1

7. Pseudo-local control Let us find the extrema of the Bernstein basis poly-
nom Bin . Let us compute the derivative
! !
d n n i−1 n
Bi (t) = it (1 − t)n−i + (n − i)ti (1 − t)n−i−1 (−1)
dt i i
!
n i−1
= t (1 − t)n−i−1 (i − nt).
i
So the solutions of the equation
d n
B (t) = 0
dt i
are t1 = 0, t2 = 1 and t3 = ni . As the Bernstein polynomials are non-
negative, the maximum of Bin on [0, 1] is obtained at t = ni . The practical
importance of this fact is that if one changes one of the control points bi ,
the change of the curve is the biggest in the neighborhood of the point
that corresponds to t = ni , although the change affects the whole curve.

8.2.4 Derivative of the Bézier curve

Theorem 8.2.6 The derivatives of the Bézier curve can be expressed as


n−r
dr n
∆r bj Bjn−r (t),
X
r
b (t) = n(n − 1) · · · (n − r + 1) r = 1, 2, . . . , n,
dt j=0

where
∆bj := bj+1 − bj and ∆r bj = ∆(∆r−1 )bj .

Remark 8.2.2 We again obtain a Bézier curve.


110 8 Bézier curves

Proof: Let us prove this by the induction on r:


r=1:
n n
d n d n 
n−1

(t) − Bjn−1 (t))
X X
b (t) = bj Bj (t) = bj n(Bj−1
dt j=0 dt j=0
 
n n−1 n−1
n−1
bj Bjn−1 (t) ∆bj Bjn−1 (t).
X X X
= n bj Bj−1 (t) − =n
j=1 j=0 j=0

r →r+1:
dr+1 n dr n
!
d
r+1
b (t) = b (t)
dt dt dtr
 
n−r
d 
∆r bj Bjn−r (t)
X
= n(n − 1) · · · (n − r + 1)
dt j=0
n−r
d n−r
∆r bj
X
= n(n − 1) · · · (n − r + 1) B (t)
j=0 dt j
n−r  
∆r bj Bj−1
n−r−1
(t) − Bjn−r−1 (t)
X
= n(n − 1) · · · (n − r)
j=0
n−r−1
∆r+1 bj Bjn−r−1 (t).
X
= n(n − 1) · · · (n − r)
j=0

8.2.5 Subdivision

Until now, the Bézier curve was defined on the interval [0, 1]. Sometimes one
would like to express just one part of the Bézier curve (e.g. on the interval [0, c],
c < 1) as an independent curve. Searching the control points of this curve is
called subdivision.
Let the control points bj , j = 0, 1, . . . , n, correspond to the basic curve while
cj , j = 0, 1, . . . , n, correspond to the curve we are looking for. We would like
that
c0 = b0 , cn = bn (c).
The part of the curve bn on the interval [0, c] must be exactly the curve cn .
One can show the following theorem.
8.2 Bézier curves 111

Figure 8.4: An example of subdivision at t = 12 .

Theorem 8.2.7 ci = bi0 (c), i = 0, 1, . . . , n.

So one can read the control points of the curve cn from the de Casteljau scheme
for computing bn (c) (i.e. we read the diagonal elements of this scheme). Be-
cause of the symmetry, the control points of the Bézier curve on the interval
[c, 1] are bn−1
i (c) (i.e. the bottom line in the scheme - from right to left).

The subdivision can be repeated: e.g. first we preform it for c = 21 , then on the
half of both curves ... After k steps, we obtain 2k control polygons (small arcs),
which together describe the whole basic Bézier curve.

Example: Suppose that we would like to find the intersection of a given curve
with the given line. We can use min-max boxes. If the box and the line does not
intersect, than there is no intersection. If they intersect, we can make a curve
subdivision at t = 12 and we consider min-max boxes for both curves, namely,
we check if any of them intersect the given line. We continue with this process.
As the boxes quickly decrease, after some steps of iteration we take the center
of the box that the line intersects, for the approximation of the intersection.

Now, we can prove the following property of the Bézier curve.

Theorem 8.2.8 The number of intersections: Each line in R2 (or a plane in


R3 ) intersects the Bézier curve in at most as many points as its control polygon.

Proof: We will prove the theorem for the planar curve.


112 8 Bézier curves

-2

-4

-6
1 2 3 4 5

Figure 8.5: An example of using min-max boxes for searching intersection of a


Bézier curve and the line y = 1.

(i) Let us take an arbitrary curve in the plane and choose two points on it.
If the curve between these two points is replaced with the line segment
between these two points, we obtain the curve, for which clearly holds
that it less often intersects an arbitrary line as the initial curve.

(ii) Now, let us consider the control polygon P of the Bézier curve bn . We
make the subdivision at t = 21 and obtain two new control polygons that
together compose one piecewise linear curve L1 . Each knot of L1 is the
control point of one of the control polygons. So each such point is one of
the points in de Casteljau scheme for t = 12 (from the diagonal or the bot-
tom line). As the de Casteljau algorithm is actually repeating procedure
in (i), the piecewise line L1 intersects an arbitrary line less often as the
control polygon P.

(iii) We repeat the subdivision. On the step k we obtain piecewise linear curve
Lk composed of 2k control polygons, which intersect an arbitrary line less
often as the control polygon P.

(iv) Using some analysis and the fact that limk→∞ Lk = bn , we prove the
theorem.
8.3 Other interesting topics for further studying 113

8.3 Other interesting topics for further


studying
Other interesting topics for further studying: Raising the degree of a Bézier
curve, Splines of a Bézier curves (C 1 splines, G1 splines,...), Rational Bézier
curves, Bézier surfaces ...
Bézier curves bring a simple description of a curve with little data (control
points), simple shape design and many useful properties for computer presen-
tation. Their generalization are rational Bézier curves, that enable even better
interactive design. For more complex shapes one can use composed Bézier
curves, with curves of lower degree, that bring greater numerical stability. This
is, besides simple geometric interpretation, one of the main advantages of us-
ing Bézier presentation of given curve in comparison with monomial presenta-
tion.
The B-spline curves are an upgrade of Bézier curves, that allow better local
control over curve and simple conditions for the continuous differentiability.
However, they also bring more involved implementation and greater compu-
tational complexity. Nowadays, as standard, besides Bézier curves, the non-
uniform rational basis splines (NURBS) are used. They are based on the ratio-
nal curves idea, only that they use B-splines, besides Bernstein polynomials in
Bézier curves.
From the intuitive definition of the surface as a trace of a curve that moves along
the another curve, the construction of a surface as a tensor product of the given
curves was derived. In such a way, de Casteljau and Bézier developed tensor
Bézier curves, where these two curves were Bézier curves. This approach was
widely used in the car industry.
Nowadays, Bézier curves and surfaces are used as a basis computer implemen-
tation of curves and surfaces in the fields of geometric and industry design, sci-
entific research and modeling and also in the entertainment industry (movies,
cartoons, 3D computer games...).
114 8 Bézier curves

Figure 8.6: An example of using Bézier surfaces to present the dome.

01234and
Figure 8.7: An example of using polynomial 546478019
648
rational 9  984
Bézier  478 to
surfaces
construct the plane model. 

 

 !
 
"
#$
%&"'()*)#+

(a) (b)

,$#"-./"0
Figure 8.8: Geri’s game, 1997 computer animated short film made by Pixar.
Figure 2: The control mesh for Geri’s head, created by digitizing a
full-scale model sculpted out of clay.
The film won an Academy 269 Award for(c)Best Animated $4"32( )3"2:"+#? )+2in
Short
+1"#$#2Film
+34+()(d)
+0*+56
+3#"1)$(:
")1232++"$")"(4#)" #)#+$6$")(1"1?)+))0
1998.in point,Left: The effort $
control "4mesh
4")#of
(2"1)Recursive
Gery’s
$subdivision "
+ 56+
5#+$) of a Right: 34 +( )+1" )"
#)
$ ( )"?+
animated. As a case considerable manual was
( )+"3:the
+$5(Figure +"$1 (4&4#head. 2+3#"15Geri.
 #topologically )complicated
"22$)#16+"
required to hide the seams in the face of Woody, a principal +5+5mesh:
+)$(a)
0"the
+)control
5)mesh;
43(b)
2after
 one 14"$subdivision
"((2+"step;
0+ (c)3 4()566"#+
after
(++6)789* #"1+4))3") 4the ": &)3#"")(4)+")?(@ABCDEFGHA53
$limit
character in Toy Story.
)")))789* (40;"+")steps;
two subdivision (32(d)
+) 2($3#+1)#"1#"4"+#)3"()0
": surface.
)")563"))"+#+#2+3#"1)#" IJIKLMNOPQMRS;0T0UV<(4#)"/"4W.<(4#)) /:
Subdivision surfaces have the potential to overcome both of these4"+#)3"()0*#2+3#"1")652#))" ()"+X2Y)+$Z;0T0TV<(4#)"/"4W.%:
problems: they do not require trimming, and smoothness of# $:+</4"+#)2()+0
="6+"2"1+
the 3clothing,
4()))6 "">#"+ )#"[;($/")0
model is automatically guaranteed, even as the model animates."+"1"#2+animation
3#"1)
of( ))+(however,
+1$:+poses its own dif culties which we
The use of subdivision in animation systems is not new, but for+#
4" a)0,"address
)56+3+4 "))4.
in Section >#1 \ ]74to5 express
")it"#is): necessary
First,
function of the clothing on subdivision meshes in such a way that
478 the energy
variety of reasons (several of which we address in this paper), their ()(( 6)(+structure
 (4&()#"1#
use has not been widespread. In the mid 1980s for instance, Sym- the resulting motion does not )#)"+the
inappropriately reveal #(")"()2#$
of the subdivision control mesh. 4)
Second,6 "^
in1) "
((
order +789*
for a 0"((+789*"#+4":
physical
bolics was possibly the rst to use subdivision in their animation
simulator to make use of subdivision 
( "2#it)must
"" +3col-
2&)$(("
system as a means of creating detailed polyhedra. The LightWave )(#
surfaces ':_ 31")+*1);($0+56:
compute
3D modeling and animation system from NewTek also uses subdi- lisions very ef ciently. While collisions of NURBS surfaces)
3"5# 11
" 1
"( ) 
))6+ 1
? 
# .
have
vision in a similar fashion. been studied in great detail, little work-0 "((
has $&done
been 43previously
+4")#(""""0
This paper describes a number of issues that arose when we with subdivision surfaces. `0;)+1?#))()()5"344"&()
added a variant of Catmull-Clark [2] subdivision surfaces to our Having modeled and animated ( )

subdivision 
5))  (
surfaces, 1)some
4)6"^)(+ 
animation and rendering systems, Marionette and RenderMan [17], formidable challenges remain before they can be rendered. The
respectively. The resulting extensions were used heavily in the cre- topological freedom that makes subdivision surfaces so attractive
ation of Geri (Figure 1), a human character in our recently com- for modeling and animation means that they generally do not
pleted short lm Geri’s game. Speci cally, subdivision surfaces admit parametrizations suitable for texture mapping. Solid tex-
were used to model the skin of Geri’s head (see Figure 2), his hands, tures [12, 13] and projection textures [9] can address some pro-
and his clothing, including his jacket, pants, shirt, tie, and shoes. duction needs, but Section 5.1 shows that it is possible to go a good
In contrast to previous systems such as those mentioned above, deal further by using programmable shaders in combination with
that use subdivision as a means to embellish polygonal models, our smooth scalar elds de ned over the surface.
system uses subdivision as a means to de ne piecewise smooth sur- The combination of semi-sharp creases for modeling, an appro-
faces. Since our system reasons about the limit surface itself, polyg- priate and ef cient interface to physical simulation for animation,
List of Figures

1.1 The representable normalized (positive) numbers from the set


P (2, 3, −1, 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 An example of insensitive (good conditioned) system (left) and
sensitive (poorly conditioned) system (right). . . . . . . . . . . 13

2.1 The illustration of the tangent method. . . . . . . . . . . . . . . 24


2.2 The graph of a polynomial p(x) = (x − 12 )(x − 1)(x − 2)2 . . . . 25
2.3 The divergence because the initial approximation was not enough
close to the zero (left) and the repetition of the approximations
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 The illustration of the secant method. . . . . . . . . . . . . . . . 26

3.1 The graph of ρ(ω) that corresponds to the matrix (3.4). . . . . . 51

6.1 The graph of the interpolation polynomial (red) that interpolates


1
the function 1+x 2 on the interval [−5, 5] at the given equidistant

points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2 The graph of the interpolation polynomial (red) that interpolates
1
the function 1+x 2 on the interval [−5, 5] at the given Chebyshev

points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.1 The optimal step h can be estimated from the graph of the esti-
mation for the whole error. . . . . . . . . . . . . . . . . . . . . 86
7.2 The numerical integration. . . . . . . . . . . . . . . . . . . . . 87
7.3 The numerical integration: Trapezoid rule. . . . . . . . . . . . . 88
7.4 The numerical integration: Simpson’s rule. . . . . . . . . . . . . 89
7.5 The numerical integration: Midpoint rule. . . . . . . . . . . . . 90

8.1 The Bernstein basis polynomials of degree four. . . . . . . . . . 102


8.2 Cubic Bézier curve with the corresponding control polygon. . . 104
8.3 De Casteljau algortihm for the cubic curve at t = 0.7. The point
on the curve is obtained with the repetition of linear interpolations.105
8.4 An example of subdivision at t = 21 . . . . . . . . . . . . . . . . 111
116 List of Figures

8.5 An example of using min-max boxes for searching intersection


of a Bézier curve and the line y = 1. . . . . . . . . . . . . . . . 112
8.6 An example of using Bézier surfaces to present the dome. . . . . 114
8.7 An example of using polynomial and rational Bézier surfaces to
construct the plane model. . . . . . . . . . . . . . . . . . . . . 114
8.8 Geri’s game, 1997 computer animated short film made by Pixar.
The film won an Academy Award for Best Animated Short Film
in 1998. Left: The control mesh of the Gery’s head. Right: Geri. 114
List of Tables
1.1 The formula (1.1) for computing π fails. . . . . . . . . . . . . . 16
1.2 The formula (1.2) for computing π works. . . . . . . . . . . . . 16

2.1 The number of steps we need to get two consecutive approxi-


mations that differ for < 10−12 . . . . . . . . . . . . . . . . . . . 23

7.1 Numerical values of the derivatives f 0 (0) and f 00 (0) for the func-
tion f (x) = ex . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 The estimates of the definite integral obtained by the Romberg’s
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3 The 5 × 4 grid on the given rectangular area. . . . . . . . . . . . 95
7.4 The 5 × 4 grid with the coefficients that correspond to the com-
posite Trapezoid rule. . . . . . . . . . . . . . . . . . . . . . . . 95
7.5 The 5 × 5 grid with the coefficients that correspond to the com-
posite Simpson’s rule. . . . . . . . . . . . . . . . . . . . . . . . 96

You might also like