Vdoc - Pub - Lessons in Scientific Computing Numerical Mathematics Computer Technology and Scientific Discovery
Vdoc - Pub - Lessons in Scientific Computing Numerical Mathematics Computer Technology and Scientific Discovery
Computing
Numerical Mathematics,
Computer Technology, and
Scientific Discovery
Lessons in Scientific
Computing
Numerical Mathematics,
Computer Technology, and
Scientific Discovery
By
Norbert Schörghofer
MATLAB is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks
does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion
of MATLAB software or related products does not constitute endorsement or sponsorship by The
MathWorks of a particular pedagogical approach or particular use of the MATLAB software
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2019 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20180411
International Standard Book Number-13: 978-1-138-07063-9 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface ix
To the Instructor xi
Behind the Scenes: Design of a Modern and Integrated Course xiii
v
vi Contents
Bibliography 185
Index 187
Preface
ix
x Preface
from multiple disciplines, including physics, chemistry, astronomy, and for all
those who have passed through standard core quantitative classes in their
undergraduate years. Sections with a star symbol (∗ ) contain specialized or
advanced material. In the text, a triple >>> indicates a Python prompt, and a
single > a Unix or other type of prompt. Italics indicate emphasis or paragraph
headings. Truetype is used for program commands, program variables, and
binary numbers. For better readability, references and footnotes within the
text are omitted. Brainteasers at the end of several of the chapters should
stimulate discussion.
I am indebted to my former teachers, colleagues, and to many textbook
authors who have, knowingly or unknowingly, contributed to the substance
of this book. They include Professor Leo Kadanoff, who taught me why we
compute, Professor James Sethian, on whose lectures chapter 15 is based on,
and the authors of “Numerical Recipes,” whose monumental and legendary
book impacted my student years. Several friends and colleagues have provided
valuable feedback on draft versions of this material: Mike Gowanlock, Geoff
Mathews, and Tong Zhou, and I am indebted to them.
Norbert Schörghofer
Honolulu, Hawaii
March 2018
To the Instructor
This book offers a modernized, broad, and compact introduction into scien-
tific computing. It is appropriate as a main textbook or as supplementary
reading in upper-level college and in graduate courses on scientific computing
or discipline-specific courses on computational physics, astrophysics, chem-
istry, engineering, and other areas where students are prepared through core
quantitative classes. Prerequisites are basic calculus, linear algebra, and in-
troductory physics. Occasionally, a Taylor expansion, an error propagation,
a Fourier transform, a matrix determinant, or a simple differential equation
appears. It is recommended that an undergraduate course omit the last two
chapters and some optional sections, whereas a graduate course would cover
all chapters and at a faster pace.
Prior knowledge of numerical analysis and a programming language is
optional, although students will have to pick up programming skills during the
course. In today’s time and age, it is reasonable to expect that every student
of science or engineering is, or will be, familiar with at least one programming
language. It is easy to learn programming by example, and simple examples
of code are shown from several languages, but the student needs to be able
to write code in only one language. Some books use pseudocode to display
programs, but fragments of C or Python code are as easy to understand.
Supplementary material is available at https://github.com/nschorgh/
CompSciBook/. This online material is not necessary for the instructor or the
student, but may be useful nevertheless.
The lectures can involve interactive exercises with a computer’s display
projected in class. These interactive exercises are incorporated in the text,
and the corresponding files are included in the online repository.
The book offers a grand tour through the world of scientific computing.
By the end, the student has been exposed to a wide range of methods, funda-
mental concepts, and practical material. Even some graduate-level science is
introduced in optional sections, such as almost integrable systems, diagram-
matic perturbation expansions, and electronic density functionals.
Among the homework problems, three fit with the presentation so well that
it is recommended to assign them in every version of the course. The “invisible
roots” of Exercise 2.1 could be demonstrated in class, but the surprise may be
bigger when the student discovers them. For Exercise 3.1, every member of the
class, using different computing platforms and programming languages, should
be able to exactly reproduce a highly round-off sensitive result. In Exercise 5.1,
xi
xii To the Instructor
the students themselves can answer the qualitative question whether a kicked
rotator can accumulate energy indefinitely. By fitting the asymptotic depen-
dence of energy with time, some will likely use polynomials of high degree, a
pitfall that will be addressed in the subsequent chapter.
If instructors would send me the result of the survey of computing back-
ground, posted as an exercise at the end of the first chapter, I could build a
more representative data set.
Behind the Scenes:
Design of a Modern and
Integrated Course
xiii
xiv Behind the Scenes: Design of a Modern and Integrated Course
1
2 Lessons in Scientific Computing
Can it be true that the iteration does not settle to a constant or into a
periodic pattern, or is this an artifact of numerical inaccuracies? Consider the
simple iteration
yn+1 = 1 − |2yn − 1|
known as “tent map.” For yn ≤ 1/2 the value is doubled, yn+1 = 2yn , and for
yn ≥ 1/2 it is subtracted from 1 and then doubled, yn+1 = 2(1 − yn ). The
equation can equivalently be written as
2yn for 0 ≤ yn ≤ 1/2
yn+1 =
2(1 − yn ) for 1/2 ≤ yn ≤ 1
Figure 1.2 illustrates this function, along with the iteration xn+1 = 4xn (1 −
xn ), used above. The latter is known as “logistic map.”
The behavior of the tent map is particularly easy to understand when yn
is represented in the binary number system. As for integers, floating-point
numbers can be cast in binary representation. For example, the integer binary
number 10 is 1 × 21 + 0 × 20 = 2. The binary floating-point number 0.1 is
1×2−1 = 0.5 in decimal system, and binary 0.101 is 1×2−1+0×2−2+1×2−3 =
0.625.
For binary numbers, multiplication by two corresponds to a shift by one
digit, just as multiplication by 10 shifts any decimal number by one digit. The
tent map changes 0.011 to 0.110. When a binary sequence is subtracted from
1, zeros and ones are simply interchanged,
P∞ as can be seen from the following
argument. Given the identity j=1 2−j = 1, 1 is equivalent to 0.11111...
Analytical & Numerical Solutions 3
1
logistic map
tent map
0.8
xn+1 0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
xn
with infinitely many 1s. From that it is apparent that upon subtraction from
1, zeros and ones are swapped. For example, 1-0.01 is
0.1111111...
− 0.0100000
0.1011111... (digits of 0.0100000 swapped)
where the subtraction was done as if this was elementary school. As a last
step, the identity 0.0011111... = 0.01 could be used again. So in total,
1-0.01 = 0.11, which is true, because 1 − 1/4 = 1/2 + 1/4.
Armed with this knowledge, it is easy to see how the tent map trans-
forms numbers. The iteration goes from 0.011001... (which is < 1/2) to
0.11001... (> 1/2) to 0.0110.... After each iteration, numbers advance by
one digit, because of multiplication by 2 for either half of the tent map. After
many iterations the digits from far behind dominate the result. Hence, the
leading digits take on values unrelated to the leading digits of earlier itera-
tions, making the behavior of the sequence apparently random.
This demonstrates that a simple iteration can produce effectively random
behavior. We conclude that it is plausible that a simple iteration never settles
into a periodic behavior. The tent map applied to a binary floating-point
number reveals something mathematically profound: deterministic equations
can exhibit random behavior, a phenomenon known as “deterministic chaos.”
The substitution xn = sin2 (πyn ) transforms the tent map into the logistic
map. This follows from trigonometry:
xn+1 = 4xn (1 − xn )
2
sin (πyn+1 ) = 4 sin2 (πyn )(1 − sin2 (πyn ))
= 4 sin2 (πyn ) cos2 (πyn ) = sin2 (2πyn )
(a) 1
0.8
0.6
x
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 3.5 4
r
(b) 1
0.8
0.6
x
0.4
0.2
0
3 3.2 3.4 3.6 3.8 4
r
0.8
0.6
x
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3
r
FIGURE 1.4 Asymptotic behavior for the iteration xn+1 = sin(rxn ) with
varying parameter r. This iteration also exhibits period doubling and
chaotic behavior. Compare with Figure 1.3(a).
same constant for many iterative equations. It is known as the first Feigen-
baum constant. The second Feigenbaum constant asymptotically describes
the shrinking of the width of the periodic cycles. If dn is the distance be-
tween adjacent x-values of the cycle of period 2n , then dn /dn+1 converges to
2.5029 . . . .)
These properties of a large class of iterative equations turn out to be
difficult to prove, and, case in point, they were not predicted. Once aware of
the period doubling behavior and its ubiquity, one can set out to understand,
and eventually formally prove, why period doubling often occurs in iterative
equations. Indeed, Feigenbaum universality was eventually understood in a
profound way, but only after it was discovered numerically. This is a historical
example where numerical calculations lead to an important insight. It opens
our eyes to the possibility that numerical computations can be used not only to
obtain mundane numbers and solutions, but also to make original discoveries
and gain profound insight.
(And to elevate the role numerics played in this discovery even further,
even the mathematical proof of the universality of period doubling was initially
computer-assisted; it involved an inequality that was shown to hold only by
numerical evaluation.)
methods from those that do not. As an exercise, can you judge which of the
following can be obtained analytically, in closed form?
(i) All of the roots of x5 − 7x4 + 2x3 + 2x2 − 7x + 1 = 0
R
(ii) The integral x2 /(2 + x7 )dx
PN
(iii) The sum k=1 k 4
(iv) The solution to the difference equation 3yn+1 = 7yn − 2yn−1
(v) exp(A), where A is a 2 × 2 matrix, A = ((2, −1), (0, 2)), and the expo-
nential of the matrix is defined by its power series.
EXERCISES
1.1 Survey of computing background
vi or vim editor
Microsoft Excel, Google Sheets, or similar spreadsheet software
IPython
Gnuplot
awk
Perl
Other (specify):
CHAPTER 2
9
10 Lessons in Scientific Computing
procedure produces the numbers shown in the second column of Table 2.2.
The sequence quickly approaches a constant.
Newton’s method may be extremely fast to converge, but it can easily
fail to find a root. With x0 = 2 instead of x0 = 1 the iteration diverges, as
apparent from the last column of Table 2.2.
Robustness may be preferable to speed. Is there a method that is certain
to find a root? A simple and robust method is bisection, which follows the
“divide-and-conquer” strategy. Suppose we start with two x-values where the
function f (x) has opposite signs. Any continuous function must have a root
between these two values. We then evaluate the function halfway between the
two endpoints and check whether it is positive or negative there. This restricts
the root to that half of the interval on whose ends the function has opposite
signs. Table 2.3 shows an example. With the bisection method the accuracy
only doubles at each step, but the root is found for certain.
There are more methods for finding roots than the two just mentioned,
Newton’s and bisection. Each method has its advantages and disadvantages.
Bisection is most general but is awfully slow. Newton’s method is less general
but much faster. Such a trade-off between generality and efficiency is often
inevitable. This is so because efficiency is often achieved by exploiting a spe-
cific property of a system. For example, Newton’s method makes use of the
differentiability of the function, whereas the bisection method does not and
works equally well for functions that cannot be differentiated.
The bisection method is guaranteed to succeed only if it brackets a root to
begin with. There is no general method to find appropriate starting values, nor
A Few Concepts from Numerical Analysis 11
do we generally know how many roots there are. For example, a function can
reach zero without changing sign; our criterion for bracketing a root does not
work in this case. Moreover, even a continuous function can in any interval
drop rapidly, cross zero, and then increase again, making it impossible to
exclude the existence of roots (Figure 2.1). Exercise 2.1 will illustrate that
with a dramatic example. Unless additional information is known about the
properties of the function, a search would have to explore each arbitrarily small
interval to make sure it finds all and any roots. (If the nonlinear equation is a
polynomial, then a great deal can be deduced about its roots analytically. For
example, it is easy to find out how many roots it has. Specialized root-finding
methods are available for polynomials.)
f(x)
The problem becomes even more severe for finding roots in more than
one variable, say under the simultaneous conditions g(x, y) = 0, f (x, y) = 0.
12 Lessons in Scientific Computing
Figure 2.2 illustrates the situation. Even a continuous function could dip below
zero over only a small domain, and a root searching procedure may have
difficulty finding it. Not only is the space of possibilities large, but the bisection
method cannot be extended to several variables.
The situation is exacerbated with more variables, as the space of possibil-
ities is vast. The behavior of functions in ten dimensions can be difficult to
trace indeed. The conclusion is clear and cold: there is no method that is guar-
anteed to find all roots. This is not a deficiency of the numerical methods, but
it is the intrinsic nature of the problem. Unless a good, educated initial guess
can be made or enough can be deduced about the solutions analytically, find-
ing roots in more than a few variables may be fundamentally and practically
impossible. This is in stark contrast to solving a system of linear equations,
which is an easy numerical problem. Ten linear equations in ten variables can
be solved numerically with great ease, except perhaps for the most pathologi-
cal of coefficients, whereas numerical solution of ten nonlinear equations may
remain inconclusive even after a significant computational effort. (Something
analogous can be said about a purely analytical approach: a linear system can
be solved analytically, whereas a system of nonlinear equations most often can
not.)
g(x,y)=0
2
g(x,y)=0
1
0
y
f(x,y)=0
−1
−2
−3
−3 −2 −1 0 1 2 3
x
FIGURE 2.2 Roots of two functions f, g in two variables (x, y). The roots
are where the contours intersect.
Root finding can be a numerically difficult problem, because there is no
method that always succeeds.
A Few Concepts from Numerical Analysis 13
x−y+z = 1
−x + 3y + z = 1
y+z = 2
Suppose there is an error ǫ in one of the coefficients such that the last equation
becomes (1 + ǫ)y + z = 2. The solution to these equations is easily worked
out as x = 4/ǫ, y = 1/ǫ, and z = 1 − 1/ǫ. Hence, the result is extremely
sensitive to the error ǫ. The reason is that for ǫ = 0 the system of equations
is linearly dependent: the sum of the left-hand sides of the first two equations
is twice that of the third equation. The right-hand side does not follow the
same superposition. Consequently the unperturbed equations (ǫ = 0) have no
solution. The situation can be visualized geometrically (Figure 2.3). Each of
the equations describes an infinite plane in a three-dimensional space (x, y, z)
and the point at which they intersect represents the solution. None of the
planes are parallel to each other, but their line intersections are. For ǫ = 0,
the three planes do not intersect at a common point, but tilting a plane
slightly (small ǫ) would lead to an intersection far out. This is a property of
the problem itself, not the method used to solve it. No matter what method
is utilized to determine the solution, the uncertainty in the input data will
lead to an uncertainty in the output data. If a linear system of equations is
linearly dependent, it is an ill-conditioned problem.
The extent to which the outcome depends on the initial errors can often be
quantified with a “condition number”, which measures how the output value(s)
change in response to small changes in the input value(s). This condition
number κ can be about the absolute or relative error, and represents the
proportionality factor between input error and output error.
Computing that condition number may be computationally more expensive
than obtaining the solution. For example, for a large linear system that is
almost degenerate, one will have no problem obtaining its solution, but finding
out whether this solution is robust requires an additional and larger effort.
We can investigate the condition number of the roots of a single function
f (x; a) that depends on a parameter a. For a root f (x∗ ; a) = 0. The error
can be simply defined by absolute values |dx∗ | = κ|da|. A small change in
a will change f by (∂f /∂a)da. And a small change in x∗ will change f by
(∂f /∂x)dx = f ′ (x)dx. The two must cancel each other, 0 = ∂f ′
∂a da + f (x)dx.
∂f ′
Hence the solution shifts by dx = − ∂a /f (x)da and the condition number (for
14 Lessons in Scientific Computing
2 1
z
0.5
1
0
0
1 -0.5
0 x
-1
y -1
FIGURE 2.3 The planes that correspond to the degenerate linear system
and their line intersections.
dx∗ ∂f /∂a
κ= =
da f′ x=x∗
If f ′ (x∗ ) = 0, that is, if the root is degenerate, the condition number is infinite.
An infinite condition number does not mean all is lost; it only means that the
solution √ changes faster than proportional to the input error. For example
dx ∝ da has κ = ∞. The root also becomes badly-conditioned when the
numerator is large, that is, when f is highly sensitive to a. A root is well-
conditioned unless one of these two conditions applies.
A special case of this is the roots of polynomials
N
X
p(x; a0 , . . . , aN ) = an xn
n=0
themselves. These instabilities can occur not only for difference equations, but
also for differential equations (as will be described in chapter 7), and espe-
cially for partial differential equations (chapter 15), where the source of the
errors is not roundoff but discretization.
————–
EXERCISES
2.1 a. Plot the function f (x) = 3π 4 x2 + ln((x − π)2 ) in the range 0 to 4.
b. Prove that f (x) has two roots, f (x) = 0.
c. Estimate the distance between the two roots.
d. Plot the function in the vicinity of these roots. Does the graph of the
function change sign, as it must?
2.2 Show that when the Newton method is used to calculate the square root
of a number a ≥ 0, it converges for all initial conditions x0 > 0.
2.3 Degenerate roots of polynomials are numerically ill-conditioned. For ex-
ample, x2 − 2x + 1 = 0 is a complete square (x − 1)2 = 0 and has the sole
degenerate root x = 1. Suppose there is a small error in one of the three
coefficients, (1 − δ2 )x2 − 2(1 + δ1 )x + (1 − δ0 ) = 0. Perturbing each of
the three coefficients independently and infinitesimally, determine how
the relative error ǫ of the root depends on the relative error of each
coefficient.
2.4 Calculate the condition numbers for addition, subtraction, multiplica-
tion, and division of two numbers of the same sign and based on relative
error. When the relative input errors are ǫ1 and ǫ2 , and the relative error
of the output is ǫ12 , then a single condition number κ can be defined as
|ǫ12 | = κ max(|ǫ1 |, |ǫ2 |).
2.5 Consider the linear system Ax = b where the 2 × 2 matrix
c d
A=
e f
0 |01011110
{z } 00111000100010110000010
| {z }
sign exponent mantissa
+ 1.23456 E-6
| {z } |{z}
sign mantissa exponent
The 7 significant decimal digits are the typical precision. There are number
ranges where the binary and decimal representations mesh well, and ranges
where they mesh poorly. The number of significant digits is at least 6 and at
most 9. (This can be figured out with a brute force loop through all floating-
point numbers; there are only about 4 billion of them.)
19
20 Lessons in Scientific Computing
For a 64-bit number (8 bytes) there are 11 bits for the exponent (which
translates to decimal exponents of ±308) and 52 bits for the mantissa, which
provides around 16 decimal digits of precision. And here is why: 252 ≈ 1016
10
and 22 ≈ 10+308 .
Single-precision numbers are typically 4 bytes long. Use of double-precision
variables doubles the length of the representation to 8 bytes. On some ma-
chines there is a way to extend beyond double, up to quadruple precision
(typically 128 bit). Single and double precision variables do not necessarily
correspond to the same byte-length on every machine.
The mathematical constant π up to 36 significant decimal digits (usually
enough for quadruple precision) is
← single →
3.14159265 3589793 23846264338327950288
←− double −→
Using double-precision numbers is usually barely slower than single-
precision, if at all. Some processors always use their highest precision even
for single-precision variables, and the extra step to convert between number
representations makes single-precision calculations actually slower. Double-
precision numbers do, however, take twice as much memory.
Several general-purpose math packages offer arbitrary-precision arith-
metic, should this ever be needed. Computationally, arbitrary-precision cal-
culations are disproportionally slow, because they have to be emulated on a
software level, whereas single- and double-precision floating-point operations
are hardwired into the computer’s central processor.
Many fractions have infinitely many digits in decimal representation, e.g.,
1
= 0.1666666 . . .
6
The same is true for binary numbers; only that the exactly represented
fractions are fewer. The decimal number 0.5 can be represented exactly as
0.100000..., but decimal 0.2 is in binary form
0.00110011001100110...
and hence not exactly representable with a finite number of digits. This causes
a truncation error. In particular, decimals like 0.1 or 10−3 have an infinitely
long binary representation. (The fact that binary cannot represent arbitrary
decimal fractions exactly is a nuisance for accounting software, where every-
thing needs to match to the cent.) For example, if a value of 9.5 is assigned it
will be 9.5 exactly, but 9.1 carries a representation error. One can see this by
using the following Python commands, which print numbers to 17 digits after
the comma.
>>> print '%.17f' % 9.5
9.50000000000000000
>>> print '%.17f' % 9.1
9.09999999999999964
Roundoff & Number Representation 21
(The first % symbol indicates that the following is the output format. The
second % symbol separates this from the number.) For the same reason, 0.1 +
0.1 + 0.1 − 0.3 is not zero,
>>> 0.1+0.1+0.1-0.3
5.551115123125783e-17
A discrepancy of this order of magnitude, ≈ 10−16 , incidentally makes clear
that by default Python, or at least the implementation used here, represents
floating-point numbers as 8-byte variables. Any if condition for a floating-
point number needs to include a tolerance, e.g., not if (a==0.8), but if
(abs(a-0.8)<1e-12), where 10−12 is an empirical choice that got to be safely
above a (relative) accuracy of 10−16 , for whatever range a takes on in this
program.
In terms of more formal mathematics, in the language of algebraic struc-
tures, the set of floating-point numbers (F) does not obey the same rules as
the set of real numbers (R). For example, (a + b) + c may be different from
a+(b+c). The associative property, that the order in which operations are per-
formed does not matter, does not necessarily hold for floating-point numbers.
For example,
>>> (1.2-1)-0.2
-5.551115123125783e-17
>>> 1.2-(1+0.2)
0.0
Some programming languages assume that, unless otherwise instructed,
all numbers are double-precision numbers. Others look at the decimal point,
so that 5 is an integer, but 5. is a floating-point number. Often that makes no
difference in a calculation at all, but sometimes it makes a crucial difference.
For example, for an integer division 4/5=0, whereas for a floating-point divi-
sion 4./5.=0.8. This is one of those notorious situations where the absence of
a single period can introduce a fatal bug in a program. Even 4./5 and 4/5. will
yield 0.8. For this reason some programmers categorically add periods after
integers that are meant to be treated as floating-point numbers, even if they
are redundant; that habit helps to avoid this error.
When a number is represented with a fixed number of bits, there is neces-
sarily a maximum and minimum representable number; exceeding them means
an “overflow” or “underflow.” This applies to floating-point numbers as well
as to integers. For floating-point numbers we have already determined these
limits. Currently the most common integer length is 4 bytes. Since 1 byte is
8 bits, that provides 24×8 = 232 ≈ 4 × 109 different integers. The C language
and numerous other languages also have unsigned integers, so all bits can be
used for positive integers, whereas the regular 4-byte integer goes from about
−2 × 109 to about +2 × 109 . It is prudent not to use loop counters that go
beyond 2 × 109 .
22 Lessons in Scientific Computing
given above, divide by two, and take the tangent; the result will be finite. Or,
if π is already defined, type
>>> tan(pi/2)
1.633123935319537e+16
In fact the tangent does not overflow for any argument.
As part of the IEEE 754 standard, a few bit patterns have special meaning
and serve as “exceptions”. There is a bit pattern for numbers exceeding the
maximum representable number: Inf (infinity). There are also bit patterns
for -Inf and NaN (not a number). For example, 1./0. will produce Inf. An
overflow is also an Inf. There is a positive and a negative zero. If a zero is pro-
duced as an underflow of a tiny negative number it will be√ −0., and 1./(−0.)
produces -Inf. A NaN is produced by expressions like 0./0., −2., or Inf-Inf.
Exceptions are intended to propagate through the calculation, without need
for any exceptional control, and can turn into well-defined results in subse-
quent operations, as in 1./Inf or in if (2.<Inf). If a program aborts due to
exceptions in floating-point arithmetic, which can be a nuisance, it does not
comply with the standard. IEEE 754 floating-point arithmetic is algebraically
complete; every algebraic operation produces a well-defined result.
Roundoff under the IEEE 754 standard is as good as it can be for a given
precision. The standard requires that the result must be as if it was first
computed with infinite precision, and then rounded. This is a terrific accom-
plishment; we get numbers that are literally accurate to the last bit. The error
never exceeds half the gap of the two machine-representable numbers closest
to the exact result. (There are actually several available rounding modes,
but, for good reasons, this is the default rounding mode.) This applies to the
elementary operations (+, −, /, ×) as well as to the remainder (in many pro-
gramming languages denoted by %) and the square root √. Halfway cases are
rounded to the nearest even (0 at the end) binary number, rather than always
up or always down, because rounding in the same direction would be more
likely to introduce a statistical bias, as minuscule as it may be.
The IEEE 754 standard for floating-point arithmetic represents the as-
good-as-it-can-be case. But to what extent are programming platforms com-
pliant with the standard? The number representation, that is, the partitioning
of the bits, is nowadays essentially universally implemented on platforms one
would use for scientific computing. That means for an 8-byte number, relative
accuracy is about 10−16 and the maximum exponent is +308 nearly always.
Roundoff behavior and exception handling are often available as options, be-
cause obeying the standard rigorously comes with a penalty on speed. Com-
pilers for most languages provide the option to enable or disable the roundoff
and exception behavior of this IEEE standard. Certainly for C and Fortran,
ideal rounding and rigorous handling of exceptions can be enforced on most
machines. Many general-purpose computing environments also comply with
the IEEE 754 standard. Pure Python stops upon division by zero—which is a
violation of the standard, but the NumPy module is IEEE compliant.
24 Lessons in Scientific Computing
x x(1 + ǫx ) x x
(1 + ǫx/y ) = ≈ (1 + ǫx )(1 − ǫy ) ≈ (1 + ǫx − ǫy )
y y(1 + ǫy ) y y
and therefore the relative error of the ratio is |ǫx/y | ≤ |ǫx | + |ǫy |. For divisions,
we only need to worry about overflows or underflows, in particular division by
zero. Among the four elementary operations only subtraction of numbers of
equal sign or addition of numbers of opposite sign increase the relative error.
An instructive example is solving a quadratic equation ax2 + bx + c = 0
numerically. In the familiar solution formula
√
−b ± b2 − 4ac
x=
2a
a cancellation effect will occur for one of the two solutions if ac is small com-
pared to b2 . The remedy is to compute the smaller root from the larger. For
a quadratic polynomial the product of its two roots equals x1 x2 = c/a, be-
cause ax2 + bx + c = a(x − x1 )(x − x2 ). If b is positive then √ one solution is
obtained by the equation above, x1 = −q/(2a), with q = b + b2 − 4ac, but
the other solution is obtained as x2 = c/(ax1 ) = −2c/q. This implementation
of the solution of quadratic equations requires no extra line of code; the com-
mon term q could be calculated only once and stored in a temporary variable,
and the sign of√ b can be accommodated by using the sign function sgn(b),
q = b + sgn(b) b2 − 4ac. To perfect it, a factor of −1/2 can be absorbed into
q to save a couple of floating-point operations.
which follows from one or the other law of spherical trigonometry. Here, ∆σ
is the angular distance along a great circle, φ1 and φ2 are the latitudes of the
two points, and ∆λ = λ2 − λ1 is the difference in longitudes. This expression
is prone to rounding errors when the distance is small. This is clear from the
left-hand side alone. For ∆σ ≈ 0, the cosine is almost 1. Since the Taylor
expansion of the cosine begins with 1 − ∆σ 2 /2, the result will be dominated
by the truncation error at ∆σ 2 /2 ≈ 10−16 for 8-byte floating-point numbers.
The volumetric mean radius of the Earth √ is 6371 km, so the absolute dis-
tance comes out to 2π × 6371 × 103 × 2 × 10−16 ≈ 0.6 m. At distances
much larger than that, the formula given above does not have significant
roundingp errors. For small distances we could use the Pythagorean relation
∆σ = ∆φ2 + ∆λ2 cos2 φ. The disadvantage of this approach is that there
will be a small discontinuity when switching between the roundoff-sensitive ex-
act equation and the roundoff-insensitive approximate equation. Alternatively
the equation can be reformulated as
2 ∆σ 2 ∆φ 2 ∆λ
sin = sin + cos φ1 cos φ2 sin
2 2 2
This no longer has a problem for nearby points, because sin(∆σ/2) ≈ ∆σ/2, so
Roundoff & Number Representation 27
the left-hand side is small for small ∆σ, and there is no cancellation between
the two terms on the right-hand side, because they are both positive. This
expression becomes roundoff sensitive when the sine is nearly 1 though, that
is, for nearly antipodal points. Someone has found a formula that is roundoff
insensitive in neither situation, which involves more terms. The lesson from
this example is that if 16 digits are not enough, the problem can often be fixed
mathematically.
————–
EXERCISES
3.1 The iteration xn+1 = 4xn (1 − xn ) is extremely sensitive to the initial
value as well as to roundoff. Yet, thanks to the IEEE 754 standard, it
is possible to reproduce the exact same sequence on many platforms.
Calculate the first 1010 iterations with initial conditions x0 = 0.8 and
x0 = 0.8 + 10−15. Use double (8-byte) precision. Submit the program as
well as the values x1000 . . . x1009 to 9 digits after the decimal point. We
will compare the results in class.
3.2 The expression for gravitational acceleration is GM ri /r3 , where G is
the gravitational constant 6.67 × 10−11 m3 /kg s, M is the mass of the
sun 2 × 1030 kg, r = 150 × 109 m is the distance between the Sun and
Earth, and i = 1, 2, 3 indexes the three spatial directions. We cannot
anticipate in which order the computer will carry out these floating-
point operations. What are the worst possible minimum and maximum
28 Lessons in Scientific Computing
Programming Languages
& Tools
Almost any programming language one is familiar with can be used for com-
putational work. Some people are convinced that their own favorite program-
ming language is superior to all other languages, and this may be because the
versatility of a language increases with one’s familiarity with it. Figure 4.1 cat-
egorizes programming languages from the perspective of scientific computing.
They are grouped into i) high-level languages, ii) general-purpose interactive
computational environments, iii) text processing utilities and scripting lan-
guages, and iv) special-purpose languages. This classification is by no means
rigorous, but a practical grouping.
29
30 Lessons in Scientific Computing
engineers and as such it continues to be particularly well suited for this pur-
pose. It is the oldest programming language that is still widely used in con-
temporary scientific research, and that for good reasons. There is extensive
heritage, that is, code for scientific modeling. Fortran was invented in the
1950s, but underwent significant revision and expansion in the 1990s. Mod-
ern Fortran (Fortran 90 and later versions) greatly extends the capabilities
of earlier Fortran standards. Fortran 77, the language standard that preceded
Fortran 90, lacks dynamic memory allocation, that is, the size of an array can-
not be changed while the program is executing. It also used a “fixed format”
for the source code, with required blank columns at the front of each line and
Programming Languages & Tools 31
maximum line lengths, as opposed to the “free format” Modern Fortran and
nearly all modern languages use. A major advantage of Fortran is its parallel
computing abilities.
Java excels in portability. It typically does not reach the computational
performance of languages like C and Fortran, but comes close to it, and in
rare circumstances exceeds it.
Table 4.1 shows a program in C and in Fortran that demonstrates simi-
larities between the two languages. The analogies are mostly self-explanatory.
float/real declare single-precision floating-point variables. In C, statements
need to be ended with a semicolon; a semicolon is also used to pack two For-
tran commands into one line. In C, i++ is shorthand for i=i+1, and the way
for-loops work is: for( starting point; end condition; increment expression)
body. Array indices begin by default with 0 for C and with 1 for Fortran. C
does not have a special syntax for powers, so the pow function is used that
works for integer and non-integer powers alike. For print statements in C,
\n adds a newline, whereas in Fortran output lines automatically end with a
line break. The indents are merely for ease of reading and do not change the
functionality of the programs. Plain C has few commands, but additional func-
tionality is provided through the #include commands. For example, #include
<math.h> provides basic mathematical functions, such as pow and sin. C is
case-sensitive, whereas Fortran is not.
tures. Some language implementations come with their own editors (as part of
their “integrated development environment”, but many programmers prefer
the universally useful general-purpose text editors. The source code is then
compiled. For example, on a command prompt gcc example.c produces an
executable. Type a.out, the default name of the executable, and it will output
0.015430.
Typical for high-level programming languages, in contrast to very high-
level languages, is that all variables must be declared, instructions are rela-
tively detailed, and the code needs to be explicitly compiled or built to gen-
erate an executable.
The free GNU compilers have been a cornerstone in the world of pro-
gramming. They are available for many languages, including C (gcc), C++
(g++), Fortran (gfortran), and Java (gcj). Although they emerged from the
Unix world, they are available in some form for all major operating systems
(Windows, Mac/Unix, Android). And they are not the only free compilers
out there. In addition, commercial compilers are available. These often take
advantage of new hardware features more quickly than free compilers.
The role of language standards is to make sure all compilers understand
the same language, although individual compilers may process nonstandard
language extensions. The use of nonstandard extensions will make code less
portable. Nevertheless, some nonstandard syntax is widely used. For example,
Fortran’s real*8, which declares an 8-byte floating-point number, has been
understood by Fortran compilers since the early days of the language, but is
strictly speaking not part of the language standard.
Table 4.2 shows the example program above in two popular applications.
These programs are considerably shorter than those in Table 4.1 and do not
require variables to be declared. In Matlab an entire array of integers is created
with [0:N-1] and the sine command is automatically applied to all elements
in the array.
TABLE 4.2 Matlab and IDL program examples; compare to Table 4.1.
In Matlab every line produces output; a semicolon at the end of the
line suppresses its output, so the absence of a semicolon in the last line
implies that the result is written to the screen.
% Matlab program example ; IDL program example
N=64; N=64
i=[0:N-1]; a=FLTARR(N)
a=sin(i/2); FOR i=0,N-1 DO a(i)=sin(i/2)
b=max(a); b=MAX(a)
bˆ5/N PRINT, bˆ5/N
TABLE 4.3 Python version of the program in Tables 4.1 and 4.2.
# Python program example
import numpy
N=64
a=numpy.zeros(N)
for i in range (0,N):
a[i]=numpy.sin(i/2.)
b=max(a)
print b**5/N
Plain Python does not have arrays, but NumPy (Numeric Python) does.
This module is loaded with the import command. NumPy also provides the
mathematical functions that otherwise would need to be imported into plain
Python with import math. The range command in the for loop, range(0,N),
goes only from 0, . . . , N − 1, and not from 0, . . . , N .
These Python commands can be executed interactively, line by line, or
34 Lessons in Scientific Computing
they can be placed in a file and executed collectively like program code. We
can start Python and type in any Python command to get output immedi-
ately. Or we can write a sequence of commands in a file, called something
like example.py, and then feed that to Python (on a command prompt with
python example.py).
It is amusing that in the five language examples above (C, Fortran, Matlab,
IDL, and Python), each uses a different symbol to indicate a comment line:
Double slash // or /* ... */ in C, exclamation mark ! or c in Fortran,
percent symbol % in Matlab, semicolon ; in IDL, and hash # in Python.
General-purpose computational environments that are currently widely
used include: Mathematica began as a symbolic computation package, which
still is its comparative strength. Its syntax is called the Wolfram Language.
Matlab is particularly strong in linear algebra tasks; its name is an abbrevia-
tion of Matrix Laboratory. Octave is open-source software that mimics Matlab.
IDL (Interactive Data Language) began as a visualization and data analysis
package. Its syntax defines the Gnu Data Language (GDL). All of these soft-
ware packages offer a wide range of numerical and graphical capabilities, and
some of them are also capable of symbolic computations. Python and R can
also function as general-purpose interactive computational environments.
translation process is now split into two parts: one from the source code to
bytecode and then from bytecode to machine code. Even if one of those two
steps is a compilation, the other might not be. The bottom line is that Java
and Python are not fully-compiled languages, and they rarely achieve the
computational performance of C and Fortran.
Another, rarer possibility is source-to-source translators that convert the
language statements into those of another. An example is Cython, which trans-
lates Python-like code to C. (Cython is not the same as CPython. The latter
refers to a Python implementation written in C.)
When source code is spread over multiple files, compilation produces in-
termediate files, with extensions such as .o (“object file”) or .pyc (Python
bytecode). Unless the source code changes, these can be reused for the next
built, so this saves on compilation time.
Execution speed. Python is slow. That is an overgeneralized statement, but
it captures the practical situation. A factor often quoted is 20 compared to
Fortran, although that is undoubtedly even more of an overgeneralization.
The reasons for its inherent inefficiency have to do with how it manages mem-
ory and that it is not fully compiled. This in turn simply relates to the fact
that it is a very high-level language. There are ways to improve its compu-
tational performance, and language extensions and variants such as NumPy,
Cython, PyPy, and Numba help in this process, but ultimately Python was
not designed for computational performance; it was designed for coding, and
it excels in that. The appropriate saying is that
“Python for comfort, Fortran for speed.”
Very high-level languages often do not achieve the speed possible with
languages like Fortran or C. One reason for that is the trade-off between
universality and efficiency—a general method is not going to be the fastest.
Convenience comes at the cost of computational efficiency and extraneous
memory consumption. This is formally also known as “abstraction penalty”.
The bottom line is simple and makes perfect sense: higher level languages
are slower in execution than lower level languages, but easier and faster to
code. In a world where computing hardware is extremely fast, the former
often ends up to be the better time investment.
Intrinsic functions. An arsenal of intrinsic functions is available that covers
commonplace and special-purpose needs. Available mathematical functions
include powers, logarithms, trigonometric functions, etc. They also most often
include atan2(y,x), which is the arctangent function with two instead of one
argument atan(y/x), which resolves the sign ambiguity of the results for
the different quadrants. Even the Error function erf is often available. And
sometimes trigonometric functions with arguments in degree are available,
sind, cosd, and tand. In some languages an extra line of code is required to
include them, such as #include <math.h> in C or import math in Python
(import numpy will include them too; and import scipy will, more or less,
include NumPy plus other functionality).
36 Lessons in Scientific Computing
The symbol for powers is ˆ or **. (C has no dedicated symbol). The logical
and operator often is & or &&, and the or operator | or ||. The use of == as an
equality condition is rather universal, and the very few languages that use a
single =, use another symbol, such as :=, for a variable assignment. In contrast,
the symbol for “not equal” takes, depending on the language, various forms:
!=, /=, ˜=. As mentioned in section 3.1, the division symbol / can mean
something different for integers and floating-point numbers. In Python, // is
dedicated to integer division, and called “floor division”.
Sometimes the name of a function depends on the number type. For exam-
ple, in C and C++ fabs is the absolute value of a double-precision floating-
point number, fabsf is for a single-precision number, abs only for an integer,
and cabs for a complex number. Modern C compiler implementations may be
forgiving about this, and in very high-level languages it is abs for all input
types.
Another feature relevant for scientific computing is vectorized, index-free,
or slice notation. Fortran, Matlab, and Python are examples of languages
that allow it. An operation is applied to every element of an entire array,
for example A(:)=B(:)+1 adds 1 to every element. Even A=B+1 does so when
both arrays have the same shape and length. This not only saves the work
of writing a loop command, but also makes it obvious (to a compiler) that
the task can be parallelized. With explicit loops the compiler first needs to
determine whether or not the commands within the loop are independent
from one another, whereas index-free notation makes it clear that there is no
interdependence.
Initializations. In many languages, scalar variables are initialized with zero.
Not so in Fortran, where no time is lost on potentially unnecessary initializa-
tions. Neither C nor Fortran automatically initialize arrays with zeros, so
their content is whatever happens to reside in these memory locations. Some
very high-level numerical environments require initializations when the array
is created.
High-level versus very high-level languages. This distinction has already
been dwelled upon, and there is a continuum of languages with various com-
binations of properties. The programming language Go or Golang, sort of a
C-family language, is fully compiled but does not require variables to be de-
clared, so it falls somewhere in the middle of this categorization. The very
high-level languages are also those that are suited for interactive usage.
To again contrast very high-level with high-level languages, suppose we
want to read a two-column data file such as
1990 14.7
2000 N/A
2010 16.2
where the ‘N/A’ stands for non-applicable (or not-available) and indicates a
missing data point. This is an alphabetical symbol not a number, so it poten-
Programming Languages & Tools 37
tially spells trouble if the program expects only numbers. The following lines
of C code read numbers from a file:
fin = fopen("file.dat","r");
for(i=0; i<3; i++) {
fscanf(fin,"%d %f\n",&year,&T);
printf("%d %f\n",year,T);
}
The output will only be the following two lines:
1990 14.700000
2000 14.700000
It terminates on the N/A and outputs the previously read entry for T, so it is
doubly problematic. To remedy this problem, we would have to either read all
entries as character strings and convert them to numbers inside the program,
or, before even reading the data with C, replace N/A in the file with a number
outside the valid range; for example if the second column is temperature in
Celsius, the value −999 will do. On the other hand, with the very high-level
language Matlab or Octave
> [year,T]=textread('file.dat','%d %f')
year =
1990
2000
2010
T =
14.700
NaN
16.200
This single line of code executes without complaint and assigns a NaN to the
floating-point number T(2). (Chapter 3 has explained that NaN is part of the
floating-point number representation.) This illustrates two common differences
between high-level and very high-level languages: the latter are faster to code
and generally more forgiving.
Cross language compilation. Programming languages can also be mixed.
For example C and Fortran programs can be combined. There are technicali-
ties that have to be observed to make sure the variables are passed correctly
and have the same meaning in both programs, but language implementations
are designed to make this possible. Combining Python with C provides the bet-
ter of both worlds, and Python is designed to make that easily possible. Using
Python as a wrapper for lower level functions can be an efficient approach.
Scripts are another way to integrate executables from various languages, a
topic we will turn to in chapter 14.
Choosing a programming language. We have not yet discussed groups III
38 Lessons in Scientific Computing
and IV in Figure 4.1. Group III, text processing utilities and scripting lan-
guages, is another set of very high-level languages, but with a different pur-
pose. They are immensely useful, for example, for data reformatting and pro-
cessing and they will be introduced in chapters 13 and 14. Group IV are
special-purpose languages, and the short list shown in the figure ought to be
augmented with countless discipline-specific and task-specific software. Lab-
VIEW deserves special mention, because it is a graphical programming lan-
guage, where one selects and connects graphical icons to build an instrument
control or data acquisition program. It is designed for use with laboratory
equipment. (Its language is actually referred to as G, but LabVIEW is so far
the only implementation of G.) R began as a language for statistical comput-
ing, and is now widely used for general data analysis and beyond. It can be
used interactively or built into an executable.
Whether it is better to use an interactive computing environment, a pro-
gram written in a very high-level language, or write a program in a lower level
language like C or Fortran depends on the task to be solved. Each has its
domain of applicability. A single language or tool will be able to deal with a
wide range of tasks, if needed, but will be inefficient or cumbersome for some
of them. To be able to efficiently deal with a wide range of computational
problems, it is advantageous to know several languages or tools from different
parts of the spectrum: One high-level programming language for time-intensive
number-crunching tasks (group I), one general-purpose computational envi-
ronment for every-day calculations and data visualization (group II), and a
few tools for data manipulation and processing (group III). This enables a
scientist to choose a tool appropriate for the given task. Knowing multiple
languages from the same category provides less of an advantage. On a prac-
tical level, use what you know, or use what the people you work with use.
The endless number of special purpose software (group IV) can generally be
treated as “no need to learn unless needed for a specific research project.”
EXERCISES
4.1 If you do not know any programming language yet, learn one. The fol-
lowing chapters will involve simple programming exercises.
4.2 Gauss circle problem: As a basic programming exercise, write a program
that calculates the number of lattice points within a circle as a function
of radius. In other words, how many pairs of integers (m, n) are there
such that m2 + n2 ≤ r2 . A finite resolution in r is acceptable. Submit
the souce code of the program and a graph of the results.
40 Lessons in Scientific Computing
Sample Problems;
Building Conclusions
“For it is a sad fact that most of us can more easily compute than
think.”
Forman Acton
Computing is a tool for scientific research. So we better get to the essence and
start using it with a scientific goal in mind.
αn+1 = αn + ωn T
ωn+1 = ωn + K sin αn+1
The period of kicking is T and K is its strength. The iteration acts as a strobe
effect for the time-continuous system, which records (α, ω) every time interval
T , which is also the time interval between kicks.
For K = 0, without kicking, the rod will rotate with constant period; ω
stays constant and α increases proportionally with time. For finite K, will the
rod stably position itself along the direction of force (α = π) or will it rotate
41
42 Lessons in Scientific Computing
full turns forever? Will many weak kicks ultimately cause a big change in the
behavior or not? Will it accumulate unlimited amounts of kinetic energy or
not? These are nontrivial questions that can be answered with computations.
A program to iterate the above formula is only a few lines long. The angle
can be restricted to the range 0 to 2π. If mod(α, 2π) is applied to α at every
iteration, it will cause a tiny bit of roundoff after each full turn (and none
without a full turn), but it avoids loss of significant digits when α is large, as
it will be after many full turns in the same direction. After a few thousand
turns in the same direction, four significant digits are lost, because a few
thousand times 2π is about 104 . With 16 digits precision this is not all that
bad.
To begin with, a few test cases are in order to validate the program, as short
as it may be. For K = 0 the velocity should never change. A simple test run,
ω = 0.2, 0.2, 0.2, ..., confirms this. Without kicking αn+1 = αn +2πT /Tr , where
Tr = 2π/ω is the rotation period of the rod. Between snapshots separated by
time T , the angle α changes either periodically or, when Tr /T is not a ratio of
integers, the motion is “quasi-periodic,” because the rod never returns exactly
to the same position.
A second test case is the initial values α0 = 0 and ω0 = 2π/T , that is, the
rod rotates initially exactly at the kicking frequency and α should return to the
same position after every kick, no matter how hard the kicking: α = 0, 0, 0, ....
The program reproduces the correct behavior in cases we understand, so we
can proceed to more complex situations.
For K = 0.2 the angular velocity changes periodically or quasi-periodically,
as a simple plot of ωn versus n, as in Figure 5.2, shows. Depending on the
initial value, ω will remain of the same sign or periodically change its sign.
Sample Problems; Building Conclusions 43
3 K=0.2
2 K=0.2
ω 1 K=1.4
0
-1
-2
-3
0 5 10 15 20 25 30
n
FIGURE 5.2Angular velocity ω for the standard map for kicking strengths
K = 0.2 and K = 1.4.
For stronger kicking, K = 1.4, the motion can be chaotic, for example α ≈
0, 1, 3.18, 5.31, 6.27, 0.94, 3.01, 5.27, 0.05, .... An example of the ω behavior
is included in Figure 5.2. The rod goes around a few times, then reverses
direction, and continues to move in an erratic fashion.
Plots of ω versus α reveal more systematic information about the behavior
of the solutions. Figure 5.3 shows the phase space plots for several initial values
for ω and α. Since α is periodic, the (α, ω) phase space may be thought of as
a cylinder rather than a plane. Again we first consider the trivial case without
kicking (Figure 5.3a). For K = 0, if ω is irrational it will ultimately sweep out
all values of α, which appears as a horizontal straight line in the plot; if ω is
a rational number there will only be a finite number of α values.
For small K, there is a small perturbation of the unforced behavior (Fig-
ure 5.3b). When the rod starts out slow and near the stable position α = π, it
perpetually remains near the stable position, while for high initial speeds the
rod takes full turns forever at (quasi)periodically changing velocities. These
numerical results suggest, but do not prove, that the motion, based on the
few initial values plotted here, remains perpetually regular, because they lie
on an invariant curve. The weak but perpetual kicking changes the behav-
ior qualitatively, because now there are solutions that no longer go around,
but bounce back and forth instead. Nevertheless, judged by Figure 5.3(b) all
motions are still regular. Solutions that perpetually lie on a curve in phase
space (or, in higher dimensions, a surface) are called “integrable.” The theory
discussion below will elaborate on integrable versus chaotic solutions.
The behavior for stronger K is shown in panel (c). For some initial condi-
tions the rod bounces back and forth; others make it go around without ever
changing direction. The two regions are separated by a layer where the mo-
tion is chaotic. The rod can go around several times in the same direction, and
then reverse its sense of rotation. In this part of the phase space, the motion of
the rod is perpetually irregular. The phase plot also reveals that this chaotic
motion is bounded in energy, because ω never goes beyond a maximum value.
44 Lessons in Scientific Computing
(a) 3 (b) 3
2 2
ω 1 1
ω
0 0
-1 -1
-2 -2
-3 -3
0 1 2 3 4 5 6 0 1 2 3 4 5 6
α α
(c) 3 (d) 3
2 2
1 1
ω
ω
0 0
-1 -1
-2 -2
-3 -3
0 1 2 3 4 5 6 0 1 2 3 4 5 6
α α
FIGURE 5.3 Angular velocity ω versus angle α for the standard map for
different kicking strengths, (a) K = 0, (b) K = 0.2, (c) K = 0.8, and (d)
K = 1.4. Nine initial conditions are used in each plot. Lines that appear
to be continuous consist of many discrete points.
For strong kicking there is a “sea” of chaos (d). Although the motion is
erratic for many initial conditions, the chaotic motion does not cover all pos-
sible combinations of velocities and angles. We see intricate structures in this
plot. There are regular/quasi-periodic (integrable) solutions bound in α and
ω. Then there are islands of regular behavior embedded in the sea of chaotic
behavior. Simple equations can have complicated solutions, and although com-
plicated they still have structure to them. A qualitative difference between (c)
and (d) is that in (d) the chaotic motions are no longer bounded in energy. For
some intermediate kicking strengths, this division in phase space disappears.
By the way, for strong kicking, the rod can accumulate energy without
limit. In the figure, ω is renormalized to keep it in the range −π/T to π/T .
Physicist Enrico Fermi used a similar model to show that electrically charged
particles can acquire large amounts of energy from an oscillating driving field.
Unbounded growth of kinetic energy due to periodic forcing is thus also known
as “Fermi acceleration”. (Exercise 5.1 elaborates on this.)
Sample Problems; Building Conclusions 45
away, that will be captured into an orbit around the sun due to interaction
with Jupiter, the heaviest planet in our solar system. The goal here is to
get a perspective of the overall process of scientific computing: setting up
the problem, validating the numerical calculations, and arriving at a reliable
conclusion. The problem will only be outlined, so we can move from step to
step more quickly.
We begin by writing down the governing equations. The acceleration of
the bodies due to their gravitational interaction is given by
d2 ri X ri − rj
2
= −G mj
dt |ri − rj |3
j6=i
where t is time, G is the gravitational constant, the sum is over all bodies, and
m and r(t) are their masses and positions. We could use polar or cartesian
coordinates. With three bodies, polar coordinates do not provide the benefit
it would have with two bodies, so cartesian is a fine choice.
The next step is to simplify the equations. The motion of the center of
mass can be subtracted from the initial conditions, so that all bodies revolve
relative to the center of mass at a reduced mass. This reduces the number of
equations, in two dimensions, from 6 to 4. In the following we will consider
the “restricted three-body problem”: The first mass (the Sun) is much heavier
than the other two, and the second mass (Jupiter) is still much heavier than
the third, so the second mass moves under the influence of the first, and the
third mass moves under the influence of the other two. Since the first body is
so heavy that it does not accelerate, it makes subtracting the center of mass
superfluous, and we are left with equations of motion for the second and third
body only.
The numerical task is to integrate the above system of ordinary differential
equations (ODEs) that describe the positions and velocities of the bodies as
a function of time. What capabilities does a numerical ODE solver need to
have for this task? As the object arrives from far away and may pass close to
the sun, the velocities can vary tremendously over time, so an adaptive time
step is a huge advantage; otherwise, the time integration would have to use
the smallest timestep throughout.
The problem of a zero denominator, ri = rj , arises only if the bodies
fall straight into each other from rest or if the initial conditions are specially
designed to lead to collisions; otherwise the centrifugal effect will avoid this
problem for pointlike objects.
The masses and radii in units of kilograms and meters are large numbers;
hence, we might want to worry about overflow in intermediate results. For
example, the mass of the sun is 2 × 1030kg, G is 6.67 × 10−11m3 /kg·s, and the
body starts from many times the Earth-sun distance of 1.5×1011m. We cannot
be sure of the sequence in which the operations are carried out. If G/r3 , the
worst possible combination (Exercise 3.2), is formed first, the minimum expo-
nent easily becomes as low as −44. The exponent of a single-precision variable
(−38) would underflow. Double-precision representation is safe (−308).
Sample Problems; Building Conclusions 47
In conclusion, this problem demands an ODE solver with variable step size
and double-precision variables. Next we look for a suitable implementation.
A general-purpose computational environment is easily capable of solving a
system of ODEs of this kind, be it Python, Octave, Matlab, IDL, Mathematica,
or another. (The name of the command may be ode, odeint, lsode, ode45,
rk4, NDSolve, or whatnot). Or, in combination with a lower-level language,
pre-written routines for ODE integration can be used. Appendix B provides
a list of available repositories where implementations to common numerical
methods can be found. We enter the equations to be solved, along with initial
coordinates and initial velocity, and compute the trajectories. Or rather, we
find an ODE integrator, learn how to use it, then enter the equations to be
solved, and so forth.
The next step is clear: validation, validation, validation. As a first test, for
the two-body problem, we can choose the initial velocity for a circular orbit.
The resulting trajectory is plotted in Figure 5.4(a) and is indeed circular. In
fact, in this plot the trajectory goes around in a circular orbit one hundred
times, but the accuracy of the calculation and the plotting is so high that it
appears to be a single circle.
Figure 5.4(b) shows the total energy (potential energy plus kinetic energy),
which must be constant with time. This plot is only good enough for the
investigator’s purpose, because it has no axes labels, so let it be said that it
is the total energy as a function of time. For the initial values chosen, the
energy is −0.5. At this point it has to be admitted that this model calculation
was carried out with G = 1, mSun = 1, mJupiter = 1/1000, and rJupiter = 1.
Variations in energy occur in the sixth digit after the decimal point. Even
after hundreds of orbits the error in the energy is still only on the order of
10−6 . Any numerical solution of an ODE will involve discretization errors,
so this amounts to a successful test of the numerical integrator. What could
be a matter of concern is the systematic trend in energy, which turns out to
continue to decrease even beyond the time interval shown in the plot. The
numerical orbit spirals slowly inward, in violation of energy conservation. The
Earth has moved around the Sun many times,–judged by its age, over four and
a half billion times,–so if we were interested in such long integration periods,
we would face a problem with the accuracy of the numerical solver.
Another case that is easy to check is a parabolic orbit. We can check that
the resulting trajectory “looks like” a parabola, but without much extra work
the test can be made more rigorous. A parabola can be distinguished from a
hyperbolic or strongly elliptic trajectory by checking that its speed approaches
zero as the body travels toward infinity.
Trajectories in a 1/r potential repeat themselves. This is not the case
for trajectories with potentials slightly different from 1/r. This is how Isaac
Newton knew that the gravitational potential decays as 1/r, without ever
measuring the gravitational potential as a function of r. The mere fact that the
numerical solution is periodic, as in an elliptic orbit, is a nontrivial validation.
Once we convinced ourselves of the validity of the numerical solutions to
48 Lessons in Scientific Computing
– 0.5
– 0.500001
the two-body problem, we can move on to three bodies. We will consider the
restricted three-body problem in a plane. The mass of Jupiter is about 1,000
times lower than that of the Sun, and the mass of the third body is negligible
compared to Jupiter’s.
1.0
0.5
r
0.5 1.0 1.5 2.0
–0.5
–1.0
After all these steps, we are prepared to address the original issue of cap-
ture of a visiting interstellar object. Figure 5.5(a) shows an example of three-
body motion, where one body from far away comes in and is deflected by
the orbiting planet such that it begins to orbit the sun. Its initial conditions
Sample Problems; Building Conclusions 49
are such that without the influence of the planet the object would escape to
infinity. Jupiter captures comets in this manner.
For deeper insight, the total energy as a function of distance from the sun
is shown in Figure 5.5(b). The horizontal axis is distance from the sun, and
the third body starts from a large distance, so as time proceeds the graph has
to be followed from right to left. This plot shows that the initial total energy is
positive; in other words the body’s initial kinetic energy is larger than its initial
potential energy, so without the planet, the body would ultimately escape the
sun’s gravity. The third body loses energy at the same distance as the second
body’s orbit and transitions into a gravitationally bound orbit. During this
interaction the variable step size of the numerical integrator is again of great
benefit. Ultimately, the body ends up with negative total energy, which implies
it is gravitationally bound to the sun.
Is this numerical answer reliable? After all, there is rapid change during
the close encounter of Jupiter with the third body. To check, we can conduct
a number of sensitivity tests. For a sensitivity test, we do not have to carry
out a bigger calculation, in case that strains the computer, but we can carry
out a smaller calculation. A respectable ODE integrator will have some sort
of accuracy parameter or goal the user can specify. We can decrease this
parameter to see whether the answer changes. For our standard parameters
the energy is −0.9116. To test the result’s sensitivity to the accuracy goal
of the integrator, we reduce this parameter by an order of magnitude, and
obtain −0.9095. The result has changed. Now it is appropriate to increase the
accuracy goal by an order of magnitude and obtain −0.9119. There is still a
change compared to −0.9116, but it is smaller; it takes place in the fourth digit
after the decimal point instead of the second digit. It is okay for the answer
to change with the accuracy of the calculation, as long as it approaches a
fixed value, especially since the main conclusion, that an object with these
specific initial conditions gets captured, only requires that the total energy be
below zero. A “convergence test”, where we check that the answer approaches
a constant value, is a more elaborate version of a sensitivity test. Chapter 6
will treat convergence tests in detail.
The single most important point to remember from this exercise is that
we proceed from simple to complex and validate our numerical computations
along the way. The pace and detail at which this should proceed depends on
the familiarity and confidence of the user in the various components of the
task. Computational scientists not only write programs, they also build con-
clusions. In this aspect the expectation from computation-based research is
no different from experiment-based research, which demands control exper-
iments, or theory-based research, where equations must be consistent with
known answers in special cases.
EXERCISES
5.1 For the kicked rotator (αn+1 = αn + ωn T , ωn+1 = ωn + K sin αn+1 ),
50 Lessons in Scientific Computing
5.2 Solve Kepler’s equation with Newton’s method. The Kepler equation is
M = E − e sin E, where the so-called “mean anomaly” M is linear in
time, E is the eccentric anomaly, and e is the eccentricity of the orbit.
The distance from the sun is r = a(1 − e cos E), so solving Kepler’s
equation provides r for a given time.
Approximation Theory
f (x + h) − f (x − h)
f ′ (x) = + O(h2 ).
2h
The center difference is accurate to O(h2 ), not just O(h) as the one-sided
differences are, because the f ′′ terms in the Taylor expansions of f (x + h) and
f (x − h) cancel:
h2 h3
f (x + h) = f (x) + f ′ (x)h + f ′′ (x) + f ′′′ (x + ϑ+ )
2 3!
2 3
h h
f (x − h) = f (x) − f ′ (x)h + f ′′ (x) − f ′′′ (x + ϑ− )
2 3!
The Taylor expansion is written to one more order than may have been deemed
necessary; otherwise, the cancellation of the second-order time would have
gone unnoticed.
The center point, f (x), is absent from the difference formula, and at first
sight this may appear awkward. A parabola fitted through the three points
f (x + h), f (x), and f (x − h) undoubtedly requires f (x). However, it is easily
shown that the slope at the center of such a parabola is independent of f (x)
(Exercise 6.1). Thus, it makes sense that the center point does not appear in
the finite difference formula for the first derivative.
The second derivative can also be approximated with a finite difference
formula, f ′′ (x) ≈ c1 f (x + h) + c2 f (x) + c3 f (x − h), where the coefficients c1 ,
c2 , and c3 can be determined with Taylor expansions. This is a general method
to derive finite difference formulae. Each order of the Taylor expansion yields
one equation that relates the coefficients to each other. After some calculation,
we find
f (x − h) − 2f (x) + f (x + h)
f ′′ (x) = + O(h2 ).
h2
With three coefficients, c1 , c2 , and c3 , we only expect to match the first three
terms in the Taylor expansions, but the next order, involving f ′′′ (x), van-
ishes automatically. Hence, the leading error term is O(h4 )/h2 = O(h2 ). A
mnemonic for this expression is the difference between one-sided first deriva-
tives:
f (x+h)−f (x)
− f (x)−fh(x−h)
f ′′ (x) ≈ h
h
It is called a mnemonic here, because it does not reproduce the order of the
error term, which would be O(h)/h = O(1), although it really is O(h2 ).
With more points (a larger “stencil”) the accuracy of a finite-difference
approximation can be increased, at least as long as the high-order derivative
that enters the error bound is not outrageously large. The error term involves
a higher power of h, but also a higher derivative of f . Reference books provide
the coefficients for various finite-difference approximations to derivatives on
one-sided and centered stencils of various widths.
Approximation Theory 53
f (x + h) − f (x − h)
u(x) =
2h
With a Taylor expansion we can immediately verify that u(x) = f ′ (x)+O(h2 ).
For small h, this formula provides therefore an approximation to the first
derivative of f . When the resolution is doubled, the discretization error, O(h2 ),
decreases by a factor of 4. Since the error decreases with the square of the
interval h, the method is said to converge with “second order.” In general, when
the discretization error is O(hp ) then p is called the “order of convergence” of
the method.
The resolution h can be expressed in terms of the number of grid points
N , which is simply inversely proportional to h. To verify the convergence of a
numerical approximation, the error can be defined as some overall difference
between the solution at resolution 2N and at resolution N that we denote
with u2N and uN . Ideal would be the difference to the exact solution, but
the solution at infinite resolution is usually unavailable, because otherwise we
would not need numerics. “Norms” (denoted by k·k) provide a general notion
of the magnitude of numbers, vectors, matrices, or functions. One example of
a norm is the root-mean-square
v
u N
u1 X
kyk = t (y(jh))2
N j=1
kuN − uN/2 k
lim → 2p
N →∞ ku2N − uN k
Table 6.1 shows a convergence test for the center difference formula shown
above applied to an example function. The error E(N ) = ku2N −uN k becomes
indeed smaller and smaller with a ratio closer and closer to 4.
The table is all that is needed to verify convergence. For deeper insight,
however, the errors are plotted for a wider range of resolutions in Figure 6.1.
The line shown has slope −2 on a log-log plot and the convergence is over-
whelming. The bend at the bottom is the roundoff limitation. Beyond this
resolution the leading error is not discretization but roundoff. If the resolu-
tion is increased further, the result becomes less accurate. For a method with
high-order convergence this roundoff limitation may be reached already at
modest resolution. A calculation at low resolution can hence be more accu-
rate than a calculation at high resolution!
54 Lessons in Scientific Computing
TABLE 6.1 Convergence test for the first derivative of the function f (x) =
sin(2x −0.17) + 0.3 cos(3.4x + 0.1) in the interval 0 to 1 and for h = 1/N .
The error (second column) decreases with increasing resolution and the
method therefore converges. Doubling the resolution reduces the error by
a factor of four (third column), indicating the finite-difference expression
is accurate to second order.
N E(N ) E(N/2)/E(N )
20 0.005289
40 0.001292 4.09412
80 0.0003201 4.03556
160 7.978E-05 4.01257
h
10-1 10-2 10-3 10-4 10-5 10-6 10-7
100
10-2
10-4
error
10-6
10-8
10-10
10-12
101 102 103 104 105 106 107
N
approximately 10−16 /(2 × 5 × 10−6 ) = 10−11 . The total error is the sum of
discretization error and roundoff error O(h2 ) + O(ε/h), where ε ≈ 10−16 . If
we are sloppy and consider the big-O not merely as a smaller than a con-
stant times its argument, but as an asymptotic limit, then the expression
can be differentiated: O(2h) − O(ǫ/h2 ). The total error is a minimum when
h = O(ε1/3 ) = O(5 × 10−6 ). This agrees perfectly with what is seen in Fig-
ure 6.1.
The same convergence test can be used not only for derivatives, but also
for integration schemes, differential equations, and anything else that ought
to become more accurate as a parameter, such as resolution, is changed.
which is indeed the sum of the function values. The boundary points carry only
half the weight. This summation formula is called the “composite trapezoidal
rule.”
Instead of straight lines it is also possible to imagine the function values
are interpolated with quadratic polynomials. Fitting a parabola through three
points and integrating, one obtains
Z x2
h
f (x)dx ≈ (f0 + 4f1 + f2 )
x0 3
For a parabola the approximate sign becomes an exact equality. This inte-
gration formula is well-known as “Simpson’s rule.” Repeated application of
Simpson’s rule leads to
Z xN
h
f (x)dx ≈ [f0 + 4f1 + 2f2 + 4f3 + 2f4 + . . . + 4fN −1 + fN ] .
x0 3
less accurate results than the trapezoidal rule, which is the penalty for the un-
equal coefficients. At the end, simple summation of function values is an excel-
lent way of integration in the interior of the domain. Thinking that parabolas
better approximate the area under the graph than straight lines is an illusion.
Often these norms are defined without the 1/N factors. Either way, they are
norms.
The following properties define a norm:
The 1-norm is the area under the graph. The 2-norm is the integral over
the square of the deviations. The supremum norm is sometimes also called
∞-norm, maximum norm, or uniform norm. (Those who do not remember
the definition of supremum as the least upper bound can think of it as the
maximum.) Ultimately, these function values will need to be represented by
discretely sampled values, but the difference to the vector norms above is that
the number of elements is not fixed; it will increase with the resolution at
which the function is sampled.
For the norm of functions it is no longer true that convergence in one norm
implies convergence in all norms. If y unequals zero for only one argument,
the area under the function is zero, kyk1 = 0, but the supremum (maximum)
is not, kyk∞ > 0. Hence, for functions there is more than one notion of
“approximation error”.
Norms, which reduce a vector of deviations to a single number that char-
acterizes the deviation, can also be used to define a condition number, which
is simply the proportionality factor between the norm of the input error and
the norm of the output error.
Norms can also be calculated for matrices, and there are many types of
norms for matrices, that is, functions with the three defining properties above.
All that will be pointed out here is that matrix norms can be used to calculate
the condition number of a linear system of equations.
the polynomial, but the polynomial oscillates excessively near the boundary of
the domain. This oscillatory problem is known as the “Runge phenomenon.”
Consequently, using a polynomial of high degree to approximate a function
can be a bad idea.
1.5
1
f(x)
0.5
−0.5
−1 −0.5 0 0.5 1
x
as so-called Chebyshev polynomials, are very good at this. And the roots of
Chebyshev polynomials are spaced more closely toward the two edges.)
An alternative to approximation with polynomials is to approximate a
function f (x) with a series of trigonometric functions, a topic that will be
dealt with in chapter 7.
————–
EXERCISES
6.1 Show mathematically that a parabola p(x) that passes through three
equally spaced points y1 = p(−h), y2 = p(0), y3 = p(h) has slope (y3 −
y1 )/(2h) at the center point x = 0. In other words, the slope at y2 does
not depend on y2 .
6.2 Finite-difference expression for unequally spaced points.
a. Based on a stencil of three points (0, x1 , x2 ), find an approximation
of the first derivative at x = 0, that is, find the coefficients for f ′ (0) =
c0 f (0) + c1 f (x1 ) + c2 f (x2 ).
b. Determine the order of the error term.
c. Implement and verify with a numerical convergence test that the
finite-difference approximation converges.
d. Also demonstrate that it converges at the anticipated rate.
Approximation Theory 61
6.3 Show that in the neighborhood of a simple root, Newton’s method con-
verges quadratically.
6.4 Error in numerical integration
a. Implement a simple trapezoidal integrator.
R1
b. Numerically evaluate 0 cos(2π(1 − x2 ) + 12 )dx, and determine the
order of convergence as a function of spatial resolution.
R1
c. For 0 cos(2π(1−x2 ))dx, the integrand has vanishing first derivatives
on both boundaries. Again determine the order of convergence.
6.5 Carry out Exercise 5.2 and add a convergence test for part c.
6.6 Derive a formula for the integral of a cubic polynomial in the interval 0
to b. Demonstrate that applying Simpson’s rule to it will give the exact
integral.
6.8 Learn how to carry out spline interpolation within a software tool or
language of your choice, and demonstrate that the function in Figure 6.2,
f (x) = 1/(1 + 25x2 ) on the interval −1 to +1, can be approximated
without the large oscillations.
CHAPTER 7
Other Common
Computational Methods
Here we describe methods and issues for some of the most commonplace com-
putational problems that have not already been described. Root finding was
discussed in chapter 2 and methods of linear algebra will be described in chap-
ter 10. This chapter is dedicated to a handful of other problems so important
that they cannot be omitted.
which can be explicitly solved for a and b. The results are the well-known
formulae
P P P P P P P
( yi ) x2i − ( xi ) xi yi N xi yi − ( xi )( yi )
a= P 2 P 2 and b = P 2 P 2
N xi − ( xi ) N xi − ( xi )
where the sums go from i = 1, ..., N .
The popularity of linear regression is partially due to do the computa-
tional convenience the fit parameters can be obtained with: there is an ex-
plicit formula for the coefficients. Minimizing with another weighting function
would mean the equations for a and b are no longer linear. To convince our-
selves of this, suppose the error or deviation d is weighted by an arbitrary
63
64 Lessons in Scientific Computing
case above. But if w′ is nonlinear, then two coupled nonlinear equations need
to be solved to obtain the parameters a and b. As we learned in chapter 2,
numerical nonlinear root-finding does not always guarantee a solution and
there could be more than one local minimum for E. Alternatively, one can
directly minimize the error E using numerical optimization (minima finding)
methods, but conceptionally this approach has similar limitations as root-
finding; the search could get stuck in a local minimum that is not the global
minimum, so we cannot be certain the optimum found is global rather than
only local.
This little exercise provides the following lessons: 1) Minimizing the square
deviations from a straight line is a rare situation for which the fit parameters
can be calculated with explicit formulae. Without electronic computers, it had
to be that or eye-balling the best fit on a graph. 2) For regression with non-
quadratic weights the fit obtained with a numerical search algorithm is not
necessarily unique.
Minimizing the sum of the square deviations also has the following funda-
mental property: it yields the most likely fit for Gaussian distributed errors.
We will prove this property, because it also exemplifies maximum likelihood
methods. As a warm-up, we prove the following simpler statement: The sum
of (xi − x)2 over a set of data points xi is minimized when x is the sample
mean. All we have to do is
∂ X
(xi − x)2 = 0
∂x i
which immediately yields X X
xi = x
i i
and therefore x is the arithmetic mean.
Suppose the values are distributed around the model value with a proba-
bility distribution p. The sample value is yi and the model value is a function
y(xi ; a, b, ...), where a, b, ... are the fit parameters. A straight line fit would
be y(xi ; a, b) = a + bxi . We may abbreviate the deviation for data point i as
di = y(xi ; a, b, ...) − yi . The probability distribution is then p(di ), the same p
for each data point. The maximum likelihood fit maximizes the product of all
these probabilities P = Πi pi where pi = p(di ). An extremum with respect to a
occurs when ∂P/∂a = 0, and according to the product rule for differentiation
∂P ∂ X 1 ∂pi
= Πi pi = P
∂a ∂a i
pi ∂a
Other Common Computational Methods 65
Here the transform is written in the complex domain, but it could also be
written in terms of trigonometric functions, as we will do below.
Fourier transforms can be calculated more quickly than one may have
thought, with a method called the Fast Fourier Transform (FFT), a matter
that will be revisited in chapter 10.
Polynomial interpolation was discussed in chapter 6. An alternative is to
approximate a (real-valued) function f (x) with a series of trigonometric func-
tions,
∞
a0 X
f˜(x) = + ak cos(kx) + bk sin(kx)
2
k=1
We denote the Fourier representation of f (x) with f˜(x) in case the two can-
not be made to agree exactly. (After all, countably many coefficients cannot
be expected to provide a perfect match to a function everywhere, because a
function defined on a continuous interval is made up of uncountably many
points.)
Trigonometric functions obey an orthogonality relation:
Z π 2π for m = k = 0
cos(kx) cos(mx) dx = π for m = k 6= 0
−π
0 if m= 6 k
Consider a step-wise function f (x) = 1 for |x| < π/2 and f (x) = 0 for
|x| > π/2 (Figure 7.1). According to the above formula, the Fourier coefficients
are
Z Z π/2 π/2
1 π 1 sin(kx)
ak = f (x) cos(kx)dx = cos(kx)dx =
π −π −π/2 π k x=−π/2
π
2 sin k 2
=
kπ
and further
π for k = 0
ak = 0 for k even
(k+1)/2
2 (−1) k for k odd
And bk = 0, because f (x) is even.
Figure 7.1 shows the function and the first 50 terms of its Fourier rep-
resentation. It does not represent the function everywhere. By design, the
Fourier approximation goes through every required point, uniformly spaced,
but near the discontinuity it is a bad approximation in between grid points.
This is known as the “Gibbs phenomenon”, and occurs for approximations
of discontinuous functions by Fourier series. This problem is similar to the
“Runge phenomenon” that can occur for polynomials (Figure 6.2).
1.2
1
0.8
0.6
f(x)
0.4
0.2
0
-0.2
-3 -2 -1 0 1 2 3
x
FIGURE 7.1 The Gibbs phenomenon. The Fourier series (solid line) of a
discontinuous function (dot line) does not approximate the discontinuous
function everywhere, no matter how many terms in the series are used.
steps. Vice versa, for the same accuracy a larger time step can be taken with
the two-step method, so its computational efficiency is better than that of the
Euler method. The mid-point method is also known as 2nd-order Runge-Kutta
method.
Higher order methods can be constructed by using more function evalua-
tions from the past. And lower order steps can be used to launch the higher
order method, which would otherwise require values for the solution from
before the start time.
Accuracy is not all we desire from an ODE integrator. An estimate of
the discretization error is much desired, and a variable step size can greatly
increase efficiency, plus it allows adjusting the step size to meet an accuracy
goal. Error control is achieved by evaluating the outcome of the integration
at two different resolutions, and that difference is used as proxy for the error.
These error estimates can in turn be used to decide when it is time to increase
or decrease the time step, something called “adaptive step size.” And all this is
ideally achieved at excellent computational efficiency, with a minimum number
of function evaluations.
A most popular choice of a numerical integrator that efficiently combines
all this desired functionality is the 4th order Runge-Kutta method. We will
not study the details of these integrators, because there are plenty of excellent
implementations around. Suffices to say that Runge-Kutta methods use finite
difference approximations for the derivative and can adapt step size based
on local error estimates. There are, of course, other types of integrators, and
some ODEs require other methods, but 4th order Runge-Kutta is an excellent
general purpose numerical integrator.
Nevertheless, the Euler method often becomes handy. For higher order
methods, more arguments have to be passed to the integrator, along with
any parameters f may depend on. Although straightforward to implement,
this may require significant additional coding effort, plus additional validation
effort. Overall, it may be less effort to use the simplest implementation and
choose a small time step. Also, if f is not sufficiently smooth, then higher
orders will not provide higher accuracy, because their error terms involve
higher derivatives.
Now that basic thoughts about the numerical integration of ODEs have
been expressed, we continue with a few comments that expand upon the nu-
merics of ODEs in several directions.
A second order differential equation can be reduced to two first order
differential equations by introducing a new variable y2 = y ′ , where y can
be thought of as y1 . The differential equation y ′′ = f (t, y, y ′ ) then becomes
y2′ = f (t, y1 , y2 ), and combined with y1′ = y2 , they make up two coupled first
order ODEs. More generally, a high order differential equation can always be
reduced to a system of coupled first order differential equations. Runge-Kutta
methods and other common methods are readily generalized to systems of
first-order equations by letting the derivative become multi-dimensional.
Numerical solution of some ODEs can be numerically unstable, in the same
70 Lessons in Scientific Computing
y n+1 − y n
= −ay n
h
the time-stepping scheme is y n+1 = y n (1 − ah). Hence the time step better
be h < 1/a, or the numerical solution will oscillate. For h > 2/a it would
even grow instead of decay. Hence, for this situation the Euler scheme is
only “conditionally stable”; it is numerically stable for small time steps but
numerically unstable for large time steps. With a backward difference
y n+1 − y n
= −ay n+1
h
the scheme becomes y n+1 = y n /(1 + ah). This scheme works for any time step
h, and is hence “unconditionally stable”. For large time steps the solution will
be inaccurate, but it does not oscillate or grow.
For y ′ = −ay, the solution changes on the time scale of 1/a, so for accuracy
reasons alone we would want to choose time steps smaller than that. Stiff
differential equations involve two or more time-scales of very different scales,
and integration based on the shortest time scale may no longer be practical.
These stiff ODEs require special numerical methods that cure the instability.
Numerical integration always involves discretization errors. For differential
equations that have conserved quantities we want to avoid that these errors
accumulate steadily away from these conserved quantities. For example, for
two-body gravitational interaction the energy is conserved and an orbit that
would spiral in or out with time would betray the qualitative character of the
solution, as seen in Figure 5.4(b). Special-purpose integrators can be designed
with a discretization chosen to minimize the error of conserved quantities. In
the case of gravitational interaction these are called “symplectic integrators”
(where “symplectic” refers to the mathematical structure of the equations that
describe Hamiltonian systems). Symplectic integrators are a special purpose
method, but they illustrate a more general approach, namely that the freedom
we have in choosing among various finite-difference approximations can be
exploited to gain something more.
to sit down and work out a partial fraction decomposition—a tedious pro-
cedure. Instead, we can get answers promptly using symbolic computation
software.
A sample session with the program Mathematica:
/* simple indefinite integral */
In[1]:= Integrate[xˆ2,x]
In[2]:= Integrate[(4-xˆ2+xˆ5)/(1+xˆ2),x]
2 4 2
x x Log[1 + x ]
Out[2]= -x - -- + -- + 5 ArcTan[x] + -----------
2 4 2
In[3]:= Solve[-48-80*x+20*xˆ3+3*xˆ4==0,x]
Out[3]= {{x -> -6}, {x -> -2}, {x -> -2/3}, {x -> 2}}
A 4th degree polynomial root and the integration of a rational function are
in general very tedious to obtain by hand; both were mentioned in the Brain-
teaser of chapter 1.
The next example is the finite-difference approximation for the second
derivative of a function:
/* Taylor expansion around 0 up to 5th order */
In[5] = Sum[1/kˆs,{k=1,Infinity}]
Out[5] = Zeta[s]
Out[6]= x
In Python, a module called SymPy offers symbolic algebra capabilities. For
example,
>>> from sympy import *
>>> x = symbols('x')
>>> integrate(x**2, x)
x**3/3
>>> solve(-48-80*x+20*x**3+3*x**4, x)
>>> series(sin(x)/x,x,0,4)
1 - x**2/6 + O(x**4)
This command not only produced a Taylor expansion to 4th order, it also
realized that limx→0 sin(x)/x = 1 for the first term.
# infinite sum
>>> k = symbols('k')
>>> summation(k**(-4), (k,1,oo))
pi**4/90
That double-oh symbol indeeds stands for ∞, and the command symbolically
obtained the sum of the infinite series.
Symbolic computation software is efficient for certain tasks, but cannot
compete with the human mind for others. For example, simplifying expres-
sions is something humans are still often better in than computers. The human
brain recognizes simplifications that elude symbolic algebra packages, espe-
cially when the same calculation is carried out by hand over and over again.
And if a symbolic computation program does not symbolically integrate an
expression, it does not entirely exclude that an explicit formula does not exists.
Software packages can have bugs and mathematical handbooks can have
typos. Fortunately, bugs tend to get corrected over time, in books as well as in
software. By now symbolic computation software has matured in many ways
and is rather reliable.
Other Common Computational Methods 73
where V is the volume and C a physical constant. (That constant contains in-
formation about how many elements the sum has within a continuous interval,
something that comes from deeper physics.) For two particles the partition
function is
h 2 2 i
Z Z p p
−β 2m 1 + 2 +u(r −r )
2m 1 2
Z(β, V, m) = C dp1 dp2 dr1 dr2 e .
The integrals over momentum could be carried out since they are Gaussian
integrals. However, a gas without interactions is an ideal gas and hence we
simply write Z
Z = Zideal dr1 dr2 e−βu(r1 −r2 )
which also absorbs the constant C. As we all know, the series expansion of an
exponential is ex = 1 + x + x2 /2! + x3 /3! + .... An expansion for small β, that
is, high temperature, yields
Z
β2 2
Z = Zideal dr1 dr2 [1 − βu12 + u + ...]
2 12
74 Lessons in Scientific Computing
β2 2
+ (u + u213 + u223 + 2u12 u13 + 2u12 u23 + 2u13 u23 ) + ...]
2! 12
Many of these integrals yield identical results (for V → ∞), so we arrive at
Z
Z β2 2
= dr1 dr2 dr3 1 − 3βu12 + 3 (u12 + 2u12 u13 ) + ...
Zideal 2!
To keep track of the terms, they can be represented by diagrams. For example,
✉2
R ✁
dr1 dr2 dr3 u12 = ✁
✁✉1 ✉3
Since u12 , u13 , and u23 yield the same contribution, one diagram suffices to
represent all three terms. The full perturbation expansion for Z/Zideal in di-
agrammatic form is
✉ ✉ ✉ ✉
✁ 3 2 2 ✁
−3β× ✁ + 2! β × +3β × ✁ + ...
✉ ✉ ✁
✉ ✉ ✉ ✉ ✁
✉ ✉
The number of dots is the number of particles. The number of lines corre-
sponds to the power of β, that is, the order of the expansion. Every diagram
has a multiplication factor corresponding to the number of distinct possibilities
it can be drawn. In addition, it has a factorial in the denominator (from the
coefficients in the series expansion of the exponential function) and a binomial
prefactor (from the power).
Finally, unconnected dots can be omitted, as all required information is al-
ready contained otherwise. The diagrammatic representation of the expansion
simplifies to
✉ ✉ ✉
✁ 3 2 2 ✁
1 −3β× ✁ + 2! β × +3β × ✁ + ...
✁
✉ ✉ ✁
✉ ✉
✉ ✉ ✉ ✉
= ×
✉ ✉ ✉ ✉
EXERCISES
7.1 Derive the equations for linear regression without intercept, y = kx,
i.e., given data pairs (xi , yi ), i = 1, ..., N , find the equation for k that
PN
minimizes the sum of quadratic deviations, i=1 (y − yi )2 .
7.2 Weighting proportional with
P distance improves the robustness of a fit.
This method minimizes i |yi − a − bxi |, where xi and yi are the data
and a and b are, respectively, the intercept and slope of a straight line.
Show that finding the parameters a and b requires nonlinear root-finding
in only one variable.
7.3 Show that the Fourier coefficients minimize the square-deviation be-
tween the function and its approximation. Use the trigonometric repre-
sentation.
7.4 a. Calculate the Fourier coefficients of the continuous but nonperiodic
function f (x) = x/π, on the interval [−π, π]. Show that their absolute
values are proportional to 1/k.
b. Any nonperiodic continuous function can be written as the sum of
such a ramp and a periodic function. From this, argue that contin-
uous nonperiodic functions have Fourier coefficients that decay as
1/k.
c. Calculate the Fourier coefficients of a tent-shaped function, f (x) =
1 − |x|/π on [−π, π], which is continuous and periodic, and show that
their absolute values are proportional to 1/k 2 .
7.5 Consider a numerical solver for the differential equation y ′ = −ay that
uses the average of a forward time step and a backward time step:
y n+1 − y n y n + y n+1
= −a
h 2
a. Derive the criterion for which the numerical solution will not oscil-
late.
76 Lessons in Scientific Computing
b. Show that this method is one order more accurate than either the
forward- or the backward-step method.
This chapter marks the beginning of the portion of the book that is primarily
relevant to “large” computations, for which computational performance mat-
ters, because of the sheer number of arithmetic operations, the rate at which
data have to be transferred, or anything else that is limited by the physical
capacity of the hardware.
77
78 Lessons in Scientific Computing
109 —a billion floating-point operations per second (FLOPS). (The “S” stands
for “per second” and is capitalized, whereas FLOPs is the plural of FLOP.)
We can verify that these execution speeds are indeed reality. A simple
interactive exercise with a program that adds a number a billion times to a
variable can be used to approximately measure absolute execution times. Such
an exercise also reveals the ratios listed in the table above. We simply write a
big loop (according to chapter 3, a 4-byte integer goes up to about 2 × 109, so
a counter up to 109 does not overflow yet), compile the program, if it needs
to be compiled, and measure the runtime, here with the Unix command time,
which outputs the seconds consumed.
> gfortran speed.f90 % 10ˆ9 additions
> time a.out
2.468u 0.000s 0:02.46
One billion additions in 2.46 seconds demonstrates that our program indeed
processes at a speed on the order of one GigaFLOPS. If optimization is turned
on during compilation, execution is about three times faster in this example,
achieving a billion additions in less than a second:
> gfortran -O speed.f90 % 10ˆ9 additions with optimization
> time a.out
0.936u 0.000s 0:00.93
While we are at it, we can also check how fast Python is at this. Here we only
loop up to 108 instead of 109 ; otherwise, we would wait for a while.
> time python speed.py % 10ˆ8 additions
6.638u 0.700s 0:07.07
Performance Basics & Computer Architectures 79
The result reveals that Python is far slower than Fortran, as already stated
in chapter 4.
In terms of standard numerical tasks, one second is enough time to solve
a linear system with thousands of variables or to take the Fourier Transform
of a million points. In conclusion:
Arithmetic is fast!
In fact, arithmetic is so fast that the bottleneck for numerical computations
often lies elsewhere. In particular, for a processor to compute that fast, it also
needs to be fed numbers fast enough.
There are basically four limiting physical factors to computations:
• Processor speed
• Memory size
• Data transfer between memory and processor
• Input and output
A true-false flag only requires one bit, not one byte, so it still consumes more
memory than necessary.
According to the above rule, the total memory consumed is simply the
memory occupied by the variable type multiplied by the number of elements.
There is an exception to this rule for calculating required memory. For a
compound data type, and many languages offer compound data types, the
alignment of data can matter. If, in C notation, a data type is defined as
struct {int a; double b}, it might result in unused memory. Instead of
4+8 bytes, it may consume 8+8 bytes. Compilers choose such alignments in
compound data types to make memory accesses faster, even though it can
result in unused memory.
Programs can access more (virtual) memory than physically exists on the
computer. If the data exceed the available memory, the hard drive is used for
temporary storage (swap space). Reading and writing from and to a hard drive
is orders of magnitude slower than reading from memory, so this slows down
the calculation dramatically. If the computer runs out of physical memory, it
will try to place those parts in swap space that are not actively used, but if
the CPU needs data that are not in physical memory, so it has to fetch them
from disk, it will sit idle for many, many clock cycles until those data arrive.
Memory is organized in blocks of fixed length, called “pages.” If the data do
not exist in memory, then the page needs to be copied into memory, which
delays execution in the CPU.
Data transfer
Data transfer between memory and processor. When a processor carries out
instructions it first needs to fetch necessary data from the memory. This is
a slow process, compared to the speed with which the processor is able to
compute. This situation is known as the “processor-memory performance gap”.
A register is a memory unit located on the central processing unit that can
be accessed promptly, within one clock cycle. Registers are a very small and
very fast storage unit on which the processor operates.
To speed up the transport of data a “cache” (pronounced “cash”) is used,
which is a small unit of fast memory. Frequently used data are stored in the
cache to be quickly accessible for the processor. Data are moved from main
memory to the cache not byte by byte but in larger units of “cache lines,”
assuming that nearby memory entries are likely to be needed by the processor
soon (assumption of “spatial locality”). Similarly, “temporal locality” assumes
that if a data location is referenced then it will tend to be referenced again
soon. If the processor requires data not yet in the cache, a “cache miss” occurs,
which leads to a time delay.
A hierarchy of several levels of caches is customary. Table 8.2 provides an
overview of the memory hierarchy and the relative speed of its components.
The large storage media are slow to access. The small memory units are fast.
(Only one level of cache is included in this table. Higher levels of cache have
Performance Basics & Computer Architectures 81
CPU
Core 1 Core 4
Registers Registers
L2 cache L2 cache
Main memory
access times in between that of the level-1 cache and main memory.) Figure 8.1
illustrates the physical location of memory units within a multi-core CPU.
TABLE 8.2 Memory hierarchy and typical relative access times (in units
of clock cycles).
Registers 1
Level 1 Cache 4
Main memory 120
Magnetic disk 107
Data transfer to and from hard drive. Currently there are two common
types of hard drives: solid state drives (SSD) and magnetic drives, also known
as hard disk drives (HDD). Magnetic drives involve spinning disks and me-
chanical arms. Non-volatile flash (solid-state) memory is faster (and more
expensive per byte) than magnetic drives, and slowly wears out over time.
The time to access the data on a hard drive consists of seek time (latency)
and data transfer time (bandwidth). Reading or writing to or from a magnetic
disk takes as long as millions of floating-point operations. The majority of this
time is for the head, which reads and writes the data, to find and move to
the location on the disk where the data are stored. Consequently data should
be read and written in big chunks rather than in small pieces. In fact the
computer will try to do so automatically. While a program is executing, data
written to a file may not appear immediately. The data are not flushed to the
disk until they exceed a certain size or until the file is closed.
Table 8.3 shows the access times (latency) for main memory and both
82 Lessons in Scientific Computing
types of hard drives. These times are best understood in comparison to CPU
speed, already discussed above.
TABLE 8.3 Execution and access times for basic computer components.
Most of these numbers are cited from Patterson & Hennessy (2013).
CPU 1 ns
Main Memory 50–70 ns
Hard drive Solid-state 0.1 ms = 105 ns
Magnetic disk 5–10 ms = 107 ns
Input/output to other media. Table 8.3 reveals how critical the issue of
data transfer is, both between processor and memory and between memory
and hard disk. Input and output are relatively slow on any medium (magnetic
harddisk, display, network, etc.). Writing on the display is a particularly slow
process; excesses thereof can easily delay the program. A common beginner’s
mistake is to display vast amounts of output on the display, so that data scroll
down the screen at high speed; this slows down the calculation (Exercise 8.3).
Input/output can be a limiting factor due to the data transfer rate, but
also due to sheer size. Such problems are data-intensive instead of compute-
intensive.
Table 8.4 shows an overview of potential resource constraints. The current
chapter deals with compute-intensive problems. Data-intensive problems will
be dealt with in chapters 13 and 14. When neither is a significant constraint,
then the task of solving a computational problem is limited by the time it takes
a human to write a reliable program. Most everyday scientific computing tasks
are of this kind.
The lines of written program code are ultimately translated into hardware-
dependent “machine code”. For instance, the following simple line of code adds
two variables: a=i+j. Suppose i and j have earlier been assigned values and
are stored in memory. At a lower level we can look at the program in terms
of its “assembly language,” which is a symbolic representation of the binary
sequences the program is translated into:
lw $8, i
lw $9, j
add $10, $8, $9
sw $10, a
The values are pulled from main memory to a small memory unit on the
processor, called “register,” and then the addition takes place. In this example,
the first line loads variable i into register 8. The second line loads variable
j into register 9. The next line adds the contents of registers 8 and 9 and
stores the result in register 10. The last line copies the content of register
10 to variable a, that is, its memory address. There are typically about 32
registers; they store only a few hundred bytes. Arithmetic operations, in fact
most instructions, only operate on entries in the registers. Data are transferred
from main memory to the registers, and the results of operations written back
out to memory. A hardware-imposed universal of numerical computations is
that operations have no more than two arguments, each of fixed bit length.
At the level of the assembly language or machine code there is no dis-
tinction between data of different types. Floating-point numbers, integers,
characters, and so on are all represented as binary sequences. What number
actually corresponds to the sequence is a matter of how it is interpreted by the
instruction. There is a different addition operation for integers and floats, for
example. (This sheds light on what happens if a variable of the wrong type is
passed to a function. Its bit pattern ends up not only slightly misinterpreted,
but completely misinterpreted.)
Instructions themselves, like lw and add, are also encoded as binary se-
quences. The meaning of these sequences is hardware-encoded on the proces-
sor. These constitute the “instruction set” of the processor. (An example is the
“x86” instruction set, which is commonly found on computers today.) When
a program is started, it is first loaded into memory. Then the instructions are
executed.
During consecutive clock cycles the processor needs to fetch the instruc-
tion, read the registers, perform the operation, and write to the register. De-
pending on the actual hardware these steps may be split up into even more
substeps. The idea of “pipelining” is to execute every step on a different, ded-
icated element of the hardware. The next instruction is already fetched, while
the previous instruction is at the stage of reading registers, and so on. Effec-
tively, several instructions are processed simultaneously. Hence, even a single
processor core tries to execute tasks in parallel, an example of instruction-level
parallelism.
84 Lessons in Scientific Computing
The program stalls when the next instruction depends on the outcome of
the previous one, as for a conditional statement. Although an if instruction
itself is no slower than other elementary operations, it can stall the program
in this way. In addition, an unexpected jump in the program can lead to
cache misses. For the sake of computational speed, programs should have a
predictable data flow.
The program in Table 8.5 illustrates one of the practical consequences of
pipelines. The dependence of a variable on the previous step leads to stalling.
Hence, additions can be accomplished at even higher speed than shown in
Table 8.1, if subsequent operations do not immediately require the result of
the addition. (The execution speed of operations that require only one or a
small number of clock cycles effectively depends on the sequence of commands
they are embedded in.)
TABLE 8.5 Although both of these Fortran loops involve two billion ad-
ditions, the version to the right is twice as fast.
do i=1,1000000000 do i=1,1000000000
a=a+12.1 a=a+12.1
a=a+7.8 b=b+7.8
end do end do
TABLE 8.6 Case study for execution times on a CPU with 3.6 GHz clock
frequency and the gcc compiler.
operation time per clock
operation (ns) cycles
double addition 0.91 3
concurrent double addition 0.26 1
float addition 2.5 9
double sqrt 6.5 23
float sqrt 4.5 16
FLOPS per processor core has slowed, causing a sea change toward multi-core
processors. This ushered in a new era for parallel computing. As someone put
it: “Computers no longer get faster, just wider.”
The ridiculously fast multi-core CPU is where essentially all scientific com-
putations take place. Hence, for tasks where performance matters, the ex-
tremely fast multi-core CPU should be what we design our programs and
algorithms for.
Computational performance is rooted in physical hardware, and the opti-
mal numerical method therefore may depend on the technology of the time.
Over the decades the bottlenecks for programs have changed with technology.
Long ago it was memory size. Memory was expensive compared to floating-
point operations, and algorithms tried to save every byte of memory possible.
Later, memory became comparatively cheap, and the bottleneck moved to
floating-point operations. For example, storing repeatedly used results in a
temporary variable rather than recalculating them each time increased speed.
This paradigm of programming is the basis of classical numerical analysis,
with algorithms designed to minimize the number of floating-point opera-
tions. Today the most severe bottleneck often is moving data from memory to
the processor. One can also argue that the bottleneck is the parallelizability
of an algorithm.
EXERCISES
8.1 Consider the dot product of two vectors
N
X
c= xi yi
i=1
of
p points with coordinates (x, y), we can calculate the distances
(xi − xj )2 + (yi − yj )2 and then take the minimum. But because the
square root is an expensive function, it is faster to only calculate the
square of the distances and take the square root only once the minimum
is found. In a programming language of your choice:
8.4 a. One way or another, find out the following hardware specifications
of your computer: the number of CPU cores, the total amount of
memory, and the memory of each level of cache.
b. Using a programming language of your choice, determine the number
of bytes consumed by a typical floating-point number and a default
integer. In each language, a dedicated command is available that
returns this value.
CHAPTER 9
High-Performance &
Parallel Computing
89
90 Lessons in Scientific Computing
FIGURE 9.1 The Fortran example to the left accesses the first index with
a stride. It also wastes memory by creating an unnecessary array for
freq2, and, worse, this array first needs to be written to memory and
then read in. All these issues are corrected in the version on the right
side, where the inner loop is over the first index and data locality is
observed.
tional statements, like if, should be avoided in the loop which does most of
the work. The same holds true for statements that cause a function or pro-
gram to intentionally terminate prematurely. Error, stop, or exit statements
often—not always, but often—, demand that the program exits at exactly
this point, which prevents the CPU from executing lines further down ahead
of time that are otherwise independent. Such conditional stops can cause a
performance penalty, even when they are never executed.
Obviously, large data should not be unnecessarily duplicated, and with a
bit of vigilance this can be avoided. When an array is passed to a function or
subroutine, what is passed is the address of the first element, so this does not
create a duplicate array.
“Memoization” refers to the storing of intermediate results of expensive
calculations, so they can be reused when needed. There can be a conflict be-
tween minimizing the number of arithmetic operations, total memory use,
and data motion. For example, instead of storing intermediate variables for
reuse, it may be better to re-compute them whenever needed, because arith-
metic is fast compared to unanticipated memory accesses. It is often faster to
re-compute than to re-use.
Program iteratively, not recursively. Rather than write functions or sub-
routines that call themselves recursively, it is more efficient to use iteration
loops. Every function call has an overhead, and every time the function is
called the same local variables need to be allocated.
Intrinsic functions can be expected to be highly efficient. Predefined vector
reduction operations such as sum and dot product intrinsically allow for a high
degree of parallelization. For example, if a language has a dedicated command
for matrix multiplication, it may be highly optimized for this task. For the
sake of performance, intrinsics should be used when available.
Loops can be treacherously slow in very high-level languages such as
Python and Matlab. The following loop with a NumPy array
High-Performance & Parallel Computing 91
for i in range(32000):
if a[i] == 0:
a[i] = -9999
is faster and more conveniently implemented as
a[a == 0] = -9999
This is also known as “loop vectorization”.
As mentioned previously, CPUs are extremely complex and often a naive
guess about what improves or reduces computational efficiency is wrong. Only
measuring the time of actual program execution can verify a performance gain.
function was called. It can even count how many times each line in the source
code was executed. If we want to know which functions consume most of the
cycles or identify the performance bottleneck, this is the way to find out.
The program may need to be compiled and linked with special options to
enable detailed profiling during program execution. And because the program
is all mangled up by the time it is machine code, those tracebacks can alter
its performance, which is why these options are off by default. Some profilers
start with the program; others can be used to inspect and sample any running
process even when that process is already running. Table 9.1 shows a sample
of the information a standard profiling tool provides.
TABLE 9.1 A fraction of the output of the profiling tool gprof. According
to these results, most of the time is spent in the subroutine conductionq
and the function that is called most often is flux noatm.
% cumulative self
time seconds seconds calls name
51.20 7.59 7.59 8892000 conductionq_
44.66 14.22 6.62 8892000 tridag_
2.02 14.52 0.30 9781202 flux_noatm_
1.82 14.79 0.27 9336601 generalorbit_
...
also known as Gustafson’s Law. Using, as above, p = 0.99 and N = 100 a 99-
times larger problem can be solved within the same runtime. This distinction
is sometimes also referred to as strong versus weak scaling. Strong scaling
considers a fixed problem size (Amdahl’s law) and weak scaling a fixed problem
size per processor (Gustafson’s law). Weak scaling is always better than strong
scaling, and hence easier to accomplish.
When the very same program needs to run only with different input pa-
rameters, the scalability is potentially perfect. No intercommunication be-
tween processors is required during the calculation. The input data are sent
to each processor at the beginning and the output data are collected at the
end. Computational problems of this kind are called “embarrassingly paral-
lel.” This form of parallel computing is embarrassingly easy to accomplish, at
least comparatively, and doing so is not embarrassing at all; it is an extremely
attractive and efficient form of parallel computing.
“Distributed computing” involves a large number of processors located
usually at multiple physical locations. It is a form of parallel computing, but
the communication cost is very high and the platforms are diverse. Distributed
computer systems can be realized, for example, between computers in a com-
puter lab, as a network of workstations on a university campus, or with idle
personal computers from around the world. Tremendous cumulative compu-
tational power can be achieved with the sheer number of available processors.
Judged by the total number of floating-point operations, distributed calcula-
tions rank as the largest computations ever performed. In a pioneering devel-
opment for distributed computing, SETI@home utilizes millions of personal
computers to analyze data coming from a radio telescope listening to space.
threads, but the number can also be set by the user. Commonly each core has
two threads, so a 4-core processor would output it eight times.
The example on the right performs as many iterations in the loop as pos-
sible simultaneously. Its output will consist of ten results, but not necessarily
in the same order as the loop.
To write a sequential program and then add directives to undo its sequen-
tial nature, is awkward and rooted in a long history when computing hardware
was not as parallel as it is now. A more natural solution is to have language
intrinsic commands. The index-free notation is one example. If a and b are
arrays of the same shape and length, then a=sin(b) can be applied to every
element in the array, as it was in the program example of Table 4.2. In For-
tran, the right side of Table 9.2 is replaced by do concurrent (i=1:10). But
even if a language has such intrinsic commands, it does not guarantee that
the workload will be distributed over multiple cores.
For the parallelization of loops and other program blocks, a distinction
needs to be made between variables that are shared through main memory
(called shared variables) and those that are limited in scope to one thread
(called private variables). For example in
do concurrent (i=1:10)
x=i
do k=1,100000000
x=sin(3*x)
enddo
end do
it is essential that x and k be private variables; otherwise, these variables
could easily be overwritten by another of the ten concurrent tasks, because
they may be processed simultaneously. On the other hand, if we chose to use
x(i) instead x, the array should be shared; otherwise, there would be ten
arrays with ten entries each. If not specified in the program, the compiler
will apply default rules to designate variables within a concurrent or parallel
block as private or shared, for example, scalar variables as private and arrays
as shared.
That much about programming for shared memory parallel machines. A
98 Lessons in Scientific Computing
————–
computing hardware. However, since Moore’s law has slowed down this ap-
proach may become fruitful.
A more mundane approach is to use common hardware in a configuration
that is customized for the computational method. If the problem is limited
by the time it takes to read large amounts of data, then a set of disks read in
parallel will help. If an analysis requires huge amounts of memory, then, well, a
machine with more than average memory will help or, if the problem is suitably
structured, a distributed memory system can be used. With remote use of
computing clusters and computational clouds that occupy not only rooms,
but entire floors, and even warehouse-size buildings, a choice in hardware
configuration is realistic.
EXERCISES
9.1 The fragment of a Fortran program below finds the distance between the
nearest pair among N points in two dimensions. This implementation
is wasteful in computer resources. Can you suggest at least 4 simple
changes that should improve its computational performance? (You are
not asked to verify that they improve performance.)
9.2 Learn how to ingest a command line argument into a program, e.g.,
myprog 7 should read an integer number or single character from the
command line. Then use this input argument to read file in.7 and write
a file named out.7. This is one approach to run the same program with
many different input parameters. Submit the source code.
High-Performance & Parallel Computing 101
9.3 Learn how to profile your code, that is, obtain the fraction of time con-
sumed by each function in your program. Take or write an arithmetically
intensive program that takes at least 30 seconds to execute, then find
out which function or command consumes most of the time. Submit an
outline of the steps taken and the output of the profiler.
9.4 In a programming language of your choice, learn how to use multiple
cores on your CPU. Some may do this automatically; others will need
explicit program and/our launch instructions. Verify that the program
runs on multiple cores simultaneously.
CHAPTER 10
10.1 INTRODUCTION
Many computations are limited simply by the sheer number of required ad-
ditions, multiplications, or function evaluations. If floating-point operations
are the dominant cost then the computation time is proportional to the num-
ber of mathematical operations. Therefore, we should practice counting. For
example, a0 + a1 x + a2 x2 involves two additions and three multiplications,
because the square also requires a multiplication, but the equivalent formula
a0 + (a1 + a2 x)x involves only two multiplications and two additions.
More generally, aN xN + ... + a1 x + a0 involves N additions and N + (N −
1) + ... + 1 = N (N + 1)/2 multiplications, but (aN x + ... + a1 )x + a0 only
N multiplications and N additions. The first takes about N 2 /2 FLOPs for
large N , the latter 2N for large N . (Although the second form of polynomial
evaluation is superior to the first in terms of the number of floating-point
operations, in terms of roundoff it may be the other way round.)
A precise definition of the “order of” symbol O is in order (big-O notation).
A function is of order xp if there is a constant c such that the absolute value
of the function is no larger than cxp for sufficiently large x. For example,
2N 2 + 4N − log(N ) + 7 is O(N 2 ). Although the number may be larger than
2N 2 , it is, for sufficiently large N , always smaller than, say, 3N 2 , and therefore
O(N 2 ). More generally, a function is of order g(x) if |f (x)| ≤ c|g(x)| for x > x0 .
For example, log(N 2 ) is O(log N ), because it is never larger than 2 log N .
The analogous definition is also applicable for small numbers, as in chapter 6,
except the inequality has to be valid for sufficiently small instead of sufficiently
large values. The leading order of the operation count is the “asymptotic order
count”. For example, 2N 2 + 4N − log(N ) + 7 asymptotically takes 2N 2 steps.
With the definition f (N ) ≤ cN p for N > N0 , a function f that is O(N 6 )
is also O(N 7 ), but it is usually implied that the power is the lowest possible.
Some call a “tight” bound big-Θ, but we will stick to big-O. A tight bound
103
104 Lessons in Scientific Computing
104
103
time (seconds)
102
101
100
10-1
10-2
102 103 104
problem size N
time with the number of variables is not always as ideal as expected from the
operation count, because time is required not only for arithmetic operations
but also for data movement. For this particular implementation the execution
time is larger when N is a power of two. Other wiggles in the graph arise
because the execution time is not exactly the same every time the program is
run.
∗ ∗
∗ ∗ ∗
∗ ∗ ∗
.. .. ..
. . .
∗ ∗ ∗
∗ ∗
methods that accomplishes that will be described in chapter 12. Per element
this is only a O(log N ) effort, so sorting is “fast”.
A Fourier transform is calculated as,
N
X −1
F̂k = fj e−2πikj/N
j=0
with k = 1, ..., N . Every Fourier coefficient F̂k is a sum over N terms, so the
formula suggests that a Fourier transform of N points requires O(N 2 ) opera-
tions. However, already over two hundred years ago it was realized that it can
be accomplished in only O(N log N ) steps. This classic algorithm is the Fast
Fourier Transform (FFT). The FFT is a method for calculating the discrete
Fourier transform. Specifically, the operation count is about 5N log2 N . The
exact prefactor does not matter, because for a non-local equation such as this,
the time for data transfer also has to be considered.
For the FFT algorithm to work, the points need to be uniformly spaced
both in real as well as in frequency space. The original FFT algorithm worked
with a domain size that is a power of two, but it was then generalized to
other factors, combinations of factors, and even prime numbers. A domain of
any size can be FFT-ed quickly, even without having to enlarge it to the next
highest power of two (and worry what to fill in for the extra function values).
The Operation Count; Numerical Linear Algebra 109
Brainteaser: Show that log(N ) is never large. Hence, linear, O(N ), and quasi-
linear, O(N log N ), algorithms are practically equally fast. While the mathe-
matical function log(N ) undoubtedly always increases with N and has no up-
per bound, for practical purposes, its value is never large. Consider situations
such as the largest counter in a single loop and the number of distinguishable
floating-point numbers in the interval 1 to 10.
FIGURE 10.3 Two matrices are multiplied using square tiles to improve
data locality. The tiles involved in the calculation of one of the output
tiles are labeled.
that we never have to write one on our own, but the same concept—grouping of
data to reduce traffic from and to main memory—can be successfully applied
to various problems with low arithmetic intensity.
EXERCISES
10.1 How many times is the content of the innermost loop executed?
that is, the remainder of 16808xi /2147483647. The starting value x0 is called
the “seed.” Pseudorandom number generates of the form xn+1 = (axn+1 + b)
mod m, where all numbers are integers, are known as “linear congruential
generators”.
Pseudorandom number generators can never ideally satisfy all desired sta-
tistical properties. For example, the above type of generator repeats after at
most m numbers. More generally, with only finitely many computer repre-
sentable numbers, the sequence will ultimately always be periodic, though the
period can be extremely long. Random number generators can be faulty. Par-
ticular choices of the parameters or seed can lead to short periods. And the
coefficients in formulae like the one above need to be chosen carefully. Some
implementations of pseudorandom number generators were deficient, but a
good implementation suffices for almost any practical purpose. If there is any
doubt about the language-intrinsic random number generator, then vetted
open-source code routines can be used. (The general code repositories listed
in Appendix B are one place to the find them.)
There have been attempts to use truly random physical processes, such
as radioactive decay, to generate random numbers, but technical limitations
implied that the sequences were not perfect either. Moreover, an advantage of
deterministic generators is that the numbers are reproducible as long as one
chooses the same seed. (And if we never want them to start the same way, the
time from the computer’s clock can be used for the seed.)
113
114 Lessons in Scientific Computing
|p(x)dx| = |q(y)dy|
where the absolute values are needed because y could decrease with x, while
probabilities are always positive. If q(y) is uniformly distributed between 0
and 1, then
|dy/dx| for 0 < y < 1
p(x) = 0 otherwise
Integration with respect to x and inverting yields the desired transforma-
tion. ForR example, an exponential distribution p(x ≥ 0) = exp(−x) requires
x
y(x) = 0 p(x′ )dx′ = − exp(−x) + 1 and therefore x(y) = − ln(1 − y). This
equation transforms uniformly distributed numbers into exponentially dis-
tributed numbers. The distribution has the proper bounds x(0) = 0 and
x(1) = +∞. In general, it is necessary to invert the integral of the desired
distribution function p(x). That can be computationally expensive, particu-
larly when the inverse cannot be obtained analytically.
Alternatively the desired distribution p(x) can be enforced by rejecting
numbers with a probability 1 − p(x), using a second randomly generated num-
ber. These two methods are called “transformation method” and “rejection
method,” respectively.
That said, many common distributions, in one or more variables, are avail-
able from built-in functions, and if not, the reference at the end of the chapter
may contain a prescription on how to generate them.
FIGURE 11.1 Randomly distributed points are used to estimate the area
below the graph.
the random value, but otherwise discarded. Let’s quantify the error of this
numerical scheme.
Suppose we choose N randomly distributed points, requiring N function
evaluations. How fast does the integration error decrease with N ? The prob-
ability of a random point to be below the graph is proportional to the area
a under the graph. Without loss of generality, the constant of proportionality
can be set to one. The probability of having m specific points below the graph
and N − m specific points above the graph is am (1 − a)N −m , in the sense
that the first point is below, the second above, the third below, and so on,
whereas the same result would be obtained if the third was above and the
second below. The points are interchangeable, so there is a binomial factor.
The probability P of having any m points below the graph and any N − m
points above the graph is
N m
P (m) = a (1 − a)N −m
m
An error E can be defined as the root mean square difference between the
exact area a and the estimated area m/N :
N 2
X m
E2 = − a P (m)
m=0
N
more dimensions. In one dimension there are two nearest neighbors, on a two-
dimensional square lattice four nearest neighbors, and so on. There is no real
physical system that behaves exactly this way, but it is a simple model to study
the thermodynamics of an interacting system. Ferromagnetism is the closest
physical analog, and the sum of all spins corresponds to magnetization. (Like
magnetic poles repel each other, and the energy is lowest when neighboring
magnets have opposite orientations. Hence, it would appear we should choose
J < 0 in our model. However, electrons in metals interact in several ways and
in ferromagnetic materials the energies sum up to align electron spins. For
this reason we consider J > 0.)
The spins have the tendency to align with each other to minimize energy,
but this is counteracted by thermal fluctuations. At zero temperature all spins
will align in the same orientation to reach minimum energy (either all up or
all down, depending on the initial state). At nonzero temperatures will there
be relatively few spins opposite to the overall orientation of spins, or will there
be a roughly equal number of up and down spins? In the former case there
is macroscopic magnetization; in the latter case the average magnetization
vanishes.
According to the laws of statistical mechanics the probability to occupy a
state with energy E is proportional to exp(−E/kT ), where k is the Boltzmann
constant and T the temperature. This is also known as the “Boltzmann factor”.
In equilibrium the number of transitions from up to down equals the number
of transitions from down to up. Let W (+ → −) denote the probability for
a flip from spin up to spin down. In steady state the probability P of an
individual spin to be in one state is proportional to the number of sites in
that state, and hence the equilibrium condition translates into
P (+)W (+ → −) = P (−)W (− → +)
For a simulation to reproduce the correct thermodynamic behavior, we hence
need
W (+ → −) P (−) E(−) − E(+)
= = exp −
W (− → +) P (+) kT
Here, E(−) is the energy when the spin is down, which depends on the orien-
tation of the neighbors, and likewise for E(+). For the Ising model the ratio
on the right-hand side is exp(2bJ/kT ), where b is an integer that depends on
the orientations of the nearest neighbors.
There is more than one possibility to choose the transition probabilities
W (+ → −) and W (− → +) to achieve the required ratio. Any of them will
lead to the same equilibrium properties. Denote the energy difference between
before and after a spin flip with ∆E, defined to be positive when the en-
ergy increases. One possible choice is to flip from the lower-energy state to
the higher-energy state with probability exp(−∆E/kT ) and to flip from the
higher-energy state to the lower-energy state with probability one. If ∆E = 0,
when there are equally many neighbors pointing in the up and down direc-
tion, then the transition probability is taken to be one, because this is the
118 Lessons in Scientific Computing
limit for both of the preceding two rules. This method, which transitions to
a higher energy with probability exp(−∆E/kT ) and falls to a lower energy
with probability 1, is known as the “Metropolis algorithm.”
(a) 1
magnetization
0.8
0.6
0.4
0.2
0
-0.2
0 0.5 1 1.5 2 2.5 3 3.5
kT/J
(b) 1
magnetization
0.8
0.6
0.4
0.2
0
-0.2
0 0.5 1 1.5 2 2.5 3 3.5
kT/J
(a) (b)
(c)
FIGURE 11.3 Snapshot of spin configurations for the Ising model (a) be-
low, (b) close to, and (c) above the critical temperature. Black indicates
positive spins, white negative spins.
P
of the energies of all individual spins, E = j Ej . The probability to find the
system in energy E is proportional to
Y P
− Ej /kT
Ω(E) e−Ej /kT = Ω(E)e j = Ω(E)e−E/kT
j
but in two dimensions it is much, much, much harder. The dashed line in Fig-
ure 11.3(b) shows this exact solution, first obtained by Lars Onsager. In three
dimensions no one has achieved it, and hence we believe that it is impossible
to do so. The underlying reason why the magnetization can be obtained with
many fewer steps with the Metropolis algorithm than by counting states, is
that a probabilistic method automatically explores the most likely configu-
rations. The historical significance of the Ising model stems largely from its
analytical solution. Hence, our numerical attempts in one and two dimensions
have had a merely illustrative nature. But even simple variations of the model
(e.g., the Ising model in three dimensions or extending the interaction be-
yond nearest neighbors) are not solved analytically. In these cases numerics
is valuable and no more difficult than the simulations we have gone through
here.
EXERCISES
11.1 Uniform distribution of points on sphere. For points that are distributed
uniformly over the surface of a sphere (the same number of points per
area), the geographic coordinates are not uniformly distributed. Here we
seek to generate point coordinates that are uniformly distributed over a
unit sphere.
Algorithms, Data
Structures, and
Complexity
123
124 Lessons in Scientific Computing
(b)
9 8 7 5
/ \ / \ / \ / \
7 8 7 4.1 5 4.1 1 4.1 ...
/ \ / \ / \ \ \ \ \
5 1 4.1 2 5 1 2 1 2 2
and so on, and all numbers are eventually sorted according to their size; see
Figure 12.1(b).
Denote the smallest integer power of 2 larger than N with N ′ . When
N = 1, N ′ = 2; when N is 2 or 3, then N ′ = 4, and so on. The number
of levels in the heap is log2 N ′ . The first stage of the algorithm, building
the heap, loops through the N elements, but the potential swaps at each
element addition means we might have to go through each level, so the first
stage requires up to O(N log N ′ ) = O(N log N ) work. In the second stage,
comparing and swapping is necessary up to N times for each level of the tree.
Hence each of the two stages of the algorithm is O(N log N ), and the total is
also O(N log N ). Considering that merely going through N numbers is O(N )
and that log N is usually a small number, sorting is “fast.”
A binary tree as in Figure 12.2 can simply be stored as a one-dimensional
array. The first level occupies the first element, the two elements of the next
level the following two elements, and so on. With b levels, the length of this
array is 2b − 1. The index in the array aj for the i-th element in the b-th level
of the tree is 2b−1 + i − 1, where b, i, and j all start at 1.
Heapsort is also highly memory efficient, because elements can be replaced
in-place. It is never necessary to create a copy of the entire list of elements.
Brainteaser: Sort one suit of a deck of cards with the heapsort algorithm.
Algorithms, Data Structures, and Complexity 125
a1
a2 a3
a4 a5 a6 a7
in, first out). Even when their indices are sequential, it does not imply adja-
cent elements are stored sequentially in memory. As already mentioned, arrays
use consecutive memory addresses; lists generally do not. In a nonlinear data
structure the elements can be connected to more than one other element.
Examples are trees and graphs (networks).
Another useful classification of data structures is in terms of their lifecyle
during program execution. “Static data structures” are those whose size and
associated memory location are fixed at compile time. “Dynamic data struc-
tures” can expand or shrink during program execution and the associated
memory location can move. Allocation of an array during runtime (in C with
malloc) is one example of a dynamic data type.
Hashing is used to speed up searching. Consider the problem of searching
an array of length N . If the array is not sorted, this requires O(N ) steps. If the
array is sorted, a divide-and-conquer approach can be used, which reduces the
number of required steps to O(log N ) (Exercise 12.2). The search would be
even faster if there was a function which would point to the sought value in the
array. A “hash function” is a function, which when given a value, generates
an address in a table. Understandably, such a mathematical function that
would provide a one-to-one relation between values and indices can often not
be constructed, but even an imperfect relation, where two or more different
values may hash to the same array index, is helpful. Values that collide can
be put in a linked list. A hash table then is an array of lists. Exercise 12.1 will
go through one example. Hash functions can be constructed empirically, and
even applied to character strings by converting the strings to ASCII code.
They enable efficient lookup: given a string (e.g., the name of a chemical
compound), find corresponding entries(e.g., the chemical formula and other
associated information).
Concurrent data structure. Accessing data with a single process is one
thing, sharing the same data among concurrent processes is another. By defi-
Algorithms, Data Structures, and Complexity 127
nition, the order at which concurrent processes will execute is not known, so if
they share the content of a variable there can be conflicts. Consider a simple
global counter c=c+1. If one of the concurrent processes needs to increment
the counter, it has to make sure another process is not also incrementing it si-
multaneously. Even on a shared memory machine, registers, the level-1 cache,
or even higher levels of cache, are local to the CPU core (Figure 8.1), so the
content of the variable would have to transverse these levels to make sure the
value of the counter is correct, and during this time the process stalls. Here is
a way to avoid write conflicts for a (grow-only) counter: Each process has its
own local counter c[i], and when the value of the global counter is needed,
all the local counters are added. There will be a query function that calcu-
lates the sum, C = sum(c[i]), and a function that updates the local counter,
c[i]=c[i]+1. This is an example of a “conflict-free” data type, because it
avoids write conflicts. Often it is not practical to use standard data structures
for concurrent processing, and special data structures are needed to store and
organize the data.
For some problems it has been proven that it is impossible to solve them
in polynomial time; others are merely believed to be intractable since nobody
has found a way to solve them in polynomial time. (Incidentally, proving
that finding the longest common subsequence, or any of a large number of
equivalent problems, cannot be accomplished in polynomial time is one of
the major outstanding questions in present-day mathematics. Nobody has yet
proved that this group of problems indeed requires an algorithm with “NP”-
complexity rather than only polynomial or “P”-complexity.)
A small problem modification can sometimes turn a hard problem into
an easy one. (A humorous response, with a grain of truth, to the Traveling
Salesman Problem is: selling online involves only O(0) travel.) A computer
scientist or mathematician may know more about algorithms than a scien-
tist, but the scientist better understands the context of the problem and can
make approximations or modifications to the problem formulation that elude
those who look at the problem from a purely formal perspective. For this rea-
son, scientists have a role to play in research on algorithmically challenging
problems.
Another venue of approach to intractable problems are algorithms that find
Algorithms, Data Structures, and Complexity 129
the solution only with a probability, a very high probability. In this context,
it is sobering to remember that at any time a cosmic ray particle can hit the
CPU or main memory and flip a bit and cause an error. This is rare, but does
occur. In an airplane, which is higher up and exposed to more cosmic and solar
radiation, a crucial computation has to be carried out on duplicate processors
(or threefold actually, because if two processors get different answers, it is
ambiguous which one is correct). On spacecraft, which are exposed to even
more radiation, radiation-caused bit flips are routinely observed. Even for
earthly applications, memory technology often incorporates “error-correcting
code” (ECC). An extra bit is used that stores whether the number of 1s in
a binary sequence is odd or even; in other words the “parity” of a binary
sequence. If one bit flips, the parity changes, which reveals the error. If two
bits flip, the parity does not change and the error remains undetected. But
the probability of an event to occur twice is the square of the probability for
it to occur once, and therefore extremely small. For a scientific computation,
we do not worry about absurdly unlikely sources of error. In other words, a
probabilistic algorithm that errs with a probability of, say, 10−100 is as reliable
as a deterministic algorithm.
the deterministic error of another method. Here we revisit the N -body pair-
wise interaction problem, and show how systematically stopping a calculation
as soon as it reaches a certain error requirement can dramatically reduce the
operation count.
The formula for the gravitational force of N objects on each other,
N
X ri − rj
Fi = Gmi mj
|ri − rj |3
j=1,j6=i
(a) (b)
6
7
1
5
2 5
4
3
3 4 7 6
2
1
FIGURE 12.4 (a) Adaptive grid for the N -body problem in two dimensions
and (b) the corresponding quadtree. Tree nodes right to left correspond
to the NE, NW, SW, and SE quadrants.
a box and contains the total mass and center of mass coordinates for all the
particles it contains.
If the ratio D/r = (size of box) / (distance from particle to center of mass
of box) is small enough, then the gravitational force due to all the particles
in the box is approximated by their center of mass. To evaluate the forces, we
start at the head of the tree and calculate the distance, and move down the
tree, until the criterion D/r < Θ is satisfied, where Θ is a user-supplied thresh-
old. An interactive tool that visualizes the force-evaluation stage of the algo-
rithm can be found, e.g., at www.khanacademy.org/computer-programming/
quadtree-hut-tree/1179074380.
Now we will roughly count how many steps this algorithm requires. If the
particles are uniformly distributed, each body will end up at the same level
in the tree, and the depth of the tree is log4 N . The cost of adding a particle
is proportional to the distance from the root to the leaf in which the particle
resides. Hence, the complexity of inserting all the particles (building the tree)
is O(N log N ).
The operation count for the force evaluation can be determined by plotting
all the boxes around an individual particle that satisfy the geometric criterion
D/r < Θ. Their size doubles with distance, and since there are at most three
undivided squares at each level (the fourth square being occupied by the
particle under consideration), the total amount of work is proportional to
the depth in the tree at which the particle is stored. Hence, this cost is also
proportional to O(log N ) for each particle. In total, the tree-based N -body
method is an O(N log N ) algorithm, at least when the bodies are distributed
homogeneously.
In conclusion, systematic truncation or grouping reduces the computa-
tional cost of the gravitational N -body problem from O(N 2 ) to O(N log N ).
Algorithms, Data Structures, and Complexity 133
For large N , the choice we are faced with is not between an approximate
and an exact answer (that only has roundoff), but between an approximate
result for N bodies and an exact result for a lot fewer than N bodies. When
100 billion stars are replaced with 1 billion stars each with a hundred times
the mass, some dynamical aspects, such as how many stars are ejected, may
be quite different. The physics is better represented with a fast approximate
evaluation than with a precise evaluation limited to a small sample.
EXERCISES
12.1 Hash Table. A commonly used method for hashing positive integers is
modular hashing: choose the array size M to be prime and for an integer
value k, compute the remainder when dividing k by M . This is effective
in dispersing the values evenly between 0 and M − 1.
Consider the hashing function (value)%7.
a. Generate the indices for the following values: 13, 96, 16, 3, 11, 112,
23, 54, 42.
b. To look up one of these 9 values, how many steps are necessary
on average? Count calculation of the remainder as one step, and
following a linked list counts as a second step.
12.2 Given a sequence of N real numbers sorted by size. Show that a search
for a number in this array takes at most O(log N ) steps.
12.3 The operation count T of a divide-and-conquer method is given by the
recursive relation
T (N ) = 2T (N/2) + O(N )
with T (1) = O(1). Find an explicit expression for T (N ). Assume N is
a power of 2.
CHAPTER 13
Data
TABLE 13.1 The classic decimal and the new binary unit prefixes for large
numbers.
Decimal Binary
kilo k 10001 = 103 kibi Ki (K) 10241 = 210
mega M 10002 = 106 mebi Mi 10242 = 220
giga G 10003 = 109 gibi Gi 10243 = 230
tera T 10004 = 1012 tebi Ti 10244 = 240
peta P 10005 = 1015 pebi Pi 10245 = 250
exa E 10006 = 1018 exbi Ei 10246 = 260
zetta Z 10007 = 1021 zebi Zi 10247 = 270
yotta Y 10008 = 1024 yobi Yi 10248 = 280
135
136 Lessons in Scientific Computing
chines; some computers write groups of bytes left-to-right and others right-
to-left (big-endian or little-endian). Endianness refers to the ordering of bytes
within a multi-byte value. In the big endian convention, the smallest address
has the most significant byte (MSB), and in the little endian convention, the
smallest address has the least significant byte (LSB). For a single byte repre-
sentation, as ASCII, there is no ambiguity. When a multi-byte representation
is written and read on the same computer, it also works fine. But if binary
data are written on one machine and then read on another with the opposite
endianness, the output will make no sense at all. UTF-8 stores the endianness
of the byte order in itself, so the there is no ambiguity.
function beyond that.) If this file is fed into a common geographical software
(e.g., Google Earth or software that comes with handheld GPS devices), it
will know how to interpret the coordinates.
Data that are tagged with XML, JSON, or other systematic formatting
are “structured data”, as opposed to “unstructured data”. They are also ex-
amples of “hierarchical data”, because tags are nested within tags. Most data
are unstructured. Rigorously tabulated and documented data also count as
structured data.
Image formats
There are two types of digital image formats: raster graphics and vector graph-
ics. Raster graphics is pixel based and simply stores the value for each pixel; it
is used by many common formats: jpg or jpeg, tiff, png, jpg2000, and
others. Uncompressed raster images tend be very large. There is lossy and
lossless compression. Classic jpeg is a lossy format, whereas png is lossless,
and jpg2000 offers both options. The danger with lossy formats is that re-
peated file manipulations will lead to an ever increasing degradation of the
image quality. On the other hand, even a tiny loss of information might allow
for a much smaller file size.
For vector graphics, lines and points are encoded, which means when en-
larging a figure, its elements remain perfectly sharp. It also means individual
components can be edited. Moreover, text within the graphics is searchable
because it can be stored as letters. Common vector graphics file formats are
ps (postscript), pdf (portable document format), and svg (scalable vector
graphics). (Incidentally, svg is an XML-based data format.) These formats
can also store information pixelwise, but that is not what they are meant for.
For example, a pdf created by scanning pages of text from paper, without
optical character recognition, is not searchable. Vector graphics is intended
for line art and formatted text. A diagram or a plot with points and lines for
a publication is best stored in a vector graphics format. A photograph of a
newly discovered insect species is best stored as raster graphics.
Some image formats can store several layers, e.g., tiff and svg can,
whereas png and jpg cannot. Almost all common image formats can carry
metadata, for example, the settings of the camera which took the photo or
the creator of a graphics.
ImageMagick is free software that performs many standard image process-
ing tasks, such as converting between formats or resizing images. Graphics-
Magick is a fork of ImageMagick that emphasizes computational efficiency.
An example of a special-purpose data format is GeoTIFF to store maps.
GeoTIFF is a public domain metadata standard which allows georeferencing
information to be embedded within a TIFF file. This information can in-
clude geographic coordinates, the type of map projection, coordinate systems,
reference ellipsoid, and more. Any software that can open TIFF images can
140 Lessons in Scientific Computing
also open GeoTIFF images, and simply ignore the metadata, while geospatial
software can make use of the metadata.
- Print the first column and the sum of the third and fourth column of a
comma separated table, and replace the commas with spaces:
awk -F, '{print $1,$3+$4}' yourfile.csv
The option -F specifies that the field separator is a comma, while the
default would be a white space.
- Calculate the sum of all entries in a column:
awk '{ sum+=$1} END {print sum}' table.txt
Awk also has a few special variables, such as ‘NR’ for the number or
records (lines). A slightly shorter version for the calculation of the arith-
metic mean is
————–
These examples demonstrate that text processing utilities are versatile and
powerful tools to format, extract, clean up, and even analyze data. Again,
“text” refers to data in plain-text format, not prose, although the tools can
surely be applied to human language text as well. Some of them also work
with binary file formats.
Data transfer can only be as fast as the slowest component, and this bot-
tleneck is usually at the beginning or the end.
The probability of error during data transfer is negligible, because error
correction is already applied (with checksum) and if that test fails, the transfer
will be repeated.
As already described in chapter 8 there are two common types of non-
volatile and re-writable storage media (hard drives): magnetic disks and solid
state drives. The former are cheap and large, and the latter more expensive
and a bit faster. The absolute speed of transfer from hard drives was already
included in Table 8.3.
RAID (Redundant Array of Independent Disks) is a storage system that
distributes data across multiple physical drives to achieve data redundancy,
fast data access, or a bit of both. For data redundancy, it can be arranged
that a copy of one disk’s data be distributed on the other disks, such that
if any single disks fails, all its data can be recovered from the other disks.
For fast data access, sequential bytes can be stored on different drives, which
multiplies the available bandwidth. A RAID can start with just a few disks,
and some scientists have one under their office desk.
Distributed storage. When data sets are very large, they have to be dis-
tributed over many physical storage units. And since every piece of hardware
has a failure probability, failure becomes commonplace in a large cloud or data
center. Modern file systems can automatically handle this situation by main-
taining a replica of all data, and without interruption to normal operations
(“hot swappable”).
Cloud storage is a form of distributed storage. Data stored in the cloud
can be accessed through the internet from anywhere. But clouds are also a
level of abstraction or virtualization. We do not have to deal with individual
disks or size limits. Another benefit is that cloud storage functions as off-site
(remote) backup of data.
The practical situation for the cost of the components is
Bandwidth > Disk > CPU
These inequalities express that network bandwidth is expensive, so the CPU
should be co-located with the disk. In other words, bring the algorithms to the
data. Disk space is cheap, so bring everything online. CPU processing power
is even cheaper, so no need to pre-process needlessly.
The appropriate outcome given these technology-based relations is large
remote compute and storage centers. Today, there are data centers of the size
of warehouses specifically designed to hold storage media and CPUs, along
with cooling infrastructure. Currently, the largest data centers hold as much
as an Exabyte of data.
Data 145
EXERCISES
13.1 a. Use wget, curl, Scrapy, or a similar tool to download monthly me-
teorological data from 1991 to 2017 from ftp://aftp.cmdl.noaa.
gov/data/meteorology/in-situ/mlo/.
b. From the data obtained, extract the temperatures for 10-m mast
height and plot them. (Hint: awk is a tool that can accomplish this.)
13.2 Use awk, Perl, sed, or another utility to write a one-line command that
replaces
14.1 PROGRAMMING
“In computer programming, the technique of choice is not necessar-
ily the most efficient, or elegant, or fastest executing one. Instead,
it may be the one that is quick to implement, general, and easy to
check.”
Numerical Recipes
149
150 Lessons in Scientific Computing
users. Most programs written for everyday scientific computing are used for
a limited time only (“throw-away codes”). Under these circumstances there is
little benefit to extract the last margin of efficiency, create nice user interfaces,
write extensive documentation, or implement complex algorithms. Any of that
can even be counterproductive, as it consumes time and reduces flexibility.
Better is the enemy of good. That said, the most likely beneficiary of program
documentation is our future self.
It is easy to miss a mistake or an inconsistency within many lines of code. In
principle, already one wrong symbol in the program can invalidate the result.
And in practice, it sometimes does. Program validation is a practical necessity.
Some go so far as to add “blunders” to the list of common types of error in
numerical calculations: roundoff, approximation errors, statistical errors, and
blunders. Absence of obvious contradictions is not a sufficient standard of
checking. Catching a mistake later or not at all may lead to a huge waste of
effort, and this risk can be reduced by spending time on program validation
early on. We will want to compare with analytically known solutions, including
trivial solutions. Finding good test cases can be a considerable effort on its
own. It is not uncommon, nor unreasonable, to spend more time validating a
program than writing the program.
Placing simple print/output commands in the source code to trace exe-
cution is a straightforward and widely used debugging strategy (also called
“logging”). Alternatively, or in a addition, a debugger helps programmers in
this process. With a debugger, a program can be run step-by-step or stopped
at a specific point to examine the value of variables. When a program crashes,
a debugger shows the position of the error in the source code.
Complex programming and modeling tasks involve a large number of steps,
and even if making an error at any one step is small, the risk is multiplied
through the many steps. That is the rationalization for why it is practically
necessary to program defensively and validate routinely.
Programs undergo evolution. As time passes improvements are made on
a program, bugs fixed, and the program matures. As the Roman philosopher
Seneca put it: “Time discovers truth.” For time-intensive computations, it is
thus not advisable to make long runs right away. Moreover, the experience
of analyzing the result of a short run might change which and in what form
data are output. Nothing is more in vain than to discover at the end of a
long calculation that a parameter has not been set properly or a necessary
adjustment has not been made. Simulations shorter than one minute allow for
interactive improvements, and thus a rapid development cycle. Besides the
obvious difference in cumulative computation time (a series of minute-long
calculations is much shorter than a series of hour-long calculations), the mind
has to sluggishly refocus after a lengthy gap.
For lengthy runs one may wish to know how far the program has proceeded.
While a program is executing, its output is not immediately written to disk,
because this may slow it down (chapter 8). This has the disadvantage that
Building Programs for Computation and Data Analysis 151
Shell scripts
Shell scripts are specific to Unix-like environments, but not to the Unix op-
erating system per-se. Appendix A provides an introduction to Unix-like en-
vironments, which are available on various operating systems. A shell script
is a program that runs line-by-line (that is, it is interpreted) in a Unix shell.
There are a few flavors of shells: csh (C shell), tcsh (often pronounced “tee-
see-shell”), bash, zsh, and others. They all share a common core functionality,
although the syntax can vary.
A shell script conventionally starts with #!/bin/bash or #!/bin/tcsh.
The first two bytes ‘#!’ have a special meaning and indicate the file is a script
and therefore executable, as opposed to a text file not meant for execution.
The lines that follow are any commands that are understood by the Unix
environment. Common file extensions for shell scripts are .cmd, .sh, but none
is necessary. (The Unix file system has three types of permissions for each file:
readable, writable, and executable. A script file needs to be executable, which
distinguishes it from a plain text file which is not executable, so it will not be
accidentally interpreted as a list of commands.)
An example of a simple shell script is:
#!/bin/csh
prog1.out &
prog2.out &
which submits two programs as background jobs.
Building Programs for Computation and Data Analysis 153
The following shell script moves two files into another directory and re-
names them to indicate they belong to model run number 4:
#!/bin/tcsh
setenv tardir "Output/" # target directory
setenv ext "dat4" # file extension
mv zprofile $tardir/zprofile.{$ext}
mv fort.24 $tardir/series.{$ext}
Should an error occur during the execution of one of those commands, the
script will nevertheless proceed to the next line. For example
gcc myprog.c -o prog.out
prog.out &
is dangerous. If the compilation fails it will run the executable nevertheless,
although it may be an old executable. It would be safer to precede these lines
with rm -f prog.out, to make sure any old executable with this name is
deleted. This is the spirit of defensive programming.
The following merges all files whose filenames begin with ‘data’ and end
with ‘.csv’ using a wildcard:
#!/bin/csh
rm -f alldata.csv
foreach i ( data*.csv )
cat $i >> alldata.csv
end
In Unix-like environments >> appends a file, so the file first ought to be deleted
to make sure we start with an empty file.
Scripts are also great for data reformatting. The following takes files with
comma-separated values (with file extension .csv), strips out all lines that
contain the %-symbol with the grep command, and then outputs two selected
columns from the comma-separated file. Section 13.2 explained grep and awk.
#!/bin/tcsh
grep -v % data814.csv | awk -F, '{print $1,$4}' > h814.dat
grep -v % data815.csv | awk -F, '{print $1,$4}' > h815.dat
Shell scripts can have loops, if statements, variables, and arbitrary length.
Shell scripts are a great tool to automate tasks, glue various executables to-
gether, run batch jobs, and endless other useful tasks.
storage.” The saying originally referred to memory, but is now applicable for
storage. It is easier and cheaper than ever to store data.
There is so much data that processing it can be a major challenge, a sit-
uation often referred to as “Big Data”. Big data is a broad and vague term
that refers to data so large or complex that traditional data processing meth-
ods or technology become inadequate. They may require distributed storage,
new numerical methods, or a even a whole ecosystem of hardware and soft-
ware (such as “Hadoop” which many cloud services use). Various problems
can arise when working with large data sets, and next we will discuss a few
scenarios.
Data do not fit in main memory: An example of a data format that is
conscious of memory consumption is JPEG2000 (or JP2). This image format
incorporates smart decoding for work with very large images. It is possible
to smoothly pan a lower-resolution version of the image and to zoom into
a portion of the image by loading (and decompressing) only a part of the
compressed data into memory.
Data take a long time to read: Even if data fit on the local hard drive,
reading them could take a while. The data transfer rate from hard drive to
memory, and ultimately to the CPU, is limited. (A current technology, typical
transfer speeds from a magnetic disk or a solid state drive are 4 Gb/s =
0.5 GB/s.) This is an appropriate point to remember that compressed files
not only save storage space, they can also be read and written faster because
they are smaller. There are tools that work with compressed files directly. For
example, as grep searches a file, zgrep searches a compressed file, or more
specifically a file compressed with gzip.
The text processing utilities (section 13.2) work with small and big data
alike, only with large data sets their computational and memory efficiency
matters. This efficiency is determined not by the text utility per se, but by its
implementation. That said, the more primitive utility grep will most likely
perform searches faster than the more sophisticated sed, Awk, and Perl. Some
text utilities can also take advantage of multiple CPU cores.
sed, for example, can do in-place replacements, with the option -i. For
example,
sed -i 's/old/new/g' gigantic_file.dat
does not output a copy of the input file with the text substitutions, as it would
without the -i option; instead, it overwrites the input file with the output.
And to be clear: the in-place option does not do anything for runtime; it
merely saves on storage size.
Data do not fit onto the local hard drive: As long as the data can be
downloaded, they can be analyzed and then deleted (streamed through). It
becomes increasingly important that the data are formatted rigorously and
the programs that process them be fault tolerant; otherwise, the data pipeline
will break too frequently.
Data are too big to be downloaded: Storage is easier than transfer, hence
Building Programs for Computation and Data Analysis 155
analysis has to be where the data are not where the user is. This is a game
changer, as now the responsibility for the data analysis infrastructure lies with
the data host. Cloud computing primarily deals with this issue, as was already
described in chapter 13.
To continue with the example of JPEG2000, the standard also provides
the internet streaming protocol JPIP (JPEG2000 Interactive Protocol) for
efficient transmission of JP2 files over the network. With JPIP it is possible
to download only the requested part of an image, saving bandwidth and time.
With JPIP it is possible to view large images in real time over the network,
and without ever downloading the entire image.
Data may not fit anywhere, or the data are produced in bursts and cannot
be stored fast enough (e.g., particle collider experiments). In other words,
the data move too fast. This situation has to be dealt with with on-the-fly
data analysis, also known as “stream processing.” Arithmetic is much faster
than writing to disk, so valuable analysis can be done on the fly. Hardware
accelerators, such as GPUs (section 9.4), are natural candidates for such a
situation.
Big data is not necessarily large data. Data that are complex but insuffi-
ciently structured may require extensive effort to be properly analyzed. Here
too, the cloud mantra to give users access to the raw data by placing the
CPUs close to the disks, enables users to directly work with the data instead
of having to wait for sufficient curation or reduced data products by another
party.
EXERCISES
14.1 a. Create a large file containing numbers >100 MiB.
b. Write a program that reads this file and measures the execution time.
c. Compress the file with one of the many available compression tools,
such as zip or gzip. Determine the compression factor (ratio of file
sizes).
d. Write a program that can read the compressed file directly.
e. Measure the reduction in read time and compare it with the reduction
in file size.
14.2 Write a script that validates and merges a set of files. Suppose we have
files with names out.0 to out.99, i.e., 100 of them, and each has entries
of the form
0 273.15
1 260.
Write a script that checks that the first column contains the same
values in all files and that the entries of the second column are
nonnegative. Then merge the files in the order of the numerical
value of their file extension (0,...,99), not their alphanumerical value
156 Lessons in Scientific Computing
157
158 Lessons in Scientific Computing
If v is constant, then the solution is simply f (x, t) = g(x − vt), where g can
be any function in one variable. This is immediately apparent if one plugs this
expression into the above equation. The form of g is determined by the initial
condition f (x, 0). In an infinite domain or for periodic boundary conditions,
the average of f and the maximum of f never change.
(A brief explanation of the nature of the advection equation, with v(x, t)
dependent on time and space is in order. f is constant along a path x(t), if
the total derivative of f vanishes,
df (x(t), t) ∂f dx ∂f
0= = +
dt ∂t dt ∂x
Hence, dx/dt = v(x, t) describes such paths, and the equation describes the
“moving around of material”, or, rather, of f -values. In contrast,
∂f ∂(vf )
+ =0
∂t ∂x
with v inside the spatial derivative, is the local conservation law. When a
quantity is conserved, changes with time are due to material moving in from
one side or out the other side. The flux at any point is vf and the amount of
material in an interval of length 2h is 2hf , hence ∂(2f h)/∂t = vf (x − h) −
vf (x + h). In the limit h → 0, this leads to ∂f /∂t + ∂(vf )/∂x = 0. Since we
will take v to be a constant, the distinction between these two equations does
not matter, but it is helpful to understand this distinction nevertheless.)
A simple numerical scheme would be to replace the time derivative with
[f (x, t + k) − f (x, t)]/k and the spatial derivative with [f (x + h, t) − f (x −
h, t)]/2h, where k is a small time interval and h a short distance. The advection
equation then becomes
f (x, t + k) − f (x, t) f (x + h, t) − f (x − h, t)
+ O(k) + v + O(h2 ) = 0.
k 2h
This discretization is accurate to first order in time and to second order in
space. With this choice we arrive at the scheme
f (x + h, t) − f (x − h, t)
f (x, t + k) = f (x, t) − kv
2h
As will soon be shown, this scheme does not work.
Instead of the forward difference for the time discretization we can use the
backward difference [f (x, t) − f (x, t − k)]/k or the center difference [f (x, t +
k) − f (x, t − k)]/2k. Or, f (x, t) in the forward difference can be eliminated by
replacing it with [f (x + h, t) + f (x − h, t)]/2. There are further possibilities,
but let us consider only these four. Table 15.1 lists the resulting difference
schemes.
For purely historical reasons some of these schemes have names. The second
scheme is called Lax-Wendroff, the third Leapfrog (a look at its stencil in
A Crash Course on Partial Differential Equations 159
❡ ❡ ❡
❡ fjn+1 + v 2h
k n+1
(fj+1 n+1
− fj−1 ) = fjn
❡
❡ ❡ fjn+1 = fjn−1 − v hk (fj+1
n n
− fj−1 )
❡
❡
❡ ❡ fjn+1 = 12 (1 − v hk )fj+1
n
+ 12 (1 + v hk )fj−1
n
Table 15.1 explains why), and the last Lax-Friedrichs. But there are so many
possible schemes that this nomenclature is not practical.
The first scheme does not work at all, even for constant velocity. Fig-
ure 15.1(a) shows the appearance of large, growing oscillations that cannot
be correct, since the exact solution is the initial conditions shifted. This is a
numerical instability.
The reason for the instability can be understood with a bit of mathematics.
Since the advection equation is linear in f , we can consider a single mode
f (x, t) = f (t) exp(imx), where f (t) is the amplitude and m the wave number.
The general solution is a superposition (sum) of such modes. For the first
scheme in Table 15.1 this leads to
eimh − e−imh
f (t + k) = f (t) − vkf (t)
2h
and further to f (t + k)/f (t) = 1 − ikv sin(mh)/h. Hence, the amplification
factor 2
2
2 f (t + k) kv
|A| = =1+ sin2 (mh)
f (t) h
which is larger than 1. Modes grow with time, no matter how fine the reso-
lution. Modes with shorter wavelength (larger m) grow faster, therefore the
instability.
160 Lessons in Scientific Computing
(a) 15
10
f(x)
0
-5
-10
0 0.2 0.4 0.6 0.8 1
x
(b) 1
0.5
f(x)
-0.5
-1
0 0.2 0.4 0.6 0.8 1
x
(c) 1
0.5
f(x)
-0.5
-1
0 0.2 0.4 0.6 0.8 1
x
(d) 1
0.5
f(x)
-0.5
-1
0 0.2 0.4 0.6 0.8 1
x
The same analysis applied, for instance, to the last of the four schemes
yields
2
vk
|A|2 = cos2 (mh) + sin2 (mh).
h
As long as |vk/h| ≤ 1, the amplification factor |A| ≤ 1, even for the worst
m. Hence, the time step k must be chosen such that k ≤ h/|v|. This is a
requirement for numerical stability.
The second scheme in Table 15.1 contains f n+1 , the solution at a future
time, simultaneously at several grid points and hence leads only to an implicit
equation for f n+1 . It is therefore called an “implicit” scheme. For all other
discretizations shown in the table, f n+1 is given explicitly in terms of f n . The
system of linear equations can be represented by a matrix that multiplies the
vector f n+1 and yields f n at the right-hand side:
n+1 n+1 n
1 ∗ ∗ f1 f1
∗ 1 ∗ f2 f2
.. .. ..
..
= ..
. . . . .
∗ 1 ∗ fN −1 fN −1
∗ ∗ 1 fN fN
Stars stand for plus or minus vk/2h and all blank entries are zeros. The
elements in the upper right and lower left corner arise from periodic boundary
conditions. (If we were to solve the advection equation in more than one
spatial dimension, this matrix would become more complicated.) The implicit
scheme leads to a tridiagonal system of equations that needs to be solved
at every time step, if the velocity depends on time. With or without corner
elements, the system can be solved in O(N ) steps and requires only O(N )
storage (chapter 10). Hence, the computational cost is not at all prohibitive,
but comparable to that of the explicit schemes. The scheme is stable for any
step size. It becomes less and less accurate as the step size increases, but is
never unstable.
The third, center-difference scheme is explicit and second-order accurate
in time, but requires three instead of two storage levels, because it simultane-
ously involves f n+1 , f n , and f n−1 . It is just like taking half a time step and
evaluating the spatial derivative there, then using this information to take the
whole step. Starting the scheme requires a single-differenced step initially.
The stability of the last of the four schemes was already discussed. The
scheme is first-order accurate in time and second-order accurate in space, so
that the error is O(k) + O(h2 ). When the time step is chosen as k = O(h),
as appropriate for the stability condition k ≤ h/|v|, the time integration is
effectively the less accurate discretization. Higher orders of accuracy in both
space and time can be achieved with larger stencils. Table 15.2 summarizes
the stability properties of the four schemes.
162 Lessons in Scientific Computing
❡ ❡ ❡
❡ Lax-Wendroff unconditionally stable
❡
❡ ❡ Leapfrog conditionally stable, k/h < 1/|v|
❡
❡
❡ ❡ Lax-Friedrichs conditionally stable, k/h < 1/|v|
FIGURE 15.2 Sparse matrix with coefficients that correspond to the finite-
difference approximation of the two-dimensional Laplace equation. All
blank entries are zero. A sparse matrix of this form is also called “block-
tridiagonal”.
neighbors. When repeated again and again, the solution relaxes to the correct
one. This might not be the computationally fastest way to solve the Laplace
equation, and it is not, but it is exceedingly quick to implement. This is a
simple version of a “relaxation method.” It iteratively relaxes to the correct
solution.
When every derivative in a PDE is replaced by a finite difference, the PDE
turns into a large system of equations. This very general approach often leads
to a workable numerical scheme.
• Finite-difference methods
• Spectral methods (and more generally, integral transform methods)
• Finite-element methods
• Particle methods
EXERCISES
15.1 The heat or diffusion equation
∂f ∂2f
=D 2
∂t ∂x
describes the evolution of temperature f (x, t) for a (constant) thermal
diffusivity D.
fjn+1 − fjn n
fj+1 − 2fjn + fj−1
n
=D
k h2
and derive the condition for numerical stability.
b. The backward time difference
fjn+1 − fjn n+1
fj+1 − 2fjn+1 + fj−1
n+1
=D
k h2
leads to an implicit scheme. Show that this scheme is unconditionally
stable.
c. The Crank-Nicolson method for the heat equation uses the average
of the spatial derivatives evaluated at time steps n and n + 1:
fjn+1 − fjn n
1 fj+1 − 2fjn + fj−1
n n+1
1 fj+1 − 2fj
n+1 n+1
+ fj−1
= D + D
k 2 h2 2 h2
A Crash Course on Partial Differential Equations 167
Reformulated Problems
A Nobel prize was awarded for the invention of an approximation method that
can calculate the quantum mechanical ground state of many-electron systems.
This breakthrough was based, not on the invention of a new numerical method
or algorithm, nor on the implementation of one, but on an approximate refor-
mulation of the governing equation. This last chapter illustrates the power of
reformulating equations to make them amiable to numerical solution. It also
places us in the realm of partial different equations that are boundary value
problems. The chapter, and the main text of the book, ends with an outline
of the Density Functional Method, for which the prize was given.
∇·E = ρ/ǫ0
∇×E = 0 (formulation 1)
Here, ρ is the charge density and ǫ0 a universal physical constant. These are
four coupled partial differential equations (PDEs) for the three components
of E. Since the curl of E vanishes, it is possible to introduce the electric
potential Φ, and then obtain the electric field as the derivative of the potential
E = −∇Φ, thus
ρ
∇2 Φ = − (formulation 2)
ǫ0
The potential is a scalar function, and hence easier to deal with than the vector
E. This simple reformulation leads to a single partial differential equation, and
169
170 Lessons in Scientific Computing
where the sum is over point charges with charge qj at position rj . Generalizing
further, for a continuous charge distribution the potential can be expressed
by integrating over all sources,
Z
1 ρ(r′ )
Φ(r) = dr′ (formulation 3)
4πǫ0 |r − r′ |
This was a physically guided derivation; with a bit of vector calculus it could
be verified that this integral indeed satisfies the Poisson equation, and is thus
equivalent to formulation 2. (Those familiar with vector analysis know that
the Laplacian of 1/r is zero everywhere, except at the origin.) Each of the
three formulations describes the same physics. The integral (formulation 3) is
an alternative to solving the partial differential equation (formulation 2).
Such methods can be constructed whenever a PDE can be reformulated
as an integral over sources. (Another example is the integral solution to the
diffusion equation given in section 15.2.) It would be difficult to imagine a
situation where Formulation 1 is less cumbersome to evaluate than formula-
tion 2, but whether formulation 2 or 3 is more efficient numerically depends
on the situation. If the potential is required at only a few locations in space
and if the charges occupy only a small portion of space, the integral may be
computationally the more efficient formulation. To put it in the most dramatic
contrast: The electric potential of a single point charge can be evaluated with
a simple formula, or one could solve the Laplace equation over all space to
get the same result. Alternatively, if the source is spread out, it is easier to
solve the Laplace equation for the given source distribution once and for all
instead of evaluating the integral anew at each location. So, depending on the
situation, formulation 2 or 3 is faster.
In addition to exact reformulations, there may be approximate formula-
tions of the problem that drastically reduce its computational complexity. An
Reformulated Problems 171
For the ground state, the energy E is a minimum, a fact that is formally
172 Lessons in Scientific Computing
The symbol ∇1 donates the gradient with respect to the first argument, here
the coordinate vector r1 . The expressions are grouped into terms that give
rise to the kinetic energy T , electron-electron interaction Vee , and the exter-
nal potential due to the attraction of the nucleus—external from the electron’s
point of view, Vext . The Schrödinger equation with this potential cannot be
solved analytically, but the aforementioned method of approximation by min-
imization is still applicable.
A helium atom has only two electrons. A simple molecule like CO2 has
6 + 2 × 8 = 22 electrons. A large molecule, say a protein, can easily have tens
of thousands of electrons. Since ψ becomes a function of many variables, three
for each additional electron, an increasing number of parameters is required to
describe the solution in that many variables to a given accuracy. The number
Reformulated Problems 173
3N
of necessary parameters for N electrons is (a few) , where “a few” is the
number of parameters desired to describe the wavefunction along a single
dimension. The computational cost increases exponentially with the number
of electrons. Calculating the ground state of a quantum system with many
electrons is computationally unfeasible with the method described.
The integrals are over all but one of the coordinate vectors, and because of
the antisymmetries of the wavefunction it does not matter which coordinate
vector is left out.
For brevity the integrals over all r’s can be denoted by
Z Z
hψ| (anything) |ψi = ... ψ ∗ (r1 , r2 , ...)(anything)ψ(r1 , r2 , ...)dr1 dr2 ...drN .
Note that Egs is expressed in terms of ngs , and does not need to be written
in terms of ψgs . The problem splits into two parts
Z
Egs = min F [n] + V (r)n(r)dr
n
EXERCISES
16.1 The gravitational potential of a point mass and the electric potential of
a point charge both have potentials of the form Φ ∝ 1/|r|.
b. −k 2 û(k) = fˆ(k)
R∞ R∞
c. −∞ u′ v ′ dx = −∞ f vdx
R ∞ ′2
d. minu −∞ u2 − f u dx
Unix is an operating system, and the many variants of Linux are essentially
dialects of Unix. For many decades, including the current decade, Unix-like
environments have been a popular choice for scientific, and many other types,
of work. Moreover, large computer clusters and clouds often run on a Unix-
like operating system. For readers who are not already familiar with the Unix
environment, here is a compact introduction.
All major operating systems can provide Unix-like environments. MacOS
is based on Unix, and on a Mac the command line interface to Unix can be
reached with Applications > Utilities > Terminal. For Microsoft Windows, em-
ulators of a Unix environment are available (e.g., Cygwin and MKS Toolkit).
Installing such an emulator makes Unix, and more importantly all the tools
that come with it, available also on Windows.
A large number of widely used, and incredibly practical, tools are available
within Unix. It is these tools that are of primary practical interest, irrespective
of the operating system. In fact, most people who use Unix tools work on
computers with operating systems other than Unix or Linux.
Even to this day, the most flexible and fundamental way of working with
Unix is through the traditional command line interface, which accepts text-
based commands. The most basic Unix commands are often only two or three
letters long. For example,
cp file1 file2
copies file1 to file2. Directory names end in a slash to distinguish them
from file names. For example, the command
mv file1 Data/
moves file1 into directory Data/.
And it gets fancier. For example,
scp myfile.txt [email protected]:
177
178 Lessons in Scientific Computing
will copy a file onto a remote computer or cloud connected through a network,
such as the internet. The colon at the end indicates that the second argument
is not a file name but a remote computer (that the user must have an account
on). Once there was a rcp command for “remote copy”, but it was superseded
by “secure remote copy” or scp.
ssh (secure shell) provides the important functionality of logging into a
remote Unix-based computer:
ssh [email protected]
for an instance on the Amazon cloud. As mentioned above, scp can be used
to copy files to and from remote machines, and with the recursive option -r,
entire directories can be copied.
Many Unix commands understand “Regular Expressions” (described in
chapter 13), and even more use “shell globs”. For example, cat *.txt displays
the content of all files with extension txt. The command cat is an abbreviation
for “concatenate” and can be used to display the content of files. The most
common glob rules are * for zero-or-more characters and ? for exactly one
character. A way to express a choice of one or more strings is {...,...}.
For example, ls {*.csv,*.txt} lists all files with either of these two file
extensions.
Command line options. Command options are preceded by a dash. For
example, head displays the first ten lines of a file, and head -n 20 displays
the first twenty. ls *.dat lists all file names with extension .dat, and ls -t
*.dat does so in time order. (Every file has a time stamp which marks the
last time the file was modified.) ls -tr *.dat, which is equivalent to ls -t
-r *.dat, lists the content of the directory in reverse time order. Options
that are more than one letter long are often preceded by a double dash. For
example, sort -k 2 is equivalent to sort --key=2 and sorts according to the
second column.
Processes and jobs. An ampersand symbol & after a command
a.out &
executes the command in the background instead of in the foreground. A
background job frees up the prompt, so additional commands can be entered.
It continues to execute even after the user logs out, until it is finished, killed,
or the computer goes down.
ps lists current processes, including their process id (an integer). top dis-
plays information about processes. kill terminates a process based on its id,
and pkill based on its name. That way a process that runs in the background
can be terminated.
The commands more and less also display the content of (plain-text) files.
After all, less is more. The simplest form of a print statement is echo.
Redirect and pipes. Another basic function is the redirect:
echo Hello World > tmp
The Unix Environment 179
writes the words “Hello World” into the file tmp. In other words, the “larger
than” symbol redirects the output to a file. If a file with that name does not
already exist, it will be created; otherwise its content will be overwritten.
a.out > tmp
sends the output of the program a.out to a file. The double >> appends.
Furthermore, < takes the input to a command from a file rather than the key-
board. Without redirects, the standard output is the display, and the standard
input is the keyboard. (To be complete, in addition to an input and an out-
put stream, there is also a standard error stream, which by default is also
the screen/display. The “larger than” > does not redirect the standard error
stream, meaning any error messages still end up on the display.)
The output of one command can be sent directly as input to another
command with a so-called pipe |. For example,
a.out | sort
sorts the output of the program.
a.out | tee tmp
feeds the output of a.out to the program tee. The program tee then displays
the output and stores it in the file tmp. More generally, tee is a utility that
redirects the output to multiple files or processes.
Redirects and pipes have a similar function; the difference is that redi-
rection sends the output to a file whereas pipes send the output to another
command.
Some symbols need to be “escaped” with a backslash. For example, since
a space is significant in Unix, a space in a file name “my file” often needs to
be written as my\ file to indicate this is a space in a single argument rather
than two separate arguments. Similarly, a Regular Expression, such as * may
need to be written as \*.
Unix tools consist of hundreds of built-in core utilities. A number of useful
Unix utilities, such as diff, sort, and paste, have already been mentioned
in chapter 13. The text processing utilities described in chapter 13 can be
combined in a single line. For example,
grep NaN mydata.dat | awk '{if ($1==0) print}'
finds all lines that contain NaN and then outputs those lines that also have 0
in their first column.
Environment variables. An environment variable can affect the way a run-
ning process behaves. For example, LD LIBRARY PATH is the directory or
set of directories where libraries should be searched for during compilation.
A running process can query the value of an environment variable set by the
user or the system. Similarly, environment variables know the number of cores
or threads on a CPU and the id of the core a job is running on.
180 Lessons in Scientific Computing
Recommended Reading: Unix tools are free, and abundant and equally free
documentation is available online. Gnu Core Utilities are documented on many
websites, including www.gnu.org/software/coreutils/manual/coreutils.
html. In terms of books, Robbins Unix in a Nutshell is a comprehensive guide,
that covers basic Unix commands, Unix shells, Regular Expressions, common
editors, awk, and other topics.
APPENDIX B
Numerical Libraries
181
182 Lessons in Scientific Computing
Answers to Brainteasers
(i) In general only polynomials up to and including 4th degree can be solved
in closed form, but this 5th degree polynomial has symmetric coeffi-
cients. Divide by x5/2 and substitute y = x1/2 + x−1/2 , which yields a
polynomial of lower order.
(ii) Any rational function (the ratio of polynomials) can be integrated. The
result of this particular integral is not simple.
Pn 4 2 q
(iii) k=1 k = n(n + 1)(2n + 1)(3n + 3n − 1)/30. In fact, the sum of k
can be expressed in closed form for any positive integer power q.
(iv) An iteration of this type has a solution of the form yn = c1 λn1 + c2 λn2 .
The value of λ can be conveniently determined using the ansatz yn = λn ,
which leads to a quadratic equation for λ with two solutions, λ1 and λ2 .
(v) The exponential of any 2×2 matrix can be obtained analytically. The
exponential of a matrix can be calculated from the Taylor series. This
particular matrix can be decomposed into the sum of a diagonal matrix
D = ((2, 0), (0, 2)) and a remainder R = ((0, −1), (0, 0)) whose powers
vanish. Powers of the matrix are of the simple form (D + R)n = Dn +
nDn−1 R. The terms of the power expansion form a series that can be
summed.
183
184 Lessons in Scientific Computing
formally, given ε > 0 there exists an N such that for any pair m, n > N ,
kam − an k < ε. We used m = 2n, but only because that is all the resolutions
we evaluated. In spirit, we meant to use the preceding N . It is not sufficient for
each term to become arbitrarily close to the preceding term (kan+1 −an k < ε),
so our previous test criterion is actually failing in this regard. It should in-
clude pairs of n and m that neither have a fixed ratio nor a fixed difference. On
the other hand, we would not expect that to matter for our specific example.
Given the limitation that we can evaluate uN for a finite number of indices
only, the best we can do is to show that kuM − uN k < ε with an “unbiased”
relation between M and N and a range of M and N .
Whereas every convergent sequence is a Cauchy sequence, the converse is
not always true. The Cauchy sequence criterion only implies convergence when
the space is “complete.” Completeness means our discrete representations of u
must be able to represent the solution u as N increases. In other words, a series
with infinitely many terms must be able to match the exact result; obviously
a series cannot converge to the correct result if it cannot even represent it.
For example, if the function u we seek to approximate goes to infinity, then no
trigonometric series will ever converge toward it. As long as uN can at least in
principle match the solution, uN in a Cauchy sequence converges to a u that
no longer depends on the resolution, uN → u.
Finally, we move to the last part of the question “where u is the exact,
correct answer.” The answer to this is clearly no. Nothing in that formalism
requires u to be the solution to an unnamed equation. It could in principle
be the solution to a discretized version of the equation only. Such spurious
convergence is possible, if rare. In plain language: just because a numerical
solution converges, it does not mean it converges to the correct answer, but if
it does not converge, then there is definitely something wrong.
185
186 Bibliography
[14] W. McKinney. Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and IPython. O’Reilly Media, second edition, 2017.
[15] M. Metcalf, J. Reid, and M. Cohen. Modern Fortran Explained. Oxford
University Press, fourth edition, 2011.
[16] D. A. Patterson and J. L. Hennessy. Computer Organization and Design:
The Hardware/Software Interface. Morgan Kaufmann, fifth edition, 2013.
[17] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Nu-
merical Recipes: The Art of Scientific Computing. Cambridge University
Press, third edition, 2007.
[18] A. Robbins. UNIX in a Nutshell. O’Reilly, fourth edition, 2005.
[19] R. Sedgewick and K. Wayne. Algorithms. Addision-Wesley, fourth edition,
2011.
[20] S. S. Skiena. The Algorithm Design Manual. Springer, second edition,
2011.
[21] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer,
third edition, 2002.
[22] L. N. Trefethen. Approximation Theory and Approximation Practice.
Society for Industrial and Applied Mathematics, 2012.
Index
187
188 INDEX