Mathematics Computer Scientists Practice
Mathematics Computer Scientists Practice
Mathematics
for Computer
Scientists
A Practice-Oriented Approach
Mathematics for Computer Scientists
Peter Hartmann
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Fachmedien Wiesbaden
GmbH, part of Springer Nature 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Fachmedien Wiesbaden GmbH, part of
Springer Nature.
The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany
Preface
Mathematics is an essential root of computer science: In all its areas, mathematical meth-
ods are used again and again. Mathematical thinking is typical work for computer scien-
tists. Up until now, I believe too little attention has been given to the close interlocking of
the two disciplines.
This textbook contains in one volume the essential areas of mathematics needed to
understand computer science. In addition, I constantly present concrete applications of
mathematical techniques for computer science. For example, logic is used to test pro-
grams, methods of linear algebra are used in robotics and in graphical data processing.
With eigenvectors, the importance of nodes in networks can be assessed. The theory of
algebraic structures turns out to be useful for hashing, in cryptography and for data secu-
rity. Differential calculus is used to compute interpolation curves, the Fourier transform
plays an important role in data compression. Queueing theory and principal compo-
nent analysis are important applications of statistics. Throughout the book connections
between mathematics and computer science are presented.
In the textbook presentation, it is not only about the results, but also the practice of
mathematical thinking in solving problems. Computer scientists require the same ana-
lytical approach to tasks as mathematicians.
The textbook is primarily intended to supplement the mathematics lectures of com-
puter science students. The presentation is a practice-oriented approach and each les-
son explains how you can apply what you have learned by giving you many real world
examples, and by constantly cross-referencing math and computer science. I put a lot of
emphasis on the motivation of the results, the derivations are detailed, and the book is
therefore well suited for self-study. Practitioners who are looking for the mathematics
underlying their applications can use it for reference. However, no matter how you use
the textbook, keep in mind just as you cannot learn a programming language by read-
ing the syntax, it is impossible to understand mathematics without working on problems
with paper and pencil.
The three parts of the book cover discrete mathematics and linear algebra, analysis
including numerical methods, and finally the basics of probability theory and statistics.
Within each chapter, the definitions and theorems are consecutively numbered, a second
V
VI Preface
VII
VIII Contents
Part II Analysis
12 The Real Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
12.1 The Axioms of Real Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
12.2 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
12.3 Comprehension Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . 292
13 Sequences and Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
13.1 Sequences of Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
13.2 Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
13.3 Representation of Real Numbers in Numeral Systems. . . . . . . . . . . . . . 312
13.4 Comprehension Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . 317
14 Continuous Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
14.1 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
14.2 Elementary Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
14.3 Properties of Continuous Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
14.4 Comprehension Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . 347
15 Differential Calculus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
15.1 Differentiable Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
15.2 Power Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
15.3 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
15.4 Differential Calculus of Functions of Several Variables . . . . . . . . . . . . . 378
15.5 Comprehension Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . 383
16 Integral Calculus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
16.1 The Integral of Piecewise Continuous Functions . . . . . . . . . . . . . . . . . . 388
16.2 Applications of the Integral. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
16.3 Fourier Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
16.4 Comprehension Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . 415
17 Differential Equations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
17.1 What are Differential Equations?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
17.2 First Order Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
17.3 nth Order Linear Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . 426
17.4 Comprehension Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . 433
18 Numerical Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
18.1 Problems with Numerical Calculations. . . . . . . . . . . . . . . . . . . . . . . . . . 435
18.2 Nonlinear Equations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
18.3 Splines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
18.4 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
18.5 Numerical Solution of Differential Equations. . . . . . . . . . . . . . . . . . . . . 454
18.6 Comprehension Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . 458
X Contents
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
Part I
Discrete Mathematics and Linear Algebra
Sets and Mappings
1
Abstract
Sets, operations on sets and map pings between sets are part of the language of math-
ematics. Therefore, we begin the book with them. If you have worked through this
first chapter, you will know
What is a set actually? The following definition of a set is from Georg Cantor (1845–
1918), the founder of modern set theory:
Do you already feel quite uncomfortable with this first definition? After all, mathematics
claims exactness and precision; but this definition does not sound like precise mathemat-
ics at all: What is a “collection”, what are “well-differentiated objects”? As we will see, all
areas of mathematics need solid foundations on which to build. Finding these foundations
is usually very difficult (just as it is difficult to formulate the requirements in a software
project precisely). Cantor tried to put the set theory, which had been used more or less intui-
tively until then, on solid mathematical feet and formulated the above definition for this pur-
pose. But this soon led to unexpected difficulties. Perhaps you know the story of the barber
in a village who says of himself that he shaves all the men in the village who do not shave
themselves. Does he shave himself or not? A similar problem arises when one considers the
set of all sets that do not contain themselves as an element. Does the set contain itself or
not? The guild of mathematicians has since extricated itself from this swamp of paradoxes,
but the theory has not become any simpler as a result. Fortunately, as computer scientists,
we do not have to deal with such problems; we can simply use the concept of a set and
make a wide detour around things like “the set of all sets”.
The second way to write the set N (N = {1, 2, 3, 4, . . .}) shows I will sometimes be lax if it
is clear what the points mean. So in this book the natural numbers start with 1. In the lit-
erature, the 0 is sometimes also counted among the natural numbers, but the representa-
tion is not uniform. There is agreement, however, in the terms: N for {1, 2, 3, . . .} and N0 for
{0, 1, 2, . . .}.
A few more examples of sets with which we will often have to do are the integers Z, the
rational numbers Q and the real numbers R:
Here you have to take a moment to see you catch all the fractions with the definition
on the right. Is −47
in the set? Yes, if you know the rules of fractions, because −4 7
= −74
.
Here elements also occur several times: According to the rules of fractions, for example,
4
5
8
= 10 .
And why don’t we just write p ∈ Z, q ∈ Z in the definition? Every time a mathema-
tician sees a fraction, an alarm goes off and he checks whether the denominator is not
equal to 0. q ∈ Z would also include the case q = 0. You can’t divide by 0! In Chap. 5 on
algebraic structures, we will see why this does not work.
Without knowing exactly what a decimal number is, we call
R := {x | x is a decimal number}, the real numbers and calculate with these numbers as you
have learned in school. The more precise characterization of R we postpone to the second
part of the book (compare Chap. 12). In addition to the elements of Q, the set R also con-
tains, for example, all roots and numbers like e and π.
The set M = {x | x ∈ N und x < 0} does not contain any element. M is called the
empty set and is denoted by ∅.
Sets are again objects, that is, sets can be combined to sets again, and sets can contain
sets as elements. Here are two strange sets:
M := {N, Z, Q, R}, N := {∅, 1, {1}}.
How many elements do these sets have? M does not have infinitely many, but exactly
four elements! N has three elements: the empty set, “1” and the set which contains the 1.
If a set consists of a single element (M = {m}), one must carefully distinguish
between m and {m}. These are two different objects!
You don’t believe that? Tell me an element of the empty set that is not in M ! See, there is
none!
6 1 Sets and Mappings
Fig. 1.1 M ⊂ N
Fig. 1.2 M ⊂ N
Definition 1.2 If M is a set, then the set of all subsets of M is called the power
set of M . It is denoted by P(M).
Examples
The intersection of two sets M and N is the set of elements that are contained in both M
and N , see Fig. 1.3.
M ∩ N := {x | x ∈ M and x ∈ N} (1.1)
M and N are called disjoint, if M ∩ N = ∅.
Fig. 1.3 M ∩ N
1.1 Set Theory 7
Example
M = {1, 3, 5}, N = {2, 3, 5}, S = {5, 7, 8}.
M ∩ N = {3, 5}, N ∩ S = {5}, M ∩ N ∩ S = {5}. ◄
Wait! Does something strike you about the last expression? What is M ∩ N ∩ S ? We haven’t
defined that yet! Maybe we are implicitly assuming that we are forming (M ∩ N) ∩ S. That’s
possible, it is a double application of the definition in (1.1). Or are we maybe forming
M ∩ (N ∩ S) ? Is it the same thing? As long as we don’t know that, we really ought to put
parentheses around it.
The union of two sets M and N is the set of elements contained in M or N , see Fig. 1.4.
It is also allowed for an element to be contained in both sets. Mathematicians and com-
puter scientists don’t use the word “or” as exclusive or—that would be “either or”.
M ∪ N := {x | x ∈ M or x ∈ N}
In the example from above, M ∪ N = {1, 2, 3, 5}.
The difference set M \ N or M − N of two sets M , N is the set of elements contained
in M , but not in N , see Fig. 1.5.
M \ N := {x | x ∈ M and x ∈
/ N}
If N ⊂ M , then M \ N is also called the complement of N in M . The complement is
M
denoted by N .
Often one calculates with subsets of a certain fixed set, the universe (for example,
with subsets of R). If the universe is clear from the context, the complement is always
formed with respect to it, without mentioning it explicitly. Then we simply write N for
the complement. (Occasionally one also reads ∁N .)
Fig. 1.4 M ∪ N
Fig. 1.5 M \ N
8 1 Sets and Mappings
Try to write down, for four sets, how much work is saved due to the fact that we are allowed
to omit brackets.
From school you know the distributive law for real numbers: It is
a(b + c) = ab + ac.
Surprisingly, the corresponding rule for sets also applies if the signs of arithmetic are
interchanged—with real numbers this is false!
You shouldn’t believe everything just because it’s in a book; now it’s time for our first
proof. We will prove the first distributive law in detail:
show: S ∩ (M ∪ N) = (S ∩ M) ∪ (S ∩ N) .
A B
“B ⊂ A”: et x ∈ B. Then x ∈ S ∩ M or x ∈ S ∩ N .
L
If x ∈ S ∩ M is true, then x ∈ S and x ∈ M , thus also x ∈ M ∪ N and this
implies x∈S∩ (M ∪N ).
1.1 Set Theory 9
Prove the second distributive law yourself. Make a similar case distinction and don’t be sur-
prised if it takes you some time. It is always harder to come up with something than to take
something in. But trying it out for yourself is immensely important for true understanding!
M \N =M ∩N
M ∪N =M ∩N
M ∩N =M ∪N
M=M
M ⊂ N implies N ⊂ M.
So building the complement reverses the operation sign. You should at least partially
check these rules by yourself.
Occasionally, we will also need infinite intersections and infinite unions:
Definition 1.5 Let M be a set of indices (for example M = N). For all n ∈ M let a
set An be given. Then:
An := {x | x ∈ An for at least one n ∈ M},
n∈M
An := {x | x ∈ An for all n ∈ M}.
n∈M
If M is a finite set, for example M = {1, 2, 3, . . . , k}, then these definitions agree with our
previous definition of union and intersection:
An = A1 ∪ A2 ∪ . . . ∪ Ak ,
n∈M
An = A1 ∩ A2 ∩ . . . ∩ Ak .
n∈M
k
k
In this case, we write Ai for the union and Ai for the intersection.
i=1 i=1
10 1 Sets and Mappings
Example
M = N, An := {x ∈ R | 0 < x < n1 }. Then An = ∅. ◄
n∈N
This is strange: Every finite intersection contains elements: kn=1 An = Ak, but one can-
not find a real number that is contained in all An, so the infinite intersection is empty.
M × N := {(x, y) | x ∈ M, y ∈ N}
of all ordered pairs (x, y) with x ∈ M and y ∈ N is called the Cartesian product of
M and N .
Ordered means that, for example, (5, 3) and (3, 5) are different elements (in contrast to
{5, 3} = {3, 5} !).
Examples
1.2 Relations
Examples of relations
In relational databases, data sets are characterized by the relations that exist between
them.
2. R := {(x, y) ∈ R2 | (x, y) lies below the main diagonal} just describes the relation
“>” (Fig. 1.9). If the main diagonal belongs to R, then R determines the relation “≥”.
3. All comparison operators <, >, ≤, ≥, =, = form relations on R.
4. “⊂” is a relation between subsets of R. Thus, the relation R⊂ is a subset of the car-
tesian product of all subsets of R:
Take a moment and try to really understand this line. It already contains a considerable
degree of abstraction. But there is nothing mysterious about it, it just uses the known defini-
tions quite mechanically, step by step. It is very important to learn how to work out results
step by step. In the end, there is a result that you know is correct, even if one can no longer
grasp the correctness at a glance. ◄
We now focus on the special case M = N , i.e. relations of a set M . Here are some impor-
tant properties of relations:
Examples
Equivalence Relations
First of all an
Example
R5 is an equivalence relation:
nR5 n, because n − n = 0 is divisible by 5. So R5 is reflexive.
If nR5 m, then n − m = 5k for a k ∈ Z. Then m − n = 5(−k), thus m - n is also
divisible by 5. So R5 is symmetric.
Let nR5 m and mR5 s, thus n − m = 5k, m − s = 5l. Then
(n − m) + (m − s) = n − s = 5k + 5l = 5(k + l), thus n − s is divisible by 5 and
therefore nR5 s.
We will deal intensively with this equivalence relation (and others with different
divisors than 5) later. ◄
A special property of equivalence relations is they decompose the underlying set into
disjoint subsets, the so-called equivalence classes:
[a] := {x ∈ M | xRa}
is called equivalence class of a. The elements of [a] are called the elements which
are equivalent to a.
Examples
1. The hairdresser is interested in the equivalence relation: “has the same hair color
as”.
2. Let R be “=” on R. Then [a] = {x ∈ R | x = a} = {a}.
3. Let R5 ⊂ Z × Z be defined as in (1.2). Then, for example:
[1] = {n ∈ Z | n − 1 is divisible by 5} = {1, 6, 11, 16, . . . , −4, −9, −14, . . .},
[0] = {0, 5, 10, 15, . . . , −5, −10, −15, . . .},
[2] = {2, 7, 12, 17, . . ., −3, −8, −13, . . .}.
It is [5] = {n ∈ Z | n − 5 is divisible by 5} = [0] and also: [6] = [1], [7] = [2] and
so on. ◄
In Example 1 this is immediately clear. For the relation R5 from (1.2) this means:
Z = [0] ∪ [1] ∪ [2] ∪ [3] ∪ [4]. There are no more equivalence classes than these five
and these classes are disjoint.
1.2 Relations 15
In this union, the equivalence classes are probably repeated several times. But this does
not matter at all. It is only important that each element of M is element of an equivalence
class and that we thus capture all elements of M in the union.
Order Relations
The natural, integer and real numbers carry an order that we have already studied in
detail as examples of relations. You can also define orders on other sets. The properties
of such an order relation are the same as for the known order on the numbers:
Examples
In this way, all Cartesian products of linearly ordered sets can be ordered. This
order is called the lexicographic order. The dictionary entries in a dictionary are
ordered in this way. ◄
1.3 Mappings
Definition 1.13 Let M , N be sets and let each x ∈ M be associated with exactly
one y ∈ N . Through M , N and this assignment, a mapping (or map) from M to N
is defined.
Especially when M and N are subsets of the real numbers, mappings are also called
functions.
Mappings are often denoted by small Latin letters, such as f or g. The element associ-
ated with x (the image of x) is then denoted by f (x) or g(x).
The notation that has become established is: f : M → N , x → f (x). The arrows are to
be distinguished:
“→” the arrow between the sets, denotes the mapping of M to N ,
“→” the arrow between the elements, denotes the assignment of individual elements.
A mapping can be specified by listing all function values individually, for example in a
table, or by giving an explicit assignment rule by means of which f (x) can be determined
for all x ∈ M . The former is of course only possible for finite sets.
1.3 Mappings 17
Examples
x a b c ... z
f : {a, b, c, . . . , z} → N,
f (x) 1 2 3 26
g : R → R, x �→ x2
h : R → R, x � → |x|
◄
√
Attention: In the last example, x → x would not be a reasonable assignment. Why not?
If U ⊂ M , then the set of images of the x ∈ U is called the image of U . The image
of U is denoted by f (U) := {f (x) | x ∈ U}.
If V ⊂ N , then the set of preimages of the y ∈ V is called the preimage or
inverse image of V . It is denoted by f −1 (V ) := {x ∈ M | f (x) ∈ V }.
If U ⊂ M , then the mapping f |U : U → N , x → f (x) is called the restriction of
f to U .
You have to be careful with the names for the image set and the preimage set: In the
brackets behind f or f −1 there are sets, not elements. Also f (U), f −1 (V ) are again sets,
not elements!
Examples of mappings
represent 128 characters, because the target set contains exactly 128 = 27 elements.
This means that 7 bits are needed to include all image elements. 8 bits = 1 byte is
exactly the size of the data type char in C++, which can thus encode all ASCII
characters and in the 8th bit also extensions of the ASCII code, such as country-
specific special characters. In Java, the type char has the size 2 bytes. This makes
it possible to encode 216 = 65 536 symbols.
6. A simple but important example: f : M → M , x → x is the identical mapping of
the set M . It is called id M (or just id) and exists on any set.
7. In the section on logic, we will deal with predicates (propositional functions).
These are mappings into the target set {true, false} = {t, f }. The statement “x < y”
is either true or false for two real numbers x, y. This fact is described by the predi-
cate:
2 t if x < y
P : R → {t, f }, (x, y) �→ P(x, y) =
f if not x < y
8. When programming, you constantly have to deal with mappings, they are usu-
ally called functions or methods there. You put something in as an input parameter
(send a message) and get an output (a response). A function in C++ that determines
the greatest common divisor of two integers (int gcd(int m, int n)) is a
mapping:
gcd : M × M → M, (m, n) � → gcd(m, n).
Here M is the domain of an integer, for example M = [−231 , 231 − 1]. With this
example we also see a mapping of “several variables” is in fact nothing other than
an mapping of one set into another. The domain is then just the cartesian product
of the domains of the individual variables. ◄
In this definition, not only the sign “◦” is defined, but also an assertion is made, namely
that g ◦ f is again a mapping. We have to think about this before we can write the defi-
nition with a good conscience. Since we have for all x ∈ M the unique assignment
(g ◦ f )(x) := g(f (x)) here, the assumption is correct.
20 1 Sets and Mappings
This happens often: When defining a term, you always have to make sure it is reasonable
and does not contain any implicit contradictions. The mathematicians have coined the beau-
tiful term “well-defined” for this purpose. Only then can you relax comfortably in your arm-
chair.
To be able to form g(f (x)), it is enough if the domain N of the mapping g includes the
value set of f :D(g) ⊃ f (M). The definition can therefore be extended to this case.
√
f : R → R, x �→ x 2 , g : R+
0 → R, x � → x.
Here is R+ +
0 := {x | x ≥ 0}. W (f ) ⊂ R0 , thus the composition is possible (Fig. 1.12):
√
g ◦ f : R → R, x �→ x 2 = |x|. ◄
The following three terms are important properties of mappings, which are needed again
and again throughout the book. You should memorize them:
Examples
1. (compare Fig. 1.10) The mapping is not surjective, because 2 has no preimage, and
not injective, because f (a) = f (b).
2. In Fig. 1.11 no mapping was shown, so the terms injective, surjective, bijective are
meaningless here.
3. Sequences can be injective and surjective. The question of whether there are sur-
jective rational or real number sequences is not trivial! More about this after Defi-
nition 1.21.
4. The hash function h(n) = n mod p is surjective, but not injective, since for example
h(0) = h(p).
In computer science this means that collisions can occur when determining the
hash addresses of different data records, which require special treatment.
5. The ASCII code is surjective, 128 characters are encoded, but not all of them are
printable. The part shown in the example is not surjective (0 has no image, for
example). The ASCII code is also injective; it must be, otherwise you could not
reverse the encoding.
A code is always an injective mapping of a source alphabet into a code alphabet.
The injectivity ensures the reversibility of the mapping, that is, the decodability.
The target alphabet has very specific properties the source alphabet does not have,
for example the processing ability for a computer (ASCII code, Unicode), or the
readability for the blind (Braille), or the easy transferability by radio (Morse alpha-
bet).
6. The identical mapping is surjective and injective, that is, bijective.
7. The predicate
t if x < y
P : R2 → {t, f }, (x, y) �→ P(x, y) =
f if not x < y
Perhaps you have noticed in these examples it is usually easier to disprove the property
of being injective or surjective than to prove it. That is because it is enough to find a sin-
gle counterexample to prove the property of “not injective”:
„It exists x, y with x = y and f (x) = f (y).“
22 1 Sets and Mappings
Again, one must realize that g really gives a unique assignment rule. For the inverse
mapping f −1, both f −1 ◦ f = id M and f ◦ f −1 = id N hold.
The proof is a direct consequence of the second part of the last theorem: Since the iden-
tity map is bijective, f and g must also be bijective.
For bijectivity and thus for the existence of inverses, the statements f ◦ g = id N and
g ◦ f = id M are both necessary. Only one of the statements is not sufficient for this, as
the following example shows:
1.3 Mappings 23
f : R → R2 g : R2 → R
1
x �→ (x, x) (x, y) � → (x + y)
2
It is indeed g ◦ f = id R, but f is not surjective, g is not injective and f ◦ g � = id R2 !
We come back to the concept of sets; we want to deal with the cardinality of sets, that
is, with the number of elements of a set M . We denote this by |M|. What is particularly
interesting is the case where a set has infinitely many elements. For two finite sets, the
following statement is obvious:
If |M| = |N|, then there is a bijective map between the two sets.
You just have to take one element from each set in turn and map them to each other.
This property is the basis for the following definition, which is to apply to both finite and
infinite sets:
Definition 1.20 Two sets M , N are called equipotent (|M| = |N|) if there is a
bijective map between M and N .
In this way one can introduce the cardinal numbers. The cardinality of M is less than or
equal to the cardinality of N , if M is equipotent to a subset of N . So cardinalities can be
compared.
Of course, for example, |{0, 1, 2}| = |{a, b, c}| = |{0, {1}, R}| applies; these sets have the
same cardinality “3”. The cardinalities of the finite sets correspond exactly to the natural
numbers.
The sets N, Z, Q, R have no finite cardinality, they are infinitely large. But are the sets
equipotent? It looks as if Z has more elements than N. But look at the following mapping
(N0 := N ∪ {0}):
f : N0 → Z, 0 �→ 0, 1 �→ 1, 2 �→ −1, 3 � → 2, 4 � → −2, . . .
f is obviously bijective, so |N0 | = |Z|! It will now not be a problem for you to write
down a bijective mapping between N and N0, and thus we have |N| = |Z|. Cantor has
24 1 Sets and Mappings
shown that one can even find a bijective mapping between Z and Q (the procedure is
constructive and well understandable, but it does exceed the scope of this book). So there
are just as many rational numbers as natural numbers. Actually logical, or not? It should
be infinity equal to infinity. But now something amazing happens: Cantor was also able
to prove that there is no bijective mapping between Q and R. So there are more real num-
bers than rational, infinity is not equal to infinity! In fact, one can find many, many more
different infinite cardinalities.
With this we now also know that there are surjective rational number sequences, but no
surjective real number sequences.
Sets that are finite or have the same cardinality as N are called countable sets, sets of
greater cardinality, such as R are called uncountable.
For our purposes, we do not need to further deal with cardinalities. But I still want to tell
you a little story that has driven many mathematicians almost to despair for decades: The
cardinality of N is called ℵ0 (the Hebrew letter Alef), the cardinality of R is called C (for
continuum). It can be calculated that the cardinality of the set of all subsets of N, that is, of
P(N), is also equal to C. Cantor now wondered whether there might be no other cardinal-
ity between ℵ0 and C at all. Or can you still stuff a set between Q and R ? Around 1900,
Cantor made the conjecture that this is not possible (the continuum hypothesis). Through
the results of the mathematicians Kurt Gödel (1938) and Paul Cohen (1963), this problem
found a very surprising solution: Neither the continuum hypothesis nor its opposite can be
proven. A mathematical sensation one might have to put on a level with the theory of rela-
tivity in physics in the world of mathematics: You can prove there are things you can neither
prove nor disprove. This solves the problem, but many mathematicians had to get used to
this solution.
The research in connection with the decidability or undecidability of mathematical asser-
tions also had significant effects on computer science: In the course of his work on this
topic, Alan Turing invented his famous Turing machine in the 1930s. In a thought experi-
ment, he wanted to solve every problem solvable in a logical way with the universal Turing
machine. Although this did not succeed, he incidentally laid the theoretical foundations of
computer architecture.
Comprehension Questions
Exercises
m has remainder r when divided by q if and only if it exists a k ∈ Z with m = qk + r.
M = {x | 4 divides x}
N = {x | 100 divides x}
T = {x | 400 divides x}
S = {x | x is a leap year}
Express the set S using the set operations ∪, ∩ and \ from the sets M , N , T . Elimi-
nate the \ sign in this representation by using the complement.
Leap years are the years divisible by 4, except for the years divisible by 100, but not
by 400. (2000 and 2400 are leap years, 1900 and 2100 are not leap years!)
Abstract
As a computer scientist you are constantly dealing with tasks from logic. In this chap-
ter you will learn
Why is logic so important in computer science? It starts with the fact that the control
flow of a program works with logical (Boolean) variables that have the value true or
false. It continues when testing a program, where certain preconditions must be met by
the input variables and then the results must meet specific output conditions.
Even when defining a system, it is important to make sure that the requirements are
not contradictory. When creating a system architecture up to maintenance, logical think-
ing is always required. Especially in safety-critical systems, more and more work is
being done with formal specification and partly also verification methods that require a
high level of knowledge of logical methods.
You have probably already heard of fuzzy logic: It takes into account that there is also warm
between cold and hot; there are truth functions in which something can only be “a little
true”. My washing machine usually works very successfully with fuzzy logic. We will not
deal with it here, but rather investigate classical logic, the knowledge of which of course
also forms the basis for fuzzy logic.
Examples
1. 5 is smaller than 3.
2. Paris is the capital of France.
3. The study of computer science is very difficult.
4. Brush your teeth after eating!
5. It’s colder at night than it is outside. ◄
1. and 2. are propositions in our sense. 3. is very subjective, the answer is certainly dif-
ferent from person to person; but you can ask about it! 4. is not a fact, it makes no sense
to talk about truth here. 5. is not even a sentence that has a reasonable meaning.
For your consolation: We will not venture into the borderline areas of theory in which we
stand helpless before a sentence and wonder whether it is a proposition or not. Let others do
that.
The words “true” and “false” are called truth values. Every proposition has one of these
two truth values. However, the presence of a proposition does not mean that one can
immediately say which truth value a proposition has; perhaps it is still unknown.
Two more
Examples
6. The World Climate Conference will succeed in stopping the climate catastrophe.
7. Every even number greater than 2 is the sum of two prime numbers (the Goldbach
conjecture). ◄
2.1 Propositions and Propositional Variables 29
Proposition 6 will turn out at some point, 7 is unknown, perhaps one will never find out.
The Goldbach conjecture is about 300 years old. So far, the “Goldbach test” has worked well
for every even number: 6 = 3 + 3, 50 = 31 + 19, 98 = 19 + 79, and so on. Even in the age of
high-performance computers, no one has found a single counterexample. But in 300 years
no mathematician has been able to prove that the statement is really true for all even num-
bers. Maybe you have a chance to become famous here: Finding a single number is enough!
In the future, the content of a proposition is no longer interesting to us, but only the prop-
erty of the proposition being either true or false. Therefore, from now on we will also
refer to propositions with symbols, mostly with large Latin letters, A, B, C, D, . . ., which
can stand for any proposition and can take on the truth values true or false. We call these
symbols propositional variables and we will use them as variables just like, for example,
x and y in the expression 3x 2 + 7y. Values can be assigned here to the variables x and y.
The truth values “true” and “false” will from now on be abbreviated with t resp. f .
Compound Propositions
In everyday language, propositions are combined with words like “and”, “or”, “not”, “if—
then”, “except” and others. We now want to reproduce such combinations of propositions
for our propositional variables and determine the truth value of the compound propositions
depending on the contained sub-propositions. For this purpose, we will set up tables.
In the example of numerical variables, such a table looks like this:
x y 3x 2 + 7y
1 1 10
1 2 17
2 1 19
.. .. ..
. . .
We will proceed in the same way for propositional variables. The big advantage we
have is that we only have finitely many assignment values for a variable, namely exactly
two. We can therefore, in contrast to the above example, capture all possible values in a
finite table, we call this truth table. How many such assignment possibilities are there for
the combination of two propositional variables? For A and B we have four possibilities
whose combination can result in true or false:
A B A∗B
t t t/f
f t t/f
t f t/f
f f t/f
30 2 Logic
Let’s denote the connection of A and B with ∗, then you can write down 16 combinations
in total in the column A ∗ B, from four times true to four times false. Some of these are
particularly interesting, which I will introduce below.
The proposition “2 is even and 5 < 3” is false, the proposition “2 is even and for all real
numbers x is x 2 ≥ 0” is true. The complete truth table looks like this:
A B A∧B
t t t
f t f
t f f
f f f
The compound proposition is therefore only true if each individual proposition is true.
2. “or”, notation: ∨ (the disjunction).
A B A∨B
w w w
f w w
w f w
f f f
The compound proposition is only false if both individual propositions are false. Note
again here the logical or does not describe the exclusive or, both propositions may
also be true: “5 > 3 or 2 is even” is a true proposition. In everyday language, these
two “or” are not always clearly distinguished.
2.1 Propositions and Propositional Variables 31
The proposition “If Christmas is on December 24th, then I’ll eat my hat” is false,
while you can believe me the proposition “If Christmas falls on Easter, then I’ll eat
my hat” is true! What about: “If Goldbach’s conjecture is true, then Christmas is on
December 24th”? I think you will accept it as true, regardless of whether Goldbach
was right or not. With these examples we can fill in the table completely:
A B A→B
t t t
f t t (2.1)
t f f
f f t
The initially surprising thing about this table is line 2, which says you can also infer
something true from something false. But you probably know such situations from
everyday life: if you start from false assumptions, you can be lucky or unlucky with a
conclusion. Maybe something reasonable comes out by chance, maybe not.
Sometimes it is difficult to verify this table with propositions from everyday life. Try it
yourself! A student gave me this example in a lecture: “If today is Tuesday, then tomorrow
is Thursday.” Isn’t this proposition always false, no matter whether I announce it on Mon-
day, Tuesday or Wednesday? What do you think about it?
If this table gives you a stomach ache, I can calm you down from the side of mathemat-
ics: we simply take the table as a definition for the connection “→” and don’t care at all
whether it has anything to do with reality or not. Mathematicians are allowed to do that!
That’s our ivory tower. The downside is that we eventually want to start something with our
mathematics again. This gets us out of the ivory tower. That only works well if our original
assumptions were reasonable. In the case of logic, it turned out our truth tables are really
meaningful exactly in the form presented.
A B A↔B
t t t
f t f
t f f
f f t
32 2 Logic
The compound proposition is true here only when both individual propositions have
the same truth value.
Finally, an operation that only applies to one logical variable, the negation:
5. “not”, notation: ¬ (the negation).
This is simple: The “not” just reverses the truth value of a proposition:
A ¬A
t f
f t
What do we have from our connections now? We can use them to build up increasingly
complex propositions from given propositions by putting them together. A few
Examples
A → (B ∨ C)
¬(A ↔ (B ∨ (¬C)))
(A ∨ B) ∧ (¬(B ∧ (A ∨ C)))◄
With and, or, if—then, and if and only if I have defined the four of the 16 possible connec-
tions of two propositional variables with which we usually work in logic. What about the
other twelve? Some are boring: for example “everything true” or “everything false”. You
might know the exclusive or, nand and nor from digital technology. Do we need them too?
In fact, all 16 possible connections can be generated as a combination of the four basic oper-
ations together with the not. Yes, it is even possible to represent all 16 operations and the
not as a combination of two variables with only one operation, the nand operator ↑. Here
A ↑ B := ¬(A ∧ B). As a result, in digital technology, every logical connection can be imple-
mented solely by nand gates.
When programming, you also work with formulas built up from propositions: The
conditions behind if, for, while contain propositional variables, in Java of type
boolean, and connections of propositional variables, which in a specific program run
are replaced by the truth values “true”, “false” and evaluated.
Just as algebraic formulas are usually built up from a, b, c, d, . . . , +, ·, −, :, . . . we get
formulas of propositional algebra in this way. And just as one can evaluate algebraic for-
mulas by substituting numbers, one can evaluate propositional formulas by substituting
truth values, that is, one can determine whether the entire formula is true or false.
What we have done here with the construction of propositional formulas and their
evaluation from elementary propositions is a first example of a formal language. For-
mal languages have an extraordinary importance in computer science. A formal language
consists, like natural languages, of three elements:
• A set of symbols from which the sentences of the language can be constructed, the
alphabet.
2.1 Propositions and Propositional Variables 33
• Rules to specify how correct sentences can be formed from the symbols of the alpha-
bet, the syntax.
• An assignment of meaning to syntactically correctly formed sentences, the semantics.
As the first formal language, the computer scientist usually has to do with a program-
ming language: The alphabet consists of the keywords and the allowed characters of the
language, the syntax defines how correct programs can be formed from these charac-
ters. The correctness of the syntax can be checked by the compiler. But as you surely
know from bitter experience, not every syntactically correct program performs meaning-
ful actions. The task of the programmer is to ensure the semantics of the program corre-
sponds to the specification.
The formulas of propositional logic also build a formal language.
• The alphabet consists of the identifiers for propositional variables, the logical symbols
∧, ∨, →, ↔, ¬ as well as the brackets ( and ), which serve to group together parts of
the formula.
• The rules of syntax state:
– The identifiers for propositional variables are formulas,
– if F1 and F2 are formulas, then so are (F1 ∧ F2 ), (F1 ∨ F2 ), (¬F1 ), (F1 → F2 ),
(F1 ↔ F2 ).
• The semantics is finally determined by the truth tables for the elementary logical con-
nections. As a result, every syntactically correct formula can be assigned a truth value,
provided the identifiers are assigned concrete truth values.
In order to improve the readability of formulas, some precedence rules for compound
propositions are set, similar to those in arithmetic, which allow the omission of many
brackets. These rules are:
But beware! The argument with the better readability is dangerous if you do not have all
the precedents in your head. Do you know exactly what happens in Java or C++ with the
instruction if(!a&&b||c==5)? In such cases, it is better to write a few more brackets
than necessary to avoid unnecessary errors.
The evaluation of a composite formula can be traced back to the elementary truth tables
using connection tables again. First, the contained propositional variables are listed in the
columns with all possible combinations of truth values, then the evaluation is carried out
step by step, “from the inside out”. When evaluating (A ∧ B) → C, we get something like:
34 2 Logic
A B C A ∧ B (A ∧ B) → C
t t t t t
t t f t f
t f t f t
t f f f t
.. .. .. .. ..
. . . . .
There are still four cases missing in the table. You can see with more than two proposi-
tional variables in the formula, this evaluation method quickly becomes very extensive.
Fortunately, there is another tool that can help us determine the truth value of a formula.
Parts of a formula can be replaced by other equivalent and possibly simpler formula
parts.
What are logically equivalent formulas? Let’s evaluate B ∨ ¬A once:
A B ¬A B ∨ ¬A
t t f t
f t t t (2.2)
t f f f
f f t t
Compare this table with the Table (2.1) for A → B. You can see that A → B is true
(false) if and only if B ∨ ¬A is true (false). The two formulas are logically equivalent:
Definition 2.1 Two propositional formulas are called logically equivalent if they
have the same truth value for all possible assignments of truth values.
For the logician, those formulas provide calculation rules: Part of a formula can be
replaced by a logically equivalent formula without changing the truth value of the overall
formula.
Example
In most programming languages there are logical operators for “and” (in C++ and
Java, for example, &&), “or” (||) and “not” (!), but no operator for “if then”. But
since A → B and B ∨ ¬A are equivalent, you can define an arrow operator in C++ (or
similarly in Java):
Definition 2.2 If the formulas F1 and F2 are logically equivalent, we write for this
F1 ⇔ F2 .
Please note: The sign ⇔ is not a symbol of the language of propositional logic, it says
something about the truth value of formulas, that is, about the semantics.
Unfortunately, no uniform notation is used for logical equivalence and equivalence in the
literature. You will occasionally also find the symbol ≡ for logical equivalence, the double
arrow ⇔ is sometimes also used to designate the equivalence of propositions.
(A ∨ B) ∨ C ⇔ A ∨ (B ∨ C) (=: A ∨ B ∨ C)
associative laws
(A ∧ B) ∧ C ⇔ A ∧ (B ∧ C) (=: A ∧ B ∧ C)
A ∧ (B ∨ C) ⇔ (A ∧ B) ∨ (A ∧ C)
distributive laws
A ∨ (B ∧ C) ⇔ (A ∨ B) ∧ (A ∨ C)
¬(A ∧ B) ⇔ ¬A ∨ ¬B
¬(A ∨ B) ⇔ ¬A ∧ ¬B (2.4)
¬¬A ⇔ A
Do you notice anything here? Compare these formulas with Theorem 1.3 and 1.4. There
you will find the same rules for calculating with sets. ∪ and ∨ correspond to each other, ∩
and ∧ as well as “complement” and ¬.
The just discovered analogy between set operations and logical operations leads to the the-
ory of Boolean algebras, Theorem 2.7.
However, the proof is much simpler here, if also much more boring: We only have to
insert a finite number of values in the truth tables. However, for the distributive and asso-
ciative laws, we have three initial propositions here and thus 23 = 8 different combina-
tions of truth values to check. You write for a while. Believe me, the rules are correct!
36 2 Logic
A B B ∨ ¬A (A → B) B ∨ ¬A ↔ (A → B)
t t t t t
f t t t t
t f f f t
f f t t t
You can see this formula is always true, no matter what values A and B take. This is
because the formulas to the right and left of the bidirectional arrow are logically equivalent:
Theorem 2.4 Two formulas F1, F2 are logically equivalent, that is F1 ⇔ F2, if
F1 ↔ F2 is true for all possible assignments of truth values.
The correctness of this theorem can be read directly from the corresponding truth table:
F1 F2 F1 ↔ F2
t t t
f f t
t f f
f t f
If you assign truth values to F1 and F2, these are logically equivalent in the first two
rows, and exactly in these rows is also F1 ↔ F2 true.
Definition 2.5 A formula is called valid or tautology, if it is true for all possible
assignments of truth values.
By using tautologies in everyday life, one can pretend competence that is actually not pre-
sent: “If the rooster crows from dungeons top, the weather will change or it will not” (an old
German folk wisdom). Written as a formula: A → B ∨ ¬B. Check for yourself that this is a
tautology. By the way, this statement is even true if the rooster is just cold!
Here are some more tautologies, most of which will appear again in the next section:
A ∨ ¬A
A ∧ (A → B) → B (2.6)
(A → B) ∧ (B → A) ↔ (A ↔ B) (2.7)
Definition 2.6 If F1 and F2 are formulas and if, for every assignment of truth val-
ues for which F1 is true, F2 is also true, we write
F1 ⇒ F2 .
To see this, we just have to take another look at the truth table of the implication:
F1 F2 F1 → F2
t t t
(t f f)
f t t
f f t
I have put the second line in parentheses because it cannot occur: If F1 ⇒ F2 holds, then
the combination t , f for F1, F2 is not possible, and thus F1 → F2 is always true. And if
F1 → F2 is always true, then if F1 is true then also F2 is true.
Boolean Algebras
Different concrete structures often have very similar properties. Mathematicians are then
happy. They try to work out the core of these similarities and to define an abstract struc-
ture that is characterized by exactly these common properties. Then they can study these
structure. All results obtained for it of course also apply to the concrete models from
which they originally started.
Something similar happens when you define a common superclass for different classes of a
program: properties of the superclass are automatically inherited by all derived classes.
38 2 Logic
This is also the case with set operations and logical operators: the underlying common
structure is called Boolean algebra. The operations are often denoted as in the case of
sets with ∪, ∩ and −:
Definition 2.8 Let a set B be given, which contains at least two different elements
0 and 1 and on which two operations ∪ and ∩ between elements of the set as well
as an operation – on the elements are defined. (B, ∪, ∩,–) is called Boolean algebra,
if for all elements x, y, z ∈ B the following properties apply:
x ∪ y = y ∪ x, x∩y =y∩x
x ∪ (y ∪ z) = (x ∪ y) ∪ z x ∩ (y ∩ z) = (x ∩ y) ∩ z
x ∪ (y ∩ z) = (x ∪ y) ∩ (x ∪ z) x ∩ (y ∪ z) = (x ∩ y) ∪ (x ∩ z)
x∪x =1 x∩x =0
x∪0=x x∩1=x
The power set P(M) of a set M with union, intersection and complement forms a
Boolean algebra. The empty set is the 0 and the set M itself is the 1-element.
A minimal Boolean algebra is obtained by taking the set B = {0, 1} as the basic set.
This is the two-element algebra, which is used for the design of technical circuits. Try it
yourself to write down truth tables for this algebra!
The set of all propositional formulas that can be formed with the operations ∧, ∨, ¬
from n propositional variables (the n-ary propositional formulas) with the operations
∧, ∨, ¬ is also a Boolean algebra, an algebra of formulas. 1 is the always true proposi-
tion, 0 is the always false proposition.
The connections “except” and “but not” can both be represented by the symbols “∧¬”.
This gives us for
D: x is divisible by 100 but not by 400:
2.1 Propositions and Propositional Variables 39
D ↔ B ∧ ¬C
and finally for S:
S ↔ A ∧ ¬D ↔ A ∧ ¬(B ∧ ¬C). (2.9)
This allows us to formulate a query in Java, for example. The operator % gives the
remainder in an integer division, and so we write:
We can also convert the formula (2.9) using our rules: From (2.4) and (2.3) we get, for
example:
A ∧ ¬(B ∧ ¬C) ⇔ (C ∨ ¬B) ∧ A
and from that the Java instruction:
if(((x%400 == 0)||(x%100 != 0)) && (x%4 == 0)). (2.11)
Is there a difference between (2.10) and (2.11)? Of course not in the result of the evalu-
ation, otherwise our logic would be wrong. But there can be differences in the runtime
of the program: The program evaluates a logical expression from left to right and stops
immediately if it can decide whether a statement is true or false. If now a number x is
inserted, then in 75 % of the cases (2.10) the evaluation can be ended after the evaluation
of A (x%4 == 0), because if A is false, then the whole proposition is also false. In the
case (2.11) first C (x%400 == 0) must be checked, because it is usually false, also ¬B
(x%100 != 0), and because this is usually true, then A must be checked at the end.
You can see this is much more time consuming. Therefore, in time-critical applications,
it makes sense to check and possibly reformulate control conditions carefully!
I am often asked whether proofs are actually necessary in computer science. After all,
there is the profession of mathematician, who lives to prove theorems. The users of math-
ematics should be relieved by this. It is enough to provide them with the tools, that is, the
right formulas into which the problems to be solved are fed, and which spit out the result.
There may be application areas in which this is partly true, but computer science is
certainly not one of them. Computer scientists are constantly “proving” even though
they do not call it that: They think about whether a system architecture is feasible, they
analyze whether a protocol can do what it is supposed to do, they ponder whether an
40 2 Logic
algorithm works correctly in all special cases (“what happens if, …”), they check
whether the switch statement in the program also overlooks nothing, they look for test
cases that ensure the highest possible path coverage of the program, and much more. All
of this is nothing other than “proving”.
Do not be afraid, you can learn to prove as well! There are tricks used over and over
again. In this section, I would like to introduce you to some of these tricks and try them
out with concrete examples. The contents of the theorems are rather secondary, even
though I take this opportunity to present you two famous pearls of mathematics in Theo-
rems 2.10 and 2.11. It is about the techniques behind the proofs.
But often the contents of a mathematical assertion only become clear through the
proof, perhaps because one deals intensively with the assertion in the proof.
Throughout the book, I will prove again and again, not for the sake of proving,
because I hope you have enough confidence in me that I do not introduce any false con-
cepts to you (at least not intentionally). We will prove when we can learn something
from it or when it serves to understand the material.
In proofs, the validity of certain propositions, the assertions, is always concluded
from the validity of other propositions, the assumptions. We speak here of the semantics,
that is, of true and false propositions. The mapping of propositions to logical formulas
and the rules of calculation for logical formulas often help to determine the truth value.
For this purpose, the conditions and the assertions must be precisely identified and for-
mulated cleanly.
The simplest form of proof is the direct proof: From an assumption A an assertion B is
derived. An example:
Theorem 2.9 If n
∈ Zis odd, then also n isodd.
2
A B
Proof: Assume A is true. Then there is an integer m with n = 2m + 1. Then
n2 = (2m + 1)2 = 4m2 + 4m + 1, that is, n2 is also odd.
In the direct proof A ⇒ B is shown, that is, A → B is always true. Note one can do this
without even knowing whether the condition A is fulfilled. Also, from the Goldbach con-
jecture, one can derive many things without knowing whether it is actually true. But if at
some point A is recognized as true and we already know that A → B is always true, then
B is also always true. (To this belongs the tautology A ∧ (A → B) → B, compare (2.6).)
2.2 Proof Principles 41
Here an assertion A is proven to be true (or false) if and only if another assertion B is true
(or false), that is A ⇔ B.
An equivalence proof is nothing more than the consecutive execution of two direct
proofs. One shows A ⇒ B and B ⇒ A. The task is thereby divided into two easier sub-
tasks. This can also be expressed by a tautology: If A → B and B → A are always true,
then A ↔ B and vice versa: (A → B) ∧ (B → A) ↔ (A ↔ B), compare (2.7).
The last part of Theorem 1.4 is such an equivalence statement: M ⊂ N ⇔ N ⊂ M .
Try to proof this theorem by breaking it down into two direct proofs yourself.
A variant of the equivalence proof is the circular reasoning. To show three or more
assertions are equivalent, for example the assertions A, B, C, D, it is enough to show:
A ⇒ B, B ⇒ C, C ⇒ D, D ⇒ A.
Again, actually A ⇒ B should be shown. Often it is easier to carry out the conclusion
¬B ⇒ ¬A instead. Since we already know that (A → B) ↔ (¬B → ¬A) is a tautology
(compare (2.8)), this is equivalent to A ⇒ B. In Theorem 1.11 we have carried out such a
proof by contradiction. Take a look at it again!
Proofs by contradiction are a powerful and often used tool. One reason for this is that,
in addition to the assumption A, one can also assume ¬B; one therefore has more in hand
to work with. Strictly speaking, one could also describe the proof by contradiction with
the tautology: (A → B) ↔ (A ∧ ¬B → ¬A) √
I would like to introduce you to a famous example. You probably know 2 is not a
rational number, that is, it cannot be represented as a fraction of two integers. Why is that
so? The following proof shows especially well that it depends on the precise formulation
of the assumption and the assertion.
√
Theorem 2.10 2 is not a rational number.
But then m is an even number too, because according to Theorem 2.9 the square of an
odd number is always odd.
Here we have packed a small proof of contradiction into the proof of contradiction.
So m has the form m = 2l and (2l)2 = 2n2 applies. Dividing by 2 we get 2l2 = n2. Just
like before, this means n2 and thus n is an even number.
m and n are therefore divisible by 2 and thus not reduced. (Proposition ¬A)
Another form of proof by contradiction is that one derives the assertion itself from
the opposite of the assertion (¬B ⇒ B). But ¬B and B cannot be true both, because
¬(B ∧ ¬B) is a tautology (see (2.5)). Then the assumption ¬B must be false, that is, B
itself must be true. One last example:
Proof:
Here again, two proofs by contradiction are nested. In the “inner” proof, the assumption
¬A has been used to derive a false assertion, so A must be true. This is yet another ver-
sion of the proof by contradiction.
This proof is due to Euclid and is thus about 2300 years old. Can you imagine some math-
ematicians are happy about a beautiful proof, just as many others are about a good piece of
music? What makes a “beautiful” proof? It is usually short and concise, nevertheless under-
standable—at least for a mathematician—and often contains surprising conclusions and
twists not so easy to come by when looking at the assertion to be proven. I think the last two
proofs definitely fall into this category. The Hungarian mathematician Paul Erdős (1913–
1996) believed God keeps a book in which he records the perfect proofs for mathematical
2.3 Predicate Logic (First-Order Logic) 43
theorems. At least parts of this book have reached Earth by unknown means and have been
compiled in the earthly book “Proofs from the BOOK”. Euclid’s proof for the infinite num-
ber of prime numbers is the first one in it. It really deserves it.
Later we will learn another proof principle, mathematical induction. But first we have to
work out a few more logical concepts:
Look again at the example for calculating leap years in Sect. 2.1 after Definition 2.8: S =
“x is a leap year”. Is the proposition true or false? Does it even make sense to ask if it is
true or false? Obviously, the truth of the statement depends on x; that is, it only becomes
a proposition when the value of x is known. However, only a certain set of values is
allowed for x, here the natural numbers.
Propositions that only become true or false by inserting certain values occur fre-
quently. They are called predicates or propositional functions. Predicates are nothing
more than mappings. For the predicate A that depends on x, we write A(x). It is also pos-
sible to plug in several variables.
Examples
Definition 2.12 Let M be a set and M n the n-fold cartesian product of M . A n-ary
predicate P is a mapping that assigns to each element from M n a truth value t or f .
M is called the domain of individuals of the predicate P.
Example
Predicates can be used just like propositions. All rules for building formulas carry over.
Each formula thus formed is again a predicate. We assume now that combined predicates
have the same domain of individuals. This is not a significant restriction and saves us
writing.
Connections of predicates are of course again predicates.
So far we have not gained anything essentially new. But there is another method, in
addition to the insertion of values, to obtain new, interesting formulas from predicates.
If A(x) is a predicate, the following alternatives for the truth of A(x) are possible:
Examples
Example
x is even ⇐ is a predicate
for all x ∈ N x is even. ⇐ is a false proposition
it exists x ∈ N which is even ⇐ is a true proposition ◄
Example
The binary predicate A(x, y) is thus quantified by ∀xA(x, y) or ∃xA(x, y) to a unary predi-
cate. Only y is still variable, x is bound by the quantifier. x is called in this case bound
variable, y is called free variable.
Free variables are still available for further quantification:
∃y∀xA(x, y): it exists y, such that for all x holds x 2 > y ⇐ is a proposition (true).
Attention: ∃x∀xA(x, y) is not allowed! x is already used for quantification, just “bound”.
New propositions or predicates are thus obtained from existing predicates by linking or
by quantifying:
We need to look at this definition a little more closely. It is the first example of a so-
called recursive definition. These play an important role in computer science. Why can’t
we just define it roughly like this: “Formulas of predicate logic are all possible com-
binations and quantifications of predicates”? For us humans, this is more readable and
perhaps initially more understandable. However, Definition 2.13 provides more: It simul-
taneously provides a precise recipe for how to build formulas, and—almost even more
importantly—a precise instruction for how to check a given expression to see if it is a
valid formula or not. This works recursively, backwards, by repeatedly applying the defi-
nition until you arrive at elementary formulas.
Example
The brackets in Definition 2.13b) are necessary to exclude that, for example, P ∧ Q ∨ R
would pass as a formula. However, for reasons of readability, brackets are often omitted
in logical expressions. The following rules apply: ¬ is the most binding operator, ∧ and ∨
bind stronger than → and ↔.
The verification using such a recursive definition follows precise rules, so it can be
carried out mechanically. Language definitions for programming languages are usually
built recursively (for example, using the Backus Naur Form, BNF). The first thing a
compiler does is parse, analyze whether the syntax of the program is correct.
As we have seen, quantification turns a n-ary predicate (for n > 1) into a (n − 1)-ary
predicate and a 1-ary predicate into a proposition. Propositions are therefore also called
0-ary predicates and, according to Definition 2.13, are formulas of predicate logic.
All formulas of predicate logic, including propositions, will be referred to briefly as
predicates. Sometimes I will not be quite as precise and simply say “proposition” again
for a predicate, as I have already done in the example with the leap years in Sect. 2.1
after Definition 2.8.
is:
For all even numbers z > 2 it holds that z is the sum of two prime numbers.
is:
There is an even number z > 2 that is not the sum of two prime numbers.
The negation therefore reverses the quantifier and the statement inside. Note this is not
only true for one quantifier, but for any number of quantifiers. Let’s calculate the nega-
tion of ∃y∀xA(x, y), for example:
¬(∃y∀xA(x, y)) ⇔ ∀y¬(∀xA(x, y)) ⇔ ∀y∃x(¬A(x, y)).
All quantifiers are therefore reversed and the statement inside is negated. This also works
with 25 quantifiers. If you carefully formulate the statements contained in a contradiction
proof, the application of this trick is often a great relief!
For the sake of simplicity, we will assume for the moment a program runs without fur-
ther user interaction and comes to an end after it starts. At least parts of a program (meth-
ods or parts of methods) meet this requirement. Such a program is nothing more than a
set of rules for transforming variable values. Depending on the values of the variables at
the beginning of a program run, a variable assignment results at the end of the program.
Now when is a program correct? This decision can only be made if the possible input
states and the resulting output states have been specified beforehand. The program is
then correct if every allowed input state generates the specified output state.
Before and after the program runs, the variables must therefore fulfill certain condi-
tions. These conditions are nothing other than predicates: The variables are placeholders
for the individuals. During a program run, individuals are inserted and thus a proposition
results. Allowed states result in true propositions, forbidden states usually result in false
propositions (Fig. 2.1).
The program is correct if, for each variable assignment, from V , after execution of A,
the condition N follows. We write for this: V − → N.
A
After program implementation, the program is tested. For this purpose, as many
allowed variables as possible are used and the result is checked. If this is done carefully,
one can be confident that the program is correct. However, all allowed input states cannot
usually be checked. Is it still possible to prove V → N ?
A
Program:
Precondi Postcondi
Instruction
tion tion
sequence
V N
A
Example
The specification of A is: “Exchange the content of the two integer variables a and b”.
The preconditions and postconditions are:
V : Variable a has the value x, variable b has the value y. (x, y Integer)
N : Variable a has the value y, variable b has the value x. (x, y Integer) ◄
You can see here these are predicates, not propositions. We cannot say anything about
the truth or falsity of V and N , but we will prove that for the following sequence of
instructions A in (2.13) V −
→ N is always true.
A
A: a = a+b; b = a-b; a = a-b; (2.13)
The predicate V is transformed into another predicate line by line when the instructions
are executed:
V −−−→ V1 1: a has the value x + y, b has the value y.
V
a=a+b
V1 −−−→ V2
V2: b has the value x + y − y = x, a has the value x + y.
b=a−b
V2 −−−→ V3
V3: a has the value x + y − x = y, b has the value x.V3 is the same as N .
a=a−b
This proves the validity of the program.
If you test this program cleverly, you will find another error: I have not considered over-
flows. The precondition must restrict the range of individuals to avoid this. Can you do that?
For most of you, this will have been the first and the last program proof of your life. You
can see that it is very, very time-consuming even for small programs. But in theory you
can really prove the correctness of programs.
Practically, this has only been possible for small programs so far. But the importance
of proving is increasing more and more. In particular, with safety-critical software, you
cannot afford any errors. Smartcards are well suited for this: The software on it is man-
ageable, and after all a cashcard is—at least from the perspective of the banks—always
in potential enemy hands, freedom from errors is very important.
Since proving programs is a formal process, it can also be automated: You can write
proof programs. But then their correctness must be proven by hand. This is a current
topic in software development technology.
One drawback is a proof can only show that a specification is fulfilled. For this, on the
one hand, the specification must be strictly formalized, that is, it must contain complete
preconditions and postconditions. This is also a problem with larger software projects.
On the other hand, errors in the specification cannot be detected by a program proof.
2.5 Comprehension Questions and Exercises 49
Example
Your project manager assigns you the task of writing a module that calculates the
quotient of two integers. (That is his specification.) You implement the module and
deliver the “program”:
q = x/y;
The module is integrated into another module of your colleague and is used. If the
Airbus pilot reads the message “floating point exception” on his display, there is trou-
ble. Who is to blame? You are of the opinion your colleague should have checked the
denominator for 0 before. Your colleague thinks the same of you.
If you work in larger software projects once, you will notice that the correct and unam-
biguous agreement of system interfaces always belongs to the most difficult tasks of the
whole process.
You are out of the woods when you deliver the corresponding preconditions and post-
conditions together with your program. To the program q = x/y; belongs the pre-
condition V = “Variables x, y are from Z and y = 0” and the postcondition N = “q has
the value: integer part of x/y”
You could have specified differently: V = “Variables x, y are from Z”, N = “q has
the value: integer part of x/y if y = 0 and is otherwise undefined” Then your imple-
mentation would have to look like this:
if(y != 0) q = x/y;◄
Get used to specifying program modules with preconditions and postconditions and
implementing these conditions. If a condition is violated, an exception can be thrown.
Even if we don’t think about program proofs, this makes testing much easier and thus
increases the quality of the program.
Comprehension Questions
Exercises
1. Check whether the following sentences are propositions and whether they are true
or false:
a) Either 5 < 3 or from 2 + 3 = 5 it follows that 3 · 4 = 12.
b) When I am big, I am small.
c) This sentence is not a proposition.
2. For the conjunctions “neither … nor”, “either … or” and “namely …, but not” set
up truth tables and try to express these connectives with ∧, ∨, ¬.
3. Form the negation of
a) The triangle is right-angled and isosceles.
b) Boris can speak Russian or German.
4. Prove a distributive and an associative law for the logical operations ∧, ∨.
5. Show that (¬B → B) → B is a tautology. (If one can infer the assumption from the
opposite of an assumption, then the assumption is true.) Is also (¬B → B) ↔ B a
tautology?
6. For each natural number n let an be a real number. In the second part of the book
we will see the sequence an converges to zero if and only if the following logical
proposition is true:
1
∀t∃m∀n n > m → |an | < , m, n, t ∈ N.
t
Form the negation of this proposition.
7. From Theorem 2.11 it does not follow that the product of the first n prime
numbers + 1 is a prime number. Find the smallest n with the property that
p1 · p2 · . . . · pn + 1 is not prime.
Natural Numbers, Mathematical
Induction, Recursion 3
Abstract
• you will know what an axiomatic system is, and know the axioms of natural num-
bers,
• you can represent natural numbers in different numeral systems,
• you will master the proof principle of mathematical induction and have proved
some important theorems with mathematical induction,
• you will know what recursive functions are and know the relationship between
recursion and induction,
• you will have carried out runtime calculations for some recursive algorithms.
In mathematics, new assertions (theorems) are obtained from given assertions by means
of logical conclusions. This whole process has to start somewhere. At the beginning
there must be a set of facts that are assumed to be true without being proven themselves.
These unproven basic facts of a theory are called axioms.
In applied mathematics, one tries to imitate situations from the real world. The axi-
oms should describe basic facts as simply and plausibly as possible. When this is done,
the mathematician forgets the real world behind the axioms and starts to calculate. If he
has created a good model, then the results he obtains can be applied to real world prob-
lems again.
The formulation of axiom systems is a lengthy and difficult task, often involving
entire generations of mathematicians. The axiom system of a theory should meet the fol-
lowing requirements:
1. As few and simple (plausible) axioms should be formulated as possible, but of course
also enough to be able to completely describe the theory.
2. The axioms must be independent of each other, that is, one axiom should not follow
from the others.
3. The axioms must be free of contradiction.
The natural numbers and their basic properties are known to every human being from
childhood. We become familiar with them, for example, by counting gummi bears (here
the natural numbers denote the cardinality of a set) or by counting off (here the sequence
of numbers is more interesting). Intuitively we can deal with natural numbers. But what
are the characteristic properties we can rely on when we want to do mathematics? At the
end of the 19th century, Giuseppe Peano formulated the following axiom system, which
today is the generally accepted basis of the theory of natural numbers:
Definition 3.1: The Peano axioms The set N of natural numbers is characterized
by the following properties:
The first three axioms are self-evident, the fourth is plausible by “trying it out”:
(1 ∈ M ⇒ 2 ∈ M ⇒ 3 ∈ M ⇒ · · ·).
The successor of the natural number n is denoted by n + 1.
Incredibly, one can deduce everything from these few axioms that is known about natural
numbers. For example, one can define the addition n + m as “form m-times the successor of
n”. It follows that addition is associative and commutative. Similarly, one obtains a descrip-
tion of multiplication. Further, from N, with the help of the axioms, one can construct the
integers Z and the rational numbers Q with all their rules. The whole number theory with its
3.2 The Mathematical Induction 53
theorems can be deduced from it. We will not do this here, we simply believe from now on
all the calculation rules that we know from school are correct.
We investigate the axiom (P4) a little more closely and deduce our first theorem from it:
a) A(1) is true,
b) for all n ∈ N: if A(n) is true, then A(n + 1) is also true.
Before the proof, a simple example to illustrate what is behind this principle:
Let A(n) be the assertion: 1 + 2 + 3 + · · · + n = n(n+1)
2
.
By trying it out, one finds that A(1) is true: 1 = 1·2
2
, A(2) is true: 1 + 2 = 2·3
2
, A(3) is
true: 1 + 2 + 3 = 2 and so on. But what about A(1000) or A(1013 )?
3·4
Since there are infinitely many natural numbers, one would actually need an infinite
amount of time to check all cases. The induction principle says: If we can generally
infer from one number to its successor, we do not have to check everything individually.
Then A(4), A(5), . . . , A(999), A(1000) is true and also the assertion for all further num-
bers. This way we can prove it in finite time. Therefore, mathematical induction is a very
important and widely used proof method in mathematics.
We only show: A(n) → A(n + 1):
Assume A(n) is true, that is 1 + 2 + 3 + · · · + n = n(n+1)
2
. Then it holds:
n(n + 1)
1 + 2 + 3 + · · · + n + (n + 1) = + (n + 1) and
2
n(n + 1) n(n + 1) + 2(n + 1) n2 + 3n + 2 (n + 1)(n + 2)
+ (n + 1) = = = ,
2 2 2 2
so A(n + 1) is also true.
The proof of the induction principle is quite simple: Let the conditions a), b) from Theo-
rem 3.2 be satisfied for A(n). Now let M = {n ∈ N | A(n) ist true for n}. Then 1 ∈ M , and
if n ∈ M , so is n + 1 ∈ N, and by axiom (P4) it follows: M = N.
You should always carry out a mathematical induction according to the following
scheme:
Base case: Show: A(1) is correct.
54 3 Natural Numbers, Mathematical Induction, Recursion
(1 − q)(1 + q + q2 + q3 + · · · + qn + qn+1 )
= (1 − q)(1 + q + q2 + q3 + · · · + qn ) + (1 − q)(qn+1 )
= (1 − qn+1 ) + (1 − q)(qn+1 ) = (1 − qn+1 ) + (qn+1 − qn+2 ) = 1 − qn+2 .
↑
Induction
hypothesis
In the language of logic from Sect. 2.3 A(n) is a predicate with domain of individuals N.
As a proof principle, Theorem 3.2 states the formula
A(1) ∧ ∀n(A(n) → A(n + 1)) → ∀nA(n)
Base case Induction step
is always true.
You should always keep the following hints in mind when carrying out an induction:
1. Never forget the base case! If that is missing, the whole rest may be in vain.
2. The induction hypothesis “for n ∈ N it is assumed A(n) is true” may not be confused
with the assertion “ A(n) is true for all n ∈ N”.
3. If you did not use the induction hypothesiswhen carrying out the induction step, you
did something wrong.
Always try to incorporate the hypothesis into the proof. If you do not use the hypothesis,
your calculation may still be correct, but it is no longer an induction.
For practice, we carry out two slightly more complex induction proofs. At the same time,
these are theorems we will need later.
“0! = 1” looks a bit odd. But it has been shown in all formulas with n! that it is clever to
define it this way, and so we just do it. But we will have to pay special attention to this spe-
cial case most of the time.
3.2 The Mathematical Induction 55
Theorem 3.5 Let M and N be finite sets of equal cardinality, let |M| = |N| = n.
Then there are n! bijective maps from M to N .
What is the meaning of this theorem? Let’s look at the example M = N = {1, 2, 3}. Here
we can still write down all the bijections in a table:
ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6
1 1 1 2 2 3 3
2 2 3 1 3 1 2
3 3 2 3 1 2 1
In this case (and always when M = N is), n! is exactly the number of different arrange-
ments of the elements of the set M . So a soccer coach has 11! = 39 916 800 ways to
position his 11 players in different positions.
To prove Theorem 3.5:
Base case: n = 1: There is exactly one bijection, the identity.
Induction hypothesis: For all sets of size n, the assertion is true.
Induction step: Let M = {x1 , x2 , x3 , . . . , xn+1 } and N = {y1 , y2 , y3 , . . . , yn+1 } sets
of size n + 1. Compute the numbers of bijections between N
and M.
The number of bijections ϕ with ϕ(xn+1 ) = y1 is equal to the number of bijections from
M \ {xn+1 } to N \ {y1 }. That is, according to the induction hypothesis n!.
The natural numbers fall more or less from the sky, but not the way we write them down:
that has to do with the fact that we have 10 fingers. Computers work with 0 and 1, for
them the 2-finger system, the base 2 system or binary system, is suitable.
In a byte you can place exactly 28 = 256 numbers. The base 256 system is difficult
for humans to handle; but we often use the base 16 system, the hexadecimal system,
because in this one a byte is represented by two digits: the first 4 bit by the first digit, the
second 4 bit by the second digit.
For some computational purposes, the precision of a standard integer in the computer
is no longer sufficient. The Integer data type usually contains 32 bit and covers a range
of 232, that is, about 4.3 billion numbers. If you have to work with larger numbers, you
56 3 Natural Numbers, Mathematical Induction, Recursion
need large number arithmetic. A common implementation is to work in the base 256 sys-
tem. One byte represents a digit, a number is represented by an array of bytes.
The following theorem gives a recipe for how we can obtain the representation of a
number in different numeral systems:
The possible coefficients 0, 1, . . . , b − 1 we call the digits of the numeral system with
base b. For all natural numbers b it is defined b0 := 1, so the last summand in (3.1) is
equal to a0.
Proof by induction on n:
Base case: For n = 1 it is a0 = 1 and m = 0.
Induction hypothesis: The number n has the unique representation
n = am bm + am−1 bm−1 + · · · + a1 b1 + a0 b0 .
Induction step: It is
n + 1 = (am bm + am−1 bm−1 + · · · + a1 b1 + a0 b0 ) + 1b0
= am bm + am−1 bm−1 + · · · + a1 b1 + (a0 + 1)b0 .
Examples
n = am bm + am−1 bm−1 + · · · + a1 b1 + a0 b0
we write (am am−1 am−2 · · · a0 )b.
So it is 256 = (256)10 = (100)16 = (1 0000 0000)2.
255 = (15 15)16 is easily misunderstood. If b > 10, one therefore uses new abbrevia-
tions for the digits: A for 10, B for 11 and so on. So 255 = (FF)16.
But how does one find the coefficients of the number representation for a given basis?
If we divide n by b, we get
am b0 : b = 0 remainder am .
Thus, when we continue to divide by b, the remainders are exactly the digits of the num-
ber representation in reverse order.
Now two reformulations of the induction principle are sometimes very useful for calcu-
lation. They can be derived from theorem 3.2. We omit the proofs here:
Theorem 3.7: The induction principle II (shifting the base case) Let A(n) be an
assertion about the integer n and let n0 ∈ Z be given. If it holds:
a) A(n0 ) is true,
b) for all n ≥ n0: If A(n) is true, then A(n + 1) is true as well.
Theorem 3.8: The induction principle III (generalized induction hypothesis) Let
A(n) be an assertion about the integer n and let n0 ∈ Z be given. If it holds:
a) A(n0 ) is true,
b) for all n ≥ n0: From the validity of A(k) for all n0 ≤ k ≤ n follows A(n + 1).
Often functions with the domain N are not given explicitly, but only a rule for how to cal-
culate later function values from earlier ones. For example, one can define n! by the rule:
1 for n = 0
n! = . (3.2)
n(n − 1)! for n > 0
a) a1,
b) for all n ∈ N, n > 1 a rule Fn, with which an can be computed from an−1:
an = Fn (an−1 ).
Just like the induction principle, the recursion theorem can also be easily generalized:
one does not have to start with 1, and an can be calculated from a1 , a2 , . . . , an−1 or from
a part of these values. The latter are the recursions of higher order, see Definition 3.10.
The recursion principle allows for exact definitions of fuzzy formulations, for exam-
ple, those that use dots. We have already seen this with the example n! The rule for the
sequence an = n! is:
0! = 1, n! = Fn ((n − 1)!) := n · (n − 1)!.
Although this is not as readable for us humans as the one from Definition 3.4:
n! := 1 · 2 · 3 · . . . · n,
a computer can not do anything with the dots from the second definition, but it is able to
interpret the recursive definition. Another
Example
0
n
n−1
n−1
ai = 0, a i = Fn ai := ai + a n .
i=1 i=1 i=1 i=1
◄
Most higher programming languages support recursive ways of writing. In C++, for
example, a program for the recursive calculation of n! looks like this:
efficiency than traditional iterative algorithms. Look at the two following recursive defi-
nitions of the power function:
1 if n = 0
xn = n−1 (3.3)
x x if n > 0
or
1
if n = 0
n n
xn = x 2 x 2 if n is even (3.4)
n−1
x x if n is odd
Check for a few numbers n (for example n = 15, 16, 17), how many multiplications you
need to calculate x n. (3.3) corresponds in runtime to the iterative calculation, (3.4) is a
huge improvement.
With the following small checklist you should check every recursive program to avoid
unpleasant surprises:
If the function value of n depends not only on the immediately preceding function value,
but on further function values, one speaks of a recursion of higher order. A well-known
example of this is the Fibonacci sequence:
an = an−1 + an−2 ,
a0 = 0, a1 = 1.
The first function values are: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34.
Recursive functions, especially those of higher order, are also called difference equa-
tions. To uniquely determine such a function, more than just one initial value is neces-
sary. In general it holds
Theorem 3.12 A recursion of k-th order has a unique solution if the k consecutive
values a1 , a2 , . . . , ak are given.
Is it possible to give a closed form for this solution in addition to the recursive
description, that is, a calculation rule that directly calculates an from the initial values
a1 , a2 , . . . , ak ? Sometimes that works. For example, the recursive function of first order
an = x · an−1
a1 = x
has the solution an = x n, we have already seen this in the last section. In general, how-
ever, this is not possible.
A famous recursive function that looks very simple but for which there is no explicit solu-
tion is the logistic equation. It is an = αan−1 (1 − an−1 ). This equation plays an important role
in chaos theory. Depending on the value α, the function shows chaotic behavior, it jumps
wildly back and forth. Calculate the function values for a few values α between 2 and 4 in a
small program!
However, an explicit solution can always be calculated for linear recursions with con-
stant coefficients. I would like to do this for constant recursions of first order with a con-
stant inhomogeneous term.
an = · an−1 + c, , c, a1 ∈ R,
has the solution
1−n−1
a1 · n−1 + 1−
c if � = 1
an = .
a1 + (n − 1)c if = 1
62 3 Natural Numbers, Mathematical Induction, Recursion
an = an−1 + c
= (an−2 + c) + c
= 2 an−2 + c + c (3.5)
= 2 (an−3 + c) + c + c = . . .
= n−1 a1 + (n−2 + n−3 + · · · + 1)c.
Now we have to distinguish whether = 1 or = 1. In the first case, the sum is
n−2 + n−3 + · · · + 1 = n − 1 and thus an = a1 + (n − 1)c, in the second case, accord-
ing to Theorem 3.3, n−2 + n−3 + · · · + 1 = (1 − n−1 )/(1 − ) and thus
1 − n−1
an = a1 · n−1 + c.
1−
In this calculation, “…” appears again; I said that with the help of recursion, such inaccura-
cies can be avoided. If that bothers you, then “guess” first, as in (3.5), how the sequence an
looks, and then you can use induction to prove that exactly this sequence satisfies the condi-
tions of the theorem.
Example
On a savings account you have an initial balance of Ka and save S Euros per month.
The interest rate is Z . How much is your balance after n months? Let kn be the balance
at the beginning of the n-th month. Then:
kn = (1 + Z)kn−1 + S, k1 = Ka .
So this is a recursion to which we can apply Theorem 3.13. Let q = 1 + Z . At the end
of month n, the balance is Ke = kn+1:
1 − qn
Ke = Ka qn + S.
1−q
◄
For higher-order linear recursions, a solution can always be given, but the calculation is
much more complex and shows an amazing similarity to the theory of differential equa-
tions, which we will deal with in Chap. 17. Difference equations can be seen as the dis-
crete brother of differential equations. I don’t go into that any further.
Try it out!
3.3 Recursive Functions 63
Not every time the recursive programming of algorithms leads to the goal. You have to
be careful there and estimate the computational effort and the memory requirements.
Often, recursive functions for runtime calculation can be easily derived from recur-
sive algorithms. With what we have just learned, we will calculate some such functions
explicitly.
The runtime of an algorithm is essentially dependent on the performance of the
machine and on the size n of input variables of the algorithm. The machine provides a
constant factor. The more powerful the machine is the smaller the factor is. The size n of
input values (numbers/words/length of numbers/…) makes itself felt in a function f (n)
for the runtime. This function f (n) we want to determine more closely at some examples.
First for a non-recursive algorithm:
1. Calculation of n!
runtime:
fac = 1; ←α = const
for(int i = 1; i <= n; ++i)
n·β
fac *= i; ←β = const
runtime:
sort list with n entries ←f (n)
find smallest element in list ←αn
put it on front of the list ←β
sort list with n − 1 entries ←f (n − 1)
Here you get for the runtime the recursive function specification
f (n) = f (n − 1) + αn + β
f (1) = γ
with machine-dependent constants α, β and γ .
This is a linear recursion of first order, unfortunately with a non-constant inhomoge-
neous part, so that Theorem 3.13 cannot be applied. Please calculate yourself, similar
to the proof of this theorem, that a linear homogeneous recursion of first order of the
form an = an−1 + g(n) has the solution:
n
an = a 1 + g(i).
i=2
64 3 Natural Numbers, Mathematical Induction, Recursion
This gives for g(n) = αn + β using the summation formula, which I have derived
after Theorem 3.2:
n n n
n(n + 1)
g(i) = (αi + β) = α i + β(n − 1) = α − 1 + β(n − 1),
i=2 i=2 i=2
2
The runtime increases quadratically, one also says f (n) is of order n2. The orders of
magnitude of runtimes we will look at a little more closely in Part 2 of the book, see
the Big O-notation in Sect. 13.1.
3. You are certainly familiar with the Towers of Hanoi: n wooden discs with decreasing
diameters are stacked on top of each other. You can move one disc at a time and the
task is to rebuild the pyramid in the same order at another location. For this purpose,
a third storage space is available, but a larger disc may never be placed on a smaller
one.
If you don’t know the solution, try it out (for example, with different sized books). You
will see it always works and the algorithm can be described recursively as follows:
runtime:
move the tower of height n: ←f (n)
move the tower of height n − 1 to auxiliary position ←f (n − 1)
move bottom disc to target position ←γ
move the tower of height n − 1 onto this disc ←f (n − 1)
Amazing: There is no mention here of how a tower can be moved. Nevertheless, this is a
precisely programmable prescription!
1 − 2n−1
f (n) = f (1) · 2n−1 + γ = γ · 2n−1 + (2n−1 − 1)γ = 2n γ − γ .
1−2
f (n) is of order 2n; the runtime grows exponentially. Exponential algorithms are con-
sidered unusable for computation by machines.
3.3 Recursive Functions 65
According to legend, a monastic order in Hanoi believed that the world would end if the
problem was solved for 64 discs. Do the monks have a point?
runtime:
sort a list with n entries ←f (n)
sort first half of the list with n/2 entries ←f (n/2)
sort second half of the list with n/2 entries ←f (n/2)
merge both lists ←γ n
f (2m ) = 2f (2m−1 ) + γ 2m .
Now let’s set g(m) := f (2m ), so we at least get a linear recursion of first order for g:
For intermediate values between the powers of two, this formula also applies approxi-
mately, and so we get the statement that Merge Sort is of order n log2 n.
We will deal with the logarithm in the second part of the book. Just so much: The
inverse function to the function x → ax is called the logarithm to the base a, that is
x → ax = y ⇒ x = loga y and here: 2m = n ⇒ m = log2 n. The logarithm to the base 2
plays a particularly important role in computer science.
66 3 Natural Numbers, Mathematical Induction, Recursion
Comprehension Questions
Exercises
This is due to the fact that when representing numbers in binary form, it holds:
111
. . . 000. Formulate corresponding statements for other bases.
. . 111 +1 = 1 000 .
n n
6. Show by mathematical induction: Every non-empty subset of the natural numbers
has a smallest element.
This property plays a big role in number theory. It is also called: The natural numbers are
“well-ordered”. The rational numbers or the real numbers do not have this property!
Abstract
4.1 Combinatorics
We begin this chapter with the compilation of some theorems on combinatorics. Com-
binatorics describes ways to select, combine, or permute elements from given sets. The
first result we have already formulated in Theorem 3.5 as an example for mathematical
induction. Here it is again for the sake of remembrance:
Theorem 4.1 The elements of a set with n elements can be arranged in exactly n!
different ways.
Theorem 4.2 Let M be a finite set with n > 0 elements. Then there are 2n differ-
ent mappings from M to {0, 1}.
Example
At first sight, it is not at all clear what this result should mean, but it has quite practical
applications:
Interpretation 1: Through one data element of n bit length, exactly 2n symbols can be
represented (encoded).
You can see this in the example: Each mapping determines a unique combination of
0 and 1, so each mapping can be associated with a symbol. In a byte (8 bits), you can
encode 256 symbols, in 2 bytes 65 536, in 4 bytes 232, that is about 4.3 billion. 4 bytes is
the size of an integer, for example, in Java.
is a bijective mapping, so the two sets are equipotent (compare Definition 1.20). In the
example above, ϕ1 and ∅ correspond to each other, ϕ2 and {d}, …, ϕ15 and {a, b, c}, ϕ16
and M .
It is typical for mathematicians that they formulate theorems as generally and abstractly
as possible. But by simple specializations one is often able to achieve different concrete
results.
This can be proven directly: For k = 0 and k = n we have already seen this is right, so
we can assume 0 < k < n.
Let M = {x1 , x2 , . . . , xn }. The k-element subsets N of M fall into two types:
1st type of sets: Those with xn ∈ / N . These correspond
n−1 exactly to the k-element subsets
of M \ {xn } and there are k of them.
2nd type of sets: Those with xn ∈ N . These correspond n−1exactly
to the (k − 1)-ele-
ment subsets
n−1 n−1 of M \ {xn }, so we have k−1
of them.In total, there are
k
+ k−1
subsets.
Behind this recursive rule lies the possibility of calculating the binomial coefficient with
the help of Pascal’s triangle. Each coefficient results as sum of the two coefficients diag-
onally above:
72 4 Some Number Theory
(4.2)
It takes a while to calculate 69 choose 5 in this way; fortunately, we can take our com-
puter’s help and write a small recursive program very quickly that implements the calcu-
lation rule (4.1). In the end, we finally get the number 11 238 513. So you really have to
wait a long time for the five in Powerball.
If you implement this program, you will notice that the calculation of 69 choose 5 takes
an amazingly long time. Remember the warnings about using recursive functions from
Sect. 3.3? Try to determine the runtime of the algorithm as a function of n and k. You will
find more about this in the exercises for this chapter.
Those of you who already know the binomial coefficient from before may have seen
another definition for it. Of course, this is equivalent to our Definition 4.3; we formulate
it here as a theorem:
Theorem 4.6 From a set with n different elements, k elements (order doesn’t
n
matter) can be selected in k
ways.
Theorem 4.7 From a set with n different elements, k elements (order does mat-
ter) can be selected in
n!
n(n − 1)(n − 2) · · · (n − k + 1) =
(n − k)!
ways.
Order matters means that, for example, the selection (1, 2, 3, 4, 5, 6) and the selection
(2, 1, 3, 4, 5, 6) are different.
Example
Determine the number of different passwords with 5 characters in length that can
be formed from 26 letters and 10 digits, where no character occurs more than once:
According to Theorem 4.7, this results in 36 · 35 · 34 · 33 · 32 = 45 239 040. It is cer-
tainly clear to you that no password cracker program gets sweating here. ◄
74 4 Some Number Theory
You are probably already familiar with an application of the binomial coefficient in
algebra: the binomial theorem. Here is probably where the name “binomial coefficient”
comes from; it is related to the binomial “x + y” and is not intended to remind us of the
mathematician Giuseppe Binomi:
n
n n−k k
(x + y)n = x y . (4.4)
k=0
k
1. Combining sums:
n
n
n
A(k) + B(k) = (A(k) + B(k)).
k=m k=m k=m
3. Index shift:
n
n+1
A(k) = A(k − 1).
k=m k=m+1
n
n
n−1
A(k) = A(m) + A(k) = A(k) + A(n).
k=m k=m+1 k=m
4.1 Combinatorics 75
Rule 1 contains the commutative and associative laws of addition, rule 2 the distributive
law. Rule 3 also applies analogously to other index shifts than 1, and the summation lim-
its in rule 4 can be changed by more than 1.
These rules can also often be used when programming. A sum is usually implemented by a
for loop, where the summation limits represent the initialization or the termination condi-
tion. Clever index shifting or summarizing of different loops are useful tools. For this rea-
son, you should look at the following proof; here, these tricks are used very intensively.
(x + y)n+1 = (x + y)n (x + y)
n
n
n−k k
= x y (x + y)
k=0
k
n n
n n−k+1 k n n−k k+1
= x y + x y
↑
k=0
k k=0
k
rules 1 and 2
n n+1
n n−k+1 k n
= x y + x n−k+1 yk
↑
k=0
k k=1
k − 1
index shift
second term
n n
n n+1 0 n n−k+1 k n n 0 n+1
= x y + x y + x n−k+1 yk + x y
↑ 0 k=1
k k=1
k − 1 n
change summation
limits
n
n n+1 0 n n n 0 n+1
= x y + + x n−k+1 yk + x y
↑ 0 k=1
k − 1 k n
rule 1
n
n + 1 n+1 0 n + 1 n−k+1 k n + 1 0 n+1
= x y + x y + x y
↑ 0 k=1
k n+1
recursion formula
n+1
n+1
= x n−k+1 yk .
k=0
k
If you insert 1 for x and y in the binomial theorem, you get: nk=0 nk = 2n. According
to our definition of the binomial coefficient as the number of k-element subsets of a
n-element set, this means: The number of all subsets of a n-element set is 2n. We have
already proven this in a completely different way in Theorem 4.2.
76 4 Some Number Theory
The task of division with remainder has already arisen in the representation of integers in
different numeral systems. Now we want to take a closer look at division, more precisely
at the divisibility of integers.
a) If c | b and b | a, then c | a.
(Example: 3 | 6 and 6 | 18 ⇒ 3 | 18)
b) If b1 | a1 and b2 | a2, then b1 b2 | a1 a2.
(3 | 6 and 4 | 8 ⇒ 12 | 48)
c) If b | a1 and b | a2, then for α, β ∈ Z:b | αa1 + βa2.
(3 | 6 and 3 | 9 ⇒3 | (7 · 6 + 4 · 9))
d) If b | a and a | b, then a = b or a = −b.
The rules are all easily checked by reducing them to the definition. I only want to show
the first one as an example: c | b and b | a imply b = q1 c and a = q2 b and therefore
a = q2 q1 c, thus c | a.
Theorem and Definition 4.11: The Division with remainder For two integers
a, b with b = 0 there is exactly one representation a = bq + r with q, r ∈ Z and
0 ≤ r < |b|. a is called dividend, b divisor, q quotient and r remainder of the divi-
sion of a by b. We denote q with a/b and r with a mod b (to read: a modulo b).
In the proof, we first assume a, b > 0. Now let q be the largest integer with bq ≤ a. Then
there is a r ≥ 0 with bq + r = a and it applies r < b, otherwise q would not have been
maximum. Are a or b or both negative, the considerations are quite similar.
In most programming languages, there are special operators for division and remainder. In
C++ and Java, for example, q = a/b and r = a%b.
To my regret, no mathematician was involved in the definition of the division and mod-
ulo operators in C++ and Java. For positive numbers a and b, the effect corresponds to our
definition. But we sometimes have to divide with negative numbers. Also for this there is
a unique solution according to the above definition for q and r. So the division of −7 and
2, for example, results in q = −4 and r = 1, since −7 = −4 · 2 + 1. Java, on the other hand,
calculates (−7)/2 = −3 (according to the rule: first calculate |a|/|b|, are a or b negative, the
result is multiplied by −1) and from this according to the fact r = a − bq the remainder −1.
Mathematical remainders are always greater than or equal to 0! Therefore, do not use “/”
and “%” on negative numbers. Fortunately, the (in the true sense of the word) negative cases
can be reduced to positive ones.
4.2 Divisibility and Euclidean Algorithm 77
Definition 4.12: The greatest common divisor (gcd) Are a, b, d ∈ Z and d | a and
d | b, then d is called common divisor of a and b. The largest positive common
divisor of a and b is called greatest common divisor of a and b and is denoted by
gcd(a, b).
The calculation of the greatest common divisor of two numbers is carried out using the
famous Euclidean algorithm. Quite unusually for a mathematician, Euclid was not satis-
fied with the existence of the greatest common divisor, but he gave a concrete calculation
procedure giving the greatest common divisor as a result. The concept of algorithm rep-
resents a central concept in computer science and the Euclidean algorithm is a prototype
for this: a calculation procedure with a beginning and an end, input values and results.
One could thus call Euclid one of the fathers of computer science.
In order to understand the effect of the algorithm, we put a lemma first:
For the first part: b is of course a common divisor and b cannot have a divisor with a
larger absolute value, so |b| = gcd(a, b) applies. In this case, a mod b = 0.
For the second part, it is enough to show that the sets of common divisors of a, b resp.
b, r coincide.
To do this, let d be a divisor of a and b. Then, according to Theorem 4.10c), d | a − bq
follows. So d is a common divisor of b and r = a − bq. If d is now a common divisor of
b and r, then d is also a divisor of bq and, according to 4.10c) again, r − bq = a, so d is a
common divisor of a and b.
From Lemma 4.13 it follows first of all both lines are correct representations of the
gcd(a, b). When carrying out the algorithm, we repeatedly apply the second line: a is
replaced by b and b by the remainder when dividing a by b. The gcd does not change.
If finally the remainder is 0, then according to the first line the gcd(a, b) is the last
78 4 Some Number Theory
r emainder different from 0. Does the termination condition actually occur at some point?
Yes, because in each step the remainder really gets smaller. But since it is always posi-
tive, it will eventually reach 0.
Example
gctd(−42, 133):
a = b·q + r
−42 =133·(−1)+91
133 = 91·1 +42
91 = 42·2 + 7
42 = 7·6 + 0
so gcd(−42, 133) = 7. ◄
In addition to the greatest common divisor, another result is obtained in the algorithm,
which we will need later:
We go through the Euclidean algorithm line by line and cleverly rearrange each line to
be able to represent the remainder ri as a combination ri = αi a + βi b of a and b. If rn is
the last remainder different from 0, then we finally get rn = αa + βb:
a = bq0 + r0 ⇒ r0 = 1a − q0 b = α0 a + β0 b.
b = r0 q1 + r1 ⇒ r1 = 1b − r0 q1 = 1b − (α0 a + β0 b)q1
= −α0 q1 a + (1 − β0 q1 )b
↑
sort
by a and b
= α1 a + β1 b
r0 = r1 q2 + r2 ⇒ r2 = 1r0 − r1 q2 = 1(α0 a + β0 b) − (α1 a + β1 b)q2
= α2 a + β2 b
↑
sort
by a and b
..
.
rn−2 = rn−1 qn + rn ⇒ rn = 1rn−2 − rn−1 qn = 1(αn−2 a + βn−2 b) − (αn−1 a + βn−1 b)qn
= α n a + βn b
This proof contains also a recursive calculation possibility for α and β: From the last
line you can see that αn , βn can be determined as the n-th elements of the sequences
αi = αi−2 − αi−1 qi , βi = βi−2 − βi−1 qi. The initial values are obtained from the first two
lines: α0 = 1, α1 = 1, β0 =q0, β1 = r0.
4.2 Divisibility and Euclidean Algorithm 79
Calculate gcd(168, 133) and at the same time α, β with gcd = α · 168 + β · 133:
The α, β would not have been found so easily by trial and error. The existence of these
numbers is an amazing fact. It always works, no matter how big a, b may be.
As an immediate consequence, we can calculate that every common divisor of two
numbers a and b is also a divisor of gcd(a, b):
Statements about prime numbers are closely related to the question of divisibility. In
Theorem 2.11 we have already proved that there are infinitely many prime numbers. At
this point I would like to introduce some important properties of prime numbers. Let’s
start with the precise definition:
It is a convention one does not count 1 to the prime numbers, 2 is obviously the only
even prime number. The first prime numbers are:
2, 3, 5, 7, 11, 13, 17, 19, 23, . . .
80 4 Some Number Theory
From the uniqueness of the representation it follows all factors bi must already occur on
the left side of the equation. Therefore one gets the
Corollary 4.19 All divisors of a number a are obtained by the possible products
of their prime factors.
For example, 315 has the divisors 3, 5, 7, 9, 15, 21, 35 , 45, 63, 105 , 315.
all products of 2 factors all products of 3 factors
Similarly, the following statement can be derived from the prime factorization:
Corollary 4.20 If p is a prime number and a, b are natural numbers with the prop-
erty p | ab. Then p | a or p | b.
A prime number cannot be split into two other numbers as a factor, it is a divisor of one
or the other, perhaps of both numbers, if it occurs multiple times in the prime factoriza-
tion. From 3 | 42 and 42 = 6 · 7 it follows, for example, 3 | 6 or 3 | 7.
In Sect. 1.2 after Definition 1.9 we have called equivalent those integers for which the
difference is divisible by 5. This way, the set of integers was divided into the disjoint
equivalence classes [0], [1], . . . , [4]. I will now investigate such relations in more detail
and present some applications in computer science.
4.3 Modular Arithmetic 81
Please do not confuse the statement “a ≡ b mod n” with the number “a mod n”, which
denotes the remainder!
We have carried out the proof for the case n = 5 in Sect. 1.2; in general, it goes exactly
the same way.
The elements congruent to a number a form the equivalence class of a. The equiva-
lence classes with respect to the relation “≡” are called residue classes modulo m. For
the sake of reminder: It is [a] = {z | z ≡ a mod n} and b ≡ a mod n holds if and only if
[a] = [b]. If the modulus n is not clear from the context, I write for the equivalence class
[a]n.
The numbers 0, 1, 2, . . . , n − 1 are all possible remainders modulo n. There are thus
exactly the residue classes [0], [1], . . . , [n − 1]. A residue class [r] is represented by the
remainder r. Often, we will not even distinguish between remainders and residue classes.
This is a bit unusual at first, but there is nothing mysterious about it. Perhaps it is compa-
rable to the fact that you often work with references to objects in a program, rather than
with the object itself.
82 4 Some Number Theory
The set of residues resp. residue classes with respect to the number n is denoted by Z/nZ
(read: “Z modulo nZ”) or simply by Zn. The symbol nZ should symbolize that all the
elements are equivalent that differ by a multiple of n, that is, by nz for a z ∈ Z.
If a and b are remainders modulo n, we can add or multiply them. The result is usually
not a remainder anymore, but we can take it modulo n and get a remainder again. In this
way, we can define an addition and a multiplication on the set of residues:
The addition and multiplication on Z/nZ and on Z are compatible with the modulo oper-
ation. It doesn’t matter if you take two elements modulo first and then connect them or
vice versa:
ϕ : Z → Z/nZ
a �→ a mod n
it holds that
ϕ(a + b) = ϕ(a) ⊕ ϕ(b),
ϕ(a · b) = ϕ(a) ⊗ ϕ(b).
Let’s check this for the addition: If a = nq1 + r1 and b = nq2 + r2, then
ϕ(a + b) = (nq1 + nq2 + r1 + r2 ) mod n = (r1 + r2 ) mod n
and
ϕ(a) ⊕ ϕ(b) = r1 ⊕ r2 = (r1 + r2 ) mod n.
are the same. We conclude this for multiplication in a similar way.
The set Z/nZ with these two operations plays an important role in discrete mathematics
and also in computer science. We will work with it a lot. The compatibility of the opera-
tions with the modulo operation will also meet us again and again.
In anticipation of Sect. 5.6: The mapping from Theorem 4.25 is a homomorphism between
Z and Zn.
Now we are able to set up addition and multiplication tables for the operations, the so-
called Cayley tables, after the mathematician Arthur Cayley who invented them in 1854.
Two examples of this:
4.4 Hashing 83
Examples
n = 3: n = 4:
⊕ 0 1 2 ⊗ 0 1 2 ⊕ 0 1 2 3 ⊗ 0 1 2 3
0 0 1 2 0 0 0 0 0 0 1 2 3 0 0 0 0 0
1 1 2 0 1 0 1 2 1 1 2 3 0 1 0 1 2 3
2 2 0 1 2 0 2 1 2 2 3 0 1 2 0 2 0 2
3 3 0 1 2 3 0 3 2 1
◄
An example of the application of modulo calculation:
Example
Not only for people, but also for computers, the second way of calculation is much more
suitable. If, for example, you want to build an eternal calendar and calculate somewhat
clumsy on which weekday in 4 billion years the sun will set forever, you will quickly get
an overflow of numbers. Always calculate modulo 7, then you even get by with one byte
as integer data type.
In the next section on hashing, I would like to show you an important application of
the modulo calculation in computer science. You will learn about other areas of applica-
tion in Sect. 5.3 on fields and Sect. 5.7 on cryptography.
4.4 Hashing
Hash Functions
In data processing, the problem often arises that data records, which are identified by
a key, have to be stored or found quickly. Take a company with 5000 employees. Each
employee has a 10-digit personnel number with which his data can be identified in the
computer.
84 4 Some Number Theory
You could store the data in an array of 5000 elements, sorted linearly by personnel
number. In the basics principles of computer science you learn the binary search for an
array element then requires an average of 11.3 accesses (log(n + 1) − 1).
An impossible approach is the storage in an array with 1010 elements, in which the
index is just the personnel number. The search for a data set would require exactly one
step here.
Hashing represents a compromise between these two extremes: a relatively small
array with fast access. For this purpose, the long key is mapped to a short index using a
so-called hash function.
Let K be the key space, K ⊂ Z. The keys of the data to be stored come from K . Let
H = {0, 1, . . . , n − 1} be a set of indices. These identify the storage addresses of the
records in the storage area, the hashtable. |H| is usually much smaller than |K|. A hash
function is a mapping h : K → H , k → h(k). The data element with key k is then stored
at index h(k).
A function that has proven itself suitable for many purposes is our modulo mapping:
h(k) = k mod n.
The basic problem with hashing is that such a mapping h can never be injective. There
are keys k = k ′ with h(k) = h(k ′ ). This is called a collision. The records for k and k ′
would have to be stored at the same address. In this case, special treatment is necessary,
the collision resolution.
When choosing the hash function, one must make sure that such a collision occurs
as rarely as possible. For example, if the last four digits of the personnel number from
the example represent the year of birth and the hash function h(k) = k mod 10 000 is
chosen, then h(k) is just the year of birth. Collisions occur constantly and large areas of
the available array of 10,000 elements are not addressed directly. Prime numbers p often
turn out to be a good choice as a modulus. Now we assume the modulus is a prime, we
denote it with p. This is actually used frequently. Regularities in the key space are usu-
ally destroyed in the address space.
Collision Resolution
The closer the number of positions in the hash table and the number of records are to
each other, the more likely collisions will occur. One approach to solving this is to search
for another free storage location in the table according to a reproducible rule when a col-
lision occurs. If the address h(k) is already occupied, a probing sequence (si (k))i=1,...,p−1
is formed and the addresses s1 (k), s2 (k), . . . , sp−1 (k) are successively visited until a
free address is found. The numbers s1 (k), s2 (k), . . . , sp−1 (k) have to go through all hash
addresses, with the exception of h(k) itself. Only then can it be guaranteed that an exist-
ing free space will also be found.
4.4 Hashing 85
i 1 2
i2 1 4
i mod 5
2
1 4 4 1
i 1 2 3
i2 1 4 9
i 2 mod 7 1 6 4 3 2 5
i 1 2 3 4 5
i2 1 4 9 16 25
i 2 mod 11 1 10 4 7 9 2 5 6 3 8
i 1 2 3 4 5 6
i2 1 4 9 16 25 36
i 2 mod 13 1 12 4 9 9 4 3 10 12 1 10 3
If we start with h(k) = 0 for simplicity, the last line of the table always gives the probing
sequence. With 7 and 11 it works well, we get a permutation of the possible addresses,
with 5 and 13 it does not work. What is the common property of 7 and 11? They both
leave 3 as the remainder when divided by 4. In fact, for all prime numbers p with
p ≡ 3 mod 4:
86 4 Some Number Theory
2
i mod p, i = 1, . . . , (p − 1)/2 ∪ −(i2 ) mod p, i = 1, . . . , (p − 1)/2
= {1, 2, . . . , p − 1}
There are infinitely many such prime numbers. Try, for example, 9967, 33 487, 99 991. I
have to postpone the proof of this theorem to the next chapter (Theorem 5.22), we have
to learn more about the properties of fields first.
Since we calculate modulo p, the probing sequence always reaches all addresses, even
if one does not start with h(k) = 0, but with any other value.
A third common method for resolving collisions is the so-called double hashing. In
this case, for the probing sequence in the case of a collision, a second hash function h′ (k)
is used, whereby h′ (k) � = 0 must always be. The sequence is then:
Of course, the array can only be filled up to exactly 100%. However, there are also meth-
ods of dynamic hashing in which the size of the hash table can be automatically adjusted
to the demand during runtime.
Comprehension Questions
1. If you want to calculate the binomial coefficient using Theorem 4.4 in a program,
you will quickly run into a problem. What is it? Do you have a solution for it?
2. Why does the Euclidean algorithm always lead to a result?
4.5 Comprehension Questions and Exercises 87
Exercises
12. When working with remainders, you can exchange the operations +, respectively ·
with the modulo operation. The same applies to exponentiation: To calculate
a2 mod n, it is easier to calculate [(a mod n)(a mod n)] mod n. (Why actually? Try a
few examples.) Follow a similar procedure to calculate am mod n. With this knowl-
edge you can, with the help of the algorithm given in Exercise 8 of Chap. 3 for the
calculation of x n, formulate a recursive algorithm for the calculation of am mod n.
13. Implement the Euclidean algorithm; once iteratively and once recursively.
Algebraic Structures
5
Abstract
In mathematics and computer science, we often deal with sets on which certain opera-
tions are defined. It happens again and again that such operations have similar proper-
ties on quite different sets, so one can also do similar things with them. In Chap. 2 we
have already encountered something like this: For the operations ∪, ∩ and – on sets, the
same laws of arithmetic apply as for ∨, ∧ and ¬ on propositions. Mathematical strategy
is to find and describe prototypes of such operations and properties that occur again and
again, and then to form theorems about sets with these operations. These theorems are
then valid in every concrete example of such a set (a model).
Groups
Field
In this and the next chapter we will get to know the most important such sets with
operations, we call them algebraic structures. We formulate axioms for these structures
and at the same time introduce concrete examples of the structures. The structures inves-
tigated can be classified as in the diagram in Fig. 5.1.
Maybe you know the symbols from UML (Unified Modelling Language), they fit
exactly. We read the arrow as “is a”, this is the inheritance relationship: A ring is
a group, a field is a ring. Only a few more properties are added to the properties of the
“superclass”. The arrow denotes aggregation, we can read it as “has a” or “knows a”.
A vector space is a group that additionally knows and uses a field. Just as objects are
concrete instances of a class, there are concrete realizations of these prototypes. Among
the fields we already know from the first chapter, for example, Q and R.
All structures investigated have in common that one or two operations are defined on
them:
Most of the time we denote the operations with +, ·, ⊕, ⊗ or similar symbols and then
write for v(n, m) for example n + m or n · m.
The operations ∪, ∩ on sets or ∨, ∧ on propositions are such operations, as well as
multiplication and addition on the real numbers or the addition and multiplication ⊕ and
⊗ between the elements of Z/mZ, which we explained in Definition 4.24:
⊕ : Z/nZ × Z/nZ → Z/nZ ⊗ : Z/nZ × Z/nZ → Z/nZ
(a, b) �→ a ⊕ b (a, b) � → a ⊗ b
Now we can describe the structures one after the other, we start at the top of the hierar-
chy with groups; these have relatively weak structural properties:
5.1 Groups 91
5.1 Groups
Definition 5.2: The group axioms A group (G, ∗) consists of a set G and a binary
operation ∗ on G with the following properties:
The abelian groups are named after the brilliant Norwegian mathematician Niels Hendrik
Abel, who lived from 1802 to 1829. He revolutionized essential areas of modern algebra.
Far away from the world centers of mathematics at that time in France and Germany, his
work remained largely unnoticed or simply ignored during his lifetime. The Norwegian gov-
ernment established the Abel Prize in Mathematics in his honor. This was first awarded in
2003. The prize is similarly endowed as the Nobel Prizes and it is quite justified to consider
it as a kind of “Nobel Prize for Mathematics”.
Why is there actually no Nobel Prize for Mathematics? Mathematicians tell the story that
Alfred Nobel was not well disposed towards our profession after the prominent mathematician
Mittag-Leffler had stolen his girlfriend Sonja Kowalewski, also a mathematician, from him.
Even if the historical content of this story is rather meager, I like it better than to assume Nobel
would have underestimated the importance of mathematics for the modern sciences.
From the axioms, further calculation rules for groups can be derived. I would like to
mention two simple rules here we will need more often later. The first states that certain
equations can be solved, with the second one you can calculate the inverse of a product. I
write down the proofs for this in detail here, later we will omit the brackets again due to
associativity.
Examples of groups
a, b ∈ U ⇒ a ∗ b ∈ U, a−1 ∈ U. (5.1)
Proof: The condition (5.1) states ∗ is really an operation on U and that (G2) is fulfilled.
Then it follows a ∗ a−1 = e ∈ U and thus (G1) is valid. The associativity or commuta-
tivity is automatically fulfilled in U , since the elements of U are also elements of G and
therefore behave associatively or commutatively.
Examples
5. (Z, +) is obviously a subgroup of (R, +), (N, +) on the other hand not.
6. (mZ, +) is a subgroup of (Z, +), where mZ := {mz | m ∈ Z} represents all multi-
ples of m. Because let a = mz1 and b = mz2, then a + b = m(z1 + z2 ) ∈ mZ and
−a = m(−z1 ) ∈ mZ.
7. Groups of bijective mappings: Let M be a set. The set F of bijective mappings
of M → M forms a group with the composition f ◦ g as a binary operation: It is
e = idM and f −1 the inverse map to f . The operation is associative, because for all
x ∈ M it applies:
Permutation groups
How can one characterize the group elements as simply as possible? First, we write the
elements of the set {1,2,…,n} next to each other and below them the respective images.
The group S4, for example, contains among other the elements:
12 3 4 12 3 4 12 3 4
a= , b= , c= .
34 2 1 3 1 2 4 34 12
Let’s look at these three elements and see what happens to 1 when we execute each of
these permutations several times in succession: With a, 1 is mapped to 3, then 3 to 2, 2 to
94 5 Algebraic Structures
Now you also know why one can sort any unordered data field using a series of element
permutations.
You have to pay close attention here. According to the convention when performing
mappings one after the other, we carry them out starting from the right, and it is not pos-
sible to swap the order! The Sn is the first non-commutative group we get to know. For
example, (in our first notation):
12 3 4 12 34
(3 2) = (1 3) =
1 3 2 4 32 14
and
12 3 4 1 2 34
(1 3) ◦ (3 2) = (3 2) ◦ (1 3) = .
3 1 2 4 2 3 14
5.2 Rings 95
The operations in finite groups can be written down in Cayley tables. If we denote the six
elements of the S3 with a, b, c, d, e, f , we get the table:
◦ a b c d e f
a a b c d e f
b b c a e f d
c c a b f d e (5.3)
d d f e a c b
e e d f b a c
f f e d c b a
Wait, something is missing: I’ll tell you in this table a is the identical permutation,
b = (1 2 3) and d = (2 3). From this you can deduce the other elements and check the
correctness of the table.
Definition 5.5 The finite group (G, ∗) with n elements is called cyclic, if there is
an element g ∈ G with the property
G = {g, g2 , g3 , . . . , gn }. (5.4)
g is then called generator of the group.
The powers of the element g therefore generate the whole group. Then gn = e, because e
must be a power of g, and if it were gm = e already for a m < n then gm+1 = g, and then
not all n elements of the group would be present on the right side of (5.4).
Take another look at the Cayley table (5.3): This group does not have a generator,
because b3 = a, c3 = a, d 2 = a, e2 = a, f 2 = a, where here a is the identity element.
However, the sets {b, b2 , b3 }, {c, c2 , c3 }, {d, d 2 } and so on are cyclic subgroups of the S3.
You can easily check this.
5.2 Rings
Definition 5.6: The ring axioms A ring (R, ⊕, ⊗) consists of a set R with two
binary operations ⊕, ⊗ on R, for which the following properties hold:
Soon, we will run out of symbols for our operations. For this reason, we will proceed as
you know it from object-oriented programming languages with polymorphism: We will
almost exclusively use two symbols for our operations from now on, namely “+” and “·”.
These are polymorphic, which means depending on which objects are standing to the left
and right of them, they have a different meaning. We will get used to this very quickly.
Groups whose operation is denoted by “+” we call additive groups, groups with the oper-
ation “·” are called multiplicative groups. To complete the analogy to the operations of
numbers we are familiar with, we will from now on call the identity element of an additive
group “0” (the 0-element), the identity element of a multiplicative group “1” (the 1-ele-
ment). The inverse to a in an additive group we will denote by −a instead of a−1, instead
of a + (−b) we will write a − b. For a · b we will write ab, and for a · b−1 in commutative
groups we will often write a/b. Just as we introduce the convention “point before line” to
save us many brackets.
The operations in a ring we will denote by + and by ·. So a ring is a commutative
additive group with an additional associative and distributive multiplication. A ring
always has a 0, but not always a 1. All following examples are commutative rings. Just as
with groups, there is a simple way to test if a subset of a ring is a ring itself:
Theorem 5.7 If S is a subset of the ring (R, +, · ) then (S, +, · ) is a ring (a subring
of R) if and only if the following conditions are satisfied:
Examples of rings
1. (Z, +, · ) is a ring.
2. (mZ, +, · ) is a subring of Z, because (mZ, +) is a subgroup of Z, and if a = mz1,
b = mz2 are elements of mZ, then ab = m(mz1 z2 ) = mz3 ∈ mZ.
3. (Q, +, · ), (R, +, · ) are rings.
4. Let R be any ring and (F , ⊕, ⊗) the set of all mappings from R to R with the opera-
tions f ⊕ g and f ⊗ g, which are defined for all x ∈ R by:
(f ⊕ g)(x) := f (x) + g(x)
(f ⊗ g)(x) := f (x) · g(x)
(F , ⊕, ⊗) is a ring. ⊕ and ⊗ are called pointwise addition and pointwise multipli-
cation of mappings, respectively. Here I use the symbols ⊕, ⊗ again for a moment
to avoid confusion: Two maps are added (new addition) by adding the function val-
ues at each point (old addition). The 0-element is the 0-map, that is, the map that
5.2 Rings 97
maps each x ∈ R to 0. The axioms can be checked by reducing them to the rules of
the underlying ring. The calculation for the distributive law, for example, looks like
this:
Polynomial Rings
A particularly important type of rings are the polynomial rings, the set of all polynomi-
als with coefficients from a ring R. I formulate the following definition somewhat more
general. If it still causes you problems in this form, think of R as the real numbers R at
first, and you have the polynomials in front of you, which you have known from school
for a long time.
Definition 5.9 Let R be a commutative ring. The set of all polynomials with
coefficients from R is denoted by R[x]. With the operations p + q and p · q,
which are defined for t ∈ R by
(p + q)(t) := p(t) + q(t)
(5.5)
(p · q)(t) := p(t) · q(t),
This ring is a subring of the ring of all mappings from R to R (Example 4 from before).
The operations are defined pointwise, that is, for each element t ∈ R individually. Now
I have again denoted the polynomial operations with +, · instead of with ⊕, ⊗. In (5.5)
the “+” and the “·” on the left and right have different meanings! The zero polynomial of
R[x] is the zero map, that is, the polynomial for which all coefficients are equal to 0.
In order to see R[x] is really a ring, and to make the definition meaningful at all,
we have to apply the subring criterion from Theorem 5.7. Are the sum and prod-
uct of two polynomial functions really again a polynomial function? We will look
at this with an example. Let p, q ∈ R[x], p = x 3 + 3x 2 + 1, q = x 2 + 2. Then for
all t ∈ R: (p + q)(t) = (t 3 + 3t 2 + 1) + (t 2 + 2) = t 3 + 4t 2 + 3, and so we obtain
as sum p + q = x 3 + 4x 2 + 3. Without going into detail (there is nothing behind it
but paperwork), we have for the two polynomials p(x) = an x n + · · · + a1 x + a0 and
q(x) = bn x n + · · · + b1 x + b0:
I have written both polynomials up to the coefficient n here. If p and q have different
degrees, one can simply add a few zeros for the polynomial of smaller degree and then
have them in this form. The formulas look complicated, but they mean nothing more
than that you can add and multiply two polynomials as if x were some number (or a ring
element).
If you ever want to implement a polynomial class, you have to use these definitions. The
statement “multiply as if x were a number” will not be of any use to you then.
So now we know the sum and product of polynomial functions are again polynomial
functions. To finish the subring test, it still has to be shown that the polynomials form a
5.3 Fields 99
subgroup with addition. I leave that to you. Apply Theorem 5.4 and remember that the
element a−1 mentioned there is now called -a.
5.3 Fields
Definition 5.10: The field axioms A field (K, +, · ) consists of a set K and two
binary operations +, · on K with the following properties:
a) a · 0 = 0 for all a ∈ K .
b) Are a, b ∈ K and a, b = 0, so it also holds a · b � = 0.
From a) it follows that one may not divide by 0 in fields: a/0 is defined as a · 0−1, there
would have to be an inverse 0−1 to 0. Then 0−1 · 0 = 1, in contradiction to a). And b)
states that multiplication is an operation on K \ {0}. With the axioms it follows that
(K \ {0}, · ) is a group.
Examples of fields that you know are Q and R. The integers Z do not form a field,
because there are no multiplicative inverses in Z. We will now get to know more fields.
The integers Z arose from N by adding the negative numbers, because one wanted to
carry out calculations like 5–7. In Z the problem 3/4 is not solvable. If one adds
√ to Z all
fractions, one obtains the rational numbers Q. We have seen that, for example, 2 is not
an element of Q. The real numbers arise from Q by filling in the last gaps on the number
line. In Chap. 12 we will deal with the real numbers in more detail. Unfortunately, there
are still problems in R: For example, there is no real number r mit r 2 = −1, because
every square of a real number is positive.
√
Now one can extend R again to a field in which −1 exists. Since there is no
more space for this on the number line, one has to go into the second dimension:
C := R2 = {(a, b) | a, b ∈ R}. The real numbers should be the x-axis in it, that is
100 5 Algebraic Structures
R = {(a, 0) | a ∈ R}. Of course, this is not a real equality sign, we identify a with
(a, 0), we pretend it is the same. This also makes it clear how the operations in the set
{(a, 0) | a ∈ R} look: It is (a, 0) + (b, 0) = (a + b, 0) and (a, 0) · (b, 0) = (a · b, 0).
In this set C we now need an addition and multiplication so that the set becomes a
field. Restricted to R (more precisely: restricted to {(a, 0) | a ∈ R}), they should of course
√
give the operations already present there. And of course −1 should be there, that is,
there must be (a, b) ∈ C with (a, b) · (a, b) = (−1, 0).
I write down a definition for these operations:
(a, 0) + (c, 0) = (a + c, 0)
(a, 0) · (c, 0) = (a · c, 0)
(1, 0) · (c, d) = (c, d)
(0, 1) · (0, 1) = (−1, 0).
The first two lines show that on R really nothing new happens through these operations.
In the third line we see that (1, 0) (the 1 from R) also represents an identity element for
complex numbers, and from the last line we actually get our root of −1: it is (0, 1).
We will not calculate the properties in detail; everything is elementary to carry out. The
only exciting property is the existence of the multiplicative inverse. I just give the inverse
and show it is correct. So let’s say (a, b) = (0, 0), then
a −b a2 b(−b) ba a(−b)
(a, b) · , = − 2 , + 2 = (1, 0)
a 2 + b2 a 2 + b2 a 2 + b2 a + b2 a 2 + b2 a + b2
and thus ( a2 +b
a
2 , a2 +b2 ) is the multiplicative inverse of (a, b). Note that a + b is always
−b 2 2
different from 0!
5.3 Fields 101
Now we introduce an abbreviated notation for the elements in C: For (a, 0) we simply
write a again, for (0, 1) we write i . This i is a new symbol, nothing more than an abbre-
viation. Because of (0, b) = (b, 0)(0, 1) = b · i and because of (a, b) = (a, 0) + (0, b) we
can now also write a + bi for (a, b); so any complex number (a, b) can be represented in
the form a + bi.
Let’s form the product of the two numbers a + bi and c + id, without using the Defi-
nition 5.12, but only taking into account the laws of arithmetic that must apply in a field,
then we get:
(a + bi)(c + id) = (ac + i(bc + ad) + i2 bd) = (ac − bd) + (bc + ad)i.
Compare this with (5.7); see the match? In a field in which there is a root of −1 (which
we denote by i ), multiplication can not look any different than written in (5.7) !
For many, complex numbers have something mysterious about them; this may be because i
stands for the “imaginary unit”, for something that is not real, in contrast to the real num-
bers, which somehow seem “tangible” to us (and “complex” also sounds so difficult!). How-
ever, complex numbers are just as real or unreal as real numbers, we are just more used to R.
Isn’t it actually very suspicious that real numbers can never be represented in a calculator?
We can only ever work with finite decimal numbers, but real numbers are almost all infi-
nitely long.
It has been shown in mathematics and in the applications of mathematics that it is
incredibly practical to work with R and the rules that apply to it. Similarly, it has been
shown that many very real problems can only be reasonably solved using complex num-
bers. An electrical engineer would despair today without complex numbers! You will get to
know some of the useful and beautiful properties and applications of complex numbers in
this book.
Although this theorem is called the “fundamental theorem of algebra”, it is actually a theo-
rem which is proved using methods of complex analysis. As simple as the theorem is to
formulate, the proof is just as difficult. It takes a few semesters of mathematics studies to be
able to follow it, and we just have to accept it as true here.
To conclude this introduction to complex numbers, I would like to show you two impor-
tant mappings of complex numbers, which we will use again and again later.
102 5 Algebraic Structures
Both mappings have an intuitive meaning: |z| is, according to the Pythagorean theorem,
just the length of the line segment from the point (0, 0) to the point (x, y), z is obtained by
reflecting z at the x-axis.
z1 + z2 = z1 + z2 ,
z1 · z2 = z1 · z2 ,
z1 · z1 = |z1 |2 .
Only the first part of (5.8) is a little tricky, the other properties are easy to calculate.
Equation (5.8) is, by the way, also valid for the absolute value in the real numbers, so we
don’t have to learn anything new.
Now let’s look at the rings Z/nZ. The field axioms (K1) and (K2) are fulfilled. The ques-
tion is whether each element which is different from 0 has a multiplicative inverse. For
Z/4Z and Z/5Z we write down the multiplication tables:
5.3 Fields 103
n=4 0 1 2 3 n=5 0 1 2 3 4
0 0 0 0 0 0 0 0 0 0 0
1 0 1 2 3 1 0 1 2 3 4
2 0 2 0 2 2 0 2 4 1 3
3 0 3 2 1 3 0 3 1 4 2
4 0 4 3 2 1
In the case of n = 4, the element 2 has no inverse, Z/4Z can therefore not be a field. In
the case of n = 5, we see that in each row (except the first) the 1 occurs, and that means
nothing else than that each number has an inverse: 1−1 = 1, 2−1 = 3, 3−1 = 2, 4−1 = 4.
Z/5Z is therefore a field. If you look closely at the right table, you will notice: in each
row and in each column each element occurs exactly once. This is not a coincidence, it
has to be like this, because according to Theorem 5.3 the equations ax = b for a, b = 0
are uniquely solvable.
5 is a prime number and actually Z/nZ is a field if and only if n is a prime number.
We can prove this surprising statement with our elementary number theoretical knowl-
edge. The following theorem serves as a preparation:
Let gcd(a, n) = 1. By the extended Euclidean algorithm (Theorem 4.15) there are
α, β ∈ Z with αa + βn = 1. Then b := α mod n is the inverse of a, because according
to Theorem 4.25 we have a mod n = (a · α) mod n = (1 − βn) mod n, thus
n ⊗ α mod
ab = 1 in Z/nZ. a b
Conversely, if ab = 1 is in Z/nZ, then 1 = ab + γ n for an integer γ . A common divi-
sor of a and n would then also have to be a divisor of 1, thus gcd(a, n) = 1.
The Euclidean algorithm therefore also provides a method for determining the inverse
elements, so to speak a division algorithm for the invertible elements of Z/n Z. We can
derive another important property of Z/nZ from the theorem just proved:
Theorem and Definition 5.19 Z/pZ is a field if and only if p is a prime number.
Z/pZ is denoted by GF(p) .
The finite fields GF(p) are called Galois fields. The name was chosen in honor of the French
mathematician Évariste Galois (1811-1832), who made groundbreaking results in the study
of these fields. He was a mathematical genius and a political hothead who, because of his
youth and his ideas, which were far ahead of his time, was not recognized by the math-
ematical establishment. In the end, he died, just 20 years old, in a duel. His written legacy,
which he only sketched in the night before his death, still consumes mathematicians atten-
tion today.
You can now also set up a multiplication table for p = 9967 (and for any other prime
number) and you will notice that in each row and column (except the first) each remain-
der occurs exactly once. Here is a first application example:
Example
Books are identified by the ISBN, the International Standard Book Number. The
ISBN10 consists of a code of 10 digits, with the last digit being a check digit. We
ignore the hyphens between the individual digit blocks. The check digit is calculated
as follows: If ai is the i th digit of the ISBN, i = 1, . . . , 9, then the check digit p is
equal to
p = 1a1 + 2a2 + 3a3 + 4a4 + 5a5 + 6a6 + 7a7 + 8a8 + 9a9 mod 11.
The first edition of this book had the ISBN 3-528-03181-6, it is
p = 1 · 3 + 2 · 5 + 3 · 2 + 4 · 8 + 5 · 0 + 6 · 3 + 7 · 1 + 8 · 8 + 9 · 1 mod 11 = 6.
If the remainder modulo 11 is 10, the letter X is used as the check digit. Wouldn’t
it be easier to calculate modulo 10? The use of the prime number 11 as a modulus
has the consequence that in the ISBN every digit permutation and every single wrong
digit can be detected. Let us consider the permutation of the different digits ai and aj,
i = j. We calculate this in the field GF(11).
If we received the same check digit after this permutation, the following rule
would have to apply in a field:
(· · · + iai + · · · + jaj + · · · ) − (· · · + iaj + · · · + jai + · · · )
= (iai + jaj ) − (iaj + jai ) = 0,
5.3 Fields 105
and thus
0 = iai + jaj − iaj − jai = (i − j)(ai − aj ).
Here, i − j and ai − aj in the field GF(11) are not equal to 0. But this product cannot
be 0 according to Theorem 5.11b)!
You can conclude in a similar way that every single digit error can be detected.
With modulo 10 not all errors and permutations could be detected. Even a smaller
prime number is not enough. Look for number examples! ◄
Since 2006, the 13-digit ISBN13 is used more and more. This corresponds in structure to
the global trade item number, which is printed as a barcode on many commercial prod-
ucts. The check digit a13 of this article number is calculated such that
(a1 + a3 + . . . + a13 ) + 3(a2 + a4 + . . . + a12 ) mod 10 = 0
is. There is no finite field involved here anymore. But the error detection is not as good
as with ISBN10: Not all permutations can be detected.
The IBAN, the International Bank Account Number, has a two-digit check number
modulo 97. The number 97 is the largest prime less than 100 and therefore particularly
suitable for such a check number.
Now an important theorem of number theory, which is due to the famous mathemati-
cian Fermat. It can be easily proved using the field properties of GF(p). We use it imme-
diately to justify the quadratic probing in hashing (see Sect. 4.4). Later we will need it in
the derivation of the RSA encryption algorithm, which we examine in Sect. 5.6.
Theorem 5.20: Fermat’s little theorem If p is a prime number, then in GF(p) for
each a = 0:
ap−1 = 1.
Most of the time, the theorem is formulated without the explicit use of the structure
GF(p) as follows:
Theorem 5.20a: Fermat’s little theorem If p is a prime number, then for each
a ∈ Z, which is not a multiple of p:
ap−1 ≡ 1 mod p.
Numerical examples
p = 5 a = 2 : 24 = 16 = 3 · 5 + 1
a = 7: 74 = 2401 = 480 · 5 + 1.
p = 3 a = 8: 82 = 64 = 21 · 3 + 1
a = 9: 92 = 81 = 27 · 3 + 0 (a is a multiple of p!). ◄
To prove Fermat’s little theorem, we now calculate in GF(p) and show that for a = 0
always ap−1 = 1 holds: First, we note that the numbers 1a, 2a, 3a, . . . , (p − 1)a
in GF(p) are all different. If, for example, ma = na, then (m − n)a = 0 would
have to be, but according to Theorem 5.11b) this is only possible if m = n. So
the sets {1a, 2a, 3a, . . . , (p − 1)a} and {1, 2, 3, . . . , (p − 1)} contain the same ele-
ments, and therefore the products of all elements of these two sets are also equal. So
1a · 2a · 3a · . . . · (p − 1)a = 1 · 2 · 3 · . . . · (p − 1). In other words,
Unfortunately, I no longer remember where I saw this beautiful proof. I think it also
deserved an entry in the “Proofs from the BOOK” that I mentioned after Theorem 2.11.
Fermat, who knew nothing about Galois fields, had a lot more work to do with it. Fermat
(1607–1665) was one of the first great number theorists, he left many important theorems.
With one of his problems, he came back into the spotlight at the end of the 20th century:
Every construction worker knows that the equation x 2 + y2 = z2 has integer solutions (for
example, 32 + 42 = 52). With a string that is marked in the ratio 3 : 4 : 5, you can precisely
determine a right angle. Fermat examined the question of whether the equations x n + yn = zn
also have integer solutions for n > 2. Despite many attempts, it had not been possible to find
such solutions until then. Around 1637, Fermat wrote in the margin of a book something
like: “I possess a truly wonderful proof that x n + yn = zn for n > 2 is not solvable in Z. The
margin is just too small to write it down.”
Fermat had the habit of never writing down proofs, but all his theorems turned out to be
correct. However, during 350 years, the world’s smartest mathematicians efforts were fruit-
less. They were unable to proof this claim as it was a very difficult nut to crack. Only in
1993 was “Fermat’s last theorem” proven by Andrew Wiles. As simple as the problem can
be formulated, so difficult is the mathematics that is contained in the proof. Whether Fermat
really possessed a wonderful proof?
If you want to learn more about the nature of mathematics, I can recommend the book
“Fermat’s last theorem” by Simon Singh. It is also very exciting to read for non-mathemati-
cians.
As the first application of the theorem, we can now prove the assertion that, with quad-
ratic probing, every address is reached if for the prime number holds p ≡ 3 mod 4. See
Sect. 4.4:
5.4 Polynomial Division 107
If a = 0 is a square, that is a = i2, then ±i are the only roots of a: Assume i2 = j2. Then
0 = i2 − j2 = (i + j)(i − j), that is i = j or i = −j.
Since p is a prime number, (p − 1)/2 is a natural number. Further, according to Fer-
mat’s little theorem, for all a = 0:
p−1
ap−1 = (a 2 )2 = 1,
p−1 p+1 p+1
and therefore a 2 = ±1. Since p ≡ 3 mod 4, it follows that 4
∈ N and for i = a 4 ,
p+1 p+1 p−1 p−1
+1
i2 = (a 4 )2 = a 2 =a 2 =a 2 · a = ±a.
Therefore, a = i or a = −(i ). If i is greater than (p − 1)/2, we replace i with
2 2
−i = p − i and the statement remains true. So every element different from 0 is a square
or the negative of a square of a number between 1 and (p − 1)/2.
In the course of the proof, we also found a simple algorithm for extracting roots in the
fields GF(p) with p ≡ 3 mod 4. For all squares a,
√ p+1
a = ±a 4 .
In the ring of integers, we were able to carry out division with remainder, which we used
to determine the greatest common divisor of two numbers using the Euclidean algorithm.
In polynomial rings over a field K , there is also such a division algorithm with important
applications:
Theorem and Definition 5.22: Polynomial division Let K be a field, K[X] the
polynomial ring over K . Then division with remainder can be carried out in K[X],
that is:
For f , g ∈ K[X] with g = 0 there are q, r ∈ K[X] with f = g · q + r and
deg r < deg g.
The remainder r(x) is then denoted by f (x) mod g(x): r(x) = f (x) mod g(x). If
r(x) = 0, we say g(x) is a divisor of f (x) or f (x) is divisible by g(x).
Theorem 5.23 If f ∈ K[X] has the root x0, (that is, f (x0 ) = 0), then f is divisible
by (x − x0 ) without remainder. “The root can be split out.”
Proof: For g = x − x0, there is according to Theorem 5.23 q(x), r(x) with
f (x) = (x − x0 )q(x) + r(x) and deg r < lt; 1 = deg g. So r(x) = a0 x 0 = a0 is constant. If
one puts in x0, then 0 = f (x0 ) = 0 · q(x0 ) + a0 and thus a0 = 0.
Proof: Assume f has more than n roots. By splitting out the first n roots, one obtains
f (x) = (x − x1 )(x − x2 ) · · · (x − xn )q(x) and q(x) must be constant, otherwise it would
be deg f > n. If now f (xn+1 ) = 0, then one of the factors on the right must be 0 (Theo-
rem 5.11, we are in a field!), so xn+1 already occurs among the x1 , . . . , xn.
A first important consequence of Theorem 5.14, the fundamental theorem of algebra, is:
f (x) = an (x − x0 )(x − x1 ) · · · (x − xn ).
Because the roots can be successively split out. Each quotient of degree greater than 0
has another root according to the fundamental theorem.
Now to execute the polynomial division, we work out the algorithm. First, we remember
how we learned to divide natural numbers in school:
remainder:
This means:
(2x 5 + 5x 3 + x 2 + 3x + 1) = (2x 2 + 1)(x 3 + 2x + 21 ) + (x + 21 ).
With this example, you can also see that K really has to be a field; in Z[X] this polyno-
mial division does not work, because 21 ∈
/ Z!
Why does this work? The calculation shows that
+ 0 1 2 · 0 1 2
0 0 1 2 0 0 0 0
1 1 2 0 1 0 1 2
2 2 0 1 2 0 2 1
(5.9)
③
0 · x 3 − 1 · x 3 = 0 · x 3 + 2x 3 = 2x 3 (because −1 = 2!).
x − 2x = x + x = 2x.
110 5 Algebraic Structures
Horner’s Method
f (x) = 8x 7 + 3x 6 + 2x 5 − 5x 4 + 4x 3 − 3x 2 + 2x − 7.
If we want to evaluate this polynomial at a point x = b and just start calculating, we need
7 + 6 + 5 + 4 + 3 + 2 + 1 = 28 multiplications and 7 additions.
Horner’s method reduces this effort considerably. We calculate one after the other:
c0 := 8
c1 := c0 · b + 3 =8·b+3
c2 := c1 · b + 2 = 8 · b2 + 3b + 2
c3 := c2 · b − 5 = 8 · b3 + 3b2 + 2b − 5
..
.
c7 := c6 · b − 7 = 8 · b7 + 3b6 + 2b5 − 5b4 + 4b3 − 3b2 + 2b − 7 = f (b)
(5.10)
c7 is therefore the desired function value. For its calculation we need 7 multiplications
and 7 additions. This method can be used not only for the calculation of function values,
but also for factoring:
The example (5.10) shows how the statement f (b) = cn comes about. Of course, one
proves this in the general case by mathematical induction. To see the correctness of the
second statement, we only have to test it. For this we use that f (b) = cn = 0:
5.4 Polynomial Division 111
= f (x)
The following scheme can be used to calculate the coefficients ci simply: First, we write
the coefficients of the polynomial in the first line of a three-line table. Starting from the
left, we fill in the other two lines column by column: the second line contains 0 in the
first column, in the column i for i > 1 the intermediate result ci−1 · b, so that in the third
line the sum of the first two lines is the element ci:
Let’s try it with the example (5.9): The polynomial 2x 4 + 2x 2 + x + 1 ∈ GF(3)[X] has
root 1, it is therefore divisible by (x + (−1)) = x + 2. From Horner’s method we get the
coefficients ci:
The fact that division with remainder is possible in the polynomial ring K[X] allows us
to carry out similar calculations in K[X] as we did in Chap. 4 with integers. So you can
form residue classes modulo a polynomial and carry out the Euclidean algorithm. Cal-
culating with remainders is analogous to Z and so we will discover new interesting rings
and fields.
First, I would like to list a few definitions and theorems that are almost literal transla-
tions of our results with the integers:
If f (x) is divisible by g(x) and k is a field element, then f (x) is also divisible by k · g(x).
But except for this multiplicative factor, an irreducible polynomial f (x) only has itself
112 5 Algebraic Structures
and the polynomial 1 · x 0 = 1 as divisors. Do you see the analogy to the prime numbers
in Z?
Definition 5.28 Let f (x), g(x) ∈ K[X]. The polynomial d(x) ∈ K[X] is called
greatest common divisor of f and g, if d is a common divisor of f and g and if d
has the maximum degree with this property. We write d(x) = gcd(f (x), g(x)).
Compare this to the Definition 4.12 in Sect. 4.2. Please note: In contrast to the greatest
common divisor in Z, the greatest common divisor of two polynomials is not uniquely
determined, but only up to a multiplicative factor from K : If d(x) is a greatest common
divisor and k ∈ K , then k · d(x) is also a greatest common divisor. Often, in the polyno-
mial, the highest coefficient is normalized to 1 and then called the greatest common divi-
sor.
Theorem 5.29: The Euclidean algorithm for K[X] Let f (x), g(x) ∈ K[X],
f (x), g(x) = 0. Then a gcd(f (x), g(x)) can be determined recursively by
continued division with remainder according to the following rule:
g(x) if f (x) mod g(x) = 0
gcd(f (x), g(x)) =
gcd(g(x), f (x) mod g(x)) else
This is a literal translation of Theorem 4.14, and the proof proceeds analogously: at each
step, the degree of the remainder becomes at least 1 smaller, so at some point the remain-
der must be 0. The last non-zero remainder is a greatest common divisor.
For the following calculations with residue classes, the extended Euclidean algo-
rithm is of particular importance, It can also be transferred together with its proof from
Theorem 4.15:
In the polynomial ring K[X] we will now form residue classes modulo a polynomial and
calculate with them. We proceed in the same way as in Sec. 4.3:
Definition 5.31 Let f (x), g(x), p(x) ∈ K[X]. The polynomials f (x) and
g(x) are called congruent modulo p if f − g is divisible by p. In symbols:
f (x) ≡ g(x) mod p(x).
As in the integers, one finds that two polynomials are equivalent if they leave the same
remainder modulo p(x). The residue class of a polynomial can be represented by the cor-
responding residue. Which residues are possible when dividing by the polynomial p(x)
5.4 Polynomial Division 113
with degree n? These are exactly all polynomials with degree less than n. All these poly-
nomials with degree less than n therefore form a natural system of representatives of the
residue classes, just as the numbers {0, 1, . . . , n − 1} represent the residue classes modulo
n in Z.
We denote the set of residue classes or representatives by K[X]/p(x) (read: K[X] mod-
ulo p(x)). Similar to Definition 4.24 we can now define addition and multiplication on
the set of residues modulo p(x):
Definition 5.33 Let f (x), g(x) ∈ K[X] be residues modulo p(x). Then
Examples
1. In Q[X] the polynomial p(x) = x 3 + 1 is given. The set of residues consists of the
polynomials with degree less than 3: Q[X]/p(x) = {ax 2 + bx + c | a, b, c ∈ Q}.
The addition in Q[X]/p(x) is simple: The sum of two polynomials with degree less
than 3 again has a degree less than 3, so it remains in the set. When multiplying,
the degree can become greater than 2, then we have to take the remainder. So for
example (2x 2 + 1) ⊗ (x + 2) = 2x 3 + 4x 2 + x + 2 mod (x 3 + 1). Division with
remainder gives 2x 3 + 4x 2 + x + 2 = 2(x 3 + 1) + (4x 2 + x), so the remainder is
4x 2 + x:
(2x 2 + 1) ⊗ (x + 2) = 4x 2 + x.
2. In GF(2) the polynomial x 2 + 1 is given. The possible residues are the polynomials
0, 1, x, x + 1. For example:
(x + 1)x = x 2 + x = (x 2 + 1) + x + 1 ≡ (x + 1) mod (x 2 + 1)
(5.11)
(x + 1)(x + 1) = x 2 + 1 ≡ 0 mod (x 2 + 1).
Since we only have four residues, we can give a complete multiplication table:
⊗ 0 1 x x+1
0 0 0 0 0
1 0 1 x x+1
x 0 x 1 x+1
x+1 0 x+1 x+1 0
The polynomials of degree less than n in GF(p)[X] are often abbreviated simply as
the n-tuple of their coefficients from GF(p), here 0 = (0, 0), 1 = (0, 1), x = (1, 0)
114 5 Algebraic Structures
and x + 1 = (1, 1). In this notation, the Cayley tables in GF(2)/(x 2 + 1) look as
follows:
It is not difficult to check that the structure (K[X]/p(x), ⊕ , ⊗ ) forms a ring, just
like (Z/nZ, ⊕ , ⊗ ). The 0-element is the zero polynomial and there is also a 1-ele-
ment, which is the polynomial 1 = 1 · x 0. Can a field sometimes arise from this
construction? At least in the second example from above this is not the case. You
can see, for example, the product of two elements different from 0 is again 0.
Z/nZ is a field if n is a prime number. Let’s try it here with an irreducible polyno-
mial p(x)! Another example:
3. In GF(2)[X] the polynomial x 2 + x + 1 is irreducible, because every product of
two polynomials of degree 1 in GF(2)[X] is different from x 2 + x + 1. As sets,
GF(2)[X]/(x 2 + 1) and GF(2)[X]/(x 2 + x + 1) are equal. But how do the Cayley
tables look now? Nothing changes with the addition, but the multiplication gives
different results. So now, for example (compare with (5.11))
(x + 1)x = x 2 + x = (x 2 + x + 1) + 1 ≡ 1 mod (x 2 + x + 1)
(x + 1)(x + 1) = x 2 + 1 = (x 2 + x + 1) + x ≡ x mod (x 2 + x + 1)
The 1-element is (0, 1). See that here again every element different from 0 has a
multiplicative inverse? This is the property that was still missing for the field. ◄
Theorem 5.34 Let K be a field and p(x) ∈ K[X] a polynomial. Then K[X]/p(x) is
a field if and only if p(x) is irreducible.
The proof is exactly the same as in the case of integers with the help of the extended
Euclidean algorithm: Let p(x) be irreducible and f (x) a remainder modulo p(x). Then
5.4 Polynomial Division 115
p(x) and f (x) have no common divisor with a degree greater than 0. This would have
to be a divisor of p(x) with a degree less than n. This is not possible because of the irre-
ducibility of p(x). In particular, 1 = gcd(p, f ), and according to Theorem 5.31 there are
polynomials α(x), β(x) ∈ K[X] with 1 = α · f + β · p, that is α · f = 1 − β · p(x). So we
have: α · f ≡ 1 mod p(x) and α mod p(x), the representative of the residue class α, is the
multiplicative inverse of f (x).
The field K is actually contained in the new field K[X]/p(x) as a subfield: The elements
k ∈ K correspond exactly to the residues k · x 0, the constant polynomials.
Example
Please compare this with the addition and multiplication rule of the field C from Defi-
nition 5.12. It is the same rule.
In fact, R[X]/(x 2 + 1) = C! The polynomial x has taken on the role of i and really
is x 2 mod (x 2 + 1) = (x 2 + 1) − 1 ≡ −1 mod (x 2 + 1), so x 2 = −1. ◄
I think, when you see it for the first time, your head has to spin. Mathematicians always like
to look for roots of polynomials. So they have made from Q the field R, to find a root of
x 2 − 2, and from R the field C, to be able to solve x 2 + 1 = 0. And now we have found a way
to give every irreducible polynomial a root: We form the associated polynomial ring modulo
this irreducible polynomial, the result is again a field, and in that just this irreducible poly-
nomial has a root! Surprisingly, this not only works in the just seen example, but always: Is
f in K[X] irreducible, then the remainder x is root of the polynomial f in K[X]/f . Every field
can be extended so that an irreducible polynomial has roots in the field extension. A great
piece from the mathematical magic box. A large mathematical theory is based on this, the
theory of fields.
In computer science, the finite fields are particularly interesting: If one starts from
a field GF(p) and finds an irreducible polynomial f (x) of degree n in this field,
then GF(p)/f (x) is again a field. The elements are just all polynomials of the form
an−1 x n−1 + an−2 x n−2 + . . . + a1 x + a0 with ai ∈ GF(p). If you write the polynomials as
a sequence of coefficients again, then GF(p)/f (x) = {(an−1 , . . . , a1 , a0 ) | ai ∈ GF(p)}. So
this is a field with pn elements. We call this GF(pn ):
116 5 Algebraic Structures
One would assume that for different irreducible polynomials of degree n, the correspond-
ing residue class fields are different, just as different structures arose in the previous exam-
ples 2 and 3. To precisely identify the field GF(pn ), one would therefore have to specify
the irreducible polynomial f . If one wants to perform concrete operations on the field ele-
ments, that is correct. But surprisingly, for different irreducible polynomials f and g, the
corresponding fields are practically the same. In anticipation of Sect. 5.6: There is an iso-
morphism between the fields. In this respect, the designation GF(pn ) for the field with pn
elements is justified.
By the way, one can prove that there are no more finite fields: Every finite field is of the
form K = GF(pn ) for a prime number p and a natural number n.
When data is transported over a line, physical influences can always lead to errors, for
example due to “thermal noise”, “crosstalk” between different lines and many other
things. Therefore, it is necessary to check the data for correctness at the receiver. The
simplest form of such an error detection is the attachment of a so-called “check bit ”
to a transmitted data block. For example, you can append the sum of the bits modulo 2
(more precisely, we can say: the sum in Z/2Z) to each byte. At the receiver, the check bit
is recalculated, in the case of a transmission error the check bit does not match anymore
and the data has to be requested again:
10011011|1 −→ 11011011|1
↑ ↑ ↑
checkbit error checkbit no
longer correct
This error detection is not very effective: You have to transmit 1/8, i.e. 12.5%, more data,
which reduces the effective transmission rate and, if 2 bits are flipped in the byte, the
error is already overlooked.
5.4 Polynomial Division 117
The polynomial division in GF(2)[X] we have just learned, provides a much more
efficient method. We proceed as follows:
The code word to be transmitted is the polynomial f (x). Usually, much longer data
words are transmitted.
2. A fixed polynomial g(x) ∈ GF(2)[X] is used as divisor. g(x) is called the generator
polynomial. It is deg g = n.
3. Determine the remainder when dividing f (x) by g(x): r(x) = f (x) mod g(x). It is
deg r(x) < n.
4. Append the remainder r(x) to the polynomial f (x) and transmit f (x), r(x). The
remainder r(x) is used for error detection.
5. The receiver divides the received polynomial again by g(x) and receives a remain-
der r ′ (x). If an error has occurred during the transmission, usually the difference
r(x) − r ′ (x) �= 0.
An error is only not recognized if the erroneous polynomial when divided by g(x) gives
the same remainder as f (x). From the type of the difference of the remainders (how
“large” or “small” they are), it is often possible to infer the error and correct it. The ques-
tion of what the words “large”, “small” mean in this context is a topic of coding theory.
This type of error detection is called “Cyclic Redundancy Check” (CRC), it is used
in many data transmission protocols. For example, in the Ethernet protocol, frames up
to 1514 bytes in length are transmitted, f (x) therefore has a degree less than or equal to
12 112 here.
The generator polynomial
x 32 + x 26 + x 23 + x 22 + x 16 + x 12 + x 11 + x 10 + x 8 + x 7 + x 5 + x 4 + x 2 + x 1 + 1
is used for error checking, so 4 bytes are used for error checking, that’s just 2.6 promille!
With the help of this polynomial, all 1 and 2 bit errors, any odd number of errors and
all error bursts up to 32 bit length can be safely detected.
A lot of brainpower goes into the clever choice of the generator polynomial, which must
have “good” properties. Algorithms for polynomial division modulo 2 can be implemented
very efficiently in software as well as in hardware, so that even with large data rates this
type of protection is not a problem.
118 5 Algebraic Structures
-4 -2 2 4 6
-2
-4
-6
5.5 Elliptic Curves 119
-4 -2 2 4 6
-2
-4
-6
Q
P
P : R
P+Q
infinity O” and we define P + R = O. It now follows that −P = R must be. The inverse
of an element is therefore obtained by reflection at the x-axis, see also Fig. 5.5.
This way we have defined a sum for each of two different points of E. There is still
the addition of a point to itself. What is P + P? How do we proceed here?
If we move the point Q closer and closer to P in Fig. 5.5, the line through P and Q will
eventually become the tangent to the curve through P. This also intersects E in exactly
one more point, which we reflect again across the x-axis, and obtain P + P, see Fig. 5.6.
Again, there is one exception: if the tangent is perpendicular, then R + R = O and thus
−R = R.
Now we need to convert the geometric representation into algebraic formulas.
So let E be given by y2 = x 3 + ax + b. Every line, except the perpendiculars, has the
form y = mx + t, where m is the slope of the line. The connecting line g of two points
P = (xP , yP ), Q = (xQ , yQ ) has the slope
120 5 Algebraic Structures
R R
P+P
yQ − yP
m=
xQ − xP
(m(x − xP ) + yP )2 = x 3 + ax + b. (5.13)
If you multiply this out and bring it to one side, you will get a third-degree polynomial in
x:
h(x) = x 3 − m2 x 2 + αx + β (5.14)
The values of the coefficients α and β are irrelevant, so I didn’t even calculate them.
There are formulas for solving a third-degree polynomial, but they look pretty ugly.
Fortunately, however, we already know two roots of the polynomial, namely xP and xQ ,
because P and Q are intersections of E and g! So we can split two roots off the polyno-
mial h(x), and the rest is a linear polynomial:
h(x) = (x − xP )(x − xQ )(x − xR ) (5.15)
So there is exactly one more root xR. How do we get its value? Expand (5.15) again,
you get (−xP − xQ − xR ) as the coefficient of x 2. The comparison with (5.14) yields
−m2 = −xP − xQ − xR, respectively
2
yQ − yP
xR = m2 − xP − xQ = − xP − xQ . (5.16)
xQ − xP
5.5 Elliptic Curves 121
The y value of xR is most easily obtained by insertion into the line (5.12). But we must
not forget to reflect at the x-axis:
yR = −m(xR − xP ) − yP
Now to the special case of addition P + P. We can assume yP = 0 here, otherwise we
would have the perpendicular tangent and P + P = O. To calculate the tangent equation,
one needs the derivative of the curve at the point xP.√ For this I have to anticipate the sec-
ond part of the book, see Sect. 15.1: If y = f (x) = x 3 + ax + b, then the tangent to f
at P has the slope f ′ (xP ). To calculate the derivative, we need the chain rule from Theo-
rem 15.5 and the rules for exponentiation from Definition 14.13 and Theorem 14.14:
1 3xP 2 + a
f ′ (xP ) = (3xP 2 + a) = = m′
2 xp 3 + axp + b 2yP
The tangent equation is as before y = m′ (x − xP ) + yP. Also the determination of the fur-
ther intersection point of g with E takes place as in the case of P = Q: In (5.13) one has
to calculate with the slope m′. In (5.15) xP is now a double root and we get
2
3xP 2 + a
′2
xR = m − 2xP = − 2xP (5.17)
2yP
yR = −m′ (xR − xP ) − yP .
Now we can calculate the sum for all points of the curve. The sum P + O should of
course be P. Does this make E a group? O is the identity element and each element has
an inverse. The construction also automatically results in the commutativity of the opera-
tion. The associativity of the connection is still missing. Calculating this is a long and
tedious task, which I would like to avoid.
In the construction of the operation on an elliptic curve, we have made very intensive use
of the properties of the real numbers: we have taken roots, formed derivatives and cal-
culated tangents. Something like this only works to a limited extent in other fields. But
if you look at the results of the calculations in (5.16) and (5.17), you will see that only
the operations possible in every field are used here. So we can try to extend the defini-
tion to general fields. There is one exception: if in the field K it is true that 1 + 1 = 0 or
1 + 1 + 1 = 0, then the formula (5.17) breaks down, because then either 2 = 0 or 3 = 0.
In all other cases our attempt works.
122 5 Algebraic Structures
Elliptic curves over finite fields are used in cryptography. Here it is very important to have
efficient algorithms for calculating the sum of two points. Now there are very fast imple-
mentations of the field operations for the Galois fields GF(2n ). Unfortunately, we cannot
use them, because in them 1 + 1 = 0 is true. If you look at the equation of degree 3, which
defines an elliptic curve, in a different way, the concept of groups can also be extended to
elliptic curves over the fields GF(2n ). Such curves are also used in applications. I don’t want
to go into that here.
E = {(x, y) ∈ K 2 | y2 = x 3 + ax + b} ∪ {O}
is called an elliptic curve over K . With the addition of the points P = (xP , yP ),
Q = (xQ , yQ ), which is defined as follows
• For P = Q, xP = xQ:
P + Q :=(xR , −m(xR − xP ) − yP ),
yQ − yP
xR =m2 − xP − xQ , m = ,
xQ − xP
• For P = Q, yP = 0:
P + P :=(xR , −m(xR − xP ) − yP ),
3xP 2 + a
xR =m2 − 2xP , m= ,
2yP
• Otherwise:
P + Q := O,
is E a commutative group with the identity element O.
These elliptic curves and the addition on them are not as intuitive as in the case of real
numbers. In Fig. 5.7 you can see the elliptic curve to the equation y2 = x 3 + 6x + 9 over
the field GF(13). It has 17 points including the O: For 8 elements of GF(13), x 3 + 6x + 9
is a square. If z is a root of it, then −z = 13 − z is too. Therefore, the “curve” is sym-
metrical like in the real case: Above the horizontal axis 13/2 = 6.5, the additive inverses
of the lower half are located. The connecting line of two points can also be interpreted
intuitively, see exercise 14 in this chapter. This line intersects the curve in one other
point. There is no geometric interpretation for the tangent anymore, but we can still cal-
culate P + P = 2P like in the real case. For example, (1, 9) + (1, 9) = 2 · (1, 9) = (8, 7)
and 3 · (1, 9) = (7, 2). Check this using the Definition 5.36.
In Sect. 5.7 on cryptography, we will see how elliptic curves are used in encryption
algorithms.
5.6 Homomorphisms 123
2P
5
-P
3P
5 10
5.6 Homomorphisms
We have already seen several times that mappings are an important concept in mathemat-
ics. When comparing different things, one usually tries to establish a mapping between
these things in order to find similarities and differences. In the context of algebraic struc-
tures, mappings are interesting which preserve the structure. Such operation-compatible
mappings are called homomorphisms. If there is a bijective homomorphism between two
structures, they are practically indistinguishable; it does not matter whether one investi-
gates the one or the other.
Attention: Even if I do not mark the symbols differently anymore: “+” and “ · ” on the left
and on the right side of the equation mean different operations! Left in G (or in R), right in
H (or in S).
124 5 Algebraic Structures
Examples
Proof:
a) ϕ(0) = ϕ(0 + 0) = ϕ(0) + ϕ(0) ⇒ (−ϕ(0) on both sides) 0 = ϕ(0).
b) ϕ(a) + ϕ(−a) = ϕ(a + (−a)) = 0 ⇒ ϕ(−a) = −ϕ(a).
c) ϕ(a − b) = ϕ(a + (−b)) = ϕ(a) + ϕ(−b) = ϕ(a) + (−ϕ(b)) = ϕ(a) − ϕ(b).
Proof: “⇒” Let ϕ be injective. Because of ϕ(0) = 0, 0 ∈ ker ϕ, and because the mapping
is injective, this is the only preimage of 0.
“⇐” Let ker ϕ = {0}. Assume there is a = b with ϕ(a) = ϕ(b). Then
ϕ(a) − ϕ(b) = 0 = ϕ(a − b) ⇒ a − b ∈ ker ϕ in contradiction to the assumption.
Here is an application of Theorem 5.40 right away. You know the rule for solving quad-
ratic equations. A formula for it is: For real numbers p, q, the equation x 2 + px + q has
the roots
p p 2
x1/2 = − ± − q. (5.18)
2 2
We call the expression under the square root the discriminant D. You know that there are
0, 1, or 2 real solutions for this, depending on whether the discriminant is less than, equal
to, or greater than 0. With our knowledge of complex numbers, we can now say that this
√
equation always has two complex solutions in the case D < 0, namely − 2p ± i −D. A
similar rule applies not only to quadratic equations, but to the roots of all real polynomi-
als:
= an ϕ(z)n + · · · + a0 = f (ϕ(z)),
↑
ai ∈R
and because of f (z) = 0 we have ϕ(f (z)) = 0, thus also f (ϕ(z)) = f (a − bi) = 0.
126 5 Algebraic Structures
This proof is based on a brilliant discovery that revolutionized algebra at the time: There is a
close connection between roots of polynomials and field isomorphisms. Here we have used
it to find a new root. For a long time it was an unsolved problem which polynomials can be
solved by formulas similar to (5.18). In the formulary you will find formulas for solving
equations of the third and fourth degree. It could be proven that it is impossible for polyno-
mials of the 5th degree to give a solution formula (even though we know that the solutions
exist). This proof only succeeded because it was possible to translate the question into the
investigation of properties of certain groups of field homomorphisms.
5.7 Cryptography
Cryptography is the science of encrypting and decrypting data. Especially in the age
of the information society and e-commerce, cryptography is gaining more and more
importance. Results of number theory play a major role in the construction and analysis
of cryptographic algorithms and in particular the rings Z/nZ and the fields GF(p) and
GF(pn ) occur again and again. I would like to explain some basic concepts of cryptogra-
phy here and present some important algorithms as examples.
Most of you have already encrypted messages as children and passed them along the
school benches. A simple method was already used by Caesar. For this you write the
alphabet twice one below the other, once opposite the other shifted:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
DEFGHIJKLMNOPQRSTUVWXYZABC
The encryption rule is: Replace A with D, B with E and so on. We see from this: the
encryption is a function that is applied to a plaintext and that is reversible. Here the func-
tion is: “Shift the plaintext 3 characters to the right in the alphabet”.
In computer science, messages are always encoded as bit strings. For example, the
message MAUS in ASCII code: 01001101, 01000001, 01010101, 01010011. Combined
into an integer results in the 4 byte long number
01001101 01000001 01010101 01010011,
as a decimal number this is 1 296 127 315. Encrypting the MAUS therefore means
encrypting the number 1 296 127 315. This takes us over into the field of mathematics:
Encrypting means applying a reversible function to a number.
fK : {plaintexts} → {ciphertexts}
N �→ fK (N)
In real applications, the function is always dependent on a key K . It is practically impos-
sible to keep an algorithm secret, much easier the key. The secret that ensures that no
unauthorized person can decipher the message is hidden in this key. In the Caesar code,
the algorithm is “shift to the right”, the key is “3”. Of course, the algorithm is very sim-
ple and there are so few keys that you can try them all. This must not be possible.
5.7 Cryptography 127
Therefore, common algorithms use keys that are random numbers of 128- bit to 256-
bit length. The data is also not encrypted byte by byte or “integer” wise, as in the exam-
ple with the MAUS, larger data blocks are processed at once. Today, 64- bit portions are
often common, AES (Advanced Encryption Standard), which has replaced the old DES
algorithm, encrypts blocks with 128- bit size and offers key lengths of 128 or 256 bits.
These keys cannot all be tried by any computer in the world to crack a code.
There are two basic encryption principles that cryptographers work with today:
encryption with secret keys and encryption with public keys.
Each of two communication partners (in cryptographic literature these are always Alice
and Bob) share a secret key K . If Alice wants to send the message M to Bob, she forms:
C := fK (M).
The function is designed such that from the knowledge of K the inverse function fK−1 can
be derived. Then Bob (and no one else) can decrypt the message:
M = fK−1 (C).
This is called symmetric encryption. The main problems with these method are two
things:
1. The keys must be kept secret between the partners themselves. At first it looks as if
you are chasing your own tail: To exchange secret messages, you have to exchange
secret messages.
2. With n communication partners, everyone has to share a key with everyone. For this
you need n(n − 1)/2 keys. How are these keys managed and distributed?
Different keys are used here to encrypt and decrypt. Encryption can be done by any-
one, so this key does not have to be kept secret. Only the recipient of the message can
decrypt, so the decryption key must remain secret.
Each communication partner needs a key pair in this system, a public and a matching
private key. Bob, for example, has the keys KEB and KDB. Here E stands for “encryption
key”, D stands for “decryption key”. The corresponding functions I denote with fEB and
fDB. Now comes the essential difference to symmetric encryption: The functions fEB and
128 5 Algebraic Structures
fDB are indeed inverse to each other, but it must be ensured that under no circumstances
is it possible to derive fDB from fEB.
Now if Alice sends Bob a message M , she must first obtain his public key for encryp-
tion: she either has him send it to her beforehand or she retrieves it from a key database.
She then sends to Bob:
C := fEB (M).
Only Bob can reverse this encryption by computing:
M = fDB (C).
Encryption using public and private keys is called asymmetric encryption or public-key-
encryption. Even if it initially looks as if all the problems of symmetric methods are
thereby eliminated, here too there is a fly in the ointment: one must be certain that the
public keys cannot be manipulated during transport. Otherwise Charly could replace
Bob’s public key with his own and play it to Alice. She thinks she is encrypting for Bob,
but the message can be read by Charly. This problem can be handled with the help of so-
called certificates: the authenticity of public keys and the assignment of a key to a person
are confirmed by an independent third party, the certification authority, in a certificate;
this too using cryptographic methods. However, one must expend some effort here.
In addition, asymmetric methods are so computationally complex that real-time
encryption of large data sets is impossible. Symmetric methods are orders of magnitude
faster.
In practice, the advantages of both methods are combined. It is typical to use public
keys to encrypt small data sets, for example symmetric keys, which can then be securely
exchanged between two partners as session keys. The user data is then encrypted with
these. This happens, for example, every time you request a protected document with your
internet browser.
The concept of public encryption is well understandable. It is much more difficult to
find mathematical algorithms that realize it. In fact, algorithms were not published until
the 1970s. The main problem that must always be solved is the construction of secure
and fast functions that are reversible but for which the inverse function cannot be cal-
culated. There are essentially two classes of difficult mathematical problems on which
today’s algorithms are based. For both of these problem classes, I will present an algo-
rithm to you.
The first such problem is, it is very easy to multiply two large prime numbers, but it
is very difficult to decompose a large number into its prime factors in human time (the
factorization problem).
The second problem is, to a given natural number a and to a prime number p one can
very easily calculate ax mod p. But there is no known method how to a given y find a x
with y ≡ ax mod p that is significantly faster than calculating all powers (the problem of
the discrete logarithm).
5.7 Cryptography 129
Interestingly, for neither of the two classes there is a proof that the underlying problem is
not solvable quickly after all. Only no one has found a way yet. If tomorrow someone dis-
covers a fast method for factoring large numbers, half of the algorithms become unusable.
There is some consolation in the fact that at least one can fall back on the other half.
The probably best-known public-key method is the RSA algorithm, named after its
inventors (or discoverers?) Rivest, Shamir and Adleman, who published it in 1978. The
RSA is based on the factorzation problem.
To generate a key pair, you need two large prime numbers p and q. Currently,
most prime numbers are chosen in the order of 1024 bits. Then n = p · q is formed.
This n has a size of 2048 bits, which is more than a 600-digit decimal number. Now
one looks for an invertible number e in Z/(p − 1)(q − 1)Z∗, that is, an e with
gcd(e, (p − 1)(q − 1)) = 1 (see Theorem 5.18). Finally, d is the multiplicative inverse of
e in Z/(p − 1)(q − 1)Z∗. This way we have everything we need for encryption:
Now we can define the keys, with e standing for encrypt, dfor decrypt:
So let’s say C = M e mod n. Because taking the modulo and exponentiating are
interchangeable, C d mod n = (M e )d mod n = M ed mod n. So it must be calculated that
M = M ed mod n is true.
First we show M mod p = M ed mod p:
There is an integer α so that ed = 1 + (p − 1)α is true. Then according to Fermat’s
little Theorem (5.20a):
This also holds for the special case gcd(M, p) = 1 from Theorem 5.20a, because
then p is a divisor of M , and therefore also a divisor of M ed, and therefore
M mod p = M ed mod p = 0.
We show M mod q = M ed mod q in the same way.
M − M ed is therefore divisible by both p and q. Then M − M ed is also divisible by
pq = n, which means M − M ed ≡ 0 mod n Since M is a remainder mod n, according to
Theorem 4.23 M = M ed mod n must be true.
The first public-key method to become known was the Diffie-Hellman algorithm, which
was published in 1976. Diffie-Hellman (DH) is an algorithm used to exchange a secret
key between Alice and Bob. In this respect, its operation is somewhat different from the
public-key methods I described at the beginning of this section. The underlying math-
ematics uses the field GF(p) where p is a very large prime number. Its security is based
on the discrete logarithm problem. Without proof, I would like to mention that the
multiplicative group of a finite field is always cyclic (see Definition 5.5). A generator
a ∈ GF(p) \ {0} is needed for the algorithm. There are usually many generators, but it is
difficult to check whether a given element has this property when p is large. Fortunately,
the numbers p and a generator a only need to be determined once, they remain the same
for all time and for all users. If Alice and Bob now want to exchange a key, they each
choose a random number qA and qB less than p − 1. These are their private key parts,
which they keep to themselves. They then exchange messages:
Alice to Bob: rA = aqA mod p,
Bob to Alice: rB = aqB mod p.
rA and rB are the public key parts of Alice and Bob. There is no known method to calcu-
late qA from rA or qB from rB that is much faster than trial and error. If p is large enough,
this is impossible. Now Alice and Bob can calculate a common number K from the mes-
sages received:
q
Alice: rBA mod p = aqB qA mod p =: K,
q
Bob: rAB mod p = aqA qB mod p =: K.
5.7 Cryptography 131
Do you see that we used the field properties of GF(p) here? Without the respective pri-
vate parts, K is not computable, only Alice and Bob therefore share this information. K
can now be used as a key or as a starting value for a key generator, so that the following
communication can be encrypted.
The ElGamal algorithm is an encryption method that builds directly on the DH key
exchange. Alice wants to encrypt the message M , which is a number less than p. After
the DH key exchange, Alice sends Bob the number
C = KM mod p.
Without the knowledge of K , no one can calculate M . However, if Bob reads
Theorem 5.19, he can determine the inverse element GF(p) in K −1 and receives
M = K −1 C mod p.
Diffie-Hellman and ElGamal look very attractive at first glance, but I have to pour some
water into the wine right away: Alice and Bob must be sure that the exchange of public keys
is not hacked by a third party who substitutes their own key parts. This is not so easy, but
the Diffie-Hellman algorithm is widespread, it must be embedded in a corresponding proto-
col. ElGamal has the disadvantage that after each message M , the key K must be changed,
so that the transmitted text is effectively twice as long as the clear text. If this did not hap-
pen, an attacker with the knowledge of a single matching pair M , C of clear text and key text
could crack the encryption. But cryptographic algorithms must be prepared for this attack.
However, the ElGamal algorithm also has its areas of application.
The security of the Diffie-Hellman algorithm is based on the problem of the discrete log-
arithm in the multiplicative group of a finite field. This logarithm problem also exists in
other groups. The currently most important are the elliptic curves. First, it is a bit confus-
ing when we talk about exponentiation and logarithms in an additive group. We have to
translate the terms into the language of addition: exponentiation is a continued multipli-
cation, which in an additive group corresponds to a continued addition. In elliptic curves
over a finite field, it is very easy to add a point P n times to itself: nP = P + P + . . . + P.
This n-fold addition can be implemented recursively, similar to exponentiation in GF(p)
or in R. But there is no known algorithm that, given P and Q = nP, finds the number n.
This would correspond to taking the logarithm.
Now the Diffie-Hellman algorithm can be formulated for elliptic curves. As system
parameters one needs a finite field K , an elliptic curve E over K and a so called base
point P on E. The construction of the elliptic curve is very demanding. As a field one
usually chooses GF(p) or GF(2m ) for a large prime number p or for a large m. The curve
itself must contain many points and the base point P must have the property that the
order of P, that is the smallest n for which nP = O has only very large factors. And last
but not least, there are some elliptic curves in which the logarithm is fast anyway. Curves
132 5 Algebraic Structures
that are used in cryptography must be analyzed very carefully. Such curves are specified
in security standards. These can then be used in encryption protocols.
In 1999, the National Institute of Standards and Technology (NIST) in the USA standard-
ized elliptic curves that had been developed by the National Security Agency (NSA). The
design criteria were not disclosed and in the following years, especially after the revela-
tions of Edward Snowden in 2013, the suspicion arose in the community of cryptologists
that curves were generated here in which there is a back door that would allow the NSA to
decrypt. In many protocols, these NIST curves are therefore now avoided.
For key exchange, Alice and Bob choose as private key a number qA or qB which is
smaller than the order of P.
The public keys are then qA P and qB P. Both Alice and Bob can then calculate the
shared secret qB (qA P) = qA (qB P) from their private key and the public key of the partner.
Even if it is difficult to find suitable elliptic curves for the Diffie-Hellman method,
they are still very interesting for applications. The reason lies in the performance and
in the key lengths of the algorithms. In the age of the Internet of Things, in which soon
every light switch will want to communicate with its lamp in encrypted form, suddenly
storage space sizes and computing power become interesting again. Algorithms over
elliptic curves have significantly shorter key lengths and faster run times than standard
algorithms with comparable security.
Key Generation
Until around the turn of the century, numbers in the order of 512 bits were chosen as the
module n for the RSA algorithm. In 1999 it was possible for the first time to break such
a code, that is, to decompose such a number n. It took about half a year and a total of
35.7 CPU years of computing time. For cryptologists, the algorithm was no longer viable
in this form, and modules of at least 1024 bits in length were switched to. Since 2011,
the BSI (German Federal Office for Information Security) has recommended a module
length of 2048 bits.
When generating the key, one then needs prime numbers in the order of 1024 bits. Are
there enough of them? Yes, in this number range there are more than 10300 useful prime
numbers, more than enough for every atom in the universe (see Exercise 9 in Chap. 14).
But how do you find them? A simple prime number test consists in factoring the
number. But we can’t invest 35 computer years in that. There are indeed faster test algo-
rithms, in 2002 even an algorithm was discovered in polynomial time (the AKS algo-
rithm). But in practice this is still much too slow.
The discovery of a polynomial prime number test caused great excitement in mathematics.
And at first it almost looks as if this algorithm could nibble at the security of cryptographic
algorithms that are based on the factorization problem. Fortunately, the factorization of a
number and the test for the primality seem to be independent problems. And there are no
5.7 Cryptography 133
So what do you do if you want to find a large prime number in a few seconds? There are
various prime number tests that can determine that a number is at least probably prime.
I will show you the Miller-Rabin Prime Number Test, which is widely used in applica-
tions.
Fermat’s little Theorem 5.20 first gives us a negative criterion: If p is the can-
didate to be tested for its primality and a is a number with a ≡ 0 mod p, then at least
ap−1 mod p = 1 must be. If this property is violated for any a, then p is composite. Con-
versely, if for many such a always ap−1 mod p = 1 holds, then we could hope that p is
probably prime with certain probability. Any number a for which ap−1 mod p = 1 is ful-
filled, we call a witness for the primality of p.
We assume that p is prime and try to confirm it by as many witnesses as possible. As
potential witnesses we take numbers a ∈ {2, 3, . . . , p − 1}. We calculate in the field Z/pZ
and of course we want to calculate as efficiently as possible.
In order to ap−1 = 1 (i.e. ap−1 mod p = 1), it is sufficient to know that a(p−1)/2 is equal
to +1 or equal to −1 (= p − 1), because ±12 = 1. If (p − 1)/2 is an even number, we can
compute a(p−1)/4 and it follows from a(p−1)/4 = ±1 that ap−1 = 1. Continuing in this way,
we can save some powers. The algorithm in detail:
There are only a few composite numbers q for which it holds true for all a with gcd(a, q) = 1
that aq−1 mod q = 1. These numbers are called Carmichael numbers, the first five of which
are 561, 1105, 1729, 2465, 2821. From my argumentation one could conclude that these
numbers would falsely be recognized as prime during testing. Deep-seated results from
number theory however show that the Miller-Rabin test surprisingly also detects Carmi-
134 5 Algebraic Structures
chael numbers as composite: The Miller-Rabin test is stronger than the check whether for all
a < q it holds true aq−1 mod q = 1. If q is composite, then for example there can be numbers
a < q with the property a(q−1)/2 mod q �= ±1, even though aq−1 mod q = 1! Look for number
examples!
Let us now carry out a certain number of such prime number tests, for example 30. If p
is not a prime number, then it holds true:
The probability that the 1st test provides a witness is smaller than 1/4.
The probability that the 2nd test provides a witness is smaller than 1/4.
The probability that the 1st and 2nd test provide witnesses is smaller than
(1/4) · (1/4).
The probability that all 30 tests provide witnesses is smaller than 1/430 = 1/260.
In Part III of the book we will learn why one can multiply the probabilities in this way and
how one can determine the probability with the help of such a test that a randomly chosen
number is prime if it has passed 30 tests. This is different from 1 − 1/260! See the second
example after Theorem 19.8 in Sect. 19.3.
If one has found two numbers p and q, for example, which have passed 30 tests, then it is
assumed that they are prime and thus the RSA key can be generated. Is one taking a risk
with this? The probability that I will have a five twice in a row in Powerball is greater
than the probability of not catching a prime number. It just doesn’t happen!
I have heard that computer scientists refer to such numbers as “prime numbers of industrial
quality”. A real mathematician’s hair stands on end! But it works, and thus such a prime
number search is justified. Of course, such “prime numbers”—just like industrial dia-
monds—are not as valuable as the real thing.
Random Numbers
The generation of a cryptographic key always starts with a random number. In RSA you
need a whole series of candidates to test them for their primality. These candidates must
be chosen randomly. Random numbers also play an important role in other areas of com-
puter science, for example for simulations. In this book I used them in a Monte Carlo
method for integration, see the example after Calculation Rule 22.12 in Sec. 22.3. How
can you find such numbers?
Even if it does not always look like it in everyday use of the computer: Computers
work deterministically, that is, their outputs are predetermined by the inputs and the pro-
grams and reproducible. This would be fatal in key generation: A deterministic process
would always generate the same keys.
Real random numbers are difficult to implement in the computer. They use, for exam-
ple, physical processes such as thermal noise in a semiconductor or data entered by the
user at random, such as mouse movements or time gaps between keyboard input. How-
5.7 Cryptography 135
ever, for many applications so-called pseudorandom numbers are used. A pseudorandom
number generator outputs a reproducible sequence of numbers that look random after
entering a starting value. It is not trivial to assess the quality of such a sequence of num-
bers. It must at least pass a whole series of statistical tests before the corresponding ran-
dom number generator can be used. In Sec. 22.4 we will deal a little with such tests.
Generators used in cryptography must meet another requirement in addition to these sta-
tistical tests: Even if all the numbers in the sequence are known up to a certain point, it
must not be possible to predict the next number.
The modulo operation is suitable for generating random numbers. A common method
is to specify numbers a, b, m and after entering the starting value z0 to calculate the
(i + 1)-th random number from the i -th as follows:
zi+1 = (azi + b) mod m.
This algorithm is called linear congruence generator. Of course, not all numbers a, b and
m are suitable. Try it out to see what happens if b = 0 is and a is a divisor of m. However,
if the numbers are chosen wisely, such a generator will generate pseudo-random num-
bers that are well distributed statistically. The numbers are between 0 and m − 1, with
a good choice the generator has the maximum period m. There are long lists of suitable
values in
the literature. I will give you two to try out: a = 106, b = 1283, m = 6075 and
a = 2416, b = 374 441, m = 1 771 875.
When implementing the second selection, you must make sure that you do not pro-
duce an overflow. How big does your data type have to be at least?
Linear congruence generators are not to be used in cryptography because they are
predictable. However, the brainpower that has already been invested in the develop-
ment of secure algorithms can be used a second time in the field of cryptography: Ran-
dom numbers arise from continued encryption. There are a whole range of methods for
this. For example, the RSA generator is based on the security of the RSA algorithm. As
there, large prime numbers p and q are chosen, n = p · q and e, d are calculated with
e · d ≡ 1 mod (p − 1)(q − 1). Then
xi+1 = xi e mod n
is calculated. x0 is the starting value here. Anyone who can calculate the number xi+1
from xi without knowing e can also crack RSA. The generator is supposed to generate a
sequence of zeros and ones, so only the last bit of xi+1 is chosen as the random number.
The generated random number sequence depends on the starting value. If I want to
generate a cryptographic key and choose a long value with four bytes as the starting
value for the generator, I may have the best generator and choose the longest conceiv-
able keys, but I still only get 232 different keys. No problem for a hacker who knows how
I generate my keys. A good encryption system therefore needs a good random number
generator in addition to the algorithm and long, truly random starting values. The latter
136 5 Algebraic Structures
Comprehension questions
1. A ring does not necessarily have to have a 1-element. Take another look at the
examples√in Sect. 5.2. Can you find a ring without 1?
2. What is −25 in C?
3. Is it true in C that x · x = 0 ⇔ x = 0?
4. Why is it always true in the field Z/pZ that (p − 1)(p − 1) = 1?
5. If you have a ring homomorphism from R to S, is it automatically also a homomor-
phism of the additive groups of R and S?
6. If you want to generate keys for a cryptographic algorithm, you need a random
number generator. Why are the random number generators that are standard in pro-
gramming languages or operating systems not suitable for this purpose?
7. What is the great importance of elliptic curves in cryptography?
8. Which two unsolved mathematical problems are the basis for most public-key
algorithms ?
Exercises
1. Set up an addition and multiplication table for Z/6Z. Show that Z/6Z \ {0} does
not form a group with multiplication.
2. Assign the elements of the table (5.3) to the elements of the S3.
3. Show that (R+ , · ) forms a group.
4. Show that in the field C it holds: z · z = |z|2, z−1 = |z|z 2 .
5. Let z = 4 + 3i, w = 6 + 5i. Express z−1 and wz in the form a + bi. (For this, use
exercise 4.)
6. Express 2−i
1+i
in the form a + bi. √ √
7. In the field C of complex numbers let z = 22 + 22 i. Calculate and draw in the
Complex plane z, z2 , z3 , z4 , z5.
8. Carry out division with remainder for the following polynomials and then make
the test in each case:
a) in Z/2Z[x]: (x 8 + x 6 + x 2 + x)/(x 2 + x),
b) in Z/2Z[x]: (x 5 + x 4 + x 3 + x + 1)/(x 3 + x 2 + 1),
c) in Z/5Z[x]: (4x 3 + 2x 2 + 1)/(2x 2 + 3x).
9. Show in Z/nZ[x], n not a prime, there are polynomials of degree two with more
than two distinct roots.
5.8 Comprehension Questions and Exercises 137
10. Show that for n ∈ N the set (Rn , + ) forms a group with the following addition:
(a1 , a2 , . . . , an ) + (b1 , b2 , . . . , bn ) := (a1 + b1 , a2 + b2 , . . . , an + bn).
11. Show that the following mappings are homomorphisms. R2 or R3 are the groups
from the last exercise. Calculate ker f and ker g.
a) f : R3 → R2, (x, y, z) → (x + y, y + z)
b) g : R2 → R3, (x, y) → (x, x + y, y)
12. Show that in the Definition 5.2, Axiom (G2) the requirement of uniqueness can
be dispensed with: In a group, the inverse a−1 of an element a is uniquely deter-
mined.
13. Show that in an ISBN number, each individual digit error and also the transposi-
tion of the check digit with one of the preceding digits can be detected.
14. In R2, the set of (x, y) with y = mx + t forms a line. m is the slope and t is the
y-intercept. Now investigate lines over the field GF(p), that is, in GF(p)2:
a) Show that each line in GF(p)2 contains exactly p points.
b) Use a mathematical tool to plot the points of the line for some prime numbers
and for some m, t . For example, take p = 31, m = 1, 2, 13, 16(= 21 ), 29(= −2)
and t = 0.5. Compare the drawings with the corresponding lines in R2.
Vector Spaces
6
Abstract
Vector spaces are an algebraic structure that is particularly important to us. This is
partly because the space we live in can be regarded as a vector space. For example,
we constantly work with the points (the vectors) of space and compute movements in
it in graphical data processing and robotics. To represent spatial objects in a plane, for
example on a screen, we have to carry out mappings from the three-dimensional to the
two-dimensional space and also examine movements in the plane.
It will turn out that vector spaces are also a powerful tool in other areas of math-
ematics. So we will solve linear equations with their help, later we will also get to know
applications in analysis.
For this reason, I dedicate a chapter to the structure of the vector space. In the follow-
ing chapters we will deal with vector space applications.
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 139
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_6
140 6 Vector Spaces
(0,0,1) z
x
(0,1,0)
y
(0,0,0)
(1,0,0)
6.1 The Vector Spaces 141
If one does not consider individual vectors, but sets of vectors, these describe forma-
tions or shapes in R2, R3, …, Rn. Lines and planes will be particularly important to us:
Are u, v vectors in the Rn, then the set g := {u + v | ∈ R} is the line in
the direction of the vector v through the endpoint of the vector u, and the set
E = {u + v + µw | , µ ∈ R} represents the plane that goes through u and is such that
the arrows belonging to v and w just lie in the plane. Make these two constructions clear
to you by trying out some values for , µ (Fig. 6.5).
In my explanation of the plane I cheated a little: What happens if v and w have the same
direction? Try to draw the set E for this special case.
The surface of more complex objects in R3 can be approximated by polygons, flat sur-
faces bounded by line segments.
Similar to groups, rings and fields in Chap. 5, I now want to define an abstract algebraic
structure by putting together typical properties of R2 and R3. We thus obtain the axioms of
the vector space and from now on we will only use these axioms. The resulting theorems
then apply to all vector spaces, in particular of course also to the already known examples.
Definition 6.1: The vector space axioms Let K be a field. A vector space V with
scalars from K consists of a commutative group (V , + ) and a scalar multiplication
· : K × V → V , (, v) → · v, so that for all v, w ∈ V and for all , µ ∈ K it holds:
We call a vector space with scalar field K a K -vector space or vector space over K . If
K = R or K = C, we speak of a real or complex vector space. This will almost always
be the case with us.
For Rn, these properties are quite elementary to check. The surprising thing is that these few
rules are enough to characterize vector spaces. We will find that the vector spaces Rn are
important prototypes of vector spaces.
So the definition of a vector space always involves a field. In order not to confuse the
elements of the structures, one usually denotes the scalars with small Greek letters, for
example with , µ, ν, and the vectors with small Latin letters, such as u, v, w. Physicists
often paint arrows over the vectors (v) to be quite clear. I will only do this if there is a
risk of confusion, such as between the 0 ∈ K and the 0� ∈ V , the 0-element of the additive
group V , the zero-vector. In Rn it is 0� = (0, 0, . . . , 0).
Proof of (V5):
· 0� = · (0� + 0)
� = · 0� + · 0.
�
↑ ↑
� 0=
0+ � 0� (V3)
The zero vector is the zero polynomial, the polynomial whose coefficients are
all 0. ◄
The theory of vector spaces is a powerful tool for such function spaces. Everything we learn
about vector spaces can later be used for functions, polynomials, and sequences.
6.2 Vector Spaces 145
(U1)
u + v ∈ U for all u, v ∈ U,
(U2)
u ∈ U for all ∈ K , u ∈ U .
Proof: First, (U, + ) is a subgroup of V by Theorem 5.4 from Sect. 5.1, because for all
u, v ∈ U it holds:
u + v ∈ U, −u = (−1)u ∈ U.
↑ ↑
(V7) (U2)
A simple negative criterion consists in testing whether the zero vector is contained in U
: Every subspace must contain 0. Therefore, a line in R3 that does not go through the ori-
gin cannot be a subspace.
Examples of subspaces
Theorem 6.6 Let V be a vector space and let M ⊂ V . Then span M is a subspace of V .
Lines and planes through the origin in R3 are special cases of this theorem. The proof
proceeds analogously, as in Example 4.
In Sect. 5.6 we dealt with homomorphisms of groups and rings. The structure-preserving
mappings (the homomorphisms) between vector spaces are called linear mappings. In
particular, linear mappings of a vector space to itself will be particularly important to
us. With such mappings, for example, we can describe movements in space. Bijective
linear mappings (the isomorphisms) are a means of determining when vector spaces can
be considered essentially equal. For example, there is no essential difference between R2
and U := {(x1 , x2 , 0) | x1 , x2 ∈ R} ⊂ R3. In this example, we will specify a bijective linear
mapping between R2 and U and thus consider the two vector spaces as “equal”. First the
definition, which differs somewhat from that of the group or ring homomorphism, since
we now have to take into account the scalar multiplication:
The vector spaces U and V are called isomorphic if there is a bijective linear
mapping f : U → V . We write U ∼
= V for this.
6.3 Linear Mappings 147
The first of the conditions states that a linear mapping is in particular also a homomor-
phism of the underlying additive groups of U and V . Note in the second condition that
on the left the scalar multiplication is carried out in U , on the right in V , both times with
the same element ∈ K . K remains fixed under the linear mapping. There are no linear
mappings between vector spaces with different scalar fields, so there is, for example, no
linear mapping between C7 and R5.
At this example you can see where the name “linear mapping” comes from. At the end of
this chapter (in Theorem 6.23) we will see that the images of linear mappings can always
only be linear combinations of the original coordinates. Squares, roots and similar things
have no place in this theory. ◄
Proof: We have to check (6.1) for g: Let v1 = f (u1 ), v2 = f (u2 ) be. Since g is inverse to f ,
u1 = g(v1 ), u2 = g(v2 ) and thus
g(v1 + v2 ) = g(f (u1 ) + f (u2 )) = g(f (u1 + u2 )) = u1 + u2 = g(v1 ) + g(v2 )
g(v1 ) = g(f (u1 )) = g(f (u1 )) = u1 = g(v1 ).
Theorem 6.11 The linear mapping f : U → V is injective if and only if ker f = {0}.
Examples
x1 x1 − 2x2
1. f : R → R ,
2 2
→ . The kernel consists of all (x1 , x2 ) with the
x2 0
property x1 − 2x2 = 0, that is, with x1 = 2x2. The image consists of all (x1 , x2 ) with
x2= 0. Look atthe drawing in (Fig. 6.6).
x1 � �
x1 − x2 + x3
2. f : R → R , x2 →
3 2
. The kernel consists of the set of
2x1 − 2x2 + 2x3
x3
(x1 , x2 , x3 ), for which it holds:
x1 − x2 + x3 =0
2x1 −2x2 +2x3 =0
For example, (1, 1, 0) and (0, 1, 1) are solutions of these equations, but by far
not all: (0, 0, 0) or (1, 2, 1) are further solutions. How can we completely specify
these? ◄
A simple, but very important theorem helps us to determine the structure of such solu-
tions in the future:
I show that the kernel is a subspace and use the subspace criterion from Theorem 6.4 for this:
If you look at the two equations from Example 2 again, you will now see an interest-
ing relation for the first time, which we will examine in more detail below: The s olution
150 6 Vector Spaces
set of the equation system is a vector space, namely a subspace of R3. Our next big
goal, which we will attack in the rest of this chapter, will be to describe such subspaces
exactly. If we can do that, we are able to completely specify the solutions of linear equa-
tion systems, such as the equation system from Example 2.
Theorem 6.12 can be generalized:
We want to describe subspaces of vector spaces in more detail. First, we will take a
closer look at the generating sets. Let’s look at Example 2 after Theorem 6.11 again: We
had seen that the solution set of the two equations
x1 − x2 + x3 =0
2x1 −2x2 +2x3 =0
is a subspace U of R3. u1 = (1, 1, 0), u2 = (0, 1, 1), u3 = (0, 0, 0), u4 = (1, 2, 1) were ele-
ments of this solution space. With these elements, the sums and multiples of the ele-
ments are also in U and in general every linear combination of these 4 vectors, since U
is a vector space. Do we get all elements of U by such linear combinations? Is maybe
U = span(u1 , u2 , u3 , u4 )? Then the space U would be completely determined by the speci-
fication of the 4 solutions. We will see that this is correct; for each subspace of R3 we can
specify vectors that generate (span) this space by linear combinations. In our example, u1
to u4 are such generating vectors.
Further, u4 = u1 + u2, so u4 is already a linear combination of u1 and u2 and thus
span(u1 , u2 , u3 , u4 ) = span(u1 , u2 , u3 ). Also u3 can be left out, it does not contribute to the
span.
u2 cannot be written as a linear combination of u1: u2 = u1 for all ∈ K , and so
span(u1 , u2 ) span(u1 ). Similarly, for all ∈ K we also have u1 = u2. Neither u1 nor u2
can therefore be removed from the generating set.
A vector u is called linearly dependent from a set of other vectors if it can be repre-
sented as a linear combination of these vectors:
6.4 Linear Independence 151
A set of vectors is called linearly dependent if one of them is linearly dependent on the
others. This statement can be formulated in the following way, which initially looks a bit
strange:
1 v1 + 2 v2 + · · · + n vn = 0� ⇒ 1 , 2 , . . . , n = 0. (6.2)
The vectors are called linearly dependent, if they are not linearly independent.
A (possibly infinite) system B of vectors is called linearly independent, if every
finite selection of vectors from B is linearly independent.
Does (6.2) really mean that none of the vectors is linearly dependent from the oth-
ers? For example, is v1 dependent from the other vectors, then there are scalars with
v1 = 2 v2 + . . . + n vn and thus 0� = (−1)v1 + 2 v2 + . . . + n vn is a linear combination
in which not all coefficients are 0. And if there is a linear combination of the form (6.2)
in which not all i = 0, for example, 1 = 0, then
2 3 n
v1 = v2 + v3 + · · · + vn ,
−1 −1 −1
follows, so v1 is dependent on the other vectors.
You often read the abbreviation w.l.o.g. or WLOG in connection with proofs in mathematics
books. This means “without loss of generality”. The above consideration makes it clear what
is meant by this: It does not have to be v1 linearly dependent on the other vectors, it can also
be v2, v3 or any other vi. But this does not change the proof, it goes analogously, since v1 is not
distinguished from the other vectors by anything. The problem is completely symmetrical.
We can simply assume “w.l.o.g.” of the proof that v1 is our special element. This trick often
makes proofs much easier, as it can avoid many case distinctions in this way. But this means
should be used carefully and be absolutely sure that there is really no restriction.
Caution: If a set of vectors is linearly dependent, this does not mean that each of the vec-
tors can be represented as a linear combination of the others: (1, 1, 0), (2, 2, 0) and (0, 1, 1)
are linearly dependent, since (2, 2, 0) = 2 · (1, 1, 0) + 0 · (0, 1, 1), but (0,1,1) can not be
linearly combined from the two other vectors.
152 6 Vector Spaces
The formula (6.2) looks abstract and inconvenient compared to the intuitive explana-
tion “one of the vectors is linearly dependent from the others” at first sight. But we will
see that the opposite is the case. Equation (6.2) is very well suited to test a set of vectors
for linear independence. We will often use this rule.
Are there vector spaces with infinitely many linearly independent vectors? For this an
Example
Let V = R[X] be the R-vector space of polynomials with coefficients from R (com-
pare Example 6 in Sect. 6.2 after Theorem 6.2). Then the system B = {1, x, x 2 , x 3 , . . . }
is linearly independent.
Assume this is not the case. Then there is a finite subset of these vectors that can
be linearly combined so that the zero polynomial results:
1 x m1 + 2 x m2 + 3 x m3 + · · · + n x mn = 0.
On the left side there is a real polynomial, on the right side the zero polynomial. Now
let’s look at the polynomials again as functions for a moment. Then we know from
Theorem 5.25 that the polynomial on the left side has only finitely many roots. But the
polynomial on the right has the value 0 for all x ∈ R, it is the zero polynomial after all.
Obviously this is not possible, so the polynomials in B are linearly independent. ◄
The span of a set B of vectors forms a vector space V . If the set B is linearly independent,
it is called a basis of V . We will see in this section that every vector space has a basis and
derive properties of such bases.
(B1)
span B = V (B generates V ).
(B2)
B is a linearly independent set of vectors.
Examples
1 0
1. , forms a basis of R2:
0 1
x 1 0
To (B1): =x +y , so the two vectors generate the R2.
y 0 1
1 0 � 0
To (B2): From +µ = =0= follows , µ = 0, so
0 1 µ 0
they are according to (6.2) linearly independent.
6.5 Basis and Dimension of Vector Spaces 153
2. Similarly, it follows:
1 0 0 0
0 1 0
0
0 0 1 0
, , , . . . , ⊂ Rn is a basis of Rn .
. . . .
.. .. .. ..
0 0 0 1
(I) 2 + µ3 = x,
(II) 3 + µ4 = y.
Solve for and µ, and you will get
= −4x + 3y,
µ = 3x − 2y.
2 3 0
To (B2): From +µ = the equations
3 4 0
2 + µ3 = 0
3 + µ4 = 0
arise and you can calculate that they only have the solution = 0, µ = 0.
4. {1, x, x 2 , x 3 , . . . } forms a basis of the vector space R[X]: We already know that the
vectors are linearly independent, but are they also generating? Each p(x) ∈ R[X]
has the form
We have already found bases for many of the vector spaces we know. But does every
vector space have a basis? Rn has a basis with n elements. Can we generate Rn maybe
with less than n elements, if we only choose the vectors clever enough? Or could
154 6 Vector Spaces
it be we find more than n linearly independent vectors in Rn? We will see: This is not the
case. Every vector space has a basis, and different bases of a vector space always have
the same number of elements.
The ideas behind the proof of these assertions are easy to understand and constructive.
The precise implementation of the proof is quite tedious, though. I would like to forego
this and only sketch the way.
The proof for this consists of an induction, which we could carry out in the case of vec-
tor spaces with finite bases: We construct a basis B by starting with a vector different
from 0. As long as the set B does not yet generate the vector space V , we can always add
a vector v from V , so that B remains linearly independent. In the case of vector spaces
that have infinite bases, the proof exceeds the knowledge you have acquired so far.
Next, we investigate the question of whether any two bases of a vector space have the
same number of elements. Our goal is the assertion: If a vector space has a finite basis,
then every other basis has the same number of elements.
It follows from this, of course: If a vector space has an infinite basis, then every other
basis of the space has infinitely many elements.
So let’s now restrict ourselves to vector spaces that have a finite basis. The core of the
proof is in the technical theorem:
Since the proof of this statement is simple and very typical for the calculations with lin-
early independent vectors, I would like to present it here:
Since B is a basis,
x = 1 b1 + 2 b2 + · · · + n bn , (6.3)
applies, where not all coefficients i = 0. For example (w.l.o.g.!), 1 = 0. I claim that
then {x, b2 , b3 , . . . , bn } forms a basis of V . To prove this, (B1) and (B2) from Definition
6.16 have to be checked:
(B1): Let v ∈ V . There is µ1 , µ2 , . . . µn with the property
v = µ1 b1 + µ2 b2 + · · · + µn bn . (6.4)
From the representation (6.3) of x we get
1 2 3 n
b1 = x − b2 − b3 − · · · − bn .
1 1 1 1
6.5 Basis and Dimension of Vector Spaces 155
As a result, we get the announced assertion. Its proof also consists of an induction, which
is quite tricky, however. I will only present the idea.
Theorem 6.19 If the vector space V has a finite basis B with n elements, then
every other basis of V also has n elements.
With the help of the exchange theorem, we can first derive the following assertion: If
B = {b1 , b2 , . . . , bn } and further B′ = {b1′ , b2′ , . . . , bm
′
} is another basis of V with m ele-
ments, then m ≤ n applies. For if m were greater than n, the elements of the basis B could
be successively exchanged for n elements of B′, while the basis property is retained. But
then the further m − n elements of B′ can no longer be linearly independent from the n
vectors that have already been exchanged.
If we now exchange the roles of B and B′, the same argumentation results in n ≤ m.
The number of elements in a basis of a vector space is therefore an important char-
acteristic of the space. We call this number the dimension. This is the number of vectors
that are needed to generate the vector space.
It is only after Theorem 6.19 that we are allowed to write this definition, now the concept of
dimension is “well-defined”.
Because otherwise one could extend it like in the proof of Theorem 6.17 to a basis. But a
basis cannot have more than n elements.
156 6 Vector Spaces
Since we already know bases of R2, R3, Rn, we also know the dimensions of these
spaces: Rn has the dimension n and we will never be able to find n + 1 linearly independ-
ent vectors in it, just as we will never be able to generate it from n − 1 elements.
We can also make precise assertions about possible subspaces with our new knowledge.
Let’s take the R3 as an example: We already know some subspaces: Lines through the ori-
gin g = {u | ∈ R}, u ∈ R3 \ {0} � have dimension 1. The vector u is a basis of the sub-
space g. Planes through the origin are given by E = {u + µv | , µ ∈ R}, u, v ∈ R3 \ {0} �.
E only represents a plane if u and v are linearly independent, and then u and v also form a
basis of the subspace E. Are there any other subspaces? The zero space {0} � is the extreme
case. Let U be any subspace of R . Does U contain at least one vector different from 0,
3
then U already contains a line through the origin. If U is not a line through the origin,
then U contains at least two linearly independent vectors and thus includes an entire plane
through the origin. If U is not a plane through the origin itself, then a further vector must
be contained outside the plane. But this is then linearly independent, U itself contains three
linearly independent vectors and U = R3 applies. The subspaces of the R3 are therefore
� , all lines and all planes through the origin and R3 itself.
exactly {0}
All subspaces of the Rn can be classified in the same way.
Take a look at example 2 after Theorem 6.11 again: We there found that the solution
set of the two linear equations
x1 − x2 + x3 =0
2x1 −2x2 +2x3 =0
is a subspace of the R3. We guessed some solutions: (1, 1, 0), (0, 1, 1), (1, 2, 1). But we
didn’t know the structure of the space and the complete solution set yet. Now we can
find out: (1, 1, 0) and (0, 1, 1) are linearly independent, so the solution space has at least
dimension 2. (1, 1, 1) is, for example, not a solution, so the dimension cannot be 3 (oth-
erwise it would be R3). This means that the set of solutions consists of the plane spanned
by (1, 1, 0) and (0, 1, 1):
1 0 ��
L = 1 + µ 1 �� , µ ∈ R .
0 1
Of course you can also enter other basis vectors here, the plane always remains the same.
In Chap. 8 we will deal extensively with the connection between solutions of linear equa-
tions and the subspaces of Rn.
With the development of the concepts of basis and dimension, we have made a great leap
forward in our further work with vector spaces. This is mainly due to the fact that we
can now describe vectors by their coefficients (the coordinates) in a basis representation.
6.6 Coordinates and Linear Maps 157
At least in finite-dimensional vector spaces, these are only finitely many data and the cal-
culation with them is simple.
In the Rn we have already learned coordinates: For
x1
v = x2 ∈ R3
x3
the x1 , x2 , x3 are the coordinates of v. How do coordinates look in arbitrary vector spaces
with respect to a basis B? The R3 with its standard basis helps us with the construction.
For it is
x1 1 0 0
v = x2 = x1 0 + x2 1 + x3 0 .
x3 0 0 1
That is: If v is represented as a linear combination of the basis, then the coefficients of
this linear combination are just the coordinates of v. We now carry out this construction
in general:
Proof: Of course there is such a representation, since B is a basis. So it’s just the unique-
ness that has to be shown. Assume there are different representations of v:
v = x1 b1 + x2 b2 + · · · + xn bn ,
v = y1 b1 + y2 b2 + · · · + yn bn .
Then by forming the difference we get
These coordinates are of course basis-dependent, indeed they are even dependent on
the order of the basis vectors. If you change the basis or the order of the basis vectors,
you will get other coordinates.
Let’s calculate the coordinates of the basis vectors themselves: It is
b1 = 1b1 + 0b2 + 0b3 + · · · + 0bn ,
b2 = 0b1 + 1b2 + 0b3 + · · · + 0bn ,
..
.
bn = 0b1 + 0b2 + 0b3 + · · · + 1bn .
and from that we get
1 0 0
0 1 0
b1 = 0 , b2 = 0 , bn = 0 .
...,
. . .
.. .. ..
0 B
0 B
1 B
Oops, that’s certainly familiar to you. That’s exactly how the coordinates of the basis of
the Rn looked like, which we once called the standard basis. Now we have made a very
amazing discovery. In the calculation with coordinates, one basis is as good as any other.
The basis vectors always have the coordinates of the standard basis.
There is also no basis that is in any way distinguished in the Rn: If, for example, we
work with the vectors of the R3 to describe the space we live in, we first look for a basis,
that is, an origin and three vectors of certain length that do not lie in one plane. Then
we calculate with coordinates relative to this basis. This randomly chosen basis is our
“standard basis”.
It is also important that this basis choice can be made so freely. For example, if one
describes the movements of a robot arm, one usually places a coordinate system in each
joint, with which one describes the movements of this joint exactly. In Sect. 10.3 we will
examine this application case in more detail.
However, it is quite possible and often necessary to establish the connection between
different bases computationally. If you calculate with coordinates of a basis C and the
coordinates of another basis B are given with respect to this basis, the coordinates of a
vector v can be given with respect to both bases. For this purpose, an
Example
2 1
In the R2 the basis B = {b1 , b2 } is given. The vectors c1 = , c2 = are lin-
2 B 2 B
x
early independent. If v = is given, we now want to calculate the coordinates
y B
6.6 Coordinates and Linear Maps 159
v= with respect to the basis C = {c1 , c2 }: If v = c1 + µc2, then and µ are
µ C
these coordinates. But
x 2 1
= v = c1 + µc2 = +µ .
y B 2 B 2 B
From this we get the determining equations for and µ:
x = 2 + µ,
y = 2 + 2µ.
Solved for , µ we get:
1
= x − y,
2
µ = −x + y.
Now we can convert the coordinates. For example, we get
3 1
2 2 1 1 1 0 −2
= 2 , = , = , = .
1 B −1 C 2 B 0 C 0 B −1 C 1 B 1 C
The last two calculated points are just the coordinates of the basis B with respect to
the basis C. ◄
In Fig. 6.7 I have entered the two bases and the coordinates of the calculated points with
respect to both bases. Determine the coordinates for other points, graphically and math-
ematically.
Our new ability to calculate in vector spaces with coordinates can now be applied
to linear mappings. The hard work we did in connection with the concepts of basis and
dimension pays off: Some important and beautiful results are now becoming apparent.
First, we note that every linear mapping of a finite-dimensional vector space can
already be completely described by finitely many data, namely the images of a basis.
This will be very useful in the next chapter.
u = x1 b1 + x2 b2 + · · · + xn bn
v = y1 b1 + y2 b2 + · · · + yn bn
u + v = (x1 + y1 )b1 + (x2 + y2 )b2 + · · · + (xn + yn )bn
we have:
f (u) = x1 v1 + x2 v2 + · · · + xn vn
f (v) = y1 v1 + y2 v2 + · · · + yn vn
f (u) + f (v) = (x1 + y1 )v1 + (x2 + y2 )v2 + · · · + (xn + yn )vn = f (u + v).
Similarly,
f (u) = (x1 v1 + x2 v2 + · · · + xn vn ) = (x1 )v1 + (x2 )v2 + · · · + (xn )vn = f (u).
Now for uniqueness: Let g be another mapping with this property. Then, because of the
linearity of g
g(u) = g(x1 b1 + x2 b2 + · · · + xn bn ) = x1 g(b1 ) + x2 g(b2 ) + · · · + xn g(bn )
= x1 v1 + x2 v2 + · · · + xn vn = f (u)
and thus f and g agree everywhere.
U∼
=V ⇔ dim U = dim V .
Why is this theorem so important? It immediately follows from this: Every
n-dimensional vector space is isomorphic to K n, in particular, every finite-dimensional
real vector space is isomorphic to a Rn.
6.6 Coordinates and Linear Mappings 161
Thus we now know the structure of all finite-dimensional vector spaces in the uni-
verse; there are (up to isomorphism) only the vector spaces K n.
This is a moment when the true mathematician first leans back in his chair and enjoys life. It
is rare that it is possible to classify all possible instances of a given algebraic structure. This
is a highlight of the theory, then one knows the structure exactly. Here we have at least suc-
ceeded in classifying all finite-dimensional vector spaces.
Flip back to the beginning of this chapter: We started with R2, R3 and Rn. From this we
crystallized out the vector space properties and generally defined the structure of a vector
space. Now we find (at least for real vector spaces of finite dimension) there is nothing but
the Rn. So why the whole thing, are we not just going around in circles?
We have gained several things: On the one hand, if we come across a structure in some
application that we see has the properties of a real vector space, then we now know: This
structure “is” the Rn, we can use everything we know about Rn. Of course, there is still the
great theory of infinite-dimensional vector spaces, which builds on what we have developed
in this chapter, but which I have only briefly mentioned here. And finally, the process that
led to Theorem 6.24 gave us a whole toolbox we can use in the future when working with
vectors. Our knowledge of bases, dimensions, linear dependence, coordinates and other
things will be immensely important in the next chapters. So even here the journey was the
goal.
For the proof of the Theorem 6.24: As an equivalence proof it consists of two parts. We
start with the direction from left to right:
For this purpose, let f : U → V be an isomorphism, b1 , b2 , . . . , bn a basis of U . We
show that then the images v1 , v2 , . . . , vn of b1 , b2 , . . . , bn form a basis of V . Thus the
dimensions of the spaces are equal.
The vi are linearly independent, as from
0� = 1 v1 + 2 v2 + · · · + n vn = f (1 b1 + 2 b2 + · · · + n bn )
follows because of the injectivity (only f (0) � = 0�!) also 1 b1 + 2 b2 + · · · + n bn = 0�.
But since B is a basis, then all i = 0, and thus the vi are linearly independent.
The vi generate V : Since f is surjective, there is for every v ∈ V a u ∈ U with
f (u) = v, thus is
v = f (u) = f (x1 b1 + x2 b2 + · · · + xn bn ) = x1 f (b1 ) + x2 f (b2 ) + · · · + xn f (bn )
= x1 v1 + x2 v2 + · · · xn vn .
Now to the other direction: Let dim U = dim V and let u1 , u2 , . . . , un be a basis of U and
v1 , v2 , . . . , vn be a basis of V . According to Theorem 6.23 there are then two linear map-
pings which map the basis vectors to each other:
f : U → V , f (ui ) = vi ,
g : V → U, g(vi ) = ui .
These are inverse to each other and thus they are bijective linear mappings, that is iso-
morphisms, between U and V .
162 6 Vector Spaces
This proof contains two sub-statements, which are worth writing down:
Theorem 6.26 A linear mapping between two vector spaces of equal dimension,
which maps a basis to a basis, is an isomorphism.
In Chap. 8 I will go into the determination of solutions of linear equation systems. For
this, the following theorem is important, with which I would like to conclude this chapter:
In short: The more is mapped onto 0 in a linear map, the smaller the image becomes. The
� , then f is injective, the
limiting cases are still easy to follow: If the kernel is equal to {0}
image is isomorphic to U and has dimension n. Conversely, if the image = {0} � , then the
kernel = U , so it has dimension n. But the Theorem goes even further: For example, for
linear mappings f : R3 → R3: If the kernel has dimension 1, then the image has dimen-
sion 2 and vice versa.
Comprehension questions
1. Is a line in R3, which does not go through the origin, a vector space?
2. Are there any vector spaces that are a proper subspace of R2 and a proper superset
of the x-axis?
3. Is Q3 a subspace of R3?
4. Are R4 and C2 isomorphic as vector spaces? Can there be a linear mapping
between R4 and C2?
5. Are there vector spaces with bases that have different numbers of elements?
6.7 Comprehension Questions and Exercises 163
Exercises
g = {(x, y) | y = 3x + 4}, a, b ∈ R2 .
Find vectors a, b in R2 with g = {a + b | ∈ R}.
3. Every graph of a line y = mx + c has a representation in the form
g = {a + b | ∈ R}. Determine vectors a, b for this representation. Is there a rep-
resentation of every line g in R2 as a graph of a line y = mx + c?
4. Check the vector space conditions (V1) to (V4) for R3.
5. Check
whether
R with the usual addition and with the scalar multiplication
2
x x
= is a vector space.
y y
6. Find a linear mapping f : R2 → R2 for which ker f = im f applies.
7. What do yousay about
exercise 6 if I replace R2 each
time by R ?
5
3 2 x
8. The vectors and form a basis B of R2. Let ∈ R2.
5 4 y
x
Calculate the coordinates of in the basis B.
y
9. Show: If u, v are linearly independent vectors in V , then so are u + v and u − v
(make a sketch in R2!).
Matrices
7
Abstract
The use of coordinates and matrices in linear algebra lays the foundation for algo-
rithms in many areas of computer science. By the end of this chapter, you will know
In the last chapter we saw that every linear mapping is already completely determined
by the images of the basis vectors of a vector space. Matrices are suitable for describing
such a mapping. I do not want to cause illegibility by the abundance of coordinates right
from the start, so I will develop the basic concepts for matrices first in the vector space
R2. We calculate with coordinates with respect to a basis and investigate linear mappings
from R2 to R2.
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 165
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_7
166 7 Matrices
From Theorem 6.23 it follows that every linear mapping f : R2 → R2 has the form:
x1 a11 x1 + a12 x2
f = .
x2 a21 x1 + a22 x2
To see this, we first determine the images of the basis for the linear mapping f . If these
are
1 a11 0 a12
f = and f = , (7.1)
0 a21 1 a22
then for every vector x ∈ R2:
x1 1 0
f (x) = f = f x1 + x2
x2 0 1
a11 a12 a11 x1 + a12 x2
= x1 + x2 = .
a21 a22 a21 x1 + a22 x2
a11 a12 a a
In the matrix A = the 11 , 12 are called columns or column
a21 a22 a21 a22
vectors and (a11 , a12 ), (a21 , a22 ) are called rows or row vectors. The first index is
called row index, it remains constant in a row, the second index is called column
index, it remains constant in a column. aij is thus the element in the i -th row and in
the j-th column. Occasionally the matrix A is also denoted by (aij).
Our previous knowledge about the connection between matrices and linear mappings can
now be formulated in the
7.1 Matrices and Linear Mappings … 167
Theorem 7.2 For every linear mapping f : R2 → R2 there is exactly one matrix
A ∈ R2×2 with the property f (x) = Ax. Conversely, every matrix A ∈ R2×2
defines a linear mapping by the rule f : R2 → R2, x → Ax.
We have already seen that every linear mapping determines such a matrix. Conversely,
for a matrix A the mapping f : R2 → R2, x → Ax is also linear: It is exactly the linear
1 a11 0 a
mapping according to Theorem 6.23, which maps to and to 12 .
0 a21 1 a22
There is thus a bijective relationship between the set of linear mappings of R2 and the
set of matrices R2×2. We will make intensive use of this relationship and not distinguish
between matrices and mappings anymore. A matrix “is” a linear mapping and a linear
mapping “is” a matrix. So I will often speak of the linear mapping A : R2 → R2, where
A is a matrix from R2×2. A(x) is then the image of x under A, which is the same as Ax.
However, when making this identification, one should note that the matrix depends on
the basis. With respect to another basis, the corresponding matrix usually looks quite dif-
ferent. But first we fix a basis, so no problems arise from this.
If you look again at the (7.1), you will notice the columns of the matrix are just the
images of the basis vectors. You should remember this.
Now we finally want to look at some concrete linear mappings of R2 to itself. Imagine
a section of the R2 as a computer screen. If I work with a drawing program and want to
draw a rectangle, I can first drag a prototype of the rectangle onto the screen. Usually
it is marked in some places by small rectangles that show that I can grab and edit the
rectangle with the mouse. By various operations, I can generate from the rectangle, the
shapes 1-6 found in Fig. 7.1. The prototype is transformed into figure 1, figure 1 into fig-
ure 2 and so on.
A change of the figure means a mapping of the object points in R2. The computer has
to calculate the new coordinates of the object from the old ones and draw them on the
screen. With one exception, all these mappings are linear. We will now look for the linear
mappings that can generate these figures. The origin is always marked, the x1-axis goes
to the right, the x2-axis goes up.
Prototype
interpret the points of the unit circle in the other quadrants in this way, we can now also
define sine and cosine for angles greater than 90◦: If β is the angle between the x1-axis
and the vector (x1 , x2 ) on the unit circle, we set cos β := x1 and sin β := x2. Cosine and
sine can also become negative.
Let us now look at what happens when we rotate at the origin by the angle α:
More examples
1 cos α
6. From Fig.
7.3 onecan readthat the vector 0 is rotated into the vector sin α ,
0 − sin α
and becomes .
1 cos α
The matrix of the rotation by the angle α is thus:
cos α − sin α
Rα = .
sin α cos α
For this mapping applies:
x1 cos α x1 − sin α x2
→ .
x2 sin α x1 + cos α x2
Note that we always perform rotations counterclockwise.
170 7 Matrices
But what is the exception I mentioned at the beginning of the examples? What change of
the figures on the screen cannot be represented by a linear mapping? I cheated a bit with
the origin in Fig. 7.1: It remains fixed for all mappings. But in order not to have to draw
the figures on top of each other, I moved the figure a bit further to the right from step to
step. This shift, the translation, is a very essential operation:
x1 x a
→ 1 + 1 .
x2 x2 a2
In the translation, the origin is moved, and we know that for linear mappings always
� = 0�. Unfortunately, the translation requires a special treatment, it is not a
applies f (0)
linear mapping.
If A, B are two linear mappings, A, B ∈ R2×2, then the composition is possible and B ◦ A
is again a linear mapping. Which matrix C belongs to it? If we calculate the images of
the basis, we obtain the columns of the matrix C. For this, let
a11 a12 b11 b12 c11 c12
A= , B= , C= .
a21 a22 b21 b22 c21 c22
It follows that
c11 1 1 a11 b11 a11 + b12 a21
= (B ◦ A) =B A =B = ,
c21 0 0 a21 b21 a11 + b22 a21
c12 0 0 a12 b11 a12 + b12 a22
= (B ◦ A) =B A =B = .
c22 1 1 a22 b21 a12 + b22 a22
No one can remember this. But there is a trick, with which one can easily determine the
matrix C from A and B. For this, you write the matrix A not next to C, but as shown in
(7.2), right above it. In the field to the right of B, the matrix C is then created. For the
calculation of an entry in the matrix C, exactly the elements that are in the same row of B
and in the same column of A are needed. To calculate the element cij, the elements of the
i -th row of B have to be multiplied with those of the j-th column of A and then added up:
(7.2)
2
cij = bik akj .
k=1
From now on, we write for the composition B ◦ A, shortly BA, and call this operation
matrix multiplication.
Examples
1.
4 3
1 2
2 1 2 4
1 2 8 5 4 3 10 20
.
2 4 16 10 2 1 4 8
You can see from this that in general AB = BA. The order matters for the composi-
tion of linear mappings!
2. For a rotation by 45°, cos α = sin α (Fig. 7.5). From the Pythagorean theorem,√we
know (cos α)2 + (sin α)2 = 1 and from this we obtain here cos α = sin α = 1/ 2.
Thus we get the following rotation matrix:
√ √
1/√2 −1/√ 2
R45◦ = .
1/ 2 1/ 2
172 7 Matrices
After our preparations in the R2, we now examine general matrices and general linear
mappings. K stands for any field, usually it will be the field of real numbers for us. We
start with the definition of a m × n-matrix.
7.2 Matrices and Linear Mappings … 173
xn bm
a11 x1 + a12 x2 + · · · + a1n xn b1
a21 x1 + a22 x2 + · · · + a2n xn b2
.. = . (7.3)
. ..
am1 x1 + am2 x2 + · · · + amn xn bm
We will first study the connection between matrices and linear mappings as in the two-
dimensional case. But before that, I would like to point out an interesting fact: In (7.3)
there are m linear equations of the form
ai2 x1 + ai2 x2 + · · · + ain xn = bi
with the n unknowns x1 , x2 , . . . , xn. The notation Ax = b also stands as a shorthand for a
system of linear equations. In the next chapter, we will systematically solve such systems
of equations using this matrix notation.
But now to the linear mappings: The proof of the following theorem goes exactly as
in Theorem 7.2:
174 7 Matrices
Theorem 7.4 For every linear mapping f : K n → K m there is exactly one matrix
A ∈ K m×n with the property f (x) = Ax for all x ∈ K n. Conversely, every matrix
A ∈ K m×n defines a linear mapping f : K n → K m by the rule
a11 x1 + a12 x2 + · · · + a1n xn
a21 x1 + a22 x2 + · · · + a2n xn
f : K n → K m,
x �→ Ax = .. .
.
am1 x1 + am2 x2 + · · · + amn xn
Again, the columns of the matrix contain the images of the basis vectors, which, as we
know, completely determine the mapping. For example,
a11 a12 · · · a1n 1 a11 1 + a12 0 + · · · + a1n 0 a11
a21 a22 a 2n
0 a21 1 + a22 0 + · · · + a2n 0 a21
. .. . = .. = . .
.. . .. . ..
am1 am2 amn 0 am1 1 + am2 0 + · · · + amn 0 am1
One has to be careful not to confuse m and n: For a mapping from K n → K m one needs a
matrix from K m×n. The number of rows of the matrix, m, determines the number of elements
in the image vector, i.e. the dimension of the image space. But I can console you, we will
mostly deal with the case m = n.
As in the two-dimensional case, we will identify the matrices and their corresponding
linear mappings in the future.
The identity mapping id : K n → K n belongs to the identity matrix
1 0 ··· 0
0 1
c = . .. .
.. . 0
0 0 1
Note that the identity matrix is always square! There is no identity mapping from K m to
K n, if n and m are different.
rotated by A around the x1-axis. Find the matrices yourself that describe rotations
around the x2- or around the x3-axis.
2. We want to determine the matrix for the mapping
x1 � �
3 2 x1 + 21 x3
f:R →R , x2 �→
x2 + 21 x3
x3
. The matrix must therefore be from R2×3. We look for the images of the basis:
1 � � 0 � � 0 �1�
1 0
f 0 = , f 1 = , f 0 = 21 ,
0 1 2
0 0 1
from this follows
1
1 0 2
A= 1 .
0 1 2
You see: If you write the images of the basis next to each other, a matrix of the
right size automatically comes out. You don’t have to pay attention to the number
of rows and columns anymore.
What does this mapping do? Let us imagine the x1-x2-plane as our drawing plane.
All points of this plane remain unchanged. Everything that lies in front of or
behind it is projected onto this plane and shifted a bit. From a three-dimensional
cube of edge length 1, which has a corner at the origin, the shape in Fig. 7.6
(which I can draw in R2) is obtained. ◄
Since the screens of our computers will probably only produce two-dimensional repre-
sentations for some more years, we have to map all three-dimensional objects that we
want to view on the screen in some way into the R2. Here you have learned a simple
mapping. A common projection is the one shown in Fig. 7.7, which gives a somewhat
more realistic view. This is also generated by a linear mapping. Try to find the matrix for
this yourself.
In graphical data processing, many other projections are common, for example central pro-
jections. Not all of them can be represented by linear mappings. You will find an exercise on
this at the end of the chapter.
rows
columns
When you multiply two matrices, you don’t have to think long about row and column
numbers, you will see: If the matrices don’t fit together, the procedure fails and, if it
works, the right size comes out automatically.
Example
0 3 4 −1
2 0 1 1
0 1 0 −2
0 3 4 6 4 3 −5
2 0 1 0 7 8 −4◄
0 1 0 2 0 1 1
7.2 Matrices and Linear Mappings … 177
But note that this only works if we consistently use column vectors and multiply the vector
from the right.
I would like to define a few more operations for matrices that we will need later:
Definition 7.5 For two matrices A = (aij ), B = (bij ) ∈ K m×n and ∈ K , let
a11 + b11 . . . a1n + b1n a11 . . . a1n
A+B= .. .. ..
A = ... . . . ... .
. . . ,
am1 + bm1 · · · amn + bmn am1 · · · amn
So the components are simply added or multiplied by the factor . With these definitions,
K m×n becomes obviously a m · n dimensional K -vector space. If we think of the compo-
nents not in the rectangle but written one after the other, we have just the K mn. For the
operations just defined, all the rules that we derived for vector spaces apply. There are a
few more that are related to the multiplication:
Theorem 7.6: Rules for matrices For matrices that fit together by their size, the
following rules hold:
A + B = B + A,
(A + B)C = AC + BC,
A(B + C) = AB + AC,
(AB)C = A(BC),
AB �= BA (generally).
The first rule follows already from the vector space property that we just noticed. The
distributive laws can be easily verified with the help of the multiplication rule (7.5). If
you want to check the associativity law in this way, you can get terribly tangled up with
the indices. But there is another way! Let us remember the double nature of matrices:
178 7 Matrices
They are also linear mappings, and mappings are always associative, as we saw in (5.2)
in Sect. 5.1 after Theorem 5.4. So there is nothing to do here. The last rule or rather non-
rule you already know. Here it is again as a reminder.
We now focus for a moment on square matrices, i.e. on matrices that describe linear
mappings of a space into itself. We already know that some of these matrices are invert-
ible. In the following theorems, I would like to summarize what we can say so far about
invertible matrices. In doing so, we will harvest some fruits from Chap. 6:
(I1) There is a matrix A−1 ∈ K n×n with the property A−1 A = AA−1 = I .
(I2) The mapping A : K n → K n is bijective.
(I3) The columns of the matrix A form a basis of the K n.
(I4) The columns of the matrix A are linearly independent.
(I1) says nothing else than that there is an inverse mapping for A. See for this Theo-
rem 1.19 in Sect. 1.3. Thus, A is bijective. And since every bijective mapping f of a set
onto itself has an inverse mapping g for which f ◦ g = g ◦ f = id holds, (I1) and (I2) are
equivalent.
Theorem 6.25 states that a bijective linear mapping maps a basis onto a basis. The
columns of the matrix are precisely the images of the basis. Conversely, if the columns
form a basis, this means that A maps a basis onto a basis and from Theorem 6.26 it fol-
lows that A is an isomorphism, i.e. bijective. Thus we have the equivalence of (I2) and
(I3).
From (I4) of course (I3) follows and vice versa, n linearly independent vectors in an
n-dimensional vector space always form a basis (Theorem 6.21), so (I3) and (I4) are also
equivalent.
Theorem 7.8 The set of invertible matrices in K n×n forms a group with respect
to multiplication.
If you look up the group axioms in Definition 5.2 in sect. 5.1, you will find that we have
already computed (G1) to (G3). The only thing still missing is that the set is closed under
the operation. Is the product of two invertible matrices again invertible? Yes, and we can
also specify the inverse matrix for AB. It follows from Theorem 5.3b that B−1 A−1 is the
inverse matrix for AB.
7.3 The Rank of a Matrix 179
The order is very crucial here! A−1B−1 is in general different from B−1A−1 and then is not
inverse to AB.
Computing the inverse of a matrix is usually very difficult. But we will soon learn an
algorithm for
it. For
a 2×2-matrix we can still do the calculation
completely by hand:
a b e f
Let A = be given. If an inverse matrix exists, then it must hold:
c d g h
a b e f ae + bg af + bh 1 0
= = .
c d gh ce + dg cf + dh 0 1
We thus obtain four equations for the 4 unknowns e, f , g, h. These can be solved and we
obtain:
e f 1 d −b
= . (7.7)
g h ad − bc −c a
This works of course only if ad − bc � = 0. And indeed: Exactly when ad − bc � = 0, the
inverse matrix exists and has the form given in (7.7).
Let us return to the not necessarily square matrices of the K m×n. We have already seen
several times that the column vectors of the matrices play a special role. We now want to
deal with the space that these vectors span. First, an important term:
Definition 7.9 The rank of a matrix A is the maximum number of linearly inde-
pendent column vectors in the matrix.
The rank is thus the dimension of the space spanned by the column vectors. We get a first
insight into the meaning of the term in the following theorem. Recall that Ax = b not
only means the image of x under the linear mapping A, but also the shorthand for a linear
system of equations (compare (7.3) and (7.4) after Definition 7.3). From the rank of the
matrix we can then read how large the solution space of the system of equations Ax = 0
is. Later we will also conclude from this the solutions of the system Ax = b.
a) im f = span{s1 , s2 , . . . , sn }.
b) ker f = {x | Ax = 0} is the set of solutions of the system of equations Ax = 0.
c) dim im f = rank A.
d) dim ker f = n − rank A.
180 7 Matrices
For a): Since the si are all in the image and the image is a vector space, we have of
course im f ⊃ span{s1 , s2 , . . . , sn }. On the other hand, we have seen in Theorem 7.4 that
a11 x1 + a12 x2 + · · · + a1n xn
a21 x1 + a22 x2 + · · · + a2n xn
f (x) = Ax = .. = s1 x1 + s2 x2 + · · · sn xn
.
am1 x1 + am2 x2 + · · · + amn xn
holds, so every image element f (x) can be written as a linear combination of the column
vectors and thus we also have im f ⊂ span{s1 , s2 , . . . , sn }.
Point b) is exactly the definition of the kernel. Interesting here is the interpretation:
The kernel of the linear mapping is the solution set of the corresponding linear system
of equations. Point c) is an immediate consequence of a), since the rank is precisely the
dimension of span{s1 , s2 , . . . , sn }.
For d) we use Theorem 6.27: The sum of the dimensions of kernel and image gives
the dimension of the domain of definition.
Examples
0 ··· 0 1
� �� �
n columns
1 0 1
4. 0 1 1has rank 2: s1 + s2 = s3.
2 0 2
7.3 The Rank of a Matrix 181
1 2 4
5. 2 4 8 has rank 1, because the second and third column are multiples of the
3 6 12
first column.
1 30 4
6. 2 5 1 0. By looking at it, we don’t see anything at first. ◄
3 8 1 4
Let us swap for a moment the role of columns and rows and calculate the maximum
number of linearly independent rows of the matrices, the “row rank”: for example 1 we
can’t say anything yet. In example 2 and 3 we get with the same argument again 3 and n
respectively. In example 4 the third row is twice the first, so the row rank is 2. In exam-
ple 5 the rows are also multiples of each other, so row rank 1. And in the last example we
now see that the third row is the sum of the first two: so row rank 2.
Of course you notice that in cases 2 to 5 always row rank = column rank. And of
course this is not a coincidence, but it is always the case. That’s why in example 1 the
rows also form a basis, and that’s why in example 6 the “true” rank is also 2.
I find this fact so astonishing that I think one can only believe it when one has
checked it a few times. Why don’t you try to combine the 3rd and 4th column vector of
example 6 linearly from the first two. Somehow it has to work!
Of course we need a theorem here, and the proof is not very difficult, but a rather
tricky index fiddling:
Theorem 7.11 For any m × n-matrix A the rank is equal to the maximum number
of linearly independent rows of the matrix: “column rank = row rank”.
This statement is often very useful, as we have already seen in example 6: One can
always determine the rank that is easier to calculate.
In the following proof, I will write field elements in lowercase and all vectors in
uppercase. Let
a11 . . . a1n
A = ... . . . ... ,
am1 · · · amn
R1 , R2 , . . . , Rm be the rows and C1 , C2 , . . . Cn the columns of A. Let the row rank be r and
B1 = (b11 , . . . , b1n ), B2 = (b21 , . . . , b2n ), . . ., Br = (br1 , . . . , brn ) be a basis of the space
spanned by the rows. The row vectors can therefore be linearly combined from the vec-
tors B1 , . . . , Br:
182 7 Matrices
(7.9)
From each of the rows of (7.8) we now pick out the i -th component (for the row R1 I
have marked this in (7.9)), and so we get a new set of m equations:
a1i = k11 b1i + k12 b2i + · · · + k1r bri
a2i = k21 b1i + k22 b2i + · · · + k2r bri
.. (7.10)
.
ami = km1 b1i + km2 b2i + · · · + kmr bri .
On the left side of the =-signs in (7.10) stands just the i -th column vector Ci and we can
write (7.10) as a new vector equation:
a1i k11 k12 k1r
a2i k21 k22 k2r
Ci = . = . b1i + . b2i + · · · + . bri .
.. .. .. .. (7.11)
ami km1 km2 kmr
� �� � � �� � � �� �
K1 K2 Kr
Now it is done: We have in (7.11) the column Ci linearly combined from the newly
defined vectors K1 , K2 , . . . , Kr. What we have done in (7.9) with the index i , we can also
do with all other indices from 1 to n, and so we get that every column vector Ci can be
linearly combined from the vectors K1 , K2 , . . . , Kr. This means that the dimension of the
column space is in any case ≤ r and thus column rank ≤ row rank.
7.4 Comprehension Questions and Exercises 183
The problem is symmetric, however: If we swap columns and rows in the proof, we
get just as well: row rank ≤ column rank. This finally means that column rank = row
rank, and from now on there is only one rank of a matrix.
This is an immediate consequence of Theorem 7.7: If rank A = n, then the columns form
a basis, A is invertible. And if A has an inverse, then again the columns form a basis, so
rank A = n.
Comprehension questions
1. True or false: If A and B are arbitrary matrices, then the multiplication is not com-
mutative, but the products A · B and B · A can always be calculated.
2. If the n basis vectors in Rn are stretched by different factors, is the resulting map-
ping a linear mapping?
3. Why can a simple translation by a fixed vector in the R3 not be described by a lin-
ear mapping?
4. Can two non-square matrices be multiplied so that a (square) identity matrix results?
If so: What does this mean for the linear mappings associated with the matrices?
5. Is there a difference between the multiplication �matrix� · �vector� and
�matrix� · �single column matrix�?
6. Why is matrix multiplication associative?
7. Is the set of equal-sized square matrices with the multiplication a group? Is the set
of equal-sized matrices with the addition a group?
Exercises
1. In Fig. 7.8 you see a central projection sketched, with the help of which objects
of the space can be mapped into the R2. The points of the R3, which are to be
projected, are connected with the projection center, which is located at the point
(0, 0, 1).
The projection plane is the x-y-plane. The intersection point of the connecting line
with the projection plane gives the point to be represented. Calculate where the
point (x, y, z) is mapped to by this projection. Is this mapping defined for all points
of the R3? Is it a linear mapping?
184 7 Matrices
3 2 x
2. The vectors and form a basis B of R . Let
2
∈ R2. Calculate the coordi-
5 4 y
x
nates of with respect to the basis B.
y 1 3 0 2
3. Determine the matrix of the mapping that maps to and to and
0 5 1 4
calculate the inverse of this matrix.
4. Determine the matrices of the mappings
a) f : R3 → R2, (x, y, z) → (x + y, y + z)
b) g : R2 → R3, (x, y) → (x, x + y, y).
5. Perform
the following
matrix multiplications:
−2 3 1 3 1 1
a) 6 −9 −3 · 2 0 1
4 −6 −2 0 2 −1
1 0 0 1 0 0
b) 0 1 0 · 0 1 0
4 0 0 4 0 0
1 1 1 2
c) ·
1 1 0 4
1 1 0 3
d) ·
1 1 1 3
6. Determine the matrices
andtheir rankof the following linear mappings.
x x − 2x2
a) f : R2 → R2, 1 → 1 .
x2 0
x1 � �
x1 − x2 + x3
b) f : R → R , x2 →
3 2
.
2x1 − 2x2 + 2x3
x3
7. Show that for matrices A, B ∈ K n×m and C ∈ K m×r holds:
(A + B)C = AC + BC.
Gaussian Algorithm and Linear
Equations 8
Abstract
• you know what a system of linear equations is and can interpret the solutions geo-
metrically,
• you can write a system of linear equations in matrix notation,
• you master the Gaussian algorithm and can apply it to
• solve systems of linear equations and determine the inverse of matrices.
Systems of linear equations occur in many engineering and economic problems. In the
last chapters we came across systems of linear equations in connection with matrices
and linear mappings. It now turns out that the methods we have learned there are suita-
ble for developing systematic solution methods for such systems of equations. The focus
is on the Gaussian algorithm. This is a method with which a matrix is transformed in
such a way that the solutions of the associated system of linear equations can be read off
directly.
Our most important application of this algorithm will be solving equations. But we will also
compute other things with it. If we apply the algorithm to a matrix, we can read at the end:
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 185
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_8
186 8 Gaussian Algorithm and Linear Equations
The transformation of the matrices is carried out with the help of elementary row trans-
formations:
Let’s start with determining the rank of a matrix. For this it holds:
Theorem 8.2 Elementary row transformations do not change the rank of a matrix.
In the proof it is again shown how useful it is that the rank of a matrix can also be deter-
mined by the rows:
a) is clear, because the row space does not change.
To b): Show that span(r1 , r2 , . . . , ri , . . . , rm ) = span(r1 , r2 , . . . , ri + rj , . . . , rm ):
“⊂”: for u ∈ span(r1 , r2 , . . . , ri , . . . , rm ) it holds:
u = 1 r1 + · · · + i ri + · · · + j rj + · · · + m rm
= 1 r1 + · · · + i (ri + rj ) + · · · + (j − i )rj + · · · + m rm
∈ span(r1 , . . . , ri + rj , . . . , rj , . . . , rm ).
For the time being, we only need the transformations a) and b) to determine the rank.
Now we come to the Gaussian algorithm. Do you remember that you could immediately
read the rank from an upper triangular matrix with ones on the diagonal? We will try to
transform the given matrix A into something similar.
While you read the description of this algorithm, you should solve it on paper at the
same time. Use the following matrix as an exercise:
8.1 The Gaussian Algorithm 187
1 2 0 1
1 2 2 3
4
(8.1)
8 2 6
3 6 4 8
We begin the algorithm with the element a11 of the m × n-matrix A:
a11 . . . a1n
.. . . .
. . .. .
am1 · · · amn
If a11 = 0, we subtract (ai1 /a11 ) times the first row from the i-th row for all i > 1.
This makes the first element of row i a 0:
ai1 → ai1 − (ai1 /a11 )a11 = 0, (8.2)
and then the matrix A looks as follows:
a11 a12 · · · a1n
′ ′
0 a22
· · · a2n
. . .. . (8.3)
.. .. . ..
′ ′
0 am2 · · · amn
If a11 = 0, we first look for a row whose first element is not 0. If there is such a row, we
swap it with the first row and then carry out operation (8.2). After that, A also has the
form (8.3), only with a different first row.
If all elements of the first column are 0, we examine the second column and begin the
algorithm in this column with the element a12, that is, we try again to make the elements
below a12 to 0, possibly after a row swap. If this does not work again, we take the third
column, and so on. If the matrix is not the zero matrix, the attempt will be successful at
some point.
In any case, the matrix then also has the form (8.3), possibly with some zero columns
on the left.
Now the first step is over. We go one column to the right and one row down. We call
the element at this point b and start the process again, now with the element b.
It may be the case again that b = 0 and we have to swap the current row with another
row or go one column further to the right. Here you have to be careful: we are only
allowed to swap rows that are below b, otherwise something would get mixed up again in
the columns to the left of b, which we have just nicely transformed.
The process ends when we arrive at the last line or when we would leave the matrix
when progressing to the next element (to the right or to the right and down), that is, when
there are no more columns.
188 8 Gaussian Algorithm and Linear Equations
Finally, the matrix has a shape that looks similar to that in (8.4):
(8.4)
The numbers a, b, c, d, . . . are field elements different from 0. The entries designated by ∗
have any arbitrary values.
In each row, the first element different from 0 is called the leading coefficient of the
row. Lines without leading coefficients can only be at the bottom of the matrix.
If you have carried out the process with the exercise matrix (8.1), you have obtained the
matrix
(8.5)
as a result and have run through each alternative of the algorithm at least once. The algo-
rithm ends in this case because there is no further element to the right of 1.
A matrix in the form (8.4) is called a matrix in row-echelon form. You see the staircase
that I have drawn. The number of steps in this staircase is the rank of the matrix or,
expressed more distinguished:
Theorem 8.3 The rank of a matrix in row-echelon form is exactly the number of
leading coefficients.
We already know the argument from the upper triangular matrices in Example 3 in
Sect. 7.3. We examine the columns in which leading coefficients are located. The i -th
such column is always linearly independent of the previous i−1 columns with leading
coefficients, since the i -th coordinate was 0 in all previous column vectors. So the col-
umns with leading coefficients are linearly independent.
However, the dimension of the column space cannot be greater than the number of
leading coefficients: If there are k leading coefficients, there are not more than k linearly
independent rows, and as we know the dimension of the row space is equal to that of the
column space.
8.1 The Gaussian Algorithm 189
In the literature, the leading coefficients are often normalized to 1 during the execution of
the Gaussian algorithm. This may facilitate the calculation by hand, but it means a whole
series of more multiplications. But since today nobody solves linear equation systems man-
ually anymore, it is important to pay attention to numerical efficiency. Therefore, the algo-
rithm will be implemented without this.
You may have noticed while reading the algorithm that it contains a recursion. It was
quite tedious to describe the process with all possible special cases. In the following
recursive formulation you will see again what a powerful tool we have in our hands. It
goes very quickly:
I call the recursive function gauss(i, j) and write it down in pseudocode. The function
has as parameters the row and column index. It starts with i = j = 1, in each step either j,
or i and j is increased. The procedure ends when i is equal to the number of rows or j is
greater than the number of columns:
gauss(i, j):
if i = number of rows or j > number of columns:
end.
if aij = 0 :
look for akj �= 0, k > i, if there is none: gauss(i, j + 1), end.
swap row k and row i
subtract for all k > i from the k-th row (akj /aij ) times the i-th row.
gauss(i + 1, j + 1), end.
Example
2 2 0 2 2 2 0
2
4 6 4 7 II − 2 · I 0 2 4
3
5
�→
6 2 7 III − 5/2 · I 0 1 2 III − 1/2 · II
2
2 3 2 4 IV − I 0 1 2 IV − 1/2 · II
2
2 2 0 2 2 2 0 2
0 2 4 3 0 2 4 3
�→
0
�→
0 0 0 1/2 .
0 0 1/2
0 0 0 1/2 IV − III 0 0 0 0
190 8 Gaussian Algorithm and Linear Equations
This is the row-echelon form; the matrix has three leading coefficients and therefore
rank 3. ◄
If you carry out the algorithm on paper, it is very useful to write down the row transforma-
tions carried out in each case.
I simply give the matrices that correspond to the row transformations, it is then easy to
calculate that they really do what they are supposed to do:
Type a): If you multiply A from the left with the matrix I ′, which you obtain when
you swap in the m × m-identity matrix I the i -th and j-th row, the rows i and j are
exchanged in the matrix A.
Type b): I ′′ arises from the identity matrix by adding to the i -th row times the j-th
row. Only the element iij (a 0) is replaced by . If you multiply A from the left with I ′′,
then times the j-th row of A is added to the i -th row.
Type c): Finally, if you multiply the i -th row of the unit matrix by , and multiply A
from the left with the resulting matrix I ′′′, the i -th line of A is multiplied by .
The m × m-matrices I ′, I ′′ and I ′′′ are themselves derived from I by elementary row
transformations. So their rank does not change, it is m, and thus the matrices are invert-
ible according to Theorem 7.12.
This statement is not of importance for concrete calculations; elementary row transfor-
mations can be carried out faster than with matrix multiplications. However, the theorem
is an important proof tool. With its help we will see in a moment how we can determine
the inverse of a matrix with elementary transformations.
A square n × n-matrix can only be invertible if it has rank n. The row-echelon form of
such a matrix therefore has the following shape:
a11 ∗ ∗
0 a22 ∗
. .
0 . ∗
0 0 ann
where all diagonal elements are different from 0. If we multiply each row i with 1/aii,
then the diagonal will have 1s everywhere.
Starting from the right, we can now, similarly to the Gaussian algorithm, make the
elements in the upper right half to 0:
8.2 Calculating the Inverse of a Matrix 191
We subtract for all i = 1, . . . , n − 1 from the i-th row ain times the last row. Now
the last column has a row of zeros above the 1. Now we go to the second to last col-
umn and subtract from the i-th row for i = 1, . . . , n − 2 now ai,n−1 times the second to
last row. Everything above the element an−1,n−1 will become 0. The last column remains
unchanged, since an−1,n is already 0.
We continue in this form until we reach the left edge of the matrix, and thus we have
finally converted the original matrix into the identity matrix by elementary row operations.
This description of the algorithm can also be formulated very briefly recursively.
It is important in this part of the transformation that you start from the right. Try once
what happens if you want to make the elements above the diagonal to 0, starting from the left.
Theorem 8.4 helps us with the proof: Let D1 , . . . , Dk be the matrices that correspond to
the row operations performed on A, then we have:
I = Dk Dk−1 · · · D1 A = (Dk Dk−1 · · · D1 )A.
So
A I
1 0 2 1 0 0
2 −1 3 0 1 0 II − 2 · I
4 1 8 0 0 1 III − 4 · I
1 0 2 1 0 0
0 −1 −1 −2 1 0
0 1 0 −4 0 1 III + II
1 0 2 1 0 0
0 −1 −1 −2 1 0 II · (−1)
0 0 −1 −6 1 1 III · (−1)
1 0 2 1 0 0 I − 2 · III
0 1 1 2 −1 0 II − III
0 0 1 6 −1 −1
1 0 0 −11 2 2
0 1 0 −4 0 1
0 0 1 6 −1 −1
192 8 Gaussian Algorithm and Linear Equations
Ax = b.
If b = 0, such a system of linear equations is called inhomogeneous system, Ax = 0 is
called the corresponding homogeneous system.
We call
sol(A, b) = {x ∈ K n | Ax = b} (8.7)
the solution set of the linear equation system Ax = b.
The matrix (A, b), which arises when we add the column A to b, is called the aug-
mented matrix of the system.
In the course of the solution process, it will prove to be useful to interpret the matrix
A as a linear mapping. In the language of linear mappings, sol(A, b) is precisely the set of
x ∈ K n, which are mapped by the linear mapping A to b.
Theorem 8.6
The first part is known: sol(A, 0) is precisely the set of x, which are mapped to 0, that is,
the kernel of A.
8.3 Systems of Linear Equations 193
For the second part: If Ax = b has a solution w, then Aw = b, thus b ∈ im A, and con-
versely, if b ∈ im A, then every preimage of b is a solution of the equation system.
In Theorem 7.10 we have seen that the image is generated by the columns of the
matrix. Therefore, if b ∈ im A, then the rank of (A, b) cannot be greater than that of A, the
vector b can be linearly combined from the columns of A. Now if rank A = rank(A, b),
then b can be combined from the columns of A again, and since the columns generate the
image, b is also in the image.
For part c) of the theorem: We first show that w + ker A ⊂ sol(A, b). For this purpose,
let y ∈ ker A. Since A is a linear mapping, A(w + y) = Aw + Ay = Aw + 0 = b, thus
w + y ∈ sol(A, b).
For the other direction, let v ∈ sol(A, b). Then A(v − w) = Av − Aw = b − b = 0. So
v − w ∈ ker A and v = w + (v − w) ∈ w + ker A.
By reversing part 2 of the theorem, we can now see that a system of linear equations is
not solvable if rank(A, b) is greater than rank A. However, a homogeneous system always
has at least one solution: the zero vector.
Above all, the last part of theorem 8.6 helps us to compute all solutions of the system
of linear equations Ax = b: It is enough to solve the homogeneous system Ax = 0 com-
pletely and then calculate one single particular solution of the inhomogeneous system.
The solutions of homogeneous systems of linear equations are vector spaces: the zero
vector, lines and planes through the origin, and so on. The solutions of inhomogene-
ous systems of linear equations are, in general, not subspaces. Nevertheless, they can be
interpreted geometrically: They are lines, planes or spaces of higher dimension shifted
from the origin.
The dimension of the solution space can be read from the rank of the matrix:
Theorem 8.7
and according to Theorem 6.27 we have dim im A + dim ker A = dim (domain of definition).
Now plug in the other interpretations of these numbers and you will get equation (8.8).
194 8 Gaussian Algorithm and Linear Equations
Examples
ker A
solution set
projected onto the projection plane by a central projection, in the center of which the
observer’s eye is located. In ray tracing, rays are now shot from the center in the direc-
tion of the object and the first intersection point with the object is determined. The color
and brightness of the object determine the representation of the object on the projection
plane (Fig. 8.3). The problem that occurs again and again is the calculation of the inter-
section point of a line with an object. Complex objects are often approximated by flat
surface pieces, in this case it is necessary to determine as quickly as possible the inter-
section point of a line with a plane, which contains such a surface piece.
In the last example we saw the solution set of a linear equation of the form
ax1 + bx2 + cx3 = d (8.9)
represents a plane. In Sect. 10.1 after Theorem 10.5 we will learn how to find such an
equation for a given plane. Also lines could be described as solution sets of systems of
linear equations, but for the above problem there is a better possibility: We choose the
parameter representation of a line, as we have learned in Sect. 6.1:
u1 v1 �� u1 + v1 ��
g = u2 + v2 �� ∈ R = u2 + v2 �� ∈ R .
u3 v3 u3 + v3
Here (u1 , u2 , u3 ) is the projection center and (v1 , v2 , v3 ) is the direction of the projection
beam. We set the components of the line in the linear equation (8.9) and obtain
a(u1 + v1 ) + b(u2 + v2 ) + c(u3 + v3 ) = d.
Now, however, finally to the specific method for solving linear equations. Theorem 8.9
shows us that the Gaussian algorithm can also be used here. As preparation:
Theorem 8.9 If the matrix A′ is obtained from A by elementary row operations and
the vector b′ is obtained from b by the same row operations, then
If we want to solve a linear equation system Ax = b, we first use the Gaussian algorithm
to bring the augmented matrix (A, b) into row-echelon form, as we have learned when
computing the rank of a matrix.
There are different, quite similar methods on how to proceed. Below I will introduce
the recipe which seems simplest to me.
8.3 Systems of Linear Equations 197
First, as described in the caculation of the inverse of a matrix in Sect. 8.2, we can
normalize all leading coefficients to 1 by elementary row operations and then, starting
from the right, make the elements above the leading coefficients to 0. The columns with-
out leading coefficients are not of interest to us at the moment. This transforms our aug-
mented matrix ( A, b) into a matrix that looks approximately like this:
This form of the matrix is called reduced echelon form: In this form, all leading coeffi-
cients are equal to 1 and 0 is above the leading coefficient everywhere. The reduced ech-
elon form is ideally suited to immediately give the solutions of the equation system Ax =
b. The complete procedure is carried out in 4 steps:
Step 1: ring the augmented matrix into reduced echelon form. Now you can already
B
check whether the rank of the augmented matrix is equal to the rank of the
matrix, that is, whether there is a solution at all.
Step 2: Set the unknowns that belong to the columns without leading coefficients as
free variables (for example 1, 2, 3, etc.).
Step 3: Each row of the matrix now represents an equation for one of the other
unknowns. These unknowns can be expressed by the i and by a bj, one of the
elements from the last column of the matrix.
Step 4: Now the solution set can be written in the form:
c1 u11 u1k
c2 u21 u2k ��
�
sol(A, b) = . + 1 . + · · · + k . � i ∈ R .
..
.. ..
cm um1 umk
Think about why the vectors resulting from this process (u1i , u2i , . . . , umi ) are always linearly
independent!
The best way to understand this is with an example. I will give the matrix A in reduced
echelon form, and we will carry out steps 2 through 4:
198 8 Gaussian Algorithm and Linear Equations
1 0 3 0 0 8 2
0 1 2 0 0 1 4
(A, b) =
.
0 0 0 1 0 5 6
0 0 0 0 1 4 0
x1 x2 x3 x4 x5 x6
Under the matrix, I’ve written the unknowns that belong to each column. When you get
to this point, you should first think about the existence and number of solutions. The rank
of the matrix A is 4, because A has 4 leading coefficients. The rank of (A, b) is also 4, the
extension doesn’t add any new leading coefficients. So by Theorem 8.6, the system of
equations has at least one solution, and we need to continue. From Theorem 8.7, we can
read off the dimension of the solution space: it’s equal to the number of unknowns minus
the rank, so 2.
Now for step 2: The columns without leading coefficients are columns 3 and 6, so we
set x3 = , x6 = µ.
In step 3, we get from the four equations by solving for the remaining unknowns the
results:
x1 = −3 − 8µ + 2, x2 = −2 − µ + 4, x4 = −5µ + 6, x5 = −4µ.
We can summarize this result in one vector equation in step 4. Here you can see, how
two 0’s enter the result vector at position 3 and 6:
x1 −3 − 8µ + 2 2 −3 −8
x −2 − µ + 4 4 −2 −1
2
x3 0 1 0
= = + + µ ,
x4 −5µ + 6 6 0 −5
x5 −4µ 0 0 −4
x6 µ 0 0 1
and then we already have our solution set in the desired form:
2 −3 −8
4 −2 −1
��
0 1 0 �
sol(A, b) = + + µ � , µ ∈ R .
6 0 −5
0 0 −4
0
0 1
As an alternative solution method, you can directly calculate all unknowns from the
row echelon form by starting with the last row and working backwards, plugging in the
unknowns. Here too, you must set as variables the unknowns that belong to columns
without leading coefficients. This saves you the step of transforming the matrix into
reduced echelon form. Even if this method looks simpler, you usually don’t need fewer
operations than in the method presented above.
8.4 Comprehension Questions and Exercises 199
Comprehension Questions
Exercises
x1 + x3 = 0
2x1 + x2 + 2x3 = 0
3. Determine the inverses of the following matrices, if they exist:
3 0 1 3 0 1 5 0 0
1 0 1 , 6 0 2 , −1 2 0 .
0 1 0 0 1 0 4 1 3
200 8 Gaussian Algorithm and Linear Equations
4. Determine whether the following system of linear equations is solvable, and if so,
determine the solution set:
x1 + 2x2 + 3x3 = 1
4x1 + 5x2 + 6x3 = 2
7x1 + 8x2 + 9x3 = 3
5x1 + 7x2 + 9x3 = 4
Show that this system is always uniquely solvable if for all i holds
|aii | > |ai−1,i | + |ai+1,i |. Such a matrix is called “strictly diagonally dominant.”
Eigenvalues, Eigenvectors and Change
of Basis 9
Abstract
• you will know what the determinant of a matrix is and can compute it,
• you have learned the terms eigenvalue and eigenvector of matrices,
• and can calculate eigenvalues and eigenvectors with the help of the characteristic
polynomial of a matrix,
• you have calculated eigenvalues and eigenvectors of some important linear map-
pings in the R2 and interpreted the results geometrically,
• you can carry out a change of basis,
• and know what the orientation of vector spaces means.
9.1 Determinants
Determinants are a characteristic of matrices that is often useful for investigating certain
properties of matrices. We need determinants primarily in connection with eigenvalues
and eigenvectors, which we will discuss in the second part of the chapter. But they also
form an important tool for solving systems of linear equations. Determinants are only
defined for square matrices, so all matrices in this chapter are square. K is supposed to be
some field, but you can usually just think of the real numbers.
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 201
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_9
202 9 Eigenvalues, Eigenvectors and Basis Transformations
Let’s look at a system of linear equations with two equations and two unknowns again:
ax + by = e,
cx + dy = f ,
and solve for x and y. Two equations are just about manageable. We get:
ed − bf af − ce
x= , y= .
ad − bc ad − bc
These fractions only make sense if the denominator ad − bc is different from zero. In
this case, the system has the specified unique solution, and if ad − bc = 0, then there is
no unique solution. Check for yourself that in the case ad − bc = 0, the rank of the coef-
ficient matrix is less than 2.
a b
ad − bc is the determinant of the matrix , and we have already learned two
c d
assertions that can be read from it: If it is not equal to 0, then the matrix has rank 2 and
the system of equations Ax = y is uniquely solvable for all y ∈ K 2.
Every square n × n-matrix has a determinant. This is a mapping from K n×n → K . The
definition is quite technical, but we can’t get around it.
For matrices from K 2×2 and from K 3×3, up to the sign, the determinants have an intui-
tive geometric interpretation: In K 2×2, they give the area of the parallelogram spanned by
the row vectors, in K 3×3, the volume of the body spanned by the row vectors. This body
is called parallelepiped.
So far, I have always investigated column vectors in connection with matrices. Since I
want to calculate determinants using the elementary row transformations from Definition
8.1, the use of row vectors is now appropriate. We will see later that the same results also
apply to column vectors: The determinant also gives the volume of the body spanned by
the column vectors.
You can also define a n-dimensional volume in the K n, and in principle the determi-
nant does nothing else: It represents a generalized volume function for the “hyper”-par-
allelogram spanned by n vectors in the K n. I will also speak of the area as a volume,
namely a two-dimensional volume.
In three steps we want to come to such a determinant function and above all to
an algorithm for the calculation. In the matrix A we denote the row vectors with
r1 , r2 , . . . , rn and write A = (r1 , r2 , . . . , rn ). In the first step I will collect some properties
that one expects from such a volume function.
(D1) det I = 1.
(D2) If the rows are linearly dependent, the determinant is 0.
(D3) For all ∈ K , all v ∈ K n and for i = 1, . . . , n it holds:
9.1 Determinants 203
The last property can also be interpreted in this way: If we keep all rows except the i -th
fixed and only let this i -th row variable, we get a mapping that is linear in the i -th com-
ponent, just like a linear mapping.
Are these reasonable volume properties? (D1) represents the normalization so to
speak: the unit square or the unit cube has the volume 1. Of course it is also reasonable
to demand this from the “n-dimensional unit cube”.
To (D2): In the K 2, if two vectors are collinear, the parallelogram degenerates to a line
and has no area, just as in the K 3 three linearly dependent vectors do not span a real space,
it is flattened and has no volume. We also wish this property for a volume of the K n.
(D3) is not quite as easy to understand, but it is certainly correct in K 2 and in K 3. In
the two-dimensional case it still can be drawn (Fig. 9.1).
You can see in the left image the grey area is larger by the factor than the dotted
one, in the right image the dotted and the grey area are equal; what has been added to the
grey area above has just been taken away below. The same relations apply in K 3. If you
want, you can calculate them in coordinates. For the volume function in K n we simply
demand the two properties from (D3) should apply.
Of course there are other characteristic properties of a volume. But it turns out the
three mentioned conditions are already sufficient to carry out the second step, which con-
sists in the
Theorem 9.2 There is exactly one determinant function det : K n×n → K with the
properties from Definition 9.1.
But I would like to postpone the proof of this theorem for a while and come directly to
the third step. This represents a specific calculation procedure under the assumption that
there is a determinant function. Once again, the Gaussian algorithm will help us here.
First we investigate what effect elementary row operations have on the determinant:
= det(r1 , . . . , ri , . . . , rn ) + 0.
↑
(D2)
a) we can reduce to b) and c) again, because interchanging rows can be composed from
the following series of operations of type b) and c):
det(· · · , ri , · · · , rj , · · · ) = (i - th − j - th row) determinant doesn’t change
det(· · · , ri − rj , · · · , rj , · · · ) = (j - th + i - th row) determinant doesn’t change
det(· · · , ri − rj , · · · , ri , · · · ) = (i - th − j - th row) determinant doesn’t change
det(· · · , −rj , · · · , ri , · · · ) = (i - th row · (−1)) determinant changes sign
− det(· · · , rj , · · · , ri , · · · )
You might think that something is wrong with the sign in this Theorem. Isn’t a volume
always positive? Actually, the determinant is something like an oriented, signed volume. We
will investigate what this orientation is all about at the end of this chapter.
Now we can apply the Gaussian algorithm to a matrix A again. For most operations, the
determinant does not change, but if it does, we have to keep track of it.
The procedure ends immediately if we encounter a column without a leading coef-
ficient during our conversions. Because this can only happen in a n × n-matrix if the rank
is < n. That just means that the rows are linearly dependent and thus the determinant is
zero.
If the rank of A is equal to n, we finally get, using the elementary row operations of
type a) (interchange) and b) (addition of times one row to another):
′
a11 ∗ · · · ∗
.
0 a22 ′
∗ ..
det A = α det .
. (9.1)
.. 0 . . . ∗
′
0 · · · 0 ann
9.1 Determinants 205
Here α = ±1, depending on how many interchanges we have made, and the leading
coefficients are all unequal 0. By a series of further operations of type c) (multiplication
of a row by the factor ) we get
1 ∗ ··· ∗
.
′ ′ ′
0 1 ∗ ..
det A = αa11 a22 · · · ann · det .
.
. . .
. 0 . ∗
0 ··· 0 1
This matrix can be transformed into the identity matrix by operations of type b). The
determinant does not change anymore, and since the determinant of the identity matrix is
equal to 1, we can read off the determinant from (9.1): It holds
′ ′ ′
det A = αa11 a22 · · · ann . (9.2)
In this case, by the way, the determinant is not equal to zero, because all factors are une-
qual 0. We already know that in the case rank A < n the determinant = 0 is, and so we
get:
Theorem 9.4 Let A ∈ K n×n. Then the following statements are equivalent:
or:
det A = 0
⇔ rank A < n
⇔ dim ker A > 0
⇔ columns (rows) are linearly dependent.
The relation between rank and dimension of the kernel, which is stated here again, has
already been shown in Theorem 7.10. We also already know the relation between the
rank and the linear dependence of columns or rows.
Finally, some examples. I will introduce a shortened notation:
� �
� a11 . . . a1n �� a11 . . . a1n
�
� .. .. .. � := det .. .. .
� .
� . . �� . . .. .
�a
m1 · · · amn � am1 · · · amn
206 9 Eigenvalues, Eigenvectors and Basis Transformations
Examples
And a 3 × 3-determinant:
1 0 2 1 0 2
2 −2 3 II − 2 · I = 0 −2 −1
4 1 8 III − 4 · I 0 1 0 III + 1/2 · II
1 0 2
= 0 −2 −1 = 1 · (−2) · (−1/2) = 1.
0 0 −1/2
For 3 × 3-matrices there is an easier way to calculate, the Rule of Sarrus: It is best to
remember this by writing the two first columns of the matrix A next to it to the right and
then proceeding according to the following rule:
Sum the products of the main diagonals (from top left to bottom right) and subtract
the products of the secondary diagonals (from top right to bottom left). That is:
det A = a11 a22 a33 + a12 a23 a31 + a13 a21 a32 − a13 a22 a31 − a11 a23 a32 − a12 a21 a33 .
(9.3)
9.1 Determinants 207
You already know from the beginning of the chapter for 2 × 2-matrices the rule
a b
= ad − bc,
c d (9.4)
so it also applies here: “Product of main diagonal minus product of secondary diagonal”.
But please: never use it for 4 × 4 or other larger matrices! The base case cannot be con-
tinued with an induction step; it is simply wrong.
Now I want to go back to step 2 of our program to construct a determinant function.
We still don’t even know if there is a determinant at all. We calculate something, but it
is not at all clear whether with other transformations maybe something else comes out,
whether we have a function here at all and, if so, whether it fulfills our three require-
ments (D1), (D2), (D3). I would spare you the existence part of Theorem 9.2 if it were
not constructive at the same time: It contains another method for calculating determi-
nants, which often leads to the goal faster than the Gaussian algorithm and which one
should therefore know. It is the calculation method of the expansion along the first row:
With this fixed we can now recursively define the determinant function:
Definition 9.6: Expansion along the first row For a 1 × 1-matrix A = (a) we set
det A := a.
For a n × n-matrix A = (aij ) we define
det A := a11 det A11 − a12 det A12 + a13 det A13 − · · · ± a1n det A1n .
Try it out right away for 2 × 2- and 3 × 3-matrices. The same result comes out, as we have
calculated in (9.3) and (9.4).
In order for us to be able to write this definition at all, we have to check that this is
really a determinant function, that is, the properties (D1), (D2), (D3) are fulfilled. This
is a mathematical induction, which is not very difficult, but also not very exciting. I will
therefore omit it. To finish the proof of Theorem 9.2, only the uniqueness of the function
is missing. But this can be seen easily with the help of the Gaussian algorithm: Are det
208 9 Eigenvalues, Eigenvectors and Basis Transformations
and det′ determinant functions, then we calculate det A and det′ A by both times trans-
forming the matrix with the same transformations into an upper triangular matrix. As
seen in (9.1) and (9.2), the value of the determinant is thus uniquely determined, and it is
det A = det′ A.
We apply the procedure from Definition 9.6 to the 3×3 determinant that we have already
calculated twice:
1 0 2
2 −2 3 = 1 · −2 3 + 2 · 2 −2 = 1 · (−19) + 2 · 10 = 1.
1 8 4 1
4 1 8
I would like to introduce two more important theorems about determinants, the proofs of
which are rather lengthy, without us giving us any essential new insights. I will also omit
these proofs.
reflecting A on the main diagonal: the i -th column of A is the i -th row of AT.
Written out as a formula, the expansion along the i -th row is:
n
det A = (−1)i+j aij det Aij ,
j=1
9.2 Eigenvalues and Eigenvectors 209
The expansion along the i -th row can be traced back to the expansion along the first row
by previously exchanging the rows so that the i -th row comes to the first position and the
rows in front of the row i are each pushed down one position. Check that this sign pat-
tern results.
If we now carry out a transposition of the matrix before the expansion along the i -th
row, we obtain the expansion along the i -th column.
Let’s take the 3 × 3-matrix, which has already been used three times, as an example.
This time we expand along the 2nd column, and again the same comes out:
1 0 2
2 −2 3 = (−2) · 1 2 − 1 · 1 2 = (−2) · 0 − 1 · (−1) = 1.
4 8 2 3
4 1 8
Expansion of a determinant along a row or column is only interesting if there are many
zeros in this row or column, so that only a few sub-determinants have to be calculated. If
this is not the case, it is better to use Gauss’ method.
The determinant is a volume function: it gives the volume of the body spanned by the
rows. If we transpose the matrix, we see that the determinant also gives the volume of
the body generated by the columns.
Now we interpret the matrix as a linear mapping again. We know that the columns
of the matrix are exactly the images of the standard basis under this mapping. The map-
ping transforms the n-dimensional unit cube into the body determined by the columns of
the matrix. Since the volume of the unit cube is 1, the determinant thus gives something
like a distortion factor of space by the mapping. A large determinant means: the space
is inflated, a small determinant means: the space shrinks. In many applications, linear
mappings that preserve angles and lengths are particularly important. These also leave
volumes unchanged, so they have determinant ±1. We will investigate such mappings in
Sect. 10.2.
In Sect. 7.1 we have learned about some linear mappings that are important in graphi-
cal data processing, for example stretching, shearing, rotation and reflection. Many of
these mappings can be described very easily: the reflection, for example, has an axis that
remains fixed, the vector perpendicular to it just changes its direction. With the shear,
there is also a fixed vector. The rotation is characterized by an angle, usually no vector
remains fixed.
210 9 Eigenvalues, Eigenvectors and Basis Transformations
If the coordinate system of the vector space is not chosen very cleverly, it can be dif-
ficult to see which mapping is behind a matrix.
We now want to deal with the problem of finding the vectors that remain fixed or only
change their length for a given matrix.
The inverse problem is also important for data processing: If, for example, we want
to move an object in space or determine the coordinates of a robot gripper, we need to
know which types of movements occur at all: With how many different types of matrices
do we have to deal with if we want to exhaust all possibilities?
These are some applications of the theory of eigenvalues and eigenvectors. In math-
ematics, especially in numerics, there are many more. In Chap. 11 we will use eigenval-
ues to calculate the centrality of nodes in graphs, they will appear again in Part III of the
book when we talk about Markov chains and principal component analysis. A useful tool
for calculating these matrix characteristics are the determinants, which we got to know in
the first part of this chapter.
Eigenvectors of a linear mapping are therefore vectors that retain their direction under
the mapping while only changing their length, where is the stretching factor.
Well, one direction change is possible: If < 0, the eigenvector changes its sign and
points exactly in the opposite direction. Please allow me to consider this reversal of
direction as a stretch and continue to say that eigenvectors retain their direction.
The requirement v = 0 in the definition is important for the existence of an eigen-
value: If one were to allow v = 0, then every would be an eigenvalue, because A0 = 0
always holds, of course. But once an eigenvalue has been found, v = 0 is an eigenvec-
tor of by our definition.
Examples in R2
1. First of all, the matrix that describes a rotation around the origin. We met it in
Example 6 from Sect. 7.1:
cos α − sin α
Dα = .
sin α cos α
Except for the origin, every point is rotated, so no vector v = 0 remains fixed and
therefore there are no eigenvalues and no eigenvectors.
Wait, there are a few exceptions: With a rotation by 0°, 360°, and multiples
thereof, every vector is mapped to itself: Every vector is an eigenvector with eigen-
value 1. With a rotation by 180°, every vector is exactly reversed and is therefore
an eigenvector with eigenvalue −1.
9.2 Eigenvalues and Eigenvectors 211
2. I introduced the reflection matrix to you in Example 7 from Sect. 7.1. There I
claimed that this is a reflection without verifying it. Now we want to do this by
determining the eigenvalues and eigenvectors. So we are looking for (x, y) ∈ R2
and ∈ R with the property
cos α sin α x x
= .
sin α − cos α y y
We reformulate this matrix equation a little:
cos α x+ sin α y =x (cos α − )x+ sin α y =0
⇒ (9.5)
sin α x−cos α y =y sin α x+(− cos α − )y =0
Now we have a system of linear equations for x and y. Unfortunately, a third
unknown has crept in, namely . And for these three unknowns the system is no
longer linear. So we have to come up with something other than the usual Gaussian
algorithm.
When can the system ofequations (9.5) have a solution (x,y) different from (0,0)?
We can read this from the coefficient matrix:
cos α − sin α
A= .
sin α − cos α −
Because according to Theorem 9.4 the existence of a non-trivial solution is equiva-
lent to the determinant of A being equal to zero. So there can only be solutions if
We will derive a general method for computing eigenvalues, similar to the one used for
the reflection matrix. But first we have to make a few preparations:
Theorem 9.12 The eigenspace T for the eigenvalue is the kernel of the
mapping (A − I).
In the example of the reflection matrix from above, we just examined this matrix A − I .
A number is an eigenvalue if and only if there is an eigenvector = 0, that is, if the ker-
nel of (A − I) contains more than 0, it must have dimension greater than or equal to 1.
Theorem 9.4 then says that det(A − I) = 0:
Theorems 9.12 and 9.14 give us the recipe for calculating eigenvalues and eigenvec-
tors: We determine the eigenvalues according to Theorem 9.14 and with the help of
Theorem 9.12 we find the corresponding eigenspaces.
9.2 Eigenvalues and Eigenvectors 213
First of all, let us see what det(A − I) looks like for real 2 × 2- and 3 × 3-matrices:
a b a− b
A= , A − I = .
c d c d−
As you know, there are sometimes 2, sometimes 1, and sometimes no solutions for this
in R: Linear mappings in R2 have 0, 1, or 2 real eigenvalues.
We’ve seen examples of all these cases already. Can you match them up?
I don’t want to hide that it is a bit laborious to write this proof down cleanly. Much easier it
is, if one uses an alternative (and of course equivalent) definition of the determinant, which
you can find in almost every mathematics book: The determinant of an n × n-matrix is always
a sum of n-fold products of permutations of the matrix elements. In the case of 2×2- and
3×3-matrices you have seen this already. But the proof of Theorem 9.15 this is the only place
where this definition would be useful in this book, I have dispensed with the quite technical
introduction. But if you read about this form of the determinant elsewhere: it is the same.
214 9 Eigenvalues, Eigenvectors and Basis Transformations
This theorem has an immediate, very surprising consequence for real 3 × 3-matrices,
even for all real n × n-matrices, provided that n is odd:
Theorem 9.16 For odd n every matrix from Rn×n has at least one real eigenvalue.
For the proof you need to briefly flip to the second part of the book: From Bolzano's
theorem it follows that every real polynomial of odd degree has at least one root (see
Theorem 14.23 in Sect. 14.3).
Think about what this means: You cannot find a linear mapping in the R3 which does
not have at least one vector that keeps its direction. If a soccer ball flies across the field
during the game, it goes through linear mappings and translations. If the referee puts the
ball back on the kickoff spot after the break, the translations cancel out, only a linear
mapping remains. That means there is an axis in the soccer ball that has exactly the same
position as at the beginning of the game. The ball has only rotated a bit around this axis.
This circumstance makes it much easier to follow the movement of bodies on a com-
puter.
For complex numbers the fundamental theorem of algebra (Theorem 5.14) results in:
Theorem 9.17 Every matrix from Cn×n has at least one eigenvalue.
We know that every complex polynomial of degree n decomposes into linear factors, so it
has n roots. But beware, that does not mean that every complex n × n-matrix n has eigen-
values: roots can also occur multiple times, but only the different roots provide different
eigenvalues.
And now another finding follows with our knowledge of polynomials: According to The-
orem 5.25 in Sect. 5.4 a polynomial of degree n has at most n roots, so it holds:
Examples
3 −1
1. A = , det(A − E) = 2 − 4 + 4 = ( − 2)2,
1 1
has one eigenvalue, namely = 2. To determine the corresponding eigenvectors,
we now have to solve the system of linear equations (A − 2E)x = 0:
3 − 2 −1 x1 1 −1 x1 0
= = .
1 1−2 x2 1 −1 x2 0
A good control possibility to check whether you have determined the eigenvalues
correctly always consists in determining the rank of the corresponding matrix. In any
case, this must be smaller than the number of rows, in this case therefore at most 1.
This is true.
9.2 Eigenvalues and Eigenvectors 215
In general, this system of equations will now be solved again using the Gauss-
ian algorithm. But here we immediately see that (1, 1) is a basis of the solution
space. The matrix A therefore has the eigenvector (1, 1) (and of course all multiples
thereof) to the eigenvalue +2.
1 −1
2. A = , det(A − E) = 2 + 1.
2 −1
This matrix has no real eigenvalues, but complex ones.
2 2
3. A = , det(A − E) = (2 − )(1 − ),
0 1
has the eigenvalues 2 and 1. The following two systems of equations have to be
solved:
0 2 x1 0 1 2 x1 0
= and = .
0 −1 x2 0 0 0 x2 0
Basis of the eigenspace to 2 is the vector (1, 0), basis of the eigenspace to 1 is for
example (2, −1).
Draw, similar to the examples in Sect. 7.1, what these mappings do with a square.
In the last example, we found one eigenspace of dimension 2 and one of dimension 1.
Could the second eigenspace have also had dimension 2? No, because more than three
linearly independent eigenvectors do not fit into the R3, and the following result applies:
The eigenvectors of different eigenspaces have therefore nothing to do with each other.
I think this theorem is obvious: If an eigenvector were in two different eigenspaces,
he would not even know where to map. I spare us the precise induction proof. We can
draw a conclusion from the theorem that will be interesting in the next section:
As you have seen in the last example, a matrix with fewer eigenvalues can also have a
basis of eigenvectors: (−1,1,0), (2,0,1) and (1,1,−2) form a basis of the R3.
The matrix of a reflection at the x1-axis in the R2 has a very simple shape: The basis vec-
tor e1 remains fixed, e2 is turned over. As a matrix, this results in:
1 0
.
0 −1
The matrix with respect to the reflection at another axis looks much more complicated.
Of course, this is because in this special case the basis vectors have not been bent, they
are eigenvectors of the reflection.
A key goal of the calculation of eigenvectors is to find bases with respect to which the
matrix of a linear mapping looks as simple as possible. If you can choose eigenvectors
as basis vectors, you will achieve this goal. If, for example, we choose in the last exam-
ple before Theorem 9.19 as a basis the three found eigenvectors (−1, 1, 0), (2, 0, 1) and
(1,1,−2), then the matrix of the corresponding mapping with respect to this basis has the
shape
6 0 0
0 6 0 ,
0 0 0
9.3 Basis Transformations 217
because the columns of the matrix are the images of the basis and the basis vectors are
only stretched.
In Sect. 6.6 we have already seen that vectors have different coordinates with respect
to different bases. In a two-dimensional example, we have calculated in Theorem 6.22
how the coordinates of a vector can be calculated in different bases. We now have to do
this a little more systematically: If a vector is given with respect to the basis B1, we want
to determine its coordinates with respect to the basis B2. Furthermore, let A be the matrix
of a linear mapping with respect to B1, we want to know the matrix of this mapping with
respect to the basis B2.
Recall that if B = {b1 , b2 , . . . , bn } is a basis of the K n, then every v ∈ K n can be writ-
ten as a linear combination of the basis vectors: v = 1 b1 + 2 b2 + · · · + n bn. We then
write
1
..
v=. .
n B
What is the relationship between the coordinates of a vector v with respect to the bases
B1 and B2? Let B2 = {b1 , b2 , . . . , bn }, and let the coordinates of the bi with respect to B1
and v with respect to B1 and B2 be given:
b11 b1n v1 1
.. .. .. ..
b1 = . , . . . , bn = . , v = . , v = . .
bn1 B1
bnn B1
vn B1
n B2
Then
v1
..
v=. = 1 b1 + 2 b2 + · · · n bn
vn B1
b11 b1n b11 1 + · · · + b1n n
.. .. ..
= 1 . + · · · + n . = . .
bn1 B1
bnn B1
bn1 1 + · · · + bnn n B1
(9.7)
218 9 Eigenvalues, Eigenvectors and Basis Transformations
The matrix
b11 · · · b1n
T = ... ..
.
bn1 · · · bnn
is an invertible matrix, because the columns are linearly independent and from (9.7) it
follows:
v1 1 1 v1
.. .. .. −1 ..
. = T . , and thus also . = T . . (9.8)
vn n n vn
In (9.8) I have omitted the indexes. It is a calculation in base B1, but here it is only about
the mathematical connection between the numbers (v1 , v2 , · · · , vn ) and (1 , 2 , · · · , n ).
We can therefore formulate the following Theorem:
Theorem 9.21 Let f be the linear mapping that maps the basis B1 to the basis B2.
T be the matrix of this mapping with respect to the basis B1, the columns of T
therefore contain the basis B2 = {b1 , b2 , . . . , bn } in the coordinates with respect
to B1. Then the coordinates of a vector v with respect to B2 are obtained by
multiplying the coordinates of v with respect to B1 from the left with T −1. The
coordinates of v with respect to B1 are obtained from those with respect to B2 by
multiplication from the left with T .
Now turn back to the example after Theorem 6.22 where we did exactly that without know-
ing at the time what a matrix is.
This follows from (9.8), if you also write this line for the matrix S. I will leave the proof
to you as an exercise.
Occasionally, the order of matrix multiplication causes confusion, it is exactly the reverse of
the order of performing linear mappings: If you first perform the linear mapping A and then
the mapping B, you get the linear mapping BA. But first performing the transition T and then
the transition S, you get the transition TS.
9.3 Basis Transformations 219
Let’s look at Example 4 again before Theorem 9.19 and transform into the basis of
eigenvectors: The matrix T is:
−1 2 1
T = 1 0 1 ,
0 1 −2
the inverse of which is
−1 5 2
1
2 2 2 .
6
1 1 −2
Then, for example,
−1 5 2 −1 1
1
2 2 2 1 = 0 ,
6
1 1 −2 0 0
−1
because 1 is the first basis vector in the basis B2:
0 B
1
−1 1
1 = 0
0 B 0 B
1 2
How do we determine the matrix of a linear mapping with respect to the basis B2? Let
f : K n → K n be a linear mapping, A ∈ K n×n the corresponding matrix with respect to B1
and S ∈ K n×n the matrix of f with respect to B2. Let v, w ∈ K n with f (v) = w. Further,
the coordinates of v and w with respect to the bases B1 and B2 are given:
v1 w1 1 µ1
.. .. .. ..
v = . ,w = . and v = . , w = . .
vn B1
wn B1
n B2
µn B2
The comparison of (9.11) with the right half of (9.9) finally yields T −1 AT = S, because
the matrix of a linear mapping is uniquely determined. So we finally have the desired
relationship:
Theorem 9.23 If T is the transition matrix from the basis B1 to the basis B2 and
f is a linear mapping to which the matrix A belongs with respect to B1, then the
matrix T −1 AT belongs to f with respect to B2.
This means nothing other than A and B describe the same linear mapping with respect to
two different bases. In particular, therefore, A and B have the same eigenvalues and the
same determinant.
We know this already: The columns are the images of the eigenvectors and because of
f (bi ) = i bi the assertion is true.
cos(α/2) − sin(α/2) cos(−α/2) − sin(−α/2)
T= , T −1 = .
sin(α/2) cos(α/2) sin(−α/2) cos(−α/2)
With our knowledge of the transition matrix from Theorem 9.23 it must therefore hold:
cos(α/2) − sin(α/2) 1 0 cos(−α/2) − sin(−α/2) cos α sin α
= .
sin(α/2) cos(α/2) 0 −1 sin(−α/2) cos(−α/2) sin α − cos α
Calculate it using the addition rules for cosine and sine from the formulary or from The-
orem 14.21 in Sect. 14.2. It actually works out!
Since you first saw a coordinate system, you are used to drawing the x1-axis to the
right and the x2-axis up and not down. At first it caused me quite a bit of trouble that
some drawing programs place the origin in the upper left corner of the screen; probably
because the image is built up line by line from top to bottom. What is it about this top,
bottom, right and left?
Man is not rotationally symmetrical and can therefore distinguish between left and
right. We can sit on the tail of a vector in a plane, look in the direction of the vector and
say what is to the left and what is to the right of it. It is a common convention to set up
our two-dimensional coordinate system so that the basis vector e2 is to the left of the
basis vector e1. Such a coordinate system is called positively oriented, another negatively
oriented. If you exchange e1 and e2, the orientation just turns around.
In three dimensions it is a bit more difficult: If we sit in the origin and look in the
direction of a vector, we cannot speak of left or right: We need a reference plane. This is
the plane that is formed by the two basis vectors e1 and e2.
Can we say from this plane in the R3 whether its coordinate system is positive or negative?
No, it depends on which side of the plane we sit on!
The convention here is: If you rotate the basis vector e1 in the direction of e2, e3 should
be on the side of the plane into which a screw would be screwed in during this rotation.
Perhaps you also know the “right-hand rule“: If you stretch your thumb, index finger and
middle finger of your right hand out linearly independently, you get (in this order) a posi-
tively oriented system.
As in the R2 there are therefore two types of coordinate systems here: positively and
negatively oriented.
How can we describe these orientations mathematically? The determinant gives us the
means to do so.
9.3 Basis Transformations 223
Definition and Theorem 9.27 Two bases of the vector space K n are called
equally oriented if the determinant of the transition matrix is greater than 0. The
orientation is an equivalence relation on the set of bases of K n and divides them
into two classes.
The tool of the determinant does not relieve us of the task of saying which coordinate
systems are positively oriented and which are not. We have to specify that. If not almost
every human being had his right hand with him all the time, the prototype of a positively
oriented coordinate system would have to be kept in Paris beside the prototype metre and
the prototype kilogram.
But if we have fixed our prototype in the R2 or R3, does the set of positively oriented
bases agree with the convention we made at the beginning of this section?
Let’s think about that for the plane. We start from a positively oriented coordinate
system (e2 is to the left of e1) and carry out a change of basis with a positive determinant.
Then e′2 should be to the left of e′1 again. Let’s split the transition into two parts: first e2 is
kept and e1 is transformed into e′1, then e′1 is kept and e2 is transformed. The base transi-
tion matrices T and S then look as follows:
a 0 1 c
T= , S= .
b 1 0 d
It is det T = a and det S = d. Since det TS = det T · det S > 0, both a and d are greater
than 0 or both are less than 0.
Let’s assume that a and d are both greater than 0. (a, b) are the coordinates of e′1 with
respect to the original basis and a > 0 means that e1 and e′1 are on the same side of e2, so
e2 is to the left of e′1. Similarly, (c, d) are the coordinates of e′2 in the coordinate system
(e′1 , e2 ) and d > 0 means e2 and e′2 are on the same side of e′1, so e′2 is also to the left of e′1
(Fig. 9.4).
Similarly, one argues if a and d are both less than 0.
224 9 Eigenvalues, Eigenvectors and Basis Transformations
Our convention is also compatible with the determinant rule in R3. Is it possible to
formulate something like the right-hand rule in higher-dimensional spaces? I don’t think
so; we can start with a triad, but in which direction should the four-dimensional screw
then turn? In general, we simply work with a basis that is our standard basis and com-
pare the orientations of other bases with this.
Now we also see what it is with the signs of a volume that the determinant provides
us with: the volume of a n-dimensional parallelogram is positive if the vectors describing
the edges have the same orientation as the basis of the vector space. Otherwise the vol-
ume is negative.
If you want to move or deform an object in computer graphics, you have to make sure
that you apply transformations that preserve orientation, unless you want to look at the
object in the mirror. Check the transformations I introduced in Sect. 7.1 from this point
of view.
In the first part of the ray tracing, at the end of Chap. 8, we dealt with the problem
of determining the intersection of a line with an object. We assume that the object is
bounded by a series of irregular polygons. In the first step, we determined the intersec-
tion of the line with the plane in which the polygon lies. But how do we find out whether
the point is inside or outside the polygon?
First of all, we can move the problem into the R2: A point is inside the polygon if this
is true for the projection of the polygon onto one of the coordinate planes. So we project
onto a coordinate plane in which the polygon doesn't degenerate into a line. This is sim-
ply done by setting one of the three coordinates to zero.
Let’s test now whether the point P is inside the polygon. First of all, we assume the
polygon is convex, that is, the line segment between two points is also inside the poly-
gon. If a reference point R is given inside the polygon, we have to check whether the line
PR intersects the border of the polygon or not. The border of the polygon consists of line
segments itself, and so the problem reduces to the repeated investigation of the question
9.4 Comprehension Questions and Exercises 225
whether two line segments PR and QS intersect. For this we need a fast algorithm. The
sign of the determinant and its relationship with the orientation of linearly independent
vectors are a suitable tool:
PR and QS intersect exactly when Q and S are on different sides of the vector PR , and
P and R are on different sides of QS (Fig. 9.5). According to what we have just learned
about the orientation of coordinate systems, this means:
det PR � · det PR
� PQ � ≤0
� PS and � QR
det QS � QP
� · det QS � ≤ 0.
The vectors are the column vectors of the matrices, respectively. The special case = 0
occurs when one of the endpoints lies on the other line segment. To perform this check,
we therefore need 8 multiplications.
The procedure can easily be extended to non-convex polygons: P lies in the polygon
if the number of intersections of PR with the edge of the polygon is even, otherwise it
lies outside.
Comprehension Questions
Exercises
a b
1. Show: The determinant of a 2 × 2-matrix is exactly the area of the paral-
c d
lelogram determined by the two row vectors.
2. Calculate the determinants of the following matrices:
2 −6 4 0 3 −4 0 2 1 0 −1 2
4 −12 −1 2 0 7 6 3 2 1 0 1
1 7 2 1 , 2 −6 0 1 , −3 1 0 1 .
0 10 3 9 5 3 1 −2 2 2 0 −1
5 −1 2 1 −3 3
3. a) Calculate the eigenvalues of the matrices−1 5 2and3 −5 3.
2 2 2 6 −6 4
b) For each eigenvalue, give a basis of the corresponding eigenspace. Are the
matrices diagonalizable? If so, give the corresponding transition matrix.
4. Show that when two changes of basis are carried out in succession, the transition
matrices are multiplied: If T is the transition matrix from B1 to B2 and S is from B2
to B3, then TS is the transition matrix from B1 to B3.
5. Show that the equivalence relation “oriented equally” partitions the set of bases of
Rn into exactly two equivalence classes.
6. Why does a mirror swap left and right, but not top and bottom?
Dot Product and Orthogonal Mappings
10
Abstract
In real vector spaces, often the task arises of measuring lengths or distances, or of
describing something like angles between vectors. For this, the dot product can be used.
A dot product is a mapping that assigns a scalar to two vectors, in our case a real number.
Dot products are usually studied in mathematics for real or complex vector spaces.
Here I would like to limit myself to real vector spaces, as these are the spaces from
which our most important applications arise.
You already know how I proceed. I first collect the essential requirements for a dot
product and then define a specific such product for the Rn.
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 227
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_10
228 10 Dot Product and Orthogonal Mappings
In (S4) lies the reason for our restriction to real spaces: We need to know what “≥ 0” means.
For example, in C or GF(p) this is not possible. However, there is an extension of the notion
to complex vector spaces.
From (S3) it follows that �u, v + w� = �u, v� + �u, w� and �u, v� = �u, v� also hold. But
you now know that mathematicians only include the essential requirements in axioms,
nothing that can be derived from them.
The concept of norm is closely related to the concept of dot product. With the norm,
we will measure the lengths of vectors:
(N1)
�v� = || · �v�,
(N2) + v� ≤ �u� + �v� (the triangle inequality),
�u
(N3) = 0 ⇔ v = 0�.
�v�
In Theorem 5.16 in Sect. 5.3 we have already encountered such a norm: The absolute
value of a complex number is a norm if we interpret C as the vector space R2.
All properties (S1) to (S4) can be easily calculated. (S4) follows from the fact that sums
of squares in R are always greater than or equal to 0.
If in the future I speak of norm or dot product on Rn, I always mean the mappings from
Definition 10.4.
First of all, a few remarks about this theorem:
In this and the following theorem, it is essential for the first time that we use a Cartesian
coordinate system: The Pythagorean theorem only applies to right angles, and sine and
cosine are only defined in right-angled triangles. Arguments that use cosine, sine or Pythag-
oras are wrong, if you choose any arbitrary basis.
3. A little trick: If you consider u and v as one-column matrices, then the dot product
u, v is nothing other than the matrix product uT v, where uT is the transposed matrix,
that is, u is written as a row vector:
�u, v� = uT v. (10.2)
You can derive the Cauchy-Schwarz inequality (10.1), the proof of which I have withheld
from you, for R2 and R3 from this formula.
For the proof: First, it is enough to examine vectors u,v with an angle α < 90◦. If the
angle is more than 90°, we calculate with the vectors u and −v, which then form the
angle 180◦ − α with each other. Since cos(180◦ − α) = − cos α, the assertion then fol-
lows from (S2).
We can further restrict ourselves in the calculation to vectors u, v of length 1 and show
for such vectors: �u, v� = cos α. For then, for any vectors u, v:
u v
�u, v� = �u��v� , = �u��v� cos α.
�u� �v�
For this situation, we can now make a drawing of the plane in which u and v lie (Fig.
10.2).
From this we can read:
1 + 1 − 2 cos α = u12 − 2u1 v1 + v12 + u22 − 2u2 v2 + v22 + u32 − 2u3 v3 + v32
2 − 2 cos α = �u�2 + �v�2 − 2�u, v�
⇒ cos α = �u, v�.
The special case α = 90◦ is still missing. But then �u�2 + �v�2 = �u − v�2 (make a
sketch!) And, if you insert the coordinates here, you get
3
3
3
3
3
ui 2 + vi2 = ui2 − 2ui vi + vi2
i=1 i=1 i=1 i=1 i=1
The dot product of two vectors that are not 0 in R2 and R3 is therefore only 0 if they are
perpendicular to each other. Such vectors are called orthogonal.
This property can be used to describe planes in R3 using the dot product. Let a vec-
tor u = (u1 , u2 , u3 ) � = 0 be given. Then the set E of all vectors that are perpendicular to u
forms an plane through the origin. For
Definition 10.6 In a vector space with dot product, two vectors u and v are called
orthogonal, if �u, v� = 0 holds. We then write u⊥v.
Examples
I would like to show you a small, but clever application of the dot product in a graph-
ics program: If you have drawn a line on the screen and want to edit it later, you have
to mark it. This is usually done by clicking on the line. Well, not quite exactly on the
line, you can’t aim that precisely, but somewhere near the line. There is a region defined
around the line that represents the active area (Fig. 10.3). If you click in the (in reality
invisible) gray area, the line is marked, otherwise not.
How can you find out with the least possible effort whether the point is in the gray
area? We proceed in two steps. The first one is shown in Fig. 10.4. The endpoints of the
line form the vectors u,v, the mouse click takes place at the point m.
First we check whether m is in the marked strip. This is the case if the two angles α
and β are both acute angles, that is, if cos α > 0 and cos β > 0.
Since we know that
�v − u, m − u� = �v − u��m − u� cos α, �u − v, m − v� = �u − v��m − v� cos β,
it reduces to checking the two relations
�v − u, m − u� > 0, �u − v, m − v� > 0. (10.4)
In the second step we test whether m is additionally in the rectangle shaded gray in Fig.
10.5. This is the case if the area of the dotted parallelogram is less than half the gray
area, that is, less than l · ε.
But how do we get the dotted area? Here the determinant helps us again, whose value
indicates the size of the area:
det v − u < l · ε.
(10.5)
m−u
You can count that (including the calculation of l · ε ) in (10.4) and (10.5) just 7 multipli-
cations are needed to carry out the test.
Because of (10.6) it follows that cos α = cos β. Since the angles are between 0 and 180◦,
it must be α = β, the orthogonal mapping is also angle-preserving.
Examples
cos α − sin α cos α sin α
Dα = , Mα =
sin α cos α sin α − cos α
are orthogonal
mappings
(rotation and reflection). Please calculate for yourself what
x1 y
Dα , Dα 1 results in. ◄
x2 y2
This calculation is tedious, surprisingly, however, orthogonal matrices can also be char-
acterized very easily:
Theorem 10.11 The following three assertions for a matrix A ∈ Rn×n are
equivalent:
a) A is orthogonal.
b) The columns of the matrix form an orthonormal basis.
c) It holds that A−1 = AT.
I show by circular reasoning a) ⇒ b) ⇒ c) ⇒ a). This makes the three assertions equiva-
lent.
a) ⇒ b): For an orthogonal mapping A it holds that ker A = {0}, because for u = 0
it holds �u� � = 0 and thus also �f (u)� � = 0. Therefore, A is bijective and the columns
s1 , s2 , . . . , sn of A, which represent the images of the orthonormal basis b1 , b2 , . . . , bn, are
again a basis. Furthermore,
1 i=j
�si , sj � = �f (bi ), f (bj )� = �bi , bj � = (10.7)
0 i � = j,
and that means precisely that the column vectors have length 1 and are pairwise orthogo-
nal.
b) ⇒ c): We calculate C = AT A. The rows of AT contain, like the columns of A, the
orthonormal basis. If we calculate the element cij of the product matrix, then this is just
the dot product of the i -th row of AT with the j-th column of A. So it is cij = �si , sj � and
according to (10.7) thus C = I , the identity matrix.
10.2 Orthogonal Mappings 235
You should calculate the product AT A on paper, then you will see that there is nothing mys-
terious about it.
c) ⇒ a) For this we need the trick from (10.2) and a little lemma about transposed matri-
ces: It is (AB)T = BT AT. I will just use it without proof. From AT A = I it then follows for
all vectors u, v:
I find statement c) in particular very exciting. Can you still remember how hard it was to
determine the inverse of a matrix? Of all things, for the important orthogonal matrices, we
only need to transpose and we have it. Sometimes life is good to us.
From c) you can see that with A also the matrix AT is orthogonal, because the inverse of
an orthogonal mapping is again orthogonal. This has the consequence that in an orthogo-
nal matrix not only the columns, but also the rows form an orthonormal basis.
Determinants and eigenvalues of orthogonal matrices can only take certain values:
Theorem 10.12 For an orthogonal matrix A it holds det A = ±1, for eigenvalues
of A it holds = ±1.
For a length- and angle-preserving mapping one would certainly not have expected any-
thing else, but we can also write it down very quickly:
It is 1 = det I = det AT A = det AT · det A = det A · det A, so det A = ±1; and for an
eigenvector to the eigenvalue , on the one hand �Av� = �v�, on the other hand, because
of the orthogonality, also �Av� = �v� and thus �v� = �v�, from which || · �v� = �v�
and thus, because of v = 0�, also = ±1 follows.
In Chap. 9 we treated eigenvalues and eigenvectors. I now add one more important theo-
rem from eigenvalue theory, which we can only formulate now.
Theorem 10.14 Let A ∈ Rn×n be a symmetric matrix, that is A = AT. Then the
matrix has n eigenvectors that form an orthonormal basis.
The matrix A is therefore diagonalisable and the basis transition matrix is orthogonal.
It is easy to calculate that eigenvectors to different eigenvalues are orthogonal, see
Exercise 10 in this chapter. It is just as easy to check that the characteristic polynomial
of the matrix A decomposes into linear factors, that is, it has n real roots. But these roots
can be multiple roots, and it is quite tricky to prove that a root of multiplicity k in this
case also has an eigenspace of dimension k. I omit the proof of the theorem.
We will need the statement twice later when applying mathematical methods: first
when calculating the centrality of nodes in a graph in Sect. 11.1, and second when doing
principal component analysis in Sect. 22.2.
Now we know a lot about the shape of orthogonal matrices, and we will use this knowl-
edge immediately to determine all orthogonal mappings in R2 and R3.
a b
Let’s start with the R2. For the matrix A = of such a mapping, it must hold:
c d
a2 + c2 = 1, because the column vectors have length 1. Let’s take a and c as the legs of a
right-angled triangle with hypotenuse 1 and call α the angle between the hypotenuse and
a, then we see that
a = cos α, c = sin α.
Furthermore, a2 + b2 = 1 is also true, because the row vectors also have length 1, so
(cos α)2 + b2 = 1 applies. This gives b2 = (sin α)2 (because (cos α)2 + (sin α)2 = 1 ),
and we get
b = ± sin α.
From c2 + d 2 = 1 we conclude in the same way:
d = ± cos α.
Finally, we use the orthogonality of the columns: ab + cd = 0, so there is still a restric-
tion on the signs, it turns out:
b sin α b − sin α
= or = .
d − cos α d cos α
10.2 Orthogonal Mappings 237
The situation in R3 is a bit more complex, but here too we can classify all orthogonal
mappings. I would at least like to sketch the procedure. We know that each matrix A in
R3 has an eigenvalue, in this case +1 or −1. First we carry out an orthogonal basis tran-
sition so that the corresponding eigenvector is the first basis vector. This gives us our
matrix in the form:
±1 a12 a13
A′ = 0 a22 a23 .
0 a32 a33
This is also an orthogonal matrix, in particular, for the columns s1, s2 holds
�s1 , s2 � = 0 = ±1a12 + 0a22 + 0a32. This gives a12 = 0. Similarly it follows a13 = 0. So
A′ has the form
±1 0 0
A′ = 0 a22 a23 .
0 a32 a33
Now let’s take a look at what the mapping does to the elements of the two-dimensional
subspace W , which is spanned by the columns s2 and s3:
±1 0 0 0 0
0 a22 a23 x1 = a22 x1 + a23 x2 .
0 a32 a33 x2 a32 x1 + a33 x21
We see that the image lands again in the subspace W , that is, the restriction of A′ to W is
a linear mapping on W and this is of course orthogonal again. So the 2 × 2 submatrix of
A has the form Dα or Mα from Theorem 10.15.
In the first case, A′ therefore has the form
±1 0 0
A′ = 0 cos α − sin α ,
0 sin α cos α
in the second case we can simplify the matrix even further by another orthogonal basis
transition: we rotate the plane W around the first coordinate axis until the second basis
vector becomes the axis of reflection. The third basis vector, which is perpendicular to it,
becomes the second eigenvector of the reflection, and the matrix takes the form:
±1 0 0
A′′ = 0 1 0 .
0 0 −1
238 10 Dot Product and Orthogonal Mappings
Now we have gathered all the possibilities and we can formulate the following result as a
final highlight of the section:
Type 1 is a rotation about an axis, type 2 is a rotation about an axis followed by a reflec-
tion in the plane perpendicular to the origin, the rotational reflection.
The two matrices on the right are a bit out of the ordinary, but they also fall under
these types, we shouldn’t have to write them down at all: With B it is a rotation around
the x2-axis by 180°, with D it is a reflection at the x1-x2-plane, with no rotation taking
place beforehand.
There are no other orthogonal mappings in R3! The orientation-preserving orthogonal
mappings are of type 1, as you can see from the determinant.
So if you move an object somehow in space, this results in exactly one translation and
one rotation around an axis. Nothing more. Would you have believed that?
We have already come across this several times in our investigation of linear mappings:
an important type of mapping does not fall under this: translation. Of course, we are con-
stantly picking up objects with the mouse and dragging them across the screen, robots
are moving objects from one place to another, but our theory cannot be used for this.
Translations do not fix the origin, but linear mappings always map subspaces into sub-
spaces, so the origin is also mapped into a subspace, in the case of bijective maps back
into itself.
That’s a shame, because linear mappings can be described well using matrices, and
the matrix product can also be used to calculate consecutive linear mappings well. It
would be nice if we could treat translations in the same way.
We use a trick for this. If the origin bothers us, we simply take it out. Let’s first look
at this in the plane, in space (and also in higher dimensions) it works exactly the same.
Of course, we cannot simply tear a hole in the plane, but we go one dimension higher
and move our x1-x2-plane a bit in x3-direction, usually exactly by the value 1 (Fig. 10.6).
The origin is gone already. Now the points have other coordinates: (x1 , x2 ) becomes
(x1 , x2 , 1). We now identify this point with (x1 , x2 ), and so that it is clear that I am refer-
10.2 Orthogonal Mappings 239
ring to the point of the plane I write [x1 , x2 , 1] for it from now on and call [x1 , x2 , 1] the
homogeneous coordinates of the point (x1 , x2 ).
What have we gained? Now we can also specify linear mappings that shift the “ori-
gin” [0, 0, 1]. We have to be careful though, because we can only use such linear map-
pings which do not change the x3 component, which always has to remain 1. All
mappings of the following form are possible:
a11 a12 a13
a21 a22 a23 , (10.8)
0 0 1
because
a11 a12 a13 x1 y1
a21 a22 a23 x2 = y2 .
0 0 1 1 1
You see: Now I have also written the matrices in square brackets. For linear mappings on
homogeneous coordinates we introduce this notation. In these matrices, the third line is
always equal to [0, 0, 1].
But now to specific mappings in our plane: Let’s look at the following matrix:
cos α − sin α 0
sin α cos α 0 . (10.9)
0 0 1
As a mapping in the R3, this matrix, as we know, represents a rotation around the x3-axis.
What happens to the points in our plane? Let’s try it out:
cos α − sin α 0 x1 cos α x1 − sin α x2
sin α cos α 0 x2 = sin α x1 + cos α x2 .
0 0 1 1 1
240 10 Dot Product and Orthogonal Mappings
The result is the same as if we had performed a rotation in the R2, so we have found
our homogeneous rotation matrix here. This rotation is carried out around the point
[0, 0, 1] = (0, 0). Similarly, we obtain with the matrix
cos α sin α 0
sin α − cos α 0
0 0 1
a reflection, and you can immediately check that the matrix
a11 a12 0
a21 a22 0
0 0 1
does nothing else with the homogeneous coordinates but the linear mapping
a11 a12
a21 a22
in the usual x1-x2-plane. So we can still use all the linear mappings we have learned so
far. But now to something new.
Look at the following mapping:
1 0 a x1 x1 + a
0 1 b x2 = x2 + b .
0 0 1 1 1
Here we finally have the translation: The point [x1 , x2 , 1] = (x1 , x2 ) is shifted by
[a, b, 1] = (a, b). Even the “origin” [0, 0, 1] = (0, 0) of our plane is not spared.
Of course we can still perform mappings one after the other as before, we just have to
multiply the matrices. It is
1 0 a a11 a12 0 a11 a12 a
0 1 b a21 a22 0 = a21 a22 b .
0 0 1 0 0 1 0 0 1
Any mapping that is generated by a matrix of the form (10.8) is thus a combination of a
linear mapping and a translation. These maps are called affine mappings.
For example, let’s carry out a rotation and then a translation:
1 0 a cos α − sin α 0 cos α − sin α a
0 1 b sin α cos α 0 = sin α cos α b .
0 0 1 0 0 1 0 0 1
Applied to a point in our plane, we get:
cos α − sin α a x1 cos α x1 − sin α x2 + a
sin α cos α b x2 = sin α x1 + cos α x2 + b . (10.10)
0 0 1 1 1
10.3 Homogeneous Coordinates 241
Example
An object is to be rotated on the screen, but not at the origin, which is probably in
a corner of the screen, but at the center (a, b ) of the object. However, the rotation
matrix from (10.9) rotates at the origin. What can we do? Move the object to the ori-
gin, rotate it by the angle α and then move it back (Fig. 10.8). The matrix of the map-
ping is:
1 0 a cos α − sin α 0 1 0 −a
0 1 b sin α cos α 0 0 1 −b
0 0 1 0 0 1 0 0 1
cos α − sin α − cos α a + sin α b + a
= sin α cos α − sin α a − cos α b + b
0 0 1
Test that the point [a, b, 1] is the fixed point of this mapping. ◄
of the mappings are the 4 × 4 matrices, in which the last row is equal to [0, 0, 0, 1]. For
example, the following matrix defines a rotation around the x1-axis:
1 0 0 0
0 cos α − sin α 0
0 sin α cos α 0 ,
0 0 0 1
and as translation matrix we get:
1 0 0 a
0 1 0 b
.
0 0 1 c
0 0 0 1
A robot consists of several arms that are connected to each other by joints. The joints can
either perform rotations or translations, and thus influence the position of the gripper (the
effector) that sits at the end of this so-called kinematic chain and performs the tasks for
which the robot is intended. Not only the absolute position, but also the orientation of the
gripper in space is important. This depends on the position of the individual joints.
Each arm of the robot is rigidly connected to an orthogonal coordinate system that
moves along with the arm during a movement. The coordinate system K0 with the axes
x0, y0, z0 is fixed: It represents the position of the robot base; where its origin is placed.
The last coordinate system Kn with the axes xn,yn,zn is rigidly connected to the gripper.
The origin and position of the base vectors of the gripper system in coordinates of the
base system K0, in world coordinates, are sought. Also for this, the description of the sys-
tems in homogeneous coordinates has proved to be suitable.
There is still some freedom in the choice of coordinate systems, which can be used
to make the basis transitions as simple as possible. Here, the arm i is the connection
between the i -th and (i + 1)-th joint, the coordinate system Ki is assigned to the i -th arm.
The Denavit-Hartenberg convention now provides the following rules for the position of
the coordinate systems:
10.3 Homogeneous Coordinates 243
1. The zi-axis is placed in the direction of the movement axis of the (i + 1)-th joint.
2. For i > 1, the xi-axis is placed perpendicular to zi−1 (and of course to zi ).
3. The origin of Ki lies in the plane determined by zi−1 and xi.
4. And finally, yi is added so that Ki becomes a positively oriented Cartesian coordinate
system.
Some special cases, such as parallel z-axes, I have not considered in this somewhat
abbreviated description.
Now we want to express the coordinate system Kn by the coordinates of the system K0.
For this we carry out basis transitions.
Unlike before, our coordinate system is now defined by a basis and an origin, so we
have one more data element to consider. We will use homogeneous coordinates, and thus
the basis transition works quite analogously, as we have carried out in the standard coor-
dinates in Sec. 9.3. Theorem 9.21 and 9.22 can be transferred almost literally:
Theorem 10.17 Let f be the affine mapping which maps the coordinate system
K1 to the coordinate system K2. Let T be the matrix of this mapping with respect
to the coordinate system K1. Then the coordinates of a vector v with respect to K2
are obtained by multiplying the coordinates of v with respect to K1 from the left
with T −1. The coordinates of v with respect to K1 are obtained from those with
respect to K2 by multiplication from the left with T .
If T is the transition matrix from K1 to K2 and S is the transition matrix from K2 to K3,
then TS is the transition matrix from K1 to K3.
I do not want to prove this theorem, but at least give the transition matrix T . Let
K2 = {b1 , b2 , b3 , u}, where u is to be the origin of B2. The homogeneous coordinates of
K2 with respect to K1 are to be:
b11 b12 b13 u1
b21 b22 b23 u2
b1 = b31 , b2 = b32 , b3 = b33 , u = u3 .
1 K1
1 K1
1 K1
1 K1
Aren’t the images of the basis vectors in the columns of this matrix any more? Strictly
speaking they are: [b11 , b21 , b31 , 1] are the coordinates of the “tip” of the basis vector b1 with
respect to K1, length and direction of the first basis vector are then given by the first column
of the matrix, see Fig. 10.9.
We will now specify a number of basis transitions for the robot that transform K0 into Kn
(Fig. 10.10). First, we can generate Ki+1 from Ki by four consecutive transitions as fol-
lows.
First, Ki is rotated around the zi-axis by the angle i until xi and xi+1 have the same
direction. This is due to 2). This movement has the following transition matrix:
cos θi − sin θi 0 0
sin θi cos θi 0 0
Dθi =
0
.
0 1 0
0 0 0 1
The system now obtained is moved as far as possible in the direction of zi until its origin
intersects the line determined by xi+1. This is possible because of 3). As a third step, the
two origins can be brought into coincidence by another translation along this line. These
movements have the translation matrices
10.3 Homogeneous Coordinates 245
1 0 0 0 1 0 0 ai
0 1 0 0 0 1 0 0
Ts i =
0
, Tai = .
0 1 si 0 0 1 0
0 0 0 1 0 0 0 1
Finally, only a further rotation around the xi-axis is necessary to bring the vector yi into
coincidence with yi+1:
1 0 0 0
0 cos αi − sin αi 0
Dαi =
0 sin αi cos αi 0 .
0 0 0 1
This gives the transition matrix from Ki to Ki+1 the following form:
cos θi − cos αi sin θi sin αi sin θi ai cos θi
sin θi cos αi cos θi − sin αi cos θi ai sin θi
Ai,i+1 = Dθi Tsi Tai Dαi = . (10.12)
0 sin αi cos αi si
0 0 0 1
Since the zi-axis always runs along the movement axis of the (i + 1)-th joint, a move-
ment of this joint either results in a rotation around the xi − xi+1-plane or a translation in
zi-direction. This means that, depending on the type of joint, i or si are variable in the
matrix (10.12), the other elements are robot constants.
With this transition, we therefore obtain the origin and basis vectors of Ki+1 in the
coordinates of Ki. Let’s test it with the origin and two basis vectors of Ki+1:
0 ai cos θi 1 cos θi + ai cos θi
0 ai sin θi 0 sin θi + ai sin θi
0 = si , Ai,i+1 0 =
Ai,i+1 ,
si
1 i+1 1 i
1 i+1 1 i
0 ai cos θi
0 sin αi + ai sin θi
1 = cos αi + si .
Ai,i+1
1 i+1 1 i
Can you follow the results by looking at the drawing in Fig. 10.10? The origin and basis
vectors are shifted in direction xi+1 and zi. In addition to this shift, xi is rotated in the x1-y1-
plane by i, and zi is rotated in the y1-z1-plane by αi.
Now we can set up a whole chain of these transitions: Ai+1,i+2 expresses Ki+2 in coor-
dinates of Ki+1, so Ai,i+1 ◦ Ai+1,i+2 gives us the coordinates of basis Ki+2 in the Ki-system,
and so on. Finally, the transition
246 10 Dot Product and Orthogonal Mappings
provides the origin and orientation of the effector in world coordinates. From the shape
of the transition matrix (10.11) you can see that the first three columns give the direction
of the basis vectors and the last column gives the origin of the effector system.
Of course, the individual transitions depend on the current position of the robot joint
axes, and so the transition B changes with every robot movement.
This thus poses the two problems for the programmer of a robot controller:
• To determine the position and orientation of the effector from the position of the indi-
vidual joints, this is the direct kinematic problem, which I have briefly treated here,
• and the inverse kinematic problem, to find appropriate joint values for a given position
of the effector.
The second is the more important, but unfortunately also by far the more difficult prob-
lem.
Comprehension Questions
1. Does a vector space with addition and with a dot product as multiplication form a
ring?
2. Why can a dot product only be defined meaningfully for real vector spaces?
3. Explain the relationship between dot product and norm.
4. If a linear mapping in R3 preserves the angles between all vectors, then it also pre-
serves the length of the vectors and is therefore an orthogonal mapping.
5. What types of orthogonal mappings are there in R3 ?
cos α − sin α
6. The rotation in R2 has the matrix , the reflection the matrix
sin α cos α
cos α sin α − cos α sin α
. And which mappings are described by or by
sin α − cos α sin α cos α
cos α sin α
?
− sin α cos α
7. Although translation is not a linear mapping. But is it an “orthogonal” mapping in
the sense that it preserves the length of vectors and the angle between vectors?
8. Why is it sometimes useful to calculate with homogeneous coordinates?
Exercises
1. Determine a linear equation whose solution set describes the plane determined by
the points (1, 0, 1), (2, 1, 2),and (1, 1, 3) in R3.
2. Determine the angle between the vectors (2, 1, −1) and (1, 2, 1).
10.3 Homogeneous Coordinates 247
3. Determine the angle between the diagonal of a cube and an edge adjacent to the
diagonal. √
4. Complete the vector (1/2, 1/2, 1/ 2) to an orthonormal basis of R3.
5. Prove: If the vectors v1 , v2 , . . . , vn are pairwise orthogonal and all different from 0,
then they are also linearly independent.
6. What type of mapping do you get in the R3, if you successively carry out a rota-
tion, 2 reflections, 2 rotations and 4 more reflections?
7. Which
1 of 1the following
matrices are orthogonal?
√ √
2 2
0
a) 21 21 − √12
1 1 √1
2 2 2
√1 − √1
b) 12 1 2
√ √
2 2
1 1 1 1
2 2 2 2
1 −5 1 1
c) 2
1 1 1
6 6 6 .
2 6 6
− 56
1 1
2 6
− 65 1
6
8. Determine the homogeneous matrix of a mapping which maps the square 1 from
Fig. 10.11 with edge length 1 and origin in the lower left to the square 2.
9. Derive the triangle inequality �u + v� ≤ �u� + �v� from the Cauchy-Schwarz
√
inequality |�u, v�| ≤ �u� · �v� for the norm �u� := �u, u�.
10. If A is a symmetric matrix, that is A = AT, then eigenvectors to different real
eigenvalues are orthogonal.
Note: If 1 , 2 are eigenvalues to v1, v2, then show that 1 �v1 , v2 � = 2 �v1 , v2 �. Use the
tricks for calculation: �v1 , v2 � = v1 T v2 and (AB)T = BT AT.
Abstract
This chapter contains many algorithms and is particularly close to computer science.
If you have worked through it
• you know the basic concepts of graph theory: nodes, edges, degree, paths, cycles,
isomorphisms, networks and directed graphs,
• you know what trees and rooted trees are,
• you have constructed search trees as an application and can use trees to build the
Huffman code,
• you know what breadth-first search and depth-first search mean and can find short-
est paths in networks
• and you can carry out a topological sorting in directed acyclic graphs.
If you want to drive from Flensburg to Freiburg by car and do not know the way, you use
a navigation system that calculates the shortest or fastest route between the two towns
and leads you along this route. The device knows a list of places and connecting roads
between these places. The length of the routes between the places or the approximate
travel time is known.
This task is a typical problem of graph theory: Graphs are structures consisting of
a number of nodes (here the places) and edges (the roads), where the edges sometimes
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 249
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_11
250 11 Graph Theory
carry directions (one-way streets) or weights (length, travel time). The route planner
searches for the shortest path between two nodes.
Graphs occur in many areas of technology, computer science and everyday life. In
addition to road networks, all kinds of networks are graphs: power distribution networks,
electrical circuits, communication networks, molecule structures and much more. At the
beginning of Chap. 5 you will find a graph in which the nodes are algebraic structures
and the edges are relations between these. For example, class diagrams of an object-ori-
ented design are also graphs.
In this chapter I would like to introduce you to the basic concepts of graph theory.
The focus will be on trees, which play a particularly important role in computer science.
At the beginning of a theory there is always a whole lot of new terms. First I would like
to put them together:
We will only deal with finite graphs, that is, with graphs in which the set V —and thus
of course the set E—is finite. Furthermore, I do not want to allow two nodes to be con-
nected with more than one edge.
If k = [x, y] is an edge, then x, y are the endpoints of the edge, the nodes x and y are
called incident to k. x and y are then adjacent nodes.
The edge k = [x, y] is called a loop, if x = y. The graph G is called complete, if every
two nodes are adjacent. G′ = (V ′ , E ′ ) is called a subgraph of G, if V ′ ⊂ V and E ′ ⊂ E.
Figure 11.1 shows an example. Here V = {x1 , x2 , x3 , x4 }, E = {k1 , k2 , k3 , k4 } with
k1 = [x1 , x2 ], k2 = [x2 , x2 ], k3 = [x2 , x3 ], k4 = [x3 , x4 ].
x1 and x2 are adjacent, x3, x4 are endpoints of k4, k2 is a loop. The graph is not com-
plete. In Fig. 11.2 you can see a complete graph, Fig. 11.3 does not represent a graph. If
you are confronted with problems in which such objects occur, you can easily turn them
into a graph by, for example, adding more nodes to the multiple edges.
How can one represent graphs? First of all, drawings are an option. In Fig. 11.2, how-
ever, it is disturbing that the two diagonals intersect, even though the intersection point
is not a node. Figure 11.4 shows we can also draw the edges without intersections in this
graph. Is it always possible to represent graphs in such a way that edges only intersect in
nodes? In the R3 this is possible: If V is a finite subset of the R3 (the nodes) and the ele-
ments of V are connected by a finite set of line segments (the edges), then we can always
choose them so that they only intersect in the nodes.
In the R2 this is not always possible, however. Figure 11.5 shows a graph you cannot
map into the R2 without edges intersecting. Graphs that allow a non-intersecting repre-
sentation in the plane are called planar graphs.
A big problem in the design of circuit boards or integrated circuits is to plan the electrical con-
nections in such a way that they can be placed with as little overlap as possible. As we have
seen, this is not always possible. That is why circuit boards often consist of several layers.
Drawings are an important tool. But you have to be careful with them and not read more
into them than is actually there. For example, Fig. 11.6 shows the same graph twice.
Two graphs are to be considered equal if they only differ in the labeling of their
nodes:
In Sect. 5.7 I have presented two hard mathematical problems that are used in cryptography:
the factorization problem and the discrete logarithm problem. Here we have found another
problem of this kind. There are cryptographic protocols that are based on it, but I do not
know any widespread product for it. Maybe graph isomorphisms will become even more
important if someone cracks the factorization problem?
The representation of graphs in the computer can be done using adjacency matrices. The
nodes are numbered from 1 to n, the element aij of the adjacency matrix is 1 if the nodes
i , j are adjacent, and has the value 0 otherwise. The adjacency matrix of the graph from
Fig. 11.6 has the form:
11.1 Basic Concepts of Graph Theory 253
0 1 0 0 1
1 0 1 0 0
.
0 1 0 1 0
0 0 1 0 1
1 0 0 1 0
The matrix of a graph is always symmetric, in the diagonal is a 1 if the respective node
has a loop.
Adjacency matrices are only suitable for the implementation of graphs if they contain
many different entries from 0, that is, if many edges are connected to each other. In many
graphs, nodes are only connected to a few other nodes, for example, think of the connec-
tions in the route planner. For such graphs, one chooses other representations, such as the
adjacency lists, in which a list of adjacent nodes is attached to each node.
Definition 11.3 If x is a node of the graph G, then the number of edges incident to
x is called the degree of x. Loops are counted twice. The degree of x is denoted by
d(x).
In Fig. 11.1 the nodes x1 and x4 have degree 1, x3 has degree 2 and x2 has degree 4.
Let A be the adjacency matrix of a graph with nodes x1 , x2 , . . . , xn. The i -th row is
(ai1 , ai2 , . . . , ain ). aij is 1 if there is an edge from node i to node j. Therefore, the degree
of node i is exactly the sum of the 1s in the i -th row of the matrix. Multiply the adja-
cency matrix with the n -dimensional vector (1, 1, 1, 1, . . . , 1), you get in the i -th row the
value (ai1 + ai2 + . . . + ain ), that is,
1 d(x1 )
1 d(x2 )
A . = . .
.
. . .
1 d(xn )
Theorem 11.4 In every graph G = (V , E) it holds that x∈V d(x) = 2 · |E|.
Proof: Each endpoint of an edge provides exactly the contribution 1 to the degree of this
point. Since each edge has exactly 2 endpoints, it contributes exactly twice the value 1 to
the left side.
Corollary 11.5 In every graph the number of nodes of odd degree is even.
Paths in Graphs
You can therefore omit detours, no node has to be visited multiple times.
Without explicitly carrying out the proof, I give you the very intuitive method: Let
x1 x2 x3 . . . xn be the walk. If you start at x1, and you encounter a node for the second time,
so you have to remove all nodes in between, as well as the double node once. In the
graph from Fig. 11.7 you can from the walk x1 x2 x3 x4 x5 x6 x2 x3 x4 x5 x7 in this way obtain
the path x1 x2 x3 x4 x5 x7. This is done as often as necessary until all multiple nodes are
removed.
Definition 11.8 The number of edges of a walk or a path is called the length of the
walk or the path.
In many cases, certain numbers are associated with the edges of a graph, for example
distances, line capacities, costs or duration. If you assign such numbers to the edges, you
will get networks:
Definition 11.9 A graph is called network if each edge [x, y] is assigned a weight
w(x, y) ∈ R. In networks, the length of a path from one node to another is the sum
of the weights of all the edges of the path.
The search for cost-effective paths in graphs is an important task of computer science.
We will deal with this in Sect. 11.3.
In a social network, people form the nodes and relationships between people form the
edges of a graph. Which people in such a network are probably the most interesting?
Who are the important people in the network that you should contact yourself? I would
like to assign a number to each node to denote its importance, which I call the centrality
of the node.
A first approach is certainly to check who has the most contacts. The centrality is then
the degree of the corresponding node. But if you want to look a little deeper, not only
the number of contacts is interesting, but also the quality of the contacts: A person is not
important just because he knows many other people, but because he knows many impor-
tant people, that is, those who in turn know many people. Let’s look at a small example
(Fig. 11.8).
The adjacency matrix of the graph is
0 1 1 1 0
1 0 1 0 0
1 1 0 1 0 .
1 0 1 0 1
0 0 0 1 0
256 11 Graph Theory
x5
x4 x3
The nodes x1, x3 and x4 each have degree 3, x2 has degree 2 and the node x5 is some-
what remote, it only has degree 1. If only the degrees are considered, the three nodes x1,
x3 and x4 would be equally central. But if the importance of the respective neighbors is
taken into account, x4 should have less weight. How can this be reproduced? Let’s try the
approach that the still unknown centrality wi of the node xi should be proportional to the
sum of the centrality of all adjacent nodes xi. We call the proportionality factor with µ:
The result is exactly the i -th component of the product of the matrix A with the vector
w = (w1 , w2 , . . . , wn ), see (7.6). So it is
w1 w1
w1 w1
. = µA . ,
.. ..
wn wn
Aw = w.
This matrix equation we have seen before: If there is such a vector w and such a factor ,
then w is just an eigenvector to the eigenvalue of the adjacency matrix A of the graph,
see Definition 9.10 at the beginning of Sect. 9.2.
The mathematician now asks: Are there any meaningful solutions for this equation
at all? Not every matrix has eigenvalues, and here all wi should also be greater than 0
if a meaningful measure for centrality is to result. A theorem from the theory of eigen-
vectors helps us further: The adjacency matrix of a connected graph has at least one
positive real eigenvalue. The largest of these eigenvalues has an eigenvector with only
positive components. If you would like to read more about this: This result is part of the
Perron-Frobenius Theorem. The matrix is symmetric, and therefore all eigenvectors are
orthogonal according to Theorem 10.14. If a vector is orthogonal to the eigenvector with
positive components, it must have negative components, otherwise the scalar product
11.2 Trees 257
could not be 0. So all other eigenvectors have mixed signs in the components. Therefore,
the positive eigenvector is the only reasonable candidate for the weights of the nodes of
the graph. And this eigenvector always exists! Its components are called the eigenvector
centrality of the nodes of the graph. Internet search engines use variants of the eigenvec-
tor centrality, among other things, to determine the ranking of documents in the search
results: Links to documents referenced by many links are placed at the top; but these
links are also referenced by many other pages.
Let’s work through the example from above. With a mathematics tool, the eigenvalues
are quickly calculated. The largest of them has the value 2.64, and an associated eigen-
vector is
1.0
0.76
1.0 .
0.88
0.33
You can see that the nodes x1 and x3 are now more important than x4. This was down-
graded because it is in contact with the insignificant x5.
11.2 Trees
Definition 11.10 A graph in which each two nodes are connected by exactly one
path is called a tree.
Fig. 11.9 shows four examples of trees. Trees can be characterized by many different
properties. The following theorem gives some equivalent conditions to the definition.
Theorem 11.11 Let G be a graph with n nodes and m edges. Then the following
statements are equivalent:
a) G is a tree.
b) G is a connected graph without cycles.
c) G is connected, but if you remove any edge, G splits into two connected compo-
nents.
d) G is connected and n = m + 1 (G has one more node than edges).
b) ⇒ c): If we remove the edge E = [x1 , x2 ], we get the graph G′, in which x1 and x2
are no longer connected, G′ is therefore no longer connected. Let y be any node. If y is
not connected to G′ in x1, then the original path from y to x1 contained the edge E and
thus also x2. The path to x2 therefore still exists, y is connected to x2. G′ therefore splits
into two components: The nodes of one component are all connected to x1, the nodes of
the other component are all connected to x2.
c) ⇒ a): Since G is connected, any two nodes are connected by a path. If there were
two paths from x1 to x2, you could remove one of these paths without destroying the con-
nection.
To show the equivalence of d) with a), b) and c), we first reason that after removing an
edge from a tree, the resulting connected components again form trees: If, for example,
y and z are nodes in such a component, then there cannot be multiple paths from y to z,
since these would also be paths in the original graph G.
Now we show a) ⇒ d) by induction on the number of nodes n. The base case (n = 1
and n = 2) can be read off from the first two trees in Fig. 11.9. Now let the assertion be
satisfied for all trees with less than n + 1 nodes. Let G be a tree with n + 1 nodes. If we
remove an edge we get 2 subtrees with node numbers n1 and n2, where n1 + n2 = n + 1
applies. For the subtrees, the number of edges is n1 − 1 and n2 − 1 according to the
assumption. This results in the total number of edges of G: n1 − 1 + n2 − 1 + 1 = n.
We also prove d) ⇒ a) by induction, where the base case can again be read off from
Fig. 11.9. Let G be a graph with n + 1 nodes and n edges given. The degree of each node
is at least 1, since G is connected. Then there must be nodes of degree 1, i.e. nodes that
are incident with exactly one edge, because if it were for all nodes d(x) ≥ 2, then accord-
ing to Theorem 11.4 it would follow:
2· (n + 1) ≤ d(x) = 2 · n
.
number of nodes number of edges
We now remove such a node x of degree 1 with its edge from G, so a smaller graph results,
which again has exactly one node more than edges. This is therefore a tree according to
the induction hypothesis. If we add x with its edge to this tree again, then x is also con-
nected to every other node of the graph by exactly one path and thus G is a tree.
The most surprising here is probably the property d), a very simple test criterion to deter-
mine whether a given graph is a tree or not.
Rooted Trees
So far, in a tree, all nodes and edges are equal. However, often a node in the tree is par-
ticularly distinguished, this is called the root and the corresponding tree rooted tree. In
order not to overdo the analogy to botany, the root of a tree is usually drawn at the top,
the edges all point downwards. The third tree in Fig. 11.9 is already shown in this way. If
you imagine the nodes of a tree as electrically charged balls that are movably connected
to each other by rods, you can touch any node of the tree and lift it up and get a rooted
tree. So every node in a tree can be a root.
If x and y are connected by an edge and x is closer to the root than y, then x is the
parent of y and y is a child of x. If there is a path from x to y, starting from x and going
to the child, grandchild, great-grandchild, and so on, then x is an ancestor of y and y is a
descendant of x. Nodes with descendants are called inner nodes, nodes without descend-
ants are called leavesof the tree. The leaves of a tree are exactly the nodes different from
the root with degree 1. With the relation “descendant” between the nodes of a tree, the
nodes are ordered, see Definition 1.12 in Sect. 1.2.
In Fig. 11.10 x is the parent of y, and z is a descendant of x. However, z is not a
descendant of y.
If the order from top to bottom is not the only one of importance, but the order of the
children is also important, one speaks of the first, second, third child, and so on, and one
obtains an ordered rooted tree. In the drawing, the children are ordered from left to right:
y is the first child of x.
Theorem 11.12 In a rooted tree, each node together with all its descendants and
the corresponding edges forms a rooted tree again.
We check the definition of the tree. If x is the new root of the tree and y and z are
descendants of x, then there is a path from y to x and a path from x to z. By combining,
you get a walk and thus also a path from y to z. This path is unique, because otherwise
there would already be several paths from y to z in the original tree.
This property means that trees are excellently suited to be treated recursively: The sub-
trees have the same properties as the original tree, but they always become smaller.
Many tree algorithms are recursive.
Example
The first task of a compiler during the translation of a program is the syntax analysis.
In doing so, a syntax tree is generated from an expression. In such a tree, the inner
nodes represent the operators, the operands of an operator are formed from the sub-
trees that hang on the children of this operator. For the expression (a+b)·c, the syntax
tree from Fig. 11.11 belongs, the tree from Fig. 11.12 belongs to
(x%4 == 0)&& ! ((x%100 == 0)&&(x%400 ! = 0)). ◄
If the syntax tree is successfully built, the evaluation of the expression can be done recur-
sively: To evaluate the tree, all subtrees that belong to the children of the root have to be
evaluated. Then, the operation of the root can be performed. The process ends when the
root itself is an operand.
If in a rooted tree each node has at most n children, the tree is called a n-ary rooted
tree, if each node has exactly 0 or n children, it is called a regular n-ary rooted tree. In
the case n = 2 we speak of a binary rooted tree, the children are then also called left and
right children. Now we want to take a closer look at these trees.
a) B has exactly one node of degree 2, all other nodes have degree 1 or 3.
b) If x is a node of B, then x with all its descendants is a binary regular rooted tree
again.
c) The number of nodes is odd.
d) If B has n nodes, it has (n + 1)/2 leaves and (n − 1)/2 inner nodes. There is
therefore exactly one leaf more than inner nodes.
The properties a) and b) are clear: The root has exactly two children, so degree 2, every
other node is a child and has either two children (degree 3) or no descendants (degree 1).
The characterizing properties of the binary regular tree remain in all subtrees, of course.
For the third property, we can use Theorem 11.12 for induction for the first time: A min-
imal binary regular rooted tree is a single node, for which the statement is true. The num-
ber of nodes of the tree B is 1 + number of nodes of the left subtree + number of nodes of
the right subtree, thus odd, since the two smaller subtrees have an odd number of nodes.
d): Let p be the number of leaves of B, If B has n nodes, then B has n-1 edges and it
applies according to a) and according to Theorem 11.4:
d(x) = p · 1 + (n − p − 1) · 3 + 1 · 2 = 2|E| = 2(n − 1).
x∈V
You can draw the games of a tennis tournament in the form of a binary regular rooted
tree. The players are the leaves, the games are the inner nodes, the final represents the
root. Question: If p players participate in the tournament, how many games will take
place until the tournament winner is determined? Since there is exactly one leaf more
than inner nodes, the number of games must be p − 1.
But you can answer this question quite simply without deep graph-theoretical theorems:
Each game eliminates exactly one player. The winner remains: so with p players you need
exactly p− 1 games.
Definition 11.14 The level of a node in a rooted tree is the number of nodes of the
path from the root to this node. The height of a rooted tree is the maximum level of
the nodes.
262 11 Graph Theory
So the root has level 1, the tree from Fig. 11.10 has height 4.
H ≥ log2 (n + 1).
At level k of the tree, a maximum of 2k−1 nodes can be located (you can check this by
induction if you want). So at the bottom level H , a maximum of 2H−1 nodes are located,
which are then all leaves. If you add the inner nodes, you get n ≤ 2H−1 + (2H−1 − 1),
that is n ≤ 2 · 2H−1 − 1 = 2H − 1, and thus log2 (n + 1) ≤ H .
Search Trees
Read the assertion of Theorem 11.15 the other way around: in a tree of height H , up to
2H − 1 nodes can be placed. For H = 18, for example, 2H is about 262,000. For each of
these nodes, there is a path from the root that visits at most 18 nodes. This property can
be used to store data in the nodes of trees that can be accessed very quickly. These are
the search trees. In the example with H = 18, you can store a larger dictionary in it.
Data records are identified by keys. We assume each key occurs only once. Each node of
the search tree is assigned a record in the following way: all keys of the left subtree of p are
smaller than the one of node p, all keys of the right subtree of p are greater than the one of p.
First we want to create a search tree. When entering data, we always start at the root,
for each entry a new node is created as a leaf and at the same time as the root of a new
subtree:
As you can see, this is a very simple recursive rule. I want to carry it out with a concrete
example. I will assign words to the nodes of the tree, which at the same time serve as keys
with their alphabetical order. Let us enter a sentence’s words into my tree (Fig. 11.13).
The search for a record in such a tree is also carried out recursively: we find the data
that belongs to a given key s with the following algorithm:
11.2 Trees 263
enter us
my tree
Search s starting at x:
If s is equal to the key of x, the search is finished.
If s is less than the key of x, then find s starting from the left child of x:
If s is greater than the key of x, then find s starting from the right child of x:
This will find every entered record. To cover the case that a non-existing key is searched
for, the algorithm must be extended slightly. Please do this yourself.
The creation of a tree using the presented method does not always result in a tree that
is as evenly balanced as shown in Fig. 11.13. If you want to convert the phone book into a
tree structure in this way, the tree degenerates into an infinitely long list, and the entire list
must be searched sequentially when searching. This does not provide any benefit. Trees
that are particularly well suited for searching are those for which the left and right subtrees
in each node differ in height by at most 1. These trees are called height-balanced trees.
Fig. 11.13 shows such a tree. There are algorithms that create exactly such trees.
You can output all the data contained in the tree alphabetically using the rule:
If you output the tree from Fig. 11.13 according to preorder, you get:
Let enter a into us sentence’s my tree words
and with the postorder method you get:
a into enter my tree sentence’s words us Let
Try to implement such a tree. If you have taken the hurdle of modeling the tree data struc-
ture correctly, you will see that the insert, search, and output functions really only require a
few lines of code.
As an application example of binary trees, I would like to introduce you to the Huffman
code, an important algorithm for data compression. If you encode a text of the English
language with the ASCII code, you need exactly one byte for each character, for the fre-
quent “E” as well as for the rare “X”. A long time ago it was discovered that one could
save capacity during data transmission by replacing frequently occurring characters with
short code words and rarely occurring characters with longer code words. An example of
such a code is the Morse alphabet. In this, for example:
avoids this problem by using not only two characters, but also the pause in addition to “.”
and “-”, a separator. So “· -” is equal to “ET” and “·-” is equal to “A”.
Is it possible to construct a code in such a way that the separator between two code
words can be dispensed with? This is possible with the help of the so-called prefix codes:
A prefix code is a code in which there is no whole code word that is a prefix (initial
segment) of any other code word. You all know such a code, probably without having
consciously thought about this property: the telephone number system. There is no tel-
ephone number that is a prefix of another number. So for example there is no telephone
number that starts with 911. This property has the consequence that the end of a code
word can be recognized without further separator: If you dial 911, the telephone switch
knows that the phone number is over, because no other phone number has the same
beginning, and therefore connects you to the emergency services. The Morse alphabet is
not a prefix code: The code of “E” is a prefix of the code of “A”.
With the help of rooted trees, such prefix codes can be constructed. I want to gener-
ate binary codes, that is, codes whose code words consist only of the characters “0” and
“1”. For this purpose, we draw a binary regular rooted tree on which we write a 0 on the
edges leading to the left and a 1 on the edges leading to the right.
We assign the characters to be encoded to the leaves. Such a character is then mapped
to the sequence of 0s and 1s leading from the root to the corresponding leaf (Fig. 11.14).
No codeword is a prefix of another, otherwise the corresponding character would have
to lie on the way to this other word. But we only coded leaves. Now a bitstream made up
of these codewords can be decoded uniquely without using a pause character: For example,
01001100101110101011001000 can be decoded in 010 011 00 10 111 010 10 110 010 00.
Conversely, every prefix code can be represented by such a tree. The construction
method is obvious: Start at the root and draw the walk for each codeword that leads to
the left child for 0 and to the right child for 1. Because of the prefix property, the code-
words are exactly the walks that end in the leaves of the resulting tree.
Our goal was to generate prefix codes so that frequently occurring characters are
encoded short, rare characters have longer codes. The Huffman algorithm constructs
such a code, it can be proven that this represents an optimal prefix code (see Theorem
20.24).
Given is a source alphabet, from which the frequency of occurrence of the different
characters is known. I would like to explain the algorithm using a concrete example. Our
source alphabet consists of the characters a,b,c,d,e, f with the probabilities of occurrence:
a : 4 % b : 15 % c : 10 % d : 15 % e : 36 % f : 20 %.
The algorithm proceeds as follows:
1. The characters of the alphabet are the leaves of a tree. We assign the frequency of the
characters to the leaves.
If you want to draw the tree, it is best to write the characters next to each other in the
order of probability.
2. Find the two smallest nodes without a parent. If there are several, choose two arbi-
trary ones. Add a new node and connect it to these two. The new node should be the
parent of the two initial nodes. Assign the sum of the two frequencies to this node.
At the beginning, no node has a parent. When step 2 is carried out, the number of nodes
without a parent is reduced by 1.
3. Repeat step 2 until only one node without a parent is left. This is the root of the tree.
The Huffman code is used, for example, in fax coding, where one knows the probabili-
ties quite well for the number of consecutive white or black pixels. But it is still often
used in modern compression methods, usually as part of multi-stage algorithms. For
example as part of the jpeg algorithm for image compression.
Prefix codes encode characters of constant length in differently long code words. An
alternative compression approach searches for words of different lengths in the source
text and translates them into code words of constant length. Such a method is, for exam-
ple, realized by the widely used ZIP coding. During coding, a dictionary of frequently
occurring character strings is created and—also using trees—these character strings are
assigned a code word of fixed length. Most often the code words have a length of 9 bits,
so that in addition to the 256 ASCII characters, 256 character strings can be provided
with their own code.
When we looked at search trees, we already saw that it is important to know methods
that can find all the nodes of the tree. We now want to examine procedures that can reli-
ably visit all the nodes of any graph, if possible exactly once. Most such visitation algo-
rithms are based on the two prototypes depth-first search or breadth-first search. Let’s
first look at the depth-first search. We start at an arbitrary node x of the graph G and pro-
ceed according to the following recursive algorithm:
Starting from x, follow a path until you reach a node to which no unvisited nodes are
adjacent anymore. Then turn back to the next node that still has unvisited adjacent nodes.
Visit such a node next and start from there again, as far as you can get, then turn back
again and so on.
If you apply the depth-first search to a binary rooted tree, the depth-first search cor-
responds to the preorder traversal of the tree.
In Fig. 11.16 I have marked in which order the nodes are visited in a graph. When
implementing, it still needs to be specified what exactly is meant by “for all unvisited
adjacent nodes”, a sequence must be specified. In the example, I started randomly from
node 1 and then always took the path as far to the right as possible. Any other sequence
would have been just as possible.
The number of unvisited nodes decreases by 1 at each step. Every node that is con-
nected to x by a path is eventually caught by this search, and exactly once. If G is con-
nected, this thus visits every node of G. If G has several connected components, we have
to run the algorithm for each of these components.
268 11 Graph Theory
Each traversed edge leads to an unvisited node. If we process a connected graph with
n nodes with this algorithm, we therefore need n − 1 edges to visit the n − 1 nodes that
are different from the starting point. If we mark the used edges, we get a subgraph with
all n nodes and n − 1 edges. Such a graph is, according to Theorem 11.11, a tree. All
nodes of a connected graph can therefore be connected by the edges of a tree. Such a tree
is called spanning tree of the graph G. In Fig. 11.16 the edges that I used when traversing
the tree are shown in bold. You can see these edges, together with all the nodes, form a
tree, namely a spanning tree.
The depth-first search is suitable if you want to find the way out of a labyrinth. It is a
single-player algorithm: each path is first followed to the end.
If you get lost in a corn maze the next time and use the depth-first search, you can find out
what is meant by the runtime of an algorithm.
The breadth-first search is more suitable for teamwork: starting from a node, it swarms
in all directions, only gradually does it get further and further away from the starting
node. The breadth-first search algorithm for a connected graph with starting node x is
(for once not recursive):
1. Make the node x the current node and give it the number 1.
2. Visit all nodes adjacent to the current node and number them consecutively, starting
with the next free number.
3. If not all nodes have been visited, make the node with the next number the current
node and continue with step 2.
In Fig. 11.17 I have recorded the order of the visited nodes in a breadth-first search.
Again a spanning tree is created, but it looks completely different from the one in Fig.
11.16.
If we carry out a breadth-first search on a rooted tree, the levels will be successively
grazed until we reach the leaves. This is where the difference to a depth-first search, in
which all the branches are successively traversed to the leaves, is particularly noticeable.
Breadth-first search and depth-first search are methods that are applied to graphs
in this or a similar form over and over again in order to gain information about these
graphs. We have already seen that these algorithms can be used to determine whether a
11.3 Traversing Graphs 269
graph is connected. Depth-first search can also be used to determine whether a graph is
connected multiple times, a property important, for example, in power or computer net-
works: Even after a line section has failed, these should still be connected.
Shortest Paths
Finding the shortest paths in networks is a task occuring very frequently in computer sci-
ence. Graph traversal algorithms are also used for this. I would like to introduce you to
the algorithm of Dijkstra, which he formulated in 1959. It is related to breadth-first search:
Starting from one point, the shortest paths to all other points are determined. In doing so,
one gradually moves further and further away from the starting point. The algorithm always
finds the shortest path to the target point z if such a path exists at all. At the same time, all
the shortest paths are found that lead to nodes that are closer to the starting point than z.
G is a network, the weight function is always positive. This condition states that paths
from node to node always become longer. The shortest path from node x to node z is
sought.
We divide the nodes of the graph into three sets: the set V of points that have already
been visited and for which shortest paths are known, the set B of points that are adjacent
to points from V , one could call B the boundary of V , and finally U , the unvisited points.
At the beginning, V only consists of the starting node x, to which the shortest path of
length 0 is known. B consists of the nodes that are adjacent to x.
In each step of the algorithm, a node is selected from the boundary B and added to V ,
the points adjacent to this node are added to the boundary. The algorithm ends when
the node z has been added to V or when no node is left in the boundary. This is the case
when all nodes have been visited that are connected to the starting point x.
In Fig. 11.18 the iteration step is circled next to each node in which it is added to V ,
the sets V , B and U are drawn after four iterations of the algorithm.
Which node is added to V ? Each node y of B is connected to at least one node of V by
an edge. Since the shortest paths are known within V , we can use them to determine the
shortest path to y, which only contains nodes of V . We assign this path length lB (y) to all
points y of B. This does not necessarily give us the absolutely shortest path from x to y,
because this could also contain other nodes from B or U . You can see this, for example, at
270 11 Graph Theory
point y in Fig. 11.18: The shortest path from x to y has the length 8, the shortest path that
only using from B has the length 9: lB (y) = 9.
Now we are looking for a minimum of the distances in B. Let w be such a node for
which lB (w) is minimal. The path from x to w cannot be further shortened, because every
shorter path from x to w would have to touch another border point (for example y), but
the path from x to this border point would already be at least as long as the one from x
to w. The node w can therefore be removed from the border and added to V . At the same
time, all nodes adjacent to w are added to the boundary. In the example, z is the last
reached node, all shortest paths have to be calculated until the shortest path from x to z is
found.
You can see the algorithm is time-consuming. The runtime increases quadratically
with the number of nodes.
Nevertheless, this algorithm is still computable in a reasonable time, in contrast to the
related problem of the traveling salesman: A sales representative wants to visit n cities
and tries to find the shortest path that touches all n cities once and leads him back to his
starting point. The task of the traveling salesman is probably one of the best-researched
problems in computer science. There is no known algorithm solving it in polynomial
time, that is, in a time is proportional to nk when n represents the number of cities. It is
believed no such algorithm exists. Already for about 1000 cities the problem is no longer
computationally feasible in an acceptable time.
A mathematician could be content with to proof that there is no fast algorithm, but not the
traveling salesman, who actually has to make his journey, and certainly not the computer
scientist, who is given the task of finding a fast way. The typical approach of the computer
scientist is to deviate from the pure doctrine and to look for solutions that are easily cal-
culated, but also not too far from the optimal solution. He is looking for a compromise
between the two goals “best possible way” and “lowest possible computing time”. On the
Internet you will find competitions in which good algorithms for the traveling salesman
compete against each other.
11.4 Directed Graphs 271
In many applications, directions are assigned to the edges of a graph: one-way streets,
flowcharts, project plans, or finite automata are examples of such graphs. In Fig. 11.19
you can see a finite state machine that can recognize comments in the source code of a
program that begin with /* and end with */. The nodes are states, a transition from a
state to another occurs when reading a certain character.
You can see in Fig. 11.19 that there can now be two edges between different nodes, but
in different directions. In the adjacency matrix of a digraph, the element aij has the value
1 if there is an arc from i to j. The adjacency matrix is therefore generally no longer
symmetrical. For the above example it is:
1 1 0 0
1 0 1 0
0 0 1 1 .
1 0 1 0
A directed graph is nothing more than a relation on the set of nodes: it is xRy if and only
if [x, y] ∈ V . See Definition 1.7 in Sect. 1.2. On the other hand, one can also say: every
relation on a finite set represents a directed graph.
Definition 11.17 If x is a node of the directed graph G, then the number of arcs
ending in x is called the in-degree of x, it is denoted by d − (x). The number of arcs
starting in x is called the out-degree of x, it is denoted by d + (x).
Since each arc has exactly one head and one tail and each of these points contributes to
the in-degree or out-degree of a node, it follows immediately:
In Definition 11.6 we have defined the terms walk, path, cycle and connectivity for undi-
rected graphs. These transfer to directed graphs:
After many definitions finally a first, surprising result. Behind it is the fact that in such a
graph every path has a beginning and an end:
Theorem 11.22 In every directed acyclic graph there is at least one source and one
sink.
Proof: Since the graph G is finite, there are only finitely many paths in it, among them
there is a longest path. Let x be the endpoint of this path. If d + (x) � = 0 is, then [x, y ]
leads out of x. But since the path is maximal, it cannot be extended to y. The node y must
therefore already occur in the path. But then we would have found a cycle, a contradic-
tion. So x is a sink. Similarly, the starting point of this longest path is a source.
The directed acyclic graphs have an interesting property that is particularly important
when processing the nodes. The order of processing can be such that when visiting the
node x, all the nodes have already been visited from which a path leads to x. So all the
into
not
not plain- not com- not
text ment
out of
“predecessors” of x are already processed. If you number the nodes of the graph accord-
ing to this visit algorithm, you speak of a topological sorting of the graph.
Definition 11.23 The directed graph G with node set {x1 , x2 , . . . , xn } is called
topologically sorted, if for all nodes xi it holds: If [xk , xi ] is an arc, then k < i.
This algorithm also represents a method for deciding whether a directed graph is acyclic:
If you can no longer find a source although there are still unnumbered nodes, you have
found a cycle.
As in the undirected graphs, the directed edges are often equipped with weights that
represent, for example, transport capacities of pipelines, distances, duration or similar.
In a network, the flow of material from a producer to a consumer can be recorded via a
number of intermediaries. The producer is the source, the consumer the sink of the net-
work, each arc is a transport route with a capacity (Fig. 11.20).
A flow network is a directed, positively weighted network that has a source and a sink.
The weight is also called capacity function. An important task of graph theory is to
274 11 Graph Theory
determine the maximum capacity of a flow network, that is, the maximum amount of the
examined good that can be transported from the source to the sink. For no arc the capac-
ity may be exceeded, and in each node, except for the source and the sink, as much must
flow in as out. I do not want to deal with this question here, but at the end of the chapter I
would like to discuss a problem from project planning.
The plan for the implementation of a project represents a directed graph: Starting
from a source, for example the order placement, the project is divided into various mod-
ules that can be developed independently of each other. These are often divided into fur-
ther sub-modules. There are dependencies between the individual modules: There are
modules that can only be started when others are finished or have reached a defined sta-
tus. In particular, during integration, everything comes together again: In different stages,
the modules are assembled and tested until hopefully everything has grown together to
form the big picture at the end of the handover.
A project plan defines durations for the processing of individual modules and mile-
stones to which certain work must be done. Such a project plan can be regarded as a
directed network. The arcs are the activities in the project, the weight of an arc represents
the duration for the execution of this activity. The nodes denote the milestones of the
project. A schedule assigns a time to each milestone at which it is to be achieved. How
do you find an optimal schedule?
If there is a schedule at all, then the graph must be without cycles, otherwise some-
thing has gone completely wrong. The network in Fig. 11.20 is therefore not a project
plan. We further assume that there is exactly one source and one sink in the graph. In
Fig. 11.21 you can see an example of such a network plan. We assign the time 0 to the
source, now we want to find the earliest time T at which the sink can be reached. Since
each individual activity must be carried out, that is, really every path of the graph must
be traversed, we are faced here with the problem of finding the longest path in the graph.
For this I would like to introduce you to an algorithm.
Since the graph to be examined is acyclic, it can be topologically sorted. So we assume
that the nodes {x1 , x2 , . . . , xn } of the graph are numbered according to a topological sort.
11.4 Directed Graphs 275
x6 2 x10
4 3
x3 x9 4
3 4 3
x1 x5 1 x13 8 x14
1 2
6 8
x8 2
x2 x15
1 10 3
7
3 x11 6
x12
x4 5 x7
For each node y we want to determine the number L(y), which gives the longest length
of a directed path from the source to y. First, we set L(xi ) = 0 for all i . Now the longest
paths to all nodes y are determined one after the other. The value L(y) can be increased
during the process until all paths to y are considered:
1. Start with i = 1.
2. Visit the node xi and examine all nodes y to which an arc [xi , y] exists. If
L(xi ) + w(xi , y) > L(y), set L(y) = L(xi ) + w(xi , y), otherwise leave L(y) unchanged.
3. As long as there are still nodes left: Increase i by 1 and go to step 2.
This algorithm works because when visiting the node xi, the longest path L(xi ) to it is
already finally calculated: Because of the topological sort, no arc from later nodes leads
back to xi, so that L(xi ) can no longer be changed. Thus L(xi ) + w(xi , y) is the longest
path to y, which leads through xi. Since in the course of time all predecessors of y are
considered, at some point the longest path to y is also found.
It is not difficult to remember such a longest path from the source to the sink at the
same time when calculating the L(xi ).
The algorithm can be formulated very elegantly recursively:
For the sink we thus obtain a value L(xn ) = T , where T is the desired minimum project
duration. For each other node L(xi ) denotes the earliest time at which this milestone can
276 11 Graph Theory
be reached. In the example, L(x15 ) = 29, a longest path through the graph is shown in
bold.
Calculate the values L(xi ) for all nodes in the example yourself. Can you find any
more longest paths?
How can one further analyze the project plan? At each node there is also a latest time
at which it must be reached if the entire project duration T is not to be extended. I call
this S(xi ). To calculate it, the topological sorting helps us now in the other direction: We
have to start at the sink and visit the nodes backwards. If I xi examine, I already know
all S(xk ) for k > i. If [xi , xk ] is an arc, I have to subtract from S(xk) the necessary activity
duration w(xi , xk ) to xk and from all these times S(xk ) − w(xi , xk ) choose the earliest. Of
course S(xn ) = T , and we obtain analogously to (11.1):
In the example L(x14 ) = 26, while S(x15 ) = 29. The activity [x14 , x15 ] is planned with 2
units, so that a margin of 1 unit remains.
In the meantime, we now know for each activity [x, y] the earliest time L(x) at which
x can be reached, the time S(y) at which y must occur at the latest, as well as the planned
activity duration w(x, y). The difference S(y) − L(x) − w(x, y) is the time buffer: The
activity [x, y] can be extended by this period of time without endangering the entire pro-
ject duration. If the buffer is 0, we speak of a critical activity. In every project there is at
least one path that consists only of critical activities, for example the path that we found
when determining L(xn ). Such a path is called a critical path.
Comprehension Questions
1. A connected graph is a tree if it has one node more than edges. Why is it important
in this theorem that the graph is connected?
2. If you declare another node to be the root in a rooted tree, you get a rooted tree
again. Can it be that the two rooted trees have a different number of leaves?
3. Every walk k from x to y in a graph contains a path w from x to y. Of course, w is
shorter or at most as long as k. Can there be a path from x to y that is longer than w
?
4. Is a directed graph automatically turned into a graph if you remove the arrows
from the edges?
5. Is a graph automatically turned into a directed graph if you assign a direction to
each edge?
11.5 Comprehension Questions and Exercise Tasks 277
6. Can you represent a partial order on a finite set by a graph or by a directed graph?
7. In a directed graph, there can be two directed edges between the nodes x and y.
Can there also be two directed edges between x and x?
Exercises
1. Write down the degree of each node in graph G from Fig. 11.22. How many even
and odd nodes (= nodes of even and odd degree) does G contain?
Set up the adjacency matrix for G.
2. Is it possible that there are nine people at a party, each of whom knows exactly
five others?
3. G is a graph with n nodes and n − 1 edges. Show that G contains at least one end-
point or an isolated node (a node of degree 0).
4. Draw the syntax tree for the following expression:
c = (a+3)*b - 4*x + z*x/7;
Visit the nodes of the tree using the three methods Inorder, Preorder, and Pos-
torder.
5. Check whether the codes that consist of the following words are prefix codes, and
if so, draw the corresponding binary tree:
a) 11, 1010, 1011, 100, 01, 001, 000
b) 00, 010, 011, 10, 110, 111, 0110, 0111
6. Construct a prefix code for the alphabet {a, b, c, d, e, f , g, h} with the following
frequency distribution {4, 6, 7, 8, 10, 15, 20, 30} (a = 4 %,b = 6 %, …,h = 30 %).
7. Dijkstra’s algorithm for finding shortest paths in networks assumes the distance
between two nodes is not negative. Why is that so? Give an example of a network
in which this condition is not met and in which therefore the algorithm fails.
8. Given the following adjacency matrix of a network:
0 0 0 4 1 2 0 0 0
0 0 4 2 6 0 4 2 0
0 4 0 2 0 0 2 0 4
4 2 2 0 3 2 0 0 2
1 6 0 3 0 0 0 8 0
2 0 0 2 0 0 0 0 1
0 4 2 0 0 0 0 3 0
0 2 0 0 8 0 3 0 0
0 0 4 2 0 1 0 0 0
278 11 Graph Theory
The matrix entry aij just contains the length of the path from node i to node j. Try
to draw a non-overlapping picture of the graph. Determine the shortest paths from
the first node to the other nodes. Draw spanning trees in the graph that arise when
starting from the first node when performing breadth-first search and depth-first
search respectively.
9. The finite state machine from Fig. 11.19 is not quite perfect yet. If someone
comes up with the idea of writing the following text:
a/ /* this is a division */b
then it fails. Complete the finite state machine so that it can also cope with this!
10. Show: An acyclic network that has exactly one source and one sink is weakly
connected.
Part II
Analysis
The Real Numbers
12
Abstract
This chapter lays the foundation for investigations of convergent sequences and con-
tinuous functions. By the end of this chapter, you will
We have already used the real numbers in the first half quite often. Now we finally want
to characterize them more precisely, as the properties of the real numbers are quite
essential for the second half of the book. In Theorem 2.10 we have seen that there are
numbers √ with which we would like to calculate, but which are not contained in Q, for
example 2. The rational numbers still have gaps. Just as Z from N and Q from Z, one
can also construct the real numbers R from Q formally by filling in all these gaps. How-
ever, it is not so easy to describe this filling of gaps mathematically precisely and I there-
fore will refrain from this construction. We will take another approach that has already
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 281
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_12
282 12 The Real Numbers
proved to be useful several times: We will collect the characteristic properties of R, the
axioms of the real numbers. The real numbers are then a set for us that satisfies these
axioms. A little later we will find that the real numbers can be identified with the set of
all decimal fractions. We can then also use the axioms in the following chapters to derive
theorems about the real numbers.
First of all, the real numbers form a field. Please take a look at the field axioms from
Definition 5.10 in Sect. 5.3 again. Another essential property of R, which we have
already used several times, is the ordering: Two elements can be compared with each
other with respect to their size. R is a linearly ordered set with respect to the relation ≤
according to Definition 1.12.
An ordered field is linearly ordered as a set, the order additionally has some com-
patibility properties with the two operations. The properties of an ordered field can be
described by the order axioms. There is no mention of the order at all at first, they state
that there are positive and negative elements in ordered fields . The order and its proper-
ties can be derived from this.
Definition 12.1: The order axioms A field K is called ordered, if there is a sub-
set P (the prepositive cone) in it with the following properties:
The elements from P are called positive, those from −P are called negative. P ∪ {0} is
called the positive cone of K.
Each element different from 0 is thus positive or negative. Sum and product of positive
elements are positive again.
From these axioms, we can now immediately derive our order relations <, ≤, >, ≥
already known in R:
x<y :⇔ y − x ∈ P,
x≤y :⇔ x < y or x = y,
x>y :⇔ y < x,
x≥y :⇔ y ≤ x.
12.1 The Axioms of Real Numbers 283
You can see that y > 0 is only true if y ∈ P. Thus, as expected, the prepositive cone P
contains exactly the numbers greater than zero.
I would like to put together some basic properties of these relations now, which of
course all follow from the order axioms. The properties a), b) and c) just say that ≤ is a
linear order relation on the field K according to Definition 1.12.
The calculations are elementary, although one sometimes has to think about it. I would
like to derive the first and second property as examples:
to a): Let’s first take x = y. Then x ≤ y means that y − x ∈ P and y ≤ x has x − y ∈ P
as result, that is y − x = −(x − y) ∈ −P. Since P and −P are disjoint, y − x cannot be in
P and −P at the same time. The assumption x = y must therefore be false.
to b): If x = y or y = z, the assertion is clear. So let’s take x = y and y = z. Then
y − x ∈ P and z − y ∈ P, so also (z − y) + (y − x) = z − x ∈ P and thus x < z.
The behavior with multiplications gives the most trouble when calculating with inequali-
ties. If you multiply with negative numbers, the inequality sign turns around, otherwise
not. This often leads to ugly case distinctions.
In an ordered field, according to the axiom (A3), x = 0 implies x 2 > 0. This is because
if x > 0, then x 2 > 0 and for x < 0 it follows from −x > 0 that (−x)2 = x 2 > 0 is again
true. Therefore, 1 = 12 > 0, so −1 < 0. It can therefore −1 never be the square of another
number. You can see from this that complex numbers cannot carry the structure of an
ordered field, no matter how much effort is put into the definition of the order.
Because of 0 < 1, it follows from the rules 1 < 1 + 1 < 1 + 1 + 1 < · · · < 1 + 1 +· · · + 1.
n-times
So these numbers are all different. If you remember the introduction of N in Sect.
3.1, then the set {1, 1 + 1, 1 + 1 + 1, . . . } just forms the natural numbers. The natu-
ral numbers N are therefore contained in any ordered field K . Since in a field the addi-
tive inverses and the quotients of two numbers must also be contained, K also contains
284 12 The Real Numbers
the integers Z and the rational numbers Q. Q itself is an ordered field and thus the
smallest ordered field that exists. The prepositive cone of Q consists of the numbers
Q+ = { mn | m, n ∈ N}.
We have met other fields in Sect. 5.3: the finite fields GF(p) and GF(pn ). But these
fields cannot carry an order, since N is not contained in them.
We are on the way to characterizing the real numbers, and have seen that the order alone
is not enough: the rational numbers are also ordered.
In contrast to the rational numbers, the real numbers are complete in a certain sense,
there are no more gaps. We now want to formulate the term of completeness precisely.
If a set M has a maximum x, then this is uniquely determined: For if x ′ is another maxi-
mum, then from the Definition 12.4x ′ ≥ x and x ≥ x ′, it follows that x = x ′. Similarly,
the minimum is uniquely determined. We denote the maximum and minimum of M by
max(M) and min(M), respectively.
Examples
Let K = Q.
Similarly, the terms bounded from below, the lower bound and the greatest lower bound
are defined. The greatest lower bound of a set is called the infimum of M and is denoted
by inf(M).
The set M is called bounded if it is bounded from below and from above. Let’s look at
the examples from before again:
1. M1 has, for example, 106, 4 and 2 as upper bounds. M1 is not bounded from below.
2 is the smallest upper bound: sup(M1 ) = 2.
2. M2 has the same upper bounds as M1, but although M2 has no maximum, it has a
smallest upper bound, namely 2. Any smaller number belongs to M2 and is there-
fore no longer an upper bound: sup(M2 ) = 2.
3. M3 has, for example, 106, 4, 1.5, 1.42, 1.425 as upper
√ bounds, but there is no small-
est upper bound. We can always get closer to 2 with√rational √numbers, none of
these numbers is the smallest rational number above 2. If 2 were a rational
number, it would be the supremum. ◄
Theorem 12.7 There is exactly one complete ordered field. This is called the
field of real numbers R.
In particular, to show the uniqueness of this field is a difficult mathematical task. But for
us this is not so interesting. We only need the properties of R, which can be concluded
from the axioms.
286 12 The Real Numbers
√
For example, we see that 2 is a real √number: for the supremum y of the set
{x ∈ R | x 2 ≤ 2} it is indeed y2 = 2, thus y = 2.
Now let’s look at the set of integers:
Z has no upper bound in R, otherwise there would be a supremum s ∈ R of Z.
Then s − 1 is not an upper bound of Z, so there is a n ∈ Z with n > s − 1. This means
n + 1 > s, thus s is not an upper bound of Z.
Similarly, it follows that every non-empty subset M of Z which is bounded
from above has a maximum: Let n0 ∈ M be given. If M has no maximum, then
with each number n also n + 1 is in M . By the induction principle it then follows
M = {n ∈ Z | n ≥ n0 }, and this set is, just like Z, not bounded from above.
These properties of Z are contained in the following theorem:
For a) we can choose a n with n > xy. b) is a consequence: If n · x > 1, then n1 < x also
applies.
To c): The set {z ∈ Z | z ≤ x} is bounded from above and therefore as a subset of the
integers has a maximum.
Above all, property b) will often be useful to us: In the next chapters we will use again
and again that we can get as close to 0 as we like with the number sequence n1.
a) |x| ≥ 0, |x| = 0 ⇔ x = 0
b) |xy| = |x| · |y|.
c) |x + y| ≤ |x| + |y|.
x |x|
d) = .
y |y|
e) |x| − |y| ≤ |x − y|.
12.2 Topology 287
I will leave it to you as an exercise to check the rules a) to c). Property d) follows from
b) if one applies this rule to xy y and e) results from c) by replacing once x with x − y
and once y with y − x.
If you turn back to Theorem 5.16 in Sect. 5.3, you will find the same rules there for the
absolute value of a complex number.
12.2 Topology
Think of d(x, y) as the distance between the points x and y. (M3) is called the triangle
inequality and in Fig. 12.1 you can see where the name comes from: The sum of the two
z
y
288 12 The Real Numbers
sides of a triangle is always greater than the third side. Equality can only hold in degen-
erate triangles, that is, if the three points are on one line.
1. The absolute value of real and complex numbers gives us a metric on R and C: for
x, y ∈ R respectively for x, y ∈ C let
d(x, y) := |x − y|.
(M1) and (M2) are immediately clear, (M3) applies, since
|x − y| = |x − z + z − y| ≤ |x − z| + |z − y|. In R and C we will always work with
this metric.
2. In Definition 10.4 √
we have defined a norm on the real vector space Rn. This norm
√
�u� := �u, u� = u12 + u22 + · · · + un2 has, according to Definition 10.2, the same
properties as an absolute value and therefore
d(u, v) := �u − v�
is a metric on Rn. We can interpret d(u, v) in R2 and R3 as the distance between the
two vectors u and v.
3. A central concept in coding theory is that of the Hamming distance between two
code words: A code word is a n-tuple of 0 and 1, so the underlying set is
X = {0, 1}n. If b = (b1 , b2 , . . . , bn ) and c = (c1 , c2 , . . . , cn ) are two code words,
then the Hamming distance between b and c is defined by
d(b, c) = number of different digits of b and c.
Definition 12.12 Let (X, d) be a metric space and ε > 0 a real number. Then for
x ∈ X the set
Uε (x) := {y ∈ X | d(x, y) < ε}
is called ε-neighborhood of x.
1. In R is Uε (x) = {y||x − y| < ε} = {y|x − ε < y < x + ε}, which is the set of ele-
ments that differ from x by less than ε. In C, the distance between y and x must be
less than ε. This is the case for all points y that are located in a circle around x with
radius ε.
12.2 Topology 289
y x ε ε
x y
x
ε y
Uε ( x ) Uε ( x ) Uε ( x ) 3
The boundary of the sphere, that is the set of y with d(x, y) = ε, does not belong to the
neighborhood. This has an important consequence: If y ∈ Uε (x), then there is always a
δ > 0 with Uδ (y) ⊂ Uε (x): No matter how close we get to the boundary with y, there is
always a whole neighborhood around y that is also contained in Uε (x) (Fig. 12.4). Sets
with this property are called open sets in topology. The notions of closed sets, boundary
points and the contact points of a set (Fig. 12.5) are closely related to this concept.
c >t
t
c'
290 12 The Real Numbers
boundary contact
point M points
interior point interior of M closure of M
isolated point
Definition 12.13 Let (X, d) be a metric space and M ⊂ X . M is called open, if for
all x ∈ M there is a ε > 0 with Uε (x) ⊂ M .
The element x ∈ X is called contact point of M , if for every neighborhood Uε (x)
it holds: Uε (x) ∩ M � = ∅.
x ∈ X is called boundary point of M , if x is a contact point of M and of X \ M .
x ∈ M is called interior point of M , if there is an neighborhood of x with
Uε (x) ⊂ M .
x ∈ M is called isolated point of M , if there is an neighborhood of x with
Uε (x) ∩ M = {x}.
The set M is called closed, if all contact points of M already belong to M .
The boundary R is the set of boundary points of M . The set of contact points
of M is called the closure of M and is denoted by M , the set of interior points is
called the interior of M .
The set X itself is both open and closed, as is the empty set. For each subset M ⊂ X , the
set M is closed, the interior of M is open.
For the intervals in R, I would like to introduce the following notations: Let a, b ∈ R
and a < b. Then
12.2 Topology 291
[a, b] := {x ∈ R | a ≤ x ≤ b},
]a, b[ := {x ∈ R | a < x < b},
[a, b[ := {x ∈ R | a ≤ x < b},
]a, b] := {x ∈ R | a < x ≤ b},
[a, ∞[ := {x ∈ R | a ≤ x},
]−∞, b] := {x ∈ R | x ≤ b},
]a, ∞[ := {x ∈ R | a < x},
]−∞, b[ := {x ∈ R | x < b}.
With the last four notations, you should pay attention that ∞ is not a real number,
but only a useful symbol. We also want to allow the limiting case a = b: [a, a] = {a},
]a, a[ = ]a, a] = [a, a[ = ∅.
In the sense of the Definition 12.13, [a, b] is closed and ]a, b[ is open. The boundary of
the first four sets is {a, b}. The closure of these sets is [a, b] and their interior is ]a, b[.
Theorem 12.14 In the metric space X , a subset U ⊂ X is open if and only if its
complement X \ U is closed.
The proof of this theorem is typical for the ways of proofing in connection with neigh-
borhoods. I would like to present it in detail here and you should try to carry it out. We
have to show two directions. Let V = X \ U :
“⇒” Let U be open. Assume V is not closed. Then there is a contact point x of V that
is not in V . Then x ∈ U , and the property of contact point says that for each neighbor-
hood of x it holds: Uε (x) ∩ V � = ∅. This is a contradiction to the fact that U is open.
“⇐” Let U be closed. Assume V is not open. Then there is a x ∈ V so that for all
neighborhoods of x it holds Uε (x) ∩ U � = ∅. But this just means that x is a contact point
of U . Since U is closed, then x ∈ U , a contradiction.
Theorem 12.15 In the metric space X , any unions of open sets are open again,
just as finite intersections of open sets are open. Similarly, any intersections and
finite unions of closed sets are closed.
Above all, one must be careful that infinite intersections of open sets need not be open
again. For example, the following intersection of open sets results in a closed interval:
1 1
a − ,b + = [a, b].
n∈N
n n
292 12 The Real Numbers
Find an example yourself that an infinite union of closed sets need not be closed anymore.
Now we want to use our topological methods for the first time to learn something new
about the structure of the real and rational numbers:
The proof uses Theorem 12.8: Let x ∈ R and ε > 0. Then there is a n ∈ N with the prop-
erty n1 < ε and a largest integer z below n · x: It is n · x = z + r with 0 ≤ r < 1. Then it is
x = nz + nr , nr < n1 < ε and that means
z r
= x − ∈ Uε (x).
n n
If x ∈ R, then there is always a rational number in any neighborhood of x, no matter how
small it may be. Q has gaps, as we have known for a long time, but these gaps are tiny,
we can never find an interval in which no rational numbers are located. We also say of
this property: “Q is dense in R”.
Comprehension Questions
1. Why can the field GF(p) not carry an order in which it becomes an ordered field?
2. Where does the term triangle inequality come from in a metric?
3. What is the difference between a boundary point and a contact point of a set?
4. What are the interior points, contact points and boundary points of the interval
[0, π[ in R?
5. What are the boundary points of the interval [0, π[ ∩ Q in Q?
6. In a metric space, let Uε (x) be the closure of a ε-neighborhood of
x. Is there a neighborhood Uδ (y) of y ∈ Uε (x) with Uδ (y) ⊂ Uε (x) for every
y ∈ Uε (x) ?
7. True or false: By specifying x < y :⇔ d(x, 0) < d(y, 0) for complex numbers x, y,
an order is defined on the set C and C thus becomes an ordered field.
Exercises
You can make many case distinctions here, or you use that |x| = max{x, −x} and apply
Exercise 2.
b) Let G be a network (see Definition 11.9). The weight w(x, y) is always positive
and w(x, x) = 0 for all x. Is w a metric on the set of nodes of G?
7. Show that for the Hamming distance of codewords the triangle inequality holds:
d(x, y) ≤ d(x, z) + d(z, y).
8. Formulate with the help of the concept of the ε- neighborhood the statement” Z is
dense in R” and prove or disprove this statement.
Sequences and Series
13
Abstract
• you will know what convergent sequences are, and be able to calculate limits of
sequences,
• you will be able to deal with the ε of mathematicians,
• you will know the big O notation and be able to calculate and express the runtime
of algorithms in the big O notation,
• you will know what series and convergent series are, and have learned important
calculation rules to determine the limit of a convergent series,
• you will be able to represent real numbers in the decimal system and in other
numeral systems and know the relationship of these representations to the theory
of convergent series,
• you will have learned the number e and the exponential function as the first func-
tion defined by a convergent series.
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 295
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_13
296 13 Sequences and Series
For computer scientists, non-convergent sequences are also interesting, namely those
that describe the runtime of an algorithm in dependence on the input variables. Here, it is
especially important to know how fast such a sequence grows.
Sequences of numbers and limits of such sequences are needed in many areas of
mathematics. We will need them to investigate continuous functions and to introduce dif-
ferential and integral calculus.
Series are nothing but special sequences of numbers. Function series, such as Taylor
series and Fourier series play a special role here and are used in many technical applica-
tions and also in applications of computer science, for example in data compression.
Sequences do not always have to start with n = 1, sometimes they start with 0 or any
other integer. We will mostly work with sequences of real or complex numbers, that is,
with sequences in which M = R or M = C.
Convergent Sequences
c apture this fuzzy formulation more precisely. The resulting expression is not quite sim-
ple, but the meaning is vivid:
Definition 13.2 Let (X, d) be a metric space, (an )n∈N a sequence of elements from
X . The sequence (an )n∈N converges to a, we write for this lim an = a, if it holds:
n→∞
∀ε > 0 ∃n0 ∈ N ∀n > n0 : an ∈ Uε (a).
In R and C is Uε (a) = {x | |x − a| < ε}, there we can formulate the rule in this way:
∀ε > 0 ∃n0 ∈ N ∀n > n0 : |an − a| < ε.
The sequence is called divergent, if it does not converge.
What is behind it? In every neighborhood Uε (a), no matter how small it is, from a certain
index n0 on all further sequence elements lie. This means that almost all sequence ele-
ments, namely all except finitely many at the beginning of the sequence, are arbitrarily
close to the limit. There are no more outliers: The whole tail of the sequence remains
close to a. In Fig. 13.1 you can see how such sequences can look like in R or in C.
There are unwritten naming conventions in mathematics that have almost the status of laws. For
example, natural numbers are often denoted with m, n, real numbers with x, y, and similar. A
particularly sacred cow for the mathematician is his ε. This can be 1, 1000, 106 or 10−6 and the
assertions about convergence remain correct. But every mathematician always has a small posi-
tive real number in his mind (whatever small may mean) and thinks of neighborhoods that can
become arbitrarily small. And when a mathematician works with small neighborhoods, then
these have a radius ε, not µ or q. You drive every corrector of an exercise or exam to despair
when you answer the question about the definition of a convergent sequence, for example with:
∀x < 0 ∃ε ∈ N ∀β > ε : x < −aβ − a.
This is correct, but if you are unlucky, the line will simply be crossed out: To err is human.
Such conventions of course make sense: They make waht is written easier to read because you
can associate an unwritten meaning with a symbol. I stick to many such conventions in this
book, even if I don’t always mention them explicitly. If you read material in other textbooks
on mathematics, you will find it helpful that similar terms are used there most of the time.
You also know such conventions from programming: programming guidelines contain-
ing, for example, formatting and naming rules, serve primarily not to restrict the freedom of
the programmer, but to facilitate the reading of programs by other people than the developer.
Uε2 ( a )
a
ε U ε1 ( a )
Theorem 13.3 If the sequence (an )n∈N converges to a, then a is uniquely determined.
Proof: Assume an converges to a and to a′, with a = a′ (Fig. 13.2). If you choose
′
ε < d(a,a
2
)
, then there is a n0 so that for n > n0 holds an ∈ Uε (a), and similarly there is a n1
with an ∈ Uε (a′ ) for n > n1. But Uε (a) ∩ Uε (a′ ) = ∅ (see Exercise 8 for this chapter).
Most of the sequences we will be dealing with are sequences of real numbers. The real
numbers are ordered, unlike, for example, C or Rn. In this case, we can formulate the fol-
lowing concepts:
Definition 13.4 Let (an )n∈N be a sequence of real numbers. We say “an tends to ∞
” and write lim an = ∞ if, for all r > 0, there exists a n0 ∈ N such that an > r for
n→∞
all n > n0. Similarly, we say “an tends to −∞” ( lim an = −∞) if, for all r < 0
n→∞
and all n > n0, it is an < r.
The sequence (an )n∈N is called bounded or bounded from above or bounded
from below if the corresponding statement holds for the set {an | n ∈ N}.
The sequence (an )n∈N is called monotonically increasing (monotonically
decreasing) if, for all n ∈ N, an+1 ≥ an (an+1 ≤ an) holds.
Remember that a sequence that tends to infinity is not “infinite” at some point, there is no
number “infinite”. The sequence is just not bounded from above.
However, when I talk about convergent sequences of real numbers below, I always mean
that the limit a exists in the R, that is, a = ±∞. If the case ∞ is allowed, I will mention
this explicitly.
For if a ∈ R is the limit of the sequence, then from an index n0 on, all sequence members
are in U1 (a). The set {an | n ∈ N} is therefore contained in the set {a1 , a2 , . . . , an0 } ∪ U1 (a),
and this set is bounded.
a a'
Uε ( a )
U ε ( a ')
ε ε |a – a'|
|a – a'|
Examples
1. an = a converges, limn→∞ an = a.
2. (in )n∈N diverges! There are, for example, infinitely many sequence members in
each neighborhood of 1, but the sequence also jumps out of each such neighbor-
hood of 1 again and again. There is no index from which really all sequence mem-
bers are close to 1. The same applies to the points −1 , i , −i.
3. ( n1 )n∈N converges to 0: Let ε > 0 be given. Then, according to Theorem 12.8, there
is a n0 with n10 < ε and for n > n0 it holds: | n1 − 0| < n10 < ε. We call ( n1 )n∈N a null
sequence.
4. limn→∞ n+1 n
= 1. For | n+1
n
− 1| = | n−(n+1)
n+1
1
| = n+1 . If n10 < ε is as above, then for
n > n0 follows: n+11
< n1 < ε.
5. For the sequence ( 2nn )n∈N, the assumption is obvious that it converges to 0. If we
use that for n > 3 always holds 2n ≥ n2, we get | 2nn − 0| < nn2 = n1. We already
know 1/n is less than ε if n is large enough.
6. In (an )n∈N it all depends on the value of the number a: For a = 1/2 we get a con-
vergent sequence, for a = 2 a divergent sequence. a = 1 results in convergence,
a = −1 on the other hand again in divergence. In general:
a) limn→∞ an = 0 for |a| < 1.
b) an is divergent for |a| ≥ 1, a = 1.
c) limn→∞ an = 1 for a = 1.
Let’s calculate 6.a) for example: |a| < 1 ⇒ |a|1
> 1, that is |a|
1
= 1 + x or 1+x
1
= |a|
for a real number x > 0. In the exercises for mathematical induction in Chap. 3
you could calculate that in this case (1 + x)n ≥ 1 + nx applies. This gives us:
1 1
|an | = |a|n = ≤ < ε,
(1 + x)n 1 + nx
if 1 + nx > 1ε, that is, if n > 1x ( 1ε − 1). It doesn’t matter at all that 1x ( 1ε − 1) is not
an integer, we can simply choose the next larger integer for n0. Please note that this
calculation is also correct for complex numbers a !
7. The runtime of Selection Sort is not convergent, the sequence is monotonically
increasing and not bounded from above: limn→∞ an = ∞.
8. In the sequence ( n1 , n12 )n∈N both components converge to 0. Then
�( n1 , n12 ) − (0, 0)� = ( n1 )2 + ( n12 )2 also tends to 0, that is, ( n1 , n12 )n∈N converges to
(0, 0).
In Rn it holds: A sequence converges to (a1 , a2 , . . . , an ) if and only if for all i the
i -th component of the sequence converges to ai. ◄
You can see from these examples that it is sometimes not quite easy to calculate the
limit, if it exists at all, even for simple sequences. Fortunately, there are a number of
300 13 Sequences and Series
tricks with which we can reduce the convergence of sequences to those of already known
sequences. We will hardly carry out any more such complex calculations as in Example
6. The following theorems provide us with some important tools. They use the order
relation and are then only applicable in R, but this is also by far the most important case.
Theorem 13.6: The comparison test for sequences Let an , bn , cn be real number
sequences and let bn ≤ cn ≤ an for all n.
Further let lim an = lim bn = c. Then lim cn = c also applies.
n→∞ n→∞ n→∞
With the phrase “for sufficiently large n” I mean: “there is a n0 so that for all n > n0 applies”.
The limit of a sequence can therefore be determined if the sequence can be clamped
between two other sequences, the common limit of which is already known. In Exam-
ples 4 and 5, we have already used this trick implicitly: We have clamped the sequences
n
n+1
− 1 and 2nn between the constant sequence 0 and the sequence n1.
Closely related to this comparison test is the following theorem, the simple proof of
which I would like to omit:
Theorem 13.7 Let an, bn be real number sequences with lim an = a and
n→∞
lim bn = b. Let an ≤ bn for all n ∈ N. Then a ≤ b also applies.
n→∞
But beware: From an < bn for all n it does not necessarily follow that a < b: In the limit,
the “<”-sign can become a “≤”-sign. See Examples 1 and 5 for this.
In Theorems 13.6 and 13.7 you can replace the words “for all n ∈ N” each time with
“for all n > n0” or with “for all sufficiently large n”. In investigations of the conver-
gence of sequences, it is always the ends of the sequences that matter. For small n, the
sequences can do whatever they want, it has no influence on the limit.
Theorem 13.8: Calculation rules for sequences Let an, bn be real or complex
number sequences with lim an = a and lim bn = b. Then holds:
n→∞ n→∞
0,
d) If b = then there is
a n0 so that for all n > n0 it holds bn = 0. Then
1 an 1 1
also and are convergent with lim = and
bn n>n0 bn n>n0 n→∞ bn b
an a
lim = .
n→∞ bn b
13.1 Number Sequences 301
I would like to prove part a) as an example. In doing so, a trick occurs that is used again
and again in connection with the ε calculations: If a statement is true for all ε > 0, then
you can replace a given ε with ε/2, ε2 or other positive expressions depending on ε and
the statement remains true:
So let ε > 0 be given. Then there is an n0 so that for n > n0 holds |an − a| < ε/2 and
|bn − b| < ε/2. It follows:
ε ε
|(an ± bn ) − (a ± b)| = |(an − a) ± (bn − b)| ≤ |(an − a)| + |(bn − b)| < + = ε.
2 2
The other parts of the theorem are derived in principle similarly, although the individual
conclusions are sometimes somewhat more complicated.
n n 1 lim 1 1
n→∞
lim = lim 1
= = = 1.
n→∞ n + 1 n→∞ n 1 +
n
lim 1 + lim 1 1+0
n→∞ n→∞ n
3. Similarly, one can always calculate the limit of a quotient of 2 polynomials by fac-
toring out the highest power of n in the numerator and denominator and by reduc-
ing:
6n4 + 3n2 + 2 6 + n32 + n24 6
= 12
−−−→ . ◄
1 n→∞ 7
7n4 + 12n3 + n 7 + n + n3
In Sect. 3.3 we determined the runtime of some recursive algorithms. The runtime of an
algorithm is, in addition to the performance of the machine, dependent on a number n,
which is determined by the input. This number n, the size of the instance, can have very
different meanings. It can be the size of a number, the number of elements of a list to be
sorted, the length of a cryptographic key in bits, the dimension of a matrix, and much
more. In each case, the runtime results as a mapping f : N → R+, where f (n) = an
302 13 Sequences and Series
Definition 13.9 For sequences an, bn with positive real sequence members, one
says an is O(bn ), if and only if the sequence an /bn is bounded. That is
This is read as “an is of order bn”. For an is O(bn ) we also write an = O(bn ).
In particular, according to Theorem 13.5 an = O(bn ), if the sequence an /bn is conver-
gent. But this convergence does not always have to be the case.
If an and bn only differ by a constant multiplicative factor c, then an = O(bn ). This is
useful for our purposes, because such a factor reflects the performance of the computer,
which we do not want to take into account when assessing an algorithm.
With the help of this definition, the sequence an can first only be roughly estimated
upwards, because of course for example for an = an + b follows: an is O(n1000). But we
are not satisfied with that, we are looking for a few prototypes of sequences for bn.
The comparability of sequences is facilitated by the following relation between the
orders:
If you formally negate “an is O(bn)”, you get for “an is not O(bn)” the rule:
∀c ∈ R+ ∃n ∈ N : an ≥ c · bn .
Caution: Not all orders are comparable using this relation. For example, for the
sequences
1 n even n n even
an = bn =
n n odd, 1 n odd
neither of the three relations <, = or > applies, because an is not O(bn ) and bn is not O(an ).
13.1 Number Sequences 303
Examples
1. The runtime for the iterative calculation of the factorial of a number n we have
already determined in Example 1 in Sect. 3.3: It was an = n · b + a.
With our rules for the calculation of limits, we get (look again at Example 3 after
Theorem 13.8 ):
nb + a
lim = b,
n→∞ n
and thus O(bn + a) = O(n).
2. When multiplying two n × n -matrices A and B, the element cij of the product
matrix results as
n
cij = bik akj .
k=1
(a + b)n3 − bn2
lim = a + b,
n→∞ n3
thus O((a + b)n3 − bn2 ) = O(n3 ). ◄
The sequences n, n2, n3, … are prototypes of runtimes with which one likes to compare
algorithms. Since for k ∈ N the sequence nk−1 /nk is limited and nk /nk−1 is unlimited,
O(nk−1 ) < O(nk ) applies.
Similarly to the two examples, one shows:
304 13 Sequences and Series
In addition to the polynomial orders, there are some other important orders. It is:
O(1) < O(log n) < O(n) < O(n · log n) < O(n2 ) < O(n3 ) < . . . < O(nk ) < . . .
< O(2n ) < O(3n ) < . . .
An O(1)-algorithm has constant runtime, O(log n) is called logarithmic, O(n) linear and
O(n2 ) quadratic runtime. Algorithms of the order an, a > 1 are called exponential algo-
rithms. These are usually unusable for calculation purposes. In the following table I have
listed some values of such runtimes for you.
n log n n n · log n n2 n3 2n
10 3.32 10 33.22 100 1000 1024
100 6.64 100 66.44 10000 10 6
1.27 · 1030
1000 9.97 1000 9966 10 6 10 9 10301
10 000 13.29 10 000 132 877 10 8
10 12
103010
In this table I have given the logarithm to the base 2. Later we will see that the base does not
play a role here. For different bases a, b is always O(loga n) = O(logb n).
If you now go back to the examples at the end of Sect. 3.3 you will see that the Selection
Sort was of order n2 , the Merge Sort of order n · log n. From the table you can see that
for larger numbers n this is a huge improvement.
Monotonic Sequences
The proof is essentially based on the completeness of the real numbers. In the rational
numbers, the assertion is false. The limit of the monotonically increasing and bounded
sequence an is precisely the supremum a of the set {an | n ∈ N}. We have to show that in
every neighborhood of a almost all, that is all except finitely many, sequence members lie:
13.1 Number Sequences 305
Let ε > 0. Then for all n ∈ N an ≤ a < a + ε, since a is an upper bound, and there
is a n0, so that a − ε < an0, otherwise a would not be the least upper bound. But then
because of the monotonicity for all n > n0 holds: a − ε < an0 ≤ an ≤ a < a + ε, thus
an ∈ Uε (a).
The conclusion for monotonically decreasing sequences proceeds analogously.
Examples
1 1 1 1 1 1 1 1 1
an = 1 + + + + + + + + + ··· +
2 3 4
5
6 7 8
9
16
>2· 41 = 21 >4· 81 = 21 1
>8· 16 = 21
1 1 1
+ + ··· + + ··· + .
17
32 n
1
>16· 32 = 21
306 13 Sequences and Series
Try to determine the limits of some sequences with a computer. Calculate 10, 100,
1 000 000 sequence members or let the calculation run until the sequence members no
longer change. You will notice that quite different behaviour is shown: Sometimes you get
to the limit very quickly, sometimes very slowly, and even divergent sequences seem to con-
verge on the computer occasionally. What all sequences have in common is that if we have
determined a “limit” with the computer, we do not know how far it deviates from the actual
limit. This can only be determined with mathematical methods.
13.2 Series
The last examples of sequences have one common property: Each sequence member is
already a sum, and with each further sequence member something is added to this sum.
Such sequences are called series. It would be a huge understatement to say that series
play an important role in mathematics, many areas of mathematics are unthinkable with-
out the theory of series, and we have to deal with it intensively.
Examples of series
∞
∞
∞
∞
π2
1. We already know 1
n
and 1
n2
: The series 1
n
is divergent and 1
n2
= 6
.
n=1 n=1 n=1 n=1
2. An important series is the geometric series, which is defined for q ∈ R or q ∈ C:
∞
∞
qk. For all q with |q| < 1, 1
qk = 1−q .
k=0 k=0
Proof: For the n-th partial sum, it holds (compare Theorem 3.3 in Sect. 3.2):
n
(1 − q) · qk = 1 − qn+1. From Example 6 after Theorem 13.5 we know that
k=0
qn+1 is a null sequence and so we get:
∞
n
n
(1 − q) · qk = (1 − q) · lim qk = lim (1 − q) · qk
n→∞ n→∞
k=0 k=0 k=0
n+1 n+1
= lim (1 − q ) = 1 − lim q = 1.
n→∞ n→∞
∞
3. The series 1
n!
is monotonically increasing. We try to find an upper bound again:
For k ≥ 2, n=0
1 1 1 1
= ≤ = k−1
k! 1 · 2 · 3 · 4 · ···k 1 · 2 · 2 ·
2 · · · · 2 2
k−1 factors
1 1 1 1 1 1 1 1
sn = 1 + + + + ··· + ≤ 1 + 0 + 1 + 2 + · · · + n−1
1! 2! 3! n! 2 2 2 2
∞ k
1 1
≤1+ =1+ = 3.
k=0
2 1 − 21
We used the geometric series from Example 2 immediately. The series is therefore
bounded from above and thus convergent.◄
∞ 1
Definition 13.14 The limit e := is called Euler's number. The first digits in
n=0 n!
the decimal expansion are:
e = 2.718281828.
The number e is an infinite decimal fraction
√ that never has a period. e is therefore not a
rational number. e is, however, unlike 2, not a root of any polynomial with coefficients
from Q. Such numbers are called transcendental numbers.
308 13 Sequences and Series
For if a is the limit of the series, then for each ε > 0 there is a n0 ∈ N such that for all
n > n0 it holds:
|an | = |sn −sn−1 | = |(sn −a)−(sn−1 −a)| ≤ |sn −a|+|sn−1 −a| < ε.
∞
Definition 13.16 The series n=0 an is called absolutely convergent, if the series
∞
n=0 |an | converges.
∞
Theorem 13.17 If the series n=0 an is absolutely convergent, then the series
itself is convergent.
This theorem can be directly derived from the calculation rules for the limits of
sequences (Theorem 13.8). Series are nothing but special sequences.
13.2 Series 309
Some series can also be multiplied with each other. How does one carry that out? We
try to catch all terms of the product of two series. We use a trick for that: We sort the
terms according to the sum of the indices:
(a0 + a1 + a2 + a3 + · · · + an + · · · )(b0 + b1 + b2 + b3 + · · · + bn + · · · )
n
„=“ a0 b 0 +(a0 b1 + a1 b0 ) + (a0 b2 + a1 b1 + a0 b2 ) + · · · + ak bn−k +··· .
sum of indices 0 sum of indices 1 sum of indices 2 k=0 sum of indices n
I wrote the equality sign in quotation marks here because of course one cannot actually
calculate with infinite sums. There are no infinite sums after all, only series and their
limits. Nevertheless, this “calculation” should serve as motivation for the following dif-
ficult theorem, which I cannot prove here. It does not always work, but one can multiply
two convergent series if they are absolutely convergent:
∞ ∞
Theorem 13.19 If n=0 an = a and n=0 bn = b are absolutely convergent series,
∞
then the series n=0 ( nk=0 ak bn−k ) is also absolutely convergent and it holds that
∞
n ∞
∞
ak bn−k = an bn = ab.
n=0 k=0 n=0 n=0
∞
Theorem 13.20: The comparison test for series If n=0 bn is
an absolutely
convergent series and |an | ≤ |bn | holds for all n ∈ N, then the series ∞n=0 an is also
∞ ∞
absolutely convergent and it holds that n=0 |an | ≤ n=0 |bn |.
∞ n
For if c ∈ R+ is the limit of the series n=0 |bn |, then the sequence sn = k=0 |ak | is
monotonic and bounded from above:
n
n
sn = |ak | ≤ |bk | ≤ c
k=0 k=0
and thus, according to Theorem 13.12, convergent. This means nothing else than
∞
n=0 an is absolutely convergent.
If you calculate the series ∞ k=0 k! on your computer, you will find that you are already
1
close to e after only a few terms. The series converges very quickly. But how can one
determine that the series converges well if the limit is unknown? It is of course very
important for numerical calculations to know when one can stop. Unfortunately, there
are no off-the-shelf tests, but in this special case one can estimate the residual error. For
this we use the calculation rule from Theorem 13.18b) and Theorem 13.20. Find out
where! It is
310 13 Sequences and Series
n ∞
1 1
e= + rn+1 , with rn+1 = . (13.1)
k=0
k! k=n+1
k!
rn+1 is therefore the error we make if we stop the calculation after the n -th term. Now we
can give an upper bound for rn+1 similar as before for the whole series:
1 1 1 1
rn+1 = 1+ + + + ···
(n + 1)! n + 2 (n + 2)(n + 3) (n + 2)(n + 3)(n + 4)
1 1 1 1
≤ 1+ + + + · · ·
(n + 1)! n + 2 (n + 2)2 (n + 2)3
∞ k
1 1 1 1
= · = · 1
(n + 1)! k=0 n + 2 (n + 1)! 1 − n+2
1 2
≤ ·2= .
(n + 1)! (n + 1)!
It is 2/13! ≈ 3.21 · 10−10, so the error occurs after 12 additions in the tenth place after
the decimal point. Thus, the accuracy from Definition 13.14 is already given.
∞
Theorem 13.21: The ratio test Let the series an be given. If there is a number
n=0
an+1 ∞
0 < q < 1 such that from an index n0 on always holds ≤ q, then an is
an n=0
an+1
absolutely convergent. If from an index on always holds
> 1, then the series
an
is divergent.
In particular, this theorem implies that ∞ an+1
n=0 an is convergent if lim n→∞ | an | exists and
is less than 1. The ratio test is often applied in this form.
closer look at this series. Since we are allowed to move out the constant |an0 | from the
sum and because q < 1 it follows:
13.2 Series 311
∞ ∞ ∞
1
|an0 |qn−n0 = |an0 | · qn−n0 = |an0 | · qn = |an0 | · , (13.2)
n=n0 n=n0 n=0
1−q
I didn’t quite argue correctly in the formula (13.2): In the first step I used Theorem 13.18.
But this requires the convergence of the series. So we have to read the formula from right
to left: Since the series converges, we are allowed to move in the constant |an0 | and then
the equation is correct. I will occasionally do similar calculations in the future. Always pay
attention to the fact that such operations with series are only allowed if it turns out in the
end that they are convergent and if we are then allowed to read the equations backwards.
Something else has to be considered about this test: Why is it not enough to demand
|an+1 |/|an | < 1? It could then be that the quotient gets closer and closer to 1 and that we can
no longer bound it from above by an element q < 1. But then the whole argumentation with
the geometric series doesn’t work anymore. In fact, the assertion of the theorem can be
false in such a case.
We want to apply the last theorem immediately to the investigation of a very important
function, the exponential function:
∞
zn
Theorem 13.22 For all z ∈ C the series n!
is absolutely convergent.
Because it is n=0
an+1 zn+1 /(n + 1)! z |z|
= =
n + 1 = n + 1.
a z n /n!
n
Since limn→∞ |z|/(n + 1) = 0, from the ratio test follows the absolute convergence of
the series.
∞
zn
exp : C → C, z �→
n=0
n!
For the proof of a) we need the product rule for series, Theorem 13.19:
∞ ∞ ∞ n
z n wn zk wn−k
exp(z) exp(w) = · =
n=0
n! n=0 n! n=0 k=0
k! (n − k)!
∞ n
1 n!
= zk wn−k
n=0
n! k=0 k!(n − k)!
∞
1
= (z + w)n = exp(z + w).
↑
n=0
n!
according to
binomial theorem
The part b) can be reduced to Theorem 13.8f), which I do not want to carry out.
The exponential function can be found on any calculator, and the calculator does nothing
but evaluate the first terms of this series. What about the speed of convergence? Similar
to the series ∞ n=0 n! in (13.1), which is nothing but exp(1), we can carry out an error
1
n zk
Theorem 13.25 For exp(z) = + rn+1 (z) it holds for the residual error after
k=0 k!
summation of the terms up to the index n:
2|z|n+1 n
|rn+1 (z)| ≤ , if |z| ≤ 1 + .
(n + 1)! 2
The residual error is here dependent from z. The unpleasant thing about this result is on
the one hand that the error can always become larger, the greater the absolute value of z
is, and on the other hand that the calculation is only correct for small z anyway. What do
we do now if we want to calculate exp(100) ? To solve this problem, I have to put you off
until we get to Sect. 14.2. There we will have a closer look to the properties of the expo-
nential function.
Natural numbers can be represented uniquely in different bases. We will now derive a
similar procedure for the real numbers. As a result, we obtain that every positive real
number can be written as a decimal fraction. The positive rational numbers are repre-
sented by finite or periodic decimal fractions. Such a decimal fraction is nothing else but
a convergent series. But as with the representation of natural numbers, other numbers
than 10 can also be used as base:
13.3 Representation of Real Numbers in Numeral Systems 313
Definition 13.26 Let b > 1 be a fixed natural number. b is called the base and the
numbers 0, 1, . . . , b − 1 are called the digits of the base b system. Let z ∈ Z and for
all n ≥ z let an ∈ {0, 1, . . . , b − 1}. For n < 0 let 1/bn := b−n. A series of the form
∞
an
,
n=z
bn
This is the first time we encounter series that can start with negative indices. For z < 0,
the base b fraction has the form
∞
an a1 a2 a3
n
= az b−z + az+1 b−z−1 + · · · + a−1 b1 + a0 b0 + + 2 + 3 + ··· .
n=z
b b b b
Often, the base b fraction is also written in the form (az az+1 . . . a0 .a1 a2 . . .)b for z ≤ 0
and (0. 0 ·
· · 0 az az+1 . . . )b for z > 0. The part before the dot corresponds exactly to the
z−1 zeros
negative indices up to and including the index 0. This represents a natural number in the
form we learned in Theorem 3.6 in Sect. 3.2. The part of the number after the dot starts
with the index 1.
For b = 10, decimal fractions are usually written in one of these two forms, with the
index designation omitted: The first notation is the exponential notation, which is how a
computer usually outputs the numbers; the second notation is the familiar representation
with the decimal point after the integer part.
Examples
1. (1.1378E2)10 = (113.78)10 = 1 · 102 + 1 · 101 + 3 · 100 + 107 + 1082 + 1003 + · · · = 113 + 100
78
.
2. (4.711E−3)10 = (0.004711)10 = 4
103
+ 7
104
+ 1
105
+ 101 6 .
3. (4.625E1)7 = (46.25)7 = 4 · 71 + 6 · 70 + 2
7
+ 495
= 168549
.
314 13 Sequences and Series
3 3 3 ∞ 3 ∞ 1 1 10 10
4. = .
(3.3333 . . .)10 = 3 + + 2 + 3 + ··· = =3· =3· 1
=3·
10 10 10 n=0 10 n
n=0 10
n 1 − 10 9 3
5. x = (0.1234)10 also represents a rational number: To determine this, we use the
following trick: x · 10 000 − x · 100 = 1234.34 − 12.34 = 1222.00 and thus
x · 9900 = 1222, that is x = 1222/9900. ◄
These examples of base b fractions are convergent series. But does a base b fraction
always converge? The answer is given to us by the next theorem:
Theorem 13.27 Every base b fraction is convergent. For the sum from index 1, the
∞ a
n
part after the dot of such a fraction, it holds 0 ≤ n
≤ 1.
n=1 b
We use the geometric series again and get:
∞ ∞ ∞
an b−1 1
≤ = (b − 1)
n=1
bn n=1
bn n=1
bn
∞
1 1
= (b − 1) n
− 1 = (b − 1) 1
−1
n=0
b 1− b
1
= (b − 1) = 1.
b−1
The part after the dot of the base b fraction is strictly monotonically increasing and
bounded from above by 1. Therefore it is convergent. The complete base b fraction is
then of course also convergent, because only a finite natural number is added.
Now we can deduce that the real numbers, which we have only described so far by their
axioms, are exactly the base b fractions. As with the representation of the natural num-
bers, any base b greater than 1 is possible.
Theorem 13.28 The b-adic expansion Let b > 1 be a natural number. Then holds:
To b): We can construct the base b fraction converging to x by induction. I only want
to sketch the way here. To carry out the proof, you have to do the calculations on paper!
In the base case, we first find for the number x a minimal index N so that 0 ≤ x < bN+1
is true and then also the first coefficient of the representation: This is the maximum
a−N ∈ {0, 1, . . . , b − 1} with the property a−N bN ≤ x. If we have found the series
sn = nk=−N bakk up to the index n so that sn ≤ x < sn + b1n is true, we look for the next
term of the series: There must be a coefficient an+1 ∈ {0, 1, . . . , b − 1} with
an+1 an+1 1 1
sn ≤ sn + n+1
≤ x < sn + n+1 + n+1 ≤ sn + n .
b b b b
≤ b1n
n ak an+1
Then sn+1 := k
+ n+1 In this way, we can successively nest the number x. The
k=−N b b
series really converges to x.
Test the procedure with the number π and the base 10: It is 0 ≤ π < 101,
so N = 0 and 3 ≤ π is the first coefficient. If we have found the series up to 3.141
(3.141 ≤ π < 3.141 + 0.001), we look for the next digit so that we stay just below π,
which is 5 here. Then 3.1415 ≤ π < 3.1415 + 0.0001.
What is behind part c) of the theorem, we have just seen at the 5th example after
Definition 13.26. In general terms: If x is a periodic base b fraction, whose period begins
at the index n and has the period length k, then bn+k · x − bn · x = m is a natural number
and therefore x = m/(bn+k − bn ) ∈ Q. Try this with more examples!
There is still the statement d) of the theorem missing, and for this I give an algo-
rithm, the division algorithm for natural numbers p and q. If you look closely, then this
procedure is exactly the same you have learned in school. But very likely you have not
had the questions of the computer scientist in mind there: Why does the algorithm actu-
ally work? Does it have a termination condition that is always reached? This is what we
want to deal with now. It is enough if we divide natural numbers p and q with p/q < 1.
Because if p/q ≥ 1, then we can represent p/q as n + p′ /q, where n, p′ ∈ N and
p′ /q < 1. We already know the representation of n as base b number. So our task is to
find coefficients a1 , a2 , a3 , . . . with
∞
p a1 a2 a3 an
= + 2 + 3 + ··· = . (13.3)
q b b b n=1
bn
We carry out a continued division with remainder by q, as we know it from Theorem 4.11
in Sect. 4.2. We start with the division of b · p, multiply the remainder each time by b and
divide further:
316 13 Sequences and Series
b · p = a 1 q + r1 , 0 ≤ r1 < q,
b · r1 = a2 q + r2 , 0 ≤ r2 < q,
b · r2 = a3 q + r3 , 0 ≤ r3 < q, (13.4)
..
.
b · rn = an+1 q + rn+1 , 0 ≤ rn+1 < q.
The quotients ai calculated in this way are all less than b, because from b ≤ a1 and p < q
would follow bp < bq ≤ a1 q in contradiction to bp = a1 q + r1 ≥ a1 q. Since for all
remainders holds ri < q, the same argumentation applies to all ai.
These ai are exactly the coefficients from (13.3) we are looking for. Why? If we solve
the lines of (13.4) successively for p, r1, r2 and so on and substitute, we get:
a1 r1
p= ·q+
b b
a2 r2
r1 = ·q+
b b a
a1 2 r2 a1 a2 r2
⇒ p= ·q+ 2 ·q+ 2 = + 2 ·q+ 2
↑ b b b b b b
insert r1
a3 r3
r2 = ·q+
b b a
a1 a2 3 r3 a1 a2 a3 r3
⇒ p= ·q+ 2 ·q+ 3 ·q+ 3 = + 2 + 3 ·q+ 3
↑ b b b b b b b b
insert r2
..
.
an rn+1
rn = ·q+
b b
n+1
ak rn+1
⇒ p= k
· q + n+1 ,
k=1
b b
in total, therefore,
n+1
p ak rn+1
= k
+ n+1 .
q k=1
b b q
Examples
1. b = 10: 5 · 10 = 8 · 6 + 2
2 · 10 = 3 · 6 + 2
5/6 = (0.83)10 .
2. b = 2: 5 · 2 = 1 · 6 + 4
4·2=1·6+2
2·2=0·6+4
5/6 = (0.110)2 .
3. b = 12: 5 · 12 = 10 · 6 + 0 ◄
5/6 = (0.A)12 , where A is the digit 10.
You can see that a number can be infinitely periodic in some number systems and finite
in others.
Fractions are also stored in the computer in exactly the same form as presented in Def-
inition 13.26: Such a number is determined by the sign, the exponent and the mantissa.
The mantissa is the sequence of an, which of course is always finite in the computer. The
whole thing happens, as you can imagine, with the base 2. This is the floating point repre-
sentation of the number. A float in Java, for example, includes 32 bits: 1 bit for the sign,
8 bits for the exponent and 23 bits for the mantissa. The mantissa is actually 24 bits long:
The first bit is always 1 and therefore does not have to be stored. For the exponent, the
numbers from − 126 to + 127 are possible. This is a total of 254 values, so there is still
space for two more values in the 8 bits of the exponent: This also allows you to encode 0
and number overflows. So a typical float looks like −1. 0010
· · · 101 ·298 in the computer.
23 bit
When outputting to the screen, the number is then converted into a decimal number and
output approximately as −3.56527E+29. Here E+29 is to be read as ”· 1029 “.
Rounding errors can therefore already occur when inputting and outputting, for exam-
ple when a decimal number entered cannot be converted into a finite binary fraction. Fur-
ther errors can occur through mathematical operations. In Sect. 18.1 we will deal with
problems in numerical calculations in more detail.
Comprehension Questions
Exercises
1. Give an example for sequences (an )n∈N and (bn )n∈N with an < bn for all n ∈ N and
limn→∞ an = limn→∞ bn.
2. Find the limits of the sequences (if they exist):
3n2 + 2n + 1
a) 3
n − 3n2 − 1
n2 + 5
b) 2
n −5
3 7
c) 1 +
n!
3. Prove or disprove the assertions:
a) If an and bn diverge, then an + bn diverges and an · bn diverges.
b) If an is bounded and bn is a null sequence, then an · bn is a null sequence.
4. You probably know the story of Achilles and the tortoise: In a race, the tortoise
gets a head start on Achilles. Achilles can never catch up to the tortoise, because
he always has to reach the point where the tortoise was just a moment ago. But by
then, the tortoise has already moved on. Can you help Achilles?
5. Check the following series for convergence:
∞ n2
a) n
n=0 2
∞ n!
b)
n=1 1 · 3 · 5 · · · (2n − 1)
∞ 2n+1
c)
n=0 n!
6. Determine the order of the algorithm
1
if n = 0
n n n
x = x2 · x2 if n even
n−1
x ·x else
Abstract
The study of continuous functions is at the center of analysis. If you have worked
through this chapter
• you know what a continuous real or complex function is and can check if a given
function is continuous,
• you have learned about functions of several variables and the concept of continuity
for such functions,
• you master important properties of continuous functions,
• you know many elementary continuous functions: power function and root, expo-
nential function, the logarithm and trigonometric functions,
• and have learned a whole new definition of the number π.
In this chapter we deal with mappings between subsets of real or complex numbers: It is
K = R or C, D ⊂ K and f : D → K . Depending on whether the underlying field is R or
C, we speak of real or complex functions. In a short section we will also deal with func-
tions of several variables, that is, with mappings whose domain is a subset of the Rn.
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 321
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_14
322 14 Continuous Functions
14.1 Continuity
In a first approach, continuous real functions without gaps in the domain are those func-
tions whose graph can be drawn with a pencil without taking it off. The essential prop-
erty of continuous functions is that the graph must not make any jumps. In Fig. 14.1
you can see the images of a continuous and discontinuous function in this sense. This
description is not the whole truth, only the good continuous functions behave like this.
Most of the time, however, we will have to deal with such. How can this property be for-
mulated mathematically precisely?
The concept of convergence of sequences helps us to describe jumps: Let’s look at the
point of discontinuity a in Fig. 14.1 and argue first intuitively: If we approach the x-axis
to the point a with a sequence whose sequence elements sometimes lie to the left and
sometimes to the right of a, then the function values of the sequence elements diverge.
The sequence of function values is not convergent. This property is characteristic for
points of discontinuity. At the point b we have a similar situation: If a sequence con-
verges to b and there are infinitely many sequence elements different from b, then the
sequence of function values converges, but not to the function value f (b).
First, I would like to formulate what the limit of a function at a point is. The defini-
tion includes not only limits for points in the domain of the function, but also for all
boundary points of the set (see Definition 12.13). This formulation will allow us later to
extend the domain of a function sensibly. If we add the boundary points to the domain D
of the function, we just get the set of contact points of D. For these points we explain the
concept of the limit of a function.
then one says: f has in x0 the limit y0 and writes for this
lim f (x) = y0 .
x→x0
a b
If K = R and the domain is not bounded from above (from below), then we say
lim f (x) = y0 ( lim f (x) = y0), if for all sequences (xn )n∈N with lim xn = ∞
x→∞ x→−∞ n→∞
(−∞) it holds lim f (xn ) = y0. For y0, the value ±∞ is also allowed in this
n→∞
definition.
If x0 is a point of the domain D, then there exists the function value f (x0 ). If the limit
limx→x0 f (x) exists, then it is equal to f (x0 ). Because the condition in Definition 14.1
must hold for all sequences, in particular for example for the constant sequence, which
for all n has the value xn = x0. Then, of course, limn→∞ f (xn ) = f (x0 ) for this sequence
and thus the limit is fixed. You can see in Fig. 14.1 neither in the point a nor in the point
b the limits exist. For this a few
Examples
From the examples you can see that it is often easier to disprove the existence of the
limit than to prove it: To disprove it, you only have to find two sequences whose function
values have different limits, or even one sequence whose function values do not con-
verge. In Example 2, this would have been the sequence 1 + (−1)n /n. If you want to
prove the existence of a limit, you have to check the convergence of every conceivable
sequence. Fortunately, the computational rules for convergent sequences from Theorem
13.8 make our work easier: They can be transferred to limits of functions almost literally:
Here, Re f and Im f are the functions from D to R that map the element z with
f (z) = u + iv to the real part u respectively to the imaginary part v.
With the help of this concept of limit, we can now formulate what we understand by a
continuous function:
Examples
Theorems 14.2 and 14.3 can now be directly transferred to continuous functions:
a) f ± g is continuous in x0.
b) f · g is continuous in x0.
c) For ∈ K , · f is continuous in x0.
d) If g(x0 ) = 0, then f /g is continuous in x0.
e) |f | is continuous in x0.
f) If K = C, then f is continuous in z if and only if Re f and Im f are con-
tinuous in z.
With these theorems, we can reduce the continuity of composite functions to the indi-
vidual parts. Thus, for example, all polynomials f ∈ R[X] or f ∈ C[X] are continuous on
the entire domain. Let us look at the polynomial x → ax 2 + b, for example:
First, the mapping g : x � → x is continuous. Then g · g : x � → x · x is continuous and so
is x → ax 2, x → bx, and finally the sum x → ax 2 + b. For general polynomials, induc-
tion is once again required.
I would now like to go into functions f : U → R, where U is a subset of Rn, that is, real-
valued functions of several variables. Examples of this are
f : R2 → R, (x, y) �→ x 2 + y2 ,
z
g : R3 \ {(x, y, z)|x = ±y} → R, (x, y, z) � → .
x 2 − y2
These mappings represent an important special case of the functions f : U → Rm,
U ⊂ Rn. Each such “vector-valued” function has the form
f : Rn ⊃ U → Rm ,
(x1 , . . . , xn ) �→ (f1 (x1 , . . . , xn ), f2 (x1 , . . . , xn ), . . . , fm (x1 , . . . , xn )),
that is, it consists of a m-tuple of real-valued functions of several variables. We will
restrict ourselves to examining these component functions f : Rn ⊃ U → R.
You have already learned that mathematicians like to rely on theories they already
know when they encounter new problems, and here too we will try to use as much of our
knowledge about functions of one variable as possible. A very important concept in this
context is the notion of the partial function. If you keep all the variables fixed except for
one, you get a function of one variable again:
A function with n variables not only has n partial functions, but infinitely many: The
variables that are different from the i -th component are all kept fixed, but each value is
allowed for that! If, for example, in the function x 2 + y2 the second variable is kept fixed,
one obtains the partial functions x 2, x 2 + 1, x 2 + 2, x 2 + 3 and so on.
Can functions of several variables be drawn?
14.1 Continuity 327
4
3
f (x1, x2) 2
1
0
(x1, x2)
Many questions concerning functions of several variables can be reduced to the investi-
gation of partial functions. How is it with continuity? First of all it holds:
and f (0, y) to be investigated. But these are everywhere equal to 0, thus also continuous.
The function f itself is not continuous at the point (0, 0). See Fig. 14.5 for this. My
drawing program naturally has difficulty drawing the function near the point of discon-
tinuity, but you can see: If you walk along the ridge towards the point (0, 0), you always
stay at the same height. But if you approach this point on the x-axis, you always stay at
height 0. So we can find sequences that converge to (0, 0), but whose function values
have different limits. I give two such sequences: The ridge is a parabola, the sequence
xk = (1/k 2 , 1/k) runs on it. It is for all k
In Fig. 14.6 you can see the graphs of an even, an odd, a monotonic, a strictly monotonic
and a periodic function. You can see that an even function is symmetrical to the y-axis,
an odd function is point-symmetrical to the origin.
The terms monotonically increasing and decreasing can also be formulated for sub-
sets of R, as well as the terms even and odd, provided that with x always −x lies in the
domain. For complex functions, the terms do not make sense, because in complex num-
bers there is no order. Note that a strictly monotonic function is always injective, because
from x = y it follows of course in this case always f (x) = f (y).
Examples of functions
x n · x m = x n+m ,
because x n x m = (x · · · · · x) · (x · · · · · x) = x · x ·· · · · x.
n-times m-times n+m-times
(x n )m = x n·m = (x m )n ,
because (x n )m = (x · · · · · x) · (x · · · · · x) · · · (x · · · · · x) = x · x
· · · · x.
n-times n-times n-times m·n-times
m-times
The fixation x := 1 seemed to fall from the sky at first. Now it becomes clear that
0
this is the only sensible possibility: If we want the above rules to also apply to
n = 0, then x 0 · x 1 = x 0+1 = x 1 must be. This is only true for x 0 = 1.
All of these functions are polynomial functions and therefore continuous. With the
power function, we now look at the restriction to R+ 0 , the positive real numbers
including 0: f : R+ 0 → R +
0 , x → x n
. This function is strictly monotonically increas-
ing for n > 0, because from 0 ≤ x < y it follows from the order laws of the real
numbers 0 ≤ x n < yn. This makes the function injective. Is it also surjective? Since
R is complete, it is likely that every positive real number has a nth root. We can
only prove this at the end of this chapter. But I want to use it already now: The
function f : R+ +
0 → R0 , x → x is bijective and therefore has an inverse function,
n
f : R → R0 , x � → x . ◄
+ 2
Theorem 14.14 For all x, y ∈ R+ and for all rational numbers p, q ∈ Q it holds:
a) (xy)p = x p · yp,
b) x p · x q = x p+q,
c) (x p )q = x p·q.
Using the bijectivity of the root, these rules can be reduced to the corresponding rules of
the power function. Let us calculate rule b) first, assuming positive exponents: It is
√ √
(x n/m · x p/q )qm = (x n/m )qm · (x p/q )qm = ( x n )qm · ( x p )qm
m q
√ √
= (( x n )m )q · (( x p )q )m = (x n )q · (x p )m = x nq+pm .
m q
332 14 Continuous Functions
We will use this to show that the exponential function “exp” is the same as the exponen-
tial function to the base e:
14.2 Elementary Functions 333
If n, m ∈ N, then
n m n n n
exp = exp + + ··· + = exp(n) = en ,
m m m m
m-times
This extends the exponential function fe from Example√ 5 from Q to C and eliminates the
confusion. Now we can calculate eπ, but still not 3 2. Other bases than e are currently
forbidden for non-rational exponents.
As an exercise, I leave it to you to check the following assertion, it is similar to the
proof of Theorem 14.15:
for q ∈ Q and x ∈ R is exq = (ex )q . (14.3)
Theorem 14.17 The exponential function exp : C → C, z → exp(z) is
continuous.
In Fig. 14.8 you can see the graph of the real exponential function ex: The function
increases very slowly to the origin and then grows almost explosively. Remember? Algo-
rithms with exponential growth are considered incalculable.
To prove continuity: We first show that exp is continuous at 0. Let zn be a null
sequence in C, we have to show that then limn→∞ exp(zn ) = exp(0) = 1. We use
334 14 Continuous Functions
Theorem 13.25, in which we estimated the residual error after summation up to the
index m:
2|z|m+1 m
|rm+1 (z)| ≤ , if |z| ≤ 1 + . (14.4)
(m + 1)! 2
Here it is enough for us that exp(zn ) = 1 + r1 (zn ). The sequence |zn | will be less than 1 at
some point, and then
2|zn |
| exp(zn ) − 1| = |r1 (zn )| ≤ = 2|zn |.
1!
2 · |zn | converges to 0 and so limn→∞ exp(zn ) = 1. If now z = 0, any sequence that con-
verges to z can be written in the form z + hn, where hn is a null sequence. Because of
exp(z + hn ) = exp(z) exp(hn ) it follows
lim exp(zn ) = lim exp(z)·exp(hn ) = exp(z)· lim exp(hn ) = exp(z)·1.
n→∞ n→∞ n→∞
How is ex calculated for real exponents on the computer? We first assume x > 0. In The-
xn
orem 13.25 we have seen that the exponential series ∞ n=0 n! converges very quickly for
small x, for example for x < 1. The error can be estimated according to formula (14.4).
If now x is large, we look for the largest natural number n ≤ x. Then x = n + h where
h < 1 and we can calculate:
ex = en+h = en · eh .
To calculate en we already know a fast recursive algorithm (see (3.4) in Sect. 3.3), and
we can determine eh with the help of the quickly converging series. For negative expo-
nents x we simply calculate 1/e−x.
How the exponential function behaves for complex exponents, we will see from the
next theorems, which reveal a surprising connection to the trigonometric functions sine
and cosine:
a) ez = ex · eiy.
b) ex > 0.
c) |eiy | = 1.
d) |ez | = ex.
The most important conclusion we can draw at the moment is from property c): If y is a
real number, then eiy always has absolute value 1, that is, it lies on the unit circle.
Flip back briefly to the introduction of sine and cosine in Theorem 7.2: The coordi-
nates of a point (a, b) on the unit circle are given by sine and cosine of the angle between
(a, b) and the x-axis. On the other hand, these coordinates are just the real part and the
imaginary part of the complex number z = a + ib. See Figure 14.9 for this.
This establishes a relation between the exponent y and the angle α. Each exponent
has an associated angle. You probably already know that mathematicians are so distin-
guished that they do not measure angles like ordinary people from 0° to 360°, but in so-
called radians. Behind it is exactly the discovery we have just made: we simply denote
the angle that belongs to the point eiy with y (Fig. 14.10).
We still do not know that there is really such a y for each angle. The angle 0° belongs
to y = 0, because ei0 = 1 + i · 0 = (1, 0). We will see that with increasing angle, the
value of y also increases continuously. With the help of integral calculus (in the exer-
cises of Chap. 16), we will soon be able to calculate that the length of the circular arc
from (1, 0) to the point (cos y, sin y) = eiy is exactly y. The conversion between degrees
sin
cos
336 14 Continuous Functions
y
x
and radians is just as easy as between euros and dollars (but much more stable). For all
calculations with the functions cosine and sine, radians have proven to be much sim-
pler and more practical than degrees. It could be called a natural angle measurement. So
don’t be afraid of it!
Definition and Theorem 14.19: The functions cosine and sine Let y ∈ R. Then
The number y is called the angle between the x-axis and the direction of eiy. The
functions cosine and sine are continuous functions with domain R.
Theorem 14.20 For all y ∈ R it holds that (cos y)2 + (sin y)2 = 1.
Proof: According to Theorem 14.18c), |eiy | = 1 and thus |eiy |2 = (cos y)2 + (sin y)2 = 1.
You could object that this is nothing really new, but we have not yet proven it.
In the future, I will write cos2 x and sin2 x for (cos x)2 and (sin x)2, respectively.
Proof: Compare the real and imaginary parts in the following calculation:
Do you remember the rotations in R2? They were described by rotation matrices. The com-
position of two rotations means the multiplication of the two rotation matrices. If we first
rotate by x and then by y, we get the rotation by x + y:
cos(x + y) − sin(x + y) cos y − sin y cos x − sin x
=
sin(x + y) cos(x + y) sin y cos y sin x cos x
cos y cos x − sin y sin x − cos y sin x − sin y cos x
= .
sin y cos x + cos y sin x − sin y sin x + cos y cos x
Compare the entries with Theorem 14.21: Isn’t that a beautiful piece of mathematical
magic? Two completely different theories produce the same result.
The question of zeros of functions occurs again and again in all technical and economic
applications of mathematics, and so the efficient calculation of the zeros of continuous
functions is one of the standard problems of numerical analysis. The following important
theorem provides us with a first method for calculating zeros of continuous functions.
338 14 Continuous Functions
With the idea that continuous functions can be drawn without detaching, this theorem is
clear: If f is smaller than 0 at a, then the graph must cross the x-axis at some point on the
way to b, whether it wants to or not. Of course, we must not rely on the idea. We carry
out a proof that simultaneously provides us with a constructive calculation method, the
interval nesting, which is also called bisection. We assume that f (a) < 0 and f (b) > 0. If
f (a) > 0 and f (b) < 0 applies, we only have to exchange the inequality signs in point a).
We inductively construct a sequence [an , bn ] of intervals with the properties
a) f (an ) ≤ 0, f (bn ) ≥ 0,
b) a ≤ an−1 ≤ an < bn ≤ bn−1 ≤ b, that is [an , bn ] ⊂ [an−1 , bn−1 ] for n > 0,
c) bn − an = 21n (b − a).
a=a0 b=b0
a1 b1
a2 b2
a3 b3
14.3 Properties of Continuous Functions 339
The interval nesting can be implemented to numerically find zeros of functions: If you
have found two points a, b whose function values have different signs, you can localize a
zero with an error of |b − a|/2n in n steps. This method always works. There are indeed
faster methods that we will also get to know, but these can sometimes go wrong.
Caution: Even if a “zero” x0 has been found that only deviates by a value ε from the
real zero x, one cannot make any statement about how far the function value f (x0 ) devi-
ates from 0: If the function is very steep at this point, f (x0 ) can be far from 0. So you
need to know more about the function to assess the quality of the found zero.
Bolzano’s theorem has a whole range of interesting consequences. When calculating
eigenvalues of matrices in Chap. 9 we have already used the following assertion:
Theorem 14.24 Every real polynomial of odd degree has at least one zero.
The sign of the polynomial an x n + an−1 x n−1 + · · · + a1 x + a0 is determined for large val-
ues of |x| solely by the first term, since this grows the strongest. If n is odd, then for x < 0
it is also x n < 0 and x > 0 implies x n > 0. So you will always find two values for x, so
that the function values have different signs and thus Bolzano’s theorem is applicable.
Expressed a little less cryptically, this means: Every number that lies between two func-
tion values m and M of a continuous function has a preimage. The image has no gaps
between m and M (Fig. 14.12).
M
d no gaps in
the image
m
x y
a x y b
Even though it looks like more: the intermediate value theorem is only a reformula-
tion of Bolzano’s theorem: Apply Bolzano’s theorem to the function g(x) := f (x) − d.
Then a zero of g is a “d” point of f .
You can even describe the image of such a real function more precisely: The image is
a closed interval, the function values have a minimum and a maximum. The proof of this
is much more elaborate, I do not want to carry it out (Fig. 14.13).
Theorem 14.26 Let f : [a, b] → R be a continuous function. Then the image set
f ([a, b]) of f is bounded and has a minimum and a maximum, that is, there are
x, y ∈ [a, b] so that f (x) = min f ([a, b]) and f (y) = max f ([a, b]).
I would like to add one last important result to the theorems about continuous functions,
also without proof:
The interval I in this theorem can be open, closed, or half-open, and ∞ is also allowed as
an interval boundary.
In all the theorems 14.23 to 14.27, the requirements placed on the domain are very
essential. Let’s take the function f : R \ {0} → R, x → 1/x as an example (Figure
14.14): It is f (−1) = −1, f (1) = 1, but it has no zero and between f (−1) and f (1), not
every value is taken either. If we look at the restriction f : ]0, 1[ → R, x → 1/x, we see
that the image has no maximum and no minimum: f (]0, 1[) = ]1, ∞[.
We can now further investigate the elementary functions that we learned in Sect. 14.2.
In Example 4 after Definition 14.12 I introduced the root. There I have already used
that the power f : R+ +
0 → R0 , x → x is surjective. Now we check that with the help of
n
Since the intermediate value theorem is based on a closed interval, we first tinker with
the function f : We are looking for a b ∈ R+0 so that f (b) = b > y. There is always one:
n
f : R+ +
0 → R0 , x → x is therefore a bijective function.
n
According to Theorem 14.27, the inverse function to the power function is also con-
tinuous:
√ √
Theorem 14.28 The nth root n
: R+ +
0 → R0 , x →
n
x is bijective and continuous.
In the exercises for this chapter you can compute that the real exponential function
∞
xn
exp : R → R+ , x �→
n=0
n!
is an injective function (this does not apply to the complex exponential function!). The
same argument as for the power function shows us that exp is also surjective. This means
that there must be a continuous inverse function (Fig. 14.15).
Theorem and Definition 14.29: Natural logarithm The inverse function of the
exponential function exp is called the natural logarithm and is denoted by loge or
ln:
ln : R+ → R, x � → ln x.
a) ln(a · b) = ln a + ln b
b) ln(ap ) = p · ln a for all p ∈ Q.
The rules a) and b) can be traced back to the corresponding rules of the exponential func-
tion: If a = ex, b = ey, then we get:
The rule a) is based on the slide rule principle: The multiplication of two numbers can be
traced back to the addition of the logarithms. If you put numbers on a ruler not in a linear
scale, but logarithms of numbers, you get by putting together the distances that corre-
spond to ln a and ln b the distance ln a · b, so you can read off the product of a and b.
Remember that we could not yet say what a ∈ R+ and for non-rational exponents x
is? The equation (14.5) now gives us a way to extend the exponentiation sensibly from Q
to R:
Theorem and Definition 14.30: The General Exponential Function Let a ∈ R+.
Then the function
fa : R → R+ , x �→ ex·ln a = : ax .
is called the exponential function to the base a. It is bijective and continuous and it
holds:
ax+y = ax ay . (14.6)
g : R → R, x �→ x · ln a, h : R → R+ , x � → ex ,
and thus itself also bijective and continuous. On the rational numbers, fa coincides with
m √
our previous definition a n = n am according to (14.5). The equation (14.6) follows from
the corresponding rule for the exponential function with base e:
Just as in Theorem 14.15 one can first calculate that for all rational numbers q ∈ Q it
holds: f (q) = aq. Now let r be any real number. Then there is a sequence of rational
numbers (qn )n∈N that converges to r. For example, one can use the sequence of base b
fractions that converges to r (see Theorem 13.28b). Since both f (x) and ax are continu-
ous functions, then it follows:
In Definition 14.19 we defined the continuous functions cosine and sine as the real part
or imaginary part of the function eix:
x2 x4 x6 x8 x3 x5 x7 x9
cos x = 1 − + − + − ··· , sin x = x − + − + − ··· .
2! 4! 6! 8! 3! 5! 7! 9!
By inserting we get cos 0 = 1. From the series expansion we can see quickly that
cos 2 < 0 (with the help of an error estimate we can also prove this). Bolzano’s theorem
now says that there is at least one zero between 0 and 2. With a little more effort, one can
344 14 Continuous Functions
see from the series expansion that the cosine is strictly monotonically decreasing in the
interval between 0 and 2. The zero between 0 and 2 is therefore unique and can be calcu-
lated arbitrarily accurately, for example by means of an bisection. The first decimal digits
of the zero are 1.570796327.
You know π, for example, from the formulas r 2 π for the area and 2rπ for the circumfer-
ence of a circle. Of course, our π from Definition 14.33 is exactly the same number. To
check this, we just have to be a little patient. Like Euler’s number e, the number π is an
infinite non-periodic decimal number, so in particular irrational.
Because of cos(π/2) = 0 and cos2 x + sin2 x = 1 we have sin(π/2) = +1 or −1. From
the series expansion of the sine we can see that only +1 is possible. This gives us:
and the comparison between the real part and the imaginary part results in
sin x = − cos(π/2 + x), cos x = sin(π/2 + x).
You can see from this that cosine and sine are quite similar, they are only shifted a little
against each other. Now we can finally plot the graphs of the functions, see Fig. 14.16.
Sine and cosine are not injective and therefore cannot be globally inverted. However,
if we look at restrictions on intervals, the bijectivity is given (Fig. 14.17).
14.3 Properties of Continuous Functions 345
cos x sin x
/2 /2 (3/2) 2 (5/2)
arccos arcsin /2
1
+1
/2
1 +1
Theorem and Definition 14.34 cos : [0, π] → [−1, 1] and sin : [−π/2, π/2] → [−1, 1]
are bijective and continuous. The continuous inverse functions are called arcco-
sine and arcsine:
arccos : [−1, 1] → [0, π], arcsin : [−1, 1] → [−π/2, π/2].
Other important trigonometric functions, which I do not want to go into in more detail,
are
sin x cos x
tan x = , cot x = ,
cos x sin x
whose domains have gaps at the zeros of cosine or sine. They have the inverse functions
arctan and arccot.
The power series expansion of sine and cosine can be used to approximate function val-
ues very efficiently. How far do you have to add? From Theorem 13.25 we know that for
the residual error of the exponential function it holds:
346 14 Continuous Functions
n
(ix)k 2|ix|n+1 n
exp(ix) = + rn+1 (ix), |rn+1 (ix)| ≤ , if |ix| ≤ 1 + .
k=0
k! (n + 1)! 2
If the size of the error is less than ε, then of course also the size of the real part and
the imaginary part of the error, the estimate can therefore also be used for the calcula-
tion of sine and cosine. Now we can restrict ourselves to arguments x ∈ [−π/2, π/2].
It is namely sin(π/2 + x) = sin(π/2 − x), so we can map the arguments between π/2
and 3/2 · π back to this range and thus we have hit a whole period of length 2π. For
|x| ≤ π/2 it holds:
2(π/2)n+1
rn+1 (ix) ≤ .
(n + 1)!
If we add the first 17 terms of the series, we get an error that is less than 1.06 · 10−12,
which is sufficient in most cases. The polynomial
x3 x5 x7 x 15 x 17
sin(x) ≈ x − + − + ··· − +
3! 5! 7! 15! 17!
can be evaluated using Horner’s method. The coefficients are stored in a table, only 16
multiplications have to be carried out.
Please take another look at Fig. 14.10 after Theorem 14.18. We want to investigate how
the point eiy = (cos y, sin z) on the unit circle moves when y changes. Starting at 0, cos y
first decreases and sin y increases; the point moves up on the unit circle. At the point π/2,
cosine is equal to 0 and sine is equal to 1, this corresponds to the angle 90°. Then sine
becomes smaller again and cosine negative, eiy moves further around to the left on the
circle. The value y = π corresponds to 180° and at y = 2π the circle is finally closed
again: ei2π = (1, 0). The intermediate value theorem guarantees that for each point w
on the unit circle there is really a value y ∈ [0, 2π[ with w = (cos y, sin y). This value is
unique, it is the radian between w and the x-axis.
With this knowledge, we can now identify each complex number unequal to 0 (that is,
each element of R2 different from 0) by polar coordinates: Let z ∈ C, z = 0 and r := �z�.
Then z/r has length 1, so it is a point on the unit circle, and there is exactly one angle
ϕ ∈ [0, 2π[ with eiϕ = z/r, that is, with z = r · eiϕ, where r and ϕ are uniquely deter-
mined:
Theorem and Definition 14.35 Each complex number z = 0 has a unique repre-
sentation z = r · eiϕ with r ∈ R+, ϕ ∈ [0, 2π[.
For each vector (x, y) ∈ R2 \ {(0, 0)}, there are uniquely determined numbers r ∈ R+ and
ϕ ∈ [0, 2π[ with (x, y) = (r · cos ϕ, r · sin ϕ).
14.4 Comprehension Questions and Exercises 347
(r, ϕ) are called the polar coordinates of z or of (x, y). r is the distance of the element
from the origin, ϕ the angle between the element and the x-axis.
The representation of complex numbers by polar coordinates allows for a simple geo-
metric interpretation of the multiplication of two complex numbers: Let z1, z2 ∈ C. Then
Comprehension Questions
Exercises
a b c d e f
Abstract
• you know the definition of the derivative for functions of one or more variables,
• you can apply rules for the calculation of derivatives,
• you have calculated the derivatives of many elementary functions,
• you can determine extrema and inflection points of real functions,
• you know power series and the radius of convergence of power series,
• you can calculate Taylor polynomials and Taylor series for differentiable functions
and estimate approximation errors,
• you have learned the basics of differential calculus of functions of several vari-
ables.
If you want to connect a series of points in a drawing with a line, you would like to have
a “nice” curve through the points. There are certainly different ideas about what is nice.
The graphic editor with which I drew most of the pictures in this book, for example, does
not connect 5 given points with straight line segments, but as shown in the right half of
Fig. 15.1. Both pictures show graphs of continuous functions, but the right curve is not
only continuous, it also has no corners, it is smooth.
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 349
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_15
350 15 Differential Calculus
Differential calculus provides us with the mathematical means to describe and display
such smooth curves. It investigates whether functions have corners, how steep they are or
how curved they are, and also extreme values and changes in curvature behavior can be
analyzed in this way.
With few exceptions, we will work with real functions in this chapter. But occasion-
ally we will also investigate complex-valued functions, in particular the function eix,
which we learned in the last chapter. The following theorems also apply to these and
therefore I will formulate them accordingly.
So let D ⊂ R, K = R or C, x0 ∈ D and let f : D → K be a function. We investigate the
mapping to the difference quotient
f (x) − f (x0 )
D \ {x0 } → K, x �→ .
x − x0
In R this function has a simple interpretation: f (x)−f (x0 )
x−x0
specifies the slope of the line
through the points (x, f (x)) and (x0 , f (x0 )): If the quotient is 0, the line is horizontal, the
greater it is, the steeper it runs. If it is positive, the line rises to the right, if it is negative,
it falls to the right.
The slope of a line of the form g(x) = mx + c is just the factor m: For two points on
g(x) − g(x0 ) m(x − x0 )
the line = = m applies.
x − x0 x − x0
Now what happens if the point x gets closer and closer to x0 ? In Fig. 15.2 you can see
that the line then approaches the tangent at the point x0 more and more. Is there a limit
for x → x0 ? If so, this limit denotes the slope of the tangent at the point x0.
f (x) – f (x0)
f (x0)
x0 x
x– x 0
15.1 Differentiable Functions 351
f ′ : D → K, x � → f ′ (x)
is a function, the derivative of f .
1. The statement “the limit exists” means in particular that there must be sequences xn in
D \ {x0 } that converge to x0. So it makes no sense to investigate the differentiability of
functions in isolated points of the domain, that is, in points around which there is an
entire neighborhood that does not belong to D.
2. If the derivative of a function f exists at the point x0, then the tangent at x0 has the
slope f ′ (x0 ) and is therefore a line of the form f ′ (x0 ) · x + c. This can be written as
f ′ (x0 )(x − x0 ) + f (x0 ).
f (x) − f (x0 ) f (x0 + h) − f (x0 )
3. For h := x − x0, it is lim = lim . In this form, the
x→x0 x − x0 h→0 h
quotient is often easier to calculate.
The use of the letter h is based on a similar convention as for the ε. Under h you can
always imagine a small number, but in contrast to ε it does not have to be greater than 0,
x0 + h can approach x0 from the left and from the right or jump back and forth wildly.
acceleration. For time derivatives, physicists often write ṗ(t) or p̈(t) instead of p′ (t) or
p′′ (t).
5. Derivatives also play a major role in mathematical economics. An example: If f (x) is
a cost curve, that is, a function that describes the cost of producing x parts of a prod-
uct, then the derivative f ′ (x) is the marginal cost. It represents the cost of producing
the next part.
We therefore have two descriptions of the derivatives. The definition as the limit of the
difference quotient gives us a concrete calculation possibility at hand. From the second
representation f (x) = f (x0 ) + f ′ (x0 ) · (x − x0 ) + ϕ(x) · (x − x0 ) we can read an intuitive
geometric interpretation:
The closer x approaches x0, the better f (x) agrees with f (x0 ) + f ′ (x0 ) · (x − x0 ). The
remainder ϕ(x) · (x − x0 ) goes to 0.
The line g(x) := f (x0 ) + f ′ (x0 ) · (x − x0 ) is the tangent to f at x0. This
tangent represents an approximation of the function f near x0, while the error
f (x) − g(x) = ϕ(x)(x − x0 ) goes to 0 faster than x − x0 (Fig. 15.3).
As an immediate consequence of Theorem 15.2, it follows that every differentiable
function is continuous:
f (x0)
x0
15.1 Differentiable Functions 353
Conversely, the differentiability of a function does not follow from its continuity. For
example, the absolute value f : R → R, x → |x| (Fig. 15.4) is not differentiable at 0,
because
f (x) − f (0) +1 for x > 0
=
x−0 −1 for x < 0
4. Now a complex-valued function comes into play for the first time. Let a ∈ C and
f : R → C, x → eax the exponential function. Of particular importance to us is the
case a = i, but of course a real value of a is also possible, then we get the real
exponential function.
First a little preparation: Remember that for small values of |ax| it holds (see Theo-
rem 13.25):
2|ax|2
eax = 1 + ax + r2 (ax) with |r2 (ax)| ≤ = |ax|2 .
2!
ah ah
So | e h−1 − a| = | e −1−ah
h
| = | r2 (ah)
h
| ≤ |a|2 |h|. Since for h → 0 also |a|2 |h| goes to
ah ah
0, we get limh→0 | h − a| = 0, that is limh→0 e h−1 = a. With this intermediate
e −1
Differentiation Rules
As an example, I would like to derive the product rule, the other rules can be shown
similarly, partly with more, partly with less effort:
More examples
thus
g(f (x)) = g(f (x0 )) + g′ (f (x0 ))f ′ (x0 ) · (x − x0 ) + δ(x) · (x − x0 ).
x → x0 implies y = f (x) → f (x0 ) = y0, δ(x) is a function with limx→x0 δ(x) = 0 and
according to Theorem 15.2b) that just means g(f (x0 ))′ = g′ (f (x0 )) · f ′ (x0 ).
(ax )′ = e
x ln a
· (ln a) = ax · ln a.
◄
g′ (f (x)) f ′ (x)
15.1 Differentiable Functions 357
x0 y0
Examples
ln′ (x). The logarithm is the inverse function of the exponential function exp(x)
11.
and therefore it holds:
1 1 1
ln′ (x) = = = .
exp′ (ln(x)) exp(ln(x)) x
Since the logarithm is only defined for positive arguments, there can be no prob-
lems with the zero of the denominator.
√
12. n x ′ = (x 1/n )′. The nth root is the inverse image of x n. With y = x 1/n we get:
1 1 1 1 1
(x 1/n )′ = n ′
= n−1 = 1/n n−1
= 1−1/n
= x (1/n)−1 .
(y ) ny n(x ) n·x n
13. Now you can calculate the derivative of x n/m = (x n )1/m yourself using the chain
rule. Here is the result:
m (m/n)−1
(x m/n )′ = x .
n
14. For all real exponents α the derivative rule applies, which we derived for natural
numbers in Example 3 after Theorem 15.3: (x α )′ = α · x α−1. Because for α ∈ R it
is x α := eα ln x and we can apply the chain rule to the inner function f : x � → α ln x
and the outer function g : y � → ey:
(x α )′ = eαln x · (α/x) = α · x α /x = αx α−1 .
′
g (f (x)) f ′ (x)
Please note the difference to the calculation of (ax )′. We determined this deriva-
tive earlier.
15. For x ∈ ]−1, 1[ it is arcsin′ (x) = sin′ (arcsin
1
x)
1
= cos(arcsin x)
. The function values of
arcsin lie in the interval ]−π/2, π/2[.
In this range, the cosine is always positive and from (sin y) + (cos y) = 1 we
2 2
1 1
arcsin′ (x) = =√ .
1 − (sin(arcsin x))2 1 − x2
In the points ±1 the derivative does not exist. Can you see why? ◄
The derivative of a differentiable function is again a function and therefore possibly dif-
ferentiable again. In this way, one can form higher-order derivatives, which are needed in
many applications of differential calculus. I give you a recursive definition of the notion
of the n-th derivative:
Derivatives of lower order are often denoted with primes like the first derivative:
f ′′ (x), f ′′′ (x).
In the definition of the nth derivative, it is not sufficient to only demand that f is
(n − 1) times differentiable at the point x0. After all, f (n−1) must be a function that is to
be differentiated in x0. To determine the limit of the differential quotient, there must be
sequences in the domain of f (n−1) that converge to x0, but are different from x0. For sim-
plicity, we require that f (n−1) must exist in a whole neighborhood of x0.
Examples
1. ex, sin x, cos x as well as all polynomials are infinitely often differentiable on
R.
2. ln is infinitely often differentiable on R+.
360 15 Differential Calculus
f ′ : R → R, x � → 2|x|.
This function is, like the absolute value function, continuous on R, but not differ-
entiable at point 0, it has a kink there. So f is once continuously differentiable on
R.
4. Even if it is hard to imagine: There are functions that are differentiable, but whose
derivative is not continuous anymore. This seems to contradict common sense,
which says that the slope of the tangent of a differentiable function cannot make
jumps. Here we see that imagination and mathematical reality unfortunately do not
always match.
However, the examples for such functions are a bit wild, at least they are not the
kind that you can draw with a pencil. I would like to give you such a function
(Fig. 15.6):
x 2 sin(1/x) if x � = 0
f (x) :=
0 if x = 0
For x = 0 we can calculate the derivative with the product and chain rule. Check
that for x = 0 it holds:
x 2 sin(1/x) − 0
lim = lim x · sin(1/x) = 0,
x→0 x−0 x→0
because sin(1/x) is bounded by ±1. The function f ′ (x) is therefore defined on all of
R, but it is not continuous in point 0, because the limit limx→0 f ′ (x) does not exist.
For the sequence xk = 1/(πk), for example, it holds:
2 +1 if k is odd
f ′ (xk ) = sin(πk) − cos(πk) =
πk −1 if k is even.
The sequence of function values does not converge. The derivative jumps near 0
constantly between +1 and −1 back and forth. The drawing in Fig. 15.6 can no
longer represent the graph in the near the origin. ◄
Calculation of Extrema
Intuitively, this means: In a local extremum, the tangent to the function is horizontal.
Proof: We assume that the extremum in x0 is a maximum. Then for all x ∈ ]x0 − ε, x0 [
the difference quotient f (x)−f
x−x0
(x0 )
> 0, since the numerator and denominator are less
362 15 Differential Calculus
x0
U ( x0 )
You can see from this proof that the theorem is really only valid for points in the interior
of the domain: From the right and from the left, one must be able to approach x0. If one
wants to determine all the extrema of a function f : I → R, one must consider the fol-
lowing candidates:
In Fig. 15.8 you can see that extrema may or may not occur at points of every type. In
particular, it also does not follow from f ′ (x) = 0 that x is an extremum!
For further characterization of the extrema, we need an important theorem, the mean
value theorem of differential calculus. The following theorem serves as preparation:
If f is constant, then the statement is clear. Otherwise, the image of f has a minimum
and a maximum according to Theorem 14.26. f must therefore have an extremum in
the open interval ]a, b[ (Fig. 15.9), and for this, according to Theorem 15.10, f ′ (x) = 0
holds.
type:1 3 3 2 2 1
15.1 Differentiable Functions 363
a x0 b
f (b) − f (a)
f ′ (x0 ) = .
b−a
f (b)−f (a)
b−a
is exactly the slope of the connecting line between f (a) and f (b). The mean
value theorem states that there is a point on the graph at which the slope of the function
takes on the “average slope” between a and b.
The trick in the proof consists in distorting the function f so that the conditions of
Theorem 15.11 are fulfilled (Fig. 15.10): We subtract from f the connecting line s
between the endpoints of the graph. For the function
f (b) − f (a)
g(x) = f (x) − (x − a)
b−a
we have g(a) = g(b) and therefore, according to Theorem 15.12, there is a x0 with
g′ (x0 ) = 0. The derivative of g is g′ (x) = f ′ (x) − f (b)−f
b−a
(a)
, so that f ′ (x0 ) = f (b)−f
b−a
(a)
is
really true.
The mean value theorem makes the local shape of a function accessible to us. The inter-
val I , from which the following result speaks, can of course only be a small part of the
actual domain, but in this part we can now describe its shape:
a x0 b
364 15 Differential Calculus
I only show part a), the other points can be derived similarly: Let x1 , x2 ∈ I and x1 < x2.
Then there is a x0 between them with f (xx22)−f (x1 )
−x1
= f ′ (x0 ) > 0. That can only be if
f (x2 ) > f (x1 ). So f is strictly monotonically increasing.
As an immediate consequence of e) we get that two functions with the same derivative
only differ by a constant:
We come to the famous curve sketching, which you have probably been plagued with
extensively in school. In essence, it is based on Theorem 15.13. But we will now also
apply the assertions of this theorem to derivatives of higher order.
For part 1: If the derivative is positive to the left, f grows there, if it is negative to the
right it falls again to the right of x0, and at the point x0 itself there is a maximum. For the
minimum, one proceeds analogously.
15.1 Differentiable Functions 365
For part 2: Since f ′′ is still continuous, f ′′ (x) < 0 applies in a whole interval around
x0. According to Theorem 15.13 f ′ is therefore strictly monotonically decreasing in this
area. Because of f ′ (x0 ) = 0, a sign change from + to − takes place at the point x0. The
first part of the theorem then says that a maximum is present at the point x0. Assertion b)
follows again analogously.
If the first and second derivatives are 0 at the point x0, no statement can be made
about the point x0: It can be an extremum, or a so-called terrace point, a point with a
horizontal tangent, but not an extremum. A simple example of this situation is the func-
tion x → x 3 at the point 0.
A differentiable function f is called concave in the interval I if the derivative (i.e.
the slope) decreases in the interval I and convex if the derivative increases. An inflection
point is a point at which the curvature changes, i.e. a point at which the derivative has an
extreme value.
In Fig. 15.11 I have sketched a function f with its first and second derivative. You can
see from this that the extrema of the first derivative are zeros of the second derivative.
Without proof, I would like to cite the following theorem, which describes the curva-
ture of a function. There are no new ideas behind it, you just go down one derivative:
f
f
Please note that Theorems 15.15 and 15.16 only apply inside the domain of differentia-
ble functions. Boundary points, discontinuities and non-differentiable points of the func-
tion must be examined separately.
In curve sketching one checks from a given function:
• the domain,
• the zeros,
• the extrema,
• the inflection points,
• the curvature.
With this knowledge, a useful sketch of the function can usually be made. Let’s go
through an example:
Example
2
Let f (x) = x x−1
3 . The domain is the set {x ∈ R|x � = 0}. The zeros are the zeros of the
0.4
0.2
–3 –2 –1 0 1 2 3
–0.2
–0.4
The problem of approximating a given function by other, usually simpler to handle func-
tions, occurs again and again in mathematics. Polynomials are often chosen for this pur-
pose. Calculus provides us with a powerful tool for constructing such approximation
polynomials. To prepare, we first examine power series. These can be thought of as poly-
nomials of “infinite degree”. Of course, this does not exist, it is an infinite series of func-
tions. When evaluating such a polynomial, one will break off after a finite number of
summands.
So far, this formula represents neither a function nor a series of numbers. We only know:
If you insert a real number x into (15.2), you get a series, which is either convergent or
not. But we can make a function out of it:
∞ ∞
ak x k is convergent → R, x � → ak x k .
f: x ∈ R
k=0 k=0
This is the function I always mean when I talk about a power series. The starting index
of a power series does not have to be 0, but at least a natural number greater than or
equal to 0.
We have already met some of these series:
368 15 Differential Calculus
Examples
∞
1 k 1
1. x . It is ak = for all k, the series is convergent for all x ∈ R and repre-
k=0 k! k!
sents the exponential function.
∞ ∞
2. xk = 1 · x k. Here ak = 1 is for all k. We have calculated earlier that for all x
k=0 k=0 ∞
1
with |x| < 1 we have xk = . If |x| ≥ 1, then the series is divergent.
k=0 1−x
Even for x = −1 we have divergence! Flip back to the definition of the convergence
of a series: The partial sums of the series are 1, 0, 1, 0, 1, . . ., so they do not form a con-
vergent sequence.
∞ 1
3. x k is also convergent for |x| < 1: For this we use the ratio test from Theorem
k=1 k
13.19: For the quotient of two consecutive series elements we have
(1/(k + 1))x k+1
= lim k x = |x| < 1
lim k
k→∞ (1/k)x
k→∞ k + 1
and thus the series is convergent. For x = 1 the series is divergent, then it just rep-
resents the harmonic series, see Example 5 at the end of Sec. 13.1. For |x| > 1 this
quotient will become greater than 1 at some point, so there is divergence. The case
x = −1 is still interesting: In this case the series is convergent, but I do not want to
prove it here. ∞ (−1)k
∞ (−1)k 2k+1
4. The trigonometric series x 2k and x for cosine and sine
k=0 (2k)! k=0 (2k + 1)!
also represent power series, which are convergent for all real numbers.
5. A polynomial b0 + b1 x + b2 x 2 + · · · + bn x n can also be interpreted as a power
series that is always convergent, where:
bk for k ≤ n
ak = ◄
0 for k > n.
∞
Definition 15.18 Let ak x k be a power series. Then
k=0
� � ∞ �
sup x ∈ R �� � ak x k is convergent , if the supremum exists,
R := k=0
∞ else
∞
is called the radius of convergence of the power series ak x k .
k=0
The radius of convergence is ∞ or it is R ≥ 0, because for x = 0 there is always conver-
gence.
15.2 Power Series 369
∞
Theorem 15.19 Let ak x k be a power series with radius of convergence R > 0.
k=0
Then the series converges absolutely for all for all x ∈ R with |x| < R.
We can reduce this theorem to the comparison test in Theorem 13.20: Let x be given
with |x| < R. Since R is the supremum of the elements for which the series converges,
there must be a x0 with |x| < x0 < R, so that ∞ k=0 ak x0 converges. Then, according to
k
Theorem 13.15, the sequence of numbers ak x0 k is a null sequence and therefore also
bounded.
For example, let |ak x0 k | < M . It follows:
k
|x k | x
|ak x k | = |ak x0 k | k
≤ M · = M · qk ,
x q < 1.
|x0 | 0
∞ k
The series ∞ k
k=0 Mq = M k=0 q converges as a geometric series, so its elements are
always larger then the elements of the investigated series, which is therefore absolutely
convergent.
By the way, one can also prove that for all x with |x| > R divergence occurs. For the
boundary points of the interval [−R, R] no statement is possible: In the points x = ±R
both convergence and divergence can occur, as you could see in Example 3 before. For
the range of convergence of a power series with radius of convergence R > 0, all of the
following sets are possible: [−R, R],]−R, R[,[−R, R[,]−R, R].
Examples 1, 4 and 5 have an infinite radius of convergence, examples 2 and 3 have a
radius of convergence of 1. Sometimes the radius of convergence of a power series can
be easily calculated using the ratio test:
Theorem 15.20 Let ∞ k=0 ak x be a power series and let at least from a certain
k
index on all ak be different from 0. If the following limit exists in R ∪ {∞}, then
the radius of convergence is:
ak
R = lim .
k→∞ ak+1
You can try out this rule with examples 1, 2 and 3, but the procedure is not directly appli-
cable to trigonometric series. Can you see why?
370 15 Differential Calculus
This means that you can differentiate a power series term by term, just like a polynomial.
The proof of this theorem is difficult, it uses properties of series and functions that I
have not introduced to you. I can therefore only quote the theorem. Fortunately, it is easy
to apply. I will show you two
Examples
1. The first result we already know, it would be bad if something else came out now:
∞ ∞ ∞ ∞
x ′
xk kx k−1 x k−1 xk
(e ) = = = = = ex .
k=0
k! k=1
k! k=1
(k − 1)! k=0
k!
2. The second example is tricky, but here we get something really new:
Let |x| < 1. Then
∞ ′ ∞ ∞ ∞
1 k k−1 k−1 k 1
k
x = x = x = x = .
k=1
k k=1
k k=1 k=0
1 − x
0 on the left and right for x, we get 0 = c and now we have a series representation
for the logarithm:
∞
−1
ln(1 − x) = xk for |x| < 1. ◄
k=1
k
15.3 Taylor Series 371
We have seen earlier that power series are important in order to calculate function values
concretely. Unfortunately, the series found now for the logarithm is very badly conver-
gent and unusable for efficient calculations. But we will improve the representation soon,
see Example 2 after Theorem 15.24.
The goal of this section is to approximate as many functions as possible by power series,
as we have just succeeded for the logarithm. We only examine the functions at one point
x0, so we only want to find a local approximation.
Remember that the tangent of a function at the point x0 gave an approximation to the
function (compare Theorem 15.2):
We wish that F2 goes to 0 even faster than F1, F3 faster than F2 and so on. The question
is: Can we determine such Ai and Fi?
Let’s first assume that x0 = 0, that f is infinitely often differentiable in a neighbor-
hood of 0, and that there is a power series that represents f in a neighborhood of 0:
∞
f (x) = a0 + a1 x + a2 x 2 + · · · = ak x k .
k=0
2. approximation
(parabola)
x0
372 15 Differential Calculus
In this case, we have already given our desired approximation. As a power series, f can
be differentiated an arbitrary number of times in this neighborhood, and we can deter-
mine the derivatives of f by differentiating the summands:
f (k) (0)
ak = .
k!
Under certain conditions, a function f can therefore be “expanded” into a power series:
∞
f (k) (0) f ′′ (0) 2
f (x) = x k = f (0) + f ′ (0)x + x + ··· .
k=0
k! 2!
The Taylor polynomial and the remainder are dependent on the point of expansion
x0: If you shift this point, you get a different polynomial. The remainder Rn (x − x0 )
just indicates the error that arises when you replace f (x) by jxn0 (f )(x − x0 ), because it is
f (x) = jxn0 (f )(x − x0 ) + Rn (x − x0 ).
We obtain an equivalent, alternative notation in which a n-jet looks a little more like a
polynomial, if we set, as we have sometimes done before, x = x0 + h, that is h = x − x0.
Then
f ′′ (ϑ) f ′′ (ϑ)
and thus r = , so that R1 = r · (x − x0 )2 = (x − x0 )2 results.
2! 2!
374 15 Differential Calculus
It is particularly nice that the remainder converges very well to 0, provided that the
(n + 1)th derivative is also continuous:
Rn (x − x0 ) = ϕn (x − x0 ) · (x − x0 )n mit lim ϕn (x − x0 ) = 0.
x→x0
If x goes to x0, then the remainder Rn converges to 0 more strongly than the nth power
of (x − x0 ). This is also called say “ f and its Taylor polynomial agree at the point x0 in
nth”.
For infinitely often differentiable functions, we can not only set up a Taylor polyno-
mial, but also write down a whole series formally:
Replace x − x0 again by h, then you will see that the Taylor series is nothing but a power
series.
This result is not very profound according to our current knowledge. Of course, one must
investigate how the remainder behaves for ever larger n in order to assess the conver-
gence of the series. Unfortunately, now all conceivable cases can occur here, for exam-
ple:
Note the subtle but essential difference between the two expressions
The first expression in (15.5) always converges to 0 according to Theorem 15.24. How-
ever, if one keeps the number x = x0 fixed, then the sequence Rn of remainders does not
have to go to 0 if n becomes larger. In such cases, the Taylor series of the function f does
not represent the function f .
But finally to applications of the theory.
Examples
ln′ (x) = x −1 ,
ln′′ (x) = −x −2 ,
ln(3) (x) = 2x −3 ,
ln(4) (x) = −2 · 3 · x −4 ,
..
.
ln(k) (x) = (−1)k+1 (k − 1)!x −k ,
so that for k > 0: ln(k) (1) = (−1)k+1 (k − 1)!. Because of ln(1) = 0 we finally get:
n
(−1)k+1
ln(1 + h) = hk + Rn (h)
k=1
k
1 1 1 1
= h − h2 + h3 − h4 + · · · ± hn + Rn (h).
2 3 4 n
376 15 Differential Calculus
for an element ϑ between 1 and 1 + h. At least for 0 < h < 1 one can see that
1/ϑ n+1 and hn+1 are always less than 1 and therefore the sequence Rn (h) converges
to 0, although very slowly. With a little more effort, one can show that convergence
also exists for −1 < h ≤ 0. So the Taylor series converges to the logarithm and we
have found a new power series representation: For |h| < 1 we have
∞
(−1)k+1
ln(1 + h) = hk .
k=1
k
In Example 2 after Theorem 15.21 we had found a very similar power series: For
|h| < 1 it held:
∞
−1
ln(1 − h) = hk .
k=1
k
If you replace h by −h, you will see that it is actually the same series expansion we
obtained there in a completely different way.
To calculate the logarithm practically, we conjure up from these two poorly convergent
series a better convergent one, with the help of which we can also determine the loga-
rithm for all x > 0 : For all h with |h| < 1 is
∞ ∞
1+h (−1)k+1 k −1 k
ln = ln(1 + h) − ln(1 − h) = h − h
1−h k=1
k k=1
k
∞
h3 h5 h2k+1
=2 h+ + + ··· = 2 .
3 5 k=0
2k + 1
However, for |h| near 1 the convergence rate decreases again. Why can we determine
the logarithm for all x > 0 with this series? This follows from the fact that the map-
ping
1+h
]−1, 1[ → R+ , h �→ =x
1−h
2. Let us consider the function f (x) = x α for x > 0 and α ∈ R. The derivatives are
f ′ (x) = αx α−1 ,
f ′′ (x) = α(α − 1)x α−2 ,
..
.
f (k) (x) = α(α − 1) · · · (α − k + 1)x α−k .
For the fraction behind the summation sign we introduce the abbreviation
α α(α − 1) · · · (α − k + 1)
:= .
k k!
α
We call this expression the generalized binomial coefficient. If we set 0
:= 1,
then
n
α
(1 + h)α = hk + Rn (h).
k=0
k
It can be shown with some effort that for |h| < 1 the sequence of the remainders
converges to 0, for these h is therefore
∞
α α k
(1 + h) = h . (15.6)
k=0
k
Does this sound familiar to you? For α ∈ N you get exactly the binomial theorem
4.8, which we derived in Sec. 4.1: For k > n the binomial coefficient then always
becomes 0 and (15.6) becomes a finite sum that ends with the k = n. The series
from (15.6) is therefore called the binomial series. It can be used, for example, to
approximate square roots:
2 1 1
√
k h h2
1+h= 2 h + R2 (h) = 1 + − + 2 ϑ 1/2−3 h3 .
k=0
k 2 8 3
for a ϑ between 1 and 1 + h. Let’s say |h| ≤ 1/2. Then the error is maximal
1
when ϑ = 1/2 (as small as possible) and |h| = 1/2 (as large as possible). 32
378 15 Differential Calculus
has the value 1/16, and thus |R2 (h)| < (1/16) · 5.66 · (1/8) ≈ 0.044. √The sum
1 + h/2 − h2 /8 therefore represents a quite reasonable approximation to 1 + h in
this area.
3. You might know the following theorem from physics: The position of a particle
that is uniformly accelerated is fixed for all times, if at a certain time t0 the posi-
tion, velocity and acceleration of the particle are known. Why is that so? Let’s
denote the position at time t with p(t), then the velocity is v(t) = p′ (t) and the
acceleration is a(t) = v′ (t) = p′′ (t). “Uniformly accelerated” means that all higher
derivatives vanish: The acceleration does not change. If we now set up the Taylor
polynomial for the function s, we get:
1
p(t0 + h) = p(t0 ) + v(t0 )h + a(t0 )h2 + 0.
2
The Taylor polynomial ends after the third term and represents the function p for
all h ∈ R. ◄
If you look at this example closely, you will notice that p, v and a are not real-valued func-
tions at all; after all, the position of a particle consists of three spatial components, and v and
a have certain directions: The functions have the R3 as their codomain. But that doesn’t mat-
ter, the x-,y- and z-components of the functions are real-valued functions, to which we can
apply all our theorems.
exists at the point xi = ai, then fi′ (xi ) is called the partial derivative of f with
respect to xi at the point a and is denoted by
∂f ∂
(a) or f (a).
∂xi ∂xi
f is called partially differentiable if the partial derivatives of f exist for all a ∈ U ,
and continuously partially differentiable if these partial derivatives are all
continuous.
If the function f is partially differentiable with respect to the variable xi in each point
∂f ∂f
x ∈ U , then the derivative ∂x i
: x �→ ∂x i
(x) is itself a function of U to R, and can therefore
be partially differentiated again, if necessary, even with respect to other variables than xi.
The second partial derivatives of a function f of n variables are denoted by
∂ 2f ∂ ∂f ∂ 2f ∂ ∂f
(x) := (x), 2
(x) := (x),
∂xi ∂xj ∂xi ∂xj ∂xi ∂xi ∂xi
and the higher derivatives by
∂k ∂ ∂ ∂
f (x) := ··· f (x).
∂xi1 ∂xi2 . . . ∂xik ∂xi1 ∂xi2 ∂xik
2
The order of the indices is not quite uniform in the literature: With me, ∂x∂i ∂xf j means that first
j is derived, then i . Sometimes it is interpreted the other way around. In a moment you will
see that you usually don’t have to worry about it.
Examples
∂f ∂f
1. Let f (x, y) = x 2 + y2. Then ∂x (x, y) = 2x, ∂y (x, y) = 2y. The second partial deriva-
∂2f ∂2f ∂2f ∂2f
tives are ∂x2 = 2 = ∂y2 , ∂x∂y = 0 = ∂y∂x.
2. Let f (x, y) = x 2 y3 + y ln x. Then
∂f y ∂f
= 2xy3 + , = 3x 2 y2 + ln x,
∂x x ∂y
∂ 2f y ∂ 2f 1 ∂ 2f ∂ 2f
= 2y3 − 2 , = 6xy2 + = , = 6x 2 y. ◄
∂x 2 x ∂x∂y x ∂y∂x ∂y2
The mixed second partial derivatives agree in each case. If you calculate a few examples,
you will find that this is no coincidence. There is the amazing statement:
380 15 Differential Calculus
Extrema
With the help of partial derivatives, local extrema can be determined. First the definition,
which can be formulated analogously to the one-dimensional case:
For the partial functions fi (a1 , a2 , . . . , xi , . . . , an ) obviously have a local extremum at the
∂f
point xi = ai and therefore ∂x i
(ai ) = 0 must be true for all i .
15.4 Differential Calculus of Functions of Several Variables 381
A point a ∈ Rn is called a stationary point, if grad f (a) = 0. Such a point does not neces-
sarily have to be a maximum or minimum, as you can see in Fig. 15.14, which shows the
function x 2 − y2. In this function, the partial functions through (0,0) have a maximum in
x-direction and a minimum in y-direction. Therefore, both partial derivatives are equal to
0. Such a point is called a saddle point.
In order to decide whether a minimum or a maximum is present, the second deriva-
tives have to be consulted again. But now there are more than one of them:
In order to analyze extrema, the Hessian matrix has to be examined. In general, this is
difficult, but for the case of two variables one can still write down a simple criterion:
If the function f has more than two variables, one must also investigate subdeterminants
of the Hessian matrix.
382 15 Differential Calculus
Examples
(xi, axi + b)
15.5 Comprehension Questions and Exercises 383
n
F(a, b) = (axi + b − yi )2
i=1
These are two linear equations for a and b, which usually have exactly one solution. The
Hessian matrix is
n n
i=1 2xi 2 i=1 2xi
n ,
i=1 2xi 2n
2
one can calculate that det(Hf (a)) > 0 is always true. Furthermore, ∂∂aF2 = ni=2 2xi 2 > 0,
so there is actually a minimum. To calculate the regression line, the sums of the xi, the yi,
the xi2 and the xi yi must be calculated, with these coefficients a two-dimensional system
of linear equations is solved.
Comprehension Questions
5. What condition must a function f fulfill if one wants to set up a Taylor series for
it?
6. Let Rn (x − x0 ) be the remainder when expanding a function f into a Taylor series.
Which of the two following limits is always 0?
Exercises
√
1. Calculate the derivatives of 2x − 1, cot x, cosn (x/n).
2. Determine the derivative of the following functions twice each:
a) (3x + 5x 2 −1)2 by product rule and chain rule,
b) 10/(x 3 −2x + 5) by quotient rule and chain rule.
3. At which points is the function x → x| sin x| differentiable? Explain why the
function is not differentiable everywhere.
4. Is the function x → x sin |x| differentiable at 0?
5. Let f : ]−1, +1[ → R+, x → (1 + x)/(1 − x).
a) Show that f is bijective.
b) Calculate the first and second derivative of the function f .
6. Show that the function f : R \ {0} → R, x → ln |x| is differentiable throughout its
domain and calculate the derivative. (Use that for x > 0 holds ln(x)′ = 1/x.)
7. Calculate the derivative of loga (x) using the statement that loga (x) is the inverse
function of ax.
8. Confirm the result from Exercise 7 using the statement
loga (x) = ln(x)/ ln(a).
9. Carry out a curve sketching with the two following functions and sketch the func-
tions:
a) x 3 + 4x 2 − 3x.
b) x 2 · e−2x.
10. Sketch the graph of a non-differentiable function for which the mean value theo-
rem of differential calculus does not apply.
−x + 2 for 0 ≤ x < 1
11. Let f : [0, 4[ → R, x → 2
−x + 4x − 2 for 1 ≤ x < 4
Examine in which points the function is continuous and differentiable, and calcu-
late all maxima and minima of the function.
12. Calculate the radii of convergence of the following power series:
∞ xn
a) n
n=0 2
∞ n2 x n
b)
n=0 2n + 1
∞
xn
c)
n=0 (2n + 1)!
15.5 Comprehension Questions and Exercises 385
√
13. Determine for the function x the Taylor polynomial of 3rd degree √ at the point
x = 1. Use this to calculate for h = 0, 0.5, 1 an approximation for x + h. Deter-
mine the resulting error thus arising in comparison to the value from the calcula-
tor.
14. Write a program to determine zeros with bisection. Take the derivative of the
function f : x � → (1/x) sin(x) (x = 0 ) and determine some zeros of the derivative
numerically. Then sketch the graph of the function f .
15. Determine all partial derivatives of first order of the functions
x41 + 5x1 x2 + 8x3 x4 + 1.
a) 3x14 x23 x32√
xz
b) xeyz + ,x, y, z > 0.
ln y
16. Determine all stationary points of the following function:
x 3 − 3x 2 y + 3xy2 + y3 − 3x − 21y.
17. A rectangular box (length x, width y, height z ), which is open at the top, should
have a capacity of 32 liters. Determine x, y and z so that the material consumption
for the box is minimal.
Integral Calculus
16
Abstract
Determining the areas of shapes that are not bounded by line segments is an old math-
ematical problem, think for example of determining the area of a circle. With the help of
integral calculus we can tackle this task. The integral of a real function determines the
area enclosed between the graph of the function and the x-axis. We obtain this area with
the help of a limit: it is approximated better and better by a sequence of rectangles.
Integration will prove to be useful for many other tasks as well: we can for example
also calculate arc lengths of curves or the volume of a shape. An important and surpris-
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 387
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_16
388 16 Integral Calculus
ing result of the theory establishes a connection between differential calculus and inte-
gral calculus: differentiation and integration are inverse operations to each other. This
opens up new areas of application for integral calculus.
Fig. 16.1 shows some areas that we want to calculate with the help of the integral. You
will see that we not only determine areas that are bounded all around by the graph and
the x-axis, like the left area. For a function that is defined between a and b, we bound the
areas on the left and right with the help of the vertical lines through a and b. We can also
determine areas under the graph in this way for functions with discontinuities.
Let us restrict the set of functions that we want to integrate more precisely:
Such a function can therefore be composed of finitely many continuous pieces. The sec-
ond condition of the definition states that f does not vanish into infinity in the disconti-
nuities, like for example 1/x in the point 0. The function only has jump discontinuities,
the image of the function is bounded.
Now we want to approximate the area by rectangles. We take bars that become nar-
rower and narrower and whose upper boundary just cuts the graph, see Fig. 16.2.
a b
x0 = a x1 x2 b = xk
k
S := f (yi ) · �xi (16.1)
i=1
There are various concepts of integration that are applicable to different classes of functions.
The definition of the integral in the form presented by me is called the Riemann integral
after the German mathematician Bernhard Riemann (1826–1866). For piecewise continuous
functions, different definitions of the integral yield the same results.
Note that according to this definition, areas below the x-axis are calculated negative.
Theorem and Definition 16.3 Let f : [a, b] → R be piecewise continuous. (Zn )n∈N
be a sequence of partitions of [a, b] with the property limn→∞ Zn = 0 and Sn a
Riemann sum to the partition Zn. Then the limit limn→∞ Sn exists and is not
dependent on the specific sequence of partitions.
This limit is called the definite integral of f over [a, b] and is denoted by
ˆ b
f (x)dx.
a
I only want to sketch the procedure for the proof of the theorem: f can be decomposed
into continuous parts, we can therefore assume the function to be continuous. The
sequence Zn of decompositions arises by subdividing the interval [a, b] over and over
again. For the decomposition Zn it follows from Theorem 14.26 that f in the interval
[xi−1 , xi ] has a minimum mi and a maximum Mi. Now we can form the lower sum Ln and
390 16 Integral Calculus
the upper sum Un from the Riemann sum Sn. These are just the bars that bound the graph
from below and above in the interval [xi−1 , xi ]:
k
k
Ln := mi · xi , Un := Mi · xi .
i=1 i=1
In Fig. 16.3 the upper sum is shown in grey and the lower sum is dotted. The actual
Riemann sum Sn meanders somewhere in between, for all n it holds:
Ln ≤ Sn ≤ Un .
What happens if we switch to a finer decomposition, for example Zn+1? Then the upper
sum can only become smaller and the lower sum can only become larger. You can see
this in Fig. 16.4 for the lower sum. The sequence of upper sums is therefore monotoni-
cally decreasing, that of the lower sums monotonically increasing.
The lower sums are bounded from above, for example by U1, the upper sums are
bounded from below, for example by L1. According to Theorem 13.12 monotonic and
bounded sequences are always convergent. There are therefore L := limn→∞ Ln and
U := limn→∞ Un, and of course L ≤ U . In fact, lower sums and upper sums converge to
the same limit. Although this appears intuitive, there is still a lot of work involved here,
which requires more knowledge of continuous functions than I have presented. So let’s
just believe L = U . Since Sn is trapped between Ln and Un, the sequence Sn has no choice
but to also converge to the common limit.
a ∆ x1 ∆ x2 ∆ xn b
a b
16.1 The Integral of Piecewise Continuous Functions 391
The integral sign represents a stylized S, finally the integral is the limit of a sum. Behind
the integral sign is not a product ” f (x) times a small piece of dx “, but rather a notation
that is intended to remind us of the origin from the sum (16.1). The definite integral itself
is no longer a function, but a real number.
The naming of the integration variable is arbitrary, it is
ˆ b ˆ b ˆ b ˆ b
f (x)dx = f (y)dy = f (t)dt = f (b)db.
a a a a
Usually the integration variable is denoted with a different character than the limits of
integration.
An computer scientist who is used to working with local and global variables should have
no problems with this.
So far in our integrals a < b must always be true. In order to avoid case distinctions and
to include boundary cases, we define:
These specifications are also compatible with the following theorem, which assembles
simple calculation rules for integrable functions. They easily follow from the definition
of the integral and the known arithmetic rules for limits. I omit the proofs:
Integration examples
´b
1. a cdx = c(b − a). This is simply the area of the rectangle determined by the con-
a b
a b
–
1 1
16.1 The Integral of Piecewise Continuous Functions 393
√
Since
´1 √ for the points of the circle x 2
+ y 2
= 1, the function is y = 1 − x 2 . Is really
−1 1 − x dx = π/2? To check this, we need to learn more about the calculation
2
of integrals. ◄
Theorem 16.6: The mean value theorem for integrals Let f be continuous on [a, b]
. Then there is an element y ∈ [a, b] with the property
ˆ b
f (x)dx = (b − a)f (y).
a
⇒ m ≤ µ ≤ M.
According to the Intermediate Value Theorem 14.25, there is now a y between α and β
with the property f (y) = µ, that is
ˆ b
f (x)dx = (b − a)f (y).
a
The following is the rescue ring for calculating integrals: The fundamental theorem of
calculus establishes the connection between differentiation and integration and allows us
to use the knowledge we have collected in the last chapter to our advantage again.
This is because, according to Theorem 15.14, functions with the same derivative only
differ by a constant.
Look closely at the integral (16.2): To make the number a function again, we have now
made the upper limit of integration variable. The derivative of this integral function is
just the function itself.
The proof is surprisingly not difficult anymore, we have already done the preliminary
work. We simply calculate the derivative of Fa, that is the limit of the difference quotient.
Let us first look at the numerator and apply the mean value Theorem 16.6:
ˆ x+h ˆ x ˆ x+h
Fa (x + h) − Fa (x) = f (t)dt − f (t)dt = f (t)dt = f (yh ) · h.
a a x
For each h there is, according to the mean value theorem, an element yh between x and
x + h. Now we get for the derivative:
Fa (x + h) − Fa (x) 1
Fa ′ (x) = lim = lim · f (yh ) · h = lim f (yh ),
h→0 h h→0 h h→0
And since h → 0 also yh → x. Then limh→0 f (yh ) = f (x), thus Fa ′ (x) = f (x). The second
part of the theorem follows from Theorem 16.8.
For there
´ a is a c ∈ R with F(x) = Fa (x) + c. Then F(b) = Fa (b) + c and
Fa (a) = a f (t)dt = 0 implies F(a) = Fa (a) + c = c. This gives us:
ˆ b
f (t)dt = Fa (b) = F(b) − c = F(b) − F(a).
a
In the last chapter we have determined the derivatives of many functions. If we now read
all these results from right to left, we suddenly get the antiderivatives of many functions.
Here is a small table:
Integration Rules
We don’t know the antiderivative for the logarithm yet. Also, the antiderivative of
√
1 − x 2 , which we need to calculate the area of a circle, is unfortunately not included
here. We need additional methods of integration, which we will derive from the corre-
sponding differentiation rules.
396 16 Integral Calculus
Definition
ˆ 16.11 An arbitrary antiderivative of the integrable function f is denoted
by f (x)dx and is called the indefinite integral of f .
This notation is not quite clean, since f (x)dx is only determined up to a constant. In
´
general, this does not cause any problems, since when forming a definite integral this
constant disappears again.
In the following theorem, I would like to summarize the most important integration rules.
b) The linearity:
For integrable functions f , g and α, β ∈ R it holds:
ˆ ˆ ˆ
αf (x) + βg(x)dx = α f (x)dx + β g(x)dx.
or:
ˆ
f (g(x))g′ (x)dx = F(g(x)). (16.6)
16 Integral Calculus 397
The rules a) and b) are included here for completeness, they follow from the definition of
piecewise continuity or from Theorem 16.5. New are partial integration and substitution.
Partial integration can be directly traced back to the product rule: It is
and thus
ˆ ˆ
f (x)g(x) = f ′ (x)g(x)dx + f (x)g′ (x)dx,
from which we obtain by rearranging (16.4). Inserting the limits of integrationresults in (16.3).
The substitution rule is the counterpart to the chain rule of differential calculus. It
states (compare Theorem 15.5): f (g(x))′ = f ′ (g(x))g′ (x).
If we replace f in it with its antiderivative F and f ′ correspondingly with f , we get
F(g(x))′ = f (g(x))g′ (x). So F(g(x)) is the antiderivative of f (g(x))g′ (x), which is pre-
cisely the assertion (16.6). If we calculate the definite integral, we get (16.5):
ˆ b ˆ g(b)
b g(b)
f (g(x))g′ (x)dx = F(g(x))a = F(y)g(a) = f (y)dy.
a g(a)
Examples
First of ´all, for the application of partial integration. With this rule, an integral of a
product u(x)v(x)dx ´can be computed by setting u(x) = f ′ (x) and v(x) = g(x). This
results in the integral f (x)g′ (x)dx, which is hopefully easier to solve.
ˆ ˆ
1. x
e dx =
x x · x
e − ex dx
1 ·
g f′ g f g′ f
x x x
= x · e − e = (x − 1)e .
= −x · cos x + sin x.
Try it yourself!
398 16 Integral Calculus
As you can see, you can eliminate the factor x in a product by partial integration.
If the derivative of a function u is known, but not the integral, the approach g = u,
f ′ = 1 sometimes helps:
3. b b b
1
ˆ ˆ ˆ
b
ln xdx = 1 ·
ln x dx = x · ln x a − x · dx
a a
a
x
f′ g f
g′
b b b
= x · ln x a − x a = (x ln x − x)a .
Here I used the rule in the form (16.3), the antiderivative of ln x is x · ln x − x.
4. ˆ ˆ
2
x dx = sin
cos xcos x − − sin2 x dx and
f ′ ·g f ·g f ·g′
ˆ ˆ ˆ ˆ
sin2 xdx = 1 − cos2 xdx = 1dx − cos2 xdx.
This reduced the integration of cos2 x to the integration of cos2 x. Are we chasing
our own tail? No, because if we insert, we get:
ˆ ˆ ˆ
cos2 xdx = sin x cos x + 1dx − cos2 xdx
=x
ˆ
⇒ 2 cos2 xdx = sin x cos x + x,
Now for the substitution. The rule “y = g(x), dy = g′ (x)dx” is helpful when perform-
ing the integration. First of all, the functions f and g must always be identified:
5. ˆ b
esin x cos xdx.
a
ˆ b ˆ sin b sin b b
esin x cos xdx = ey dy = ey sin a = esin x a .
a ↑ sin a
y=sin x
dy=cos xdx
At the last equality sign you see that the antiderivative is esin x.
6. For any continuously differentiable function g that has no zeros in the interval
´ ′ (x)
under consideration, the integral gg(x) g (x)dx can be calculated. It is
´ 1 ′
dx = g(x)
f (y) = 1/y, the antiderivative F is F(y) = ln |y|. With (16.6) we obtain:
16.1 The Integral of Piecewise Continuous Functions 399
. g′ (x)
ˆ
dx = ln |g(x)|.
g(x)
´ 2π
7. We determine 0 cos kxdx for k ∈ N. We set f (y) = cos y and, g(x) = kx.
Then g′ (x) = k, g(0) = 0, g(2π ) = 2πk. Our function does not quite fit yet, but we
can always insert a constant factor:
2π 2π 2π k
1 1
ˆ ˆ ˆ
cos kxdx = kx ·
cos k dx = cos ydy
0 k 0 ↑ k 0
f (g(x)) g′ (x) y=kx,dy=kdx
1 2π k
= sin y0 = 0 − 0 = 0.
k
In the last three examples, we interpreted the integrand as f (g(x))g′ (x) by careful observa-
tion or small rearrangements. Unfortunately, this rarely works out except for nicely con-
structed exercise examples. I would like to introduce you to a second type of use of the
substitution rule. We now read the equation (16.5) from right to left:
ˆ g(b) ˆ b
f (y)dy = f (g(x))g′ (x)dx.
g(a) a
8. Now we are finally ´ 1 able to calculate the area of a circle (see Example 4 after
Theorem 16.5): −1 1 − y2 dy is the area of the semicircle with radius 1. We
choose as substitution function g(x) = sin x. The sine between −π/2 and π/2 is
bijective, that is, invertible, and therefore:
ˆ 1 ˆ sin−1 (1)
1− y2 dy = 1 − (sin x)2 cos xdx
−1 ↑ sin−1 (−1)
y=sin x
dy=cos xdx
ˆ π/2
= 1 − (sin x)2 cos xdx.
−π/2
Between −π/2 and π/2, it holds 1 − (sin x)2 = cos x. We have already calcu-
lated the antiderivative of cos2 x in Example 4, and so we get:
400 16 Integral Calculus
1 π/2 π/2
1
ˆ ˆ
cos2 xdx =
1 − y2 dy = (cos x sin x + x)
−1 −π/2 2 −π/2
1 π π π
= cos sin +
2 2 2 2
1 π π π
− cos − sin − + −
2 2 2 2
π
= .
2
Now we finally know that our number π, defined as abstractly as “twice the first positive
zero of the power series of the cosine”, really corresponds to the well-known mathematical
constant π. ◄
For the set of rational functions, that is, functions that can be represented as a quotient of
polynomials, one can always specify an antiderivative:
Theorem 16.13 If f (x), g(x) are real polynomials, then a rational function is given
by h(x) = f (x)/g(x), which is defined outside the zeros of the denominator. There
is always an antiderivative to h(x).
With the help of the rather laborious method of partial fraction decomposition, one can
decompose h(x) into a sum of fractions of the form
a a + bx
or
(x − b)k (x 2 − cx + d)k
with a, b, c, d ∈ R and k ∈ N. The integrals of these fractions can be found in the integral
table or with a computer algebra system. I don’t want to go into that any further.
k
ˆ b
Sk := f (yi ) · �xi , lim Sk = f (x)dx.
k→∞ a
i=1
Volumes of Shapes
Let a shape be given in the three-dimensional space R3. As a function of x, let the cross-
sectional area F(x) be known (Fig. 16.9).
Examples
1. The cylinder:
0 x < a, x > b
F(x) = 2
r π a ≤ x ≤ b.
√
2. The sphere: For |x| < R the radius at the point x is: r(x) = R2 − x 2 and thus
0 x < −R, x > R
F(x) = 2 2
(R − x )π −R ≤ x ≤ R.
3. The cone:
0 x < 0, x > H
F(x) = R2 π ◄
x2 0 ≤ x ≤ H.
H2
Now if we want to calculate a volume, we put slices of thickness xi next to each other.
If yi ∈ [xi−1 , xi ], then the volume of the i -th slice is Vi = F(yi )�xi and for the volume of
the entire shape we get the approximation:
k
V= F(yi )�xi .
i=1
r R
x R
a x b
xi H
Example
H
R2 π 2 R2 π x 3 H R2 πH 3 1 1
ˆ
x dx = = = R2 πH = · base area · height. ◄
H 2 2
H 3 0 H 32 3 3
0
I leave the calculation of the volume of a sphere as an exercise. Compare your result with
the value from the formulary!
a b
function graph no function graph
Examples
You see that the parametrization of a curve is by no means unique, which corresponds
to the fact that paths can be traversed at different speeds. Curves can also intersect, in
the example of the circle the initial and final point are the same, that is also possible. Of
course, one can also investigate curves in R3 and in Rn, which describe the flight of a fly
or a hyperspaceship.
With the help of integral calculus, we can determine the arc length of such a curve. I
restrict myself to the two-dimensional case. We want to reduce the task to Riemann sums
again and divide the curve into ever shorter pieces, which we approximate by line seg-
ments.
We investigate the curve s : [a, b] → R2, t → (x(t), y(t)) with the partition
a = t0 < t1 < · · · < tn = b and use the notation as in Fig. 16.11:
�ti = ti − ti−1 ,
�xi = x(ti ) − x(ti−1 ),
�yi = y(ti ) − y(ti−1 ),
�si = �xi2 + �yi2 .
t1
∆ y2 ∆ s2
t0 =a t2
∆ x2 tn = b
We refine the partition so that the ti go to 0. For the approximate arc length Ln we then have:
n
n
Ln = si = xi2 + yi2 . (16.7)
i=1 i=1
From the mean value theorem of differential calculus 15.12, applied to x(t) and y(t), we
learn that there is a ui ∈ ti and a vi ∈ ti with the property
x(ti ) − x(ti−1 ) �xi ′ y(ti ) − y(ti−1 ) �yi
x ′ (ui ) = = , y (vi ) = = ,
ti − ti−1 �ti ti − ti−1 �ti
i.e. �xi = x ′ (ui )�ti and �yi = y′ (vi )�ti. Inserted into (16.7) we get:
n
Ln = x ′ (ui )2 + y′ (vi )2 · �ti , ui , vi ∈ �ti .
i=1
This is not quite, but almost a Riemann sum of the function x ′ (t)2 + y′ (t)2 . The only
problem is that the elements ui, vi in the interval ti can be different.
At this point I would like to abbreviate the derivation. It can be shown that for finer
and finer partitions, the arc length L can really be calculated as the limit of Riemann
sums of this function, it is
And since (t, f (t)) is the parametrization of the function f , we immediately get as a con-
sequence:
Example
The circumference of a circle with radius r is, according to Example 2 from before:
ˆ 2π ˆ 2π
2π
L= r 2 sin2 t + r 2 cos2 tdt = rdt = rt 0 = 2πr. ◄
0 0
16.2 Applications of the Integral 405
Improper Integrals
´b
The integral a 1x dx exists for all a, b > 0 and for all a, b < 0. But what happens if a
approaches 0, or b approaches infinity (Fig. 16.12)? Does there exist for a < 0 and b > 0
a finite area between a and b? To answer such questions, we need to carry out limit con-
siderations for the integrals.
Definition 16.18 Let the function f be defined on [a, b[ and let the integral
´c
a f (x)dx exist for all c between a and b. Then
ˆ b ˆ c
f (x)dx := lim f (x)dx
a c→b a
is called the improper integral of f if this limit exists. For b, the value ∞ is also
allowed, for the limit, the values ±∞ are allowed.
Examples
1 ˆ 1
1 1
ˆ
1
1. dx = lim dx = lim (ln x c ) = lim (− ln c) = ∞.
ˆ0 ∞x
c→0 c x c→0 c→0
ˆ c
1 1 c
2. dx = lim dx = lim (ln x 1 ) = lim (ln c) = ∞.
1 x c→∞ 1 x c→∞ c→∞
The function 1/x therefore has no finite area. It looks different with 1/x 2: The area
between 1 and ∞ remains finite:
ˆ ∞ ˆ c
1 1 1 c 1 1
3. dx = lim dx = lim − 1 = lim − + = 1.
1 x2 c→∞ 1 x 2 c→∞ x c→∞ c 1
ˆ 1 ˆ 0 ˆ 1
1 1 1
4. √ dx = √ dx + √ dx (Fig. 16.13):
−1 |x| −1 |x| 0 |x|
√ √
The antiderivative of 1/ x is 2 x . It follows that
1 √ 1
1
ˆ
√ dx = lim 2 x c = 2.
0 |x| c→0
1 1
It is clear that the left half is just as big, but we can also check it again using the
substitution rule:
ˆ 0 ˆ 0 ˆ 0 ˆ 1
1 1 1 1
√ dx = √ dx = − √ dy = √ dy = 2. ◄
−1 |x| −1 −x ↑ +1 y 0 y
y=−x,dy=−dx
cos x
0 2π cos 2x
cos 3x
In this section we will only investigate periodic functions. If f is periodic with period
2π, then with the help of the transformation x → x · (2π/T ) one can easily create a func-
tion with any period T : f (x · (2π/T )) is then periodic with period T , because
2π 2π T 2π 2π 2π
f (x + T ) · =f x· + =f x· + 2π = f x · .
T T T T T
We restrict ourselves in our following investigations to the period 2π, all statements
can be transferred to functions of arbitrary period T by replacing everywhere x by
x · (2π/T ).
The idea behind the Fourier expansion is that many functions can be composed of
such individual oscillations: The sound of a violin consists of a fundamental frequency
with some overtones, which are also oscillations. Even the most complex piece of music
is only made up of an overlay of such oscillations. For example, the running noise of a
turbine, the data transmitted over a channel, or the color information of an image that is
scanned point by point. Which class of functions can be decomposed in this way?
The following relations between cosine and sine are an essential basis for the Fourier
expansion of a function:
I calculate the first of these relations as an example: In the formulary we find the rule
cos α cos β = 1/2(cos(α − β) + cos(α + β)). If m = n, then we get from it, according
to Example 7 at the end of Sect. 16.1,
ˆ 2π
1 2π
ˆ
cos(mx) cos(nx)dx = cos((m − n)x) − cos((m + n)x)dx = 0,
0 2 0
408 16 Integral Calculus
2π 2π
1
ˆ ˆ
cos2 (mx)dx = cos2 (mx)mdx
0 m 0
2π m 2π m
1 1 1
ˆ
2
= cos ydy = (sin(y) cos(y) + y) = π.
↑
y=mx
m 0 m 2 0
dy=mdx
Now let real numbers an, n ∈ N0 and bn, n ∈ N be given, and first of all assume that the
following series converges for all x ∈ R and thus represents a function f :
∞
1
f (x) = a0 + (an cos(nx) + bn sin(nx)). (16.9)
2 n=1
This series is then called the Fourier series of f . Since all the individual functions
involved are periodic, the function f is also periodic with period 2π.
Let’s try to calculate the coefficients ak, bk for a function f which can be represented
in the form (16.9). We use Theorem 16.19: The trick is to multiply from the right with
cos(kx) or sin(kx) and then to integrate from 0 to 2π:
∞
2π 2π
1
ˆ ˆ
f (x) cos(kx)dx = a0 + (an cos(nx) + bn sin(nx)) cos(kx)dx.
0 0 2 n=1
Under certain conditions, which I will not go into here, summation and integration can
∞
be interchanged (for example, if ∞ n=0 an and n=0 bn are absolutely convergent). Since
the trigonometric functions are orthogonal to each other, almost everything cancels out
in this rearrangement:
ˆ 2π ˆ 2π ∞ ˆ 2π
1
f (x) cos kxdx = a0 cos kxdx + an cos(nx) cos(kx)dx
0 2
0 n=1 0
=0 =0 for n� =k
∞
ˆ 2π
+ bn sin(nx) cos(kx)dx
n=1 0
=0
ˆ 2π
= ak cos(kx) cos(kx)dx = ak π.
0
16.3 Fourier Series 409
This is how we get the coefficients ak. If we multiply (16.9) by sin(kx) and integrate, we
get the bk, and finally an integration without any multiplication gives us the coefficient a0:
1 2π 1 2π
ˆ ˆ
ak = f (x) cos kxdx, bk = f (x) sin kxdx,
π 0 π 0
(16.10)
1 2π
ˆ
a0 = f (x)dx.
π 0
There is also linear algebra behind this calculation: The i -th coordinate of the vector v in an
orthonormal basis is obtained by scalar multiplication of the vector with the basis vector bi:
vi = �v, bi �. Try it out! This is exactly the scalar product we have carried out here.
a0 ´2π
The first coefficient a0 /2 can be interpreted quite simply: It is 2π · = f (x)dx, so
2 0
a0 /2 is the “average” function value between 0 and 2π: The area of the rectangle with the
height a0 /2 is equal to the area under f .
If a function f has a Fourier series representation and summation and integration can
be interchanged, then the coefficients have the form calculated in (16.10). More interest-
ing is the question for which functions f such a Fourier series actually exists. Informa-
tion about this gives the next theorem. It is very difficult to prove, but the derivation
above should provide some motivation for its content:
Theorem 16.20 Let f be piecewise continuous and periodic with period 2π. Let
the coefficients ak, bk be determined as in (16.10). Further let
n
1
Sn (x) := a0 + (ak cos(kx) + bk sin(kx)).
2 k=1
Then
ˆ 2π
lim (Sn (x) − f (x))2 dx = 0. (16.11)
n→∞ 0
´ 2π
What does (16.11) mean? 0 (Sn (x) − f (x))2 dx measures the difference between the
areas of Sn (x) and f , with all surface areas being positive because of the square. This
means that the area between f and Sn always gets smaller, it goes to 0. The relation
(16.11) is called: Sn converges in mean square to f . The coefficients ak, bk form null
sequences, and if f is good, then the convergence also goes quite quickly. However, for
each individual point x it does not have to be the case that limn→∞ Sn (x) = f (x), in gen-
eral really only the area between the functions goes to 0.
410 16 Integral Calculus
The
´ formula (16.11) can be expressed again in the language of linear algebra:
1 2π
π 0
(Sn (x) − f (x))2 dx is nothing other than the scalar product �Sn (x) − f (x), Sn (x) − f (x)�.
In the norm that belongs to this scalar product, it therefore applies that
limn→∞ �Sn (x) − f (x)� = 0. This means that with respect to this norm Sn converges to f .
which is continued periodically on R. Imagine a data line in which the bits 1 and 0 are
continuously sent. Now
1 2π 1 π 1
ˆ ˆ
a0 = f (x)dx = 1dx = π = 1,
π 0 π 0 π
ˆ kπ
1 π
ˆ π
1 1
ˆ
ak = 1 cos kxdx = k cos kxdx = cos ydy
π 0 kπ 0 kπ 0
kπ
1
= sin y = 0,
kπ 0
ˆ kπ
1 π 1
ˆ
bk = 1 sin kxdx = · · · = sin ydy
π 0 kπ 0
kπ
1 1 1−1 k even
=− cos y = − ·
kπ 0 kπ −1 − 1 k odd,
so
0 k even
bk = 2
,
kπ
k odd
1 Term
2 Terme
4 Terme
8 Terme
iterations, so the approximation always differs a whole piece from the given function at
individual points.
However, in this particular example, it holds for every fixed x in which the function is con-
tinuous limn→∞ Sn (x) = f (x). Try to visualize this with a drawing!
Every data channel has attenuation properties that limit its capacity. This attenuation is
frequency-dependent: Different frequencies are filtered out to different degrees for tech-
nical reasons. If you know these parameters, you can adjust the corresponding frequen-
cies in the above Fourier expansion and see how the data that you initially fed into the
line as nice rectangles comes out at the other end. Can you still identify the 0s and 1s
there? In this way, you can determine the capacity of a data line.
Similarly, for example, by analyzing the running noise of a turbine, you can identify
changes that are very likely to indicate a defect.
The human ear, by the way, works on the same principle, it performs a frequency
analysis: Different frequencies stimulate different parts of the cochlea in the inner ear.
Fourier series were developed by Joseph Fourier (1768–1830). Fourier was a contemporary
and confidant of Napoleon and is considered one of the fathers of mathematical physics. He
came across his series while analyzing heat conduction problems. His results were certainly
recognized by the mathematical establishment of his time, but he was severely criticized for
the poor mathematical representation and the incomplete proof. Fourier was not deterred
by this criticism, nor was the wide ranging applicability of his theory affected by it. In fact,
some of his theorems were not fully proven until the 20th century. In 2006, the mathemati-
cian Lennart Carleson received the Abel Prize (see the note after Definition 5.2) mainly for
a work from 1966 in which he showed that every continuous function can be represented as
the sum of its Fourier series. But Fourier just had the right nose. The dispute between the
nagging mathematicians who cling to the pure doctrine and the users who recklessly calcu-
late is still present today. In defense of the mathematicians, it should be said that the toolbox
should already be in order, even if the user sometimes grabs the wrong wrench.
412 16 Integral Calculus
2π N−1 N−1
1 1 2π 2
ˆ
ak = f (x) cos(kx)dx = f (xi ) cos(kxi ) · = · f (xi ) cos(kxi ),
π 0 π i=0 N N i=0
2 N−1
just as bk = N2 N−1i=0 f (xi ) sin(kxi ) and a0 = N · i=0 f (xi ). The coefficient a0 is again
twice the average of the N function values.
Of course, this numerical integration makes errors that become larger the larger k
becomes, because then the function wobbles back and forth more and more wildly. With
our few interpolation points, we can hardly hope to get a reasonably accurate integral.
One first effect of this error is that the sequences ak, bk are no longer null sequences. In
fact, they are periodic sequences, because from
N−1
2
ak+N = · f (xi ) cos((k + N)xi ),
N i=0
2π 2π
cos((k + N)xi ) = cos (k + N) · i = cos k · i + i2π
N N
2π
= cos k · i = cos(kxi )
N
x0 x1 x2 xN–1
0 2π
16.3 Fourier Series 413
N−1 N−1
2 2
a0 = · f (xi ), ak = f (xi ) cos(kxi ),
N i=0 N i=0
N−1
2
bk = f (xi ) sin(kxi ),
N i=0
If N is even, then bN/2 = 0, so that in any case only N “real” coefficients have to be cal-
culated. From the N given points of the function f (x), exactly N Fourier coefficients are
determined.
The absolutely amazing thing about the deliberately error-prone construction is that
the function can still be represented as a “Fourier series” using these coefficients. Even
a finite sum using the N calculated coefficients is sufficient, and it is not just an approxi-
mation, the representation is exact:
⌊N/2⌋
a0
f (xi ) = + (ak cos(kxi ) + bk sin(kxi ))
2 k=1
N/2−1
a0 aN/2
f (xi ) = + (ak cos(kxi ) + bk sin(kxi )) + cos((N/2) · xi ).
2 k=1
2
I do not want to prove this theorem, but it is interesting that the proof does not require
any analytical tools. It is a purely algebraic calculation in which, using elementary prop-
erties of sine and cosine, a linear system of equations for the N coefficients is solved.
414 16 Integral Calculus
The runtime for calculating the Fourier coefficients is of the order O(N 2 ) in both direc-
tions. In the 1960s, implementations were developed that reduce this effort to O(N · log N),
a tremendous improvement. Such implementations are summarized under the name Fast
Fourier Transform (FFT). The possible uses for the Fourier transform increased enor-
mously. For example, real-time transforms are now also possible, as they are, for example,
required for the turbine already mentioned. I would like to sketch one application example
for you: the use of the DFT for image compression:
Let’s start with a black and white image. Already when the image is read in by a scan-
ner, data loss occurs: on the one hand through the size of the pixels, on the other hand
through the continuous course of the gray values being pressed into a finite scale, typi-
cally 256 gray levels, which occupy one byte of memory space per pixel. Nevertheless,
the memory requirements for such naked image data are enormous.
If the image is further compressed, another data loss usually results. But this is
designed so that no visible deterioration of the image quality takes place for the eye. For
example, another coarsening of the gray levels can be carried out, because 256 gray val-
ues are difficult to distinguish for the human eye.
How do you proceed? In a first approach, the gray values could be rounded even fur-
ther. This would have to be done in the same way for all values, since all gray levels are
of equal importance. No account can be taken of the content of the image. So for soft,
low-contrast images, perhaps a completely different type of rounding would produce
good results than for high-contrast images.
At this point, the Fourier transformation can be used: First we decompose the image
into manageable parts, for example into sub-images with a size of 8 × 8 pixels. The func-
tion values f (xi ) are the gray levels of the pixels.
Now we carry out the DFT line by line with these points, we get the coefficients ak
and bk. For the function values f (xi ) it then applies:
15
a0 a16
f (xi ) = + (ak cos(kxi ) + bk sin(kxi )) + cos(32xi ).
2 k=1
2
The data that we store compressed are no longer the function values themselves, but the
Fourier coefficients. It turns out that these play a completely different role for the image
information. a0 /2, for example, represents the mean gray value of the image, an impor-
tant information that should be stored very precisely. Various experiments and experi-
ences show that the higher-frequency components (larger k) contribute much less to the
image information than the low-frequency components (small k). Therefore, the coeffi-
cients with a larger index can be rounded much coarser or even completely thrown away
without causing a visible image loss. With the same amount of data, this results in much
less information loss than without the Fourier transformation.
In the well-known jpeg compression, images are treated with the discrete cosine trans-
formation (DCT), which is a close relative of the Fourier transformation. This is carried
out on sub-images of the size 8 × 8, but not line by line, but in a two-dimensional form.
16.4 Comprehension Questions and Exercises 415
The resulting coefficients are then rounded according to fixed rules laid down in tables.
These rounded coefficients are then Huffman-coded. The Huffman coding is a very effec-
tive supplement to the DCT, as the probability of occurrence of different coefficients can
vary greatly.
The approach of approximating functions by linear combinations of other, linearly
independent functions and then calculating with the coefficients of these approxima-
tions is fruitful in many areas of mathematics over and over again. The transformation of
functions using wavelets is very current: Starting from a wavelet prototype, a family of
orthogonal functions is formed, with the help of which it is possible to approximate the
original function well in different scales, on a large and small scale. There are also dis-
crete, fast variants of the wavelet transformation, which can be used, for example, in data
compression with even better results than the Fourier transform. The latest jpeg standard
also includes the use of wavelets.
Comprehension questions
´b
1. Let f : [a, b] → R be continuous. Does the definite integral a f (x)dx then exist?
´a
2. Let f : [−a, a] → R be an integrable, even function. What is −a f (x)dx?
3. Explain the fundamental theorem of calculus.
4. If the antiderivative F(x) of the function f (x) exists, is F : [a, b] → R then continu-
ous and differentiable?
5. f : [a, b] → R is a non-periodic continuous function. Can you still set up a Fourier
series for f ?
6. In the note to Theorem 16.19 I said that the functions cos(nx), sin(nx) on the vec-
tor space of functions integrable on [0, 2π] almost form a basis. Why only almost?
Take a look again at the definitions 6.16 and 6.5.
7. Can you explain the term “convergence in the mean square”?
Exercises
1. Calculate
´ 2π the following integrals:
a) 0 sin(x) cos(x)dx,
b) ´ x ln xdx,
´
b
c) a x 2 ex dx,
´1
d) 0 (3x − 2)2 dx,
´2 1
e) −2 2x−8 dx.
´ 2π
2. Calculate 0 cos2 (nx)dx, n ∈ N. Use Example 4 after Theorem 16.12.
416 16 Integral Calculus
3. Calculate an antiderivative for tan(x) in the range −π/2 < x < π/2. Use Example
6 after Theorem 16.12.
4. One of the two following integrals can be calculated (as adefinite integral), one
cannot.
´ 3 Calculate the integral or explain why it does not work:
a) −3 3x+4
7
dx,
´5 2
b) −1 6x+9 dx.
5. Calculate the arc length of the unit circle between (0, 0) and (cos α, sin α).
As a result, you find that to the angle α in radians belongs exactly the circular arc of
length α.
Abstract
Late risers have the following serious problem to solve if they do not want to come to the
lecture too late: The coffee from the machine is too hot to drink, it has to be brought to
drinking temperature as quickly as possible. Does the coffee cool down faster if you add
the sugar immediately and then wait, or is it smarter to wait for a while and then add the
sugar?
Let’s do a proper problem analysis: We are given TK (t), the coffee temperature at time
t , the temperature of the surrounding air TL and the maximum drinking temperature Tm.
Let’s first neglect the sugar. We assume that the cooling takes place more quickly the
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 417
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_17
418 17 Differential Equations
greater the temperature difference between coffee and air is, the cooling is proportional
to this difference.
This cooling is nothing other than the change in temperature over time, that is, its
derivative TK ′ (t). We get the equation:
TK′ (t) = c(TK (t) − TL ) (17.1)
with a proportionality factor c < 0: Since the temperature decreases, the derivative is
negative.
We are looking for a function TK (t) which satisfies (17.1). Let’s guess first. The
desired function looks similar to its derivative, and so the exponential function is a good
choice. We try
TK (t) = αeβt + γ .
We have to take a few variables into the approach, because TK is not exactly the exponen-
tial function. Let’s take the derivative: It is Tk ′ (t) = αβeβt and inserted in (17.1) we get
βαeβt = c(αeβt + γ − TL ).
For γ = TL and β = c this equation is fulfilled, the function
TL
17.1 What are Differential Equations? 419
Tm
TL
If you wait for a while and then dissolve the sugar, the cooling takes place faster. You
can see that in the solid curve.
Unfortunately, the computational determination of the optimal sugar injection time takes
longer than breakfast.
An equation of the form TK ′ (t) = c(TK (t) − TL ) is called a differential equation. In dif-
ferential equations, in addition to an unknown function, derivatives of this function also
occur. In our first example, we guessed a solution. However, this was only completely
determined by another condition: the initial condition TK (0) = T0.
In the next example, we examine a pendulum. In order to have to calculate only in
one dimension, we use a spring pendulum: A weight hangs on a spring and can move
in y-direction, in the rest position the pendulum should be at the point y = 0. We set
the pendulum in oscillation and want to determine the displacement y(t) as a function of
time t .
Hooke’s law is known from physics: With the displacement y, the restoring force F(t)
is proportional to y. At any time t therefore F(t) = a · y(t) applies. Here a < 0 is the
spring constant.
Newton’s law states that the acceleration that the weight experiences is proportional
to the force acting. The acceleration is the second derivative of the function y(t), and so
we get a second equation F(t) = b · y′′ (t), where b > 0. Combined, this results in a dif-
ferential equation for y:
(β 2 − c)αeβt = cγ . (17.4)
A pretty boring solution would be α = γ = 0. Also β = 0 is not interesting. Let’s ignore
these solutions, so (17.4) must hold for different values of t . This can only be the case
if β 2 − c = γ = 0, so β 2 = c. Because of c < 0 there is no real solution for β, the first
attempt has failed.
420 17 Differential Equations
Later we will see that the complex roots can also help us find solutions to the real differen-
tial equation.
We know more functions whose second derivative resembles the function: sine and
cosine. Let’s try it: Let
If there are ordinary differential equations, then there must of course also be extraordi-
nary ones: These are differential equations that contain more than one variable, for example
location and time, or even three-dimensional spatial coordinates. Such equations are called
partial differential equations. We cannot deal with these here.
Examples
1. The coffee problem TK ′ (t) = c(TK (t) − TL ) can be written with y = TK (x) as
F(x, y, y′ ) = y′ − cy − cTL = 0.
2. In the pendulum example we get F(x, y, y′ , y′′ ) = y′′ − cy = 0. Here x and y′ don’t
appear explicitly in the equation, but that shouldn’t bother us.
3. F(x, y, y′ ) = y′ − ay + by2 = 0 and F(x, y, y′ ) = y′ + 2xy − 2x = 0 are further
differential equations, for which we cannot make any progress with guessing. For
these equations we will develop solution methods. Can you find them? ◄
As we have seen from our first examples, the solutions of differential equations can con-
tain parameters: For example, y′′ = 0 has the solutions y = ax + b with the independent
parameters a, b ∈ R. The equation y = 3x + 7 is a particular solution of y′′ = 0.
Example
is the general solution of the differential equation y′′ = f (x) with the parameters c and
d. ◄
y′ = f (x)g(y) (17.5)
where f : I → R, g : J → R are continuous functions on intervals I, J .
If for x0 ∈ I , y0 ∈ J the initial condition y(x0 ) = y0 is given, and if g(y) = 0 in the inter-
val J, then (17.5) is solvable. So that I don’t have to work with too many different names,
I will use the same letter for the integration variable as for the upper limit of integration.
Let
ˆ y ˆ x
1
G(y) := dy, F(x) = f (x)dx
y0 g(y) x0
We can just try it: It is G(y(x)) = F(x). If we differentiate this equation to the
right and left (to the left using the chain rule), we get G′ (y(x)) · y′ (x) = F ′ (x), so
1/g(y(x)) · y′ (x) = f (x), that is y′ (x) = f (x)g(y).
It remains to check the initial condition: Because of G(y0 ) = 0 and F(x0 ) = 0 we get:
y(x0 ) = G−1 (F(x0 )) = G−1 (0) = y0. I summarize the result:
There is a mnemonic which sends a shiver down the spine of the true mathematician, but
it works: Write the differential equation in the form dy/dx = f (x)g(y) and then pretend that
dy/dx is a regular fraction. Bring all the y in the equation to the left and all the x to the right
and you get (1/g(y))dy = f (x)dx. Now put integral hooks in front of it and you have (17.7).
17.2 First Order Differential Equations 423
Examples
1. Let’s take another look at the coffee problem TK′ (x) = c(TK (x) − TL ), TK (0) = T0.
We write it now in the form:
y′ =
c (y − TL )
f (x) g(y)
and get:
y x
1
ˆ ˆ
y x
dy = cdt ⇒ ln(y − TL )T0 = cx 0
T0 y − TL 0
⇒ ln(y − TL ) − ln(T0 − TL ) = cx − 0
y − TL
⇒ ln = cx.
T 0 − TL
The term 1 − βy(x) represents the braking factor: as soon as y(x) comes close to
1/β, growth becomes smaller. For y(x) = 1/β, we would get zero growth. Let’s try
the solution with the initial condition y(0) = y0:
ˆ y ˆ x
1
2
dy = αdx = αx.
y0 y − βy 0
βy
You will find the left integral in the formulary: y−βy 2 dy = ln( 1−βy ). Try it! This
´ 1
gives:
βy βy0
ln − c = αx c = ln
1 − βy 1 − βy0
424 17 Differential Equations
βy
and from there by applying the exponential function = eαx+c.
1 − βy
eαx+c
If we solve this equation for y, we get y = .
β(eαx+c + 1)
Fig. 17.3 shows the graph of the function, the rabbit curve. You can see that
limx→∞ y(x) = 1/β is the limit of growth. If the initial value y0 is greater than 1/β,
we get no growth, but a shrinkage to 1/β.
3. Of course, differential equations can also be solved in which the function f (x) is
not constant. Let’s take
x
y′ = .
y−1
Since x and y must be defined in an interval, it must be y > 1 or y < 1. By the ini-
tial value y(1/2) = 1/2, it is already y < 1 predetermined. Now we can integrate:
y
x 2 x
ˆ y ˆ x
y2
(y − 1)dy = xdx ⇒ − y =
1/2 1/2 2 1/2 2 1/2
y2 1 1 x2 1
⇒ −y− + = −
2 8 2 2 8
y2 1 x2
⇒ −y+ − =0
2 2 2
⇒ y2 − 2y + 1 − x 2 = 0
⇒ y = 1 ± 1 − 1 + x 2 = 1 ± x.
y = 1±x? Are there two solutions? No, from the initial condition it follows that
only y = 1 − x can be a solution. ◄
Differential equations in which the function y and its derivatives only occur linearly are
called linear differential equations. In the first order, they have the form
y′ + a(x)y = f (x).
y0
17.2 First Order Differential Equations 425
If the function f (x) is equal to 0 on the right side, the equation is called homogeneous,
otherwise inhomogeneous. For homogeneous first order linear equations, the general
solution can be given:
Theorem 17.5 If a(x) is continuous on the interval I, the general solution of the
differential equation y′ + a(x)y = 0 is:
y(x) = c · e−A(x) ,
where c ∈ R and A(x) is an antiderivative of a(x).
This results in c′ (x)e−A(x) = f (x) or c′ (x) = f (x)eA(x), i.e. a differential equation for c(x).
Any antiderivative
ˆ
c(x) = f (x)eA(x) dx
Example
ˆ ˆ
x x2
t2 −x 2 2
y(x) = 2t · e dt + c e = e dy + c e−x
y
0 ↑ 0
y=t 2 ,dy=2tdt
2 2 2
= (ex − 1 + c)e−x = 1 + de−x
=d
17.3
nth Order Linear Differential Equations
The following existence and uniqueness theorem holds for these differential equations,
which I would like to quote without proof:
Theorem 17.8 Let y(n) + a1 (x)y(n−1) + · · · + an−1 (x)y′ + an (x)y = f (x) be a nth
order linear differential equation, ai , f : I → R and x0 ∈ I . Then there is a unique
solution y of this initial value problem for the initial values y(x0 ) = b0,y′ (x0 ) = b1,
…,y(n−1) (x0 ) = bn−1. This solution exists on the whole interval I .
A nice result, but once again a typical mathematician’s theorem: It doesn’t help us at all
to find solutions, it can only serve to calm us down when we get stuck on a specific task.
More interesting are the two following theorems, which give us information about the
structure of the solutions and help us to sort the solutions and check them for complete-
ness. And to prove Theorem 17.10 we need Theorem 17.8, so that one is not completely
superfluous either.
Similarly to the solution of systems of linear equations, we therefore only have to deter-
mine one solution of the inhomogeneous equation in addition to all solutions of the
homogeneous equation to solve the system completely.
The set H is a subset of the vector space of all real functions on I . We therefore only
have to check the subspace criterion (see Theorem 6.4 in Section 6.2). According to
Theorem 17.8, H is not empty, and are y1 , y2 ∈ H , ∈ R, then:
Theorem and Definition 17.10 The solution space of a homogeneous nth order lin-
ear differential equation has dimension n. A basis of this solution space is called a
fundamental system.
�
0 y0 + 1 y1 + · · · + n−1 yn−1 = 0,
0 y0′ + 1 y1′ + · · · + n−1 yn−1
′ �
= 0,
..
.
(n−1)
0 y0(n−1) + 1 y1(n−1) + · · · + n−1 yn−1 �
= 0.
This means nothing else than for all x ∈ I the vector (0 , 1 , . . . , n−1 ), which is different
from 0, is a solution to the homogeneous linear equation system
y0 (x) y1 (x) · · · yn−1 (x) x0 0
y0′ (x) y1′ (x) ′
· · · yn−1 (x) x1 0
.. .. .. .. . = .
. . . . .. ..
y0(n−1) (x) y1(n−1) (x) · · · yn−1
(n−1)
(x) xn−1 0
As we have learned in Theorem 9.4 in Sect. 9.1, this is only possible if the determinant
of the coefficient matrix is 0.
Now back to the proof of linear independence of the solution functions (17.10): By
contraposition to Theorem 17.11 it holds: If the Wronskian is anywhere unequal 0,
then the functions contained therein are linearly independent. At the point x0 the found
solutions however yield as Wronskian just the determinant of the identity matrix. Thus
Theorem 17.10 is proven.
17.3 Linear Differential Equations n of Order 429
Let’s look at the example of the pendulum at the beginning of the chapter again: We
had guessed for the linear differential equation y′′ (t) + 0 · y′ (t) − c · y(t) = 0 the two
solutions:
y1 (t) = cos βt, y2 (t) = sin βt,
√
with β = −c. Are these solutions linearly independent? We calculate W (x):
cos(βx) sin(βx)
W (x) = det = β cos2 (βx) + β sin2 (βx) = β � = 0.
−β sin(βx) β cos(βx)
This means that y1 and y2 form a basis. Now we also know that the initial conditions
y(0) = 1, y′ (0) = v, which we had set in this example, were reasonable and complete.
The unique solution of the initial value problem is y(t) = 1 · cos βt + (v/β) · sin βt.
For differential equations in the form (17.9) there is no general solution method. If the
coefficient functions however are constant real numbers, one can specify a basis of the
solution space. We first solve the homogeneous system
y(x) = ex , y′ (x) = ex , y′′ (x) = 2 ex , ..., y(n) (x) = n ex .
Inserted in (17.11) we obtain for all x:
n ex + a1 n−1 ex + · · · + an−1 1 ex + an ex = (n + a1 n−1 + · · · + an−1 + an )ex
= 0.
If is a root of + a1
n
+ · · · + an−1 + an, then ex is a solution of (17.11). So we
n−1
From the roots of the characteristic polynomial we can construct a fundamental system
for (17.11). We have to distinguish the following cases:
Case 1: is a simple real root. Then ex is a solution of the differential equation.
430 17 Differential Equations
Case 3: Multiple roots occur. I would like to give the corresponding solutions without
further calculation: If is a k-fold real root, then the k functions x i ex,i = 0, . . . , k − 1 are
linearly independent solutions and = α + iβ (and thus also α − iβ) is a k-fold complex
root, then the 2k functions x i eαx cos βx, x i eαx sin βx, i = 0, . . . , k − 1 are the correspond-
ing solutions.
In each case, we have therefore found exactly n solutions from the polynomial p().
With the help of the Wronskian, one can determine that they are all linearly independent.
They form a fundamental system of the differential equation (17.11).
Examples
y = ex : ex − 2ex + ex = 0,
y = xex : (xex + ex + ex ) − 2 (xex + ex ) + xex = 0.
y′′ y′
y′′ + 2αy′ + ω0 2 y = 0.
The characteristic polynomial is 2 + 2α + ω02 = 0 and has the roots
1/2 = −α ± α 2 − ω02 = −α ± β.
Here two cases have to be distinguished: Is α 2 − ω02 > 0, then we get two real
roots, both of which are less than 0. Here α is relatively large, there is a strong
damping. Imagine that the pendulum is dipped in honey. The general solution is
y = c1 e1 x + c2 e2 x .
In this case, no proper oscillation takes place at all, the function has at most one
maximum and at most one root. Fig. 17.4 shows some possible shapes of the curve.
In thesecond case α 2 − ω02 < 0 we get two complex roots −α ± iω1 with
ω1 = ω02 − α 2 . Here α is relatively small, so there is a weak damping, for exam-
ple, the pendulum swings in the air. The general solution here is
y = c1 e−αx sin ω1 x + c2 e−αx cos ω1 x.
The curve represents an oscillation with the frequency ω1 < ω0 whose amplitude
decreases. The oscillation therefore proceeds somewhat slower than undamped.
Fig. 17.5 shows examples of this solution.
The limiting case α 2 = ω02 results in the solution
y = (c1 + c2 x)e−αx .
The course of the curve corresponds here to that of the strong damping. ◄
If ω0 is close to ω, and the damping (that is, α) is small, the denominator of the amplitude
will be very small, and the amplitude can be very large. The further ω is from ω0, the
smaller the amplitude of the oscillation will be.
The differential equation (17.12) not only describes the spring pendulum, but in prin-
ciple any oscillating system. The negative effect of the presented solution is the reso-
nance desaster: The opera singer can make glasses explode, storms can be able to set
bridges into oscillation so that they tear. Look on the internet for the Tacoma Narrows
Bridge! Positive effects are experienced in the rocking chair and when listening to the
radio or making a phone call: The electrical oscillating circuit in the receiver consists
of a capacitor and an inductor. The natural frequency of the system can be adjusted by
changing the capacity of the capacitor to the transmission frequency so that the oscillat-
ing circuit is exactly driven by this and no other frequency.
Comprehension Questions
Exercises
Abstract
The final chapter from the second part of the book familiarizes you with applications
of theoretical mathematics to specific computational tasks. At the end of this chapter
• you know that calculation errors are inevitable and can estimate their size and their
propagation during calculation operations,
• you can calculate zeros and fixed points of nonlinear equations using different
methods,
• you can determine smooth interpolation curves between given points in the
R2,
• you can solve integrals numerically,
• and determine solutions to first order differential equations.
I don’t need to explain to you the importance of computers for mathematical calcula-
tions. The largest civilian concentrations of computers are at weather services and large
film studios in America. Everywhere, equations are solved, roots are determined, func-
tions and differential equations are integrated and curves, shapes and shadows are calcu-
lated like mad.
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 435
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_18
436 18 Numerical Methods
Leaving aside the ever more powerful computer algebra systems, however, computers do
not calculate with formulas, but with numbers, and in doing so they make errors in calcula-
tion; not because they would miscalculate, but because they simply cannot represent real num-
bers accurately enough. It can happen that such errors amplify during long calculations and
lead to unusable results. Before we therefore deal with some important numerical algorithms,
I would like to explain the problem of uncertainty of calculation to you in more detail. I will
do this using the example of solving systems of linear equations. In Chap. 8.8 we dealt exten-
sively with the Gaussian algorithm. This also represents the first important numerical method.
If you carry out a calculation on the computer and have the result displayed on the screen,
it usually appears in the form
±a1 .a2 a3 . . . an E±m. (18.1)
This is the floating point representation of the number. The sequence of digits
a1 a2 a3 . . . an is called the mantissa of the number, the coefficient a1 is always unequal to
0. E±m stands for “·10±m”.
This is exactly how real numbers are stored in the computer, but not in decimal form,
but to the base 2 (see also the end of Sec. 13.3).
To store a real number, just like for an integer, only a finite memory space is avail-
able, which has to store the sign, mantissa and √ exponent. In Java, for example, this is 8
bytes for a double, i.e. 64 bits. Numbers like 2 or π cannot be represented precisely in
this way and, of course, only finitely many numbers can be stored in total.
Every real number that we enter into the computer, or that arises during a calculation, is
rounded to the next number representable in this form. This results in errors. We distin-
guish between the absolute error ax and the relative error rx. If x ′ is the representation of
the real number x in the computer, then applies:
ax = x − x ′ , rx = (x − x ′ )/x. (18.2)
Of particular interest is of course the relative error, which sets the absolute error in rela-
tion to the number under consideration: A large absolute error is much less important for
a very large number than for a small number.
The floating-point representation (18.1) now has the advantage that it limits the rela-
tive error for numbers of all orders of magnitude: For the following, let us consider a
beginner’s computer with a mantissa length of 3 that works in the decimal system. We
want to represent the numbers x = 1 234 567 and y = 0.0007654321 in it. In the floating-
point representation applies:
and thus
ax = 4567 rx ≈ 0.0037
ay = 0.0000004321 ry ≈ 0.00056.
How large is the maximum relative error? The worst case occurs when we enter the num-
ber x = a1 .a2 a3 5 · 10m, it is rounded to a1 .a2 (a3 + 1) · 10m. This makes the relative error
−0.005 · 10m −0.005
m
= .
a1 .a2 a3 5 · 10 a1 .a2 a3 5
This fraction is largest when the denominator is smallest, that is, when x = 1.005 · 10m,
and we get a relative error of about 4.98 · 10−3, regardless of the size of the exponent.
You can now easily derive the rule that for a mantissa length of n the maximum rela-
tive error is:
|rmax | ≤ 5 · 10−n .
Similarly, in the binary system, with a mantissa length of n, the maximum relative error
is |rmax | ≤ 2−n.
Propagation of Errors
If you carry out mathematical operations with the already rounded numbers, the errors
will affect the result, and it can also happen that error-free numbers can no longer be rep-
resented precisely after an operation and therefore have to be rounded.
Let’s think for a moment about what happens to the relative error when adding and
multiplying: We use rx x = x − x ′ and x ′ = x(1 − rx ). This follows directly from (18.2).
For simplicity, we assume the sum and product of x ′ and y′ are again numbers that can be
represented in the computer. Then we get:
(x + y) − (x ′ + y′ ) (x − x ′ ) + (y − y′ )
rx+y = =
x+y x+y
rx x + ry y x y
= = rx + ry , (18.3)
x+y x+y x+y
xy − x ′ y′ xy − x(1 − rx )y(1 − ry )
rxy = =
xy xy
= 1 − (1 − rx )(1 − ry ) = rx + ry − rx ry . (18.4)
Since the relative errors are small numbers, rx ry is very small compared to rx or ry and
therefore rxy ≈ rx + ry.
438 18 Numerical Methods
In multiplication the relative error behaves quite wel, the two relative errors add up. In
addition nothing particularly bad happens at first, when x and y have the same sign: Then
the fractions x/(x + y) and y/(x + y) are each less than 1 and here too the worst that can
happen is an addition of the relative errors.
A catastrophe can occur, however, if x and y are about the same size and have differ-
ent signs: Then the values x/(x + y) and y/(x + y) are very large because the denomina-
tors are small, and the relative error can explode! This behavior is called catastrophic
cancellation.
Never try to check the difference of two real numbers against 0 in a logical condition,
it can have fatal consequences!
The numerical examples that I presented to you in Chap. 8 for the Gaussian algorithm
consisted of integers and were designed so the transformations usually resulted in inte-
gers again; I chose them so that the methods are easy to follow. Unfortunately, reality
is different. In one and the same system of equations, very large, very small and very
crooked non-integers can occur. Let’s take a look at what can happen to systems of lin-
ear equations due to the described errors. We solve a few systems of equations with our
computer, which works with 3-digit mantissas. If more than three significant digits occur
in an operation, we round.
First, I want to calculate the solution of the following system of equations Ax = b:
11 2
(A, b) = .
1 1.001 2.001
The rank of A and that of (A, b) is 2, so there is a unique solution that you can easily
determine to (1, 1).
Unfortunately, when entering this system, the numbers have to be rounded, because
1.001 and 2.001 have 4 significant digits. So in our computer only the following system
of equations arrives:
112
(A, b) = .
112
The rank of this matrix is 1, the solution space is the set {(1, 1) + (1, −1) | ∈ R}.
Already at the input, a system of linear equations can thus get completely different prop-
erties due to rounding errors.
Now let’s look at a system of equations that does not undergo any rounding on entry:
203 202 406 000
(A, b) = . (18.5)
1 1 2010
18.1 Problems with Numerical Calculations 439
The rank of the matrix is 2, so there is a unique solution again. Calculating by hand gives
x1 = −20, x2 = 2030.
Now let’s solve the equations with the computer using the Gaussian algorithm,
see Sect. 8.1. When performing, I always round at the third digit different from 0 and
still write “=” cheeky to it, just like our computer does. First we have to subtract the
1/203 = 0.00493-fold of row1 from row 2 and get:
203 202 406 000
1 − 0.00493 · 203 1 − 0.00493 · 202 2010 − 0.00493 · 406 000
� �� � � �� � � �� �
1.00 0.996 2000
� �
203 202 406 000
= .
0 0.00400 10
Now we determine x2 from row 2: 0.004 · x2 = 10, so x2 = 2500. We set this in row 1,
and get 203 · x1 + 202 · 2500 = 406 000, so 203 · x1 = 406 000 − 505 000 = −99 000,
and finally x1 = −488, so a solution that is completely off. You shouldn’t sell that to any-
one!
The disaster is completed by the test:
203 · (−488) + 202 · 2500 = −99 100 + 505 000 = 406 000,
1 · (−488) + 1 · 2500 = 2010,
apparently everything is right.
There is a simple trick with which one can often increase the computational accuracy
when solving systems of linear equations numerically:
Look back again into the Gaussian algorithm. In the key operation, the (akj /aij )-fold
of row k is subtracted from row i (cf. (8.6) after Theorem 8.3):
akl − (akj /aij ) · ail , l = 1, . . . , n (18.6)
These are operations in which cancellation can occur. We do not have the values of the
matrix elements in our hands, but we can control the size of the quotient q = akj /aij by
choosing the pivot element aij. According to (18.3) and (18.4), the relative error of an
operation a − q · b as in (18.6) results in:
a qb
ra + (rq + rb ).
a − qb a − qb
In any case, it has a positive effect on the error size if q is as small as possible, because
then the influence of the second term of the sum on the error becomes smaller.
In implementations of the Gaussian algorithm a pivoting is therefore carried out: If
we want to make all elements below aij (the pivot element) to zero, we not only check
whether the element aij = 0 (we know by now that this is anyway problematic), but
we also search for the row below aij whose j-th element is maximum. Then this row is
440 18 Numerical Methods
swapped with the i -th row. The quotient (aki /aij), with which row i is subsequently multi-
plied, is therefore always less than or equal to 1 for all k > i.
In addition to the described partial pivoting, a complete pivoting can also be car-
ried out, that is, columns can be exchanged to make the pivot element as large as pos-
sible. However, one must be careful here: When swapping columns, the indices of the
unknowns must also be exchanged in order to obtain the correct results at the end.
Unfortunately, pivoting is not a panacea, as the equation system (18.5) shows. Here
the pivot element is already maximum. Developing good strategies for solving systems
of linear equations is not trivial!
In the examples you have seen: Small rounding errors in the input and in the calcula-
tions can have a huge impact on the results. This can go so far that the results are com-
pletely worthless. Numerical problems with this property are called ill-conditioned. A
major task of numerical analysis is to formulate problems in such a way that they are
well-conditioned.
What else can you do to get problems with computational errors in your computer
programs under control? Some people think we don′t need much mathematics here, since
we solve our problems with the computer. But this is wrong, I would like to emphasize
the importance of mathematics: The further you treat a problem analytically and the later
you start doing numerical calculations, the fewer errors you will make.
It is also important to choose good algorithms: If you need fewer operations, you usu-
ally not only increase the speed, but also the accuracy of the calculation.
In any case, you must be aware of the problem that rounding errors can occur and not
rely on exact calculations.
To solve systems of linear equations, we used the Gaussian algorithm in the last sec-
tion. Nonlinear functions are often not solvable analytically. Other numerical methods
must be found for this. I would like to focus here on the case of an equation with one
unknown. Let’s start with an example. The solutions of the equation
e−x + 1 = x.
are sought. This equation can also be written in another form:
e−x + 1 − x = 0.
The first way of writing is of the form F(x) = x. Here we call x a fixed point of the func-
tion F . The second way of writing has the form G(x) = 0, and x is a zero point of G.
I would like to introduce you to methods for determining fixed points and determining
zero points. You can see from the example equations can often be brought into one form
or the other as required. Unfortunately, not all methods always lead to the goal and the
convergence speed of the methods is also different. First a definition that helps with the
assessment of convergence:
18.2 Nonlinear Equations 441
Definition 18.1 Let (xn )n∈N be a convergent sequence of real numbers with limit
x. Then xn is called linearly convergent with rate of convergence c, if there is a
number 0 < c < 1 such that
|xn+1 − x| ≤ c · |xn − x|.
xn is called convergent with order of convergence q, if there is a 0 < c with
|xn+1 − x| ≤ c · |xn − x|q .
In the case q = 2, xn is called quadratically convergent.
Note that in the case of order of convergence q > 1, the factor c does not have to be < 1
anymore.
Let us first try to determine the fixed points. One approach could be to begin with a start-
ing value x0 and to determine x1 = F(x0 ). Then we evaluate F(x1 ): x2 = F(x1 ) and so on:
xn = F(xn−1 ), n = 1, 2, 3, . . . (18.7)
Can we hope the sequence xn converges? Maybe even to a fixed point? If F : R → R is
a function of one variable, then a fixed point x̂ with F(x̂) = x̂ is an intersection of the
graph of F with the main diagonal, and we can visualize the sequence xn.
As you can see, the sequence converges to the fixed point in Fig. 18.1, in Fig. 18.2 it
does not. Draw a few graphs yourself and try to find out when convergence occurs and
when it does not. Do you have an idea? It seems to depend on the slope of the function:
If it is too steep, it does not work anymore. If it is flat, whether rising or falling, then the
sequence xn approaches the fixed point more and more closely. In fact, it can be shown:
If the function values of F are always closer together than the arguments, then conver-
gence occurs. Behind this is a famous theorem that not only applies to real functions,
F(x1)
F(x0)
x1 x̂ x2 x0
442 18 Numerical Methods
F(x0)
x3 x1 x̂ x0 x2
but to all functions in “complete metric spaces”. I formulate it for functions in Rn. It is
named after the Polish mathematician Stefan Banach, who proved it in 1922. It is there-
fore one of the newer mathematical results in this book.
Theorem 18.3: The Banach fixed point theorem Let D ⊂ Rn be a closed subset
and f : D → D a contraction mapping of D in itself. Then holds:
Imagine you have a city map of the town you live in front of you on the floor. The map-
ping of the city onto the map is obviously a contraction mapping. The fixed point the-
orem now says there is exactly one point on the map that lies precisely at the point it
represents. If you put a city map of Munich on the floor in Hamburg, however, you are
out of luck: A necessary condition is the property of self-mapping: The image of the
mapping must be part of the domain. But then the theorem is even constructive, which
is especially important for computer scientists. It provides a method of how to find the
fixed point and also says something about the rate of convergence in 3.
Because of f (xn ) = xn+1 and f (x̂) = x̂, we can immediately read off the order of con-
vergence from Definition 18.2: It is �xn+1 − x̂� ≤ c · �xn − x̂� and therefore linear con-
vergence is given.
It is often difficult to determine whether a given function is a contraction mapping.
If f : [a, b] → [a, b] is a differentiable function of one variable, it is sufficient to check
18.2 Nonlinear Equations 443
whether Max{|f ′ (x)||x ∈ [a, b]} = c < 1, because from the mean value theorem of differ-
ential calculus (Theorem 15.12 in Section 15.1) it follows that for all x, y ∈ [a, b] holds
|f (x) − f (y)|
= |f ′ (x0 )| ≤ c for a x0 ∈ [x, y],
|x − y|
and thus the contraction property is fulfilled. This rule also agrees with our intuitive con-
siderations in Figs. 18.1 and 18.2.
If you want to apply the fixed point theorem, it is sufficient to check the conditions in
the vicinity of the sought-after fixed point, they do not have to be fulfilled globally for
the whole function.
Calculation of Zeros
We will now examine methods for determining zeros of functions of one variable. We
have already learned one algorithm: The proof of Bolzano’s theorem, Theorem 14.23 in
Section 14.3, was constructive. With the help of bisection, we can approximate a zero
arbitrarily accurately. Unfortunately, this method converges very slowly. I would like to
introduce two more methods that are often better suited to find zeros.
The regula falsi is based on the same assumptions as bisection: The function
f : [a, b] → R is continuous, f (a) < 0 and f (b) > 0 (or vice versa). Now we hope to
get close to the zero between a and b more quickly if we do not simply halve the inter-
val between them, but determine the intersection point c of the line g from (a, f (a)) to
(b, f (b)) with the x-axis as the new approximation value (Figure 18.3).
The equation of g is:
f (b) − f (a)
y = g(x) = (x − a) + f (a).
b−a
By inserting a and b into this linear equation, we immediately find g(a) = f (a) and
g(b) = f (b), so this is the line connecting (a, f (a)) and (b, f (b)). We get the intersection c
of g with the x-axis, if we insert the value c for x, set, y = 0 and solve for c:
a c
b
444 18 Numerical Methods
x0 x2
x3 x1
18.3 Splines 445
a x2 x0 b
x1
18.3 Splines
Let’s return to the problem I posed at the beginning of Chap. 15.15: How can one draw
a beautiful, smooth curve through n given data points? If a series of points is given that
should lie on the graph of a function, then one interpolates between the points. The
resulting curve is called interpolation curve or spline.
Spline means curve ruler in English. When interpolation curves could not yet be cal-
culated numerically, one drew curves with the help of a spline, a flexible ruler that could
be exactly laid out at the given points.
446 18 Numerical Methods
Of course, the curve we want to generate now must be easy to calculate. Polynomi-
als offer themselves as candidates. You will remember that we were able to approximate
given functions very well with polynomials several times. It applies:
and
L(x) = L0 (x)y1 + L1 (x)y2 + · · · + Ln (x)yn .
L(x) is called n-th Lagrange interpolating polynomial to (xi , yi ), i = 0, . . . , n.
from which it follows that L is a polynomial of degree less than or equal to n with
L(xi ) = yi, so just a polynomial through the given points.
Mostly deg L = n. deg L < n is possible if through the addition of the individual Li
coefficients cancel out.
Why can there not be several polynomials of degree ≤ n through the n + 1 points? If
f (x) and g(x) were such polynomials, then f (x) − g(x) would be a polynomial of degree
≤ n with n + 1 zeros. But that cannot be according to Theorem 5.24. Thus Theorem 18.5
is completely proved.
This type of interpolation is unfortunately mostly not well suited to solve our task of the
“beautiful” connection of points. The problem is that a polynomial of degree n can have
up to n − 1 extrema. In the worst case, the connection of 5 points can look like in Fig.
18.6. This curve does not have any corners, but it certainly did not turn out the way we
wanted it to.
18.3 Splines 447
The Lagrange polynomials have rather theoretical meaning. For us, the following
error formula will be important in Sect. 18.4. It provides information about how much
the Lagrange polynomial can differ at most from a good and smooth function that runs
through the points (xi , yi ):
Cubic Splines
I would like to present you a practical solution of the problem of the smooth connec-
tion of points. There are many approaches to this. In general, the interpolant function
is defined piecewise and it is ensured that the curve parts fit together well at the seams.
I would like to introduce you to such a method. With the widely used cubic splines the
curve segments are generated by third degree polynomials.
So we are looking for a “nice” curve s : [x0 , xn ] → R through (xi , yi ), where it should
be x0 < x1 < · · · < xn (Fig. 18.7). Between these points we interpolate by polynomials,
that means we define s piecewise: Between xi and xi+1 the following should apply
s[xi ,xi+1 ] = si : [xi , xi+1 ] → R,
(18.8)
x �→ ai (x − xi )3 + bi (x − xi )2 + ci (x − xi ) + di .
448 18 Numerical Methods
y0 y2
y1
x0 x1 x2 xn
Now we have 3n − 1 linear equations for our 4n unknowns. So we can set up more con-
ditions.
Let us simply require that the curve should not only be smooth, but especially
smooth: The first derivative should also have no kinks, that is, it should be differenti-
able. This means that the second derivatives should also agree at the inner points. Thus s
is twice differentiable. This gives us n − 1 more equations:
si′′ (xi+1 ) = si+1
′′
(xi+1 ), i = 0, . . . , n − 2
or after insertion into (18.10):
6ai �xi + 2bi = 2bi+1 , i = 0, . . . , n − 2, (18.14)
so that we now have a system of linear equations with 4n − 2 equations and 4n
unknowns. The solution space of this system has dimension at least 2.
There are different approaches to choosing a suitable function. Often, the requirement
s0′′ (x0 ) = 0, ′′
sn−1 (xn ) = 0, (18.15)
is made at the edge points, which means the function is not curved at the edges, one
could continue it linearly to the left and right. In this way, the natural splines are
obtained. The clamped splines are obtained if the slope is given at the edges:
s0′ (x0 ) = m0 , ′
sn−1 (xn ) = m1 .
Finally, the periodic splines play an important role, which assume that the function is
periodic. Here one demands
s0 (x0 ) = sn−1 (xn ), s0′ (x0 ) = sn−1
′
(xn ), s0′′ (x0 ) = sn−1
′′
(xn ).
From all these boundary conditions, two more linear equations for the coefficients can
be derived, so that finally the interpolation curve is uniquely determined. For the natural
splines, we get from (18.10):
2b0 = 0, 6an−1 (xn − xn−1 ) + 2bn−1 = 0. (18.16)
How do you calculate the coefficients ai, bi, ci, di concretely? If you move the equa-
tions back and forth a bit more, you can reduce the problem of the 4n equations to the
solution of n other linear equations with other unknowns, which also have a very simple
form. I would like to sketch the procedure, you can follow the calculations on paper. I
restrict myself to the natural splines.
The new unknowns are exactly the second derivatives at the interpolation points,
which I denote with zi := si′′ (xi ), i = 0, . . . , n − 1. The following formulas become some-
what simpler if I formally introduce the (n + 1)-th unknown zn = 0.
For i = 0, . . . , n − 1 we get:
zi+1 − zi
ai = .
6xi
If we finally insert the just calculated di, bi, ai in (18.12) and solve for ci, we get:
yi+1 − yi 1
ci = − �xi (zi+1 + 2zi ), i = 0, . . . , n − 1.
�xi 6
xi , yi are known in these four determination equations for ai , bi , ci , di, so that only the
unknowns zi have to be determined.
As complicated as the equations now look: If we insert the calculated coefficients
ai , bi , ci , di in (18.13), much will dissolve into nothing. For i = 0, . . . , n − 2 we get equa-
tions of the form
αi zi + βi zi+1 + γi zi+2 = δi ,
where it holds
αi = �xi ,
βi = 2(�xi + �xi+1 ),
γi = �xi+1 ,
yi+2 − yi+1 yi+1 − yi
δi = 6 − .
�xi+1 �xi
In the natural splines, according to (18.15), further z0 = 0 and the complete system
of linear equations looks as follows:
1 0 0 0 0 ··· 0
α β 0 γ0 0 0 ··· z00
0
0 z δ
0 α 1 β 1 γ1 0 ··· 1 0
0
z δ
.. 2 1
..
.
0 0 α2 β 2 γ2 z3 = δ2 .
.
. .
.. .. ..
. 0 .. ..
0 0 0 . .
.. .. .. ..
. αn−2 βn−2 γn−2 zn−1
δn−2
. . .
zn 0
0 0 0 0 0 0 1
Parametric Splines
Often, points cannot be connected by the graph of a function, the given points do not
have to follow each other monotonically. If you want to connect a series of points (xi , yi )
in the plane by a curve smoothly, you have to look for a parameter representation of the
curve. See Definition 16.15 in Sect. 16.2. On the searched curve (x(t), y(t)) for certain
parameter values ti it always should be x(ti ) = xi and y(ti ) = yi. Now we have to solve
two interpolation problems: The points (ti , xi ) and (ti , yi ) are each connected by cubic
splines. The result are the smooth parameter functions x(t), y(t).
How to choose the parameter points ti? Of course, we could simply take the values
0, 1, 2, 3, . . . and interpolate the points (i, xi ) and (i, yi ). However, it is advisable to include
the distance between the points (xi , yi ) in the parameterization. For example, you can use:
t0 = 0, ti = ti−1 + (xi − xi−1 )2 + (yi − yi−1 )2 , i = 1, . . . , n.
In addition to the cubic splines calculated here, there are many other methods for curve
interpolation: polynomials of different degrees can be taken, the boundary conditions are
not unique, and completely different approaches are possible, such as interpolation with
Bezier curves. Every CAD system offers many different types of splines. This flexibility
can cause gray hair to a constructor: For example, in CAD, a car body part is defined
by only a finite number of points, the shape between these points is interpolated. If the
constructor’s and manufacturer’s CAD systems are not absolutely identically configured,
one might wonder about a dent in the door.
Integral calculation is a tricky business and requires imagination. In the libraries you find
large integral tables with integrals calculated by someone at some time. With the help of
computer algebra systems, many integrals can be solved analytically. Nevertheless, there
are very simple functions that have been proven not to have an elementary function as
2
antiderivative. For example, the function e−x /2, which describes the ´bell curve. Statisti-
cians help themselves with tables. Also the so-called elliptic integral 1 − k 2 sin2 tdt,
0 < k < 1, which occurs in the calculation of the circumference of an ellipse, is not com-
putable. Numerical integration therefore plays an important role in applied mathematics.
We divide the interval [a, b] into n equal parts of length hn = (b − a)/n with the inter-
polation points a = x0 , x1 , . . . , xn = b. As a first attempt, we can simply use the Riemann
sums, our definition of the integral in Theorem 16.3 was constructive after all: Then we
get for the area
ˆ b n
f (x)dx ≈ f (xi )hn .
a i=1
452 18 Numerical Methods
x0 x1 xn
a b
hn
x0 x1 xn
a b
hn
Unfortunately, this method converges very badly (Fig. 18.8). We get a significant
improvement by the trapezoidal rule: Between two interpolation points, not a bar, but a
trapezoid is placed, whose upper endpoints represent the function values of the interpola-
tion points (Fig. 18.9).
f (xi ) + f (xi+1 )
The area of the trapezoid between xi and xi+1 is hn · , and thus we get
2
for the total area:
f (x0 ) + f (x1 ) f (x1 ) + f (x2 ) f (xn−1 ) + f (xn )
Fn = hn + + ··· +
2 2 2
f (x0 ) f (xn )
= hn + f (x1 ) + f (x2 ) + · · · + f (xn−1 ) + .
2 2
What is the error we make with this approximation? Now the Lagrange interpolating pol-
ynomials from Definition 18.6 come into play: We assume that f is twice continuously
differentiable and first examine the section between xi and xi+1. The Lagrange polyno-
mial L(x), which connects the two points (xi , f (xi )) and (xi+1 , f (xi+1 )), has degree 1 and
therefore represents the line between these two points and thus the boundary line of the
trapezoid. If we also use the error calculation from Theorem 18.7, we get the error Fi
for the area calculation of the i -th partial section as
ˆ xi+1 ˆ xi+1 ′′
f (ϑ)
�Fi = f (x) − L(x)dx = (x − xi )(x − xi+1 )dx.
xi xi 2!
Here ϑ is a value between xi and xi+1.
18.4 Numerical Integration 453
If you’ve gotten this far, then integrating a quadratic polynomial is no longer art.
I will give you the result:
f ′′ (ϑ)
�Fi = (xi − xi+1 )3 .
12
Now we just have to sum up all the Fi. Let it be M = Max{|f ′′ (x)| | x ∈ [a, b]}, then
n−1 n−1
M 3 nhn 2 b−a 2
Fi ≤ hn = hn M = hn M.
i=0 i=0
12 12 12
So in total
ˆ b
(b − a) 2
f (x)dx − Fn ≤ hn · M.
a 12
The error therefore goes to zero with the square of the step size.
Even better approximations can be achieved by connecting the function values of
the support points with spline curves: In Simpson′s rule, quadratic splines are used: The
three points a, b and (a + b)/2 are connected by the uniquely determined parabola sec-
tion that runs through these three points.
If g(x) = cx 2 + dx + e is the parabola through f (a), f ((a + b)/2) and f (b) (Fig.
18.10), then the area F2 under this parabola is equal to
ˆ b
c d
F2 = g(x)dt = (b3 − a3 ) + (b2 − a2 ) + e(b − a).
a 3 2
You can easily check the following transformations:
b−a
F2 = [2c(b2 + ab + a2 ) + 3d(b + a) + 6e]
6
b−a 2 a+b 2 a+b 2
= (cb + db + e) + 4 c +d + e + (ca + da + e)
6 2 2
b−a a+b
= g(b) + 4g + g(a) .
6 2
x0 x1 x2
a (a+b)/2 b
454 18 Numerical Methods
x0 x1 x2 x3 x4
a b
b, (a + b)/2 and a are precisely the points at which f and g match. So we get
b−a a+b
F2 = f (b) + 4f + f (a) , (18.17)
6 2
a formula that is easy to evaluate. The parabola parameters c, d, e have disappeared as if
by magic. The parabola doesn’t have to be calculated explicitly.
Now we subdivide the interval [a, b] again into n equally large parts hn, where n must
be even this time, and apply Simpson’s rule to each of the sub-intervals (shown in Fig.
18.11 for n = 4). Then, by combining the areas, we get the composite Simpson's rule:
hn
Fn = f (x0 ) + 4(f (x1 ) + f (x3 ) + · · · + f (xn−1 ))
3
+ 2(f (x2 ) + f (x4 ) + · · · + f (xn−2 )) + f (xn ) .
Since differential equations are often very difficult to solve, but play an immense role in
technology and economy, numerical methods for solving differential equations are very
important and widespread. For example, the weather forecast is essentially based on the
18.5 Numerical Solution of Differential Equations 455
(x0,y0) (x3,y3)
x0 x1 x2 x3
h
456 18 Numerical Methods
z
1
f (x0 , y0 ) + f (x1 , y0 + f (x0 , y0 ) · h)
y1 = y0 + · h,
2
or as the k-th step:
f (xk , yk ) + f (xk+1 , yk + f (xk , yk ) · h)
yk+1 = yk + · h.
2
Finally, I would like to introduce you to the Runge-Kutta method. Like the previous
methods, it is based on the idea of guessing the correct slope to the next curve point
(xk+1 , yk+1 ) from the point (xk , yk ). In Euler’s method, the initial slope was used, in
Heun’s method the average of two slopes was formed. In the Runge-Kutta method, we
shoot four arrows towards the next value (xk+1 , yk+1 ) and form an weighted average of
these four slopes. In Fig. 18.14 you can see the procedure outlined.
Starting from the point P = (xk , yk ), we first proceed as in Euler′s method, but only
up to half of the interval to be bridged. The slope in the point P is m1 = f (xk , yk ). We get
the point (xk + h/2, yk + m1 · h/2) and there the slope m2 = f (xk + h/2, yk + m1 · h/2).
Now we start from P with the new slope m2 to the middle of the interval and get there
the point (xk + h/2, yk + m2 · h/2) with the slope m3 = f (xk + h/2, yk + m2 · h/2). With
this third slope, we finally go from P to the right edge of the interval and reach the point
m3
m1 P m2
xk xk+1
h/2
18.5 Numerical Solution of Differential Equations 457
(xk + h, yk + m3 · h) with the slope m4 = f (xk + h, yk + m3 · h). From these four slopes
we form an average, whereby the slopes m2 and m3, which were formed in the middle
of the interval, are double-weighted. This average then gives us the final direction with
which we calculate the point (xk+1 , yk+1 ). In summary, the following steps result:
m1 = f (xk , yk )
m2 = f (xk + h/2, yk + m1 · h/2)
calculation of 4 slopes
m3 = f (xk + h/2, yk + m2 · h/2)
m4 = f (xk + h, yk + m3 · h)
�
1
m = (m1 + 2m2 + 2m3 + m4 ) calculation of weighted average
6
�
yk+1 = yk + m · h calculation of next function value
This approach looks somewhat arbitrary, but it is not: If you choose a differential equa-
tion y′ = f (x) that does not depend explicitly on y, then you get for the mean slope m:
1
m= (f (xk ) + 4f (xk + h/2) + f (xk + h))
6
and thus
h
y(xk+1 ) = y(xk ) + (f (xk ) + 4f (xk + h/2) + f (xk + h)). (18.18)
6
We can also integrate f (x) directly, which gives us:
ˆ xk+1
f (x)dx = y(xk+1 ) − yk (xk ).
xk
If you evaluate this integral numerically using Simpson’s rule (see Sec. 18.4), you will
get the result (18.18) we obtained by solving the differential equation using the Runge-
Kutta method.
Try it yourself to see that for differential equations of the form y′ = f (x), Heun’s
method corresponds to the trapezoidal rule and Euler’s method to integration using Rie-
mann sums.
In the solution methods for differential equations, one speaks of methods of order n if
the difference between the calculated slope and the actual slope of the function y(x) is of
order O(hn ), where h represents the step size of the method. The order of Euler′s method
is 1, the order of Heun′s method is 2, and that of the Runge-Kutta method even 4: The
error has order O(h4 ).
458 18 Numerical Methods
Comprehension questions
1. If a and b are numbers with numerical errors, in which operations can the error
cause greater problems: addition or multiplication?
2. You want to transform an equation for computing zeros into a fixed point equation.
Does that always work?
3. If the Banach fixed point theorem cannot be applied to a nonlinear equation
because it is not a contraction mapping, can the equation still have a fixed point?
4. You know that a function has a zero. Which method always returns the zero?
5. In cubic splines you have the freedom to specify the first or second derivative at
the edges. Can such a specification also have an effect on the inner curve sections
of a spline?
6. Simpson′s rule can be used to integrate polynomials of degree 3 exactly. Can you
make a similar statement for integration with the trapezoidal rule? Take a look at
the given error estimates again.
7. In the numerical solution of differential equations, the solution can be made arbi-
trarily accurate in theory if only the step size is made small enough. What are the
limits of this accuracy in practice?
Exercises
6. Calculate the Lagrange polynomial and the natural spline functions for n given
points in R2. Use a graphics program to plot your results.
7. Implement the trapezoidal rule and Simpson’s composite rule for numerical inte-
gration. Use this to calculate the next decimal for the table in the appendix. Esti-
mate beforehand how fine ´ 0you need to choose the subintervals in order to be as
−t 2
efficient as possible. Use −∞ √12π e 2 dt = 0.5
8. Show that for differential equations of the form y′ = f (x), Euler′s method is equiv-
alent to integrarion with Riemann sums and Heun′s method is equivalent to the
trapezoidal rule.
9. Implement Euler′s method, Heun′s method and the Runge-Kutta method for the
numerical solution of first-order differential equations. Use this to determine the
rabbit curve
Abstract
Probability theory is the basis for statistics. This is what we deal with in this chapter.
At the end of it
It may initially look like a tightrope walk to describe processes with the exact language
of mathematics whose results are unpredictable. Probability has something to do with
uncertainty, but mathematics only knows true and false. Nevertheless, precise statements
can also be made about probabilities. However, these can sometimes sound a bit weird,
and you have to look at them closely to see what’s really behind them.
In public, statistics are occasionally referred to as a refined form of lies. You know the say-
ing which is attributed to Churchill: “The only statistics you can trust are those you falsi-
fied yourself.” Maybe this reputation is also due to the fact that statistical statements are
often presented in a shortened form in the media and thus the precision is lost. However,
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 463
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_19
464 19 Probability Spaces
the exact statements are often difficult to digest for the uninitiated reader and – I’m afraid –
sometimes also for the publicist.
Statistics permeate our lives, and computers are of course increasingly the tools with
which they are generated. The mathematics behind it is complex. There are many text-
books for statistics users, which are essentially extensive formula collections with many
examples of applications. I am taking a different approach: In the following chapters I
will introduce the mathematical basics that are necessary in order to work sensibly with
statistical tools and to interpret the results. I will describe a few statistical methods by
way of example. This should help you to be able to use other methods if necessary, with-
out having to fully understand their mathematical background.
I would like to introduce you to some typical problems that statisticians deal with. Based
on these questions, we can see which toolbox we need to be able to give the right answers.
In the elections to the German Bundestag, about 65 million citizens are entitled to vote.
For the forecast on election night, an exit poll is carried out, that is, after leaving the
polling station, a certain number of voters are asked about their vote. Let’s assume that
all respondents are happy to tell the truth openly.
For example, 1000 voters could be asked. Of these, 400 voted for party A, 350 for
party B and 60 for party C. From this, the polling company creates the following forecast:
party A: 40 % party B: 35 % party C: 6 %
This is announced on television. Most of the time, the reporter then says something like
“There can still be fluctuations of plus/minus x percent.”
Where does this number x come from? It is clear that the estimate will be the bet-
ter the more voters are asked. If only 100 voters are asked, no reliable prediction can
be expected. The statistician will therefore determine a number x which depends on the
sample size and which characterizes the possible deviations.
But does this give us a precise statement? Even if 10 million citizens are asked,
there is, albeit naturally vanishingly small, the possibility that only voters of party A are
caught and that the forecast is therefore completely wrong.
You know that the first election forecasts are usually very good nowadays, but some-
times they do change. These are then the more exciting election nights, but afterwards
the mistake is sought among the mathematicians.
However, the clever statistician, to hedge his bets, has formulated his statement some-
thing like this:
19 Probability Spaces 465
“With a probability of 95%, my forecast deviates from the actual final result by less
than 1%.”
The 95% is left out by the newsreader, he speaks of a maximum of 1% deviation and is
usually right. The polling company delivers the figures as ordered: It can also guarantee
a deviation of ± 1% with a probability of 99%, but then it wants more money from the
client because it has to question more voters.
Quality Checking
In industrial production of a part, for example a ball bearing, often not all balls are
checked because it is too expensive. Samples are taken at regular intervals and the num-
ber of defective parts in this sample is recorded in a control chart. If this number exceeds
a certain threshold, production must be intervened. The problem to be solved is the same
as with the election forecast. One can make statements like “with a probability of 99%,
no more than 0.01% of the balls are defective”. The 99% and the 0.01% are threshold
values set by the manufacturer. Based on these values, the sample size and the allowed
number of defective parts in the sample must be determined.
Determining Estimates
Carp ponds for fish farming are very common in Germany. How many carp swim in the
pond? If it is not possible to fish the pond completely, you can catch and mark 100 fish
for example. After a few days you catch 100 fish again. If 10 marked fish are contained
in it, you can guess that a total of about 1000 carp live in the pond. Again the question
arises: What does “about” mean? With the help of the first two examples you can already
find out what form the statement of the statistician will have.
We also encounter a problem here that lies outside of mathematics and which we cannot
consider further: How good is the method with which the sample is obtained? Are there sys-
tematic procedural errors that distort the results? Could it be, for example, that the fish form
shoals and that there is no mixing at all in the pond? Then the experiment is completely
wrongly designed. In the case of election polls too, a large part of the know-how of the com-
panies lies not in the mathematical evaluation, but in the clever construction of the selected
sample. But that is another matter.
In the examples given so far, an object is selected randomly from a population whose
elements have certain characteristics and the characteristics of this object are checked.
The mathematicians have constructed a prototype for such tasks, the urn model: In an
urn there are n balls with different colors, from which m balls are selected randomly. We
will carry out this experiment several times and always assume that there are no proce-
dural errors: The balls are well mixed and the selection is really random.
466 19 Probability Spaces
Testing a Hypothesis
A few years ago, a British physicist proved Murphy’s law: He computed when a slice
of toast falls, it falls on the buttered side more often than on the unbuttered side. This
happens for physical reasons that are related to our gravitational constant, table height,
air pressure, specific weight of butter and toast, and other factors. We want to approach
the question with statistical methods. I don’t believe in Murphy and I put forward the
hypothesis: The probability for “butter on the floor” is exactly 50%
Now we drop the toast 100 times: 54 times it falls on the butter side, 46 times on the
other. What can we conclude from this? Probably only: “The result does not contradict
the hypothesis.”
If the toast falls on the buttered side 80 times out of 100 attempts, we will most likely
say: “The deviation is significant, the hypothesis must be rejected, Murphy is right”.
What does “significant” mean in this context? Somewhere there is a threshold, at which
our confidence in the hypothesis is broken. What would you say? 65, 60, 70 slices?
Even with the rejection of the hypothesis in the second case, we can not be quite sure:
Maybe the toast falls on the side without butter the next 100 times. With what probability
do we make a mistake with the rejection? In test theory, statements of this kind are made.
If the hypothesis can not be rejected, can it be accepted? No, because even if the toast
falls on the buttered side exactly 50 times, the actual probability could be, for example,
52%. In this case, no statement is possible! Here, mistakes are often made, be careful.
More serious applications of the hypothesis test can be found, for example, in the
approval process for drugs or chemicals.
Queueing theory plays an important role in computer science: Requests arrive at a web
server at random intervals, the average number per day is known. Each request needs a
certain amount of time, which also varies randomly, until it has been served. How must
the server’s capacity be designed to avoid long waiting times?
Probabilistic Algorithms
There are many problems in computer science that can not be solved exactly in a reasonable
time, for example the np-complete problems. For many such tasks (for example, the prob-
lem of the traveling salesman), there are probabilistic approaches: we can find solutions that
are correct only with a certain probability, or that are very likely to deviate from the optimal
solution by at most a certain percentage.
We have developed such a probabilistic algorithm in Sect. 5.7: To find large prime
numbers needed in cryptography, one performs prime number tests. The result is then
only “probably a prime number”, but for practical applications that is sufficient.
19 Probability Spaces 467
Can you determine the number π by dropping a needle on a piece of paper? Perform
the following experiment (Buffon’s needle problem): Draw a series of parallel lines on a
piece of paper, whose distance is just as large as the length of a needle. Focus firmly on
the number π, drop the needle often on the paper, and check each time whether the nee-
dle hits one of the lines or not. Count the number N of attempts as well as the number T
of hits. Then the fraction N/T will get closer and closer to π/2.
Before you believe that I am drifting into esotericism, some mathematics to this
experiment: For simplicity, let’s assume that the needle is 2 cm long, so the distance
between the lines is also 2 cm. The needle can only hit the closest line, by which I
mean the line that has the shortest distance from the center of the needle. I call this dis-
tance d (0 ≤ d ≤ 1) (Fig. 19.1). α is the acute angle that the needle forms with this line
(0 ≤ α ≤ π). The needle hits the line exactly when d < sin α. So for each throw there is
a value pair (α, d) in the rectangle [0, π] × [0, 1]. Then d < sin α if and only if (α, d) lies
below the graph of sin x, in Fig. 19.2 in the grey area. If you carry out the experiment
randomly, the points (α, d) are evenly distributed in the rectangle, it applies:
number of hits T grey area
= ≈ .
number of attempts N total area
´π
The grey area is F = 0 sin xdx = 2, the total area G = π, so the T /N ≈ 2/π, from
which N/T ≈ π/2 results.
What we have done here is nothing other than a numerical integration: The integral to
be calculated is enclosed in a rectangle and from this rectangle a large number of points
are randomly selected. Then the ratio “integral to rectangle” is equal to the ratio “points
in the integral area to total number of points”.
Quicksort has an average runtime of O(n · log n), so it is a very good sorting algorithm.
Distributions
In Fig. 19.3 you can see in a very abbreviated form the first chapters of this book. I com-
pressed them and then read them in as a sequence of 32-bit integers. For each of the first
100 000 integers, I counted the number of ones in the binary representation.
All values, with the exception of 0 times one and 32 times one, occurred. The most
frequent was 16 times one: exactly 13 346 times. I entered the individual results between
0 and 32 in the diagram in Fig. 19.3. You can see that the distribution of frequencies fits
very well with the dashed curve. This is the bell curve, which occurs again and again in
the analysis of many data sets and plays an important role in statistical applications.
Implement such a “one counter” yourself and apply it to text files, binary files, or sequences
of random numbers. What do your results look like?
In the German board game “Man, don’t get angry”, you can start with your peg only
after rolling a 6. In a dice experiment, I tested how long one has to wait for the six. The
result of “1000 times rolling until you get a 6” can be seen in Fig. 19.4. Twice the 6 only
came on the 43rd throw! Also here we can put a curve over the bars: It is an exponential
function.
The probabilities of events can be described using such distributions. Some of these
distributions occur again and again in statistics. One task of the next chapters will be to
characterise important such distributions and to determine for which experiments or for
which data sets these distributions are applicable.
The term probability plays a central role in all the examples of statistical problems
presented here. The foundation of all statistics is probability theory: Probabilities are
assigned to random events, which can then be combined and interpreted in certain ways.
We will examine such random events and their probabilities below.
Random Events
The statistician conducts experiments. With these experiments, the set of all possible out-
comes is usually known, but it cannot be predicted which specific result will occur. Such
an experiment is called a random experiment.
We have already encountered some random experiments: rolling dice, an election
poll, dropping a slice of toast, randomly selecting a point in a rectangle. Other examples
include playing the lottery, flipping a coin, or measuring a temperature.
The set of all possible outcomes of an experiment is called the outcome space, and is
often denoted by the Greek letter .
470 19 Probability Spaces
Examples
Below, we will assign probabilities to events. It is important to distinguish between the out-
come of an experiment ω and the event {ω} which states that ω has occurred.
Which subsets of a n outcome space should be events? Of course, it makes sense to use
the sets ∅ and as events, with A also the set A, and with several sets also their union
and their intersection. A system of sets that satisfies these properties is called an algebra
of events:
(A1) ∈ A.
(A2) If A ∈ A, then A ∈ A ( A is the complement of A).
(A3) If An ∈ A for n ∈ N, then ∞ n=1 An ∈ A.
The subsets that belong to the algebra of events A are called events.
19 Probability Spaces 471
Because of the possibility of building the complement, ∅ is included in A and all inter-
sections of sets from A also belong to A.
The concept of algebra of events plays – except in the next definition – no more practical
role in our further considerations. For all finite and countably infinite sets you can confi-
dently think of events as elements of the power set. Every subset of is then an event. With
uncountable sets, such as subsets of real numbers, however, one can construct wild subsets
for which no reasonable probability can be assigned anymore. The restriction to certain
algebras of events therefore has mathematical reasons, all “good” subsets A of can also be
referred to as events.
The probability of events or the possibility of an event occurring at all is not yet stated.
For example, when measuring temperature, there is a physical upper limit somewhere,
where the real numbers do not stop for a long time. But it does not matter if we include
these large numbers in the outcome space. For example, in the lottery we could choose
� = {1, 2, . . . , 69}5 as the outcome space. Some of the results would then have probability 0.
Probability Spaces
How can we assign probabilities to specific events? The key here is the concept of
probability space. As we have done many times in this book, we choose an axiomatic
approach: Some basic and intuitive properties of probabilities are collected and writ-
ten down as axioms. The theorems of probability theory are derived from these axioms.
If the axiom system is good, then one can prove many theorems which show good agree-
ment with the real-world results.
The following axiom system fulfills these requirements, even though it is almost
unbelievable how simple it is. It was formulated in the 1930s by the Russian mathemati-
cian Kolmogoroff:
The axioms (W1) and (W2) are immediately clear: We measure probability with num-
bers between 0 and 1, where 1 means the certain event. About (W3) one can stumble at
first. But for finite spaces we can replace (W3) by the rule:
p(A ∪ B) = p(A) + p(B), if A ∩ B = ∅. (19.1)
In an example, this means if the probability of rolling a 1 is p and the probability of roll-
ing a 2 is q, then the probability of rolling 1 or 2 is just p + q. This corresponds to com-
mon sense.
From (19.1) follows by mathematical induction:
p(A1 ∪ A2 ∪ · · · ∪ An ) = p(A1 ) + p(A2 ) + · · · + p(An ), if Ai ∩ Aj = ∅ for i � = j
This is almost the axiom (W3), for infinite spaces it has been shown that one must also
allow countable unions of sets in the axiom.
Theorem 19.3: First conclusions from the axioms Let (�, p) be a probability
space and A, B, A1 , A2 , . . . An events. Then holds:
a) p(A) = 1 − p(A).
b) p(∅) = 0.
c) From A ⊂ B it follows that p(A) ≤ p(B).
d) If Ai ∩ Aj = ∅ for all i = j, then:
p(A1 ∪ A2 ∪ · · · ∪ An ) = p(A1 ) + p(A2 ) + · · · + p(An ).
e) p(A ∪ B) = p(A) + p(B) − p(A ∩ B).
As an example, I would like to derive property e). The points a) to d) can be calculated in
a very similar way:
I remind you of the distributive laws for the set operations. From these we first get
A ∪ B = (A ∪ A) ∩ (A ∪ B) = A ∪ (A ∩ B),
B = (A ∪ A) ∩ B = (A ∩ B) ∪ (A ∩ B),
where on the right side there is always the union of two disjoint events. This results in
Example
We now want to construct further concrete models for the abstract concept of the prob-
ability space:
Let us assume in the following that an experiment can only have finitely many results,
and that all these results have the same probability. Examples for this are
the coin toss: p({head}) = p({tail}),
the dice: p({1}) = p({2}) = · · · = p({6}),
the lottery: p({(1, 2, 3, 4, 5)}) = · · · = p({(7, 23, 34, 36, 45)}) = . . .
This does not include, for example, an opinion poll, a temperature measurement or the
tossing a biased coin.
The mathematician speaks of a fair coin or a fair dice, if the probabilities are exactly
equally distributed. Of course, this does not exist in reality. But the lottery companies make
great efforts to make their experiments as fair as possible.
This notation implies of course that all {ωi } (and thus, by Definition 19.1, all subsets of )
are events. I will no longer mention this explicitly.
With the help of the axiom (W3) we can now determine the probability for each {ω} ⊂ �
and then also for each subset A of : It is namely
1 = p(�) = p({ω1 }) + p({ω2 }) + · · · + p({ωn }) = n · p({ωi }) ⇒ p({ωi }) = 1/n
474 19 Probability Spaces
and if A is a subset with k elements, then p(A) = ωi ∈A p({ωi }) = k/n, so for every
A ⊂ applies:
number of elements of A
p(A) = .
number of elements of �
The elements of A are called the favorable outcomes for the event, so that we can also
say
number of favorable outcomes
p(A) = .
number of possible outcomes
Many real experiments can be mapped to uniform probability spaces, even if the prob-
abilities are not initially evenly distributed.
Examples
1. In the calculation of probabilities in the lottery game, it makes sense to work with
the set of 5-tuples of different elements from {1, 2, . . . , 69} and not with the set of
all 5-tuples. There are 11 238 513 different 5-tuples and therefore the probability
for a five is 1/11 238 513.
2. We roll two dice and want to calculate the probabilities with which the sums of
eyes 2, 3, 4, . . . , 12 are thrown. The space � = {2, 3, . . . , 12} is unsuitable, because
for example p({2}) � = p({3}). We construct another space in which the results are
the possible pairs of throws:
� := {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),
..
.
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
Each of these 36 elementary events has the same probability 1/36. Now we can
simply count and, for example, get:
3. You may be familiar with the birthday problem: What is the probability that k peo-
ple all have birthdays on different days?
Let’s construct a uniform probability space . We neglect leap years and any sea-
sonal birthday clustering. The elements of (the possible outcomes) should be all
possible birthday distributions, the event Ak (the favorable outcomes) should be all
the distributions in which the birthdays fall on k different days.
19 Probability Spaces 475
How many elements does have? If we number the days from 1 to 365, the pos-
sible distributions correspond to all k-tuples of the set {1, 2, . . . , 365}, where at the i
-th position of such a k-tuple is the birthday of person i . So is � = {1, 2, . . . , 365}k.
This set has 365k elements, which we all assume to be equally likely.
How many elements does Ak have? For this, we need to answer the question: In
how many ways can k different days be selected, namely the k different birthdays,
out of 365 days? With this selection, the order must be taken into account: For
example, two people can be born on the days i and j or on the days j and i , these
are two cases. The answer to the number gives us Theorem 4.7 in Sect. 4.1: There
are 365 · 364 · . . . · (365 − k + 1) possibilities. Now we can calculate the quotient
of the favorable by the possible outcomes:
A very surprising result: Already from k = 23, the probability for a double birth-
day is greater than 0.5. If I am standing in front of 40 students in a lecture, I can
bet on it quite well.
The birthday problem also occurs in computer science: In a hash table, the prob-
ability of a collision between two data sets is much greater than initially assumed.
Now you can calculate this. ◄
Geometric Probabilities
Remember Buffon’s needle problem in Sect. 19.1: In the numerical integration carried
out there, we assumed that a randomly selected point in the selected rectangle lies at any
position with the same probability.
Definition 19.5 If the set consists of curves, surfaces or volumes and the
probability for an event (a subset of the curve, the surface or the volume) is
proportional to the size of this subset, then is called space with geometric
probability.
The rectangle � = [0, π] × [0, 1] in the needle experiment contains an infinite number
of points that can all be selected with the same probability. So the probability of a single
476 19 Probability Spaces
point is strictly speaking 0, is not a finite uniform probability space. We can only spec-
ify positive probabilities for subsets of that have an area different from 0. Since the
probability p(�) = 1 and the total area is π, we get for each subarea A the probability
p(A) = A/π. In this way we were able to determine the integral of the sine.
Reality is somewhat different: If we work with a random number generator that, for
example, can generate 232 different pairs of points, we still have a finite uniform space in
which the probability of each point is 1/232. (e, π/4) is, for example, not an element of .
Of course, it makes sense to still calculate with the geometric probability in this case.
A similar example is the position of the hand of a clock. It makes little sense to ask with
what probability the minute hand is at 1.4, but one can give the probability with which
the hand is between 2 and 3.
Different events of an experiment are often not independent of each other, the probability
of an event can be different depending on whether another event has occurred or not.
Let’s take the distribution of cards in the German card game Skat as an example: Alex,
Bob and Charly each get 10 cards, 2 cards remain, these form the so-called Skat. With
what probability is the ace of spades with Alex? Of course with the probability 10/32.
With probability 2/32 the Ace of Spades is in the Skat. But if Bob already knows that he
did not receive the Ace of Spades and is asking this question, then the probability that
Alex has the card is now 10/22, the probability that the card is in the Skat becomes 2/22.
Are A and Y events, then we call the probability of A under the condition that already
Y has occurred p(A|Y ). If A is the event “Alex has the Ace of Spades”, and Y is the event
“Bob does not have the Ace of Spades”, then p(A) = 10/32 and p(A|Y ) = 10/22.
Let us try to find the conditional probability of events in a uniform probability space
. For a set M I denote with |M| the number of elements. Let A and Y be events in , we
want to calculate p(A|Y ). Y has thus occurred, which means that the result of the experi-
ment is an element ω0 ∈ Y . First, let us determine the probability p({ω}|Y ) for all ω ∈ �.
For ω ∈ Y the probability that ω = ω0 is equal to 1/|Y |, because there are |Y | equally
likely possibilities for ω. The event {ω} occurs precisely when ω = ω0. In the case ω ∈ /Y
is in any case ω = ω0, thus {ω} cannot occur. We therefore have:
1/|Y | if ω ∈ Y
p({ω}|Y ) =
0 if ω ∈
/ Y.
With the addition rule from Theorem 19.3 we find p(A|Y ) is the sum of the probabilities
of all elements of A. Only the elements of A ∩ Y contribute to this, each with 1/|Y |, so it is
|A ∩ Y |
p(A|Y ) = .
|Y |
19 Probability Spaces 477
We can convert this a bit more: For each subset M of is p(M) = |M|/|�|, and we
obtain for the conditional probability in the uniform probability space:
|A ∩ Y | |A ∩ Y |/|�| p(A ∩ Y )
p(A|Y ) = = = .
|Y | |Y |/|�| p(Y )
This calculation serves as motivation for the following definition of conditional probabil-
ity, which has also proven to be the right concept for non-uniform spaces:
Definition 19.6 Let (�, p) be a probability space, let A and B be events and
p(B) > 0. Then
p(A ∩ B)
p(A|B) :=
p(B)
is called the conditional probability of A under the condition B.
The following two theorems state important calculation rules for conditional probabilities.
The probability space is partitioned into disjoint subsets Bi, i = 1, . . . , n (Fig. 19.5).
n
p(A) = p(Bi )p(A|Bi ).
i=1
With the help of Bayes’ law, the calculation of the probability of Bi under the condition
A can be traced back to the calculation of the inverse probabilities of A under the condi-
tions Bi.
Examples
1. An automobile manufacturer gets a part for its production delivered by three differ-
ent subcontractors, in different proportions and in different quality:
2. In Sect. 5.7 we dealt with cryptography. To generate keys for the RSA algo-
rithm, large prime numbers are needed. These can be found with the help of prime
number tests.
If the number q is prime, then every prime number test is successful. If q is not
prime, then the probability that n prime number tests are successful is less than
1/4n. In the checked range of numbers of 512 bits in length, the probability that a
randomly chosen number is prime is approximately 0.0028 (see Exercise 9 in Sect.
14.4). What is the probability that the randomly chosen number q is not prime,
although it has survived n prime number tests?
Let B1 be the event “q is prime,” B2 be the event “q is not prime,” and A be the
event “n prime tests were successful.” We obtain p(A|B1 ) = 1 and p(A|B2 ) < 1/4n.
This gives the desired probability
p(B2 )p(A|B2 ) 0.9972 · 1/4n
p(B2 |A) = <
p(B1 )p(A|B1 ) + p(B2 )p(A|B2 ) 0.0028 · 1 + 0.9972 · 1/4n
(19.4)
0.9972
= n .
4 · 0.0028 + 0.9972
For n = 25 the result is, for example, p(B2 |A) < 3 · 10−13, for n = 30 is
p(B2 |A) < 3 · 10−16.
In (19.4) you must still think for a moment about the “<”-sign: Check that for a > 0 and
for x < y always applies a+xx
< a+yy
. Or calculate with p(A|B2 ) = 1/4n, that would be the
worst case. ◄
Independent Events
Two events A, B are said to be independent of each other if the probability of A does not
depend on the occurrence of B, that is, if p(A|B) = p(A).
Rolling two dice apparently results in two independent events: If A is the
event “First dice results in 6” and B is the event “Second dice results in 6”, then
p(A) = 1/6 = p(A|B). The two dice have nothing to do with each other.
From Definition 19.6 it follows for independent events, if p(B) > 0, after multiplica-
tion with the denominator: p(A)p(B) = p(A ∩ B). Conversely, this relation, after division
by p(B) implies p(A|B) = p(A). The following definition is therefore appropriate:
Examples
1. In the example after Theorem 19.3 I defined the concept of an source of informa-
tion. If Q = (A, p) is a source of information with the alphabet A = {x1 , x2 , . . . , xn }
and associated probabilities of occurrence p1 , p2 , . . . , pn, then Q is called a memo-
ryless source of information if in a message the probability of the occurrence of
the character xi is always independent of the already sent characters pi. In such a
source of information, the probability that in a message the characters xi and xj fol-
low each other is equal to pi · pj.
Write down which probability space and which events describe this assertion.
and further using the distributive law and the addition rule:
Examples of such independent events are those that arise from repeating an experiment
under the same initial conditions, such as dice, lottery or roulette.
The random drawing of numbered or colored balls from an urn is an important proto-
type for idealized experiments for the mathematician. The drawing of several balls from
an urn with the replacement of the drawn ball before the next draw, represents independ-
ent events. If the drawn balls are not replaced, however, the result of the second draw
depends on the outcome of the first, no independent events are to be expected here.
It can be shown that the events “prime number test”, which I described in Sect. 5.7
in the part on key generation, are mutually independent events for different numbers a
. If the probability that a composite number q passes a prime number test is < 1/4, so the
probability that q passes all n prime number tests is < 1/4n.
In a Bernoulli process of length n, we denote with hn( A) the number of trials in which A
occurs, that is, the absolute frequency of the occurrence of A, and with rn (A) the quotient
hn (A)/n, the relative frequency of the occurrence of A.
If we carry out this experiment often, we expect the relative frequency rn (A) finally
approaches the probability p. For rolling dice in the long run, the ratio “number of sixes/
482 19 Probability Spaces
number of throws” will be close to 1/6, while for coin tossing the ratio “heads/tosses”
will be close to 1/2. We are tempted to make a statement like “limn→∞ rn (A) = p”.
Strictly speaking, the limit of the sequence rn (A) however does not exist in the form
in which we defined it in Definition 13.2 in Sect. 13.1: For any given ε-neighborhood of
p, the sequence rn (A) every now and then jumps out. This is the nature of chance. How-
ever, the following law of large numbers can be derived from the axioms of probability,
which I would like to cite here without proof. In words, the somewhat cryptic formula
states that at least the probability for such jumps from rn (A) becomes arbitrarily small:
For each ε > 0 the probability that |rn (A) − p| ≤ ε tends to 1.
Let’s look at some probabilities in a Bernoulli process of length n. What does the prob-
ability space look like in which our experiment takes place? One possible result of the
Bernoulli process has, for example, the form:
(A, A, A, A, A, A, A, A, . . . , A).
n elements
The outcome space consists of all such n-tuples of elements from {A, A}, so = {A, A}n.
The elements ω ∈ � have the form ω = (ω1 , ω2 , . . . , ωn ) with ωi ∈ {A, A}.
We calculate the probability for such a ω. Since the events in the individual trials of
the process are independent of each other, for example in a Bernoulli process of length 2:
p({(A, A)}) = p2 , p({(A, A)}) = p({(A, A)}) = p(1 − p), p({(A, A)}) = (1 − p)2
and correspondingly for a process of length n:
In the following table I give the values rounded to three decimal places for b10,1/6(k) and
b10,1/2(k). For example, it is the probability of k six when rolling the dice ten times or the
probability of k times head when flipping a coin ten times. Fig. 19.6 shows bar charts for
this.
k 0 1 2 3 4 5 6 7 8 9 10
b10,1/6 (k) 0.162 0.323 0.291 0.155 0.054 0.013 0.002 0.000 0.000 0.000 0.000
b10,1/2 (k) 0.001 0.010 0.044 0.117 0.205 0.246 0.205 0.117 0.044 0.010 0.001
In Sect. 19.1 I carried out the rolling experiment “waiting for the 6”, a typical example of
a Bernoulli process, under the heading “distributions”. Now we can calculate the prob-
ability that an event occurs for the first time on the kth trial:
Theorem 19.15 In a Bernoulli process, let p(A) = p. Then the probability that A
occurs for the first time on the kth trial is equal to
p(1 − p)k−1 .
For this we only have to carry out the experiment k times and determine the probability
of the event {(A, A, A, A, . . . , A, A)}, where A is at the kth position. According to (19.7)
this is just p(1 − p)k−1.
I have already mentioned the urn experiments as an ideal thought experiment of the
mathematician. With their help we will derive important results for later applications.
Satz 19.16: The urn problem “drawing with replacement” There are N balls in
an urn, S black and W white, where S + W = N From the urn, n balls are drawn,
after each draw the ball is put back. There are ns black and nw white balls drawn.
Then the probability of drawing exactly ns black balls is equal to
ns nw
n S W
p(number of black balls = ns ) = · · . (19.8)
ns N N
Since the balls are replaced, there is the same starting situation at each trial, so it is aBer-
noulli process of length n. Are S black balls in the urn, the probability p for the event A,
to draw a black ball, is just S/N . Thus the probability for the event A to occur exactly ns
times, is given by Theorem 19.14. It is equal to
ns nw
n n S W
· pns · (1 − p)n−ns = · · .
ns ns N N
Theorem 19.17: The urn problem “drawing without replacement.” There are N
balls in an urn, S black and W white, where S + W = N . From the urn, n balls are
drawn in succession, of which ns balls are black and nw balls are white. Then the
probability of drawing exactly ns black balls is equal to
S W N
p(number of black balls = ns ) = · . (19.9)
ns nw n
Here the different trials of the process are not independent of each other, the number
of balls and the ratio of black to white balls change after each draw. So we don’t have
a Bernoulli process. Instead, we examine a uniform
probability space that contains all
possible draws of n balls. In total, there are Nn possibilities to draw the n balls, these
are the possible outcomes. The ns black balls can of course only be taken from the S
black balls of the urn, these are nSs different possible outcomes. With each such choice
of black balls, we can in exactly nWw ways add nw white balls. This gives us the number
of favorable outcomes as nSs · nWw . The quotient “favorable outcomes by possible out-
comes” gives the probability we are looking for.
19 Probability Spaces 485
Examples
1. In a box there are 20 chocolates, 8 of which are filled with marzipan. You take 5
chocolates from the box. What is the probability of getting 0, 1, 2, 3, 4 or 5 marzi-
pan chocolates?
Here we have the experiment “drawing without replacement”. The marzi-
pan chocolates are the black balls, and so we get with N = 20, S = 8, W = 12,
ns = 0, 1, 2, 3, 4, 5, nw = 5, 4, 3, 2, 1, 0:
8 12 20
p(nS = 0) = ≈ 0.051,
0 5 5
8 12 20
p(ns = 1) = ≈ 0.255,
1 4 5
and so on. In the following table I have written down all the results:
Marzipan pralines 0 1 2 3 4 5
p 0.051 0.255 0.397 0.238 0.054 0.004
2. Now “drawing with replacement”: To allow for a comparison with the experiment
“drawing without replacement”, we take the same numerical values as above: The
urn contains 20 balls (chocolates are not put back), 8 of them are black, 5 balls are
drawn. What is the probability of getting 0, 1, 2, 3, 4 or 5 black balls?
5
p(ns = 0) = · 0.40 · 0.65 ≈ 0.078,
0
5
p(ns = 1) = · 0.41 · 0.64 ≈ 0.259
1
Black balls 0 1 2 3 4 5
p 0.078 0.259 0.346 0.230 0.077 0.01
◄
486 19 Probability Spaces
Comprehension Questions
Exercises
1. Let your computer perform Buffon’s needle experiment and determine the number
π to 4 decimals. How many trials do you need? Perform the experiment several
times and compare the number of trials each time.
2. In a box with 100 lottery tickets there are 40 duds. You buy 6 tickets. What is the
probability that you will get at least 3 wins?
3. Assume the hypothesis to be true that a slice of toast falls on the buttered side just
as often as on the unbuttered side when it is dropped. What is the probability then
that out of 100 trials, the toast falls on the buttered side more than 52 times?
4. A disease occurs in 0.5% of the population. A test finds 99% of the sick people
(test positive) but also responds in 2% of the healthy population. What is the prob-
ability a tested personis sick if the test is positive?
5. A multiple-choice test is conducted in an exam. For one question, there are n pos-
sible answers, exactly one is correct. Well-prepared students circle the correct
answer, poorly prepared students circle randomly. ( p · 100) % of the participants
are well prepared. With which (conditional) probability does a correct result come
from a well-prepared student?
6. In 4000 draws of the German 6 out of 49 lottery, the number sequence 15, 25, 27,
30, 42, 48 was drawn twice: on 12-20-1986 and on 6-21-1995. This caused quite a
stir among lottery players. Calculate whether this event was really unlikely.
19 Probability Spaces 487
7. Calculate the probability of getting a three, four or five in the 6 out of 49 lottery.
8. The following game was carried out in a similar form on an American TV quiz
show: The contestant stands in front of three doors. Behind one of them is a car,
behind the other two are sheep. The contestant is allowed to choose one of the
doors, but not to open it yet. The quizmaster opens one of the other two doors,
namely one behind which a sheep is standing. Now the contestant may choose
between the two closed doors again. He receives what is behind this door. Calcu-
late the probabilities for a (car) win with the following procedures for the second
decision:
a) The contestant tosses a coin.
b) The contestant always sticks to his original decision.
c) The contestant always changes his original choice.
For a while, a bizarre discussion went on in the American media, in which serious people
(including respected mathematicians) tore each other to pieces in writing, trying to solve
the question of which was the optimal strategy for the contestant. After the quizmaster
has opened the door, the prize is behind one of the other two doors with a probability of
0.5. Can the contestant increase the probability of winning?
Random Variables
20
Abstract
• you will know what discrete and continuous random variables are,
• you will know the connection between random variables, probability distributions
and cumulative distribution functions,
• you will have understood the meaning of the terms expected value and variance of
random variables, can calculate and deal with them,
• and you will have gained a brief insight into information theory.
Take a look at example 2 after Definition 19.4 again, in which we calculated the prob-
abilities for the sum of eyes when rolling the dice twice. In it we constructed a uniform
probability space and the sum of eyes was assigned to the elements of this space. The
elements of the space itself were actually unimportant. This is often the case: In statis-
tical surveys, it is not the surveyed persons who are of interest, but perhaps the weight,
age or voting behavior. In a Bernoulli process, it is not the sequence of results that is in
the foreground, but for example the number of times an event has occurred.
In all these cases, a characteristic is assigned to the elements of , we have a mapping
from into the set of characteristics. As the codomain of the mapping, we only allow the
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 489
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_20
490 20 Random Variables
set of real numbers, because we know real mappings well. Often this is possible directly
(sum of eyes, age, weight), and if not, one can often map the characteristics to numbers,
for example the parties in an election can be assigned numbers.
In the case of the dice experiment, we therefore get a function “sum of eyes”:
SE : → R, (a, b) → a + b. Such a function is called a random variable.
In Fig. 20.1 we can still recognize another function: Each function value n of the
function SE a probability is assigned. The function
V : {2, 3, 4, . . . , 12} → [0, 1], x � → p(SE = x)
is called probability distribution of the random variable.
I have written down the results of the experiment in the form p(SE = n). This is a
somewhat sloppy way of writing, because p is defined on the elements of . More pre-
cisely, p(SE = n) := p({ω ∈ � | sum of eyes in ω = n}).
It will turn out that working with random variables makes it much easier to deal with
probabilities and their properties. The often very complicated probability spaces recede
more and more into the background and we can work with real functions that are now
very familiar to us.
The fact that the image M is finite or has at most as many elements as N, means that it
can be written in the form M = {mi | i ∈ N}. This is sometimes useful for calculations.
Let us conduct an experiment with the result ω, the function value X(ω) belongs to
this result. This function value is also called realization of the random variable X . The
20 Random Variables 491
sum of eyes 12 in a specific throw with two dice is a realization of the random variable
“sum of eyes”.
As with the sum of eyes of the dice, we introduce abbreviated notations for events. For
example
X = x := {ω ∈ �|X(ω) = x},
X ≤ x := {ω ∈ �|X(ω) ≤ x},
a < X < b := {ω ∈ �|a < X(ω) < b},
1 2 3 4 5 6 1 2 3 4 5 6
For all x < 1 is F(x) = 0, for 1 ≤ x < 2 is F(x) = 1/6 and so on. The cumulative dis-
tribution function of a discrete random variable is always a monotonically increasing
step function that starts somewhere to the left at 0 and ends somewhere to the right at the
function value 1. Jumps of the function can only occur at the function values of the ran-
dom variable X . Without proof, I summarize this in the obvious theorem:
In order for the sum in a) to exist at all, we use for the first time the fact that the random
variable is discrete, that is, W has only finitely or at most countably infinitely many ele-
ments. If W is infinite, we have to form an infinite sum here. Remember the analysis,
Theorem 13.12 in Sect. 13.2: The sequence of partial sums is monotonic and bounded,
so the series is convergent.
The following theorem shows how we can easily calculate the probability of intervals
using the cumulative distribution function:
F(b) = p(X ≤ b)
= p({X ≤ a} ∪ {a < X ≤ b})
= p(X ≤ a) + p(a < X ≤ b)
= F(a) + p(a < X ≤ b)
and b) from:
1 = p(�)
= p({a < X} ∪ {X ≤ a})
= p(a < X) + p(X ≤ a)
= p(a < X) + F(a)
The graph of the probability distribution of a discrete random variable consists of a dis-
crete number of points, as you have seen in Fig. 20.2. Usually, the form of histograms
is chosen for such a graphical representation: a column is drawn over each argument.
This is particularly practical if the arguments, as in the case of dice, all have the same
distance from each other. Then we draw the columns all the same width, so that they just
touch each other. We calculate the height of the columns so that the area of the column
represents the probability of the event occurring. You can see the histogram for the prob-
ability distribution of rolling a die in Fig. 20.3.
Let’s look at coin tossing as another example. The random variable X should count
the number of heads in 20 throws. According to Theorem 19.14 is p(X = k) = b20,1/2 (k).
The area of the column over the value k in Fig. 20.4 gives the probability with which, in
20 trials, head occured k times.
A very nice property of these area-proportional histograms is that you can determine
the probability of a range, for example p(10 ≤ X ≤ 14) exactly as the area of the col-
umns over the values from 10 to 14.
Do you see where I am going with this? We will approximate some common prob-
ability distributions by continuous functions; then we can calculate the probability that X
lies between a and b as the integral of the distribution from a to b. To carry out this pro-
gram, we first need to deal with continuous random variables.
1 2 3 4 5 6
494 20 Random Variables
Let’s look at the random variable “position of the hour hand of the clock”. consists
of the set of possible hand positions, a value between 0 and 12 is assigned to a hand
position. The random variable X : → ]0, 12] no longer has a discrete image. We can’t
give probabilities for individual elements ω anymore, but we can for example determine:
p(x ≤ 12) = 1, p(X ≤ 6) = 1/2, p(2 < X ≤ 3) = 1/12.
Similarly to the case of discrete random variables (see Definition 20.3), we now look
for the cumulative distribution function F with the property F(x) = p(X ≤ x). Of course,
F(x) = 0 for x ≤ 0 and F(x) = 1 for x ≥ 12. In between: F(x) = x/12. In Fig. 20.5 you
can see the graph of the function F .
Just like in Theorem 20.5, for example p(4 < X ≤ 7) = F(7) − F(4).
Now we are still missing the analogue to the probability distribution of a discrete ran-
dom variable. Let us recall: In the probability distribution V the area between a and b is
the probability that the result lies between a and b which is just F(b) − F(a). We already
know this property from integral calculus:
The function F is the antiderivative of the probability distribution V :
ˆ b
V (t)dt = F(b) − F(a)
a
thus V the derivative of F . In our example we get for V the function (Fig. 20.6):
0 x ≤ 0 and x > 12
V : → [0, 1], x �→
1/12 0 < x ≤ 12.
We now have a cumulative distribution function and a kind of probability distribution for
the given non-discrete random variable. Mathematically, we’re putting the cart before the
horse: We call a function a continuous random variable if it has a nice cumulative distri-
bution function:
The position of the hour hand of a clock is a continuous random variable in this sense.
I state the following two theorems without proof, which correspond exactly to Theo-
rems 20.4 and 20.5 for discrete random variables:
Thus, in the transition from the discrete to the continuous random variable, only the sum
´x
p(X = xi ) is replaced by the integral w(t)dt.
xi ≤x −∞
Often one has to deal with more than one random variable: In a probability space
several characteristics are observed at the same time, for example in opinion polls. An
experiment can be carried out several times in succession and each single experiment is
described by a random variable.
A finite sequence of random variables (X1 , X2 , . . . , Xn ) is called a random vector or a
multivariate random variable. A random vector of length n is a function from to Rn.
Examples
1. The random selection of a point from a rectangle [a, b] × [c, d] can be described by
two random variables: X is the random variable “choose a point from [a, b]”, Y is
the random variable “choose a point from [c, d]”. A realization of the random vec-
tor (X, Y ) consists of a specific point (x, y) of the rectangle.
2. The two random variables H and W for heigth and weight (in m and kg) form thus the
random vector (H, W ). The result of an experiment is a tuple, for example (186, 82).
3. If you roll several dice at the same time, for example 5 dice, you get the random
variables (X1 , X2 , X3 , X4 , X5 ), where Xi is the random variable “rolling the i -th die”.
The result of an experiment is an element (x1 , x2 , x3 , x4 , x5 ) of R5, where xi is the
number of eyes of the i -th die. We also say (x1 , x2 , x3 , x4 , x5 ) is a realization of the
random vector (X1 , X2 , X3 , X4 , X5 ).
4. The drawing of n balls from an urn with black and white balls can be described by
the random vector (X1 , X2 , . . . , Xn ), Xi is the i -th draw, which has the possible out-
comes black or white, respectively, the results 0 or 1, if we work with real random
variables. The result of the experiment, the realization, is then a vector of the form
(1, 0, 0, 1, 0, . . . , 0) ∈ Rn, where the 1 means a white and the 0 a black ball. ◄
Also, infinitely many random variables can occur. Sequences of random variables, such
as (Xn )n∈N or (Xt )t∈R are called stochastic processes. We will deal with such processes in
Sect. 21.3.
Different real random variables can be linked together with mathematical operations,
just as we already know from real functions.
Examples
5. Currently, the body mass index is modern as a measure for a reasonable ratio
between height and weight: It is “weight in kg divided by the square of the height
in meters” and should be in the range of 20 to 25. With the random variables H
for the height and W for the weight this gives a random variable W /H 2, where
W /H 2 (ω) := W (ω)/H 2 (ω).
6. When rolling 5 dice, the random variable X = X1 + X2 + X3 + X4 + X5 is the
sum of eyes of the 5 dice.
20 Random Variables 497
Just as with different events, you can also ask about values of random variables whether
they have something to do with each other or are independent of each other. Weight and
height of a person are certainly not independent of each other, but maybe height and
income? With the random vector (X, Y ), which describes the random selection of a point
from a rectangle, the results of X and Y should be independent of each other. If we draw
balls from an urn, the 2nd draw is independent of the first, provided we carry out the
experiment “drawing with replacement”. If we draw without replacement, the later draws
are dependent on the earlier ones! When rolling dice, the result of the first die certainly
has nothing to do with the result of the second die, the random variables Xi = “result of
the i -th throw” are in this case independent.
We mathematically describe this situation by reducing it to the already known concept
of independent events: We call two discrete random variables X , Y independent of each
other if the events X = x and Y = y are independent of each other for all possible values
of x and y:
I use the intuitively clear notation for probabilities of several random variables, such as
p(X = x, Y = y) := p({X = x} ∩ {Y = y}),
p(X1 ≤ a, X2 ≤ b) := p({X1 ≤ a} ∩ {X2 ≤ b}).
In the Definition 20.9 I have taken this in advance.
The concept of independence can also be formulated for continuous random varia-
bles. However, since the probability of an element p(X = x) is not a meaningful quantity,
we have to require independence of events belonging to intervals of R. I write down the
definition for two variables:
498 20 Random Variables
Let’s take a look at the game of roulette. The ball can land in one of the pockets from 0
to 36 and we assume that � = {0, 1, 2, . . . , 36} is a uniform probability space, so we can
give the probability for each subset of .
First, let’s bet 1 € on a single number, for example 7. If the ball lands in the pocket with
the number 7, you will receive a 36 € pay out (from which the stake is deducted). You usu-
ally have to wait for this, because the probability of winning is 1/37. To shorten this waiting
time, we could also bet 1 € on the odd numbers at the same time. The pay out for this is 2
€. Finally, we bet on the street (1,2,3), for which we still have a chance to win 12 €. Now
we are interested in what the chances of winning look like if we play this combination for a
longer period of time. Can we calculate how big the average payout is in one game?
First, let’s write down the payout for the possible results.
Numbers 5, 9, 11, 13, 15, . . . , 35 (15 values) 1, 3 2 7 the rest (18 values)
Payout 2 12 + 2 12 36 + 2 0
If X is the random variable describing the pay out in such a game, then the image
W (X) = {0, 2, 12, 14, 38}
Now let’s play n times. Let fk be the number of cases in which the pay out k occurred.
Then our total payout P:
P = 2 · f2 + 14 · f14 + 12 · f12 + 38 · f38 + 0 · f0
the average profit per game is therefore:
P 2 · f2 + 14 · f14 + 12 · f12 + 38 · f38 + 0 · f0
P= =
n n
= 2 · r2 + 14 · r14 + 12 · r12 + 38 · r38 + 0 · r0
20 Random Variables 499
if rk = fk /n is the relative frequency of the occurrence of the pay out k. Very likely, for
large n, the relative frequency of the payout k will be of the order of the probability of
k occuring:
Surprisingly, the right side of this equation no longer depends on the specific course of
the game, it is independent of the experiment and only determined by the random vari-
able X and its probability distribution. We call this right side of the equation the expected
value E(X) of the random variable X . The value E(X) will usually differ from the average
pay out, but the longer we play, the more it will approach it. Let’s calculate the expected
value in the concrete example:
15 2 1 1 108
E(X) = 2 · + 14 · + 12 · + 38 · = .
37 37 37 37 37
That looks good at first, but unfortunately the bank deducts the stake of 3 € in each
game, so that in total an average loss of 108/37 − 3 = −3/37 € remains, that is about
2.7%.
Is there a better strategy? No, you will find that with all possible combinations of bets,
on average, for every 1 euro bet, 1/37 euros go to the bank. The reason for this is some-
thing we will learn about in Theorem 20.18.
A game is called fair if the expected value for the winning is 0, if in other words, winnings
and losses cancel each other out. In this sense, roulette is not fair; but if you consider that
from this 1/37 all of the casino’s expenses are covered, you can’t really complain.
in which the sum is taken over all x from the image W (X) of X , uniquely exists, it
is called the expected value of X . The expected value E(X) of a random variable X
is also often denoted by µ.
A trap lies in the inconspicuous remark, “if the value exists uniquely”. If the image of X is
finite, then E(X) always exists. However, if it is infinite, the series does not have to converge,
and even if it converges, the limit may depend on the order of summation. Then there is no
expected value. If I write E(X) in the future, I always assume that E(X) really exists.
When forming the expected value, all possible products “function value times probability
of function value” must be added up. If the random variable X describes the numbers
rolled with a die, then:
500 20 Random Variables
1 1 1 1 1 1
E(X) = 1 · + 2 · + 3 · + 4 · + 5 · + 6 · = 3.5. (20.3)
6 6 6 6 6 6
The expected value does not necessarily have to be in the image of the random variable:
On average, you just roll 3.5.
I give a similar definition for continuous random variables. This can be motivated, for
example, by the fact that the continuous random variable is approximated by a sequence
of discrete random variables with existing expected values. I don’t want to do this, but
I just want to point out the analogy that has already emerged in the comparison of the
Theorems 20.4 and 20.7. The sum over the probabilities of the function values of the dis-
crete random variable X just corresponds to the integral of the density of the continuous
random variable:
Let’s look back at the roulette example: If I multiply my bet by five, all winnings and
losses are also multiplied by five. We get a new random variable Y = 5 · X , which
describes the winnings. What is the expected value of Y ? If I am interested in the random
variable “net profit”, I have to construct the new random variable Z = X − 3 from X .
What is the expected value of Z ?
The following result answers these questions, it should not be very surprising:
I used that the sum over all possible probabilities p(X = x) just results in 1.
In the case of continuous random variables, one can first calculate the density func-
tion of the new random variable Y and then the expected value from this.
In the game of roulette, let XA be the random variable that describes the “win when bet-
ting 1 € on 7”, XB the random variable “win when betting 1 € on the odd numbers” and
XC is finally “win when betting 1 € on 1,2,3”. The expected values of all three random
variables are equal:
The thrill is much greater when betting on a number than, say, betting on odd. How do
these two strategies differ? In the first case, the possible win is much greater, but so is the
risk of loss. Betting on 1, 2, 3 is somewhere in between in terms of risk.
It would be interesting to know how the individual results are scattered around the
expected value for different strategies, that is, how far possible wins and losses deviate
from the expected value on average. Let’s apply what we’ve already learned about ran-
dom variables and expected values: If E(X) = µ, then the random variable Y = X − µ
describes the deviation of the win from the expected value µ. So we would need to
obtain the average deviation from the expected value of the random variable Y . Let’s cal-
culate it: It is
E(Y ) = E(X − µ) = E(x) − µ = 0.
Too bad: The positive and negative deviations cancel each other out. But on closer
inspection this is also reasonable, the expected value is in the middle of the values that
occur.
502 20 Random Variables
To avoid this cancellation effect, we have to count all the deviations positive, so we
could examine Y = |X − µ|, for example. If you have gotten this far in mathematics, you
also know that a mathematician only calculates with absolute values when there is no
avoiding it: There are always such ugly case distinctions. For this reason we take the ran-
dom variable Y = (X − µ)2 here; this describes the squares of the deviations, and they
are of course all positive.
Definition 20.14: The variance If E(x) = µ is the expected value of the random
variable X , so the value
= E(X 2 ) − µ2
Theorem 20.16 If the variance of the random variable X exists, then for a, b ∈ R:
Var(aX + b) = a2 Var(X).
20 Random Variables 503
The multiplicative factor a thus goes quadratically into the variance, a simple shift of the
values of the random variable does not change the variance. That is obvious, then with
such a shift, all deviations from the expected value remain unchanged.
In the game of roulette, the expected value of XB “Bet on odd” and of XC “Bet on 1, 2, 3”
was 36/37 respectively. What happens if you combine both options in one game?
Let S = XB + XC the game in which we bet on odd and on 1, 2, 3. If an odd number
comes, you win 2 € at XB. If 1, 2 or 3 comes, the win in XC is 12 €. The possible function
values of S are therefore 2 (for the 16 odd numbers without 1 and 3), 12 (for 2), 14 (for 1
and 3) and 0 for the rest. What is the expected value of S?
16 1 2 72
E(S) = 2 · + 12 · + 14 · =
37 37 37 37
thus just the sum of the two individual expected values.
At the beginning of this section, we calculated the expected value of
X = XA + XB + XC: It was 108/37 = 3 · (36/37), that is also the sum of the individual
expected values. This always applies:
Theorem 20.18 If X , Y are random variables with the expected values E(X) and
E(Y ), then
E(X + Y ) = E(X) + E(Y ).
504 20 Random Variables
Definition 20.21: The covariance If X and Y are random variables with the expected
values E(X) = µX, E(Y ) = µY , then the number
Cov(X, Y ) := E((X − µX )(Y − µY )),
if it exists, is called the covariance of X and Y . The random variables X and Y are
called uncorrelated if Cov(X, Y ) = 0.
Cov(X, Y ) = E(XY − µX Y − µY X + µX µY )
= E(XY ) − µX E(Y ) − µY E(X) + µX µY
= E(XY ) − µX µY
This expression is 0 if X and Y are independent. Independent random variables are there-
fore uncorrelated. The reverse is not true.
1. The information content of a sign does not depend on the sign itself, but only on the
probability of the occurrence of this sign: different signs with the same probability
of occurrence carry the same amount of information. If two probabilities differ only
slightly, the information content should not be very different.
2. Signs with a small probability of occurrence carry more information than signs with a
large probability of occurrence.
I n a political speech, the important information is not contained in the recurring clichés,
but hidden in between at large intervals. An information source that always tells the
same thing has no real information content, the content is in the rare and therefore sur-
prising signs. Shannon called the information content of a message also the “surprisal”
of the message.
wo bytes of equal probability of occurrence can therefore, for example, transmit twice
T
as much information as one byte.
Information content that is associated with the character xi must have the following prop-
erties:
Since the source is memoryless, the two events xi and xj are independent of each other,
the probability of xi xj is therefore equal to pi pj and therefore follows from 3. in particular
f (pi pj ) = f (pi ) + f (pj ). (20.5)
We know from analysis a function that satisfies this condition, it is the logarithm. In
fact, one can show that the only continuous functions that fulfill (20.5) are the functions
a · log x, where a ∈ R and the logarithm can be formed to any base. Often one chooses
the logarithm to the base 2.
After Theorem 14.31 in Section 14.3 we have seen that logarithms to different bases only
differ by a constant factor. This means that a function that fulfills (20.5) always has the form
a log2 x for a a ∈ R.
The information content is a random variable on the probability space Q. The expected
value of this random variable, that is, the average amount of information transmitted, is
the entropy of the source:
If in the alphabet all characters x1 to xn have the same probability of occurrence 1/n, then
the entropy
n
H(Q) = 1/n · log2 (n) = log2 (n). (20.6)
i=1
It can be shown that this value is the maximum entropy of an information source with n
characters. As soon as the probabilities of individual characters deviate from the value
1/n the entropy decreases.
Do not confuse the information content thus defined with the meaning of a message
interpreted by humans. Otherwise you would prefer a book filled with randomly distrib-
uted letters to the present book. That would be a shame. But you can also look at it this
way: The book you are reading now has a size of about 13 MB as an electronic docu-
ment. If I compress it with zip, the size decreases to about 4 MB. The amount of infor-
mation is the same in both documents, so the information content per character must
be greater in the second document. What is behind it? When compressing with zip, the
source text alphabet is encoded in another alphabet in which, in contrast to the English
clear text, each character occurs approximately equally often. The entropy of the com-
pressed text is therefore greater than that of the uncompressed text. The better the com-
pression, the more evenly distributed the individual characters are and the greater the
entropy.
In data processing, a source is usually binary coded, that is, translated into a code
word of 0 and 1. We have learned such a coding procedure, for example, with the Huff-
man code (at the end of Sect. 11.2). In the source Q = (A, p) the character xi is encoded
in a code word of length m bits. The function l(xi ) = length of the code word of the char-
acter xi also represents a random variable on the probability space Q, and the expected
value of this random variable
n
L(Q) = pi · l(xi )
i=1
is the average codeword length in the transmission. Of course, one would like to keep
the average codeword length as short as possible in an encoding of the source alphabet.
Is there a lower limit for the codeword length? First two
Examples
1. Let’s take the ASCII code first and assume that all characters are equally likely.
Then pi = 1/256 and log2 (1/pi ) = 8. According to (20.6) this is the entropy. It
corresponds exactly to the constant word length of 8 bits. In this case, entropy and
average word length match. This is the reason for the initially arbitrary normaliza-
tion of the information content.
508 20 Random Variables
2. When introducing the Huffman code at the end of Sect. 11.2 we defined
an alphabet with the characters {a, b, c, d, e, f } and associated probabilities
{0.04, 0.15, 0.1, 0.15, 0.36, 0.2}. The entropy of this code is:
The theorem is usually stated more generally, without the restriction to a binary coding.
We can infer two statements from this theorem: The entropy is the lower limit for the
average word length in a prefix code, it cannot be shorter. And further, one can always
find a prefix code whose average word length differs from the entropy by at most one bit.
This is an upper bound, in the above example we were even closer.
The Huffman code was very close to the entropy in the example. This is no coinci-
dence: It can be shown that the Huffman code is the best prefix code. There is no other
prefix code with shorter average word length.
Comprehension Questions
3. When rolling 2 dice, the random variable Xi should be the result of the i -th dice.
Are the random variables X1 and X2 independent of each other? Are the random
variables X1 and X1 + X2 independent of each other?
4. Does each discrete random variable have an expected value?
5. What type of deviation from the expected value does the variance of a random var-
iable describe?
6. Let X and Y independent random variables. Is then Var(X · Y ) = Var(X) · Var(Y )?
Exercises
1. Calculate the expected value, the variance and the standard deviation of the sum of
eyes when rolling 5 dice.
2. The random variable P describes the product of eyes when rolling dice twice.
Draw the probability distribution of P. Calculate the expected value and the vari-
ance of P.
3. The random variable X describes the number of jacks in skat after the deal (32
cards with four jacks, 3 players each receive 10 cards, 2 cards are in the skat).
Determine the probability distribution and the expected value of X ,
a) without information about the card distribution among the 3 players,
b) with the knowledge that the first player did not receive any jacks.
4. In the game of roulette, a player bets 10 Euros on red. If he wins (18 of 37 fields
are red), he will receive a pay out of 20 Euros (net profit = 10 Euros). If he loses,
he will double his bet until he wins or until the bank’s betting limit of 1000 Euros
is reached. Describe the random variable “net profit” in this strategy and calculate
the expected value and variance of this game.
5. Use a spreadsheet program or a computer algebra system to plot histograms of the
function n and p for different values of bn,p (k).
6. The following ad appeared in the Süddeutsche Zeitung a few years ago:1
Let us assume that the scientific prediction is a coin toss. The probability of a girl’s
birth is 0.465. Calculate the approximate annual income of the institute if about
100 requests per month arrive. How can the institute improve its income (at the
same rate)? (For this purpose, set up a random variable “profit” and calculate the
expected value.)
7. Calculate the entropy and the average word length of the Huffman code from Exer-
cise 6 in Chap. 11.
Important Distributions, Stochastic
Processes 21
Abstract
The probabilities of events can be described using probability distributions. At the end
of this chapter you will know
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 511
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_21
512 21 Important Distributions, Stochastic Processes
experiment, you can use it to determine probabilities, expected values, variances, and
other parameters.
We will now take a closer look at some of these distributions. The data sets we are
dealing with are always finite, but sometimes it is more clever to work with continuous
distributions and thus approximate the discrete distributions. We will also get to know
such distributions.
Some discrete distributions have already been encountered in Sect. 19.3, from these and
from other distributions we will now calculate the corresponding expected values and
variances in particular.
that is the arithmetic mean of the function values. The variance is, according to Theorem
20.15
n
1 2
Var(X) = xi − E(X)2 .
n i=1
For the uniformly distributed random variable “rolling a dice”, we have calculated the
expected value in (20.3) after Definition 20.11, it results in 3.5. The variance when roll-
ing the dice is
1 2
Var(X) = σ 2 = (1 + 22 + 32 + 42 + 52 + 62 ) − 3.52 = 2.917,
6
the standard deviation σ thus approximately 1.7.
21.1 Discrete Distributions 513
The binomial distribution describes in a Bernoulli process the probability that in n tri-
als the event A occurs k times. The prototype for this is the urn model “drawing with
replacement”: If there are N balls in the urn, of which B are black, then the probability
for the event A to draw a black ball is just p = B/N .
For the random variable X , which is given by
X=k ⇔ A occurs exactly k times in n trials,
applies according to Theorem 19.15: p(X = k) = nk pk (1 − p)n−k.
How can we compute this sum? A little trick helps us with that. I formulate it as a
lemma because we will use it often, especially in the next chapter.
A binomially distributed random variable X can be composed additively from n indi-
vidual random variables Xi:
The Xi from the lemma are independent and therefore, with the calculation rules for
expected values and variances, which we derived in Theorems 20.18 and 20.20:
n
n
E(X) = E(Xi ), Var(X) = Var(Xi ).
k=1 k=1
The expected value and variance of the Xi are easy to calculate, because Xi almost always
has the value 0. First the expected value:
E(Xi ) = 1 · p(Xi = 1) + 0 · p(Xi = 0) = 1 · p,
and thus E(X) = n · p.
For the variances applies:
The crooked numbers are probably due to the fact that the number of particles per
cubic foot was still specified in the last standard.
k 0 1 2 3 4 5
p(Z = k) 0.02955 0.10411 0.18337 0.21524 0.18944 0.13335 (21.2)
◄
The binomial distribution bn,p (k) can be calculated recursively from bn,p (k − 1). We will
need this formula again later:
n−k p
bn,p (k + 1) = · · bn,p (k), bn,p (0) = (1 − p)n .
k+1 1−p
n n · (n − 1) · . . . · (n − k) n n−k
If we use = = · , we get with q = 1 − p:
k+1 1 · 2 · 3 · . . . · (k + 1) k k+1
n n n−k k
bn,p (k + 1) = k+1 n−k−1
p q = · p p · qn−k · q−1
k+1 k k+1
n−k p
= · · bn,p (k).
k+1 q
The distribution belonging to the urn model “drawing without replacement” is called
hypergeometric distribution: We derived this distribution in Theorem 19.17 :
Definition and Theorem 21.5 An urn contains N balls, B of which are black. A
discrete random variable Y , which after n draws without replacement from the
urn gives the number k of the black balls drawn, is called hypergeometrically
distributed with parameters N , B and n. It is
B N −B N
hN,B,n (k) := p(Y = k) = . (21.3)
k n−k n
Without proof, I give the expected value and variance for the hypergeometric distribu-
tion. They can be calculated similarly to the binomial distribution:
Example
In a raffle, 10 sellers sell tickets. Each has 50 tickets in his box, of which 20 are win-
ners. You want to buy 10 tickets. Is your chance of winning greater if you buy all the
tickets from one seller, or if you buy only one ticket from each seller and thus always
draw from the full?
The purchase of tickets from one seller is hypergeometrically distributed, from
several sellers it is binomially distributed, since the same initial situation prevails
again with each purchase. Calculate the probabilities for 0, 1, 2, . . . , 10 wins in the
two cases. You can see that these always differ slightly, sometimes one is larger,
sometimes the other. But the profit expectation, that is, the expected value, is the same
in both cases, it is n · p = 10 · 0.4 = 4. The variance differs slightly: For one seller it
is about 1.96, for several sellers 2.4. Where should you buy if you are risk-averse? ◄
Example
There are 1000 ball bearing balls in a box, 100 of them faulty. You take 10 balls, once
without putting them back, the second time with putting them back.
Let’s calculate the probability of k faulty balls among the 10 balls taken out first
with the help of the hypergeometric distribution and then with the help of the bino-
mial distribution. The results are entered in the following table:
k 0 1 2 3 4 5
hN,B,n (k) 0.3469 0.3894 0.1945 0.0569 0.0108 0.00139
bn,p (k) 0.3487 0.3874 0.1937 0.0574 0.0112 0.00149
k 6 7 8 9 10
hN,B,n (k) 1.2 · 10-4 7.4 · 10-6 2.9 · 10-7 6.5 · 10-9 6.6 · 10-11
bn,p (k) 1.4 · 10-4 8.7 · 10-6 3.6 · 10-7 9.0 · 10-9 1.0 · 10-10
The expected value is in both cases 10 · 0.1 = 1, the variance in the first case is about
0.89, in the second case 0.9. ◄
21.1 Discrete Distributions 517
You can see that the results of the formulas in this case are almost indistinguishable. For
a large population N and a relatively small number n of selected elements, the probabili-
ties of drawing without replacement and drawing with replacement are almost the same.
We can write this result in a theorem that I would like to cite without proof:
Theorem 21.7 If B < N , n < N and n and p = B/N are constant, then for
k = 0, 1, . . . , n:
If you compare the formulas (21.1) and (21.3) for calculating the distributions, you will
find that for large populations, the binomial distribution (21.1) is much easier to evaluate
than the hypergeometric distribution (21.3): with large numbers N and B it is easy to get
sweaty when calculating the binomial coefficients. Later we will see that the binomial
distribution can also often be approximated by other, simpler distributions.
This property is often used: If it is possible, statisticians work with the model
“drawing with replacement”. In an election poll, for example, the sample consists of a
selection of people, each of whom is only interviewed once, so strictly speaking it is a
drawing without replacement. However, with 65 million eligible voters and a few thou-
sand respondents, the probability of catching a voter twice when he is “put back”, is
practically zero, so it is quite reasonable to assume that the drawing is with replacement.
The probability that the event A occurs for the first time in the k-th trial in a Bernoulli
process is described by the geometric distribution. We know it from Theorem 19.15:
and therefore we have to calculate the limit of an infinite series. Our knowledge of series,
which we acquired in the second part of the book, helps us further: First of all, we know
that for real numbers x with |x| < 1 applies:
∞
1
xk = (21.5)
k =0
1−x
This is the geometric series, see Example 2 in Sect. 13.2. In Theorem 15.21 we learned
that the function determined by this series can be differentiated piecewise. If we differ-
entiate (21.5) to the right and left, piecewise to the left and to the right according to the
chain rule, we obtain for all x with |x| < 1 the identity:
∞
1
k· x k−1 =
k=1
(1 − x)2
0.10
0.05
2 4 8 10 12 14 16 18 20
Let’s check (21.6): The probability p(X > s) for s futile attempts is (1 − p)s. With the
help of the Definition 19.6 of the conditional probability we get:
p({X = s + k} ∩ {X > s}) p(X = s + k)
p(X = s + k|X > s) = =
p(X > s) p(X > s)
s+k−1
p(1 − p)
= = p(1 − p)k−1 = p(X = k)
(1 − p)s
Because of the memorylessness of the geometric distribution, it is pointless to bet, for
example, in roulette on a number that has not been drawn for a long time. The probabil-
ity of drawing a number is always 1/37, even though it is long overdue.
For a large number of trials in a Bernoulli process, the binomial distribution quickly
becomes unwieldy. If the probability for the observed event is small, the Poisson distri-
bution is a good and easy to calculate approximation to the binomial distribution.
If in a Bernoulli process n is large compared to k and p is small, then 1 − p ≈ 1 and
n − k ≈ n. Now we use the recursive formula from Theorem 21.4 to calculate the proba-
bilities bn,p (k). We can simplify this formula a bit by replacing n − k by n and 1 − p by 1:
(n − k)p np
bn,p (k + 1) = · bn,p (k) ≈ · bn,p (k). (21.7)
(k + 1)(1 − p) (k + 1)
To calculate bn,p (0) = (1 − p)n we use the limit of a sequence, which is known from
analysis. For all real numbers x holds:
x n
lim 1 + = ex .
n→∞ n
The number e is often defined as the limit of the sequence (1 + n1 )n.
520 21 Important Distributions, Stochastic Processes
Compared to Table (21.2) in the clean room example, which we calculated with the bino-
mial distribution, you can see that the results match well.
The approximation gets better and better the larger n and the smaller p is. It is com-
mon to refer to the product n · p with the letter .
k −
Theorem 21.9 If n · p = is constant, then lim bn,p (k) = e .
n→∞ k!
I want to sketch the proof, it uses the known rules of limits. For each fixed number k
holds:
k
n · (n − 1) · · · (n − k + 1) n−k
lim bn,p (k) = lim 1−
1 · 2 · 3 ·4 · · · · · k n n
n→∞ n→∞
(nk) p 1−p
k n−k
n · (n − 1) · · · (n − k + 1) k −
= lim k
· 1− = e .
k! n→∞ n
n k!
→1 →e−
21.1 Discrete Distributions 521
k −
p(X = k) = e
k!
is called Poisson distributed with parameter . A Poisson distributed random
variable has the expected value E(X) = and the variance Var(X) = .
I do not want to derive the expected value and variance of the Poisson distribution, but
in comparison with the binomial distribution you can see that for = n · p the expected
values match. The variance of the binomial distribution is n · p · (1 − p), also this value is
for small p close to .
As an immediate consequence of Theorem 21.9 we obtain the
Rule of thumb 21.11 For large n and small p the binomial distribution can be
replaced by the Poisson distribution with parameter = n · p . This replacement
can be made for ≤ 10 and for n ≥ 1500 · p .
Example
Let’s calculate the probability for k particles in one cubic centimeter for a Class 4
cleanroom. Here, a maximum of 352 particles per cubic meter is allowed. In
the worst case, n is equal to 352, and p is now 1/106. For the parameter holds:
= n · p = 0.000352 < 10 and n > 1500 · p = 0.0015.
The conditions of rule 21.11 are therefore met, so the probability for k particles in
one liter is
0.000352k −0.000352
e
k!
In the following table I have compiled the first results:
k 0 1 2
Class 4 0.9996 0.00035 6.2 · 10−8
Try to calculate these probabilities on the calculator using the binomial distribu-
tion and you will see why the approximation by the Poisson distribution is so useful
here. ◄
522 21 Important Distributions, Stochastic Processes
The simplest continuous distribution is, as in the discrete case, the uniform distribution.
Take a look again at the example before Definition 20.6 in Sect. 20.1. If the possible
results of an experiment are evenly distributed in an interval [a, b] ⊂ R, we can no longer
give the probability for a single number to occur, but only for subsets of [a, b].
The density of the uniform distribution in the interval [a, b] ⊂ R has the form (Fig. 21.2):
1/(b − a) a ≤ x ≤ b
w :R → R, x �→
0 else
´x
The corresponding cumulative distribution function is F(x) = −∞ w(t)dt (Fig. 21.3):
0
x<a
F :R → [0, 1], x �→ (x − a)/(b − a) a ≤ x ≤ b
1 x>b
The expected value of a uniformly distributed random variable is, by Definition 20.12,
equal to
ˆ +∞ ˆ b
x a+b
E(X) = x · w(x)dx = dx = ,
−∞ a b − a 2
which is exactly the mean of a and b.
The results of a random number generator that generates numbers between 0
and 1 are ideally uniformly distributed. The expected value is 0.5 and, for example,
p(0.1 < X < 0.2) = F(0.2) − F(0.1) = 0.2 − 0.1 = 0.1. Is your random number gener-
ator working properly? You can generate as many numbers as you want and analyze the
a b
21.2 Continuous Distributions, The Normal Distribution 523
resulting data set. The task of statistics is to check whether this data set has the correct
probability distribution. More on this in the next chapter.
The uniform distribution serves me mainly for mental warming up at the beginning of a
difficult task: We come to the most important probability distribution of all, the normal
distribution. With this we have to deal extensively. I hope I can give you an understand-
ing in the following section how this distribution arises and why it is so important.
As in the derivation of the Poisson distribution, we start from the binomial distri-
bution. We had seen there that for large n and small p the binomial distribution can be
replaced by the Poisson distribution: If = np is constant, then
k −
lim bn,p (k) = e .
n→∞ k!
If n gets bigger, p must get smaller. But what happens if p remains constant and n gets
bigger and bigger? Think of a Bernoulli process that is carried out very often. In Fig.
21.4 I have plotted for p = 0.5 the binomial distributions for different values of n as his-
tograms.
You can see that the distribution becomes flatter and flatter for larger n, the maxi-
mum moves further and further to the right. This is not very surprising, because we know
that the expected value is E(bn,p ) = n · p, so it moves to the right with increasing n. For
the variance applies Var(bn,p ) = np(1 − p), so the dispersion of the values around the
expected value becomes greater and greater, that is, the histograms always become wider
and wider. This sequence of distributions therefore certainly does not converge to a rea-
sonable limit function, it dissipates with increasing n.
In order to find a limit distribution nevertheless, we carry out a trick. Do you remem-
ber that you can normalize the expected value and variance of a random variable? To the
0.25 b10,1/2(k)
0.2 b20,1/2(k)
0.15 b30,1/2(k)
b40,1/2(k)
b50,1/2(k)
0.1
0.05
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
random variable X with expected value µ and variance σ 2 we have formed in Definition
20.17 the standardized variable X ∗ = (X − µ)/σ . This has the expected value 0 and the
variance 1. We now carry this out for the binomial distributions. The effect of this stand-
ardization is that all distributions are shifted with their peak, the expected value, to 0, and
that they are squeezed together until they all have the same variance 1.
Let us now examine the histograms of the standardizations: If we set k ∗ := (k − µ)/σ
we get:
X=k ⇔ (X − µ)/σ = (k − µ)/σ ⇔ X ∗ = k∗.
X∗ k∗
The values of k ∗, for k = 0, 1, . . . , n are the function values of the random variable X ∗,
that is, the points at which the bars are to be drawn.
√
In the case of a binomially distributed random variable is k ∗ = (k − np)/ np(1 − p).
Let’s take b16,1/2: It is µ = np = 8 and σ 2 = np(1 − p) = 4, so σ = 2. This results in the
values:
k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
k* − 4 − 3.5 − 3 − 2.5 − 2 − 1.5 − 1 − 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4
We now want to draw area-proportional histograms. This means that we plot over k ∗ the
bar with the area p(X ∗ = k ∗ ). The bars should all be the same width and abut each other.
The distance between two values of k ∗ in the example is just 0.5, and in the general case
the bar width for X ∗ is:
(k + 1) − µ k − µ 1
(k + 1)∗ − k ∗ = − = .
σ σ σ
In the random variable X the width of the bar was exactly 1, and since
p(X ∗ = k ∗ ) = p(X = k), the bar over k ∗ must be stretched by the factor σ in order to
obtain the same area. In Fig. 21.5a you can see the histogram of b16,1/2, in Fig. 21.5b the
histogram to b16,1/2
∗
.
The standardizations of the distributions from Fig. 21.4 result in the histograms in
Fig. 21.6.
You see that these distributions are approaching a bell-shaped function more and
more closely, namely the famous Gaussian bell curve. Let us formulate this result math-
ematically:
For each number n the histogram of the standardized binomial distribution bn,p ∗
can be
represented by a step function ϕn (x) The width of a step is 1/σ , so this function has for x
between k ∗ − (0.5/σ ) and k ∗ + (0.5/σ ) the value σ · bn,p (k). The mathematical formula
for ϕn (x) is:
21.2 Continuous Distributions, The Normal Distribution 525
a b
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 2 4 6 8 10 12 14 –4 –3 –2 –1 0 1 2 3 4
k−µ−0.5 k−µ+0.5
σ bn,p (k) for σ
≤x< σ
ϕn (x) = µ = np, σ = np(1 − p).
0 else,
(21.8)
The following, very difficult theorem, a version of the Moivre-Laplace theorem, states
that this sequence of functions actually converges:
Theorem 21.12: The Moivre-Laplace theorem For all p between 0 and 1 holds:
1 x2
lim ϕn (x) = √ · e− 2 .
n→∞ 2π
x2
Fig. 21.7 shows the bell curve, which is defined by the rule ϕ(x) = √12π · e− 2 . The
graph is not drawn to scale, the maximum is approximately at 0.4, so the curve is very
flat. Interesting is that the convergence does not only take place for p = 0.5 but also for
all other probabilities p: Then the binomial distributions are not symmetrical from the
beginning, the convergence takes a little longer, but eventually the same form is reached.
526 21 Important Distributions, Stochastic Processes
0.3
0.2
0.1
–4 –3 –2 –1 0 1 2 3 4
0.4
0.3
0.2
0.1
k*: – 4 –3 –2 –1 0 1 2 3 4
k: 16 20 24 28 32 36 40 44 48
x2
Theorem 21.13 The function ϕ(x) = · e− 2 is the density of a continuous
√1
2π
random variable N with expected value 0 and variance 1. N has the cumulative
distribution function
ˆ x
1 t2
�(x) = p(N ≤ x) = √ e− 2 dt.
2π −∞
Definition 21.14 The random variable N from Theorem 21.13 is called standard
normally distributed or N(0, 1)-distributed.
The cumulative distribution function can only be evaluated numerically. In the exer-
cises for integral calculus, you could calculate a table, the result of this numerical inte-
gration can be found in the appendix. Since ϕ(x) is symmetrical, a simple consideration
shows that for x > 0 always holds �(−x) = 1 − �(x), therefore usually only the posi-
tive values are tabulated.
The standardized binomial distributions are approximately standard normally distrib-
uted for large n, therefore
k−µ
p(X ≤ k) ≈ p(X ∗ ≤ k ∗ ) = �(k ∗ ) = � ,
σ
(21.9)
k2 − µ k1 − µ
p(k1 < X ≤ k2 ) ≈ �(k2∗ ) − �(k2∗ ) = � −� .
σ σ
The convergence in Theorem 21.12 works best if p is close to 0.5. The more p differs
from 0.5, the larger n must be chosen. There are several rules of thumb for this in the
literature. The approximation of the binomial distribution by the normal distribution is
already very good for np > 5 and n(1 − p) > 5. Another rule requires that it should be
np(1 − p) > 9. In both rules you can see that the further p deviates from 0.5, the larger n
must be chosen. I summarize our considerations in the following rule of thumb:
Rule of thumb 21.15 Let X be a bn,p-distributed random variable. Then, for np > 5
and n(1 − p) > 5 the following calculation is possible:
k2 − np + 0.5 k1 − np − 0.5
a) p(k1 ≤ X ≤ k2 ) = � √ −� √ ,
np(1 − p) np(1 − p)
0 ≤ k1 ≤ k2 ≤ n,
k − np + 0.5 k − np − 0.5
b) p(X ≤ k) = � √ , p(X < k) = � √ ,
np(1 − p) np(1 − p)
0 ≤ k ≤ n.
528 21 Important Distributions, Stochastic Processes
The ±0.5 in the two formulas arise because in the binomial distribution one always has
to go to the left or to the right edge of the bar over k ∗, see (21.8). This correction only
plays a role for small values of n .
Example 1
In an election, 10% of voters voted for party A. In the election poll, 1000 voters are
surveyed and a election forecast is created. How big is the probability that this fore-
cast deviates by no more than ±2% from the actual election result?
The survey is a Bernoulli process, the random variable X = “number of voters for
party A“ is binomially distributed with the parameters n = 1000 and p = 0.1. The
probability sought is p(80 ≤ X ≤ 120), because then the prediction lies between 8%
and 12%. It is np = 100, n(1 − p) = 900, so we can apply rule 21.15a):
120 − 100 + 0.5 80 − 100 − 0.5
p(80 ≤ X ≤ 120) = � √ −� √
90 90
= �(2.16) − �(−2.16) = 2 · �(2.16) − 1
= 2 · 0.9846 − 1 = 0.9692.
The probability we are looking for is therefore about 96.9%. Calculate for yourself
what the result is if we leave out the ±0.5 in the formula: we get about 96.5%.
Of course, we could also solve this task directly with the binomial distribution: But
then the probabilities p(X = 80), p(X = 81) and so on until p(X = 120) all have to be
calculated individually, a rather cumbersome procedure. ◄
From the example you can see that we are slowly approaching the statistical questions
that I introduced in Sect. 19.1. In reality, the actual result, 10%, is unfortunately not
given. We have to guess a percentage from the survey and give a so-called confidence
interval to it.
Example 2
Assume the hypothesis to be true that a slice of toast falls on the buttered side as often
as on the unbuttered side when it falls down. What is the probability the toast will fall
on the buttered side more than 52 times out of 100 trials?
The associated random variable X is b100,1/2-distributed, we can use rule 21.15b):
52 − 50 + 0.5
p(X > 52) = 1 − p(X ≤ 52) = 1 − � √
25
= 1 − �(0.5) = 1 − 0.6915 = 0.3085.
In the exercises to Chap. 19 you were asked to solve the problem using the binomial
distribution. The result of this calculation was (rounded) 0.3087. ◄
21.2 Continuous Distributions, The Normal Distribution 529
Just as we were able to normalize the binomial distribution by shifting and pressing it
into a distribution with expected value 0 and variance 1, we can transform the stand-
ard normal distribution into a similar distribution with expected value µ and variance σ 2:
If X is a N(0,1)-distributed random variable, then according to Theorems 20.13 and
20.16 the random variable Y = σ X + µ has expected value µ and variance σ 2. Then
X = (Y − µ)/σ.
Theorem 21.17 Let X be a N(µ, σ 2 )-distributed random variable. Then for the
associated cumulative distribution function F holds:
x−µ
F(x) = p(X ≤ x) = �
σ
(21.10)
y−µ x−µ
F(y) − F(x) = p(x < X ≤ y) = � −� .
σ σ
Rule of thumb 21.18 For np > 5 and n(1 − p) > 5 a bn,p-distributed random
variable is approximately N(np, np(1 − p))-distributed.
530 21 Important Distributions, Stochastic Processes
It is therefore not surprising that the formulas (21.9) and (21.10) agree with each other
except for the “ ≈ ”.
Example
It is known that the diameter of ball bearing balls from a special production is
N(45, 0.012 ) -distributed. A ball is unusable if it deviates from the nominal 45 mm by
more than 0.03 mm. What is the probability of such a deviation?
p(|X − µ| ≤ σ ) ≈ 0.6826,
p(|X − µ| ≤ 2σ ) ≈ 0.9546,
p(|X − µ| ≤ 3σ ) ≈ 0.9973.
p(|X − µ| ≤ kσ ) = p(µ − kσ ≤ X ≤ µ + kσ )
= F(µ + kσ ) − F(µ − kσ ) =
µ + kσ − µ µ − kσ − µ
=� −�
σ σ
= �(k) − �(−k) = 2�(k) − 1.
The given values for k = 1, 2, 3 can be determined using the table in the appendix.
The interesting thing about this statement is that these numerical values apply regardless
of µ and σ (Fig. 21.9): 68.3% of the results of X differ from the expected value by less
than σ , 95.4% by less than 2σ and 99.7% less than 3σ . These multiples of the standard
deviation are therefore often used as simple indicators to determine how “normal” a par-
ticular result is.
In the example of the ball bearings, the permissible deviation of 0.03 mm happened to
be exactly three times the standard deviation. We could therefore have directly taken the
result from Theorem 21.19.
21.2 Continuous Distributions, The Normal Distribution 531
68.3%
µ–σ µ µ+σ
The normal distribution occurs again and again with many random variables in prac-
tice. Why is it so common? We know it as the limit distribution of the binomial distribu-
tion, but there are other ways to get the normal distribution. A very important one results
from the central limit theorem of probability theory, which I would like to quote in a
special form:
A random variable that is composed additively of many individual influences, where the
individual influences are equally distributed random variables, is thus normally distrib-
uted. The conditions on the variables Xi in this limit theorem can still be significantly
weakened, but then the theorem becomes very unreadable. The proof is very involved.
Many random variables meet the conditions of the theorem. This is where the great
importance of the normal distribution in statistics lies.
The convergence takes place quite quickly, already for n ≥ 30 we can apply the rule:
Examples
1. The random variable “rolling a die” is uniformly distributed with expected value
3.5 and variance 2.92. If we roll the die 1000 times and Xi is the result of the i-th
roll, then the sum of eyes is a N(3500, 2920)-distributed random variable. What is
the probability that the sum of eyes deviates by more than 100 from 3500?
3600 − 3500 3400 − 3500
p(3400 ≤ S1000 ≤ 3600) = � −�
54 54
= �(1.85) − �(−1.85) = 2 · �(1.85) − 1
= 2 · 0.9678 − 1 = 0.9356.
is the density of a continuous distribution that has this property, as we will see shortly:
The antiderivative of e−x is −e−x and so we can calculate the cumulative distribution
function F :
ˆ t ˆ t
F :R → [0, 1], t �→ p(X ≤ t) = w(x)dx = e−x dx = 1 − e−t .
−∞ 0
A similar calculation for the variance results in the value 1/2. I summarize the results:
is the density of a continuous random variable X with expected value 1/ and
variance 1/2. The random variable X is called exponentially distributed with
expected value 1/. It has the distribution function F(t) = p(X ≤ t) = 1 − e−t.
In practice, the exponential distribution is often used as a distribution for lifetime cal-
culations. Here, the memorylessness states that the expected lifetime of a component is
independent of how long it has been in operation. Such components are called fatigue-
free. The expected value in the exponential distribution then corresponds to the expected
lifetime of the component.
Example
The manufacturer of a hard disk specifies a value of 70 years as the mean time to fail-
ure (MTTF). What is the probability that the hard disk in a server will fail next year
or in the next two years? The MTTF is the expected value 1/. So we are looking for
p(X ≤ 1) or p(X ≤ 2) with = 0.014286. It is
What is the probability that all hard disks running in 50 computers in continuous
operation will survive one year or two years without failure?
It is p(X > 1) = 0.9858 and p(X > 2) = 0.9718. Since the hard disk failures are
independent of each other, it applies:
In fatigue-free systems, the failure rate does not change over the course of its life.
Reality looks different. None of the hard disks in operation today will still be running
in 70 years, even if one were to try. So why can the MTTF still be a reasonable meas-
ure?
21.2 Continuous Distributions, The Normal Distribution 535
The failure rate of a component often follows a bathtub curve: at the beginning of
its use, there is a higher number of failures that can be caused by production or mate-
rial errors. Such failures are hopefully caught by the warranty. This is followed by
a longer period of regular operation, during which the probability of failure remains
almost unchanged. At some point, then, the aging processes set in, and the probability
of failure becomes greater again. The specified failure rate, 70 years in the example,
only applies at the bottom of the bathtub, where the exponential distribution can be
applied well. The number “70 years” given in the example must under no circum-
stances be interpreted as the mean service life of the hard disk. ◄
As the last continuous distribution, I would like to introduce you to the chi-square distri-
bution, which plays an important role in test theory. In the next chapter we will need this
for a statistical test.
The χ 2-distribution consists of several independent standard normal distributions:
1 x 1 −x
g1 (x) = √ · e− 2 , g2 (x) = ·e 2,
2πx 2
x x x −x
g3 (x) = · e− 2 , g4 (x) = ·e 2,
2π 4
x3 x
g5 (x) = · e− 2 .
18π
In Fig. 21.11 you see the graphs of the first five chi-square density functions.
536 21 Important Distributions, Stochastic Processes
0.6
g1
0.4 g2
g3
0.2
g4
g5
0 2 4 6 8 10
From g3 on the densities have a maximum, the expected value of χn2 is n, the variance
is 2n. For large n is χn2 approximately N(n, 2n)-distributed. This can also be concluded
from the central limit theorem.
Sequences of random variables, such as (Xn )n∈N or (Xt )t∈R are called stochastic pro-
cesses. The index can take continuous values (t ∈ R) or take discrete values (n ∈ N).
Such processes often describe the temporal development of a random variable. The index
t then denotes the current observation time of the variable.
1. The Brownian motion of a particle is a stochastic process: The random variable Xt has
in this case the value set R3, it gives the position of a particle in space at time t .
2. The random occurrence of an event over time is a stochastic process: Xt is the number
of events that have occurred from time 0 to time t . If the events occur randomly and
independently of each other, we obtain the Poisson process.
3. Do you remember the long queues at enrollment? The number of students who have
arrived at the enrollment office by the time t , the number of students who have been
enrolled by the time t and the number of students in the queue at the time t are sto-
chastic processes that are studied in queueing theory.
Carrying out an experiment means observing the values of Xt for all t , i.e. for exam-
ple, observing a specific queue between t0 and t1. In this interval, each Xt is assigned
the observed value xt, that is, the number of students waiting at time t . The result of
the experiment, the realization, thus represents a function here: [t0 , t1 ] → R, t → xt. This
function is called the trajectory of the stochastic process. The trajectory is the result of
the specific experiment.
21.3 Stochastic Processes 537
We first want to investigate events that occur randomly and independently of each other
over time. Examples of this are:
• Nuclear decay,
• The arrival of service requests at a server,
• Calls in a call center,
• The occurrence of software errors in a program system.
Let’s start with a classic random event: the nuclear decay of a particle, for example plu-
tonium. The half-life is known from physics, so for a given amount of plutonium one
knows the decay rate, that is, one knows how many decays are to be expected in a given
time interval. Unknown, however, is when a single particle decays and how the decays
are distributed in this time interval. Since the decay is a random process, only a probabil-
ity can be given that in an interval T exactly k particles decay. We now want to calculate
this probability.
Assume that in a sample of material there are about 7200 decays per hour. This decay
rate can be calculated from the known half-life. What is the probability that in a period
of 1 second 0, 1, 2 or more decays will occur?
To do this, we choose a fixed second T from the hour of observation. Physics tells us
that the particles decay independently of each other and so we can regard the process as
a Bernoulli process: The i -th trial of the process is the fate of the i -th decaying particle.
The observed event A = “Particle decays in T ” has the probability p = 1/3600. Then
the random variable Z , which indicates the number of particles that decay in second T ,
is binomially distributed with the parameters n = 7200, p = 1/3600. We can apply rule
21.11 and obtain with = 7200 · 1/3600 = 2:
k − 2k
p(Z = k) = e = e−2 (21.13)
k! k!
plays the role of an expected value for the examined time interval: We expect an aver-
age of 7200 decays per hour, which corresponds to 2 decays per second.
In the calculation of the number of decays in one second, I made a mistake: I assumed
that exactly 7200 particles decay, which was the number n of trials. However, the num-
ber 7200 is only an average, there will always be deviations from it. Now let’s look at
a longer period of time, say 1000 hours, then the relative deviation from the average
and thus the calculation error is certainly much smaller. In 1000 hours we expect 7 200
000 decays. Then p = 1/3 600 000 and the same calculation as in (21.13) results with
= 7 200 000 · 1/3 600 000 = 2 surprisingly again in:
2k −2
p(Z = k) = e .
k!
538 21 Important Distributions, Stochastic Processes
The result is therefore independent of the length of the observed time period. If we con-
sider Theorem 21.9, we can make another interesting observation: I chose the longer
time period of 1000 hours to reduce the error caused by the deviation from the mean
decay rate. But at the same time, the longer the period under investigation, the better the
approximation of the binomial distribution by the Poisson distribution. In fact, the Pois-
son distribution not only approximately describes the distribution of nuclear decays over
time, but exactly, provided that the decay rate is known exactly.
Let us now consider the stochastic process Zt which describes the probability for k
decays in t seconds. For n = 7200 the probability p has the value t/3600 and it is
= 7200 · t/3600 = 2t. Of course we get the same value for with n = 7 200 000 and
p = t/3 600 000. It results in:
(2t)k −2t
p(Zt = k) = e .
k!
The random variable Zt is therefore Poisson distributed for each t with parameter 2t.
Nuclear decay is a prototype for a class of stochastic processes that count independent
and random events. These are called Poisson processes and they all have the following
properties:
Definition 21.24: The Poisson process For t ∈ R+ let Xt be the random variable
that describes the number of occurrences of an event in the period from 0 to t . For
these events it should apply:
The conclusions we have drawn for the atomic decay process can be generalized:
Theorem 21.25 In a Poisson process (Xt )t∈R+ the random variable Xt is Poisson
distributed for all t with parameter t, that is,
(t)k −t
p(Xt = k) = e .
k!
In this theorem, t takes on the role of from the Poisson distribution: t is the expected
value for the number of events in the period from 0 to t and is something like the
21.3 Stochastic Processes 539
“expected value per unit of time”. is called rate of the events. In the example of the
nuclear decay, on average, 7200 decays per hour are expected, which corresponds to 2
decays per second, that is = 2/sec. The rate for the events in a Poisson process is often
known.
Example
1 −2 2 −2
p(X1 = 0) = e = 0.135, p(X1 = 1) = e = 0.271,
1 1
4 8 −2
p(X1 = 2) = e−2 = 0.271, p(X1 = 3) = e = 0.180,
2 6
16 −2
p(X1 = 4) = e = 0.090,
24
and thus the sum p(X1 < 5) ≈ 0.947. This results in an overload in about 5.3 % of the
minute intervals. ◄
If Xt is a Poisson process, then the waiting times between the events of the process are
exponentially distributed:
Theorem 21.26 Let (Xt )t>0 be a Poisson process and Xt be Poisson distributed
with parameter t. Let Wi be the time between the i -th and the i + 1-th occurrence
of the event. Then Wi is exponentially distributed with expected value 1/.
0
According to Theorem 21.25, p(Xt = 0) = (t) 0!
e−t = e−t. The probability that the
waiting time W0 for the first event is greater than t is equal to the probability that Xt = 0.
We thus obtain for W0:
p(W0 ≤ t) = 1 − e−t .
540 21 Important Distributions, Stochastic Processes
W0 therefore has the distribution function of the exponential distribution and is thus
exponentially distributed. For i > 0 we can conclude similarly. If s is the time at which
the event occurs for the i -th time, then by Theorem 21.24a) for Wi holds:
Markov Chains
We will now investigate stochastic processes with a discrete index set N0. Again, the
index can be interpreted as a time parameter, but in contrast to the Poisson processes,
we only look at the value of the random variable Xt at certain points in time, for example
every minute or every hour. Such a process (Xt )t∈N0 is called a Markov process or Markov
chain, if the probability that Xt = k depends only on the distribution of the random var-
iable Xt−1, not on other random variables of the process. The image of the random vari-
ables should be a subset of N or N0. The elements of the image of Xt are called states.
Definition 21.27 Let (Xt )t∈N0 for all t ∈ N0 be a discrete random variable with the
same image W ⊂ N. The stochastic process (Xt )t∈N0 is called a Markov chain, if
p(Xt+1 = k) only depends on Xt. Let
pik (t) := p(Xt+1 = k|Xt = i).
pik (t) is called the transition probability from i to k. If pik (t) = : pik is constant
for all t ∈ N0, then (Xt )t∈N0 is called a homogeneous Markov chain. The vector or
sequence (a0 , a1 , a2 , . . .) with ai = p(X0 = i) is called initial distribution of the
chain. If the image W is finite, the matrix P = (pij ) is called transition matrix of
the Markov chain.
We will only deal with homogeneous Markov chains. With probability pik the system
transitions from state i to state k in an observed time period.
Such a Markov chain can be represented in the form of a directed network (compare
Definition 11.25). For this purpose, an
Example
An admittedly somewhat crude description of the weather could consist of the three
states: sunny all day, cloudy but dry, rain. I number the states with 0, 1, 2. The times t
should be the days. If the weather tomorrow only depends on the weather today, then
there are transition probabilities. I have entered these in the graph in Fig. 21.12.
21.3 Stochastic Processes 541
0.3
sun
0.3 0 0.2
0.5
0.2
clouds rain
0.5 1 0.4
0.4 2
0.2
Starting from the state Xt = i always exactly one of the states Xt+1 = k, k = 0, . . . , n
occurs with the probability pik. Therefore, the sum must be nk=0 pik = 1. This is just
the sum of the elements of the i -th row of the matrix. In a transition matrix, therefore, all
row sums are equal to 1.
How can we calculate the distribution at a later time from a given initial distribu-
tion a(0) := (a0 , a1 , . . . , an )? Let us put in the law of total probability (Theorem 19.7)
Bi = {Xt = i} and A = {Xt+1 = k}, so we get
n
p(Xt+1 = k) = p(Xt = i) · p(Xt+1 = k|Xt = i),
i=0
a(1) = a(0)P.
You have to be careful here: in linear algebra, we usually multiply matrices with column
vectors from the right, here a row vector is multiplied from the left with the matrix.
It is the same for the states at later times. If a(t) := (p(Xt = 0), p(Xt = 1), . . . , p(Xt = n)),
then:
Example
If the initial distribution for day 0 is given, for example (a0,a1,a2) = (0.2, 0.5, 0.3) for
sun, clouds or rain, then the probabilities of the three states are now calculable for all
times. Let’s start with day 1:
0.3 0.5 0.2
a(1) = a(0)P = (0.2, 0.5, 0.3) · 0.3 0.4 0.3 = (0.27, 0.42, 0.31)
0.2 0.4 0.4
It follows
a(2) = a(1)P = (0.269, 0.427, 0.304),
a(3) = (0.2696, 0.4269, 0.3035).
Do you see the sum of the vector elements is always 1 again? This must be so,
because there are only the three states sun, clouds or rain.
What is after 100 days? With a computer algebra system, you can immediately cal-
culate that the values of a(100) only differ from a(2) after the fourth decimal place.
We will see in a moment that this is no coincidence. Let’s look at the (rounded)
powers of the matrix P:
0.28 0.43 0.29
P2 = 0.27 0.43 0.30 ,
0.26 0.42 0.32
0.271 0.428 0.301
P3 = 0.270 0.427 0.303 ,
0.268 0.426 0.306
..
.
0.270 0.427 0.303
P10 = 0.270 0.427 0.303 .
0.270 0.427 0.303
21.3 Stochastic Processes 543
For higher powers, almost nothing happens anymore. The powers converge
to a matrix P∞ and surprisingly, in the limit all lines are equal. This has an inter-
esting consequence: For each initial distribution a = (a1 , a2 , a3 ) we have
aP∞ = (0.270, 0.427, 0.303). Please try this yourself. This means that the state prob-
abilities of the Markov chain become stationary. Already after a few days we get a
probability of 0.270 for sun, 0.427 for clouds and 0.303 for rain. ◄
a) There exists the limit P∞ = limn→∞ Pn. All rows of the matrix P∞ are equal.
The sum of the elements of this row is 1.
The row vector p is also called the stationary distribution of the Markov chain.
Remember that for the transposition of matrices holds (AB)T = BT AT. Then pP = p is the
same as PT pT = pT. In the language of eigenvalue theory, Theorem 21.28c) says: pT is the
only eigenvector of the matrix PT to the eigenvalue 1.
The proof of part a) of the theorem requires only elementary mathematics, but is some-
what tricky. I will not carry it out here.
To b): Since the row sum is ni=0 ai = 1, we get aP∞ = p simply by substitution:
p0 p1 . . . pn
p0 p1 . . . pn
(a0 , a1 , . . . , an ) . . ..
.. .. .
p0 p1 . . . pn
�� �
n � �� n � n
�� � �
= ai p0 , ai p1 , . . . , ai pn = (p0 , p1 , . . . , pn ).
i=0 i=0 i=0
equation from the left by p, we get pP∞ = pP∞ P. Since pP∞ = p , we get pP = p.
Is there another stationary distribution q, which is different from p? From qP = q
would then for all n follow qPn = q, thus also qP∞ = q. But we already know that
qP∞ = p is true. Therefore p = q.
Theorem 21.28c) provides us with an elegant way to determine the stationary distribu-
tion of a homogeneous Markov chain. We do not have to laboriously calculate the limit
544 21 Important Distributions, Stochastic Processes
of a matrix power, it is enough to solve the system of linear equations pP = p for the
unknowns p = (p0 , p1 , . . . , pn ).
Don't be confused by the notation here. In the language of Chap. 8 we had to solve the sys-
tem of equations PT x = x or (PT − E)x = 0.
Queues
Queues occur again and again in many areas of everyday life: at the checkout in the
supermarket, in the doctor’s waiting room, when calling a call center, when requesting a
server. In queueing theory, one usually speaks of customers who arrive and are served at
a service station. In a common type of queue, arrival and service can each be described
by Poisson processes. In between is a waiting room that can usually accommodate a lim-
ited number of customors waiting. We will now investigate such queues.
The customers are to arrive in the queue with an arrival rate of . At the head of the
queue they are served, this is to be done with a service rate of µ. and µ are the rates of
the corresponding Poisson processes, they give the expected number of events in a cer-
tain period of time. The maximum length of the queue is n. Furthermore, let ρ = /µ.
The number ρ is called the utilization of the system. We want to calculate how the queue
develops as a function of and µ. The queue can be described by a stochastic process
(Xt )t∈R, where the random variable Xt is the number of customers in the system at time t ,
including the customer currently being served.
If we consider the queue at fixed times t ∈ N0, for example every minute, we get a
stochastic process (Xt )t∈N0, and since both entry and exit from the queue are random and
independent, this process is a homogeneous Markov chain: the number of customers in
the system at time t + 1 depends only on the number at time t , and at different times the
transition probability from t to t + 1 is always the same. For this Markov chain, we now
want to calculate the transition matrix and the stationary distribution.
For simplicity, we choose for the period of time an interval T , which should be so
small that in this interval at most one customer arrives or is served. The probability that
two or more of these events will occur in the interval T should be negligibly small.
Example
A doctor can treat 10 patients per hour in his practice, about 9 patients come per
hour. The system should have 11 seats, 10 in the waiting room, one in the treat-
ment room. If all seats are occupied and another patient arrives, he leaves again. The
arrival rate in this example is = 9, the service rate µ = 10. The period of time of
21.3 Stochastic Processes 545
one hour is too large for our calculations, so we choose, for example, the interval
T = 1 minute = 1/60 hour. Then we calculate with the arrival rate T = 9/60 and
with the service rate µT = 10/60 per minute. We now assume that never more than
one patient enters or leaves the waiting room per minute.
T and µT are the arrival and service rates in the small interval T . We will later use that
the quotients are equal: T /µT = /µ = ρ. The rate T is the expected value for the
arrival of a customer in this interval, and since we assume that only 0 or 1 customer can
come, T is also the probability for the arrival of a customer in T . The same is true for
µT . The smaller T is, the smaller become T and µT . For the following calculation, this
has the pleasant side effect that the product T · µT is tiny and can also be neglected.
Now for the calculation of the transition matrix. It is pik = p(Xt+1 = k|Xt = i). The
probability that between t and t + 1 a customer arrives is T , the probability that a cus-
tomer has been served and leaves the system is µT . Then the probabilities that between t
and t + 1 no customer arrives respectively no customer leaves the system are 1 − T and
1 − µT . Between t and t + 1 the following cases can now occur:
Event Probability
One customer comes, one goes T · µT
Nothing happens (1 − T ) · (1 − µT )
One customer comes, no one goes T · (1 − µT )
No customer comes, one goes (1 − T ) · µT
pii = T · µT + (1 − T ) · (1 − µT ) = 1 − T − µT + 2T µT
pi,i+1 = T · (1 − µT ) = T − T µT
pi,i−1 = µT · (1 − T ) = µT − T µT
All other values in the row i are 0, the number of customers in the queue can only
change by 1. If we leave out the small products T · µT , we get for row i of the transition
matrix just
(0, 0, . . . 0, µT , 1 − T − µT , T , 0, . . . , 0)
↑ ↑ ↑
i−1 i i+1
Although we have done some rounding, in the matrix all line sums are equal to 1, so it is
a correct transition matrix of a Markov chain. And the smaller the interval T was chosen,
the better this matrix describes our queue.
Does this process have a stationary state? If you calculate a few powers of the matrix,
you will notice that more and more zeros disappear, eventually all elements are positive.
Then we can apply Theorem 21.28: there is a stationary state, which we can calculate
with the help of Theorem 21.28c). For this, the system of equations pP = p has to be
solved for p = (p0 , p1 , . . . , pn ).
The n equations of the system are:
p0 (1 − T ) + p1 µT = p0
⇒ −p0 T + p1 µT = 0
p0 T + p1 (1 − T − µT ) + p2 µT = p1
⇒ p0 T − p1 (T + µT ) + p2 µT = 0
..
.
pi−1 T + pi (1 − T − µT ) + pi+1 µT = pi
⇒ pi−1 T − pi (T + µT ) + pi+1 µT = 0
..
.
pn−1 T + pn (1 − µT ) = pn
⇒ −pn−1 T + pn µT = 0
From the first equation we get p0 = (µT /T )p1. Substituted into equation 2 we get
p1 µT − p1 (T + µT ) + p2 µT = −p1 T + p2 µT = 0 ⇒ p1 = (µT /T )p2 ,
and so on. For all i , also for i = n − 1 is pi = (µT /T )pi+1, or pi+1 = (T /µT )pi. If we
now set ρ = T /µT we get for the result vector:
p = p0 (1, ρ, ρ 2 , ρ 3 , . . . , ρ n )
21.3 Stochastic Processes 547
The solution of the system of equations is a one-dimensional vector space. We now have
to determine p0 so that the sum of the vector elements is 1:
p0 (1 + ρ + ρ 2 + ρ 3 + . . . + ρ n ) = 1
According to Theorem 3.3 is (1 + ρ + ρ 2 + ρ 3 + . . . + ρ n ) = (1 − ρ n+1 )/(1 − ρ) and
therefore p0 = (1 − ρ)/(1 − ρ n+1 ). The stationary state of the queue is therefore
Example
In the waiting room of the doctor with T = 9/60 and µT = 10/60 is ρ = 0.9. That
is also just /µ with the original rates = 9 and µ = 10. If the system has 11 places,
the sum (1 − ρ)/(1 − ρ n+1 ) has the value 0.14 and the stationary state is:
p = (0.14, 0.13, 0.11, 0.10, 0.091, 0.082, 0.074, 0.067, 0.060, 0.054, 0.049, 0.044)
Position i indicates the probability that i people are in the system. The probability of
an empty system (position 0) is therefore 0.14. In this case, the doctor has nothing to
do. With a probability of 0.13, he is currently treating a patient and the waiting room
is empty. With a probability of 0.11, there is 1 patient in the waiting room and so on.
The expected value for the number of people in the practice, calculated according
to the formula in Definition 20.11, is 4.28. The average length of stay in the prac-
tice is 5.28/10 hours, that is about 32 minutes. Calculate for yourself how the queue
develops if the waiting room gets bigger or if the arrival rate is 10 or 11 patients per
hour. ◄
Often it is assumed in queueing theory that the length of the queue is unlim-
ited, that n can therefore be arbitrarily large. The transition matrix then becomes
arbitrarily large and we finally obtain for the stationary state p the “infinite”
vector p = p0 (1, ρ, ρ 2 , ρ 3 , . . . ). If ρ < 1 then for the geometric series holds
∞ i
i=0 ρ = 1/(1 − ρ), so p0 = (1 − ρ). Then formula (21.14) becomes a little bit easier:
p = (1 − ρ)(1, ρ, ρ 2 , ρ 3 , . . . ).
The random variable X with p(X = k) = (1 − ρ)ρ k describes the number of custom-
ers in the system. We can also give a simple formula for the calculation of the expected
value of X : For the random variable Y = X/ρ we have p(Y = k) = (1 − ρ)ρ k−1, so Y
548 21 Important Distributions, Stochastic Processes
Theorem 21.29 If the arrival and service of customers in a queue follow Poisson
processes with arrival rate and service rate µ, if the length of the queue is
unlimited and < µ, so the random variable X with
k
p(X = k) = 1− , k ∈ N0
µ µ
describes the number of customers in the system in the stationary state. Further, it
is:
: The expected value for the number of customers in the system,
µ−
− : The expected value for the length of the queue,
µ− µ
1
: The average length of stay in the system,
µ−
1 1
− : The average waiting time in the queue.
µ− µ
21.4 Comprehension Questions and Exercise Problems 549
Example
Let’s look again in the waiting room of the doctor and assume it is unlimited. For the
arrival rate 9 and the service rate 10 we then get
p = (0.10, 0.090, 0.081, 0.073, 0.066, 0.059,
0.053, 0.048, 0.043, 0.039, 0.035, 0.031, . . . ).
The expected value for the number of patients is 9/(10 − 9) = 9 and the average
waiting time until leaving the practice is 1/(10 − 9) = 1 hour. The expected value
for the length of the queue is 8.1, the waiting time in the waiting room on average
1 − 1/10 hours, that is 54 minutes.
Look at the difference to the waiting room with 10 seats. The situation was much
more relaxed there. However, about 5% of the patients were also sent home without
having received medical attention, because the waiting room was full. ◄
While in the case of a finite waiting room solutions exist for all values of and µ, in
the infinite case must necessarily be < µ. It is obvious that the arrival rate cannot be
greater than the service rate in the long run, otherwise the queue will always grow, there
can be no stationary state. But even if = µ, no stationary state is achieved. You can see
very well what happens here if you play around with the example of the waiting room:
If = µ then finally all lengths are equally likely. Of course, this is no longer possible if
the waiting room has an infinite number of seats.
In queueing theory, many other models are considered. For example, arrival and ser-
vice can follow other distributions, customers can be scheduled at certain times, instead
of one service station there can be several stations with possibly different distributions,
customers can be served according to different priorities. Also combinations of different
queues can be examined.
Comprehension questions
Exercises
1. In a light drizzle, 100 000 water droplets fall on a grid on the ground. The grid is 1
m2 large, the holes in the grid 1 cm2.
Assume that the numbers of drops falling on different areas of the grid are inde-
pendent of each other.
For each sub-task, describe in detail which “experiment” is carried out and for
which events probabilities are calculated. Describe the random variables that occur
and justify which distribution you use for which sub-task.
a) Determine the probability that exactly 10 raindrops fall into a certain grid hole
and the probability that at least 1 drop falls into each hole.
b) The grid is divided into 100 parts. Calculate the probability that between 900
and 1100 drops fall into such a part, as well as the probability that there is a part
on which more than 1050 or more than 1100 drops fall.
c) On 10 adjacent holes, 100 drops fall randomly. Calculate the probability that
exactly 10 drops fall into the first of these 10 holes.
2. A baker bakes 100 olive breads. He throws 400 olives into the dough. Determine
the probability that there are two to six olives in a bread and that you catch a bread
without olives.
3. An airline knows that on average 10% of the booked flight seats are canceled and
overbooks by 5%. For an airplane with 100 seats, it sells 105 tickets. What is the
probability that more passengers will come to the departure than there are seats
available? Solve this task with the exact distribution as well as with a possible
approximation to the exact distribution.
You will find that the rules of thumb are sometimes to be taken with a grain of salt.
4. At an highway tollbooth, 1000 cars arrive randomly distributed between 7:00 and
17:00. If more than 5 arrive in one minute, the tollbooth is overloaded. In how
many percent of the intervals is this the case?
5. A company receives an average of 500 orders between 8 a.m and 1 a.m. in the
morning, 800 orders between 2 p.m. and 6 p.m. in the afternoon, evenly distributed
over the 5 or 4 hours. Waiting times occur when more than 3 orders arrive in one
minute. Calculate the probability of this occuring in a one-minute interval for the
morning and the afternoon.
21.4 Comprehension Questions and Exercise Problems 551
Note: If both engines are failed, there is still an emergency engine that should keep
the ship maneuverable and not stranded in the Elbe off Hamburg.
Statistical Methods
22
Abstract
Now you can reap the fruits of the last chapters. If you have worked through this
chapter
• then you know what a sample is, and can calculate the sample mean, the mean
squared error and the sample variance,
• you can examine data using principal component analysis,
• you are familiar with the concept of the estimator and have criteria to determine
whether an estimator is unbiased or consistent,
• you know estimators for the probability in a Bernoulli process and for the expected
value and variance of a random variable,
• you understand the concepts of confidence interval and confidence level and can
calculate confidence intervals for the probability in a Bernoulli process and deter-
mine necessary sample sizes,
• you can carry out one-sided and two-sided hypothesis tests for Bernoulli processes
and know the errors of type I and type II in hypothesis tests,
• you have carried out Pearson’s chi-squared test,
• and can put the finished book aside. Congratulations!
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 553
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_22
554 22 Statistical Methods
s election of data – a sample – from the population and analyzing it. In this chapter we
will learn some such analysis methods.
In the example of the election forecast, the task arose of guessing the election result from
a sample. This election result is an unknown parameter. We now want to develop meth-
ods with which one can derive estimates for such unknown parameters, the estimands,
from a sample.
First I would like to clarify the concept of the sample:
Samples
I will only investigate random samples in what follows, I simply call them samples.
If you receive a sample value by a random experiment, it is the result of a random
variable X . If the experiment is repeated n times independently, we get n sample val-
ues x1 , x2 , . . . , xn. This n-tuple can also be regarded as the result of a random vector
(X1 , X2 , . . . , Xn ), where Xi describes the result of the i -th experiment. A sample of size n
is then a realization of this random vector.
Examples
1. Let’s roll a die 10 times. The random variable Xi denotes the result of the i -
th roll. A sample is a function value of the random vector (X1 , X2 , . . . , X10 ),
different function values result in different samples. So, for example,
s1 = (2, 5, 2, 6, 2, 5, 5, 6, 4, 3) and s2 = (4, 6, 2, 2, 6, 1, 6, 4, 4, 1) are possible sam-
ples of this random experiment.
2. There are 10 000 screws in a box, some of which are defective. For a sample, 100
screws are taken. The random variable Xi describes the condition of the i -th screw
taken:
0 if i-th screw is ok
Xi (ω) =
1 if i-th screw is defective.
A concrete sample is a realization of the random vector (X1 , X2 , . . . , X100 ) and, for
example, has the value (0, 1, 0, 0, 0, 1, 1, 0, 0, . . . , 1).
New random variables can be obtained from random variables by combination. For
example, one could consider:
H100 (ω) = X1 (ω) + X2 (ω) + . . . + X100 (ω),
1
R100 (ω) = (X1 (ω) + X2 (ω) + . . . + X100 (ω)).
100
H100 describes the number of defective screws in the sample, R100 the relative fre-
quency of defective screws. These are also random variables that can take different
values in each sample. We already know from Theorem 21.2 that H100 is binomi-
ally distributed, but the parameter p of the binomial distribution is still unknown.
3. In an election poll, the sample consists of n randomly selected voters. The random
experiment Xi describes the voting behavior of the i -th person surveyed. The poll
is a realization of the random vector (X1 , X2 , . . . , Xn ). The opinion research com-
pany conducts a sample and estimates the election result from it. Another institute
receives a different sample and is therefore very likely to have a different election
forecast. ◄
Just as we have assigned parameters to random variables, such as expected value and
variance, we now want to define characteristics for samples:
1
x := (x1 + x2 + . . . + xn )
n
is the sample mean. The number
n
1
m := (xi − x)2
n i=1
is called the mean squared error, it represents the mean of the squared deviations
of x .
n
1
s2 := (xi − x)2
n − 1 i=1
√
is called the sample variance and s = s2 sample standard deviation. If
(y1 , y2 , . . . , yn ) is another sample with mean y, then
n
2 1
sxy := (xi − x)(yi − y)
n − 1 i=1
The expected value of a random variable is something like a mean function value, and
the variance of a random variable is the mean squared deviation from it. You see the
analogy to the concepts of sample mean and mean squared error of a sample. The reason
why the sample variance is still introduced, in which instead of dividing by n, we divide
by n − 1, is not yet clear, we will come to it later. We will use these numbers as estimates
for corresponding parameters of a random variable.
n
1 2
s2 = xi − nx 2 .
n−1 i=1
Proof:
n
n
(xi − x)2 = (xi 2 − 2xi x + x 2 )
i=1 i=1
n
n
n
= xi 2 − 2x xi + nx 2 = xi 2 − nx 2
i=1 i=1 i=1
=nx
Let’s look at the dice example again. The expected value when rolling dice is 3.5, the
variance 2.917. If we calculate the sample mean and sample variance of the two samples
s1 = (2, 5, 2, 6, 2, 5, 5, 6, 4, 3) and s2 = (4, 6, 2, 2, 6, 1, 6, 4, 4, 1), we get:
Sample mean of s1:
1 40
s1 = (2 + 5 + 2 + 6 + 2 + 5 + 5 + 6 + 4 + 3) = =4
10 10
Sample variance of s1:
1 2 24
s2 = (2 + 52 + 22 + 62 + 22 + 52 + 52 + 62 + 42 + 32 − 10 · 42 ) = ≈ 2.67
9 9
similarly for the sample mean of s2: s2 = 3.6, and the sample variance of s2: s2 = 4.04.
We must carefully distinguish: The expected value, variance and standard deviation
of a random variable are fixed numbers that are independent of a specific experiment.
The sample mean, mean square error, sample variance and sample standard deviation
of a sample, on the other hand, are themselves realizations of a random variable, that
is, random, trial-dependent values. If (x1 , x2 , . . . , xn ) arises as the realization of the ran-
dom vector (X1 , X2 , . . . , Xn ), then the associated sample mean is the realization of the
random variable X := n1 (X1 + X2 + · · · + Xn ), the sample variance is the realization of
2
1
S 2 := n−1 (( ni=1 Xi 2 ) − nX ).
22 Statistical Methods 557
Estimators
Definition 22.4 Let the random variable X describe the outcome of a random
experiment. This experiment is to be repeated n times. Let Xi be the random
variable that describes the outcome of the i -th experiment. If p is a parameter
that is associated with the random variable X and f is a function that can be used
to determine an estimate p̃ = f (x1 , x2 , . . . , xn ) for the parameter p from a sample
(x1 , x2 , . . . , xn ), then the random variable
P = f (X1 , X2 , . . . , Xn )
formed from the random variables Xi is called estimator for the parameter p. A
realization of the estimator
P(ω) = f (X1 (ω), X2 (ω), . . . , Xn (ω)) = f (x1 , x2 , . . . , xn )
is called an estimate for the parameter p.
You will have to read this definition several times to understand what is behind it. But it
becomes clearer when we look at the examples from the beginning of the chapter again:
Examples
1. The random variable X describes the result of rolling a die. If we do not know the
expected value of X yet, we can guess after 10 rolls with the result (x1 , x2 , . . . , x10 ):
1
E(X) ≈ (x1 + x2 + · · · + x10 ).
10
In this case, for all n ∈ N, the random variable X = n1 (X1 + X2 + . . . + Xn ) is an
estimator for the unknown parameter E(X).
How can we estimate the variance of the random variable X ? It will turn out that
2
1
S 2 := n−1 (( ni=1 Xi 2 ) − nX ), the function whose result is the sample variance of
the sample, is the right estimator for the parameter Var(X).
2. The random experiment consists of taking a screw out of the box with 10 000
screws. X describes the state of the screw with 0 (okay) or 1 (not okay). The exper-
iment is repeated 100 times, Xi represents the state of the i -th screw. The unknown
parameter of X is the probability p with which X = 1. In a sample (x1 , x2 , . . . , x100 ),
the number of defective screws is just the sum of the xi. It is therefore natural to
guess:
x1 + x2 + . . . + x100
p(X = 1) ≈ .
100
This is a realization of the estimator R100 = 1
100
(X1 + X2 + . . . + X100 ). ◄
558 22 Statistical Methods
In these examples we have guessed the estimators for a parameter with common sense,
we do not yet know any criteria for when an estimator is good or bad. For example, with
the screws in the box, the estimate for p can take any value between 0 and 1 depending
on the sample. Is R100 really a good estimator?
We describe the quality of the estimators with the means of probability theory that we
have already worked out:
The function values of the estimator can of course not always exactly match the param-
eter p, but the estimates should be scattered around this parameter, the expected value of
the estimator should be p.
For a good estimator, this is not enough: The values of the estimator must not jump
too wildly around p:
I only included the definition in this form for the sake of mathematical correctness. For us,
the following, much more intuitive theorem, which can be derived with some effort from
Definition 22.6 is sufficient:
This means that with increasing n, the variance of the estimator becomes smaller and
smaller.
In the screw example, we chose the relative frequency as an estimator for the probability
of the event “screw defective”. The following theorem shows that this was a good choice:
Theorem 22.8 In a Bernoulli process, let p be the probability of the event A. Let
Xi = 1 if A occurs on the i -th trial, otherwise let Xi = 0. Then
n
1
Rn := Xi
n i=1
Let us check Definition 22.5 and Theorem 22.7. We know from Theorem
21.2 that Hn = ni=1 Xi is binomially distributed. Therefore, E(Hn ) = np and
Var(Hn ) = np(1 − p). Because of Rn = n1 Hn, just like Hn, Rn is normally distributed for
large n and it is
1 1 p(1 − p)
E(Rn ) = E(Hn ) = p, Var(Rn ) = Var(Hn ) = .
n n2 n
So Rn is unbiased and because of limn→∞ Var(Rn ) = 0 also consistent.
Let us finally examine the expected value and variance of a random variable:
Theorem 22.9 Let the random variable X describe an experiment that is repeated
n times independently, let Xi be the outcome of the i -th experiment. Let X have
the expected value µ and the variance σ 2. Then
1
X := (X1 + X2 + . . . + Xn )
n
is an unbiased and consistent estimator for the expected value of X and
n
1 2 2
S 2 := Xi − nX
n−1 i=1
2 σ2
E(Xi 2 ) = σ 2 + µ2 , E(X ) = + µ2
n
560 22 Statistical Methods
and thus:
n n
2 1 2 2 1 2
E(S ) = E Xi − nX = E(Xi 2 ) − n · E(X )
n−1 i=1
n − 1 i=1
2
1 σ 1
= n(σ 2 + µ2 ) − n + µ2 = ((n − 1)σ 2 ) = σ 2 .
n−1 n n−1
In this calculation is hidden the reason why in the definition of the sample variance we
divided by n − 1 and not by n: Otherwise, E(S 2 ) = σ 2 (n − 1)/n and S 2 would not be an
unbiased estimator.
In big data analysis, large, often unstructured data sets are examined. They can hardly
be processed manually anymore, mathematical methods are used to evaluate them. One
of these methods is the principal component analysis (PCA). In this section we will do
linear algebra again. Let’s start with an example: A high school graduate takes an online
test to find out recommendations for a suitable field of study or vocational training. The
test collects 75 features from the candidate, including school grades, personal prefer-
ences and hobbies, social and political interest and much more. Let’s assume that each
feature can be represented by a real number, then the test result is a point in R75. Of the
features, some are certainly more important than others, some are more or less redun-
dant, others are of great importance for the purpose of the evaluation. How can a sugges-
tion for a study direction be read out of this?
If the test is not taken by one, but by 1000 subjects, we get a point cloud in R75. In the
language of statistics, a sample of size 1000 is taken for each of the 75 features. Based
on these samples, the relevance of features and the relation between different features
should now be worked out. The following idea is behind it: First calculate the covari-
ances of sample pairs. The smaller these covariances are, the less the corresponding fea-
tures are correlated. Now try to generate new features through a change of basis in R75,
with the goal of minimizing the covariances between the different features. With the
principal component analysis we will even be able to make them 0! The new features are
then linear combinations of the original data points. Now you have generated 75 largely
uncorrelated features. Often you will then find that many of these new features have a
very small variance. The same value results for all subjects here. This is a strong indica-
tion that these features do not contain much information that can be used to interpret the
result. Such features are ignored. If you restrict yourself to the features with the largest
variance, the so-called principal components, you get a point cloud in a low-dimensional
space, for example, with two or three principal components in R2 or R3. At this point,
human expertise is required. The two- or three-dimensional cloud can be visualized,
and if the questions of the online test were well formulated, you can see areas in which
22 Statistical Methods 561
data accumulate. You will now try to identify areas in R2 or R3 in which, for example,
respondents with mathematical and scientific interests, economic, social, craft or other
interests can be found.
Once this first manual interpretation of the reduced data has been done, the 1001st
participant in the survey can be automatically assigned to one of these areas and receives
a corresponding recommendation.
The algorithmic core of this program consists in finding the suitable change of basis
to make the correlation of the features zero. Subsequently, the principal components are
analyzed. I would now like to introduce this process. I start with a concrete example
with three features: From a set of n = 13 people, the height, weight and body mass index
(BMI) are recorded, so we have three samples:
Sub- 1 2 3 4 5 6 7 8 9 10 11 12 13
ject
Height 161 181 174 170 165 191 184 162 183 171 159 166 193
Weight 56 73 57 70 59 71 74 58 72 64 54 60 80
BMI 21.60 22.28 18.50 24.91 20.94 19.46 21.86 23.62 20.90 21.89 21.36 24.68 21.48
You may remember that BMI is just determined by the person’s height and weight. So it’s
a redundant feature in this data collection. From the raw data perspective, you can’t see this
immediately. But if our data analysis is any good, it should be able to tell us that. We will see.
The following calculations are hardly possible on paper anymore, you need computer
support for that. I recommend that you follow the examples using a computer algebra
system.
To analyze the data, it makes sense to first normalize the samples. For this purpose,
we subtract the mean in the three samples S1, S2, S3. If the sample variances are very
different, it also makes sense to divide by the sample standard deviation to make the
features quantitatively comparable. I did this in the present example. Now all samples
scatter around zero and have a sample variance of 1. We enter these normalized sample
values in a matrix A. The columns are assigned to the subjects and the rows to the fea-
tures. The matrix of normalized features is:
−1.12 0.62 0.013 −0.33 −0.77 1.50 0.88 −1.03 0.80 −0.25 −1.29 −0.68 1.67
−1.09 0.92 −0.97 0.57 −0.74 0.68 1.04 −0.86 0.80 −0.15 −1.33 0.62 1.75.
−0.05 0.57 −2.10 2.07 0.10 −1.61 0.25 0.43 −0.32 0.27 −0.14 0.18 −0.05
The matrix A therefore contains at the position aij the i -th normalized feature of the j-th
subject. Let us multiply the matrix A from the right by its transpose AT, let B = AAT. At
the position bij of the result matrix we now have:
n
bij = aik ajk .
k=1
562 22 Statistical Methods
See (7.5) and note that because of the transpose in the second factor of the product, the
indices are just swapped.
Since the means of the normalized features are 0, this is exactly the sample covari-
ance of the samples i and j, except for the factor 1/(n − 1), see Definition 22.2. We thus
obtain a matrix containing all covariances:
Cov(S1 , S1 ) Cov(S1 , S2 ) Cov(S1 , S3 )
1
C = Cov(S2 , S1 ) Cov(S2 , S2 ) Cov(S2 , S3 ) = AAT
n−1
Cov(S3 , S1 ) Cov(S3 , S2 ) Cov(S3 , S3 )
In the concrete example,
1.00 0.89 −0.27
C = 0.89 1.00 0.19 .
−0.27 0.19 1.00
In the diagonal are the sample variances themselves, after normalization these are of
course 1. Now we want to perform a change of basis in the R3 which makes the covari-
ances as small as possible. Theorem 10.14 from eigenvalue theory helps us here: C is a
real symmetric matrix, so it has an orthonormal basis of eigenvectors. The eigenvalues of
C are 1.90, 1.19 and 0.001. I sorted them by size. The transition matrix T , which changes
the standard basis into this basis of eigenvectors, has in its columns exactly the corre-
sponding eigenvectors, see Sect. 9.3. It is:
−0.71 −0.18 0.68
T = −0.70 0.27 −0.66
0.06 0.95 0.31
I normalized the eigenvectors to length 1, T is therefore an orthogonal matrix. We now
carry out this change of basis in our feature space. The new coordinates of the sam-
ples are obtained according to Theorem 9.21, if we multiply the old coordinates from
the left with T −1. We have written all the samples in the rows of the matrix A, and so
the matrix of new features results as A′ = T −1 A. Because of the orthogonality of T , the
inverse matrix is according to Theorem 10.11 just the transposed matrix, so it holds that
A′ = T T A. In the example:
1.5 −1.0 0.56 −0.02 1.1 −1.6 −1.4 1.4 −1.1 0.28 1.9 0.94 −2.5
A′ =
−0.05 0.69 −2.2 2.2 0.04 −1.6 0.34 0.38 0.04 0.25 −0.25 0.12 .
0.12
−0.015 −0.011 0.002 0.043 0.001 0.055 −0.016 0.006 −0.003 0.013 −0.035 0.006 −0.051
What are the covariances of the new features now? Just like before, we have to multiply
the new feature matrix from the right with its transpose:
1 1 1
C′ = A′ A′T = (T −1 A)(T T A)T = T −1 AAT T = T −1 CT
n−1 n−1 n−1
22 Statistical Methods 563
T −1 CT is just the matrix of the linear mapping C in the new basis of eigenvectors. With
respect to this basis, C has according to Theorem 9.25 the diagonal matrix D, which con-
tains the eigenvalues in the diagonal:
1 0 0
C ′ = 0 2 0
0 0 3
Great magic: The covariances are all 0 now, in the diagonal are the variances of the new
features, and these are just the calculated eigenvalues of C. By this coordinate transfor-
mation, we have therefore achieved our goal! In the concrete example,
1.90 0 0
C ′ = 0 1.10 0 .
0 0 0.001
You can see that 3 is almost 0. The variance of the third feature is very small, and there-
fore the values of the third row in the transformed matrix A′ are also very small. This is
an irrelevant feature that we can ignore, We can restrict ourselves to the first two princi-
pal components.
So the method has recognized that in the raw data the BMI is dependent on the other two
features.
If we ignore the third row in A′, we can draw a two-dimensional image of the new fea-
tures. Compared to the original data, this is just a rotation of the normalized first two
components of the raw data, see Fig. 22.1.
How do we interpret the two components? The first principal component (the x-axis)
shows to the left the rather large and heavy subjects, to the right the small and light ones.
Size and weight are not independent, but overweight is not correlated with size. In the
second principal component, one actually finds deviations from the normal weight. The
overweight people are above, the underweight ones are below.
2
7 10 8
13 12
9
5 1
11
3
564 22 Statistical Methods
I would like to summarize the procedure for the principal component analysis:
1. Set up a matrix M of the features: The columns are assigned to the objects to be
examined, the rows contain the different features of the objects.
2. Normalize the features: The mean is subtracted from each row of the matrix M ,
the samples. If the variances of the rows differ greatly from each other, each row is
divided by its sample standard deviation. The matrix A results.
3. Calculate the matrix C = 1/(n − 1)AAT. This contains all sample covariances of the
examined features.
4. Calculate eigenvalues and eigenvectors of the covariance matrix. The eigenvalues are
sorted by size and the eigenvectors are normalized to length 1. The orthogonal transi-
tion matrix T is the matrix that contains the normalized and sorted eigenvectors in its
columns.
5. Perform the transition T with the matrix A. The matrix of new features A′ = T T A
results.
6. Based on the size of the eigenvalues, it is decided how many principal components
are to be examined. The principal components are the first rows of A′.
7. Now the principal components have to be interpreted.
I would like to present another example with several features, not quite as detailed. The
raw data consists of six final grades of ten students (the best grade is 1):
Alex Bianca Claudia Daniel Eva Felix Georg Hanna Ina Kai
German 1 3 1 2 2 4 2 2 2 3
History 2 1 2 1 2 3 1 1 3 3
English 2 3 1 1 1 3 2 4 4 3
French 2 4 1 2 2 3 1 3 4 4
Math 4 2 3 1 3 1 4 1 2 4
Physics 4 2 2 1 4 1 3 2 1 3
First, the six samples are normalized again by subtracting the sample means. The sample
variances do not differ greatly, so this time I will forego the division by the sample stand-
ard deviation. We obtain the feature matrix:
−1.2 0.8 −1.2 −0.2 −0.2 1.8 −0.2 −0.2 −0.2 0.8
0.1 −0.9 0.1 −0.9 0.1 1.1 −0.9 −0.9 1.1 1.1
−0.4 0.6 −1.4 −1.4 −1.4 0.6 −0.4 1.6 1.6 0.6
A=
−0.6 1.4 −1.6 −0.6 −0.6 0.4 −1.6 0.4 1.4 1.4
1.5 −0.5 0.5 −1.5 0.5 −1.5 1.5 −1.5 −0.5 1.5
1.7 −0.3 −0.3 −1.3 1.7 −1.3 0.7 −0.3 −1.3 0.7
From this we can calculate the sample covariance matrix:
1
C = (Cov(Si , Sj )) = AAT
n−1
22 Statistical Methods 565
We can now plot the two principal components, that is, the first two rows of the matrix,
in R2. In Fig. 22.2 we assign the points to the ten people again.
Alex 1
Ina
Eva Bianca
George–2 –1 1 2 Felix
Hanna
–1
Claudia
–2
Daniel
566 22 Statistical Methods
For the interpretation of the principal components: The further to the right a point lies,
the greater the inclination for mathematical and natural science subjects, on the left are
the people with a tendency for languages. The further down a point is located, the greater
the performance of the graduate, at the top are placed the rather weaker students.
One could derive this interpretation more or less well from the view of the raw data.
The real performance of the principal component analysis only shows itself with data
sets with very many features. So it is used, for example, in pattern recognition, in cancer
diagnosis, in material or food analysis and in many other data sets with many features for
analysis.
Mathematical algorithms in themselves are neither good nor evil. But there are also very
controversial applications in the use of methods. For example, facial recognition is used in
China for comprehensive population surveillance. In Germany, Schufa checks the creditwor-
thiness of applicants on the basis of secret criteria. There are well-founded suspicions that
this assessment also uses the name and address. In the USA, software is used to determine
from over 100 features whether a prisoner can be released on parole or not. Colored people
are systematically disadvantaged here. If, based on a similar grade analysis, the decision is
made about admission to a degree program, as I have carried out above, this is also highly
questionable. Be aware of your responsibility as an IT specialist! The mathematician Han-
nah Fry has written an interesting book on this topic: Hello World: Being Human in the Age
of Algorithms.
With the help of parameter estimation, it is possible, for example, to give an estimate
for the proportion p of voters of a party in an opinion poll. This estimate is the relative
frequency of voters in the sample. We now know that the expected value of this estimate
results in the exact right probability and that the variance of the estimate decreases with
the size of the sample. But in practice we will never hit p exactly, and we cannot say how
far we are off. To get out of this dilemma, we use a trick: We do not estimate a single
number, but a whole interval [pu , po ], from which we assume that the true parameter is
contained in it. If we can then give a probability that p is in [pu , po ], we are satisfied.
Such an interval is called a confidence interval. We find it by evaluating estimators for
the endpoints of the interval:
If (X1 , X2 , . . . , Xn ) is a random vector and p is the parameter to be estimated, we now
need two estimators Fu = fu (X1 , X2 , . . . , Xn ) and Fo = fo (X1 , X2 , . . . , Xn ) instead of the
estimator F = f (X1 , X2 , . . . , Xn ) for p. We try to determine Fu and Fo so that the prob-
ability p(Fu ≤ p ≤ Fo ) assumes a certain value γ . When realizing the random variables
Fu and Fo, we get two values pu and po and can then say that with probability γ the value
p is between pu and po. Here the unknown parameter p is fixed. The confidence interval
is variable, it depends on the concrete sample.
22 Statistical Methods 567
X−np
1. Let X ∗ = √np(1−p) be the standardization of X . We determine a number c with the
property p(−c ≤ X ∗ ≤ c) = γ .
2. If p is the unknown parameter to be estimated, we search for functions fu (X), fo (X)
with the property
fu (X) ≤ p ≤ fo (X) ⇔ −c ≤ X ∗ ≤ c.
3. If this is successful, then p(fu (X) ≤ p ≤ fo (X)) = p(−c ≤ X ∗ ≤ c) = γ and thus
Fu = fu (X) and Fo = fo (X) are the sought estimators.
Point 2: It is
X − np
−c ≤ X ∗ ≤ c ⇔ −c ≤ √ ≤c
np(1 − p)
(X − np)2
⇔ ≤ c2
np(1 − p)
⇔ (X − np)2 ≤ c2 np(1 − p)
⇔ (nc2 + n2 )p2 − (2nX + nc2 )p + X 2 ≤ 0.
Now we know: p1 (ω) ≤ p ≤ p2 (ω) if and only if f (p) ≤ 0 and thus −c ≤ X ∗ ≤ c. This
solves point 2 of the program, p1 and p2 are the estimators we are looking for:
1 c2 X(n − X) c2
fu (X) := p1 = 2 X+ −c + ,
c +n 2 n 4
1 c2 X(n − X) c2
fo (X) := p2 = 2 X+ +c + .
c +n 2 n 4
If we take a sample, we receive a concrete value k for X and obtain a confidence interval
as a realization of fu (X) and fo (X):
1 c2 k(n − k) c2 1+γ
pu/o = 2 k+ ±c + , with c = �−1 . (22.2)
c +n 2 n 4 2
Now we can finally deal with the example of the election forecast from Section 19.1:
22 Statistical Methods 569
Example
If you look at these results, you can assess the quality of the so-called “election day ques-
tion”, which in Germany is answered monthly in the media and from which we are to infer
how the federal election would turn out, if elections were held today. As a rule, a survey of
around 1000 voters is conducted for this purpose, which must also represent a representa-
tive cross-section of the population, i.e. age groups, education, regions and so on. You can
see that with a confidence level of 95%, deviations of ±3% are quite possible for each party.
Judge for yourself the informative value of this statistic. Larger samples are usually only con-
ducted on election day itself, in the exit poll or by evaluating partial results of the election.
If in (22.2) the numbers k, n and n − k are large, then in comparison the summands con-
taining the number c2 are negligibly small. We let them fall under the table and after this
really hard work we get the very simple to use rule:
The width of the interval shrinks with increasing n, the prediction will therefore always
be better if the sample is larger. If you increase the confidence level γ at a fixed sample
size, c will also increase, making the interval wider. The higher certainty can only be
bought with larger allowed deviations.
According to rule 21.18, the approximation of the binomial distribution by the
normal distribution requires np > 5 and n(1 − p) > 5. We cannot check this con-
dition because p is unknown. But it should at least be fulfilled for the estimate of p.
This is the case: If k and n − k are greater than 30, then n · (k/n) = k > 30 > 5 and
˙ − k/n) = n − k > 30 > 5.
n(1
What error do we make by canceling the summands with c2? Let’s calculate the exam-
ple of the election poll again with rule 22.11: From
400 1.96 400(1000 − 400)
pu/o = ± ·
1000 1000 1000
we get [pu , po ] = [0.3696, 0.4304], with the precise calculation (22.3) we get
[pu , po ] = [0.3701, 0.4307], an error we can live with.
Example
Let’s try to a numerical integration with a Monte-Carlo method. For this, please also
take another look at Buffon’s needle problem section 19.1 on Monte-Carlo methods.
Now I want to calculate �(1). It is
ˆ 1
1 x2
�(1) = 0.5 + ϕ(x)dx with ϕ(x) = √ e− 2 .
0 2π
It is therefore sufficient to determine the integral over ϕ(x) from 0 to 1. In numeri-
cal integration, the function ϕ(x) is enclosed in a rectangle, here in the rectangle
[0, 1] × [0, 1/2] with the area R = 0.5(Fig. 22.3). The Bernoulli experiment that we
carry out is “choose a random point (x, y) in the rectangle”. The event A with the
unknown probability p is: “y < ϕ(x)”, because exactly then the point lies in the inte-
gral area. If F is the integral, then p = F/R.
We carry out the experiment n times. If the event occurs k times, then p̃ = k/n is
an estimate for p and thus R · (k/n) is an estimate for F .
How often must the experiment be carried out in order to determine with 99.9%
confidence the integral with an error of less than 10−3?
For γ = 0.999 we get c = 3.29. The area R is 0.5. If F̃ is the estimate for F ,
then it should be |F̃ − F| = |0.5 · p̃ − 0.5 · p| < 10−3, that is |p̃ − p| < 2 · 10−3.
The deviation from p is allowed in both directions, so we can specify the width of
the confidence interval with B = 2 · 2 · 10−3. The required sample size is then
c2 /B2 = 676 506. The estimate for F with this number of trials gave me 0.341235,
that is �(1) = 0.841235. From the table in the appendix we can read the result
0.8413. ◄
You can see that this method requires very many trials to achieve reasonably good
results, it shows a very poor convergence behavior. With this you can hardly generate
(x,y) F
1
572 22 Statistical Methods
the table of the normal distribution from the appendix. For simple integrals, the method
is practically not usable. But it shows its strengths in multi-dimensional integrals, which
in comparison to the one-dimensional case become much more complex for the usual
numerical methods, while the effort of the Monte Carlo integration remains almost
unchanged.
Parameter Testing
Let’s test the coin of a player. We observe the event A = “coin shows head”. p0 = p(A)
is unknown, but we take the hypothesis p0 = 1/2. Now we carry out the experiment
100 times. Let Xi be the random variable that describes the outcome of the i -th trial, and
X = X1 + X2 + · · · + X100. We know that R = X/100 is an estimator for the parameter
p0. How far may the estimate p, obtained by a sampling, deviate from p0 = 1/2, without
shaking our confidence in the hypothesis?
First let’s try an intuitive approach: We want to allow a deviation δ of ±0.1 of p0 from
p0 = 1/2, otherwise we reject the hypothesis. Of course, a sample can also deviate more
even though the hypothesis is correct. Then we commit an error by rejecting the hypoth-
esis. How often does that happen?
R is a N(p0 , p0 (1 − p0 )/n) = N(0.5, 0.0025)-distributed estimator for p0 according to
Theorem 22.8. We are looking for the probability p(0.4 < R < 0.6):
0.6 − 0.5 0.4 − 0.5
p(0.4 ≤ R ≤ 0.6) = � √ −� √
0.0025 0.0025
= �(2) − �(−2) = 2 · �(2) − 1 = 0.9546.
So p(R ∈/ [0.4, 0.6]) = 1 − 0.9546 = 0.0454. The probability for a larger deviation than
δ = 0.1 is therefore about 4.5 %. If p0 is correct and such a deviation occurs, we reject
22 Statistical Methods 573
the hypothesis and commit an error. The probability of error α is 4.5 %. The area to the
left and right of p0 ± δ under the bell curve is α/2, see Fig. 22.4.
The problem is usually stated the other way around: Given is a probability of error
α and an estimator T for the parameter p. In test theory, this random variable T is often
called the test statistic. For the parameter p, we make a hypothesis. This hypothesis is
called the null hypothesis and is denoted by H0. The alternative to this hypothesis is
called H1. The probability of error α is usually small, for example 1% or 5%. The num-
ber 1 − α is called the significance level. Depending on α, a region of rejection for H0 is
determined.
Now a sample is taken and thus an estimate value p for p0 is determined. If p is in the
region of rejection, the hypothesis H0 is rejected. Then the alternative hypothesis H1 is
automatically accepted.
Errors can be made here:
• The type I error is the error that the hypothesis is rejected although it is correct.
• The type II error is the error that the hypothesis is not rejected although it is false.
α α
p(T < p0 − δ1 ) = und p(T > p0 + δ2 ) = .
2 2
The distribution does not have to be symmetrical in general, but there are two equally
large regions of rejection to the right and left of p0, see Fig. 22.5.
Let’s carry out this test in the specific case of a Bernoulli process. The probability
p of the event A is unknown, we set up the hypothesis H0: “ p = p0”. The random vari-
able R, which describes the relative frequency of the event A, is the test statistic for the
parameter p. It is N(p0 , p0 (1 − p0 )/n)-distributed under the assumption of H0. R is sym-
metrical, so here the two rejection regions are equally far from p0, as shown in Fig. 22.4.
So we are looking for the number δ for which is p(p0 − δ ≤ R ≤ p0 + δ) = 1 − α.
δ can be calculated using Theorem 21.19:
(22.5)
that is
α δ −1 α
1− =� √ ⇒ δ=� 1− p0 (1 − p0 )/n.
2 p0 (1 − p0 )/n 2
We can write the result in a calculation rule:
Example
It is very important to note that if the hypothesis is not rejected, it cannot be accepted!
Let’s take the coin toss example again: In 100 tosses, the coin should show head 57
times. Neither with a probability of error of 1% nor with a probability of error of 10%
is the result in contradiction to the hypothesis p0 = 1/2. However, the hypothesis cannot
be accepted either, because it would be quite possible that the parameter, for example, is
0.57, 0.58 or 0.55. We can determine a confidence interval: With rule 22.11 we get that
with a confidence level of 90%, the true parameter lies between 0.49 and 0.65. Unfortu-
nately, this statement does not help us much in our question.
You can see it again here: It is easier to be destructive than constructive. Hypotheses
can be rejected more easily than accepted. If you want to substantiate a hypothesis, you
should therefore formulate the test in such a way that the exact opposite is assumed as
the null hypothesis.
This is usually not possible with a two-sided hypothesis test, but with a one-sided
hypothesis test the situation looks different: The hypothesis H0 here is p ≤ p0 or p ≥ p0,
we can choose which of the two statements we want to refute, and again determine the
Type I error. A complication compared to the two-sided hypothesis test is that the exact
distribution of the test statistic T is not known, it depends on the true value p of the
parameter, but we only know p ≤ p0 or p ≥ p0.
Let’s calculate an example for the case H0 = “ p ≤ p0”. The alternative hypothesis H1
is then “ p > p0”.
At least in continuous distributions we can assume the value 0 for the probability p = p0, so
that there is no difference between p > p0 and p ≥ p0.
This is exactly the test we can use to confirm Murphy’s Law (see Sect. 19.1 “Testing a
Hypothesis”): We conduct a Bernoulli process and observe the event B = “Toast falls
on the buttered side”. p = p(B) is unknown, but I would like to confirm the hypothesis
“ p > 1/2”. So we take the opposite as the null hypothesis: H0 = “ p ≤ 1/2”. If we can
reject H0, Murphy is right.
576 22 Statistical Methods
If Xi is the result of the i -th trial, then R = 1/n · (X1 + X2 + . . . + Xn ) is again the test
statistic for the probability p after n trials. R is normally distributed with expected value
p, but in contrast to the two-sided case, this value p is unknown and thus the correct nor-
mal distribution is unknown. How should we determine the region of rejection?
We will try to keep it simple at first with p = p0 = 1/2 as in the two-sided case: Then
R is distributed as before N(p0 , p0 (1 − p0 )/n) and for the given probability of error α we
can determine the threshold value δ with the property
p(R > p0 + δ|p0 ) = α (22.6)
(Figure 22.6). In the one-sided test we only need a one-sided region of rejection, because
we have nothing against small values of p, they do not contradict the hypothesis. I write
p(A|p0 ) here to indicate that the probability of the event A has been calculated under the
assumption p = p0.
But what if the true parameter p1 is less than p0? Then R is a N(p1 , p1 (1 − p1 )/n)-dis-
tributed random variable. But since p1 < p0, R has a smaller expected value compared to
the first case: R|p1 is shifted to the left compared to R|p0, see Fig. 22.7.
You can calculate that
and thus
δ = �−1 (1 − α) p0 (1 − p0 )/n.
Before we carry out the example with concrete numerical values, we can again formulate
a calculation rule for this test in the Bernoulli process:
(22.7) states that the function ϕ(p) = p(R > c|p) is monotonically increasing. That’s why our
calculation worked. ϕ(p) = p(T > c|p) has this property for many test statistics T , in these
cases the threshold value δ can be determined similarly in a one-sided test as in the example
of the Bernoulli process.
Examples
1. We want to confirm the hypothesis that Murphy’s Law applies, that is,
p(B) = p(butter on the floor) > 1/2 is true. To do this, we have to assume the
opposite as the null hypothesis: H0 = p(B) ≤ 1/2. Now let’s sacrifice some sticks
of toast and a big piece of butter and let 100 pieces of buttered toast fall on the
floor. In this test, 61 times the butter side comes to lie on the floor.
First, the probability of error α = 5 % is given. Then c = 1.64 and
√
δ = 1.64 · 0.25/100 = 0.082. For k > 50 + 8.2 = 58.2, the hypothesis can be
rejected with this probability of error. This is the case here. With a probability of
error of no more than 5%, Murphy is right.
Can we increase our significance level? Let’s take α = 1 %: Now it’s c = 2.33 and
δ = 0.1165, the region of rejection is [62, 100]. The result does not contradict the
hypothesis.
Let’s look at the type II error in this case: The hypothesis is not rejected, although
it is false. With the normally distributed test statistic R and p = p(B) > 1/2, we
578 22 Statistical Methods
have to calculate the probability that R has a value in the region of non-rejection,
that is
p(R < 1/2 + δ|p),
where p can take all values greater than 1/2. A similar problem as in the derivation
of the rule 22.14: Also here, the limiting case p = 1/2 represents the worst case, as
you can see for yourself in a small drawing. So let’s calculate p(R < 1/2 + δ|1/2).
Since R is continuous, we have
p(R < 1/2 + δ|1/2) = 1 − p(R > 1/2 + δ|1/2) = 1 − α,
because R > 1/2 + δ is just the region of rejection for α! This gives us the sad
result that the type II error can be arbitrarily close to 1 − α = 99 %.
Calculate the confidence interval for α yourself with the confidence level 99%
(Rule 22.11): With a probability greater than 99%, p0 is between 0.48 and 0.74.
2. The manufacturer of a biometric authentication tool specifies the following error
rates in its sales prospectus: The FRR, the False Rejection Rate, is less than 0.01,
the FAR, the False Acceptance Rate, is less than 0.0001. With a probability of less
than 0.01, the authentication of a person is therefore wrongly rejected, with a prob-
ability of less than 0.0001 a false authentication is performed. In a test center, the
product is tested with 10 000 authentication attempts. In doing so, 81 authentica-
tion attempts are wrongly rejected, twice a false authentication takes place. How
are the advertising claims of the manufacturer to be assessed?
Let’s look at the rejection of the authentications. The expected value is 100, with
81 actual rejections, it makes sense to check the null hypothesis H0: “ p ≥ 0.01”, in
the hope of being able to reject it. It is
δ = c · 0.01 · 0.99/10 000 = c · 9.95 · 10−4 , c = �−1 (1 − α).
For α = 1 % (c = 2.33) we get δ = 2.32 · 10−3, so the hypothesis can be rejected
for k < 100 − 23.2 = 76.8, the result does not contradict the hypothesis H0.
For α = 5 % (c = 1.64) we get δ = 1.63 · 10−3, the region of rejection is
k < 100 − 16.3 = 83.7. With a probability of error of 5%, H0 can be rejected and
the manufacturer’s statement confirmed.
Now for the false authentications: We assume p > 10−4, and therefore check the
null hypothesis p ≤ 10−4. Now
δ = c · 0.0001 · 0.9999/10 000 = c · 9.9995 · 10−5 , c = �−1 (1 − α).
For α = 5 % we get δ = 1.64 · 10−4, the region of rejection is k > 1 + 1.64 = 2.64,
no contradiction to the hypothesis. Only with a probability of error of about 16%
would result a region of rejection k<2, so that the rejection of the hypothesis would
be justified.
If our test had not revealed a single false authentication, we could still not be able
to confirm the manufacturer’s statement. ◄
22 Statistical Methods 579
In the last example you can see that it is very difficult to make statements with high sig-
nificance when there are only a few cases. The probability of error is always very high
here. This problem occurs in many studies, especially when the sample size is limited.
For the Japanese government, the necessary sample size is an argument for why several
hundred whales have to be caught for “scientific purposes” every year. Small case numbers
also lead to groups with different interests creating statistics that appear to contradict each
other, but which unfortunately all have the same low predictive value. Think of such contro-
versial issues as the increased incidence of leukemia in the vicinity of nuclear power plants
or in the vicinity of strong radio transmitters.
In many cases, the statistician has to examine data sets in which he already has a suspi-
cion about the distribution present. Very often he will expect a normal distribution, for
example, when many individual characteristics accumulate additively. In such cases,
Pearson’s chi-squared test is carried out: The hypothesis does not contain a parameter,
but the assumption that a certain distribution is present. I would like to present such a
test as an example of this test form.
Let a sample be given as a realization of the random variable X . We have the suspi-
cion that X has a certain distribution and want to test this hypothesis. First, we divide the
image of X into various disjoint sub-ranges I1 , I2 , . . . , Im. For small images, each value
can be its own range, for larger images or for continuous random variables, values are
combined or a series of intervals is formed. If Ak is the event that the value of X lies in Ik,
then we can calculate the probabilities pk = p(Ak ) from the hypothesis about the distri-
bution. The hypothesis H0, which we want to check, is then formulated as follows:
H0 : p(A1 ) = p1 , p(A2 ) = p2 , . . . , p(Am ) = pm
Of course it is p1 + p2 + . . . + pm = 1.
Let Xk be the random variable that counts the number of times Ak occurs in n trials. Xk
is bn,pk-distributed and therefore has expected value npk and variance npk (1 − npk ). If we
set
(Xk − npk )2
Yk2 =
npk
, then Yk2 represents the squared deviation of Xk from the expected value divided by the
expected value. The sum over all Yk2,
m m
(Xk − npk )2
χ 2 := Yk2 =
k=1 k=1
npk
580 22 Statistical Methods
is the test statistic for our null hypothesis. It measures the sum of all squared deviations
from the respective expected value, each in relation to the expected value. The distribu-
tion of this test statistic is known, but difficult to calculate, it is χ 2-distributed with m − 1
degrees of freedom. Already if for all k it is n · pk > 5, this is a good approximation.
Beware of the trap: If you look at the definition of the χ 2-distribution at the end of Sect.
21.2 you might be tempted to simply take mk=1 Xk∗2 as a test statistic, the sum of the squared
standardized Xk. Isn’t this χ -distributed? No, because the Xk are not independent of each
2
other! If, for example, X1 occurs frequently, then X2 to Xk can only occur less often. This is
the reason for the number of degrees of freedom of the test function: m − 1 and not m.
Please read the properties of the chi-square distribution again at the end of Sect. 21.2. The
expected value of χ 2 is m − 1, so if the null hypothesis is true, the sample values will be
distributed around m − 1. What does the region of rejection look like? Is it two-sided or
one-sided? The Yk all have the expected value 0, and even if it is not very likely that all Yk
and thus all Yk2 have a small value at the same time, this does not contradict the hypothesis.
Deviations from the expected value of the Yk are, on the other hand, squared and summed
up in the test statistic. Large deviations result in large values of χ 2. The region of rejection
is therefore one-sided. We specify a probability of error α and determine the δ with
p(χ 2 > δ) = α.
The hypothesis can be rejected with the probability of error of α if, for a sample of size n,
holds χ 2 (ω) > δ.
Let’s take a dice as an example. We want to test if it is fair. The ranges I1 , I2 , . . . , I6
should be the possible dice results 1 to 6 here. The hypothesis says that the dice results
are evenly distributed, that is
H0 : p(A1 ) = 1/6, p(A2 ) = 1/6, ..., p(A6 ) = 1/6.
I rolled the dice 100 times and noted the results. They were:
k 1 2 3 4 5 6
xk 17 21 17 18 18 9
6
xk − 100/6 2
√ ≈ 4.88.
k=1
100/6
For the probability of error of 10%, the region of rejection is therefore ]9.24, ∞[, for
α = 1 % the region of rejection is ]15.09, ∞[. For the test value, we therefore get no con-
tradiction to the hypothesis, both with an probability of error of 10% and with an prob-
ability of error of 1%.
More interesting than the dice is a random number generator. For example, this can
generate random integers between 0 and 232 − 1 and these should also be evenly dis-
tributed. As a division into disjoint areas, I first chose the 1000 remainders modulo
1000 here. If Ak is the event that the value of X lies in Ik, then pk = 1/1000. To achieve
n · pk > 5, I carried out n = 10 000 trials. The test function is χ999
2
-distributed, so approx-
imately N(999, 1998)-distributed. We have to find the δ with
2 δ − 999
p(χ999 ≤ δ) = � √ = 1 − α.
1998
For α = 1 % we get δ = 1103, for α = 5 % we get δ = 1072. The test yielded the value
1037.6, so no contradiction to the hypothesis at the 95% and at the 99% significance
level.
But this is only true for the special choice of Ik as remainders modulo 1000. Are small
or large numbers perhaps preferred? In a second attempt, I chose 1000 equally large
intervals as Ik. Here the test yielded 1017.4, also no contradiction to the hypothesis.
But beware: The fact that the hypothesis “uniform distribution” cannot be rejected
does not mean that the random number generator is really good: It could still have accu-
mulations with different decompositions. Furthermore, not only the number of values in
individual number ranges must be evenly distributed, but also the order of appearance
must be random. For the first edition of this book, I carried out these tests for the ran-
dom number generator of a common C++ compiler, the hypothesis of uniform distribu-
tion could not be rejected. But if I only evaluated every second random number, the first
test yielded 10 938. The hypothesis of uniform distribution could thus be rejected with a
probability of error of 0.1%. On closer inspection, it turned out that the random number
generator had the strange habit of always producing even and odd numbers alternately, a
pretty disaster. Meanwhile, this error has been eliminated.
If you have to prove results experimentally in a research project and are perhaps
under pressure from the client who wants to see results, you can easily get into the situ-
ation of wanting to sweep results under the table that do not fit into the desired picture.
What does it matter if you repeat the test under the same conditions?
582 22 Statistical Methods
Don’t do that if you should find yourself in that situation: The results of such a test
are worthless! See the last experiment for this:
My computer is patient, I have not only simulated the dice test I described earlier
once, but 10 000 times and written down the test results each time. In Fig. 22.9 you can
see the frequency of the individual test results as a histogram.
You can clearly see the shape of the χ52 distribution. But you can also see that of
course at some point very unlikely results will occur. 108 times a test result was greater
than 15. If I only test often enough, almost every desired result will occur at some point
and thus also every hypothesis will be rejected. Therefore, when testing, the sample size
must be determined before the test is carried out and then the test is carried out and eval-
uated exactly once.
Comprehension questions
Exercises
1. Rolling 10 times with 2 dice results in the pairs (2, 4),(5, 6),(2, 2),(6, 2),(2, 6),
(5, 1),(5, 6),(6, 4),(4, 4),(3, 1).
Calculate the sample mean and sample variance for the sum of eyes of these sam-
ples. Give an estimate for the probability of a double and for the expected value of
the sum of the eyes.
Why can’t you calculate a confidence interval for p(double) using our standard for-
mula?
Show that with a fair die, the probability that 0, 1, 2 or 3 doubles will occur in 10
throws is more than 90%.
2. Try to follow the two numerical examples in the principal component analysis
(Sec. 22.2) using a computer algebra system.
3. Determine the defect rate of a bulk article: A sample of 1000 elements yields 94
defective parts. Determine a confidence interval for the unknown probability p of
the defects at the confidence level 95%.
4. The probability of a girl’s birth:
In 2017, 784 901 children were born in Germany, 382 374 of them girls. Deter-
mine a confidence interval for the probability of a girl’s birth at the confidence
level 99%.
5. In a survey, the percentage of the population that does not believe in statistics is to
be determined. For the specifications
a) Confidence level 98%, confidence interval width 4%,
b) Confidence level 96%, confidence interval width 4%
the necessary sample size is to be calculated.
The survey is conducted with 3000 people. Of these, 1386 do not believe in sta-
tistics. Determine the corresponding confidence intervals at the confidence levels
96% and 98%.
6. In roulette, the probability of a red number is p(red) = 18/37. You suspect that this
is not true in a special drum. The last 5000 results are posted in lists. This resulted
in 2355 red numbers. Make a suitable hypothesis to check your suspicion. Use the
probability of error 2% resp. 5% and formulate the calculated result precisely.
7. 4964 draws of the German lottery "6 out of 49" since 1955 resulted in the follow-
ing draw frequencies for the numbers 1 to 49 in sequence:
606 617 628 615 602 633 607 583 630 591 608 592 551
600 576 612 608 606 602 577 585 616 600 602 642 649
635 569 581 597 640 616 628 603 609 608 600 646 610
605 629 615 651 604 564 585 607 606 638
Does this result, with a probability of error of 1%, contradict the hypothesis of the
uniform distribution of the 49 numbers? Write a test program.
Appendix 23
Traditionally, mathematicians use many Greek symbols in their formulas. Here is the
Greek alphabet with the names for the letters.
© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, 585
part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9_23
586 23 Appendix
x
1
ˆ
t2
�(x) = p(N(0, 1) ≤ x) = √ e− 2 dt, �(−x) = 1 − �(x), �−1 (y) = −�−1 (1 − y)
2π −∞
0.01 0.5040 0.47 0.6808 0.93 0.8238 1.39 0.9177 1.85 0.9678 2.31 0.9896 2.77 0.99720 3.23 0.99938
0.02 0.5080 0.48 0.6844 0.94 0.8264 1.40 0.9192 1.86 0.9686 2.32 0.9898 2.78 0.99728 3.24 0.99940
0.03 0.5120 0.49 0.6879 0.95 0.8289 1.41 0.9207 1.87 0.9693 2.33 0.9901 2.79 0.99736 3.25 0.99942
0.04 0.5160 0.50 0.6915 0.96 0.8315 1.42 0.9222 1.88 0.9700 2.34 0.9904 2.80 0.99744 3.26 0.99944
0.05 0.5199 0.51 0.6950 0.97 0.8340 1.43 0.9236 1.89 0.9706 2.35 0.9906 2.81 0.99752 3.27 0.99946
0.06 0.5239 0.52 0.6985 0.98 0.8365 1.44 0.9251 1.90 0.9713 2.36 0.9909 2.82 0.99760 3.28 0.99948
0.07 0.5279 0.53 0.7019 0.99 0.8389 1.45 0.9265 1.91 0.9719 2.37 0.9911 2.83 0.99767 3.29 0.99950
0.08 0.5319 0.54 0.7054 1.00 0.8413 1.46 0.9279 1.92 0.9726 2.38 0.9913 2.84 0.99774 3.30 0.99952
0.09 0.5359 0.55 0.7088 1.01 0.8438 1.47 0.9292 1.93 0.9732 2.39 0.9916 2.85 0.99781 3.31 0.99953
0.10 0.5398 0.56 0.7123 1.02 0.8461 1.48 0.9306 1.94 0.9738 2.40 0.9918 2.86 0.99788 3.32 0.99955
0.11 0.5438 0.57 0.7157 1.03 0.8485 1.49 0.9319 1.95 0.9744 2.41 0.9920 2.87 0.99795 3.33 0.99957
0.12 0.5478 0.58 0.7190 1.04 0.8508 1.50 0.9332 1.96 0.9750 2.42 0.9922 2.88 0.99801 3.34 0.99958
0.13 0.5517 0.59 0.7224 1.05 0.8531 1.51 0.9345 1.97 0.9756 2.43 0.9925 2.89 0.99807 3.35 0.99960
0.14 0.5557 0.60 0.7258 1.06 0.8554 1.52 0.9357 1.98 0.9762 2.44 0.9927 2.90 0.99813 3.36 0.99961
0.15 0.5596 0.61 0.7291 1.07 0.8577 1.53 0.9370 1.99 0.9767 2.45 0.9929 2.91 0.99819 3.37 0.99962
0.16 0.5636 0.62 0.7324 1.08 0.8599 1.54 0.9382 2.00 0.9773 2.46 0.9931 2.92 0.99825 3.38 0.99964
0.17 0.5675 0.63 0.7357 1.09 0.8621 1.55 0.9394 2.01 0.9778 2.47 0.9932 2.93 0.99831 3.39 0.99965
0.18 0.5714 0.64 0.7389 1.10 0.8643 1.56 0.9406 2.02 0.9783 2.48 0.9934 2.94 0.99836 3.40 0.99966
0.19 0.5754 0.65 0.7422 1.11 0.8665 1.57 0.9418 2.03 0.9788 2.49 0.9936 2.95 0.99841 3.41 0.99968
0.20 0.5793 0.66 0.7454 1.12 0.8686 1.58 0.9430 2.04 0.9793 2.50 0.9938 2.96 0.99846 3.42 0.99969
0.21 0.5832 0.67 0.7486 1.13 0.8708 1.59 0.9441 2.05 0.9798 2.51 0.9940 2.97 0.99851 3.43 0.99970
0.22 0.5871 0.68 0.7518 1.14 0.8729 1.60 0.9452 2.06 0.9803 2.52 0.9941 2.98 0.99856 3.44 0.99971
0.23 0.5910 0.69 0.7549 1.15 0.8749 1.61 0.9463 2.07 0.9808 2.53 0.9943 2.99 0.99861 3.45 0.99972
0.24 0.5948 0.70 0.7580 1.16 0.8770 1.62 0.9474 2.08 0.9812 2.54 0.9945 3.00 0.99865 3.46 0.99973
0.25 0.5987 0.71 0.7612 1.17 0.8790 1.63 0.9485 2.09 0.9817 2.55 0.9946 3.01 0.99869 3.47 0.99974
0.26 0.6026 0.72 0.7642 1.18 0.8810 1.64 0.9495 2.10 0.9821 2.56 0.9948 3.02 0.99874 3.48 0.99975
0.27 0.6064 0.73 0.7673 1.19 0.8830 1.65 0.9505 2.11 0.9826 2.57 0.9949 3.03 0.99878 3.49 0.99976
0.28 0.6103 0.74 0.7704 1.20 0.8849 1.66 0.9515 2.12 0.9830 2.58 0.9951 3.04 0.99882 3.50 0.99977
0.29 0.6141 0.75 0.7734 1.21 0.8869 1.67 0.9525 2.13 0.9834 2.59 0.9952 3.05 0.99886 3.51 0.99978
0.30 0.6179 0.76 0.7764 1.22 0.8888 1.68 0.9535 2.14 0.9838 2.60 0.9953 3.06 0.99889 3.52 0.99978
23 Appendix 587
0.32 0.6255 0.78 0.7823 1.24 0.8925 1.70 0.9554 2.16 0.9846 2.62 0.9956 3.08 0.99896 3.54 0.99980
0.33 0.6293 0.79 0.7852 1.25 0.8944 1.71 0.9564 2.17 0.9850 2.63 0.9957 3.09 0.99900 3.55 0.99981
0.34 0.6331 0.80 0.7881 1.26 0.8962 1.72 0.9573 2.18 0.9854 2.64 0.9959 3.10 0.99903 3.56 0.99981
0.35 0.6368 0.81 0.7910 1.27 0.8980 1.73 0.9582 2.19 0.9857 2.65 0.9960 3.11 0.99906 3.57 0.99982
0.36 0.6406 0.82 0.7939 1.28 0.8997 1.74 0.9591 2.20 0.9861 2.66 0.9961 3.12 0.99910 3.58 0.99983
0.37 0.6443 0.83 0.7967 1.29 0.9015 1.75 0.9599 2.21 0.9865 2.67 0.9962 3.13 0.99913 3.59 0.99983
0.38 0.6480 0.84 0.7996 1.30 0.9032 1.76 0.9608 2.22 0.9868 2.68 0.9963 3.14 0.99916 3.60 0.99984
0.39 0.6517 0.85 0.8023 1.31 0.9049 1.77 0.9616 2.23 0.9871 2.69 0.9964 3.15 0.99918 3.61 0.99985
0.40 0.6554 0.86 0.8051 1.32 0.9066 1.78 0.9625 2.24 0.9875 2.70 0.9965 3.16 0.99921 3.62 0.99985
0.41 0.6591 0.87 0.8079 1.33 0.9082 1.79 0.9633 2.25 0.9878 2.71 0.9966 3.17 0.99924 3.63 0.99986
0.42 0.6628 0.88 0.8106 1.34 0.9099 1.80 0.9641 2.26 0.9881 2.72 0.9967 3.18 0.99926 3.64 0.99986
0.43 0.6664 0.89 0.8133 1.35 0.9115 1.81 0.9649 2.27 0.9884 2.73 0.9968 3.19 0.99929 3.65 0.99987
0.44 0.6700 0.90 0.8159 1.36 0.9131 1.82 0.9656 2.28 0.9887 2.74 0.9969 3.20 0.99931 3.66 0.99987
0.45 0.6736 0.91 0.8186 1.37 0.9147 1.83 0.9664 2.29 0.9890 2.75 0.9970 3.21 0.99934 3.67 0.99988
Bibliography
Mathematics …
[1] Aigner, M., Ziegler, G.: Proofs from the BOOK, Springer, Berlin, 2018.
[2] Artmann, B.: Lineare Algebra, Birkhäuser, Basel, 1986.
[3] Barner, M., Flohr, F.: Analysis I, DeGruyter, Berlin, 1974.
[4] Bauer, H.: Wahrscheinlichkeitstheorie und Grundzüge der Maßtheorie, DeGruyter, Berlin,
1974.
[5] Beutelspacher, A.: Lineare Algebra, Vieweg, Wiesbaden 1994.
[6] Beutelspacher, A., Zschiegner, M.: Diskrete Mathematik für Einsteiger, Vieweg+Teubner,
Wiesbaden 2011.
[7] Biggs, N.L.: Discrete Mathematics, Oxford University Press, Oxford, 2002.
[8] Blatter, C.: Analysis I+II, Springer, Berlin, 1974.
[9] Bosch, K.: Elementare Einführung in die Statistik, Vieweg, Wiesbaden, 1994.
[10] Bosch, K.: Elementare Einführung in die Wahrscheinlichkeitstheorie, Vieweg+Teubner, Wies-
baden, 2011.
[11] Bosch, K.: Statistik für Nichtstatistiker, Oldenbourg, Munich, 1994.
[12] Brill, M.: Mathematik für Informatiker, Hanser, Munich, 2001.
[13] Domschke, W., Drexl, A.: Einführung in Operations Research, Springer, Berlin, 2011.
[14] Dörfler, W., Peschek, W.: Mathematik für Informatiker, Hanser, Munich, 1988.
[15] Engel, A.: Wahrscheinlichkeitsrechnung und Statistik, Vol. 2, Klett, Stuttgart, 1976.
[16] Forster, O.: Algorithmische Zahlentheorie, Vieweg, Wiesbaden, 1996.
[17] Forster, O.: Analysis I+II, Vieweg+Teubner, Wiesbaden, 2011.
[18] Gathen, J. von zur: CryptoSchool, Springer, Berlin 2015.
[19] Greiner, M., Tinhofer, G.: Stochastik für Studienanfänger der Informatik, Hanser, Munich,
1996.
[20] Handl; A., Kuhlenkasper, T.: Multivariate Analysemethoden, Springer Spektrum, Berlin,
2017.
[21] Huppert, B., Willems, W.: Lineare Algebra, Vieweg+Teubner, Wiesbaden 2010.
[22] Knorrenschild, M.: Numerische Mathematik, Eine beispielorientierte Einführung, Hanser,
Munich, 2010.
[23] Rosen, Kenneth H.: Elementary Number Theory and Its Applications, Pearson, Boston, 2005.
[24] Rießinger, T.: Mathematik für Ingenieure, Springer, Berlin, 1996.
[25] Schöning, U.: Logik für Informatiker, Spektrum Akademischer Verlag, Heidelberg, 2000.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 589
Fachmedien Wiesbaden GmbH, part of Springer Nature 2023
P. Hartmann, Mathematics for Computer Scientists,
https://doi.org/10.1007/978-3-658-40423-9
590 Bibliography
[26] Teschl, G.,Teschl, S.: Mathematik für Informatiker Band I, Springer, Berlin 2013.
[27] Waerden, B. L. van der: Algebra I, Springer, Berlin, 1971.
[28] Walter, W.: Gewöhnliche Differentialgleichungen, Springer, Berlin, 1976.
[29] Weller, F.: Numerische Mathematik für Ingenieure und Naturwissenschaftler, Vieweg, Wies-
baden, 1996.