Labs For Foundations of Applied Mathematics Volume 1 (Mathematical Analysis)
Labs For Foundations of Applied Mathematics Volume 1 (Mathematical Analysis)
Foundations of Applied
Mathematics
Volume 1
Mathematical Analysis
B. Barker T. Christensen
Brigham Young University Brigham Young University
E. Evans M. Cook
Brigham Young University Brigham Young University
R. Evans M. Cutler
Brigham Young University Brigham Young University
J. Grout R. Dorff
Drake University Brigham Young University
J. Humpherys B. Ehlert
Brigham Young University Brigham Young University
T. Jarvis M. Fabiano
Brigham Young University Brigham Young University
J. Whitehead K. Finlinson
Brigham Young University Brigham Young University
J. Adams J. Fisher
Brigham Young University Brigham Young University
K. Baldwin R. Flores
Brigham Young University Brigham Young University
J. Bejarano R. Fowers
Brigham Young University Brigham Young University
J. Bennett A. Frandsen
Brigham Young University Brigham Young University
A. Berry R. Fuhriman
Brigham Young University Brigham Young University
Z. Boyd T. Gledhill
Brigham Young University Brigham Young University
M. Brown S. Giddens
Brigham Young University Brigham Young University
A. Carr C. Gigena
Brigham Young University Brigham Young University
C. Carter M. Graham
Brigham Young University Brigham Young University
S. Carter F. Glines
Brigham Young University Brigham Young University
i
ii List of Contributors
C. Glover E. Manner
Brigham Young University Brigham Young University
M. Goodwin M. Matsushita
Brigham Young University Brigham Young University
R. Grout R. McMurray
Brigham Young University Brigham Young University
D. Grundvig S. McQuarrie
Brigham Young University Brigham Young University
S. Halverson E. Mercer
Brigham Young University Brigham Young University
E. Hannesson D. Miller
Brigham Young University Brigham Young University
K. Harmer J. Morrise
Brigham Young University Brigham Young University
J. Henderson M. Morrise
Brigham Young University Brigham Young University
J. Hendricks A. Morrow
Brigham Young University Brigham Young University
A. Henriksen R. Murray
Brigham Young University Brigham Young University
I. Henriksen J. Nelson
Brigham Young University Brigham Young University
B. Hepner C. Noorda
Brigham Young University Brigham Young University
C. Hettinger A. Oldroyd
Brigham Young University Brigham Young University
S. Horst A. Oveson
Brigham Young University Brigham Young University
R. Howell E. Parkinson
Brigham Young University Brigham Young University
E. Ibarra-Campos M. Probst
Brigham Young University Brigham Young University
K. Jacobson M. Proudfoot
Brigham Young University Brigham Young University
R. Jenkins D. Reber
Brigham Young University Brigham Young University
J. Larsen H. Ringer
Brigham Young University Brigham Young University
J. Leete C. Robertson
Brigham Young University Brigham Young University
Q. Leishman M. Russell
Brigham Young University Brigham Young University
J. Lytle R. Sandberg
Brigham Young University Brigham Young University
List of Contributors iii
C. Sawyer T. Thompson
Brigham Young University Brigham Young University
N. Sill B. Trendler
Brigham Young University Brigham Young University
D. Smith
M. Victors
Brigham Young University
Brigham Young University
J. Smith
Brigham Young University E. Walker
P. Smith Brigham Young University
Brigham Young University J. Webb
M. Stauffer Brigham Young University
Brigham Young University R. Webb
E. Steadman Brigham Young University
Brigham Young University
J. West
J. Stewart
Brigham Young University
Brigham Young University
S. Suggs R. Wonnacott
Brigham Young University Brigham Young University
A. Tate A. Zaitzeff
Brigham Young University Brigham Young University
iv List of Contributors
Preface
This lab manual is designed to accompany the textbook Foundations of Applied Mathematics
Volume 1: Mathematical Analysis by Humpherys, Jarvis and Evans. The labs focus mainly on
important numerical linear algebra algorithms, with applications to images, networks, and data
science. The reader should be familiar with Python [VD10] and its NumPy [Oli06, ADH+ 01, Oli07]
and Matplotlib [Hun07] packages before attempting these labs. See the Python Essentials manual
for introductions to these topics.
©This work is licensed under the Creative Commons Attribution 3.0 United States License.
You may copy, distribute, and display this copyrighted work only if you give credit to Dr. J. Humpherys.
All derivative works must include an attribution to Dr. J. Humpherys as the owner of this work as
well as the web address to
https://github.com/Foundations-of-Applied-Mathematics/Labs
as the original source of this work.
To view a copy of the Creative Commons Attribution 3.0 License, visit
http://creativecommons.org/licenses/by/3.0/us/
or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105,
USA.
v
vi Preface
Contents
Preface v
I Labs 1
1 Introduction to GitHub 3
2 Linear Transformations 7
3 Linear Systems 19
4 The QR Decomposition 33
6 Image Segmentation 59
8 Facial Recognition 81
9 Differentiation 89
10 Newton’s Method 99
vii
viii Contents
18 GMRES 173
II Appendices 179
Bibliography 201
Part I
Labs
1
1
Introduction to GitHub
Lab Objective: Git is a version control system that helps you manage changes to your code over
time. It allows you to keep track of different versions of your code, collaborate with others, and revert
changes if necessary. In ACME, Git will allow the Lab Assistants to see and grade your code. In
this mini-lab you will learn how to successfully save your code to a GitHub repository.
Before you begin this lab, you should have already gone through the Getting Started tutorials.
Specifically:
• The course materials should be downloaded and stored in an accessible place on your computer
• VSCode (or another code editor) should be installed and set up on your computer
• You should have created a GitHub account with repositories for Volume 1 and Volume 2
If you have missed any of these steps, stop here and refer back to the Getting Started pdf and
the accompanying tutorial videos.
Each week there will be an assigned lab in both Volume 1 and Volume 2 that will supplement
the material presented in class. These labs will involve editing, running, and saving code frequently.
To assist in this process, GitHub will be used to save your code and allow the instructors to grade
each week. The following example problem will help outline this process.
Problem 1. In the lab folder titled GitHubIntro, you will find the file github_intro.py.
Open github_intro.py with VSCode (or your favorite code editor). In the function labeled
prob1(), remove the line that says raise NotImplementedError() and replace it with return
"Student Name" putting your actual name in the quotation marks. Make sure to not change
the indentation, otherwise your code will not run correctly. Be sure to save the file after you’ve
finished editing.
Now, run the file using python. This can be done either in your code editor, or via the
terminal (instructions are given later in this tutorial). If everything worked correctly, you should
see your name appear in the console.
We will now demonstrate how to run your file in python and upload your changed file to GitHub.
3
4 Lab 1. Introduction to GitHub
To start, open a terminal window. The following coding examples will teach you how to navigate
to the GitHubIntro/ directory via the terminal. Instead of clicking icons to open folders, you will
need to use typed commands within your terminal to move from folder to folder. To see your current
location, type the command pwd which stands for "print working directory."
~$ pwd
/home/username
Use the command cd to change directory. For example, if I wanted to navigate into the
Documents folder, I would use the following command:
~$ cd Documents # Change to the Documents folder
~$ pwd # Check that the current directory changed
/home/username/Documents
If you wish to go back a directory, use cd followed by a space and two periods “..”
~$ cd ..
~$ pwd
/home/username
Using these commands, navigate to the GitHubIntro directory. Once you’re there, use pwd to
check that you’re in the right folder. It should look like the following example.
~$ pwd
/home/username/.../GitHubIntro
Now that you are in the correct directory, you can run the following command to see that your
code compiles correctly:
~$ python github_intro.py
Student Name
Now, load your changed file to GitHub using the following commands:
# Commit saves the changes you made to your file, acting like a time stamp
~$ git commit -m "Finished GitHub lab"
5
Now, open a browser and log in to your GitHub account. Open your Volume 1 repository and
open your github_intro.py file. If everything worked correctly, you should be able to see the edits
you made to your file. Additionally, you can click on Commits in the menu on the left and see your
most recent commit statements. Congratulations, you have completed your GitHub introduction!
In the future, you will submit your labs in a similar way. The process is to first navigate to the
correct directory within your terminal, then input the following commands:
~$ git pull origin master
~$ git add <changed files>
~$ git commit -m "<descriptive message>"
~$ git push origin master
Note: following this ordering will help you to avoid merge conflicts with GitHub!
6 Lab 1. Introduction to GitHub
Linear
2 Transformations
Lab Objective: Linear transformations are the most basic and essential operators in vector space
theory. In this lab we visually explore how linear transformations alter points in the Cartesian plane.
We also empirically explore the computational cost of applying linear transformations via matrix
multiplication.
Linear Transformations
A linear transformation is a mapping between vector spaces that preserves addition and scalar
multiplication. More precisely, let V and W be vector spaces over a common field F. A map
L : V → W is a linear transformation from V into W if
L(ax1 + bx2 ) = aLx1 + bLx2
for all vectors x1 , x2 ∈ V and scalars a, b ∈ F.
Every linear transformation L from an m-dimensional vector space into an n-dimensional vector
space can be represented by an m × n matrix A, called the matrix representation of L. To apply L
to a vector x, left multiply by its matrix representation. This results in a new vector x′ , where each
component is some linear combination of the elements of x. For linear transformations from R2 to
R2 , this process has the form
′
a b x ax + by x
Ax = = = = x′ .
c d y cx + dy y′
Linear transformations can be interpreted geometrically. To demonstrate this, consider the
array of points H that collectively form a picture of a horse, stored in the file horse.npy. The
coordinate pairs xi are organized by column, so the array has two rows: one for x-coordinates, and
one for y-coordinates. Matrix multiplication on the left transforms each coordinate pair, resulting in
another matrix H ′ whose columns are the transformed coordinate pairs:
x1 x2 x3 . . .
AH = A = A x1 x2 x3 . . . = Ax1 Ax2 Ax3 . . .
y1 y2 y3 . . .
x′1 x′2 x′3
...
= x′1 x′2 x′3 ... = = H ′.
y1′ y2′ y3′ ...
7
8 Lab 2. Linear Transformations
To begin, use np.load() to extract the array from the npy file, then plot the unaltered points
as individual pixels. See Figure 2.1 for the result.
# Set the window limits to [-1, 1] by [-1, 1] and make the window square.
>>> plt.axis([-1,1,-1,1])
>>> plt.gca().set_aspect("equal")
>>> plt.show()
Problem 1. Write a function for each type of linear transformation. Each function should
accept an array to transform and the scalars that define the transformation (a and b for stretch,
shear, and reflection, and θ for rotation). Construct the matrix representation, left multiply it
with the input array, and return a transformation of the data.
To test these functions, write a function to plot the original points in horse.npy together
with the transformed points in subplots for a side-by-side comparison. Compare your results
to Figure 2.1.
Affine Transformations
All linear transformations map the origin to itself. An affine transformation is a mapping between
vector spaces that preserves the relationships between points and lines, but that may not preserve
the origin. Every affine transformation T can be represented by a matrix A and a vector b. To apply
T to a vector x, calculate Ax + b. If b = 0 then the transformation is linear, and if A = I but b ̸= 0
then it is called a translation.
T
For example, if T is the translation with b = 34 , 21 , then applying T to an image will shift it
Original Translation
Affine transformations include all compositions of stretches, shears, rotations, reflections, and
translations. For example, if S represents a shear and R a rotation, and if b is a vector, then RSx+b
shears, then rotates, then translates x.
p(t)
tω radians
Origin p(0)
11
Composing the rotation with a translation shifts the center of rotation away from the origin,
yielding more complicated motion.
Problem 2. The moon orbits the earth while the earth orbits the sun. Assuming circular
orbits, we can compute the trajectories of both the earth and the moon using only linear and
affine transformations.
Assume an orientation where both the earth and moon travel counterclockwise, with the
sun at the origin. Let pe (t) and pm (t) be the positions of the earth and the moon at time t,
respectively, and let ωe and ωm be each celestial body’s angular velocity. For a particular time
t, we calculate pe (t) and pm (t) with the following steps.
1. Compute pe (t) by rotating the initial vector pe (0) counterclockwise about the origin by
tωe radians.
2. Calculate the position of the moon relative to the earth at time t by rotating the vector
pm (0) − pe (0) counterclockwise about the origin by tωm radians.
3. To compute pm (t), translate the vector resulting from the previous step by pe (t).
Write a function that accepts a final time T , initial positions xe and xm , and the angular
momenta ωe and ωm . Assuming initial positions pe (0) = (xe , 0) and pm (0) = (xm , 0), plot
pe (t) and pm (t) over the time interval t ∈ [0, T ].
Setting T = 3π2 , xe = 10, xm = 11, ωe = 1, and ωm = 13, your plot should resemble
the following figure (fix the aspect ratio with ax.set_aspect("equal")). Note that a more
celestially accurate figure would use xe = 400, xm = 401 (the interested reader should see
http://www.math.nus.edu.sg/aslaksen/teaching/convex.html).
10
5
Earth
10 Moon
10 5 0 5 10
12 Lab 2. Linear Transformations
Timing Code
Recall that the time module’s time() function measures the number of seconds since the Epoch.
To measure how long it takes for code to run, record the time just before and just after the code in
question, then subtract the first measurement from the second to get the number of seconds that have
passed. Additionally, in IPython, the quick command %timeit uses the timeit module to quickly
time a single line of code.
In [4]: time_for_loop()
0.24458789825439453
Timing an Algorithm
Most algorithms have at least one input that dictates the size of the problem to be solved. For
example, the following functions take in a single integer n and produce a random vector of length n
as a list or a random n × n matrix as a list of lists.
Executing random_vector(n) calls random() n times, so doubling n should about double the
amount of time random_vector(n) takes to execute. By contrast, executing random_matrix(n) calls
random() n2 times (n times per row with n rows). Therefore doubling n will likely more than double
the amount of time random_matrix(n) takes to execute, especially if n is large.
To visualize this phenomenon, we time random_matrix() for n = 21 , 22 , . . . , 212 and plot n
against the execution time. The result is displayed below on the left.
2.0 2.0
1.5 1.5
Seconds
Seconds
1.0 1.0
0.5 0.5
0.0 0.0
0 1000 2000 3000 4000 0 1000 2000 3000 4000
n n
The figure on the left shows that the execution time for random_matrix(n) increases quadrat-
ically in n. In fact, the blue dotted line in the figure on the right is the parabola y = an2 , which
fits nicely over the timed observations. Here a is a small constant, but it is much less significant
than the exponent on the n. To represent this algorithm’s growth, we ignore a altogether and write
random_matrix(n) ∼ n2 .
Note
An algorithm like random_matrix(n) whose execution time increases quadratically with n is
called O(n2 ), notated by random_matrix(n) ∈ O(n2 ). Big-oh notation is common for indicating
both the temporal complexity of an algorithm (how the execution time grows with n) and the
spatial complexity (how the memory usage grows with n).
14 Lab 2. Linear Transformations
These formulas are implemented below without using NumPy arrays or operations.
Time each of these functions with increasingly large inputs. Generate the inputs A, x,
and B with random_matrix() and random_vector() (so each input will be n × n or n × 1).
Only time the multiplication functions, not the generating functions.
Report your findings in a single figure with two subplots: one with matrix-vector times,
and one with matrix-matrix times. Choose a domain for n so that your figure accurately
describes the growth, but avoid values of n that lead to execution times of more than 1 minute.
Your figure should resemble the following plots.
Seconds
0.004 1.5
0.003 1.0
0.002
0.5
0.001
0.000 0.0
0 50 100 150 200 250 0 50 100 150 200 250
n n
15
Logarithmic Plots
Though the two plots from Problem 3 look similar, the scales on the y-axes show that the actual
execution times differ greatly. To be compared correctly, the results need to be viewed differently.
A logarithmic plot uses a logarithmic scale—with values that increase exponentially, such as
101 , 102 , 103 , . . .—on one or both of its axes. The three kinds of log plots are listed below.
• log-lin: the x-axis uses a logarithmic scale but the y-axis uses a linear scale.
Use plt.semilogx() instead of plt.plot().
• lin-log: the x-axis is uses a linear scale but the y-axis uses a log scale.
Use plt.semilogy() instead of plt.plot().
Since the domain n = 21 , 22 , . . . is a logarithmic scale and the execution times increase
quadratically, we visualize the results of the previous problem with a log-log plot. The default base
for the logarithmic scales on logarithmic plots in Matplotlib is 10. To change the base to 2 on each
axis, specify the keyword arguments base=2.
Suppose the domain of n values are stored in domain and the corresponding execution times
for matrix_vector_product() and matrix_matrix_product() are stored in vector_times and
matrix_times, respectively. Then the following code produces Figure 2.5.
>>> plt.show()
2.5 22
Matrix-Vector
Matrix-Matrix 21
2.0
24
1.5 27
1.0 2 10
0.5 2 13
2 16
0.0
0 50 100 150 200 250 21 22 23 24 25 26 27 28
Figure 2.5
16 Lab 2. Linear Transformations
In the log-log plot, the slope of the matrix_matrix_product() line is about 3 and the slope of
the matrix_vector_product() line is about 2. This reflects the fact that matrix-matrix multipli-
cation (which uses 3 loops) is O(n3 ), while matrix-vector multiplication (which only has 2 loops) is
only O(n2 ).
Problem 4. NumPy is built specifically for fast numerical computations. Repeat the experi-
ment of Problem 3, timing the following operations:
Create a single figure with two subplots: one with all four sets of execution times on a
regular linear scale, and one with all four sets of execution times on a log-log scale. Compare
your results to Figure 2.5.
Note
Problem 4 shows that matrix operations are significantly faster in NumPy than in
plain Python. Matrix-matrix multiplication grows cubically regardless of the implementation;
however, with lists the times grows at a rate of an3 while with NumPy the times grow at a rate
of bn3 , where a is much larger than b. NumPy is more efficient for several reasons:
1. Iterating through loops is very expensive. Many of NumPy’s operations are implemented
in C, which are much faster than Python loops.
2. Arrays are designed specifically for matrix operations, while Python lists are general
purpose.
3. NumPy carefully takes advantage of computer hardware, efficiently using different levels
of computer memory.
However, in Problem 4, the execution times for matrix multiplication with NumPy seem
to increase somewhat inconsistently. This is because the fastest layer of computer memory can
only handle so much information before the computer has to begin using a larger, slower layer
of memory.
17
Additional Material
Image Transformation as a Class
Consider organizing the functions from Problem 1 into a class. The constructor might accept an
array or the name of a file containing an array. This structure would makes it easy to do several
linear or affine transformations in sequence.
Animating Parametrizations
The plot in Problem 2 fails to fully convey the system’s evolution over time because time itself is not
part of the plot. The following function creates an animation for the earth and moon trajectories.
def animate(index):
earth_dot.set_data(earth[0,index], earth[1,index])
earth_path.set_data(earth[0,:index], earth[1,:index])
moon_dot.set_data(moon[0,index], moon[1,index])
moon_path.set_data(moon[0,:index], moon[1,:index])
return earth_dot, earth_path, moon_dot, moon_path,
a = FuncAnimation(fig, animate, frames=earth.shape[1], interval=25)
plt.show()
18 Lab 2. Linear Transformations
Linear Systems
3
Lab Objective: The fundamental problem of linear algebra is solving the linear system Ax = b,
given that a solution exists. There are many approaches to solving this problem, each with different
pros and cons. In this lab we implement the LU decomposition and use it to solve square linear
systems. We also introduce SciPy, together with its libraries for linear algebra and working with
sparse matrices.
Gaussian Elimination
The standard approach for solving the linear system Ax = b on paper is reducing the augmented
matrix [A | b] to row-echelon form (REF) via Gaussian elimination, then using back substitution.
The matrix is in REF when the leading non-zero term in each row is the diagonal term, so the matrix
is upper triangular.
At each step of Gaussian elimination, there are three possible operations: swapping two rows,
multiplying one row by a scalar value, or adding a scalar multiple of one row to another. Many
systems, like the one displayed below, can be reduced to REF using only the third type of operation.
First, use multiples of the first row to get zeros below the diagonal in the first column, then use a
multiple of the second row to get zeros below the diagonal in the second column.
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 4 2 3 −→ 0 3 1 2 −→ 0 3 1 2 −→ 0 3 1 2
4 7 8 9 4 7 8 9 0 3 4 5 0 0 3 3
Each of these operations is equivalent to left-multiplying by a type III elementary matrix, the
identity with a single non-zero non-diagonal term. If row operation k corresponds to matrix Ek , the
following equation is E3 E2 E1 A = U .
1 0 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1
0 1 0 0 1 0 −1 1 0 1 4 2 3 = 0 3 1 2
0 −1 1 −4 0 1 0 0 1 4 7 8 9 0 0 3 3
19
20 Lab 3. Linear Systems
Note that the final row operation modifies only part of the third row to avoid spending the
computation time of adding 0 to 0.
If a 0 appears on the main diagonal during any part of row reduction, the approach given above
tries to divide by 0. Swapping the current row with one below it that does not have a 0 in the same
column solves this problem. This is equivalent to left-multiplying by a type II elementary matrix,
also called a permutation matrix.
Achtung!
Gaussian elimination is not always numerically stable. In other words, it is susceptible to
rounding error that may result in an incorrect final matrix. Suppose that, due to roundoff
error, the matrix A has a very small entry on the diagonal.
−15
10 1
A=
−1 0
Though 10−15 is essentially zero, instead of swapping the first and second rows to put A in
REF, a computer might multiply the first row by 1015 and add it to the second row to eliminate
the −1. The resulting matrix is far from what it would be if the 10−15 were actually 0.
−15 −15
10 1 10 1
−→
−1 0 0 1015
Round-off error can propagate through many steps in a calculation. The NumPy routines
that employ row reduction use several tricks to minimize the impact of round-off error, but
these tricks cannot fix every matrix.
21
Problem 1. Write a function that reduces an arbitrary square matrix A to REF. You may
assume that A is invertible and that a 0 will never appear on the main diagonal (so only use
type III row reductions, not type II). Avoid operating on entries that you know will be 0 before
and after a row operation. Use at most two nested loops.
Test your function with small test cases that you can check by hand. Consider using
np.random.randint() to generate a few manageable tests cases.
The LU Decomposition
The LU decomposition of a square matrix A is a factorization A = LU where U is the upper
triangular REF of A and L is the lower triangular product of the type III elementary matrices
whose inverses reduce A to U . The LU decomposition of A exists when A can be reduced to REF
using only type III elementary matrices (without any row swaps). However, the rows of A can always
be permuted in a way such that the decomposition exists. If P is a permutation matrix encoding the
appropriate row swaps, then the decomposition P A = LU always exists.
Suppose A has an LU decomposition (not requiring row swaps). Then A can be reduced
to REF with k row operations, corresponding to left-multiplying the type III elementary matrices
E1 , . . . , Ek . Because there were no row swaps, each Ei is lower triangular, so each inverse Ei−1 is also
lower triangular. Furthermore, since the product of lower triangular matrices is lower triangular, L
is lower triangular:
Ek . . . E2 E1 A = U −→ A = (Ek . . . E2 E1 )−1 U
= E1−1 E2−1 . . . Ek−1 U
= LU.
Thus, L can be computed by right-multiplying the identity by the matrices used to reduce U .
However, in this special situation, each right-multiplication only changes one entry of L, matrix mul-
tiplication can be avoided altogether. The entire process, only slightly different than row reduction,
is summarized below.
Algorithm 1
1: procedure LU Decomposition(A)
2: m, n ← shape(A) ▷ Store the dimensions of A.
3: U ← copy(A) ▷ Make a copy of A with np.copy().
4: L ← Im ▷ The m × m identity matrix.
5: for j = 0 . . . n − 1 do
6: for i = j + 1 . . . m − 1 do
7: Li,j ← Ui,j /Uj,j
8: Ui,j: ← Ui,j: − Li,j Uj,j:
9: return L, U
Problem 2. Write a function that finds the LU decomposition of a square matrix. You may
assume that the decomposition exists and requires no row swaps.
22 Lab 3. Linear Systems
0 ··· 0
1 0 y1 b1
l21 1
0 ··· 0 y 2 b2
l31 l32 1 · · · 0 y3 b3
= .
. .. .. .. .. .. ..
.. . . . . . .
ln1 ln2 ln3 · · · 1 yn bn
y1 = b1 , y1 = b1 ,
l21 y1 + y2 = b2 , y2 = b2 − l21 y1 ,
.. ..
. .
k−1
X k−1
X
lkj yj + yk = bk , y k = bk − lkj yj . (3.1)
j=1 j=1
0 0 0 · · · unn xn yn
1
unn xn = yn , xn = yn ,
unn
1
un−1,n−1 xn−1 + un−1,n xn = yn−1 , xn−1 = (yn−1 − un−1,n xn ) ,
un−1,n−1
.. ..
. .
n n
X 1 X
ukj xj = yk , xk = yk − ukj xj . (3.2)
ukk
j=k j=k+1
Problem 3. Write a function that, given A and b, solves the square linear system Ax = b.
Use the function from Problem 2 to compute L and U , then use (3.1) and (3.2) to solve for y,
then x. You may again assume that no row swaps are required (P = I in this case).
23
Unit Test
Write a unit test for Problem 3, your solve function. It can be found in the test_linear_systems
.py file and the unit test is named test_solve.
There are example unit tests for Problems 1 and 2 to help you structure your unit test.
SciPy
SciPy [JOP+ ] is a powerful scientific computing library built upon NumPy. It includes high-level
tools for linear algebra, statistics, signal processing, integration, optimization, machine learning, and
more.
SciPy is typically imported with the convention import scipy as sp. However, SciPy is set
up in a way that requires its submodules to be imported individually.1
Linear Algebra
NumPy and SciPy both have a linear algebra module, each called linalg, but SciPy’s module is the
larger of the two. Some of SciPy’s common linalg functions are listed below.
Function Returns
det() The determinant of a square matrix.
eig() The eigenvalues and eigenvectors of a square matrix.
inv() The inverse of an invertible matrix.
norm() The norm of a vector or matrix norm of a matrix.
solve() The solution to Ax = b (the system need not be square).
As with NumPy, SciPy’s routines are all highly optimized. However, some algorithms are, by
nature, faster than others.
Problem 4. Write a function that times different scipy.linalg functions for solving square
linear systems.
For various values of n, generate a random n × n matrix A and a random n-vector b using
np.random.random(). Time how long it takes to solve the system Ax = b with each of the
following approaches:
2. Use la.solve().
3. Use la.lu_factor() and la.lu_solve() to solve the system with the LU decomposition.
4. Use la.lu_factor() and la.lu_solve(), but only time la.lu_solve() (not the time
it takes to do the factorization with la.lu_factor()).
Plot the system size n versus the execution times. Use log scales if needed.
Achtung!
Problem 4 demonstrates that computing a matrix inverse is computationally expensive. In fact,
numerically inverting matrices is so costly that there is hardly ever a good reason to do it. Use
a specific solver like la.lu_solve() whenever possible instead of using la.inv().
Sparse Matrices
Large linear systems can have tens of thousands of entries. Storing the corresponding matrices in
memory can be difficult: a 105 × 105 system requires around 40 GB to store in a NumPy array (4
bytes per entry × 1010 entries). This is well beyond the amount of RAM in a normal laptop.
In applications where systems of this size arise, it is often the case that the system is sparse,
meaning that most of the entries of the matrix are 0. SciPy’s sparse module provides tools for
efficiently constructing and manipulating 1- and 2-D sparse matrices. A sparse matrix only stores
the nonzero values and the positions of these values. For sufficiently sparse matrices, storing the
matrix as a sparse matrix may only take megabytes, rather than gigabytes.
25
For example, diagonal matrices are sparse. Storing an n × n diagonal matrix in the naïve way
means storing n2 values in memory. It is more efficient to instead store the diagonal entries in a
1-D array of n values. In addition to using less storage space, this allows for much faster matrix
operations: the standard algorithm to multiply a matrix by a diagonal matrix involves n3 steps, but
most of these are multiplying by or adding 0. A smarter algorithm can accomplish the same task
much faster.
SciPy has seven sparse matrix types. Each type is optimized either for storing sparse matrices
whose nonzero entries follow certain patterns, or for performing certain computations.
A regular, non-sparse matrix is called full or dense. Full matrices can be converted to each of the
sparse matrix formats listed above. However, it is more memory efficient to never create the full
matrix in the first place. There are three main approaches for creating sparse matrices from scratch.
• Coordinate Format: When all of the nonzero values and their positions are known, create
the entire sparse matrix at once as a coo_matrix. All nonzero values are stored as a coordinate
and a value. This format also converts quickly to other sparse matrix types.
• DOK and LIL Formats: If the matrix values and their locations are not known beforehand,
construct the matrix incrementally with a dok_matrix or a lil_matrix. Indicate the size of
the matrix, then change individual values with regular slicing syntax.
26 Lab 3. Linear Systems
>>> B = sparse.lil_matrix((2,6))
>>> B[0,2] = 4
>>> B[1,3:] = 9
>>> print(B.toarray())
[[ 0. 0. 4. 0. 0. 0.]
[ 0. 0. 0. 9. 9. 9.]]
• DIA Format: Use a dia_matrix to store matrices that have nonzero entries on only certain
diagonals. The function sparse.diags() is one convenient way to create a dia_matrix from
scratch. Additionally, every sparse matrix has a setdiags() method for modifying specified
diagonals.
# If all of the diagonals have the same entry, specify the entry alone.
>>> A = sparse.diags([1,3,6], offsets, shape=(3,4))
>>> print(A.toarray())
[[ 3. 0. 0. 6.]
[ 1. 3. 0. 0.]
[ 0. 1. 3. 0.]]
• BSR Format: Many sparse matrices can be formulated as block matrices, and a block matrix
can be stored efficiently as a bsr_matrix. Use sparse.bmat() or sparse.block_diag() to
create a block matrix quickly.
# Use sparse.bmat() to create a block matrix. Use 'None' for zero blocks.
>>> A = sparse.coo_matrix(np.ones((2,2)))
>>> B = sparse.coo_matrix(np.full((2,2), 2.))
>>> print(sparse.bmat([[ A , None, A ],
[None, B , None]], format='bsr').toarray())
[[ 1. 1. 0. 0. 1. 1.]
[ 1. 1. 0. 0. 1. 1.]
[ 0. 0. 2. 2. 0. 0.]
27
[ 0. 0. 2. 2. 0. 0.]]
Note
If a sparse matrix is too large to fit in memory as an array, it can still be visualized with
Matplotlib’s plt.spy(), which colors in the locations of the non-zero entries of the matrix.
20
40
60
80
28 Lab 3. Linear Systems
−4
B I 1
I B I 1 −4 1
.. .. .. ..
. . . .
A= I , B= 1 ,
.. .. .. ..
. . . .
I 1
I B 1 −4
where A is n2 × n2 and each block B is n × n. The large matrix A is used in finite difference
2 2
methods for solving Laplace’s equation in two dimensions, ∂∂xu2 + ∂∂yu2 = 0.
Write a function that accepts an integer n and constructs and returns A as a sparse matrix.
Use plt.spy() to check that your matrix has nonzero values in the correct places.
Once a sparse matrix has been constructed, it should be converted to a csr_matrix or a csc_matrix
with the matrix’s tocsr() or tocsc() method. The CSR and CSC formats are optimized for row or
column operations, respectively. To choose the correct format to use, determine what direction the
matrix will be traversed.
For example, in the matrix-matrix multiplication AB, A is traversed row-wise, but B is tra-
versed column-wise. Thus A should be converted to a csr_matrix and B should be converted to a
csc_matrix.
# Convert A to CSR and CSC formats to compute the matrix product AA.
>>> Acsr = A.tocsr()
>>> Acsc = A.tocsc()
>>> Acsr.dot(Acsc)
<10000x10000 sparse matrix of type '<type 'numpy.float64'>'
with 10142 stored elements in Compressed Sparse Row format>
Beware that row-based operations on a csc_matrix are very slow, and similarly, column-based
operations on a csr_matrix are very slow.
Achtung!
29
Many familiar NumPy operations have analogous routines in the sparse module. These meth-
ods take advantage of the sparse structure of the matrices and are, therefore, usually significantly
faster. However, SciPy’s sparse matrices behave a little differently than NumPy arrays.
Note in particular the difference between A * B for NumPy arrays and SciPy sparse
matrices. Do not use np.dot() to try to multiply sparse matrices, as it may treat the inputs
incorrectly. The syntax A.dot(B) is safest in most cases.
SciPy’s sparse module has its own linear algebra library, scipy.sparse.linalg, designed for
operating on sparse matrices. Like other SciPy modules, it must be imported explicitly.
Problem 6. Write a function that times regular and sparse linear system solvers.
For various values of n, generate the n2 × n2 matrix A described in Problem 5 and a
random vector b with n2 entries. Time how long it takes to solve the system Ax = b with each
of the following approaches:
In each experiment, only time how long it takes to solve the system (not how long it takes to
convert A to the appropriate format).
Plot the system size n2 versus the execution times. As always, use log scales where
appropriate and use a legend to label each line.
Achtung!
Even though there are fast algorithms for solving certain sparse linear system, it is still very
computationally difficult to invert sparse matrices. In fact, the inverse of a sparse matrix is
usually not sparse. There is rarely a good reason to invert a matrix, sparse or dense.
Additional Material
Improvements on the LU Decomposition
Vectorization
Algorithm 1 uses two loops to compute the LU decomposition. With a little vectorization, the process
can be reduced to a single loop.
Algorithm 2
1: procedure Fast LU Decomposition(A)
2: m, n ← shape(A)
3: U ← copy(A)
4: L ← Im
5: for k = 0 . . . n − 1 do
6: Lk+1:,k ← Uk+1:,k /Uk,k
T
7: Uk+1:,k: ← Uk+1:,k: − Lk+1:,k Uk,k:
8: return L, U
Note that step 7 is an outer product, not the regular dot product (xyT instead of the usual
x y). Use np.outer() instead of np.dot() or @ to get the desired result.
T
Pivoting
Gaussian elimination iterates through the rows of a matrix, using the diagonal entry xk,k of the
matrix at the kth iteration to zero out all of the entries in the column below xk,k (xi,k for i ≥ k).
This diagonal entry is called the pivot. Unfortunately, Gaussian elimination, and hence the LU
decomposition, can be very numerically unstable if at any step the pivot is a very small number.
Most professional row reduction algorithms avoid this problem via partial pivoting.
The idea is to choose the largest number (in magnitude) possible to be the pivot by swapping
the pivot row2 with another row before operating on the matrix. For example, the second and fourth
rows of the following matrix are exchanged so that the pivot is −6 instead of 2.
× × × × × × × × × × × ×
0
2 × × −→ 0 −6 × × −→ 0 −6 × ×
0 4 × × 0 4 × × 0 0 × ×
0 −6 × × 0 2 × × 0 0 × ×
Algorithm 3
1: procedure LU Decomposition with Partial Pivoting(A)
2: m, n ← shape(A)
3: U ← copy(A)
4: L ← Im
5: P ← [0, 1, . . . , n − 1] ▷ See tip 2 below.
6: for k = 0 . . . n − 1 do
7: Select i ≥ k that maximizes |Ui,k |
8: Uk,k: ↔ Ui,k: ▷ Swap the two rows.
9: Lk,:k ↔ Li,:k ▷ Swap the two rows.
10: Pk ↔ Pi ▷ Swap the two entries.
11: Lk+1:,k ← Uk+1:,k /Uk,k
T
12: Uk+1:,k: ← Uk+1:,k: − Lk+1:,k Uk,k:
13: return L, U, P
There are potential cases where even partial pivoting does not eliminate catastrophic numerical
errors in Gaussian elimination, but the odds of having such an amazingly poor matrix are essentially
zero. The numerical analyst J.H. Wilkinson captured the likelihood of encountering such a matrix
in a natural application when he said, “Anyone that unlucky has already been run over by a bus!”
32 Lab 3. Linear Systems
In Place
The LU decomposition can be performed in place (overwriting the original matrix A) by storing U
on and above the main diagonal of the array and storing L below it. The main diagonal of L does
not need to be stored since all of its entries are 1. This format saves an entire array of memory, and
is how scipy.linalg.lu_factor() returns the factorization.
A−1 = a1 a2 ··· an ,
Algorithm 4
1: procedure Cholesky Decomposition(A)
2: n, n ← shape(A)
3: U ← np.triu(A) ▷ Get the upper-triangular part of A.
4: for i = 0 . . . n − 1 do
5: for j = i + 1 . . . n − 1 do
6: Uj,j: ← Uj,j: − Ui,j: Uij /Uii
√
7: Ui,i: ← Ui,i: / Uii
8: return U
• Full QR: Q is m × m and R is m × n. In this case, the columns {qj }m j=1 of Q form an
orthonormal basis for all of Fm , and the last m − n rows of R only contain zeros. If m = n,
this is the same as the reduced factorization.
b (m × n)
Q b (n × n)
R
r11 · · · r1n
.. ..
. .
rnn
= A (m × n)
q1 ··· qn qn+1 ··· qm
0 ··· 0
.. ..
. .
0 ··· 0
Q (m × m) R (m × n)
QR via Gram-Schmidt
The classical Gram-Schmidt algorithm takes a linearly independent set of vectors and constructs an
orthonormal set of vectors with the same span. Applying Gram-Schmidt to the columns of A, which
are linearly independent since A has rank n, results in the columns of Q.
33
34 Lab 4. The QR Decomposition
k−1
X
p0 = 0, pk−1 = ⟨qj , xk ⟩qj , k = 2, . . . , n.
j=1
vector of the projection. Thus qk is orthogonal to each of the vectors in {qj }k−1
′
j=1 . Therefore,
normalizing each q′k produces an orthonormal set {qj }nj=1 .
To construct the reduced QR decomposition, let Q b be the matrix with columns {qj }n , and
j=1
let R
b be the upper triangular matrix with entries
Modified Gram-Schmidt
If the columns of A are close to being linearly dependent, the classical Gram-Schmidt algorithm
often produces a set of vectors {qj }nj=1 that are not even close to orthonormal due to rounding
errors. The modified Gram-Schmidt algorithm is a slight variant of the classical algorithm which
more consistently produces a set of vectors that are “very close” to orthonormal.
Let q1 be the normalization of x1 as before. Instead of making just x2 orthogonal to q1 , make
each of the vectors {xj }nj=2 orthogonal to q1 :
xk = xk − ⟨q1 , xk ⟩q1 , k = 2, . . . , n.
Next, define q2 = ∥x2 ∥ .
x2
Proceed by making each of {xj }nj=3 orthogonal to q2 :
xk = xk − ⟨q2 , xk ⟩q2 , k = 3, . . . , n.
Since each of these new vectors is a linear combination of vectors orthogonal to q1 , they are orthogonal
to q1 as well. Continuing this process results in the desired orthonormal set {qj }nj=1 . The entire
modified Gram-Schmidt algorithm is described below.
Algorithm 1
1: procedure Modified Gram-Schmidt(A)
2: m, n ← shape(A) ▷ Store the dimensions of A.
3: Q ← copy(A) ▷ Make a copy of A with np.copy().
4: R ← zeros(n, n) ▷ An n × n array of all zeros.
5: for i = 0 . . . n − 1 do
6: Ri,i ← ∥Q:,i ∥
7: Q:,i ← Q:,i /Ri,i ▷ Normalize the ith column of Q.
8: for j = i + 1 . . . n − 1 do
9: Ri,j ← QT :,j Q:,i
10: Q:,j ← Q:,j − Ri,j Q:,i ▷ Orthogonalize the jth column of Q.
11: return Q, R
35
• Note that steps 7 and 10 employ scalar multiplication or division, while step 9 uses vector
multiplication.
To test your function, generate test cases with NumPy’s np.random module. Verify that
R is upper triangular, Q is orthonormal, and QR = A. You may also want to compare your
results to SciPy’s QR factorization routine, scipy.linalg.qr().
# Generate a random matrix and get its reduced QR decomposition via SciPy.
>>> A = np.random.random((6,4))
>>> Q,R = la.qr(A, mode="economic") # Use mode="economic" for reduced QR.
>>> print(A.shape, Q.shape, R.shape)
(6,4) (6,4) (4,4)
Determinants
Let A be n × n. Then Q and R are both n × n as well.1 Since Q is orthonormal and R is upper-
triangular,
Yn
det(Q) = ±1 and det(R) = ri,i .
i=1
Problem 2. Write a function that accepts an invertible matrix A. Use the QR decomposition
of A and (4.1) to calculate |det(A)|. You may use your QR decomposition algorithm from
Problem 1 or SciPy’s QR routine. Can you implement this function in a single line?
(Hint: np.diag() and np.prod() may be useful.)
Check your answer against la.det(), which calculates the determinant.
Linear Systems
The LU decomposition is usually the matrix factorization of choice to solve the linear system Ax = b
because the triangular structures of L and U facilitate forward and backward substitution. However,
the QR decomposition avoids the potential numerical issues that come with Gaussian elimination.
Since Q is orthonormal, Q−1 = QT . Therefore, solving Ax = b is equivalent to solving the
system Rx = QT b. Since R is upper-triangular, Rx = QT b can be solved quickly with back
substitution.2
1. Compute Q and R.
2. Calculate y = QT b.
QR via Householder
The Gram-Schmidt algorithm orthonormalizes A using a series of transformations that are stored
in an upper triangular matrix. Another way to compute the QR decomposition is to take the
opposite approach: triangularize A through a series of orthonormal transformations. Orthonormal
transformations are numerically stable, meaning that they are less susceptible to rounding errors. In
fact, this approach is usually faster and more accurate than Gram-Schmidt methods.
The idea is for the kth orthonormal transformation Qk to map the kth column of A to the span
of {ej }kj=1 , where the ej are the standard basis vectors in Rm . In addition, to preserve the work of
the previous transformations, Qk should not modify any entries of A that are above or to the left of
the kth diagonal term of A. For a 4 × 3 matrix A, the process can be visualized as follows.
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ = Q3 Q2 0 ∗ ∗ = Q3 0 ∗ ∗ = 0 ∗ ∗
Q3 Q2 Q1 ∗ ∗ ∗ 0 ∗ ∗ 0 0 ∗ 0 0 ∗
∗ ∗ ∗ 0 ∗ ∗ 0 0 ∗ 0 0 0
decomposition.
How to correctly construct each Qk isn’t immediately obvious. The ingenious solution lies in
one of the basic types of linear transformations: reflections.
2 See the Linear Systems lab for details on back substitution.
37
Householder Transformations
The orthogonal complement of a nonzero vector v ∈ Rn is the set of all vectors x ∈ Rn that are
orthogonal to v, denoted v⊥ = {x ∈ Rn | ⟨x, v⟩ = 0}. A Householder transformation is a linear
transformation that reflects a vector x across the orthogonal complement v⊥ for some specified v.
The matrix representation of the Householder transformation corresponding to v is given by
T
Hv = I − 2 vvvT v . Since HvT Hv = I, Householder transformations are orthonormal.
Hv x
v⊥ v
Figure 4.1: The vector v defines the orthogonal complement v⊥ , which in this case is a line. Applying
the Householder transformation Hv to x reflects x across v⊥ .
Householder Triangularization
The Householder algorithm uses Householder transformations for the orthonormal transformations
in the QR decomposition process described on the previous page. The goal in choosing Qk is to send
xk , the kth column of A, to the span of {ej }kj=1 . In other words, if Qk xk = yk , the last m − k entries
of yk should be 0, i.e.,
z1 y1
.. ..
. .
zk yk
Qk xk = Qk
zk+1
=
0 = yk .
. .
.. ..
zm 0
To begin, decompose xk into xk = x′k + x′′k , where x′k and x′′k are of the form
T T
x′k = [z1 ··· zk−1 0 ··· 0] , x′′k = [0 ··· 0 zk ··· zm ] .
Because x′k represents elements of A that lie above the diagonal, only x′′k needs to be altered by the
reflection.
The two vectors x′′k ± ∥x′′k ∥ek both yield Householder transformations that send x′′k to the
span of ek (see Figure 4.2). Between the two, the one that reflects x′′k further is more numerically
stable. This reflection corresponds to
v1 x
v2
Hv2 x
Hv1 x
Figure 4.2: There are two reflections that map x into the span of e1 , defined by the vectors v1 and
v2 . In this illustration, Hv2 is the more stable transformation since it reflects x further than Hv1 .
v vT
After choosing vk , set uk = ∥vvkk ∥ . Then Hvk = I − 2 ∥vkk ∥k2 = I − 2uk uT
k , and hence Qk is given
by the block matrix
Ik−1 0 Ik−1 0
Qk = = .
0 Hvk 0 Im−k+1 − 2uk uT k
Algorithm 2
1: procedure Householder(A)
2: m, n ← shape(A)
3: R ← copy(A)
4: Q ← Im ▷ The m × m identity matrix.
5: for k = 0 . . . n − 1 do
6: u ← copy(Rk:,k )
7: u0 ← u0 + sign(u0 )∥u∥ ▷ u0 is the first entry of u.
8: u ← u/∥u∥ ▷ Normalize u.
▷ Apply the reflection to R.
9: Rk:,k: ← Rk:,k: − 2u uT Rk:,k:
▷ Apply the reflection to Q.
10: Qk:,: ← Qk:,: − 2u uT Qk:,:
11: return QT , R
Problem 4. Write a function that accepts as input a m×n matrix A of rank n. Use Algorithm
2 to compute the full QR decomposition of A.
Consider the following implementation details.
• NumPy’s np.sign() is an easy way to implement the sign() operation in step 7. However,
np.sign(0) returns 0, which will cause a problem in the rare case that u0 = 0 (which is
possible if the top left entry of A is 0 to begin with). The following code defines a function
that returns the sign of a single number, counting 0 as positive.
39
• In steps 9 and 10, the multiplication of u and (uT X) is an outer product (xyT instead of
the usual xT y). Use np.outer() instead of np.dot() to handle this correctly.
Use NumPy and SciPy to generate test cases and validate your function.
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
Q3 Q2 Q1 AQT T T
Q Q
1 2 3 =
0 ∗ ∗ ∗ ∗
0 0 ∗ ∗ ∗
0 0 0 ∗ ∗
factorization A = QHQ . T
40 Lab 4. The QR Decomposition
Because yk′ represents elements of A that lie above the first subdiagonal, only yk′′ needs to be altered.
This suggests using the reflection
Ik 0 Ik 0
Qk = = , where
0 Hvk 0 Im−k − 2uk uT k
vk
vk = yk′′ + sign(zk )∥yk′′ ∥ek , uk = .
∥vk ∥
Algorithm 3
1: procedure Hessenberg(A)
2: m, n ← shape(A)
3: H ← copy(A)
4: Q ← Im
5: for k = 0 . . . n − 3 do
6: u ← copy(Hk+1:,k )
7: u0 ← u0 + sign(u0 )∥u∥
8: u ← u/∥u∥
9: Hk+1:,k: ← Hk+1:,k: − 2u(uT Hk+1:,k: ) ▷ Apply Qk to H.
10: H:,k+1: ← H:,k+1: − 2(H:,k+1: u)uT ▷ Apply QT
k to H.
11: Qk+1:,: ← Qk+1:,: − 2u(uT Qk+1:,: ) ▷ Apply Qk to Q.
12: return H, QT
# Generate a random matrix and get its upper Hessenberg form via SciPy.
>>> A = np.random.random((8,8))
>>> H, Q = la.hessenberg(A, calc_q=True)
# Verify that H has all zeros below the first subdiagonal and QHQ^T = A.
>>> np.allclose(np.triu(H, -1), H)
True
>>> np.allclose(Q @ H @ Q.T, A)
True
41
Additional Material
Complex QR Decomposition
The QR decomposition also exists for matrices with complex entries. The standard inner product in
Rm is ⟨x, y⟩ = xT y, but the (more general) standard inner product in Cm is ⟨x, y⟩ = xH y. The H
stands for the Hermitian conjugate, the conjugate of the transpose. Making a few small adjustments
in the implementations of Algorithms 1 and 2 accounts for using the complex inner product.
2. Conjugate the first entry of vector or matrix multiplication before multiplying with np.dot().
3. In the complex plane, there are infinitely many reflections that map a vector x into the span
of ek , not just the two displayed in Figure 4.2. Using sign(zk ) to choose one is still a valid
method, but it requires updating the sign() function so that it can handle complex numbers.
QR with Pivoting
The LU decomposition can be improved by employing Gaussian elimination with partial pivoting,
where the rows of A are strategically permuted at each iteration. The QR factorization can be
similarly improved by permuting the columns of A at each iteration. The result is the factorization
AP = QR, where P is a permutation matrix that encodes the column swaps. To compute the pivoted
QR decomposition with scipy.linalg.qr(), set the keyword pivoting to True.
42 Lab 4. The QR Decomposition
QR via Givens
The Householder algorithm uses reflections to triangularize A. However, A can also be made upper
triangular using rotations. To illustrate the idea, recall that the matrix for a counterclockwise rotation
of θ radians is given by
cos θ − sin θ
Rθ = .
sin θ cos θ
T
This transformation is orthonormal. Given x = [a, b] , if θ is the angle between x and e1 , then
R−θ maps x to the span of e1 .
θ
a
T
Figure 4.3: Rotating clockwise by θ sends the vector [a, b] to the span of e1 .
The matrix Rθ above is an example of a 2 × 2 Givens rotation matrix. In general, the Givens
matrix G(i, j, θ) represents the orthonormal transformation that rotates the 2-dimensional span of
ei and ej by θ radians. The matrix representation of this transformation is a generalization of Rθ .
I 0 0 0 0
0 c 0 −s 0
G(i, j, θ) =
0 0 I 0 0
0 s 0 c 0
0 0 0 0 I
Here I represents the identity matrix, c = cos θ, and s = sin θ. The c’s appear on the ith and
jth diagonal entries.
43
Givens Triangularization
As demonstrated, θ can be chosen such that G(i, j, θ) rotates a vector so that its jth-component is
0. Such a transformation will only affect the ith and jth entries of any vector it acts on (and thus
the ith and jth rows of any matrix it acts on).
1 2 3 4 5
Figure 4.4: The order in which to zero out subdiagonal entries in the Givens triangularization
algorithm. The heavy black line is the main diagonal of the matrix. Entries should be zeroed out
from bottom to top in each column, beginning with the leftmost column.
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ G(2, 3, θ1 ) ∗ ∗ G(1, 2, θ2 ) 0 ∗ G(2, 3, θ3 ) 0 ∗
−−−−−−−→ −−−−−−−→ −−−−−−−→
∗ ∗ 0 ∗ 0 ∗ 0 0
At each stage, the boxed entries are those modified by the previous transformation. The final
transformation G(2, 3, θ3 ) operates on the bottom two rows, but since the first two entries are zero,
they are unaffected.
Assuming that at the ijth stage of the algorithm aij is nonzero, Algorithm 4 computes the
Givens triangularization of a matrix. Notice that the algorithm does not actually form the entire
matrices G(i, j, θ); instead, it modifies only those entries of the matrix that are affected by the
transformation.
44 Lab 4. The QR Decomposition
Algorithm 4
1: procedure Givens Triangularization(A)
2: m, n ← shape(A)
3: R ← copy(A)
4: Q ← Im
5: for j = 0 . . . n − 1 do
6: for i = m − 1 . . . j + 1 do
7: a, b ← Ri−1,j , Ri,j √
8: G ← [[a, b], [−b, a]]/ a2 + b2
9: Ri−1:i+1,j: ← GRi−1:i+1,j:
10: Qi−1:i+1,: ← GQi−1:i+1,:
11: return QT , R
The Givens algorithm is particularly efficient for computing the QR decomposition of a matrix that is
already in upper Hessenberg form, since only the first subdiagonal needs to be zeroed out. Algorithm
5 details this process.
Algorithm 5
1: procedure Givens Triangularization of Hessenberg(H)
2: m, n ← shape(H)
3: R ← copy(H)
4: Q ← Im
5: for j = 0 . . . min{n − 1, m − 1} do
6: i=j+1
7: a, b ← Ri−1,j , Ri,j √
8: G ← [[a, b], [−b, a]]/ a2 + b2
9: Ri−1:i+1,j: ← GRi−1:i+1,j:
10: Qi−1:i+1,:i+1 ← GQi−1:i+1,:i+1
11: return QT , R
Note
When A is symmetric, its upper Hessenberg form is a tridiagonal matrix, meaning its only
nonzero entries are on the main diagonal, the first subdiagonal, and the first superdiagonal.
This is because the Qk ’s zero out everything below the first subdiagonal of A and the QT
k ’s zero
out everything to the right of the first superdiagonal. Tridiagonal matrices make computations
fast, so computing the Hessenberg form of a symmetric matrix is very useful.
5
Least Squares and
Computing
Eigenvalues
Lab Objective: Because of its numerical stability and convenient structure, the QR decomposition
is the basis of many important and practical algorithms. In this lab we introduce linear least squares
problems, tools in Python for computing least squares solutions, and two fundamental algorithms for
computing eigenvalues. The QR decomposition makes solving several of these problems quick and
numerically stable.
Least Squares
A linear system Ax = b is overdetermined if it has more equations than unknowns. In this situation,
there is no true solution, and x can only be approximated.
The least squares solution of Ax = b, denoted x b, is the “closest” vector to a solution, meaning
it minimizes the quantity ∥Ab x − b∥2 . In other words, x
b is the vector such that Ab x is the projection
of b onto the range of A, and can be calculated by solving the normal equations,1
AT Ab
x = AT b.
AT Ab
x = AT b
(QR)T QRb
x = (QR)T b
RT QT QRb
x = RT QT b
RT Rb
x = RT QT b
x = QT b
Rb (5.1)
45
46 Lab 5. Least Squares and Computing Eigenvalues
Fitting a Line
The least squares solution can be used to find the best fit curve of a chosen type to a set of points.
Consider the problem of finding the line y = ax + b that best fits a set of m points {(xk , yk )}mk=1 .
Ideally, we seek a and b such that yk = axk + b for all k. These equations can be simultaneously
represented by the linear system
x1 1 y1
x2 1 y2
x3 1 a
y3
Ax = = = b. (5.2)
. .. b .
.. . .
.
xm 1 ym
Note that A has full column rank as long as not all of the xk values are the same.
Because this system has two unknowns, it is guaranteed to have a solution if it has two or fewer
equations. However, if there are more than two data points, the system is overdetermined if any set
of three points is not collinear. We therefore seek a least squares solution, which in this case means
finding the slope ba and y-intercept bb such that the line y = b
ax + bb best fits the data.
Figure 5.1 is a typical example of this idea where b a ≈ 21 and bb ≈ −3.
Data Points
2 Least Squares Fit
1
0
1
2
3
0 2 4 6 8 10
Figure 5.1: A linear least squares fit.
47
Problem 2. The file housing.npy contains the purchase-only housing price index, a measure
of how housing prices are changing, for the United States from 2000 to 2016.a Each row in the
array is a separate measurement; the columns are the year and the price index, in that order.
To avoid large numerical computations, the year measurements start at 0 instead of 2000.
Find the least squares line that relates the year to the housing price index (i.e., let year
be the x-axis and index the y-axis).
2. Use your function from Problem 1 to find the least squares solution.
a See http://www.fhfa.gov/DataTools/Downloads/Pages/House-Price-Index.aspx.
Note
The least squares problem of fitting a line to a set of points is often called linear regression,
and the resulting line is called the linear regression line. SciPy’s specialized tool for linear
regression is scipy.stats.linregress(). This function takes in an array of x-coordinates and
a corresponding array of y-coordinates, and returns the slope and intercept of the regression
line, along with a few other statistical measurements.
For example, the following code produces Figure 5.1.
Fitting a Polynomial
Least squares can also be used to fit a set of data to the best fit polynomial of a specified degree. Let
k=1 be the set of m data points in question. The general form for a polynomial of degree
{(xk , yk )}m
n is
n
X
pn (x) = cn xn + cn−1 xn−1 + · · · + c2 x2 + c1 x + c0 = ci xi .
i=0
Note that the polynomial is uniquely determined by its n + 1 coefficients {ci }ni=0 . Ideally, then, we
seek the set of coefficients {ci }ni=0 such that
for all values of k. These m linear equations yield the linear system
cn
xn1 xn−1 ··· x21
1 x1 1 y1
xn2 xn−1 x22 cn−1
2 ··· x2 1 ..
y2
xn−1 .
Ax =
xn3 3 ··· x23 x3 1 =
y3 = b.
(5.3)
.. .. .. .. .. ..
c2
. . . . .
c1
.
xnm xn−1
m ··· x2m xm 1 ym
c0
NumPy also has powerful tools for working efficiently with polynomials. The class np.poly1d
represents a 1-dimensional polynomial. Instances of this class are callable like a function.3 The
constructor accepts the polynomial’s coefficients, from largest degree to smallest.
Table 5.1 lists some attributes and methods of the np.poly1d class.
2 Vandermonde matrices have many special properties and are useful for many applications, including polynomial
Attribute Description
coeffs The n + 1 coefficients, from greatest degree to least.
order The polynomial degree (n).
roots The n roots of the polynomial.
Method Returns
deriv() The coefficients of the polynomial after being differentiated.
integ() The coefficients of the polynomial after being integrated (with c0 = 0).
Problem 3. The data in housing.npy is nonlinear, and might be better fit by a polynomial
than a line.
Write a function that uses (5.3) to calculate the polynomials of degree 3, 6, 9, and 12 that
best fit the data. Plot the original data points and each least squares polynomial together in
individual subplots.
(Hint: define a separate, refined domain with np.linspace() and use this domain to smoothly
plot the polynomials.)
Instead of using Problem 1 to solve the normal equations, you may use SciPy’s least
squares routine, scipy.linalg.lstsq().
Achtung!
Having more parameters in a least squares model is not always better. For a set of m points, the
best fit polynomial of degree m − 1 interpolates the data set, meaning that p(xk ) = yk exactly
for each k. In this case there are enough unknowns that the system is no longer overdetermined.
However, such polynomials are highly subject to numerical errors and are unlikely to accurately
represent true patterns in the data.
Choosing to have too many unknowns in a fitting problem is (fittingly) called overfitting,
and is an important issue to avoid in any statistical model.
Fitting a Circle
Suppose the set of m points {(xk , yk )}m
k=1 are arranged in a nearly circular pattern. The general
equation of a circle with radius r and center (c1 , c2 ) is
(x − c1 )2 + (y − c2 )2 = r2 . (5.4)
The circle is uniquely determined by r, c1 , and c2 , so these are the parameters that should be
solved for in a least squares formulation of the problem. However, (5.4) is not linear in any of these
variables.
(x − c1 )2 + (y − c2 )2 = r2
x2 − 2c1 x + c21 + y 2 − 2c2 y + c22 = r2
x2 + y 2 = 2c1 x + 2c2 y + r2 − c21 − c22 (5.5)
The quadratic terms x2 and y 2 are acceptable because the points {(xk , yk )}mk=1 are given.
To eliminate the nonlinear terms in the unknown parameters r, c1 , and c2 , define a new variable
c3 = r2 − c21 − c22 . Then for each point (xk , yk ), (5.5) becomes
These m equations are linear in c1 , c2 , and c3 , and can be written as the linear system
x21 + y12
2x1 2y1 1
2x2 2y2 1 c1 x22 + y22
.. .. .. c2 = .. . (5.6)
. . . .
c3
2xm 2ym 1 x2m + ym
2
After solving for the least squares solution, r can be recovered with the relation r = c21 + c22 + c3 .
p
Finally, plotting a circle is best done with polar coordinates. Using the same variables as before, the
circle can be represented in polar coordinates by setting
To plot the circle, solve the least squares system for c1 , c2 , and r, define an array for θ, then use
(5.7) to calculate the coordinates of the points the circle.
51
# Load some data and construct the matrix A and the vector b.
>>> xk, yk = np.load("circle.npy").T
>>> A = np.column_stack((2*xk, 2*yk, np.ones_like(xk)))
>>> b = xk**2 + yk**2
# Calculate the least squares solution and solve for the radius.
>>> c1, c2, c3 = la.lstsq(A, b)[0]
>>> r = np.sqrt(c1**2 + c2**2 + c3)
2
4 2 0 2 4 6 8 10
ax2 + bx + cxy + dy + ey 2 = 1.
Write a function that calculates the parameters for the ellipse that best fits the data in the
file ellipse.npy. Plot the original data points and the ellipse together, using the following
function to plot the ellipse.
Computing Eigenvalues
The eigenvalues of an n×n matrix A are the roots of its characteristic polynomial det(A−λI). Thus,
finding the eigenvalues of A amounts to computing the roots of a polynomial of degree n. However,
for n ≥ 5, it is provably impossible to find an algebraic closed-form solution to this problem.4 In
addition, numerically computing the roots of a polynomial is a famously ill-conditioned problem,
meaning that small changes in the coefficients of the polynomial (brought about by small changes
in the entries of A) may yield wildly different results. Instead, eigenvalues must be computed with
iterative methods.
Axk
xk+1 = .
∥Axk ∥2
If A has a dominant eigenvalue λ, and if the projection of x0 onto the subspace spanned by the
eigenvectors corresponding to λ is nonzero, then the sequence of vectors (xk )∞
k=0 converges to an
eigenvector x of A corresponding to λ.
Since x is an eigenvector of A, Ax = λx. Left multiplying by xT on each side results in
T
x Ax = λxT x, and hence λ = xxTAx
T
x
. This ratio is called the Rayleigh quotient. However, since each
xk is normalized, xT x = ∥x∥22 = 1, so λ = xT Ax.
The entire algorithm is summarized below.
Algorithm 1
1: procedure PowerMethod(A)
2: m, n ← shape(A) ▷ A is square so m = n.
3: x0 ← random(n) ▷ A random vector of length n
4: x0 ← x0 /∥x0 ∥2 ▷ Normalize x0
5: for k = 0, 1, . . . , N − 1 do
6: xk+1 ← Axk
7: xk+1 ← xk+1 /∥xk+1 ∥2
8: return xT
N AxN , xN
4 This result, called Abel’s impossibility theorem, was first proven by Niels Heinrik Abel in 1824.
53
The power method is limited by a few assumptions. First, not all square matrices A have
a dominant eigenvalue. However, the Perron-Frobenius theorem guarantees that if all entries of
A are positive, then A has a dominant eigenvalue. Second, there is no way to choose an x0 that is
guaranteed to have a nonzero projection onto the span of the eigenvectors corresponding to λ, though
a random x0 will almost surely satisfy this condition. Even with these assumptions, a rigorous proof
that the power method converges is most convenient with tools from spectral calculus, and as such
will not be pursued here.
Problem 5. Write a function that accepts an n×n matrix A, a maximum number of iterations
N , and a stopping tolerance tol. Use Algorithm 1 to compute the dominant eigenvalue of A
and a corresponding eigenvector. Continue the loop in step 5 until either ∥xk+1 − xk ∥2 is less
than the tolerance tol, or until iterating the maximum number of times N .
Test your function on square matrices with all positive entries, verifying that Ax = λx.
Use SciPy’s eigenvalue solver, scipy.linalg.eig(), to compute all of the eigenvalues and
corresponding eigenvectors of A and check that λ is the dominant eigenvalue of A. There is
also a file called test_lstsq_eigs.py that has prewritten unit tests for this problem that you
can use to check your code.
The QR Algorithm
An obvious shortcoming of the power method is that it only computes one eigenvalue and eigenvector.
The QR algorithm, on the other hand, attempts to find all eigenvalues of A.
Let A0 = A, and for arbitrary k let Qk Rk = Ak be the QR decomposition of Ak . Since A is
square, so are Qk and Rk , so they can be recombined in reverse order:
Ak+1 = Rk Qk .
This recursive definition establishes an important relation between the Ak :
Q−1 −1 −1
k Ak Qk = Qk (Qk Rk )Qk = (Qk Qk )(Rk Qk ) = Ak+1 .
Thus, Ak is orthonormally similar to Ak+1 , and similar matrices have the same eigenvalues. The
series of matrices (Ak )∞
k=0 converges to the block matrix
54 Lab 5. Least Squares and Computing Eigenvalues
∗ ∗ ··· ∗
S1 ∗ ··· ∗
s1
.. .. 0 s2,1 s2,2 ··· ∗
. .
0 S2
For example, S = s2,3 s2,4 ··· ∗
S= . .
.. .. .. .. ..
. . . ∗
.
.
0 · · · 0 Sm sm
Each Si is either a 1×1 or 2×2 matrix.5 In the example above on the right, since the first subdiagonal
entry is zero, S1 is the 1 × 1 matrix with a single entry, s1 . But as s2,3 is not zero, S2 is 2 × 2.
Since S is block upper triangular, its eigenvalues are the eigenvalues of its diagonal Si blocks.
Then because A is similar to each Ak , those eigenvalues of S are the eigenvalues of A.
When A has real entries but complex eigenvalues, 2 × 2 Si blocks appear in S. Finding eigen-
values of a 2 × 2 matrix is equivalent to finding the roots of a 2nd degree polynomial,
a−λ b
det(Si − λI) = = (a − λ)(d − λ) − bc = λ2 − (a + d)λ + (ad − bc), (5.8)
c d−λ
which has a closed form solution via the quadratic equation. This also demonstrates that complex
eigenvalues come in conjugate pairs.
Hessenberg Preconditioning
A matrix in upper Hessenberg form is one that has all entries below the first subdiagonal equal to
zero. This is similar to an upper triangular matrix, except that the entries directly below the diagonal
are also allowed to be nonzero. The QR algorithm works more accurately and efficiently on matrices
that are in upper Hessenberg form, as upper Hessenberg matrices are already close to triangular.
Furthermore, if H = QR is the QR decomposition of upper Hessenberg H then RQ is also upper
Hessenberg, so the almost-triangular form is preserved at each iteration. Putting a matrix in upper
Hessenberg form before applying the QR algorithm is called Hessenberg preconditioning.
5 If all of the S are 1 × 1 matrices, then the upper triangular S is called the Schur form of A. If some of the S are
i i
2 × 2 matrices, then S is called the real Schur form of A.
55
Algorithm 2
1: procedure QR_Algorithm(A, N )
2: m, n ← shape(A)
3: S ← hessenberg(A) ▷ Put A in upper Hessenberg form.
4: for k = 0, 1, . . . , N − 1 do
5: Q, R ← S ▷ Get the QR decomposition of Ak .
6: S ← RQ ▷ Recombine Rk and Qk into Ak+1 .
7: eigs ← [] ▷ Initialize an empty list of eigenvalues.
8: i←0
9: while i < n do
10: if Si is 1 × 1 then
11: Append the only entry si of Si to eigs
12: else if Si is 2 × 2 then
13: Calculate the eigenvalues of Si
14: Append the eigenvalues of Si to eigs
15: i←i+1
16: i←i+1 ▷ Move to the next Si .
17: return eigs
• If Si is 2 × 2, use the quadratic formula and (5.8) to compute its eigenvalues. Use the
function cmath.sqrt() to correctly compute the square root of a negative number.
Test your function on small random symmetric matrices, comparing your results to SciPy’s
scipy.linalg.eig(). While the QR algorithm works on arbitrary matrices, it has better con-
vergence properties for symmetric matrices, which makes them better for testing. To construct
a random symmetric matrix, note that A + AT is always symmetric.
56 Lab 5. Least Squares and Computing Eigenvalues
Unit Test
There is a file called test_lstsq_eigs.py that contains some prewritten unit tests for Problem
5. There is a place for you to add your own unit tests for Problem 6 called test_qr_algorithm.
You are required to include at least one unit test which will be graded.
Note
Algorithm 2 is theoretically sound, but can still be greatly improved. Most modern computer
packages instead use the implicit QR algorithm, an improved version of the QR algorithm, to
compute eigenvalues.
For large matrices, there are other iterative methods besides the power method and the
QR algorithm for efficiently computing eigenvalues. They include the Arnoldi iteration, the
Jacobi method, the Rayleigh quotient method, and others.
57
Additional Material
Variations on the Linear Least Squares Problem
If W is an n × n is symmetric positive-definite matrix, then the function ∥ · ∥W 2 : Rn → R given by
√
∥x∥W 2 = ∥W x∥2 = xT W T W x
defines a norm and is called a weighted 2-norm. Given the overdetermined system Ax = b, the
problem of choosing x
b to minimize ∥Ab x − b∥W 2 is called a weighted least squares (WLS) problem.
This problem has a slightly different set of normal equations,
AT W T W Ab
x = AT W T W b.
However, letting C = W A and z = W b, this equation reduces to the usual normal equations,
C T Cb
x = C T z,
so a WLS problem can be solved in the same way as an ordinary least squares (OLS) problem.
Weighted least squares is useful when some points in a data set are more important than others.
Typically W is chosen to be a diagonal matrix, and each positive diagonal entry Wi,i indicate how
much weight should be given to the ith data point. For example, Figure 5.2a shows OLS and WLS
fits of an exponential curve y = aekx to data that gets more sparse as x increases, where the matrix
W is chosen to give more weight to the data with larger x values.
Alternatively, the least squares problem can be formulated with other common vector norms,
but such problems cannot be solved via the normal equations. For example, minimizing ∥Ax − b∥1 or
∥Ax−b∥∞ is usually done by solving an equivalent linear program, a type of constrained optimization
problem. These norms may be better suited to a particular application than the regular 2-norm.
Figure 5.2b illustrates how different norms give slightly different results in the context of Problem 4.
800 OLS 5
WLS || ||2 fit || || fit
data 4 || ||1 fit data
600
3
400 2
1
200
0
0 1
0.0 0.2 0.4 0.6 0.8 1.0 2
1e9 4 2 0 2 4
(a) Ordinary and weighted least squares fits for (b) Best fits for elliptical data with respect to
exponential data. different vector norms.
The inverse power method is more expensive than the regular power method because at each
iteration, instead of a matrix-vector multiplication (step 6 of Algorithm 1), a system of the form
(A − µI)x = b must be solved. To speed this step up, start by taking the LU or QR factorization of
A − µI before the loop, then use the factorization and back substitution to solve the system quickly
within the loop. For instance, if QR = A − µI, then since Q−1 = QT ,
b = (A − µI)x = QRx ⇔ Rx = QT b,
Algorithm 3
1: procedure InversePowerMethod(A, µ)
2: m, n ← shape(A)
3: x0 ← random(n)
4: x0 ← x0 /∥x0 ∥
5: Q, R ← A − µI ▷ Factor A − µI with la.qr().
6: for k = 0, 1, 2, . . . , N − 1 do
7: Solve Rxk+1 = QT xk ▷ Use la.solve_triangular().
8: xk+1 ← xk+1 /∥xk+1 ∥
9: return xT
N AxN , xN
It is worth noting that the QR algorithm can be improved with a similar technique: instead of
computing the QR factorization of Ak , factor the shifted matrix Ak − µk I, where µk is a guess for
an eigenvalue of A, and unshift the recombined factorization accordingly. That is, compute
Qk Rk = Ak − µk I,
Ak+1 = Rk Qk + µk I.
This technique yields the single-shift QR algorithm. Another variant, the practical QR algorithm, uses
intelligent shifts and recursively operates on smaller blocks of Ak+1 where possible. See [QSS10, TB97]
for further discussion.
Image Segmentation
6
Lab Objective: Graph theory has a variety of applications. A graph (or network) can be represented
in many ways on a computer. In this lab we study a common matrix representation for graphs and
show how certain properties of the matrix representation correspond to inherent properties of the
original graph. We also introduce tools for working with images in Python, and conclude with an
application of using graphs and linear algebra to segment images.
Graphs as Matrices
A graph is a mathematical structure that represents relationships between objects. Graphs are
defined by G = (V, E), where V is a set of vertices (or nodes) and E is a set of edges, each of which
connects one node to another. A graph can be classified in several ways.
• The edges of an undirected graph are bidirectional: if an edge goes from node A to node B,
then that same edge also goes from B to A. For example, the graphs G1 and G2 in Figure 6.1
are both undirected. In a directed graph, edges only go one way, usually indicated by an arrow
pointing from one node to another. In this lab, we focus on undirected graphs.
• The edges of a weighted graph have a weight assigned to them, such as G2 . A weighted graph
could represent a collection of cities with roads connecting them: each vertex would represent
a city, and the edges would represent roads between the cities. The length of each road could
be the weight of the corresponding edge. An unweighted graph like G1 does not have weights
assigned to its edges, but any unweighted graph can be thought of as a weighted graph by
assigning a weight of 1 to every edge.
4 5 1 3 5
2
6 1 3 1 1
3 2 2 4 6
.5
(a) G1 , an unweighted undirected graph. (b) G2 , a weighted undirected graph.
Figure 6.1
59
60 Lab 6. Image Segmentation
L = D − A, (6.2)
where D is the degree matrix of G and A is the adjacency matrix of G. For G1 and G2 , the
Laplacian matrices L1 and L2 are
3 −1 −1 −1 3 −3
0 0 0 0 0 0
−1 3 −1 0 −1 0 −3 3 0 0 0 0
0 −1 2 −1 0 0 0 0 1 −1 0 0
L1 = , L2 = .
0
0 −1 3 −1 −1 0 0 −1 3.5 −2 −.5
−1 −1 0 −1 3 0 0 0 0 −2 3 −1
−1 0 0 −1 0 2 0 0 0 −.5 −1 1.5
61
Problem 1. Write a function that accepts the adjacency matrix A of a graph G. Use (6.1)
and (6.2) to compute the Laplacian matrix L of G.
(Hint: The diagonal entries of D can be computed in one line by summing A over an axis.)
Test your function on the graphs G1 and G2 from Figure 6.1 and validate your results
with scipy.sparse.csgraph.laplacian().
Connectivity
A connected graph is a graph where every vertex is connected to every other vertex by at least one
path. For example, G1 is connected, whereas G2 is not because there is no path from node 1 (or
node 2) to node 3 (or nodes 4, 5, or 6). The naïve brute-force algorithm for determining if a graph
is connected is to check that there is a path from each edge to every other edge. While this may
work for very small graphs, most interesting graphs have thousands of vertices, and for such graphs
this approach is prohibitively expensive. Luckily, an interesting result from algebraic graph theory
relates the connectivity of a graph to its Laplacian matrix.
If L is the Laplacian matrix of a graph, then the definition of D and the construction L = D −A
guarantees that the rows (and columns) of L must each sum to 0. Therefore L cannot have full rank,
so λ = 0 must be an eigenvalue of L. Furthermore, if L represents a graph that is not connected,
more than one of the eigenvalues of L must be zero. To see this, let J ⊂ {1, 2, . . . , N } such that the
vertices {vj }j∈J form a connected component of the graph, meaning that there is a path between
each pair of vertices in the set. Next, let x be the vector with entries
(
1, k ∈ J
xk =
0, k ̸∈ J.
Then x is an eigenvector of L corresponding to the eigenvalue λ = 0.
For example, the example graph G2 has two connected components.
In fact, it can be shown that the number of zero eigenvalues of the Laplacian exactly equals
the number of connected components. This makes calculating how many connected components are
in a graph only as hard as calculating the eigenvalues of its Laplacian.
62 Lab 6. Image Segmentation
A Laplacian matrix L is always a positive semi-definite matrix when all weights in the graph
are positive, meaning that its eigenvalues are each nonnegative. The second smallest eigenvalue of
L is called the algebraic connectivity of the graph. It is clearly 0 for non-connected graphs, but
for a connected graph, the algebraic connectivity provides useful information about its sparsity or
“connectedness.” A higher algebraic connectivity indicates that the graph is more strongly connected.
Problem 2. Write a function that accepts the adjacency matrix A of a graph G and a small
tolerance value tol. Compute the number of connected components in G and its algebraic
connectivity. Consider all eigenvalues that are less than the given tol to be zero.
Use scipy.linalg.eig() or scipy.linalg.eigvals() to compute the eigenvalues of
the Laplacian matrix. These functions return complex eigenvalues (with negligible imaginary
parts); use np.real() to extract the real parts.
Unit Test
Write unit tests for Problem 2 in test_image_segmentation.py. There are example unit tests
for Problem 1 to help check the Laplacian.
Images as Matrices
Computer images are stored as arrays of integers that indicate pixel values. Most m × n grayscale
(black and white) images are stored in Python as a m × n NumPy arrays, while most m × n color
images are stored as 3-dimensional m × n × 3 arrays. Color image arrays can be thought of as a stack
of three m × n arrays, one each for red, green, and blue values. The datatype for an image array is
np.uint8, unsigned 8-bit integers that range from 0 to 255. A 0 indicates a black pixel while a 255
indicates a white pixel.
Use imageio.imread() to read an image from a file and imageio.imwrite() to save an image.
Matplotlib’s plt.imshow() displays an image array, but it displays arrays of floats between 0 and 1
more cleanly than arrays of 8-bit integers. Therefore it is customary to scale the array by dividing
each entry by 255 before processing or showing the image. In this case, a 0 still indicates a black
pixel, but now a 1 indicates pure white.
A color image can be converted to grayscale by averaging the RGB values of each pixel, resulting
in a 2-D array called the brightness of the image. To properly display a grayscale image, specify the
keyword argument cmap="gray" in plt.imshow().
Finally, it is often important in applications to flatten an image matrix into a large 1-D array.
Use np.ravel() to convert a m × n array into a 1-D array with mn entries.
# Unravel a grayscale image into a 1-D array and check its size.
>>> M,N = brightness.shape
>>> flat_brightness = np.ravel(brightness)
>>> M*N == flat_brightness.size
True
>>> print(flat_brightness.shape)
(2304,)
1. Write the constructor so that it accepts the name of an image file. Read the image, scale
it so that it contains floats between 0 and 1, then store it as an attribute. If the image is
in color, compute its brightness matrix by averaging the RGB values at each pixel (if it is
a grayscale image, the image array itself is the brightness matrix). Flatten the brightness
matrix into a 1-D array and store it as an attribute.
2. Write a method called show_original() that displays the original image. If the original
image is grayscale, remember to use cmap="gray" as part of plt.imshow().
Achtung!
Matplotlib’s plt.imread() also reads image files. However, this function automatically scales
PNG image entries to floats between 0 and 1, but it still reads non-PNG image entries as 8-bit
integers. To avoid this inconsistent behavior, always use imageio.imread() to read images
and divide by 255 when scaling is desired.
There are many ways to approach image segmentation. The following algorithm, developed by
Jianbo Shi and Jitendra Malik in 2000 [SM00], converts the image to a graph and “cuts” it into two
connected components.
where r, σB2
and σX 2
are constants for tuning the algorithm. In this context, ∥ · ∥ is the standard
euclidean norm, meaning that ∥X(i) − X(j)∥ is the physical distance between vertices i and j,
measured in pixels.
With this definition for wij , pixels that are farther apart than the radius r are not connected at
all in G. Pixels within r of each other are more strongly connected if they are similar in brightness
and close together (the value in the exponential is negative but close to zero). On the other hand,
highly contrasting pixels where |B(i) − B(j)| is large have weaker connections (the value in the
exponential is highly negative).
0 4 8 12
0 0
1
2
3
4 4
0 1 2 3 5 1 4 5 6 9
6
4 5 6 7
7
8 9 10 11 8 8
9
12 13 14 15
10
11
image
12 12
13
14
15
Figure 6.3: The grid on the left represents a 4 × 4 (m × n) image with 16 pixels. On the right is the
corresponding 16 × 16 (mn × mn) adjacency matrix with all nonzero entries shaded. For example, in
row 5, entries 1, 4, 5, 6, and 9 are nonzero because those pixels are within radius r = 1.2 of pixel 5.
Since there are mn total pixels, the adjacency matrix A of G with entries wij is mn × mn. With
a relatively small radius r, A is relatively sparse, and should therefore be constructed and stored as
a sparse matrix. The degree matrix D is diagonal, so it can be stored as a regular 1-dimensional
NumPy array. The procedure for constructing these matrices can be summarized in just a few steps.
1. Initialize A as a sparse mn × mn matrix and D as a vector with mn entries.
2. For each vertex i (i = 0, 1, . . . , mn − 1),
(a) Find the set of all vertices Ji such that ∥X(i) − X(j)∥ < r for each j ∈ Ji . For example,
in Figure 6.3 i = 5 and Ji = {1, 4, 5, 6, 9}.
66 Lab 6. Image Segmentation
(b) Calculate the weights wij for each j ∈ Ji according to (6.3) and store them in A.
(c) Set the ith element of D to be the sum of the weights, di = j∈Ji wij .
P
The most difficult part to implement efficiently is step 2a, computing the neighborhood Ji of
the current pixel i. However, the computation only requires knowing the current index i, the radius
r, and the height and width m and n of the original image. The following function takes advantage
of this fact and returns (as NumPy arrays) both Ji and the distances ∥X(i) − X(j)∥ for each j ∈ Ji .
Parameters:
index (int): The index of a central pixel in a flattened image array
with original shape (radius, height).
radius (float): Radius of the neighborhood around the central pixel.
height (int): The height of the original image in pixels.
width (int): The width of the original image in pixels.
Returns:
(1-D ndarray): the indices of the pixels that are within the specified
radius of the central pixel, with respect to the flattened image.
(1-D ndarray): the euclidean distances from the neighborhood pixels to
the central pixel.
"""
# Calculate the original 2-D coordinates of the central pixel.
row, col = index // width, index % width
# Get a grid of possible candidates that are close to the central pixel.
r = int(radius)
x = np.arange(max(col - r, 0), min(col + r + 1, width))
y = np.arange(max(row - r, 0), min(row + r + 1, height))
X, Y = np.meshgrid(x, y)
# Determine which candidates are within the given radius of the pixel.
R = np.sqrt(((X - col)**2 + (Y - row)**2))
mask = R < radius
return (X[mask] + Y[mask]*width).astype(np.int), R[mask]
To see how this works, consider Figure 6.3 where the original image is 4 × 4 and the goal is to
compute the neighborhood of the pixel i = 5.
Problem 4. Write a method for the ImageSegmenter class that accepts floats r defaulting to
2
5, σB defaulting to .02, and σX2
defaulting to 3. Compute the adjacency matrix A and the
degree matrix D according to the weights specified in (6.3).
Initialize A as a scipy.sparse.lil_matrix, which is optimized for incremental construc-
tion. Fill in the nonzero elements of A one row at a time. Use get_neighbors() at each step
to help compute the weights.
(Hint: Try to compute and store an entire row of weights at a time. What does the command
A[5, np.array([1, 4, 5, 6, 9])] = weights do?)
Finally, convert A to a scipy.sparse.csc_matrix, which is faster for computations.
Then return A and D.
Use blue_heart.png to test A and D, saved as HeartMatrixA.npz and HeartMatrixD.npy
datafiles.
>>> x = np.arange(-5,5).reshape((5,2)).T
>>> print(x)
[[-5 -3 -1 1 3]
[-4 -2 0 2 4]]
Problem 5. Write a method for the ImageSegmenter class that accepts an adjacency matrix
A as a scipy.sparse.csc_matrix and a degree matrix D as a 1-D NumPy array. Construct
an m × n boolean mask describing the segments of the image.
4. Reshape the eigenvector as a m × n matrix and use this matrix to construct the desired
boolean mask. Return the mask.
Multiplying the boolean mask component-wise by the original image array produces the positive
segment, a copy of the original image where the entries that aren’t in the segment are set to 0.
Computing the negative segment requires inverting the boolean mask, then multiplying the inverted
mask with the original image array. Finally, if the original image is a m × n × 3 color image, the
mask must be stacked into a m × n × 3 array to facilitate entry-wise multiplication.
Problem 6. Write a method for the ImageSegmenter class that accepts floats r, σB 2
, and σX
2
,
with the same defaults as in Problem 4. Call your methods from Problems 4 and 5 to obtain the
segmentation mask. Plot the original image, the positive segment, and the negative segment
side-by-side in subplots. Your method should work for grayscale or color images.
Use dream.png as a test file and compare your results to Figure 6.2.
70 Lab 6. Image Segmentation
7
The SVD and Image
Compression
Lab Objective: The Singular Value Decomposition (SVD) is an incredibly useful matrix factor-
ization that is widely used in both theoretical and applied mathematics. The SVD is structured in
a way that makes it easy to construct low-rank approximations of matrices, and it is therefore the
basis of several data compression algorithms. In this lab we learn to compute the SVD and use it to
implement a simple image compression routine.
U1 (m × r) Σ1 (r × r) V1H (r × n)
σ1 v1H
.. ..
. .
σr vrH
u1 ··· ur ur+1 ··· um
H
0 vr+1
.. ..
.
.
0 vnH
U (m × m) Σ (m × n) V H (n × n)
71
72 Lab 7. The SVD and Image Compression
Finally, the SVD yields an outer product expansion of A in terms of the singular values and the
columns of U and V ,
Xr
A= σi ui viH . (7.1)
i=1
Note that only terms from the compact SVD are needed for this expansion.
Algorithm 1
1: procedure compact_SVD(A)
Problem 1. Write a function that accepts a matrix A and a small error tolerance tol. Use
Algorithm 1 to compute the compact SVD of A. In step 6, compute r by counting the number
of singular values that are greater than tol.
Consider the following tips for implementing the algorithm.
• In step 4, the way that σ is sorted needs to be stored so that the columns of V can be
sorted the same way. Consider using np.argsort() and fancy indexing to do this, but
remember that by default it sorts from least to greatest (not greatest to least).
• Step 9 can be done by looping over the columns of V , but it can be done more easily and
efficiently with array broadcasting.
Test your function by calculating the compact SVD for random matrices. Verify that U
and V are orthonormal, that U ΣV H = A, and that the number of nonzero singular values is
the rank of A. You may also want to compre your results to SciPy’s SVD algorithm.
73
# Generate a random matrix and get its compact SVD via SciPy.
>>> A = np.random.random((10,5))
>>> U,s,Vh = la.svd(A, full_matrices=False)
>>> print(U.shape, s.shape, Vh.shape)
(10, 5) (5,) (5, 5)
An m × n matrix A defines a linear transformation that sends points from Rn to Rm . The SVD
decomposes a matrix into two rotations and a scaling, so that any linear transformation can be easily
described geometrically. Specifically, V H represents a rotation, Σ a rescaling along the principal axes,
and U another rotation.
so that plotting the first row of S against the second row of S displays the unit circle, and
plotting the first row of E against its second row displays the standard basis vectors in R2 .
Compute the full SVD A = U ΣV H using scipy.linalg.svd(). Plot four subplots to
demonstrate each step of the transformation, plotting S and E, V H S and V H E, ΣV H S and
ΣV H E, then U ΣV H S and U ΣV H E.
For the matrix
3 1
A= ,
1 3
your function should produce Figure 7.1.
(Hint: Use plt.axis("equal") to fix the aspect ratio so that the circles don’t appear elliptical.)
74 Lab 7. The SVD and Image Compression
1.0 1.0
0.5 0.5
0.0 0.0
0.5 0.5
1.0 1.0
1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0
(a) S (b) V H S
4 4
2 2
0 0
2 2
4 4
4 2 0 2 4 4 2 0 2 4
(c) ΣV H S (d) U ΣV H S
Figure 7.1: Each step in transforming the unit circle and two unit vectors using the matrix A.
If A is a m × n matrix of rank r < min{m, n}, then the compact SVD offers a way to store A with
less memory. Instead of storing all mn values of A, storing the matrices U1 , Σ1 and V1 only requires
saving a total of mr + r + nr values. For example, if A is 100 × 200 and has rank 20, then A has
20, 000 values, but its compact SVD only has total 6, 020 entries, a significant decrease.
The truncated SVD is an approximation to the compact SVD that allows even greater efficiency
at the cost of a little accuracy. Instead of keeping all of the nonzero singular values, the truncated
SVD only keeps the first s < r singular values, plus the corresponding columns of U and V . In this
case, (7.1) becomes
s
X
As = σi ui viH .
i=1
b (m × s)
U b (s × s)
Σ Vb H (s × n)
σ1 v1H
.. ..
. .
σs vsH
u1 ··· us us+1 ··· ur
H
σs+1 vs+1
.. ..
.
.
σr vrH
U1 (m × r) Σ1 (r × r) V1H (r × n)
The beauty of the SVD is that it makes it easy to select the information that is most important.
Larger singular values correspond to columns of U and V that contain more information, so dropping
the smallest singular values retains as much information as possible. In fact, given a matrix A, its
rank-s truncated SVD approximation As is the best rank s approximation of A with respect to both
the induced 2-norm and the Frobenius norm. This result is called the Schmidt, Mirsky, Eckhart-Young
theorem, a very significant concept that appears in signal processing, statistics, machine learning,
semantic indexing (search engines), and control theory.
1. Use your function from Problem 1 or scipy.linalg.svd() to compute the compact SVD
of A, then form the truncated SVD by stripping off the appropriate columns and entries
from U1 , Σ1 , and V1 . Return the best rank s approximation As of A (with respect to the
induced 2-norm and Frobenius norm).
2. Also return the number of entries required to store the truncated form U b Vb H (where Σ
bΣ b
is stored as a one-dimensional array, not the full diagonal matrix). The number of entries
stored in NumPy array can be accessed by its size attribute.
3. If s is greater than the number of nonzero singular values of A (meaning s > rank(A)),
raise a ValueError.
Unit Test
Write a unit test for Problem 3 to check your low rank SVD approximation. The unit test
can be found in the file test_svd_image_compression.py, and you will edit the problem
test_svd_approx.
There is an example unit test for Problem 1 to help you make your unit test.
76 Lab 7. The SVD and Image Compression
Another result of the Schmidt, Mirsky, Eckhart-Young theorem is that the exact 2-norm error of the
best rank-s approximation As for the matrix A is the (s + 1)th singular value of A:
∥A − As ∥2 = σs+1 . (7.2)
This offers a way to approximate A within a desired error tolerance ε: choose s such that σs+1 is the
largest singular value that is less than ε, then compute As . This As throws away as much information
as possible without violating the property ∥A − As ∥2 < ε.
1. Compute the compact SVD of A, then use (7.2) to compute the lowest rank approximation
As of A with 2-norm error less than ε. Avoid calculating the SVD more than once.
(Hint: np.argmax(), np.where(), and/or fancy indexing may be useful.)
2. As in the previous problem, also return the number of entries needed to store the resulting
approximation As via the truncated SVD.
3. If ε is less than or equal to the smallest singular value of A, raise a ValueError; in this
case, A cannot be approximated within the tolerance by a matrix of lesser rank.
This function should be close to identical to the function from Problem 3, but with the extra
step of identifying the appropriate s. Construct test cases to validate that ∥A − As ∥2 < ε.
Image Compression
Images are stored on a computer as matrices of pixel values. Sending an image over the internet or
a text message can be expensive, but computing and sending a low-rank SVD approximation of the
image can considerably reduce the amount of data sent while retaining a high level of image detail.
Successive levels of detail can be sent after the inital low-rank approximation by sending additional
singular values and the corresponding columns of V and U.
Examining the singular values of an image gives us an idea of how low-rank the approximation
can be. Figure 7.2 shows the image in hubble_gray.jpg and a log plot of its singular values. The
plot in 7.2b is typical for a photograph—the singular values start out large but drop off rapidly.
In this rank 1041 image, 913 of the singular values are 100 or more times smaller than the largest
singular value. By discarding these relatively small singular values, we can retain all but the finest
image details, while storing only a rank 128 image. This is a huge reduction in data size.
77
105
104
Singular values
103
102
101
100
0 200 400 600 800 1000
Matrix rank
(a) NGC 3603 (Hubble Space Telescope). (b) Singular values on a log scale.
Figure 7.2
Figure 7.3 shows several low-rank approximations of the image in Figure 7.2a. Even at a low
rank the image is recognizable. By rank 120, the approximation differs very little from the original.
Figure 7.3
Grayscale images are stored on a computer as 2-dimensional arrays, while color images are
stored as 3-dimensional arrays—one layer each for red, green, and blue arrays. To read and display
images, use imageio.imread() and plt.imshow(). Images are read in as integer arrays with entries
between 0 and 255 (dtype=np.uint8), but plt.imshow() works better if the image is an array of
floats in the interval [0, 1]. Scale the image properly by dividing the array by 255.
# The final axis has 3 layers for red, green, and blue values.
>>> red_layer = image_color[:,:,0]
>>> red_layer.shape
(1158, 1041)
Problem 5. Write a function that accepts the name of an image file and an integer s. Use
your function from Problem 3, to compute the best rank-s approximation of the image. Plot
the original image and the approximation in separate subplots. In the figure title, report the
difference in number of entries required to store the original image and the approximation (use
plt.suptitle()).
Your function should be able to handle both grayscale and color images. Read the image
in and check its dimensions to see if it is color or not. Grayscale images can be approximated
directly since they are represented by 2-dimensional arrays. For color images, let R, G, and B
be the matrices for the red, green, and blue layers of the image, respectively. Calculate the low-
rank approximations Rs , Gs , and Bs separately, then put them together in a new 3-dimensional
array of the same shape as the original image.
(Hint: np.dstack() may be useful for putting the color layers back together.)
Finally, it is possible for the low-rank approximations to have values slightly outside the
valid range of RGB values. Set any values outside of the interval [0, 1] to the closer of the two
boundary values.
(Hint: fancy indexing and/or np.clip() may be useful here.)
To check, compressing hubble_gray.jpg with a rank 20 approximation should appear
similar to Figure 7.3b and save 1, 161, 478 matrix entries.
79
Additional Material
More on Computing the SVD
For an m × n matrix A of rank r < min{m, n}, the compact SVD of A neglects last m − r columns
of U and the last n − r columns of V . The remaining columns of each matrix can be calculated by
using Gram-Schmidt orthonormalization. If m < r < n or n < r < m, only one of U1 and V1 will
need to be filled in to construct the full U or V . Computing these extra columns is one way to obtain
a basis for N (AH ) or N (A).
Algorithm 1 begins with the assumption that we have a way to compute the eigenvalues and
eigenvectors of AH A. Computing eigenvalues is a notoriously difficult problem, and computing the
SVD from scratch without an eigenvalue solver is much more difficult than the routine described by
Algorithm 1. The procedure involves two phases:
1. Factor A into A = Ua BVaH where B is bidiagonal (only nonzero on the diagonal and the first
superdiagonal) and Ua and Va are orthonormal. This is usually done via Golub-Kahan Bidi-
agonalization, which uses Householder reflections, or Lawson-Hanson-Chan bidiagonalization,
which relies on the QR decomposition.
2. Factor B into B = Ub ΣVbH by the QR algorithm or a divide-and-conquer algorithm. Then the
SVD of A is given by A = (Ua Ub )Σ(Va Vb )H .
For more details, see Lecture 31 of [TB97] or Section 5.4 of Applied Numerical Linear Algebra by
James W. Demmel.
def animate_images(images):
"""Animate a sequence of images. The input is a list where each
entry is an array that will be one frame of the animation.
"""
fig = plt.figure()
plt.axis("off")
im = plt.imshow(images[0], animated=True)
def update(index):
plt.title("Rank {} Approximation".format(index))
im.set_array(images[index])
return im, # Note the comma!
Lab Objective: Facial recognition algorithms attempt to match a person’s portrait to a database
of many portraits. Facial recognition is becoming increasingly important in security, law enforcement,
artificial intelligence, and other areas. Though humans can easily match pictures to people, computers
are beginning to surpass humans at facial recognition. In this lab, we implement a basic facial
recognition system that relies on eigenvectors and the SVD to efficiently determine the difference
between faces.
F = f1 f2 ··· fk ,
import os
import numpy as np
from imageio import imread
def get_faces(path="./faces94"):
# Traverse the directory and get one image per subdirectory.
1 See http://cswww.essex.ac.uk/mv/allfaces/faces94.html.
81
82 Lab 8. Facial Recognition
faces = []
for (dirpath, dirnames, filenames) in os.walk(path):
for fname in filenames:
if fname[-3:]=="jpg": # Only get jpg images.
# Load the image, convert it to grayscale,
# and flatten it into a vector.
faces.append(np.ravel(imread(dirpath+"/"+fname, as_gray=True)))
break
# Put all the face vectors column-wise into a matrix.
return np.transpose(faces)
Problem 1. Write a function that accepts an image as a flattened mn-vector, along with its
original dimensions m and n. Use np.reshape() to convert the flattened image into its original
m × n shape and display the result with plt.imshow().
(Hint: use cmap="gray" in plt.imshow() to display images in grayscale.)
Unzip the faces94.zip archive and use get_faces() to construct F . Each faces94
image is 200 × 180, and there are 153 people in the dataset, so F should be 36000 × 153. Use
your function to display one of the images stored in F .
f̄i = fi − µ.
Next, define F̄ as the mn × k matrix whose columns are given by the mean-shifted face vectors,
(a) The mean face. (b) An original face. (c) A mean-shifted face.
Figure 8.1
Problem 2. Write a class called FacialRec whose constructor accepts a path to a directory
of images. In the constructor, use get_faces() to construct F , then compute the mean face
µ and the shifted faces F̄ . Store each array as an attribute.
(Hint: Both µ and F̄ can be computed in a single line of code by using NumPy functions and/or
array broadcasting.)
Use your function from Problem 1 to visualize the mean face, and compare it to Figure
8.1a. Also display an original face and its corresponding mean-shifted face. Compare your
results with Figures 8.1b and 8.1c.
To increase computational efficiency and minimize storage, the face vectors can be represented
with fewer values by projecting F̄ onto a lower-dimensional subspace. Let s be a natural number
such that s < r, where r is the rank of F̄ . By projecting F̄ onto an s-dimensional subspace, each
face can be stored with only s values.
Specifically, let U ΣV H be the compact SVD of F̄ with rank r, which can also be represented by
r
X
F̄ = σi ui viH .
i=1
The first r columns of U form a basis for the range of F̄ . Recall that the Schmidt, Mirsky, Eckart-
Young Theorem states that the matrix
s
X
F̄s = σi ui viH
i=1
is the best rank-s approximation of F̄ for each s < r. This means that ∥F̄ − F¯s ∥ is minimized against
all other ∥F̄ − B∥ where B has rank s. As a consequence of this theorem, the first s columns of U
form a basis that provides the “best” s-dimensional subspace for approximating F̄ .
The s basis vectors u1 , . . . , us are are commonly called the eigenfaces because they are eigen-
vectors of F̄ F̄ T and because they resemble face images. Each original face image can be efficiently
represented in terms of these eigenfaces. See Figure 8.2 for visualizations of the some of the eigenfaces
for the facesd94 data set.
84 Lab 8. Facial Recognition
In general, the lower eigenfaces provide a more general information of a face and higher-ordered
eigenfaces provide the details necessary to distinguish particular faces [MMH04]. These eigenfaces
will be used to construct the face images in the dataset. The more eigenfaces used, the more detailed
the resulting image will be.
Next, let Us be the matrix with the first s eigenfaces as columns. Since the eigenfaces {ui }si=1
form an orthonormal set, Us is an orthonormal matrix (independent of s) and hence UsT Us = I.
The matrix Ps = Us UsT projects vectors in Rmn to the subspace spanned by the orthonormal basis
{ui }si=1 , and the change of basis matrix UsT puts the projection in terms of the basis of eigenfaces.
Thus the projection b fi of f̄i in terms of the basis of eigenfaces is given by
fi = UsT Ps f̄i = UsT Us UsT f̄i = UsT f̄i .
b (8.1)
Note carefully that though the shifted image f¯i has mn entries, the projection b
fi has only s entries
since Us is mn × s. Likewise, the matrix Fb that has the projections b
fi as columns is s × k, and
Fb = UsT F. (8.2)
Problem 3. In the constructor of FacialRec, calculate the compact SVD of F̄ and save the
matrix U as an attribute. Compare the computed eigenfaces (the columns of U ) to Figure 8.2.
Also write a method that accepts a vector of length mn or an mn × ℓ matrix, as well as
an integer s < mn. Construct Us by taking the first s columns of U , then use (8.1) or (8.2) to
calculate the projection of the input vector or matrix onto the span of the first s eigenfaces.
(Hint: this method should be implemented with a single line of code.)
Reducing the mean-shifted face image f̄i to the lower-dimensional projection b fi drastically re-
duces the computational cost of the facial recognition algorithm, but this efficiency gain comes at
a price. A projection image only approximates the corresponding original image, but as long as s
isn’t too small, the approximation is usually good enough for the algorithm to work well. Before
completing the facial recognition system, we reconstruct some of these projections to visualize the
amount of information lost.
From (8.1), since UsT projects f̄i and performs a change of basis to get b
fi , its transpose Us puts
fi back into the original basis with as little error as possible. That is,
b
fi ≈ f̄i = fi − µ,
Usb
85
(a) A reconstruction with s = 5. (b) A reconstruction with s = 19. (c) A reconstruction with s = 75.
Figure 8.3: An image rebuilt with various numbers of eigenfaces. The image is already recognizable
when it is reconstructed with only 19 eigenfaces, less than an eighth of the 153 eigenfaces correspond-
ing to nonzero eigenvalues or F̄ F̄ T . Note the similarities between this method and regular image
compression via the truncated SVD.
Problem 4. Instantiate a FacialRec object that draws from the faces94 dataset. Select one
of the shifted images f̄i . For at least 4 values of s, use your method from Problem 3 to compute
the corresponding s-projection b fi , then use (8.3) to compute the reconstruction e
fi . Display the
various reconstructions and the original image. Compare your results to Figure 8.3
Matching Faces
Let g be a vector representing an unknown face that is not part of the database. We determine which
image in the database is most like g by comparing g b to each of the b
fi . First, shift g by the mean to
obtain ḡ, then project ḡ using a given number of eigenfaces:
b = UsT ḡ = UsT (g − µ)
g (8.4)
we have that the jth face image fj is the best match for g. Again, since b
fi and g
bi only have s entries,
the computation in (8.5) is much cheaper than comparing the raw fi to g.
86 Lab 8. Facial Recognition
Problem 5. Write a method for the FacialRec class that accepts an image vector g and an
integer s. Use your method from Problem 3 to compute Fb and gb for the given s, then use (8.5)
to determine the best matching face in the database. Return the index of the matching face.
(Hint: scipy.linalg.norm() and np.argmin() may be useful.)
Note
This facial recognition system works by solving a nearest neighbor search, since the goal is to
find the fi that is “nearest” to the input image g. Nearest neighbor searches can be performed
more efficiently with the use of a k-d tree, a binary search tree for storing vectors. The system
could also be called a k-neighbors classifier with k = 1.
Problem 6. Write a method for the FacialRec class that accepts an flat image vector g, an
integer s, and the original dimensions of g. Use your method from Problem 5 to find the index
j of the best matching face, then display the original face g alongside the best match fj .
The following generator yields random faces from faces94 that can be used as test cases.
# Get a subset of the image names and yield the images one at a time.
test_files = np.random.choice(files, num_faces, replace=False)
for fname in test_files:
yield np.ravel(imread(fname, as_gray=True))
The yield keyword is like a return statement, but the next time the generator is called, it will
resume immediately after the last yield statement.a
Use sample_faces() to get at least 5 random faces from faces94, and match each random
face to the database with s = 38. Iterate through the random faces with the following syntax.
Although there are other approaches to facial recognition that utilize more complex techniques,
the method of eigenfaces remains a wonderfully simple and effective solution.
87
Additional Material
Improvements on the Facial Recognition System with Eigenfaces
The FacialRec class does its job well, but it could be improved in several ways. Here are a few ideas.
• The most computationally intensive part of the algorithm is computing Fb. Instead of recom-
puting Fb every time the method from Problem 5 is called, store Fb and s as attributes the first
time the method is called. In subsequent calls, only recompute Fb if the user specifies a different
value for s.
• Load a scipy.spatial.KDTree object with Fb and use its query() method to compute (8.5).
Building a kd-tree is expensive, so be sure to only build a new tree when necessary (i.e., the
user specifies a new value for s).
• Include an error tolerance ε in the method for Problem 5. If ∥fj − g∥ > ε, print a message or
raise an exception to indicate that there is no suitable match for g in the database. In this
case, add g to the database for future reference.
• Generalize the system by turning it into a k-neighbors classifier. In the constructor, add several
faces per person to the database (this requires modifying get_faces()). Assign each individual
a unique ID so that the system knows which faces correspond to the same person. Modify the
method from Problem 5 so that it also accepts an integer k, then use scipy.spatial.KDTree
to find the k nearest images to g. Choose the ID that belongs to the most nearest neighbors,
then return an index that corresponds to an individual with that ID.
In other words, choose the k faces fi that give the smallest values of ∥fi − g
b∥2 . These faces then
get to vote on which person g belongs to.
• Improve the user interface of the class by modifying the method from Problem 6 so that it
accepts a file name to read from instead of an array. A few lines of code from get_faces() or
sample_faces() might be helpful for this.
Lab Objective: Derivatives are central in many applications. Depending on the application and
on the available information, the derivative may be calculated symbolically, numerically, or with
differentiation software. In this lab we explore these three ways to take a derivative, discuss what
settings they are each appropriate for, and demonstrate their strengths and weaknesses.
Symbolic Differentiation
The derivative of a known mathematical function can be calculated symbolically with SymPy. This
method is the most precise way to take a derivative, but it is computationally expensive and requires
knowing the closed form formula of the function. Use sy.diff() to take a symbolic derivative.
>>> x = sy.symbols('x')
>>> sy.diff(x**3 + x, x) # Differentiate x^3 + x with respect to x.
3*x**2 + 1
Problem 1. Write a function that defines f (x) = (sin(x) + 1)sin(cos(x)) and takes its symbolic
derivative with respect to x using SymPy. Lambdify the resulting function so that it can accept
NumPy arrays and return the resulting function handle.
Hint: You can test your function by plotting f and its derivative f ′ over the domain [−π, π]. It
may be helpful to move the bottom spine to 0 so you can see where the derivative crosses the
x-axis. Note: Do NOT include this in the final code for this problem.
>>> ax = plt.gca()
>>> ax.spines["bottom"].set_position("zero")
89
90 Lab 9. Differentiation
Numerical Differentiation
One definition for the derivative of a function f : R → R at a point x0 is
f (x0 + h) − f (x0 )
f ′ (x0 ) = lim .
h→0 h
Since this definition relies on h approaching 0, choosing a small, fixed value for h approximates f ′ (x0 ):
f (x0 + h) − f (x0 )
f ′ (x0 ) ≈ . (9.1)
h
This approximation is called the first order forward difference quotient. Using the points x0 and
x0 − h in place of x0 + h and x0 , respectively, results in the first order backward difference quotient,
f (x0 ) − f (x0 − h)
f ′ (x0 ) ≈ . (9.2)
h
Forward difference quotients use values of f at x0 and points greater than x0 , while backward
difference quotients use the values of f at x0 and points less than x0 . A centered difference quotient
uses points on either side of x0 , and typically results in a better approximation than the one-sided
quotients. Combining (9.1) and (9.2) yields the second order centered difference quotient,
x̄ x̄ + h x̂ − h x̂ x̂ + h x̃ − h x̃
* * * * * * * * * * *
Figure 9.1
Note
The finite difference quotients in this section all approximate the first derivative of a function.
The terms first order and second order refers to how quickly the approximation converges on
the actual value of f ′ (x0 ) as h approaches 0, not to how many derivatives are being taken.
There are finite difference quotients for approximating higher order derivatives, such as
f ′′ or f ′′′ . For example, the centered difference quotient
While we do not derive them here, there are other finite difference quotients that use more
points to approximate the derivative, some of which are listed below. Using more points generally
results in better convergence properties.
Problem 2. Write a function for each of the finite difference quotients listed in Table 9.1. Each
function should accept a function handle f , an array of points x, and a float h; each should
return an array of the difference quotients evaluated at each point in x.
To test your functions, approximate the derivative of f (x) = (sin(x) + 1)sin(cos(x)) at each
point of a domain over [−π, π]. Plot the results and compare them to the results of Problem 1.
If f ′′ is continuous, then for any δ > 0, setting M = supx∈(x0 −δ,x0 +δ) f ′′ (x) guarantees that
Z 1
R2 (h)
≤ |h| M dt = M |h| ∈ O(h).
h 0
whenever |h| < δ. That is, the error decreases at the same rate as h. If h gets twice as small, the error
does as well. This is what is meant by a first order approximation. In a second order approximation,
the absolute error is O(h2 ), meaning that if h gets twice as small, the error gets four times smaller.
Note
The notation O(f (n)) is commonly used to describe the temporal or spatial complexity of an
algorithm. In that context, a O(n2 ) algorithm is much worse than a O(n) algorithm. However,
when referring to error, a O(h2 ) algorithm is better than a O(h) algorithm because it means
that the accuracy improves faster as h decreases.
Problem 3. Write a function that accepts a point x0 at which to compute the derivative of
f (x) = (sin(x) + 1)sin(cos(x)) . Use your function from Problem 1 to compute the exact value of
f ′ (x0 ). Then use each your functions from Problem 2 to get an approximate derivative f˜′ (x0 )
for h = 10−8 , 10−7 , . . . , 10−1 , 1. Track the absolute error |f ′ (x0 ) − f˜′ (x0 )| for each trial, then
plot the absolute error against h on a log-log scale (use plt.loglog()).
Instead of using np.linspace() to create an array of h values, use np.logspace(). This
function generates logarithmically spaced values between two powers of 10.
Order 4 Centered
10 6
10 8
10 10
10 12
10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 100
h
93
Achtung!
Mathematically, choosing smaller h values results in tighter approximations of f ′ (x0 ). However,
Problem 3 shows that when h gets too small, the error stops decreasing. This numerical error
is due to the denominator in each finite difference quotient becoming very small. The optimal
value of h is usually one that is small, but not too small.
Problem 4. The radar stations A and B, separated by the distance a = 500 m, track a plane
C by recording the angles α and β at one-second intervals. Your goal, back at air traffic control,
is to determine the speed of the plane.a
Let the position of the plane at time t be givenp by (x(t), y(t)). The speed at time t is the
magnitude of the velocity vector, ∥ dt d
(x(t), y(t))∥ = x′ (t)2 + y ′ (t)2 . The closed forms of the
functions x(t) and y(t) are unknown (and may not exist at all), but we can still use numerical
methods to estimate x′ (t) and y ′ (t). For example, at t = 3, the second order centered difference
quotient for x′ (t) is
x(3 + h) − x(3 − h) 1
x′ (3) ≈ = (x(4) − x(2)).
2h 2
In this case h = 1 since data comes in from the radar stations at 1 second intervals.
Successive readings for α and β at integer times t = 7, 8, . . . , 14 are stored in the file
plane.npy. Each row in the array represents a different reading; the columns are the observation
time t, the angle α (in degrees), and the angle β (also in degrees), in that order. The Cartesian
coordinates of the plane can be calculated from the angles α and β as follows.
Load the data, convert α and β to radians, then compute the coordinates x(t) and y(t) at
each given t using 9.4. Approximate x′ (t) and y ′ (t) using a first order forward difference
quotient for t = 7, a first order backward difference quotient for t = 14, and a second order
centered difference quotient for t = 8, 9, . . . , 13 (see Figure 9.1). Return the values of the speed
x′ (t)2 + y ′ (t)2 at each t.
p
Finite difference quotients can also be used to approximate derivatives in higher dimensions. The
Jacobian matrix of a function f : Rn → Rm at a point x0 ∈ Rn is the m × n matrix J whose entries
are given by
∂fi
Jij = (x0 ).
∂xj
The difference quotients in this case resemble directional derivatives. The first order forward
difference quotient for approximating a partial derivative is
where ej is the jth standard basis vector. The second order centered difference approximation is
Differentiation Software
Many machine learning algorithms and structures, especially neural networks, rely on the gradient of a
cost or objective function. To facilitate their research, several organizations have recently developed
Python packages for numerical differentiation. For example, the Harvard Intelligent Probabilistic
Systems Group (HIPS) started developing autograd in 2014 (https://github.com/HIPS/autograd)
and Google created JAX (https://github.com/google/jax) as a successor to autograd. Popular
deep learning libraries also contain automatic differentiation libraries. These tools use an algorithm
known as automatic differentiation that is incredibly robust: they can differentiate functions with
NumPy routines, if statements, while loops, and even recursion.
We conclude with a brief introduction to JAX. It can be installed as follows on Mac and Linux:
Installation directly via pip is not currently supported on Windows, however. Some unofficial builds
for Windows are available at https://github.com/cloudhan/jax-windows-builder. JAX also has
additional installation options that allow it to do computations on a GPU using the CUDA library.
See https://github.com/google/jax#installation for additional options for these cases.
JAX’s grad() accepts a scalar-valued function and returns its gradient as a function that
accepts the same parameters as the original. To support most of the NumPy features, JAX comes
with its own thinly-wrapped version of Numpy, jax.numpy. Import this version of NumPy as jnp to
avoid confusion.
>>> from jax import numpy as jnp # Use JAX's version of NumPy.
>>> from jax import grad
Functions that grad() produces do not support array broadcasting, meaning they do not accept
arrays as input. The easiest way to create a function is to use jnp.vectorize() on the derivative.
SymPy would have no trouble differentiating g(x) in these examples. However, JAX can also
differentiate Python functions that look nothing like traditional mathematical functions. For exam-
ple, the following code computes the Taylor series of ex with a loop.
Write a function that accepts an array x and an integer n and recursively computes Tn (x). Use
JAX and your first function to create a function for Tn′ (x). Use this last function to plot each
Tn′ (x) over the domain [−1, 1] for n = 0, 1, 2, 3, 4.
(Hint: Use jnp.ones_like(x) to handle the case when n = 0.)
Problem 7. Let f (x) = (sin(x) + 1)sin(cos(x)) as in Problems 1 and 3. Write a function that
accepts an integer N and performs the following experiment N times.
2. Use your function from Problem 1 to calculate the “exact” value of f ′ (x0 ). Time how long
the entire process takes, including calling your function (each iteration).
3. Time how long it takes to get an approximation f˜′ (x0 ) of f ′ (x0 ) using the fourth-order
centered difference quotient from Problem 3. Record the absolute error |f ′ (x0 ) − f˜′ (x0 )|
of the approximation.
4. Time how long it takes to get an approximation f¯′ (x0 ) of f ′ (x0 ) using JAX (calling grad()
every time). Record the absolute error |f ′ (x0 ) − f¯′ (x0 )| of the approximation.
Plot the computation times versus the absolute errors on a log-log plot with different
colors for SymPy, the difference quotient, and JAX. For SymPy, assume an absolute error of
1e-18 (since only positive values can be shown on a log plot).
For N = 200, your plot should resemble the following figure. Note that SymPy has the
least error but longer computation time, and that the difference quotient takes the least amount
of time but has the most error. JAX, on the other hand, does not appear to be as well-suited
to this particular problem. However, for more complicated functions and functions of multiple
variables, it tends to be a “happy medium” between the two, with faster runtime than SymPy.
97
SymPy
10 4 Difference Quotients
JAX
10 7
Absolute Error 10 10
10 13
10 16
10 3 10 2
Computation Time (seconds)
Figure 9.2: Solution with N = 200.
98 Lab 9. Differentiation
Additional Material
More JAX
For scalar-valued functions with multiple inputs, the parameter argnums specifies the variable that
the derivative is computed with respect to. Providing a list for argnums gives several outputs.
Lab Objective: Newton’s method, the classical method for finding the zeros of a function, is
one of the most important algorithms of all time. In this lab we implement Newton’s method in
arbitrary dimensions and use it to solve a few interesting problems. We also explore in some detail
the convergence (or lack of convergence) of the method under various circumstances.
Iterative Methods
An iterative method is an algorithm that must be applied repeatedly to obtain a result. The general
idea behind any iterative method is to make an initial guess at the solution to a problem, apply a
few easy computations to better approximate the solution, use that approximation as the new initial
guess, and repeat until done. More precisely, let F be some function used to approximate the solution
to a problem. Starting with an initial guess of x0 , compute
The choices for ε and N are significant: a “large” ε (such as 10−6 ) produces a less accurate
result than a “small” ε (such 10−16 ), but demands less computations; a small N (10) also potentially
lowers accuracy, but detects and halts nonconvergent iterations sooner than a large N (10,000). In
code, ε and N are often named tol and maxiter, respectively (or similar).
While there are many ways to structure the code for an iterative method, probably the cleanest
way is to combine a for loop with a break statement. As a very simple example, let F (x) = x2 . This
method converges to x = 0 independent of starting point.
99
100 Lab 10. Newton’s Method
>>> F = lambda x: x / 2
>>> x0, tol, maxiter = 10, 1e-9, 8
>>> for k in range(maxiter): # Iterate at most N times.
... print(x0, end=' ')
... x1 = F(x0) # Compute the next iteration.
... if abs(x1 - x0) < tol: # Check for convergence.
... break # Upon convergence, stop iterating.
... x0 = x1 # Otherwise, continue iterating.
...
10 5.0 2.5 1.25 0.625 0.3125 0.15625 0.078125
In this example, the algorithm terminates after N = 8 iterations (the maximum number of
allowed iterations) because the tolerance condition |xk − xk−1 | < 10−9 is not met fast enough. If N
had been larger (say 40), the iteration would have quit early due to the tolerance condition.
f (xk )
xk+1 = xk − . (10.3)
f ′ (xk )
2. f ′ (x̄) ̸= 0, and
In applications, the first two conditions usually hold. If x̄ and x0 are not “sufficiently close,” Newton’s
method may converge very slowly, or it may not converge at all. However, when all three conditions
hold, Newton’s method converges quadratically, meaning that the maximum error is squared at every
iteration. This is very quick convergence, making Newton’s method as powerful as it is simple.
Problem 1. Write a function that accepts a function f , an initial guess x0 , the derivative f ′ ,
a stopping tolerance defaulting to 10−5 , and a maximum number of iterations defaulting to 15.
Use Newton’s method as described in (10.3) to compute a zero x̄ of f . Terminate the algorithm
when |xk − xk−1 | is less than the stopping tolerance or after iterating the maximum number
of allowed times. Return the last computed approximation to x̄, a boolean value indicating
whether or not the algorithm converged, and the number of iterations completed.
Test your function against functions like f (x) = ex − 2 (see Figure 10.1) or f (x) = x4 − 3.
Check that the computed zero x̄ satisfies f (x̄) ≈ 0. Also consider comparing your function to
scipy.optimize.newton(), which accepts similar arguments.
101
f(x) = ex 2 x0
x1
x3 x2
0.0 0.5 1.0 1.5 2.0
Figure 10.1: Newton’s method approximates the zero of a function (blue) by choosing as the next
approximation the x-intercept of the tangent line (red) that goes through the point (xk , f (xk )). In
this example, f (x) = ex − 2, which has a zero at x̄ = log(2). Setting x0 = 2 and using (10.3) to
e2 −2
iterate, we have x1 = x0 − ff′(x 0)
(x0 ) = 2 − e2 ≈ 1.2707. Similarly, x2 ≈ 0.8320, x3 ≈ .7024, and
x4 ≈ 0.6932. After only a few iterations, the zero log(2) ≈ 0.6931 is already computed to several
digits of accuracy.
Note
Newton’s method can be used to find zeros of functions that are hard to solve for analytically.
For example, the function f (x) = sin(x)
x − x is not continuous on any interval containing 0, but
it can be made continuous by defining f (0) = 1. Newton’s method can then be used to compute
the zeros of this function.
Problem 2. Suppose that an amount of P1 dollars is put into an account at the beginning of
years 1, 2, ..., N1 and that the account accumulates interest at a fractional rate r (so r = .05
corresponds to 5% interest). In addition, at the beginning of years N1 + 1, N1 + 2, ..., N1 + N2 ,
an amount of P2 dollars is withdrawn from the account and that the account balance is exactly
zero after the withdrawal at year N1 + N2 . Then the variables satisfy
There is a file called test_newtons_method.py that contains a unit test for this problem
that you can use to check your code.
Backtracking
Newton’s method may not converge for a variety of reasons. One potential problem occurs when the
step from xk to xk+1 is so large that the zero is stepped over completely. Backtracking is a strategy
that combats the problem of overstepping by moving only a fraction of the full step from xk to xk+1 .
This suggests a slight modification to (10.3),
f (xk )
xk+1 = xk − α , α ∈ (0, 1]. (10.4)
f ′ (xk )
Note that setting α = 1 results in the exact same method defined in (10.3), but for α ∈ (0, 1), only
a fraction of the step is taken at each iteration.
Problem 3. Modify your function from Problem 1 so that it accepts a parameter α that
defaults to 1. Incorporate (10.4) to allow for backtracking.
To test your modified function, consider f (x) = x1/3 . The command x**(1/3.) fails
when x is negative, so the function and its derivative can be defined with NumPy as follows.
import numpy as np
f = lambda x: np.sign(x) * np.power(np.abs(x), 1./3)
Df = lambda x: np.power(np.abs(x), -2./3) / 3.
With x0 = .01 and α = 1, the iteration should not converge. However, setting α = .4, the
iteration should converge to a zero that is close to 0.
0.50
0.25 x0 x
0.00
0.25 no backtracking
0.50 backtracking
0.75
1.00 x
1 0 1 2 3 4 5
Figure 10.2: Starting at the same initial value but using different backtracking constants can result
in convergence to two different solutions. The blue line converges to x̃ = (0, −1) with α = 1 in 5
iterations of Newton’s method while the orange line converges to x̂ = (3.75, .25) with α = 0.4 in 15
iterations. Note that the points in this example are 2-dimensional, which is discussed in the next
section.
Problem 4. Write a function that accepts the same arguments as your function from Problem
3 except for α. Use Newton’s method to find a zero of f using various values of α in the interval
(0, 1]. Plot the values of α against the number of iterations performed by Newton’s method.
Return a value for α that results in the lowest number of iterations.
A good test case for this problem is the function f (x) = x1/3 discussed in Problem 3. In
this case, your plot should show that the optimal value for α is actually closer to .3 than to .4.
Df = ... ..
.
..
.
∂fn ∂fn
∂x1 · · · ∂xk
In this setting, Newton’s method seeks a vector x̄ such that f (x̄) = 0, the vector of n zeros.
With backtracking incorporated, (10.4) becomes
−1
xk+1 = xk − αDf (xk ) f (xk ). (10.5)
Note that if n = 1, (10.5) is exactly (10.4) because in that case, Df (x)−1 = 1/f ′ (x).
This vector version of Newton’s method terminates when the maximum number of iterations is
reached or the difference between successive approximations is less than a predetermined tolerance ε
with respect to a vector norm, that is, ||xk − xk−1 || < ε.
104 Lab 10. Newton’s Method
Problem 5. Modify your function from Problems 1 and 3 so that it can compute a zero of a
function f : Rn → Rn for any n ∈ N. Take the following tips into consideration.
• If n > 1, f should be a function that accepts a 1-D NumPy array with n entries and
returns another NumPy array with n entries. Similarly, Df should be a function that
accepts a 1-D array with n entries and returns a n × n array. In other words, f and Df
are callable functions, but f (x) is a vector and Df (x) is a matrix.
• Instead of computing Df (xk )−1 directly at each step, solve the system Df (xk )yk = f (xk )
and set xk+1 = xk − αyk . In other words, use la.solve() instead of la.inv().
• The stopping criterion now requires using a norm function instead of abs().
After your modifications, carefully verify that your function still works in the case that
n = 1, and that your functions from Problems 2 and 4 also still work correctly. In addition,
your function from Problem 4 should also work for any n ∈ N.
Unit Test
The file test_newtons_method.py contains unit tests for Problem 2. There is a place to add
your own unit tests to test your function for Problem 5, which will be graded.
Problem 6. Bioremediation involves the use of bacteria to consume toxic wastes. At a steady
state, the bacterial density x and the nutrient concentration y satisfy the system of nonlinear
equations
γxy − x(1 + y) = 0
−xy + (δ − y)(1 + y) = 0,
where γ and δ are parameters that depend on various physical features of the system.a
For this problem, assume the typical values γ = 5 and δ = 1, for which the system has
solutions at (x, y) = (0, 1), (0, −1), and (3.75, .25). Write a function that finds an initial point
x0 = (x0 , y0 ) such that Newton’s method converges to either (0, 1) or (0, −1) with α = 1, and
to (3.75, .25) with α = 0.55. As soon as a valid x0 is found, return it (stop searching).
(Hint: search within the rectangle [− 14 , 0] × [0, 14 ].)
a This problem is adapted from exercise 5.19 of [Hea02] and the notes of Homer Walker).
105
Basins of Attraction
When a function f has many zeros, the zero that Newton’s method converges to depends on the
initial guess x0 . For example, the function f (x) = x2 − 1 has zeros at −1 and 1. If x0 < 0, then
Newton’s method converges to −1; if x0 > 0 then it converges to 1 (see Figure 10.3a). The regions
(−∞, 0) and (0, ∞) are called the basins of attraction of f . Starting in one basin of attraction leads
to finding one zero, while starting in another basin yields a different zero.
When f is a polynomial of degree greater than 2, the basins of attraction are much more
interesting. For example, the basis of attraction for f (x) = x3 − x are shown in Figure 10.3b. The
basin for the zero at the origin is connected, but the other two basins are disconnected and share a
kind of symmetry.
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
(a) Basins of attraction for f (x) = x2 − 1. (b) Basins of attraction for f (x) = x3 − x.
Figure 10.3: Basins of attraction with α = 1. Since choosing a different value for α can change which
zero Newton’s method converges to, the basins of attraction may change for other values of α.
It can be shown that Newton’s method converges in any Banach space with only slightly stronger
hypotheses than those discussed previously. In particular, Newton’s method can be performed over
the complex plane C to find imaginary zeros of functions. Plotting the basins of attraction over C
yields some interesting results.
√
The zeros of f (x) = x3 −1 are 1, and − 21 ± 23 i. To plot the basins of attraction for f (x) = x3 −1
on the square complex domain X = {a+bi | a ∈ [− 23 , 32 ], b ∈ [− 32 , 32 ]}, create an initial grid of complex
points in this domain using np.meshgrid().
The grid X0 is a 500 × 500 array of complex values to use as initial points for Newton’s method.
Array broadcasting makes it easy to compute an iteration of Newton’s method at every grid point.
After enough iterations, the (i, j)th element of the grid Xk corresponds to the zero of f that
results from using the (i, j)th element of X√ 0 as the initial
√
point. For example, with f (x) = x3 − 1,
each entry of Xk should be close to 1, − 12 + 23 i, or − 21 − 23 i. Each entry of Xk can then be assigned
a value indicating which zero it corresponds to. Some results of this process are displayed below.
(a) Basins of attraction for f (x) = x3 − 1. (b) Basins of attraction for f (x) = x3 − x.
Figure 10.4
Note
Notice that in some portions of Figure 10.4a, whenever red and blue try to come together, a
patch of green appears in between. This behavior repeats on an infinitely small scale, producing
a fractal. Because it arises from Newton’s method, this kind of fractal is called a Newton fractal.
Newton fractals show that the long-term behavior of Newton’s method is extremely
sensitive to the initial guess x0 . Changing x0 by a small amount can change the output of
Newton’s method in a seemingly random way. This phenomenon is called chaos in mathematics.
1. Construct a res×res grid X0 over the domain {a + bi | a ∈ [rmin , rmax ], b ∈ [imin , imax ]}.
2. Run Newton’s method (without backtracking) on X0 iters times, obtaining the res×res
array xk . To avoid the additional computation of checking for convergence at each step,
do not use your function from Problem 5.
107
3. Xk cannot be directly visualized directly because its values are complex. Solve this issue
by creating another res×res array Y . To compute the (i, j)th entry Yi,j , determine
which zero of f is closest to the (i, j)th entry of Xk . Set Yi,j to the index of this zero in
the array zeros. If there are R distinct zeros, each Yi,j should be one of 0, 1, . . . , R − 1.
(Hint: np.argmin() may be useful.)
4. Use plt.pcolormesh() to visualize the basins. Recall that this function accepts three
array arguments: the x-coordinates (in this case, the real components of the initial grid),
the y-coordinates (the imaginary components of the grid), and an array indicating color
values (Y ). Set cmap="brg" to get the same color scheme as in Figure 10.4.
Test your function using f (x) = x3 − 1 and f (x) = x3 − x. The resulting plots should
resemble Figures 10.4a and 10.4b, respectively (perhaps with the colors permuted).
108 Lab 10. Newton’s Method
11
Conditioning and
Stability
Lab Objective: The condition number of a function measures how sensitive that function is to
changes in the input. On the other hand, the stability of an algorithm measures how accurately that
algorithm computes the value of a function from exact input. Both of these concepts are important
for answering the crucial question, “is my computer telling the truth?” In this lab we examine the
conditioning of common linear algebra problems, including computing polynomial roots and matrix
eigenvalues. We also present an example to demonstrate how two different algorithms for the same
problem may not have the same level of stability.
Note: There may be some variation in the solutions to problems in this lab between the different
updates of NumPy, SciPy, and SymPy. Consider updating these packages if you are currently using
older versions.
Conditioning
The absolute condition number of a function f : Rm → Rn at a point x ∈ Rm is defined by
∥f (x + h) − f (x)∥
κ̂(x) = lim+ sup . (11.1)
δ→0 ∥h∥<δ ∥h∥
In other words, the absolute condition number of f is the limit of the change in output over
the change of input. Similarly, the relative condition number of f is the limit of the relative change
in output over the relative change in input,
∥f (x + h) − f (x)∥ ∥h∥ ∥x∥
κ(x) = lim sup = κ̂(x). (11.2)
δ→0+ ∥h∥<δ ∥f (x)∥ ∥x∥ ∥f (x)∥
A function with a large condition number is called ill-conditioned. Small changes to the input
of an ill-conditioned function may produce large changes in output. It is important to know if a
function is ill-conditioned because floating point representation almost always introduces some input
error, and therefore the outputs of ill-conditioned functions cannot be trusted.
109
110 Lab 11. Conditioning and Stability
The condition number of a matrix A, κ(A) = ∥A∥∥A−1 ∥, is an upper bound on the condition
number for many of the common problems associated with the matrix, such as solving the system
Ax = b. If A is square but not invertible, then κ(A) = ∞ by convention. To compute κ(A), we often
use the matrix 2-norm, which is the largest singular value σmax of A. Recall that if σ is a singular
value of A, σ1 is a singular value of A−1 . Thus, we have that
σmax
κ(A) = , (11.3)
σmin
which is also a valid equation for non-square matrices.
Achtung!
Ill-conditioned matrices can wreak havoc in even simple applications. For example, the matrix
1 1
A=
1 1.0000000001
If you find yourself working with matrices that have large condition numbers, check your
math carefully or try to reformulate the problem entirely.
Note
111
Problem 1. Write a function that accepts a matrix A and computes its condition number
using (11.3). Use scipy.linalg.svd(), or scipy.linalg.svdvals() to compute the singular
values of A. Avoid computing A−1 . If the smallest singular value is 0, return ∞ (np.inf).
Validate your function by comparing it to np.linalg.cond(). Check that orthonormal
matrices have a condition number of 1 (use scipy.linalg.qr() to generate an orthonormal
matrix) and that singular matrices have a condition number of ∞ according to your function.
Unit Test
The file test_conditioning_stability.py contains unit tests to test your fuction from Prob-
lem 1 with orthonormal matrices. There is a place to add your own unit tests to test your
function with other kinds of matrices, which will be graded.
Let f : Cn+1 → Cn be the function that maps a collection of n + 1 coefficients (cn , cn−1 , . . . , c0 ) to
the n roots of the polynomial cn xn + cn−1 xn−1 + . . . + c2 x2 + c1 x + c0 . Finding polynomial roots is
an extremely ill-conditioned problem in general, so the condition number of f is likely very large. To
see this, consider the Wilkinson polynomial, made famous by James H. Wilkinson in 1963:
20
Y
w(x) = (x − r) = x20 − 210x19 + 20615x18 − 1256850x17 + · · · .
r=1
Let w̃(x) be w(x) where the coefficient on x19 is very slightly perturbed from −210 to −209.9999999.
The following code computes and compares the roots of w̃(x) and w(x) using NumPy and SymPy.
Figure 11.1a plots w(x) and w̃(x) together, and Figure 11.1b and compares their roots in the
complex plane.
112 Lab 11. Conditioning and Stability
4 1e13
Original Original
3 Perturbed 2 Perturbed
2
1
Imaginary Axis
1
0 0
w(x)
1 1
2
3 2
4
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
x Real Axis
(a) The original and perturbed Wilkinson polynomi- (b) Roots of the original and perturbed Wilkinson
als. They match for about half of the domain, then polynomials. About half of the perturbed roots are
differ drastically. complex.
Figure 11.1
Figure 11.1 clearly indicates that a very small change in just a single coefficient drastically
changes the nature of the polynomial and its roots. To quantify the difference, estimate the condition
numbers (this example uses the ∞ norm to compute κ̂ and κ).
# Sort the roots to ensure that they are in the same order.
>>> w_roots = np.sort(w_roots)
>>> new_roots = np.sort(new_roots)
1. Computing the quotients in (11.1) and (11.2) for a fixed perturbation h only approximates the
condition number. The true condition number is the limit of such quotients. We hope that
when ∥h∥ is small, a random quotient is at least the same order of magnitude as the limit, but
there is no way to be sure.
2. This example assumes that NumPy’s root-finding algorithm, np.roots(), is stable, so that the
difference between w_roots and new_roots is due to the difference in coefficients, and not to
problems with np.roots(). We will return to this issue in the next section.
Even with these caveats, it is apparent that root finding is a difficult problem to solve correctly.
Always check your math carefully when dealing with polynomial roots.
Problem 2. Write a function that carries out the following experiment 100 times.
1. Randomly perturb the true coefficients of the Wilkinson polynomial by replacing each
coefficient ci with ci ∗ ri , where ri is drawn from a normal distribution centered at 1 with
standard deviation 1e-10 (use np.random.normal()).
2. Plot the perturbed roots as small points in the complex plane. That is, plot the real part
of the coefficients on the x-axis and the imaginary part on the y-axis. Plot on the same
figure in each experiment.
(Hint: use a pixel marker, marker=',', to avoid overcrowding the figure.)
3. Compute the absolute and relative condition numbers with the ∞ norm.
Plot the roots of the unperturbed Wilkinson polynomial with the perturbed roots. Your final
plot should resemble Figure 11.2. Finally, return the average computed absolute and relative
condition numbers.
114 Lab 11. Conditioning and Stability
6
Perturbed
4 Original
2
Imaginary Axis 0
2
4
6
0 5 10 15 20
Real Axis
Figure 11.2: This figure replicates Figure 12.1 on p. 93 of [TB97].
Calculating Eigenvalues
Let f : Mn (C) → Cn be the function that maps an n × n matrix with complex entries to its n
eigenvalues. This problem is well-conditioned for symmetric matrices, but it can be extremely ill-
conditioned for non-symmetric matrices. Let A be an n × n matrix and let λ be the vector of the n
eigenvalues of A. If à = A + H is a pertubation of A and λ̃ are its eigenvalues, then the condition
numbers of f can be estimated by
∥λ − λ̃∥ ∥A∥
κ̂(A) = , κ(A) = κ̂(A). (11.4)
∥H∥ ∥λ∥
Problem 3. Write a function that accepts a matrix A and estimates the condition number of
the eigenvalue problem using (11.4). For the perturbation H, construct a matrix with complex
entries where the real and imaginary parts are drawn from normal distributions centered at 0
with standard deviation σ = 10−10 .
Problem 4. Write a function that accepts bounds [xmin , xmax , ymin , ymax ] and an integer res.
Use your function from Problem 3 to compute the relative condition number of the eigenvalue
problem for the 2 × 2 matrix
1 x
y 1
at every point of an evenly spaced res×res grid over the domain [xmin , xmax ] × [ymin , ymax ].
Plot these estimated relative condition numbers using plt.pcolormesh() and the colormap
cmap='gray_r' (you can use plt.colorbar() to create the colorbar). With res=200, your
plot should look similar to the following figure.
116 Lab 11. Conditioning and Stability
Problem 4 shows that the conditioning of the eigenvalue problem depends heavily on the matrix,
and that it is difficult to know a priori how bad the problem will be. Luckily, most real-world problems
requiring eigenvalues are symmetric. In their book on Numerical Linear Algebra, L. Trefethen and
D. Bau III summed up the issue of conditioning and eigenvalues when they stated, “if the answer is
highly sensitive to perturbations, you have probably asked the wrong question.”
Stability
The stability of an algorithm is measured by the error in its output. Let f : Rm → Rn be a problem
to be solved, as in the previous section, and let f˜ be an actual algorithm for solving the problem.
The forward error of f at x is ||f (x) − f˜(x)||, and the relative forward error of f at x is
# w_coeffs holds the coefficients and w_roots holds the true roots.
>>> computed_roots = np.sort(np.roots(np.poly1d(w_coeffs)))
>>> print(computed_roots[:6]) # The computed roots are close to integers.
[ 1. 2. 3. 3.99999999 5.00000076 5.99998749]
1 See the Additional Material section for alternative (and more rigorous) definitions of algorithmic stability.
117
This analysis suggests that np.roots() is a stable algorithm, so large condition numbers of
Problem 2 really are due to the poor conditioning of the problem, not the way in which the problem
was solved.
Note
Conditioning is a property of a problem to be solved, such as finding the roots of a polynomial
or calculating eigenvalues. Stability is a property of an algorithm to solve a problem, such
as np.roots() or scipy.linalg.eig(). If a problem is ill-conditioned, any algorithm used to
solve that problem may result in suspicious solutions, even if that algorithm is stable.
Least Squares
The ordinary least squares (OLS) problem is to find the x that minimizes ∥Ax − b∥2 for fixed A and
b. It can be shown that an equivalent problem is finding the solution of AH Ax = AH b, called the
normal equations. A common application of least squares is polynomial approximation. Given a set
k=1 , the goal is to find the set of coefficients {ci }i=0 such that
of m data points {(xk , yk )}m n
Problem 5. Write a function that accepts an integer n. Solve for the coefficients of the poly-
nomial of degree n that best fits the data found in stability_data.npy. Use two approaches
to get the least squares solution:
1. Use la.inv() to solve the normal equations: x = (AT A)−1 AT b. Although this approach
seems intuitive, it is actually highly unstable and can return an answer with a very large
forward error.
Load the data and set up the system (11.5) with the following code.
118 Lab 11. Conditioning and Stability
xk, yk = np.load("stability_data.npy").T
A = np.vander(xk, n+1)
Plot the resulting polynomials together with the raw data points. Return the forward
error ∥Ax − b∥2 of both approximations.
(Hint: The function np.polyval() will be helpful for plotting the resulting polynomials.)
Test your function using various values of n, taking special note of what happens for values
of n near 14.
Catastrophic Cancellation
When a computer takes the difference of two very similar numbers, the result is often stored with
a small number of significant digits and the tiniest bit of information is lost. However, these small
errors can propagate into large errors later down the line. This phenomenon is called catastrophic
cancellation, and is a common cause for numerical instability.
Catastrophic cancellation is a potential problem whenever floats or large integers that are very
close to one another are subtracted. This problem can be avoided by either rewriting the program
to not use subtraction, or by increasing the number of significant digits that the computer tracks.
√ √
For example, consider the simple problem of computing a − b. The computation can be
done directly with subtraction, or by performing the equivalent division
√
√ √ √ √ √a + b a−b
a − b = ( a − b) √ √ =√ √ .
a+ b a+ b
>>> from math import sqrt # np.sqrt() fails for very large numbers.
>>> a = 10**20 + 1
>>> b = 10**20
>>> sqrt(a) - sqrt(b) # Do the subtraction directly.
0.0 # a != b, so information has been lost.
In this
√ example, a and b are distinct enough√that the computer can still tell that a − b = 1, but
√ √
a and b are so close to each other that a − b is computed as 0.
R1
Problem 6. Let I(n) = 0
xn ex−1 dx. It can be shown that for a positive integer n,
n!
I(n) = (−1)n (!n − ), (11.6)
e
Pn (−1)k
where !n = n! k=0 k! is the subfactorial of n. Write a function to do the following.
119
1. Use SymPy’s sy.integrate() to evaluate the integral form of I(n) for n = 5, 10, . . . , 50.
Convert the symbolic results of each integration to a float. Since this is done symbolically,
these values can be accepted as the true values of I(n). For this problem, use sy.exp()
in the integrand.
(Hint: be careful that the values of n in the integrand are of type int.)
2. Use (11.6) to compute I(n) for the same values of n. Use sy.subfactorial() to compute
!n and sy.factorial() to compute n!. The function used for e in this equation changes
the returned error value. For this problem, use np.e instead of sy.exp().
(Hint: be careful to only pass Python integers to these functions.)
3. Plot the relative forward error of the results computed in step 2 at each of the given
values of n. When computing the relative forward error use absolute values instead of
la.norm(). Use a log scale on the y-axis.
The examples presented in this lab are just a few of the ways that a mathematical problem can
turn into a computational train wreck. Always use stable algorithms when possible, and remember
to check if problems are well conditioned or not.
120 Lab 11. Conditioning and Stability
Additional Material
Other Notions of Stability
The definition of stability can be made more rigorous in the following way. Let f be a problem to
solve and f˜ an algorithm to solve it. If for every x in the domain there exists a x̃ such that
are small (close to εmachine ≈ 10−16 ), then f˜ is called stable. In other words, “A stable algorithm
gives nearly the right answer to nearly the right question” (Trefethen, Bao, 104). Note carefully that
the quantity on the right is slightly different from the plain forward error introduced earlier.
Stability is desirable, but plain stability isn’t the best possible condition. For example, if for
every input x there exists a x̃ such that ∥x̃−x∥/∥x∥ is small and f˜(x) = f (x̃) exactly, then f˜ is called
backward stable. Thus “A backward stable algorithm gives exactly the right answer to nearly the right
question” (Trefethen, Bao, 104). Backward stable algorithms are generally more trustworthy than
stable algorithms, but they are also less common.
Lab Objective: Many important integrals cannot be evaluated symbolically because the integrand
has no antiderivative. Traditional numerical integration techniques like Newton-Cotes formulas and
Gaussian quadrature usually work well for one-dimensional integrals, but rapidly become inefficient
in higher dimensions. Monte Carlo integration is an integration strategy that has relatively slow
convergence, but that does extremely well in high-dimensional settings compared to other techniques.
In this lab we implement Monte Carlo integration and apply it to a classic problem in statistics.
Volume Estimation
Since the area of a circle of radius r is A = πr2 , one way to numerically estimate π is to compute
the area of the unit circle. Empirically, we can estimate the area by randomly choosing points in a
domain that encompasses the unit circle. The percentage of points that land within the unit circle
approximates the percentage of the area of the domain that the unit circle occupies. Multiplying this
percentage by the total area of the sample domain gives an estimate for the area of the circle.
Since the unit circle has radius r = 1, consider the square domain Ω = [−1, 1] × [−1, 1]. The
following code samples 2000 uniformly distributed random points in Ω, determines what percentage
of those points are within the unit circle, then multiplies that percentage by 4 (the area of Ω) to get
an estimate for π.
121
122 Lab 12. Monte Carlo Integration
The estimate π ≈ 3.198 isn’t perfect, but it only differs from the true value of π by about
0.0564. On average, increasing the number of sample points decreases the estimate error.
Figure 12.1: Estimating the area of the unit circle using random points.
Problem 1. The n-dimensional open unit ball is the set Un = {x ∈ Rn | ∥x∥2 < 1}. Write a
function that accepts an integer n and a keyword argument N defaulting to 104 . Estimate the
volume of Un by drawing N points over the n-dimensional domain [−1, 1]×[−1, 1]×· · ·×[−1, 1].
(Hint: the volume of [−1, 1] × [−1, 1] × · · · × [−1, 1] is 2n .)
When n = 2, this is the same experiment outlined above so your function should return
an approximation of π. The volume of the U3 is 43 π ≈ 4.18879, and the volume of U4 is
π2
2 ≈ 4.9348. Try increasing the number of sample points N to see if your estimates improve.
Integral Estimation
The strategy for estimating π can be formulated as an integral problem. Define f : R2 → R by
(
1 if ∥x∥2 < 1 (x is within the unit circle)
f (x) =
0 otherwise,
To estimate the integral we chose N random points {xi }N i=1 in Ω. Since f indicates whether or not a
point lies within the unit circle, the total number of random points that lie in the circle is the sum of
the f (xi ). Then the average of these values, multiplied by the volume V (Ω), is the desired estimate:
Z N
1 X
f (x) dV ≈ V (Ω) f (xi ). (12.1)
Ω N i=1
This remarkably simple equation can be used to estimate the integral of any integrable function
f : Rn → R over any domain Ω ⊂ Rn and is called the general formula for Monte Carlo integration.
123
PN
The intuition behind (12.1) is that N1 i=1 f (xi ) approximates the average value of f on Ω,
and multiplying the approximate average value by the volume of Ω yields the approximate integral
of f over Ω. This is a little easier to see in one dimension: for a single-variable function f : R → R,
the Average Value Theorem states that the average value of f over an interval [a, b] is given by
Z b
1
favg = f (x) dx.
b−a a
PN
Then using the approximation favg ≈ 1
N i=1 f (xi ), the previous equation becomes
Z b N
1 X
f (x) dx = (b − a)favg ≈ V (Ω) f (xi ), (12.2)
a N i=1
which is (12.1) in one dimension. In this setting Ω = [a, b] and hence V (Ω) = b − a.
Test your function on the following integrals, or on other integrals that you can check by hand.
Z 2 Z 2π Z 10
2 1
x dx = 24 sin(x) dx = 0 dx = log(10) ≈ 2.30259
−4 −2π 1 x
Z 5 √
sin(10x) cos(10x) + x sin(3x) dx ≈ 4.502
1
Achtung!
Be careful not to use Monte Carlo integration to estimate integrals that do not converge. For
example, since 1/x approaches ∞ as x approaches 0 from the right, the integral
Z 1
1
dx
0 x
does not converge. Even so, attempts at Monte Carlo integration still return a finite value. Use
various numbers of sample points to see whether or not the integral estimate is converging.
2. It is easy to sample uniformly over an interval [a, b] with np.random.uniform(), or even over
the n-dimensional cube [a, b] × [a, b] × · · · × [a, b] (such as in Problem 1). However, if ai ̸= aj
or bi ̸= bj for any i ̸= j, the samples need to be constructed in a slightly different way.
The interval [0, 1] can be transformed to the interval [a, b] by scaling it so that it is the same
length as [a, b], then shifting it to the appropriate location.
scale by b−a shift by a
[0, 1] −−−−−−−−→ [0, b − a] −−−−−−→ [a, b]
This suggests a strategy for sampling over [a1 , b1 ] × [a2 , b2 ] × · · · × [an , bn ]: sample uniformly
from the n-dimensional box [0, 1]×[0, 1]×· · ·×[0, 1], multiply the ith component of each sample
by bi − ai , then add ai to that component.
scale shift
[0, 1] × · · · × [0, 1] −−−→ [0, b1 − a1 ] × · · · × [0, bn − an ] −−−→ [a1 , b1 ] × · · · × [an , bn ] (12.4)
Test your function on the following integrals below, as well as the single dimensional exam-
ples from Problem 2.
(Hint: make sure bounds are inputted as lists, even in the single dimensional case).
Z 1 Z 1 Z 1 Z 3
2 2 2
x + y dx dy = 3x − 4y + y 2 dx dy = 54
0 0 3 −2 1
Z 4 Z 3 Z 2 Z 1
x + y − wz 2 dx dy dz dw = 0
−4 −3 −2 −1
125
Note carefully how the order of integration defines the domain; in the last example, the x-y-z-w
domain is [−1, 1] × [−2, 2] × [−3, 3] × [−4, 4], so the lower and upper bounds passed to your
function should be [−1, −2, −3, −4] and [1, 2, 3, 4], respectively.
Convergence
Monte Carlo integration has some obvious pros and cons. On the one hand, it is difficult√to get highly
precise estimates. In fact, the error of the Monte Carlo method is proportional to 1/ N , where N
is the number of points used in the estimation. This means that dividing the error by 10 requires
using 100 times more sample points.
On the other hand, the convergence rate is independent of the number of dimensions of the
problem. That is, the error converges at the same rate whether integrating a 2-dimensional function or
a 20-dimensional function. This gives Monte Carlo integration a huge advantage over other methods,
and makes it especially useful for estimating integrals in high dimensions where other methods become
computationally infeasible.
Problem 4. The probability density function of the joint distribution of n independent normal
random variables, each with mean 0 and variance 1, is the function f : Rn → R defined by
1 xT x
f (x) = n/2
e− 2 .
(2π)
Though this is a critical distribution in statistics, f does not have a symbolic antiderivative.
Integrate f several times to study the convergence properties of Monte Carlo integration.
1. Let n = 4 and Ω = [− 32 , 43 ] × [0, 1] × [0, 21 ] × [0, 1] ⊂ R4 . Define f and Ω so that you can
integrate f over Ω using your function from Problem 3.
example, the following code computes the integral over [−1, 1] × [−1, 3] × [−2, 1] ⊂ R3 .
3. Use np.logspace() to get 20 integer values of N that are roughly logarithmically spaced
from 101 to 105 . For each value of N , use your function from Problem 3 to compute an
estimate F̃ (N ) of the integral with N samples. Compute the relative error |F −|F
F̃ (N )|
| for
each value of N .
4. Plot
√ the relative error against the sample size N on a log-log scale. Also plot the line
1/ N for comparison. Your results should be similar to Figure 12.2.
Relative Error
10 1
1/ N
10 2
10 3
10 4
10 5
101 102 103 104 105
√
Figure 12.2: Monte Carlo integration converges at the same rate as 1/ N where N is the number of
samples used in the estimate. However, the convergence is independent of dimension, which is why
this strategy is so commonly used for high-dimensional integration.
13
Visualizing
Complex-valued
Functions
Lab Objective: Functions that map from the complex plane into the complex plane are difficult
to fully visualize because the domain and range are both 2-dimensional. However, such functions
can be visualized at the expense of partial information. In this lab we present methods for analyzing
complex-valued functions visually, including locating their zeros and poles in the complex plane. We
recommend completing the exercises in a Jupyter Notebook.
Conversely, Euler’s formula is the relation reiθ = r cos(θ) + ir sin(θ). Then setting reiθ = x + iy and
equating real and imaginary parts yields the equations x = r cos(θ) and y = r sin(θ).
z
iy
r
θ
x
Figure 13.1: The complex number z can be represented in Cartesian coordinates as z = x + iy and
in polar coordinates as z = reiθ , when θ is in radians.
NumPy makes it easy to work with complex numbers and convert between coordinate systems.
The function np.angle() returns the argument θ of a complex number (between −π and π) and
np.abs() (or np.absolute()) returns the magnitude r. These functions also operate element-wise
on NumPy arrays.
127
128 Lab 13. Visualizing Complex-valued Functions
Complex Functions
A function f : C → C is called a complex-valued function. Visualizing f is difficult because C has 2
real dimensions, so the graph of f should be 4-dimensional. However, since it is possible to visualize
3-dimensional objects, f can be visualized by ignoring one dimension. There are two main strategies
for doing this: assign a color to each point z ∈ C corresponding to either the argument θ of f (z), or
to the magnitude r of f (z). The graph that uses the argument is called a complex color wheel graph.
Figure 13.2 displays the identity function f (z) = z using these two methods.
Figure 13.2: The identity function f : C → C defined by f (z) = z. On the left, the color at each
point z represents the angle θ = arg(f (z)). As θ goes from −π to π, the colors cycle smoothly
counterclockwise from white to blue to red and back to white (this colormap is called "twilight").
On the right, the color represents the magnitude r = |f (z)|. The further a point is from the origin,
the greater its magnitude (the colormap is the default, "viridis").
129
The plots in Figure 13.2 use Cartesian coordinates in the domain and polar coordinates in
the codomain. The procedure for plotting in this way is fairly simple. Begin by creating a grid of
complex numbers: create the real and imaginary parts separately, then use np.meshgrid() to turn
them into a single array of complex numbers. Pass this array to the function f , compute the angle
and argument of the resulting array, and plot them using plt.pcolormesh(). The following code
sets up the complex domain grid.
Visualizing the argument and the magnitude separately provides different perspectives of the
function f . The angle plot is generally more useful for visualizing function behavior, though the
magnitude plot often makes it easy to spot important points such as zeros and poles.
√
Figure 13.3: Plots of f (z) = z 2 + 1 on {x + iy | x, y ∈ [−3, 3]}. Notice how a discontinuity is clearly
visible in the angle plot on the left, but disappears from the magnitude plot on the right.
Problem 1. Write a function that accepts a function f : C → C, bounds [rmin , rmax , imin , imax ]
for the domain, an integer res that determines the resolution of the plot, and a string to set
the figure title. Plot arg(f (z)) and |f (z)| on an equally-spaced res×res grid over the domain
{x + iy | x ∈ [rmin , rmax ], y ∈ [imin , imax ]} in separate subplots.
1. For arg(f (z)), set the plt.pcolormesh() keyword arguments vmin and vmax to −π and
π, respectively. This forces the color spectrum to work well with np.angle(). Use the
colormap "twilight", which starts and ends white, so that the color is the same for −π
and π.
130 Lab 13. Visualizing Complex-valued Functions
3. Set the aspect ratio to "equal" in each plot. Give each subplot a title, and set the overall
figure title with the given input string.
√
Use your function to visualize f (z) = z on {x + iy | x, y ∈ [−1, 1]} and f (z) = z 2 + 1 on
{x + iy | x, y ∈ [−3, 3]}. Compare the resulting plots to Figures 13.2 and 13.3, respectively.
Zeros
A complex number z0 is called a zero of the complex-valued function f if f (z0 ) = 0. The mutliplicity
or order of z0 is the largest integer n such that f can be written as f (z) = (z − z0 )n g(z) where
g(z0 ) ̸= 0. In other words, f has a zero of order n at z0 if the Taylor series of f centered at z0 can
be written as
X∞
f (z) = ak (z − z0 )k , an ̸= 0.
k=n
Angle and magnitude plots make it easy to locate a function’s zeros and to determine their
multiplicities.
Problem 2. Use your function from Problem 1 to plot the following functions on the domain
{x + iy | x, y ∈ [−1, 1]}.
• f (z) = z n for n = 2, 3, 4.
Use a Markdown cell to write a sentence or two about how the zeros of a function and their
multiplicity appear in angle and magnitude plots.
Problem 2 shows that in an angle plot of f (z) = z n , the colors cycle n times counterclockwise
around 0. This is explained by looking at z n in polar coordinates,
z n = (reiθ )n = rn ei(nθ) .
Multiplying θ by a number greater than 1 compresses the graph along the “θ-axis” by a factor of n.
In other words, the output angle repeats itself n times in one cycle of θ. This is similar to taking a
scalar-valued function f : R → R and replacing f (x) with f (nx).
Problem 2 also shows that the plot of f (z) = z 3 − iz 4 − 3z 6 looks very similar to the plot of
f (z) = z 3 near the origin. This is because when z is close to the origin, z 4 and z 6 are much smaller
in magnitude than z 3 , and so the behavior of z 3 dominates the function. In terms of the Taylor series
centered at z0 = 0, the quantity |z − z0 |n+k is much smaller than |z − z0 |n for z close to z0 , and so
the function behaves similar to an (z − z0 )n .
131
Figure 13.4: The angle plot of f (z) = z 3 − iz 4 − 3z 6 on {x + iy | x, y ∈ [−1, 1]}. The angle plot
shows that f (z) has a zero of order 3 at the origin and 3 distinct zeros of order 1 scattered around
the origin. The magnitude plot makes it easier to pinpoint the location of the zeros.
Poles
A complex number z0 is called a pole of the complex-valued function f if f can be written as
f (z) = g(z)/(z − z0 ) where g(z0 ) ̸= 0. From this definition it is easy to see that limz→z0 |f (z)| = ∞,
but knowing that limz→z1 |f (z)| = ∞ is not enough information to conclude that z1 is a pole of f .
The order of z0 is the largest integer n such that f can be written as f (z) = g(z)/(z − z0 )n
with g(z0 ) ̸= 0. In other words, f has a pole of order n at z0 if its Laurent series on a punctured
neighborhood of z0 can be written as
∞
X
f (z) = ak (z − z0 )k , a−n ̸= 0.
k=−n
Problem 3. Plot the following functions on domains that show all of its zeros and/or poles.
• f (z) = z −n for n = 1, 2, 3.
• f (z) = z 2 + iz −1 + z −3 .
Use a Markdown cell to write a sentence or two about how the poles of a function appear in
angle and magnitude plots. How can you tell the multiplicity of the poles from the plot?
Problem 3 shows that in angle plot of z −1 , the colors cycle n times clockwise around 0, as
opposed to the counter-clockwise rotations seen around roots. Again, this can be explained by
looking at the polar representation,
The minus sign on the θ reverses the direction of the colors, and the n makes them cycle n times.
From Problem 3 it is also clear that f (z) = z 2 + iz −1 + z −3 behaves similarly to z −3 for z near
the pole at z0 = 0. Since |z − z0 |−n+k is much smaller than |z − z0 |−n when |z − z0 | is small, near
z0 the function behaves like a−n (z − z0 )−n . This is why the order of a pole can be estimated by
counting the number of times the colors circle a point in the clockwise direction.
Problem 4. Plot the following functions and count the number and order of their zeros and
poles. Adjust the bounds of each plot until you have found all zeros and poles.
• f (z) = −4z 5 + 2z 4 − 2z 3 − 4z 2 + 4z − 4
It is usually fairly easy to see how many zeros or poles a polynomial or quotient of polynomials
has. However, it can be much more difficult to know how many zeros or poles a different function
may or may not have without visualizing it.
Problem 5. Plot the following functions on the domain {x + iy | x, y ∈ [−8, 8]}. Explain
carefully in a Markdown cell what each graph reveals about the function and why the function
behaves that way.
• f (z) = ez
• f (z) = tan(z)
(Hint: use the polar coordinate representation to mathematically examine the magnitude and
angle of each function.)
Essential Poles
A complex-valued function f has an essential pole at z0 if its Laurent series in a punctured neigh-
borhood of z0 requires infinitely many terms with negative exponents. For example,
∞
X 1 1 1 1 1 1
e1/z = n
=1+ + 2
+ + ··· .
n!z z 2z 6 z3
k=0
An essential pole can be thought of as a pole of order ∞. Therefore, in an angle plot the colors cycle
infinitely many times around an essential pole.
133
Figure 13.5: Angle plot of f (z) = e1/z on the domain {x + iy | x, y ∈ [−1, 1]}. The colors circle
clockwise around the origin because it is a pole, not a zero. Because the pole is essential, the colors
repeat infinitely many times.
Achtung!
Often, color plots like the ones presented in this lab can be deceptive because of a bad choice
of domain. Be careful to validate your observations mathematically.
Problem 6. For each of the following functions, plot the function on {x + iy | x, y ∈ [−1, 1]}
and describe what this view of the plot seems to imply about the function. Then plot the
function on a domain that allows you to see the true nature of the roots and poles and describe
how it is different from what the original plot implied. Use Markdown cells to write your
answers.
• f (z) = 100z 2 + z
Lab Objective: Many real-world systems—the internet, transportation grids, social media, and
so on—can be represented as graphs (networks). The PageRank algorithm is one way of ranking the
nodes in a graph by importance. Though it is a relatively simple algorithm, the idea gave birth to the
Google search engine in 1998 and has shaped much of the information age since then. In this lab we
implement the PageRank algorithm with a few different approaches, then use it to rank the nodes of
a few different networks.
b c
a b c d
a 0 0 0 0
A= b
1 0 1 0
c 1 0 0 1
d 1 0 1 0
a d
Figure 14.1: A directed unweighted graph with four nodes, together with its adjacency matrix. Note
that the column for node b is all zeros, indicating that b is a sink —a node that doesn’t point to any
other node.
If n users start on random pages in the network and click on a link every 5 minutes, which page
in the network will have the most views after an hour? Which will have the fewest? The goal of the
PageRank algorithm is to solve this problem in general, therefore determining how “important” each
webpage is.
135
136 Lab 14. The PageRank Algorithm
Before diving into the mathematics, there is a potential problem with the model. What happens
if a webpage doesn’t have any outgoing links, like node b in Figure 14.1? Eventually, all of the users
will end up on page b and be stuck there forever. To obtain a more realistic model, modify each sink
in the graph by adding edges from the sink to every node in the graph. This means users on a page
with no links can start over by selecting a random webpage.
b c
a b c d
a 0 1 0 0
e= b
A 1 1 1 0
c 1 1 0 1
d 1 1 1 0
a d
Figure 14.2: Here the graph in Figure 14.1 has been modified to guarantee that node b is no longer
a sink (the added links are blue). We denote the modified adjacency matrix by A.
e
Now let pk (t) be the likelihood that a particular internet user is surfing webpage k at time t.
Suppose at time t + 1, the user clicks on a link to page i. Then pi (t + 1) can be computed by counting
the number of links pointing to page i, weighted by the total number of outgoing links for each node.
As an example, consider the graph in Figure 14.2. To get to page a at time t + 1, the user had
to be on page b at time t. Since there are four outgoing links from page b, assuming links are chosen
with equal likelihood,
1
pa (t + 1) = pb (t).
4
Similarly, to get to page b at time t + 1, the user had to have been on page a, b, or c at time t. Since
a has 3 outgoing edges, b has 4 outgoing edges, and c has 2 outgoing edges,
1 1 1
pb (t + 1) = pa (t) + pb (t) + pc (t).
3 4 2
The previous equations can be written in a way that hints at a more general linear form:
1
pa (t + 1) = 0pa (t) + pb (t) + 0pc (t) + 0pd (t),
4
1 1 1
pb (t + 1) = pa (t) + pb (t) + pc (t) + 0pd (t).
3 4 2
The coefficients of the terms on the right hand side are precisely the entries of the ith row of the
modified adjacency matrix A,e divided by the jth column sum. In general, pi (t + 1) satisfies
n
eij P pj (t) .
X
pi (t + 1) = A n
(14.1)
j=1 k=1 A
ekj
Pn e
Note that the column sum k=1 A kj in the denominator can never be zero since, after the fix in
Figure 14.2, none of the nodes in the graph are sinks.
137
bij = P Aij
e
A . (14.4)
k=1 Akj
e
In other words, A
b is A
e normalized so that the columns each sum to 1. For the graph in Figure 14.2,
the matrix A
b is given by
a b c d
a 0 1/4 0 0
b= b
A 1/3 1/4 1/2 0
. (14.5)
c 1/3 1/4 0 1
d 1/3 1/4 1/2 0
Problem 1. Write a class for representing directed graphs via their adjacency matrices. The
constructor should accept an n × n adjacency matrix A and a list of node labels (such as
[a, b, c, d]) defaulting to None. Modify A as in Figure 14.2 so that there are no sinks in
the corresponding graph, then calculate the A b from (14.4). Save A b and the list of labels as
attributes. Use [0, 1, . . . , n − 1] as the labels if none are provided. Finally, raise a ValueError
if the number of labels is not equal to the number of nodes in the graph.
(Hint: use array broadcasting to compute A b efficiently.)
For the graph in Figure 14.1, check that your A b matches (14.5).
pi = lim pi (t).
t→∞
Linear System
If p exists, then taking the limit as t → ∞ to both sides of (14.3) gives the following.
1−ε
lim p(t + 1) = lim εAp(t)
b + 1
t→∞ t→∞ n
1 − ε
p = εAp
b + 1
n
b p = 1 − ε1
I − εA (14.6)
n
This linear system is easy to solve as long as the number of nodes in the graph isn’t too large.
Eigenvalue Problem
Let E be an n × n matrix of ones. Then Ep(t) = 1 since i=1 pi (t) = 1. Substituting into (14.3),
P
1−ε 1−ε
p(t + 1) = εAp(t) +
b Ep(t) = εA + b E p(t) = Bp(t), (14.7)
n n
where B = εA
b+ 1−ε
n E. Now taking the limit at t → ∞ of both sides of (14.7),
Bp = p.
That is, p is an eigenvector of B corresponding to the eigenvalue λ = 1. In fact, since the columns
of B sum to 1, and because the entries of B are strictly positive (because the entries of E are all
positive), Perron’s theorem guarantees that λ = 1 is the unique eigenvalue of B of largest magnitude,
and that the corresponding eigenvector p is unique up to scaling. Furthermore, p can be scaled so
that each of its entires are positive, meaning p/∥p∥1 is the desired PageRank vector.
Note
A Markov chain is a weighted directed graph where each node represents a state of a discrete
system. The weight of the edge from node j to node i is the probability of transitioning from
state j to state i, and the adjacency matrix of a Markov chain is called a transition matrix.
Since B from (14.7) contains nonnegative entries and its columns all sum to 1, it can be
viewed as the transition matrix of a Markov chain. In that context, the limit vector p is called
the steady state of the Markov chain.
Iterative Method
Solving (14.6) or (14.7) is feasible for small networks, but they are not efficient strategies for very
large systems. The remaining option is to use an iterative technique. Starting with an initial guess
p(0), use (14.3) to compute p(1), p(2), . . . until ∥p(t) − p(t − 1)∥ is sufficiently small. From (14.7),
we can see that this is just the power method1 for finding the eigenvector corresponding to the
dominant eigenvalue of B.
1 See the Least Squares and Computing Eigenvalues lab for details on the power method.
139
Problem 2. Add the following methods to your class from Problem 1. Each should accept a
damping factor ε (defaulting to 0.85), compute the PageRank vector p, and return a dictionary
mapping label i to its PageRank value pi .
2. eigensolve(): solve for p using (14.7). Normalize the resulting eigenvector so its entries
sum to 1. Your answer should be a real number, so use .real to constrain the sum to
the reals (it might try to have an imaginary part equal to 0).
Check that each method yields the same results. For the graph in Figure 14.1 with ε = 0.85,
you should get the following dictionary mapping labels to PageRank values.
Unit Test
The file test_pagerank.py contains a prewritten unit test to test your linsolve() method
for Problem 2. There is a place for you to add unit tests to test your other two methods for
Problem 2, eigensolve() and itersolve(), which will be graded.
Problem 3. Write a function that accepts a dictionary mapping labels to PageRank values,
like the outputs in Problem 2. Return a list of labels sorted from highest to lowest rank.
(Hint: if d is a dictionary, use list(d.keys()) and list(d.values()) to get the list of keys
and values in the dictionary, respectively.)
For the graph in Figure 14.1 with ε = 0.85, this is the list [c, b, d, a] (or [c, d, b, a], since
b and d have the same PageRank value).
Write a function that accepts a damping factor ε defaulting to 0.85. Read the data and
get a list of the n unique page IDs in the file (the labels). Construct the n × n adjacency matrix
of the graph where node j points to node i if webpage j has a hyperlink to webpage i. Use your
class from Problem 1 and its itersolve() method from Problem 2 to compute the PageRank
values of the webpages, then rank them with your function from Problem 3. In the case where
two webpages have the same rank, resolve ties by first listing the webpage whose ID comes first
alphabetically. Note that even though the IDs are numbers, we can sort them alphabetically
because they are defined as strings. (Hint: Sorting the list of unique webpage IDs by string
before ranking them will place the site IDs in the desired order; there is no need to convert the
IDs to integers.) Return the ranked list of webpage IDs.
(Hint: After constructing the list of webpage IDs, make a dictionary that maps a webpage ID
to its index in the list. For Figure 14.1, this would be {'a': 0, 'b': 1, 'c': 2, 'd': 3}.
The values are the row/column indices in the adjacency matrix for each label.)
With ε = 0.85, the top three ranked webpage IDs are 98595, 32791, and 28392.
a http://www.stanford.edu/
b See http://snap.stanford.edu/data/web-Stanford.html for the original (larger) dataset.
a b c d
a 0 0 0 0
A= b
2 0 1 0
1 c 1 0 0 2
1 d 1 0 2 0
1
b c
a b c d
1 a
0 1/4 0 0
1 2 2 2 b 1/2 1/4 1/3 0
A=
b
1
c 1/4 1/4 0 1
1 d 1/4 1/4 2/3 0
a d
Figure 14.3: A directed weighted graph with four nodes, together with its adjacency matrix and the
corresponding PageRank transition matrix. Edges that are added to fix sinks have weight 1, so the
computation of Ae and Ab are exactly the same as in Figure 14.2 and (14.4), respectively.
141
Problem 5. The files ncaa2010.csv, ncaa2011.csv, . . ., ncaa2017.csv each contain data for
men’s college basketball for a given school year.a Each line (except the very first line, which is
a header) represents a different basketball game, formatted winning_team,losing_team.
Write a function that accepts a filename and a damping factor ε defaulting to 0.85. Read
the specified file (skipping the first line) and get a list of the n unique teams in the file. Construct
the n × n adjacency matrix of the graph where node j points to node i with weight w if team
j was defeated by team i in w games. That is, edges point from losers to winners. For
instance, the graph in Figure 14.3 would indicate that team c lost to team b once and to team
d twice, team b was undefeated, and team a never won a game. Use your class from Problem
1 and its itersolve() method from Problem 2 to compute the PageRank values of the teams,
then rank them with your function from Problem 3. Return the ranked list of team names.
Using ncaa2010.csv with ε = 0.85, the top three ranked teams (of the 607 total teams)
should be UConn, Kentucky, and Louisville, in that order. That season, UConn won the
championship, Kentucky was a semifinalist, and Louisville lost in the first tournament round
(a surprising upset).
a ncaa2010.csv has data for the 2010–2011 season, ncaa2011.csv for the 2011–2012 season, and so on.
Note
In Problem 5, the damping factor ε acts as an “upset” factor: a larger ε puts more emphasis on
win history; a smaller ε allows more randomness in the system, giving underdog teams a higher
probability of defeating a team with a better record.
It is also worth noting that the sink-fixing procedure is still reasonable for this model
because it gives every other team equal likelihood of beating an undefeated team. That is, the
additional edges don’t provide an extra advantage to any one team.
Method Description
add_node() Add a single node.
add_nodes_from() Add a list of nodes.
add_edge() Add an edge between two nodes, adding the nodes if needed.
add_edges_from() Add multiple edges (and corresponding nodes as needed).
remove_edge() Remove a single edge (no nodes are removed).
remove_edges_from() Remove multiple edges (no nodes are removed).
remove_node() Remove a single node and all adjacent edges.
remove_nodes_from() Remove multiple nodes and all adjacent edges.
Table 14.1: Methods of the nx.DiGraph class for inserting or removing nodes and edges.
142 Lab 14. The PageRank Algorithm
For example, the weighted graph in Figure 14.3 can be constructed with the following code.
Once constructed, an nx.Digrah object can be queried for information about the nodes and
edges. It also supports dictionary-like indexing to access node and edge attributes, such as the weight
of an edge.
Method Description
has_node(A) Return True if A is a node in the graph.
has_edge(A,B) Return True if there is an edge from A to B.
edges() Iterate through the edges.
nodes() Iterate through the nodes.
number_of_nodes() Return the number of nodes.
number_of_edges() Return the number of edges.
Table 14.2: Methods of the nx.DiGraph class for accessing nodes and edges.
NetworkX efficiently implements several graph algorithms. The function nx.pagerank() com-
putes the PageRank values of each node iteratively with sparse matrix operations. This function
returns a dictionary mapping nodes to PageRank values, like the methods in Problem 2.
143
Achtung!
NetworkX also has a class, nx.Graph, for undirected graphs. The edges in an undirected graph
are bidirectional, so the corresponding adjacency matrix is symmetric.
The PageRank algorithm is not very useful for undirected graphs. In fact, the PageRank
value for node is close to its degree—the number of edges it connects to—divided by the total
number of edges. In Problem 5, that would mean the team who simply played the most games
would be ranked the highest. Always use nx.DiGraph, not nx.Graph, for PageRank and other
algorithms that rely on directed edges.
Problem 6. The file top250movies.txt contains data from the 250 top-rated movies accord-
ing to IMDb.a Each line in the file lists a movie title and its cast as title/actor1/actor2/...,
with the actors listed mostly in billing order (stars first), though some casts are listed alpha-
betically or in order of appearance.
Create a nx.DiGraph object with a node for each actor in the file. The weight from actor
a to actor b should be the number of times that actor a and b were in a movie together but actor
b was listed first. That is, edges point to higher-billed actors (see Figure 14.4). Compute
the PageRank values of the actors and use your function from Problem 3 to rank them. Return
the list of ranked actors.
(Hint: Consider using itertools.combinations() while constructing the graph. Also, use
encoding="utf-8" as an argument to open() to read the file, since several actors and actresses
have nonstandard characters in their names such as ø and æ.)
With ε = 0.7, the top three actors should be Leonardo DiCaprio, Robert De Niro, and
Tom Hanks, in that order.
a https://www.imdb.com/search/title?groups=top_250&sort=user_rating
144 Lab 14. The PageRank Algorithm
Anne
Hathaway
Scarlett 1 Hugh
Johansson Jackman
1 2
1 1 1 1
Christian Michael
Bale 4 Caine
Figure 14.4: A portion of the graph from Problem 6. Michael Caine was in four movies with Christian
Bale where Christian Bale was listed first in the cast.
145
Additional Material
Sparsity
On very large networks, the PageRank algorithm becomes computationally difficult because of the
size of the adjacency matrix A. Fortunately, most adjacency matrices are highly sparse, meaning
the number of edges is much lower than the number of entries in the matrix. Consider adding
functionality to your class from Problem 1 so that it stores A
b as a sparse matrix and performs sparse
linear algebra operations in the methods from Problem 2 (use scipy.sparse.linalg).
• Degree centrality uses the degree of a node, meaning the number of edges adjacent to it (inde-
pendent of edge direction), for ranking. An academic paper that has been cited many times
has a high degree and is considered more important than a paper that has only been cited once.
• Eigenvector centrality is an extension of degree centrality. Instead of each neighbor contributing
equally to the centrality, nodes that are important are given a higher weight. Thus a node
connected to lots of unimportant nodes can have the same measure as a node connected to a
few, important nodes. Eigenvector centrality is measured by the eigenvector associated with
the largest eigenvalue of the adjacency matrix of the network.
• Katz centrality is a modification to eigenvalue centrality for directed networks. Outgoing nodes
contribute centrality to their neighbors, so an important node makes its neighbors more im-
portant.
• PageRank adapts Katz centrality by averaging out the centrality that a node can pass to its
neighbors. For example, if Google—a website that should have high centrality—points to a
million websites, then it shouldn’t pass on that high centrality to all of million of its neighbors,
so each neighbor gets one millionth of Google’s centrality.
For more information on these centralities, as well as other ways to measure node importance,
see [New10].
146 Lab 14. The PageRank Algorithm
15
Iterative Solvers
Lab Objective: Many real-world problems of the form Ax = b have tens of thousands of
parameters. Solving such systems with Gaussian elimination or matrix factorizations could require
trillions of floating point operations (FLOPs), which is of course infeasible. Solutions of large systems
must therefore be approximated iteratively. In this lab we implement three popular iterative methods
for solving large systems: Jacobi, Gauss-Seidel, and Successive Over-Relaxation.
Iterative methods are often useful to solve large systems of equations. In this lab, let x(k) denote
the kth iteration of the iterative method for solving the problem Ax = b for x. Furthermore, let xi
(k)
be the ith component of x so that xi is the ith component of x in the kth iteration. Like other
iterative methods, there are two stopping parameters: a very small ε > 0 and an integer N ∈ N.
Iterations continue until either
∥x(k−1) − x(k) ∥ < ε or k > N. (15.1)
147
148 Lab 15. Iterative Solvers
The process is repeated until at least one of the two stopping criteria in (15.1) is met. For this
particular problem, convergence to 8 decimal places (ε = 10−8 ) is reached in 29 iterations.
Matrix Representation
The iterative steps performed above can be expressed in matrix form. First, decompose A into its
diagonal entries, its entries below the diagonal, and its entries above the diagonal, as A = D + L + U .
a11 0 . . . 0
0 0 ... 0
0 a12 . . . a1n
..
0 a22 . . . 0 a21 0 ... 0 0 0 ... .
.. .. . .. .. . . ..
. . . . . .
. . . . . . . . ..
.. .. a
n−1,n
0 0 . . . ann an1 . . . an,n−1 0 0 0 ... 0
D L U
Ax = b
(D + L + U )x = b
Dx = −(L + U )x + b
x = D−1 (−(L + U )x + b)
Now using x(k) as the variables on the right side of the equation to produce x(k+1) on the left,
and noting that L + U = A − D, we have the following.
There is a potential problem with (15.2): calculating a matrix inverse is the cardinal sin of
numerical linear algebra, yet the equation contains D−1 . However, since D is a diagonal matrix,
D−1 is also diagonal, and is easy to compute.
149
1
0 ... 0
a11
0 1
a22 ... 0
D−1 = . .. .. ..
.. .
. .
1
0 0 ... ann
Because of this, the Jacobi method requires that A have nonzero diagonal entries.
The diagonal D can be represented by the 1-dimensional array d of the diagonal entries. Then
the matrix multiplication Dx is equivalent to the component-wise vector multiplication d ∗ x = x ∗ d.
Likewise, the matrix multiplication D−1 x is equivalent to the component-wise “vector division” x/d.
Problem 1. Write a function that accepts a matrix A, a vector b, a convergence tolerance tol
defaulting to 10−8 , and a maximum number of iterations maxiter defaulting to 100. Implement
the Jacobi method using (15.2), returning the approximate solution to the equation Ax = b.
Run the iteration until ∥x(k−1) − x(k) ∥∞ < tol, and only iterate at most maxiter times.
Avoid using la.inv() to calculate D−1 , but use la.norm() to calculate the vector ∞-norm.
Your function should be robust enough to accept systems of any size. To test your
function, generate a random b with np.random.random() and use the following function to
generate an n × n matrix A for which the Jacobi method is guaranteed to converge. Run the
iteration, then check that Ax(k) and b are close using np.allclose(). There is a file called
test_iterative_solvers.py that contains prewritten unit tests to test your function for this
problem.
Also test your function on random n × n matrices. If the iteration is non-convergent, the
successive approximations will have increasingly large entries.
Convergence
Most iterative methods only converge under certain conditions. For the Jacobi method, convergence
mostly depends on the nature of the matrix A. If the entries aij of A satisfy the property
X
|aii | > |aij | for all i = 1, 2, . . . , n,
j̸=i
then A is called strictly diagonally dominant (diag_dom() in Problem 1 generates a strictly diagonally
dominant n × n matrix). If this is the case,1 then the Jacobi method always converges, regardless of
the initial guess x0 . This is a very different convergence result than many other iterative methods
such as Newton’s method where convergence is highly sensitive to the initial guess.
There are a few ways to determine whether or not an iterative method is converging. For
example, since the approximation x(k) should satisfy Ax(k) ≈ b, the normed difference ∥Ax(k) − b∥∞
should be small. This value is called the absolute error of the approximation. If the iterative method
converges, the absolute error should decrease to ε.
2. Keep track of the absolute error ∥Ax(k) − b∥∞ of the approximation at each iteration.
3. If plot is True, produce a lin-log plot (use plt.semilogy()) of the error against iteration
count. Remember to still return the approximate solution x.
If the iteration converges, your plot should resemble the following figure.
1 Although this seems like a strong requirement, most real-world linear systems can be represented by strictly
10 3
10 5
10 7
0 5 10 15 20 25 30
Iteration
2x1 − x3 = 3,
−x1 + 3x2 + 2x3 = 3,
+ x2 + 3x3 = −1.
As with the Jacobi method, solve for x1 in the first equation, x2 in the second equation, and
x3 in the third equation:
1
x1 = 2 (3 + x3 ),
1
x2 = 3 (3 + x1 − 2x3 ),
1
x3 = 3 (−1 − x2 ).
(1)
Using x(0) to compute x1 in the first equation as before,
(1) 1 (0) 1 3
x1 = (3 + x3 ) = (3 + 0) = .
2 2 2
(1) (1)
Now, however, use the updated value of x1 in the calculation of x2 :
(1) 1 (1) 1 3 5
x3 = (−1 − x2 ) = (−1 − ) = − .
3 3 2 6
This process of using calculated information immediately is called forward substitution, and causes
the algorithm to (generally) converge much faster.
152 Lab 15. Iterative Solvers
Notice that Gauss-Seidel converges in less than half as many iterations as Jacobi does for this system.
Implementation
Because Gauss-Seidel updates only one element of the solution vector at a time, the iteration cannot
be summarized by a single matrix equation. Instead, the process is most generally described by the
equation
(k+1) 1 X (k)
X (k)
xi = bi − aij xj − aij xj . (15.3)
aii j<i j>i
Let ai be the ith row of A. The two sums closely resemble the regular vector product of ai
(k)
and x(k) without the ith term aii xi . This suggests the simplification
(k+1) 1 (k)
xi = bi − aTix
(k)
+ aii xi
aii
(k) 1
= xi + bi − a T
i x (k)
. (15.4)
aii
One sweep through all the entries of x completes one iteration.
Achtung!
Since the Gauss-Seidel algorithm operates on the approximation vector in place (modifying
it one entry at a time), the previous approximation x(k−1) must be stored at the beginning
of the kth iteration in order to calculate ∥x(k−1) − x(k) ∥∞ . Additionally, since NumPy
arrays are mutable, the past iteration must be stored as a copy.
Convergence
Whether or not the Gauss-Seidel method converges depends on the nature of A. If all of the eigenval-
ues of A are positive, A is called positive definite. If A is positive definite or if it is strictly diagonally
dominant, then the Gauss-Seidel method converges regardless of the initial guess x(0) .
Problem 4. Write a new function that accepts a sparse matrix A, a vector b, a convergence
tolerance tol, and a maximum number of iterations maxiter (plotting the convergence is not
required for this problem). Implement the Gauss-Seidel method using (15.4), returning the
approximate solution to the equation Ax = b. Use the usual default stopping criterion.
The Gauss-Seidel method requires extracting the rows Ai from the matrix A and com-
puting ATi x. There are many ways to do this that cause some fairly serious runtime issues, so
we provide the code for this specific portion of the algorithm.
# Get the indices of where the i-th row of A starts and ends if the
# nonzero entries of A were flattened.
rowstart = A.indptr[i]
2 See the lab on Linear Systems for a review of scipy.sparse matrices and syntax.
154 Lab 15. Iterative Solvers
rowend = A.indptr[i+1]
# Multiply only the nonzero elements of the i-th row of A with the
# corresponding elements of x.
Aix = A.data[rowstart:rowend] @ x[A.indices[rowstart:rowend]]
Unit Test
There is a file called test_iterative_solvers.py that contains prewritten unit tests for Prob-
lem 1. There is a place for you to add your own unit tests to test your function for Problem 4,
which will be graded.
Successive Over-Relaxation
There are many systems that meet the requirements for convergence with the Gauss-Seidel method,
but for which convergence is still relatively slow. A slightly altered version of the Gauss-Seidel
method, called Successive Over-Relaxation (SOR), can result in faster convergence. This is achieved
by introducing a relaxation factor ω ≥ 1 and modifying (15.3) as
(k+1) (k) ω X (k)
X (k)
xi = (1 − ω)xi + bi − aij xj − aij xj .
aii j<i j>i
(k+1) (k) ω
xi = xi + bi − a T
i x(k)
. (15.5)
aii
Note that when ω = 1, SOR reduces to Gauss-Seidel. The relaxation factor ω weights the new
iteration between the current best approximation and the next approximation in a way that can
sometimes dramatically improve convergence.
Problem 5. Write a function that accepts a sparse matrix A, a vector b, a relaxation factor
ω, a convergence tolerance tol, and a maximum number of iterations maxiter. Implement
SOR using (15.5), compute the approximate solution to the equation Ax = b. Use the usual
stopping criterion. Return the approximate solution x as well as a boolean indicating whether
the function converged and the number of iterations computed.
(Hint: this requires changing only one line of code from the sparse Gauss-Seidel function.)
155
Figure 15.1: On the left, an example of a 6 × 6 grid (n = 4) where the red dots are hot boundary
zones and the blue dots are cold boundary zones. On the right, the green dots are the neighbors of
the interior black dot that are used to approximate the heat at the black dot.
∂2u ∂2u
+ 2 =0 (15.6)
∂x2 ∂y
Laplace’s equation can be used to model heat flow. Consider a square metal plate where the
top and bottom borders are fixed at 0◦ Celsius and the left and right sides are fixed at 100◦ Celsius.
Given these boundary conditions, we want to describe how heat diffuses through the rest of the plate.
The solution to Laplace’s equation describes the plate when it is in a steady state, meaning that the
heat at a given part of the plate no longer changes with time.
It is possible to solve (15.6) analytically. However, the problem can also be solved numerically
using a finite difference method. To begin, we impose a discrete, square grid on the plate with uniform
spacing. Denote the points on the grid by (xi , yj ) and the value of u at these points (the heat) as
u(xi , yj ) = Ui,j . Using the centered difference quotient for second derivatives to approximate the
partial derivatives,
∂2u ∂2u
0= + 2
∂x2 ∂y
Ui+1,j − 2Ui,j + Ui−1,j Ui,j+1 − 2Ui,j + Ui,j−1
≈ +
h2 h2
1
= 2 (−4Ui,j + Ui+1,j + Ui−1,j + Ui,j+1 + Ui,j−1 ) , (15.7)
h
where h = xi+1 − xi = yj+1 − yj is the distance between the grid points in either direction. This
problem can be formulated as a linear system. Suppose the grid has exactly (n + 2) × (n + 2) entries.
Then the interior of the grid (where u(x, y) is unknown) is n × n, and can be flattened into an n2 × 1
vector u. The entire first row goes first, then the second row, proceeding to the nth row.
T
u = U1,1 U1,2 ··· U1,n U2,1 U2,2 ··· U2,n ··· Un,n
If any of the neighbors to Ui,j is a boundary point on the grid, its value is already determined by the
boundary conditions. For example, the neighbor U3,0 of the gridpoint for U3,1 is fixed at U3,0 = 100.
In this case, (15.8) becomes
The constants on the right side of (15.8) become the n2 × 1 vector b. All nonzero entries of b
correspond to interior points that touch the left or right boundaries.
As an example, writing (15.8) for the 16 interior points of the grid in Figure 15.1 results in the
following 16 × 16 system Au = b. Note the block structure (empty blocks are all zeros).
−4 −100
1 0 0 1 0 0 0 U1,1
1 −4 1 0 0 1 0 0
U1,2
0
0 1 −4 1 0 0 1 0
U1,3
0
0 0 1 −4 0 0 0 1
U1,4
−100
1 0 0 0 −4 1 0 0 1 0 0 0
U2,1
−100
0 1 0 0 1 −4 1 0 0 1 0 0
U2,2
0
0 0 1 0 0 1 −4 1 0 0 1 0
U2,3
0
0 0 0 1 0 0 1 −4 0 0 0 1 U2,4 −100
=
1 0 0 0 −4 1 0 0 1 0 0 0 U3,1 −100
0 1 0 0 1 −4 1 0 0 1 0 0 U3,2 0
−4
0 0 1 0 0 1 1 0 0 1 0 U3,3 0
0 0 0 1 0 0 1 −4 0 0 0 1
U3,4
−100
1 0 0 0 −4 1 0 0
U4,1
−100
0 1 0 0 1 −4 1 0
U4,2
0
0 0 1 0 0 1 −4 1 U4,3 0
0 0 0 1 0 0 1 −4 U4,4 −100
More concisely, for any positive integer n, the matrix A can be written as
−4
B I 1
I B I 1 −4 1
.. .. .. ..
. . where B = . . is n × n.
A= I , 1
.. .. .. ..
. . . .
I 1
I B 1 −4
Problem 7. To demonstrate how convergence is affected by the value of the relaxation factor
ω in SOR, run your function from Problem 6 with ω = 1, 1.05, 1.1, . . . , 1.9, 1.95 and n = 20.
Plot the number of computed iterations as a function of ω. Return the value of ω that results
in the least number of iterations.
Note that the matrix A from Problem 6 is not strictly diagonally dominant. However,
A is positive definite, so the algorithm will converge. Unfortunately, convergence for these
kinds of systems usually requires more iterations than for strictly diagonally dominant systems.
Therefore, set tol=1e-2 and maxiter=1000.
Recall that ω = 1 corresponds to the Gauss-Seidel method. Choosing a more optimal
relaxation factor saves a large number of iterations. This could translate to saving days or
weeks of computation time while solving extremely large linear systems on a supercomputer.
158 Lab 15. Iterative Solvers
16
The Drazin Inverse
Lab Objective: The Drazin inverse of a matrix is a pseudoinverse which preserves certain spectral
properties of the matrix. In this lab we compute the Drazin inverse using the Schur decomposition,
then use it to compute the effective resistance of a graph and perform link prediction.
• AAD = AD A
• Ak+1 AD = Ak
• AD AAD = AD
Note that if A is invertible, in which case k = 0, then AD = A−1 . On the other hand, if A is nilpotent,
meaning Aj = 0 for some nonnegative integer j, then AD is the zero matrix.
159
160 Lab 16. The Drazin Inverse
where S is a change of basis matrix, M is nonsingular, and N is nilpotent. Then the Drazin inverse
can be calculated as
M −1 0
D
A =S −1
S. (16.2)
0 0
To put A into the form in (16.1), we can use the Schur decomposition of A, given by
A = QT Q−1 , (16.3)
where Q is orthonormal and T is upper triangular. Since T is similar to A, the eigenvalues of A are
listed along the diagonal of T . If A is singular, at least one diagonal entry of T must be 0.
In general, Schur decompositions are not unique; the eigenvalues along the diagonal of T can
be reordered. To find M , N , and S, we compute the Schur decomposition of A twice, ordering the
eigenvalues differently in each decomposition.
First, we sort so that the nonzero eigenvalues are listed first along the diagonal of T . Then, if
k is the number of nonzero eigenvalues, the upper left k × k block of T forms the nonsingular matrix
M , and the first k columns of Q form the first k columns of the change of basis matrix S.
Computing the decomposition a second time, we reorder so that the 0 eigenvalues are listed
first along the diagonal of T . Then the upper left (n − k) × (n − k) block forms the nilpotent matrix
N , and the first n − k columns of Q form the last n − k columns of S. This completes a change of
basis matrix that will put A into the desired block diagonal form. Lastly, we use (16.2) to compute
AD .
SciPy’s la.schur() is a routine for computing the Schur decomposition of a matrix, but it
does not automatically sort it by eigenvalue. However, sorting can be accomplished by specifying the
sort keyword argument. Given an eigenvalue, the sorting function should return a boolean indicating
whether to sort that eigenvalue to the top left of the diagonal of T .
array([[ 2. , 0. , 6.70820393],
[ 0. , 1. , 2. ],
[ 0. , 0. , 0. ]])
>>> k # k is the number of columns satisfying the sort,
2 # which is the number of nonzero eigenvalues.
The procedure for finding the Drazin inverse using the Schur decomposition is given in Algo-
rithm 1. Due to possible floating point arithmetic errors, consider all eigenvalues smaller than a
certain tolerance to be 0.
Algorithm 1
1: procedure Drazin(A, tol)
2: (n, n) ← shape(A)
3: T1 , Q1 , k1 ← schur(A, |x| > tol) ▷ Sort the Schur decomposition with 0 eigenvalues last.
4: T2 , Q2 , k2 ← schur(A, |x| ≤ tol) ▷ Sort the Schur decomposition with 0 eigenvalues first.
5: U ← [Q1:,:k1 | Q2:,:n−k1 ] ▷ Create change of basis matrix.
6: U −1 ← inverse(U)
7: V ← U −1 AU ▷ Find block diagonal matrix in (16.1)
8: Z ← 0n×n
9: if k1 ̸= 0 then
10: M −1 ← inverse(V:k1 ,:k1 )
11: Z:k1 ,:k1 ← M −1
12: return U ZU −1
Problem 2. Write a function that accepts an n × n matrix A and a tolerance for rounding
eigenvalues to zero. Use Algorithm 1 to compute the Drazin inverse AD . Use your function
from Problem 1 to verify your implementation.
Achtung!
Because the algorithm for the Drazin inverse requires calculation of the inverse of a matrix, it
is unstable when that matrix has a high condition number. If the algorithm does not find the
correct Drazin inverse, check the condition number of V from Algorithm 1
Note
The Drazin inverse is called a pseudoinverse because AD = A−1 for invertible A, and for
noninvertible A, AD always exists and acts similarly to an inverse. There are other matrix
pseudoinverses that preserve different qualities of A, including the Moore-Penrose pseudoinverse
A† , which can be thought of as the least squares approximation to A−1 .
162 Lab 16. The Drazin Inverse
a c e
b d f
Figure 16.1: A graph with a resistor on each edge.
In electromagnetism, there are rules for manually calculating the effective resistance between
two nodes for relatively simple graphs. However, this is infeasible for large or complicated graphs.
Instead, we can use the Drazin inverse to calculate effective resistance for any graph.
First, create the adjacency matrix 1 of the graph, the matrix where the (ij)th entry is the
number of connections from node i to node j. Next, calculate the Laplacian L of the adjacency
matrix. Then if Rij is the effective resistance from node i to node j,
(
e j )D if i ̸= j
(L
Rij = ii
(16.4)
0 if i = j,
where L
e j is the Laplacian with the jth row of the Laplacian replaced by the jth row of the identity
matrix, and (Le j )D is its Drazin inverse.
Problem 3. Write a function that accepts the n × n adjacency matrix of an undirected graph.
Use (16.4) to compute the effective resistance from each node to every other node. Return an
n × n matrix where the (ij)th entry is the effective resistance from node i to node j. Keep the
following in mind:
1 See Problem 1 of Image Segmentation for a refresher on adjacency matrices and the Laplacian.
163
• Consider creating the matrix column by column instead of entry by entry. Every time
you compute the Drazin inverse, the whole diagonal of the matrix can be used.
Test your function using the graphs and values from Figure 16.2.
Rab = 23 , Rac = 2
3 a b Rab = 1
3
a b
1 1
a b Rab = 2 a b Rab = 4
Figure 16.2: The effective resistance between two points for several simple graphs. Nodes that are
farther apart have a larger effective resistance, while nodes that are nearer or better connected have
a smaller effective resistance.
Link Prediction
Link prediction is the problem of predicting the likelihood of a future association between two uncon-
nected nodes in a graph. Link prediction has application in many fields, but the canonical example
is friend suggestions on Facebook. The Facebook network can be represented by a large graph where
each user is a node, and two nodes have an edge connecting them if they are “friends.” Facebook
aims to predict who you would like to become friends with in the future, based on who you are
friends with now, as well as discover which friends you may have in real life that you have not yet
connected with online. To do this, Facebook must have some way to measure how closely two users
are connected.
We will compute link prediction using effective resistance as a metric. Effective resistance
measures how closely two nodes are connected, and nodes that are closely connected at present are
more likely to be connected in the future. Given an undirected graph, the next link should connect
the two unconnected nodes with the least effective resistance between them.
Problem 4. Write a class called LinkPredictor for performing link prediction. Implement
the __init__() method so that it accepts the name of a csv file containing information about a
social network. Each row of the file should contain the names of two nodes which are connected
by an (undirected) edge.
164 Lab 16. The Drazin Inverse
Store each of the names of the nodes of the graph as an ordered list. Next, create the
adjacency matrix for the network where the ith row and column of the matrix correspond to the
ith member of the list of node names. Finally, use your function from Problem 3 to compute
the effective resistance matrix. Save the list of names, the adjacency matrix, and the effective
resistance matrix as attributes.
(a) You want to find the two nodes which have the smallest effective resistance between
them which are not yet connected. Use information from the adjacency matrix to
zero out all entries of the effective resistance matrix that represent connected nodes.
The “*" operator multiplies arrays component-wise, which may be helpful.
(b) Find the next link by finding the minimum value of the array that is nonzero. Your
array may be the whole matrix or just a column if you are only considering links for
a certain node. This can be accomplished by passing np.min() a masked version of
your matrix to exclude entries that are 0.
(c) NumPy’s np.where() is useful for finding the minimum value in an array:
>>> A = np.random.randint(-9,9,(3,3))
>>> A
array([[ 6, -8, -9],
[-2, 1, -1],
[ 4, 0, -3]])
2. add_link(): Take as input two names of nodes, and add a link between them. If either
name is not in the network, raise a ValueError. Add the link by updating the adjacency
matrix and the effective resistance matrix.
165
Figure 16.3 visualizes the data in social_network.csv. Use this graph to verify that
your class is suggesting plausible new links. You should observe the following:
• In the entire network, Emily and Oliver are most likely to become friends next.
• Alan is expected to become friends with Sonia, then with Piers, and then with Abigail.
Figure 16.3: The social network contained in social_network.csv. Adapted from data by Wayne.
W Zachary (see https://en.wikipedia.org/wiki/Zachary%27s_karate_club).
Lab Objective: The Arnoldi Iteration is an efficient method for finding the eigenvalues of extremely
large matrices. Instead of using standard methods, the iteration uses Krylov subspaces to approximate
how a linear operator acts on vectors. With this approach, the Arnoldi Iteration facilitates the
computation of eigenvalues for enormous matrices without needing to physically create the matrix
in memory. We will explore this subject by implementing the Arnoldi iteration algorithm, using our
implementation for eigenvalue computation, and then graphically representing the accuracy of our
approximated eigenvalues.
Krylov Subspaces
One of the biggest difficulties in numerical linear algebra is the amount of memory needed to store
a large matrix and the amount of time needed to read its entries. Methods using Krylov subspaces
avoid this difficulty by studying how a matrix acts on vectors, making it unnecessary in many cases
to create the matrix itself.
The Arnoldi Iteration is an algorithm for finding an orthonormal basis of a Krylov subspace.
One of its strengths is that it can run on any linear operator without knowing the operator’s under-
lying matrix representation. The outputs of the Arnoldi algorithm can then be used to approximate
the eigenvalues of the matrix of the linear operator.
The order-n Krylov subspace of A generated by x is
If the vectors {x, Ax, A2 x, . . . , An−1 x} are linearly independent, then they form a basis for Kn (A, x).
However, An x frequently converges to a dominant eigenvector of A as n gets large, which fills the
basis with many almost parallel vectors. This yields a basis prone to ill-conditioned computations
and numerical instability.
167
168 Lab 17. The Arnoldi Iteration
The algorithm begins by initializing a matrix H which will be an upper Hessenberg matrix and
a matrix Q which will be filled with the basis vectors of our Krylov subspace. It also requires an
initial vector b ̸= 0 which is normalized to get q1 = b/ ∥b∥. This represents the basis for the initial
Krylov subspace, K1 (A, b).
For the kth iteration, compute the next basis vector qk+1 by using the modified Gram-Schmidt
process to make Aqk orthonormal to qk . This entails making each column of Q orthogonal to qk
before proceeding to the next iteration. The vectors {qi }ki=1 are then a basis for Kk (A, b). If ∥qk+1 ∥
is below a certain tolerance, stop and return H and Q. Otherwise, normalize the new basis vector
new qk+1 and continue to the next iteration.
Algorithm 1 The Arnoldi iteration. This algorithm accepts a square matrix A and a starting vector
b. It iterates k times or until the norm of the next vector in the iteration is less than tol. The
algorithm returns an upper Hessenberg H and an orthonormal Q such that H = QH AQ.
1: procedure Arnoldi(b, A, k, tol)
2: Q ← empty(size(b), k + 1) ▷ Some initialization steps
3: H ← zeros(k + 1, k)
4: Q:,0 ← b/ ∥b∥2
5: for j = 0 . . . k − 1 do ▷ Perform the actual iteration.
6: Q:,j+1 ← A(Q:,j )
7: for i = 0 . . . j do ▷ Modified Gram-Schmidt.
8: Hi,j ← QH Q
:,i :,j+1
9: Q:,j+1 ← Q:,j+1 − Hi,j Q:,i
10: Hj+1,j ← ∥Q:,j+1 ∥2 ▷ Set subdiagonal element of H.
11: if |Hj+1,j | < tol then ▷ Stop if ∥Q:,j+1 ∥2 is small enough.
12: return H:j+1,:j+1 , Q:,:j+1
13: Q:,j+1 ← Q:,j+1 /Hj+1,j ▷ Normalize qj+1 .
14: return H:−1,: , Q ▷ Return Hk and Q.
Achtung!
If the starting vector x is an eigenvector of A with corresponding eigenvalue λ, then by definition
Kk (A, x) = span{x, λx, λ2 x, . . . , λk x}, which is equal to the span of x. So, when x is normalized
with q1 = x/∥x∥, q2 = Aq1 = λq1 .
The vector q2 is supposed to be the next vector in the orthonormal basis for Kk (A, x),
but it is not linearly independent of q1 . In fact, q1 already spans Kk (A, x). Hence, the Gram-
Schmidt process fails and results in a ZeroDivisionError or an extremely early termination
of the algorithm. A similar phenomenon may occur if the starting vector x is contained in a
proper invariant subspace of A.
Problem 1. Write a function that accepts a starting vector b for the Arnoldi Iteration, a
function handle L that describes a linear operator, the number of times n to perform the
iteration, and a tolerance tol that defaults to 10−8 . Use Algorithm 1 to implement the Arnoldi
Iteration with these parameters. Return the upper Hessenberg matrix H and the orthonormal
matrix Q from the iteration.
Consider the following implementation details.
1. Since H and Q will eventually hold complex numbers, initialize them as complex arrays
(e.g., A = np.empty((3,3), dtype=np.complex128)).
2. This function can be tested on a matrix A by passing in A.dot for a linear operator.
b = A.conj() @ B
Hk = QH
k AQk . (17.1)
Problem 2. Write a function that accepts a function handle L that describes a linear operator,
the dimension of the space dim that the linear operator works on, the number of times k to
perform the Arnoldi Iteration, and the number of Ritz values n to return. Use the previous
implementation of the Arnoldi Iteration and an eigenvalue function such as scipy.linalg.
eigs() to compute the largest Ritz values of the given operator. Return the n largest Ritz
values.
One application of the Arnoldi iteration is to find the eigenvalues of linear operators that are
20
too large to store in memory. For example, if an operator acts on a vector x ∈ C2 , then its matrix
representation contains 240 complex values. Storing such a matrix would require 64 terabytes of
memory!
An example of such an operator is the Fast Fourier Transform, cited by SIAM as one of the
top algorithms of the century [Cip00]. The Fast Fourier Transform is used very commonly in signal
processing.
170 Lab 17. The Arnoldi Iteration
Problem 3. The four largest eigenvalues of the Fast Fourier Transform are known to be
√ √ √ √
{− n, n, −i n, i n} where n is the dimension of the space on which the transform acts.
Use your function from Problem 2 to approximate the eigenvalues of the Fast Fourier
Transform. Set k = 10 and dim = 220 . For the argument L, use the scipy.fftpack.fft().
The Arnoldi iteration for finding eigenvalues is implemented in a Fortran library called ARPACK.
Scipy interfaces with the Arnoldi iteration in this library via the function scipy.sparse.linalg.
eigs(). This function has many more options than the implementation we wrote in Problem 2. In
this example, the keyword argument k=5 specifies that we want five Ritz values. Note that even
though this function comes from the sparse library in Scipy, we can still call it on regular Numpy
arrays.
Convergence
As more iterations of the Arnoldi method are performed, our approximations are of higher rank.
Consequently, the Ritz values become more accurate approximations to the eigenvalues of the linear
operator.
This technique converges quickly to eigenvalues whose magnitude is distinctly larger than the
rest. For example, matrices with random entries tend to have one eigenvalue of distinctly greatest
magnitude. Convergence of the Ritz values for such a matrix is plotted in Figure 17.1a.
However, Ritz values converge more slowly for matrices with random eigenvalues. Figure 17.1b
plots convergence of the Ritz values for a matrix with eigenvalues uniformly distributed in [0, 1).
Problem 4. Write a function that accepts a linear operator A, the number of Ritz values
to plot n, and the the number of times to perform the Arnoldi iteration iters. Use these
parameters to create a plot of the absolute error between the largest Ritz values of A and the
largest eigenvalues of A.
2. Create an empty array to store the relative errors for every k = 0, 1, . . . , iters.
(a) Use your Ritz function to find the n largest Ritz values of the operator. Note that
for small k, the matrix Hk may not have this many eigenvalues. Due to this, the
graphs of some eigenvalues have to begin after a few iterations.
(b) Store the absolute error between the eigenvalues of A and the Ritz values of H. Make
sure that the errors are stored in the correct order.
3. Iteratively plot the errors for each eigenvalue with the range of the iterations.
171
101 102
10 2 10 1
10 5 10 4
10 8
10 7
10 11
10 10
10 14
10 17 10 13
0 25 50 75 100 125 150 175 200 0 50 100 150 200 250 300
(a) (b)
Figure 17.1: These plots show the relative error of the ritz values as approximations to the eigenvalues
of a matrix. The figure on the left plots the largest 15 Ritz values for a 500 × 500 matrix with random
entries and demonstrates that the largest eigenvalue (the blue line) converges after 20 iterations.
The figure at right plots the largest 15 Ritz values for a 500 × 500 matrix with uniformly distributed
eigenvalues in [0, 1) and demonstrates that all the eigenvalues take from 150 to 250 iterations to
converge.
Hints: If x̃ is an an approximation to x, then the absolute error in the approximation is ∥x− x̃∥.
Sort your eigenvalues from greatest to least. An example of how to do this is included:
In addition, remember that certain eigenvalues of H will not appear until we are computing
enough iterations in the Arnoldi algorithm. As a result, we will have to begin the graphs of
several eigenvalues after we are computing sufficient iterations of the algorithm.
Run your function on these examples. The plots should be fairly similar to Figures 17.1b
and 17.1a.
Additional Material
The Lanczos Iteration
The Lanczos iteration is a version of the Arnoldi iteration that is optimized to operate on symmetric
matrices. If A is symmetric, then (17.1) shows that Hk is symmetric and hence tridiagonal. This
leads to two simplifications of the Arnoldi algorithm.
First, we have 0 = Hk,n = ⟨qk , Aqn ⟩ for k ≤ n − 2; i.e., Aqn is orthogonal to q1 , . . . , qn−2 .
Thus, if the goal is only to compute Hk (say to find the Ritz values), then we only need to store the
two most recently computed columns of Q. Second, the data of Hk can also be stored in two vectors,
one containing the main diagonal and one containing the first subdiagonal of Hk (by symmetry, the
first superdiagonal equals the first subdiagonal of Hk ).
Algorithm 2 The Lanczos Iteration. This algorithm operates on a vector b of length n and an
n × n symmetric matrix A. It iterates k times or until the norm of the next vector in the iteration
is less than tol. It returns two vectors x and y that respectively contain the main diagonal and first
subdiagonal of the current Hessenberg approximation.
1: procedure Lanczos(b, A, k, tol)
2: q0 ← zeros(size(b)) ▷ Some initialization
3: q1 ← b/ ∥b∥2
4: x ← empty(k)
5: y ← empty(k)
6: for i = 0 . . . k − 1 do ▷ Perform the iteration.
7: z ← Aq1 ▷ z is a temporary vector to store qi+1 .
8: x[i] ← qT 1z ▷ q1 is used to store the previous qi .
9: z ← z − x[i]q1 + y[i − 1]q0 ▷ q0 is used to store qi−1 .
10: y[i] = ∥z∥2 ▷ Initialize y[i].
11: if y[i] < tol then ▷ Stop if ∥qi+1 ∥2 is too small.
12: return x[: i + 1], y[: i]
13: z = z/y[i]
14: q0 , q1 = q1 , z ▷ Store new qi+1 and qi on top of q1 and q0 .
15: return x, y[: −1]
As it is described in Algorithm 2, the Lanczos iteration is not stable. Roundoff error may cause
the qi to be far from orthogonal. In fact, it is possible for the qi to be so adulterated by roundoff
error that they are no longer linearly independent.
There are modified versions of the Lanczos iteration that are numerically stable. One of these,
the Implicitly Restarted Lanczos Method, is found in SciPy as scipy.sparse.linalg.eigsh().
GMRES
18
Lab Objective: The Generalized Minimal Residuals (GMRES) algorithm is an iterative Krylov
subspace method for efficiently solving large linear systems. In this lab we implement the basic GM-
RES algorithm, then make an improvement by using restarts. We then discuss the convergence of the
algorithm and its relationship with the eigenvalues of a linear system. Finally, we introduce SciPy’s
version of GMRES.
The GMRES algorithm uses the Arnoldi iteration for numerical stability. The Arnoldi iteration
produces Hn , an (n + 1) × n upper Hessenberg matrix, and Qn , a matrix whose columns make up an
orthonormal basis of Kn (A, b), such that AQn = Qn+1 Hn . The GMRES algorithm finds the vector
xn which minimizes the norm ∥b − Axn ∥2 , where xn = Qn yn + x0 for some yn ∈ Rn . Since the
columns of Qn are orthonormal, the residual can be equivalently computed as
Here e1 is the vector [1, 0, . . . , 0]T of length n + 1 and β = ∥b − Ax0 ∥2 , where x0 is an initial
guess of the solution. Thus, to minimize ∥b − Axn ∥2 , the right side of (18.1) can be minimized, and
xn can be computed as xn = Qn yn + x0 .
173
174 Lab 18. GMRES
Algorithm 1 The GMRES algorithm. This algorithm operates on a vector b and a linear operator
A. It iterates k times or until the residual is less than tol, returning an approximate solution to
Ax = b and the error in this approximation.
1: procedure GMRES(A, b, x0 , k, tol)
2: Q ← empty(size(b), k + 1) ▷ Initialization.
3: H ← zeros(k + 1, k)
4: r0 ← b − A(x0 )
5: Q:,0 = r0 / ∥r0 ∥2
6: for j = 0 . . . k − 1 do ▷ Perform the Arnoldi iteration.
7: Q:,j+1 ← A(Q:,j )
8: for i = 0 . . . j do
9: Hi,j ← QT :,i Q:,j+1
10: Q:,j+1 ← Q:,j+1 − Hi,j Q:,i
11: Hj+1,j ← ∥Q:,j+1 ∥2
12: if |Hj+1,j | > tol then ▷ Avoid dividing by zero.
13: Q:,j+1 ← Q:,j+1 /Hj+1,j
14: y ← least squares solution to ∥H:j+2,:j+1 x − βe1 ∥2 ▷ β and e1 as in (18.1).
15: res ← ∥H:j+2,:j+1 y − βe1 ∥2
16: if res < tol then
17: return Q:,:j+1 y + x0 , res
18: return Q:,:j+1 y + x0 , res
Problem 1. Write a function that accepts a matrix A, a vector b, and an initial guess x0 , a
maximum number of iterations k defaulting to 100, and a stopping tolerance tol that defaults
to 10−8 . Use Algorithm 1 to approximate the solution to Ax = b using the GMRES algorithm.
Return the approximate solution and the residual at the approximate solution.
You may assume that A and b only have real entries. Use scipy.linalg.lstsq() to
solve the least squares problem. Be sure to read the documentation so that you understand
what the function returns.
Compare your function to the following code.
>>> A = np.array([[1,0,0],[0,2,0],[0,0,3]])
>>> b = np.array([1, 4, 6])
>>> x0 = np.zeros(b.size)
>>> gmres(A, b, x0, k=100, tol=1e-8)
(array([ 1., 2., 2.]), 7.174555448775421e-16)
Convergence of GMRES
One of the most important characteristics of GMRES is that it will always arrive at an exact solution
(if one exists). At the n-th iteration, GMRES computes the best approximate solution to Ax = b for
xn ∈ Kn . If A is full rank, then Km = Fm , so the mth iteration will always return an exact answer.
Sometimes, the exact solution x ∈ Kn for some n < m, in this case xn is an exact solution. In either
case, the algorithm is convergent after n steps if the nth residual is sufficiently small.
175
Problem 2. Add a keyword argument plot defaulting to False to your function from Problem
1. If plot=True, keep track of the residuals at each step of the algorithm. At the end of the
iteration, before returning the approximate solution and its residual error, create a figure with
two subplots.
2. Plot the residuals versus the iteration counts using a log scale on the y-axis
(use ax.semilogy()).
Problem 3. Use your function from Problem 2 to investigate how the convergence of GMRES
relates to the eigenvalues of a matrix as follows. Define an m × m matrix
An = nI + P,
where I is the identity matrix and P is an m × m matrix with entries taken from a random
√
normal distribution with mean 0 and standard deviation 1/(2 m). Call your function from
Problem 2 on An for n = −4, −2, 0, 2, 4. Use m = 200, let b be an array of all ones, and let
x0 = 0.
Use np.random.normal() to create the matrix P . When analyzing your results, pay
special attention to the clustering of the eigenvalues in relation to the origin. Compare your
results with n = 2, m = 200 to Figure 18.1.
Ideas for this problem were taken from Example 35.1 on p. 271 of [TB97].
0.4 10 1
10 2
0.2 10 3
0.0 10 4
10 5
0.2 10 6
0.4 10 7
10 8
Figure 18.1: On the left, the eigenvalues of the matrix A2 defined in Problem 3. On the right,
the rapid convergence of the GMRES algorithm on A2 with starting vector b = (1, 1, . . . , 1).
This issue is addressed by using GMRES(k), or GMRES with restarts. When k becomes large,
this algorithm restarts GMRES with an improved initial guess. The new initial guess is taken to
be the vector that was found upon termination of the last GMRES iteration run. The algorithm
GMRES(k) will always have manageable spatial and temporal complexity, but it is less reliable than
GMRES. If the true solution x to Ax = b is nearly orthogonal to the Krylov subspaces Kn (A, b) for
n ≤ k, then GMRES(k) could converge very slowly or not at all.
2. If the desired tolerance was reached, terminate the algorithm. If not, repeat step 1 using
xk from the previous GMRES algorithm as a new initial guess x0 .
3. Repeat step 2 until the desired tolerance has been obtained or until a given maximum
number of restarts has been reached.
Your function should accept all of the same inputs as the function you wrote in Problem 1 with
the exception of k, which will now denote the number of iterations before restart (defaults to 5),
and an additional parameter restarts which denotes the maximum number of restarts before
termination (defaults to 50).
GMRES in SciPy
The GMRES algorithm is implemented in SciPy as the function scipy.sparse.linalg.gmres().
Here we use this function to solve Ax = b where A is a random 300 × 300 matrix and b is a random
vector.
The function outputs two objects: the approximate solution x and an integer info which gives
information about the convergence of the algorithm. If info=0 then convergence occured; if info
is positive then it equals the number of iterations performed. In the previous case, the function
performed 3000 iterations of GMRES before returning the approximate solution x. The following
code verifies how close the computed value was to the exact solution.
>>> la.norm((A @ x) - b)
4.744196381683801
This time, the returned approximation x is about as close to a true solution as can be expected.
Problem 5. Plot the runtimes of your implementations of GMRES from Problems 1 and 4
and scipy.sparse.linalg.gmres() use the default tolerance and restart=1000 with different
matrices. Use the m×m matrix P with m = 25, 50, . . . 200 and with entries taken from a random
√
normal distribution with mean 0 and standard deviation 1/(2 m). Use a vector of ones for b
and a vector of zeros for x0 . Use a single figure for all plots, plot the runtime on the y-axis and
m on the x-axis.
178 Lab 18. GMRES
Part II
Appendices
179
A
NumPy Visual Guide
Lab Objective: NumPy operations can be difficult to visualize, but the concepts are straightforward.
This appendix provides visual demonstrations of how NumPy arrays are used with slicing syntax,
stacking, broadcasting, and axis-specific operations. Though these visualizations are for 1- or 2-
dimensional arrays, the concepts can be extended to n-dimensional arrays.
Data Access
The entries of a 2-D array are the rows of the matrix (as 1-D arrays). To access a single entry, enter
the row index, a comma, and the column index. Remember that indexing begins with 0.
× × × × × × × × × ×
× × × × × × × × × ×
A[0] =
×
A[2,1] =
× × × × × × × × ×
× × × × × × × × × ×
Slicing
A lone colon extracts an entire row or column from a 2-D array. The syntax [a:b] can be read as
“the ath entry up to (but not including) the bth entry.” Similarly, [a:] means “the ath entry to the
end” and [:b] means “everything up to (but not including) the bth entry.”
× × × × × × × × × ×
× × × × × × × × × ×
A[1] = A[1,:] =
× ×
A[:,2] =
× × × × × × × ×
× × × × × × × × × ×
× × × × × × × × × ×
× × × × × × × × × ×
A[1:,:2] =
×
A[1:-1,1:-1] =
× × × × × × × × ×
× × × × × × × × × ×
181
182 Appendix A. NumPy Visual Guide
Stacking
np.hstack() stacks sequence of arrays horizontally and np.vstack() stacks a sequence of arrays
vertically.
× × × ∗ ∗ ∗
A= × × × B= ∗ ∗ ∗
× × × ∗ ∗ ∗
× × × ∗ ∗ ∗ × × ×
np.hstack((A,B,A)) = ×
× × ∗ ∗ ∗ × × ×
× × × ∗ ∗ ∗ × × ×
× × ×
× × ×
× × ×
∗ ∗ ∗
np.vstack((A,B,A)) = ∗ ∗ ∗
∗ ∗ ∗
× × ×
× × ×
× × ×
Because 1-D arrays are flat, np.hstack() concatenates 1-D arrays and np.vstack() stacks them
vertically. To make several 1-D arrays into the columns of a 2-D array, use np.column_stack().
x= y=
× × × × ∗ ∗ ∗ ∗
np.hstack((x,y,x)) =
× × × × ∗ ∗ ∗ ∗ × × × ×
× ∗ ×
× × × × × ∗ ×
np.vstack((x,y,x)) = ∗ ∗ ∗ ∗ np.column_stack((x,y,x)) =
×
∗ ×
× × × ×
× ∗ ×
The functions np.concatenate() and np.stack() are more general versions of np.hstack() and
np.vstack(), and np.row_stack() is an alias for np.vstack().
Broadcasting
NumPy automatically aligns arrays for component-wise operations whenever possible. See http:
//docs.scipy.org/doc/numpy/user/basics.broadcasting.html for more in-depth examples and
broadcasting rules.
183
1 2 3
A= 1 x=
2 3 10 20 30
1 2 3
1 2 3
1 2 3
11 22 33
A + x= 1 2 3 = 11 22 33
+ 11 22 33
10 20 30
1 2 3 10 11 12 13
A + x.reshape((1,-1)) = 1 2 3 + 20 = 21 22 23
1 2 3 30 31 32 33
1 2 3 4
1 2 3 4
A=
1
2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
A.sum(axis=0) =
= 4 8 12 16
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
A.sum(axis=1) =
= 10 10 10 10
1 2 3 4
1 2 3 4
184 Appendix A. NumPy Visual Guide
B
Matplotlib Syntax and
Customization Guide
Lab Objective: The documentation for Matplotlib can be a little difficult to maneuver and basic
information is sometimes difficult to find. This appendix condenses and demonstrates some of the
more applicable and useful information on plot customizations. It is not intended to be read all at
once, but rather to be used as a reference when needed. For an interative introduction to Matplotlib,
see the Introduction to Matplotlib lab in Python Essentials. For more details on any specific function,
refer to the Matplotlib documentation at https: // matplotlib. org/ .
Matplotlib Interface
Matplotlib plots are made in a Figure object that contains one or more Axes, which themselves
contain the graphical plotting data. Matplotlib provides two ways to create plots:
1. Call plotting functions directly from the module, such as plt.plot(). This will create the plot
on whichever Axes is currently active.
2. Call plotting functions from an Axes object, such as ax.plot(). This is particularly useful for
complicated plots and for animations.
Table B.1 contains a summary of functions that are used for managing Figure and Axes objects.
Function Description
add_subplot() Add a single subplot to the current figure
axes() Add an axes to the current figure
clf() Clear the current figure
figure() Create a new figure or grab an existing figure
gca() Get the current axes
gcf() Get the current figure
subplot() Add a single subplot to the current figure
subplots() Create a figure and add several subplots to it
185
186 Appendix B. Matplotlib Customization
Axes objects are usually managed through the functions plt.subplot() and plt.subplots().
The function subplot() is used as plt.subplot(nrows, ncols, plot_number). Note that if the
inputs for plt.subplot() are all integers, the commas between the entries can be omitted. For
example, plt.subplot(3,2,2) can be shortened to plt.subplot(322).
The function subplots() is used as plt.subplots(nrows, ncols), and returns a Figure
object and an array of Axes. This array has the shape (nrows, ncols), and can be accessed as any
other array. Figure B.1 demonstrates the layout and indexing of subplots.
1 2 3
4 5 6
Figure B.1: The layout of subplots with plt.subplot(2,3,i) (2 rows, 3 columns), where i is the
index pictured above. The outer border is the figure that the axes belong to.
The following example demonstrates three equivalent ways of producing a figure with two
subplots, arranged next to each other in one row:
Achtung!
Be careful not to mix up the following similarly-named functions:
1. plt.axes() creates a new place to draw on the figure, while plt.axis() or ax.axis()
sets properties of the x- and y-axis in the current axes, such as the x and y limits.
2. plt.subplot() (singular) returns a single subplot belonging to the current figure, while
plt.subplots() (plural) creates a new figure and adds a collection of subplots to it.
Plot Customization
Styles
Matplotlib has a number of built-in styles that can be used to set the default appearance of plots.
These can be used via the function plt.style.use(); for instance, plt.style.use("seaborn")
will have Matplotlib use the "seaborn" style for all plots created afterwards. A list of built-in
styles can be found at https://matplotlib.org/stable/gallery/style_sheets/style_sheets_
reference.html.
The style can also be changed only temporarily using plt.style.context() along with a with
block:
with plt.style.context('dark_background'):
# Any plots created here use the new style
plt.subplot(1,2,1)
plt.plot(x, y)
# ...
# Plots created here are unaffected
plt.subplot(1,2,2)
plt.plot(x, y)
Plot layout
Axis properties
Table B.2 gives an overview of some of the functions that may be used to configure the axes of a
plot.
The functions xlim(), ylim(), and axis() are used to set one or both of the x and y ranges
of the plot. xlim() and ylim() each accept two arguments, the lower and upper bounds, or a single
list of those two numbers. axis() accepts a single list consisting, in order, of xmin, xmax, ymin,
ymax. Passing None instead of one of the numbers to any of these functions will make it not change
the corresponding value from what it was. Each of these functions can also be called without any
arguments, in which case it will return the current bounds. Note that axis() can also be called
directly on an Axes object, while xlim() and ylim() cannot.
axis() also can be called with a string as its argument, which has several options. The most
common is axis('equal'), which makes the scale of the x- and y-scales equal (i.e. makes circles
circular).
188 Appendix B. Matplotlib Customization
Function Description
axis() set the x- and y-limits of the plot
grid() add gridlines
xlim() set the limits of the x-axis
ylim() set the limits of the y-axis
xticks() set the location of the tick marks on the x-axis
yticks() set the location of the tick marks on the y-axis
xscale() set the scale type to use on the x-axis
yscale() set the scale type to use on the y-axis
ax.spines[side].set_position() set the location of the given spine
ax.spines[side].set_color() set the color of the given spine
ax.spines[side].set_visible() set whether a spine is visible
Table B.2: Some functions for changing axis properties. ax is an Axes object.
To use a logarithmic scale on an axis, the functions xscale("log") and yscale("log") can
be used.
The functions xticks() and yticks() accept a list of tick positions, which the ticks on the
corresponding axis are set to. Generally, this works the best when used with np.linspace(). This
function also optionally accepts a second argument of a list of labels for the ticks. If called with no
arguments, the function returns a list of the current tick positions and labels instead.
The spines of a Matplotlib plot are the black border lines around the plot, with the left and
bottom ones also being used as the axis lines. To access the spines of a plot, call ax.spines[side],
where ax is an Axes object and side is 'top', 'bottom', 'left', or 'right'. Then, functions can
be called on the Spine object to configure it.
The function spine.set_position() has several ways to specify the position. The two simplest
are with the arguments 'center' and 'zero', which place the spine in the center of the subplot or
at an x- or y-coordinate of zero, respectively. The others are a passed as a tuple (position_type,
amount):
• 'axes': place the spine at the specified Axes coordinate, where 0 corresponds to the bottom
or left of the subplot, and 1 corresponds to the top or right edge of the subplot.
• 'outward': places the spine amount pixels outward from the edge of the plot area. A negative
value can be used to move it inwards instead.
spine.set_color() accepts any of the color formats Matplotlib supports. Alternately, using
set_color('none') will make the spine not be visible. spine.set_visible() can also be used for
this purpose.
The following example adjusts the ticks and spine positions to improve the readability of a plot
of sin(x). The result is shown in Figure B.2.
>>> x = np.linspace(0,2*np.pi,150)
>>> plt.plot(x, np.sin(x))
>>> plt.title(r"$y=\sin(x)$")
#Move the bottom spine to zero, remove the top and right ones
>>> ax = plt.gca()
>>> ax.spines['bottom'].set_position('zero')
>>> ax.spines['right'].set_color('none')
>>> ax.spines['top'].set_color('none')
>>> plt.show()
y = sin(x)
1.0
0.5
0.0
0 2
3
2
2
0.5
1.0
Figure B.2: Plot of y = sin(x) with axes modified for clarity
Plot Layout
The position and spacing of all subplots within a figure can be modified using the function plt
.subplots_adjust(). This function accepts up to six keyword arguments that change different
aspects of the spacing. left, right, top, and bottom are used to adjust the rectangle around all of
the subplots. In the coordinates used, 0 corresponds to the bottom or left edge of the figure, and 1
corresponds to the top or right edge of the figure. hspace and wspace set the vertical and horizontal
spacing, respectively, between subplots. The units for these are in fractions of the average height
and width of all subplots in the figure. If more fine control is desired, the position of individual Axes
objects can also be changed using ax.get_position() and ax.set_position().
The size of the figure can be configured using the figsize argument when creating a figure:
>>> plt.figure(figsize=(12,8))
Note that many environments will scale the figure to fill the available space. Even so, changing the
figure size can still be used to change the aspect ratio as well as the relative size of plot elements.
The following example uses subplots_adjust() to create space for a legend outside of the
plotting space. The result is shown in Figure B.3.
190 Appendix B. Matplotlib Customization
#Generate data
>>> x1 = np.random.normal(-1, 1.0, size=60)
>>> y1 = np.random.normal(-1, 1.5, size=60)
>>> x2 = np.random.normal(2.0, 1.0, size=60)
>>> y2 = np.random.normal(-1.5, 1.5, size=60)
>>> x3 = np.random.normal(0.5, 1.5, size=60)
>>> y3 = np.random.normal(2.5, 1.5, size=60)
Dataset 1 4
Dataset 2
Dataset 3 2
0
2
4
2 0 2 4
Figure B.3: Example of repositioning axes.
191
Colors
The color that a plotting function uses is specified by either the c or color keyword arguments; for
most functions, these can be used interchangeably. There are many ways to specific colors. The most
simple is to use one of the basic colors, listed in Table B.3. Colors can also be specified using an
RGB tuple such as (0.0, 0.4, 1.0), a hex string such as "0000FF", or a CSS color name like "
DarkOliveGreen" or "FireBrick". A full list of named colors that Matplotlib supports can be found
at https://matplotlib.org/stable/gallery/color/named_colors.html. If no color is specified
for a plot, Matplotlib automatically assigns it one from the default color cycle.
Code Color
Code Color
'b' blue
'y' yellow
'g' green
'k' black
'r' red
'w' white
'c' cyan
'C0' - 'C9' Default colors
'm' magenta
Plotting functions also accept an alpha keyword argument, which can be used to set the
transparency. A value of 1.0 corresponds to fully opaque, and 0.0 corresponds to fully transparent.
The following example demonstrates different ways of specifying colors:
Colormaps
Certain plotting functions, such as heatmaps and contour plots, accept a colormap rather than a
single color. A full list of colormaps available in Matplotlib can be found at https://matplotlib.
org/stable/gallery/color/colormap_reference.html. Some of the more commonly used ones
are "viridis", "magma", and "coolwarm". A colorbar can be added by calling plt.colorbar()
after creating the plot.
Sometimes, using a logarithmic scale for the coloring is more informative. To do this, pass a
matplotlib.colors.LogNorm object as the norm keyword argument:
>>> plt.title(r"$\frac{1}{2}\sin(x^2)$")
The function legend() can be used to add a legend to a plot. Its optional loc keyword
argument specifies where to place the legend within the subplot. It defaults to 'best', which will
cause Matplotlib to place it in whichever location overlaps with the fewest drawn objects. The other
locations this function accepts are 'upper right', 'upper left', 'lower left', 'lower right',
'center left', 'center right', 'lower center', 'upper center', and 'center'. Alternately,
a tuple of (x,y) can be passed as this argument, and the bottom-left corner of the legend will be
placed at that location. The point (0,0) corresponds to the bottom-left of the current subplot, and
(1,1) corresponds to the top-right. This can be used to place the legend outside of the subplot,
although care should be taken that it does not go outside the figure, which may require manually
repositioning the subplots.
The labels the legend uses for each curve or scatterplot are specified with the label keyword
argument when plotting the object. Note that legend() can also be called with non-keyword argu-
ments to set the labels, although it is less confusing to set them when plotting.
The following example demonstrates creating a legend:
>>> x = np.linspace(0,2*np.pi,250)
The function plot() has several ways to specify this argument; the simplest is to pass it as the
third positional argument. The marker and linestyle keyword arguments can also be used. The
size of these can be modified using markersize and linewidth. Note that by specifying a marker
style but no line style, plot() can be used to make a scatter plot. It is also possible to use both a
marker style and a line style. To set the marker using scatter(), use the marker keyword argument,
with s being used to change the size.
The following code demonstrates specifying marker and line styles. The results are shown in
Figure B.4.
#With plot(), the color to use can also be specified in the same string.
#Order usually doesn't matter.
#Use red dots:
>>> plt.plot(x, y, '.r')
194 Appendix B. Matplotlib Customization
#Equivalent:
>>> plt.plot(x, y, 'r.')
Plot Types
Matplotlib has functions for creaing many different types of plots, many of which are listed in Table
B.6. This section gives details on using certain groups of these functions.
195
Line plots
Line plots, the most basic type of plot, are created with the plot() function. It accepts two lists of
x- and y-values to plot, and optionally a third argument of a string of any combination of the color,
line style, and marker style. Note that this method only works with the single-character color codes;
to use other colors, use the color argument. By specifying only a marker style, this function can
also be used to create scatterplots.
There are a number of functions that do essentially the same thing as plot() but also change
the axis scaling, including loglog(), semilogx(), semilogy(), and polar. Each of these functions
is used in the same manner as plot(), and has identical syntax.
Bar Plots
Bar plots are a way to graph categorical data in an effective way. They are made using the bar()
function. The most important arguments are the first two that provide the data, x and height. The
first argument is a list of values for each bar, either categorical or numerical; the second argument is
a list of numerical values corresponding to the height of each bar. There are other parameters that
may be included as well. The width argument adjusts the bar widths; this can be done by choosing
a single value for all of the bars, or an array to give each bar a unique width. Further, the argument
bottom allows one to specify where each bar begins on the y-axis. Lastly, the align argument can
be set to ’center’ or ’edge’ to align as desired on the x-axis. As with all plots, you can use the color
keyword to specify any color of your choice. If you desire to make a horizontal bar graph, the syntax
follows similarly using the function barh(), but with argument names y, width, height and align.
196 Appendix B. Matplotlib Customization
Box Plots
A box plot is a way to visualize some simple statistics of a dataset. It plots the minimum, maximum,
and median along with the first and third quartiles of the data. This is done by using boxplot()
with an array of data as the argument. Matplotlib allows you to enter either a one dimensional
array for a single box plot, or a 2-dimensional array where it will plot a box plot for each column of
the data in the array. Box plots default to having a vertical orientation but can be easily laid out
horizontally by setting vert=False.
>>> x = np.linspace(0,1,100)
>>> y = np.linspace(0,1,80)
>>> X, Y = np.meshgrid(x, y)
The z-coordinate can then be computed using the x and y mesh grids.
Note that each of these functions can accept a colormap, using the cmap parameter. These
plots are sometimes more informative with a logarithmic color scale, which can be used by passing a
matplotlib.colors.LogNorm object in the norm parameter of these functions.
With pcolormesh(), it is also necessary to pass shading='auto' or shading='nearest' to
avoid a deprecation error.
The following example demonstrates creating heatmaps and contour plots, using a graph of
z = (x2 + y) sin(y). The results is shown in Figure B.5
>>> x = np.linspace(-3,3,100)
>>> y = np.linspace(-3,3,100)
>>> X, Y = np.meshgrid(x, y)
>>> Z = (X**2+Y)*np.sin(Y)
#Heatmap
>>> plt.subplot(1,3,1)
197
#Contour
>>> plt.subplot(1,3,2)
>>> plt.contour(X, Y, Z, cmap='magma')
>>> plt.title("Contour plot")
#Filled contour
>>> plt.subplot(1,3,3)
>>> plt.contourf(X, Y, Z, cmap='coolwarm')
>>> plt.title("Filled contour plot")
>>> plt.colorbar()
>>> plt.show()
Showing images
The function imshow() is used for showing an image in a plot, and can be used on either grayscale
or color images. This function accepts a 2-D n × m array for a grayscale image, or a 3-D n × m × 3
array for a color image. If using a grayscale image, you also need to specify cmap='gray', or it will
be colored incorrectly.
It is best to also use axis('equal') alongside imshow(), or the image will most likely be
stretched. This function also works best if the images values are in the range [0, 1]. Some ways to
load images will format their values as integers from 0 to 255, in which case the values in the image
array should be scaled before using imshow().
3-D Plotting
Matplotlib can be used to plot curves and surfaces in 3-D space. In order to use 3-D plotting, you
need to run the following line:
198 Appendix B. Matplotlib Customization
The argument projection='3d' also must be specified when creating the subplot for the 3-D object:
Curves can be plotted in 3-D space using plot(), by passing in three lists of x-, y-, and z-
coordinates. Surfaces can be plotted using ax.plot_surface(). This function can be used similar
to creating contour plots and heatmaps, by obtaining meshes of x- and y- coordinates from np.
meshgrid() and using those to produce the z-axis. More generally, any three 2-D arrays of meshes
corresponding to x-, y-, and z-coordinates can be used. Note that it is necessary to call this function
from an Axes object.
The following example demonstrates creating 3-D plots. The results are shown in Figure B.6.
plt.show()
199
4 1.0
3 0.5 0.5
2 0.0 0.0
1 0.5 0.5
0 1.0
1 1 1
1 0 1 0 1 0
0 1 0 1 0 1
1 1 1
Figure B.6: Examples of 3-D plotting.
Additional Resources
rcParams
The default plotting parameters of Matplotlib can be set individually and with more fine control than
styles by using rcParams. rcParams is a dictionary that can be accessed as either plt.rcParams or
matplotlib.rcParams.
For instance, the resolution of plots can be changed via the "figure.dpi" parameter:
A list of parameters that can set via rcParams can be found at https://matplotlib.org/
stable/api/matplotlib_configuration_api.html#matplotlib.RcParams.
Animations
Matplotlib has capabilities for creating animated plots. The Animations lab in Volume 4 has detailed
instructions on how to do so.
[ADH+ 01] David Ascher, Paul F Dubois, Konrad Hinsen, Jim Hugunin, Travis Oliphant, et al.
Numerical python, 2001.
[Cip00] Barry A. Cipra. The best of the 20th century: editors name top 10 algorithms. Siam
news, 33(4), 16 May 2000.
[JOP+ ] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools
for Python, 2001–. [Online; accessed 11/05/18].
[Kiu13] Jaan Kiusalaas. Numerical methods in engineering with Python 3. Cambridge university
press, 2013.
[MMH04] Neil Muller, Lourenco Magaia, and B. M. Herbst. Singular value decomposition, eigen-
faces, and 3D reconstructions. SIAM Rev., 46(3):518–545, 2004.
[Oli06] Travis E Oliphant. A guide to NumPy, volume 1. Trelgol Publishing USA, 2006.
[Oli07] Travis E Oliphant. Python for scientific computing. Computing in Science & Engineering,
9(3), 2007.
[QSS10] Alfio Quarteroni, Riccardo Sacco, and Fausto Saleri. Numerical mathematics, volume 37.
Springer Science & Business Media, 2010.
[SM00] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
[TB97] Lloyd N. Trefethen and David Bau, III. Numerical linear algebra. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 1997.
[VD10] Guido VanRossum and Fred L Drake. The python language reference. Python software
foundation Amsterdam, Netherlands, 2010.
201