LA4CS
LA4CS
! * =.×- rank ! = -
Computer Science
! * =! = all possible ! !" = $
! 6 =- ÿ =.
- * =" = all possible & impossible -
! " : dim = (
! "! : dim = ( Column Space
Row Space ") ÿ + ) ÿ "# + .#
."
,#
,"
ï
ÿ+
ï
) ) ÿ " #(+ +
"() + )" +" )
$
.
$
", ÿ ,
,
#
,ÿ" ,
! "! :
! " : dim = ( 2 *
dim = + 2 * , , ÿ " #+ Left Null Space
Null Space ") " ÿ " $
$ # !" # !"
#
! !" ! !"
#
!4$
ï
"=
ï
&
#
!4: =. ÿ =-
%
!
Essential
! * = Mathematics
4
rank ! = - for -×. 4
Manoj Thulasidas
Linear Algebra for
Computer Science
Manoj Thulasidas
ASIAN BOOKS
Singapore
ASIAN BOOKS
Manoj Thulasidas
Associate Professor of Computer Science (Education)
80 Stamford Road, School of Computing and Information Systems,
Singapore Management University, Singapore 178902
https://www.thulasidas.com
https://smu.sg/manoj
https://LA4CS.com
Published in Singapore
“You can’t learn too much linear algebra.”
Acknowledgments 1
Preface 2
Introduction 4
I.1 Why Learn Linear Algebra? 5
I.2 Learning Objectives and Competencies 6
I.3 Organization 7
4. Gaussian Elimination 75
4.1 Solvability of System of Linear Equations 75
4.2 Gaussian Elimination 80
4.3 Applications of Gaussian Elimination 85
4.4 More Examples 96
4.5 Beyond Gaussian Elimination 99
Chapter Summary, Exercises and Resources 99
Glossary 317
Credits 319
List of Figures
M.T.
Preface
MANOJ THULASIDAS
Singapore
August 21, 2023
Introduction
I.3 Organization
Numerical
Computations
1
Functions, Equations and
Linearity
1.1 Linearity
Linearity
Definition: We call a transformation (or a function) of a single, real
value (x ∈ R) linear if it satisfies the following two conditions:
• Homogeneity: When the value of x is multiplied by a real
number, the value of the function also gets multiplied by the
same number.
f (sx) = sf (x) ∀ s, x ∈ R
Linearity 13
1
Programming languages such as Python are procedural, where we specify assignments and
operations (or steps) to be performed on them. We have another class of programming
languages called functional, where we list statements of truth and specify mathematical
operations on the variables. Haskell is one of them.
14 Functions, Equations and Linearity
We stated earlier that the most general linear equation of one variable
was ax = b. We start with the linear equation −mx + y = c. What
is the most general linear equation (or system of linear equations) in
two or n dimensions?
Let’s start by defining a multiplication operation between two vec-
tors: The product of two vectors is the sum of the products of the
corresponding elements of each of them. With this definition, and a
notational trick2 of writing the first of the two vectors horizontally,
2
The reason for this trick is to have a generalized definition of the product of two matrices,
of which this vector multiplication will become a special case. We will go through matrix
multiplication in much more detail in the next chapter.
16 Functions, Equations and Linearity
we write:
x x
a1 a2 = a1 x + a2 y =⇒ −m 1 = −mx + y
y y
Our linear equation −mx+y = c and a more general a1 x+a2 y = b
then becomes:
x x
−m 1 =c a1 a2 =b
y y
If we had one more equation, a21 x + a22 y = b2 , we could add another
row in the compact notation above. In fact, the notation is capable of
handling as many equations as we like. For instance, if we had three
equations:
1. a11 x + a12 y = b1
2. a21 x + a22 y = b2
3. a31 x + a32 y = b3
We could write the system of three equations, more compactly,
a11 a12 b1
a21 a22 x = b2 or Ax = b (1.1)
y
a31 a32 b3
Here, the table of numbers we called A is a matrix. What we have
written down as Ax = b is a system of three linear equations (in
2-dimensions, but readily extended to n dimensions).
3
For our purposes, there is very little difference between functions, transformations and
mapping.
The Big Picture 17
4
As we get more sophisticated later on, we will qualify this statement and draw a distinction
between coordinate spaces where we live and vector spaces where vectors live.
18 Functions, Equations and Linearity
What We Learned
f (x + x′ ) = f (x) + f (x′ )
Exercises
A. Review Questions
If unsure of these review questions, go through the chapter again.
(a) Scale the input variables by k, the output of the function scales by k
(b) Scale the output of the function by k, the input gets scaled by k
(c) The input variables are homogeneous
(d) The output of the function is uniform and differentiable
3. What is a Matrix?
(a) ax + by = e, cx + dy = f (b) ax + cy = f, bx + dy = e
(c) ax + dy = e, cx + by = f (d) All of the above
(a) x (b) 3x
(c) x + 3 (d) x + y
(e) xy (f ) x + y + 3
The Big Picture 21
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
22 Functions, Equations and Linearity
2.1 Vectors
we will see that vectors (and matrices) can be over the ring (see the
box titled Groups, Rings and Fields) of integers (Z, called ZZ in
SageMath), or the field of rationals (Q, QQ), real (R, RR) or complex
(C, CC) numbers1 .
In order to build a realistic example of a vector that we might come
across in computer science, let’s think of a data set where we have
multiple observations with three quantities:
1
See https://www.mathsisfun.com/sets/number-types.html for common sets of numbers
in mathematics.
Vector Operations 25
1. w: Weight in kg
2. h: Height in cm
3. a: Age in years
Scalar Multiplication
Definition: For any vector x and any scalar s, the result of scalar
2
In computer science, the difference between using the rational ring instead of real field is
too subtle to worry about. It has to do with the definition of the norm (or the size) of a vector.
3
In fact, the whole machinery of Linear Algebra is a big syntactical engine, yielding us
important insights into the underlying structure of the numbers under study. But it has little
semantic content or correspondence to the physical world, other than the insights themselves.
26 Vectors, Matrices and Their Operations
SCALAR MULTIPLICATION 5
1 1
!=
2 !
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
1 2
2! = 2 = 21
2 4
1 21
2! = 21! = 21 = 2#
22
2 22
' ' 1 ' 23
!= = 2
2 2 2 ' 24
25
26
Fig. 2.1 Example of scalar multiplication of a vector x ∈ R2 . Note that all the scaled
versions lie on the same line defined by the original vector.
along the x axis, and two units along the y axis. The vector then is
an arrow from the origin to the point (1, 2), shown in red. Note how
the scaled versions (in blue, green and purple) of the vector are all
collinear with the original one. Note also that the line defined by the
original vector and its scaled versions all go through the origin.
We define the addition of two vectors such that the result of the
addition is another vector whose elements are the sum of the cor-
responding elements in the two vectors. Note that the sum of two
vectors also is vector. In other words, the set of vectors is closed
under addition as well.
Vector Addition
Definition: For any two vector x1 and x2 , the result of addition
(x1 + x2 ) is defined as follows:
x11 x12
x21 x22
∀ x1 = .. and x2 = .. ∈ Rn ,
. .
xn1 xn2
(2.3)
x11 + x12
def x21 + x22
x1 + x2 = .. ∈ Rn
.
xn1 + xn2
Vector addition also is commutative (x1 + x2 = x2 + x1 ), as a
consequence of its definition. Moreover, we can only add vectors
of the same number of elements (which is called the dimension of
the vector). Although it is not critical to our use of Linear Algebra,
vectors over different fields also should not be added. We will not see
the latter restriction because our vectors are all members of Rn . Even
if we come across vectors over other fields, we will not be impacted
because we have the hierarchy integers (Z) ¦ rationals (Q) ¦ real (R)
¦ complex (C) numbers. In case we happen to add a vector over the
field of integers to another one over complex numbers, we will get
a sum vector over the field of complex numbers; we may not realize
that we are committing a Linear Algebra felony.
28 Vectors, Matrices and Their Operations
6 "
ADDITION OF VECTORS 5
23
24
25
26
Fig. 2.2 Adding two vectors x1 , x2 ∈ R2 . We add the element of x2 to the correspond-
ing elements of x1 , which we can think of as moving x2 (in blue) to the tip of x1 (in red),
so that we get the arrow drawn with dashed blue line. The sum of the two vectors is in green,
drawn from the origin to the tip of the dashed arrow.
4
In Linear Algebra as taught in this book, our vectors are always drawn from the origin, as
a general rule. This description of the addition of vectors is the only exception to the rule,
when we think of a vector transported to the tip of another one.
Vector Operations 29
4 ##"
-
3
1
2 #" =
2
3
#! =
1 1
!
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
21
LINEAR INDEPENDENCE
##$ 22
Draw a line (blue Length of ##"
dotted) through the tip .% =
23 Length of #"
of -, parallel to #$ , Length of ##$
24 intersecting the line of .! =
#" and constructing ##" . Length of #$
25 Construct ##$ similarly. - = .% #" + .! #$
26
Fig. 2.3 Given any green vector y and the two vectors x1 and x2 , how to find the scaling
factors s1 and s2 in y = s1 x1 + s2 x2 ? A geometric proof that it is always possible for
the given x1 and x2 .
vector x2 . Let it intersect the line of the red vector, giving us the light
red vector x′1 . The (signed) length of this vector x′1 divided by the
length of x1 will be s1 . The sign is positive if x1 and x′1 are in the
same direction, and negative otherwise. By a similar construction,
we can get s2 as well. From the definition of the addition of two
vectors (as the diagonal from 0 of the parallelogram of which the two
vectors are sides), we can see that y = s1 x1 + s2 x2 , as shown in
Figure 2.3. Since two lines (that are not parallel to each other) can
intersect only in one point (in R2 ), the lengths and the scaling factors
are unique.
The second question is whether we can get the zero vector
0
y=0= = s 1 x 1 + s 2 x2
0
without having s1 = s2 = 0 (when x1 and x2 are what we will
call linearly independent in a minute). The answer is again no,
as a corollary to the geometric “proof” for the uniqueness of the
scaling factors. But we will formally learn the real reasons (à la
Linear Algebra) in a later chapter dedicated to vector spaces and the
associated goodness.
In Figure 2.3, we saw that we can get any general vector y as a linear
combination of x1 and x2 . In other words, we can always find s1 and
s2 such that y = s1 x1 + s2 x2 . Later on, we will say that x1 and x2
span R2 , which is another way of saying that all vectors in R2 can be
written as a unique linear combination of x1 and x2 .
The fact that x1 and x2 span R2 brings us to another pivotal concept
in Linear Algebra: Linear Independence, which we will get back to,
in much more detail in Chapter 6. x1 and x2 are indeed two linearly
independent vectors in R2 .
A set of vectors are linearly independent of each other if none
of them can be expressed as a linear combination of the rest. For
R2 and two vectors x1 and x2 , it means x1 is not a scalar multiple
of x2 . Another equivalent statement is that x1 and x2 are linearly
independent if the only s1 and s2 we can find such that 0 = s1 x1 +
s2 x2 is s1 = 0 and s2 = 0.
Linear Independence of Vectors 31
6 "
-
3
2
1.5 3
#" = #! =
1 0.5 1
!
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
21
LINEAR DEPENDENCE
22
Line through the tip of -, parallel
23 to #$ , never intersects the line of
#" . We cannot construct .% or .!
24 (unless it is already in the line of
#" and #$ , in which case we get
25 an infinite number of .% and .! )
26
Fig. 2.4 Two linearly dependent vectors x1 and x2 , which cannot form a linear combi-
nation such that the green y = s1 x1 + s2 x2 .
Dot Product
Definition: For any two vector x1 and x2 , the dot product (x1 · x2 )
32 Vectors, Matrices and Their Operations
is defined as follows:
x11 x12
x21 x22
∀ x1 = .. and x2 = .. ∈ Rn ,
. .
xn1 xn2
(2.4)
def
x1 · x2 = x11 x12 + x21 x22 + · · · + xn1 xn2
Xn
= xi1 xi2 ∈ R
1
Norm of a Vector
Definition: For a vector x, its norm, ∥x∥, is defined as follows:
x1
x2 q
def
∀ x = .. ∈ Rn , ∥x∥ = x21 + x22 + · · · + x2n
.
xn (2.5)
v
u n
uX
=t x2i ∈ R
1
√
By the definition of the dot product, we can see that ∥x∥ = x · x.
Related to the dot product, the norm of a vector is a measure of its size.
Also note that if we scale a vector by a factor s, the norm also scales
by the same factor: ∥sx∥ = s∥x∥. If we divide a vector by its norm
(which means we perform scalar multiplication by the reciprocal of
the norm), the resulting vector has unit length, or is normalized. We
Linear Independence of Vectors 33
Cosine Similarity
In text analytics, in what they call Vector-Space Model, documents are represented as
vectors. We would take the terms in all documents and call each a direction in some
space. A document then would be a vector having components equal to the frequency
of each term. Before creating such term-frequency vectors, we may want to clean up
the documents by normalizing different forms of words (by stemming or lemmatization)
removing common words like articles (the, a, etc.) and prepositions (in, on, etc.), which
are considered stop words. We may also want to remove or assign lower weight to words
that are common to all documents, using the so-called inverse document frequency (IDF)
instead of raw term frequency (TF). If we treat the chapters in this book as documents,
words like linear, vector, matrix, etc. may be of little distinguishing value. Consequently,
they should get lower weights.
Once such document vectors are created (either using TF or TF-IDF), one common
task is to quantify similarity, for instance, for plagiarism detection or document retrieval
(searching). How would we do it? We could use the norm of the difference vector, but
given that the documents are likely to be of different length, and document length is not
a metric by which we want to quantify similarity, we may need another measure. The
cosine of the angle between the document vectors is a good metric, and it is called the
Cosine Similarity, computed exactly as we described in this section.
P
m
xi1 xi2
i=1
cos θ =
∥x1 ∥∥x2 ∥
How accurate would the cosine similarity measure be? It turns out that it would
be very good. If, for instance, we compare one chapter against another one in this
book as opposed to one from another book on Linear Algebra, the former is likely to
have a higher cosine similarity. Why? Because we tend to use slightly flowery (albeit
totally appropriate) language because we believe it makes for a nuanced treatment of
the intricacies of this elegant branch of mathematics. How many other Linear Algebra
textbooks are likely to contain words like albeit, flowery, nuance, intricacy etc.?
absolute value of the elements of the vector (xi ), known as the infinity
norm, or simply the maximum norm.
6 "
3 4.96
!! = !# = . = 60°
4 20.60 5
!! = !# = 5 #" =
3
4 4
Dot Product Using angle:
1 3
!! 1 !# = !! !# cos . = 5×5× = 12.5
2 2
Using element-by-element multiplication:
' 1 =
!! 1 !# = 6 7$& 7$" = 3×4.96 + 4× 20.60
<= !
3
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
$%& 4.96
= 12.5 21 #$ =
20.60
The Projection (of !# on to !! ) is: 22
1
!# cos . = 5× = 2.5
2
23
24
25
26
Fig. 2.5 Dot product between two vectors x1 , x2 ∈ R2 . We compute the dot product
using the elements of the vectors as well as the angle between them and show that we get
the same result.
Dot Product as Projection: The dot product can also be defined using
the angle between the two vectors. Let’s consider two vectors x1 and
x2 with an angle θ between them. Then,
def
x1 · x2 = ∥x1 ∥∥x2 ∥ cos θ
x1 · x2 x1
∥x2 ∥ cos θ = = · x2 = x̂1 · x2
∥x1 ∥ ∥x1 ∥
x̂1 is the unit vector in the direction of the first vector. ∥x2 ∥ cos θ is the
projection of the second vector on to this direction. As a consequence,
if the angle between the two vectors θ = π2 , the projection is zero. If
the angle is zero, the projected length is the same as the length of the
second vector.
Linear Independence of Vectors 35
Quantum Computing
The backbone of Quantum Mechanics is, in fact, Linear Algebra. By the time we finish
Chapter 11 (Eigenvalue Decomposition), we will have learned everything we need to
deal with the mathematics of QM. However, we may not be able to understand the
lingo because physicists use a completely different and alien-looking notation. Let’s go
through this so-called Dirac (AKA "bra-ket") notation for two good reasons. Firstly, we
may come across it in our online searches on Linear Algebra topics. Secondly, closer
to our field, Quantum Computing is gathering momentum. And the lingo used in the
description of its technical aspects is likely to be the one from physics.
A vector in QM is a ket vector. What we usually write as x would appear as |xð in this
notation. The transpose of a vector is the bra vector, written as y T ≡ ïy|. Therefore, a
dot product x · y = xT y ≡ ïx|yð.
Now that we got started with QM, let’s go ahead and complete the physics story. The
QM vectors are typically the wave functions of the probability amplitudes. So, if we have
an electron with a wave function |ψð = ψ(x) (x being its location in a one dimensional
problem), what it is describing is the probability amplitude of finding the electron at x.
The corresponding probability is norm of this vector, which is ïψ|ψð.
There are a couple of complications here: Since |ψð is actually a function ψ(x), it
has a value at each point x, and if we are going to think of it as a vector, it is an infinite-
dimensional vector. Secondly, the values of |ψð can be (and typically are) complex
numbers. So when we take the norm, since we like the norm to be positive, we cannot
merely take the transpose, we have to take the complex-conjugate transpose (called the
Hermitian transpose, or simply conjugate transpose). Lastly, the analog of summation
of elements, when we have an infinity of them, is going to be an integral. The space in
which such wave functions live is called the Hilbert Space.
Finally, toward the end of this book, we will come across expressions like xT Ax
which will look like ïψ|H|ψð. The matrix A has become an operator H corresponding
to a physical observable (in this case, the energy, if H is the so-called Hamiltonian), and
the values we can get are, in fact, its eigenvalues, which can be thought of as the reason
why the observable can take only discrete, quantized values. That, in a nutshell, is how
we get the various allowed energy levels in a Hydrogen atom.
n
X
x1i x2i = ∥x1 ∥∥x2 ∥ cos θ
1
It can be proven using the Law of Cosines. Figure 2.5 shows the
equivalence of the two definitions using an example of two vectors in
R2 , one projecting on to the other.
36 Vectors, Matrices and Their Operations
2.4 Matrices
The first two operations on matrices that we will define are identical
to the ones for vectors.
Definition: For any matrix A and any scalar s, the result of scalar
multiplication (sA) is defined as follows:
a11 · · · a1n
∀ A = ... aij .. ∈ Rm×n and s ∈ R,
.
am1 · · · amn
(2.7)
sa11 · · · sa1n
sA = ... .. ∈ Rm×n
def
saij .
sam1 · · · samn
Definition: For any two matrices A and B, their sum (the result of
their addition, A + B) is defined as follows:
a11 ··· a1n b11 ··· b1n
∀ A = ... .. and B = .. .. ∈ Rm×n
aij . . bij .
am1 · · · amn bm1 ··· bmn
(2.8)
a11 + b11 ··· a1n + b1n
def
A+B = .. .. m×n
∈R
. aij + bij .
am1 + bm1 ··· amn + bmn
Commutativity
The order in which the matrices (or vectors) appear in the operations
does not matter.
sx = xs; sA = As
x1 + x2 = x2 + x1 ; A + B = B + A
Associativity
Distributivity
We will now define and describe how matrices multiply. Two ma-
trices can be multiplied to get a new matrix, but only under certain
conditions. Matrix multiplication, in fact, forms the backbone of
much of subject matter to follow. For that reason, we will look at it
carefully, and from different perspectives.
Matrix Multiplication 39
Element-wise Multiplication
Definition: For conformant matrices, we define the matrix multipli-
cation as follows: The element in the ith row and the j th column of
the product is the sum of the products of the elements in ith row of
the first matrix and the j th column of them second matrix.
More formally, for any two conformant matrices A ∈ Rm×k and
B ∈ Rk×n , their product (the result of their multiplication, C = AB)
is defined as follows:
a11 ···a1k b11 · · · b1n
A = ... .. and B = .. .. ,
ail . . blj .
am1 · · · amk bk1 · · · bkn
c11 · · · c1n
(2.9)
AB = C = ... .. ∈ Rm×n where
cij .
cm1 ··· cmn
k
def
X
cij = ai1 b1j + ai2 b2j + · · · + aik bkj = ail blj
l=1
! " = $
!;; !;< ï !;= %;; %;< ï %;? ';; ';< ï ';?
!<; !<< ï !<= %<; %<< ï %<? '<; '<< ï '<?
î î ó î = î î ó î
î î ó î
!>; !>< ï !>= %=; %=< ï %=? '>; '>< ï '>?
Fig. 2.6 Illustration of matrix multiplication AB = C . In the top panel, the element
c11 (in red) of the product C is obtained by taking the elements in the first row of A
and multiplying them with the elements in the first column of B (shown in red letters and
arrows) and summing them up. c22 (blue) is the sum-product of the second row of A and
the second column of B (blue). In the bottom panel, we see a general element, cij (green)
as the sum-product of the ith row of A and the j th column of B (in green).
In this case, C has become a matrix of one row and one column,
which is the same as a scalar. We can, therefore, use the symbol s
to represent it. B is a column matrix with k rows, which is what
we earlier called a vector; remember, our vectors are all column
vectors. Let’s use the symbol b instead of B to stay consistent in our
Matrix Multiplication 41
5
We are using D (instead of C) as the product in order to avoid possible confusion between
the symbols for column vectors cj and the elements of the product matrix.
42 Vectors, Matrices and Their Operations
provided that the block Ail is conformant for multiplication with Blj .
Compare this equation to Eqn (2.9) and we can immediately see that
the latter is a special case of partitioning A and B into blocks of
single elements.
In general, segmenting matrices into comformant blocks may not
be trivial, but we have advanced topics in Linear Algebra where it
comes in handy. For our purposes, we can think of few cases of simple
segmentation of a matrix and perform block-wise multiplication.
x1
| | | x2
A = c1 c2 · · · cn ∈ Rm×n x = .. ∈ Rn
| | |
. (2.12)
xn
Ax = b = x1 c1 + x2 c2 + · · · + xn cn ∈ Rm
Similarly, multiplication on the left gives us the row picture. Take
xT to be a matrix of single row (xT ∈ R1×m ), which means x is
column vector x ∈ Rm . And A ∈ Rm×n . Now the product xT A is
a row matrix bT ∈ R1×n . The row picture says that xT A is a linear
combination of the rows of the matrix A.
Fig. 2.7 Illustration of matrix multiplication as column and row pictures. On the left,
we have Ax, where the product (which is a column vector we might call b) is the linear
combination of the columns of A. On the right, we have xT A, where the product (say bT )
is a linear combination of the rows of A.
Since the column and row pictures are hard to grasp as concepts, we
illustrate them in Figure 2.7 using example matrices with color-coded
rows and columns. We can use the basic element-wise multiplication
(Eqn (2.9)) of matrices and satisfy ourselves that the column and row
pictures indeed give the same numeric answers as element-by-element
multiplication. To put it as a mnemonic, matrix multiplication is the
linear combination of the columns of the matrix on the left and of the
rows of the matrix on the right.
Matrix Multiplication 45
The column and row pictures are, in fact, special cases of the block-
wise multiplication we discussed earlier. In the column picture, we
are segmenting the first matrix into blocks that are columns, which are
conformant for multiplication by single-element blocks (or scalars)
of the second matrix. The sum of the individual products is then
the linear combinations of the columns of the first matrix. We can
visualize the row picture also as a similar block-wise multiplication.
If, instead of a vector x, we had a multi-column second matrix,
then the product also would have multiple columns. Each column of
the product would then be the linear combinations of the columns of
A, scaled by the corresponding column of the second matrix. In other
words, in AB, A ∈ Rm×k , B ∈ Rk×n , the product has n columns,
each of which is a linear combination of the columns of A.
Considering this a teachable moment, let’s look at it from the
perspective of block-wise multiplication once more. Let’s think of
B as composed on of n column matrices stacked side-by-side as
B = [b1 | b2 | · · · | bn ]. Then, by block-wise multiplication, we
have:
which says that each column of the product is Abi , which, by the
column picture of matrix multiplication, is a linear combination of
the columns of A using the coefficients from bi .
Other Operations
Although we have covered the mathematical operations in traditional Linear Alge-
bra, there are a couple of more that are commonly used in Computer/Data Science.
Hadamard Product: The Hadamard product, also known as the element-wise prod-
uct or entry-wise product, is a mathematical operation that takes two matrices or vectors
of the same dimensions and produces a new matrix or vector by multiplying correspond-
ing elements together. In other words, each element in the resulting matrix/vector is
obtained by multiplying the elements at the same position in the input matrices/vectors.
This operation is often denoted by » and is commonly used in various fields, includ-
ing linear algebra, signal processing, and machine learning, for tasks such as pointwise
multiplication of data or combining features.
For A, B ∈ Rm×n , the Hadamard product C = A » B has elements cij = aij bij .
As we can see, C ∈ Rm×n as well.
Scalar Addition: Scalar addition involves adding a scalar to each element of a matrix.
This operation is performed by simply adding the scalar value to every element of the
matrix, resulting in a new matrix with the same dimensions as the original. Scalar
addition is used to shift or adjust the values in the matrix uniformly in some algorithms
in CS.
For A ∈ Rm×n and s ∈ R, scalar addition can be formally written as B = A + s
has elements bij = aij + s. Again, B ∈ Rm×n . Note that this operation is not really
allowed in pure Linear Algebra. We cannot add entities belonging to Rm×n and R
together. But we can think of the operation as B = A + sO, where O ∈ Rm×n is
matrix with all ones as its elements: oij = 1.
We can ask whether these operations are linear. When we ask such a question, we
have to carefully define what we mean. For the Hadamard product, let’s think of it
as a transformation. Rewriting it using our usual notations, we have A » X = B,
which is a transformation of X to B. A : X 7→ B or A : Rm×n 7→ Rm×n . Is this
operation linear? Similarly, for scalar addition, think of it as a + X = B, which is a
transformation a : Rm×n 7→ Rm×n . Is it linear?
Convolution
Convolution in image processing involves sliding a small matrix (kernel) over an image.
At each position, the kernel’s values are multiplied with the underlying image pixels, and
the results are summed to form a new pixel value in the output image. This operation is
used for tasks like blurring, edge detection, and feature extraction.
Here’s how convolution works in the context of image processing:
• Input Image: A two-dimensional matrix representing an image’s pixel values.
• Kernel/Filter: A smaller matrix with numerical values defining the convolution
operation.
• Sliding: The kernel is systematically moved over the image in small steps.
• Element-Wise Multiplication: Values in the kernel and underlying pixels are
multiplied.
• Summation: The products are added up at each kernel position.
• Output: The sums are placed in the output matrix, which is also known as the
feature map or convolved image.
Convolution is used for various image processing tasks:
• Blurring/Smoothing: By using a kernel with equal values, convolution can
smooth an image, reducing noise and sharp transitions.
• Edge Detection: Specific kernels can detect edges in an image by highlighting
areas with rapid intensity changes.
• Feature Extraction: Convolution with various filters can extract specific fea-
tures from images, such as texture or pattern information.
• Sharpening: Convolution with a sharpening filter enhances edges and details in
an image.
It is left as an exercise to the student to look up how exactly the convolution is performed,
and whether it is linear.
What We Learned
Exercises
A. Review Questions
(a) t = 5
(b) Any t ∈ R
(c) x3 can never be a linear combination of x1 and x2
(d) Not enough information in the question to answer
(a) 1 (b) 2
(c) 3 (d) Any number > 1
4. True or False: The Dot Product of two vectors is the same as their Scalar
Product.
(a) A parallelogram
(b) Another vector
(c) A vector or a scalar in special cases
(d) we cannot add vectors, only multiply
7. A column vector is always a matrix with one column, in our context of Linear
Algebra. True or False?
an n × n matrix
9. We have A and n × 3n matrix B, which is written as three
blocks B1 | B2 | B3 , each of size n × n. Select the right answer for AB:
(c) AB = AB1 B2 B3
(d) AB = A(B T B)−1 AT
(a) If A + B = B + A, then AB = BA
(b) If AB = BA then AB is a symmetric matrix
(c) AB can never be equal to BA
(d) If AB = BA then both A and B are square matrices
13. In the column picture of C = AB, which of the following statements is most
accurate?
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
• Lecture Videos by the Author: LA4CS Chapter 2: Vectors, Matrices and Their
Operations
• 3Blue1Brown: “Essence of Linear Algebra.”
– Video 2: Linear combinations, span, and basis vectors | Chapter 2,
Essence of linear algebra
– Video 3: Linear transformations and matrices | Chapter 3, Essence of
linear algebra
• Imperial College: “Mathematics for Machine Learning - Linear Algebra.”
– Watch Videos 6 to 8: From “M4ML - Linear Algebra - 2.1 Introduction
to module 2: Vectors” to “M4ML - Linear Algebra - 2.2 - Part 2: Cosine
& Dot Product”
Now that we have defined matrices and mastered the basic oper-
ations on them, let’s look at another operation that comes up very
often in Linear Algebra, namely taking the transpose of a matrix. We
will also introduce the concept of the determinant of a matrix, which
a single number with a lot of information about the matrix, and with
a nice geometrical interpretation. Also in this chapter, we will go
over the nomenclature of various special matrices and entities related
to matrices with a view to familiarizing ourselves with the lingo of
Linear Algebra. This familiarity will come in handy in later chapters.
Matrix Transpose
Definition: For any matrix A = aij ∈ Rm×n , its transpose is
def
defined as AT = aji ∈ Rn×m .
(AB)T = B T AT
This product rule also can be proven by looking at the (i, j) element
of the product matrices on the left and right hand sides, although it is
a bit tedious to do so. Before proving the product rule, let’s look at
the dimensions of the matrices involved in the product rule and easily
see why the rule makes sense.
Let’s consider A ∈ Rm×k and B ∈ Rk×n so that AB ∈ Rm×n .
The dimensions have to be this way by the conformance requirement
of matrix multiplication, which says that the number of columns of
the first matrix has to be the same as the number of rows of the
second one. Otherwise, we cannot define the product. In particular,
for m ̸= n, BA is not defined.
By the definition of transpose, we have
We have postponed the proof for the product rule of transposes for
as long as possible. Now, we will present two proofs.
k
X
(1) Element-wise matrix multiplication: cij = aip bpj
p=1
In (5), we used the fact that the ith row of a matrix is the same as
the ith column of its transpose. In both proofs, we have shown that
C T = (AB)T = B T AT .
Main Diagonal: The line of elements in a matrix with the same row
and column indexes is known as the main diagonal. It may be
referred to as the leading diagonal as well. In other
words, it
is the line formed by the elements aii in A = aij ∈ Rm×n .
Note that number of rows and columns the matrix A does not
have to be equal.
Morphisms
Earlier (in §1.1.5, page 16), we talked about how a matrix A ∈ Rm×n as encoding a
linear transformations between Rn and Rm . A : Rn 7→ Rm . Other names for a linear
transformation include linear map, linear mapping, homomorphism etc. If n = m, then
A maps Rn to itself: Every vector (x) in Rn , when multiplied by A, gives us another
vector b ∈ Rn . The name used for it is a linear endomorphism. Note that two different
vectors x1 , x2 do not necessarily have to give us two different vectors. If they do, then
the transformation is called an automorphism. Such a transformation can be inverted:
We can find another transformation that will reverse the operation.
To complete the story of morphisms, a transformation Rn 7→ Rm (with n and m not
necessarily equal) is called an isomorphism if it can be reversed, which means that it is
a one-to-one and onto mapping, also known as bijective. If two vectors in Rn (x1 and
x2 ) map to the same vector in Rm (b), then given b, we have no way of knowing which
vector (x1 or x2 ) it came from, and we cannot invert the operation.
Of course, these morphisms are more general: They are not necessarily between the
so-called Euclidean spaces Rn , but between any mathematical structure. We, for our
purposes in computer science, are interested only in Rn though.
Let’s illustrate the use of the idea of isomorphisms with an (admittedly academic)
example. We know that the number of points in a line segment between 0 and 1 is
infinite. So is the number of points in a square of side 1. Are these two infinities the
same?
If we can find an isomorphism from the square (in R2 ) to the line (R), then we can
argue that they are. Here is such an isomorphism: For any point in the square, take its
coordinates, 0 < x, y < 1. Express them as decimal numbers. Create a new number x′
by taking the first digit of x, then the first digit of y, followed by the second digit of x,
second digit of y and so on, thereby interleaving the coordinates into a new number. As
we can see, 0 < x′ < 1, and this transformation T is a one-to-one mapping and onto,
or an isomorphism: T : (x, y) 7→ x′ : R2 7→ R. It is always possible to reverse the
operation and find x and y given any x′ , T −1 : x′ 7→ (x, y) : R 7→ R2 . Therefore the
infinities have to be equal.
Another transformation that is definitely not an isomorphism is the projection op-
eration: Take any point (x, y) and project it to the x axis, so that the new x′ = x.
P : (x, y) 7→ x : R2 7→ R maps multiple points to the same number, and it cannot be
reversed. P −1 does not exist.
Interestingly, P is a linear transformation, while T is not.
the matrix and the linear transformation (see §1.1.5, page 16) that
it represents. We use the symbol det(A), ∆A or |A| to denote the
determinant of the square matrix A.
Before actually defining the determinant, let’s formally state what
we said in the previous paragraph.
A = aij ∈ Rn×n
|A| = f (aij ) where
Determinant of a Matrix 61
3.3.1 2 × 2 Matrices
If we have a matrix A ∈ R2×2 as in the equation below, its determinant
is defined as |A| = ad − bc. To restate it formally,
a c a c def
A= ∈ R2×2 |A| = = ad − bc ∈ R (3.2)
b d b d
We can think of A as having two column vectors (c1 and c2 ) standing
side-by-side.
a c 2×2 a c
A= ∈R c1 = , c2 = , ∈ R2
b d b d
62 Transposes and Determinants
1.25 "
TRANSFORMATION: MEANING 0
1 1
K
1 > ? 1 > L
= = =
0 @ A 0 @ 0.75
0 > ? 0 ?
= = = 0.5
1 @ A 1 A
1 0 > ? 0.25 I
B= §==
0 1 @ A J
22 21.75 21.5 21.25 21 20.75 20.5 20.25 0 0.25 0.5 0.75 1 1.25 1.5 1.7
1
§ 20.25 0
2 = 1 § ! = 45 2 78 20.5
20.75
Fig. 3.1 The matrix A transforms the unit vectors to its columns, a1 and a2 , thus
transforming the unit square to a parallelogram.
How does A transform the unit vectors? (The ith unit vector is the ith
column of the identity matrix I). The transformed versions are just
c1 and c2 , the columns of A.
1 a c 1 a 0 a c 0 c
A = = = c1 A = = = c2
0 b d 0 b 1 b d 1 d
1
The proof is recreated from Mathematics StackExchange, attributed to Solomon W. Golomb.
If the geometrical version of the proof is hard to digest, there is an algebraic version as well.
Determinant of a Matrix 63
(L, O + M) - (L + N, O + M)
.
(0, M)
(N, M) (L, M) (L + N, M)
(L, O)
(0,0) (L, 0)
L N
= LM 2 ON = 2 =
O M
Fig. 3.2 The parallelogram that results from the transformation of the unit square by the
action of A. The picture proves, without words, that its area is the same as |A|.
the area in Figure 3.2; we will learn more about the sign part in the
following examples.
In Figure 3.3, we have three different examples of A in R2 and
their determinants. In the left panel, we see how A transforms the
red and blue unit vectors (which have unit elements in the first and
second dimension respectively). The red unit vector gets transformed
to the first column of A (shown in a lighter shade of red), and the
blue one to the second column (light blue vector). If we complete the
parallelogram, its area is 2, which is the determinant, |A|.
In the middle panel of Figure 3.3, we have a different A, which
does something strange to the red and blue unit vectors: They get
transformed to a line, which is a collapsed parallelogram with zero
area. And by the definition of |A|, it is indeed zero. Looking at the
column vectors in A, we can see that they are scalar multiples of each
other; they are not linearly independent.
In the right panel of the same figure, we have shuffled the columns
of A, so that parallelogram is the same as in the left panel, but the
determinant now is negative. We, therefore, call the determinant the
signed area of the parallelogram formed with the column vectors as
adjacent sides. Why is the area negative in this case? It is because A
has flipped the order of the transformed vectors: Our blue unit vector
is to the left of the red one. In the left panel, the transformed blue
64 Transposes and Determinants
1 1 2 1 1 2 1 0 2
§ § §
0 1 0 1 1 1
21 0 1 2 3 4 5 21
6 0 1 2 3 4 5 21
6 0 1 2 3 4 5 6
21 Blue unit vector is to 21 Blue unit vector is to 21 Blue unit vector is to
the left of the red one the left of the red one the left of the red one
22 22 22
Blue transformed vector also Blue transformed vector is Blue transformed vector is to
23 to the left of the red one 23 on top of the red one 23 the right of the red one
The determinant is positive The determinant is zero The determinant is negative
24 24 24
25 25 25
Fig. 3.3 Determinants as areas: The matrix A transforms the unit square (shaded grey)
into the amber parallelogram. The determinant |A| is the signed area of this parallelogram.
The sign is negative when the transformed unit vectors “flip.”
vector is still to the left. But in the right panel, the blue one has gone
to the right of the red one after transformation, thereby attributing a
negative sign to the determinant.
To use a more formal language, to go from the first (red) unit vector
to the second (blue) one, we go in the counterclockwise direction.
The direction is the same for the transformed versions in the left panel
of Figure 3.3, in which case the determinant is positive. So is the
area. In the right panel, the direction for the transformed vectors is
clockwise, opposite of the unit vectors. In this case, the determinant
and the signed area are negative.
3.3.2 3 × 3 Matrices
We have defined the determinant of a 2 × 2 matrix in Eqn (3.2).
We
now extend it to higher dimensions recursively. For A = aij ∈
R3×3 , we have:
a11 a12 a13
A = a21 a22 a23 ∈ R3×3
a31 a32 a33 (3.3)
def a a a a a a
|A| = a11 22 23 − a12 21 23 + a13 21 22 ∈ R
a32 a33 a31 a33 a31 a32
R3×3 . Notice that the first term has a positive sign, the second one
a negative sign and the third a positive sign again. This pattern of
alternating signs extends to higher dimensions as well.
3.3.3 n × n Matrices
We can extend the notion of volume to Rn . If we think of A as being
composed of n column vectors,
| | |
A = c1 c2 · · · cn ∈ Rn×n ci ∈ Rn
| | |
the determinant, |A|, is the signed volume of the n-dimensional par-
allelepiped formed with edges ci .
66 Transposes and Determinants
7 1 1 4 7 1 1 4 7 1 1 4 7 1 1 4 7 1 1 4
5 8 0 7 5 8 0 7 5 8 0 7 5 8 0 7 5 8 0 7
|A| = =7 21 +1 24
6 9 2 5 6 9 2 5 6 9 2 5 6 9 2 5 6 9 2 5
3 5 2 7 3 5 2 7 3 5 2 7 3 5 2 7 3 5 2 7
8 0 7 5 0 7 5 8 7 5 8 0
=7 9 2 5 21 6 2 5 +1 6 9 5 24 6 9 2
5 2 7 3 2 7 3 5 7 3 5 2
Fig. 3.4 Illustration of minors and cofactors using an example 4 × 4 matrix. The minor is
the determinant of the submatrix obtained by removing the row and column corresponding
to each element, as shown. Notice the sign in the summation. The minor with the associated
sign is the cofactor.
Laplace Formula: What we wrote down in Eqn (3.4) is, in fact, the
general version of the recursive formula for computing the determi-
nant, expanding over the ith row.
n
X n
X
i+j
|A| = (−1) aij Mij = aij Cij (3.5)
j=1 j=1
Deriving |A|
It is possible to start from the properties (listed in §3.3.4, page 66) and derive the formula
for the determinant for a 2 × 2 matrix, which is an interesting exercise.
• Let’s start with the identity matrix in R2 . By Property (1), we have:
1 0
=1
0 1
• Property (2) says if we flip the rows, the determinant changes sign, which means:
0 c b 0
=− = −bc
b 0 0 c
• By Property (4), if we can express the elements of the first row as sums, we can
write:
a c a+0 0+c a 0 0 c
= = +
b d b d b d b d
• We can express the second term in the sum above, again by using Property (4)
as:
a 0 a 0 a 0 a 0 a 0
= = + =
b d 0+b d+0 0 d b 0 0 d
The second determinant is zero by Property (5) because it contains a column of
zeros. Similarly,
0 c 0 c
=
b d b 0
• Therefore we get, using Equations (3.6) and (3.7),
a c a 0 0 c
= + = ad − bc
b d 0 d b 0
10. The determinant of the product of two matrices (of the same
size) is the product of their determinants: |AB| = |A| |B|.
This first part of the book (comprising Chapters 1, 2 and 3) was meant
to be about numerical computations involving vectors and matrices.
The moment we start speaking of vectors, however, we are already
thinking in geometrical terms. In this chapter, we also saw how
determinants had a geometric meaning as well.
In the first chapter, in order to provide a motivation for the idea
of matrices, we introduced linear equations, which is what we will
expand on, in the next part on the algebraic view of Linear Algebra.
We will go deeper into systems of linear equations. While discussing
the properties of determinants, we hinted at their connection with
linear equations again.
As we can see, although we might want to keep the algebra and
geometry separated, it may not be possible, nor perhaps advisable to
do so because we are dealing with different views of the same subject
of Linear Algebra.
What We Learned
• Transposes:
70 Transposes and Determinants
– Transpose of A = aij , denoted by AT is a matrix with the elements
flipped about the main (AKA leading) diagonal. The leading diagonal is the
imaginary line joining the elements aii .
– Product rule: (AB)T = B T AT .
• Determinant: Some Properties:
– Defined only for square matrices, A ∈ Rn×n . A number with a lot of
information about the properties of the matrix.
– |A| is the signed volume of the parallelepiped into which A transforms the
unit cube.
– |A| = 0 ⇐⇒ A is singular.
– Switch any two rows or columns, the determinant flips sign.
– Product rule:
1
|AB| = |A| |B| =⇒ A−1 =
|A|
– Minor: The Minor of an element aij of A is the determinant of the matrix
with the ith row and the j th column removed. It is denoted by Aij .
– Cofactor: Minor with the sign (−1)i+j is the cofactor (denoted by Cij ) of
the element aij .
– Laplace formula, in Eqn (3.5), expresses the determinant of a matrix in
terms of its minors or cofactors. It is computationally inefficient.
• Some matrices with special properties and their names:
– Square: A ∈ Rn×n
– Symmetric: AT = A =⇒ A is square.
– Skew or Antisymmetric: AT = −A =⇒ A is square.
– Diagonal: All non-diagonal elements zero: aij = 0. Only the elements
along the main diagonal aii may be non-zero.
– Upper Triangular: All the elements below the leading diagonal zero.
– Lower Triangular: All the elements above the leading diagonal zero.
– Identity: A square diagonal matrix with aii = 1. Denoted by I or In ∈
Rn×n .
– Unit Vectors: The columns of I are unit vectors; they are of length (norm)
one, and orthogonal to each other.
– Gram Matrix. For A, AT A is called the Gram matrix. It is square and
symmetric.
– Inverse: A−1 A = AA−1 = I.
– Singular: A is singular if A−1 does not exist. If |A| = 0, A is singular.
Numerical Computations 71
Exercises
A. Review Questions
If unsure of these review questions, go through the chapter again.
6. Choose wisely
(a) The identity matrix is always symmetric
(b) The identity matrix is diagonal
(c) The identity matrix is upper and lower triangular
(d) All these statements are true
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
• Lecture Videos by the Author: LA4CS, Chapter 3: Transposes and Determi-
nants
• 3Blue1Brown: “Essence of Linear Algebra.”
– Video 6: The determinant | Chapter 6, Essence of linear algebra
• Imperial College: “Mathematics for Machine Learning - Linear Algebra.”
– Video 20: M4ML - Linear Algebra - 3.4 Determinants and inverses
• MIT: Linear Algebra
– Video 39: 18. Properties of Determinants
– Video 40: Properties of Determinants
– Video 41: 19. Determinant Formulas and Cofactors
– Video 42: Determinants
Algebraic View
4
Gaussian Elimination
Let’s start with a simple system of two linear equations, and see what
the problem really is. Here, we will have only two variables, x and
76 Gaussian Elimination
In the fourth row, we have two equations, but they are not consistent
with each other. They both cannot be true at the same time for any
pair of values of x and y.
Things get complicated in the fifth row, where we seem to have
three equations. But the third one can be derived from the first two. It
is, in fact, Eq.1 + 2×Eq.2, which means, in reality, we only have two
good equations for two unknowns. We therefore get a good solution,
much like the first row of Table 4.1.
The sixth row looks similar to the fifth, but the third equation there
is different on the right hand side. It cannot be derived from the other
two, and is inconsistent with them. Therefore, we have no solutions.
In light of these results, we can state the solvability condition, in
a general case as follows: If we have a system of n independent
and consistent linear equations on n unknowns, we can find a unique
solution for them. We are yet to define the concepts of independence
and consistency though.
Independence
Definition: An equation in a system of linear equations is considered
independent if it cannot be derived from the rest using algebraic
manipulations.
If we multiply an equation with a scalar, or add two equations, the
new equation we get is not independent. Again, notice the similarity
of the dependence of equations with our requirements for linearity
(§1.1.3, page 15).
Consistency
Definition: An inconsistent system of linear equations is the one with
no solutions.
The concept of consistency is harder to pin down. For now, we are
defining it rather cyclically as in the statement above. It is possible,
however, to visualize why some equations are inconsistent with oth-
ers. In fourth row of Table 4.1, for instance, the lines described by
the two equations are parallel to each other.
In Table 4.1, our equations represent line because we have only two
variables (x and y) and we are dealing with R2 , which is a plane.
78 Gaussian Elimination
6 " 6 "
Row1: Consistent Equations Row4: Inconsistent Equations
Parallel lines
5 5
!
4 + 4
"
=
5 3 !
3 +
"
=
2 2 5
!
1 +
= 1 "
1 " =
2 1
! !
2 21 0 1 2 3 4 5 36 22 21 0 1 2 3 4 5
21 21
22 22
6 " Row5: Consistent: 6 " Row6: Inconsistent:
Single Point of No Single Point of
7
5 Intersection 5 Intersection
9
"=
"=
! !
2
+ +
2
4 4
3!
" "
3!
= =
5 5
3 3
2 2
1 1
= =
1 " 1 "
2 2
! ! ! !
22 21 0 1 2 3 4 5 26 21 0 1 2 3 4 5 6
21 21
22
Fig. 4.1 Visualizing equations listed in Table 4.1. 22
Clockwise from top-left: Row 1, 4, 5
and 6 in the table.
4.1.2 Generalizing to Rn
Now
we are naming matrices, let’s also call A (either as part of
that
A | b or by itself) the coefficient matrix, and the b part the constant
matrix or constant vector.
Before describing the algorithm of Gaussian elimination, let’s look
at the endpoint of the algorithm,
which is the form in which we would
like to have our matrix A or A | b .
Gaussian Elimination 83
Row-Echelon form
Definition: A matrix is considered to be in its row-echelon form
(REF) if it satisfies the following two conditions:
1. All rows with zero elements are at the bottom of the matrix.
Pivots
Definition: The leading non-zero element in a row of a matrix in its
row-echelon form is called a pivot. The corresponding column is
called the pivot column. Pivots are also called the leading coefficient.
The largest number of pivots a matrix can have is the smaller of
its dimensions (numbers of rows and columns). In other words, for
A ∈ Rm×n , the largest number of pivots would be min(m, n).
Rank
Definition: The number of pivots of a matrix (in its REF) is its rank.
We will have a better definition of rank later on. If a matrix has its
largest possible rank (which is the largest possible number of pivots,
min(m, n)) is called a full-rank matrix. If a matrix is not full rank, we
call it rank deficient, and its rank deficiency is min(m, n) − rank(A).
We have a few examples of matrices in their REF in Eqn (4.3)
below, where the pivots are shown in bold. The first matrix shows a
square matrix of size 4 × 4, and it has four pivots, and is therefore
full rank. The second matrix is 2 × 3, and has two pivots–the largest
possible number. It is also full-rank. The third one is a 4 × 4 matrix,
but has only three pivots. It is rank deficient by one. The fourth
matrix also has a rank deficiency of one.
5 2 11 3 5 2 11 3
5 2 11 3
0 3 7 13 5 11 3 0 0 17 2
0 0 0 17 2 (4.3)
0 17 2 0 0 13 0 0 0 13
0 0 0 0
0 0 0 13 0 0 0 0
1. If a11 = 0, loop over the rows of A to find the row that has a
non-zero element in the first column.
2. If found, swap it with the first row. If not, ignore the first row
and column and move on to the second row (calling it the first).
4. Now consider the submatrix from the second row, second col-
umn to the last row, last element (i.e., from a22 to amn ) as the
new matrix:
A ← A[a22 : amn ], m ← m − 1, n ← n − 1
5. Loop back to step 1 and perform all the steps until all rows or
columns are exhausted.
1 1 ) 5
$% = ' ó * = 1
1 21
Subtract Row 1 from Row 2:
1 1 5 1 1 5
1 21 1 0 22 24
Augmented Matrix: [1 | 3] Pivots ó First non-zero element
• Since the coefficient matrix (A) is square and full rank, we can
infer that the system has a unique solution.
Back Substitution
Definition: The process of solving a system of linear equations from
the REF of the augmented matrix of the system is known as back
substitution. The equation corresponding to the last non-zero row
Applications of Gaussian Elimination 87
x+y =5
1 1 5 1 1 5 REF has no 0 = bi =⇒ Solvable
1
x−y =1 1 −1 1 0 −2 −4 Rank = # Vars =⇒ Unique Solution
REF has no 0 = bi =⇒ Solvable
2 x+y =5 1 1 5 1 1 5
Rank < # Vars =⇒ Infinity of Solns
x+y =5
1 1 5 1 1 5 REF has no 0 = bi =⇒ Solvable
3
2x + 2y = 10 2 2 10 0 0 0 Rank < # Vars =⇒ Infinity of Solns
x+y =5
1 1 5 1 1 5 REF has 0 = bi =⇒ Inconsistency
4
x+y =6 1 1 6 0 0 1 Not Solvable
x+y =5
1 1 5
1 1 5
REF has no 0 = bi =⇒ Solvable
5 x−y =1 1 −1 1 0 −2 −4
Rank = # Vars =⇒ Unique Solution
3x − y = 7 3 −1 7 0 0 0
x+y =5
1 1 5
1 1 5
is solved first, and the solution is substituted in the one for the row
above and so on, until all the variables are solved.
With Gaussian elimination and back substitution, we can solve
a system of linear equations is as fully as possible. Moreover, we
can say a lot about the solvability of the system by looking at the
pivots, as we shall illustrate using the examples in Table 4.1. We
have the augmented matrices for these equations, their REF, and our
observations on solvability of the system based on the properties
of the REF in Table 4.3. These observations are, in fact, general
statements about the solvability of systems of linear equations, as
listed below.
Solvability Conditions of a system of linear equations based on the
properties of the REF of its augmented matrix:
1. If we have a row in the REF (of the augmented matrix A | b )
with all zeros in the coefficient (A) part and a non-zero element
in the constant (b) part, the system is not solvable.
Notes:
• The largest value the rank of any matrix can have is the smaller
of its dimensions.
• The rank r of the coefficient
matrix
A ∈ Rm×n can never be
larger than the rank r′ of A | b ∈ Rm×(n+1) .
• As a corollary, if n > m, we have too few equations. The
system, if solvable, will always have infinitely many solutions.
We will look at some more examples to illustrate the solvability
conditions when we discuss the elementary matrices (the matrices
Applications of Gaussian Elimination 89
x+y+z =6
1 1 1 6 1 1 1 6
REF
2x + 2y + z = 9 =⇒ A | b = 2 2 1 9 −−→ 0 0 −1 −3
x+y =3 1 1 0 3 0 0 0 0
From the REF above, we can start back substituting: Row 3 does
not say anything. Row 2 says:
−z = −3 =⇒ z = 3
Substituting it in the row above, we get:
x+y+3=6 or x+y =3
By convention, we take the variable corresponding to the non-pivot
column (which is column 2 in this case, corresponding to the variable
y) as a free variable. It can take any value y = t. Once y takes a
value, x is fixed: x = 3 − t. So the complete solution to this system
of equations is:
x 3−t 3 −1
x= y =
t = 0 +t 1
z 3 3 0
Note that a linear equation (such as z = 3) in R3 defines a plane.
x + y = 3 also defines a plane. (The pair x = 3 − t and y = t is
90 Gaussian Elimination
vectors xs1 and xs2 . These vectors are called the special solutions
of the system. We can see that they are, in fact, the solutions to the
equations when the right hand side, the constants part, is zero, which
is to say b = 0.
When a system of linear equations is of the form Ax = 0, it is
called a homogeneous system because all the terms in the system are
of the same order one in the variables (as opposed to some with order
zero if we had a non-zero b). Therefore, the special solutions are also
called homogeneous solutions.
Lastly, the linear combinations of the special solutions in our ex-
ample, t1 xs1 +t2 xs2 , define a plane in R4 , as two linearly independent
vectors always form a plane going through the origin 0. What the ad-
dition of xp does is to shift the plane to its tip, namely the coordinate
point (3, 0, 3, 0), if we allow ourselves to visualize it in a coordinate
space. In other words, the complete solution is any vector whose
tip is on this plane defined by the special solutions xs1 and xs2 , and
shifted by the particular solution xp .
For any A ∈ R3×n , here are the so-called elementary matrices (or
operators) that would implement the examples listed above:
1 0 0 1 0 0 1 0 0
E1 = 0 0 1 E2 = 0 3 0 E3 = 0 1 0
0 1 0 0 0 1 −3 0 1
4.3.4 LU Decomposition
1 1 ) 5
$% = ' ó * = 1
1 21
Subtract Row 1 from Row 2:
1 1 5 1 1 5
1 21 1 0 22 24
Augmented Matrix: [1 | 3] Pivots ó First non-zero element
Fig. 4.3 Gaussian elimination on a simple augmented matrix, showing the elementary
operator that implements the row reduction.
operations listed earlier (in §4.2.2, page 82). Table 4.4 shows an
example of this process of getting a U out of a matrix by finding its
REF.
We have more examples coming up (in Tables 4.4 to 4.10), where
we will write down the elementary matrix that implemented each row
operation. We will call these elementary matrices Ei . Referring to
Table 4.4, we can therefore write the REF form as:
U = E2 E1 A = EA
Let’s focus on one elementary row operation, E2 (second row, under the column Ei ). This operation replaces
row 3 with the sum of twice row 2 and row 3. The inverse of this operation would be to replace the row 3 with
row 3 − twice row 2, as shown in the column Inv. Op. The elementary operation for this inverse operation is,
−1
in fact, the inverse of E2 , which we can find in the last column, under Ei .
94 Gaussian Elimination
Note the order in which the elementary matrices appear in the product.
We perform the first row operation E1 on A, get the product E1 A
and apply the second operation E2 on this product, and so on.
We have not yet fully discussed the inverse of matrices, but the
idea of the inverse undoing what the original matrix does is probably
clear enough at this point. Now, starting from the upper triangular
matrix U that is the REF of A, we can write the following:
U = EA = E2 E1 A =⇒ A = E1−1 E2−1 U = LU
where we have called the matrix E1−1 E2−1 , which is a lower triangular
matrix (because each of the Ei−1 is a lower triangular matrix) L.
Again notice the order in which the inverses Ei appear in the equation:
We undo the last operation first before moving on to the previous one.
This statement is the famous LU decomposition, which states
that any matrix can be written as the product of a lower triangular
matrix L and an upper triangular matrix U . The algorithm we use
to perform the decomposition is Gaussian elimination. Notice that
all the elementary matrices and their inverses have unit determinants,
and the row operations we carry out do not change the determinant
of A, such that |A| = |U |.
5
In principle, we can have permutation matrices that perform multiple row exchanges, but
for the purposes of this book, we focus on one row exchange at a time.
Applications of Gaussian Elimination 95
r1 ´ r2
r1 ´ r2 r1 ´ r2 r2 ´ r3 r3 ´ r1 r
2 ´ r3
0 1 0 1 0 0 0 0 1 0 1 0
0 1 1 0 0 0 0 1 0 1 0 0 0 1
1 0
0 0 1 0 1 0 1 0 0 1 0 0
Note that the last one does multiple row exchanges and conse-
quently P 2 ̸= I: It is not involutory.
4.3.5 P LU Decomposition
Since the elementary operation of row exchanges breaks our decom-
position of A = LU , we keep the permutation part separate, and
come up with the general, unbreakable, universally applicable decom-
position A = P LU . We can see an example of this decomposition
in Table 4.5.
Back Substitution:
The last row of the REF(A) says −z = −3 =⇒ z = 3.
Substituting it in the row above, 2y − 3 = 1 =⇒ y = 2.
Substituting z and y in the first row, x + 2 + 3 = 6 =⇒ x = 1.
The complete and unique solution is: (x, y, z) = (1, 2, 3)
The third example in Table 4.8, we have added one more equation
to the system in Table 4.6. Thus, we have four equations for three
unknowns. But the augmented matrix reduces with the last row
reading 0 = 0, which means the fourth equation is consistent with
the rest. And rank( A | b ) = rank(A) = 3, same as the number of
unknowns. Hence unique solution.
The fourth example is similar to the third one; we have added a
fourth equation in Table 4.9 as well. However, the new equation is
not consistent with the rest.
In the last example in Table 4.10 (which we used to illustrate the
complete solution with free variables), we have as many equations as
unknowns.
After
Gaussian elimination, we get the last row reading
0 0 0 0 ; no zero=non-zero row, indicating that the equations
are consistent. However rank( A | b ) = rank(A) = 2 with three
unknowns, which means we have infinitely many combinations of
(x, y, z) that can satisfy these equations.
What We Learned
1. All rows with zero elements are at the bottom of the matrix.
2. The first non-zero element (pivot) in any row is strictly to the right of the
first non-zero element in the row above it.
A = P LU =⇒ |A| = |P | |L| |U |
Beyond Gaussian Elimination 101
Exercises
A. Review Questions
If unsure of these review questions, go through the chapter again.
10. The determinant of a triangular matrix is the product of the diagonal elements.
True or False?
15. Starting from a matrix A (in Ax = b), we can always get an upper triangular
matrix. This process is called:
(a) Upper Triangulation (b) Gaussian Elimination
(c) Back Substitution (d) Linear Algebra
19. True or False: If we have two equations in R2 , we have either a unique solution,
or an infinite number of solutions.
104 Gaussian Elimination
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
• Lecture Videos by the Author: LA4CS, Chapter 4: Gaussian Elimination
• Imperial College: “Mathematics for Machine Learning - Linear Algebra.”
– Video 18: M4ML - Linear Algebra - 3.3 Part 1: Solving the apples and
bananas problem: Gaussian elimination
• MIT: Linear Algebra
– Video 7: 2. Elimination with Matrices
– Video 8: Elimination with Matrices
– Video 11: 4. Factorization into A = LU
– Video 12: LU Decomposition
This edition of LA4CS is meant for my
students at SMU. If you are not one,
consider getting the Full Edition with
Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
5
Ranks and Inverses of
Matrices
Earlier, we stated that the rank of a matrix is the number of pivots in its
row echelon form (REF), which we get through Gaussian elimination.
This algorithm is merely a series of elementary row operations, each
of which is about taking a linear combination of the rows of the
matrix. And, right from the opening chapters where we introduced
vectors and matrices (in §2.2.3, page 29), we talked about linear
combinations.
If we can combine a bunch of rows in a matrix so as to get a different
row, then the rows are not linearly independent. To state it without
ambiguity using mathematical lingo, for a matrix A ∈ Rm×n , we say
that a non-zero ith row (riT ) is a linear combination of the other rows
if we can write:
m
X
riT = sj rjT , r j ∈ Rn , sj ∈ R
j=1,j̸=i
When we can write one row as the linear combination of the rest,
we say that that row is not linearly independent. Note that any row
that is not linearly independent can be reduced to zero using the
elementary row operations. Once a row is reduced to zero, it does
not have a pivot. Conversely, all non-zero rows do have pivots, and
they are linearly independent. Therefore, we can state that the rank
of a matrix is the number of linearly independent rows.
A ∈ Rn×n =⇒ rank(A) f n
1 1 5 Elementary Operators
1 21 1 1 0
( 1 5 21 1
Subtract Row 1 0 26 24
1 0 1 0
from Row 2: Kill leading non-zeros !
( 1 5 0 2 21 1
"
0 ( 2
1 21 1 0 1 0
Divide Row 2 by 22: Make Pivots 1 ( 0 3 !
0 1 0 2 21 1
"
0 ( 2
Subtract Row 2 from Row 1: Make non-pivots 0
))*+
! " # ", ó Find the solutions in the augmented column (", )
) = 3, *=2
The Upper Triangular Matrix in the row-echelon form (REF) in Gaussian Elimination
becomes the Identity matrix in the Reduced REF (RREF) in Gauss-Jordan Elimination
Fig. 5.1 Example of Gauss-Jordan elimination to solve a simple system of linear equations
In this case, the REF of the augmented matrix, A | b , becomes
the identity matrix in the coefficient (A) side, at which point the
constants column contains the solution. Remember, each row of
110 Ranks and Inverses of Matrices
2. Every non-zero row is scaled such that the pivot value is one.
As we can see, the Reduced Row Echelon Form (RREF) is the Row
Echelon Form (REF) with some special properties, as its name indi-
cates. The actual definition of REF and RREF are a bit fluid: Different
textbooks may define them differently. For instance, some may sug-
gest that REF should have all pivots equalling unity, through scaling.
Such differences, however, do not affect the conceptual significance
of the forms and perhaps not even their applications.
Note that while the REF (which is the result of Gaussian elimina-
tion) can have different shapes depending on the order in which the
row operations are performed, RREF (the result of Gauss-Jordan) is
immutable: A matrix has a unique RREF. In this sense RREF is more
fundamental than REF, and points to the underlying characteristics
of the matrix.
Gauss-Jordan Elimination 111
With the insight from the example in Figure 5.1, we canstate the steps
in the Gauss-Jordan algorithm (for a matrix A = aij ∈ Rm×n ) as
follows:
1. Run the Gaussian elimination algorithm (§4.2.4, page 83) to
get REF(A).
2. Loop (with index i) over the rows of REF(A) and scale the ith
row by a1ik (where k is the column where Pivoti appears) so that
Pivoti = 1.
3. Loop (with index j) over all the elements of the pivot column
from 1 to i − 1, multiply the ith row with −ajk and add it to the
j th row to get zeros in the k th for all rows above the ith row.
4. Loop back to step 2 (with i ← i + 1) and iterate until all rows
or columns are exhausted.
112 Ranks and Inverses of Matrices
Table 5.1 Gauss-Jordan on a full-rank, square coefficient matrix, giving us the unique
solution
Equations A|b REF RREF
x+y+z =6
1 1 1
6
1 1 1 6
1 0 0 1
2x + 2y + z = 9 2 2 1 9 0 2 −1 1 0 1 0 2
x + 3y = 7 1 3 0 7 0 0 −1 −3 0 0 1 3
Here, we start from the REF form of A | b (as in Table 4.7), and do further row reduction to get
I in the coefficient part. In other words, in total, we apply the Gauss-Jordan algorithm. We can then
read out the solution (x, y, z) = (1, 2, 3) from the augmented part of the RREF.
If we started from the REF of A | b in Table 4.9, where we have
inconsistent equations, the RREF would be an uninteresting 4 × 4
identity matrix. What it says is that we have a rank of four in A | b ,
with only three unknowns, indicating that we have no solutions.
Notice that we did not write down the elementary matrices for the
Gauss-Jordan algorithm, as we diligently did for Gaussian elimina-
tion? The reason for doing that in Gaussian elimination was to arrive
at the A = P LU decomposition so as to compute its determinant.
We do not use Gauss-Jordan to compute determinants, although we
could, if we wanted to.
Tall Matrices For a “tall” matrix of full (column) rank, we will get
pivots for the first n rows, and the rest m − n rows will be zeros. The
n pivots can be normalized so that the top part of the canonical form
becomes In , and we will have:
114 Ranks and Inverses of Matrices
This third example is similar to the one in Table 5.2, but we removed the third equation. The
Gauss-Jordan
algorithm gets us as close to the identity matrix as possible in the coefficient part of
A | b . Notice how the columns of the identity matrix and the pivot-less columns are interspersed.
" #
m×n RREF In
A∈R , rank(A) = n < m =⇒ A −−−→ ∈ Rm×n
0(m−n)×n
RREF
A ∈ Rm×n , rank(A) = m < n =⇒ A −−−→ Im · Fm×(n−m) ∈ Rm×n
variable columns are interspersed among it. The pivot columns are 1
and 3, and the free variable is in column 2. The rest m − r rows are
zeros. We represent this as:
" #
m×n RREF Ir · Fr×(n−r)
A∈R , rank(A) = r < m < n =⇒ A −−−→ ∈ Rm×n
0(m−r)×n
5.3.1 Invertibility
b′1 is the solution to the first system, and b′2 that of the second one.
In other words:
RREF
A | [b1 b2 ] −−−→ I | [b′1 b′2 ] =⇒ Ab′1 = b1 and Ab′2 = b2
More generally, we could write this complicated super system as
(again, limiting ourselves to A, B, B ′ ∈ Rn×n ):
RREF
AX = B =⇒ A | B −−−→ I | B ′
Noting that X is a symbolic matrix (xij are symbols, not numbers)
and B ′ is the solution that satisfies the equations, we can confidently
write:
AB ′ = B
We know that there is nothing special about this B. In other words,
if we have a A ∈ Rn×n , we can build any number of systems of
linear equations by an arbitrary choice of B. Let’s therefore choose
B = I, and investigate what it says. With this choice, we get:
AB ′ = I
What does this equation mean? It says that B ′ is the matrix that,
when multiplied with A results in the identity matrix I, which is the
definition of the inverse of A. In other words, B ′ = A−1
′
Remembering again that B is merely the part of the result of
Gauss-Jordan running on A | B , and that B now is I, we get the
same cryptic recipé for A−1 :
RREF
A | I −−−→ I | A−1
Inverse of a Matrix 119
1 T
A−1 = C (5.1)
|A|
Related to the matrix of cofactors and the analytic formulas for de-
terminant and inverse of matrices (Equations (3.5) and (5.1)) is a
beautifully compact and elegant (albeit computationally useless) for-
mula for solving a system of linear equations. This formula is known
as the Cramer’s Rule. It states that for a system of linear equations
Ax = b with a square, invertible A and x = [xj ], the solution is
given by:
|Aj |
xj = (5.3)
|A|
2. Perform
Gauss-Jordan elimination on the augmented matrix
A | b (which is not really different from the previous method).
What We Learned
• Gauss-Jordan Elimination:
– Adjoint matrix (the transpose of the cofactor matrix) is another way to get
the inverse, albeit it is computationally inefficient:
1 T
A−1 = C
|A|
• We saw the various possible shapes of a matrix: Tall, Wide and Square and
their RREF. Here is the most general one:
" #
m×n RREF Ir · Fr×(n−r)
A∈R , rank(A) = r f min(m, n) =⇒ A −−−→
0(m−r)×n
124 Ranks and Inverses of Matrices
Exercises
A. Review Questions
If unsure of these review questions, go through the chapter again.
5. We have a full-row-rank matrix. Its rank is two. How many rows does it have?
(a) Two or more (b) Two
(c) More rows than columns (d) Depends on the columns
6. We have a full-column-rank matrix. Its rank is two. How many rows does it
have?
(a) Two or more (b) Two
(c) More columns than rows (d) Depends on the columns
= A−1 I = A−1
Under what conditions will this expansion and simplification work?
Explore the web, other books or resources and attempt these questions.
13. Restate the Gauss-Jordan algorithm to have only a single pass over the matrix.
i.e., not Gaussian elimination followed by another loop for scaling and making
the pivot column an identity column. Implement both in SageMath or Python
and compare the performance.
14. Use Gauss-Jordan elimination to compute the inverse of the general matrix
A ∈ R2×2 given below:
a c
A=
b d
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
126 Ranks and Inverses of Matrices
Geometric View
6
Vector Spaces, Basis and
Dimensions
23 Any $- * =. can be
#%! '$ written as a unique
24 linear combination of
(& = #%! '$ + #%" '#
25 7/ and 70
26
Fig. 6.1 Linear combinations of two vectors in R2 . We can create all possible vectors in
R2 starting from our x1 and x2 .
Fig. 6.2 Linear combinations of two vectors in R2 . Starting from our x1 and x2 , we
cannot get out of the line defined by the vectors, no matter what scaling factors we try. The
vectors are not linearly independent.
"
LINEAR COMBINATIONS OF TWO
6
VECTORS ON CD PLANE IN =& #"
'# !' '#
1 3 # "" 5 $
# !"
+ +
'$ = 2 '# = 1 !
'$ !
'$
#" 4 #!
0 0 = =
4 (# ($
3
($ = #!! '$ + #!" '# = 3
'$
0 2
,!! = 1, ,!" = 1
(# = #"! '$ + #"" '# 1 '#
(& = #%! '$ + #%" '# !
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
21
# %"' #
#""' # 22
Fig. 6.3 Linear combinations of two vectors in R3 . All the vectors we create starting
from our x1 and x2 live on the xy plane. We do not have enough vectors to span the whole
of R3 .
Fig. 6.4 Linear combinations of two vectors in R3 . Here again, we do not have enough
vectors to span R3 . The linear combinations of x1 and x2 live in a plane defined by the
two vectors. (Not drawn to scale.)
In Figure 6.1, the span of the red and blue vectors is all possible
vectors in R2 . In Figure 6.2, the span of the two vectors is a much
smaller subset: Only those vectors that are in the same direction (the
dotted line in the figure) as the original two, collinear vectors.
134 Vector Spaces, Basis and Dimensions
6.1.2 Cardinality
Clearly, with all si = 0, the sum is always the zero vector. What we
are looking for is a set of scaling factors with at least some non-zero
numbers such that the sum is 0. If we can find such a set, the vectors
are not linearly independent. They are linearly dependent.
Looking at the example in Figure 6.2, we have x2 = 2x1 =⇒
x2 − 2x1 = 0, the zero vector as a linear combination with non-
zero scalars, which means x1 and x2 are not linearly independent.
Similarly, for the sixth example in the list of examples above, x3 =
3x1 + 2x2 or 3x1 + 2x2 − x3 = 0, again the zero vector as a a linear
combination with non-zero scalars, implying linear dependence.
For our purpose in this book, we will define a vector space as a set of
all possible vectors. In fact, while defining vectors, their operations
and properties, we were indirectly defining vector spaces as well. For
instance, we said when we scale a vector or add two vectors, we get
another vector, which is the same as saying that the set of vectors
is closed under scalar multiplication and vector addition. We also
listed the associativity, commutativity and distributivity properties
Vector Spaces and Subspaces 135
6.2.2 Subspace
2. The dot product between any vector on the xy plane and any
vector along the z axis is therefore:
x · z = xT z = 0 =⇒ x § z
6.3.1 Basis
Notational Abuse
For our purposes in Linear Algebra, R2 , R3 , Rn etc. are vector spaces, which means they
are collections or sets of all possible vectors of the right dimensions. R2 , for instance, is
a collection of all vectors of the kind
x
x = 1 with x1 , x2 ∈ R
x2
and nothing else.
However, as we saw, R2 also is a coordinate space—the normal and familiar two-
dimensional plane where we have points with (x, y) coordinates. Because of such
familiarity, we may switch lightly between the vector space that is R2 and the coordinate
space that is R2 . We did it, for instance, when speaking of the span of a single vector
which is a line. In a vector space, there is no line, there are no points, only vectors.
Similarly when we spoke of vector subspaces, we spoke of the xy plane in R3 without
really distinguishing it from the vector space R3 .
Since the vectors in R2 or R3 , as we described them so far, all have components (or
elements) that are identical to the coordinates of the points in the coordinate space, this
abuse of notation goes unnoticed. Soon, we will see that the coordinates are artifacts of
the basis that we choose. It just so happens that the most natural and convenient basis
vectors are indeed the ones that will give coordinates as components.
Ultimately, however, it may be a difference without a distinction, but it is still good to
know when we are guilty of notational abuse so that we may be careful to avoid mistakes
arising from it.
• The reason why the two vectors did not form a basis for the
subspace in the previous example was that we had too many
vectors: for a subspace formed as a span of one vector (which
is a line of vectors), we need only one vector in the basis. Simi-
larly, for a subspace that looks like a plane (span of two vectors)
as in Figure 6.4, we need exactly two vectors in the basis. If
we had a third vector, it is necessarily a linear combination of
the other two that are already in the basis, and is therefore not
linearly independent.
6.3.2 Dimension
to each other if all the basis vectors of one are orthogonal to all the
basis vectors of the other.
In R3 , we spoke of the xy plane as a subspace being orthogonal
to the z axis (taking the liberty to mix vector and coordinate spaces).
The dimension of the first space is 2 and the second one is 1, adding
up to the dimension of the containing space R3 . We saw that the xy
plane and the yz plane do not form orthogonal subspaces, and one
reason is that their dimensions add up to 4, which is more than the
dimension of R3 .
In R4 , however, we will be able to find two-dimensional subspaces
that are orthogonal to each other. Remember that subspaces cannot
have non-trivial (non-zero) intersections. Therefore, we can have two
planes intersecting at a point in four dimensions!
S § = x | xT y = 0 ∀ y ∈ S
" Direction 2
6 ·
!+" =5
!2" =1 5
8' = 9 ó ;$ ;# ' = 9
1 1 ! 5 4
= !"! = 3
1
1 21 " 1 1
Column Picture of Equations: 3
1 1 5
! +" =
1 21 1 2 !! =
1
!;$ + ";# = 9 1 5
(=
1 1
Direction 1
·!
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
21
1
!# =
21
22
Draw the blue dotted ± Length of ;'$ 1
!= &$" = 2
line parallel to ;# , Length of ;$ 21
23
through the tip of 9, ± Length of ;'#
intersecting the line of "=
Length of ;# 24
;$ and constructing ;'$ .
Construct ;#' similarly. 9 = !;$ + ";#
25
26
Fig. 6.5 Two linear equations on two unknowns with a unique solution. The scalars for
the column vectors of A to produce b as linear combinations are the solution to the system
of equations. Note that the length of a′i is signed (as indicated by ±): It is positive if a′i is
in the same direction ai , negative otherwise.
these directions, once again highlighting the need for care described
in the box above on notational abuse.
To convince ourselves that this system does have a solution in this
case, let’s outline, as a series of steps, or an algorithm of sorts, how
we can get to the right values for the scalars1 referring to Figure 6.5:
1. Draw a line parallel to the second blue vector (the second
column of A, a2 ), going through the tip of the green b vector.
It is shown in blue as a thin dashed line.
2. Scale the first red vector (the first column of A, a1 ) to reach this
line. The scaling required tells us what the scalar should be.
For our equations, the scalar for a1 is x = 3 in b = xa1 + ya2 .
1
In listing these steps, we break our own rule about notational abuse: We talk about drawing
lines in the vector space, which we cannot do. A vector space contains only vectors; it does
not contain lines. It is coordinate spaces that contain lines. This predicament of ours shows
the difficulty in staying absolutely rigorous about concepts. Perhaps pure mathematical rigor
for its own sake is not essential, especially for an applied field like computer science.
Geometry of Linear Equations 143
26
Fig. 6.6 Two inconsistent linear equations on two unknowns with no solution. All linear
combination of the column vectors of A fall on the purple line, and the right hand side
does not. Therefore, b is not in the span of a1 and a2 . Hence no solution. Note that we
have drawn the red and blue vectors slightly offset from each other for visibility; they are
supposed to be on top of each other.
Let’s now move on to the case where we have two linear equations
that are not consistent with each other: x + y = 5 and x + y = 1. In
Figure 4.1, we saw that they were parallel lines, which would never
meet. What do they look like in our advanced geometric view in the
vector space? The column vectors of the coefficient matrix are now
identical. As we now know, all possible linear combinations of the
two vectors a1 and a2 fall in a subspace that is a line defined by the
direction of either of them. Their span, to use the technical term, is
only a subspace. The green vector b that we would like to create is
144 Vector Spaces, Basis and Dimensions
not along this line, which means no matter what scalars we try, we
will never be able to get b out of a1 and a2 . The system has no
solutions.
If we try the steps of our little construction algorithm above, we
see that while we can draw a line parallel to a2 going through the tip
of b, there is scaling factor (other than ∞, to be absolutely rigorous)
that will take a1 to this line.
" Direction 2
6 ·
!+" =5
!2" =1 5
!+" =6
0+0=1 4
!"! = ?
1 1 ! 5
8' = 9 ó 1 21 " = 1 3
0 0 1 5
Column Picture: 1
2 1
!! = 1
1 1 5 0 5
0
! 1 + " 21 = 1 1 (= 1
1
Direction 1
0 0 1 ·!
28 27 26!;$ +
25";# =
249 23 22 21 0 1 2 3 4 5 6 7 8
21
1
! = 21
22 #
The blue dotted line ó Cannot construct ;'$ . 0
&$" = ?
parallel to ;# , through Cannot construct ;#' .
23
the tip of 9 does not Cannot find ! and "
intersect the line of ;$ such that
24
because they are on !;$ + ";# = 9
different planes ó No solution
25
26
Fig. 6.7 Three linear equations on two unknowns with no solution. Here, a1 and a2 are
not collinear, but span the plane of the page. The right hand side, b, has a component in the
third direction, perpendicular to the page coming toward us, indicated only by the shadow
that b casts.
In Figure 6.7, we have three equations, with the third one incon-
sistent with the first two. The geometric view is in three dimensions,
as opposed to the algebraic visualization of the equations, which still
stays in two dimensions because of the number of variables. In other
words, the geometric view is based on the column vectors of A,
which have as many elements as equations, or number of rows of A.
The algebraic view is based on the number of unknowns, which is
the same as the number of columns of A.
The fact that the vector space now has three directions makes it
harder for us to visualize it. We have simplified it: First, we reduced
the third equation to a simpler form. Secondly, we indicate the third
direction (assumed to be roughly perpendicular to the page, coming
Geometry of Linear Equations 145
toward us) only by the shadows that the vector and the dashed lines
would cast if we were to shine light on them from above the page.
Here are the equations, the columns of A and the RHS:
x+y =5
1 1 5
x−y =1 a1 1 , a2 −1
b = 1
x + y = 6 (reduced to 0 = 1) 0 0 1
Running our construction algorithm on this system:
1. Drawing a line parallel to a2 (blue vector), going through the
tip of the green b, we get the thin dashed line in blue. This
line is one unit above the plane of the page (because the third
component of b is one).
2. Trying to scale the first red vector (a1 ) to reach this blue dashed
line, we fail because the scaled versions go under the blue line.
3. Similarly, we fail trying to scale the blue a2 as well, for the
same reason. We cannot find x and y such that b = xa1 + ya2 ,
because of the pesky, non-zero third component in b.
4. However, we can see that the shadows of these red and blue
dashed lines (shown in grey) on the plane of the page do meet
at the tip of the shadow of b. Think of these shadows as
projections and we have a teaser for a future topic.
In Figure 6.7, we took a coefficient matrix such that the third
components of its column vectors were zero so that we could visualize
the system relatively easily: Most of the action was taking place on the
xy-plane. In a general case of three equations on two unknowns, even
when we have non-zero third components, the two column vectors
still make a plane as their span. If the RHS vector is in the span of
the two vectors, we get a unique solution. If not, the equations are
inconsistent and we get no solution.
The geometric view of Linear Algebra, much like the algebraic
view, concerns itself with the solvability conditions, the characteris-
tics of the solutions etc., but from the geometry of the vector spaces
associated with the coefficient matrix (as opposed to row-reduction
type of operations in the algebraic view). As we saw in this chapter,
and as we will appreciate even more in later chapters, the backdrop
of the geometric view is the column picture of matrix multiplication.
146 Vector Spaces, Basis and Dimensions
What We Learned
• Review: Linear Combinations: Doing both basic operations at the same time.
For x1 , x2 ∈ Rn and s1 , s2 ∈ R, z = s1 x1 +s2 x2 ∈ Rn is a linear combination
of x1 and x2 , which is easily extended to any number of vectors. The concept of
linear combinations is at the core of most definitions in this chapter, and indeed
most of Linear Algebra.
• Span: Given a set of k vectors {xi }, their span is the set of all possible linear
combinations of them, defined as the set:
( k )
X
si xi for xi ∈ Rn and si ∈ R
i=1
• Vector Space: Set of all possible vectors of given number of elements, such that
it is closed under the defined basic operations.
• Vector Subspace: Any subset of a vector space that is closed under the defined
basic operations.
• The span of a given set of vectors belonging to a vector space is a vector subspace.
• Basis: The minimal set of vectors that span a space or a subspace is its basis.
• Dimension: The dimension of space or subspace is the number of vectors in any
of its basis (AKA the cardinality of the basis set).
• System of Linear Equations Ax = b:
Exercises
A. Review Questions
If unsure of these review questions, go through the chapter again.
Geometry of Linear Equations 147
(a) True
(b) False
(c) Trick question; it says b is already in a linear combination of the columns
of A
(d) Trick question; b can never be in the space spanned by the columns of A
because of dimensions
(a) No, they always intersect in a line (b) No, need three planes for a point
(c) Never (d) Yes
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
• Lecture Videos by the Author: LA4CS, Chapter 6: Vector Spaces, Basis and
Dimensions
• Imperial College: “Mathematics for Machine Learning - Linear Algebra.”
– Video 11: M4ML - Linear Algebra - 2.4 Part 1: Basis, vector space, and
linear independence
Geometry of Linear Equations 149
Right from the start of this book, we wrote vectors as column ma-
trices, with numbers arranged in a column. These numbers are the
components of the vector. Where do the components come from?
How do we get them? They are, in fact, the byproduct of the under-
lying basis that we did not hitherto talk much about. In this chapter,
we will expand on our understanding of bases, learn how the compo-
nents change when we change bases and explore some of the desirable
properties of basis vectors.
Basis and Components 151
n
X
x= xi|A ai (7.1)
i=1
1
From this chapter onward, when we say space or subspace, we mean a vector space or a
vector subspace.
152 Change of Basis, Orthogonality and Gram-Schmidt
x ∈ R2 .
2 1 0 2 1 0
x= = Ix = =2 +3 = 2q1 + 3q2
3 0 1 3 0 1
Trace
Definition: For any A = aij ∈ Rn×n , its trace is defined as
n
X
trace(A) = aii
i=1
When we use the identity matrix as the basis (as we almost always
do), what we get as the components of vectors are, in fact, the coor-
dinates of the points where the tips of the vectors lie. For this reason,
the identity basis may also be referred to as the coordinate basis. The
components of a vector may be called coordinates. And a vector (as
we define and use it in Linear Algebra, as starting from the origin)
may be called a position vector to distinguish it from other vectors
(such as the electric or magnetic field strength, which may have a
specific value and direction at any point in the coordinate space).
Basis and Components 153
∥Qx∥2 = (Qx)T Qx = xT QT Qx
= xT Q−1 Qx = xT Ix
= xT x = ∥x∥2
=⇒ ∥Qx∥ = ∥x∥
6 "
CHANGE OF BASIS 5
#
7
!= 4
5
1 0
2! = 2 = 3
0 # 1 5!%
3.5 & 0
3&! = 3" = 2
0 2.5 5%%
!
2 1
3&&
! = 3&& = P# 1
5%%
1 " 1 "
!
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
P" 5%"
21
22 7
! = 72! + 52# ó ! ' =
5
23 2
!= 23&! + 23&" ó ! ($ =
2
24 2
!= 23&&
! + 33&&
" ó ! ($$ =
3
25
26
Fig. 7.1 Visualizing the change-of-basis examples listed in Table 7.1. The green vector
is represented in three different bases, giving vastly different values for its components.
only two in each direction to get to the tip of our vector x. When
the basis vectors become bigger, the components get smaller, which
should indicate to us that the change of basis probably has the inverse
of the basis matrix involved in some fashion.
In the third row of Table 7.1, things get really complicated. Now
we have the basis vectors in some random direction (not orthogonal
to each other) with some random size (pretty far from unity). The
first component now is 2 and the second 3, as described in Table 7.1
and illustrated in Figure 7.1.
156 Change of Basis, Orthogonality and Gram-Schmidt
Let’s now verify Eqn (7.5) using the most complicated example we
did, namely the third row of Table 7.1. Remembering the prescription
for the inverse of a 2 × 2 matrix (swap the diagonal elements, switch
the sign of the off-diagonal elements and divide by the determinant),
we have:
7
The original vector in the coordinate basis: x =
5
2 1 −1 1 −1
The new basis: A = =⇒ |A| = 1 and A =
1 1 −1 2
1 −1 7 2
The vector in the new basis: [x]A = A−1 x = =
−1 2 5 3
Note that we have used the symbol A for the new basis rather than
A′′ as in Eqn (7.5). Comparing our [x]A with [x]A′′ in the Table 7.1,
we can satisfy ourselves that the matrix equation in Eqn (7.5) does
work.
we can write:
x · y = xT y = (A[x]A )T A[y]A = [x]TA AT A [y]A
Whatever we said about bases and components for spaces also applies
to subspaces, but with one important and interesting difference. As
we know, subspaces live inside a bigger space. For example, we
can have a subspace of dimension r inside Rn with r < n. For
this subspace, we will need r basis vectors, each of which is a n-
dimensional vector: ai ∈ Rn . If we were to place these r vectors in
a matrix, we would get A ∈ Rn×r , not a square matrix, but a “tall”
one.
Remember, this subspace of dimension r is not Rr . In particular, a
two-dimensional subspace (a plane going through the origin) inside
R3 is not the same as R2 . Let’s take an example, built on the third row
of Table 7.1 again, to illustrate it. Let’s take the two vectors in the
example, make them three-dimensional by adding a third component.
The subspace we are considering is the span of these two vectors,
which is a plane in the coordinate space R3 : All linear combinations
of the two vectors lie on this plane. We will use the same two vectors
as the basis A and write our vector x (old, coordinate basis) as [x]A
(new basis for the subspace). We have:
2 1 7
2
A = 1 1 x = 2a1 + 3a2 = 5 =⇒ [x]A =
3
1 0 2
Note that our [x]A has only two components because the subspace
has a dimension of two. Why is that? Although all the vectors in
the subspace are in R3 , they are all linear combinations of the two
column vectors of A. The two scaling factors required in taking the
158 Change of Basis, Orthogonality and Gram-Schmidt
linear combination of the basis vectors are the two components of the
vectors in this basis.
In the case of full spaces Rn , we had a formula in Eqn (7.5) to
compute [x]A , which had A−1 in it. For a subspace, however, what
we have is a “tall” matrix with more rows (m) than columns (r < m).
What is the equivalent of Eqn (7.5) in this case? Here’s where we
will use the left inverse as defined in Eqn (5.2). Note that A is a
tall matrix with full column rank because its r columns (being basis
vectors for a subspace) are linearly independent. AT A is a full-rank,
square matrix of size r × r, whose inverse figures in the left inverse.
As a reminder, here is how we defined it:
−1
AT A AT A = I =⇒ A−1 T −1 T
left = (A A) A
Once again, let’s verify the veracity of this prescription using our
example.
7 2 1
2 1 1 6 3
x = 5 A = 1 1 =⇒ AT = and AT A =
1 1 0 3 2
2 1 0
2
1 2 −3 −1
AT A = 3 and AT A
−1
= = 3
3 −3 6 −1 2
2 1
− 31 2
T −1 T −1 2 1 1
(A A) A = ALeft = −1 3 = 3 3
−1 2 1 1 0 0 1 −1
1 7
− 31 2
2
[x]A = A−1
Left x = 3 3 5 =
0 1 −1 3
2
7.4 Orthogonality
Orthogonal Vectors
Definition: Two vectors, x, y ∈ Rn are orthogonal to each other if
and only if
xT y = y T x = 0
Note that the vectors in the inner product can commute. Note also
that the less sophisticated definition of the inner product (namely
x · y = ∥x∥∥y∥ cos θ) shows that the inner (or dot) product is zero
when x and y are perpendicular to each other because the angle θ
between them is π2 and cos θ = 0.
We still stay away from the definition of the inner product using the
angle because, by now, we know that the machinery of Linear Allegra
may be applied to vector-like objects where we may not be able to
talk about directions and angles. For example, in Fourier transforms
or the wave functions in quantum mechanics, vectors are functions
with the inner product defined with no reference any kind of angles.
We can still have orthogonal “vectors” in such vector spaces when the
inner product is zero. Clearly, we cannot have perpendicular vectors
without abusing the notion and notation a bit too much for our (or at
least, the author’s) liking.
Earlier, in Eqn (2.5), we defined the Euclidean norm of a vector
∥a∥:
∥a∥2 = aT a
Using this definition, we can prove that the inner product of orthog-
onal vectors has to be zero, albeit not completely devoid of the lack
of sophistication associated with perpendicularity.
If we have a § b, then we know that ∥a∥and ∥b∥ make the sides
of a right-angled triangle (which is where the pesky perpendicularity
160 Change of Basis, Orthogonality and Gram-Schmidt
= aT a + bT b + bT a + aT b
=⇒ 0 = bT a + aT b
=⇒ bT a = aT b = 0 since bT a = aT b
7.4.2 Orthogonalization
Given that the orthonormal (we may as well call it orthogonal because
everybody does it) basis is the best possible basis we can ever hope
to have, we may want to have an algorithm to make any matrix
A ∈ Rn×n orthogonal. An orthogonal matrix is the one in which
the columns are normalized and orthogonal to one another. In other
words, it is matrix that could be a basis matrix as described in §7.1.4
with the associated properties. And orthogonalization is the process
or algorithm that can make a square matrix orthogonal.
The first question we may want to ask ourselves is why we would
want to do this; why orthogonalize? We know that a perfect basis
for Rn×n is In , the identity matrix. Why not just use it? We have
two reasons for doing it. The first one is pedagogical: We get
to see how projection works in a general way, which we will use
later. The second reason is a practical one from a computer science
perspective: Certain numerical algorithms use the decomposition that
results from the orthogonalization process. Another reason, again
from our neck of the woods, is that when we know that a matrix
Q is orthonormal, we know that the transformation it performs on
a vector is numerically stable: Qn x does not suffer from overflow
or underflow errors because the norm of x is not modified by the
multiplication with Q.
7.4.3 Projection
PROJECTION OF VECTORS
5" and 5# * =& with Q between them
5"
Dot Product Using angle:
5'" 5# = 5" 5# cos Q ÿ
5'" 5#
cos Q = 5#'
5" 5#
Projection Length (of 5# on to 5" ):
5'" 5#
!
! = 5# cos Q =
5"
Q
Projection Vector (of 5# on to 5" ):
5" 5" 5'" 5# 5" 5'" 5#
5#' = != = ' 5# 5"'
5" 5" 5" 5" 5"
Projection Vector (of 5" on to 5# ):
5# 5'#
5"' = 5"
5'# 5#
Fig. 7.2 Dot product between two vectors a1 , a2 ∈ Rn , shown on the plane defined by
the two vectors.
sa1 a2 = a1 sa2 = a1 a2 s
GRAM-SCHMIDT PROCESS
5!
5 = 3) 3" 3% ï 3*
1. Normalize 5) as P) : 5 !+
5)
P) =
5)
2. Project 5! on to P) : 5!' = (5*! P) )P)
" Get part perpendicular to P) :
5!+ = 5! 2 5!' = 5! 2 (5*! P) )P) P!
,'
5
" Normalize 5!+ as P!
5!+
P! = 5 ,' "
5!+ 5)
5,
3. Project 5, on to P) and P! : 5 !'
P)
5,'! = (5*, P) )P) and 5,'" = 5*, P! P! 5 ,' !
" Get part of 5, perpendicular to P) and P! :
5,+ = 5, 2 5,' = 5, 2 5*, P) P) + 5*, P! P! 5,+
" Normalize 5,+ as P, :
5,+ P,
P, =
5,+
Fig. 7.3 Illustration of the Gram-Schmidt process running on a matrix A. The first
column (red) is normalized to get the unit vector q1 , which is then used to create q2 from
the second (blue) column. Both q1 and q2 are used in projecting the third (green) column
and computing q3 .
7.5.3 QR Decomposition
Since the Gram-Schmidt algorithm is about taking linear combina-
tions of the columns of the matrix, it should be possible to write it as
166 Change of Basis, Orthogonality and Gram-Schmidt
1.25 "
0
P O# =
in 1
2s os P 1
sP
=
c co n P
% i
O# 0.75 % = s
O"
0.5
Y
0.25
Y
Z !
22 ROTATION
21.75 21.5 21.25MATRIX
21 IN =20.5
20.75 20.25 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
1
O" =
0
The unit vectors transform as: 20.25
1 cos ;
2! = § 2&! = 20.5
0 sin ;
0 2sin ;
2# = § 2&# = 20.75
1 cos ;
ó The Rotation Matrix
21
cos ; 2sin ;
?+ =
sin ; cos ;
21.25
Fig. 7.4 The rotation matrix in R2 can be written down by looking at where the unit
vectors go under a rotation through a specified angle.
What We Learned
– From the coordinate basis, which uses the identity matrix as the basis.
– The identity (AKA coordinate) basis is an ideal basis because the basis
vectors are orthonormal.
• Change of Basis:
AT A AT A = I =⇒ A−1 T −1 T
−1
left = (A A) A
Exercises
A. Review Questions
If unsure of these review questions, go through the chapter again.
Explore the web, other books or resources and attempt these questions.
13. Prove that the inverse of an upper triangular matrix is an upper triangular
matrix.
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
• Lecture Videos by the Author: LA4CS, Chapter 7: Change of Basis, Orthogo-
nality and Gram-Schmidt
• 3Blue1Brown: “Essence of Linear Algebra.”
– Video 13: Change of basis | Chapter 13, Essence of linear algebra
• Imperial College: “Mathematics for Machine Learning - Linear Algebra.”
– Video 10: M4ML - Linear Algebra - 2.3 Changing basis
– Video 12: M4ML - Linear Algebra - 2.4 Part 2: Applications of changing
basis
– Video 13: M4ML - Linear Algebra - 2.5 Summary
– Video 25: M4ML - Linear Algebra - 4.3 Orthogonal Matrices
– Video 26: M4ML - Linear Algebra - 4.4 The Gram-Schmidt process
• MIT: Linear Algebra
– Video 21: 9. Independence, Basis, and Dimension
– Video 22: Basis and Dimension
– Video 37: 17. Orthogonal Matrices and Gram-Schmidt
– Video 38: Gram-Schmidt Orthogonalization
This edition of LA4CS is meant for my
students at SMU. If you are not one,
consider getting the Full Edition with
Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
8
Review and Recap
8.1 A Generalization
Vector Space
Definition: A vector space over a field of K is a set of elements
that have two operations defined on them. We will call the elements
“vectors” and use the symbol S for the set.
1. Addition (denoted by +): For any two vectors, x, y ∈ S,
addition assigns a third (not necessarily distinct) vector (called
the sum) in z ∈ S. We will write z = x + y.
2. Scalar Multiplication: For an element s ∈ K (which we will
call a scalar) and a vector x ∈ S, scalar multiplication assigns
a new (not necessarily distinct) vector z ∈ S such that z = sx.
These two operations have to satisfy the properties listed below:
Commutativity: Addition should respect associativity, which means
the order in which the vectors appear in the operation does not
matter. For any two vectors x1 , x2 ∈ S,
x1 + x2 = x2 + x1
We can make scalar multiplication also commutative by defin-
ing sx = xs.
Associativity: Both operations should respect associativity, which
means we can group and perform the operations in any order
we want. For any two scalars s1 , s2 ∈ K and a vector x ∈ S,
s1 s2 x = s1 (s2 x) = (s1 s2 )x
and for any three vectors x1 , x2 , x3 ∈ S,
x1 + x2 + x3 = (x1 + x2 ) + x3 = x1 + (x2 + x3 )
Distributivity: Scalar multiplication distributes over vector addition.
For any scalar s ∈ K and any two vectors x1 , x2 ∈ S,
s(x1 + x2 ) = sx1 + sx2
Product Rules: Transposes and Inverses 173
(s1 + s2 )x = s1 x + s2 x
T T
(AT A) = AT (AT )T = AT A and (AAT ) = (AT )T AT = AAT
Earlier, we defined the span of vectors in Eqn (6.2). For our own
nefarious ulterior motives, let’s recast the definition using other no-
tations and symbols as in the following equation:
( n
)
X
m
Given n vectors ai ∈ R , span({ai }) = b|b= xi a i
i=1
(8.1)
for any n scalars xi ∈ R
becomes: n
X
xi ai = 0 =⇒ Ax = 0 (8.2)
i=1
A = QR =⇒ R = Q−1 A = QT A
1
To be very pedantic about it, it is not a permutation but an element of the Cartesian product
of the sets of matrix shapes and rank statuses.
180 Review and Recap
Square, Full-rank:
1
> * =&×&
2 3O Unique Solution
Wide, Full-rank:
5
> * =(×& , M < O
9 3S · 7S× ORS Infinity of Solutions
Fig. 8.1 Properties and solvability of the linear equations Ax = b based on the RREF
of the coefficient matrix A,
What We Learned
• This chapter is already a recap of the first seven chapters. A further summary,
therefore, is probably an overkill. Nevertheless. . .
• We took another look at the definition of vector spaces, in a more formal and
general fashion.
• We reiterated that the column picture of matrix multiplication is key to under-
standing most of the advanced topics of Linear Algebra.
• We restated span and linear independence of vectors, linking them to Ax = b
and Ax = 0.
• We also summarized the algorithms we learned, the shapes of matrices, properties
and interconnections among pivots, ranks and determinants etc.
Exercises
A. Review Questions
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
• 3Blue1Brown: “Essence of Linear Algebra.”
– Video 16: Abstract vector spaces | Chapter 16, Essence of linear algebra
• Imperial College: “Mathematics for Machine Learning - Linear Algebra.”
– Video 14: M4ML - Linear Algebra - 3.1 Matrices, vectors, and solving
simultaneous equation problems
– Video 15: M4ML - Linear Algebra - 3.2 Part 1: How matrices transform
space
– Video 17: M4ML - Linear Algebra - 3.3 Part 1: Solving the apples and
bananas problem: Gaussian elimination
This edition of LA4CS is meant for my
students at SMU. If you are not one,
consider getting the Full Edition with
Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
9
The Four Fundamental
Spaces
Our geometric view started with the notion that we can think of
the solution to the system of linear equations Ax = b as a quest for
that special x whose components become the coefficients in taking
the linear combination of the columns of A to give us b. Since this
opening statement was a bit too long and tortured, let’s break it down.
Here is what it means:
Solving this system means finding the right xi so that the linear
combination satisfies the condition.
We also learned that the collection of all possible linear combina-
tions of a set of vectors is called the span of those vectors, and it is
a vector subspace, which is a subset of a vector space. For example,
Rm is a space1 and n vectors ai ∈ Rm span a subspace contained
within Rm . Let’s call this subspace C ¦ Rm . If, among the n vectors
that span C, only r f n of them are linearly independent, then those
r vectors form a basis for C and the dimension of this subspace C
is indeed r (which is the cardinality of the basis). In fact, even if
n > m, the n vectors span only a subspace C ¢ Rm if the number of
independent vectors r < m. Note that r can never be greater than m,
the number of components of each of our vectors ai ∈ Rm .
Column Space
Definition: The column space C of a matrix is the span of its columns.
( n
)
def
X
For a matrix A = ai ∈ Rm×n , C(A) = z | z =
xi a i
i=1
1
Once again, we will be dropping the ubiquitous “vector” prefix from spaces and subspaces.
186 The Four Fundamental Spaces
Null Space
Definition: The null space N of a matrix A is the complete set of all
vectors that form the solutions to the homogeneous system of linear
equations Ax = 0.
def
N (A) = {x | Ax = 0}
Note that if we have two vectors in the null space, then all their
linear combinations are also in the null space. x1 , x2 ∈ N (A) =⇒
x = s1 x1 + s2 x2 ∈ N (A). If Ax1 = 0 and Ax2 = 0, then
A(s1 x1 + s2 x2 ) = 0 because of the basic linearity condition we
learned way back in the first chapter. What all this means is that
the null space is indeed a vector subspace. Furthermore, N (A) is
complete, by definition. In other words, we will not find a vector x
that is not in N (A) such that Ax = 0. In the cold hard language of
mathematics, x ∈ N (A) ⇐⇒ Ax = 0.
For any and all vector x ∈ N (A), in the null space of A, its dot
product with the rows of A are zero. If riT is a row of A, we have:
.
..
Ax = 0 =⇒ ri x = 0 =⇒ riT x = 0 ∀ i
T
(9.1)
..
.
The dot product being zero means that the vectors in N (A) and the
rows of A are orthogonal. It then follows that any linear combinations
of the rows of A are also orthogonal to the vectors in N (A). And,
the collection of the linear combinations of the rows of A is indeed
their span, which is a subspace we will soon call the row space of A.
are all column vectors, and the rows of a matrix are definitely not
columns, we think of the transpose of the rows as the vectors whose
linear combinations make the row space. Equivalently, we may
transpose the matrix first and call the column space of the transpose
the row space of the original. That is to say, the row space of A is
C(AT ), which is the notation we will use for row space. Since the row
rank is the same as column rank, the number of linearly independent
rows in A is its rank, which is the same as the dimension of C(AT ).
Orthogonal Complementarity
Before attempting to prove that C(AT ) and N (A) are orthogonal complements, it is
perhaps best to spell out what it means. It means that all the vectors in C(AT ) are
orthogonal to all the vectors in N (A), which is the orthogonal part. It also means
if there is a vector orthogonal to all vector in C(AT ), it is in N (A), which is the
complement part. Here are the two parts in formal lingo:
1. y ∈ C(AT ) and x ∈ N (A) =⇒ xT y = 0
2. (a) y ̸= 0 ∈ C(AT ) and
(b) xT y = 0 ∀ y ∈ C(AT ) =⇒ x ∈ N (A)
3. Or, conversely,
(a) x ∈ N (A) and
(b) xT y = 0 ∀ x ∈ N (A) =⇒ y ∈ C(AT )
The first part is easy to prove, and we did it in Eqn (9.1), which just says that each
element of Ax has to be zero for Ax = 0 to be true.
To prove the second part, let’s think of the matrix AT as being composed of columns
ci . If y ∈ C(AT ), as in condition 2(a), we can write y as a linear combination of ci :
X
y= si ci = A T s
therefore write:
0
N (A) ¢ R3 , of one dimension, with the basis 0
1
Thinking in terms of the coordinate space, we see that the row space
is actually the xy-plane and the null space is the z-axis. They are
indeed orthogonal complements.
Let’s go back to talking about the general case, A ∈ Rm×n , rank(A) =
r. We know that by definition, all vectors x ∈ N (A) get transformed
to 0 by Ax = 0. Because of the orthogonal complementarity, we
also know that all orthogonal vectors are in C(AT ), the row space of
190 The Four Fundamental Spaces
A, as illustrated above with our toy example. All the vectors in the
row space get transformed to non-zero vectors in Rm . What about
the rest of the vectors? After all, most vectors in Rn are in neither
C(AT ) nor N (A): They are linear combinations of these two sets of
vectors. Let’s take such a vector x = x∥ + x§ where x∥ ∈ C(AT )
and x§ ∈ N (A).
What this equation tells is that all vectors not in N (A) also end up in
C(A) by Ax. Moreover, multiple vectors x ̸∈ N (A) get transformed
to the same b ∈ C(A): It is a many-to-one (surjective) mapping.
What is special about the row space is that the mapping from C(AT )
to C(A) is a one-to-one (injective) mapping. Since it is an important
point, let’s state it mathematically:
To complete the picture and bring out the beautiful symmetry of the
whole system, we will define one more null space, which is N (AT ),
which is also called the left null space. It lives on the same side as
the column space, C(A), and no vector in the input space Rn (other
than the zero vector) can reach this space.
Computing the Four Fundamental Subspaces 191
general. But when b = 0, the constants part does not get affected by
row operations. Since the solution of a system of linear equations is
not affected by row operations, the solution sets for A and R are the
same.
Next, we can see that in R, the pivot columns are linearly indepen-
dent of each other by design. They are the only ones with a non-zero
element (actually 1) in the pivot positions, and there is no way we
can create a 1 by taking finite combinations of 0. Therefore, a linear
combination of the pivot columns in R will be 0 if and only if the
the coefficients multiplying them are all zero. Since the solution sets
for A and R are the same, the same statement applies to A as well.
Therefore, the columns in A that correspond to the pivot columns in
R are linearly independent. And indeed, they form a basis for the
column space of A.
I * =Y×U rank I = C
! * =( = all possible ! I7 = K c * =* = all possible & impossible c
I 6 =U § =Y
e 8 : dim = f
e 8- : dim = f Column Space
Row Space 8' § 9 Image
Coimage Range
§9
8 '' + !,
8h § h
3 4- :
h
3 4 : 8' § dim = 9 2 :
dim = ; 2 : Left Null Space
Null Space Left or Cokernel
Kernel
Fig. 9.1 A pictorial representation of the four fundamental subspaces defined by a matrix.
equations. Keep in mind that while the REF (which is the result of
Gaussian elimination) can have different shapes depending on the
order in which the row operations are performed, RREF (the result
of Gauss-Jordan) is immutable: A matrix has a unique RREF.
2
Wikipedia describes it as: “The rank-nullity theorem is a theorem in linear algebra, which
asserts that the dimension of the domain of a linear map is the sum of its rank (the dimension
of its image) and its nullity (the dimension of its kernel).”
196 The Four Fundamental Spaces
The column space lives in R3 because the columns have three compo-
nents. The dimension of the column space = the rank = the number of
pivots = 2. The basis (which is a set) consists of the column vectors
in A (not in R) corresponding to the pivot columns, namely 1 and 3.
So here is our answer:
1 1
C(A) ¢ R3 ; dim (C(A)) = 2; Basis = 2 , 1
1 0
For the row space, we can take the pivot rows of R as our basis. Note
that the row space lives in R4 because the number of columns of A
is 4. It also has a dimension of 2, same as the rank.
1 0
1 0
T 4 T
C(A ) ¢ R ; dim C(A ) = 2; Basis = ,
0 1
3 3
Notice how we have been careful to write the basis as a set of column
vectors, although we are talking about the row space. Our vectors are
always columns.
For the null space, we will first solve the underlying Ax = 0 equa-
tions completely, highlight a pattern, and present it as a possible
shortcut, to be used with care.
We saw earlier that Ax = 0 and Rx = 0 have the same solution
set, which is the null space. Writing down the equations explicitly,
Computing the Spaces 197
we have
x1
1 1 0 3
x2
Rx = 0 0 1 3
x3 = 0
0 0 0 0 (9.2)
x4
=⇒ x1 + x2 + 3x4 = 0 and x3 + 3x4 = 0
Noting that x2 and x4 are free variables (because the corresponding
columns have no pivots), we solve for x1 and x3 as x3 = −3x4 and
x1 = −x2 − 3x4 . Therefore the complete solution becomes:
x1 −x2 − 3x4 −1 −3
1
= x2 + x4 0
x2 x 2
=
x3 −3x4 0 −3
x4 x4 0 1
The complete solution is a linear combination of the two vectors
above because x2 and x4 , being free variables, they can take any
value in R. In other words, the complete solution is the span of these
two vectors, which is what the null space is. We give our computation
of the null space as follows:
−1 −3
1 0
4
N (A) ¢ R ; dim (N (A)) = 2; Basis = ,
0 −3
0 1
The steps to compute N (AT ) are identical to the ones for N (A),
except that we start with AT instead of A, naturally. We will not go
through them here, but, as promised earlier, we will share a shortcut
for computing null spaces in the box on “Null Spaces: A Shortcut,”
and use the left-null-space computation of this A as an example.
From the examples worked out, we can see that the null-space
computations and the complete solutions of the underlying system of
linear equations have a lot in common. It is now time to put these
two topics on an equal footing, which also gives us an opportunity to
review the process of completely solving a system of linear equations
and present it as a step-by-step algorithm.
198 The Four Fundamental Spaces
3. In the blank position, indicated by ⃝, of the first basis vector created in (2) we
type in 1.
4. The final answer is:
1
N (AT ) ¢ R3 ; dim N (AT ) = 1; Basis = −1
1
A bit of thinking should convince us that this shortcut is, in fact, the same as what we
did in computing the null space in the text, by writing down the complete solution. We
can easily verify that we get the same answer by applying this shortcut on the matrix R
in Eqn (9.2).
The computation of the null space of a matrix is, in fact, the same
as finding the special solutions of the underlying system of linear
equations. When we found the complete solution earlier in §4.3.2
(page 89) and when we computed the null space above, the procedures
Review: Complete Solution 199
may have looked ad-hoc. Now that we know all there is to know about
the fundamental spaces of a matrix, it is time to finally put to rest the
tentativeness in the solving procedure and present it like an algorithm
with unambiguous steps. Let’s start by defining the terms.
In Ax = b, the complete solution is the sum of the particular so-
lution and the special solutions. We can see an example in Eqn (4.4),
where the first vector is the particular solution and the second and
third terms making a linear combination of two vectors is the special
solution. The linear combination is, in fact, the null space of A. As
we saw earlier, if x∥ is a solution to Ax = b, then x∥ + x§ also is a
solution for any x§ ∈ N (A) because, as we saw earlier,
9.8.1 An Example
We will reuse the example in Eqn (4.4) (on page 90), where we started
with these equations:
x1 + x2 + x3 + 2x4 = 6
2x1 + 2x2 + x3 + 7x4 = 9
from which we got the augmented matrix:
1 1 1 2 6 REF 1 1 1 2 6
A|b = −−→
2 2 1 7 9 0 0 −1 3 −3
and ended up with the complete solution:
3 −1 −5
0 1 0
x= 3 + t1 0 + t2 3
(9.3)
0 0 1
200 The Four Fundamental Spaces
The null space of the coefficient matrix has the basis vectors appearing
as the linear combination in the last two terms above. N (A) is a plane
in R4 , and we can use any two linearly independent vectors on it as
its basis. The exact basis vectors we wind up with depend on the
actual elimination steps we use, but they all specify the same plane
of vectors, and indeed the same subspace. As we can see, in solving
the system of equations above,
we
started with the row-echelon form
of the augmented matrix A | b . The REF (the output of Gaussian
elimniation) of a matrix is not unique; it is the RREF (from Gauss-
Jordan elimination) that is unique.
Let’s solve the system again. This time, we will start by finding
the RREF (the output of Gauss-Jordan) because it is unique for any
given matrix.
1 1 1 2 6 RREF 1 1 0 5 3
A|b = −−−→
2 2 1 7 9 0 0 1 −3 3
In order to find a particular solution, we first set values of the
free variables to zero, knowing that they are free to take any values.
This step, in effect, ignores the free variables for the moment. In
the example above, the free variables are x2 and x4 , corresponding
to the columns with no pivots. Ignoring these pivot-less columns,
what we see is an augmented matrix A | b = I | b′ , giving us the
3
Note that the particular solution obtained using this prescription does not have to be in the
row space of the coefficient matrix because we are taking zero values for the free variables.
Review: Complete Solution 201
Setting x2 = 1 and x4 = 0:
x1 −1
x1 + x2 + 5x4 =0 x1 + 1 = 0 x2 1
=⇒ =⇒ x§1 x3 = 0
=
x3 − 3x4 =0 x3 = 0
x4 0
Now, setting x2 = 0 and x4 = 1:
x1 −5
x1 + x2 + 5x4 =0 x1 + 5 = 0 x2 0
=⇒ =⇒ x§2 =
x3 = 3
x3 − 3x4 =0 x3 − 3 = 0
x4 1
Putting it all together, we can write down the complete solution as:
3 −1 −5
0 1 0
x= 3 + t1 0 + t2 3
(9.4)
0 0 1
form, RREF:
" #
RREF Ir · Fr×(n−r)
A −−−→ R = b′
0(m−r)×n
Note that we use the symbol · to indicate that the columns of Ir
and Fr×(n−r) may be shuffled in; we may not have the columns of F
neatly to the right of I. With this picture in mind, let’s describe the
algorithm for completely solving the system of linear equations:
1. Find RREF
through Gauss-Jordan on the augmented matrix
A|b →R
2. Ignore the free variables by setting them to zero, which is the
same as deleting the pivot-less columns in R and zero rows,
giving us Ir
3. Get the particular solution, x∥ with the r values in b′ , and zeros
for the n − r free variables
4. For each free variable,
with
the RREF of the homogeneous
augmented matrix A | 0 (which is the same R as above, but
with 0 instead of b′ in the augmenting column):
• Set its value to one, and the values of all others to zero
• Solve the resulting equations to get one special solution,
x§i
• Iterate over all free variables
5. Write down the complete solution:
n−r
X
x = x∥ + ti x§i
i=1
What We Learned
• Note that these four fundamental subspaces, being vector subspaces, contain the
zero vector.
• Computing the four fundamental subspaces is the process of specifying the space
(either Rm or Rn ) containing them, the dimension (the rank or the nullity of the
matrix) and the basis vectors.
• Review Table 9.2 to appreciate and internalize the connection between the four
fundamental subspaces and the shapes of matrices. While at it, also stare at
Figure 9.1 to commit it to memory.
Exercises
A. Review Questions
2. A matrix A ∈ Rm×n has all of Rn as its columns space. What can we say
about it?
(a) It is a square matrix (b) It is a symmetric matrix
(c) It is an orthonormal matrix (d) It is a singular matrix
7. We have a matrix with m rows and n columns, with m > n (more rows than
columns). Its rank is n. Which of the following statements is true?
(a) The Column Space of the matrix is an n dimensional subspace in Rm
206 The Four Fundamental Spaces
8. True or False: All invertible 4-by-4 matrices have the same Column Space.
(a) True (b) False
12. Given a matrix A of size m × n and of rank r, what is the dimension of its null
space?
(a) r (b) m − r
(c) n − r (d) n − m
13. Given a matrix A of size m × n, and the fact that its row space is all of Rn ,
what can we say about its properties?
(a) Its column space is all of Rm as well
(b) The matrix A has full column rank
(c) The matrix A has full row rank
(d) n > m
14. True or False: Given a matrix A of size n × m, and m ̸= n, the pivot rows of
the reduced row echelon form would be a possible basis for its row space.
(a) True (b) False
(c) Depends on the rank of A (d) Impossible to say
Other Names 207
16. We have a matrix A that is full-column-rank and full-row-rank at the same time.
What can we say about A?
(a) It is a square matrix
(b) It is an invertible matrix
(c) It is an orthogonal matrix
(d) It is a symmetric matrix
(e) Its column space is the same as its row space
(f ) It is full rank matrix
(g) Its inverse is the same as its transpose
(h) Its null space contains only the zero vector
17. We have a real matrix A with 3 rows and 4 columns, and it has full row-rank.
Select all the true statements in the list below.
(a) Its rank is 3
(b) Ax = b always has solutions, regardless of what b is
(c) Its rank is 4
(d) All its columns are linearly independent
(e) Its column space is all of R3
(f ) AT Ax = b has unique solutions
(g) AAT is invertible
(h) AAT is symmetric
19. We have a matrix A ∈ R3×4 with rank(A) = 2. The following statements are
about the four fundamental spaces defined by A. Select the ones that are true.
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
• Lecture Videos by the Author: LA4CS, Chapter 9: The Four Fundamental
Spaces
• 3Blue1Brown: “Essence of Linear Algebra.”
– Video 7: Inverse matrices, column space and null space | Chapter 7,
Essence of linear algebra
• MIT: Linear Algebra
– Video 15: 6. Column Space and Nullspace
– Video 17: 7. Solving Ax = 0: Pivot Variables, Special Solutions
– Video 23: 10. The Four Fundamental Subspaces
– Video 24: Computing the Four Fundamental Subspaces
here, but the much superior Linear Algebra. The right approach to
be taken, as we shall see here, is much more elegant.
6 "
5 6 "
) Several vectors
4 5 give same
)
'
c
projection 3.
3 4 Projection is not
invertible.
2 3 c = d3, d is a
In 3
&
3
(' Singular matrix
1 2
! 5
24 23 22 21 0 1 2 3 4 5 1 c3
ce &
21
spa (+) 24 23 22 21 0 1 2 3 4 5
Sub 22
'='
21
( ='2)
'
23 = +& = &+ 22
24 23
25 24
Fig. 10.1 Projection of one vector (b in blue) on to another (a in red). The projection
b̂ is shown in light blue, and the error vector in green. The right panel shows that many
different vectors (blue ones) can all have the same projection. The projection operation is
many-to-one, and cannot be inverted.
computing x̂.
aT e = aT (b − b̂) = aT (b − ax̂) = 0
aT b
aT ax̂ = aT b =⇒ x̂ =
aT a (10.1)
aT b −1
b̂ = ax̂ = a T = a(aT a) aT b
a a
−1
P = a(aT a) aT
Here, in addition to computing the projection value x̂, the projection
vector b̂, we have also defined a projection matrix P which we can
multiply with any vector and gets its projection on to a. It is indeed
a matrix because aaT is of a column matrix (n × 1) multiplying a
−1
row matrix (1 × n), giving as an n × n matrix. The factor (aT a)
in between is just a scalar, which does not change the shape.
Comparing the derivation of the projection matrix above in Eqn (10.1)
to the one we did earlier in Eqn (7.9), we can appreciate that they
are identical. It is just that the Linear Algebra way is much more
elegant. In this derivation, we took certain facts to be self-evident,
such as when two vectors are orthogonal, their dot product is zero.
We can indeed prove it, as shown in the box on “Orthogonality and
Dot Product.”
"
4
3 3 5)
f 2
c)
3
e 1
c
3 !
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
5! c!
3 21
22
23
-
24
25
26
Fig. 10.2 Projecting the blue b on to the subspace that is the span of a1 and a2 , the two
red vectors. The subspace is a plane, shown in light red. In order to find the projection, b̂
(in bright blue), we need the projections of b to the basis vectors of the subspace.
using the formula for x̂ from Eqn (10.2), the projection matrix is:
−1
P = A AT A AT (10.3)
P = A AT A AT =⇒ P 2 = A AT A
−1 T
A A AT A
−1 −1 T
A
= A AT A AT A AT A AT
−1 −1
= AI AT A
−1 T
A = A AT A
−1 T
A
=P
aT b = aT b̂ = aT P b
216 Projection, Least Squares and Linear Regression
aT b = (P a)T b̂ = aT P T b
With the two expressions for aT b, we can equate them and write,
aT P b = aT P T b =⇒ P = P T
where we used (1) the product rule of transposes, (2) the fact that
the inverse of a transpose is the transpose of the inverse and (3) the
symmetry of AT A.
We should not do this expansion and the product rule does not apply
here because A−1 is not defined, which was the whole point of
embarking on this projection trip to begin with.
What happens if we take the full space and try to project on to
it? In other words, we try to project a vector on to a “subspace”
of dimension n in Rn . In this case, we get a full-rank projection
matrix, and the expansion of the inverse above is indeed valid, and
the projection matrix really is I, which is an invertible matrix because
every vector gets “projected” on to itself. Note that the two properties
we were looking for in P are satisfied by I: It is idempotent because
I 2 = I and of course I T = I.
For rank-deficient matrices, P is AA−1 Left , the left inverse multi-
plying on the right, almost like an attempt to get as close to I as
possible.
x y
1 1
2 2
3 2
4 5
Fig. 10.3 An example of simple linear regression. The data points in the table on the left
are plotted in the chart on the right, and a “trendline” is estimated and drawn.
y = mx + c
1=m+c 1 1 1
2 1 m 2
2 = 2m + c Ax = b A=
3
x= b=
1 c 2
2 = 3m + c 4 1 5
5 = 4m + c
Since we are modeling our data as y = mx + c and we have five
(x, y) pairs, we get five equations as shown above, which we massage
into the Ax = b form. Notice how A has two columns, the first for
m and another one for c full of ones. If we had written our model as
y = c + mx, the column for the intercept c would have been the first
one.
As we can see, we have five equations,
and if we were to do
Gauss-Jordan elimination on A | b , the third row would indicate
an inconsistent equation 0 = 1, and the system is not solvable, which
is fine by us at this point in our Linear Algebra journey. We will get
the best possible solution x̂, which will give us our best estimates
for the slope and the intercept as m̂ and ĉ. The steps are shown below.
Linear Regression 219
T1 2 3 4
A =
1 1 1 1
T 30 10
A A= AT A = 20
10 4
" 1 1
#
−1 −
AT A = 51 32
−
" 23 2 1 1 3
#
−1 T − 10
− 10 10 10
AT A
A = 1
1 0 − 21
" 6 # 2
−1 T 6 1
x̂ = AT A A b = 5 1 =⇒ m̂ = and ĉ = −
−2 5 2
As we can see, our linear regression model becomes y = m̂x + ĉ =
1.2x − 0.5, same as the trendline that the spreadsheet application
computed in Figure 10.3.
We also have the error vector e = b − b̂. b is what we project
on to the column space, C(A) and b̂ = P b is the projection. Let’s
go ahead and compute P and b̂ as well. Using the formula for the
projection matrix from Eqn (10.3), we get:
7 2 1
− 51
10 5 10
−1 T 25 10 3 1 1
5 10
T
P =A A A A = 1 1 3 2
10 5 10 5
− 51 10
1 2
5
7
10
7 3
10 10
19 1
10 10 9
b̂ = P b =
31 ê = b − b̂ =
− 11 ∥e∥ =
10 10 5
43 7
10 10
The norm of the error vector is the variance (∥e∥) in the data that is ex-
plained by the model. The total variance is computed independently
(using its formula from statistics) as σ 2 = 2.25. The coefficient of
determination R2 is the fraction of the variance in the data that is
explained by our model, and it is 0.8, just as the trendline from the
spreadsheet application reports it in Figure 10.3.
220 Projection, Least Squares and Linear Regression
150
160
170
180
190
Fig. 10.4 Example of a data matrix and its visualization in multiple linear regression.
Fig. 10.5 The standard notations used in multiple linear regression. Weight is the depen-
dent (or target or output) variable. Height and Hair Len. are the independent (or predictor
or input) variables.
Following the same matrix equations, now with the new notations
as in Figure 10.5, we get the best estimate for the parameter vector
β̂, so that our model (coming from the 26 data points shown in
Figures 10.4 and 10.5) becomes:
Weight = β̂0 + β̂1 Height + β̂2 Hair Len.
= −74.06 + 0.814 Height − 0.151 Hair Len.
Although it is not easy to visualize the model and the points, even in
the simple intuitive three-dimensional space, we have attempted to
show this model in Figure 10.5. What is perhaps more important is
to understand the model in terms of its parameters: We can see that
222 Projection, Least Squares and Linear Regression
Fig. 10.6 Attempt to visualize a three dimensional MLR model for Weight. All three
panels show the model (which is a plane) and the associated data points, but from different
perspectives. The middle one shows the dependency of Weight on Height, and the last one
shows that on Hair Len.
What We Learned
• Projection:
– Already looked at projection of one vector on to another, using the cosine
formula for dot product.
– Projecting on to a vector is the same as projecting to the one-dimensional
subspace defined by it.
– To project to higher dimensional subspaces, we project to any of their bases.
– The projection matrix is identical, and is derived using orthogonality. P =
−1 T
A AT A A projects to the subspace defined by the columns of A
(which is its column space, C(A)).
• In the system of linear equations defined by a tall, full-column-rank matrix
A ∈ Rm×n , m k n, with full column-rank, as is common in data sience,
Ax = b, rank(A) = r = n j m, The system has no solutions because, in
general, b ̸∈ C(A). The best possible solution is when b is projected to C(A)
as b̂.
Linear Regression 223
X T Xβ = X T y =⇒ β̂ = X T X
−1 T
X y
Exercises
A. Review Questions
If unsure of these review questions, go through the chapter again.
3. We have a real data set with four variables (height, weight, age and hair length)
and 127 observations (students). Which of the following statements is most
likely true of this data matrix?
(a) It is a full-column-rank matrix
(b) Its column space is all of R5
(c) It has no row space
(d) Written as Ax = b, it can never be solved
6. We have a full-column-rank matrix A, and we are told that the system of linear
equations Ax = b has no solutions. What is the best possible solution?
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
• Lecture Videos by the Author: LA4CS, Chapter 10: Projection, Least Squares
and Linear Regression
• MIT: Linear Algebra
– Video 33: 15. Projections onto Subspaces
– Video 34: Projection into Subspacess
– Video 35: 16. Projection Matrices and Least Squares
– Video 36: Least Squares Approximation
Advanced Topics
11
Eigenvalue Decomposition
and Diagonalization
As we can see above, we have two eigenvectors for this matrix, with
eigenvalues 1 and −1.
P x∥ = x∥ = 1 × x∥
P x§ = 0 = 0 × x§
1.25 "
SHEAR MATRIX 0 0.5
#" = ##" =
1 1
1
The unit vectors transform as:
1 1
!! = § !"! = 0.75
0 0
0 0.5
!# = § !"# = 0.5
1 1
ó The Shear Matrix
0.25
1
1 0.5 ##! =
)= 0
2 21.75 21.5 21.25
021 120.75 20.5 20.25 0 0.25 0.5 0.75 1
11.25 1.5 1.
#! =
20.25 0
20.5
Fig. 11.1 An example shear matrix, showing a square being transformed into a parallelo-
gram.
1
Although we state it like this, we should note that squares and parallelograms do not exist
in a vector space. They live in coordinate spaces, and this statement is an example of the
Notational Abuse, about which we complained in a box earlier. What we mean is that the
two vectors forming the sides of a unit square get transformed such that they form sides of
a parallelogram. We should perhaps eschew our adherence to this pedantic exactitude, now
that we are in the advanced section.
Computing Eigenvalues and Finding Eigenvectors 231
For any non-trivial θ (which means θ ̸= 2kπ for integer k), we can see
that Qθ changes every single vector in R2 . We have no eigenvectors
for this matrix in R2 .
11.2.5 Differentiation
We can think of the set of all functions (of one variable, for instance)
as a vector space. It satisfies all the requisite properties. The calculus
operation of differentiation is a linear transformation in this space; it
satisfies both the homogeneity and additivity properties of linearity.
d ax
e = aeax =⇒ eax is an eigenvector with eigenvalue a
dx
d2
sin x = − sin x =⇒ sin x is an eigenvector with eigenvalue − 1
dx2
Permutation Matrix
Projection Matrix
Shear Matrix
Rotation Matrix
11.4 Properties
The eigenvalues and eigenvectors provide deep insights into the struc-
ture of the matrix, and have properties related to the properties of the
matrix itself. Here are some of them with proofs, where possible.
It is worth our time to verify these properties on the examples we
worked out above.
Properties 235
11.4.1 Eigenvalues
Since the LHS and RHS coefficients have to match, we see that
Xn Xn
λi = aii = trace(A)
i=1 i=1
(A + αI)s = As + αs = λs + αs = (λ + α)s
11.4.2 Eigenvectors
2
We use R in the mathematical statement and the proof of this property for convenience and
because of its relevance to computer science, but the property applies to C as well.
238 Eigenvalue Decomposition and Diagonalization
(6) Reordering: λ j sT
i sj = λ i sT
i sj
(8) Since λj ̸= λi =⇒ sT
i sj = 0
(9) sT
i sj = 0 =⇒ si § sj
(A + αI)s = As + αs = λs + αs = (λ + α)s
One fair question we may have at this point is why we are doing all
this. It is all an academic exercise in intellectual acrobatics? We may
not be able to answer this question completely yet, but we can look
at a linear transformation and see what the eigenvalue analysis tells
us about it. In the last chapter, we will see how these insights are
harnessed in statistical analyses.
Let’s start with an example A ∈ R2×2 , find its eigenvalues and
eigenvectors, and look at them in the coordinate space R2 .
√ √
1 5
√ − 3 3 1 3 1 1 √1
A= λ1 = ; s1 = λ2 = ; s2 =
4 − 3 3 2 2 −1 2 2 3
As we can see from Figure 11.2, A takes the first basis vector
(q1 , shown in red, dashed arrow) to its first column vector (a1 shown
bright red arrow): q1 7→ a1 . Similarly for the second one as well,
q2 7→ a2 , shown in various shades of blue. What happens to the
basis vectors happens to all vectors, and therefore, the unit circle in
the figure gets mapped to the ellipse, as shown.
Although we know about this unit-circle-to-ellipse business, from
the matrix A itself, we know very little else. Note that the vectors
to which the unit vectors transform (in qi 7→ ai ) are nothing special;
240 Eigenvalue Decomposition and Diagonalization
1.25 "
-
+$ =
,
3 $"
1 = 1
1 2 3 2 1
= 4 3
*#
0.75
0.5
(!
= 3
2
0.25
,
+# =
- !
22 21.75 21.5 21.25 21 20.75 20.5 20.25 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
= 1
2
20.25
("
EIGENVALUES & VECTORS
20.5 ,!
1 5 $! = 1
2 3 = 1 4 2 5
3= 2 3
4 2 3 3 21 3
20.75
/ .
*. = *0 =
0 0
21
. 3 . 1
+! = +# =
0 21 0 3
21.25
Fig. 11.2 Visualization of eigenvalues and eigenvectors: A transforms the unit circle into
a rotate ellipse. The eigenvalues specify the lengths of its major and minor axes. And the
eigenvectors specify the orientation of the axes.
they are on the ellipse somewhere. What we would like to know are
the details of the ellipse, like its size and orientation, which is exactly
what the eigenvalues and eigenvectors tell us. The eigenvalues λi
are the lengths of the major and minor axes of the ellipse and the
eigenvectors are the unit vectors along the axes. In Figure 11.2, the
eigenvectors (s1 and s2 ) are shown in darker shades of red and blue,
while the corresponding eigenvalues are marked as the lengths of the
axes.
When we move on to higher dimensions, ellipses become ellip-
soids or hyper-ellipsoids, and the axes are their principal axes. The
mathematics of eigen-analysis still stays the same: We get the direc-
tions and lengths of the principal axes. And, if the matrix on which
we are performing the eigen-analysis happens to contain the covari-
ance of the variables in a dataset, then what the eigen-analysis gives
us are insights about the directions along which we can decompose
the covariance. If we sort the directions by the eigenvalues, we can
extract the direction for the highest variance, second highest variance
and so on. We will revisit this idea in more detail in one of our last
topics, the Principal Component Analysis, which is the mainstay of
dimensionality reduction in data science.
Diagonalization 241
11.6 Diagonalization
11.6.1 S and Λ
Suppose A ∈ Rn×n has its eigenvalues λi and the corresponding
eigenvectors si . Let’s construct two matrices, arranging the eigen-
vectors as columns, and the eigenvalues as diagonal elements:
λ1 0 ··· 0
| | ··· | 0 λ2 ··· 0
S = s1 s2 · · · sn Λ = ..
.. ...
| | ··· | . . 0
0 0 · · · λn
For simplicity, we may write S = [s] and Λ = [λ]. With these new
matrices, we arrive at the most important result from this chapter:
AS = SΛ
If S −1 exixts, AS = SΛ =⇒ A = SΛS −1
11.6.3 Powers of A
A = SΛS −1
In fact, A is already a diagonal matrix: It has non-zero elements only along its diagonal.
We know the condition for A to be invertible, or for A−1 to exist. Let’s state it
several different ways, as a means to remind ourselves. A is invertible if:
• |A| =
̸ 0. Otherwise, as Eqn (5.1) clearly shows, we cannot compute A−1
because |A| appears in the denominator.
• N (A) = 0, its null space contains only the zero vector. Otherwise, for some x,
we have Ax = 0, and there is no way we can invert it to go from 0 to x.
• λi ̸= 0, all its eigenvalues are non-zero. Otherwise, |A|, being the product of
eigenvalues, would be zero.
• λi ̸= 0, all its eigenvalues are non-zero. Another reason, otherwise, for the zero
λ, we have Ax = 0, which implies the existence of a non-trivial null space.
The diagonalizability of A is tested using the invertibility of its eigenvector matrix S.
Although this point is probably not critical for our view of Linear Algebra as it applies to
computer science, we might as well state it here. For a matrix to be non-diagonalizable,
the algebraic multiplicity of one of its eigenvalues (the number of times it is repeated)
has to be greater than its geometric multiplicity (the number of associated eigenvectors),
which means the characteristic polynomial needs to have repeated roots to begin with.
The roots are repeated if the discriminant of the polynomial (similar to b2 − 4ac in the
quadratic case) is zero. The discriminant being a continuous function of the coefficients
of the polynomial, which are the elements of the matrix, it being zero happens with
a frequency of the order of zero. But the roots being complex happens half the time
because the discriminant is less than zero half the time.
11.6.4 Inverse of A
Since we have a product for A, we can take its inverse using the
product rule of inverses.
−1 −1
A−1 = SΛS −1 = S −1 Λ−1 S −1 = SΛ−1 S −1
244 Eigenvalue Decomposition and Diagonalization
1
Asi = λi si =⇒ si = λi A−1 si =⇒ A−1 si = si
λi
Enough said.
Since A−1 = SΛ−1 S −1 , we can see that Ak = SΛk S −1 holds
for k < 0 as well. Extrapolating even further, through the Taylor
series expansion, we can compute entities like eA (matrix exponenti-
ation), which are essential in solving differential equations—a topic
we consider beyond the scope of this book.
xk = Axk−1 = A2 xk−2 = · · · = Ak x0
11.6.6 Eigenbasis
154).
n
X
x0 = s 1 c 1 + s 2 c 2 + · · · + s n c n = si ci = Sc
i=1
n
X
k
A x0 = λki si ci = SΛk c
i=1
Why does this matter? Why use the eigenbasis? Let’s think of A
as the transformation encoding the time evolution of a system with
xk its state at a given step (or iteration, or a point in time). Given
the state of the system at one step, we evolve it to the next step by
multiplying with A to get xk+1 = Axk .
Knowing the initial state x0 and, more importantly, the transition
matrix A, what can we say about the stability of the system? We can
say the following:
n
X
k
lim xk = lim A x0 = λki si ci
k→∞ k→∞
i=1,|λi |>1
(approx)
Table 11.1 Fibonacci numbers (fk ) vs. its approximation (fk )
What We Learned
Exercises
A. Review Questions
If unsure of these review questions, go through the chapter again.
(a) A vector that does not change when multiplied by the matrix
(b) A vector that is definitely not in the column space of the matrix
(c) A vector that gets scaled by the multiplication of the matrix
(d) A vector that transforms orthogonal to itself
2. True or False: For any two square, invertible matrices A and B, AB and
BA have the same eigenvalues
3. If we take the ratio between the 101st Fibonacci number to the 100th one, what
do we get?
(a) The Golden Ratio = 0.618 (b) The Golden Ratio = 1.618
(c) Golden Ratio to the power 100 (d) Golden Ratio to the power 101
250 Eigenvalue Decomposition and Diagonalization
5. If we have a real, symmetric matrix, what can we say about its eigenvalues?
(a) The eigenvalues are all real (b) The eigenvalues are all positive
(c) The eigenvalues are all distinct (d) All of the above
(a) All eigenvalues are real (b) All eigenvalues are positive
(c) All eigenvalues are non-zero (d) The eigenvalues are all distinct
7. If we have the largest eigenvalue |λ| > 1, the matrix represents an exponentially
growing difference equation. What does it represent the largest eigenvalue
|λ| = 1?
11. True or False: For any matrix of any rank A, either AT A or AAT is full
rank.
13. The following statements are about a matrix A ∈ R4×4 with rank(A) = 2.
Select the ones that are true.
(a) If the column space of A is the same as its row space, then A is symmetric
(b) A has two zero eigenvalues
(c) A has two real, distinct eigenvalues
(d) AT A is full rank
(e) If we run the Gram-Schmidt process on A, we will get exactly two
orthonormal vectors
(f ) A good basis for the null space of A would be the columns of the 2 × 2
identity matrix
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
• Lecture Videos by the Author: LA4CS, Chapter 11: Eigenvalue Decomposition
and Diagonalization
• 3Blue1Brown: “Essence of Linear Algebra.”
252 Eigenvalue Decomposition and Diagonalization
Now that we are listing theorems, we have another one that goes by
the physics-inspired name, the Law of Inertia (attributed to James
Joseph Sylvester, not the brawny movie star). We stated it earlier:
For real, symmetric matrices, the number of positive eigenvalues is
the same as the number of positive pivots. Similarly, the negative
Hermitian Matrices 255
Note that the REF in this law is the result of Gaussian elimination,
done with no scaling of the rows. It is not the RREF from Gauss-
Jordan elimination, which makes all pivots one by scaling rows.
1
Some people write x∗ to mean both conjugation on top of transposition. For this reason,
a less confusing notation for conjugate by itself may be an overline a + ib = a − ib, but it
may lead to another contextual confusion: Are we underlining the line above or overlining
the variable below? Good or bad, we are going to stick with ∗ for complex conjugate and
for conjugate transpose.
Eigen Properties of Hermitian Matrices 257
Step (7) says λ is real. Since all eigenvalues are real, the Λ
matrix (with eigenvalues in the diagonal) is Hermitian as well.
In fact, the proof we gave in §3 (page 237) holds for Hermitian
matrices as well, with minor changes. We , however, provided
a brand-new proof, now that we are in the happy position of
being able to do the same thing in multiple ways.
A ∈ Cn×n , A = A, Asi = λi si i ̸= j =⇒ si § sj
A square matrix is called a Markov matrix if all its entries are non-
negative and the sum of each column vector is equal to one. It is
also known as a left stochastic matrix2 . “Stochastic” by the way is a
fancy word meaning probabilistic. Markov matrices are also known
as probability/transition/stochastic matrices.
Markov Matrix
Definition: A = aij ∈ Rn×n is a Markov matrix if
n
X
0 f aij f 1 and aij = 1
i=1
All the following properties of Markov matrices follow from the fact
that the columns add up to one. In other words, if we were to add up
all the rows of a Markov matrix, we would get a row of ones because
each column adds up to one.
1. Markov matrices have one eigenvalue equal to one.
2. The product of two Markov Matrices is another Markov matrix.
3. All eigenvalues of a Markov matrix are less than or equal to
one, in absolute value: |λi | f 1.
Let’s try proving these properties one by one. Since its columns add
up to one, a Markov matrix always has one eigenvalue equal to one.
2
The right stochastic matrix, on the other hand, would be one in which the rows add up to
one. Since our vectors are all column vectors, it is the left stochastic matrix that we will
focus on. But we should keep in mind that the rows of a matrix are, at times, considered
“row vectors.” For such a row vector, a matrix would multiply it on the right and we can
think of the so-called right-eigenvectors.
Markov Matrices 259
Proof :
n
X
(1) Since A is a Markov matrix: aij =1
i=1
n
X
(2) Therefore: ajj =1− aij
i=1,i̸=j
Xn
(3) The diagonal element in A − I: (A − I)jj = 1 − aij
i=1,i̸=j
In order to make this cryptic proof of the third property a bit more
accessible, let’s work out an example. Let’s say we are studying
the human migration patterns across the globe, and know the yearly
migration probabilities as in Table 12.1 through some unspecified
demographic studies. We know nothing else, except perhaps that the
birth and death rate are close enough to each other for us to assume
that they add up to zero everywhere. One reasonable question to
ask would be about the steady state: If we wait long enough, do the
populations stabilize?
Note that the numbers in each column add up to 100% because
people either stay or leave. The numbers in each row, on the other
hand, do not. Asia-Pacific and Africa lose people to the Americas
and Europe.
Once we have probabilities like Table 12.1, the first thing to do
would be to put the values in matrices, now that we know enough
Linear Algebra.
0.80 0.04 0.05 0.05 4.68
0.10 0.90 0.07 0.08
x0 = 1.20 xk+1 = Axk
A= 0.03 0.01 0.75 0.02 1.34
0.07 0.05 0.13 0.85 0.75
where we put the initial populations in a vector x0 . As we can see,
A is a Markov matrix. It describes how the population evolves over
time. The populations for year k evolve to that of year k + 1 as Axk ,
which is identical to what we did in the case of Fibonacci numbers
in §11.7 (page 246). As we learned there, the long-term evolution
of the populations is fully described by the eigenvalues λi of A. If
|λi | > 1, we will have a growing system, if |λi | < 1, we will have
system tending to zero. If we have an eigenvalue |λi | = 1, we will
have as steady state. And we know that A does have a eigenvalue
equal to one.
Knowing that x = xk will stabilize and reach an equilibrium value,
we can implement an iterative method to compute it: First, initialize
it x ← x0 Then iterate until convergence: x ← Ax. Doing all this
Markov Matrices 261
and then say that (with λ1 = 1 and all other |λi | < 1):
n
X n
X n
X
k k
xk = A c i si = c i A si = ci λki si = c1 λk1 s1 = c1 s1
i=1 i=1 i=1
2. All the pivots > 0? The second test is essentially the same as
the first, by Sylvester’s law connecting the signs of pivots and
eigenvalues.
sT As
(3) Rearranging: λ =
sT s
(4) Since xT Ax > 0 for any x and sT s > 0: λ > 0
1
Since B is positive definite, its eigenvalues are positive, and Λ 2
is invertible. So is Q because Q−1 = QT . So A is invertible.
Let’s see how we can apply these five tests to various matrices to
determine whether they are positive definite.
Let’s take an example and see how a > 0 and |A| > 0 implies that
T
x Ax > 0 for all x.
2 6
A= and |A| = 2c − 36 > 0 if c > 18.
6 c
Let’s set c = 20 =⇒
We can see the pivots and row multipliers in the expression for
(square-completed) xT Ax in the last line above, can’t we? Fur-
thermore, xT Ax > 0 if:
b2 2 b2
cx22 > x or when c > =⇒ ac − b2 = |A| > 0
a 2 a
3
The author is not quite sure if AAT also considered a Gram matrix, although it should be,
by symmetry.
268 Special Matrices, Similarity and Algorithms
AT Ax = 0 =⇒ xT AT Ax = 0, (Ax)T (Ax) = 0 =⇒ Ax = 0
It is also easy enough to prove that A and AT A have the same row
space, looking at the product AT A as the linear combinations of the
rows of A and as the linear combinations of the columns of AT . At
this stage in our Linear Algebra journey, we are in the happy position
of being able to prove a lemma in multiple ways, and we can afford
to pick and choose.
Summarizing, A and AT A have the same row and null spaces.
AT and AAT have the same column space and left-null space. All
four of them have the same rank.
For a full-column-rank matrix (A ∈ Rm×n , rank(A) = n), the
Gram matrix AT A is a full-rank, square matrix with the same rank. It
is also much smaller. We can, therefore test the linear independence
of the n column vectors in A by looking at the invertibility (or,
equivalently, the determinant) of the Gram matrix. In data science,
as we shall see in the last chapter of this book, the Gram matrix is the
covariance matrix of a zero-centered data set.
Step (6) above shows that both A and the similar matrix B have
the same characteristic polynomial. Note that, as a consequence
of the characteristic polynomials being the same, the algebraic
multiplicities of the eigenvalues (how many times each one is
repeated) are also the same for A and B.
4. Combining the previous property with the first one, we can see
that AB and BA have the same eigenvalues.
Matrix Similarity 271
The properties listed above are the ones that are useful for us in Linear
Algebra. However, more basic than them, similarity as a relation has
some fundamental properties that make it an equivalence relation.
Here is a formal statement of what it means.
For A, B, C ∈ Rn×n , we can easily show that similarity as a
relation is:
Reflexive: A ∼ A
Proof : A = IAI −1
Symmetric: A ∼ B =⇒ B ∼ A
Proof : A = M BM −1 =⇒ B = M −1 AM = M ′ AM ′−1
Transitive: A ∼ B and B ∼ C then A ∼ C
Proof : A = M BM −1 , B = N CN −1
−1
=⇒ A = M N CN −1 M −1 = (M N ) C (M N )
=⇒ A = M ′ CM ′−1
12.7.2 Diagonalizability
Similarity
Definition: A matrix is similar to another matrix if they diagonalize
to the same diagonal matrix.
As we can see, the similarity relation puts diagonalizable matrices
in families. In Rn×n , we have an infinity of such mutually exclusive
families. All matrices with the same set of eigenvalues belong to the
same family.
When we said “the same diagonal matrix” in the definition of
similarity above, we were being slightly imprecise: We should have
specified that shuffling the eigenvalues is okay. We are really looking
for the same set of eigenvalues, regardless of the order.
However, this definition leaves something unspecified: What hap-
pens if a matrix is not diagonalizable? Does it belong to no family?
Is it similar to none? Is it an orphan? It is in this context that the
Jordan Normal Forms come in to help. Since it is an important topic,
we will promote it to a section of its own.
Jordan Block
Definition:A Jordan block of size k and value λ, Jk (λ) is a square
matrix with the value λ repeated along its main diagonal and ones
along the superdiagonal with zeros everywhere else. Here are some
examples of Jordan blocks:
7 1 0
λ 1
J2 (λ1 ) = 1
J1 (λ) = λ J3 (7) = 0 7 1 (12.5)
0 λ1
0 0 7
As we can see, superdiagonal means the diagonal above the main
diagonal, so to speak. In the matrix J , each eigenvector has a Jordan
block of its own, and one block has only one eigenvalue. If we have
repeated eigenvalues, but linearly independent eigenvectors, we get
multiple Jordan blocks. For instance, for the identity matrix in Rn×n ,
4
The characteristic equation is |A − λ′ I| = (λ − λ′ )2 = 0, with λ′ as the dummy variable
in the polynomial because we already used λ as the diagonal elements of the shear matrix.
274 Special Matrices, Similarity and Algorithms
λ2 = 0, s2 ∈ N (A)
" #
λ 1 − t 1 t2
2 1, 1 1, 1 λ1 0
rank(A) = 1 0 0 λ1 t1 −t2
t2
1
t1
Repeated λ and s
3 2 2 λ 0 Only J = A
rank(A) = 2 0 λ
λ1 = λ2 = 0
4 s1 , s2 ∈ N (A) 2 2 0 0 Only J = A
rank(A) = 0 0 0
Repeated λ, one s
" #
2λ − t1
5 2 1 λ 1 t2
rank(A) = 2 0 λ −λ2 −2λt1 −t2
t2
1
t1
In the last column, t1 , t2 ∈ R are any numbers that will generate an example of a similar
matrix for the corresponding row. They are constructed such that the sum and product of
the eigenvalues come out right. Note that the Jordan normal forms in all rows except the
fifth one have two Jordan blocks each.
Jordan’s Theorem
Every square matrix A ∈ Rn×n with k f n linearly independent
eigenvectors si , 1 f i f k and the associated eigenvalues λi , 1 f
i f k, which are not necessarily linearly independent, is similar to a
Jordan matrix J made up of Jordan blocks along its diagonal.
To start with something simple before generalizing and compli-
cating life, let’s look at A ∈ R2×2 with eigenvalues λ1 and λ2 and
the corresponding eigenvectors s1 and s2 . Table 12.2 tabulates the
various possibilities. Let’s go over each of the rows, and generalize
it to from R2×2 to Rn×n .
Jordan Normal Forms 275
1. The first row is the good case, where we have distinct eigenval-
ues and linearly independent eigenvectors. The Jordan Normal
Form (JNF) is the same as Λ. Each λi is in a Jordan block
J1 (λi ) of its own.
2. When one of the two eigenvalues is zero, the matrix is singular,
and one of the eigenvectors is the basis of its null space. Qual-
itatively though, this case is not different from the first row. In
Rn×n , we will have JNF = Λ.
3. When the eigenvalues are repeated (meaning algebraic multi-
plicity is two), but we have two linearly independent eigenvec-
tors. Again, JNF = Λ. However, we have no other similar
matrices: M J M −1 = λM IM −1 = λI = J .
4. This row shows a rank-zero matrix. Although troublesome, it
is also qualitatively similar to the previous row.
5. In the last row, we have the geometric multiplicity smaller than
the algebraic one, and we have a J2 (λ).
In order to show the Jordan form in all its gory detail, we have a
general Jordan matrix made up of a large number of Jordan blocks in
Eqn (12.6) below:
J1 (λ1 )
λ1 J1 (λ1 )
0 λ1 J1 (λ1 )
0 0 λ1
0 0 0 λ2 1
J3 (λ2 )
0 0 0 0 λ2 1
J =0 (12.6)
0 0 0 0 λ2 J1 (λ3 )
0 0 0 0 0 0 λ3 1
(λ4 )
J2
0 0 0 0 0 0 0 λ4 1
0 0 0 0 0 0 0 0 λ4
..
0 0 0 0 0 0 0 0 0 .
Each Jordan block is highlighted in a colored box with its label Jk (λ).
Here is what this Jordan matrix J tells us about the original matrix
A (of which J is the Jordan Normal Form):
• The eigenvalues of J and A are the same: λi .
• The algebraic multiplicity of any eigenvalue is the size of the
Jordan blocks associated with it.
• Its geometric multiplicity is the number of Jordan blocks asso-
ciated with it.
276 Special Matrices, Similarity and Algorithms
Much like the other topics in this chapter, Jordan canonical form
also is a portal, this time to advanced theoretical explorations in Linear
Algebra, perhaps more relevant to mathematicians than computer
scientists.
12.9 Algorithms
They are neatly summarized in Table 8.1 and recapped in the as-
sociated text. We came across one more algorithm earlier in this
chapter (see §12.4.2, page 261), where we computed one eigenvector
corresponding to λ = 1. The method is called the Power Iteration
Algorithm, and can, in fact, find the largest eigenvalue/eigenvector
pair.
Algorithms 277
12.9.2 QR Algorithm
A general numerical method to compute all eigenvalues and eigen-
vectors of a matrix is the QR algorithm, based on the Gram-Schmidt
process.
Input: A ∈ Rn×n
Output: λi ∈ C
1: repeat
2: Perform the Gram-Schmidt process to get QR
5
From Wiki University.
278 Special Matrices, Similarity and Algorithms
3: We get: A = QR
4: Consider: RQ = Q−1 QRQ = Q−1 AQ
5: =⇒ RQ and A are similar and have the same eigenvalues
6: Therefore, set A ← RQ
7: until Convergence
8: return The diagonal elements of R as the eigenvalues
Convergence is obtained when A becomes close enough to a trian-
gular matrix. At that point, the eigenvalues are its diagonal elements.
Once we have the eigenvalues, we can compute the eigenvectors as
the null space of A − λI using elimination algorithms.
A = LLT
Note that we used the fact that A is symmetric in dividing up the top
row of A as [a11 aT21 ]. Similarly, we could write the first row of L as
[l11 0T ] because it is lower triangular. And, we do not have to worry
about the second factor LT at all because it is just the transpose of L.
" T
# " T
#" T
#
a11 a 21 l 11 0 l 11 l 21
A = LLT =⇒ =
a21 A22 l21 L22 0 LT22
Algorithms 279
The second and third last lines in Eqn (12.8) tell us the elements of
L. Note that we decide to go with the positive square root for l11 .
The very last line tells us that once we got l11 and l21 , the problem
reduces to computing the Cholesky decomposition of a smaller matrix
A′ = A22 − l21 l21 T
.
Before we write it down as a formal algorithm, the only thing left
to do is to ensure that A′ is positive definite. Otherwise, we are
not allowed to assume that we can find A′ = L′ L′T . In the fourth
test for positive definiteness (on 264), we proved that the upper-left
submatrices of a positive definite matrix were also positive definite. If
we look closely at the proof, we can see that we did not need to confine
ourselves to “upper-left”: Any submatrix sharing the main diagonal
is positive definite. Therefore, if A in our Cholesky factorization is
positive definite, so is A22 . We also saw that sums of positive definite
matrices are positive definite. Extending it to non-trivial (meaning,
non-zero) differences, we can see that A′ = A22 − l21 l21 T
is positive
6
definite .
Looking at Eqn (12.8), we can translate it to an algorithm7 as
below:
Input: A ∈ Rn×n
Output: L ∈ Rn×n
1: Initialize L with zeros
2: repeat
3: Divide A into block-matrices as in Eqn (12.7)
√
4: l11 ← a11
6 T
To be very precise, l21 l21 is a rank-one matrix, and is only positive semidefinite. But the
difference is still positive definite.
7
This description and the algorithm listed are based on the discussion in this video.
280 Special Matrices, Similarity and Algorithms
5: l21 ← al11
21
T
6: A ← A22 − l21 l21
7: until A22 becomes a scalar
8: return The matrix L
What We Learned
• Special Matrices:
Real, symmetric matrices: A ∈ Rn×n , AT = A. Real eigenvalues, orthogo-
nal eigenvectors
Hermitian: Generalization of real, symmetric to complex field. A ∈ Cn×n ,
A = A =⇒ real eigenvalues and orthogonal eigenvectors.
Markov: Columns add up to one, λi f 1 with at least one λ = 1. Steady state.
Used in Google Page Rank algorithm.
Positive definite: All eigenvalues positive λi > 0 (or non-negative for positive
semi-definite). xT Ax > 0 for any x ̸= 0. Several tests for positive
definiteness.
Gram matrix: AT A, used for testing linear independence, covariance (after
zero-centering it by subtracting the mean of each column) etc. It is positive
definite if A has full column rank.
Jordan form: The closest we can get to a diagonal matrix for a non-diagonalizable
matrix. Made up of Jordan blocks, Jk (λ).
– Jk (λ) has a size of k × k, λ repeated along the diagonal k times, and
ones along the super diagonal.
Algorithms 281
Exercises
A. Review Questions
If unsure of these review questions, go through the chapter again.
3. We have a matrix A with the property in the choices below. Which one is
guaranteed to be diagonalizable?
4. For a tall, full-rank matrix A ∈ Rm×n , rank(A) = n, what can we say about
its eigenvalues?
(a) Because a Markov matrix has all elements less than one
(b) Because a Markov matrix has at least one eigenvalue equal to one
(c) Because a Markov matrix has the dominant eigenvalue equal to one
(d) Because a Markov matrix is a stochastic matrix
11. True or False: If two matrices are similar, they have the same eigenvectors
12. For two square, invertible matrices A and B of the same size, when do AB
and BA have the same eigenvalues?
(a) Always
(b) Only when they have the same eigenvectors
(c) Only when they are are both symmetric
Algorithms 283
(d) Never
13. For a Jordan block J4 (7), which of the following statements is true?
(a) The algebraic multiplicity of the eigenvalue 7 is 1
(b) The geometric multiplicity of the eigenvalue 7 is 1
(c) The algebraic multiplicity of the eigenvalue 4 is 1
(d) The geometric multiplicity of the eigenvalue 4 is 1
15. For a matrix A, the eigenvalues are arranged in a matrix Λ, and the eigenvectors
as columns in a matrix S. Go through the statements below, and the mark the
ones that are always true.
(a) A and S are similar matrices
(b) AS = SΛ
(c) A and Λ are similar matrices
(d) A = SΛS −1
(e) The column space of AS − ΛS is the null space of the identity matrix
(f ) If A is singular, Λ cannot be inverted
(g) A is a symmetric matrix with real elements
(h) A, S and Λ are all square matrices of the same size
16. For a real, symmetric, full-rank matrix A, which of the following statements
are true?
(a) It has a full set of orthogonal eigenvectors
(b) It has real eigenvalues
(c) It is positive definite
(d) A = SΛS −1
(e) It has all positive pivots
284 Special Matrices, Similarity and Algorithms
17. Identify the requirements and consequences of a matrix A being positive defi-
nite.
19. Which of the following statements are true of a real, symmetric matrix A ∈
Rn×n ?
20. For a matrix A, the eigenvalues are all real. Furthermore, they are the same as
the diagonal elements of A, namely aii . What can we say about A?
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
• Lecture Videos by the Author: LA4CS, Chapter 12: Special Matrices, Similar-
ity and Algorithms
• MIT: Linear Algebra
– Video 51: 24. Markov Matrices; Fourier Series
– Video 52: Markov Matrices
– Video 55: 25. Symmetric Matrices and Positive Definiteness
286 Special Matrices, Similarity and Algorithms
1.25 "
0
#" =
1
1
0
####
" = 34 0.75
2
!: SHEAR MATRIX
0.5
1 //
!! = § !"""
! =
0
0 21 0.25
0 0 !
22
! = § !""" = 20.75
21.75 # 21.5 1 21.25 # 21 // 20.5 20.25 0 0.25 0.5 0.75 1 2
0 11.25 1.5 1.75
#! =
// 20.25 0
0 0
)=
//
21 0 20.5
20.75
21 34
####
! = 2
21
21.25
Before learning how SVD does its magic, let’s take a look at an
example. For ease of visualization, we will work with a square matrix
so that we can draw the vectors and their transformations in R2 .
" √3 #
2
0 SVD
A= √ −−→ A = U ΣV T
−1 23
" √ # "3 # " √3 #
1 3 1
2 2 2
0 2 2
U= √ Σ= 1
V = √
− 3 1 0 2 − 21 3
2 2 2
1.25 "
1 0
2 #" =
2 1
1
3
2
0.75
3
2
0.5 1
2
0.25 #
1
!= $ #! =
0 !
22 21.75 21.5 21.25 21 20.75 20.5 20.25 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
1 $ 3
#! = § ##! = 20.5
0 % 1
0 $ 21
#" = § ##" = % 20.75
1 3
$ 3 21
6& = %
21
1 3
Anticlockwise rotation of 30° 21.25
Figures 13.2 to 13.5 show the actions of these three matrices, and
a summary. The transformation a vector x by Ax is broken down
into U ΣV T x. The first multiplication by V T is a rotation, shown in
Figure 13.2. It does not change the size of any vector x, nor of the
basis vectors q1 and q2 shown in red and blue. In their new, rotated
positions, q1 and q2 are shown in lighter red and blue. The unit
vectors all have their tips on the unit circle, as shown in Figure 13.2.
Because of the rotation (of 30°) by V T , all the vectors in the first
quadrant are now between the bright red and blue vectors q1′ and q2′ .
We then apply the scaling Σ on the product V T x, which scales
along the original (not the rotated) unit vectors q1 and q2 . In Fig-
ure 13.3, we can see how the rotated unit vectors (now shown in
translucent red and blue) get transformed into their new versions.
What Σ does to the unit vectors qi , it does to all vectors. Therefore,
What SVD Does 291
1.25 "
0
#" =
1
1
0.75
3
2
4
3 0.5
3 3
4
4
0.25
1
4
!
22 21.75 21.5 21.25 21 20.75 20.5 20.25 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
1
#! =
20.25 0
D: SCALING
D>1
1 $ 3 20.5
#! = §
0 % 0
§ ### 3 3
$
! = '
0 0 20.75
1
#" = §
1 0.5 $ 23
§ ###
" ='
$ 3 0 3
7=% 21
0 1 $ 3 3 23
? ³ /d0 ?; C ³ .d0 C 86& = '
21.25 1 3
the unit circle gets elongated along the x direction, and squashed
along the y direction. What we mean by this statement is that all
vectors whose tips are on the unit circle get transformed such that
their tips end up on the said ellipse. As a part of this transformation,
the rotated unit vectors, the translucent red and blue vectors q1′ and
q2′ , get transformed to q1′′ and q2′′ (in brighter colors) on the ellipse. In
other words, the effect of the two transformations, the product ΣV T ,
is to move all the vectors in the first quadrant of the unit circle to the
arc of the ellipse between the bright red and blue vectors q1′′ and q2′′ .
Notice that the transformed ellipse in Figure 13.3 has its axes along
the x and y directions. The last step, shown in Figure 13.4, is the
rotation by U . It is a anti-clockwise rotation of 60°. It rotates the
unit vectors through that angle. Remembering that the axes of the
ellipse after the scaling (in Figure 13.3) were along the directions of
x and y unit vectors, we can see that how the ellipse gets rotated. Of
course, the rotation happens to all the vectors. The ΣV T -transformed
versions of the original unit vectors (from Figure 13.2), now shown
in translucent red and blue in Figure 13.4 as q1′′ and q√2′′ , for instance,
get rotated to the bright red vector,√with its tip at ( 23 , −1) and the
bright blue vector with its tip at (0, 23 ). This indeed is exactly what
the shear matrix A does, as illustrated in Figure 13.1.
292 Singular Value Decomposition
1.25 "
2 (4
#" = § ####
! =
%
1 21
1
0 0
§ #### = (
3 " 4%
0.75 2
(4
% 0
986& = =3
(4
0.5 21 %
0.25
!
22 21.75 21.5 21.25 21 20.75 20.5 20.25 0 0.25 0.5 0.75 1 1
1.25 1.5 1.75 2
#! =
!= 2
E: ROTATION 20.25 2 #
*
1 $ 1
#! = § 20.5
0 % 2 3
0
#" = § 2 3 20.75
1 1
$ 1 3 21 3
9=
% 2 3 1 2
21
Clockwise rotation of 60° 21.25
;$
!
;%
#: ! = #$%! %
<%
$ ;%
;$
<$
We may be tempted to think, from the figures, that one rotation and
one scaling might be enough to do everything that A does. We should,
however, note that the first quadrant of the unit circle in Figure 13.2
is getting mapped to the arc of the ellipse between the light red and
blue vectors in Figure 13.3. One rotation and one scaling could give
is this mapping, but the ellipse would have its axes along the unit
How SVD Works 293
Avi = σi ui =⇒ AV̂ = Û Σ̂
! * =!×#
=>? =
rank % = '
4$
! is rank-de/icient ó ! = %$#K = % 4#4K
Fig. 13.6 The shapes of U , Σ and V in SVD when A is a full-column-rank matrix.
• V is a basis for the domain, which is the union of the bases for
the row and null spaces of A:
4$
! is rank-de/icient ó ! = %$#K = % 4#
4K
! * =!×#
rank 3 = D =
D < min(=, ?)
Fig. 13.7 The shapes of U , Σ and V in SVD when A is a general, rank-deficient matrix.
equations.
| | ··· |
U = u1 u2 · · · um ∈ Rm×m = [ui ]
| | ··· |
| | ··· |
V = v1 v2 · · · vn ∈ Rn×n = [vi ]
| | ··· |
(13.2)
σ1 0 · · · 0 · · · 0
0 σ2 · · · ... · · · 0
. .. ..
· · · ...
..
.. . . .
Σ= ∈ Rm×n = [σi ]
.0 0 · · · σ r · · · 0
. . . .
.
.. .. .. .. .. ..
0 0 0 0 0 0
The Σ matrix is arranged such that σ1 g σ2 g · · · σr g 0 so that the
first singular value is the most important one. The singular vectors
are eigenvectors of AT A and AAT :
1
To remember whether U or V is left or right, note that the left singular matrix U appears
on the left in U ΣV T and V on the right.
How SVD Works 299
σ1 g σ1 g · · · g σr
• For A ∈ R2×2 , σ1 g λ1 g λ2 g σ2
300 Singular Value Decomposition
where we used the fact from Eqn (13.2) that the diagonal matrix Σ
has only the first r = rank(A) non-zero elements, and therefore, the
sum runs from 1 only to r, not to m or n, the matrix dimensions.
It is perhaps important enough to reiterate that Eqn (13.4), by
itself, is not an approximation just because we are summing only up
to r, which is to say, we are using the “economical” (hatted) SVD
matrices. Even if we were to use the full matrices, the multiplication
ΣV T would have resulted in a matrix (∈ Rm×n ) with the last n − r
columns zero because only the first r singular values (σi , 0 < i f r)
are non-zero.
Each term in the summation in Eqn (13.4), Ai = ui viT , is a rank-
one matrix (because it is a linear combination of one row matrix
viT , by the row-picture of matrix multiplication. The first of them,
A1 = u1 v1T , is the most important one because σ1 is the largest
among the singular values. We can therefore see that A1 is the best
rank-one approximation of the original matrix A.
Let’s say A is a megapixel image of size 1000 × 1000. It takes a
million bytes to store it. A1 , on the other hand, takes up only 2001
bytes 1(σ1 ) + 1000(ui ) + 1000(ui ). If A1 is not good enough, we
may include up to k such rank one matrix at a storage cost of 2001k,
which is smaller than a million for k < 499. Typically, the first few
tens of σi would be enough to keep most of the information in A.
In general, in order to store up to k rank-one approximations of
A ∈ Rm×n , we need k(m + n + 1) units of memory, which could be
Why SVD Is Important 301
2
A quick search revealed this article from 2017.
302 Singular Value Decomposition
which is the covariance between the ith and j th variables. For example,
if we set i = j to look at the diagonal elements of C, we get
m
X
(aki − µi )(aki − µi ) = m (aki − µi )2 = m Var(xi )
k=0
A Simulated Example
As a well-established and widely used technique, PCA is available
in all modern statistical applications. In order to make its discussion
clear, we will use R and create a toy example of it using simulation.
We are going to simulate a multivariate normal distribution, centered
at µ with a covariance matrix S, generating 1000 tuples (x, y).
2.5 2 1
µ= S=
2.5 1 2
Since we set the parameters for the simulation, we already know what
to expect: We started with the covariance matrix S, and we expect
the bivariate normal distribution to show up as an ellipse, centered at
(1, 3) and with the major and minor axis along the eigenvectors s1
304 Singular Value Decomposition
$vectors
[,1] [,2]
[1,] 0.7071068 -0.7071068
[2,] 0.7071068 0.7071068
In this output, $values are our eigenvalues λi , which says the vari-
ances along the major and minor axes of the generated data in Fig-
ure 13.8, left panel, are 3 and 1. Therefore the lengths of the axes
(σi ) are the square roots, namely about 1.73 and 1, which is what
we see in Figure 13.8, on the left (except that we scaled the standard
deviations by a factor of two, so that 95% of the points are within the
ellipse).
1.5
440
191
550
3
119
6
325
771
1.0
708
307
686 952 116
914 47 562
622 330
558 312
156 603 421
466 911 399 865 36 947
2
653 829
154
340 969 552
121 381418 602 783 908
627 754
702 459 871 74
285104
0.5
625
424
275 51062 545 41 393
818
965 327 573 948 100
270 772 67 328482 997
839 648 515 521 883 99 216 836527 790 486 987
347
87 886 915252
10 949
300 655 53 876788 483176
481 12717 991 355
738 370
813 152 126 366 150209142 841586709 879 449
631 656422 51
265223 764
1
981
531289 92 798 344
283 645
131 608 480
946
812 534 103 793
177 650 21 941
398 95 957 321128999
804
735
488
25332 155
842620 501
460 549 716
269 85 729 135
869 179 19 411362 866 7956 786408 800 961 903497 609 719 589671989
607 178 641
2
0.0
493
54425 630
936 585
699 913 980 963 61 707
304
465 303 924 917
614 133 93
118
689 341
385
256
72 955891 718394
249
840 4791 805685
91 30
2 942 922 454101858 700 468324 755945
816 525 22183 524 203 20
757168
84
217
940821272601 45 65127 170 415925 432 572 846 594122 431 931
98 744 826517 258 82 430
899 282 933
487 612
337
974
401643
723
471 271710725669
877 161 728 455 458
463107 478
923 384 14
374 714737 137 417 828171 990
375 490
6473295417
746
76334 554 498 835
514875 427
996701984 433 248114239
733 31
724292 811 416 279
884 758284
688
110 756 342 868 429
726619 491 681
351 652 854870 467
782
200
86152 70 109
17 361
472 789
5 276
513 838 683 749 530
985 445
299 386
495 403 65 88852 352
387 713
397485 77 691 298 745 58 731
197251 278616905668932500 874396576 11 763
148
124
277559
661 502 966 820 219 180
388
462297
97
151 63 774
784802
420
916646
13 888 195153678 807 273 73
979211
55 447 320 238 208 134
850 670 250
844 287 512503
863613628310291 960 247520
181
227 336 443 769882
536 196
473 538
48 305 926 365 123
37 770 967
102 206 89765
518 561 139 335
235
257910626 338 697860499
851 437 477 632 207
855 414 382 785
935 470986
302 402 274 423
951721 773 444590461806 604 704676 333 797
21
569 175
210186 246376 543474 198 584 234
895 649 436 438 237
898
405 703 847 654
692 78075
927570 815577 750 937824 323 674
637
353
115
706 64 228 752568
413 822
20.5
r1 777108
938 46 255
71 794682
Va
Var 1 993 684
633 140
599 313734 428
901894
762
22
673
450
778 751
348
22
94
42215
23
22 0 2 2.5 4 6 23 22 21 0 1 2 3 4
Var 1 =(x)
Var 1 x PC1
Fig. 13.8 Left: Example of simulated (x, y) pairs, showing the elliptical shape, the
directions and sizes of the major and minor axes. Right: The “biplot” from the PCA
function in R.
Rotation (n x k) = (2 x 2):
PC1 PC2
[1,] -0.7043863 -0.7098168
[2,] -0.7098168 0.7043863
From the PCA output, we see the standard deviations σ1 and σ2 , close
to what we specified in our covariance matrix, S, as revealed√by the
eigen-analysis on it. Ideally, the values should have been 3 and
√
1, but we got 1.718 and 1.018—close enough. The first principal
component is linear combination of x and y with coefficients −0.704
and −0.710. Note that SVD vectors are unique in values only; their
signs are not fixed.
The right panel in Figure 13.8 is the so-called “biplot” of the
analysis, which shows a scatter plot between the first and second
principal components, as well as the loading of the original variables.
The first thing to note is that PC1 and PC2 are now uncorrelated standard
normal distributions, as indicated by the circle in the biplot rather than
the ellipse in the data.
The directions shown in red and blue are to be understood as
follows: The first variable x “loads” PC1 with a weight of −0.704
and and PC2 −0.710 (which form the first right singular vector, v1 ).
It is shown as the red arrow in the biplot, but with some scaling so
that its length corresponds to the weight. The second variable x loads
PC1 and PC2 at −0710 and 0.704 (the second right singular vector,
v2 ), shown in the blue arrow, again with some scaling. Since both
the principal components load each variable with similar weights, the
lengths of the red and blue arrows are similar.
13.3.3 Pseudo-Inverse
We talked about the left and right inverses in §5.4 (page 120).
Eqn (5.2), for instance, shows how we define the left inverse of a
“tall,” full-rank matrix. The SVD method gives us another way to
define an inverse of any matrix, called the pseudo-inverse, A+ .
Let’s first define what we are looking for.
306 Singular Value Decomposition
Pseudo-Inverse
Definition: A matrix A ∈ Rm×n has an associated pseudo-inverse
A+ if the following four criteria are met:
1. AA+ A = A: Note that AA+ does not have to be I.
2. A+ AA+ = A+ : The product A+ A does not have to be I
either.
T
3. (A+ A) = A+ A: Like the Gram matrix, AA+ needs to be
symmetric.
T
4. (AA+ ) = AA+ : The other product, A+ A should be sym-
metric too.
With the SVD of A, we can come up with A+ that satisfies the
four criteria.
A = U ΣV T =⇒ A+ = V Σ+ U T (13.6)
where Σ+ is a diagonal matrix with the reciprocals of σi when σi ̸=
0 and zero when σi = 0. In practice, since floating point zero
comparison is always troublesome in computing, we will use a lower
bound for σi . Note that Σ+ has the same size as AT , while Σ has the
same size as A. In other words, for A ∈ Rm×n , Σ+ ∈ Rn×m .
Why SVD Is Important 307
! * =T×S rank ! = @
I * =2 = all possible I !: = ; N * =3 = all possible & impossible N
! 6 =S ÿ =T
L 3 : dim = O
L 3) : dim = O Column Space
Row Space 3P ÿ R Pÿ3 +R
1/
1,
//
/,
ï
ÿR
ï
) P ÿ 3 +(R +
3(P + P* R* )
0
1
0
32 ÿ 2
/
+2
2ÿ3
$ %. :
$ % : dim = + 2 -
dim = . 2 - 2 2 ÿ 3 +R Left Null Space
Null Space 3P * ÿ * $
$ . !" . !"
#
, !" , !"
#
!X;
ï
:=
ï
&
.
!X: =T ÿ =S
%
,
!X * =S×T rank !X = @
Fig. 13.9 The elegant symmetry of the four fundamental spaces, completed by the pseudo-
inverse A+ .
What We Learned
– V is an orthonormal basis for Rn , the domain, which is the union of the bases
for C(AT ) and N (A). The column vectors {bmvi } are the right singular
vectors.
– U is an orthonormal basis for Rm , the co-domain, which is the union of the
bases for C(A) and N (AT ). The column vectors {ui } are the left singular
vectors.
– Σ is a diagonal matrix, with positive values, the r singular values, [σi ] along
the main diagonal, sorted descending.
– V̂ has r orthonormal vectors that for a basis for the row space C(AT ).
– Û holds an orthonormal basis for C(A), the column space of A.
– Σ is a square matrix, with the r singular values along its diagonal.
• Like the Spectral Theorem in EVD (only for real, symmetric matrices), SVD for
any matrix leads to rank-one approximations. Used in data compression.
• Principal Component Analysis (PCA):
– PCA is the EVD of the Gram matrix (after taking out the mean in each
column, AKA zero-centering).
– The first step in many machine-learning, data-science projects, leading to
dimensionality reduction, with controllable amount of variance retained.
– The principal components are the linear combinations of the columns of the
data matrix A using the the right singular vectors vi : A.
– The right singular vectors, vi , are therefore called the loading of the variables.
• Pseudo-Inverse:
A+ = V̂ Σ̂−1 Û T
where the hatted matrices retain only the r significant entries of the general,
unhatted matrices in SVD.
– The transformation A+ : Rm 7→ Rn is a one-to-one, onto mapping from the
column space to the row space of A.
– It completes the elegant symmetry of Linear Algebra, as beautifully portrayed
in Figure 13.9, which we should stare at for as long as necessary to fall in
love with mathematics.
Fundamental Spaces and SVD 311
Exercises
A. Review Questions
If unsure of these review questions, go through the chapter again.
(a) U (b) Σ
(c) V (d) The first r columns of V
(a) A basis for its column space (b) A basis for its row space
(c) A basis for its null space (d) A basis for its left null space
6. For a tall, full-column-rank matrix A with the “thin” SVD A = Û Σ̂V̂ T , which
of the following matrices is a square matrix?
(a) SVD of A
(b) SVD of the zero-centered A
(c) SVD of the covariance matrix of A
′
(d) SVD of AT A′ where A′ is the zero-centered version of A
10. The pseudo inverse of a matrix A+ can be defined using its full SVD or its
“thin” SVD. What is the diffenrece?
(a) No difference
(b) Full SVD gives a more accurate A+
(c) “Thin” SVD gives a more accurate A+
(d) Depends on the matrix A
12. Select the properties of AT A for a tall, full-rank matrix A ∈ Rm×n , rank(A) =
n.
Fundamental Spaces and SVD 313
Resources
Review the topic and get alternative viewpoints using these curated video links
and links to problem sets.
• Lecture Videos by the Author: LA4CS, Chapter 13: Singular Value Decompo-
sition
• MIT: Linear Algebra
– Video 63: 29. Singular Value Decomposition
– Video 64: Computing the Singular Value Decomposition
– Video 69: 33. Left and Right Inverses; Pseudoinverse
– Video 70: Pseudoinverses
• Aaron Greiner: SVD How To
• The Complete Guide to Everything: How to calculate the singular values of a
matrix. Contains a fully worked out numeric example.
• Sundog Education with Frank Kane: Using Singular Value Decomposition
(SVD) for Movie Recommendations. Using PCA really, rather than SVD.
• Luis Serrano
– Singular Value Decomposition (SVD) and Image Compression
– Principal Component Analysis (PCA)
Advanced Discussions
The last topic in this book, SVD is also the starting point of countless applications,
not only in computer science, but in other branches engineering and physical
sciences. It is also the opening chapter in some of the advanced books, such as the
one by Steve Brunton’s “Data-Driven Science and Engineering: Machine Learning,
Dynamical Systems, and Control.” For such reasons, we will wrap up our foray
into this field with a list of advanced videos. Although they may use notations
different from ours, these videos should pique our interest and inspire us to explore
further.
• Toby Driscoll
– Eigenvalues and their properties
– SVD compared to EVD
– Singular value decomposition
– QR algorithm for eigenvalues
• Steve Brunton: Singular Value Decomposition
– Watch Videos 1 to 4: From “Singular Value Decomposition (SVD):
Overview” to “Singular Value Decomposition (SVD): Dominant Corre-
lations”
Fundamental Spaces and SVD 315
Resources
Gilbert Strang “Linear Algebra and its Applications”: This is the book version
of the lectures in 18.06SC, and may be useful for sample problems and as
lecture notes. However, lecture notes, problem sets and lecture transcripts
from the book are all available online at MIT Open Courseware.
Philip Klein “Coding the Matrix: Linear Algebra through Computer Science
Applications”: A very comprehensive and well-known work, this book
teaches Linear Algebra from a computer science perspective. Some of the
labs in our course are inspired by or based on the topics in this book, which
has an associated website with a lot of information.
Books
In addition to this textbook, “Linear Algebra for Computer Science”, here are some
other books that we can freely download and learn from:
Jim Hefferson “Linear Algebra”: This book uses SageMath and has tutorials and
labs that can be downloaded for more practice from the author’s website.
Robert Beezer “A First Course in Linear Algebra”: Another downloadable book
with associated web resources on SageMath. In particular, it has an on-line
tutorial that can be used as a reference for SageMath.
Stephen Boyd “Introduction to Applied Linear Algebra”: This well-known book
takes a pragmatic approach to teaching Linear Algebra. Commonly referred
to by the acronym (VMLS) of its subtitle (“Vectors, Matrices, and Least
Squares”), this book is also freely downloadable and recommended for
computer science students.
Copyright
A legal disclaimer: The websites and resources listed above are governed by their
own copyright and other policies. By listing and referring to them, we do not imply
any affiliation with or endorsement from them or their authors.
322 Credits ABOUT LA4CS
ASIAN BOOKS
ISBN 978-981-18-2045-8