Lin Alg Hand
Lin Alg Hand
Kevin P. Murphy
Last updated January 25, 2008
1 Introduction
Linear algebra is the study of matrices and vectors. Matrices can be used to represent linear functions1 as well as
systems of linear equations. For example,
4x1 − 5x2 = −13
(2)
−2x1 + 3x2 = 9
can be represented more compactly by
Ax = b (3)
where
4 −5 13
A= ,b= (4)
−2 3 −9
Linear algebra also forms the the basis of many machine learning methods, such as linear regression, PCA, Kalman
filtering, etc.
Note: Much of this chapter is based on notes written by Zico Kolter, and is used with his permission.
2 Basic notation
We use the following notation:
• By A ∈ Rm×n we denote a matrix with m rows and n columns, where the entries of A are real numbers.
• By x ∈ Rn , we denote a vector with n entries. Usually a vector x will denote a column vector — i.e., a matrix
with n rows and 1 column. If we want to explicitly represent a row vector — a matrix with 1 row and n columns
— we typically write xT (here xT denotes the transpose of x, which we will define shortly).
• The ith element of a vector x is denoted xi :
x1
x2
x= .. .
.
xn
• We use the notation aij (or Aij , Ai,j , etc) to denote the entry of A in the ith row and jth column:
a11 a12 · · · a1n
a21 a22 · · · a2n
A= . .. .. .. .
.. . . .
am1 am2 ··· amn
1A m n
function f : R →R is called linear if
f (c1 x + c2 y) = c1 f (x) + c2 f (y) (1)
for all scalars c1 ,c2 and vectors x, y. Hence we can predict the output of a linear function in terms of its response to simple inputs.
1
• We denote the jth column of A by aj or A:,j :
| | |
A = a1 a2 ··· an .
| | |
• A diagonal matrix is a matrix where all non-diagonal elements are 0. This is typically denoted D = diag(d1 , d2 , . . . , dn ),
with
d1
d2
D= .. (5)
.
dn
• The identity matrix, denoted I ∈ Rn×n , is a square matrix with ones on the diagonal and zeros everywhere
else, I = diag(1, 1, . . . , 1). It has the property that for all A ∈ Rm×n ,
AI = A = IA (6)
where the size of I is determined by the dimensions of A so that matrix multiplication is possible. (We define
matrix multiplication below.)
• A block diagonal matrix is one which contains matrices on its main diagonal, and is 0 everywhere else, e.g.,
A 0
(7)
0 B
• The unit vector ei is a vector of all 0’s, except entry i, which has value 1:
The vector of all ones is denoted 1. The vector of all zeros is denoted 0.
3 Matrix Multiplication
The product of two matrices A ∈ Rm×n and B ∈ Rn×p is the matrix
C = AB ∈ Rm×p , (9)
where
n
X
Cij = Aik Bkj . (10)
k=1
Note that in order for the matrix product to exist, the number of columns in A must equal the number of rows in B.
There are many ways of looking at matrix multiplication, and we’ll start by examining a few special cases.
2
3.1 Vector-Vector Products
Given two vectors x, y ∈ Rn , the quantity xT y, called the inner product, dot product or scalar product of the
vectors, is a real number given by
Xn
xT y = hx, yi ∈ R = xi yi . (11)
i=1
T T
Note that it is always the case that x y = y x.
Given vectors x ∈ Rm , y ∈ Rn (they no longer have to be the same size), xyT is called the outer product of the
vectors. It is a matrix whose entries are given by (xyT )ij = xi yj , i.e.,
x1 y1 x1 y2 · · · x1 yn
x2 y1 x2 y2 · · · x2 yn
xyT ∈ Rm×n =
.. .. .. .. . (12)
. . . .
xm y1 xm y2 · · · xm yn
In other words, the ith entry of y is equal to the inner product of the ith row of A and x, yi = aTi x. In Matlab notation,
we have
y = [A(1,:)*x1; ...; A(m,:)*xn]
Alternatively, let’s write A in column form. In this case we see that,
x1
| | | x2
y = a1 a2 · · · an . = a1 x1 + a2 x2 + . . . + an xn . (14)
..
| | |
xn
In other words, y is a linear combination of the columns of A, where the coefficients of the linear combination are
given by the entries of x. In Matlab notation, we have
y = A(:,1)*x1 + ...+ A(:,n)*xn
So far we have been multiplying on the right by a column vector, but it is also possible to multiply on the left by a
row vector. This is written, yT = xT A for A ∈ Rm×n , x ∈ Rm , and y ∈ Rn . As before, we can express yT in two
obvious ways, depending on whether we express A in terms on its rows or columns. In the first case we express A in
terms of its columns, which gives
| | |
yT = xT a1 a2 · · · an = xT a1 xT a2 · · · xT an (15)
| | |
which demonstrates that the ith entry of yT is equal to the inner product of x and the ith column of A.
3
Finally, expressing A in terms of rows we get the final representation of the vector-matrix product,
— aT1 —
T
— a2 —
yT = x1 x2 · · · xn .. (16)
.
— aTm —
(17)
= x1 — aT1 — + x2 — aT2 — + ... + xn — aTn — (18)
so we see that yT is a linear combination of the rows of A, where the coefficients for the linear combination are given
by the entries of x.
3.3 Matrix-Matrix Products
Armed with this knowledge, we can now look at four different (but, of course, equivalent) ways of viewing the
matrix-matrix multiplication C = AB as defined at the beginning of this section. First we can view matrix-matrix
multiplication as a set of vector-vector products. The most obvious viewpoint, which follows immediately from the
definition, is that the i, j entry of C is equal to the inner product of the ith row of A and the jth row of B. Symbolically,
this looks like the following,
T
— aT1 — a1 b1 aT1 b2 · · · aT1 bp
— aT2 — | | | aT2 b1 aT2 b2 · · · aT2 bp
C = AB = .. b1 b2 · · · bp = .. .. .. .. . (19)
.
| | |
. . . .
— aTm — aTm b1 aTm b2 · · · aTm bp
Remember that since A ∈ Rm×n and B ∈ Rn×p , ai ∈ Rn and bj ∈ Rn , so these inner products all make sense. This
is the most “natural” representation when we represent A by rows and B by columns.
A special case of this result arises in statistical applications where A = X and B = XT , where X is the n × d
design matrix, whose rows are the data cases. In this case, XXT is an n × n matrix of inner products called the
Gram matrix: T
x1 x1 · · · xT1 xn
XXT =
..
(20)
.
xTn x1 ··· xTn xn
Alternatively, we can represent A by columns, and B by rows, which leads to the interpretation of AB as a sum
of outer products. Symbolically,
— bT1
—
| | | — bT2 — n
X
ai bTi .
C = AB = a1 a2 ··· an .. = (21)
| | | . i=1
— bTn —
Put another way, AB is equal to the sum, over all i, of the outer product of the ith column of A and the ith row of B.
Since, in this case, ai ∈ Rm and bi ∈ Rp , the dimension of the outer product ai bTi is m × p, which coincides with
the dimension of C.
If A = XT and B = X, where X is the n × d design matrix, we get a d × d matrix which is proportional to the
empirical covariance matrix of the data (assuming it has been centered):
2
x xi,1 xi,2 · · · xi,1 xi,d
Xn X i,1
XT X = xi xTi = ..
(22)
.
i=1 i xi,d xi,1 xi,d xi,2 ··· x2i,d
4
Second, we can also view matrix-matrix multiplication as a set of matrix-vector products. Specifically, if we
represent B by columns, we can view the columns of C as matrix-vector products between A and the columns of B.
Symbolically,
| | | | | |
C = AB = A b1 b2 · · · bp = Ab1 Ab2 · · · Abp . (23)
| | | | | |
Here the ith column of C is given by the matrix-vector product with the vector on the right, ci = Abi . These matrix-
vector products can in turn be interpreted using both viewpoints given in the previous subsection. Finally, we have the
analogous viewpoint, where we represent A by rows, and view the rows of C as the matrix-vector product between
the rows of A and C. Symbolically,
— aT1 — — aT1 B —
— aT2 — — aT2 B —
C = AB = .. B = .. . (24)
. .
— aTm — — aTm B —
Here the ith row of C is given by the matrix-vector product with the vector on the left, cTi = aTi B.
It may seem like overkill to dissect matrix multiplication to such a large degree, especially when all these view-
points follow immediately from the initial definition we gave (in about a line of math) at the beginning of this section.
However, virtually all of linear algebra deals with matrix multiplications of some kind, and it is worthwhile to spend
some time trying to develop an intuitive understanding of the viewpoints presented here.
In addition to this, it is useful to know a few basic properties of matrix multiplication at a higher level:
• Matrix multiplication is associative: (AB)C = A(BC).
• Matrix multiplication is distributive: A(B + C) = AB + AC.
• Matrix multiplication is, in general, not commutative; that is, it can be the case that AB 6= BA.
5
4 Operations on matrices
4.1 The transpose
The transpose of a matrix results from “flipping” the rows and columns. Given a matrix A ∈ Rm×n , is transpose,
written AT , is defined as
AT ∈ Rn×m , (AT )ij = Aji . (25)
We have in fact already been using the transpose when describing row vectors, since the transpose of a column vector
is naturally a row vector.
The following properties of transposes are easily verified:
• (AT )T = A
• (AB)T = BT AT
• (A + B)T = AT + BT
4.2 The trace of a square matrix
The trace of a square matrix A ∈ Rn×n , denoted tr(A) (or just trA if the parentheses are obviously implied), is the
sum of diagonal elements in the matrix:
Xn
trA = Aii . (26)
i=1
and so on for the product of more matrices. This is called the cyclic permutation property of the trace operator.
from the cyclic permutation property, we can derive the trace trick, which rewrites the scalar inner product xT Ax
as follows
xT Ax = tr(xT Ax) = tr(xxT A) (28)
We shall use this result later in the book.
The trace of two matrices, tr(AB), is the sum of a vector which contains the inner product of the rows of A and
the columns of B. In Matlab notation we have
6
4.3 Adjugate (classical adjoint) of a square matrix
Below we introduce some rather abstract material that will be necessary for defining matrix determinants and inverse.
However, it can be skipped without loss of continuity. This subsection is based on the wikipedia entry.2
Let A be an n × n matrix. Let M̃ij be the matrix obtained by deletring row i and column j from A. Define
Mij = det M̃ij to be the (i, j)’th minor of A. Define Cij = −1i+j Mij to be the (i, j)’th cofactor of A. Finally,
define the adjugate or classical adjoint of A to be adj(A) = CT . Thus the adjoint stores the cofactors, but in
transposed order.
Let us consider a 2 × 2 example.
a b d −b C11 C21
adj = = (29)
c d −c a C12 C22
We can define the determinant in terms of an expansion of its cofactors (see Section 4.3) along any row i or column
j as follows:
Xn Xn
det A = aij Cij = aij Cij (37)
i=1 j=1
This is called Laplace’s expansion of the determinant. We see that the determinant is given by the inner product of
any row i of A with the corresponding row i of C. Hence
7
We will return to this important formula below.
We now give some examples. Consider a 2 × 2 matrix. From Equation 29, and summing along the rows for column
1, we have,
a b
det = aC11 + cC21 = ad − cb (39)
c d
Alternatively, summing along the columns for row 2 we have
a b
det = cC21 + dC22 = −cb + da (40)
c d
Now consider a 3 × 3 matrix. From Equation 31, and summing along the rows for column 1, we have
A22 A23 A21 A23 A21 A22
det(A) = a11 − a12
+ a13
(41)
A32 A33 A31 A33 A31 A32
a11 a22 a33 + a12 a23 a31 + a13 a21 a32
= (42)
−a11 a23 a32 − a12 a21 a33 − a13 a22 a31
Note that the definition of the cofactors involves a determinant. We can make this explicit by giving the general
(recursive) formula:
n
X
|A| = (−1)i+j aij |M̃i,j | (for any j ∈ 1, . . . , n)
i=1
Xn
= (−1)i+j aij |M̃i,j | (for any i ∈ 1, . . . , n)
j=1
where M̃i,j is the minor matrix obtained by deleting row i and column j from A. The base case is that |A| = a11
for A ∈ R1×1 . If we were to expand this formula completely for A ∈ Rn×n , there would be a total of n! (n factorial)
different terms. For this reason, we hardly even explicitly write the complete equation of the determinant for matrices
bigger than 3 × 3. Instead, we will see below that there are various numerical methods for computing |A|.
4.5 The inverse of a square matrix
The inverse of a square matrix A ∈ Rn×n is denoted A−1 , and is the unique matrix such that
Note that A−1 only exists if det(A) > 0. If det(A) = 0, it is called a singular matrix.
The following are properties of the inverse; all assume that A, B ∈ Rn×n are non-singular:
• (A−1 )−1 = A
• If Ax = b, we can multiply by A−1 on both sides to obtain x = A−1 b. This demonstrates the inverse with
respect to the original system of linear equalities we began this review with.
• (AB)−1 = B−1 A−1
• (A−1 )T = (AT )−1 . For this reason this matrix is often denoted A−T .
From Equation 38, we have that if A is non-singular, then
1
A−1 = adj(A) (44)
|A|
While this is a nice “explicit” formula for the inverse of matrix, we should note that, numerically, there are in fact
much more efficient ways of computing the inverse. In Matlab, we just type inv(A).
8
For the case of a 2 × 2 matrix, the expression for A−1 is simple enough to give explicitly. We have
a b −1 1 d −b
A= , A = (45)
c d |A| −c a
For a block diagonal matrix, the inverse is obtained by simply inverting each block separately, e.g.,
−1 −1
A 0 A 0
= (46)
0 B 0 B−1
4.6 Schur complements and the matrix inversion lemma
(This section is based on [Jor07, ch13]. We will use these results in Section ??, when deriving equations for inferring
marginals and conditionals of multivariate Gaussians. We will also use special cases of these results (such as the
matrix inversion lemma) in many other places throughout the book.
Consider a general partitioned matrix
E F
M= (47)
G H
where we assume E and H are invertible. The goal is to derive an expression for M−1 . If we could block diagonalize
M, it would be easier to invert. To zero out the top right block of M we can pre-multiply as follows
I −FH−1 E F E − FH−1 G 0
= (48)
0 I G H G H
Similarly, to zero out the bottom right we can post-multiply as follows
E − FH−1 G 0 I 0 E − FH−1 G 0
= (49)
G H −H−1 G I 0 H
Putting it all together we get
I −FH−1 E F I 0 E − FH−1 G 0
= (50)
0 I G H −H−1 G I 0 H
| {z } | {z } | {z } | {z }
X M Z W
Taking determinants we get
|X||M||Z| = |M| = |W| = |E − FH−1 G||H| (51)
Let us define Schur complement of M wrt H as
M/H = E − FH−1 G (52)
Then Equation 51 becomes
|M| = |M/H||H| (53)
So we can see that M/H acts somewhat like a division operator. We can derive the inverse of M as follows
Z−1 M−1 X−1 = W−1 (54)
−1
M = ZW−1 X (55)
hence
−1
E F I 0 (M/H)−1 0 I −FH−1
= −1 −1 (56)
G H −H G I 0 H 0 I
−1
−1
(M/H) 0 I −FH
= (57)
−H−1 G(M/H)−1 H−1 0 I
(M/H) −1
−(M/H) FH−1
−1
= (58)
−H−1 G(M/H)−1 H−1 + H−1 G(M/H)−1 FH−1
9
Alternatively, we could have decomposed the matrix M in terms of E and M/E = (H − GE−1 F), yielding
−1 −1
E F E + E−1 F(M/E)−1 GE−1 E−1 F(M/E)−1
= (59)
G H −(M/E)−1 GE−1 (M/E)−1
Equating the top left block of these two expression yields the following formula:
Simplifiying this leads to the widely used matrix inversion lemma or the Sherman-Morrison-Woodbury formula:
In the special case that H = −1 a scalar, F = u a column vector, G = vT a row vector, we get the following formula
for the rank one update of an inverse matrix
UT U = I = UUT . (68)
In other words, the inverse of an orthogonal matrix is its transpose. Note that if U is not square — i.e., U ∈
Rm×n , n < m — but its columns are still orthonormal, then UT U = I, but UUT 6= I. We generally only
use the term orthogonal to describe the previous case, where U is square.
10
Another nice property of orthogonal matrices is that operating on a vector with an orthogonal matrix will not
change its Euclidean norm, i.e.,
kUxk2 = kxk2 (69)
for any x ∈ Rn , U ∈ Rn×n orthogonal. Similarly, one can show that the angle between two vectors is preserved after
they are transformed by an orthogonal matrix. In summary, transformations by orthogonal matrices are generalizations
of rotations and reflections, since they preserve lengths and angles. Computationally, orthogonal matrices are very
desirable because they do not magnify roundoff or other kinds of errors.
An example of an orthogonal matrix is a rotation matrix. A rotation in 3d by angle α about the z axis is given by
cos(α) − sin(α) 0
R(α) = sin(α) cos(α) 0 (70)
0 0 1
Note that,
1 1
xT Ax = (xT Ax)T = xT AT x = xT ( A + AT )x (73)
2 2
i.e., only the symmetric part of A contributes to the quadratic form. For this reason, we often implicitly assume that
the matrices appearing in a quadratic form are symmetric.
We give the following definitions:
• A symmetric matrix A ∈ Sn is positive definite (pd) iff for all non-zero vectors x ∈ Rn , xT Ax > 0. This is
usually denoted A 0 (or just A > 0), and often times the set of all positive definite matrices is denoted Sn++ .
• A symmetric matrix A ∈ Sn is positive semidefinite (PSD) iff for all vectors xT Ax ≥ 0. This is written A 0
(or just A ≥ 0), and the set of all positive semidefinite matrices is often denoted Sn+ .
• Likewise, a symmetric matrix A ∈ Sn is negative definite (ND), denoted A ≺ 0 (or just A < 0) iff for all
non-zero x ∈ Rn , xT Ax < 0.
• Similarly, a symmetric matrix A ∈ Sn is negative semidefinite (NSD), denoted A 0 (or just A ≤ 0) iff for
all x ∈ Rn , xT Ax ≤ 0.
• Finally, a symmetric matrix A ∈ Sn is indefinite, if it is neither positive semidefinite nor negative semidefinite
— i.e., if there exists x1 , x2 ∈ Rn such that xT1 Ax1 > 0 and xT2 Ax2 < 0.
It should be obvious that if A is positive definite, then −A is negative definite and vice versa. Likewise, if A is
positive semidefinite then −A is negative semidefinite and vice versa. If A is indefinite, then so is −A. It can also be
shown that positive definite and negative definite matrices are always invertible.
Finally, there is one type of positive definite matrix that comes up frequently, and so deserves some special mention.
Given any matrix A ∈ Rm×n (not necessarily symmetric or even square), the Gram matrix G = AT A is always
11
positive semidefinite. Further, if m ≥ n (and we assume for convenience that A is full rank), then G = AT A is
positive definite.
Note that if all elements of A are positive, it does not mean A is necessarily pd. For example,
4 3
A= (74)
3 2
A sufficient condition for a (real, symmetric) matrix to be pd is that it is diagonally dominant, i.e., if in every row
of the matrix, the magnitude of the diagonal entry in that row is larger than the sum of the magnitudes of all the other
(non-diagonal) entries in that row. More precisely,
X
|aii | > |aij | for all i (76)
j6=i
6 Properties of matrices
6.1 Linear independence and rank
A set of vectors {x1 , x2 , . . . xn } is said to be (linearly) independent if no vector can be represented as a linear
combination of the remaining vectors. Conversely, a vector which can be represented as a linear combination of the
remaining vectors is said to be (linearly) dependent. For example, if
n−1
X
xn = αi xi (77)
i=1
for some {α1 , . . . , αn−1 } then xn is dependent on {x1 , . . . , xn−1 }; otherwise, it is independent of {x1 , . . . , xn−1 }.
The column rank of a matrix A is the largest number of columns of A that constitute linearly independent set. In
the same way, the row rank is the largest number of rows of A that constitute a linearly independent set.
It is a basic fact of linear algebra that for any matrix A, columnrank(A) = rowrank(A), and so this quantity is
simply refereed to as the rank of A, denoted as rank(A). The following are some basic properties of the rank:
• For A ∈ Rm×n , rank(A) ≤ min(m, n). If rank(A) = min(m, n), then A is said to be full rank, otherwise it
is called rank deficient.
• For A ∈ Rm×n , rank(A) = rank(AT ).
• For A ∈ Rm×n , B ∈ Rn×p , rank(AB) ≤ min(rank(A), rank(B)).
• For A, B ∈ Rm×n , rank(A + B) ≤ rank(A) + rank(B).
6.2 Range and nullspace of a matrix
The span of a set of vectors {x1 , x2 , . . . xn } is the set of all vectors that can be expressed as a linear combination of
{x1 , . . . , xn }. That is, ( )
Xn
span({x1 , . . . xn }) = v : v = αi xi , αi ∈ R . (78)
i=1
It can be shown that if {x1 , . . . , xn } is a set of n linearly independent vectors, where each xi ∈ Rn , then span({x1 , . . . xn }) =
Rn . In other words, any vector v ∈ Rn can be written as a linear combination of x1 through xn .
The dimension of some subspace S, dim(S), is defined to be d if there exists some spanning set {b1 , . . . , bd } that
is linearly independent.
12
Figure 1: Visualization of the nullspace and range of an m × n matrix A. Here y1 = Ax1 , so y1 is in the range (is reachable);
similarly for y2 . Also Ax3 = 0, so x3 is in the nullspace (gets mapped to 0); similarly for x4 .
The range (sometimes also called the column space) of a matrix A ∈ Rm×n is the the span of the columns of A.
In other words,
range(A) = {v ∈ Rm : v = Ax, x ∈ Rn }. (79)
This can be thought of as the set of vectors that can be “reached” or “generated” by A. The nullspace of a matrix
A ∈ Rm×n is the set of all vectors that get mapped to the null vector when multiplied by A, i.e.,
nullspace(A) = {x ∈ Rn : Ax = 0}. (80)
See Figure 1 for an illustration of the range and nullspace of a matrix. We shall discuss how to compute the range and
nullspace of a matrix numerically in Section 12.1 below.
6.3 Norms of vectors
A norm of a vector kxk is informally measure of the “length” of the vector. For example, we have the commonly-used
Euclidean or `2 norm, v
u n
uX
kxk2 = t x2i . (81)
i=1
13
6.4 Matrix norms and condition numbers
(The following section is based on [Van06, p93].) Assume A is a non-singular matrix, and Ax = b, so x = A−1 b is
a unique solution. If we change b to b + ∆b, then
A(x + ∆x) = b + ∆b (85)
so the new solution is x + ∆x where
∆x = A−1 ∆b (86)
We say that A is well-conditioned if a small ∆b results in a small ∆x; otherwise we say that A is ill-conditioned.
For example, suppose
1 1 1 −1 1 − 1010 1010
A= 2 , A = (87)
1 + 10−10 1 − 10−10 1 + 1010 −1010
The solution for b = (1, 1) is x = (1, 1). If we change b by ∆b, the solution changes to
−1 ∆b1 − 1010 (∆b1 − ∆b2 )
∆x = A ∆b = (88)
∆b1 + 1010 (∆b1 − ∆b2 )
So a small change in b can lead to an extremely large change in x.
In general, we can think of a matrix A as a defining a linear function f (x) = Ax. The expression ||Ax||/||x||
gives the amplication factor or gain of the system in the direction x. We define the norm of A as the maximum gain
(over all directions):
||Ax||
||A|| = max = max ||Ax|| (89)
x6=0 ||x|| ||x||=1
This is sometimes called the 2-norm of a matrix. For example, if A = diag(a1 , . . . , an ), then ||A|| = maxni=1 |ai |. In
general, we have to use numerical methods to compute ||A||; in Matlab, we can just type norm(A).
Other matrix norms exist. For example, the Frobenius norm of a matrix is defined as
v
um X n
uX q
||A||F = t a2ij = tr(AT A) (90)
i=1 j=1
Note that this is the same as the 2-norm of A(:), the vector gotten by stacking all the columns of A together. In Matlab,
just type norm(A,’fro’).
We now show how the norm of a matrix can be used to quantify how well-conditioned a linear system is. Since
∆x = A−1 ∆b, one can show the following upper bound on the absolute error
||∆x|| ≤ ||A−1 || ||∆b|| (91)
−1
So large ||A || means ||∆x|| can be large even when ||∆b|| is small. Also, since ||b|| ≤ ||A|| ||x||, and hence
b|| , we have
x ≥ ||||A ||
14
7 Eigenvalues and eigenvectors
7.1 Square matrices
Given a square matrix A ∈ Rn×n , we say that λ ∈ C is an eigenvalue of A and x ∈ Cn is the corresponding
eigenvector3 if
Ax = λx, x 6= 0 . (94)
Intuitively, this definition means that multiplying A by the vector x results in a new vector that points in the same
direction as x, but scaled by a factor λ. Also note that for any eigenvector x ∈ Cn , and scalar t ∈ C, A(cx) = cAx =
cλx = λ(cx), so cx is also an eigenvector. For this reason when we talk about “the” eigenvector associated with λ,
we usually assume that the eigenvector is normalized to have length 1 (this still creates some ambiguity, since x and
−x will both be eigenvectors, but we will have to live with this).
We can rewrite the equation above to state that (λ, x) is an eigenvalue-eigenvector pair of A if,
(λI − A)x = 0, x 6= 0 . (95)
But (λI − A)x = 0 has a non-zero solution to x if and only if (λI − A) has a non-empty nullspace, which is only the
case if (λI − A) is singular, i.e.,
|(λI − A)| = 0 . (96)
We can now use the previous definition of the determinant to expand this expression into a (very large) polynomial
in λ, where λ will have maximum degree n; this is called the characteristic polynomial. We then find the n (possibly
complex) roots of this polynomial to find the n eigenvalues λ1 , . . . , λn . To find the eigenvector corresponding to
the eigenvalue λi , we simply solve the linear equation (λi I − A)x = 0. It should be noted that this is not the
method which is actually used in practice to numerically compute the eigenvalues and eigenvectors (remember that
the complete expansion of the determinant has n! terms); it is rather a mathematical argument.
The following are properties of eigenvalues and eigenvectors (in all cases assume A ∈ Rn×n has eigenvalues
λi , . . . , λn and associated eigenvectors x1 , . . . xn ):
• The trace of a A is equal to the sum of its eigenvalues,
n
X
trA = λi . (97)
i=1
15
If the eigenvectors of A are linearly independent, then the matrix X will be invertible, so
A = XΛX−1 . (101)
A matrix that can be written in this form is called diagonalizable.
7.2 Symmetric matrices
Two remarkable properties come about when we look at the eigenvalues and eigenvectors of a symmetric matrix A ∈
Sn . First, it can be shown that all the eigenvalues of A are real. Secondly, the eigenvectors of A are orthonormal, i.e.,
uTi uj = 0 if i 6= j, and uTi ui = 1, where ui are the eigenvectors. In matrix form, this becomes UT U = UUT = I
and |U| = 1. We can therefore represent A as
n
X
A = UΛUT = λi ui uTi (102)
i=1
remembering from above that the inverse of an orthogonal matrix is just its transpose. We can write this symbolically
as
λ1 − uT1 −
| | | λ2 − uT2 −
A = = u1 u2 · · · un .. .. (103)
| | |
. .
λn − uTn −
| |
= λ1 u1 − uT1 − + · · · + λn un − uTn − (104)
| |
As an example of diagonalization, suppose we are given the following real, symmetric matrix
1.5 −0.5 0
A = −0.5 1.5 0 (105)
0 0 3
We can diagonalize this in Matlab as follows.
Listing 1: Part of rotationDemo
[U,D]=eig(A)
U =
-0.7071 -0.7071 0
-0.7071 0.7071 0
0 0 1.0000
D =
1 0 0
0 2 0
0 0 3
We recognize from Equation 71 that U corresponds to rotation by 45 degrees. and D corresponds to scaling by
diag(1, 2, 3). So the whole matrix is equivalent to a rotation by 45◦ , followed by scaling, followed by rotation by
−45◦ . Note that there is a sign ambiguity for each column of U, but the resulting answer is always correct:
Listing 2: Output of rotationDemo
>> U*D*U’
ans =
1.5000 -0.5000 0
-0.5000 1.5000 0
0 0 3.0000
We can also use the diagonalization property to show that the definiteness of a matrix depends entirely on the sign
of its eigenvalues. Suppose A ∈ Sn = UΛUT . Then
n
X
T T T T
x Ax = x UΛU x = y Λy = λi yi2 (106)
i=1
16
where y = UT x (and since U is full rank, any vector y ∈ Rn can be represented in this form). Because yi2 is always
positive, the sign of this expression depends entirely on the λi ’s. If all λi > 0, then the matrix is positive definite;
if all λi ≥ 0, it is positive semidefinite. Likewise, if all λi < 0 or λi ≤ 0, then A is negative definite or negative
semidefinite respectively. Finally, if A has both positive and negative eigenvalues, it is indefinite.
7.3 Covariance matrices
An important application of eigenvectors is to visualizing the probability density function of a multivariate Gaus-
sian, also called a multivariate Normal (MVN). We study this in more detail in Section ??, but we introduce it here,
since it serves as a good illustration of many ideas in linear algebra. The density is defined as follows:
def 1
N (x|µ, Σ) = exp[− 21 (x − µ)T Σ−1 (x − µ)] (107)
(2π)d/2 |Σ|1/2
where d is the dimensionality of x, µ = E[X] is the d × 1 mean vector, and Σ is a d × d symmetric and positive
definite covariance matrix, defined by
def
Cov(X) = Σ = E [(X − E X)(X − E X)T ] (108)
Var (X1 ) Cov(X1 , X2 ) · · · Cov(X1 , Xd )
=
.. .. .. ..
(109)
. . . .
Cov(Xd , X1 ) Cov(Xd , X2 ) · · · Var (Xd )
def 1 1
N (x|µ, σ 2 ) = exp[− 2 (x − µ)2 ] (110)
(2π)1/2 σ 2σ
The expression inside the exponent of the MVN is a quadratic form called the Mahalanobis distance, and is a scalar
value equal to
(x − µ)T Σ−1 (x − µ) = xT Σ−1 x + µT Σ−1 µ − 2µT Σ−1 x (111)
Intuitively, this just measures the distance of point x from µ, where each dimension gets weighted differently. In the
case of a diagonal covariance, it becomes a weighted Euclidean distance:
d
X
∆ = (x − µ)T Σ−1 (x − µ) = (xi − µi )2 Σ−1
ii (112)
i=1
So if Σii is very large, meaning dimension i has high variance, then it will be downweighted (since we use 1/Σii )
when comparing x to µ. This is a form of feature selection.
Figure 2 plots some MVN densities in 2d for three different kinds of covariance matrices.4 A full covariance matrix
has d(d + 1)/2 parameters (we divide by 2 since it is symmetric). A diagonal covariance matrix has d parameters. A
spherical or isotropic covariance, Σ = σ 2 Id , has one free parameter.
The equation ∆ = const defines an ellipsoid, which are the level sets of constant probability density. We see that
a general covariance matrix has elliptical contours; for a diagonal covariance, the ellipse is axis aligned; and for a
spherical covariance, the ellipse is a circle (same “spread” in all directions). We now explain why the contours have
this form, using eigenanalysis.
Letting Σ = UΛUT we find
p
X 1
Σ−1 = U−T Λ−1 U−1 = UΛ−1 UT = ui uTi (113)
i=1
λi
4 These plots were made by creating a dense grid of points using meshgrid, evaluating the pdf on each grid cell using mvnpdf, and then using
17
Full, S=[2.0 1.5 ; 1.5 4.0], ρ=0.53 Diagonal, S=[1.2 −0.0 ; −0.0 4.8], ρ=−0.00 Spherical, S=[1.0 0.0 ; 0.0 1.0], ρ=0.00
p(x,y)
p(x,y)
p(x,y)
0.02
0.008 0.008
0.015
0.006 0.006
0.01
0.004 0.004
0 0 0
5 5 5
5 5 5
0 0 0
0 0 0
−5 −5 −5 −5 −5 −5
y y y
x x x
Full, S=[2.0 1.5 ; 1.5 4.0] Diagonal, S=[1.2 −0.0 ; −0.0 4.8] Spherical, S=[1.0 0.0 ; 0.0 1.0]
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
y
y
−1 −1 −1
−2 −2 −2
−3 −3 −3
−4 −4 −4
−5 −5 −5
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
x x x
Figure 2: Visualization of 2 dimensional Gaussian densities. Left: a full covariance matrix Σ. Middle: Σ has been decorrelated
(see Section ??). Right: Σ has been whitened (see Section 7.3.1). This figure was produced by gaussPlot2dDemo.
Hence
p
!
T −1 T
X 1
(x − µ) Σ (x − µ) = (x − µ) ui uTi (x − µ) (114)
i=1
λi
p p
X 1 X yi2
= (x − µ)T ui uTi (x − µ) = (115)
λ
i=1 i
λ
i=1 i
def
where yi = uTi (x − µ). The y variables define a new coordinate system that is shifted (by µ) and rotated (by U)
with respect to the original x coordinates: y = U(x − µ).
Recall that the equation for an ellipse in 2D is
y12 y2
+ 2 =1 (116)
λ1 λ2
Hence we see that the contours of equal probability density of a Gaussian lie along ellipses: see Figure 3.
This gives us a fast way to plot a 2d Gaussian, without having to evaluate the density on a grid of points and use
1
the contour command. If X is a matrix of points on the circle, then Y = UΛ 2 X is a matrix of points on the ellipse
represented by Σ = UΛUT . To see this, we use the result
to conclude
1 1 1 1
Cov[y] = UΛ 2 Cov[x]Λ 2 UT = UΛ 2 Λ 2 UT = Σ (118)
1 √
where Λ2 = diag( Λii ). We can implement this in Matlab as follows.5
5 We can
√
explain the k = 6 term as follows. We use the fact that the Mahalanobis distance is a sum of squares of d Gaussian random variables,
to conclude that ∆ ∼ χ2d , which is the χ-squared distribution with d degrees of freedom (see Section ??). Hence we can find the value of ∆
√
that corresponds to a 95% confidence interval by using chi2inv(0.95, 2), which is approximately 6. So by setting the radius to 6, we will
enclose 95% of the probability mass.
18
x2
u2
u1
y2
y1
µ
1/2
λ2
1/2
λ1
x1
Figure 3: Visualization of a 2 dimensional Gaussian density. The major and minor axes of the ellipse are defined by the first two
eigenvectors of the covariance matrix, namely u1 and u2 . Source: [Bis06] Figure 2.7.
Listing 3: gaussPlot2d
function h=gaussPlot2d(mu, Sigma, color)
[U, D] = eig(Sigma);
n = 100;
t = linspace(0, 2*pi, n);
xy = [cos(t); sin(t)];
k = sqrt(6); %k = sqrt(chi2inv(0.95, 2));
w = (k * U * sqrt(D)) * xy;
z = repmat(mu, [1 n]) + w;
h = plot(z(1, :), z(2, :), color);
axis(’equal’);
Then
1 1 1 1 1 1
Cov[y] = Λ− 2 U T ΣU Λ− 2 = Λ− 2 U T (U ΛU T U )Λ− 2 = Λ− 2 ΛΛ− 2 = I (121)
19
Finally, we can whiten the data and center it, by subtracting off the mean.
1
y = Λ− 2 UT (x − µ) (122)
1
E[y] = Λ− 2 UT (µ − µ) = 0 (123)
We can do this in MATLAB using the MLAPA function standardize, or the Statistics function zscore.
Listing 5: standardize
function [Z, mu, sigma] = standardize(X, mu, sigma)
X = double(X);
if nargin < 2
mu = mean(X);
sigma = std(X);
ndx = find(sigma < eps);
sigma(ndx) = 1;
end
[n d] = size(X);
Z = X - repmat(mu, [n 1]);
Z = Z ./ repmat(sigma, [n 1]);
Of course, in a supervised learning setting, we must apply the same transformation to the test data. Thus it is very
common to see the following idiom (we will see examples in Chapter ??):
Listing 6: :
[Xtrain, mu, sigma] = [ones(size(Xtrain,1), 1) standardize(Xtrain)];
w = fitModel(Xtrain); % plug in appropriate learning procedure
Xtest = [ones(size(Xtest,1),1) standardize(Xtest, mu, sigma)];
ypred = testModel(Xtest, w); % plug in appropriate test procedure
Alternatively, we can rescale the data, e.g., to 0:1 or -1:1, using the rescaleData function. We must rescale the
test data in the same way, as in the example below.
Listing 7: :
xTrain = -10:1:10;
[zTrain, minx, rangex] = rescaleData(xTrain, -1, 1);% zTrain= -1:1
xTest = [-11, 8];
[zTest] = rescaleData(xTest, -1, 1, minx, rangex) % [-1.1, 0.8]
8 Matrix calculus
While the topics in the previous sections are typically covered in a standard course on linear algebra, one topic that
does not seem to be covered very often (and which we will use extensively) is the extension of calculus to the vector
setting. Despite the fact that all the actual calculus we use is relatively trivial, the notation can often make things look
much more difficult than they are. In this section we present some basic definitions of matrix calculus and provide a
few examples. We concentrate on functions that maps vectors / matrices to scalars, i.e., their output is scalar valued.
8.1 The gradient of a function wrt a matrix
Suppose that f : Rm×n → R is a function that takes as input a matrix A of size m × n and returns a real value. Then
the gradient of f (with respect to A ∈ Rm×n ) is the matrix of partial derivatives, defined as:
20
∂f (A) ∂f (A) ∂f (A)
∂A11 ∂A12 ··· ∂A1n
∂f (A) ∂f (A) ∂f (A)
∂A21 ∂A22 ··· ∂A2n
∇A f (A) ∈ Rm×n
= .. .. .. .. (125)
. . . .
∂f (A) ∂f (A) ∂f (A)
∂Am1 ∂Am2 ··· ∂Amn
Hence
∂ ∂ XX
tr(AB) = = akl blm = bji (129)
∂aij ∂aij
k `
so
∇A tr(AB) = BT (130)
We can use this, plus the trace trick, to compute the derivative of a quadratic form wrt a matrix:
∂
|A| = Cij (132)
∂aij
since Cij does not contain any terms involving aij (since we deleted row i and column j. Hence
Hence
1
∇A log |A| = ∇ |A| = A−T (135)
|A| A
Note the similarity to the scalar case, where ∂/(∂x) log x = 1/x.
21
8.2 The gradient of a function wrt a vector
Note that the size of ∇A f (A) is always the same as the size of A. So if, in particular, A is just a vector x ∈ Rn , we
have ∂f (x)
∂x1
∂f (x )
∂x2
∇x f (x) = .. .
(136)
.
∂f (x)
∂xn
It is very important to remember that the gradient of a function is only defined if the function is real-valued, that is, if
it returns a scalar value. We can not, for example, take the gradient of Ax, A ∈ Rn×n with respect to x, since this
quantity is vector-valued.
It follows directly from the equivalent properties of partial derivatives that:
• ∇x (f (x) + g(x)) = ∇x f (x) + ∇x g(x).
• For t ∈ R, ∇x (t f (x)) = t∇x f (x).
We give some examples below, which we will use when computing the least squares estimate (see Section ??) as
well as when computing the maximum likelihood estimate of the mean of a multivariate Gaussian in Section ??.
8.2.1 Derivative of a scalar product
For x ∈ Rn , let f (x) = aT x for some known vector a ∈ Rn . Then
n
∂f (x) ∂ X
= ai xi = ak . (137)
∂xk ∂xk i=1
so
n n n n
∂f (x) ∂ XX X X
= Aij xi xj = Aik xi + Akj xj (140)
∂xk ∂xk i=1 j=1 i=1 j=1
Hence
∇x xT Ax = (A + AT )x (141)
If A is symmetric, this becomes ∇x xT Ax = 2Ax. Again, this should remind you of the analogous fact in single-
variable calculus, that ∂/(∂x) ax2 = 2ax.
8.3 The Jacobian
Let f be a function that maps Rn to Rm , and let y = f (x). Define the Jacobian matrix as
∂y1 ∂y1
∂x1 ··· ∂xn
def ∂(y1 , . . . , ym ) def . .. ..
Jx→y = = .. . . (142)
∂(x1 , . . . , xn ) ∂y ∂ym
m
∂x1 ··· ∂xn
22
If we denote the i’th component of the output, then the i’th row of J is (∇x fi (x))T .
If f is a linear function, so y = Ax, where A is m × n, we can compute the gradient of each component of
the output separately. We have yi = ATi,: x, so from Equation ??, we have that the the i’th row of J is ATi,: , so
Jacobian(Ax) = A.
8.4 The Hessian of a function wrt a vector
Suppose that f : Rn → R is a function that takes a vector in Rn and returns a real number. Then the Hessian matrix
with respect to x, written ∇2x f (x) or simply as H is the n × n matrix of partial derivatives,
∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)
∂x21 ∂x1 ∂x2 · · · ∂x 1 ∂x n
∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)
2 n×n
∂x2 ∂x1 ∂x2 2 · · · ∂x2 ∂xn
∇x f (x) ∈ R = .. .. .. .. . (143)
.
. . .
∂ 2 f (x) ∂ 2 f (x) 2
f (x)
∂xn ∂x1 ∂xn ∂x2 · · · ∂ ∂x 2
n
23
d d n-d d
R
X Q d
0
n =
n-d
Figure 4: QR decomposition of a non-square matrix A = QR, where QT Q = I and R is upper triangular. The shaded parts of
R, and all the below-diagonal terms, are zero. The shaded entries in Q are not computed in the economy-sized version, since they
are not needed.
9 Matrix factorizations
There are various ways to decompose a matrix into a product of other matrices. In Section 7, we discussed diago-
nalizing a square matrix A = XΛX−1 . In Section ?? below, we will discuss a way to generalize this to non-square
matrices using a technique called SVD. In this section, we present some other useful matrix factorizations.
9.1 Cholesky factorization of pd matrices
Any symmetric positive definite matrix can be factorized as A = RT R, where R is upper triangular with real, positive
diagonal elements. This is called a Cholesky factorization: In Matlab, type R = chol(A). In fact, we can use this
function to determine if a matrix is positive definite, as follows:
Listing 8: isposdef
function b = isposdef(a)
% Written by Tom Minka
[R,p] = chol(a);
b = (p == 0);
(This method is faster than checking that all the eigenvalues are positive.) We can create a random pd matrix using
the following function.
Listing 9: randpd
function M = randpd(d)
A = rand(d);
M = A*A’;
The Cholesky decomposition of a covariance matrix can be used to sample from a multivariate Gaussian. Let
y ∼ N (µ, Σ) and Σ = LLT , where L = RT is lower triangular. We first sample x ∼ N (0, I), which is easy
because it just requires sampling from d separate 1d Gaussians. We then set y = LT x + µ. This is valid since
Listing 10: :
X = chol(Sigma)’*randn(d,n) + repmat(mu,n,1);
Of course, if you have the Matlab statistics toolbox, you can just call mvnrnd, but it is useful to know this other
method, too.
9.2 QR decomposition
The QR decomposition of an n × d matrix X is defined by
X = QR (153)
24
n n m-n n n
n
A U Σ VT
m =
Figure 5: SVD decomposition of a non-square matrix A = UΣVT . The shaded parts of Σ, and all the off-diagonal terms, are
zero. The shaded entries in U and Σ are not computed in the economy-sized version, since they are not needed.
where Q is an n × n orthogonal matrix (hence QT Q = I and QQT = I), and R is an n × d matrix, which is only
non-zero in its upper-right triangle (see Figure 4). In Matlab, type [Q,R] = qr(X). It is also possible to compute
an economy-sized QR decomposition, using [Q,R] = qr(X,0). In this case, Q is just n × d and R is d × d, where
the lower triangle is zero (see Figure 4). We will use this when we study least squares in Section ??.
10 Singular value decomposition (SVD)
Any (real) m × n matrix A can be decomposed as
| |
A = UΣVT = λ1 u1 − vT1 − + · · · + λr ur − vTr − (154)
| |
where U is an m × m whose columns are orthornormal (so UT U = I), V is n × n matrix whose rows and columns
are orthonormal (so VT V = VVT = I), and Σ is a m × n matrix containing the r = min(m, n) singular values
σi ≥ 0 on the main diagonal, with 0s filling the rest of the matrix. The columns of U are the left singular vectors, and
the columns of V are the right singular vectors. See Figure 5 for an example. In Matlab, you can compute the SVD
using [U,Sigma,V] = svd(A).
As is apparent from Figure 5, if m < n, there are some zero columns in Σ, and hence the last rows of VT are
not used. The economy sized SVD (in matlab, svd(A,’econ’)) in this case leaves U as m × m, but makes Σ be
m × m and only computes the first m columns of V . If m > n, we take the first n columns of U, make Σ be n × n,
and leave V as is (this is also called a thin SVD).
A square diagonal matrix is nonsingular iff its diagonal elements are nonzero. The SVD implies that any square
matrix is nonsingular if, and only if, its singular values are nonzero. The most numerically reliable way to determine
whether matrices are singular is to test their singular values. This is far better than trying to compute determinants,
which are numerically unstable.
10.1 Connection between SVD and eigenvalue decomposition
If A is real, symmetric and positive definite, then the singular values are equal to the eigenvalues, and the left and
right singular vectors are equal to the eigenvectors (up to a sign change). However, Matlab always returns the singular
values in decreasing order, whereas the eigenvalues need not necessarily be sorted. This is illustrated below.
Listing 11: :
>> A=randpd(3)
A =
0.9302 0.4036 0.7065
0.4036 0.8049 0.4521
0.7065 0.4521 0.5941
>> [U,S,V]=svd(A)
U =
-0.6597 0.5148 -0.5476
-0.5030 -0.8437 -0.1872
-0.5584 0.1520 0.8155
S =
1.8361 0 0
25
0 0.4772 0
0 0 0.0159
V =
-0.6597 0.5148 -0.5476
-0.5030 -0.8437 -0.1872
-0.5584 0.1520 0.8155
>> [U,Lam]=eig(A)
U =
0.5476 0.5148 0.6597
0.1872 -0.8437 0.5030
-0.8155 0.1520 0.5584
Lam =
0.0159 0 0
0 0.4772 0
0 0 1.8361
so the eigenvectors of AT A are equal to V, the right singular vectors of A. Also, the eigenvalues of AT A are equal
to the elements of Σ2 , the squared singular values. Similarly
so the eigenvectors of AAT are equal to U, the left singular vectors of A. Also, the eigenvalues of AAT are equal to
the squared singular values.
We will use these results when we study PCA (see Section ??), which is very closely related to SVD.
10.2 Pseudo inverse
The pseudo-inverse of an m × n matrix A, denoted A† , is defined as the unique matrix that satisfies the following 4
properties:
AA† A = A (159)
A† AA† = A† (160)
(AA† )T = AA† (161)
(A† A)T = A† A (162)
If A is square and non-singular, then A† = A−1 . If m > n and the columns of A are linearly independent, then
which is the same expression as arises in the normal equations (see Section ??). In this case, A† is a left inverse
because
A† A = (AT A)−1 AT A = I (164)
but is not a right inverse because
AA† = A(AT A)−1 AT (165)
only has rank n, and so cannot be the m × m identity matrix.
In general, we can compute the pseudo inverse using the SVD decomposition A = UΣVT . If A is square and
non-singular, we can compute A† easily:
26
n k k n
Σ VT k
A U
m ¼
Figure 7: Low rank approximations to a 200 × 320 image. Ranks 1, 2, 5, 10, 20 and 200 (original is bottom right). Produced by
svdImageDemo.
since VT V = I and UT U = I. When computing the pseudo inverse, if any σj = 0, we replace 1/σj with 0. Suppose
the first r = rank(A) singular values are nonzero, and the rest are zero. Then
Hence
A† = VΣ† UT (168)
amounts to only using the first r rows of V, U and Σ−1 , since the remaining elements of Σ† will be 0. In mat-
lab, we can just type pinv(A). The (abbreviated) source code for this built-in function is shown below. (Use
type(which(’pinv’)) to see the full source.)
Listing 12: pinv
function B = pinv(A)
[U,S,V] = svd(A,0); % if m>n, only compute first n cols of U
s = diag(S);
r = sum(s > tol); % rank
w = diag(ones(r,1) ./ s(1:r));
B = V(:,1:r) * w * U(:,1:r)’;
We will see how to use pseudo inverse for solving under-determined linear systems in Section 11.3.
10.3 Low rank approximation of matrices
(The following section is based on [Mol04, sec10.11].) The economy-sized SVD
A = UΣVT (169)
27
9000 4
10
8000
7000
3
10
6000
5000
log(σi)
2
σi
10
4000
3000
1
10
2000
1000
0
0 10
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
i i
Figure 8: Singular values (left) and log singular values (right) for the clown image.
can be rewritten as
A = E1 + E2 + · · · + Er (170)
where r = min(m, n). The component matrices Ei are rank one outer products:
Ei = σi ui vTi (171)
The norm of each matrix is the corresponding singular value, ||Ei || = σi . If we truncate this sum to the first k terms,
we get a rank r approximation to the matrix: see Figure 6. This is called a truncated SVD. The total number of
parameters needed to represent an m × n matrix using a rank k approximation is
mk + k + kn = k(m + n + 1) (173)
As an example, consider the 200 × 320 pixel image in Figure 7(top left). This has 64,000 numbers in it. We see that a
rank 20 approximation, with only (200 + 320 + 1) × 20 = 10, 420 numbers is a very good approximation.
The error in this approximation is given by
Since the singular values often die off very quickly (see Figure 8), a low-rank approximation is often quite accurate.
In Section ??, we will see that the truncated SVD is the same thing as principal components analysis (PCA).
We can construct a low rank approximation to a matrix (here, an image) as follows.
28
11.2 m > n: Over-determined case
If m > n, the system is over-determined: there are more equations than unknowns. In general, there is no exact
solution. Typically one seeks the least squares solution:
In matlab, the backslash operator x=A\b, gives the least squares solution; internally Matlab uses QR decomposition:
see Section 9.2.
11.3 m < n: Under-determined case
If m < n, the system is under-determined: there is either no solution or there are infinitely many. In either case, the
“optimal” x is ambiguous. In Matlab, there are two main approaches one can use: the backslash operator, x=A\b,
which gives a least squares solution, or pseudoinverse, x = pinv(A)*b, which gives the minimum norm solution.
In the full rank case, these two solutions are the same, although pinv does more work than necessary to obtain the
answer. But in degenerate situations, these two methods can give different results: see Section ?? for details.
12 Existence/ uniqueness of solutions to linear systems: SVD analysis
Consider the problem of solving Ax = b for an n × n matrix A. Two natural questions are: when does some solution
x ∈ Rn exist? and if it does exist, when is it unique? The SVD can be used to answer these questions, as we now
describe. More interestingly, when the solution is not unique, one can use the singular vectors to determine what extra
data would need to be obtained in order to make the solution unique, as we show in Section ??. (The following section
in based on [Bee07, p29,p143].)
12.1 Left and right singular vectors form an orthonormal basis for the range and null space
Let us start by writing the SVD symbolically as
σ1 − (v1 )T −
σ2 − (v2 )T −
= UΣVT = U
A .. .. (178)
. .
σn − (vn )T −
Hence
− σ1 (v1 )T − x1 σ1 < v1 , x >
− σ2 (v2 )T −
x2
σ2 < v2 , x >
Ax = U .. .. = U .. (179)
. . .
− σn (vn )T − xn σn < vn , x >
σ1 < v1 , x >
| | | σ2 < v2 , x >
= u1 u2 · · · un ..
(180)
| | |
.
σn < vn , x >
u11 σ1 < v1 , x > + · · · + u1n σ1 < vn , x >
u21 σ1 < v1 , x > + · · · + u2n σ1 < vn , x > X n
= .. = σj < vj , x > uj (181)
. j=1
un1 σ1 < v1 , x > + · · · + unn σ1 < vn , x >
29
Suppose that the first r singular vectors of A are nonzero and that the rest are zero.6 Hence
X r
X
Ax = σj < vj , x > uj = σj < vj , x > uj (182)
j:σj >0 j=1
Thus any Ax can be written as a linear combination of the left singular vectors u1 , . . . , ur , so the range of A is given
by
range(A) = span{uj : σj > 0} (183)
with dimension r.
To find a basis for the null space, let us now define a second vector y ∈ Rn that is a linear combination solely of
the right singular vectors for the zero singular values,
X n
X
y= cj v j = cj v j (184)
j:σj =0 j=r+1
Hence the right singular vectors form an orthonormal basis for the null space:
Furthemore, this solution must be unique. To see why, suppose y is some other solution, so Ay = b. Define
v = y − x. Then
Ay = A(x + v) = Ax + Av = b + Av = b (190)
Hence v ∈ nullspace(A). So if nullspace(A) = {0} (since nullity=0), then we must have v = 0, so the solution x is
unique.
6 This is the opposite convention of [Bee07].
30
Now suppose A is not full rank (r < n), so A is singular and A−1 does not exist. If b 6∈ range(A), then there
is no solution, since A can only generate values in range(A). If b ∈ range(A), then there are an infinite number of
solutions of the form X
x=z+ cj v j (191)
j:σj =0
P
for some constants cj ∈ R, where z is a particular solution satisfying Az = b. To see this, let y = j:σj =0 cj vj be
a vector in the null space. Then from Equation 185,
Ax = Az + Ay = b + 0 = b (192)
The last step follows since b ∈ range(A) (by assumption), and hence b must lie in span{uj : σj > 0}.
It is natural to ask: what happens if we apply the pseudo inverse to a b that is not in the range of A? This happens
when A is over-determined (more rows than columns in A). In this case, z = A† b is not a solution, Az 6= b.
However, it is the “closest thing to a solution”, in this sense that it minimizes the residual norm:
b1
2x1 = b2 ⇒ 2 = b2 ⇒ b1 = 3b2 (198)
6
So the system is only solvable if b1 = 3b2 , and in that case, we can choose x2 arbitrarily. Finally, the third equation
can be satisfied by taking
1 1 1
x3 = − 21 (b3 + 4x1 − x2 ) = − b1 − b3 + x2 (199)
3 2 2
31
So if b1 6= 3b2 , the equation has no solution. If b1 = 3b2 , there are infinitely many solutions: x = (b1 /6, α, −b1/3 −
b3 /2 + α/2) is a solution for all α.
Let us now see how we can derive these results using SVD. The following code fragment verifies that b = (3, 1, 10)
(which satisfies b1 = 3b2 ) is in the range of A, by showing that there are coefficients c that can multiply the left singular
vectors to generate b.
We can also create other solutions by adding random multiples of vectors in the null space.
Similarly, it it easy to verify that b = (1, 1, 10) is not a solution, and is not in the range of A (cannot be perfectly
generated by any Az for any z):
References
[Bee07] K. Beers. Numerical Methods for Chemical Engineering: Applications in MATLAB. Cambridge, 2007.
32