6 Gram-Schmidt Procedure, QR-factorization, Orthog-Onal Projections, Least Square
6 Gram-Schmidt Procedure, QR-factorization, Orthog-Onal Projections, Least Square
79
Class notes with a few problems and their solutions are found
For Gram-Schmidt and QR by Eric Carlen
http://www.math.gatech.edu/~carlen/1502/html/pdf/gram.pdf
http://www.math.gatech.edu/~carlen/1502/html/pdf/qr.pdf
6.1
80
This is an algorithm to find an orthonormal basis (ONB, for brevity) {u1 , u2 , . . . u` } in the
span of a given set of vectors {v1 , v2 , . . . vk }. The algorithm is successive; first it finds an
ONB in V1 = Span{v1 }, then in V2 = Span{v1 , v2 }, then in V3 = Span{v1 , v2 , v3 } etc.
It could happen that some vector vj is already in Vj1 , i.e. in the span of {v1 , v2 , . . . vj1 }
for some j. In this case, we do not have to add any new vector to the ONB at that stage. For
this reason, there could be a jump in the indices of the ONB. Suppose that j1 < j2 < j3 <
. . . < j` are those indices, where the sequence of Vj -subspaces truly increases, i.e.
{0} =
V1 = V2 = . . . = Vj1 1
Vj1 = Vj1 +1 = . . . = Vj2 1
Vj2 = Vj2 +1 = . . . = Vj3 1
...
Vj` = Vj` +1 = . . . = Vk
(with the convention that V0 = {0}, which is needed if j1 = 1). Of course in most cases
j1 = 1, j2 = 2, j3 = 3 etc. But if there is a jump, say j1 = 1, j2 = 2 but j3 = 4, then it
means that {v1 , v2 , v3 } span the same space as {v1 , v2 }, i.e. v3 is not linearly independent
from the previous vectors. In this case V3 is also spanned by only two orthonormal vectors,
u1 , u2 .
The algorithm gives orthonormal vectors u1 , u2 , . . . such that
span{u1 } = Vj1
span{u1 , u2 } = Vj2
etc.
span{u1 , u2 , . . . u` } = Vj`
The algorithm will give the j1 , j2 , . . . j` indices as well. Notice that the number of v vectors
may not be the same as the number of u vectors, in fact the latter, `, is the dimension of the
span of {v1 , . . . vk }.
81
In the description below the main text refers to the standard case jm = m (when the
{v1 , . . . vk } set is linearly independent) and in parenthesis we remark the general case.
Step (1.) Normalize the first nonzero vector, i.e. define
u1 =
v1
kv1 k
vj1
,]
kvj1 k
w2
kw2 k
vj1
kvj1 k
82
and continue until you find the first a nonzero vector, say wj1 +m . Then j2 = j1 + m and u2
will be wj1 +m = wj2 normalized.]
Step (i.) Suppose that so far we have found orthonormal vectors u1 , u2 , . . . ui such that
Span{u1 , u2 , . . . ui } = Span{v1 , v2 , . . . vi } = Vi
Consider the vector
t
t
t
wi+1 = vi+1 (vi+1
u1 )u1 (vi+1
u2 )u2 . . . (vi+1
ui )ui
i.e. the projection of vi+1 onto the subspace spanned by {u1 , . . . ui } (at this stage you do not
have to know that it is a projection, since we have not defined it yet, but it is good to have
an idea whats going on). If this vector is nonzero, wi+1 6= 0, then let
ui+1 =
wi+1
kwi+1 k
If it is zero, then we do not create a new u vector and we go on to the next untouched v
vector.
[General case: The algorithm in the general case goes as follows. Suppose so far we have
found orthonormal vectors u1 , u2 , . . . ui such that
Span{u1 , u2 , . . . ui } = Span{v1 , v2 , . . . vji } = Vi
Consider the vectors
t
t
t
wm = vm (vm
u1 )u1 (vm
u2 )u2 . . . (vm
ui )ui
for m = ji + 1, ji + 2, . . .. Let ji+1 be the index of the first nonzero among these vectors. This
means that vji+1 is the first vector not in the span of {u1 , u2 , . . . ui }. Then we define
ui+1 :=
vji+1
kvji+1 k
83
6.2
Orthogonal projections
k
X
i=1
(uti w)ui
and notice that Q Q = Ik , the k k identity matrix. The following is the key theorem
Theorem 6.1 The matrix P = QQt defined above is independent of the orthogonal basis
chosen, it depends only on the subspace V. It has the properties
P 2 = P,
Pt = P
84
(6.1)
with v V, v V .
(iv.) If P is the orthogonal projection onto V, and P is the orthogonal projection onto
V, then P + P = I (identity). The range and the kernel of P are given as
R(P ) = V
and
Ker(P ) = V
or
N(At ) = R(A) .
(6.2)
Finally, recall that any subspace V can be described in two different ways: in parametric
form or with constraints. Either you give a basis {u1 , . . . , uk } in V (and orthonormal basis
85
are usually even better), and use that any element of v can be uniquely written as
v=
k
X
j=1
aj uj
i = j + 1, j + 2, . . . n,
0
1
u=
1
1
spans N(At ) the nullspace of the transpose matrix (CHECK!). In this case N(At ) is one
dimensional, but you could have ended up with more than one fully zero rows after the
86
elimination with a general right hand side b. In this case you have as many constraints as
fully zero rows and this is also the dimension of N(At ). Also these constraints, written as
uti b = 0, give immediately an orthogonal basis in N(At ) (which you can normalize if you
wish).
For more details and examples, see Eric Carlens notes:
http://www.math.gatech.edu/~carlen/1502/html/pdf/proj.pdf
and with Maple solutions to the problems:
http://www.math.gatech.edu/~carlen/1502/html/pdf/prj.pdf
6.3
QR decomposition
Once you understood the Gram-Schmidt procedure, then the QR decomposition is easy. The
key point is that the QR-decomposition runs a Gram-Schmidt algorithm for the column vectors
of a matrix A starting from the leftmost vector. It ignores those columns which are linearly
dependent of the previous ones (hence, in the pivoting language, it picks only the pivot
columns). The columns of the matrix Q is therefore the Gram-Schmidt output of the pivot
column vectors. These vectors can be expressed by the orthogonalized columns in an upper
triangular form:
a1 = r11 q1
a2 = r12 q1 + r22 q2
a3 = r13 q1 + r23 q2 + r33 q3
..
.
where a1 , a2 , . . . are the pivot columns of A. (Note: here ai , the columns of A are the vectors
to be orthonormalized, i.e. these play the role of the vectors vi in Section 6.1. The vectors qi
are the resulting orthonormal vectors, these play the role of the ui vectors in Section 6.1.)
If all columns are pivot columns, then one immediately has A = QR with
h
A = a1 a2 . . . ak
Q = q1 q2 . . . qk
and
r11
0
R= 0
..
.
r12
r22
0
..
.
r13
r23
r33
..
.
87
...
...
...
..
.
If A has nonpivot columns, then R contains columns which express these nonpivot columns
in terms of those columns in Q which were obtained from preceding pivot columns of A. In
general, the decomposition looks like
= QR
A = =
0 0
(where denotes elements not necessarily zero). The first nonzero (pivot) elements are underlined in each row. These determine the location of the linearly independent columns in A.
In this example the column space of A is two dimensional. The two columns of Q is an orthonormal basis in R(A). These two columns are obtained by applying Gram-Schmidt to the
first and third column of A. The coefficients in this Gram-Schmidt procedure are in the first
and third columns of R. Finally, the coefficients in the second, fourth and fifth column of R
express the remaining (not linearly independent) column vectors of A as linear combinations
of the columns of Q. One can easily express these coefficients (i.e. the matrix elements of R)
as
rij = qti aj
These properties are summarized
Theorem 6.3 Let A be an n k matrix and let r = rank(A). Then there exist a matrix Q
of dimensions n r consisting of orthonormal columns and an upper triangular matrix R of
dimension r k such that the following properties hold
88
A = QR
R(A) = R(Q). In particular, the columns of Q form an orthonormal basis in R(A),
hence Qt Q = I. The matrix P = QQt is the orthogonal projection from Rn onto R(A).
Ker A = Ker R, in particular rank(R) = r.
Finally, it is always possible to ensure that the first nonzero elements in each row of R be
nonnegative. With this extra requirement, such a decomposition is unique.
6.4
Least squares
The method of least squares aims at finding a vector x for any given n k matrix A and
n-vector b such that kAx bk2 is the smallest possible. Why is this interesting? Of course
if b R(A), then just choose x to be (one of) the solution to Ax = b and this reaches the
smallest possible value (namely zero). What if b 6 R(A)? This especially often happens if
you have many equations and a few unknowns (n k). For example this is typical problem
with curve fitting.
In this case you cannot solve Ax = b exactly, but you can aim for a solution x such that
Ax be as close as possible to b. This is given by the QR factorization as well:
Theorem 6.4 (Least squares for overdetermined systems) Let A = QR be the QR-factorization
of the matrix A. Then the equation
Rx = Qt b
has a solution for any b Rn . Every such solution minimizes the expression kAx bk2 , i.e.
kAx bk2 kAy bk2
for any vector y Rk .
89
For the proof, in nutshell, recall that QQt is the orthogonal projection onto R(A) in Rn .
Hence the point closest to b in R(A) is QQt b, i.e. we have to solve Ax = QQt b instead of
Ax = b. Since A = QR, we have QRx = QQt b. Multiplying it by Qt from the left and using
Qt Q = I we get exactly Rx = Qt b.
Finally we show that Rx = v has a solution for any v Rr (in particular for v = Qt v).
But this is clear from rank(R) = r hence the column space of R is Rr .
There is a least square method for underdetermined systems as well, but it is less frequently used. It selects the smallest possible solution to Ax = b, assuming that there is a
solution at all.
Theorem 6.5 (Minimal solution for the underdetermined case) For any b Rn , if Ax = b,
then A(Pr x) = b as well, where Pr is the orthogonal projection onto the row space of A. In
other words, x = Pr x is also a solution and in fact
kx k2 < kxk
for any other solution x of Ax = b.
For the proof, just recall that
Ax = APr x + A(I Pr )x
but I Pr is the projection onto the orthogonal complement of the row space of A, hence
A(I Pr ) = 0. This gives Ax = Ax if x = Pr x. The minimality of kx k follows from the
basic property of the orthogonal projections.
Recall that for getting Pr , you have to find the QR-factorization of At : At = QR and
Pr = QQt .
The least square method has numerous applications, some of the are found on the web-page
mentioned above
90
http://www.math.gatech.edu/~carlen/1502/html/pdf/least.pdf
Here is another application which goes back to Gauss, the inventor of the method. (This
part is taken from Applied numerical linear algebra by J. W. Demmel)
6.4.1
How do you measure the distance on a real landscape? How do you measure the height of the
mountains? In other words, how to you figure out the coordinates of a given geographic point
with respect to the standard coordinate system of the Earth (latitude, longitude, height with
respect to the sea level)? The modern General Positioning System (GPS) uses satellites, but
let us go back to times where everything had to be done on the Earth...
The way to do it was to put reference points (so called landmarks) all over on the terrain at
well visible points (when you hike, you can still see them on the top of big mountains). The US
geodetic database consisted of about 700,000 landmarks (1974) more or less uniformly spaced
points at visible distances (a few miles) from each other. The goal is to find the coordinates
of these points very accurately. The number of unknowns is about 2,100,000. In fact Gauss
in nineteenth century has been asked to solve a similar problem in Germany (of course with
much less numbers). And he invented the method for this purpose...
The quantity which can be measured very accurately is the angle. At any landmark P
they measured the angle between the lines P Qi and P Qj for a few nearby (visible) landmarks
Qi , Qj . In this way one obtained a few numbers for each landmark, for definiteness, lets
say, ten angles for each P . Hence altogether they obtained 7, 000, 000 numbers. Of course
these numbers are not independent of each other; from elementary geometry we know lots of
relations between them (most notably, the sum of the angles of a triangle...). In any case,
using the cosine theorem, one has 7, 000, 000 equations between the unknown coordinates and
the measured angles. These equations are actually not exactly linear, but one can linearize
91
them in a consistent way. Hence the problem is to solve a huge linear system Ax = b of
7, 000, 000 equations with 2,100,000 unknowns. If everything were measured perfectly, there
would be an exact solution to this overdetermined system. But the measured data are not
exact, and since we have much more equations and unknowns, the errors most likely drive b
out of the column space of A. But one can search for the least square solution.
In 1978 such a system was solved for updating the US geodetic database with about
2,5 million equations and 400,000 unknowns, which was the biggest Least Square problem
ever solved at the time (some elementary geometric relations allowed to reduce the number
of equations and unknowns a bit). The actual computation heavily used further special
structure of the matrix A, namely that it is a very sparse matrix (most elements are zero).
This is because all relations expressed by the measured angles are actually relations between
neighboring landmarks. Each equation involves only a few out of the 400,000 variables. This
is very typical in many applications, and the numerical algorithms for sparse matrices is a
well developed separate branch of mathematics.