Geological Data Analysis_Matrix Algebra_Paul Wessel
Geological Data Analysis_Matrix Algebra_Paul Wessel
SECTION 3
LINEAR (MATRIX) ALGEBRA
OVERVIEW OF MATRIX ALGEBRA
(or All you ever wanted to know about Linear Algebra but were afraid to ask...)
This matrix, A, has order 4 x 3, the element a23 = 11, a13 = 10, etc. The notation for matrices is
not always consistent but is usually one of the following schemes:
matrix - designated by a bold letter (most common); capital letter; or letter with under-score,
brackets or hat (^). Order is also sometimes given: A(4,3) = A is 4 x 3.
order - always given as row x column but uses letters n,m,p differently: n(row) xm(column) or
m(row) x n(column)
element - most commonly aij with i = row; j = column; (sometimes k,l,p)
Advantages of matrix algebra mainly lies in the fact that it provides a concise and simple
method for manipulating large sets of numbers or computations, making it ideal for computers.
Also (l) Compact form of matrices allows convenient notation for describing large tables of data;
(2) Operations allow complex relationships to be seen which would otherwise be obscured by the
shear size of the data (i.e., it aids clarification); and (3) Most matrix manipulation involves just a
few standard operations for which standard subroutines are readily available.
As a convention with data matrices (i.e., the elements represent data values), the columns
usually represent the dilferent variables (e.g., one column contains temperatures, another salinity,
etc.) while rows contain the samples (e.g., the values of the variable at each depth or time). Since
there are usually more samples than variables, such data matrices are usually rectangular having
more rows (m) than columns (n) - order m x n where m > n.
Column vector is a matrix containing only a single column of elements:
GG313 GEOLOGICAL D ATA ANALYSIS 3–2
a1
a2
a=
an
0 for i ≠ j
dij =
non –zero for i = j
This type of matrix is important for scaling rows or columns of other matrices. The Identity
matrix (I) is a diagonal matrix with all of the nonzero elements equal to one. Written as I or In;
it plays the role of 1 in matrix algebra (A·I= I·A = A). A Lower triangular matrix (L) is a square
matrix with all elements equal to zero above the principal diagonal:
100 1
L = 370 = 37
826 82 6
or
0 for i < j
=
ij
non –zero for i ≥ j
An Upper triangular matrix is a square matrix with all elements equal to zero below the principal
diagonal
0 for i < j
u ij =
non–zero for i ≤ j
(If one multiplies 2 triangular matrices of the same form, the result is a third matrix of the same
form).
We also have Fullv populated matrix which is a matrix with all of its elements nonzero,
Sparse matrix which is a matrix with only a small proportion of its elements nonzero, and Scalar
which simply is a number (i.e., matrix with a single element).
GG313 GEOLOGICAL D ATA ANALYSIS 3–3
Matrix transpose (or transpose of a matrix) is obtained by interchanging the rows and columns
of a matrix. So row i becomes column i and column j becomes row j (also the order of the matrix
is reversed).
1 14
A = 6 7 ; AT = 1 6 8
8 2 14 7 2
Any square matrix can be split into the sum of a symmetric and skew symmetric matrix:
A = 1 A + AT + 1 A – AT
2 2
Basic Matrix Operations:
Matrix addition and subtraction requires matrices of the same order since this operation
simply involves addition or subtraction of corresponding elements. So, if A + B = C,
a 11 a 12 b 11 b 12 a 11 + b 11 a 12 + b 12
A = a 21 a 22 ; B = b 21 b 22 ; C = a 21 + b 21 a 22 + b 22
a 31 a 32 b 31 b 32 a 31 + b 31 a 32 + b 32
and
(1) A + B = B + A
a 11 a 12 a 11 a 12
A = a 21 a 22 = a 21 a 22
a 31 a 32 a 31 a 32
where is a scalar. Every element is multiplied by the scalar. Scalar product (or dot product or
inner product) is the product of 2 vectors of the same size.
a⋅b =
where a is a row vector (or the transpose of a column vector) of length n, b is a column vector (or
the transpose of a row vector), also of length n, and is the scalar product of a·b. Then:
b1
a = a1 a2 a3 ; b = b2
b3
and
= a 1b 1 + a 2b 2 + a 3 b 3
Some people like to visualize this multiplication as:
1
3
b
x 4
x
2
x
x
a [ 2 1 4 5 ] [ 31 ] =
Fig. 3-1. Dot product of two vectors.
Conceptually, this product can be thought of as multiplying the length of one vector by the
component of the other vector which is parallel to the first:
|a |·|b |·cos
Fig. 3-2. Graphical meaning of the dot product of two vectors.
Think of b as a force and |a| as the magnitude of displacement which is equal to work in the
direction of a. Thus:
a ⋅ b =|a||b|cos( )
where
GG313 GEOLOGICAL D ATA ANALYSIS 3–5
x = x 21 + x 22 + .. . + x 2
The maximum principle says that the unit vector (n) making a ·n a maximum is that unit vector
pointing in the same direction as a: If n || a then cos( ) = cos(0°) = 1 and a·n = |a| |n| cos ( ) =
|a||n| = |a|. This is equally true where d is any vector of a given magnitude - that vector n which
parallels d will give the largest scalar product.
Parallel vectors thus have cos( ) = 1, then a·b = |a| |b| and a = b (i.e., 2 vectors are parallel if
one is simply a scalar multiple of the other - this property comes from equating direction cosines)
where:
= |a|
|a | |b|
Perpendicular vectors have cos( ) = cos 90° = 0, so a·b = 0, where a ⊥ b. Squaring vectors is
simply:
a 2 = a ⋅ a T forrowvectors
a 2 = a T ⋅ a forcolumnvectors
Matrix multiplication requires "conformable" matrices. Conformable matrices are ones in which
there are as many columns in the first as there are rows in the second:
C (m ,n) = A (m,p ) ⋅ B (p, n)
So, the product matrix C is of order m x n and has elements cij:
p
c ij = Σ a ikb kj
k =1
This is extension of scalar product - in this case, each element of C is the scalar product of a row
vector in A and column vector in B.
b 11 b 12
c 11 c12 a 11 a 12 a 13
c 21 c 22 = a 21 a 22 a 23 b 21 b 22
b 31 b 32
m A m C
p n
Fig. 3-3. The matrix product of two matrices.
A is pre-multiplied by B (for B · A)
A is post-multiplied by B (for A · B)
Multiple products:
D = A ⋅B ⋅ C =A ⋅ B ⋅ C
(The order in which the pairs are multiplied is not important mathematically.)
Computational considerations:
C (m n ) = A (m,p) ⋅ B (p,n )
involves m × n × p multiplications and m ×n × (p -1) additions, so:
E (m, n) =[A (m, p ) ⋅ B (p, q)] ⋅ C (q, n )
gives m × p × q multiplications = [D (m, q)] ⋅ C (q,n)
gives m × q × n multiplications
and
E (m, n) = A (m,p ) ⋅ [B (p,q) ⋅ C (q,n )]
gives p × q × n multiplications = A (m ,p ) ⋅ [D (p ,n)]
gives m × p × n multiplications
GG313 GEOLOGICAL D ATA ANALYSIS 3–7
Therefore:
1 )(A ⋅ B) ⋅ C ⇒ mq(p + n)totalmultiplications
2) A ⋅ (B ⋅ C) ⇒ pn(m + q)total multiplications
If both A and B are 100 x 100 matrices and C is 100 x 1, then m = 100, p = 100, q = 100 , and n
= 1. Multiplying using form l) involves ~l x 106 multiplications, whereas form 2) involves 2 x
104 ; so computing B · C first, then pre-multiplying by A saves almost a million multiplications
and almost an equal number of additions in this example. Therefore order is extremely important
computationally for both speed and accuracy (more operations lead to a greater accumulation of
round-off errors).
A: 3 6 9 3 6 9 :AI
287 28 7
C = D · A, where D is a diagonal matrix. C is the A matrix with each row scaled by a diagonal
element of D:
d 11 a11 a 12 a 13
D= d 22 ; A = a21 a 22 a 23
d 33 a31 a 32 a 33
Post-multiplication bv a diagonal matrix produces a matrix in which each column has been
scaled by a diagonal element.
C =A ⋅ D
where each column in A has been scaled by the corresponding diagonal matrix elements, dii.
(dependent upon what the matrix represents). The main use here is for finding the inverse of a
matrix or solving simultaneous equations. Symbolically, the determinant is usually given as det
A, |A| or ||A|| (to differentiate from magnitude). Calculation of a 2 x 2 determinant is given by:
a a
|A|= a 11 a 12 = a 11a 22 – a 12a 21
21 22
This is the difference of the cross products. The calculation of an n x n determinant is given by
A = a 11m 11 – a 12m12 + a 13m 13 – ⋅ ⋅ ⋅ –( –1) n a 1n m 1n
where m11 is the determinant with the first row and column missing; m12 is the determinant with
the first row and second column missing; etc. (The determinant of a 1 × 1 matrix is just the
particular element.) An example of a 3 x 3 determinant follows.
a 11 a12 a 13
A = a 21 a22 a 23
a 31 a32 a 33
a 11 a 12 a 13
m11 = a 21 a 22 a 23 = a 22a 33 – a 23a 32
a 31 a 32 a 33
a 11 a 12 a 13
m12 = a 21 a 22 a 23 = a 21a 33 – a 23a 31
a 31 a 32 a 33
a 11 a 12 a 13
m13 = a 21 a 22 a 23 = a 21a 32 – a 22a 31
a 31 a 32 a 33
So
A = a 11m 11 – a 12m 12 + a 13m13
= a 11(a 22a 33 – a 23a 32)– a 12(a 21a 33 – a 23a 31)+ a 13(a 21a 32 – a 22a 31)
For a 4 x 4 determinant, each m 1i would be an entire expansion given above for the 3 x 3
determinant—one quickly needs a computer.
A = a 11(a 22a 33 – a 23a 32)– a 12(a 21a 33 – a 23a 31)+ a 13(a 21a 32 – a 22a 31)
= 1[1(–4)– 0(–3)]– 6[2(–4)– 0(5)]+4[2(–3)– 1(5]=– 4 +48 – 44= 0
The degree of clustering symmetrically about the principal diagonal is another (of many)
properties of a determinant. The more the clustering, the higher the value of the determinant.
Matrix division can be thought of as multiplying by the inverse. Consider scalar division:
x = x 1 = xb–1
b b
which we can write
bb– 1 = 1
Matrices can be effectively divided by multiplying by the inverse matrix. Nonsingular square
matrices may have an inverse symbolized as A-l and AA-l = 1. The calculation of matrix inverse
is usually done using elimination methods on the computer. For a simple 2 x 2 matrix, its inverse
is given by:
a 22 –a 12
A–1 = 1
A –a 21 a 11
An example follows.
A= 7 2 ;
10 3
1 3 –2 = 3 –2
A–1 =
21 – 20 –10 7 –10 7
3 –2 = 1 0 = I
AA –1 = 7 2
10 3 –10 7 01
x1 b1
x2 b
x= x ; b= 2
3 b3
x4 b4
Then
A – 1 Ax = A – 1b (pre–multiplyingboth sidesby A –1 )
so
Ix = x = A –1 b
gives the solution for values of xl , x2 , x3 , x4 which solve the system. The following example
solves for 2 simultaneous equations. Consider 2 equations in 2 unknowns (e.g., equations of lines
in the x-y plane):
5x1 + 7x2 = 19
3x1 – 2x2 =– 1
In matrix form this translates to:
5 7 x 1 = 19
3 –2 x 2 –1
A ⋅ x = b
To solve this matrix, we need the inverse of A:
2 7
–1 1 –2 –7 31 31
A = =
–10 –21 –3 5 3 –5
31 31
Then x = A-1·b where
2 7 38 7
–1 31 31 19 = 31 – 31 = 1
A b=
3 –5 –1 57 + 5 2
31 31 31 31
or, in box form:
19 :b
–1
2 7
–1 31 31 1 :x
A =
3 –5 2
31 31
x1 1
x2 = 2
Computational considerations
While this approach may seem burdensome, it is good because it is extremely general and
allows a simple handling and a straight forward solution to very large systems. However, it is
true that direct (elimination) methods to the solution are in fact quicker for fully populated
matrices:
l) Solution to inverse matrix approach involves n3 multiplications for the inversion and n 2 m
more multiplications to finish the solution, where n is the number of equations per set, and m
is the number of sets of equations (each of the same form but different b matrix). The total
number of multiplications is n 3 + n2 m.
So, while the matrix form is easy to handle, one should not necessarily always use it blindly. We
will consider many situations for which matrix solutions are ideal. For sparse or symmetrical
matrices, the above relationships may not hold. The rank of a matrix is the number of linearly
independent vectors it contains (either row or column vectors):
1 4 0 2
A = 1 0 1 –1
–3 –4 –2 0
Since row 3 = -(row 1) - 2(row 2) or col 3 = col 1 - 1/4(col 2) and col 4 = -(col 1) + 3/4(col 2),
the matrix A has rank 2 (i.e., it has only 2 linearly independent vectors, independent of whether
viewed by rows or columns). The rank of a matrix product must be less than or equal to the
smallest rank of the matrices being multiplied
A (rank2) · B (rank1) = C (rank1)
Therefore, (from another angle), if a matrix has rank r, than any matrix factor of it must have
rank of at least r. Since the rank cannot be greater than the smallest of m or n, in a mxn matrix,
this definition also limits the size (order) of factor matrices. (That is, one cannot factor a matrix
of rank 2, into 2 matrices of which either is of less than rank 2, so m and n of each factor must
also be ≥ 2 ).
The trace of a square matrix is simply the sum of the elements along the principal diagonal. It
is symbolized as tr A. This property is useful in calculating various quantities from matrices.
Submatrices are smaller matrix partitions of the larger supermatrix:
Supermatrix
= A B
F CD
T
1. A T = A
–1
2. A – 1 =A
T –1
3. A –1 = A T = A– T
4. D = ABC,then D – 1 = C – 1 B – 1 A – 1 ;recall that D = ABC; D T = C TB T A T
This "reversal rule" for inverse products may be useful for eliminating or minimizing the number
of matrix inverses requiring calculation.
We will look at a few examples of matrix manipulations. For data matrix A:
123
A = 456
789
and unit row vector j:
j = 111
(l) Compute the mean of each column vector in A (each column has length n = 3):
x c = 1 jA = jn A jn= 1 1 1
3 3 3 3
Then
123
4 5 6 :A
789
jn: 1 1 1 4 5 6 :x c
3 3 3
13
1 3 : jTn
13
123 2
A : 456 = 5 : xr
789 8
x = A–1 b
For a moment, let us just consider the left hand side A .x. For any x, this product gives a new
vector y. We can say that x is transformed to give y. This is a linear transformation since there
are only linear terms in the matrix multiplication, i.e., the vector y is
Thus we call the operation T(x) = A.x a linear transformation. If we stick to three or less
dimensions, it is possible to graphically visualize vectors and operations on them. Figure 3-4
shows an arbitrary vector x and the result y of the linear transformation y = A.x.
Fig. 3-4. The vector x is transformed into another vector y using a linear transformation.
Obviously, as we pick another x, we get another y. We might want to know if there are certain
vectors that, when being operated on, returns a vector in the same direction, possibly longer or
shorter than the original x. In other words, are there an x that satisfies
Ax = x (3.1)
We call the eigenvalue and x the eigenvector. We can rewrite this as
Ax – x =0
Ax – Ix =0
A– I x =0
or
B x = 0= n
In general, the solution to this equation can be written x = B-l.n:
x= 1 n = B –1 n
B
or
GG313 GEOLOGICAL D ATA ANALYSIS 3–14
B x= n=0
So apart from the trivial solution x = [0 0 0], the answer is given when
B =0
We know the determinant of B is
a 11 – a 12 a 13
B = a 21 a 22 – a 23 =0
a 31 a 32 a 33 –
Writing out what the determinant is and setting it to zero gives a polynomial in of order n. For
n = 3 this will in general give a cubic equation; for n = 2 a quadratic equation must be solved.
The solutions 1 , 2 .... etc. are called the eigenvalues of A, and the equation |B | = 0 is called
the characteristic equation. For example, given
A = 17 –6
45 –16
17 – –6
A– I = =0
45 – 16 –
or
2 2
– 272– 17 + 16 + +270 = – –2 =0
We easily solve for :
1± 12 –4(–2)
= = 2,– 1
2
So the eigenvalues are 1 = 2, 2 = -1. We now know what must be for (3.1) to be satisfied, but
what about the vectors x ? We still haven't found what they must be, but we will substitute in the
value for in (3.1). Using = 2 first, we find
A x =2 x
(A –2 I ) x = 0
Find
15 –6 x 1 = 0
45 –18 x 2 0
or
15x1 – 6x2 = 0
45x1 – 18x2 =0
which both give
GG313 GEOLOGICAL D ATA ANALYSIS 3–15
x1 = 2 x2
5
So
2
x = t
5
where t is any scalar. Similarly, for = - 1, we find (A + I).x = 0
18 –6 x 1 = 0
45 –15 x 2 0
which reduces to
3x1 – x2 = 0
which gives
1
x = t
3
It may happen that the characteristic equation gives solutions that are imaginary. However, if
the matrix is symmetric it will always yield real eigenvalues, and as long as the matrix A is not
singular, all the will be non-zero and the corresponding eigenvectors will be orthogonal.
The technique we've used applies to matrices of any size n × n, but finding the roots of large
polynomials is painful. Usually, the are found by matrix manipulations that involve successive
approximations to the x. This is of course only practical on a computer. If we restricted our
attention to 2-D geometry, certain properties of eigenvalues and eigenvectors will be clearer.
Consider the matrix A
A = 48
84
(4,8)
(8,4)
We can regard the matrix as two row vectors [4 8] and [8 4]. Let us find the eigenvalues and
eigenvectors of A :
GG313 GEOLOGICAL D ATA ANALYSIS 3–16
4– 8
=0
8 4–
2
16 – 8 + – 64 =0
2
– 8 – 48 = 0
– 8x 1 +8x 2 = 0 ⇒ x 1 = x 2 e1T = 1 1
4+ 4 8 x1 88 x1 0
8 4 +4 x2 = 8 8 x2 = 0
We find that the eigenvectors define the minor and major axis of the ellipse which goes through
the two points defined by (4,8) and (8,4). The length of these axes are given by the absolute
values of the eigenvalues, 1 = 12 and 2 = 4.
e2 e1
λ
12
2
=
1=
4
Fig. 3-6. The eigenvectors, scaled by the eigenvalues, can be seen to represent the major and minor axes of the
ellipse that goes through the two data vectors (8, 4) and (4, 8).
GG313 GEOLOGICAL D ATA ANALYSIS 3–17
It is customary to normalize the eigenvectors so that their length is unity. In our case we find
e T1 = 2 2 and e2T = – 2 2
2 2 2 2
The axes of the ellipse are then simply
v1 = e 1 1
v2 = 2e2
Since the sign of the eigenvector is indeterminate we choose to make all eigenvalues positive and
thus place the minus-sign in 2 inside e2. You'll notice that vl . v2 = 0, i.e., they are orthogonal.
The eigenvectors make up the columns in a new matrix V
1 1
V = e1 e2 = 2
2 1 –1
Let us expand the eigenvalue equation (3.1) A .x = x to a full matrix equation. We have
A e1 = e
1 1
A e2 = 2e2
1 0
=
0 2
2) Pre-multiplying by V-1:
= V–1 A V
This operation transforms the A matrix into a diagonal matrix . It corresponds to a rotation of
the coordinate axes in which the eigenvectors in V becomes the new coordinate axes. Relative to
the new coordinates, conveys the same information as A does in the old coordinates. Because
is a simple diagonal matrix, the rotation (transformation) makes the relationships between
rows and columns in A much clearer.
Whereas an interpolant fits each data point exactly, it is frequently advantageous to produce a
smoothed fit to the data - not exactly fitting each point, but producing a "best" fit. A popular (and
GG313 GEOLOGICAL D ATA ANALYSIS 3–18
convenient) method for producing such fits is known as the method of least squares.
The method of least squares produces a fit of a specified (usually continuous) basis to a set of
data points which minimizes the sum of the squared mismatch (error) between the fitted curve
and data. The error can be measured as in Figure 13-1:
This regression of y on x is the most common method. Less common methods (more work
involved) is regression of x on y and orthogonal regression (which we will return to later).
y
error
Fig. 3-7. Graphical representation of the regression errors used in least-squares procedures.
y y
error error
x x
Fig. 3-8. Two other regression methods: regressing x on y and orthogonal regression.
Consider fitting a single "best" linear slope to n data points. This can be a scatter plot of y(t),
x(t) plotted at similar values of t; or a simple f(x) relationship. At any rate, y is considered a
function of x. We wish to fit a line of the form
y = a1 + a2(x – x0 ) (3.2)
and must therefore determine a value for a1 and a2 which produces a line that minimizes the sum
of the squared errors (x0 is specified beforehand). So
Σ (y
n
minimize computed – y observed )2
i= 1
a1 + a2(x1 – x0 )= y1
a1 + a2(x2 – x0 )= y2
a1 + a2(x3 – x0 )= y3
a1 + a2(xn – x0 )= yn
There are many more equations (n - one for each observed value of y) than unknowns (2 - a1 and
a2 ). Such a system is over-determined and there exists no unique solution (unless all the yi 's do
lie exactly on a single line, in which case any two equations will uniquely determine a1 and a2 ).
In matrix form (i.e., Ax = b):
1 (x1 – x0) y1
1 (x2 – x0 ) a1 = y2
a2 (3.3)
1 (xn – x0) yn
Since A is a non-square matrix it has no inverse, so the equation cannot be inverted and solved.
Consider instead the error in the fit at each point:
a1 + a2(x1 – x0 )– y1 = e1
a1 + a2(x2 – x0 )– y2 = e2
a1 + a2(xn – x0 )– yn = en
We wish to determine the values for a1 and a2 that minimize
Σe
n
2
i
i =1
This will minimize the variance of the residuals about the regression line and give the least-
squares fit. Notation:
Σe
n
S = E(a 1,a 2)= 2
i ( = eTe in matrixform) (3.4)
i =1
So, E(a1 ,a2 ) and the minimum of this function (with respect to the two unknown coefficients)
can be determined using simple differential calculus, where
∂E(a1,a2) ∂E(a1,a2)
∂a1 = ∂a2 = 0 (at theminimum)
∂E ∂ ∂
Σe Σ
n n 2
2
∂a1 = ∂a1 i = ∂a a1 + a2(xi – x0) – yi
i=1 1 i=1
Σ
n
=2 a 1 + a 2(x i – x 0) – y i =0
i =1
∂E ∂
Σ e 2i = ∂a∂ 2 Σ
n n 2
Σ
n
=2 a1 + a2(xi – x0) – yi (xi – x0 ) = 0
i =1
These two equations can now be expanded into their individual terms forming what are known as
the normal equations:
Σ Σ a2 (xi – x0 ) = Σ yi
n n n
a1 +
i =1 i=1 i =1
The normal equations thus provide a system of 2 equations in 2 unknowns which can be uniquely
solved. Rearranging,
i
i =1 i=1
i=1 i =1 i =1
Notice that all sums are sums of known values that sum to simple constants. Solving:
a1 = 1n Σ yi – n2 Σ (xi – x0 )
n
a n
i =1 i =1
n iΣ n iΣ 0 Σ 2Σ Σ i i 0
1 n y – a2 n (x – x ) n (x – x ) + a n (x – x )2 = n y (x – x )
i i i 0 i 0
=1 =1 i=1 i =1 i =1
=1 i =1
i 0 n i=1
i 0
i =1 i =1
Σ Σ (x i – x0) = Σ yi(x i – x 0) – 1n Σ y i Σ (x
n n n n n
a2 (xi – x 0)2 – 1n i – x 0)
i =1 i =1 i =1 i =1 i =1
Finally,
Σ y (x i – x 0) – n Σ y i Σ (x i – x 0)
n n n 2
1
Σ (x i – x0 ) – 1n Σ (x
n n
2
a2 = i i – x0)
ii =1 i =1 i =1 i =1 i =1
Substitute a2 into the first equation to solve for a 1 . So, we compute the sums
Σ y, Σ y (x – x ), Σ (x – x ) , and Σ (x – x )
n n n n
2
i i i 0 i 0 i 0
i =1 i=1 i=1 i =1
and substitute into the above equation to give a1 and a2 producing the best fit. In matrix form the
normal equations are:
Σy
n
Σ (x – x )
n
n i 0 a1 i =1
i
i =1
a2 = (3.5)
Σ
n
Σ (x – x ) Σ (x – x )
n n
2 yi(xi – x0)
i 0 i 0 i =1
i =1 i =1
GG313 GEOLOGICAL D ATA ANALYSIS 3–21
So, Nx = B, e.g., of the form Ax = b. Since N is square and of full rank, this equation is solved in
the standard manner:
N – 1Nx = N – 1B
or
x = N –1 B
This problem was simple enough (2 x 2) to solve brute force for a1 and a2 . For larger systems
this becomes impractical and a matrix solution to the non-square A·x = b equation must be
sought. We will next look at the general linear least-squares problem and find a solution in
matrix notation.
We have looked at a few special cases where we have sought to fit a "model" to "data" in a
least-squares sense. Fitting a straight line to the x-y points was a very simple example of this
technique. We will now look at the more general problem of finding the coefficients for any
linear combination of basis functions that fits some data in a least squares sense. There are
numerous situations where this is needed:
While the basis functions in these cases are all vastly different, they are all used in a linear
combination to fit the observed data. We will therefore take time to investigate how such a
problem is set up, and how it can all be simplified with matrix algebra.
Consider the least squares fitting of any continuous basis of the form
x1, x2, x3 , , xm
For example
x1 = x0 x1 =sin (2π x/ T)
x2 = x1 x2 =sin (4π x/ T)
x3 = x2 x3 =sin (6π x/ T)
GG313 GEOLOGICAL D ATA ANALYSIS 3–22
xm = xm – 1 x j= sin (2mπ x/ T)
E = Σ (e i )2 = Σ (a 1x i1 + a 2x i 2 +
n n
+ a m x im –y i)2 (3.6)
i =1 i= 1
a 1x n 1+ a 2x n2 + + a mx nm – y n = en
where xij is the j'th x function of the basis, evaluated at the value xi . To minimize E, we set
∂E(a )j
∂a j = 0 (3.7)
So
∂E(a 1)
= ∂ Σ (a 1xi 1 + a 2x i2 +
n
+ a mx im – y i)2
∂a 1 ∂a 1 i = 1
Σ (a x
n
=2 1 i1 + a 2x i2 + + a mx im – y i )xi1 =0
i =1
∂E(a 2)
= ∂ Σ (a 1xi 1 + a 2x i 2 +
n
+ a mx im – y i)2
∂a 2 ∂a 2 i = 1
Σ (a x
n
=2 1 i1 + a 2x i2 + + a mx im – y i )xi2 =0
i =1
∂E(a m)
Σ (a x
n
= ∂ + a 2x i 2 + + a mx im – y i)2
∂a m ∂a m i= 1
1 i1
Σ (a x
n
=2 1 i1 + a 2x i2 + + a mx im – y i )xim =0
i =1
or
∂E(a j )
=2 Σ (a 1xi1 + a 2xi 2 +
n
+ a mx im – y i)xij = 0
∂aj i =1
i i1
i =1 i =1 i =1 i =1
i i2
i =1 i =1 i =1 i =1
i =1 i =1 i =1 i =1
or (for each j ):
i=1 i =1 i=1 i =1
This provides a closed system of m normal equations, one for each coefficient, e.g.,
∂E(a )j
∂a j = 0 for j =1, 2, ..., m.
In matrix form
Σx Σx x Σx Σyx
n n n n
2
i1 i2 i1 imxi 1 i i1
i =1 i =1 i =1 i =1
Σx x Σ x Σx Σyx
n n n n
2 a1
i1 i2 i2 x
im i 2 i i2
i =1 i =1 i =1 a2 i =1
=
am
Σx Σx Σx Σ yx
n n n n
2
x
i1 im x
i2 im im i im
i =1 i =1 i =1 i =1
or simply
N⋅x =B
where N is the (known) coefficient matrix, x the vector with the unknowns (aj ), and B contains
weighted sums of known (observed) quantities. Solve for the a j in the x vector (N is square and
of full rank):
N – 1 ⋅ N ⋅ x = N – 1 ⋅B
x = N –1 ⋅B
where x is the solution for the aj . The resulting aj values are the ones which satisfy (3.7) and
therefore the same ones which, in combination with the chosen basis, produce the "best" fit to the
n data points such that (3.6) is minimized.
Last time we found the solution to a general linear least squares problem led us to the matrix
form
GG313 GEOLOGICAL D ATA ANALYSIS 3–24
Σx Σx x Σx Σyx
n n n n
2
i1 i2 i1 imxi 1 i i1
i =1 i =1 i =1 i =1
Σx x Σ x Σx Σyx
n n n n
2 a1
i1 i2 i2 x
im i 2 i i2
i =1 i =1 i =1 a2 i =1
=
am
Σx Σx Σx Σ yx
n n n n
2
i1x im i2 xim im i im
i =1 i =1 i =1 i =1
or simply
N⋅x =B (3.8)
where N is the (known) coefficient matrix, x the vector with the unknowns (aj ), and B contains
weighted sums of known (observed) quantities. Solve for the aj in the x vector (N is square and
of full rank):
N – 1 ⋅ N ⋅ x = N – 1 ⋅B
x = N –1 ⋅B (3.9)
where x is the solution for the aj . The resulting aj values are the ones which satisfy (3.7) and
therefore the same ones which, in combination with the chosen basis, produce the "best" fit to the
n data points such that (3.6) is minimized.
We will now look at a simpler approach to the same problem using matrix algebra. We have
y1 e1
x 11 x12 x 1m a1
y2 e2
x 21 x22 x 2m a2
× – y 3 = e3
x n1 xn 2 x nm am
yn en
A · x – b = e
Since m < n , the A matrix is rectangular. This can be written
a1 e1
x 11 x12 x 1m –y 1
a2 e2
x 21 x 22 x 2m –y 2
× = e3
am
x n1 xn 2 x nm –y n
1 en
C · X = e
These matrices describe the system listed earlier. We wish to find the aj values which minimize
E = eTe. The matrix form can be partitioned as
x
A –b = e
1
GG313 GEOLOGICAL D ATA ANALYSIS 3–25
A –b x = A ⋅ x + (–b ⋅ 1)= A ⋅ x – b = e
1
We will, for now, refer to the [ A -b] matrix as the C matrix and the [x 1] T column vector as the X
vector. So, we have
C ⋅ X =e
(recall C is an n x m+l rectangular matrix). The i'th error, ei , is the dot product:
a1 xi1
a2 xi2
C i ⋅ X = e i = x i1 x i 2 x im –yi = a1 a2 am 1
am x im
1 –yi
where Ci is the i'th row vector in C. The squared i'th error, ei 2 is then
e Ti ⋅ e i = C i ⋅ X ⋅ C i ⋅ X = X T⋅ C Ti ⋅ C i ⋅ X
T
or
x i1 a1
x i2 a2
e Ti ⋅ e i = a 1 a 2 am 1 x i1 x i2 x im –yi
x im am
–yi 1
where we have used the reversal rule for transposed products. The sum of the ei 2 over all the i's is
thus
e T ⋅ e = X T ⋅ C T⋅ C⋅ X =
x i1 a1
xi2 a2
Σ
n
a1 a2 am 1 x i 1 x i2 x im –yi
i =1
x im am
–yi 1
XT C Ti Ci X
The product CTC can be computed to form a new matrix R. Since CTm+1,nC n,m+1=Rm+1,m+1 the
resulting R matrix is square and symmetric. So,
E = eT ⋅ e = XT ⋅ C T ⋅ C ⋅ X = XT ⋅ R ⋅ X
To minimize E, we find
∂E a j
= ∂ X T ⋅R ⋅ X = ∂X R ⋅ X + X T⋅R ∂X
T
∂a j ∂a j ∂a j ∂a j
For the 2nd coefficient (as an example), we get
GG313 GEOLOGICAL D ATA ANALYSIS 3–26
∂X T = 0 1 0 0 0 = ∂X
T
∂a 2 ∂a 2
Thus, the partial derivative of the error is
a1 0
a2 1
∂E = 0 1 0 0 0 R + a1 a2 am 1 R
∂a 2
am 0
1 0
1 0 0 0
∂X T
= 0 1
T 0 0 = I O
X =
∂a j m m
0 0 1 0
where Im is the m x m identity matrix and O m the null vector of length m. Consider first R
(= CT C):
x11 x 12 x 1m –y 1
x 21 x 22 x 2m –y 2
C:
x n1 xn 2 x nm –y n
x11 x 21 x n1 Σx Σ x x
2
i1 i2 i1 Σx x –Σ y i x i1
im i1
x12 x 22 x n2 Σx x Σx
i1 i2
2
i2 Σx x –Σ y i x i2
im i2
T
C : :R
x 1m x 2m x nm Σ x x Σx x
i1 im i2 im Σ x –Σ y x
2
im i im
–Σ y x –Σ y x –Σ y x Σy
– y1 – y2 – yn 2
i i1 i i2 i im i
The R matrix should look familiar. Consider the partitioned matrix multiplication:
GG313 GEOLOGICAL D ATA ANALYSIS 3–27
n A(n x m) -b
R= CT C=
(m x n)
m+1
n m+1
-A b
T
m+1
T
m+1
A (m x n) AT A (m x m)
T T
-b -b A bTb
Notice that ATA is matrix N of the normal equations and -AT b = (-bTA)T equal matrix B.
Because R is symmetrical (so R = RT) we have
T
XT ⋅ R = R ⋅ X
So
∂E(a j )
= X T R X + X T R X =0
∂aj
=2X T R X =0 X T R X =0
Consider X T · R :
GG313 GEOLOGICAL D ATA ANALYSIS 3–28
m+1
AT A
-A b
m+1
T
R
T
-b A bTb
m+1 m+1
-A b
Always multiplied
m
.T
T
m by zero
X I 0 A TA
(m x m)
N B
And X T ⋅ R ⋅ X :
x m+1
m+1
T
AT Ax - A b
m
AT A T
-A b Nx - B
Finally, since
XT R X = A T A x – AT b = 0
AT A x = AT b
–1 –1
AT A A T A x = AT A AT b
–1
x = AT A AT b (3.10)
or
x = N–1 B
GG313 GEOLOGICAL D ATA ANALYSIS 3–29
as before. Therefore, the unknown values of the aj (in the x vector) can be solved for directly
from the system:
a 1x 11 + a 2x 12 + + a mx 1m = y1
a 1x 21 + a 2x22 + + a mx 2m = y2
a 1x n 1 + a 2x n2 + + a mx nm = yn
or simply
A ⋅ x =b
where A is of order n x m with n > m. The least squares solution then becomes
–1
x = AT ⋅ A AT ⋅ b
where ATA is a square matrix of full rank, with order r = m, and thus invertable.
We will again consider the best-fitting line problem, this time with errors i in the y-values.
We want to measure how well the model agrees with the data, and for this purpose will use the
2 function, i.e.
n 2
y i – a – bx i
2
a,b = (3.11)
i =1 i
Minimizing 2 will give the best weighted least squares solution. Again we set the partial
derivatives to 0:
∂ 2 n
y i – a – bx i
= 0= –2
∂a i =1
2
i
(3.12)
∂ 2 n
y i – a – bx i
= 0= –2 xi
∂b i =1
2
i
1 xi yi x 2i x iy i
S= 2
Sx = 2
Sy = 2
S xx = 2
S xy = 2
i i i i i
S xxS y – S xS xy
a=
(3.13)
SS xy – S x S y
b=
All this is swell but we must also estimate the uncertainties in a and b. For the same i we
may get large differences in errors in a and b. Although not shown here, consideration of
propagation of errors shows that the variance f2 in the value of any function is
∂f 2
2
= 2 (3.14)
f i
∂yi
For our line we can directly find the derivatives of a and b with respect to yi from (3.13):
Figure 3-9. The uncertainty in the line fit depends to a large extent on the distribution of the x-positions.
∂a = S xx – S xx i
∂y i 2
i
∂b = Sx i – S x
∂y i 2
i
2
S x2 2
S xx 1 – 2S xxS x xi x 2i S xxS 2S xx S x2 S x2S xx
= 2 2 2 2
+ 2 2
= 2
– 2
+ 2
i i i
S xx S xxS – S x2 S xx
= 2
=
(3.15)
and
2
2 2
Sx i – S x S 2x2i – 2SS xxi + S x2
b = i 2
= 2
2
i i
GG313 GEOLOGICAL D ATA ANALYSIS 3–31
S x2
2 2 2 2
x 2i 2SS x xi 1 = S S xx – 2SS x + S x S
= S2 2
– 2 2
+ 2 2 2 2 2
i i i
S S xxS – S x2
= 2
=S
(3.16)
2
= 2 ∂a ∂b = – S x
ab i
∂y i ∂y i
–S x
r=
SS xx (3.17)
What if some data constraints are more reliable than others? We may give that residual more
weight than the others:
e1
2e 2
e=
en
In general, we can use weights wi for each error so that the new error ei ' = e i ·wi . We do this by
introducing a weight matrix w which is a diagonal matrix:
w1
w2
w=
wn