2016 STATS 302 Course Book
2016 STATS 302 Course Book
This book also provides an introduction to some techniques not covered in the course.
Past students have found it a useful reference source later in life.
Material covered in the course : Chapters 1-7,10
If you are doing STATS 302 you do not have to read chapters 8 and 9!
Contents
Contents ......................................................................................................................... 2
Chapter 1. Introduction to Matrix Algebra. ............................................................... 2
Chapter 2. Data exploration and checking. ................................................................ 1
Chapter 3. Data Preparation..................................................................................... 10
Chapter 4. Distances and similarities. ..................................................................... 21
Chapter 5. Principal Components Analysis ............................................................. 27
Chapter 6. Multidimensional Scaling. ..................................................................... 54
Chapter 7. Cluster Analysis ..................................................................................... 67
Chapter 8. Multiple Regression ............................................................................... 85
Chapter 9. Multivariate regression and constrained ordinations. .......................... 103
Chapter 10. Canonical Discriminant Analysis and MANOVA ........................... 138
R tips for STAT302 ................................................................................................... 139
Since, no matter how hard one tries to avoid it, a familiarity with some of the jargon
of matrix algebra is necessary for even a superficial understanding of multivariate
statistics; an introductory section on the subject is inescapable. That was the bad
news. The good news is that I will try to use a simple, intuitive, non-rigorous
approach. The main aim of the section is to provide a highly selective, very superficial
introduction to the subject.
As might be guessed from the context (a course on multivariate statistics), matrices
are commonly used for handling multivariable systems. Simple systems (say of 2 or 3
variables) can also be described and manipulated by the more traditional graph paper.
It should therefore come as no surprise that the two, matrices and graph paper, are
linked and our entry into the arcane world of matrices is via a concept of
multidimensional graph paper - Euclidean space.
WARNING: attempts to visualise spaces of more than 3 dimensions can damage your
mental health.
Vectors.
A vector is a set of coordinates that define a point in a space (e.g. graph paper)
relative to a set of axes (a basis). More usefully it defines a line from the origin (the
zero point, the intersection of the axes) to that point. This line is a vector in the strict
meaning of the word having magnitude and direction. Such a set of coordinates can be
referred to by a variable name e.g. x (note: bold lower case type usually refers to a
vector). So, a set of 5 measurements on the nose of an anteater would define a point in
a 5-dimensional space. This concept naturally generalises to any number of
dimensions. Such a space is called an n-dimensional hyperspace, (they do exist
outside science fiction), or an n-space,(the in-person's jargon term).
We have talked of multi-dimensional hyperspaces, but in fact the most used space of
all is one-dimensional! All real numbers can be thought of as points lying on a line
running from plus infinity to minus infinity - the real line. Any number also defines a
line from zero. It has length and direction (positive or negative) - it is a vector. Thus
the number 5 defines a vector running from the origin in the positive direction having
length 5. The number (-3) has length 3 in the negative direction. Ordinary real
numbers can be thought of vectors in a one dimensional space. This will allow the
operations and algebra of multidimensional spaces to be explained with reference to
the arithmetic and algebra of real numbers, which everyone understands (I hope).
Properties of Vectors.
Vectors, like numbers on the real line, have both length (distance from zero - the
origin) and direction. Like numbers you can add (or subtract) two vectors to get a
third, just add (or subtract) corresponding elements of the two vectors.
Later on in the course we shall be concerned with the length of vectors so it may need
some explanation now. Given the vector, how do we calculate its length? In one
dimension it is trivial, the vector (-3) has length 3. In two dimensions it becomes easy
as soon as you draw it. The length of the vector (x1, x2), usually written ||x||, is clearly
given by Pythagoras's theorem. Thus
||x||=(x12+x22)1/2.
The length of a three dimensional vector (x1,x2,x3) can be similarly calculated.
||x||=(x12+x22+x32)1/2,
or more generally for a space of n dimensions
n
||x||=( ∑ xi2 )1/ 2 .
i =1
The concept of the length of a vector now lets us measure the distance between two
vectors. First we calculate the difference between two vectors. The operation of
subtraction in 1 dimensional space defines the vector that joins the end points of the
two vectors (think about it, better still draw it and look).
The same operation can be described for two dimensions. Subtracting one vector a
from another b gives a third which defines the position vector c (fig.2b). This gives
the direction and distance between the two vector end points.
The actual calculation is as one would expect:
a1 b1 a1 − b1 c1
a − b = a − b = c
2 2 2 2 2
3
The difference vector has direction and length. Its length is the distance between the
two points(vectors). Notice the vectors a and b are written as column vectors (i.e. the
elements are aligned vertically). This is the usual default, if you are not told otherwise
assume that vectors will be considered as having 1 column and many rows. As you
will see this can become important later on.
The difference between two data points (e.g. animals or other sampling unit) is a most
useful thing to know, and is called the Euclidean distance, after Euclid (lived around
300 B.C.) the father of classical geometry. We showed how to get the length of a
vector earlier; so the distance between two vectors a and b is the length of the
difference vector:
n
||a-b|| = ( ∑ ( ai − bi ) 2 )1/ 2
i =1
So the Euclidean distance depends on Pythagoras's theorem- the ancient Greeks knew
a thing or two.
We can therefore say how dissimilar two vectors (data points, animals) are by
calculating the Euclidean distance between them. This will be useful later on when we
talk of dissimilarity (and similarity) matrices in the context of cluster analysis and
multidimensional scaling.
0.5
0.5
a =
. a = -1
1 a- b 1
0.5
1.5
b =
0.5
2
a+b = c =
- 0.5
1.5
b =
-1.5
a1 b1 a1 + b1 c1 a1 b1 a1 - b1 c1
a + b = a + b = c a - b = a - b = c
2 2 2 2 2 2 2 2 2 2
Figure 1.1
a) addition of vectors to produce resultant vector
b) subtraction of vectors
4
Vectors have two different multiplication operations associated with them:
1) Like numbers you can multiply the vector by a single number. Every element of the
vector is then multiplied by this same number. Because it therefore rescales every
element of the vector such numbers are called scalars. Multiplication by a scalar has a
particular geometric interpretation that will be important later, it doesn't change the
direction of the vector (except to reverse it if the scalar is negative). But it does
change the length - see fig 1.2.
2) You can multiply two vectors together (provided they have the same number of
coordinates). The usual method produces a single number, the inner product. The
process is simple, but since it is a special case of matrix multiplication - the vectors
are treated as little matrices- we will consider it later.
Matrices.
What is a matrix? The most obvious use of the word is to refer to a table of numbers.
The starting point for most multivariate analyses is the data matrix, whose rows
contain the values for each sampling unit, and whose columns contain the values for
each variable. Such a table may be referred to by a name e.g. X (note the bold type to
show it's a matrix not a simple variable, and that it is a capital letter to show it is not a
vector). Each individual number in the table, an element of the matrix, can be referred
to using subscripts: x15 refers to the element in the 1st row (sampling unit) and the 5th
column (variable). The convention is the row subscript first, then the column. (Please
repeat 20 times: rows then columns, rows then columns... ).
Sometimes the data is needed with the variables as rows and the sampling units as
columns. This new matrix is the transpose of X, and is written X' (or sometimes XT
or Xt ). The value of the 1st variable on the 5th sampling unit (x51) would now be in
the 1st row, 5th column, i.e. x'15. The transpose is simply given by reversing the
(a) (b)
multiply by
2
(2, 1) multiply by
-2
(1, 0.5) (1, 0.5)
(-2, -1)
Figure 1.2
a) multiplication by a positive vector (2)
b) by a negative vector (-2)
5
subscripts of the elements.
The data matrix, X, could also be thought of as a table of vectors. Each row could be
considered a vector - each sampling unit thus defining a vector in variable space. So,
the (nxp) data matrix defines a cloud of n points (observation vectors) in a p
dimensional space. Most multivariate statistical methods can be imagined as
performing various operations on this data cloud, looking or testing for trends or
structure.
In an introduction to matrix algebra perhaps the most useful way of viewing matrices
is as operators on vectors. Just as ordinary algebra allows the manipulation of
variables representing real numbers (i.e. vectors on the real number line), so the use of
matrices allows a matrix algebra to manipulate vectors and vector spaces.
Matrix multiplication.
We can for example look at the matrix as an operator which moves points (vectors)
around in a space. If you multiply a point on the real line by a number it moves it to a
new position. If you multiply a vector by a matrix it moves it to a new vector. For
example if you multiply
1 4 1 3
a = by the matrix M = we get the vector b = .
− 1 6 1 5
− 1 0 −3
Multiplying b by the matrix N = gives the vector c = .
0 − 1 −5
Thus M.a = b, N.b = c, just like ordinary algebra. Like numbers, this sequence of
operations can be condensed into one equation NMa = Pa = c. Clearly matrices can
be multiplied together, so M multiplied by N gives a new matrix P. This matrix does
the same thing as multiplying a vector first by M and then multiplying the resulting
vector by N; multiplying a vector by P just does it in one go.
IMPORTANT: Unlike multiplying numbers NM is not usually equal to MN; the order
1 0
of multiplication is important. In Figure 1.3 using the matrices A = which
0 −1
0 −1
reflects vectors in the Y axis, and B = which rotates them 90° anticlockwise,
1 0
we can see a reflection followed by a rotation does not give the same results as a
rotation followed by a reflection.
6
(b)
(a)
BAx Bx B
x x
A B A
Ax
ABx
7
A-1
y
A
As we shall see later this can be a short hand way of representing an important
statistical calculation.
Matrix division.
In comparing ordinary algebra on the real line with matrix algebra we have defined:
addition of real numbers (cf addition of vectors (and matrices)), and multiplication by
a number (cf multiplication by a matrix); how about division by a number? To see if
there could be a matrix equivalent it is sensible to have a closer look at what division
means on the real line. Division by a number a is equivalent to multiplication by 1/a,
-1
or a . Division has now become multiplication by the reciprocal. How might we
define the reciprocal? If multiplication by a moves a point x to ax then we can define
a-1 as that operation that takes ax back to x, so that a-1a equals one. We can therefore
call a-1 the inverse of a, since on the real line it does the opposite to multiplication by
a.
We can now define the equivalent matrix operation to division by a number as
multiplication by the inverse of a matrix. The inverse of a matrix, by analogy, is going
to be that matrix, if it exists, that acts to negate the action of the matrix of which it is
the inverse. So if M takes a to b, then M-1 takes b to a (Figure 1.4), and M-1M = I the
identity matrix (the matrix equivalent of the number 1).
Are there any situations when this is not going to work, an inverse does not exist?
Even on the real line there is one such case - the inverse of zero does not exist. If we
look at why, we may be able to predict situations where the inverse of a matrix will
not exist.
The problem with multiplication by zero is that it maps all points on the real line into
the same spot - zero. This makes an inverse impossible. Given that we are currently at
zero which of the infinity of points should we map back to? There is no unique
inverse. If we look at it in vector space terms, multiplication by zero moves any
vector in the one dimensional space (the real line) into a zero dimensional space (the
point zero). It then becomes apparent that any matrix that moves vectors into a space
of lower dimension is not going to have an inverse. Many points in the original space
8
will map to a single point in the resulting space, trying to jam a quart (1136ml) into a
pint (568ml) pot, making an inverse operation impossible. Similarly matrices that map
into higher dimensional spaces also lack inverses. Once in the higher dimensional
space there is no unique way back. Rectangular matrices therefore do not have
inverses. As we shall see below not all square ones have inverses, such matrices are
called singular, because they map to a singularity - like a black hole. Well behaved
square matrices that have inverses are called non-singular.
One advantage of the inverse is that it allows us to do algebraic manipulations just as
with real numbers. We can therefore rewrite the equation
Mx = y as x = M-1y, just as though this was univariate algebra.
Addition of matrices.
If Ax is the vector x after it has been moved, then it is also a vector and all the
operations between vectors can be applied (e.g. vector addition and subtraction).
Suppose we have two vectors Ax and Bx then the vector that is their sum will be
Ax+Bx which should (and does) equal (A+B)x = Cx, where C=(A+B). How does one
add matrices? The two matrices must be the same shape, same number of rows and
columns, then we can simply add the elements of one to those in the corresponding
position in the other matrix.
Subtraction of matrices is similarly defined.
Matrix properties.
The determinant.
Like vectors and numbers on the real line, matrices have unique properties. Vectors
have length and direction. Numbers have size and direction. Matrices also have a
concept of "size" . The size of a number X is mathematically defined by the absolute
value or modulus, usually written |X|. The analogous property of a matrix is given by
the determinant, |X|, which measures of the "size" of the matrix. It is a single number
and as such is often used as a summary statistic for statistical matrices of various
kinds. However its most common use is to find out if a matrix is non-singular (i.e. has
a inverse).
9
(3, 6)
(2, 4)
(1, 1) (1, 2)
(0, 1)
B
(0, 0) (1, 0) 1 2 (0, 0)
2 4
Figure 1.5. Multiplying by a singular matrix. No inverse is possible. The matrix does
not have two dimensions worth of information. Its determinant is zero.
10
13
. 0.4 1 17
. 2.65
0.4 0.7 1 = 11 M 2x =
.
. 145
Once the vectors are close to the line further multiplication by the matrix only shifts
them further out, stretches the current vector. this is equivalent to multiplication by a
scalar, in this example 1.5. In fact multiplication of any vector v that actually lies on
the line is equivalent to multiplication by a scalar λ. So Mu=λu. For points on this
line, and only on this line the matrix is equivalent to a single number! This number
labours under a variety of names : characteristic root, latent root, but probably the
most common, and the one we shall use in this course is eigenvalue. All vectors that
lie on the invariant line are multiples of the characteristic vector, latent vector, or
eigenvector.
For example
M u = 1.5 × u
13
. 0.4 2 3 10 15
0.4 0.7 1 = 15 M times =
. 5 7.5
M5x
M4x
M3x
M2x
Mx
x
Figure 1.6. The invariant line. Repeated multiplication of the vector x by the
matrix M brings the resulting vectors closer and closer to the invariant line; where
multiplication by the matrix is equivalent to multiplication by a scalar - the eigenvalue.
(c.f. figure 1.2a)
11
You could try any vector where the elements are in the proportion 2:1, in other words
lie on the invariant line for matrix M; multiplication by M is equivalent to multiplying
by 1.5. In fact most matrices have more than one such invariant line, matrix M has
another defined by the vector (-1,2) or any multiple thereof. The corresponding scalar
is 0.5 (try it). Thus matrix M has two eigenvectors (u1 and u2) each with its own
eigenvalue (λ1 and λ2).
Intuitively, eigenvalues can be thought of as a measure of the stretching power or size
of the matrix. In particular the largest or dominant eigenvalue is often used in this
way. Since the determinant is sometimes used similarly, you would expect a
relationship between the two. The product of the eigenvalues ( ∏ λi ) equals the
determinant |M|. So if any of the eigenvalues are zero then so is the determinant - no
inverse - singular matrix. In fact the number of non-zero eigenvalues give the number
of linearly independent rows there are in the matrix - its rank. Eigenvalues have
many useful properties, but the only other one to which I shall refer is that the sum of
the eigenvalues is equal to the sum of the diagonal elements of the matrix, which is
called the trace. Σλi=Σmii.
We have discussed the meaning of the eigenvalues, what about the eigenvectors, the
vectors that define the invariant lines? The full set can be used to define a new set of
axes (a basis) that is implicit or latent in that matrix - the basic structure of the
matrix. The directions defined by these vectors are somehow important to that matrix,
are characteristic of that matrix, and in some way summarise the information in that
matrix. These new axes are sometimes very useful for re-expressing the data points,
especially in principal component analysis, and canonical discriminant analysis
(sections 6 and 12). If the matrix M is symmetric then the eigenvectors are orthogonal
to each other (these new axes are at right angles) - most useful.
One way in which the eigenvalues and eigenvectors express the underlying structure
of a square matrix is in the eigenvalue (or spectral) decomposition. Any square
matrix M can be broken down into 3 matrices. So: M = UΛU', where U consists of all
the eigenvectors of M stacked like books in a bookshelf, U' is its transpose (the books
are now piled on top of each other) Λ is a diagonal matrix (i.e. only the diagonal
elements can be non zero) with the eigenvalues as the diagonal elements in the same
order as the eigenvectors. It is worth pointing out that these eigenvalues can under
certain circumstances be imaginary numbers, but DON'T PANIC, we generally throw
those away if we meet them. All this seems like just some wonderfully useless
mathematical fantasy. This decomposition will however be an extremely useful way
of summarising certain data based matrices. We will find that eigenvalues and
eigenvectors have quite simple statistical interpretations when used properly, and this
decomposition of a matrix into its basic structure underlies most of the techniques
found in this course.
12
Let us take a data vector that has had the mean of each variable (column) subtracted
from every value in that column, e.g.
( x1 − x )
( x − x )
2
( x3 − x )
So we are now dealing with deviations from the mean (such data are called mean-
centred - see section 4.1.1). If we now multiply this vector by its transpose we get a
number that should be instantly familiar to anyone who has done 1st year Stats.
x' x = x'x
( x1 − x )
[( x1 − x ) ( x2 − x ) ( x3 − x )] ( x2 − x ) =
n
∑ (x − x )
i =1
i
2
( x3 − x )
In other words, when the data have been centred by subtracting the mean, x'x is equal
to the corrected sum of squares; a most useful number.
We can generalise this to the bivariate case:
( x11 − x1 ) ( x12 − x2 )
X= ( x − x ) ( x − x )
21 1 22 2
( x31 − x1 ) ( x32 − x2 )
If we now multiply X by its transpose X' we will get a matrix whose elements should
look familiar. Thus:
X' X D
( x11 − x1 ) ( x12 − x2 )
( x11 − x1 ) ( x21 − x1 ) ( x31 − x1 )
− − = ∑ ( xi1 − x1 ) 2 ∑ ( xi1 − x1 )( xi 2 − x2 )
∑ ( xi1 − x1 )( xi 2 − x2 )
( x − x ) ( x − x ) ( x − x ) 21 ( x x ) ( x x )
∑ ( xi 2 − x2 ) 2
1 22 2
12 2
( x31 − x1 ) ( x32 − x2 )
2 12 2 32
If you examine the elements of D you should recognise d11 as the corrected sum of
squares for variable X1, d12 as the corrected sum of crossproducts between variables
X1 and X2 (you may have used this to calculate regressions and correlations) and d22 as
the sum of squares of variable X2. It should be obvious that d21 is the same as d12 - the
matrix D is symmetric. If we think of the sums of squares dii as lying on the diagonal
of the matrix, then the sums of crossproducts lie in the upper (d12) and lower (d21)
triangle. The matrix D is usually known as the sums of squares and crossproducts
or SSCP matrix. The elements of D immediately suggest (or should do) that if we
were to divide every element by (the number of sampling units-1), i.e. multiply by
1/(n-1), we would convert the sums of squares to variances and the crossproducts to
covariances. Thus 1/(n-1)D = S, the variance-covariance, or just covariance, matrix.
The elements s11 and s22 (i.e. sii the diagonal elements) are the estimated variances of
X1 and X2, while s12 and s21 estimate the covariance between them. This matrix is the
13
starting point for many of the classical multivariate methods and we will come across
it repeatedly. More generally, the ijth element is the covariance of the ith and the jth
variable. Note that since this matrix is symmetric about the diagonal, i.e. sij=sji, then
S'=S. For example, in Table 1.1 the variance of sepal length is 12.4, its covariance
with petal width is 1.6.
The covariance matrix contains information about the spread of the data points over
the variables (the variances), and also about whether the variables are correlated, how
elliptical the data cloud is. (Data from a heavily correlated pair of variables typically
lie along a line, forming a narrow ellipse). Thus for many types of data the covariance
matrix summarises the size and shape of the data cloud.
There is another crossproduct matrix that is important statistically. If we divide the
mean corrected values in data matrix X by the appropriate standard deviations we get
a matrix of z-scores, standardised deviations from the mean (see section 4.1.2a). The
resulting variance covariance matrix actually contains correlation coefficients as the
off diagonal terms. This correlation matrix (usually called R) has all the diagonal
elements equal to 1 (z-scores have unit variance), and the ijth element, rij, is the
correlation between variables Xi and Xj. For example, in Table 1.1 the correlation
between sepal width and petal width is 0.23, the reverse correlation, between petal
width and sepal width is of course the same so the matrix is symmetric. To point out
the obvious, the correlation between petal length and itself is of course 1. It is of
course quite simple to calculate the correlation matrix directly from the covariance
matrix (you might like to work out how) but thinking of the correlation matrix as the
covariance matrix of z scores, standardised data, will help you to understand methods
that use the correlation matrix as their starting point.
Programming Notes
In R/S Plus the variance covariance matrix can be got by:
var(data)
The correlation matrix by:
cor(data)
Table 1.1. Covariance and Correlation matrices for measurements (mm´ 10) on the
plant Iris setosa.
Covariance Matrix Correlation Matrix
Sepal Sepal Petal Petal Sepal Sepal Petal
Length Width Length Width Length Width Length
12.4 9.9 1.6 1.0 1.00 0.74 0.27
9.9 14.3 1.2 0.93 0.74 1.00 0.17
1.6 1.2 3.01 0.61 0.27 0.17 1.00
1.0 0.93 0.61 1.1 0.28 0.23 0.33
14
suggests. Some of the off diagonal elements of the covariance matrix, the covariances,
will be large - there are large correlations present. If on the other hand all the variables
are independent - all the covariances near zero - then the determinant will be larger.
Also the larger the diagonal terms (variances) the larger the determinant. Thus the
determinant of a covariance matrix is sometimes used as a single number that
summarises the overall independent variance of a system of variables - the
generalised variance. We mentioned earlier that the determinant was sometimes used
as a measure of the 'size' or 'magnitude' of the matrix. Graphically, we can think about
the determinant as measuring the (hyper)volume of the data cloud in space, the larger
the variance the larger the volume; the more correlated the variables, the smaller the
volume.
15
Chapter 2. Data exploration and checking.
When the data is first entered into the computer you must resist the temptation to dive straight into a
sophisticated multivariate analysis. No matter how close the deadline, time spent checking and
preparing the data now will save time and embarrassment later. For a start the data almost certainly
contains copying mistakes and outliers, almost certainly violates the assumptions of the analysis, and,
if you are lucky, may not need a sophisticated analysis anyway.
The first step of any analysis is to look for the copying errors that have probably crept into the data
while it was being entered into the computer. The most thorough method is to have someone else read
out what ought to be in the computer while you check what is actually there. Regrettably, very large
data sets can strain friendships and eyesight alike; so this method is often impractical. Generally the
most disastrous errors are those that generate outliers, a slipped decimal point for example. These can
often be detected by preparing the univariate data displays (histograms, stem-and-leaf or box plots)
that are standard in 1st year statistics courses. At the very least every variable should be displayed and
outliers investigated to check they are genuine. A word of warning: if you fail to discover any mistakes
- worry. I have never had a data set of any size that did not have some errors in it. Typing mistakes,
observations missed out, observations repeated, they all happen somewhere. Only when you are sure
the data set is as correct as possible is it worth investing time in its analysis.
The univariate displays are not only for detecting errors and outliers, they also give you your first
insight into the data. Is there any evidence of structure in the data - polymodality for example? Do any
patterns emerge when you compare means between variables, or between groups of observations? In
environmental statistics the sampling units can often be plotted on a map. If the values of each variable
are mapped do any patterns emerge? It may be that the patterns or results are so obvious that no
multivariate analysis is necessary. Though before you make that decision have second thoughts; what
is obvious to you, a true believer, may not be obvious to a sceptical client, referee or examiner. If you
decide to go ahead with an appropriate multivariate analysis, the insights you have gained from the
univariate displays will help you interpret and check the results of the multivariate analysis. If the
analysis flatly contradicts what is clear in the univariate displays, some further investigation is needed.
Scatterplots.
Exploratory data analysis (EDA to its advocates) is more of a philosophy than an analysis. Its chief
proponent was John W. Tukey (Bell Labs and Princeton University): "Exploratory data analysis is an
attitude, a flexibility, and a reliance on display, NOT a bundle of techniques" (Tukey 1980:23). The
approach is based on a "recognition that the picture-examining eye is the best finder we have of the
wholly unanticipated". Clearly, by relying on such a (over)sensitive pattern recognition machine, EDA
is aimed at the generation of hypotheses, not their testing; to the identification of the unexpected, not
the confirmation of the already known. To a large extent the rest of this course is concerned with
methods that make multivariate data available to that "picture-examining eye"; techniques that allow
the data to be examined, patterns or relationships detected and ideas generated. They are seldom used
for tests of a priori hypotheses.
The techniques in the later sections are usually sophisticated and often difficult to interpret, their
advantage is that they are truly multivariate. The spirit of EDA, on the other hand, is simplicity. As I
pointed out above, simple univariate methods, histograms and the like, can often be useful in gaining
1
Figure 2.1. All these scatterplots have correlations of 0.75, yet all tell very different
stories. Do not rely on summary statistics, always plot your data before analysis.
insight into the data. Such simple graphical methods are certainly more useful than tables of summary
statistics, and are a lot easier to read.
The chief problem with multivariate data is that if there are more than three variables it is usually
impossible to visualise the data cloud properly. We can produce all the summary statistics: mean
vectors, covariance matrices, correlations etc. but we cannot see what the data cloud actually looks
like. Figure 2.1 shows how important this can be - even with just two variables.
2. Before we apply a sophisticated multivariate analysis (with their often restrictive assumptions) we
ought to try simpler, more direct, approaches to the data. If we cannot plot the data cloud in all its
multidimensional glory, then we can at least plot it on each pair of variables in turn. Subtle, truly
multivariate patterns will probably not emerge, but the grosser patterns and relationships should be
apparent. The simplest way of organising all these plots, p(p - 1)/2 of them (if there are p variables) is
in a scatterplot matrix (Figure 2.2). This is like the lower triangle of the correlation matrix (section
1.4.1) with plots instead of correlation coefficients.
EXAMPLE 2.1.
In 1936 the great geneticist and biometrician R. A. Fisher introduced a data set into the multivariate
literature that now appears in nearly every text course on the subject - it is almost a tradition. The data
consists of measurements of sepal length and width and petal length and width on three Iris species: I.
setosa, I. versicolor, and I. virginica. He used them to demonstrate the application of Linear
Discriminant Function Analysis (LDFA - not covered in this course), and since then they have become
the classic data set for the purpose. LDFA tries to establish rules (functions) to classify vectors
(individual flowers) into classes (the three species). For this to work, the species must be relatively
separate. By doing all the bivariate plots we can see if they are.
Figure 2.2 shows the scatterplot matrix for the four variables with each of the species having a
different symbol. Clearly I. setosa is separate from the others, while I. virginica and I. versicolor,
though less obviously separable, do not appear to overlap too much. Indeed, in the full four dimensions
they could be clearly distinct. A further point worth noting is that these plots clearly suggest that the
species have very different shaped data clouds, i.e. that their covariance matrices are different. This
could have consequences for LDFA, which, to be optimal, requires the classes (species) to have
identical population covariance matrices - i.e. their data clouds must have similar shapes. I show how
to check this assumption more formally in section 0.
2
Sepal length Sepal width Petal length
Sepal
width
Petal
length
Petal
width
Figure 2.2. Scatterplot for Fisher's Iris data. The species seem well separated.
Programming note.
In R the basic instruction is: pairs(variable list).
110
PETLEN 118
SEPWID
Figure 2.3. Rotated 3-D plot of Fisher's Iris data (petal width, sepal width and
length). I rotated it using the tool bar at the left to find a useful picture. Note 3 observations
have been selected as possible outliers. They are highlighted and their observation
numbers (row in the data matrix) displayed so they can be easily identified.
Unfortunately most packages that do this type of plot try to be too helpful. If there is little variation on
one axis, they expand the plot in that direction so that the range of each variable has the same size on
its axis. As a result the shape of the cloud is distorted, sometimes badly. Unless the package has the
facility to turn off this clumsy assistance, the only recourse is to put into the data file eight dummy data
points that define a cube that will completely enclose the cloud. Because the range on each variable is
now the same (the dummy points now define the range) the cloud will be undistorted, hanging in the
cube which protects it from getting rescaled.
Programming Note.
At present there is no convenient 3-D rotation in R.
Many techniques for the analysis of continuous data assume a multivariate normal distribution
(MVN). Unfortunately, though univariate normality of each of the variables is a necessary condition of
MVN; it is not sufficient. So, though checking for univariate normality is a useful first step, if you fail
to find non-normality it does not mean that the data are MVN.
Techniques are available for testing the null hypothesis of MVN, but generally the relatively informal
graphical methods are best. They can kill two birds with one stone: not only do they check the shape of
the distribution, they also highlight multivariate outliers. Though univariate and bivariate plots (see
4
a) b) c)
15 60 60
Observed D 2 values.
10 40 40
5 20 20
0 0 0
0 5 10 15 0 5 10 15 0 5 10 15
Expected χ2 value.
Probability plotting.
The simplest check for multivariate normality is based on a simple property of the MVN: the squared
distance (suitably standardised) from a point to the centroid (the mean vector - the centre of the data
cloud) is approximately χ2 distributed. This is understandable if we look at the definition of the χ2
statistic: χ2 (df 1) is defined as ((x-µ)/σ)2 i.e. the squared standardised distance from x to the mean
(univariate centroid). To show the link with the multivariate case, this can be rewritten as
(x-µ)(σ2)-1(x-µ). The multivariate equivalent is (x - µ)'Σ-1(x - µ) where Σ-1 is the inverse of the
covariance matrix. If the data are MVN these squared distances will have a χ2 distribution with p
degrees of freedom - where p is the number of variables. This standardised distance is known as the
squared Mahalanobis distance D2. We shall meet it again in a number of places later in the course.
So, if the data are MVN the distances should be χ2. If the distances are not χ2 then the data are not
MVN. How can the χ2 distribution of these distances be checked? Most people taking an
undergraduate course in statistics are shown how to check the normality of a univariate distribution
using probability or normal quantile-quantile (qq) plots. Such a plot is possible for other distributions -
in the present case by using the inverse χ2 transformation in place of the inverse normal. In this case
the Di2 squared distance values would be used in place of xi.
This technique will work well if S is close to Σ, i.e. if it is based on a large sample. I suggest n > 10p
(p is the number of variables); though for practical purposes it can work well for smaller numbers.
The plot will be approximately straight if the data conform to the multivariate normal distribution
(Figure 2.4a). If they are curved or stepped, the data are not normal, e.g. Figure 2.4b where the data
come from a multivariate lognormal distribution (i.e. heavily skewed). Stepping may suggest groups in
the data, but as a general rule polymodality is extremely difficult to detect using these plots.
5
Since the distances are ordered from smallest to largest on the transformed cumulative % axis, the
extreme outliers will be at the right hand end, and it should be obvious if they are further away from
the centroid than expected from an MVN distribution (Figure 2.4c).
Most multivariate techniques that assume the MVN distribution are more sensitive to skewness than to
kurtosis or polymodality. So, even if the probability plots detect non-normality, the sample sizes are
small, and the analysis is not considered robust, don't panic. Provided the distribution is still symmetric
the analysis will usually be able to go on as planned. Check the symmetry using the univariate
displays; if they are all symmetric then the multivariate distribution must be symmetric.
Programming notes
In R use:
xx<-dataset name – quantitative variables only
mah<-mahalanobis(xx,apply(xx,2,mean),var(xx))
qqplot(qchisq(ppoints(mah), ncol(xx)), mah)
Some methods, like multivariate analysis of variance, and canonical discriminant analysis (Chapter
11), make comparisons between groups of observations. One assumption that they all share, and to
which they can be very sensitive, is that the observations within each group are MVN distributed with
equal covariance matrices (section 1.4.1). It is a sensible precaution to check the assumptions before
employing any of these techniques.
While it might be possible to compare covariance matrices element by element, this is usually
impractical. There are a number of more sophisticated techniques described in the literature, but I shall
only discuss two:
i) The maximum likelihood test for homogeneity of covariance matrices.
ii) The multivariate extension of Levene's test.
One approach would be to condense each covariance matrix down to a single measure of 'size' and
compare those. The most commonly used value is the determinant |S| - the generalised variance
(section 1.4.2). Sometimes loge|S| is used instead.
Unfortunately, just looking at the values of the generalised variance conveys little to the inexperienced
eye (and not much to the experienced one), so some more formal approach is usually needed.
Maximum likelihood test for the homogeneity of covariance matrices.
This is the most widely available test. It is discussed in all major multivariate texts and is present in all
major statistical packages; yet for most purposes it is of little use. It is a multivariate generalisation of
Bartlett's test that is described in many introductory statistics courses..
It is calculated as
g ( ni −1)
∏S i
2
M= i =1
(n− g )
S 2
6
Iris setosa Iris versicolor
Sepal length 12.42 9.921 1.635 1.033 26.64 8.518 18.28 5.577
Petal length 9.921 14.36 1.169 .9298 8.518 9.846 8.265 4.120
Sepal width 1.635 1.169 3.015 .6069 18.28 8.265 22.08 7.310
Petal width 1.033 .9298 .6069 1.110 5.577 4.120 7.310 3.910
Iris virginica
In Levene's univariate test each data value is replaced with the absolute value of its deviation from the
median (or, sometimes, the mean) of its group. If one sample is more variable than the others then the
average deviation for that sample will be larger than for the others; this can be detected by a simple
analysis of variance (ANOVA) on the deviations.
With multivariate data each variable is treated as above (except we nearly always use the mean rather
than median). Each value is replaced by the absolute value of its deviation from the sample mean for
that variable in that group. However a multivariate ANOVA (section 11.1) is used to detect any
differences, rather than a lot of simple ANOVAs. The multivariate ANOVA combines (sort of) all the
ANOVAs into one significance test with one p-value. A significant result does not necessarily mean
that the differences are large enough to be statistically (or biologically) important. But this technique is
more robust than the maximum likelihood method described above, and the results are correspondingly
more reliable indications of potential problems.
EXAMPLE 2.2
7
Using the Iris data again, we want to test the null hypothesis that the population covariance matrices of
the three species are equal. The sample covariance matrices are shown in Table 1. At first sight they
appear different, particularly the variances. Our multivariate Levene's test rejects the null hypothesis
with p<0.0001. A preliminary interpretation based on univariate Levene's tests suggests that the major
differences are in sepal length and width; which from Table 1 are clearly less variable in the smaller
species. This relationship between the variance and the mean is a common phenomenon with
morphometric data. The situation can often be improved by using a logarithmic transformation (but not
in this case!)
This function will do a Multivariate Levenes test.
MVlevene= function(y,x)
{
res=lm(as.matrix(y)~as.factor(x))
resid=abs(res$residuals)
lev=lm(resid~as.factor(x))
mvlev=manova(lev)
summary.manova(mvlev)
}
Many of the multivariate techniques described in this course assume that the covariance or correlation
matrix adequately describes the relationship amongst the variables, i.e. that the relationships are linear.
Some, like multiple regression and canonical correlation (Sections 9 & 10) have their own techniques
for detecting non-linearities. Even so, it is often useful to check for linearity at an early stage before
deciding on a final analysis. The checks for multivariate normality in section 2.3 are to some extent
also checks for linearity - an MVN distribution has linear relationships among the variables. But non-
linearity cannot usually be distinguished from other types of non-normality. It is therefore often useful
to make a simple and direct check by plotting a matrix of scatter plots (see above - section 2.2.1). By
examining each plot non-linear relationships can often be spotted. Dynamic 3D graphics (section
2.2.2) can also help. If there are a lot of variables there will be lots of plots; still, time spent in front of
the computer screen examining them is seldom wasted. Once the shape of the bivariate relationships
has been seen you can decide on the potential usefulness of transformations; whether or not to drop
some variables; which analysis is going to be the most appropriate; or even whether or not you need a
multivariate analysis at all.
8
9
Chapter 3. Data Preparation.
Missing values.
Many real data sets have missing values, i.e. some of the observations have one or more of their
variables unmeasured. This leaves gaps in the data matrix. There are a number of ways of handling
this situation, though there must be sufficient data remaining to do the job - information cannot be
created from nothing. Problems can also arise if the missing values are non-randomly distributed
(as they often are). The simplest methods described below assume that values are missing
completely at random (MCAR). If all the large values for a variable were missing, the sample
would be biased and this assumption violated. However, the more sophisticated methods only
assume that missing values in Y1 cannot depend on the value of Y1 but can depend on values of the
other variables e.g. Y2. In other words, the probability of Y1 being missing can depend on the other
variables, but if there were a number of observations with the same value of Y2 then the pattern of
missing values of Y1 within them must be completely random. This weaker assumption ( missing at
random - MAR) means that the missing values of Y1 can be validly predicted from the other
variables.
Table 3.1a shows a complete, highly artificial, data set. Table 3.1b shows what it might look like if
values of Y1 were missing completely at random, i.e. there is no tendency for the missing values to
correlate with Y1 or Y2. Table 3.1c shows what it might look like if the probability of a missing
value were correlated with Y2 but is random within Y1 for any value of Y2 (MAR). This will lead to
a biased estimated of say the mean if you averaged over the whole data set. But unbiased if you
broke the data up into subsamples based on Y2 and estimated separate means. Table 3.1d shows
what might happen if the probability of a missing value is correlated with both Y1 and Y2 - the large
values of Y1 are missing within Y2 = 1. This last situation is irretrievably biased and cannot be
handled by any currently available technique.
Y1 Y2 Y1 Y2 Y1 Y2 Y1 Y2
17 1 17 1 . 1 . 1
16 1 . 1 16 1 . 1
15 1 15 1 . 1 15 1
14 2 14 2 14 2 14 2
13 2 . 2 13 2 13 2
12 2 12 2 12 2 12 2
For MCAR data the usual strategy is to remove missing values from the data matrix.
If the data set has many variables and the missing values are confined to a few, you could consider
dropping the affected variables. However, it is more usual to drop all the observations that have
missing values - the complete-observation method. This is the standard solution used by the main
10
statistical packages. Clearly, depending on the distribution of the missing values, a combination of
the two could be a sensible option. This method has several drawbacks:
If too many values are missing this pruning of the data set may shrink the data set too far for it to
yield useful results.
Potentially useful information is being thrown away, some of the discarded observations may be of
particular interest.
It assumes the values are missing completely at random. If the missing values are non-randomly
distributed over the observations you will be left with a biased sample.
This crude method has one major advantage over the alternatives: it is simple. This and the fact that
the more complicated alternatives (that we do not coever in this course) sometimes do not work
well make it attractive if the data set is not too small, the values are MCAR, and the number of
missing values is not too large.
Many data sets require modification before analysis. At first sight this might appear a questionable
activity - fudging the data, fiddling the results. In fact there are several perfectly respectable reasons
for transforming the data, and if it is done properly the results should be just as valid (you hope
even more valid) and more useful than if the analysis had been performed on the original values.
The choice of transformation of standardisation is, as we shall see, possibly the most important
decision to be made when planning a multivariate analysis.
The main reasons for using transformations are:
to put a variable on a more interpretable, appropriate, scale;
to remove size effects;
to put variables on similar scales;
to make the data conform to the assumptions of the chosen analysis, e.g. multivariate normality,
homogeneity of covariance matrices, linearity.
I will follow convention by distinguishing between four different types of univariate
transformation: centering, standardising, recoding, and transforming by mathematical function. I
don't discuss truly multivariate transformations here because to a large extent that is what the rest of
the course is about.
Centering.
This involves shifting the centroid of the data cloud, i.e. shifting the multivariate centre of the data.
It is simply done by subtracting a number from each of the values. Though there can be a number of
reasons for doing this, the most common is to remove what I shall call size effects.
a) Centering by column means.
Very often differences between the absolute values of variables are not of interest. The differences
between the averages for each variable obscure the interesting patterns. For example in
morphometrics: that the average length of the humerus (the upper bone in the foreleg) of a mammal
is longer than that of the ulna - the lower leg bone of the foreleg - (or perhaps vice versa) is usually
of little interest, it is how these variables (co-)vary that is interesting. The effect of the averages is
removed by subtracting the mean of each variable from all its values:
i.e. x*ij = (xij - xj). Where xij is the value from the ith row (observation) and jth column (variable),
and xj is the mean of the jth column.
11
This can therefore be thought of removing simple size differences between the variables and
focusing the analysis on variability of the variables - the data has been transformed to deviations
from the variable means.
Geometrically this transformation is a shift of the centroid (the vector of means) to the origin (the
zero vector). All vectors are now deviations from the centroid (mean). In fact column centering is
performed whenever the covariance or correlation matrix is calculated, i.e. when interest is focused
on how the variables vary together, so it is easily the most commonly used transformation.
Programming notes.
In R we can use the apply() function that repeats a function separately for each row (put 1 in as the
second argument) or for each column (as here, a 2 :as the second argument)
xx<-data
col.mean.stdised<-apply(xx, 2, function(z){z-mean(z)})
or more simply
col.mean.stdised=scale(xx,scale=F)
Standardising.
12
commonest such standardisation is to divide by a sample measure of the dispersion, most
commonly the standard deviation, (though some workers recommend the range): i.e. x*ij = xij/sj,.
Where sj is the sample standard deviation of the jth variable (column). This converts all the values
to scores in standard deviation units. If the xij have already been centred by columns this
standardisation will yield scores with zero mean and unit variance. Geometrically this has the effect
of equalising the variation on each axis (the variables now all have unit variance) so the data cloud
usually tends to become more spherical. Also the covariance matrix of such standardised variables
is now the correlation matrix (section 1.4.1), an advantage if the main interest is in the relationships
amongst the variables.
In practice this standardisation means that all variables will have roughly equal weight in the
analysis.
Programming notes.
In R
xx<-data
col.sd.stdised<-apply(xx,2,function(z){z/sqrt(var(z))})
Or if you don't mind it being centred (subtracting the column mean) then this is easier:
col.sd.stdised<-scale(xx)
This is often an acceptable thing to do as the overall column means are seldom considered useful.
ii) No standardisation.
Despite the fact that most people do not explicitly standardise their data, this is a fairly rare
situation. Many multivariate techniques use the correlation matrix, and many distance and similarity
metrics also standardise implicitly (section 4). A principal component analysis on the covariance
matrix (section 5) or any techniques performed on a simple Euclidean distance matrix would be
effectively unstandardised though centered. These would be appropriate if the variables are on
similar scales and you want the observations to contribute in proportion to their variability.
iii) Standardising the columns by a measure of size.
One way of putting variables on comparable scales is to convert them to proportions. For example
by dividing each value by the column total: i.e. x*ij = xij/∑ixij, where ∑ixij is the total for each
variable. (xij/xj, where xj is the mean for the jth species is equivalent.)
Proportions are often easier to interpret than the original data, and can be an effective way of
removing size differences between the variables. They also ensure that all the variables will have
equal weight in the analysis - with the problems that this can entail (see above, section 3.1.2.i).
In R
xx<-data
x<-apply(xx,2,function(z){z/sum(z)})
13
are expressed as proportions (or percentages) of the row, i.e. observation, totals. It also ensures that
each observation is given equal weight. However, it can lead to problems with techniques that
require non-singular matrices - any technique that calculates an inverse covariance matrix for
example (e.g. canonical correlation, MANOVA and related techniques). Because the proportions for
each observation are forced to add up to one, converting say a two dimensional data set to
proportions of the observation total collapses the data to one dimension. If the data are originally 3
dimensional, converting them to proportions of the observation (row) totals collapses the space to 2
dimensions. This means that there may be p columns to a data matrix, but if they are proportions
that add to one within each observation, the data points will all fall in a p-1 dimensional space so
the covariance matrix will be singular and have no inverse (section 1.2.2). Such data are often
referred to as compositional.
In R:
xx<-data
x<-t(apply(xx,1,function(z){z/sum(z)}))
or more simply
x=xx/rowSums(xx)
Row proportions are routinely further standardised by √column total by any procedure based on
Correspondence Analysis, a popular technique in ecology and market research. This has the effect
of making Euclidean distances between the row vectors into chi-squared distances. It also does the
same to the transposed data matrix (i.e. after flipping the matrix so rows become columns and
columns rows). This transformation has the effect of evening out the contribution made by the
variables.
14
proportional abundance (log transformation). Changing the data to improve its interpretability is
probably the most obvious and defensible use of transformations.
Like standardisations, transformations are also used to ensure that all the variables in a multivariate
data set are on similar scales. For example, if all the variables are linear measurements except one
which is an area (or volume), it might be appropriate to square root the area (or cube root the
volume) to bring it on to a similar linear scale to the others. Alternatively, a log transform will
ensure that they are all on the same scale, this time a unit free scale of proportional change.
To improve the normality of the frequency distribution.
As discussed earlier many multivariate methods make the assumption of multivariate normality
(MVN). Most techniques are fairly robust to this assumption, but when they are not it is usually
skewness and non-linearity that cause the trouble. It is therefore sometimes useful to transform to
improve the normality of the data. There are methods to transform to MVN but they are not widely
available, and I shall concentrate on univariate methods. This section is therefore concerned with
the normality of the marginal, univariate, distributions; linearity of the relationships between the
variables (another requirement of MVN) is discussed separately below.
There are two major approaches to find the right transformation for a given data set: using rules of
thumb based on the type of data being collected, or by using the data itself to suggest the
appropriate function.
a) Rules of thumb.
Most statisticians are aware of the tradition that the square root transformation is used for Poisson
distributed data, logarithms for log-normal data, log(X+3/8) or log(X+1) for negative binomial
counts, arcsin(square root) 1 or logit (i.e. log(p/(1-p)) ) for proportional (binomial) frequencies and
other proportions. These transformations will often be adequate to make the distribution more
symmetric. However, often the data do not fall unambiguously into one of the above classes.
One thing to remember is that even the optimum transformation will not be able to normalise some
data sets. Distributions of counts with many zeros, truncated distributions, continuous distributions
of positive values with the mode at zero all may be irredeemably non-normal. And though it may be
worth trying to make them more normal, it may be more sensible to try to find a method of analysis
that does not care what shape the distribution is.
I included the arcsin(p1/2) more out of respect for the tradition than because it is useful. If the
sample sizes for the proportions are unequal then it will not perform as advertised; if the
proportions are between 0.25 and 0.75, no transformation is usually necessary; and when the
proportions are close to 0 or 1 it doesn't normalise the data particularly well anyway (nor does any
other transformation). As a general rule I prefer the logit. Not because it performs better, but at least
it is interpretable as a log(odds).
15
To improve linearity.
Many multivariate methods assume that the relationships amongst the variables are linear. This
assumption is frequently violated by biological data and the consequences are often, even usually,
serious. Having detected non-linearities (perhaps using the techniques of section 2.5) how can you
correct the situation? The simple answer is: you may not be able to; particularly if there are many
variables. There may be no set of simple transformations that can remove the non-linearities, which
could be very complex. For example, linearising a multidimensional spiral is not a simple
proposition. Indeed there are situations, particularly with ordinations like principal component
analysis (section 5) where displaying the non-linearities is more interesting biologically than
removing them before the display. However in some data sets, and with most of the modelling
techniques (multiple regression, canonical correlation and related methods) the usefulness of the
results can often be improved by a sensible choice of transformations.
A major problem is that while the non-linearity may be truly multivariate (like the spiral) most
available methods for choosing transformations progress by attempting to improve the bivariate
relationships. This is at best inefficient and at worst pointless. Improvements on one bivariate
relationship may ruin the linearity of another. Be that as it may, it is sometimes worth trying
different transformations to see if any overall improvements can be made.
In some situations experience has shown that a single transformation applied to all the variables
will help to linearise the multivariate relationships in a data set. For example, relationships amongst
morphometric variables (e.g. length of body parts) are often improved by the log(x) transformation.
Linear relationships amongst organism abundances in ecological data sets are often improved by a
log(x+1) transformation. These are rules of thumb that have emerged from experience, there are no
doubt others.
Problems with transformations.
The dominant problem is that transformations to improve any one of: interpretability, normality,
homoscedasticity or linearity may make one or more of the others worse.
Since MVN data are by definition normal, they have no relationship between the mean and the
variance (and so are more likely to be homoscedastic between samples); and have linear
relationships. So transformations that achieve multivariate normality often (usually) improve
homoscedasticity and linearity. Even so, if you are using the techniques described in this section to
improve any of the distributional properties of your data, always check the other properties after
transforming; you may have made the overall situation worse.
Perhaps more important, and less recognised, is that transforming the data to improve the
distributional properties will inevitably change the interpretability of the results. When the aim of
the analysis is simply to display or describe the data this may not matter very much (particularly if
the data is back transformed when appropriate). However, if a model is being fitted and coefficients
are being examined then the interpretation can become very complicated.
Let me take a univariate example. If we do a one way analysis of variance on log transformed data,
there are no problems.
The model is (leaving out the error term):
log(Yij) = µ + ti
where µ is the grand mean and ti is the effect due to the ith level of the treatment. This is equivalent
to:
16
Yij = eµ eti
By testing the null hypothesis: all ti are 0, we are testing for differences between the geometric
means of the original data rather than the arithmetic. But if the geometric means are different then it
is extremely unlikely that the arithmetic ones will be the same (and vice versa). So the test on
transformed data is telling us about the raw data. Indeed, if the log transform improved normality
(or at least symmetry) then the analysis is closer to an analysis on medians and it is often useful to
consider it as such (the arithmetic mean of a skewed distribution is no longer a good description of
the typical or central value, a median is).
However, the situation changes if the analysis is more complicated. For example let us look at a two
way analysis of variance with interaction. The model is now (again without the error term):
log(Yijk) = µ + ai + bj + (ab)ij.
which implies that:
Yijk = e µ e ai e j e ij In other words the model is now multiplicative. If all the (ab)ijs are 0, then the
b ( ab )
interaction term is said to be zero. However on the original scale the two factors (a and b) are still
interacting in a biological sense; the effect of ai on Y depends on the value of bj (they are
multiplicative on the original scale). Equally, if the effect of the treatment a are truly independent
of those of treatment b on the original scale then analysing the data on the log scale would very
likely produce a misleadingly large interaction term. If the log scale is a natural, interpretable one,
then this presents no problem: the analysis is interpreted on the log scale and the results taken at
face value. If however the transformation was made solely for convenience, to improve some
distributional property, then the results of the analysis must be interpreted with care. The situation is
even worse when the scale is less intuitive than logs, e.g. arcsin(Y1/2) or Y-1/2.
Attempts to interpret the "relative importance" of variables by examining the coefficients in
principal component analysis, multiple regression, canonical correlation and related techniques will
only be meaningful on the transformed scales. If these are not natural and interpretable then it may
be difficult or impossible to infer the role of the original, untransformed, variables.
The conclusion to be drawn from the above is that the advantages gained from transforming to
improve distributional properties may be dearly bought. Interpreting the final results must be done
in the context of the transformation.
17
MORAL. Do not transform unless you really have to or unless it enhances the relevance and
interpretability of your results. It is often better to rely on the robustness of the analysis or use a
non-parametric method than use arbitrary, uninterpretable, scales.
Population 1
Log(Population size)
Population 2
Time
Population 1
Population size
Population 2
Time
Unnecessary variables should ideally be dropped before the analysis. So whatever the reason for
wanting to drop variables, the most informative variables must be retained.
19
20
Chapter 4. Distances and similarities.
Many of the techniques described in this course operate not on the matrix of observation vectors - the
data matrix - but on a matrix of distances between the vectors. Such a matrix will contain all the
information about the relative positions of the observations in multivariate space. However, deciding
how 'far apart' two observations are may not be straightforward. We have already met the standard
measure of the distance between two vectors in section 1 - the Pythagorean or Euclidean distance; but
sometimes this measure will not convey the interesting differences between the vectors.
In fact there are an enormous number of ways of measuring the 'distance' or dissimilarity between two
observation vectors. How then can one possibly decide which to use? It is difficult. The most
important criterion is: does the measure agree with your intuition. When it says two observations are
far apart (or close), after comparing the two data points does your intuition agree? If it does not, the
results of the analysis will be uninterpretable and irrelevant. Thus the criterion for choosing a distance
measure is the same as for choosing a transformation or standardisation: will it give interpretable,
relevant results? Indeed, in many cases, the behaviour of a distance measure is largely determined by
the standardisation it imposes on the data.
The choice of distance measure is not a trivial one; for many techniques (e.g. clustering and
multidimensional scaling (MDS) - both covered later in the course) the results of the analysis may
depend more on the distance measure used than on the particular method of clustering or MDS used.
So if there is no obvious a priori choice of measure on grounds of relevance, it would be sensible to
try out the analysis with more than one measure to show that the results do not depend too strictly on
the choice.
While relevance to the current problem must be the dominant consideration in choosing a measure
there are other factors that can be important. For example, some distance measures may behave
counter intuitively with some techniques. The problem is that we live in a universe where, locally at
least, distances obey certain rules, and our interpretation of distances is based on the tacit assumption
that these hold. However some of the measures commonly used do not obey the rules, which can lead
to problems of interpretation and also representation, how do you plot a set of points whose distances
apart do not obey the rules of Euclidean distance?
The rules are:
1) If vector A equals vector B then they are zero distance apart.
2) If vector A does not equal vector B then there is a positive distance between them (you are not
allowed a negative distance).
3) The distance from A to B equals that from B to A.(Distances should be symmetric).
4) The route from A directly to B must be shorter than or equal to the route that goes via any other
point. (In other words the shortest distance between two points is a straight line). This is called the
triangle inequality as it can be rephrased to state that any side of a triangle can never be longer
than the sum of the other two sides
If a distance measure obeys all these rules then it is called a metric, if it only obeys the first 3 then it is
a semimetric and if it violates any more it is nonmetric. Our Euclidean distance, the natural measure
in our part of the universe, is metric. However, as most travellers have experienced, travelling time is
not. The direct route between two places is often longer than one that goes via somewhere else: the
"we're-in-a-hurry-so-we'll-just-try-this-short-cut" corollary of Murphy's Law.
The ideal distance measure (one that we can interpret as we do geometric distance) is however not
merely metric, it should be a Euclidean metric. This does not mean that only the Euclidean distance
fits the bill; it means that a distance matrix made up of such a distance measure can be treated as
Euclidean distances and can be used to plot the relative positions of the vectors in a Euclidean space.
Their geometric distances apart will exactly reflect the values of our distance measure. To avoid
confusion some workers would like to call Euclidean distance (the geometric distance), Pythagorean
21
distance, and reserve the adjective Euclidean for Euclidean metric - any distance measure that can be
exactly reproduced in a Euclidean space. However, the name Euclidean distance is so well entrenched
that it seems futile to try to change it now, we will simply have to live with the potential for confusion.
The advantage of Euclidean metrics is that given a matrix of distances it is possible to exactly
reconstruct the relative position of the points - they may lie in a high dimensional space but there will
be no distortion. This may not be possible with non-Euclidean metrics and is especially unlikely with
semi- and nonmetrics: if the distance from A to B is larger than that from A to B via C, how can you
plot the relative positions of the three points? It can't be done without distortion, so semimetrics and
nonmetrics can be a problem with ordination techniques that try to preserve all the information in the
distance matrix while reconstructing the relative positions of the observations for plotting (e.g. see
section 6.1.5: principal coordinates analysis). Non-metrics are even worse. However for most purposes
if the measure conveys the important differences between observations you should not worry too much
whether it is a Euclidean metric or not. If you insist on worrying you can always use a nonmetric
technique (e.g. nonmetric multidimensional scaling, single linkage clustering - both covered later in
the course).
In the following sections I will outline a few of of the commonest measures. It would take an entire
book to examine the whole range of measures that have been suggested at different times; but the vast
majority of workers use a fairly restricted subset.
In fact some of the measures I discuss are more often calculated as similarities rather than distances
(most analyses that accept distances also take similarities). But it is so simple to convert similarities to
distances that it is easier to be consistent and refer to the distance form throughout. The commonest
conversion is dij = (1 - sij), the one complement of the similarity, though sometimes you have to use
dij = (1 - sij)1/2.
Continuous variables.
i) Euclidean distance.
dij = (∑k(xik-xjk)2)1/2
This is the same geometric, Pythagorean distance between two vectors that we met in Section 1
(section 1.1.1). Confusingly some statisticians apply the name to the squared distance as well, but this
can lead to confusion (and different results) so I will reserve the term for the distance.
As one would expect this is a Euclidean metric, though the squared distance is not. Though it is very
popular (it is the default distance in most computer packages), it has a number of characteristics that
can make it dangerous to the unwary:
a) It is scale dependent, any change in the units of one of the variables could completely change the
pattern of distances. For example: suppose three individuals are being compared on two variables:
length (in cm) and weight (in gms).
Length Weight
A 10 50
B 15 100
C 20 75
22
A is closer to C (26.9) than to B (50.2). If we now express the length variable in millimetres instead
of centimetres, the distance matrix becomes:
0 70.7 103.1
70.7 0 55.9
103.1 55.9 0
Now A is closer to B than to C. This is particularly likely to occur if the variables are in different
units. The usual solution is to transform or standardise the data first so as to make them
dimensionless or their units comparable (section 2).
b) Because it squares the differences during the calculations it is very sensitive to outliers, or to
variables where the size of the differences depends on their average value, e.g. weight of animals:
large animals tend to have larger differences from each other than small ones. Such problems can
often be avoided by using an appropriate transformation (section 3.3).
c) For some data, e.g. species abundances, Euclidean distance will seldom be appropriate. The
problem lies in the way it handles double zeros. If two sites are missing the same species, they will
be regarded as similar as if they had the species present in the same numbers at each site. Imagine
two ends of a gradient, perhaps a transect down the shore in an intertidal community. Sites at the
top and bottom of the shore will be missing the species from the mid-tide zone, should they be
regarded as more similar as a result? Ecologically, of course not. The community above the high
tide mark is ecologically more similar to the mid tide zone than to the subtidal; using Euclidean
distance could obscure that fact. For this reason, few ecologists knowingly use the Euclidean
distance (without some appropriate transformation) - though anyone who uses principal component
analysis (section 6) does so by default.
If the mutual absence of a variable or attribute conveys no information on the difference between
the observations then the Euclidean distance is not appropriate. There are a number of
standardisations that removes the effect of double zeros on the Euclidean distance. But we will not
discuss them in this course.
Programming notes:
In R.
x<-data
d<-dist(x,method="euclidean")
dij = ∑ xik − x jk
k
it sometimes called the mean character difference or the unrelativised Czekanowski coefficient.
It is metric but not Euclidean (though if the variables are standardised by the range - Gower's
quantitative distance - it is). It is one of the most widely used measures in all branches of biology,
social sciences or market research. It is quite frequently used in ecology, because it has the attractive
feature that it will change the same amount if two sites differ by 2 individuals in 1 species as if they
differ by 1 individual in 2 species. Thus individuals in the different species are all treated as
equivalent. Since common species tend to vary over sites more than do rare ones this measure is
sensitive to abundant species; though since the differences are not being squared it is less sensitive
than the Euclidean distance. Like the Euclidean distance it is scale sensitive so it should only be used
23
on variables with comparable units. Also, like the Euclidean distance, the Manhattan distance
incorporates double zeros, so it should not be used where joint absences are uninformative.
Programming notes:
In R.
x<-data
city.matrix<-dist(x,method="manhattan")
Binary Variables.
Most measures comparing binary observation vectors are similarities, but all those I present here can
be converted to distances by (1-sij).
A comparison of two observations A and B on a binary variable leads to one of four outcomes: 1 in
both, 0 in both, 1 in A but 0 in B, and vice-versa. Accumulating these comparisons over all the
variables leads to a two way table. To simplify the formulae I shall follow convention and use the
following terminology
Observation A
1 0
Observation B 1 a b
0 c d
sij = a / (a + b + c),
24
i.e. the probability of a randomly chosen variable being present in both observations, ignoring double
absences. The distance (1-sij) is equivalent to Gower's distance (ignoring double zeros - section 4.1.1 c)
applied to the binary data. It is clearly appropriate when all the variables are presence/absence and
mutual absences are uninformative. The one complement, (1-sij), is metric but not Euclidean. Because
it is insensitive to double zeros it has been widely used in ecology and animal behaviour. However this
problem - uninformative double-zeros - is actually very common and so perhaps this measure should
be more commonly used than it is.
Programming notes:
In R.
x<-as.matrix(data)
jacc.matrix<-dist(x,method="binary")
25
26
Chapter 5. Principal Components Analysis
Principal Component Analysis (PCA) is a technique for the analysis of an unstructured sample of
multivariate data. Its aim is to display the relative positions of the observations in the data cloud in
fewer dimensions (while losing as little information as possible) and to help give insight into the
way the observations vary. It is not a hypothesis testing technique (like t-test or Analysis of
Variance); it is an exploratory, hypothesis generating tool that describes patterns of variation, and
suggests relationships that should be investigated further.
PCA is a member of a family of techniques for dimension reduction (ordination). I have chosen to
give PCA a chapter on its own because, while relatively easy to understand, it provides a good
introduction to many other more complex methods. Anyway, its wide use is sufficient justification
on its own. Other common ordination techniques are described in Chapter 6.
The word ordination was applied to dimension reduction techniques by botanical ecologists whose
aim was to identify gradients in species composition in the field. For this reason they wanted to
reduce the quadrat ´ species (observations ´ variables) data matrix to a single ordering (hence
ordination) of the quadrats which they hoped would reflect the underlying ecological gradient.
An Intuitive Explanation.
The aim of PCA is to reduce the dimensionality of the data, and to help us to say something about
the patterns of variation. How can this be done? Consider figure 5.1. Here we have an imaginary
data set of 10 observations on 2 variables that conform, roughly, to a multivariate normal
distribution. How can we re-express these data in fewer dimensions (i.e. one), while at the same
Figure .5.1 The geometry of PCA. a) the data cloud. b) the new axes are shifted to
the centroid. c) the axes are rotated to lie along the major axes of the data. d) The data
are projected onto the first component axis, giving a reduced space (1 dimensional) plot.
(a) (b)
X2 X2
X2
X1 X1
X1
(c)
X2
PC1
(d)
PC1
0
PC2
X1
27
time keeping their relationships to each other as undistorted as possible? It is actually very easy,
even by hand. First we shift the origin of our present axes to the centroid (X1,X2)of the data cloud
(fig.5.1b) - centering the data. Now all the observations are in terms of deviations from the mean,
and have means of zero. Now we trace these axes on a transparent sheet, and rotate the tracing till
one axis lies along the main axis of the ellipse (fig 5.1c). Then we mark off the position of every
observation onto this new axis. If we simply look at this new axis on its own (fig 5.1d) we see that
the relative positions of the observations on it are quite close to those of the original data. In other
words we have effectively reduced the data from 2D to 1D while leaving their relative positions
largely unchanged.
How does this help us interpret the patterns of variation present in the data? Which variable
contributes most to the trend? Figure 5.2a shows a situation where variable X1 is responsible for the
major variability in the data. In figure 5.2b X1 and X2 contribute roughly equally. Both variables
also contribute equally in figure 5.2c, though here they are negatively related. So by looking at
where the new axis lies relative to the old ones we can say something about how our variables vary
together. We can describe the major trend in the data.
All this is rather trivial in two dimensions, but the same process can be done in spaces of higher
dimension. Though you won't use transparent sheets to rotate the axes, the idea remains the same -
it is just rather harder to visualise. To help, let us extend the idea to three dimensions. Imagine a
data cloud that looks like a loaf of French bread: long, thin, and slightly flattened. We now relocate
and rotate our axes till one lies along the main line of the data, as though we pushed a skewer down
the length of the loaf. This identifies the dominant trend in the data. We then use that line as our
axle, and rotate the two remaining axes till one lies along the next longest line of the data, i.e. the
width of the loaf. The line running down the length of the data we call the first component and it
identifies and describes the major trend in variation of the data, in the same way that the component
(a) (b)
X2 X2
X1 X1
(c)
X2
X1
28
did in Figure 5.1. The line that runs across the width of the data describes a second trend in the
data. This is the second component. The third axis, running at right angles to the first two, is called,
unsurprisingly, the third component. If we wished to draw the data in two dimensions, while
maintaining their relative positions, we could plot them on the first two of the new axes and get
quite a reliable picture. In fact, provided the loaf was long and thin enough we would probably get
quite an adequate representation if we only used the first component. Of course if the loaf was
round like a ball, i.e. no correlation between the variables and equal variances, then there would be
no way of getting a good picture in fewer dimensions.
Of course, actually calculating the positions of each observation on these new axes is not a trivial
exercise. We need to first identify the position of the new axes and then get the score of each
observation on each axis. If you worked through the matrix algebra in chapter 1, I can use the
intuitive approach used there to show how it's done. The shape of the data cloud is summarised by
the variance covariance matrix (section 1.4.1). What we want is a set of vectors that identify the
major axes of this data cloud. These vectors are clearly latent in that matrix and characterise the
structure of the cloud. If you look back at section 1.3.2 I am clearly hinting fairly broadly that the
vectors we want are the latent, characteristic or eigenvectors. These vectors identify the new axes
and allow us to calculate the positions of each of the observations.
The position of an observation x on the first, most important new axis is simply given as ∑u
i
i1 Xi ,
a linear combination of the X variables. Multiply the value of each variable by the corresponding
element (ui1) of the first eigenvector and add them all up. The resulting number is the principal
component score of that observation for component 1. By using the eigenvectors for the second
axis, third etc., we can get the position of the observation on all the new axes (of course we will
only be interested in the more important axes).
If the eigenvectors identify the directions of the major axes of the data cloud what do the
eigenvalues do? This is easily understood if we remember that the sum of the eigenvalues Σλi =
trace of the matrix, i.e. Σmii. What is the trace of a variance covariance matrix? It is the sum of all
the variances, the total variance. So the sum of the eigenvalues adds up to the sum of the variances.
The eigenvalues are actually giving us the variance of the scores of the observations on each of the
new axes. The first component lies along the axis of the data cloud that will have the largest amount
of associated variance. The variance of the scores of each observation on the line in fig. 5.1d is
clearly going to be large; and is measured by the eigenvalue associated with that eigenvector. The
total variance is being partitioned into bits associated with the new axes. The largest bit with the
first new axis, the first principal component; the next largest bit with the next most important axis,
and so on.
29
Data Matrix
X1 X2
1.87 1.59 Covariance
Vector of
3.62 3.28 1 Matrix
3.20 3.87 means X'X
4.77 4.99 (n-1 ) 1.241 1.196
X= 3.80 4.17 - ( 3.26, 3.14) 1.196 1.495
4.57 4.07
4.25 3.57 Extract
2.03 2.70
1.85 1.60 PC Scores Eigenvectors &
2.68 1.59 -2.086 0.002 Eigenvalues
0.340 0.173
0.498 -0.534 Eigenvectors U Eigenvalues Λ
2.380 -0.115 XU
1.122 -0.288 0.669 0.743 2.571 0
1.562 0.351 0.743 -0.669
. 0 0.165
0.977 0.447
-1.155 -0.621
-2.093 -0.019
-1.545 0.604
30
Standardisations and transformations.
A major problem with PCA is that the components are not scale invariant. That means if we change
the units in which our variables are expressed, we change the components; and not in any simple
way either. So, every scaling or adjustment of the variables in preparation for the analysis could
(and usually does ) produce a separate component structure. As I showed in section 4 sensible pre-
treatment of the data by standardisation or transformation can often increase the interpretability and
biological relevance of the results. It is therefore important to choose a standardisation or
transformation carefully. In particular PCA will give different results depending on whether we
analyse the covariance matrix, where the data have merely been centred (corrected for the column,
variable, mean), or the correlation matrix, where the data have been standardised to z-scores
(centred and converted into standard deviation units). This is particularly important, as many
computer programs to do PCA automatically analyse the correlation matrix . If you do not want that
standardisation; you may have to explicitly ask for the covariance matrix. As you would expect, the
results from the two analyses will usually be very different.
Since the interpretation of a full PCA can take a lot of effort, it is often worth using the techniques
of section 3 to check for trends and correlations in the data before investing time in doing a PCA.
If you decide to do a PCA one of the first decisions you are faced with is how many components to
use. PCA attempts to reduce dimensionality and/or identify trends; so you will want to look at
fewer dimensions than there are variables (p). But PCA extracts as many components as there are
variables (unless p, the number of variables, is more than n, the number of sampling units; in which
case you should probably use another technique, e.g. principal coordinates, see section 6.1).
However, the components are selected in sequence as the best descriptors of the data (subject to the
constraints mentioned earlier). The last few components usually won't account for much of the
variance (information), so, if you ignore them you should lose little. But how many should you
drop? What amount of variance is small enough to ignore? We need a cut-off point.
An example may illustrate the problem. Suppose we have 4 variables but only 2 trends (linear and
orthogonal - wishful thinking but it's only an example). Once the first 2 components have identified
the trends, the remaining variation is amorphous, random - in geometric terms, spherical. There are,
therefore, no major axes to be identified, no trends left, but PCA still identifies 2 more components.
Their direction will be arbitrary, determined by sampling variation, and the variances that they
explain will be small and about equal. They can tell us nothing about the major trends of variation
in the data, and we would usually want to forget about them. These 'residual' components can
sometimes be useful in a regression context but most of the time we will want to ignore them.
Unfortunately the problem of how many components to use doesn't have a standard, rigorous
solution. We will meet similar situations in other analyses. There is a general absence of appropriate
significance tests. This is probably a good thing; too slavish an adherence to significance tests can
blind an investigator to what the data is actually saying. However it makes interpreting the results
of an analysis inevitably appear somewhat subjective. To a large extent, therefore, we resort to rules
of thumb and various approximate devices, and accept that most multivariate techniques suggest
rather than test hypotheses. It is still more unfortunate that there is no general agreement in which
rules of thumb to use.
Probably the most obvious method of selecting which components to use or interpret is to look at
the first component - the one with the largest eigenvalue. It describes some amount (λ1) of the total
variation (∑λi =∑si2). Do we consider this amount a sufficient percentage of the total? If not, we
consider the second component. In combination with the first it will explain some larger percentage
31
15
10
Eigenvalue
0
1 2 3 4 5 6 7 8 9
Component
- is this sufficient yet? And so on till we have explained a sufficient percentage of the variation. The
obvious problem is - what is sufficient? In the literature this ranges from 75% to 90%.
Morrison(1976) suggests 75% and not more than 4 or 5 components. Timm(1975) suggests that
experimenters should be satisfied if no more than 5 or 6 components explain 70 to 80%.
Pimentel(1979) suggests 80-90%, but keep any others that make sense. While Mardia et al(1979)
suggest 90%. As you can guess, these will usually give very different sized sets of components,
particularly if the number of variables (p) is large.
The other suggested methods also give a wide variety of results. Some suggest using only those
λi
components that have eigenvalues greater than the average ( ∑ ). If you analysed the correlation
p i
matrix this average will always be 1, (confirm this yourselves - hint: ∑λi = trace of the correlation
matrix). This cut off point (Kaiser's criterion) tends to retain too few components when p<=20;
but in most cases that is no bad thing and it is a widely used technique.
.
Another potentially useful way of identifying a good cutoff is using a "scree graph". If we plot λi
against i (figure 5.3) we get a declining curve, which may allow us to spot where "large"
eigenvalues end and "small" ones begin. This technique can apparently lead to including too many
components for most purposes. So a compromise between this and Kaiser's criterion is sometimes
used by analysts.
In a study that compared these three methods on data with a large number of variables (p>= 36):
Kaiser's criterion, the scree graph and significance tests, Kaiser's criterion retained too many
components but the scree graph performed well (provided the user was experienced).
Finally there is the pragmatic attitude: use as many components as give an interesting result,
interpret as many as give a reasonable story. PCA is generally a hypothesis generating technique, so
there is no harm in exploring the data space as much as you like to suggest ideas; and speculate all
you wish on the meaning of the components.
32
AUTHOR'S CHOICE: For data exploration a scree graph is often useful. If it includes too many
components it doesn't matter. Look at them anyway.
EXAMPLE 5.1
I performed a PCA on the covariance matrix of the log(X+1) transformed microplankton data. The
eigenvalues and the cumulative percentage variance explained are shown in table 5.1 and the scree
graph is in figure 5.3. Let us apply some of the rules of thumb I outlined above. If we use the
cumulative percentage variance explained then we find 70% would give us 4 components, 80%
gives us 5 and 90% 7. The scree graph suggests 2 or 5, Kaiser's criterion 3. This disagreement is
quite standard, and should not worry you. I personally would use 2, I put more reliance on the scree
graph than the other methods and always tend towards fewer rather than more. I might still look at
the other components but would be careful to keep a very sceptical attitude to them. I would still be
fairly sceptical about the two I retain - they will be suggesting patterns not stating truths.
Programming notes.
In R when using multivariate methods you may first want to enter library(MASS). It may not be
necessary, it depends on the installation.
x<-dataset name
prin<-princomp(x)
33
8 66
6556
62 54
6
64
44
49
Principal component 2 (20%)
4
55 61
39
48 47 60
2
16
19 41
0
53
38 27 15 58 35 3234
67 17 5751
11 21 26
37
63
28 22 9 4652
42 59 4 10 20
43 50335 30 31
25
-2 45 137 8 14 1
2318 126
68 24
40 36 3
29
-4
2
-8 -6 -4 -2 0 2 4 6
Figure 5.4. Plot of the zooplankton samples in the space defined by the first two
principal components. The numbers are the sample number. Pretty uninformative isn't it?
One of the main uses of PCA is to reduce the dimensionality of the data so that the relative
positions of the observations can be examined and patterns and trends identified. If you have
decided that the first k components are likely to preserve the important information, then use the
k
∑λ j =1
j
first 2 components are the only interesting ones then a simple scatter plot of the observations'
positions on the components (the component scores) is all that is necessary; the relative positions of
the observations in the plot will show their relative positions in the full space.
If the dimensionality of the data is greater than 2 then it may be necessary to do more than one
scatterplot, say by plotting component 1 against component 2, 1 against 3, and 2 against 3 and so
on. A very powerful alternative approach is to use the Brush-and-spin techniques described in
section 2.2.2. These allow you to look at 3 dimensional plots of the data from all directions and
explore the trends that (you hope) will emerge.
Programming notes
In R:
eqscplot(prin$scores[,1:2])
34
The bubble plots we want the area to be proportional to the variable. So we must plot bubbles
whose radius is the square root of the value.
In R we can get a bubble plot using just:
var.scaled<-apply(dataset,2,function(z){z<-5*(z-min(z))/diff(range(z))+.5})
eqscplot(prin$scores[,1:2],cex=var.scaled[,3])
var.scaled contains the variables scaled to produce sensible sized bubbles. The cex= option
requests that variable (in this case the 3rd column) to be plotted as bubbles.
EXAMPLE 5.2
Figure 5.4 shows the reduced space plot for the first two principal components from the log(X+1)
transformed plankton data. At first sight there are no great biological insights to be gained from the
plot, the only apparent trend is that the bulk of the observations have high values of component 1
and low values of component 2. It must also be remembered that this plot only summarises 52% of
the variation, so the picture we get may not be very reliable; this is typical of real ecological data
(and many other kinds) - we seldom do much better. Real insight will usually be gained (if at all)
when we investigate the relationship between the plot and the data.
The presence and nature of trends in the data can often be simply displayed by adding information
to the basic reduced space scatter plot. Which method you use will depend in part on how crowded
the plot is; how many plots you are prepared to do; and how much extra information you have. In
any case the ease of producing these plots on computers means that even a large number of plots
can be scanned in a relatively short time. So, if you can, try all the following methods and see
which bring out the trend(s) most obviously.
i) Label the observations.
Sometimes the observations have names or identities that are meaningful. Plotting these labels
instead of the plotting symbol in your reduced space scatter plot can make trends more obvious.
Identifying which observations are close to which other ones is an obvious way to spot pattern.
Unfortunately a crowded plot can be made almost indecipherable if labels are added. This is can be
avoided by using Brush-and-spin programs where individual observations or groups of
observations can be highlighted interactively and their labels shown. The labels can then be hidden
again to avoid overcrowding the plot.
ii) Include the original variables.
Bubble plots: A simple way of identifying trends in the original variables is to make the plotting
symbol of the reduced space scatter plot reflect the values of one of the variables. For example
make the radius of the plotted circle proportional to the value of the variable – so they look like
bubbles. If observations in one part of the plot have large values (large bubbles) and in another
have small, a trend is obvious. Of course this means that there will be one plot for each variable,
which could be a lot of plots. But the speed of modern computers and the effectiveness of the
human eye at detecting pattern makes this a relatively simple job even for large data sets.
EXAMPLE 5.3
35
If we plot the plankton species as bubbles in the reduced space plot some interesting trends emerge.
The best six are presented in Figure 5.5. Clearly there are strong trends in the observations which
are clearly visible from the plots. I tentatively identify 3 basic trends: One running from left to right
- Favella and Oikopleura, one from top to bottom - Gladioferens and Euterpina, and finally one
running from top left to bottom right - Harpacticoids and Temora. It will require more information
to determine what these trends mean, but I would suggest that these diagrams make a good start.
They are certainly more useful than the plot using just observation number (Figure 5.4)
Oikopleura
36
Oikopleura Favella
8 8
6 6
4 4
2 2
0 0
-2 -2
-4 -4
-8 -6 -4 -2 0 2 4 -8 -6 -4 -2 0 2 4
Euterpina
8 8
Gladioferens
6 6
4 4
2 2
0 0
-2 -2
-4 -4
-8 -6 -4 -2 0 2 4 -8 -6 -4 -2 0 2 4
8
Temora Acartia
8
6
6
4
4
2
2
0
0
-2
-2
-4
-4
-8 -6 -4 -2 0 2 4 -8 -6 -4 -2 0 2 4
37
iii) External variables.
In many situations there is more information available than went into the PCA. As a general rule
your analysis should be chosen to take advantage of all the information you have available, but
sometimes for one reason or another some is held back. You may be able to identify or interpret
trends in the observations by superimposing this information on the plots.
Programming Notes.
To name the sample units in a plot (i.e. plot a label on each point) in R
use:
eqscplot(prin$scores[,1:2],type="n")
text(prin$scores[,1:2],label=as.vector(names))
EXAMPLE 5.5
Figure 5.7 shows the plot when we identify the site from which each sample came. Clearly there is
a split, albeit not a clean one, between the observations from Whau Creek (W) and those from
Mangere (M). The Mangere observations seem to be largely concentrated in the lower right corner
of the plot. These differences seem to coincide with the trend in the numbers of Gladioferens and
Euterpina.
Interpreting the axes as trends.
Though PCA is primarily concerned with displaying trends among the observations, your
understanding of these trends can sometimes be enhanced by attempting to interpret these new
axes, the components, as new variables each with a biological interpretation. After all they do lie
along the axes of major variation that are most likely to be due to real trends in the data.
Interpreting the components in this way, sometimes known as reification, - turning the component
into a thing (res in Latin) - is a virtually automatic part of any PCA, and can often be useful; though
the resulting 'entities' should not be taken too seriously. To be useful, these new, synthetic variables
should have a plausible biological identity; at the very least finding out which variables vary most
along a particular axis can help interpret trends apparent in the plots.
38
W
WW
W W
W
W
W
W W
W
W W W
M
M W
W M
W
W MM W M
M M M M WW
W W
M M M WW
W W M M M
W
W W MM M MM M MM
M
MM M
M
W M
W M M
M
M
Figure 5.5. Using external variables to aid the interpretation of trends. The main
cluster of observations in the bottom right are largely from Mangere, Whau Creek
observations are far more variable. The major trends are probably related, at least
partly, to organic and nutrient enrichment from the Mangere sewage oxidation ponds.
There are three main ways of trying to interpret the new axes.
i) Interpreting the coefficients.
Having decided which components to use, it is common to try to interpret each of them (the process
of reification). You hope they will correspond to some easily identified phenomenon If these new,
synthetic, variables have simple and useful interpretations they can be used on their own in further
analyses and data displays. This is probably best done by attempting to identify the variables that
are most influential in determining the position of the component. We do this by looking at the
coefficients (uij) of the eigenvectors determining the components. These coefficients are usually
normalised to one; i.e. presented scaled to unit length (that is the sum of their squared values is
one). They give the relative separate contribution of each variable to the component score.
When looking at the uijs how do we select those variables that have 'significant' influence on the
position of the component? There are no formal tests; just more rules of thumb. If we are lucky, one
or more variables will have 'large' (in absolute value) coefficients on a component that are clearly
distinct from the 'smaller' values for the remaining variables; in which case it's easy. However, often
there will be a graded series of values (no group of variables clearly dominating); so either you
come up with an interpretation that includes all the variables, or you impose an arbitrary cutoff
value. There are few guidelines on where this cutoff should go. It's usually a matter of experience;
but there are a couple of potentially useful values that can be used.
The first is the coefficient the variable would have had if all the variables had contributed equally to
this components. Called the equilibrium contribution it usually takes the value of 1/√p, where p
is the number of variables. It has little theoretical justification, it just happens to provide a cut off
value in roughly the right place. It is basically identifying "above average" coefficients. This
method will not work when the component is a “growth” component – i.e. where all the
eigenvector coefficients are of much the same size. In this case all those below the average will be
39
dropped even though they are large. An alternative method that works better in that situation is a
rough cut off point: say 0.7 times the largest coefficient in an eigenvector. Though totally ad hoc
this value (I call it Mardia's criterion) seems to work quite well.
EXAMPLE 5.6.
Table 5.2 gives the normalised eigenvector coefficients (the uijs) for the first three components of
the plankton data. (Please remember that there are in fact 9 components in all).
Let us try to interpret the first component. The equilibrium contribution will be 1/√p = 1/√9 = .333.
Mardia's criterion will give us 0.7 x 0.774 = 0.54. The equilibrium contribution suggests that the
major trend in the data is determined by the abundances of Favella, Oikopleura, and Acartia.
Mardia's criterion identifies Favella alone as the important species for separating the observations.
In fact, as we have seen in Figure 5.5 the Favella-Oikopleura trend is very clear; but it is not
obvious that Acartia is part of this - it seems more closely related to the Harpacticoid trend. The
interpretation of a principal component can often be better understood if we remember that the
score an observation has on the jth PC = Σiuijxi (section 5.3 ii). So, if we consider only the largest
coefficients we get:
PC score = 0.77xlog(Favella density+1) - 0.45xlog(Oikopleura density+1) + 0.37xlog(Acartia
density +1).
All the other coefficients are considered to be effectively zero. Clearly a large value for Favella
will give a large value for the PC. A large value for Oikopleura will tend to give a large negative PC
score. We can also reverse the logic: a large PC score implies large amounts of Favella (and/or
Acartia) but a small value for Oikopleura. Similarly large negative PC scores are only possible if
there are many Oikopleura and few Favella. We might therefore conclude that component one is
identifying a trend between sites containing large numbers of Favella and small numbers of
Oikopleura (giving positive values on the PC axis) and sites with few Favella but lots (relatively)
of Oikopleura (sites with negative values on the PC axis). Such a component is called a bipolar
component because it identifies a contrast: where there are Favella you do not find Oikopleura and
vice-versa. A bipolar component is easily identified by looking at the eigenvector coefficients. If
some of the important coefficients are positive and some negative then a contrast is involved - think
about it.
Table 5.5.2. The eigenvectors associated with the first 3 principal components of the
zooplankton data.
40
positive scores on the 2nd PC axis will tend to have lots of Gladioferens and few Euterpina. Those
with large negative values will tend to have many Euterpina and few Gladioferens.
Clearly, interpreting the components from their eigenvector coefficients gives us no more than was
apparent from the bubble plots of figure 5.5. Indeed, arguably it gives less. This will usually be
true. However, when there are many variables, scanning the coefficients may be an efficient way of
suggesting which variables to plot.
NOTE: Some computer packages when they perform PCA do not label the component
(eigenvector) coefficients as such in the output. They call them factor or score coefficients, a relic
of the bad old days when PCA was seen as a form of Factor Analysis (see later in this chapter).
princomp() calls them loadings.
ii) Interpreting the component-variable correlations.
There is another, less direct, approach to the problem of interpreting the components. We can
characterise the component by identifying the variables that vary closely with it. In essence we are
not identifying the variables that determine the value of the component, we are identifying variables
that are well described by the component. We can then try to use them to infer an identity for the
component. For this interpretation we simply get the correlation coefficient (cij) between variable xi
and the scores on the jth component (the component correlation). I refer to them as cij to avoid
confusion with the elements rik of the correlation matrix. The set of component correlation
coefficients is sometimes called the factor structure, a singularly uninformative designation derived
from psychology. They can be calculated from the eigenvector coefficients by uij λ j / s i . When we
are using coefficients from a PCA on the correlation matrix, the variables have already been
standardised into standard deviation units; so this standardisation leaves the uij λ j unchanged. In
this case the interpretation will be the same as for contribution to the component, as described
above. However, when we have left the variables unstandardised, (we did the PCA on the
covariance matrix S), the cijs and the uijs will often (usually) produce different interpretations.
Those variables that correlate most closely with the component may have small variances, and thus
exert little influence on the position of the component, and therefore on the trend that it describes. If
the analysis was performed on the covariance matrix, a variable with large variance is regarded in
some sense as important, otherwise the correlation matrix would have been used. Thus, those
authors who say that the cijs (correlations between Xi and jth component) are a measure of the
contribution of Xi to the jth component are correct only when they refer to analyses of the
correlation matrix, where interpretation of cij and uij produce the same results. When analysing the
covariance matrix, the contributions of the variables to the component values, the uijs, are
generally to be preferred to the correlations (cijs).
This is not to say that the cijs cannot be useful. They identify which variables are well described by
the components. Indeed, cij2 is the proportion of the variance of a variable explained by a
component. It may be interesting to know if 99% of a variable's variance is explained by a
component; particularly if it was a component you were considering dropping. However, the
importance of a variable to the component is not measured by cij or cij2, (except almost
coincidentally when you've analysed the correlation matrix). The importance of a component to a
variable is usually not the same as the importance of a variable to the component. It may be an
overall unimportant variable, i.e. it may have a small variance.
The difference between the uij and the cij can most simply be understood by the useful fact that if
you regress the values the observations have on the PC (the PC scores) against their values on the
original variables the uijs emerge as the partial regression coefficients (section 8.****). That is, the
41
influence of the ith variable on the PC score keeping all the other variables constant. The cijs are of
course the simple correlations between PC score and the variable.
Programming note.
In R if the PCA object is called prin then the coefficients (uij) are in prin$loadings. Print them
using print(prin$loadings,digits=3,cutoff=0) or something similar. The PC-variable correlations
(cij) are easily got by cor(original variables, prin$scores)
EXAMPLE 5.7
The component-variable correlations (cijs) for components 1 and 2 of the microplankton data are
presented in table 5.3. Their interpretation is substantially similar to the coefficients (uijs - Table
5.2) - Favella, Oikopleura and Acartia on the first component and Gladioferens and Euterpina on
the second - though this would not always be the case.
Table 5.3. The component correlations associated with the first 3 principal components of
the log(X+1) transformed zooplankton data.
PC1 PC2 PC3
ACARTIA -0.585 -0.258 0.576
EUTERP -0.017 -0.836 0.077
GLADIO -0.107 0.877 0.347
HARPACT -0.119 -0.353 0.538
OITHONA -0.336 0.158 -0.267
PARACAL 0.106 -0.082 0.029
TEMORA 0.357 0.388 0.386
FAVELLA -0.913 0.131 -0.132
OIKOPL 0.717 0.086 0.148
43
8 a) Water clarity b) Faecal coliforms
8
6
6
4
4
2
2
0
0
-2
-2
-4
-4
-8 -6 -4 -2 0 2 4 -8 -6 -4 -2 0 2 4
8
6
6
4
4
2
2
0
0
-2
-2
-4
-4
-8 -6 -4 -2 0 2 4 -8 -6 -4 -2 0 2 4
Figure 5.6. Bubble plots of external variables displayed in the reduced space from
a PCA on the Zooplankton data (log(X+1) transformed). Note the trends that clearly
relate the physical variables to the patterns in the animal data. By identifying groups of
observations (areas of the space) with particular physical properties we can go back to
the bubble plots (Error! Reference source not found.) and see which species are
found there.
EXAMPLE 5.8
We saw in Figure 5.5 that there were differences between the two harbours on the first component
(apparently attributable to the Mangere sewage oxidation ponds). The bottom right corner of the
reduced space plot contained nearly all the Mangere samples. The Whau samples were spread over
the arc from bottom left to upper right. The bubble plots of some other external variables are shown
in Figure 5.6. These plots clearly suggest that components one and two both relate to the physical
characteristics associated with pond effluent (sewage bacteria, nutrient enrichment etc.). It is clear
from the plots that if the individual components are to be given separate meaning they should be
interpreted simultaneously. Clearly, given the concentration of the Mangere observations in the
lower right quadrant, component 1 is a contrast between a subset of Whau stations and the Mangere
sites, while component 2 is contrasts a different subset of Whau stations with the Mangere sites.
44
The Whau stations low on component 1 seem to be more offshore with relatively clean water.
Those Whau stations high on component 2 are more inshore and though cleaner than Mangere are
dirtier than the other Whau stations. Therefore we might tentatively identify component 1 as
relating to a comparison between offshore and inshore water while component 2 could be mainly
describing the difference between the dirty inshore water of Mangere with the (somewhat) cleaner
less nutrient enriched inshore water in Whau creek. This is entirely consistent with the plankton
associated with these trends (see Error! Reference source not found.). Oikopleura is on offshore
species coming in on the high tide or at more offshore stations. Gladioferens on the other hand
tends to be an inhabitant of mangrove swamps and inshore brackish water which, though
contaminated with sewage, have not had the excessive fertilising of the oxidation ponds.
Assumptions of PCA.
45
Since PCA is normally an exploratory, hypothesis generating technique it doesn't really make any
assumptions (providing you are not using any of the tests). However, PCA is more useful and the
results easier to interpret (and biologically more relevant) if certain assumptions can be made: in
particular, random or at least representative sampling, and linearity of the relationships between
variables.
PCA assumes that the covariances (or correlations) adequately describe the relationships between
variables. This is only true when the relationships are linear, or at least monotonic. In real data the
relationship between two variables may be markedly non-linear; or the covariances may fail to
describe the situation for other reasons, so the resulting PCA may not give good results. .
Outliers.
Another feature of the data that may distort the components is the presence of outliers. There are
two major ways in which outliers can affect the results: by distorting the covariances between the
variables; and by adding spurious dimensions or obscuring the cutoff point for choosing
components. We can detect points with excessive influence on the covariance matrix by searching
the plots of sampling units on the first few components for extreme values. We can detect the other
kind of outlier by considering the minor components. These last few components (the discarded
ones) can be thought of as being residual variation left after a model (the higher components) has
been fitted to the data. Looking for outliers among these residuals can identify observations that are
adding unimportant dimensions to the data. One of the less important components may be in place
solely due to the presence of a single outlying data point. These outliers, by inflating the variance of
these residual components, may obscure the discontinuity between the 'large' components and the
'small' ones. This will make it more difficult to decide which components to discard. These outliers
may be picked up by looking at the scatter plots of the sampling units scores on these 'minor'
components.
If you are using PCA to generate hypotheses, there is no reason not to drop outliers and reanalyse.
It could also be interesting to look more closely at those observations; their deviations may be a
clue to something important.
Factor Analysis
The versatility of PCA actually creates problems. There is a family of techniques closely related to,
and partly based on PCA-like mathematics, called Factor Analysis (FA). The similarities between
Factor Analysis and PCA are so great that the boundaries have become blurred. This blurring is not
helped by the varying interpretations that workers in different disciplines put on the words Factor
Analysis. For my purposes I shall maintain that PCA is primarily concerned with the variation
between the observations, summarising, describing and possibly explaining it. The new variables
we extract from the data are defined by the pattern of observations. Even when we are looking at
the relationships between the original variables we do so to identify which combinations of
variables are related to the major variation among the observations. Factor Analysis on the other
hand is concerned with the pattern of covariation among the variables, the observations are only
relevant in so far as they define the pattern of correlations among the variables.
There are two main streams of thought on what factor analysis is trying to do, so when reading
anything about factor analysis it is important to identify which is being discussed. Most statisticians
and a lot of social scientists see factor analysis as a way of identifying underlying latent, causal,
variables that are responsible for the observed values of the the measured variables. In other words
factor analysis is attempting to fit a particular type of causal model to the data. Other practitioners
see it more as a way of identifying groups of covarying variables which may allow the efficient
46
summary of information. For the purposes of the book, I shall be pedantic and restrict factor
analysis to the fitting of latent variable models - because in this form I don't like it and it makes an
easier target.
Factor Analysis has been widely used in psychology and the other social sciences, but it has tended
to receive a bad press from statisticians outside the social sciences. Unfortunately I tend to agree
with them, though not necessarily for the same reasons, so I cannot recommend its use, particularly
by beginners. For this reason I am not giving it a chapter to itself. However since it does appear in
the literature, and to show the similarities with, and more importantly, the differences from PCA; I
will give a brief outline of the core methodology.
Factor Analysis like PCA is designed to identify underlying patterns in the data (the variables you
have measured). However it has a more ambitious goal, and is consequently based on much more
restrictive assumptions. It assumes that the pattern of covariation observed in the set of variables
you have measured are due to a set of underlying factors- some unique to each variable, some
shared between variables. The goal of factor analysis is to identify and interpret all the factors that
influence the variables. This can probably be best illustrated by a path diagram that shows a
possible set of causal links between the variables in your data set (V1, V2, V3) and the factors that
you cannot directly measure but which you think affect the observed values for the variables:
V1 U1
F1
V2 U2
F2
V3 U3
F1 and F2 are the common factors, they each effect more than one variable. U1, U2 and U3 are the
unique factors. The fundamental idea is that the pattern of variation and covariation that we observe
in the covariance or correlation matrix for the variables can be explained in terms of the underlying
causal factors. The common factors are described in terms of their contribution to each variable
(factor score coefficients) or by their correlation with each variable (factor correlation). The
components in PCA are described by the contribution of the variable to the component ( the
component coefficient, equivalent to the factor score), or by the correlation of the component with
each variable,( the component correlations). The importance of the unique factors can be inferred
from the size of their contribution to the variables. However the real problem lies in identifying the
way in which the common factors combine to produce the pattern of covariance observed among
the variables. The final set of factor score coefficients that describe this, is called the factor
pattern. The corresponding set of factor correlations is called the factor structure.
There are two major forms of Factor Analysis: exploratory Factor Analysis when little or nothing is
known about the factor structure - this is the commonest use; and confirmatory Factor Analysis
where the number of factors is known a priori and a certain amount is known about the structure of
a hypothesised factor pattern. This last form, though apparently the most robust use of Factor
Analysis, is, by its nature, rare outside the social sciences. It is generally performed using
Maximum Likelihood Factor Analysis or Latent Variable Modelling which allows the appropriate
hypothesis tests to be performed. However due to its sophistication I shall not describe it further. I
shall instead concentrate on the exploratory form.
47
There are three parts to most exploratory Factor Analyses: i) to identify what proportion of the
variation in each variable is shared with other variables and what is unique; ii) to identify how
many common factors are involved and iii) to try to pin down and interpret these factors.
The analysis usually starts with the correlation matrix of the variables. Clearly the off-diagonal
elements (the correlations) will reflect the activity of the common factors, and will be unaffected by
the unique ones. The diagonal elements (the variances, standardised to one) however, can in theory
be split into two parts, one unique to that variable and another caused by the action of one or more
common factors. So, when we analyse the matrix to detect the common factors, it would be sensible
to remove the variation due to the unique factors before we start. This would leave us with the
communalities, estimates of the variance due to the common factors, as the elements along the
diagonal of the reduced correlation matrix that results. There are a number of methods for
estimating the communalities, some simple but approximate, others requiring successive
approximations (iterative methods), and yet others that provide exact results derived, somewhat
laboriously from theory. I will not go into detail.
Now that we have a correlation structure that reflects the pattern of the common factors, we will
now want to know how many factors there could be. The most commonly used Factor Analysis
technique, Principal Factor Analysis, performs a PCA on the reduced correlation matrix. If the
factor model holds, there should be as many major components as underlying factors, the remaining
components are assumed to be trivial, representing random error. The number of 'significant'
eigenvalues should therefore identify the number of factors. Of course this step, guessing how
many eigenvalues are 'significant' is fraught with problems (section 5.3). Most computer packages
use Kaiser's criterion. If we accepted the eigenvectors of the reduced correlation matrix as
describing the factors, and some workers do, our interpretation would in fact probably not differ
markedly from the solution derived from an orthodox PCA on the correlation matrix. However
these are more usually regarded as preliminary values that require more manipulation before an
interpretation is attempted. PCA is being used to identify the subspace in which the factors must lie.
It may not identify the factors themselves.
48
A)
V1 U1
F1
V2 U2
V3 U3
F2
V4 U4
B)
V1 U1
F1
V2 U2
V3 U3
F2
V4 U4
Let us assume that we have guessed the correct number of factors. The corresponding preliminary
factors, like the principal components that they so resemble, define a reduced space. This is the
space spanned by the factors. Let us now envisage these factors as axes that define the space. There
are an infinite number of ways of orienting them in the space. How should we chose just one
orientation? Of course, we have one quite sensible set, the principal components of the reduced
correlation matrix, the preliminary factors. These describe the maximum amount of shared
variance. But most practitioners of Factor Analysis do not regard these as adequate. They are led to
a choice of orientation by a desire for what they call 'simple structure'. Some argue that the number
of factors that contribute to an individual variable is likely to be small. They believe that the
structure a) in figure 5.8 is more likely than b).
The problem is that the preliminary factors, the component axes of the reduced correlation space,
are liable to have a complex interpretation, like the second path diagram above. So, after
identifying the factor space, it is usual to rotate the axes to a new orientation that makes their factor
structure and therefore their interpretation simpler. Unfortunately there are a number of definitions
of "simple" as applied to factor structure- so there are a number of rotation techniques. The
commonest of these is probably the varimax rotation. This attempts to line up a factor axis so that
a variable will have either a high correlation with it or none at all. This will avoid the coefficients
of intermediate size that can make PCA derived components so tricky to interpret (section 5.5.2.i).
It will also usually make the pattern of the derived factors more like that of the "simple" path
diagram above. It achieves all this by rotating each axis to maximise the variance of the factor
loadings (hence "varimax"), while keeping the axes orthogonal. There are other rotation methods.
Some are designed around alternative views on simplicity, others around a priori conceptions of
how a factor structure should look. This raises the spectre, mentioned by many critics of Factor
49
Analysis, that an expert can get virtually any result he wants by appropriate choice of a rotation
technique. Indeed, one of my teachers used to call principal factor analysis, rather unkindly, "PCA
with fudge factors".
Some of the rotations, like varimax, keep the axes orthogonal. They therefore assume that the
factors are independent of each other. Other techniques allow for the possibility that the underlying
factors may themselves be correlated. The factors produced by these oblique rotations appear to be
somewhat fragile and are often difficult to interpret.
The common factors, whether oblique or orthogonal, are interpreted in much the same way as PCA
derived components - by examining the strength of the relationship between the variables and each
factor in turn. Most workers use the correlation between the variable and the factor.
The major problems with Factor Analysis can be divided into two groups: those associated with the
assumptions and those resulting from its performance.
As explained above, Factor Analysis assumes that the variance of a variable can be partitioned into
a unique component and one due to one or more factors that also affect other variables, as in the
path diagram above. This assumption explicitly excludes the possibility of one of the variables
causing some of the variation in one of the others.
This might considerably limit Factor Analysis's usefulness in some disciplines. In ecology,
for example; where it is reasonable to suppose that there are at least some direct interactions
between species, e.g.predation, mutualism or competition. The assumption of 'simple structure' is
also of concern. It is true many scientists have recourse to the principle of parsimony. They look for
the simple, elegant solution. But their relationship with this principle tends to be ambivalent.
Lagrange said "seek simplicity, but distrust it". My personal prejudice is that the form of simplicity
sought by the common rotations in Factor Analysis has little relevance to many, if not most
situations
In performance exploratory Factor Analysis appears to be rather fragile. Seber(1984) outlines in
some detail a test, performed by Francis(1973), of the effectiveness of Factor Analysis. Francis
investigated the ability of Principal Factor Analysis, with and without rotations (both orthogonal
and oblique), to identify the factor structure from covariance matrices derived from known
underlying factor models. Thus, he knew the right answer, and he wanted to see how well the
technique did in finding it. He concluded that if the correct number of factors were known, then
Principal Factor Analysis may provide a reasonable factor structure, that may or may not improve
with rotation. The big problem is, of course that there is no satisfactory method for estimating the
number of factors and fictitious factors are all too easily generated. Seber further concludes that
unless the underlying factor structure really is simple in an appropriate way, orthogonal rotations
may be worse than useless if the wrong number of factors is chosen. He also suggests that oblique
rotations will seldom be useful. J. Scott Armstrong in a satirical paper in the American Statistician
(1967) also demonstrated the weaknesses of exploratory factor analysis.
The practical limitations of exploratory factor analysis are therefore largely in the appropriateness
of simple structure and the choice of number of factors. If the number of factors is known a priori,
and the approximate structure of the causal relationships is known to be "simple" - no direct
influence between observed variables, no loops in the path diagram, then factor analysis may work
quite well. In other words, confirmatory factor analysis is fine, but exploratory analysis is at best
weak, at worst plain misleading. Furthermore, some workers have concluded that there is no
practical difference between principal factor analysis where the communalities were set to 1
(essentially a PCA with factor rotation), image factor analysis (yet another method) and maximum
likelihood factor analysis - provided the number of factors was known in advance and they used the
same rotation methods. The results can be expected to diverge if the number of factors is wrongly
estimated.
50
If Factor analysis is so fragile and unreliable how is it that so many workers (particularly in the
social sciences) swear by it? It may be impolite to suggest it, but one reason that factor analysts
claim it to be a useful technique may be because, even if the factors generated are spurious, they
can nearly always be given plausible interpretations. The interpretations can often then be
rationalised as supporting some particular viewpoint. The fact that many factor analysts are
working in complex subject areas (like the social sciences) where it is very difficult to confirm the
factor interpretations, may explain why these workers stay satisfied with their results. That it is
depressingly easy to fall into this trap of finding satisfying interpretations of spurious factors is
shown by an embarrassing episode in my own past.
I was once involved in performing a PCA for a graduate student - in those days computing was so
difficult students often did not do it themselves. I got the print out and together we pored over it as I
gave an inspired, eminently plausible and authoritative interpretation of the component coefficients
(the factor score coefficients). The student went away happy. Imagine how I felt when I discovered
the next day that during the reading of the data into the computer there had been a mix up in the
input formats and I had been effectively interpreting a PCA on random data! I had had no difficulty
in providing meaningful interpretations to the components, yet they were complete rubbish.
There were two main reasons why I was able to fall into the trap. The first was that there were no
standard errors available to show me that the coefficients were not reliable, the second was that the
student was exploring the data with an open mind. He had no idea of what to expect, and the range
of possible, biologically plausible relationships among the variables was enormous. When little is
known about a system there is nothing to contradict any interpretation of its structure. The more
complex the system being studied and the less that is known about it, the easier it is to be tricked by
spurious, non-existent factors (perhaps why factor analysis has proved so popular in the social
sciences).
I mentioned above that there is a more exploratory, less formal approach to factor analysis. Some
people use it simply to identify groups of covarying variables, to suggest patterns among the
variables. They do not claim to be identifying underlying, hidden factors or to be modelling causal
systems. Their factors are simply new variables that best summarise the pattern of correlations
among the variables after a certain amount of variation has been dropped as irrelevant. While in
many, particularly exploratory situations this may be more reasonable than attempting to fit causal
models, it still suffers from some of the problems outlined above. In particular the choice of how
much information to discard may still be crucial. As an ad hoc check on the robustness of the
choice, do not rely on any single method for choosing the number of factors, try a range of values.
If after rotation they all give substantially the same interpretation to a common subset of factors
then one might have faith that you were identifying a stable and real pattern. If however the
patterns detected were closely dependent on the number of factors extracted then I would abandon
factor analysis altogether and rely on a simple PCA performed on the correlation matrix.
I feel it is a regrettable fact that the use of axis rotations has crept from Factor Analysis to PCA.
With the increased availability of computer packages that allow a rotation by the addition of a
single command, people have begun doing them as a matter of course. Some say they rotate to
increase the interpretability of the components. They are in fact not interpreting the principal
components but the subspace spanned by the subset of components they have retained. Whatever
the rotated axes may be they are no longer principal components. These people are performing a
51
principal factor analysis where all the variance is assumed shared - i.e. that the communalities=total
variance. Their results are therefore subject to the same uncertainties as those from a more orthodox
Factor Analysis, and will probably not be much different.
In conclusion, PCA and Factor Analysis are for two distinct uses. PCA is exploratory and is
primarily concerned about patterns in the observations; apart from the assumption of linearity, it
makes no a priori assertions about the patterns of covariation. It can be used to suggest descriptions
of trends in the data, but it should never be concluded that all the important trends can be identified
(sections 5.3 and 5.). It can only be suggested that if a trend can be clearly identified then it is
probably important; there are exceptions here - the "horseshoe effect" (section 6.8.4).
Factor Analysis on the other hand is primarily concerned with modelling the patterns among the
variables. It assumes that the observed pattern of covariation is the result of a particular
arrangement of shared and unique factors and further, if rotations are used, that the common factors
conform to some particular 'simple' structure. If the number of factors is known a priori then
confirmatory factor analysis can be performed which when the model structure is appropriate can
be very powerful. However when the number of factors is not known the factor analysis will be
exploratory and subject to so many problems that it is doubtful if the results should carry much
credence. (You should be fairly sceptical about the results of an unrotated PCA how much more so
for an exploratory factor analysis.) Even if exploratory factor analysis is being used simply to
identify groups of covarying variables (possibly by using a principal components analysis then
rotating components within a subspace), the assumption is still being made that the appropriate
number of factors can be correctly identified, and as we have seen this is unlikely to be true.
AUTHOR'S CHOICE.
Reserve factor analysis for the fitting of models when the number of factors is known a priori and
when the underlying structure is known to conform to one of the forms of "simplicity".
52
53
Chapter 6. Multidimensional Scaling.
Multidimensional scaling (MDS) is an extended family of techniques that try to reproduce the
relative positions of a set of points in a reduced space given, not the points themselves, but only a
matrix with interpoint distances (dissimilarities) - see section 3 . This sounds easier than it is.
These distances might be measured with error, or even be non-Euclidean. With PCA there would be
at least one Euclidean configuration of points that would exactly reproduce the dissimilarity matrix
- the original data matrix. Indeed there would be an infinity, any rotation or reflection of the
original data would leave the interpoint distances unchanged. This configuration might lie in a high
dimension hyperspace but at least it would be exact. The problem of approximating it in fewer
dimensions is not too difficult. However, if the matrix of dissimilarities is not Euclidean, then there
is no guarantee that there is any exact configuration, let alone an adequate approximation. It is this
kind of problem the scaling techniques are designed to solve.
The origins of many of these techniques are in psychology (psychometrics to be precise) so the
extensive literature is littered with terms like stimulus, subject, attribute and preference, which
makes reading it rather stressful.
Principal Coordinates.
Principal coordinates (PCO) is closely related to PC. It finds a configuration of points that
optimally reproduces the distance matrix in a Euclidean space of few dimensions. Though it works
best on a Euclidean dissimilarity matrix, it will nearly always produce useful results from non-
Euclidean ones.
Intuitive explanation
Assume for the moment that the dissimilarity matrix is Euclidean. We can therefore try to imagine a
cloud of points hanging in a space of unknown dimensionality. These points have no coordinates
since as yet there are no axes (only distances were given). The problem is to impose a set of axes on
the space and then locate the points on them. By a cunning transformation of the dissimilarity
matrix, it can be made equivalent to a matrix YY', a cross-product matrix of a set of coordinates Y.
An eigenanalysis of this cross-product matrix (like PCA and CA) will give a diagonal matrix of
eigenvalues Λ and a matrix of eigenvectors V, so that YY'=VΛV' (see chapter 1). The coordinates
are given by Y=VΛ1/2, the elements of the scaled eigenvectors (the Λ1/2 rescales the eigenvectors to
have a sum of squared coefficients equal to the eigenvalue). Like PCA the eigenvectors identify
orthogonal axes that run down the major axes of the cloud of data points; though they will not be
the old axes rotated, there are no old axes. As in PCA the size of the eigenvalues will give the
variation of the data points along the associated axis (eigenvector). It is therefore comparatively
easy to select a reduced space to display the relationships among the data points. If the original
distance matrix was Pythagorean then the results will be identical to a PCA on thedata matrix from
which the distances were calculated.
If the dissimilarity matrix is non-Euclidean, then there may be problems with the interpretation of
the eigenvalues and in extreme cases the plots themselves, this is discussed in section 5.0.
Relationships with PCA.
The final stage of PCO, an eigenanalysis of a cross products matrix YY', seems very reminiscent of
PCA - an eigenanalysis of a crossproducts matrix Y'Y. If the original dissimilarity matrix actually
contains Euclidean (Pythagorean) distances, then PCA and PCO are doing exactly the same thing,
reproducing a cloud of points in fewer dimensions on orthogonal axes while optimally preserving
their relative positions. It is reassuring that in such a case, a PCA on the covariance matrix will is
identical to PCO on the dissimilarity matrix. The major conceptual difference between the two
54
techniques is that PCA uses the eigenvectors to project the original data points (vectors) into the
reduced space. In PCO the scaled eigenvectors are the vectors of observation scores in reduced
space; but their positions in the reduced space and the positions of the corresponding axes will be
identical.
By this stage you should be grasping the extraordinarily incestuous relationships amongst
multivariate methods. Every so often it is demonstrated, for example, that all the methods of
ordination are special cases of each other. This serves to remind us that the main difference between
them is the transformation and/or standardisation they impose not the fact that they have different
names.
Programming notes:
In R.
If d is your distance matrix then:
pco<-cmdscale(d, number ,eig=T)
Where number is the number of eigenvalues you want to extract (usually (the number of
observations minus one) - so you can look for negative eigenvalues). Your scores on the principal
axes will be in pco$points, your eigenvalues in pco$eig.
A screeplot can be done using the plot() function
plot(pco$eig, type=”b”)
EXAMPLE 6.1
Using the zooplankton data I have performed a PCO on a chi-squared distance matrix. This is of
course identical to doing a PCA on chi-square transformed data. The results of an analysis are
Figure 6.1 Principal coordinates reduced space plot using chi-square distance.
Because chi-square distance is a special case of the Euclidean distance this is identical to a
PCA on the appropriately standardised data.
M
WM
M
Principal Axis 2
W
W
WM
M
WM
WW
M
M WW
WW W
M
MW
M
M
W
MMWM
W W W
MM
MM W W W
WMW
M
M M W W
M
M
MM
W
MWM
M
W
M
MM
W
M
WW
Principal Axis 1
55
primarily determined by the transformation of the data, not the technique.
Interpretation
i) How many axes?
Since PCO is an eigenanalysis of a cross-product matrix, like PCA, the eigenvalues give the
amount of variation described by the associated eigenvectors. A scree graph is therefore the most
direct way of selecting a reduced space. The criteria are the same as for a PCA (see section 6.3).
However there is a complication: if the dissimilarities are non-Euclidean then there can be negative
eigenvalues. These indicate that there is no exact configuration possible in Euclidean space.
What you do in this situation depends on whether the negative eigenvalues are referring to
potentially relevant information. These eigenvalues are summarising those components of the
distances that prevent the observations being represented in a Euclidean space. If these bits can be
assumed to be uninformative, then you could argue the negative eigenvalues are irrelevant and can
be discarded. So the measure of goodness of fit is then the proportion of the positive eigenvalues
represented by the retained ones.
Mardia suggested two measures of the goodness of fit of a k dimensional reduced space:
a1k= ∑i λi/ ∑|λ|,
a2k= ∑i λi2/∑λ2.
I prefer a1k. When there are no negative eigenvalues it is the same as the measures used in PCA.
Also, if you are used to the usual measure, then a2k appears to exaggerate the goodness of fit; it
might mislead.
Figure 6.2 Scree diagram for the PCO on the Gower's unstandardised distance of the
log(X+1) transformed zooplankton data. Note the negative eigenvalues. 40 of the 66 non-zero
eigenvalues are negative, though they only represent 13.1% of the absolute variation. The first
2 axes summarise 39.6% of the absolute variation.
120
90
60
Eigenvalue
30
-30
0 20 40 60 80
Axis number
56
ii) Interpreting the principal axes.
If you still have the original data matrix from which you calculated the distances then you can use
the axis variable correlations to help interpret which variables are associated with which axis.
Bubble plots can also be used.
iii) Problems.
a) Adequacy of the reduced space.
Just as with PCA this is typically summarised by the eigenvalues.
b) Stability of space.
As in any other method based on eigenanalysis, if two or more of the eigenvalues are equal then
there will be problems of instability (sphericity).
EXAMPLE 6.2
Using the zooplankton data set again I have performed a PCO on a site × site dissimilarity matrix
using Gower's distance, ignoring double zeros, but without the range standardisation (not a distance
measure you have met in this course, don’t worry about it) and using a log(X+1) transformation.
The distance was chosen to remove the double zero problem, and the transformation was chosen to
reduce the effect of dominant species. The scree plot of the eigenvalues is shown in Figure 6.2 . 40
out of the 66 non-zero eigenvalues are negative, but they only contain 13.1% of the total variation
(when we disregard the sign), so the positive eigenvalues contain most of the information. (I
discuss negative eigenvalues below). The main problem is that the first two eigenvalues only
explain 39.6% of the (absolute) variation. The first 3 explain 50.3% and the first 4, 57.9. There is
no obvious cutoff - though the features found by our earlier analyses are apparent on the first two
axes. The plot looks particularly like the PCA on the log(X+1) transformed data. As usual, what we
see is largely determined by the transformation. That we are using a city-block metric rather than
the Euclidean, and are excluding double zeros (and 51 % of the data set are zeros) has little effect, it
is the log(X+1) transform that is important.
57
Metric Scaling.
Metric scaling tries to produce a set of coordinates (a configuration of points) in a reduced number
of dimensions whose matrix of interpoint Euclidean distances approximates the original
dissimilarity matrix as closely as possible. The eigenanalysis technique Principal Coordinates
(PCO) does this directly. PCO is a metric scaling technique (it is sometimes called classical or
Torgerson scaling). However, the term metric scaling is more commonly applied when computer
intensive iterative algorithms are used to do the job rather than eigenanalysis. The results will
seldom be very different from doing a PCO on the same dissimilarity matrix.
Intuitive explanation
The simplest approach to metric scaling is by repeated approximation, and is always done on a
computer. The computer can be imagined as guessing an initial configuration in a high dimension
space (less than p-1). It then calculates the distance matrix for these initial points. The elements of
this matrix are then regressed against the elements of the given dissimilarity matrix. If by some
extraordinary stroke of luck (or fudging) the fit is extremely good then this configuration will do.
However, it is far, far, more likely that the fit will not be adequate. The computer then shifts each
point in the configuration slightly so that its interpoint distances will fit the given dissimilarity
matrix better, calculates the distances, measures the fit and calculates the improvement. If there has
been little or no improvement, or the fit is adequate, then a solution has been found and the process
stops. If not, then it goes through more iterations till it finds the best fitting set of coordinates (or
gives up in disgust). This process is repeated for spaces of fewer dimensions, and the goodness of
Figure 6.3. Plot from a PCO on Gower's unstandardised distance using log(X+1)
transformed data ignoring double zeros. Note the similarity with the PCA plot on the same
transformation of the data
3 W
W
W
WW
2 W W
Dimension 2
W
1 W W W
W
W
W M MW
W
0 W MM W
W MW
W
W M M
W M WW
M MM M WM W
M
M M W
M MM
MM MM MM W
-1 M
W
MW
M
-2
-3 -2 -1 0 1 2 3 4
Dimension 1
58
fit plotted against the number of dimensions (like a scree graph - section 5.3). The appropriate
number of dimensions for the reduced space is chosen, and the corresponding configuration plotted
and (hopefully) interpreted. Notice it does not calculate one high dimensional solution and extract
all lower dimensional solutions from it as PCA or PCO does; the configuration for each reduced
space is calculated anew each time. In fact, if the dissimilarities are exactly Euclidean, then a p-1
dimension solution, if subjected to a PCA, will give similar lower dimensional configurations to
separate scalings for each solution. In other words a metric scaling on a Euclidean distance matrix
will give the same results as a PCA.
The goodness-of-fit statistic.
To get the fit of the calculated distances to the observed, the estimated distances are regressed,
linear regression through the origin, against the original dissimilarities (δij) in the distance matrix.
The fitted values ( dij - the disparities, forgive the jargon) are compared with the current distances
between the points in the reduced space, the dijs, to assess the fit. The usual goodness of fit statistic
is analogous to r2, the coefficient of determination in ordinary regression:
∑(dij- dij )2/ ∑dij2
This is called STRESS, STRESS formula 1, or STRESS1. This is the statistic that is being
minimised by the iterative procedure.
Because each configuration for a dimensionality is calculated separately, it is possible to request a
solution for a particular dimensionality without calculating any others. This is nearly always a bad
idea; not only is it possible that another (possibly lower dimensioned one) might be better, but as
we shall see below local minima can be detected by comparing the stress values from different
solutions - see section 0.ii.
Provided STRESS1 is used as the measure of goodness of fit the results of a metric scaling will be
nearly always equivalent to a PCO on the same dissimilarity matrix. The reason I described it is to
introduce a more widely used technique.
Non-metric scaling.
Under certain circumstances trying to preserve the actual dissimilarities might be too restrictive or
even pointless. For example if there is large error in the dissimilarity estimates, if the dissimilarities
or the data they were based on were ranks (ordinal), then the magnitude of the distances are too
crude to be worth preserving. A method that preserved only the rank order of the dissimilarities
would be more appropriate.
The algorithm to do this is virtually the same as the one given above for metric scaling.. The sole
difference is that the linear regression that fitted the estimated distances for the solution to the
dissimilarities is now replaced with an order preserving regression - Kruskal's least squares
monotonic transformation (Kruskal 1964), sometimes known as optimal scaling.
Monotonic Regression.
59
The fitted line in this form of regression is not smooth, there is only one constraint on its shape - it
must only move in one direction, always upwards or always downwards; it must be monotonic. In
the algorithm described here the direction will be upwards. The process is simple if tedious. The
optimal scaling procedure starts at the lower left corner of the plot and moves towards the
right.(Figure 6.4) Each point is examined in turn, if it is higher than the previous one it is left
undisturbed; if lower, the mean of it and the previous value is calculated and both points replaced
with this value. The next point is then compared with this value, if it is lower then the mean of the
three points is calculated and all three replaced. Then the next point until one is found that is higher
than the current value. There is now a block of points all with the same value, the algorithm now
returns to the point prior to the block to see if the fitted line still moves upwards, if not, then the
points are amalgamated again. The result of this tedious process is a line that progresses upwards in
a series of steps like a staircase. The tread (the horizontal bit) of each step (the predicted value - dij
the disparity) is defined by the mean of all the points that occur in that stretch of the X axis (δij -
elements of the original dissimilarity matrix). It can be shown (Kruskal 1964) that this line
minimises the sum of the squared deviations of the current configuration distances from the
estimated distances predicted from the line, STRESS1; it is truly a least squares method. STRESS1
measures the extent that the rank order of the estimated distances differs from the rank order of the
dissimilarities.
Programming notes:
In R
if dist is the name of your distance matrix.
then:
dimen<-3
sol<-isoMDS(dist, y=cmdscale(dist, dimen),k=dimen)
Figure 6.4. Monotone regression. Empty markers represent the fitted values. Starting at
the origin the first four observed values increase monotonically so the fitted and observed
values coincide. However, the 6th and 7th are lower than the 5th so the fitted values are
the average of the three. Similarly, the 8th and 9th get smoothed also. The resulting line
increases monotonically.
2
Euclidean distances
0
0 1 2
Dissimilarities from data set
60
will get you a 3 dimensional solution. The points will be in sol$points, and the final STRESS1
value in sol$stress (expressed as a percentage rather than a proportion). Plot the points with the
same functions as PCA.
To do a STRESS plot:
STRESS=NULL #PREPARE AN EMPTY VECTOR
mds2=isoMDS(distance matrix, k=1)
STRESS=append(STRESS,mds2$stress) #ADDS THE STRESS VALUE ON THE
END OF THE VECTOR
mds3=isoMDS(distance matrix,k=2)
STRESS=append(STRESS,mds3$stress)
keep going up to the desired maximum number of dimensions
When you have finished
plot(1:length(STRESS),STRESS,type="b") #PRODUCES STRESS PLOT
Interpretation
a) How many dimensions?
There are three main techniques for identifying the appropriate number of dimensions for the
reduced space.
i) If the program has been allowed to calculate solutions for high dimensionality down to one, then
a plot of STRESS against number of dimensions is extremely useful. This STRESS plot is
analogous to a scree graph of other ordination techniques but is interpreted slightly differently. It is
examined for an "elbow" where the STRESS reduces rapidly and the addition of further
dimensions improves the STRESS only slightly. The number of dimensions at the elbow identifies
the solution that is used for the reduced space plots.
Of course if the ideal space is unidimensional (for example in many morphometric data sets) then
there will be no elbow. Kruskal and Wish in their excellent little book, hereafter referred to as K &
W, suggest that, in this situation, if stress for the unidimensional solution is less than .15 then it will
probably give the most useful plot - provided there are more than ten sampling units so that the
dissimilarity matrix is larger than 10 X 10.
K & W also offer advice on when to accept an elbow as useful. If the STRESS value at the elbow is
greater than 0.1 then it should not be used, indeed an elbow at high STRESS may be suggesting a
local minimum. Generally, a genuine elbow at high STRESS(near 0.1) ought to have the left hand
section very steep, the right hand (higher dimensional) bit need not be very shallow. For a genuine
elbow at low STRESS (e.g. 0.02) the left hand section need not be very steep but the right hand bit
must be very shallow, nearly horizontal. Unfortunately if the elbow appears at two dimensions,
where most workers would like it, there could be a problem. A large STRESS for unidimensional
solutions does not mean much. Most existing algorithms have difficulty in fitting data onto a line
(see Heiser 1987 for references), so an elbow at two dimensions could be an artefact and should be
regarded with caution - a pity.
It is also unfortunate that many, possibly most, real data do not have clear elbows; in which case K
& W offer the following rules of thumb: if possible do not use a solution with a STRESS greater
than 0.1 or less than 0.05 (unless an extra dimension reduces the STRESS considerably. Seber
(1984) presents a table (from Kruskal 1964) that gives guide-lines to the meaning of different
values of STRESS.
61
0.20
0.15
STRESS1
0.10
0.05
0.00
1 2 3 4 5 6 7 8
Dimensions
Figure 6.5. Stress diagram for non-metric MDS on chi-squared distance matrix.
There is no obvious elbow (not unusual). Dimensionality was chosen by checking that
there was no effective difference between the 2 - D solution and the first 2 dimensions of
the 3-D solution.
0.2 Poor
0.1 Fair
0.05 Good
0.025 Excellent
0 Perfect and therefore suspicious
A STRESS value close to zero (say < 0.01) is a possible indicator of a degenerate solution - see
below, section 0.iii.
EXAMPLE 6.3
Figure 6.6 shows a reduced space plot of the two dimensional solution from a non-metric
multidimensional scaling minimising STRESS1. The distance measure used was the chi-squared
distance to compare the solution with the PCO (classical metric scaling). The main difference is
that the horseshoe has been removed. This remarkable removal of the horseshoe is by no means
inevitable. It is commoner in my experience for any horseshoe in a metric technique to also appear
(though sometimes a bit reduced) in the non-metric solution.
Why did I chose a 2 dimensional solution? The STRESS plot is in Figure 6.5. There is no obvious
elbow so the choice of dimensionality is somewhat arbitrary. I chose 2 because that gave a
reasonable STRESS level (0.075); but checked that the first 2 dimensions of the 3-D solution
(STRESS 0.053) gave essentially the same picture. This analysis illustrated a typical problem. Even
though it was performed with a modern algorithm (PROC MDS, in SAS version 6.12), the best 2 -
D solution was not found using a PCO starting configuration (the default). After 20 random starts, 3
better solutions (over 10% improvement in STRESS) had been found. Clearly there are problems
with local minima that only repeated random starts can overcome Even now I cannot be sure there
is not a better solution but I am fairly sure that there would be little substantial change to the
reduced space plot.
62
b) Interpreting dimensions and trends.
The direction of the axes on which the MDS configuration is defined are usually in arbitrary
directions. Even if the final solution of a given dimensionality has been rotated by PCA, the
component axes need have no meaning. As a general rule therefore there is little point in trying to
interpret the axes.
If external variables are available, trends in the plots can sometimes be interpreted if it is possible
to associate the plotted points (the sampling units) with their values on the external variables.
Bubble plots can be useful.
Problems
Besides the problems associated with all ordination techniques: outliers, useful information
associated with rejected dimensions, the horseshoe effect etc; Multidimensional scaling has three
that are peculiar to it.
i) Incomplete convergence.
The iterative process may not have found the minimum before it gives up. It is inconceivable that
any program would not notify you of the fact, but most, perhaps all, present the configuration at the
last iteration as a final solution - which it isn't. It is therefore important to check the output to find
out why the program terminated. The only acceptable reasons are that the change in STRESS is too
small, so a minimum has been reached; or that the value of STRESS is too small. Any other
message spells trouble. In fact as we shall see below (section 6.3.3.iii), if the STRESS is too small,
then the solution must be checked for degeneracy.
If the program has terminated before a minimum STRESS solution has been found, then either the
program can be run again requesting a larger number of iterations, or, if this is not possible then the
final configuration of the run can be fed back in as the initial configuration for another run, so the
program starts where it left off.
ii) Local minimum.
Since the program is an iterative one, it is sometimes possible, especially with non-Euclidean
dissimilarity matrices and non-metric scaling, for the process to find more than one minimum
STRESS configuration depending which starting configuration is used.
To use an analogy: with well behaved data, finding the solution with minimum STRESS is like
rolling down the sides of a volcanic crater, you may bounce around a bit, but sooner or later you
will end up at the bottom. However if the data are not well behaved - the structure is not clear -
there may be secondary craters in the side of the main one. So where on the edge you start your trip
will determine whether you end up in the bottom or in a secondary crater (a local minimum). Of
course if the volcano is active it doesn't much matter either way, but in MDS it does; a local
minimum can be misleading.
Even with well behaved data there will be local minima, but the configurations will usually be so
similar: reflections, rotations, close data points reversed in position, etc, that it will not matter.
Local minima that are significantly different to the global arise when the structure of the data is not
clear, in a way analogous to the equal eigenvalue problem in PCA. Large numbers of missing
values can accentuate the problem. Local minima are particularly likely on unidimensional
solutions.
In the event there is a local minimum configuration that is very different from the global, then
either both will have high STRESS, in which case both are useless; or the local one will have a
much higher STRESS than the global so it should stand out in the STRESS diagram as anomalous
relative to the solutions of higher and lower dimensionality. The STRESS diagram must always be
63
concave upwards, (e.g. Figure 6.5) any kink upwards in that shape will nearly always mark a local
minimum.
If a local minimum is suspected, or no STRESS diagram is being produced, then the program
should be run with more than one starting configuration. Some programs can generate random
starting configurations, otherwise the user may have to provide them, which is very tedious. Note:
do not give one where all the points lie on one axis, it can lead to problems. Make sure the starting
configuration spreads the points throughout the space. If all the starting configurations lead to the
same final solution then it is very unlikely that this is a local minimum. If some locate a different
solution with a lower STRESS then this should be used.
iii) Degeneracy.
A degenerate solution is a configuration where for no obvious reason the program decides assert its
artistic independence and present what is usually an attractive, regular but meaningless plot with all
the points coalescing into a few, often equally spaced, clusters. This solution is often associated
with a very small STRESS value (e.g. less than 0.01).
Degenerate solutions are largely a feature of non-metric methods. There appear to be three main
causes of degenerate solutions.
a) There are a large number of equal values in the dissimilarity matrix, perhaps a lot of zeros,
or a matrix where the dissimilarities take only a few values, for example where it is based
on a few binary variables.
b) The data cloud is really a few (4 or less) clusters, where the intercluster distance is much
larger than the intra cluster distances.
c) There are a lot of missing values in the dissimilarity matrix.
These situations are basically the same, and represent a situation where there is insufficient
information for the program to come up with an meaningful configuration, so it produces a pretty
one instead.
There are two main ways to recognise a degenerate solution: a STRESS value of less than 0.01,
even zero, and the scatter plot of estimated distance against dissimilarities has a characteristic
shape. The plot will usually consist of a few "steps", where a number of different dissimilarities
(δij) have the same distance (dij). There will nearly always be a cluster of zero distances, sometimes
associated with medium to large dissimilarities.
A degenerate solution cannot be interpreted, so the simplest thing to do is re-analyse using a metric
method.
iv) Outliers.
Metric scaling is sensitive to outliers; though Gower (1987) suggests it is more robust than PCO,
and therefore PCA. If outliers are found there are two main solutions.
a) Drop them and repeat the analysis.
b) Use a non-metric method. These are more robust as they do not try to preserve distance,
only the rank order of distances.
v) Adequacy of the reduced space.
The measure of goodness of fit, in this case STRESS, provides a crude measure.
vi) Stability of the reduced space.
The major determinant of the stability of a solution under sampling error is the sample size relative
to the number of dimensions. Kruskal and Wish (1978) suggest that if the number of dimensions is
three or less then the sample size should always be greater than four times the number of
dimensions - plus one. If the sample size is less than twice the number of dimensions - plus one -
then the solution should not be used.
64
Of course using dissimilarity matrices with little information in them can also lead to instability in
the space, even if the solution is not degenerate.
Which to use: metric or non-metric?
Both metric and non-metric methods have their strengths. Non-metric methods can handle ordinal
data or other lower quality dissimilarities, and are robust to outliers. On the other hand they are
more prone to local minima and degenerate solutions. As Gower (1987) points out: when the
number of sampling units is large, preserving the rank order is usually essentially the same as
preserving distances; in which case it doesn't much matter which is used. A metric scaling will
always have a higher stress than the corresponding global non-metric solution, but will often be
more accurate. Sometimes, with non-Euclidean distances, the relationship between the fitted
Euclidean distances and the dissimilarities is non-linear. In which case a linear metric scaling may
not be adequate and a non-metric method will usually be appropriate. Such a situation can be
recognised from the shape of the scatter diagram of the fitted distances against the dissimilarities.
65
66
Chapter 7. Cluster Analysis
Cluster analysis, or classification as it is known in the botanical literature, has the apparently simple
aim of finding clusters in a data cloud of sampling units in the absence of any a priori information
about which point belongs in which cluster. This apparently unambitious aim is unfortunately
fraught with problems.
The major difficulty is that no one seems to agree on precisely what a cluster is. For a very good
reason, the human eye is unexcelled as a pattern recognition device, but we recognise clusters of
points in a variety of different ways. For example, it is extremely difficult to think of a single
definition that would adequately describe all the clusters in fig 7.1, even though they are quite
obvious (I hope). Some workers have stressed the importance of cohesiveness (like fig 7.1a); others
contiguity of points (7.1b); yet others have concentrated on distances such that all or most of the
distances within a cluster are less than those to any point outside the cluster; and finally others have
tried to make the definition so vague that it can include most of the possibilities without the
necessity of actually defining anything. Everitt's definition (Everitt 1980) seems to come as close to
being useful as any:
"Clusters may be described as continuous regions of (a) space containing a relatively high density
of points, separated from other such regions by regions containing a relatively low density of
points."
Unfortunately it does not provide a rationale for a single comprehensive technique that can handle
all the data structures shown and satisfy all the requirements of workers. Indeed it is extremely
unlikely that any such method could ever be found, for there lies another problem, workers want the
technique(s) for a number of different purposes:
i) to find groups for classification;
ii) to reduce the number of sampling units in an analysis by using a single representative from each
cluster of similar individuals;
iii) for data exploration and hypothesis generation;
iv) for fitting distribution models and estimating their parameters;
v) dissecting a continuous data cloud into relatively homogeneous zones;
and many more.
The only thing the large number of existing techniques have in common is that unlike canonical
Figure 7.1. The human eye can detect clusters in all three of these diagrams. Statistical
methods find it more difficult.
a) b) c)
discriminant analysis (section 10) and discriminant function analysis (not covered in this course)
there is no prior information about which sampling unit is in which group. Like the ordination
methods of the earlier chapters, cluster analysis techniques operate on an unpartitioned data matrix
to find, or impose, structure in the data cloud.
One consequence of this variation in definition and use is that cluster analysis as such does not
exist. The title refers to an enormous and extraordinarily diverse family of techniques. For someone
to say that they used cluster analysis is about as informative as their saying they studied an insect.
To cover all the techniques would take a whole (large) book. So for this course I shall content
myself with covering some of the common ones and ones I think are potentially most useful.
Given the diversity of techniques it is very important to choose the technique with a clear idea of
what it is required to do. Like selecting a similarity or distance metric (Section 4), the choice must
be made with care after consideration of the nature of the data, your objectives, and the available
alternatives. However the most important thing to remember when using a clustering technique is:-
you must not believe the result. The pattern you get is at most a plausible way of viewing the data.
By using an appropriate method and by employing validation techniques the plausibility can be
enhanced, but no cluster analysis can be relied on to produce truth. With real data, different methods
will nearly always produce different results. If the structure in the data is fairly obvious then these
answers may not differ much, but if there is any ambiguity in the data then the methods may well
give contradictory results.
Partitioning methods.
Though the hierarchical methods have been historically more important, the partitioning methods
are becoming increasingly popular, and it is easy to see why. The hierarchical methods are restricted
to an often inappropriate nested structure so that at each level (number of clusters) the solution is
constrained by the previous one. In the partitioning or segmentation methods the solution at any
level is independent of the others and can therefore be globally optimal - if you're lucky.
For a single run of a partitioning method, the desired number of clusters (k) is usually fixed - some
techniques do allow some small adjustment in this number during the process. Of course since the
correct number of clusters is usually not known, the program is normally run with different values
of k and the optimum number of clusters chosen (covered later).
There are two major phases to a partitioning method:
i) an initial allocation (usually rather arbitrary) into k preliminary clusters;
ii) reallocation of each point either to the closest centroid, or so as to optimise some property of the
clusters. This is repeated until there is no further improvement, then the program stops.
The initial allocation is usually started by choosing k sampling units to use as "seeds" to
"crystallise" the clusters. There are a number of ways to choose these seeds; it depends on the
program. As we shall see it is a tremendous advantage if you can put in your own set. These seeds
are used as the initial centres of the clusters, points are allocated to the nearest cluster centre, and in
most programs the cluster centroid is adjusted as they are added.
The methods we consider here (there are others) the k-means methods, run through the sampling
units reallocating them to the cluster with the closest centroid; they pass and repass through the data
till no further reallocation of points is possible. Some programs then try swapping pairs of points
between clusters, to further improve the solution, and to protect against local optima.
K-means partitioning methods
68
The k-means methods are generally the fastest clustering methods, but they are inclined to be
trapped by local optima and tend to produce equal volume spherical solutions. They are also very
sensitive to starting strategy. Some workers suggest that random starting values should not be used.
Seber reports a study as having located the global optimum only 3 times from 24 random starts!
However their performance in the few Monte Carlo simulation studies that have incorporated them
has been good relative to alternative methods, particularly when the solution from a hierarchical
method was used as the starting configuration. In fact, it has tended to be better than the best
hierarchical methods considered (Ward's and average linkage).
If the data set is particularly large, a sub-sample of the points could be clustered and the estimated
centroids of the resulting clusters used as seeds for the analysis of the full data set. Some programs
allow you to vary how the distance to the centroid is measured. Some programs normally use the
squared distance which means that it is minimising the trace(W) where W is the within cluster
variance-covariance matrix pooled over all the clusters, i.e. the total within sample variance. This is
an appealingly statistical thing to optimise.
EXAMPLE 7.1.
To display the use of a k-means partitioning technique, I will use the microzooplankton data again.
However, because I want to use the same data set again for the other clustering techniques which
produce dendrograms, I will cut the data set down a bit first. All the examples in this chapter will
only consider the 35 samples that were taken near to high tide. Since the PCA on log(X+1)
transformed abundances seemed to give the most informative ordination I have used log(X+1)
transformed data in this partitioning. A different transformation or standardisation would give a
different result.
In reality I would normally not use a k-means techniques on a data set this small. Because of
problems with local minima, the technique is best used with large data sets. We have already used a
variety of ordination techniques on the complete data set and there were no obvious clusters so a k
means method may well have problems converging to a global optimum. From the ordination plots
we might suspect that a partitioning technique would really only dissect the data cloud (in
marketing terms: produce a segmentation). After trying partitions with 7, 6, 5, 4, 3, and 2 clusters
(allowing a couple of outliers) there was no reason to change that opinion. Using Callinski and
Harabasz's Index (see later), a measure of cluster separation, none of them was clearly superior to
the others (values of 13.04, 13.03, 13.12, 13.04, 11.23, 11.32 respectively). Given that the value
seems to drop slightly between 4 and 3 clusters, and because I felt that 4 was a nice number of
clusters - not too many, not too few - I chose to present four clusters. (Since we are dissecting the
space rather than recovering true clusters, we can afford to please ourselves a bit). I used seeds from
a Ward's hierarchical clustering program (see later) as starting points for the partitions.
69
3 3 4
3 14
3
1 1 1
1
Principal component 2
4 1 1 1
1 1
11
0 2
3 1
11
1
2
2 2
2 2
2
2
0
Principal component 1
The results are displayed on a PCA reduced space plot (section 5) on the log(X+1) transformed data
in Figure 7.2 Clearly the partitioning has identified quite consistent groups. This analysis therefore
conforms to the central axiom for clustering: "I do not believe it unless I can see it." This of course
has a complement: "But I do not believe it just because I can see it" - the human eye and brain are
hard-wired to see pattern even when none exists. The plot allows the major differences between the
clusters to be seen.
As we shall see in Example 7.2 the results of this analysis are very similar to those from the Ward's
method applied to the same data. In essence it identifies the split between the Whau Creek (groups
3, and 4) and the Mangere samples (group 1), with group 2 containing 4 Mangere samples taken all
on the same day, and 4 Whau samples taken on 2 different days.
It is worth pointing out that my suspicions about the possibility of local minima with this data set
were well founded. Out of 20 random starts for the 5 group partition, not one found a solution as
good as that found when the seed came from a preliminary Ward's clustering. Clearly when the
number of observations is so small there are a large number of local minima.
Programming notes.
70
In R. First load my function PseudoF.
data<-dataset
k=number of clusters
cl<-kmeans(data,centers=k,nstart=number of random starts)
pf<-PseudoF(data,cl$cluster)
pf
The nstart= option tells kmeans how many random start you want it to do. It then picks the best
of these. The value pf is the pseudoF statistic for this cluster solution.
When you have chosen your final solution then you may want to plot your cluster solution in a
reduced space plot
Do your PCA and then
eqscplot(pcs$scores[,1:2],type="n")
text(pcs$scores[,1:2],labels=as.vector(cl$cluster))
Hierarchical methods.
These methods assume that the groupings in the data cloud have a hierarchical structure. The
smaller groups form larger groups which form larger groups and so on - a nested classification. If
this assumption is untrue then the techniques can be expected to distort the true structure of the
data.
Most of the commonly used techniques are members of this group. They are widely available, all
the major packages have a selection, and they are relatively easy to use, though often less so to
interpret.
Hierarchical organisation is often difficult to justify for real data sets. Though there may be more
than one level of grouping there may be no reason to assume that they are nested. For example, it
has been shown that the clusterings defined by the optimum sum of squares at various levels of k
may not be nested for all data sets; so a hierarchical method may be unsuitable for any given data
set.
There are two approaches to hierarchical clustering, agglomerative and divisive. Agglomerative
methods start from the individual sampling units forming them into groups and fusing the groups
till there is only one that includes all the points. If we can describe this as working from the bottom
up, then the divisive techniques work from the top down. The groups are formed by splitting the
data set successively until there are as many groups as points.
Hierarchical agglomerative clustering.
All of the commonly used hierarchical methods are agglomerative. Most of them operate in the
same way: first all sampling units that are zero distance apart are fused into clusters. The threshold
for fusion is then raised from zero until two clusters (they may be individual points) are found that
are close enough to fuse. The threshold is raised, fusing the clusters as their distance apart is
reached until all the clusters have been fused into one big one. Thus the close clusters are fused
first, then those further apart, till all have been fused. This process allows the history of the fusions,
the hierarchy, to be displayed as a dendrogram. This is an advantage of the agglomerative methods,
if the data have a nested structure these techniques lead to a useful way of displaying it. Other
advantages are the ready availability of programs and their ability to handle quite large data sets - at
reasonable expense. Unlike the optimisation or k-means methods, most of the agglomerative
71
techniques can use a broad range of similarity or distance measures. This of course means that
considerable care must be taken to choose the appropriate one; different measures often lead to
different results.
Inevitably, given the variety of definitions of a cluster, there are a large number of different
hierarchical agglomerative techniques. They mainly differ in the details of the fusion rule. For most
of them the rule is simply stated: two clusters should be fused if the distance between them has been
reached by the threshold. The problem is to estimate that distance. It can be done in a variety of
ways and will usually affect the results. As we shall see, different types of clusters need different
ways of estimating intercluster distance.
We shall consider the four most commonly used methods.
i) Single linkage (nearest neighbour) clustering.
The distance between two clusters is the distance between their nearest points (Figure 7.3a).The
simplicity of this method makes it easy to program and extremely efficient. It was one of the most
popular techniques in the early days of clustering; but since then, despite support from the
theoreticians, it has been used less frequently. In general it has not performed well. It identifies
clusters on the basis of isolation, how far apart they are at their closest points. This means that if
there are any intermediate points then single linkage will fuse the groups without leaving any trace
of their separate identities. This is called "chaining", which leads to characteristic and rather
uninformative dendrograms. It is the chief weakness of the method. Its strength is that if the clusters
are well separated in the data, then single linkage can handle groups of different shapes and sizes,
even long thin straggly ones (e.g. Figure 7.1c) that other methods often cannot recover. It has other
advantages, it will give the same clustering after any monotonic transformation of the distance
measure - that means that it is fairly robust to the choice of measure. It is insensitive to tied
distances - some methods suffer from indeterminacy if there are too many ties; a bit like degenerate
solutions in non-metric MDS, (section 6.3.3.iii) and though the results are seldom as pretty, they
can be just as meaningless.
As a cluster analysis single linkage is usually not very useful (unless the data is of the right type).
Many investigations have found it performs badly with even slightly messy data.
ii) Complete linkage (farthest neighbour) clustering.
In many respects complete linkage clustering is the opposite of single linkage. Instead of measuring
the distance between two clusters as that between their two nearest members; it uses that between
the two farthest members (Figure 7.3b). In consequence the resulting clusters are compact, spherical
and well defined. Unlike single linkage it can be sensitive to tied distances. There are similarities,
a) b) c)
72
the clustering it gives is also invariant under monotonic transformation of the distances; it is robust
to a certain amount of measurement error and choice of distance. Unfortunately it is sensitive to
even a single change in the rank order of the distances in the dissimilarity matrix (Seber 1984), and
does not cope well with outliers. However, in Monte Carlo simulations, it nearly always performed
better than single linkage; though usually not quite as well as Ward's or group average.
iii) Group average linkage (UPGMA)
This is probably the most popular hierarchical clustering method - for a very good reason - it
usually works well. It could be thought of as an attempt to avoid the extremes of the single and
complete linkage methods. The distance between two clusters is the average of the distances
between the members of the two groups (Figure 7.3c). If the distances are Euclidean this is the
distance between the centroids plus the within group scatter. As a result this method tends to
produce compact spherical clusters.
Like its main rival Ward's method, average linkage has generally performed well in Monte Carlo
simulations, and its continued popularity is because it consistently, though not inevitably, gives
adequate results. However, Ward's generally performed better, particularly when there was some
overlap between the groups. When intermediate points and outliers were removed ("trimming" or
"incomplete coverage"), group average's performance was considerably improved. It performed
poorly with mixtures of multivariate normal distributions probably because of the overlap between
clusters..
iv) Ward's method (incremental sums of squares, minimum variance, agglomerative sums of
squares).
Ward's method is the hierarchical version of the k-means partitioning method. At each fusion it
attempts to minimise the increase in total sum of squared distances within the clusters. This is
equivalent to minimising the sum of squared within cluster deviations from the centroids - i.e.
trace(W). Since at any one stage it can only fuse those clusters already in existence - it is not
allowed to reallocate points - it can only be stepwise optimal. It cannot find the true minimum
configuration at each level, so it would not be expected to recover natural clusters as well as the
non-hierarchical methods that also minimise trace(W). A bad start to the agglomeration process can
place the algorithm on a path from which it can never reach the global optimum for a given number
of clusters. Despite this, Ward's method has performed well in simulations; one of the two best
hierarchical methods overall. Its chief flaw is a tendency to form clusters of equal size, regardless of
the true number. So when the number of points in the clusters are different, group average and
complete link may give better results. Like the complete linkage and group average methods it is
also biased towards forming spherical clusters; though perhaps not as strongly as they are. It may
also be rather sensitive to outliers. However it appears to perform well when there is a lot of
overlap, when many of the other techniques have difficulties. It has been found in simulations that
Ward's performed best of the hierarchical methods at recovering natural clusters, but that the k-
means and optimising methods were better.
EXAMPLE 7.2.
To show the differences between the major agglomerative methods I analysed the high-tide micro-
zooplankton data. I used Ward's, average , complete , and single linkage, on six different distance
measures: Euclidean with log(X+1) transformed data, chi squared distance, chi-squared distance
with log(X+1) transformed data, Bray-Curtis distance, Bray Curtis with √√ transformed data, and
Jaccard's distance. These impose a range of standardisations on the data that try to reduce the
influence of some of the very large numbers lurking in the data set. Plankton data typically consists
of some very large numbers in a sea of zeros. In Figure 7.4 I have presented the results for the
Euclidean log(X+1) data.
One thing that is apparent from a glance at Figure 7.4, the dendrograms look different, even though
the same data had been given each of the clustering methods. It is also clear that there is a large core
73
group of Mangere samples that form a fairly tight cluster, there is also another group of four
Mangere samples that reappear as a group (12, 9, 21, 15) in 3 of the four dendrograms. Though at
first sight there does not appear to be much more structure, close examination shows that other,
Whau Creek, samples consistently reappear as groups or together inside groups ((24, 28, 34), (29,
31, 33), (23, 32, 35), (22, 26), (30, 27)). Much of this cluster structure is due to samples that were
taken on the same day. So, even though ordinations do not suggest much cluster structure, the
dendrograms do seem to be recovering some informative groupings.
One typical, pleasing, feature of Ward's method that emerges from this diagram is that the
dendrogram tends to be clearer to read than other methods. Like Complete Linkage, it tends to
slightly exaggerate the differences between clusters, so making them more apparent.
To conserve space, I will not present all the dendrograms that were produced for all the other
transformations and standardisations,, but Figure 7.5 shows the dendrogram relating them to each
other. It was produced using Ward's clustering algorithm, and indicates how similar the
dendrograms produced by the various methods were. The main groups are not defined by the
clustering method used, but by the standardisation or transformation. This reinforces the message of
the course, that the choice of standardisation is often more important than the choice of method.
It is worth noting that while the dendrograms in Figure 7.4 were fairly successful at identifying
plausible groups, most of the others were not. In particular, Bray-Curtis and chi-squared distance on
untransformed data were too affected by extreme values to be useful. This is shown by their outlier
position in Figure 7.5.
74
Figure 7.4. Dendrograms for hierarchical cluster analysis on zooplankton data.
Whau creek observations are in the stippled boxes, Mangere in the clear. The
observations have been sorted within the dendrogram so neighbouring observations
are more similar to each other.
75
c) Complete linkage clustering aplied
Euclidean distances with log(X+1) data
76
singlechi
averchi
compchi
wardschi
compbray
averbray
singlebray
wardsbray
comprtrtbray
wardslogeucl
wardsjacc
compjacc
averjacc
singlejacc
singlelogchi
averlogchi
averrtrtbray
complogeucl
averlogeucl
singlelogeucl
singlertrtbray
complogchi
wardslogchi
wardsrtrtbray
Programming notes
In R we can use "ward" , "average", "single", and "complete". The hang=-1 makes the dendrogram
branches all have the same length
cl<-hclust(d,method="ward")
plot(cl,hang=-1)
AUTHOR'S CHOICE.
When a hierarchical method is needed, Ward's and group average linkage methods (UPGMA) are
sensible choices. If the clusters are well separated it will not much matter which method you
choose. Remember your choice of transformation or standardisation will usually be more important
than your choice of algorithm. If the data consists mainly of continuous or ordered variables, and
the observations do not necessarily have a nested structure then one of the non-hierarchical methods
for minimising the within cluster variation would usually be more appropriate - e.g k-means. If you
77
are dissecting a space rather than trying to recover real clustering then a k-means technique is
probably best.
Interpretation.
Display.
i) Displaying a partition
a) Ordination.
If the main aim of the study was to dissect a more or less homogeneous data cloud then you would
usually want to show that the partition is reasonable and at the same time display the relative
positions of the sampling units. Perform an ordination on the sampling units and then show their
cluster membership in the reduced space plot (e.g. Figure 7.2).
Even if the aim was the recovery of natural clusters, this approach could be used to try to confirm
the adequacy of the cluster solution. If the points cluster in the reduced space then you have your
confirmation and a good display; if they do not, it means either that there are no clusters in the data
or the clusters are not spherical and such splits as exist are on the minor component axes.
b) Canonical Discriminant Analysis.
If the main aim of the study was to recover natural clusters then your main concern would usually
be their positions relative to each other. If the recovered clusters have approximately equal variance
covariance matrices (and most clustering methods tend to force this anyway), then the relative
positions of the cluster centroids can be shown in a reduced space using Canonical Discriminant
Analysis (chapter 11).
A B C D E F
D E F A B C
78
ii) Displaying a hierarchy.
The dendrogram.
This is easily the commonest way of presenting the results of a hierarchical clustering. It provides a
two dimensional record of the clustering history. Clusters (or points) are joined at the distance at
which they were fused (their fusion threshold). Large vertical distances between successive fusions
implies that a cluster is a long way from other clusters, and therefore discrete. Isolated clusters or
solitary points join the major clusters only at the higher levels.
It is not always appreciated that the order of the sampling units along the "crown" of the tree does
not necessarily reflect similarity. This means that a dendrogram, as usually presented, can seriously
mislead. Two adjacent points on the dendrogram can in fact be very far apart in the full space; while
points that are far apart on the dendrogram may in fact be similar to each other. A dendrogram is
like a hanging mobile sculpture, the horizontal lines joining clusters are the bars, the vertical lines
are the cotton. Any part of the tree can be twisted to reverse the order of its sampling units, thus
Figure 7.6 a and b are equivalent, having been twisted at levels I and II.
Number of clusters.
When the purpose of the analysis is the recovery of natural clusters the number of clusters is not
usually known in advance. With a hierarchical method there are as many possible clustering levels
(partitions) as there are sampling units, and it is normal with partitioning methods to try a number
of runs with different k. The problem is to choose the best partition. One study lists no fewer than
30 ways of making that decision - and they deliberately left out several more. They compared the
performances of the 30 methods using the inevitable Monte Carlo simulations. Their test data was
exceptionally well behaved, non-overlapping compact clusters, all roughly the same shape and
volume - and unrealistic. They varied the number of units per cluster, number of clusters and
dimensionality of the data. Their argument was that if a technique did not perform well on these
data, it was not going to be much use with real data. Though it is worth considering that the reverse
is not necessarily true.
As you will see, both the methods I present from their investigation will work best with compact
spherical data which may not be present in real data. This returns the responsibility for choosing the
number of clusters to the worker. There is probably no foolproof way of choosing the optimum
number. Since cluster analysis, like most multivariate methods, is mainly a hypothesis generating
tool, the choice that suggests the most interesting hypotheses should be used - look at more than one
partition and choose the most interesting.
i) Callinski and Harabasz's Index , a.k.a. Pseudo-F statistic (SAS Institute 1987).
(Trace(B)/(k-1))/(trace(W)/(n-k)).
Trace(B) is the total between-cluster sum of squares (summed over all variables), so the expression
is a sort of total between-cluster mean square over a within-cluster mean square - i.e. a pseudo-F
statistic. It is calculated for each partition (level of the hierarchy) and the k with the largest value of
the index is chosen. If it decreases monotonically as cluster number increases then the data possess
a hierarchical structure. If it increases monotonically as cluster number increase then there is
probably no cluster structure in the data. Because it ignores the off diagonal terms in the two
matrices, it can be expected to work best with spherical clusters with equal variance-covariance
matrices - i.e. this kind of data. In fact, in the simulations it recovered the true number of clusters
390 times out of a possible 412.
ii) Duda and Hart's index - only suitable for hierarchical solutions.
(WK+WL)/WM.
79
Where WK and WL are the sums of squares (over all variables) within the two clusters K and L that
are about to be fused, and WM is the value in the resulting cluster, M. At each fusion a test is
performed of the null hypothesis that there is only one cluster, not two. When this is rejected, the
current level is chosen so that clusters K and L are kept separate.
In simulations it performed better than the Callinski and Harabasz index when the true number of
clusters was greater than two. Most of the methods tested performed worse when there were only
two true clusters in the data. Callinski and Harabasz's statistic was an exception; it was very
consistent over all cluster numbers. Like Callinski and Harabasz's index, the Duda and Hart index
ignores covariation among the variables, so it will also be at its best with spherical clusters, though
it might be expected to be more robust to moderate variation among the within cluster variance-
covariance matrices.
It is important to realise that this statistic is measuring something very different to Callinski and
Harabasz's index. That attempts to assess the variability between all the extant clusters at each level;
it is examining the global cluster structure. The Duda and Hart index is just looking at the two
clusters currently being fused; it is examining local cluster structure. While global structure would
be, ideally, the most useful, local structure is easier to detect - and sometimes easier to interpret. In
my experience, with real data, Duda and Hart's index is much more sensitive and informative than
Callinski and Harabasz's.
A slight variant of this statistic is in SAS, modified to become a pseudo-t2 analogous to Hotelling's
T2 statistic (the multivariate equivalent of the t-test) with a scalar variance-covariance matrix. This
is distributed as F with p and p(nK+nL-2) degrees of freedom. This could imply that we could use
critical values from tables to "test" whether to fuse or not. In practise it is safer not to rely on any
particular cut-off but to look for peaks in the measure to identify levels of the dendrogram which
suggest clustering. Certainly it should not be used for formal significance tests except with
randomisation techniques.
Programming notes
R. As yet I have no function for the Duda and Hart Pseudo-t2. However if you load my function
pseudofh it will produce a Pseudo-F for a range of cluster numbers. So after doing your hclust the
function is used:
psfs<-pseudofh(datasetname,cl,2,8)
This asks for a pseudoF for 2 through 8 clusters.
EXAMPLE 7.3
In Table 7.1 I present 3 of the cluster statistics calculated for the Ward's clustering on the log(X+1)
transformed high-tide zooplankton data.
Callinski and Harabasz's index fails to find any clear clustering at all, it increases monotonically
with number of clusters, and does not even show any sharp changes that might indicate solutions to
examine.
Duda and Hart's statistic (in the form of the pseudo-t2) suggests that 13 is an interesting number,
the two clusters that were fused to form 12 clusters are very different. However, examining the
dendrogram for this analysis (Figure 7.4a), the two clusters concerned are small and relatively
uninteresting, merely {21, 15} and {12, 9}. We can afford to disregard the 13 cluster solution. There
is clearly another peak around 6-8 clusters. The 5 cluster solution is formed by fusing most of the
Mangere samples with a block of Whau Creek ones ({22, 26, 27, 30}), clearly not a good idea. One
of these solutions would be reasonable, but which to choose would depend on how much detail you
80
Table 7.1. Clustering statistics for the Ward's clustering of the log(X+1) transformed
high-tide zooplankton data: Pseudo-t2 (Duda and Hart's index), Pseudo-F (Callinski and
Harabasz's index). Note that a large Pseudo-t2 refers to the previous number of clusters; it
is measuring the similarity of the last two clusters to be fused. If they are different, you
shouldn't have fused them. So you must backtrack one fusion to get the right number of
clusters. The pseudo-F finds no clustering at all; the pseudo t2 suggests 13 (rather too
many for convenience), 8-6, or 2.
were hoping for from your clustering. By looking at the dendrogram again we see that the peak at 2
clusters is simply due to the outlier group {34, 28, 34}.
After considering the statistics we should now examine the dendrogram (Figure 7.4a), and the
contoured ordination to see which solution seems the most informative. The pseudo-t 2 statistics
suggest the 6 cluster solution and I agree, as this keeps the Whau Creek and Mangere samples
largely separate and identifies structure among the Whau Creek samples.
AUTHORS CHOICE.
Ultimately the final choice of the number of clusters is nearly always subjective; does any particular
partition suggest biologically interesting groupings? Remember that there might be more than one
interesting level of partitioning, even in a non-nested data set. However the methods given here can
help. For k-means partitioning methods that minimise trace(W) I suggest you try Callinski and
Harabasz's index. It assumes multivariate normality and spherical clusters; so do these partitioning
methods. It ought to be good with k-means methods that use absolute distance For hierarchical
methods with spherical data, if there are more than two clusters, use Duda and Hart's index, but
examine the dendrogram for large changes in distance between successive fusions. If the clusters
are extremely non-spherical and you have used single linkage or one of its family, Duda and Hart's
method might be inappropriate.
How valid is the partition?
Having produced a partition of the data into k groups either directly, or by sectioning a dendrogram
at the appropriate level, it is important that you should demonstrate the validity of the clusters.
Otherwise you cannot expect people to accept your solutions as worth looking at. Even if the data
have structure, and the clustering technique has produced a partition, you cannot be sure the
resulting clusters are any different from a random partition; the limitations of your method may
prevent it recapturing the real structure in the data.
81
Replicating methods.
The simplest, if not the most direct, method is to use two different clustering algorithms, if possible
with two different dissimilarity measures for each. If all four solutions are essentially the same, then
the chances are the partition is a valid description of the data. It might be considered a bit
underhand if the methods used were too similar. For example it would not be acceptable to use a k-
means method and one that directly minimised a trace(W) criterion; they both optimise the same
criterion and might be expected to give much the same results. Ward's and the group average
methods, or a k-means instead of Ward's, might be sensible combinations. Of course in the
extremely unlikely event that you got the same result from a single linkage and a complete linkage,
even the most sceptical referee would have to admit the existence of the clusters in the data - the
more different the analyses, the more plausible the solution. Of course, the transformations or
standardisations may make agreement between the methods impossible - the clustering structure of
the data cloud can be changed dramatically by transformation or standardisation (see Figure 7.5, the
dendrogram of dendrograms).
Interpreting clusters.
There are three major features of a clustering solution that usually demand interpretation: which
sampling units are in which groups; which groups are similar to which other groups; and which
variables are most important in separating out the groups.
The interpretation of group membership depends on having extra information about the sampling
units. For example in the zooplankton clusterings we can interpret the groups on the basis of what
site the sampling units were taken from. We could also look at the physical variables that might be
correlated with group membership. This can be legitimately done using simple univariate
significance tests and descriptive statistics. These tests are not circular as the variables being tested
did not contribute to the clustering.
The relationship between groups can usually be seen directly from the data displays discussed in
section 7.3; though looking at the group means (centroids) for the variables can tell you a lot.
When looking at the cluster means it is important to realise that for most clustering methods they
are biased estimates of the population means of the underlying distributions. The reason is not hard
to see. Imagine two overlapping normal distributions. If the points are allocated to clusters solely on
the basis of which mean is closer, then extreme points of one distribution will be allocated into the
other cluster and vice versa. This will shift the means of the resulting clusters away from each other;
they will be further apart than the true means.
One of the more important questions that can arise after a cluster analysis is: which variables are
responsible for the difference between the clusters. Some of the variables may only show random
differences over the clusters; the clustering may be only apparent on a subset of the variables -
perhaps even only one. A simple approach is to perform a separate 1-way ANOVA for each variable,
and by ranking the resulting F-values identify which variables are most effective at separating the
groups. It is important to remember that these F values cannot be used for significance tests in the
normal way. The standard ANOVA significance tests are totally invalid and can tell you nothing
about the clusters. The data have been split to maximise some measure of difference between the
clusters, or some measure of similarity within them; of course the null hypothesis of no difference
between clusters can be rejected - the test is totally circular. If I split a group of people into "greater
than 5ft 10 ins" and "less than 5ft 10 ins", it would be rather pointless to then test to see if their
mean heights are the same.
If there are many variables, a quick way of identifying the potentially important ones is to use a
Canonical Discriminant Analysis and interpret the structure coefficients (we do this later in the
course: section 11.3.2). Since a reduced plot from a CDA is a good way of aiding the interpretation
of the relationships among the clusters (7.3.1.i.b), this further step is a natural one.
82
In example 7.1 I used a PCA reduced space plot (Figure 7.2) to indicate the cluster membership.
Programming notes
Plotting labels in a reduced space plot has already been covered in the PCA section. However we
have to create variables with the cluster membership information
In R,
clusmem<-cutree(cl,k=number of clusters)
EXAMPLE 7.5.
If we return to the k-means partition of example 7.1, we can now attempt to interpret the groups.
The most obvious feature that has already been commented on is that group membership is related
to the sites from which the samples were taken. Since this pattern was expected even before we
looked at the data (i.e. a priori), a simple contingency table test is appropriate. Because of the small
numbers a χ2 test is not possible but a Fisher's Exact Test rejects the null hypothesis that cluster
membership is independent of site (p<.0001).
Figure 7.2, the PC plot suggests that group 1 is associated with larger amounts of Favella, group 2
with Harpacticoids, group 3 with Oikopleura, and it is unclear what is related to group 4. We can
expand on that interpretation by presenting the means of the log(X+1) transformed abundances for
each group in Table 7.2.
As suggested by the PC plot, group 1 is characterised by high densities of Favella, the organic
sewage loving ciliate. Group 1 contains nearly all the Mangere observations, close to the sewage
oxidation ponds. Group 2 has the highest levels of Harpacticoids, though there does not seem to be
anything else particularly characteristic. Group 3 is quite clearly the offshore water in the Whau
Creek as it contains characteristically high numbers of Oikopleura, an offshore species. Group 4 has
large amounts of Gladioferens, Temora, and Favella, an anomalous mix since Temora tends to like
cleaner, more offshore water, Favella likes high nutrient, organically polluted water, and
Gladioferens likes relatively unpolluted estuarine waters. Since the 3 members of this group were
observations on the same day taken well up the Creek at high tide, I am going to suggest that they
represent a mix of offshore Temora-bearing water, with bodies of highly polluted estuarine water
carrying Favella and less polluted estuarine water carrying Gladioferens.
If we rank the F values from one-way ANOVA between the clusters on each of the variables we find
that Gladioferens, Oikopleura, Temora, and Favella have the largest values. Oikopleura and Temora
marking the cleaner, offshore waters, Favella, organically polluted inshore waters, and Gladioferens
relatively unpolluted estuarine water.
Table 7.2 Means of Log(X+1) transformed high tide zooplankton data for each
cluster from example 9.1 (k-means method).
Cluster 1 2 3 4 F-value
Species from
ANOVA
Acartia 5.44 5.91 1.72 6.06 7.9
Euterpina 6.38 6.16 6.16 4.03 2.28
Gladioferens 0.00 0.00 0.93 5.20 37.12
Harpacticoids 0.00 2.57 0.85 1.38 6.04
Oithona 8.26 7.72 6.49 7.91 6.85
Paracalanus 0.00 0.93 1.54 0.00 2.93
Temora 0.00 0.54 2.19 4.91 20.21
Favella 6.53 1.58 1.88 5.47 18.45
Oikopleura 0.51 0.41 6.12 0.00 31.57
83
84
Chapter 8. Multiple Regression
Though formally not multivariate techniques, multiple regression and partial correlation are most
commonly used in biology to investigate relationships within large groups of variables. They also
provide a relatively simple introduction to some of the problems that affect other truly multivariate
techniques.
Multiple regression and partial correlation are closely related techniques and their uses should be to
a large extent complementary. Partial correlation describes the direction and intensity of a linear
relation between two variables after the influence of other variables has been corrected for. Multiple
regression on the other hand can be used, under certain restrictive circumstances, to describe the
linear relationship between two variables, after correcting for the influence of other variables.
Ideally such a description provides insight into the nature and size of the influence of one variable
on another. When the data are observational rather than experimental, conclusions about causality
are hypotheses at best, and rather than use multiple regression it is better (for a number of reasons
outlined later) to use partial correlation. Regression is more appropriately, but in biology more
rarely, used to provide a linear equation that allows the prediction of a particular variable (Y) given
the values of a group of other (X) variables. Such predictive equations are "black boxes"; it would
not usually be possible to interpret their elements (coefficients) as descriptions of the system or as
providing insight into the relative influence of the different X variables on the Y variable.
Multivariate techniques usually require interaction with the data, none more than multiple
regression. There are no magic boxes. Data cannot be simply thrown into a computer and useful
results retrieved at the other end. Multiple regression is probably the most misused technique in
biological statistics, which is a pity because it is also one of the most widely used. In particular, it is
used to analyse observational or field data when partial correlation would lead to fewer misleading
conclusions.
These two techniques are very closely related, just as are bivariate regression and correlation. The
tests are the same; if a β�coefficient (a partial regression coefficient) achieves some form of
significance, then so will the equivalent partial correlation. The obvious difference is that regression
attempts to describe the shape of the relationship, correlation its closeness. If the data are
observational, as we shall see below in section 8.1.4 the β�coefficients are nearly always biased. If
the primary aim is descriptive, investigatory, rather than prediction then the β� s can at best be used to
generate hypotheses, not to test them. Certainly they can provide no evidence of causality, they are
describing and detecting correlations not causes. So the only real difference between presenting the
results as a regression equation rather than partial correlations is that the regression equation has, in
the eyes of many biologists a sort of spurious rigour, and seems to imply a (often spurious) causal
relationship. This exploratory mode is more appropriately associated with correlation rather than
regression, so presenting the results as partial correlations rather than partial regression coefficients
will tend to encourage a suitably tentative interpretation
Intuitive introduction.
It is simplest to start with the bivariate equivalents of partial correlation and multiple regression.
Most biologists have at least a nodding acquaintance with bivariate correlation and regression.
The most usual correlation coefficient, and there are a number to choose from, is the Pearson's
product moment correlation coefficient, r. This measures the intensity and direction of the linear
relationship between two variables. It is also a measure of the amount of independent information in
the two variables, if r equals 1, the two variables are perfectly correlated, there is effectively only
one variable, not two. This measure is more usually expressed as r2, the coefficient of determination
85
or the proportion of the variance of one variable explained by the other. Correlation is most
appropriately used with observational data.
Regression on the other hand can be used with either experimental or observational data, though in
different ways. It estimates the equation of the line that in some sense fits the data best. The most
commonly used technique, Ordinary Least Squares Regression, finds the best line by minimising
the sum of the squared differences between the observed Y values and those predicted by the line -
hence its name.
With experimental data, such an equation can be used to describe the causal relationship between
the response (dependent) variable and the experimentally controlled stimulus (independent)
variable. The resulting equation gives the amount the response variable will change for a unit
change in the stimulus variable. With observational data on the other hand this is generally not
possible. Such a description is possible only if there is a direct causal relation between the Y variable
and the X; there are no other unmeasured variables that can influence both; and the X variable has
been measured without error. This would be a comparatively rare situation. Not that that has
deterred s workers in every discipline.
With observational data regression can be used to provide a simple predictive linear equation. This
would predict the value of the Y variable given a value for X, but the equation cannot be assumed to
provide an adequate description of the relationship. Such a description would identify the line
(equation) of the most likely pairs of X and Y values, the typical combinations of the two variables.
The usual regression method does not give this line, except under very rare circumstances. For this
reason it is usually better when using observational data to use correlation techniques rather than
regression. If you insist on using regression methods then there are a number of methods appropriate
to this "error-in-variables" problem (i.e. when there is error in the Xs), but for multivariate data they
are complicated, often fragile, and not widely available.
Partialling.
Correlation in a multivariable world, i.e. outside the laboratory, is notoriously difficult to interpret,
since an apparent relationship between two variables can be the result of the action of a third
perhaps unmeasured variable. For example, there is a data set that records the number of births per
year in a city of central Europe, and the number of nesting stork pairs; there is a large, statistically
significant, correlation. While they do not provide the number of gooseberry bushes per city as well,
I have no doubt that there would also be a significant relationship there too. Before we rush out to
rewrite the obstetrics textbooks we would be well advised to check for the influence of other
variables. In this case it is clear that number of births, number of storks and probably number of
gooseberry bushes are proportional to the size of the city, and that the true relationship between
numbers of births and storks can only be appreciated after correcting the data for the different sizes
of city. Disappointingly, if this is done, the correlation dwindles to a boring insignificance. By
correcting for the influence of a third variable we have materially assisted our interpretation of the
data. Of course we may still be ignoring yet another variable which will change our interpretation
yet again, but the line has to be drawn somewhere, and only our experience of the system under
study can suggest where and prevent foolish mistakes from being made. This corrected coefficient is
the partial correlation. It is the correlation between 2 variables that would be observed if the third
variable were held constant. If the ordinary correlation coefficient were to be written r12 then the
same with the effect of variable 3 corrected for (partialled out) would be written r12.3. It is possible,
though not always useful, to simultaneously correct for a number of variables.
Multiple regression can be regarded as the regression equivalent of partial correlation. It provides an
equation that expresses the dependent variable Y in terms of the X variables:
Y�
=β0+β1.X1+β2.X2+...βn.Xn.
86
The variable Y is estimated by a linear combination of the Xs.
If we start by only considering 2 X variables this equation will define the plane that will fit the Y
values best (see figure 8.1). It minimises the sum of squared differences between the observed Y
values and those predicted by the equation. i.e. the same method, least squares, as in the bivariate
technique; or equivalently it maximises the correlation coefficient (the multiple correlation
coefficient) between Y and Y� , i.e. between Y and the linear combination of X variables. If there are
more X variables it will define a hyperplane.
For observational data this is usually a purely predictive equation. However, when the data are from
a controlled situation, the coefficients βi may be interpreted as the amount of change in the response
variable produced by a unit change in the ith variable after correcting for the influence of the other
variables. In other words, it is a partial regression coefficient, analogous to the partial correlation
coefficient. It is a measure of the slope that would exist between Y and Xi if all the other Xs were
kept fixed. Visually it defines the slope of the regression hyperplane in the plane of the Y and Xi axes
(see Figure 8.1).
EXAMPLE 8.1
Partial Correlation
To show the usefulness of partial correlation, let us consider the simple correlation between the
estuarine species Gladioferens (log(X+1) transformed to improve linearity) and the salinity of the
water, -0.29 (p=0.024). This is, unsurprisingly, negative but not as obvious as we might expect for
an exclusively brackish water species. If we think about it, we might consider that the presence of
sewage and organic pollution in many of the inshore samples could be obscuring the relationship. If
we partial out the concentration of faecal coliform bacteria (log(X+1) transformed), an indicator of
sewage, and the Biological oxygen demand (BOD), a measure of the organic loading in the water,
Figure 8.1 The plane of best fit (defined by the regression equation
y = β0 + β1 X1 + β2X2 ). The parameters β1 and β2 are the slopes of the fitted plane
in the plane of the axes Y-X1, and Y-X2, respectively. β0 is the point where the
fitted plane intersects the Y axis.
β
2
β
0
β
1
X2
X1
87
the (now partial) correlation between Gladioferens and salinity jumps to -0.478 (p<.0001). Indeed
Gladioferens is a low salinity species, but its distribution is influenced by the degree of organic
pollution. Here, correcting for confounding variables has caused an interesting relationship to
emerge more clearly. While we are at it, we might as well also calculate the relationship between
Gladioferens and faecal coliforms keeping BOD level and salinity fixed: -0.35 (p = 0.006), and also
the correlation with BOD keeping faecal coliform and salinity fixed: -0.34 (p = 0.008). Clearly
Gladioferens tends not to be found in areas with high organic loading and high sewage
concentration.
The reverse effect, the disappearance of an interesting correlation when sensible confounding
variables are partialled out is also common. The correlation between Gladioferens and Chlorophyll
"a", a measure of phytoplankton abundance, is negative: -0.32 (p=0.01). Since Gladioferens eats
phytoplankton this seems a bit surprising. However, on second thoughts we might expect more
phytoplankton in the richer polluted waters near the sewage outfalls. The nutrient enrichment of
such areas may encourage algal growth, though we know that Gladioferens seems to avoid such
areas. We can check this by controlling for some measure of nutrient level, Phosphate concentration
seems the most sensible since that has been shown to be involved in phytoplankton growth. When it
is partialled out the new partial correlation is -0.03 (p=0.82, NS). The apparent relationship has
disappeared. That is not to say that -0.03 measures the true relationship between Gladioferens and
Chlorophyll "a", there may be (will be) other confounding variables whose influence has to be
corrected for before the true relationship will emerge. Our partial correlations are almost inevitably
biased, but for exploratory work they offer a useful tool to investigate, at a fairly detailed level, the
relationships among variables.
Programming notes:
In R there are no functions for this. You will therefore have to use two functions I have set up. They
are available in CECIL.
To get the variables Y after partialling out variables X use:
res<-parcor.res(Y,X)
To get the matrix of partial correlations use:
parcor<-cor(res)
To do the test on a given correlation between say res[,2] and res[,4] use:
q=ncol(X)
parcor.test(res[,2],res[,4],q)
Multiple regression.
The same relationships could have been analysed using multiple regression. The X variables are not
measured without error so the β estimates will be biased, and therefore any interpretation must be
tentative.
There are two approaches to the analysis we just performed with partial correlation.
i) The "Shotgun approach".
88
We are investigating the relationship between the log(X+1) transformed abundance of Gladioferens
(Y) and certain physical factors (the Xs). we have mentioned so far salinity, Faecal coliform
concentration, Biological Oxygen Demand, Chlorophyll "a" and Phosphate concentration. We
might therefore investigate what happens when we fit all the Xs, i.e. partial all the Xs out of each
other. This lacks the finesse of an approach which corrected only for those variables that were a
priori important, but has the merit that it might throw up something unexpected.
If we fit all the Xs at once we get the results of Table 8.1.
The relationship with salinity that we found with partial correlations is quite clear. The negative one
with chlorophyll "a" has disappeared and there is the suggestion of a negative one with BOD. There
is however little to suggest a negative relationship with organic pollution, except the coefficient for
BOD. Our, very tentative interpretation is:
a) for a given level of algal abundance, sewage pollution and nutrient enrichment, Gladioferens
abundance decreases with salinity (they prefer brackish water);
b) for a given level of salinity, algal abundance, sewage pollution and nutrient enrichment,
Gladioferens abundance decreases with the organic loading.
The results might suggest that the organic loading is not the algae, not the sewage, but some other
source, possibly a freezing works that despite restrictions used to sometimes discharge into the
Mangere Inlet. Or it might simply be another of those correlations that come from leaving out a
confounding variable.
Surprisingly it has failed to detect the negative relationship with sewage organic loading, or with
high nutrient levels that was implied by the partial correlations. Does this mean that such a
relationship does not exist? No, and the reason why will appear below.
Programming notes:
In R
reg<-lm(yvar~xvar1 + xvar2 + xvar3 …, data=data frame name)
summary(reg)
Be careful about the case of the variable names (upper or lower case) when you are specifying the
regression model (equation). Submit this command first to look at them:
names(data frame name)
Partial correlation.
A test is possible for the null hypothesis that the partial correlation between two variables - holding
the other q variables constant - is zero. critical values can be taken from the table for the ordinary
correlation coefficient with (n-2-q) degrees of freedom. Since the partial correlation coefficient
describes the intensity of the linear relationship it is often useful to calculate confidence intervals:
use those for the ordinary correlation coefficient with (n-2-q) degrees of freedom. The partial
correlation coefficient and its associated tests and statistics are subject to all the problems associated
with multiple regression (see below). Though because the interpretation is different they are less of a
problem for the correlation.
Multiple regression.
The significance tests for multiple regression are all subject to the assumptions about the errors
mentioned above, that all the important variables have been included and that the Xs have been
measured without 'error'. These, and the effects excessive collinearity among the Xs will be
discussed in 8.1.4.iv & v.
The first null hypothesis for testing is usually that all the βs are 0. Since multiple regression and
ANOVA are both in the family of General Linear Models, this is usually given as part of an ANOVA
table - an F test of the regression M.S./residual M.S.. A significant model may be no use for
prediction. The F ratio should be at least 4 to 5 times the critical value, preferably 10 times before
the tested model can be regarded as even potentially useful for prediction. Equally, a significant
model does not imply that the model will provide any insight into the system.
A potentially more useful test is for the null hypothesis βi=0. If all the assumptions are met, a
standard error can be calculated and this null hypothesis tested with a t-test with degrees of freedom
(n-q-2) where q is the number of X variables. This test is identical to that for the partial correlation
for Y and Xj with all the other Xs partialled out. This means, and it will be reiterated below, that the
interpretation of the significance of β estimates derived from observational data suffer from the
same problems as do the equivalent correlation coefficients, and as such can only be regarded as
generating hypotheses rather than proving them. If the model is to be used for prediction only, these
tests of hypothesis will seldom be of much interest.
90
If the model is to be used as a description of the system then confidence intervals on the β estimates
will be necessary. Of course these intervals do not take bias into consideration, so they only give
information about the true β values when they are unbiased. The sources of bias are discussed
below.
Whether the model is to be used for prediction or system description, most workers would like to
know how well it fits the data. Clearly the estimate of σ would be one measure but this would
2
usually have little intuitive appeal except to compare different models fitted to the same Y values.
One simple measure is the correlation between the values of Y predicted by the model ( Y� ) for each
data point and the corresponding observed Y value - the multiple correlation coefficient. Just as the
square of the ordinary correlation coefficient is the proportion of the total variance in one of the
variables explained by the other, so the square of multiple correlation coefficient R2 is the proportion
of the total variation in Y that is explained (predicted) by that linear combination of Xis. This has
obvious appeal. Unfortunately it has one major drawback: as you fit more X variables, even ones
unrelated to Y, the R2 value will always increase. Indeed, as the number of X variables approaches
the number of data points R2 inevitably approaches 1. For comparing the goodness of fit of models
with different numbers of variables, we need a measure that is stable with respect to the number of
variables include in the model - the adjusted R2.
R2adj=1-(1-R2)*(n-1)/(n-q-1).
When there are no relationships the true, population, the value of R2 should be 0. However, in the
absence of any relationships the expected sample value of R2 is p/(n-1), while that for R2adj is 0. One
minor drawback is that sample values of the adjusted R2 may go negative, which for a proportion of
variance explained is embarrassing; by convention these are reported as 0. A more significant
problem is that if n<60 and q>20 then the adjustment is inadequate and may be out by as much as
0.1
EXAMPLE 8.2.
If we return to the shotgun regression of Example 8.1 we can calculate the goodness of fit statistics
R2, R2adj will have to wait until we look at models that only include a subset of the X variables. The
amount of the variance of the log(X+1) transformed Gladioferens abundances explained by the
model including: salinity, chlorophyll "a", faecal coliforms, BOD, and Phosphate concentration is
0.34; the adjusted R2 is 0.28 - rather small values. This suggests that the density of Gladioferens is
influenced by more factors than we have measured or there are non-linear relationships present
(both quite inevitable in real observational data).
Interestingly, the 3 variable model with only salinity, faecal coliforms, and BOD in it fits nearly as
well: R2 = 0.31, R2adj = 0.27 - confirming that some of the variables are redundant. But this leaves
the question open as to which, since the two models disagree about which variables are important.
Model Selection.
Whether the variables are to be partialled out (partial correlation) or fitted in the model (multiple
regression ) the problem of selecting a useful subset remains. No-one wants to fit more parameters
than necessary; each extra X variable considered costs another degree of freedom, and as a result the
β estimates and predicted values of Y will generally have larger standard errors than necessary.
Equally, leaving out an important variable would produce biased and spurious results. How this
problem is approached depends on the aim of the analysis. We can identify 3 main alternative goals
for these analyses:
i) To test a priori hypotheses about the roles of the X variables in the system.
ii) To produce an equation that will allow prediction of the Y variable.
iii) To generate hypotheses about the role of the X variables in the system.
91
There are 5 main ways to select variables for inclusion into a model. I will discuss them with respect
to multiple regression but the same arguments hold equally well for partial correlation.
i) Theoretical.
In experimental or well studied situations it may be possible to identify the best subset of measured
variables a priori. It may be known in advance which are likely to be important. In this case a single
analysis can be performed and all relevant null hypotheses tested. Of course, there is always the
possibility of throwing away an important variable by mistake, this would usually produce spurious
results; leave out the size of the town from a regression of number of babies against number of
storks and you may be misled. There is therefore a balance: increased precision from throwing
variables out to be weighed against an increasing risk of spurious, biased results.
EXAMPLE 8.3
Returning again to Gladioferens we can a priori suggest a subset of variables to test: salinity
certainly (Xsal); a pollution indicator, say BOD (XBOD); perhaps a dummy variable for the state of the
tide (High=1, low=0), estuarine animals are often tidal (X tide); and finally, since it is often found in
mangrove swamps where the water is muddy, we might include a measure of suspended solids
(Xss).
Since all the null hypotheses are a priori, the significance tests are (at least to that extent) valid.
When we fit the model we get
β Standard t- value p- value
estimate Error H0: β=0 H0:β=0
Intercept 15.51 4.02 3.85 0.0003
Suspended -0.01 0.01 -1.14 0.2594
Solids
Salinity -0.36 0.12 -3.05 0.0035
Tide -1.69 0.52 -3.23 0.0021
BOD -0.40 0.14 -2.82 0.0066
2 2
R = 0.34, R adj= 0.29.
The results clearly support all our expectations, except that about inorganic suspended solids - there
is no evidence of a relationship there.
ii) Hierarchical.
This is in essence the structured, investigatory, approach described in Example 8.1, In certain
situations there are a priori reasons for fitting X variables sequentially. Thus the relationship
between Y and X1 might be tested for first, then that between Y and X2, corrected for X1; then Y and
X3, corrected for X1 and X2 and so on. The most likely such situation would be a multiple regression
analogue of the analysis of covariance. In such a situation there would be a set of variables, which
have to be corrected for before the effect of the interesting factors can be assessed. A similar
situation might arise if the Xs were causally linked to the Y variable, but in a hierarchy of proximity.
In which case the most proximal factor would be fitted first, the second factor would then be fitted,
and its influence assessed after the effect of the first had been corrected for, and so on. If the prime
aim of the analysis is the generation of hypotheses about the X Y relationships, then it might be
sensible to fit the interesting variables first, then fit the less interesting subsequently. In these
examples the decision to include a variable in the final model can be determined by the results of
simple tests of null hypothesis, for in each case the null hypothesis, which depends on the order in
which the variables are to be fitted, exists a priori. It is important to note that we are not trying to
get the 'best' model in any sense, we are attempting to gain insight into the relationships. The final
model may not be the 'best' according to any statistical criterion, but it may reflect reality better than
alternatives since it has incorporated intuition and experience into its formation. Of course the tests
and parameters are only valid if the assumptions are correct.
92
An alternative, though related, approach for handling this situation is path analysis or causal
modelling, but they are too complex for this course.
We can see that the approach we used for the partial correlations in Examples 8.1 and 8.2ii is
essentially a hierarchical one. By observing how the regression parameters (and their p values )
change as we fit the extra variables we can, we hope, gain insight into the correlations among the Xs
as well as between the Xs and the Y.
iii) All possible subsets.
The all possible subsets technique is primarily used to produce an optimum predictive equation,
q
though it can be used for generating hypotheses. If there are q X variables then there are 2 possible
models to predict Y. If all the usual assumptions are correct, then the major problem is introducing
bias by dropping the wrong variable.
If you look at the few best models for each m (number of variables in a subset model), it is often
possible to see how some X variables can substitute for others giving nearly equivalent fits. In this
way we can gain insight into how the X interact in their relationships with the Y variable. There is of
course a problem, if there are 12 X variables there are 4096 possible models, a large number, but not
impractical using an appropriate algorithm on modern computers. If there are 15 then the number of
possible models is 32762 - unreasonable by virtually any standards. In such a situation, it is still
possible to identify the 'best' equations. The optimal subsets technique, now becoming available in
most statistical packages will identify the k best subsets for m variables, using whatever criterion
you may choose, e.g. R2adj. In this case all that is necessary to achieve an equivalent analysis to the
all possible subsets technique is to run the analysis setting m to 2 through to q. This way you can get
the advantages of the all possible regressions technique without the excessive cost or output.
If the aim of the analysis is system description, investigating the correlations between the Y and the
Xs, there is a major drawback to these two techniques and those that follow. Even if the
assumptions are all strictly true, because the Xs to be partialled out for any given β are decided by
looking at the data, the hypothesis tests are invalid. It is tautological to test a hypothesis on the same
data as generated it. These a posteriori techniques can therefore only be used to generate hypotheses
that can then be investigated further. Of course as soon as we treat multiple regression as an
exploratory multivariate technique, then we would ignore the tests anyway; so perhaps it's not such
a major drawback.
If the model is to be used for prediction then there is another consideration. To some extent the
emphasis is less on unbiasedness and more on minimising the residuals. It is sometimes more useful
to accept a biased but more precise result than to use an unbiased prediction with a large variance.
Frequently the user of a predictive equation does not need the very best equation, particularly if it is
going to be expensive to discover, an adequate one more cheaply arrived at will be preferred. In
such a case the stepwise techniques that follow might be used (though I would seldom recommend
them if alternatives exist).
EXAMPLE 8.4
Let us now try to find the best model to predict Gladioferens by considering all the physical
variables in the data set (13 of them). Purely to see what we get, let's first find the model that gives
2 2
the best R , then the best R adj.
a) Maximising R2..
Since the value of R2 trivially depends on the number of variables fitted in the model, we have to
find the best R2 for models with 1 variable, 2 variables etc. The results are shown in Table 8.1. The
most obvious feature is that because of the correlations among the Xs, some variables can substitute
for each other. For example, though phosphate concentration (PO4) is the best single predictor,
nitrate concentration (NO3) substitutes for it in the 3 variable model. But nitrate does not last long
93
Table 8.2 The best variable subset, using R2, for models of increasing size. The
variables for each model are ordered by the magnitude of their partial correlations, the sign
of the coefficient is also included. Entries in bold type had p values < 0.1 (an arbitrary rule
of thumb to help us identify interesting variables when we attempt to interpret the
coefficients). The most interesting feature of the table is how phosphate concentration is
the best single predictor but when other variables are fitted it is redundant and only enters
the model again after all the other variables have been fitted (when m=12). The same thing
happens to Nitrate concentration.
m 2 2
R R adj
1 0.153 0.306 -PO4
2 0.329 0.306 -NO3 -DO
3 0.407 0.376 -NO3 -DO -TIDE
4 0.449 0.408 -TIDE -CHLA -SAL -BOD
5 0.484 0.436 -CHLA -TIDE -BOD -SAL -SEC
6 0.511 0.456 -CHLA -TIDE -SAL -SEC -BOD +PH
7 0.525 0.461 -CHLA -TIDE -BOD -SAL +PH -SEC -DO
8 0.548 0.477 -CHLA -SAL -BOD -TIDE +PH -SEC -DO -FCOLI
9 0.551 0.471 -CHLA -TIDE +PH -SAL -SEC -DO -BOD -FCOLI -NO3
10 0.555 0.464 -CHLA -TIDE +PH -SAL -DO -SEC -FCOLI -BOD -NO3 +NH3
11 0.556 0.455 -CHLA -TIDE +PH -DO -SEC -SAL -FCOLI -BOD -NO3 +NH3 -TEMP
12 0.557 0.444 -CHLA -TIDE +PH -SEC -DO -SAL -FCOLI -BOD -NO3 +NH3 -TEMP +PO4
13 0.557 0.432 -CHLA +PH -TIDE -DO SAL -SEC -FCOLI -BOD -NO3 +NH3 -TEMP +PO4 -SS
either. Similarly though dissolved oxygen (DO) appears early on, it is replaced by BOD and only
re-enters towards the end.
b) Adjusted R2.
Unlike R2, R2adj can be compared between models with different numbers of variables. For any given
sized subset of variables, comparing R2s will give the same best subset as comparing R2adj.
Examining the R2adj values in Table 8.2 we find that the 8 variable model is the best by this criterion.
94
estimate. The critical value is chosen for a particular probability value. Though this may sound like
a simple hypothesis test, this probability level is not a measure of Type I error as in a true hypothesis
test. It and the critical F value it defines is an arbitrary criterion whose value is chosen on purely
pragmatic grounds. A probability value of 0.5 (no, it's not a misprint for 0.05) has been found
useful. It ensures that plenty of variables get fitted. Thus the program will generally produce a good
equation for 1 variable, for 2 and so on. These can then be assessed and the one to be used selected
2
by comparing the R adj values.
b)Backward elimination.
This technique starts with all the variables present in the model. Variables are selectively removed
until further removals would adversely affect the fit. The first step is to calculate the partial
correlations of Y with the X variables, all the Xs being partialled out. The X variable with the
smallest is selected and its β estimate (or partial correlation) tested; though, as with the forward
selection procedure, this is not a formal test of significance. If the β value is considered too small,
i.e. the X variable appears to contribute little to the fit in this model, it is dropped. The partial
correlations of the remaining variables are calculated and the smallest tested for removal, if it
appears to contribute little to that model then it is dropped, and so on; until the test forbids the
dropping of a variable from the model, or all the variables have been dropped. The commonest
strategy is to set the criterion for removal very lax, making it difficult to retain the variables. This
ensures that all the variables get dropped and there will be a choice of models of different sizes. The
2
final choice being made on the basis of their R adj values. A probability value of 0.05 is often used to
achieve this. The similarity to a significance level is entirely coincidental - it is not a measure of
type I error. If expediency and speed are the important criteria, and you cannot examine the model
for every stage of the selection process, then use a p - value of 0.15, it keeps a reasonable number of
variables in the model.
c)Full stepwise.
This technique starts off as a forward selection method but as each new variable is added, the
contributions of all the variables already in the model are checked, as in backwards selection; any
variables that fail to meet the criterion for retention are removed. Then the X variable with the next
largest partial correlation given the currently included variables is found, tested and if it passes,
added to the model; then all the included variables are tested again. This technique therefore uses
two criteria, one for the addition of variables, and one for their removal. Once again the choice of
criteria will determine how many models are presented for consideration, i.e. how soon the program
stops adding variables. The choice can be difficult, for example the rejection criterion (probability
value) cannot be smaller than the inclusion one. Otherwise variables may be rejected as soon as they
have been entered. A rejection criterion larger than the inclusion one will 'protect' entered variables,
making them difficult to remove; if it is set too large it will effectively reduce the technique to a
forward selection method. A compromise with the two criteria set equal at a probability value of
0.15 might be appropriate, but there are no hard and fast rules, try a number of settings; though it is
probably a good idea to look at as many models as possible and make the final decision using R2adj.
Though this method is computationally more expensive than the other two stepwise methods it is
probably to be preferred to them since it will usually present more models for consideration.
The stepwise family of techniques has been very widely used in biology; indeed in the eyes of many
they are synonymous with multiple regression. Their ease of use is seductive. Put the numbers into
the computer, press a few buttons, and "Hey Presto!", the 'best' model is pumped out. In fact these
techniques suffer from several major drawbacks that render them unsuitable if alternatives are
available.
Because these techniques do not look at all possible subsets, and because when the Xs are correlated,
groups of variables may be able to substitute for each other to some extent, there will nearly always
95
be other subsets of X variables that will be as good or better than those found by these methods.
Hocking (1976) cites a study where the best subset, as found by the all possible subsets method, had
a 37% improvement in the proportion of variance explained (R2), over that found by the full
stepwise method. However, when the X variables are not heavily correlated, there is a reasonable
chance of these techniques giving adequate predictive models but they will seldom be reliable
enough for investigative studies.
Very often users are tempted to interpret the order in which the variables are added (forward
selection), and deleted (backwards elimination) as reflecting their relative importance, in some
biological sense, of the variables. This can be totally misleading, this order is strongly affected by
intercorrelations among the X variables. It is entirely possible for the variable added first in a
forward selection as having the largest simple correlation, to be the first to be abandoned in a
backwards elimination as having the smallest partial correlation. (see PO4 in Table 8.2)
Even if the model is to be used solely for prediction there can be further disadvantages to not
surveying all the possible models. Very often, the X variables cost different amounts to measure.
Clearly if there are two models nearly equally good for prediction, one containing cheap variables
the other expensive ones, the choice is clear. Unfortunately the stepwise techniques seldom give
enough models for this situation to be apparent. Only by surveying all the adequate models can such
factors as cost be taken into consideration.
Of course, at the extreme, there may be too many variables to possibly look at all the models or even
the best for all subset sizes. In which case, you are going to have to use a stepwise technique and
rely on luck to give you what you need.
EXAMPLE 8.5.
a) Forward selection.
Because the best 1 variable model includes phosphate concentration, but the better larger models do
not, we can predict that the forward selection method is going to have problems in locating the best
model. Once phosphate is in the model, this stepwise method cannot let it out.
The order in which the variables entered the model were:
PO4
DO
NO3
TIDE
PH
CHLA
SEC
SAL
BOD
FCOLI
The selection stopped when the next variable failed to produce an F statistic with a p value <0.5.
This model had an R2 =0.553, and a R2adj = 0.461. Though not the best model, it has come up with a
sensible one - I would say we got lucky; this technique can sometimes come up with some fairly
silly models.
Some workers use the order in which the variables enter the model as representing the importance of
that variable. Given that two of the first 3 variables into the model do not even appear in the best
model (the 8 variable one), it should be clear that this is an unreliable thing to do.
b) Backward elimination.
I used a p value of 0.1 for the criterion to keep a variable in the model. The order in which the
variables left the model was:
SS
96
PO4
TEMP
NH3
NO3
FCOLI
DO
leaving a 6 variable model containing: CHLA, TIDE, SAL, SEC, BOD, and PH. This is the best of
the 6 variable models (see Table 8.2), but not the best overall model as found by R2adj. Still, by
examining the models at each stage of the selection process, we do find that the best model was
considered. By examining the R2adj we can identify it and choose it instead of the final model of the
backward stepwise procedure. Clearly, a change to the p- value used as the retention criterion would
have led to the technique stopping at the best model - it is just difficult to know in advance what
value will work. It is better to examine the models and find the best yourself. Once again, though the
model found is sensible, you should not interpret the order of the variables leaving the model as
telling you about their relative biological importance; nor rely on the technique even passing
through the best model with every data set.
If you insist on presenting the partial regression coefficients then problems of interpretation can
arise if the X variables have different units. The interpretation of a β coefficient is the amount the
predicted Y value will change for a unit change in X. To be comparable between the Xs, say to
97
identify which X has the steepest association, the Xs must have the same units. The solution is a
traditional one, standardise the Xs into standard deviation units and then perform the regression. The
resulting partial regression coefficients will be comparable. A large value will imply a steep
relationship. In fact there is no need to do the standardisation yourself and then the regression, it is
simple to calculate these standardised regression coefficients direct from the unstandardised ones.
Most regression programs offer them as output.
98
a) Number of observations.
You must always have more observations than variables, otherwise the regression solution will be
trivially perfect- "with enough parameters one can fit an elephant". Ideally there should be more
than 20 times as many cases as variables. It has been suggested that if a stepwise technique is to be
used, 40 times would be appropriate; and that there should be at minimum 4 or 5 times as many
observations as X variables. Our example therefore is close to minimal.
b) Range of values of X vars.
If the relationship between the Y variable and an X is to be detected, then the X variable must have
been measured over a sufficiently large range of values for its effects to be manifest in the Y values.
This is sometimes a problem if the system being studied is under the control of man, agricultural
systems for example. The important variables are often the very ones that the farmer cannot allow to
vary without risking loss of his crop. These important variables, if measured by an observer, will not
contain enough information for their true relationship to the response variable to be detected. If the
model is to be used for prediction, its accuracy will depend on those important variables being
controlled to the same values wherever the equation is applied. This kind of situation can occur in
natural systems as well, particularly in ecology or behaviour. The study may be too short to measure
the full range of values for a variable such as temperature, or may not include a sufficiently large
range of habitats so that edaphic or vegetational factors are adequately characterised. It is important
to remember that the validity of the conclusions, such as they may be in a multiple regression, is
limited to the range of values measured in X. Extrapolation is a very problematic procedure. Note
that in ecology in particular there is a balance to be achieved here. Over relatively small ranges of
environmental variables relationships may be effectively linear. However, the larger the range of
environmental conditions observed, the more likely non-linearity is going to rear its ugly head.
The measuring of an insufficient range of values for an X variable, also raises the possibility of a
subtle form of the inadequate model problem described above. If insufficient information is
available about an X variable, then its effects cannot be corrected for. This is a similar situation to
not measuring it at all, the result will usually be bias in the β estimates for the other X variables.
c) Outliers.
Multiple regression minimises the sum of squared distances from the surface (predicted values)
defined by the model to the observed values of Y. So any point with a Y value way out of line, an
outlier, can contribute disproportionately to the position of the best fitting surface, i.e. drag it out of
line; and change the β estimates.
d) Distribution of errors.
One of the assumptions of multiple regression concerns the error term ε. These are assumed to be
N(0,σ2) and pairwise uncorrelated. There are therefore three ways in which this assumption can be
violated: the errors may be correlated with each other or the X values; they may not be normally
distributed; they may not have a constant variance.
Serial correlation of the errors.
There are a number of ways in which the errors can be correlated. The commonest is serial
correlation. If the residuals are plotted against the order in which the observations were taken, then
values from neighbouring observations may tend to be correlated. The same problem can emerge in
spatial data where neighbouring data points have correlated errors.
The presence of serial correlation, also called autocorrelation, has several potentially serious affects
on an analysis: The parameter estimates are unbiased but there may be more precise estimates.
The estimate of σ2 and the standard errors of the regression coefficients may be seriously
underestimated. The tests will suggest spurious significance and the standard errors will imply a
greater accuracy than actually exists; so the confidence limits will also be invalid.
99
Non-normality of the errors.
The assumption of normality strictly only affects the validity of the hypothesis tests and related
descriptive statistics ( standard errors, confidence limits etc.). However a distribution that has extra
long tails can result in an increase in high influence points. This could be a nuisance. In general a
moderate deviation from normality is unlikely to cause major difficulties. However major skewness
in particular can lead to problems. The usual way to check the normality of the residuals is to use a
normal probability plot.
Heterogeneity of variance.
It is an assumption of multiple regression that the residuals come from a single normal distribution.
The confidence intervals and hypothesis tests are constructed using an estimate of its variance, the
error mean square. If however this variance of the residuals is not constant then these statistics and
tests will be inaccurate. This phenomenon is called heterogeneity of variance, or sometimes
heteroscedasticity, and a frequent consequence will be that the confidence intervals will be
underestimated and the tests will give spuriously significant results. It comes most commonly when
the variance of the residuals is correlated to some degree with the true value of Y.
e) Collinearity.
Probably the best known and one of the most important problems is the existence of strong
correlations and linear dependence among the Xs. If two or more of the Xs are heavily correlated or
one of the Xs is a linear combination of some of the others then we have a problem of too many
variables with too little information to estimate their parameters properly. If we have 4 X variables
and X4 is virtually identical to X3 then, if there is a real relationship between Y and 3 X variables, we
are trying to fit 4 Xs when there is only information for 3. This is going to result in indeterminacy of
the parameter estimates; they will be unstable, very sensitive to sample error.
To explain it in another way, when used appropriately, multiple regression tries to present the effect
of one X on Y all other Xs kept constant. If two Xs covary almost exactly, then there is little or no
information about how one varies independently of the other; the β�values are bound to be
unreliable.
There are three main consequences of excessive collinearity:
The standard errors of the β�estimates become inflated, as a result the hypothesis
tests become less powerful and will tend to fail to reject the null hypothesis of βi=0. and makes the
assessment of the relative strengths of the X-Y relationships virtually impossible. A corollary of this
phenomenon is that the parameter estimates become much more sensitive to outliers.
The selection of an adequate subset of variables becomes more difficult. Not only
may there be many adequate subsets, but because of linear dependencies there may be models where
a group of variables can be substituted for a single variable and have a major improvement in the fit.
Such models would not be detected by any fitting technique that fits and changes only one variable
at a time, e.g. the stepwise methods. Furthermore, the value of the coefficients can change
dramatically, even changing sign depending on the other variables in the model. Interpreting subset
models in the presence of collinearity is a very chancy business.
The consequences of model misspecification become more marked. The bias that
results from non-linearities and latent, unmeasured, variables can be magnified considerably by
collinearity amongst the Xs. This can result in spuriously significant partial regression or correlation
coefficients.
Generally, therefore, collinearity can introduce major problems for the selection and interpretation
of the regression model.
100
f) Errors on the Xs.
One of the most often ignored assumptions of multiple regression is that the X variables must be
measured without error. If the data are collected as part of an observational study then this
assumption will not be valid. The error on the Xs need not be strict measurement error (inaccuracies
of the measuring equipment), it might be due to your being unable to measure the real X variable
directly. Sometimes choice of the Y variable as the response is somewhat arbitrary. It might be
equally plausible to use one of the Xs instead; the direction of causality is not clear cut. In such a
case fitting a model to describe the relationship by minimising the deviations of the Y values from
the hyperplane would be arbitrary. If you had minimised the deviations of one of the X variables
from the plane you would have got a different surface. This then is the problem: if you should be
taking variation in the Xs into account when fitting the model, the ordinary least squares multiple
regression will give biased estimates of the βs. These biases can be very large. As a general rule the
bias will towards zero - the estimates will tend to be "attenuated". However this is not to be relied
on, large significant coefficients in OLS can be the result of error effects. These biases are only a
problem if the regression is being used for system description, to gain insight into the system under
study. If the model is to be used solely for prediction then minimising the error on the variable to be
predicted is still a sensible thing to do, and the ordinary least squares model will still be be
appropriate.
101
102
Chapter 9. Multivariate regression and constrained
ordinations.
While multiple regression and partial correlation have their uses, this book is about multivariate
techniques. Instead of only one response variable the multivariate situation has many. We have
measured a number of variables so that by investigating them at the same time we can detect some
feature of the system that would not emerge (or not emerge so efficiently) by univariate techniques.
So let us consider the technique that relates a set of Y variables with a set of Xs. In parallel with
univariate regression and correlation, there are two techniques: redundancy analysis and canonical
correlation. MANOVA (Multivariate Analysis of Variance) and Canonical Discriminant Analysis are
special cases of Canonical Correlation, in the same way that ANOVA (Analysis of Variance) is a
special case of multiple regression. However, since all these methods are based on multivariate
regression we shall start there.
Multivariate regression.
so that the predicted values of Y are as close as possible (usually least squares) to the observed. To
put it into matrix form we wanted to estimate the vector β in the model:
ŷ = Xβ
where X is the matrix of X values and y is the vector of y values.
The multivariate case has a matrix of Y values, Y. Since the relations between the separate Y
variables and the Xs are unlike to be the same, there will have to be a separate vector β for each Y
variable. In other words we want the matrix of β values (B) in the model:
Ŷ = XB.
Though there are elegant matrix solutions to this problem they all boil down to performing separate
multiple regression for each Y variable and pulling the β vectors together as the columns of the
matrix (B). These separate univariate multiple regressions also give us the predicted values from the
model, the elements of the matrix Y �. The multivariate bit starts when you want to calculate tests,
standard errors and confidence intervals for the parameters and fitted values. We shall look at this
problem later in the chapter. For the present we shall concern ourselves primarily with the fitted
values and the parameters. Our problem, one we have met so often in this book, is that there are
simply too many of them, we have measured the relationship between every variable in one set and
every variable in the other set. Clearly one objective of a truly multivariate technique would be to
reduce that number of relationships to a more manageable and more informative scale. This is the
purpose of constrained ordinations and the simplest of these is redundancy analysis (or RA).
Redundancy analysis.
103
Get proportion of variance
Centre Y Regress Y on Xs explained
There are a number of ways of making sense of redundancy analysis. The simplest is to consider it as
a PCA (chapter 6) on the fitted values ( Y� i) from a multivariate regression. As an approach this has
the advantage that it allows us to do an RA without a dedicated program for it. Few general purpose
statistics packages have special programs to do RA. The approach I am about to describe allows you
to do redundancy analysis using just regression and PCA which are universally available. These
fitted values from the regression represent the part of the Y values that are described by the estimated
relationships with the Xs. Since each Y was predicted using its own univariate multiple regression the
maximum proportion of variance (R2) has been explained in each Y. By doing a PCA on them we are
attempting to reduce the totality of the relationships down to a few independent ones. We are
analysing the correlations among the relationships and trying to summarise them. Suppose that the
relationships between most of the Ys and the Xs were essentially similar, then this could be more
efficiently summarised by one relationship. This is the aim of RA to summarise the univariate
relationships into fewer multivariate ones.
The scores on the first PC are called the first redundancy variate and the associated eigenvalue tells
us how much of the total relationship is summarised by this first synthetic relationship. The second
PC is called (surprise, surprise) the second redundancy variate and so on. It is important to note that
unlike the orthodox PCA, this PCA is not analysing the covariance matrix of the Y variables but of
the fitted values Y . The total variation of the Y variables was first split into fitted values and
104
residuals, and then the variance (and covariance) of the fitted values (the variance explained by the
regression) is decomposed into PCs. This reminds us the there is also residual variation that ought to
be investigated, something that is seldom, if ever, done. However, to return to the redundancy
analysis how do we discover what (which variables) the new relationship actually involves?
In fact statistics packages that do redundancy analysis tend to do it in one hit as an eigenanalysis of a
large compound matrix (it all works out algebraically equivalent) so the terminology can be full of
eigenvalues and eigenvectors etc. but they all boil down to amounts of variance explained and
regression coefficients.
Programming Notes
R: First do the multivariate regression.
y<-as.matrix(y variables)
x<-as.matrix(x variables)
redund<-lm(y~x)
summary(redund)
To get the proportion of the total of the Y variable variance explained by the regression simply sum
the variances of the fitted values (the Y s), then divide by the sum of the variances of the Y variables.
This will give the same results as taking the sums of squares of the fitted values (summed over
variables), and expressing it as a proportion of the total sums of squares of the Y values (summed
over variables) - i.e. a sort of overall R2.
totalredundancy<- sum(diag(var(redund$fitted)))/sum(diag(var(y)))
Now do the principal component on the fitted values
rv<-princomp(redund$fitted.values)
The proportion of the fitted value variation explained by each of the PCs is given by the eigenvalues.
(=the trace of the SSCP matrix of Y �- the sum of the sums of squares of the Y s, again easily
calculated). In the technical literature (especially the psychometric).this is called the total
redundancy. This is the variation that is shared between the Xs and the Ys, and is therefore referred to
as redundancy, hence redundancy analysis. Of the total redundancy a certain proportion is explained
by the first redundancy variate (given by the first eigenvalue in the PC), a certain proportion by the
second and so on. It is therefore important and informative to know:
a) How much of the total variance of the Ys is explained by the Xs - the total redundancy? -Do
the Xs tell us anything useful about the Ys?
b) What percentage of this shared variation is attributable to the first, second, etc. redundancy
variate? - Is RV1 important relative to the other RVs?
105
c) What percentage of the total variation in the Ys is explained by the first, second etc. redundancy
variate? Is RV1 important relative to the Y variation?
These values let us know if these relationships are worth interpreting.
Interpreting the redundancy variates
If we take the first redundancy variate we are suggesting that there will be some combination of the Y
variables that are associated (linearly, always remember - linearly) with some group of X variables.
So, which variables? The Y variables are simple, look at the elements of the eigenvectors or the
correlations between the PC scores (called redundancy variates for the X variables, or just
redundancy variates or RVs) and the fitted Y variables both of which will usually be given by the
PCA program. It is worth reminding ourselves what these numbers represent. The coefficients of the
first eigenvector are like partial regression coefficients. They therefore represent the relationship of
each Y variable to the PC - keeping all of the other Y variables constant. The other Y variables have
been partialled out. When we are dealing with clear response variables then there seems little point in
trying to partial out their influence from each other. They are all simply "responding" to the Xs (I
know they're not really, not in observational studies anyway, but that's where the interesting
hypotheses usually come from don't they?). So, for the Y variables it seems to make sense to interpret
the correlations (called redundancy correlations for the Y variables for redundancy variate 1)
between the Y values and the PC score (redundancy variate). If a response variable has a high
correlation with the redundancy variate then the relationship between the two sets of variables (the
Xs and the Ys) clearly involves this one. Note that the correlations and coefficients are between the
fitted values of Y and the redundancy variate; the relationship between the observed values and the
redundancy variate we'll look at later (it's just as important - possibly more).
We have now discovered (we hope) which response variables are involved in this summary
relationship. Which predictor variables are? This is once again a simple matter, just regress (using
multiple regression) the redundancy variate onto the Xs. The partial regression coefficients
(redundancy coefficients for the X variables from redundancy variate 1) are simply telling you the
influence of each X on the relationship - keeping the other Xs constant. We have now identified
which X variables are involved in this relationship with the set of Ys we identified earlier. You can of
course calculate the redundancy correlations for the X variables from redundancy variate 1 if you like
but I feel that we usually want the partial coefficients so we can dissect apart the influence of the
predictor variables.
There is however one more consideration. We have the relationship and we know which variables are
involved but how important is the relationship to the Ys overall? and to which Ys in particular? Just
because we have found a relationship (even the best relationship) does not mean that it is important
in explaining why the response variables have the values they do. To do this we must relate the
observed values of Y to the redundancy variate. Simply calculating the correlation coefficient
between the individual Ys and the redundancy variate will tell us which of the response variables are
well described by the relationship, indeed squaring the correlation coefficients tell us what proportion
of the Y variables are determined by this relationship (in the same way the R2s from the original Y on
X regressions told us how much of each Y variable could be predicted (explained) by the all the Xs).
So we have extracted a linear relationship between the Y variables and the Xs and identified which Ys
are associated with it and which Xs are responsible for it (the (highly informal) implication is usually
of causality). We have also established whether this relationship actually has any importance in
explaining the variation in Y.
Having done this for the first of the relationships we would now move on to the second PC
(redundancy variate) and if it seemed important enough we would do the same there.
106
6
W
W
4 W
W
W
W W
W
RV2
2 W WW WWW W W
W
W W WW
W WW
W
WM
W W
0 W
M M
W M MM
M MM
M M MM M
-2 M MM
MM M MMM
M MMM
W
M
-4
-8 -6 -4 -2 0 2 4 6
RV 1
Figure 9.2. Redundancy Analysis on log(X+1) transformed zooplankton.
As one might guess, since we are doing a PCA on the fitted Y values we can have a reduced space
plot that displays the major trends in the Y-X relationships (in so far as our linear model has been
able to describe them). This can be interpreted, and the relationships clarified, by adding variables
(the Xs or the Ys or even other ones you didn't use) into the reduced space plot e.g. as bubble plots.
Of course the down side is that you could end up with a lot of plots, but given that computers can
produce such plots so easily it is worth looking at them.
EXAMPLE 9.1.
If we use the log(X +1) transformed zooplankton data again we can try to relate them to the physical
variables used in the regression section. The regressions explain 42% of the total variation in the
species. The first 2 RVs explain 71% of the variation in the fitted values (i.e. of the 9 fitted
relationships - RVs- the first 2 explained 71% of the total - that looks pretty good!). We are brought
down to earth when we realise that this only represents 30% of the variation in the species! We may
well have identified some important relationships but there is clearly much more going on than we
can describe using linear models. (Hardly a surprise really, but you'd be surprised how many workers
think that straight lines should be adequate.)
Programming notes.
Each PC is identifying a linear relationship that is shared across a numberof Y variables. Now we
have to identify the relationship in terms of which X variables are important. So now regress the PC
on the Xs. E.g. with PC1 (Redundancy variate 1)
107
rvcoeff1<-lm(rv$scores[,1]~ x)
The coefficients are the redundancy coefficients.
The standardised coefficients can be done by standardising the X variables first.
Structure (correlation ) coefficients between the Y and the RV can be got from simple correlation of
the Y variables with RV1.
Canonical Correlation.
Canonical correlation is a further generalisation of multivariate regression that detects, estimates and
displays linear relationships between two sets of variables. It bears the same relationship to
redundancy analysis as correlation does to regression. In regression there is a clear Y variable to be
predicted by the X variable, in redundancy analysis there is a Y variable set that is to be predicted
from the X set. In correlation there is no such distinction into independent and dependent variable
sets, there is simply a Y1 set and Y2 set. The aim is to measure the intensity and direction of the
symmetric relationship. So it is with Canonical Correlation, the aim is to describe the pattern of
symmetric relationships between two sets of variables. There is no question of one set being used to
explain the other, neither set should be thought of as dependent or independent.
The relation to redundancy analysis can be further clarified (I hope) by pointing out that estimating
the regression relationship can be done by standardising the X variable to unit variance and then
calculating the covariance between the X and Y variables. The resulting number is the regression
coefficient with X in standard deviation units. The correlation coefficient can be estimated by the
covariance when both X and Y have been standardised. Thus both regression and correlation can be
thought of as starting from a covariance but in regression only the independent variable has been
standardised in correlation both have. Thus if you do a regression when both variables have been
standardised the regression coefficient is the correlation coefficient.
Similarly if we do a redundancy analysis after standardising (a genuinely multivariate standardisation
using inverse variance-covariance matrices) both sets of variables, we are actually doing a canonical
X X X X
∑ s− (Y − Y ) ∑ s
− (Y − Y )
=βˆY | X / s x sx
= x sx
= cov
X X
2
(n − 1)
∑ s − s
x x
108
correlation. This standardisation is not simply giving the variables unit variance, you must give them
(the Xs and the Ys) identity variance covariance matrices. This is actually not the sensible way to do a
canonical correlation, that standardisation is not usually simple to do in standard statistics packages.
However, that is not a problem since most such packages have procedures to do canonical
correlation. In summary therefore both redundancy analysis and canonical correlation differ only in
the standardisation of the Y variables. They are both PCAs on the fitted values from a multivariate
regression. They both share essentially the same problems (those of regression in general). Noting
that they are essentially the same technique offers the potential for other useful techniques based on
different standardisations and/or weighting or even different forms of the multivariate regression or a
different dimension reducing technique instead of PCA. The main practical distinction between the
methods is that RA is a regression techniques while Canonical correlation is (obviously) a correlative
one.
One consequence which can be important is that in order to standardise the Y variables in Canonical
correlation we have to estimate the inverse of their variance covariance matrix. We standardise by
"dividing" by a square root of the variance covariance matrix - analogous to the univariate case
where we divide by the standard deviation. This in turn means that, if there are more Y variables than
there are observations, this matrix will be singular (no inverse) and this standardisation is impossible.
This restriction of course does not apply to redundancy analysis.
An alternative informal explanation.
Canonical correlation identifies the linear combination of one set of p variables (say the Ys) that has
the highest correlation with a linear combination of the other set of q variables (the Xs). These linear
combinations are called canonical variates (CVs). The correlation coefficient is called the first
canonical correlation coefficient. Because there is more than one variable in each set, there can be
more than one relationship – more than one canonical correlation coefficient. At the extreme, there
could be as many as there are variables in the smallest set, min(p,q).
Imagine the data as two clouds of points in separate spaces (e.g. one defined by biological variables
and one by physical). Both sets of variables have been standardised to zero mean and unit variance.
Like PCA, Canonical correlation rotates the axes within the spaces but does it so that scores on the
new axes (canonical variates) in one space are maximally correlated with those on the axes in the
other space. The rotation need not be rigid; the new axes within a space need not be orthogonal. The
scores on the axes must be uncorrelated, but the axes need not be at right angles in the old space. It is
a subtle distinction, and not one to worry about. They are only at right angles in the space formed
after the full multivariate transformation to identity variance covariance matrices.
109
Step 1: the Ys
110
° °
X Regress X
on Ys
Get proportion of variance
explained
°
Get X hats
(predicted
values)
Do PCA on
°
st
Get 1 PC
ΣδjYj
CV1 of Ys
Interpretation.
Since the standard statistics packages all offer procedures to do canonical correlation it will come as
no surprise that they tend to be generous in their output. Since the jargon associated with this output
tends to be rather dense (I blame the psychometricians) I shall treat it in some detail. Also many of
these statistics can be calculated to assist in the interpretation of redundancy and canonical
correspondence analysis.
i) Goodness-of-fit.
Before any attempt is made to interpret the results of a canonical correlation, it is important to check
that it is going to be worth the effort. In principal components and the other ordinations there is
always some measure of adequacy or goodness-of-fit; in multiple regression there is R2, the
coefficient of determination. But in canonical correlation there are a number of measures that must
all be considered.
The canonical correlations.
111
Despite their prominence in any description of the technique, and in the output from statistical
packages, the canonical correlations (ris) are not the most important measure of the adequacy of the
canonical variates. At first sight they seem to be analogous to the coefficient of multiple correlation,
R, in multiple regression. However, though they describe the correlation between a linear
combination of the X variables and a linear combination of the Ys, importantly they say nothing
about the amount of variation of the Xs, or the Ys, described by the relationship. It is quite possible
to have a very large, significant, canonical correlation whose associated canonical variates explain
only trivial amounts of the variation in X or Y; the relationship is real but of no biological
importance. Clearly the analogy with R is not exact. R not only describes the strength of the
relationship between Y and the linear combination of the Xs, but also gives (R2) how much of the
variation in Y is summarised by the Xs (i.e. how important or interesting the relationship might be).
Significant, large, canonical correlations are a necessary but not sufficient condition for the
interpretation of the canonical variates. Most statistical texts and packages give significance tests to
identify the number of statistically significant canonical variates. These are usually based on the
likelihood ratio test (Wilks' Λ) described later. This test is performed in a slightly unusual way and
care should be taken in its interpretation. The H0 for the test on r1 is not that ρ1=0; it is that
ρ1=ρ2=ρ3=...ρmin(p,q)=0. That associated with r2 is ρ2=ρ3=...ρmin(p,q)=0. This test cannot test directly for
the statistical significance of the ris; it is really designed to determine the dimensionality of the space
spanned by the significant canonical variates.
EXAMPLE. 9.2.
Let us return to the zooplankton example and become pedantic over the distinction between
regression and correlation. In an observational study such as this we might very well argue that we
cannot strictly claim that the biological variables are dependent and the physicals are independent.
We might feel happier looking for symmetric correlations rather than the asymmetric: biological
predicted from physical. My personal feeling is that in this situation that would be excessively picky,
however it allows me to use the data set yet again, and see if it gives anything new. So we will use
Canonical Correlation on the log(X+1) transformed species data to identify correlational linear
relationships between the organisms and the physical variables. It is clearly asking a lot of the data,
there are 22 variables but only 60 observations. To do the calculations we need the variance
covariance matrices for the Xs and the Ys: SXX, SYY; and the matrix of covariances between the two
sets: SXY. These matrices contain no less than 253 different variances and covariances - to be
estimated by only 60 observations! Even so, I expect to get useful results, but the interpretation will
probably be fairly tentative.
The first result we might look at is the significance test for H0: no linear relationships between the
two sets of variables. Pillai's trace (supposedly the most robust to non-normality -section 12.1.1) had
a value of 3.074 and p<0.0001 (from the F approximation). Having established that there may be
relationships between the physical and biological sets of variables, we now look at the table of
canonical correlations. If there are no significant correlations then it is probably not worth investing
much time in going further. It would not mean that there were no relationships, not even that there
were no important or interesting ones. It might mean that there were no linear ones, but strong non-
linear relationships could exist in the data and not be detected. (It could of course just mean that the
data set is too small). Table 9.1 gives the canonical correlations, and the Wilks' Λ and p values for
the test that the current correlation and all the subsequent ones are equal to zero.
Table 9.1 Canonical correlations and test statistics.
Canonical Wilks' p
Correlation Λ value
0.912 0.008 0.0001
0.824 0.045 0.0009
0.715 0.142 0.1148
112
0.605 0.290 0.5321
0.537 0.458 0.7965
0.492 0.644 0.9341
0.342 0.849 0.9961
0.174 0.962 0.9996
0.090 0.992 0.9957
The results clearly suggest that there are at least 2 possible relationships worth further investigation.
Of course, as I explained earlier, they could be biologically trivial, but it is a useful start to the
analysis.
The next move is to look at the total redundancy of the biological variables (accumulated over all the
canonical variables) to see if the physical variables explain much of the biological variation: 40% of
the total variance of the biological variables is associated with the physical variables (compare that
with the 42% with RA). If we now use Miller's test of the H0: redundancy = 0, we get an approximate
F value of 3.70 (d.f. 117, 650), p<0.0001, so we can reject the H0. The physical variables seem to
explain more variance of the biological ones than would be expected due to chance - thank goodness.
ii) Adequacy of individual canonical variates.
Like PCA and the other ordination methods the final choice on which canonical variates to use or
interpret is the responsibility of the worker. There are tests and measures to help but the final
criterion should probably be: Is this canonical variable interpretable and interesting?
Redundancies for individual canonical variates.
While the canonical correlations describe how tight the relationship is between the two linear
combinations (canonical variates), the more important information is how biologically significant
might it be. How much variation in a variable set is explained by this particular linear combination of
the other variables? There are two separate measures that address different aspects of this problem:
the first is an aid to the direct interpretation of the canonical variates; the second measures how
useful the new axes are for representing the data cloud in the space defined by their own variable set.
The descriptive power of the ith canonical variate from one variable set is measured by the
proportion of variance of the other set it describes, a redundancy. The redundancy (Di2) in the Ys
associated with the ith canonical variate from the Xs is commonly part of the output. If necessary
they can be calculated by regressing the Y variables separately onto the ith canonical variate and
getting the average r2. Personally, being a purist when it costs nothing I would do it by adding up all
the regression sums of squares, then all the total sums of squares, and then seeing what proportion of
the total was explained by the regressions (the ratio). It will give much the same result but will be
more appropriate. The corresponding redundancy of the X variables given the Ys (Ei2) is calculated
in the same ways.
These are the most important statistics to examine when choosing which relationships (canonical
variates) to interpret. However it would be silly to ignore any potential asymmetry in the
relationships being described. For example if a set of biological variables is well described by a
linear combination of physical variables, that is interesting. The reverse will usually be less so.
Generally, only canonical variates that explain a relatively large proportion of the variation in the
other variable set should be retained for interpretation; though if one of the rejected ones looks
interesting it might be put on one side for consideration. Nearly always, a large redundancy is
associated with a large canonical correlation, though a large canonical correlation is no guarantee of
a large redundancy. It is quite possible to have a tight relationship between the canonical variates of
the two variable sets which explains little variation in either.
The proportion of variance (redundancy) of one variable set explained by canonical variates of the
same variable set can also be calculated. This redundancy in the Y set due to the ith canonical variate
113
from the Ys is calculated as above but using the ith CV score from the Y set. The redundancy (Gi2) of
the Xs due to the ith canonical variate of the same set is calculated in the same way.
These redundancies are important because the canonical variates of a variable set are the old axes
rotated and rescaled; so they, like the components in PCA, can be regarded as partitioning the
variance of the original variables. This is particularly useful when plotting the data points in a
reduced space. The proportion of the variance (redundancy) summarised by the reduced set of axes
(canonical variates) is a measure of the adequacy of the reduced space for reproducing the data
cloud. Such reduced space plots can be useful in interpreting the relationships between the variable
sets.
EXAMPLE 9.3.
Since in this example we are primarily interested in how much of the variation in the animal
abundances is associated with the environmental variables, we go direct to the proportions of the
biological variables associated with the environmental canonical variates (Table 9.2). As we saw in
Example 9.1, 40% of the variance of the biological variables is associated with the environmental
ones. However, the first 2 canonical variates explain in total only 25.8% of the biological variance
(compared with 30% with redundancy analysis on the same data). Our relationships may be real, they
are even statistically significant, but they are not explaining a very large proportion of the total
variance. This of course does not mean that the animals are not responding much to the physical
environment. The true relationships may be non-linear or there may be other important factors
besides the ones we have measured, and sampling error is always going to be large in studies like
this. A reassuring feature of this table is the drop-off between CV2 and CV3. While the CV2 of the
environmental variables explains 11% of the biological variation, the trend identified by CV3
explains only 4.8%. This clearly supports our interpreting only the first two CVs.
Table 9.2. Proportions of biological CVs explained by the environmental variables.
Canonical Proportion Cumulative
variate proportion
1 0.149 0.149
2 0.110 0.258
3 0.048 0.309
4 0.029 0.338
5 0.026 0.362
6 0.023 0.385
7 0.012 0.396
8 0.003 0.399
9 0.001 0.400
Figure 9.1 is a reduced space plot of the observations in environment space (i.e. using the
environmental canonical variates). The plot on the biological canonical variates would be very
similar; after all, the canonical variates are chosen so the observations' scores on the biological CVs
correlate maximally with those on the environmental ones.
Table 9.3 gives the proportion of the environmental variables explained by the environmental
canonical variates.
Table 9.3. Proportions of the variance of the environmental variables (redundancy)
explained by their own CVs.
Canonical Proportion Cumulative
variate proportion
1 0.203 0.203
2 0.210 0.413
3 0.052 0.466
4 0.045 0.511
5 0.057 0.568
6 0.103 0.671
7 0.044 0.718
114
8 0.071 0.787
9 0.044 0.830
The most important feature of this table is that the first 2 canonical variates summarise 41.3% of the
variation in the environmental variables. We can use this figure as a measure of goodness of fit for
the reduced space plot (figure 9.1) - the first 2 CVs display 41.3% of the total variation in the
environmental variable set. It also implies that the two gradients the method has picked up are fairly
major features of the data.
Figure 9.3 shows that canonical variate 1 appears to be associated with a gradient over the Whau
creek observations. CV2, however, shows a clear split of the Mangere observations from the Whau
creek ones (though if you look closely there is a Whau creek interloper lurking in the Mangere
crowd). Obviously the next question is: What are these gradients and which species respond to them?
MM
MWMMM
M M M
1 M MMMMM
M MM M
M MM
MM MM
M
MM W
0 W
CV2
W W
W W WW W
W
W
W WW W WW
-1 W W W W WW
WW
W
W
W
-2 W
W
-3
-2 -1 01 2 3 4
CV1
Figure 9.3. The sites in the space defined by the canonical variates of the physical
variables. It is clearly extremely similar to the corresponding ordination from redundancy
analysis. The separation of the Mangere and Whau samples is nearly complete, and the
gradient in the Whau samples between the inshore Gladioferens containing samples and the
offshore Oikopleura containing ones is clear.
115
interpreting the role of a variable the most important decision is whether you should partial out the
influence of the other variables in its set. If the variable can be regarded as a response then there may
be little point in partialling out the effects of the other response variables. Thus in this data set the
animal abundances could be regarded as responses to the physical environment; so we are likely to
be primarily interested in how the species vary together in response; so changes in one species
keeping the other species constant (partialling out their influence) is unlikely to be of interest. On the
other hand, we might well be interested in the role of one physical variable keeping the others
constant, in which case we must partial the others out. The decision on whether to use partial or
simple relationships is an important one and partially resolves a long standing controversy on
whether to use the canonical coefficients (partialled) or the structure coefficients (simple intraset
correlations).
116
different information to the canonical coefficients; they do not remove the influence of the other
variables. They are to the canonical coefficients what bivariate correlations are to partial regression
coefficients. They are therefore appropriate when you do not wish to partial out the influence of the
other variables. For example if you wanted to interpret the canonical variates as entities then which
variables are associated with large or small values of a canonical variate would often be more
informative than the strict 'contribution' of a variable to the canonical variate (holding the other
variables constant).
If you are interpreting them because the canonical coefficients might be unstable then you must
remember that they can appear important purely because of correlations among the variables. Such
correlations can often be identified by comparing them with the corresponding canonical coefficients
(where the influence of the other variables has been partialled out). For example any conflict in sign
may suggest a spurious correlation. Unfortunately, such a conflict might equally be the result of a
spurious canonical coefficient due to one of the problems mentioned earlier: missing variables,
collinearity etc. If the SE on the canonical coefficient is large, or absent (and the data has not been
thoroughly checked for collinearity etc.), then the safest policy is to avoid interpreting that variable
unless there is a biological reason for including it - e.g. it enhances the interpretation of the canonical
variate as a whole. In which case the conflict should be reported if the analysis is written up.
c) Interset coefficients.
While the relationship between a CV and its variables is undoubtedly useful sometimes it may be just
as useful to look at the relationship between a CV for one set and the variables of the other. For
example the relationship between a response CV (say a combination of animal abundances) might be
interpreted by looking at the relationship with the individual physical variables. In this case we
would be interested in the influence of a physical variable in isolation, after partialling out the
influence of the others. We would therefore use the interset canonical coefficients. These are the
partial regression coefficients of a multiple regression of the CV for the animals regressed on the
physical variables.
On the other hand one of the variable sets might be a response set, in this case we might wish to see
how the response variables relate to one of the CVs of the other set. In this case we do not want to
isolate the response variables from each other, there is no point in using partialled coefficients. We
would therefore use the interset correlations. These are the correlation coefficients between the
variables of one set and the canonical variates of the other. Thompson (1984) for no obvious reason
calls them index coefficients. They show which variables in one set are associated with the canonical
variates of the other set. In the animal abundance example they would identify which species have
changes associated with changes in the CV for the physical variables.
d) Use of external information.
Very often a worker will have extra information about the sampling units that should be used to help
the interpretation. It could be positions in space or time, state of the tide, sex, membership of a group,
or any other variable that was not included in the analysis. If the canonical variate scores for each
observations are plotted (or mapped) against the available information useful patterns often emerge.
v) Reduced space plots.
The interpretation of the relationships can be enhanced if you examine the patterns the sampling
units make when their scores are plotted on a reduced set of canonical variate axes for a variable set.
These plots differ from PCA plots in two main ways:
a) The new axes within a data cloud are not chosen to maximise an internal criterion -
variance explained; they maximise an external one - the correlation with an axis in the other data set.
The canonical variate axes and the component axes may sometimes be close but will never coincide.
b) The distances are not Euclidean, they are Mahalanobis D; the canonical variates are
scaled to have unit variance and be uncorrelated, so the data cloud is more spherical.
117
AUTHOR'S CHOICE.
The reduced space plots (perhaps using Brush-and -spin) can nearly always help you to interpret the
relationships. Use the canonical coefficients and interset correlations only after deciding whether you
need simple or partial coefficients. When in doubt use the correlations, and interpret them as
associations with the trends.
EXAMPLE 9.4
Because the analysis of the plankton data is clearly asymmetric, we would normally use the interset
correlations to interpret the patterns in the plankton that are associated with the physical variables,
but the canonical coefficients of the environmental variables (we want to partial out the influences of
the other environmental variables). The interset correlations for the biological variables on the first 2
environmental CVs are in Table 9.4.
Table 9.4. Interset correlations between the physical CVs identifying the physical
gradients and the biological variables.
Species CV1 CV2
ACARTIA -0.1545 0.1377
EUTERP 0.3164 0.4078
GLADIO -0.2516 -0.6164
HARPACT 0.0751 0.2796
OITHONA -0.3983 -0.0593
PARACAL 0.0038 -0.3722
TEMORA 0.3677 -0.2945
FAVELLA -0.4776 -0.0357
OIKOPL 0.7900 -0.3384
The dominant environmentally associated trend among the plankton (CV1) is a contrast between
Oikopleura and Favella (though Temora may be associated with Oikopleura, and Oithona with
Favella).
The corresponding canonical coefficients for the environmental variables are in Table 9.5. They show
that, all else being equal, the plankton positively associated with CV1 seem to be responding to clear
water (SECchi disk), low sewage concentrations (FCOLI). High densities of Oikopleura (and
Temora) seem to be associated with clean water, while Favella (and Oithona ) relate to sewage.
Table 9.5. Standardised canonical coefficients relating the physical variables to their
CVs. These are partial regression coefficients.
Standardised Standardised
Variable canonical canonical
coefficients coefficients
CV1 CV2
DO 0.153 -0.112
BOD 0.277 -0.013
CHLA 0.070 0.343
TIDE 0.101 0.292
PO4 0.014 0.323
NO3 -0.097 0.040
NH3 0.160 0.182
FCOLI -0.340 0.243
SS 0.036 -0.173
SEC 0.553 0.252
TEMP -0.139 -0.151
SAL 0.180 0.386
PH 0.248 -0.292
If we were to attempt to interpret the environmental gradient represented by the first CV we might
look a which variables are associated with it - the intraset correlations, these are presented in Table
9.6.
118
Table 9.6. Intraset correlations for the physical variables. These are simple bivariate
correlations between the variable and the CV scores.
Intraset Intraset
Variable correlations correlationsCV
CV1 2
DO 0.710 -0.083
BOD -0.033 0.373
CHLA -0.298 0.591
TIDE 0.223 0.429
PO4 -0.253 0.746
NO3 -0.241 0.608
NH3 -0.223 0.788
FCOLI -0.618 0.519
SS -0.306 0.153
SEC 0.834 -0.133
TEMP -0.172 -0.278
SAL 0.637 0.158
PH 0.466 -0.305
It is clear that one end of the gradient is associated with clear (SEC), high salinity (SAL), high pH
water, with relatively high dissolved Oxygen (DO), but low concentrations of fresh sewage. This is
the water that Oikopleura seems to inhabit. The other end has high values of sewage but low values
of salinity etc. It seems to be a clear contrast between clean offshore and polluted inshore water.
From and Figure 9.3 we saw that this gradient is only really apparent at the Whau Creek, a typical
urban estuary.
The second gradient seems to be largely associated with Gladioferens. Using the partialled canonical
coefficients we can say, all things being equal, Chlorophyll 'a' (CHLA) is associated with low
numbers of Gladioferens, so are: high tide, Phosphate concentration, salinity and low pH. Using the
intraset correlations suggests that the gradient is associated with high nutrient enrichment (high
phosphate, nitrate and ammoniacal nitrogen) possibly associated with sewage (FCOLI), high algal
concentrations (CHLA), and high tide. Reassuringly, these are all the things that our investigations
with multiple regression in chapter 8 also identified. Interestingly, there is some suggestion that
Euterpina likes those conditions that Gladioferens doesn't.
The constrained ordination techniques described in this chapter are simply regressions
combined with a PCA. As a result they are all subject to the problems listed in Chapter 8.
120
Secchi disk (water clarity)
0.3
0.2
0.1
CV2
0.0
-0.1
-0.2
CV1
log(sewage bacteria)
0.3
0.2
0.1
CV2
0.0
-0.1
-0.2
CV1
Chlorophyll A
0.3
0.2
0.1
CV2
0.0
-0.1
-0.2
CV1
Figure 9.4. Canonical Correlation. Bubble plots of some physical variables in the space of physical
CVs
121
Oikopleura
0.3
0.2
0.1
CV2
0.0
-0.1
-0.2
CV1
Favella
0.3
0.2
0.1
CV2
0.0
-0.1
-0.2
CV1
Gladioferens
0.3
0.2
0.1
CV2
0.0
-0.1
-0.2
CV1
Figure 9.5. Canonical Correlation. Bubble plots for the zooplankton species in the
space of physical CVs.
122
Collinearity is not the only cause of inflated sampling errors on the canonical coefficients. If two of
the eigenvalues (ri2) are equal, the direction of the eigenvectors will be indeterminate. This is similar
to the sphericity problem in PCA. If this occurs, the simplest solution is to abandon both
eigenvectors and attempt an interpretation of the joint space spanned by the associated canonical
variate axes. These axes are attempting to describe a single relationship; not between two axes (one
in each data cloud) but between two planar subspaces (one in each cloud).
Outliers.
Like multiple regression constrained ordinations are very sensitive to outliers. For example, it is
possible to have a statistically significant canonical correlation determined almost entirely by a single
outlying observation (a high influence point).
In canonical correspondence analysis and redundancy analysis outliers are best detected by looking
at the residuals (see below).
If outliers are found, go back to the original records (field book, lab. notes) and check that they have
been correctly entered in the analysis. If they have been, take note of their values and anything else
that is known about them; this information may contain the seed of a new project.
If you drop one or more points, it should be clearly stated in any report of the analysis. You should
also describe the effects the removal had on the results of the analysis, so that readers can make up
their own minds whether it was justified.
Non-linearity.
One of the chief criticisms made of canonical correlation and redundancy analysis is that it only
looks for linear relationships. Nature, it is pointed out, abhors a straight line. This is largely true, but
it is often possible to ensure that canonical correlation can still be useful.
If there is non-linearity in the data it can be difficult to detect. A useful, if tedious, first step is to
examine all the possible bivariate scatter plots within and between the variable sets. This checks that
the correlation coefficients adequately describe the bivariate relationships.
Checking the residuals.
The simplest thing to do is an ordination of the residuals from the multivariate regression and look
for pattern. Another would be to use the multivariate normal probability plot described in section 5..
Outliers should stand out clearly. Alternatively you can look for non-linearity and other lack of fit by
plotting the dominant PCs from the residuals against the PCs of the fitted values (i.e. the redundancy
or canonical variates) to look for systematic variation. Of course plotting the residual variates against
the original response variables could also be useful, but I wouldn't do this (too many plots) unless I
had already established that there was something wrong and I had a shrewd idea where to look.
EXAMPLE 9.5.
The probability plot is shown in
Figure 9.6. There are clearly two outliers, sufficiently extreme that they may be having some effect.
The simplest way to find out is of course to remove them and see if anything does change. First we
examine them and see if they might be the result of a mistake or are genuine data points. The reason
for the most extreme point (unit 2) is easily found: this sample is the only one with no Oithona. This
is rare in samples at this time of year from either of the sites, but it is possible. We might suspect that
the poor devil counting these animals (a tedious and thankless task) had simply failed to record the
number - but we cannot assume it. The second outlier (unit 34) is more difficult, the largest residuals
seem to be indicating that the model does not expect to find so many Harpacticoids or Paracalanus.
123
Just to see if these outliers made any difference I repeated the redundancy analysis without them. The
main trends were still present but the remarkable separation of Whau and Mangere sites in the earlier
figures is no longer perfect, with a few of the units swapping between the two groups.
300
Observed D2 values
200
100
0
0 10 20 30 40
Expected F value
Figure 9.6. Multivariate normal probability plot (section 3.3) of residuals from the redundancy
analysis of Example 11.1. Note the two obvious outliers
124
125
Chapter 10. Canonical Discriminant Analysis and MANOVA
Multivariate Analysis of Variance - MANOVA.
Few multivariate techniques labour under so many names as Canonical Discriminant Analysis
(CDA). It is called Discriminant Analysis, Multiple Discriminant Analysis, Canonical Coordinates or
Canonical Variates Analysis. The common elements of the names betray its close relationships with
Canonical Correlation on the one hand and Discriminant Function Analysis (a different technique
not covered in this course) on the other.
In fact though it is usually explained as a separate technique it is simply canonical correlation
relating the response variables to a matrix of dummy variables (the design matrix) that specifies the
ANOVA design. It specifies the parameters to be estimated in that particular design: e.g. simple
treatment effects, block effects, interaction terms etc. CDA is most often used with simple one way
designs (more complex designs are discussed later). In this case the fitted values from the model are
the sample (treatment) means. The PCA on them therefore displays the difference between the
means. It is however important to realise that the space in which the display is done is standardised
(this is canonical correlation). This has important consequences on interpretation as we shall see
later.
In ANOVA regression is used to separate the total variation into hypothesis (treatment) and error
sums of squares. In MANOVA the decomposition is of the total sums of squares and crossproduct
matrix (S) into hypothesis (H) and error (E) matrices. The hypothesis SSCP matrix is simply the
SSCP matrix of the fitted values and since those fitted values are group or treatment means then it
summarises the between centroid variation. We could therefore just do a PCA on the matrix H, a
simple redundancy analysis with dummy variables (I will describe its advantages later), but this
could be misleading. This is clearest in a univariate example. Take fig 1 a & b. These two variables
Figure 10.1. The means of populations a) and b) are equal distances apart, but clearly
those in a) are more distinct. When we rescale their axes so that they have the same
(unit) within population variance (c and d) the true distinctness of the populations is
apparent .
a) c)
b) d)
138
Figure 10.2 a) The distance between the centroids suggest that the centroids are not
equidistant. Adjusting the space to standardise by within sample error shows that they are.
have the same distance between their means, but obviously the means on variable A are better
separated than variable B. The simple distance does not reflect this, it should be considered relative
to the within sample variation. The simplest solution is to rescale the variables by their within group
variation, e.g. divide each value by the within sample standard deviation. The within group SDs will
have the value one on the rescaled axes, and the between group distances now reflect the real
differences between the means (fig 1 c & d).
This process generalises to the multivariate situation. The between centroid distance in fig 2 a does
not reflect the real difference between the centroids. Rescaling the H matrix by E-1 (the inverse of
the within group variance-covariance matrix) change the axes so that the within group variances are
unity and the data clouds for the groups are now spherical. The true differences between the
centroids are now apparent. Basically this is a PCA on the centroids in this rescaled space (though in
fact it is weighted by the sample size for each centroid). The rescaled old axes are rotated in the new
space so that one lies along the major axis between the centroids, then the second major axis,
orthogonal to the first in that space, is found, and so on. The number of these axes, canonical
variates, that can be found is limited by the number of variables p or the number of groups (g) minus
one, whichever is the smaller. Clearly you can't discover more axes than you started with, hence p is
a limit; but any g points can be perfectly represented in g-1 dimensions (try it), so g-1 is also a limit.
The result of the analysis is a reduced space display of the group centroids that reflect how separate
(distinct) they are, and also a set of new axes (canonical variates) that summarise the between group
variation. The contribution of the variables to the new axes can be calculated, and those that are
responsible for the differences between the groups identified.
139
Figure 10.3. The observations on 3 dates plotted on the two canonical variables. Notice the
excellent separation
1
2 1
1
1 1 3
1 1
Canonical variate 2
3 3
2 3
0 2 3
3
-1 3
2
1 2
2
-2
2
-3
-3 -2 -1 0 1 2 3 4 5
Canonical variate 1
Even though CDA is really a canonical correlation analysis, to some extent the terminology and
methods of interpretation have diverged. Canonical correlation coefficients are often still reported
but eigenvalues are usually given greater prominence. Canonical variate scores are still calculated
but only for the response variables (CVs for the dummy variables are probably not very interesting),
and, as we shall see, the canonical and structure coefficients for the response variables are still used -
though they are some times renamed as eigenvector coefficients and canonical factor structure
respectively.
i) Goodness of fit.
Clearly a significant MANOVA test is a reasonable condition for a useful CDA. As in canonical
correlation the eigenvalues λi describe the separating power of the associated eigenvectors that
define the canonical variates; they give the between group sums of squares of the observation's
scores on that canonical variate axis.
An standard, intuitive, approach is to employ a scree diagram to identify the number of axes required
to summarise the between group variation. Certainly, the adequacy of the reduced space plot is best
described by the sum of the eigenvalues associated with the reduced space plot as a proportion of the
sum of all the eigenvalues, e.g. (λ1 + λ2) / Σλi.
ii) Canonical coefficients.
As in most metric ordinations the new axes given by the analysis provide potentially useful
information about linear trends that underlie the variation in the data set. In the case of CDA, the
canonical variate axes identify the major axes of the variation between the groups. It is therefore
normal to attempt to interpret the relationship between these new axes and the original variables.
However it is important to realise that the space being described by the ordination has been rescaled
by the error so that distance between centroids reflects how distinct the samples (figure 2b), are not
simply how different (figure 2a).
140
Since the CDA is usually being used to analyse the results of an experiment or to describe the
difference between groups, the treatment or group identifiers (the dummy variables mentioned
earlier) are being used as the predictors, and the Y variables as responses. Consider the first canonical
variate, the interpretation should tell us which variables would be most distinct for two centroids
separated along this line. In this case you should use the structure coefficients (rij) the correlations
between the ith Y variable and the jth canonical variate.
iii) Display.
The CDA reduced plot can be used to display three different types of information:
a) Centroids and confidence intervals.
The chief aim of CDA is to display the relationships among the centroids. For this reason, the plot of
the centroids on the first two canonical variates is incomplete without confidence ellipses around the
centroids (intervals are around univariate means, ellipsoids are around multidimensional centroids).
No plot of univariate means would get past a referee or examiner without confidence intervals or
standard errors. The same rigour should be demanded if CDA is to be used for anything other than
data exploration.
The standard confidence ellipses are based on the same assumptions as the MANOVA tests and
CDA: random sampling, multivariate normality and homogeneous variance covariance matrices.
They make use of the fact that if these assumptions are met the canonical variate scores have unit
within group variance, and the data clouds for each group are spherical (the scores are uncorrelated
within the groups). Thus simple confidence intervals can be drawn in the 2-D reduced plot as circles
of radius (χ 2,0.05/ni) , i.e. 2.45/√ni.
2 1/2
141
Figure 10.4 The centroids for each date with approximate confidence regions
1 1
Canonical variate 2
3
0
-1
2
-2
-2 -1 0 1 2 3
Canonical variate 1
b) Overlap. This can be simply shown by plotting the canonical variate scores for the individual
sampling units. Give each group a different plotting symbol so group membership is easily
recognised. While the overall level of overlap will usually be well described (provided the reduced
space plot is itself adequate), the overlap between particular pairs of groups may be badly
overestimated; the difference between any two centroids may lie on one or more of the minor axes.
This will be most likely when there are more than two important axes, in which case you could do
multiple plots (CV1 vs CV3, CV2 vs CV3 etc) to see if such separation exists.
Programming Notes.
142
If you don’t want the scores, just the centroids and/or confidence circles: replace the last command
with: eqscplot(cvscores[,1:2], type=”n”)
EXAMPLE 10.1.
It would be logical to display the differences between the dates to find out which dates are most
similar to each other? I therefore performed a CDA treating sites as replicates (though since no
significance tests are involved it doesn't much matter whether they are independent or not). Since
there are only three dates their centroids can be perfectly represented in two dimensions. The CV
scores for the 21 observations are displayed in Figure 2. The centroids with their approximate
confidence intervals (circles of radius (χ 2,0.05/ni) ) are in Figure 3. The first canonical variate (CV)
2 1/2
explains 69% of the scaled between centroid variation, the second therefore explains 31%.
The canonical coefficients (raw because the variables are all in the same units) and the structure
coefficients are in Table 2. Since we are primarily interested in which species vary between dates we
will look at the structure coefficients: the simple correlations between the CVs and the species. The
structure coefficients suggest that on the first axis the differences are associated with changes in
Favella, Acartia, and the Harpacticoid group. So we can infer that March (Date 3) had more of the
two last species than January or February, while it had less Favella. Similarly, the differences
between January and February (CV 2) seem to be associated with a decrease in Euterpina and
Acartia.
Table 1. Coefficients for the first two canonical variates of the difference between dates for the
log(X+1) transformed plankton data.
Canonical variate 1 Canonical variate 2
Robustness of CDA.
CDA is a descriptive technique, none of the tests mentioned later are essential to its validity. It is
therefore generally less sensitive to heterogeneity of variance and non-normality than they are.
Moderate heterogeneity does not seem to be a major problem, especially if the groups are well
separated. Large differences between the variance covariance matrices will distort the inter-centroid
distances but a useful picture will usually be produced. Non-normality on its own is usually not such
a problem, but skewness can strongly amplify the distortion due to heterogeneity of variance
covariance matrices.
The simplest way to check for heterogeneity of variance covariance matrices is to plot the
observations as well as the centroids in the reduced space. You can then examine the similarity of the
within group distributions. More formal methods were given earlier in the course.
143
Figure 10.5. Not separate on the marginal axes. But clearly
separate in multivariate space.
Informal Explanation.
144
Figure 10.6 a) Between centroid distance does not reflect
distinctness. b) In Mahalanobis space it does. The groups are all
equally distinct.
Consider fig 5 above . If we look at the variables separately, there is considerable overlap between
the two populations. However by looking at the variables in multivariate space, we see the
populations are quite distinct. Similarly, if we took samples from these two populations on those
variables, we might very well fail to get significant results from separate ANOVAs for each variable;
but a MANOVA, which treats the populations as truly two dimensional, would nearly always detect
the difference.
ANOVA starts off by partitioning the total sums of squares into an error (within groups) and
treatment (between groups) sums of squares. Analogously, MANOVA partitions the total sums of
squares and cross products matrix to give an error (within group) matrix E and a treatment (between
group) matrix H (for hypothesis).
Now consider the points in fig6. This space is first rescaled (stretched) so that the data clouds for
each group are spherical (fig 6b). This is done by rescaling the data by the inverse of the error
matrix, i.e. E-1. This converts all the distances in the space to Mahalanobis' D. I shall discussed this
rescaling in more detail earlier - canonical discriminant analysis. CDA then rotates the axes in this
new space till they are oriented along the major axes of the variation between the centroids - just like
a PCA on the centroids. The data points are now given by scores on these new axes (the CVs). These
new variates are used for the tests. As in PCA, these are uncorrelated, and the within group variance
is one.
Now the test for the null hypothesis of equal population centroid vectors can be performed.
Formally, the null hypothesis can be stated H0: µ1= µ2 = ... = µg, where g is the number of groups.
Unfortunately though there is one null hypothesis, there are four test statistics. These reflect different
ways of relating the between, within and total sums of squares on these new axes. Because the new
axes are uncorrelated these sums of squares can be validly combined in ways that would not be
possible if we used the original (correlated) variables. This is how scalar multivariate test statistics
can be built up that incorporate the variation in the whole set of variables.
There are four main test statistics, all relating the matrix H to either the total variation (H+E) or the
error variation E:
145
i) Wilks Lamda (Λ) is the values of ssw / ssT on each of the new axes, multiplied together. If the
groups are well separated then ssW / ssT will be small for each new axis, and their product smaller
still. It is |E|/|H + E|.
ii) Pillai's Trace (V) is the sum of the values for ssB / ssT for all the new axes. If the groups are
well separated this will be large. It is trace(H(H + E)-1).
iii) Hotelling-Lawley Trace (U) is the sum of the values for ssB / ssW for all the new axes.
Because the within group variances have been rescaled to 1, U is proportional to the sum of the
ANOVA F values on each of the new axes. It is trace(HE-1)
iv) Roy's greatest characteristic root (gcr). This is simply ssB / ssW on the first, major, axis. It is
therefore proportional to the ANOVA F statistic on this axis. This axis is chosen because it
maximally separates the centroids, and so the F value is the largest possible. It is the first eigenvalue
of HE-1.
Though all these methods are testing the same null hypothesis, they do not always give the same
results. I shall discuss the appropriate choice of statistic below.
Choice of statistic.
Many, perhaps most, MANOVA programs produce all four statistics; sometimes these will
contradict each other. Which to believe? First we can generally disregard the p value for Roy's
greatest root, it is usually based on an approximation that in practice nearly always leads to a p value
that is much too small. As for choosing between the other 3, it is not an easy decision and has caused
a great deal of argument One way we can choose is based on the robustness of the test statistics.
None of the test statistics given above are robust when the sample sizes are unequal. Just as in
ANOVA, violations of the assumptions combined with unequal sample sizes can result in spuriously
significant results; at other times it can lead to failing to detect differences that are really there.
The assumptions of normality and equal variance covariance matrices can be tested using Levene's
test.
Programming Notes:
man= manova(lm(as.matrix(Y variables) ~ as.factor(Class variable)))
summary(man, test="Pillai")
You can get other test statistics using the options "Wilks", "Hotelling-Lawley", "Roy" instead of
"Pillai".
EXAMPLE 10.2
Though the plankton data were not collected with strict hypothesis tests in mind, we can still use
tests like MANOVA as exploratory tools to help us check for patterns in the data that we can expect
a priori. For example some workers might test to see if there was a consistent difference between the
146
harbours. They might, I wouldn't; the difference is so obvious even in the raw data that there is no
point in testing for it. A less trivial example might be to look for differences between sampling dates
or between stations within a site. Within the Mangere Inlet, samples were taken at 7 different
stations on 3 different occasions (January, February and March 1979). Since some of the sites are on
mudflats some samples were only possible at high tide, so I have excluded the low tide data. Though
there are no replicates on each occasion we could treat the problem as a two way unreplicated design.
This lumps any interaction term into the error, thus making it potentially insensitive, but since I am
looking for general averaged trends over these particular stations, I am content to ignore the
interaction, and consider it to be the error term.
I chose to use the Mangere Inlet rather than Whau Creek data because there are fewer species in
Mangere. This is actually very important. It is impossible to do a MANOVA at Whau with all 9
species because there are not enough error degrees of freedom. The reduced diversity at Mangere (5
species) makes it possible.
The results of the analysis on the log(X+1) transformed data are in Table 12.1a. It is worth noting
that the test for Roy's gcr given by R strongly suggested a difference between stations
(p-value = 0.0023), but using the more precise values from tables of gcr failed to detect any station
differences. A result that is much more in keeping with the other statistics.
Table 2a.
Test for differences between stations
p-value
Wilks’ Lambda 0.052 0.2708
Pillai’s Trace 1.863 0.2802
Hotelling-Lawley Trace 5.563 0.3168
Roy’s Greatest Root 3.485 >.05
Table 2b
Test for differences between months
p-value
Wilks’ Lambda 0.063 0.0029
Pillai’s Trace 1.463 0.0017
Hotelling-Lawley Trace 0.486 0.0053
Roy’s Greatest Root 4.728 <.05
There does seem to be some consistent differences between months, but the stations do not seem to
be detectably different on average. That is not to say that the stations might not be different on any
given date, nor that if we had sampled another phase of the tide we might not have got more obvious
differences. However it does suggest that it may more productive to carry on to examine differences
between months rather than stations.
Univariate Anovas.
The simplest way to identify which variables have responded to the experimental regime is to
perform separate Anovas for each of the response variables. This will allow you to identify which
147
variables are most different between the groups. The p-values should not be taken too literally
because you have done a number of tests and so are subject to Type I error rate inflation.
One simple approach that I have found useful is to ignore the significance tests altogether. Which
variables have responded most can be identified by ranking them on their ANOVA F statistics;
remember the F value is the between variation scaled by the within. Which means are different from
which others is perhaps best investigated using Canonical Discriminant Analysis described earlier.
EXAMPLE 10.3
Having demonstrated that differences exist, we should now try to find out which variables are
responsible. As mentioned above, one of the simplest approaches is to examine the univariate
ANOVAs. The tests for differences between months are shown in table 10.3.
148
138
R tips for STAT302
1) How to read in a .csv file
Comma delimited files (.csv) are simple text file with every data value separated from the other by a
comma. They can be read in and created by virtually every statistics package/language. CSV files
can be created in Excel (Save as and choose CSV Comma delimited (.csv) from the Save as
Type pull down menu).
dataset name=read.csv(file.choose(),header=T)
fix(dataset name)
This opens a spreadsheet window so you can look at the data set and check what you got. You must
close the window before you can execute any more R commands.
2) Some functions are from libraries that need to be loaded first. e.g. eqscplot() is often used
but you must enter library(MASS) first, because eqscplot is from that library
Thus
var.log=log(dataset[,3]+1)
Note: a matrix or dataset has a row index and a column index. Leaving an index blank says use all its
possible values. So, in the example above, the row index is missing which says use all the rows in the
matrix.
vars.log=log(dataset[,3:10]+1)
vars.sel.log=log(dataset[,c(3,5,9)]+1)
d) Sometimes it is easier to do the transformation in Excel and then save it as a .csv file and
read it in to R again.
139
4) To harvest a plot and paste it into a MSWord document (or Powerpoint if you want to edit it).
Have the plot window active (the bar is blue as in the screen below) then use Ctrl-W to copy it in a
suitable format to the clipboard.
140
5) To copy a table from R and paste it neatly into MSWord go by way of Excel.
First select and copy the R table you want – an ANOVA table in the example below
141
Now: Alt-d, then Alt-e
142
Then press Next in the Box
143
Check that the splits into separate columns are in the right place. Double click on an arrow to remove
a split click in a gap to put one in
Now click Finish
144
Now copy and Paste Special > Formatted text (RTF) the table into MSWord.
Pretty isn't it?
145
Useful resources on the net:
http://maths.anu.edu.au/~johnm/courseweb/r-courseprep.html
http://www.youtube.com/off2themovies2
http://www.ms.unimelb.edu.au/~andrewpr/r-users/icebreakeR.pdf
146