Introduction To Ecological Multivariate Analysis
Introduction To Ecological Multivariate Analysis
Multivariate Analysis
0.5
0.4
0.0
0.1
K
Mg
P
R2
0.2 0.3
Mn
Fe
Al
Mg
2014
Al
Fe
Mn
Contents
1 Data
1.1 Matrices . . . . . . . . . .
1.1.1 Matrix algebra . .
1.1.2 Data matrix . . . .
1.2 Species data manipulation
1.2.1 Transformation . .
1.2.2 Standardization . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
5
7
7
8
2 Association
2.1 What is an association coefficient? . . . . . . . . . . . . . . . . .
2.1.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Partial correlation . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Double-zero problem . . . . . . . . . . . . . . . . . . . . .
2.3 Measuring ecological distance . . . . . . . . . . . . . . . . . . . .
2.3.1 Chord distance . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 2 distance . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Hellinger distance . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Bray-Curtis distance, aka Odums index, aka Renkonen
index, aka percentage difference dissimilarity . . . . . . .
2.4 Metric and semimetric distances . . . . . . . . . . . . . . . . . .
2.5 Similarity indices . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
11
11
12
13
14
16
17
17
18
19
3 Cluster Analysis
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Hierarchical clustering . . . . . . . . . . . . . . . . . . . .
3.2.1 Single-linkage clustering . . . . . . . . . . . . . . .
3.2.2 Complete-linkage clustering . . . . . . . . . . . . .
3.2.3 Average-linkage clustering . . . . . . . . . . . . . .
3.2.4 Wards clustering method . . . . . . . . . . . . . .
3.2.5 Comparison . . . . . . . . . . . . . . . . . . . . . .
3.3 Interpretation of hierarchical clustering results . . . . . .
3.3.1 Cophenetic correlation . . . . . . . . . . . . . . . .
3.3.2 Finding interpretable clusters . . . . . . . . . . . .
3.3.3 Graphical presentation of the final clustering result
3.4 Non-hierarchical clustering . . . . . . . . . . . . . . . . . .
3.4.1 Partitioning by k -means . . . . . . . . . . . . . . .
3.4.2 Fuzzy clustering . . . . . . . . . . . . . . . . . . .
3.5 Validation with external data . . . . . . . . . . . . . . . .
3.5.1 Continuous predictors . . . . . . . . . . . . . . . .
3.5.2 Categorical predictors . . . . . . . . . . . . . . . .
23
23
23
23
24
26
27
28
29
30
30
33
35
35
37
40
40
42
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
20
21
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
MDS)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
45
46
47
48
50
51
52
53
54
56
59
61
63
63
67
67
71
72
73
Data
1.1
Matrices
Matrix algebra
a11 a12
b11 b12 b13
A = a21 a22 , B =
(1.1)
b21 b22 b23
a31 a32
Summation of these matrices is not possible as such, because A is a 3-by-2 matrix
and B is a 2-by-3 matrix. When dealing with matrices, the dimensions must be
the same for a summation to be carried out. The sum of A and B is then:
a11 b11 + a21 b12 + a31 b13 a12 b11 + a22 b12 + a32 b13
.
a11 b21 + a21 b22 + a31 b23 a12 b21 + a22 b22 + a32 b23
(1.3)
When a matrix with only one row and another with only one column (i.e.,
two vectors) are multiplied, the result is a scalar (a single value), and thus this
operation is called a scalar product. A useful piece of information related to
the scalar product is that if two vectors (variables) are both standardized (zero
mean and unit variance) and normalized (scaled in such a way that their sum
of squares equals 1), the scalar product between them equals the correlation
between the original variables.
It is also possible to multiply the example matrices the otherway a round
(that is AB), which results in a 3-by-3 matrix. In matrix algebra, one cannot
divide matrices, such that A/B is impossible. Instead, this is written as AB1 ,
where B1 is the inverse of matrix B.
A square matrix is a special class of matrices that has an equal number of
rows and columns, for example:
a11 0
0
diag(A) = 0 a22 0 .
(1.5)
0
0 a33
The off-diagonal elements form the lower and upper triangle around the diagonal
(use functions lower.tri and upper.tri to get the triangles):
0 a12 a13
upper(A) = 0 0 a23 .
(1.6)
0 0
0
For any square matrix, one can apply the following equation:
(A I)u = 0,
(1.7)
where I is an identity matrix (ones on the diagonal and zeros on the off-diagonal).
This so-called characteristic equation is used to derive the eigenvalues and
eigenvectors u of a matrix. An n-by-n square matrix has n eigenvalues and each
eigenvalue is associated with an eigenvector. Eigenvectors are orthogonal and
thus they represent indepedent directions of variation in the matrix. This is a
very useful property, which will become evident later.
Eigenvalues and eigenvectors can be calculated with function eigen in R.
For example, the following matrix A:
1 2
A=
(1.8)
3 4
has eigenvalues
1 = 5.37, 2 = 0.37
(1.9)
and eigenvectors
u1 =
1
1
, u2 =
1
1
.
(1.10)
The eigenvectors can be interpretted as orthogonal, unit vectors that define the
dimensions of the matrix:
1
1
Data matrix
Ecologists are usually dealing with data matrices, which are generally non-square
matrices, and covariance/correlation matrices that are square matrices. Let us
look at an example data matrix:
> library(vegan) # load the vegan package to the workspace
> data(dune)
# load a data set to the workspace
> head(dune[,1:10])
2
13
4
16
6
1
Belper Empnig Junbuf Junart Airpra Elepal Rumace Viclat Brarut Ranfla
3
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
2
2
0
0
0
0
0
0
0
2
0
0
0
0
3
0
8
0
0
4
2
0
0
0
0
0
0
6
0
6
0
0
0
0
0
0
0
0
0
0
0
This data matrix contains the abundance of vascular plants on a dune meadow.
The entire matrix has 30 species sampled in 20 sites (sample plots). The dimensions of the matrix can be examined with the following commands:
> dim(dune)
> nrow(dune)
> ncol(dune)
If one wishes to consider the table not as sites-by-species but as species-by-sites,
the data matrix can be transposed :
> t(dune[1:6,1:10])
Belper
Empnig
Junbuf
Junart
Airpra
Elepal
Rumace
Viclat
Brarut
Ranfla
2 13 4 16 6 1
3 0 2 0 0 0
0 0 0 0 0 0
0 3 0 0 0 0
0 0 0 3 0 0
0 0 0 0 0 0
0 0 0 8 0 0
0 0 0 0 6 0
0 0 0 0 0 0
0 0 2 4 6 0
0 2 0 2 0 0
8
0
Frequency
4
3
2
Frequency
1
0
10
12
14
Species richness
20
40
60
80
1.2
1.2.1
The distribution of variables within a data matrix can be examined using histograms. If the distribution is undesirable for some purposes (e.g., non-symmetric),
the data can be transformed, e.g., using a square root (sqrt) or logarithm (log,
log10, log1p) functions (Figure 1.3):
data(varechem)
attach(varechem)
par(mfrow=c(1,3),mar=c(6,6,2,1),mgp=c(4,1,0),xpd=NA)
hist(Al,15,col='lightblue',cex.axis=2,cex.lab=3,cex.main=3)
hist(sqrt(Al),10,col='lightblue',cex.axis=2,cex.lab=3,cex.main=3)
hist(log(Al),10,col='lightblue',cex.axis=2,cex.lab=3,cex.main=3)
detach(varechem)
Histogram of log(Al)
3
2
Frequency
3
2
Frequency
0
100
200
Al
300
400
4
2
Frequency
Histogram of sqrt(Al)
Histogram of Al
>
>
>
>
>
>
>
10
15
sqrt(Al)
20
log(Al)
distributed (using linear regression). Normality can be tested, e.g., using the
Shapiro-Wilk test:
> shapiro.test(Al)
Shapiro-Wilk normality test
data: Al
W = 0.8877, p-value = 0.01193
which indicates that the null hypothesis (the distribution is normal) can be
rejected for the distribution Aluminium concentration in the soil. Doing the
same test for square root transformed data gives a p = 0.18, indicating that this
transformed data is normally distributed.
Instead of applying a specific function to the data to alter the scale of observations, species abundances can also be transformed to presence-absence (i.e.,
changing the scale to 0/1). This can be done either as:
> dune.pa = matrix(as.numeric(dune>0),dim(dune),
+
dimnames=list(rownames(dune),colnames(dune)))
or:
> dune.pa = ifelse(dune > 0, 1, 0)
Standardization
yi
,
max(y)
(1.11)
yi
k=1 ykj
(1.12)
ykj
(1.13)
This scaling (3) is called the chord transformation; when calculating Euclidean
distances after this transformation, chord distances are returned. This can be
useful when performing analysis such as Principal component analysis (PCA)
and k -means partitioning because it is the chord distance, not the Euclidean
distance that is preserved in the analysis. Other similar standardizations are
considered later in Chapter 2.
>
>
>
>
>
This method first ranges abundances by species maxima (1) and then by site
totals (2). The function metaMDS (in vegan) (which is used to perform
non-metric multidimentional scaling) automatically performs a Wisconsin transformation if there are abundance values greater than 9. This function also automatically applies a square root transformation, if the maximum count in the
data is greater than 50. The reasoning behind this transformation is that, theoretically, the variance of a Poisson distributed variable (a distribution followed
by e.g. species counts data) that has been square root transformed tends to 1/4.
9
All these standardizations (as the transformations considered previously) affect the distribution of species abundances. The differences between these transformation can be compared using boxplots (Figure 1.4):
Raw
Sqrt
log
0.4
0.0
0.0
0.2
0.2
0.4
0.6
10
0.8
0.6
1.0
15
>
>
>
>
+
>
+
>
+
Max
Total
Norm
Chi
Wisconsin
10
Association
2.1
Correlation
When considering Pearsons correlation, it is useful to start from defining covariance. The (sample) variance of a variable x (with n observations) is defined
as:
n
V [x] =
1 X
(x x
)2 .
n1
(2.1)
i=1
Cov[x, y] =
1 X
(x x
)(y y).
n1
(2.2)
i=1
While the variance measures the amount of dispersion around the mean, covariance measures the amount of joint dispersion between two variables (how much
overlap is there in the variances of two variables?).
The Pearsons correlation is calculated by scaling covariance by the product
of variances, such that:
Cov[x, y]
rx,y = p
.
V [x]V [y]
(2.3)
From here it can be seen that if covariance is calculated for standardized variables
(with zero mean and unit variance), the result is Pearsons correlation. If there
are several variables among which correlation is calculated, the result is usually
combined into a correlation matrix. This is a symmetric matrix (rij = rji ), with
ones on the diagonal (rii = 1).
As mentioned above, Pearsons correlation is a parameteric method that
assumes variables are normally distributed. If the values of variables are replaced
by their rank orders, the formula for Pearsons correlation returns the Spearman
correlation . If there are no duplicate values (i.e., all ranks are unique), the
Spearman correlation is calculated as follow:
11
=1
Pn
2
i=1 (xi yi )
.
n(n2 1)
(2.4)
2(a b)
.
n(n 1)
(2.5)
The Spearman and Kendall correlations are non-parameteric, and thus assume nothing about the distribution of the variables considered. Also, while
Pearsons correlation models a linear relationship between variables, rank-order
correlations are more flexible about the shape of the association. For example, consider how different relationships are captured by linear and rank-order
correlations (Figure 2.1):
>
>
>
>
>
>
>
>
+
>
>
>
>
>
>
# Simulate correlations:
# ---------------------m = character(6)
x1 = seq(-2,2,length.out=100)
y1 = x1 + rnorm(100)*.5
y2 = x1^3 + rnorm(100)*.5
x2 = c(seq(0,2,length.out=90),seq(3,3.5,length.out=10))
y3 = c(2*x2[1:90]+rnorm(90)*.5,6-x2[91:100]
+rnorm(10)*.5)
m[1] = as.character(signif(cor(x1,y1),2))
m[2] = as.character(signif(cor(x1,y1,method='spearman'),2))
m[3] = as.character(signif(cor(x1,y2),2))
m[4] = as.character(signif(cor(x1,y2,method='spearman'),2))
m[5] = as.character(signif(cor(x2,y3),2))
m[6] = as.character(signif(cor(x2,y3,method='spearman'),2))
2.1.2
Partial correlation
When considering the association between several variables, one might be interested in the unique dependence between two variables, when the effect of other
variables has been accounted for. This is essentially what partial correlation is
about. For example, when the focal data consists of three variables x, y, and z,
the partial correlation between x and y is calculated as follow:
12
y3
2
y2
0
1
y1
2
0
x1
Pearson = 0.71
Spearman = 0.87
Pearson = 0.9
0
1
2
Pearson = 0.91
Spearman = 0.92
Spearman = 0.93
2
0
x1
0.0
1.0
2.0
x2
3.0
x,y|z = q
Visualization
# Figure 2.3.
par(mfrow=c(1,2),mar=c(1,1,1,1),mgp=c(2,.2,0))
n = 11
cors = cor(varechem)
13
60
100
200
25
30
60
15
800
30
100
200
Ca
80
Fe
20
Mn
2.8
pH
3.4
200
Al
15
25
200
800
20
80
2.8
3.4
o = order.single(cors)
cmat = dmat.color(cors,breaks=seq(-1,1,length=n),colors=brewer.pal(n,'RdBu'))
plotcolors(cmat,rlabels=T,clabels=T,dlabels=names(varechem))
cmat = dmat.color(cors[o,o],breaks=seq(-1,1,length=n),
colors=brewer.pal(n,'RdBu'))
plotcolors(cmat,rlabels=o,clabels=o,dlabels=names(varechem)[o])
2.2
Distance
In many community ecology studies, the aim is to compare the species compostion between two or several samples, and possibly relate that to external
explanatory variables. Many applications, such as ordination and classification
(clustering), are based on some measure of resemblance between samples, rather
than the raw presence/absence or abundace of species. In this section we will
consider what is meant by ecological resemblance and how it can be calculated.
For example, consider a simple data set with 2 species, whose relative abundance has been observed in 5 sites. The relationships between the sites, with
respect to their species composition, can be easily examined in a scatter plot
(Figure 2.4).
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1
P
12
K
13
Ca
9
Mg
10
S
6
Al
3
Fe
2
Mn
4
Zn
5
Mo
11
Baresoil
Humdepth
pH
8
14
14
11
10
13
12
14
13
12
11
10
1
14
N
Baresoil
Humdepth
Mn
Zn
S
K
P
Ca
Mg
Mo
Al
Fe
pH
Figure 2.3: Color plots of the correlation matrix. On the left the matrix is
plotted in its original form, while on the right the matrix is reordered to reflect
patterns in correlation between several variables (the variables in the middle are
all negatively correlated with each other)
>
>
>
>
>
>
>
# Figure 2.4
# Data matrix:
X = matrix(c(0.78,0.90,0.29,0.73,0.19,0.28,0.93,0.63,0.51,0.87),5,2);
n = nrow(X)
# all possible combinations of the elements of [1:n] taken 2
# at a time
pairs = combn(1:n, 2)
>
>
+
>
>
>
par(mar=c(5,5,3,3))
plot(X[pairs,1], X[pairs,2],type='b',xlab='Species 1',ylab='Species 2',
pch=21,cex=2,bg='gray70',cex.lab=1.5,xlim=c(0.1,1),ylim=c(0.2,1))
text(X[,1], X[,2],1:n,pos=2,col=2,cex=2)
text(.5,.92,expression('D'[25])); text(.6,.74,expression('D'[23]));
text(.75,.7,expression('D'[24])); text(.89,.6,expression('D'[12]))
1.0
D25
0.8
D24
0.6
D12
4
0.4
Species 2
D23
0.2
1
0.2
0.4
0.6
0.8
1.0
Species 1
Figure 2.4: The dispersion of 5 samples in the space of 2 species. The lines
connecting each site (from 1 to 5) represent the shortest (Euclidean) distance
between them, D
variables and increasing difference in the value for each variable between sites
(this means that the scale of each variable has an effect on the distance). This
can be avoided by standardizing each variable (i.e., substract mean and divide
by standard deviation: (X X )/X ). As a result, all variables will be on the
same scale and have equal (zero) mean.
As with correlations, all distances between objects are collected into a symmetric distance matrix, with zeros on the diagonal (Dii = 0):
> print(signif(as.matrix(dist(X)),2))
1
2
3
4
5
1
0.00
0.66
0.60
0.24
0.83
2.2.1
2
0.66
0.00
0.68
0.45
0.71
3
0.60
0.68
0.00
0.46
0.26
4
0.24
0.45
0.46
0.00
0.65
5
0.83
0.71
0.26
0.65
0.00
Double-zero problem
The Euclidean distance is a useful, and logical, measure to characterize differences in, e.g., spatial location, or the physical properties of sampling locations.
16
In these cases the value zero has the same meaning as any other value on the
scale of the variable. For example, the absence of nitrogen in the soil or the
fact that two samples have been acquired from the same spot are ecologically
meaningful pieces of information.
In contrast, when it comes to species presence/absence, the interpretation
of double-zeros becomes more tricky. The presence of a species at a given site
generally implies that this site provides a set of minimal conditions allowing the
species to survive (the dimensions of its ecological niche). Note that a species
might be found in a site because it appeared there by accident and not because
the local conditions are suitable for it; many species can be transiently found
in sites where they cannot survive in the long run. However, the absence of a
species from a sample can be due to a variety of causes: the species niche may
be occupied by a replacement species, or the absence of the species is due to
adverse conditions on any of the important dimensions of its ecological niche,
or the species has been missed because of a purely stochastic component of its
spatial distribution, or the species does not show a regular distribution on the
site under study, or simply due to observation error (the species was there but
the observer missed it).
The crucial point here is that the absence of a species from two sites cannot
readily be counted as an indication of resemblance between the two sites, because
this double absence may be due to completely different reason between the samples. Luckily, many alternative methods for measuring ecological resemblance
that account for this problem are available.
2.3
Chord distance
If the raw data is normalized between 0 and 1, calculating the Euclidean distance
between two objects results in the so-called chord distance (here referred to as
Dchord ). The transformation required is:
bij = q Xij
X
Pp
(2.8)
2
j=1 Xij
that is, the abundance of species j in site i is divided by the square root of the
sum of squared abundances at that site. The clear advantage of Dchord over
DEucl is that the chord distance is insensitive to double-zeros, making it suitable
for species abundance data.
While DEucl is the shortest distance netween points in variable space, the
chord distance is equivalent to the length of a chord joining two points within
17
1.0
0.6
D12
0.4
0.0
0.2
Species 2
0.8
0.0
0.2
0.4
0.6
0.8
1.0
Species 1
Figure 2.5: Graphical representation of the chord distance. The red line between
point 1 and 2 illustrate the Euclidean distance whereas the black arc defines the
chord distance.
The chord distance is maximum when the species at the two sites are completely different (no common species). In this case, the normalized site vec
o
tors are
at 90 with each other, and the distance between the two sites is 2
(D = 12 + 12 ).
2.3.2
2 distance
p
X++
Xij
p
,
Xi+ X+j
(2.9)
where X++ is the grand sum of the sites-by-species data table, Xi+ is the total
abundance of site i, and X+j is the total abundance of species j in the data.
This distance has no upper bound, similarly to DEucl .
The 2 distance (here D2 ) is considered here because the Correspondence
Analysis (CA) retains this distance between sites. As with Dchord , D2 does not
18
Hellinger distance
This beloved child has many names, which reflects the fact that the percentage
difference dissimilaritytance is one of the most applied measure of ecological
resemblance. This is the reciprocal of the Stainhaus similarity index, DBC =
1 SStein . Steinhaus similarity is calculated as follow:
ij
=
SStein
2W
,
A+B
(2.11)
where W is theP
sum of the minimum abundances of shared species between sites
i and j (W =
min(Xi , Xj )), A is total abundance of species occupying only
site i and B is total abundance of species occupying only site j. That is, this
measure gives the proportion of shared abundance between two sites, hence the
name percentage difference (dis)similarity.
2.4
as the distance from y to x. The fourth axiom is often called the triangular
inequality. This axiom requires that the distance from x to z via y is at least as
great as from x to z directly. DEucl , D2 , Dchord , and DHell are metric distances.
A semimetric distance satisfies the three first axioms, but not necessarily the
triangular inequality. These measures cannot directly be used to order points
in a metric or Euclidean space because, for three points (x, y and z), the sum
of the distances from x to y and from y to z may be smaller than the distance
between x and z. This is the case for example with the Bray-Curtis dissimilarity
(DBC ), which
is semimetric. However, note that the square rooted Bray-Curtis
dissimilarity ( DBC ) is metric.
2.5
Similarity indices
In many cases similarity and distance are just reciprocals of each other. Thus,
all similarity coefficients can be converted into distances by one of the following
formulas:
D =1S
D=
p
1 S2
D=
1S
D = 1 S/Smax
A similarity that has been converted to a distance is usually referred to as
dissimilarity. The above presented Steinhaus similarity is commonly used for
ecological data. This a quantitative index that takes into account the differences in species abundances. Other commonly used similarity indices are binary,
meaning that they only consider presence/absence data.
When considering binary data, there are four different quantities related to
the similarity of two samples (objects):
a: number of matches between samples
b: number of species unique to site i
c: number of species unique to site j
d : number of co-absences
Different similarity indices can be generated by combining these different
quantities, with different weights. As in the case of distances, double-absence
is not meaningfull when considering species data. However, This can be useful if one is analysing categorical data. The best-known indices utilizing these
quantities are the Jaccard index and the Srensen index:
20
a
a+b+c
2a
=
.
2a + b + c
SJac =
SSor
(2.12)
(2.13)
Jaccard index can also be calculated as 2SSor /(1 + SSor ). SSor is the binary
equivalent of SStein . The Srensen dissimilarity (DSor = 1 SSor = (b + c)/(2a +
b + c)) between two sites equals the Whittakers species turnover between those
sites. If the formula for Jaccard index is applied to quantitative data (such that
the letters stand for summed abundances rather than presences) the resulting
similarity is called Ruzicka index (according to J. Oksanen).
The binary similarity between sites can also be calculated using a correlation
coefficient. Using the above quantities, this becomes:
ad bc
r=p
.
(a + b)(c + d)(a + c)(b + d)
(2.14)
2.6
Implementation
There are many R-packages with functions dedicated for calculating ecological
distances/similarities, such as vegan (vegdist), ade4 (dis.binary, dist.quant),
cluster (daisy), and labdsv (dsvdis). The properties of each function can be
examined throught their help files.
A particularly interesting function is designdist in the vegan package. Note
that this function generates only distances (or dissimilarities), even when the applied index is actually a similarity. When calculating the distances considered
above, one can use the function decostand in vegan to do the data transformations. The following example unitilizes these two functions to generate a range
of distances and dissimilarities.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
21
> plot(D,lower.panel=NULL,panel=panel.smooth,cex=.5,pch=16,lwd=2)
250
0.3
0.5
0.7
0.35
0.50
0.650.20
0.35
0.50 0.4
0.8
1.2
1.0
2.0
3.0
150
0.7
50
D.Eucl
50
150
250
150
250
50
0.50
0.65
0.3
0.5
D.BC
0.35
0.50 0.35
D.Jac
1.2 0.20
D.Sor
2.0
3.0
0.4
0.8
D.chord
D.Hel
1.0
D.Chi
The first thing to notice here is that Euclidean distances differ quite considerably from all the other measures. One reason for this is that Euclidean
distance is the only one that considers double-zeros. Next, one can see that the
Bray-Curtis dissimilarity is most similar to the Hellinger distance. This is useful
to keep in mind when using methods such as PCA, k -means partitioning, and
redundancy analysis (RDA). It is not very surprising that Jaccard and Srensen
dissimilarities are almost identical. However, there is a rather large quantitative
devition between them and other resemblance measures (remember that these
are binary indices).
22
Cluster Analysis
3.1
Overview
Clustering is a family of methods that are used to classify objects into descrete
categories, using specific rules and assumptions. The majority of clustering algorithms fall into the category of hierarchical agglomerative clustering. Nonhierarchical methods can also be used, if one is interested in finding an optimal
grouping of observations and not their hierarchical relationships. Most clustering methods are based on an association matrix calculated between the objects
of interest.
Hierarchical, agglomerative procedures begin with the discontinuous collection of objects (i.e., each object forms its own group) that are successively
grouped into larger and larger clusters until a single, all-encompassing cluster is
obtained (all objects are combined in the same group). That is, the members
of lower-ranking clusters become members of larger, higher-ranking clusters. In
contrast, non-hierarchical methods (such as k -means clustering) produce one
single partition, without any hierarchy among the groups.
In the 80s and 90s a popular method, especially among vegetation ecologists, was to use a divisive clustering method called TWINSPAN (for Two-Way
INdicator SPecies ANalysis). This method proceeds by first calculating the 1st
axis of Correspondence analysis (CA) and splits the data by the centre of this
axis. This process is repeated within each part of the division, until all objects
form their own group. Divisive clustering methods are not considered here, as
they are rather problematic (objects are assigned to different groups based on a
rather arbitrary partitioning) and are no longer used in ecology.
In the stats package in R, function hclust can be used to produce a variety
of clustering methods. Other functions are available, e.g., in package cluster.
3.2
3.2.1
Hierarchical clustering
Single-linkage clustering
23
>
>
>
>
>
>
library(vegan)
data(dune)
dat = dune
D.Hel = dist(decostand(dat,'hellinger'))
C.SL = hclust(D.Hel,method='single')
plot(C.SL)
17
19
6
7
4
3
10
5
16
15
20
13
12
8
9
11
18
1
14
0.7
0.5
0.3
Height
0.9
Cluster Dendrogram
D.Hel
hclust (*, "single")
3.2.2
Complete-linkage clustering
24
16
15
20
13
12
4
3
6
7
11
18
5
2
10
8
9
17
19
14
1.0
0.6
0.2
Height
1.4
Cluster Dendrogram
D.Hel
hclust (*, "complete")
Figure 3.3: Illustrating the conceptual difference between single- and completelinkage clustering. The two groups (red and blue points) are joined by singlelinkage at the distance corresponding to the green line between the most similar
objects between the two groups. In contrast, complete-linkage joins the two
groups at the distance shown by the cyan line, the most distant points between
the groups
25
3.2.3
Average-linkage clustering
Arithmetic average
Unweighted (UPGMA)
Weighted (WPGMA)
Centroid clustering
Unweighted (UPGMC)
Weighted (WPGMC)
UPGMA
UPGMC
Figure 3.4: In UPGMA an object joins a cluster at the average distance between
the object and all members of the cluster. In UPGMC the joining distance is
that between the object and the cluster centroid
UPGMA must be applied with caution because it gives equal weights to the
original similarities. It assumes that the objects in each group form a representative sample of the corresponding larger groups of objects in the reference
population under study. For this reason, UPGMA clustering should only be used
26
11
18
10
6
7
4
3
8
9
13
12
16
15
20
2
1
14
17
19
0.8
0.6
0.2
0.4
Height
1.0
1.2
Cluster Dendrogram
D.Hel
hclust (*, "average")
This method is related to the centroid clustering methods described above (UPGMC and WPGMC). That is, cluster centroids play an important role. What
this method does is that it tries to minimizes the squared error of ANOVA.
At the beginning, each objects form a cluster; for this starting point, the sum
of squared distances between objects and centroids is 0. As clusters form, the
centroids move away from actual object coordinates and the sum of the squared
distances from the objects to the centroids increase.
At each clustering step, Wards clustering method finds the pair of objects
or clusters whose fusion increases as little as possible the sum, over all objects,
of the squared distances between objects and cluster centroids. As the mean
squared deviation can be calculated for both raw data and distances, Wards
27
method is very flexible about the type of input data. However, if using raw
species data, it is best to pre-transform it before analysis (such that other than
Euclidean distances are preserved between objects). Lets see how this method
performs using Hellinger transformed data, as above.
> C.Ward = hclust(D.Hel^2,method='ward')
> C.Ward$height = sqrt(C.Ward$height)
> plot(C.Ward)
1.5
16
15
20
13
12
4
3
8
9
2
10
6
7
11
18
17
19
14
1.0
0.0
0.5
Height
2.0
Cluster Dendrogram
D.Hel^2
hclust (*, "ward")
Figure 3.6: Cluster dendrogram for the Hellinger transformed data, using Wards
method. Here the height refers to the sum of squared distances to the cluster
centroid. To obtain the correct solution the distances need to be squared when
using function hclust. In this case it is useful to take a square root of the height
element of the clustering object
3.2.5
Comparison
The four different methods of hierarchical clustering all produced somewhat different results. We can do a simple visual comparison, before going into detail
about selecting the best method, by considering the grouping of objects between
the methods. Lets assume that we are interested in defining three groups in the
data (Figure 3.7). One can easily see from Figure 3.7 that the methods differ
somewhat in their final grouping. Consider for example how sites 17 and 19 are
classified by the four methods.
> # Figure 3.7:
> # -----------
28
par(mfrow=c(2,2),mar=c(2,4,2,1))
plot(C.SL,main='Single-linkage',xlab='')
rect.hclust(C.SL,3,border=2:4)
plot(C.CL,main='Complete-linkage',xlab='')
rect.hclust(C.CL,3,border=2:4)
plot(C.upgma,main='UPGMA',xlab='')
rect.hclust(C.upgma,3,border=2:4)
plot(C.Ward,main='Ward',xlab='')
rect.hclust(C.Ward,3,border=2:4)
14
8
9
6
7
13
12
4
3
2
10
11
18
0.4
0.2
16
15
20
17
19
0.6
2
4
3
10
5
6
7
Ward
hclust (*, "complete")
Height
17
19
2
1
16
15
20
13
12
4
3
8
9
2
10
6
7
11
18
17
19
0.5
11
18
10
5
6
7
14
8
4
3
13
12
0.0
14
16
15
20
1.0
1.5
1.0
2.0
UPGMA
hclust (*, "single")
0.8
0.6
0.2
0.4
Height
0.8
Height
1.0
17
19
1.2
1.4
Completelinkage
1
14
16
15
20
13
12
8
9
11
18
Height
Singlelinkage
>
>
>
>
>
>
>
>
>
Figure 3.7: Visual comparison between the clustering patterns between four
hierarchical clustering methods. The function rect.hclust is used to visualize
three groups in each dendrogram. This function uses another function cutree,
which can be used to find the hierarchy level of a desired number of groups, or
find the number of groups at a desired height in the dendrogram
3.3
tance of choosing a method that is consistent with the aims of the analysis. If
one uses a distance measure with an interpretable interval (such as Bray-Curtis
or Jaccard), a meaningful cutting level is 0.5, since objects in a cluster are then
more similar with each other than with other objects in the dendrogram. There
are also several methods for assessing the suitability of a given approach.
3.3.1
Cophenetic correlation
A clustering algoritm maps the original distances between objects into cophenetic
distances. The cophenetic distance between two objects in a dendrogram is the
distance at which the objects become members of the same group. That is, the
cophenetic distance between the two objects is the distance to the source node
for both objects in the dendrogram, or their common ancestor. As the original
distances between objects form a distance matrix, cophenetic correlations form
a cophenetic matrix. To evaluate the correspondence between the original and
the cophenetic matrix (i.e., to assess how well the original distances have been
mapped), one can calculate a correlation between these matrices, called the
cophenetic correlation. Note that since the two matrices are not independent,
this correlation cannot be tested for significance.
Lets see how the above methods perform. The cophenetic matrix is found
with function cophenetic in package stats. Then, a correlation coefficient is
calculated. It is preferrable to use a rank-order correlation, since the relationship
between the original and cophenetic distances are likely to be non-linear.
>
>
>
>
>
>
>
>
>
>
>
>
# Cophenetic correlations:
# -----------------------cph.SL = cophenetic(C.SL)
cph.CL = cophenetic(C.CL)
cph.upgma = cophenetic(C.upgma)
cph.Ward = cophenetic(C.Ward)
cors = matrix(0,1,4,dimnames=list('COR',c('Single','Complete','UPGMA','Ward')))
cors[1] = cor(D.Hel,cph.SL,method='spearman')
cors[2] = cor(D.Hel,cph.CL,method='spearman')
cors[3] = cor(D.Hel,cph.upgma,method='spearman')
cors[4] = cor(D.Hel,cph.Ward,method='spearman')
print(cors)
Single Complete
UPGMA
Ward
COR 0.5300351 0.5329543 0.7858786 0.629469
This analysis suggests that UPGMA would be the optimal method, given the
Hellinger distances used to model between-object association.
3.3.2
Above we briefly considered the way the four clustering methods divided the
objects into three groups. Next we will put such cutting of the dendrogram to a
test. There are several approaches to validate a given partitioning. We will here
consider a method considering silhouette width.
30
# Silhouette widths:
# -----------------require(cluster)
sil.wid = numeric(nrow(dat))
# Calculate silhouette widths for each number of clusters,
# disregarding the trivial k = 1:
for(k in 2:(nrow(dat)-1)){
tmp = silhouette(cutree(C.Ward,k=k),D.Hel)
sil.wid[k] = summary(tmp)$avg.width
}
# Best width
k.best = which.max(sil.wid)
# Plotting:
par(xpd=NA)
plot(1:(nrow(dat)),sil.wid,type='h',main='Silhouette: optimal number
of clusters, Ward',xlab='k number of clusters',
ylab='Average silhouette width',cex.lab=1.25)
lines(rep(k.best,2),c(0,max(sil.wid)),col=2,cex=1.5,lwd=3)
0.20
0.10
0.00
10
15
20
k number of clusters
Figure 3.8: Bar plot of silhouette widths for k = 220 number of groups, for the
Wards clustering method. The optimal partition is given by the red bar, having
the highest average silhouette width
n = 20
10
6
7
17
5
18
11
19
1
2
1 : 10 | 0.12
13
9
12
3
4
8
2 : 6 | 0.27
20
15
14
16
3 : 4 | 0.29
0.0
0.2
0.4
0.6
0.8
1.0
Silhouette width si
Average silhouette width : 0.2
Figure 3.9: Silhouette plot for a three-group partition, based on Wards clustering
method using a Hellinger distance
In Figure 3.9, showing the silhouette plot, on the right you can see the number
of objects (n) and the average silhouette with ave for each group. Below the
figure the average silhouette width for the entire partition is given. Notice that
objects 1 and 2 seem to be missclassified. Going through this procedure for
several alternative methods might be needed for finding the best approach.
A similar approach to silhouette widths is to calculate a Mantel correlation
between the original distance matrix and binary matrices computed from the
dendrogram cut at various levels (representing group allocations).
32
3.3.3
After selecting the best algorithm and finding the optimal number of clusters,
it is time to consider how should the result be presented. First, produce a
dendrogram with the final groupping: Figure 3.10.
>
>
>
>
>
>
>
>
>
>
# Figure 3.10:
# -----------require(gclus)
# Reorder the dendrogram such that the ordering in the dissimilarity
# matrix is respected as much as possible (this does not affect
# dendrogram totpology).
C.Ward.ord = reorder.hclust(C.Ward,D.Hel)
plot(C.Ward.ord,hang=-1,xlab='5 groups',sub='',main='Reordered, Ward (Hellinger)')
# the hang = -1 argument draws the branches of the dendrogram down to 0.
rect.hclust(C.Ward.ord,k=k.best,border=2:6)
1.0
17
19
18
11
6
7
5
10
2
1
4
3
9
8
13
12
16
20
15
14
0.0
Height
2.0
5 groups
# Figure 3.11:
# -----------dend = as.dendrogram(C.Ward.ord)
heatmap(as.matrix(D.Hel),Rowv=dend,symm=T)
Using a similar approach one can explore the species composition of each
cluster. Here it is useful to rescale species abundaces, which can be done, e.g.,
33
17
19
18
11
6
7
5
10
2
1
4
3
9
8
13
12
16
20
15
14
17
19
18
11
6
7
5
10
2
1
4
3
9
8
13
12
16
20
15
14
Figure 3.11: Heat map of the Hellinger dissimilarity matrix, reordered according
to the Wards clustering dendrogram. Darker colors indicate higher similarity
using the vegemite function in the vegan package. Using a heat map, the
rescaled abundance of each species can be displayed in association with the cluster dendrogram, which allows one to inspect how species content varies between
groups. As a default, vegemite orders species by their weighted averages on the
site scores (Hills method). This is illustrated in Figure 3.12. This figure indicates that there is a gradient in the data, associated with considerable turnover
in species composition.
>
>
>
>
# Figure 3.12:
# -----------require(RColorBrewer)
or = vegemite(dat,C.Ward.ord,'Hill',zero='-')
Airpra
Empnig
Hyprad
Antodo
Viclat
Plalan
Tripra
Achmil
Rumace
Lolper
Belper
Brohor
1111
1
111211
79816750214398326054
22------------------2-----------------23-2---------------22--2222-------------12---1-----------2-223332---------------322------------2---222221-------------323-----2--2-----233323333322-------2---222-22------------2222-2---------
34
Leoaut
Poapra
Salrep
Brarut
Trirep
Elyrep
Sagpro
Poatri
Cirarv
Junbuf
Alogen
Chealb
Agrsto
Junart
Ranfla
Elepal
Calcus
Potpal
sites
20
scale:
233322223-222222-222
1-2222222223222-----22--------------3--2323222--2222-2222-22232233-122222--13
------2-22223-------2-2------3-2222-------2332323332322------------2-------------2------2-22-----------2-2323332----------------1--------------3222323322
------------22--222-------------22-2222
-------------2--3232
----------------22-2
------------------22
species
30
Hill
> heatmap(t(dat[rev(or$species)]),Rowv=NA,Colv=dend,
+
col=c('white',brewer.pal(5,'Blues')),xlab='Sites',
+
margin=c(4,4),ylab='Species')
3.4
Non-hierarchical clustering
As explained in the Overview, non-hierarchical clustering methods do not produce a tree-like hierarchy of objects, but generate only a single partitioning. Here
we will consider two methods, k -means and fuzzy clustering.
3.4.1
Partitioning by k -means
17
19
18
11
6
7
5
10
2
1
4
3
9
8
13
12
16
20
15
14
Species
Airpra
Empnig
Hyprad
Antodo
Viclat
Plalan
Tripra
Achmil
Rumace
Lolper
Belper
Brohor
Leoaut
Poapra
Salrep
Brarut
Trirep
Elyrep
Sagpro
Poatri
Cirarv
Junbuf
Alogen
Chealb
Agrsto
Junart
Ranfla
Elepal
Calcus
Potpal
Sites
# k-means partitioning:
# --------------------X.Hel = decostand(dat,'hellinger')
KM.ssi = cascadeKM(X.Hel,sup.gr=10,inf.gr=2,criterion='ssi')
KM.calinski = cascadeKM(X.Hel,sup.gr=10,inf.gr=2,criterion='calinski')
36
> plot(KM.calinski,sortg=T)
> plot(KM.ssi,sortg=T)
calinski
criterion
10
2
10
8
6
4
2
10
15
20
5.2
5.8
ssi
criterion
8
6
4
2
10
Values
10
Objects
10
15
Objects
20
0.10
0.16
Values
Figure 3.13: k -means cascade plots for two different evaluation criteria, showing
the grouping of each object for each partition. The optimal solution, given the
criterion, is marked with red in the right-hand panel.
The same analysis could be done step-by-step with the functions kmeans,
which provides a partitioning to k groups, which can in turn be evaluated with
function cIndexKM.
3.4.2
Fuzzy clustering
# Fuzzy clustering:
# ----------------k = 3
C.fuz = fanny(X.Hel,k=k,memb.exp=1.5)
s.C.fuz = summary(C.fuz)
C.fuz.g = C.fuz$clustering
# Silhouette plot:
# ---------------plot(silhouette(C.fuz),main='Silhouette plot, fuzzy (Hellinger)',
cex.names=0.8,col=C.fuz$silinfo$widths+1)
n = 20
3
9
4
13
12
2
1
8
1 : 8 | 0.19
20
15
16
14
2 : 4 | 0.33
6
17
7
10
18
5
11
19
3 : 8 | 0.19
0.0
0.2
0.4
0.6
0.8
Silhouette width si
Average silhouette width : 0.22
38
1.0
0.6
17
0.4
19
14
20
15
0.0
10
57
0.2
Dim2
0.2
18
11
0.4
16
12
1
4
Cluster 1
Cluster 2
Cluster 3
13
0.6
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
Dim1
3.5
Clustering of species data classifies objects (sites, patches, etc.) based on the
differences in species composition. Validation of a given partition based on silhouettes or other means only considers the mapping of original distances to the
classification. However, what should be of interest is the ecological interpretability of the grouping of objects; how can the clusters be explained? Here any
external data to species abundances comes into picture. A simple way of contrasting the clustering result with, e.g., environmental data is to use the grouping
as a factor in an ANOVA (or KuskalWallis test).
3.5.1
Continuous predictors
The dune data set is not optimal for an external validation of clustering results
(the associated environmental data only contains a single continuous variable).
Instead, well use here another data set from vegan, the varespec species data
and varechem environmental data. Using the calinski criterion, a k -means
partition on Hellinger transformed data gives an optimum of 3 groups, which is
used here. The varechem data set contains 14 variables, but we will use only six
of them here, the content of Nitrogen, Phosphorus, Calsium, Aluminium, Iron,
and Manganese in the soil (Figure 3.16).
>
>
>
>
>
>
>
>
+
+
A global test of log-transformed environmental variables (MANOVA) indicates that the groups do differ in their environmental conditions (log-transformed
data is used for two reasons: (1) variance tends to increases with the mean and
(2) a unit change in concentration is likely to be more important under low
concentrations that it is under high concentrations):
> m = manova(as.matrix(log(env))~groups)
> summary(m)
Df Pillai approx F num Df den Df
Pr(>F)
groups
2 1.3075
5.3495
12
34 5.415e-05 ***
Residuals 21
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary statistics for univariate tests can be accessed via summary.aov of the
model object (here m). Doing this shows that (as it should be clear from Figure
3.16) the best descriptors of group differences are Al and Fe.
Note that we did not evaluate the assumptions of parametric testing here.
The assumption of normality is not usually an issue, for two reasons: 1) linear
40
Ca
400
200
15
30
20
40
600
50
25
60
30
800 1000
70
50
40
60
100
300
200
0
20
100
Mn
150
400
Fe
200
Al
80 100
As the plot of standardized residuals against fitted values indicated heterogeneous variance between groups, statistical testing of between-group differences
are likely to be biased. The problem can be fixed, e.g., by accounting for this
heterogeneity. In function gls this can be done by specifying weights to describe
the within-group variance structure. In this case we let the error variance differ
between groups.
> m.Fe.2 = gls(log(Fe)~groups,data=env,weights=varIdent(form=~1|groups),method='ML')
The two models can be compared with the anova function to evaluate the importance of heteroscedasticity for inference:
> anova(m.Fe,m.Fe.2)
m.Fe
m.Fe.2
Model df
AIC
BIC
logLik
Test L.Ratio p-value
1 4 64.36398 69.07620 -28.18199
2 6 62.59623 69.66455 -25.29811 1 vs 2 5.767754 0.0559
This test indicates that the model is not improved by accounting for heterogenic
residual variance (the model fit is only marginally improved by accounting for
the heterogeneity in residual variance).
41
Standardized residuals
3
2
1
0
1
2
2
Fitted values
Figure 3.17: Standardized model residuals versus fitted values. Notice that
residual variance differs between groups even after log-transforming the response
variable
One way to visualize the relationship between a grouping and a continuous
predictor is to overlay the predictor on a ordination plot, using either bubbles
or smoothed surfaces (see, e.g., Oksanen, 2013). For example, lets consider how
the concentration of iron, aluminium, and manganese is distributed across the
sites:
>
>
>
>
>
+
>
>
+
>
>
+
>
3.5.2
Categorical predictors
When both the response and predictor variables are categorical one is concerned
with the analysis of contingency tables. That is, the response data is the counts
of observations belonging to each combination of categories. The table of counts
can then be analyzed using Poisson regression (generalized linear model with
42
0.2
0.0
0.2
Dim 1
0.4
Dim 2
0.05
0.05 0.10
Manganese
0.15
0.05
Dim 2
0.05 0.10
Aluminium
0.15
0.05
0.15
Dim 2
0.05 0.10
Iron
0.2
0.0
0.2
Dim 1
0.4
0.2
0.0
0.2
0.4
Dim 1
Figure 3.18: The (log) concentration of iron, aluminium, and manganese illustrated in PCoA ordination diagrams
Poisson distributed errors). Here we can again use the dune data set as an
example.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
The analysis of contingency tables should start from the saturated model, proceeding by sequentially dropping non-significant terms. This can be done using
the drop1 function, with test = Chi, which effectively performs a likelihood
ratio test between models that do contain and do not contain a specific term.
> drop1(m,test='Chi')
Single term deletions
Model:
counts ~ f1 * f2
Df Deviance
AIC
LRT Pr(>Chi)
<none>
0.000 45.847
f1:f2
6
18.555 52.401 18.555 0.004986 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
43
This test indicates that there is a significant interaction between the grouping
(acquired from k -means clustering) and management of sites. A somewhat simpler approach is to use the Chi-square test (chisq.test) to compare the groups
and a categorical variable:
> chisq.test(groups,dune.env$Manage,simulate.p.value=T)
Pearson's Chi-squared test with simulated p-value (based on 2000
replicates)
data: groups and dune.env$Manage
X-squared = 12.8222, df = NA, p-value = 0.04298
44
4.1
Overview
1
2
3
4
5
6
7
Var 2
4
3
5
1
4
3
4
3
1 7 5
2
4
Variable 2
SU
SU
SU
SU
SU
SU
SU
Variable 1
Figure 4.1: Ordination of seven samples in the space of two variables
The first ever application of an ordination in an ecological context was pre45
4.2
Lets imagine a data matrix where n sampling units are characterized by p variables. Graphically, the sampling units can be represented by a group of points
in a p-dimensional space. These points are generally not distributed in a perfect
sphere across all dimensions; the cluster of points can be elongated in one or
a few directions, and flattened in others. In addition, the directions where the
points are found is not necessarily aligned with a single dimension (i.e. with a
single variable) of the multidimensional space. The direction where the set of
points is most elongated represent the direction with the largest variance of the
set of points.
2
1
7
5
6
PCA axis 2
Variable 2
Variable 1
PCA axis 1
46
contribute most to the first few principal components. It is also possible to represent the variables on the PCA diagram with the sample points. However, it
is important to note that when interested in the relationships among variables
another type of projection is preferable (this will be discussed later).
A principal component is constructed on an eigenvector for which an eigenvalue i is associated. This eigenvalue defines the amount of variance represented
by the principal component. The eigenvalues are always presented in decreasing
order; that is the first axis presents the most important part of the variance in
the data, the second axis defines less information than the first axis but more
than the others, and so on. The number of principal components in a PCA
equals the number of variables in the original data set.
When performing a PCA on a matrix Y, it is usual to present the total variance of the data as a reference to evaluate the proportion of variance represented
by each principal component. The total variance of a matrix can be calculated
as follow:
n
1 X
Var (Y) =
Yij Yj
n1
(4.1)
i=1
However, within the PCA framework, the total variance of Y can also be
calculated through a sum of all the eigenvalues:
Var (Y) =
p
X
(4.2)
i=1
(4.3)
Correlation or Covariance?
In a PCA, the association measures used to compare all pairs of variables is either
the covariance or the correlation. Both of these association measures are linear.
It is important to decide which one of these two association measures should
be used when computing a PCA. The reason why this decision is important is
because of the Euclidean property of PCA. Euclidean distance is very sensitive
to the scales of the variables. For this reason, performing a PCA on the raw (that
is... only centred) variables (yielding a PCA on a covariance matrix) is only valid
47
1
7
5
6
Variable 1
Variable 2
3
Scaling
When performing a PCA, both the samples and the variables can be represented
on the same diagram, called a biplot. There are two types of biplots that can
be used to represent the result of a PCA. However, each one has particular
properties.
With this in mind, if the main interest of the analysis is to interpret the relationships among samples a distance biplot (scaling 1) should be used. However,
if the interest is to study the relationship among variables, a correlation biplot
(scaling 2) should be used.
48
Table 4.1: Properties of the distance biplot (scaling 1) and the correlation biplot
(scaling 2) in PCA)
Distance
among
samples in biplot
Projection of samples on variables
Length of variables
Distance biplot
Correlation biplot
(scaling 1)
(scaling 2)
Approximation of the Eu- Meaningless
clidean distance in multidimensional space
Projecting a sample at right angle on a variable approximates its position on the variable
1 in full dimensional space Covariance
matrix:
Indicates the contribution standard deviation of the
of a variable in the reduced variable in full dimenspace
sional space
The length of the descriptor in the reduced space
is an approximation of its
standard deviation
Correlation matrix: 1
in full dimensional space
Indicates the contribution
of a variable in the reduced
space
Meaningless
Reflect the covariance or
correlation among variables
49
4.2.3
In all but one option discussed above (i.e. all option discussed above except the
PCA performed on a covariance matrix plotted using a correlation biplot [scaling
2]), a circle representing the equilibrium contribution of all the variables can
be drawn on the plane defined by two principal components. The equilibrium
contribution is the length a variable (a vector in the biplot) should have if it
contribute equally to all the axes of the PCA. The variables whose vectors is
within the equilibrium contribution circle contribute little to a given reduced
space (i.e. plane described by the first and second axis). Conversely, variables
whose vectors extend beyond the radius of the equilibrium contribution circle
contribute more to the reduced space. Figure 4.4 shows an example of a PCA
calculated on the dune data where equilibrium an circle is drawn. In a distance
biplot (scaling 1), the equilibrium circle is calculated as follow:
r
(4.4)
(4.5)
# Figure 4.4:
# ----------library(vegan)
### The two fictitious variables
data(dune)
### Hellinger transformation on dune
duneHell<-decostand(dune,method="hellinger")
### Perform PCA on correlation matrix of Hellinger transfomred dune data
PCABase<-rda(dune,scale=TRUE)
### Extract species information
PCAsp<-scores(PCABase,choices=1:2,display="species",scaling=2)
#==============
### Plot graphs
#==============
### Plot basis
par(mar=c(3,3,0.5,0.5),pty='s',mgp=c(2,.8,0))
labels<-paste("PCA Axis",1:2," - ",round(eigenvals(PCABase)[1:2]
/sum(eigenvals(PCABase)),4)*100,"%",sep="")
plot(PCAsp, asp=1,xlim=c(-1,1),ylim=c(-1,1),type="n",
xlab=labels[1],ylab=labels[2])
abline(h=0,lty=2)
50
1.0
Hyprad
Airpra
Empnig
Antodo
Leoaut
Brarut
Plalan
Achmil
Salrep
Viclat
Tripra
PotpalCalcus
Elepal
Rumace
Trirep
Sagpro
Chealb
Cirarv
Junbuf
Brohor
Lolper
Belper
Poapra
Elyrep
Ranfla
Junart
Agrsto
Alogen
1.0
Poatri
1.0
0.5
0.0
0.5
PCA Axis1 23.44%
1.0
abline(v=0,lty=2)
### Equilibrium circle
symbols(0,0,circles=1,inches=FALSE,fg="darkgreen",add=TRUE,lwd=2)
symbols(0,0,circles=sqrt(2/20),inches=FALSE,fg="blue",add=TRUE,lwd=2)
arrows(0,0,PCAsp[,1],PCAsp[,2],length=0.1,lwd=2,angle=30,col="red")
pos1<-which(PCAsp[,2] < 0)
pos3<-which(PCAsp[,2] > 0)
text(PCAsp[pos1,1],PCAsp[pos1,2],labels=rownames(PCAsp)[pos1],pos=1,cex=0.65,col="red")
text(PCAsp[pos3,1],PCAsp[pos3,2],labels=rownames(PCAsp)[pos3],pos=3,cex=0.65,col="red")
4.2.4
PCA is not a statistical test. The goal of PCa is to represent the major features
of a data matrix on a reduced number of axes, this is why the expression ordination in reduced space is often used to decribe it. Generally, one studies the
eigenvalues (i.e. the amount of variance defined by each axis) and decides the
number of axes that are worth presenting. The decision can be arbitrary (e.g.
only the axes that represent in total 75% of the variance should be considered).
51
However, procedures have been proposed to distinguish the axes represent interesting and valuable features in the data and the axes display random variance.
One can calculate the average of the eigenvalues and interpret only the axes
associated to eigenvalues larger than that average. Another idea is to compute
a model called the broken stick model that divide a stick of unit length into as
many pieces as the number of axes in the PCA. The pieces are than ordered
from longest to shortest and compared to the eigenvalues. One interprets only
the axes with an eigenvalue larger than the length of the corresponding stick.
Figure 4.5 illustrate the two techniques discussed previously to assess the number
of axes to interpret in a PCA.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
+
>
+
# Figure 4.5:
# ----------### Extract eigenvalues
eigPCA<-as.vector(eigenvals(PCABase))
eigPCAmean<-mean(eigPCA)
### Calculate the percentage of variation represented by each eigenvalues
eigPCAVar<-eigPCA/sum(eigPCA)*100
### Construct broken stick model
brokenStick<-bstick(length(eigPCA),tot.var=100)
### Combine eigenvalues and broken-stick model result
PCAVar<-rbind(eigPCAVar,brokenStick)
#==============
### Plot graphs
#==============
### Plot basis
par(mfrow=c(2,1),mar=c(4,5,0.5,0.5))
barplot(eigPCA,ylab="Eigenvalue",cex.axis=0.9)
abline(h=eigPCAmean,col="red",lwd=3)
legend("topright","Average eigenvalue",lwd=3,col="red",cex=0.8,bty="n")
barplot(PCAVar,beside=TRUE,col=c("lightblue","orange"),
names.arg=paste("Axis",1:19),las=3,ylab="Variance (%)",cex.axis=0.9)
legend("topright",c("PCA Axis","Broken-stick model"),
fill=c("lightblue","orange"),cex=0.8,bty="n")
4.2.5
Traditionally, PCA has been very useful for the ordination of matrices of environmental data. However, because it is a linear method with Euclidean distance
as the underlying distance, it is not adapted to ordinate raw species abundance
data. This is mainly because of the double-zero problem (i.e. zeros are treated
as any other values in the data) However, Legendre & Gallagher (2001) found
a way to overcome this problem. What they propose is to pre-transform the
species data so that, after carrying out a PCA, the distance respected among
object is not the Euclidean distance anymore, but an ecologically meaningful
one; a distance that does not account for the double-zeros in the computation
of the resemblances between objects. The tranformations proposed by Legendre
& Gallagher (2001) and their associated distance coefficients are presented in
52
4
2
20
5 10
PCA Axis
Brokenstick model
Axis 1
Axis 2
Axis 3
Axis 4
Axis 5
Axis 6
Axis 7
Axis 8
Axis 9
Axis 10
Axis 11
Axis 12
Axis 13
Axis 14
Axis 15
Axis 16
Axis 17
Axis 18
Axis 19
Variance (%)
Eigenvalue
Average eigenvalue
Figure 4.5: Barplots to help decide how many PCA axis to interpret. The two
results presented here were constructed using the dune data.
Table 4.2. Note that these pre-transformations can be used with many linear
analytical methods such as PCA, RDA, K -mean clustering,...
4.3
53
Distance
v
u
uP
u p q y1j
t j=1
Pp
chord
Transformation
2
j=1 y1j
Pp
1
y+j /y++
Pp
Pp
r
j=1
s
Species profiles
j=1
s
Hellinger
j=1
2
j=1 y2j
y1j
y2j
y1+
y2+
y1j
y2j
y1+
y2+
y1j
y1+
y2j
qP
p
2
y2j
y2+
2
yij
0
yij
= qP
p
j=1
0
yij
=
0
yij
=
2
0
yij
=
y++
2
yij
yij
yi+ y+j
yij
yi+
r
yij
yi+
4.3.1
Scaling
Unlike for PCA, in CA, both the samples and the variables are usually displayed
as points on the same biplot (also know as joint plot in CA). In CA, there are
three scalings, two of which, the distance scaling (scaling 1) and the correlation
scaling (scaling 2) are commonly used in ecology. Scaling 3 (also known as
Hill scaling), is less commonly used in ecology but it has its merits. Table 4.3
describes the properties of each scaling and how results from each scaling should
be interpreted.
An example of of a CA is presented in Figures 4.6. The data used for this
illutration is given in Table 4.4.
>
>
>
>
>
>
>
>
>
>
>
>
# Figure 4.6:
# ----------library(vegan)
### Three fictitious species sampled at 7 sites
sp1<-c(1,2,3,5,3,6,2)
sp2<-c(4,3,5,1,4,3,4)
sp3<-c(1,0,2,1,0,3,5)
sp<-cbind(sp1,sp2,sp3)
rownames(sp)<-paste("SU",1:7)
### Perform PCA on correlation matrix of Hellinger transfomred dune data
CABase<-cca(sp)
54
to
Properties
Distance
among
sample
points*
Distance
among
variable
points
Scaling 1
When the interest
lies in the ordinations of samples
Rows are the centroids of columns
Approximation of
the 2 distance in
multidimensional
space
Meaningless
Scaling 2
When the interest
lies in the ordinations of variables
(species)
Columns are the
centroids of rows
Meaningless
Approximation of
the 2 distance in
multidimensional
space
Scaling 3
When both samples
and variables are
important
Rows and columns
are both centred
Approximation of
the 2 distance in
multidimensional
space
Approximation of
the 2 distance in
multidimensional
space
*For scaling 1 and 3 - (1) Sample points close to one another are generally
similar in their species frequencies; (2) For abundance data: Sample points close
to a species point is likely to have a high contribution of that species; (3) For
presence-absence data: The probably to have an occurence is higher for sample
points closer to a species point
For scaling 2 and 3 - (1) A variable point found close a sample point is likely
to have a higher frequency at this sample than in samples further away; (2)
Variable points close to one another are likely to have the relatively similar
relative frequencies in samples.
SU
SU
SU
SU
SU
SU
SU
1
2
3
4
5
6
7
Sp 1
1
2
3
5
3
6
2
Sp 2
4
3
5
1
4
3
4
55
Sp 3
1
0
2
1
0
3
5
1.5
1.5
SU
SU 25
0.0
SU 3
SU 7
SU 4
sp1
SU 1
(b)
0.5
1.0
SU 2
SU 5
SU 3
sp2
sp1
sp3
SU 7
SU 6
1.5
sp3
0.5 0.0
SU 1
0.5
1.0
sp2
SU 6
1.0
(a)
1.5
0.5
0.0
0.5
1.0
SU 4
1.5
0.5
1.0
1.0
0.5 0.0
(c)
0.5
sp2
SU 2
SU 5
0.0
SU 3
0.5
SU 7
sp1
sp3
SU 6
1.0
SU 1
SU 4
1.0
0.5
0.0
0.5
Figure 4.6: First and second axis of the CA of the data shown in Table 4.4. (a)
Scaling 1. (b) Scaling 2. (c) Scaling 3.
>
>
>
>
>
>
>
>
>
>
>
eig<-eigenvals(CABase)
explAxis<-round(eig/sum(eig),4)*100
labels<-paste("CA Axis 1 - ",explAxis[1:2],"% (",round(eig[1:2],3),")",sep="")
par(mar=c(4,5,0.5,0.5))
layout(matrix(c(1,1,2,2,0,3,3,0),byrow=TRUE,nrow=2,ncol=4))
plot(CABase,scaling=1,xlab=labels[1],ylab=labels[2])
leg<-legend("topright","(a)",bty="n")
plot(CABase,scaling=2,xlab=labels[1],ylab=labels[2])
leg<-legend("topright","(b)",bty="n")
plot(CABase,scaling=3,xlab=labels[1],ylab=labels[2])
leg<-legend("topright","(c)",bty="n")
4.3.2
Word of caution
56
15
10
0
Abundance
20
>
>
>
>
+
+
>
>
>
>
>
+
+
20
40
60
80
100
Sampling units
Figure 4.7: Succession of species along an ideal gradient (species packing model).
57
par(mar=c(4,5,0.5,0.5),mfrow=c(1,2))
CA<-cca(comm)
PCA<-rda(comm)
CAsites<-scores(CA,display="sites",choices=1:2,scaling=1)
PCAsites<-scores(PCA,display="sites",choices=1:2,scaling=1)
plot(CAsites,xlab="CA Axis 1",ylab="CA Axis 2",cex.lab=1.5,pch=19,col="blue")
abline(h=0,lty=2)
abline(v=0,lty=2)
legend("top","(a)",cex=2,bty="n")
plot(PCAsites,xlab="PCA Axis 1",ylab="PCA Axis 2",cex.lab=1.5,pch=19,col="blue")
abline(h=0,lty=2)
abline(v=0,lty=2)
legend("top","(b)",cex=2,bty="n")
0.0
1.0
0.5
PCA Axis 2
0.5
0.0
0.5
1.0
CA Axis 2
(b)
0.5
(a)
1.0
>
>
>
>
>
>
>
>
>
>
>
>
>
0.0
0.5
1.0
1.5
1.0
CA Axis 1
0.5
0.0
0.5
1.0
PCA Axis 1
Figure 4.8: (a) CA, describing the arch effect and (b) PCA describing the horseshoe effect on the data of Figure 4.7. Scaling 1 for CA and PCA.
58
4.4
PCA as well as CA (at least in their classic forms) impose the distance preserved
among samples: Euclidean distance for PCA and 2 distance for CA (remember,
however, that one can modify this to some extent by using pre-transformation
of data (see Table 4.2) before carrying out a PCA). But if one would like to
ordinate samples on the basis of yet another distance measure, more appropriate
to the problem at hand, then PCoA is the method to apply. It allows to obtain
a Euclidean representation of a set of samples whose relationships are measured
by any similarity or distance coefficient chosen by the user. (Figure 4.9)
>
>
>
>
+
>
>
>
+
>
>
>
par(mar=c(3,3,0.5,0.5),pty='s',mgp=c(2,.8,0))
PCoA<-cmdscale(vegdist(sp,"bray"),eig=TRUE)
PCoAEig<-PCoA$eig
PCoAAxis<-(PCoAEig[1:2]+abs(min(PCoAEig)))/
(sum(PCoAEig)+((nrow(sp)-1)*abs(min(PCoAEig))))
PCoAAxisPres<-round(PCoAAxis,4)*100
labels<-paste("PCoA Axis ",1:2," - ",PCoAAxisPres,"% (",round(PCoAEig,3),")",sep="")
plot(PCoA$points[,1:2],xlab=labels[1],ylab=labels[2],
ylim=range(c(PCoA$points[,2],0.24)),cex.lab=1.5,pch=19,cex=2)
text(PCoA$points[,1:2],labels=1:7,cex=1.25,pos=3)
abline(h=0,lty=2)
abline(v=0,lty=2)
Like PCA and CA, PCoA produces a set of orthogonal axes whose importance
is measured by eigenvalues. When negative eigenvalues are obtained in a PCoA,
one needs to apply the following calculation to compute the amount of variance
explained by a particular axis:
k + m |min ()|
k=1 k ) + (n 1) |min ()|
Pn
(4.6)
par(mar=c(4,5,0.5,0.5))
plot(PCoA$points[,1:2],xlab=labels[1],ylab=labels[2],cex.lab=1.5,type="n")
abline(h=0,lty=2)
abline(v=0,lty=2)
### Plot samples
text(PCoA$points[,1:2],labels=1:7,cex=1.25)
### Plot species
59
0.2
0.0
0.1
0.1
3
1
4
5
2
0.2 0.1
0.0
0.1
0.2
0.3
spWa<-wascores(PCoA$points,sp)
text(spWa[,1:2],labels=colnames(sp),cex=1.25,col="red")
In the case of Euclidean association measures, PCoA will behave in a Euclidean manner. For instance, computing a Euclidean distance among sites and
running a PCoA will yield the same results as running a PCA on a covariance
matrix and scaling 1 on the same data. But if the association coefficient used
is non-metric, semi-metric or has other problems of non-Euclideanarity, then
PCoA will react by producing several negative eigenvalues in addition to the
positive ones (an a null one in-between). The negative eigenvalues can be seen
as the representation of the non-Euclidean part of the structure of the association matrix and it is, of course, not representable on real ordination axes. In
most cases this does not affect the representation of the samples on the first few
principal axes, but in several applications this can lead to problems. There are
technical solutions to this problem (e.g. Lingoes and Caillez corrections), but
they are not always recommendable, and go beyond the scope of this lecture.
60
0.2
0.1
sp3
3
0.0
sp2
sp1
1
4
0.1
5
2
0.2
0.1
0.0
0.1
0.2
0.3
4.5
If the users priority is not to preserve the exact distances among samples, but
rather to represent as well as possible the ordering relationships among samples
in a small and specified number of axes, then NMDS may be the solution. Like
PCoA, NMDS is not limited to Euclidean distance matrices. It can produce
ordinations of samples from any distance matrix. The method can also proceed
with missing distance estimates, as long as there are enough measures left to
position an object with respect to a few others.
NMDS is not an eigenvalue technique, and it does not maximise the variability associated with individual axes of the ordination. As a result, plots may
arbitrarily be rotated, centred, or inverted. The procedure goes as follows (very
schematically; for details see Borcard et al. 2011, section 5.6):
Step 1 Specify the number m of axes (dimensions) desired.
Step 2 Construct an initial configuration of the objects in the m dimensions,
to be used as a starting point of an iterative adjustment process. This is a
tricky step, since the end-result may depend on the starting configuration.
61
Step 3 An iterative procedure seeks to position the objects in the desired number of dimensions in such a way as to minimize a stress function (scaled
from 0 to 1), which measures how far the reduced-space configuration is
from being monotonic to the original distances in the association matrix.
Step 4 The adjustment goes on until the stress value can no more be diminished,
or it attains a predefined value (tolerated lack-of-fit).
Step 5 Most NMDS programs rotate the final solution using PCA for easier
interpretation.
For a given and small number of axes (e.g. 2 or 3), NMDS often achieves
a less deformed representation of the relationships among objects than a PCoA
can show on the same number of axes. But NMDS remains a computer-intensive
solution, exposed to the risk of suboptimal solutions in the iterative process
(because the objective function to minimize has reached a local minimum).
62
5
5.1
Explanatory variables
1 variable
Many variables
No variable
Many variables
Analysis
Simple regression
Multiple regression
Simple ordination
Canonical ordination
tory variables is expressed on a series of unconstrained axes following the canonical ones.
Due to the fact that in many cases the explanatory variables are not dimensionally homogeneous, usually canonical ordinations are carried out with standardized explanatory variables. In RDA, this does not affect the choice
between running the analysis on a covariance or a correlation matrix, however,
since this choice relates to the response (y) variables.
Depending on the algorithm used, the search for the optimal linear combinations of explanatory variables, that represent the orthogonal canonical axes,
is done sequentially (axis by axis, using an iterative algorithm) or in one step
(direct algorithm). Figure 5.1, which is Figure 11.2 of Legendre and Legendre
(2012, p. 631), summarises the steps of a redundancy analysis (RDA) using the
direct algorithm:
Step 1 Regress each dependent variable separately on the explanatory variables
and compute the fitted and residual values of the regressions.
Step 2 Run a PCA of the matrix of fitted values of these regressions.
Step 3 Use the matrix of canonical eigenvectors to compute two sorts of ordinations:
(a) An ordination in the space of the response variables (species space);
the ordination axes are not orthogonal in this ordination;
(b) An ordination in the space of the explanatory variables; this yields
the fitted site scores; the canonical axes obtained here are orthogonal
to one another;
Step 4 Use the matrix of residuals from the multiple regressions to compute an
unconstrained ordination (PCA in the case of an RDA).
Redundancy analysis (RDA) is the canonical version of principal component
analysis (PCA). Canonical correspondence analysis (CCA) is the canonical version of correspondence analysis (CA).
Due to various technical constraints, the maximum numbers of canonical and
non-canonical axes differ (Table 5.2).
Graphically, the results of RDA and CCA are presented in the form of biplots or triplots, i.e. scattergrams showing the samples, response variables
(usually species) and explanatory variables on the same diagram. In canonical
ordinations, explanatory variables can be qualitative (the multiclass ones are
coded as a series of binary variables) or quantitative. A qualitative explanatory
variable is represented on the bi- or triplot as the centroid of the sites that have
the description 1 for that variable, and the quantitative ones are represented as
vectors. The analytical choices are the same as for PCA and CA with respect to
the analysis on a covariance or correlation matrix (RDA) and the scaling types
(RDA and CCA). Table 5.3 presents how an RDA triplot should be interpreted.
64
Table 5.2: Maximum number of non-zero eigenvalues and corresponding eigenvectors that may be obtained from canonical analysis of a matrix of response
variables Y(n p) and a matrix of explanatory variables X(n m) using redundancy analysis (RDA) or canonical correspondence analysis (CCA). This is
Table 11.1 from Legendre & Legendre (2012).
Canonical eigenvalues
RDA
CCA
and eigenvectors
min(p, m, n 1)
min(p 1, m, n 1)
Non-canonical eigenvalues
and eigenvectors
min(p, n 1)
min(p 1, n 1)
Table 5.3: Properties of the distance biplot (scaling 1) and the correlation biplot
(scaling 2) in RDA)
Distance triplot
(scaling 1)
Approximate the Euclidean distance
Correlation triplot
(scaling 2)
Meaningless
65
Samples
(centred)
Ordination
in the space
of Y
(centred)
F=YU
Regress each y on X
Samples
Explanatory
variables
Species
Fitted values
from the
multiple
regressions
PCA
Matrix of
canonical
eigenvectors
Ordination
in the space
of X
Z=YU
Y = X(Xt X) XtY
-1
} }
Matrix of
Residual values
residual
from the
eigenvectors
multiple
PCA
regressions
Yres = Y Y
Ures
Ordination
in the space
of the
residuals
Yres Ures
Figure 5.1: The steps to perform a redundancy analysis (RDA) using a direct
algorithm. This is a modification of Figure 11.2 of Legendre & Legendre (2012).
In CCA, on can use the same types of scalings as in CA. Samples and response
variables are plotted as points on the triplot. For the response variables (species)
and samples, the interpretation is the same as in CA. The interpretation of the
explanatory variables should be made as followin in CCA:
>
>
>
>
### CCA using varespec (community matrix) and varechem (explanatory variables)
CCA<-cca(varespec,varechemSc)
par(mar=c(4,5,0.5,0.5))
ordiplot(CCA,scaling=1,type="t")
0.5
6
N 18 Cla.ran
Cla.arb13
Baresoil
0.0
75
16 14
20
Dic.fus
Cal.vul Mo
Ste.sp
Vac.uli
Cet.niv
Dip.mon
Pti.cil
23 Emp.nig
Cla.gra
Bar.lyc
Cla.fim
Cla.def
Cla.coc
Ich.eri
Led.pal
Cla.ama
Cet.eri
Pol.pil
Cla.bot
Cla.cri
Bet.pub
Cla.cer
22
Cla.sp
Dic.pol
Pel.aph
Poh.nut
Cla.chl
Cla.cor
Cet.isl
Pol.com
Cla.phy
P
in.syl 11
15
Nep.arc
Des.fle
Vac.vit
Pol.jun
Cla.unc
Hyl.spl
Vac.myr
Dic.sp
21
2524
0.5
Mn 27
Ple.sch
Humdepth
28
K
MgZn
Ca
P
0.5
19
S
0.0
Fe
4
Al
pH
3
2
12
Cla.ste
10
9
0.5
5.2
5.2.1
In the same way as one can do partial regression, it is possible to run partial
canonical ordinations. It is thus possible to run, for instance, a CCA of a species
67
Cla.phy
Hyl.spl
Cla.chl
910
Ca
S
Mg
Cla.sp
Zn
Cet.isl
K
28
2
Pol.com
Vac.myrPle.sch
Pin.syl12
Des.fle
Humdepth
Poh.nut
Mn
Dic.sp
Pol.jun
Nep.arc
27 Pel.aph 19
24 21
Vac.vit
3 pH
Dic.pol
Emp.nig
25 Cla.unc
Cla.cor
Cla.cri 11
Cla.def
Cet.eri
Led.pal
15 23
Cla.cer Al
20 Cla.fim
Fe
Pti.cil
22 Cla.graCla.coc
4
Mo
Cla.bot
14
Bet.pub
Bar.lyc 16
18
Baresoil
6
Dic.fus
13
Pol.pil
CCA2
Cla.ste
7
Cla.ran
5
Cla.arb
Cet.niv
Dip.mon
Cal.vul
Cla.ama
Vac.uli
Ste.sp
Ich.eri
CCA1
Figure 5.3: CCA example using the varespec and varechem data.
data matrix (Y matrix), explained by a matrix of climatic variables (X), controlling for the edaphic variables (W). Such an analysis would allow the user to
assess how much species variation can be uniquely attributed to climate when
the effect of the soil factors have been removed. This possibility has led Borcard
et al. (1992) to devise a procedure called variation partitioning in a context of
spatial analysis. One explanatory matrix X contains the environmental variables, and the other (W) contains the x-y geographical coordinates of the sites,
augmented to its third-order polynomial function:
b0 + b1 x + b2 y + b3 x2 + b4 xy + b5 y 2 + b6 x3 + b7 x2 y + b1 x + b8 xy 2 + b9 y 3 (5.1)
The procedure aims at partitioning the variation of a Y matrix of species
data into the following fractions (Figure 5.4):
[a] Variation explained solely by matrix X
[b] Variation explained by matrix X and W
[c] Variation explained solely by matrix W
[d] Unexplained variation
68
If run with RDA, the partitioning is done under a linear model, the total SS
of the Y matrix is partitioned, and it corresponds strictly to what is obtained
by multiple regression if the Y matrix contains only one response variable. If
run under CCA, the partitioning is done on the total inertia of the Y matrix.
More recently, Borcard & Legendre (2002), Borcard et al. (2004), Dray et al.
(2006) and Blanchet et al. (2008b) have proposed to replace the spatial polynom
by a much more powerful representation of space defined using various types of
spatial eigenfunctions. See Chapter 7 of Borcard et al. (2011) for more details.
>
>
par(mar=c(0.5,0.5,0.5,0.5))
showvarparts(2,cex=3,lwd=3)
[a]
[b]
[c]
Residuals = [d]
Figure 5.4: The fractions of variation obtained by partitioning a response data
set Y (large rectangle) with two explanatory data matrices X (Fractions [a]+[b])
and W (Fractions [b] + [c]).
Fractions [a]+[b], [b]+[c], [a] alone and [c] alone can be obtained by canonical
or partial canonical analyses. Fraction [b] does not correspond to a fitted fraction
of variation an can only be obtained by subtraction of some of the fractions
obtained by ordinations.
The procedure must be run as follows if one is interested in the R2 values of
the four fractions:
Step 1 Perform an RDA (or CCA) of Y explained by X. This yields fraction
[a] + [b].
Step 2 Perform an RDA (or CCA) of Y explained by W. This yields fraction
[b] + [c].
Step 3 Perform an RDA (or CCA) of Y explained by X and W together. This
yields fraction [a] + [b] + [c].
69
The R2 values obtained above are unadjusted, i.e. they do not take into
account the numbers of explanatory variables used in matrices X and W. In
canonical ordination as in regression analysis, R2 always increases when an explanatory variable xi is added to the model, regardless of the real meaning of this
variable. In the case of regression, to obtain a better estimate of the population
coefficient of determination (2 ), Zar (1999), p. 423, among others, propose to
use an adjusted coefficient of multiple determination:
2
Radj
=1
n1
1 R2
nm1
(5.2)
As Peres-Neto et al. (2006) have shown using extensive simulations, this formula can be applied to the fractions obtained above in the case of RDA
(but not CCA), yielding adjusted fractions: ([a] + [b])adj , ([b] + [c])adj and
([a] + [b] + [c])adj . These adjusted fractions can then be used to obtain the individual adjusted fractions:
Step 4 Fraction [a]adj is obtained by substracting ([b] + [c])adj from ([a] + [b] + [c])adj .
Step 5 Fraction [b]adj is obtained by substracting [a]adj from ([a] + [b])adj .
Step 6 Fraction [c]adj is obtained by substracting ([a] + [b])adj from ([a] + [b] + [c])adj .
Step 7 Fraction [d]adj is obtained by substracting ([a] + [b] + [c])adj from 1 (i.e.
the total variance of Y).
Alternately, if one is interested in the fitted site scores for fractions [a]adj
and [c]adj , the partitioning can be run using partial canonical ordinations. Note,
however, that it is not possible to obtain the R2 adj on this basis:
Step 1 Perform an RDA (or CCA) of Y explained by X. This yields fraction
[a] + [b].
Step 2 Perform a partial RDA (or CCA) of Y explained by X, controlling for
W. This yields fraction [a].
Step 3 Perform a partial RDA (or CCA) of Y explained by Wm controlling
for X. This yields fraction [c].
Step 4 Fraction [b] is obtained by substracting [a] from [a] + [b].
Step 5 Fraction [d] is obtained by substracting [a] + [b] + [c] from 1 (RDA) or
the total inertia of Y (CCA).
It must be emphasised here that fraction [b] has nothing to do with
the interaction of a ANOVA! In ANOVA, an interaction measures the
effect that an explanatory variable (a factor) has on the influence
of the other explanatory variable(s) on the dependent variable. An
interaction can have a non-zero value when the two explanatory variables are
70
orthogonal, which is the situation where fraction [b] is equal to zero. Fraction
[b] arises because there is some correlation between matrices X and W. Note
that in some cases fraction [b] can even take negative values. This happens, for
instance, if matrices X and W have strong opposite effects on matrix Y while
being positively correlated to one another.
This variation partitioning procedure can be extended to more than two
explanatory matrices, and can be applied outside the spatial context.
5.2.2
There are situations where one wants to reduce the number of explanatory
variables in a regression or canonical ordination model. An approach commonly
used to perform this procedure is forward selection. This is how it works:
Step 1 Compute the independent contribution of all the m explanatory variables to the explanation of the variation of the response data table. This
is done by running m separate canonical analyses.
Step 2 Test the significance of the contribution of the best variable (i.e. the
2 .
one with the highest R2 ) and calculate an Radj
2
2
calculated on
does not exceed an Radj
Step 3 If it is significant and the Radj
the full model (i.e. a model constructed with all the explanatory variables
of interest), include it into the model as a first explanatory variable.
Remarks
(a) The tests are run by random permutations.
(b) Like all variable selection procedures (forward, backward or stepwise), this
one does not guarantee that the best model is found. From the second step
on, the inclusion of variables is conditioned by the nature of the variables
that are already in the model.
(c) As in all regression models, the presence of strongly intercorrelated explanatory variables renders the regression/canonical coefficients unstable.
Forward selection does not necessarily eliminate this problem since even
strongly correlated variables may be admitted into a model.
(d) Forward selection can help when several candidate explanatory variables
are strongly correlated, but the choice has no a priori ecological validity.
In this case it is often advisable to eliminate one of the intercorrelated
variables on ecological basis rather than on statistical basis.
(e) The classic forward selection is a rather conservative procedure when compared to backward elimination (see below): it tends to admit a smaller set
of explanatory variables. In absolute terms, however, it is relatively liberal. However, the forward selection proposed by Blanchet et al. (2008a)
is much more efficient at not making bad variable selection.
(f ) If one wants to select an even larger subset of variables, another choice is
backwards elimination, where one starts with all the variables included,
and remove one by one the variables whose partial contributions are not
significant. The partial contributions must also be recomputed at each
step. Backward elimination is not offered in packfor, however, it is offered
in vegan through the ordistep function.
(g) In cases where several correlated explanatory variables are present, without
clear a priori reasons to eliminate one or the other, one can examine the
variance inflation factors (VIF).
(h) The variance inflation factors (VIF) measure how much the variance of the
canonical coefficients is inflated by the presence of correlations among explanatory variables. This measures in fact the instability of the regression
model. As a rule of thumb, ter Braak recommends that variables that
have a VIF larger than 20 be removed from the analysis. Beware: always
remove the variables one at a time and recompute the analysis, since the
VIF of every variable depends on all the others!
5.3
For cases where the user does not want to base the comparisons among objects on the distances that are preserved in CCA or RDA (including the species
72
pre-transformations), another approach is possible for canonical ordination: dbRDA (Legendre & Anderson, 1999). Described in the framework of multivariate
ANOVA testing, the steps of a db-RDA are as follows (Figure 5.5):
Step 1 Compute a distance matrix from the raw data using the most appropriate association coefficient.
Step 2 Compute a PCoA of the matrix obtained in Step 1. If necessary, correct
for negative eigenvalues (Lingoes or Caillez correction), because the aim
here is to conserve all the data variation.
Step 3 Compute an RDA, using the objects principal coordinates as response
(Y) matrix and the matrix of explanatory variables as X matrix.
Samples
Species
Distance matrix
Correct for
negative eigenvalues
PCoA
Explanatory
variables
Samples
Samples
Principal coordinates
X
(centred)
}
RDA
Figure 5.5: The steps to perform a db-RDA. Modified from Legendre and Anderson (1999).
5.4
Consensus RDA
(5.3)
74
75
Explanatory
variables
Axes
K
1
Samples
Species
Samples
Samples
sites
scores
Explanatory
variables
Significant axes
Z1 Z2 ... ZK
RDA
Consensus axes
Consensus
species
scores
Z*
Consensus axes
Consensus
sites
scores
Consensus axes
m descriptors
Consensus axes
p species
Consensus axes
n sites
C*
Consensus
canonical
coefficients
eigenvalues
}
Consensus RDA Axis 2
2
2
Figure 5.7: Schematic representation of consensus RDA. (a) The first step of
the procedure is to perform a series of RDAs (tb-RDA or db-RDA) to model
the community data Y using explanatory variables X. Each RDA is computed
with a different dissimilarity coefficient using scaling type 1 (distance triplot,
Z matrices). In the figure, K different dissimilarity coefficients are used. (b)
For each of the K dissimilarity coefficients, the significant axes within each Z
matrix are grouped in a large matrix. (c) An RDA is then performed on this large
matrix using X as the explanatory variables. (d) This RDA yields the site scores
consensus matrix Z , a diagonal matrix of eigenvalues , and the consensus
canonical coefficients C . (e) Equation 5 is then used to obtain the consensus
species scores U . (f) Z , U , and C can be used to draw a consensus RDA
triplot; the eigenvalues in show the importance of each axis in the consensus
triplot. This figure was modified from Blanchet et al. (in press)
76
References
Blanchet, F.G., Legendre, P., Bergeron, J.A.C. & He, F. (in press). Consensus rda across dissimilarity coefficients for canonical ordination of community
composition data. Ecological Monographs.
Blanchet, F.G., Legendre, P. & Borcard, D. (2008a). Forward selection of explanatory spatial variables. Ecology, 89, 26232632.
Blanchet, F.G., Legendre, P. & Borcard, D. (2008b). Modelling directional spatial processes in ecological data. Ecological Modelling, 215, 325336.
Borcard, D., Gillet, F. & Legendre, P. (2011). Numerical Ecology with R. Use
R! Springer, New York.
Borcard, D. & Legendre, P. (2002). All-scale spatial analysis of ecological data
by means of principal coordinates of neighbour matrices. Ecological Modelling,
153, 5168.
Borcard, D., Legendre, P., Avois-Jacquet, C. & Tuosimoto, H. (2004). Dissecting
the spatial structure of ecological data at multiple scales. Ecology, 85, 1826
1832.
Borcard, D., Legendre, P. & Drapeau, P. (1992). Partialling out the spatial
component of ecological variation. Ecology, 73, 10451055.
Dray, S., Legendre, P. & Peres-Neto, P.R. (2006). Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices
(PCNM). Ecological Modelling, 196, 483493.
Goodall, D.W. (1954). Objective methods for the classification of vegetation.
III. an essay in the use of factor analysis. Australian Journal of Botany, 2,
304324.
Greenacre, M. & Primeicerio, R. (2013). Multivariate analysis of Ecological Data.
Fundaci
on BBVA.
Legendre, P. & Anderson, M.J. (1999). Distance-based redundancy analysis:
testing multispecies responses in multifactorial ecological experiments. Ecological Monographs, 69, 124.
Legendre, P. & De C
aceres, M. (2013). Beta diversity as the variance of community data: dissimilarity coefficients and partitioning. Ecology Letters, 16,
951963.
Legendre, P. & Gallagher, E. (2001). Ecologically meaningful transformations
for ordination of species data. Oecologia, 129, 271280.
Legendre, P. & Legendre, L. (2012). Numerical Ecology. 3rd edn. vol. 24 of
Developments in Environmental Modelling. Elsevier.
77
78