Applying Multivariate Methods
Applying Multivariate Methods
Peter Henderson
& Richard Seaby Pisces Conservation Ltd
Applying Multivariate Methods using R, CAP and Ecom
This book, based upon A Practical Handbook for Multivariate Methods (2008), is invaluable for anyone interested in multivariate statistics, and has been
extensively revised to reflect the ever-growing popularity of R in statistical analysis. All the main multivariate techniques used in Advertising, Archaeology,
Anthropology, Botany, Geology, Palaeontology, Sociology and Zoology are clearly presented in a manner which does not assume a mathematical
background. The authors’ aim is to give you confidence in the use of multivariate methods so that you can clearly explain why you have chosen a particular
method. The book is richly illustrated, showing how your results can be displayed in both publications and presentations. For each method, an introductory
section describes the method of calculation and the type of data to which it should be applied. This is followed by a series of carefully selected examples
from published papers covering a wide range of disciplines. You will find examples from fields you are familiar with. The data sets and R code used are
available from the Pisces Conservation website, allowing you to repeat the analyses and explore the various approaches.
This book has been written by the team that created the software titles Community Analysis Package (CAP) and ECOM. The two programs have themselves
been updated to run R code natively within them, and tips are given on how to use these programs to get the best presentation for your ideas.
Peter Henderson is a director of Pisces Conservation and a Senior Research Associate of the
Department of Zoology, University of Oxford, England. His personal research area is community and
population ecology, and he is author of “Ecological Methods”, “Marine Ecology”, “Practical Methods in Ecology”
and “Ecological Effects of Electricity Generation, Storage and Use”. He teaches multivariate statistical methods. He
was part of the team that wrote the Community Analysis Package and ECOM software.
Richard Seaby is a director of Pisces Conservation where he develops software and was the designer of
the Community Analysis Package and ECOM user interface. His personal research area is aquatic ecology © Pisces Conservation Ltd, 2019
and the population dynamics of fish. www.pisces-conservation.com
Pisces Conservation Ltd
Dr Peter A. Henderson
Dr Richard M. H. Seaby
© Peter A. Henderson and Richard M. H. Seaby, Pisces Conservation Ltd, 2019
IRC House, The Square,
Pennington, Lymington, Hants
SO41 8GN, UK
Phone +44 (0)1590 674000
[email protected]
www.pisces-conservation.com
ISBN 978-1-904690-67-2
Extensively revised and based upon “A Practical Handbook for Multivariate Methods”,
by the same authors, published in Great Britain in 2008 by Pisces Conservation Ltd,
ISBN 978-1-904690-57-3
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, without the prior permission in writing of the publisher, nor be circulated in any form of
binding or cover other than that in which it is published.
This is the electronic version of our printed work, “Applying Multivariate Methods”,
(ISBN 978-1-904690-66-5)
Applying Multivariate Methods
Preface
Multivariate methods such as Principal Component Analysis, Correspondence Analysis and Discriminant Analysis are
essential tools for researchers in all scientific disciplines and are important in commercial activities such as marketing
and market analysis. Not everyone who needs to use these methods has either the skills or the time to acquire a deep
understanding of the mathematical procedures and theory underlying these methods. What all practitioners do need is
an appreciation of the strengths and weaknesses of the available methods, so that they can confidently explain why they
chose their selected methodology. Further, we all need to appreciate the pitfalls and weaknesses of the methods, so that we
correctly interpret the features within our data. This book is aimed at giving you these skills. It describes, as far as possible
in non-mathematical terms, how all the key techniques are undertaken. The strengths and weaknesses of each method
are discussed, so you can confidently explain why you chose to use a particular approach. Most importantly, this book
presents numerous examples taken from published work. These examples have been carefully chosen to reflect the range
of disciplines that use multivariate techniques. Thus, archaeologists, anthropologists, botanists, geologists, palaeontologists
and zoologists and many others will find examples that relate to their subject. This has been done to ensure that the text
is relevant to your needs, and also to help in the learning process. I have found it easier to teach these complex techniques
if the examples are drawn from subjects with which the researcher is familiar.
The writing of this book follows on from the development of CAP, the Community Analysis Package, with Dr Richard
Seaby and other members of Pisces. This program was developed following discussions with undergraduate and post-
graduate students at the University of Oxford who needed a simple, intuitive to use, Windows-based suite of programs
iii
Preface
for multivariate analysis. Subsequently, we developed Ecom to undertake constrained ordination when environmental
variables potentially able to explain the observed pattern have also been collected. Together, these packages have been
under continuous development and improvement for more than 15 years, gradually becoming easier to use and more
versatile in terms of their presentation of the results. Most recently we added the ability to run R packages within the CAP
and Ecom environment. This allowed those who found R difficult and did not have the time to invest in a training course
to have access to R packages. In this revised book I introduce the R packages available to undertake multivariate analyses
and presented working R code. This code is generally a simple working version that the reader can modify and enhance as
they become proficient in R. While this book should prove useful irrespective of the software that will be used to analyse
the data, care has been taken to produce information boxes separate from the main run of the text that explain how to
produce the output displayed using CAP and Ecom. Instructions on how to use these programs were written by Richard
Seaby, who was the developer responsible for the design of the Windows user interface and the graphics.
Finally, the best way to learn is by experimenting with data. This book is supported by a website1 from which you can
download for yourself all of the data sets used to undertake the analyses. If you are using CAP and Ecom, you will be able
to produce all of the graphs printed in the book as they were generated using these programs.
Peter Henderson
Director, Pisces Conservation Ltd
Senior Research Associate, University of Oxford.
1 www.pisces-conservation.com/phmm-data.html
iv
Applying Multivariate Methods
Contents
Preface........................................................................................................................................................................................iii
1: Introduction.............................................................................................................................................................................1
Why use multivariate methods?.................................................................................................................................................... 2
Organising your data - CAP and Ecom........................................................................................................................................... 3
Organising your data for R............................................................................................................................................................ 4
Opening a .csv file in R................................................................................................................................................................. 5
The appropriate types of variable.................................................................................................................................................. 6
Data transformation..................................................................................................................................................................... 7
Which method to use?.................................................................................................................................................................. 7
Where to obtain the example data sets and R code........................................................................................................................ 9
Software featured in the book....................................................................................................................................................... 9
2: Installing R and getting started.............................................................................................................................................10
Getting R running on your computer............................................................................................................................................10
Installing R.................................................................................................................................................................................11
Installing RStudio........................................................................................................................................................................11
Introducing RStudio....................................................................................................................................................................11
R code conventions used in the book...........................................................................................................................................16
Final advisory note on R code......................................................................................................................................................16
3: Principal Component Analysis................................................................................................................................................17
Uses..........................................................................................................................................................................................17
Summary of the method..............................................................................................................................................................18
The use of the correlation or covariance matrix.............................................................................................................................20
How many dimensions are meaningful?........................................................................................................................................20
Do my variables need to be normally distributed?..........................................................................................................................21
Data transformations...................................................................................................................................................................21
The horseshoe effect...................................................................................................................................................................22
PCA functions in R......................................................................................................................................................................24
v
Contents
vi
Applying Multivariate Methods
vii
Contents
Uses........................................................................................................................................................................................162
Summary of the method............................................................................................................................................................162
Example: archaeology: The classification of Jomon pottery sherds................................................................................................163
11: Tips on CAP and Ecom........................................................................................................................................................165
When to use CAP and Ecom......................................................................................................................................................165
Data organisation.....................................................................................................................................................................166
What type of data do you have?................................................................................................................................................166
Transforming your data.............................................................................................................................................................167
Dealing with outliers ................................................................................................................................................................168
Organising samples into groups.................................................................................................................................................169
Explore your data.....................................................................................................................................................................170
Dealing with multicollinearity in Ecom.........................................................................................................................................171
Getting the most out of your charts............................................................................................................................................171
Editing dendrograms.................................................................................................................................................................174
Suggestions for how to present your methods.............................................................................................................................175
Glossary of multivariate terms.................................................................................................................................................176
Index........................................................................................................................................................................................181
Pisces software titles & training..............................................................................................................................................185
viii
Applying Multivariate Methods
Chapter
1
1: Introduction
If you record a number of variables from each sample, you are building a multivariate data set, and you will probably
want to use multivariate methods to analyse and present your findings. This handbook is a practical guide to the methods
available for exploring multivariate data and identifying relationships between samples and variables. I have whenever
possible used actual data from published studies, so that the reader can understand how the authors generated their
presentation and conclusions. The focus is on the use of mathematical methods to explore and present relationships in
field data, and not the statistical testing of hypotheses using experimental data. The data sets discussed arise from field
observations in geology, archaeology, biology and palaeontology.
Typically, the sampling program has been designed to gain insight into what is present at different localities or times,
and to compare the results to see if they fall into some sort of pattern or classification. In many studies, the variables
that can explain this pattern are not measured, or may not even be known. Possibly, after a pattern has been detected, an
explanation might be inferred. For example, the distribution of pottery shards observed might lead to an idea about the
past human activity in the area. Methods discussed here, which do not explicitly consider the explanatory variables, include
Principal Component Analysis (PCA), Correspondence Analysis (CA), Non-metric MultiDimensional Scaling (NMDS),
Cluster Analysis and TWINSPAN.
1
1: Introduction
If the objects or samples under study can be placed into groups, we often need to test whether these groups are statistically
significant. Analysis of Similarities (ANOSIM) is often the method of choice to test if members of a group are more
similar to each other than they are to members of other groups. This randomisation test is particularly useful as it is
generally applicable. With previously-defined groups, Discriminant Analysis (DA) can be used to create a discriminant
function to predict group membership. These predicted memberships can also be used to validate groups formed by
Agglomerative Cluster Analysis (ACA).
There is also a group of methods for the analysis of situations where possible explanatory variables have been measured
together with the descriptive variables. One of the most familiar is multiple regression, where a model is constructed in
which a number of explanatory variables are used to predict the value of a dependent variable. This handbook does not
discuss standard regression methods, which are well-covered in standard statistical textbooks. It does cover Canonical
Correspondence Analysis (CCA), a constrained ordination method based on correspondence and regression analysis. This
method is presently by far the most popular constrained ordination method. CCA is termed a constrained method because
the sample ordination scores are constrained to be a linear combination of the explanatory variables. Unlike unconstrained
methods, CCA allows significance testing for the possible explanatory variables via randomisation tests.
Our second example is derived from archaeology (Table 2). This data set comprises counts of different-shaped pottery
shards from different localities in the Mississippi valley. In both of these examples the data comprise integer counts.
Most methods are equally applicable to continuous real variables, such as the height of people or the speed of swimming
of individual dolphins. It is also possible in some circumstances to mix different types of data such as continuous real
numbers and classificatory variables that are only represented by a few discrete numbers. The type of data applicable to
each method is discussed under that method.
Table 2: Counts for pottery shards from different localities in the Mississippi valley. From Pierce & Christopher,
19981.
Caney Claiborne Copes Hearns Jaketown Linsley Poverty Shoe Teoc Lewis
Mounds Point Bayou Creek
Biconical 29 3259 67 57 485 58 3122 23 228 104
Cylindrical 5 1230 78 0 1411 7 4718 3 12 0
Ellipsoidal 1 3476 130 5 3 33 5103 1 2 108
Spheroidal 7 824 2 6 29 4 355 1 16 5
GroovedSphere 1 2014 17 0 410 8 3434 0 2 93
Biscuit 58 22 11 8 0 4 138 12 2 12
Amorphous 55 476 56 90 0 7 866 116 4 65
Other 1 143 1 12 0 6 187 1 1 4
4
Applying Multivariate Methods
Jomon Hall R.csv. It is possible to open Excel files in R; however, this is best avoided because it is more difficult to control
which data will be opened, given that Excel can have multiple worksheets. It is generally recommended that you place your
data in a simple comma-delimited (*.csv) file which can then be easily opened in R.
5
1: Introduction
The first 4 rows of the Jomon Hall R.csv file open in Excel are shown below:
Sample Location Ti Mn Fe Ni Cu Zn Ga Pb Th Rb Sr Y Zr Ba
s:ma:001 Shouninzuka 11035 1065 40145 43 131 114 33 68 16 54 143 26 212 892
s:ma:002 Shouninzuka 9958 574 37170 21 70 70 23 50 10 44 88 18 177 748
s:mb:001 Shouninzuka 10147 851 68136 40 108 73 23 32 13 59 76 13 185 1069
R also offers a read.csv command to enter data in .csv format, as shown below:
# The data set has variable names in row 1, column 1 is sample identifiers, column 2 group
allocation
x <- read.csv(file=”D:\\Demo Data\\Jomon Hall R.csv”, header=T, row.names=1)
NOTE: Throughout the book, we have used the generic file location of ‘D:\\Demo Data’ in the R code; obviously you
will need to amend this to reflect your own file locations if you use the code yourself.
6
Applying Multivariate Methods
measurement are arbitrary, so you can change these to produce variables with a similar magnitude. For example, if the
concentration of sodium is 5 mg/l and calcium is 5000 mg/l the calcium concentration can be expressed as grams to
become 5 g/l. This will result in both variables having a similar magnitude and variance.
7
1: Introduction
therefore, in part, determined by the method that best displays the main features of the data. If you have good reasons for
believing your data were collected from along a simple environmental gradient, which is reflected in the magnitudes of the
sample variables, then Correspondence Analysis or Canonical Correspondence Analysis may be most appropriate. If the
samples are likely to form distinct groups then a cluster analysis may be useful. Alternatively, if the samples are unlikely
or not necessarily likely to split into a few clear groups, Principal Component Analysis or MultiDimensional Scaling may
be most appropriate. MultiDimensional Scaling is a highly flexible method that allows a wide range of different similarity
measures between samples to be used. However, it does not allow the most important variables determining the ordination
of samples to be easily identified and presented graphically. Authors almost never explain why they chose a particular
method. Probably the two main reasons were that it was available on their computer system, or they tried several, and the
one chosen gave the clearest presentation or the result they hoped or expected to see. The other common reason is that a
particularly opinionated person has asserted the superiority of one approach. Treat such know-alls with the polite disdain
you show their type in other walks of life. Be aware that, because there is no mathematical argument to demonstrate the
clear superiority of a single method for all types of data, multivariate analysis is prone to methods moving in and out of
fashion.
I would advise that all appropriate methods are tried, and if any lead to clearly different conclusions, care is taken to
find out why this has come about. If, as is often the case, Principal Component Analysis, Correspondence Analysis and
MultiDimensional Scaling all give broadly similar ordinations, then the method most clearly showing the main features can
be chosen, confident that your conclusions are robust. If you have to struggle hard to find any interpretable pattern, and
different methods lead to different conclusions, you should accept that there is no clear structure within the data. However,
before conceding that your data do not tell a story, make sure no single variable is overwhelming a pattern displayed by
other, possibly lower-magnitude, variables. Such structure may be discovered by transformation of
some or all variables. In CAP, the rare species are
It may even be useful to remove some variables. Frequently, many variables represent rare objects removed from the Working
Data tab, Select Handling
and are in very low abundance. Biological data sets, for example, frequently comprise a large number zeros and choose Remove
of zero observations, because most species are only found once or twice. It is not uncommon sparse rows. The proportion
for 80% of the elements of a data matrix to be zero values. Such matrices are termed sparse, and of zero values in the data set
methods such as PCA generally should not be applied to sparse arrays. Rare observations can be is given in the Summary tab.
8
Applying Multivariate Methods
removed, as they contribute nothing to the general pattern. Remember, species or objects that are only observed once
cannot be correlated with other features so they cannot contribute to an ordination such as PCA.
9
2: Installing R and getting started
Chapter
2
2: Installing R and getting started
10
Applying Multivariate Methods
Installing R
To download R, choose your preferred CRAN mirror at https://cran.r-project.org/mirrors.html
You will find a range of options for different operating systems and computers. For example, download and install the
current Windows version of R from https://cran.r-project.org/bin/windows/base/
When installing, you can usually just accept the default settings.
Installing RStudio
RStudio is an open-source front-end for R which makes R a little simpler to use. It offers a feature-rich source-code editor
which includes syntax highlighting, parentheses completion, spell-checking, etc., a system for examining objects saved in R,
an interface to R help, and extended features to examine and save plots. RStudio is easy to learn.
○○ Go to the RStudio download page at https://www.rstudio.com/products/rstudio/#Desktop. Select the
DOWNLOAD RSTUDIO DESKTOP button if you wish to use the free version.
○○ Select the link from the “Installers for Supported Platforms” list that corresponds to the operating system on your
computer.
○○ Once installed you can personalise RStudio by opening the program and selecting from Tools the Global Options…
submenu. It is usually best not to change the defaults until you have gained some experience with the program.
Introducing RStudio
RStudio design
RStudio is organized around a four-panel layout, seen in Fig. 2, page 12. The upper-left panel, R Script Editor, may
not be visible when you open the program for the first time; if this is the case, click File: New File: R Script to display
it, or press Ctrl+Shift+N on your keyboard.
11
2: Installing R and getting started
12
Applying Multivariate Methods
tab can be double-clicked to open them for viewing as a tab in the Script Editor. The History tab simply shows all of the
commands that you have submitted to the Console during the current session.
The lower-right panel contains at least five tabs - Files, Plots, Packages, Help, and Viewer. The Plots tab will show the
plots produced by commands submitted to the Console. One can cycle through the history of constructed plots with the
arrows on the left side of the plot toolbar, and plots can be saved to external files using the Export tab on the plot toolbar
(see figure above). A list of all installed packages is seen by selecting the Packages tab (packages can also be installed
through this tab, as described below). Help for each package can be obtained by clicking on the name of the package. The
help will then appear in the Help tab.
Basic usage
Your primary interaction with RStudio will be through developing R scripts in the Script Editor, submitting those scripts
to the Console, and viewing textual or tabular results in the Console and graphical results in the Plot panel. In this section,
we briefly introduce how to construct and run R scripts in RStudio.
To open a blank file, select the New icon ( ) and then R Script. Alternatively, you can select the File menu, New
submenu, and R Script option, or use <CTRL>+<Shift>+N. In the newly-created Script Editor panel, type the three
lines exactly as shown below (for the moment, don’t worry about what the lines do).
These commands must be submitted to the Console to perform the requested calculations. Commands may be submitted
to the Console in a variety of ways:
○○ Put the cursor on a line in the Script Editor and press the Run icon ( ); alternatively press <CTRL>+<Enter>.
This will submit that line to the Console and move the cursor to the next line in the Script Editor. Pressing
(or <CTRL>+<Enter>) will submit this next line. And so on.
○○ Select all lines in the Script Editor that you wish to submit and press (or <CTRL>+<Enter>).
13
2: Installing R and getting started
The RStudio layout after using the first method is shown in Fig. 2, page 12.
The R Script in the Script Editor should now be saved by selecting the File menu and the Save option (alternatively,
pressing <CTRL>+S). Give a name to the script, choose the preferred directory, and click Save; it will be saved with a
.R file extension. RStudio can now be closed (do NOT save the workspace). When RStudio is restarted later, the script
can be reopened (choose the File menu and the Open file submenu, if the file is not already in the Script Editor) and
resubmitted to the Console to exactly repeat the analyses. (Note that the results of commands are not saved in R or
RStudio; rather the commands are saved and resubmitted to re-perform the analysis).
R packages to install
Each chapter with a section on R will list the packages required to carry out the analyses featured in that chapter. The full
list required is as follows:
devtools (needed to install some other packages, such as ggbiplot)
stats (already installed as standard)
ade4
ca If you use the R functionality
calibrate within CAP and Ecom, the
flipMultivariates programs will automatically
check and install the required
ggbiplot (refer to the section Packages not available from the CRAN repository, below)
packages.
ggplot2
graphics
grDevices
RColorBrewer
vegan
Other packages which may be useful, but are not featured in analyses in this book, include FactoMineR, MASS and
pcaMethods.
14
Applying Multivariate Methods
Fig. 3: (left) The RStudio window featuring the Packages tab (lower right).
Fig. 4: (right) The Install Packages dialog.
Installing R packages
The first time you try to install a package, the installation routine may ask if you wish to create a library folder to store the
installed packages, for instance D:\R\win-library\3.5. If this option fails, it may be necessary to create the R folder and
its sub-folders by hand; the rest of the installation should then proceed.
R packages may be installed in RStudio using the Tools: Install Packages... option on the top toolbar. Alternatively, look
for the Packages tab on the lower left pane of the main program window, and click the Install button - see Fig. 3. You
will then see the Install Packages dialog (Fig. 4). Start typing the name of the package you want into the Packages box,
and select it from the drop-down list of available packages; the installation will then proceed.
15
2: Installing R and getting started
16
Applying Multivariate Methods
Chapter
3
3: Principal Component Analysis
A standard method to summarise multivariate data in a 2-dimensional plot
Principal Component Analysis (PCA) is one of the oldest and most important multivariate methods. When used
appropriately it can display the main features of a multivariate data set and may reveal hidden features within your data.
Uses
Use PCA to show the relationship between objects or samples in a simple 1-, 2- or 3-dimensional plot. The method also
identifies those variables which best define the similarity between the objects. In a PCA plot, the most similar objects
will be placed closest together. It can also be used to create composite variables that capture the general magnitude of a
feature that can only be described using a combination of variables. These new composite variables created by PCA can
then be used as input for other forms of analysis. For example, PCA scores can be used to create a dendrogram, or used
in a regression analysis.
17
3: Principal Component Analysis
correlated. If the variables are uncorrelated, the method is powerless to help you in your analysis, so try MultiDimensional
Scaling instead (see Chapter 5).
Fig. 5: (left) A simple example of the plot of two highly positively-correlated variables.
Fig. 6: (right) The final ordination produced by PCA for the data set plotted in Fig. 5.
18
Applying Multivariate Methods
1 The dispersion matrix is either a matrix giving the variances and covariances of the variables, or a matrix of the correlations between the
variables.
19
3: Principal Component Analysis
and these are then used to solve for the eigenvectors. The eigenvectors are the principal axes of the dispersion matrix S.
The eigenvectors are scaled to a unit length and then used to compute the positions of the samples on each principal axis.
The dispersion matrix, S, can be either the variance-covariance or the correlation matrix for the variables. The choice will
depend on your data and is discussed below.
The eigenvalues of S give the amount of variance explained by each principal axis. PCA is therefore a partitioning of
the total variability in the original data set between a new set of variables. When successful, PCA can place most of the
variability in the dispersion matrix into only 2 or 3 dimensions.
eigenvalue is larger than the mean of all the eigenvalues. Software normally gives the sum of the eigenvalues; divide this
sum by the total number of variables in your data set to get the mean.
If you undertake a PCA on the correlation matrix then the mean eigenvalue is 1, so the rule is simply to only consider
components with an eigenvalue > 1.
Data transformations
It is often essential to transform some or all of the variables prior to their use in a PCA. Remember that if one variable
has a much larger mean than the rest it should be rescaled. For example, if frequency is measured in hertz and ranges from
6000 to 8000 Hz, while another variable, the pulse rate, ranges from 5 to 20 per second, you should express the frequency
in KHz so that it ranges from 6 to 8.
We transform data to:
1. Reduce the influence of the high-magnitude variables and, conversely, increase that of low magnitude variables, e.g.
the highly-abundant species. It is not always relevant; for example, when PCA is applied to the correlation matrix, all
variables have equal weight.
2. Normalise the data. Transformations are often useful when some variables have a highly- skewed distribution, e.g.
when the bulk of the observations are low, but there are a few much higher values. However, do not be too worried
if your variables are not normally distributed; they rarely are, and PCA can still give useful results.
21
3: Principal Component Analysis
CAP has a range of
Common transformations that can be useful are logarithmic and square root. The square root transformations on the
transformation has the advantage that it can be used in data sets with zero values. There was a Working Data tab. You can
experiment with different
fashion for the 4th root transformation; this should be avoided as it excessively distorts your data.
transformations, without
Generally, transformations make it more difficult to interpret your results and should only be used altering your original data.
when necessary. To revert to the raw data, use
Reload Raw Data
22
Applying Multivariate Methods
23
3: Principal Component Analysis
Fig. 10: (left) An ordination produced by DECORANA of the demonstration data given in
In CAP, to change the font
Table 3, page 22. These data produce the horseshoe effect when analysed using PCA. and size of the label for the
Fig. 11: (right) An ordination produced by Non-metric MultiDimensional Scaling of the points, click on the Edit
button above the graph
demonstration data given in Table 3, page 22. These data produce the horseshoe effect and then choose Series –
when analysed using PCA. Unassigned – Marks – Text
– Font.
PCA functions in R
A variety of R packages can be used to undertake Principal Component Analysis. The widely-used function prcomp()
is in the stats package. It has been argued that prcomp() is generally superior to princomp() because it uses singular
value decomposition of the data matrix to obtain the eigenvalues and eigenvectors, rather than spectral decomposition of
24
Applying Multivariate Methods
either the correlation or variance-covariance matrix. In practice, most users will find no difference in the results obtained.
Ecologists may favour the rda() function in the vegan package as this package offers a wide range of analytical techniques
used by ecologists for community analysis. Another function undertaking a PCA is dudi.pca() in the ade4 package which
has also been developed to analyse ecological data. The use of each of these functions is demonstrated below in the
example applications.
Other PCA functions not discussed further here are PCA(X) in the FactoMineR package, and pca(X) in the
bioconductor pcaMethods package. The function pca(X) is a wrapper for the prcomp() function which offers further
visualisation functions and methods for handling missing values. The PCA(X) function has many plotting options and
allows observations or variables to be given different weights.
25
3: Principal Component Analysis
26
Applying Multivariate Methods
Table 4: The eigenvalues for the first 5 axes of the PCA undertaken on Jomon pottery sherds. As there were
14 variables and the correlation matrix was analysed, the sum of the eigenvalues, which is the total inertia or
variance of the data set, was 14.
Eigenvalues Cumulative percentage of the total variance
In CAP, the eigenvalues 1 4.713 33.66
are shown in the Variance
tabbed output window. 2 1.954 47.62
3 1.437 57.89
4 1.228 66.66
In R, using the prcomp()
5 0.9771 73.64
function to undertake a PCA,
the variances are shown using
the summary() function.
These results suggest that much of the variability in elemental composition can be expressed in 3
dimensions. The first 4 dimensions are probably meaningful (eigenvalues > 1).
In CAP, you can obtain the
averages for the variables An examination of the 3 2D plots possible (axis 1 and 2; axis 1 and 3; axis 2 and 3) for the 3 largest
for each group by using the components showed that the position of the sherds in the 2-dimensional space defined by the 1st
Summary tab. Click on the and 3rd principal components separated the sherds into the 4 localities (Fig. 12, page 28). Four
Group radio button and sample outliers in the PCA were a:mb:002, k:uk2:008, n:uk:137 and s:mb:007. For example, the
you will be presented with
sherd s:mb:007 is represented by the X on the far lower left of the plot. By repeating the analysis
the averages for each group.
Remember, that this will only with the outliers removed we can see more clearly the grouping of the sherds between the 4 sites
work if you have previously (Fig. 13, page 28). In this figure, for which the first 2 principal components are plotted, each
defined group membership. of the sites is coded as a different symbol and colour. You will see for example that the crosses
representing the Narita 60 site are clustered in a single discrete area.
The eigenvector plot (Fig. 14, page 29) shows that Principal axis 1 is a measure of the concentration of the elements
Zn, Ba, Mn, Zr, Ga, Cu, Ni, Fe, Y and Ti present, with sherds to the right (positive direction) of the axis having the largest
concentrations. Axis 2 is a measure of Sr, Rb, Th and Pb concentration, with the greatest concentrations at the bottom
(negative direction) of the axis.
27
3: Principal Component Analysis
Fig. 12: (left) The ordination plot of the Jomon pottery sherd data, based on the first and third principal axes.
PCA was undertaken using the correlation matrix.
Fig. 13: (right) PCA ordination of Jomon potsherds with outliers removed. PCA was undertaken using the
correlation matrix.
The samples were also classified by pottery style – Red is Moroiso A, Blue Moroiso B and Green Ukishima (Fig. 15)1. By
comparing the plot showing the sites and the styles, it is apparent that the Ukishima pottery is found at all the sites. Note
that there is a difference in the elemental composition in the pottery styles, with Moroiso sherds having generally higher
concentrations of all the elements measured except lead (Pb).
1 Note that changing the group membership does not affect a PCA ordination.
28
Applying Multivariate Methods
Fig. 14: Eigenvectors for the chemical composition variables produced by PCA for the
In CAP, to deselect the
outliers, click on the Working Jomon potsherds with outliers removed. PCA was undertaken using the correlation matrix.
Data tab. Find the first sherd Fig. 15: PCA ordination of Jomon potsherds grouped by pottery style with outliers removed.
to remove and select any cell
PCA was undertaken using the correlation matrix. The samples have been allocated to
in this column. Now select
Handling zeros in the Type groups based on pottery style.
of Adjustment radio box,
choose Deselect Column in Code and results using R and the prcomp() function
the list, and click on Submit.
For this example the prcomp() function is used to undertake the PCA. Note that scale = TRUE
results in PCA based on the correlation matrix. To use the variance-covariance matrix specify scale
= FALSE. To identify how many of the principal components are important, the summary() function is used together
with plot() to make a scree plot (Fig. 16, page 30).
29
3: Principal Component Analysis
# Open data set, print first lines to check data loaded correctly and find names of variables
pottery.csv <- read.table(“D:\\Demo Data\\Jomon Hall R.csv”,header = TRUE, sep = “,”)
head(pottery.csv, 4)
# log transform
log.sherds <-log(pottery.csv[, 3:16]) # log the numerical data
head(log.sherds, 4) # Some output to check values have been logged
location.sherds <- pottery.csv[,2] # Put locations in a vector
# Run analysis
sherds.pca <-prcomp(log.sherds, center = TRUE, scale = TRUE)
print(sherds.pca) # Print the output
# Investigate the importance of each PCA
summary(sherds.pca) # Plot of variances for each PC
plot(sherds.pca, type = “barplot”) # makes a scree bar plot; otherwise use type = “lines”
30
Applying Multivariate Methods
31
3: Principal Component Analysis
The following R code produces the plot shown in Fig. 18 which will aid understanding of how the variables are associated
with each PCA axis. Note that the plot shows that negative values on PC2 are associated with elevated Pb (lead), Rb
(rubidium) and Th (thorium) and positive values with high Ni (nickel).
Conclusions
Hall (2001) concluded that Principal Component Analysis
indicated that there are four major groups in the data set, which
correspond to site location. This led him to the conclusion that
the majority of Early Jomon pottery found at four sites in Chiba
Prefecture, Japan, was made from locally-available raw materials.
While the Kamikaizuka and Shouninzuka groups overlap, both
sites are less than 10 km apart and their potters could have shared
raw material sources. The ordination is improved following the
removal of outliers.
For sites having both Moroiso- and Ukishima-style pottery, both
styles of pottery were made from the same or geochemically
32
Applying Multivariate Methods
similar raw materials. This suggests that both styles were probably made at the same site, and indicates that if the different
pottery styles are reflecting ethnic identity, then intermarriage between ethnic groups was occurring. Alternatively, the
pottery styles could be reflecting some sort of social interaction between groups.
Hall (2001) did not test for the statistical significance of the 4 groups. Without such tests the results are not convincing.
The statistical significance of the grouping of the pottery into four groups relating to their location is tested below in the
chapter on ANOSIM.
Alternative approaches
Hall (2001) used a correlation matrix
for the PCA, which had the effect
of giving equal weighting to every
element. The plot in Fig. 19 shows
the ordination of the sites using the
variance-covariance matrix, which
fully uses the quantitative data. It is
interesting that the same conclusion is
reached, and a slightly clearer division
of the four sites is shown using a plot
of the first and second principal axes.
33
3: Principal Component Analysis
34
Applying Multivariate Methods
determine if other points such as s.ma.001, s.mb.006 and s.mb.008 are also significant outliers. Be aware that if your data
encompass nonlinear relationships, Mahalanobis distance can be misleading, as it assumes linear relationships between
variables.
library(ClassDiscovery)
# Open data set
Shouninzuka <- read.table(“D:\\Demo Data\\Jomon Hall Shouninzuka R.csv”,header = TRUE, sep = “,”)
head(Shouninzuka, 4) # Print first lines to check data
spc <- SamplePCA(Shouninzuka, usecor=TRUE) # Undertake a PCA on correlation matrix
round(cumsum(spc@variances)/sum(spc@variances), digits=2)
plot(spc) # Plot the first 2 principal components
text(spc, cex= 0.7, pos=2) # Puts a label by each sample point
maha2 <- mahalanobisQC(spc,2)
maha2
35
3: Principal Component Analysis
36
Applying Multivariate Methods
variables (Zr, Sr and Mg) with the largest variances because of their high relative magnitudes. This is the correct choice if
it is believed that all elements can potentially contribute equally to the study of the relationships between the rocks.
Results
As shown in Table 6, the first 2 axes explained about 72.9% of the total variability in the data set. The sum of all the
eigenvalues, a measure of the total variability, is 14, which is simply the sum of the number of variables used in the
analysis, because the correlation matrix was used. Therefore, the percentage variability explained by the largest eigenvalue
is 7.28/14 x 100 = 52.01%. The first 3 dimensions are probably meaningful (eigenvalues > 1).
Table 6: The eigenvalues for the first 5 axes of the PCA undertaken on chemical variables for the Martinsville
igneous complex. As there were 14 variables, and the correlation matrix was analysed, the sum of the eigenvalues,
which is the total inertia or variance of the data set, was 14.
Eigenvalues Cumulative percentage of the total variance
1 7.282 52.01
2 2.918 72.86
3 1.27 81.93
4 0.8209 87.79
5 0.5165 91.48
The authors reported a higher percentage of the variability explained by the first two axes, probably because they combined
the percentage composition for the two iron oxides into a single variable. However, this makes little difference to the
ordination produced.
These results show that much of the variability in chemical composition can be expressed in 2 dimensions.
The plot in Fig. 21, page 38 of the eigenvectors shows that Principal axis 1 arranges the samples so that those with the
highest concentrations of Ca, Fe, Al, Mn, Sr, P and Ti are towards the left (negative side) and those with highest concentrations
of Si, Rb and K to the right (positive side). When deciding which variables are making the greatest contribution to an axis,
examine both the direction and the length of the eigenvectors. To make a strong contribution to an axis, an eigenvector
37
3: Principal Component Analysis
Fig. 21: (left) Plot of eigenvectors of chemical composition for rock samples in the Martinsville igneous complex.
PCA was undertaken on the correlation matrix using CAP.
Fig. 22: (right) Ordination of rocks within the Martinsville igneous complex. PCA was undertaken on the
correlation matrix using CAP.
should point approximately along an axis and be relatively long. Axis 2 is a measure of Zr and In CAP, you can edit all
Na concentration, with the greatest concentrations at the top (positive direction) of the axis. The aspects of the legend. First
select the Edit tool button
authors recognised 5 groups of eigenvectors: 1) Si, Rb and K; 2) Ca and Mg; 3) Fe, Al, Sr and Mn;
above the chart. Then choose
4) P and Ti; and 5) Na and Zr. When grouping eigenvectors, you must consider the angle between the tabs Chart: Legend.
them, not the length of the vector. The present results would suggest that P and Ti make a poor
group, but they can be viewed as intermediate between {Zr, Na} and {Fe, Al, Sr, Mn} To remove the tick boxes,
select the tabs Chart:
An examination of the 2D PCA ordination plots (Fig. 22) shows a clear clustering of the rock Legend: Style and select No
samples. When the samples are grouped according to their mineralogy it is clear that the PCA check boxes from the drop-
down menu.
38
Applying Multivariate Methods
ordination based on chemistry produces a similar classification. For example, that the syenodiorite can best be distinguished
by its relatively high Na and Zr contents. The granites are characterised by relatively high Si, Rb, and K, and the Rich
Acres gabbros by being relatively rich in Mg, Ca, and Fe. The authors do not consider these findings “particularly a surprise”
and if the PCA “only confirmed the mineralogical groupings and chemical differences easily apparent, they would be of limited value.” The
particular value of the PCA is in showing relationships and hybrids. For instance, the hybrid Leatherwood rocks (blue
squares) are intermediate in composition between the granites (yellow squares) and the diorites (red squares).
The dudi.pca() function in ade4 is used for the PCA. Note that scale = TRUE results in PCA based on the correlation
matrix. The scannf=FALSE option suppresses the scree plot. The plots are shown in Fig. 23 and Fig. 24, page 40.
Conclusions
Ragland et al. (1997) concluded that PCA was a useful tool. “... PCA is an insightful tool in petrology and geochemistry and is
recommended as a first-step, exploratory technique for a dataset of chemical analyses. It allows the researcher to determine which of the original
variables may be the most useful in characterizing the dataset.” Further, it is capable of identifying possible relationships and
hybrids, and thus can be used as an aid when generating hypotheses about relationships between and origins of rocks.
39
3: Principal Component Analysis
Fig. 23: The PCA scatter plot for the Martinsville rocks, produced by the function scatter() in the ade4 R package.
The numbers refer to each rock sample and the eigenvectors are the chemical components.
Fig. 24: The PCA plot for the Martinsville rocks, showing the different classifications. The ordination produced
a clear separation of the different rock types.
Alternative approaches
Ragland et al. (1997) used the correlation matrix for the PCA, which had the effect of giving equal weighting to every
element. Fig. 25 shows the ordination of the sites using the variance-covariance matrix calculated with all variables log
transformed. It is interesting to note that essentially the same clusters are formed, but the eigenvectors show a number
of tight pairs {Zr, Na}, {K, Rb}, {Mn, Fe} and {Ca, Mg}. This plot also shows that it is possible to place the samples
and the variable eigenvectors on the same plot. Some authors, including Ragland et al. (1997), plot only the apex of the
40
Applying Multivariate Methods
eigenvectors. For clarity, this should be avoided; the relationship between eigenvectors, given by
In CAP, to show only the their angular difference, is more easily studied if they are plotted as vectors (arrows).
largest eigenvectors, use the
slider below the ordination Fig. 25: Ordination biplot of the rocks with the Martinsville igneous complex. The plot also
plot. shows the largest 9 eigenvectors for the chemical variables used to produce the ordination
of the rocks. The PCA was undertaken using the variance-covariance matrix using CAP.
41
3: Principal Component Analysis
Results
As shown in Table 7, the first 2 axes explained about 98% of the total variability in the data set, demonstrating that a 2D
graph can show the relationship between the cicada species. The sum of all the eigenvalues, which is a measure of the total
variability, is 3, which is simply the sum of the number of variables used in the analysis, because the correlation matrix was
used with 3 variables. The first dimension only has an eigenvalue > 1), but the second is required to distinguish between
T. japonicus and T. flammatus.
42
Applying Multivariate Methods
Table 7: The eigenvalues for the first 2 axes of the PCA undertaken on 3 cicada song variables. As PCA was
undertaken on the correlation matrix, the sum of the eigenvalues, which is the total inertia in the data set,
equals 3.
Fig. 26, page 44 is a biplot of the eigenvectors and the sample scores. They can be effectively placed on the same graph
because of the small number of variables and samples included in this study. The 3 samples for known species have been
marked as large circles and labelled with the species name.
The 2D plot shows a clear separation between the species, and the clustering of the unknown samples around the T.
bihamatus standard indicates that all samples except for S1 can be assigned to this species. S1 has a song very similar to that
of T. japonicus.
Conclusions
As the author stated, “The cluster analysis of the PCA scores clearly separated T. japonicus, T. flammatus and T. bihamatus from each
other and allocated the samples as expected.” He did include a warning: “However, one should collect real specimens with each sound
recording in order to check the result of this method.”
Alternative approaches
For data of this type, there are no other methods that work as well as a PCA applied to the correlation matrix.
43
3: Principal Component Analysis
Fig. 26: The relationship between the songs of 3 species of cicada. The PCA ordination is a biplot of the
eigenvectors and the sample scores using the correlation matrix.
44
Applying Multivariate Methods
Results
As shown in Table 8, page 46, using a log10 transformation, the first 2 axes explained about 46% of the total variability
45
3: Principal Component Analysis
in the data set. For an ecological data set this is quite a large proportion of the total variability, and community analyses
are often presented in which the first two axes explain less than 30%. Henderson (2007) reported that, when using a
square root transformation, the first 2 axes explained about 62% of the variability. If a PCA is undertaken again on the
square root-transformed data used here, 67.5% of the variance is explained by the first two axes. However, the ordination
produced does not give such a clear distinction between the communities in the early 1980s and after 1987. The point to
note is that when using the variance-covariance matrix, the best transformation is not necessarily the one which explains
the largest amount of the variance in the first two axes.
Table 8: The eigenvalues for the first 3 axes of the PCA undertaken on the fish community data collected at
Hinkley Point.
The plot of the fish species eigenvectors shows that Principal axis 1 arranges the years so that those with the highest
abundance of Sprattus lie towards the left (Fig. 27). Axis 2 is a measure of the abundance of most of the other fish species,
with warm-water species, Solea, Dicentrarchus and Trisopterus increasing in the positive direction, and cold-water species,
Liparis and Limanda, in the negative direction. Axis 2 can therefore be thought of as a temperature axis.
An examination of the 2D ordination of the years clearly shows that the fish community in the early 1980s was different
from that in later years (Fig. 28).
46
Applying Multivariate Methods
Fig. 27: (left) The 6 largest eigenvectors for the fish species variables calculated by PCA for the Hinkley Point
data. PCA was undertaken on the variance-covariance matrix derived from square root-transformed data.
Fig. 28: (right) The ordination of the fish community of the Severn estuary, showing the change between the
1980s and latter years. As above, PCA was undertaken on the variance-covariance matrix derived from square
root-transformed data.
In CAP, you can select the number of eigenvectors to present on the chart using the slider at the bottom of the PCA plot tab.
It is often worth only showing the major eigenvectors, which are making the greatest contribution to the principal axes.
To produce the large plotting symbols and labels for years in the 1980s, we activated Chart edit by clicking on the Tools button on the left above
the graph. The pre- and post-1986 years had already been defined as groups. It was therefore possible to double-click on the pre-1986 series in
the Series list. This opened the Series dialog where the symbol and the label text could be changed.
47
3: Principal Component Analysis
Using this ordination result, Henderson (2007) concluded that there was an abrupt change in the fish community around
1986, which was related to a change in climatic conditions and a switch in the North Atlantic Oscillation. Note that the
PCA analysis can tell us nothing about the causality behind the change in fish community. The explanation came from
an investigation of changes in the physical environment between the early 1980s and after 1987. Further, biological
knowledge about which fish favoured warmer conditions allowed axis 2 to be interpreted as a water temperature axis,
indicating a climatic effect.
48
Applying Multivariate Methods
A weakness of the rda() function is the difficulty of producing good graphical output. As is shown below in Fig. 30, the
objects, in this case years, are always called sites, so in the biplot sit1 is 1981, sit2 is 1982, and so on.
Fig. 29: (left) Scree plot output generated by the R function barplot() using the Hinkley Point fish data. The plot
shows that over 25% of the total variance is included in PC1 and that PCs 1, 2 and 3 can together summarise the
relationships within the data set.
Fig. 30: (right) Output generated by the R function rda() and plotted using the biplot() function. It shows that
the fish community in the years 1981 to 1985 (labelled Sit1 - Sit5) forms a separate group reflecting the different
colder-water community present in those years.
49
3: Principal Component Analysis
Conclusions
Essentially similar conclusions would be reached if the PCA were undertaken on the correlation matrix, or with square
root-transformed data and the variance-covariance matrix. However, a log transformation and PCA applied to the variance-
covariance matrix gave the clearest results. The log was better than a square root transformation
because it reduced the dominant role of Sprattus. It might be argued that the dominance of Sprattus In CAP, the rare species can
is a real feature and therefore should not be reduced by transformation. However, this dominance be removed from the data
set on the Working Data
is exceptionally great because abundance has been measured in terms of number of individuals tab. Select Handling zeros
caught. The dominance of Sprattus would have been much lower if abundance had been measured and choose Remove sparse
as weight caught each year, because Sprattus is one of the smaller fish caught. As our units of rows. The proportion of
measurement change the pattern of dominance, it seems fair to use a transformation to increase the zero values in the data set is
given in the Summary tab.
weighting given to the less numerically-abundant species.
Alternative approaches
Detrended Correspondence Analysis and MultiDimensional Scaling (see page 52 and page 73, respectively)
also show the distinct change in fish community. PCA is to be favoured in this case because the eigenvectors give insight
into which species are structuring the community and show the role of climate.
50
Applying Multivariate Methods
samples. In such cases you should remove all the rare species from the analysis. Such species make a negligible contribution
to the ordination.
If you have quantitative data, PCA should generally be undertaken using the variance-covariance matrix. However, if your
variables differ greatly in magnitude or variability, then there is a risk that the ordination will simply reflect the abundance of
the most variable, highest-magnitude variables. To avoid this, either rescale your variables to similar
In CAP, a data transformation magnitudes, or undertake a log or square root transformation. Use a square root transformation if
is easily undertaken. First click
your data set includes zero values. Avoid 4th root transformations and other exotica, as you will not
on the Working Data tab.
Second, select the Transform generate easily-interpretable results.
radio button and finally select Remember, if a PCA is undertaken on the correlation matrix all your variables will acquire
Log10 etc. from the list of
possible transformations.
equal importance. This may be quite appropriate when using physical variables; a low mercury
Remember your data set is concentration may be just as important as a calcium concentration a thousand-fold larger. But, do
not permanently altered, so you consider that the rarer species are as important as the common forms in defining a community?
you can always try a range of Only the first 3 or so principal axes are likely to hold interpretable variability, so do not waste time
different transformations. studying the plot of the 5th and 10th principal axes!
51
4: Correspondence Analysis
Chapter
4
4: Correspondence Analysis
This method, also called Reciprocal Averaging, produces an ordination of both the samples and the variables, and is
particularly effective when samples are derived from along a gradient.
Uses
Correspondence Analysis was developed by statisticians in the 1930s, but only came into use in ecology in the 1960s. It
is particularly favoured by plant ecologists and is used within the TWINSPAN method. Use Correspondence Analysis to
show the relationship between objects and variables within a single plot. The method can be used with categorical data,
so is particularly appropriate when it is impossible or too costly to collect quantitative data. It produces a particularly clear
and useful ordination when the objects have been sampled along a single dominant gradient; unlike Principal Component
Analysis and MultiDimensional Scaling, it does not produce a powerful horseshoe effect (see p 2) in which samples from
opposite ends of the gradient are placed close together. However, as is discussed below, the ordination can form an arch.
52
Applying Multivariate Methods
Chi-squared tests are frequently applied to contingency table data of this type to test if the two variables, handedness and
hair colour, are independently distributed. The data sets we are concerned with in multivariate analysis of field data are
also often arranged as a two-dimensional table with the elements comprising frequency data. For example, Table 10 shows
the counts for pottery shards at different localities.
Table 10: Pottery shard counts from different localities in the Mississippi valley. From Pierce & Christopher
19981
Caney Claiborne Copes Hearns Jaketown Linsley Poverty Shoe Teoc Terral
Mounds Point Bayou Creek Lewis
Biconical 29 3259 67 57 485 58 3122 23 228 104
Cylindrical 5 1230 78 0 1411 7 4718 3 12 0
Ellipsoidal 1 3476 130 5 3 33 5103 1 2 108
Spheroidal 7 824 2 6 29 4 355 1 16 5
GroovedSphere 1 2014 17 0 410 8 3434 0 2 93
Biscuit 58 22 11 8 0 4 138 12 2 12
Amorphous 55 476 56 90 0 7 866 116 4 65
Other 1 143 1 12 0 6 187 1 1 4
1 Pierce and Christopher 1998. Theory, measurement, and explanation: variable shapes in Poverty Point Objects. In Unit Issues in Archaeology,
edited by Ann F. Ramenofsky and Anastasia Steffen, pp.163-190. Utah University Press, Salt Lake City.
53
4: Correspondence Analysis
In biology, this table would usually comprise samples as the columns and species as the rows. Correspondence Analysis
measures the degree of interdependence between the columns (sites or samples) and the rows (species, types of pottery,
etc.) using the standard Chi-squared statistic, which compares the observed frequencies with those expected if the rows
and columns were independent. Generally, we hope that our data do not show independence, as we expect certain species
or types of pottery to be associated with particular samples or sites. Looked at from the sample perspective, we expect
samples or sites to differ in their species or pottery composition.
Generally, linear algebra is used to calculate the ordination. Using the Chi-squared statistic to depict the degree to which
any associations in the data matrix depart from independence, the variance in these Chi-squared “distances” is evaluated
by eigenanalysis. The scores for the samples or sites are derived from a metric of species associations, and the more these
associations depart from independence, the further separated the final scores will be. Similarly, the scores for the sites
are used to find final scores for the row variables. The final result is an ordination in which samples most similar in their
species assemblage are closest together. Similarly, species which have a similar distribution across the samples will be
closest together.
CA and PCA ordinations are both usually derived by eigenanalysis and are related methods, which differ in their measure
of the distance between samples. In PCA, the matrix of species abundances is transformed into a matrix of covariances or
correlations, each abundance value being replaced by a measure of correlation or covariance with other species. CA uses
Chi-squared as a measure of association, rather than the correlation coefficient or covariance.
CA is also called Reciprocal Averaging (RA). RA describes an alternative method for calculating the ordination, which was
developed by M.O. Hill in 1973, who at the time, did not realise his solution was in fact another method of Correspondence
Analysis. This method involves the repeated averaging of column (sample) scores and row (species) scores until the
correspondence between row scores and column scores is maximised and convergence is reached. This approach can be
more easily understood and applied by hand to small data sets.
The 2nd and higher axes of a CA ordination, like those of PCA, can be distorted by non-linear relationships between the
row variables. This is termed the arch effect, and is most likely to occur in ecological data sets with high β diversity and
a strong gradation of species (β diversity is that between locations). As well as the arch, the axis extremes of CA can be
compressed. In other words, the spacing of samples along an axis may not reflect true differences in species composition.
54
Applying Multivariate Methods
55
4: Correspondence Analysis
Results
Fig. 31 shows a joint plot of the individuals (green squares) and the drinks (labelled, red, triangles). Note that diet and
non-diet Coke and Pepsi are separated at opposite ends of axis 1. Axis 2 distinguishes between Sprite and 7 Up drinkers
(Sprite+ and 7 up+) and non-drinkers (Sprite- and 7 up-). The lack of individuals in the upper right-hand region of the
ordination indicates that no one drinks only diet non-colas.
56
Applying Multivariate Methods
Conclusions
It is possible to classify individuals by
their consumption of diet and non-diet
and cola and non-cola soft drinks.
Alternative approaches
The doubling of the variables by including
their opposite does not greatly add to the
analysis, and a simpler plot can be obtained
with only the positive variables included
(see R code below). In this case, similar
results are produced by Correspondence
Analysis and Detrended Correspondence
Analysis (DECORANA).
57
4: Correspondence Analysis
Code and results using the vegan package in R on the beverages data
For this example, the cca() function in the vegan package is used to undertake the Correspondence Analysis. The
summary() function defaults to the scaling = 2 option in which the columns are at the centroids of the rows.
This scale is most appropriate if our primary interest is the variables rather than the individuals. As an example of the
alternative approach, the data set beverages R2.csv only comprises the positive values (see Alternative Approaches above)
As before, diet and non-diet Coke and Pepsi are separated at opposite ends of axis 1. Axis 2 distinguishes between Sprite
and 7 Up drinkers and non-drinkers (Fig. 32).
Fig. 32: The choice of carbonated drink in the USA: an example of Correspondence Analysis applied to
categorical data using the cca() function in the R vegan package. The biplot was produced by Correspondence
Analysis. The blue dots are the individuals.
58
Applying Multivariate Methods
59
4: Correspondence Analysis
The data set analysed here was taken from the seriation sequence diagram in Berg and Blieden (2000). The width of the
lines gave some measure of relative abundance in each layer, and this was converted into a simple abundance scale from 0
to 8. Because of the difficulty of reading data from this figure, the data will not be the same as the original, however, the
analysis gives very similar results.
Results
Fig. 33 shows a joint plot of the pottery types (green triangles) and the trench layers (dark green squares). The squares
make up a roughly horseshoe-shaped curve. The authors state that “By moving from top left to top right along this ‘horseshoe’ we
seem to go from older layers to younger layers. The layers that are marked near the top left of the Figure (such as those of PiDI/PiE) appear
near the bottom of the seriation diagram, and we believe these to be the most ancient. PiA 69 and PiA 67 and various layers from KKd are
on the most extreme right of the Figure and are listed as the youngest layers by seriation.”
They also consider the structure of an individual
trench. Pla samples are clumped at the base of the
horseshoe, and on either side of this cluster there
are clusters of PiS samples. The authors suggest
that, “deposits of trench PiS fall into distinct phases”
... “We believe that this trench is located in a part of
Phylakopi that was occupied by the local shrine and this
fits in well with the statistical pattern we have just observed.
Once a shrine is built, we would expect the area around it
to remain unchanged for a long time.” This stability is
reflected in the discontinuity in PiS layers.
Fig. 33: Correspondence Analysis joint plot
of the pottery types (green squares) and the
trench layers (smaller symbols) for the Melos
study.
60
Applying Multivariate Methods
# The following code produces better output with groups colour coded and
# output summary directed to a text file and graphics to a pdf file
r.c <- ca(x)$rowcoord
c.c <- ca(x)$colcoord
xrange <- range(r.c[,1]*1.5,c.c[,1]*1.5)
yrange <- range(r.c[,2]*1.5,c.c[,2]*1.5)
plot(xrange, yrange, type=”n”, xlab=”Dimension 1”, ylab=”Dimension 2”, main=”Correspondence Plot”)
points(r.c[,1], r.c[,2], pch=p.lab, col=Group, cex=0.75)
points(c.c[,1], c.c[,2], pch=4)
61
4: Correspondence Analysis
Fig. 34: Correspondence Analysis undertaken on the Melos data using the ca R package. Biplot of the first 2
axes. The blue dots are the individual samples and the red triangles the variables.
62
Applying Multivariate Methods
Fig. 35: Correspondence Analysis undertaken on the Melos data using the ca R package. Biplot of the first 2
axes. The individual samples are coded by the Group variable in the dataset. A, PiA: B, KKd: C, Pis: D, PK: E,
PiC: F, PLa: E, PiD. The black crosses mark the position of the variables.
Conclusions
CA successfully ordered the samples along a temporal gradient and clearly showed the discontinuity in the PiS trench.
63
4: Correspondence Analysis
Alternative approaches
The use of the word horseshoe by the authors in their description of the ordination plot immediately raises the question as
to whether a clearer ordination of the temporal sequence could have been presented using Detrended Correspondence
Analysis (DECORANA). As shown below in Fig. 36, there is actually little improvement.
Fig. 36: Detrended Correspondence Analysis joint plot or biplot of the pottery types (large green triangles) and
the trench layers (smaller symbols) for the Melos study.
64
Applying Multivariate Methods
Results
As with the PCA results, there is a clear difference in the fish community between the early 1980s and later years. Because
of the number of variables (fish species) included in the analysis it was decided to plot the samples (years) and variables
(species) on different graphs (Fig. 37, page 66 and Fig. 38, page 67). The 1980s samples were dominated by colder-
water species, Limanda limanda, Liparis liparis, Trisopterus minutus and Anguilla anguilla. (Fig. 38). The author concluded that
climate change was having an impact on the fish community.
65
4: Correspondence Analysis
Conclusions
PCA and CA lead to the same conclusion. The fish community was different in the early 1980s from later years. It is
normal for the different methods to give consistent results, suggesting that when there are clear differences and similarities
between samples the choice of method is not as critical as is often implied in critiques of the ordination methods. If this is
not the case, consider carefully if you can make any conclusions from your data. You may just be seeing random patterns.
Fig. 37: DECORANA plot of the years
for the Hinkley fish data, showing the
clear difference in species composition
pre- and post-1986.
66
Applying Multivariate Methods
Fig. 38: DECORANA plot of the fish variables for the Hinkley fish data. By comparison with Fig. 37 it is possible
to identify the species change pre- and post-1986. The fish species Limanda, Liparis, Anguilla and Trisopterus M are
all associated with the pre-1986 period.
67
4: Correspondence Analysis
library(ca)
# The data set has names of fish in row 1, column 1 is the year of sampling
x <- read.csv(file=”D:\\Demo Data\\Hinkley fish R.csv”, header=T, row.names=1)
ca(x) # undertakes correspondence analysis
summary(ca(x)) # summary output
plot(ca(x)) # simple plot
In Fig. 40 a square root transformation was used to give a clearer ordination; using R this is easily accomplished using the
function sqrt(y).
library(ca)
# The data has names of fish in row 1, column 1 is the year of sampling
y <- read.csv(file=”D:\\Demo Data\\Hinkley fish R.csv”, header=T, row.names=1)
x <- sqrt(y)
ca(x) # undertakes correspondence analysis
summary(ca(x)) # summary output
plot(ca(x)) # simple plot
68
Applying Multivariate Methods
Fig. 39: (left) Correspondence Analysis undertaken on the untransformed Hinkley fish data using the ca package
in R. Biplot of the first 2 axes. The blue dots are the individual years and the red triangles the fish. The early
1980s are clearly different and characterised by the presence of the fish species Anguilla anguilla, Liparis liparis,
Limanda limanda and Trisopterus minutus.
Fig. 40: (right) Correspondence Analysis undertaken on the square root-transformed Hinkley fish data using
the ca package in R. Biplot of the first 2 axes. The blue dots are the individual years and the red triangles the
fish. The early 1980s are clearly different and characterised by the presence of the fish species Anguilla anguilla,
Liparis liparis, Limanda limanda and Trisopterus minutus. Square root transforming gives a clear grouping of the 1980s
samples.
69
4: Correspondence Analysis
70
Applying Multivariate Methods
71
4: Correspondence Analysis
72
Applying Multivariate Methods
Chapter
5
5: MultiDimensional Scaling
This method produces an ordination of only the samples in an n-dimensional space so that the most similar samples are
placed closest together. The measure of similarity used is at the discretion of the researcher.
Uses
Use MultiDimensional Scaling to ordinate samples when you do not wish to be constrained by a particular measure of
similarity or distance between objects.
73
5: MultiDimensional Scaling
method over PCA or CA is the wide choice of similarity/distance or association measure that can be used.
The key idea behind MDS is to find the best arrangement within a reduced space of 2 or 3 dimensions
In CAP, to test the adequacy
of the final MDS ordination, that places the most similar samples closest together and the least similar further apart. Because of
you must run the procedure the multidimensional nature of the data, it is usually not possible to find a perfect arrangement in a
a number of times while small number of dimensions. The degree to which the ordination produces a good arrangement, in
selecting Random as the which the samples are positioned so that their distances apart reflect their similarities, is measured
initial starting position.
by a parameter termed the stress. The greater the stress, the poorer is the representation of the
sample similarity in the reduced dimensional space.
Once you have selected the number of dimensions in which to display your results, it is not possible to calculate the
best positions for the samples, so MDS uses an iterative procedure to find a good solution. It is possible for this iterative
procedure not to find the best, or even a good, solution. For this reason, it is normal to test a number of different
starting positions and compare the ordinations produced. The best ordination will be the one with the lowest stress. With
typical data sets, very similar ordinations will be produced irrespective of the starting positions. If this is not the case, try
increasing the maximum number of iterations in the set-up options.
It is important to consider if the relationships are well expressed with the reduced number of dimensions chosen. This
can be done by looking at the change in stress with increasing number of dimensions. Ideally, a 2- or 3-dimensional plot
should have a stress level little greater than the arrangement of samples in a 4-, 5- or 6-dimensional space.
There exists a multitude of variants of MDS with slightly different functions and algorithms to find the best solution.
In practice, the exact mathematical method is of little importance compared with the choice of similarity measure. It
is difficult to give simple advice on this topic. Obviously, if you have gone to great effort to collect quantitative data
it would be foolish to use a measure such as a Jaccard index, that only uses presence/absence information. In marine
benthic ecology, where animals vary greatly in abundance, but quantitative data are collected, a good compromise is the
Bray-Curtis similarity index. In pollen record studies, Gavin et al. (2003)1 have shown that the squared-chord distance
“outperforms most other metrics”.
1 Daniel G. Gavin, W. Wyatt Oswald, Eugene R. Wahl, and John W. Williams. (2003). A statistical approach to evaluating distance metrics and
analog assignments for pollen records. Quaternary Research 60, 356–367
74
Applying Multivariate Methods
Rotate output
If selected, a PCA is performed on the final site coordinates. The default is usually to select this option.
75
5: MultiDimensional Scaling
Fig. 42: The option window for MultiDimensional Scaling within the
CAP software.
Similarity measure
The measure to use must be selected. MDS can use a wide range of distance
or similarity measures, and this needs careful consideration. Some measures,
such as Sørensen’s, use only presence/absence data, and therefore should
not be used when you have spent effort and money in quantifying your
variables. In this case a Bray-Curtis, Euclidean or other measure might be
suitable. Generally, there is no theory to guide this choice, so try a few
different measures to check that your conclusions are robust. While different
measures produce different ordinations, if your conclusions are robust,
quantitative measures should each show the same groupings of points.
Number of dimensions
The default value is 2, as you need to display your ordination on a flat
surface. For higher values, the program calculates the configuration of the
points from this number of dimensions down to 1 dimension, listing the
final stress for each number of dimensions. More than 3 dimensions are only
used to find out how the stress changes with the number of dimensions.
Maximum iterations
You can change the number of iterations used by the stress minimisation algorithm; in CAP the default is 200. While
there is a relationship between the number of iterations and the magnitude of the stress level achieved, in practice there is
often little advantage in selecting a higher iteration number. You may like to vary this number to become satisfied that the
minimum stress level possible has been achieved.
76
Applying Multivariate Methods
Preliminary considerations
The author used the Euclidean distance measure. This measure is not really appropriate for percentage frequency data
so here we have used the Bray-Curtis percentage similarity measure. The general conclusions on the seriation of the
potsherds are not sensitive to the distance measure, although the Bray-Curtis similarity measure gives a clearer ordination.
Further, the MDS ordination gives similar results for almost all random starting positions. Using Euclidean distance, the
ordination is more sensitive to the starting position.
Results
Fig. 43, page 78 shows the results of a multidimensional scaling in 2 dimensions. The results are similar to those
presented by Usman (2003), and were generally consistent irrespective of starting position, indicating that this is close to
the ordination that minimises stress. Site GIP-5b is a clear outlier. The other sites fit into an approximately linear sequence
with the earliest (GIP-21a and GIP21b) at the top and the latest (GLR-13 and GLR-15) at the bottom.
For comparison, the plot presented by Usman (2003) computed using the Euclidean distance is also shown, in Fig. 44,
page 79. Note that the temporal sequence is now in the form of a horseshoe curve.
77
5: MultiDimensional Scaling
The author did not consider the stress level of the 2 D plot, or the change in stress with the number of dimensions; this
is shown in Fig. 45. Note that at 4 dimensions and above, stress is very low and hardly declines further as the number
of dimensions is increased. There is, however, a notable decline in stress between 2 and 3 dimensions, suggesting that
these data might be more usefully displayed as a 3-dimensional plot (Fig. 46, page 80). On this 3-dimensional plot, the
clusters identified by Usman using k-means clustering have also been marked as different colours (Fig. 47, page 81).
It is clear that the k-means clustering and MDS plots do not completely agree. In this plot the light blue and red groups
are not clearly separated, and GLR-12 does not form a clear solitary group. Both the light blue group and GLR-12 are
characterised by a large score for axis 2. This is made clearer in Fig. 47, where the greater height of the light blue group
is more visible, which demonstrates the importance of the choice of angle in a 3D presentation.
78
Applying Multivariate Methods
Fig. 44: (left) The seriation of Nigerian pottery using MultiDimensional Scaling as published by Usman (2003).
The author used Euclidean distance.
Fig. 45: (right) The change in stress with the number of dimensions for the seriation of Nigerian pottery using
MultiDimensional Scaling. The Bray-Curtis similarity measure was used.
In CAP, to produce a 6D
stress graph for an MDS
plot, first select 6 dimensions
in the initial MDS Setup
screen. Then click on the
MDS plots tab, which will
show a 2D graph. Select the
Stress vs Dimension radio
button, and a stress plot will
be produced.
79
5: MultiDimensional Scaling
Fig. 46: A 3-dimensional plot of the seriation of Nigerian pottery using MultiDimensional Scaling. The Bray-
Curtis similarity measure was used.
In CAP, to produce a 3D graph for an MDS plot, first select 3 dimensions in the initial MDS Setup screen. Then click on the MDS Plots tab,
which will show a 2D graph. Select the Samples (3D) radio button and a 3D plot will be produced. To put stalks on the points, click on the
Drop Lines button (second from the right above the graph).
80
Applying Multivariate Methods
Fig. 47: Alternative view of a 3-dimensional plot of the seriation of Nigerian pottery using MultiDimensional
Scaling. The Bray-Curtis similarity measure was used.
81
5: MultiDimensional Scaling
Conclusions
NMDS using the Bray-Curtis similarity measure was able to produce a reliable ordination, which was stable irrespective
of starting position. This ordination arranged the sites in a logical temporal order in both 2D and 3D space. The stress
vs. dimension plot indicates that a 3D plot is most suitable for these data. This is confirmed by other analyses undertaken
in Usman (2003), which suggest the presence of 5 groups that could be distinguished in the 3D MDS plot. However,
this plot needed to be carefully rotated to show the relationship between the samples. The author had used a Euclidean
distance measure. This resulted in a horseshoe shape for the temporal sequence that was less satisfactory than the almost-
linear sequence produced here. A Euclidean distance measure is unsatisfactory for percentage frequency data and should
not have been used. However, it did lead to the same general conclusion as the Bray-Curtis similarity measure.
A final consideration is the exclusion of the outlier GIP-5b. If this site is removed from the analysis a clearer ordination
showing the temporal sequence of the sites is produced.
82
Applying Multivariate Methods
Preliminary considerations
Henderson (2007) used Sørensen’s similarity index, which only uses presence/absence data. This was deliberate as he
wished to give every species, irrespective of abundance, an equal weighting in the analysis. The ordination is therefore a
search for an abrupt change in species presence.
Results
Fig. 48 shows the results of a MultiDimensional Scaling in 2 dimensions. The results support the view that there was a
change in community structure around 1993, and that years pre- and post-1993 form 2 distinct groups.
Fig. 48: The results of an MDS ordination for the Hinkley Point fish data set, showing the difference in the fish
community pre- and post-1993. The Sørensen similarity index was used.
83
5: MultiDimensional Scaling
Code and results using the vegan package in R on the Hinkley data
For this example, the metaMDS function in the vegan package is used to undertake multidimensional scaling. The data
set is organised as a standard vegan community matrix with species (variables) forming columns and sites (samples) the
rows. The first column and row hold species and sample names. In this Hinkley example the sites are years. For biological
data it is generally appropriate to default all the options in the metaMDS function, as was the case in the box below. The
default similarity measure is the Bray-Curtis which is a standard measure for quantitative ecological data. Other measures
are available in vegan; for example, distance="euclidean" applies the Euclidean distance measure.
As a default the metaMDS function performs a Wisconsin double standardization if the data
Wisconsin double
standardization - the
values are larger than common class scales. After finding a solution metaMDS runs postMDS for
abundance values are first the final result. Function postMDS moves the origin to the average of the axes and undertakes
standardized by species a Principal Component Analysis to rotate the configuration so that the variance of points is
maximum standardization, maximized on the first dimension (with function metaMDSrotate you can alternatively rotate the
and then by sample total configuration so that the first axis is parallel to an environmental variable).
standardization, and by
convention multiplied by 100.NMDS is easily trapped by local optima, and you must start NMDS several times from random
starting positions to be confident that you have found the global solution. The default in isoMDS
is to start from a metric scaling which typically is close to a local optimum. metaMDS first runs a default isoMDS, or
uses the previous.best solution if supplied (see box below), and takes this solution as the standard (Run 0). Then
metaMDS starts isoMDS from several random starts (maximum given by trymax). If a solution has a lower stress than
the previous standard, it is taken as the new standard. If the solution is better or close to a standard, metaMDS compares
two solutions using Procrustes analysis (page 70). If the two solutions have very similar Procrustes rmse and the
largest residual is very small, the solutions are regarded as convergent and the result is output.
84
Applying Multivariate Methods
# For congested plots, the following will display a plot with symbols for both samples and
variables.
# Click on individual points you would like identified.
# Then Press “escape” to visualize
The ordination plotted in Fig. 49, page 86 shows that the fish community has changed over time. This is illustrated by
using the year label to mark the position of each sample. Note that the fish communities observed between 1981 and 1985
are all placed within the lower left-hand corner of the ordination. The ordination gives no clue as to the reason for this
temporal change. An examination of the changes in individual species’ abundances showed this to be linked to a change in
the relative abundances of cold- and warm-water species, linked to an increase in water temperature from the late 1980s.
The ability of the ordination to summarise the dissimilarity between the samples can be examined by plotting the observed
dissimilarity against the distance apart of each sample in the generated ordination using the function stressplot(mds)
(Fig. 50, page 86). A high correlation suggests a successful ordination.
85
5: MultiDimensional Scaling
Fig. 49: MultiDimensional Scaling undertaken on the Hinkley fish data using the metaMDS function in the
vegan package. The early 1980s are clearly different.
Fig. 50: A comparison of the observed dissimilarity between samples and their distance apart in a 2D
ordination plot. The high correlation suggests that the ordination gives a successful summary of relationships.
MultiDimensional Scaling undertaken on the Hinkley fish data using the metaMDS function in the vegan
package. Plot produced using the function stressplot(mds).
86
Applying Multivariate Methods
Arch (Upper Ordovician). 27 bulk samples were collected from ten localities; these samples represent eight formations
and five of the six stratigraphic sequences composing the type Cincinnatian. The authors wished to study the pattern of
occurrence of bivalves and gastropods to test if they each favoured different palaeoenvironments.
Because the interest is on the pattern of occurrence of bivalve and genera, this is an example of an R-mode analysis.
The similarity of the gastropod genera, in terms of the samples they were recorded from, is analysed. In a more typical
Q-mode analysis we examine the similarity of the samples in terms of their genera or other attributes. (See page 143 for
a fuller explanation of R- and Q-mode analyses).
The authors reached the conclusions that “Non-metric multidimensional scaling and several statistical analyses show that the taxonomic
richness and abundance of these classes (bivalve and gastropod) within samples were significantly negatively correlated, such that bivalve-rich
settings were only sparsely inhabited by gastropods and vice versa.”
Preliminary considerations
The authors used the Bray-Curtis similarity measure, which employs quantitative data. This was presumably chosen because
they had good counts of the abundance of each genus of bivalve and gastropod in the samples.
Results
Fig. 51, page 88 shows the results of a Non-metric MultiDimensional Scaling in 3 dimensions, which clearly show
that the bivalves and gastropods occupy different regions of the ordination space. The authors were therefore able to
demonstrate that bivalves and gastropods were most abundant in different strata during the Ordovician, presumably
because of differences in habitat preference.
87
5: MultiDimensional Scaling
Fig. 51: Gastropod and bivalve Ordovician fossil assemblages, as shown by MDS. The Bray-Curtis similarity
measure is applied.
88
Applying Multivariate Methods
89
6: Linear Discriminant Analysis
Chapter
6
6: Linear Discriminant Analysis
Discriminant Analysis (DA) is also called Canonical Variate Analysis (CVA).
Uses
Discriminant Analysis is a standard method for testing the significance of previously-defined groups, identifying and
describing which variables distinguish between the groups, and producing a model to allocate new samples to a group.
DA allows the relationship between groups of samples to be displayed graphically. A goal of a Discriminant Analysis is to
produce a simple function that, given the available parameters, will classify the samples or objects.
90
Applying Multivariate Methods
explain. Fig. 52 shows a plot of two groups, 1 and 2, with respect to two variables. Note that if only one variable is used,
there is considerable overlap, and the groups are poorly differentiated. As shown in Fig. 53, using a linear combination of
both variables we can produce an axis that combines both variables, giving a far better discrimination between the groups
than either variable alone. Fisher’s method was to find the linear combination of variables that maximised the ratio of the
between-group sums of squares to the within-group sums of squares.
Normality
The method assumes that the values for the predictors (row variables) are independently and randomly sampled from the
population and that the sample distribution of any linear combination of predictors is normally distributed. Deviations
from normality caused by skew are unlikely to invalidate the result. However, the method is sensitive to outliers, which
should be removed, or the data transformed to reduce their influence.
Fig. 52: (left) A sketch of the plot of two hypothetical groups defined by two variables. Below the plot is shown
the distribution of the two groups along variable 1. Note that they are very poorly separated and it would be
difficult to assign many samples to groups.
Fig. 53: (right) The same two hypothetical groups defined by two variables; to one side of the plot is shown
the distribution of the two groups along an axis combining variables 1 and 2. Note that compared with using a
single variable, the linear combination allows most samples to be clearly allocated to one group or another.
91
6: Linear Discriminant Analysis
Results
With each of the 150 observations allocated to the setosa, versicolor or virginica group, and a discriminant plot produced (Fig.
54), Iris setosa is clearly separated, and it is simple to draw a line on the chart to distinguish this species from the other two.
While versicolor and virginica form separate clusters, there are 3 points which would be misclassified if a straight line were
drawn between the groups. This was shown to be the case when the discriminant functions were used to allocate samples
to species (Table 11). This shows that two specimens of versicolor and one of virginica were misclassified.
Table 11: A table comparing the original iris classifications and that produced by Discriminant Analysis.
Original Iris setosa Original Iris versicolor Original Iris virginica No. correct % correct
Predicted Iris setosa 50 0 0 50 100
Predicted Iris versicolor 0 48 1 48 98
Predicted Iris virginica 0 2 49 49 96
Total 50 50 50 147 98
1 Fisher, R.A. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7: 179–188
92
Applying Multivariate Methods
In CAP, Discriminant
Analysis is selected from
within the Groups drop-
down menu. The method is
only available if each sample
or object has been assigned
to a group. This is undertaken
using the Edit Groups menu.
Fig. 54: The results of a Linear Discriminant Analysis applied to the well-known Fisher iris data. The centroid
(centre of mass) of each species is shown as a square.
93
6: Linear Discriminant Analysis
Results
An initial Discriminant Analysis with the different
periods as the groups gave a confusing plot with the
different periods not forming tight clusters (Fig. 55).
However, examination of the plot shows that more
recent skulls lie more to the left. This tendency for
the skulls to be arranged in a time sequence from
right to left is shown clearly if only the cluster
centroids are plotted (Fig. 56). Examination of the
positions of the centroids suggests that it might be
possible to discriminate between early skulls (from
4000 and 3300 BC) and late skulls (from 200 BC and
150 AD). The plot was therefore changed to only
show the 4000 BC and 150 AD skulls (Fig. 57, page
96). While there are skulls that do not conform to
their group and create overlapping clusters, a clear
Fig. 55: The plot of the first and second discriminant functions
separation is apparent.
for the Egyptian skull data. The groups are defined by age.
94
Applying Multivariate Methods
Fig. 56: The position of the centroids with the space defined by the first and second discriminant functions for
the Egyptian skull data. Note that the centroids represent groups of increasing age in a left-right direction.
95
6: Linear Discriminant Analysis
Fig. 57: The plot of the first and second discriminant functions for the Egyptian skull data. Only the 4000 BC
and 150 AD groups are plotted.
In CAP, to select a reduced number of groups in the plot, first display it by opening the Plot tab. Next click on the Edit button above the plot
(it has a set square symbol). Open the Series tab (if not already open) and remove the tick from groups/series you do not wish to see.
96
Applying Multivariate Methods
Results
The DA plot (Fig. 58, page 99) strongly suggests that pottery from Llanederyn and Caldicot is different in chemical
composition from that collected from Ashley Rails and Island Thorns. Further, there seems to be no clear separation
between the two New Forest sites, Ashley Rails and Island Thorns, suggesting they form a single group. This view is
reinforced following examination of the observed and predicted allocation to group (Table 12, page 100). DA allocated
one Ashley Rails sample to Island Thorns, and one Island Thorns sample to Ashley Rails, indicating a lack of discrimination
between these sites. The analysis was therefore repeated with the Ashley Rails and Island Thorns samples combined into
a New Forest group, producing the clearer plot shown in Fig. 59, page 100.
To assign a piece of pottery to one of our 3 groups we use the classification equations generated by DA to give a score
97
6: Linear Discriminant Analysis
where cj is a classification function coefficient, j is the group and p the number of variables. In our case the equations are:
Caldicot:
Ccald = −76.217 + 3.73 X Al + 11.17 X Fe + 0.84 X Mg + 155.68 X Ca + −17.22 X Na In CAP, the classification
coefficients used here are
listed under the tab labelled
Llanederyn: Fisher’s Disc. Func.
−80.81 + 3.75 X Al + 11.75 X Fe + 4.17 X Mg + 85.62 X Ca + 8.73 X Na
Clla =
New Forest:
−75.29 + 8.75 Al − 2.52 X Fe + 0.7 X Mg − 1.49 X Ca − 19.67 X Na
CNF =
A sample is allocated to the group for which it has the highest classification score. For example, a sample with 14% Al, 7%
Fe, 4% Mg, 0.1% Ca and 0.5% Na gives:
Ccald = 64.511
and in similar fashion:
Clla = 83.547
CNF = 22.286
This sample is therefore allocated to the Llanederyn group as it has the highest score.
98
Applying Multivariate Methods
Fig. 58: The plot of the first two discriminant scores to separate different types of Romano-British pottery.
99
6: Linear Discriminant Analysis
Table 12: A comparison of the actual group allocation with that generated by Discriminant Analysis for the
Romano-British pottery data set.
Original Totals
In CAP, the observed and Predicted Ashley Rails Caldicot Island Thorns Llanederyn No. correct % correct
predicted group memberships Ashley Rails 4 0 1 0 4 80
are found under the Caldicot 0 2 0 0 2 100
Predictive Validation tab. Island Thorns 1 0 4 0 4 80
Llanederyn 0 0 0 14 14 100
Fig. 59: The plot of the first two discriminant scores to separate different types of Romano-British pottery. The
Ashley Rails and Island Thorns samples have been combined into a single group.
100
Applying Multivariate Methods
The default Linear Discriminant Analysis produced the simplified output shown in Fig. 60, page 102. This shows that
the LDA gave 100% correct predictions for Caldicot and Llanederyn. The scatter plot output option (lda_scat) gave the
scatterplot shown in Fig. 61, page 102. The results show that Island Thorns and Ashley Rails are similar and could be
combined into a single group.
The Prediction - Accuracy Table option (lda_pred) gives the output shown in Fig. 62, page 103. LDA offers further
output options: see Help for the full range of outputs, which includes “Discriminant Functions”.
101
6: Linear Discriminant Analysis
102
Applying Multivariate Methods
103
7: Canonical Correspondence Analysis (CCA)
Chapter
7
7: Canonical Correspondence Analysis
(CCA)
Uses
CCA is the favoured method for producing an ordination which includes the possible causal factors within the analysis.
The result is a plot that shows the relationship between the samples, the dependent variables within each sample and the
explanatory variables.
104
Applying Multivariate Methods
CCA is called a constrained analysis method, as the ordination is constrained by the environmental variables.
of determination (R2) close to 1 or variance inflation factor (VIFs) well above 1 are indicative of multicollinearity. When
this occurs, you should consider removing one of the highly-correlated variables from the analysis.
106
Applying Multivariate Methods
Fig. 63: An example of a CCA triplot with site at species centroid scaling. This emphasises the difference
between the variables, which are plotted here as red squares.
107
7: Canonical Correspondence Analysis (CCA)
Fig. 64: An example of a CCA triplot with species at site centroid scaling. This emphasises the difference
between the samples or objects, which are plotted here as green triangles.
108
Applying Multivariate Methods
109
7: Canonical Correspondence Analysis (CCA)
as the raw data do not produce the ordination presented in the paper. The dependent variables are given as percentage
compositions in each sample. These numbers are used untransformed.
Results
The ordination plot of the samples and the environmental vectors is shown in Fig. 65. The general arrangement of the
samples is similar to that presented by Pokorný (2002) and the samples are grouped as per his analysis. In producing this
result, multicollinearity between explanatory variables was ignored, but will be discussed below.
One lowland site (dark blue cross, sample 7), situated in the floodplain of the river Labe, forms a single outlier shown in
blue. This site has a high percentage of Pinus pollen and the unusual community was related to soil conditions.
Group 1 (green crosses, samples 6, 8, 15, 16) were lowland sites. Low percentages of Picea pollen and a somewhat higher
percentage of Pinus and other forest tree pollen characterised these samples.
Group 2 (pink squares, samples 1 to 5 and 9 to 12). This group consists of upland and mountain sites with the exception
of site 1 which was believed to have a deposited a record of mountain vegetation. Common features of these samples
were low percentages of Quercus and Carpinus, and high percentages of Abies, Fagus and Picea pollen.
Group 3 (light blue stars, samples 13 and 14). This small group consists of two sites in the Třeboňská pánev basin with
similar qualitative characteristics to Group 2, but with a higher percentage of Pinus pollen present.
110
Applying Multivariate Methods
Fig. 65: A Canonical Correspondence Analysis (CCA) ordination biplot of 16 pollen samples. The samples are
placed into the groups defined by Pokorný (2002).
111
7: Canonical Correspondence Analysis (CCA)
Table 13: Results of a check for multicollinearity between the variables used in the pollen study of Pokorný
(2002). The highly significant VIF for the variable precipitation is highlighted by bold underline.
Dependent variable R2 VIF
Altitude 0.804149 5.10592
Temperature 0.896613 9.67242
Precipitation 0.933205 14.9712
Latitude 0.37461 1.599
Longitude 0.72948 3.69659
An examination of simple scatter plots indicates that precipitation is correlated positively with
In Ecom, the Scatter plot
tab allows you to quickly altitude and longitude, and negatively with temperature (Fig. 66).
examine scatter plots and the With precipitation removed from the analysis the result is a clearer ordination with the same
correlation between variables.
grouping as before (Fig. 67).
Fig. 66: Simple scatter plots showing the correlation between some of the environmental variables used by
Pokorný (2002)
112
Applying Multivariate Methods
Fig. 67: A CCA ordination biplot of 16 pollen samples. The samples are placed into the groups defined by
Pokorný (2002). Because of correlation with other variables, precipitation has been removed from the analysis.
113
7: Canonical Correspondence Analysis (CCA)
The row titled “Cumulative % variance” indicates that the first axis explains about 39.39% of the total variation (inertia)
in the data set. Taken together, the first two axes explain about 44% of the variation.
Overall, how well do our measured variables explain species composition? As CCA is in part a regression method, an
114
Applying Multivariate Methods
obvious measure would be to create an analogue to the coefficient of determination, R2, and to divide the explained
variance by the total variance. However, there are problems with this approach and unfortunately, there are as yet no good
solutions.
The relative amount of variability explained by each axis of the ordination is given by the eigenvalues presented in a
standard CCA output (see Table 14). The eigenvalue for axis 1 is far larger than the other two axes presented in the table,
indicating that there is a relatively strong gradient acting along axis 1. The amount of the total variation that we can explain
by the environmental variation is the sum of all canonical (or constrained) eigenvalues, which in this case equals 48% of
the total. However, axis 1 alone explains 39%, which highlights the importance of this axis. The analysis also shows that
the pollen species-environment correlations are quite high, but in constrained ordinations this is normal, even for random
data.
The eigenvalues presented give no indication as to whether the amount of variability explained by the environmental or
explanatory variables is larger than would be expected by random chance. We can test for significance using a Monte Carlo
test. Shown in Table 15, page 116 is the probability that an eigenvalue as large as that observed could have occurred by
chance. The probability for axis 1, marked by bold underline, shows that this axis is highly significant, as the value is much
lower than 0.05, the 5% probability level conventionally used in statistical tests. Axes 2 and 3 are not significant at the 5%
level, suggesting that analysis should be restricted to a consideration of the ordination along axis 1 alone. From Fig. 67,
page 113 it can be seen that axis 1 is essentially a temperature - altitude axis and that axis 2 is a latitude - longitude axis.
We conclude that the primary variables determining the forest tree community are altitude and temperature, which are
negatively correlated.
115
7: Canonical Correspondence Analysis (CCA)
Table 15: The results of a Monte Carlo simulation of the pollen data to determine if the explanatory variables
account for a significant amount of the variation. The results are for 1000 random simulations. These calculations
were undertaken after the removal of precipitation from the environmental data set. A log10 transformation was
applied to the environmental data. The highly significant probability for axis 1 is highlighted in bold underline.
Axis 1 2 3
Actual eigenvalues 0.206817 0.0231581 0.0186781
In Ecom, the precipitation Eigenvalue results from simulation
variable can be removed Mean 0.0861691 0.0333423 0.0137693
under the Working
Maximum 0.206023 0.0822763 0.0470048
Environmental data tab.
Minimum 0.0209588 0.0084877 0.00138418
First, click on a cell in the
Precipitation row. Second, Probability 0.000999001 0.771229 0.211788
select the Handling zeros
radio button. Finally, choose
Conclusions
Deselect row from the drop-
down menu and click on Pokorný (2002) concluded that it was possible to reconstruct past forest vegetation from pollen
Submit. data. The altitudinal zonation of forests was similar to that observed today. The data indicated
that Man had affected lowland forests in 2000 BC. All of the significant gradation between pollen
communities is presented along axis 1, which is primarily an altitude - precipitation axis. As was concluded by Pokorný
(2002), the two main groups in the data comprise lowland and upland forest communities.
116
Applying Multivariate Methods
117
7: Canonical Correspondence Analysis (CCA)
Almost exactly the same conclusions to those presented by Pokorný (2002) could have been arrived at using a simpler
ordination method that did not explicitly consider the environmental variables. Fig. 71 shows the result of a Correspondence
Analysis, which shows the same grouping as CCA. For example, Pokorný’s Group 1 (samples 6, 8, 15, 16) is distinct, and
the lowland status of these sites and their associated species identified by simply looking at the associated environmental
data. That the resulting ordinations with CCA and CA should be similar is expected, as they both use Reciprocal Averaging.
Somewhat different results are obtained if Non-metric MultiDimensional Scaling is used (Fig. 72). In this case Group 1 is
less distinct, with samples 6, 8, 16 forming a clear group and 15 taking a more intermediate position between the lowland
and highland groups. The unique nature of sample 7 is still apparent. Redundancy Analysis, which is the extension of
multiple linear regression to include multivariate response data, produces an ordination of sites with similarities to that
produced by NMDS (Fig. 73). For example, site 15 is separated from Group 1 and closer to sites 13 and 14.
Fig. 71: Results of a Correspondence Analysis of the 16 pollen samples (red dots), and pollen species (green
squares). Data from Pokorný (2002).
118
Applying Multivariate Methods
Fig. 72: (left) Results of a Non-metric MultiDimensional Scaling ordination of the 16 pollen samples. The Bray-
Curtis distance measure was used. Data from Pokorný (2002).
Fig. 73: (right) Results of a Redundancy Analysis ordination of the 16 pollen samples (in blue), environmental
vectors (in blue) and pollen species (in red). Data from Pokorný (2002).
119
7: Canonical Correspondence Analysis (CCA)
Results
A test for multicollinearity between these three variables gave variance inflation factors of between 1.03 and 1.1, indicating
that each was varying independently of the others. A CCA using these three variables was then subjected to a Monte Carlo
test to ensure that the observed relationships could not have been generated by random chance. The probability that the
eigenvalues for axes 1, 2 and 3 were generated by random chance was estimated as 0.0529, 0.1778 and 0.011 respectively.
1 Used here for analyses where variables are added to the analysis one at a time.
120
Applying Multivariate Methods
This indicated that at just over the 5% significance level, axes 1 and 3 were explaining more of the total variability than
would be expected by random chance. Axis 2 was clearly not significant.
The CCA biplots for species and years in relation to the environmental variables are plotted in Fig. 74 and Fig. 75,
respectively. Grey mullet and pout are associated with years of higher than average salinity. Sea snail, dab, poor cod,
transparent goby and eel were most abundant in years with lower than average seawater temperatures and a higher than
Fig. 74: (left) The Canonical Correspondence Analysis (CCA) biplot of fish species and environmental variables
for the Hinkley Point fish data analysed in Henderson (2007). Fish abundances were natural log-transformed
(loge) prior to analysis.
Fig. 75: (right) The Canonical Correspondence Analysis (CCA) biplot of the annual samples and environmental
variables for the Hinkley Point fish data analysed in Henderson (2007). Fish abundances were natural log-
transformed (loge) prior to analysis.
121
7: Canonical Correspondence Analysis (CCA)
average NAOI. Bass were highly responsive to increased seawater temperature. Fig. 75, page 121 shows that the years
from 1981 to 1987 formed a group characterised by lower seawater temperatures and high NAOI.
The ordination of the species along two of the environmental axes is shown in Fig. 76 A & B. The position of species
along the vector representing each environmental variable was found by projecting orthogonal lines from the species
positions on to the vector. These points of intersection show the relative response of each species to the environmental
variables.
Fig. 76: The inferred ranking of common fish species to the environmental variables, NAOI (A, left) and seawater
temperature (B, right) at Hinkley Point. Results obtained by Canonical Correspondence Analysis.
122
Applying Multivariate Methods
Conclusions
The results of Henderson (2007) show that CCA can be an effective method to identify key environmental variables, and
to show how both a community and the individual species respond to a change in the physical environment.
Table 17, page 124 shows the layout for the explanatory variables, salinity, temperature and NAOWI. Remember that
both the observation and the explanatory data sets must have the same number of rows.
123
7: Canonical Correspondence Analysis (CCA)
The following R code opens two data sets and runs a CCA. It uses the Hinkley Point fish and environmental datasets.
The code produces a series of plots to visualise the results which are similar to those produced by CAP and discussed
previously. The results do not look the same as those presented previously because in this case the data were analysed
untransformed. The previous analysis used log-transformed data. Using untransformed data will place a greater emphasis
library(vegan) # We will use the vegan community analysis package
# Load Hinkley data species first
species <- read.csv(“D:\\Demo Data\\Hinkley annual fish R CCA.csv”, header = TRUE, row.names = 1)
print(species) # Print data to check it is OK
# Load Hinkley data explanatory variables
enviro <- read.csv(“D:\\Demo Data\\Hinkley annual env var R CCA.csv”, header = TRUE, row.names = 1)
print(enviro) # Print data to check it is OK
# Run a CCA
CCA_output <- cca (species,enviro)
CCA_output # Print results
summary(CCA_output)
# Correlation between species (WA) and constraint (LC) scores
spenvcor(CCA_output)
# Default plot species & WA scores, envir. variables.
plot(CCA_output)
# The components plotted can be varied
plot(CCA_output, display = c(“lc”, “bp”))
plot(CCA_output, display = c(“sp”, “bp”))
124
Applying Multivariate Methods
Fig. 77: (left) Demonstrating that low temperatures were associated with the 1980s.
Fig. 78: (right) Showing the association between some fish species (such as bass) and warmer years, and other
species (such as sea snail), with colder years.
125
7: Canonical Correspondence Analysis (CCA)
126
Applying Multivariate Methods
Chapter
8
8: TWINSPAN (Two-Way Indicator Species
Analysis)
Uses
TWINSPAN produces dendrograms showing the relationship between the samples, and between the variables that make
up the samples. Further, the method identifies key variables for each bifurcation in the sample dendrogram. The method
was originally developed for the analysis of botanical data, but is now far more widely used.
127
8: TWINSPAN - Two-Way Indicator Species Analysis
Examining the score of each sample in terms of the indicator species present then refines the membership of the negative
and positive groups previously defined by their Correspondence Analysis score. Generally, for most samples, a positive
indicator species score will correspond to a high ordination score and these samples would be defined as the positive
group. If the indicator score and the ordination score are contradictory these samples are termed misclassified. Samples
that lie close to the centroid are termed borderline.
Once the data have been split into 2 groups the procedure is repeated for each group to form 4, 8, or more groups, until a
minimum group size is reached. The results can then be presented in the form of a dendrogram with the indicator species
marked at each dichotomy.
Classification of species is undertaken in a similar fashion to the samples. However, the ordination is based on the degree
to which species are confined to groups of samples, termed the fidelity, rather than the raw data. Unlike the sample
ordination, the species are always split into positive and negative groups with no recognition of borderline, difficult to
classify, species.
An odd aspect of TWINSPAN is the use of pseudospecies. TWINSPAN only works on presence/absence data. The
indicator species are defined by their presence in a group of samples, and no account is taken of their abundance. As
typical TWINSPAN data are quantitative or semi-quantitative measurements, the pseudospecies concept was introduced
to allow quantitative data to be reflected in the indicator species. Each species is divided into a number of abundance levels,
each of which is considered computationally as a different species during the identification of indicator species. Within
TWINSPAN, the user defines these abundance levels by selecting cut levels. Thus the same species might be an indicator
species for a number of splits. At one split the presence/absence at a low density might be an indicator, while at another
split presence/absence at high density might be an indicator. A suitable choice of cut level can only be made following
analysis of the data and trial and error. However, a robust conclusion will only be obtained if the final classifications are
not particularly sensitive to the cut levels chosen.
In CAP, pseudospecies are Pseudospecies are scored as 0 or 1s, with 0 representing absent and 1 representing present. A
indicated by a number after species present at a specified abundance level is also scored as present at all lower abundance levels.
the variable. For example, Acer For example, species C in Table 18 has an abundance of 49. With cut levels set at 5, 10, 50, 100 and
2 represents pseudospecies 2 500, species C is marked as present for pseudospecies 3, (abundance between 11 and 50) and all
for this species.
128
Applying Multivariate Methods
pseudospecies up to 3 (i.e. pseudospecies 1, and 2). So in this example TWINSPAN would use 3 separate pseudospecies
to represent species C.
Table 18: An example of how pseudospecies are calculated for a specific set of cut levels. Four species and their
abundances are presented and the presence/absence of each pseudospecies for each species is shown.
Species name A B C D
Actual abundance 75 620 49 1
Cut levels The range of values that score
5 1-5 Pseudospecies 1 1 1 1 1
10 6-10 Pseudospecies 2 1 1 1 0
50 11-50 Pseudospecies 3 1 1 1 0
100 51-100 Pseudospecies 4 1 1 0 0
500 101-500 Pseudospecies 5 0 1 0 0
Generally, the results produced by TWINSPAN make sense and can be an aid to those seeking to summarise and classify.
However, when there are two or more gradients moulding the community, the method may fail. Using synthetic data,
van Groenewoud (1992)1 found TWINSPAN to be unsatisfactory: “Thus, there appear to be two reasons why a TWINSPAN
analysis fails to produce acceptable and ecologically meaningful results: (1) The displacement of the sample points along a first CA axis may
be considerable, resulting in the failure of both CA and TWINSPAN analysis. (2) In TWINSPAN, the division of the first CA axis
in the middle, followed by separate CA analyses of each of the two halves of the original data matrix, creates conditions under which the
second CA analysis will result in a greater displacement of the sample points, thus producing a spurious classification. The erratic behavior of
TWINSPAN beyond the first division makes the results of this analysis of real vegetation data suspect”.
1 van Groenewoud, H. (1992). The Robustness of Correspondence, Detrended Correspondence, and TWINSPAN Analysis. Journal of
Vegetation Science, 3, 239-246.
129
8: TWINSPAN - Two-Way Indicator Species Analysis
130
Applying Multivariate Methods
Results
Fig. 80, page 133 shows the sample dendrogram produced with cut levels of 0, 10 and 100. The lowest level dichotomy
has Acer saccharum as the only indicator species. Examination of the data shows this species to be present only at sites in
the lower half of the dendrogram (Cucumber Creek to Forked Lake). The authors identified the Cypress Creek, Crooked
131
8: TWINSPAN - Two-Way Indicator Species Analysis
Creek and Forked Lake samples as representing the Taxodium distichum community. They form a discrete group in the
dendrogram, but an equally-legitimate indicator species would be Betula nigra, as selected by TWINSPAN, which was also
only found at these three sites. The author’s Carpinus caroliniana community was found at Cucumber Creek 1 & 2 and
Refuge Creek 1 & 2. TWINSPAN certainly shows these sites as forming a discrete group, but suggests that they might
be described as an A. saccharum community without B. nigra. Finally, the authors classified all the other sites as holding a
Quercus phellos community. TWINSPAN indicates that these may not form a single group with the Caney and Refuge North
sites characterised by the presence of Bumelia lanuginosa. Following a Detrended Correspondence Analysis (DECORANA),
the authors noted the different in community at these two sites.
“The high IV (Importance Value) for Liquidambar styraciflua separated the Caney and Refuge-North sites from others in the Quercus
phellos community type. Overall, importance values for water-tolerant Quercus spp. were low in the Carpinus caroliniana community type.
Second axis DCA scores were high for sites with Quercus alba, regardless of community type. The high axis 1 and low axis 2 scores for the
Taxodium distichum community type are most likely due to the singular presence of Taxodium distichum, Betula nigra, and Acer
saccharum at those sites”.
The authors did not refer to the species dendrogram also produced by TWINSPAN (see Fig. 81, page 134). This shows
that the species divide into two main groups. Quercus phellos is in the upper group and Carpinus caroliniana in the lower group.
132
Applying Multivariate Methods
Fig. 80: Sample dendrogram produced by TWINSPAN for the Oklahoma forest study. Cut levels of 0, 10 and
100 were used.
133
8: TWINSPAN - Two-Way Indicator Species Analysis
134
Applying Multivariate Methods
Results
First, using the total data set (Hinkley annual fish.csv) the dendrogram of the classification of the years between 1981 and
2004 is shown in Fig. 83, page 136. The results do not show a clear discontinuity between the 1980s and later years.
Using the Hinkley fish.csv data set of resident fish, the dendrogram now shows the 1980s as forming a distinct group, (Fig.
135
8: TWINSPAN - Two-Way Indicator Species Analysis
84, page 137), as was found with PCA. Some insight into why the two data sets produce such different results is obtained
by examining the indicator species. When the full data set is used, TWINSPAN bases the classification on occasionally-
present species. As these are to some extent random in occurrence the classification shows little pattern. The point to note
is that TWINSPAN will tend to classify using uncommon species; this may not be helpful when it is used to either identify
a classification scheme or to produce a classification of samples.
Fig. 83: The dendrogram of relationships between the fish communities present in the Bristol Channel for the
years 1981 to 2004; obtained using the total data set. The indicator species are placed on the dendrogram at the
branch to which they refer. Predefined year groups are indicated by different colours.
136
Applying Multivariate Methods
Fig. 84: The dendrogram of relationships between the fish communities present in the Bristol Channel for the
years 1981 to 2004; obtained using data for the resident species only. The indicator species are placed on the
dendrogram at the branch to which they refer. Predefined year groups are indicated by different colours.
137
8: TWINSPAN - Two-Way Indicator Species Analysis
138
Applying Multivariate Methods
Chapter
9
9: Hierarchical Agglomerative Cluster
Analysis
Uses
Agglomerative Cluster Analysis (ACA) methods are used to show the relationships between objects or samples in a
dendrogram, tree or branching diagram. The approach is useful when samples clearly fall into discrete groups. Dendrograms
are a powerful presentation method, which are easily understood.
139
9: Hierarchical Agglomerative Cluster Analysis
140
Applying Multivariate Methods
the most successful of which are described below. We will discuss similarity measures; some authors discuss dissimilarity,
which is generally just 1 - similarity.
Similarity indices are simple measures of either the extent to which two samples have species or other attributes in
common (Q analysis) or which variables, (e.g. species) have samples in common (R analysis). For a fuller explanation of Q
and R analyses, see page 143. Binary similarity coefficients use presence/absence data, but more complex quantitative
coefficients can be used if you have data on species abundance. When comparing the species at two localities, indices can
be divided into those that take account of the absence of an attribute or species from both communities (double-zero
methods), and those that do not. In most applications it is unwise to use double-zero methods as they assign a high level of
similarity to localities which both lack many variables or species. We would not normally consider two sites highly similar
because their only common feature was the joint lack of a group of variables. This might occur because of sampling errors
or chance. For that reason, here I only describe binary indices that exclude double zeros.
Jaccard Cj = a ( + + )
a b c
Sørensen Cs = 2a ( + + )
2a b c
2a
Mountford CM =
2bc − (b + c)a
CM was designed to be less sensitive to sample size than CJ or CS, however, it assumes that variable abundance fits a log-
series model, which may be inappropriate.
141
9: Hierarchical Agglomerative Cluster Analysis
Following evaluation of similarity measures using Rothamsted insect data, Smith (1986)1 concluded that no index based
on presence/absence data was entirely satisfactory, but the Sørensen was the best of those considered.
Quantitative coefficients
Because they are based purely on presence/absence, binary coefficients give equal weight to all objects, and hence tend
to place too much significance on rare features whose discovery or capture will depend heavily on chance. Bray & Curtis
(1957)2 brought abundance into consideration in a modified Sørensen coefficient, and this approach is widely used in plant
ecology; the coefficient, as modified, essentially reflects the similarity in individuals between the habitats:
where aN = the total of individual objects sampled in habitat a, bN = the same in habitat b and jN = the sum of the lesser
values of abundance in both samples for the objects common to both samples (often termed W ). For example, if there
are 10 of a species in sample 1 and only 3 in sample 2, the lesser value is 3.
However, for quantitative data, Wolda (1981)3 found that the only index not strongly influenced by sample size and species
richness was the Morista-Horn index:
1 Smith, B. (1986). Evaluation of different similarity indices applied to data from the Rothamsted insect survey. Unpublished M.Sc. Thesis,
University of York.
2 Bray, J. R. & Curtis, C. T. (1957). An ordination of the upland forest communities of southern Wisconsin. Ecol. Monogr. 27, 325-349.
3 Wolda, H. (1981). Similarity indices, sample size and diversity. Oecologia 50, 296-302.
142
Applying Multivariate Methods
where Pa,i and Pb,i are the percentage abundances of object i in samples a and b respectively, and S is total number of
objects. This index takes little account of rare objects, and thus will give a good indication of the similarity in dominant
forms between the sites.
143
9: Hierarchical Agglomerative Cluster Analysis
144
Applying Multivariate Methods
Results
Fig. 85 shows the dendrogram produced using Ward’s method with the Euclidean distance measure. It produces results
markedly similar to the TWINSPAN dendrogram shown in Fig. 80, page 133. As an example of a poor choice of
method, Fig. 86, page 146 shows the dendrogram generated using single-linkage clustering and the Renkonen similarity
measure for the same data set. Note that chaining has occurred and the clear split into two groups shown in Fig. 85 is no
longer apparent.
Fig. 85: Sample dendrogram for the Oklahoma floodplain forest data. Agglomerative cluster analysis undertaken
with Ward’s method and Euclidean distance.
145
9: Hierarchical Agglomerative Cluster Analysis
Fig. 86: An example of chaining using single-linkage clustering and the Renkonen similarity measure on the
Oklahoma floodplain forest data. Note that this dendrogram also shows a familiar problem, in that at one point
in the process (arrowed) the combined group resulted in an increase in similarity at the next cluster aggregation.
Conclusions
Agglomerative clustering methods can produce dendrograms which clearly show the main clusters of objects within the
data set. However, the results can seem arbitrary and highly dependent upon the method used to join together the clusters
and the similarity or distance measure used. The significance of group membership could be tested using Analysis of
Similarities (ANOSIM) randomisation test. This approach is used in later examples below.
146
Applying Multivariate Methods
Results
A typical example of the type of dendrogram produced for the Hinkley data is shown in Fig. 87, page 148. It is notable
that, as in the case of TWINSPAN, when the Hinkley annual fish.csv data set is used the years do not form clear pre- and
post- early 1990s groups. This data set comprises all the fish species. However, while TWINSPAN produced a dendrogram
which separated the 1980s when only the resident species were considered (Fig. 84, page 137), this was not the case for
Agglomerative Clustering using Euclidean distance (Fig. 88, page 149). By contrast, Ward’s method together with the
Bray-Curtis similarity measure did produce a clustering of years with features in common with that observed with PCA
and TWINSPAN (Fig. 89, page 150).
147
9: Hierarchical Agglomerative Cluster Analysis
Fig. 87: The Agglomerative Cluster Analysis dendrogram of relationships between the fish communities present
in the Bristol Channel for the years 1981 to 2004, obtained using data for all fish species in the Hinkley annual fish.
csv data set. Clustering used Ward’s method and Euclidean distance.
148
Applying Multivariate Methods
Fig. 88: The Agglomerative Cluster Analysis dendrogram of relationships between the fish communities present
in the Bristol Channel for the years 1981 to 2004, obtained using data for resident fish species in the Hinkley fish.
csv data set. Clustering used Ward’s method and Euclidean distance.
149
9: Hierarchical Agglomerative Cluster Analysis
Fig. 89: The Agglomerative Cluster Analysis dendrogram of relationships between the fish communities present
in the Bristol Channel for the years 1981 to 2004, obtained using data for resident fish species in the Hinkley fish.
csv data set. Clustering used Ward’s method and Bray-Curtis similarity.
150
Applying Multivariate Methods
Conclusions
Agglomerative clustering methods were unsatisfactory for studying the structure of the Hinkley fish data because the
result was so dependent upon the similarity measure used. The Bray-Curtis similarity measure was found to produce a
dendrogram having features in common with other ordination methods. This measure is favoured by benthic ecologists
and is frequently used within NMDS. In some fields, it can at least be argued that the similarity measure used has been
chosen because it is the standard method used by other scientists. However, it is clear that the choices are often painfully
arbitrary and in danger of being driven by the desire to get a pleasing result, which will often reinforce preconceived ideas.
For many applications, if a dendrogram is required for presentational purposes, the Bray-Curtis similarity index should be
considered.
In conclusion, use agglomerative clustering dendrograms to present patterns that have also been identified by other
methods.
151
9: Hierarchical Agglomerative Cluster Analysis
Results
As the authors did not specify the distance or similarity measure used, it was not possible to precisely reproduce their
dendrogram. However, a wide variety of methods gave a result similar to that published in their paper; Fig. 90 shows
a clear distinction in the growth curve parameters of large species – the English mastiff to Newfoundland group, and
smaller species – the Labrador retriever to Papillon group. This dendrogram was produced using Ward’s method and
Euclidean distance. Within each of these main divisions there are further divisions that the authors considered notable,
for example, the separation of the Newfoundlands from other giant breeds.
Conclusions
Agglomerative clustering allows the different growth patterns of small and large dog breeds to be elegantly displayed.
However, the significance of further differences within the two main groups is considerably more subjective and may
not be significant. The authors did not attempt any statistical analysis of the significance of their grouping into large and
small dogs. An Analysis of Similarities randomisation test (ANOSIM) showed that the large and small dog groups were
significantly different (test statistic = 0.06, p = 0.001).
152
Applying Multivariate Methods
Fig. 90: The relationship between the growth parameters of 12 breeds of dog. The dendrogram was produced
using Ward’s method and Euclidean distance.
153
9: Hierarchical Agglomerative Cluster Analysis
Fig. 91: A 3D MDS plot of the dog growth variables for the 12 different breeds. The large and small dogs are
shown in red and blue, respectively.
154
Applying Multivariate Methods
Results
The authors undertook ACA using complete-linkage (furthest neighbour), average-linkage and single-linkage (nearest
neighbour) with both Euclidean and city-block distance measures. Fig. 92, page 156 shows the dendrogram produced
using single-linkage joining and Euclidean distance. The authors examined dendrograms produced with all permutations
of the three linkage methods and two distance measures to ensure that the conclusions were not
In CAP, the city-block sensitive to the method chosen. In Fig. 92, page 156 and Fig. 93, page 157, we show the
distance is called the
Manhattan distance.
dendrograms produced using average-linkage and city-block distance respectively. As was concluded
in the paper, the styles of the individual authors form separate branches.
To help determine if the impression gained from the ACA that the 4 authors belonged to separate groups was correct,
Jin & Murakami (1993) examined the ordination produced by Principal Component Analysis (PCA) using the variance-
covariance matrix (see page 20). Fig. 94, page 158 shows that PCA also showed a clear distinction between the four
authors, increasing confidence that the dendrogram groups reflected real differences.
155
9: Hierarchical Agglomerative Cluster Analysis
Fig. 92: Agglomerative Cluster Analysis of comma use by the Japanese authors, Yashue Inoue, Atsushi Nakajima,
Yukio Mishima and Junichiro Tanizaki. Single-linkage joining with Euclidean distance was used.
156
Applying Multivariate Methods
Fig. 93: Agglomerative Cluster Analysis of comma use by the Japanese authors, Yashue Inoue, Atsushi Nakajima,
Yukio Mishima and Junichiro Tanizaki. Average-linkage joining with Manhattan distance was used.
157
9: Hierarchical Agglomerative Cluster Analysis
Fig. 94: The PCA grouping of four Japanese authors by reference to their use of commas. The analysis was
undertaken on the variance-covariance matrix.
158
Applying Multivariate Methods
Conclusions
Jin & Murakami (1993) concluded that comma placement by individual authors generally remained relatively consistent,
and that such analysis could be used to determine authorship and authenticity of modern Japanese literature.
Alternative approaches
Jin & Murakami (1993) had no simple way of testing if their assignment to group was significant. Generally, there are
two approaches that might be taken; group membership could be tested using Linear Discriminant Analysis (DA) or,
alternatively, the ANOSIM randomisation test could be applied. DA could not be used as Nakajima
In CAP, Analysis of
was represented by only a single work. For the full data set, ANOSIM gave a sample statistic of
Similarities (ANOSIM),
and Discriminant Analysis 0.96 and a probability of only p = 0.001 that the observed within-group similarity could have been
(DA) are found on the generated by random chance. This suggested that the separation of the individual authors was
Groups drop-down menu. highly significant. ANOSIM also tests the difference between pairs of groups (in our case compares
Remember that, before an all combinations of two authors - see Table 19). This shows that the data set does not demonstrate
ANOSIM can be carried out,
a significant difference at the 10% level between Inoue and Nakajima, Mishima and Nakajima and
the individual samples must
be assigned to groups. Nakajima and Tanizaki. This lack of significance is related to the small number of works available
for each author, and does not invalidate the method.
Table 19: Results of all pairwise tests generated by ANOSIM for the Japanese author analysis.
1st author 2nd author Permutations P-value Level % No >= Obs Sample stat.
Inoue (2) Mishima (3) 10 0.1 10 1 1
Inoue (2) Nakajima (1) 3 0.33333 33.33 1 1
Inoue (2) Tanizaki (3) 10 0.1 10 1 1
Mishima (3) Nakajima (1) 4 0.25 25 1 1
Mishima (3) Tanizaki (3) 10 0.05 10 1 0.925
Nakajima (1) Tanizaki (3) 4 0.25 25 1 0.777
159
9: Hierarchical Agglomerative Cluster Analysis
160
Applying Multivariate Methods
161
10: Analysis of Similarities (ANOSIM)
Chapter
10
10: Analysis of Similarities (ANOSIM)
Uses
To test if the grouping of samples or objects is statistically significant.
162
Applying Multivariate Methods
if the high and low similarities are perfectly mixed and bear no relationship to the group. A value of -1 indicates that the
most similar samples are never in the same group. While negative values might seem to be a most unlikely eventuality, it
has been found to occur with surprising frequency.
To test for significance, the ranked similarity within and between groups is compared with the similarity that would be
generated by random chance. Essentially, the samples are randomly assigned to groups 1000 times and the value of R is
calculated for each permutation. The observed value of R is then compared against the random distribution to determine
whether it is significantly different from that which could occur at random. If the value of R is significant, it can be
concluded that there is evidence the samples within groups are more similar than would be expected by random chance.
Results
The global ANOSIM test statistic of R = 0.336 indicated that the pottery sherds did form four groups based on locality.
To put it another way, the sherds within each of the 4 groups were significantly more similar than would be expected by
163
10: Analysis of Similarities (ANOSIM)
random chance. The given probability that such a similarity would be observed by chance was 0.001, or about 1 in 1000.
The true probability is actually lower, but the program only did 1000 randomisations, which is all we needed to do to prove
the validity of the groups. ANOSIM also produces all the pair-wise comparisons between the 4 groups (Table 20). These
results show that the pottery sherds from all the localities were significantly different from each other.
Table 20: The results of pairwise randomisation tests of the Jomon pottery produced by ANOSIM
1st Group 2nd Group Perms. done P-value Level % No >= Obs Sample stat.
Ariyoshi-kita (14) Kamikaizuka (30) 1000 0.001 0.1 1 0.434
Ariyoshi-kita (14) Narita60 (15) 1000 0.001 0.1 1 0.590
Ariyoshi-kita (14) Shouninzuka (33) 1000 0.001 0.1 1 0.434
Kamikaizuka (30) Narita60 (15) 1000 0.002 0.2 2 0.220
Kamikaizuka (30) Shouninzuka (33) 1000 0.001 0.1 1 0.277
Narita60 (15) Shouninzuka (33) 1000 0.001 0.1 1 0.293
Conclusions
Hall (2001) concluded that there are four major groups in the data set, which correspond to site location. ANOSIM
supports this conclusion.
Alternative approaches
There are no alternative approaches to ANOSIM.
164
Applying Multivariate Methods
Chapter
11
11: Tips on CAP and Ecom
A guide to getting the best out of your data
CAP and Ecom were designed to make the application of multivariate analysis techniques less intimidating than is the case
with other programs. They offer a range of analytical techniques commonly used by researchers in fields such as biology,
geology, palaeontology, archaeology and the social sciences. Software to carry out these techniques has long been available,
but it is often difficult to use, offering limited help and non-intuitive interfaces. It also frequently has limited graphical
output.
165
11: Tips on using CAP and ECOM
each organised to have the same number of samples in the same order. There must be fewer environmental or predictive
variables than there are variables (pottery types, different metals, individual species, etc.), in the sample dataset.
Data organisation
Data to be used in the programs can be organised using standard spreadsheet programs such as Excel, Calc or Google
Spreadsheet. The output from CAP and Ecom is displayed, exported and printed using standard Windows techniques. The
program can open .csv (comma-delimited text files) and .xls (Excel files). The newer Excel format (.xlsx) is not supported,
so if your data set is currently in a spreadsheet of that format, save it out as the older Excel format, or as a .csv file.
Both programs expect data sets arranged with the samples (or quadrat, trench etc.) as the columns, and variables (or
species, chemical composition, etc.) as rows. If your data are the other way round, then you can use the Transpose function
in either CAP or Ecom to change columns to rows and rows to columns. To do this, select the Working Data tab, then
select Transpose in the Type of Adjustment radio box and click on the Submit button (Fig. 96, page 168).
It is advisable to export the working data (i.e. the transposed data set) out of CAP or Ecom. Giving it a unique name
will stop you overwriting the original file later by mistake. To save your working data use File: Export, then reload the
transposed data into the program.
The ability of CAP and Ecom to transform data allows you to use data sets that have been prepared with rows as samples.
If you have multiple data sets that contain similar data, such as samples from the same site at different times, or from
adjacent samples, then you may find the Pisces List Combiner 1 useful. List Combiner will take all your worksheets and
combine them all into one large .csv file, summing and sorting the values for each variable along the way.
166
Applying Multivariate Methods
On the Summary tab you can review the data in several ways. The first choice is whether to look at the raw or the working
data. Often these will be the same, but if you have transformed the data in any way then selecting Working Data will show
you the effects of that transformation.
Once you have chosen which data set to look at, you can then view the General, Row, Column and, in CAP, Group
statistics. General statistics tell you about the whole data set. These statistics are particularly useful prior to undertaking
many analyses. For example, PCA should not be undertaken on a data set that holds large numbers of zero observations.
The next two options, Row and Column, are useful to show you the range of data in each variable. If any variables are
all 0, then you may want to remove them from the analysis - some analyses will fail, as they will produce a division-by-
zero error. Or, if you have one or two variables whose range is much larger than the other variables you might want to
transform them to bring them within range.
Group summary data is slightly different. This gives you the average value of each variable in that group. This can be
useful in understanding the differences between your groups, and when summarising the results of methods such as
Linear Discriminant Analysis.
167
11: Tips on using CAP and ECOM
You can apply multiple transformations to your data, but be aware that each time you do, it will get more and more difficult
to explain! If you find that you want to start again, the Reload Raw Data button on the Working Data tab will reload
the original data.
168
Applying Multivariate Methods
Alternatively, transforming one or more variables, or using a method that is not affected by magnitude, can reduce the
influence of the outlier. It is worth noting down what transformation(s) you have undertaken, and which outliers you have
removed. It can be very frustrating trying to reproduce the perfect plots you got the last time you analysed the data, when
you cannot remember which transformation you used, and which of the samples you removed.
169
11: Tips on using CAP and ECOM
○○ Data file order – this is the order of the samples in the data set as it is loaded into the program. This is best if your
data are already organised by group or location.
○○ Alphabetical – the names of the samples are sorted alphabetically. This is best for finding a sample when you know
its name.
○○ By row value – This sorts the data in the Grouping tab by the value of the variable in the row that is selected in
the dropdown list to the right. You can change the row you use with the dropdown list, and see the values by using
the Show values check box. This is useful if you want to separate the samples into groups based on the values in a
particular row (perhaps, the most abundant species, or different values of a predictive variable).
Once you have created and/or modified the sample groups, click the Save Group File button. The group information is
saved in a small text file. This file is saved in the same folder as your data set, and carries the same file name, with a .pcg
file extension. The Reset Group File button deletes the saved group information and reassigns all the samples to a single
group. It does not affect the original data. This option can help when you wish to start again, when you have transposed
the data, or if the group file is damaged or lost (for instance, if you move the main data set to another folder, but leave
the group file behind or delete it).
170
Applying Multivariate Methods
Data analysis is interesting; you might be surprised what patterns there are in your data. Without exploration, multivariate
techniques tend to only show you what you already knew. If several of the methods give you similar groupings or
classifications, this should give you confidence that your analysis is showing something real. Even if they do not, think
about why not – is it that one method uses quantitative data and another presence/absence? Is it an effect of sparse data
or an outlier?
171
11: Tips on using CAP and ECOM
the Tools symbol on the toolbar (2nd button from left) will open the Edit Chart dialog shown below (Fig. 97).
In this dialog you can see the situation with a typical PCA chart with 5 user-defined groups. The chart contains a total of
172
Applying Multivariate Methods
7 series; the other two series control the vector plots – one for drawing the vectors and one to allow names to be put on
the end of the arrows. Each of these series is plotted separately on the chart; this allows you to edit every aspect of the
chart independently, if you wish.
The Edit Chart dialog has an expanding menu of options in the left-hand pane; Series, Chart, Data, Tools, Animations,
Export, Print and Themes. Each option in this expanding menu opens up a series of tabbed pages covering each major
area of the chart to be edited. Sub-tabs then allow you to edit every aspect of these major areas. You will mainly use
the Chart and Series options. Chart allows you to control the background features of the chart, such as the axes, the
background colour, the chart and axes titles, or the legend. The Series option gives you control over each of the plotted
sets of data. Here you can choose the shape, size and colour used to plot each series.
In the instance above, each of the group series has a different colour. You can change the details of a series by double-
clicking on the series in the central panel of the dialog, or alternatively clicking on the series name in the expanding tree
menu in the left-hand panel of the dialog.
Returning to the other chart options; Data shows you the data used to plot the chart, while Tools allows you to add items
that are not directly related to the plotting of the chart, such as annotations and extra lines to divide areas. Animations
gives you the option to add animated effects to charts.
The Export tab has three sub-areas. Export: Picture saves the chart in a variety of formats. Many of the charts in this
book have been exported using Copy and Paste – this produces an enhanced metafile (.emf) format, which is best for
resizing the image in Word documents and other similar applications. Other formats are better for web presentations, etc.
Export: Native allows you to save a chart in a native format. This is useful when you haven’t made a decision on the final
layout of the chart. You can reopen these charts and edit them using TeeChartOffice 1, a free program from Steema Software.
Finally, Export: Data allows you to export the data used to plot the chart; you can use this to export the data to another
program for plotting.
Print allows you to preview the chart before printing, and Themes allows you to apply a whole range of pre-defined styles
to your chart.
173
11: Tips on using CAP and ECOM
Editing dendrograms
Dendrograms are used to show the relationships calculated in methods such as TWINSPAN and clustering. These can
be edited using the Edit Dendrogram button. This dialog (Fig. 98) allows you to change the layout of the dendrogram.
Useful features include the Space Equally option under the Lines tab, which spreads out the dendrogram, so that each
divide in the horizontal direction is given equal weight (this option is also available as the Space Equally Horizontally
tickbox on the Options panel of the dendrogram). Space Equally works well if the top axis has no useful meaning, or
you need to see the relationships more clearly. The Labels tab allows you to control how the labels at the right-hand end
are presented. You can choose to have them coloured by groups, position the labels on the top of the lines or set the stub
lengths. You can copy or save the dendrogram in the Copy/Export tab.
174
Applying Multivariate Methods
175
Glossary
176
Applying Multivariate Methods
covariance between a variable and itself) are the variances of the variables.
DECORANA - strictly speaking, a tool which implements a Detrended Correspondence Analysis; in practice, DECORANA
is often used to refer to the analysis itself.
Detrended Correspondence Analysis (DECORANA) - an iterative multivariate technique used to find the main factors
or gradients in ecological community data. It is often used to overcome issues such as the Arch effect (q.v.) inherent in
Correspondence Analysis (q.v.)
Dimension - used here as the number of axes used to express the relationship between samples.
Discriminant Analysis (DA) - a technique to assign samples to different groups.
Dispersion matrix - general term used for a matrix holding either the correlations or covariances between variables.
Downweighting - a method in ordination programs to reduce the influence of rare species.
Eigenvalue - a mathematical term for a particular type of variable used in linear algebra. In the methods discussed here
it often measures the variability assigned to a particular axis.
Eigenvector - a vector used in linear algebra. Used here in PCA to show the direction of influence of each variable within
the ordination plot.
Environmental variable - a variable that is believed to influence the structure of the samples.
Euclidean distance - the straight-line distance between two points in a Cartesian coordinate system. It is the distance you
are familiar as measuring with a ruler.
Horseshoe effect - a distortion of a PCA ordination diagram.
Inertia - a measure of the total amount of variance in a data set.
Iteration - a single step in an often potentially-repeated mathematical operation.
Matrix - a set of numbers arranged in rows and columns. A 2-dimensional matrix comprises a grid of numbers.
MDS - an acronym for MultiDimensional Scaling.
177
Glossary
178
Applying Multivariate Methods
179
Glossary
180
Applying Multivariate Methods
181
Index
182
Applying Multivariate Methods
Monte Carlo test: 106, 115–116, 120. quantitative data: 20, 33, 45, 51, 73, 74, 128, 142, 171.
multicollinearity: 105–106, 110, 112, 120, 171. quantitative measures: 6, 76.
MultiDimensional Scaling (MDS): 8, 18, 50, 52, 73–89, 102, 152, 161.
R
multiple regression: 2, 104, 105.
Multivariate Analysis of Variance (MANOVA): 90. R: iv, 2, 10–16.
Agglomerative Cluster Analysis (ACA): 160–161.
N Canonical Correspondence Analysis: 123–125.
NMDS - See Non-metric MultiDimensional Scaling (NMDS). Correspondence Analysis: 58, 61–63, 68–71.
Non-metric MultiDimensional Scaling (NMDS): 1, 23, 73, 82, 84, 87, 118, Download code examples: 9.
126, 151, 175 - See also MultiDimensional Scaling (MDS). heatmaps: 160–161.
Linear Discriminant Analysis: 101–103.
O
MDS: 84–86.
ordination: iv, 2, 8–9, 23, 26, 42, 46, 50–51, 52–54, 70–71, 73–75, 89,
Opening files in: 5–6.
104–106, 115–118, 126, 127–129.
Organising data for: 3, 4–5.
orthogonal lines: 122.
PCA functions: 24–25, 29–32, 34–35, 39, 48–49.
outliers: 7, 27–29, 34–35, 82, 91, 105, 168–169.
Procrustes in: 70–71.
P Transformation in: 7, 26.
PCA - See Principal Component Analysis (PCA). RA - See Reciprocal Averaging (RA).
percentage variability: 26, 37. randomisation tests: 2, 126, 144, 146, 152, 159, 164.
Prediction - Accuracy Table: 101. rare species: 8, 51, 55 - See also down-weight rare species.
presence/absence data: 6, 73, 74, 76, 83, 128, 141, 142. RDA - See Redundancy Analysis (RDA).
Principal Component Analysis (PCA): iii, 8–9, 17–51, 54, 75, 84, 143, 167, Reciprocal Averaging (RA): 54, 118, 171 - See also Correspondence
171, 175. Analysis (CA).
principal components: 20, 29–30, 34. Redundancy Analysis (RDA): 118–119.
procrustes function: 70–71. regression analysis: 2, 17, 104, 114, 118.
Procrustes method: 70, 84. Renkonen measure: 140, 145, 146.
pseudospecies: 128–129, 130, 131, 138. R-mode analysis: 87, 143–144.
R packages: 14–16.
Q for cluster analysis & heatmaps: 160.
Q-mode analysis: 143. for PCA: 24–25.
183
Index
T
taxi-cab distance - See Manhattan distance.
total variability: 20, 26, 42, 48, 121.
transformation: 7, 8, 21–22, 26, 36, 45, 46, 48, 50, 51, 68, 109–110, 131,
144, 167–168 - See also square root transformation.
184
Applying Multivariate Methods
185
Pisces software titles & training
QED Statistics offers a comprehensive range of statistics, chosen to meet the needs of all students, researchers and post-
grads wanting to analyse quantitative data. The program holds your hand, from data input, through single sample stats,
right up to General Linear Models.
More information at www.qedstatistics.com
Aside from statistics, we also produce a wide range of other software titles for analysis of scientific data, as well as over
60 classic scientific works as e-books, and a series of ready-made lectures on ecology.
Information on all of these is available on our website at www.pisces-conservation.com
186
Applying Multivariate Methods
187