0% found this document useful (0 votes)
35 views197 pages

Applying Multivariate Methods

Uploaded by

tanyafd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views197 pages

Applying Multivariate Methods

Uploaded by

tanyafd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 197

Applying Multivariate Methods

using R, CAP and Ecom


E-book edition

Peter Henderson
& Richard Seaby Pisces Conservation Ltd
Applying Multivariate Methods using R, CAP and Ecom
This book, based upon A Practical Handbook for Multivariate Methods (2008), is invaluable for anyone interested in multivariate statistics, and has been
extensively revised to reflect the ever-growing popularity of R in statistical analysis. All the main multivariate techniques used in Advertising, Archaeology,
Anthropology, Botany, Geology, Palaeontology, Sociology and Zoology are clearly presented in a manner which does not assume a mathematical
background. The authors’ aim is to give you confidence in the use of multivariate methods so that you can clearly explain why you have chosen a particular
method. The book is richly illustrated, showing how your results can be displayed in both publications and presentations. For each method, an introductory
section describes the method of calculation and the type of data to which it should be applied. This is followed by a series of carefully selected examples
from published papers covering a wide range of disciplines. You will find examples from fields you are familiar with. The data sets and R code used are
available from the Pisces Conservation website, allowing you to repeat the analyses and explore the various approaches.

This book has been written by the team that created the software titles Community Analysis Package (CAP) and ECOM. The two programs have themselves
been updated to run R code natively within them, and tips are given on how to use these programs to get the best presentation for your ideas.

Peter Henderson is a director of Pisces Conservation and a Senior Research Associate of the
Department of Zoology, University of Oxford, England. His personal research area is community and
population ecology, and he is author of “Ecological Methods”, “Marine Ecology”, “Practical Methods in Ecology”
and “Ecological Effects of Electricity Generation, Storage and Use”. He teaches multivariate statistical methods. He
was part of the team that wrote the Community Analysis Package and ECOM software.

Richard Seaby is a director of Pisces Conservation where he develops software and was the designer of
the Community Analysis Package and ECOM user interface. His personal research area is aquatic ecology © Pisces Conservation Ltd, 2019
and the population dynamics of fish. www.pisces-conservation.com
Pisces Conservation Ltd

Applying Multivariate Methods


using R, CAP and Ecom

Dr Peter A. Henderson

With instructions on the use of software by

Dr Richard M. H. Seaby
© Peter A. Henderson and Richard M. H. Seaby, Pisces Conservation Ltd, 2019
IRC House, The Square,
Pennington, Lymington, Hants
SO41 8GN, UK
Phone +44 (0)1590 674000
[email protected]
www.pisces-conservation.com

ISBN 978-1-904690-67-2

Extensively revised and based upon “A Practical Handbook for Multivariate Methods”,
by the same authors, published in Great Britain in 2008 by Pisces Conservation Ltd,
ISBN 978-1-904690-57-3

The contributors have asserted their moral rights.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, without the prior permission in writing of the publisher, nor be circulated in any form of
binding or cover other than that in which it is published.

Editing, proofing and layout by Robin Somes

This is the electronic version of our printed work, “Applying Multivariate Methods”,
(ISBN 978-1-904690-66-5)
Applying Multivariate Methods

Preface
Multivariate methods such as Principal Component Analysis, Correspondence Analysis and Discriminant Analysis are
essential tools for researchers in all scientific disciplines and are important in commercial activities such as marketing
and market analysis. Not everyone who needs to use these methods has either the skills or the time to acquire a deep
understanding of the mathematical procedures and theory underlying these methods. What all practitioners do need is
an appreciation of the strengths and weaknesses of the available methods, so that they can confidently explain why they
chose their selected methodology. Further, we all need to appreciate the pitfalls and weaknesses of the methods, so that we
correctly interpret the features within our data. This book is aimed at giving you these skills. It describes, as far as possible
in non-mathematical terms, how all the key techniques are undertaken. The strengths and weaknesses of each method
are discussed, so you can confidently explain why you chose to use a particular approach. Most importantly, this book
presents numerous examples taken from published work. These examples have been carefully chosen to reflect the range
of disciplines that use multivariate techniques. Thus, archaeologists, anthropologists, botanists, geologists, palaeontologists
and zoologists and many others will find examples that relate to their subject. This has been done to ensure that the text
is relevant to your needs, and also to help in the learning process. I have found it easier to teach these complex techniques
if the examples are drawn from subjects with which the researcher is familiar.
The writing of this book follows on from the development of CAP, the Community Analysis Package, with Dr Richard
Seaby and other members of Pisces. This program was developed following discussions with undergraduate and post-
graduate students at the University of Oxford who needed a simple, intuitive to use, Windows-based suite of programs

iii
Preface

for multivariate analysis. Subsequently, we developed Ecom to undertake constrained ordination when environmental
variables potentially able to explain the observed pattern have also been collected. Together, these packages have been
under continuous development and improvement for more than 15 years, gradually becoming easier to use and more
versatile in terms of their presentation of the results. Most recently we added the ability to run R packages within the CAP
and Ecom environment. This allowed those who found R difficult and did not have the time to invest in a training course
to have access to R packages. In this revised book I introduce the R packages available to undertake multivariate analyses
and presented working R code. This code is generally a simple working version that the reader can modify and enhance as
they become proficient in R. While this book should prove useful irrespective of the software that will be used to analyse
the data, care has been taken to produce information boxes separate from the main run of the text that explain how to
produce the output displayed using CAP and Ecom. Instructions on how to use these programs were written by Richard
Seaby, who was the developer responsible for the design of the Windows user interface and the graphics.
Finally, the best way to learn is by experimenting with data. This book is supported by a website1 from which you can
download for yourself all of the data sets used to undertake the analyses. If you are using CAP and Ecom, you will be able
to produce all of the graphs printed in the book as they were generated using these programs.

Peter Henderson
Director, Pisces Conservation Ltd
Senior Research Associate, University of Oxford.

1 www.pisces-conservation.com/phmm-data.html

iv
Applying Multivariate Methods

Contents
Preface........................................................................................................................................................................................iii
1: Introduction.............................................................................................................................................................................1
Why use multivariate methods?.................................................................................................................................................... 2
Organising your data - CAP and Ecom........................................................................................................................................... 3
Organising your data for R............................................................................................................................................................ 4
Opening a .csv file in R................................................................................................................................................................. 5
The appropriate types of variable.................................................................................................................................................. 6
Data transformation..................................................................................................................................................................... 7
Which method to use?.................................................................................................................................................................. 7
Where to obtain the example data sets and R code........................................................................................................................ 9
Software featured in the book....................................................................................................................................................... 9
2: Installing R and getting started.............................................................................................................................................10
Getting R running on your computer............................................................................................................................................10
Installing R.................................................................................................................................................................................11
Installing RStudio........................................................................................................................................................................11
Introducing RStudio....................................................................................................................................................................11
R code conventions used in the book...........................................................................................................................................16
Final advisory note on R code......................................................................................................................................................16
3: Principal Component Analysis................................................................................................................................................17
Uses..........................................................................................................................................................................................17
Summary of the method..............................................................................................................................................................18
The use of the correlation or covariance matrix.............................................................................................................................20
How many dimensions are meaningful?........................................................................................................................................20
Do my variables need to be normally distributed?..........................................................................................................................21
Data transformations...................................................................................................................................................................21
The horseshoe effect...................................................................................................................................................................22
PCA functions in R......................................................................................................................................................................24

v
Contents

Example: archaeology: the classification of Jomon pottery sherds .................................................................................................25


Example: geology: An investigation of the Martinsville igneous complex..........................................................................................36
Example: biology: Comparing the songs of cicadas........................................................................................................................42
Example: biology: Analysis of community change with climate warming..........................................................................................45
Concluding remarks on the use of PCA.........................................................................................................................................50
4: Correspondence Analysis.......................................................................................................................................................52
Uses..........................................................................................................................................................................................52
Summary of the method..............................................................................................................................................................52
Example: marketing: Soft drink consumption................................................................................................................................56
Example: archaeology: The temporal relationship between the pots from trenches on the Greek island of Melos...............................59
Example: biology: Analysis of community change with climate warming..........................................................................................65
A Procrustes method to compare ordinations ...............................................................................................................................70
Concluding remarks on the use of CA...........................................................................................................................................72
5: MultiDimensional Scaling ......................................................................................................................................................73
Uses..........................................................................................................................................................................................73
Summary of the method..............................................................................................................................................................73
Metric and non-metric similarity or distance measures...................................................................................................................75
Common user options.................................................................................................................................................................75
Example: archaeology: Seriation of Nigerian pottery......................................................................................................................77
Example: biology: Analysis of fish community change....................................................................................................................82
Example: geology: Ordovician mollusc assemblages......................................................................................................................86
Concluding remarks on the use of MDS........................................................................................................................................89
6: Linear Discriminant Analysis..................................................................................................................................................90
Uses..........................................................................................................................................................................................90
Summary of the method..............................................................................................................................................................90
Normality...................................................................................................................................................................................91
Example: biology: Iris systematics................................................................................................................................................92
Example: archaeology: Skull shape..............................................................................................................................................94
Example: archaeology: Chemical analysis of Romano-British pottery...............................................................................................97

vi
Applying Multivariate Methods

Concluding remarks on the use of Linear Discriminant Analysis....................................................................................................102


7: Canonical Correspondence Analysis (CCA)..........................................................................................................................104
Uses........................................................................................................................................................................................104
Summary of the method............................................................................................................................................................104
Do my variables need to be normally distributed?........................................................................................................................105
Minimising the number of explanatory variables..........................................................................................................................105
Multicollinearity and the selection of variables.............................................................................................................................105
Testing for significance..............................................................................................................................................................106
The choice of scaling for the CCA output....................................................................................................................................106
Example: palaeontology: Palaeogeography of forest trees in the Czech Republic around 2000 BP...................................................109
Example: biology: Effects of climate change on an estuarine fish community.................................................................................120
Concluding remarks on the use of CCA.......................................................................................................................................126
8: TWINSPAN (Two-Way Indicator Species Analysis).............................................................................................................127
Uses........................................................................................................................................................................................127
Summary of the method............................................................................................................................................................127
Example: biology: Woody species composition of floodplain forests of the Little River, McCurtain and LeFlore Counties, Oklahoma....131
Example: biology: The effect of climate change on an estuarine fish community............................................................................135
Concluding remarks on the use of TWINSPAN.............................................................................................................................138
9: Hierarchical Agglomerative Cluster Analysis.......................................................................................................................139
Uses........................................................................................................................................................................................139
Summary of the method............................................................................................................................................................139
Some useful measures of similarity ...........................................................................................................................................140
Example: biology: Woody species composition of floodplain forests of the Little River, McCurtain and LeFlore Counties, Oklahoma....144
Example: biology: The effect of climate change on an estuarine fish community............................................................................147
Example: veterinary science: Body-weight changes during growth in puppies of different breed.....................................................151
Example: linguistics: Authors’ characteristic writing styles as seen through their use of commas.....................................................155
Example: cluster analysis and heat maps in R.............................................................................................................................160
Concluding remarks on the use of Agglomerative Cluster Analysis................................................................................................161
10: Analysis of Similarities (ANOSIM)......................................................................................................................................162

vii
Contents

Uses........................................................................................................................................................................................162
Summary of the method............................................................................................................................................................162
Example: archaeology: The classification of Jomon pottery sherds................................................................................................163
11: Tips on CAP and Ecom........................................................................................................................................................165
When to use CAP and Ecom......................................................................................................................................................165
Data organisation.....................................................................................................................................................................166
What type of data do you have?................................................................................................................................................166
Transforming your data.............................................................................................................................................................167
Dealing with outliers ................................................................................................................................................................168
Organising samples into groups.................................................................................................................................................169
Explore your data.....................................................................................................................................................................170
Dealing with multicollinearity in Ecom.........................................................................................................................................171
Getting the most out of your charts............................................................................................................................................171
Editing dendrograms.................................................................................................................................................................174
Suggestions for how to present your methods.............................................................................................................................175
Glossary of multivariate terms.................................................................................................................................................176
Index........................................................................................................................................................................................181
Pisces software titles & training..............................................................................................................................................185

viii
Applying Multivariate Methods

Chapter

1
1: Introduction

If you record a number of variables from each sample, you are building a multivariate data set, and you will probably
want to use multivariate methods to analyse and present your findings. This handbook is a practical guide to the methods
available for exploring multivariate data and identifying relationships between samples and variables. I have whenever
possible used actual data from published studies, so that the reader can understand how the authors generated their
presentation and conclusions. The focus is on the use of mathematical methods to explore and present relationships in
field data, and not the statistical testing of hypotheses using experimental data. The data sets discussed arise from field
observations in geology, archaeology, biology and palaeontology.
Typically, the sampling program has been designed to gain insight into what is present at different localities or times,
and to compare the results to see if they fall into some sort of pattern or classification. In many studies, the variables
that can explain this pattern are not measured, or may not even be known. Possibly, after a pattern has been detected, an
explanation might be inferred. For example, the distribution of pottery shards observed might lead to an idea about the
past human activity in the area. Methods discussed here, which do not explicitly consider the explanatory variables, include
Principal Component Analysis (PCA), Correspondence Analysis (CA), Non-metric MultiDimensional Scaling (NMDS),
Cluster Analysis and TWINSPAN.

1
1: Introduction

If the objects or samples under study can be placed into groups, we often need to test whether these groups are statistically
significant. Analysis of Similarities (ANOSIM) is often the method of choice to test if members of a group are more
similar to each other than they are to members of other groups. This randomisation test is particularly useful as it is
generally applicable. With previously-defined groups, Discriminant Analysis (DA) can be used to create a discriminant
function to predict group membership. These predicted memberships can also be used to validate groups formed by
Agglomerative Cluster Analysis (ACA).
There is also a group of methods for the analysis of situations where possible explanatory variables have been measured
together with the descriptive variables. One of the most familiar is multiple regression, where a model is constructed in
which a number of explanatory variables are used to predict the value of a dependent variable. This handbook does not
discuss standard regression methods, which are well-covered in standard statistical textbooks. It does cover Canonical
Correspondence Analysis (CCA), a constrained ordination method based on correspondence and regression analysis. This
method is presently by far the most popular constrained ordination method. CCA is termed a constrained method because
the sample ordination scores are constrained to be a linear combination of the explanatory variables. Unlike unconstrained
methods, CCA allows significance testing for the possible explanatory variables via randomisation tests.

Why use multivariate methods?


There are several motivations for using multivariate methods. In general, they are required because we need to search for
the pattern of relationships between many variables simultaneously. Complex inter-relationships will not allow a useful
analysis to be obtained by using each variable in isolation. The main motivations are:
○○ Classification - dividing variables or samples into groups with shared properties.
CAP, Ecom and R can open
○○ Identifying gradients, trends or other patterns in multivariate data.
Excel files, but you must be
○○ Identifying which explanatory, independent or environmental variables are most influential in careful to import the correct
determining sample or community structure. data when using multiple
○○ Finally, and usually most importantly, we aim to distil the most important features from a set of worksheets. It is often wise
data derived from an almost infinitely complex world, so that these can be presented clearly to to export the desired data set
from Excel as a .csv file. This
others. This often entails displaying the main features in a 2- or 3-dimensional plot.
will ensure no non-standard
features or data are present.
2
Applying Multivariate Methods
CAP can transpose (switch
rows and columns) your
data. Click on the Working Organising your data - CAP and Ecom
Data tab and then select Almost all multivariate analysis is undertaken using computers, and the choice of software will
the Transpose radio button
at the lower left side of dictate the particular format and organisation of the data. Your data set will typically comprise a
the window, then click the number of samples, cases or observations. For each sample there will be values for a number of
Submit button. variables. These data are typically organised as a two-dimensional table using a spreadsheet such as
It is often easier to transpose Excel.
your data set within Excel or
another spreadsheet. In Excel Our first example is a typical biological data set, and is from a study of spider communities. In a
use the Paste: Transpose study of the effects of farming, spiders were sampled by sweep-netting in 4 areas of saltmarsh, and
option in the menu bar. the number of each species present counted. The researchers wished to know which sites were the
most similar. The tabulated data were arranged as shown in Table 1. Note that some software (such
In R a matrix or data frame as R) requires the variables as the columns and the samples as the rows. It is, however, usually easy
is transposed using the t(x)
function where x is the matrix
on a computer to transpose your data. With the exception of the R analyses, in this book, data are
to be transposed. arranged with samples as the columns and variables as the rows.
For example:
D <- matrix(1:30,5,6)
Table 1: An example of a typical multivariate data set; the tabulated results from a study of
Trans_D <- t(D) spider community structure showing the first few rows and columns. Unpublished data.
Spider species Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7
Note that in this table blanks
Pachygnatha clercki 6 1 5
are actually zero values. CAP
will automatically add zeros Pachygnatha degeeri 1 1
to blank or empty cells. Also Tetragnatha extensa 2 1 2
take care to use a zero and Bathyphantes gracilis 1 1 2 4 3
not a capital O.
Lepthyphantes tenuis 1 3 3 7 1
When working with Excel,
zeros can be added using Erigone longipalpis 1 1
the Replace function. Erigone atra female 1 4 4
Select the area to search and Agyneta cf decora 1
then simply select Replace
without anything in the Find Savignya frontata 1
what cell and a 0 (zero) in the
Replace with cell.
3
1: Introduction

Our second example is derived from archaeology (Table 2). This data set comprises counts of different-shaped pottery
shards from different localities in the Mississippi valley. In both of these examples the data comprise integer counts.
Most methods are equally applicable to continuous real variables, such as the height of people or the speed of swimming
of individual dolphins. It is also possible in some circumstances to mix different types of data such as continuous real
numbers and classificatory variables that are only represented by a few discrete numbers. The type of data applicable to
each method is discussed under that method.
Table 2: Counts for pottery shards from different localities in the Mississippi valley. From Pierce & Christopher,
19981.
Caney Claiborne Copes Hearns Jaketown Linsley Poverty Shoe Teoc Lewis
Mounds Point Bayou Creek
Biconical 29 3259 67 57 485 58 3122 23 228 104
Cylindrical 5 1230 78 0 1411 7 4718 3 12 0
Ellipsoidal 1 3476 130 5 3 33 5103 1 2 108
Spheroidal 7 824 2 6 29 4 355 1 16 5
GroovedSphere 1 2014 17 0 410 8 3434 0 2 93
Biscuit 58 22 11 8 0 4 138 12 2 12
Amorphous 55 476 56 90 0 7 866 116 4 65
Other 1 143 1 12 0 6 187 1 1 4

Organising your data for R


Within R you will need to organise your data with the variables as the columns and the rows as the individual samples or
observations. The best approach is usually to organise the data in a spreadsheet such as Excel and save the data as a .csv
file. This can then be easily opened in R. Fig. 1 shows a section of the Jomon pottery sherd Excel spreadsheet. These data
are used in the example in Chapter 3, on page 29. Note that each piece of pottery is a row and the concentrations of
the various elements which are present in each piece are the columns. The data set was saved as the comma-delimited file
1 Pierce & Christopher, 1998, Theory, measurement, and explanation: variable shapes in Poverty Point Objects. In Unit Issues in Archaeology,
edited by Ann F. Ramenofsky and Anastasia Steffen, pp.163-190. Utah University Press, Salt Lake City.

4
Applying Multivariate Methods

Fig. 1: Part of the


Jomon pottery sherd
Excel spreadsheet,
organised for use in R.

Jomon Hall R.csv. It is possible to open Excel files in R; however, this is best avoided because it is more difficult to control
which data will be opened, given that Excel can have multiple worksheets. It is generally recommended that you place your
data in a simple comma-delimited (*.csv) file which can then be easily opened in R.

Opening a .csv file in R


The read.table command can be used as shown in the example below, to undertake a PCA or similar method. Note
that this data set holds sample and location names in the first 2 columns. The tabulated data are held in the file Jomon Hall
R.csv. The header = TRUE option tells R that the first row holds column titles, and sep = “,” that the individual
cell contents are separated by commas. The numerical data for the PCA are then log-transformed and placed in log.
sherds. The location of each sample is held in location.sherds.

# Open data set


pottery.csv <- read.table(“D:\\Demo Data\\Jomon Hall R.csv”,header = TRUE, sep = “,”)
head(pottery.csv, 4) # Print first lines to check data loaded correctly and find names of
variables
# log transform
log.sherds <-log(pottery.csv[, 3:16]) # log the numerical data
head(log.sherds, 4) # Some output to check values have been logged
location.sherds <- pottery.csv[,2] # Put locations in a vector

5
1: Introduction

The first 4 rows of the Jomon Hall R.csv file open in Excel are shown below:
Sample Location Ti Mn Fe Ni Cu Zn Ga Pb Th Rb Sr Y Zr Ba
s:ma:001 Shouninzuka 11035 1065 40145 43 131 114 33 68 16 54 143 26 212 892
s:ma:002 Shouninzuka 9958 574 37170 21 70 70 23 50 10 44 88 18 177 748
s:mb:001 Shouninzuka 10147 851 68136 40 108 73 23 32 13 59 76 13 185 1069

R also offers a read.csv command to enter data in .csv format, as shown below:

# The data set has variable names in row 1, column 1 is sample identifiers, column 2 group
allocation
x <- read.csv(file=”D:\\Demo Data\\Jomon Hall R.csv”, header=T, row.names=1)

NOTE: Throughout the book, we have used the generic file location of ‘D:\\Demo Data’ in the R code; obviously you
will need to amend this to reflect your own file locations if you use the code yourself.

The appropriate types of variable


The methods discussed in this handbook can be applied to the following of types of variable record:
○○ Quantitative measures – e.g. the number of animals in a sample, bits of pottery at a certain depth or the concentration
of a chemical. These can be integer counts 1, 2, 3… or real numbers, 1.21, 1.325, etc.
○○ Semi-quantitative measures – e.g. plant density on a scale 1 to 5, or perceived attractiveness on a scale 1 to 10.
○○ Binary or presence/absence records – e.g. a species or other object has a score of 1 if present in a sample and zero
if not.
For some methods it is possible to combine quantitative, semi-quantitative and binary measurements. It is also appropriate
to mix variables with different units of measurement. For example, a data set might comprise variables measuring
concentrations in mg/l and acidity expressed as pH. However, you should generally ensure that all your variables have
a similar magnitude. If not, then the high-magnitude variables may dominate the analysis. In general, the units of

6
Applying Multivariate Methods

measurement are arbitrary, so you can change these to produce variables with a similar magnitude. For example, if the
concentration of sodium is 5 mg/l and calcium is 5000 mg/l the calcium concentration can be expressed as grams to
become 5 g/l. This will result in both variables having a similar magnitude and variance.

CAP offers a wide range Data transformation


of transformations within
the Working Data tab. You It is common practice in statistical analysis to transform data by taking, for example, the square
should experiment with root or logarithm of each observation. The transformation of data is discussed, when appropriate,
different transformations under the various examples below. We transform data for two main reasons, first, to reduce the
as this will not harm your influence of the high magnitude variables on the outcome of the analysis, and second to normalise
original data. To revert to the
raw data just click the Reload
the data. Generally, for the methods discussed here, we are little concerned about the normality of
Raw Data button. the variables.
While not always necessary, transformations are often useful when some variables have a highly-
In R, log and square
root transformations are skewed distribution, e.g. the bulk of the observations are low, but there are a number of much higher
accomplished with the log() values. Common transformations that can be useful are logarithmic and square root; there was a
and sqrt() functions . fashion for the 4th root transformation, which should be avoided. This is because your data will
For example: be so heavily modified it will bear little relationship to what you actually recorded. If your data set
l.pot <-log(pot.csv[,
includes zero values, a log transformation cannot be applied, but a square root transformation will
3:16])
This logs columns 3 to 16. usually prove appropriate.
If your data set includes just a few extreme outliers it may be sensible to exclude these data from the
sqrtSS <- sqrt(SS)
This puts the square root of
analysis. If you take this approach it is important to consider why these extreme values were present.
all values in SS in sqrtSS. Presumably they reflect a particular set of circumstances, e.g. unusual environmental conditions.

Which method to use?


There are no simple rules for the selection of the best method. It is frequently observed that if a data set tells a strong
story, this will be clearly shown by all of the methods that can be appropriately applied. Your choice of method is

7
1: Introduction

therefore, in part, determined by the method that best displays the main features of the data. If you have good reasons for
believing your data were collected from along a simple environmental gradient, which is reflected in the magnitudes of the
sample variables, then Correspondence Analysis or Canonical Correspondence Analysis may be most appropriate. If the
samples are likely to form distinct groups then a cluster analysis may be useful. Alternatively, if the samples are unlikely
or not necessarily likely to split into a few clear groups, Principal Component Analysis or MultiDimensional Scaling may
be most appropriate. MultiDimensional Scaling is a highly flexible method that allows a wide range of different similarity
measures between samples to be used. However, it does not allow the most important variables determining the ordination
of samples to be easily identified and presented graphically. Authors almost never explain why they chose a particular
method. Probably the two main reasons were that it was available on their computer system, or they tried several, and the
one chosen gave the clearest presentation or the result they hoped or expected to see. The other common reason is that a
particularly opinionated person has asserted the superiority of one approach. Treat such know-alls with the polite disdain
you show their type in other walks of life. Be aware that, because there is no mathematical argument to demonstrate the
clear superiority of a single method for all types of data, multivariate analysis is prone to methods moving in and out of
fashion.
I would advise that all appropriate methods are tried, and if any lead to clearly different conclusions, care is taken to
find out why this has come about. If, as is often the case, Principal Component Analysis, Correspondence Analysis and
MultiDimensional Scaling all give broadly similar ordinations, then the method most clearly showing the main features can
be chosen, confident that your conclusions are robust. If you have to struggle hard to find any interpretable pattern, and
different methods lead to different conclusions, you should accept that there is no clear structure within the data. However,
before conceding that your data do not tell a story, make sure no single variable is overwhelming a pattern displayed by
other, possibly lower-magnitude, variables. Such structure may be discovered by transformation of
some or all variables. In CAP, the rare species are
It may even be useful to remove some variables. Frequently, many variables represent rare objects removed from the Working
Data tab, Select Handling
and are in very low abundance. Biological data sets, for example, frequently comprise a large number zeros and choose Remove
of zero observations, because most species are only found once or twice. It is not uncommon sparse rows. The proportion
for 80% of the elements of a data matrix to be zero values. Such matrices are termed sparse, and of zero values in the data set
methods such as PCA generally should not be applied to sparse arrays. Rare observations can be is given in the Summary tab.

8
Applying Multivariate Methods

removed, as they contribute nothing to the general pattern. Remember, species or objects that are only observed once
cannot be correlated with other features so they cannot contribute to an ordination such as PCA.

Where to obtain the example data sets and R code


It is because of the difficulty of giving simple advice on the best method that this handbook comprises a series of worked
examples. At the head of each example is the source of the data and a reference to any published work. The name of the
data file refers to the name of the file(s), which can be downloaded (in a zipped format) from our website at:
www.pisces-conservation.com/phmm-data.html
The R code for all of the worked examples is also available from the same page, as a single zipped file.

Software featured in the book


All of the calculations, graphs and dendrograms presented in this handbook have been calculated using either (1)
Community Analysis Package (CAP) and Ecological Community Analysis (Ecom), produced by Pisces Conservation Ltd.,
or (2) R. In the margins of this book are placed occasional text-boxes that describe details and tips to produce the output
using R, CAP or Ecom. The final chapter gives some tips and advice on using both CAP & Ecom. R code can be run
within the CAP & Ecom environment.

9
2: Installing R and getting started

Chapter

2
2: Installing R and getting started

Getting R running on your computer


R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX
platforms, Windows and MacOS. To use R for the multivariate analyses in this book you will need to undertake the
following 3 steps in the order listed below. You do not need to use RStudio, but it is available free of charge and makes
your life easier.
○○ Install R
○○ Install RStudio
○○ Install R packages
You will find many useful YouTube videos taking you through the installation, for example:
https://www.youtube.com/watch?v=cX532N_XLIs for Mac users
https://www.youtube.com/watch?v=MFfRQuQKGYg for Windows
https://www.youtube.com/watch?v=Led-G7lkMZ4 for Linux

10
Applying Multivariate Methods

Installing R
To download R, choose your preferred CRAN mirror at https://cran.r-project.org/mirrors.html
You will find a range of options for different operating systems and computers. For example, download and install the
current Windows version of R from https://cran.r-project.org/bin/windows/base/
When installing, you can usually just accept the default settings.

Installing RStudio
RStudio is an open-source front-end for R which makes R a little simpler to use. It offers a feature-rich source-code editor
which includes syntax highlighting, parentheses completion, spell-checking, etc., a system for examining objects saved in R,
an interface to R help, and extended features to examine and save plots. RStudio is easy to learn.
○○ Go to the RStudio download page at https://www.rstudio.com/products/rstudio/#Desktop. Select the
DOWNLOAD RSTUDIO DESKTOP button if you wish to use the free version.
○○ Select the link from the “Installers for Supported Platforms” list that corresponds to the operating system on your
computer.
○○ Once installed you can personalise RStudio by opening the program and selecting from Tools the Global Options…
submenu. It is usually best not to change the defaults until you have gained some experience with the program.

Introducing RStudio

RStudio design
RStudio is organized around a four-panel layout, seen in Fig. 2, page 12. The upper-left panel, R Script Editor, may
not be visible when you open the program for the first time; if this is the case, click File: New File: R Script to display
it, or press Ctrl+Shift+N on your keyboard.

11
2: Installing R and getting started

Fig. 2: The RStudio layout.


The upper-left panel is the R Script Editor. R commands are typed into this panel and submitted to the R Console in
the lower-left panel. For most applications, you will type R commands into the Script Editor and submit them to the
Console; you will not type commands directly into the Console. The Script Editor is a high-level text editor, whereas the
Console is the R program.
The upper-right panel contains at least two tabs - Environment and History. Many items listed under the Environment

12
Applying Multivariate Methods

tab can be double-clicked to open them for viewing as a tab in the Script Editor. The History tab simply shows all of the
commands that you have submitted to the Console during the current session.
The lower-right panel contains at least five tabs - Files, Plots, Packages, Help, and Viewer. The Plots tab will show the
plots produced by commands submitted to the Console. One can cycle through the history of constructed plots with the
arrows on the left side of the plot toolbar, and plots can be saved to external files using the Export tab on the plot toolbar
(see figure above). A list of all installed packages is seen by selecting the Packages tab (packages can also be installed
through this tab, as described below). Help for each package can be obtained by clicking on the name of the package. The
help will then appear in the Help tab.

Basic usage
Your primary interaction with RStudio will be through developing R scripts in the Script Editor, submitting those scripts
to the Console, and viewing textual or tabular results in the Console and graphical results in the Plot panel. In this section,
we briefly introduce how to construct and run R scripts in RStudio.
To open a blank file, select the New icon ( ) and then R Script. Alternatively, you can select the File menu, New
submenu, and R Script option, or use <CTRL>+<Shift>+N. In the newly-created Script Editor panel, type the three
lines exactly as shown below (for the moment, don’t worry about what the lines do).

dat <- rnorm(100) # create random normal data (n=100)


hist(dat,main=””) # histogram of data without a title
summary(dat) # summary statistics

These commands must be submitted to the Console to perform the requested calculations. Commands may be submitted
to the Console in a variety of ways:
○○ Put the cursor on a line in the Script Editor and press the Run icon ( ); alternatively press <CTRL>+<Enter>.
This will submit that line to the Console and move the cursor to the next line in the Script Editor. Pressing
(or <CTRL>+<Enter>) will submit this next line. And so on.
○○ Select all lines in the Script Editor that you wish to submit and press (or <CTRL>+<Enter>).

13
2: Installing R and getting started

The RStudio layout after using the first method is shown in Fig. 2, page 12.
The R Script in the Script Editor should now be saved by selecting the File menu and the Save option (alternatively,
pressing <CTRL>+S). Give a name to the script, choose the preferred directory, and click Save; it will be saved with a
.R file extension. RStudio can now be closed (do NOT save the workspace). When RStudio is restarted later, the script
can be reopened (choose the File menu and the Open file submenu, if the file is not already in the Script Editor) and
resubmitted to the Console to exactly repeat the analyses. (Note that the results of commands are not saved in R or
RStudio; rather the commands are saved and resubmitted to re-perform the analysis).

R packages to install
Each chapter with a section on R will list the packages required to carry out the analyses featured in that chapter. The full
list required is as follows:
devtools (needed to install some other packages, such as ggbiplot)
stats (already installed as standard)
ade4
ca If you use the R functionality
calibrate within CAP and Ecom, the
flipMultivariates programs will automatically
check and install the required
ggbiplot (refer to the section Packages not available from the CRAN repository, below)
packages.
ggplot2
graphics
grDevices
RColorBrewer
vegan
Other packages which may be useful, but are not featured in analyses in this book, include FactoMineR, MASS and
pcaMethods.

14
Applying Multivariate Methods

Fig. 3: (left) The RStudio window featuring the Packages tab (lower right).
Fig. 4: (right) The Install Packages dialog.

Installing R packages
The first time you try to install a package, the installation routine may ask if you wish to create a library folder to store the
installed packages, for instance D:\R\win-library\3.5. If this option fails, it may be necessary to create the R folder and
its sub-folders by hand; the rest of the installation should then proceed.
R packages may be installed in RStudio using the Tools: Install Packages... option on the top toolbar. Alternatively, look
for the Packages tab on the lower left pane of the main program window, and click the Install button - see Fig. 3. You
will then see the Install Packages dialog (Fig. 4). Start typing the name of the package you want into the Packages box,
and select it from the drop-down list of available packages; the installation will then proceed.

15
2: Installing R and getting started

Packages not available from the CRAN repository


Not all packages are immediately available from the default CRAN repository. ggbiplot is one such package1. In order to
install this, first ensure that you have devtools installed. Then in the Script Editor pane, enter:
library(devtools)
install_github(“vqv/ggbiplot”)
Select the two lines of code and press Run; the installation will proceed. The installation may require RTools to be
installed, if you do not already have it; this should be carried out automatically.

R code conventions used in the book


NOTE: R is case-sensitive, so please ensure that you follow this convention, to avoid unnecessary effort tracking down
syntax errors in your code.
Throughout the book, where R code is presented:
Comment lines are in green text, preceded by the hash (#) symbol.
Active lines of code are in bold blue.
Throughout the book, we have used the generic file location of ‘D:\\Demo Data’ in the R code; obviously you will need
to amend this to reflect your own file locations if you use the code yourself.

Final advisory note on R code


The R code snippets given in this book were tested and shown to work with the data sets supplied, at the time of going
to press (June 2019). Please be aware that the various packages used are frequently updated, and so some editing of the
code may be necessary. We will endeavour to ensure that the downloadable versions of the code on our website are kept
updated. However, as the R packages themselves are beyond our control, we cannot guarantee that the code snippets will
continue to work, or give the same results, indefinitely, without some modification.
1 https://github.com/vqv/ggbiplot

16
Applying Multivariate Methods

Chapter

3
3: Principal Component Analysis
A standard method to summarise multivariate data in a 2-dimensional plot

Principal Component Analysis (PCA) is one of the oldest and most important multivariate methods. When used
appropriately it can display the main features of a multivariate data set and may reveal hidden features within your data.

Uses
Use PCA to show the relationship between objects or samples in a simple 1-, 2- or 3-dimensional plot. The method also
identifies those variables which best define the similarity between the objects. In a PCA plot, the most similar objects
will be placed closest together. It can also be used to create composite variables that capture the general magnitude of a
feature that can only be described using a combination of variables. These new composite variables created by PCA can
then be used as input for other forms of analysis. For example, PCA scores can be used to create a dendrogram, or used
in a regression analysis.

In CAP you will find PCA


The ability to summarise multivariate data in a 2- or 3-dimensional plot, or create new composite
within the Ordination drop- variables that summarise a general feature, is dependent on the existence of some level of correlation
down menu at the top of the between the variables. The method is most valuable when some of the measured variables are highly
program.

17
3: Principal Component Analysis

correlated. If the variables are uncorrelated, the method is powerless to help you in your analysis, so try MultiDimensional
Scaling instead (see Chapter 5).

Summary of the method


PCA is essentially a method for displaying your data in a space with fewer dimensions than the number of variables
measured. To show how this is achieved, we use the simplest example possible. Consider a plot of two highly-correlated
variables (Fig. 5) and the resulting PCA plot. After the PCA, the points have been rotated and are now arranged along
the new x-axis (Fig. 6). This new axis is a linear combination of variables 1 and 2. However, the most important point to
note is that PCA has arranged the points almost entirely along a single axis. It has, in effect, converted a 2D plot to a 1D
plot. PCA is sometimes described as a rotation. This is because the method can be thought of as a way to rotate the axes
so that the new x-axis (first principal axis) runs through the line of greatest variability in the data. The second principal
axis is then set at right angles to the first, and is similarly orientated to pass along the line of second greatest variability.

Fig. 5: (left) A simple example of the plot of two highly positively-correlated variables.
Fig. 6: (right) The final ordination produced by PCA for the data set plotted in Fig. 5.

18
Applying Multivariate Methods

Fig. 7: (left) A scatter plot of 2 random variables showing no correlation.


Fig. 8: (right) The ordination produced by PCA of the data plotted in Fig. 7.
It is easily demonstrated that PCA is of no value when applied to data in which there are no correlations between the
variables.
Fig. 7 plots two random, uncorrelated, variables, and Fig. 8 the plot after PCA. Note that the two plots are similar and
still 2-dimensional.
The number of principal axes that can be calculated equals the number of variables measured. For the mathematically-
minded, the principal axes are found using matrix algebra to solve the equation:
( S − λk I ) uk =
0

where S is the dispersion matrix1, λk is an eigenvalue, and uk the associated eigenvector.


The eigenvalues are found by solving numerically the characteristic equation:
S − λk I =
0

1 The dispersion matrix is either a matrix giving the variances and covariances of the variables, or a matrix of the correlations between the
variables.

19
3: Principal Component Analysis

and these are then used to solve for the eigenvectors. The eigenvectors are the principal axes of the dispersion matrix S.
The eigenvectors are scaled to a unit length and then used to compute the positions of the samples on each principal axis.
The dispersion matrix, S, can be either the variance-covariance or the correlation matrix for the variables. The choice will
depend on your data and is discussed below.
The eigenvalues of S give the amount of variance explained by each principal axis. PCA is therefore a partitioning of
the total variability in the original data set between a new set of variables. When successful, PCA can place most of the
variability in the dispersion matrix into only 2 or 3 dimensions.

The use of the correlation or covariance matrix


If all of the variables are of the same kind and order of magnitude then you should generally undertake a PCA using
the variance-covariance matrix. If you use the correlation matrix, you give equal weight to all your
variables, irrespective of their magnitude. This is unwise if you have quantitative data, since why In CAP you can choose
bother to collect quantitative data if it is not used? to perform a PCA on the
variance-covariance or
The correlation matrix is necessary when variables are measured in a number of completely different correlation matrix under
ways. For example, if some variables are the concentration of chemicals in water, and others describe the Ordination drop-down
non-quantitatively the structure of the lake bed, a correlation matrix would be appropriate. The menu. The dispersion matrix
used is shown under the
results from using a variance-covariance or correlation matrix can be quite different. This is because, Cross Products tab.
as mentioned above, using the correlation matrix gives every variable an equal weighting, while an
analysis based on the variance-covariance matrix will be dominated by the variables with the largest
magnitude. With the R function
prcomp() use scale =
FALSE to perform a PCA
How many dimensions are meaningful? on the variance-covariance
matrix or scale = TRUE
Statistical tests can rarely be used to determine how many components (dimensions) are meaningful. to use the correlation matrix.
This is because the tests assume the variables are all normally distributed, and this is hardly ever the Similarly with princomp()
case. An empirical rule is that you should only interpret principal components if the corresponding use cor = TRUE to use the
correlation matrix.
20
Applying Multivariate Methods

eigenvalue is larger than the mean of all the eigenvalues. Software normally gives the sum of the eigenvalues; divide this
sum by the total number of variables in your data set to get the mean.
If you undertake a PCA on the correlation matrix then the mean eigenvalue is 1, so the rule is simply to only consider
components with an eigenvalue > 1.

Do my variables need to be normally distributed?


In practice, normality is not essential. PCA was developed for the analysis of multinormal distributions, and the method
works best when applied to data sets for which all the variables are normally distributed. However, it can give good results
with non-normal data, provided none of the variables is highly skewed or has extreme outliers. If some of the variables are
very skewed, then the first few components usually just separate out the few objects that have high values for the skewed
variables. You could do this by eye without using PCA. It is often possible to reduce the skew using a transformation, as
is discussed below.

Data transformations
It is often essential to transform some or all of the variables prior to their use in a PCA. Remember that if one variable
has a much larger mean than the rest it should be rescaled. For example, if frequency is measured in hertz and ranges from
6000 to 8000 Hz, while another variable, the pulse rate, ranges from 5 to 20 per second, you should express the frequency
in KHz so that it ranges from 6 to 8.
We transform data to:
1. Reduce the influence of the high-magnitude variables and, conversely, increase that of low magnitude variables, e.g.
the highly-abundant species. It is not always relevant; for example, when PCA is applied to the correlation matrix, all
variables have equal weight.
2. Normalise the data. Transformations are often useful when some variables have a highly- skewed distribution, e.g.
when the bulk of the observations are low, but there are a few much higher values. However, do not be too worried
if your variables are not normally distributed; they rarely are, and PCA can still give useful results.

21
3: Principal Component Analysis
CAP has a range of
Common transformations that can be useful are logarithmic and square root. The square root transformations on the
transformation has the advantage that it can be used in data sets with zero values. There was a Working Data tab. You can
experiment with different
fashion for the 4th root transformation; this should be avoided as it excessively distorts your data.
transformations, without
Generally, transformations make it more difficult to interpret your results and should only be used altering your original data.
when necessary. To revert to the raw data, use
Reload Raw Data

The horseshoe effect


In ecological sampling, PCA can produce what is termed the arch or horseshoe effect in a plot of the first and second
principal axes. This is frequently observed when samples are taken along a transect which follows an environmental
gradient. Under these circumstances, the different species tend to reach their maximum abundance at different points
along the transect. An extreme example of this type of distribution is shown in Table 3, and the resulting plot showing
the horseshoe effect in Fig. 9. Note that samples 1 and 10, which have no species in common, are placed alongside each
other, giving the mistaken impression that they have common attributes.
Table 3: A set of artificial data designed to show an extreme form of the horseshoe effect with Principal
Component Analysis.
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
sp1 10 5 0 0 0 0 0 0 0 0
sp2 0 10 5 0 0 0 0 0 0 0
sp3 0 5 10 5 0 0 0 0 0 0
sp4 0 0 5 10 5 0 0 0 0 0
sp5 0 0 0 5 10 5 0 0 0 0
sp6 0 0 0 0 5 10 5 0 0 0
sp7 0 0 0 0 0 5 10 5 0 0
sp8 0 0 0 0 0 0 5 10 5 0
sp9 0 0 0 0 0 0 0 5 10 0
sp10 0 0 0 0 0 0 0 0 5 10

22
Applying Multivariate Methods

Fig. 9: The result of a Principal Component Analysis using


the data set shown in Table 3, and demonstrating the
horseshoe effect. Note that Samples 1 and 10, which have
no species in common, are placed close together. Data set:
artificial data horseshoe effect.xls.
The horseshoe effect arises from the assumption that the
descriptors (species, in the present discussion) are linearly related
to each other. This is not the case when species or descriptors
are arranged along a gradient and each tends to form a unimodal
distribution.
The existence of the horseshoe effect has been used to argue that
PCA is inappropriate for ecological and environmental studies
and should be avoided. The method is too useful to be so easily
discarded, but it should not be used to study organisation along
gradients. For such studies, consider Correspondence Analysis1
(CA) or Detrended Correspondence Analysis (DECORANA).
Using again the example data in Table 3, a DECORANA
ordination plot (Fig. 10, page 24) has no horseshoe effect, and the samples are arranged as per their position along
the gradient. Note that in this figure, samples 10 and 1 are at opposite ends of axis 1. It is sometimes assumed that Non-
metric MultiDimensional Scaling (NMDS) is generally superior to PCA. With respect to the horseshoe effect, this is not
the case. As shown in Fig. 11, page 24, with data showing a powerful gradient, such as the example data in Table 3, it
also produces the same horseshoe effect. This problem occurs with most of the common similarity measures used with
NMDS.

1 Also termed Reciprocal Averaging (RA)

23
3: Principal Component Analysis

Fig. 10: (left) An ordination produced by DECORANA of the demonstration data given in
In CAP, to change the font
Table 3, page 22. These data produce the horseshoe effect when analysed using PCA. and size of the label for the
Fig. 11: (right) An ordination produced by Non-metric MultiDimensional Scaling of the points, click on the Edit
button above the graph
demonstration data given in Table 3, page 22. These data produce the horseshoe effect and then choose Series –
when analysed using PCA. Unassigned – Marks – Text
– Font.
PCA functions in R
A variety of R packages can be used to undertake Principal Component Analysis. The widely-used function prcomp()
is in the stats package. It has been argued that prcomp() is generally superior to princomp() because it uses singular
value decomposition of the data matrix to obtain the eigenvalues and eigenvectors, rather than spectral decomposition of

24
Applying Multivariate Methods

either the correlation or variance-covariance matrix. In practice, most users will find no difference in the results obtained.
Ecologists may favour the rda() function in the vegan package as this package offers a wide range of analytical techniques
used by ecologists for community analysis. Another function undertaking a PCA is dudi.pca() in the ade4 package which
has also been developed to analyse ecological data. The use of each of these functions is demonstrated below in the
example applications.
Other PCA functions not discussed further here are PCA(X) in the FactoMineR package, and pca(X) in the
bioconductor pcaMethods package. The function pca(X) is a wrapper for the prcomp() function which offers further
visualisation functions and methods for handling missing values. The PCA(X) function has many plotting options and
allows observations or variables to be given different weights.

Example: archaeology: the classification of Jomon pottery sherds


Demonstration data set: Jomon_Hall.csv or Jomon Hall R.csv and Jomon Hall Shouninzuka R.csv
Reference: Hall, M. E., 2001. Pottery styles during the early Jomon period: geochemical perspectives on the Moroiso and
Ukishima pottery styles. Archaeometry 43, 59-75.
Required R packages: ggbiplot, ggplot2, ClassDiscovery
This example is based on the study by Hall (2001) of Japanese pottery sherds from the early Jomon period (c. 5000±2500
BC). Energy-dispersive X-ray fluorescence was used to determine the concentration of 15 minor and trace chemical
elements in 92 pottery sherds. The sherds came from four sites in the Kanto region and belong to either the Moroiso or
Ukishima style of pottery.
The author reasoned that if the pottery were locally produced, we should expect to find statistically significant differences
in the chemical composition between potsherds from different sites. If there are no differences between sites, then we
can assume that the Jomon potters utilized raw materials that were geochemically similar, or that the pottery was part of a
trade/exchange/redistribution network between settlements. For each sherd the elemental compositions of barium (Ba),
copper (Cu), gallium (Ga), iron (Fe), lead (Pb), manganese (Mn), nickel (Ni), niobium (Nb), rubidium (Rb), strontium (Sr),
thorium (Th), titanium (Ti), yttrium (Y), zinc (Zn) and zirconium (Zr) were measured.

25
3: Principal Component Analysis

Preliminary data examination and transformation In CAP, the Log10


The concentration of the elements present varied greatly, from about 10 ppm for iron (Fe) to
5 transformation is easily
undertaken. First click on the
around 10 ppm for yttrium (Y). The author therefore undertook a log10 transformation of the data
Working Data tab. Second,
to reduce the dominance of Fe and Ti in the analysis. Given the 5 orders of magnitude difference select the Transform radio
in concentrations, and the fact that the data set has no zeros, a log transformation is a good choice. button and finally select
Niobium was removed from the data set prior to analysis as it was generally below the detectable Log10 from the list of
limit. possible transformations.
Remember your data are not
permanently altered, so you
The use of the correlation matrix can always try a range of
different transformations.
PCA was undertaken on the correlation matrix of the log-transformed data. The author does not
explain why a log-transformation was used, but as the analysis was undertaken using the correlation
matrix this transformation did not alter the conclusions. By using the correlation matrix the author In R, use the log() function
to log the data.
was giving all the elements the same influence on the final ordination. This is the correct choice if For example:
it is believed that all elements can potentially equally contribute to the identification of similarities log.sherds
between sherds, irrespective of concentration. In fact, for these data the author would have reached <-log(pottery.csv)
substantially similar conclusions if the variance-covariance matrix of log-transformed data had been
used instead.

Results using CAP


As shown in Table 4, the first 3 axes explained about 57.89% of the total variability in the data set. The sum of all
the eigenvalues, which is a measure of the total variability, or inertia, is 14. Because the analysis is undertaken on the
correlation matrix, this is simply the sum of the number of variables used in the analysis. This summation is always true
when a correlation matrix is used. Therefore, the percentage variability explained by the largest eigenvalue is 4.713/14 x
100 = 33.66%.

26
Applying Multivariate Methods

Table 4: The eigenvalues for the first 5 axes of the PCA undertaken on Jomon pottery sherds. As there were
14 variables and the correlation matrix was analysed, the sum of the eigenvalues, which is the total inertia or
variance of the data set, was 14.
Eigenvalues Cumulative percentage of the total variance
In CAP, the eigenvalues 1 4.713 33.66
are shown in the Variance
tabbed output window. 2 1.954 47.62
3 1.437 57.89
4 1.228 66.66
In R, using the prcomp()
5 0.9771 73.64
function to undertake a PCA,
the variances are shown using
the summary() function.
These results suggest that much of the variability in elemental composition can be expressed in 3
dimensions. The first 4 dimensions are probably meaningful (eigenvalues > 1).
In CAP, you can obtain the
averages for the variables An examination of the 3 2D plots possible (axis 1 and 2; axis 1 and 3; axis 2 and 3) for the 3 largest
for each group by using the components showed that the position of the sherds in the 2-dimensional space defined by the 1st
Summary tab. Click on the and 3rd principal components separated the sherds into the 4 localities (Fig. 12, page 28). Four
Group radio button and sample outliers in the PCA were a:mb:002, k:uk2:008, n:uk:137 and s:mb:007. For example, the
you will be presented with
sherd s:mb:007 is represented by the X on the far lower left of the plot. By repeating the analysis
the averages for each group.
Remember, that this will only with the outliers removed we can see more clearly the grouping of the sherds between the 4 sites
work if you have previously (Fig. 13, page 28). In this figure, for which the first 2 principal components are plotted, each
defined group membership. of the sites is coded as a different symbol and colour. You will see for example that the crosses
representing the Narita 60 site are clustered in a single discrete area.
The eigenvector plot (Fig. 14, page 29) shows that Principal axis 1 is a measure of the concentration of the elements
Zn, Ba, Mn, Zr, Ga, Cu, Ni, Fe, Y and Ti present, with sherds to the right (positive direction) of the axis having the largest
concentrations. Axis 2 is a measure of Sr, Rb, Th and Pb concentration, with the greatest concentrations at the bottom
(negative direction) of the axis.

27
3: Principal Component Analysis

Fig. 12: (left) The ordination plot of the Jomon pottery sherd data, based on the first and third principal axes.
PCA was undertaken using the correlation matrix.
Fig. 13: (right) PCA ordination of Jomon potsherds with outliers removed. PCA was undertaken using the
correlation matrix.
The samples were also classified by pottery style – Red is Moroiso A, Blue Moroiso B and Green Ukishima (Fig. 15)1. By
comparing the plot showing the sites and the styles, it is apparent that the Ukishima pottery is found at all the sites. Note
that there is a difference in the elemental composition in the pottery styles, with Moroiso sherds having generally higher
concentrations of all the elements measured except lead (Pb).

1 Note that changing the group membership does not affect a PCA ordination.

28
Applying Multivariate Methods

Fig. 14: Eigenvectors for the chemical composition variables produced by PCA for the
In CAP, to deselect the
outliers, click on the Working Jomon potsherds with outliers removed. PCA was undertaken using the correlation matrix.
Data tab. Find the first sherd Fig. 15: PCA ordination of Jomon potsherds grouped by pottery style with outliers removed.
to remove and select any cell
PCA was undertaken using the correlation matrix. The samples have been allocated to
in this column. Now select
Handling zeros in the Type groups based on pottery style.
of Adjustment radio box,
choose Deselect Column in Code and results using R and the prcomp() function
the list, and click on Submit.
For this example the prcomp() function is used to undertake the PCA. Note that scale = TRUE
results in PCA based on the correlation matrix. To use the variance-covariance matrix specify scale
= FALSE. To identify how many of the principal components are important, the summary() function is used together
with plot() to make a scree plot (Fig. 16, page 30).

29
3: Principal Component Analysis

# Open data set, print first lines to check data loaded correctly and find names of variables
pottery.csv <- read.table(“D:\\Demo Data\\Jomon Hall R.csv”,header = TRUE, sep = “,”)
head(pottery.csv, 4)
# log transform
log.sherds <-log(pottery.csv[, 3:16]) # log the numerical data
head(log.sherds, 4) # Some output to check values have been logged
location.sherds <- pottery.csv[,2] # Put locations in a vector
# Run analysis
sherds.pca <-prcomp(log.sherds, center = TRUE, scale = TRUE)
print(sherds.pca) # Print the output
# Investigate the importance of each PCA
summary(sherds.pca) # Plot of variances for each PC
plot(sherds.pca, type = “barplot”) # makes a scree bar plot; otherwise use type = “lines”

Fig. 16: The scree plot generated


by plot(sherds.pca, type
= “barplot”). Note that a
high proportion of the total
variance is explained by the first
two principal components.
The additional code segment
below uses the function ggbiplot()
from the ggbiplot package, which
generates a particularly good biplot
(Fig. 17) with each location group
coloured differently. For each
group, a 68% normal probability
ellipse is drawn. The plot also
shows the eigenvectors for each
variable.

30
Applying Multivariate Methods

# Plot results using ggbiplot


library(ggbiplot)
g <- ggbiplot(sherds.pca, obs.scale = 1, var.scale = 1, groups = location.sherds, ellipse = TRUE,
circle = TRUE)# plots PCs 1 & 2
g <- g + scale_color_discrete(name = ““)
g <- g + theme(legend.direction = “horizontal”, legend.position = “top”)
print(g)

The default for ggbiplot() is a


plot of the PC 1 and 2. Use the
choices option to select other
PCs. For example, to plot PCs
1 and 3 use:
g <- ggbiplot(sherds.pca,
choices = c(1,3), obs.scale
= 1, var.scale = 1, groups
= location.sherds, ellipse
= TRUE, circle = TRUE)

Fig. 17: The output using


ggbiplot to show the variable
associations with each PC.

31
3: Principal Component Analysis

The following R code produces the plot shown in Fig. 18 which will aid understanding of how the variables are associated
with each PCA axis. Note that the plot shows that negative values on PC2 are associated with elevated Pb (lead), Rb
(rubidium) and Th (thorium) and positive values with high Ni (nickel).

# A plot to identify variables associated with each PC


require(ggplot2)
theta <- seq(0,2*pi,length.out = 100)
circle <- data.frame(x = cos(theta), y = sin(theta))
p <- ggplot(circle,aes(x,y)) + geom_path()
loadings <- data.frame(sherds.pca$rotation, .names = row.names(sherds.pca$rotation))
p + geom_text(data=loadings, mapping=aes(x = PC1, y = PC2, label = .names, colour = .names)) + coord_
fixed(ratio=1) + labs(x = “PC1”, y = “PC2”)

Fig. 18: PCA ordination of Jomon potsherds using the


variance-covariance matrix.

Conclusions
Hall (2001) concluded that Principal Component Analysis
indicated that there are four major groups in the data set, which
correspond to site location. This led him to the conclusion that
the majority of Early Jomon pottery found at four sites in Chiba
Prefecture, Japan, was made from locally-available raw materials.
While the Kamikaizuka and Shouninzuka groups overlap, both
sites are less than 10 km apart and their potters could have shared
raw material sources. The ordination is improved following the
removal of outliers.
For sites having both Moroiso- and Ukishima-style pottery, both
styles of pottery were made from the same or geochemically

32
Applying Multivariate Methods

similar raw materials. This suggests that both styles were probably made at the same site, and indicates that if the different
pottery styles are reflecting ethnic identity, then intermarriage between ethnic groups was occurring. Alternatively, the
pottery styles could be reflecting some sort of social interaction between groups.
Hall (2001) did not test for the statistical significance of the 4 groups. Without such tests the results are not convincing.
The statistical significance of the grouping of the pottery into four groups relating to their location is tested below in the
chapter on ANOSIM.

Alternative approaches
Hall (2001) used a correlation matrix
for the PCA, which had the effect
of giving equal weighting to every
element. The plot in Fig. 19 shows
the ordination of the sites using the
variance-covariance matrix, which
fully uses the quantitative data. It is
interesting that the same conclusion is
reached, and a slightly clearer division
of the four sites is shown using a plot
of the first and second principal axes.

Fig. 19: PCA ordination of Jomon


potsherds using the variance-
covariance matrix.

33
3: Principal Component Analysis

Using Mahalanobis distance with a PCA to detect outliers


Outliers are data points which lie outside the range of values you would expect for the data set. If you run a PCA and one
of the samples is an extreme outlier it will dominate the PCA plot and you will be unable to see clearly the relationships
between the other samples. We can use the Mahalanobis distance to decide if a data point should be removed. In a PCA,
each sample is placed in a multi-dimensional space which is an ellipsoid, and so we need a multi-dimensional measure in
standard deviation units of the distance of each sample point from the multi-dimensional centroid of the distribution.
The Mahalanobis distance is the distance of a point from the centre of mass, divided by the width of the ellipsoid in the
direction of the test point. It is a unitless and scale-invariant measure.
In the Jomon pottery sherds example previously discussed in this section, we looked at the PCA plot before and after
outlier removal. We can use Mahalanobis distance to decide if a point is an outlier. Fig. 12 and Fig. 13, page 28, show
that one point from the Shouninzuka group seems to be an outlier. We can test this formally.
The R code below uses the package ClassDiscovery and a data set which only includes samples from the Shouninzuka
group. For this library each column in the data set represents a sample, and the rows are the individual variables. The
theory says that, under the null hypothesis that all samples arise from the same multivariate normal distribution, the
distance from the centre of a d-dimensional PC space should follow a chi-squared distribution with d degrees of freedom.
This theory lets us compute p-values associated with the Mahalanobis distances for each sample. The line
round(cumsum(spc@variances)/sum(spc@variances), digits=2)
shows that the first two axes explain 50% of the total variance. This is quite large for a typical data
set, suggesting a plot of the first two axes should show the relationship between the samples. Fig. The R code for Mahalanobis
distance has also been
20 shows that the sample s.mb.007 appears to be an outlier. The line implemented into CAP and
maha2 <- mahalanobisQC(spc,2) can be run on either the
correlation or the covariance
calculates the Mahalanobis distances using the first two principal components. To use more axes, matrix; to run it, simply select
change the 2 to a higher number. The calculations of the Mahalanobis distances for each sample Run R Code: PCA - Cor
and their significance values are shown in Table 5, demonstrating that s.mb.007 (marked in bold) is - Outlier or PCA - Covar
- Outlier from the main
clearly significantly different. This point can be removed from the data set and the analysis re-run to
program menu.

34
Applying Multivariate Methods

determine if other points such as s.ma.001, s.mb.006 and s.mb.008 are also significant outliers. Be aware that if your data
encompass nonlinear relationships, Mahalanobis distance can be misleading, as it assumes linear relationships between
variables.
library(ClassDiscovery)
# Open data set
Shouninzuka <- read.table(“D:\\Demo Data\\Jomon Hall Shouninzuka R.csv”,header = TRUE, sep = “,”)
head(Shouninzuka, 4) # Print first lines to check data
spc <- SamplePCA(Shouninzuka, usecor=TRUE) # Undertake a PCA on correlation matrix
round(cumsum(spc@variances)/sum(spc@variances), digits=2)
plot(spc) # Plot the first 2 principal components
text(spc, cex= 0.7, pos=2) # Puts a label by each sample point
maha2 <- mahalanobisQC(spc,2)
maha2

Fig. 20: A plot of the two largest principal


components for the Shouninzuka group of
the Jomon pottery sherds. Note that point
s.mb.007 on the right (arrowed) appears to
be an outlier.

Table 5: Excerpt from the Mahalanobis


distance output, showing the outlier point
s.mb.007.
Sample Statistic P value
s.mb.003 1.09061387 5.796638e-01
s.mb.004 4.57285846 1.016287e-01
s.mb.005 0.35119079 8.389574e-01
s.mb.006 4.71815285 9.450747e-02
s.mb.007 41.51771085 9.650363e-10

35
3: Principal Component Analysis

Example: geology: An investigation of the Martinsville igneous complex


Demonstration data set: Petrology.csv or Petrology R.csv
Reference: P. C. Ragland, J. F. Conley, W. C. Parker, and J. A. Van Orman, 1997, Use of Principal Components Analysis
in petrology: an example from the Martinsville igneous complex, Virginia, U.S.A. Mineralogy and Petrology 60:165-184.
Required R package: ade4
This example is based on the study by Ragland et al. (1997) which examined the utility of PCA for the analysis of the
relationship between geological structures using a chemical data set for the Martinsville igneous complex (MIC), Virginia,
USA. The study sought to answer 4 main questions:
○○ Can PCA discern geochemical trends or relationships among lithologic units that have petrogenetic significance? If
so, what are these trends and what do they indicate about the origin of the rocks?
○○ Can PCA determine which of the original chemical variables are the most meaningful and do these correspond to the
traditionally-accepted variables, such as SiO2 and MgO?
○○ Are the PCA-generated variables as useful or more useful for petrogenetic purposes than the original chemical
variables?
○○ Overall, is PCA a useful alternative to the traditional approaches for examining these types of geochemical and
petrologic data?

Preliminary data examination and transformation


The data set comprised data on the percentage weight of the oxides of 10 major elements, and the concentration in parts
per million of 3 trace elements (Rb, Sr and Zr). The authors checked the variables for normality and noted that MgO was
not normal and so log-transformed this variable. This transformation does not make any difference to the ordination or
the resulting conclusions, so was not carried out here. The two water variables, H20+ and H2O- were excluded.

The use of the correlation matrix


PCA was undertaken on the correlation matrix; this was essential if the ordination is not to be dominated by the three

36
Applying Multivariate Methods

variables (Zr, Sr and Mg) with the largest variances because of their high relative magnitudes. This is the correct choice if
it is believed that all elements can potentially contribute equally to the study of the relationships between the rocks.

Results
As shown in Table 6, the first 2 axes explained about 72.9% of the total variability in the data set. The sum of all the
eigenvalues, a measure of the total variability, is 14, which is simply the sum of the number of variables used in the
analysis, because the correlation matrix was used. Therefore, the percentage variability explained by the largest eigenvalue
is 7.28/14 x 100 = 52.01%. The first 3 dimensions are probably meaningful (eigenvalues > 1).
Table 6: The eigenvalues for the first 5 axes of the PCA undertaken on chemical variables for the Martinsville
igneous complex. As there were 14 variables, and the correlation matrix was analysed, the sum of the eigenvalues,
which is the total inertia or variance of the data set, was 14.
Eigenvalues Cumulative percentage of the total variance
1 7.282 52.01
2 2.918 72.86
3 1.27 81.93
4 0.8209 87.79
5 0.5165 91.48

The authors reported a higher percentage of the variability explained by the first two axes, probably because they combined
the percentage composition for the two iron oxides into a single variable. However, this makes little difference to the
ordination produced.
These results show that much of the variability in chemical composition can be expressed in 2 dimensions.
The plot in Fig. 21, page 38 of the eigenvectors shows that Principal axis 1 arranges the samples so that those with the
highest concentrations of Ca, Fe, Al, Mn, Sr, P and Ti are towards the left (negative side) and those with highest concentrations
of Si, Rb and K to the right (positive side). When deciding which variables are making the greatest contribution to an axis,
examine both the direction and the length of the eigenvectors. To make a strong contribution to an axis, an eigenvector

37
3: Principal Component Analysis

Fig. 21: (left) Plot of eigenvectors of chemical composition for rock samples in the Martinsville igneous complex.
PCA was undertaken on the correlation matrix using CAP.
Fig. 22: (right) Ordination of rocks within the Martinsville igneous complex. PCA was undertaken on the
correlation matrix using CAP.
should point approximately along an axis and be relatively long. Axis 2 is a measure of Zr and In CAP, you can edit all
Na concentration, with the greatest concentrations at the top (positive direction) of the axis. The aspects of the legend. First
select the Edit tool button
authors recognised 5 groups of eigenvectors: 1) Si, Rb and K; 2) Ca and Mg; 3) Fe, Al, Sr and Mn;
above the chart. Then choose
4) P and Ti; and 5) Na and Zr. When grouping eigenvectors, you must consider the angle between the tabs Chart: Legend.
them, not the length of the vector. The present results would suggest that P and Ti make a poor
group, but they can be viewed as intermediate between {Zr, Na} and {Fe, Al, Sr, Mn} To remove the tick boxes,
select the tabs Chart:
An examination of the 2D PCA ordination plots (Fig. 22) shows a clear clustering of the rock Legend: Style and select No
samples. When the samples are grouped according to their mineralogy it is clear that the PCA check boxes from the drop-
down menu.

38
Applying Multivariate Methods

ordination based on chemistry produces a similar classification. For example, that the syenodiorite can best be distinguished
by its relatively high Na and Zr contents. The granites are characterised by relatively high Si, Rb, and K, and the Rich
Acres gabbros by being relatively rich in Mg, Ca, and Fe. The authors do not consider these findings “particularly a surprise”
and if the PCA “only confirmed the mineralogical groupings and chemical differences easily apparent, they would be of limited value.” The
particular value of the PCA is in showing relationships and hybrids. For instance, the hybrid Leatherwood rocks (blue
squares) are intermediate in composition between the granites (yellow squares) and the diorites (red squares).

Code and results using R


library(“ade4”) # PCA example using rda in the ade4 package
#O pen data set
rocks.csv <- read.table(“D:\\Demo Data\\Petrology R.csv”,header = TRUE, sep = “,”)
head(rocks.csv, 4) # Print out the first 4 lines to check data loaded correctly
rock.obs <- rocks.csv[, 2:17] # get numerical data - the first column holds the classification
names<-rocks.csv[,1] # Create a variable holding the classifications
# run PCA on the correlation matrix
pca.results <- dudi.pca(rock.obs, scannf = FALSE, nf=5, center = TRUE, scale=TRUE)
scatter(pca.results) # create a plot
s.class(pca.results$li,factor(names)) # create plot showing classifications

The dudi.pca() function in ade4 is used for the PCA. Note that scale = TRUE results in PCA based on the correlation
matrix. The scannf=FALSE option suppresses the scree plot. The plots are shown in Fig. 23 and Fig. 24, page 40.

Conclusions
Ragland et al. (1997) concluded that PCA was a useful tool. “... PCA is an insightful tool in petrology and geochemistry and is
recommended as a first-step, exploratory technique for a dataset of chemical analyses. It allows the researcher to determine which of the original
variables may be the most useful in characterizing the dataset.” Further, it is capable of identifying possible relationships and
hybrids, and thus can be used as an aid when generating hypotheses about relationships between and origins of rocks.

39
3: Principal Component Analysis

Fig. 23: The PCA scatter plot for the Martinsville rocks, produced by the function scatter() in the ade4 R package.
The numbers refer to each rock sample and the eigenvectors are the chemical components.
Fig. 24: The PCA plot for the Martinsville rocks, showing the different classifications. The ordination produced
a clear separation of the different rock types.

Alternative approaches
Ragland et al. (1997) used the correlation matrix for the PCA, which had the effect of giving equal weighting to every
element. Fig. 25 shows the ordination of the sites using the variance-covariance matrix calculated with all variables log
transformed. It is interesting to note that essentially the same clusters are formed, but the eigenvectors show a number
of tight pairs {Zr, Na}, {K, Rb}, {Mn, Fe} and {Ca, Mg}. This plot also shows that it is possible to place the samples
and the variable eigenvectors on the same plot. Some authors, including Ragland et al. (1997), plot only the apex of the

40
Applying Multivariate Methods

eigenvectors. For clarity, this should be avoided; the relationship between eigenvectors, given by
In CAP, to show only the their angular difference, is more easily studied if they are plotted as vectors (arrows).
largest eigenvectors, use the
slider below the ordination Fig. 25: Ordination biplot of the rocks with the Martinsville igneous complex. The plot also
plot. shows the largest 9 eigenvectors for the chemical variables used to produce the ordination
of the rocks. The PCA was undertaken using the variance-covariance matrix using CAP.

41
3: Principal Component Analysis

Example: biology: Comparing the songs of cicadas.


Demonstration data set: cicada.csv
Reference: Ohya, E., 2004. Identification of Tibicen cicada species by a Principal Components Analysis of their songs.
Anais da Academia Brasileira de Ciências 76: 441-444
Ohya (2004) used recordings of cicadas to demonstrate that the songs of different species could be differentiated using
PCA. This example shows the use of PCA to compare the features of time series. It also shows that standard measurements
for known types, in this case species, can be included in the data set so that samples can be assigned to group by their
proximity to these standards within the ordination space.

Preliminary data examination and transformation


The data set comprised observations of peak frequency (Hz), mean frequency (Hz) and number of pulses per 0.2 second.
Recordings were made on 12 individuals of unknown species, and 3 standard sets for the species Tibicen japonicus, T.
flammatus and T. bihamatus. No transformations were undertaken.

The use of the correlation matrix


PCA was undertaken on the correlation matrix; this was essential if the ordination was not to be dominated by frequency
measurements, which were between 5000 and 7000, while the pulse rate ranged between 8 and 20. It would also have been
possible to have rescaled frequency in KHz and used the variance-covariance matrix.

Results
As shown in Table 7, the first 2 axes explained about 98% of the total variability in the data set, demonstrating that a 2D
graph can show the relationship between the cicada species. The sum of all the eigenvalues, which is a measure of the total
variability, is 3, which is simply the sum of the number of variables used in the analysis, because the correlation matrix was
used with 3 variables. The first dimension only has an eigenvalue > 1), but the second is required to distinguish between
T. japonicus and T. flammatus.

42
Applying Multivariate Methods

Table 7: The eigenvalues for the first 2 axes of the PCA undertaken on 3 cicada song variables. As PCA was
undertaken on the correlation matrix, the sum of the eigenvalues, which is the total inertia in the data set,
equals 3.

Eigenvalues Cumulative percentage of the total variance


1 2.69 89.62
2 0.26 98.15

Fig. 26, page 44 is a biplot of the eigenvectors and the sample scores. They can be effectively placed on the same graph
because of the small number of variables and samples included in this study. The 3 samples for known species have been
marked as large circles and labelled with the species name.
The 2D plot shows a clear separation between the species, and the clustering of the unknown samples around the T.
bihamatus standard indicates that all samples except for S1 can be assigned to this species. S1 has a song very similar to that
of T. japonicus.

Conclusions
As the author stated, “The cluster analysis of the PCA scores clearly separated T. japonicus, T. flammatus and T. bihamatus from each
other and allocated the samples as expected.” He did include a warning: “However, one should collect real specimens with each sound
recording in order to check the result of this method.”

Alternative approaches
For data of this type, there are no other methods that work as well as a PCA applied to the correlation matrix.

43
3: Principal Component Analysis

Fig. 26: The relationship between the songs of 3 species of cicada. The PCA ordination is a biplot of the
eigenvectors and the sample scores using the correlation matrix.

44
Applying Multivariate Methods

Example: biology: Analysis of community change with climate warming.


Demonstration data set: Hinkley fish.csv or Hinkley fish R.csv
Reference: Henderson, P. A., 2007. Discrete and continuous change in the fish community of the Bristol Channel in
response to climate change. J. Mar. Biol. Ass. U.K. (2007), 87, 589–598
Required R package: vegan
The actual data set used by Henderson (2007) is not supplied with the paper. The data set we analyse here is not identical,
but close to the original. The author reports on changes in the fish community in Bridgwater Bay, UK over a 25-year
period. The data set comprises the annual catch derived from monthly sampling for the 16 species of fish that were caught
every year between 1981 and 2004. Data for 1986 were excluded, as not all months were sampled.

Preliminary data examination and transformation


The fish species vary greatly in abundance, and the abundance of individual fish varies greatly through time. There is a clear
need to transform these data. As all species are present in all years, a log transformation is appropriate. Henderson (2007)
used a square root transformation, which is the most useful transformation for ecological data when zero observations
occur. While it is possible to cope with zeros by using a log + 1 transformation, this should be avoided, as it can give
misleading results. For this example, a log10 transformation has been applied to all the fish species variables.

The use of the variance-covariance matrix


PCA was undertaken by Henderson (2007) on the variance-covariance matrix derived from square root-transformed data.
The use of the variance-covariance matrix instead of the correlation matrix reflects the desire to use the full potential of
the quantitative data in the analysis. The inevitable result will be to place more emphasis on the most abundant species
within the data set.

Results
As shown in Table 8, page 46, using a log10 transformation, the first 2 axes explained about 46% of the total variability

45
3: Principal Component Analysis

in the data set. For an ecological data set this is quite a large proportion of the total variability, and community analyses
are often presented in which the first two axes explain less than 30%. Henderson (2007) reported that, when using a
square root transformation, the first 2 axes explained about 62% of the variability. If a PCA is undertaken again on the
square root-transformed data used here, 67.5% of the variance is explained by the first two axes. However, the ordination
produced does not give such a clear distinction between the communities in the early 1980s and after 1987. The point to
note is that when using the variance-covariance matrix, the best transformation is not necessarily the one which explains
the largest amount of the variance in the first two axes.
Table 8: The eigenvalues for the first 3 axes of the PCA undertaken on the fish community data collected at
Hinkley Point.

Eigenvalues Cumulative percentage of the total variance


1 14.05 27.53
2 9.4 45.95
3 6.735 59.15

The plot of the fish species eigenvectors shows that Principal axis 1 arranges the years so that those with the highest
abundance of Sprattus lie towards the left (Fig. 27). Axis 2 is a measure of the abundance of most of the other fish species,
with warm-water species, Solea, Dicentrarchus and Trisopterus increasing in the positive direction, and cold-water species,
Liparis and Limanda, in the negative direction. Axis 2 can therefore be thought of as a temperature axis.
An examination of the 2D ordination of the years clearly shows that the fish community in the early 1980s was different
from that in later years (Fig. 28).

46
Applying Multivariate Methods

Fig. 27: (left) The 6 largest eigenvectors for the fish species variables calculated by PCA for the Hinkley Point
data. PCA was undertaken on the variance-covariance matrix derived from square root-transformed data.
Fig. 28: (right) The ordination of the fish community of the Severn estuary, showing the change between the
1980s and latter years. As above, PCA was undertaken on the variance-covariance matrix derived from square
root-transformed data.
In CAP, you can select the number of eigenvectors to present on the chart using the slider at the bottom of the PCA plot tab.
It is often worth only showing the major eigenvectors, which are making the greatest contribution to the principal axes.
To produce the large plotting symbols and labels for years in the 1980s, we activated Chart edit by clicking on the Tools button on the left above
the graph. The pre- and post-1986 years had already been defined as groups. It was therefore possible to double-click on the pre-1986 series in
the Series list. This opened the Series dialog where the symbol and the label text could be changed.

47
3: Principal Component Analysis

Using this ordination result, Henderson (2007) concluded that there was an abrupt change in the fish community around
1986, which was related to a change in climatic conditions and a switch in the North Atlantic Oscillation. Note that the
PCA analysis can tell us nothing about the causality behind the change in fish community. The explanation came from
an investigation of changes in the physical environment between the early 1980s and after 1987. Further, biological
knowledge about which fish favoured warmer conditions allowed axis 2 to be interpreted as a water temperature axis,
indicating a climatic effect.

Code and results using R


This example uses the rda() function in the vegan package to undertake the PCA. The fish data include no zero values,
so a log transformation is applied. If the data were not log-transformed the analysis would have been dominated by the
most abundant species. Note that scale = FALSE results in PCA based on the variance-covariance matrix. To identify
how many of the PCs are important the summary() function is used together with barplot() to show the proportion
of the total variability explained by
each principal axis (Fig. 29). The library(vegan) # PCA example using rda in the vegan package
fish.csv <- read.table(“D:\\Demo Data\\Hinkley fish R.csv”,header =
function biplot() plots the scores for TRUE, sep = “,”) # Open data set
each year and the eigenvectors (Fig. head(fish.csv, 4) # Print first lines to check data has been
30). You can choose different scaling. loaded correctly and find names of variables
With scaling = 1 the eigenvectors log.fish <-log(fish.csv[, 2:17]) # log10 transform as no 0 values
are scaled to unit length, and the head(log.fish, 4) # Some output to check values have been
transformed as expected
distances between objects (years in # Run analysis
this case) scale with their Euclidean fish.pca <-rda(log.fish, scale=FALSE) # do pca on var-covar matrix
distance apart in n-dimensional #Investigate the importance of each PCA
space. With scaling = 2, each summary(fish.pca) # This prints out the main results
eigenvector is scaled to the square ev <-fish.pca$CA$eig
barplot(t(cbind(100*ev/sum(ev))), main=”% variance”) # Plot
root of the eigenvector and the %variance explained by each axis
angles between descriptors (species) biplot(fish.pca, scaling = 1, main=”Hinkley fish PCA - scaling = 1”) #
reflect their correlation. biplots

48
Applying Multivariate Methods

A weakness of the rda() function is the difficulty of producing good graphical output. As is shown below in Fig. 30, the
objects, in this case years, are always called sites, so in the biplot sit1 is 1981, sit2 is 1982, and so on.

Fig. 29: (left) Scree plot output generated by the R function barplot() using the Hinkley Point fish data. The plot
shows that over 25% of the total variance is included in PC1 and that PCs 1, 2 and 3 can together summarise the
relationships within the data set.
Fig. 30: (right) Output generated by the R function rda() and plotted using the biplot() function. It shows that
the fish community in the years 1981 to 1985 (labelled Sit1 - Sit5) forms a separate group reflecting the different
colder-water community present in those years.

49
3: Principal Component Analysis

Conclusions
Essentially similar conclusions would be reached if the PCA were undertaken on the correlation matrix, or with square
root-transformed data and the variance-covariance matrix. However, a log transformation and PCA applied to the variance-
covariance matrix gave the clearest results. The log was better than a square root transformation
because it reduced the dominant role of Sprattus. It might be argued that the dominance of Sprattus In CAP, the rare species can
is a real feature and therefore should not be reduced by transformation. However, this dominance be removed from the data
set on the Working Data
is exceptionally great because abundance has been measured in terms of number of individuals tab. Select Handling zeros
caught. The dominance of Sprattus would have been much lower if abundance had been measured and choose Remove sparse
as weight caught each year, because Sprattus is one of the smaller fish caught. As our units of rows. The proportion of
measurement change the pattern of dominance, it seems fair to use a transformation to increase the zero values in the data set is
given in the Summary tab.
weighting given to the less numerically-abundant species.

Alternative approaches
Detrended Correspondence Analysis and MultiDimensional Scaling (see page 52 and page 73, respectively)
also show the distinct change in fish community. PCA is to be favoured in this case because the eigenvectors give insight
into which species are structuring the community and show the role of climate.

Concluding remarks on the use of PCA


PCA can be the method of choice to represent the relationship between quantitative samples in a 2- or 3-dimensional
space. Remember, PCA can only be effective if some of your variables are correlated. It is possible to produce a biplot
which shows the relationships between both the samples and their variables (e.g. Fig. 26, page 44). However, when
there are many samples or variables, this can be too confusing, and it is best to plot the ordination of the samples and the
eigenvectors of the variables separately (e.g. Fig. 27 and Fig. 28, page 47).
PCA is most effective when the data matrix does not have a very high proportion of zero values. Data arrays with more
than 80% zero values are common in biology, where the vast majority of species are only recorded in a small number of

50
Applying Multivariate Methods

samples. In such cases you should remove all the rare species from the analysis. Such species make a negligible contribution
to the ordination.
If you have quantitative data, PCA should generally be undertaken using the variance-covariance matrix. However, if your
variables differ greatly in magnitude or variability, then there is a risk that the ordination will simply reflect the abundance of
the most variable, highest-magnitude variables. To avoid this, either rescale your variables to similar
In CAP, a data transformation magnitudes, or undertake a log or square root transformation. Use a square root transformation if
is easily undertaken. First click
your data set includes zero values. Avoid 4th root transformations and other exotica, as you will not
on the Working Data tab.
Second, select the Transform generate easily-interpretable results.
radio button and finally select Remember, if a PCA is undertaken on the correlation matrix all your variables will acquire
Log10 etc. from the list of
possible transformations.
equal importance. This may be quite appropriate when using physical variables; a low mercury
Remember your data set is concentration may be just as important as a calcium concentration a thousand-fold larger. But, do
not permanently altered, so you consider that the rarer species are as important as the common forms in defining a community?
you can always try a range of Only the first 3 or so principal axes are likely to hold interpretable variability, so do not waste time
different transformations. studying the plot of the 5th and 10th principal axes!

51
4: Correspondence Analysis

Chapter

4
4: Correspondence Analysis
This method, also called Reciprocal Averaging, produces an ordination of both the samples and the variables, and is
particularly effective when samples are derived from along a gradient.

Uses
Correspondence Analysis was developed by statisticians in the 1930s, but only came into use in ecology in the 1960s. It
is particularly favoured by plant ecologists and is used within the TWINSPAN method. Use Correspondence Analysis to
show the relationship between objects and variables within a single plot. The method can be used with categorical data,
so is particularly appropriate when it is impossible or too costly to collect quantitative data. It produces a particularly clear
and useful ordination when the objects have been sampled along a single dominant gradient; unlike Principal Component
Analysis and MultiDimensional Scaling, it does not produce a powerful horseshoe effect (see p 2) in which samples from
opposite ends of the gradient are placed close together. However, as is discussed below, the ordination can form an arch.

Summary of the method


Correspondence Analysis was originally developed and is still used by statisticians for the graphical analysis of contingency
tables. These are used to record and analyse the relationships between two or more categorical variables. For example,
if a study were made of the frequency of hair colour and left and right-handedness, the data could be presented in a

52
Applying Multivariate Methods

contingency table, as shown in Table 9.


Table 9: An example of a typical two-way contingency table. This example gives the frequencies of people who
were left- and right-handed and fair- and dark-haired.

Fair-haired Dark-haired Total


Left-handed 12 20 32
Right-handed 18 30 48
Total 30 50 80

Chi-squared tests are frequently applied to contingency table data of this type to test if the two variables, handedness and
hair colour, are independently distributed. The data sets we are concerned with in multivariate analysis of field data are
also often arranged as a two-dimensional table with the elements comprising frequency data. For example, Table 10 shows
the counts for pottery shards at different localities.
Table 10: Pottery shard counts from different localities in the Mississippi valley. From Pierce & Christopher
19981
Caney Claiborne Copes Hearns Jaketown Linsley Poverty Shoe Teoc Terral
Mounds Point Bayou Creek Lewis
Biconical 29 3259 67 57 485 58 3122 23 228 104
Cylindrical 5 1230 78 0 1411 7 4718 3 12 0
Ellipsoidal 1 3476 130 5 3 33 5103 1 2 108
Spheroidal 7 824 2 6 29 4 355 1 16 5
GroovedSphere 1 2014 17 0 410 8 3434 0 2 93
Biscuit 58 22 11 8 0 4 138 12 2 12
Amorphous 55 476 56 90 0 7 866 116 4 65
Other 1 143 1 12 0 6 187 1 1 4

1 Pierce and Christopher 1998. Theory, measurement, and explanation: variable shapes in Poverty Point Objects. In Unit Issues in Archaeology,
edited by Ann F. Ramenofsky and Anastasia Steffen, pp.163-190. Utah University Press, Salt Lake City.

53
4: Correspondence Analysis

In biology, this table would usually comprise samples as the columns and species as the rows. Correspondence Analysis
measures the degree of interdependence between the columns (sites or samples) and the rows (species, types of pottery,
etc.) using the standard Chi-squared statistic, which compares the observed frequencies with those expected if the rows
and columns were independent. Generally, we hope that our data do not show independence, as we expect certain species
or types of pottery to be associated with particular samples or sites. Looked at from the sample perspective, we expect
samples or sites to differ in their species or pottery composition.
Generally, linear algebra is used to calculate the ordination. Using the Chi-squared statistic to depict the degree to which
any associations in the data matrix depart from independence, the variance in these Chi-squared “distances” is evaluated
by eigenanalysis. The scores for the samples or sites are derived from a metric of species associations, and the more these
associations depart from independence, the further separated the final scores will be. Similarly, the scores for the sites
are used to find final scores for the row variables. The final result is an ordination in which samples most similar in their
species assemblage are closest together. Similarly, species which have a similar distribution across the samples will be
closest together.
CA and PCA ordinations are both usually derived by eigenanalysis and are related methods, which differ in their measure
of the distance between samples. In PCA, the matrix of species abundances is transformed into a matrix of covariances or
correlations, each abundance value being replaced by a measure of correlation or covariance with other species. CA uses
Chi-squared as a measure of association, rather than the correlation coefficient or covariance.
CA is also called Reciprocal Averaging (RA). RA describes an alternative method for calculating the ordination, which was
developed by M.O. Hill in 1973, who at the time, did not realise his solution was in fact another method of Correspondence
Analysis. This method involves the repeated averaging of column (sample) scores and row (species) scores until the
correspondence between row scores and column scores is maximised and convergence is reached. This approach can be
more easily understood and applied by hand to small data sets.
The 2nd and higher axes of a CA ordination, like those of PCA, can be distorted by non-linear relationships between the
row variables. This is termed the arch effect, and is most likely to occur in ecological data sets with high β diversity and
a strong gradation of species (β diversity is that between locations). As well as the arch, the axis extremes of CA can be
compressed. In other words, the spacing of samples along an axis may not reflect true differences in species composition.

54
Applying Multivariate Methods

The arch effect and Detrended Correspondence Analysis


Detrended Correspondence Analysis (DCA) eliminates the arch effect and compression at the extremes of an axis (Hill
and Gauch, 1982). To detrend the second axis by segments, the first axis is divided up into segments, and the samples
within each segment are centred to have a zero mean for the second axis (see illustrations in Gauch 1982). The procedure
is repeated for different ‘starting points’ of the segments. Although results in some cases are sensitive to the number of
segments (Jackson and Somers 1991), the default of 26 segments is usually satisfactory. The detrending of higher axes
proceeds by a similar process. The compression of the ends of the gradients is corrected by nonlinear rescaling. Rescaling
shifts sample scores along each axis such that the average width is equal to 1.
The merits of DCA have been much debated. Often, the presence of an arch can just be accepted as an artefact and
considered no further. However, DCA may be helpful to present the results in the clearest and simplest way. I suggest that
CA and DCA are both tried, and CA used if the ordinations are similar or both equally easily understood.

Common user options


With most data and computer programs for CA and DCA you can confidently use the default options. In CA, a common
option is to down-weight rare species. Select this option if the influence of rare species is to be reduced. If selected, the
abundance of species rarer than the frequency of the commonest species divided by 5 are down-weighted in proportion
to their frequency. There are further options available within DCA, or DECORANA as it is commonly known:
Number of axis re-scalings - Input an integer number; the default of 4 should generally be used. Values between zero
and 20 are generally permitted. In DECORANA, the axis is scaled to mean squared deviation of species scores per sample
of 1. The program attempts to find the longest axes with the correct mean square deviation all the way along the axes. This
option defines the number of attempts at squeezing the axes in and out to find the optimum solution.
Rescaling threshold - Axes shorter than this value will not be re-scaled. The default is zero, in other words, axes of all
length will be re-scaled. Generally, the default should be used.
Number of segments - This is the number of segments the axis is divided into for re-scaling. Input an integer number,
the default of 26 should generally be used.

55
4: Correspondence Analysis

Example: marketing: Soft drink consumption.


Demonstration data set: Beverages.csv or beverages R2.csv
Reference: Hoffman; D. L. & Franke, G. R., 1986, Correspondence Analysis: Graphical Representation of Categorical
Data in Marketing Research. Journal of Marketing Research, 23, 213-227.
Required R package: vegan
This is an example of the use of Correspondence Analysis with categorical data. The data set comprises information
about the consumption of soft drinks. A group of male and female MBA students from Columbia University were asked
to indicate, for a variety of popular soft drinks, the frequency with which they purchased and consumed the soft drinks in
a 1-month period. The results were coded 1 if the individual purchased and consumed at least one every other week, and
0 for any lower frequency.

Preliminary data examination and transformation


An interesting feature of the approach taken by Hoffman and Franke (1986) is that the number of variables is doubled
by including the opposite for each drink in the data set. Thus, there is a variable for those who drank Coke, labelled as
coke+, and for those who did not drink Coke, a label coke-. This doubling is because researchers typically record only the
positive endorsements and infer the negative by subtraction. Therefore, if an individual scored 1 for coke+, meaning they
had bought at least one Coke in the last month, they scored 0 for coke-.

Results
Fig. 31 shows a joint plot of the individuals (green squares) and the drinks (labelled, red, triangles). Note that diet and
non-diet Coke and Pepsi are separated at opposite ends of axis 1. Axis 2 distinguishes between Sprite and 7 Up drinkers
(Sprite+ and 7 up+) and non-drinkers (Sprite- and 7 up-). The lack of individuals in the upper right-hand region of the
ordination indicates that no one drinks only diet non-colas.

56
Applying Multivariate Methods

Fig. 31: The choice of carbonated


drink in the USA: an example biplot
of Correspondence Analysis applied
to categorical data. The triangles show
the variables, and the green squares
the individuals.

Conclusions
It is possible to classify individuals by
their consumption of diet and non-diet
and cola and non-cola soft drinks.

Alternative approaches
The doubling of the variables by including
their opposite does not greatly add to the
analysis, and a simpler plot can be obtained
with only the positive variables included
(see R code below). In this case, similar
results are produced by Correspondence
Analysis and Detrended Correspondence
Analysis (DECORANA).

57
4: Correspondence Analysis

Code and results using the vegan package in R on the beverages data
For this example, the cca() function in the vegan package is used to undertake the Correspondence Analysis. The
summary() function defaults to the scaling = 2 option in which the columns are at the centroids of the rows.
This scale is most appropriate if our primary interest is the variables rather than the individuals. As an example of the
alternative approach, the data set beverages R2.csv only comprises the positive values (see Alternative Approaches above)
As before, diet and non-diet Coke and Pepsi are separated at opposite ends of axis 1. Axis 2 distinguishes between Sprite
and 7 Up drinkers and non-drinkers (Fig. 32).

library(vegan) # CA example using cca in the


vegan package
# Open data set
beverages <- read.table(“D:\\Demo Data\\beverages
R2.csv”, header = TRUE, sep = “,”)
print(beverages) # Print data to check it is OK
# CA with graphical output
mca <- cca (beverages) # run a CA
mca # Print results
plot(mca) # Plot results
# Below is an approach to a more attractive
plot
plot(mca, type = “n”)
points(mca, col = “blue”, pch = 16)
text(mca, col = “red”, dis = “sp”)

Fig. 32: The choice of carbonated drink in the USA: an example of Correspondence Analysis applied to
categorical data using the cca() function in the R vegan package. The biplot was produced by Correspondence
Analysis. The blue dots are the individuals.

58
Applying Multivariate Methods

Example: archaeology: The temporal relationship between the pots from


trenches on the Greek island of Melos.
Demonstration data set: Melos.csv or Melos R grouped.csv
Reference: Berg; I. & Blieden, S., 2000. The Pots of Phylakopi: Applying Statistical Techniques to Archaeology. Chance,
13, 8-15.
Required R packages: ca, calibrate
This is an example of the use of Correspondence Analysis with semi-quantitative data.
The temporal sequence of the ancient Greek pottery of Melos, an island in the Greek Cyclades, was analysed by Berg
and Blieden (2000). The purpose of the paper was to demonstrate the use of Correspondence Analysis in addition to the
standard archaeological technique of seriation. The authors point out that seriation places all layers from all trenches into
a single linear temporal sequence. It is difficult to place the pottery from layers in different trenches into a single temporal
sequence, and CA may help to clarify the relationship between the trenches.
The data set comprises pottery derived from different layers from a number of trenches. For example, in trench PiS, finds
are divided into layers called PiS 9, PiS 18, and so on. The general idea is that each numbered layer contains material that
was found close together. Excavations remove material from the top downward; thus, lower numbers are associated with
higher layers, while higher numbers refer to deeper (older) layers. To avoid analysing layers with small numbers of finds,
they merged successive layers to get combined layers containing >100 pieces. This is what has been done, for instance,
with the successive layers KKd 19, KKd 21, KKd 23, and KKd 25. They were merged into a single layer referred to as
KKd 19,-25.
Data for 8 trenches were analysed and coded as KKd, PiA, PiC, PiDl, PiE, PiS, PK and Pla. It is the layers of these
trenches which comprise the samples or objects.
Each piece of pottery found was classified into the following mutually-exclusive categories: burnished pots (burn),
monochrome pots (mono), slipped pots (slip), pots painted in curvilinear decoration (curv), pots painted in geometric
decoration (geo), pots painted in naturalistic decoration (nature), pots made of “Cycladic White” clay (CW), and pots
made of “conical cup” clay (cc). These 8 pottery types were the variables.

59
4: Correspondence Analysis

The data set analysed here was taken from the seriation sequence diagram in Berg and Blieden (2000). The width of the
lines gave some measure of relative abundance in each layer, and this was converted into a simple abundance scale from 0
to 8. Because of the difficulty of reading data from this figure, the data will not be the same as the original, however, the
analysis gives very similar results.

Results
Fig. 33 shows a joint plot of the pottery types (green triangles) and the trench layers (dark green squares). The squares
make up a roughly horseshoe-shaped curve. The authors state that “By moving from top left to top right along this ‘horseshoe’ we
seem to go from older layers to younger layers. The layers that are marked near the top left of the Figure (such as those of PiDI/PiE) appear
near the bottom of the seriation diagram, and we believe these to be the most ancient. PiA 69 and PiA 67 and various layers from KKd are
on the most extreme right of the Figure and are listed as the youngest layers by seriation.”
They also consider the structure of an individual
trench. Pla samples are clumped at the base of the
horseshoe, and on either side of this cluster there
are clusters of PiS samples. The authors suggest
that, “deposits of trench PiS fall into distinct phases”
... “We believe that this trench is located in a part of
Phylakopi that was occupied by the local shrine and this
fits in well with the statistical pattern we have just observed.
Once a shrine is built, we would expect the area around it
to remain unchanged for a long time.” This stability is
reflected in the discontinuity in PiS layers.
Fig. 33: Correspondence Analysis joint plot
of the pottery types (green squares) and the
trench layers (smaller symbols) for the Melos
study.

60
Applying Multivariate Methods

Code and results using the ca package in R on the Melos data


For this example, the ca package is used to undertake the Correspondence Analysis. The code below is adapted from
that developed by Peeples (2011) for archaeological seriation. The extended output, which corresponds with that found
previously, but with a different orientation, is shown in Fig. 34, page 62 & Fig. 35, page 63.
# Modified from: Peeples, Matthew A. (2011) RScript for Seriation Using Correspondence Analysis.
# Available: http://www.mattpeeples.net/ca.html. (October 21, 2015)
library(ca)
library(calibrate)
# The dataset has variable names in row 1, column 1 is sample identifiers, column 2 group
allocation
x <- read.csv(file=”D:\\Demo Data\\Melos R grouped.csv”, header=T, row.names=1)
x <- na.omit(x)
x$Group -> Group
x$Group <- NULL
p.lab <- as.matrix(Group)
x2 <- as.matrix(x)
y <- as.matrix(colnames(x))
prop.table(as.matrix(x),1)
prop.table(as.matrix(x),2)
ca(x) # undertakes correspondence analysis
summary(ca(x)) # summary output
plot(ca(x)) # simple plot - see below for grouped output

# The following code produces better output with groups colour coded and
# output summary directed to a text file and graphics to a pdf file
r.c <- ca(x)$rowcoord
c.c <- ca(x)$colcoord
xrange <- range(r.c[,1]*1.5,c.c[,1]*1.5)
yrange <- range(r.c[,2]*1.5,c.c[,2]*1.5)
plot(xrange, yrange, type=”n”, xlab=”Dimension 1”, ylab=”Dimension 2”, main=”Correspondence Plot”)
points(r.c[,1], r.c[,2], pch=p.lab, col=Group, cex=0.75)
points(c.c[,1], c.c[,2], pch=4)

61
4: Correspondence Analysis

textxy(c.c[,1], c.c[,2], labs=y, cex=0.75)


out <- capture.output(summary(ca(x)))
cat(out,file=”CA_out.txt”, sep=”\n”, append=F) # summary text saved to file in default directory
pdf(file=”ca.pdf”) # pdf file saved to default directory - change as desired
plot(ca(x))
plot(xrange, yrange, type=”n”, xlab=”Dimension 1”, ylab=”Dimension 2”, main=”Correspondence Plot”)
points(r.c[,1], r.c[,2], pch=p.lab, col=Group, cex=0.75)
points(c.c[,1], c.c[,2], pch=4)
textxy(c.c[,1], c.c[,2], labs=y, cex=0.75)
dev.off()

Fig. 34: Correspondence Analysis undertaken on the Melos data using the ca R package. Biplot of the first 2
axes. The blue dots are the individual samples and the red triangles the variables.

62
Applying Multivariate Methods

Fig. 35: Correspondence Analysis undertaken on the Melos data using the ca R package. Biplot of the first 2
axes. The individual samples are coded by the Group variable in the dataset. A, PiA: B, KKd: C, Pis: D, PK: E,
PiC: F, PLa: E, PiD. The black crosses mark the position of the variables.
Conclusions
CA successfully ordered the samples along a temporal gradient and clearly showed the discontinuity in the PiS trench.

63
4: Correspondence Analysis

Alternative approaches
The use of the word horseshoe by the authors in their description of the ordination plot immediately raises the question as
to whether a clearer ordination of the temporal sequence could have been presented using Detrended Correspondence
Analysis (DECORANA). As shown below in Fig. 36, there is actually little improvement.

Fig. 36: Detrended Correspondence Analysis joint plot or biplot of the pottery types (large green triangles) and
the trench layers (smaller symbols) for the Melos study.

64
Applying Multivariate Methods

Example: biology: Analysis of community change with climate warming.


Demonstration data set: Hinkley fish.csv or Hinkley fish R.csv
Reference: Henderson, P. A., 2007. Discrete and continuous change in the fish community of the Bristol Channel in
response to climate change. J. Mar. Biol. Ass. U.K. (2007), 87, 589–598
Required R packages: ca, vegan
In this example we apply CA to fully quantitative data of the annual abundance of marine fish in Bridgwater Bay, Somerset,
UK. The actual data set used by Henderson (2007) is not supplied with the paper. The data set we analyse here is not
identical, but close to the original. The author reports on changes in the fish community in Bridgwater Bay, UK over a 25-
year period. The data set comprises the annual catch derived from monthly sampling for the 16 species of fish that were
caught every year between 1981 and 2004. Data for 1986 were excluded as not all months were sampled.
Like the author, we previously analysed these data using PCA. This showed that the early 1980s had a fish community that
differed substantially from that present in later years. We now examine the results using CA

Preliminary data examination and transformation


Because of the great differences in abundance between the species the abundance data were square root transformed. A
log transformation could not be used because the data set holds zero values. Without this transformation, a similar, but less
clear, separation of the early 1980s was produced. To avoid the possibility of the arch effect, a Detrended Correspondence
Analysis was undertaken. In fact, an arch is not in evidence with these data.

Results
As with the PCA results, there is a clear difference in the fish community between the early 1980s and later years. Because
of the number of variables (fish species) included in the analysis it was decided to plot the samples (years) and variables
(species) on different graphs (Fig. 37, page 66 and Fig. 38, page 67). The 1980s samples were dominated by colder-
water species, Limanda limanda, Liparis liparis, Trisopterus minutus and Anguilla anguilla. (Fig. 38). The author concluded that
climate change was having an impact on the fish community.

65
4: Correspondence Analysis

Conclusions
PCA and CA lead to the same conclusion. The fish community was different in the early 1980s from later years. It is
normal for the different methods to give consistent results, suggesting that when there are clear differences and similarities
between samples the choice of method is not as critical as is often implied in critiques of the ordination methods. If this is
not the case, consider carefully if you can make any conclusions from your data. You may just be seeing random patterns.
Fig. 37: DECORANA plot of the years
for the Hinkley fish data, showing the
clear difference in species composition
pre- and post-1986.

66
Applying Multivariate Methods

Fig. 38: DECORANA plot of the fish variables for the Hinkley fish data. By comparison with Fig. 37 it is possible
to identify the species change pre- and post-1986. The fish species Limanda, Liparis, Anguilla and Trisopterus M are
all associated with the pre-1986 period.

In CAP, the default is a colour


plot. To change to a black and
white image suitable for many
publications click on the
B&W/Colour button above
the plot.

Once your graph is in black


and white you may wish to
change the plotting symbols.
Click the Edit button on
the left above the plot. Now
choose the Series tab and
select the series you wish to
change from the drop-down
list. Clicking on the Point
tab allows you to change the
symbol.

67
4: Correspondence Analysis

Code and results using the ca package in R on the Hinkley data


For this example, the ca package is used to undertake the correspondence analysis. The code below is the minimum to
undertake a simple correspondence analysis on the Hinkley fish R.csv data set. The plotted output is shown in Fig. 39.

library(ca)
# The data set has names of fish in row 1, column 1 is the year of sampling
x <- read.csv(file=”D:\\Demo Data\\Hinkley fish R.csv”, header=T, row.names=1)
ca(x) # undertakes correspondence analysis
summary(ca(x)) # summary output
plot(ca(x)) # simple plot

In Fig. 40 a square root transformation was used to give a clearer ordination; using R this is easily accomplished using the
function sqrt(y).

library(ca)
# The data has names of fish in row 1, column 1 is the year of sampling
y <- read.csv(file=”D:\\Demo Data\\Hinkley fish R.csv”, header=T, row.names=1)
x <- sqrt(y)
ca(x) # undertakes correspondence analysis
summary(ca(x)) # summary output
plot(ca(x)) # simple plot

68
Applying Multivariate Methods

Fig. 39: (left) Correspondence Analysis undertaken on the untransformed Hinkley fish data using the ca package
in R. Biplot of the first 2 axes. The blue dots are the individual years and the red triangles the fish. The early
1980s are clearly different and characterised by the presence of the fish species Anguilla anguilla, Liparis liparis,
Limanda limanda and Trisopterus minutus.
Fig. 40: (right) Correspondence Analysis undertaken on the square root-transformed Hinkley fish data using
the ca package in R. Biplot of the first 2 axes. The blue dots are the individual years and the red triangles the
fish. The early 1980s are clearly different and characterised by the presence of the fish species Anguilla anguilla,
Liparis liparis, Limanda limanda and Trisopterus minutus. Square root transforming gives a clear grouping of the 1980s
samples.

69
4: Correspondence Analysis

Code and results using the vegan package


The vegan package in R can be used to undertake Correspondence Analysis (CA) and Detrended Correspondence
Analysis (DECORANA). The listing below uses the ca function, and plots the results in various ways.
fish.csv <- read.table(“D:\\Demo Data\\Hinkley fish R.csv”,header = TRUE, sep = “,”, row.names=1)
# Run analysis
fish.dca <-decorana(fish.csv, ira=1) # undertake ca: ira=0 performs DECORANA; ira=1 performs ca,
the default is DECORANA
print(fish.dca) # Print the output
plot(fish.dca, type = “t”) # A plot of the sample scores only
fish.t2.dca.years <- scores(fish.dca, display=c(“sites”), choices=c(1,2)) # extracts axis 1 & 2
scores for taxa
plot(fish.t2.dca.years,pch=3, col=”green”)
text(fish.dca, dis=”sites”, col=”blue”)
plot(fish.dca, type=”n”) # Plots an empty graph
points(fish.dca, display = “sites”, col=”blue”, pch=16) # Plots the sample scores as blue dots
text(fish.dca, col=”red”, dis=”sp”) # Plots the species names in red
text(fish.dca, dis=”sites”) # Labels the sample scores

A Procrustes method to compare ordinations


vegan also has the function procrustes which allows the comparison of two ordinations. Procrustes rotation rotates a
matrix to maximum similarity with a target matrix minimizing the sum of squared differences. In the example below the
procrustes function is used to compare the CA and DECORANA ordinations of the Hinkley data set (Fig. 41).

library(vegan) # PCA example using decorana in the vegan package


# Open data set
fish.csv <- read.table(“D:\\Demo Data\\Hinkley fish R.csv”, header = TRUE, sep = “,”, row.names=1)
# Comparing 2 ordinations
# Score with CA and DECORANA compared using vegan to undertake analyses

70
Applying Multivariate Methods

fish.dca <-decorana(fish.csv, ira=0) # undertake DECORANA


fish.ca <-decorana(fish.csv, ira=1) # undertake CA
# Extract scores
t1.dca.scores <- fish.dca$rproj[ ,1:2]
t2.ca.scores <- fish.ca$rproj[ ,1:2]
t2.ca.scores
# Compare the two ordination results
comparison <- procrustes(t1.dca.scores, t2.ca.scores,scale = “false”) # CA is rotated to DECORANA
plot(comparison)
# displays plot showing first ordination (tips of blue arrow) relative to second ordination
(open black circles)

Fig. 41: A comparison of two ordinations using


procrustes in R. The relative positions of the sampled
years under CA and DECORANA are shown with the
CA position marked as a point and the head of the
arrow showing the DECORANA position. Note how
the CA ordination is pulled in and the horseshoe effect
reduced.

71
4: Correspondence Analysis

Concluding remarks on the use of CA


CA is the method of choice to represent the relationship between samples collected along an environmental gradient in a
2-dimensional space. It is possible to produce a biplot which shows the relationships between both the samples and their
variables (e.g. Fig. 36, page 64). However, when there are many samples or variables, this can be too confusing, and it
is best to plot the ordination of the samples and the variables separately (e.g. Fig. 37, page 66 and Fig. 38, page 67).
If a strong arch effect is present, DECORANA can be used. However, if the arch effect is not too strong, this is
unnecessary. There are good arguments to avoid Detrended Correspondence Analysis, just remember not to interpret an
arch as an interesting feature.

72
Applying Multivariate Methods

Chapter

5
5: MultiDimensional Scaling
This method produces an ordination of only the samples in an n-dimensional space so that the most similar samples are
placed closest together. The measure of similarity used is at the discretion of the researcher.

Uses
Use MultiDimensional Scaling to ordinate samples when you do not wish to be constrained by a particular measure of
similarity or distance between objects.

Summary of the method


MultiDimensional Scaling (MDS) refers to a group of methods that all produce a graphical representation of the similarity
between samples or objects in a small number of dimensions. This similarity is measured in terms of their row attributes,
which, in biology, is usually species abundance or presence data. In archaeology, it might be the frequency of various
artefacts. The method analyses the matrix of pair-wise similarities between all the samples in the data set. The actual
choice of similarity (or dissimilarity) measure can be varied. It is common to use both similarity measures requiring
presence/absence data, such as the Sørensen or Jaccard indices, and measures that use quantitative data, such as Bray-
Curtis or Euclidean distance. Many of these similarity measures are not metric (see the next section for an explanation of
this term) so the method is often referred to as Non-metric MultiDimensional Scaling (NMDS). The key advantage of this

73
5: MultiDimensional Scaling

method over PCA or CA is the wide choice of similarity/distance or association measure that can be used.
The key idea behind MDS is to find the best arrangement within a reduced space of 2 or 3 dimensions
In CAP, to test the adequacy
of the final MDS ordination, that places the most similar samples closest together and the least similar further apart. Because of
you must run the procedure the multidimensional nature of the data, it is usually not possible to find a perfect arrangement in a
a number of times while small number of dimensions. The degree to which the ordination produces a good arrangement, in
selecting Random as the which the samples are positioned so that their distances apart reflect their similarities, is measured
initial starting position.
by a parameter termed the stress. The greater the stress, the poorer is the representation of the
sample similarity in the reduced dimensional space.
Once you have selected the number of dimensions in which to display your results, it is not possible to calculate the
best positions for the samples, so MDS uses an iterative procedure to find a good solution. It is possible for this iterative
procedure not to find the best, or even a good, solution. For this reason, it is normal to test a number of different
starting positions and compare the ordinations produced. The best ordination will be the one with the lowest stress. With
typical data sets, very similar ordinations will be produced irrespective of the starting positions. If this is not the case, try
increasing the maximum number of iterations in the set-up options.
It is important to consider if the relationships are well expressed with the reduced number of dimensions chosen. This
can be done by looking at the change in stress with increasing number of dimensions. Ideally, a 2- or 3-dimensional plot
should have a stress level little greater than the arrangement of samples in a 4-, 5- or 6-dimensional space.
There exists a multitude of variants of MDS with slightly different functions and algorithms to find the best solution.
In practice, the exact mathematical method is of little importance compared with the choice of similarity measure. It
is difficult to give simple advice on this topic. Obviously, if you have gone to great effort to collect quantitative data
it would be foolish to use a measure such as a Jaccard index, that only uses presence/absence information. In marine
benthic ecology, where animals vary greatly in abundance, but quantitative data are collected, a good compromise is the
Bray-Curtis similarity index. In pollen record studies, Gavin et al. (2003)1 have shown that the squared-chord distance
“outperforms most other metrics”.
1 Daniel G. Gavin, W. Wyatt Oswald, Eugene R. Wahl, and John W. Williams. (2003). A statistical approach to evaluating distance metrics and
analog assignments for pollen records. Quaternary Research 60, 356–367

74
Applying Multivariate Methods

Metric and non-metric similarity or distance measures


Metric distance measures always conform to the triangle inequality. Consider 3 samples a, b and c, and the calculated
distances between these samples, D(a,b), D(a,c) and D(b,c). Any metric measure conforms to the inequality D(a,b) +
D(b,c) ≥ D(b,c). This rule is the case for any triangle made from 3 points plotted in Euclidean space (Euclidean space is the
common-sense space we are used to thinking about, an ordinary piece of graph paper represents Euclidean space). For
a triangle in Euclidean space, the sum of two sides is always greater than or equal to the length of the third side. As an
example, the Jaccard similarity or distance measure is metric, as also is Euclidean distance.
Non-metric, semi-metric or pseudo-metric measures of distance do not conform to the triangle inequality rule, and a plot
in Euclidean space cannot show the distances between the points. An example of a semi-metric measure is the Sørensen
index.

Common user options


Software to undertake MDS almost always offers a range of options. As an example, the MDS Setup screen within CAP
is shown in Fig. 42, page 76.

Initial start position


The algorithm used to find the final coordinates will produce different results depending upon the starting configuration.
Indeed, a poor initial choice may result in the algorithm becoming trapped before a low stress value is reached. Select PCA
to use a Principal Component Analysis to calculate the starting coordinates of the sites (columns). Select Random to use a
random number generator to assign initial coordinates. Generally, PCA will produce a satisfactory choice. Before moving
to a final publication, it is wise to undertake a few runs with random starting points, to satisfy yourself that the algorithm
is finding a minimum stress solution.

Rotate output
If selected, a PCA is performed on the final site coordinates. The default is usually to select this option.

75
5: MultiDimensional Scaling

Fig. 42: The option window for MultiDimensional Scaling within the
CAP software.
Similarity measure
The measure to use must be selected. MDS can use a wide range of distance
or similarity measures, and this needs careful consideration. Some measures,
such as Sørensen’s, use only presence/absence data, and therefore should
not be used when you have spent effort and money in quantifying your
variables. In this case a Bray-Curtis, Euclidean or other measure might be
suitable. Generally, there is no theory to guide this choice, so try a few
different measures to check that your conclusions are robust. While different
measures produce different ordinations, if your conclusions are robust,
quantitative measures should each show the same groupings of points.

Number of dimensions
The default value is 2, as you need to display your ordination on a flat
surface. For higher values, the program calculates the configuration of the
points from this number of dimensions down to 1 dimension, listing the
final stress for each number of dimensions. More than 3 dimensions are only
used to find out how the stress changes with the number of dimensions.

Maximum iterations
You can change the number of iterations used by the stress minimisation algorithm; in CAP the default is 200. While
there is a relationship between the number of iterations and the magnitude of the stress level achieved, in practice there is
often little advantage in selecting a higher iteration number. You may like to vary this number to become satisfied that the
minimum stress level possible has been achieved.

76
Applying Multivariate Methods

Example: archaeology: Seriation of Nigerian pottery.


Demonstration data set: Nigerian pottery.csv
Reference: Usman, A. A., 2003, Ceramic Seriation, Sites Chronology, and Old Oyo Factor in Northcentral Yorubaland,
Nigeria. African Archaeological Review, 20, 149-169
This is an example of the use of MDS using percentage frequency data.
The data comprise the percentage frequency of pottery shards of different types excavated from 15 sites. The authors
states that “My main concern in this study is to establish a chronological seriation of pottery from Igbomina, tied to an absolute timescale derived
from radiometric dates”. Quantitative data were recorded for two aspects of design on potsherds: (1) decorative techniques
and motifs (e.g., twisted string roulette, wavy incision); and (2) the location of decoration e.g. interior or lip.

Preliminary considerations
The author used the Euclidean distance measure. This measure is not really appropriate for percentage frequency data
so here we have used the Bray-Curtis percentage similarity measure. The general conclusions on the seriation of the
potsherds are not sensitive to the distance measure, although the Bray-Curtis similarity measure gives a clearer ordination.
Further, the MDS ordination gives similar results for almost all random starting positions. Using Euclidean distance, the
ordination is more sensitive to the starting position.

Results
Fig. 43, page 78 shows the results of a multidimensional scaling in 2 dimensions. The results are similar to those
presented by Usman (2003), and were generally consistent irrespective of starting position, indicating that this is close to
the ordination that minimises stress. Site GIP-5b is a clear outlier. The other sites fit into an approximately linear sequence
with the earliest (GIP-21a and GIP21b) at the top and the latest (GLR-13 and GLR-15) at the bottom.
For comparison, the plot presented by Usman (2003) computed using the Euclidean distance is also shown, in Fig. 44,
page 79. Note that the temporal sequence is now in the form of a horseshoe curve.

77
5: MultiDimensional Scaling

The author did not consider the stress level of the 2 D plot, or the change in stress with the number of dimensions; this
is shown in Fig. 45. Note that at 4 dimensions and above, stress is very low and hardly declines further as the number
of dimensions is increased. There is, however, a notable decline in stress between 2 and 3 dimensions, suggesting that
these data might be more usefully displayed as a 3-dimensional plot (Fig. 46, page 80). On this 3-dimensional plot, the
clusters identified by Usman using k-means clustering have also been marked as different colours (Fig. 47, page 81).
It is clear that the k-means clustering and MDS plots do not completely agree. In this plot the light blue and red groups
are not clearly separated, and GLR-12 does not form a clear solitary group. Both the light blue group and GLR-12 are
characterised by a large score for axis 2. This is made clearer in Fig. 47, where the greater height of the light blue group
is more visible, which demonstrates the importance of the choice of angle in a 3D presentation.

Remember that because


MDS is a numerical method
you may not get exactly the
same ordination as shown
here. However, it should show
the same relationships. In
CAP, to ensure the solution
is stable, you should run a
number of ordinations with
the initial starting position
set to Random, and compare
the results.

Fig. 43: The seriation of Nigerian pottery using


MultiDimensional Scaling. The Bray-Curtis
similarity measure was used.

78
Applying Multivariate Methods

Fig. 44: (left) The seriation of Nigerian pottery using MultiDimensional Scaling as published by Usman (2003).
The author used Euclidean distance.
Fig. 45: (right) The change in stress with the number of dimensions for the seriation of Nigerian pottery using
MultiDimensional Scaling. The Bray-Curtis similarity measure was used.
In CAP, to produce a 6D
stress graph for an MDS
plot, first select 6 dimensions
in the initial MDS Setup
screen. Then click on the
MDS plots tab, which will
show a 2D graph. Select the
Stress vs Dimension radio
button, and a stress plot will
be produced.

79
5: MultiDimensional Scaling

Fig. 46: A 3-dimensional plot of the seriation of Nigerian pottery using MultiDimensional Scaling. The Bray-
Curtis similarity measure was used.
In CAP, to produce a 3D graph for an MDS plot, first select 3 dimensions in the initial MDS Setup screen. Then click on the MDS Plots tab,
which will show a 2D graph. Select the Samples (3D) radio button and a 3D plot will be produced. To put stalks on the points, click on the
Drop Lines button (second from the right above the graph).

80
Applying Multivariate Methods

Fig. 47: Alternative view of a 3-dimensional plot of the seriation of Nigerian pottery using MultiDimensional
Scaling. The Bray-Curtis similarity measure was used.

81
5: MultiDimensional Scaling

Conclusions
NMDS using the Bray-Curtis similarity measure was able to produce a reliable ordination, which was stable irrespective
of starting position. This ordination arranged the sites in a logical temporal order in both 2D and 3D space. The stress
vs. dimension plot indicates that a 3D plot is most suitable for these data. This is confirmed by other analyses undertaken
in Usman (2003), which suggest the presence of 5 groups that could be distinguished in the 3D MDS plot. However,
this plot needed to be carefully rotated to show the relationship between the samples. The author had used a Euclidean
distance measure. This resulted in a horseshoe shape for the temporal sequence that was less satisfactory than the almost-
linear sequence produced here. A Euclidean distance measure is unsatisfactory for percentage frequency data and should
not have been used. However, it did lead to the same general conclusion as the Bray-Curtis similarity measure.
A final consideration is the exclusion of the outlier GIP-5b. If this site is removed from the analysis a clearer ordination
showing the temporal sequence of the sites is produced.

Example: biology: Analysis of fish community change.


Demonstration data set: Hinkley annual fish.csv and Hinkley fish R.csv
Reference: Henderson, P. A., 2007. Discrete and continuous change in the fish community of the Bristol Channel in
response to climate change. J. Mar. Biol. Ass. U.K. (2007), 87, 589–598
Required R package: vegan
We previously considered the analysis undertaken by Henderson (2007) in the PCA chapter. The author used PCA on
a subset of common species, which were present every year, and thus the data matrix did not contain zeros. He also
considered the change between years in the full species complement of 81 fish species. This data matrix has about 52%
zero elements.
Henderson (2007) does not present an MDS plot, as a referee thought it was unnecessary and asked for it to be removed.
However, Henderson states “An examination of the ordination of the total fish community using NMDS found that years pre- and
post-1993 occupied distinct regions of the ordination space, indicating a marked change in the total fish species complement after 1993.” We
will run an MDS to check this statement.

82
Applying Multivariate Methods

Preliminary considerations
Henderson (2007) used Sørensen’s similarity index, which only uses presence/absence data. This was deliberate as he
wished to give every species, irrespective of abundance, an equal weighting in the analysis. The ordination is therefore a
search for an abrupt change in species presence.

Results
Fig. 48 shows the results of a MultiDimensional Scaling in 2 dimensions. The results support the view that there was a
change in community structure around 1993, and that years pre- and post-1993 form 2 distinct groups.

In CAP, to produce the


perimeters round each group,
first ensure you have defined
the groups, and then right-
click on the plot, and select
each group for which you
want the perimeter drawn.

To get the labels in colour,


click on the Edit button
above the chart, select the
Series tab, choose the series
in the drop-down menu and
change the colour by selecting
the Marks: Text: Font tabs.

Fig. 48: The results of an MDS ordination for the Hinkley Point fish data set, showing the difference in the fish
community pre- and post-1993. The Sørensen similarity index was used.

83
5: MultiDimensional Scaling

Code and results using the vegan package in R on the Hinkley data
For this example, the metaMDS function in the vegan package is used to undertake multidimensional scaling. The data
set is organised as a standard vegan community matrix with species (variables) forming columns and sites (samples) the
rows. The first column and row hold species and sample names. In this Hinkley example the sites are years. For biological
data it is generally appropriate to default all the options in the metaMDS function, as was the case in the box below. The
default similarity measure is the Bray-Curtis which is a standard measure for quantitative ecological data. Other measures
are available in vegan; for example, distance="euclidean" applies the Euclidean distance measure.
As a default the metaMDS function performs a Wisconsin double standardization if the data
Wisconsin double
standardization - the
values are larger than common class scales. After finding a solution metaMDS runs postMDS for
abundance values are first the final result. Function postMDS moves the origin to the average of the axes and undertakes
standardized by species a Principal Component Analysis to rotate the configuration so that the variance of points is
maximum standardization, maximized on the first dimension (with function metaMDSrotate you can alternatively rotate the
and then by sample total configuration so that the first axis is parallel to an environmental variable).
standardization, and by
convention multiplied by 100.NMDS is easily trapped by local optima, and you must start NMDS several times from random
starting positions to be confident that you have found the global solution. The default in isoMDS
is to start from a metric scaling which typically is close to a local optimum. metaMDS first runs a default isoMDS, or
uses the previous.best solution if supplied (see box below), and takes this solution as the standard (Run 0). Then
metaMDS starts isoMDS from several random starts (maximum given by trymax). If a solution has a lower stress than
the previous standard, it is taken as the new standard. If the solution is better or close to a standard, metaMDS compares
two solutions using Procrustes analysis (page 70). If the two solutions have very similar Procrustes rmse and the
largest residual is very small, the solutions are regarded as convergent and the result is output.

library(vegan) # MDS example using metaMDS function in the vegan package


# Open data set
x <- read.csv(file=”D:\\Demo Data\\Hinkley fish R.csv”, header=T, row.names=1)
head(x, 4) # Print out the first lines to check data
# The mds with graphical output

84
Applying Multivariate Methods

mds <- metaMDS(x) # run nMDS - default is Bray-Curtis dissimilarity


mds # Print results
plot(mds, type=”t”) # Plot results with labels
plot(mds, type=”t”, display=(“species”)) # label species
plot(mds, type=”t”, display=(“sites”)) # label sites
stressplot(mds) # plot stress

# For congested plots, the following will display a plot with symbols for both samples and
variables.
# Click on individual points you would like identified.
# Then Press “escape” to visualize

fig <- ordiplot(mds)


identify(fig, “sites”)
identify(fig, “spec”)

# Start from previous best solution


# This will give assurance you have a stable solution
mds2 <- metaMDS(x, previous.best = mds)

The ordination plotted in Fig. 49, page 86 shows that the fish community has changed over time. This is illustrated by
using the year label to mark the position of each sample. Note that the fish communities observed between 1981 and 1985
are all placed within the lower left-hand corner of the ordination. The ordination gives no clue as to the reason for this
temporal change. An examination of the changes in individual species’ abundances showed this to be linked to a change in
the relative abundances of cold- and warm-water species, linked to an increase in water temperature from the late 1980s.
The ability of the ordination to summarise the dissimilarity between the samples can be examined by plotting the observed
dissimilarity against the distance apart of each sample in the generated ordination using the function stressplot(mds)
(Fig. 50, page 86). A high correlation suggests a successful ordination.

85
5: MultiDimensional Scaling

Fig. 49: MultiDimensional Scaling undertaken on the Hinkley fish data using the metaMDS function in the
vegan package. The early 1980s are clearly different.
Fig. 50: A comparison of the observed dissimilarity between samples and their distance apart in a 2D
ordination plot. The high correlation suggests that the ordination gives a successful summary of relationships.
MultiDimensional Scaling undertaken on the Hinkley fish data using the metaMDS function in the vegan
package. Plot produced using the function stressplot(mds).

Example: geology: Ordovician mollusc assemblages.


Demonstration data set: Ordovician fossils.csv
Reference: Novack-Gottshall, P. M., Miller, A. I. 2003, Comparative Taxonomic Richness and Abundance of Late
Ordovician Gastropods and Bivalves in Mollusc-rich Strata of the Cincinnati Arch. PALAIOS, 18, 559–571.
Novack-Gottshall and Miller (2003) describe the relative abundance of bivalves and gastropods in strata in the Cincinnatian

86
Applying Multivariate Methods

Arch (Upper Ordovician). 27 bulk samples were collected from ten localities; these samples represent eight formations
and five of the six stratigraphic sequences composing the type Cincinnatian. The authors wished to study the pattern of
occurrence of bivalves and gastropods to test if they each favoured different palaeoenvironments.
Because the interest is on the pattern of occurrence of bivalve and genera, this is an example of an R-mode analysis.
The similarity of the gastropod genera, in terms of the samples they were recorded from, is analysed. In a more typical
Q-mode analysis we examine the similarity of the samples in terms of their genera or other attributes. (See page 143 for
a fuller explanation of R- and Q-mode analyses).
The authors reached the conclusions that “Non-metric multidimensional scaling and several statistical analyses show that the taxonomic
richness and abundance of these classes (bivalve and gastropod) within samples were significantly negatively correlated, such that bivalve-rich
settings were only sparsely inhabited by gastropods and vice versa.”

Preliminary considerations
The authors used the Bray-Curtis similarity measure, which employs quantitative data. This was presumably chosen because
they had good counts of the abundance of each genus of bivalve and gastropod in the samples.

Results
Fig. 51, page 88 shows the results of a Non-metric MultiDimensional Scaling in 3 dimensions, which clearly show
that the bivalves and gastropods occupy different regions of the ordination space. The authors were therefore able to
demonstrate that bivalves and gastropods were most abundant in different strata during the Ordovician, presumably
because of differences in habitat preference.

87
5: MultiDimensional Scaling

In CAP, to change the colour


of the points and their stalks
click on Edit and choose the
Series tab, then select the
series, click on the Format
tab and select the colour for
the point and base line.

To select the view that shows


your data to best advantage
click the Edit button and
choose Chart: 3D: Options.
We chose the Orthogonal
option.

Fig. 51: Gastropod and bivalve Ordovician fossil assemblages, as shown by MDS. The Bray-Curtis similarity
measure is applied.

88
Applying Multivariate Methods

Concluding remarks on the use of MDS


MDS is the method of choice to produce an ordination diagram of the relationship between samples when you need the
flexibility to select the distance or similarity measure. This can be required when your data set includes non-quantitative
variables, for example, classifications into 5 levels from rare to abundant, or a mixture of variable types. It is also useful
when variables vary greatly in magnitude or variability. It is important to remember that the choice
of similarity measures such as the Sørensen index effectively changes your data into a presence/ The R metaMDS function
absence data set of zeros and ones. If you have invested heavily in collecting quantitative or semi- in the vegan package has
the Bray-Curtis similarity
quantitative measurements for the variables, you will be discarding your hard work. In such cases the measure as the default.
Bray-Curtis measure is often favoured.
When using this method, it is wise to ensure that the algorithm has found a good solution with minimum stress by
repeating the procedure a number of times with different random starting positions.
While MDS can seem the obvious ordination method for general use it is notable that when the data hold a clear story, all
methods work well and will display it. If you look back over the Hinkley fish example analysed using PCA, CA and MDS
you will see that in all cases the early 1980s were identified as holding a different fish community than later years. What
none of these methods can tell you is that this was linked to the low water temperatures during the 1980s.

89
6: Linear Discriminant Analysis

Chapter

6
6: Linear Discriminant Analysis
Discriminant Analysis (DA) is also called Canonical Variate Analysis (CVA).

Uses
Discriminant Analysis is a standard method for testing the significance of previously-defined groups, identifying and
describing which variables distinguish between the groups, and producing a model to allocate new samples to a group.
DA allows the relationship between groups of samples to be displayed graphically. A goal of a Discriminant Analysis is to
produce a simple function that, given the available parameters, will classify the samples or objects.

Summary of the method


Computationally, DA is closely related to Multivariate Analysis of Variance (MANOVA). The method finds a linear
combination of the row variables (species, fossils, pottery shards etc.) that will separate the samples or sites into the
redefined groups.
There are a number of approaches that can be taken when computing a DA; Fisher’s approach1 is intuitive and easy to
1 Sir Ronald Aylmer Fisher, FRS (1890 – 1962) was an English statistician, evolutionary biologist, and geneticist and a founder of modern
statistical analysis.

90
Applying Multivariate Methods

explain. Fig. 52 shows a plot of two groups, 1 and 2, with respect to two variables. Note that if only one variable is used,
there is considerable overlap, and the groups are poorly differentiated. As shown in Fig. 53, using a linear combination of
both variables we can produce an axis that combines both variables, giving a far better discrimination between the groups
than either variable alone. Fisher’s method was to find the linear combination of variables that maximised the ratio of the
between-group sums of squares to the within-group sums of squares.

Normality
The method assumes that the values for the predictors (row variables) are independently and randomly sampled from the
population and that the sample distribution of any linear combination of predictors is normally distributed. Deviations
from normality caused by skew are unlikely to invalidate the result. However, the method is sensitive to outliers, which
should be removed, or the data transformed to reduce their influence.

Fig. 52: (left) A sketch of the plot of two hypothetical groups defined by two variables. Below the plot is shown
the distribution of the two groups along variable 1. Note that they are very poorly separated and it would be
difficult to assign many samples to groups.
Fig. 53: (right) The same two hypothetical groups defined by two variables; to one side of the plot is shown
the distribution of the two groups along an axis combining variables 1 and 2. Note that compared with using a
single variable, the linear combination allows most samples to be clearly allocated to one group or another.

91
6: Linear Discriminant Analysis

Example: biology: Iris systematics


Demonstration data set: irises.csv
Discriminant Analysis was first proposed by Fisher (1936)1 in a paper where he used a data set of measurements of the
morphology of 150 iris flowers using 4 characters (lengths and widths of sepals and petals). The measurements were from
3 established species and comprised 50 measurements on Iris setosa, 50 on I. versicolor, and 50 on I. virginica. These data are
commonly referred to as the “Fisher Iris Data”, although the data were collected by Dr. Edgar Anderson.

Results
With each of the 150 observations allocated to the setosa, versicolor or virginica group, and a discriminant plot produced (Fig.
54), Iris setosa is clearly separated, and it is simple to draw a line on the chart to distinguish this species from the other two.
While versicolor and virginica form separate clusters, there are 3 points which would be misclassified if a straight line were
drawn between the groups. This was shown to be the case when the discriminant functions were used to allocate samples
to species (Table 11). This shows that two specimens of versicolor and one of virginica were misclassified.

Table 11: A table comparing the original iris classifications and that produced by Discriminant Analysis.
Original Iris setosa Original Iris versicolor Original Iris virginica No. correct % correct
Predicted Iris setosa 50 0 0 50 100
Predicted Iris versicolor 0 48 1 48 98
Predicted Iris virginica 0 2 49 49 96
Total 50 50 50 147 98

1 Fisher, R.A. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7: 179–188

92
Applying Multivariate Methods

In CAP, Discriminant
Analysis is selected from
within the Groups drop-
down menu. The method is
only available if each sample
or object has been assigned
to a group. This is undertaken
using the Edit Groups menu.

The centroids were removed


from the legend by clicking
on the Series tab, selecting
the Centroids series, then
clicking on the General tab
and removing the tick next to
Show in Legend.

Fig. 54: The results of a Linear Discriminant Analysis applied to the well-known Fisher iris data. The centroid
(centre of mass) of each species is shown as a square.

93
6: Linear Discriminant Analysis

Example: archaeology: Skull shape


Demonstration data set: Egyptian skulls.csv
Reference: Thomson, A. and Randall-Maciver, R. (1905) Ancient Races of the Thebaid, Oxford: Oxford University
Press. Also found in: Hand, D.J., et al. (1994) A Handbook of Small Data Sets, New York: Chapman & Hall, pp. 299 - 301.
Manly, B.F.J. (1986) Multivariate Statistical Methods, New York: Chapman & Hall.
Four measurements were made of male Egyptian skulls from five different time periods ranging from 4000 B.C. to 150
A.D. The researchers theorised that a change in skull size over time was evidence of the interbreeding of the Egyptians
with immigrant populations.

Results
An initial Discriminant Analysis with the different
periods as the groups gave a confusing plot with the
different periods not forming tight clusters (Fig. 55).
However, examination of the plot shows that more
recent skulls lie more to the left. This tendency for
the skulls to be arranged in a time sequence from
right to left is shown clearly if only the cluster
centroids are plotted (Fig. 56). Examination of the
positions of the centroids suggests that it might be
possible to discriminate between early skulls (from
4000 and 3300 BC) and late skulls (from 200 BC and
150 AD). The plot was therefore changed to only
show the 4000 BC and 150 AD skulls (Fig. 57, page
96). While there are skulls that do not conform to
their group and create overlapping clusters, a clear
Fig. 55: The plot of the first and second discriminant functions
separation is apparent.
for the Egyptian skull data. The groups are defined by age.

94
Applying Multivariate Methods

Fig. 56: The position of the centroids with the space defined by the first and second discriminant functions for
the Egyptian skull data. Note that the centroids represent groups of increasing age in a left-right direction.

95
6: Linear Discriminant Analysis

Fig. 57: The plot of the first and second discriminant functions for the Egyptian skull data. Only the 4000 BC
and 150 AD groups are plotted.
In CAP, to select a reduced number of groups in the plot, first display it by opening the Plot tab. Next click on the Edit button above the plot
(it has a set square symbol). Open the Series tab (if not already open) and remove the tick from groups/series you do not wish to see.

96
Applying Multivariate Methods

Example: archaeology: Chemical analysis of Romano-British pottery


Demonstration data set: Romano British pottery.csv or roman pottery R.csv
Reference: Tubb, A., Parker, A.J. and Nickless, G. (1980) The analysis of Romano-British pottery by atomic absorption
spectrophotometry. Archaeometry, 22, 153-171. Also found in: Hand, D.J., et al. (1994) A Handbook of Small Data Sets,
London: Chapman & Hall, 252.
Required R packages: devtools, flipMultivariates
Twenty-six samples of Romano-British pottery were found at four different kiln sites in Wales, and the New Forest, in
Hampshire. The sites are Llanederyn, Caldicot, Island Thorns, and Ashley Rails. The variables measured on each piece
were the percentage of oxides of various metals measured by atomic absorption spectrophotometry:
Al: Percentage of aluminium oxide in sample
Fe: Percentage of iron oxide in sample
Mg: Percentage of magnesium oxide in sample
Ca: Percentage of calcium oxide in sample
Na: Percentage of sodium oxide in sample
The data were collected to see if different sites contained pottery of different chemical compositions.

Results
The DA plot (Fig. 58, page 99) strongly suggests that pottery from Llanederyn and Caldicot is different in chemical
composition from that collected from Ashley Rails and Island Thorns. Further, there seems to be no clear separation
between the two New Forest sites, Ashley Rails and Island Thorns, suggesting they form a single group. This view is
reinforced following examination of the observed and predicted allocation to group (Table 12, page 100). DA allocated
one Ashley Rails sample to Island Thorns, and one Island Thorns sample to Ashley Rails, indicating a lack of discrimination
between these sites. The analysis was therefore repeated with the Ashley Rails and Island Thorns samples combined into
a New Forest group, producing the clearer plot shown in Fig. 59, page 100.
To assign a piece of pottery to one of our 3 groups we use the classification equations generated by DA to give a score

97
6: Linear Discriminant Analysis

for a sample. These have the form:


Cj =c j 0 + c j1 X 1 + c j 2 X 2 + .....c jp X p

where cj is a classification function coefficient, j is the group and p the number of variables. In our case the equations are:
Caldicot:
Ccald = −76.217 + 3.73 X Al + 11.17 X Fe + 0.84 X Mg + 155.68 X Ca + −17.22 X Na In CAP, the classification
coefficients used here are
listed under the tab labelled
Llanederyn: Fisher’s Disc. Func.
−80.81 + 3.75 X Al + 11.75 X Fe + 4.17 X Mg + 85.62 X Ca + 8.73 X Na
Clla =
New Forest:
−75.29 + 8.75 Al − 2.52 X Fe + 0.7 X Mg − 1.49 X Ca − 19.67 X Na
CNF =
A sample is allocated to the group for which it has the highest classification score. For example, a sample with 14% Al, 7%
Fe, 4% Mg, 0.1% Ca and 0.5% Na gives:

−76.217 + 3.73 x14 + 11.17 x7 + 0.84 x 4 + 155.68 x0.1 − 17.22 x0.5


Ccald =
so

Ccald = 64.511
and in similar fashion:

Clla = 83.547

CNF = 22.286
This sample is therefore allocated to the Llanederyn group as it has the highest score.

98
Applying Multivariate Methods

Fig. 58: The plot of the first two discriminant scores to separate different types of Romano-British pottery.

99
6: Linear Discriminant Analysis

Table 12: A comparison of the actual group allocation with that generated by Discriminant Analysis for the
Romano-British pottery data set.
Original Totals
In CAP, the observed and Predicted Ashley Rails Caldicot Island Thorns Llanederyn No. correct % correct
predicted group memberships Ashley Rails 4 0 1 0 4 80
are found under the Caldicot 0 2 0 0 2 100
Predictive Validation tab. Island Thorns 1 0 4 0 4 80
Llanederyn 0 0 0 14 14 100

Fig. 59: The plot of the first two discriminant scores to separate different types of Romano-British pottery. The
Ashley Rails and Island Thorns samples have been combined into a single group.

100
Applying Multivariate Methods

Linear Discriminant Analysis using R


The MASS package contains functions for performing linear and quadratic discriminant function analysis. However, a
more attractive and interpretable output is easily achieved using the flipMultivariates package available from github. The
following code both installs the packages you need and runs the analysis on the Roman pottery data used above. The data
file is roman pottery R.csv.

# Get required packages


install.packages(“remotes”)
remotes::install_github(“Displayr/flipMultivariates”)
library(flipMultivariates)
# Open data set
roman_pottery <- read.table(“D:\\Demo Data\\roman pottery R.csv”,header = TRUE, sep=”,”)
# output the data set to check it loaded correctly
roman_pottery
# The first column holds the Class names e.g. Caldicot or Ashley Rails
# The following columns hold the numerical variables
# Undertake a default analysis
lda <- LDA(Class ~ ., data = roman_pottery)
lda
lda_scat <-LDA(Class ~ ., data = roman_pottery, output = “Scatterplot”)
lda_scat
lda_pred <-LDA(Class ~ ., data = roman_pottery, output = “Prediction-Accuracy table”)
lda_pred

The default Linear Discriminant Analysis produced the simplified output shown in Fig. 60, page 102. This shows that
the LDA gave 100% correct predictions for Caldicot and Llanederyn. The scatter plot output option (lda_scat) gave the
scatterplot shown in Fig. 61, page 102. The results show that Island Thorns and Ashley Rails are similar and could be
combined into a single group.
The Prediction - Accuracy Table option (lda_pred) gives the output shown in Fig. 62, page 103. LDA offers further
output options: see Help for the full range of outputs, which includes “Discriminant Functions”.

101
6: Linear Discriminant Analysis

Fig. 60: (top) Default output from the Linear Discriminant


Analysis in the flipMultivariates package in R.
Fig. 61: (right) Scatter plot output from LDA, showing that
the Island Thorns and Ashley Rails sites are similar, and
can be combined.

Concluding remarks on the use of Linear Discriminant Analysis


This is the method of choice for classifying objects into predefined groups, when a linear, straight line, function to split
the data is appropriate. If curved, non-linear lines are needed to split the groups the method is not appropriate.
LDA can be used to test the validity of groups identified in an MDS plot, or from a dendrogram generated using
Agglomerative Cluster Analysis. However, given the assumption of linearity, in general it is probably best to test for the
significance of groups using Analysis of Similarities (ANOSIM), that is based on a randomisation procedure and assumes
no model.

102
Applying Multivariate Methods

Fig. 62: (right) The Prediction


- Accuracy table produced by
the lda_pred option.

103
7: Canonical Correspondence Analysis (CCA)

Chapter

7
7: Canonical Correspondence Analysis
(CCA)

Uses
CCA is the favoured method for producing an ordination which includes the possible causal factors within the analysis.
The result is a plot that shows the relationship between the samples, the dependent variables within each sample and the
explanatory variables.

Summary of the method


CCA is a form of regression analysis. It is a constrained ordination method derived from Correspondence Analysis (see
page 52), modified to incorporate environmental or explanatory data into the analysis. It could be calculated using
the Reciprocal Averaging algorithm of Correspondence Analysis with a multiple regression of the sample scores on the
environmental variables undertaken over each cycle of the averaging process. New site scores are calculated based on
this regression, and then the process is repeated, continuing until the scores stabilise. The result is that the axes of the
final ordination, rather than simply reflecting the dimensions of the greatest variability in the species data, are a linear
combination of the environmental variables and the species data.

104
Applying Multivariate Methods

CCA is called a constrained analysis method, as the ordination is constrained by the environmental variables.

Do my variables need to be normally distributed?


There is generally little concern about the distribution of the variables when undertaking CCA. It is believed that for
sufficiently large sample sizes CCA is robust to deviations from normality. However, extreme outliers should be removed,
or transformed to reduce their distance from the mean. Outliers can greatly change the calculated correlation coefficients,
and one advantage of large sample sizes is to reduce the impact of a small number of outliers. The significance of the
relationships found is tested using a permutation test, which does not assume normality. Heavily-skewed distributions
should be avoided.

Minimising the number of explanatory variables


To undertake a CCA your data set should comprise more samples than environmental variables. Further, there should be
more dependent variables in the primary data set than explanatory or environmental variables. In practice, it is important
not to introduce more environmental variables than necessary, and it is rarely the case that a data set is large enough to
allow the recognition of the effects of more than 4 or 5 explanatory variables. There are two components to the reduction
of explanatory variable number. First, some explanatory variables will often be highly correlated and so one of a correlated
pair should be removed (see the multicollinearity discussion below). Second, use a stepwise CCA which adds variables one
at a time to identify the most important variables.

Multicollinearity and the selection of variables


If any combination of the environmental variables is highly correlated, the results obtained by a multiple regression
method such as CCA can be unreliable, or in the worst case, unobtainable. Such a lack of independence between the
independent environmental variables is termed multicollinearity. It is not always easy to identify. One
In Ecom, you can check for
approach is to undertake multiple regressions for each of the environmental variables in turn with
multicollinearity and remove
affected variables by running all the other environmental variables acting as the independent variables. Values of the coefficient
Test for Multicollinearity
from the Ordination menu.
105
7: Canonical Correspondence Analysis (CCA)

of determination (R2) close to 1 or variance inflation factor (VIFs) well above 1 are indicative of multicollinearity. When
this occurs, you should consider removing one of the highly-correlated variables from the analysis.

Testing for significance


Unlike all the other methods previously described, CCA allows the testing of the significance of environmental variables
in explaining the observed variation. The relative amount of variability explained by each axis of the ordination is given
by the eigenvalues presented in a standard CCA output. However, the magnitude alone does not give an indication of the
significance of the result, because the eigenvalues give no indication as to whether the amount of variability explained
by the explanatory variables is larger than would be expected by random chance. We can test for significance using a
Monte Carlo test. The test works by using simulation to estimate the magnitude and variability of the eigenvalues of a
purely random data set, in which the environmental variables had no effect on the dependent variables. To do this, a large
number of simulations are run in which the order of the samples in the sample-species array is randomly shuffled and
the eigenvectors calculated as usual. By shuffling the order of the samples we are breaking the correlation between the
environmental variables and the species. Thus, there will not be any relationship. These randomly-generated eigenvalues
are then compared against those observed to see if the observations are greater than would be expected by chance.

The choice of scaling for the CCA output


There are a number of ways in which the output of a CCA can be scaled for plotting. The two commonest will emphasise
different features.
Species at site centroids (Fig. 63) – this will emphasise the difference between the species. The CCA procedure in
vegan defaults to this option, which can also be selected using the scaling command in plot(CCA_output, scaling
= “species”). This is also referred to as scaling option 2 as was the case in the original CANOCO program.
Sites at species centroids (Fig. 64, page 108) – this will emphasise the difference between the samples. The CCA
procedure in vegan produces this scaling by selecting sites in plot(CCA_output, scaling = “sites”). This is
also referred to as scaling option 1, as in the original CANOCO program.

106
Applying Multivariate Methods

Fig. 63: An example of a CCA triplot with site at species centroid scaling. This emphasises the difference
between the variables, which are plotted here as red squares.

107
7: Canonical Correspondence Analysis (CCA)

Fig. 64: An example of a CCA triplot with species at site centroid scaling. This emphasises the difference
between the samples or objects, which are plotted here as green triangles.

108
Applying Multivariate Methods

Example: palaeontology: Palaeogeography of forest trees in the Czech


Republic around 2000 BP
Demonstration data sets: Pollen data biological.csv and Pollen data environmental.csv
Reference: Petr Pokorný (2002) Palaeogeography of forest trees in the Czech Republic around 2000 BP: Methodical
approach and selected results. Preslia, Praha, 74: 235-246.
Pollen analysis was used to reconstruct the forest tree composition of the Czech Republic around 2000 BP. Data on the
percentage composition of tree pollen was mainly derived from publications. Environmental variables selected to explain
the observed pollen communities were altitude, temperature, precipitation, latitude and longitude. Latitude and longitude
were presented as degrees and minutes; these have been converted for the present analysis into decimal numbers so that
50º30’ becomes 50.5. This is probably the conversion used by the author, although this is not described in the paper.
The author sought to answer the following questions.
1. Is pollen analysis able to detect regional forest composition with sufficient spatial accuracy?
2. To what degree did the past forest composition reflect regional climatic and edaphic (soil) conditions? Was the
distribution of forest trees directly affected by human activity?
3. Do the observed patterns agree with the results of potential vegetation mapping and the present geobotanical view
of the problem?

Preliminary data examination and transformation


The environmental explanatory variables vary greatly in magnitude, with temperature ranging from 3.9 to 9.2 and
precipitation from 474 to 1267. The numerical dominance of precipitation needs to be addressed or it might dominate
the analysis simply because of the units of measurement used. One approach would be to rescale some variables so that
they all cover the same order of magnitude. For example, if precipitation were measured in metres rather than millimetres
per year, the range would be 0.474 to 1.267. An alternative, which was used in the output shown below, was to use a log10
transformation of all the environmental variables. A log transformation resulted in an ordination close to that presented
by Pokorný (2002), who did not state what transformation he used. A rescaling transformation must have been applied,

109
7: Canonical Correspondence Analysis (CCA)

as the raw data do not produce the ordination presented in the paper. The dependent variables are given as percentage
compositions in each sample. These numbers are used untransformed.

Results
The ordination plot of the samples and the environmental vectors is shown in Fig. 65. The general arrangement of the
samples is similar to that presented by Pokorný (2002) and the samples are grouped as per his analysis. In producing this
result, multicollinearity between explanatory variables was ignored, but will be discussed below.
One lowland site (dark blue cross, sample 7), situated in the floodplain of the river Labe, forms a single outlier shown in
blue. This site has a high percentage of Pinus pollen and the unusual community was related to soil conditions.
Group 1 (green crosses, samples 6, 8, 15, 16) were lowland sites. Low percentages of Picea pollen and a somewhat higher
percentage of Pinus and other forest tree pollen characterised these samples.
Group 2 (pink squares, samples 1 to 5 and 9 to 12). This group consists of upland and mountain sites with the exception
of site 1 which was believed to have a deposited a record of mountain vegetation. Common features of these samples
were low percentages of Quercus and Carpinus, and high percentages of Abies, Fagus and Picea pollen.
Group 3 (light blue stars, samples 13 and 14). This small group consists of two sites in the Třeboňská pánev basin with
similar qualitative characteristics to Group 2, but with a higher percentage of Pinus pollen present.

Multicollinearity and the selection of variables


Values of the coefficient of determination (R2) close to 1, or variance inflation factor (VIF) well above 1 are indicative of
multicollinearity. When this occurs you should consider removing one of a group of highly-correlated variables from the
analysis. R-squared and VIF values for the pollen analysis environmental variables are given in Table 13, page 112. The
values marked by bold underline indicate that precipitation is highly correlated with one or more other variables.

110
Applying Multivariate Methods

Fig. 65: A Canonical Correspondence Analysis (CCA) ordination biplot of 16 pollen samples. The samples are
placed into the groups defined by Pokorný (2002).

111
7: Canonical Correspondence Analysis (CCA)

Table 13: Results of a check for multicollinearity between the variables used in the pollen study of Pokorný
(2002). The highly significant VIF for the variable precipitation is highlighted by bold underline.
Dependent variable R2 VIF
Altitude 0.804149 5.10592
Temperature 0.896613 9.67242
Precipitation 0.933205 14.9712
Latitude 0.37461 1.599
Longitude 0.72948 3.69659

An examination of simple scatter plots indicates that precipitation is correlated positively with
In Ecom, the Scatter plot
tab allows you to quickly altitude and longitude, and negatively with temperature (Fig. 66).
examine scatter plots and the With precipitation removed from the analysis the result is a clearer ordination with the same
correlation between variables.
grouping as before (Fig. 67).

Fig. 66: Simple scatter plots showing the correlation between some of the environmental variables used by
Pokorný (2002)

112
Applying Multivariate Methods

Fig. 67: A CCA ordination biplot of 16 pollen samples. The samples are placed into the groups defined by
Pokorný (2002). Because of correlation with other variables, precipitation has been removed from the analysis.

113
7: Canonical Correspondence Analysis (CCA)

The significance of the results


Pokorný (2002) did not consider the significance of the relationships between environmental variables and the tree pollen
community. The species-environment correlations (at the bottom of Table 14) are quite high, for example the multiple
correlation between the species and environment scores for canonical axis 1 is 0.927. Do not be deceived, as in constrained
ordinations we are maximising the relationships between species and the environment, so such numbers will appear to be
quite high even for random data.
Table 14: The tabulated output produced by a CCA analysis of the Czech pollen data. CAP produced this table
of values.
Number of sites (objects) = 16
Number of species (response variables) = 8
Number of environmental (explanatory) variables = 4
Number of canonical axes = 4
Number of non-canonical axes = 7
Total variance (inertia) in species data = 0.52494
Sum of canonical eigenvalues = 0.254287
Sum of non-canonical eigenvalues = 0.270653
Can. Axis 1 Can. Axis 2 Can. Axis 3 Can. Axis 4
Canonical eigenvalue = 0.206817 0.0231581 0.0186781 0.00563405
% variance explained = 39.3982 4.41158 3.55814 1.07328
Cumulative % variance = 39.3982 43.8098 47.3679 48.4412
Multiple correlation species/environment scores = 0.927629 0.518796 0.560266 0.305178
Kendal rank correlation of species/environment scores = 0.666667 0.45 0.483333 0.25

The row titled “Cumulative % variance” indicates that the first axis explains about 39.39% of the total variation (inertia)
in the data set. Taken together, the first two axes explain about 44% of the variation.
Overall, how well do our measured variables explain species composition? As CCA is in part a regression method, an

114
Applying Multivariate Methods

obvious measure would be to create an analogue to the coefficient of determination, R2, and to divide the explained
variance by the total variance. However, there are problems with this approach and unfortunately, there are as yet no good
solutions.
The relative amount of variability explained by each axis of the ordination is given by the eigenvalues presented in a
standard CCA output (see Table 14). The eigenvalue for axis 1 is far larger than the other two axes presented in the table,
indicating that there is a relatively strong gradient acting along axis 1. The amount of the total variation that we can explain
by the environmental variation is the sum of all canonical (or constrained) eigenvalues, which in this case equals 48% of
the total. However, axis 1 alone explains 39%, which highlights the importance of this axis. The analysis also shows that
the pollen species-environment correlations are quite high, but in constrained ordinations this is normal, even for random
data.
The eigenvalues presented give no indication as to whether the amount of variability explained by the environmental or
explanatory variables is larger than would be expected by random chance. We can test for significance using a Monte Carlo
test. Shown in Table 15, page 116 is the probability that an eigenvalue as large as that observed could have occurred by
chance. The probability for axis 1, marked by bold underline, shows that this axis is highly significant, as the value is much
lower than 0.05, the 5% probability level conventionally used in statistical tests. Axes 2 and 3 are not significant at the 5%
level, suggesting that analysis should be restricted to a consideration of the ordination along axis 1 alone. From Fig. 67,
page 113 it can be seen that axis 1 is essentially a temperature - altitude axis and that axis 2 is a latitude - longitude axis.
We conclude that the primary variables determining the forest tree community are altitude and temperature, which are
negatively correlated.

115
7: Canonical Correspondence Analysis (CCA)

Table 15: The results of a Monte Carlo simulation of the pollen data to determine if the explanatory variables
account for a significant amount of the variation. The results are for 1000 random simulations. These calculations
were undertaken after the removal of precipitation from the environmental data set. A log10 transformation was
applied to the environmental data. The highly significant probability for axis 1 is highlighted in bold underline.
Axis 1 2 3
Actual eigenvalues 0.206817 0.0231581 0.0186781
In Ecom, the precipitation Eigenvalue results from simulation
variable can be removed Mean 0.0861691 0.0333423 0.0137693
under the Working
Maximum 0.206023 0.0822763 0.0470048
Environmental data tab.
Minimum 0.0209588 0.0084877 0.00138418
First, click on a cell in the
Precipitation row. Second, Probability 0.000999001 0.771229 0.211788
select the Handling zeros
radio button. Finally, choose
Conclusions
Deselect row from the drop-
down menu and click on Pokorný (2002) concluded that it was possible to reconstruct past forest vegetation from pollen
Submit. data. The altitudinal zonation of forests was similar to that observed today. The data indicated
that Man had affected lowland forests in 2000 BC. All of the significant gradation between pollen
communities is presented along axis 1, which is primarily an altitude - precipitation axis. As was concluded by Pokorný
(2002), the two main groups in the data comprise lowland and upland forest communities.

Additional and alternative approaches


Pokorný (2002) chose to present only sites and environmental vectors on the ordination. CCA also ordinates the dependent
variables, in this case the pollen of tree species. Fig. 68 is an ordination of sites, species and environmental variables,
which shows that Carpinus, Quercus, Pinus and Cerealia are lowland, higher-temperature species. CCA also creates ranked
biplots of the species along the environmental gradients (Fig. 69 and Fig. 70). These plots clearly rank the different tree
species along the environmental gradients. Note the direction of the arrows in these plots. Picea, for example, is found at
the highest altitudes.

116
Applying Multivariate Methods

Fig. 68: (left) A CCA ordination plot of 16


pollen samples, environmental vectors and
pollen species (large red squares). The samples
are placed into the groups defined by Pokorný
(2002).

Fig. 69: (bottom left) The ranked biplots of


tree species along the altitude environmental
vector. Data from Pokorný (2002).
Fig. 70: (bottom right) The ranked biplots
of tree species along the temperature
environmental vector. Data from Pokorný
(2002).

117
7: Canonical Correspondence Analysis (CCA)

Almost exactly the same conclusions to those presented by Pokorný (2002) could have been arrived at using a simpler
ordination method that did not explicitly consider the environmental variables. Fig. 71 shows the result of a Correspondence
Analysis, which shows the same grouping as CCA. For example, Pokorný’s Group 1 (samples 6, 8, 15, 16) is distinct, and
the lowland status of these sites and their associated species identified by simply looking at the associated environmental
data. That the resulting ordinations with CCA and CA should be similar is expected, as they both use Reciprocal Averaging.
Somewhat different results are obtained if Non-metric MultiDimensional Scaling is used (Fig. 72). In this case Group 1 is
less distinct, with samples 6, 8, 16 forming a clear group and 15 taking a more intermediate position between the lowland
and highland groups. The unique nature of sample 7 is still apparent. Redundancy Analysis, which is the extension of
multiple linear regression to include multivariate response data, produces an ordination of sites with similarities to that
produced by NMDS (Fig. 73). For example, site 15 is separated from Group 1 and closer to sites 13 and 14.

Fig. 71: Results of a Correspondence Analysis of the 16 pollen samples (red dots), and pollen species (green
squares). Data from Pokorný (2002).

118
Applying Multivariate Methods

Fig. 72: (left) Results of a Non-metric MultiDimensional Scaling ordination of the 16 pollen samples. The Bray-
Curtis distance measure was used. Data from Pokorný (2002).
Fig. 73: (right) Results of a Redundancy Analysis ordination of the 16 pollen samples (in blue), environmental
vectors (in blue) and pollen species (in red). Data from Pokorný (2002).

119
7: Canonical Correspondence Analysis (CCA)

Example: biology: Effects of climate change on an estuarine fish commu-


nity
Demonstration data sets: Hinkley annual fish CCA.csv and Hinkley annual env var CCA.csv or Hinkley annual fish R CCA.csv
and Hinkley annual env var R CCA.csv
Reference: Henderson, P. A. (2007) Discrete and continuous change in the fish community of the Bristol Channel in
response to climate change. J. Mar. Biol. Ass. UK. 87, 589-598.
Required R package: vegan
A data set on fish abundance collected between 1981 and 2005 was used to study the effects of climate change. A stepwise1
CCA was used to identify temperature, salinity and the North Atlantic Oscillation Index as the environmental variables
most influencing the fish community. It was concluded that the fish community of Bridgwater Bay in the outer Severn
estuary is rapidly responding to changes in seawater temperature, salinity and the North Atlantic Oscillation (NAO). CCA
was able to rank the fish in terms of their temperature preference, which can be used as a starting point to predict their
response to climate warming.

Preliminary data examination and transformation


The fish species abundances vary greatly between species and between years. As the data set only comprises fish caught in
every year of the study, there are no zeros in the data set, so Henderson (2007) selected a natural logarithm transformation
(loge). The environmental variables were all scaled to a similar magnitude, so no transformations or rescaling was required.

Results
A test for multicollinearity between these three variables gave variance inflation factors of between 1.03 and 1.1, indicating
that each was varying independently of the others. A CCA using these three variables was then subjected to a Monte Carlo
test to ensure that the observed relationships could not have been generated by random chance. The probability that the
eigenvalues for axes 1, 2 and 3 were generated by random chance was estimated as 0.0529, 0.1778 and 0.011 respectively.

1 Used here for analyses where variables are added to the analysis one at a time.

120
Applying Multivariate Methods

This indicated that at just over the 5% significance level, axes 1 and 3 were explaining more of the total variability than
would be expected by random chance. Axis 2 was clearly not significant.
The CCA biplots for species and years in relation to the environmental variables are plotted in Fig. 74 and Fig. 75,
respectively. Grey mullet and pout are associated with years of higher than average salinity. Sea snail, dab, poor cod,
transparent goby and eel were most abundant in years with lower than average seawater temperatures and a higher than

Fig. 74: (left) The Canonical Correspondence Analysis (CCA) biplot of fish species and environmental variables
for the Hinkley Point fish data analysed in Henderson (2007). Fish abundances were natural log-transformed
(loge) prior to analysis.
Fig. 75: (right) The Canonical Correspondence Analysis (CCA) biplot of the annual samples and environmental
variables for the Hinkley Point fish data analysed in Henderson (2007). Fish abundances were natural log-
transformed (loge) prior to analysis.

121
7: Canonical Correspondence Analysis (CCA)

average NAOI. Bass were highly responsive to increased seawater temperature. Fig. 75, page 121 shows that the years
from 1981 to 1987 formed a group characterised by lower seawater temperatures and high NAOI.
The ordination of the species along two of the environmental axes is shown in Fig. 76 A & B. The position of species
along the vector representing each environmental variable was found by projecting orthogonal lines from the species
positions on to the vector. These points of intersection show the relative response of each species to the environmental
variables.

Fig. 76: The inferred ranking of common fish species to the environmental variables, NAOI (A, left) and seawater
temperature (B, right) at Hinkley Point. Results obtained by Canonical Correspondence Analysis.

122
Applying Multivariate Methods

Conclusions
The results of Henderson (2007) show that CCA can be an effective method to identify key environmental variables, and
to show how both a community and the individual species respond to a change in the physical environment.

Canonical Correspondence Analysis using R


CCA is offered by the vegan package. You will need to prepare two data sets, (1) the observations for each sample
(in biology these would typically be the observed number for each species) and (2) the explanatory (or environmental)
variables for each sample. These are best organised in a spreadsheet like Excel and then saved as .csv files.
Table 16 below shows the structure of the layout for the observations in the Hinkley annual fish data set. The species,
hooknose, eel etc. are the columns and the rows are each individual annual sample.
Table 16: The layout of the Hinkley annual fish data set.
Hooknose Eel Transparent goby 5-bearded rockling Conger Bass Dab Sea snail Mullet Whiting
1 1 44 2 9 11 36 70 199 19 552
2 3 61 74 43 13 49 65 210 19 700
3 8 17 21 14 4 49 82 90 15 695
4 2 23 35 14 12 11 199 229 5 1040
5 4 12 27 3 8 16 56 281 13 927
6 10 18 1 6 5 14 94 144 23 3180

Table 17, page 124 shows the layout for the explanatory variables, salinity, temperature and NAOWI. Remember that
both the observation and the explanatory data sets must have the same number of rows.

123
7: Canonical Correspondence Analysis (CCA)

Table 17: The layout of the explanatory variables data set.


Salinity Temperature NAOWI
1 26.43333 12.4625 0.9
2 26.29091 12.38333 0.2475
3 25.25 12 2
4 26.7 12.09167 0.7425
5 25.775 11.65833 -0.38
6 26.9 11.76667 -0.0325

The following R code opens two data sets and runs a CCA. It uses the Hinkley Point fish and environmental datasets.
The code produces a series of plots to visualise the results which are similar to those produced by CAP and discussed
previously. The results do not look the same as those presented previously because in this case the data were analysed
untransformed. The previous analysis used log-transformed data. Using untransformed data will place a greater emphasis
library(vegan) # We will use the vegan community analysis package
# Load Hinkley data species first
species <- read.csv(“D:\\Demo Data\\Hinkley annual fish R CCA.csv”, header = TRUE, row.names = 1)
print(species) # Print data to check it is OK
# Load Hinkley data explanatory variables
enviro <- read.csv(“D:\\Demo Data\\Hinkley annual env var R CCA.csv”, header = TRUE, row.names = 1)
print(enviro) # Print data to check it is OK
# Run a CCA
CCA_output <- cca (species,enviro)
CCA_output # Print results
summary(CCA_output)
# Correlation between species (WA) and constraint (LC) scores
spenvcor(CCA_output)
# Default plot species & WA scores, envir. variables.
plot(CCA_output)
# The components plotted can be varied
plot(CCA_output, display = c(“lc”, “bp”))
plot(CCA_output, display = c(“sp”, “bp”))

124
Applying Multivariate Methods

on the abundant species such as sprat and whiting.


Fig. 77 was created using the instruction plot(CCA_output, display = c(“lc”, “bp”)). This shows that low
temperatures were associated with the 1980s. Fig. 78 used the instruction plot(CCA_output, display = c(“sp”,
“bp”)). This shows, for example, that bass are most abundant in warmer years and sea snail, a more northerly fish, is
most abundant in colder years.

Fig. 77: (left) Demonstrating that low temperatures were associated with the 1980s.
Fig. 78: (right) Showing the association between some fish species (such as bass) and warmer years, and other
species (such as sea snail), with colder years.

125
7: Canonical Correspondence Analysis (CCA)

Additional and alternative approaches


As always with CCA, other ordination methods such as PCA and CA would have shown the changing structure of the
community through time and also the relationships between the various species. However, CCA was the superior method,
as it aided initial environmental variable selection, and in addition to a standard ordination of the samples (years) it
produced clear ranked plots of the response of the fish to the key environmental variables.

Concluding remarks on the use of CCA


CCA is the method of choice for studying the influence of environmental variables on a set of dependent variables. It
produces ordinations of the samples and shows the influence of the environmental variables on the dependent variables.
The environmental variables can be of both quantitative and classificatory type.
The important feature of the method is the use of randomisation tests to show the statistical significance of the explanatory
variables. The method is often applied in a stepwise fashion to select explanatory variables.
The problem with this method is that it demands large data sets and is dependent on your choice of environmental
(explanatory) variables. Often the environmental variables are selected because they are measurable. However, they may
not be the key variables that are moulding the community you are studying. Consider carefully how and why you selected
the environmental variables in your study.
It has become popular to use the environmental information directly and use constrained or ‘canonical’ ordination methods.
But unconstrained analyses (Principal Component Analysis, Non-metric MultiDimensional Scaling and Correspondence
Analysis) should still be used. Constrained analysis such as CCA is well suited for confirmatory research, where we have
specific a priori hypotheses on the important variables, and we want to test those hypotheses.
In exploratory analysis, when we wish to spot environmental variables that might be important, it is better to use
unconstrained analysis, with environmental information used during the interpretation of the ordination.

126
Applying Multivariate Methods

Chapter

8
8: TWINSPAN (Two-Way Indicator Species
Analysis)

Uses
TWINSPAN produces dendrograms showing the relationship between the samples, and between the variables that make
up the samples. Further, the method identifies key variables for each bifurcation in the sample dendrogram. The method
was originally developed for the analysis of botanical data, but is now far more widely used.

Summary of the method


TWINSPAN was developed by M. O. Hill in the 1970s (Hill, 1979). As Hill suggested, it might have been better named as
dichotomised ordination analysis. The basic idea is to use Correspondence Analysis1, described in Chapter 4, to ordinate
the samples along a single axis. These are then divided into 2 groups on the right and left of the centroid for the first
axis. Those on the left of the centroid are termed the negative group and those on the right the positive group. Indicator
species are defined as species that tend to be found in higher frequency in either the positive or negative group. At any
split, up to 5 indicator species can be used; the number chosen will not change the form of the dendrogram produced.
1 Also termed Reciprocal Averaging.

127
8: TWINSPAN - Two-Way Indicator Species Analysis

Examining the score of each sample in terms of the indicator species present then refines the membership of the negative
and positive groups previously defined by their Correspondence Analysis score. Generally, for most samples, a positive
indicator species score will correspond to a high ordination score and these samples would be defined as the positive
group. If the indicator score and the ordination score are contradictory these samples are termed misclassified. Samples
that lie close to the centroid are termed borderline.
Once the data have been split into 2 groups the procedure is repeated for each group to form 4, 8, or more groups, until a
minimum group size is reached. The results can then be presented in the form of a dendrogram with the indicator species
marked at each dichotomy.
Classification of species is undertaken in a similar fashion to the samples. However, the ordination is based on the degree
to which species are confined to groups of samples, termed the fidelity, rather than the raw data. Unlike the sample
ordination, the species are always split into positive and negative groups with no recognition of borderline, difficult to
classify, species.
An odd aspect of TWINSPAN is the use of pseudospecies. TWINSPAN only works on presence/absence data. The
indicator species are defined by their presence in a group of samples, and no account is taken of their abundance. As
typical TWINSPAN data are quantitative or semi-quantitative measurements, the pseudospecies concept was introduced
to allow quantitative data to be reflected in the indicator species. Each species is divided into a number of abundance levels,
each of which is considered computationally as a different species during the identification of indicator species. Within
TWINSPAN, the user defines these abundance levels by selecting cut levels. Thus the same species might be an indicator
species for a number of splits. At one split the presence/absence at a low density might be an indicator, while at another
split presence/absence at high density might be an indicator. A suitable choice of cut level can only be made following
analysis of the data and trial and error. However, a robust conclusion will only be obtained if the final classifications are
not particularly sensitive to the cut levels chosen.
In CAP, pseudospecies are Pseudospecies are scored as 0 or 1s, with 0 representing absent and 1 representing present. A
indicated by a number after species present at a specified abundance level is also scored as present at all lower abundance levels.
the variable. For example, Acer For example, species C in Table 18 has an abundance of 49. With cut levels set at 5, 10, 50, 100 and
2 represents pseudospecies 2 500, species C is marked as present for pseudospecies 3, (abundance between 11 and 50) and all
for this species.

128
Applying Multivariate Methods

pseudospecies up to 3 (i.e. pseudospecies 1, and 2). So in this example TWINSPAN would use 3 separate pseudospecies
to represent species C.
Table 18: An example of how pseudospecies are calculated for a specific set of cut levels. Four species and their
abundances are presented and the presence/absence of each pseudospecies for each species is shown.
Species name A B C D
Actual abundance 75 620 49 1
Cut levels The range of values that score
5 1-5 Pseudospecies 1 1 1 1 1
10 6-10 Pseudospecies 2 1 1 1 0
50 11-50 Pseudospecies 3 1 1 1 0
100 51-100 Pseudospecies 4 1 1 0 0
500 101-500 Pseudospecies 5 0 1 0 0

Generally, the results produced by TWINSPAN make sense and can be an aid to those seeking to summarise and classify.
However, when there are two or more gradients moulding the community, the method may fail. Using synthetic data,
van Groenewoud (1992)1 found TWINSPAN to be unsatisfactory: “Thus, there appear to be two reasons why a TWINSPAN
analysis fails to produce acceptable and ecologically meaningful results: (1) The displacement of the sample points along a first CA axis may
be considerable, resulting in the failure of both CA and TWINSPAN analysis. (2) In TWINSPAN, the division of the first CA axis
in the middle, followed by separate CA analyses of each of the two halves of the original data matrix, creates conditions under which the
second CA analysis will result in a greater displacement of the sample points, thus producing a spurious classification. The erratic behavior of
TWINSPAN beyond the first division makes the results of this analysis of real vegetation data suspect”.

1 van Groenewoud, H. (1992). The Robustness of Correspondence, Detrended Correspondence, and TWINSPAN Analysis. Journal of
Vegetation Science, 3, 239-246.

129
8: TWINSPAN - Two-Way Indicator Species Analysis

Common user options


TWINSPAN offers a number of standard user options, and the option window displayed by CAP is shown in Fig. 79.
Data type. Informs the program of the type of data you are inputting. The choice will affect the default cut levels.
Cut levels. This determines the abundance levels in the working data at which a species is divided into pseudospecies. A
maximum of 9 cut levels can be defined. The default state is for 5 cut levels set at 0, 2, 5, 10 and 20.
Options. Three options are available. Maximum
number of indicators per division defines the
maximum number of indicator species that can be
found per division. Maximum level of divisions
defines the number of subdivisions of the groups
that can be undertaken. The maximum is 9. Minimum
group size per division gives the minimum number
of samples (quadrats) in each group. Once a group
size has reached this minimum level no further
subdivisions are undertaken.
Weighting. Each box in this panel gives the relative
weighting given to each pseudospecies cut level. The
default is all 1s, indicating that all pseudospecies are
given equal weighting. The default should generally
be used.
Indicator levels. Defines which pseudospecies cut
levels can act as indicators. The default is that all are
used. The default should generally be used.
Fig. 79: The TWINSPAN option screen used by
CAP.

130
Applying Multivariate Methods

Example: biology: Woody species composition of floodplain forests of


the Little River, McCurtain and LeFlore Counties, Oklahoma
Demonstration data set: Oklahoma floodplain forest.csv
Reference: B. W. Hoagland, L. R. Sorrels and S. M. Glenn (1996). Woody species composition of floodplain forests of
the Little River, McCurtain and LeFlore Counties, Oklahoma. Proc. Okla. Acad. Sci. 76: 23 - 29
Species composition and structure of bottomland hardwood forests were studied in the Coastal Plain region of south-
eastern Oklahoma. The objectives of this study were to develop a quantitative vegetation classification and analysis of
species diversity patterns in bottomland forests of the Little River. Fourteen bottomland sites were sampled with 10-
m² circular plots. Relative frequency, relative density, and relative basal area were calculated and summed to derive an
Importance Value (IV) for each species at a site.

Preliminary data examination and transformation


There is no need to transform data prior to undertaking a TWINSPAN. A critical decision when using TWINSPAN is the
choice of cut levels for the pseudospecies. The authors do not give details of the values they used, but a choice of 0 and
10 gives results that produce the same major groupings of samples as they report. If cut levels of 0, 50, 100 and 200 are
used, the Crooked Creek sample will no longer group with Cypress Creek and Forked Lake. The reason for this difference
is because Taxodium distichum has an importance value of 18.4 in Crooked Creek compared with 139.4 and 36.5 at the
other two sites. Generally, robust conclusions are not sensitive to the choice of cut levels. This observation would suggest
that the author’s conclusion that there were three groups of sites holding 1) Quercus phellos, 2) Carpinus caroliniana and 3)
Taxodium distichum communities may not be the only possible interpretation of these data.

Results
Fig. 80, page 133 shows the sample dendrogram produced with cut levels of 0, 10 and 100. The lowest level dichotomy
has Acer saccharum as the only indicator species. Examination of the data shows this species to be present only at sites in
the lower half of the dendrogram (Cucumber Creek to Forked Lake). The authors identified the Cypress Creek, Crooked

131
8: TWINSPAN - Two-Way Indicator Species Analysis

Creek and Forked Lake samples as representing the Taxodium distichum community. They form a discrete group in the
dendrogram, but an equally-legitimate indicator species would be Betula nigra, as selected by TWINSPAN, which was also
only found at these three sites. The author’s Carpinus caroliniana community was found at Cucumber Creek 1 & 2 and
Refuge Creek 1 & 2. TWINSPAN certainly shows these sites as forming a discrete group, but suggests that they might
be described as an A. saccharum community without B. nigra. Finally, the authors classified all the other sites as holding a
Quercus phellos community. TWINSPAN indicates that these may not form a single group with the Caney and Refuge North
sites characterised by the presence of Bumelia lanuginosa. Following a Detrended Correspondence Analysis (DECORANA),
the authors noted the different in community at these two sites.
“The high IV (Importance Value) for Liquidambar styraciflua separated the Caney and Refuge-North sites from others in the Quercus
phellos community type. Overall, importance values for water-tolerant Quercus spp. were low in the Carpinus caroliniana community type.
Second axis DCA scores were high for sites with Quercus alba, regardless of community type. The high axis 1 and low axis 2 scores for the
Taxodium distichum community type are most likely due to the singular presence of Taxodium distichum, Betula nigra, and Acer
saccharum at those sites”.
The authors did not refer to the species dendrogram also produced by TWINSPAN (see Fig. 81, page 134). This shows
that the species divide into two main groups. Quercus phellos is in the upper group and Carpinus caroliniana in the lower group.

Concluding remarks on TWINSPAN


While Hoagland et al. (1996) used TWINSPAN, it is notable that they did not include a dendrogram produced by
TWINSPAN, and in the choice of species characteristic of the different communities they did not use the indicator
species identified by TWINSPAN. Their conclusions were more influenced by the ordination produced by DECORANA
(see Fig. 82, page 134), and it is notable that an ordination plot was included in their paper. Their analysis may have been
clearer and more easily understood if they had based it on the sample dendrogram and indicator species shown on Fig.
80. Their conclusion that there are three community types seems to be based on their intuition, rather than the quantitative
analysis they undertook. I cannot judge if this was wise.

132
Applying Multivariate Methods

Fig. 80: Sample dendrogram produced by TWINSPAN for the Oklahoma forest study. Cut levels of 0, 10 and
100 were used.

133
8: TWINSPAN - Two-Way Indicator Species Analysis

Fig. 82: The results of a Detrended Correspondence


Analysis for the Oklahoma forest study. The different
coloured points are used to distinguish between the
communities identified by Hoagland et al. (1996).
Blue - Quercus phellos, Red - Carpinus caroliniana and
Green - Taxodium distichum communities.

Fig. 81: Species dendrogram produced by TWINSPAN


for the Oklahoma forest study. Cut levels of 0, 10 and
100 were used.

134
Applying Multivariate Methods

Example: biology: The effect of climate change on an estuarine fish com-


munity
Demonstration data sets: Hinkley annual fish.csv and Hinkley fish.csv
Reference: Henderson, P. A. (2007) Discrete and continuous change in the fish community of the Bristol Channel in
response to climate change. J. Mar. Biol. Ass. UK. 87, 589-598.
A data set on fish abundance collected between 1981 and 2005 was used to study the change in fish community structure
in the Bristol Channel, England. The author used PCA and a stepwise CCA to identify temperature, salinity and the North
Atlantic Oscillation Index as the environmental variables most influencing the fish community. It was concluded that the
fish community of Bridgwater Bay in the outer Severn estuary is rapidly responding to changes in seawater temperature,
salinity and the North Atlantic Oscillation (NAO). An interesting feature of this paper is the recognition that some
changes were gradual, while others occurred in a sudden step. Using PCA or CCA, the years between 1981 and 1985
are clearly different from later years (see page 45 and page 65). Here we investigate whether TWINSPAN will also
identify this discontinuity by classifying the early 1980s samples together.

Preliminary data examination and transformation


Hinkley annual fish.csv holds annual abundance data for all the fish sampled at Hinkley Point in the Bristol Channel, while
Hinkley fish.csv only holds data for the resident species caught every year. No data transformations were used. For the
Hinkley annual fish.csv a power series of cut values of 0, 10, 100 and 1000 was used. Several other patterns of cut level
were tried with generally similar results. In contrast, for Hinkley fish.csv only two cut levels, 0 and 100, were used, and the
maximum number of indicator species set to three. Again, the output was not particularly sensitive to the values used.

Results
First, using the total data set (Hinkley annual fish.csv) the dendrogram of the classification of the years between 1981 and
2004 is shown in Fig. 83, page 136. The results do not show a clear discontinuity between the 1980s and later years.
Using the Hinkley fish.csv data set of resident fish, the dendrogram now shows the 1980s as forming a distinct group, (Fig.

135
8: TWINSPAN - Two-Way Indicator Species Analysis

84, page 137), as was found with PCA. Some insight into why the two data sets produce such different results is obtained
by examining the indicator species. When the full data set is used, TWINSPAN bases the classification on occasionally-
present species. As these are to some extent random in occurrence the classification shows little pattern. The point to note
is that TWINSPAN will tend to classify using uncommon species; this may not be helpful when it is used to either identify
a classification scheme or to produce a classification of samples.

Fig. 83: The dendrogram of relationships between the fish communities present in the Bristol Channel for the
years 1981 to 2004; obtained using the total data set. The indicator species are placed on the dendrogram at the
branch to which they refer. Predefined year groups are indicated by different colours.

136
Applying Multivariate Methods

Fig. 84: The dendrogram of relationships between the fish communities present in the Bristol Channel for the
years 1981 to 2004; obtained using data for the resident species only. The indicator species are placed on the
dendrogram at the branch to which they refer. Predefined year groups are indicated by different colours.

137
8: TWINSPAN - Two-Way Indicator Species Analysis

Concluding remarks on the use of TWINSPAN


TWINSPAN is a remarkably effective method that combines features of ordination and cluster analysis. It is based on
Correspondence Analysis to split sites. Unlike Agglomerative Cluster Analysis it produces dendrograms showing both the
relationship between the samples and the individual variables, which in botany are usually species. It also shows the key
variables or species causing a branch in the dendrogram. TWINSPAN is therefore particularly useful if you are trying to
produce a classificatory scheme for future samples or objects.
The weakness of the method is the rather odd use of pseudospecies, which are used to capture the quantitative aspect of
a variable. This concept does not work well in disciplines other than botany, for which it was initially designed. It is best
thought of as a way of capturing the effect of a variable at different levels of magnitude. TWINSPAN is often viewed
as a closed box method and the user is often unclear exactly what has been undertaken and why the observed result was
generated. This leaves many researchers uneasy.

138
Applying Multivariate Methods

Chapter

9
9: Hierarchical Agglomerative Cluster
Analysis

Uses
Agglomerative Cluster Analysis (ACA) methods are used to show the relationships between objects or samples in a
dendrogram, tree or branching diagram. The approach is useful when samples clearly fall into discrete groups. Dendrograms
are a powerful presentation method, which are easily understood.

Summary of the method


The basic computational scheme used in Cluster Analysis can be illustrated using single-linkage cluster analysis as an
example. This is the simplest procedure and consists of the following steps:
1. Start with n groups each containing a single object (sites or species).
2. Calculate, using the similarity measure of choice, the array of between-object similarities.
3. Find the two objects with the greatest similarity, and group them into a single object.
4. Assign similarities between this group and each of the other objects using the rule that the new similarity will be the
greater of the two similarities prior to the join.

139
9: Hierarchical Agglomerative Cluster Analysis

Continue steps 3 and 4 until only one object is left.


At each iteration, a number of different procedures can be used to determine which objects or groups should be joined to
form a single group. Common methods are Ward’s, single-linkage, complete-linkage, average-linkage, McQuitty’s, Gower’s
and centroid. For example, with Ward’s method, also termed minimum variance or error sums of squares clustering, at
each iteration all possible pairs of groups are compared and the two groups chosen for fusion are those which will produce
a group with the lowest variance. In comparison, for the single-linkage method, at each iteration, the clusters are compared
in terms of the similarity of their most similar samples, and the two clusters that hold the most similar samples are fused.
These clustering methods can use many different similarity or distance measures. Common measures include Euclidean,
Mahalanobis and Manhattan1 distance. It is not possible to give any clear guidance as to the choice of clustering method
and similarity measure. I recommend that you try a few and choose one that produces a dendrogram as free as possible
of chaining. Chaining occurs when samples are added sequentially to produce no clear groups of samples. An example
of chaining is shown in Fig. 86, page 146, which can be compared with the successful dendrogram shown in Fig. 85,
page 145. Chaining is a particular problem with single-linkage clustering. The Euclidean distance measure is one of the
most commonly-used, and is the natural measure for the distance apart of two points on a graph using a ruler. It is worth
noting that the Mahalanobis distance measure2 differs from Euclidean distance in that it takes into account the correlations
between variables in the data set, and is scale-invariant. The Renkonen index of similarity is based on proportional
abundances and is believed to be little influenced by changes in sample size. It is therefore used widely in situations where
sampling effort between localities has varied. In pollen record studies, Gavin et al. (2003)3 have shown that the squared-
chord distance “outperforms most other metrics”.

Some useful measures of similarity


When we compare a fauna or other community of objects sampled at different localities, we often wish to know how
similar they are in terms of their assemblages. Numerous methods have been devised for the measurement of similarity,
1 Otherwise known as city-block or taxi-cab distance.
2 Designed by P. C. Mahalanobis in 1936.
3 Daniel G. Gavin, W. Wyatt Oswald, Eugene R. Wahl, and John W. Williams. (2003) A statistical approach to evaluating distance metrics and analog
assignments for pollen records Quaternary Research 60, 356–367.

140
Applying Multivariate Methods

the most successful of which are described below. We will discuss similarity measures; some authors discuss dissimilarity,
which is generally just 1 - similarity.
Similarity indices are simple measures of either the extent to which two samples have species or other attributes in
common (Q analysis) or which variables, (e.g. species) have samples in common (R analysis). For a fuller explanation of Q
and R analyses, see page 143. Binary similarity coefficients use presence/absence data, but more complex quantitative
coefficients can be used if you have data on species abundance. When comparing the species at two localities, indices can
be divided into those that take account of the absence of an attribute or species from both communities (double-zero
methods), and those that do not. In most applications it is unwise to use double-zero methods as they assign a high level of
similarity to localities which both lack many variables or species. We would not normally consider two sites highly similar
because their only common feature was the joint lack of a group of variables. This might occur because of sampling errors
or chance. For that reason, here I only describe binary indices that exclude double zeros.

Binary coefficients for presence/absence data


When comparing two sites, let a be the number of types of object (e.g. species) held in common, and b and c the number
of objects found at only one of the sites. When comparing two species or other objects over many sites, the terms are
similar e.g. a is the number of sites where they both were caught. The three simplest coefficients are:

Jaccard Cj = a ( + + )
a b c

Sørensen Cs = 2a ( + + )
2a b c

2a
Mountford CM =
2bc − (b + c)a

CM was designed to be less sensitive to sample size than CJ or CS, however, it assumes that variable abundance fits a log-
series model, which may be inappropriate.

141
9: Hierarchical Agglomerative Cluster Analysis

Following evaluation of similarity measures using Rothamsted insect data, Smith (1986)1 concluded that no index based
on presence/absence data was entirely satisfactory, but the Sørensen was the best of those considered.

Quantitative coefficients
Because they are based purely on presence/absence, binary coefficients give equal weight to all objects, and hence tend
to place too much significance on rare features whose discovery or capture will depend heavily on chance. Bray & Curtis
(1957)2 brought abundance into consideration in a modified Sørensen coefficient, and this approach is widely used in plant
ecology; the coefficient, as modified, essentially reflects the similarity in individuals between the habitats:

where aN = the total of individual objects sampled in habitat a, bN = the same in habitat b and jN = the sum of the lesser
values of abundance in both samples for the objects common to both samples (often termed W ). For example, if there
are 10 of a species in sample 1 and only 3 in sample 2, the lesser value is 3.
However, for quantitative data, Wolda (1981)3 found that the only index not strongly influenced by sample size and species
richness was the Morista-Horn index:

where aN and bN = the total number of objects in sites a and b respectively,


ani and bni = the number of the ith object or variable in sample a, and b respectively, and

1 Smith, B. (1986). Evaluation of different similarity indices applied to data from the Rothamsted insect survey. Unpublished M.Sc. Thesis,
University of York.
2 Bray, J. R. & Curtis, C. T. (1957). An ordination of the upland forest communities of southern Wisconsin. Ecol. Monogr. 27, 325-349.
3 Wolda, H. (1981). Similarity indices, sample size and diversity. Oecologia 50, 296-302.

142
Applying Multivariate Methods

db is calculated in similar fashion using sample b.


A considerably simpler, but frequently-used index is percent similarity (Whittaker 1952)1, calculated using:

where Pa,i and Pb,i are the percentage abundances of object i in samples a and b respectively, and S is total number of
objects. This index takes little account of rare objects, and thus will give a good indication of the similarity in dominant
forms between the sites.

What is R- and Q-mode analysis?


There is a large and confused literature discussing R- and Q-mode analysis. Because of the lack of consensus as to the
definition of these terms I would recommend that the terms be avoided. In an R-mode analysis, we examine the similarity
between variables or descriptors, and cluster or ordinate the variables or descriptors. In contrast, in a Q-mode analysis the
similarity between objects or samples is examined, and we cluster or ordinate the objects or samples.
In Agglomerative Cluster Analysis, both R-mode and Q-mode clustering can be carried out. TWINSPAN always produces
R- and Q-mode dendrograms.
However, Principal Component Analysis only makes sense if an R-mode analysis is undertaken. Some workers have
transposed the data matrix and undertaken a PCA with the aim of positioning the descriptors or variables in a reduced
space defined by the objects or samples. This should not be done, as the correlations between the samples or objects in
terms of a number of different variables are not meaningful. Further, such a procedure is unnecessary, as the plot of
the eigenvectors shows the relationship between the descriptors. An example of MultiDimensional Scaling used with an
1 Whittaker, R. H. (1952). A study of summer foliage insect communities in Great Smoky Mountains. Ecol. Monogr. 22, 1-44.

143
9: Hierarchical Agglomerative Cluster Analysis

R-mode analysis is given on page 86.

Testing for the significance of groups


A key weakness of ACA methods is the wide variety of linking and distance methods that can be used, which can each
produce a different dendrogram. This results in little confidence in the method. Frequently, researchers simply choose to
present the dendrogram that shows the pattern that they believe to exist. An approach that has frequently been used is
to demonstrate the robustness of the results by running a range of ACA methods and similarity measures. There are two
generally-applicable approaches that can be taken to test for the significance of groups identified in a dendrogram. The
first is to assign the samples to the groups identified in your interpretation of the dendrogram and then undertake a Linear
Discriminant Analysis. DA will show how many of the samples are correctly assigned to group. A second approach is to
use an Analysis of Similarities (ANOSIM) randomisation test on the group memberships identified from the dendrogram.

Example: biology: Woody species composition of floodplain forests of


the Little River, McCurtain and LeFlore Counties, Oklahoma
Demonstration data set: Oklahoma floodplain forest.csv
Reference: B. W. Hoagland, L. R. Sorrels and S. M. Glenn (1996) Woody species composition of floodplain forests of the
Little River, McCurtain and LeFlore Counties, Oklahoma. Proc. Okla. Acad. Sci. 76: 23 - 29
This is the data set used above in the TWINSPAN example applications. Species composition and structure of bottomland
hardwood forests were studied in the Coastal Plain region of south-eastern Oklahoma. Fourteen bottomland sites were
sampled with 10 m² circular plots. Relative frequency, relative density, and relative basal area were calculated and summed
to derive an Importance Value (IV) for each species at a site.

Preliminary data examination and transformation


There is no need to transform data prior to undertaking a Cluster Analysis.

144
Applying Multivariate Methods

Results
Fig. 85 shows the dendrogram produced using Ward’s method with the Euclidean distance measure. It produces results
markedly similar to the TWINSPAN dendrogram shown in Fig. 80, page 133. As an example of a poor choice of
method, Fig. 86, page 146 shows the dendrogram generated using single-linkage clustering and the Renkonen similarity
measure for the same data set. Note that chaining has occurred and the clear split into two groups shown in Fig. 85 is no
longer apparent.

Fig. 85: Sample dendrogram for the Oklahoma floodplain forest data. Agglomerative cluster analysis undertaken
with Ward’s method and Euclidean distance.

145
9: Hierarchical Agglomerative Cluster Analysis

Fig. 86: An example of chaining using single-linkage clustering and the Renkonen similarity measure on the
Oklahoma floodplain forest data. Note that this dendrogram also shows a familiar problem, in that at one point
in the process (arrowed) the combined group resulted in an increase in similarity at the next cluster aggregation.
Conclusions
Agglomerative clustering methods can produce dendrograms which clearly show the main clusters of objects within the
data set. However, the results can seem arbitrary and highly dependent upon the method used to join together the clusters
and the similarity or distance measure used. The significance of group membership could be tested using Analysis of
Similarities (ANOSIM) randomisation test. This approach is used in later examples below.

146
Applying Multivariate Methods

Example: biology: The effect of climate change on an estuarine fish com-


munity
Demonstration data sets: Hinkley annual fish.csv and Hinkley fish.csv
Reference: Henderson, P. A. (2007) Discrete and continuous change in the fish community of the Bristol Channel in
response to climate change. J. Mar. Biol. Ass. UK. 87, 589-598.
A data set on fish abundance collected between 1981 and 2005 was used to study the change in fish community structure
in the Bristol Channel, England. The author used PCA and a stepwise CCA to identify temperature, salinity and the North
Atlantic Oscillation Index as the environmental variables most influencing the fish community. It was concluded that the
fish community of Bridgwater Bay in the outer Severn estuary is rapidly responding to changes in seawater temperature,
salinity and the North Atlantic Oscillation (NAO). An interesting feature of this paper is the recognition that some
changes were gradual, while others occurred in a sudden step. Using PCA or CCA the years between 1981 and 1985 are
clearly different from later years (see page 45 and page 65). Here we investigate if agglomerative clustering will also
identify this discontinuity by classifying the early 1980s samples together.

Preliminary data examination and transformation


No preliminary data transformation is required.

Results
A typical example of the type of dendrogram produced for the Hinkley data is shown in Fig. 87, page 148. It is notable
that, as in the case of TWINSPAN, when the Hinkley annual fish.csv data set is used the years do not form clear pre- and
post- early 1990s groups. This data set comprises all the fish species. However, while TWINSPAN produced a dendrogram
which separated the 1980s when only the resident species were considered (Fig. 84, page 137), this was not the case for
Agglomerative Clustering using Euclidean distance (Fig. 88, page 149). By contrast, Ward’s method together with the
Bray-Curtis similarity measure did produce a clustering of years with features in common with that observed with PCA
and TWINSPAN (Fig. 89, page 150).

147
9: Hierarchical Agglomerative Cluster Analysis

Fig. 87: The Agglomerative Cluster Analysis dendrogram of relationships between the fish communities present
in the Bristol Channel for the years 1981 to 2004, obtained using data for all fish species in the Hinkley annual fish.
csv data set. Clustering used Ward’s method and Euclidean distance.

148
Applying Multivariate Methods

Fig. 88: The Agglomerative Cluster Analysis dendrogram of relationships between the fish communities present
in the Bristol Channel for the years 1981 to 2004, obtained using data for resident fish species in the Hinkley fish.
csv data set. Clustering used Ward’s method and Euclidean distance.

149
9: Hierarchical Agglomerative Cluster Analysis

Fig. 89: The Agglomerative Cluster Analysis dendrogram of relationships between the fish communities present
in the Bristol Channel for the years 1981 to 2004, obtained using data for resident fish species in the Hinkley fish.
csv data set. Clustering used Ward’s method and Bray-Curtis similarity.

150
Applying Multivariate Methods

Conclusions
Agglomerative clustering methods were unsatisfactory for studying the structure of the Hinkley fish data because the
result was so dependent upon the similarity measure used. The Bray-Curtis similarity measure was found to produce a
dendrogram having features in common with other ordination methods. This measure is favoured by benthic ecologists
and is frequently used within NMDS. In some fields, it can at least be argued that the similarity measure used has been
chosen because it is the standard method used by other scientists. However, it is clear that the choices are often painfully
arbitrary and in danger of being driven by the desire to get a pleasing result, which will often reinforce preconceived ideas.
For many applications, if a dendrogram is required for presentational purposes, the Bray-Curtis similarity index should be
considered.
In conclusion, use agglomerative clustering dendrograms to present patterns that have also been identified by other
methods.

Example: veterinary science: Body-weight changes during growth in pup-


pies of different breed
Demonstration data sets: dog growth study.csv
Reference: Hawthorne, A. J. Booles, D., Nugent, P. A., Gettinby, G. and Wilkinson, J. (2004) Body-weight changes during
growth in puppies of different breeds. American Society for Nutritional Sciences. J. Nutr. 134, 2027S–2030S.
The aim of this study was to compare the growth curves of 12 different-sized dog breeds to investigate if feeding guides
need to be breed-specific. Logistic growth curves were fitted to size at age data for 12 breeds of dog. The fitted growth
curve parameters of final adult weight, growth rate and time taken to reach 50% of the adult weight were used to compare
the breeds. The authors used Agglomerative Cluster Analysis with Ward’s method to show the relationship between the
breeds. The methods section did not specify the distance measure used.

Preliminary data examination and transformation


No preliminary data transformation is required.

151
9: Hierarchical Agglomerative Cluster Analysis

Results
As the authors did not specify the distance or similarity measure used, it was not possible to precisely reproduce their
dendrogram. However, a wide variety of methods gave a result similar to that published in their paper; Fig. 90 shows
a clear distinction in the growth curve parameters of large species – the English mastiff to Newfoundland group, and
smaller species – the Labrador retriever to Papillon group. This dendrogram was produced using Ward’s method and
Euclidean distance. Within each of these main divisions there are further divisions that the authors considered notable,
for example, the separation of the Newfoundlands from other giant breeds.

Conclusions
Agglomerative clustering allows the different growth patterns of small and large dog breeds to be elegantly displayed.
However, the significance of further differences within the two main groups is considerably more subjective and may
not be significant. The authors did not attempt any statistical analysis of the significance of their grouping into large and
small dogs. An Analysis of Similarities randomisation test (ANOSIM) showed that the large and small dog groups were
significantly different (test statistic = 0.06, p = 0.001).

Additional and alternative approaches


In this example, there are only 3 variables, so it is possible to present the relationships between the breeds using a simple
3-dimensional MDS plot (Fig. 91, page 154). This graph shows clearly that the Labrador retriever actually lies between
the small and large dog groups.

152
Applying Multivariate Methods

Fig. 90: The relationship between the growth parameters of 12 breeds of dog. The dendrogram was produced
using Ward’s method and Euclidean distance.

153
9: Hierarchical Agglomerative Cluster Analysis

Fig. 91: A 3D MDS plot of the dog growth variables for the 12 different breeds. The large and small dogs are
shown in red and blue, respectively.

154
Applying Multivariate Methods

Example: linguistics: Authors’ characteristic writing styles as seen


through their use of commas
Demonstration data sets: comma placement.csv
Reference: Jin, M. & Murakami, M. (1993) Authors’ characteristic writing styles as seen through their use of commas.
Behaviormetrika, 20, 63-74.
The aim of this study was to test whether the work of different Japanese authors could be distinguished by their writing
style, as measured by their use of commas. The styles of the different authors were investigated by comparing the frequency
of various symbols preceding a comma. The number of works compared for each author ranged from 1 to 3.

Preliminary data examination and transformation


No preliminary data transformation is required.

Results
The authors undertook ACA using complete-linkage (furthest neighbour), average-linkage and single-linkage (nearest
neighbour) with both Euclidean and city-block distance measures. Fig. 92, page 156 shows the dendrogram produced
using single-linkage joining and Euclidean distance. The authors examined dendrograms produced with all permutations
of the three linkage methods and two distance measures to ensure that the conclusions were not
In CAP, the city-block sensitive to the method chosen. In Fig. 92, page 156 and Fig. 93, page 157, we show the
distance is called the
Manhattan distance.
dendrograms produced using average-linkage and city-block distance respectively. As was concluded
in the paper, the styles of the individual authors form separate branches.
To help determine if the impression gained from the ACA that the 4 authors belonged to separate groups was correct,
Jin & Murakami (1993) examined the ordination produced by Principal Component Analysis (PCA) using the variance-
covariance matrix (see page 20). Fig. 94, page 158 shows that PCA also showed a clear distinction between the four
authors, increasing confidence that the dendrogram groups reflected real differences.

155
9: Hierarchical Agglomerative Cluster Analysis

Fig. 92: Agglomerative Cluster Analysis of comma use by the Japanese authors, Yashue Inoue, Atsushi Nakajima,
Yukio Mishima and Junichiro Tanizaki. Single-linkage joining with Euclidean distance was used.

156
Applying Multivariate Methods

Fig. 93: Agglomerative Cluster Analysis of comma use by the Japanese authors, Yashue Inoue, Atsushi Nakajima,
Yukio Mishima and Junichiro Tanizaki. Average-linkage joining with Manhattan distance was used.

157
9: Hierarchical Agglomerative Cluster Analysis

Fig. 94: The PCA grouping of four Japanese authors by reference to their use of commas. The analysis was
undertaken on the variance-covariance matrix.

158
Applying Multivariate Methods

Conclusions
Jin & Murakami (1993) concluded that comma placement by individual authors generally remained relatively consistent,
and that such analysis could be used to determine authorship and authenticity of modern Japanese literature.

Alternative approaches
Jin & Murakami (1993) had no simple way of testing if their assignment to group was significant. Generally, there are
two approaches that might be taken; group membership could be tested using Linear Discriminant Analysis (DA) or,
alternatively, the ANOSIM randomisation test could be applied. DA could not be used as Nakajima
In CAP, Analysis of
was represented by only a single work. For the full data set, ANOSIM gave a sample statistic of
Similarities (ANOSIM),
and Discriminant Analysis 0.96 and a probability of only p = 0.001 that the observed within-group similarity could have been
(DA) are found on the generated by random chance. This suggested that the separation of the individual authors was
Groups drop-down menu. highly significant. ANOSIM also tests the difference between pairs of groups (in our case compares
Remember that, before an all combinations of two authors - see Table 19). This shows that the data set does not demonstrate
ANOSIM can be carried out,
a significant difference at the 10% level between Inoue and Nakajima, Mishima and Nakajima and
the individual samples must
be assigned to groups. Nakajima and Tanizaki. This lack of significance is related to the small number of works available
for each author, and does not invalidate the method.
Table 19: Results of all pairwise tests generated by ANOSIM for the Japanese author analysis.
1st author 2nd author Permutations P-value Level % No >= Obs Sample stat.
Inoue (2) Mishima (3) 10 0.1 10 1 1
Inoue (2) Nakajima (1) 3 0.33333 33.33 1 1
Inoue (2) Tanizaki (3) 10 0.1 10 1 1
Mishima (3) Nakajima (1) 4 0.25 25 1 1
Mishima (3) Tanizaki (3) 10 0.05 10 1 0.925
Nakajima (1) Tanizaki (3) 4 0.25 25 1 0.777

159
9: Hierarchical Agglomerative Cluster Analysis

Example: cluster analysis and heat maps in R


Demonstration data set: Hinkley annual data.csv
Required R packages: graphics, grDevices, RColorBrewer
Returning to the Hinkley Point fish data set, it is also possible using R to cluster both the samples (years) and the species
and produce a heat map. As shown in Fig. 95, the heat map orders the species and samples (years) so that all the species
showing similar patterns of abundance are clustered together. The colour codes for relative species abundance in the
different years.

require(graphics); require(grDevices); require(RColorBrewer) # load relevant packages


my_palette <- colorRampPalette(c(“red”, “yellow”, “green”))(n = 299) # set colour palette
data <- read.table(“D:\\Demo Data\\Hinkley annual data.csv”,header = TRUE, sep = “,”) # load data
my_palette <- colorRampPalette(c(“red”, “yellow”, “green”))(n = 299) # set colour palette
rnames <- data[,1] # assign labels in column 1 to “rnames”
mat_data <- data.matrix(data[,2:ncol(data)]) # transform column 2-n into a matrix
rownames(mat_data) <- rnames # assign row names
mat_data=log10(mat_data+1) # log10 transform (skip for untransf. data)
x <- as.matrix(mat_data) # create a matrix, x
row_distance = dist(x, method = “euclidian”) # calculate Eucl. distance for matrix rows
row_cluster = hclust(row_distance, method = “ward.D”) # cluster column variables
col_distance = dist(t(x), method = “euclidian”) # transpose matrix and calculate Euclidian
# distance for columns
col_cluster = hclust(col_distance, method = “ward.D”) # Cluster row variables
cc <- rainbow(ncol(x), start = 0, end = .3)
hv <- heatmap(x, col=my_palette, scale = “column”,
Rowv = as.dendrogram(row_cluster), # apply rows clustering
Colv = as.dendrogram(col_cluster), # apply column clustering
# xlab = “XXX”, ylab = “YYY”, # add x and y labels if required
cexCol=0.4, # reduce size of column labels
cexRow=0.4, # reduce size of row labels
main = “Heat map for Hinkley data”) # main title

160
Applying Multivariate Methods

Concluding remarks on the use of


Agglomerative Cluster Analysis
ACA can produce dendrograms which are highly
effective ways of presenting your data. The problem is
that it encompasses a swarm of possible approaches,
and these can produce radically different dendrograms.
If you use this method you must consider how robust
your conclusions about observed groups are. Remember
that even purely random data will produce a pleasing
dendrogram. To test for robustness, consider the
following:
○○ Do your groups make sense?
○○ Do the dendrograms produce similar groupings
irrespective of the linking method and similarity
measure used?
○○ Are the results consistent with the clustering of
points produced by an ordination method such as
MDS or PCA?
○○ Does ANOSIM show that samples within groups
are significantly more similar than would be
expected by random chance?
If you can answer ‘yes’ to all of the above, your
dendrogram is probably a good way of displaying the
relationships between the samples.
Fig. 95: Heat map ordering the species and samples
(years) so that all species with similar patterns of
abundance cluster together.

161
10: Analysis of Similarities (ANOSIM)

Chapter

10
10: Analysis of Similarities (ANOSIM)

Uses
To test if the grouping of samples or objects is statistically significant.

Summary of the method


This test was developed by Clark1 as a test of the significance of the groups that had been previously defined. The idea is
simple: if the assigned groups are meaningful, samples within groups should be more similar in composition than samples
from different groups. The method uses the Bray-Curtis measure of similarity. The null hypothesis is therefore that there
are no differences between the members of the various groups.
The test statistic, R, measures the difference between the mean of the ranked similarity BETWEEN groups and WITHIN
groups. R scales from +1 to -1. +1 indicates that all the most similar samples are within the same groups. R = 0 occurs
1 Clarke, K. R. 1988. Detecting change in benthic community structure. 131-142 in R. Oger [ed.] Proceedings of invited papers, 14th
international biometric conference, Namour, Belgium, and Clarke, K. R. 1993. Non-parametric multivariate analyses of changes in community
structure. Aust. J. Ecol. 18, 117-143.

162
Applying Multivariate Methods

if the high and low similarities are perfectly mixed and bear no relationship to the group. A value of -1 indicates that the
most similar samples are never in the same group. While negative values might seem to be a most unlikely eventuality, it
has been found to occur with surprising frequency.
To test for significance, the ranked similarity within and between groups is compared with the similarity that would be
generated by random chance. Essentially, the samples are randomly assigned to groups 1000 times and the value of R is
calculated for each permutation. The observed value of R is then compared against the random distribution to determine
whether it is significantly different from that which could occur at random. If the value of R is significant, it can be
concluded that there is evidence the samples within groups are more similar than would be expected by random chance.

Example: archaeology: The classification of Jomon pottery sherds


Demonstration data set: Jomon_Hall.csv
Reference: Hall, M. E., 2001. Pottery styles during the early Jomon period: geochemical perspectives on the Moroiso and
Ukishima pottery styles. Archaeometry 43, 59-75.
This example is based on the study by Hall (2001) of Japanese pottery sherds from the early Jomon period (c. 5000±2500
BC). Energy-dispersive X-ray fluorescence was used to determine the concentration of 15 minor and trace chemical
elements in 92 pottery sherds. The sherds came from four sites in the Kanto region and belong to either the Moroiso or
Ukishima style of pottery. These data were discussed above in Chapter 3 on page 25.

Preliminary data examination and transformation


No preliminary data analysis is required. However, the groups were defined following examination of the PCA ordination
and were the four localities from which the sherds were collected (see page 25).

Results
The global ANOSIM test statistic of R = 0.336 indicated that the pottery sherds did form four groups based on locality.
To put it another way, the sherds within each of the 4 groups were significantly more similar than would be expected by

163
10: Analysis of Similarities (ANOSIM)

random chance. The given probability that such a similarity would be observed by chance was 0.001, or about 1 in 1000.
The true probability is actually lower, but the program only did 1000 randomisations, which is all we needed to do to prove
the validity of the groups. ANOSIM also produces all the pair-wise comparisons between the 4 groups (Table 20). These
results show that the pottery sherds from all the localities were significantly different from each other.
Table 20: The results of pairwise randomisation tests of the Jomon pottery produced by ANOSIM

1st Group 2nd Group Perms. done P-value Level % No >= Obs Sample stat.
Ariyoshi-kita (14) Kamikaizuka (30) 1000 0.001 0.1 1 0.434
Ariyoshi-kita (14) Narita60 (15) 1000 0.001 0.1 1 0.590
Ariyoshi-kita (14) Shouninzuka (33) 1000 0.001 0.1 1 0.434
Kamikaizuka (30) Narita60 (15) 1000 0.002 0.2 2 0.220
Kamikaizuka (30) Shouninzuka (33) 1000 0.001 0.1 1 0.277
Narita60 (15) Shouninzuka (33) 1000 0.001 0.1 1 0.293

Conclusions
Hall (2001) concluded that there are four major groups in the data set, which correspond to site location. ANOSIM
supports this conclusion.

Alternative approaches
There are no alternative approaches to ANOSIM.

164
Applying Multivariate Methods

Chapter

11
11: Tips on CAP and Ecom
A guide to getting the best out of your data

CAP and Ecom were designed to make the application of multivariate analysis techniques less intimidating than is the case
with other programs. They offer a range of analytical techniques commonly used by researchers in fields such as biology,
geology, palaeontology, archaeology and the social sciences. Software to carry out these techniques has long been available,
but it is often difficult to use, offering limited help and non-intuitive interfaces. It also frequently has limited graphical
output.

When to use CAP and Ecom


Use CAP when you only have data for the samples. For example, if you have quantified the different types of pottery
at each site, the amount of different metals in several rock samples, or the number of animals in each sample, then you
have data suitable for CAP. If you have this type of sample data and additional information on the environment of the
sample, which could be used to predict the nature of the sample, then the methods offered by Ecom can be used. These
predictive variables might be, for instance, the depth at which the sample was taken, the pH of the soil or water, or the
position relative to another feature; the predictive variables must exist for all the samples in the sample dataset. Ecom
therefore requires two separate data sets, one for the samples and the other for the environmental or predictive variables;

165
11: Tips on using CAP and ECOM

each organised to have the same number of samples in the same order. There must be fewer environmental or predictive
variables than there are variables (pottery types, different metals, individual species, etc.), in the sample dataset.

Data organisation
Data to be used in the programs can be organised using standard spreadsheet programs such as Excel, Calc or Google
Spreadsheet. The output from CAP and Ecom is displayed, exported and printed using standard Windows techniques. The
program can open .csv (comma-delimited text files) and .xls (Excel files). The newer Excel format (.xlsx) is not supported,
so if your data set is currently in a spreadsheet of that format, save it out as the older Excel format, or as a .csv file.
Both programs expect data sets arranged with the samples (or quadrat, trench etc.) as the columns, and variables (or
species, chemical composition, etc.) as rows. If your data are the other way round, then you can use the Transpose function
in either CAP or Ecom to change columns to rows and rows to columns. To do this, select the Working Data tab, then
select Transpose in the Type of Adjustment radio box and click on the Submit button (Fig. 96, page 168).
It is advisable to export the working data (i.e. the transposed data set) out of CAP or Ecom. Giving it a unique name
will stop you overwriting the original file later by mistake. To save your working data use File: Export, then reload the
transposed data into the program.
The ability of CAP and Ecom to transform data allows you to use data sets that have been prepared with rows as samples.
If you have multiple data sets that contain similar data, such as samples from the same site at different times, or from
adjacent samples, then you may find the Pisces List Combiner 1 useful. List Combiner will take all your worksheets and
combine them all into one large .csv file, summing and sorting the values for each variable along the way.

What type of data do you have?


CAP and Ecom can handle a wide range of data types; however some methods require certain data types or data ranges
to work properly. If your data set is small it is easy to look at the data grid and check all the values in each column or row.
If you have a large data set, however, it is not so easy. CAP and Ecom both provide a tab called Summary to help you.
1 www.pisces-conservation.com/softutils.html

166
Applying Multivariate Methods

On the Summary tab you can review the data in several ways. The first choice is whether to look at the raw or the working
data. Often these will be the same, but if you have transformed the data in any way then selecting Working Data will show
you the effects of that transformation.
Once you have chosen which data set to look at, you can then view the General, Row, Column and, in CAP, Group
statistics. General statistics tell you about the whole data set. These statistics are particularly useful prior to undertaking
many analyses. For example, PCA should not be undertaken on a data set that holds large numbers of zero observations.
The next two options, Row and Column, are useful to show you the range of data in each variable. If any variables are
all 0, then you may want to remove them from the analysis - some analyses will fail, as they will produce a division-by-
zero error. Or, if you have one or two variables whose range is much larger than the other variables you might want to
transform them to bring them within range.
Group summary data is slightly different. This gives you the average value of each variable in that group. This can be
useful in understanding the differences between your groups, and when summarising the results of methods such as
Linear Discriminant Analysis.

Transforming your data


Transformations can be performed on the Working Data tab (Fig. 96, page 168). This method always leaves your
original data unchanged, so it is easy to return to the raw data and undertake another analysis.
You can apply transformations to all the variables at once, or just a single selected variable, by using the Selected Variable
drop-down menu.
The types of adjustment are;
○○ Transform, which performs actions such as logging or rooting the data,
○○ Relative, which will relativise the data in various ways so that all variables are equal in magnitude.
○○ Handling zeros; allows you to remove rows or columns completely, or selectively remove those with little or no data.
○○ Transpose, which allows you to swap rows and columns. It is best to save the transposed data with a new name and

167
11: Tips on using CAP and ECOM

reload, to save problems with the grouping.


If your dataset contains zeros, then log transformations will not work. The log +1 options are provided for this reason,
but should be used with caution. Square root transformations work well with data with zeros. Arcsin transformations
require the data to be between -1 and 1. If the program finds a value it cannot transform, it will underline it in red.

Fig. 96: Applying transformations on the Working Data tab.

You can apply multiple transformations to your data, but be aware that each time you do, it will get more and more difficult
to explain! If you find that you want to start again, the Reload Raw Data button on the Working Data tab will reload
the original data.

Dealing with outliers


Sometimes your data set contains a site that is so different from all the others that it will mask any interesting relationships
between the remaining sites. If this is the case, it is common to remove the outlier from the analysis. An example of this
can be seen in Fig. 12 and Fig. 13, page 28. Removing outliers can be done easily from the Working Data tab. Select
the column or row in the data grid you want to remove, select Handling zeros under the Type of Adjustment, choose
Remove Row or Remove Column, and press the Submit button.

168
Applying Multivariate Methods

Alternatively, transforming one or more variables, or using a method that is not affected by magnitude, can reduce the
influence of the outlier. It is worth noting down what transformation(s) you have undertaken, and which outliers you have
removed. It can be very frustrating trying to reproduce the perfect plots you got the last time you analysed the data, when
you cannot remember which transformation you used, and which of the samples you removed.

Organising samples into groups


The Groups function allows you to pre-assign your samples to a group. In CAP, groups are used in calculations for
ANOSIM, SIMPER and Discriminant Analysis. For all other methods, the function is simply used to distinguish
between the groups of samples in plots.
To work with groups, load your data and select the Grouping tab. This shows you a grid where you can assign and
unassign samples to groups. There are two ways to assign the samples to a group. The first method is by selecting and
dragging the samples with the cursor; this is easy to use if your samples are in order (i.e. Group 1 samples are in the first
few rows, Group 2 in the next rows, etc.). To select the samples, left-click and hold the mouse button down while moving
over the samples required, and then drag them either to a new group or an existing group. You can also select a block of
samples by clicking on the first sample, then move to the last sample, hold down the Shift button and click again to select
the range of cells required.
The second method uses the right-hand mouse click. When you have selected a sample or samples to move to a group,
click the right-hand mouse button and a pop-up dialog box will appear, offering the option to move the samples to any of
your defined groups, or create a new group.
Dragging or sending the selected samples to the Create Group column will create a new group and open a dialog box
asking you to name it. The program will give the group a unique name and define a default colour to it.
To edit the properties of a group, double click on the name of a group in the Grouping tab. This allows you to change
the name and default colour of the group, as well as set the shape and size of the object used to plot it.
The bottom panel of the tab allows you to sort the data for grouping in three ways:

169
11: Tips on using CAP and ECOM

○○ Data file order – this is the order of the samples in the data set as it is loaded into the program. This is best if your
data are already organised by group or location.
○○ Alphabetical – the names of the samples are sorted alphabetically. This is best for finding a sample when you know
its name.
○○ By row value – This sorts the data in the Grouping tab by the value of the variable in the row that is selected in
the dropdown list to the right. You can change the row you use with the dropdown list, and see the values by using
the Show values check box. This is useful if you want to separate the samples into groups based on the values in a
particular row (perhaps, the most abundant species, or different values of a predictive variable).
Once you have created and/or modified the sample groups, click the Save Group File button. The group information is
saved in a small text file. This file is saved in the same folder as your data set, and carries the same file name, with a .pcg
file extension. The Reset Group File button deletes the saved group information and reassigns all the samples to a single
group. It does not affect the original data. This option can help when you wish to start again, when you have transposed
the data, or if the group file is damaged or lost (for instance, if you move the main data set to another folder, but leave
the group file behind or delete it).

Explore your data


CAP and Ecom are easy to use. This allows you to explore your data freely and become familiar with them. The Compare
menu allows you to examine the relationships between samples. Compare samples shows you what species two
samples (or groups) have in common. The Profile plot, Scatter plot and Matrix plot options allow you to look at the
relationships between samples. Heatmap represents the individual values contained in a matrix as colours, with larger
numbers represented by darker colours. It is useful to display a general view of the data which can help interpret where
the information is within the matrix.
Don’t be afraid to run several different analyses on the same data, both with and without transformations. Try removing
some of the variables that are only occasionally present. Remember your original data will not be changed; you can always
use the Reload Raw Data option to get back your original untransformed data. It is easy to save out the working data,
with the removed columns or rows, as a new data set.

170
Applying Multivariate Methods

Data analysis is interesting; you might be surprised what patterns there are in your data. Without exploration, multivariate
techniques tend to only show you what you already knew. If several of the methods give you similar groupings or
classifications, this should give you confidence that your analysis is showing something real. Even if they do not, think
about why not – is it that one method uses quantitative data and another presence/absence? Is it an effect of sparse data
or an outlier?

Dealing with multicollinearity in Ecom


In Ecom, multicollinearity occurs when two or more explanatory variables in the Environmental data set are highly
related to each other. In the Multicollinearity tab, the variables the program has identified as problematic are highlighted
in red. In the Use/Exclude column of the grid, click on the Use cell for the problem variable, to change it to Exclude.
Then, click Remove excluded variables. This will remove the problem variable when you rerun the analysis. Click Reset
to add the excluded variable(s) back in to the data set.
You can also exclude problem variables by going to the Working Environmental data tab, clicking into the relevant
variable row, selecting Type of Adjustment: Handling zeros, clicking Deselect Row, then clicking Submit.

Getting the most out of your charts


The charts in CAP and Ecom are highly customisable. You can change almost any feature on the graph, from the font to
the background image. The best way to set colours, plot shapes and sizes for groups is by editing the group properties in
the Grouping tab. This sets the default group colours etc. for any plots the data set is used in, rather than just one specific
chart. Where appropriate on a chart, a right-hand mouse click will pop up a menu that allows you to show the perimeter
of a group – this plots a line around all the members of a group.
If your plot for a PCA, DECORANA or Reciprocal Averaging analysis is complicated, you can use the Find Point
page in the left-hand pane of the chart. At the top of that pane you then can chose the samples, groups or other plotted
features. By clicking on a sample name, a cross-hair will show over the chart in the position of the point.
The toolbar above the graph has controls to perform many of the commonly-required edits; if you need more, clicking on

171
11: Tips on using CAP and ECOM

the Tools symbol on the toolbar (2nd button from left) will open the Edit Chart dialog shown below (Fig. 97).
In this dialog you can see the situation with a typical PCA chart with 5 user-defined groups. The chart contains a total of

Fig. 97: The Edit Chart dialog box.

172
Applying Multivariate Methods

7 series; the other two series control the vector plots – one for drawing the vectors and one to allow names to be put on
the end of the arrows. Each of these series is plotted separately on the chart; this allows you to edit every aspect of the
chart independently, if you wish.
The Edit Chart dialog has an expanding menu of options in the left-hand pane; Series, Chart, Data, Tools, Animations,
Export, Print and Themes. Each option in this expanding menu opens up a series of tabbed pages covering each major
area of the chart to be edited. Sub-tabs then allow you to edit every aspect of these major areas. You will mainly use
the Chart and Series options. Chart allows you to control the background features of the chart, such as the axes, the
background colour, the chart and axes titles, or the legend. The Series option gives you control over each of the plotted
sets of data. Here you can choose the shape, size and colour used to plot each series.
In the instance above, each of the group series has a different colour. You can change the details of a series by double-
clicking on the series in the central panel of the dialog, or alternatively clicking on the series name in the expanding tree
menu in the left-hand panel of the dialog.
Returning to the other chart options; Data shows you the data used to plot the chart, while Tools allows you to add items
that are not directly related to the plotting of the chart, such as annotations and extra lines to divide areas. Animations
gives you the option to add animated effects to charts.
The Export tab has three sub-areas. Export: Picture saves the chart in a variety of formats. Many of the charts in this
book have been exported using Copy and Paste – this produces an enhanced metafile (.emf) format, which is best for
resizing the image in Word documents and other similar applications. Other formats are better for web presentations, etc.
Export: Native allows you to save a chart in a native format. This is useful when you haven’t made a decision on the final
layout of the chart. You can reopen these charts and edit them using TeeChartOffice 1, a free program from Steema Software.
Finally, Export: Data allows you to export the data used to plot the chart; you can use this to export the data to another
program for plotting.
Print allows you to preview the chart before printing, and Themes allows you to apply a whole range of pre-defined styles
to your chart.

1 https://www.steema.com/downloads/vcl - scroll down to the bottom of the page

173
11: Tips on using CAP and ECOM

Editing dendrograms
Dendrograms are used to show the relationships calculated in methods such as TWINSPAN and clustering. These can
be edited using the Edit Dendrogram button. This dialog (Fig. 98) allows you to change the layout of the dendrogram.
Useful features include the Space Equally option under the Lines tab, which spreads out the dendrogram, so that each
divide in the horizontal direction is given equal weight (this option is also available as the Space Equally Horizontally
tickbox on the Options panel of the dendrogram). Space Equally works well if the top axis has no useful meaning, or
you need to see the relationships more clearly. The Labels tab allows you to control how the labels at the right-hand end
are presented. You can choose to have them coloured by groups, position the labels on the top of the lines or set the stub
lengths. You can copy or save the dendrogram in the Copy/Export tab.

Fig. 98: The Edit Dendrogram dialog.

174
Applying Multivariate Methods

Suggestions for how to present your methods


Writing up the methods section of a report or paper using multivariate analysis is difficult, given the possibility of producing
a text that is hard to understand for others unfamiliar with the methods used. A simple outline of the section might include
some of the following:
○○ Explain the nature of your data; is it quantitative, presence/absence, categorical, or a mix of these?
○○ Describe which transformations you undertook, and why.
○○ If any outliers removed were removed, describe how many, and what made them outliers.
○○ For methods such as Principal Component Analysis, sparse variables should be removed. They rarely add to the
analysis and can cause confusion, particularly in graphical presentations. Explain whether this has been carried out.
○○ Decide if groups should be assigned a priori. Are you looking for evidence that groups exist? Are your samples from
different places? Explain in the methods.
○○ Describe the multivariate method used; many have more than one name, so choose the name commonly-used in
your discipline. Describe the options and any major choices made. For example, in Non-metric MultiDimensional
Scaling you need to say which distance measure you used. For PCA you need to say whether the analysis used the
covariance or correlation matrix.
○○ Groups assigned post priori. Having looked at the plots, are there groups present? Are there different groups to those
you assigned to begin with?
○○ Did you do any statistical group testing (e.g. ANOSIM)? If so which methods?
Generally, colour charts with or without 3D are excellent in essays, reports, posters and PowerPoint presentations.
Unfortunately, due to the cost of reproduction, black and white, 2D charts are normally required for publication.
Remember that any 3D plot can be displayed as 3 2D plots. Further, try to keep the plots simple. With small data sets it
may be possible to present a plot of the ordination of the samples and the eigenvectors all with labels on a single plot.
However, this is often not possible and should certainly be avoided in PowerPoint lectures where your audience may only
have 1 minute or less to examine and understand a graph.

175
Glossary

Glossary of multivariate terms


Cross-references in underline.
Analysis of Similarities (ANOSIM) - a non-parametric test which operates on a matrix of dissimilarities between a set
of samples, to test whether there is a significant difference between two or more groups of samples.
Arch effect - a distortion of a CA ordination plot, in which the second axis is an arched function of the first axis. In
biology, it may be caused by the unimodal distribution of species along gradients. One of the main purposes of Detrended
Correspondence Analysis is to remove the arch effect.
Biplot - a graph that includes a plot of both the samples and the variables.
CA - the acronym for Correspondence Analysis.
Canonical Correspondence Analysis (CCA) - a widely-used method for direct gradient analysis primarily developed by
C.J.F. ter Braak. CCA assumes that species have unimodal distributions along environmental gradients.
Categorical variable - a variable that is represented by several different types, for example: North, South, East, West or
Male, Female.
CCA - the acronym for Canonical Correspondence Analysis.
Centroid - the mean or average of a multivariate data set.
Correlation - a measure of the strength of the relationship between variables.
Correlation coefficient - a number which measures the strength of the correlation between variables.
Correlation matrix - a square, symmetric matrix consisting of correlation coefficients.
Correspondence Analysis (CA) - an ordination method, also known as Reciprocal Averaging.
Covariance matrix - a square, symmetric matrix of covariances between variables. The diagonal elements (i.e. the

176
Applying Multivariate Methods

covariance between a variable and itself) are the variances of the variables.
DECORANA - strictly speaking, a tool which implements a Detrended Correspondence Analysis; in practice, DECORANA
is often used to refer to the analysis itself.
Detrended Correspondence Analysis (DECORANA) - an iterative multivariate technique used to find the main factors
or gradients in ecological community data. It is often used to overcome issues such as the Arch effect (q.v.) inherent in
Correspondence Analysis (q.v.)
Dimension - used here as the number of axes used to express the relationship between samples.
Discriminant Analysis (DA) - a technique to assign samples to different groups.
Dispersion matrix - general term used for a matrix holding either the correlations or covariances between variables.
Downweighting - a method in ordination programs to reduce the influence of rare species.
Eigenvalue - a mathematical term for a particular type of variable used in linear algebra. In the methods discussed here
it often measures the variability assigned to a particular axis.
Eigenvector - a vector used in linear algebra. Used here in PCA to show the direction of influence of each variable within
the ordination plot.
Environmental variable - a variable that is believed to influence the structure of the samples.
Euclidean distance - the straight-line distance between two points in a Cartesian coordinate system. It is the distance you
are familiar as measuring with a ruler.
Horseshoe effect - a distortion of a PCA ordination diagram.
Inertia - a measure of the total amount of variance in a data set.
Iteration - a single step in an often potentially-repeated mathematical operation.
Matrix - a set of numbers arranged in rows and columns. A 2-dimensional matrix comprises a grid of numbers.
MDS - an acronym for MultiDimensional Scaling.

177
Glossary

Monte Carlo tests - another expression for a Randomisation test


Multicollinearity - describes the situation in which a number of variables are correlated. In multivariate data sets it can
be difficult to spot.
MultiDimensional Scaling (MDS) - an ordination method in which the distance measure can be chosen. Once termed
Principal Coordinates Analysis.
Multiple Regression - used here as the fitting of a function to a number of variables using a Least-Squared method
(usually based on the Least Squares principle) which attempts to describe or “fit” a measured dependent variable as a
function of multiple measured independent variables.
Multivariate analysis - any simultaneous analysis of 2 or more variables. A Multiple Regression is not considered a
multivariate analysis as only one dependent (response) variable is studied at a time.
NMDS - the acronym for Non-metric MultiDimensional Scaling.
Nominal variable - a binary variable, for example on and off.
Non-metric MultiDimensional Scaling (NMDS) - ordination method that allows the specification of distance measure
to be used.
Ordination - the act of ordering objects in a logical fashion.
Orthogonal - used here to describe lines which are oriented at right angles.
PCA - the acronym for Principal Component Analysis.
Principal components - the axes produced by a Principal Component Analysis.
Principal Component Analysis - an ordination method that produces compound variables from correlated variables,
which can result in the simplification of the plot showing the relationship between samples.
Q-Mode - an analysis of samples to determine which species or other attributes they have in common (see R-Mode)
R-Mode - an analysis of variables, (e.g. species) to determine which samples they have in common (see Q-Mode).

178
Applying Multivariate Methods

RA - the acronym for Reciprocal Averaging, also known as Correspondence Analysis.


Randomisation test - a method of testing for statistical significance by randomly generating data sets to estimate the
likelihood that the observed pattern could occur by chance. Such tests have been developed since the advent of computers
and offer the advantage that they are not dependent on assumptions about the underlying distributions.
RDA - the acronym for Redundancy Analysis.
Reciprocal Averaging - an ordination method also called Correspondence Analysis. It refers to a particular way of
finding the solution to the ordination.
Redundancy Analysis (RDA) - an extension of multiple regression when there is more than one response (species)
variable. If there is only one species then RA reduces to a simple multiple regression. The input data comprises a biological
data set of samples/species and another set of environmental/explanatory variables. The results are normally presented
as a biplot.
Regression - a method to fit an equation using Least Squares distances.
Regression coefficient - a parameter fitted by regression analysis.
Sample score - the position of a sample on an ordination axis.
Segments - in DECORANA, used to describe the subdivision of the axes to remove the arch effect.
Similarity index - a measure of the similarity of between two samples. Examples include the Sørensen and Jaccard
indices.
Similarity matrix - a square matrix in which the entries are similarities between samples.
Singular matrix - a square matrix which cannot be inverted. Singular matrices are a problem in multivariate methods.
If your analysis is not possible because your matrix is singular, check to see if two or more of the variables are almost
perfectly correlated.
Site score - the position of a site or sample on an ordination axis.

179
Glossary

Species score - the position of a species on an ordination axis.


Standardization - the scaling of a variable so that different variables, measured in different units, can be compared.
Stepwise analysis - an approach in which variables are added to or taken from an analysis one at a time
Stress - a measure of the ability of a MultiDimensional Scaling ordination to express the similarity and dissimilarity
between the objects. The lower the stress, the better the ordination reflects the similarity.
Transformation - the changing of a variable in to a numerical range more suitable for analysis. For example: log or square
root transforming.
TWINSPAN - the acronym for Two-Way Indicator Species Analysis.
Two-Way Indicator Species Analysis (TWINSPAN) - a classification method derived from Correspondence Analysis.
Vector - a compound variable that has magnitude and direction. It is represented graphically as an arrow. The length is the
magnitude and the arrow head shows the direction.
Weighted average - an average where the importance of each observation is given an individual weight. These weights
are often the abundance, so the most abundant forms have the greatest influence on the average.

180
Applying Multivariate Methods

Index coefficient of determination (R-squared): 105, 110, 115.


comma-delimited (*.csv) file: 5, 166.
A Community Analysis Package (CAP): iii–iv, 165–175, 186.
ACA - See Agglomerative Cluster Analysis (ACA). Charts and dendrograms: 171–175.
Agglomerative Cluster Analysis (ACA): 1, 2, 102, 138, 139–161. Grouping: 169–170.
Analysis of Similarities (ANOSIM): 2, 152, 159, 161, 162–164, 169. Organising data for: 3–4, 166.
ANOSIM - See Analysis of Similarities (ANOSIM). Removing outliers: 168–169.
arch, or horseshoe, effect: 22–24, 52, 54–56, 71–72. Transforming data: 167–168.
average-linkage measure: 140, 155. User options:
DECORANA: 55.
B MDS: 75–76.
binary measurements: 6. TWINSPAN: 130.
biplots: 30, 43–44, 48–50, 72, 116–117, 121–122. complete-linkage measure: 140, 155.
blank cells - See zero values. constrained analysis: 105, 126.
Bray-Curtis similarity: 73–82, 84–89, 147, 150–151, 162. contingency tables: 52–53.
correlation coefficient: 54.
C
correlation matrix: 20–21, 26, 33, 36, 42, 50–51.
CA - See Correspondence Analysis (CA).
Correspondence Analysis (CA): iii, 1, 8, 23, 50, 52–72, 126, 127–129.
Canonical Correspondence Analysis (CCA): 2, 104–126, 135, 147.
covariance matrix: 20, 26, 33, 40, 45–46, 48, 50–51, 155.
Canonical Variate Analysis - See Linear Discriminant Analysis (DA).
csv - See comma-delimited (*.csv) file.
CAP - See Community Analysis Package (CAP).
cut levels: 128–130, 131, 134, 135.
categorical variables: 52.
CVA - See Linear Discriminant Analysis (DA).
CCA - See Canonical Correspondence Analysis (CCA).
centroids: 34, 58, 93–95, 106–108, 127–128. D
Chi-squared test: 53, 54. DA - See Linear Discriminant Analysis (DA).
city-block distance - See Manhattan distance. data organisation: 3–4, 166.
classification equations: 97. data sets - See example data sets, how to obtain.
cluster analysis: 8, 43, 138, 139–161, 160 - See also Agglomerative Cluster data transformation: 7, 21, 51.
Analysis (ACA).
DCA - See Detrended Correspondence Analysis (DECORANA, DCA).

181
Index

DECORANA - See Detrended Correspondence Analysis (DECORANA, DCA). G


dendrograms: 9, 102, 127–128, 131–138, 139–140, 143–146, 147, 151, Gower’s measure: 140.
155, 161, 174.
groups: 2, 8, 32–33, 38, 90–93, 96, 97, 102, 118, 139–140, 144, 159, 161,
dependent variables: 2, 104–106, 116, 126. 162–164, 167, 169–171, 175.
Detrended Correspondence Analysis (DECORANA, DCA): 23, 50, 55, 57,
64–66, 70–72, 132, 171. H
dimensions: 20, 27, 37, 42, 73–87. heat maps: 160, 170.
Discriminant Analysis - See Linear Discriminant Analysis (DA). Hierarchical Agglomerative Cluster Analysis - See Agglomerative Cluster
discriminant functions: 2, 92, 94–96. Analysis (ACA).
dispersion matrix: 19–20. horseshoe effect - See arch, or horseshoe, effect.
dominance: 26, 50, 109.
down-weight rare species: 55. I
independent variables: 105.
E indicator species: 127–128, 130–132, 136.
Ecological Community Analysis - See Ecom. inertia - See total variability.
Ecom: iv, 3–4, 165–175, 186. iterations: 74, 76.
editing dendrograms: 174.
eigenanalysis: 54. J
eigenvalues: 19–21, 24, 26, 106, 115. Jaccard index: 73, 74, 75, 141.
eigenvectors: 19, 20, 24, 37–38, 40–41, 47, 50, 106.
L
empty cells - See zero values.
environmental gradient: 8, 22, 72, 116. Linear Discriminant Analysis (DA): 2, 90–103, 144, 159, 167.
environmental variables: iv, 2, 104–106, 109, 120–123, 126. log transformation: 7, 26, 45, 48, 50, 65, 109.
environmental vector: 116, 117. M
Euclidean distance: 48, 73, 75, 77, 82, 84, 140.
Mahalanobis distance: 34–35, 140.
example data sets, how to obtain: 9.
Manhattan distance: 140, 155.
Excel files: 2, 5, 166.
MANOVA - See Multivariate Analysis of Variance (MANOVA).
explanatory variables: 1–2, 104–106, 109–110, 123–124, 126, 171.
McQuitty’s measure: 140.
MDS - See MultiDimensional Scaling (MDS).

182
Applying Multivariate Methods

Monte Carlo test: 106, 115–116, 120. quantitative data: 20, 33, 45, 51, 73, 74, 128, 142, 171.
multicollinearity: 105–106, 110, 112, 120, 171. quantitative measures: 6, 76.
MultiDimensional Scaling (MDS): 8, 18, 50, 52, 73–89, 102, 152, 161.
R
multiple regression: 2, 104, 105.
Multivariate Analysis of Variance (MANOVA): 90. R: iv, 2, 10–16.
Agglomerative Cluster Analysis (ACA): 160–161.
N Canonical Correspondence Analysis: 123–125.
NMDS - See Non-metric MultiDimensional Scaling (NMDS). Correspondence Analysis: 58, 61–63, 68–71.
Non-metric MultiDimensional Scaling (NMDS): 1, 23, 73, 82, 84, 87, 118, Download code examples: 9.
126, 151, 175 - See also MultiDimensional Scaling (MDS). heatmaps: 160–161.
Linear Discriminant Analysis: 101–103.
O
MDS: 84–86.
ordination: iv, 2, 8–9, 23, 26, 42, 46, 50–51, 52–54, 70–71, 73–75, 89,
Opening files in: 5–6.
104–106, 115–118, 126, 127–129.
Organising data for: 3, 4–5.
orthogonal lines: 122.
PCA functions: 24–25, 29–32, 34–35, 39, 48–49.
outliers: 7, 27–29, 34–35, 82, 91, 105, 168–169.
Procrustes in: 70–71.
P Transformation in: 7, 26.
PCA - See Principal Component Analysis (PCA). RA - See Reciprocal Averaging (RA).
percentage variability: 26, 37. randomisation tests: 2, 126, 144, 146, 152, 159, 164.
Prediction - Accuracy Table: 101. rare species: 8, 51, 55 - See also down-weight rare species.
presence/absence data: 6, 73, 74, 76, 83, 128, 141, 142. RDA - See Redundancy Analysis (RDA).
Principal Component Analysis (PCA): iii, 8–9, 17–51, 54, 75, 84, 143, 167, Reciprocal Averaging (RA): 54, 118, 171 - See also Correspondence
171, 175. Analysis (CA).
principal components: 20, 29–30, 34. Redundancy Analysis (RDA): 118–119.
procrustes function: 70–71. regression analysis: 2, 17, 104, 114, 118.
Procrustes method: 70, 84. Renkonen measure: 140, 145, 146.
pseudospecies: 128–129, 130, 131, 138. R-mode analysis: 87, 143–144.
R packages: 14–16.
Q for cluster analysis & heatmaps: 160.
Q-mode analysis: 143. for PCA: 24–25.

183
Index

R-squared - See coefficient of determination (R-squared). triplot: 107, 108.


RStudio: 10–16. TWINSPAN (Two-Way Indicator Species Analysis): 1, 52, 127–138, 143,
174.
S
V
sample dendrogram: 127, 131–133, 145.
sample scores: 43–44, 55, 104. variability: 18, 20, 26–27, 37, 42, 48, 51, 89, 106, 115.
scatter plots: 101, 112, 170. variance: 7, 20, 27, 29, 30, 34, 45–46, 48–51, 54, 84, 114–115, 140 - See
also variability.
scree plot: 29–30, 39.
variance-covariance matrix - See covariance matrix.
segments (in DECORANA): 55.
variance inflation factor (VIF): 106–107, 110, 112, 120.
semi-quantitative measures: 6.
vector: 5, 38, 41, 110, 116–117, 122, 173 - See also eigenvectors; See
seriation: 59–61, 77–82.
also environmental vector.
similarity indices: 8, 23, 73–76, 83, 89, 141–142, 144, 151.
VIF - See variance inflation factor (VIF).
single-linkage: 139–140, 145–146, 155.
site scores: 104. W
Sørensen similarity measure: 73, 75, 76, 83, 89, 141–142. Ward’s measure: 140, 145, 147.
species dendrogram: 132. Wisconsin double standardization: 84.
species score: 55, 128.
spreadsheet: 3, 4–5, 123, 166. X
squared-chord distance: 74, 140. xls files - See Excel files.
square root transformation: 7, 22, 45–46, 50–51, 68, 168 - See
also transformation. Z
standardization: 84 - See also Wisconsin double standardization. zero values: 3, 7, 8, 22, 48, 50–51.
stepwise analysis: 105, 120, 126, 135, 147.
stress: 74–78, 79, 84, 89.

T
taxi-cab distance - See Manhattan distance.
total variability: 20, 26, 42, 48, 121.
transformation: 7, 8, 21–22, 26, 36, 45, 46, 48, 50, 51, 68, 109–110, 131,
144, 167–168 - See also square root transformation.

184
Applying Multivariate Methods

Pisces software titles & training


Software packages
Pisces staff have many years’ experience in research science and ecology; we both produce software and undertake
research and consultancy using our own software, so we understand how important ease of use and great presentation is.
In creating our software we have tried to keep the user interface as intuitive as possible, without sacrificing the power or
speed. Our range of software for statistical analysis includes:
Community Analysis Package (CAP); a multivariate analysis package for Windows PCs, suitable for undergraduate
and post-graduate students and researchers in a wide range of sciences. It offers NMDS & PCA, Reciprocal Averaging,
DECORANA & TWINSPAN, Similarity & Distance measures, ANOSIM & SIMPER, Agglomerative & Divisive Cluster
Analysis, Discriminant Analysis and Association Analysis - plus the ability to run R code from within the program:
More information at www.pisces-conservation.com/softcap.html
Ecom offers you a range of analytical techniques to detect, visualise and order relationships in multivariate data which
include within the analysis the possible causal factors. Includes Canonical Correspondence Analysis (CCA), Redundancy
Analysis (RDA), Multiple Regression (MR) and Forward/Backward Stepwise Regression (MR) - plus the ability to run R
code from within the program
More information at www.pisces-conservation.com/softecom.html
Fuzzy Grouping offers the fuzzy clustering methods that are now becoming popular for analysing multivariate data.
Includes Fuzzy C-means, Fuzzy Ordination and Fuzzy Discriminant Analysis.
More information at www.pisces-conservation.com/softfuzz.html
Species Diversity & Richness (SDR) offers over 70 methods in one easy to use program – it allows you to use the latest
methods to gain insights into your data. For example you have 11 Alpha, 8 Beta, 14 Evenness, 14 Richness indices, many
with the bootstrapping of confidence intervals.
More information at www.pisces-conservation.com/softdiversity.html

185
Pisces software titles & training

QED Statistics offers a comprehensive range of statistics, chosen to meet the needs of all students, researchers and post-
grads wanting to analyse quantitative data. The program holds your hand, from data input, through single sample stats,
right up to General Linear Models.
More information at www.qedstatistics.com
Aside from statistics, we also produce a wide range of other software titles for analysis of scientific data, as well as over
60 classic scientific works as e-books, and a series of ready-made lectures on ecology.
Information on all of these is available on our website at www.pisces-conservation.com

Multivariate statistical methods tuition


We have designed our CAP and Ecom packages to be simple to use and to produce clear, unambiguous output. However,
even the most experienced researcher will know that it is not always easy to interpret the results of a multivariate analysis.
To help you understand how CAP and Ecom work, and what their output means, Pisces run one-day workshops on
multivariate statistics either at our office in Lymington or at your site. This course has been run all over the world, since
August 2001 and has proved successful and rewarding for the participants. We prefer to use your data to explain the
methods and explore the results, so that you get the most out of the course.
More information at www.pisces-conservation.com/statswork.html
Quotes from our users:
“Thank you very much for including me in your workshop. It was a great day and very helpful for my research. I will be an avid user of Ecom
now”.
“We found it to be a concise informative workshop which we could tailor to our own needs, covering a wide range of multi-variate techniques.
Using our own datasets allowed for better interpretation of the program and gave us a chance to deal with problems usually encountered while
running such statistics”.
“The one-day course was very informative and covered theory, data manipulation and interpretation of outputs from the user-friendly software”.

186
Applying Multivariate Methods

Pisces Conservation Ltd


IRC House, The Square
Pennington, Lymington
Hampshire, SO41 8GN, UK
phone: +44 (0)1590 674000
fax: +44 (0)1590 675599
email: [email protected]
web: www.pisces-conservation.com

187

You might also like