0% found this document useful (0 votes)
2 views

LBOE2112 Module 2 Multivariate Data Analysis - 2024-2025 - All (1)

LBOE2112 is a course on multivariate data analysis in biological systems, taught in English with a focus on interactive learning and practical applications. The course includes lectures and practical work, covering statistical tools and their application in ecology, culminating in an open-book exam. Key topics include data visualization, statistical inference, and various multivariate analysis techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

LBOE2112 Module 2 Multivariate Data Analysis - 2024-2025 - All (1)

LBOE2112 is a course on multivariate data analysis in biological systems, taught in English with a focus on interactive learning and practical applications. The course includes lectures and practical work, covering statistical tools and their application in ecology, culminating in an open-book exam. Key topics include data visualization, statistical inference, and various multivariate analysis techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 155

LBOE2112 - Data analysis and

modeling of biological systems


Module 2: Multivariate data analysis Ecole de Biologie

Nicolas Schtickzelle
2024-2025
12h lectures: 18h practical work:
Nicolas Schtickzelle Hadrien Glibert
[email protected] [email protected]
010/47.20.52, Carnoy C.157 Carnoy A.104
www.nicolas-schtickzelle.net 1
LBOE2112 is taught in English
• Some courses of the Master BOE program are
taught in English.
LBOE2112 is one of them.
• Still, the most important is the content,
not the language.
• I’m not a native English speaker; you’re
probably neither. This means we will all make
language mistakes. They don’t matter.
• So don’t be afraid to ask questions, to use
French words when you need…
https://cheezburger.com/8553617408/funny-web-comics-when-english-is-not-your-native-language
2
LBOE2112 on Moodle
• The Moodle website is the central point of the course documentation:
– Visuals of the theoretical course
– Practical work: instructions, data sets…
– Various additional resources (e.g. summary of matrix algebra)
– Instructions and announcements
– …
• Register as soon as possible:
https://moodle.uclouvain.be/course/view.php?id=949

3
This course is intended to be interactive
• The course includes a series of anonymous, real-time surveys conducted
with
• You are cordially invited to participate actively; this helps me to perceive the
understanding of the group and allows you to situate your level

• Your answers are not marked but you need to login (via UCLouvain)
so I can send your individual report for you to review your responses later
4
5
My teaching philosophy for statistics
• Statistics are decision support tools, essential in
science, particularly ecology
• You need to know which tool to use for a specific
task, how to handle it, what result you can expect
and how to troubleshoot its use
• You do not need to know the internal machinery
• Same for statistics:
ISBN:
– Select the right tool for the job 9780521183284
– Understand its requirements and inputs McKillup (2011) Statistics explained

– Understand its outputs


– Be able to use it in practice https://hikingartist.co
m/thrive/nail-screw/ 6
In practice
• This course combines:
– Lectures to set the concepts and illustrate the basic statistical methods that can be
used to analyze multivariate data
– Practical work sessions to put this knowledge in practice, using
• The evaluation is a written open-book exam on Moodle
– "Theoretical" type (5/10): reflection questions to assess your level of understanding
of the ins and outs of the statistical analysis process of multivariate biological data
– "Practical" type (10/20): perform statistical analysis of concrete cases of multivariate
biological data : what analysis?, why?, how?, interpretation?, limits?...
– Closed questions (true/false, MCQ, numerical response, etc.)
– Open questions
7
A cautionary note
• Course objective = informed use of statistics
recitation of formulas
blind application (cooking)
• Statistics are not studied, they are understood
• Understanding is built little by little, not in one day!

https://www.cardstacker.com/#/tallest-house-of-cards/

8
Facultative supporting books

https://shop.elsevier.com/books/numerical https://link.springer.com/book https://link.springer.com/book/ ht


-ecology/legendre/978-0-444-53868-0 /10.1007/978-3-319-71404-2 10.1007/978-0-387-45972-1 1
The classical book for multivariate data The R companion of the Many case studies. Useful to relate to
analysis in an ecological context. “Numerical ecology” book, similar cases but “crucial to pr
With interesting review of some basic so very much tool (R) oriented understand the background provided
statistics principles. in the first part of the book” (page 3)

Except the 1st one, all can be downloaded in pdf for free using your UCLouvain login 9
Facultative supporting books

https://link.springer.com/book https://link.springer.com/book/ https://link.springer.com/book/


/10.1007/978-1-4419-9650-3 10.1007/978-3-031-13005-2 10.1007/978-3-319-24277-4

Focussed on R, more mathematical, Not on multivariate statistics but on


not focussed on ecology the very powerful ggplot2 R library
to visualize data

All can be downloaded in pdf for free using your UCLouvain login 10
A note on statistical software
• There are many statistical software, e.g.

• Some are more popular than others

https://r4stats.com/articles/popularity/

https://hikingartist.co
m/thrive/nail-screw/
11
A note on statistical software
• In ecology research, R has become the norm. Previously it was SAS
• R is free and can be very powerful, in particular because of the huge amount
of available libraries

Zelterman (2022) Applied Multivariate Statistics with R


• But this makes it dangerous
into the wrong hands
• R needs more user investment
to produce correct results

Be sure to read the library


documentation
In particular, check the default settings !! 12
What this module contains
• Chapter 1 − Reviewing basic statistical knowledge
• Chapter 2 − Multivariate data and their visualization
• Chapter 3 − A first technique in details:
ordination by Principal Component Analysis (PCA)
• Chapter 4 − Ordination of a contingency table:
Correspondence Analysis (CA)
• Chapter 5 − Expanding to other ordination techniques:
Multidimensional scaling
• Chapter 6 − Grouping objects: Clustering
• Chapter 7 − Assigning objects to groups: Discriminant analysis
13
Chapter 1 − Reviewing basic statistical knowledge

14
What you should already master
• Data terminology: variable, individual, observation, population, sample,
distribution, degrees-of-freedom, expected value,…
• Concept of probability
• Types of scales on which variables are measured
• Principle of statistical inference: sample -> population, p-value,
type I & II errors,…
• Basic statistical methods:
– For testing whether groups differ in average: t-test, ANOVA, Chi²
– For testing the relation among two variables: correlation, linear regression

Let’s recap that ! 15


16
Data terminology
• Individual/object: the “thing” on which data are recorded
• Observation: a measurement of an attribute on an individual
• Variable: attribute measured for each observation
• We can distinguish:
– Response/dependent variable (Y): a random variable that we are interested to measure
– Explanatory/independent variable (X): a variable representing the change from the
norm and assumed to explain Y; it can be
• Controlled
• Not controlled (random) Stat
inference
• Population: all the existing individuals
• Sample: a subset of the population by Ferran Sayol (https://www.phylopic.org/images/2ff4c7f3-d403-407d-
a430-e0e2bc54fab0/charadrius) 17
18
Measurement scales of variables
• Variables are of different nature:
– A continuous variable has an infinite number of possible values
– A discrete variable has a limited number of possible values
– A numerical variable contains numbers
– An alphanumerical variable contains text
• Variables can be measured on different scales:
Scale Order Rank differences Meaning of 0
Nominal Undefined na
Ordinal Defined = or ≠
Interval Defined = Arbitrary
Ratio Defined = Natural
19
Distribution of a variable
• Distribution: collection of probabilities associated with the possible
values ​for a random variable: P(Y=y)

Discrete variable Continuous variable

20
The correct encoding of data
• A dataset should be formatted as a table with
– variables as columns
– individuals as rows
Variable 1 Variable 2 … Variable p
Individual 1
Individual 2

Individual n

• All boxes must be filled in. If necessary, with “missing value”


(“NA”, “[space]”, “.”,… depending on the software)
21
22
Sources of variation in the data
• To ensure that a valid conclusion can be inferred about the population,
a good estimate of the random variation of a variable must be obtained
because it serves as the reference to determine when effects are significant
• The variation measured for a random variable depends on:
– The degree of intrinsic variation of the variable (natural variation)
– The precision of the measurement technique: random error + possible systematic bias
• Taking more precise measurements and more observations on independent
individuals (replication) helps to better quantify the effect of chance
(random variation) and improves the ability to make statistical inferences
• The set of individuals must be a representative sample of the population
23
24
Characterize a distribution
• Central moments (location):
– Mean: σ𝑁𝑖=1 𝑌𝑖
µ=
𝑁

– Median: q0.5
50 % 50 %

Quantile (x):
qx
x% 1-x %

Zar (2011) Biostatistical analysis

– Mode: most frequent value

25
26
Characterize a distribution
• Dispersion moments:
– Range = max-min
– Inter-quartile range: EQ = q0.75 – q0.25
– Variance: σ𝑁 (𝑌 − µ)2
𝑖=1 𝑖
𝜎2 =
𝑁
– Standard deviation: 𝜎 = 𝜎2 McKillup (2011) Statistics explained

– Coefficient 𝜎
of variation: 𝑐𝑣 =
𝜇
σ𝑛 2
(𝑌
𝑖=1 𝑖 − 𝑌)
– Note: variance estimator on a sample: 𝑆 2 =
𝑛−1
27
The normal distribution
• Many natural continuous variables follow a normal distribution
• A normal distribution is characterized by its mean and variance: 𝑌~𝑁(𝜇, 𝜎 2 )

https://en.wikipedia.org/wiki/File:Normal_Distribution_PDF.svg

• A key property is that the frequency of https://commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg

observations in intervals of 𝜎 around 𝜇 is constant


(the 3-sigma rule) 28
Center and standardize a normal distribution
• Center a variable means:
– Setting the mean µ to 0
• Standardize/normalize a variable means:
– Setting the mean µ to 0 𝑌−𝜇
𝑍=
– Setting the variance σ² to 1 𝜎
That is: Z is distributed according
to a reduced normal:
𝑌~𝑁(𝜇, 𝜎 2 ) 𝑍~𝑁(0,1)

Useful for comparing variables measured on different scales (e.g. cm and kg)
29
30
31
32
33
2 types of error are possible
• When you make a decision about a hypothesis based on a given sample,
you are never sure you are making the right decision!!
• You can make a mistake in 2 ways:

" © S. McKillup 2005, reproduced with


permission of Cambridge University Press"

• It is not possible to prove that a hypothesis is true!!


It remains true until maybe one day proven to be false. 34
Chapter 2 − Multivariate data and their visualization

35
Uni- vs multivariate data
• Univariate data: observations of a single random variable are recorded on a
random sample of individuals
• Example: shorebirds
Charadri
Scolopaci

Baguette (…) &


Are Charadrii and Schtickzelle in prep.
Scolopaci differing in
average beak length ?

36
Uni- vs multivariate data
• Is the raw beak length the most adequate variable to test ?

• Somehow, we need to correct for body size 37


38
Uni- vs multivariate data
• The basic body size measures for shorebirds

Weight

NB: in reality, specific bird


positions must be used
Toe
length

• Multivariate data: observations of several random variable are recorded


on a random sample of individuals
39
Multivariate data analysis
• Multivariate statistical analysis is the
simultaneous statistical analysis of a collection of variables,
which improves upon separate univariate analyses of each variable
by using information about the relationships between the variables
• Many methods are descriptive / exploratory
and their results apply to the analyzed sample
and are not meant to test specific hypotheses about the population
• In ecology, descriptive multivariate methods are often used to pretreat a set
of variables to be used in a further inferential method (e.g. a linear model),
in particular to deal with the multicollinearity of explanatory variables

40
Multivariate data analysis
• Many methods differ in subtle ways,
generating some confusion
• There are no two books about multivariate
data analysis covering the same methods
• For this course, I chose the basic methods,
those that are most likely to be of use to you
• But you may need other methods for specific
study cases

McGarigal, Cushman & Stafford (2000) Multivariate 41


Statistics for Wildlife and Ecology Research
The first step:
visualizing multivariate data:
• The first and key step to analyse multivariate data is to look at them
• It is essential for several reasons:
– Provide an overview of the data
– Detect errors or outliers
– Check the distribution of response variables
– Suggest relations among variables
– Determine / check the shape of the relations among variables
–…
• Data visualization and exploration often takes > 25% of the analysis time
42
Visualizing multivariate data:
pairwise relations
• Do you think these 5 shorebird body size variables are independent ?
• How can we check that ?
• Pearson (linear) correlation test: Impact of outliers

One species with wrong


body size encoding (mm
instead of cm) makes the
whole correlation vanish

This is a scatterplot
43
Visualizing multivariate data:
pairwise relations
• Do you think these 5 shorebird body size variables are independent ?
• How can we check that ?
• Pearson (linear) correlation test: Impact of a non-linear relation

The relation is not linear


because weight is a volume,
not a length

Use weight1/3

This is a scatterplot
44
Visualizing multivariate data:
pairwise relations
• Do you think these 5 shorebird body size variables are independent ?
• How can we check that ?
• Pearson (linear) correlation test:

45
Visualizing multivariate data:
use the best plot(s)
• There are so many ways to format a scatterplot
• Find the right one(s) to show the most important properties of your data


Zelterman (2022) Applied Multivariate Statistics with R

46
Visualizing multivariate data:
correlation matrix
• A Pearson correlation matrix is an easy way to see all the pairwise
correlations at once

47
48
Visualizing multivariate data:
correlation matrix
• A Pearson correlation matrix is an easy way to see all the pairwise
correlations at once Correlation vs covariance
Correlation is a measure of association that is
independent of the variable scales:
σ𝑛𝑖=1(𝑋1𝑖 − 𝑋1) × (𝑋2𝑖 − 𝑋2) 1
𝑟𝑋1,𝑋2 = ×
𝑛−1 𝑆1 × 𝑆2

It is the covariance of standardized variables:


σ𝑛𝑖=1 𝑍𝑋1𝑖 × 𝑍𝑋2𝑖 𝑋1𝑖 − 𝑋1
𝑟𝑋1,𝑋2 = 𝑍𝑋1𝑖 =
𝑛−1 𝑆

49
Visualizing multivariate data:
correlation matrix
• A Pearson correlation matrix is an easy way to see all the pairwise
correlations at once Reminder
Compare the correlation matrix

with the covariance matrix

50
Visualizing multivariate data:
correlation matrix
• A Pearson correlation matrix is an easy way to see all the pairwise
correlations at once

There are many ways to


format such a correlation
matrix view. You will see
some others during practical
sessions, and internet is full
of examples.

R is very powerful, use it


but understand what you do!

51
Visualizing multivariate data:
3D view of 3 variables
• We can visualize these 3 • Making an interactive helps to
relations at once using look at this potentially more
a 3D graph complex set of relations

52
Visualizing multivariate data:
3D view of more than variables
• To view more than 3 variables, we can only rely on combinations of 3D plots

53
The problem of missing data
• Missing data can rapidly create a major problem in multivariate data analysis
Variable Variable Variable Variable Variable Variable Variable …
1 2 3 4 5 6 7
Individual 1
Individual 2 NA
Individual 3
Individual 4 NA
Individual 5 NA
Individual 7 NA

54
55
How to deal with missing data
• Option 1 ─ complete-case analysis / listwise deletion:
it discards all the observations for which at least one variable is missing
Variable Variable Variable Variable Variable Variable Variable …
1 2 3 4 5 6 7
Individual 1
Individual 2 NA
Individual 3
Individual 4 NA
Individual 5 NA
Individual 7 NA

• Problem solved but at a potentially high cost in terms of sample size and
estimation bias if the discarded observations are not a random subsample 56
How to deal with missing data
• Option 2 ─ available-case analysis / pairwise deletion:
similar but it considers only the variables involved in the current analysis,
and discards observations for which at least one of them is missing
Variable Variable Variable Variable Variable Variable Variable …
1 2 3 4 5 6 7
Individual 1
Individual 2 NA
Individual 3
Individual 4 NA
Individual 5 NA
Individual 7 NA

• Problem solved but the data set changes among analyses 57


How to deal with missing data
• Option 3 ─ imputation:
missing values are replaced by plausible values and the whole data set
is then analyzed
• There are various methods to perform imputation, e.g.:
– Use the mean of the variable
– Use a linear model to predict the missing value
– Use multiple imputation

https://amices.org/
• None is perfect: imputed values are not real measurements,
missing information cannot be created from nothing
• So, care must be exercised, particularly if missing values are frequent
58
Chapter 3 − A first technique in details:
ordination by Principal Component Analysis (PCA)

59
60
Let’s go back to our shorebirds example
• How can we measure body size on shorebirds, to use it as some correction
factor to convert raw beak length into a relative measure ?
• 5 body size variables are highly correlated

Weight

Toe
length

61
62
Principal Component Analysis (PCA):
a method to summarize a set of correlated variables
• Aim: reducing the dimensionality of a multivariate data set while accounting
for as much of the original variation as possible present in the data set
• Method: transforming to a new set of variables (the principal components),
that are linear combinations of the original variables, which are uncorrelated
and are ordered so that the first few of them account for most of the
variation in all the original variables
• In other words, PCA combines variables into several independent principal
components (PCs) that capture the highest variation existing in the data
• It is an explanatory technique that provides sample PCs (not population PCs)
that are not tested for any hypothesis
63
Simplest example: 2 variables
• 2 variables (Body size & wing length)

• First step: standardize the variables so that units are comparable (unit = 1 SD)
[Note: this is done by default when PCA is performed on correlation matrix]
64
Simplest example: 2 variables
• 2 variables (Body size & wing length)
Original variables Principal components
With 2 variables, PCA only
rotates and rescales the data.

Euclidian distances among


individuals are preserved.

Let’s check it
Barycentre
= (0,0)

• PC1 accounts for the most variation that one axis can represent (i.e. longest dimension)
• PC2 accounts for the most remaining variation with the condition to be independent from
PC1 (i.e. perpendicular or orthogonal) 65
Simplest example: 2 variables
Principal components

Invert Rotate
PC2
Original variables

The sign of
PCs is Rescale
indeed PC1
arbitrary
Rescale
PC2

66
Simplest example: 2 variables
• 2 variables (Body size & wing length)
Original variables Principal components Scree plot

• PC1 accounts for the most variation that one axis can represent (i.e. longest dimension)
• PC2 accounts for the most remaining variation with the condition to be independent from
PC1 (i.e. perpendicular or orthogonal) 67
Principal Component Analysis (PCA):
the general working principle
Whatever the number of variables, the PCA always proceeds the same way:
1. Compute the pairwise correlation matrix of the p variables
2. Finds PC1 as the linear combination of the p variables
(i.e. an axis in a p-dimension space) that accounts for most of the variation
that one axis can represent
3. Finds PC2 as the axis that accounts for most of the remaining variation
and independent of PC1
4. Finds PC3 as the axis that accounts for most of the remaining variation
and independent of PC1 and PC2
5. Continue up to PCp
68
A more complex example with a clear interpretation:
visualization of the data
• The 5 body size variables measured on 26 shorebirds species

69
70
A more complex example with a clear interpretation:
checking PCA requirements
• Visualization of the data confirmed that:
– There are no missing data
– There are no major outliers
– The variables are linearly correlated
– The variables are not on the same scale
• We can perform a PCA on the
correlation matrix

71
72
A more complex example with a clear interpretation:
how much variation is explained by the PCs?
• First, we look at the scree plot to see By default, there are as many
how much of the variation existing PCs as original variables,
so PCA explains 100%
in this dataset PCs can explain of the original variation
• PC1 explains a huge 88.5%
• PC1+PC2 explain 95.6%
• So, PCA is very efficient in summarizing
the variation in this data
• How many PCs to keep ?
There is no firm rule (we’ll come back to that later)
• Here, PC1 could be enough.
But let’s keep PC1+PC2 to look at the meaning of PC2 73
A more complex example with a clear interpretation:
how are variables and PCs linked?
• Second, we interpret the PCs by looking at
the loadings matrix and at the correlation circle
= correlations
between PCs
and variables

Because only 2 dimensions


are considered
• The angle between 2 arrows approximates the degree of correlation of the
2 variables (𝑅 = cos 𝜃 ): + − 0
• Interpreting the proximity between the apices as correlations is incorrect 74
A more complex example with a clear interpretation:
what is the biological meaning of the PCs?
Weight

• PC1 is a body size index: strongly & positively correlated with all 5 variables
• PC2 is an
?? axis contrasting “big-heavy-stocky” species
vs “smaller-lighter-slender” species 75
A more complex example with a clear interpretation:
how are species positioned in the PC1 × PC2 space?
• Third, we look at the score plot, which ordinates the species
= value of an
individual on a PC

• There is no clearly separated groups of individuals 76


A more complex example with a clear interpretation:
let’s go back to our original aim: correct for body size
• Using the body size index (PC1), we can compute a beak length that is
relative to body size and use it to correct beak length

77
78
A more complex example with a clear interpretation:
let’s go back to our original aim: correct for body size
• The approach is to compute first the beak
length that is expected for a species given
its body size using a linear regression
• Then some deviation from this expected
value is computed:

𝑏𝑒𝑎𝑘 𝑙𝑒𝑛𝑔𝑡ℎ
𝑅𝑇𝐸 𝑏𝑒𝑎𝑘 𝑙𝑒𝑛𝑔𝑡ℎ = ln
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑏𝑒𝑎𝑘 𝑙𝑒𝑛𝑔𝑡ℎ

79
A more complex example with a clear interpretation:
let’s go back to our original aim: correct for body size
• Let’s test whether Charadri and Scolopaci differ in average RTE beak length

Scolopaci have a longer


beak than Charadri
for their body size

But not a longer beak


in absolute terms

80
Explore a bit deeper the PCA statistics:
splitting the correlation matrix into scale and direction
• PCA splits the correlation matrix of the variables into
1. a scale (length) part: eigenvalues
2. a direction part: eigenvectors

81
Explore a bit deeper the PCA statistics:
how much variation is explained by the PCs?
• The total variation in the original data set is called the inertia
• It is in fact the total variance summed σ𝑛𝑖=1 σ𝑝𝑗=1(𝑍𝑖𝑗 )2
over the n observations of the p variables: 𝐼 = =𝑝
𝑛
• PCA splits the inertia into eigenvalues λ,
one for each PC in decreasing order: 𝐼 = 𝜆1 + 𝜆2 + ⋯ + 𝜆𝑝
• So λ1 quantifies how much variation PC1 explains, i.e. the variance of PC1
• It is easier to express this as a relative value,
𝜆𝛼
i.e. explained inertia/variance: 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = × 100%
𝑝
Equivalent of
R² in linear
regression
82
Explore a bit deeper the PCA statistics:
how many PCs to keep?
• There is no well-accepted objective way to decide how many PCs are enough
• Several possible approaches: 1 is the explained inertia of
– Keep the PCs with λ>1 one of the original variables
– Keep the PCs that explain more variance than expected
(broken stick model)
– Keep the PCs that accounts together for a certain fraction
of the total inertia that you fix: 70%, 80%, 90%,...
– Keep the PCs up to the first one after a large decrease in inertia (“down to the elbow”)
– Use a resampling procedure get a p-value on a λ for each PC [see next slide]
• This important choice must be made by considering the aim of the PCA
analysis and the understanding of the data and the scientific context
• Importantly, the first m PCs are the same whether we retain all possible p
PCs or just the first m. In other words, the choice can be made a posteriori 83
A short digression:
the use of permutation / randomization tests
• Permutation / randomization tests can be used to test the significance of
virtually any statistical parameter in particular when the test statistic does
not follow a know distribution
• Very easy to understand and relatively easy to perform with a computer:
1. Shuffle/randomize the dataset to suppress any property you want to test,
i.e. create a copy of your dataset under H0
2. Compute the statistic of which you want to test the significance
3. Repeat a large number of times (>1000)
and compute the p-val as the proportion
of the simulations for which
the simulated statistic > observed statistic

84
85
A short digression:
the use of randomization tests
• The “PCAtest” R packages performs such tests for PCA

• PC1 is highly significant, and all 5 variables are significantly correlated with it
• PC2 is still interesting to keep
86
Explore a bit deeper the PCA statistics:
how are variables combined into PCs?
• The PCs follow the eigenvectors (in a space with p dimensions)
of the correlation matrix: 𝑢1 , 𝑢2 ,… 𝑢𝑝
• Each eigenvector
– starts at the [0,0,…0] position (barycentre of the data)
𝑃𝐶1𝑖 𝑢1,1
– ends at the [𝑢𝑗=1,𝛼 , 𝑢𝑗=2,𝛼 ,… 𝑢𝑗=𝑝,𝛼 ] position = 0.46 × 𝑊𝑖𝑛𝑔 𝑙𝑒𝑛𝑔𝑡ℎ + 0.44 × 𝑇𝑜𝑒 𝑙𝑒𝑛𝑔𝑡ℎ𝑖
𝑖
– has a length of 1 (i.e. 𝑢1 = 1) + 0.41 × 𝑇𝑎𝑟𝑠𝑒 𝑙𝑒𝑛𝑔𝑡ℎ𝑖 + 0.45
× 𝐶𝑢𝑏 𝑤𝑒𝑖𝑔ℎ𝑡𝑖 + 0.47 × 𝐵𝑜𝑑𝑦 𝑠𝑖𝑧𝑒𝑖

• All eigenvectors are orthogonal/independent to each other: 𝑢𝛼 ⊥ 𝑢𝛽


• The PCs or factorial axes are the lines matching the eigenvectors

87
Explore a bit deeper the PCA statistics:
how much is each variable represented by the PCs?
• The correlation circle illustrates how much (=arrow length)
the variables are represented in the PC1*PC2 space
• The loading is the coordinate of a variable on a PC
(i.e. cos 𝜃 ) and equals the variable-PC correlation
• Loadings combine scale and direction: 𝜆𝛼 × 𝑢1,𝛼

88
Explore a bit deeper the PCA statistics:
how much is each variable represented by the PCs?
• The loading² (= cos²) quantifies the quality of
representation of a variable on a PC,
i.e. the proportion of its variance explained by the PC:
2
𝜆𝛼 × 𝑢𝑗,𝛼 = 𝜆𝛼 × 𝑢²𝑗,𝛼

• Loadings² can be summed over several PCs to see


how much of the original variation in a variable
is kept in this simplified space (also called communality)
89
90
Explore a bit deeper the PCA statistics:
how much does each variable contribute to each PCs?
• The contribution of a variable to a PC is its
cos² scaled to 100% per PC
𝜽

100% 100%

• This is a place where confusion often emerges in R


due to the existence of synonym outputs:
– coord AND cor: give the loadings/coordinates/correlations
– cos2: gives loadings² = coordinates² = cos², i.e. quality of representation
𝑐𝑜𝑜𝑟𝑑 = cos 𝜃

– contrib: gives variable contributions, i.e. cos² / 𝜆 91


Explore a bit deeper the PCA statistics:
how much is each individual represented by the PCs?
• The coordinates of each individual in the PC1*PC2 space (i.e. its scores) are
computed using the eigenvectors applied to the original variables
𝑃𝐶1𝑖 = 0.46 × 𝑊𝑖𝑛𝑔 𝑙𝑒𝑛𝑔𝑡ℎ𝑖 + 0.44 × 𝑇𝑜𝑒 𝑙𝑒𝑛𝑔𝑡ℎ𝑖 + 0.41 × 𝑇𝑎𝑟𝑠𝑒 𝑙𝑒𝑛𝑔𝑡ℎ𝑖 + 0.45 × 𝐶𝑢𝑏 𝑤𝑒𝑖𝑔ℎ𝑡𝑖 + 0.47 × 𝐵𝑜𝑑𝑦 𝑠𝑖𝑧𝑒𝑖
𝑃𝐶2𝑖 = −0.25 × 𝑊𝑖𝑛𝑔 𝑙𝑒𝑛𝑔𝑡ℎ𝑖 + 0.23 × 𝑇𝑜𝑒 𝑙𝑒𝑛𝑔𝑡ℎ𝑖 + 0.77 × 𝑇𝑎𝑟𝑠𝑒 𝑙𝑒𝑛𝑔𝑡ℎ𝑖 − 0.52 × 𝐶𝑢𝑏 𝑤𝑒𝑖𝑔ℎ𝑡𝑖 − 0.16 × 𝐵𝑜𝑑𝑦 𝑠𝑖𝑧𝑒𝑖

• Like variables, cos² quantifies


how much an individual is
represented on a PC
(or a group of PCs)

92
Performing a PCA on non-standardized variables
• Very often, PCA is performed using the correlation matrix of the Xi variables
• If the Xi variables are not standardized into Zi variable, the PCA is performed
on the variance-covariance matrix of Xi instead of their correlation matrix

• The result will depend on the choice you make!

93
An important reminder on statistical tools:
comparing R and SAS for the shorebird PCA

94
An important reminder on statistical tools:
comparing R and SAS for the shorebird PCA

95
An important reminder on statistical tools:
comparing R and SAS for the shorebird PCA

96
An important reminder on statistical tools:
choose yourself but master it!
• All the results are identical, fortunately!!
• R is a great tool, powerful and flexible
• But it is less straightforward than other software in producing the complete
set of results for a statistical analysis
• Often, there are many ways/libraries to produce the same results
[you will experiment that during the practical sessions]
• With R, you need to ask specifically for every info
and you need to understand and check your analysis
• As an example: the R code to produce
examples in these slides is >500 lines
97
An example that is less straightforward to interpret:
trait association in T. thermophila strains
• Tetrahymena thermophila is a ciliated protist,
much used in microcosm ecology
• We characterized 44 strains (i.e. clonal genotypes)
for several life history traits:
– Growth rate and peak density (i.e. reproduction)

Abundance
Peak density
– Morphology (size and shape)
Growth rate
– Dispersal rate and swimming speed
– Survival and morphology change when fasting
Time

98
An example that is less straightforward to interpret:
trait association in T. thermophila strains
• Few variables are correlated
• PCA might not be able to
efficiently summarize the variation
in these variables into a smaller
number of PCs

99
An example that is less straightforward to interpret:
trait association in T. thermophila strains
• This is confirmed by the scree plot and the randomization tests

100
An example that is less straightforward to interpret
trait association in T. thermophila strains
• What is the meaning of PC1 and PC2?

101
An example that is less straightforward to interpret:
trait association in T. thermophila strains
• The score plot does not show clear groups of strains

• So, T. thermophila strains exhibit continuous variation


more than different types
102
PCA, a conceptual summary
• PCA is an ordination method that ordinates a set of individuals along
continuous gradients (PCs) by summarizing the redundancy in quantitative
variables (with no distinction between response and explanatory variables)
• PCs are linear combinations of the original variables, ordered to capture the
highest variation existing in the data and uncorrelated to each other; PCs are
useful as explanatory variables in an inferential analysis (e.g. a linear model)
• PCA is not an inferential method, it is a descriptive / exploratory method, i.e.
it describes/summarizes the variation in the data without assuming / testing
any mechanisms or causal links; it is said to be unconstrained
• The biological interpretation of the PCs is performed a posteriori by the
experimenter, as is the potential inferential part
103
Chapter 4 − Ordination of a contingency table:
Correspondence Analysis (CA)

104
Correspondence Analysis (CA):
the PCA equivalent for contingency table data
• CA (also known as RA, Reciprocal averaging) is the PCA equivalent for when
the dataset is a contingency table with two categorical variables
Contingency table & X² test
Presence/absence of two bacterial species: Bacteria A Total Prop
Bacteria B Many Some Few None
Are they distributed independently?
Many 30 20 15 20 85 0.09
H0: pij = pi· x p·j
𝑐 𝑟
Some 25 80 35 40 180 0.19
2
2
𝑂𝑖𝑗 − 𝐸𝑖𝑗 Few 10 70 100 30 210 0.22
𝜒𝑜𝑏𝑠 = ෍෍ ∼ 𝜒2 𝑐 − 1 × 𝑟 − 1
𝐸𝑖𝑗 None 60 80 50 300 490 0.51
𝑖=1 𝑗=1
Total 125 250 200 390 965
𝐸𝑖𝑗 = 𝑁 × 𝑝𝑖𝑗 = 𝑁 × 𝑝𝑖. × 𝑝.𝑗
Proportion 0.13 0.26 0.21 0.40

𝜒92 = 298.54 𝑝 < 0.0001


105
Determining how the two variables are related ?
• When the two variables are not independent,
CA aims to determine how their modalities are associated
• We can look how the individual frequencies are distributed (profiles)
among rows and among columns

106
107
Determining how the two variables are related ?
• If the variables are independent,
the cell proportions should respect
H0: pij = pi· x p·j

• CA computes how each cell proportion


deviates from this expectation:
𝑝𝑖𝑗 − 𝑝𝑖. × 𝑝.𝑗
𝑞𝑖𝑗 =
𝑝𝑖. × 𝑝.𝑗
2
• 𝑛 × 𝑞𝑖𝑗 is the contribution of cell i,j
to the X² statistic
108
Joined ordination of rows and columns
• Then CA ordinates the row- and column- profiles together in a way that is
similar to PCA but adapted to consider the contingency nature of the data,
i.e. that rows and columns have a symmetrical role
• The maximum number of axes that can be computed is min[ 𝑐 − 1 , 𝑟 − 1 ]
• A scree plot is useful to determine
how many axes to keep
based on the inertia

109
Joined ordination of rows and columns
• Both row and column profiles contribute to the creation of the axes
• It is then possible to explore in detail for each row and each column: its
inertia, its quality of representation on each axis, its contribution to each axis
• Several score plots (called biplots) can be In this symmetric biplot,
both row and columns
produced because it is not possible modalities are represented.
Only the distances between
to place the row and columns modalities either row modalities or
column modalities can be
in the same geometrical space interpreted.

• The two bacteria species seems


positively associated, with a close
association of their abundance class
110
Chapter 5 − Expanding to other ordination
techniques: Multidimensional scaling

111
112
Enlarging the search for relations
• In multivariate data analysis, three types of relations / associations are
considered:
– R-mode analysis : relations among variables (or descriptors)
– Q-mode analysis : relations among individuals (or objects, observations)
– R- and Q-mode analyses combined: relations among variables, among
individuals and relations between variables and individuals
• Examples in ecology often imply that species are recorded in sites,
so many ecological sources present multivariate analyses in this restricted
context; e.g. Legendre & Legendre (2012), vegan R package
• But ordination techniques are not restricted to species x sites !
113
The matrix of association / (dis)similarity
• Most multivariate analysis methods are based on comparison coefficients
computed for all pairs of variables or individuals
• These are assembled in an association /
Var 1 … Var p
(dis)similarity square matrix: Var 1 Measure of
… dependence:
Variable 1 … Variable p correlation, covariance
Var p
Individual 1
Ind 1 Ind 2 Ind n
Individual 2
Ind 1

Ind 2
Individual n Measure of
… association / (dis)similarity:
Ind n many, many measures

114
115
PCA, a (mostly) R-mode analysis
• PCA is primarily interested in the
relations among variables
• It is a R-mode analysis based on the p x p
correlation matrix of the p variables

• The relative position of individuals is


represented in the score plot,
but their similarity is a deduced
result, often not the main focus

116
Principal Coordinate Analysis (PCoA)
= Classical / Metric MultiDimensional Scaling (MDS)
• PCoA is mainly interested in looking for association / (dis)similarity among
elements and produces a representation / ordination of these elements in a
lower-dimensional (often 2D or 3D) Euclidean space that preserves the
distance relationships computed using any (dis)similarity coefficient
• Elements can be :
– most frequently the individuals:
Q-mode analysis based on a n x n matrix of (dis)similarity
– rarely the variables:
R-mode analysis based on a p x p matrix of dependence

117
Principal Coordinate Analysis (PCoA)
= Classical / Metric MultiDimensional Scaling (MDS)
• The analysis is entirely based on this association / (dis)similarity matrix
• PCoA can then be applied to any type of variables
provided a comparison coefficient appropriate to the data
is used to compute the (dis)similarity matrix
• PCoA offers great flexibility in the choice of association measures

118
It is key to choose the appropriate measure of
association / (dis)similarity
• There are many possible ways to measures the association
No
/ (dis)similarity among elements Association association
• In R-mode: a measure of dependence among variables 1 (or -1) 0

• In Q-mode: a measure of similarity (S) or its contrary 1 0

(dissimilarity or distance (D)) among individuals 0 ∞

• They differ in particular on:


– The meaning of measure = 0
– Their (as)symmetrical nature and the treatment of “0,0” (double 0)
– The way the distance among elements is computed
• It is out of scope of this course to discuss all the potential measures
[see Legendre & Legendre (2012) Numerical ecology (chapter 7)] 119
Dependence coefficients for R-mode analysis

Legendre & Legendre (2012) Numerical ecology


120
Similarity / distance coefficients for Q-mode analysis

Legendre & Legendre (2012) Numerical ecology


121
Similarity / distance coefficients for Q-mode analysis
• Similarity can be transformed into distance and vice-versa
• Unfortunately, there are several formulas to do that:
𝐷 = 1 − 𝑆 OR 𝐷 = 1 − 𝑆 OR 𝐷 = 1 − 𝑆 2

• In particular in R, all similarity measures are converted to dissimilarities


to compute a square matrix of class "dist" in which the diagonal (distance
between each object and itself) is 0 and can be ignored.
The conversion formula varies with the package used,
and this is not without consequences!

122
Similarity / distance coefficients for Q-mode analysis
• Presence / absence data can be
summarised as a 2x2 contingency table:
• Many coefficients exist to summarize
such a table:

Legendre & Legendre (2012) Numerical ecology 123


124
Measure of association / (dis)similarity :
The double zero problem
• Multidimensional scaling is often used in community ecology to analysis
community composition, i.e. presence / abundance of species in sites
• Such data typically contain many 0
• With such data, there are two key points to consider about absence data:
– The meaning of a 0: is an absence real or a pseudoabsence (i.e. the
species was not recorded but maybe because of no/too low sampling)?
– The meaning of 0,0: should the analysis consider the fact that a species
has not been recorded in two different sites as indication of similarity?

125
Measure of association / (dis)similarity :
The double zero problem
• It is often preferable to draw no ecological conclusion from the
simultaneous absence of a species at two sites
?
+ =
• This means 0,0 are skipped when computing similarity or distance
coefficients using species presence-absence or abundance data
• This is done using asymmetrical similarity or distance coefficients

Because the treat differently double


absences (0,0) than double presences

126
Habitat quality
in the bog fritillary butterfly
• 106 50x50cm vegetation plots sampled in the
butterfly habitat for 3 years (2000, 2001, 2004)
• Recorded variables:
year plot patch habitat_quality bist_cover bist_leafquality bist_bud bist_flower bist_withered grass_cover nb_tussocks filipendula thistle galium
2000 2000_1 1 bad 1 1 0 0 0 4 0 0 0 1
2000 2000_2 1 bad 1 1 0 0 0 3 1 0 0 0
2000 2000_3 1 bad 3 1 0 0 0 1 0 0 0 0
2000 2000_4 1 bad 1 0 0 0 0 2 0 1 0 1
2000 2000_5 1 bad 3 1 0 0 0 0 0 0 0 2
2000 2000_6 1 bad 0 NA NA NA NA 4 0 1 0 0
2000 2000_7 1 bad 1 0 0 0 1 1 0 3 0 0

• Properties of these variables:


– Quantitative but not continuous, always positive
– Not on the same measurement scale
– 0 is meaningful (an absence is certain) and 0,0 indicates similarity
– Missing values for bistort leaf and flowers when there is no bistort 127
Habitat quality in the bog fritillary butterfly:
Choosing a comparison coefficients for PCoA
• Choices made according to these properties of the variables:
– Quantitative but not continuous, always positive
→ PCoA, not PCA
– Not on the same measurement scale
→ standardize by dividing each variable by its maximum observed value
– 0 is meaningful (an absence is certain) and 0,0 indicates similarity
→ no need for an asymmetrical association measure
– Missing values for bistort leaf and flowers when there is no bistort
→ meaningful here to replace NA by 0 Value of variable j for individual 1

σ 𝑤12𝑗 ×𝑠12𝑗 𝑦1𝑗 −𝑦2𝑗


• Gower coefficient is chosen: 𝑆15 = σ 𝑤12𝑗
with 𝑠12𝑗 = 1 −
𝑅𝑗
Range of variable j aver all individuals 128
Habitat quality in the bog fritillary butterfly:
Result of PCoA with Gower similarity coefficient
• The first two axes explain 39%
of the variation in the data (inertia).
This means that PCoA is not so efficient in
summarizing the existing variation in the original
variables

• The score plot shows no year effect


– The Euclidean distance between points in the
score plot approximates the original dissimilarity
between the corresponding individuals
– Groups or clusters of points in the score plot can
suggest the presence of distinct subgroups
129
Habitat quality in the bog fritillary butterfly:
Result of PCoA with Gower similarity coefficient
• The score plot shows there is a large
overlap in vegetation descriptors
between bad and good quality
habitat patches
• But patches with a lower habitat
quality have a larger range
(mainly smaller values in axis 2)

130
Habitat quality in the bog fritillary butterfly:
Interpretation of PCoA axes
• PCoA is computed from a distance matrix among elements (here Gower, S15)
• The original variables do not intervene in PCoA,
so PCoA does not produce any loadings
• To interpret the axes in terms of
the original variables, we must
compute the correlation between
the axes and the original variables

131
Habitat quality in the bog fritillary butterfly:
Interpretation of PCoA axes

Dominated by

officinalis
Bistorta
Dominated by
Filipendula
ulmaria

Pictures: Victor Brans 132


Habitat quality in the bog fritillary butterfly:
Impact of the choice of the association measure
• Using different coefficients to compute the (dis)similarity matrix
leads to different results
Gower Kulczynski

• The general conclusion is here quite similar: the measured vegetation


features is not so different but more restricted in good quality patches 133
Nonmetric MultiDimensional Scaling (nMDS)
• PCoA produces a representation / ordination of the elements in a lower-
dimensional Euclidean space that preserves the distance relationships
computed using any (dis)similarity coefficient
• nMDS does the same but by preserving the rank order of the distance
relationships: it uses only the rank info in dissimilarity matrix to replace the
assumption of linearity by a less problematic assumption of monotony
• It can use the same variety of association measure as PCoA
• It is not a method with an analytical solution, but involves optimization
algorithms

Help of vegan R package 134


Habitat quality in the bog fritillary butterfly:
Result of nMDS with Gower similarity coefficient
• First thing to do: check the algorithm likely
found the global optimum solution

• The fit of the nMDS is expressed via a


stress plot, which shows the correlation between
observed dissimilarity between individuals h and i
(𝑑ℎ𝑖 ) and ordination distance (𝑑መ ℎ𝑖 )

• Stress is a summary measure over all pairs


of individuals of the difference
between 𝑑ℎ𝑖 and 𝑑መ ℎ𝑖
135
Habitat quality in the bog fritillary butterfly:
Result of nMDS with Gower similarity coefficient
• Axes of the nMDS score plot are arbitrary
• Results are very similar to those for PCoA using the same association
measure (Gower similarity)
Gower nMDS Gower PCoA

136
Ordination, a summary

Maximises the linear


correlation between
individuals

Maximises the rank-order Iterative fitting


correlation between method
individuals

Legendre & Legendre (2012) Numerical ecology


137
Chapter 6 − Grouping objects:
Clustering

138
The principle of clustering elements:
partitioning the elements into groups
• To cluster is to recognize that:
1. some elements are sufficiently similar to be put in the same group
2. and some are sufficiently distinct to identify distinctions or separations
between groups of elements
• The aim of clustering is therefore to divide / partition the elements into
a certain number of groups, creating a typology (i.e. a system of types)
• Groups are assumed to exist, but clustering is heuristic (i.e. groups are
created without any knowledge of processes really occurring)

139
The principle of clustering elements:
partitioning the elements into groups
• Usually, individuals / objects are clustered (Q-mode analysis),
but clustering can also be used to identify groups of collinear variables
or groups of species (R-mode analysis)
• There are many clustering approaches, and the choice of method is critical:
– Agglomerative vs divisive
– Sequential vs simultaneous
– Single vs hierarchical [see slide below]
– Fuzzy vs hard (the groups are mutually exclusive,
i.e. an object cannot belong to more than 1 group)
– Unconstrained vs constrained by external information
140
The principle of clustering elements:
partitioning the elements into groups
• Many methods are based on a (dis)similarity matrix,
also a critical choice (as in MDS)
• The final number of groups must be determined by the user;
the choice of how to do it (>30 rules) is also critical

141
The principle of clustering elements:
Legendre & Legendre (2012) Numerical ecology
partitioning the elements into groups

142
Single vs hierarchical clustering
• Groups can be the result of a single partition or of several partition levels
embedded into each other (hierarchical partitioning)

A
dendrogram

Second partition

143
An example with K-means clustering:
grouping shorebirds according to body size
• Aim: given n individuals in a p–dimensional space,
determine a partition of the individuals into K groups (clusters)
such that the individuals within each cluster are more similar to one another
than to individuals in the other clusters
• (Dis)similarity criterion:
– On the raw data: Euclidian distance
– On a (dis)similarity matrix: any
• Approach: optimization algorithm that group individuals in the way that
minimizes the sum of squares of errors (i.e. distance between each
individuals and the centroid of its group, called SSE or E² or TESS)
• K: the number of clusters is determined by the user (a priori or a posteriori)144
An example with K-means clustering:
grouping shorebirds according to body size
• Criterion used here: Euclidian distance
based on the 5 standardized body size variables
• Optimization criterion: Simple Structure Index (in vegan R package)

• K = 2 to 4 clusters

145
An example with K-means clustering:
grouping shorebirds according to body size
Groups are
4 groups structured along
Dim1, confirming
the PCA analysis
that 88.5% of the
variation in the
five variables can
be summarized
by this body size
2 groups index.

146
Chapter 7 − Assigning objects to groups:
Discriminant analysis

147
The principle of discriminating elements:
assigning the elements to groups
• Clustering analysis aims at grouping elements according to their
characteristics (i.e. values of variables)
• Discrimination / classification analysis makes the reverse job:
identifying the characteristics that discriminate the groups of elements,
when group membership is known
• Linear discriminant analysis (LDA) aims at finding linear combinations
of the variables (as in PCA) that
– maximize the separability / differences among known groups of elements
– minimize the variation among elements belonging to the same group
• As in PCA, variables must be quantitative or binary
148
An example with simple linear discriminant analysis:
discriminating Tetrahymena cells from artefacts on photos
• Tetrahymena thermophila
is a ciliated protist (~40µm)

• It has become an established model system for


ecology and evolution in the last 20 years

• Especially for the study of dispersal as it is


actively swimming

149
An example with simple linear discriminant analysis:
discriminating Tetrahymena cells from artefacts on photos
• Cells are identified on digital pictures shot using a dark field microscope
• Pictures also contain artefacts, and these must be discarded from the data
before further analyses
• A series of characteristics are measured for each identified particle,
related to:
– Size
– Shape
– Brightness

150
An example with simple linear discriminant analysis:
discriminating Tetrahymena cells from artefacts on photos
• Aim: create a function (equation) to discriminate
cells and artefacts on any picture
shot under the same conditions
• Data: >40000 particles have been manually classified:
~30000 cells and ~10000 artefacts
• Approach: simple LDA, i.e. LDA discriminating objects from 2 groups

151
An example with simple linear discriminant analysis:
discriminating Tetrahymena cells from artefacts on photos
• There are differences
in cells vs artefacts
distributions but also
a large overlap
• Use LDA to search for
the best linear
combination of these
variables to
discriminate cells and
artefacts

152
An example with simple linear discriminant analysis:
discriminating Tetrahymena cells from artefacts on photos
• The discriminant function is:

• It allows to compute the probability that


each particle is an artefact or a cell:

153
An example with simple linear discriminant analysis:
discriminating Tetrahymena cells from artefacts on photos
• The quality of the discrimination can be
checked by looking at the classification matrix:
75% of artefacts and 96% of cells are correctly classified:

• The score plot shows that the LD1 axis


allows a good discrimination between
cells and artefacts, much better than
any of the original variables

154
PCA vs LDA
• Both PCA and LDA create a reduced-dimension space to ordinate the
individuals by creating a set of independent axes, ranked in order of
decreasing importance
• But they differ in how they quantify this importance:
– PCA: the amount of variation in the variables
– LDA: the ability to discriminate the groups

155

You might also like