Chapter 9 Multivariate Regression Tree - Workshop 10 - Advanced Multivariate Analyses in R
Chapter 9 Multivariate Regression Tree - Workshop 10 - Advanced Multivariate Analyses in R
The MRT splits the data into clusters of samples similar in their species composition based on
environmental value thresholds. It involves two procedures running at the same time: 1) the
computation of the constrained partitioning of the data, and 2) the calculation of the relative
error of the successive partitioning levels by multiple cross-validations. This cross-validation is,
in essence, aiming to identify best predictive tree. The “best” tree varies depending on your
study goals. Usually you want a tree that is parsimonious, but still has an informative number
of groups. This is, of course, a subjective decision to make according to the question you are
trying to answer.
First, the method computes all possible partitions of the sites into two groups. For each
quantitative explanatory variable, the sites will be sorted in the ascending values of the
variables. For categorical variables, the sites will be aggregated by levels to test all
combinations of levels. The method will split the data after the first object, the second object
and so on, and compute the sum of within-group sum of squared distances to the group mean
(within-group SS) for the response data. The method will retain the partition into two groups
minimizing the within-group SS and the threshold value/level of the explanatory variable.
These steps will be repeated within the two subgroups formed previously, until all objects form
their own group. In other words, this process ends when each leaf of the tree contains one
object.
The next step is to perform a cross-validation and identify the best predictive tree. The cross-
validation procedure consists in using a subset of the objects to construct the tree, and to
allocate the remaining objects to the groups. In a good predictive tree, objects are assigned to
the appropriate groups. The cross-validated relative error (CVRE) is the measure of the
predictive error. Without cross-validation, one would retain the number of partitions minimizing
the variance not explained by the tree (i.e. the relative error: the sum of the within-group SS
over all leaves divided by the overall SS of the data). This is the solution which maximizes the
R
2
, so to speak.
9.2 MRT in R
The function mvpart() from the package mvpart computes both the partition and the cross-
validation steps required to build a multivariate regression tree.
We will demonstrate the process of building a multivariate regression tree on the Doubs River
data.
## X-Val rep : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## Minimum tree sizes
## tabmins
## 2 3 4 5 6 7 8 9 10
## 7 1 2 4 11 9 23 4 39
At this point, you will need to select the tree with an appropriate number of groups, depending
on the aim of your study. In other words, you must prune the tree by picking the best-fit tree. A
fully resolved tree is not the desirable outcome; instead, one is usually interested in a tree
including only informative partitions/groups. In such cases, it is possible to have an a priori
idea of the number of potential groups to be retained. You can make this choice interactively,
with the argument xv = "pick" .
The resulting figure shows the relative error RE (in green) and the cross-validated relative
error CVRE (in blue) of trees of increasing size. The red dot indicates the solution with the
smallest CVRE, and the orange dot shows the smallest tree within one standard error of
CVRE. It has been suggested that instead of choosing the solution minimizing CVRE, it would
be more parsimonious to opt for the smallest tree for which the CVRE is within one standard
error of the tree with the lowest CVRE Breiman et al. (1984). The green bars at the top
indicate the number of times each size was chosen during the cross-validation process. This
graph is interactive, which means you will have to click on the blue point corresponding your
choice of tree size. In summary:
We don’t have an a priori expectation about how to partition this data, so we’ll select the
smallest tree within 1 standard error of the overall best-fit tree (i.e. the orange dot). We can
select this tree using the xv = "1se" argument.
## X-Val rep : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## Minimum tree sizes
## tabmins
## 2 3 5 6 7 8 9 10
## 10 2 3 13 6 26 4 36
The statistics at the bottom of the figure are: the residual error, the cross-validated error, and
the standard error. This tree has only two leaves separated by one node. Each leaf is
characterized by a small barplot showing the abundances of the species included in the group,
the number of sites in the group, and the group’s relative error. From this figure, we can report
the following statistics: * The species matrix is partitioned according to an altitude threshold
(361.5 m) * Residual error = 0.563, which means the model’s R
2
is 43.7% (
1 − 0.563 = 0.437 )
We can also compare solutions, to help us chose the best tree. For example, let’s take a look
at a 10-group solution!
# Trying 10 groups
mvpart(as.matrix(spe.hel) ~ ., data = env,
xv = "none", # no cross-validation
size = 10, # set tree size
which = 4,
This tree is much harder to interpret, because there are so many groups! Although this version
of the tree offers higher explanatory power, its predictive power (CV Error = 0.671) is basically
the same as the previous two-group solution (CV Error = 0.673). This suggests that we may
want to try a tree with a few more groupings than the two-group solution, while staying lower
than 10 groups.
This tree is much easier to interpret! It also offers higher explanatory power (lower Error) than
our original solution, and higher predictive power than both previous solutions (CV Error). We
have a winner!
To find out how much variance is explained by each node in the tree, we need to look at the
complexity parameter (CP). The CP at nsplit = 0 is the R
2
of the entire tree.
# Checking the complexity parameter
doubs.mrt$cptable
The summary then outlines, for each node, the best threshold values to split the data.
##
## CP nsplit rel error xerror xstd
## 1 0.4369561 0 1.0000000 1.0758122 0.07493568
## 2 0.1044982 1 0.5630439 0.6755865 0.09492709
##
## Node number 1: 29 observations, complexity param=0.4369561
## Means=0.07299,0.2472,0.2581,0.2721,0.07133,0.06813,0.06897,0.07664,0.1488,0.2331,0.
## left son=2 (15 obs) right son=3 (14 obs)
## Primary splits:
## alt < 361.5 to the right, improve=0.4369561, (0 missing)
## deb < 23.65 to the left, improve=0.4369561, (0 missing)
## Means=0.1208,0.4463,0.4194,0.4035,0.1104,0.09023,0,0.02108,0.1256,0.2164,0.04392,0.
##
## Node number 3: 14 observations
## Means=0.02179,0.03391,0.08514,0.1313,0.02945,0.04444,0.1429,0.1362,0.1736,0.2509,0.
You might also be interested in finding out which species are significant indicator species for
each grouping of sites.
# Calculate indicator values (indval) for each species
doubs.mrt.indval <- indval(spe.hel, doubs.mrt$where)
## TRU VAI LOC HOT TOX BAR SPI GOU BRO PER BOU PSO ROT CAR BCO PCH GRE GAR BBO ABL
## 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## ANG
## 2
TRU has the highest indicator value (0.867) overall, and is an indicator species for the first (alt
>= 361.5) leaf of the tree.
9.3 Challenge 4
Create a multivariate regression tree for the mite data. * Select the smallest tree within 1 SE of
the CVRE. * What is the proportion of variance (R2) explained by this tree? * How many
leaves does it have? * What are the top 3 discriminant species?
## X-Val rep : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Which species are significant indicator species for each grouping of sites?
## 2 2 2 1 1 2 2 2
## Miniglmn LRUG Ceratoz3 Trimalc2
## 2 1 1 1
References
Breiman, Leo, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification
and Regression Trees. CRC press.
De’ath, Glenn. 2002. “Multivariate Regression Trees: A New Technique for Modeling Species–
Environment Relationships.” Ecology 83 (4): 1105–17.
All the content of the workshop series is under a Creative Commons Attribution-NonCommercial-
ShareAlike 4.0 International License.