Partial Least Square
Partial Least Square
Among all the software packages available for discriminant analyses based on projection to latent
structures (PLS-DA) or orthogonal projection to latent structures (OPLS-DA), SIMCA (Umetrics, Umeå
Sweden) is themore widely used in the metabolomics field. SIMCA proposes many parameters or tests to
assess the quality of the computed model :
the number of significant components,
R2, (variation)
Q2, (prediction)
pCV-ANOVA, (CV-cross validation)
and the permutation test).
Significance thresholds for these parameters are strongly application-dependent. Concerning the Q2
parameter, a significance threshold of 0.5 is generally admitted. However, during the last few years, many
PLS-DA/OPLS-DA models built using SIMCA have been published with Q2 values lower than 0.5. The
purpose of this opinion note is to point out that, in some circumstances frequently encountered in
metabolomics, the values of these parameters strongly depend on the individuals that constitute the
validation subsets. As a result of the way in which the software selects members of the calibration and
validation subsets, a simple permutation of dataset rows can, in several cases, lead to contradictory
conclusions about the significance of the models when a K-fold cross-validation is used. We believe that,
when Q2 values lower than 0.5 are obtained, SIMCA users should at least verify that the quality
parameters are stable towards permutation of the rows in their dataset.
PLS/OPLS models try to find a linear relationship between a X predictor matrix and a Y response matrix
The only way to reliably estimate the ability of the model to predict Y values of new individuals is to predict
individuals from an independent dataset (i.e. that were not used to build this model). This can be achieved
by splitting the dataset into a training set and a test set. The training set is used to build the model and
the test set is used to estimate the predictability.
The default SIMCA cross-validation is the so-called K-fold cross-validation. Results of the cross-validation
procedure are summarized by the value of different quality parameters. The most frequently mentioned
in the metabolomics literature are R2 and Q2 parameters (also called cross-validated R2). R2 measures
the goodness of fit while Q2 measures the predictive ability of the model. R2 = 1 indicates perfect
description of the data by the model, whereas Q2 = 1 indicates perfect predictability. R2 increases
monotonically with the number of components (NC) and will automatically approach 1 if NC approaches
the rank of the X matrix. Q2 will not necessarily approach 1.
A large discrepancy between R2 and Q2 indicates an overfitting of the model through the use of too many
components. According to the SIMCA users’ guide, Q2 > 0.5 is admitted for good predictability (SIMCA
P12 users’ guide, p. 514).
It has been shown that in practice it is difficult to give a general limit that corresponds to a good
predictability since this strongly depends on the properties of the dataset. For example, an acceptable Q2
threshold will strongly depend on the number of observations included.
During the last few years, a large number of SIMCA PLSDA/ OPLS-DA models have been published with Q2
below 0.4 or even below 0.3 (for example, see ref. 11 and 12). These models with poor predictability are
frequently validated by a permutation test that consists in comparing the Q2 obtained for the original
dataset with the distribution of Q2 values calculated when original Y values are randomly assigned to the
individuals.
The cross-validation procedure also provides the possibility to calculate a p-value to estimate the
significance of PLS/OPLS models (pCV-ANOVA).
The number of components used in the final model, Q2 and pCV-ANOVA values should be presented to
allow the reader to assess the quality of the model calculated by SIMCA.
However, in this Opinion piece, we want to point out that in some cases, because of the way in which the
default SIMCA cross-validation procedure selects members of the calibration and validation subsets,
permutation of the rows of a dataset can result in variations in the values of the quality parameters.
As a consequence, in these circumstances, different conclusions on the quality of the PLS/OPLS models
may be drawn from the same dataset.
In a first part, we will show that under some conditions a random permutation of rows in the
dataset strongly affects the quality parameter values obtained when default SIMCA cross-
validation settings are used.
In a second part, we will discuss three different types of situations frequently encountered in
metabolomics studies where the K-fold cross-validation procedure fails to calculate a Q2 that is
not strongly dependent on the arbitrary order of the rows in a dataset.
Default SIMCA cross-validation procedure
We give here a very basic description of the default SIMCA cross-validation procedure. Only the way the
validation sets are built will be discussed in detail. For an exhaustive description of the procedure, the
reader should refer to the Umetrics documentation. Cross-validation allows to estimate the ability of a
model to correctly predict the Y response matrix of new individuals.
In the SIMCA software, cross-validation is also used to avoid overfitting by estimating the number of
significant components (NSCs) to use in the model. Many cross-validation procedures are used in the
metabolomic community (K-fold, Leave One Out, Monte-Carlo, 2CV, etc.).
The default SIMCA cross-validation procedure is a 7-fold cross-validation8 where the dataset is split into
7 different subsets. For a fixed number of components (NC), the Y values of all individuals of each subset
are predicted using a submodel built with the 6 other subsets (calibration subset).
The differences between the predicted Y values and the observed Y values are used to calculate the QNC
2 parameter for this number of components. The procedure starts at NC = 1 and is repeated by
incrementing NC as long as the increase of QNC2 is larger than a limit value fixed by various rules.
Each subset is constituted by selecting one row every seven rows in the dataset. The first subset is built
with the individuals corresponding to rows 7, 14, 21 and so on.
The second subset is constituted with the individuals corresponding to rows 1, 8, 15,. . .. The other subsets
are built in the same way (Scheme 1a).
Considering the way the subsets are built, it is clear that a permutation in row order of the X and Y dataset
changes the individual positions and modifies the composition of these subsets (Scheme 1b).
Thus, submodels and predicted Y values calculated during the cross-validation procedure are also affected
by a permutation of rows.
The range of R2 is in between 0 and 1, the higher level, the higher predictive accuracy. According
to Chin (1998) and Henseler et al. (2009), R2 value greater than 0.67 indicate a high predictive accuracy,
a range of 0.33 - 0.67 indicated a moderated effect, R2 between 0.19 and 0.33 indicate low effect, while
the R2 value below 0.19 considered unacceptable (the exogenous variables unable to explain the
endogenous dependent variable). While Q2 value of greater than zero for a particular reflective
endogenous latent variable indicate the path model’s predictive relevance for a specific dependent
construct (Hair et al. 2016).
Well as far as a PLSR model is concerned you've to be careful while evaluating the performance your
model based just on R2...the values of
root mean square error of calibration (RMSEC) and
root mean square error of cross validation (RMSECV) and
root mean square error of prediction (RMSEP) are equally important.
So you've to select the R2 which has the lowest value of error as well. I agree with Ahmed Meri that
some research works have presented some thresholds for the model performance but mostly it depends
on your data and most importantly how you internally and externally validate your model.
And similarly while you eveluate the value of Q2 it should be chosen on the basis of root mean square
error of prediction (RMSEP)....as a raw example lets say, you can get a model with Q2=0.72 with a RMSEP
of lets say 9.567 and another model with Q2=0.70 but a RMSEP of 9.019 ...so in this case it is much better
to choose Q2=0.70 since the model in this case is more reliable.
Q2 is the R2 when the PLS built on a training set is applied to a test set. So a good value for Q2 is a
value that is close to the R2. That means that your PLS model works independently of the specific data
that was used to train the PLS model. Adding more variables always makes R2 go up, but might not make
Q2 go up.
Reference: https://www.youtube.com/watch?v=dWPzr9NjJxU
Q2 is similar to R-square
If the value of Q2 < 0
Your model is very poor
All independent variables cannot explain the dependent variable
No predictive relevance
The resulting Q2 values larger than 0 indicate that
The exogenous constructs have predictive relevance for endogenous construct under
consideration
Predictive relevance (q2)
0.02 – small predictive relevance
0.15 – medium predictive relevance
0.35 – larger predictive relevance