0% found this document useful (0 votes)

92 views

Partial Least Square

The document discusses issues with using the default K-fold cross-validation in SIMCA software for metabolomics data. It notes that permuting the row order of datasets can significantly impact quality parameters like Q2 and R2. This is because SIMCA's cross-validation splits datasets into subsets based on row order. The document outlines three common situations in metabolomics where K-fold cross-validation may produce quality parameters too dependent on row order, and advocates verifying parameter stability under permutation.

Uploaded by

Ibnusina Bloodbrothers

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views

Partial Least Square

Uploaded by

Ibnusina Bloodbrothers

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Reference: PLS/OPLS models in metabolomics: the impact of permutation of dataset rows on the K-fold

cross-validation quality parameters

Among all the software packages available for discriminant analyses based on projection to latent
structures (PLS-DA) or orthogonal projection to latent structures (OPLS-DA), SIMCA (Umetrics, Umeå
Sweden) is themore widely used in the metabolomics field. SIMCA proposes many parameters or tests to
assess the quality of the computed model :
 the number of significant components,
 R2, (variation)
 Q2, (prediction)
 pCV-ANOVA, (CV-cross validation)
 and the permutation test).
Significance thresholds for these parameters are strongly application-dependent. Concerning the Q2
parameter, a significance threshold of 0.5 is generally admitted. However, during the last few years, many
PLS-DA/OPLS-DA models built using SIMCA have been published with Q2 values lower than 0.5. The
purpose of this opinion note is to point out that, in some circumstances frequently encountered in
metabolomics, the values of these parameters strongly depend on the individuals that constitute the
validation subsets. As a result of the way in which the software selects members of the calibration and
validation subsets, a simple permutation of dataset rows can, in several cases, lead to contradictory
conclusions about the significance of the models when a K-fold cross-validation is used. We believe that,
when Q2 values lower than 0.5 are obtained, SIMCA users should at least verify that the quality
parameters are stable towards permutation of the rows in their dataset.

PLS/OPLS models try to find a linear relationship between a X predictor matrix and a Y response matrix

The only way to reliably estimate the ability of the model to predict Y values of new individuals is to predict
individuals from an independent dataset (i.e. that were not used to build this model). This can be achieved
by splitting the dataset into a training set and a test set. The training set is used to build the model and
the test set is used to estimate the predictability.

The default SIMCA cross-validation is the so-called K-fold cross-validation. Results of the cross-validation
procedure are summarized by the value of different quality parameters. The most frequently mentioned
in the metabolomics literature are R2 and Q2 parameters (also called cross-validated R2). R2 measures
the goodness of fit while Q2 measures the predictive ability of the model. R2 = 1 indicates perfect
description of the data by the model, whereas Q2 = 1 indicates perfect predictability. R2 increases
monotonically with the number of components (NC) and will automatically approach 1 if NC approaches
the rank of the X matrix. Q2 will not necessarily approach 1.

A large discrepancy between R2 and Q2 indicates an overfitting of the model through the use of too many
components. According to the SIMCA users’ guide, Q2 > 0.5 is admitted for good predictability (SIMCA
P12 users’ guide, p. 514).

It has been shown that in practice it is difficult to give a general limit that corresponds to a good
predictability since this strongly depends on the properties of the dataset. For example, an acceptable Q2
threshold will strongly depend on the number of observations included.

During the last few years, a large number of SIMCA PLSDA/ OPLS-DA models have been published with Q2
below 0.4 or even below 0.3 (for example, see ref. 11 and 12). These models with poor predictability are
frequently validated by a permutation test that consists in comparing the Q2 obtained for the original
dataset with the distribution of Q2 values calculated when original Y values are randomly assigned to the
individuals.
The cross-validation procedure also provides the possibility to calculate a p-value to estimate the
significance of PLS/OPLS models (pCV-ANOVA).

The number of components used in the final model, Q2 and pCV-ANOVA values should be presented to
allow the reader to assess the quality of the model calculated by SIMCA.

However, in this Opinion piece, we want to point out that in some cases, because of the way in which the
default SIMCA cross-validation procedure selects members of the calibration and validation subsets,
permutation of the rows of a dataset can result in variations in the values of the quality parameters.

As a consequence, in these circumstances, different conclusions on the quality of the PLS/OPLS models
may be drawn from the same dataset.
 In a first part, we will show that under some conditions a random permutation of rows in the
dataset strongly affects the quality parameter values obtained when default SIMCA cross-
validation settings are used.
 In a second part, we will discuss three different types of situations frequently encountered in
metabolomics studies where the K-fold cross-validation procedure fails to calculate a Q2 that is
not strongly dependent on the arbitrary order of the rows in a dataset.
Default SIMCA cross-validation procedure

We give here a very basic description of the default SIMCA cross-validation procedure. Only the way the
validation sets are built will be discussed in detail. For an exhaustive description of the procedure, the
reader should refer to the Umetrics documentation. Cross-validation allows to estimate the ability of a
model to correctly predict the Y response matrix of new individuals.

In the SIMCA software, cross-validation is also used to avoid overfitting by estimating the number of
significant components (NSCs) to use in the model. Many cross-validation procedures are used in the
metabolomic community (K-fold, Leave One Out, Monte-Carlo, 2CV, etc.).

The default SIMCA cross-validation procedure is a 7-fold cross-validation8 where the dataset is split into
7 different subsets. For a fixed number of components (NC), the Y values of all individuals of each subset
are predicted using a submodel built with the 6 other subsets (calibration subset).

The differences between the predicted Y values and the observed Y values are used to calculate the QNC
2 parameter for this number of components. The procedure starts at NC = 1 and is repeated by
incrementing NC as long as the increase of QNC2 is larger than a limit value fixed by various rules.

Each subset is constituted by selecting one row every seven rows in the dataset. The first subset is built
with the individuals corresponding to rows 7, 14, 21 and so on.

The second subset is constituted with the individuals corresponding to rows 1, 8, 15,. . .. The other subsets
are built in the same way (Scheme 1a).

Considering the way the subsets are built, it is clear that a permutation in row order of the X and Y dataset
changes the individual positions and modifies the composition of these subsets (Scheme 1b).

Thus, submodels and predicted Y values calculated during the cross-validation procedure are also affected
by a permutation of rows.

The major consequences of this are:

 Row permutations can potentially change the number of components considered as
significant (NSC) by SIMCA.
 For the same number of significant components, row permutations will change the value of
the QNSC2 parameter.
 The CV-ANOVA p-value, which depends on the cross-validation procedures, is also affected
by row permutations in the dataset.
 The conclusion of the permutation test can be different when the order of rows is changed.

The range of R2 is in between 0 and 1, the higher level, the higher predictive accuracy. According
to Chin (1998) and Henseler et al. (2009), R2 value greater than 0.67 indicate a high predictive accuracy,
a range of 0.33 - 0.67 indicated a moderated effect, R2 between 0.19 and 0.33 indicate low effect, while
the R2 value below 0.19 considered unacceptable (the exogenous variables unable to explain the
endogenous dependent variable). While Q2 value of greater than zero for a particular reflective
endogenous latent variable indicate the path model’s predictive relevance for a specific dependent
construct (Hair et al. 2016).
Well as far as a PLSR model is concerned you've to be careful while evaluating the performance your
model based just on R2...the values of
 root mean square error of calibration (RMSEC) and
 root mean square error of cross validation (RMSECV) and
 root mean square error of prediction (RMSEP) are equally important.
So you've to select the R2 which has the lowest value of error as well. I agree with Ahmed Meri that
some research works have presented some thresholds for the model performance but mostly it depends
on your data and most importantly how you internally and externally validate your model.
And similarly while you eveluate the value of Q2 it should be chosen on the basis of root mean square
error of prediction (RMSEP)....as a raw example lets say, you can get a model with Q2=0.72 with a RMSEP
of lets say 9.567 and another model with Q2=0.70 but a RMSEP of 9.019 ...so in this case it is much better
to choose Q2=0.70 since the model in this case is more reliable.
Q2 is the R2 when the PLS built on a training set is applied to a test set. So a good value for Q2 is a
value that is close to the R2. That means that your PLS model works independently of the specific data
that was used to train the PLS model. Adding more variables always makes R2 go up, but might not make
Q2 go up.

Reference: https://www.youtube.com/watch?v=dWPzr9NjJxU

Q2 is similar to R-square
If the value of Q2 < 0
 Your model is very poor
 All independent variables cannot explain the dependent variable
 No predictive relevance
The resulting Q2 values larger than 0 indicate that
 The exogenous constructs have predictive relevance for endogenous construct under
consideration
Predictive relevance (q2)
 0.02 – small predictive relevance
 0.15 – medium predictive relevance
 0.35 – larger predictive relevance

 R2X (cum) - cumulative modelled variation in X matrix (fatty acids),

 R2Y (cum) - cumulative modelled variation in Y matrix (samples) or and
 Q2 (cum) - cross validated predictive ability or
Are usually used to evaluate the quality and reliability of OPLS-DA model.

R2 (variation) and Q2 (prediction)

Primitive Technology A Book of Earth Skills (David Wescott) (Z-Library)
100% (1)
Primitive Technology A Book of Earth Skills (David Wescott) (Z-Library)
676 pages
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
9.20: The Glass Slipper Restaurant
No ratings yet
9.20: The Glass Slipper Restaurant
12 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
625 Preliminary
No ratings yet
625 Preliminary
39 pages
Task 5: H32 H32 H32 H32 H32 H32 H32 H32 H32 H32 H32 H32
No ratings yet
Task 5: H32 H32 H32 H32 H32 H32 H32 H32 H32 H32 H32 H32
49 pages
Solution 1
No ratings yet
Solution 1
14 pages
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
Machine Learning. Supervised Learning Techniques and Tools: Nonlinear Models Exercises with R, SAS, Stata, Eviews and SPSS
From Everand
Machine Learning. Supervised Learning Techniques and Tools: Nonlinear Models Exercises with R, SAS, Stata, Eviews and SPSS
César Pérez López
No ratings yet
Berrar_EBCB_2nd_edition_Cross-validation_preprint
No ratings yet
Berrar_EBCB_2nd_edition_Cross-validation_preprint
13 pages
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
OPLS-DA - Predictions: X-Space Y-Space U y
No ratings yet
OPLS-DA - Predictions: X-Space Y-Space U y
44 pages
The PLS Method - Partial Least Squares Projections To Latent Structures
No ratings yet
The PLS Method - Partial Least Squares Projections To Latent Structures
44 pages
Cross-Validation and Model Selection
No ratings yet
Cross-Validation and Model Selection
46 pages
Cosc2753 A1 MC
No ratings yet
Cosc2753 A1 MC
8 pages
TaylorFit Regression Manual
No ratings yet
TaylorFit Regression Manual
15 pages
Case Study - Pontius Data: at - at May Not Be Good Enough
No ratings yet
Case Study - Pontius Data: at - at May Not Be Good Enough
9 pages
Chapter 5 Learning Deterministic Models
No ratings yet
Chapter 5 Learning Deterministic Models
28 pages
Partial Least Squares Regression A Tutorial
100% (1)
Partial Least Squares Regression A Tutorial
17 pages
Tutorial On PLS and PCA
100% (1)
Tutorial On PLS and PCA
17 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Paper_Heart Disease Prediction
No ratings yet
Paper_Heart Disease Prediction
5 pages
Design of Experiments For The NIPS 2003 Variable Selection Benchmark
No ratings yet
Design of Experiments For The NIPS 2003 Variable Selection Benchmark
30 pages
Trifocal Tensor: Exploring Depth, Motion, and Structure in Computer Vision
From Everand
Trifocal Tensor: Exploring Depth, Motion, and Structure in Computer Vision
Fouad Sabry
No ratings yet
Cross Validation
No ratings yet
Cross Validation
4 pages
SET3065_group9_A6
No ratings yet
SET3065_group9_A6
4 pages
Statistical Analysis
No ratings yet
Statistical Analysis
50 pages
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
From Everand
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
Wouter Verbeke
No ratings yet
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
PLS and Cross Validation
No ratings yet
PLS and Cross Validation
18 pages
Computer Methods in Power Systems Analysis with MATLAB
From Everand
Computer Methods in Power Systems Analysis with MATLAB
Sekhar Chandra P.
No ratings yet
Direct Linear Transformation: Practical Applications and Techniques in Computer Vision
From Everand
Direct Linear Transformation: Practical Applications and Techniques in Computer Vision
Fouad Sabry
No ratings yet
Comp - Sem VI - Quantitative Analysis+Sample Questions
No ratings yet
Comp - Sem VI - Quantitative Analysis+Sample Questions
10 pages
Mme 8201-4-Linear Regression Models
No ratings yet
Mme 8201-4-Linear Regression Models
24 pages
Adv QSAR
No ratings yet
Adv QSAR
72 pages
Answer Book - Sparkling Wines
No ratings yet
Answer Book - Sparkling Wines
10 pages
chap12_2012
No ratings yet
chap12_2012
30 pages
Zhang 2015
No ratings yet
Zhang 2015
53 pages
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Out of Sample
No ratings yet
Out of Sample
19 pages
Dmitry Grapov
No ratings yet
Dmitry Grapov
41 pages
Advanced SAS Interview Questions You'll Most Likely Be Asked
From Everand
Advanced SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
No ratings yet
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
2 pages
Unit 2
No ratings yet
Unit 2
28 pages
H-409; Experimental Design with R
No ratings yet
H-409; Experimental Design with R
72 pages
Time Series Project 11
No ratings yet
Time Series Project 11
55 pages
Influence_properties_of_partial_squares
No ratings yet
Influence_properties_of_partial_squares
20 pages
Mixed Model Analysis For Overdispersion
No ratings yet
Mixed Model Analysis For Overdispersion
9 pages
Data Pre Processing
No ratings yet
Data Pre Processing
26 pages
Partial Least Squares A Tutorial
No ratings yet
Partial Least Squares A Tutorial
12 pages
PLS Tutorial PDF
No ratings yet
PLS Tutorial PDF
12 pages
chap12_2012
No ratings yet
chap12_2012
30 pages
Data-Science-Pdf-3-144-162
No ratings yet
Data-Science-Pdf-3-144-162
19 pages
A1 DataMining
No ratings yet
A1 DataMining
13 pages
Model Answers To Part B of 2005 Final Exam
No ratings yet
Model Answers To Part B of 2005 Final Exam
4 pages
Dmitry Grapov
No ratings yet
Dmitry Grapov
41 pages
Bias Varience Trade Off
100% (2)
Bias Varience Trade Off
35 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
I. Treatment % of Lard - Different Ii. Ration FFA 2. Formulation 3. Statistical Analysis 4. Expected Results
No ratings yet
I. Treatment % of Lard - Different Ii. Ration FFA 2. Formulation 3. Statistical Analysis 4. Expected Results
1 page
Fish Feed
No ratings yet
Fish Feed
1 page
Fish Feed Ingredient: 1. 2. 3. 4. 5. Proximate Analysis: Ash, Moisture Content, Fats, Protein, Carbohydrate Winfeed Formulation Irsm, Gcms - Pca, Sn2
No ratings yet
Fish Feed Ingredient: 1. 2. 3. 4. 5. Proximate Analysis: Ash, Moisture Content, Fats, Protein, Carbohydrate Winfeed Formulation Irsm, Gcms - Pca, Sn2
1 page
Fish Feed Ingredient: 1. 2. 3. 4. 5. Proximate Analysis: Ash, Moisture Content, Fats, Protein, Carbohydrate Winfeed Formulation Irsm, Gcms - Pca, Sn2
No ratings yet
Fish Feed Ingredient: 1. 2. 3. 4. 5. Proximate Analysis: Ash, Moisture Content, Fats, Protein, Carbohydrate Winfeed Formulation Irsm, Gcms - Pca, Sn2
1 page
ProblemSet422 23
No ratings yet
ProblemSet422 23
5 pages
Miller and Freunds Probability and Statistics For Engineers 9th Edition Johnson Solutions Manual
100% (33)
Miller and Freunds Probability and Statistics For Engineers 9th Edition Johnson Solutions Manual
8 pages
4.2 Slides - Generalized Linear Mixed Models Part 1
No ratings yet
4.2 Slides - Generalized Linear Mixed Models Part 1
9 pages
Time Series: Chapter 4 - Estimation
No ratings yet
Time Series: Chapter 4 - Estimation
53 pages
Chi-Square Test: by Dr. M.Supriya Moderator:Dr.B.Aruna, M.D. (H)
No ratings yet
Chi-Square Test: by Dr. M.Supriya Moderator:Dr.B.Aruna, M.D. (H)
75 pages
TKKD
No ratings yet
TKKD
7 pages
Statistics I Ii For Dummies 2 Ebook Bundle Deborah Rumsey download
100% (2)
Statistics I Ii For Dummies 2 Ebook Bundle Deborah Rumsey download
81 pages
Business Statistics - II Syllabus
No ratings yet
Business Statistics - II Syllabus
2 pages
Survival Analysis Dengan Pendekatan R
No ratings yet
Survival Analysis Dengan Pendekatan R
32 pages
Elementary Statistics A Step by Step Approach 7th Edition Allan G. Bluman download
100% (1)
Elementary Statistics A Step by Step Approach 7th Edition Allan G. Bluman download
53 pages
GCUF Spring 2023 BBA 4 business statistics paper
No ratings yet
GCUF Spring 2023 BBA 4 business statistics paper
1 page
DS535 Note 4 (With Marks)
No ratings yet
DS535 Note 4 (With Marks)
18 pages
Chapter Three - Estimation
No ratings yet
Chapter Three - Estimation
50 pages
TD Regression
No ratings yet
TD Regression
2 pages
(eBook PDF) Introduction to Statistics and Data Analysis 6th Edition 2024 scribd download
100% (3)
(eBook PDF) Introduction to Statistics and Data Analysis 6th Edition 2024 scribd download
52 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
31 pages
Chapter 2 Curve Fitting, Regression and Correlation
No ratings yet
Chapter 2 Curve Fitting, Regression and Correlation
18 pages
Linear Models Bias
No ratings yet
Linear Models Bias
17 pages
Quantitative Techniques and Methods Notes
No ratings yet
Quantitative Techniques and Methods Notes
269 pages
Regression-Based Earnings Forecasts (Gerakos and Gramacy, 2013)
No ratings yet
Regression-Based Earnings Forecasts (Gerakos and Gramacy, 2013)
33 pages
Learning Task 2 Final
No ratings yet
Learning Task 2 Final
12 pages
QBM - Revision For Final Assessment
No ratings yet
QBM - Revision For Final Assessment
1 page
Fundamental of Hypothesis Testing One-Sample Tests (Basic Statistics)
No ratings yet
Fundamental of Hypothesis Testing One-Sample Tests (Basic Statistics)
35 pages
Hypotheis Testing
No ratings yet
Hypotheis Testing
12 pages
Nov Ebcs1524 Exam1 2023 Memo
No ratings yet
Nov Ebcs1524 Exam1 2023 Memo
40 pages
6 Confidence Intervals
No ratings yet
6 Confidence Intervals
17 pages
Chapter 4 - Forecasting Production
No ratings yet
Chapter 4 - Forecasting Production
58 pages
Practice Forecasting
No ratings yet
Practice Forecasting
4 pages

Partial Least Square

Uploaded by

Partial Least Square

Uploaded by

Reference: PLS/OPLS models in metabolomics: the impact of permutation of dataset rows on the K-fold

cross-validation quality parameters

The major consequences of this are:

 R2X (cum) - cumulative modelled variation in X matrix (fatty acids),

R2 (variation) and Q2 (prediction)

You might also like