SlideShare a Scribd company logo
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
DOI: 10.5121/ijdkp.2018.8302 15
USE OF PLS COMPONENTS TO IMPROVE
CLASSIFICATION ON BUSINESS DECISION MAKING
José C. Vega Vilca, Aniel Nieves-González and Roxana Aparicio
Institute of Statistics and Computer Information Systems, University of Puerto Rico, Rio
Piedras Campus, Puerto Rico
ABSTRACT
This paper presents a methodology that eliminates multicollinearity of the predictors variables in
supervised classification by transforming the predictor variables into orthogonal components obtained
from the application of Partial Least Squares (PLS) Logistic Regression. The PLS logistic regression was
developed by Bastien, Esposito-Vinzi, and Tenenhaus [1]. We apply the techniques of supervised
classification on data, based on the original variables and data based on the PLS components. The error
rates are calculated and the results compared. The implementation of the methodology of classification is
rests upon the development of computer programs written in the R language to make possible the
calculation of PLS components and error rates of classification. The impact of this research will be
disseminated, based on evidence that the methodology of Partial Least Squares Logistic Regression, is
fundamental when working in a supervised classification with data of many predictors variables.
KEYWORDS
Supervised classification, error rate, multivariate analysis, Logistic Regression
1. INTRODUCTION
In data analysis via supervised classification [13] a classifier is constructed based on the observed
data. The data is arranged into an matrix where is the number of rows (subjects) and
is the number of columns (variables in the study), and a column vector that contains and
indicator of the group to which each of the subjects belongs to. The goal of constructing the
classifier is to place new subjects into one of the groups established in the given problem.
Whenever (the variables of the predictor matrix ) is large, is generally implied
multicollinearity between the variables. Such multicollinearity is defined as a high linear
dependence between the predictor variables. In this study it is demonstrated, by case studies, that
the multicollinearity should be eliminated in order to construct a better classifier.
The general rules of thumb of data analysis by supervised classification can be summarized as
follows:
• Given a new subject characterized by the variables in the study. Into which of the defined
groups ( ) does the subject should be classified?
• The new subject should be classified into the group where the probability of belonging to
that group is greater than the probability of belonging to the other groups.
• Based on the matrix and the vector one should construct a classifier with a minimum
error rate of classification.
The lack of knowledge about the consequences of multicollinearity in the predictor matrix
force the researchers to directly apply the techniques of supervised classification and to construct
inefficient classifiers with a high error rate. The classifier error rate is defined as follows.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
16
Definition 1.1
Let be the error rate of classification for a classifier , and be a new subject that does not
belong to a group . Then is the probability
(1)
That is, , is the conditional probability that the classifier locates a new subject into a group to
which the subject does not belong to.
In this work the multicollinearity problem is solved by transforming the predictor variables into
latent variables, also called components. The components are linear combinations of predictor
variables that have the property of being orthogonal (not correlated) and are obtained through the
application of a method named Logistic Regression by Partial Least Squares (PLS). This method
was introduced by Bastien, Esposito-Vinzi, and Tenenhaus [1].
This work states a method to improve the strategies for data analysis in situations where the
subjects under consideration (e.g. people, animals, or things), should be classified correctly into
groups according to their characteristics to find favorable or unfavorable patterns. For instance, a
loan applicant to a bank provides personal information like income, sex, age, number of
dependents, expenses, etc. This applicant is evaluated according to the information provided and
is classified into potential good or bad borrower with the objective to determine whether the loan
should be granted or not granted to the applicant.
The goal of this study is to disseminate the application of Logistic Regression by Partial
Minimum Squares, introduced by Bastien, Esposito-Vinzi, and Tenenhaus [1], to eliminate the
problem of multicollinearity in the predictor matrix and demonstrate. by means of case study, that
the multicollinearity should be eliminated in order to construct a better classifier function,
characterized by a minimal error rate of classification
2. MULTICOLLINEARITY
The authors in [11] analyze multicollinearity in multiple regression problems and verify two
aspects about multicollinearity: First, it is a problem that makes it difficult to precisely quantify
the effect that exerts each predictor variable over the dependent variable. Second, it can be
determined by the computation of the Variance Inflation Factor (VIF) and by the condition
number ( ). The VIF is an indicator of specific multicollinearity of each predictor variable. The
VIF is defined as:
(2)
where is the coefficient of determination for the linear regression of with respect of the
other predictor variables. As a rule of thumb, if , then there is strong multicollinearity.
The condition number of the correlation matrix of the predictor variables is an indicator of the
global multicollinearity of the predictor variables. The condition number is computed as
(3)
where and are the minimum and maximal eigenvalue (by moduli) of the correlation
matrix of the predictor variables. Generally, if , then there is strong multicollinearity.
Once the multicollinearity is detected it should be eliminated by means of the method proposed in
this work, Logistic Regression by Partial Least Squares (PLS).
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
17
2.1. DIAGNOSIS OF MULTICOLLINEARITY
Fernando Tusell [10] states that there are some indicators and statistical values that help to
diagnose multicollinearity in multiple regression. Below, we present three basic rules for
multicollinearity diagnosis. The first one is strictly related to multiple regression, and the other
two are related to supervised classification.
• A large value for the coefficient of determination and the not significance of most of the
parameters. In the presence of multicollinearity the estimated regression coefficients have
a sign that is the opposite of what was expected. Moreover, its variance is also high, and
because of that one gets the not significance of the parameters. In this case it seems that
none of the predictor variables explains the response variable, whereas all of them, as a
whole, do explain the response variable. The multicollinearity does not allow to clarify
the contribution of each predictor variable.
• An eigenvalue of the correlation matrix with magnitude close to zero (zero in the case of
perfect multicollinearity). In this case, because difference between the smallest and the
greatest eigenvalue, the condition number of the correlation matrix will be large and
therefore the multicollinearity is evident.
• A large value of the VIF for the predictor variables. If for some predictor variable
, then the coefficient of determination for the regression of such variables
versus the other variables is greater or equal to . This indicates dependence between
the variables that are supposedly independent. Furthermore, it can be demonstrated that
the VIF for each predictor variable is located in the main diagonal of the inverse of the
correlation matrix.
3. LOGISTIC REGRESSION PLS
Bastien, Esposito Vinzi y Tenenhaus [1] presented an algorithm that transforms predictor
variables (with multicollinearity) into latent variables, also called PLS components (with no
multicollinearity). The authors of [1] illustrate their methodology by analyzing a data set named
"Bordeaux". This data set corresponds to 34 years of observations of a French wine in terms of
quality ( ): good, average, and poor. The predictor variables are: , the sum of the average
daily temperatures (in ); , the duration of sunny weather (in hours); , the number of very
hot days; and ,the amount of rainfall (in mm). Without any multicollinearity analysis the
investigators used the logistic regression as a classifier. They classified the data and found 7
classification errors, therefore the estimated error rate was . Using the method of
Logistic Regression PLS the authors transform the four predictor variables into one PLS
component and use the logistic regression classifier. They reclassified the data and found 6 errors,
ergo the error rate is .
It has been observed that the PLS logistic regression method is efficient albeit the data that is
analyzed have low multicollinearity. In no case the variance inflation factor (VIF) was greater
than 10. The values of VIF for the predictor variables were: , ,
and . The condition number was , which is lesser than 25. Thereby, the
existence of multicollinearity is minimal or almost none.
Recently, Bertrand, Meyer and Maumy-Bertrand [2] presented a library for R called plsRglm:
PLS generalized linear models for R. The library deals with PLS Regression for the case of
multiple regression and with PLS logistic regression for the case of supervised classification.
They also solve the classification problem for the "Bordeaux" wine data. For that problem the
investigators compute all the possible PLS components (four in that case) and select the optimal
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
18
number of components in the data in order to find the best model for classification. They did that
by using the following criteria:
• Akaike Information Criterion (AIC).
• Bayesian Information Criterion (BIC).
• Misclassification error rate.
To select the number of components one must keep in mind that an overly simplistic model (too
few components) produces a large approximation error (underfitting) whereas an overly complex
model (too many components) produces a large estimation error (overfitting).
3.1. SELECTION OF THE NUMBER OF COMPONENTS
Three criteria are used to select the number of components PLS: Akaike Information Criterion,
Bayesian Information Criterion and the number of bad classifications. The manner in which the
AIC and BIC criteria work is explained in [3].
1. The Akaike Information Criterion (AIC) estimates the relative distance between the
unknown likelihood function of the data and the adjusted likelihood function of the model.
Thus, a smaller AIC values means that the analyzed model is closer to the true model.
2. The Bayesian Information Criterion (BIC) estimates the posterior probability function that a
model under a given bayesian configuration is the true model. Hence, a smaller BIC value
means that is more probable that the analyzed model is the true model.
3. The Misclassification error rate: After constructing the classifier, the data that was used to
construct the classifier is classified. Then the number of misclassifications is counted.
Whenever the number of bad classifications is minimum then it is considered that the
analyzed model is the best one.
4. CLASSIFIERS
We now present seven classifiers that are usually used in supervised classification: logistic
regression, linear discriminant analysis, quadratic discriminant analysis, -nearest neighbors with
and , naive Bayes, recursive partitioning, and regression trees (the latter two are
classification trees).
4.1. LOGISTIC REGRESSION:
It is a regression model widely used for data analysis. In this case the response variable is binary
and dichotome or in some cases polytome, whereas the predictor variables could be continuous or
categorical. The logistic regression is a special case of the Generalized Linear Model (GLM),
where the parameter estimation and hence the probability estimation is done using the maximum
likelihood method [6].
4.2. DISCRIMINANT ANALYSIS:
It is a multivariate analysis technique that constructs a classifier function based on multivariate
data that belongs well-defined classes or groups. The goal is to assign new subjects to one of
these groups. The classifier function is then constructed as a linear combination of a set of
independent or predictor variables. If the covariance matrix of the groups under consideration is
homogeneous, then we apply the Linear Discriminant Analysis, otherwise we apply the quadratic
Discriminant Analysis [12].
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
19
4.3. -NEAREST NEIGHBORS:
The classifier function -Nearest Neighbor (KNN) is a simple classifier based on distance. A
new subject will be classified into the most frequent class that its -nearest neighbors belong to.
For and (the most used values) there is a different classifier function [8].
4.4. NAIVE BAYES
It is a simple but efficient algorithm that predicts the class to which a new subject belongs to. It
based on Bayes's theorem and the term naive is used because the algorithm uses bayesian
techniques that do not consider possible dependencies between predictor variables [7].
4.5. CLASSIFICATION TREES
It is a classifier that recursively splits up the interval of possible values of the predictor variables.
The goal is to construct logical networks and to establish rules that represent the knowledge of the
problem through a tree structure. We used Recursive Partitioning and Regression Trees (rpart) as
established in [4].
5. CLASSIFIER ERROR RATE
The classifier error rate is defined as the probability that a classifier function classify a new
individual into a group that does not belong to (see Eq. (1)). The most commonly used classifier
error rates are: the apparent, cross-validation leaving 1 out (cv-n), and cross-validation 10 (cv-
10).
5.1. APPARENT ERROR RATE [5].
Although the apparent error rate is used by many investigators, its use is not recommended
because is overly optimistic (usually yields low values) and has a high bias. Figure 1 illustrates
the computation of the apparent error rate. We followed the following procedure in its
computation:
1. A classifier function is constructed using all the data.
2. The classifier function classifies the data that was used to construct the classifier.
3. The number of misclassifications is counted.
4. The proportions of bad classifications are computed. It is the total number of bad
classifications divided by the sample size.
Figure 1: Apparent error rate.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
20
5.2. ERROR RATE BY 10-FOLD CROSS-VALIDATION [9].
This method yields a more accurate error rate. Figure 1 illustrates the computation of this error
rate. The following procedure was used to compute this error rate:
1. The data set is split into 10 subsets.
2. The classifier function is constructed using 9 of the 10 subsets of the sample.
3. The subset not used to construct the classifier is classified using the classifier function.
4. Steps 2 and 3 are repeated until all subsets are classified.
5. The number of bad classifications is counted.
6. The proportion of bad classifications is computed as the number of bad classifications
divided by the sample size.
Figure 2: Error rate by cross validation.
5.3. ERROR RATE BY LEAVE-ONE-OUT CROSS-VALIDATION.
Error rate by cross-validation leaving 1 out. This method is also known as error rate by n-fold
cross-validation. Akin to cross-validation 10, this method yields a more accurate error rate. Figure
2 shows the computation of the error rate by means of the following steps:
1. The data set is split into parts, where n is the sample size.
2. The classifier function is constructed using parts of the sample.
3. The individual that was not considered for the classifier construction is then classified.
4. Steps 2 and 3 are repeated until all members of the sample are classified.
5. The number of bad classifications is counted.
6. The proportion of bad classifications is computed as the number of bad classifications
divided by the sample size.
6. DATASETS
Five different data sets were used in the present work. We describe such data sets below and in
Table 1.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
21
6.1 AUSTRALIAN DATA SET
The Australian database contains the characteristics of 690 clients of a financial institution. The
dependent variable is "credit card" and there are 14 predictor variables. The dependent variable
indicates whether or not the client obtains the credit card approval. The data set is available in
https://archive.ics.ici.edu/ml/datasets.
6.2 HOUSEVOTES84 DATA SET
The data set includes the votes of the members of the House of Representatives of the United
States of America, with over 16 key votes identified by the Congressional Quarterly Almanac
(CQA). The number of predictor variables is 16 and the response variable has two possible
values: republican or democrat. The variable number three was eliminated because it has the same
values. The data is available in the repository of Machine Learning Databases of University of
California at Irvine (UCI), http://www.ics.uci.edu/ mlearn/MLRepository.html
6.3 GERMAN DATA SET
This data set contains 20 variables of financial information of about 1000 loan applicants, and a
classifier variable that expresses whether the applicant is a "good" client. The data is available in
https://archive.ics.ici.edu/ml/datasets.
6.4 SONAR DATA SET
A database with 208 observations. Each one with over 60 variables and 2 classes. The data is
available in the repository of Machine Learning Databases of UCI.
https://archive.ics.ici.edu/ml/datasets.
6.5 COLON DATA SET
A data set that consists of microarray experiment results. The data contains 2000 attributes for
two types of colon tissue: normal and tumor. The data is available in the Gene Expression Project
webpage of Princeton University, http://microarray.princeton.edu/oncology.
Table 1. Data sets description
Name Subjects Predictors Classes Description
Australian 690 14 2 Clients
House Votes 84 232 15 2 Voters
German 1000 20 2 Clients
Sonar 208 60 2 Sonar signals
Colon 62 2000 2 Microarrays
7. IMPLEMENTATION AND RESULTS
The application of the methodology presented in this study used data from Table 1, each of these
data sets were processed in the following manner:
1. Each data set, which are characterized by their original variables, was analyzed. Apparent
error rate, leave-one-out cross-validation error rate (cv-n), and 10-fold cross-validation 10
(cv-10) error rate were calculated.
2. Each data set was transformed to PLS components, that were analyzed. First we
examined the degree of multicollinearity of the predictor variables by means of the
condition number. Second, the predictor variables were transformed to PLS
(uncorrelated) components and the number of components used was determined by the
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
22
AIC, BIC and the misclassification error rate. These results are shown in Table 2. Finally,
apparent error rate, leave-one-out cross-validation error rate (cv-n) and 10-fold cross-
validation 10 (cv-10) error rate were calculated.
Table 2. Determination of the number of components PLS, each set of data
Set PLS components AIC BIC wrong rated
Australian PLS_Comp_0 950.2 954.7 307
η = 3.59 PLS_Comp_1 479.2 488.3 98
PLS_Comp_2 437.3 450.9 87
PLS_Comp_3 432.8 451.0 90
PLS_Comp_4 434.0 456.7 88
PLS_Comp_5 436.0 463.2 86
House
Votes 84
PLS_Comp_0
322.5 326.0 108
η = 8.58 PLS_Comp_1 106.1 113.0 20
PLS_Comp_2 47.1 57.4 10
PLS_Comp_3 33.1 46.9 6
PLS_Comp_4 32.7 50.0 5
PLS_Comp_5 34.3 55.0 5
German PLS_Comp_0 1223.7 1228.6 300
η = 3.12 PLS_Comp_1 985.1 995.0 236
PLS_Comp_2 967.8 982.5 227
PLS_Comp_3 965.6 985.3 224
PLS_Comp_4 966.7 991.2 228
PLS_Comp_5 968.6 998.0 233
Sonar PLS_Comp_0 289.4 292.7 97
η = 42.99 PLS_Comp_1 210.8 217.5 55
PLS_Comp_2 167.4 177.4 38
PLS_Comp_3 142.6 156.0 27
PLS_Comp_4 137.0 153.7 23
PLS_Comp_5 123.0 143.1 24
Colon PLS_Comp_0 82.6 84.8 22
η = inf. PLS_Comp_1 60.6 64.8 16
PLS_Comp_2 36.0 42.4 6
PLS_Comp_3 17.5 26.0 2
PLS_Comp_4 10.0 20.6 0
PLS_Comp_5 12.0 24.8 0
Table 2 shows that House Votes 84, Australian and German datasets, have low multicollinearity,
since the values of the condition numbers are 6.75, 8.58 and 3.12, respectively, which are all less
than 25. Regarding the number of PLS components we observe that the whole Australian data set
needs only 2 components from 14 predictor variables, the House Votes 84 dataset needs 3
components PLS from 15 predictive variables, and the German data set needs 2 PLS components
from 20 predictor variables. Sonar and Colon datasets have high multicollinearity because their
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
23
condition numbers are 42.99 and infinity, respectively (both greater than 25). Only 4 PLS
components were used for Sonar dataset from 2000 predictor variables. Also, only 4 components
were used for Colon dataset from 60 predictor variables.
Tables 3, 4, 5, 6 and 7, show the apparent error rates, leave-one-out cross-validation (cv-n) error
rates, and 10-fold cross-validation (cv-10) error rates. These errors were calculated for the
original data based on predictor variables and for processed data based on PLS components. In
general, we note the following:
1. Apparent error rate is always lower than leave-one-out cross-validation and 10-fold cross-
validation error rates, for both, original data and PLS components.
2. For data with low multicollinearity in their predictive variables, such as the Australian,
House Votes 84, and German datasets, the calculation of the three types of error rates
yielded almost the same value. The difference is that the error rates from transformed
data were calculated considering a minimum number of PLS components: 2 components
for 14 predictors of Australian, 3 components for 15 predictors of House Votes 84, and 2
components for 20 predictors of German.
3. For data with high multicollinearity in their predictive variables, such as Sonar and Colon
datasets, the calculation of the three types of error rates yielded lesser values for
processed data using PLS components compared with the error rates for the original data.
The difference is that the error rates from transformed data were calculated considering a
minimum number of PLS components: 4 components for 60 predictors of Sonar dataset
and 4 components for 2000 predictors of Colon dataset.
4. Minimum error rate identifies the best classifier, which is not unique and depends on the
data. The error rates that should be used to evaluate a classifier are 10-fold cross
validation (cv-10) and leave-one-out cross-validation (cv-n), in that order. For the
Australian dataset, the best classifier is logistic regression with original data and LDA
with processed data; for House Votes 84 dataset the best classifiers are LDA and Rpart,
with original data and logistic regression with processed data; for the German dataset the
best classifier is LDA with original data and logistic regression with processed data; for
Sonar dataset, the best classifier is knn-3 with original data and knn-3 and logistic
regression with processed data. The best classifier for original Colon dataset, is logistic
regression and for the processed data, the best classifiers are logistic regression and LDA.
Table 3. Australian dataset error rates
Method
Original data (14 Predictors) 2-component PLS
apparent CV-n CV-10 apparent CV-n CV-10
Reg. Logistic
s
12.46 13.91 13.77 12.61 12.60 12.46
LDA 13.91 14.20 14.06 11.59 11.59 12,32
Qda 18.84 20.00 20.29 14.20 14.20 14.78
knn-3 16.38 32.75 32.61 9.86 14.93 13.77
knn-5 22.03 31.16 31.16 10.29 14.49 13.48
Naive Bayes 20.00 20.29 21.16 13.91 13.91 13.91
Rpart 11.74 12.17 14.35 12.03 13.48 12.75
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
24
Table 4. House Votes 84 dataset error rates
METHOD
Original data (15 Predictor) 3-component PLS
apparent CV-n CV-10 apparent CV-n CV-10
Reg. Logisti
cs
2.16 6.47 6.90 2.59 2.59 2.59
LDA 3.02 3.02 3.02 3.02 3.02 3.02
Qda 3.45 NA NA 3.45 3.45 3.45
knn-3 6.03 7.76 7.76 1.72 3.02 3.88
knn-5 7.76 8.62 8.62 2.59 3.45 3.45
Naive
Bayes
5.17 5.17 7.33 6.47 6.90 6.90
Rpart 3.02 3.02 3.02 3.45 3.88 6.47
Table 5. German dataset error rates
METHOD
Original data (20 Predictor) 2-component PLS
apparent CV-n CV-10 apparent CV-n CV-10
Reg. Logistic
s
23.40 25.00 24.90 22.70 22.90 22.80
LDA 2310 24.20 24.50 22.70 22.80 22.90
Qda 22.00 26.90 26.70 2230 22,60 2310
knn-3 19.20 37.40 37.70 15.70 26.80 27.60
knn-5 25.10 35.10 35.40 18.50 25.60 26.00
Naive Bayes 24.50 25.50 26.30 2250 2250 23.00
Rpart 21.80 26.90 26.50 21.20 21.90 25.60
Table 6. Sonar dataset error rates
METHOD
Original data (60 Predictor) 4-component PLS
apparent CV-n CV-10 Apparent CV-n CV-10
Reg. Logistic
s
0.00 27.40 26.92 11.06 12.50 12.98
LDA 9.62 24.52 25.48 13.46 13.94 14.42
Qda 0.00 24.04 25.48 14.90 15.38 15.38
knn-3 11.06 18.75 20.67 8.17 12.50 12.98
knn-5 13.46 17.31 21.15 9.13 12.02 13.94
Naive Bayes 26.92 32.69 32.69 15.38 17.31 18.75
Rpart 12.50 33.17 29.33 9.62 1635 17.79
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
25
Table 7. Colon dataset error rates
METHOD
Original data (2000 Predictor) 4-component PLS
Apparent CV-n CV-10 apparent CV-n CV-10
Reg. Logistic
s
0.00 51.61 4.84 0.00 4.84 3.23
LDA 3.23 22.58 19.35 1.61 1.61 3.23
Qda 3.23 NA 6.45 1.61 4.84 6.45
knn-3 8.06 14.52 14.52 6.45 9.68 12.90
knn-5 12.90 16.13 16.13 6.45 9.68 9.68
Naive Bayes 29.03 40.32 64.52 1.61 6.45 8.06
Rpart 8.06 41.94 79.03 9.68 20.97 25.81
8. CONCLUSIONS
1. In each dataset, the choice of the number of PLS components is independent of the degree of
multicollinearity of the predictor variables. The Australian dataset has condition number
and 2 PLS components were selected. The House Votes 84 dataset has condition
number and 3 PLS components were selected. The German dataset has condition
number and 2 PLS components were selected. The Sonar dataset has condition
number and 4 PLS components were selected. The Colon dataset has condition
number and 4 PLS components were selected.
2. BIC was the most frequent selection criterion of the number of PLS components in each
dataset. The misclassification error rate criterion was used only for selecting the number of
PLS components for the Sonar dataset. The selection criteria of the number of components,
i.e. AIC, BIC, and misclassification error rate, for the Colon dataset agreed.
3. The dimensionality of each dataset was drastically reduced by use of transformation
components PLS. Australian dropped 14 variables to 2 components PLS, House Votes 84
dropped from 15 variables to 3 components PLS, German dropped from 20 variables to 2
PLS components, Sonar was reduced from 60 variables to 4 PLS components and in Colon
dropped from 2000 variables to 4 components PLS.
4. Apparent error rates of each of the classifiers, for all data sets, are on average slightly lower
when using PLS components compared to the apparent error rates when using the original
predictor variables.
5. 10-fold cross-validation (cv-10) of each one of the classifiers, for all data sets, are on average
slightly higher compared with the leave-one-out cross-validation error rates, in both cases,
using original variables and using PLS components.
6. 10-fold cross-validation (cv-10) and leave-one-out cross-validation error rates for all data sets
using PLS components are generally lower than the equivalent error rates when using all the
predictor variables. Here stands the benefit of working with PLS components, to achieve a
significant decrease in the rate of error in all the classifiers.
7. There is not an ideal classifier with minimal error rate for any set of data. Analyzing 10-fold
cross-validation error rates for each dataset, we found that the best classifier for Australian
dataset is discriminant linear, for House Votes 84 and German datasets the best classifier is
logistic regression, for Sonar dataset the best classifiers are logistic regression and knn-3, and
for colon dataset the best classifiers are logistic regression and linear discriminant.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
26
REFERENCES
[1] Lee, S.hyun. & Kim Mi Na, (2008) “This is my paper”, ABC Transactions on ECE, Vol. 10, No. 5,
pp120-122.
[2] Gizem, Aksahya & Ayese, Ozcan (2009) Communications & Networks, Network Books, ABC
Publishers.
[3] Philippe Bastien, Vincenzo Esposito-Vinzi, and Michel Tenenhaus. PLS generalised linear
regression. Computational Statistics & data analysis, 48(1):17–46, 2005.
[4] F. Bertrand, N. Meyer, and M. Maumy-Bertrand. Package 'plsRglm'. version 1.1.1. R
documentation, 2015.
[5] Guillaume Bouchard and Gilles Celeux. Selection of generative models in classification. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 28(4):544–554, 2006.
[6] L Breiman, JH Friedman, RA Olshen, and CJ Stone. Classification and regression trees. CRC Press,
1984.
[7] Smith C. Some examples of discrimination. Ann. Eugenic, 18:272–282, 1947.
[8] Annette J Dobson and Adrian Barnett. An introduction to generalized linear models. CRC press,
2008.
[9] Christopher D Manning, Prabhakar Raghavan, and Hinrich Schutze. An introduction to information
retrieval. Cambridge University Press, 2008.
[10] Brian D Ripley. Pattern recognition and neural networks. Cambridge university press, 1996.
[11] Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the royal
statistical society. Series B (Methodological), pages 111–147, 1974.
[12] Fernando Tusell. Análisis de regresión. Introducción teórica y práctica basada en R. Adolescence.
An age of opportunity, 2011.
[13] José Carlos Vega-Vilca and Josué Guzmán. Regresión PLS y PCA como solución al problema de
multicolinealidad en regresión múltiple. Revista de Matemática Teoría y Aplicaciones, 18(1):09–20,
2011.
[14] William N Venables and Brian D Ripley. Modern applied statistics with S. Springer-Verlag, 2002.
[15] Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. Data Mining: Practical machine
learning tools and techniques. Morgan Kaufmann, 2016.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018
27
AUTHORS
Dr. José C. Vega holds a Ph.D. degree in Computer and Information Sciences and
Engineering from the University of Puerto Rico - Mayaguez Campus. He received his
MS degree in Statistics from the University of San Marcos, Lima, Peru and his BS in
Statistics from the Agraria La Molina University, Lima, Peru. He is a Professor in the
Institute of Statistics and Information Systems of the University of Puerto Rico - Río
Piedras Campus.
Roxana Aparicio received her Ph.D. degree in Computer and Information Sciences and
Engineering from the University of Puerto Rico - Mayaguez Campus in 2012. She
received her MS degree in Scientific Computing from the University of Puerto Rico and
her BS in Computer Engineering from the University San Antonio Abad, Cusco, Peru.
Currently she is professor in the Institute of Statistics and Information Systems of the
University of Puerto Rico - Río Piedras Campus.
Aniel Nieves-Gonzalez received his Ph.D. in Applied Mathematics from the State
University of New York at Stony Brook in 2010. He has a M.S. in applied mathematics
from the University of Puerto Rico and his undergraduate degree is in Computer Science
and Physics also from the University of Puerto Rico. He is currently an Assistant
Professor at the Institute of Statistics and Computerized Information Systems at the
University of Puerto Rico Rio Piedras Campus. He has published papers about
mathematical models (differential equations) of complex systems physiological systems
like the thick ascending limb (a part of the kidney). He still works in problems related to kidney
physiology, but also works in problems related to coral population dynamics and spectral analysis of high
frequency financial data. His research interests include dynamical systems, power spectral analysis, wavelet
analysis, and parallel computing.

More Related Content

What's hot (12)

PPTX
Logistic regression with SPSS examples
Gaurav Kamboj
 
PPTX
Cannonical Correlation
domsr
 
PPTX
Logistic regression
saba khan
 
PDF
QNT 275 Week 5 Apply Connect Week 5 Case Qnt 275 qnt275 https://uopcourses.co...
NewUOPCourse
 
PPTX
Introduction to principal component analysis (pca)
Mohammed Musah
 
PPTX
Logistic regression
DrZahid Khan
 
PPT
Discriminant analysis
Murali Raj
 
PPTX
Logistic regression with SPSS
LNIPE
 
DOCX
QNT 275 qnt275 QNT275 Qnt 275 qnt275 QNT/275 STATISTICS FOR DECISION MAKING h...
UOPCourseHelp
 
PPTX
Discriminant analysis
Wansuklangk
 
PPT
Discriminant analysis group no. 4
Advait Bhobe
 
PPT
Logistic regression
Khaled Abd Elaziz
 
Logistic regression with SPSS examples
Gaurav Kamboj
 
Cannonical Correlation
domsr
 
Logistic regression
saba khan
 
QNT 275 Week 5 Apply Connect Week 5 Case Qnt 275 qnt275 https://uopcourses.co...
NewUOPCourse
 
Introduction to principal component analysis (pca)
Mohammed Musah
 
Logistic regression
DrZahid Khan
 
Discriminant analysis
Murali Raj
 
Logistic regression with SPSS
LNIPE
 
QNT 275 qnt275 QNT275 Qnt 275 qnt275 QNT/275 STATISTICS FOR DECISION MAKING h...
UOPCourseHelp
 
Discriminant analysis
Wansuklangk
 
Discriminant analysis group no. 4
Advait Bhobe
 
Logistic regression
Khaled Abd Elaziz
 

Similar to USE OF PLS COMPONENTS TO IMPROVE CLASSIFICATION ON BUSINESS DECISION MAKING (20)

PPTX
Multicolinearity
Pawan Kawan
 
PPTX
correction maximum likelihood estimation method
qazikhanzla
 
PDF
Multicollinearity econometrics semester 4 Delhi University
killerharsh4100
 
PPTX
Multicollinearity PPT
GunjanKhandelwal13
 
PPT
Econometric model ing
Matt Grant
 
PPT
Econometrics ch11
Baterdene Batchuluun
 
PDF
A comparative analysis of predictve data mining techniques4
Mintu246
 
PDF
A comparative analysis of predictve data mining techniques4
Mintu246
 
PPT
Econometrics_ch11.ppt
MewdedDelelegn
 
PDF
Multicollinearity1
Muhammad Ali
 
PPTX
Chapter8_Final.pptxkhnhkjllllllllllllllllllllllllllllllllllllllllllllllllllll...
DibyenduRoy49
 
PDF
International Journal of Quantum Chemistry
speterangelo
 
PPTX
The 10 Algorithms Machine Learning Engineers Need to Know.pptx
Chode Amarnath
 
PPTX
LEC11 (1).pptx
BokulHossain1
 
PDF
Factor analysis
Mintu246
 
PDF
A comparative analysis of predictve data mining techniques3
Mintu246
 
PDF
Assumptions of Linear Regression - Machine Learning
Kush Kulshrestha
 
PPTX
statistical learning theory
HarshKumar943076
 
PPTX
Multicollinearity.pptx this is a presentation of hetro.
vrao95787
 
DOCX
A researcher in attempting to run a regression model noticed a neg.docx
evonnehoggarth79783
 
Multicolinearity
Pawan Kawan
 
correction maximum likelihood estimation method
qazikhanzla
 
Multicollinearity econometrics semester 4 Delhi University
killerharsh4100
 
Multicollinearity PPT
GunjanKhandelwal13
 
Econometric model ing
Matt Grant
 
Econometrics ch11
Baterdene Batchuluun
 
A comparative analysis of predictve data mining techniques4
Mintu246
 
A comparative analysis of predictve data mining techniques4
Mintu246
 
Econometrics_ch11.ppt
MewdedDelelegn
 
Multicollinearity1
Muhammad Ali
 
Chapter8_Final.pptxkhnhkjllllllllllllllllllllllllllllllllllllllllllllllllllll...
DibyenduRoy49
 
International Journal of Quantum Chemistry
speterangelo
 
The 10 Algorithms Machine Learning Engineers Need to Know.pptx
Chode Amarnath
 
LEC11 (1).pptx
BokulHossain1
 
Factor analysis
Mintu246
 
A comparative analysis of predictve data mining techniques3
Mintu246
 
Assumptions of Linear Regression - Machine Learning
Kush Kulshrestha
 
statistical learning theory
HarshKumar943076
 
Multicollinearity.pptx this is a presentation of hetro.
vrao95787
 
A researcher in attempting to run a regression model noticed a neg.docx
evonnehoggarth79783
 
Ad

Recently uploaded (20)

PPTX
Comparing Translational and Rotational Motion.pptx
AngeliqueTolentinoDe
 
PPTX
ESP 10 Edukasyon sa Pagpapakatao PowerPoint Lessons Quarter 1.pptx
Sir J.
 
PDF
Indian National movement PPT by Simanchala Sarab, Covering The INC(Formation,...
Simanchala Sarab, BABed(ITEP Secondary stage) in History student at GNDU Amritsar
 
PPTX
How to Add a Custom Button in Odoo 18 POS Screen
Celine George
 
PDF
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
PPTX
week 1-2.pptx yueojerjdeiwmwjsweuwikwswiewjrwiwkw
rebznelz
 
PPTX
Practice Gardens and Polytechnic Education: Utilizing Nature in 1950s’ Hu...
Lajos Somogyvári
 
PDF
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
PPTX
Connecting Linear and Angular Quantities in Human Movement.pptx
AngeliqueTolentinoDe
 
PPTX
Parsing HTML read and write operations and OS Module.pptx
Ramakrishna Reddy Bijjam
 
PDF
Our Guide to the July 2025 USPS® Rate Change
Postal Advocate Inc.
 
DOCX
Lesson 1 - Nature and Inquiry of Research
marvinnbustamante1
 
PPTX
Lesson 1 Cell (Structures, Functions, and Theory).pptx
marvinnbustamante1
 
PPTX
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
PDF
TechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.06.25.pdf
TechSoup
 
PDF
Lesson 1 - Nature of Inquiry and Research.pdf
marvinnbustamante1
 
DOCX
MUSIC AND ARTS 5 DLL MATATAG LESSON EXEMPLAR QUARTER 1_Q1_W1.docx
DianaValiente5
 
PPTX
How to Configure Refusal of Applicants in Odoo 18 Recruitment
Celine George
 
PDF
Genomics Proteomics and Vaccines 1st Edition Guido Grandi (Editor)
kboqcyuw976
 
PPTX
PLANNING A HOSPITAL AND NURSING UNIT.pptx
PRADEEP ABOTHU
 
Comparing Translational and Rotational Motion.pptx
AngeliqueTolentinoDe
 
ESP 10 Edukasyon sa Pagpapakatao PowerPoint Lessons Quarter 1.pptx
Sir J.
 
Indian National movement PPT by Simanchala Sarab, Covering The INC(Formation,...
Simanchala Sarab, BABed(ITEP Secondary stage) in History student at GNDU Amritsar
 
How to Add a Custom Button in Odoo 18 POS Screen
Celine George
 
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
week 1-2.pptx yueojerjdeiwmwjsweuwikwswiewjrwiwkw
rebznelz
 
Practice Gardens and Polytechnic Education: Utilizing Nature in 1950s’ Hu...
Lajos Somogyvári
 
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
Connecting Linear and Angular Quantities in Human Movement.pptx
AngeliqueTolentinoDe
 
Parsing HTML read and write operations and OS Module.pptx
Ramakrishna Reddy Bijjam
 
Our Guide to the July 2025 USPS® Rate Change
Postal Advocate Inc.
 
Lesson 1 - Nature and Inquiry of Research
marvinnbustamante1
 
Lesson 1 Cell (Structures, Functions, and Theory).pptx
marvinnbustamante1
 
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
TechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.06.25.pdf
TechSoup
 
Lesson 1 - Nature of Inquiry and Research.pdf
marvinnbustamante1
 
MUSIC AND ARTS 5 DLL MATATAG LESSON EXEMPLAR QUARTER 1_Q1_W1.docx
DianaValiente5
 
How to Configure Refusal of Applicants in Odoo 18 Recruitment
Celine George
 
Genomics Proteomics and Vaccines 1st Edition Guido Grandi (Editor)
kboqcyuw976
 
PLANNING A HOSPITAL AND NURSING UNIT.pptx
PRADEEP ABOTHU
 
Ad

USE OF PLS COMPONENTS TO IMPROVE CLASSIFICATION ON BUSINESS DECISION MAKING

  • 1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 DOI: 10.5121/ijdkp.2018.8302 15 USE OF PLS COMPONENTS TO IMPROVE CLASSIFICATION ON BUSINESS DECISION MAKING José C. Vega Vilca, Aniel Nieves-González and Roxana Aparicio Institute of Statistics and Computer Information Systems, University of Puerto Rico, Rio Piedras Campus, Puerto Rico ABSTRACT This paper presents a methodology that eliminates multicollinearity of the predictors variables in supervised classification by transforming the predictor variables into orthogonal components obtained from the application of Partial Least Squares (PLS) Logistic Regression. The PLS logistic regression was developed by Bastien, Esposito-Vinzi, and Tenenhaus [1]. We apply the techniques of supervised classification on data, based on the original variables and data based on the PLS components. The error rates are calculated and the results compared. The implementation of the methodology of classification is rests upon the development of computer programs written in the R language to make possible the calculation of PLS components and error rates of classification. The impact of this research will be disseminated, based on evidence that the methodology of Partial Least Squares Logistic Regression, is fundamental when working in a supervised classification with data of many predictors variables. KEYWORDS Supervised classification, error rate, multivariate analysis, Logistic Regression 1. INTRODUCTION In data analysis via supervised classification [13] a classifier is constructed based on the observed data. The data is arranged into an matrix where is the number of rows (subjects) and is the number of columns (variables in the study), and a column vector that contains and indicator of the group to which each of the subjects belongs to. The goal of constructing the classifier is to place new subjects into one of the groups established in the given problem. Whenever (the variables of the predictor matrix ) is large, is generally implied multicollinearity between the variables. Such multicollinearity is defined as a high linear dependence between the predictor variables. In this study it is demonstrated, by case studies, that the multicollinearity should be eliminated in order to construct a better classifier. The general rules of thumb of data analysis by supervised classification can be summarized as follows: • Given a new subject characterized by the variables in the study. Into which of the defined groups ( ) does the subject should be classified? • The new subject should be classified into the group where the probability of belonging to that group is greater than the probability of belonging to the other groups. • Based on the matrix and the vector one should construct a classifier with a minimum error rate of classification. The lack of knowledge about the consequences of multicollinearity in the predictor matrix force the researchers to directly apply the techniques of supervised classification and to construct inefficient classifiers with a high error rate. The classifier error rate is defined as follows.
  • 2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 16 Definition 1.1 Let be the error rate of classification for a classifier , and be a new subject that does not belong to a group . Then is the probability (1) That is, , is the conditional probability that the classifier locates a new subject into a group to which the subject does not belong to. In this work the multicollinearity problem is solved by transforming the predictor variables into latent variables, also called components. The components are linear combinations of predictor variables that have the property of being orthogonal (not correlated) and are obtained through the application of a method named Logistic Regression by Partial Least Squares (PLS). This method was introduced by Bastien, Esposito-Vinzi, and Tenenhaus [1]. This work states a method to improve the strategies for data analysis in situations where the subjects under consideration (e.g. people, animals, or things), should be classified correctly into groups according to their characteristics to find favorable or unfavorable patterns. For instance, a loan applicant to a bank provides personal information like income, sex, age, number of dependents, expenses, etc. This applicant is evaluated according to the information provided and is classified into potential good or bad borrower with the objective to determine whether the loan should be granted or not granted to the applicant. The goal of this study is to disseminate the application of Logistic Regression by Partial Minimum Squares, introduced by Bastien, Esposito-Vinzi, and Tenenhaus [1], to eliminate the problem of multicollinearity in the predictor matrix and demonstrate. by means of case study, that the multicollinearity should be eliminated in order to construct a better classifier function, characterized by a minimal error rate of classification 2. MULTICOLLINEARITY The authors in [11] analyze multicollinearity in multiple regression problems and verify two aspects about multicollinearity: First, it is a problem that makes it difficult to precisely quantify the effect that exerts each predictor variable over the dependent variable. Second, it can be determined by the computation of the Variance Inflation Factor (VIF) and by the condition number ( ). The VIF is an indicator of specific multicollinearity of each predictor variable. The VIF is defined as: (2) where is the coefficient of determination for the linear regression of with respect of the other predictor variables. As a rule of thumb, if , then there is strong multicollinearity. The condition number of the correlation matrix of the predictor variables is an indicator of the global multicollinearity of the predictor variables. The condition number is computed as (3) where and are the minimum and maximal eigenvalue (by moduli) of the correlation matrix of the predictor variables. Generally, if , then there is strong multicollinearity. Once the multicollinearity is detected it should be eliminated by means of the method proposed in this work, Logistic Regression by Partial Least Squares (PLS).
  • 3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 17 2.1. DIAGNOSIS OF MULTICOLLINEARITY Fernando Tusell [10] states that there are some indicators and statistical values that help to diagnose multicollinearity in multiple regression. Below, we present three basic rules for multicollinearity diagnosis. The first one is strictly related to multiple regression, and the other two are related to supervised classification. • A large value for the coefficient of determination and the not significance of most of the parameters. In the presence of multicollinearity the estimated regression coefficients have a sign that is the opposite of what was expected. Moreover, its variance is also high, and because of that one gets the not significance of the parameters. In this case it seems that none of the predictor variables explains the response variable, whereas all of them, as a whole, do explain the response variable. The multicollinearity does not allow to clarify the contribution of each predictor variable. • An eigenvalue of the correlation matrix with magnitude close to zero (zero in the case of perfect multicollinearity). In this case, because difference between the smallest and the greatest eigenvalue, the condition number of the correlation matrix will be large and therefore the multicollinearity is evident. • A large value of the VIF for the predictor variables. If for some predictor variable , then the coefficient of determination for the regression of such variables versus the other variables is greater or equal to . This indicates dependence between the variables that are supposedly independent. Furthermore, it can be demonstrated that the VIF for each predictor variable is located in the main diagonal of the inverse of the correlation matrix. 3. LOGISTIC REGRESSION PLS Bastien, Esposito Vinzi y Tenenhaus [1] presented an algorithm that transforms predictor variables (with multicollinearity) into latent variables, also called PLS components (with no multicollinearity). The authors of [1] illustrate their methodology by analyzing a data set named "Bordeaux". This data set corresponds to 34 years of observations of a French wine in terms of quality ( ): good, average, and poor. The predictor variables are: , the sum of the average daily temperatures (in ); , the duration of sunny weather (in hours); , the number of very hot days; and ,the amount of rainfall (in mm). Without any multicollinearity analysis the investigators used the logistic regression as a classifier. They classified the data and found 7 classification errors, therefore the estimated error rate was . Using the method of Logistic Regression PLS the authors transform the four predictor variables into one PLS component and use the logistic regression classifier. They reclassified the data and found 6 errors, ergo the error rate is . It has been observed that the PLS logistic regression method is efficient albeit the data that is analyzed have low multicollinearity. In no case the variance inflation factor (VIF) was greater than 10. The values of VIF for the predictor variables were: , , and . The condition number was , which is lesser than 25. Thereby, the existence of multicollinearity is minimal or almost none. Recently, Bertrand, Meyer and Maumy-Bertrand [2] presented a library for R called plsRglm: PLS generalized linear models for R. The library deals with PLS Regression for the case of multiple regression and with PLS logistic regression for the case of supervised classification. They also solve the classification problem for the "Bordeaux" wine data. For that problem the investigators compute all the possible PLS components (four in that case) and select the optimal
  • 4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 18 number of components in the data in order to find the best model for classification. They did that by using the following criteria: • Akaike Information Criterion (AIC). • Bayesian Information Criterion (BIC). • Misclassification error rate. To select the number of components one must keep in mind that an overly simplistic model (too few components) produces a large approximation error (underfitting) whereas an overly complex model (too many components) produces a large estimation error (overfitting). 3.1. SELECTION OF THE NUMBER OF COMPONENTS Three criteria are used to select the number of components PLS: Akaike Information Criterion, Bayesian Information Criterion and the number of bad classifications. The manner in which the AIC and BIC criteria work is explained in [3]. 1. The Akaike Information Criterion (AIC) estimates the relative distance between the unknown likelihood function of the data and the adjusted likelihood function of the model. Thus, a smaller AIC values means that the analyzed model is closer to the true model. 2. The Bayesian Information Criterion (BIC) estimates the posterior probability function that a model under a given bayesian configuration is the true model. Hence, a smaller BIC value means that is more probable that the analyzed model is the true model. 3. The Misclassification error rate: After constructing the classifier, the data that was used to construct the classifier is classified. Then the number of misclassifications is counted. Whenever the number of bad classifications is minimum then it is considered that the analyzed model is the best one. 4. CLASSIFIERS We now present seven classifiers that are usually used in supervised classification: logistic regression, linear discriminant analysis, quadratic discriminant analysis, -nearest neighbors with and , naive Bayes, recursive partitioning, and regression trees (the latter two are classification trees). 4.1. LOGISTIC REGRESSION: It is a regression model widely used for data analysis. In this case the response variable is binary and dichotome or in some cases polytome, whereas the predictor variables could be continuous or categorical. The logistic regression is a special case of the Generalized Linear Model (GLM), where the parameter estimation and hence the probability estimation is done using the maximum likelihood method [6]. 4.2. DISCRIMINANT ANALYSIS: It is a multivariate analysis technique that constructs a classifier function based on multivariate data that belongs well-defined classes or groups. The goal is to assign new subjects to one of these groups. The classifier function is then constructed as a linear combination of a set of independent or predictor variables. If the covariance matrix of the groups under consideration is homogeneous, then we apply the Linear Discriminant Analysis, otherwise we apply the quadratic Discriminant Analysis [12].
  • 5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 19 4.3. -NEAREST NEIGHBORS: The classifier function -Nearest Neighbor (KNN) is a simple classifier based on distance. A new subject will be classified into the most frequent class that its -nearest neighbors belong to. For and (the most used values) there is a different classifier function [8]. 4.4. NAIVE BAYES It is a simple but efficient algorithm that predicts the class to which a new subject belongs to. It based on Bayes's theorem and the term naive is used because the algorithm uses bayesian techniques that do not consider possible dependencies between predictor variables [7]. 4.5. CLASSIFICATION TREES It is a classifier that recursively splits up the interval of possible values of the predictor variables. The goal is to construct logical networks and to establish rules that represent the knowledge of the problem through a tree structure. We used Recursive Partitioning and Regression Trees (rpart) as established in [4]. 5. CLASSIFIER ERROR RATE The classifier error rate is defined as the probability that a classifier function classify a new individual into a group that does not belong to (see Eq. (1)). The most commonly used classifier error rates are: the apparent, cross-validation leaving 1 out (cv-n), and cross-validation 10 (cv- 10). 5.1. APPARENT ERROR RATE [5]. Although the apparent error rate is used by many investigators, its use is not recommended because is overly optimistic (usually yields low values) and has a high bias. Figure 1 illustrates the computation of the apparent error rate. We followed the following procedure in its computation: 1. A classifier function is constructed using all the data. 2. The classifier function classifies the data that was used to construct the classifier. 3. The number of misclassifications is counted. 4. The proportions of bad classifications are computed. It is the total number of bad classifications divided by the sample size. Figure 1: Apparent error rate.
  • 6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 20 5.2. ERROR RATE BY 10-FOLD CROSS-VALIDATION [9]. This method yields a more accurate error rate. Figure 1 illustrates the computation of this error rate. The following procedure was used to compute this error rate: 1. The data set is split into 10 subsets. 2. The classifier function is constructed using 9 of the 10 subsets of the sample. 3. The subset not used to construct the classifier is classified using the classifier function. 4. Steps 2 and 3 are repeated until all subsets are classified. 5. The number of bad classifications is counted. 6. The proportion of bad classifications is computed as the number of bad classifications divided by the sample size. Figure 2: Error rate by cross validation. 5.3. ERROR RATE BY LEAVE-ONE-OUT CROSS-VALIDATION. Error rate by cross-validation leaving 1 out. This method is also known as error rate by n-fold cross-validation. Akin to cross-validation 10, this method yields a more accurate error rate. Figure 2 shows the computation of the error rate by means of the following steps: 1. The data set is split into parts, where n is the sample size. 2. The classifier function is constructed using parts of the sample. 3. The individual that was not considered for the classifier construction is then classified. 4. Steps 2 and 3 are repeated until all members of the sample are classified. 5. The number of bad classifications is counted. 6. The proportion of bad classifications is computed as the number of bad classifications divided by the sample size. 6. DATASETS Five different data sets were used in the present work. We describe such data sets below and in Table 1.
  • 7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 21 6.1 AUSTRALIAN DATA SET The Australian database contains the characteristics of 690 clients of a financial institution. The dependent variable is "credit card" and there are 14 predictor variables. The dependent variable indicates whether or not the client obtains the credit card approval. The data set is available in https://archive.ics.ici.edu/ml/datasets. 6.2 HOUSEVOTES84 DATA SET The data set includes the votes of the members of the House of Representatives of the United States of America, with over 16 key votes identified by the Congressional Quarterly Almanac (CQA). The number of predictor variables is 16 and the response variable has two possible values: republican or democrat. The variable number three was eliminated because it has the same values. The data is available in the repository of Machine Learning Databases of University of California at Irvine (UCI), http://www.ics.uci.edu/ mlearn/MLRepository.html 6.3 GERMAN DATA SET This data set contains 20 variables of financial information of about 1000 loan applicants, and a classifier variable that expresses whether the applicant is a "good" client. The data is available in https://archive.ics.ici.edu/ml/datasets. 6.4 SONAR DATA SET A database with 208 observations. Each one with over 60 variables and 2 classes. The data is available in the repository of Machine Learning Databases of UCI. https://archive.ics.ici.edu/ml/datasets. 6.5 COLON DATA SET A data set that consists of microarray experiment results. The data contains 2000 attributes for two types of colon tissue: normal and tumor. The data is available in the Gene Expression Project webpage of Princeton University, http://microarray.princeton.edu/oncology. Table 1. Data sets description Name Subjects Predictors Classes Description Australian 690 14 2 Clients House Votes 84 232 15 2 Voters German 1000 20 2 Clients Sonar 208 60 2 Sonar signals Colon 62 2000 2 Microarrays 7. IMPLEMENTATION AND RESULTS The application of the methodology presented in this study used data from Table 1, each of these data sets were processed in the following manner: 1. Each data set, which are characterized by their original variables, was analyzed. Apparent error rate, leave-one-out cross-validation error rate (cv-n), and 10-fold cross-validation 10 (cv-10) error rate were calculated. 2. Each data set was transformed to PLS components, that were analyzed. First we examined the degree of multicollinearity of the predictor variables by means of the condition number. Second, the predictor variables were transformed to PLS (uncorrelated) components and the number of components used was determined by the
  • 8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 22 AIC, BIC and the misclassification error rate. These results are shown in Table 2. Finally, apparent error rate, leave-one-out cross-validation error rate (cv-n) and 10-fold cross- validation 10 (cv-10) error rate were calculated. Table 2. Determination of the number of components PLS, each set of data Set PLS components AIC BIC wrong rated Australian PLS_Comp_0 950.2 954.7 307 η = 3.59 PLS_Comp_1 479.2 488.3 98 PLS_Comp_2 437.3 450.9 87 PLS_Comp_3 432.8 451.0 90 PLS_Comp_4 434.0 456.7 88 PLS_Comp_5 436.0 463.2 86 House Votes 84 PLS_Comp_0 322.5 326.0 108 η = 8.58 PLS_Comp_1 106.1 113.0 20 PLS_Comp_2 47.1 57.4 10 PLS_Comp_3 33.1 46.9 6 PLS_Comp_4 32.7 50.0 5 PLS_Comp_5 34.3 55.0 5 German PLS_Comp_0 1223.7 1228.6 300 η = 3.12 PLS_Comp_1 985.1 995.0 236 PLS_Comp_2 967.8 982.5 227 PLS_Comp_3 965.6 985.3 224 PLS_Comp_4 966.7 991.2 228 PLS_Comp_5 968.6 998.0 233 Sonar PLS_Comp_0 289.4 292.7 97 η = 42.99 PLS_Comp_1 210.8 217.5 55 PLS_Comp_2 167.4 177.4 38 PLS_Comp_3 142.6 156.0 27 PLS_Comp_4 137.0 153.7 23 PLS_Comp_5 123.0 143.1 24 Colon PLS_Comp_0 82.6 84.8 22 η = inf. PLS_Comp_1 60.6 64.8 16 PLS_Comp_2 36.0 42.4 6 PLS_Comp_3 17.5 26.0 2 PLS_Comp_4 10.0 20.6 0 PLS_Comp_5 12.0 24.8 0 Table 2 shows that House Votes 84, Australian and German datasets, have low multicollinearity, since the values of the condition numbers are 6.75, 8.58 and 3.12, respectively, which are all less than 25. Regarding the number of PLS components we observe that the whole Australian data set needs only 2 components from 14 predictor variables, the House Votes 84 dataset needs 3 components PLS from 15 predictive variables, and the German data set needs 2 PLS components from 20 predictor variables. Sonar and Colon datasets have high multicollinearity because their
  • 9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 23 condition numbers are 42.99 and infinity, respectively (both greater than 25). Only 4 PLS components were used for Sonar dataset from 2000 predictor variables. Also, only 4 components were used for Colon dataset from 60 predictor variables. Tables 3, 4, 5, 6 and 7, show the apparent error rates, leave-one-out cross-validation (cv-n) error rates, and 10-fold cross-validation (cv-10) error rates. These errors were calculated for the original data based on predictor variables and for processed data based on PLS components. In general, we note the following: 1. Apparent error rate is always lower than leave-one-out cross-validation and 10-fold cross- validation error rates, for both, original data and PLS components. 2. For data with low multicollinearity in their predictive variables, such as the Australian, House Votes 84, and German datasets, the calculation of the three types of error rates yielded almost the same value. The difference is that the error rates from transformed data were calculated considering a minimum number of PLS components: 2 components for 14 predictors of Australian, 3 components for 15 predictors of House Votes 84, and 2 components for 20 predictors of German. 3. For data with high multicollinearity in their predictive variables, such as Sonar and Colon datasets, the calculation of the three types of error rates yielded lesser values for processed data using PLS components compared with the error rates for the original data. The difference is that the error rates from transformed data were calculated considering a minimum number of PLS components: 4 components for 60 predictors of Sonar dataset and 4 components for 2000 predictors of Colon dataset. 4. Minimum error rate identifies the best classifier, which is not unique and depends on the data. The error rates that should be used to evaluate a classifier are 10-fold cross validation (cv-10) and leave-one-out cross-validation (cv-n), in that order. For the Australian dataset, the best classifier is logistic regression with original data and LDA with processed data; for House Votes 84 dataset the best classifiers are LDA and Rpart, with original data and logistic regression with processed data; for the German dataset the best classifier is LDA with original data and logistic regression with processed data; for Sonar dataset, the best classifier is knn-3 with original data and knn-3 and logistic regression with processed data. The best classifier for original Colon dataset, is logistic regression and for the processed data, the best classifiers are logistic regression and LDA. Table 3. Australian dataset error rates Method Original data (14 Predictors) 2-component PLS apparent CV-n CV-10 apparent CV-n CV-10 Reg. Logistic s 12.46 13.91 13.77 12.61 12.60 12.46 LDA 13.91 14.20 14.06 11.59 11.59 12,32 Qda 18.84 20.00 20.29 14.20 14.20 14.78 knn-3 16.38 32.75 32.61 9.86 14.93 13.77 knn-5 22.03 31.16 31.16 10.29 14.49 13.48 Naive Bayes 20.00 20.29 21.16 13.91 13.91 13.91 Rpart 11.74 12.17 14.35 12.03 13.48 12.75
  • 10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 24 Table 4. House Votes 84 dataset error rates METHOD Original data (15 Predictor) 3-component PLS apparent CV-n CV-10 apparent CV-n CV-10 Reg. Logisti cs 2.16 6.47 6.90 2.59 2.59 2.59 LDA 3.02 3.02 3.02 3.02 3.02 3.02 Qda 3.45 NA NA 3.45 3.45 3.45 knn-3 6.03 7.76 7.76 1.72 3.02 3.88 knn-5 7.76 8.62 8.62 2.59 3.45 3.45 Naive Bayes 5.17 5.17 7.33 6.47 6.90 6.90 Rpart 3.02 3.02 3.02 3.45 3.88 6.47 Table 5. German dataset error rates METHOD Original data (20 Predictor) 2-component PLS apparent CV-n CV-10 apparent CV-n CV-10 Reg. Logistic s 23.40 25.00 24.90 22.70 22.90 22.80 LDA 2310 24.20 24.50 22.70 22.80 22.90 Qda 22.00 26.90 26.70 2230 22,60 2310 knn-3 19.20 37.40 37.70 15.70 26.80 27.60 knn-5 25.10 35.10 35.40 18.50 25.60 26.00 Naive Bayes 24.50 25.50 26.30 2250 2250 23.00 Rpart 21.80 26.90 26.50 21.20 21.90 25.60 Table 6. Sonar dataset error rates METHOD Original data (60 Predictor) 4-component PLS apparent CV-n CV-10 Apparent CV-n CV-10 Reg. Logistic s 0.00 27.40 26.92 11.06 12.50 12.98 LDA 9.62 24.52 25.48 13.46 13.94 14.42 Qda 0.00 24.04 25.48 14.90 15.38 15.38 knn-3 11.06 18.75 20.67 8.17 12.50 12.98 knn-5 13.46 17.31 21.15 9.13 12.02 13.94 Naive Bayes 26.92 32.69 32.69 15.38 17.31 18.75 Rpart 12.50 33.17 29.33 9.62 1635 17.79
  • 11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 25 Table 7. Colon dataset error rates METHOD Original data (2000 Predictor) 4-component PLS Apparent CV-n CV-10 apparent CV-n CV-10 Reg. Logistic s 0.00 51.61 4.84 0.00 4.84 3.23 LDA 3.23 22.58 19.35 1.61 1.61 3.23 Qda 3.23 NA 6.45 1.61 4.84 6.45 knn-3 8.06 14.52 14.52 6.45 9.68 12.90 knn-5 12.90 16.13 16.13 6.45 9.68 9.68 Naive Bayes 29.03 40.32 64.52 1.61 6.45 8.06 Rpart 8.06 41.94 79.03 9.68 20.97 25.81 8. CONCLUSIONS 1. In each dataset, the choice of the number of PLS components is independent of the degree of multicollinearity of the predictor variables. The Australian dataset has condition number and 2 PLS components were selected. The House Votes 84 dataset has condition number and 3 PLS components were selected. The German dataset has condition number and 2 PLS components were selected. The Sonar dataset has condition number and 4 PLS components were selected. The Colon dataset has condition number and 4 PLS components were selected. 2. BIC was the most frequent selection criterion of the number of PLS components in each dataset. The misclassification error rate criterion was used only for selecting the number of PLS components for the Sonar dataset. The selection criteria of the number of components, i.e. AIC, BIC, and misclassification error rate, for the Colon dataset agreed. 3. The dimensionality of each dataset was drastically reduced by use of transformation components PLS. Australian dropped 14 variables to 2 components PLS, House Votes 84 dropped from 15 variables to 3 components PLS, German dropped from 20 variables to 2 PLS components, Sonar was reduced from 60 variables to 4 PLS components and in Colon dropped from 2000 variables to 4 components PLS. 4. Apparent error rates of each of the classifiers, for all data sets, are on average slightly lower when using PLS components compared to the apparent error rates when using the original predictor variables. 5. 10-fold cross-validation (cv-10) of each one of the classifiers, for all data sets, are on average slightly higher compared with the leave-one-out cross-validation error rates, in both cases, using original variables and using PLS components. 6. 10-fold cross-validation (cv-10) and leave-one-out cross-validation error rates for all data sets using PLS components are generally lower than the equivalent error rates when using all the predictor variables. Here stands the benefit of working with PLS components, to achieve a significant decrease in the rate of error in all the classifiers. 7. There is not an ideal classifier with minimal error rate for any set of data. Analyzing 10-fold cross-validation error rates for each dataset, we found that the best classifier for Australian dataset is discriminant linear, for House Votes 84 and German datasets the best classifier is logistic regression, for Sonar dataset the best classifiers are logistic regression and knn-3, and for colon dataset the best classifiers are logistic regression and linear discriminant.
  • 12. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 26 REFERENCES [1] Lee, S.hyun. & Kim Mi Na, (2008) “This is my paper”, ABC Transactions on ECE, Vol. 10, No. 5, pp120-122. [2] Gizem, Aksahya & Ayese, Ozcan (2009) Communications & Networks, Network Books, ABC Publishers. [3] Philippe Bastien, Vincenzo Esposito-Vinzi, and Michel Tenenhaus. PLS generalised linear regression. Computational Statistics & data analysis, 48(1):17–46, 2005. [4] F. Bertrand, N. Meyer, and M. Maumy-Bertrand. Package 'plsRglm'. version 1.1.1. R documentation, 2015. [5] Guillaume Bouchard and Gilles Celeux. Selection of generative models in classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):544–554, 2006. [6] L Breiman, JH Friedman, RA Olshen, and CJ Stone. Classification and regression trees. CRC Press, 1984. [7] Smith C. Some examples of discrimination. Ann. Eugenic, 18:272–282, 1947. [8] Annette J Dobson and Adrian Barnett. An introduction to generalized linear models. CRC press, 2008. [9] Christopher D Manning, Prabhakar Raghavan, and Hinrich Schutze. An introduction to information retrieval. Cambridge University Press, 2008. [10] Brian D Ripley. Pattern recognition and neural networks. Cambridge university press, 1996. [11] Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the royal statistical society. Series B (Methodological), pages 111–147, 1974. [12] Fernando Tusell. Análisis de regresión. Introducción teórica y práctica basada en R. Adolescence. An age of opportunity, 2011. [13] José Carlos Vega-Vilca and Josué Guzmán. Regresión PLS y PCA como solución al problema de multicolinealidad en regresión múltiple. Revista de Matemática Teoría y Aplicaciones, 18(1):09–20, 2011. [14] William N Venables and Brian D Ripley. Modern applied statistics with S. Springer-Verlag, 2002. [15] Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
  • 13. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.8, No.3, May 2018 27 AUTHORS Dr. José C. Vega holds a Ph.D. degree in Computer and Information Sciences and Engineering from the University of Puerto Rico - Mayaguez Campus. He received his MS degree in Statistics from the University of San Marcos, Lima, Peru and his BS in Statistics from the Agraria La Molina University, Lima, Peru. He is a Professor in the Institute of Statistics and Information Systems of the University of Puerto Rico - Río Piedras Campus. Roxana Aparicio received her Ph.D. degree in Computer and Information Sciences and Engineering from the University of Puerto Rico - Mayaguez Campus in 2012. She received her MS degree in Scientific Computing from the University of Puerto Rico and her BS in Computer Engineering from the University San Antonio Abad, Cusco, Peru. Currently she is professor in the Institute of Statistics and Information Systems of the University of Puerto Rico - Río Piedras Campus. Aniel Nieves-Gonzalez received his Ph.D. in Applied Mathematics from the State University of New York at Stony Brook in 2010. He has a M.S. in applied mathematics from the University of Puerto Rico and his undergraduate degree is in Computer Science and Physics also from the University of Puerto Rico. He is currently an Assistant Professor at the Institute of Statistics and Computerized Information Systems at the University of Puerto Rico Rio Piedras Campus. He has published papers about mathematical models (differential equations) of complex systems physiological systems like the thick ascending limb (a part of the kidney). He still works in problems related to kidney physiology, but also works in problems related to coral population dynamics and spectral analysis of high frequency financial data. His research interests include dynamical systems, power spectral analysis, wavelet analysis, and parallel computing.