Arora 2019
Arora 2019
PII: S1568-4946(19)30717-3
DOI: https://doi.org/10.1016/j.asoc.2019.105936
Reference: ASOC 105936
Please cite this article as: N. Arora and P.D. Kaur, A Bolasso based consistent feature selection
enabled random forest classification algorithm: An application to credit risk assessment, Applied
Soft Computing Journal (2019), doi: https://doi.org/10.1016/j.asoc.2019.105936.
This is a PDF file of an article that has undergone enhancements after acceptance, such as the
addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive
version of record. This version will undergo additional copyediting, typesetting and review before it
is published in its final form, but we are providing this version to give early visibility of the article.
Please note that, during the production process, errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.
Highlights of Paper:
1) A novel Bolasso enabled Random forest algorithm(BS-RF)is proposed to classify
borrower as defaulter or legitimate.
2) The stability of Bolasso is compared with other feature selectors(Chi-square, Gain
Ratio,ReliefF) in terms of JSM .
3) Time complexities are computed and run time of various algorithms are recorded.
of
4) Proposed algorithm is compared with Bolasso enabled Naïve Bayes ,SVM and KNN
Classifiers.
5) The experimental results shows that Bolasso selected features are stable with respect to
small variations in dataset.
pro
6) Results further emphasize that BS-RF is better than other methods in terms of AUC and
Accuracy.
re-
a lP
urn
Jo
Journal Pre-proof
*Manuscript
Click here to view linked References
of
Assessment
Nisha Arora, Research Scholar, Guru Nanak Dev University ,Regional Campus, Jalandhar
pro
Dr. Pankaj Deep Kaur,Asst. Prof., Guru Nanak Dev University, Regional Campus, Jalandhar
Abstract
Credit Risk Assessment has been a crucial issue as it forecasts whether an individual will default
on loan or not. Classifying an applicant as good or bad debtor helps lender to make a wise
decision. The modern data mining and machine learning techniques have been found to be very
re-
useful and accurate in credit risk predictive capability and correct decision making. Classification
is one of the most widely used techniques in machine learning. To increase prediction accuracy of
standalone classifiers while keeping overall cost to a minimum, feature selection techniques have
been utilized , as feature selection removes redundant and irrelevant attributes from dataset .This
paper initially introduces Bolasso(Bootstrap- Lasso) which selects consistent and relevant
lP
features from pool of features. The consistent feature selection is defined as robustness of
selected features with respect to changes in dataset. Bolasso generated shortlisted features are
then applied to various classification algorithms like Random Forest(RF), Support Vector
Machine(SVM),Naïve Bayes(NB) and K-Nearest Neighbours(K-NN) to test its predictive
accuracy.It is observed that Bolasso enabled Random Forest Algorithm(BS-RF) provides best
results forcredit risk evaluation. The classifiers are built on training and test data partition(70:30)
of three datasets(Lending Club’s peer to peer dataset , Kaggle’s “Bank Loan Status Dataset” and
a
German Credit dataset obtained from UCI). The performance of Bolasso enabled various
classification algorithms is then compared with that of other baseline feature selection methods
urn
like Chi Square, Gain Ratio, ReliefF and stand-alone classifiers (no feature selection method
applied). The experimental results shows that Bolasso provides phenomenal stability of features
when compared with stability of other algorithms . Jaccard Stability Measure(JSM) is used to
assess stability of feature selection methods. Moreover BS-RF have good classification accuracy
and is better than other methods in terms of AUC and Accuracy resulting in effectively
improving the decision making process of lenders.
Jo
As per the data obtained from capitalineplus[33],India’s gross bad loans stood at Rs 10.25 lakh
crore as on 31 March,2018 which is 11.8% of the total loans given by banking industry.State
Bank of India (SBI), tops the bad loans chart, has logged an increase of Rs 24,286 crore in bad
loans in the March quarter to Rs 2.23 lakh crore. The Nirav Modi scam-hit Punjab National Bank
(PNB) has reported the maximum rise of Rs 29,100 crore in gross bad loans to Rs 86,620 crore in
the March quarter. As per the statistical data obtained from federal reserve[34], delinquency and
of
charged off rates of banks of 2019 Q1 is 1.74(Real estate loans) and 2.33(consumer loan
category).
This alarming situation in banks and financial institutions have sought the attention of various
researchers. The modern data mining and machine learning techniques have been found to be very
pro
useful and accurate in credit risk predictive capability and correct decision making. To enhance
accuracy several approaches have been proposed which includes feature selection techniques [1,2]
Feature Selection is a pre-processing technique which helps in selecting appropriate attributes.
Since the real world instances of data are usually high dimensional, the feature selection
techniques help in reducing dimensionality of data by removing irrelevant and redundant
features.The irrelevant features, if included in classification process, not only reduce its accuracy
re-
but also consume more space and execution time. So feature selection is very important as it
facilitates deep understanding of data by studying only relevant features and enhancing accuracy,
speed and the predictive capability of classifier[3].Feature Selection Methods are classified in 3
categories:
lP
this approach classification algorithm is executed many times, each time with different
subset,andwith each iteration, quality of feature subset is evaluated, and finally the best
urn
featuresubset is selected[5]. Examples of this method are Beam Search, Sequential Forward
Selection (SFS), Sequential Backward Elimination(SBE) etc. Embedded Method, however,
shortlists features in training process of learning model, and selectedfeatures are outputted at the
end of training process.Examples are Weighted Naïve Bayes and Lasso.The performance of
feature selection method is evaluated by machine learning models like Naïve Bayes, SVM,
KNN,C4.5 etc. [6]
Jo
This paper proposes improved version of embedded based Lasso approach, namely Bolasso to
shortlist consistent and relevant features. As Credit risk assessment is a difficult learning problem,
so selection of appropriate set of features is critical for success of learning process and therefore is
a vital issue.An important property of feature selection strategy is selection consistency(stability)
i.e to find “true” set of features[35] .The feature stability matters especially when feature
selection is applied for knowledge discovery. In credit risk assessment, a feature selection
algorithm may select different subsets of features when there are slight variations in training data,
although most of these subsets may result in good prediction accuracy ,such changes (instability)
in selected features may reduce the confidence of lenders in investigating reliable risk factors.
Bolasso exhibits stability and also provides reasonably good classification accuracy. Various
Journal Pre-proof
classification algorithms are used to verify the accuracy of Bolasso selected features, out of which
random forest outperforms.
.In summary, the main contributions of this paper are:
This paper presents the stability property of Bolasso using Jaccard Index stability metric.
Introducing a novel Bolasso enabled Random forest algorithm to classify borrower as
of
potential defaulter or legitimate.
Four standalone classifiers (SVM, Naïve Bayes, KNN, RF) are used to predict the
behavior of debtors.
A total of sixteen feature selection enabled models are compared. Out of these sixteen,
pro
four algorithms are based on chi-square feature selector(C-SVM,C-NB,C-KNN,C-RF),
four on gain ratio feature selector(G-SVM,G-NB,G-KNN,G-RF), four on ReliefF feature
selector(R-SVM,R-NB,R-KNN,R-RF) ,and four on Bolasso based feature selector(BS-
SVM,BS-NB,BS-KNN,BS-RF).
The performance of algorithms in terms of AUC and Accuracy are compared and it is
observed that BS-RF algorithm gives the best results.
re-
Time complexities of classification algorithms are computed and run-times are also
recorded during experiments.
Rest of the paper is organized as:
Section 2 reviews the existing literature on classification in credit risk, feature selection based
study in credit risk assessment and Lasso based study. Section 3 describes methodology which
lP
includes description of feature selection methods and classifiers used in this paper. In section 4
description of dataset utilized for experiment purposes is given. Section 5 presents design of
Bolasso based Random Forest Algorithm(BS-RF),experiments are carried out and results are
summarized , and finally section 6 concludes the research
a
1. Literature Review
urn
Random Forest etc. Kruppa, J. et al [49] estimated consumer’s default probabilities and concluded
that Random Forest outperformed logistic regression and KNN. Malekipirbazari M et al. [32]
focused on peer to peer lending and compared the performance of machine learning methods
namely Random Forest,SVM, Logistic regression, K-NN in terms of AUC,accuracy, Root Mean
Square Error(RMSE),confusion matrix .They concluded that Random Forest provides matchless
results in identifying the best borrowers (with low default probability) and hence helping lenders
in taking better investment decisions. Shi, L.[50] also conducted experiment to identify bad debts
and concluded that Random Forest is an effective classification tool for credit assessment. Behr,
A., & Weinblat, J. [52] recommended Random Forest as a worthwhile alternative to well
Journal Pre-proof
established default prediction models and estimated the default tendencies for firms from seven
European countries using two datasets. Other classification models used in this study for
comparison namely SVM,NB and K-NN have also been explored by other researchers for credit
risk assessment. Table 1 briefly outlines the application of the mentioned models in credit risk
context.
Table 1: Classification technique and reference list
of
Classification Technique Reference list
Naïve Bayes [54],[55],[56]
pro
SVM [57],[58],[59],[26]
KNN [67],[60],[61],[62],[63]
Random Forest is ensemble classification algorithm which besides from providing good
classification accuracy, also provides numerous other advantages such as immunity to over
re-
fitting, faster, simpler, helps in better estimating internal error and provides a good tolerance of
outliers and noises. In this paper Bolasso is integrated with Random Forest, Naïve Bayes, SVM,
KNN and the best results are obtained with Random Forest.
lP
Generally datasets on credit scoring are highly dimensional making classification problem
complex, not only in terms that they are highly computational but also less accurate for
prediction[15].Feature selection becomes necessary as it reduces the burden of computing and
a
also helps to improve accuracy. Since each of feature selection method has its own pros and cons
,like filter approach is faster but, neglects feature’s influence on classifiers while wrapper
approach utilizes optimization algorithms to generate subsets and then evaluate subsets using
urn
hybrid GA with NN(Neural Network) (HGA-NN), uses filter method to shortlist 12 features and
reduced feature subset was applied to genetic algorithm and showed that it was a promising
method for credit risk assessment. Wang D.[3] proposed HMPGA for credit scoring in which
three initial feature subsets are firstly generated using three filter approaches namely F-score,
Information gain ratio and Pearson’s correlation. Then Wrapper approach namely
MPGA(Multiple Population Genetic Algorithm) is applied for feature selection. Jadhav S.[16]
developed IGDFS,a hybrid model based on Information Gain(filter approach) and Genetic
Algorithm(wrapper) approach to select features for credit scoring problem and concluded that
model is improved when features are selected carefully. X.Zhang[17]developed NCSM model
Journal Pre-proof
which includes information gain ratio to filter irrelevant and redundant features and then applied
grid search method to produce optimized random forest algorithm. The conclusion drawn was
keeping less features would help to reduce workload of credit evaluators and at the same time,
increases predictive efficiency.
Although the hybrid approaches have been used widely but they still possess some limitations
such as filter method may eliminate potentially useful features, as there is no guarantee that
of
features shortlisted by filter method are good candidates for wrapper method[18], Moreover
wrapper methods are still computationally intensive as they tend to call classification algorithm
repeatedly. To overcome the limitations of hybrid and wrapper approaches, this paper employs
Bolasso based feature selection which has the advantage of integrating feature selection and
pro
learning process .Moreover it is a one phase feature selection algorithm which shortlists and
outputs the selected feature in one step only.
1.3 Stability based feature selection:
A prominent feature of Bolasso is that it provides stability in feature selection.High stability of
feature selection algorithm is equally important as the high classification accuracy and ignoring
re-
stability issue may lead to drawing wrong conclusions and may put the claimed results of domain
experts in suspicion.[36],[38],[39],[40],[51]. Stability was introduced by Turney[37] who put
emphasis on the importance of it in learning algorithm and discussed the relationship between
stability, predictive accuracy and bias. Z. He[41] has reviewed various research papers on
stability of biomarker discovery. Abeel et al [44] proposed ensemble framework for biomarker
lP
discovery in microarray datasets. They utilized SVM-RFE to improve the feature stability. In their
method, they first rank all the features using SVM and then eliminate features with least score.
Finally, they aggregate the final feature score using a linear combination of all scores in all
iterations. Kamker et al [42],focused on feature stability and proposed C-SVM to increase the
stability of l1-norm SVM and tested it on three datasets(Breast cancer, cancer dataset, AMI
dataset).Li Y. et al [43]proposed FREL (Feature weighting as regularized energy based learning)
a
and ensemble FREL , stability based feature selection algorithms and tested its stability on four
dataset(TOX,SMK, Leukemia and prostate).Han and Yu[45] proposed a variance reduction
framework to improve the stability of feature selection. They showed that the stability of feature
urn
selection under training data variations can be improved by variance reduction techniques.
Stability feature of Bolasso has never been explored earlier. Bolasso is considered as consensus
combination scheme as it generates the bootstrap replications of a dataset, calculates the lasso
estimates , intersect the support of each bootstrap replication and keeps the largest subset of
variables on which all regressors agree in terms of variable selection. Bolasso incorporates
stability as it produces similar features even when small variations occurs in training data or when
Jo
out of 18 and concluded that lasso provided smallest average prediction error. Fang et al[9] made
an analysis of personal credit evaluation in china and found that lasso regression does better in
prediction accuracy. Various other versions of Lasso such as Group Lasso, Tree-Lasso were
developed and used for feature selection .Hongmei Chen [10] constructed credit scoring model
based on Group Lasso Logistic Regression and concluded that Group Lasso Method is better in
interoperability and predictive analysis. Group Lasso is applicable when features form different
of
groups and the variables within the group are correlated. KamkarI.et al,[11] proposed a tree –
Lasso model for stable feature selections in health care .They concluded that features selected by
Tree-lasso are more stable when compared to other methods like t-test, Lasso etc. In another
work,ZhangZ.,et al[12] proposed a novel interactive Lasso regression model to identify high-
pro
order feature interactions.
However, when degree of dataset is large and there is strong dependence between relevant and
irrelevant variables, Bolasso method can be applied to obtain set of consistent variables [13].
2. Methodology
In section 3.1, feature selection methods namely chi-square, gain ratio,ReliefF, Bolasso, are
re-
described and section 3.2 describes baseline classifiers used in this study to evaluate predictive
performance of feature selection methods
3.1 Feature selection methods:
3.1.1 Chi square test : Chi square test[19] is a statistical test of independence to determine
lP
dependency of two variables. It is filter based feature selection method to evaluate worth of an
attribute by computing value of chi-squared statistics with respect to class. It measures the ability
to predict the value of an attribute from the value of the class by testing the independence between
them and checking the absence of the statistical links[20,21]. It is computed as
where ni j represents the observed cell value in the contingency table being row total and
urn
where p is the probability, for which a particular value occurs in the sample space . Entropy
ranges from 0 (all instances of a variable have the same value) to 1 (equal number of instances of
each value).This is a measure of how values of an attribute are distributed and signifies the
measurement of pureness of an attribute. High Entropy means the distribution is uniform. That is,
the histogram of distribution is flat and hence we have equal chance of obtaining any possible
class .Low Entropy means the distribution is gathered around a point [22]
3.1.3 Relief and ReliefF:Relief and ReliefF algorithms are successful attribute estimators. They
are able to detect conditional dependencies between attributes and provide a unified view on the
attribute estimation in classification problems[23].The key idea of ‘Relief’ is to rank the quality of
Journal Pre-proof
features according to how well their values distinguish between the cases that are near to each
other.It is based on Euclidean distance measure. The enhanced method ‘ReliefF’ [24] uses
Manhattan(l1) norm instead of Euclidean(l2) norm to rank the features .The top k features are
selected as final .
3.1.4 Lasso and Bolasso
Lasso(Least Absolute Shrinkage and Selection Operator) is widely used feature selection method.
of
This method selects variables and also utilizes regularization to increase prediction
accuracy.Bolasso(Bootstrap enabled Lasso) was introduced by Francis R. Bach(2008)[13] ,
presenting a model for the selection of consistent variables. It was proved that in presence of
strong correlation between relevant and irrelevant variables, Lasso cannot generate consistent
pro
results. However Bolasso runs Lasso model using several bootstrapped samples of original data
and then intersect the non-zero co-efficient for estimating consistent coefficients. Computational
complexity of Bolasso is O(m(p3 +P2n)),(p is number of features ). The probability that Bolasso
does not select correct model has upper bound
has following upper bound:
re-
(3)
hyperlane separates the data in two classes-positive or negative .Carrizosa, E & Morales, DR, [27]
used mathematical optimizationto address the issues in SVM, such as the detection of relevant
urn
(4)
s.t.
Jo
where is the Lagrange multiplier associated and K( , ) is kernel function. The most
common Kernel functions are:
1) Linear Kernel Function: K( )=
2) Polynomial Function: : K( )=
3) Gaussian Function: K( =exp(- )
4) Radial kernel Function: K( exp(- )
Journal Pre-proof
of
conditionally independent of each other given the class variable[30]. The final decision about a
class is made after the class probability estimation
P(C = c| X = x) = (5)
pro
Here C is the random variable denoting the class of an instance and X is a vector of random
variables denoting the observed attribute value vector.
3.2.3 K-NN
This method was first utilised for credit scoring system by Henley et al [67] in 1997.It is a non-
parametric classifier that uses a distance measure to make predictions without building a
model.The training stage of K-NN comprises of storing feature vector and class labels of training
re-
samples.For testing stage, distance from new vector to all stored vectors are computed and K
closest samples are selected.The class label is then assigned according to the class of majority of
k- nearest neighbours[60],[61].
bulit which constitutes a random forest. Each decision tree determines the class label predictor
for new instance(termed as vote) . The votes casted by various decision trees are then counted,
urn
and the class with majority votes is assumed to be “win” and hence prediction for new instance is
made.
In our implementation, 500 such decision trees are build , x is (sqrt(n)) . Ours is a classification
model,and hence each decision tree will output either “Charged Off” or “Fully Paid”(defaulter or
legitimate).
4. Dataset Description
Jo
4.1 Lending Club[64]: Dataset consists of 42,538 loan records having 143 attributes issued by
lending club between years 2007-2011.Some of the important features describing loan are shown
in Table 1.The dependent variable is loan status. It has two values namely, charged off and fully
paid. Online peer to peer lending is a risky task, since borrower and lenders are not known to each
others. The work done in this paper, can hence be used by investors to classify the new debtor as
good or bad based on learning by classification algorithms.
Table 2: Lending Club’s Dataset Fields
Journal Pre-proof
of
A7 Installment The monthly payment owed by the borrower 322.6 208.93 15.67 1305.19
A8 Grade LC assigned loan grade(Categorical Range –A-G) 2.671 1.4384 1 7
A9 Sub_grade LC assigned loan subgrade 11.41 7.0714 1 35
A10 Emp-Length Employment length in years. 0 means less than one year and 4.913 3.4614 0 10
10 means ten or more years.
A11 Home_ownerhip The home ownership status provided by the borrower. values 3.134 1.9337 1 5
pro
are: RENT, OWN, MORTGAGE, OTHER
A12 Annual_inc The self-reported annual income provided by the borrower. 68861 551599.77 0 3900000
A13 Verification Status Indicates if income was verified by LC or not verified, 1.876 0.8615 1 3
A14 Purpose A category provided by the borrower for the loan request. 4.904 3.4278 1 14
A15 Addr_state The state provided by the borrower 22.83 4.7671 1 50
A16 Dti A ratio calculated using the borrower’s total monthly debt 13.37 6.7254 0 29.99
payments on the total debt obligations, excluding mortgage
and the requested LC loan, divided by the borrower’s self-
reported monthly income.
re-
a17 Delinq_2yrs The number of 30+ days past-due incidences of delinquency in 0.1524 0.5122 0 13
the borrower's credit file for the past 2 years
A18 Inq_last_6months The number of inquiries in past 6 months 1.081 1.5272 0 33
A19 mths_since_last_delinq The number of months since the borrower's last delinquency 12.85 1.6609 0 120
A20 open_acc The number of open credit lines in the borrower's credit file. 9.338 4.5013 0 47
A21 Pub_rec Number of derogatory public records 0.058 0.2456 0 5
A22 Revol_bal Total credit revolving balance 14298 22018.8 0 1207359
lP
A23 Revol_util Revolving line utilization rate, or the amount of credit the 49.01 28.4240 0 119
borrower is using relative to all available revolving credit.
A24 Total_acc The total number of credit lines currently in the borrower's 22.11 11.6033 0 90
credit file
A25 Total_payment Payments received to date for total amount funded 12020 9094.68 0 58886
A26 Total_pamt_inv Payments received to date for portion of total amount funded 11313 9038.65 0 58564
by investors 4
A27 Total_rec_prncpl Principal received to date 9676 6106.009 0 35000
a
A31 Acc_now_delinq The number of accounts on which the borrower is now 9.41e-05 0.00969 0 1
delinquent.
A32 Delinq_amt The past-due amount owed for the accounts on which the 0.143 29.3512 0 6053
borrower is now delinquent.
A33 pub_rec_bankruptcies Number of public record bankruptcies 0.0437 0.2055 0 2
A34 tax_liens Number of tax liens 2.35e-05 0.00484 0 1
A35 debt_settlement_flag Flags whether or not the borrower, who has charged-off, is 1.004 0.0608 1 2
working with a debt-settlement company.
A36 Loan_status Current status of the loan.Possible values: charged off / Fully 1.849 0.3582 1 2
Jo
paid.
of
B5 Credit score A score number given on the basis of credit history, 866.8 1386.5 0 7510
information in application file ,
B6 Annual Income Annual Income as reported by borrower 1112375 1009341 0 36475440
B7 In current job No. of years spent by borrower in current job 5.744 3.6241 0 10
B8 Home Ownership Possible values are: RENT,OWN,ON MORTGAGE 2.932 0.9544 1 4
pro
B9 Purpose Possible values are:BUSINESS LOAN,EDUCATIONAL,DEBT 4.784 2.1975 1 16
CONSOLIDATION,MEDICAL BILLS, BUY A HOME ETC.
B10 Monthly debt Amount of debt a borrower is paying monthly 18467 12275.07 0 435843
B11 Years of credit History No. of years of credit 18.15 6.9652 3.60 70.50
B12 Months since last The number of months since the borrower's last 17.16 22.991 0 176
delinquent delinquency
B13 Number of open Total no. of accounts a borrower owns 11.13 4.9858 0 56
accounts
B14 Number of credit Number indicating the problems encountered in repaying 0.1669 0.4860 0 15
re-
problems credit balance
B15 Current credit balance Total balance against which the borrower is credited 296526 387564.9 0 32878968
B16 Max.open credit Number of credit accounts opened 802700 10990056 0 1539737892
plans
C15 Housing Accommodation on rent/own 151.929 0.53126 151 153
C16 Number of existing Other credits taken from this bank 1.407 0.57765 1 4
credits
C17 job Whether unemployed/skilled employee/unskilled/highly 172.904 0.65361 171 174
qualified/officer etc.
C18 Manitenance Number of people being liable to provide maintenance for 1.155 0.36208 1 2
C19 Telephone Yes/no 191.404 0.49094 191 192
of
C20 Foreign worker Yes/no 201.037 0.1888 201 202
C21 Credit Status Paid/defaulter 1.3 0.4584 1 2
pro
5.1 Workflow Diagram and Experiment: Fig.1 shows the workflow diagram
re-
a lP
urn
Jo
of
algorithms to learn from future are also removed as they lead to undesired results like high accuracy and
precision (close to 100%). After deleting attributes we are left with 36 features which are shown in Table2
along with their statistical details. The Kaggle and German Credit datasets need no such operation as
there are no attributes which contains any sort of redundancy or useless information. The only pre
processing step taken in Kaggle’s dataset is changing values of loan Id and customer Id as the
pro
original values are alphabetical and 33 characters long,for simplicity their values are taken as
1,2…42000.
2) Data Standardization: As it is clear from Table2 , Table3 and Table4 , there is large
disparity in variability in predictor variables. Many variables have standard deviation less than 1,
while many others have value of running in lakhs. Before applying any technique, there is a need
to standardize the variables, otherwise the variables having higher value of standard deviation
re-
dominates the variables having lower standard deviation. The variables are standardized using
scale() function in R. This function centers and scales the columns of numeric matrix.It performs
z-score normalization by subtracting the mean and dividing by standard deviationi.ezi= ,
where is the mean and is the standard deviation.
After applying pre-processing steps, Bolasso,ReliefF, chi-square and gain ratio algorithm are
lP
applied to Lending Club, Kaggle’s Bank Loan status and German Credit dataset. Bolasso method
shortlists 18 attributes(out of 35) for lending club dataset, , 6 attributes(out of 18) for kaggle’s
dataset and 5 attributes(out of 20) for German Credit dataset. ReliefF, chi-square and gain ratio
algorithms returned attribute importance score, the topmost 18 (for lending club), 6 (for kaggle)
and 5(German Credit) attributes are selected for experimental purposes. The size of dataset is
reduced to include only the features shortlisted by respective algorithm. Table 5,Table 6 and
a
Table7 shows attributes selected by algorithms. However for original baseline SVM,NB,K-NN
and RF classifiers , complete dataset is used for carrying out experiments.
urn
Att A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18
No./Method
Chi-Square √ √ √ √ √ √ √ √ √ √
Gain Ratio √ √ √ √ √ √ √ √
Relieff √ √ √ √ √ √ √ √ √ √ √
Jo
Bolasso √ √ √ √ √ √ √ √ √ √ √
Journal Pre-proof
Table5(contd)
Att/Method A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35
Chi- √ √ √ √ √ √ √ √
Square
Gain Ratio √ √ √ √ √ √ √ √ √ √
of
ReliefF √ √ √ √ √ √ √
Bolasso √ √ √ √ √ √ √
pro
Table 6: Attributes selected by various algorithms for Kaggle’s Bank Loan Status dataset
Chi- √ √ √ √ √ √
Square
re-
Gain √ √ √ √ √ √
Ratio
ReliefF √ √ √ √ √ √
Bolasso √ √ √ √ √ √
lP
Method/ C1 C2 C3 C4 C5 C6 C7 C8 C9 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C2
Attribut 0 1 2 3 4 5 6 7 8 9 0
e
a
Chi- √ √ √ √ √
Square
urn
Gain √ √ √ √ √
Ratio
ReliefF √ √ √ √ √
Bolasso √ √ √ √ √
Jo
Data set obtained after reducing degree is then partitioned into training and test data set in ratio
70:30 respectively. For lending club dataset, 12,756 rows are used for testing and 29,774 rows
for training .Kaggle dataset consists of 12535 rows for testing and 29465 rows for training. In
German credit dataset, 700 rows are used for training and 300 for testing. Then various versions
of SVM,NB,K-NN, Random Forest algorithm are applied on the training datasets. The models are
then tested and validated on test set for overall performance evaluation. The R software[68] is
used for running various functions. Algorithm 1 provides the operational steps of Bolasso based
feature selection algorithm followed by Random Forest classifier
Journal Pre-proof
of
J: List of selected variables
F: Dataset with reduced features
Procedure:
pro
1) Split D in (X,Y) //Predictor Matrix is stored in X, Response Matrix is stored in
Y
2) For k=1 to m,
Generate Bootstrap sample by sampling N instances from (X,Y) ,name it (Xk, Yk)
Compute Lasso estimate (denoted by wk) from (Xk, Yk)
Compute support Jk=(j, such that wk≠0) // select the variables for which lasso
estimate is non-zero.
re-
End for
3) Compute J= // J now contains list of attributes for which lasso estimate is
non-zero, consistently among m iterations
4) F=D(features) //Reduce dimensions of training dataset by eliminating variables
not selected by J
lP
In order to compare the stability of Bolasso with other feature selection algorithms,
Jaccard stability measure is used. Jaccard stability measure is an intersection based metric
that finds the average similarity between different features set. Formally Jaccard stability
measure is calculated as:
(6)
Jo
Where Q=Number of sub-samples of training data, q=1…..Q ,Sq and Sq’ denotes the
features set, denotes number of common features.
The value of JSM ranges from 0 to 1,value near to 1 is desirable as it means that the
feature set selected does not change significantly and hence are more stable with respect to
small variations in dataset.JSM falls under “stability by index” category which considers
indices of selected features without taking into consideration weight or the rank of
selected features.
Journal Pre-proof
of
denotes:
o TP(True Positive): They are the number of good debtors who are classified as good
by the model.
o TN(True Negative):They are the number of bad debtors who are classified as bad
pro
by the model.
o FP(False Positive): It denote the number of bad debtors who are misclassified as
good by the model.
o FN(False Negative): It represents the number of good debtors who are
misclassified as bad by model.
AUC: It stands for “Area under the ROC curve”. In literature , it is considered as the
re-
better test of classification than accuracy to determine which of the models predicts the
class best.
6
status dataset
German credit dataset 20 5
Journal Pre-proof
of
Chi-square 0.9461 0.8539 0.6433
Gain ratio 0.8963 0.7650 0.6793
ReliefF 0.5505 0.4342 0.2701
pro
Bolasso 0.9888 0.9682 1
Figures of Table 9 depicts the stability behavior of Bolasso algorithm in terms of JSM which
clearly outperforms stability of other features selection algorithms.
re-
5.3.2 Analysis II: Classification Accuracy
In order to compare the performance of Bootstrap-lasso with other feature selection methods,
we apply features obtained using each feature selection method to different classifiers namely,
SVM, NB, K-NN and RF. The classification performance of various algorithms is shown in
Table 10.
lP
Jo
of
,the values for LC dataset are(0.929,0.883,0.832,0.931,0.934) and highest is for Bolasso.
For Lending Club dataset: The comparison of AUC is shown in fig 2(a)-2(d).The
closer the line is to the top left corner, more better is the performance than the line
pro
closer to diagonal. Bolasso enabled SVM,NB,KNN and RF performed consistently
better than other versions of SVM,NB,KNN and RF.
For Kaggle’s Bank Loan Status dataset: AUC of Bolasso method in case of SVM,NB
and KNN does not perform well. Their results are found to be equivalent to other
methods (Baseline, Chi-square,gain ratio, ReliefFmethods). But Bolasso enabled
Random Forest algorithm performs extraordinarily and provides best results .AUC
curves are shown in fig 3(a)-3(d)
re-
For German Credit Dataset: AUC of Bolasso enabled SVM,NB,KNN,RF is superior
than their other counterparts. Fig. 4(a)-4(d) shows their performance graphically.
a lP
urn
Jo
Journal Pre-proof
of
pro
re-
a lP
urn
Jo
Journal Pre-proof
of
pro
re-
lP
Figure 5(a) : Comparison of Bolasso based classifiers for LC Dataset Figure5(b):Comparison of Bolasso based classifiers for Kaggle Dataset
Journal Pre-proof
of
Figure5(c):Comparison of Bolasso based classifiers for German Credit dataset
pro
re-
Graphical Comparison of Accuracy of various algorithms are shown in Figure6(a)-6(c).It is
concluded that Bolasso based Random Forest algorithm performs better as compared to all other
algorithms.
0.986 0.984 0.984 0.988
0.85
lP
0.6
urn
0.7
0.55
0.5
0.6
0.45
0.386
0.5 0.4 0.374
0.372
0.445 0.347
0.338
0.35
Jo
0.4 0.3
KNN
C-KNN
G-NB
RF
C-RF
R-KNN
SVM
C-SVM
G-KNN
R-RF
BS-NB
KNN
C-KNN
G-NB
NB
C-NB
G-RF
R-SVM
RF
C-RF
R-KNN
G-SVM
R-NB
BS-KNN
SVM
C-SVM
G-KNN
R-RF
BS-NB
BS-SVM
NB
C-NB
BS-RF
G-RF
R-SVM
BS-KNN
R-NB
G-SVM
BS-RF
BS-SVM
BASE CHI-SQUARE GAIN RATIO RELIEFF BOLASSO BASE CHI-SQUARE GAIN RATIO RELIEFF BOLASSO
CLASIFIERS CLASIFIERS
Fig. 6(a): Accuracy Comparison of various algorithms(lending club dataset) Fig 6(b): Accuracy Comparison of various algorithms(Kaggle Dataset)
Journal Pre-proof
1
0.95
0.9
0.84
0.85
0.8 0.766 0.76 0.76
0.741 0.745 0.7470.744 0.745 0.748
0.75 0.729 0.729 0.7290.738 0.719
0.704 0.7010.698
of
0.7 0.68 0.68
0.65
0.6
0.55
pro
0.5
0.45
0.4
G-RF
BS-RF
C-RF
R-KNN
KNN
R-SVM
BS-NB
SVM
C-NB
G-NB
R-RF
G-KNN
RF
BS-SVM
BS-KNN
C-SVM
C-KNN
G-SVM
R-NB
NB
Experiments are carried out on i3 processor, 2GB RAM, 64-bit Windows7.Table 11 depicts time
lP
complexities(in Big-O notation) of feature selection algorithms while run time and time
complexities of various classification algorithms are shown in table 12
Chi-Square O(nf)
Gain Ratio O(nf)
ReliefF O(n2f)
Bolasso O(mf3+f2nm)
Table 12: Time complexity and Run time of several classification algorithms
Jo
of
O(nfk), optimal 3.07 8000.45
K-NN value of k is found 6500.80
in this study
C-KNN Test phase=O(nf’k) 4494.88 1.71 2166.27
pro
R-KNN Test phase=O(nf’k) 377.79 1.76 1820.53
G-KNN Test phase=O(nf’k) 344.25 1.81 1755.25
BS-KNN Test phase=O(nf’k) 417.10 1.88 976.18
RF ) 243.34 0.92 330.05
C-RF ) 120.45 0.83 175.45
R-RF ) 124.69 0.86 84.42
G-RF ) 159.27 0.14 113.14
re- 0.16 200.03
BS-RF ) 123.51
a) Run time is directly proportional to size of dataset. As Kaggle and Lending Club dataset are
highly dimensional, so their run time is more than run time of German dataset having 1000
rows
urn
b) Run time of classification algorithms has reduced significantly after applying feature
selection algorithms
c) Naïve bayes based algorithms have minimum run time while K-NN based algorithms are the
most time consuming.
Jo
with stability of other feature selection algorithms in terms of JSM.Then we reduced the
dimension of dataset to include only the features selected by respective feature selection
algorithm.Further we used different type of classifiers SVM,NB,K-NN and RF. To show the
importance of feature selection, results were also compared with baseline classifiers applied to
complete datasets. It was found that classification performance of random forest algorithms when
used with Bolasso shortlisted features provided best decision to lend money than other methods.
of
In future, we would like to test Bolasso shortlisted features on ensemble of various classifiers.
Another interesting area is to perform sentence and sentiment analysis on the text to correctly
judge the intensions of debtor and hence increasing the predictive capability of classifier.
pro
References:
[1] Oreski, S. &Oreski G. Genetic Algorithm based heuristic for feature selection in credit risk assessment.
,sciencedirect,Expert Systems with application(2013) http://dx.doi.org/10.1016/j.eswa.2013.09.004
[2] Dahiya S, Handa SS, Singh NP. A feature selection enabled hybrid-bagging algorithm for credit risk
re-
evaluation. Expert Systems. 2017;34:e12217. https://doi.org/10.1111/exsy.12217
[3] WangD, Zhang Z A hybrid System with filter approach and multiple population Genetic Algorithm for
feature selection in Credit Scoring., science direct,Journal of Computation and applied mathematics
http://dx.doi.org/10.1016/j.cam.2017.04.036
lP
[4] Chandrashekar G , Sahin F, A survey on feature selection methods , Computers and Electrical Engineering
2014,Elsevier,https://doi.org/10.1016/j.compeleceng.2013.11.024
[5] Dehuri S.,GhoshA.,Revisiting evolutionary algorithms in feature selection and nonfuzzy/fuzzy rule based
classification(2013) WIRE Data Mining KnowlDiscov , 3: 83-108,John Wiley &Sons, doi: 10.1002/widm.1087
[6] Cai J., Luo J, Wang S, YangS,Feature Selection in machine Learning: a new
a
267-288.
[8] Lin, L., Shuang, W., Yifang, L., &Shouyang, W. (2014). A New Idea of Study on the Influence Factors of
Companies’ Debt Costs in the Big Data Era.Procedia Computer Science, 31, 532-541.
doi:10.1016/j.procs.2014.05.299.
[9] FangK,.Zhang G,,H.Zhang,Individual Credit risk prediction model:Application of Lasso-Logistic Model.The
Journal of Quantitative and Technical Economics,2014
[10] HongmeiChen,Yaoxin Xiang, The Study of Credit scoring model Based on Group Lasso .
Jo
[13] Bach F,2008, arXiv:0804.1302- ,Bolasso- Model Consistent Lasso Estimation through Bootstrap, INRIA –
Willow Project-Team, Paris France
Journal Pre-proof
[14] Huang, X., Liu, X., Ren, Y., Enterprise Credit Risk Evaluation Based on Neural NetworkAlgorithm,
Cognitive Systems Research (2018), doi: https://doi.org/10.1016/j.cogsys.2018.07.023
[15] Y.Liu, M Schumann , Data Mining Feature Selection for credit scoring models, Journal of Operation
Research Soc. 56(2005) 1099-1108 doi: 10.1057/palgrave.jors.2601976
[16] Jadhav S., He H, JenkinsK,Information gain directed genetic algorithm wrapper feature selection for credit
rating,Applied Soft Computing,Volume 69,(2018) ,Pages 541-553,ISSN 1568-
4946,https://doi.org/10.1016/j.asoc.2018.04.033.
of
[17] X. Zhang, Y. Yang and Z. Zhou, "A novel credit scoring model based on optimized random forest," 2018
IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, (2018), pp.
60-65.doi: 10.1109/CCWC.2018.8301707
[18] Xiao-Ying liu,Yong Liang et al,A Hybrid Genetic Algorithm withWrapper-embedded Approaches for
pro
Feature Selection,IEEE Access, doi: 10.1109/ACCESS.2018.2818682
[19] Liu, H. &Setiono R.(1995) Chi2 : feature selection and discretization of the numeric attributes. In
Anon(Ed.), Proceedings of the International Conference on Tools with Artificial Intelligence (pp 388-391),IEEE
[20] Trabelsia M , , NidaMeddouria , MondherMaddourib, A New Feature Selection Method for Nominal
Classifier based on Formal Concept Analysis, International Conference on Knowledge Based and Intelligent
re-
Information and Engineering Systems, KES2017, 6-8 September (2017), Marseille, France,Procedia Computer
Science,https://doi.org/10.1016/j.procs.2017.08.227
2012,pp1-5.doi: 10.1109/INISTA.2012.6247011
[24] Liu, Y., and M. Schumann.“Data Mining Feature Selection for Credit Scoring Models.” The Journal of the
Operational Research Society, vol. 56, no. 9, 2005, pp. 1099–1108. JSTOR, JSTOR, www.jstor.org/stable/4102203.
a
[32] Malekipirbazari M, Aksakalli V, “Risk assessment in social lending via random forests”,Expert Systems
with Applications 42 (2015) 4621–4631,Sciencedirect
[33] https://www.capitaline.com
Journal Pre-proof
[34] https://www.federalreserve.gov/releases/chargeoff/delallsa.htm
[35] Yue Zhang, Weihong Guo, Soumya Ray ; “On the consistency of Feature Selection with Lasso for Non-
Linear Targets”, Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:183-191, 2016
[36] Khaire, U. M., & Dhanalakshmi, R. (2019). Stability of feature selection algorithm: A review. Journal of
King Saud University - Computer and Information Sciences. doi:10.1016/j.jksuci.2019.06.012
of
[37] P. Turney. Technical note: Bias and the quantification of stability. Machine Learning, 20:23–33, 1995
[38] P. Somol and J. Novovicova, "Evaluating Stability and Comparing Output of Feature Selectors that
Optimize Feature Subset Cardinality," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32,
no. 11, pp. 1921-1939, Nov. 2010.doi: 10.1109/TPAMI.2010.34
pro
[39] L.I. Kuncheva, “A Stability Index for Feature Selection,” Proc. 25th IASTED Int’l Multi-Conf. Artificial
Intelligence and Applications, pp. 421-427, 2007
[40] I. Kamkar, S. K. Gupta, D. Phung and S. Venkatesh, "Exploiting feature relationships towards stable feature
selection," 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Paris, 2015, pp. 1-
10.doi: 10.1109/DSAA.2015.7344859
[41] He Z,Yu W ,”Stable feature selection for biomarker discovery”,Computaional Biology and Discovery,
re-
Elsevier, doi:https://doi.org/10.1016/j.compbiolchem.2010.07.002
[42] Kamkar, Iman & Gupta, Sunil & Phung, Dinh & Venkatesh, Svetha. (2015). Stable Feature Selection with
Support Vector Machines. 9457. 10.1007/978-3-319-26350-2_26.
[43]Li, Y., Si, J., Zhou, G., Huang, S., & Chen, S. (2015). FREL: A Stable Feature Selection Algorithm. IEEE
lP
[44] Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, P. and Saeys, Y. (2010), Robust biomarker identification
for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3), pp. 392–398.
[45] Han, Y., & Yu, L. (2010). A Variance Reduction Framework for Stable Feature Selection. (2010) IEEE
International Conference on Data Mining. doi:10.1109/icdm.2010.144
a
[46] Pandey, T. N., Jagadev, A. K., Mohapatra, S. K., & Dehuri, S. (2017). Credit risk analysis using machine
learning classifiers. 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing
(ICECDS).doi:10.1109/icecds.2017.8389769
urn
[47] Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L. C. (2015). Benchmarking state-of-the-art
classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1),
124–136.doi:10.1016/j.ejor.2015.05.030
[48] Wei-Yang Lin, Ya-Han Hu, & Chih-Fong Tsai. (2012). Machine Learning in Financial Crisis Prediction: A
Survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 421–
436.doi:10.1109/tsmcc.2011.2170420
Jo
[49] Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability
estimates using machine learning. Expert Systems with Applications, 40(13), 5125–5131.
doi:10.1016/j.eswa.2013.03.019
[50] Shi, L., Liu, Y., & Ma, X. (2011). Credit Assessment with Random Forests. Emerging Research in Artificial
Intelligence and Computational Intelligence, 24–28.doi:10.1007/978-3-642-24282-3_4
[51] Pes, Barbara. (2019). Ensemble feature selection for high-dimensional data: a stability analysis across multiple
domains. Neural Computing and Applications. 10.1007/s00521-019-04082-3.
Journal Pre-proof
[52] Behr, A., & Weinblat, J. (2016). Default Patterns in Seven EU Countries: A Random Forest Approach.
International Journal of the Economics of Business, 24(2), 181–222.doi:10.1080/13571516.2016.1252532 .
[53] Ha Van Sang1 , Nguyen Ha Nam , Nguyen Duc Nhan. A Novel Credit Scoring Prediction Model based on
Feature Selection Approach and Parallel Random Forest. Indian Journal of Science and Technology, Vol 9(20), DOI:
10.17485/ijst/2016/v9i20/92299, May (2016)
[54] Bingamawa, Muhammad Tosan & Agus Santoso, Heru. (2016).” IMPLEMENTATION OF NAÏVE BAYES
of
ALGORITHM TO DETERMINE CUSTOMER CREDIT STATUS IN PT. MULTINDO AUTO FINANCE
SEMARANG.” , doi: 10.13140/RG.2.2.20330.52164.
[55] A. C. Antonakis and M. E. Sfakianakis(2009) ,”Assessing naïve Bayes as a method for screening
credit applicants”,Journal of Applied Statistics,Vol 5(36),537-545,Taylor & Francis, doi =
pro
10.1080/02664760802554263
[56] Yeh, I.-C., & Lien, C. (2009). The comparisons of data mining techniques for the predictive accuracy of
probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473–
2480doi:10.1016/j.eswa.2007.12.020
[57] Paulius Danenas, Gintautas Garsva, Saulius Gudas, “Credit Risk Evaluation Model Development Using
Support Vector Based Classifiers”,Procedia Computer Science,Volume 4,2011,Pages 1699-1707,ISSN 1877-0509,
re-
https://doi.org/10.1016/j.procs.2011.04.184.
[58] Danenas, P., & Garsva, G. (2015). Selection of Support Vector Machines based classifiers for credit risk
domain. Expert Systems with Applications, 42(6), 3194–3204.doi:10.1016/j.eswa.2014.12.001
[59] Sivasankar E., Selvi C., Mala C. (2017) A Study of Dimensionality Reduction Techniques with Machine
lP
Learning Methods for Credit Risk Prediction. In: Behera H., Mohapatra D. (eds) Computational Intelligence in Data
Mining. Advances in Intelligent Systems and Computing, vol 556. Springer, Singapore
[60] Baesens, B., Van Gestel, T., Viaene, S. et al. J Oper Res Soc (2003), “Benchmarking state of the art
classification algorithm for credit scoring” 54: 627. https://doi.org/10.1057/palgrave.jors.2601545
[61] Li, Feng-Chia. (2009). The hybrid credit scoring model based on KNN classifier. 330-334.Sixth
a
International Conference on Fuzzy Systems and Knowledge Discovery. IEEE Computer Society
[62] Hand, D. J., and Vinciotti, V., 2003, "Choosing k for two-class nearest neighbor classifiers with unbalanced
urn
[63] Islam, M. J., Wu, Q. M. J., Ahmadi, M., and Sid-Ahmed, M. A., 2007, "Investigating the Performance of
Naive- Bayes Classifiers and K- Nearest Neighbor Classifiers"International Conference on Convergence Information
Technology. IEEE Computer Society.
[64] https://www.lendingclub.com/info/download-data.action
[65] https://www.kaggle.com/zaurbegiev/my-dataset
Jo
[66] https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
[68] R Core Team (2018). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Journal Pre-proof
*Declaration of Interest Statement
Declaration of Interest
Dear Editor-in-Chief
10-05-2019
of
We hereby declare that we have no affiliations with or involvement in any organization or entity with
pro
any financial interest (such as honoraria, educational grants, membership, employment etc) or non-
financial interest(such as personal or professional relationships, affiliations, knowledge or beliefs) in
subject matter or materials discussed in this manuscript.
Regards,
Nisha Arora,
re-
Dr Pankaj Deep Kaur
a lP
urn
Jo