0% found this document useful (0 votes)
14 views29 pages

Arora 2019

This paper introduces a novel Bolasso enabled Random Forest classification algorithm (BS-RF) for credit risk assessment, demonstrating its effectiveness in classifying borrowers as defaulters or legitimate. The study compares the stability and performance of Bolasso with other feature selection methods and classifiers, revealing that BS-RF outperforms others in terms of accuracy and area under the curve (AUC). The research emphasizes the importance of feature selection in improving predictive capabilities while maintaining stability across varying datasets.

Uploaded by

liuguanlan009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views29 pages

Arora 2019

This paper introduces a novel Bolasso enabled Random Forest classification algorithm (BS-RF) for credit risk assessment, demonstrating its effectiveness in classifying borrowers as defaulters or legitimate. The study compares the stability and performance of Bolasso with other feature selection methods and classifiers, revealing that BS-RF outperforms others in terms of accuracy and area under the curve (AUC). The research emphasizes the importance of feature selection in improving predictive capabilities while maintaining stability across varying datasets.

Uploaded by

liuguanlan009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Journal Pre-proof

A Bolasso based consistent feature selection enabled random forest


classification algorithm: An application to credit risk assessment

Nisha Arora, Pankaj Deep Kaur

PII: S1568-4946(19)30717-3
DOI: https://doi.org/10.1016/j.asoc.2019.105936
Reference: ASOC 105936

To appear in: Applied Soft Computing Journal

Received date : 10 May 2019


Revised date : 5 September 2019
Accepted date : 12 November 2019

Please cite this article as: N. Arora and P.D. Kaur, A Bolasso based consistent feature selection
enabled random forest classification algorithm: An application to credit risk assessment, Applied
Soft Computing Journal (2019), doi: https://doi.org/10.1016/j.asoc.2019.105936.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the
addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive
version of record. This version will undergo additional copyediting, typesetting and review before it
is published in its final form, but we are providing this version to give early visibility of the article.
Please note that, during the production process, errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.


Journal Pre-proof
*Highlights (for review)

Highlights of Paper:
1) A novel Bolasso enabled Random forest algorithm(BS-RF)is proposed to classify
borrower as defaulter or legitimate.
2) The stability of Bolasso is compared with other feature selectors(Chi-square, Gain
Ratio,ReliefF) in terms of JSM .
3) Time complexities are computed and run time of various algorithms are recorded.

of
4) Proposed algorithm is compared with Bolasso enabled Naïve Bayes ,SVM and KNN
Classifiers.
5) The experimental results shows that Bolasso selected features are stable with respect to
small variations in dataset.

pro
6) Results further emphasize that BS-RF is better than other methods in terms of AUC and
Accuracy.

re-
a lP
urn
Jo
Journal Pre-proof
*Manuscript
Click here to view linked References

A Bolasso Based Consistent Feature Selection


Enabled Random Forest Classification
Algorithm: An Application to Credit Risk

of
Assessment
Nisha Arora, Research Scholar, Guru Nanak Dev University ,Regional Campus, Jalandhar

pro
Dr. Pankaj Deep Kaur,Asst. Prof., Guru Nanak Dev University, Regional Campus, Jalandhar
Abstract
Credit Risk Assessment has been a crucial issue as it forecasts whether an individual will default
on loan or not. Classifying an applicant as good or bad debtor helps lender to make a wise
decision. The modern data mining and machine learning techniques have been found to be very
re-
useful and accurate in credit risk predictive capability and correct decision making. Classification
is one of the most widely used techniques in machine learning. To increase prediction accuracy of
standalone classifiers while keeping overall cost to a minimum, feature selection techniques have
been utilized , as feature selection removes redundant and irrelevant attributes from dataset .This
paper initially introduces Bolasso(Bootstrap- Lasso) which selects consistent and relevant
lP

features from pool of features. The consistent feature selection is defined as robustness of
selected features with respect to changes in dataset. Bolasso generated shortlisted features are
then applied to various classification algorithms like Random Forest(RF), Support Vector
Machine(SVM),Naïve Bayes(NB) and K-Nearest Neighbours(K-NN) to test its predictive
accuracy.It is observed that Bolasso enabled Random Forest Algorithm(BS-RF) provides best
results forcredit risk evaluation. The classifiers are built on training and test data partition(70:30)
of three datasets(Lending Club’s peer to peer dataset , Kaggle’s “Bank Loan Status Dataset” and
a

German Credit dataset obtained from UCI). The performance of Bolasso enabled various
classification algorithms is then compared with that of other baseline feature selection methods
urn

like Chi Square, Gain Ratio, ReliefF and stand-alone classifiers (no feature selection method
applied). The experimental results shows that Bolasso provides phenomenal stability of features
when compared with stability of other algorithms . Jaccard Stability Measure(JSM) is used to
assess stability of feature selection methods. Moreover BS-RF have good classification accuracy
and is better than other methods in terms of AUC and Accuracy resulting in effectively
improving the decision making process of lenders.
Jo

Keywords— Bootstrap-lasso; Stable Feature selection ;BS-RF;Credit Risk Assessment; Random


Forest in credit risk;
Introduction
Now-a-days, credit risk assessment is indispensable issue in financial institutions .Credit Risk is
defined as a probability that a borrower will fail to repay the borrowed amount. The decision of
granting or rejecting a loan is very critical and is based on applicant’s personal information, credit
history, living status, loyalty etc.
Journal Pre-proof

As per the data obtained from capitalineplus[33],India’s gross bad loans stood at Rs 10.25 lakh
crore as on 31 March,2018 which is 11.8% of the total loans given by banking industry.State
Bank of India (SBI), tops the bad loans chart, has logged an increase of Rs 24,286 crore in bad
loans in the March quarter to Rs 2.23 lakh crore. The Nirav Modi scam-hit Punjab National Bank
(PNB) has reported the maximum rise of Rs 29,100 crore in gross bad loans to Rs 86,620 crore in
the March quarter. As per the statistical data obtained from federal reserve[34], delinquency and

of
charged off rates of banks of 2019 Q1 is 1.74(Real estate loans) and 2.33(consumer loan
category).
This alarming situation in banks and financial institutions have sought the attention of various
researchers. The modern data mining and machine learning techniques have been found to be very

pro
useful and accurate in credit risk predictive capability and correct decision making. To enhance
accuracy several approaches have been proposed which includes feature selection techniques [1,2]
Feature Selection is a pre-processing technique which helps in selecting appropriate attributes.
Since the real world instances of data are usually high dimensional, the feature selection
techniques help in reducing dimensionality of data by removing irrelevant and redundant
features.The irrelevant features, if included in classification process, not only reduce its accuracy
re-
but also consume more space and execution time. So feature selection is very important as it
facilitates deep understanding of data by studying only relevant features and enhancing accuracy,
speed and the predictive capability of classifier[3].Feature Selection Methods are classified in 3
categories:
lP

a) Filter Method b) Wrapper Method c)Embedded Method


The filter approach firstly shortlists the important features independent of any classification
algorithm. Filter methods use variable ranking techniques to score the variables. The topmost
variables are chosen from the pool of variables resulting in removal of less relevant variables [4].
Examples of filter methods are T-test, Information gain, Chi-square, ReliefF etc.In wrapper
method, feature selection is performed by taking into consideration the classification algorithm. In
a

this approach classification algorithm is executed many times, each time with different
subset,andwith each iteration, quality of feature subset is evaluated, and finally the best
urn

featuresubset is selected[5]. Examples of this method are Beam Search, Sequential Forward
Selection (SFS), Sequential Backward Elimination(SBE) etc. Embedded Method, however,
shortlists features in training process of learning model, and selectedfeatures are outputted at the
end of training process.Examples are Weighted Naïve Bayes and Lasso.The performance of
feature selection method is evaluated by machine learning models like Naïve Bayes, SVM,
KNN,C4.5 etc. [6]
Jo

This paper proposes improved version of embedded based Lasso approach, namely Bolasso to
shortlist consistent and relevant features. As Credit risk assessment is a difficult learning problem,
so selection of appropriate set of features is critical for success of learning process and therefore is
a vital issue.An important property of feature selection strategy is selection consistency(stability)
i.e to find “true” set of features[35] .The feature stability matters especially when feature
selection is applied for knowledge discovery. In credit risk assessment, a feature selection
algorithm may select different subsets of features when there are slight variations in training data,
although most of these subsets may result in good prediction accuracy ,such changes (instability)
in selected features may reduce the confidence of lenders in investigating reliable risk factors.
Bolasso exhibits stability and also provides reasonably good classification accuracy. Various
Journal Pre-proof

classification algorithms are used to verify the accuracy of Bolasso selected features, out of which
random forest outperforms.
.In summary, the main contributions of this paper are:
 This paper presents the stability property of Bolasso using Jaccard Index stability metric.
 Introducing a novel Bolasso enabled Random forest algorithm to classify borrower as

of
potential defaulter or legitimate.
 Four standalone classifiers (SVM, Naïve Bayes, KNN, RF) are used to predict the
behavior of debtors.
 A total of sixteen feature selection enabled models are compared. Out of these sixteen,

pro
four algorithms are based on chi-square feature selector(C-SVM,C-NB,C-KNN,C-RF),
four on gain ratio feature selector(G-SVM,G-NB,G-KNN,G-RF), four on ReliefF feature
selector(R-SVM,R-NB,R-KNN,R-RF) ,and four on Bolasso based feature selector(BS-
SVM,BS-NB,BS-KNN,BS-RF).
 The performance of algorithms in terms of AUC and Accuracy are compared and it is
observed that BS-RF algorithm gives the best results.

re-
Time complexities of classification algorithms are computed and run-times are also
recorded during experiments.
Rest of the paper is organized as:
Section 2 reviews the existing literature on classification in credit risk, feature selection based
study in credit risk assessment and Lasso based study. Section 3 describes methodology which
lP

includes description of feature selection methods and classifiers used in this paper. In section 4
description of dataset utilized for experiment purposes is given. Section 5 presents design of
Bolasso based Random Forest Algorithm(BS-RF),experiments are carried out and results are
summarized , and finally section 6 concludes the research
a

1. Literature Review
urn

1.1 Classification in credit risk prediction


Classification models have been used widely in credit risk assessment as they help to find
relationship between the attributes in demanding loans(such as history, number of accounts of
customer, annual income, home ownership etc.) and potential defaulters[14].Various models and
classification algorithms have been used to predict credit risk[46],[47],[48],[49]. To name the
few are Naïve bayes, , Decision Trees, Logistic regression, K-NN,K-means, Neural Networks,
Jo

Random Forest etc. Kruppa, J. et al [49] estimated consumer’s default probabilities and concluded
that Random Forest outperformed logistic regression and KNN. Malekipirbazari M et al. [32]
focused on peer to peer lending and compared the performance of machine learning methods
namely Random Forest,SVM, Logistic regression, K-NN in terms of AUC,accuracy, Root Mean
Square Error(RMSE),confusion matrix .They concluded that Random Forest provides matchless
results in identifying the best borrowers (with low default probability) and hence helping lenders
in taking better investment decisions. Shi, L.[50] also conducted experiment to identify bad debts
and concluded that Random Forest is an effective classification tool for credit assessment. Behr,
A., & Weinblat, J. [52] recommended Random Forest as a worthwhile alternative to well
Journal Pre-proof

established default prediction models and estimated the default tendencies for firms from seven
European countries using two datasets. Other classification models used in this study for
comparison namely SVM,NB and K-NN have also been explored by other researchers for credit
risk assessment. Table 1 briefly outlines the application of the mentioned models in credit risk
context.
Table 1: Classification technique and reference list

of
Classification Technique Reference list
Naïve Bayes [54],[55],[56]

pro
SVM [57],[58],[59],[26]
KNN [67],[60],[61],[62],[63]

Random Forest is ensemble classification algorithm which besides from providing good
classification accuracy, also provides numerous other advantages such as immunity to over
re-
fitting, faster, simpler, helps in better estimating internal error and provides a good tolerance of
outliers and noises. In this paper Bolasso is integrated with Random Forest, Naïve Bayes, SVM,
KNN and the best results are obtained with Random Forest.
lP

1.2 Feature Selection based Study in Credit Risk Assessment

Generally datasets on credit scoring are highly dimensional making classification problem
complex, not only in terms that they are highly computational but also less accurate for
prediction[15].Feature selection becomes necessary as it reduces the burden of computing and
a

also helps to improve accuracy. Since each of feature selection method has its own pros and cons
,like filter approach is faster but, neglects feature’s influence on classifiers while wrapper
approach utilizes optimization algorithms to generate subsets and then evaluate subsets using
urn

classifiers to guarantee the best features but is computational intensive.


Hybridization is important concept as it takes the advantage of combining various methods[2]
Recent studies in credit risk assessment uses hybrid methods to combine the best properties of
filter and wrappers, as it achieves the balance of accuracy and computational time. In such
methods, final feature subset is selected in two phases i.e. filter method (used in first phase) to
rank the features and shortlists some top most features thus reducing feature space and then (in
second phase) wrapper is applied to find best candidate subset. Oreski and Oreski [1] proposed
Jo

hybrid GA with NN(Neural Network) (HGA-NN), uses filter method to shortlist 12 features and
reduced feature subset was applied to genetic algorithm and showed that it was a promising
method for credit risk assessment. Wang D.[3] proposed HMPGA for credit scoring in which
three initial feature subsets are firstly generated using three filter approaches namely F-score,
Information gain ratio and Pearson’s correlation. Then Wrapper approach namely
MPGA(Multiple Population Genetic Algorithm) is applied for feature selection. Jadhav S.[16]
developed IGDFS,a hybrid model based on Information Gain(filter approach) and Genetic
Algorithm(wrapper) approach to select features for credit scoring problem and concluded that
model is improved when features are selected carefully. X.Zhang[17]developed NCSM model
Journal Pre-proof

which includes information gain ratio to filter irrelevant and redundant features and then applied
grid search method to produce optimized random forest algorithm. The conclusion drawn was
keeping less features would help to reduce workload of credit evaluators and at the same time,
increases predictive efficiency.
Although the hybrid approaches have been used widely but they still possess some limitations
such as filter method may eliminate potentially useful features, as there is no guarantee that

of
features shortlisted by filter method are good candidates for wrapper method[18], Moreover
wrapper methods are still computationally intensive as they tend to call classification algorithm
repeatedly. To overcome the limitations of hybrid and wrapper approaches, this paper employs
Bolasso based feature selection which has the advantage of integrating feature selection and

pro
learning process .Moreover it is a one phase feature selection algorithm which shortlists and
outputs the selected feature in one step only.
1.3 Stability based feature selection:
A prominent feature of Bolasso is that it provides stability in feature selection.High stability of
feature selection algorithm is equally important as the high classification accuracy and ignoring
re-
stability issue may lead to drawing wrong conclusions and may put the claimed results of domain
experts in suspicion.[36],[38],[39],[40],[51]. Stability was introduced by Turney[37] who put
emphasis on the importance of it in learning algorithm and discussed the relationship between
stability, predictive accuracy and bias. Z. He[41] has reviewed various research papers on
stability of biomarker discovery. Abeel et al [44] proposed ensemble framework for biomarker
lP

discovery in microarray datasets. They utilized SVM-RFE to improve the feature stability. In their
method, they first rank all the features using SVM and then eliminate features with least score.
Finally, they aggregate the final feature score using a linear combination of all scores in all
iterations. Kamker et al [42],focused on feature stability and proposed C-SVM to increase the
stability of l1-norm SVM and tested it on three datasets(Breast cancer, cancer dataset, AMI
dataset).Li Y. et al [43]proposed FREL (Feature weighting as regularized energy based learning)
a

and ensemble FREL , stability based feature selection algorithms and tested its stability on four
dataset(TOX,SMK, Leukemia and prostate).Han and Yu[45] proposed a variance reduction
framework to improve the stability of feature selection. They showed that the stability of feature
urn

selection under training data variations can be improved by variance reduction techniques.
Stability feature of Bolasso has never been explored earlier. Bolasso is considered as consensus
combination scheme as it generates the bootstrap replications of a dataset, calculates the lasso
estimates , intersect the support of each bootstrap replication and keeps the largest subset of
variables on which all regressors agree in terms of variable selection. Bolasso incorporates
stability as it produces similar features even when small variations occurs in training data or when
Jo

new training samples are added or removed.

1.4 Lasso based Feature Selection


Main aim of feature selection is to find a feature subset which is small in size but high in
predictive accuracy. The idea of using Lasso for feature selection was proposed by Tibshirani R
in 1996[7] . As per his findings, Lasso does not focus on subsets but rather defines a continuous
shrinking operation that can produce coefficients that are exactly 0, and hence applicable to both
model selection and parameter estimation. The authors Lin, L., Shuang, W., Yifang, L et al
[8]applied Lasso to 2301 companies data having 18 indexes. Lasso method selected 9 variables
Journal Pre-proof

out of 18 and concluded that lasso provided smallest average prediction error. Fang et al[9] made
an analysis of personal credit evaluation in china and found that lasso regression does better in
prediction accuracy. Various other versions of Lasso such as Group Lasso, Tree-Lasso were
developed and used for feature selection .Hongmei Chen [10] constructed credit scoring model
based on Group Lasso Logistic Regression and concluded that Group Lasso Method is better in
interoperability and predictive analysis. Group Lasso is applicable when features form different

of
groups and the variables within the group are correlated. KamkarI.et al,[11] proposed a tree –
Lasso model for stable feature selections in health care .They concluded that features selected by
Tree-lasso are more stable when compared to other methods like t-test, Lasso etc. In another
work,ZhangZ.,et al[12] proposed a novel interactive Lasso regression model to identify high-

pro
order feature interactions.
However, when degree of dataset is large and there is strong dependence between relevant and
irrelevant variables, Bolasso method can be applied to obtain set of consistent variables [13].

2. Methodology
In section 3.1, feature selection methods namely chi-square, gain ratio,ReliefF, Bolasso, are
re-
described and section 3.2 describes baseline classifiers used in this study to evaluate predictive
performance of feature selection methods
3.1 Feature selection methods:
3.1.1 Chi square test : Chi square test[19] is a statistical test of independence to determine
lP

dependency of two variables. It is filter based feature selection method to evaluate worth of an
attribute by computing value of chi-squared statistics with respect to class. It measures the ability
to predict the value of an attribute from the value of the class by testing the independence between
them and checking the absence of the statistical links[20,21]. It is computed as

CT= In fact ei j = , (1)


a

where ni j represents the observed cell value in the contingency table being row total and
urn

being column total in contingency table


3.1.2 Gain Ratio: It calculates the ratio of the information gained from attribute with respect to
entropy.This method is normalized version of information gain.
Gain Ratio (Class, Attribute) = (H (Class) – H (Class |Attribute)) / H (Attribute). (2)
where H is called Entropy, defined as :
H=
Jo

where p is the probability, for which a particular value occurs in the sample space . Entropy
ranges from 0 (all instances of a variable have the same value) to 1 (equal number of instances of
each value).This is a measure of how values of an attribute are distributed and signifies the
measurement of pureness of an attribute. High Entropy means the distribution is uniform. That is,
the histogram of distribution is flat and hence we have equal chance of obtaining any possible
class .Low Entropy means the distribution is gathered around a point [22]
3.1.3 Relief and ReliefF:Relief and ReliefF algorithms are successful attribute estimators. They
are able to detect conditional dependencies between attributes and provide a unified view on the
attribute estimation in classification problems[23].The key idea of ‘Relief’ is to rank the quality of
Journal Pre-proof

features according to how well their values distinguish between the cases that are near to each
other.It is based on Euclidean distance measure. The enhanced method ‘ReliefF’ [24] uses
Manhattan(l1) norm instead of Euclidean(l2) norm to rank the features .The top k features are
selected as final .
3.1.4 Lasso and Bolasso
Lasso(Least Absolute Shrinkage and Selection Operator) is widely used feature selection method.

of
This method selects variables and also utilizes regularization to increase prediction
accuracy.Bolasso(Bootstrap enabled Lasso) was introduced by Francis R. Bach(2008)[13] ,
presenting a model for the selection of consistent variables. It was proved that in presence of
strong correlation between relevant and irrelevant variables, Lasso cannot generate consistent

pro
results. However Bolasso runs Lasso model using several bootstrapped samples of original data
and then intersect the non-zero co-efficient for estimating consistent coefficients. Computational
complexity of Bolasso is O(m(p3 +P2n)),(p is number of features ). The probability that Bolasso
does not select correct model has upper bound
has following upper bound:
re-
(3)

Where A1,A2,A3, A4 are positive constants,m is number of bootstrapped replications , n is number


of finite data points and J is list of selected variables.
3.2Baseline Classifiers
lP

3.2.1 Support Vector Machine


Support Vector Machines(SVMs)introduced by Cortes, C., Vapnik, V.,1995[25] are very popular
machine learning techniques for both classification and regression analysis. They have been
applied in various fields such as text categorization, pattern recognition and for binary
problems.SVM construct a hyperlane to maximize the margin between different classes[26]. This
a

hyperlane separates the data in two classes-positive or negative .Carrizosa, E & Morales, DR, [27]
used mathematical optimizationto address the issues in SVM, such as the detection of relevant
urn

features or the accommodation of measurement costs associated with the variables.FangK.Zhang


G,[9] stated that SVM optimization problem can be stated as

(4)

s.t.
Jo

where is the Lagrange multiplier associated and K( , ) is kernel function. The most
common Kernel functions are:
1) Linear Kernel Function: K( )=
2) Polynomial Function: : K( )=
3) Gaussian Function: K( =exp(- )
4) Radial kernel Function: K( exp(- )
Journal Pre-proof

In this study, we employ Radial kernel function: K( exp(- )


3.2.2 Naïve Bayes
Naïve Bayes algorithm introduced by John G.H [28] is a probablistic based supervised machine
learning method.It applies bayesian theorem to predict the class[29].Naive Bayes classifier
ignores possible dependencies (correlations) among the inputs. i.e all the feature variables are

of
conditionally independent of each other given the class variable[30]. The final decision about a
class is made after the class probability estimation
P(C = c| X = x) = (5)

pro
Here C is the random variable denoting the class of an instance and X is a vector of random
variables denoting the observed attribute value vector.

3.2.3 K-NN
This method was first utilised for credit scoring system by Henley et al [67] in 1997.It is a non-
parametric classifier that uses a distance measure to make predictions without building a
model.The training stage of K-NN comprises of storing feature vector and class labels of training
re-
samples.For testing stage, distance from new vector to all stored vectors are computed and K
closest samples are selected.The class label is then assigned according to the class of majority of
k- nearest neighbours[60],[61].

3.2.4 Random Forest


lP

Random Forest is an ensemble learning algorithm(developed byBreiman, L.)[31] is used for


classification and regression.Random Forest is constructed by combinig the results of various
decision trees.Bagging is applied on the original dataset to select sample.Then x features are
selected randomly from a pool of actual features. A candidate split node is selected from one of
these x features. Fresh selection of attributes is made at each split.Splitting is continued till the
depth d where a decision tree is completed[32]. A large number of such decision trees are then
a

bulit which constitutes a random forest. Each decision tree determines the class label predictor
for new instance(termed as vote) . The votes casted by various decision trees are then counted,
urn

and the class with majority votes is assumed to be “win” and hence prediction for new instance is
made.
In our implementation, 500 such decision trees are build , x is (sqrt(n)) . Ours is a classification
model,and hence each decision tree will output either “Charged Off” or “Fully Paid”(defaulter or
legitimate).
4. Dataset Description
Jo

4.1 Lending Club[64]: Dataset consists of 42,538 loan records having 143 attributes issued by
lending club between years 2007-2011.Some of the important features describing loan are shown
in Table 1.The dependent variable is loan status. It has two values namely, charged off and fully
paid. Online peer to peer lending is a risky task, since borrower and lenders are not known to each
others. The work done in this paper, can hence be used by investors to classify the new debtor as
good or bad based on learning by classification algorithms.
Table 2: Lending Club’s Dataset Fields
Journal Pre-proof

Att Field name Explanation Mean Std. Min. Max.


Deviation
No
a1 Id A unique LC assigned ID 21268 12279.3 1 42538
A2 Loan_amount Amount of the loan applied/approved 11090 7411.15 500 35000
A3 Funded Amount Amount committed to loan 10822 7147.12 500 35000
A4 Funded_Amt_inv Total amount committed by investors 10140 7131.86 0 35000
A5 Term Number of payments on loan either 36 or 60(Months). 42.21 10.509 36 60
A6 Int_rate Interest Rate 12.16 3.707 5.42 24.59

of
A7 Installment The monthly payment owed by the borrower 322.6 208.93 15.67 1305.19
A8 Grade LC assigned loan grade(Categorical Range –A-G) 2.671 1.4384 1 7
A9 Sub_grade LC assigned loan subgrade 11.41 7.0714 1 35
A10 Emp-Length Employment length in years. 0 means less than one year and 4.913 3.4614 0 10
10 means ten or more years.
A11 Home_ownerhip The home ownership status provided by the borrower. values 3.134 1.9337 1 5

pro
are: RENT, OWN, MORTGAGE, OTHER
A12 Annual_inc The self-reported annual income provided by the borrower. 68861 551599.77 0 3900000
A13 Verification Status Indicates if income was verified by LC or not verified, 1.876 0.8615 1 3
A14 Purpose A category provided by the borrower for the loan request. 4.904 3.4278 1 14
A15 Addr_state The state provided by the borrower 22.83 4.7671 1 50
A16 Dti A ratio calculated using the borrower’s total monthly debt 13.37 6.7254 0 29.99
payments on the total debt obligations, excluding mortgage
and the requested LC loan, divided by the borrower’s self-
reported monthly income.
re-
a17 Delinq_2yrs The number of 30+ days past-due incidences of delinquency in 0.1524 0.5122 0 13
the borrower's credit file for the past 2 years
A18 Inq_last_6months The number of inquiries in past 6 months 1.081 1.5272 0 33
A19 mths_since_last_delinq The number of months since the borrower's last delinquency 12.85 1.6609 0 120
A20 open_acc The number of open credit lines in the borrower's credit file. 9.338 4.5013 0 47
A21 Pub_rec Number of derogatory public records 0.058 0.2456 0 5
A22 Revol_bal Total credit revolving balance 14298 22018.8 0 1207359
lP

A23 Revol_util Revolving line utilization rate, or the amount of credit the 49.01 28.4240 0 119
borrower is using relative to all available revolving credit.
A24 Total_acc The total number of credit lines currently in the borrower's 22.11 11.6033 0 90
credit file
A25 Total_payment Payments received to date for total amount funded 12020 9094.68 0 58886
A26 Total_pamt_inv Payments received to date for portion of total amount funded 11313 9038.65 0 58564
by investors 4
A27 Total_rec_prncpl Principal received to date 9676 6106.009 0 35000
a

A28 Total_rec-int Interest received to date 2240.1 2585.16 0 23886.5


A29 Total_rec_late_fee Late fees received to date 1.517 7.8302 0 209
A30 Last_pymnt_amt Last total payment amount received 2613.5 4385.28 0 36115.1
urn

A31 Acc_now_delinq The number of accounts on which the borrower is now 9.41e-05 0.00969 0 1
delinquent.
A32 Delinq_amt The past-due amount owed for the accounts on which the 0.143 29.3512 0 6053
borrower is now delinquent.
A33 pub_rec_bankruptcies Number of public record bankruptcies 0.0437 0.2055 0 2
A34 tax_liens Number of tax liens 2.35e-05 0.00484 0 1

A35 debt_settlement_flag Flags whether or not the borrower, who has charged-off, is 1.004 0.0608 1 2
working with a debt-settlement company.
A36 Loan_status Current status of the loan.Possible values: charged off / Fully 1.849 0.3582 1 2
Jo

paid.

4.2 Kaggle’s Bank Loan Status dataset[65]:


In this dataset also, loan status attribute (dependent attribute)contained two values: Charged
off/Fully paid. After applying classification algorithms to the dataset any new loan applicant
could be classified as good or bad thushelping banks to make decision regarding loan approval or
rejection.Table 3 describes features along with their statistical details.
Journal Pre-proof

Table 3:Kaggle’s Dataset fields


Att Field name Explanation Mean Std. Deviation Min. Max.
No
B1 Loan Id Unique loan ID 21001 12124.5 1 42000
B2 Customer Id Unique customer ID 21001 12124.5 1 42000
B3 Current loan Amunt Loan amount applied for by customer 11853348 31189447 112 99999999
42
B4 Term Possible values: short term/Long term 1.721 0.4487 1 2

of
B5 Credit score A score number given on the basis of credit history, 866.8 1386.5 0 7510
information in application file ,
B6 Annual Income Annual Income as reported by borrower 1112375 1009341 0 36475440

B7 In current job No. of years spent by borrower in current job 5.744 3.6241 0 10
B8 Home Ownership Possible values are: RENT,OWN,ON MORTGAGE 2.932 0.9544 1 4

pro
B9 Purpose Possible values are:BUSINESS LOAN,EDUCATIONAL,DEBT 4.784 2.1975 1 16
CONSOLIDATION,MEDICAL BILLS, BUY A HOME ETC.
B10 Monthly debt Amount of debt a borrower is paying monthly 18467 12275.07 0 435843
B11 Years of credit History No. of years of credit 18.15 6.9652 3.60 70.50
B12 Months since last The number of months since the borrower's last 17.16 22.991 0 176
delinquent delinquency
B13 Number of open Total no. of accounts a borrower owns 11.13 4.9858 0 56
accounts
B14 Number of credit Number indicating the problems encountered in repaying 0.1669 0.4860 0 15
re-
problems credit balance
B15 Current credit balance Total balance against which the borrower is credited 296526 387564.9 0 32878968
B16 Max.open credit Number of credit accounts opened 802700 10990056 0 1539737892

B17 Bankruptcies Number of bankruptcies 0.1163 0.34823 0 7


B18 Tax liens Number of tax liens 0.02919 0.26675 0 15
B19 Loan status Charged off/fully paid indicating bad/good debtor 1.774 0.4182 1 2
lP

4.3 German Credit dataset[66]:


The original dataset contains 1000 entries with 13 categorial and 7 numerical attributes prepared
by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit from a bank
and is classified as good or bad according to the set of attributes. For the purpose of infering
statistical details (as shown in Table 4) categorical values are coverted to numerical fields. For
a

experimental puposes, they are treated as categorical field only.


Table 4:German Credit Dataset fields
urn

AttN Field name Explanation Mean Std. Min. Max.


o Deviation
C1 Salary Salary Status of existing checking Accounts 12.577 1.257 11 14
C2 Duration Duration of credit amount taken (in months) 20.903 12.058 4 72
C3 Credit History Categorical field describing whether credit taken in past was 32.545 1.083 30 34
paid in time or not
C4 purpose Possible values: Car, furniture ,education, domestic 47.148 40.0953 40 410
appliance etc.
C5 Credit amount Amount of credit 3271.258 2822.73 250 18424
Jo

C6 Saving Money kept in saving accounts 62.105 1.58002 61 65


accounts/Bonds
C7 Employment since Categorical value describing number of years of 73.384 1.20830 71 75
employment
C8 Installment rate Installment rate in percentage of disposable income 2.973 1.11871 1 4
C9 Personal Status Married/Unmarried/Divorced 92.682 0.70808 91 94
C10 Other debtors Whether co-applicant/guarantor 101.145 0.47770 101 103
C11 Present residence This field shows stability of debtor at a present residence 2.845 1.1037 1 4
since
C12 property Whether a person possesses car/real estate/does not own 122.358 1.0502 121 124
any property
C13 Age - 35.546 11.3754 19 75
C14 Other installment Possible values: bank/stores/none 142.675 0.70560 141 143
Journal Pre-proof

plans
C15 Housing Accommodation on rent/own 151.929 0.53126 151 153
C16 Number of existing Other credits taken from this bank 1.407 0.57765 1 4
credits
C17 job Whether unemployed/skilled employee/unskilled/highly 172.904 0.65361 171 174
qualified/officer etc.
C18 Manitenance Number of people being liable to provide maintenance for 1.155 0.36208 1 2
C19 Telephone Yes/no 191.404 0.49094 191 192

of
C20 Foreign worker Yes/no 201.037 0.1888 201 202
C21 Credit Status Paid/defaulter 1.3 0.4584 1 2

5. Experiments and Results

pro
5.1 Workflow Diagram and Experiment: Fig.1 shows the workflow diagram

re-
a lP
urn
Jo

Figure 1: Workflow Diagram


Journal Pre-proof

Datasets has been refined in the following stages:


1) Data Pre-processing :Initially in lending club dataset, reduction in dataset degree has been
performed manually .Redundant features , features having 80% to 100% missing values and features
having same values in all the rows in the original dataset are removed. Some attributes that allows

of
algorithms to learn from future are also removed as they lead to undesired results like high accuracy and
precision (close to 100%). After deleting attributes we are left with 36 features which are shown in Table2
along with their statistical details. The Kaggle and German Credit datasets need no such operation as
there are no attributes which contains any sort of redundancy or useless information. The only pre
processing step taken in Kaggle’s dataset is changing values of loan Id and customer Id as the

pro
original values are alphabetical and 33 characters long,for simplicity their values are taken as
1,2…42000.
2) Data Standardization: As it is clear from Table2 , Table3 and Table4 , there is large
disparity in variability in predictor variables. Many variables have standard deviation less than 1,
while many others have value of running in lakhs. Before applying any technique, there is a need
to standardize the variables, otherwise the variables having higher value of standard deviation
re-
dominates the variables having lower standard deviation. The variables are standardized using
scale() function in R. This function centers and scales the columns of numeric matrix.It performs
z-score normalization by subtracting the mean and dividing by standard deviationi.ezi= ,
where is the mean and is the standard deviation.
After applying pre-processing steps, Bolasso,ReliefF, chi-square and gain ratio algorithm are
lP

applied to Lending Club, Kaggle’s Bank Loan status and German Credit dataset. Bolasso method
shortlists 18 attributes(out of 35) for lending club dataset, , 6 attributes(out of 18) for kaggle’s
dataset and 5 attributes(out of 20) for German Credit dataset. ReliefF, chi-square and gain ratio
algorithms returned attribute importance score, the topmost 18 (for lending club), 6 (for kaggle)
and 5(German Credit) attributes are selected for experimental purposes. The size of dataset is
reduced to include only the features shortlisted by respective algorithm. Table 5,Table 6 and
a

Table7 shows attributes selected by algorithms. However for original baseline SVM,NB,K-NN
and RF classifiers , complete dataset is used for carrying out experiments.
urn

Table 5: Attributes selected by various algorithms for Lending Club dataset

Att A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18
No./Method
Chi-Square √ √ √ √ √ √ √ √ √ √

Gain Ratio √ √ √ √ √ √ √ √

Relieff √ √ √ √ √ √ √ √ √ √ √
Jo

Bolasso √ √ √ √ √ √ √ √ √ √ √
Journal Pre-proof

Table5(contd)

Att/Method A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35

Chi- √ √ √ √ √ √ √ √
Square

Gain Ratio √ √ √ √ √ √ √ √ √ √

of
ReliefF √ √ √ √ √ √ √

Bolasso √ √ √ √ √ √ √

pro
Table 6: Attributes selected by various algorithms for Kaggle’s Bank Loan Status dataset

Method/At B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B B B


tribute 1 1 1
6 7 8

Chi- √ √ √ √ √ √
Square
re-
Gain √ √ √ √ √ √
Ratio

ReliefF √ √ √ √ √ √

Bolasso √ √ √ √ √ √
lP

Table 7: Attributes selected by various algorithms for German credit dataset

Method/ C1 C2 C3 C4 C5 C6 C7 C8 C9 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 C2
Attribut 0 1 2 3 4 5 6 7 8 9 0
e
a

Chi- √ √ √ √ √
Square
urn

Gain √ √ √ √ √
Ratio

ReliefF √ √ √ √ √

Bolasso √ √ √ √ √
Jo

Data set obtained after reducing degree is then partitioned into training and test data set in ratio
70:30 respectively. For lending club dataset, 12,756 rows are used for testing and 29,774 rows
for training .Kaggle dataset consists of 12535 rows for testing and 29465 rows for training. In
German credit dataset, 700 rows are used for training and 300 for testing. Then various versions
of SVM,NB,K-NN, Random Forest algorithm are applied on the training datasets. The models are
then tested and validated on test set for overall performance evaluation. The R software[68] is
used for running various functions. Algorithm 1 provides the operational steps of Bolasso based
feature selection algorithm followed by Random Forest classifier
Journal Pre-proof

Algorithm1: BS-RF Classifier


Input: D, set of d instances
m is the no. of iterations
N is sample size
Output

of
J: List of selected variables
F: Dataset with reduced features

Procedure:

pro
1) Split D in (X,Y) //Predictor Matrix is stored in X, Response Matrix is stored in
Y
2) For k=1 to m,
 Generate Bootstrap sample by sampling N instances from (X,Y) ,name it (Xk, Yk)
 Compute Lasso estimate (denoted by wk) from (Xk, Yk)
 Compute support Jk=(j, such that wk≠0) // select the variables for which lasso
estimate is non-zero.
re-
End for
3) Compute J= // J now contains list of attributes for which lasso estimate is
non-zero, consistently among m iterations
4) F=D(features) //Reduce dimensions of training dataset by eliminating variables
not selected by J
lP

5) TR,TE= Split F in training and test dataset in 70:30


6) Execute Random Forest Classifier using TR
7) Make a prediction for test set ,TE

5.2 Evaluation Measures:


a

5.2.1 Feature selection stability measure:


urn

In order to compare the stability of Bolasso with other feature selection algorithms,
Jaccard stability measure is used. Jaccard stability measure is an intersection based metric
that finds the average similarity between different features set. Formally Jaccard stability
measure is calculated as:

(6)
Jo

Where Q=Number of sub-samples of training data, q=1…..Q ,Sq and Sq’ denotes the
features set, denotes number of common features.
The value of JSM ranges from 0 to 1,value near to 1 is desirable as it means that the
feature set selected does not change significantly and hence are more stable with respect to
small variations in dataset.JSM falls under “stability by index” category which considers
indices of selected features without taking into consideration weight or the rank of
selected features.
Journal Pre-proof

5.2.2 Classification based measures:


The measures used for evaluating the models with test data are accuracy, and Area under
receiver operating curve(AUC).
 Accuracy :It measures the overall effectiveness of the model. Higher the accuracy, the
better its performance. It is calculated by formula The terms TP,TN,FP,FN

of
denotes:
o TP(True Positive): They are the number of good debtors who are classified as good
by the model.
o TN(True Negative):They are the number of bad debtors who are classified as bad

pro
by the model.
o FP(False Positive): It denote the number of bad debtors who are misclassified as
good by the model.
o FN(False Negative): It represents the number of good debtors who are
misclassified as bad by model.
 AUC: It stands for “Area under the ROC curve”. In literature , it is considered as the
re-
better test of classification than accuracy to determine which of the models predicts the
class best.

5.3 Results and Discussions:


lP

5.3.1 Analysis I: Stability of Blolasso


Let k be the number of features in a dataset. To quantify stability a single dataset is
resampled 10 times, randomly selecting 70% of the rows each time and taking all k
attributes.Out of total k features, m features are shortlisted by feature selection
algorithm. Most stable algorithm selects same m features every time. Table 8
shows value of k and m used for experimental purposes.Table9 shows stability
a

results for various algorithms in terms of JSM.

Table 8: k and m values of datasets


urn

Dataset k (Total m (Top features


number of selected by feature
features) selection
algorithm)
Lending club 35 18
Kaggle’s Bank loan 18
Jo

6
status dataset
German credit dataset 20 5
Journal Pre-proof

Table 9 : JSM values of feature selection algorithm for various datasets


Feature selection Lending Kaggle’s Bank German credit
Algorithm Club Loan status dataset
dataset dataset

of
Chi-square 0.9461 0.8539 0.6433
Gain ratio 0.8963 0.7650 0.6793
ReliefF 0.5505 0.4342 0.2701

pro
Bolasso 0.9888 0.9682 1

Figures of Table 9 depicts the stability behavior of Bolasso algorithm in terms of JSM which
clearly outperforms stability of other features selection algorithms.
re-
5.3.2 Analysis II: Classification Accuracy
In order to compare the performance of Bootstrap-lasso with other feature selection methods,
we apply features obtained using each feature selection method to different classifiers namely,
SVM, NB, K-NN and RF. The classification performance of various algorithms is shown in
Table 10.
lP

Table 10:Comparison of Evaluation Measures across various models

LC DATASET KAGGLE DATASET GERMAN DATASET


Algorithm used Accuracy AUC Accuracy AUC Accuracy AUC
SVM 0.978 0.929 0.816 0.595 0.741 0.59
CLASIFIERS

NB 0.445 0.668 0.347 0.575 0.729 0.630


a

RF 0.979 0.934 0.818 0.607 0.766 0.653


BASE

KNN 0.859 0.544 0.813 0.595 0.745 0.633


urn

C-SVM 0.964 0.883 0.816 0.595 0.747 0.613


CHI-SQUARE

C-NB 0.783 0.805 0.372 0.582 0.744 0.655


C-RF 0.979 0.934 0.816 0.598 0.68 0.573
C-KNN 0.859 0.543 0.814 0.601 0.729 0.637
G-SVM 0.946 0.832 0.816 0.595 0.704 0.563
G-NB 0.797 0.808 0.386 0.582 0.729 0.627
GAIN RATIO

Jo

G-RF 0.975 0.938 0.814 0.601 0.745 0.633


G-KNN 0.866 0.603 0.815 0.604 0.738 0.639

R-SVM 0.979 0.931 0.773 0.50 0.701 0.5


R-NB 0.818 0.607 0.338 0.572 0.698 0.601
RELIEFF

R-RF 0.978 0.929 0.762 0.501 0.68 0.576


R-KNN 0.857 0.528 0.77 0.5 0.719 0.611
BS-SVM 0.979 0.934 0.816 0.602 0.76 0.642
BOLASS

BS-NB 0.816 0.827 0.374 0.583 0.76 0.663


O
Journal Pre-proof

BS-KNN 0.884 0.641 0.816 0.612 0.748 0.649

BS-RF 0.988 0.964 0.823 0.731 0.84 0.713

As seen from table 10, the following observations are made:


1) AUC and Accuracy of Bolasso methods are on an average higher or equivalent to filter based
approaches. e.g., when AUC of SVM is compared with C-SVM,G-SVM,R-SVM, BS-SVM

of
,the values for LC dataset are(0.929,0.883,0.832,0.931,0.934) and highest is for Bolasso.
 For Lending Club dataset: The comparison of AUC is shown in fig 2(a)-2(d).The
closer the line is to the top left corner, more better is the performance than the line

pro
closer to diagonal. Bolasso enabled SVM,NB,KNN and RF performed consistently
better than other versions of SVM,NB,KNN and RF.
 For Kaggle’s Bank Loan Status dataset: AUC of Bolasso method in case of SVM,NB
and KNN does not perform well. Their results are found to be equivalent to other
methods (Baseline, Chi-square,gain ratio, ReliefFmethods). But Bolasso enabled
Random Forest algorithm performs extraordinarily and provides best results .AUC
curves are shown in fig 3(a)-3(d)
re-
 For German Credit Dataset: AUC of Bolasso enabled SVM,NB,KNN,RF is superior
than their other counterparts. Fig. 4(a)-4(d) shows their performance graphically.
a lP
urn
Jo
Journal Pre-proof

of
pro
re-
a lP
urn
Jo
Journal Pre-proof

of
pro
re-
lP

2) When Bolasso is applied to SVM,NB,KNN,RF(BS-SVM,BS-NB,BS-KNN,BS-RF),


Random Forest outperform the predictive capability of SVM ,NB and KNN as can also be
seen from fig5(a)–5(c).
a
urn
Jo

Figure 5(a) : Comparison of Bolasso based classifiers for LC Dataset Figure5(b):Comparison of Bolasso based classifiers for Kaggle Dataset
Journal Pre-proof

of
Figure5(c):Comparison of Bolasso based classifiers for German Credit dataset

pro
re-
Graphical Comparison of Accuracy of various algorithms are shown in Figure6(a)-6(c).It is
concluded that Bolasso based Random Forest algorithm performs better as compared to all other
algorithms.
0.986 0.984 0.984 0.988
0.85
lP

1 0.978 0.975 0.979 0.979


0.964 0.816 0.816 0.823
0.816 0.816
0.815
0.814
0.946 0.818
0.816 0.813 0.814 0.816
0.8
0.762
0.773 0.77
0.9 0.884
0.859 0.859 0.866
0.857
0.75
0.818 0.816 0.7
0.797
0.8 0.783
0.65
a

0.6
urn

0.7
0.55

0.5
0.6

0.45
0.386
0.5 0.4 0.374
0.372
0.445 0.347
0.338
0.35
Jo

0.4 0.3
KNN

C-KNN

G-NB
RF

C-RF

R-KNN
SVM

C-SVM

G-KNN

R-RF

BS-NB

KNN

C-KNN

G-NB
NB

C-NB

G-RF

R-SVM

RF

C-RF

R-KNN
G-SVM

R-NB

BS-KNN

SVM

C-SVM

G-KNN

R-RF

BS-NB
BS-SVM

NB

C-NB
BS-RF

G-RF
R-SVM

BS-KNN
R-NB
G-SVM

BS-RF
BS-SVM

BASE CHI-SQUARE GAIN RATIO RELIEFF BOLASSO BASE CHI-SQUARE GAIN RATIO RELIEFF BOLASSO
CLASIFIERS CLASIFIERS

Fig. 6(a): Accuracy Comparison of various algorithms(lending club dataset) Fig 6(b): Accuracy Comparison of various algorithms(Kaggle Dataset)
Journal Pre-proof

1
0.95
0.9
0.84
0.85
0.8 0.766 0.76 0.76
0.741 0.745 0.7470.744 0.745 0.748
0.75 0.729 0.729 0.7290.738 0.719
0.704 0.7010.698

of
0.7 0.68 0.68

0.65
0.6
0.55

pro
0.5
0.45
0.4

G-RF

BS-RF
C-RF

R-KNN
KNN

R-SVM

BS-NB
SVM

C-NB

G-NB

R-RF
G-KNN
RF

BS-SVM

BS-KNN
C-SVM

C-KNN

G-SVM

R-NB
NB

BASE CLASIFIERS CHI-SQUARE GAIN RATIO RELIEFF BOLASSO


re-
Fig 6(c): Accuracy Comparison of various algorithms(German Credit Dataset)

5.3.3 Analysis III: Time Complexity

Experiments are carried out on i3 processor, 2GB RAM, 64-bit Windows7.Table 11 depicts time
lP

complexities(in Big-O notation) of feature selection algorithms while run time and time
complexities of various classification algorithms are shown in table 12

Table11: Time complexity of Feature selection algorithms


a

Algorithm Complexity of feature selection algorithm


urn

Chi-Square O(nf)
Gain Ratio O(nf)
ReliefF O(n2f)
Bolasso O(mf3+f2nm)

Table 12: Time complexity and Run time of several classification algorithms
Jo

Complexity of Run Time in seconds


Classification German Lending Club
Algorithm
algorithm(Training Kaggle
phase)
0.13 305.82
SVM O(n2f+n3) 927.68

C-SVM O(n2f’+n3) 186 0.08 70.71


R-SVM O(n2f’+n3) 203.93 0.08 64.38
G-SVM O(n2f’+n3) 211.38 0.06 68.22
Journal Pre-proof

BS-SVM O(n2f’+n3) 195.66 0.07 109.27


NB O(nf) 0.10 0.05 1.00
0.01 0.07
C-NB O(nf’) 0.05

R-NB O(nf’) 0.07 0.01 0.07


G-NB O(nf’) 0.06 0.01 0.07
BS-NB O(nf’) 0.05 0.01 0.19

of
O(nfk), optimal 3.07 8000.45
K-NN value of k is found 6500.80
in this study
C-KNN Test phase=O(nf’k) 4494.88 1.71 2166.27

pro
R-KNN Test phase=O(nf’k) 377.79 1.76 1820.53
G-KNN Test phase=O(nf’k) 344.25 1.81 1755.25
BS-KNN Test phase=O(nf’k) 417.10 1.88 976.18
RF ) 243.34 0.92 330.05
C-RF ) 120.45 0.83 175.45
R-RF ) 124.69 0.86 84.42
G-RF ) 159.27 0.14 113.14
re- 0.16 200.03
BS-RF ) 123.51

n=Number of rows in training sample


f= Total number of features in training sample
f’=Number of shortlisted features after applying feature selection algorithm
m=Number of bootstrapped replications(for Bolasso, 100 in our experiments)
lP

p=Number of Trees(For Random Forest, 500 for our experiments)


k=K-NN algorithm is tested with different k-values.

Following observations are drawn from run time:


a

a) Run time is directly proportional to size of dataset. As Kaggle and Lending Club dataset are
highly dimensional, so their run time is more than run time of German dataset having 1000
rows
urn

b) Run time of classification algorithms has reduced significantly after applying feature
selection algorithms
c) Naïve bayes based algorithms have minimum run time while K-NN based algorithms are the
most time consuming.
Jo

6. Conclusion and Future Work


Lending money is a risky affair and considering the wide competition, a wise and a quick
decision is needed to be taken to access credit risks and for that advanced methods are needed
.Feature selection is one of the pre-processing requirement in developing advanced methods, as
they not only help in reducing dimensions of dataset but at the same time enhances accuracy and
AUC.
In our paper, we firstly shortlisted the features by using different types of feature selectors such
as Gain ratio,chi-square, ReliefF filter and Bolasso.Stability feature of Bolasso was compared
Journal Pre-proof

with stability of other feature selection algorithms in terms of JSM.Then we reduced the
dimension of dataset to include only the features selected by respective feature selection
algorithm.Further we used different type of classifiers SVM,NB,K-NN and RF. To show the
importance of feature selection, results were also compared with baseline classifiers applied to
complete datasets. It was found that classification performance of random forest algorithms when
used with Bolasso shortlisted features provided best decision to lend money than other methods.

of
In future, we would like to test Bolasso shortlisted features on ensemble of various classifiers.
Another interesting area is to perform sentence and sentiment analysis on the text to correctly
judge the intensions of debtor and hence increasing the predictive capability of classifier.

pro
References:

[1] Oreski, S. &Oreski G. Genetic Algorithm based heuristic for feature selection in credit risk assessment.
,sciencedirect,Expert Systems with application(2013) http://dx.doi.org/10.1016/j.eswa.2013.09.004

[2] Dahiya S, Handa SS, Singh NP. A feature selection enabled hybrid-bagging algorithm for credit risk
re-
evaluation. Expert Systems. 2017;34:e12217. https://doi.org/10.1111/exsy.12217

[3] WangD, Zhang Z A hybrid System with filter approach and multiple population Genetic Algorithm for
feature selection in Credit Scoring., science direct,Journal of Computation and applied mathematics
http://dx.doi.org/10.1016/j.cam.2017.04.036
lP

[4] Chandrashekar G , Sahin F, A survey on feature selection methods , Computers and Electrical Engineering
2014,Elsevier,https://doi.org/10.1016/j.compeleceng.2013.11.024

[5] Dehuri S.,GhoshA.,Revisiting evolutionary algorithms in feature selection and nonfuzzy/fuzzy rule based
classification(2013) WIRE Data Mining KnowlDiscov , 3: 83-108,John Wiley &Sons, doi: 10.1002/widm.1087
[6] Cai J., Luo J, Wang S, YangS,Feature Selection in machine Learning: a new
a

perspective,Neuocomputing(2018), doi: 10.1016/j.neucom.2017.11.077


[7] Tibshirani Robert, Regression shrinkage and selection via the lasso, J.Royal.Statist. Soc. Ser. B, 58 (1996),
urn

267-288.
[8] Lin, L., Shuang, W., Yifang, L., &Shouyang, W. (2014). A New Idea of Study on the Influence Factors of
Companies’ Debt Costs in the Big Data Era.Procedia Computer Science, 31, 532-541.
doi:10.1016/j.procs.2014.05.299.
[9] FangK,.Zhang G,,H.Zhang,Individual Credit risk prediction model:Application of Lasso-Logistic Model.The
Journal of Quantitative and Technical Economics,2014
[10] HongmeiChen,Yaoxin Xiang, The Study of Credit scoring model Based on Group Lasso .
Jo

Procedia,Sciencedirect, 5th International Conference on Information Technology and quantitative Management


,ITQM,2017
[11] Kamkar I, Sunil Kumar Gupta, DinhPhung, SvethaVenkatesh, “Stable feature Selection for clinical
prediction:Exploiting ICD tree structure using Tree-Lasso”, Journal of Biomedical Informatics (2015), Elsevier. doi:
10.1016/j.jbi.2014.11.013
[12] Zhang Z. et al., High-order covariate interacted Lasso for feature selection, Pattern Recognition Letters
(2016), http://dx.doi.org/10.1016/j.patrec.2016.08.005

[13] Bach F,2008, arXiv:0804.1302- ,Bolasso- Model Consistent Lasso Estimation through Bootstrap, INRIA –
Willow Project-Team, Paris France
Journal Pre-proof

[14] Huang, X., Liu, X., Ren, Y., Enterprise Credit Risk Evaluation Based on Neural NetworkAlgorithm,
Cognitive Systems Research (2018), doi: https://doi.org/10.1016/j.cogsys.2018.07.023
[15] Y.Liu, M Schumann , Data Mining Feature Selection for credit scoring models, Journal of Operation
Research Soc. 56(2005) 1099-1108 doi: 10.1057/palgrave.jors.2601976
[16] Jadhav S., He H, JenkinsK,Information gain directed genetic algorithm wrapper feature selection for credit
rating,Applied Soft Computing,Volume 69,(2018) ,Pages 541-553,ISSN 1568-
4946,https://doi.org/10.1016/j.asoc.2018.04.033.

of
[17] X. Zhang, Y. Yang and Z. Zhou, "A novel credit scoring model based on optimized random forest," 2018
IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, (2018), pp.
60-65.doi: 10.1109/CCWC.2018.8301707
[18] Xiao-Ying liu,Yong Liang et al,A Hybrid Genetic Algorithm withWrapper-embedded Approaches for

pro
Feature Selection,IEEE Access, doi: 10.1109/ACCESS.2018.2818682

[19] Liu, H. &Setiono R.(1995) Chi2 : feature selection and discretization of the numeric attributes. In
Anon(Ed.), Proceedings of the International Conference on Tools with Artificial Intelligence (pp 388-391),IEEE

[20] Trabelsia M , , NidaMeddouria , MondherMaddourib, A New Feature Selection Method for Nominal
Classifier based on Formal Concept Analysis, International Conference on Knowledge Based and Intelligent
re-
Information and Engineering Systems, KES2017, 6-8 September (2017), Marseille, France,Procedia Computer
Science,https://doi.org/10.1016/j.procs.2017.08.227

[21] McHugh, M..The chi-square test of independence.Biochemiamedica 2013;:143–149


[22] H. Dağ, K. E. Sayin, I. Yenidoğan, S. Albayrak and C. Acar, "Comparison of feature selection algorithms
for medical data," 2012 International Symposium on Innovations in Intelligent Systems and Applications, Trabzon,
lP

2012,pp1-5.doi: 10.1109/INISTA.2012.6247011

[23] Robnik-Šikonja, M. &Kononenko, I. Machine Learning(2003)53:23https://doi.org/10.1023/A:1025667309714

[24] Liu, Y., and M. Schumann.“Data Mining Feature Selection for Credit Scoring Models.” The Journal of the
Operational Research Society, vol. 56, no. 9, 2005, pp. 1099–1108. JSTOR, JSTOR, www.jstor.org/stable/4102203.
a

[25] Cortes, C. &Vapnik, V. Machine Learning (1995) 20: 273. https://doi.org/10.1023/A:1022627411411


[26] Jiang, Hao&Ching, Wai-Ki & Fai Cedric Yiu, Ka&Qiu, Yushan.(2018). Stationary Mahalanobis Kernel
urn

SVM for Credit Risk Evaluation.Applied Soft Computing.71. 10.1016/j.asoc.2018.07.005.


[27] Carrizosa, E & Morales, DR 2013, 'Supervised Classification and Mathematical
Optimization'Computers & Operations Research, vol 40, no. 1, pp. 150–165
[28] John GH, Langley P., Estimating Continuous distribution in Bayesian Classifier in: Proceedings on 11 th
Conference in Uncertainty in Artificial Intelligence (1995) ;338-345
[29] Zareapoor, Masoumeh, Shamsolmoali, Pourya, Application of Credit Card Fraud Detection: Based on
Bagging Ensemble Classifier, International Conference on Computer, Communication and Convergence (ICCC
Jo

2015), Procedia Computer Science, ://doi.org/10.1016/j.procs.2015.04.201,


http://www.sciencedirect.com/science/article/pii/S1877050915007103
[30] Mase, S. (2008).Credit‐Rating of Companies. In Bayesian Networks (eds O. Pourret, P. Naim and B.
Marcot). doi:10.1002/9780470994559.ch15
[31] Breiman, L. (2001). Random forests. Machine Learning, 45(1),5–
32,https://doi.org/10.1023/A:1010933404324

[32] Malekipirbazari M, Aksakalli V, “Risk assessment in social lending via random forests”,Expert Systems
with Applications 42 (2015) 4621–4631,Sciencedirect
[33] https://www.capitaline.com
Journal Pre-proof

[34] https://www.federalreserve.gov/releases/chargeoff/delallsa.htm

[35] Yue Zhang, Weihong Guo, Soumya Ray ; “On the consistency of Feature Selection with Lasso for Non-
Linear Targets”, Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:183-191, 2016

[36] Khaire, U. M., & Dhanalakshmi, R. (2019). Stability of feature selection algorithm: A review. Journal of
King Saud University - Computer and Information Sciences. doi:10.1016/j.jksuci.2019.06.012

of
[37] P. Turney. Technical note: Bias and the quantification of stability. Machine Learning, 20:23–33, 1995

[38] P. Somol and J. Novovicova, "Evaluating Stability and Comparing Output of Feature Selectors that
Optimize Feature Subset Cardinality," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32,
no. 11, pp. 1921-1939, Nov. 2010.doi: 10.1109/TPAMI.2010.34

pro
[39] L.I. Kuncheva, “A Stability Index for Feature Selection,” Proc. 25th IASTED Int’l Multi-Conf. Artificial
Intelligence and Applications, pp. 421-427, 2007
[40] I. Kamkar, S. K. Gupta, D. Phung and S. Venkatesh, "Exploiting feature relationships towards stable feature
selection," 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Paris, 2015, pp. 1-
10.doi: 10.1109/DSAA.2015.7344859
[41] He Z,Yu W ,”Stable feature selection for biomarker discovery”,Computaional Biology and Discovery,
re-
Elsevier, doi:https://doi.org/10.1016/j.compbiolchem.2010.07.002
[42] Kamkar, Iman & Gupta, Sunil & Phung, Dinh & Venkatesh, Svetha. (2015). Stable Feature Selection with
Support Vector Machines. 9457. 10.1007/978-3-319-26350-2_26.
[43]Li, Y., Si, J., Zhou, G., Huang, S., & Chen, S. (2015). FREL: A Stable Feature Selection Algorithm. IEEE
lP

Transactions on Neural Networks and Learning Systems, 26(7), 1388–1402.doi:10.1109/tnnls.2014.2341627

[44] Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, P. and Saeys, Y. (2010), Robust biomarker identification
for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3), pp. 392–398.
[45] Han, Y., & Yu, L. (2010). A Variance Reduction Framework for Stable Feature Selection. (2010) IEEE
International Conference on Data Mining. doi:10.1109/icdm.2010.144
a

[46] Pandey, T. N., Jagadev, A. K., Mohapatra, S. K., & Dehuri, S. (2017). Credit risk analysis using machine
learning classifiers. 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing
(ICECDS).doi:10.1109/icecds.2017.8389769
urn

[47] Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L. C. (2015). Benchmarking state-of-the-art
classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1),
124–136.doi:10.1016/j.ejor.2015.05.030

[48] Wei-Yang Lin, Ya-Han Hu, & Chih-Fong Tsai. (2012). Machine Learning in Financial Crisis Prediction: A
Survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 421–
436.doi:10.1109/tsmcc.2011.2170420
Jo

[49] Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A. (2013). Consumer credit risk: Individual probability
estimates using machine learning. Expert Systems with Applications, 40(13), 5125–5131.
doi:10.1016/j.eswa.2013.03.019
[50] Shi, L., Liu, Y., & Ma, X. (2011). Credit Assessment with Random Forests. Emerging Research in Artificial
Intelligence and Computational Intelligence, 24–28.doi:10.1007/978-3-642-24282-3_4
[51] Pes, Barbara. (2019). Ensemble feature selection for high-dimensional data: a stability analysis across multiple
domains. Neural Computing and Applications. 10.1007/s00521-019-04082-3.
Journal Pre-proof

[52] Behr, A., & Weinblat, J. (2016). Default Patterns in Seven EU Countries: A Random Forest Approach.
International Journal of the Economics of Business, 24(2), 181–222.doi:10.1080/13571516.2016.1252532 .

[53] Ha Van Sang1 , Nguyen Ha Nam , Nguyen Duc Nhan. A Novel Credit Scoring Prediction Model based on
Feature Selection Approach and Parallel Random Forest. Indian Journal of Science and Technology, Vol 9(20), DOI:
10.17485/ijst/2016/v9i20/92299, May (2016)
[54] Bingamawa, Muhammad Tosan & Agus Santoso, Heru. (2016).” IMPLEMENTATION OF NAÏVE BAYES

of
ALGORITHM TO DETERMINE CUSTOMER CREDIT STATUS IN PT. MULTINDO AUTO FINANCE
SEMARANG.” , doi: 10.13140/RG.2.2.20330.52164.
[55] A. C. Antonakis and M. E. Sfakianakis(2009) ,”Assessing naïve Bayes as a method for screening
credit applicants”,Journal of Applied Statistics,Vol 5(36),537-545,Taylor & Francis, doi =

pro
10.1080/02664760802554263
[56] Yeh, I.-C., & Lien, C. (2009). The comparisons of data mining techniques for the predictive accuracy of
probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473–
2480doi:10.1016/j.eswa.2007.12.020

[57] Paulius Danenas, Gintautas Garsva, Saulius Gudas, “Credit Risk Evaluation Model Development Using
Support Vector Based Classifiers”,Procedia Computer Science,Volume 4,2011,Pages 1699-1707,ISSN 1877-0509,
re-
https://doi.org/10.1016/j.procs.2011.04.184.

[58] Danenas, P., & Garsva, G. (2015). Selection of Support Vector Machines based classifiers for credit risk
domain. Expert Systems with Applications, 42(6), 3194–3204.doi:10.1016/j.eswa.2014.12.001

[59] Sivasankar E., Selvi C., Mala C. (2017) A Study of Dimensionality Reduction Techniques with Machine
lP

Learning Methods for Credit Risk Prediction. In: Behera H., Mohapatra D. (eds) Computational Intelligence in Data
Mining. Advances in Intelligent Systems and Computing, vol 556. Springer, Singapore

[60] Baesens, B., Van Gestel, T., Viaene, S. et al. J Oper Res Soc (2003), “Benchmarking state of the art
classification algorithm for credit scoring” 54: 627. https://doi.org/10.1057/palgrave.jors.2601545

[61] Li, Feng-Chia. (2009). The hybrid credit scoring model based on KNN classifier. 330-334.Sixth
a

International Conference on Fuzzy Systems and Knowledge Discovery. IEEE Computer Society

[62] Hand, D. J., and Vinciotti, V., 2003, "Choosing k for two-class nearest neighbor classifiers with unbalanced
urn

classes." Pattern Recognition Letters 24(9-10), 1555-1562

[63] Islam, M. J., Wu, Q. M. J., Ahmadi, M., and Sid-Ahmed, M. A., 2007, "Investigating the Performance of
Naive- Bayes Classifiers and K- Nearest Neighbor Classifiers"International Conference on Convergence Information
Technology. IEEE Computer Society.

[64] https://www.lendingclub.com/info/download-data.action

[65] https://www.kaggle.com/zaurbegiev/my-dataset
Jo

[66] https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

[67] W. E. HENLEY and D. j. Hand, "Construction of a k-nearest-neighbour credit-scoring system†," in IMA


Journal of Management Mathematics, vol. 8, no. 4, pp. 305-321, Oct. 1997.doi: 10.1093/imaman/8.4.305

[68] R Core Team (2018). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Journal Pre-proof
*Declaration of Interest Statement

Declaration of Interest

Dear Editor-in-Chief

Applied Soft Computing

10-05-2019

of
We hereby declare that we have no affiliations with or involvement in any organization or entity with

pro
any financial interest (such as honoraria, educational grants, membership, employment etc) or non-
financial interest(such as personal or professional relationships, affiliations, knowledge or beliefs) in
subject matter or materials discussed in this manuscript.

Regards,

Nisha Arora,
re-
Dr Pankaj Deep Kaur
a lP
urn
Jo

You might also like