Feature Selection For Loan Repayment Prediction System Using Machine Learning
Feature Selection For Loan Repayment Prediction System Using Machine Learning
https://doi.org/10.22214/ijraset.2023.49748
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue III Mar 2023- Available at www.ijraset.com
Abstract: It is essential for banks to evaluate and predict the repayment ability of the loaners in order to minimise the risk of
loan payment default. Due to this, there are systems created by the banks to process the loan request based on the loaners’ status,
such as employment status, credit history, etc. This paper attempts to determine the most significant factors/features which help
in predicting whether a loan applicant would be able to repay their loan. Feature selection provides an effective way to solve this
problem by removing irrelevant and redundant data, which can reduce computation time, improve learning accuracy, and
facilitate a better understanding of the learning model or data. In order to properly assess the repayment ability of all groups of
people, several frequently-used evaluation measures for feature selection are applied, and different sets of features using
different feature selection methods are generated. Afterwards, those sets are tested against different machine learning models, to
figure out the most effective feature set that should be analysed in order to figure out the repayment ability of an applicant. The
data used in this study was gathered from a Kaggle Dataset which contained the details of over 300,000+ loaners and whether
they were able to repay their loans or not. After data cleaning and feature engineering, the dataset still appeared quite
imbalanced , so, along with accuracy, other measures such as precision, recall, and F1 Score were also considered. Results of
the study indicate that, days employed, number of family members , number of children , income of the person were some of the
most significant factors for determining a borrower’s performance.
I. INTRODUCTION
Many people struggle to get loans from trustworthy sources such as banks, due to insufficient credit histories.In 2015, the federal
Consumer Financial Protection Bureau (CFPB) reported that one of every 10 American adults is “credit invisible,” meaning they
don’t have a credit history with one of the three major credit bureaus. Usually students and unemployed adults, who don’t have
enough credit history fall under this category, as supported by the following data “20 percent of people ages 18 to 22 have no credit
report, according to data by credit rating company VantageScore”. Apart from evaluating the borrower based on their credit score,
there are other ways to measure or predict their ability to repay. For example, employment is generally a big factor which affects the
person’s repayment ability since an employed adult has more stable incomes and cash flow. Factors such as marital status, property
owned, number of dependents, might also affect the study of the repayment ability.
In this project, feature selection methods are used to choose the set of factors that play a crucial role in determining the repayment
ability of an applicant. Feature selection refers to the process of obtaining a subset from an original feature set according to a certain
criterion. This crucially helps in compressing the data , where the redundant and irrelevant features are removed [2][3]. This pre-
processing reduces the data to train and test, which in turn reduces the time taken by the model to learn, thus it simplifies the results
[1][3]. The dataset, ‘Loan Application Prediction Analysis’ from Kaggle.com, was used in this project .This open dataset contains
300,000+ anonymous clients’ with 122 unique features [1][4]. Due to such a large dataset, feature selection is highly recommended
in order to reduce the training time and simplify the results. The study of correlation between these features and repayment ability of
the clients, would help lenders evaluate borrowers from the most significant dimensions and would also help borrowers, especially
those who do not have sufficient credit histories, to find a credible loaner.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2394
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue III Mar 2023- Available at www.ijraset.com
Out of all the models, K-clustering achieved the highest accuracy with 71.57%[1]. ‘Credit Risk Analysis and Prediction Modelling
of Bank Loans Using R’ by Sudhamathy G. focused on preprocessing and used clustering and classification techniques in R to
prepare the data for further use [8].The decision tree classifier was then built using the preprocessed dataset which achieved 0.833
precision [8]. The ‘Loan Prediction by using Machine Learning Models’ paper also emphasizes on pre-processing where it uses
Outlier detection and removal, as well as imputation removal processing in the pre-processing stage [1][6]. To predict the chances of
current status regarding the loan approval process, SVM, DT, KNN, and gradient boosting models were used [6]. According to the
results, experimentation concluded that the Decision Tree has significantly higher loan prediction accuracy than the other models.It
yielded an accuracy of 81.1% [6].
Another paper takes a different approach and uses Exploratory Data Analysis (EDA) as a method for predicting loan amounts based
on the nature of the client and their needs [7]. Annual income versus loan purpose, customer trust, loan tenure versus delinquent
months, loan tenure versus credit category, loan tenure versus credit category, loan tenure versus the number of years in current job,
and chances for loan repayment versus homeownership were the major factors concentrated during the data analysis [7]. The
purpose of this study was to infer the constraints that the customer faces when applying for a loan, as well as to make a prediction
about repayment[1][7][8]. It also revealed that borrowers are more interested in short term loans than long term loans [7].
IV. DATASET
A. Introduction
The dataset was obtained from Kaggle. Initially it had information about 300,000+ people to whom loans were granted and whether
they were able to repay their loans or not. It contained 122 columns/dimensions including the TARGET variable which indicated the
loan repayment status. The number of people who were able to repay their loans is extremely high when compared to the loan
repayment defaulters. Fig 1 shows the frequency of loan defaulters and loan payers.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2395
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue III Mar 2023- Available at www.ijraset.com
V. METHODS
A. Feature Selection Techniques
1) Manual Feature Selection Through Correlation: A good feature subset is one that contains features highly correlated with
(predictive of) the class, yet uncorrelated with (not predictive of) each other. A feature is said to be redundant if one or more of
the other features are highly correlated with it. Correlation was implemented and the feature set was selected which was highly
correlated to the ‘TARGET’ field .
2) Mutual Information Feature Selection: Mutual information from the field of information theory is the application of
information gain to feature selection. It is calculated between two variables and measures the reduction in uncertainty for one
variable given a known value of the other variable. This feature selection was implemented between our dependent variables
and the class.
Value Frequency
0 (repayers) 282686
1 (defaulters) 24825
Fig 1. Frequency of 0 and 1 in ‘TARGET’ field
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2396
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue III Mar 2023- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2397
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue III Mar 2023- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2398
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue III Mar 2023- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2399
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue III Mar 2023- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2400
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue III Mar 2023- Available at www.ijraset.com
REFERENCES
[1] Yiyun Liang, Xiaomeng Jin, Zihan Wang : “Loanliness: Predicting Loan Repayment Ability by Using Machine Learning Methods” (2019)
[2] Ritika Purswani, Sakshi Verma, Yash Jaiswal, Prof. Surekha M : “Loan Approval Prediction using Machine Learning” (June 2021)
[3] Jie Cai, Jiawei Luo, Shulin Wang, Sheng Yang : “Feature selection in machine learning : a new perspective” (2018)
[4] Kaggle Dataset : Link
[5] Aboobyda Jafar Hamid and Tarig Mohammed Ahmed : "Developing Prediction Model of Loan Risk in Banks using Data Mining"
[6] Pidikiti Supriya, Myneedi Pavani, Nagarapu Saisushma- "Loan Prediction by using Machine Learning Models" (April 2019)
[7] X. Francis Jency, V.P.Sumathi, Janani Shiva Sri - “An Exploratory Data Analysis for Loan Prediction Based on Nature of the Clients”
[8] Sudhamathy G.-"Credit Risk Analysis and Prediction Modelling of Bank Loans Using R", (IJET), Oct-Nov 2016
[9] Aphale & Shide . “Predict Loan Approval in Banking System Machine Learning Approach For Cooperative Banks Loan Approval” . International Research
Journal of Engineering and technology, (2020)
[10] Chandra & Rekha: “Exploring the Machine Algorithm for Prediction the Loan Sanctioning“ (2019)
[11] Khan et al.: “Loan Approval Prediction Model: A Comparative Analysis. Advances and Applications“ (2021)
[12] Nikhil Madane, Siddharth Nanda: “Loan Prediction using Decision tree” ,Journal of the Gujrat Research History - December 2019
[13] Shrishti Srivastava, Ayush Garg, Arpit Sehgal, Ashok kumar – “Analysis and comparison of Loan Sanction Prediction Model using Python” International
journal of computer science engineering and information technology research(IJCSEITR), (June 2018)
[14] Anchal Goyal, Ranpreet Kaur- “A survey on ensemble model of Loan Prediction” , International journal of engineering trends and application(IJETA), (Feb
2016)
[15] Li Y (2019) - Credit risk prediction based on machine learning methods The 14th Int. Conf. on Computer Science & Education (ICCSE) pp 1011–3
[16] Ahmed M S I and Rajaleximi P R (2019) - An empirical study on credit scoring and credit scorecard for financial institutions Int. Journal of Advanced
Research in Computer Engineering & Technol. (IJARCET)
[17] Shoumo S Z H, Dhruba M I M, Hossain S, Ghani N H, Arif H and Islam S (2019) “Application of machine learning in credit risk assessment: a prelude to smart
banking” TENCON 2019 – 2019 IEEE Region 10 Conf.
[18] Alshouiliy K, Alghamdi A and Agrawal D P 2020 AzureML based analysis and prediction loan borrowers creditworthy The 3rd Int. Conf. on Information and
Computer Technologies (ICICT)
[19] Li M, Mickel A and Taylor S 2018, “Should this loan be approved or denied?”: a large dataset with class assignment guidelines Journal of Statistics Education
[20] Vaidya A 2017 Predictive and probabilistic approach using logistic regression: application to prediction of loan approval The 8th Int. Conf. on Computing,
Communication and Networking Technologies (ICCCNT)
[21] S. Vimala, K.C. Sharmili, ―Prediction of Loan Risk using NB and Support Vector Machine‖, International Conference on Advancements in Computing
Technologies (ICACT 2018),
[22] A. Goyal and R. Kaur, “Accuracy Prediction for Loan Risk Using Machine Learning Models”
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2401