Bank Loan Approval Prediction Using Data Science Technique (ML)
Bank Loan Approval Prediction Using Data Science Technique (ML)
https://doi.org/10.22214/ijraset.2022.43665
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
Abstract: Banks are making major part of profits through loans. Loan approval is a very important process for banking
organizations. It is very difficult to predict the possibility of payment of loan by the customers because there is an increasing rate
of loan defaults and the banking authorities are finding it more difficult to correctly access loan requests and tackle the risks of
people defaulting on loans. In the recent years, many researchers have worked on prediction of loan approval systems. Machine
learning technique is very useful in predicting outcomes for large amount of data. In this paper, four algorithms are used such
as Random Forest algorithm, Decision Tree algorithm, Naive Bayes algorithm, Logistic Regression algorithm to predict the loan
approval of customers. All the four algorithms are going to be used on the same dataset and going to find the algorithm with
maximum accuracy to deploy the model. Henceforth, we develop bank loan prediction system using machine learning
techniques, so that the system automatically selects the eligible candidates to approve the loan.
Keywords: Loan approval, Loan Default, Random Forest algorithm, Decision Tree algorithm, Naive Bayes algorithm, Logistic
Regression algorithm, Loan prediction, Machine learning.
I. INTRODUCTION
A loan is the major source of income for the banking sector of financial risk for banks. Large portions of a bank’s assets directly
come from the interest earned on loans given. The activity of lending loans carry great risks including the inability of borrower to
pay back the loan by the stipulated time. It is referred as “credit risk”. A candidate’s worthiness for loan approval or rejection was
based on a numerical score called ”credit score”. Therefore, the goal of this paper is to discuss the application of different Machine
Learning approach which accurately identifies whom to lend loan to and help banks identify the loan defaulters for much-reduced
credit risk.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5228
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
A. Disadvantages
1) They had proposed a mathematical model and machine learning algorithms were not used.
2) Class Imbalance problem was not addressed and the proper measure were not taken.
B. Proposed System
In our proposed system, we combine datasets from different sources to form a generalized dataset and use four machine learning
algorithms such as Random forest, Logistic regression, Decision tree and Naive bayes algorithm on the same dataset .The dataset we
collected for predicting given data is split into training set and test set in the ratio of 7:3. The data model which was created using
Machine learning algorithms are applied on training set and based on maximum test result from the four algorithms, the test set
prediction is done using the algorithm that has maximum performance. After that, we deploy the model using Flask Framework.
C. Advantages
1) Performance and accuracy of the algorithms can be calculated and compared.
2) Class imbalance can be dealt with machine learning approaches.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5229
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
The dataset is obtained by gathering lot of required datasets and combining them to produce a generalised dataset. The dataset thus
produced is pre-processed i.e., the dataset is cleaned before doing data visualization. Then the four algorithms are applied on the
same pre-processed dataset and calculated for the best performed algorithms among them. Then the best algorithm is used to train
the model and test it to check how accurate the algorithm can predict the output. Then we deploy that model to predict if bank loan
can be approved or not for a specific candidate.
Use case diagrams are used for high level requirement analysis of a system. So, when analysing the requirements of a system, the
functionalities are captured in use cases. So, uses cases are nothing, but the functionalities of the system written in an organized
manner.
A. Class Diagram
Class diagram is generally a graphical representation of the static view of the system and represents different aspects of the
application. A collection of class diagrams will represent the whole system.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5230
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
B. Activity Diadram
Activity diagrams not only visualize the dynamic nature of a system, but they are also used for constructing executable system by
using forward and reverse engineering techniques. Activity diagram is some time considered as the flow chart, but it is not.
C. Sequence Diagram
Sequence diagrams model the flow of logic within our system in a visual manner, enabling both to document and validate our logic,
and are commonly used for both analysis and design purposes. Sequence diagrams are the most popular UML artifact for dynamic
modelling, which focuses on identifying the behaviour within the system.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5231
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
An entity relationship diagram (ERD) is a graphical representation of an information system that depicts the relationships among
people, objects, places, concepts or events within that system. An Entity relationship model is a data modelling technique that helps
define business processes and can be used as the foundation for a relational database.
A. Confusion Matrix
Confusion matrix is one of the performance metrics used to find the correctness and accuracy of the model. It has the following four
parameters:
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5232
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
F. Accuracy
Accuracy is the most important performance metrics which is the ratio of observations that are correctly predicted to the total
observations. Higher accuracy means that the model produces accurate results but only when we have symmetric datasets where
values of false positive and false negatives are almost the same.
Accuracy = (TP+TN) / (FP+FN+ TN+TP)
G. Precision
Precision is the ratio of positive observations correctly predicted to the positive observations totally predicted. High precision rates
relates to the low false positive rates of the dataset. We have got 0.876 precision which is really good.
Precision = TP / (FP+TP)
H. Recall
Recall is the ratio of correctly predicted positive observations to the all observations in actual class – yes i.e., the proportion of
positively observed values correctly predicted which is nothing but the proportion of actual defaulters that the model will correctly
predict.
Recall = TP / (FN +TP)
I. F1 Score
F1 Score is basically the average weight of Precision and Recall. Therefore, the F1 score takes the values of both false positives and
false negatives into consideration. F1 score can be found out if there is an uneven class distribution. If the values of false positives
and false negatives are too different, it’s better to have a look at both Precision and Recall.
F1 Score = {(Precision * Recall ) * 2} / ( Precision + Recall )
X. LOGISTIC REGRESSION
Logistic regression is a machine learning classification supervised algorithm that is used for predicting the probability of a
categorical dependent variable. It is a statistical method that is used for analysing a dataset where there are one or more independent
variables which determines an outcome. The outcome is measured with a dichotomous variable (which means there are only two
possible outcomes). The primary objective of logistic regression is to find the best fitting model for describing the relationship
between dependent variables and a set of independent variables. In logistic regression, the dependent variable is binary that contains
data which is coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5233
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5234
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
XV. CONCLUSION
The analysis starts from data cleaning and processing missing value, exploratory analysis and finally model building and evaluation
of the model. The best accuracy on public test set is when we get higher accuracy score and other performance metrics which will
be found out. This paper can help to predict the approval of bank loan or not for a candidate.
REFERENCES
[1] Arun Kumar, Ishan Garg, and Sanmeer Kaur, "Loan Approval Prediction Using Machine Learning Approach," 2018.
[2] K. Hanumantha Rao, G. Srinivas, A. Damodhar, and M. Vikas Krishna at International Journal of Computer Science and Telecommunications published an
article titled "Implementation of Anomaly Detection Technique Using Machine Learning Algorithms" (Volume2, Issue3, June 2011).
[3] G. Arutjothi and C. Senthamarai, "Prediction of loan status in commercial banks using machine learning classifier," International Conference on Intelligent
Sustainable Systems (ICISS), 2017.
[4] "AzureML based analysis and prediction of loan applicants creditworthy," by Alshouiliy K, Alghamdi A, and Agrawal D P I n 2020, Third International
conference on information and computer technologies.
[5] "Developing prediction model of loan risk in banks using data mining Machine Learning and Applications," Hamid A J and Ahmed T M, 2016.
[6] M. Li, A. Mickel, and S. Taylor "Should this loan be approved or denied?" published a paper in the Journal of Statistics Education in 2018.
[7] A. Vinayagamoorthy, M. Somasundaram, and C. Sankar, "Impact of Personal Loans Offered by Banks and Non-Banking Financial Companies in Coimbatore
City," 2012.
[8] M. Cary Collins, Ph.D., and Frank M. Guess, Ph.D., MIT's Information Quality Conference, 2000, "Improving information quality in loan approval processes
for fair lending and fair pricing."
[9] Arun Kumar, Ishan Garg, and Sanmeet Kaur, "Loan approval prediction based on machine learning approach," National Conference on Recent Trends in
Computer Science and Information Technology, 2016.
[10] Sivasree M S and Rekha Sunny T, "Loan Credibility Prediction System Using Decision Tree Algorithm," International Journal of Engineering Research &
Technology (IJERT), Vol. 4 Issue 09, September-2015.
[11] Jiří Doležal, Jiří Šnajdr, Jaroslav Belás, Zuzana Vincúrová, “Model of the loan process in the context of unrealized income and loss prevention”, Journal of
International Studies, Vol. 8, No 1, 2015, pp. 91-106. DOI: 10.14254/2071-8330.2015.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5235