Loan Eligibility Prediction
Loan Eligibility Prediction
Taking a loan is a very time-consuming process. The application must go through a lot of stages
and it’s still not necessary that it will be approved. To decrease the approval time and to decrease
the risk associated with the loan many loan prediction models were introduced. A prediction
model uses statistics, probability and data mining to forecast an outcome. Every model has some
variables that are likely to influence the outcome. The prediction model helps the banks by
minimizing the risks associated with loan approval system and helps applicant by decreasing the
time for the process.
Background
In the present scenario, a loan needs to be approved manually by a representative of the bank
which means that person will be responsible for whether the person is eligible for the loan or not
and calculating the risk associated with it. As it is done by a human it is a time-consuming
process and is susceptible to errors. If the loan is not repaid, then it accounts as a loss to the bank
and banks earn most of their profits by the interest paid to them. If the banks lose too much
money, then it will result in a banking crisis. These banking crisis affects the economy of the
country. So, it is very important that the loan should be approved with the least amount of error in
risk calculation while taking up as the least time possible. So, a loan prediction model is required
that can predict quickly whether the loan can be passed or not with the least amount of risk
possible.
Problem Statement
The two most pressing issues in the banking sector are: 1) How risky is the borrower? 2) Should
we lend to the borrower given the risk? The response to the first question dictates the borrower’s
interest rate. Interest rate, among other things (such as time value of money), tests the riskiness of
the borrower, i.e., the higher the interest rate, the riskier the borrower. We will then decide
whether the applicant is suitable for the loan based on the interest rate. Lenders (investors) make
loans to creditors in return for the guarantee of interest-bearing repayment. That is, the lender
only makes a return (interest) if the borrower repays the loan. However, whether he or she does
not repay the loan, the lender loses money. Banks make loans to customers in exchange for the
guarantee of repayment. Some would default on their debts, unable to repay them for several
reasons. The bank retains insurance to minimize the possibility of failure in the case of a default.
The insured sum can cover the whole loan amount or just a portion of it. Banking processes use
manual procedures to determine whether a borrower is suitable for a loan based on results.
1
Manual procedures were mostly effective, but they were insufficient when there were a large
number of loan applications. At that time, making a decision would take a long time. As a result,
the loan prediction machine learning model can be used to assess a customer’s loan status and
build strategies. This model extracts and introduces the essential features of a borrower that
influence the customer’s loan status. Finally, it produces the planned performance (loan status).
These reports make a bank manager’s job simpler and quicker.
Objective
To develop a simple loan eligibility prediction model that automatically decides whether a
person is eligible for a loan or not.
To compare different algorithm for creating the model and calculate their accuracy.
2
Literature Review
The author, Vaidya, Ashlesha [1] uses logistic regression as a machine learning tool in paper and
shows how predictive approaches can be used in real world loan approval problems. His paper
uses a statistical model (Logistic Regression) to predict whether the loan should be approved or
not for a set of records of an applicant. Logistic regression can even work with power terms and
nonlinear effect. Some limitations of this model are that it requires independent variables for
estimation and a large sample is required for parameter estimation.
A work by Amin, Rafik Khairul and Yuliant Sibaroni [2] was referenced which used Decision
tree algorithm called C4.5 to implement a predictive model. This algorithm creates a decision tree
that generally gives a high accuracy in decision making problems. Dataset of 1000 cases is used
in which 70% is approved and rest is rejected. This paper shows C4.5 algorithm performance in
recognizing the eligibility of the applicant to repay his/her loan. From the conducted tests, it is
found that the highest precision value is 78.08% which was found using data partition of 90:10.
The optimized recall value is 96.4% and was reached with data partition of 80:20. Partition of
80:20 is considered to be best since it has a high recall and the highest accuracy.
The optimized and work done by Arora, Nisha and Pankaj Deep Kaur [3] aimed at forecasting
whether an applicant can be a loan defaulter or not. It uses Bolasso to select most relevant
attributes based on their robustness and then applied to classification algorithms like Random
Forest, SVM, Naïve Bayes and Knearest Neighbours (KNN) to test how accurately they can
predict the results. It is concluded that Bolasso enabled Random Forest algorithm (BS-RF)
provides the best results in credit risk evaluation and gives better accuracy by using optimized
feature selection methods.
3
Methodology
S tart
Data Collection
Analysis of Data
Data Cleaning
Model building using Decision tree and N aïve Bayes A lgo rithm
End
4
Data Collection
We used dataset of Loan Eligibility Prediction from Kaggle which is the world’s largest data
science community with powerful tools and resources. The dataset consists of 614 applicants with
following attributes:
Firstly, we load the dataset using pandas and after loading the dataset, we preprocessed it and
then used 80% data in the dataset to train the model and verified the accuracy using remaining
20% of the data.
Preprocessing
There were some missing values in the data set. Based on the variables we used mean and mode
of all the values of the variables. The missing values in applicant income was replaced with mean
of applicant income of the dataset. Similarly, missing values in gender, marital status, no. of
dependents were replaced with the mode of the respective variables.
5
Missing values of variables
Next, the distribution of variables was studied. For this box plot and histogram was used. Study
of the distribution of data gave general idea about the variables related to the applicants.
6
Histogram of applicant income
Now the data was normalized so the outliers can be handled effectively. For this, the logarithmic
function was applied to the total of applicant income and co-applicant income and the data was
normalized.
7
System Design
Decision Tree
This is a supervised machine learning algorithm mostly used for classification problems. All
features should be discretized in this model, so that the population can be split into two or more
homogeneous sets or subsets. This model uses a different algorithm to split a node into two or
more sub-nodes. With the creation of more sub-nodes, homogeneity and purity of the nodes
increases with respect to the dependent variable.
Naïve Bayes
Naïve Bayes methods are a set of supervised learning algorithms based on applying Bayes’
theorem with the “naïve” assumption of conditional independence between every pair of features
given the value of the class variable. Bayes’ theorem states the following relationship, given class
variable y and dependent feature vector x1 through xn:
𝑃(𝑦)𝑃(𝑥1, … … … , 𝑥𝑛|𝑦)
8
IMPLEMENTATION AND TESTING
Implementation
A loan eligibility prediction model was developed which can effectively predict whether a person
is eligible for a loan or not. To develop a working system, implementation was done in a single
phase where we preprocessed the data and created a model to predict loan eligibility using
Decision Tree and Naïve Bayes Algorithms.
Tools Used
1. Python
2. NumPy
NumPy library was used to work with multidimensional array, linear algebra and matrices.
3. Pandas
4. Matplotlib
5. Sklearn
Testing
Accuracy, Precision, Recall and F-measure are taken to validate the performance of the model.
The overall test ensured validity and reliability of the system. The accuracy achieved using
decision tree was 62.60%, precision was 0.83, recall was 0.611 and f-measure was 0.71. The
accuracy for Naïve Bayes was 83.74%, precision was 0.83, recall was 0.97 and f-measure was
0.90.
9
Conclusion
Recommendations
The model can be further enhanced by again training the predicted output from more datasets and
adding more features like why the loan application was rejected. Adding some more features to
this model will surely make the model better.
10
11