Loan Prediction Using Artificial Intelligence and Machine Learning
Loan Prediction Using Artificial Intelligence and Machine Learning
PROJECT ON
Submitted by:
B.E. in
This is to certify that the report entitled “Loan prediction using machine
learning and artificial intelligence” being submitted by LAKSHAY SINGH,
ARCHIT SALRIWAL, SUMIT KUMAR CHAUDHARY & KUNAL to the
Division of Electronics and Communication Engineering, NSIT, for the ward of
bachelor’s degree of engineering, is the record of the bonafide work carried out
by them under our supervision and guidance. The results contained in the report
have not been submitted either in part or in full to any other university or
institute
for the award of any degree or diploma.
SUPERVISOR
ACKNOWLEDGMENT
We want to offer our earnest thanks to our supervisor Dr. Urvashi Bansal for giving her
collaboration over the span of the project. She helped us in overcoming the challenges
We would like to express our sincere thanks to the Head of Department for allowing us
Project Thesis for the Partial fulfillment of the requirements for the award of the degree
of Bachelor of Technology.
We take this opportunity to thank all our professors who have helped us in our project.
Archit Salriwal(2018UEC2018)
Sumit Chaudhary(2018UEC2013)
Kunal(2018UEC2044)
ABSTRACT
In our banking system, banks have many products to sell but main source of
income of any banks is on its credit line. So they can earn from interest of those
loans which they credits.A bank's profit or a loss depends to a large extent on
loans i.e. whether the customers are paying back the loan or defaulting. By
predicting the loan defaulters, the bank can reduce its Non- Performing Assets.
This makes the study of this phenomenon very important. Previous research in
this era has shown that there are so many methods to study the problem of
controlling loan default. But as the right predictions are very important for the
maximization of profits, it is essential to study the nature of the different
methods and their comparison. A very important approach in predictive
analytics is used to study the problem of predicting loan defaulters: The Logistic
regression model. The data is collected from the Kaggle for studying and
prediction. Logistic Regression models have been performed and the different
measures of performances are computed. The models are compared on the basis
of the performance measures such as sensitivity and specificity. The final results
have shown that the model produce different results. Model is marginally better
because it includes variables (personal attributes of customer like age, purpose,
credit history, credit amount, credit duration, etc.) other than checking account
information (which shows wealth of a customer) that should be taken into
account to calculate the probability of default on loan correctly. Therefore, by
using a logistic regression approach, the right customers to be targeted for
granting loan can be easily detected by evaluating their likelihood of default on
loan. The model concludes that a bank should not only target the rich customers
for granting loan but it should assess the other attributes of a customer as well
which play a very important part in credit granting decisions and predicting the
loan defaulter
LIST OF CONTENTS
ACKNOWLEDGEMENT
ABSTRACT
INDEX
LIST OF FIGURES
LIST OF TABLES
CHAPTER 1: INTRODUCTION
1.1 OBJECTIVE
1.2 THE CLASSIFICATION PROBLEM
CHAPTER 3: DATASET
3.1 FEATURES
3.2 LABELS
CHAPTER 4: RESULT
4.1 MODELS OF TRAINING AND TESTING THE DATASET
CHAPTER 5: CONCLUSION
CHAPTER 1 :INTRODUCTION
1. Loan-Prediction – It is the process by which a machine learning algorithm can predict
whether a person will get loan or not.
2. Understanding the problem statement is the first and foremost step. This would help you
give an intuition of what you will face ahead of time. Let us see the problem statement.
3. Dream Housing Finance company deals in all home loans. They have presence across all
urban, semi urban and rural areas. Customer first apply for home loan after that company
validates the customer eligibility for loan. Company wants to automate the loan eligibility
process (real time) based on customer detail provided while filling online application form.
These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan
Amount, Credit History and others. To automate this process, they have given a problem to
identify the customer segments, those are eligible for loan amount so that they can
specifically target these customers.
5. Binary Classification : In this classification we have to predict either of the two given
classes. For example: classifying the gender as male or female, predicting the result as win
or loss, etc. Multiclass Classification : Here we have to classify the data into three or more
classes. For example: classifying a movie's genre as comedy, action or romantic, classify
fruits as oranges, apples, or pears, etc.
6. Loan prediction is a very common real-life problem that each retail bank faces atleast
once in its lifetime. If done correctly, it can save a lot of man hours at the end of a retail
bank.
1. Data Collection
● The quantity & quality of your data dictate how accurate our model is
● The outcome of this step is generally a representation of data which we will use for
training
● Using pre-collected data, by way of datasets from Kaggle, UCI, etc., still fits into this
step.
2. Data Preparation
● Clean that which may require it (remove duplicates, correct errors, deal with missing
values, normalization, data type conversions, etc.)
● Randomize data, which erases the effects of the particular order in which we collected
and/or otherwise prepared our data.
3. Choose a Model
● Different algorithms are for different tasks; choose the right one
6. Parameter Tuning
● This step refers to hyper-parameter tuning, which is an "art form" as
opposed to a science
● Tune model parameters for improved performance.
● Simple model hyper-parameters may include: number of training steps,
learning rate, initialization values and distribution, etc
We could not find any literature review for loan prediction for specific Machine
learning algorithms to use which would be a possible starting point for our paper.
Instead, since loan prediction is a classification problem, we went with popular
classification algorithms used for a similar problem. Ashlesha Vaidya [2] used
logistic regression as a probabilistic and predictive approach to loan approval
prediction. The author pointed out how Artificial neural networks and Logistic
regression are most used for loan prediction as they are easier comparatively develop
and provide the most accurate predictive analysis. One of the reasoning behind this
that that other Algorithms are generally bad at predicting from non-normalized data.
But the nonlinear effect and power terms are easily handled by Logistic regression as
there is no need for the independent variables on which the prediction takes place to
be normally distributed.
Logistic regression still has its limitations, and it requires a large sample of data for
parameter estimation. Logistic regression also requires that the variables be
independent of each other otherwise the model tends to overweigh the importance of
the dependent variables.
Similar to PCA, Zaghdoudi, Djebali & Mezni [4] compared the use of Linear
Discriminant Analysis versus Logistic Regression for Credit Scoring and Default
Risk Prediction for foreseeing default risk o small and medium enterprises. Linear
Discriminant Analysis (LDA) is like PCA for dimensionality reduction but instead of
looking for the most variation, LDA focuses on maximizing the separability among
the know categories. This subspace that well separates the classes is usually in which
a linear classifier can be learned. The classification of those enterprises correctly in
their original groups through both these methods was inconsequential with Logistic
regression having a 0.3% better accuracy score than LDA.
Another novel approach for T.Sunitha and colleagues [5] was to predict loan Status
using Logistic Regression and a Binary Tree. Decision Tree is an algorithm for a
predictive type machine learning model.
Classification and Regression Trees are referred to as CART (in short) introduced by
Leo Breiman. It best suits both predictive and decision modeling problems. This
Binary Tree methodology is the greedy method is used for the selection of the best
splitting. Although Decision trees gave us a similar accuracy. The benefits of
Decision Trees, in this case, were due to the latter giving equal importance to both
accuracy and prediction. This model became successful in making a lower number of
False Predictions to reduce the risk factor.
Rajiv Kumar and Vinod Jain [6] proposed a model using machine learning
algorithms to predict the loan approval of customers. They applied three machine
learning algorithms, Logistic Regression (LR), Decision Tree (DT), and Random
Forest (RF) using Python on a test data set. From the results, they concluded that the
Decision Tree machine learning algorithm performs better than Logistic Regression
and Random Forest machine learning approaches. It also opens other areas on which
the Decision Tree algorithm is applicable.
Some machine learning models give different weights to each factor but in practice
sometimes loans can be sanctioned based on a single strong factor only. To eliminate
this problem J. Tejaswini and T. Mohana Kavya [7] in their research paper have built
a loan prediction system that automatically calculates the weight of each feature
taking part in loan processing and on new test data the same features are processed
concerning their associated weight. They have implemented six machine learning
classification models using R for choosing the deserving loan applicants. The models
include Decision Trees, Random Forest, Support Vector Machine, Linear Models,
Neural Network and Adaboost. The authors concluded that the accuracy of the
Decision Tree is highest among all models and performs better on the loan prediction
system.
Anchal Goyal and Ranpreet Kaur [9] discuss various ensemble algorithms. Ensemble
algorithm is a supervised machine learning algorithm that is a combination of two or
more algorithms to get better predictive performance. They carried out a systematic
literature review to compare ensemble models with various stand-alone models such
as neural network, SVM, regression, etc. The authors after reviewing different
literature reviews concluded that the Ensemble Model performs better than the stand-
alone models. Finally, they concluded that the concept of combined algorithms also
improves the accuracy of the model.
Data Mining is also becoming popular in the field banking sector as it extracts
information from a tremendous amount of accumulated data sets. Aboobyda Jafar
Hamid and Tarig Mohammed Ahmed [10] focused on implementing data mining
techniques using three models j48, bayesNet, and naiveBayesdel for classifying loan
risk in the banking sector. The author implemented and tested models using the
Weka application. In their work, they made a comparison between these algorithms
in terms of accuracy in classifying the data correctly. The operation of sprinting
happened in a manner that 80% represented the training dataset and 20% represented
the testing dataset. After analyzing the results the author came up with the results
that the best algorithm among the three is the J48w algorithm in terms of high
accuracy and low mean absolute error.
CHAPTER 3 -DATASETS
● Here we have two datasets. First is train_dataset.csv, test_dataset.csv.
● These are datasets of loan approval applications which are featured
with annualincome, married or not, dependents are there or not, educated
or not, credit history present or not, loan amount etc.
● The outcome of the dataset is represented by loan_status in the train
dataset.
● This column is absent in test_dataset.csv as we need to assign loan
status with the help of training dataset.
● These two datasets are already uploaded on google colab.
LABELS
● LOAN_STATUS – Based on the mentioned features, the machine
learning algorithm decides whether the person should be give loan or not.
Visualizing data
Code and output
#Importing required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
df_train = pd.read_csv('train_dataset.csv')
# take a look at the top 5 rows of the train set, notice the column "Loan_Status"
df_train.head()
# This code visualizes the people applying for loan who are categorized based on
gender and marriage
# take a look at the top 5 rows of the test set, notice the absense of "Loan_Status"
that we will predict
test.head()
# import classification_report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
0.8024081632653062
CHAPTER -5 CONCLUSION