Project Report Final
Project Report Final
A Project Report
Submitted by
ABHISHEK: 20BCS3591
BACHELOR OF ENGINEERING
IN
INFORMATION SECURITY
CHANDIGARH UNIVERSITY
NOVEMBER 2022
BONAFIDE CERTIFICATE
First of all, we would like to thank our supervisor Mr. Nirmalya Basu who was
a constant source of inspiration. He encouraged us to think creatively and
motivated us to work on this project without giving it a second thought. He
expressed full support and provided us with the different teaching aids that were
required to complete this project. He believed in us even when we could not
believe that we could do it.
We are also thankful to every member of this group. It was each and every
individual’s contribution that made this assignment a success. We were always
there to lift each other up, and that was what helped us stay together till the end
who guided us in this project. The group members continuously researched and
tried to find out many things related to the working of project and some other
aspects which are helpful in future scope of this project. They worked day and
night and proposed this system for our society so that they can avail the benefits
of this system. Every group member has his/her unique role in this project and
we cannot think the success without the role of any group members.
We thank our parents for always trusting in us and teaching us to believe in our
abilities and strengths and never give up until the goal is achieved. We are
thankful to all our friends who extended their moral support, and above all, we
are thankful to God for being with us and giving us the wisdom and ability to do
this project.
Thank You
Table of Contents
List of Figures i
Abstract iii
1 Introduction 1
1.1 Project Overview 1
1.2 Problem Identification 2
1.3 Gantt Charts 3
2 Literature Survey 5
2.1 Existing System 5
2.2 Proposed System 6
3 Design Process 9
3.1 Design Overview 9
3.2 Hardware Specifications 9
3.3 Software Specifications 10
3.4 Methodology 10
4 Results, Analysis, and Observations 13
4.1 Results 13
4.2 Graphical Data Visualization and Representation 13
4.3 Major Observations 22
5 Conclusion and Future Work 23
5.1 Conclusion 23
5.2 References 24
5.3 User Manual to Run the Product 24
List of Figures
i
List of Figures (Continued)
ii
ABSTRACT
In today’s world, the banks offer a lot of services to the people around the globe, but the most
frequently used service is the none other than the loan system. Offering loans is a profitable
way of business for the banks. The need for loan tends to be more and more with every passing
day. In the process of loan sanctioning, a customer or a potential borrower applies for a loan
and the banks processes the loans based on the applicant’s request after screening and verifying
the applicant’s details. As simple as it might seem, the process of verifying the eligibility of an
applicant is quite complicated and time consuming. Credit score, annual income, and many
other factors comes into play when deciding the loan eligibility.
The risk associated with the decision of approving a loan is immense. Approving a loan which
cannot be repaid can cause the bank to lose capital. No one wants to waste their precious time
or lose out their capital. So we are developing a framework to aid these issues. Our motive in
this project is to create a framework that will automate the process of Loan Eligibility
Prediction for the banking system by simply taking some necessary information about the
applicant. It can help minimalize the losses for banks and it can also reduce human errors. Our
framework can predict the eligibility for the loan sanction at a very high accuracy rate by
verifying the credibility of the individual through the usage of Machine Learning.
iii
1
INTRODUCTION
The banks nowadays not only provide their services in urban and sub-urban areas, but also in
rural areas. The banking system is growing rapidly and is dealing with millions of people all
the time. Customers applies for a loan by filling out numerous forms and going to and from to
the employees to validate their status of eligibility. The eligibility of a customer can be
determined the details like income of applicant, amount of loan they need and other similar
information that they provide. The system requires a real time process for improving the
efficiency and latency caused due to human errors. That is why we want to provide a system
to automate the process by bringing every aspect of loan eligibility under one roof.
1
1.2 Problem Identification
The main profit-making business of practically all banks is the distribution of loans. The main
portion the bank's assets directly comes from the profit earned from the loans distributed by
the banks. In a banking environment, the main goal is to place one's assets in trustworthy hands.
Today, many banks and financial institutions grant loans through a lengthy verification and
validation process, but there is still no assurance that the chosen applicant is the most deserving
candidate among all applicants.
When authorizing money loans, banks and other lenders must ensure that they will return the
money with interest. Therefore, they need to know the credibility of the borrower before
lending money. To do this, credit agencies must thoroughly check the background and
credibility of the borrower. However, manually iterating over multiple variables and factors for
each borrower is a time-consuming process and highly inefficient. Banks offer different types
of loans across nations worldwide, including mortgages, personal loans and business loans.
These companies exist in urban, semi-urban and rural areas as well. After a customer applies
for a loan, these companies verify whether the customer is eligible for the loan
2
1.3 Gantt Charts
3
Figure 1.3: Date-wise Work Distribution
4
2
LITERATURE SURVEY
It has been reviewed that various machine learning models can be used to predict the credit-
worthiness of applicants. Initially we applied different learning algorithms to the dataset to
determine the best algorithm to explore the bank loan dataset. Neural Networks, K-Nearest
Neighbors, Linear Regression, Decision Trees, Ensemble Learning/Methods, Logistic
Regression are some of the algorithms used. In many experiments, it was found that, with the
exception of Nearest Centroid and Gaussian Naive Bayes, the remaining algorithms performed
reliably in terms of accuracy and other performance metrics. Each of these algorithms achieved
accuracies of 76% to barely over 80%. The rapidly moving technology world towards full
automation, the importance of automation, and the role of artificial intelligence and machine
learning in it. One of the most important capabilities to consider in this transition to automation
is the machine's decision-making ability. It has been stated that decision making can be
achieved through his predictive and probabilistic approaches. These are developed by various
machine learning algorithms. To elaborate further, this emphasizes using logistic regression as
a machine learning model to achieve this predictive and probabilistic approach. Using the
example of predicting loan eligibility, the use of logistic regression to design a machine
learning model that makes decisions based on multiple variables such as gender, income,
employment status, and dependents. The final score is determining whether the borrower in
question is eligible for a loan or not.
5
2.2 Proposed System
The proposed system automates the determination process the creditworthiness of the
applicant. The data set containing data on loan applicants is collected. It is structured and
analyzed using appropriate analysis techniques. The dataset is divided into two categories:
• The train data is used to train the model, i.e. our model will learn from this file. Contains all
independent variables and the target variable.
• The test data contains all the independent variables, but no target variable. We apply the
model to predict the target variable for test data. A logistic regression model is used to predict
the binary result.
Python is the language which is used to implement this project. The output variable is divided,
according to the problem specification. The desired output will be obtained by entering the data
into a logistic regression model for multiple independent variables.
In the proposed credit prediction model, the data set is split into training and test data. The
training data set is then trained using a decision tree algorithm to create a predictive model
developed using the algorithm. For the prediction of loans, a test dataset is given to the model.
The purpose of this paper is to predict defaults against repayment of loans. Various libraries
such as Pandas, Numpy were used. After loading the data set, preprocess the data, such as
handling numerical missing values and categorical values, and validate the values. Numeric
and categorical values are separated. A frequency analysis with outliers is performed and
outliers are checked by obtaining boxplot charts of the attributes.
6
2.2.2 IMPLEMENTATION
DATA DESCRIPTION
It consists of various attributes that are considered before sanctioning money loan to the
applicant. Data Training is a supervised learning algorithm that is also used to solve
classification and regression problems. Here, DT uses a tree representation to solve the
prediction problem.
Step 1: Collect real data and create a training set.
Step 2: The training set is divided into subsets, each containing 4712 similar attribute values.
Step 3: Step 2 is repeated for all subsets until all leaf nodes in the tree have been traversed.
7
Decision tree algorithm is a machine learning technique that efficiently performs both
classification and regression tasks. Create a decision tree. Decision trees are widely used in the
banking industry due to their high accuracy and the ability to create statistical models in a
simple language. In decision tree, each node represents a feature (attribute), each link (branch)
represents a decision (rule), and each leaf represents a result (categorical or continuous value).
Various data analysis tools can be used to predict credit forecasts and their severity.
This process involves training data using different algorithms and comparing user data to the
trained data to predict loan types. Several R functions and packages were used to prepare the
data and create the classification model. This work proves that the R package is an efficient
visualization tool that applies data mining techniques. You can use the R package to perform
analysis on your customer data. This depends on whether the bank can approve or reject the
loan. In real-time, customer records can contain a lot of missing or imputed data that must be
replaced with valid data generated using the complete data available. The dataset has many
attributes that define the credibility of customers looking for different types of loans. The
values of these attributes may contain outliers that fall outside the normal data range.
8
3
DESIGN PROCESS
3.1 Design Overview
9
3.3 Software Specifications
These are the recommended software required for reproducing and developing the program:
a) Anaconda Navigator (Python version 3.9.7 preferred)
b) Libraries:
csv
pandas
matplotlib
xgboost
sklearn
tkinter
pil
seaborn
3.4 Methodology
These are the most used concepts in our project:
10
3.4.2 LOGISTIC REGRESSION
This is a classification algorithm that uses the logistic function to predict a binary outcome
(true/false, 0/1, yes/no) given an independent variable. The goal of this model is to find
relationships between features and the probabilities of certain outcomes. The logistic function
used is the logit function, which is the logarithm of the probability in favor of an event. The
logit function produces a sigmoidal curve with probability estimates similar to a step function.
3.4.5 XGBOOST
This algorithm works only with quantitative variables. This is a gradient boosting algorithm
that builds strong rules for models by turning weak learners into strong learners. It is a fast and
efficient algorithm that has recently dominated machine learning due to its high performance
and speed.
3.4.6 SKLEARN
This python library is helpful for building machine learning and statistical models such as
clustering, classification, regression etc. Though it can be used for reading, manipulating and
summarizing the data as well, better libraries are there to perform these functions.
11
Figure 3.2: Block diagram
Since predicting whether a loan application will be approved is a classification problem, the
model is trained using algorithms for classification including logistic regression, decision trees,
random forests, and support vector machines.
12
4
RESULTS ANALYSIS AND OBSERVATIONS
4.1 Results
The project is working properly and completing all the goals which we set before starting of
this project. An accuracy of 90.3909% was observed for our model while testing with different
auto-populated data(s).
The AUC (Area Under ROC Curve) Score after training the model was noted to be 0.986202.
AUC ranges in value from 0 to 1. A model whose predictions are 100% WRONG has an AUC
of 0.0 whereas the one whose predictions are 100% CORRECT has an AUC of 1.0. So we can
say that our Training model is 98.6202% CORRECT in predicting the Loan Eligibility.
13
Figure 4.2: Gender VS Education
14
Figure 4.4: Self-Employed VS Education
15
Figure 4.7: Married VS Credit History
16
Figure 4.9: Credit History VS Property Area
17
Figure 4.11: Probability Plot of Applicant Income based on other quantities
Figure 4.12: Applicant Income, Co-Applicant Income, and Loan Amount Distribution
18
Female
Male
19
Figure 4.15: Distribution of Applicant Income and Loan Amount
20
Figure 4.17: Number of applicants based on Education, Self-Employment, and Property Area
Figure 4.18: Number of applicants based on Gender, Marital Status, and Dependents Count
21
4.3 Major Observations
1. Applicants who are male and married tends to have more applicant income whereas applicant
who are female and married have least applicant income.
2. Applicants who are male and are graduated have more applicant income over the applicants
who have not graduated.
3. Again the applicants who are married and graduated have the more applicant income.
4. Applicants who are not self-employed have more applicant income than the applicants who
are self-employed.
5. Applicants who have more dependents have least applicant income whereas applicants which
have no dependents have maximum applicant income.
6. Applicants who have property in urban and have credit history have maximum applicant
income.
7. Applicants who are graduate and have credit history have more applicant income.
9. From heat maps, applicant income and loan amount are highly positively correlated.
11. No of applicants who are married are more than no of applicants who are not married.
13. Applicants with graduation are more than applicants with no graduation.
14. Property area is to be find more in semi urban areas and minimum in rural areas.
22
5
CONCLUSION AND FUTURE WORK
5.1 Conclusion
The project revealed that the algorithms such as xgboost, adaboost, forest, and decision tree
perform credibly well in term of their accuracy and other performance evaluation matrices.
Based on a comprehensive screening and validation process, loan organizations issue loans.
However, they cannot say for sure whether the applicant will be able to repay the loan without
having any trouble. They will be able to quickly, conveniently, and effectively select the
worthiest applicants thanks to the loan prediction system. It could offer the bank special
advantages.
We have examined how to create a Loan Approval Prediction System in this project. The
analytical steps in developing this system include data gathering, exploratory data analysis,
data preprocessing, model construction, and model testing. In this work, we did a detailed
analysis of the earlier research papers in this topic. Among the most popular algorithms are
Logistic Regression, Decision Trees, and Random Forest Technique.
High accuracy is achieved using random forest. It can be confidently concluded from a correct
study of the part's advantages and limitations that the product might be a very effective
component. This application is working properly and satisfies any requirements set forth by
the bank.
23
5.2 References
1. How to predict Loan Eligibility using Machine Learning Models- Github- Mridul Bhandari.
2. Loan Approval Prediction based on Machine Learning Approach Kumar Arun, Garg Ishan,
Kaur Sanmeet, May-Jun. 2016. Loan Approval Prediction based on Machine Learning
Approach, IOSR Journal of Computer Engineering (IOSR-JCE)
3. Heterogeneous Ensemble for Default Prediction of Peer-to-Peer Lending in China, Key
Laboratory of Process Optimization and Intelligent Decision-Making Wei Li, Shuai Ding, Yi
Chen, and Shanlin Yang, Heterogeneous Ensemble for Default Prediction of Peer-to-Peer
Lending in China, Key Laboratory of Process Optimization and Intelligent Decision-Making,
Ministry of Education, Hefei University of Technology, Hefei 2009, China
24
Our project can either predict loan eligibility of a single individual or multiple individuals at
once.
25
5.3.1 Guide for Predicting Eligibility of Multiple Entries:
1) Click on “Multiple Entries” Button and you will be greeted with the following menu:
2) Click on “Browse File” Button and browse for the file in which the data are present. The
contents of the *.csv file should be in the following format, and this is the file that we will be
using to predict the eligibility:
26
3) After Selecting the file, you can see the full path of file along with the filename that you
have selected. Click on “Predict” button to predict loan eligibility of all the applicants.
3) Now we can see all predictions made by the program against the Loan ID(s) as output.
27
Figure 5.8: Multiple Results for Multiple Entries (2nd Part)
28
2) Fill in the following details as applicable and click on “Predict” Button.
Gender
Married
Dependents
Education
Self Employed
Applicant Income
Co-applicant Income
Loan Amount
Loan Amount Term
Credit Card History
Property Area
29
3) Now we can see the prediction made by the program against the provided input.
30
Figure 5.13: Single Result for Single Input (2nd Entry)
31