0% found this document useful (0 votes)
4 views

Project Report Final

The document presents a project report on a Loan Eligibility Prediction System developed by students at Chandigarh University. It outlines the project's aim to automate the loan eligibility verification process using machine learning techniques, specifically logistic regression and decision trees, to enhance efficiency and accuracy in banking. The report includes sections on the introduction, literature survey, design process, results, and future work, detailing the methodology and implementation of the system.

Uploaded by

Soham Mukherjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Project Report Final

The document presents a project report on a Loan Eligibility Prediction System developed by students at Chandigarh University. It outlines the project's aim to automate the loan eligibility verification process using machine learning techniques, specifically logistic regression and decision trees, to enhance efficiency and accuracy in banking. The report includes sections on the introduction, literature survey, design process, results, and future work, detailing the methodology and implementation of the system.

Uploaded by

Soham Mukherjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

LOAN ELIGIBILITY PREDICTION SYSTEM

A Project Report

Submitted by

SOHAM MUKHERJEE: 20BCS3593

ABHISHEK: 20BCS3591

AANANAYA MATHUR: 20BCS3627

ASMITA BHARTI: 20BCS3584

In the partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING

IN

INFORMATION SECURITY

CHANDIGARH UNIVERSITY
NOVEMBER 2022
BONAFIDE CERTIFICATE

Certified that this project report of LOAN ELIGIBILITY PREDICTION is the


bonafide work of Soham Mukherjee, Abhishek, Asmita Bharti, and
Aananaya Mathur who carried out the project work under my/our supervision.

Signature of the HoD Signature of SUPERVISOR

(Mr. Aman Kaushik) (Mr. Nirmalya Basu)


HoD of CSE – AIT Project Supervisor

Submitted for the project viva-voce examination held on 19/11/2022

Signature of Internal Examiner Signature of External Examiner


ACKNOWLEDGEMENT

First of all, we would like to thank our supervisor Mr. Nirmalya Basu who was
a constant source of inspiration. He encouraged us to think creatively and
motivated us to work on this project without giving it a second thought. He
expressed full support and provided us with the different teaching aids that were
required to complete this project. He believed in us even when we could not
believe that we could do it.

We are also thankful to every member of this group. It was each and every
individual’s contribution that made this assignment a success. We were always
there to lift each other up, and that was what helped us stay together till the end
who guided us in this project. The group members continuously researched and
tried to find out many things related to the working of project and some other
aspects which are helpful in future scope of this project. They worked day and
night and proposed this system for our society so that they can avail the benefits
of this system. Every group member has his/her unique role in this project and
we cannot think the success without the role of any group members.

We thank our parents for always trusting in us and teaching us to believe in our
abilities and strengths and never give up until the goal is achieved. We are
thankful to all our friends who extended their moral support, and above all, we
are thankful to God for being with us and giving us the wisdom and ability to do
this project.

Thank You
Table of Contents

List of Figures i
Abstract iii
1 Introduction 1
1.1 Project Overview 1
1.2 Problem Identification 2
1.3 Gantt Charts 3
2 Literature Survey 5
2.1 Existing System 5
2.2 Proposed System 6
3 Design Process 9
3.1 Design Overview 9
3.2 Hardware Specifications 9
3.3 Software Specifications 10
3.4 Methodology 10
4 Results, Analysis, and Observations 13
4.1 Results 13
4.2 Graphical Data Visualization and Representation 13
4.3 Major Observations 22
5 Conclusion and Future Work 23
5.1 Conclusion 23
5.2 References 24
5.3 User Manual to Run the Product 24
List of Figures

Figure 1.1 Work Distribution Timeline 3


Figure 1.2 Evaluation Timeline 3
Figure 1.3 Date-wise Work Distribution 4
Figure 2.1 Diagram of Prediction Model 7
Figure 2.2 Decision Tree Algorithm 8
Figure 3.1 Process for loan prediction 9
Figure 3.2 Block diagram 12
Figure 4.1 Gender VS Married 13
Figure 4.2 Gender VS Education 14
Figure 4.3 Married VS Education 14
Figure 4.4 Self-Employed VS Education 15
Figure 4.5 Married VS Dependents 15
Figure 4.6 Credit History VS Property 15
Figure 4.7 Married VS Credit History 16
Figure 4.8 Education VS Credit History 16
Figure 4.9 Credit History VS Property Area 17
Figure 4.10 Applicant Income Distribution 17
Figure 4.11 Probability Plot of Applicant Income based on 18
other quantities
Figure 4.12 Applicant Income, Co-Applicant Income, and 18
Loan Amount Distribution
Figure 4.13 Male VS Female Loan Applications based on 19
Applicant Income
Figure 4.14 Distribution of Applicant Income and Co- 19
Applicant Income
Figure 4.15 Distribution of Applicant Income and Loan 20
Amount

i
List of Figures (Continued)

Figure 4.16 Correlation Matrix 20


Figure 4.17 Number of applicants based on Education, 21
Self-Employment, and Property Area
Figure 4.18 Number of applicants based on Gender, 21
Marital Status, and Dependents Count
Figure 5.1 GUI Main Menu 24
Figure 5.2 Multiple Entries Hint 25
Figure 5.3 Single Entries Hint 25
Figure 5.4 Multiple Entries Menu 26
Figure 5.5 CSV File and Format 26
Figure 5.6 File displayed after selection 27
Figure 5.7 Multiple Results for Multiple Entries (1st Part) 27
Figure 5.8 Multiple Results for Multiple Entries (2nd Part) 28
Figure 5.9 Single Entries Menu 28
Figure 5.10 Fill details from menu (1st Entry) 29
Figure 5.11 Single Result for Single Input (1st Entry) 30
Figure 5.12 Fill details from menu (2nd Entry) 30
Figure 5.13 Single Result for Single Input (2nd Entry) 31

ii
ABSTRACT

In today’s world, the banks offer a lot of services to the people around the globe, but the most
frequently used service is the none other than the loan system. Offering loans is a profitable
way of business for the banks. The need for loan tends to be more and more with every passing
day. In the process of loan sanctioning, a customer or a potential borrower applies for a loan
and the banks processes the loans based on the applicant’s request after screening and verifying
the applicant’s details. As simple as it might seem, the process of verifying the eligibility of an
applicant is quite complicated and time consuming. Credit score, annual income, and many
other factors comes into play when deciding the loan eligibility.
The risk associated with the decision of approving a loan is immense. Approving a loan which
cannot be repaid can cause the bank to lose capital. No one wants to waste their precious time
or lose out their capital. So we are developing a framework to aid these issues. Our motive in
this project is to create a framework that will automate the process of Loan Eligibility
Prediction for the banking system by simply taking some necessary information about the
applicant. It can help minimalize the losses for banks and it can also reduce human errors. Our
framework can predict the eligibility for the loan sanction at a very high accuracy rate by
verifying the credibility of the individual through the usage of Machine Learning.

iii
1
INTRODUCTION

The banks nowadays not only provide their services in urban and sub-urban areas, but also in
rural areas. The banking system is growing rapidly and is dealing with millions of people all
the time. Customers applies for a loan by filling out numerous forms and going to and from to
the employees to validate their status of eligibility. The eligibility of a customer can be
determined the details like income of applicant, amount of loan they need and other similar
information that they provide. The system requires a real time process for improving the
efficiency and latency caused due to human errors. That is why we want to provide a system
to automate the process by bringing every aspect of loan eligibility under one roof.

1.1 Project Overview


The proposed model in our project predicts customer credit-worthiness based on available data.
Inputs to the model include attributes like the gender of applicant, marital status, number of
dependents, education, self-employment status, applicant income, co-applicant income, loan
amount, credit history, property area, and loan term/period. The output of the model is a
Boolean decision as to whether the customer is eligible for a loan or not. Logistic regression
and machine learning models are an important approach for predictive analytics and analyzing
the problem to predict loan defaults.

1
1.2 Problem Identification

The main profit-making business of practically all banks is the distribution of loans. The main
portion the bank's assets directly comes from the profit earned from the loans distributed by
the banks. In a banking environment, the main goal is to place one's assets in trustworthy hands.
Today, many banks and financial institutions grant loans through a lengthy verification and
validation process, but there is still no assurance that the chosen applicant is the most deserving
candidate among all applicants.
When authorizing money loans, banks and other lenders must ensure that they will return the
money with interest. Therefore, they need to know the credibility of the borrower before
lending money. To do this, credit agencies must thoroughly check the background and
credibility of the borrower. However, manually iterating over multiple variables and factors for
each borrower is a time-consuming process and highly inefficient. Banks offer different types
of loans across nations worldwide, including mortgages, personal loans and business loans.
These companies exist in urban, semi-urban and rural areas as well. After a customer applies
for a loan, these companies verify whether the customer is eligible for the loan

2
1.3 Gantt Charts

Figure 1.1: Work Distribution Timeline Figure 1.2: Evaluation Timeline

3
Figure 1.3: Date-wise Work Distribution

4
2
LITERATURE SURVEY
It has been reviewed that various machine learning models can be used to predict the credit-
worthiness of applicants. Initially we applied different learning algorithms to the dataset to
determine the best algorithm to explore the bank loan dataset. Neural Networks, K-Nearest
Neighbors, Linear Regression, Decision Trees, Ensemble Learning/Methods, Logistic
Regression are some of the algorithms used. In many experiments, it was found that, with the
exception of Nearest Centroid and Gaussian Naive Bayes, the remaining algorithms performed
reliably in terms of accuracy and other performance metrics. Each of these algorithms achieved
accuracies of 76% to barely over 80%. The rapidly moving technology world towards full
automation, the importance of automation, and the role of artificial intelligence and machine
learning in it. One of the most important capabilities to consider in this transition to automation
is the machine's decision-making ability. It has been stated that decision making can be
achieved through his predictive and probabilistic approaches. These are developed by various
machine learning algorithms. To elaborate further, this emphasizes using logistic regression as
a machine learning model to achieve this predictive and probabilistic approach. Using the
example of predicting loan eligibility, the use of logistic regression to design a machine
learning model that makes decisions based on multiple variables such as gender, income,
employment status, and dependents. The final score is determining whether the borrower in
question is eligible for a loan or not.

2.1 Existing System


Banks need to analyze whether people who apply for loans can repay the loans. In some cases,
customers may provide partial data to their bank. In this case, a person can obtain a loan without
proper verification and the bank may incur a loss. Bankers cannot manually analyze vast
amounts of data. Checking if a person will repay a loan can be a big headache. It is very
important to know if the person taking the loan is eligible or not. Therefore, it is very important
to have an automated model that predicts whether a customer who receives a loan will repay
the loan.

5
2.2 Proposed System

The proposed system automates the determination process the creditworthiness of the
applicant. The data set containing data on loan applicants is collected. It is structured and
analyzed using appropriate analysis techniques. The dataset is divided into two categories:
• The train data is used to train the model, i.e. our model will learn from this file. Contains all
independent variables and the target variable.
• The test data contains all the independent variables, but no target variable. We apply the
model to predict the target variable for test data. A logistic regression model is used to predict
the binary result.
Python is the language which is used to implement this project. The output variable is divided,
according to the problem specification. The desired output will be obtained by entering the data
into a logistic regression model for multiple independent variables.
In the proposed credit prediction model, the data set is split into training and test data. The
training data set is then trained using a decision tree algorithm to create a predictive model
developed using the algorithm. For the prediction of loans, a test dataset is given to the model.
The purpose of this paper is to predict defaults against repayment of loans. Various libraries
such as Pandas, Numpy were used. After loading the data set, preprocess the data, such as
handling numerical missing values and categorical values, and validate the values. Numeric
and categorical values are separated. A frequency analysis with outliers is performed and
outliers are checked by obtaining boxplot charts of the attributes.

2.2.1 DATA SET


The dataset is currently deployed in a machine learning model and this version is trained on
these facts. The data records are divided into existing and new customers. Each new applicant
information serves as a fact test set. After a testing period, it is predicted whether a new
applicant will be eligible for loan approval based on the conclusions they make based on their
training information ideas.

6
2.2.2 IMPLEMENTATION

Figure 2.1: Diagram of Prediction Model

DATA DESCRIPTION
It consists of various attributes that are considered before sanctioning money loan to the
applicant. Data Training is a supervised learning algorithm that is also used to solve
classification and regression problems. Here, DT uses a tree representation to solve the
prediction problem.
Step 1: Collect real data and create a training set.
Step 2: The training set is divided into subsets, each containing 4712 similar attribute values.
Step 3: Step 2 is repeated for all subsets until all leaf nodes in the tree have been traversed.

7
Decision tree algorithm is a machine learning technique that efficiently performs both
classification and regression tasks. Create a decision tree. Decision trees are widely used in the
banking industry due to their high accuracy and the ability to create statistical models in a
simple language. In decision tree, each node represents a feature (attribute), each link (branch)
represents a decision (rule), and each leaf represents a result (categorical or continuous value).
Various data analysis tools can be used to predict credit forecasts and their severity.

Figure 2.2: Decision Tree Algorithm

This process involves training data using different algorithms and comparing user data to the
trained data to predict loan types. Several R functions and packages were used to prepare the
data and create the classification model. This work proves that the R package is an efficient
visualization tool that applies data mining techniques. You can use the R package to perform
analysis on your customer data. This depends on whether the bank can approve or reject the
loan. In real-time, customer records can contain a lot of missing or imputed data that must be
replaced with valid data generated using the complete data available. The dataset has many
attributes that define the credibility of customers looking for different types of loans. The
values of these attributes may contain outliers that fall outside the normal data range.

8
3
DESIGN PROCESS
3.1 Design Overview

Figure 3.1: Process for loan prediction

3.2 Hardware Specifications


Any Windows PC/Laptop with the following configuration can run our program easily:
a) Windows 7/8/8.1/10/11.
b) 2 GB or more RAM.
c) 2 GHz or more CPU Clock Speed.
d) 25 MB or more available Disk Space.

9
3.3 Software Specifications
These are the recommended software required for reproducing and developing the program:
a) Anaconda Navigator (Python version 3.9.7 preferred)
b) Libraries:
 csv
 pandas
 matplotlib
 xgboost
 sklearn
 tkinter
 pil
 seaborn

3.4 Methodology
These are the most used concepts in our project:

3.4.1 MACHINE LEARNING


Machine learning is an evolving interdisciplinary field, the application of artificial intelligence
(AI) to augment systems with the ability to learn from existing, available datasets, called
training data. The entire modeling process begins with appropriate data selection and
subsequent analysis. Observed data can help maintain patterns and enable better decisions in
the future. The purpose of this exercise is to allow the system to self-learn without human
intervention. In the future, this will enable machines to make constructive decisions that
complement the overall decision-making process. His three types of machine learning methods
are: Applying what was learned in the past to new data using labeled examples to predict future
events is called a supervised machine learning technique. In contrast, unsupervised machine
learning algorithms are used when the information used for training is neither classified nor
labeled. Semi-supervised machine learning algorithms fall somewhere between supervised and
unsupervised learning. This is because we use both labeled and unlabeled data for training.
Small amounts of labeled data and large amounts of unlabeled data are typically used.
Reinforcement machine learning algorithms are learning methods that interact with the
environment by generating actions and detecting errors or rewards. Trial-and-error searching
and delayed gratification are the most relevant features of reinforcement learning.

10
3.4.2 LOGISTIC REGRESSION
This is a classification algorithm that uses the logistic function to predict a binary outcome
(true/false, 0/1, yes/no) given an independent variable. The goal of this model is to find
relationships between features and the probabilities of certain outcomes. The logistic function
used is the logit function, which is the logarithm of the probability in favor of an event. The
logit function produces a sigmoidal curve with probability estimates similar to a step function.

3.4.3 DECISION TREE


This is a supervised machine learning algorithm primarily used for classification problems. The
model requires all features to be discretized so that the population can be divided into two or
more homogeneous sets or subsets. This model splits a node into two or more sub nodes using
different algorithms. As more sub nodes are formed, the homogeneity and purity of the nodes
associated with the dependent variable increases.

3.4.4 RANDOM FOREST


This is a tree-based ensemble model that helps improve model accuracy. Combine dozens of
decision trees to create powerful predictive models. To prepare the decision tree model, random
samples of individual tree rows and features are taken. The final predicted class is the mode of
all predictors or the mean of all predictors.

3.4.5 XGBOOST
This algorithm works only with quantitative variables. This is a gradient boosting algorithm
that builds strong rules for models by turning weak learners into strong learners. It is a fast and
efficient algorithm that has recently dominated machine learning due to its high performance
and speed.

3.4.6 SKLEARN
This python library is helpful for building machine learning and statistical models such as
clustering, classification, regression etc. Though it can be used for reading, manipulating and
summarizing the data as well, better libraries are there to perform these functions.

11
Figure 3.2: Block diagram

Since predicting whether a loan application will be approved is a classification problem, the
model is trained using algorithms for classification including logistic regression, decision trees,
random forests, and support vector machines.

12
4
RESULTS ANALYSIS AND OBSERVATIONS
4.1 Results
The project is working properly and completing all the goals which we set before starting of
this project. An accuracy of 90.3909% was observed for our model while testing with different
auto-populated data(s).
The AUC (Area Under ROC Curve) Score after training the model was noted to be 0.986202.
AUC ranges in value from 0 to 1. A model whose predictions are 100% WRONG has an AUC
of 0.0 whereas the one whose predictions are 100% CORRECT has an AUC of 1.0. So we can
say that our Training model is 98.6202% CORRECT in predicting the Loan Eligibility.

4.2 Graphical Data Visualization and Representation

Figure 4.1: Gender VS Married

13
Figure 4.2: Gender VS Education

Figure 4.3: Married VS Education

14
Figure 4.4: Self-Employed VS Education

Figure 4.5: Married VS Dependents

Figure 4.6: Credit History VS Property

15
Figure 4.7: Married VS Credit History

Figure 4.8: Education VS Credit History

16
Figure 4.9: Credit History VS Property Area

Figure 4.10: Applicant Income Distribution

17
Figure 4.11: Probability Plot of Applicant Income based on other quantities

Figure 4.12: Applicant Income, Co-Applicant Income, and Loan Amount Distribution

18
Female

Male

Figure 4.13: Male VS Female Loan Applications based on Applicant Income

Figure 4.14: Distribution of Applicant Income and Co-Applicant Income

19
Figure 4.15: Distribution of Applicant Income and Loan Amount

Figure 4.16: Correlation Matrix

20
Figure 4.17: Number of applicants based on Education, Self-Employment, and Property Area

Figure 4.18: Number of applicants based on Gender, Marital Status, and Dependents Count

21
4.3 Major Observations
1. Applicants who are male and married tends to have more applicant income whereas applicant
who are female and married have least applicant income.

2. Applicants who are male and are graduated have more applicant income over the applicants
who have not graduated.

3. Again the applicants who are married and graduated have the more applicant income.

4. Applicants who are not self-employed have more applicant income than the applicants who
are self-employed.

5. Applicants who have more dependents have least applicant income whereas applicants which
have no dependents have maximum applicant income.

6. Applicants who have property in urban and have credit history have maximum applicant
income.

7. Applicants who are graduate and have credit history have more applicant income.

8. Loan Amount is linearly dependent on Applicant income.

9. From heat maps, applicant income and loan amount are highly positively correlated.

10. Male applicants are more than female applicants.

11. No of applicants who are married are more than no of applicants who are not married.

12. Applicants with no dependents are maximum.

13. Applicants with graduation are more than applicants with no graduation.

14. Property area is to be find more in semi urban areas and minimum in rural areas.

22
5
CONCLUSION AND FUTURE WORK
5.1 Conclusion
The project revealed that the algorithms such as xgboost, adaboost, forest, and decision tree
perform credibly well in term of their accuracy and other performance evaluation matrices.

Based on a comprehensive screening and validation process, loan organizations issue loans.
However, they cannot say for sure whether the applicant will be able to repay the loan without
having any trouble. They will be able to quickly, conveniently, and effectively select the
worthiest applicants thanks to the loan prediction system. It could offer the bank special
advantages.

We have examined how to create a Loan Approval Prediction System in this project. The
analytical steps in developing this system include data gathering, exploratory data analysis,
data preprocessing, model construction, and model testing. In this work, we did a detailed
analysis of the earlier research papers in this topic. Among the most popular algorithms are
Logistic Regression, Decision Trees, and Random Forest Technique.

We have suggested utilizing supervised learning approaches to determine whether a loan


candidate would pay back the loan or not. In this project, several algorithms were put into
practice to forecast consumer loans. Logistic Regression, Random Forest, KNN, SVM, and
decision Tree Classifier were used to achieve the best results. These five algorithms are
compared.

High accuracy is achieved using random forest. It can be confidently concluded from a correct
study of the part's advantages and limitations that the product might be a very effective
component. This application is working properly and satisfies any requirements set forth by
the bank.

23
5.2 References
1. How to predict Loan Eligibility using Machine Learning Models- Github- Mridul Bhandari.
2. Loan Approval Prediction based on Machine Learning Approach Kumar Arun, Garg Ishan,
Kaur Sanmeet, May-Jun. 2016. Loan Approval Prediction based on Machine Learning
Approach, IOSR Journal of Computer Engineering (IOSR-JCE)
3. Heterogeneous Ensemble for Default Prediction of Peer-to-Peer Lending in China, Key
Laboratory of Process Optimization and Intelligent Decision-Making Wei Li, Shuai Ding, Yi
Chen, and Shanlin Yang, Heterogeneous Ensemble for Default Prediction of Peer-to-Peer
Lending in China, Key Laboratory of Process Optimization and Intelligent Decision-Making,
Ministry of Education, Hefei University of Technology, Hefei 2009, China

5.3 User Manual to Run the Product


1) Run the app and you will be greeted with a window as the following:

Figure 5.1: GUI Main Menu

24
Our project can either predict loan eligibility of a single individual or multiple individuals at
once.

Figure 5.2: Multiple Entries Hint

Figure 5.3: Single Entries Hint

25
5.3.1 Guide for Predicting Eligibility of Multiple Entries:
1) Click on “Multiple Entries” Button and you will be greeted with the following menu:

Figure 5.4: Multiple Entries Menu

2) Click on “Browse File” Button and browse for the file in which the data are present. The
contents of the *.csv file should be in the following format, and this is the file that we will be
using to predict the eligibility:

Figure 5.5: CSV File and Format

26
3) After Selecting the file, you can see the full path of file along with the filename that you
have selected. Click on “Predict” button to predict loan eligibility of all the applicants.

Figure 5.6: File displayed after selection

3) Now we can see all predictions made by the program against the Loan ID(s) as output.

Figure 5.7: Multiple Results for Multiple Entries (1st Part)

27
Figure 5.8: Multiple Results for Multiple Entries (2nd Part)

4) Click on “Back” button to go back to Main Menu.

5.3.2 Guide for Predicting Eligibility of Single Entries:


1) Click on “Single Entries” Button and you will be greeted with the following menu:

Figure 5.9: Single Entries Menu

28
2) Fill in the following details as applicable and click on “Predict” Button.
 Gender
 Married
 Dependents
 Education
 Self Employed
 Applicant Income
 Co-applicant Income
 Loan Amount
 Loan Amount Term
 Credit Card History
 Property Area

Figure 5.10: Fill details from menu (1st Entry)

29
3) Now we can see the prediction made by the program against the provided input.

Figure 5.11: Single Result for Single Input (1st Entry)

4) Click on “Back” button to go back to Main Menu.


5) Click on “Single Entries” button again to input for another individual.

Figure 5.12: Fill details from menu (2nd Entry)

30
Figure 5.13: Single Result for Single Input (2nd Entry)

31

You might also like