Summary and Context
Summary and Context
IN BFS
Presented by:
Shakti Singh
Introduction to Exploratory Data Analysis
• “While fraud reduction is a common goal for banks and financial institutions, analytics
can be used to manage risk instead of simply detecting fraud” (Janaha, Data Analytics
in banking and Financial Services 2023).
• “Analytics can be used to identify and rate individual customers who are at risk of
fraud and then apply different levels of monitoring and verification to those accounts.
Analyzing the risk of the accounts allows banks and financial institutions to know
what to prioritize in their fraud detection efforts” (Janaha, Data Analytics in banking
and Financial Services 2023).
• “With the rise of computing power and new analytical techniques, banks can now
extract deeper and more valuable insights from their ever-growing mountains of data.”
Moreover, “The recent dramatic increases in computing power have allowed banks to
deploy advanced analytical techniques at an industrial scale” (Dash et al., Risk
analytics enters its prime 2017).
Introduction to Exploratory Data Analysis
• “Machine-learning techniques, such as deep learning, random forest, and XGBoost, are now
common at top risk-analytics departments” (Dash et al., Risk analytics enters its prime 2017).
• “The new tools radically improve banks’ decision models. And techniques such as natural-
language processing and geospatial analysis expand the database from which banks can
derive insights” (Dash et al., Risk analytics enters its prime 2017).
• “This means that risk teams can increasingly measure and mitigate risk more accurately and
faster” (Dash et al., Risk analytics enters its prime 2017).
• “Banks that are fully exploiting these shifts are experiencing a “golden age” of risk analytics,
capturing benefits in the accuracy and reach of their credit-risk models and in entirely new
business models” (Dash et al., Risk analytics enters its prime 2017).
Introduction to the Case Study
• “The loan providing companies find it hard to give loans to the people due to their insufficient or
non-existent credit history. Because of that, some consumers use it to their advantage by becoming a
defaulter” (upGrad, Credit EDA Assignment 2023).
• “When the company receives a loan application, the company has to decide for loan approval based
on the applicant’s profile” (upGrad, Credit EDA Assignment 2023).
• “This case study aims to identify patterns which indicate if a client has difficulty paying their
instalments which may be used for taking actions such as denying the loan, reducing the amount of
loan, lending (to risky applicants) at a higher interest rate, etc.” (upGrad, Credit EDA Assignment
2023).
• Problem Statement: “The company wants to understand the driving factors (or driver variables)
behind loan default, i.e. the variables which are strong indicators of default. The company can
utilise this knowledge for its portfolio and risk assessment” (upGrad, Credit EDA Assignment 2023).
Procedure followed from the Beginning
• All the warning and necessary libraries were imported such as import warnings, numpy,
pandas, matplotlib.pyplot and seaborn.
• Then the CSV file called ‘application_data.csv’ was imported along with CSV file
‘previous_application.csv’.
• ‘application_data.csv’ was converted to a pandas data frame and stored in a variable and same
was done for ‘previous_application.csv’.
• First analysis was done on current application or application data set and then later it was
done on the previous application dataset.
• File was read using the head( ) and tail( ) function
• After that the columns were checked, and info was printed.
• Then the data set was checked for null values.
Null Value Treatment
• Then the null values were converted to a percentage format.
• Columns that were having null values more than 35 percent were removed from the dataset.
• Then null values that were less than 13 to 20 percent that were missing in the dataset were
identified labeled as minor_missing_values.
• Then the unique values of the columns were identified using the nuinque ( ) function and was
determined that they were of categorical nature.
• Some unnecessary columns were dropped that were not needed for the analysis.
• Then data type correction was performed making columns that are of object type or
categorical or numerical type and were converted accordingly.
• Columns having negative values were converted to positive where it was required.
Data Binning
Inference: We can
clearly see that
there are many
outliers present, we
can use median or
mode to impute the
outliers because
mode or median
will give an
accurate
representation of
the whole dataset.
Box Plot for AMT_INCOME_TOTAL
Inference: We can
observe that there are
outliers present here
as well. We can use
median or mode to
impute the outliers
because mode or
median will give an
accurate
representation of the
whole dataset
Box Plot for AMT_CREDIT
It is an interesting observation
here one can see that clients
that have been paying
regularly they have moved
more frequently and clients
that have had problems with
payment have moved less
often.
Numerical Univariate Analysis
Box Plot comparing
AMT_ANNUITY and clients
from dataset of Target 0 and 1
An interesting observation,
Males in both the categories
have defaulted less than the
females. However, the
number of females that have
taken a loan is also more than
males.
Categorical Univariate Analysis
Count Plot comparing
EDUCATION_TYPE and clients
from dataset of Target 0 and 1
Inference:
People without
payment difficulties
take more credit for
the annuity
Bi-variate Analysis
Count Plot on Contract Type
and Credit Range
Inference:
Here we see that
people from both the
category take loans
of cash type more
than the revolving
loans but that can be
credited to most
people in Labor and
sales class.
Bi-variate Analysis
Count Plot on Gender and
Credit Range
Inference:
In both categories
females have taken
out more loans and
even the amount is
greater in both cases
in females.
Numerical and Categorical Bi-variate Analysis
Box Plot on Credit
Amount and Education
Type
Inference: In case of
clients with difficulty
paying the loan, we can
see that people with
higher education
struggled more in
repayment it can be
because of the situation
of job market and the
amount of loan or their
employment status. On
the other hand, clients
who don't have
payment difficulty is
also leading with higher
education.
Numerical and Categorical Bi-variate Analysis
Box Plot on Total
Income and Education
Type
Inference:
In both cases there are
numerous outliers, but
academic degree is the
least in both the cases
because of the client
base of the dataset
which is less.
Numerical and Categorical Bi-variate Analysis
Box Plot on Credit
Amount and Occupation
Type
Inference:
We can observe that the
amount of credit taken
is more in clients with
no difficulties and
defaulters tend to take
less amount of credit in
any occupation type.
Also, we can see that
Accountants and
Mangers tend to take
more loans and have
more difficulty paying
back as well.
Previous Application Data
• We have followed the same steps as the current application data for Data Cleaning.
• After Cleaning the data, we have merged both the current and previous application data to perform the final
analysis.
Final Dataset Analysis
Count Plot and Pie Plot showing Different status of Loan Offered
Final Dataset Analysis
Count Plot for Contract Type with four subcategory
Inference:
Inference:
Inference:
Majority of rejected
loans are from the
category 'repairs'.
Also, education has
equal number of
approves and
rejection
Paying other loans
and buying a new car
is having significant
higher rejection than
approvals.
Univariate Analysis on Final Dataset
Logarithmic Comparison of Contract Status
Inference: