0% found this document useful (0 votes)
249 views

Capstone Project NBFC Loan Foreclosure Prediction

This document provides an overview of a capstone project on predicting loan foreclosures at a non-banking financial company (NBFC). The project aims to analyze consumer and loan data to identify attributes that influence loan default and foreclosure. Accurately predicting foreclosures would allow the NBFC to take preventative actions to avoid foreclosures and retain customers. Foreclosures have widespread negative impacts for borrowers, lenders, insurers, and the overall economy. The document outlines alternative actions like reinstatement, forbearance, and repayment plans that lenders can offer to help borrowers avoid defaulting and undergoing foreclosure.

Uploaded by

Abhay Poddar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
249 views

Capstone Project NBFC Loan Foreclosure Prediction

This document provides an overview of a capstone project on predicting loan foreclosures at a non-banking financial company (NBFC). The project aims to analyze consumer and loan data to identify attributes that influence loan default and foreclosure. Accurately predicting foreclosures would allow the NBFC to take preventative actions to avoid foreclosures and retain customers. Foreclosures have widespread negative impacts for borrowers, lenders, insurers, and the overall economy. The document outlines alternative actions like reinstatement, forbearance, and repayment plans that lenders can offer to help borrowers avoid defaulting and undergoing foreclosure.

Uploaded by

Abhay Poddar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Capstone Project

NBFC Loan Foreclosure Prediction

Project Notes - I

Submitted By: Ms. Shilpa Naik


Project Mentor: Mr. Dheeraj Singh
Date: 8th Nov 2020
Table of Contents

Introduction ..............................................................................................................................................................3

Company Details ................................................................................................................................................................. 3


Problem Statement ............................................................................................................................................................. 3
Need of Study / Project ....................................................................................................................................................... 3
Understanding business/social opportunity ....................................................................................................................... 3
Impacts of Foreclosure.................................................................................................................................................... 3
Alternative to Foreclosure .............................................................................................................................................. 4
Is Foreclosure Prevention Effective? .................................................................................................................................. 5
Is Foreclosure Prevention Cost Effective? .......................................................................................................................... 5
Data Report ...............................................................................................................................................................6

Dataset Name: .................................................................................................................................................................... 6


Data Details: ........................................................................................................................................................................ 6
Insights/observations Based on Attribute details and descriptive statistics above: ........................................................ 10
Exploratory Data Analysis ........................................................................................................................................ 11

Univariate Analysis for Categorical Variables ................................................................................................................... 11


Univariate Analysis for Continuous Variables ................................................................................................................... 19
Bivariate Analysis .............................................................................................................................................................. 32
Removal of unwanted variables ....................................................................................................................................... 39
Missing Value Treatment .................................................................................................................................................. 39
Outlier Treatment ............................................................................................................................................................. 42
Variable Transform ........................................................................................................................................................... 43
Datatype change ........................................................................................................................................................... 45
Normalizing and Scaling ................................................................................................................................................ 45
ENCODING ..................................................................................................................................................................... 46
Variable Addition .............................................................................................................................................................. 47
Business Insights from EDA ...................................................................................................................................... 47

DATA Imbalance ............................................................................................................................................................ 47


Methods for dealing with class imbalance: .................................................................................................................. 48
Insights .......................................................................................................................................................................... 48
Introduction
Company Details
NBFC – A Non-Banking Financial Company (NBFC) is a company registered under the Companies
Act, 1956 engaged in the business of loans and advances etc.

Problem Statement
Predicting Loan Foreclosure

Banks or Non-Banking Financial companies (NBFC) main business is to lend money to interested
individuals / businesses called “Borrowers” against some assets called “Collateral” for the loan.

Borrowers then repay loan amount in installments with certain agreed interest on loan amount. NBFC
makes profit mainly through interest amount paid by borrowers. Sometimes borrowers may fail to repay
loan and interest to lending institute. This is called loan Default. In the event of loan default, lending
institute has right to recover the balance of loan from defaulted borrower by forcefully selling the asset
used as the collateral. This process is called Foreclosure.

The impacts of foreclosures are widespread and costly not only for borrowers, but for lenders, servicers,
insurers, cities etc.,

Lenders want to find a suitable solution to avoid foreclosures. Through the analysis, if we could predict
“FORECLOSURE” correctly in advance, it will help the NBFC to take required actions to avoid
foreclosure and retain customers.

Need of Study / Project


The major objective of this study is to understand how consumer attributes and loan attributes are
influencing the tendency of default / foreclosure of loan.

This study will help to predict loan defaulters based on available data related to loan and borrower.
Based on predictions, lender institute NBFC can take preventive measures to avoid foreclosure by
providing different repayment options, schemes and try to retain customers.

By avoiding foreclosures, NBFC can increase profits, save time on legal foreclosure battles, and make
business more sustainable.

Understanding business/social opportunity


Impacts of Foreclosure

The impacts of foreclosures are widespread and costly not only for borrowers, but for lenders,
servicers, insurers etc. Indirectly it impacts economy and public money also.

The United States subprime mortgage crisis was a nationwide financial crisis that occurred
between 2007 and 2010, and contributed to the U.S. financial crisis. It was triggered by a large
decline in home prices after the collapse of a housing bubble, leading to mortgage delinquencies,
foreclosures, and the devaluation of housing-related securities. Declines in residential investment
preceded the Great Recession and were followed by reductions in household spending and then
business investment.

Borrowers:
A damaged credit rating. Poor credit resulting from foreclosure often becomes a barrier to
obtaining new loan or expand businesses for borrowers.

Potentially higher interest rest if new loan is approved


Possible tax consequences. For tax purposes, foreclosure is treated like a sale; any principal
balance and accrued interest forgiven are treated as income for the borrower. The amount of gain
or loss is determined just as if the collateral had been sold for cash equal to the face amount of
the debt.

Loan servicers:
For loan servicers, the income stream from servicing fees stops when borrowers halt payments.

Insurers:
Many borrowers may have insurance for loan. Loan foreclosure by selling collateral affects
insurances also. The amount of loss equals the outstanding principal and all the expenses
incurred, less the proceeds from the sale of the collateral.

Alternative to Foreclosure

There are workout options available for lenders to help borrowers and avoid them being
defaulters and stop foreclosure.

Reinstatement:
Accepting the total amount of back interest and principal owed by a specific date. This option is
often combined with forbearance.

Forbearance:
Reducing or suspending payments for a short period, after which another option is agreed upon
to bring the loan current. A forbearance option is often combined with a reinstatement, when it is
known that the borrower will have enough money to bring the account current at a specific time
in the future. The money might come from a bonus, investment, insurance settlement, or a tax
refund.

Repayment Plan:
With a repayment plan, the lender agrees to add, for example, half the amount of the first missed
payment onto each of the next subsequent two payments. These plans provide some relief for
borrowers with short-term financial problems.

Loan Modifications:
If the borrower can make the payments on the loan, but does not have enough money to bring the
account current or cannot afford the total amount of the current payment, a change to one or
more of the original loan terms may make the payments more affordable.

• Adding the missed payments to the outstanding loan balance


• Changing the interest rate, including making an adjustable rate into a fixed rate
• Extending the repayment term
Short Refinance:
Forgive some of the debt and refinance the rest into a new loan, usually resulting in lower
financial loss to lender than foreclosing.

Claim Advance:
If the mortgage is insured, the borrower may qualify for an interest-free loan from the insurer to
bring the account current. Full repayment of this loan may be delayed for several years.

Is Foreclosure Prevention Effective?


National and International studies show that repayment plans, loan modifications, Short
refinancing helps borrower to catch up on loan repayment and avoid foreclosure.

It is beneficial for both lenders and borrowers and helps to keep economy stable.

Is Foreclosure Prevention Cost Effective?


National and International studies like Mortgage Foreclosure Prevention Program (MFPP) show
that foreclosure prevention cost is very less compared to actual foreclosure cost and losses
occurring during foreclosure.

It also prevents damage to credit rating, reputation, and social status.


Data Report
Dataset Name:

NBFC Loan Transaction Data.xlsx

Data Details:
NBFC has provided dataset consisting of aggregated loan transactions data for the customers.

Dataset has 20012 aggregated transactions over 53 different attributes.

Loans approved by NBFC has authorization date period varying between Aug 2010 to Dec 2018.

Target attribute is “FORECLOSURE” – which is binary in nature and indicates if loan is foreclosued or
not.

Attribute details table:

Null
Values Business
Column Name Description Data Type Present Importance
Agreement ID of the loan account (a No - Just an
AGREEMENTID customer can have multiple loans) int64 Identifier
Yes (Can be used
AUTHORIZATIONDAT for time period
E Authorization date of the loan datetime64 calculation)
BALANCE_EXCESS Balance of excess amount float64 Yes
BALANCE_TENURE Remaining tenure int64 Yes
CITY City of origination object Yes
COMPLETED_TENURE Completed tenure int64 Yes
Current rate of interest on the loan.
CURRENT_INTEREST_ Renamed field (Old Name:
RATE CURRENT_ROI ) float64 Yes
CURRENT_INTEREST_ Maximum value of the CURRENT ROI
RATE_MAX across transactions float64 Yes
CURRENT_INTEREST_ Minimum value of the CURRENT ROI
RATE_MIN across transactions float64 Yes
CURRENT_INTEREST_ Number of times the CURRENT ROI has
RATE_CHANGES changed int64 Yes
CURRENT_TENOR Current tenor of the loan int64 Yes
Unique Customer ID given to each No - Just an
CUSTOMERID customer float64 Yes Identifier
DIFF_AUTH_INT_DAT Difference between authorization and
E interest start date int64 Yes
DIFF_CURRENT_INTE
REST_RATE_MAX_MI Difference between the maximum and
N minimum interest rate per agreement float64 Yes
DIFF_EMI_AMOUNT_ Difference between maximum and
MAX_MIN minimum EMI AMOUNT float64 Yes Yes
DIFF_ORIGINAL_CUR
RENT_INTEREST_RAT Difference in original ROI and current ROI
E (ORIGNAL_ROI - CURRENT_ROI) float64 Yes
Difference in original and current tenor
DIFF_ORIGINAL_CUR (ORIGNAL_TENOR -
RENT_TENOR CURRENT_TENOR) int64 Yes
DPD Days past due int64 Yes
DUEDAY Next due date of the loan int64 Yes
EMI_AMOUNT Mode of the receipt amount float64 Yes
EMI_DUEAMT EMI due amount float64 Yes
EMI_OS_AMOUNT EMI outstanding amount float64 Yes
EMI_RECEIVED_AMT EMI received amount float64 Yes
EXCESS_ADJUSTED_A
MT Excess adjusted amount float64 Yes
EXCESS_AVAILABLE Excess received float64 Yes
Fixed obligation to income ratio (Value
FOIR should range from 0-1 – Derived variable) float64 Yes
Yes (Can be used
INTEREST_START_DA for time period
TE Interest start date on the loan datetime64 calculation)
LAST_RECEIPT_AMOU
NT Last receipt amount float64 Yes Yes
Yes (Can be used
for time period
LAST_RECEIPT_DATE Last receipt date datetime64 Yes calculation)
Month of last receipt date. In case account
LATEST_TRANSACTIO is Foreclosed, it will be month of
N_MONTH Foreclosure float64 Yes Yes
LOAN_AMT Loan amount which was sanctioned float64 Yes
MAX_EMI_AMOUNT Maximum receipt amount float64 Yes Yes
MIN_EMI_AMOUNT Minimum receipt amount float64 Yes Yes
MONTHOPENING Month of opening float64 Yes
NET_DISBURSED_AM
T Amount that was disbursed float64 Yes
Net Loan to Value ratio (Value ranges from
NET_LTV 0-100 (in %) – Derived variable) float64 Yes
Net receivable (EMI_DUEAMT -
EMI_RECEIVED_AMT =
EMI_OS_AMOUNT) +
(EXCESS_AVAILABLE -
EXCESS_ADJUSTED_AMT =
BALANCE_EXCESS) =
NET_RECEIVABLE NET_RECEIVABLE) float64 Yes
Number of different values in the receipts
NUM_EMI_CHANGES amount int64 Yes
NUM_LOW_FREQ_TR Number of transactions done in less than 28
ANSACTIONS days int64 Yes

Original rate of interest on the loan (when


ORIGNAL_INTEREST_ the loan was sanctioned). Renamed field
RATE (Old Name: ORIGNAL_ROI) float64 Yes
Original tenor of the loan (when the loan
ORIGNAL_TENOR was sanctioned) int64 Yes
OUTSTANDING_PRINC
IPAL Outstanding principal float64 Yes
PAID_INTEREST Paid interest float64 Yes
PAID_PRINCIPAL Paid principal float64 Yes
PRE_EMI_DUEAMT Pre EMI due amount for the loan float64 Yes
PRE_EMI_OS_AMOUN
T Pre EMI-Outstanding amount float64 Yes
PRE_EMI_RECEIVED_
AMT Pre EMI that was received float64 Yes
PRODUCT Loan product object Yes
SCHEMEID Scheme ID under which loan was given float64 Yes Yes
NPA_IN_LAST_MONT
H Whether NPA in last month object Yes Yes
NPA_IN_CURRENT_M
ONTH Whether NPA in current month object Yes Yes
MOB Internal code int64 Yes
FORECLOSURE Labelled Field int64 Yes

Descriptive statistics:

Column Names mean std min 25% 50% 75% max


BALANCE_EXCESS
78995.98 1348636.3 0 0 0 57.422352 75555999.5
BALANCE_TENURE
172.82461 64.004484 0 136 174 216 674
COMPLETED_TENURE
17.269089 16.486279 0 6 12 25 98
CURRENT_INTEREST_RATE
14.781931 2.4858582 9.9010174 12.797658 14.545631 16.231176 25.0958952
CURRENT_INTEREST_RATE_
MAX 14.900248 2.480029 10.425409 13.109796 14.670486 16.543314 37.45656
CURRENT_INTEREST_RATE_
MIN 14.301873 2.6770138 -5.056636 12.423092 13.734072 16.168748 24.034626
CURRENT_INTEREST_RATE_
CHANGES
0.7580951 1.1343233 0 0 0 2 9
CURRENT_TENOR
190.09369 58.559953 6 166 180 228 713
DIFF_AUTH_INT_DATE
0.0062962 0.5696331 -17 0 0 0 70
DIFF_CURRENT_INTEREST_
RATE_MAX_MIN
0.5983747 0.9669352 0 0 0 1.1861244 24.346764
DIFF_EMI_AMOUNT_MAX_
MIN 115209.42 967082.44 0 10207 19885 42466.485 84968249.9
DIFF_ORIGINAL_CURRENT_I
NTEREST_RATE
-0.380504 0.8811203 -7.179174 -1.186124 0 0 10.3244569
DIFF_ORIGINAL_CURRENT_
TENOR
-6.796372 33.525757 -461 -14 0 0 234
DPD 7.5740556 66.098901 0 0 0 0 2054
DUEDAY 5.776634 2.7190093 1 5 5 5 15
EMI_AMOUNT 43609.495 113131.82 0 10685 18937.5 36424 4879479
EMI_DUEAMT 1991553.2 6838394.2 0 204021.62 545065.11 1481417.2 354610410
EMI_OS_AMOUNT
33297.348 656131.13 0 0 0 0 58995308.8
EMI_RECEIVED_AMT
1958255.8 6762984.2 0 202093.55 537657.63 1456413.6 354610410
EXCESS_ADJUSTED_AMT
359900.21 3923345.6 0 0 0 260.60914 284164207
EXCESS_AVAILABLE
438896.19 4169759.4 0 0 260.60914 3105.0088 284164207
FOIR 27.960034 3871.0648 -170.33 0.41 0.52 0.68 547616
LAST_RECEIPT_AMOUNT
80674.458 808402.7 1 11061 19642 38219 84968811.9
LATEST_TRANSACTION_MO
NTH 10.692231 2.8214091 1 12 12 12 12
LOAN_AMT 5897355.3 12985661 37532.395 1558947.4 2684572.1 5233435.7 424566452
MAX_EMI_AMOUNT
122254.44 970451.59 13.34 13318 23600 49360.5 84968811.9
MIN_EMI_AMOUNT
7045.0255 43425.488 0.01 118 133.18 3334 3156965
MONTHOPENING 5447511.2 11838513 0 1483751.7 2503693.7 4791777.8 381836715
NET_DISBURSED_AMT
5847665.5 12911932 37532.395 1544082.7 2640779.3 5186724.8 424566452
NET_LTV 51.18924 21.10683 0.38 35.16 53.3 66.77 100
NET_RECEIVABLE -
-45439.15 1348502.3 75345538 -17.66842 0 0 38643502.1
NUM_EMI_CHANGES
2.9498301 2.6355002 -1 2 2 4 33
NUM_LOW_FREQ_TRANSA
CTIONS
2.7691385 2.571271 0 1 2 3 30
ORIGNAL_INTEREST_RATE
14.401427 2.6032649 9.651307 12.48552 13.734072 16.168748 27.780282
ORIGNAL_TENOR 183.29732 44.600262 14 180 180 228 300
OUTSTANDING_PRINCIPAL
5212982.4 11521353 -0.750648 1428919.5 2394655.4 4551203.7 381836715
PAID_INTEREST 989054.69 3026052.5 0 125331.93 309724.83 795467.96 123036221
PAID_PRINCIPAL 866763.73 34697581 0 23418.338 78786.502 291780.97 4885216533
PRE_EMI_DUEAMT
57804.47 377664.74 0 4768.2638 10696.017 31878.792 31775396.1
PRE_EMI_OS_AMOUNT
259.47789 10967.445 0 0 0 0 1074263.99
PRE_EMI_RECEIVED_AMT
57544.992 376971.85 0 4755.0125 10679.453 31805.357 31775396.1
SCHEMEID 10901216 88.905192 10901100 10901112 10901264 10901291 10901455
MOB 18.813612 16.541875 0 7 13 26 98
FORECLOSURE 0.0896962 0.2857531 0 0 0 0 1
Insights/observations Based on Attribute details and descriptive statistics
above:
• Attribute names are meaningful and well defined, does not need renaming
• Few attributes like MAX_EMI_AMOUNT, MIN_EMI_AMOUNT, NPA_IN_LAST_MONTH
etc. has NULL values and need to be addressed.
• Attribute like SCHEMEID,BALANCE_TENURE,COMPLETED_TENURE etc. has numeric
value but it is categorical in nature and need to be converted.
• Attributes like CUSTOMERID, AGREEMENTID are just a unique identifiers for customers and
does not play any important role from business perspective.
• Attribute FOIR will need transformation – as expected value is between 0 and 1 and dataset has
completely irrelevant values.
• EMI_AMOUNT is 0 for few transactions and it is incorrect. Values need to be imputed.
• statistics shows that attribute values are not at same scale and will need scaling of dataset in
future analysis.
• Number of emi changes are negative and its incorrect. Either it can be 0 or greater than 0 as it
shows number of times emi has been changed. If not changed it has to be 0 but not negative
number
Exploratory Data Analysis
Univariate Analysis for Categorical Variables
• Target Variable / Dependent Variable

FORECLOSURE is target variable – it is binary in nature and indicates loan is foreclosed or not.
Although variable datatype is integer, variable is categorical in nature with possible values of 0 or 1.

0 – Indicating loan not foreclosed


1 – Indicating loan is foreclosed

Record distribution for variable FORECLOSURE

% OF TOTAL
FORECLOSURE RECORD COUNT RECORDS
0 18217 91.03038
1 1795 8.96962

• Independent Variables

CITY – variable has 272 distinct city names.

Top 5 cities which has highest number of customers opting for loan are

CITY Name Customers count


MUMBAI 2028
HYDERABAD 1567
AHMEDABAD 1396
SURAT 1391
PUNE 1202

We could see that few city names are incorrectly spelled – resulting in two different cities and need
transformation to have one correct city name. Record as follow

Given City Name Correct Name


BHIWADI
BHIWANDI
BHIWANDI
CHAMARAJANAGAR
CHAMRAJNAGAR
CHAMRAJNAGAR
CHENGALPATTU CHENGALPATTU
CHENGALPET
DHARAMPUR
DHARAMPUR
DHARMAPURI
KANCHEEPURAM
KANCHIPURAM
KANCHIPURAM
KRISHNA
KRISHNAGIRI
KRISHNAGIRI
PONDICHERRY
PONDICHERRY
PONDICHERRY 1
PUDUKKOTTAI
PUDUKOTTAI
PUDUKOTTAI
VIJAYAWADA
VIJAYWADA
VIJAYWADA
VILLUPPURAM
VILLUPURAM
VILLUPURAM

Unique city names count after correction - 263

PRODUCT – variable has 4 different values. Indicating loan product/type.

Product Name/Type Customer Count


STHL 7268

LAP 6226

HL 3482

STLAP 3036

Most loan foreclosures are under STHL and LAP has higher number of customers compared to HL and
STLAP .

SCHEMEID – Although variable contains numeric data, it is categorical in nature. It is unique


identifier for scheme under which loan is sanctioned.

Top 5 Scheme IDs with highest number of customers,

SCHEME ID Count
10901104 2359
10901106 1463
10901295 1090
10901112 1019
10901287 1018
We do not have more details about schemes provided in dataset so we will ignore scheme details. Also,
this variable has null values and needs to be addressed.

NPA_IN_LAST_MONTH & NPA_IN_CURRENT_MONTH – NPA stands for non-performing


asset. A non-performing loan is a loan that is in default or close to being in default. Many loans become
non-performing after being in default for 90 days, but this can depend on the contract terms.

NPA_IN_LAST_MONTH Count
0 102
Yes 15
#N/ 2

NPA_IN_CURRENT_MONTH Count
0 103
Yes 16

Both these variables should be binary in nature. Indicating either loan is non-performing asset or not.
These variables need transformation and need to fill in null values

0 / No – Indicating loan is not NPA and which is good sign.


1 / Yes – Indicating loan is a NPA and it is not considered as good sign, most probably loan will be
foreclosed in this case.

CUSTOMERID & AGREEMENTID – Customer ID is a unique identifier for customer and


Agreement ID is a unique identifier for loan agreement. Customer can have multiple loans but analyzing
dataset shows there are no customers having multiple loans.

CURRENT_INTEREST_RATE_CHANGES - Although variable contains numeric data, it is


categorical in nature. It shows number of times interest rate has been changed in past. We will convert it
to categorical variable for future analysis.

No. of Time
Interest rate
has changed Count
0 12496
2 4404
1 1934
3 599
4 342
5 175
6 48
7 12
9 1
8 1

We can see that for ~ 60% customers interest rate has not been changed indicating 0. For ~25%
customers interest rate has been changed 1or 2 times. Interest rate change more than 2 times is less
frequent.
DIFF_AUTH_INT_DATE - Although variable contains numeric data, it is categorical in nature. It
shows difference of number of days between Authorization date and Initiation date. We will convert it to
categorical variable for future analysis.

Diff of days between


Authorization and
Initiation date Count
0 19926
-1 49
5 8
2 4
3 4
-4 4
-2 3
1 2
7 2
6 1
More than 95% of customers have no difference or 0 difference 4 1
between authorization date and initiation date which is expected -17 1
one. Difference of more than or less than 0 is exceedingly rare 70 1
which might care under different legal loan approval circumstances. 15 1
11 1
12 1
-3 1
14 1
9 1

DPD - Although variable contains numeric data, it is categorical in nature. DPD stands for days past due
date. It shows number of days past due date. We will convert it to categorical variable for future
analysis.

Higher the DPD, more chances of loan getting foreclosed. We will plot records only where DPD is
greater than 0. Most of the customers fall between 0-200 days. > 90 days is mostly difficult to get back
on track.

DPD Count
0 18770
26 407
56 161
87 108
25 91
DUEDAY - Although variable contains numeric data, it is categorical in nature. DUEDAY shows day
of the month when emi is due for loan. It can be anywhere between 1 – 31 but in this dataset, we can see
that NBFC has fixed payment day to 1,5 or 15th of the month. We will convert it to categorical variable
for future analysis.

DUEDAY Count

5 18343

15 1587

1 82

~90% customers have due day=5

NUM_LOW_FREQ_TRANSACTIONS - Although variable contains numeric data, it is categorical in


nature. It shows number of transactions done in less than 28 days. We will convert it to categorical
variable for future analysis.

Most of the customers have 1-5 transactions in less than 28 days.

LATEST_TRANSACTION_MONTH - Although variable contains numeric data, it is categorical in


nature. It shows Month of last receipt date. In case account is Foreclosed, it will be month of
Foreclosure. We will convert it to categorical variable for future analysis.

Month Count
12 15203
11 658
8 556
9 460
10 421
7 413
6 392
3 384
4 384
1 378
2 350
5 338

Since it shows either month of last receipt date or month of foreclosure date, we do not get many
insights just looking at this variable alone. Spike in dec can just mean that data is collected in Jan and
last
Receipt date for most of the transaction falls in Dec.
We will need to bivariate analysis for this variable along with foreclosure to find pattern.

NUM_EMI_CHANGES - Although variable contains numeric data, it is categorical in nature. It shows


number of times EMI has been changed or number of different values in receipt amount for particular
loan agreement. We will convert it to categorical variable for future analysis.
Data is left skewed and most of the customers has emi changes 1
to 4 times.

Here -1 seems unacceptable value. As either we will have no


receipt amount => 0 , or multiple receipt amounts => >= 1 but it
cannot be negative. This column needs imputation to correct the
value.

ORIGNAL_TENOR - Although variable contains numeric data, it is categorical in nature. It shows


original tenor of loan in number of months. We will convert it to categorical variable for future analysis.

Most of the customers have original tenor between 170 to 250 months which is like ~ 15-20 years.

BALANCE_TENURE - Although variable contains numeric data, it is categorical in nature. It shows


balance tenure of loan repayment in number of months. We will convert it to categorical variable for
future analysis.
Most of the customers have balance tenor between 150 to 250 months which is like ~ 15-20 years.

COMPLETED_TENURE - Although variable contains numeric data, it is categorical in nature. It


shows completed tenure of loan repayment in number of months. We will convert it to categorical
variable for future analysis.

~ 60% customers have completed tenure less than 20 months showing that dataset provided has mostly
new customers.

CURRENT_TENOR - Although variable contains numeric data, it is categorical in nature. It shows


current tenure of loan repayment in number of months. We will convert it to categorical variable for
future analysis.

~ 80% customers have current tenure between 100-300 months which is different than original tenor.

DIFF_ORIGINAL_CURRENT_TENOR - Although variable contains numeric data, it is categorical


in nature. It shows different between original and current tenure. We will convert it to categorical
variable for future analysis.
Negative value indicates tenure has been increased – may be to help customer to pay rather than to
foreclose it. Positive values indicate tenure has been reduced. Depending on repay capacity tenure can
be reduced.

MOB - Although variable contains numeric data, it is categorical in nature. It is internal id. We will
convert it to categorical variable for future analysis.
Univariate Analysis for Continuous Variables
BALANCE_EXCESS – It shows excess balance amount.

It is clearly visible that for most of the transactions amount is 0. This column has outliers.

CURRENT_INTEREST_RATE – It shows current interest rate.

---- Blue dotted line – Median, Red Solid Line – Mean

No clear liner increasing or decreasing pattern. Current interest rate ranges between 10-25% with most of the
data between 13-16% and very few outliers.

CURRENT_INTEREST_RATE_MAX – It shows current maximum interest rate.


---- Blue dotted line – Median, Red Solid Line – Mean

No clear liner increasing or decreasing pattern. Current interest rate maximum ranges between 10-40% with
most of the data between 13-16% and few outliers.

CURRENT_INTEREST_RATE_MIN – It shows current minimum interest rate.

---- Blue dotted line – Median, Red Solid Line – Mean

No clear liner increasing or decreasing pattern. Current interest rate minimum ranges between (-5) to 25% with
most of the data between 13-16% and few outliers.

Negative interest rates effectively mean that a bank pays a borrower to take money off their hands, so they pay
back less than they have been loaned. This scenario very rarely occurs. In case of loan foreclosure , institute
may restructure loan and interest rate and may help borrower to pay back easily and avoid foreclosure.

Under its negative interest rate borrowers will make a monthly repayment as usual – but the amount still
outstanding will be reduced each month by more than the borrower has paid.

DIFF_CURRENT_INTEREST_RATE_MAX_MIN– It shows difference between current maximum and


minimum interest rate.

Most of the customers go for floating interest rate rather than fixed one. In case of fixed interest rate, maximum
and minimum will be same, and difference will be 0. In case of floating interest rate, lender offers range of
interest rate which will be generally not very wide and depending on economical situations interest rate may
increase or decrease within a range. Very wide range is riskier for lender and borrower both.
Mostly customers have either fixed rate indicating difference between min and max = 0 or difference of 0-3.
Large difference is rare. Dataset has many outliers.

ORIGINAL_INTEREST_RATE– It shows original interest rate.

Original interest rate ranges between 8-28% with most of customers having interest rate between 12.5 -16 %
and very few outliers.

DIFF_ORIGINAL_CURRENT_INTEREST_RATE– It shows difference between original and current


interest rate.

It ranges between -7.5 to +10. Dataset has many outliers. Negative value indicates current interest rate is higher
than Original and positive indicates current is lower than original. Lower or no difference makes it more stable.

EMI_AMOUNT – Indicates EMI Amount.


It is visible that all few records are have very high values indicating may be higher loan amount or less tenure.
Data has outliers. Let us plot only smaller values like emi_amount less than 100k.

emi amount = 0 seems incorrect in case of loan is not paid yet. We need to impute this.

EMI_DUEAMT – Indicates EMI due Amount.

This variable also exhibits same pattern as EMI_AMOUNT. Many outliers. Let us plot only smaller values like
emi due amount less than 100k.
EMI_OS_AMOUNT – Indicates outstanding emi amount.

>95 % records have Outstanding emi amount equal to 0 which is great as it indicates borrowers are paying on
time and no outstanding. Few outliers with very high outstanding amount which may indicate foreclosure. Will
check it in bivariate analysis.

EMI_RECEIVED_AMT – Indicates received emi amount

This variable also exhibits same pattern as EMI_DUEAMT. Many outliers. Let us plot only smaller values like
emi received amount less than 100k. EMI_DUEAMT & EMI_RECEIVED_AMT graphs are closely matching
as expected.
MAX_EMI_AMOUNT – Maximum emi amount.

Few max emi amounts are very large and dataset has outliers. Lets plot values having max emi_amount less
than 100k.

Maximum emi amount = 0 seems incorrect in case of loan is not paid yet. We need to impute this.

MIN_EMI_AMOUNT – Maximum emi amount.

Few min emi amounts are very large and dataset has outliers. Lets plot values having min emi_amount less than
100k
minimum emi amount = 0 seems incorrect in case of loan is not paid yet. We need to impute this.

PRE_EMI_DUEAMT– Pre EMI due amount

This indicates pre emi due amount and has outliers. Let us plot values having pre emi dueamt less than 100k.

Most customers have pre emi due amount less than 20K
PRE_EMI_OS_AMOUNT– Pre EMI outstanding amount

Most of the values are 0 which is good sign. No outstanding. Few very large values exist in dataset.

PRE_EMI_RECEIVED_AMT– Pre EMI received amount

This indicates pre emi received amount and has outliers. Let us plot values having pre emi dueamt less than
100k. This is closely matching with PRE_EMI_DUEAMT which is expected.
EXCESS_ADJUSTED_AMT– Excess adjusted amount

Most of the excess adjusted amount is 0 and few large values.

EXCESS_AVAILABLE – Excess available

Most of the excess available amount is 0 and few large values.

FOIR – Fixed obligation to income ratio (Value should range from 0-1 – Derived variable)

FOIR values should be between 0 and 1 but above graph shows that dataset has few incorrect values which
needs to be imputed with correct value, we will do it in later part of the report.
LAST_RECEIPT_AMOUNT– Last received amount

This variable has few large values. Let us plot values which are less than 100k

LOAN_AMT– Loan amount

This variable has few large values. Let us plot values which are less than 5Million. Mostly loan amount is
between 1-3 Million.
MONTHOPENING – Data description says month of opening which should be between 1-12 indicating Jan-
Dec but data set has amount values. Assuming this is opening balance at month.

This variable also has lot of outliers. Let us plot data where opening balance < 5 million.

This is somewhat similar to Loan-amount. During by variate analysis if we find these are correlated we can
eliminate one of them.

NET_DISBURSED_AMT – Net disbursed amount

This variable also has outliers. Let us plot values less than 5 million. In general this should closely match with
loan_amount
NET_LTV– Net Loan to Value ratio (Value ranges from 0-100 (in %) – Derived variable)

This variable shows fair distribution and does not have any outliers. Most of records have LTV ratio between
35-65 %.

The loan-to-value (LTV) ratio is an assessment of lending risk that financial institutions and other lenders
examine before approving a mortgage. Typically, loan assessments with high LTV ratios are considered higher
risk loans. Therefore, if the loan is approved, the loan has a higher interest rate.

NET_RECEIVABLE - Net receivable amount

Net receivables refer to the net amount of money remaining after deducting the provision for bad debt. It is
primarily used in businesses that sell on credit.

Net Receivables = (Total Amount Borrowed By Customers) - (Amount Borrowed By Customers That will
Never be Repaid)

Higher the Net Receivables , company needs to collect more from the customer

OUTSTANDING_PRINCIPAL – Outstanding principal amount


Outstanding principal amount = loan amount – paid principal

Here we see few very large values. Let us plot values less than 5 million.

Since most of this loan transactions belongs to new customer , nothing is paid yet and outstanding principal is
closer to loan amount. Outstanding principal gets reduced towards the end of the tenure in ideal scenario.

PAID_INTEREST– Paid interest till now

Here we see few very large values. Let us plot values less than 2 million.

PAID_PRINCIPAL – Paid principal till now


Here we see few very large values. Let us plot values less than 100k. Most of the transactions are new one so
paid principal will be very less. Dataset has lot of 0 values might need imputation based on calculation using
loan amount and outstanding principal.

Bivariate Analysis
We see on data set most of the attributes are highly correlated to each other based on their function or definition
and usage. Few variables belong to one particular parent attribute criteria like loan amt , emi ,tenure , interest
rate etc. All this parent categories are also dependent on each other or have derived values.

Let’s see on broader scale,

LOAN / TOTAL Date/Time


Amount Related Tenure Related Interest Rate Related Related EMI Related
BALANCE_EXCESS BALANCE_TENURE CURRENT_INTEREST_RATE DIFF_AUTH_INT_DATE EMI_AMOUNT
CURRENT_INTEREST_RATE_MA
EXCESS_ADJUSTED_AMT COMPLETED_TENURE X DPD EMI_DUEAMT
CURRENT_INTEREST_RATE_MI
EXCESS_AVAILABLE CURRENT_TENOR N DUEDAY EMI_OS_AMOUNT
DIFF_ORIGINAL_CURRE CURRENT_INTEREST_RATE_CH LATEST_TRANSACTION
LOAN_AMT NT_TENOR ANGES _MONTH EMI_RECEIVED_AMT
DIFF_CURRENT_INTEREST_RAT DIFF_EMI_AMOUNT_M
NET_DISBURSED_AMT ORIGNAL_TENOR E_MAX_MIN AX_MIN
DIFF_ORIGINAL_CURRENT_INT
NET_LTV EREST_RATE MAX_EMI_AMOUNT
NET_RECEIVABLE ORIGNAL_INTEREST_RATE MIN_EMI_AMOUNT
OUTSTANDING_PRINCIPAL NUM_EMI_CHANGES
NUM_LOW_FREQ_TRAN
PAID_INTEREST SACTIONS
PAID_PRINCIPAL PRE_EMI_DUEAMT
FOIR PRE_EMI_OS_AMOUNT
PRE_EMI_RECEIVED_AM
LAST_RECEIPT_AMOUNT T
MONTHOPENING

We will do bi-variate analysis for few combinations which are most suitable and can provide business insights.
CITY count w.r.t FORECLOSURE

Top 5 cities with highest foreclosure numbers,

% OF TOTAL
FORECLOSURE CITY Count FORECLOSURE=1
1 MUMBAI 353 19.66
HYDERABAD 165 9.19
PUNE 151 8.41
CHENNAI 109 6.07
AHMEDABAD 90 5.01

It is easily visible that big metro cities have highest number of customers and total sum amounts.
Number of foreclosed loans are also higher in metro cities compared to other cities provide in dataset.

CITY w.r.t LOAN_AMOUNT

Top 5 cities with highest sum of loan amount

CITY Name SUM(LOAN_AMT)


MUMBAI 26,453,929,441.7286
PUNE 10,936,817,628.4622
DELHI 10,285,276,185.3309
BANGALORE 8,262,236,184.9314
CHENNAI 7,497,097,157.3392

Metro cities have higher loan amounts as customers also high in number

PRODUCT w.r.t FORECLOSURE

PRODUCT % OF TOTAL
FORECLOSURE NAME Count FORECLOSURE=1
1 HL 990 55.153
STHL 803 44.735
LAP 2 0.111

Most loan foreclosures are under HL / STHL Product with HL product being highest.
SCHEMEID w.r.t FORECLOSURE

Top 5 schemeids with highest foreclosure numbers,

FORECLOSURE SCHEMEID Count % OF TOTAL FORECLOSURE=1


1 10901104 775 43.17548747
10901112 462 25.73816156
10901291 56 3.119777159
10901142 52 2.896935933
10901251 50 2.78551532

Although SCHEMEID 10901291, 10901142 & 10901251 are not top selling but have more loan
foreclosures compared to other SCHEMEIDs.

LOAN_AMT vs EMI_AMOUNT w.r.t FORECLOSURE

Scatterplot clearly shows that higher the loan amount higher the emi which is expected. Foreclosed loans
pattern is not clearly visible but we can see that higher loan amount values are not foreclosed.

Since EMI_DUEAMT , EMI_RECEIVED_AMT are closely related to EMI_AMOUNT they also exhibit
similar pattern.
LOAN_AMT vs CURRENT_INTEREST_RATE w.r.t FORECLOSURE

There is no clear visible pattern between Loan amount and Interest rate. Irrespective of loan amt, interest can
vary largely. Same is applicable for CURRENT_INTEREST_RATE_MAX,
CURRENT_INTEREST_RATE_MIN, CURRENT_INTEREST_RATE_CHANGES as they are closely related

LOAN_AMT vs CURRENT_TENOR w.r.t FORECLOSURE

Let’s restrict high value loan_amounts. > 10M


We can see that lower the current tenor or tenure between 250-300 months more foreclosures.

LOAN_AMT vs ORIGNAL_TENOR w.r.t FORECLOSURE

Pattern is similar to current_tenor. Similar pattern will be visible for balance tenor. But no clear pattern between
loan_amt and tenure.

LOAN_AMT vs NET_LTV w.r.t FORECLOSURE


Loan amount does not have clear relation with NET_LTV. NET_LTV ranges irrespective of loan amount but
higher the loan amount and higher the LTV more foreclosures.

Let us look at pair plot , since we have many attributes, we will do pair plot for selected attributes

Pairplot shows us that non of the attributes clearly shows pattern or help in identifying class.

Very few attributes show linear relation like LOAN_AMT and EMI_AMOUNT. Most of the attributes data is
left skewed.
Let us look at Heatmap,

As expected, CURRENT_INTEREST_RATE,CURRENT_INTEREST_RATE_MAX,
CURRENT_INTEREST_RATE_MIN,ORIGINAL_INTEREST_RATE is highly correlated to each other.

Same way EMI_AMOUNT , EMI_DUEAMT , EMI_RECEIVED_AMT are highly correlated.

LOAN_AMT , NET_DISBURSED_AMT, MONTHOFOPENING are highly correlated which is expected.

PRE_EMI_DUEAMT , PRE_EMI_RECEIVED_AMT are highly correlated as expected.

As we saw in scatterplot also LOAN_AMT is high correlated with EMI_AMOUNT.

OUTSTANDING_PRINCIPAL and EMI Related columns have considerable correlation.

NET_LTV shows inverse co-relation with Interest rate related attributes.

EMI_AMOUNT has correlation with principal and interest paid and interest rate also which is technically
correct.

Highly correlated attributes are not useful in model building and can impact model performance. We will need
to get rid of correlated attributes before model building. Using methods like VIF to get factor value, check the p
value and importance and remove the variable.

Other correlations have significantly low intensity, and which is good for model building.
Removal of unwanted variables
There are few variables in dataset which has no value in terms of business, this kind of variables are not useful
in model building and need to be dropped. This are mostly identifiers or sometimes some duplicate columns.

NBFC dataset contains few unwanted variables which can be safely dropped before model building.

List of unwanted variables:

AGREEMENTID – This is unique identifier for loan contract

CUSTOMERID – This is unique identifier for customer

AUTHORIZATIONDATE – This is loan authorization date , sometimes we can use date values to derive
columns by comparing date with some fixed point in past or future and calculate difference to find recency of
given data and use it for analysis but in this given dataset and business scenario this doesn’t seem to be useful.

INTEREST_START_DATE – This is interest start date and for given business problem this is not useful.

LAST_RECEIPT_DATE – This is last payment receipt date and for given business problem this is not useful.

Missing Value Treatment


During descriptive analysis and univariate analysis, we have found that many columns have null or missing
values. Null / missing values in dataset are not acceptable during model building and those needs to be taken
care before proceeding further.

There are many ways in which null / missing values can be imputed. This depends on type of data is being
handled.

In case of Categorical / Discrete variable , generally we impute missing or null values with MODE of data or
most frequent values.

In case of continuous variables, we impute null / missing values with MEAN or MEDIAN. Mean is impacted by
presence of outliers in data as mean is average of values. Median is not much impacted by presence of outliers
in data.

Other methods like formula-based calculation , interpolation and imputation using constant can also be
performed depending on need.

Depending on column from NBFC dataset which needs imputation we will decide best strategy.

NBFC columns list containing Null values / Missing data in below table
Column Name Null Values
NPA_IN_CURRENT_MONTH 19893
NPA_IN_LAST_MONTH 19893
SCHEMEID 281
LAST_RECEIPT_AMOUNT 247
DIFF_EMI_AMOUNT_MAX_MIN 89
MAX_EMI_AMOUNT 89
MIN_EMI_AMOUNT 89
LATEST_TRANSACTION_MONTH 75

NPA_IN_CURRENT_MONTH – This is categorical variable and Null in this case means Loan is not a
nonperforming asset which is good.

NPA_IN_CURRENT_MONTH Count
0 103
Yes 16

Let us impute values as below

NULL => 0
Yes => 1

After impute,

NPA_IN_CURRENT_MONTH Count
0 19996
1 16

NPA_IN_LAST_MONTH – This is categorical variable and Null in this case means Loan is not a
nonperforming asset which is good.

NPA_IN_LAST_MONTH Count
0 102
Yes 15
#N/ 2

Let us impute values as below

NULL => 0
Yes => 1
#N/ => 0

After impute,

NPA_IN_LAST_MONTH Count
0 19997
1 15
SCHEMEID– This is categorical variable and Null in this case can be imputed with most frequent value as we
do not have any other source / information available for this variable.

Most frequent value in SCHEMEID is 10901104. Null is replaced with this value in dataset.
After imputation,

MAX_EMI_AMOUNT and MIN_EMI_AMOUNT – This shows maximum emi amount. This is continuous
variable and generally continuous variables are replaced with mean value but earlier we have seen that dataset
has large values (outliers) in related columns which will impact largely.

We could safely replace it with corresponding EMI_AMOUNT but in this case of NULL values
EMI_AMOUNT is also 0 which is incorrect.

General EMI formula is

EMI = P × r × (1 + r)n/((1 + r)n - 1) where P= Loan amount, r= interest rate, n=tenure in number of months
r= original_interest rate, n=original_tenor is this case.

For max and min emi amount we will use max interest rate and and minimum interest rate respectively.

df['MAX_EMI_AMOUNT']=df['MAX_EMI_AMOUNT'].fillna(df['LOAN_AMT'] *
df['CURRENT_INTEREST_RATE_MAX'] * (1 +
df['CURRENT_INTEREST_RATE_MAX'])*df['CURRENT_TENOR']/((1 +
df['CURRENT_INTEREST_RATE_MAX'])*df['CURRENT_TENOR'] - 1) )

df['MIN_EMI_AMOUNT']=df['MIN_EMI_AMOUNT'].fillna(df['LOAN_AMT'] *
df['CURRENT_INTEREST_RATE_MIN'] * (1 +
df['CURRENT_INTEREST_RATE_MIN'])*df['CURRENT_TENOR']/((1 +
df['CURRENT_INTEREST_RATE_MIN'])*df['CURRENT_TENOR'] - 1) )

df['EMI_AMOUNT']=df['EMI_AMOUNT'].replace(0,df['LOAN_AMT'] * df['ORIGNAL_INTEREST_RATE']
* (1 + df['ORIGNAL_INTEREST_RATE'])*df['ORIGNAL_TENOR']/((1 +
df['ORIGNAL_INTEREST_RATE'])*df['ORIGNAL_TENOR'] - 1) )

DIFF_EMI_AMOUNT_MAX_MIN – This is difference between min and max emi amount. Since we have
already filled missing values for min and max emi amount this one is easy to calculate.

DIFF_EMI_AMOUNT_MAX_MIN = MAX_EMI_AMOUNT - MIN_EMI_AMOUNT

LAST_RECEIPT_AMOUNT – This is last received amount and technically it should be equal to emi amount.
Fill null values in this column with corresponding emi_amount.

LATEST_TRANSACTION_MONTH – This is categorical variable, and we will replace is with most


frequent value – which is 12

After all imputations,


Outlier Treatment
An outlier is a data point that differs significantly from other observations. An outlier may be due to variability
in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.
An outlier can cause serious problems in statistical analyses.

We have already seen boxplots showing outliers in univariate analysis for continuous variables.

• Presence of datapoints beyond the whiskers/fences desn't necessarily mean there are outliers
• The rule that box plot follows to decide what should be an outlier is that "any point greater than Q3 +
1.5IQR or lesser than Q1 - 1.5IQR is an outlier "

We need to get rid of outliers before model building.

In given dataset we have seen that except NET_LTV, all other continuous variables were having outliers and to
get rid of it we can write function.

def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

Attributes boxplot after removal of outliers,


Variable Transform
FOIR - Fixed obligation to income ratio (Value should range from 0-1 – Derived variable) but we have seen in
univariate analysis that value is incorrect in given dataset.

An applicant with higher FOIR is offered a smaller loan amount due to the smaller EMI limit and greater risk of
default in payment.

FOIR= Total obligations (ie. debt and living expenses) /net monthly salary

Given dataset does not have any information related living expenses and monthly salary. So we cannot calculate
it.

FOIR vary from bank to bank and from case to case, but on average, it should be between 0.4 to 0.55

In given dataset, we will replace the values that are not between 0 and 1 with average standard value 0.5

Before impute,

After imputation,
After outlier treatment,

CITY – City attribute has few incorrectly spelled city names or some cities may have different names used. We
have impute this city names with correct values.

Given City Name Correct Name


BHIWADI
BHIWANDI
BHIWANDI
CHAMARAJANAGAR
CHAMRAJNAGAR
CHAMRAJNAGAR
CHENGALPATTU
CHENGALPATTU
CHENGALPET
DHARAMPUR
DHARAMPUR
DHARMAPURI
KANCHEEPURAM
KANCHIPURAM
KANCHIPURAM
KRISHNA
KRISHNAGIRI
KRISHNAGIRI
PONDICHERRY
PONDICHERRY
PONDICHERRY 1
PUDUKKOTTAI
PUDUKOTTAI
PUDUKOTTAI
VIJAYAWADA
VIJAYWADA
VIJAYWADA
VILLUPPURAM
VILLUPURAM
VILLUPURAM
Datatype change

As seen in univariate analysis, many columns are categorical in nature but defined as float or int. For
example, tenure related attributes and schemeid. We have converted all such columns to object type
using “.astype” method.

Column converted to object type:


'CURRENT_INTEREST_RATE_CHANGES','DIFF_AUTH_INT_DATE','DPD','DUEDAY','LATEST
_TRANSACTION_MONTH','NUM_EMI_CHANGES','NUM_LOW_FREQ_TRANSACTIONS','ORI
GNAL_TENOR','BALANCE_TENURE','COMPLETED_TENURE','CURRENT_TENOR','DIFF_ORI
GINAL_CURRENT_TENOR','MOB','SCHEMEID','FORECLOSURE'

Normalizing and Scaling

In NBFC dataset we can see that variables of the data set are of different scales i.e. one variable is in
millions and other in only 100. For e.g. in our data set LOAN_AMT is having values in millions and
INTEREST_RATE related variables are in just two digits. Since the data in these variables are of
different scales, it is tough to compare these variables.

Feature scaling (also known as data normalization) is the method used to standardize the range of
features of data. Since, the range of values of data may vary widely, it becomes a necessary step in data
preprocessing while using machine learning algorithms.

In this method, we convert variables with different scales of measurements into a single scale.

StandardScaler normalizes the data using the formula (x-mean)/standard deviation.

We will be doing this only for the numerical variables.

Before scaling

from sklearn.preprocessing import StandardScaler

std_scale = StandardScaler()

std_scale

for column in df.columns:

if df[column].dtype != 'object':

df[column] = std_scale.fit_transform(df[[column]])
After scaling,

We can see that now principal amount , emi amount all are at same scale.

ENCODING

Most of the machine learning models are designed to work on numeric data. Hence, we need to convert
categorical text data into numerical data for model building.

Categorical variables are of two types

Ordinal: where categories have order and you can arrange them in ascending or descending order.
Nominal: Without any order or ranks like city names, products, etc

Label encoding:

In label encoding, we map each category to a number or a label. The labels chosen for the categories
have no relationship.

One-Hot encoding

In One-Hot encoding, dummy attribute is created for each unique category and depending on category
value is assigned value 1 or 0

One hot encoding results in a high dimension in case of many unique values in column. High
dimensionality is curse for machine learning algorithms/models.

For NBFC data, we will use label encoding,

Let us look at one sample column – product


Variable Addition
With current dataset, as of now we do not need to add any new variable or derived fields. Derive field are
highly co-related with underlying values being used and mostly not useful if we are using underlying values
also in model building. If needed, we can add variables later depending on requirement.

Business Insights from EDA


DATA Imbalance

Classification predictive modeling involves predicting a class label for a given observation.

An imbalanced classification problem is an example of a classification problem where the distribution of


examples across the known classes is biased or skewed.

For NBFC Dataset also we have imbalanced class. We are here to predict if loan will foreclose or not which Is
binary classification problem.

FORECLOSURE is target variable

0 – Indicating loan not foreclosed


1 – Indicating loan is foreclosed

Record distribution for variable FORECLOSURE

% OF TOTAL
FORECLOSURE RECORD COUNT RECORDS
0 18217 91.03038
1 1795 8.96962

Approximately ratio is 90:10 indicating, 1 in 9 records will foreclosure. This happens because loan defaulters
are not quite common than non-defaulters. Very rarely in case of economic crises (like 2008 recession) we may
see different kind of data with many defaulters but in general defaulters will be very less compared to non-
defaulters.

The class or classes with abundant examples are called the major or majority classes – class 0 in Foreclosure,
whereas the class with few examples (and there is typically just one) is called the minor or minority class – class
1 in Foreclosure

Most machine learning algorithms work best when the number of samples in each class are about equal. This is
because most algorithms are designed to maximize accuracy and reduce error.
We can deal will class imbalance to have better models.
Methods for dealing with class imbalance:

1) Change the performance metric


Accuracy is not the best metric to use when evaluating imbalanced datasets as it can be very misleading.
Use other metrics like F1 Score , Recall etc. for model evaluation.

2) Change the algorithm


While in every machine learning problem, it’s a good rule of thumb to try a variety of algorithms, it can
be especially beneficial with imbalanced datasets. Decision trees frequently perform well on imbalanced
data. They work by learning a hierarchy of if/else questions and this can force both classes to be
addressed.

3) Resampling Techniques — Oversample minority class


Oversampling can be defined as adding more copies of the minority class. Oversampling can be a good
choice when you don’t have a ton of data to work with.
We can use the resampling module from Scikit-Learn to randomly replicate samples from the minority
class. Always split into test and train sets before trying oversampling techniques. Oversampling before
splitting the data can allow the exact same observations to be present in both the test and train sets. This
can allow our model to simply memorize specific data points and cause overfitting and poor
generalization to the test data.

4) Resampling techniques — Undersample majority class


Undersampling can be defined as removing some observations of the majority class. Undersampling can
be a good choice when you have a ton of data -think millions of rows. But a drawback is that we are
removing information that may be valuable. This could lead to underfitting and poor generalization to
the test set.

5) Generate synthetic samples


A technique similar to up sampling is to create synthetic samples. Here we will use imblearn’s SMOTE
or Synthetic Minority Oversampling Technique. SMOTE uses a nearest neighbors’ algorithm to
generate new and synthetic data we can use for training our model.
Again, it’s important to generate the new samples only in the training set to ensure our model
generalizes well to unseen data.

For NBFC dataset in our current project , we will use different models, different performance matrices
and SMOTE technique to treat class imbalance.

Insights

Dataset from a NBFC is an interesting dataset. Most of the insights are already provided during
univariate and bivariate analysis. Additionally, If company could add/collect other attributes like
customer profile including salary , expenses , age etc. indicating repay capacity will give robust models
as values like loan_amt , interest rate , tenure etc. largely depend on this.
Majorly metro cities have more customers and more loan foreclosures. This is might due to dynamically
and fast changing environments. NBFC can take more precautions while providing loan in metro cities,
by charging higher interest rate or making sure customer has insurance mandatory in case of unfortunate
events.
Also, it has been observed that different kind of products have more defaulters, but we do not have
product details available separately. If product details are available, we can study it and find out pattern
or some relational factors within product.

You might also like