Capstone Project NBFC Loan Foreclosure Prediction
Capstone Project NBFC Loan Foreclosure Prediction
Project Notes - I
Introduction ..............................................................................................................................................................3
Problem Statement
Predicting Loan Foreclosure
Banks or Non-Banking Financial companies (NBFC) main business is to lend money to interested
individuals / businesses called “Borrowers” against some assets called “Collateral” for the loan.
Borrowers then repay loan amount in installments with certain agreed interest on loan amount. NBFC
makes profit mainly through interest amount paid by borrowers. Sometimes borrowers may fail to repay
loan and interest to lending institute. This is called loan Default. In the event of loan default, lending
institute has right to recover the balance of loan from defaulted borrower by forcefully selling the asset
used as the collateral. This process is called Foreclosure.
The impacts of foreclosures are widespread and costly not only for borrowers, but for lenders, servicers,
insurers, cities etc.,
Lenders want to find a suitable solution to avoid foreclosures. Through the analysis, if we could predict
“FORECLOSURE” correctly in advance, it will help the NBFC to take required actions to avoid
foreclosure and retain customers.
This study will help to predict loan defaulters based on available data related to loan and borrower.
Based on predictions, lender institute NBFC can take preventive measures to avoid foreclosure by
providing different repayment options, schemes and try to retain customers.
By avoiding foreclosures, NBFC can increase profits, save time on legal foreclosure battles, and make
business more sustainable.
The impacts of foreclosures are widespread and costly not only for borrowers, but for lenders,
servicers, insurers etc. Indirectly it impacts economy and public money also.
The United States subprime mortgage crisis was a nationwide financial crisis that occurred
between 2007 and 2010, and contributed to the U.S. financial crisis. It was triggered by a large
decline in home prices after the collapse of a housing bubble, leading to mortgage delinquencies,
foreclosures, and the devaluation of housing-related securities. Declines in residential investment
preceded the Great Recession and were followed by reductions in household spending and then
business investment.
Borrowers:
A damaged credit rating. Poor credit resulting from foreclosure often becomes a barrier to
obtaining new loan or expand businesses for borrowers.
Loan servicers:
For loan servicers, the income stream from servicing fees stops when borrowers halt payments.
Insurers:
Many borrowers may have insurance for loan. Loan foreclosure by selling collateral affects
insurances also. The amount of loss equals the outstanding principal and all the expenses
incurred, less the proceeds from the sale of the collateral.
Alternative to Foreclosure
There are workout options available for lenders to help borrowers and avoid them being
defaulters and stop foreclosure.
Reinstatement:
Accepting the total amount of back interest and principal owed by a specific date. This option is
often combined with forbearance.
Forbearance:
Reducing or suspending payments for a short period, after which another option is agreed upon
to bring the loan current. A forbearance option is often combined with a reinstatement, when it is
known that the borrower will have enough money to bring the account current at a specific time
in the future. The money might come from a bonus, investment, insurance settlement, or a tax
refund.
Repayment Plan:
With a repayment plan, the lender agrees to add, for example, half the amount of the first missed
payment onto each of the next subsequent two payments. These plans provide some relief for
borrowers with short-term financial problems.
Loan Modifications:
If the borrower can make the payments on the loan, but does not have enough money to bring the
account current or cannot afford the total amount of the current payment, a change to one or
more of the original loan terms may make the payments more affordable.
Claim Advance:
If the mortgage is insured, the borrower may qualify for an interest-free loan from the insurer to
bring the account current. Full repayment of this loan may be delayed for several years.
It is beneficial for both lenders and borrowers and helps to keep economy stable.
Data Details:
NBFC has provided dataset consisting of aggregated loan transactions data for the customers.
Loans approved by NBFC has authorization date period varying between Aug 2010 to Dec 2018.
Target attribute is “FORECLOSURE” – which is binary in nature and indicates if loan is foreclosued or
not.
Null
Values Business
Column Name Description Data Type Present Importance
Agreement ID of the loan account (a No - Just an
AGREEMENTID customer can have multiple loans) int64 Identifier
Yes (Can be used
AUTHORIZATIONDAT for time period
E Authorization date of the loan datetime64 calculation)
BALANCE_EXCESS Balance of excess amount float64 Yes
BALANCE_TENURE Remaining tenure int64 Yes
CITY City of origination object Yes
COMPLETED_TENURE Completed tenure int64 Yes
Current rate of interest on the loan.
CURRENT_INTEREST_ Renamed field (Old Name:
RATE CURRENT_ROI ) float64 Yes
CURRENT_INTEREST_ Maximum value of the CURRENT ROI
RATE_MAX across transactions float64 Yes
CURRENT_INTEREST_ Minimum value of the CURRENT ROI
RATE_MIN across transactions float64 Yes
CURRENT_INTEREST_ Number of times the CURRENT ROI has
RATE_CHANGES changed int64 Yes
CURRENT_TENOR Current tenor of the loan int64 Yes
Unique Customer ID given to each No - Just an
CUSTOMERID customer float64 Yes Identifier
DIFF_AUTH_INT_DAT Difference between authorization and
E interest start date int64 Yes
DIFF_CURRENT_INTE
REST_RATE_MAX_MI Difference between the maximum and
N minimum interest rate per agreement float64 Yes
DIFF_EMI_AMOUNT_ Difference between maximum and
MAX_MIN minimum EMI AMOUNT float64 Yes Yes
DIFF_ORIGINAL_CUR
RENT_INTEREST_RAT Difference in original ROI and current ROI
E (ORIGNAL_ROI - CURRENT_ROI) float64 Yes
Difference in original and current tenor
DIFF_ORIGINAL_CUR (ORIGNAL_TENOR -
RENT_TENOR CURRENT_TENOR) int64 Yes
DPD Days past due int64 Yes
DUEDAY Next due date of the loan int64 Yes
EMI_AMOUNT Mode of the receipt amount float64 Yes
EMI_DUEAMT EMI due amount float64 Yes
EMI_OS_AMOUNT EMI outstanding amount float64 Yes
EMI_RECEIVED_AMT EMI received amount float64 Yes
EXCESS_ADJUSTED_A
MT Excess adjusted amount float64 Yes
EXCESS_AVAILABLE Excess received float64 Yes
Fixed obligation to income ratio (Value
FOIR should range from 0-1 – Derived variable) float64 Yes
Yes (Can be used
INTEREST_START_DA for time period
TE Interest start date on the loan datetime64 calculation)
LAST_RECEIPT_AMOU
NT Last receipt amount float64 Yes Yes
Yes (Can be used
for time period
LAST_RECEIPT_DATE Last receipt date datetime64 Yes calculation)
Month of last receipt date. In case account
LATEST_TRANSACTIO is Foreclosed, it will be month of
N_MONTH Foreclosure float64 Yes Yes
LOAN_AMT Loan amount which was sanctioned float64 Yes
MAX_EMI_AMOUNT Maximum receipt amount float64 Yes Yes
MIN_EMI_AMOUNT Minimum receipt amount float64 Yes Yes
MONTHOPENING Month of opening float64 Yes
NET_DISBURSED_AM
T Amount that was disbursed float64 Yes
Net Loan to Value ratio (Value ranges from
NET_LTV 0-100 (in %) – Derived variable) float64 Yes
Net receivable (EMI_DUEAMT -
EMI_RECEIVED_AMT =
EMI_OS_AMOUNT) +
(EXCESS_AVAILABLE -
EXCESS_ADJUSTED_AMT =
BALANCE_EXCESS) =
NET_RECEIVABLE NET_RECEIVABLE) float64 Yes
Number of different values in the receipts
NUM_EMI_CHANGES amount int64 Yes
NUM_LOW_FREQ_TR Number of transactions done in less than 28
ANSACTIONS days int64 Yes
Descriptive statistics:
FORECLOSURE is target variable – it is binary in nature and indicates loan is foreclosed or not.
Although variable datatype is integer, variable is categorical in nature with possible values of 0 or 1.
% OF TOTAL
FORECLOSURE RECORD COUNT RECORDS
0 18217 91.03038
1 1795 8.96962
• Independent Variables
Top 5 cities which has highest number of customers opting for loan are
We could see that few city names are incorrectly spelled – resulting in two different cities and need
transformation to have one correct city name. Record as follow
LAP 6226
HL 3482
STLAP 3036
Most loan foreclosures are under STHL and LAP has higher number of customers compared to HL and
STLAP .
SCHEME ID Count
10901104 2359
10901106 1463
10901295 1090
10901112 1019
10901287 1018
We do not have more details about schemes provided in dataset so we will ignore scheme details. Also,
this variable has null values and needs to be addressed.
NPA_IN_LAST_MONTH Count
0 102
Yes 15
#N/ 2
NPA_IN_CURRENT_MONTH Count
0 103
Yes 16
Both these variables should be binary in nature. Indicating either loan is non-performing asset or not.
These variables need transformation and need to fill in null values
No. of Time
Interest rate
has changed Count
0 12496
2 4404
1 1934
3 599
4 342
5 175
6 48
7 12
9 1
8 1
We can see that for ~ 60% customers interest rate has not been changed indicating 0. For ~25%
customers interest rate has been changed 1or 2 times. Interest rate change more than 2 times is less
frequent.
DIFF_AUTH_INT_DATE - Although variable contains numeric data, it is categorical in nature. It
shows difference of number of days between Authorization date and Initiation date. We will convert it to
categorical variable for future analysis.
DPD - Although variable contains numeric data, it is categorical in nature. DPD stands for days past due
date. It shows number of days past due date. We will convert it to categorical variable for future
analysis.
Higher the DPD, more chances of loan getting foreclosed. We will plot records only where DPD is
greater than 0. Most of the customers fall between 0-200 days. > 90 days is mostly difficult to get back
on track.
DPD Count
0 18770
26 407
56 161
87 108
25 91
DUEDAY - Although variable contains numeric data, it is categorical in nature. DUEDAY shows day
of the month when emi is due for loan. It can be anywhere between 1 – 31 but in this dataset, we can see
that NBFC has fixed payment day to 1,5 or 15th of the month. We will convert it to categorical variable
for future analysis.
DUEDAY Count
5 18343
15 1587
1 82
Month Count
12 15203
11 658
8 556
9 460
10 421
7 413
6 392
3 384
4 384
1 378
2 350
5 338
Since it shows either month of last receipt date or month of foreclosure date, we do not get many
insights just looking at this variable alone. Spike in dec can just mean that data is collected in Jan and
last
Receipt date for most of the transaction falls in Dec.
We will need to bivariate analysis for this variable along with foreclosure to find pattern.
Most of the customers have original tenor between 170 to 250 months which is like ~ 15-20 years.
~ 60% customers have completed tenure less than 20 months showing that dataset provided has mostly
new customers.
~ 80% customers have current tenure between 100-300 months which is different than original tenor.
MOB - Although variable contains numeric data, it is categorical in nature. It is internal id. We will
convert it to categorical variable for future analysis.
Univariate Analysis for Continuous Variables
BALANCE_EXCESS – It shows excess balance amount.
It is clearly visible that for most of the transactions amount is 0. This column has outliers.
No clear liner increasing or decreasing pattern. Current interest rate ranges between 10-25% with most of the
data between 13-16% and very few outliers.
No clear liner increasing or decreasing pattern. Current interest rate maximum ranges between 10-40% with
most of the data between 13-16% and few outliers.
No clear liner increasing or decreasing pattern. Current interest rate minimum ranges between (-5) to 25% with
most of the data between 13-16% and few outliers.
Negative interest rates effectively mean that a bank pays a borrower to take money off their hands, so they pay
back less than they have been loaned. This scenario very rarely occurs. In case of loan foreclosure , institute
may restructure loan and interest rate and may help borrower to pay back easily and avoid foreclosure.
Under its negative interest rate borrowers will make a monthly repayment as usual – but the amount still
outstanding will be reduced each month by more than the borrower has paid.
Most of the customers go for floating interest rate rather than fixed one. In case of fixed interest rate, maximum
and minimum will be same, and difference will be 0. In case of floating interest rate, lender offers range of
interest rate which will be generally not very wide and depending on economical situations interest rate may
increase or decrease within a range. Very wide range is riskier for lender and borrower both.
Mostly customers have either fixed rate indicating difference between min and max = 0 or difference of 0-3.
Large difference is rare. Dataset has many outliers.
Original interest rate ranges between 8-28% with most of customers having interest rate between 12.5 -16 %
and very few outliers.
It ranges between -7.5 to +10. Dataset has many outliers. Negative value indicates current interest rate is higher
than Original and positive indicates current is lower than original. Lower or no difference makes it more stable.
emi amount = 0 seems incorrect in case of loan is not paid yet. We need to impute this.
This variable also exhibits same pattern as EMI_AMOUNT. Many outliers. Let us plot only smaller values like
emi due amount less than 100k.
EMI_OS_AMOUNT – Indicates outstanding emi amount.
>95 % records have Outstanding emi amount equal to 0 which is great as it indicates borrowers are paying on
time and no outstanding. Few outliers with very high outstanding amount which may indicate foreclosure. Will
check it in bivariate analysis.
This variable also exhibits same pattern as EMI_DUEAMT. Many outliers. Let us plot only smaller values like
emi received amount less than 100k. EMI_DUEAMT & EMI_RECEIVED_AMT graphs are closely matching
as expected.
MAX_EMI_AMOUNT – Maximum emi amount.
Few max emi amounts are very large and dataset has outliers. Lets plot values having max emi_amount less
than 100k.
Maximum emi amount = 0 seems incorrect in case of loan is not paid yet. We need to impute this.
Few min emi amounts are very large and dataset has outliers. Lets plot values having min emi_amount less than
100k
minimum emi amount = 0 seems incorrect in case of loan is not paid yet. We need to impute this.
This indicates pre emi due amount and has outliers. Let us plot values having pre emi dueamt less than 100k.
Most customers have pre emi due amount less than 20K
PRE_EMI_OS_AMOUNT– Pre EMI outstanding amount
Most of the values are 0 which is good sign. No outstanding. Few very large values exist in dataset.
This indicates pre emi received amount and has outliers. Let us plot values having pre emi dueamt less than
100k. This is closely matching with PRE_EMI_DUEAMT which is expected.
EXCESS_ADJUSTED_AMT– Excess adjusted amount
FOIR – Fixed obligation to income ratio (Value should range from 0-1 – Derived variable)
FOIR values should be between 0 and 1 but above graph shows that dataset has few incorrect values which
needs to be imputed with correct value, we will do it in later part of the report.
LAST_RECEIPT_AMOUNT– Last received amount
This variable has few large values. Let us plot values which are less than 100k
This variable has few large values. Let us plot values which are less than 5Million. Mostly loan amount is
between 1-3 Million.
MONTHOPENING – Data description says month of opening which should be between 1-12 indicating Jan-
Dec but data set has amount values. Assuming this is opening balance at month.
This variable also has lot of outliers. Let us plot data where opening balance < 5 million.
This is somewhat similar to Loan-amount. During by variate analysis if we find these are correlated we can
eliminate one of them.
This variable also has outliers. Let us plot values less than 5 million. In general this should closely match with
loan_amount
NET_LTV– Net Loan to Value ratio (Value ranges from 0-100 (in %) – Derived variable)
This variable shows fair distribution and does not have any outliers. Most of records have LTV ratio between
35-65 %.
The loan-to-value (LTV) ratio is an assessment of lending risk that financial institutions and other lenders
examine before approving a mortgage. Typically, loan assessments with high LTV ratios are considered higher
risk loans. Therefore, if the loan is approved, the loan has a higher interest rate.
Net receivables refer to the net amount of money remaining after deducting the provision for bad debt. It is
primarily used in businesses that sell on credit.
Net Receivables = (Total Amount Borrowed By Customers) - (Amount Borrowed By Customers That will
Never be Repaid)
Higher the Net Receivables , company needs to collect more from the customer
Here we see few very large values. Let us plot values less than 5 million.
Since most of this loan transactions belongs to new customer , nothing is paid yet and outstanding principal is
closer to loan amount. Outstanding principal gets reduced towards the end of the tenure in ideal scenario.
Here we see few very large values. Let us plot values less than 2 million.
Bivariate Analysis
We see on data set most of the attributes are highly correlated to each other based on their function or definition
and usage. Few variables belong to one particular parent attribute criteria like loan amt , emi ,tenure , interest
rate etc. All this parent categories are also dependent on each other or have derived values.
We will do bi-variate analysis for few combinations which are most suitable and can provide business insights.
CITY count w.r.t FORECLOSURE
% OF TOTAL
FORECLOSURE CITY Count FORECLOSURE=1
1 MUMBAI 353 19.66
HYDERABAD 165 9.19
PUNE 151 8.41
CHENNAI 109 6.07
AHMEDABAD 90 5.01
It is easily visible that big metro cities have highest number of customers and total sum amounts.
Number of foreclosed loans are also higher in metro cities compared to other cities provide in dataset.
Metro cities have higher loan amounts as customers also high in number
PRODUCT % OF TOTAL
FORECLOSURE NAME Count FORECLOSURE=1
1 HL 990 55.153
STHL 803 44.735
LAP 2 0.111
Most loan foreclosures are under HL / STHL Product with HL product being highest.
SCHEMEID w.r.t FORECLOSURE
Although SCHEMEID 10901291, 10901142 & 10901251 are not top selling but have more loan
foreclosures compared to other SCHEMEIDs.
Scatterplot clearly shows that higher the loan amount higher the emi which is expected. Foreclosed loans
pattern is not clearly visible but we can see that higher loan amount values are not foreclosed.
Since EMI_DUEAMT , EMI_RECEIVED_AMT are closely related to EMI_AMOUNT they also exhibit
similar pattern.
LOAN_AMT vs CURRENT_INTEREST_RATE w.r.t FORECLOSURE
There is no clear visible pattern between Loan amount and Interest rate. Irrespective of loan amt, interest can
vary largely. Same is applicable for CURRENT_INTEREST_RATE_MAX,
CURRENT_INTEREST_RATE_MIN, CURRENT_INTEREST_RATE_CHANGES as they are closely related
Pattern is similar to current_tenor. Similar pattern will be visible for balance tenor. But no clear pattern between
loan_amt and tenure.
Let us look at pair plot , since we have many attributes, we will do pair plot for selected attributes
Pairplot shows us that non of the attributes clearly shows pattern or help in identifying class.
Very few attributes show linear relation like LOAN_AMT and EMI_AMOUNT. Most of the attributes data is
left skewed.
Let us look at Heatmap,
As expected, CURRENT_INTEREST_RATE,CURRENT_INTEREST_RATE_MAX,
CURRENT_INTEREST_RATE_MIN,ORIGINAL_INTEREST_RATE is highly correlated to each other.
EMI_AMOUNT has correlation with principal and interest paid and interest rate also which is technically
correct.
Highly correlated attributes are not useful in model building and can impact model performance. We will need
to get rid of correlated attributes before model building. Using methods like VIF to get factor value, check the p
value and importance and remove the variable.
Other correlations have significantly low intensity, and which is good for model building.
Removal of unwanted variables
There are few variables in dataset which has no value in terms of business, this kind of variables are not useful
in model building and need to be dropped. This are mostly identifiers or sometimes some duplicate columns.
NBFC dataset contains few unwanted variables which can be safely dropped before model building.
AUTHORIZATIONDATE – This is loan authorization date , sometimes we can use date values to derive
columns by comparing date with some fixed point in past or future and calculate difference to find recency of
given data and use it for analysis but in this given dataset and business scenario this doesn’t seem to be useful.
INTEREST_START_DATE – This is interest start date and for given business problem this is not useful.
LAST_RECEIPT_DATE – This is last payment receipt date and for given business problem this is not useful.
There are many ways in which null / missing values can be imputed. This depends on type of data is being
handled.
In case of Categorical / Discrete variable , generally we impute missing or null values with MODE of data or
most frequent values.
In case of continuous variables, we impute null / missing values with MEAN or MEDIAN. Mean is impacted by
presence of outliers in data as mean is average of values. Median is not much impacted by presence of outliers
in data.
Other methods like formula-based calculation , interpolation and imputation using constant can also be
performed depending on need.
Depending on column from NBFC dataset which needs imputation we will decide best strategy.
NBFC columns list containing Null values / Missing data in below table
Column Name Null Values
NPA_IN_CURRENT_MONTH 19893
NPA_IN_LAST_MONTH 19893
SCHEMEID 281
LAST_RECEIPT_AMOUNT 247
DIFF_EMI_AMOUNT_MAX_MIN 89
MAX_EMI_AMOUNT 89
MIN_EMI_AMOUNT 89
LATEST_TRANSACTION_MONTH 75
NPA_IN_CURRENT_MONTH – This is categorical variable and Null in this case means Loan is not a
nonperforming asset which is good.
NPA_IN_CURRENT_MONTH Count
0 103
Yes 16
NULL => 0
Yes => 1
After impute,
NPA_IN_CURRENT_MONTH Count
0 19996
1 16
NPA_IN_LAST_MONTH – This is categorical variable and Null in this case means Loan is not a
nonperforming asset which is good.
NPA_IN_LAST_MONTH Count
0 102
Yes 15
#N/ 2
NULL => 0
Yes => 1
#N/ => 0
After impute,
NPA_IN_LAST_MONTH Count
0 19997
1 15
SCHEMEID– This is categorical variable and Null in this case can be imputed with most frequent value as we
do not have any other source / information available for this variable.
Most frequent value in SCHEMEID is 10901104. Null is replaced with this value in dataset.
After imputation,
MAX_EMI_AMOUNT and MIN_EMI_AMOUNT – This shows maximum emi amount. This is continuous
variable and generally continuous variables are replaced with mean value but earlier we have seen that dataset
has large values (outliers) in related columns which will impact largely.
We could safely replace it with corresponding EMI_AMOUNT but in this case of NULL values
EMI_AMOUNT is also 0 which is incorrect.
EMI = P × r × (1 + r)n/((1 + r)n - 1) where P= Loan amount, r= interest rate, n=tenure in number of months
r= original_interest rate, n=original_tenor is this case.
For max and min emi amount we will use max interest rate and and minimum interest rate respectively.
df['MAX_EMI_AMOUNT']=df['MAX_EMI_AMOUNT'].fillna(df['LOAN_AMT'] *
df['CURRENT_INTEREST_RATE_MAX'] * (1 +
df['CURRENT_INTEREST_RATE_MAX'])*df['CURRENT_TENOR']/((1 +
df['CURRENT_INTEREST_RATE_MAX'])*df['CURRENT_TENOR'] - 1) )
df['MIN_EMI_AMOUNT']=df['MIN_EMI_AMOUNT'].fillna(df['LOAN_AMT'] *
df['CURRENT_INTEREST_RATE_MIN'] * (1 +
df['CURRENT_INTEREST_RATE_MIN'])*df['CURRENT_TENOR']/((1 +
df['CURRENT_INTEREST_RATE_MIN'])*df['CURRENT_TENOR'] - 1) )
df['EMI_AMOUNT']=df['EMI_AMOUNT'].replace(0,df['LOAN_AMT'] * df['ORIGNAL_INTEREST_RATE']
* (1 + df['ORIGNAL_INTEREST_RATE'])*df['ORIGNAL_TENOR']/((1 +
df['ORIGNAL_INTEREST_RATE'])*df['ORIGNAL_TENOR'] - 1) )
DIFF_EMI_AMOUNT_MAX_MIN – This is difference between min and max emi amount. Since we have
already filled missing values for min and max emi amount this one is easy to calculate.
LAST_RECEIPT_AMOUNT – This is last received amount and technically it should be equal to emi amount.
Fill null values in this column with corresponding emi_amount.
We have already seen boxplots showing outliers in univariate analysis for continuous variables.
• Presence of datapoints beyond the whiskers/fences desn't necessarily mean there are outliers
• The rule that box plot follows to decide what should be an outlier is that "any point greater than Q3 +
1.5IQR or lesser than Q1 - 1.5IQR is an outlier "
In given dataset we have seen that except NET_LTV, all other continuous variables were having outliers and to
get rid of it we can write function.
def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
An applicant with higher FOIR is offered a smaller loan amount due to the smaller EMI limit and greater risk of
default in payment.
FOIR= Total obligations (ie. debt and living expenses) /net monthly salary
Given dataset does not have any information related living expenses and monthly salary. So we cannot calculate
it.
FOIR vary from bank to bank and from case to case, but on average, it should be between 0.4 to 0.55
In given dataset, we will replace the values that are not between 0 and 1 with average standard value 0.5
Before impute,
After imputation,
After outlier treatment,
CITY – City attribute has few incorrectly spelled city names or some cities may have different names used. We
have impute this city names with correct values.
As seen in univariate analysis, many columns are categorical in nature but defined as float or int. For
example, tenure related attributes and schemeid. We have converted all such columns to object type
using “.astype” method.
In NBFC dataset we can see that variables of the data set are of different scales i.e. one variable is in
millions and other in only 100. For e.g. in our data set LOAN_AMT is having values in millions and
INTEREST_RATE related variables are in just two digits. Since the data in these variables are of
different scales, it is tough to compare these variables.
Feature scaling (also known as data normalization) is the method used to standardize the range of
features of data. Since, the range of values of data may vary widely, it becomes a necessary step in data
preprocessing while using machine learning algorithms.
In this method, we convert variables with different scales of measurements into a single scale.
Before scaling
std_scale = StandardScaler()
std_scale
if df[column].dtype != 'object':
df[column] = std_scale.fit_transform(df[[column]])
After scaling,
We can see that now principal amount , emi amount all are at same scale.
ENCODING
Most of the machine learning models are designed to work on numeric data. Hence, we need to convert
categorical text data into numerical data for model building.
Ordinal: where categories have order and you can arrange them in ascending or descending order.
Nominal: Without any order or ranks like city names, products, etc
Label encoding:
In label encoding, we map each category to a number or a label. The labels chosen for the categories
have no relationship.
One-Hot encoding
In One-Hot encoding, dummy attribute is created for each unique category and depending on category
value is assigned value 1 or 0
One hot encoding results in a high dimension in case of many unique values in column. High
dimensionality is curse for machine learning algorithms/models.
Classification predictive modeling involves predicting a class label for a given observation.
For NBFC Dataset also we have imbalanced class. We are here to predict if loan will foreclose or not which Is
binary classification problem.
% OF TOTAL
FORECLOSURE RECORD COUNT RECORDS
0 18217 91.03038
1 1795 8.96962
Approximately ratio is 90:10 indicating, 1 in 9 records will foreclosure. This happens because loan defaulters
are not quite common than non-defaulters. Very rarely in case of economic crises (like 2008 recession) we may
see different kind of data with many defaulters but in general defaulters will be very less compared to non-
defaulters.
The class or classes with abundant examples are called the major or majority classes – class 0 in Foreclosure,
whereas the class with few examples (and there is typically just one) is called the minor or minority class – class
1 in Foreclosure
Most machine learning algorithms work best when the number of samples in each class are about equal. This is
because most algorithms are designed to maximize accuracy and reduce error.
We can deal will class imbalance to have better models.
Methods for dealing with class imbalance:
For NBFC dataset in our current project , we will use different models, different performance matrices
and SMOTE technique to treat class imbalance.
Insights
Dataset from a NBFC is an interesting dataset. Most of the insights are already provided during
univariate and bivariate analysis. Additionally, If company could add/collect other attributes like
customer profile including salary , expenses , age etc. indicating repay capacity will give robust models
as values like loan_amt , interest rate , tenure etc. largely depend on this.
Majorly metro cities have more customers and more loan foreclosures. This is might due to dynamically
and fast changing environments. NBFC can take more precautions while providing loan in metro cities,
by charging higher interest rate or making sure customer has insurance mandatory in case of unfortunate
events.
Also, it has been observed that different kind of products have more defaulters, but we do not have
product details available separately. If product details are available, we can study it and find out pattern
or some relational factors within product.