0% found this document useful (0 votes)

59 views

Linear Models Reading

The document discusses using multiple linear regression and regularized linear regression (Ridge and Lasso) to predict interest rates based on loan application data. It describes preparing the data by removing % signs from columns, converting columns to numeric types, and creating dummy variables for categorical columns like loan length. The goal is to build a model to predict interest rates given loan characteristics for a loan aggregator company.

Uploaded by

Ayush Nagori

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views

Linear Models Reading

Uploaded by

Ayush Nagori

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Linear Models Reading

October 31, 2017

For the theoretical background regarding these techniques , pleases refer to theoretical discus-
sion on linear models.
Now we’ll discuss a case study solution using multiple linear regression, and regularised linear
regresison [Ridge and Lasso] . We’ll also look at hyper parameter tuning for regularised regres-
sion.
A little background on the case study. This data belongs to a loan aggregator agency which
connects loan applications to different financial institutions in attempt to get the best interest rate.
They want to now utilise past data to predict interest rate given by any financial institute just by
looking at loan application characteristics.
To achieve that , they have decided to do a POC with a data from a particular financial institu-
tion. The data is given in the file "loans data.csv". Lets begin:

In [1]: data_file=r'/Users/lalitsachan/Dropbox/March onwards/Python Data Science/Data/loans dat

import pandas as pd
import math
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
import numpy as np
from sklearn.cross_validation import KFold
%matplotlib inline

ld=pd.read_csv(data_file)

In [2]: ld.head()

Out[2]: ID Amount.Requested Amount.Funded.By.Investors Interest.Rate \

0 81174.0 20000 20000 8.90%
1 99592.0 19200 19200 12.12%
2 80059.0 35000 35000 21.98%
3 15825.0 10000 9975 9.99%
4 33182.0 12000 12000 11.71%

Loan.Length Loan.Purpose Debt.To.Income.Ratio State Home.Ownership \

0 36 months debt_consolidation 14.90% SC MORTGAGE
1 36 months debt_consolidation 28.36% TX MORTGAGE
2 60 months debt_consolidation 23.81% CA MORTGAGE
3 36 months debt_consolidation 14.30% KS MORTGAGE

1
4 36 months credit_card 18.78% NJ RENT

Monthly.Income FICO.Range Open.CREDIT.Lines Revolving.CREDIT.Balance \

0 6541.67 735-739 14 14272
1 4583.33 715-719 12 11140
2 11500.00 690-694 14 21977
3 3833.33 695-699 10 9346
4 3195.00 695-699 11 14469

Inquiries.in.the.Last.6.Months Employment.Length
0 2.0 < 1 year
1 1.0 2 years
2 1.0 2 years
3 0.0 5 years
4 0.0 9 years

You can see that variable Interest.Rate and Debt.To.Income.Ratio contain "%" sign in their val-
ues and because of which they have come as character columns in the data. Lets remove these
percentages first.

In [3]: for col in ["Interest.Rate","Debt.To.Income.Ratio"]:

ld[col]=ld[col].astype("str")
ld[col]=[x.replace("%","") for x in ld[col]]

In [4]: ld.dtypes

Out[4]: ID float64
Amount.Requested object
Amount.Funded.By.Investors object
Interest.Rate object
Loan.Length object
Loan.Purpose object
Debt.To.Income.Ratio object
State object
Home.Ownership object
Monthly.Income float64
FICO.Range object
Open.CREDIT.Lines object
Revolving.CREDIT.Balance object
Inquiries.in.the.Last.6.Months float64
Employment.Length object
dtype: object

We can see that many columns which should have really been numbers have been imported as
character columns , probably because some characters values in those columns in the files. We’ll
convert all such columns to numbers .

In [5]: for col in ["Amount.Requested","Amount.Funded.By.Investors","Open.CREDIT.Lines","Revolv

"Inquiries.in.the.Last.6.Months","Interest.Rate","Debt.To.Income.Ratio"]:
ld[col]=pd.to_numeric(ld[col],errors="coerce")

2
In [6]: ld.dtypes

Out[6]: ID float64
Amount.Requested float64
Amount.Funded.By.Investors float64
Interest.Rate float64
Loan.Length object
Loan.Purpose object
Debt.To.Income.Ratio float64
State object
Home.Ownership object
Monthly.Income float64
FICO.Range object
Open.CREDIT.Lines float64
Revolving.CREDIT.Balance float64
Inquiries.in.the.Last.6.Months float64
Employment.Length object
dtype: object

Next we will make dummy variables for remaining categorical variables

In [7]: ld["Loan.Length"].value_counts()

Out[7]: 36 months 1950

60 months 548
. 1
Name: Loan.Length, dtype: int64

Function get_dummies creates dummy variables for all the categories present in the categorical
variable. Result is a dataframe, we can then choose and drop the dummies that we want to drop
and attach the ones selected back to our original data.

In [8]: ll_dummies=pd.get_dummies(ld["Loan.Length"])

In [9]: ll_dummies.head()

Out[9]: . 36 months 60 months

0 0.0 1.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
3 0.0 1.0 0.0
4 0.0 1.0 0.0

We’ll add dummy variable for "36 months" to our data and ignore the rest two.

In [10]: ld["LL_36"]=ll_dummies["36 months"]

Now that we’are done with dataframe ll_dummies , we can drop it. Below we demonstrate a
general way of removing variables from notebook environment.

3
In [11]: %reset_selective ll_dummies

Once deleted, variables cannot be recovered. Proceed (y/[n])? y

To know what all variables are in the environment, you can use function "who"

In [12]: who

KFold Lasso LinearRegression Ridge col data_file

pd train_test_split

Now that we have created dummies for Loan.Length, we need to remove this from the
dataframe.

In [13]: ld=ld.drop('Loan.Length',axis=1)

In [14]: ld.dtypes

Out[14]: ID float64
Amount.Requested float64
Amount.Funded.By.Investors float64
Interest.Rate float64
Loan.Purpose object
Debt.To.Income.Ratio float64
State object
Home.Ownership object
Monthly.Income float64
FICO.Range object
Open.CREDIT.Lines float64
Revolving.CREDIT.Balance float64
Inquiries.in.the.Last.6.Months float64
Employment.Length object
LL_36 float64
dtype: object

Next we examine variable "Loan.Purpose".

In [15]: ld["Loan.Purpose"].value_counts()

Out[15]: debt_consolidation 1307

credit_card 444
other 200
home_improvement 152
major_purchase 101
small_business 87
car 50
wedding 39
medical 30

4
moving 29
vacation 21
house 20
educational 15
renewable_energy 4
Name: Loan.Purpose, dtype: int64

There are 14 categories in that variable, we can either make 13 dummies or we can club few
of these categories together and reduce the number of effective categories and then make dummy
variables for those.
It makes sense to club those categories which behave similarly in terms of their effect on re-
sponse. Or in other words , we can club those categories for which average interest rates are
similar in the data.

In [16]: round(ld.groupby("Loan.Purpose")["Interest.Rate"].mean())

Out[16]: Loan.Purpose
car 11.0
credit_card 13.0
debt_consolidation 14.0
educational 11.0
home_improvement 12.0
house 13.0
major_purchase 11.0
medical 12.0
moving 14.0
other 13.0
renewable_energy 10.0
small_business 13.0
vacation 12.0
wedding 12.0
Name: Interest.Rate, dtype: float64

We can see from the table above that there are 4 effective categoris in the data. Lets club them

In [17]: for i in range(len(ld.index)):

if ld["Loan.Purpose"][i] in ["car","educational","major_purchase"]:
ld.loc[i,"Loan.Purpose"]="cem"
if ld["Loan.Purpose"][i] in ["home_improvement","medical","vacation","wedding"]:
ld.loc[i,"Loan.Purpose"]="hmvw"
if ld["Loan.Purpose"][i] in ["credit_card","house","other","small_business"]:
ld.loc[i,"Loan.Purpose"]="chos"
if ld["Loan.Purpose"][i] in ["debt_consolidation","moving"]:
ld.loc[i,"Loan.Purpose"]="dm"

Now we make dummies for this variable

In [18]: lp_dummies=pd.get_dummies(ld["Loan.Purpose"],prefix="LP")

In [19]: lp_dummies.head()

5
Out[19]: LP_cem LP_chos LP_dm LP_hmvw LP_renewable_energy
0 0.0 0.0 1.0 0.0 0.0
1 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0
3 0.0 0.0 1.0 0.0 0.0
4 0.0 1.0 0.0 0.0 0.0

We’ll add this data to original data. And then drop original variable "Loan.Purpose" and one
of the dummies "LP_renewable_energy"

In [20]: ld=pd.concat([ld,lp_dummies],1)
ld=ld.drop(["Loan.Purpose","LP_renewable_energy"],1)

In [21]: ld.dtypes

Out[21]: ID float64
Amount.Requested float64
Amount.Funded.By.Investors float64
Interest.Rate float64
Debt.To.Income.Ratio float64
State object
Home.Ownership object
Monthly.Income float64
FICO.Range object
Open.CREDIT.Lines float64
Revolving.CREDIT.Balance float64
Inquiries.in.the.Last.6.Months float64
Employment.Length object
LL_36 float64
LP_cem float64
LP_chos float64
LP_dm float64
LP_hmvw float64
dtype: object

Next we look at variable "State".

In [22]: ld["State"].nunique()

Out[22]: 47

There are too many unique values. Although its not a legit reason to drop a variable. But we’ll
ignore this in this discussion any way in order to reduce amount of data prep that we are doing
here. You can try including it in the model and see if the performance improves.

In [23]: ld=ld.drop(["State"],1)

Next we take care of variable Home.Ownership.

In [24]: ld["Home.Ownership"].value_counts()

6
Out[24]: MORTGAGE 1147
RENT 1146
OWN 200
OTHER 5
NONE 1
Name: Home.Ownership, dtype: int64

In [25]: ld["ho_mort"]=np.where(ld["Home.Ownership"]=="MORTGAGE",1,0)
ld["ho_rent"]=np.where(ld["Home.Ownership"]=="RENT",1,0)
ld=ld.drop(["Home.Ownership"],1)

We have simply ignored values OTHER and NONE , and considered that there are only 3
categories and created only two dummies . We did this because of very low frequencies of OTHER
and NONE

In [26]: ld["FICO.Range"].head()

Out[26]: 0 735-739
1 715-719
2 690-694
3 695-699
4 695-699
Name: FICO.Range, dtype: object

If you look at first few values of variable FICO.Range , you can see that we can convert it to
numeric by taking average of the range given. To do that first we need to split the column with
"-", so that we can have both end of ranges in separate columns and then we can simply average
them. Lets first split.

In [27]: ld['f1'], ld['f2'] = zip(*ld['FICO.Range'].apply(lambda x: x.split('-', 1)))

Now we create new variable "fico" by averaging f1 and f2. And then we’ll drop the original
variable FICO.Range and f1,f2.

In [28]: ld["fico"]=0.5*(pd.to_numeric(ld["f1"])+pd.to_numeric(ld["f2"]))

ld=ld.drop(["FICO.Range","f1","f2"],1)

Next we look at variavle Employment.Length. You’ll see that we can convert that to number
as well.

In [29]: ld["Employment.Length"].value_counts()

Out[29]: 10+ years 653

< 1 year 249
2 years 243
3 years 235
5 years 202
4 years 191
1 year 177

7
6 years 163
7 years 127
8 years 108
n/a 77
9 years 72
. 2
Name: Employment.Length, dtype: int64

In [30]: ld["Employment.Length"]=ld["Employment.Length"].astype("str")
ld["Employment.Length"]=[x.replace("years","") for x in ld["Employment.Length"]]
ld["Employment.Length"]=[x.replace("year","") for x in ld["Employment.Length"]]

We can convert everything else to numbers , but "n/a" are a problem. We can look at average
interest across all values of Employment.Length and then replace "n/a" with value which has
closet average response.

In [31]: round(ld.groupby("Employment.Length")["Interest.Rate"].mean(),2)

Out[31]: Employment.Length
. 11.34
1 12.49
10+ 13.34
2 12.87
3 12.77
4 13.14
5 13.40
6 13.29
7 13.10
8 13.01
9 13.15
< 1 12.86
n/a 12.85
nan 7.51
Name: Interest.Rate, dtype: float64

As you can see "n/a" is similar to "< 1".

In [32]: ld["Employment.Length"]=[x.replace("n/a","< 1") for x in ld["Employment.Length"]]

ld["Employment.Length"]=[x.replace("10+","10") for x in ld["Employment.Length"]]
ld["Employment.Length"]=[x.replace("< 1","0") for x in ld["Employment.Length"]]
ld["Employment.Length"]=pd.to_numeric(ld["Employment.Length"],errors="coerce")

In [33]: ld.dtypes

Out[33]: ID float64
Amount.Requested float64
Amount.Funded.By.Investors float64
Interest.Rate float64
Debt.To.Income.Ratio float64

8
Monthly.Income float64
Open.CREDIT.Lines float64
Revolving.CREDIT.Balance float64
Inquiries.in.the.Last.6.Months float64
Employment.Length float64
LL_36 float64
LP_cem float64
LP_chos float64
LP_dm float64
LP_hmvw float64
ho_mort int64
ho_rent int64
fico float64
dtype: object

Now we have all the variables as numbers. After dropping observations with missing values ,
we can proceed to build oru model.

In [34]: ld.shape

Out[34]: (2500, 18)

In [35]: ld.dropna(axis=0,inplace=True)

In [36]: ld.shape

Out[36]: (2471, 18)

We now split our data into two random parts . One to build model on , Another to test its
performance. Option "random_state" is used to make our random operation reproducible.

In [37]: ld_train, ld_test = train_test_split(ld, test_size = 0.2,random_state=2)

In [38]: lm=LinearRegression()

Above line creates and object of class LinearRegression named lm. We can use this object to
access all functions realted to LinearRegression.
Now we’ll separate predictors and response for both the datasets . We’ll also drop ID from
predictor’s list because it doesnt make sense to include an ID variable in the model. Variable
"Amount.Funded.By.Investors" will also be dropped because it wont be available until the loan
has been processed. We can use only those variables which are present at the point of the business
process where we want to apply our model.

In [39]: x_train=ld_train.drop(["Interest.Rate","ID","Amount.Funded.By.Investors"],1)
y_train=ld_train["Interest.Rate"]
x_test=ld_test.drop(["Interest.Rate","ID","Amount.Funded.By.Investors"],1)
y_test=ld_test["Interest.Rate"]

Now we can fit our model using lm the LinearRegression object that we created earlier

9
In [40]: lm.fit(x_train,y_train)

Out[40]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Next we predict resposne on our test data , calculate errors on those prediction and then rmse
for those residuals. That is the measure of performance on the test data. We can use this measure
to compare other models that we’ll build.

In [41]: p_test=lm.predict(x_test)

residual=p_test-y_test

rmse_lm=np.sqrt(np.dot(residual,residual)/len(p_test))

rmse_lm

Out[41]: 1.9984182813783065

We can use this to compare our linear regression model with other techniques and evenutall
pick the one with least error .
Next we show how to extract coefficient produced by our model

In [42]: coefs=lm.coef_

features=x_train.columns

list(zip(features,coefs))

Out[42]: [('Amount.Requested', 0.00016471416513021343),

('Debt.To.Income.Ratio', 0.001940716763844156),
('Monthly.Income', -1.964495403046129e-05),
('Open.CREDIT.Lines', -0.034083616785864863),
('Revolving.CREDIT.Balance', -3.9668091914257398e-06),
('Inquiries.in.the.Last.6.Months', 0.35395352269203217),
('Employment.Length', 0.0062596138442035315),
('LL_36', -3.1338448528798901),
('LP_cem', -0.36782330890010462),
('LP_chos', -0.24412655191507102),
('LP_dm', -0.43656408581180051),
('LP_hmvw', -0.44251741243247011),
('ho_mort', -0.51263319187574363),
('ho_rent', -0.2334213637532894),
('fico', -0.086502602177937149)]

We can see that linear regression has produced coefficients for all variables. If you recall our
theoretical discussion, we need to penalise coefficient for the variables which are not really con-
tributing well to our resposne and might be causing overfitting of the model. Among the regu-
larised technique we’ll first look at Ridge regression.
Since penalty in ridge regression is a hyperparameter , we’d look at multiple values of it and
choose the best one through 10 fold cross validation.

10
In [43]: # Finding best value of penalty weight with cross validation for ridge regression
alphas=np.linspace(.0001,10,100)
# We need to reset index for cross validation to work without hitch
x_train.reset_index(drop=True,inplace=True)
y_train.reset_index(drop=True,inplace=True)

In [44]: rmse_list=[]
for a in alphas:
ridge = Ridge(fit_intercept=True, alpha=a)

# computing average RMSE across 10-fold cross validation

kf = KFold(len(x_train), n_folds=10)
xval_err = 0
for train, test in kf:
ridge.fit(x_train.loc[train], y_train[train])
p = ridge.predict(x_train.loc[test])
err = p - y_train[test]
xval_err += np.dot(err,err)
rmse_10cv = np.sqrt(xval_err/len(x_train))
# uncomment below to print rmse values for individidual alphas
# print('{:.3f}\t {:.6f}\t '.format(a,rmse_10cv))
rmse_list.extend([rmse_10cv])
best_alpha=alphas[rmse_list==min(rmse_list)]
print('Alpha with min 10cv error is : ',best_alpha )

Alpha with min 10cv error is : [ 4.04046364]

best value of alpha might be slightly different across different runs because of random nature
of cross validation. So dont worry if you determine a different value of best alpha.
Next we fit Ridge Regression on the entire train data with best value of alpha we just deter-
mined.

In [45]: ridge=Ridge(fit_intercept=True,alpha=best_alpha)

ridge.fit(x_train,y_train)

p_test=ridge.predict(x_test)

residual=p_test-y_test

rmse_ridge=np.sqrt(np.dot(residual,residual)/len(p_test))

rmse_ridge

Out[45]: 1.9986610201010222

In [46]: list(zip(x_train.columns,ridge.coef_))

11
Out[46]: [('Amount.Requested', 0.00016586905207985483),
('Debt.To.Income.Ratio', 0.0020224200468193219),
('Monthly.Income', -2.0262579354428066e-05),
('Open.CREDIT.Lines', -0.034289979364722015),
('Revolving.CREDIT.Balance', -4.0012363985016231e-06),
('Inquiries.in.the.Last.6.Months', 0.35358237791769503),
('Employment.Length', 0.0060576014674008482),
('LL_36', -3.085888228910501),
('LP_cem', -0.060597535013434234),
('LP_chos', 0.051904670459773983),
('LP_dm', -0.13915040742140772),
('LP_hmvw', -0.13894706764602308),
('ho_mort', -0.48648146285695704),
('ho_rent', -0.21080912056441062),
('fico', -0.086530387911495879)]

You can see that ridge regression though, shrinks the coefficients but never makes them exactly
zero, essentially never reduce our model size. Next we look at lasso Regression.

In [47]: alphas=np.linspace(0.0001,1,100)
rmse_list=[]
for a in alphas:
lasso = Lasso(fit_intercept=True, alpha=a,max_iter=10000)

# computing RMSE using 10-fold cross validation

kf = KFold(len(x_train), n_folds=10)
xval_err = 0
for train, test in kf:
lasso.fit(x_train.loc[train], y_train[train])
p =lasso.predict(x_train.loc[test])
err = p - y_train[test]
xval_err += np.dot(err,err)
rmse_10cv = np.sqrt(xval_err/len(x_train))
rmse_list.extend([rmse_10cv])
# Uncomment below to print rmse values of individual alphas
print('{:.3f}\t {:.4f}\t '.format(a,rmse_10cv))
best_alpha=alphas[rmse_list==min(rmse_list)]
print('Alpha with min 10cv error is : ',best_alpha )

0.000 2.0755
0.010 2.0747
0.020 2.0759
0.030 2.0774
0.041 2.0795
0.051 2.0823
0.061 2.0851
0.071 2.0878
0.081 2.0907

12
0.091 2.0939
0.101 2.0975
0.111 2.1015
0.121 2.1059
0.131 2.1107
0.141 2.1159
0.152 2.1214
0.162 2.1274
0.172 2.1337
0.182 2.1403
0.192 2.1474
0.202 2.1547
0.212 2.1625
0.222 2.1705
0.232 2.1789
0.242 2.1877
0.253 2.1968
0.263 2.2062
0.273 2.2160
0.283 2.2260
0.293 2.2364
0.303 2.2471
0.313 2.2582
0.323 2.2695
0.333 2.2811
0.343 2.2930
0.354 2.3052
0.364 2.3177
0.374 2.3305
0.384 2.3436
0.394 2.3569
0.404 2.3705
0.414 2.3844
0.424 2.3978
0.434 2.4056
0.444 2.4096
0.455 2.4111
0.465 2.4126
0.475 2.4141
0.485 2.4156
0.495 2.4169
0.505 2.4182
0.515 2.4193
0.525 2.4202
0.535 2.4207
0.545 2.4210
0.556 2.4211
0.566 2.4212

13
0.576 2.4212
0.586 2.4212
0.596 2.4212
0.606 2.4212
0.616 2.4211
0.626 2.4211
0.636 2.4211
0.646 2.4211
0.657 2.4211
0.667 2.4210
0.677 2.4210
0.687 2.4210
0.697 2.4210
0.707 2.4210
0.717 2.4210
0.727 2.4209
0.737 2.4209
0.747 2.4209
0.758 2.4209
0.768 2.4209
0.778 2.4209
0.788 2.4209
0.798 2.4209
0.808 2.4209
0.818 2.4209
0.828 2.4210
0.838 2.4210
0.848 2.4210
0.859 2.4210
0.869 2.4210
0.879 2.4210
0.889 2.4210
0.899 2.4210
0.909 2.4210
0.919 2.4210
0.929 2.4210
0.939 2.4210
0.949 2.4210
0.960 2.4210
0.970 2.4210
0.980 2.4210
0.990 2.4210
1.000 2.4210
Alpha with min 10cv error is : [ 0.0102]

In [48]: lasso=Lasso(fit_intercept=True,alpha=best_alpha)

14
lasso.fit(x_train,y_train)

p_test=lasso.predict(x_test)

residual=p_test-y_test

rmse_lasso=np.sqrt(np.dot(residual,residual)/len(p_test))

rmse_lasso

Out[48]: 1.9957102870584467

In [49]: list(zip(x_train.columns,lasso.coef_))

Out[49]: [('Amount.Requested', 0.00016596990419555165),

('Debt.To.Income.Ratio', 0.0018512121409904694),
('Monthly.Income', -2.1507229501854823e-05),
('Open.CREDIT.Lines', -0.033670424591482777),
('Revolving.CREDIT.Balance', -3.9690552293878496e-06),
('Inquiries.in.the.Last.6.Months', 0.34548495700631621),
('Employment.Length', 0.0043033318748096124),
('LL_36', -3.0510561974874602),
('LP_cem', 0.0),
('LP_chos', 0.11552185942149593),
('LP_dm', -0.024266912121686871),
('LP_hmvw', -0.0),
('ho_mort', -0.26583276808536688),
('ho_rent', -0.0),
('fico', -0.086541751428909866)]

We can see that lasso regression, not only improves performance on the data slightly , but also
makes size of the model smaller by making many coefficents exactly zero, thus excluding them
from our model.

0.0.1 Logistic Model for Binary Classification

A retail banking institution is going to float a stock trading facility for their existing customer.
Since this kind of facitlity is nothing new , company knows that they will have to incetivise their
customers for adopting their offerings. One way to incetiwise is to offer discounts on the commi-
sion for trading transactions.
One issue with that is that only about 10% of the customers do enought trades for earnings
after discounts to be profitable. Company wants to figure out, which are those 10% customer so
that it can selectively offer them discount. there is no magic way to figure that out. So company
rolled out this service to about 10000+ of their customers and observed their trading behaviour
for 6 months and after that they labelled them into two revenue.grids 1 and 2. using this data,
now they want us to build a classification model which can be used to classify their remaining
customers into these revenue grids.

15
Logistic Regression from Scikit Learn Logistic Regression in scikit learn already contains penal-
ties. l1 and l2 [Read as L-one & L-two] penalties . l1 penalty is same as lasso penalty where as l2
is same as ridge penalty. parameter C for logistic regression function is the hyperparameter for
penalty . However it works in inverse fashion, i.e. if C takes smalle , it means higher penalty.
For the case that we have discussed here , we have discussed l1 penalty with value of C as 1.
We have left following things for you to try on your own.

• model with l2 penalty

• Finding optimal value of hyperparameter C with cross-validation for both the penalties

You will find these in the practice exercise as well

you can use auc value obtained from function roc_auc_score to select best value for the hy-
perparameter. Higher the auc, better is the model . If you dont recall this, please go back to the
theoretical reading material.
Lets beging our model building process.

In [50]: from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score

In [51]: data_file=r'/Users/lalitsachan/Dropbox/March onwards/Python Data Science/Data/Existing

bd=pd.read_csv(data_file)

In [52]: bd.head()

Out[52]: REF_NO children age_band status occupation \

0 1 Zero 51-55 Partner Manual Worker
1 2 Zero 55-60 Single/Never Married Retired
2 3 Zero 26-30 Single/Never Married Professional
3 5 Zero 18-21 Single/Never Married Professional
4 6 Zero 45-50 Partner Business Manager

occupation_partner home_status family_income self_employed \

0 Secretarial/Admin Own Home <17,500, >=15,000 No
1 Retired Own Home <27,500, >=25,000 No
2 Other Own Home <30,000, >=27,500 Yes
3 Manual Worker Own Home <15,000, >=12,500 No
4 Unknown Own Home <30,000, >=27,500 No

self_employed_partner ... Investment Tax Saving Bond \

0 No ... 19.99
1 No ... 0.00
2 No ... 0.00
3 No ... 0.00
4 No ... 0.00

Home Loan Online Purchase Amount Revenue Grid gender region \

0 0.00 0.00 1 Female Wales
1 0.00 0.00 2 Female North West
2 3.49 0.00 2 Male North

16
3 0.00 0.00 2 Female West Midlands
4 45.91 25.98 2 Female Scotland

Investment in Commudity Investment in Equity Investment in Derivative \

0 74.67 18.66 32.32
1 20.19 0.00 4.33
2 98.06 31.07 80.96
3 4.10 14.15 17.57
4 70.16 55.86 80.44

Portfolio Balance
0 89.43
1 22.78
2 171.78
3 -41.70
4 235.02

[5 rows x 32 columns]

In [53]: bd["children"].value_counts()

Out[53]: Zero 6208

1 1848
2 1607
3 473
4+ 19
Name: children, dtype: int64

It seems we can directly convert this to numeric.

In [54]: bd.loc[bd["children"]=="Zero","children"]="0"
bd.loc[bd["children"]=="4+","children"]="4"
bd["children"]=pd.to_numeric(bd["children"],errors="coerce")

In [55]: bd["Revenue Grid"].value_counts()

Out[55]: 2 9069
1 1086
Name: Revenue Grid, dtype: int64

In [56]: bd["y"]=np.where(bd["Revenue Grid"]==2,0,1)

bd=bd.drop(["Revenue Grid"],1)

For variable , age_band if we treat it as categorical variable , we can combine its categories by
looking average response rate across its categories.

In [57]: round(bd.groupby("age_band")["y"].mean(),2)

17
Out[57]: age_band
18-21 0.17
22-25 0.11
26-30 0.11
31-35 0.11
36-40 0.13
41-45 0.11
45-50 0.10
51-55 0.10
55-60 0.11
61-65 0.09
65-70 0.10
71+ 0.10
Unknown 0.05
Name: y, dtype: float64

In [58]: for i in range(len(bd)):

if bd["age_band"][i] in ["71+","65-70","51-55","45-50"]:
bd.loc[i,"age_band"]="ab_10"
if bd["age_band"][i] in ["55-60","41-45","31-35","22-25","26-30"]:
bd.loc[i,"age_band"]="ab_11"
if bd["age_band"][i]=="36-40":
bd.loc[i,"age_band"]="ab_13"
if bd["age_band"][i]=="18-21":
bd.loc[i,"age_band"]="ab_17"
if bd["age_band"][i]=="61-65":
bd.loc[i,"age_band"]="ab_9"
ab_dummies=pd.get_dummies(bd["age_band"])
ab_dummies.head()

Out[58]: Unknown ab_10 ab_11 ab_13 ab_17 ab_9

0 0.0 1.0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 1.0 0.0
4 0.0 1.0 0.0 0.0 0.0 0.0

We will add it back to the dataset, dropping the dummy for "Unknown".

In [59]: bd=pd.concat([bd,ab_dummies],1)
bd=bd.drop(["age_band","Unknown"],1)

In [60]: bd["status"].value_counts()

Out[60]: Partner 7709

Single/Never Married 1101
Divorced/Separated 679
Widowed 618
Unknown 48
Name: status, dtype: int64

18
In [61]: bd["st_partner"]=np.where(bd["status"]=="Partner",1,0)
bd["st_singleNm"]=np.where(bd["status"]=="Single/Never Married",1,0)
bd["st_divSep"]=np.where(bd["status"]=="Divorced/Separated",1,0)
bd=bd.drop(["status"],1)

In [62]: round(bd.groupby("occupation")["y"].mean(),2)

Out[62]: occupation
Business Manager 0.12
Housewife 0.09
Manual Worker 0.11
Other 0.11
Professional 0.12
Retired 0.10
Secretarial/Admin 0.11
Student 0.11
Unknown 0.11
Name: y, dtype: float64

In [63]: for i in range(len(bd)):

if bd["occupation"][i] in ["Unknown","Student","Secretarial/Admin","Other","Manual
bd.loc[i,"occupation"]="oc_11"
if bd["occupation"][i] in ["Professional","Business Manager"]:
bd.loc[i,"occupation"]="oc_12"
if bd["occupation"][i]=="Retired":
bd.loc[i,"occupation"]="oc_10"
oc_dummies=pd.get_dummies(bd["occupation"])
oc_dummies.head()

Out[63]: Housewife oc_10 oc_11 oc_12

0 0.0 0.0 1.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 1.0
3 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 1.0

In [64]: bd=pd.concat([bd,oc_dummies],1)

bd=bd.drop(["occupation","Housewife"],1)

In [65]: round(bd.groupby("occupation_partner")["y"].mean(),2)

Out[65]: occupation_partner
Business Manager 0.11
Housewife 0.11
Manual Worker 0.11
Other 0.10
Professional 0.11
Retired 0.10

19
Secretarial/Admin 0.12
Student 0.12
Unknown 0.10
Name: y, dtype: float64

In [66]: bd["ocp_10"]=0
bd["ocp_12"]=0
for i in range(len(bd)):
if bd["occupation_partner"][i] in ["Unknown","Retired","Other"]:
bd.loc[i,"ocp_10"]=1
if bd["occupation_partner"][i] in ["Student","Secretarial/Admin"]:
bd.loc[i,"ocp_12"]=1

bd=bd.drop(["occupation_partner","TVarea","post_code","post_area","region"],1)

You can see that we have also dropped variables TVarea, region, post_code, post_area. If you
look at number of unique values taken by post_area and post_code , you’ll realise why decided to
drop them. TVarea and region on the other hand we have left for you to make use of and see if
using them improves your model.

In [67]: bd["home_status"].value_counts()

Out[67]: Own Home 9413

Rent from Council/HA 322
Rent Privately 261
Live in Parental Hom 109
Unclassified 50
Name: home_status, dtype: int64

In [68]: bd["hs_own"]=np.where(bd["home_status"]=="Own Home",1,0)

del bd["home_status"]

Notice that we used an alternate syntax for dropping a column here. You can use that too if
you like this syntax better.

In [69]: bd["gender"].value_counts()

Out[69]: Female 7634

Male 2486
Unknown 35
Name: gender, dtype: int64

In [70]: bd["gender_f"]=np.where(bd["gender"]=="Female",1,0)
del bd["gender"]

In [71]: bd["self_employed"].value_counts()

Out[71]: No 9436
Yes 719
Name: self_employed, dtype: int64

20
In [72]: bd["semp_yes"]=np.where(bd["self_employed"]=="Yes",1,0)
del bd["self_employed"]

In [73]: bd["self_employed_partner"].value_counts()

Out[73]: No 9026
Yes 1129
Name: self_employed_partner, dtype: int64

In [74]: bd["semp_part_yes"]=np.where(bd["self_employed_partner"]=="Yes",1,0)
del bd["self_employed_partner"]

In [75]: bd["family_income"].value_counts()

Out[75]: >=35,000 2517

<27,500, >=25,000 1227
<30,000, >=27,500 994
<25,000, >=22,500 833
<20,000, >=17,500 683
<12,500, >=10,000 677
<17,500, >=15,000 634
<15,000, >=12,500 629
<22,500, >=20,000 590
<10,000, >= 8,000 563
< 8,000, >= 4,000 402
< 4,000 278
Unknown 128
Name: family_income, dtype: int64

We can convert this to number as average of the range once we have figured out what to do
with category "Unknown".

In [76]: round(bd.groupby("family_income")["y"].mean(),4)

Out[76]: family_income
< 4,000 0.0755
< 8,000, >= 4,000 0.0796
<10,000, >= 8,000 0.1066
<12,500, >=10,000 0.1019
<15,000, >=12,500 0.1113
<17,500, >=15,000 0.1230
<20,000, >=17,500 0.1113
<22,500, >=20,000 0.1186
<25,000, >=22,500 0.1032
<27,500, >=25,000 0.0970
<30,000, >=27,500 0.1157
>=35,000 0.1116
Unknown 0.0703
Name: y, dtype: float64

21
In [77]: bd["fi"]=4 # by doing this , we have essentially clubbed <4000 and Unknown values . Ho
bd.loc[bd["family_income"]=="< 8,000, >= 4,000","fi"]=6
bd.loc[bd["family_income"]=="<10,000, >= 8,000","fi"]=9
bd.loc[bd["family_income"]=="<12,500, >=10,000","fi"]=11.25
bd.loc[bd["family_income"]=="<15,000, >=12,500","fi"]=13.75
bd.loc[bd["family_income"]=="<17,500, >=15,000","fi"]=16.25
bd.loc[bd["family_income"]=="<20,000, >=17,500","fi"]=18.75
bd.loc[bd["family_income"]=="<22,500, >=20,000","fi"]=21.25
bd.loc[bd["family_income"]=="<25,000, >=22,500","fi"]=23.75
bd.loc[bd["family_income"]=="<27,500, >=25,000","fi"]=26.25
bd.loc[bd["family_income"]=="<30,000, >=27,500","fi"]=28.75
bd.loc[bd["family_income"]==">=35,000","fi"]=35
bd=bd.drop(["family_income"],1)

In [78]: bd.dtypes

Out[78]: REF_NO int64

children int64
year_last_moved int64
Average Credit Card Transaction float64
Balance Transfer float64
Term Deposit float64
Life Insurance float64
Medical Insurance float64
Average A/C Balance float64
Personal Loan float64
Investment in Mutual Fund float64
Investment Tax Saving Bond float64
Home Loan float64
Online Purchase Amount float64
Investment in Commudity float64
Investment in Equity float64
Investment in Derivative float64
Portfolio Balance float64
y int64
ab_10 float64
ab_11 float64
ab_13 float64
ab_17 float64
ab_9 float64
st_partner int64
st_singleNm int64
st_divSep int64
oc_10 float64
oc_11 float64
oc_12 float64
ocp_10 int64
ocp_12 int64

22
hs_own int64
gender_f int64
semp_yes int64
semp_part_yes int64
fi float64
dtype: object

Now that the entire data is of numeric type, lets beging our modelling process after removing
nas from the data.

In [79]: bd.dropna(axis=0,inplace=True)
bd_train, bd_test = train_test_split(bd, test_size = 0.2,random_state=2)

In [80]: x_train=bd_train.drop(["y","REF_NO"],1)
y_train=bd_train["y"]
x_test=bd_test.drop(["y","REF_NO"],1)
y_test=bd_test["y"]

In [81]: logr=LogisticRegression(penalty="l1",class_weight="balanced",random_state=2)

In [82]: logr.fit(x_train,y_train)

Out[82]: LogisticRegression(C=1.0, class_weight='balanced', dual=False,

fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l1', random_state=2,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [83]: # score model performance on the test data

roc_auc_score(y_test,logr.predict(x_test))

Out[83]: 0.89959186496956278

To arrive at the eventual 1,0 prediction , we need to find some way [some cutoff ] to convert
predicted probabilites into two classes. Lets first get the probabilities out.

In [84]: prob_score=pd.Series(list(zip(*logr.predict_proba(x_train)))[1])

On these scores , we will consider many cutoffs between 0 to 1

In [85]: cutoffs=np.linspace(0,1,100)

For each of these cutoff , we are going to look at TP,FP,TN,FN values and caluclate KS. Then
we’ll chose the best cutoff as the one having highest KS.

In [86]: KS_cut=[]
for cutoff in cutoffs:
predicted=pd.Series([0]*len(y_train))
predicted[prob_score>cutoff]=1
df=pd.DataFrame(list(zip(y_train,predicted)),columns=["real","predicted"])
TP=len(df[(df["real"]==1) &(df["predicted"]==1) ])

23
FP=len(df[(df["real"]==0) &(df["predicted"]==1) ])
TN=len(df[(df["real"]==0) &(df["predicted"]==0) ])
FN=len(df[(df["real"]==1) &(df["predicted"]==0) ])
P=TP+FN
N=TN+FP
KS=(TP/P)-(FP/N)
KS_cut.append(KS)

cutoff_data=pd.DataFrame(list(zip(cutoffs,KS_cut)),columns=["cutoff","KS"])

KS_cutoff=cutoff_data[cutoff_data["KS"]==cutoff_data["KS"].max()]["cutoff"]

Now we’ll see how this model with the cutoff determined here , performs on the test data.

In [87]: # Performance on test data

prob_score_test=pd.Series(list(zip(*logr.predict_proba(x_test)))[1])

predicted_test=pd.Series([0]*len(y_test))
predicted_test[prob_score_test>float(KS_cutoff)]=1

df_test=pd.DataFrame(list(zip(y_test,predicted_test)),columns=["real","predicted"])

k=pd.crosstab(df_test['real'],df_test["predicted"])
print('confusion matrix :\n \n ',k)
TN=k.iloc[0,0]
TP=k.iloc[1,1]
FP=k.iloc[0,1]
FN=k.iloc[1,0]
P=TP+FN
N=TN+FP

confusion matrix :

predicted 0 1
real
0 1646 161
1 26 198

In [88]: # Accuracy of test

(TP+TN)/(P+N)

Out[88]: 0.9079271294928607

In [89]: # Sensitivity on test

TP/P

Out[89]: 0.8839285714285714

24
In [90]: #Specificity on test
TN/N

Out[90]: 0.91090204759269511

Next we see how cutoff determined by F_beta score performs on test data for beta values :
0.5,1,2

In [91]: cutoffs=np.linspace(0.010,0.99,100)
def Fbeta_perf(beta,cutoffs,y_train,prob_score):
FB_cut=[]
for cutoff in cutoffs:
predicted=pd.Series([0]*len(y_train))
predicted[prob_score>cutoff]=1
df=pd.DataFrame(list(zip(y_train,predicted)),columns=["real","predicted"])

TP=len(df[(df["real"]==1) &(df["predicted"]==1) ])
FP=len(df[(df["real"]==0) &(df["predicted"]==1) ])
FN=len(df[(df["real"]==1) &(df["predicted"]==0) ])
P=TP+FN

Precision=TP/(TP+FP)
Recall=TP/P
FB=(1+beta**2)*Precision*Recall/((beta**2)*Precision+Recall)
FB_cut.append(FB)

cutoff_data=pd.DataFrame(list(zip(cutoffs,FB_cut)),columns=["cutoff","FB"])

FB_cutoff=cutoff_data[cutoff_data["FB"]==cutoff_data["FB"].max()]["cutoff"]

prob_score_test=pd.Series(list(zip(*logr.predict_proba(x_test)))[1])

predicted_test=pd.Series([0]*len(y_test))
predicted_test[prob_score_test>float(FB_cutoff)]=1

df_test=pd.DataFrame(list(zip(y_test,predicted_test)),columns=["real","predicted"]

k=pd.crosstab(df_test['real'],df_test["predicted"])
# print('confusion matrix :\n \n ',k)
TN=k.iloc[0,0]
TP=k.iloc[1,1]
FP=k.iloc[0,1]
FN=k.iloc[1,0]
P=TP+FN
N=TN+FP
print('For beta :',beta)
print('Accuracy is :',(TP+TN)/(P+N))

25
print('Sensitivity is :',(TP/P))
print('Specificity is :',(TN/N))
print('\n \n \n')

In [92]: Fbeta_perf(0.5,cutoffs,y_train,prob_score)
Fbeta_perf(1,cutoffs,y_train,prob_score)
Fbeta_perf(2,cutoffs,y_train,prob_score)

For beta : 0.5

Accuracy is : 0.939931068439
Sensitivity is : 0.625
Specificity is : 0.978970669618

For beta : 1
Accuracy is : 0.932053175775
Sensitivity is : 0.758928571429
Specificity is : 0.953514111787

For beta : 2
Accuracy is : 0.929591334318
Sensitivity is : 0.834821428571
Specificity is : 0.941339236303

You can see that low beta < 1 favors Specificity where as beta > 1 favors sensitivity
We’ll conclude our discussion here. Please do the practice exercises . If you face any issue we’ll
discuss that either in class or QA forum on LMS.
Prepared By : Lalit Sachan ([email protected])
In case of any doubts or errata alert; please take to QA forum for discussion.
Doubts will be discussed in live class sessions too. [This doesnt apply for self paced students]

Blackstone and The Sale of Citigroup's Loan Portfolio
0% (1)
Blackstone and The Sale of Citigroup's Loan Portfolio
10 pages
Impact of Korean Culture To The Philippine Culture
50% (2)
Impact of Korean Culture To The Philippine Culture
36 pages
HBR Case Study Blackstone and The Sale o
50% (2)
HBR Case Study Blackstone and The Sale o
3 pages
Credit Freeze and Data Repair Strategies
From Everand
Credit Freeze and Data Repair Strategies
Credit Report Relief
5/5 (2)
1) Introduction A) Defining Problem Statement:-: ST ST
No ratings yet
1) Introduction A) Defining Problem Statement:-: ST ST
10 pages
Ifrs 16 Simulation
100% (1)
Ifrs 16 Simulation
7 pages
GUIDE To The IA in IB Biology
50% (2)
GUIDE To The IA in IB Biology
14 pages
Predictive Modeling: Project Documentation Team 10
No ratings yet
Predictive Modeling: Project Documentation Team 10
16 pages
WRITEUP
No ratings yet
WRITEUP
2 pages
Credit Risk Project Introduction: Indian School of Business
No ratings yet
Credit Risk Project Introduction: Indian School of Business
3 pages
Hp1047, Vmr286 Loan Default Prediction Final Report
No ratings yet
Hp1047, Vmr286 Loan Default Prediction Final Report
8 pages
General Mathematics: Quarter 2 - Module 1: Simple Interest and Compound Interest
No ratings yet
General Mathematics: Quarter 2 - Module 1: Simple Interest and Compound Interest
42 pages
Genmath11 q2 Mod1and 2 Week1
No ratings yet
Genmath11 q2 Mod1and 2 Week1
42 pages
Predicting Credit Risk 1713295035
No ratings yet
Predicting Credit Risk 1713295035
19 pages
Befa Unit - V
No ratings yet
Befa Unit - V
22 pages
Bi Dashboard Loan Report
No ratings yet
Bi Dashboard Loan Report
5 pages
Chap 2 - Ratio Analysis
100% (1)
Chap 2 - Ratio Analysis
24 pages
Mini Case Basics of Cap Budg
No ratings yet
Mini Case Basics of Cap Budg
2 pages
JSE #15-091 - Loan Documentation - SBA Case Final
No ratings yet
JSE #15-091 - Loan Documentation - SBA Case Final
4 pages
Lecture # 33
No ratings yet
Lecture # 33
8 pages
ACF Question Bank 2
No ratings yet
ACF Question Bank 2
11 pages
FM11 CH 10 Mini-Case Old13 Basics of Cap Budg
100% (1)
FM11 CH 10 Mini-Case Old13 Basics of Cap Budg
20 pages
UNIT - 4-2 - Gursamey....
No ratings yet
UNIT - 4-2 - Gursamey....
27 pages
Accounting Assignment
No ratings yet
Accounting Assignment
15 pages
EAD Calibration For Corporate Credit Lines
No ratings yet
EAD Calibration For Corporate Credit Lines
14 pages
Case Study - Banking Peer Group Lending (Regression)
No ratings yet
Case Study - Banking Peer Group Lending (Regression)
2 pages
Illustrative Example - WCM (Current Liabilities)
No ratings yet
Illustrative Example - WCM (Current Liabilities)
5 pages
2 - Time Value of Money_unlocked
No ratings yet
2 - Time Value of Money_unlocked
17 pages
Engineering-Economics-1610656365
No ratings yet
Engineering-Economics-1610656365
254 pages
Credit Default Swaps: Chapter Summary
No ratings yet
Credit Default Swaps: Chapter Summary
23 pages
PA v0.12
No ratings yet
PA v0.12
9 pages
Loan Approval Prediction
No ratings yet
Loan Approval Prediction
23 pages
Corporate Finance
No ratings yet
Corporate Finance
18 pages
The Job Consist in Pricing A Portfolio of Bonds Which Takes Into Account The Issuer Credit Risk
No ratings yet
The Job Consist in Pricing A Portfolio of Bonds Which Takes Into Account The Issuer Credit Risk
3 pages
BOND ANALYSIS-2
No ratings yet
BOND ANALYSIS-2
28 pages
Hands-On Activity 3.3 Random Forest Mantaring - Ipynb - Mantaring
No ratings yet
Hands-On Activity 3.3 Random Forest Mantaring - Ipynb - Mantaring
13 pages
PA v0.21
No ratings yet
PA v0.21
17 pages
Bank Loan Case Study1
No ratings yet
Bank Loan Case Study1
13 pages
Multivariate Data Analysis Project
0% (1)
Multivariate Data Analysis Project
26 pages
R29 CFA Level 3
No ratings yet
R29 CFA Level 3
12 pages
PA v0.25
No ratings yet
PA v0.25
18 pages
Vechile Loan Defaulter
No ratings yet
Vechile Loan Defaulter
23 pages
R06 Time Value of Money 2017 Level I Notes: WWW - Ift.world
No ratings yet
R06 Time Value of Money 2017 Level I Notes: WWW - Ift.world
22 pages
Report On Lending Co.
No ratings yet
Report On Lending Co.
36 pages
Sumaiya - Financial Management Assignment 1 - 80201190060
No ratings yet
Sumaiya - Financial Management Assignment 1 - 80201190060
5 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Corporate Finance 22vaCRTlVYrp
No ratings yet
Corporate Finance 22vaCRTlVYrp
8 pages
Inline: Import As Import As Import As Import As Matplotlib Import
100% (1)
Inline: Import As Import As Import As Import As Matplotlib Import
15 pages
Macroeconomic (Saving and Investment)
No ratings yet
Macroeconomic (Saving and Investment)
110 pages
Tutorial 9 Solutions
No ratings yet
Tutorial 9 Solutions
8 pages
CERA - 2021-Presentation Chheda - Raj
No ratings yet
CERA - 2021-Presentation Chheda - Raj
22 pages
Credit Analysis and Lending Management 4th Ed 3-3 (C11!12!13 - 16)
No ratings yet
Credit Analysis and Lending Management 4th Ed 3-3 (C11!12!13 - 16)
94 pages
DVX
No ratings yet
DVX
2 pages
Valuation of Securities
No ratings yet
Valuation of Securities
25 pages
Elon_Musk_Twitter_Tesla_Margin_Loan_Termsheet_1732575649
No ratings yet
Elon_Musk_Twitter_Tesla_Margin_Loan_Termsheet_1732575649
7 pages
Forecasting Debt and Interest
No ratings yet
Forecasting Debt and Interest
16 pages
assignment
No ratings yet
assignment
5 pages
RM - Topic 4 - Valuation of Bonds and Shares
No ratings yet
RM - Topic 4 - Valuation of Bonds and Shares
33 pages
Credit in the Palm of Your Hands
From Everand
Credit in the Palm of Your Hands
City Lights Credit Solutions
No ratings yet
The 10 Credit Building Commandments
From Everand
The 10 Credit Building Commandments
Lisa Gavin
No ratings yet
Simple Credit Repair
From Everand
Simple Credit Repair
LadyG
5/5 (1)
WHAT'S F.R.E.E. CREDIT? the personal game changer
From Everand
WHAT'S F.R.E.E. CREDIT? the personal game changer
Durrell wilson
3.5/5 (2)
Assumptions of Multiple Linear Regression
No ratings yet
Assumptions of Multiple Linear Regression
18 pages
Towards Fostering Growth Mindset Classrooms
No ratings yet
Towards Fostering Growth Mindset Classrooms
28 pages
M1. Introduction To Cognitive Psychology: Philosophical Antecedents
No ratings yet
M1. Introduction To Cognitive Psychology: Philosophical Antecedents
5 pages
Statistics Project - DBA 2024 - Final
No ratings yet
Statistics Project - DBA 2024 - Final
24 pages
Dr_JoseMC_CorrelationandRegression
No ratings yet
Dr_JoseMC_CorrelationandRegression
50 pages
Shahi and Baker 2014 (NGA-West2 Models For Ground Motion Directionality)
No ratings yet
Shahi and Baker 2014 (NGA-West2 Models For Ground Motion Directionality)
21 pages
Collin Gretl
No ratings yet
Collin Gretl
28 pages
Research Reviewer
No ratings yet
Research Reviewer
12 pages
Influence of Brand Name On Consumer Choice Decisio
No ratings yet
Influence of Brand Name On Consumer Choice Decisio
6 pages
POL 411 Nov 2018
No ratings yet
POL 411 Nov 2018
151 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
Transportation Geotechnics
No ratings yet
Transportation Geotechnics
16 pages
Trans Template Sample Kiar
No ratings yet
Trans Template Sample Kiar
4 pages
Research Methods - Tharindi Eratne 1
No ratings yet
Research Methods - Tharindi Eratne 1
19 pages
Effect of climate change on maize yield in Western Ethiopia
No ratings yet
Effect of climate change on maize yield in Western Ethiopia
14 pages
Calibration: The Achilles Heel of Predictive Analytics: Opinion Open Access
No ratings yet
Calibration: The Achilles Heel of Predictive Analytics: Opinion Open Access
7 pages
Chapter Nine: Conducting Marketing Experiments: 1. Know The Basic Characteristics of Research Experiments
No ratings yet
Chapter Nine: Conducting Marketing Experiments: 1. Know The Basic Characteristics of Research Experiments
6 pages
Masters Research Methodology
No ratings yet
Masters Research Methodology
35 pages
Examining The Effect of Role Conflict and Job Stress On Turnover Intention Among The Private School Teachers in Vellore District
No ratings yet
Examining The Effect of Role Conflict and Job Stress On Turnover Intention Among The Private School Teachers in Vellore District
6 pages
3sample_mooc_report_FINAL[1]
No ratings yet
3sample_mooc_report_FINAL[1]
18 pages
Correlates of Fatality Risk of Vulnerable Road Users in Delhi
No ratings yet
Correlates of Fatality Risk of Vulnerable Road Users in Delhi
8 pages
Quantitative Research Design and Methods
No ratings yet
Quantitative Research Design and Methods
35 pages
Bucchi, 2008 - Coefficient of Determination
No ratings yet
Bucchi, 2008 - Coefficient of Determination
1 page
PR1 2gumamela
No ratings yet
PR1 2gumamela
31 pages
Business Analytics
No ratings yet
Business Analytics
13 pages
Matrix
No ratings yet
Matrix
4 pages
Experiment 1 Newton S Laws: I. Objectives
No ratings yet
Experiment 1 Newton S Laws: I. Objectives
15 pages
Econometrics Chapter 11 PPT Slides
No ratings yet
Econometrics Chapter 11 PPT Slides
46 pages

Linear Models Reading

Uploaded by

Linear Models Reading

Uploaded by

Linear Models Reading

October 31, 2017

In [1]: data_file=r'/Users/lalitsachan/Dropbox/March onwards/Python Data Science/Data/loans dat

Out[2]: ID Amount.Requested Amount.Funded.By.Investors Interest.Rate \

Loan.Length Loan.Purpose Debt.To.Income.Ratio State Home.Ownership \

Monthly.Income FICO.Range Open.CREDIT.Lines Revolving.CREDIT.Balance \

In [3]: for col in ["Interest.Rate","Debt.To.Income.Ratio"]:

In [5]: for col in ["Amount.Requested","Amount.Funded.By.Investors","Open.CREDIT.Lines","Revolv

Next we will make dummy variables for remaining categorical variables

Out[7]: 36 months 1950

Out[9]: . 36 months 60 months

In [10]: ld["LL_36"]=ll_dummies["36 months"]

Once deleted, variables cannot be recovered. Proceed (y/[n])? y

KFold Lasso LinearRegression Ridge col data_file

Next we examine variable "Loan.Purpose".

Out[15]: debt_consolidation 1307

In [17]: for i in range(len(ld.index)):

Now we make dummies for this variable

Next we look at variable "State".

Next we take care of variable Home.Ownership.

In [27]: ld['f1'], ld['f2'] = zip(*ld['FICO.Range'].apply(lambda x: x.split('-', 1)))

Out[29]: 10+ years 653

As you can see "n/a" is similar to "< 1".

In [32]: ld["Employment.Length"]=[x.replace("n/a","< 1") for x in ld["Employment.Length"]]

Out[34]: (2500, 18)

Out[36]: (2471, 18)

In [37]: ld_train, ld_test = train_test_split(ld, test_size = 0.2,random_state=2)

Out[40]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Out[42]: [('Amount.Requested', 0.00016471416513021343),

# computing average RMSE across 10-fold cross validation

Alpha with min 10cv error is : [ 4.04046364]

# computing RMSE using 10-fold cross validation

Out[49]: [('Amount.Requested', 0.00016596990419555165),

0.0.1 Logistic Model for Binary Classification

• model with l2 penalty

You will find these in the practice exercise as well

In [50]: from sklearn.linear_model import LogisticRegression

In [51]: data_file=r'/Users/lalitsachan/Dropbox/March onwards/Python Data Science/Data/Existing

Out[52]: REF_NO children age_band status occupation \

occupation_partner home_status family_income self_employed \

self_employed_partner ... Investment Tax Saving Bond \

Home Loan Online Purchase Amount Revenue Grid gender region \

Investment in Commudity Investment in Equity Investment in Derivative \

Out[53]: Zero 6208

It seems we can directly convert this to numeric.

In [55]: bd["Revenue Grid"].value_counts()

In [56]: bd["y"]=np.where(bd["Revenue Grid"]==2,0,1)

In [58]: for i in range(len(bd)):

Out[58]: Unknown ab_10 ab_11 ab_13 ab_17 ab_9

Out[60]: Partner 7709

In [63]: for i in range(len(bd)):

Out[63]: Housewife oc_10 oc_11 oc_12

Out[67]: Own Home 9413

In [68]: bd["hs_own"]=np.where(bd["home_status"]=="Own Home",1,0)

Out[69]: Female 7634

Out[75]: >=35,000 2517

Out[78]: REF_NO int64

Out[82]: LogisticRegression(C=1.0, class_weight='balanced', dual=False,

In [83]: # score model performance on the test data

On these scores , we will consider many cutoffs between 0 to 1

In [87]: # Performance on test data

In [88]: # Accuracy of test

In [89]: # Sensitivity on test

For beta : 0.5

You might also like