Linear Models Reading
Linear Models Reading
For the theoretical background regarding these techniques , pleases refer to theoretical discus-
sion on linear models.
Now we’ll discuss a case study solution using multiple linear regression, and regularised linear
regresison [Ridge and Lasso] . We’ll also look at hyper parameter tuning for regularised regres-
sion.
A little background on the case study. This data belongs to a loan aggregator agency which
connects loan applications to different financial institutions in attempt to get the best interest rate.
They want to now utilise past data to predict interest rate given by any financial institute just by
looking at loan application characteristics.
To achieve that , they have decided to do a POC with a data from a particular financial institu-
tion. The data is given in the file "loans data.csv". Lets begin:
import pandas as pd
import math
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
import numpy as np
from sklearn.cross_validation import KFold
%matplotlib inline
ld=pd.read_csv(data_file)
In [2]: ld.head()
1
4 36 months credit_card 18.78% NJ RENT
Inquiries.in.the.Last.6.Months Employment.Length
0 2.0 < 1 year
1 1.0 2 years
2 1.0 2 years
3 0.0 5 years
4 0.0 9 years
You can see that variable Interest.Rate and Debt.To.Income.Ratio contain "%" sign in their val-
ues and because of which they have come as character columns in the data. Lets remove these
percentages first.
In [4]: ld.dtypes
Out[4]: ID float64
Amount.Requested object
Amount.Funded.By.Investors object
Interest.Rate object
Loan.Length object
Loan.Purpose object
Debt.To.Income.Ratio object
State object
Home.Ownership object
Monthly.Income float64
FICO.Range object
Open.CREDIT.Lines object
Revolving.CREDIT.Balance object
Inquiries.in.the.Last.6.Months float64
Employment.Length object
dtype: object
We can see that many columns which should have really been numbers have been imported as
character columns , probably because some characters values in those columns in the files. We’ll
convert all such columns to numbers .
2
In [6]: ld.dtypes
Out[6]: ID float64
Amount.Requested float64
Amount.Funded.By.Investors float64
Interest.Rate float64
Loan.Length object
Loan.Purpose object
Debt.To.Income.Ratio float64
State object
Home.Ownership object
Monthly.Income float64
FICO.Range object
Open.CREDIT.Lines float64
Revolving.CREDIT.Balance float64
Inquiries.in.the.Last.6.Months float64
Employment.Length object
dtype: object
In [7]: ld["Loan.Length"].value_counts()
Function get_dummies creates dummy variables for all the categories present in the categorical
variable. Result is a dataframe, we can then choose and drop the dummies that we want to drop
and attach the ones selected back to our original data.
In [8]: ll_dummies=pd.get_dummies(ld["Loan.Length"])
In [9]: ll_dummies.head()
We’ll add dummy variable for "36 months" to our data and ignore the rest two.
Now that we’are done with dataframe ll_dummies , we can drop it. Below we demonstrate a
general way of removing variables from notebook environment.
3
In [11]: %reset_selective ll_dummies
To know what all variables are in the environment, you can use function "who"
In [12]: who
Now that we have created dummies for Loan.Length, we need to remove this from the
dataframe.
In [13]: ld=ld.drop('Loan.Length',axis=1)
In [14]: ld.dtypes
Out[14]: ID float64
Amount.Requested float64
Amount.Funded.By.Investors float64
Interest.Rate float64
Loan.Purpose object
Debt.To.Income.Ratio float64
State object
Home.Ownership object
Monthly.Income float64
FICO.Range object
Open.CREDIT.Lines float64
Revolving.CREDIT.Balance float64
Inquiries.in.the.Last.6.Months float64
Employment.Length object
LL_36 float64
dtype: object
In [15]: ld["Loan.Purpose"].value_counts()
4
moving 29
vacation 21
house 20
educational 15
renewable_energy 4
Name: Loan.Purpose, dtype: int64
There are 14 categories in that variable, we can either make 13 dummies or we can club few
of these categories together and reduce the number of effective categories and then make dummy
variables for those.
It makes sense to club those categories which behave similarly in terms of their effect on re-
sponse. Or in other words , we can club those categories for which average interest rates are
similar in the data.
In [16]: round(ld.groupby("Loan.Purpose")["Interest.Rate"].mean())
Out[16]: Loan.Purpose
car 11.0
credit_card 13.0
debt_consolidation 14.0
educational 11.0
home_improvement 12.0
house 13.0
major_purchase 11.0
medical 12.0
moving 14.0
other 13.0
renewable_energy 10.0
small_business 13.0
vacation 12.0
wedding 12.0
Name: Interest.Rate, dtype: float64
We can see from the table above that there are 4 effective categoris in the data. Lets club them
In [18]: lp_dummies=pd.get_dummies(ld["Loan.Purpose"],prefix="LP")
In [19]: lp_dummies.head()
5
Out[19]: LP_cem LP_chos LP_dm LP_hmvw LP_renewable_energy
0 0.0 0.0 1.0 0.0 0.0
1 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0
3 0.0 0.0 1.0 0.0 0.0
4 0.0 1.0 0.0 0.0 0.0
We’ll add this data to original data. And then drop original variable "Loan.Purpose" and one
of the dummies "LP_renewable_energy"
In [20]: ld=pd.concat([ld,lp_dummies],1)
ld=ld.drop(["Loan.Purpose","LP_renewable_energy"],1)
In [21]: ld.dtypes
Out[21]: ID float64
Amount.Requested float64
Amount.Funded.By.Investors float64
Interest.Rate float64
Debt.To.Income.Ratio float64
State object
Home.Ownership object
Monthly.Income float64
FICO.Range object
Open.CREDIT.Lines float64
Revolving.CREDIT.Balance float64
Inquiries.in.the.Last.6.Months float64
Employment.Length object
LL_36 float64
LP_cem float64
LP_chos float64
LP_dm float64
LP_hmvw float64
dtype: object
In [22]: ld["State"].nunique()
Out[22]: 47
There are too many unique values. Although its not a legit reason to drop a variable. But we’ll
ignore this in this discussion any way in order to reduce amount of data prep that we are doing
here. You can try including it in the model and see if the performance improves.
In [23]: ld=ld.drop(["State"],1)
In [24]: ld["Home.Ownership"].value_counts()
6
Out[24]: MORTGAGE 1147
RENT 1146
OWN 200
OTHER 5
NONE 1
Name: Home.Ownership, dtype: int64
In [25]: ld["ho_mort"]=np.where(ld["Home.Ownership"]=="MORTGAGE",1,0)
ld["ho_rent"]=np.where(ld["Home.Ownership"]=="RENT",1,0)
ld=ld.drop(["Home.Ownership"],1)
We have simply ignored values OTHER and NONE , and considered that there are only 3
categories and created only two dummies . We did this because of very low frequencies of OTHER
and NONE
In [26]: ld["FICO.Range"].head()
Out[26]: 0 735-739
1 715-719
2 690-694
3 695-699
4 695-699
Name: FICO.Range, dtype: object
If you look at first few values of variable FICO.Range , you can see that we can convert it to
numeric by taking average of the range given. To do that first we need to split the column with
"-", so that we can have both end of ranges in separate columns and then we can simply average
them. Lets first split.
Now we create new variable "fico" by averaging f1 and f2. And then we’ll drop the original
variable FICO.Range and f1,f2.
In [28]: ld["fico"]=0.5*(pd.to_numeric(ld["f1"])+pd.to_numeric(ld["f2"]))
ld=ld.drop(["FICO.Range","f1","f2"],1)
Next we look at variavle Employment.Length. You’ll see that we can convert that to number
as well.
In [29]: ld["Employment.Length"].value_counts()
7
6 years 163
7 years 127
8 years 108
n/a 77
9 years 72
. 2
Name: Employment.Length, dtype: int64
In [30]: ld["Employment.Length"]=ld["Employment.Length"].astype("str")
ld["Employment.Length"]=[x.replace("years","") for x in ld["Employment.Length"]]
ld["Employment.Length"]=[x.replace("year","") for x in ld["Employment.Length"]]
We can convert everything else to numbers , but "n/a" are a problem. We can look at average
interest across all values of Employment.Length and then replace "n/a" with value which has
closet average response.
In [31]: round(ld.groupby("Employment.Length")["Interest.Rate"].mean(),2)
Out[31]: Employment.Length
. 11.34
1 12.49
10+ 13.34
2 12.87
3 12.77
4 13.14
5 13.40
6 13.29
7 13.10
8 13.01
9 13.15
< 1 12.86
n/a 12.85
nan 7.51
Name: Interest.Rate, dtype: float64
In [33]: ld.dtypes
Out[33]: ID float64
Amount.Requested float64
Amount.Funded.By.Investors float64
Interest.Rate float64
Debt.To.Income.Ratio float64
8
Monthly.Income float64
Open.CREDIT.Lines float64
Revolving.CREDIT.Balance float64
Inquiries.in.the.Last.6.Months float64
Employment.Length float64
LL_36 float64
LP_cem float64
LP_chos float64
LP_dm float64
LP_hmvw float64
ho_mort int64
ho_rent int64
fico float64
dtype: object
Now we have all the variables as numbers. After dropping observations with missing values ,
we can proceed to build oru model.
In [34]: ld.shape
In [35]: ld.dropna(axis=0,inplace=True)
In [36]: ld.shape
We now split our data into two random parts . One to build model on , Another to test its
performance. Option "random_state" is used to make our random operation reproducible.
In [38]: lm=LinearRegression()
Above line creates and object of class LinearRegression named lm. We can use this object to
access all functions realted to LinearRegression.
Now we’ll separate predictors and response for both the datasets . We’ll also drop ID from
predictor’s list because it doesnt make sense to include an ID variable in the model. Variable
"Amount.Funded.By.Investors" will also be dropped because it wont be available until the loan
has been processed. We can use only those variables which are present at the point of the business
process where we want to apply our model.
In [39]: x_train=ld_train.drop(["Interest.Rate","ID","Amount.Funded.By.Investors"],1)
y_train=ld_train["Interest.Rate"]
x_test=ld_test.drop(["Interest.Rate","ID","Amount.Funded.By.Investors"],1)
y_test=ld_test["Interest.Rate"]
Now we can fit our model using lm the LinearRegression object that we created earlier
9
In [40]: lm.fit(x_train,y_train)
Next we predict resposne on our test data , calculate errors on those prediction and then rmse
for those residuals. That is the measure of performance on the test data. We can use this measure
to compare other models that we’ll build.
In [41]: p_test=lm.predict(x_test)
residual=p_test-y_test
rmse_lm=np.sqrt(np.dot(residual,residual)/len(p_test))
rmse_lm
Out[41]: 1.9984182813783065
We can use this to compare our linear regression model with other techniques and evenutall
pick the one with least error .
Next we show how to extract coefficient produced by our model
In [42]: coefs=lm.coef_
features=x_train.columns
list(zip(features,coefs))
We can see that linear regression has produced coefficients for all variables. If you recall our
theoretical discussion, we need to penalise coefficient for the variables which are not really con-
tributing well to our resposne and might be causing overfitting of the model. Among the regu-
larised technique we’ll first look at Ridge regression.
Since penalty in ridge regression is a hyperparameter , we’d look at multiple values of it and
choose the best one through 10 fold cross validation.
10
In [43]: # Finding best value of penalty weight with cross validation for ridge regression
alphas=np.linspace(.0001,10,100)
# We need to reset index for cross validation to work without hitch
x_train.reset_index(drop=True,inplace=True)
y_train.reset_index(drop=True,inplace=True)
In [44]: rmse_list=[]
for a in alphas:
ridge = Ridge(fit_intercept=True, alpha=a)
best value of alpha might be slightly different across different runs because of random nature
of cross validation. So dont worry if you determine a different value of best alpha.
Next we fit Ridge Regression on the entire train data with best value of alpha we just deter-
mined.
In [45]: ridge=Ridge(fit_intercept=True,alpha=best_alpha)
ridge.fit(x_train,y_train)
p_test=ridge.predict(x_test)
residual=p_test-y_test
rmse_ridge=np.sqrt(np.dot(residual,residual)/len(p_test))
rmse_ridge
Out[45]: 1.9986610201010222
In [46]: list(zip(x_train.columns,ridge.coef_))
11
Out[46]: [('Amount.Requested', 0.00016586905207985483),
('Debt.To.Income.Ratio', 0.0020224200468193219),
('Monthly.Income', -2.0262579354428066e-05),
('Open.CREDIT.Lines', -0.034289979364722015),
('Revolving.CREDIT.Balance', -4.0012363985016231e-06),
('Inquiries.in.the.Last.6.Months', 0.35358237791769503),
('Employment.Length', 0.0060576014674008482),
('LL_36', -3.085888228910501),
('LP_cem', -0.060597535013434234),
('LP_chos', 0.051904670459773983),
('LP_dm', -0.13915040742140772),
('LP_hmvw', -0.13894706764602308),
('ho_mort', -0.48648146285695704),
('ho_rent', -0.21080912056441062),
('fico', -0.086530387911495879)]
You can see that ridge regression though, shrinks the coefficients but never makes them exactly
zero, essentially never reduce our model size. Next we look at lasso Regression.
In [47]: alphas=np.linspace(0.0001,1,100)
rmse_list=[]
for a in alphas:
lasso = Lasso(fit_intercept=True, alpha=a,max_iter=10000)
0.000 2.0755
0.010 2.0747
0.020 2.0759
0.030 2.0774
0.041 2.0795
0.051 2.0823
0.061 2.0851
0.071 2.0878
0.081 2.0907
12
0.091 2.0939
0.101 2.0975
0.111 2.1015
0.121 2.1059
0.131 2.1107
0.141 2.1159
0.152 2.1214
0.162 2.1274
0.172 2.1337
0.182 2.1403
0.192 2.1474
0.202 2.1547
0.212 2.1625
0.222 2.1705
0.232 2.1789
0.242 2.1877
0.253 2.1968
0.263 2.2062
0.273 2.2160
0.283 2.2260
0.293 2.2364
0.303 2.2471
0.313 2.2582
0.323 2.2695
0.333 2.2811
0.343 2.2930
0.354 2.3052
0.364 2.3177
0.374 2.3305
0.384 2.3436
0.394 2.3569
0.404 2.3705
0.414 2.3844
0.424 2.3978
0.434 2.4056
0.444 2.4096
0.455 2.4111
0.465 2.4126
0.475 2.4141
0.485 2.4156
0.495 2.4169
0.505 2.4182
0.515 2.4193
0.525 2.4202
0.535 2.4207
0.545 2.4210
0.556 2.4211
0.566 2.4212
13
0.576 2.4212
0.586 2.4212
0.596 2.4212
0.606 2.4212
0.616 2.4211
0.626 2.4211
0.636 2.4211
0.646 2.4211
0.657 2.4211
0.667 2.4210
0.677 2.4210
0.687 2.4210
0.697 2.4210
0.707 2.4210
0.717 2.4210
0.727 2.4209
0.737 2.4209
0.747 2.4209
0.758 2.4209
0.768 2.4209
0.778 2.4209
0.788 2.4209
0.798 2.4209
0.808 2.4209
0.818 2.4209
0.828 2.4210
0.838 2.4210
0.848 2.4210
0.859 2.4210
0.869 2.4210
0.879 2.4210
0.889 2.4210
0.899 2.4210
0.909 2.4210
0.919 2.4210
0.929 2.4210
0.939 2.4210
0.949 2.4210
0.960 2.4210
0.970 2.4210
0.980 2.4210
0.990 2.4210
1.000 2.4210
Alpha with min 10cv error is : [ 0.0102]
In [48]: lasso=Lasso(fit_intercept=True,alpha=best_alpha)
14
lasso.fit(x_train,y_train)
p_test=lasso.predict(x_test)
residual=p_test-y_test
rmse_lasso=np.sqrt(np.dot(residual,residual)/len(p_test))
rmse_lasso
Out[48]: 1.9957102870584467
In [49]: list(zip(x_train.columns,lasso.coef_))
We can see that lasso regression, not only improves performance on the data slightly , but also
makes size of the model smaller by making many coefficents exactly zero, thus excluding them
from our model.
15
Logistic Regression from Scikit Learn Logistic Regression in scikit learn already contains penal-
ties. l1 and l2 [Read as L-one & L-two] penalties . l1 penalty is same as lasso penalty where as l2
is same as ridge penalty. parameter C for logistic regression function is the hyperparameter for
penalty . However it works in inverse fashion, i.e. if C takes smalle , it means higher penalty.
For the case that we have discussed here , we have discussed l1 penalty with value of C as 1.
We have left following things for you to try on your own.
In [52]: bd.head()
16
3 0.00 0.00 2 Female West Midlands
4 45.91 25.98 2 Female Scotland
Portfolio Balance
0 89.43
1 22.78
2 171.78
3 -41.70
4 235.02
[5 rows x 32 columns]
In [53]: bd["children"].value_counts()
In [54]: bd.loc[bd["children"]=="Zero","children"]="0"
bd.loc[bd["children"]=="4+","children"]="4"
bd["children"]=pd.to_numeric(bd["children"],errors="coerce")
Out[55]: 2 9069
1 1086
Name: Revenue Grid, dtype: int64
For variable , age_band if we treat it as categorical variable , we can combine its categories by
looking average response rate across its categories.
In [57]: round(bd.groupby("age_band")["y"].mean(),2)
17
Out[57]: age_band
18-21 0.17
22-25 0.11
26-30 0.11
31-35 0.11
36-40 0.13
41-45 0.11
45-50 0.10
51-55 0.10
55-60 0.11
61-65 0.09
65-70 0.10
71+ 0.10
Unknown 0.05
Name: y, dtype: float64
We will add it back to the dataset, dropping the dummy for "Unknown".
In [59]: bd=pd.concat([bd,ab_dummies],1)
bd=bd.drop(["age_band","Unknown"],1)
In [60]: bd["status"].value_counts()
18
In [61]: bd["st_partner"]=np.where(bd["status"]=="Partner",1,0)
bd["st_singleNm"]=np.where(bd["status"]=="Single/Never Married",1,0)
bd["st_divSep"]=np.where(bd["status"]=="Divorced/Separated",1,0)
bd=bd.drop(["status"],1)
In [62]: round(bd.groupby("occupation")["y"].mean(),2)
Out[62]: occupation
Business Manager 0.12
Housewife 0.09
Manual Worker 0.11
Other 0.11
Professional 0.12
Retired 0.10
Secretarial/Admin 0.11
Student 0.11
Unknown 0.11
Name: y, dtype: float64
In [64]: bd=pd.concat([bd,oc_dummies],1)
bd=bd.drop(["occupation","Housewife"],1)
In [65]: round(bd.groupby("occupation_partner")["y"].mean(),2)
Out[65]: occupation_partner
Business Manager 0.11
Housewife 0.11
Manual Worker 0.11
Other 0.10
Professional 0.11
Retired 0.10
19
Secretarial/Admin 0.12
Student 0.12
Unknown 0.10
Name: y, dtype: float64
In [66]: bd["ocp_10"]=0
bd["ocp_12"]=0
for i in range(len(bd)):
if bd["occupation_partner"][i] in ["Unknown","Retired","Other"]:
bd.loc[i,"ocp_10"]=1
if bd["occupation_partner"][i] in ["Student","Secretarial/Admin"]:
bd.loc[i,"ocp_12"]=1
bd=bd.drop(["occupation_partner","TVarea","post_code","post_area","region"],1)
You can see that we have also dropped variables TVarea, region, post_code, post_area. If you
look at number of unique values taken by post_area and post_code , you’ll realise why decided to
drop them. TVarea and region on the other hand we have left for you to make use of and see if
using them improves your model.
In [67]: bd["home_status"].value_counts()
Notice that we used an alternate syntax for dropping a column here. You can use that too if
you like this syntax better.
In [69]: bd["gender"].value_counts()
In [70]: bd["gender_f"]=np.where(bd["gender"]=="Female",1,0)
del bd["gender"]
In [71]: bd["self_employed"].value_counts()
Out[71]: No 9436
Yes 719
Name: self_employed, dtype: int64
20
In [72]: bd["semp_yes"]=np.where(bd["self_employed"]=="Yes",1,0)
del bd["self_employed"]
In [73]: bd["self_employed_partner"].value_counts()
Out[73]: No 9026
Yes 1129
Name: self_employed_partner, dtype: int64
In [74]: bd["semp_part_yes"]=np.where(bd["self_employed_partner"]=="Yes",1,0)
del bd["self_employed_partner"]
In [75]: bd["family_income"].value_counts()
We can convert this to number as average of the range once we have figured out what to do
with category "Unknown".
In [76]: round(bd.groupby("family_income")["y"].mean(),4)
Out[76]: family_income
< 4,000 0.0755
< 8,000, >= 4,000 0.0796
<10,000, >= 8,000 0.1066
<12,500, >=10,000 0.1019
<15,000, >=12,500 0.1113
<17,500, >=15,000 0.1230
<20,000, >=17,500 0.1113
<22,500, >=20,000 0.1186
<25,000, >=22,500 0.1032
<27,500, >=25,000 0.0970
<30,000, >=27,500 0.1157
>=35,000 0.1116
Unknown 0.0703
Name: y, dtype: float64
21
In [77]: bd["fi"]=4 # by doing this , we have essentially clubbed <4000 and Unknown values . Ho
bd.loc[bd["family_income"]=="< 8,000, >= 4,000","fi"]=6
bd.loc[bd["family_income"]=="<10,000, >= 8,000","fi"]=9
bd.loc[bd["family_income"]=="<12,500, >=10,000","fi"]=11.25
bd.loc[bd["family_income"]=="<15,000, >=12,500","fi"]=13.75
bd.loc[bd["family_income"]=="<17,500, >=15,000","fi"]=16.25
bd.loc[bd["family_income"]=="<20,000, >=17,500","fi"]=18.75
bd.loc[bd["family_income"]=="<22,500, >=20,000","fi"]=21.25
bd.loc[bd["family_income"]=="<25,000, >=22,500","fi"]=23.75
bd.loc[bd["family_income"]=="<27,500, >=25,000","fi"]=26.25
bd.loc[bd["family_income"]=="<30,000, >=27,500","fi"]=28.75
bd.loc[bd["family_income"]==">=35,000","fi"]=35
bd=bd.drop(["family_income"],1)
In [78]: bd.dtypes
22
hs_own int64
gender_f int64
semp_yes int64
semp_part_yes int64
fi float64
dtype: object
Now that the entire data is of numeric type, lets beging our modelling process after removing
nas from the data.
In [79]: bd.dropna(axis=0,inplace=True)
bd_train, bd_test = train_test_split(bd, test_size = 0.2,random_state=2)
In [80]: x_train=bd_train.drop(["y","REF_NO"],1)
y_train=bd_train["y"]
x_test=bd_test.drop(["y","REF_NO"],1)
y_test=bd_test["y"]
In [81]: logr=LogisticRegression(penalty="l1",class_weight="balanced",random_state=2)
In [82]: logr.fit(x_train,y_train)
Out[83]: 0.89959186496956278
To arrive at the eventual 1,0 prediction , we need to find some way [some cutoff ] to convert
predicted probabilites into two classes. Lets first get the probabilities out.
In [84]: prob_score=pd.Series(list(zip(*logr.predict_proba(x_train)))[1])
In [85]: cutoffs=np.linspace(0,1,100)
For each of these cutoff , we are going to look at TP,FP,TN,FN values and caluclate KS. Then
we’ll chose the best cutoff as the one having highest KS.
In [86]: KS_cut=[]
for cutoff in cutoffs:
predicted=pd.Series([0]*len(y_train))
predicted[prob_score>cutoff]=1
df=pd.DataFrame(list(zip(y_train,predicted)),columns=["real","predicted"])
TP=len(df[(df["real"]==1) &(df["predicted"]==1) ])
23
FP=len(df[(df["real"]==0) &(df["predicted"]==1) ])
TN=len(df[(df["real"]==0) &(df["predicted"]==0) ])
FN=len(df[(df["real"]==1) &(df["predicted"]==0) ])
P=TP+FN
N=TN+FP
KS=(TP/P)-(FP/N)
KS_cut.append(KS)
cutoff_data=pd.DataFrame(list(zip(cutoffs,KS_cut)),columns=["cutoff","KS"])
KS_cutoff=cutoff_data[cutoff_data["KS"]==cutoff_data["KS"].max()]["cutoff"]
Now we’ll see how this model with the cutoff determined here , performs on the test data.
predicted_test=pd.Series([0]*len(y_test))
predicted_test[prob_score_test>float(KS_cutoff)]=1
df_test=pd.DataFrame(list(zip(y_test,predicted_test)),columns=["real","predicted"])
k=pd.crosstab(df_test['real'],df_test["predicted"])
print('confusion matrix :\n \n ',k)
TN=k.iloc[0,0]
TP=k.iloc[1,1]
FP=k.iloc[0,1]
FN=k.iloc[1,0]
P=TP+FN
N=TN+FP
confusion matrix :
predicted 0 1
real
0 1646 161
1 26 198
Out[88]: 0.9079271294928607
Out[89]: 0.8839285714285714
24
In [90]: #Specificity on test
TN/N
Out[90]: 0.91090204759269511
Next we see how cutoff determined by F_beta score performs on test data for beta values :
0.5,1,2
In [91]: cutoffs=np.linspace(0.010,0.99,100)
def Fbeta_perf(beta,cutoffs,y_train,prob_score):
FB_cut=[]
for cutoff in cutoffs:
predicted=pd.Series([0]*len(y_train))
predicted[prob_score>cutoff]=1
df=pd.DataFrame(list(zip(y_train,predicted)),columns=["real","predicted"])
TP=len(df[(df["real"]==1) &(df["predicted"]==1) ])
FP=len(df[(df["real"]==0) &(df["predicted"]==1) ])
FN=len(df[(df["real"]==1) &(df["predicted"]==0) ])
P=TP+FN
Precision=TP/(TP+FP)
Recall=TP/P
FB=(1+beta**2)*Precision*Recall/((beta**2)*Precision+Recall)
FB_cut.append(FB)
cutoff_data=pd.DataFrame(list(zip(cutoffs,FB_cut)),columns=["cutoff","FB"])
FB_cutoff=cutoff_data[cutoff_data["FB"]==cutoff_data["FB"].max()]["cutoff"]
prob_score_test=pd.Series(list(zip(*logr.predict_proba(x_test)))[1])
predicted_test=pd.Series([0]*len(y_test))
predicted_test[prob_score_test>float(FB_cutoff)]=1
df_test=pd.DataFrame(list(zip(y_test,predicted_test)),columns=["real","predicted"]
k=pd.crosstab(df_test['real'],df_test["predicted"])
# print('confusion matrix :\n \n ',k)
TN=k.iloc[0,0]
TP=k.iloc[1,1]
FP=k.iloc[0,1]
FN=k.iloc[1,0]
P=TP+FN
N=TN+FP
print('For beta :',beta)
print('Accuracy is :',(TP+TN)/(P+N))
25
print('Sensitivity is :',(TP/P))
print('Specificity is :',(TN/N))
print('\n \n \n')
In [92]: Fbeta_perf(0.5,cutoffs,y_train,prob_score)
Fbeta_perf(1,cutoffs,y_train,prob_score)
Fbeta_perf(2,cutoffs,y_train,prob_score)
For beta : 1
Accuracy is : 0.932053175775
Sensitivity is : 0.758928571429
Specificity is : 0.953514111787
For beta : 2
Accuracy is : 0.929591334318
Sensitivity is : 0.834821428571
Specificity is : 0.941339236303
You can see that low beta < 1 favors Specificity where as beta > 1 favors sensitivity
We’ll conclude our discussion here. Please do the practice exercises . If you face any issue we’ll
discuss that either in class or QA forum on LMS.
Prepared By : Lalit Sachan ([email protected])
In case of any doubts or errata alert; please take to QA forum for discussion.
Doubts will be discussed in live class sessions too. [This doesnt apply for self paced students]
26