Prepared by Asif Bhat Exploratory Data Analysis: Explore Dataset
Prepared by Asif Bhat Exploratory Data Analysis: Explore Dataset
Explore Dataset
Out[2]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_R
In [3]: app_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
Out[5]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_AN
Out[6]: float64 65
int64 41
object 16
dtype: int64
Data Cleaning
59 59 59 59 59 59
60 56 55 55 55
53 53 53
51 51
50
Percentage
40
30
20
10
0
COMMONAREA_MEDI
COMMONAREA_AVG
COMMONAREA_MODE
NONLIVINGAPARTMENTS_MODE
NONLIVINGAPARTMENTS_AVG
NONLIVINGAPARTMENTS_MEDI
FONDKAPREMONT_MODE
LIVINGAPARTMENTS_MODE
LIVINGAPARTMENTS_AVG
LIVINGAPARTMENTS_MEDI
FLOORSMIN_AVG
FLOORSMIN_MODE
FLOORSMIN_MEDI
YEARS_BUILD_MEDI
YEARS_BUILD_MODE
YEARS_BUILD_AVG
OWN_CAR_AGE
LANDAREA_MEDI
LANDAREA_MODE
LANDAREA_AVG
BASEMENTAREA_MEDI
BASEMENTAREA_AVG
BASEMENTAREA_MODE
EXT_SOURCE_1
NONLIVINGAREA_MODE
NONLIVINGAREA_AVG
NONLIVINGAREA_MEDI
ELEVATORS_MEDI
ELEVATORS_AVG
ELEVATORS_MODE
WALLSMATERIAL_MODE
APARTMENTS_MEDI
Columns
In [10]: # Get Column names with NULL percentage greater than 40%
cols = (app_data.isnull().mean()*100 > 40)[app_data.isnull().mean()*100 > 4
cols
Out[10]: ['OWN_CAR_AGE',
'EXT_SOURCE_1',
'APARTMENTS_AVG',
'BASEMENTAREA_AVG',
'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BUILD_AVG',
'COMMONAREA_AVG',
'ELEVATORS_AVG',
'ENTRANCES_AVG',
'FLOORSMAX_AVG',
'FLOORSMIN_AVG',
'LANDAREA_AVG',
'LIVINGAPARTMENTS_AVG',
'LIVINGAREA_AVG',
'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAREA_AVG',
'APARTMENTS_MODE',
'BASEMENTAREA_MODE',
'YEARS_BEGINEXPLUATATION_MODE',
'YEARS_BUILD_MODE',
'COMMONAREA_MODE',
'ELEVATORS_MODE',
'ENTRANCES_MODE',
'FLOORSMAX_MODE',
'FLOORSMIN_MODE',
'LANDAREA_MODE',
'LIVINGAPARTMENTS_MODE',
'LIVINGAREA_MODE',
'NONLIVINGAPARTMENTS_MODE',
'NONLIVINGAREA_MODE',
'APARTMENTS_MEDI',
'BASEMENTAREA_MEDI',
'YEARS_BEGINEXPLUATATION_MEDI',
'YEARS_BUILD_MEDI',
'COMMONAREA_MEDI',
'ELEVATORS_MEDI',
'ENTRANCES_MEDI',
'FLOORSMAX_MEDI',
'FLOORSMIN_MEDI',
'LANDAREA_MEDI',
'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MEDI',
'NONLIVINGAPARTMENTS_MEDI',
'NONLIVINGAREA_MEDI',
'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE',
'TOTALAREA_MODE',
'WALLSMATERIAL_MODE',
'EMERGENCYSTATE_MODE']
In [11]: # We are good to delete 49 columns because NULL percentage for these column
len(cols)
Out[11]: 49
In [15]: s2.head(10)
In [16]: app_data.head()
Out[16]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_R
5 rows × 73 columns
In [17]: '''
Impute the missing values of below columns with mode
- AMT_REQ_CREDIT_BUREAU_MONTH
- AMT_REQ_CREDIT_BUREAU_WEEK
- AMT_REQ_CREDIT_BUREAU_DAY
- AMT_REQ_CREDIT_BUREAU_HOUR
- AMT_REQ_CREDIT_BUREAU_QRT
'''
for i in s2.head(10).index.to_list():
if 'AMT_REQ_CREDIT' in i:
print('Most frequent value in {0} is : {1}'.format(i,app_data[i].mo
print('Imputing the missing value with : {0}'.format(app_data[i].mo
app_data[i].fillna(app_data[i].mode()[0],inplace=True)
print('NULL Values in {0} after imputation : {1}'.format(i,app_data
print()
50k
40k
Count
30k
20k
10k
0
La Sa Co Ma Dr Hi Ac Me Se Co Cle Pr
bo les re na ive gh co dic cu ok an iv
re sta sta ger rs sk un i ne r i ty i n i ng
rs ff s ill tan gs
ff tec ts sta sta taf sta
hs ff ff f
taf
f
Occupation Type
In [20]: app_data.OCCUPATION_TYPE.fillna('Laborers',inplace=True)
In [21]: app_data['CODE_GENDER'].value_counts()
Out[21]: F 202448
M 105059
XNA 4
Name: CODE_GENDER, dtype: int64
In [22]: app_data['CODE_GENDER'].replace(to_replace='XNA',value=app_data['CODE_GENDE
In [23]: app_data['CODE_GENDER'].value_counts()
Out[23]: F 202452
M 105059
Name: CODE_GENDER, dtype: int64
In [24]: app_data.EXT_SOURCE_3.dtype
Out[24]: dtype('float64')
In [25]: app_data.EXT_SOURCE_3.fillna(app_data.EXT_SOURCE_3.median(),inplace=True)
In [28]: app_data.columns
Out[29]: ['FLAG_OWN_CAR',
'FLAG_OWN_REALTY',
'FLAG_MOBIL',
'FLAG_EMP_PHONE',
'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE',
'FLAG_PHONE',
'FLAG_EMAIL',
'FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_3',
'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6',
'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9',
'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21']
In [30]: # DELETE all flag columns as they won't be much useful in our analysis
app_data.drop(columns=col,inplace=True)
app_data.head()
#OR
#app_data= app_data[[i for i in app_data.columns if 'FLAG' not in i]]
Out[30]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER CNT_CHILDREN AMT_INCOME_
5 rows × 45 columns
In [32]: app_data['AMT_ANNUITY'].fillna(app_data['AMT_ANNUITY'].median(),inplace=Tru
app_data['AMT_GOODS_PRICE'].fillna(app_data['AMT_GOODS_PRICE'].median(),inp
app_data['AMT_ANNUITY'].isnull().sum()
app_data['AMT_GOODS_PRICE'].isnull().sum()
Out[32]: 0
Correcting Data
In [33]: days = []
for i in app_data.columns:
if 'DAYS' in i:
days.append(i)
print('Unique Values in {0} column : {1}'.format(i,app_data[i].uniq
print('NULL Values in {0} column : {1}'.format(i,app_data[i].isnull
print()
In [34]: app_data[days]
Out[34]:
DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH DAYS_LAST_PHO
Out[35]:
DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH DAYS_LAST_PHO
Binning
Low 85756
High 75513
Very Low 63671
Very High 47118
Medium 35453
Name: AMT_INCOME_TOTAL_Range, dtype: int64
Medium 61569
Very Low 61507
Low 61499
Very High 61484
High 61452
Name: AMT_ANNUITY_Range, dtype: int64
Out[40]:
SK_ID_CURR Client_Age Age Group
Out[42]:
SK_ID_CURR YEARS_EMPLOYED Work Experience
Outlier Detection
AMT_INCOME_TOTAL
AMT_ANNUITY
In [45]: fig=px.box(app_data['CNT_CHILDREN'])
fig.update_layout(
title=dict(text = "Number of children",x=0.5,y=0.95),
title_font_size=20,
showlegend=False,
width =400,
height =400,
)
fig.show()
Number of children
20
15
value
10
0
CNT_CHILDREN
variable
In [46]: app_data['CNT_CHILDREN'].value_counts()
Out[46]: 0 215371
1 61118
2 26748
3 3717
4 429
5 84
6 21
7 7
14 3
8 2
9 2
12 2
10 2
19 2
11 1
Name: CNT_CHILDREN, dtype: int64
In [47]: app_data.shape[0]
Out[47]: 307509
Out[48]: 307501
In [49]: sns.boxplot(y=app_data['YEARS_EMPLOYED'])
Out[49]: <AxesSubplot:ylabel='YEARS_EMPLOYED'>
In [50]: app_data['YEARS_EMPLOYED'].value_counts()
In [51]: app_data.shape[0]
Out[51]: 307501
In [52]: app_data['YEARS_EMPLOYED'][app_data['YEARS_EMPLOYED']>1000]=np.NaN
In [53]: sns.boxplot(y=app_data['YEARS_EMPLOYED'])
plt.show()
In [54]: app_data.isnull().sum().sort_values(ascending=False).head(10)
Out[55]: ['AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR']
Univariate Analysis
In [58]: app_data.columns
In [59]: fig1=px.bar(app_data['OCCUPATION_TYPE'].value_counts(),color=app_data['OCCU
fig1.update_traces(textposition='outside',marker_coloraxis=None)
fig1.update_xaxes(title='Occupation Type')
fig1.update_yaxes(title='Count')
fig1.update_layout(
title=dict(text = "Occupation Type",x=0.5,y=0.95),
title_font_size=20,
showlegend=False,
height =450,
)
fig1.show()
Occupation Type
150k
100k
Count
50k
0
La Sa Co Ma Dr Hi Ac Me Se Co Cle Pr
bo les re na ive gh co dic cu ok an iv
re sta sta ger rs sk un i n rit ing i n
rs ff s ill tan es ys sta g
ff tec ts taf taf sta
hs f f ff
taf
f
Occupation Type
In [60]: fig2=px.bar(app_data['ORGANIZATION_TYPE'].value_counts(),color=app_data['OR
fig2.update_traces(textposition='outside',marker_coloraxis=None)
fig2.update_xaxes(title='Organization Type')
fig2.update_yaxes(title='Count')
fig2.update_layout(
title=dict(text = "Organization Type",x=0.5,y=0.95)
title_font_size=20,
showlegend=False,
height =450,
)
fig2.show()
Organization Type
60k
Count
40k
20k
0
Business Entity Type 3
Self-employed
Other
Medicine
Business Entity Type 2
Government
School
Trade: type 7
Kindergarten
Construction
Business Entity Type 1
Transport: type 4
Trade: type 3
Industry: type 9
Industry: type 3
Security
Housing
Industry: type 11
Military
Bank
Agriculture
Police
Transport: type 2
Postal
Security Ministries
Trade: type 2
Restaurant
Services
University
Industry: type 7
Transport: type 3
Industry: type 1
Hotel
Electricity
Industry: type 4
Trade: type 6
Industry: type 5
Insurance
Organization Type
Insights
Age Group N
80k 26.8% 90.5
24.9% 250k
22.1%
60k 200k
150k
40k 14.6%
11.6% 100k
20k
50k
0 0
30-40 40-50 50-60 20-30 60-100 0-20 Cash lo
NAME_INCOME_TYPE NA
51.6%
150k 71.0%
200k
150k
100k
100k
23.3%
50k
18.0%
2
50k
7.1%
0.0% 0.0% 0.0% 0.0%
0 0
Wo Co Pe St Un St Bu Ma Se
rk mm ns ate em ud sin ter co
ing er i o ne s p e n e nit nd
cia er loy t ss yl ar
r va ed m an ea y/
la nt ve
ss
oc
iat
e
NAME_FAMILY_STATUS N
200k
63.9% 88.7%
250k
150k
200k
localhost:8888/notebooks/Documents/Python Notebooks/EDA.ipynb 27/143
4/9/23, 6:41 PM EDA - Jupyter Notebook
Bank has recieved majority of the loan application from 30-40 & 40-50 Age 100k groups.
50k 50% of clients who have applied for loan belong to Working Income Type.
More than
88.7% clients with Secondary/Secondary
14.8% Special education type have applied 50k for the loan.
6.4% 5.2%
Married people tend to apply more for loans. 63.9% clients who are have applied for loan are4.8%
9.7%
CODE_GENDER
200k
54.1%
65.8%
150k 100k
100k 34.2%
25.7%
50k
50k
0 0
F M 0-5 5-10
In [62]: app_data.nunique().sort_values()
Out[62]: LIVE_REGION_NOT_WORK_REGION 2
REG_REGION_NOT_LIVE_REGION 2
REG_REGION_NOT_WORK_REGION 2
REG_CITY_NOT_LIVE_CITY 2
LIVE_CITY_NOT_WORK_CITY 2
REG_CITY_NOT_WORK_CITY 2
CODE_GENDER 2
NAME_CONTRACT_TYPE 2
TARGET 2
REGION_RATING_CLIENT 3
REGION_RATING_CLIENT_W_CITY 3
AMT_CREDIT_Range 5
AMT_INCOME_TOTAL_Range 5
AMT_REQ_CREDIT_BUREAU_HOUR 5
AMT_GOODS_PRICE_Range 5
NAME_EDUCATION_TYPE 5
Age Group 5
AMT_ANNUITY_Range 5
NAME_FAMILY_STATUS 6
NAME_HOUSING_TYPE 6
WEEKDAY_APPR_PROCESS_START 7
Work Experience 7
NAME_TYPE_SUITE 7
NAME_INCOME_TYPE 8
AMT_REQ_CREDIT_BUREAU_WEEK 9
AMT_REQ_CREDIT_BUREAU_DAY 9
DEF_60_CNT_SOCIAL_CIRCLE 9
AMT_REQ_CREDIT_BUREAU_QRT 10
DEF_30_CNT_SOCIAL_CIRCLE 10
CNT_CHILDREN 11
CNT_FAM_MEMBERS 12
OCCUPATION_TYPE 18
HOUR_APPR_PROCESS_START 24
AMT_REQ_CREDIT_BUREAU_MON 24
AMT_REQ_CREDIT_BUREAU_YEAR 25
OBS_60_CNT_SOCIAL_CIRCLE 33
OBS_30_CNT_SOCIAL_CIRCLE 33
ORGANIZATION_TYPE 57
REGION_POPULATION_RELATIVE 81
EXT_SOURCE_3 814
AMT_GOODS_PRICE 1002
AMT_INCOME_TOTAL 2547
DAYS_LAST_PHONE_CHANGE 3773
AMT_CREDIT 5603
DAYS_ID_PUBLISH 6168
YEARS_EMPLOYED 12573
AMT_ANNUITY 13671
DAYS_REGISTRATION 15688
Client_Age 17460
EXT_SOURCE_2 119830
SK_ID_CURR 307500
dtype: int64
Checking Imbalance
In [63]: app_data['TARGET'].value_counts(normalize=True)
Out[63]: 0 0.919275
1 0.080725
Name: TARGET, dtype: float64
In [64]: fig=px.pie(values=app_data['TARGET'].value_counts(normalize=True),
names=app_data['TARGET'].value_counts(normalize=True).index,
hole = 0.5
)
fig.update_layout(
title=dict(text = "Target Imbalance",x=0.5,y=0.95),
title_font_size=20,
showlegend=False
)
fig.show()
Target Imbalance
8.07%
91.9%
In [66]: app_target0.shape
In [67]: app_target1.shape
tle = [None]*(2*len(cols))
tle[::2]=[i+' (Non-Payment Difficulties)' for i in cols]
tle[1::2]=[i+' (Payment Difficulties)' for i in cols]
Subplot initialization
g = make_subplots(
rows=4,
cols=2,
subplot_titles=title,
)
Adding subplots
unt=0
or i in range(1,5):
for j in range(1,3):
if j==1:
fig.add_trace(go.Bar(x=app_target0[cols[count]].value_counts().ind
y=app_target0[cols[count]].value_counts(),
name=cols[count],
textposition='auto',
text= [str(i) + '%' for i in (app_target0[cols[co
),
row=i,col=j)
else:
fig.add_trace(go.Bar(x=app_target1[cols[count]].value_counts().ind
y=app_target1[cols[count]].value_counts(),
name=cols[count],
textposition='auto',
text= [str(i) + '%' for i in (app_target1[cols[co
),
row=i,col=j)
count+=1
g.update_layout(
title=dict(text = "Analyze Categorical variables (Payment/
title_font_size=20,
showlegend=False,
height = 1600,
)
g.show()
40k
14.1% 4000
12.0%
20k 2000
0 0
30-40 40-50 50-60 20-30 60-100 0-20 30-40 40
15k
150k
10k
100k
50k 5k
9.8%
0 0
Cash loans Revolving loans Cash
61.3%
100k
10k
23.4%
50k 5k
18.5%
21
7.2%
0.0% 0.0% 0.0% 0.0%
0 0
Wo Co Pe St St Un Bu Ma Wo
rkin mm ns ate ud em sin ter r kin
er i o ne se e nt plo e nit
g cia rva ss yl g
r ye ma ea
la nt d n ve
ss
oc
iat
e
150k 15k
100k 10k
25.1%
50k 5k
3.3%
1.2% 0.1%
0 0
Se Hi In Lo Ac Se
co gh co we ad co
nd er mp rs e mi nd
ar ed let ec cd ar
y /s uca eh on y
tio igh da eg
ec n ry ree
on er
da
ry
sp
ec
ial
150k
10k
100k
5k
50k
14.5%
5.4%
9.5%
6.4% 0.0%
0 0
Ma Sin Civ Se Wi Un Ma
rri gle il m pa do kn rri
ed /n ar ra w o wn ed
ria t e d
ot g
ma e
rri
ed
200k
15k
150k
10k
100k
5k
50k
7
4.6% 3.6% 1.5% 0.9% 0.4%
0 0
Ho Wi Mu Re Of Co Ho
us th nic n ted fic -o us
e/ pa ipa ea pa e/
ap re la ap pa pa ap
ar nt pa ar rtm rtm
tm s r tm en
en tm en en
en t t t
t t
150k
10k
localhost:8888/notebooks/Documents/Python Notebooks/EDA.ipynb 38/143
4/9/23, 6:41 PM EDA - Jupyter Notebook
10k
100k
33.4%
5k
50k
0 0
F M
66.0%
100k
10k
80k
60k
26.1%
40k 5k
21.9%
20k
11.3%
4.5%
2.6% 1.4% 1.3%
0 0
0-5 5-10 10-15 15-20 20-25 25-30 30-100 0-5 5-1
= [None]*(2*len(cols))
[::2]=[i+' (Non-Payment Difficulties)' for i in cols]
[1::2]=[i+' (Payment Difficulties)' for i in cols]
lot initialization
make_subplots(
rows=4,
cols=2,
subplot_titles=title,
)
ing subplots
=0
in range(1,5):
or j in range(1,3):
if j==1:
fig.add_trace(go.Bar(x=app_target0[cols[count]].value_counts().index,
y=app_target0[cols[count]].value_counts(),
name=cols[count],
textposition='auto',
text= [str(i) + '%' for i in (app_target0[cols[count
),
row=i,col=j)
else:
fig.add_trace(go.Bar(x=app_target1[cols[count]].value_counts().index,
y=app_target1[cols[count]].value_counts(),
name=cols[count],
textposition='auto',
text= [str(i) + '%' for i in (app_target1[cols[count
),
row=i,col=j)
count+=1
pdate_layout(
title=dict(text = "Analyze Categorical variables (Payment/ No
title_font_size=20,
showlegend=False,
height = 1600,
)
how()
1.7%
Trade: type 3
Core staff
2.2%
7.0% 2.2%
2.3%
2.4% Transport: type 4
Drivers
29.7%
8.5% 3.3%
3.4%
3.4% Medicine
12.5% Sales staff 3.6%
5.8%
4.1%
Business Entity Type 2
48.8% Laborers 17.9%
29.0%
Business Entity Type 3
12k
10k
8k
6k
4k
2k
6000
4000
2000
6000
ORGANIZATION_TYPE (Non-Payment Difficulties)
OCCUPATION_TYPE (Non-Payment Difficulties)
0.0%
0.2%
IT staff
0.1% 0.1% 0.1% 0.1% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Industry: type 13
0.9% 0.6% 0.4% 0.4% 0.2% 0.2%
0.3% 0.2% 0.2% 0.2% 0.2% 0.2% 0.2% 0.2% 0.2% 0.2%
EDA - Jupyter Notebook
Secretaries Culture
Low-skill Laborers Emergency
Private service staff Insurance
20 7%
Cooking staff Transport: type 3
0.7% 0.6%
Security staff 0.8%
Medicine staff
0.8%
Security Ministries
2.8%
0.9%
0.9%
3.3%
Accountants
1.0%
1.0%
Agriculture
24.6%
1.0%
1.1%
Industry: type 11
High skill tech staff
3.8%
1.1%
1.2%
1.3%
Security
localhost:8888/notebooks/Documents/Python Notebooks/EDA.ipynb
5.8% Drivers
1.3%
1.4%
1.4%
Industry: type 9
Managers
2.1%
7.1% 2.4%
2.6%
2.8% Construction
Core staff
27.7%
9.1% 3.1%
3.6%
4.2% School
10.3% Sales staff 4.2%
6.7%
4.5%
Medicine
49.3% Laborers 15.0%
26.8%
Business Entity Type 3
100k
50k
60k
50k
40k
30k
20k
10k
80k
60k
4/9/23, 6:41 PM
4/9/23, 6:41 PM EDA - Jupyter Notebook
20.7%
4000
40k 15.6%
Bivariate / Multivariate Analysis
11.5%
In [71]: # Group 20k
data by 'AMT_CREDIT_Range' & 'CODE_GENDER' 2000
df1=app_data.groupby(by=['AMT_CREDIT_Range','CODE_GENDER']).count().reset_i
df1
0 0
Out[71]: Low High Very Low Very High Medium Low
AMT_CREDIT_Range CODE_GENDER SK_ID_CURR
2 Low F 38185
3 Low M 19912
4 Medium F 39807
AMT_CREDIT_Range (Non-Payment Difficulties) AMT_CRE
5 60k Medium M 21743
21.3% 6000 24.9%
20.9%
6 High F 19.6% 42360
19.6%
50k 18.7% 5000
7 High M 21662
40k 4000
8 Very High F 38019
20k 2000
In [72]: # Group data by 'AMT_INCOME_TOTAL_Range' & 'CODE_GENDER'
df2=app_data.groupby(by=['AMT_INCOME_TOTAL_Range','CODE_GENDER']).count().r
10k 1000
df2
Out[72]: 0 0
Very Low High Medium Very High Low Medium
AMT_INCOME_TOTAL_Range CODE_GENDER SK_ID_CURR
2 Low F 59967
3 Low M 25787
4 Medium F 23184
5 Medium M 12269
6 High F 43441
7 High M 32070
In [73]: fig1=px.bar(data_frame=df1,
x='AMT_CREDIT_Range',
y='SK_ID_CURR',color='CODE_GENDER',
barmode='group',
text='SK_ID_CURR'
)
fig1.update_traces(textposition='outside')
fig1.update_xaxes(title='Day')
fig1.update_yaxes(title='Transaction count')
fig1.update_layout(
title=dict(text = "Loan Applications by Gender & Credit
title_font_size=20,
)
fig1.show()
35k
30k
Transaction count
25k
21743
20850
19912
20k
15k
10k
5k
0
Very Low Low Medium Hig
Day
Insights
In [74]: fig2=px.bar(data_frame=df2,
x='AMT_INCOME_TOTAL_Range',
y='SK_ID_CURR',color='CODE_GENDER',
barmode='group',
text='SK_ID_CURR'
)
fig2.update_traces(textposition='outside')
fig2.update_xaxes(title='Day')
fig2.update_yaxes(title='Transaction count')
fig2.update_layout(
title=dict(text = "Loan Applications by Gender & Total
title_font_size=20,
)
fig2.show()
51464
50k
43441
Transaction count
40k
30k
25787
23184
20k
12205 12269
10k
0
Very Low Low Medium Hig
Day
Insights
Females with Low & Very Low total income have applied the most for the loan.
4M
3M
AMT_CREDIT
2M
1M
0
Hi Se In Lo
gh co co we
er nd mp rs
ed ar let e
uc y/ eh
atio se igh
n co er
nd
ar
ys
pec
ial
NAME_EDUCATION_TYPE
4M
3M
AMT_CREDIT
2M
1M
0
Se In Hi Lo
co co gh we
nd mp er rs
ar let ed e
y/ e uc
se hig atio
co he n
nd r
ar
y sp
ec
ial
NAME_EDUCATION_TYPE
4M
3.5M
3M
AMT_CREDIT
2.5M
2M
1.5M
1M
0.5M
0
Very High Very Low High Low
AMT_INCOME_TOTAL_Range
4M
3.5M
3M
AMT_CREDIT
2.5M
2M
1.5M
1M
0.5M
0
High Very High Medium Low
AMT_INCOME_TOTAL_Range
4M
3.5M
3M
AMT_CREDIT
2.5M
2M
1.5M
1M
0.5M
0
40-50 30-40 50-60 20-30
Age Group
4M
3.5M
3M
AMT_CREDIT
2.5M
2M
1.5M
1M
0.5M
0
20-30 40-50 30-40 50-60
Age Group
4M
3.5M
3M
AMT_CREDIT
2.5M
2M
1.5M
1M
0.5M
0
0-5 5-10 10-15 25-30 15-20 2
Work Experience
4M
3.5M
3M
AMT_CREDIT
2.5M
2M
1.5M
1M
0.5M
0
0-5 5-10 20-25 10-15 15-20 2
Work Experience
In [84]: plt.figure(figsize=(12,8))
sns.heatmap(app_target0[['AMT_INCOME_TOTAL', 'AMT_GOODS_PRICE',
'AMT_CREDIT', 'AMT_ANNUITY',
'Client_Age','YEARS_EMPLOYED' ,
'DAYS_ID_PUBLISH', 'DAYS_REGISTRATION',
'EXT_SOURCE_2','EXT_SOURCE_3','REGION_POPULATION_REL
plt.title('Correlation matrix for Non-Payment Difficulties')
plt.show()
In [85]: plt.figure(figsize=(12,8))
sns.heatmap(app_target1[['AMT_INCOME_TOTAL', 'AMT_GOODS_PRICE',
'AMT_CREDIT', 'AMT_ANNUITY',
'Client_Age','YEARS_EMPLOYED' ,
'DAYS_ID_PUBLISH', 'DAYS_REGISTRATION',
'EXT_SOURCE_2','EXT_SOURCE_3','REGION_POPULATION_REL
plt.title('Correlation Matrix for Payment Difficulties')
plt.show()
Out[86]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CR
5 rows × 37 columns
In [88]: appdata_previous.shape
In [90]: appdata_previous.shape
In [91]: days = []
for i in appdata_previous.columns:
if 'DAYS' in i:
days.append(i)
print('Unique Values in {0} column : {1}'.format(i,appdata_previous
print()
Unique Values in DAYS_DECISION column : [ -73 -164 -301 ... -1967 -238
9 -1]
In [93]: appdata_previous[days]
0 73
1 164
2 301
3 512
4 781
... ...
1670209 544
1670210 1694
1670211 1488
1670212 1185
1670213 1193
In [95]: appdata_previous.columns
NAME_CONTRACT_STATUS WEEKDAY_APPR_
1M 250k
15.3%
15.3%
62.1%
15.2%
15.1%
0.8M 200k
0.6M 150k
0.4M 100k
18.9% 17.4%
0.2M 50k
1.6%
0 0
Approved Canceled Refused Unused offer TU WE MO
ES DN ND
DA ES AY
Y DA
Y
NAME_PAYMENT_TYPE CODE_REJE
1M
56.2%
99.1%
150k
0.8M
0.6M 100k
0.4M
50k
17.9%
0.2M
12.0%
8.5
0.8% 0.1%
0 0
Ca No Ca HC LIM SC C
sh n- sh IT O
th ca les
ro sh sf
ug fro ro
ht m m
he yo th
ba u ea
n ra cc
k c co oun
un to
t ft
h ee
mp
loy
er
NAME_CONTRACT_TYPE NAME_CLI
44.8% 1.2M 73.8%
43.7%
600k 1M
400k
Approved Loans 0.6M
0.4M
In [97]: approved=appdata_previous[appdata_previous['NAME_CONTRACT_STATUS']=='Approv
200k
11.6% 18.
0.2M
0 0
Cash loans Consumer loans Revolving loans Repeater Ne
In [98]:s = ['NAME_PORTFOLIO','NAME_GOODS_CATEGORY',
'CHANNEL_TYPE','NAME_YIELD_GROUP' , 'NAME_PRODUCT_TYPE','NAME_CASH_LOAN_
bplot initialization
= make_subplots(
rows=3,
cols=2,
subplot_titles=cols,
horizontal_spacing=0.1,
vertical_spacing=0.19
)
dding subplots
nt=0
i in range(1,4):
for j in range(1,3):
fig.add_trace(go.Bar(x=approved[cols[count]].value_counts().index,
y=approved[cols[count]].value_counts(),
name=cols[count],
textposition='auto',
text= [str(i) + '%' for i in (approved[cols[count]
),
row=i,col=j)
count+=1
update_layout(
title=dict(text = "Analyze Categorical variables (Frequency
title_font_size=20,
showlegend=False,
width = 960,
height = 1400,
)
show()
NAME_PORTFOLIO NAM
600k
29.7%
60.4%
150k
400k
17.8%
100k
30.1%
14.3%
14.1%
200k
50k
7.8%
9.4%
3.6%
3.4%
0.0%
0 0
POS Cash Cards Cars
Mobile
Audio/Video
Furniture
300k 34.4%
300k
27.9%
200k
200k
18.2%
100k
100k
9.3%
3.0% 2.4%
0.3% 0.0%
0 0
Co Cr St Re AP Co Ch Ca middle
u nt e dit on g ion + nt an rd
ry- e ( Ca a ct n el ea
an al s c ler
wi dc /L hl en of
de a oc oa ter co
sh al n) rp
off or
ic ate
es sa
les
NAME_PRODUCT_TYPE NAME
3
32.2%
82.3% 8000
300k
Refused Loans
24.8%
6000
200k
In [99]: refused=appdata_previous[appdata_previous['NAME_CONTRACT_STATUS']=='Refused
4000
13.3%
100k
2000
17.7%
4.6%
3.9%
3.3%
0 0
x-sell walk-in
Repairs
Other
Urgent needs
Everyday expenses
Medicine
Buying a used car
Education
NAME_PORTFOLIO NAM
20k
30.6%
55.7%
100k 15k
20.2%
10k
25.5%
13.6%
13.6%
50k
18.7%
5k
6.5%
3.7%
3.4%
0.0%
0 0
Cash POS Cards Cars
Mobile
Consumer Electronics
Furniture
34.7%
60k
100k
40k
22.6%
50k
20k
NAME_PRODUCT_TYPE NAME
15k
3
35.3%
59.3%
100k
Merging
80k Application & Previous Application Data
10k
40.7%
60k
20.9%
In [101]: appdata_merge = app_data.merge(appdata_previous,on='SK_ID_CURR', how='inner
appdata_merge.shape
40k 5k
Out[101]: (1413670, 76)
11.5%
20k
Analysis of Merged Data
4.7%
4.6%
3.8%
0 0
x-sell walk-in
Repairs
Other
Urgent needs
Building a house or an annex
Buying a used car
Payments on other loans
Everyday expenses
In [102]: # Function for multiple plotting - Bar Chart
def plot_merge(appdata_merge,column_name):
col_value = ['Refused','Approved', 'Canceled' , 'Unused offer']
#Subplot initialization
fig = make_subplots(
rows=2,
cols=2,
subplot_titles=col_value,
horizontal_spacing=0.1,
vertical_spacing=0.3
)
# Adding subplots
count=0
for i in range(1,3):
for j in range(1,3):
fig.add_trace(go.Bar(x=appdata_merge[appdata_merge['NAME_CONTRAC
y=appdata_merge[appdata_merge['NAME_CONTRACT_ST
name=cols[count],
textposition='auto',
text= [str(i) + '%' for i in (appdata_merge[app
),
row=i,col=j)
count+=1
fig.update_layout(
title=dict(text = "NAME_CONTRACT_STATUS VS "+column_name
title_font_size=20,
showlegend=False,
width = 960,
height = 960,
)
fig.show()
#Subplot initialization
fig = make_subplots(
rows=2,
cols=2,
subplot_titles=col_value,
specs=[[{"type": "pie"}, {"type": "pie"}],[{"type": "p
)
# Adding subplots
count=0
for i in range(1,3):
for j in range(1,3):
fig.add_trace(go.Pie(labels=appdata_merge[appdata_merge['NAME_C
values=appdata_merge[appdata_merge['NAME_CONTR
textinfo='percent',
insidetextorientation='auto',
hole=.3
),
row=i,col=j)
count+=1
fig.update_layout(
title=dict(text = "NAME_CONTRACT_STATUS VS "+column_nam
title_font_size=20,
width = 960,
height = 960,
)
fig.show()
In [104]: plot_pie_merge(appdata_merge,'NAME_CONTRACT_TYPE_y')
NAME_CONTRACT_STATUS VS NAME_CON
Refused Ap
26.2%
30.1%
56.9%
16.9%
9.3%
1.9%
Canceled Unus
0.0176%
0.513% 14.4%
85%
Insights
In [105]: plot_pie_merge(appdata_merge,'NAME_CLIENT_TYPE')
NAME_CONTRACT_STATUS VS NAME_C
Refused
5.11%
5.0
4%
27.4%
9.36%
89.8%
Canceled U
1.13%
6.23%
15
8.21%
92.6%
Insights
Most of the approved , refused & canceled loans belong to the old clients.
Almost 27.4% loans were provided to new customers.
In [106]: plot_pie_merge(appdata_merge,'CODE_GENDER')
NAME_CONTRACT_STATUS VS CODE
Refused
33.4% 32.
66.6%
Canceled
31%
37.8
69%
Insights
In [107]: plot_merge(appdata_merge,'NAME_EDUCATION_TYPE')
NAME_CONTRACT_STATUS VS NAME_EDU
Refused
73.2% 73.6%
600k
150k
500k
400k
100k
300k
200k
50k 22.0% 2
100k
3.5%
1.2% 0.0%
0 0
Se Hi In Lo Ac Se
co gh c om we ad co
n da er ple rs em nda
r ed t e co ic ry
y/ uc eh nd de /
se ati igh ar gr
c on on er y ee
da
ry
sp
ec
ial
Canceled
200k
15k
73.7% 65.4%
150k
10k
100k
2
5k
50k 22.0%
3.1%
1.2% 0.0%
0 0
Se Hi In Lo Ac Se
co gh co we ad co
nd er mp rs em nd
ar ed let ec ic ar
y/ uc e on de y/
se ati hig d ar gr
co on h er y ee
nd
ar
ys
p ec
ial
Insights
Most of the approved loans belong to applicants with Secondary / Secondary Special
education type.
In [108]: plot_merge(appdata_merge,'NAME_INCOME_TYPE')
NAME_CONTRACT_STATUS VS NAME_IN
Refused
52.1%
51.4%
120k
400k
100k
300k
80k
22.4%
40k
17.9%
100k
20k
6.4%
0.0% 0.0%
0 0
Wo Co Pe St Un Ma St Wo Co
rk mm ns ate em ter ud r kin m
ing er ion s p n en
cia er e rva loy ity t g
la nt e d lea
ss ve
oc
iat
e
Canceled
14k
48.7%
120k 62.6%
12k
100k
10k
80k
8k
60k
23.0%
6k
22.2%
40k 22.3
4k
20k 2k
6.0%
0.0% 0.0%
0 0
Wo Co Pe St Un St Ma Wo Co
rk m n sio ate em ud ter rk
in me ne se p e n nit in
g rci r rva loy t yl g
al nt e d ea
as ve
so
cia
te
Insights
Across all Contract Status (Approved , Refused , Canceled , Unused Offer) people with
Working income type are leading. So it is quite evident that majority of the loans are coming
from this income type class.
In [109]: plot_pie_merge(appdata_merge,'NAME_FAMILY_STATUS')
NAME_CONTRACT_STATUS VS NAME_FA
Refused Appro
14% 13%
9.8%
11.5%
6.4%
6.71%
62%
5.7%
5.81%
Canceled Unused
12.9%
17.9%
10.1%
8.61%
6.37%
64.5% 7.03%
6.22%
%
32
3.
Insights
Approved percentage of loans for married applicants is higher than the rest of the contract
status (refused , canceled etc.).
In [110]: plot_pie_merge(appdata_merge,'NAME_PORTFOLIO')
NAME_CONTRACT_STATUS VS NAME_
Refused
25.8%
30.1%
55.5%
18.6%
9.3%
0.0272%
0.0503%
1.45% Canceled
0.377%
4.76%
93.4%
Insights
In [111]: plot_merge(appdata_merge,'OCCUPATION_TYPE')
300k
200k
100k
10k
8k
6k
4k
2k
0
0.2% 0.1%
2.6%
Medicine staff Accountants
2.8%
2.9%
High skill tech staff High skill tech staff
3.2%
3.3%
localhost:8888/notebooks/Documents/Python Notebooks/EDA.ipynb
6.3% Drivers 5.7% Drivers
7.4% Managers 6.5% Managers
8.4% Core staff 7.7% Core staff
11.4% Sales staff 10.7% Sales staff
48.9% Laborers 51.6% Laborers
120k
100k
80k
60k
40k
20k
140k
120k
100k
80k
60k
40k
20k
0
4/9/23, 6:41 PM
4/9/23, 6:41 PM EDA - Jupyter Notebook
In [112]: plot_merge(appdata_merge,'NAME_GOODS_CATEGORY')
3.4%
Photo / Cinema Equipment Consumer Electronics
3.6%
3.6%
Audio/Video
3.7%
7.7%
Furniture
13.9% 5.3%
Photo / Cinema Equipment
14.3%
Audio/Video
17.9%
15.9% Computers
29.9%
Mobile 69.2% Mobile
150k
100k
50k
15k
10k
5k
0
sm
mp udio ons urni hoto ons uto edi ewe ouri
T
ry
Weapon
p
tio ces l Sup
l
Insurance
me ls
uip ria
nt
Additional Service
so
c a
e
Education
Direct Sales
0.4% 0.3% 0.2% 0.2% 0.1% 0.1%
EDA - Jupyter Notebook
M
A c
em n
q
Medicine
Other
/ C ruc
A
Tourism
t
n
Gardening
ics
3.1%
C
Canceled
Office Appliances
Refused
on
Sport and Leisure
0.6% 0.5% 0.4% 0.4%
ctr
er Vide mer ure
4.4%
Medical Supplies
Ele
P
t
Vehicles
Homewares 5.0%
F
u
Jewelry
1.0% 0.9%
o
Auto Accessories
13.1%
C
Clothing and Accessories
/
2.9%
s
Photo / Cinema Equipment
localhost:8888/notebooks/Documents/Python Notebooks/EDA.ipynb
3.5%
15.0%
ut
A
3.7%
Construction Materials
6.4%
Furniture
Mo Co
13.5%
Consumer Electronics 16.2%
e
Audio/Video
bil
13.6%
20.2%
Computers 40.0%
30.9%
Mobile
15k
10k
5k
60
50
40
30
20
10
0
4/9/23, 6:41 PM
4/9/23, 6:41 PM EDA - Jupyter Notebook
In [113]: plot_merge(appdata_merge,'PRODUCT_COMBINATION')
S
ew
12.7%
PO
8.5% POS industry with interest
bil
mo
11.1% Cash X-Sell: middle
S
POS mobile with interest
PO
17.0%
81.3%
22.6% POS household with interest
200k
150k
100k
50k
15k
10k
5k
0
0.1%
Card X-Sell
3.0% POS industry with interest
POS household with interest
3.5% POS household without interest
4.2% Cash X-Sell: high Cash Street: high
Canceled
Refused
localhost:8888/notebooks/Documents/Python Notebooks/EDA.ipynb
Cash X-Sell: middle
9.2% Cash
Cash X-Sell: low
9.4% POS household with interest
9.7% Card Street
14.4% Card Street
14.4% Cash X-Sell: low 81.9% Cash
30k
20k
10k
200k
150k
100k
50k
0
4/9/23, 6:41 PM
4/9/23, 6:41 PM EDA - Jupyter Notebook
Insights
Most of the approved loans belong to POS hosehold with interest & POS mobile with
interest product combination.
15% refused loans belong to Cash X-Sell: low product combination.
Most of the canceled loans belong to Cash category.
81.3% Unused Offer loans belong to POS mobile with interest.
In [114]: plot_merge(appdata_merge,'NAME_PAYMENT_TYPE')
NAME_CONTRACT_STATUS VS NAME_PA
Refused Appr
100k
400k
50k
200k
Canceled Unuse
2000
15k
1500
10k
1000
5k
500
In [115]: plot_merge(appdata_merge,'CHANNEL_TYPE')
NAME_CONTRACT_STATUS VS CHAN
Refused
350k
51.2%
38.9%
120k
300k
100k
250k
27.6%
80k
200k
60k
150k
22.9%
40k
100k
20k 50k
3.9%
7.9%
7.7%
5.2% 1.2% 0.0%
0 0
Cr Co AP St Co Re Ch Ca Co Cr
ed un + on nt gio an rd un ed
it try (C e ac n ne ea try it
an - a s t c a l l ler -w
dc w ide hl en /L o fc ide
as oa ter oc o r
ho n) al po
ffic ra
es te
sa
les
Canceled
88.6%
20k 92.0%
200k
15k
150k
10k
100k
50k 5k
9.4% 3
1.2% 0.6% 0.0% 0.0% 0.0% 0.0%
0 0
Cr Co AP Co Ch Re St Ca Co
e dit ntac + un an gio on rd un
( Ca t ry- n el n e ea try
an tc s al ler -w
dc en hl w i o f / i
as ter oa de co L oc de
ho n) rp a l
ffic or
ate
es sa
les
Insights
Most of the approved loans belong to either Country-wide or Credit & cash offices channel
type.
More than 50% refused loans belong to Credit & cash offices channel type.
Credit & cash offices channel type loans are getting canceled the most.
More than 90% Unused Offer loans belong to Country-wide channel type.
In [116]: plot_pie_merge(appdata_merge,'NAME_YIELD_GROUP')
NAME_CONTRACT_STATUS VS NAME_YI
Refused A
28.5%
32.2%
34.7%
26.9% 9.98% 26
Canceled Un
21%
31.5%
49% 7.93%
9.84%
9.66%
Insights
In [117]: plot_pie_merge(appdata_merge,'NAME_HOUSING_TYPE')
NAME_CONTRACT_STATUS VS NAME_HO
4.35%
4.63%
0.269% 0.315%
3.4
3.7
8%
4%
89.4%
0.242% 1.22%
3.5
6.31%
3.0
0.782%
3%
8%
90.2%
NAME_CONTRACT_STATUS VS Age
Refused
26.4% 26.7%
60k 25.7
24.7%
200k
22.6%
50k
40k 150k
14.5%
30k
11.8% 100k
20k
50k
10k
0 0
30-40 40-50 50-60 20-30 60-100 0-20 30-40 40-5
Canceled
8000
25.2% 24.8% 34.3%
60k 24.0%
50k 6000
26.3
40k
15.0%
4000
30k
11.0%
20k
2000
10k
0 0
40-50 50-60 30-40 60-100 20-30 0-20 30-40 40-5
NAME_CONTRACT_STATUS VS Work E
Refused
350k
54.4%
50.1%
100k
300k
80k
250k
60k 200k
28.3%
26.8%
150k
40k
100k
20k
10.4%
50k
4.0%
2.1% 1.1% 1.1%
0 0
0-5 5-10 10-15 15-20 20-25 25-30 30-100 0-5 5-10
Canceled
52.9%
100k
10k 53.6%
80k
8k
60k
6k
28.1%
26.9%
40k 4k
20k 2k
11.2%
4.3%
2.2% 1.3% 1.2%
0 0
0-5 5-10 10-15 15-20 20-25 25-30 30-100 0-5 5-10
In [120]: plot_merge(appdata_merge,'AMT_CREDIT_Range')
NAME_CONTRACT_STATUS VS AMT_CR
Refused
22.9% 21.2% 2
50k 21.1%
20.3% 20.1%
150k
40k
15.7%
30k 100k
20k
50k
10k
0 0
Medium Very Low Low High Very High Very Low
Canceled
5000
21.3% 21.8%
50k 20.6% 20.6% 2
20.1%
4000
17.5%
40k
3000
30k
2000
20k
10k 1000
0 0
Medium Very Low High Low Very High Very Low
Insights
Most of the approved loans belong to Very Low & High Credit range.
Medium & Very Low credit range loans are canceled and rejected the most.
In [121]: plot_merge(appdata_merge,'AMT_INCOME_TOTAL_Range')
NAME_CONTRACT_STATUS VS AMT_INCOM
Refused
70k 250k
27.9% 27.6%
60k 26.1% 2
200k
50k
18.3% 150k
40k
15.6%
30k
12.1% 100k
20k
50k
10k
0 0
High Low Very High Very Low Medium Low
Canceled
50k
4000
17.9%
40k
15.1%
3000
30k 12.4%
2000
20k
10k 1000
0 0
High Low Very High Very Low Medium Low
Insights
Most of the loans are getting approved for Applicants with Low Income range. May be they are
opting for low credit loans.
Almost 28% loan applications are either getting rejected or canceled even though applicant
belong to HIGH Income range. May be they have requested for quite HIGH credit range.
END