0% found this document useful (0 votes)
46 views

SN Travel Jupyter Notebook PDF

The document discusses analyzing travel and survey data from two datasets using Pandas in a Jupyter Notebook. It loads the datasets, merges them on ID, then performs several summary analyses including checking the shape of the data, distributions of missing values and target variables, and describing various columns. No significant issues were found in the data.

Uploaded by

sonali Pradhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

SN Travel Jupyter Notebook PDF

The document discusses analyzing travel and survey data from two datasets using Pandas in a Jupyter Notebook. It loads the datasets, merges them on ID, then performs several summary analyses including checking the shape of the data, distributions of missing values and target variables, and describing various columns. No significant issues were found in the data.

Uploaded by

sonali Pradhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

09/05/2021 SN Travel - Jupyter Notebook

In [362]:

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [363]:

travel_df_train = pd.read_csv('Traveldata_train.csv')
travel_df_test = pd.read_csv('Traveldata_test.csv')

survey_df_train = pd.read_csv('Surveydata_train.csv')
survey_df_test = pd.read_csv('Surveydata_test.csv')

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 1/28


09/05/2021 SN Travel - Jupyter Notebook

In [364]:

def exp_data_ana(df,target,tvar):
pd.set_option('display.expand_frame_repr', False)
# print('\n****Display first 5 rows****')
# print('****************************')
# print(df.head())

#Shape of the Data


print('\n****Shape of the data****')
print('*************************')
print('No of rows\t:\t{}\nNo of columns\t:\t{}'.format(df.shape[0],df.shape[1]))

print('\n****Show infomation of the data****')


print('***********************************')
print(df.info())

if(target==True):
#print the % of each class in the Target variable
perclass = df[tvar].value_counts(normalize=True)
print('\n****Target Varible Distribution****')
print('************************************')
print('{} Yes\t:\t{}% \n{} No\t:\t{}%'.format(tvar,round(perclass[1]*100,2),

#Check for missing value


print('\n****Missing Values in the Dataset****')
print('*************************************')
msv = df.isnull().sum()[df.isnull().sum()>0]
if msv.empty:
print('There is no missing values in the data.')
else:
for i in range(msv.count()):
print('{} Missing values in {} which is {}% of total data'.format(msv[i]

# #Check for duplicate values


print('\n****Duplicates data in the Dataset****')
print('**************************************')

dups = df.duplicated().sum()
if dups ==0:
print('There is no duplicate values in the data.')
else:
print(dups)

print('\n****Describe the data****')


print('*************************')
print(df.describe())

print('\n****Describe the catagorical data****')


print('*************************')
print(df.describe(include=['O']))

In [365]:

df_train = travel_df_train.merge(survey_df_train,how='left',on=['ID'])
df_test = travel_df_test.merge(survey_df_test,how='left',on=['ID'])

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 2/28


09/05/2021 SN Travel - Jupyter Notebook

In [366]:

exp_data_ana(df_train,True,'Overall_Experience')

****Shape of the data****


*************************
No of rows : 94379
No of columns : 25

****Show infomation of the data****


***********************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 94379 entries, 0 to 94378
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 94379 non-null int64
1 Gender 94302 non-null object
2 CustomerType 85428 non-null object
3 Age 94346 non-null float64
4 TypeTravel 85153 non-null object
5 Travel_Class 94379 non-null object
6 Travel_Distance 94379 non-null int64
7 DepartureDelay_in_Mins 94322 non-null float64
8 ArrivalDelay_in_Mins 94022 non-null float64
9 Overall_Experience 94379 non-null int64
10 Seat_comfort 94318 non-null object
11 Seat_Class 94379 non-null object
12 Arrival_time_convenient 85449 non-null object
13 Catering 85638 non-null object
14 Platform_location 94349 non-null object
15 Onboardwifi_service 94349 non-null object
16 Onboard_entertainment 94361 non-null object
17 Online_support 94288 non-null object
18 Onlinebooking_Ease 94306 non-null object
19 Onboard_service 86778 non-null object
20 Leg_room 94289 non-null object
21 Baggage_handling 94237 non-null object
22 Checkin_service 94302 non-null object
23 Cleanliness 94373 non-null object
24 Online_boarding 94373 non-null object
dtypes: float64(3), int64(3), object(19)
memory usage: 18.7+ MB
None

****Target Varible Distribution****


************************************
Overall_Experience Yes : 54.67%
Overall_Experience No : 45.33%

****Missing Values in the Dataset****


*************************************
77 Missing values in Gender which is 0.08% of total data
8951 Missing values in CustomerType which is 9.48% of total data
33 Missing values in Age which is 0.03% of total data
9226 Missing values in TypeTravel which is 9.78% of total data
57 Missing values in DepartureDelay_in_Mins which is 0.06% of total da
ta
357 Missing values in ArrivalDelay_in_Mins which is 0.38% of total dat
a
localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 3/28
09/05/2021 SN Travel - Jupyter Notebook
61 Missing values in Seat_comfort which is 0.06% of total data
8930 Missing values in Arrival_time_convenient which is 9.46% of total
data
8741 Missing values in Catering which is 9.26% of total data
30 Missing values in Platform_location which is 0.03% of total data
30 Missing values in Onboardwifi_service which is 0.03% of total data
18 Missing values in Onboard_entertainment which is 0.02% of total dat
a
91 Missing values in Online_support which is 0.1% of total data
73 Missing values in Onlinebooking_Ease which is 0.08% of total data
7601 Missing values in Onboard_service which is 8.05% of total data
90 Missing values in Leg_room which is 0.1% of total data
142 Missing values in Baggage_handling which is 0.15% of total data
77 Missing values in Checkin_service which is 0.08% of total data
6 Missing values in Cleanliness which is 0.01% of total data
6 Missing values in Online_boarding which is 0.01% of total data

****Duplicates data in the Dataset****


**************************************
There is no duplicate values in the data.

****Describe the data****


*************************
ID Age Travel_Distance DepartureDelay_in_
Mins ArrivalDelay_in_Mins Overall_Experience
count 9.437900e+04 94346.000000 94379.000000 94322.00
0000 94022.000000 94379.000000
mean 9.884719e+07 39.419647 1978.888185 14.64
7092 15.005222 0.546658
std 2.724501e+04 15.116632 1027.961019 38.13
8781 38.439409 0.497821
min 9.880000e+07 7.000000 50.000000 0.00
0000 0.000000 0.000000
25% 9.882360e+07 27.000000 1359.000000 0.00
0000 0.000000 0.000000
50% 9.884719e+07 40.000000 1923.000000 0.00
0000 0.000000 1.000000
75% 9.887078e+07 51.000000 2538.000000 12.00
0000 13.000000 1.000000
max 9.889438e+07 85.000000 6951.000000 1592.00
0000 1584.000000 1.000000

****Describe the catagorical data****


*************************
Gender CustomerType TypeTravel Travel_Class Seat_comf
ort Seat_Class Arrival_time_convenient Catering Platform_location O
nboardwifi_service Onboard_entertainment Online_support Onlinebooking_
Ease Onboard_service Leg_room Baggage_handling Checkin_service Cleanli
ness Online_boarding
count 94302 85428 85153 94379 94
318 94379 85449 85638 94349
94349 94361 94288 94306
86778 94289 94237 94302 94373
94373
unique 2 2 2 2
6 2 6 6 6
6 6 6 6
6 6 5 6 6
6
top Female Loyal Customer Business travel Eco accepta
ble Green Car good acceptable manageable

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 4/28


09/05/2021 SN Travel - Jupyter Notebook
good good good good
good good good good good
good
freq 47815 69823 58617 49342 21
158 47435 19574 18468 24173
22835 30446 30016 28909
27265 28870 34944 26502 35427
25533

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 5/28


09/05/2021 SN Travel - Jupyter Notebook

In [367]:

exp_data_ana(df_test,False,False)

****Shape of the data****


*************************
No of rows : 35602
No of columns : 24

****Show infomation of the data****


***********************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 35602 entries, 0 to 35601
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 35602 non-null int64
1 Gender 35572 non-null object
2 CustomerType 32219 non-null object
3 Age 35591 non-null float64
4 TypeTravel 32154 non-null object
5 Travel_Class 35602 non-null object
6 Travel_Distance 35602 non-null int64
7 DepartureDelay_in_Mins 35573 non-null float64
8 ArrivalDelay_in_Mins 35479 non-null float64
9 Seat_comfort 35580 non-null object
10 Seat_Class 35602 non-null object
11 Arrival_time_convenient 32277 non-null object
12 Catering 32245 non-null object
13 Platform_location 35590 non-null object
14 Onboardwifi_service 35590 non-null object
15 Onboard_entertainment 35594 non-null object
16 Online_support 35576 non-null object
17 Onlinebooking_Ease 35584 non-null object
18 Onboard_service 32730 non-null object
19 Leg_room 35577 non-null object
20 Baggage_handling 35562 non-null object
21 Checkin_service 35580 non-null object
22 Cleanliness 35600 non-null object
23 Online_boarding 35600 non-null object
dtypes: float64(3), int64(2), object(19)
memory usage: 6.8+ MB
None

****Missing Values in the Dataset****


*************************************
30 Missing values in Gender which is 0.08% of total data
3383 Missing values in CustomerType which is 9.5% of total data
11 Missing values in Age which is 0.03% of total data
3448 Missing values in TypeTravel which is 9.68% of total data
29 Missing values in DepartureDelay_in_Mins which is 0.08% of total d
ata
123 Missing values in ArrivalDelay_in_Mins which is 0.35% of total da
ta
22 Missing values in Seat_comfort which is 0.06% of total data
3325 Missing values in Arrival_time_convenient which is 9.34% of tota
l data
3357 Missing values in Catering which is 9.43% of total data
12 Missing values in Platform_location which is 0.03% of total data
12 Missing values in Onboardwifi_service which is 0.03% of total data
localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 6/28
09/05/2021 SN Travel - Jupyter Notebook
8 Missing values in Onboard_entertainment which is 0.02% of total dat
a
26 Missing values in Online_support which is 0.07% of total data
18 Missing values in Onlinebooking_Ease which is 0.05% of total data
2872 Missing values in Onboard_service which is 8.07% of total data
25 Missing values in Leg_room which is 0.07% of total data
40 Missing values in Baggage_handling which is 0.11% of total data
22 Missing values in Checkin_service which is 0.06% of total data
2 Missing values in Cleanliness which is 0.01% of total data
2 Missing values in Online_boarding which is 0.01% of total data

****Duplicates data in the Dataset****


**************************************
There is no duplicate values in the data.

****Describe the data****


*************************
ID Age Travel_Distance DepartureDelay_in
_Mins ArrivalDelay_in_Mins
count 3.560200e+04 35591.000000 35602.000000 35573.0
00000 35479.000000
mean 9.991780e+07 39.446995 1987.151761 14.8
80696 15.308802
std 1.027756e+04 15.137554 1024.308863 37.8
95453 38.531293
min 9.990000e+07 7.000000 50.000000 0.0
00000 0.000000
25% 9.990890e+07 27.000000 1360.000000 0.0
00000 0.000000
50% 9.991780e+07 40.000000 1929.000000 0.0
00000 0.000000
75% 9.992670e+07 51.000000 2559.000000 13.0
00000 13.000000
max 9.993560e+07 85.000000 6868.000000 978.0
00000 970.000000

****Describe the catagorical data****


*************************
Gender CustomerType TypeTravel Travel_Class Seat_com
fort Seat_Class Arrival_time_convenient Catering Platform_location
Onboardwifi_service Onboard_entertainment Online_support Onlinebookin
g_Ease Onboard_service Leg_room Baggage_handling Checkin_service Clea
nliness Online_boarding
count 35572 32219 32154 35602 3
5580 35602 32277 32245 35590
35590 35594 35576 35584
32730 35577 35562 35580 35600
35600
unique 2 2 2 2
6 2 6 6 5
6 6 5 6
5 6 5 5 5
6
top Female Loyal Customer Business travel Eco accept
able Ordinary good acceptable manageable
good good good good
good good good good good
good
freq 18069 26349 22313 18473
8003 17860 7361 7133 9364
8743 11436 11487 11025

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 7/28


09/05/2021 SN Travel - Jupyter Notebook
10222 10842 13291 9982 13421
9680

In [368]:

cat=[]
num=[]
for i in df_train.loc[:, ~df_train.columns. isin(['EmployeeID'])]:
if df_train[i].dtype=="object":
cat.append(i)
else:
num.append(i)

print('Catogorical Variables : \n*****************\n', cat)


print('\nNumerical Variables : \n*****************\n', num)

Catogorical Variables :
*****************
['Gender', 'CustomerType', 'TypeTravel', 'Travel_Class', 'Seat_comfor
t', 'Seat_Class', 'Arrival_time_convenient', 'Catering', 'Platform_loc
ation', 'Onboardwifi_service', 'Onboard_entertainment', 'Online_suppor
t', 'Onlinebooking_Ease', 'Onboard_service', 'Leg_room', 'Baggage_hand
ling', 'Checkin_service', 'Cleanliness', 'Online_boarding']

Numerical Variables :
*****************
['ID', 'Age', 'Travel_Distance', 'DepartureDelay_in_Mins', 'ArrivalDe
lay_in_Mins', 'Overall_Experience']

In [369]:

#Unique values of the catagrical variables


for i in cat:
print(f'Unique values of the catagorical variable {i}')
print('*******************************************************\n')
print(df_train[i].value_counts())
print('\n')
Unique values of the catagorical variable Gender
*******************************************************

Female 47815
Male 46487
Name: Gender, dtype: int64

Unique values of the catagorical variable CustomerType


*******************************************************

Loyal Customer 69823


disloyal Customer 15605
Name: CustomerType, dtype: int64

Unique values of the catagorical variable TypeTravel


*******************************************************

Business travel 58617

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 8/28


09/05/2021 SN Travel - Jupyter Notebook

In [370]:

# bins = [7,30,45,60,85]
# labels=['7-30','31-45','46-60','61-85']

# df_train['AgeRange'] = pd.cut(df_train['Age'],bins=bins, labels=labels)

# bins = [0,320,640,960,1280,1600]
# labels=['0-320','321-640','641-960','961-1280','1281-1600']

# df_train['DepartureDelay_in_MinsRange'] = pd.cut(df_train['DepartureDelay_in_Mins

# bins = [0,320,640,960,1280,1600]
# labels=['0-320','321-640','641-960','961-1280','1281-1600']

# df_train['ArrivalDelay_in_MinsRange'] = pd.cut(df_train['ArrivalDelay_in_Mins'],bi

In [371]:

nmstr=df_train['CustomerType'].isnull().sum()
nmste=df_test['CustomerType'].isnull().sum()
print(f'Number of missing values in CustomerType: \n Train\t:\t{nmstr}\n Test\t:\t{n
Number of missing values in CustomerType:
Train : 8951
Test : 3383

In [372]:

df_train.groupby(['Travel_Class'])['CustomerType'].agg(pd.Series.mode)
Out[372]:

Travel_Class
Business Loyal Customer
Eco Loyal Customer
Name: CustomerType, dtype: object

In [373]:

#Impute the Department with mode value of group by "EductaionField"


df_train['CustomerType'] = df_train.groupby(['Travel_Class'])['CustomerType'].apply(
df_test['CustomerType'] = df_test.groupby(['Travel_Class'])['CustomerType'].apply(la

In [374]:

nmstr=df_train['CustomerType'].isnull().sum()
nmste=df_test['CustomerType'].isnull().sum()
print(f'Number of missing values in CustomerType: \n Train\t:\t{nmstr}\n Test\t:\t{n

Number of missing values in CustomerType:


Train : 0
Test : 0

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 9/28


09/05/2021 SN Travel - Jupyter Notebook

In [375]:

nmstr=df_train['Age'].isnull().sum()
nmste=df_test['Age'].isnull().sum()
print(f'Number of missing values in Age: \n Train\t:\t{nmstr}\n Test\t:\t{nmste}')
Number of missing values in Age:
Train : 33
Test : 11

In [376]:

df_train.groupby(by=['Gender','CustomerType','Travel_Class'])['Age'].median()
Out[376]:

Gender CustomerType Travel_Class


Female Loyal Customer Business 44.0
Eco 40.0
disloyal Customer Business 30.0
Eco 26.0
Male Loyal Customer Business 44.0
Eco 40.0
disloyal Customer Business 30.0
Eco 25.0
Name: Age, dtype: float64

In [377]:

#Impute the Age with group by "WorkExprience" and "EductaionField"


df_train['Age'] = df_train['Age'].fillna(df_train.groupby(by=['Gender','CustomerType
df_test['Age'] = df_test['Age'].fillna(df_test.groupby(by=['Gender','CustomerType','

In [378]:

nmstr=df_train['Age'].isnull().sum()
nmste=df_test['Age'].isnull().sum()
print(f'Number of missing values in Age: \n Train\t:\t{nmstr}\n Test\t:\t{nmste}')

Number of missing values in Age:


Train : 0
Test : 0

In [379]:

nmstr=df_train['DepartureDelay_in_Mins'].isnull().sum()
nmste=df_test['DepartureDelay_in_Mins'].isnull().sum()
print(f'Number of missing values in DepartureDelay_in_Mins: \n Train\t:\t{nmstr}\n T

Number of missing values in DepartureDelay_in_Mins:


Train : 57
Test : 29

In [380]:

df_train['DepartureDelay_in_Mins'] = df_train['DepartureDelay_in_Mins'].fillna(0)
df_test['DepartureDelay_in_Mins'] = df_test['DepartureDelay_in_Mins'].fillna(0)

# df_train['ArrivalDelay_in_Mins'] = df_train['ArrivalDelay_in_Mins'].fillna(0)
# df_test['ArrivalDelay_in_Mins'] = df_test['ArrivalDelay_in_Mins'].fillna(0)

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 10/28


09/05/2021 SN Travel - Jupyter Notebook

In [381]:

nmstr=df_train['DepartureDelay_in_Mins'].isnull().sum()
nmste=df_test['DepartureDelay_in_Mins'].isnull().sum()
print(f'Number of missing values in DepartureDelay_in_Mins: \n Train\t:\t{nmstr}\n T
Number of missing values in DepartureDelay_in_Mins:
Train : 0
Test : 0

In [382]:

nmstr=df_train['ArrivalDelay_in_Mins'].isnull().sum()
nmste=df_test['ArrivalDelay_in_Mins'].isnull().sum()
print(f'Number of missing values in ArrivalDelay_in_Mins: \n Train\t:\t{nmstr}\n Tes
Number of missing values in ArrivalDelay_in_Mins:
Train : 357
Test : 123

In [383]:

round(df_train.groupby(by=['Arrival_time_convenient'])['ArrivalDelay_in_Mins'].mean(
Out[383]:

Arrival_time_convenient
acceptable 15.28
excellent 14.55
extremely poor 12.96
good 15.05
need improvement 15.60
poor 15.34
Name: ArrivalDelay_in_Mins, dtype: float64

In [384]:

df_train['ArrivalDelay_in_Mins'] = df_train['ArrivalDelay_in_Mins'].fillna(df_train.
df_test['ArrivalDelay_in_Mins'] = df_test['ArrivalDelay_in_Mins'].fillna(df_test.gro

In [385]:

df_train['ArrivalDelay_in_Mins'] = df_train['ArrivalDelay_in_Mins'].fillna(0)
df_test['ArrivalDelay_in_Mins'] = df_test['ArrivalDelay_in_Mins'].fillna(0)

In [386]:

nmstr=df_train['ArrivalDelay_in_Mins'].isnull().sum()
nmste=df_test['ArrivalDelay_in_Mins'].isnull().sum()
print(f'Number of missing values in ArrivalDelay_in_Mins: \n Train\t:\t{nmstr}\n Tes

Number of missing values in ArrivalDelay_in_Mins:


Train : 0
Test : 0

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 11/28


09/05/2021 SN Travel - Jupyter Notebook

In [387]:

nmstr=df_train['Arrival_time_convenient'].isnull().sum()
nmste=df_test['Arrival_time_convenient'].isnull().sum()
print(f'Number of missing values in Arrival_time_convenient: \n Train\t:\t{nmstr}\n

Number of missing values in Arrival_time_convenient:


Train : 8930
Test : 3325

In [388]:

df_train['Arrival_time_convenient'] = df_train['Arrival_time_convenient'].fillna('0'
df_test['Arrival_time_convenient'] = df_test['Arrival_time_convenient'].fillna('0')

In [389]:

df_train['Arrival_time_convenient'].value_counts()
Out[389]:

good 19574
excellent 17684
acceptable 15177
need improvement 14990
poor 13692
0 8930
extremely poor 4332
Name: Arrival_time_convenient, dtype: int64

In [390]:

df_test['Arrival_time_convenient'].value_counts()

Out[390]:

good 7361
excellent 6589
acceptable 5844
need improvement 5684
poor 5131
0 3325
extremely poor 1668
Name: Arrival_time_convenient, dtype: int64

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 12/28


09/05/2021 SN Travel - Jupyter Notebook

In [391]:

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'excellent' if (row['Arrival_time_convenient']=='0' and row['Arrival
axis=1
)

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'acceptable' if (row['Arrival_time_convenient']=='0' and (row['Arriv
axis=1
)

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'poor' if (row['Arrival_time_convenient']=='0' and (row['ArrivalDela
axis=1
)

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'good' if (row['Arrival_time_convenient']=='0' and (row['ArrivalDela
axis=1
)

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'extremely poor' if (row['Arrival_time_convenient']=='0' and (row['A
axis=1
)

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'need improvement' if (row['Arrival_time_convenient']=='0' and (row[
axis=1
)

df_train['Arrival_time_convenient'].value_counts()

Out[391]:

excellent 23101
good 19575
acceptable 18391
need improvement 14990
poor 13990
extremely poor 4332
Name: Arrival_time_convenient, dtype: int64

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 13/28


09/05/2021 SN Travel - Jupyter Notebook

In [392]:

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'excellent' if (row['Arrival_time_convenient']=='0' and row['Arrival
axis=1
)

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'acceptable' if (row['Arrival_time_convenient']=='0' and (row['Arriv
axis=1
)

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'poor' if (row['Arrival_time_convenient']=='0' and (row['ArrivalDela
axis=1
)

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'good' if (row['Arrival_time_convenient']=='0' and (row['ArrivalDela
axis=1
)

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'extremely poor' if (row['Arrival_time_convenient']=='0' and (row['A
axis=1
)

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'need improvement' if (row['Arrival_time_convenient']=='0' and (row[
axis=1
)

df_test['Arrival_time_convenient'].value_counts()

Out[392]:

excellent 8564
good 7361
acceptable 7075
need improvement 5684
poor 5250
extremely poor 1668
Name: Arrival_time_convenient, dtype: int64

In [393]:

nmstr=df_train['Arrival_time_convenient'].isnull().sum()
nmste=df_test['Arrival_time_convenient'].isnull().sum()
print(f'Number of missing values in Arrival_time_convenient: \n Train\t:\t{nmstr}\n

Number of missing values in Arrival_time_convenient:


Train : 0
Test : 0

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 14/28


09/05/2021 SN Travel - Jupyter Notebook

In [394]:

nmstr=df_train['TypeTravel'].isnull().sum()
nmste=df_test['TypeTravel'].isnull().sum()
print(f'Number of missing values in TypeTravel: \n Train\t:\t{nmstr}\n Test\t:\t{nms

Number of missing values in TypeTravel:


Train : 9226
Test : 3448

In [395]:

df_train.groupby(['CustomerType','Travel_Class'])['TypeTravel'].agg(pd.Series.mode)

Out[395]:

CustomerType Travel_Class
Loyal Customer Business Business travel
Eco Personal Travel
disloyal Customer Business Business travel
Eco Business travel
Name: TypeTravel, dtype: object

In [396]:

#Impute the Department with mode value of group by "EductaionField"


df_train['TypeTravel'] = df_train.groupby(['CustomerType','Travel_Class'])['TypeTrav
df_test['TypeTravel'] = df_test.groupby(['CustomerType','Travel_Class'])['TypeTravel

In [397]:

nmstr=df_train['TypeTravel'].isnull().sum()
nmste=df_test['TypeTravel'].isnull().sum()
print(f'Number of missing values in TypeTravel: \n Train\t:\t{nmstr}\n Test\t:\t{nms

Number of missing values in TypeTravel:


Train : 0
Test : 0

In [398]:

nmstr=df_train['TypeTravel'].isnull().sum()
nmste=df_test['TypeTravel'].isnull().sum()
print(f'Number of missing values in TypeTravel: \n Train\t:\t{nmstr}\n Test\t:\t{nms
Number of missing values in TypeTravel:
Train : 0
Test : 0

In [400]:

cat = ['Gender','Seat_comfort','Catering','Platform_location',
'Onboardwifi_service','Onboard_entertainment',
'Online_support','Onlinebooking_Ease','Leg_room','Baggage_handling','Checkin_service
for i in cat:
df_train[i] = df_train[i].fillna(df_train[i].mode()[0])
df_test[i] = df_test[i].fillna(df_test[i].mode()[0])

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 15/28


09/05/2021 SN Travel - Jupyter Notebook

In [402]:

nmstr=df_train['Onboard_service'].isnull().sum()
nmste=df_test['Onboard_service'].isnull().sum()
print(f'Number of missing values in Onboard_service : \n Train\t:\t{nmstr}\n Test\t:

Number of missing values in Onboard_service :


Train : 7601
Test : 2872

In [401]:

df_train.groupby(['Onboardwifi_service','Onboard_entertainment'])['Onboard_service']

Out[401]:

Onboardwifi_service Onboard_entertainment
acceptable acceptable good
excellent good
extremely poor acceptable
good good
need improvement good
poor good
excellent acceptable good
excellent good
extremely poor acceptable
good good
need improvement good
poor good
extremely poor acceptable [acceptable, excellent]
excellent need improvement
good excellent
need improvement good
poor need improvement
good acceptable good
excellent good
extremely poor good
good good
need improvement good
poor good
need improvement acceptable good
excellent good
extremely poor good
good good
need improvement acceptable
poor good
poor acceptable acceptable
excellent excellent
extremely poor acceptable
good good
need improvement good
poor good
Name: Onboard_service, dtype: object

In [403]:

df_train['Onboard_service'] = df_train.groupby(['Onboardwifi_service','Onboard_enter
df_test['Onboard_service'] = df_test.groupby(['Onboardwifi_service','Onboard_enterta

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 16/28


09/05/2021 SN Travel - Jupyter Notebook

In [404]:

#Check for missing value


print('\n****Missing Values in the Dataset****')
print('*************************************')
msv = df_train.isnull().sum()[df_train.isnull().sum()>0]
if msv.empty:
print('There is no missing values in the data.')
else:
for i in range(msv.count()):
print('{} Missing values in {} which is {}% of total data'.format(msv[i],msv

****Missing Values in the Dataset****


*************************************
There is no missing values in the data.

In [405]:

#Check for missing value


print('\n****Missing Values in the Dataset****')
print('*************************************')
msv = df_test.isnull().sum()[df_test.isnull().sum()>0]
if msv.empty:
print('There is no missing values in the data.')
else:
for i in range(msv.count()):
print('{} Missing values in {} which is {}% of total data'.format(msv[i],msv

****Missing Values in the Dataset****


*************************************
There is no missing values in the data.

In [ ]:

num
Out[293]:

['ID',
'Age',
'Travel_Distance',
'DepartureDelay_in_Mins',
'ArrivalDelay_in_Mins',
'Overall_Experience']

In [ ]:

# fig, axis=plt.subplots(nrows=4,ncols=2)
# fig.set_size_inches(15,17)
# fig.tight_layout()

# for i in range(1,5):
# sns.distplot(df_train[num[i]],ax=axis[i-1][0]);
# sns.boxplot(df_train[num[i]],ax=axis[i-1][1]);

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 17/28


09/05/2021 SN Travel - Jupyter Notebook

In [ ]:

#Define the function to identify the outliers


def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

In [ ]:

# #Fix the outliers


# columns = ['Travel_Distance','DepartureDelay_in_Mins','ArrivalDelay_in_Mins']
# for column in columns:
# lr,ur=remove_outlier(df_train[column])
# df_train[column]=np.where(df_train[column]>ur,ur,df_train[column])

In [ ]:

# fig, axis=plt.subplots(nrows=4,ncols=2)
# fig.set_size_inches(15,17)
# fig.tight_layout()

# for i in range(1,5):
# sns.distplot(df_train[num[i]],ax=axis[i-1][0]);
# sns.boxplot(df_train[num[i]],ax=axis[i-1][1]);

In [406]:

#Heat map - Relationalship analysis


plt.figure(figsize=(18,13))
sns.heatmap(round(df_train.corr(),3),annot=True,mask=np.triu(df_train.corr(),+1));

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 18/28


09/05/2021 SN Travel - Jupyter Notebook

In [407]:

from sklearn.ensemble import RandomForestClassifier


from IPython.display import HTML
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.metrics import roc_auc_score,roc_curve,plot_confusion_matrix

In [408]:

def display_dataframe(df):
numeric_col_mask = df.dtypes.apply(lambda d: issubclass(np.dtype(d).type, np.num

# Dict used to center the table headers


d = dict(selector="th",
props=[('text-align', 'center')])

# Style
display(df.style.set_properties(subset=df.columns[numeric_col_mask], # right-ali
**{'width':'5em', 'height':'3em','text-align':'right','b
.set_properties(subset=df.columns[~numeric_col_mask], # left-align the n
**{'width':'5em', 'text-align':'left'})\
.format(lambda x: '{:,.0f}'.format(x) if x > 1e3 else '{:,.2f}'.format(x
subset=pd.IndexSlice[:,df.columns[numeric_col_mask]])\
.hide_index()\
# .highlight_max('color: green')\
.set_table_styles([d])) # center the header

In [409]:

#AUC and ROC Value


def roc_model(model_name,x,y):
# predict probabilities
probs = model_name.predict_proba(x)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y, probs)
fpr, tpr, _ = roc_curve(y, probs)
return probs,auc,fpr,tpr

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 19/28


09/05/2021 SN Travel - Jupyter Notebook

In [410]:

def con_mat(y_train,y_predict_train,y_test,y_predict_test):
fig, axis=plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(10,4)
fig.tight_layout()

cm=confusion_matrix(y_train,y_predict_train,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["Conservative","Labour"]],


columns = [i for i in ["Conservative","Labour"]])
sns.heatmap(df_cm, annot=True ,fmt='g',ax=axis[0])
axis[0].title.set_text('Confustion Matrix - Train Data')

cm=confusion_matrix(y_test,y_predict_test,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["Conservative","Labour"]],


columns = [i for i in ["Conservative","Labour"]])
sns.heatmap(df_cm, annot=True ,fmt='g',ax=axis[1])
axis[1].title.set_text('Confustion Matrix - Test Data')

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 20/28


09/05/2021 SN Travel - Jupyter Notebook

In [411]:

def scores_train_test(model,X_train,X_test,y_train,y_test,y_predict_train,y_predict_
#model=bgcl

s = [[None for j in range(6)] for i in range(2)]

print(str(model).split('(')[0])
print('********************************\n')

model_name = str(model).split('(')[0]
s[0][0] = model_name + '_' + mname +'_Train'
s[1][0] = model_name + '_' + mname +'_Test'

for prf in range(1,4):


s[0][prf]=round(score(y_train,y_predict_train)[prf-1][1]*100,2)
s[1][prf]= (round(score(y_test,y_predict_test)[prf-1][1]*100,2))

s[0][prf+1]=round(model.score(X_train,y_train)*100,2)
s[1][prf+1]=round(model.score(X_test,y_test)*100,2)

probs, auc, fpr, tpr = roc_model(model,X_train,y_train)


probst, auc1, fpr1, tpr1 = roc_model(model,X_test,y_test)

s[0][prf+2]=round(auc*100,2)
s[1][prf+2]=round(auc1*100,2)

df = pd.DataFrame(data=s,columns=['Scores','Precision','Recall','F-Score','Accur

con_mat(y_train,y_predict_train,y_test,y_predict_test)

plt.figure(figsize=(5,5))

plt.plot(fpr,tpr, marker='o', label='AUC - Train:' + str(s[0][prf+2]))


plt.plot(fpr1,tpr1, marker='o', label='AUC - Test:' + str(s[1][prf+2]))

plt.legend(loc="lower right")
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')

print('Display Metrics of Train and Test Data')


print('***************************************')

display_dataframe(df)
return(df)

In [412]:

for feature in df_train.columns:


if df_train[feature].dtype == 'object':
df_train[feature] = pd.Categorical(df_train[feature]).codes

for feature in df_test.columns:


if df_test[feature].dtype == 'object':
df_test[feature] = pd.Categorical(df_test[feature]).codes

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 21/28


09/05/2021 SN Travel - Jupyter Notebook

In [413]:

tvar = 'Overall_Experience'

In [414]:

X = df_train.drop(['ID',tvar], axis=1)
y = df_train[tvar]

In [415]:

from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.30,random_state=0)

In [416]:

#Scale the data using feature scaling


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [417]:

sc_train = scaler.fit_transform(X_train)
X_train_sc = pd.DataFrame(sc_train, index=X_train.index, columns=X_train.columns)

In [418]:

sc_test = scaler.transform(X_test)
X_test_sc = pd.DataFrame(sc_test, index=X_test.index, columns=X_test.columns)

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 22/28


09/05/2021 SN Travel - Jupyter Notebook

In [438]:

rfcl = RandomForestClassifier(random_state=0,max_features=14)

rfcl.fit(X_train, y_train)
rf_train = rfcl.predict(X_train)
rf_test = rfcl.predict(X_test)

res_df = scores_train_test(rfcl,X_train,X_test,y_train,y_test,rf_train,rf_test,'Base
RandomForestClassifier
********************************

Display Metrics of Train and Test Data


***************************************

Scores Precision Recall F-Score Accuracy AUC

RandomForestClassifier_Base_Train 100.00 100.00 100.00 100.00 100.00

RandomForestClassifier_Base_Test 96.17 94.37 95.26 94.90 99.07

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 23/28


09/05/2021 SN Travel - Jupyter Notebook

In [ ]:

# features = X_train.columns
# importances = rfcl.feature_importances_
# indices = np.argsort(importances)

# plt.title('Feature Importances')
# plt.barh(range(len(indices)), importances[indices], color='b', align='center')
# plt.yticks(range(len(indices)), [features[i] for i in indices])
# plt.xlabel('Relative Importance')
# plt.show()

In [420]:

def rfrun(x_train,train_labels,x_test,max_f,no_est,max_dep,min_sam,min_spl):
param_grid = {
'criterion': ['gini'],
'max_depth': max_dep, #,7,9],
'max_features':max_f, #,32],
'min_samples_leaf': min_sam,#15,20],
'min_samples_split': min_spl,#75,60],
'n_estimators': no_est
}

rfcl = RandomForestClassifier(random_state=1)

grid_search = GridSearchCV(estimator = rfcl, param_grid = param_grid, cv = 5,sco


grid_search.fit(x_train, train_labels)
print(grid_search.best_params_)

rfcl = grid_search.best_estimator_
rfcl

rfcl_y_predict_train = rfcl.predict(x_train)
rfcl_y_predict_test = rfcl.predict(x_test)
return(rfcl,rfcl_y_predict_train,rfcl_y_predict_test)

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 24/28


09/05/2021 SN Travel - Jupyter Notebook

In [451]:

from sklearn.model_selection import GridSearchCV


rfcl_tuned,rfcl_y_predict_train,rfcl_y_predict_test = rfrun(
X_train,y_train,X_test,[13,17,21],[100,200,300],[15,17,25],[1,3],[3,6])

res_df = scores_train_test(rfcl_tuned,X_train,X_test,y_train,y_test,rfcl_y_predict_t
---------------------------------------------------------------------
------
KeyboardInterrupt Traceback (most recent call
last)
<ipython-input-450-a2387e5d0eb9> in <module>
1 from sklearn.model_selection import GridSearchCV
----> 2 rfcl_tuned,rfcl_y_predict_train,rfcl_y_predict_test = rfrun(
3 X_train,y_train,X_test,[13,17,21],[100,200,300],[15,17,2
5],[1,3],[3,6])
4
5 res_df = scores_train_test(rfcl_tuned,X_train,X_test,y_train,
y_test,rfcl_y_predict_train,rfcl_y_predict_test,'Tuned')

<ipython-input-420-9de6e369a47f> in rfrun(x_train, train_labels, x_te


st, max_f, no_est, max_dep, min_sam, min_spl)
12
13 grid_search = GridSearchCV(estimator = rfcl, param_grid =
param_grid, cv = 3,scoring='accuracy')
---> 14 grid_search.fit(x_train, train_labels)
15 i t( id h b t )

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 25/28


09/05/2021 SN Travel - Jupyter Notebook

In [442]:

res_df = scores_train_test(rfcl_tuned,X_train,X_test,y_train,y_test,rfcl_y_predict_t
RandomForestClassifier
********************************

Display Metrics of Train and Test Data


***************************************

Scores Precision Recall F-Score Accuracy AUC

RandomForestClassifier_Tuned_Train 99.96 100.00 99.98 99.97 100.00

RandomForestClassifier_Tuned_Test 96.16 94.61 95.38 99.97 100.00

In [ ]:

# from sklearn.neighbors import KNeighborsClassifier

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 26/28


09/05/2021 SN Travel - Jupyter Notebook

In [ ]:

# for i in range(1,9,2):
# KNN_Model = KNeighborsClassifier(n_neighbors=i,metric='euclidean')
# KNN_Model.fit(X_train_sc,y_train)
# y_test_p = KNN_Model.predict(X_test_sc)
# print(f'Accuracy Score for K={i} : ',KNN_Model.score(X_test_sc,y_test))

In [ ]:

# from sklearn.neural_network import MLPClassifier

In [441]:

def nnrun(x_train,train_labels,x_test,hid_ly,max_int,tol,sol,act):
param_grid = {
'hidden_layer_sizes': hid_ly,
'max_iter': max_int,
'activation':act,
'solver': sol,
'tol': tol,
'random_state':[0] #1
}

nncl = MLPClassifier(random_state=0)

grid_search = GridSearchCV(estimator = nncl, param_grid = param_grid, cv = 5,sco

grid_search.fit(x_train, train_labels)
grid_search.best_params_
print(grid_search.best_params_)
nn_model = grid_search.best_estimator_
nn_model
nn_train = nn_model.predict(x_train)
nn_test = nn_model.predict(x_test)
return(nn_model,nn_train,nn_test)

---------------------------------------------------------------------
------
KeyboardInterrupt Traceback (most recent call
last)
<ipython-input-440-a2387e5d0eb9> in <module>
1 from sklearn.model_selection import GridSearchCV
----> 2 rfcl_tuned,rfcl_y_predict_train,rfcl_y_predict_test = rfrun(
3 X_train,y_train,X_test,[13,17,21],[100,200,300],[15,17,2
5],[1,3],[3,6])
4
5 res_df = scores_train_test(rfcl_tuned,X_train,X_test,y_train,
y_test,rfcl_y_predict_train,rfcl_y_predict_test,'Tuned')

<ipython-input-420-9de6e369a47f> in rfrun(x_train, train_labels, x_te


st, max_f, no_est, max_dep, min_sam, min_spl)
12
13 grid_search = GridSearchCV(estimator = rfcl, param_grid =
param_grid, cv = 3,scoring='accuracy')
---> 14 grid_search.fit(x_train, train_labels)
15 print(grid search best params )
In [ ]:

nn_model,nn_train_p,nn_test_p = nnrun(X_train_sc,y_train,X_test_sc,[100],[1000],[0.0

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 27/28


09/05/2021 SN Travel - Jupyter Notebook

In [ ]:

res_df = scores_train_test(nn_model,X_train_sc,X_test_sc,y_train,y_test,nn_train_p,n

In [443]:

df_test.drop(tvar,axis=1,inplace=True)

In [444]:

final_model = rfcl_tuned.fit(X,y)

In [445]:

rf_output = final_model.predict(df_test.drop('ID',axis=1))

In [446]:

rf_output
Out[446]:

array([1, 1, 1, ..., 1, 1, 0])

In [447]:

df_test[tvar]= rf_output

In [448]:

m='tuned'

In [449]:

df_test[['ID',tvar]].to_csv('./Hack_submission_'+m+'.csv',index=False)

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 28/28

You might also like