0% found this document useful (0 votes)

46 views

SN Travel Jupyter Notebook PDF

The document discusses analyzing travel and survey data from two datasets using Pandas in a Jupyter Notebook. It loads the datasets, merges them on ID, then performs several summary analyses including checking the shape of the data, distributions of missing values and target variables, and describing various columns. No significant issues were found in the data.

Uploaded by

sonali Pradhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

SN Travel Jupyter Notebook PDF

Uploaded by

sonali Pradhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

09/05/2021 SN Travel - Jupyter Notebook

In [362]:

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [363]:

travel_df_train = pd.read_csv('Traveldata_train.csv')
travel_df_test = pd.read_csv('Traveldata_test.csv')

survey_df_train = pd.read_csv('Surveydata_train.csv')
survey_df_test = pd.read_csv('Surveydata_test.csv')

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 1/28

09/05/2021 SN Travel - Jupyter Notebook

In [364]:

def exp_data_ana(df,target,tvar):
pd.set_option('display.expand_frame_repr', False)
# print('\n****Display first 5 rows****')
# print('****************************')
# print(df.head())

#Shape of the Data

print('\n****Shape of the data****')
print('*************************')
print('No of rows\t:\t{}\nNo of columns\t:\t{}'.format(df.shape[0],df.shape[1]))

print('\nShow infomation of the data')

print('***********************************')
print(df.info())

if(target==True):
#print the % of each class in the Target variable
perclass = df[tvar].value_counts(normalize=True)
print('\n****Target Varible Distribution****')
print('************************************')
print('{} Yes\t:\t{}% \n{} No\t:\t{}%'.format(tvar,round(perclass[1]*100,2),

#Check for missing value

print('\n****Missing Values in the Dataset****')
print('*************************************')
msv = df.isnull().sum()[df.isnull().sum()>0]
if msv.empty:
print('There is no missing values in the data.')
else:
for i in range(msv.count()):
print('{} Missing values in {} which is {}% of total data'.format(msv[i]

# #Check for duplicate values

print('\n****Duplicates data in the Dataset****')
print('**************************************')

dups = df.duplicated().sum()
if dups ==0:
print('There is no duplicate values in the data.')
else:
print(dups)

print('\nDescribe the data')

print('*************************')
print(df.describe())

print('\nDescribe the catagorical data')

print('*************************')
print(df.describe(include=['O']))

In [365]:

df_train = travel_df_train.merge(survey_df_train,how='left',on=['ID'])
df_test = travel_df_test.merge(survey_df_test,how='left',on=['ID'])

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 2/28

09/05/2021 SN Travel - Jupyter Notebook

In [366]:

exp_data_ana(df_train,True,'Overall_Experience')

Shape of the data

*************************
No of rows : 94379
No of columns : 25

Show infomation of the data

***********************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 94379 entries, 0 to 94378
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 94379 non-null int64
1 Gender 94302 non-null object
2 CustomerType 85428 non-null object
3 Age 94346 non-null float64
4 TypeTravel 85153 non-null object
5 Travel_Class 94379 non-null object
6 Travel_Distance 94379 non-null int64
7 DepartureDelay_in_Mins 94322 non-null float64
8 ArrivalDelay_in_Mins 94022 non-null float64
9 Overall_Experience 94379 non-null int64
10 Seat_comfort 94318 non-null object
11 Seat_Class 94379 non-null object
12 Arrival_time_convenient 85449 non-null object
13 Catering 85638 non-null object
14 Platform_location 94349 non-null object
15 Onboardwifi_service 94349 non-null object
16 Onboard_entertainment 94361 non-null object
17 Online_support 94288 non-null object
18 Onlinebooking_Ease 94306 non-null object
19 Onboard_service 86778 non-null object
20 Leg_room 94289 non-null object
21 Baggage_handling 94237 non-null object
22 Checkin_service 94302 non-null object
23 Cleanliness 94373 non-null object
24 Online_boarding 94373 non-null object
dtypes: float64(3), int64(3), object(19)
memory usage: 18.7+ MB
None

Target Varible Distribution

************************************
Overall_Experience Yes : 54.67%
Overall_Experience No : 45.33%

Missing Values in the Dataset

*************************************
77 Missing values in Gender which is 0.08% of total data
8951 Missing values in CustomerType which is 9.48% of total data
33 Missing values in Age which is 0.03% of total data
9226 Missing values in TypeTravel which is 9.78% of total data
57 Missing values in DepartureDelay_in_Mins which is 0.06% of total da
ta
357 Missing values in ArrivalDelay_in_Mins which is 0.38% of total dat
a
localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 3/28
09/05/2021 SN Travel - Jupyter Notebook
61 Missing values in Seat_comfort which is 0.06% of total data
8930 Missing values in Arrival_time_convenient which is 9.46% of total
data
8741 Missing values in Catering which is 9.26% of total data
30 Missing values in Platform_location which is 0.03% of total data
30 Missing values in Onboardwifi_service which is 0.03% of total data
18 Missing values in Onboard_entertainment which is 0.02% of total dat
a
91 Missing values in Online_support which is 0.1% of total data
73 Missing values in Onlinebooking_Ease which is 0.08% of total data
7601 Missing values in Onboard_service which is 8.05% of total data
90 Missing values in Leg_room which is 0.1% of total data
142 Missing values in Baggage_handling which is 0.15% of total data
77 Missing values in Checkin_service which is 0.08% of total data
6 Missing values in Cleanliness which is 0.01% of total data
6 Missing values in Online_boarding which is 0.01% of total data

Duplicates data in the Dataset

**************************************
There is no duplicate values in the data.

Describe the data

*************************
ID Age Travel_Distance DepartureDelay_in_
Mins ArrivalDelay_in_Mins Overall_Experience
count 9.437900e+04 94346.000000 94379.000000 94322.00
0000 94022.000000 94379.000000
mean 9.884719e+07 39.419647 1978.888185 14.64
7092 15.005222 0.546658
std 2.724501e+04 15.116632 1027.961019 38.13
8781 38.439409 0.497821
min 9.880000e+07 7.000000 50.000000 0.00
0000 0.000000 0.000000
25% 9.882360e+07 27.000000 1359.000000 0.00
0000 0.000000 0.000000
50% 9.884719e+07 40.000000 1923.000000 0.00
0000 0.000000 1.000000
75% 9.887078e+07 51.000000 2538.000000 12.00
0000 13.000000 1.000000
max 9.889438e+07 85.000000 6951.000000 1592.00
0000 1584.000000 1.000000

Describe the catagorical data

*************************
Gender CustomerType TypeTravel Travel_Class Seat_comf
ort Seat_Class Arrival_time_convenient Catering Platform_location O
nboardwifi_service Onboard_entertainment Online_support Onlinebooking_
Ease Onboard_service Leg_room Baggage_handling Checkin_service Cleanli
ness Online_boarding
count 94302 85428 85153 94379 94
318 94379 85449 85638 94349
94349 94361 94288 94306
86778 94289 94237 94302 94373
94373
unique 2 2 2 2
6 2 6 6 6
6 6 6 6
6 6 5 6 6
6
top Female Loyal Customer Business travel Eco accepta
ble Green Car good acceptable manageable

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 4/28

09/05/2021 SN Travel - Jupyter Notebook
good good good good
good good good good good
good
freq 47815 69823 58617 49342 21
158 47435 19574 18468 24173
22835 30446 30016 28909
27265 28870 34944 26502 35427
25533

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 5/28

09/05/2021 SN Travel - Jupyter Notebook

In [367]:

exp_data_ana(df_test,False,False)

Shape of the data

*************************
No of rows : 35602
No of columns : 24

Show infomation of the data

***********************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 35602 entries, 0 to 35601
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 35602 non-null int64
1 Gender 35572 non-null object
2 CustomerType 32219 non-null object
3 Age 35591 non-null float64
4 TypeTravel 32154 non-null object
5 Travel_Class 35602 non-null object
6 Travel_Distance 35602 non-null int64
7 DepartureDelay_in_Mins 35573 non-null float64
8 ArrivalDelay_in_Mins 35479 non-null float64
9 Seat_comfort 35580 non-null object
10 Seat_Class 35602 non-null object
11 Arrival_time_convenient 32277 non-null object
12 Catering 32245 non-null object
13 Platform_location 35590 non-null object
14 Onboardwifi_service 35590 non-null object
15 Onboard_entertainment 35594 non-null object
16 Online_support 35576 non-null object
17 Onlinebooking_Ease 35584 non-null object
18 Onboard_service 32730 non-null object
19 Leg_room 35577 non-null object
20 Baggage_handling 35562 non-null object
21 Checkin_service 35580 non-null object
22 Cleanliness 35600 non-null object
23 Online_boarding 35600 non-null object
dtypes: float64(3), int64(2), object(19)
memory usage: 6.8+ MB
None

Missing Values in the Dataset

*************************************
30 Missing values in Gender which is 0.08% of total data
3383 Missing values in CustomerType which is 9.5% of total data
11 Missing values in Age which is 0.03% of total data
3448 Missing values in TypeTravel which is 9.68% of total data
29 Missing values in DepartureDelay_in_Mins which is 0.08% of total d
ata
123 Missing values in ArrivalDelay_in_Mins which is 0.35% of total da
ta
22 Missing values in Seat_comfort which is 0.06% of total data
3325 Missing values in Arrival_time_convenient which is 9.34% of tota
l data
3357 Missing values in Catering which is 9.43% of total data
12 Missing values in Platform_location which is 0.03% of total data
12 Missing values in Onboardwifi_service which is 0.03% of total data
localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 6/28
09/05/2021 SN Travel - Jupyter Notebook
8 Missing values in Onboard_entertainment which is 0.02% of total dat
a
26 Missing values in Online_support which is 0.07% of total data
18 Missing values in Onlinebooking_Ease which is 0.05% of total data
2872 Missing values in Onboard_service which is 8.07% of total data
25 Missing values in Leg_room which is 0.07% of total data
40 Missing values in Baggage_handling which is 0.11% of total data
22 Missing values in Checkin_service which is 0.06% of total data
2 Missing values in Cleanliness which is 0.01% of total data
2 Missing values in Online_boarding which is 0.01% of total data

Duplicates data in the Dataset

**************************************
There is no duplicate values in the data.

Describe the data

*************************
ID Age Travel_Distance DepartureDelay_in
_Mins ArrivalDelay_in_Mins
count 3.560200e+04 35591.000000 35602.000000 35573.0
00000 35479.000000
mean 9.991780e+07 39.446995 1987.151761 14.8
80696 15.308802
std 1.027756e+04 15.137554 1024.308863 37.8
95453 38.531293
min 9.990000e+07 7.000000 50.000000 0.0
00000 0.000000
25% 9.990890e+07 27.000000 1360.000000 0.0
00000 0.000000
50% 9.991780e+07 40.000000 1929.000000 0.0
00000 0.000000
75% 9.992670e+07 51.000000 2559.000000 13.0
00000 13.000000
max 9.993560e+07 85.000000 6868.000000 978.0
00000 970.000000

Describe the catagorical data

*************************
Gender CustomerType TypeTravel Travel_Class Seat_com
fort Seat_Class Arrival_time_convenient Catering Platform_location
Onboardwifi_service Onboard_entertainment Online_support Onlinebookin
g_Ease Onboard_service Leg_room Baggage_handling Checkin_service Clea
nliness Online_boarding
count 35572 32219 32154 35602 3
5580 35602 32277 32245 35590
35590 35594 35576 35584
32730 35577 35562 35580 35600
35600
unique 2 2 2 2
6 2 6 6 5
6 6 5 6
5 6 5 5 5
6
top Female Loyal Customer Business travel Eco accept
able Ordinary good acceptable manageable
good good good good
good good good good good
good
freq 18069 26349 22313 18473
8003 17860 7361 7133 9364
8743 11436 11487 11025

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 7/28

09/05/2021 SN Travel - Jupyter Notebook
10222 10842 13291 9982 13421
9680

In [368]:

cat=[]
num=[]
for i in df_train.loc[:, ~df_train.columns. isin(['EmployeeID'])]:
if df_train[i].dtype=="object":
cat.append(i)
else:
num.append(i)

print('Catogorical Variables : \n*****************\n', cat)

print('\nNumerical Variables : \n*****************\n', num)

Catogorical Variables :
*****************
['Gender', 'CustomerType', 'TypeTravel', 'Travel_Class', 'Seat_comfor
t', 'Seat_Class', 'Arrival_time_convenient', 'Catering', 'Platform_loc
ation', 'Onboardwifi_service', 'Onboard_entertainment', 'Online_suppor
t', 'Onlinebooking_Ease', 'Onboard_service', 'Leg_room', 'Baggage_hand
ling', 'Checkin_service', 'Cleanliness', 'Online_boarding']

Numerical Variables :
*****************
['ID', 'Age', 'Travel_Distance', 'DepartureDelay_in_Mins', 'ArrivalDe
lay_in_Mins', 'Overall_Experience']

In [369]:

#Unique values of the catagrical variables

for i in cat:
print(f'Unique values of the catagorical variable {i}')
print('*******************************************************\n')
print(df_train[i].value_counts())
print('\n')
Unique values of the catagorical variable Gender
*******************************************************

Female 47815
Male 46487
Name: Gender, dtype: int64

Unique values of the catagorical variable CustomerType

*******************************************************

Loyal Customer 69823

disloyal Customer 15605
Name: CustomerType, dtype: int64

Unique values of the catagorical variable TypeTravel

*******************************************************

Business travel 58617

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 8/28

09/05/2021 SN Travel - Jupyter Notebook

In [370]:

# bins = [7,30,45,60,85]
# labels=['7-30','31-45','46-60','61-85']

# df_train['AgeRange'] = pd.cut(df_train['Age'],bins=bins, labels=labels)

# bins = [0,320,640,960,1280,1600]
# labels=['0-320','321-640','641-960','961-1280','1281-1600']

# df_train['DepartureDelay_in_MinsRange'] = pd.cut(df_train['DepartureDelay_in_Mins

# bins = [0,320,640,960,1280,1600]
# labels=['0-320','321-640','641-960','961-1280','1281-1600']

# df_train['ArrivalDelay_in_MinsRange'] = pd.cut(df_train['ArrivalDelay_in_Mins'],bi

In [371]:

nmstr=df_train['CustomerType'].isnull().sum()
nmste=df_test['CustomerType'].isnull().sum()
print(f'Number of missing values in CustomerType: \n Train\t:\t{nmstr}\n Test\t:\t{n
Number of missing values in CustomerType:
Train : 8951
Test : 3383

In [372]:

df_train.groupby(['Travel_Class'])['CustomerType'].agg(pd.Series.mode)
Out[372]:

Travel_Class
Business Loyal Customer
Eco Loyal Customer
Name: CustomerType, dtype: object

In [373]:

#Impute the Department with mode value of group by "EductaionField"

df_train['CustomerType'] = df_train.groupby(['Travel_Class'])['CustomerType'].apply(
df_test['CustomerType'] = df_test.groupby(['Travel_Class'])['CustomerType'].apply(la

In [374]:

nmstr=df_train['CustomerType'].isnull().sum()
nmste=df_test['CustomerType'].isnull().sum()
print(f'Number of missing values in CustomerType: \n Train\t:\t{nmstr}\n Test\t:\t{n

Number of missing values in CustomerType:

Train : 0
Test : 0

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 9/28

09/05/2021 SN Travel - Jupyter Notebook

In [375]:

nmstr=df_train['Age'].isnull().sum()
nmste=df_test['Age'].isnull().sum()
print(f'Number of missing values in Age: \n Train\t:\t{nmstr}\n Test\t:\t{nmste}')
Number of missing values in Age:
Train : 33
Test : 11

In [376]:

df_train.groupby(by=['Gender','CustomerType','Travel_Class'])['Age'].median()
Out[376]:

Gender CustomerType Travel_Class

Female Loyal Customer Business 44.0
Eco 40.0
disloyal Customer Business 30.0
Eco 26.0
Male Loyal Customer Business 44.0
Eco 40.0
disloyal Customer Business 30.0
Eco 25.0
Name: Age, dtype: float64

In [377]:

#Impute the Age with group by "WorkExprience" and "EductaionField"

df_train['Age'] = df_train['Age'].fillna(df_train.groupby(by=['Gender','CustomerType
df_test['Age'] = df_test['Age'].fillna(df_test.groupby(by=['Gender','CustomerType','

In [378]:

nmstr=df_train['Age'].isnull().sum()
nmste=df_test['Age'].isnull().sum()
print(f'Number of missing values in Age: \n Train\t:\t{nmstr}\n Test\t:\t{nmste}')

Number of missing values in Age:

Train : 0
Test : 0

In [379]:

nmstr=df_train['DepartureDelay_in_Mins'].isnull().sum()
nmste=df_test['DepartureDelay_in_Mins'].isnull().sum()
print(f'Number of missing values in DepartureDelay_in_Mins: \n Train\t:\t{nmstr}\n T

Number of missing values in DepartureDelay_in_Mins:

Train : 57
Test : 29

In [380]:

df_train['DepartureDelay_in_Mins'] = df_train['DepartureDelay_in_Mins'].fillna(0)
df_test['DepartureDelay_in_Mins'] = df_test['DepartureDelay_in_Mins'].fillna(0)

# df_train['ArrivalDelay_in_Mins'] = df_train['ArrivalDelay_in_Mins'].fillna(0)
# df_test['ArrivalDelay_in_Mins'] = df_test['ArrivalDelay_in_Mins'].fillna(0)

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 10/28

09/05/2021 SN Travel - Jupyter Notebook

In [381]:

nmstr=df_train['DepartureDelay_in_Mins'].isnull().sum()
nmste=df_test['DepartureDelay_in_Mins'].isnull().sum()
print(f'Number of missing values in DepartureDelay_in_Mins: \n Train\t:\t{nmstr}\n T
Number of missing values in DepartureDelay_in_Mins:
Train : 0
Test : 0

In [382]:

nmstr=df_train['ArrivalDelay_in_Mins'].isnull().sum()
nmste=df_test['ArrivalDelay_in_Mins'].isnull().sum()
print(f'Number of missing values in ArrivalDelay_in_Mins: \n Train\t:\t{nmstr}\n Tes
Number of missing values in ArrivalDelay_in_Mins:
Train : 357
Test : 123

In [383]:

round(df_train.groupby(by=['Arrival_time_convenient'])['ArrivalDelay_in_Mins'].mean(
Out[383]:

Arrival_time_convenient
acceptable 15.28
excellent 14.55
extremely poor 12.96
good 15.05
need improvement 15.60
poor 15.34
Name: ArrivalDelay_in_Mins, dtype: float64

In [384]:

df_train['ArrivalDelay_in_Mins'] = df_train['ArrivalDelay_in_Mins'].fillna(df_train.
df_test['ArrivalDelay_in_Mins'] = df_test['ArrivalDelay_in_Mins'].fillna(df_test.gro

In [385]:

df_train['ArrivalDelay_in_Mins'] = df_train['ArrivalDelay_in_Mins'].fillna(0)
df_test['ArrivalDelay_in_Mins'] = df_test['ArrivalDelay_in_Mins'].fillna(0)

In [386]:

nmstr=df_train['ArrivalDelay_in_Mins'].isnull().sum()
nmste=df_test['ArrivalDelay_in_Mins'].isnull().sum()
print(f'Number of missing values in ArrivalDelay_in_Mins: \n Train\t:\t{nmstr}\n Tes

Number of missing values in ArrivalDelay_in_Mins:

Train : 0
Test : 0

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 11/28

09/05/2021 SN Travel - Jupyter Notebook

In [387]:

nmstr=df_train['Arrival_time_convenient'].isnull().sum()
nmste=df_test['Arrival_time_convenient'].isnull().sum()
print(f'Number of missing values in Arrival_time_convenient: \n Train\t:\t{nmstr}\n

Number of missing values in Arrival_time_convenient:

Train : 8930
Test : 3325

In [388]:

df_train['Arrival_time_convenient'] = df_train['Arrival_time_convenient'].fillna('0'
df_test['Arrival_time_convenient'] = df_test['Arrival_time_convenient'].fillna('0')

In [389]:

df_train['Arrival_time_convenient'].value_counts()
Out[389]:

good 19574
excellent 17684
acceptable 15177
need improvement 14990
poor 13692
0 8930
extremely poor 4332
Name: Arrival_time_convenient, dtype: int64

In [390]:

df_test['Arrival_time_convenient'].value_counts()

Out[390]:

good 7361
excellent 6589
acceptable 5844
need improvement 5684
poor 5131
0 3325
extremely poor 1668
Name: Arrival_time_convenient, dtype: int64

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 12/28

09/05/2021 SN Travel - Jupyter Notebook

In [391]:

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'excellent' if (row['Arrival_time_convenient']=='0' and row['Arrival
axis=1
)

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'acceptable' if (row['Arrival_time_convenient']=='0' and (row['Arriv
axis=1
)

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'poor' if (row['Arrival_time_convenient']=='0' and (row['ArrivalDela
axis=1
)

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'good' if (row['Arrival_time_convenient']=='0' and (row['ArrivalDela
axis=1
)

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'extremely poor' if (row['Arrival_time_convenient']=='0' and (row['A
axis=1
)

df_train['Arrival_time_convenient']=df_train.apply(
lambda row: 'need improvement' if (row['Arrival_time_convenient']=='0' and (row[
axis=1
)

df_train['Arrival_time_convenient'].value_counts()

Out[391]:

excellent 23101
good 19575
acceptable 18391
need improvement 14990
poor 13990
extremely poor 4332
Name: Arrival_time_convenient, dtype: int64

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 13/28

09/05/2021 SN Travel - Jupyter Notebook

In [392]:

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'excellent' if (row['Arrival_time_convenient']=='0' and row['Arrival
axis=1
)

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'acceptable' if (row['Arrival_time_convenient']=='0' and (row['Arriv
axis=1
)

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'poor' if (row['Arrival_time_convenient']=='0' and (row['ArrivalDela
axis=1
)

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'good' if (row['Arrival_time_convenient']=='0' and (row['ArrivalDela
axis=1
)

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'extremely poor' if (row['Arrival_time_convenient']=='0' and (row['A
axis=1
)

df_test['Arrival_time_convenient']=df_test.apply(
lambda row: 'need improvement' if (row['Arrival_time_convenient']=='0' and (row[
axis=1
)

df_test['Arrival_time_convenient'].value_counts()

Out[392]:

excellent 8564
good 7361
acceptable 7075
need improvement 5684
poor 5250
extremely poor 1668
Name: Arrival_time_convenient, dtype: int64

In [393]:

nmstr=df_train['Arrival_time_convenient'].isnull().sum()
nmste=df_test['Arrival_time_convenient'].isnull().sum()
print(f'Number of missing values in Arrival_time_convenient: \n Train\t:\t{nmstr}\n

Number of missing values in Arrival_time_convenient:

Train : 0
Test : 0

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 14/28

09/05/2021 SN Travel - Jupyter Notebook

In [394]:

nmstr=df_train['TypeTravel'].isnull().sum()
nmste=df_test['TypeTravel'].isnull().sum()
print(f'Number of missing values in TypeTravel: \n Train\t:\t{nmstr}\n Test\t:\t{nms

Number of missing values in TypeTravel:

Train : 9226
Test : 3448

In [395]:

df_train.groupby(['CustomerType','Travel_Class'])['TypeTravel'].agg(pd.Series.mode)

Out[395]:

CustomerType Travel_Class
Loyal Customer Business Business travel
Eco Personal Travel
disloyal Customer Business Business travel
Eco Business travel
Name: TypeTravel, dtype: object

In [396]:

#Impute the Department with mode value of group by "EductaionField"

df_train['TypeTravel'] = df_train.groupby(['CustomerType','Travel_Class'])['TypeTrav
df_test['TypeTravel'] = df_test.groupby(['CustomerType','Travel_Class'])['TypeTravel

In [397]:

nmstr=df_train['TypeTravel'].isnull().sum()
nmste=df_test['TypeTravel'].isnull().sum()
print(f'Number of missing values in TypeTravel: \n Train\t:\t{nmstr}\n Test\t:\t{nms

Number of missing values in TypeTravel:

Train : 0
Test : 0

In [398]:

nmstr=df_train['TypeTravel'].isnull().sum()
nmste=df_test['TypeTravel'].isnull().sum()
print(f'Number of missing values in TypeTravel: \n Train\t:\t{nmstr}\n Test\t:\t{nms
Number of missing values in TypeTravel:
Train : 0
Test : 0

In [400]:

cat = ['Gender','Seat_comfort','Catering','Platform_location',
'Onboardwifi_service','Onboard_entertainment',
'Online_support','Onlinebooking_Ease','Leg_room','Baggage_handling','Checkin_service
for i in cat:
df_train[i] = df_train[i].fillna(df_train[i].mode()[0])
df_test[i] = df_test[i].fillna(df_test[i].mode()[0])

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 15/28

09/05/2021 SN Travel - Jupyter Notebook

In [402]:

nmstr=df_train['Onboard_service'].isnull().sum()
nmste=df_test['Onboard_service'].isnull().sum()
print(f'Number of missing values in Onboard_service : \n Train\t:\t{nmstr}\n Test\t:

Number of missing values in Onboard_service :

Train : 7601
Test : 2872

In [401]:

df_train.groupby(['Onboardwifi_service','Onboard_entertainment'])['Onboard_service']

Out[401]:

Onboardwifi_service Onboard_entertainment
acceptable acceptable good
excellent good
extremely poor acceptable
good good
need improvement good
poor good
excellent acceptable good
excellent good
extremely poor acceptable
good good
need improvement good
poor good
extremely poor acceptable [acceptable, excellent]
excellent need improvement
good excellent
need improvement good
poor need improvement
good acceptable good
excellent good
extremely poor good
good good
need improvement good
poor good
need improvement acceptable good
excellent good
extremely poor good
good good
need improvement acceptable
poor good
poor acceptable acceptable
excellent excellent
extremely poor acceptable
good good
need improvement good
poor good
Name: Onboard_service, dtype: object

In [403]:

df_train['Onboard_service'] = df_train.groupby(['Onboardwifi_service','Onboard_enter
df_test['Onboard_service'] = df_test.groupby(['Onboardwifi_service','Onboard_enterta

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 16/28

09/05/2021 SN Travel - Jupyter Notebook

In [404]:

#Check for missing value

print('\n****Missing Values in the Dataset****')
print('*************************************')
msv = df_train.isnull().sum()[df_train.isnull().sum()>0]
if msv.empty:
print('There is no missing values in the data.')
else:
for i in range(msv.count()):
print('{} Missing values in {} which is {}% of total data'.format(msv[i],msv

Missing Values in the Dataset

*************************************
There is no missing values in the data.

In [405]:

#Check for missing value

print('\n****Missing Values in the Dataset****')
print('*************************************')
msv = df_test.isnull().sum()[df_test.isnull().sum()>0]
if msv.empty:
print('There is no missing values in the data.')
else:
for i in range(msv.count()):
print('{} Missing values in {} which is {}% of total data'.format(msv[i],msv

Missing Values in the Dataset

*************************************
There is no missing values in the data.

In [ ]:

num
Out[293]:

['ID',
'Age',
'Travel_Distance',
'DepartureDelay_in_Mins',
'ArrivalDelay_in_Mins',
'Overall_Experience']

In [ ]:

# fig, axis=plt.subplots(nrows=4,ncols=2)
# fig.set_size_inches(15,17)
# fig.tight_layout()

# for i in range(1,5):
# sns.distplot(df_train[num[i]],ax=axis[i-1][0]);
# sns.boxplot(df_train[num[i]],ax=axis[i-1][1]);

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 17/28

09/05/2021 SN Travel - Jupyter Notebook

In [ ]:

#Define the function to identify the outliers

def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

In [ ]:

# #Fix the outliers

# columns = ['Travel_Distance','DepartureDelay_in_Mins','ArrivalDelay_in_Mins']
# for column in columns:
# lr,ur=remove_outlier(df_train[column])
# df_train[column]=np.where(df_train[column]>ur,ur,df_train[column])

In [ ]:

# fig, axis=plt.subplots(nrows=4,ncols=2)
# fig.set_size_inches(15,17)
# fig.tight_layout()

# for i in range(1,5):
# sns.distplot(df_train[num[i]],ax=axis[i-1][0]);
# sns.boxplot(df_train[num[i]],ax=axis[i-1][1]);

In [406]:

#Heat map - Relationalship analysis

plt.figure(figsize=(18,13))
sns.heatmap(round(df_train.corr(),3),annot=True,mask=np.triu(df_train.corr(),+1));

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 18/28

09/05/2021 SN Travel - Jupyter Notebook

In [407]:

from sklearn.ensemble import RandomForestClassifier

from IPython.display import HTML
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.metrics import roc_auc_score,roc_curve,plot_confusion_matrix

In [408]:

def display_dataframe(df):
numeric_col_mask = df.dtypes.apply(lambda d: issubclass(np.dtype(d).type, np.num

# Dict used to center the table headers

d = dict(selector="th",
props=[('text-align', 'center')])

# Style
display(df.style.set_properties(subset=df.columns[numeric_col_mask], # right-ali
**{'width':'5em', 'height':'3em','text-align':'right','b
.set_properties(subset=df.columns[~numeric_col_mask], # left-align the n
**{'width':'5em', 'text-align':'left'})\
.format(lambda x: '{:,.0f}'.format(x) if x > 1e3 else '{:,.2f}'.format(x
subset=pd.IndexSlice[:,df.columns[numeric_col_mask]])\
.hide_index()\
# .highlight_max('color: green')\
.set_table_styles([d])) # center the header

In [409]:

#AUC and ROC Value

def roc_model(model_name,x,y):
# predict probabilities
probs = model_name.predict_proba(x)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y, probs)
fpr, tpr, _ = roc_curve(y, probs)
return probs,auc,fpr,tpr

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 19/28

09/05/2021 SN Travel - Jupyter Notebook

In [410]:

def con_mat(y_train,y_predict_train,y_test,y_predict_test):
fig, axis=plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(10,4)
fig.tight_layout()

cm=confusion_matrix(y_train,y_predict_train,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["Conservative","Labour"]],

columns = [i for i in ["Conservative","Labour"]])
sns.heatmap(df_cm, annot=True ,fmt='g',ax=axis[0])
axis[0].title.set_text('Confustion Matrix - Train Data')

cm=confusion_matrix(y_test,y_predict_test,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["Conservative","Labour"]],

columns = [i for i in ["Conservative","Labour"]])
sns.heatmap(df_cm, annot=True ,fmt='g',ax=axis[1])
axis[1].title.set_text('Confustion Matrix - Test Data')

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 20/28

09/05/2021 SN Travel - Jupyter Notebook

In [411]:

def scores_train_test(model,X_train,X_test,y_train,y_test,y_predict_train,y_predict_
#model=bgcl

s = [[None for j in range(6)] for i in range(2)]

print(str(model).split('(')[0])
print('********************************\n')

model_name = str(model).split('(')[0]
s[0][0] = model_name + '_' + mname +'_Train'
s[1][0] = model_name + '_' + mname +'_Test'

for prf in range(1,4):

s[0][prf]=round(score(y_train,y_predict_train)[prf-1][1]*100,2)
s[1][prf]= (round(score(y_test,y_predict_test)[prf-1][1]*100,2))

s[0][prf+1]=round(model.score(X_train,y_train)*100,2)
s[1][prf+1]=round(model.score(X_test,y_test)*100,2)

probs, auc, fpr, tpr = roc_model(model,X_train,y_train)

probst, auc1, fpr1, tpr1 = roc_model(model,X_test,y_test)

s[0][prf+2]=round(auc*100,2)
s[1][prf+2]=round(auc1*100,2)

df = pd.DataFrame(data=s,columns=['Scores','Precision','Recall','F-Score','Accur

con_mat(y_train,y_predict_train,y_test,y_predict_test)

plt.figure(figsize=(5,5))

plt.plot(fpr,tpr, marker='o', label='AUC - Train:' + str(s[0][prf+2]))

plt.plot(fpr1,tpr1, marker='o', label='AUC - Test:' + str(s[1][prf+2]))

plt.legend(loc="lower right")
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')

print('Display Metrics of Train and Test Data')

print('***************************************')

display_dataframe(df)
return(df)

In [412]:

for feature in df_train.columns:

if df_train[feature].dtype == 'object':
df_train[feature] = pd.Categorical(df_train[feature]).codes

for feature in df_test.columns:

if df_test[feature].dtype == 'object':
df_test[feature] = pd.Categorical(df_test[feature]).codes

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 21/28

09/05/2021 SN Travel - Jupyter Notebook

In [413]:

tvar = 'Overall_Experience'

In [414]:

X = df_train.drop(['ID',tvar], axis=1)
y = df_train[tvar]

In [415]:

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.30,random_state=0)

In [416]:

#Scale the data using feature scaling

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [417]:

sc_train = scaler.fit_transform(X_train)
X_train_sc = pd.DataFrame(sc_train, index=X_train.index, columns=X_train.columns)

In [418]:

sc_test = scaler.transform(X_test)
X_test_sc = pd.DataFrame(sc_test, index=X_test.index, columns=X_test.columns)

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 22/28

09/05/2021 SN Travel - Jupyter Notebook

In [438]:

rfcl = RandomForestClassifier(random_state=0,max_features=14)

rfcl.fit(X_train, y_train)
rf_train = rfcl.predict(X_train)
rf_test = rfcl.predict(X_test)

res_df = scores_train_test(rfcl,X_train,X_test,y_train,y_test,rf_train,rf_test,'Base
RandomForestClassifier
********************************

Display Metrics of Train and Test Data

***************************************

Scores Precision Recall F-Score Accuracy AUC

RandomForestClassiﬁer_Base_Train 100.00 100.00 100.00 100.00 100.00

RandomForestClassiﬁer_Base_Test 96.17 94.37 95.26 94.90 99.07

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 23/28

09/05/2021 SN Travel - Jupyter Notebook

In [ ]:

# features = X_train.columns
# importances = rfcl.feature_importances_
# indices = np.argsort(importances)

# plt.title('Feature Importances')
# plt.barh(range(len(indices)), importances[indices], color='b', align='center')
# plt.yticks(range(len(indices)), [features[i] for i in indices])
# plt.xlabel('Relative Importance')
# plt.show()

In [420]:

def rfrun(x_train,train_labels,x_test,max_f,no_est,max_dep,min_sam,min_spl):
param_grid = {
'criterion': ['gini'],
'max_depth': max_dep, #,7,9],
'max_features':max_f, #,32],
'min_samples_leaf': min_sam,#15,20],
'min_samples_split': min_spl,#75,60],
'n_estimators': no_est
}

rfcl = RandomForestClassifier(random_state=1)

grid_search = GridSearchCV(estimator = rfcl, param_grid = param_grid, cv = 5,sco

grid_search.fit(x_train, train_labels)
print(grid_search.best_params_)

rfcl = grid_search.best_estimator_
rfcl

rfcl_y_predict_train = rfcl.predict(x_train)
rfcl_y_predict_test = rfcl.predict(x_test)
return(rfcl,rfcl_y_predict_train,rfcl_y_predict_test)

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 24/28

09/05/2021 SN Travel - Jupyter Notebook

In [451]:

from sklearn.model_selection import GridSearchCV

rfcl_tuned,rfcl_y_predict_train,rfcl_y_predict_test = rfrun(
X_train,y_train,X_test,[13,17,21],[100,200,300],[15,17,25],[1,3],[3,6])

res_df = scores_train_test(rfcl_tuned,X_train,X_test,y_train,y_test,rfcl_y_predict_t
---------------------------------------------------------------------
------
KeyboardInterrupt Traceback (most recent call
last)
<ipython-input-450-a2387e5d0eb9> in <module>
1 from sklearn.model_selection import GridSearchCV
----> 2 rfcl_tuned,rfcl_y_predict_train,rfcl_y_predict_test = rfrun(
3 X_train,y_train,X_test,[13,17,21],[100,200,300],[15,17,2
5],[1,3],[3,6])
4
5 res_df = scores_train_test(rfcl_tuned,X_train,X_test,y_train,
y_test,rfcl_y_predict_train,rfcl_y_predict_test,'Tuned')

<ipython-input-420-9de6e369a47f> in rfrun(x_train, train_labels, x_te

st, max_f, no_est, max_dep, min_sam, min_spl)
12
13 grid_search = GridSearchCV(estimator = rfcl, param_grid =
param_grid, cv = 3,scoring='accuracy')
---> 14 grid_search.fit(x_train, train_labels)
15 i t( id h b t )

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 25/28

09/05/2021 SN Travel - Jupyter Notebook

In [442]:

res_df = scores_train_test(rfcl_tuned,X_train,X_test,y_train,y_test,rfcl_y_predict_t
RandomForestClassifier
********************************

Display Metrics of Train and Test Data

***************************************

Scores Precision Recall F-Score Accuracy AUC

RandomForestClassiﬁer_Tuned_Train 99.96 100.00 99.98 99.97 100.00

RandomForestClassiﬁer_Tuned_Test 96.16 94.61 95.38 99.97 100.00

In [ ]:

# from sklearn.neighbors import KNeighborsClassifier

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 26/28

09/05/2021 SN Travel - Jupyter Notebook

In [ ]:

# for i in range(1,9,2):
# KNN_Model = KNeighborsClassifier(n_neighbors=i,metric='euclidean')
# KNN_Model.fit(X_train_sc,y_train)
# y_test_p = KNN_Model.predict(X_test_sc)
# print(f'Accuracy Score for K={i} : ',KNN_Model.score(X_test_sc,y_test))

In [ ]:

# from sklearn.neural_network import MLPClassifier

In [441]:

def nnrun(x_train,train_labels,x_test,hid_ly,max_int,tol,sol,act):
param_grid = {
'hidden_layer_sizes': hid_ly,
'max_iter': max_int,
'activation':act,
'solver': sol,
'tol': tol,
'random_state':[0] #1
}

nncl = MLPClassifier(random_state=0)

grid_search = GridSearchCV(estimator = nncl, param_grid = param_grid, cv = 5,sco

grid_search.fit(x_train, train_labels)
grid_search.best_params_
print(grid_search.best_params_)
nn_model = grid_search.best_estimator_
nn_model
nn_train = nn_model.predict(x_train)
nn_test = nn_model.predict(x_test)
return(nn_model,nn_train,nn_test)

---------------------------------------------------------------------
------
KeyboardInterrupt Traceback (most recent call
last)
<ipython-input-440-a2387e5d0eb9> in <module>
1 from sklearn.model_selection import GridSearchCV
----> 2 rfcl_tuned,rfcl_y_predict_train,rfcl_y_predict_test = rfrun(
3 X_train,y_train,X_test,[13,17,21],[100,200,300],[15,17,2
5],[1,3],[3,6])
4
5 res_df = scores_train_test(rfcl_tuned,X_train,X_test,y_train,
y_test,rfcl_y_predict_train,rfcl_y_predict_test,'Tuned')

<ipython-input-420-9de6e369a47f> in rfrun(x_train, train_labels, x_te

nn_model,nn_train_p,nn_test_p = nnrun(X_train_sc,y_train,X_test_sc,[100],[1000],[0.0

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 27/28

09/05/2021 SN Travel - Jupyter Notebook

In [ ]:

res_df = scores_train_test(nn_model,X_train_sc,X_test_sc,y_train,y_test,nn_train_p,n

In [443]:

df_test.drop(tvar,axis=1,inplace=True)

In [444]:

final_model = rfcl_tuned.fit(X,y)

In [445]:

rf_output = final_model.predict(df_test.drop('ID',axis=1))

In [446]:

rf_output
Out[446]:

array([1, 1, 1, ..., 1, 1, 0])

In [447]:

df_test[tvar]= rf_output

In [448]:

m='tuned'

In [449]:

df_test[['ID',tvar]].to_csv('./Hack_submission_'+m+'.csv',index=False)

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 28/28

ML Quiz 3 Machine Learning Great Learning
89% (9)
ML Quiz 3 Machine Learning Great Learning
7 pages
Quiz 3 LDA Predictive Modeling Great Learning
100% (5)
Quiz 3 LDA Predictive Modeling Great Learning
7 pages
Delhivery Feature Engineering Cs
No ratings yet
Delhivery Feature Engineering Cs
46 pages
Clustering Documentation R Code
100% (1)
Clustering Documentation R Code
9 pages
Machine Learning Project Problem 1 Jupyter Notebook PDF
100% (5)
Machine Learning Project Problem 1 Jupyter Notebook PDF
85 pages
Asphalt Shingles Data Analysis PDF
No ratings yet
Asphalt Shingles Data Analysis PDF
4 pages
Business Report Project SMDM Sonali Pradhan
100% (1)
Business Report Project SMDM Sonali Pradhan
56 pages
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
No ratings yet
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
59 pages
Airlanes Booking Analys
No ratings yet
Airlanes Booking Analys
26 pages
Airline Passenger Booking Analyze
No ratings yet
Airline Passenger Booking Analyze
26 pages
BPP Business School - Applied Modelling and Visualisation
No ratings yet
BPP Business School - Applied Modelling and Visualisation
19 pages
Step 16 Chapter4
No ratings yet
Step 16 Chapter4
64 pages
Aviation Marketing Project - Capstone 1
100% (1)
Aviation Marketing Project - Capstone 1
25 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
BDABI - Project Work Mikiyas Henok Dureti
No ratings yet
BDABI - Project Work Mikiyas Henok Dureti
34 pages
Car-price-prediction (1)
No ratings yet
Car-price-prediction (1)
42 pages
Random Forest Model
No ratings yet
Random Forest Model
16 pages
Ai Tools and Applications-Lab
No ratings yet
Ai Tools and Applications-Lab
33 pages
Flight Price Prediction Capstone Project Submission 2
No ratings yet
Flight Price Prediction Capstone Project Submission 2
69 pages
P1) Code Uber
No ratings yet
P1) Code Uber
6 pages
ML5 Decision Tree Airline Safety
No ratings yet
ML5 Decision Tree Airline Safety
3 pages
Social Media Data For DSBA - Vijay Borade - Ipynb - Colab
No ratings yet
Social Media Data For DSBA - Vijay Borade - Ipynb - Colab
22 pages
FDS Practical 2
No ratings yet
FDS Practical 2
8 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Pyt Manual 1
No ratings yet
Pyt Manual 1
85 pages
Supervised Regression
No ratings yet
Supervised Regression
24 pages
Analyzing Taxi Trends
No ratings yet
Analyzing Taxi Trends
43 pages
Data Preprocessing - Ipynb - Colaboratory
No ratings yet
Data Preprocessing - Ipynb - Colaboratory
7 pages
Airfare ML - Predicting Flight Fares
No ratings yet
Airfare ML - Predicting Flight Fares
21 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
AI Final PDF
No ratings yet
AI Final PDF
38 pages
U19ADS2035-Python For Data Science Laboratory Page No:17
No ratings yet
U19ADS2035-Python For Data Science Laboratory Page No:17
5 pages
Customer Churn Syntax
No ratings yet
Customer Churn Syntax
66 pages
ML LAB - BCSL606
No ratings yet
ML LAB - BCSL606
67 pages
PRAC3_23BME053
No ratings yet
PRAC3_23BME053
5 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
Sprob 2
No ratings yet
Sprob 2
55 pages
Project
No ratings yet
Project
4 pages
Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction
No ratings yet
Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction
28 pages
Railway Price Prediction
No ratings yet
Railway Price Prediction
20 pages
ML File 211173
No ratings yet
ML File 211173
19 pages
Q1 Video Games Sales: #Import The Libraries
No ratings yet
Q1 Video Games Sales: #Import The Libraries
16 pages
00 - Lesson - Data Science Workflow - Jupyter Notebook
No ratings yet
00 - Lesson - Data Science Workflow - Jupyter Notebook
6 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
No ratings yet
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
18 pages
Loading The Dataset: ## The Matplotlib and Seaborn Library For Result Visualization and Analysis
No ratings yet
Loading The Dataset: ## The Matplotlib and Seaborn Library For Result Visualization and Analysis
13 pages
Preprocessing Data For Machine Learning: Sarah Guido
No ratings yet
Preprocessing Data For Machine Learning: Sarah Guido
21 pages
Practical 1
No ratings yet
Practical 1
6 pages
Assessment Brief: Learning Outcomes To Be Assessed
No ratings yet
Assessment Brief: Learning Outcomes To Be Assessed
7 pages
02 Amazon Fine Food Reviews Analysis - TSNE - Slides
No ratings yet
02 Amazon Fine Food Reviews Analysis - TSNE - Slides
1 page
Computer Science - Term 1 (2020 - 21) Grade 10 Advanced - Checkpoint 4 Part - A
No ratings yet
Computer Science - Term 1 (2020 - 21) Grade 10 Advanced - Checkpoint 4 Part - A
5 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
16 pages
Week 1 To Week 9
No ratings yet
Week 1 To Week 9
30 pages
Delhivery_Case_Study_compressed
No ratings yet
Delhivery_Case_Study_compressed
31 pages
Vertopal.com AML Project LearnerNotebook LowCode
No ratings yet
Vertopal.com AML Project LearnerNotebook LowCode
74 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
Lab1.ipynb - Colaboratory
No ratings yet
Lab1.ipynb - Colaboratory
9 pages
Multiple - Linear - Regression - AirBNB - Solution-0.2 - New - Ipynb - Colaboratory
No ratings yet
Multiple - Linear - Regression - AirBNB - Solution-0.2 - New - Ipynb - Colaboratory
11 pages
Preprocessing ch.1
No ratings yet
Preprocessing ch.1
24 pages
Machine Learning Program
No ratings yet
Machine Learning Program
12 pages
4. Data Analytics I
No ratings yet
4. Data Analytics I
4 pages
warmUp
No ratings yet
warmUp
5 pages
Divvy Exercise R Script
No ratings yet
Divvy Exercise R Script
5 pages
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Next Pathway Hack Backpackers Problem Statement
No ratings yet
Next Pathway Hack Backpackers Problem Statement
11 pages
CLUSTERING ANALYSIS State Wise Health PDF
No ratings yet
CLUSTERING ANALYSIS State Wise Health PDF
14 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Answer
100% (3)
Answer
5 pages
ANN Doc
No ratings yet
ANN Doc
2 pages
CMSU Survey Data Analysis PDF
100% (3)
CMSU Survey Data Analysis PDF
13 pages
A B Shingles Case
No ratings yet
A B Shingles Case
2 pages

SN Travel Jupyter Notebook PDF

Uploaded by

SN Travel Jupyter Notebook PDF

Uploaded by

09/05/2021 SN Travel - Jupyter Notebook

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 1/28

#Shape of the Data

print('\n****Show infomation of the data****')

#Check for missing value

# #Check for duplicate values

print('\n****Describe the data****')

print('\n****Describe the catagorical data****')

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 2/28

****Shape of the data****

****Show infomation of the data****

****Target Varible Distribution****

****Missing Values in the Dataset****

****Duplicates data in the Dataset****

****Describe the data****

****Describe the catagorical data****

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 4/28

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 5/28

****Shape of the data****

****Show infomation of the data****

****Missing Values in the Dataset****

****Duplicates data in the Dataset****

****Describe the data****

****Describe the catagorical data****

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 7/28

print('Catogorical Variables : \n*****************\n', cat)

#Unique values of the catagrical variables

Unique values of the catagorical variable CustomerType

Loyal Customer 69823

Unique values of the catagorical variable TypeTravel

Business travel 58617

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 8/28

# df_train['AgeRange'] = pd.cut(df_train['Age'],bins=bins, labels=labels)

#Impute the Department with mode value of group by "EductaionField"

Number of missing values in CustomerType:

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 9/28

Gender CustomerType Travel_Class

#Impute the Age with group by "WorkExprience" and "EductaionField"

Number of missing values in Age:

Number of missing values in DepartureDelay_in_Mins:

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 10/28

Number of missing values in ArrivalDelay_in_Mins:

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 11/28

Number of missing values in Arrival_time_convenient:

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 12/28

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 13/28

Number of missing values in Arrival_time_convenient:

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 14/28

Number of missing values in TypeTravel:

#Impute the Department with mode value of group by "EductaionField"

Number of missing values in TypeTravel:

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 15/28

Number of missing values in Onboard_service :

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 16/28

#Check for missing value

****Missing Values in the Dataset****

#Check for missing value

****Missing Values in the Dataset****

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 17/28

#Define the function to identify the outliers

# #Fix the outliers

#Heat map - Relationalship analysis

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 18/28

from sklearn.ensemble import RandomForestClassifier

# Dict used to center the table headers

#AUC and ROC Value

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 19/28

df_cm = pd.DataFrame(cm, index = [i for i in ["Conservative","Labour"]],

df_cm = pd.DataFrame(cm, index = [i for i in ["Conservative","Labour"]],

localhost:8888/notebooks/DS/Hackathon/SN Travel/SN Travel.ipynb 20/28

s = [[None for j in range(6)] for i in range(2)]

for prf in range(1,4):

probs, auc, fpr, tpr = roc_model(model,X_train,y_train)

plt.plot(fpr,tpr, marker='o', label='AUC - Train:' + str(s[0][prf+2]))

print('Display Metrics of Train and Test Data')

for feature in df_train.columns:

print('\nShow infomation of the data')

print('\nDescribe the data')

print('\nDescribe the catagorical data')

Shape of the data

Show infomation of the data

Target Varible Distribution

Missing Values in the Dataset

Duplicates data in the Dataset

Describe the data

Describe the catagorical data

Shape of the data

Show infomation of the data

Missing Values in the Dataset

Duplicates data in the Dataset

Describe the data

Describe the catagorical data

Missing Values in the Dataset

Missing Values in the Dataset