Titanic Data Analysis
Titanic Data Analysis
dataset
In [1]: #Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.250
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.283
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.925
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.100
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.050
Henry
Moran,
5 6 0 3 male NaN 0 0 330877 8.458
Mr. James
McCarthy,
6 7 0 1 Mr. male 54.0 0 0 17463 51.862
Timothy J
Palsson,
Master.
7 8 0 3 male 2.0 3 1 349909 21.075
Gosta
Leonard
Johnson,
Mrs.
Oscar W
8 9 1 3 female 27.0 0 2 347742 11.133
(Elisabeth
Vilhelmina
Berg)
Nasser,
Mrs.
9 10 1 2 Nicholas female 14.0 1 0 237736 30.070
(Adele
Achem)
Montvila,
886 887 0 2 Rev. male 27.0 0 0 211536 13.00
Juozas
Graham,
Miss.
887 888 1 1 female 19.0 0 0 112053 30.00
Margaret
Edith
Johnston,
Miss.
W./C.
888 889 0 3 Catherine female NaN 1 2 23.45
6607
Helen
"Carrie"
Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369 30.00
Howell
Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376 7.75
Patrick
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
SibSp - Sibsp: The dataset defines family relations... Sibling = brother, sister,
stepbrother, stepsister Spouse = husband, wife
Parch: The dataset defines family relations in this way... Parent = mother, father Child =
daughter, son, stepdaughter, stepson Some children travelled only with a nanny,
therefore parch=0 for them.
Ticket - Ticket Number Fare - Passenger Fare Cabin - Cabin Embarked - Port of
Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
In [6]: #4.Find the total number of rows and columns in the dataset.
train_data.shape
(891, 12)
Out[6]:
PassengerId 0
Out[7]:
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
In [23]: #Draw the count plot to show the passengers survived or not survived.
sb.countplot('Survived',hue='Survived',data=train_data)
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing othe
r arguments without an explicit keyword will result in an error or misint
erpretation.
warnings.warn(
In [26]: #Identify the number of male and female survived and not survived.
train_data.groupby(['Sex', 'Survived'])['Survived'].count()
Sex Survived
Out[26]:
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64
In [12]: #Plot to show a passenger class has any impact on survived vs dead.
#train_data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()
sb.countplot('Sex',hue='Survived',data=train_data)
plt.show()
In [10]: #Identify the number of male and female survived or died based on the pas
#class.
pd.crosstab([train_data.Sex,train_data.Survived],train_data.Pclass)
Out[10]: Pclass 1 2 3
Sex Survived
female 0 3 6 72
1 91 70 72
male 0 77 91 300
1 45 17 47
In [11]: #Find the age of oldest,youngest and average age of person travelled.
print('Age of oldest person travelled :',train_data['Age'].max())
print('Age of youngest person travelled :',train_data['Age'].min())
print('Average Age of person travelled :',train_data['Age'].mean())
In [27]: train_data['Initial']=0
for i in train_data:
train_data['Initial']=train_data.Name.str.extract('([A-Za-z]+)\.') #e
In [28]: pd.crosstab(train_data.Initial,train_data.Sex)
Out[28]: Sex female male
Initial
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
In [14]: train_data.groupby('Initial')['Age'].mean()
Initial
Out[14]:
Capt 70.000000
Col 58.000000
Countess 33.000000
Don 40.000000
Dr 42.000000
Jonkheer 38.000000
Lady 48.000000
Major 48.500000
Master 4.574167
Miss 21.773973
Mlle 24.000000
Mme 24.000000
Mr 32.368090
Mrs 35.898148
Ms 28.000000
Rev 43.166667
Sir 49.000000
Name: Age, dtype: float64
In [15]: train_data['Initial'].replace(['Capt','Col','Countess','Don','Dr','Jonkhe
'Mr','Miss','Mr','Other','Mr','Mrs','Mr','
In [ ]:
In [16]: train_data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Cou
'Miss','Miss','Mr','Mr','Mrs','Mrs','Othe
In [17]: #Average age based on initials
train_data.groupby('Initial')['Age'].mean()
Initial
Out[17]:
Master 4.574167
Miss 21.879195
Mr 32.891990
Mrs 35.828829
Other 42.000000
Name: Age, dtype: float64
In [19]: #Fill the null values of age with average age based on initial
train_data.loc[(train_data.Age.isnull()) & (train_data.Initial=='Mr'),'Ag
train_data.loc[(train_data.Age.isnull()) & (train_data.Initial=='Mrs'),'A
train_data.loc[(train_data.Age.isnull()) & (train_data.Initial=='Master')
train_data.loc[(train_data.Age.isnull()) & (train_data.Initial=='Miss'),'
train_data.loc[(train_data.Age.isnull()) & (train_data.Initial=='Other'),
In [27]: train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
Initial 891 non-null object
dtypes: float64(2), int64(5), object(6)
memory usage: 90.6+ KB
In [28]: train_data.Age.isnull().any()
False
Out[28]:
In [20]: f,ax=plt.subplots(1,2,figsize=(20,20))
train_data[train_data['Survived']==0].Age.plot.hist(ax=ax[0],bins=20,edge
ax[0].set_title('Survived = 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
train_data[train_data['Survived']==1].Age.plot.hist(ax=ax[1],bins=20,edge
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
ax[1].set_title('Survived = 1')
plt.show()
In [ ]: #Observations: (1) First priority during Rescue is given to children and
#as the persons<5 are save by large numbers (2) The oldest saved passange
# of age 80 (3) The most deaths were between 30-40
In [31]: #Identify the number of passenger died based on size of the family
#(using SibSp feature) and also draw the plot..
pd.crosstab([train_data.SibSp],train_data.Survived)
Out[31]: Survived 0 1
SibSp
0 398 210
1 97 112
2 15 13
3 12 4
4 15 3
5 5 0
8 7 0
In [29]: #Identify the number of passenger died based on size of the family
#(using SibSp feature) and also draw the plot..
pd.crosstab([train_data.SibSp],train_data.Survived).style.background_grad
Out[29]: Survived 0 1
SibSp
0 398 210
1 97 112
2 15 13
3 12 4
4 15 3
5 5 0
8 7 0
<matplotlib.axes._subplots.AxesSubplot at 0x24c61b50748>
Out[41]:
In [39]: f,ax=plt.subplots(1,2,figsize=(20,8))
sb.barplot('SibSp','Survived', data=train_data,ax=ax[0])
ax[0].set_title('SipSp vs Survived in BarPlot')
plt.show()
In [40]: f,ax=plt.subplots(1,2,figsize=(20,8))
sb.barplot('SibSp','Survived', data=train_data,ax=ax[0])
ax[0].set_title('SipSp vs Survived in BarPlot')
sb.factorplot('SibSp','Survived', data=train_data,ax=ax[1])
ax[1].set_title('SibSp vs Survived in FactorPlot')
plt.close(2)
plt.show()
In [33]: pd.crosstab(train_data.SibSp,train_data.Pclass).style.background_gradient
Out[33]: Pclass 1 2 3
SibSp
1 71 55 83
2 5 8 15
3 3 1 12
4 0 0 18
5 0 0 5
8 0 0 7
In [ ]: #Barplot and Crosstab data shows that if a passanger is alone in ship wit
#siblings, survival rate is 34.5%. The graph decreases as no of siblings
#increase. This is interesting because, If I have a family onboard, I wil
#them instead of saving myself. But there's something wrong, the survival
#for families with 5-8 members is 0%. Is this because of PClass?
#Yes this is PClass, The crosstab shows that Person with SibSp>3 were all
#Pclass3. It is imminent that all the large families in Pclass3(>3) died.
In [64]: train_data.corr(method='pearson')
In [ ]: #From above correlation table we can see that Survival is inversly correl
#Pclass value. In this case since Class 1 has lower numerical value, it h
#better survival rate compared to other classes.
#We also see that Age and Survival are slighltly correlated.
In [70]: train_data.groupby(['Survived']).hist()
<seaborn.axisgrid.FacetGrid at 0x27a34bf3c50>
Out[70]:
In [78]: #Plot Agewise distribution of the passenger aboard.
sb.distplot(train_data['Age'].dropna(), bins=15,kde=False)
<matplotlib.axes._subplots.AxesSubplot at 0x27a35629b70>
Out[78]:
In [ ]: #Note: Many passensgers are of age 15-40 yrs. But this is not complete da
In [ ]: