0% found this document useful (0 votes)
55 views

PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML

This document discusses predicting survival on the Titanic by applying exploratory data analytics and machine learning techniques to an available dataset. It first explores the dataset to understand how factors like age, sex, class, etc. influence survival. It then cleans the data by handling missing values. Various machine learning algorithms like logistic regression, random forest, and support vector machines are applied and their accuracy in predicting survival is compared to determine the best performing model.

Uploaded by

fitoj aka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML

This document discusses predicting survival on the Titanic by applying exploratory data analytics and machine learning techniques to an available dataset. It first explores the dataset to understand how factors like age, sex, class, etc. influence survival. It then cleans the data by handling missing values. Various machine learning algorithms like logistic regression, random forest, and support vector machines are applied and their accuracy in predicting survival is compared to determine the best performing model.

Uploaded by

fitoj aka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

International Journal of Computer Applications (0975 – 8887)

Volume 179 – No.44, May 2018

Predicting Survival on Titanic by Applying Exploratory


Data Analytics and Machine Learning Techniques

Yogesh Kakde Shefali Agrawal


Asst. Professor UG Scholar
AITR, Indore AITR, Indore

ABSTRACT 2. DATA ANALYTICS AND ITS


The sinking of the RMS Titanic caused the death of thousands CATEGORIES
of passengers and crew is one of the deadliest maritime
disasters in history. One of the reasons that the shipwreck led
to such loss of life was that there were not enough lifeboats Data Data
for the passengers and crew. The interesting observation Data analysis Data
Tranfor and
cleaning Analytics
which comes out from the sinking is that some people were ming modeling
more likely to survive than others, like women, children were
the one who got the priority to rescue. The objective is to first
explore hidden or previously unknown information by
applying exploratory data analytics on available dataset and Fig 1: Data Analytics
then apply different machine learning models to complete the
analysis of what sorts of people were likely to survive. After
this the results of applying machine learning models are
compared and analyzed on the basis of accuracy.
Descriptive DA
General Terms
Data Analytics, Exploratory Data Analytics, Machine
Learning, Model Evaluation, Data Science.
Exploratory DA
Keywords
Data mining, ggplot, Logistic Regression, Random Forest,
Feature Engineering, Support Vector Machine, Confusion Data Analytics
Matrix.
Confirmative
1. INTRODUCTION DA
The most infamous disaster which occurred over a century
ago on April 15, 1912, that is well known as sinking of “The
Titanic”. The collision with the iceberg ripped off many parts
of the Titanic. Many classes of people of all ages and gender
where present on that fateful night, but the bad luck was that Predictive DA
there were only few life boats to rescue. The dead included a
large number of men whose place was given to the many
women and children on board. The men travelling in second Fig 2: Categories of Data Analytics
class were dead on the vine. [1]
Machine learning algorithms are applied to make a prediction
3. PROCESS FLOW
There is a step by step approach to choose a particular model
which passengers survived at the time of sinking of the
for the current problem. [27] We need to decide whether a
Titanic. Features like ticket fare, age, sex, class will be used to
particular machine learning model is suitable for our problem
make the predictions. Predictive analysis is a procedure that
or not. Here we can see process flow being followed.
incorporates the use of computational methods to determine
important and useful patterns in large data. Using the machine
learning algorithms, survival is predicted on different
combinations of features.
The objective is to perform exploratory data analytics to mine
various information in the dataset available and to know effect
of each field on survival of passengers by applying analytics
between every field of dataset with “Survival” field. The
predictions are done for newer data sets by applying machine
learning algorithm. The data analysis will be done on applied
algorithms and accuracy will be checked. Different algorithms
are compared on the basis of accuracy and the best performing
model is suggested for predictions. [2]

32
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018

Table 1. Description of each attribute in our dataset


Attribute Description Factors
Survival of
Survival 0 = No, 1 = Yes
passenger
1 = 1st, 2 = 2nd,
Pclass Ticket class
3 = 3rd
Sex Sex Male/Female
Age of passengers
Age
in years
# of siblings /
sibsp spouses aboard the
Titanic
# of parents /
parch children aboard the
Titanic
ticket Ticket number
Fig 3: Process of fitting a Machine Learning Model

4. DESCRIPTION OF DATA fare Passenger fare


In R str() function is used to find structure of dataset that we
cabin Cabin number
have in csv file. Below there is a snippet of output of we got
by executing str() in R studio. Port from where
passenger
embarked. C for
Embarked C, Q, S
Cherbourg, Q for
Queenstown, S for
Southampton

Now let us explore our dataset by knowing the influence of


each attribute on survival of passenger. We will create
histograms, Bar plots to achieve this.

5. DATA CLEANING
Before applying any type of data analytics on the dataset, the
data is first cleaned. There are some missing values in the
dataset which needs to be handled. In attributes like Age,
Cabin and Embarked, missing values are replaced with
random sample from existing age. [15]
In case of column Fare we found that there is one passenger
with missing fare having passenger id 1044. To put a
meaningful value of fair column we first found value of
Embarked and Pclass of this passenger. Then median is
calculated for fair values of all passenger who whose
embarkation and Pclass was same as of passenger id 1044.

6. EXPLORATORY DATA ANALYSIS


We are going to perform exploratory data analysis for our
problem in the first stage. In exploratory data analysis dataset
is explored to figure out the features which would influence
the survival rate. The data is deeply analysed by finding a
relationship between each attribute and survival.

6.1 Age verses Survival


Here fig. 5 shows how survival rate will be affected by age. If
Fig 4: Structure of input Dataset the value of age is less then chances of survival are more and
There is table showing meaning of each attribute. vice versa.

33
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018

Fig 5: Age v/s Survival

Fig 6: Sex v/s Survival


In the same way there are some more facts we found. There is Table 2. Age Group and Survival Rate
a table showing age group and survival rate of that age group. Age Group Survival Rate (%)
0-10 53.24675
10-20 38.29787
20-30 37.03704
30-40 40.21739
40-50 34.82143
50-60 34.61538
60-70 22.72727

34
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018

Sex verses Survival can be used for both classification and regression problems.
From Fig. 6 it is clear that females are more likely to survive For instance, it will take random samples of 100 observation
than males. We calculated that survival rate of female and and 5 randomly chosen initial variables to build a model. The
male are 74.20382% and 18.89081% respectively. same process is repeated a number of times, then the final
prediction is made according to the observations. Final
In similar way relationship between other attributes like fare, prediction is a function (mean) of each prediction.
cabin, title, family, Pclass, Embarked and survival is found.
We extracted the title from attribute ‘name’. We combined
parch and sibsp. In this way we will be able to decide
emphasis of each attribute on survival of passenger.

7. METHODOLOGY
7.1 Feature Engineering
Feature engineering is the most important part of data
analytics process. It deals with, selecting the features that are
used in training and making predictions. In feature
engineering the domain knowledge is used to find features in
the dataset which are helpful in building machine learning
model. It helps in understanding the dataset in terms of Fig 7: Example of a Decision Tree
modeling. A bad feature selection may lead to less accurate or There are two types of decision tree based on the type of
poor predictive model. The accuracy and the predictive power target variable.
depend on the choice of correct features. It filters out all the
unused or redundant features. 1. Categorical Variable Decision Tree: The tree in which
target variables have categorical values.
Based on the exploratory analysis above, following features 2. Continuous Variable Decision Tree: The tree in which
are used age, sex, cabin, title, Pclass, family size (parch plus the target variable has continuous values.
sibsp columns), fare, embarked. Survival column is chosen as
response column. These features are selected because their 7.2.4 Support Vector Machine
values have an impact on the rate of survival. These features Support Vector Machine (SVM) falls in supervised machine
will be the value of “x” in the bar-plots. If wrong features learning algorithm. This algorithm is used to solve both
where selected then even the good algorithm may produce the classification and regression problems. The classification is
bad predictions. Therefore, feature engineering acts like a performed by constructing hyper planes in a multidimensional
backbone in building an accurate predictive model. space that separates cases of different class labels. For
categorical data variables a dummy variable is created with
7.2 Machine Learning Models values as either 0 or 1. So, a categorical dependent variable
Various machine learning models are implemented to validate consisting three levels, say (A, B, C) can be represented by a
and predict survival. set of three dummy variables:
7.2.1 Logistic Regression A: {1, 0, 0}; B: {0, 1, 0}; C: {0, 0, 1}
Logistic regression is the technique which works best when
dependent variable is dichotomous (binary or categorical). 8. MODEL EVALUATION
[23] The data description and explaining the relationship The accuracy of the model is evaluated using “confusion
between one dependent binary variable and one or more matrix”. A confusion matrix is a table layout that allows to
nominal, ordinal, interval or ratio-level independent variables visualize the correctness and the performance of an algorithm.
is done with the help of logistic regression. It is used to solve
binary classification problem, some of the real life examples 8.1 Confusion Matrix
are spam detection- predicting if an email is spam or not, A confusion matrix is a method to verify how accurately the
health-Predicting if a given mass of tissue is benign or classification model works. It gives the actual number of
malignant, marketing- predicting if a given user will buy an predictions which were correct or incorrect when compared to
insurance product or not. the actual result of the data. The matrix is of the order N*N,
here N is the number of values. Performance of such models
7.2.2 Decision Tree is commonly evaluated using the data in the matrix.
Decision tree is a supervised learning algorithm. This is
generally used in problems based on classification. It is Sensitivity: It defines the percentage of actual positive
suitable for both categorical and continuous input and output which are correctly identified, and is complementary to the
variables. Each root node represents a single input variable (x) false negative rate. Sensitivity= true positive/(true negative +
and a split point on that variable. The dependent variable (y) false positive). The ideal value for sensitivity is “1.0” and
is present at leaf nodes. For example: Suppose there are two minimum value is “0.0”
independent variables, i.e. input variables (x) which are height
in centimeter and weight in kilograms and the task to find Specificity: It measures the proportion of negatives which
gender of person based on the given data. (Hypothetical are correctly identified, and is complementary to the false
example, for demonstration purpose only). positive rate. Specificity= true negatives/(true negatives +
false positives). The ideal value for specificity is “1.0” and
7.2.3 Random Forest least value is “0.0”.
Random forest algorithm is supervised classification
algorithm. The algorithm basically makes forest with large Positive Predictive Value: It gives the performance
number of trees. The higher the number of trees in the forest measure of the statistical test. It is a ratio true positive (event
gives the higher accuracy results. Random forest algorithm that makes true prediction and subject result is also true) and

35
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018

the sum of true positive and false positive (event that makes
false prediction and subject result is also false). 7.2.3 Decision Tree
Negative Predicted Value: It is the ratio of true
negatives (the event which makes negative prediction and
result is also false) and sum of true negative and false
negative (event that makes false prediction and subject result
is positive).

8.2 Accuracy: It gives the measure of percentage of


correct prediction done by the model/algorithm. The best
value is “1.0” and the worst value is “0.0”.

Fig 11: Confusion Matrix for Decision Tree


7.2.4 Support Vector Machine

Fig 8: Generalized confusion matrix


In R mathematical calculations are performed and accuracy
using each model is found. Here are the accuracies we
achieved for each model.

8.2.1 Logistic Regression

Fig 12: Confusion Matrix for SVM

9. PREDICTION
Here we can choose any of the models to predict survival of
test sample. Since we have evaluated all models by using
confusion matrix we will predict by using model which has
highest accuracy.
We performed prediction on data dataset by using logistic
model and SVM.

10. GUI IMPLEMENTATION IN R


Fig 9: Confusion Matrix for Logistic Regression We have also added a GUI in our implementation. R provides
In the confusion matrix the values of “a, b, c, d” gives the a library called “shiny” which is used to give the analysis in a
count of true positive, true negative, false positive and false presentable interface. Using shiny graphical user interface can
negative respectively. The accuracy of this confusion matrix be created. To use dashboard, “shinydashboard” library must
is close to “1” which shows that the model makes maximum be included. The dashboard contains different tabs showing
correct predictions. the exploratory data analytics which includes graph between
age v/s survival, sex v/s survival, cabin v/s survival, Pclass v/s
7.2.2 Random Forest survival, fare v/s survival, name v/s survival, family v/s
survival, embarked v/s survival. Another tab shows the
predictive analysis details under the heading of logistic
regression, decision tree, random forest and support vector
machine. A value box is included to show the accuracy of
each model.

Fig 10: Confusion Matrix for Random Forest

36
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018

Fig 13: GUI Environment

11. CONCLUSION We combined parch and sibsp column to know family size of
Data cleaning is the first step while performing data analysis. a particular passenger. We found that survival rate increases
Exploratory data analytics helps one to understand the dataset when family size lies from 0 to 3. But when family size
and the dependency among the attributes. EDA is used to becomes greater than 3, survival rate decrease.
figure out the relationship between the features of the dataset. Similarly it is found that passengers who has more cabins has
This is done by using various graphical techniques. The one higher survival rate.
used above is ggplot and histograms.
Table 4. Passenger Class Vs. Survival Rate
By applying EDA some conclusions are drawn and facts are
found. Age Group Survival Rate
There is high influence of age on survival. We can see from (0,50] 32.40223
table-2 that as age increases survival decreases.
(50,100] 65.42056
It can be seen that survival rate of female is very high
(approx. 74%) and survival rate of male is very low. This fact (100,150] 79.16667
can also be verified by extracting titles (Mr, Mrs, Ms etc) (150,200] 66.66667
from name column. Survival rate with title Mr. is
approximately 16% while survival rate for Mrs. is 79%. (200,250] 63.63636

We can also see survival v/s Pclass in following table- (250,300] 66.66667
(500,550] 100
Table 3. Passenger Class Vs. Survival Rate
Passenger class Survival Rate (%)
With these figure we can say that higher the fare higher will
1 62.96296 be survival rate.
2 47.28261 In feature engineering the actual parameters to be used while
designing the training model and prediction model is found
3 24.23625
out on the basis of exploratory data analytics process.
Machine Learning models predict the values of passengers
We found that Passengers who were travelling in first class is who survived. Logistic regression technique is used in making
more likely to survive. predictions in classification problem.

37
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018

The confusion matrix gives the accuracy of all the models, the [8] MICHAEL AARON WHITLEY, Using statistical
logistic regression is proves to be best among all with an learning to predict survival of passengers on the RMS
accuracy of 0.837261504. This means the predictive power of Titanic by Michael Aaron Whitley, 2015.
logistic regression in this dataset with the chosen features is [9] Kunal Vyas, Zeshi Zheng, Lin Li, Titanic- Machine
very high. Learning From Disaster- 2015.
[10] EECS 349 Titanic- Machine Learning From Disaster,
It is clearly stated that the accuracy of the models may vary Xiaodong Yang, Northwestern University.
when the choice of feature modelling is different. Ideally [11] Prediction of Survivors in Titanic Dataset: A
logistic regression and support vector machine are the models Comparitive Study using Machine Learning Algorithms,
which give a good level of accuracy when it comes to Tryambak Chatterlee, IJERMT-2017.
classification problem. [12] An Introduction to Logistic Regression Analysis and
12. FUTURE WORK Reporting by Chao-Yig Joanne Peng, Kuk Lida Lee &
Gary M. Ingersoll, April 2010.
This project involves implementation of data analytics and
[13] Zhenyan Liu, Yifei Zeng, Yida Yan, Pengfei Zhang and
machine learning. This project work can be used as reference
Yong Wang, Machine Learning for Analyzing Malware,
to learn implementation of EDA and machine learning from
Journal of Cyber Security and Mobility, Vol: 6 Issue:
very basic.
3, July 2017.
In future the idea can be extended by making more advanced [14] Andy Liaw and Metthew Wiener, Classification and
graphical user interface with the help of newer libraries like Regression by Random Forest, vol. 2/3, December 2002.
shiny in R. An interactive page can be made, i.e. if the value [15] Galit Shmueli and Otto R. Koppius MIS Quarterly,
of a attribute is changed on the scale the values corresponding Predictive Analytics in Information System Research, ,
to its graph (ggplot or histogram) will also change. We can Vol. 35, No. 3(September 2011), pp. 553-572.
also draw much focused conclusions by combining results we [16] john D. Kelleher, Brain Mac Namee, Aoife D’Arcy
obtained. Fundamentals of Machine Learning for Predictive Data
Analytics: Algorithms .
13. REFERENCES [17] Dr. Neeraj Bhargava, Girja Sharma, Decision Tree
[1] Analyzing Titanic disaster using machine learning Analysis on J48 Algorithm for Data Mining. Volume 3,
algorithms-Computing, Communication and Automation Issue 6, June 2013.
(ICCCA), 2017 International Conference on 21 [18] Data Mining: Practical Machine Learning Tools and
December 2017, IEEE. Techniques, by Ian H. Witten, Eibe Frank, Mark A. Hall,
[2] Eric Lam, Chongxuan Tang, "Titanic Machine Learning Christopher J. Pal.
From Disaster", LamTang-Titanic Machine Learning [19] A Comparison of Goodness of Fit Tests for the Logistic
From Disaster, 2012. Regression Model, D.W. Hosmer, T. Hosmer, S. Le
[3] S. Cicoria, J. Sherlock, M. Muniswamaiah, L. Clarke, Cessie and S. Lemeshow
"Classification of Titanic Passenger Data and Chances of [20] Breiman, L. 2001a. Random forests. Machine Learning
Surviving the Disaster", Proceedings of Student-Faculty 45:5-32.
Research Day CSIS, pp. 1-6, May 2014. [21] Stuart J. Russell, Peter Norvig, Artificial Intelligence: A
[4] Corinna Cortes, Vlasdimir Vapnik, “Support-vector Modern Approach, Pearson Education, 2003, pg 697-
networks”, Machine Learning, Volume 20, Issue 3,pp 702.
273-297. [22] Cortes, Corinna; and Vapnik, Vladimir N.;
[5] L Breman- “random forests”, Machine Learning, 2001 "SupportVector Networks", Machine Learning, 20, 1995.
Ng. CS229 Notes, Standford University, 2012. [23] Unwin A, Hofmann H (1999). \GUI and Command-line
[6] SJ Russsel P Norvig-“Artificial intelligence: A modern { Conict or Synergy?" In K Berk,M Pourahmadi (eds.),
approach”-2016. Computing Science and Statistics.
[7] Lonnie Stevans, David L. Gleicher, ”Who Survived the [24] Machine Learning Benchmarks and Random Forest
Titanic? A logistic regression analysis”-Article in Regression, Segal, Mark R, 2004.
International Journal of Maritime History, December [25] Proceedings of Student-Faculty Research Day, CSIS,
2004. Pace University, May 2nd, 2014.

IJCATM : www.ijcaonline.org
38

You might also like