PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
32
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018
5. DATA CLEANING
Before applying any type of data analytics on the dataset, the
data is first cleaned. There are some missing values in the
dataset which needs to be handled. In attributes like Age,
Cabin and Embarked, missing values are replaced with
random sample from existing age. [15]
In case of column Fare we found that there is one passenger
with missing fare having passenger id 1044. To put a
meaningful value of fair column we first found value of
Embarked and Pclass of this passenger. Then median is
calculated for fair values of all passenger who whose
embarkation and Pclass was same as of passenger id 1044.
33
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018
34
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018
Sex verses Survival can be used for both classification and regression problems.
From Fig. 6 it is clear that females are more likely to survive For instance, it will take random samples of 100 observation
than males. We calculated that survival rate of female and and 5 randomly chosen initial variables to build a model. The
male are 74.20382% and 18.89081% respectively. same process is repeated a number of times, then the final
prediction is made according to the observations. Final
In similar way relationship between other attributes like fare, prediction is a function (mean) of each prediction.
cabin, title, family, Pclass, Embarked and survival is found.
We extracted the title from attribute ‘name’. We combined
parch and sibsp. In this way we will be able to decide
emphasis of each attribute on survival of passenger.
7. METHODOLOGY
7.1 Feature Engineering
Feature engineering is the most important part of data
analytics process. It deals with, selecting the features that are
used in training and making predictions. In feature
engineering the domain knowledge is used to find features in
the dataset which are helpful in building machine learning
model. It helps in understanding the dataset in terms of Fig 7: Example of a Decision Tree
modeling. A bad feature selection may lead to less accurate or There are two types of decision tree based on the type of
poor predictive model. The accuracy and the predictive power target variable.
depend on the choice of correct features. It filters out all the
unused or redundant features. 1. Categorical Variable Decision Tree: The tree in which
target variables have categorical values.
Based on the exploratory analysis above, following features 2. Continuous Variable Decision Tree: The tree in which
are used age, sex, cabin, title, Pclass, family size (parch plus the target variable has continuous values.
sibsp columns), fare, embarked. Survival column is chosen as
response column. These features are selected because their 7.2.4 Support Vector Machine
values have an impact on the rate of survival. These features Support Vector Machine (SVM) falls in supervised machine
will be the value of “x” in the bar-plots. If wrong features learning algorithm. This algorithm is used to solve both
where selected then even the good algorithm may produce the classification and regression problems. The classification is
bad predictions. Therefore, feature engineering acts like a performed by constructing hyper planes in a multidimensional
backbone in building an accurate predictive model. space that separates cases of different class labels. For
categorical data variables a dummy variable is created with
7.2 Machine Learning Models values as either 0 or 1. So, a categorical dependent variable
Various machine learning models are implemented to validate consisting three levels, say (A, B, C) can be represented by a
and predict survival. set of three dummy variables:
7.2.1 Logistic Regression A: {1, 0, 0}; B: {0, 1, 0}; C: {0, 0, 1}
Logistic regression is the technique which works best when
dependent variable is dichotomous (binary or categorical). 8. MODEL EVALUATION
[23] The data description and explaining the relationship The accuracy of the model is evaluated using “confusion
between one dependent binary variable and one or more matrix”. A confusion matrix is a table layout that allows to
nominal, ordinal, interval or ratio-level independent variables visualize the correctness and the performance of an algorithm.
is done with the help of logistic regression. It is used to solve
binary classification problem, some of the real life examples 8.1 Confusion Matrix
are spam detection- predicting if an email is spam or not, A confusion matrix is a method to verify how accurately the
health-Predicting if a given mass of tissue is benign or classification model works. It gives the actual number of
malignant, marketing- predicting if a given user will buy an predictions which were correct or incorrect when compared to
insurance product or not. the actual result of the data. The matrix is of the order N*N,
here N is the number of values. Performance of such models
7.2.2 Decision Tree is commonly evaluated using the data in the matrix.
Decision tree is a supervised learning algorithm. This is
generally used in problems based on classification. It is Sensitivity: It defines the percentage of actual positive
suitable for both categorical and continuous input and output which are correctly identified, and is complementary to the
variables. Each root node represents a single input variable (x) false negative rate. Sensitivity= true positive/(true negative +
and a split point on that variable. The dependent variable (y) false positive). The ideal value for sensitivity is “1.0” and
is present at leaf nodes. For example: Suppose there are two minimum value is “0.0”
independent variables, i.e. input variables (x) which are height
in centimeter and weight in kilograms and the task to find Specificity: It measures the proportion of negatives which
gender of person based on the given data. (Hypothetical are correctly identified, and is complementary to the false
example, for demonstration purpose only). positive rate. Specificity= true negatives/(true negatives +
false positives). The ideal value for specificity is “1.0” and
7.2.3 Random Forest least value is “0.0”.
Random forest algorithm is supervised classification
algorithm. The algorithm basically makes forest with large Positive Predictive Value: It gives the performance
number of trees. The higher the number of trees in the forest measure of the statistical test. It is a ratio true positive (event
gives the higher accuracy results. Random forest algorithm that makes true prediction and subject result is also true) and
35
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018
the sum of true positive and false positive (event that makes
false prediction and subject result is also false). 7.2.3 Decision Tree
Negative Predicted Value: It is the ratio of true
negatives (the event which makes negative prediction and
result is also false) and sum of true negative and false
negative (event that makes false prediction and subject result
is positive).
9. PREDICTION
Here we can choose any of the models to predict survival of
test sample. Since we have evaluated all models by using
confusion matrix we will predict by using model which has
highest accuracy.
We performed prediction on data dataset by using logistic
model and SVM.
36
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018
11. CONCLUSION We combined parch and sibsp column to know family size of
Data cleaning is the first step while performing data analysis. a particular passenger. We found that survival rate increases
Exploratory data analytics helps one to understand the dataset when family size lies from 0 to 3. But when family size
and the dependency among the attributes. EDA is used to becomes greater than 3, survival rate decrease.
figure out the relationship between the features of the dataset. Similarly it is found that passengers who has more cabins has
This is done by using various graphical techniques. The one higher survival rate.
used above is ggplot and histograms.
Table 4. Passenger Class Vs. Survival Rate
By applying EDA some conclusions are drawn and facts are
found. Age Group Survival Rate
There is high influence of age on survival. We can see from (0,50] 32.40223
table-2 that as age increases survival decreases.
(50,100] 65.42056
It can be seen that survival rate of female is very high
(approx. 74%) and survival rate of male is very low. This fact (100,150] 79.16667
can also be verified by extracting titles (Mr, Mrs, Ms etc) (150,200] 66.66667
from name column. Survival rate with title Mr. is
approximately 16% while survival rate for Mrs. is 79%. (200,250] 63.63636
We can also see survival v/s Pclass in following table- (250,300] 66.66667
(500,550] 100
Table 3. Passenger Class Vs. Survival Rate
Passenger class Survival Rate (%)
With these figure we can say that higher the fare higher will
1 62.96296 be survival rate.
2 47.28261 In feature engineering the actual parameters to be used while
designing the training model and prediction model is found
3 24.23625
out on the basis of exploratory data analytics process.
Machine Learning models predict the values of passengers
We found that Passengers who were travelling in first class is who survived. Logistic regression technique is used in making
more likely to survive. predictions in classification problem.
37
International Journal of Computer Applications (0975 – 8887)
Volume 179 – No.44, May 2018
The confusion matrix gives the accuracy of all the models, the [8] MICHAEL AARON WHITLEY, Using statistical
logistic regression is proves to be best among all with an learning to predict survival of passengers on the RMS
accuracy of 0.837261504. This means the predictive power of Titanic by Michael Aaron Whitley, 2015.
logistic regression in this dataset with the chosen features is [9] Kunal Vyas, Zeshi Zheng, Lin Li, Titanic- Machine
very high. Learning From Disaster- 2015.
[10] EECS 349 Titanic- Machine Learning From Disaster,
It is clearly stated that the accuracy of the models may vary Xiaodong Yang, Northwestern University.
when the choice of feature modelling is different. Ideally [11] Prediction of Survivors in Titanic Dataset: A
logistic regression and support vector machine are the models Comparitive Study using Machine Learning Algorithms,
which give a good level of accuracy when it comes to Tryambak Chatterlee, IJERMT-2017.
classification problem. [12] An Introduction to Logistic Regression Analysis and
12. FUTURE WORK Reporting by Chao-Yig Joanne Peng, Kuk Lida Lee &
Gary M. Ingersoll, April 2010.
This project involves implementation of data analytics and
[13] Zhenyan Liu, Yifei Zeng, Yida Yan, Pengfei Zhang and
machine learning. This project work can be used as reference
Yong Wang, Machine Learning for Analyzing Malware,
to learn implementation of EDA and machine learning from
Journal of Cyber Security and Mobility, Vol: 6 Issue:
very basic.
3, July 2017.
In future the idea can be extended by making more advanced [14] Andy Liaw and Metthew Wiener, Classification and
graphical user interface with the help of newer libraries like Regression by Random Forest, vol. 2/3, December 2002.
shiny in R. An interactive page can be made, i.e. if the value [15] Galit Shmueli and Otto R. Koppius MIS Quarterly,
of a attribute is changed on the scale the values corresponding Predictive Analytics in Information System Research, ,
to its graph (ggplot or histogram) will also change. We can Vol. 35, No. 3(September 2011), pp. 553-572.
also draw much focused conclusions by combining results we [16] john D. Kelleher, Brain Mac Namee, Aoife D’Arcy
obtained. Fundamentals of Machine Learning for Predictive Data
Analytics: Algorithms .
13. REFERENCES [17] Dr. Neeraj Bhargava, Girja Sharma, Decision Tree
[1] Analyzing Titanic disaster using machine learning Analysis on J48 Algorithm for Data Mining. Volume 3,
algorithms-Computing, Communication and Automation Issue 6, June 2013.
(ICCCA), 2017 International Conference on 21 [18] Data Mining: Practical Machine Learning Tools and
December 2017, IEEE. Techniques, by Ian H. Witten, Eibe Frank, Mark A. Hall,
[2] Eric Lam, Chongxuan Tang, "Titanic Machine Learning Christopher J. Pal.
From Disaster", LamTang-Titanic Machine Learning [19] A Comparison of Goodness of Fit Tests for the Logistic
From Disaster, 2012. Regression Model, D.W. Hosmer, T. Hosmer, S. Le
[3] S. Cicoria, J. Sherlock, M. Muniswamaiah, L. Clarke, Cessie and S. Lemeshow
"Classification of Titanic Passenger Data and Chances of [20] Breiman, L. 2001a. Random forests. Machine Learning
Surviving the Disaster", Proceedings of Student-Faculty 45:5-32.
Research Day CSIS, pp. 1-6, May 2014. [21] Stuart J. Russell, Peter Norvig, Artificial Intelligence: A
[4] Corinna Cortes, Vlasdimir Vapnik, “Support-vector Modern Approach, Pearson Education, 2003, pg 697-
networks”, Machine Learning, Volume 20, Issue 3,pp 702.
273-297. [22] Cortes, Corinna; and Vapnik, Vladimir N.;
[5] L Breman- “random forests”, Machine Learning, 2001 "SupportVector Networks", Machine Learning, 20, 1995.
Ng. CS229 Notes, Standford University, 2012. [23] Unwin A, Hofmann H (1999). \GUI and Command-line
[6] SJ Russsel P Norvig-“Artificial intelligence: A modern { Conict or Synergy?" In K Berk,M Pourahmadi (eds.),
approach”-2016. Computing Science and Statistics.
[7] Lonnie Stevans, David L. Gleicher, ”Who Survived the [24] Machine Learning Benchmarks and Random Forest
Titanic? A logistic regression analysis”-Article in Regression, Segal, Mark R, 2004.
International Journal of Maritime History, December [25] Proceedings of Student-Faculty Research Day, CSIS,
2004. Pace University, May 2nd, 2014.
IJCATM : www.ijcaonline.org
38