Report TSP
Report TSP
Submitted by
Submitted to
1 INTRODUCTION 2
3 IMPLEMENTATION FOR 10
PREDICTING ACCURACY
4 CONCLUSION 11
5 REFERENCES 12
1
INTRODUCTION
The sinking of the Titanic ship caused the death of about thousands of passengers
and crew is one of the fatal accidents in history. The loss of lives was mostly
caused due to the shortage of the life boats. The mind shaking observation came
out from the incident is that some people were more sustainable to endure than
many others, like children, women were the one who got the more priority to be
rescued. The main objective of the algorithm is to firstly find predictable or
previously unknown data by implementing exploratory data analytics on the
available training data and then apply different machine learning models and
classifiers to complete the analysis.
This will predict which people are more likely to survive. After this the result of
applying machine learning algorithm is analyzed on the basis of performance and
accuracy
EDA is used :
It is a good practice to understand the data first and try to gather as many insights
from it. EDA is all about making sense of data in hand, before getting them dirty
with it.
2
1) A quick glance on data :
First, we will import the necessary packages and load the data set.
3
Above is a list of passengers with $0 fare. We spot checked a few
passengers to see if the $0 fare is intended.
Passengers that share the same ticket number seem to be in the same
traveling group. We can create a boolean variable for traveling group to
see if people travelled in groups would be more likely to survive.
20% of Age data is missing, 77% of Cabin data is missing, and 0.2% of
Embarked data is missing. We‟ll need to handle the missing data before
modeling. This will be covered in Feature Engineering article as well.
2) Numerical Variables:
As to the box plots, survivors and victims have similar quartiles in Age
and SibSp. Compared to victims, survivors were more likely to have
parents / children aboard the Titanic and have relatively more
expensive tickets.
Box plot provides a quick view of numerical data through quartiles.
Let‟s also check the data distribution using histograms to uncover
additional patterns.
4
As to the box plots, survivors and victims have similar quartiles in Age
and SibSp. Compared to victims, survivors were more likely to have
parents / children aboard the Titanic and have relatively more
expensive tickets.
Box plot provides a quick view of numerical data through quartiles.
3) Data disrubution :
When comparing the distribution of two sets of data, it‟s preferred to use
the relative frequency instead of the absolute frequency. Using Age as an
example, the histogram with absolute frequency suggests that there were
a lot more victims than survivors in the age group of 20–30 .
In the histogram of relative frequency for age, what really stands out is the
age group < 10. Children were more likely to survive compared to victims
among all age groups.
5
Fig 7 : Pie Plot for Survived data
From the pie plots, we can tell that passengers with missing
age were more likely to be victims.
6
STORYTELLING
The column „Age‟ and „Cabin‟ have got null values. While
„Cabin‟ has huge amount null values, „Age‟ has moderate
amount of null values.
Fig 10 : Factorplot
7
Inference: As we all know from the movie as well as the story of
titanic females were given priority while saving passengers. The above
graph also tells us the same story. More number of male passengers
have died than female ones.
8
Here „SibSp‟ variable refers to the number of sibling or spouse
the person was accompanied with. We can see most of the people
came alone.
Fig 13 : Boxplot
Now, figure out a way to fill the missing value of the variable
„Age‟. Here we segregated the „Age‟ variable according to the
Pclass variable as it was found out that „Age‟ and „Pclass‟
column were related. We would draw a boxplot that would tell
us the mean value each of the Pclass.
9
IMPLEMENTATION FOR PREDICTING ACCURACY
10
CONCLUSION
In conclusion, we can say that this data gives us the information of the
travellers and whether they survived or not.
The confusion matrix gives the accuracy of all the models, the logistic
regression is proves to be best among all with an accuracy of 0.8272.
This means the predictive power of logistic regression in this dataset
with the chosen features is very high.
It is clearly stated that the accuracy of the models may vary when the
choice of feature modelling is different. Ideally logistic regression and
support vector machine are the models which give a good level of
accuracy when it comes to classification problem.
I really hope this has been a great read and a source of inspiration to
develop and innovate.
11
REFERENCES
12