Group Project
Group Project
Background
On April 10, 1912, the Titanic set sail from Southampton with 2240 passengers on board
ranging from ages five months to eighty from different classes of society. On April 5, 1912, the
ship hit an iceberg that broke the ship apart, killing more than 1500 crew and passengers
(National Oceanic and Atmospheric Administration, nd.). Since then, analysts have been trying
BUSI group project 2
to identify if more people could have survived and what were the factors leading to surviving or
This project will use data gathered from Kaggel of 891 passengers, or 40% of passengers
on the ship from different classes, ages, sex and where they embarked, to identify through
descriptive, diagnostic, and predictive analysis to help determine the likelihood of survival.
This project will use data to analyze the survival rate based on age, sex and class. We
want to determine how these factors contributed to whether or not people survived the wreckage.
By profoundly examining age, sex, and class and embarked through linear regression, we hope to
conclude which was most strongly associated with survival. The results from this project can be
used to clear myths and suggestions surrounding the sinking, where it is believed that no
Data source
https://www.kaggle.com/competitions/titanic/overview
Data attributes
The dataset contains information for 891 passengers, each with four attributes relevant to
Attribute Description
0: No
1: Yes
socioeconomic status.
to luxurious amenities
amenities
3: Mostly immigrants
voyage from:
1: S- Southampton
2: C- Cherbourg
3: Q- Queenstown
Methodology
Once the data was gathered from Kaggle, we saw that approximately 23.5% of the pages
needed to be included. We used an imputation technique to complete the dataset by following the
steps:
First, we remove the data which cannot be converted into numerical. Like, Cabin, name and
ticket.
Second, we took average to fill in the blanks. We have filled in all the blanks using Go to special
tool in excel.
Third, we have converted all the word form data into numerical using the replace tool in excel.
Like we give numbers to the male=0 and female=1. For Embarked, we give numbers S=1, C=2
and Q=3.
The average age of 30 was inputted to complete the dataset. This complete data was run against a
training set so it could learn the underlying patterns and relationships to develop an accurate
predictive model. The model is also run against a test file to assist us in assessing the model’s
Descriptive Analysis
Figure 1
Figure 1 shows that from the 891 passengers who boarded the ship, 577 were males, and
Figure 2
Figure 2 describes that more people from class 3 died, a total of 372, compared to class 2
and 1, where 97 and 80 people died, respectively. On the other hand, more people from class 1
BUSI group project 6
survived, a total of 136 passengers, than from class 3, where 119 people survived. Class 2 had
Figure 3
Figure 3 describes survivors by age, where 140 passengers aged 30 did not survive. This
does not consider other factors such as age or class. However, 62 passengers of the same age
survived. Most passengers who survived or did not survive were between the ages of 15-55, but
Figure 4
BUSI group project 7
Figure 4 describes the total survivors or non-survivors based on the passenger’s gender.
From the data, 468 males did not survive compared to 81 females. On the other hand, more
females totalling 233 survived than males, 109. It can be said that more female survived than
males.
Correlation
Figure 5
Figure 5 shows the correlation between survived, class, sex, and age. The correlation
between survived, and class is -0.338481036, indicating a negative correlation between the two
variables. In other words, the likelihood of survival falls as class rises from 1 to 3. The
correlation between sex and survival is also positive at 0.543351381, demonstrating that these
two variables are related. This shows that sex and survival are related, with females having a
higher chance of surviving than men. The correlation coefficient of age indicates a weak negative
association between age and survival with survival, which is -0.070657231. However, the weak
association shows that older passengers may have had a lower chance of surviving than younger
ones.
Predictive Analysis
Logistic Regression
H0: There are fewer people who survival between 20-30 age.
H1: There are more people who survival between 20-30 age.
BUSI group project 9
Ho: People between the ages 60-70 in class 1 did not survive.
The data below confirms that the p-value for class, age and sex are all very low. This
grounds us to reject the null hypothesis H1, H2, H3, and H4 accept the alternatives (H1) that say
there is a statistically significant relationship between the independent and dependent variables.
It should be noted that the p-value describes a probability and not certainty (Andrade, 2019). The
other variables, such as sib(siblings), parch(parents), fare and embarked, all have a p-value
greater than 0.05 which may indicate that these variables are insignificant to the outcome of
surviving or not.
Figure 6
Embarked*X7))
Here,
X1= Pclass, X2= Sex, X3= Age, X4= SibSp, X5= Parch, X6= Fare, X7= Embarked.
By using the above model, we have predicted the data and used this model for the test file
to cross check.
Data Model
Given the data in figure 6, the model has predicted the probability of a 0.09 chance of
passenger 1 surviving, as the predicted versus actual is 0, and the label is True; this means the
model has accurately predicted that the passenger did not survive. Further, it predicted passenger
2 a probability of 0.91 of passengers surviving and the predicted versus actual is 1, and the label
is True; this means the model has again accurately predicted that the passenger would survive.
The model has accurately predicted the survival or non-survival of passengers with an acceptable
80% accuracy.
The predictive accuracy of 80% in the logistic model is considered high for acceptance.
Therefore this model can be used to predict that based on the age, sex, and class of a passenger;
they were likely or unlikely to live. From the findings, we can expect that if you are a male, you
BUSI group project 11
are more likely to die than a female. This may be because males were head of the household and
the ones to take care of the family; therefore, they would have opted for the females to go onto a
lifeboat while they followed later if they could be accommodated. More passengers between the
ages of 15-37 survived due to being more agile and able to swim, holding on to wreckage or
distance from rescue, and lacking lifeboats. Additionally, passenger class was significantly
related to survival, which may be linked to the societal norms when the wealthy were given first
preference regardless of age and gender. Based on reports, lifeboats were launched from the first-
and second-class desk, which may suggest why most survivors came from classes 1 and 2
(Henderson, 1998). The class 3 survival was surprisingly higher than class 2; however, this might
be due to their age and ability to reach a lifeboat after it was launched. It should be noted that as
most women survived, it can be said that the class one survivors were mostly females.
Test data:
By using the logistics regression model, we have test model accuracy in the test file. And we
found that there is a 63% chance of non-survival of passengers, and another 37% will survive.
BUSI group project 12
Conclusion
The sinking of the Titanic in 1912 is one of the best-known maritime accidents and
serious in history. We were commissioned to carry out a descriptive and predictive analysis based
Regarding the descriptive analysis, it was observed that although the proportion of people
traveling on the Titanic was predominantly male (65%), only 19% managed to survive. In
comparison, 74% of women managed to save themselves. On the other hand, a substantial
difference is observed in the proportion of people who succeeded considering the class in which
they traveled. The third class far exceeds the deaths they had over the second and first classes.
On the other hand, for the predictive analysis, the logical regression model is used,
considering passenger survival and death as the dependent variable. For the independent
variables, gender, age, and class in which they traveled were considered. On the class side, the
results show that class 3 passengers have the least probability of survival and class 1 passengers
have the highest likelihood of survival. For the sex variable, the findings indicate that women are
References
R.M.S Titanic - history and significance. R.M.S Titanic - History and Significance | National
international-section/rms-titanic-history-and-significance#:~:text=Titanic%2C
%20launched%20on%20May%2031,than%201%2C500%20passengers%20and%20crew
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6532382/
Henderson, J. R. (1998, June 6). Titanic: Demographics of the passengers. Retrieved March 26,