0% found this document useful (0 votes)
34 views34 pages

Titanic Survival Prediction Using Machine Learning

Uploaded by

brucewayne.07690
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views34 pages

Titanic Survival Prediction Using Machine Learning

Uploaded by

brucewayne.07690
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

EMERALD VALLEY PUBLIC SCHOOL

ACADEMIC YEAR : 2024 – 25

PROJECT REPORT ON

TITANIC SURVIVAL PREDICTION


USING MACHINE LEARNING

Submitted to :
Mrs.Priya P B.Sc.,M.C.A,B.Ed.,M.Phil
PGT(CS)
Emerald Valley Public School,
Salem – 636008
Tamilnadu
EMERALD VALLEY PUBLIC SCHOOL

CERTIFICATE

This is to certify that NIVASINI M V, Roll. No. :


has successfully completed the project work entitled “TITANIC
SURVIVAL PREDICTION USING MACHINE LEARNING” in
the subject Data Science(844) laid down in the regulations of CBSE
for the purpose of Practical Examination in class XII to be held in
Emerald Valley Public School, Yercaud Foothills, Salem – 636008,
during the academic year 2024 – 25

Priya P

Name :
Signature :
Date :
ACKNOWLEDGEMENT

First and foremost, I owe my wholehearted thanks to my parents for their love,
encouragement and moral support for completing this project.

I sincerely appreciate our Principal Mr. K. Manimaran for permitting access to the well-
equipped lab and the resources required for the project.

I am deeply appreciative of my project mentor, Priya P, for offering invaluable guidance


and motivation throughout the project. She carefully monitored my progress, clarified my
uncertainties, and provided constructive feedback that improved the quality of my
project.

My special thanks to my classmates who were incredibly helpful. They assisted me in


various stages of the project by providing useful insights, engaging in brainstorming
sessions, and providing support .

The encouragement from my teacher, principal and friends was invaluable. I will always

remain grateful for their support.


TITANIC SURVIVAL PREDICTION

USING MACHINE LEARNING

PROJECT DONE BY : NIVASINI M V


CONTENT
SERIAL
DESCRIPTION PAGE NO.
NO.

1 PROBLEM DEFINITION

2 REQUIREMENTS

3 INTRODUCTION

4 EXPLARATORY DATA ANALYSIS

5 SOURCE CODE

6 PREDICTION

7 VISUALIZATION

8 BIBILIOGRAPHY
TITANIC SURVIVAL PREDICTION
USING MACHINE LEARNING

PROBLEM DEFINITION

The Titanic Survival Prediction problem is a classic


machine learning challenge that involves predicting whether a
passenger survived or perished in the tragic sinking of the
RMS Titanic. The dataset typically consists of historical
passenger information, such as age, gender, class, ticket fare,
and other attributes, alongside the target variable indicating
whether the passenger survived the disaster (binary outcome:
1 = survived, 0 = did not survive). This problem is widely
used in data science education and competitions (such as on
Kaggle) to help beginners practice machine learning concepts,
data preprocessing, and model evaluation.
The goal of this problem is to develop a machine learning
model that can predict whether a passenger survived or not
based on a set of features describing the passenger’s attributes.
HARDWARE AND SOFTWARE
REQUIREMENTS

HARDWARE REQUIRED
 Printer, to print the required documents of the project
 Drive
 Processor : Inter i5
 Ram : 4GM and above
 Hard Disk : 1 TB

SOFTWARE REQUIRED
 Operating System : Windows 11
 Jupyter notebook
 Python
 Visual Studio Code
 MS word (for preparing and presenting the project)
INTRODUCTION
Introduction to Machine Learning

Machine Learning (ML) is a branch of artificial intelligence (AI) that enables systems to
automatically learn and improve from experience without being explicitly programmed.
In other words, machine learning allows computers to identify patterns, make decisions,
and improve performance through exposure to data, rather than following strict, pre-
defined rules.

At its core, machine learning involves the development of algorithms that can analyze
data, learn from it, and then make predictions or decisions based on that learning.
Machine learning has become a cornerstone of AI, and it is used in a variety of fields
such as healthcare, finance, e-commerce, entertainment, and more.

Key Concepts in Machine Learning

1. Data: The foundation of machine learning. It consists of input features (also called
variables or attributes) and labels (the outcome you want to predict). The quality
and quantity of data are critical for the success of machine learning models.

2. Algorithms: Machine learning algorithms are the methods used to find patterns in
the data. There are different types of algorithms based on the problem you're trying
to solve.

3. Model: A model is the output of a machine learning algorithm after it has been
trained on data. It represents the patterns or relationships the algorithm has learned
and is used to make predictions or decisions on new data.
4. Training: The process of feeding data into a machine learning algorithm to help it
learn the underlying patterns.

5. Testing: After a model has been trained, it is evaluated on new, unseen data (test
data) to assess its performance and ability to generalize to real-world situations.

Types of Machine Learning

Machine learning is typically classified into three major categories:

1. Supervised Learning:
o In supervised learning, the algorithm is trained on a labeled dataset, meaning
the input data is paired with the correct output (or label).
o The goal is for the model to learn a mapping from inputs to outputs, so it can
predict the output for new, unseen inputs.
o Examples: Classification (e.g., spam email detection) and Regression (e.g.,
predicting house prices based on features like size, location, etc.).

2. Unsupervised Learning:
o In unsupervised learning, the algorithm is given data without labels and
must find the underlying structure or patterns in the data on its own.
o The goal is often to discover hidden structures like clusters or associations in
the data.
o Examples: Clustering (e.g., grouping similar customers based on purchasing
behavior) and Dimensionality Reduction (e.g., reducing the number of
variables in a dataset).
Data Requirements

The Titanic dataset typically includes the following features (columns):


Binary Classification Problem: Given a set of features (e.g., gender, age, class, etc.), the
model's task is to classify whether a passenger survived (1) or did not survive (0).
PassengerId: Unique identifier for each passenger (typically not used for prediction).
Pclass: Passenger class (1, 2, or 3), representing the socio-economic status of the
passenger. 1st class (highest) to 3rd class (lowest).
Name: Name of the passenger (could be useful for extracting titles such as Mr., Mrs.,
etc., but typically not directly used).
Sex: Gender of the passenger (categorical: male or female).
Age: Age of the passenger (continuous variable, often with missing values).
SibSp: Number of siblings or spouses aboard the Titanic.
Parch: Number of parents or children aboard the Titanic.
Ticket: Ticket number (could be used for extracting patterns, but often not directly used).
Fare: The fare the passenger paid for the ticket (continuous variable).
Cabin: Cabin number (with many missing values; could be partially used or omitted).
Embarked: Port of embarkation (categorical: C = Cherbourg, Q = Queenstown, S =
Southampton).
Survived (Target): Whether the passenger survived (1) or did not survive (0).
Additional features, such as Title (extracted from the Name column), or even Family size
(SibSp + Parch), might also be engineered during preprocessing.
Types of Models Used:

The Titanic survival prediction is a binary classification problem, where several machine
learning algorithms can be employed:
Logistic Regression: A simple and interpretable model for binary classification.
Decision Trees: A non-linear model that splits the data based on feature values. Often
prone to overfitting.
Random Forest: An ensemble of decision trees that reduces overfitting by averaging
multiple trees.
Support Vector Machines (SVMs): A classifier that finds the optimal hyperplane to
separate classes.
K-Nearest Neighbors (KNN): A simple algorithm that classifies based on the majority
class of the nearest neighbors.
Gradient Boosting (XGBoost, LightGBM, CatBoost): Ensemble methods that combine
multiple weak learners (decision trees) to improve accuracy.
Neural Networks: Deep learning models, though generally not necessary for this dataset’s
complexity, can still be used for more advanced approaches.
DATA SET
1. gender_submission

2. train

Exploratory Data Analysis (EDA)


1. Data Collection: Obtain the Titanic dataset, which is typically available as CSV files,
from sources like Kaggle or other open data repositories.

2.Data Preprocessing: Handling Missing Data: Many columns, like Age and Cabin, have
missing values that need to be imputed or removed.

Categorical Encoding: Convert categorical features (like Sex and Embarked) into
numerical form using one-hot encoding or label encoding.

Feature Engineering: Create new features such as family size (SibSp + Parch) or extract
titles from the Name column (e.g., Mr, Mrs, Miss).

Scaling/Normalization: Some algorithms like Logistic Regression or SVM may benefit


from feature scaling, especially for continuous features (e.g., Age, Fare).

3. Model Selection: Choose machine learning models suitable for classification tasks
(e.g., Random Forest, Logistic Regression, Gradient Boosting).

4. Model Training: Train the model on the preprocessed training dataset using the
Survived column as the target.
5. Model Evaluation -Accuracy: The proportion of correctly predicted survival status
(survived or not).

Confusion Matrix: To better understand the classification performance, including false


positives and false negatives.

Precision, Recall, F1-Score: Especially important if the dataset is imbalanced (i.e., more
passengers did not survive).

ROC-AUC: The area under the Receiver Operating Characteristic curve, which is useful
to evaluate classification models, especially when dealing with class imbalance.

6. Model Tuning: Hyperparameter tuning using techniques such as grid search or random
search to improve model performance.

7. Prediction and Submission: After training and evaluating the model, predict the
survival status for the test dataset and prepare the output in the required format (usually a
CSV file with PassengerId and Survived predictions).

Challenges and Considerations:

Handling Missing Data Many features (like Age and Cabin) have missing values. How to
impute missing values (mean, median, mode, or other methods) can significantly affect
model performance.

Class Imbalance: If the dataset is imbalanced (e.g., many more passengers did not
survive), special care must be taken to avoid biased models that predict one class more
often than the other.

Feature Engineering: Some features (like Name) need to be parsed into useful
information (e.g., extracting titles like Mr, Mrs) to improve model performance.
Model Interpretability: While models like decision trees provide interpretability, more
complex models (e.g., neural networks or gradient boosting) may be harder to explain but
often offer higher accuracy.

Overfitting: Overfitting to the training data can happen, especially when the model is too
complex. Regularization methods (e.g., pruning decision trees, using ensemble methods)
can help mitigate overfitting.

SOURCE CODE
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline
warnings.filterwarnings('ignore')

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# To know number of columns and rows


train.shape
# (891, 12)

train.info()

train.isnull().sum()

f, ax = plt.subplots(1, 2, figsize=(12, 4))


train['Survived'].value_counts().plot.pie(
explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=False)
ax[0].set_title('Survivors (1) and the dead (0)')
ax[0].set_ylabel('')
sns.countplot(x='Survived', data=train, ax=ax[1])
ax[1].set_ylabel('Quantity')
ax[1].set_title('Survivors (1) and the dead (0)')
plt.show()

f, ax = plt.subplots(1, 2, figsize=(12, 4))


train[['Sex', 'Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survivors by sex')
sns.countplot(x='Sex', hue='Survived', data=train, ax=ax[1])
ax[1].set_ylabel('Quantity')
ax[1].set_title('Survived (1) and deceased (0): men and women')
plt.show()
# Create a new column cabinbool indicating
# if the cabin value was given or was NaN
train["CabinBool"] = (train["Cabin"].notnull().astype('int'))
test["CabinBool"] = (test["Cabin"].notnull().astype('int'))

# Delete the column 'Cabin' from test


# and train dataset
train = train.drop(['Cabin'], axis=1)
test = test.drop(['Cabin'], axis=1)

train = train.drop(['Ticket'], axis=1)


test = test.drop(['Ticket'], axis=1)

# replacing the missing values in


# the Embarked feature with S
train = train.fillna({"Embarked": "S"})

# sort the ages into logical categories


train["Age"] = train["Age"].fillna(-0.5)
test["Age"] = test["Age"].fillna(-0.5)
bins = [-1, 0, 5, 12, 18, 24, 35, 60, np.inf]
labels = ['Unknown', 'Baby', 'Child', 'Teenager',
'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age"], bins, labels=labels)
test['AgeGroup'] = pd.cut(test["Age"], bins, labels=labels)

# create a combined group of both datasets


combine = [train, test]

# extract a title for each Name in the


# train and test datasets
for dataset in combine:
dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train['Title'], train['Sex'])

# replace various titles with more common names


for dataset in combine:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Capt', 'Col',
'Don', 'Dr', 'Major',
'Rev', 'Jonkheer', 'Dona'],
'Rare')

dataset['Title'] = dataset['Title'].replace(
['Countess', 'Lady', 'Sir'], 'Royal')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

# map each of the title groups to a numerical value


title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3,
"Master": 4, "Royal": 5, "Rare": 6}
for dataset in combine:
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)
mr_age = train[train["Title"] == 1]["AgeGroup"].mode() # Young Adult
miss_age = train[train["Title"] == 2]["AgeGroup"].mode() # Student
mrs_age = train[train["Title"] == 3]["AgeGroup"].mode() # Adult
master_age = train[train["Title"] == 4]["AgeGroup"].mode() # Baby
royal_age = train[train["Title"] == 5]["AgeGroup"].mode() # Adult
rare_age = train[train["Title"] == 6]["AgeGroup"].mode() # Adult

age_title_mapping = {1: "Young Adult", 2: "Student",


3: "Adult", 4: "Baby", 5: "Adult", 6: "Adult"}
for x in range(len(train["AgeGroup"])):
if train["AgeGroup"][x] == "Unknown":
train["AgeGroup"][x] = age_title_mapping[train["Title"][x]]
for x in range(len(test["AgeGroup"])):
if test["AgeGroup"][x] == "Unknown":
test["AgeGroup"][x] = age_title_mapping[test["Title"][x]]

# map each Age value to a numerical value


age_mapping = {'Baby': 1, 'Child': 2, 'Teenager': 3,
'Student': 4, 'Young Adult': 5, 'Adult': 6,
'Senior': 7}
train['AgeGroup'] = train['AgeGroup'].map(age_mapping)
test['AgeGroup'] = test['AgeGroup'].map(age_mapping)

train.head()

# dropping the Age feature for now, might change


train = train.drop(['Age'], axis=1)
test = test.drop(['Age'], axis=1)

train = train.drop(['Name'], axis=1)


test = test.drop(['Name'], axis=1)

sex_mapping = {"male": 0, "female": 1}


train['Sex'] = train['Sex'].map(sex_mapping)
test['Sex'] = test['Sex'].map(sex_mapping)

embarked_mapping = {"S": 1, "C": 2, "Q": 3}


train['Embarked'] = train['Embarked'].map(embarked_mapping)
test['Embarked'] = test['Embarked'].map(embarked_mapping)

for x in range(len(test["Fare"])):
if pd.isnull(test["Fare"][x]):
pclass = test["Pclass"][x] # Pclass = 3
test["Fare"][x] = round(
train[train["Pclass"] == pclass]["Fare"].mean(), 4)

# map Fare values into groups of


# numerical values
train['FareBand'] = pd.qcut(train['Fare'], 4,
labels=[1, 2, 3, 4])
test['FareBand'] = pd.qcut(test['Fare'], 4,
labels=[1, 2, 3, 4])

# drop Fare values


train = train.drop(['Fare'], axis=1)
test = test.drop(['Fare'], axis=1)

from sklearn.model_selection import train_test_split

# Drop the Survived and PassengerId


# column from the trainset
predictors = train.drop(['Survived', 'PassengerId'], axis=1)
target = train["Survived"]
x_train, x_val, y_train, y_val = train_test_split(
predictors, target, test_size=0.2, random_state=0)

from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import accuracy_score

randomforest = RandomForestClassifier()

# Fit the training data along with its output


randomforest.fit(x_train, y_train)
y_pred = randomforest.predict(x_val)

# Find the accuracy score of the model


acc_randomforest = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_randomforest)

ids = test['PassengerId']
predictions = randomforest.predict(test.drop('PassengerId', axis=1))

# set the output as a dataframe and convert


# to csv file named resultfile.csv
output = pd.DataFrame({'PassengerId': ids, 'Survived': predictions})
output.to_csv('resultfile.csv', index=False)
Prediction
We are provided with the testing dataset on which we have to perform the

prediction. To predict, we will pass the test dataset into our trained model and

save it into a CSV file containing the information, passengerid and survival.

PassengerId will be the passengerid of the passengers in the test data and the

survival will column will be either 0 or 1.

PassengerI
d Survived
892 0
893 1
894 0
895 0
896 1
897 0
898 1
899 0
900 1
901 0
902 0
903 0
904 1
905 0
906 1
907 1
908 0
909 0
910 1
911 1
912 0
913 0
914 1
915 0
916 1
917 0
918 1
919 0
920 0
921 0
922 0
923 0
924 1
925 1
926 0
927 0
928 1
929 1
930 0
931 0
932 0
933 0
934 0
935 1
936 1
937 0
938 0
939 0
940 1
941 1
942 0
943 0
944 1
945 1
946 0
947 0
948 0
949 0
950 0
951 1
952 0
953 0
954 0
955 1
956 0
957 1
958 1
959 0
960 0
961 1
962 1
963 0
964 1
965 0
966 1
967 0
968 0
969 1
970 0
971 1
972 0
973 0
974 0
975 0
976 0
977 0
978 1
979 1
980 1
981 0
982 1
983 0
984 1
985 0
986 0
987 0
988 1
989 0
990 1
991 0
992 1
993 0
994 0
995 0
996 1
997 0
998 0
999 0
1000 0
1001 0
1002 0
1003 1
1004 1
1005 1
1006 1
1007 0
1008 0
1009 1
1010 0
1011 1
1012 1
1013 0
1014 1
1015 0
1016 0
1017 1
1018 0
1019 1
1020 0
1021 0
1022 0
1023 0
1024 1
1025 0
1026 0
1027 0
1028 0
1029 0
1030 1
1031 0
1032 1
1033 1
1034 0
1035 0
1036 0
1037 0
1038 0
1039 0
1040 0
1041 0
1042 1
1043 0
1044 0
1045 1
1046 0
1047 0
1048 1
1049 1
1050 0
1051 1
1052 1
1053 0
1054 1
1055 0
1056 0
1057 1
1058 0
1059 0
1060 1
1061 1
1062 0
1063 0
1064 0
1065 0
1066 0
1067 1
1068 1
1069 0
1070 1
1071 1
1072 0
1073 0
1074 1
1075 0
1076 1
1077 0
1078 1
1079 0
1080 1
1081 0
1082 0
1083 0
1084 0
1085 0
1086 0
1087 0
1088 0
1089 1
1090 0
1091 1
1092 1
1093 0
1094 0
1095 1
1096 0
1097 0
1098 1
1099 0
1100 1
1101 0
1102 0
1103 0
1104 0
1105 1
1106 1
1107 0
1108 1
1109 0
1110 1
1111 0
1112 1
1113 0
1114 1
1115 0
1116 1
1117 1
1118 0
1119 1
1120 0
1121 0
1122 0
1123 1
1124 0
1125 0
1126 0
1127 0
1128 0
1129 0
1130 1
1131 1
1132 1
1133 1
1134 0
1135 0
1136 0
1137 0
1138 1
1139 0
1140 1
1141 1
1142 1
1143 0
1144 0
1145 0
1146 0
1147 0
1148 0
1149 0
1150 1
1151 0
1152 0
1153 0
1154 1
1155 1
1156 0
1157 0
1158 0
1159 0
1160 1
1161 0
1162 0
1163 0
1164 1
1165 1
1166 0
1167 1
1168 0
1169 0
1170 0
1171 0
1172 1
1173 0
1174 1
1175 1
1176 1
1177 0
1178 0
1179 0
1180 0
1181 0
1182 0
1183 1
1184 0
1185 0
1186 0
1187 0
1188 1
1189 0
1190 0
1191 0
1192 0
1193 0
1194 0
1195 0
1196 1
1197 1
1198 0
1199 0
1200 0
1201 1
1202 0
1203 0
1204 0
1205 1
1206 1
1207 1
1208 0
1209 0
1210 0
1211 0
1212 0
1213 0
1214 0
1215 0
1216 1
1217 0
1218 1
1219 0
1220 0
1221 0
1222 1
1223 0
1224 0
1225 1
1226 0
1227 0
1228 0
1229 0
1230 0
1231 0
1232 0
1233 0
1234 0
1235 1
1236 0
1237 1
1238 0
1239 1
1240 0
1241 1
1242 1
1243 0
1244 0
1245 0
1246 1
1247 0
1248 1
1249 0
1250 0
1251 1
1252 0
1253 1
1254 1
1255 0
1256 1
1257 1
1258 0
1259 1
1260 1
1261 0
1262 0
1263 1
1264 0
1265 0
1266 1
1267 1
1268 1
1269 0
1270 0
1271 0
1272 0
1273 0
1274 1
1275 1
1276 0
1277 1
1278 0
1279 0
1280 0
1281 0
1282 0
1283 1
1284 0
1285 0
1286 0
1287 1
1288 0
1289 1
1290 0
1291 0
1292 1
1293 0
1294 1
1295 0
1296 0
1297 0
1298 0
1299 0
1300 1
1301 1
1302 1
1303 1
1304 1
1305 0
1306 1
1307 0
1308 0
1309 0

You might also like