Titanic Survival Prediction Using Machine Learning
Titanic Survival Prediction Using Machine Learning
PROJECT REPORT ON
Submitted to :
Mrs.Priya P B.Sc.,M.C.A,B.Ed.,M.Phil
PGT(CS)
Emerald Valley Public School,
Salem – 636008
Tamilnadu
EMERALD VALLEY PUBLIC SCHOOL
CERTIFICATE
Priya P
Name :
Signature :
Date :
ACKNOWLEDGEMENT
First and foremost, I owe my wholehearted thanks to my parents for their love,
encouragement and moral support for completing this project.
I sincerely appreciate our Principal Mr. K. Manimaran for permitting access to the well-
equipped lab and the resources required for the project.
The encouragement from my teacher, principal and friends was invaluable. I will always
1 PROBLEM DEFINITION
2 REQUIREMENTS
3 INTRODUCTION
5 SOURCE CODE
6 PREDICTION
7 VISUALIZATION
8 BIBILIOGRAPHY
TITANIC SURVIVAL PREDICTION
USING MACHINE LEARNING
PROBLEM DEFINITION
HARDWARE REQUIRED
Printer, to print the required documents of the project
Drive
Processor : Inter i5
Ram : 4GM and above
Hard Disk : 1 TB
SOFTWARE REQUIRED
Operating System : Windows 11
Jupyter notebook
Python
Visual Studio Code
MS word (for preparing and presenting the project)
INTRODUCTION
Introduction to Machine Learning
Machine Learning (ML) is a branch of artificial intelligence (AI) that enables systems to
automatically learn and improve from experience without being explicitly programmed.
In other words, machine learning allows computers to identify patterns, make decisions,
and improve performance through exposure to data, rather than following strict, pre-
defined rules.
At its core, machine learning involves the development of algorithms that can analyze
data, learn from it, and then make predictions or decisions based on that learning.
Machine learning has become a cornerstone of AI, and it is used in a variety of fields
such as healthcare, finance, e-commerce, entertainment, and more.
1. Data: The foundation of machine learning. It consists of input features (also called
variables or attributes) and labels (the outcome you want to predict). The quality
and quantity of data are critical for the success of machine learning models.
2. Algorithms: Machine learning algorithms are the methods used to find patterns in
the data. There are different types of algorithms based on the problem you're trying
to solve.
3. Model: A model is the output of a machine learning algorithm after it has been
trained on data. It represents the patterns or relationships the algorithm has learned
and is used to make predictions or decisions on new data.
4. Training: The process of feeding data into a machine learning algorithm to help it
learn the underlying patterns.
5. Testing: After a model has been trained, it is evaluated on new, unseen data (test
data) to assess its performance and ability to generalize to real-world situations.
1. Supervised Learning:
o In supervised learning, the algorithm is trained on a labeled dataset, meaning
the input data is paired with the correct output (or label).
o The goal is for the model to learn a mapping from inputs to outputs, so it can
predict the output for new, unseen inputs.
o Examples: Classification (e.g., spam email detection) and Regression (e.g.,
predicting house prices based on features like size, location, etc.).
2. Unsupervised Learning:
o In unsupervised learning, the algorithm is given data without labels and
must find the underlying structure or patterns in the data on its own.
o The goal is often to discover hidden structures like clusters or associations in
the data.
o Examples: Clustering (e.g., grouping similar customers based on purchasing
behavior) and Dimensionality Reduction (e.g., reducing the number of
variables in a dataset).
Data Requirements
The Titanic survival prediction is a binary classification problem, where several machine
learning algorithms can be employed:
Logistic Regression: A simple and interpretable model for binary classification.
Decision Trees: A non-linear model that splits the data based on feature values. Often
prone to overfitting.
Random Forest: An ensemble of decision trees that reduces overfitting by averaging
multiple trees.
Support Vector Machines (SVMs): A classifier that finds the optimal hyperplane to
separate classes.
K-Nearest Neighbors (KNN): A simple algorithm that classifies based on the majority
class of the nearest neighbors.
Gradient Boosting (XGBoost, LightGBM, CatBoost): Ensemble methods that combine
multiple weak learners (decision trees) to improve accuracy.
Neural Networks: Deep learning models, though generally not necessary for this dataset’s
complexity, can still be used for more advanced approaches.
DATA SET
1. gender_submission
2. train
2.Data Preprocessing: Handling Missing Data: Many columns, like Age and Cabin, have
missing values that need to be imputed or removed.
Categorical Encoding: Convert categorical features (like Sex and Embarked) into
numerical form using one-hot encoding or label encoding.
Feature Engineering: Create new features such as family size (SibSp + Parch) or extract
titles from the Name column (e.g., Mr, Mrs, Miss).
3. Model Selection: Choose machine learning models suitable for classification tasks
(e.g., Random Forest, Logistic Regression, Gradient Boosting).
4. Model Training: Train the model on the preprocessed training dataset using the
Survived column as the target.
5. Model Evaluation -Accuracy: The proportion of correctly predicted survival status
(survived or not).
Precision, Recall, F1-Score: Especially important if the dataset is imbalanced (i.e., more
passengers did not survive).
ROC-AUC: The area under the Receiver Operating Characteristic curve, which is useful
to evaluate classification models, especially when dealing with class imbalance.
6. Model Tuning: Hyperparameter tuning using techniques such as grid search or random
search to improve model performance.
7. Prediction and Submission: After training and evaluating the model, predict the
survival status for the test dataset and prepare the output in the required format (usually a
CSV file with PassengerId and Survived predictions).
Handling Missing Data Many features (like Age and Cabin) have missing values. How to
impute missing values (mean, median, mode, or other methods) can significantly affect
model performance.
Class Imbalance: If the dataset is imbalanced (e.g., many more passengers did not
survive), special care must be taken to avoid biased models that predict one class more
often than the other.
Feature Engineering: Some features (like Name) need to be parsed into useful
information (e.g., extracting titles like Mr, Mrs) to improve model performance.
Model Interpretability: While models like decision trees provide interpretability, more
complex models (e.g., neural networks or gradient boosting) may be harder to explain but
often offer higher accuracy.
Overfitting: Overfitting to the training data can happen, especially when the model is too
complex. Regularization methods (e.g., pruning decision trees, using ensemble methods)
can help mitigate overfitting.
SOURCE CODE
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline
warnings.filterwarnings('ignore')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.info()
train.isnull().sum()
pd.crosstab(train['Title'], train['Sex'])
dataset['Title'] = dataset['Title'].replace(
['Countess', 'Lady', 'Sir'], 'Royal')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
train.head()
for x in range(len(test["Fare"])):
if pd.isnull(test["Fare"][x]):
pclass = test["Pclass"][x] # Pclass = 3
test["Fare"][x] = round(
train[train["Pclass"] == pclass]["Fare"].mean(), 4)
randomforest = RandomForestClassifier()
ids = test['PassengerId']
predictions = randomforest.predict(test.drop('PassengerId', axis=1))
prediction. To predict, we will pass the test dataset into our trained model and
save it into a CSV file containing the information, passengerid and survival.
PassengerId will be the passengerid of the passengers in the test data and the
PassengerI
d Survived
892 0
893 1
894 0
895 0
896 1
897 0
898 1
899 0
900 1
901 0
902 0
903 0
904 1
905 0
906 1
907 1
908 0
909 0
910 1
911 1
912 0
913 0
914 1
915 0
916 1
917 0
918 1
919 0
920 0
921 0
922 0
923 0
924 1
925 1
926 0
927 0
928 1
929 1
930 0
931 0
932 0
933 0
934 0
935 1
936 1
937 0
938 0
939 0
940 1
941 1
942 0
943 0
944 1
945 1
946 0
947 0
948 0
949 0
950 0
951 1
952 0
953 0
954 0
955 1
956 0
957 1
958 1
959 0
960 0
961 1
962 1
963 0
964 1
965 0
966 1
967 0
968 0
969 1
970 0
971 1
972 0
973 0
974 0
975 0
976 0
977 0
978 1
979 1
980 1
981 0
982 1
983 0
984 1
985 0
986 0
987 0
988 1
989 0
990 1
991 0
992 1
993 0
994 0
995 0
996 1
997 0
998 0
999 0
1000 0
1001 0
1002 0
1003 1
1004 1
1005 1
1006 1
1007 0
1008 0
1009 1
1010 0
1011 1
1012 1
1013 0
1014 1
1015 0
1016 0
1017 1
1018 0
1019 1
1020 0
1021 0
1022 0
1023 0
1024 1
1025 0
1026 0
1027 0
1028 0
1029 0
1030 1
1031 0
1032 1
1033 1
1034 0
1035 0
1036 0
1037 0
1038 0
1039 0
1040 0
1041 0
1042 1
1043 0
1044 0
1045 1
1046 0
1047 0
1048 1
1049 1
1050 0
1051 1
1052 1
1053 0
1054 1
1055 0
1056 0
1057 1
1058 0
1059 0
1060 1
1061 1
1062 0
1063 0
1064 0
1065 0
1066 0
1067 1
1068 1
1069 0
1070 1
1071 1
1072 0
1073 0
1074 1
1075 0
1076 1
1077 0
1078 1
1079 0
1080 1
1081 0
1082 0
1083 0
1084 0
1085 0
1086 0
1087 0
1088 0
1089 1
1090 0
1091 1
1092 1
1093 0
1094 0
1095 1
1096 0
1097 0
1098 1
1099 0
1100 1
1101 0
1102 0
1103 0
1104 0
1105 1
1106 1
1107 0
1108 1
1109 0
1110 1
1111 0
1112 1
1113 0
1114 1
1115 0
1116 1
1117 1
1118 0
1119 1
1120 0
1121 0
1122 0
1123 1
1124 0
1125 0
1126 0
1127 0
1128 0
1129 0
1130 1
1131 1
1132 1
1133 1
1134 0
1135 0
1136 0
1137 0
1138 1
1139 0
1140 1
1141 1
1142 1
1143 0
1144 0
1145 0
1146 0
1147 0
1148 0
1149 0
1150 1
1151 0
1152 0
1153 0
1154 1
1155 1
1156 0
1157 0
1158 0
1159 0
1160 1
1161 0
1162 0
1163 0
1164 1
1165 1
1166 0
1167 1
1168 0
1169 0
1170 0
1171 0
1172 1
1173 0
1174 1
1175 1
1176 1
1177 0
1178 0
1179 0
1180 0
1181 0
1182 0
1183 1
1184 0
1185 0
1186 0
1187 0
1188 1
1189 0
1190 0
1191 0
1192 0
1193 0
1194 0
1195 0
1196 1
1197 1
1198 0
1199 0
1200 0
1201 1
1202 0
1203 0
1204 0
1205 1
1206 1
1207 1
1208 0
1209 0
1210 0
1211 0
1212 0
1213 0
1214 0
1215 0
1216 1
1217 0
1218 1
1219 0
1220 0
1221 0
1222 1
1223 0
1224 0
1225 1
1226 0
1227 0
1228 0
1229 0
1230 0
1231 0
1232 0
1233 0
1234 0
1235 1
1236 0
1237 1
1238 0
1239 1
1240 0
1241 1
1242 1
1243 0
1244 0
1245 0
1246 1
1247 0
1248 1
1249 0
1250 0
1251 1
1252 0
1253 1
1254 1
1255 0
1256 1
1257 1
1258 0
1259 1
1260 1
1261 0
1262 0
1263 1
1264 0
1265 0
1266 1
1267 1
1268 1
1269 0
1270 0
1271 0
1272 0
1273 0
1274 1
1275 1
1276 0
1277 1
1278 0
1279 0
1280 0
1281 0
1282 0
1283 1
1284 0
1285 0
1286 0
1287 1
1288 0
1289 1
1290 0
1291 0
1292 1
1293 0
1294 1
1295 0
1296 0
1297 0
1298 0
1299 0
1300 1
1301 1
1302 1
1303 1
1304 1
1305 0
1306 1
1307 0
1308 0
1309 0