ml dataset performance
ml dataset performance
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
1
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
2
[11]: #check for duplicates
print(f"Duplicates:{df.duplicated().sum()}")
df.drop_duplicates(inplace=True)
Duplicates:0
PassengerId 0
Survived 0
PassengerClass 0
Name 0
Sex 0
Age 177
SiblingSpouses 0
Parch 0
Fare 0
Embarked 2
dtype: int64
C:\Users\alish\AppData\Local\Temp\ipykernel_18424\1672961352.py:2:
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series
through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.
df['Age'].fillna(df['Age'].median(), inplace=True)
3
PassengerId Survived PassengerClass \
1 2 1 1
3 4 1 1
4 5 0 3
6 7 0 1
11 12 1 1
[17]: #Sorting
#sort passenger by age
df.sort_values(by='Age',ascending=False,inplace=True)
print(df.head())
Fare Embarked
630 30.0000 S
851 7.7750 S
493 49.5042 C
96 34.6542 C
116 7.7500 Q
4
[19]: #Aggregation
#group by passengerClass and find the average fare
avg_fare=df.groupby('PassengerClass')['Fare'].mean()
print(avg_fare)
PassengerClass
1 84.193516
2 20.662183
3 13.675550
Name: Fare, dtype: float64
Fare Embarked
630 30.0000 S
851 7.7750 S
493 49.5042 C
96 34.6542 C
116 7.7500 Q
672 10.5000 S
745 71.0000 S
33 10.5000 S
5
456 26.5500 S
54 61.9792 C
[22]: PassengerId 0
Survived 0
PassengerClass 0
Name 0
Sex 0
Age 0
SiblingSpouses 0
Parch 0
Fare 0
Embarked 0
dtype: int64
[23]: #rename
df.rename(columns={'Pclass':'PassengerClass','SibSp':
↪'SiblingsSpouses'},inplace=True)
[24]: df
6
116 Connors, Mr. Patrick male 70.50 0
.. … … … …
831 Richards, Master. George Sibley male 0.83 1
644 Baclini, Miss. Eugenie female 0.75 2
469 Baclini, Miss. Helene Barbara female 0.75 2
755 Hamalainen, Master. Viljo male 0.67 1
803 Thomas, Master. Assad Alexander male 0.42 0
[30]: #select the Name,Age and Fare columns and display first 5 rows
selected_columns=df[['Name','Age','Fare']]
print(selected_columns.head())
7
.. … … …
867 868 0 1
215 216 1 1
671 672 0 1
318 319 1 1
690 691 1 1
[37]: #sort the dataset by fare in descending order and display the top 10 passengers
s_passenger=df.sort_values(by='Fare',ascending=False)
T_passenger=s_passenger.head(10)
print(T_passenger)
8
311 312 1 1
742 743 1 1
299 300 1 1
9
[43]: #find the survival rate for each PassengerClasss
survival_rate=df.groupby('PassengerClass')['Survived'].mean()
print("survival rate by PassengerClass:",survival_rate)
[50]: #remoce duplicates if they exist and verify the total number of rows
df=df.drop_duplicates()
print("Total number of rows after removing duplicates: ",{len(df)})
[54]: #1.Add a new column Familysize by summing SiblingSpouses and Parch Columns
df['FamilySize'] = df['SiblingSpouses'] + df['Parch']
[56]: #create new columns and set it to true if FamilySize is 0,otherwise false
df['IsAlone']=df['FamilySize']==0
[57]: #save the cleaned dataset to new csv file named cleaned_titanic.csv
df.to_csv('cleaned_titatic.csv',index=False)
print("Dataset cleand and saved as 'cleaned_titanic.csv'.")
10
Dataset cleand and saved as 'cleaned_titanic.csv'.
[ ]:
11