0% found this document useful (0 votes)
9 views

ml dataset performance

Uploaded by

Rutuja Jadhav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

ml dataset performance

Uploaded by

Rutuja Jadhav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

[1]: import pandas as pd

[6]: #1.load the dataset


df=pd.read_csv("titanic.csv")

[8]: #2.Inspect the data


#TO print first few rows
print(df.head())

PassengerId Survived Pclass \


0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age SibSp \


0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

[4]: #To print summary of dataset


print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----

1
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

[5]: #to print summary statistics


print(df.describe())

PassengerId Survived Pclass Age SibSp \


count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000

Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200

[9]: #3.Clean the data


#Rename the columns
df.rename(columns={'Pclass': 'PassengerClass','SibSp':
↪'SiblingSpouses'},inplace=True)

[10]: #drop the unnecessary columns


df.drop(['Cabin','Ticket',],axis=1,inplace=True)

2
[11]: #check for duplicates
print(f"Duplicates:{df.duplicated().sum()}")
df.drop_duplicates(inplace=True)

Duplicates:0

[12]: #4.handling missing values


#check for missing values
print(df.isnull().sum())

PassengerId 0
Survived 0
PassengerClass 0
Name 0
Sex 0
Age 177
SiblingSpouses 0
Parch 0
Fare 0
Embarked 2
dtype: int64

[14]: # Fill missing 'Age' with median


df['Age'].fillna(df['Age'].median(), inplace=True)

C:\Users\alish\AppData\Local\Temp\ipykernel_18424\1672961352.py:2:
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series
through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using


'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

df['Age'].fillna(df['Age'].median(), inplace=True)

[15]: # Drop rows with missing 'Embarked'


df.dropna(subset=['Embarked'], inplace=True)

[16]: #5.Perform Basix Dataframe operations


# selecting and filtering
#select passenger age above 30
older_passenger=df[df['Age']>30]
print(older_passenger.head())

3
PassengerId Survived PassengerClass \
1 2 1 1
3 4 1 1
4 5 0 3
6 7 0 1
11 12 1 1

Name Sex Age \


1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
4 Allen, Mr. William Henry male 35.0
6 McCarthy, Mr. Timothy J male 54.0
11 Bonnell, Miss. Elizabeth female 58.0

SiblingSpouses Parch Fare Embarked


1 1 0 71.2833 C
3 1 0 53.1000 S
4 0 0 8.0500 S
6 0 0 51.8625 S
11 0 0 26.5500 S

[17]: #Sorting
#sort passenger by age
df.sort_values(by='Age',ascending=False,inplace=True)
print(df.head())

PassengerId Survived PassengerClass \


630 631 1 1
851 852 0 3
493 494 0 1
96 97 0 1
116 117 0 3

Name Sex Age SiblingSpouses Parch \


630 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0
851 Svensson, Mr. Johan male 74.0 0 0
493 Artagaveytia, Mr. Ramon male 71.0 0 0
96 Goldschmidt, Mr. George B male 71.0 0 0
116 Connors, Mr. Patrick male 70.5 0 0

Fare Embarked
630 30.0000 S
851 7.7750 S
493 49.5042 C
96 34.6542 C
116 7.7500 Q

4
[19]: #Aggregation
#group by passengerClass and find the average fare
avg_fare=df.groupby('PassengerClass')['Fare'].mean()
print(avg_fare)

PassengerClass
1 84.193516
2 20.662183
3 13.675550
Name: Fare, dtype: float64

[20]: #display 10 first rows


df.head(10)

[20]: PassengerId Survived PassengerClass \


630 631 1 1
851 852 0 3
493 494 0 1
96 97 0 1
116 117 0 3
672 673 0 2
745 746 0 1
33 34 0 2
456 457 0 1
54 55 0 1

Name Sex Age SiblingSpouses Parch \


630 Barkworth, Mr. Algernon Henry Wilson male 80.0 0 0
851 Svensson, Mr. Johan male 74.0 0 0
493 Artagaveytia, Mr. Ramon male 71.0 0 0
96 Goldschmidt, Mr. George B male 71.0 0 0
116 Connors, Mr. Patrick male 70.5 0 0
672 Mitchell, Mr. Henry Michael male 70.0 0 0
745 Crosby, Capt. Edward Gifford male 70.0 1 1
33 Wheadon, Mr. Edward H male 66.0 0 0
456 Millet, Mr. Francis Davis male 65.0 0 0
54 Ostby, Mr. Engelhart Cornelius male 65.0 0 1

Fare Embarked
630 30.0000 S
851 7.7750 S
493 49.5042 C
96 34.6542 C
116 7.7500 Q
672 10.5000 S
745 71.0000 S
33 10.5000 S

5
456 26.5500 S
54 61.9792 C

[21]: #print total number of ros and columns


df.shape

[21]: (889, 10)

[22]: #print missing values


df.isnull().sum()

[22]: PassengerId 0
Survived 0
PassengerClass 0
Name 0
Sex 0
Age 0
SiblingSpouses 0
Parch 0
Fare 0
Embarked 0
dtype: int64

[23]: #rename
df.rename(columns={'Pclass':'PassengerClass','SibSp':
↪'SiblingsSpouses'},inplace=True)

[24]: df

[24]: PassengerId Survived PassengerClass \


630 631 1 1
851 852 0 3
493 494 0 1
96 97 0 1
116 117 0 3
.. … … …
831 832 1 2
644 645 1 3
469 470 1 3
755 756 1 2
803 804 1 3

Name Sex Age SiblingSpouses \


630 Barkworth, Mr. Algernon Henry Wilson male 80.00 0
851 Svensson, Mr. Johan male 74.00 0
493 Artagaveytia, Mr. Ramon male 71.00 0
96 Goldschmidt, Mr. George B male 71.00 0

6
116 Connors, Mr. Patrick male 70.50 0
.. … … … …
831 Richards, Master. George Sibley male 0.83 1
644 Baclini, Miss. Eugenie female 0.75 2
469 Baclini, Miss. Helene Barbara female 0.75 2
755 Hamalainen, Master. Viljo male 0.67 1
803 Thomas, Master. Assad Alexander male 0.42 0

Parch Fare Embarked


630 0 30.0000 S
851 0 7.7750 S
493 0 49.5042 C
96 0 34.6542 C
116 0 7.7500 Q
.. … … …
831 1 18.7500 S
644 1 19.2583 C
469 1 19.2583 C
755 1 14.5000 S
803 1 8.5167 C

[889 rows x 10 columns]

[29]: #drop rows with missing values in the Embarked columns


df.dropna(subset=['Embarked'],inplace=True)

[30]: #select the Name,Age and Fare columns and display first 5 rows
selected_columns=df[['Name','Age','Fare']]
print(selected_columns.head())

Name Age Fare


630 Barkworth, Mr. Algernon Henry Wilson 80.0 30.0000
851 Svensson, Mr. Johan 74.0 7.7750
493 Artagaveytia, Mr. Ramon 71.0 49.5042
96 Goldschmidt, Mr. George B 71.0 34.6542
116 Connors, Mr. Patrick 70.5 7.7500

[34]: #Aged above 30 and who paid a fare greater then 50


passenger=df[(df['Age']>30)&(df['Fare']>50)]
print(passenger)

PassengerId Survived PassengerClass \


745 746 0 1
54 55 0 1
438 439 0 1
275 276 1 1
366 367 1 1

7
.. … … …
867 868 0 1
215 216 1 1
671 672 0 1
318 319 1 1
690 691 1 1

Name Sex Age \


745 Crosby, Capt. Edward Gifford male 70.0
54 Ostby, Mr. Engelhart Cornelius male 65.0
438 Fortune, Mr. Mark male 64.0
275 Andrews, Miss. Kornelia Theodosia female 63.0
366 Warren, Mrs. Frank Manley (Anna Sophia Atkinson) female 60.0
.. … … …
867 Roebling, Mr. Washington Augustus II male 31.0
215 Newell, Miss. Madeleine female 31.0
671 Davidson, Mr. Thornton male 31.0
318 Wick, Miss. Mary Natalie female 31.0
690 Dick, Mr. Albert Adrian male 31.0

SiblingSpouses Parch Fare Embarked


745 1 1 71.0000 S
54 0 1 61.9792 C
438 1 4 263.0000 S
275 1 0 77.9583 S
366 1 0 75.2500 C
.. … … … …
867 0 0 50.4958 S
215 1 0 113.2750 C
671 1 0 52.0000 S
318 0 2 164.8667 S
690 1 0 57.0000 S

[81 rows x 10 columns]

[37]: #sort the dataset by fare in descending order and display the top 10 passengers
s_passenger=df.sort_values(by='Fare',ascending=False)
T_passenger=s_passenger.head(10)
print(T_passenger)

PassengerId Survived PassengerClass \


737 738 1 1
258 259 1 1
679 680 1 1
27 28 0 1
341 342 1 1
88 89 1 1
438 439 0 1

8
311 312 1 1
742 743 1 1
299 300 1 1

Name Sex Age \


737 Lesurer, Mr. Gustave J male 35.0
258 Ward, Miss. Anna female 35.0
679 Cardeza, Mr. Thomas Drake Martinez male 36.0
27 Fortune, Mr. Charles Alexander male 19.0
341 Fortune, Miss. Alice Elizabeth female 24.0
88 Fortune, Miss. Mabel Helen female 23.0
438 Fortune, Mr. Mark male 64.0
311 Ryerson, Miss. Emily Borie female 18.0
742 Ryerson, Miss. Susan Parker "Suzette" female 21.0
299 Baxter, Mrs. James (Helene DeLaudeniere Chaput) female 50.0

SiblingSpouses Parch Fare Embarked


737 0 0 512.3292 C
258 0 0 512.3292 C
679 0 1 512.3292 C
27 3 2 263.0000 S
341 3 2 263.0000 S
88 3 2 263.0000 S
438 1 4 263.0000 S
311 2 2 262.3750 C
742 2 2 262.3750 C
299 0 1 247.5208 C

[41]: #avgrage age of passengers in each class


avg_age=df.groupby('PassengerClass')['Age'].mean()
print("Average Age by PassengerClass:",avg_age)

Average Age by PassengerClass: PassengerClass


1 36.688879
2 29.765380
3 25.932627
Name: Age, dtype: float64

[42]: #the total fare paid by passengers in each class


total_fare=df.groupby('PassengerClass')['Fare'].sum()
print("Total fare by Passenger",total_fare)

Total fare by Passenger PassengerClass


1 18017.4125
2 3801.8417
3 6714.6951
Name: Fare, dtype: float64

9
[43]: #find the survival rate for each PassengerClasss
survival_rate=df.groupby('PassengerClass')['Survived'].mean()
print("survival rate by PassengerClass:",survival_rate)

survival rate by PassengerClass: PassengerClass


1 0.626168
2 0.472826
3 0.242363
Name: Survived, dtype: float64

[44]: #check fro duplicate row


duplicates=df.duplicated()
print("Number of duplicate row",duplicates)

Number of duplicate row 630 False


851 False
493 False
96 False
116 False

831 False
644 False
469 False
755 False
803 False
Length: 889, dtype: bool

[50]: #remoce duplicates if they exist and verify the total number of rows
df=df.drop_duplicates()
print("Total number of rows after removing duplicates: ",{len(df)})

Total number of rows after removing duplicates: {889}

[51]: #count the missing values


missing=df.isnull().any(axis=1).sum()
print("number of rows with missing values:",missing)

number of rows with missing values: 0

[54]: #1.Add a new column Familysize by summing SiblingSpouses and Parch Columns
df['FamilySize'] = df['SiblingSpouses'] + df['Parch']

[56]: #create new columns and set it to true if FamilySize is 0,otherwise false
df['IsAlone']=df['FamilySize']==0

[57]: #save the cleaned dataset to new csv file named cleaned_titanic.csv
df.to_csv('cleaned_titatic.csv',index=False)
print("Dataset cleand and saved as 'cleaned_titanic.csv'.")

10
Dataset cleand and saved as 'cleaned_titanic.csv'.

[ ]:

11

You might also like