Data Wrangling
Data Wrangling
# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav', replacing Nan values with average
'Anuj', 'Ravi', 'Natasha', 'Riya'], Data Replacing in Data Wrangling
'Age': [17, 17, 18, 17, 18, 17, 17], in the GENDER column, we can replace the
'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'], Gender column data by categorizing them
'Marks': [90, 76, 'NaN', 74, 65, 'NaN', into different numbers.
71]} # Categorize gender
df['Gender'] = df['Gender'].map({'M': 0,
# Convert into DataFrame 'F': 1, }).astype(float)
df = pd.DataFrame(data)
# Display data # Display data
df order to merge the data and provide it
Output: meaning. So that teacher will analyze it easily
and it also reduces the time and effort of the
Teacher from Manual Merging.
# Display data
df
Output:
printing dataframe
Creating Second Dataframe to Perform
Merge operation using Data Wrangling:
# Import module
import pandas as pd
# Group the data when year = 2010 'Roll_no': [23, 54, 29, 36, 59, 38,
grouped = df.groupby('Year') 12, 45, 34, 36, 54, 23],
print(grouped.get_group(2010))
Output: 'Email': ['[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
Using groupby method on dataframe '[email protected]',
'[email protected]',
Data Wrangling by Removing Duplication '[email protected]',
Pandas duplicates() method helps us to '[email protected]',
remove duplicate values from Large Data. An '[email protected]',
important part of Data Wrangling is removing '[email protected]']}
Duplicate values from the large data set.
Syntax: DataFrame.duplicated(subset=None, # Creating Dataframe of Data
keep=’first’) df = pd.DataFrame(student_data)
Here subset is the column value where we
want to remove the Duplicate value. # Printing Dataframe
In keeping, we have 3 options : print(df)
if keep =’first’ then the first value is Output:
marked as the original rest of all values if
occur will be removed as it is considered
duplicate.
if keep=’last’ then the last value is
marked as the original rest the above
same values will be removed as it is
considered duplicate values.
if keep =’false’ all the values which occur
more than once will be removed as all are
considered duplicate values.
For example, A University will organize the
event. In order to participate Students have to Student Dataset who want to participate in
fill in their details in the online form so that the event
they will contact them. It may be possible that
a student will fill out the form multiple times. Removing Duplicate data from the Dataset
It may cause difficulty for the event organizer using Data wrangling:
# import module
import pandas as pd Creating Two Dataframe For
Concatenation.
# initializing Data # importing pandas module
student_data = {'Name': ['Amit', 'Praveen', import pandas as pd
'Jagroop',
'Rahul', 'Vishal', 'Suraj', # Define a dictionary containing employee
'Rishab', 'Satyapal', 'Amit', data
'Rahul', 'Praveen', 'Amit'], data1 = {'Name':['Jai', 'Princi', 'Gaurav',
'Anuj'],
'Roll_no': [23, 54, 29, 36, 59, 38, 'Age':[27, 24, 22, 32],
12, 45, 34, 36, 54, 23], 'Address':['Nagpur', 'Kanpur', 'Allahabad',
'Email': ['[email protected]', 'Kannuaj'],
'[email protected]', 'Qualification':['Msc', 'MA', 'MCA',
'[email protected]', 'Phd'],
'[email protected]', 'Mobile No': [97, 91, 58, 76]}
'[email protected]',
'[email protected]', # Define a dictionary containing employee
'[email protected]', data
'[email protected]', data2 = {'Name':['Gaurav', 'Anuj', 'Dhiraj',
'[email protected]', 'Hitesh'],
'[email protected]', 'Age':[22, 32, 12, 52],
'[email protected]', 'Address':['Allahabad', 'Kannuaj',
'[email protected]']} 'Allahabad', 'Kannuaj'],
'Qualification':['MCA', 'Phd', 'Bcom',
# creating dataframe 'B.hons'],
df = pd.DataFrame(student_data) 'Salary':[1000, 2000, 3000, 4000]}
# Here df.duplicated() list duplicate Entries in # Convert the dictionary into DataFrame
ROllno. df = pd.DataFrame(data1,index=[0, 1, 2, 3])
# So that ~(NOT) is placed in order to get non
duplicate values. # Convert the dictionary into DataFrame
non_duplicate = df[~df.duplicated('Roll_no')] df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])
We will join these two dataframe along axis
# printing non-duplicate values 0.
print(non_duplicate) res = pd.concat([df, df1])
Output:D output:
Name Age Address Qualification
Mobile No Salary
0 Jai 27 Nagpur Msc
97.0 NaN
1 Princi 24 Kanpur MA 91.0
NaN
2 Gaurav 22 Allahabad MCA
58.0 NaN
3 Anuj 32 Kannuaj Phd 76.0
Remove – Duplicate data from Dataset using NaN
Data wrangling 4 Gaurav 22 Allahabad MCA
NaN 1000.0
Creating New Datasets Using the 5 Anuj 32 Kannuaj Phd NaN
Concatenation of Two Datasets In Data 2000.0
Wrangling. 6 Dhiraj 12 Allahabad Bcom NaN
We can join two dataframe in several ways. 3000.0
For our example in Concanating Two 7 Hitesh 52 Kannuaj B.hons
datasets, we use pd.concat() function. NaN 4000.0