0% found this document useful (0 votes)
8 views

Data Wrangling

Data wrangling is the process of transforming raw data into a more usable format for analysis and decision-making. It involves data exploration, handling missing values, reshaping, filtering, and merging datasets using Python's pandas library. Techniques such as grouping and removing duplicates are also essential for efficient data management.

Uploaded by

suhani thakur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data Wrangling

Data wrangling is the process of transforming raw data into a more usable format for analysis and decision-making. It involves data exploration, handling missing values, reshaping, filtering, and merging datasets using Python's pandas library. Techniques such as grouping and removing duplicates are also essential for efficient data management.

Uploaded by

suhani thakur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Wrangling df

Data Wrangling is the process of gathering, Output:


collecting, and transforming Raw data into
another format for better understanding,
decision-making, accessing, and analysis in
less time.

Data wrangling in Python deals with the


below functionalities:
1. Data exploration: In this process, the
data is studied, analyzed, and understood
by visualizing representations of data.
2. Dealing with missing values: Most of
the datasets having a vast amount of data Dealing with missing values in Python
contain missing values of NaN, they are As we can see from the previous output, there
needed to be taken care of by replacing are NaN values present in the MARKS column
them with mean, mode, the most frequent which is a missing value in the dataframe that
value of the column, or simply by is going to be taken care of in data wrangling
dropping the row having a NaN value. by replacing them with the column mean.
3. Reshaping data: In this process, data is # Compute average
manipulated according to the c = avg = 0
requirements, where new data can be for ele in df['Marks']:
added or pre-existing data can be if str(ele).isnumeric():
modified. c += 1
4. Filtering data: Some times datasets are avg += ele
comprised of unwanted rows or columns avg /= c
which are required to be removed or
filtered # Replace missing values
5. Other: After dealing with the raw dataset df = df.replace(to_replace="NaN",
with the above functionalities we get an value=avg)
efficient dataset as per our requirements
and then it can be used for a required # Display data
purpose like data analyzing, machine df
learning, data visualization, model Output:
training etc.

Data exploration in Python


Here in Data exploration, we load the data
into a dataframe, and then we visualize the
data in a tabular format.
# Import pandas package
import pandas as pd

# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav', replacing Nan values with average
'Anuj', 'Ravi', 'Natasha', 'Riya'], Data Replacing in Data Wrangling
'Age': [17, 17, 18, 17, 18, 17, 17], in the GENDER column, we can replace the
'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'], Gender column data by categorizing them
'Marks': [90, 76, 'NaN', 74, 65, 'NaN', into different numbers.
71]} # Categorize gender
df['Gender'] = df['Gender'].map({'M': 0,
# Convert into DataFrame 'F': 1, }).astype(float)
df = pd.DataFrame(data)
# Display data # Display data
df order to merge the data and provide it
Output: meaning. So that teacher will analyze it easily
and it also reduces the time and effort of the
Teacher from Manual Merging.

Creating First Dataframe to Perform


Merge Operation using Data Wrangling:
# import module
import pandas as pd

# creating DataFrame for Student Details


details = pd.DataFrame({
'ID': [101, 102, 103, 104, 105, 106,
107, 108, 109, 110],
Data encoding for gender variable in data 'NAME': ['Jagroop', 'Praveen', 'Harjot',
wrangling 'Pooja', 'Rahul', 'Nikita',
Filtering data in Data Wrangling 'Saurabh', 'Ayush', 'Dolly', "Mohit"],
suppose there is a requirement for the details 'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE',
regarding name, gender, and marks of the 'CSE',
top-scoring students. Here we need to remove 'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})
some using the pandas slicing method in data
wrangling from unwanted data. # printing details
# Filter top scoring students print(details)
df = df[df['Marks'] >= 75].copy() Output:
# Remove age column from filtered
DataFrame
df.drop('Age', axis=1, inplace=True)

# Display data
df
Output:
printing dataframe
Creating Second Dataframe to Perform
Merge operation using Data Wrangling:
# Import module
import pandas as pd

# Creating Dataframe for Fees_Status


Dropping column and filtering rows fees_status = pd.DataFrame(
{'ID': [101, 102, 103, 104, 105,
106, 107, 108, 109, 110],
Data Wrangling Using Merge Operation 'PENDING': ['5000', '250', 'NIL',
Merge operation is used to merge two raw '9000', '15000', 'NIL',
data into the desired format. '4500', '1800', '250', 'NIL']})
Syntax: pd.merge( data_frame1,data_frame2,
on=”field “) # Printing fees_status
Here the field is the name of the column print(fees_status)
which is similar in both data-frame. Output:
For example: Suppose that a Teacher has two
types of Data, the first type of Data consists
of Details of Students and the Second type of
Data Consist of Pending Fees Status which is
taken from the Account Office. So The
Teacher will use the merge operation here in
Example: There is a Car Selling company and
this company have different Brands of
various Car Manufacturing Company like
Maruti, Toyota, Mahindra, Ford, etc., and
have data on where different cars are sold in
different years. So the Company wants to
wrangle only that data where cars are sold
during the year 2010. For this problem, we
use another data Wrangling technique which
Define second dataframe is a pandas groupby() method.

Data Wrangling Using Merge Operation: Creating dataframe to use Grouping


# Import module methods[Car selling datasets]:
import pandas as pd # Import module
import pandas as pd
# Creating Dataframe
details = pd.DataFrame({ # Creating Data
'ID': [101, 102, 103, 104, 105, car_selling_data = {'Brand': ['Maruti', 'Maruti',
106, 107, 108, 109, 110], 'Maruti',
'NAME': ['Jagroop', 'Praveen', 'Harjot', 'Maruti', 'Hyundai',
'Pooja', 'Rahul', 'Nikita', 'Hyundai',
'Saurabh', 'Ayush', 'Dolly', "Mohit"], 'Toyota', 'Mahindra',
'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'Mahindra',
'CSE', 'Ford', 'Toyota', 'Ford'],
'CSE', 'CSE', 'CSE', 'CSE', 'CSE']}) 'Year': [2010, 2011, 2009, 2013,
2010, 2011, 2011, 2010,
# Creating Dataframe 2013, 2010, 2010, 2011],
fees_status = pd.DataFrame( 'Sold': [6, 7, 9, 8, 3, 5,
{'ID': [101, 102, 103, 104, 105, 2, 8, 7, 2, 4, 2]}
106, 107, 108, 109, 110],
'PENDING': ['5000', '250', 'NIL', # Creating Dataframe of car_selling_data
'9000', '15000', 'NIL', df = pd.DataFrame(car_selling_data)
'4500', '1800', '250', 'NIL']})
# printing Dataframe
# Merging Dataframe print(df)
print(pd.merge(details, fees_status, on='ID')) Output:
Output:

Merging two dataframes Creating new dataframe

Data Wrangling Using Grouping Method Creating Dataframe to use Grouping


The grouping method in Data wrangling is methods[DATA OF THE YEAR 2010]:
used to provide results in terms of various # Import module
groups taken out from Large Data. This import pandas as pd
method of pandas is used to group the outset
of data from the large data set. # Creating Data
car_selling_data = {'Brand': ['Maruti', 'Maruti', if a single student will fill in multiple entries.
'Maruti', The Data that the organizers will get can be
'Maruti', 'Hyundai', Easily Wrangles by removing duplicate
'Hyundai', values.
'Toyota', 'Mahindra', Creating a Student Dataset who want to
'Mahindra', participate in the event:
'Ford', 'Toyota', 'Ford'], # Import module
'Year': [2010, 2011, 2009, 2013, import pandas as pd
2010, 2011, 2011, 2010,
2013, 2010, 2010, 2011], # Initializing Data
'Sold': [6, 7, 9, 8, 3, 5, student_data = {'Name': ['Amit', 'Praveen',
2, 8, 7, 2, 4, 2]} 'Jagroop',
'Rahul', 'Vishal', 'Suraj',
# Creating Dataframe for Provided Data 'Rishab', 'Satyapal', 'Amit',
df = pd.DataFrame(car_selling_data) 'Rahul', 'Praveen', 'Amit'],

# Group the data when year = 2010 'Roll_no': [23, 54, 29, 36, 59, 38,
grouped = df.groupby('Year') 12, 45, 34, 36, 54, 23],
print(grouped.get_group(2010))
Output: 'Email': ['[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
Using groupby method on dataframe '[email protected]',
'[email protected]',
Data Wrangling by Removing Duplication '[email protected]',
Pandas duplicates() method helps us to '[email protected]',
remove duplicate values from Large Data. An '[email protected]',
important part of Data Wrangling is removing '[email protected]']}
Duplicate values from the large data set.
Syntax: DataFrame.duplicated(subset=None, # Creating Dataframe of Data
keep=’first’) df = pd.DataFrame(student_data)
Here subset is the column value where we
want to remove the Duplicate value. # Printing Dataframe
In keeping, we have 3 options : print(df)
 if keep =’first’ then the first value is Output:
marked as the original rest of all values if
occur will be removed as it is considered
duplicate.
 if keep=’last’ then the last value is
marked as the original rest the above
same values will be removed as it is
considered duplicate values.
 if keep =’false’ all the values which occur
more than once will be removed as all are
considered duplicate values.
For example, A University will organize the
event. In order to participate Students have to Student Dataset who want to participate in
fill in their details in the online form so that the event
they will contact them. It may be possible that
a student will fill out the form multiple times. Removing Duplicate data from the Dataset
It may cause difficulty for the event organizer using Data wrangling:
# import module
import pandas as pd Creating Two Dataframe For
Concatenation.
# initializing Data # importing pandas module
student_data = {'Name': ['Amit', 'Praveen', import pandas as pd
'Jagroop',
'Rahul', 'Vishal', 'Suraj', # Define a dictionary containing employee
'Rishab', 'Satyapal', 'Amit', data
'Rahul', 'Praveen', 'Amit'], data1 = {'Name':['Jai', 'Princi', 'Gaurav',
'Anuj'],
'Roll_no': [23, 54, 29, 36, 59, 38, 'Age':[27, 24, 22, 32],
12, 45, 34, 36, 54, 23], 'Address':['Nagpur', 'Kanpur', 'Allahabad',
'Email': ['[email protected]', 'Kannuaj'],
'[email protected]', 'Qualification':['Msc', 'MA', 'MCA',
'[email protected]', 'Phd'],
'[email protected]', 'Mobile No': [97, 91, 58, 76]}
'[email protected]',
'[email protected]', # Define a dictionary containing employee
'[email protected]', data
'[email protected]', data2 = {'Name':['Gaurav', 'Anuj', 'Dhiraj',
'[email protected]', 'Hitesh'],
'[email protected]', 'Age':[22, 32, 12, 52],
'[email protected]', 'Address':['Allahabad', 'Kannuaj',
'[email protected]']} 'Allahabad', 'Kannuaj'],
'Qualification':['MCA', 'Phd', 'Bcom',
# creating dataframe 'B.hons'],
df = pd.DataFrame(student_data) 'Salary':[1000, 2000, 3000, 4000]}

# Here df.duplicated() list duplicate Entries in # Convert the dictionary into DataFrame
ROllno. df = pd.DataFrame(data1,index=[0, 1, 2, 3])
# So that ~(NOT) is placed in order to get non
duplicate values. # Convert the dictionary into DataFrame
non_duplicate = df[~df.duplicated('Roll_no')] df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])
We will join these two dataframe along axis
# printing non-duplicate values 0.
print(non_duplicate) res = pd.concat([df, df1])
Output:D output:
Name Age Address Qualification
Mobile No Salary
0 Jai 27 Nagpur Msc
97.0 NaN
1 Princi 24 Kanpur MA 91.0
NaN
2 Gaurav 22 Allahabad MCA
58.0 NaN
3 Anuj 32 Kannuaj Phd 76.0
Remove – Duplicate data from Dataset using NaN
Data wrangling 4 Gaurav 22 Allahabad MCA
NaN 1000.0
Creating New Datasets Using the 5 Anuj 32 Kannuaj Phd NaN
Concatenation of Two Datasets In Data 2000.0
Wrangling. 6 Dhiraj 12 Allahabad Bcom NaN
We can join two dataframe in several ways. 3000.0
For our example in Concanating Two 7 Hitesh 52 Kannuaj B.hons
datasets, we use pd.concat() function. NaN 4000.0

You might also like