Data Wrangling

Data wrangling is the process of transforming raw data into a more usable format for analysis and decision-making. It involves data exploration, handling missing values, reshaping, filtering, and merging datasets using Python's pandas library. Techniques such as grouping and removing duplicates are also essential for efficient data management.

Uploaded by

suhani thakur

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Data Wrangling

Uploaded by

suhani thakur

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Data Wrangling df

Data Wrangling is the process of gathering, Output:

collecting, and transforming Raw data into
another format for better understanding,
decision-making, accessing, and analysis in
less time.

Data wrangling in Python deals with the

below functionalities:
1. Data exploration: In this process, the
data is studied, analyzed, and understood
by visualizing representations of data.
2. Dealing with missing values: Most of
the datasets having a vast amount of data Dealing with missing values in Python
contain missing values of NaN, they are As we can see from the previous output, there
needed to be taken care of by replacing are NaN values present in the MARKS column
them with mean, mode, the most frequent which is a missing value in the dataframe that
value of the column, or simply by is going to be taken care of in data wrangling
dropping the row having a NaN value. by replacing them with the column mean.
3. Reshaping data: In this process, data is # Compute average
manipulated according to the c = avg = 0
requirements, where new data can be for ele in df['Marks']:
added or pre-existing data can be if str(ele).isnumeric():
modified. c += 1
4. Filtering data: Some times datasets are avg += ele
comprised of unwanted rows or columns avg /= c
which are required to be removed or
filtered # Replace missing values
5. Other: After dealing with the raw dataset df = df.replace(to_replace="NaN",
with the above functionalities we get an value=avg)
efficient dataset as per our requirements
and then it can be used for a required # Display data
purpose like data analyzing, machine df
learning, data visualization, model Output:
training etc.

Data exploration in Python

Here in Data exploration, we load the data
into a dataframe, and then we visualize the
data in a tabular format.
# Import pandas package
import pandas as pd

# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav', replacing Nan values with average
'Anuj', 'Ravi', 'Natasha', 'Riya'], Data Replacing in Data Wrangling
'Age': [17, 17, 18, 17, 18, 17, 17], in the GENDER column, we can replace the
'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'], Gender column data by categorizing them
'Marks': [90, 76, 'NaN', 74, 65, 'NaN', into different numbers.
71]} # Categorize gender
df['Gender'] = df['Gender'].map({'M': 0,
# Convert into DataFrame 'F': 1, }).astype(float)
df = pd.DataFrame(data)
# Display data # Display data
df order to merge the data and provide it
Output: meaning. So that teacher will analyze it easily
and it also reduces the time and effort of the
Teacher from Manual Merging.

Creating First Dataframe to Perform

Merge Operation using Data Wrangling:
# import module
import pandas as pd

# creating DataFrame for Student Details

details = pd.DataFrame({
'ID': [101, 102, 103, 104, 105, 106,
107, 108, 109, 110],
Data encoding for gender variable in data 'NAME': ['Jagroop', 'Praveen', 'Harjot',
wrangling 'Pooja', 'Rahul', 'Nikita',
Filtering data in Data Wrangling 'Saurabh', 'Ayush', 'Dolly', "Mohit"],
suppose there is a requirement for the details 'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE',
regarding name, gender, and marks of the 'CSE',
top-scoring students. Here we need to remove 'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})
some using the pandas slicing method in data
wrangling from unwanted data. # printing details
# Filter top scoring students print(details)
df = df[df['Marks'] >= 75].copy() Output:
# Remove age column from filtered
DataFrame
df.drop('Age', axis=1, inplace=True)

# Display data
df
Output:
printing dataframe
Creating Second Dataframe to Perform
Merge operation using Data Wrangling:
# Import module
import pandas as pd

# Creating Dataframe for Fees_Status

Dropping column and filtering rows fees_status = pd.DataFrame(
{'ID': [101, 102, 103, 104, 105,
106, 107, 108, 109, 110],
Data Wrangling Using Merge Operation 'PENDING': ['5000', '250', 'NIL',
Merge operation is used to merge two raw '9000', '15000', 'NIL',
data into the desired format. '4500', '1800', '250', 'NIL']})
Syntax: pd.merge( data_frame1,data_frame2,
on=”field “) # Printing fees_status
Here the field is the name of the column print(fees_status)
which is similar in both data-frame. Output:
For example: Suppose that a Teacher has two
types of Data, the first type of Data consists
of Details of Students and the Second type of
Data Consist of Pending Fees Status which is
taken from the Account Office. So The
Teacher will use the merge operation here in
Example: There is a Car Selling company and
this company have different Brands of
various Car Manufacturing Company like
Maruti, Toyota, Mahindra, Ford, etc., and
have data on where different cars are sold in
different years. So the Company wants to
wrangle only that data where cars are sold
during the year 2010. For this problem, we
use another data Wrangling technique which
Define second dataframe is a pandas groupby() method.

Data Wrangling Using Merge Operation: Creating dataframe to use Grouping

# Import module methods[Car selling datasets]:
import pandas as pd # Import module
import pandas as pd
# Creating Dataframe
details = pd.DataFrame({ # Creating Data
'ID': [101, 102, 103, 104, 105, car_selling_data = {'Brand': ['Maruti', 'Maruti',
106, 107, 108, 109, 110], 'Maruti',
'NAME': ['Jagroop', 'Praveen', 'Harjot', 'Maruti', 'Hyundai',
'Pooja', 'Rahul', 'Nikita', 'Hyundai',
'Saurabh', 'Ayush', 'Dolly', "Mohit"], 'Toyota', 'Mahindra',
'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'Mahindra',
'CSE', 'Ford', 'Toyota', 'Ford'],
'CSE', 'CSE', 'CSE', 'CSE', 'CSE']}) 'Year': [2010, 2011, 2009, 2013,
2010, 2011, 2011, 2010,
# Creating Dataframe 2013, 2010, 2010, 2011],
fees_status = pd.DataFrame( 'Sold': [6, 7, 9, 8, 3, 5,
{'ID': [101, 102, 103, 104, 105, 2, 8, 7, 2, 4, 2]}
106, 107, 108, 109, 110],
'PENDING': ['5000', '250', 'NIL', # Creating Dataframe of car_selling_data
'9000', '15000', 'NIL', df = pd.DataFrame(car_selling_data)
'4500', '1800', '250', 'NIL']})
# printing Dataframe
# Merging Dataframe print(df)
print(pd.merge(details, fees_status, on='ID')) Output:
Output:

Merging two dataframes Creating new dataframe

Data Wrangling Using Grouping Method Creating Dataframe to use Grouping

The grouping method in Data wrangling is methods[DATA OF THE YEAR 2010]:
used to provide results in terms of various # Import module
groups taken out from Large Data. This import pandas as pd
method of pandas is used to group the outset
of data from the large data set. # Creating Data
car_selling_data = {'Brand': ['Maruti', 'Maruti', if a single student will fill in multiple entries.
'Maruti', The Data that the organizers will get can be
'Maruti', 'Hyundai', Easily Wrangles by removing duplicate
'Hyundai', values.
'Toyota', 'Mahindra', Creating a Student Dataset who want to
'Mahindra', participate in the event:
'Ford', 'Toyota', 'Ford'], # Import module
'Year': [2010, 2011, 2009, 2013, import pandas as pd
2010, 2011, 2011, 2010,
2013, 2010, 2010, 2011], # Initializing Data
'Sold': [6, 7, 9, 8, 3, 5, student_data = {'Name': ['Amit', 'Praveen',
2, 8, 7, 2, 4, 2]} 'Jagroop',
'Rahul', 'Vishal', 'Suraj',
# Creating Dataframe for Provided Data 'Rishab', 'Satyapal', 'Amit',
df = pd.DataFrame(car_selling_data) 'Rahul', 'Praveen', 'Amit'],

# Group the data when year = 2010 'Roll_no': [23, 54, 29, 36, 59, 38,
grouped = df.groupby('Year') 12, 45, 34, 36, 54, 23],
print(grouped.get_group(2010))
Output: 'Email': ['[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
Using groupby method on dataframe '[email protected]',
'[email protected]',
Data Wrangling by Removing Duplication '[email protected]',
Pandas duplicates() method helps us to '[email protected]',
remove duplicate values from Large Data. An '[email protected]',
important part of Data Wrangling is removing '[email protected]']}
Duplicate values from the large data set.
Syntax: DataFrame.duplicated(subset=None, # Creating Dataframe of Data
keep=’first’) df = pd.DataFrame(student_data)
Here subset is the column value where we
want to remove the Duplicate value. # Printing Dataframe
In keeping, we have 3 options : print(df)
 if keep =’first’ then the first value is Output:
marked as the original rest of all values if
occur will be removed as it is considered
duplicate.
 if keep=’last’ then the last value is
marked as the original rest the above
same values will be removed as it is
considered duplicate values.
 if keep =’false’ all the values which occur
more than once will be removed as all are
considered duplicate values.
For example, A University will organize the
event. In order to participate Students have to Student Dataset who want to participate in
fill in their details in the online form so that the event
they will contact them. It may be possible that
a student will fill out the form multiple times. Removing Duplicate data from the Dataset
It may cause difficulty for the event organizer using Data wrangling:
# import module
import pandas as pd Creating Two Dataframe For
Concatenation.
# initializing Data # importing pandas module
student_data = {'Name': ['Amit', 'Praveen', import pandas as pd
'Jagroop',
'Rahul', 'Vishal', 'Suraj', # Define a dictionary containing employee
'Rishab', 'Satyapal', 'Amit', data
'Rahul', 'Praveen', 'Amit'], data1 = {'Name':['Jai', 'Princi', 'Gaurav',
'Anuj'],
'Roll_no': [23, 54, 29, 36, 59, 38, 'Age':[27, 24, 22, 32],
12, 45, 34, 36, 54, 23], 'Address':['Nagpur', 'Kanpur', 'Allahabad',
'Email': ['[email protected]', 'Kannuaj'],
'[email protected]', 'Qualification':['Msc', 'MA', 'MCA',
'[email protected]', 'Phd'],
'[email protected]', 'Mobile No': [97, 91, 58, 76]}
'[email protected]',
'[email protected]', # Define a dictionary containing employee
'[email protected]', data
'[email protected]', data2 = {'Name':['Gaurav', 'Anuj', 'Dhiraj',
'[email protected]', 'Hitesh'],
'[email protected]', 'Age':[22, 32, 12, 52],
'[email protected]', 'Address':['Allahabad', 'Kannuaj',
'[email protected]']} 'Allahabad', 'Kannuaj'],
'Qualification':['MCA', 'Phd', 'Bcom',
# creating dataframe 'B.hons'],
df = pd.DataFrame(student_data) 'Salary':[1000, 2000, 3000, 4000]}

# Here df.duplicated() list duplicate Entries in # Convert the dictionary into DataFrame
ROllno. df = pd.DataFrame(data1,index=[0, 1, 2, 3])
# So that ~(NOT) is placed in order to get non
duplicate values. # Convert the dictionary into DataFrame
non_duplicate = df[~df.duplicated('Roll_no')] df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])
We will join these two dataframe along axis
# printing non-duplicate values 0.
print(non_duplicate) res = pd.concat([df, df1])
Output:D output:
Name Age Address Qualification
Mobile No Salary
0 Jai 27 Nagpur Msc
97.0 NaN
1 Princi 24 Kanpur MA 91.0
NaN
2 Gaurav 22 Allahabad MCA
58.0 NaN
3 Anuj 32 Kannuaj Phd 76.0
Remove – Duplicate data from Dataset using NaN
Data wrangling 4 Gaurav 22 Allahabad MCA
NaN 1000.0
Creating New Datasets Using the 5 Anuj 32 Kannuaj Phd NaN
Concatenation of Two Datasets In Data 2000.0
Wrangling. 6 Dhiraj 12 Allahabad Bcom NaN
We can join two dataframe in several ways. 3000.0
For our example in Concanating Two 7 Hitesh 52 Kannuaj B.hons
datasets, we use pd.concat() function. NaN 4000.0

Python Cheat Sheet 2.0
100% (1)
Python Cheat Sheet 2.0
10 pages
Grade7 Lesson Plan (Fungi, Protist and Bacteria)
100% (1)
Grade7 Lesson Plan (Fungi, Protist and Bacteria)
4 pages
Consulting Case New Products
No ratings yet
Consulting Case New Products
208 pages
Cody's Data Cleaning Techniques Using SAS, Third Edition
From Everand
Cody's Data Cleaning Techniques Using SAS, Third Edition
Ron Cody
4.5/5 (3)
Aci 350.4R-04 PDF
No ratings yet
Aci 350.4R-04 PDF
18 pages
Lesson 9: Global Demography
100% (1)
Lesson 9: Global Demography
6 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
Part A Assignment_No_1
No ratings yet
Part A Assignment_No_1
7 pages
EXP-3
No ratings yet
EXP-3
10 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
Data Wrangling- Jupyter Notebook
No ratings yet
Data Wrangling- Jupyter Notebook
5 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
12 Pandas
100% (1)
12 Pandas
21 pages
ML Lab Manual Final
No ratings yet
ML Lab Manual Final
36 pages
python interviews
No ratings yet
python interviews
154 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
a5
No ratings yet
a5
28 pages
Pandas
No ratings yet
Pandas
94 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Week 5 LAB
No ratings yet
Week 5 LAB
23 pages
Python For DS Unit4
No ratings yet
Python For DS Unit4
11 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
Pandas Cheat Sheet CN
No ratings yet
Pandas Cheat Sheet CN
4 pages
Pandas Cheat Sheet
100% (4)
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet
83% (12)
Pandas Cheat Sheet
2 pages
02. Python Pandas - 2 2020-21
No ratings yet
02. Python Pandas - 2 2020-21
21 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Pandas
No ratings yet
Pandas
5 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Document (4)
No ratings yet
Document (4)
15 pages
DS Practical
No ratings yet
DS Practical
30 pages
12 Useful Pandas Techniques in Python For Data Manipulation
100% (2)
12 Useful Pandas Techniques in Python For Data Manipulation
19 pages
Rapids Cheatsheet
100% (1)
Rapids Cheatsheet
2 pages
Python Notes by Prof T
No ratings yet
Python Notes by Prof T
10 pages
c
No ratings yet
c
5 pages
Unit3_3) Pandas.ipynb - Colab
No ratings yet
Unit3_3) Pandas.ipynb - Colab
11 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Chapter 2 Python Pandas - II
No ratings yet
Chapter 2 Python Pandas - II
19 pages
Pandas - Digitalocean
No ratings yet
Pandas - Digitalocean
15 pages
Data Analysis Tools
No ratings yet
Data Analysis Tools
26 pages
Lab Record IP
No ratings yet
Lab Record IP
13 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Chapter2 - Data Wrangling
No ratings yet
Chapter2 - Data Wrangling
48 pages
Data Science - Unit II
100% (2)
Data Science - Unit II
173 pages
Advanced Python Lab
No ratings yet
Advanced Python Lab
17 pages
7 Days Analytics Course 3feiz7 4
No ratings yet
7 Days Analytics Course 3feiz7 4
8 pages
Document (4)-1
No ratings yet
Document (4)-1
15 pages
ds with py
No ratings yet
ds with py
39 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Data Aggregation and Group Operations
No ratings yet
Data Aggregation and Group Operations
34 pages
Python Cheat Sheets
97% (33)
Python Cheat Sheets
11 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
2777959-Day 8 - Data Wrangling
No ratings yet
2777959-Day 8 - Data Wrangling
2 pages
What is pandas
No ratings yet
What is pandas
9 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Python Libraries Cheat Sheets
No ratings yet
Python Libraries Cheat Sheets
6 pages
B "Hello, World!" Print (B (2:5) ) Llo
No ratings yet
B "Hello, World!" Print (B (2:5) ) Llo
52 pages
Fundamental - Python
No ratings yet
Fundamental - Python
3 pages
Data Wrangling and Analysis
100% (1)
Data Wrangling and Analysis
36 pages
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
ANSYS Discovery Live Release Notes
No ratings yet
ANSYS Discovery Live Release Notes
37 pages
Introduction of Protozoa
100% (2)
Introduction of Protozoa
31 pages
Passbook PRINTER Settings
No ratings yet
Passbook PRINTER Settings
2 pages
Download Complete Brill s Companion to Callimachus Brill s Companions to Classical Studies Benjamin Acosta-Hughes PDF for All Chapters
100% (1)
Download Complete Brill s Companion to Callimachus Brill s Companions to Classical Studies Benjamin Acosta-Hughes PDF for All Chapters
78 pages
Blackcat John Deere
No ratings yet
Blackcat John Deere
50 pages
Dell RAC Serial/Telnet Console: Remotely Managing UNIX and Linux Servers Using The
No ratings yet
Dell RAC Serial/Telnet Console: Remotely Managing UNIX and Linux Servers Using The
4 pages
Compressor Elements PDF
No ratings yet
Compressor Elements PDF
102 pages
feismo.com-montreaux-analysis-pr
No ratings yet
feismo.com-montreaux-analysis-pr
16 pages
Trigo Exer1
No ratings yet
Trigo Exer1
2 pages
Lesson 1 - Anthropology and The Study of Culture
100% (1)
Lesson 1 - Anthropology and The Study of Culture
71 pages
Word-Formation: Group 3
No ratings yet
Word-Formation: Group 3
9 pages
Potato Cookery
No ratings yet
Potato Cookery
11 pages
Chapter 12. Monetary Policy and The Phillips Curve
No ratings yet
Chapter 12. Monetary Policy and The Phillips Curve
43 pages
Nokia BSS Parameter
No ratings yet
Nokia BSS Parameter
20 pages
CSEC Mathematics June 2001 P2
100% (2)
CSEC Mathematics June 2001 P2
11 pages
Buildings 13 03116
No ratings yet
Buildings 13 03116
16 pages
Moral Therapy
No ratings yet
Moral Therapy
3 pages
Fugue-NEW GROVE
100% (1)
Fugue-NEW GROVE
4 pages
6393 - FOLUR - ID - AWP Development Workshop - 25 April 2022
No ratings yet
6393 - FOLUR - ID - AWP Development Workshop - 25 April 2022
35 pages
Halo 2
No ratings yet
Halo 2
1 page
Etching Methods For Indium Oxide-Tin Oxide Films PDF
No ratings yet
Etching Methods For Indium Oxide-Tin Oxide Films PDF
4 pages
Energy Audit Methology For Turbine Cycle - M.V. Pande & Dy - Director PDF
100% (1)
Energy Audit Methology For Turbine Cycle - M.V. Pande & Dy - Director PDF
33 pages
Accelerating Climate Resilient Plant Breeding by Applying Next-Generation Artificial Intelligence
No ratings yet
Accelerating Climate Resilient Plant Breeding by Applying Next-Generation Artificial Intelligence
19 pages
Pt100 T - Vs - R Table
No ratings yet
Pt100 T - Vs - R Table
2 pages
A Roof Over The Dead:: Communal Tombs and Family Structure
100% (1)
A Roof Over The Dead:: Communal Tombs and Family Structure
23 pages
Greek Captives
No ratings yet
Greek Captives
257 pages

Data Wrangling

Uploaded by

Data Wrangling

Uploaded by

Data Wrangling df

Data Wrangling is the process of gathering, Output:

Data wrangling in Python deals with the

Data exploration in Python

Creating First Dataframe to Perform

# creating DataFrame for Student Details

# Creating Dataframe for Fees_Status

Data Wrangling Using Merge Operation: Creating dataframe to use Grouping

Merging two dataframes Creating new dataframe

Data Wrangling Using Grouping Method Creating Dataframe to use Grouping

You might also like