0% found this document useful (0 votes)
189 views

Data Wrangling

Data wrangling refers to processes that transform raw data into a more usable format. It includes merging multiple datasets, identifying and handling missing values, removing outliers and irrelevant data, and ensuring data is consistent and reliable before analysis. Proper data wrangling is important because analyses are only as good as the underlying data. It involves steps like data discovery, structuring, cleaning, enrichment, validation, and publishing. Common wrangling functions in Python include data exploration, handling missing values, reshaping data, filtering data, and merging datasets.

Uploaded by

arjun
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
189 views

Data Wrangling

Data wrangling refers to processes that transform raw data into a more usable format. It includes merging multiple datasets, identifying and handling missing values, removing outliers and irrelevant data, and ensuring data is consistent and reliable before analysis. Proper data wrangling is important because analyses are only as good as the underlying data. It involves steps like data discovery, structuring, cleaning, enrichment, validation, and publishing. Common wrangling functions in Python include data exploration, handling missing values, reshaping data, filtering data, and merging datasets.

Uploaded by

arjun
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

DATA WRANGLING

Data wrangling —also called data cleaning, data remediation, or data munging—refers to a
variety of processes designed to transform raw data into more readily used formats.
Some examples of data wrangling include:
Merging multiple data sources into a single dataset for analysis
Identifying gaps in data (empty cells) and either filling or deleting them
Deleting data that’s either unnecessary or irrelevant
Identifying extreme outliers in data and either explaining the discrepancies or removing
them so that analysis can take place.
THE IMPORTANCE OF DATA
WRANGLING
Any analyses a business performs will ultimately be constrained by the data that informs
them. If data is incomplete, unreliable, or faulty, then analyses will be too—diminishing
the value of any insights cleaned.
Data wrangling seeks to remove that risk by ensuring data is in a reliable state before it’s
analyzed and leveraged. This makes it a critical part of the analytical process.
It’s important to note that data wrangling can be time-consuming and taxing on resources,
particularly when done manually. This is why many organizations institute policies and
best practices that help employees streamline the data cleanup process—for example,
requiring that data include certain information or be in a specific format before it’s
uploaded to a database.
For this reason, it’s vital to understand the steps of the data wrangling process and the
negative outcomes associated with incorrect or faulty data.
DATA WRANGLING STEPS
1.Discovery-Discovery refers to the process of familiarizing yourself with data so you can
conceptualize how you might use it. During discovery, you may identify trends or patterns in
the data, along with obvious issues, such as missing or incomplete values that need to be
addressed. This is an important step, as it will inform every activity that comes afterward.
2. Structuring - Raw data is typically unusable in its raw state because it’s either incomplete
or mis formatted for its intended application. Data structuring is the process of taking raw
data and transforming it to be more readily leveraged. The form your data takes will depend
on the analytical model you use to interpret it.
3. Cleaning- Data cleaning is the process of removing inherent errors in data that might
distort your analysis or render it less valuable. Cleaning can come in different forms,
including deleting empty cells or rows, removing outliers, and standardizing inputs. The
goal of data cleaning is to ensure there are no errors (or as few as possible) that could
influence your final analysis.
4. Enriching - Once you understand your existing data and have transformed it into a
more usable state, you must determine whether you have all of the data necessary for the
project at hand. If not, you may choose to enrich or augment your data by incorporating
values from other datasets. For this reason, it’s important to understand what other data is
available for use.
5. Validating - Data validation refers to the process of verifying that your data is both
consistent and of a high enough quality. During validation, you may discover issues you
need to resolve or conclude that your data is ready to be analyzed. Validation is typically
achieved through various automated processes and requires programming.
6. Publishing - Once your data has been validated, you can publish it. This involves
making it available to others within your organization for analysis. The format you use to
share the information—such as a written report or electronic file—will depend on your
data and the organization’s goals.
DATA WRANGLING
FUNCTIONS
Data wrangling in python deals with the below functionalities:
1.Data exploration: In this process, the data is studied, analyzed and understood by visualizing
representations of data.
2.Dealing with missing values: Most of the datasets having a vast amount of data contain missing
values of NaN, they are needed to be taken care of by replacing them with mean, mode, the most
frequent value of the column or simply by dropping the row having a NaN value.
3.Reshaping data: In this process, data is manipulated according to the requirements, where new
data can be added or pre-existing data can be modified.
4.Filtering data: Some times datasets are comprised of unwanted rows or columns which are
required to be removed or filtered
5.Other: After dealing with the raw dataset with the above functionalities we get an efficient dataset
as per our requirements and then it can be used for a required purpose like data analyzing, machine
learning, data visualization, model training etc.
DATA EXPLORATION
Here we assign the data, and then we visualize the data in a tabular format.
# Import pandas package
import pandas as pd
# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav',
'Anuj', 'Ravi', 'Natasha', 'Riya'],
'Age': [17, 17, 18, 17, 18, 17, 17],
'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'],
'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}
# Convert into DataFrame
df = pd.DataFrame(data)
# Display data
Print(df)
DEALING WITH MISSING
VALUES
As we can see from the previous output, there are NaN values present in the MARKS column which are going to be taken care
of by replacing them with the column mean.
# Compute average
c = avg = 0
for ele in df['Marks']:
if str(ele).isnumeric():
c += 1
avg += ele
avg /= c
# Replace missing values
df = df.replace(to_replace="NaN", value=avg)
# Display data
Print (df)
RESHAPING DATA
In the GENDER column, we can reshape the data by categorizing them into different
numbers.
# Categorize gender
df['Gender'] = df['Gender'].map({'M': 0,'F': 1, }).astype(float)
# Display data
df
FILTERING DATA
Suppose there is a requirement for the details regarding name, gender, marks of the top-
scoring students. Here we need to remove some unwanted data.
# Filter top scoring students
df = df[df['Marks'] >= 75]
# Remove age row
df = df.drop(['Age'], axis=1)
# Display data
Print(df)
WRANGLING DATA USING
MERGE OPERATION
Merge operation is used to merge raw data and into the desired format.

Syntax:
pd.merge( data_frame1,data_frame2, on="field ")
Here the field is the name of the column which is similar on both data-frame.

For example: Suppose that a Teacher has two types of Data, first type of Data consist of
Details of Students and Second type of Data Consist of Pending Fees Status which is
taken from Account Office. So The Teacher will use merge operation here in order to
merge the data and provide it meaning. So that teacher will analyze it easily and it also
reduces time and effort of Teacher from Manual Merging.
# import module
import pandas as pd
# creating DataFrame for Student Details
details = pd.DataFrame({
'ID': [101, 102, 103, 104, 105, 106,
107, 108, 109, 110],
'NAME': ['Jagroop', 'Praveen', 'Harjot',
'Pooja', 'Rahul', 'Nikita',
'Saurabh', 'Ayush', 'Dolly', "Mohit"],
'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE’, 'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})
# printing details
print(details)
# Import module
import pandas as pd
# Creating Dataframe for Fees_Status
fees_status = pd.DataFrame(
{'ID': [101, 102, 103, 104, 105,
106, 107, 108, 109, 110],
'PENDING': ['5000', '250', 'NIL',
'9000', '15000', 'NIL',
'4500', '1800', '250', 'NIL']})
# Printing fees_status
print(fees_status)
WRANGLING DATA USING MERGE OPERATION
# Import module
import pandas as pd
# Creating Dataframe
details = pd.DataFrame({'ID': [101, 102, 103, 104, 105,106, 107, 108, 109, 110],
'NAME': ['Jagroop', 'Praveen', 'Harjot','Pooja', 'Rahul', 'Nikita','Saurabh', 'Ayush', 'Dolly', "Mohit"],
'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE','CSE', 'CSE', 'CSE', 'CSE', 'CSE']})
# Creating Dataframe
fees_status = pd.DataFrame( {'ID': [101, 102, 103, 104, 105,106, 107, 108, 109, 110],
'PENDING': ['5000', '250', 'NIL','9000', '15000', 'NIL','4500', '1800', '250', 'NIL']})
# Merging Dataframe
print(pd.merge(details, fees_status, on='ID'))

You might also like