DWDV UNIT 1
DWDV UNIT 1
Unuit-1
Data Wrangling:
Data wrangling is the process of transforming and structuring data from one raw
form into a desired format with the intent of improving data quality and making it
more consumable and useful for analytics or machine learning. It’s also sometimes
called data munging.
The data wrangling process often includes transforming, cleansing, and enriching
data from multiple sources. As a result of data wrangling, the data being analyzed
is more accurate and meaningful, leading to better solutions, decisions, and
outcomes.
Because of the increase in data collection and usage, especially diverse and
unstructured data from multiple data sources, organizations are now dealing with
larger amounts of raw data and preparing it for analysis can be time-consuming
and costly.
Self-service approaches and analytics automation can speed up and increase the
accuracy of data wrangling processes by eliminating the errors that can be
introduced by people when they transform data using Excel or other manual
processes.
Cleanse: Data often contains errors as a result of manual entry, incomplete data,
data automatically collected from sensors, or even malfunctioning equipment. Data
cleansing corrects those entry errors, removes duplicates and outliers (if
appropriate), eliminates missing data, and imputes null values based on statistical
or conditional modeling to improve data quality.
2. Data Consistency: Since businesses often use data from multiple sources,
including third-parties, the data can often include many errors. An important step
of the data wrangling process is creating uniform datasets that help eliminate the
errors introduced by people and different formatting standards across third parties
which results in improved accuracy during analysis.
3. Improved Accuracy and Precision of Data: The way data is manipulated and
arranged can affect the accuracy and precision of analysis, especially when it’s
related to identifying relevant patterns and trends. Examples of good data
wrangling include organizing data by numerical data rather than categorical values
or organizing data in tables rather than columns. Grouping similar data together
improves the accuracy.
4. Improved Communication and Decision-Making: Increased clarity and
improved accuracy reduce the time it takes for others to understand and interpret
data, leading to better understanding and communication between teams. This
benefit can lead to increased collaboration, transparency, and better decisions.
When working with multiple data sources, there are many chances for data to
be incorrect, duplicated, or mislabeled. If data is wrong, outcomes and
algorithms are unreliable, even though they may look correct. Data cleaning in
data science using Python is changing or eliminating garbage, incorrect,
duplicate, corrupted, or incomplete data in a dataset. There’s no absolute way
to describe the precise steps in data cleaning because the processes may vary
from dataset to dataset. The general data preparation process initiative is data
cleansing, data cleansing, or scrubbing.
Data Cleaning using Pandas in Python is the most important task that a data
science professional should do. Wrong or bad-quality data can be detrimental to
processes and analysis. Clean data will ultimately increase overall productivity
and permit the very best quality information in decision-making.
Following are some reasons why Python data cleaning is essential:
1. Error-Free Data:
The quality of the data is the degree to which it follows the rules of particular
requirements. For example, if we have imported phone number data of different
customers, and in some places, we have added customers’ email addresses.
However, because our needs were straightforward for phone numbers, the email
addresses would be invalid data. Here, some pieces of data follow a specific
format. Some types of numbers have to be in a specific range.
Some data cells might require selected quiet data like numeric, Boolean, etc. In
every scenario, there are some mandatory constraints our data should follow.
Certain conditions affect multiple fields of data in a particular form. Particular
types of data have unique restrictions. Data will always be invalid if it isn’t in
the required format. Data cleaning in data science using Python will help us
simplify this process and avoid useless data values.
3. Accurate and Efficient:
Ensuring the data is close to the correct values. We know that most data in a
dataset are valid, and we should focus on establishing its accuracy. Even if the
data is authentic and correct, it isn’t accurate. Determining accuracy helps to
figure out whether the data entered is accurate or not. For example, a
customer’s address is stored in the specified format; it may not be in the right
one. The email has an additional character or value that makes it incorrect or
invalid. Another example is the phone number of a customer. This means that
we have to rely on data sources to cross-check the data to figure out if it’s
accurate or not. Depending on the kind of data we are using, we might be able
to find various resources that could help us in this regard for cleaning.
4. Complete Data:
Completeness is the degree to which we should know all the required values.
Completeness is a little more challenging to achieve than accuracy or quality.
Because it’s nearly impossible to have all the info we need, only known facts
can be entered. We can try to complete data by redoing the data-gathering
activities like approaching the clients again, re-interviewing people, etc. For
example, we might need to enter every customer’s contact information.
However, a number of them might not have email addresses. In this case, we
have to leave those columns empty. If a system requires us to fill all columns,
we can try to enter missing or unknown ones. However, entering such values
does not mean that the data is complete. It would still be referred to as
incomplete.
5. Maintains Data Consistency:
To ensure the data is consistent within the same dataset or across multiple
datasets, we can measure consistency by comparing two similar systems. We
can also check the data values within the same dataset to see if they are
consistent. Consistency can be relational. For example, a customer’s age might
be 25, which is a valid value and also accurate, but it is also stated as a senior
citizen in the same system. In such cases, we must cross-check the data, similar
to measuring accuracy, and see which value is true. Is the client a 25-year-old?
Or is the client a senior citizen? Only one of these values can be true. There are
multiple ways to make your data consistent.
Data scientists spend a lot of time cleaning datasets and getting them in the
form they can work. It is an essential skill of Data Scientists to work with
messy data, missing values, and inconsistent, noisy, or nonsensical data. Python
provides a built-in module called Pandas that works smoothly. Pandas is a
popular Python library for data processing, cleaning, manipulation, and
analysis. Pandas stand for “Python Data Analysis Library.” It consists of
classes on reading, processing, and writing CSV files. Numerous Data cleaning
tools are present, but the Pandas library provides a fast and efficient way to
manage and explore data. It does that by providing us with Series and
DataFrames, which help us represent data efficiently and manipulate it in
various ways.
This article will use the Pandas module to clean our dataset.
We are using a simple dataset for data cleaning, i.e., the iris species dataset.
You can download this dataset from kaggle.com.
To start working with Pandas, we need first to import it. We are using Google
Colab as IDE to import Pandas in Google Colab.
#importing module
import pandas as pd
Step 1: Import Dataset
To import the dataset, we use the read_csv() function of pandas and store it in
the pandas DataFrame named data. As the dataset is in tabular format, when
working with tabular data in Pandas, it will be automatically converted into a
DataFrame. A DataFrame is a two-dimensional, mutable data structure in
Python. It is a combination of rows and columns like an Excel sheet.
Python Code:
The head() function is a built-in function in pandas for the dataframe used to
display the rows of the dataset. We can specify the number of rows by giving
the number within the parenthesis. By default, it displays the first five rows of
the dataset. If we want to see the last five rows of the dataset, we use the
tail()function of the dataframe like this:
#displayinf last five rows of dataset
data.tail()
Step 2: Merge Dataset
Merging the dataset is combining two datasets in one and lining up rows based
on some particular or common property for data analysis. We can do this by
using the merge() function of the dataframe. Following is the syntax of the
merge function:
DataFrame_name.merge(right, how='inner', on=None, left_on=None,
right_on=None, left_index=False, right_index=False, sort=False,
suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
However, in this case, we don’t need to merge two datasets, so we will skip this
step.
Step 3: Rebuild Missing Data
We will use another function to find and fill in the missing data in the dataset.
There are 4 ways to find the null values if present in the dataset. Let’s see them
one by one:
Using isna().any()
data.isna().any()
This function also gives a boolean value indicating whether a null value is
present or not, but it gives results column-wise, not in tabular format.
Using isna().any().sum()
data.isna().any().sum()
This function gives output in a single value, whether any null is present.
There are no null values present in our dataset. But if any null values are preset,
we can fill those places with any other value using the fillna() function of
DataFrame.Following is the syntax of fillna() function:
DataFrame_name.fillna(value=None, method=None, axis=None,
inplace=False, limit=None, downcast=None)
This function will fill NA/NaN or 0 values instead of null spaces. You may also
drop null values using the dropna method when the amount of missing data is
relatively small and unlikely to affect the overall.
Step 4: Standardization and Normalization
This step is not needed for the dataset we are using. So, we will skip this step.
After removing null, duplicate, and incorrect values, we should verify the
dataset and its accuracy. In this step, we have to check that the data cleaned so
far makes sense. If the data is incomplete, we have to enrich it again by data
gathering activities like approaching the clients again, re-interviewing people,
etc. Completeness is a little more challenging to achieve accuracy or quality in
the dataset.
Step 7: Export Dataset
This is the last step of the data-cleaning process. After performing all the above
operations, the data is transformed into a clean dataset, and it is ready to export
for the next process in Data Science or Data Analysis.
Need of data cleanup:
multiple data sources, there are many opportunities for data to be duplicated or
though they may look correct. There is no one absolute way to prescribe the exact
steps in the data cleaning process because the processes will vary from dataset to
dataset. But it is crucial to establish a template for your data cleaning process so
you know you are doing it the right way every time.
Data cleanup is a crucial step in data wrangling because it ensures the quality and
reliability of the data used for analysis and visualization. Here are some key
reasons why data cleanup is necessary:
1. Accuracy and Reliability
Combining Datasets: Clean data from different sources can be more easily
combined and integrated, facilitating comprehensive analyses and insights.
Interoperability: Clean and standardized data improves interoperability
between different systems and tools.
8. Facilitating Advanced Analytics
Machine Learning and AI: Clean data is essential for training reliable and
effective machine learning models. Poor-quality data can lead to model
inaccuracies and failures.
Complex Analysis: Advanced analytics techniques often require high-
quality data to produce meaningful and actionable results.
9. Regulatory Compliance
By ensuring that the data is accurate, consistent, and complete, data cleanup helps
in building a strong foundation for any data-driven project.
Data Cleanup Basics
Effective data cleanup involves several key steps to ensure the data is accurate,
consistent, and ready for analysis. Here are the basic steps and techniques
involved:
1. Formatting
Date and Time: Standardize date and time formats (e.g., YYYY-MM-DD).
Text Data: Ensure consistent casing (e.g., all uppercase or lowercase),
remove leading/trailing spaces, and correct spelling errors.
Numeric Data: Ensure numeric data is in a consistent format, handle
decimal points and thousand separators correctly.
Example: Convert all dates to the format YYYY-MM-DD and ensure all names
are in proper case (e.g., "John Doe").
2. Handling Outliers
Definition: Identifying and handling data points that deviate significantly from
other observations.
Detection Methods:
o Statistical Methods: Using measures like the z-score, which indicates
how many standard deviations a data point is from the mean.
o Visualization: Box plots and scatter plots can visually identify
outliers.
Handling Methods:
Example: Identify and possibly remove data points where the z-score is greater
than 3.
3. Removing Duplicates
Z-score Standardization:
o Formula: X′=X−μσX' = \frac{X - \mu}{\sigma}X′=σX−μ
o μ\muμ is the mean of the data.
o σ\sigmaσ is the standard deviation of the data.