0% found this document useful (0 votes)
2 views21 pages

DWDV UNIT 1

Data wrangling is the process of transforming raw data into a structured format to improve its quality for analytics and machine learning, involving steps like exploration, cleansing, transformation, enrichment, validation, and storage. Self-service data wrangling tools are essential for enabling analysts to quickly handle complex datasets, leading to better decision-making and cost efficiency. Data cleaning, particularly using Python's Pandas library, is crucial for ensuring data accuracy, consistency, and completeness, which ultimately enhances productivity and the reliability of analytical outcomes.

Uploaded by

Varshini Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views21 pages

DWDV UNIT 1

Data wrangling is the process of transforming raw data into a structured format to improve its quality for analytics and machine learning, involving steps like exploration, cleansing, transformation, enrichment, validation, and storage. Self-service data wrangling tools are essential for enabling analysts to quickly handle complex datasets, leading to better decision-making and cost efficiency. Data cleaning, particularly using Python's Pandas library, is crucial for ensuring data accuracy, consistency, and completeness, which ultimately enhances productivity and the reliability of analytical outcomes.

Uploaded by

Varshini Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT-1

Unuit-1

Data Wrangling:
Data wrangling is the process of transforming and structuring data from one raw
form into a desired format with the intent of improving data quality and making it
more consumable and useful for analytics or machine learning. It’s also sometimes
called data munging.
The data wrangling process often includes transforming, cleansing, and enriching
data from multiple sources. As a result of data wrangling, the data being analyzed
is more accurate and meaningful, leading to better solutions, decisions, and
outcomes.
Because of the increase in data collection and usage, especially diverse and
unstructured data from multiple data sources, organizations are now dealing with
larger amounts of raw data and preparing it for analysis can be time-consuming
and costly.
Self-service approaches and analytics automation can speed up and increase the
accuracy of data wrangling processes by eliminating the errors that can be
introduced by people when they transform data using Excel or other manual
processes.

Why Is Self-Service Wrangling Important?


Complex data sets have increased the time required to cull, clean, and organize
data ahead of a broader analysis. At the same time, with data informing just about
every business decision, business users have less time to wait on technical
resources for prepared data, which is where data wrangling becomes valuable.
This necessitates a self-service model for a more democratized model of data
analysis. This self-service model with data wrangling tools allows analysts to
tackle more complex data more quickly, produce more accurate results, and make
better decisions. Because of data wrangling abilities, more businesses have started
using data wrangling tools to prepare before analysis.
How Data Wrangling Works
Data wrangling follows six major steps: Explore, transform, cleanse, enrich,
validate and store.
Explore: Data exploration or discovery is a way to identify patterns, trends, and
missing or incomplete information in a dataset. The bulk of exploration happens
before creating reports, data visualizations, or training models, but it’s common to
uncover surprises and insights in a dataset during analysis too.

Cleanse: Data often contains errors as a result of manual entry, incomplete data,
data automatically collected from sensors, or even malfunctioning equipment. Data
cleansing corrects those entry errors, removes duplicates and outliers (if
appropriate), eliminates missing data, and imputes null values based on statistical
or conditional modeling to improve data quality.

Transform: Data transformation or data structuring is important; if not done early


on, it can compromise the rest of the wrangling process. Data transformation
involves putting the raw data in the right shape and format that will be useful for a
report, data visualization, or analytic or modeling process. It may involve creating
new variables (aka features) and performing mathematical functions on the data.

Enrich: Enrichment or blending makes a dataset more useful by integrating


additional sources such as authoritative third-party census, firmographic, or
demographic data. The enrichment process may also help uncover additional
insights from the data within an organization or spark new ideas for capturing and
storing additional customer information in the future. This is an opportunity to
think strategically about what additional data might contribute to a report, model,
or business process.
Validate: Validation rules are repetitive programming sequences that verify data
consistency, quality, and security. Examples of validation include ensuring
uniform distribution of attributes that should be distributed normally (e.g. birth
dates) or confirming accuracy of fields through a check across data. This is a vital
step in the data wrangling process.
Store: The last part of the wrangling process is to store or preserve the final
product, along with all the steps and transformations that took place so it can be
audited, understood, and repeated in the future.

Benefits of Data Wrangling


Data wrangling makes it easier to analyze and interpret information, which leads to
many benefits, including:
1. Increased Clarity and Understanding: If you’ve ever come across
disorganized data, or a large data set that’s not easy to interpret, you understand the
pain that comes with not being able to understand what the data represents and can
be used for. Properly wrangled datasets can more easily be used for reporting in
Tableau and other data visualization tools.

2. Data Consistency: Since businesses often use data from multiple sources,
including third-parties, the data can often include many errors. An important step
of the data wrangling process is creating uniform datasets that help eliminate the
errors introduced by people and different formatting standards across third parties
which results in improved accuracy during analysis.
3. Improved Accuracy and Precision of Data: The way data is manipulated and
arranged can affect the accuracy and precision of analysis, especially when it’s
related to identifying relevant patterns and trends. Examples of good data
wrangling include organizing data by numerical data rather than categorical values
or organizing data in tables rather than columns. Grouping similar data together
improves the accuracy.
4. Improved Communication and Decision-Making: Increased clarity and
improved accuracy reduce the time it takes for others to understand and interpret
data, leading to better understanding and communication between teams. This
benefit can lead to increased collaboration, transparency, and better decisions.

5. Cost Efficiency: Reducing errors, organizing data, and increasing collaboration


all lead to more efficient use of time, saving organizations money. As one
example, thoroughly cleaned and organized data reduces errors and saves
developers time in creating reports or machine learning models. Consistent datasets
make it easier for data scientists to reuse algorithms for their models or apply new
ones through data science and automated machine learning.

What Is Data Cleaning?

When working with multiple data sources, there are many chances for data to
be incorrect, duplicated, or mislabeled. If data is wrong, outcomes and
algorithms are unreliable, even though they may look correct. Data cleaning in
data science using Python is changing or eliminating garbage, incorrect,
duplicate, corrupted, or incomplete data in a dataset. There’s no absolute way
to describe the precise steps in data cleaning because the processes may vary
from dataset to dataset. The general data preparation process initiative is data
cleansing, data cleansing, or scrubbing.

Data Cleaning using Pandas in Python is important in developing reliable


answers within the analytical process. It is observed to be a basic feature of the
info science basics. The motive of Python data cleaning services is to construct
uniform and standardized data sets that enable easy access to data analytics
tools and business intelligence and perceive accurate data for each problem.

Why Is Data Cleaning Essential?

Data Cleaning using Pandas in Python is the most important task that a data
science professional should do. Wrong or bad-quality data can be detrimental to
processes and analysis. Clean data will ultimately increase overall productivity
and permit the very best quality information in decision-making.
Following are some reasons why Python data cleaning is essential:
1. Error-Free Data:

When combining multiple data sources, there may be a chance of so much


error. Through data cleaning in data science using Python, errors can be
removed from data. Having clean data free from wrong and garbage values can
help perform analysis faster and more efficiently. By doing this task, we save a
considerable amount of time. The results won’t be accurate if we use data
containing garbage values. When we don’t use accurate data, we will surely
make mistakes. Monitoring errors and good reporting help find where errors
come from and make it easier to fix incorrect or corrupt data for future
applications.
2. Data Quality:

The quality of the data is the degree to which it follows the rules of particular
requirements. For example, if we have imported phone number data of different
customers, and in some places, we have added customers’ email addresses.
However, because our needs were straightforward for phone numbers, the email
addresses would be invalid data. Here, some pieces of data follow a specific
format. Some types of numbers have to be in a specific range.

Some data cells might require selected quiet data like numeric, Boolean, etc. In
every scenario, there are some mandatory constraints our data should follow.
Certain conditions affect multiple fields of data in a particular form. Particular
types of data have unique restrictions. Data will always be invalid if it isn’t in
the required format. Data cleaning in data science using Python will help us
simplify this process and avoid useless data values.
3. Accurate and Efficient:

Ensuring the data is close to the correct values. We know that most data in a
dataset are valid, and we should focus on establishing its accuracy. Even if the
data is authentic and correct, it isn’t accurate. Determining accuracy helps to
figure out whether the data entered is accurate or not. For example, a
customer’s address is stored in the specified format; it may not be in the right
one. The email has an additional character or value that makes it incorrect or
invalid. Another example is the phone number of a customer. This means that
we have to rely on data sources to cross-check the data to figure out if it’s
accurate or not. Depending on the kind of data we are using, we might be able
to find various resources that could help us in this regard for cleaning.
4. Complete Data:

Completeness is the degree to which we should know all the required values.
Completeness is a little more challenging to achieve than accuracy or quality.
Because it’s nearly impossible to have all the info we need, only known facts
can be entered. We can try to complete data by redoing the data-gathering
activities like approaching the clients again, re-interviewing people, etc. For
example, we might need to enter every customer’s contact information.
However, a number of them might not have email addresses. In this case, we
have to leave those columns empty. If a system requires us to fill all columns,
we can try to enter missing or unknown ones. However, entering such values
does not mean that the data is complete. It would still be referred to as
incomplete.
5. Maintains Data Consistency:

To ensure the data is consistent within the same dataset or across multiple
datasets, we can measure consistency by comparing two similar systems. We
can also check the data values within the same dataset to see if they are
consistent. Consistency can be relational. For example, a customer’s age might
be 25, which is a valid value and also accurate, but it is also stated as a senior
citizen in the same system. In such cases, we must cross-check the data, similar
to measuring accuracy, and see which value is true. Is the client a 25-year-old?
Or is the client a senior citizen? Only one of these values can be true. There are
multiple ways to make your data consistent.

 By checking in different systems.


 By checking the source.
 By checking the latest data.

Data Cleaning Cycle

It is the method of analyzing, distinguishing, and correcting untidy, raw data.


Pandas Data Cleaning involves filling in missing values, handling outliers, and
distinguishing and fixing errors in the dataset. Meanwhile, the techniques used
for data cleaning in data science using Python might vary in step with different
types of datasets. In this tutorial, we will learn how to clean data using pandas.
The following are standard steps to map out data cleaning:
Data Cleaning With Pandas

Data scientists spend a lot of time cleaning datasets and getting them in the
form they can work. It is an essential skill of Data Scientists to work with
messy data, missing values, and inconsistent, noisy, or nonsensical data. Python
provides a built-in module called Pandas that works smoothly. Pandas is a
popular Python library for data processing, cleaning, manipulation, and
analysis. Pandas stand for “Python Data Analysis Library.” It consists of
classes on reading, processing, and writing CSV files. Numerous Data cleaning
tools are present, but the Pandas library provides a fast and efficient way to
manage and explore data. It does that by providing us with Series and
DataFrames, which help us represent data efficiently and manipulate it in
various ways.
This article will use the Pandas module to clean our dataset.

We are using a simple dataset for data cleaning, i.e., the iris species dataset.
You can download this dataset from kaggle.com.

Let’s get started with data cleaning using Pandas.

To start working with Pandas, we need first to import it. We are using Google
Colab as IDE to import Pandas in Google Colab.
#importing module
import pandas as pd
Step 1: Import Dataset

To import the dataset, we use the read_csv() function of pandas and store it in
the pandas DataFrame named data. As the dataset is in tabular format, when
working with tabular data in Pandas, it will be automatically converted into a
DataFrame. A DataFrame is a two-dimensional, mutable data structure in
Python. It is a combination of rows and columns like an Excel sheet.

Python Code:

The head() function is a built-in function in pandas for the dataframe used to
display the rows of the dataset. We can specify the number of rows by giving
the number within the parenthesis. By default, it displays the first five rows of
the dataset. If we want to see the last five rows of the dataset, we use the
tail()function of the dataframe like this:
#displayinf last five rows of dataset
data.tail()
Step 2: Merge Dataset

Merging the dataset is combining two datasets in one and lining up rows based
on some particular or common property for data analysis. We can do this by
using the merge() function of the dataframe. Following is the syntax of the
merge function:
DataFrame_name.merge(right, how='inner', on=None, left_on=None,
right_on=None, left_index=False, right_index=False, sort=False,
suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

However, in this case, we don’t need to merge two datasets, so we will skip this
step.
Step 3: Rebuild Missing Data

We will use another function to find and fill in the missing data in the dataset.
There are 4 ways to find the null values if present in the dataset. Let’s see them
one by one:

Using isnull() function:


data.isnull()
This function provides a boolean value for the complete dataset to determine
whether any null value is present.

Using isna() function:


data.isna()
This is the same as the isnull() function. Ans provides the same output.

Using isna().any()
data.isna().any()

This function also gives a boolean value indicating whether a null value is
present or not, but it gives results column-wise, not in tabular format.

Using isna(). sum()


data.isna().sum()
This function gives the sum of the null values preset in the dataset column-
wise.

Using isna().any().sum()
data.isna().any().sum()

This function gives output in a single value, whether any null is present.

There are no null values present in our dataset. But if any null values are preset,
we can fill those places with any other value using the fillna() function of
DataFrame.Following is the syntax of fillna() function:
DataFrame_name.fillna(value=None, method=None, axis=None,
inplace=False, limit=None, downcast=None)

This function will fill NA/NaN or 0 values instead of null spaces. You may also
drop null values using the dropna method when the amount of missing data is
relatively small and unlikely to affect the overall.
Step 4: Standardization and Normalization

Data standardization and normalization are common practices in machine


learning.

Standardization is another scaling technique where the values are centered


around the mean with a unit standard deviation. This means that the mean of
the attribute becomes zero, and the resultant distribution has a unit standard
deviation.

Normalization is a scaling technique in which values are shifted and rescaled so


that they range between 0 and 1. It is also known as Min-Max scaling.

To know more about this, click here.

This step is not needed for the dataset we are using. So, we will skip this step.

Step 5: De-Duplicate Data

De-Duplicate means removing all duplicate values. There is no need for


duplicate values in data analysis. These values only affect the accuracy and
efficiency of the analysis result. To find duplicate values in the dataset, we will
use a simple dataframe function, i.e., duplicated(). Let’s see the example:
data.duplicated()
This function also provides bool values for duplicate values in the dataset. As
we can see, the dataset doesn’t contain any duplicate values. A dataset
containing duplicate values can be removed using the drop_duplicates()
function. Following is the syntax of this function:
DataFrame_name.drop_duplicates(subset=None, keep='first', inplace=False,
ignore_index=False)
Step 6: Verify and Enrich the Data

After removing null, duplicate, and incorrect values, we should verify the
dataset and its accuracy. In this step, we have to check that the data cleaned so
far makes sense. If the data is incomplete, we have to enrich it again by data
gathering activities like approaching the clients again, re-interviewing people,
etc. Completeness is a little more challenging to achieve accuracy or quality in
the dataset.
Step 7: Export Dataset

This is the last step of the data-cleaning process. After performing all the above
operations, the data is transformed into a clean dataset, and it is ready to export
for the next process in Data Science or Data Analysis.
Need of data cleanup:

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly

formatted, duplicate, or incomplete data within a dataset. When combining

multiple data sources, there are many opportunities for data to be duplicated or

mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even

though they may look correct. There is no one absolute way to prescribe the exact

steps in the data cleaning process because the processes will vary from dataset to

dataset. But it is crucial to establish a template for your data cleaning process so

you know you are doing it the right way every time.

Data cleanup is a crucial step in data wrangling because it ensures the quality and
reliability of the data used for analysis and visualization. Here are some key
reasons why data cleanup is necessary:
1. Accuracy and Reliability

 Error Reduction: Cleaning data removes errors, inaccuracies, and


inconsistencies, which can otherwise lead to incorrect conclusions and faulty
analyses.
 Improved Decision Making: Reliable and accurate data leads to better
insights and more informed decision-making.
2. Consistency

 Standardization: Data cleanup ensures that data follows a consistent


format, which is essential for accurate analysis. For example, date formats
should be consistent across the dataset.
 Uniformity: It helps in maintaining uniformity in data, making it easier to
process and analyze.
3. Efficiency

 Data Processing: Clean data is easier to process, leading to more efficient


data analysis workflows. It reduces the time and computational resources
required for data processing.
 Automation: Consistent and clean data is easier to automate, allowing for
more streamlined and repeatable data analysis processes.
4. Removing Redundancies

 Duplicates: Removing duplicate records prevents skewed analysis and


ensures that each data point is unique and valuable.
 Redundant In formation: Eliminating redundant information simplifies
datasets, making them easier to work with.
5. Handling Missing Data

 Completeness: Filling in or appropriately handling missing data ensures


that analyses are based on complete datasets, reducing bias and improving
the reliability of results.
 Integrity: Properly addressing missing data maintains the integrity of the
dataset and ensures that the analysis reflects the true nature of the data.
6. Enhancing Data Quality

 Accuracy: Ensures that data accurately represents real-world conditions and


measurements.
 Validity: Ensures that data is valid and falls within the expected range and
format.
7. Improving Data Integration

 Combining Datasets: Clean data from different sources can be more easily
combined and integrated, facilitating comprehensive analyses and insights.
 Interoperability: Clean and standardized data improves interoperability
between different systems and tools.
8. Facilitating Advanced Analytics

 Machine Learning and AI: Clean data is essential for training reliable and
effective machine learning models. Poor-quality data can lead to model
inaccuracies and failures.
 Complex Analysis: Advanced analytics techniques often require high-
quality data to produce meaningful and actionable results.
9. Regulatory Compliance

 Data Governance: Ensuring data is clean helps in complying with data


governance and regulatory requirements, which may mandate accuracy and
consistency in data reporting and handling.
Common Data Cleanup Tasks

1. Removing Duplicates: Identifying and eliminating duplicate entries.


2. Handling Missing Values: Imputing or removing missing data points.
3. Correcting Errors: Fixing typographical errors and inaccuracies.
4. Standardizing Formats: Ensuring consistent data formats (e.g., date
formats, string cases).
5. Outlier Detection: Identifying and handling outliers that may distort
analysis.

By ensuring that the data is accurate, consistent, and complete, data cleanup helps
in building a strong foundation for any data-driven project.
Data Cleanup Basics

Effective data cleanup involves several key steps to ensure the data is accurate,
consistent, and ready for analysis. Here are the basic steps and techniques
involved:
1. Formatting

Definition: Ensuring data is in a consistent and usable format.

 Date and Time: Standardize date and time formats (e.g., YYYY-MM-DD).
 Text Data: Ensure consistent casing (e.g., all uppercase or lowercase),
remove leading/trailing spaces, and correct spelling errors.
 Numeric Data: Ensure numeric data is in a consistent format, handle
decimal points and thousand separators correctly.

Example: Convert all dates to the format YYYY-MM-DD and ensure all names
are in proper case (e.g., "John Doe").
2. Handling Outliers

Definition: Identifying and handling data points that deviate significantly from
other observations.

 Detection Methods:
o Statistical Methods: Using measures like the z-score, which indicates
how many standard deviations a data point is from the mean.
o Visualization: Box plots and scatter plots can visually identify
outliers.

Handling Methods:

 Removal: If outliers are errors or irrelevant, they can be removed.


 Transformation: Apply transformations to reduce the impact of outliers
(e.g., log transformation).
 Capping: Set upper and lower bounds for data values.

Example: Identify and possibly remove data points where the z-score is greater
than 3.
3. Removing Duplicates

Definition: Identifying and eliminating duplicate records.

 Exact Duplicates: Remove rows where all columns are identical.


 Partial Duplicates: Identify duplicates based on key columns (e.g., same
name and date of birth).

Example: Use functions like drop_duplicates() in pandas to remove duplicate rows


from a DataFrame.
4. Normalizing Data

Definition: Adjusting values measured on different scales to a common scale.

 Min-Max Scaling: Rescale the data to a fixed range, typically 0 to 1.


o Formula: X′=X−XminXmax−XminX' = \frac{X - X_{\text{min}}}
{X_{\text{max}} - X_{\text{min}}}X′=Xmax−XminX−Xmin

Example: Rescale all feature values in a dataset to be within the range of 0 to 1.


5. Standardizing Data

Definition: Transforming data to have a mean of 0 and a standard deviation of 1.

 Z-score Standardization:
o Formula: X′=X−μσX' = \frac{X - \mu}{\sigma}X′=σX−μ
o μ\muμ is the mean of the data.
o σ\sigmaσ is the standard deviation of the data.

Example: Use z-score standardization to standardize the features of a dataset for a


machine learning algorithm.

You might also like