0% found this document useful (0 votes)

2 views21 pages

DWDV UNIT 1

Data wrangling is the process of transforming raw data into a structured format to improve its quality for analytics and machine learning, involving steps like exploration, cleansing, transformation, enrichment, validation, and storage. Self-service data wrangling tools are essential for enabling analysts to quickly handle complex datasets, leading to better decision-making and cost efficiency. Data cleaning, particularly using Python's Pandas library, is crucial for ensuring data accuracy, consistency, and completeness, which ultimately enhances productivity and the reliability of analytical outcomes.

Uploaded by

Varshini Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views21 pages

DWDV UNIT 1

Uploaded by

Varshini Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

UNIT-1

Unuit-1

Data Wrangling:
Data wrangling is the process of transforming and structuring data from one raw
form into a desired format with the intent of improving data quality and making it
more consumable and useful for analytics or machine learning. It’s also sometimes
called data munging.
The data wrangling process often includes transforming, cleansing, and enriching
data from multiple sources. As a result of data wrangling, the data being analyzed
is more accurate and meaningful, leading to better solutions, decisions, and
outcomes.
Because of the increase in data collection and usage, especially diverse and
unstructured data from multiple data sources, organizations are now dealing with
larger amounts of raw data and preparing it for analysis can be time-consuming
and costly.
Self-service approaches and analytics automation can speed up and increase the
accuracy of data wrangling processes by eliminating the errors that can be
introduced by people when they transform data using Excel or other manual
processes.

Why Is Self-Service Wrangling Important?

Complex data sets have increased the time required to cull, clean, and organize
data ahead of a broader analysis. At the same time, with data informing just about
every business decision, business users have less time to wait on technical
resources for prepared data, which is where data wrangling becomes valuable.
This necessitates a self-service model for a more democratized model of data
analysis. This self-service model with data wrangling tools allows analysts to
tackle more complex data more quickly, produce more accurate results, and make
better decisions. Because of data wrangling abilities, more businesses have started
using data wrangling tools to prepare before analysis.
How Data Wrangling Works
Data wrangling follows six major steps: Explore, transform, cleanse, enrich,
validate and store.
Explore: Data exploration or discovery is a way to identify patterns, trends, and
missing or incomplete information in a dataset. The bulk of exploration happens
before creating reports, data visualizations, or training models, but it’s common to
uncover surprises and insights in a dataset during analysis too.

Cleanse: Data often contains errors as a result of manual entry, incomplete data,
data automatically collected from sensors, or even malfunctioning equipment. Data
cleansing corrects those entry errors, removes duplicates and outliers (if
appropriate), eliminates missing data, and imputes null values based on statistical
or conditional modeling to improve data quality.

Transform: Data transformation or data structuring is important; if not done early

on, it can compromise the rest of the wrangling process. Data transformation
involves putting the raw data in the right shape and format that will be useful for a
report, data visualization, or analytic or modeling process. It may involve creating
new variables (aka features) and performing mathematical functions on the data.

Enrich: Enrichment or blending makes a dataset more useful by integrating

additional sources such as authoritative third-party census, firmographic, or
demographic data. The enrichment process may also help uncover additional
insights from the data within an organization or spark new ideas for capturing and
storing additional customer information in the future. This is an opportunity to
think strategically about what additional data might contribute to a report, model,
or business process.
Validate: Validation rules are repetitive programming sequences that verify data
consistency, quality, and security. Examples of validation include ensuring
uniform distribution of attributes that should be distributed normally (e.g. birth
dates) or confirming accuracy of fields through a check across data. This is a vital
step in the data wrangling process.
Store: The last part of the wrangling process is to store or preserve the final
product, along with all the steps and transformations that took place so it can be
audited, understood, and repeated in the future.

Benefits of Data Wrangling

Data wrangling makes it easier to analyze and interpret information, which leads to
many benefits, including:
1. Increased Clarity and Understanding: If you’ve ever come across
disorganized data, or a large data set that’s not easy to interpret, you understand the
pain that comes with not being able to understand what the data represents and can
be used for. Properly wrangled datasets can more easily be used for reporting in
Tableau and other data visualization tools.

2. Data Consistency: Since businesses often use data from multiple sources,
including third-parties, the data can often include many errors. An important step
of the data wrangling process is creating uniform datasets that help eliminate the
errors introduced by people and different formatting standards across third parties
which results in improved accuracy during analysis.
3. Improved Accuracy and Precision of Data: The way data is manipulated and
arranged can affect the accuracy and precision of analysis, especially when it’s
related to identifying relevant patterns and trends. Examples of good data
wrangling include organizing data by numerical data rather than categorical values
or organizing data in tables rather than columns. Grouping similar data together
improves the accuracy.
4. Improved Communication and Decision-Making: Increased clarity and
improved accuracy reduce the time it takes for others to understand and interpret
data, leading to better understanding and communication between teams. This
benefit can lead to increased collaboration, transparency, and better decisions.

5. Cost Efficiency: Reducing errors, organizing data, and increasing collaboration

all lead to more efficient use of time, saving organizations money. As one
example, thoroughly cleaned and organized data reduces errors and saves
developers time in creating reports or machine learning models. Consistent datasets
make it easier for data scientists to reuse algorithms for their models or apply new
ones through data science and automated machine learning.

What Is Data Cleaning?

When working with multiple data sources, there are many chances for data to
be incorrect, duplicated, or mislabeled. If data is wrong, outcomes and
algorithms are unreliable, even though they may look correct. Data cleaning in
data science using Python is changing or eliminating garbage, incorrect,
duplicate, corrupted, or incomplete data in a dataset. There’s no absolute way
to describe the precise steps in data cleaning because the processes may vary
from dataset to dataset. The general data preparation process initiative is data
cleansing, data cleansing, or scrubbing.

Data Cleaning using Pandas in Python is important in developing reliable

answers within the analytical process. It is observed to be a basic feature of the
info science basics. The motive of Python data cleaning services is to construct
uniform and standardized data sets that enable easy access to data analytics
tools and business intelligence and perceive accurate data for each problem.

Why Is Data Cleaning Essential?

Data Cleaning using Pandas in Python is the most important task that a data
science professional should do. Wrong or bad-quality data can be detrimental to
processes and analysis. Clean data will ultimately increase overall productivity
and permit the very best quality information in decision-making.
Following are some reasons why Python data cleaning is essential:
1. Error-Free Data:

When combining multiple data sources, there may be a chance of so much

error. Through data cleaning in data science using Python, errors can be
removed from data. Having clean data free from wrong and garbage values can
help perform analysis faster and more efficiently. By doing this task, we save a
considerable amount of time. The results won’t be accurate if we use data
containing garbage values. When we don’t use accurate data, we will surely
make mistakes. Monitoring errors and good reporting help find where errors
come from and make it easier to fix incorrect or corrupt data for future
applications.
2. Data Quality:

The quality of the data is the degree to which it follows the rules of particular
requirements. For example, if we have imported phone number data of different
customers, and in some places, we have added customers’ email addresses.
However, because our needs were straightforward for phone numbers, the email
addresses would be invalid data. Here, some pieces of data follow a specific
format. Some types of numbers have to be in a specific range.

Some data cells might require selected quiet data like numeric, Boolean, etc. In
every scenario, there are some mandatory constraints our data should follow.
Certain conditions affect multiple fields of data in a particular form. Particular
types of data have unique restrictions. Data will always be invalid if it isn’t in
the required format. Data cleaning in data science using Python will help us
simplify this process and avoid useless data values.
3. Accurate and Efficient:

Ensuring the data is close to the correct values. We know that most data in a
dataset are valid, and we should focus on establishing its accuracy. Even if the
data is authentic and correct, it isn’t accurate. Determining accuracy helps to
figure out whether the data entered is accurate or not. For example, a
customer’s address is stored in the specified format; it may not be in the right
one. The email has an additional character or value that makes it incorrect or
invalid. Another example is the phone number of a customer. This means that
we have to rely on data sources to cross-check the data to figure out if it’s
accurate or not. Depending on the kind of data we are using, we might be able
to find various resources that could help us in this regard for cleaning.
4. Complete Data:

Completeness is the degree to which we should know all the required values.
Completeness is a little more challenging to achieve than accuracy or quality.
Because it’s nearly impossible to have all the info we need, only known facts
can be entered. We can try to complete data by redoing the data-gathering
activities like approaching the clients again, re-interviewing people, etc. For
example, we might need to enter every customer’s contact information.
However, a number of them might not have email addresses. In this case, we
have to leave those columns empty. If a system requires us to fill all columns,
we can try to enter missing or unknown ones. However, entering such values
does not mean that the data is complete. It would still be referred to as
incomplete.
5. Maintains Data Consistency:

To ensure the data is consistent within the same dataset or across multiple
datasets, we can measure consistency by comparing two similar systems. We
can also check the data values within the same dataset to see if they are
consistent. Consistency can be relational. For example, a customer’s age might
be 25, which is a valid value and also accurate, but it is also stated as a senior
citizen in the same system. In such cases, we must cross-check the data, similar
to measuring accuracy, and see which value is true. Is the client a 25-year-old?
Or is the client a senior citizen? Only one of these values can be true. There are
multiple ways to make your data consistent.

 By checking in different systems.

 By checking the source.
 By checking the latest data.

Data Cleaning Cycle

It is the method of analyzing, distinguishing, and correcting untidy, raw data.

Pandas Data Cleaning involves filling in missing values, handling outliers, and
distinguishing and fixing errors in the dataset. Meanwhile, the techniques used
for data cleaning in data science using Python might vary in step with different
types of datasets. In this tutorial, we will learn how to clean data using pandas.
The following are standard steps to map out data cleaning:
Data Cleaning With Pandas

Data scientists spend a lot of time cleaning datasets and getting them in the
form they can work. It is an essential skill of Data Scientists to work with
messy data, missing values, and inconsistent, noisy, or nonsensical data. Python
provides a built-in module called Pandas that works smoothly. Pandas is a
popular Python library for data processing, cleaning, manipulation, and
analysis. Pandas stand for “Python Data Analysis Library.” It consists of
classes on reading, processing, and writing CSV files. Numerous Data cleaning
tools are present, but the Pandas library provides a fast and efficient way to
manage and explore data. It does that by providing us with Series and
DataFrames, which help us represent data efficiently and manipulate it in
various ways.
This article will use the Pandas module to clean our dataset.

We are using a simple dataset for data cleaning, i.e., the iris species dataset.
You can download this dataset from kaggle.com.

Let’s get started with data cleaning using Pandas.

To start working with Pandas, we need first to import it. We are using Google
Colab as IDE to import Pandas in Google Colab.
#importing module
import pandas as pd
Step 1: Import Dataset

To import the dataset, we use the read_csv() function of pandas and store it in
the pandas DataFrame named data. As the dataset is in tabular format, when
working with tabular data in Pandas, it will be automatically converted into a
DataFrame. A DataFrame is a two-dimensional, mutable data structure in
Python. It is a combination of rows and columns like an Excel sheet.

Python Code:

The head() function is a built-in function in pandas for the dataframe used to
display the rows of the dataset. We can specify the number of rows by giving
the number within the parenthesis. By default, it displays the first five rows of
the dataset. If we want to see the last five rows of the dataset, we use the
tail()function of the dataframe like this:
#displayinf last five rows of dataset
data.tail()
Step 2: Merge Dataset

Merging the dataset is combining two datasets in one and lining up rows based
on some particular or common property for data analysis. We can do this by
using the merge() function of the dataframe. Following is the syntax of the
merge function:
DataFrame_name.merge(right, how='inner', on=None, left_on=None,
right_on=None, left_index=False, right_index=False, sort=False,
suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

However, in this case, we don’t need to merge two datasets, so we will skip this
step.
Step 3: Rebuild Missing Data

We will use another function to find and fill in the missing data in the dataset.
There are 4 ways to find the null values if present in the dataset. Let’s see them
one by one:

Using isnull() function:

data.isnull()
This function provides a boolean value for the complete dataset to determine
whether any null value is present.

Using isna() function:

data.isna()
This is the same as the isnull() function. Ans provides the same output.

Using isna().any()
data.isna().any()

This function also gives a boolean value indicating whether a null value is
present or not, but it gives results column-wise, not in tabular format.

Using isna(). sum()

data.isna().sum()
This function gives the sum of the null values preset in the dataset column-
wise.

Using isna().any().sum()
data.isna().any().sum()

This function gives output in a single value, whether any null is present.

There are no null values present in our dataset. But if any null values are preset,
we can fill those places with any other value using the fillna() function of
DataFrame.Following is the syntax of fillna() function:
DataFrame_name.fillna(value=None, method=None, axis=None,
inplace=False, limit=None, downcast=None)

This function will fill NA/NaN or 0 values instead of null spaces. You may also
drop null values using the dropna method when the amount of missing data is
relatively small and unlikely to affect the overall.
Step 4: Standardization and Normalization

Data standardization and normalization are common practices in machine

learning.

Standardization is another scaling technique where the values are centered

around the mean with a unit standard deviation. This means that the mean of
the attribute becomes zero, and the resultant distribution has a unit standard
deviation.

Normalization is a scaling technique in which values are shifted and rescaled so

that they range between 0 and 1. It is also known as Min-Max scaling.

To know more about this, click here.

This step is not needed for the dataset we are using. So, we will skip this step.

Step 5: De-Duplicate Data

De-Duplicate means removing all duplicate values. There is no need for

duplicate values in data analysis. These values only affect the accuracy and
efficiency of the analysis result. To find duplicate values in the dataset, we will
use a simple dataframe function, i.e., duplicated(). Let’s see the example:
data.duplicated()
This function also provides bool values for duplicate values in the dataset. As
we can see, the dataset doesn’t contain any duplicate values. A dataset
containing duplicate values can be removed using the drop_duplicates()
function. Following is the syntax of this function:
DataFrame_name.drop_duplicates(subset=None, keep='first', inplace=False,
ignore_index=False)
Step 6: Verify and Enrich the Data

After removing null, duplicate, and incorrect values, we should verify the
dataset and its accuracy. In this step, we have to check that the data cleaned so
far makes sense. If the data is incomplete, we have to enrich it again by data
gathering activities like approaching the clients again, re-interviewing people,
etc. Completeness is a little more challenging to achieve accuracy or quality in
the dataset.
Step 7: Export Dataset

This is the last step of the data-cleaning process. After performing all the above
operations, the data is transformed into a clean dataset, and it is ready to export
for the next process in Data Science or Data Analysis.
Need of data cleanup:

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly

formatted, duplicate, or incomplete data within a dataset. When combining

multiple data sources, there are many opportunities for data to be duplicated or

mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even

though they may look correct. There is no one absolute way to prescribe the exact

steps in the data cleaning process because the processes will vary from dataset to

dataset. But it is crucial to establish a template for your data cleaning process so

you know you are doing it the right way every time.

Data cleanup is a crucial step in data wrangling because it ensures the quality and
reliability of the data used for analysis and visualization. Here are some key
reasons why data cleanup is necessary:
1. Accuracy and Reliability

 Error Reduction: Cleaning data removes errors, inaccuracies, and

inconsistencies, which can otherwise lead to incorrect conclusions and faulty
analyses.
 Improved Decision Making: Reliable and accurate data leads to better
insights and more informed decision-making.
2. Consistency

 Standardization: Data cleanup ensures that data follows a consistent

format, which is essential for accurate analysis. For example, date formats
should be consistent across the dataset.
 Uniformity: It helps in maintaining uniformity in data, making it easier to
process and analyze.
3. Efficiency

 Data Processing: Clean data is easier to process, leading to more efficient

data analysis workflows. It reduces the time and computational resources
required for data processing.
 Automation: Consistent and clean data is easier to automate, allowing for
more streamlined and repeatable data analysis processes.
4. Removing Redundancies

 Duplicates: Removing duplicate records prevents skewed analysis and

ensures that each data point is unique and valuable.
 Redundant In formation: Eliminating redundant information simplifies
datasets, making them easier to work with.
5. Handling Missing Data

 Completeness: Filling in or appropriately handling missing data ensures

that analyses are based on complete datasets, reducing bias and improving
the reliability of results.
 Integrity: Properly addressing missing data maintains the integrity of the
dataset and ensures that the analysis reflects the true nature of the data.
6. Enhancing Data Quality

 Accuracy: Ensures that data accurately represents real-world conditions and

measurements.
 Validity: Ensures that data is valid and falls within the expected range and
format.
7. Improving Data Integration

 Combining Datasets: Clean data from different sources can be more easily
combined and integrated, facilitating comprehensive analyses and insights.
 Interoperability: Clean and standardized data improves interoperability
between different systems and tools.
8. Facilitating Advanced Analytics

 Machine Learning and AI: Clean data is essential for training reliable and
effective machine learning models. Poor-quality data can lead to model
inaccuracies and failures.
 Complex Analysis: Advanced analytics techniques often require high-
quality data to produce meaningful and actionable results.
9. Regulatory Compliance

 Data Governance: Ensuring data is clean helps in complying with data

governance and regulatory requirements, which may mandate accuracy and
consistency in data reporting and handling.
Common Data Cleanup Tasks

1. Removing Duplicates: Identifying and eliminating duplicate entries.

2. Handling Missing Values: Imputing or removing missing data points.
3. Correcting Errors: Fixing typographical errors and inaccuracies.
4. Standardizing Formats: Ensuring consistent data formats (e.g., date
formats, string cases).
5. Outlier Detection: Identifying and handling outliers that may distort
analysis.

By ensuring that the data is accurate, consistent, and complete, data cleanup helps
in building a strong foundation for any data-driven project.
Data Cleanup Basics

Effective data cleanup involves several key steps to ensure the data is accurate,
consistent, and ready for analysis. Here are the basic steps and techniques
involved:
1. Formatting

Definition: Ensuring data is in a consistent and usable format.

 Date and Time: Standardize date and time formats (e.g., YYYY-MM-DD).
 Text Data: Ensure consistent casing (e.g., all uppercase or lowercase),
remove leading/trailing spaces, and correct spelling errors.
 Numeric Data: Ensure numeric data is in a consistent format, handle
decimal points and thousand separators correctly.

Example: Convert all dates to the format YYYY-MM-DD and ensure all names
are in proper case (e.g., "John Doe").
2. Handling Outliers

Definition: Identifying and handling data points that deviate significantly from
other observations.

 Detection Methods:
o Statistical Methods: Using measures like the z-score, which indicates
how many standard deviations a data point is from the mean.
o Visualization: Box plots and scatter plots can visually identify
outliers.

Handling Methods:

 Removal: If outliers are errors or irrelevant, they can be removed.

 Transformation: Apply transformations to reduce the impact of outliers
(e.g., log transformation).
 Capping: Set upper and lower bounds for data values.

Example: Identify and possibly remove data points where the z-score is greater
than 3.
3. Removing Duplicates

Definition: Identifying and eliminating duplicate records.

 Exact Duplicates: Remove rows where all columns are identical.

 Partial Duplicates: Identify duplicates based on key columns (e.g., same
name and date of birth).

Example: Use functions like drop_duplicates() in pandas to remove duplicate rows

from a DataFrame.
4. Normalizing Data

Definition: Adjusting values measured on different scales to a common scale.

 Min-Max Scaling: Rescale the data to a fixed range, typically 0 to 1.

o Formula: X′=X−XminXmax−XminX' = \frac{X - X_{\text{min}}}
{X_{\text{max}} - X_{\text{min}}}X′=Xmax−XminX−Xmin

Example: Rescale all feature values in a dataset to be within the range of 0 to 1.

5. Standardizing Data

Definition: Transforming data to have a mean of 0 and a standard deviation of 1.

 Z-score Standardization:
o Formula: X′=X−μσX' = \frac{X - \mu}{\sigma}X′=σX−μ
o μ\muμ is the mean of the data.
o σ\sigmaσ is the standard deviation of the data.

Example: Use z-score standardization to standardize the features of a dataset for a

machine learning algorithm.

Data Clean R
100% (1)
Data Clean R
11 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
UNIT-1(DWV)[1]
No ratings yet
UNIT-1(DWV)[1]
12 pages
DWDV notes
No ratings yet
DWDV notes
111 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
2-Data wrangling
No ratings yet
2-Data wrangling
13 pages
Math211101020
No ratings yet
Math211101020
12 pages
scribd3
No ratings yet
scribd3
2 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Unit IV (3)
No ratings yet
Unit IV (3)
27 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Unit-1 DM
No ratings yet
Unit-1 DM
10 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
DATA WRANGLING
No ratings yet
DATA WRANGLING
9 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
211101088math - Data Ass 2
No ratings yet
211101088math - Data Ass 2
12 pages
Data Wrangling
No ratings yet
Data Wrangling
17 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
data-cleaning-using-pandas
No ratings yet
data-cleaning-using-pandas
9 pages
What is Data Cleaning
No ratings yet
What is Data Cleaning
8 pages
P6
No ratings yet
P6
24 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Unit II Notes
No ratings yet
Unit II Notes
39 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Unit-1, 1
No ratings yet
Unit-1, 1
5 pages
DATA WRANGLING AND DATA VISUALIZATION -Unit-01
No ratings yet
DATA WRANGLING AND DATA VISUALIZATION -Unit-01
19 pages
DATA WRANGLING New
No ratings yet
DATA WRANGLING New
13 pages
DS_UNIT_2
No ratings yet
DS_UNIT_2
23 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
Module -1(Introduction to Data Wrangling)
No ratings yet
Module -1(Introduction to Data Wrangling)
29 pages
UNIT V
No ratings yet
UNIT V
47 pages
Data Analytics_Module-1.1
No ratings yet
Data Analytics_Module-1.1
42 pages
CompTIA Data+ (Plus) The Ultimate Exam Prep Study Guide to Pass the Exam
From Everand
CompTIA Data+ (Plus) The Ultimate Exam Prep Study Guide to Pass the Exam
Jamie Murphy
No ratings yet
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
From Everand
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
Ike Beck
No ratings yet
Unit 4
No ratings yet
Unit 4
60 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Lecture Week 6-Data Scraping and Data Wrangling
No ratings yet
Lecture Week 6-Data Scraping and Data Wrangling
16 pages
Unit 4
No ratings yet
Unit 4
60 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
M2.pptx
No ratings yet
M2.pptx
33 pages
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
B DWM Lab Manual Zil
No ratings yet
B DWM Lab Manual Zil
114 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
L3
No ratings yet
L3
34 pages
step by step data wrangling
No ratings yet
step by step data wrangling
4 pages
Unit-2
No ratings yet
Unit-2
21 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
Updated notes of APR_084732
No ratings yet
Updated notes of APR_084732
6 pages
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
From Everand
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
George Snypes
2/5 (1)
Data Wrangling
No ratings yet
Data Wrangling
6 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
1708443470801
No ratings yet
1708443470801
71 pages
Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
XII_IP_PB1_AGRA_QP_set2
No ratings yet
XII_IP_PB1_AGRA_QP_set2
13 pages
Testbank for Intro to Python for Computer Science and Data Science Learning to Program With AI Big Data and the Cloud Deitel Instant Download
No ratings yet
Testbank for Intro to Python for Computer Science and Data Science Learning to Program With AI Big Data and the Cloud Deitel Instant Download
18 pages
Class XII-IP-Practical File 1
No ratings yet
Class XII-IP-Practical File 1
28 pages
Python For Accounting A Modern Guide Python Programming in Accounting 9789730338928 Compress
100% (3)
Python For Accounting A Modern Guide Python Programming in Accounting 9789730338928 Compress
395 pages
Periods and Period Arithmetic
No ratings yet
Periods and Period Arithmetic
13 pages
Innovative Assignment PDF
No ratings yet
Innovative Assignment PDF
11 pages
ML lab
No ratings yet
ML lab
51 pages
Python Lab - Evaluation - Final (Ece)
No ratings yet
Python Lab - Evaluation - Final (Ece)
4 pages
Python For Financial Analysis - Van Der Post, Hayden
No ratings yet
Python For Financial Analysis - Van Der Post, Hayden
131 pages
Presentation 1
No ratings yet
Presentation 1
11 pages
Data Analytics Course-April
No ratings yet
Data Analytics Course-April
19 pages
Ip HHW
No ratings yet
Ip HHW
32 pages
Series
No ratings yet
Series
2 pages
MST Lab Manual (R20)
No ratings yet
MST Lab Manual (R20)
75 pages
alternator_analyzer_guide
No ratings yet
alternator_analyzer_guide
18 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
14 pages
Python for Data Science – Ultimate Library Guide
No ratings yet
Python for Data Science – Ultimate Library Guide
5 pages
Mohammad Wahaj Tariq Resume Senior Full Stack Data Engineer (2)
No ratings yet
Mohammad Wahaj Tariq Resume Senior Full Stack Data Engineer (2)
3 pages
Python Assignment 03- Anil Kumar KN -91241460081
No ratings yet
Python Assignment 03- Anil Kumar KN -91241460081
9 pages
IP project i
No ratings yet
IP project i
51 pages
Ariel Fleiderman Resume
No ratings yet
Ariel Fleiderman Resume
2 pages
DS Manual
No ratings yet
DS Manual
29 pages
23MCA555 DAP Lab Manual
No ratings yet
23MCA555 DAP Lab Manual
25 pages
Brainflow Sample
No ratings yet
Brainflow Sample
6 pages
Class XII Python Practical File
No ratings yet
Class XII Python Practical File
19 pages
Xii Ip Study Material
No ratings yet
Xii Ip Study Material
92 pages
215 Professional Resume
No ratings yet
215 Professional Resume
1 page
Disease Prediction Using Machine Learning: V. Sharon Rose (Urk18Cs178)
No ratings yet
Disease Prediction Using Machine Learning: V. Sharon Rose (Urk18Cs178)
31 pages
Python Programming Lab Manual 3rd sem BCA
No ratings yet
Python Programming Lab Manual 3rd sem BCA
22 pages
Vikrant Yadav Y18
No ratings yet
Vikrant Yadav Y18
1 page

DWDV UNIT 1

Uploaded by

DWDV UNIT 1

Uploaded by

UNIT-1

Why Is Self-Service Wrangling Important?

Transform: Data transformation or data structuring is important; if not done early

Enrich: Enrichment or blending makes a dataset more useful by integrating

Benefits of Data Wrangling

5. Cost Efficiency: Reducing errors, organizing data, and increasing collaboration

What Is Data Cleaning?

Data Cleaning using Pandas in Python is important in developing reliable

Why Is Data Cleaning Essential?

When combining multiple data sources, there may be a chance of so much

 By checking in different systems.

Data Cleaning Cycle

It is the method of analyzing, distinguishing, and correcting untidy, raw data.

Let’s get started with data cleaning using Pandas.

Using isnull() function:

Using isna() function:

Using isna(). sum()

Data standardization and normalization are common practices in machine

Standardization is another scaling technique where the values are centered

Normalization is a scaling technique in which values are shifted and rescaled so

To know more about this, click here.

Step 5: De-Duplicate Data

De-Duplicate means removing all duplicate values. There is no need for

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly

formatted, duplicate, or incomplete data within a dataset. When combining

mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even

 Error Reduction: Cleaning data removes errors, inaccuracies, and

 Standardization: Data cleanup ensures that data follows a consistent

 Data Processing: Clean data is easier to process, leading to more efficient

 Duplicates: Removing duplicate records prevents skewed analysis and

 Completeness: Filling in or appropriately handling missing data ensures

 Accuracy: Ensures that data accurately represents real-world conditions and

 Data Governance: Ensuring data is clean helps in complying with data

1. Removing Duplicates: Identifying and eliminating duplicate entries.

Definition: Ensuring data is in a consistent and usable format.

 Removal: If outliers are errors or irrelevant, they can be removed.

Definition: Identifying and eliminating duplicate records.

 Exact Duplicates: Remove rows where all columns are identical.

Example: Use functions like drop_duplicates() in pandas to remove duplicate rows

Definition: Adjusting values measured on different scales to a common scale.

 Min-Max Scaling: Rescale the data to a fixed range, typically 0 to 1.

Example: Rescale all feature values in a dataset to be within the range of 0 to 1.

Definition: Transforming data to have a mean of 0 and a standard deviation of 1.

Example: Use z-score standardization to standardize the features of a dataset for a

You might also like