4. Data Cleaning and Preparation

Data cleaning and preparation are essential steps in data science that involve identifying and fixing errors in datasets to ensure accurate analysis. Key tasks include handling missing values, removing duplicates, and standardizing data formats, which contribute to the overall quality and reliability of insights derived from the data. The process ultimately enhances productivity and decision-making by providing clean and structured data for analysis.

Uploaded by

Ranzen Galleon

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

4. Data Cleaning and Preparation

Uploaded by

Ranzen Galleon

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Data Cleaning and

Preparation
Methods for cleaning data and preprocessing
• In data science, extracting meaningful insights from data often begins
with data cleaning and preprocessing.
• These initial steps are like laying the foundation for a sturdy building.
• Any analysis or modelling efforts can be riddled with errors and
misinterpretations without clean, well-structured data.
What is data cleaning?
• Data cleaning involves identifying and fixing errors in a dataset, such
as incorrect, corrupted, duplicates, or incomplete data. When
merging different data sources, there is a high chance of duplication
or mislabeling. Incorrect data can lead to unreliable outcomes and
algorithms, despite appearing correct. The exact steps in data
cleaning can vary from dataset to dataset, so it's important to create a
template for consistent and proper data cleaning practices.
The Significance of Data Cleaning and
Preprocessing
• Garbage In, Garbage Out (GIGO): Inaccurate or incomplete data can
lead to unreliable results. Cleaning and preprocessing ensure the data
you analyze is as accurate and complete as possible.
• Consistency: Datasets often come from various sources, and the data
may need to be consistent in format or structure. Cleaning and
preprocessing standardize the data, making it easier to work with.
• Removing Noise: Noise in data can come from various sources,
including measurement errors or outliers. Cleaning helps identify and
remove such noise to focus on the underlying patterns.
The Significance of Data Cleaning and
Preprocessing
• Handling Missing Data: Real-world data is rarely complete. Cleaning
includes strategies for dealing with missing values, such as imputation
or removal.
• Feature Engineering: Preprocessing can involve creating new features
or transforming existing ones to improve the quality of input data for
machine learning models.
What is the difference between data cleaning
and data transformation?
• Data cleaning is the process that removes data that does not belong
in your dataset.
• Data transformation is the process of converting data from one
format or structure into another.
• Transformation processes can also be referred to as data wrangling,
or data munging, transforming and mapping data from one "raw"
data form into another format for warehousing and analyzing.
Common Data Cleaning and Preprocessing
Tasks
• Handling Missing Values
• Missing data is a common challenge. Strategies include imputation (replacing
missing values with estimates) or removing rows or columns with too many
missing values.
• Outlier Detection and Treatment
• Outliers are extreme values that can skew your analysis. Detecting and
addressing outliers is crucial for accurate results.
• Data Standardization and Normalization
• Standardizing and normalizing data scales variables to make them comparable
and removes biases due to different units or scales.
Common Data Cleaning and Preprocessing
Tasks
• Encoding Categorical Variables
• Machine learning models require numerical data. Categorical variables are
often converted into numerical representations using one-hot or label
encoding techniques.
• Removing Duplicates
• Duplicate entries can distort the analysis. Identifying and removing duplicates
is a fundamental cleaning step.
• Handling Data Types
• Ensuring that data types are consistent and appropriate for the analysis is
essential. For example, dates should be in date format, not as text.
Common Data Cleaning and Preprocessing
Tasks
• Feature Engineering
• Feature engineering involves creating new features or transforming existing
ones to improve model performance.
• Data Splitting
• Splitting data into training, validation, and test sets is essential for model
evaluation.
How to clean data
Step 1: Remove duplicate or irrelevant
observations
• Get rid of any unwanted data in your dataset, such as duplicates or
irrelevant information.
• Duplicates are common during data collection, especially when combining
data from different sources or receiving data from various sources.
• Removing duplicates is crucial in this process.
• Irrelevant data refers to observations that do not pertain to the specific
issue you are studying.
• For instance, if you are analyzing data on millennial customers and your
dataset contains information on older generations, you should eliminate
those irrelevant observations.
• Doing so can streamline your analysis, keep you focused on your main goal,
and make your dataset more manageable and efficient.
Step 2: Fix structural errors
• When errors occur in the structure, it means that there are strange
names, typos, or wrong capitalization when collecting or transferring
data.
• These discrepancies can lead to mislabeled groups or classifications.
• For instance, if you come across both "N/A" and "Not Applicable,"
they should be considered as the same category.
Step 3: Filter unwanted outliers
• Sometimes when analyzing data, you may come across unusual
observations that don't seem to match the rest of the data.
• If you have a good reason to exclude these outliers, such as a data
entry error, removing them can improve the quality of your analysis.
• However, outliers can also be important in validating a theory you are
testing.
• It's important to remember that just because a data point is an
outlier, it doesn't mean it's wrong.
• It's essential to investigate the validity of the outlier.
• If an outlier turns out to be insignificant or a mistake, it may be
appropriate to remove it from the analysis.
Step 4: Handle missing data
• You can’t ignore missing data because many algorithms will not accept
missing values. There are a couple of ways to deal with missing data.
Neither is optimal, but both can be considered.
• As a first option, you can drop observations that have missing values, but
doing this will drop or lose information, so be mindful of this before you
remove it.
• As a second option, you can input missing values based on other
observations; again, there is an opportunity to lose integrity of the data
because you may be operating from assumptions and not actual
observations.
• As a third option, you might alter the way the data is used to effectively
navigate null values.
Step 5: Validate and QA
• At the end of the data cleaning process, you should be able to answer
these questions as a part of basic validation:

• Does the data make sense?

• Does the data follow the appropriate rules for its field?
• Does it prove or disprove your working theory, or bring any insight to light?
• Can you find trends in the data to help you form your next theory?
• If not, is that because of a data quality issue?
• Incorrect or “dirty” data can lead to wrong conclusions which can
result in bad business decisions. Making false conclusions can be
embarrassing during a meeting when you realize your data is flawed.
It is crucial to establish a culture of quality data within your
organization to avoid these situations. Start by outlining the tools
and standards necessary to maintain high data quality.
Components of quality data
Determining the quality of data requires an examination of its characteristics, then
weighing those characteristics according to what is most important to your
organization and the application(s) for which they will be used.
5 characteristics of quality data
1. Validity. The degree to which your data conforms to defined
business rules or constraints.
2. Accuracy. Ensure your data is close to the true values.
3. Completeness. The degree to which all required data is known.
4. Consistency. Ensure your data is consistent within the same dataset
and/or across multiple data sets.
5. Uniformity. The degree to which the data is specified using the
same unit of measure.
Advantages and benefits of data cleaning
• Having clean data will ultimately increase overall productivity and
allow for the highest quality information in your decision-making.
Benefits include:

1. Removal of errors when multiple sources of data are at play.

2. Fewer errors make for happier clients and less-frustrated employees.
3. Ability to map the different functions and what your data is intended to do.
4. Monitoring errors and better reporting to see where errors are coming
from, making it easier to fix incorrect or corrupt data for future
applications.
5. Using tools for data cleaning will make for more efficient business practices
and quicker decision-making.
• https://www.linkedin.com/pulse/data-cleaning-preprocessing-first-
step-science-devsort/
• https://www.tableau.com/learn/articles/what-is-data-
cleaning#definition

Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Data Clean R
100% (1)
Data Clean R
11 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
4-6 Analytics Lifecycle Case Study Netflix
0% (1)
4-6 Analytics Lifecycle Case Study Netflix
2 pages
Compal LA-F951P DH5VF DH7VF DH53F DH73F REV 1A (1.0) - Acer Nitro AN515-52
100% (5)
Compal LA-F951P DH5VF DH7VF DH53F DH73F REV 1A (1.0) - Acer Nitro AN515-52
67 pages
L3
No ratings yet
L3
34 pages
Data Mining Group Assignment4
No ratings yet
Data Mining Group Assignment4
10 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
SMA_Expt_3
No ratings yet
SMA_Expt_3
9 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
data-cleaning-using-pandas
No ratings yet
data-cleaning-using-pandas
9 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Data Cleaning_ Importance and Techniques
No ratings yet
Data Cleaning_ Importance and Techniques
1 page
dm unit 3
No ratings yet
dm unit 3
15 pages
Chapter 4
No ratings yet
Chapter 4
20 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
DS-Unit-2_ABM_final
No ratings yet
DS-Unit-2_ABM_final
134 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
U1_DA_Data Preprocessing
No ratings yet
U1_DA_Data Preprocessing
6 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Cleaning and Preparing Data
No ratings yet
Cleaning and Preparing Data
12 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
? Data Cleaning 101❗_
No ratings yet
? Data Cleaning 101❗_
17 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Data Cleaning: Definition
No ratings yet
Data Cleaning: Definition
2 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
05 Data Cleaning
No ratings yet
05 Data Cleaning
9 pages
the Ultimate Guide to Data Cleaning With SQL 1738769035
No ratings yet
the Ultimate Guide to Data Cleaning With SQL 1738769035
36 pages
4. Data segmentation
No ratings yet
4. Data segmentation
11 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
No ratings yet
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
2 pages
Module 2 Clean Data For More Accurate Insights
No ratings yet
Module 2 Clean Data For More Accurate Insights
35 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
Aspects of Data Quality (Excellent!)
No ratings yet
Aspects of Data Quality (Excellent!)
2 pages
DWM
No ratings yet
DWM
14 pages
Programming Presentation
No ratings yet
Programming Presentation
8 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
C-42 Exp 3 Sma
No ratings yet
C-42 Exp 3 Sma
8 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
Data cleaning
No ratings yet
Data cleaning
6 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
Data Cleaning in Excel
100% (1)
Data Cleaning in Excel
68 pages
CompTIA Data+ (Plus) The Ultimate Exam Prep Study Guide to Pass the Exam
From Everand
CompTIA Data+ (Plus) The Ultimate Exam Prep Study Guide to Pass the Exam
Jamie Murphy
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
P030025 - Donaldson Torit - OEM Replacement Filter: Home Search Filter Analysis About Us Contact Us
No ratings yet
P030025 - Donaldson Torit - OEM Replacement Filter: Home Search Filter Analysis About Us Contact Us
5 pages
IT Security and Safety
No ratings yet
IT Security and Safety
11 pages
Ndovi v Standard Bank Judgement edited
No ratings yet
Ndovi v Standard Bank Judgement edited
24 pages
NIRF2024
No ratings yet
NIRF2024
28 pages
(Ebook) The Heart and Stomach of a King: Elizabeth I and the Politics of Sex and Power by Carole Levin ISBN 9780812207729 download
No ratings yet
(Ebook) The Heart and Stomach of a King: Elizabeth I and the Politics of Sex and Power by Carole Levin ISBN 9780812207729 download
41 pages
Ac Transformer PDF
100% (2)
Ac Transformer PDF
120 pages
Chapter-14 Data Science
No ratings yet
Chapter-14 Data Science
12 pages
Sanusi 2021 J. Phys. Conf. Ser. 1752 012004
No ratings yet
Sanusi 2021 J. Phys. Conf. Ser. 1752 012004
12 pages
Public Chapter 4
No ratings yet
Public Chapter 4
9 pages
USTR List 3 Precious Metals
No ratings yet
USTR List 3 Precious Metals
3 pages
Sample CV
No ratings yet
Sample CV
2 pages
Radiance Sapphire
No ratings yet
Radiance Sapphire
39 pages
History Terminology Categories of Frozen Products
No ratings yet
History Terminology Categories of Frozen Products
44 pages
Globalization Concept Map
No ratings yet
Globalization Concept Map
6 pages
Unit Four Tactical Decision Making
No ratings yet
Unit Four Tactical Decision Making
21 pages
RENR2343
No ratings yet
RENR2343
4 pages
What Is Devops?
No ratings yet
What Is Devops?
8 pages
Nonlinear Systems Analysis and Control of Variable Speed Wind Turbines For Multiregime Operation
No ratings yet
Nonlinear Systems Analysis and Control of Variable Speed Wind Turbines For Multiregime Operation
10 pages
The Beatles Scheme of Work
No ratings yet
The Beatles Scheme of Work
9 pages
Cellophane™ 325P32
No ratings yet
Cellophane™ 325P32
2 pages
Oscillator
No ratings yet
Oscillator
7 pages
Od 125886840719035000
No ratings yet
Od 125886840719035000
1 page
02 Marketing Strategy
No ratings yet
02 Marketing Strategy
19 pages
Unit 24 Packing and Shipping Biologial Material
No ratings yet
Unit 24 Packing and Shipping Biologial Material
8 pages
D203254-12 UCJV300 MultilayerPrintingGuide e
No ratings yet
D203254-12 UCJV300 MultilayerPrintingGuide e
30 pages
LHG Sisterwrite 20161 PDF
No ratings yet
LHG Sisterwrite 20161 PDF
14 pages
Agricultural Insect Pest
No ratings yet
Agricultural Insect Pest
3 pages
12 UMD180202 Intl Student Guide 40PP WEB
No ratings yet
12 UMD180202 Intl Student Guide 40PP WEB
40 pages

4. Data Cleaning and Preparation

Uploaded by

4. Data Cleaning and Preparation

Uploaded by

Data Cleaning and

• Does the data make sense?

1. Removal of errors when multiple sources of data are at play.

You might also like