0% found this document useful (0 votes)
17 views

4. Data Cleaning and Preparation

Data cleaning and preparation are essential steps in data science that involve identifying and fixing errors in datasets to ensure accurate analysis. Key tasks include handling missing values, removing duplicates, and standardizing data formats, which contribute to the overall quality and reliability of insights derived from the data. The process ultimately enhances productivity and decision-making by providing clean and structured data for analysis.

Uploaded by

Ranzen Galleon
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

4. Data Cleaning and Preparation

Data cleaning and preparation are essential steps in data science that involve identifying and fixing errors in datasets to ensure accurate analysis. Key tasks include handling missing values, removing duplicates, and standardizing data formats, which contribute to the overall quality and reliability of insights derived from the data. The process ultimately enhances productivity and decision-making by providing clean and structured data for analysis.

Uploaded by

Ranzen Galleon
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Cleaning and

Preparation
Methods for cleaning data and preprocessing
• In data science, extracting meaningful insights from data often begins
with data cleaning and preprocessing.
• These initial steps are like laying the foundation for a sturdy building.
• Any analysis or modelling efforts can be riddled with errors and
misinterpretations without clean, well-structured data.
What is data cleaning?
• Data cleaning involves identifying and fixing errors in a dataset, such
as incorrect, corrupted, duplicates, or incomplete data. When
merging different data sources, there is a high chance of duplication
or mislabeling. Incorrect data can lead to unreliable outcomes and
algorithms, despite appearing correct. The exact steps in data
cleaning can vary from dataset to dataset, so it's important to create a
template for consistent and proper data cleaning practices.
The Significance of Data Cleaning and
Preprocessing
• Garbage In, Garbage Out (GIGO): Inaccurate or incomplete data can
lead to unreliable results. Cleaning and preprocessing ensure the data
you analyze is as accurate and complete as possible.
• Consistency: Datasets often come from various sources, and the data
may need to be consistent in format or structure. Cleaning and
preprocessing standardize the data, making it easier to work with.
• Removing Noise: Noise in data can come from various sources,
including measurement errors or outliers. Cleaning helps identify and
remove such noise to focus on the underlying patterns.
The Significance of Data Cleaning and
Preprocessing
• Handling Missing Data: Real-world data is rarely complete. Cleaning
includes strategies for dealing with missing values, such as imputation
or removal.
• Feature Engineering: Preprocessing can involve creating new features
or transforming existing ones to improve the quality of input data for
machine learning models.
What is the difference between data cleaning
and data transformation?
• Data cleaning is the process that removes data that does not belong
in your dataset.
• Data transformation is the process of converting data from one
format or structure into another.
• Transformation processes can also be referred to as data wrangling,
or data munging, transforming and mapping data from one "raw"
data form into another format for warehousing and analyzing.
Common Data Cleaning and Preprocessing
Tasks
• Handling Missing Values
• Missing data is a common challenge. Strategies include imputation (replacing
missing values with estimates) or removing rows or columns with too many
missing values.
• Outlier Detection and Treatment
• Outliers are extreme values that can skew your analysis. Detecting and
addressing outliers is crucial for accurate results.
• Data Standardization and Normalization
• Standardizing and normalizing data scales variables to make them comparable
and removes biases due to different units or scales.
Common Data Cleaning and Preprocessing
Tasks
• Encoding Categorical Variables
• Machine learning models require numerical data. Categorical variables are
often converted into numerical representations using one-hot or label
encoding techniques.
• Removing Duplicates
• Duplicate entries can distort the analysis. Identifying and removing duplicates
is a fundamental cleaning step.
• Handling Data Types
• Ensuring that data types are consistent and appropriate for the analysis is
essential. For example, dates should be in date format, not as text.
Common Data Cleaning and Preprocessing
Tasks
• Feature Engineering
• Feature engineering involves creating new features or transforming existing
ones to improve model performance.
• Data Splitting
• Splitting data into training, validation, and test sets is essential for model
evaluation.
How to clean data
Step 1: Remove duplicate or irrelevant
observations
• Get rid of any unwanted data in your dataset, such as duplicates or
irrelevant information.
• Duplicates are common during data collection, especially when combining
data from different sources or receiving data from various sources.
• Removing duplicates is crucial in this process.
• Irrelevant data refers to observations that do not pertain to the specific
issue you are studying.
• For instance, if you are analyzing data on millennial customers and your
dataset contains information on older generations, you should eliminate
those irrelevant observations.
• Doing so can streamline your analysis, keep you focused on your main goal,
and make your dataset more manageable and efficient.
Step 2: Fix structural errors
• When errors occur in the structure, it means that there are strange
names, typos, or wrong capitalization when collecting or transferring
data.
• These discrepancies can lead to mislabeled groups or classifications.
• For instance, if you come across both "N/A" and "Not Applicable,"
they should be considered as the same category.
Step 3: Filter unwanted outliers
• Sometimes when analyzing data, you may come across unusual
observations that don't seem to match the rest of the data.
• If you have a good reason to exclude these outliers, such as a data
entry error, removing them can improve the quality of your analysis.
• However, outliers can also be important in validating a theory you are
testing.
• It's important to remember that just because a data point is an
outlier, it doesn't mean it's wrong.
• It's essential to investigate the validity of the outlier.
• If an outlier turns out to be insignificant or a mistake, it may be
appropriate to remove it from the analysis.
Step 4: Handle missing data
• You can’t ignore missing data because many algorithms will not accept
missing values. There are a couple of ways to deal with missing data.
Neither is optimal, but both can be considered.
• As a first option, you can drop observations that have missing values, but
doing this will drop or lose information, so be mindful of this before you
remove it.
• As a second option, you can input missing values based on other
observations; again, there is an opportunity to lose integrity of the data
because you may be operating from assumptions and not actual
observations.
• As a third option, you might alter the way the data is used to effectively
navigate null values.
Step 5: Validate and QA
• At the end of the data cleaning process, you should be able to answer
these questions as a part of basic validation:

• Does the data make sense?


• Does the data follow the appropriate rules for its field?
• Does it prove or disprove your working theory, or bring any insight to light?
• Can you find trends in the data to help you form your next theory?
• If not, is that because of a data quality issue?
• Incorrect or “dirty” data can lead to wrong conclusions which can
result in bad business decisions. Making false conclusions can be
embarrassing during a meeting when you realize your data is flawed.
It is crucial to establish a culture of quality data within your
organization to avoid these situations. Start by outlining the tools
and standards necessary to maintain high data quality.
Components of quality data
Determining the quality of data requires an examination of its characteristics, then
weighing those characteristics according to what is most important to your
organization and the application(s) for which they will be used.
5 characteristics of quality data
1. Validity. The degree to which your data conforms to defined
business rules or constraints.
2. Accuracy. Ensure your data is close to the true values.
3. Completeness. The degree to which all required data is known.
4. Consistency. Ensure your data is consistent within the same dataset
and/or across multiple data sets.
5. Uniformity. The degree to which the data is specified using the
same unit of measure.
Advantages and benefits of data cleaning
• Having clean data will ultimately increase overall productivity and
allow for the highest quality information in your decision-making.
Benefits include:

1. Removal of errors when multiple sources of data are at play.


2. Fewer errors make for happier clients and less-frustrated employees.
3. Ability to map the different functions and what your data is intended to do.
4. Monitoring errors and better reporting to see where errors are coming
from, making it easier to fix incorrect or corrupt data for future
applications.
5. Using tools for data cleaning will make for more efficient business practices
and quicker decision-making.
• https://www.linkedin.com/pulse/data-cleaning-preprocessing-first-
step-science-devsort/
• https://www.tableau.com/learn/articles/what-is-data-
cleaning#definition

You might also like