4. Data Cleaning and Preparation
4. Data Cleaning and Preparation
Preparation
Methods for cleaning data and preprocessing
• In data science, extracting meaningful insights from data often begins
with data cleaning and preprocessing.
• These initial steps are like laying the foundation for a sturdy building.
• Any analysis or modelling efforts can be riddled with errors and
misinterpretations without clean, well-structured data.
What is data cleaning?
• Data cleaning involves identifying and fixing errors in a dataset, such
as incorrect, corrupted, duplicates, or incomplete data. When
merging different data sources, there is a high chance of duplication
or mislabeling. Incorrect data can lead to unreliable outcomes and
algorithms, despite appearing correct. The exact steps in data
cleaning can vary from dataset to dataset, so it's important to create a
template for consistent and proper data cleaning practices.
The Significance of Data Cleaning and
Preprocessing
• Garbage In, Garbage Out (GIGO): Inaccurate or incomplete data can
lead to unreliable results. Cleaning and preprocessing ensure the data
you analyze is as accurate and complete as possible.
• Consistency: Datasets often come from various sources, and the data
may need to be consistent in format or structure. Cleaning and
preprocessing standardize the data, making it easier to work with.
• Removing Noise: Noise in data can come from various sources,
including measurement errors or outliers. Cleaning helps identify and
remove such noise to focus on the underlying patterns.
The Significance of Data Cleaning and
Preprocessing
• Handling Missing Data: Real-world data is rarely complete. Cleaning
includes strategies for dealing with missing values, such as imputation
or removal.
• Feature Engineering: Preprocessing can involve creating new features
or transforming existing ones to improve the quality of input data for
machine learning models.
What is the difference between data cleaning
and data transformation?
• Data cleaning is the process that removes data that does not belong
in your dataset.
• Data transformation is the process of converting data from one
format or structure into another.
• Transformation processes can also be referred to as data wrangling,
or data munging, transforming and mapping data from one "raw"
data form into another format for warehousing and analyzing.
Common Data Cleaning and Preprocessing
Tasks
• Handling Missing Values
• Missing data is a common challenge. Strategies include imputation (replacing
missing values with estimates) or removing rows or columns with too many
missing values.
• Outlier Detection and Treatment
• Outliers are extreme values that can skew your analysis. Detecting and
addressing outliers is crucial for accurate results.
• Data Standardization and Normalization
• Standardizing and normalizing data scales variables to make them comparable
and removes biases due to different units or scales.
Common Data Cleaning and Preprocessing
Tasks
• Encoding Categorical Variables
• Machine learning models require numerical data. Categorical variables are
often converted into numerical representations using one-hot or label
encoding techniques.
• Removing Duplicates
• Duplicate entries can distort the analysis. Identifying and removing duplicates
is a fundamental cleaning step.
• Handling Data Types
• Ensuring that data types are consistent and appropriate for the analysis is
essential. For example, dates should be in date format, not as text.
Common Data Cleaning and Preprocessing
Tasks
• Feature Engineering
• Feature engineering involves creating new features or transforming existing
ones to improve model performance.
• Data Splitting
• Splitting data into training, validation, and test sets is essential for model
evaluation.
How to clean data
Step 1: Remove duplicate or irrelevant
observations
• Get rid of any unwanted data in your dataset, such as duplicates or
irrelevant information.
• Duplicates are common during data collection, especially when combining
data from different sources or receiving data from various sources.
• Removing duplicates is crucial in this process.
• Irrelevant data refers to observations that do not pertain to the specific
issue you are studying.
• For instance, if you are analyzing data on millennial customers and your
dataset contains information on older generations, you should eliminate
those irrelevant observations.
• Doing so can streamline your analysis, keep you focused on your main goal,
and make your dataset more manageable and efficient.
Step 2: Fix structural errors
• When errors occur in the structure, it means that there are strange
names, typos, or wrong capitalization when collecting or transferring
data.
• These discrepancies can lead to mislabeled groups or classifications.
• For instance, if you come across both "N/A" and "Not Applicable,"
they should be considered as the same category.
Step 3: Filter unwanted outliers
• Sometimes when analyzing data, you may come across unusual
observations that don't seem to match the rest of the data.
• If you have a good reason to exclude these outliers, such as a data
entry error, removing them can improve the quality of your analysis.
• However, outliers can also be important in validating a theory you are
testing.
• It's important to remember that just because a data point is an
outlier, it doesn't mean it's wrong.
• It's essential to investigate the validity of the outlier.
• If an outlier turns out to be insignificant or a mistake, it may be
appropriate to remove it from the analysis.
Step 4: Handle missing data
• You can’t ignore missing data because many algorithms will not accept
missing values. There are a couple of ways to deal with missing data.
Neither is optimal, but both can be considered.
• As a first option, you can drop observations that have missing values, but
doing this will drop or lose information, so be mindful of this before you
remove it.
• As a second option, you can input missing values based on other
observations; again, there is an opportunity to lose integrity of the data
because you may be operating from assumptions and not actual
observations.
• As a third option, you might alter the way the data is used to effectively
navigate null values.
Step 5: Validate and QA
• At the end of the data cleaning process, you should be able to answer
these questions as a part of basic validation: