Data Handling and Visualization 3rd Unit
Data Handling and Visualization 3rd Unit
Data Preprocessing is an important step in the Data Preparation stage of a Data Science
development lifecycle that will ensure reliable, robust, and consistent results. The main
objective of this step is to ensure and check the quality of data before applying any Machine
Learning or Data Mining methods. Let’s review some of its benefits -
Accuracy - Data Preprocessing will ensure that input data is accurate and reliable by ensuring
there are no manual entry errors, no duplicates, etc.
Completeness - It ensures that missing values are handled, and data is complete for further
analysis.
Consistent - Data Preprocessing ensures that input data is consistent, i.e., the same data kept
in different places should match.
Timeliness - Whether data is updated regularly and on a timely basis or not.
Trustable - Whether data is coming from trustworthy sources or not.
Interpretability - Raw data is generally unusable, and Data Preprocessing converts raw data
into an interpretable format.
Let’s explore a few of the key steps involved in the Data Preprocessing stage -
Data Cleaning
Data Cleaning uses methods to handle incorrect, incomplete, inconsistent, or missing values.
Some of the techniques for Data Cleaning include -
Data Integration
Data Integration can be defined as combining data from multiple sources. A few of the issues
to be considered during Data Integration include the following -
Data Reduction
Data Reduction is used to reduce the volume or size of the input data. Its main objective is to
reduce storage and analysis costs and improve storage efficiency. A few of the popular
techniques to perform Data Reduction include -
Dimensionality Reduction - It is the process of reducing the number of features in the input
dataset. It can be performed in various ways, such as selecting features with the highest
importance, Principal Component Analysis (PCA), etc.
Numerosity Reduction - In this method, various techniques can be applied to reduce the
volume of data by choosing alternative smaller representations of the data. For example, a
variable can be approximated by a regression model, and instead of storing the entire variable,
we can store the regression model to approximate it.
Data Compression - In this method, data is compressed. Data Compression can be lossless or
lossy depending on whether the information is lost or not during compression.
Data Transformation
Data Transformation is a process of converting data into a format that helps in building
efficient ML models and deriving better insights. A few of the most common methods for
Data Transformation include -
Smoothing - Data Smoothing is used to remove noise in the dataset, and it helps identify
important features and detect patterns. Therefore, it can help in predicting trends or future
events.
Aggregation - Data Aggregation is the process of transforming large volumes of data into
an organized and summarized format that is more understandable and comprehensive.
For example, a company may look at monthly sales data of a product instead of raw sales data
to understand its performance better and forecast future sales.
Discretization - Data Discretization is a process of converting numerical or continuous
variables into a set of intervals/bins. This makes data easier to analyze. For example, the age
features can be converted into various intervals such as (0-10, 11-20, ..) or (child, young, …).
Normalization - Data Normalization is a process of converting a numeric variable into a
specified range such as [-1,1], [0,1], etc. A few of the most common approaches to performing
normalization are Min-Max Normalization, Data Standardization or Data Scaling, etc.
Data Preprocessing is important in the early stages of a Machine Learning and AI application
development lifecycle. A few of the most common usage or application include -
Conclusion
Data Preprocessing is a process of converting raw datasets into a format that is consumable,
understandable, and usable for further analysis. It is an important step in any Data Analysis
project that will ensure the input datasets's accuracy, consistency, and completeness.
The key steps in this stage include - Data Cleaning, Data Integration, Data Reduction, and
Data Transformation.
It can help build accurate ML models, reduce analysis costs, and build dashboards on
raw data.
Data Acquisition