chapter3 DS
chapter3 DS
Data Preprocessing
Data Preprocessing: Data cleaning: handling missing values, outliers,
duplicates, Data transformation: scaling, normalization, encoding categorical.
variables, Feature selection: selecting relevant features/columns, Data.
merging: combining multiple datasets.
Data cleaning: Data cleaning is one of the important parts of machine learning. It
plays a significant part in building a model.
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data. The goal of data
cleaning is to ensure that the data is accurate, consistent, and free of errors, as
incorrect or inconsistent data can negatively impact the performance of the ML
model. Professional data scientists usually invest a very large portion of their time
in this step because of the belief that “Better data beats fancier algorithms”.
Data cleaning is essential because raw data is often noisy, incomplete, and
inconsistent, which can negatively impact the accuracy and reliability of the insights
derived from it.
Data cleaning involves the systematic identification and correction of errors,
inconsistencies, and inaccuracies within a dataset, encompassing tasks such as
handling missing values, removing duplicates, and addressing outliers. This
meticulous process is essential for enhancing the integrity of analyses, promoting
more accurate modeling, and ultimately facilitating informed decision-making based
on trustworthy and high-quality data.
The first common strategy for dealing with missing data is to delete the rows with
missing values. Typically, any row which has a missing value in any cell gets
deleted. However, this often means many rows will get removed, leading to loss of
information and data. Therefore, this method is typically not used when there are
few data samples.
We can also impute the missing data. This can be based solely on information in the
column that has missing values, or it can be based on other columns present in the
dataset.
Finally, we can use classification or regression models to predict missing values.
• Replace it with a constant value. This can be a good approach when used in
discussion with the domain expert for the data we are dealing with.
• Replace it with the mean or median. This is a decent approach when the data
size is small—but it does add bias.
• Replace it with values by using information from other columns.
A data scientist can use several techniques to identify outliers and decide if they are
errors or novelties.
Numeric outlier
This is the simplest nonparametric technique, where data is in a one-dimensional
space. Outliers are calculated by dividing them into three quartiles. The range limits
are then set as upper and lower whiskers of a box plot. Then, the data that is outside
those ranges can be removed.
Z-score
This parametric technique indicates how many standard deviations a certain point of
data is from the sample’s mean. This assumes a gaussian distribution (a normal, bell-
shaped curve). However, if the data is not normally distributed, data can be
transformed by scaling it, and giving it a more normal appearance. The z-score of
data points is then calculated, placed on the bell curve, and then using heuristics (rule
of thumb) a cut-off point for thresholds of standard deviation can be decided. Then,
the data points that lie beyond that standard deviation can be classified as outliers
and removed from the equation. The Z-score is a simple, powerful way to remove
outliers, but it is only useful with medium to small data sets. It can’t be used for
nonparametric data.
DBSCAN
This is Density Based Spatial Clustering of Applications with Noise, which is
basically a graphical representation showing density of data. Using complex
calculations, it clusters data together in groups of related points. DBSCAN groups
data into core points, border points, and outliers. Core points are main data groups,
border points have enough density to be considered part of the data group, and
outliers are in no cluster at all, and can be disregarded from data.
Isolation forest
This method is effective for finding novelties and outliers. It uses binary decision
trees which are constructed using randomly selected features and a random split
value. The forest trees then form a tree forest, which is averaged out. Then, outlier
scores can be calculated, giving each node, or data point, a score from 0 to 1, 0 being
normal and 1 being more of an outlier.
We can use the box plot, or the box and whisker plot, to explore the dataset and
visualize the presence of outliers. The points that lie beyond the whiskers are
detected as outliers.
Handling Outliers
a. Removing Outliers
i) Listwise detection: Remove rows with outliers.
ii) Trimming: Remove extreme values while keeping a certain percentage (1%
or 5%) of data.
b. Transforming Outliers
i) Winsorization: Cap or replace outliers with values at a specified percentile.
ii) Log Transformation: Apply a log transformation to reduce the impact of
extreme values.
c. Imputation
Impute outliers with a value derived from statistical measures(mean,median)
or more advanced imputation methods.
d. Treating as Anomaly: Treat outliers as anomalies and analyze them
separately. This is common in fraud detection or network security.
Data Transformation:
Min-Max Scaling:
The objective of Min-Max scaling is to shift the values closer to the mean of the
column. This method scales the data to a fixed range, usually [0, 1] or [-1, 1]. A
drawback of bounding this data to a small, fixed range is that we will, in turn, end
up with smaller standard deviations, which suppresses the weight of outliers in our
data.
Standardization (Z-Score Normalization):
Standardization is used to compare features that have different units or scales. This
scale (σ).
This transforms your data, so the resulting distribution has a mean of 0 and a standard
important outliers in our data, and we don’t want to remove them and lose their
impact.
2. Normalization
Note: The main difference between normalizing and scaling is that in normalization
you are changing the shape of the distribution and in scaling you are changing the
range of your data. Normalizing is a useful method when you know the distribution
is not Gaussian. Normalization adjusts the values of your numeric data to a common
scale without changing the range whereas scaling shrinks or stretches the data to fit
within a specific range.
Height Height
Tall 0
Medium 1
Short 2
Advantages:
It allows the use of categorical variables in models that require
numerical input.
It can improve model performance by providing more information to
the model about the categorical variable.
It can help to avoid the problem of ordinality, which can occur when a
categorical variable has a natural ordering (e.g. “small”, “medium”,
“large”).
apple 1 5
mango 2 10
apple 1 15
orange 3 20
The output after applying one-hot encoding on the data is given as follows,
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20
3. Binary Encoding:
Binary encoding combines elements of label encoding and one-hot encoding. It first
assigns unique integer labels to each category and then represents these labels in
binary form. It’s especially useful when we have many categories, reducing the
dimensionality compared to one hot encoding.
Feature Selection
A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection.
Each machine learning process depends on feature engineering, which mainly
contains two processes, which are Feature Selection and Feature Extraction.
Although feature selection and extraction processes may have the same objective,
both are completely different from each other. The main difference between them is
that feature selection is about selecting the subset of the original feature set, whereas
feature extraction creates new features.
Feature selection is a way of reducing the input variable for the model by using only
relevant data to reduce overfitting in the model.
We can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model
building." Feature selection is performed by either including the important features
or excluding the irrelevant features in the dataset without changing them.
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with
other combinations. It trains the algorithm by using the subset of features iteratively.
Based on the output of the model, features are added or subtracted, and with this
feature set, the model has trained again.
Some techniques of wrapper methods are:
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
These methods are also iterative, which evaluates each iteration, and optimally finds
the most important features that contribute the most to training in a particular
iteration. Some techniques of embedded methods are:
• Inner Join: Uses a comparison operator to match rows from two tables that
are based on the values in common columns from each table.
• Left join/left outer join.
Returns all the rows from the left table that are specified in the left outer join
clause, not just the rows in which the columns match.
• Right join/right outer join
Returns all the rows from the right table that are specified in the right outer
join clause, not just the rows in which the columns match.
• Full outer join
Returns all the rows in both the left and right tables.
• Cross joins (cartesian join)
Returns all possible combinations of rows from two tables.
Chapter Ends…