0% found this document useful (0 votes)
0 views

chapter3 DS

Chapter 3 focuses on data preprocessing, emphasizing the importance of data cleaning, transformation, and feature selection in machine learning. It outlines methods for handling missing values, duplicates, and outliers, as well as techniques for scaling, normalization, and encoding categorical variables. The chapter highlights that effective data preprocessing is crucial for enhancing model performance and ensuring accurate analyses.

Uploaded by

trexwarrior92
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

chapter3 DS

Chapter 3 focuses on data preprocessing, emphasizing the importance of data cleaning, transformation, and feature selection in machine learning. It outlines methods for handling missing values, duplicates, and outliers, as well as techniques for scaling, normalization, and encoding categorical variables. The chapter highlights that effective data preprocessing is crucial for enhancing model performance and ensuring accurate analyses.

Uploaded by

trexwarrior92
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Chapter-3

Data Preprocessing
Data Preprocessing: Data cleaning: handling missing values, outliers,
duplicates, Data transformation: scaling, normalization, encoding categorical.
variables, Feature selection: selecting relevant features/columns, Data.
merging: combining multiple datasets.

Data cleaning: Data cleaning is one of the important parts of machine learning. It
plays a significant part in building a model.
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data. The goal of data
cleaning is to ensure that the data is accurate, consistent, and free of errors, as
incorrect or inconsistent data can negatively impact the performance of the ML
model. Professional data scientists usually invest a very large portion of their time
in this step because of the belief that “Better data beats fancier algorithms”.
Data cleaning is essential because raw data is often noisy, incomplete, and
inconsistent, which can negatively impact the accuracy and reliability of the insights
derived from it.
Data cleaning involves the systematic identification and correction of errors,
inconsistencies, and inaccuracies within a dataset, encompassing tasks such as
handling missing values, removing duplicates, and addressing outliers. This
meticulous process is essential for enhancing the integrity of analyses, promoting
more accurate modeling, and ultimately facilitating informed decision-making based
on trustworthy and high-quality data.

Steps to Perform Data Cleanliness


Performing data cleaning involves a systematic process to identify and rectify errors,
inconsistencies, and inaccuracies in a dataset.
• Removal of Unwanted Observations: Identify and eliminate irrelevant
or redundant observations from the dataset. The step involves scrutinizing
data entries for duplicate records, irrelevant information, or data points that
do not contribute meaningfully to the analysis. Removing unwanted
observations streamlines the dataset, reducing noise and improving the
overall quality.
• Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity
in data representation. Fixing structure errors enhances data consistency
and facilitates accurate analysis and interpretation.
• Managing Unwanted outliers: Identify and manage outliers, which are
data points significantly deviating from the norm. Depending on the
context, decide whether to remove outliers or transform them to minimize
their impact on analysis. Managing outliers is crucial for obtaining more
accurate and reliable insights from the data.
• Handling Missing Data: Devise strategies to handle missing data
effectively. This may involve imputing missing values based on statistical
methods, removing records with missing values, or employing advanced
imputation techniques. Handling missing data ensures a more complete
dataset, preventing biases and maintaining the integrity of analyses.

Handling missing values:

Identify the Missing Data Values


Most analytics projects will encounter three possible types of missing data values,
depending on whether there’s a relationship between the missing data and the other
data in the dataset:

• Missing completely at random (MCAR): In this case, there may be no


pattern as to why a column’s data is missing. For example, survey data is
missing because someone could not make it to an appointment, or an
administrator misplaces the test results he is supposed to enter into the
computer. The reason for the missing values is unrelated to the data in the
dataset.
• Missing at random (MAR): In this scenario, the reason the data is missing
in a column can be explained by the data in other columns. For example, a
school student who scores above the cutoff is typically given a grade. So, a
missing grade for a student can be explained by the column that has scores
below the cutoff. The reason for these missing values can be described by data
in another column.
• Missing not at random (MNAR): Sometimes, the missing value is related to
the value itself. For example, higher income people may not disclose their
incomes. Here, there is a correlation between the missing values and the actual
income. The missing values are not dependent on other variables in the
dataset.
Handling Missing Data Values

The first common strategy for dealing with missing data is to delete the rows with
missing values. Typically, any row which has a missing value in any cell gets
deleted. However, this often means many rows will get removed, leading to loss of
information and data. Therefore, this method is typically not used when there are
few data samples.
We can also impute the missing data. This can be based solely on information in the
column that has missing values, or it can be based on other columns present in the
dataset.
Finally, we can use classification or regression models to predict missing values.

1. Missing Values in Numerical Columns


The first approach is to replace the missing value with one of the following
strategies:

• Replace it with a constant value. This can be a good approach when used in
discussion with the domain expert for the data we are dealing with.
• Replace it with the mean or median. This is a decent approach when the data
size is small—but it does add bias.
• Replace it with values by using information from other columns.

2. Predicting Missing Values Using an Algorithm


Another way to predict missing values is to create a simple regression model. The
column to predict here is the Salary, using other columns in the dataset. If there are
missing values in the input columns, we must handle those conditions when creating
the predictive model. A simple way to manage this is to choose only the features that
do not have missing values or take the rows that do not have missing values in any
of the cells.

3. Missing Values in Categorical Columns


Dealing with missing data values in categorical columns is a lot easier than in
numerical columns. Simply replace the missing value with a constant value or the
most popular category. This is a good approach when the data size is small, though
it does add bias.
For example, say we have a column for Education with two possible values: High
School and College. If there are more people with a college degree in the dataset, we
can replace the missing value with College Degree:
2. Handling Duplicates:
The simplest and most straightforward way to handle duplicate data is to delete it.
This can reduce the noise and redundancy in our data, as well as improve the
efficiency and accuracy of your models. However, we need to be careful and make
sure that you are not losing any valuable or relevant information by removing
duplicate data. We also need to consider the criteria and logic for choosing which
duplicates to keep or discard. For example, we can use the df.drop_duplicates()
method in pandas to remove duplicate rows or columns, specifying the subset, keep,
and inplace arguments.
Removing duplicates:
In python using Pandas: df.drop_duplicates()
In SQL: Use DISTINCT keyword in SELECT statement.

3. Outliers Detection and Treatment


Outlier detection is the process of detecting outliers, or a data point that is far away
from the average, and depending on what we are trying to accomplish.
Detecting and appropriately dealing with outliers is essential in data science to
ensure that statistical analysis and machine learning models are not unduly
influenced, and the results are accurate and reliable.
Techniques used for outlier detection.

A data scientist can use several techniques to identify outliers and decide if they are
errors or novelties.
Numeric outlier
This is the simplest nonparametric technique, where data is in a one-dimensional
space. Outliers are calculated by dividing them into three quartiles. The range limits
are then set as upper and lower whiskers of a box plot. Then, the data that is outside
those ranges can be removed.
Z-score
This parametric technique indicates how many standard deviations a certain point of
data is from the sample’s mean. This assumes a gaussian distribution (a normal, bell-
shaped curve). However, if the data is not normally distributed, data can be
transformed by scaling it, and giving it a more normal appearance. The z-score of
data points is then calculated, placed on the bell curve, and then using heuristics (rule
of thumb) a cut-off point for thresholds of standard deviation can be decided. Then,
the data points that lie beyond that standard deviation can be classified as outliers
and removed from the equation. The Z-score is a simple, powerful way to remove
outliers, but it is only useful with medium to small data sets. It can’t be used for
nonparametric data.
DBSCAN
This is Density Based Spatial Clustering of Applications with Noise, which is
basically a graphical representation showing density of data. Using complex
calculations, it clusters data together in groups of related points. DBSCAN groups
data into core points, border points, and outliers. Core points are main data groups,
border points have enough density to be considered part of the data group, and
outliers are in no cluster at all, and can be disregarded from data.
Isolation forest
This method is effective for finding novelties and outliers. It uses binary decision
trees which are constructed using randomly selected features and a random split
value. The forest trees then form a tree forest, which is averaged out. Then, outlier
scores can be calculated, giving each node, or data point, a score from 0 to 1, 0 being
normal and 1 being more of an outlier.

Visualization for Outlier Detection

We can use the box plot, or the box and whisker plot, to explore the dataset and
visualize the presence of outliers. The points that lie beyond the whiskers are
detected as outliers.

Handling Outliers
a. Removing Outliers
i) Listwise detection: Remove rows with outliers.
ii) Trimming: Remove extreme values while keeping a certain percentage (1%
or 5%) of data.
b. Transforming Outliers
i) Winsorization: Cap or replace outliers with values at a specified percentile.
ii) Log Transformation: Apply a log transformation to reduce the impact of
extreme values.
c. Imputation
Impute outliers with a value derived from statistical measures(mean,median)
or more advanced imputation methods.
d. Treating as Anomaly: Treat outliers as anomalies and analyze them
separately. This is common in fraud detection or network security.
Data Transformation:

Data transformation is the process of converting data from one format or


structure into another format or structure. It is a fundamental aspect of most
data integration and data management tasks such as data wrangling, data
warehousing, data integration and application integration.
Data transformation is part of an ETL process and refers to preparing data for
analysis and modeling. This involves cleaning (removing duplicates, fill-in
missing values), reshaping (converting currencies, pivot tables), and
computing new dimensions and metrics.

Data transformation techniques include scaling, normalization, and encoding


categorical variables.

1. Scaling: Scaling is the process of transforming the features of a dataset


so that they fall within a specific range. Scaling is useful when we want
to compare two different variables on equal grounds. This is especially
useful with variables which use distance measures. For example,
models that use Euclidean Distance are sensitive to the magnitude of
distance, so scaling helps even with the weight of all the features. This
is important because if one variable is more heavily weighted than the
other, it introduces bias into our analysis.

Min-Max Scaling:

The objective of Min-Max scaling is to shift the values closer to the mean of the

column. This method scales the data to a fixed range, usually [0, 1] or [-1, 1]. A

drawback of bounding this data to a small, fixed range is that we will, in turn, end

up with smaller standard deviations, which suppresses the weight of outliers in our

data.
Standardization (Z-Score Normalization):

Standardization is used to compare features that have different units or scales. This

is done by subtracting a measure of location (x- x̅) and dividing by a measure of

scale (σ).

This transforms your data, so the resulting distribution has a mean of 0 and a standard

deviation of 1. This method is useful (in comparison to normalization) when we have

important outliers in our data, and we don’t want to remove them and lose their

impact.

2. Normalization

Data normalization is a technique used in data mining to transform the values of a


dataset into a common scale. This is important because many machine learning
algorithms are sensitive to the scale of the input features and can produce better
results when the data is normalized.
There are several different normalization techniques that can be used in data mining,
including:

1. Min-Max normalization: This technique scales the values of a feature to


a range between 0 and 1. This is done by subtracting the minimum value
of the feature from each value, and then dividing it by the range of the
feature.
2. Z-score normalization: This technique scales the values of a feature to
have a mean of 0 and a standard deviation of 1. This is done by subtracting
the mean of the feature from each value, and then dividing it by the
standard deviation.
3. Decimal Scaling: This technique scales the values of a feature by dividing
the values of a feature by a power of 10.
4. Logarithmic transformation: This technique applies a logarithmic
transformation to the values of a feature. This can be useful for data with
a wide range of values, as it can help to reduce the impact of outliers.
5. Root transformation: This technique applies a square root transformation
to the values of a feature. This can be useful for data with a wide range of
values, as it can help to reduce the impact of outliers.
6. It’s important to note that normalization should be applied only to the input
features, not the target variable, and that different normalization techniques
may work better for different types of data and models.

Note: The main difference between normalizing and scaling is that in normalization
you are changing the shape of the distribution and in scaling you are changing the
range of your data. Normalizing is a useful method when you know the distribution
is not Gaussian. Normalization adjusts the values of your numeric data to a common
scale without changing the range whereas scaling shrinks or stretches the data to fit
within a specific range.

3. Encoding Categorical Variables

The process of encoding categorical data into numerical data is called


“categorical encoding.” It involves transforming categorical variables into a
numerical format suitable for machine learning models.

1. Label Encoding: Label Encoding is a technique that is used to convert


categorical columns into numerical ones so that they can be fitted by
machine learning models which only take numerical data. It is an
important preprocessing step in a machine-learning project.

Example Of Label Encoding


Suppose we have a column Height in some dataset that has elements as Tall,
Medium, and short. To convert this categorical column into a numerical column we
will apply label encoding to this column. After applying label encoding, the Height
column is converted into a numerical column having elements 0,1, and 2 where 0 is
the label for tall, 1 is the label for medium, and 2 is the label for short height.

Height Height

Tall 0

Medium 1

Short 2

2. One-Hot Encoding: One hot encoding is a technique that we use


to represent categorical variables as numerical values in a machine
learning model. One-hot encoding is used when there is no ordinal
relationship among the categories, and each category is treated as a
separate independent feature. It creates a binary column for each
category, where a “1” indicates the presence of the category and “0” its
absence. This method is suitable for nominal variables.

Advantages:
It allows the use of categorical variables in models that require
numerical input.
It can improve model performance by providing more information to
the model about the categorical variable.
It can help to avoid the problem of ordinality, which can occur when a
categorical variable has a natural ordering (e.g. “small”, “medium”,
“large”).

One Hot Encoding Examples


In One Hot Encoding, the categorical parameters will prepare separate columns for
both Male and Female labels. So, wherever there is a Male, the value will be 1 in the
Male column and 0 in the Female column, and vice-versa. Let’s understand with an
example: Consider the data where fruits, their corresponding categorical values, and
prices are given.

Fruit Categorical value of fruit Price

apple 1 5

mango 2 10

apple 1 15
orange 3 20

The output after applying one-hot encoding on the data is given as follows,

apple mango orange price

1 0 0 5

0 1 0 10

1 0 0 15

0 0 1 20

The disadvantages of using one hot encoding include:


1. It can lead to increased dimensionality, as a separate column is created for
each category in the variable. This can make the model more complex and
slower to train.

2. It can lead to sparse data, as most observations will have a value of 0 in


most of the one-hot encoded columns.
3. It can lead to overfitting, especially if there are many categories in the
variable and the sample size is relatively small.

4. One-hot-encoding is a powerful technique to treat categorical data, but it


can lead to increased dimensionality, sparsity, and overfitting. It is
important to use it cautiously and consider other methods such as ordinal
encoding or binary encoding.

3. Binary Encoding:
Binary encoding combines elements of label encoding and one-hot encoding. It first
assigns unique integer labels to each category and then represents these labels in
binary form. It’s especially useful when we have many categories, reducing the
dimensionality compared to one hot encoding.

4. Frequency Encoding (Count Encoding)


Frequency encoding replaces each category with the count of how often it appears
in the dataset. This can be useful when we suspect that the frequency of a category
is related to the target variable.

5. Target Encoding (Mean Encoding)


Target encoding is used when we want to encode categorical variables based on their
relationship with the target variable. It replaces each category with the mean of the
target variable for that category.

Feature Selection

A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection.
Each machine learning process depends on feature engineering, which mainly
contains two processes, which are Feature Selection and Feature Extraction.
Although feature selection and extraction processes may have the same objective,
both are completely different from each other. The main difference between them is
that feature selection is about selecting the subset of the original feature set, whereas
feature extraction creates new features.
Feature selection is a way of reducing the input variable for the model by using only
relevant data to reduce overfitting in the model.
We can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model
building." Feature selection is performed by either including the important features
or excluding the irrelevant features in the dataset without changing them.

Feature Selection Techniques


There are mainly two types of Feature Selection techniques, which are:

o Supervised Feature Selection technique


Supervised Feature selection techniques consider the target variable and can
be used for the labeled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can
be used for the unlabeled dataset.

There are mainly three techniques under supervised feature Selection:

1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with
other combinations. It trains the algorithm by using the subset of features iteratively.
Based on the output of the model, features are added or subtracted, and with this
feature set, the model has trained again.
Some techniques of wrapper methods are:

o Forward selection - Forward selection is an iterative process, which begins


with an empty set of features. After each iteration, it keeps adding on a feature
and evaluates the performance to check whether it is improving the
performance or not. The process continues until the addition of a new
variable/feature does not improve the performance of the model.
o Backward elimination - Backward elimination is also an iterative approach,
but it is the opposite of forward selection. This technique begins the process
by considering all the features and removes the least significant feature. This
elimination process continues until removing the features does not improve
the performance of the model.
o Exhaustive Feature Selection- Exhaustive feature selection is one of the best
feature selection methods, which evaluates each feature set as brute-force. It
means this method tries & make each possible combination of features and
return the best performing feature set.
o Recursive Feature Elimination-
Recursive feature elimination is a recursive greedy optimization approach,
where features are selected by recursively taking a smaller and smaller subset
of features. Now, an estimator is trained with each set of features, and the
importance of each feature is determined using coef_attribute or through a
feature_importances_attribute.
2. Filter Methods
In the Filter Method, features are selected based on statistics measures. This method
does not depend on the learning algorithm and chooses the features as a pre-
processing step.
The filter method filters out the irrelevant features and redundant columns from the
model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and
does not overfit the data.

Some common techniques of Filter methods are as follows:

o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio

Information Gain: Information gain determines the reduction in entropy while


transforming the dataset. It can be used as a feature selection technique by
calculating the information gain of each variable with respect to the target variable.
Chi-square Test: Chi-square test is a technique to determine the relationship between
the categorical variables. The chi-square value is calculated between each feature
and the target variable, and the desired number of features with the best chi-square
value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised techniques of feature selection. It
returns the rank of the variable on the fisher's criteria in descending order. Then we
can select the variables with a large fisher's score.
Missing Value Ratio:
The value of the missing value ratio can be used for evaluating the feature set against
the threshold value. The formula for obtaining the missing value ratio is the number
of missing values in each column divided by the total number of observations. The
variable having more than the threshold value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are
fast processing methods like the filter method but more accurate than the filter
method.

These methods are also iterative, which evaluates each iteration, and optimally finds
the most important features that contribute the most to training in a particular
iteration. Some techniques of embedded methods are:

o Regularization- Regularization adds a penalty term to different parameters


of the machine learning model for avoiding overfitting in the model. This
penalty term is added to the coefficients; hence it shrinks some coefficients to
zero. Those features with zero coefficients can be removed from the dataset.
The types of regularization techniques are L1 Regularization (Lasso
Regularization) or Elastic Nets (L1 and L2 regularization).
o Random Forest Importance - Different tree-based methods of feature
selection help us with feature importance to provide a way of selecting
features. Here, feature importance specifies which feature has more
importance in model building or has a great impact on the target variable.
Random Forest is such a tree-based method, which is a type of bagging
algorithm that aggregates a different number of decision trees. It automatically
ranks the nodes by their performance or decrease in the impurity (Gini
impurity) over all the trees. Nodes are arranged as per the impurity values,
and thus it allows pruning of trees below a specific node. The remaining nodes
create a subset of the most important features.

Data Merging: Combining Multiple Datasets


The most common method for merging data is through a process called “joining”.
There are several types of joins.

• Inner Join: Uses a comparison operator to match rows from two tables that
are based on the values in common columns from each table.
• Left join/left outer join.
Returns all the rows from the left table that are specified in the left outer join
clause, not just the rows in which the columns match.
• Right join/right outer join
Returns all the rows from the right table that are specified in the right outer
join clause, not just the rows in which the columns match.
• Full outer join
Returns all the rows in both the left and right tables.
• Cross joins (cartesian join)
Returns all possible combinations of rows from two tables.

Chapter Ends…

You might also like