IML 2 - Data Preparation
IML 2 - Data Preparation
UNIT # 2
The slides and code in this lecture are primarily taken from
Machine Learning with PyTorch and Scikit-Learn by Rischka et al.
Discussion and figures on CRISP-ML(Q) is taken from
https://ml-ops.org/content/crisp-ml
One of the easiest ways to deal with missing data is simply to remove the corresponding
features (columns) or training examples (rows) from the dataset entirely.
df.dropna(axis=0)
df.dropna(axis=1)
(# drop columns that have at least one NaN in any row)
df.dropna(how='all’)
(# only drop rows where all columns are NaN)
df.dropna(thresh=4)
(# drop rows that have fewer than 4 real values)
df.dropna(subset=['C’])
(# only drop rows where NaN appear in specific columns (here: 'C'))
FALL 2023 Sajjad Haider 4
IMPUTING MISSING VALUES
Removal of training examples or dropping of entire feature columns may not be feasible as
we might lose too much valuable data.
In this case, we can use different interpolation techniques to estimate the missing values
from the other training examples in our dataset.
One of the most common interpolation techniques is mean imputation, where we simply
replace the missing value with the mean value of the entire feature column.
A convenient way to achieve this is by using the SimpleImputer class from scikit-learn.
There is another popular function (KNNImputator) in sklearn that employs k-Nearest
Neighbor (kNN) method. Will discuss it after we have studied the kNN method.
The idea behind one-hot encoding is to create a new dummy feature for each unique value in
the nominal feature column.
Suppose we have three possible values under color feature: blue, green and red.
We would convert the color feature into three new features: blue, green, and red.
Binary values can then be used to indicate the particular color of an example; for example, a
blue example can be encoded as blue=1, green=0, red=0.
A convenient way to create those dummy features via one-hot encoding is to use the
get_dummies method implemented in pandas.
Sklearn also provide label and ordinal encoding
Feature scaling is a crucial step in our preprocessing pipeline that can easily be
forgotten.
Many ML algorithms (like Decision trees and random forests) are scale-
invariant where we don’t need to worry about feature scaling.
However, many other ML algorithms behave much better if features are on the
same scale.
If we need to compute the similarity among records using Euclidean distance,
then Salary would play a more significant role than Age in the given table.
Although normalization via min-max scaling is a commonly used technique that is useful
when we need values in a bounded interval, standardization can be more practical for
many machine learning algorithms, especially for optimization algorithms such as gradient
descent.
Using standardization, we center the feature columns at mean 0 with standard deviation 1
so that the feature columns have the same parameters as a standard normal distribution
(zero mean and unit variance), which makes it easier to learn the weights.
However, it must be emphasized that standardization does not change the shape of the
distribution, and it does not transform non-normally distributed data into normally
distributed data.