Data Pre Processing
Data Pre Processing
processing
Missing Data Imputation
Methods to impute missing values:
Ignore the tuple
Fill in the missing value
Use a global value to fill in the missing value (e.g. Unknown)
Use a measure of central tendency for the attribute (e.g. mean or median)
Use the attribute mean or median for all samples belonging to the same
class
Use the most probable value to fill in the missing value (e.g. regression,
kNN, decision tree)
Implementation in Python
Implementation in Python
sns.heatmap(df.isnull())
Feature Selection
Feature subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions).
Need of Feature selection mechanisms:
Generally, the dataset consists of noisy data, irrelevant data, and some part of
useful data.
The huge amount of data also slows down the training process of the model,
and with noise and irrelevant data, the model may not predict and perform well.
Feature Selection
Best Subset Selection
To perform best subset selection, we fit a separate least squares regression for
each possible combination of the p predictors.
In total, there are combinations for p predictors.
An approach to find the best combination is to find the combination of predictors
giving the least RSS (Residual Sum of Squares) and highest .
Limitation: The number of possible models that must be considered grows rapidly
as p increases. In general, there are models that involve subsets of p predictors.
◦ So, if p = 10, then there are approximately 1,000 possible models to be considered.
RSS
RSS =
2
𝑅
=1 =0
Feature Selection
Best Subset
Selection
Feature Selection
Forward Stepwise Selection
Forward stepwise selection begins with a model containing no predictors, and
then adds predictors to the model, one-at-a-time, until all of the predictors are in
the model.
In particular, at each step the variable that gives the greatest additional
improvement to the fit is added to the model.
Forward stepwise model is a better model computationally with high number of
predictors.
Total search space (models explored) = models.
Feature Selection
Forward Stepwise Selection
Though forward stepwise tends to do well in practice, it is not guaranteed to find the
best possible model out of all models containing subsets of the p predictors.
For instance, suppose that in a given data set with p = 3 predictors, the best possible
one-variable model contains X1, and the best possible two-variable model instead
contains X2 and X3. Then forward stepwise selection will fail to select the best possible
two-variable model, because M1 will contain X1, so M2 must also contain X1 together
with one additional variable.
Feature Selection
Backward Stepwise Selection
Unlike forward stepwise selection, it begins with the full least squares model containing
all p predictors, and then iteratively removes the least useful predictor, one-at a- time.
Total search space (models explored) = models.
Backward selection requires that the number of samples n is larger than the number of
variables p (so that the full model can be fit). In contrast, forward stepwise can be used
even when n < p, and so is the only viable subset method when p is very large.
The best subset, forward stepwise, and backward stepwise selection
approaches generally give similar but not identical models.
Feature Selection
Hybrid Stepwise Selection
In hybrid versions of forward and backward stepwise selection, variables are
added to the model sequentially, in analogy to forward selection. However, after
adding each new variable, the method may also remove any variables that no
longer provide an improvement in the model fit.
Some other Feature Selection
Techniques
Chi Square
Information gain
LASSO etc.
Principal Component Analysis
(PCA)
Principal components analysis (PCA) searches for k n-
dimensional orthogonal vectors that can best be used to
represent the data, where k ≤ n.
The original data are thus projected onto a much smaller
space, resulting in dimensionality reduction.
Unlike attribute subset selection, which reduces the attribute
set size by retaining a subset of the initial set of attributes,
PCA “combines” the essence of attributes by creating an
alternative, smaller set of variables. The initial data can then
be projected onto this smaller set.
Principal Component Analysis
(PCA)
Steps:
The input data are normalized, so that each attribute falls within the same range. This
step helps ensure that attributes with large domains will not dominate attributes with
smaller domains.
PCA computes k orthonormal vectors that provide a basis for the normalized input data.
These are unit vectors that each point in a direction perpendicular to the others. These
vectors are referred to as the principal components.
The principal components are sorted in order of decreasing “significance” or strength.
The principal components essentially serve as a new set of axes for the data, providing
important information about variance.
Because the components are sorted in decreasing order of “significance,” the data size
can be reduced by eliminating the weaker components, that is, those with low variance.
Principal Component Analysis
(PCA)
The first principal component of a set of features is the normalized linear
combination of the features.