0% found this document useful (0 votes)
2 views

Data Pre Processing

The document discusses various data preprocessing techniques, including methods for missing data imputation, feature selection methods like best subset, forward, backward, and hybrid stepwise selection, as well as Principal Component Analysis (PCA) for dimensionality reduction. It also addresses the class imbalance problem in predictive modeling and strategies for oversampling and under-sampling to balance classes. Additionally, it covers the concepts of bias and variance in model performance, emphasizing the importance of managing the bias-variance trade-off to avoid underfitting and overfitting.

Uploaded by

sourabhmadaan31
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Pre Processing

The document discusses various data preprocessing techniques, including methods for missing data imputation, feature selection methods like best subset, forward, backward, and hybrid stepwise selection, as well as Principal Component Analysis (PCA) for dimensionality reduction. It also addresses the class imbalance problem in predictive modeling and strategies for oversampling and under-sampling to balance classes. Additionally, it covers the concepts of bias and variance in model performance, emphasizing the importance of managing the bias-variance trade-off to avoid underfitting and overfitting.

Uploaded by

sourabhmadaan31
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Pre-

processing
Missing Data Imputation
Methods to impute missing values:
Ignore the tuple
Fill in the missing value
Use a global value to fill in the missing value (e.g. Unknown)
Use a measure of central tendency for the attribute (e.g. mean or median)
Use the attribute mean or median for all samples belonging to the same
class
Use the most probable value to fill in the missing value (e.g. regression,
kNN, decision tree)
Implementation in Python
Implementation in Python
sns.heatmap(df.isnull())
Feature Selection
Feature subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions).
Need of Feature selection mechanisms:
Generally, the dataset consists of noisy data, irrelevant data, and some part of
useful data.
The huge amount of data also slows down the training process of the model,
and with noise and irrelevant data, the model may not predict and perform well.
Feature Selection
Best Subset Selection
To perform best subset selection, we fit a separate least squares regression for
each possible combination of the p predictors.
In total, there are combinations for p predictors.
An approach to find the best combination is to find the combination of predictors
giving the least RSS (Residual Sum of Squares) and highest .
Limitation: The number of possible models that must be considered grows rapidly
as p increases. In general, there are models that involve subsets of p predictors.
◦ So, if p = 10, then there are approximately 1,000 possible models to be considered.
RSS

RSS =
2
𝑅

=1 =0
Feature Selection
Best Subset
Selection
Feature Selection
Forward Stepwise Selection
Forward stepwise selection begins with a model containing no predictors, and
then adds predictors to the model, one-at-a-time, until all of the predictors are in
the model.
In particular, at each step the variable that gives the greatest additional
improvement to the fit is added to the model.
Forward stepwise model is a better model computationally with high number of
predictors.
Total search space (models explored) = models.
Feature Selection
Forward Stepwise Selection
Though forward stepwise tends to do well in practice, it is not guaranteed to find the
best possible model out of all models containing subsets of the p predictors.
For instance, suppose that in a given data set with p = 3 predictors, the best possible
one-variable model contains X1, and the best possible two-variable model instead
contains X2 and X3. Then forward stepwise selection will fail to select the best possible
two-variable model, because M1 will contain X1, so M2 must also contain X1 together
with one additional variable.
Feature Selection
Backward Stepwise Selection
Unlike forward stepwise selection, it begins with the full least squares model containing
all p predictors, and then iteratively removes the least useful predictor, one-at a- time.
Total search space (models explored) = models.
Backward selection requires that the number of samples n is larger than the number of
variables p (so that the full model can be fit). In contrast, forward stepwise can be used
even when n < p, and so is the only viable subset method when p is very large.
The best subset, forward stepwise, and backward stepwise selection
approaches generally give similar but not identical models.
Feature Selection
Hybrid Stepwise Selection
In hybrid versions of forward and backward stepwise selection, variables are
added to the model sequentially, in analogy to forward selection. However, after
adding each new variable, the method may also remove any variables that no
longer provide an improvement in the model fit.
Some other Feature Selection
Techniques
Chi Square
Information gain
LASSO etc.
Principal Component Analysis
(PCA)
Principal components analysis (PCA) searches for k n-
dimensional orthogonal vectors that can best be used to
represent the data, where k ≤ n.
The original data are thus projected onto a much smaller
space, resulting in dimensionality reduction.
Unlike attribute subset selection, which reduces the attribute
set size by retaining a subset of the initial set of attributes,
PCA “combines” the essence of attributes by creating an
alternative, smaller set of variables. The initial data can then
be projected onto this smaller set.
Principal Component Analysis
(PCA)
Steps:
The input data are normalized, so that each attribute falls within the same range. This
step helps ensure that attributes with large domains will not dominate attributes with
smaller domains.
PCA computes k orthonormal vectors that provide a basis for the normalized input data.
These are unit vectors that each point in a direction perpendicular to the others. These
vectors are referred to as the principal components.
The principal components are sorted in order of decreasing “significance” or strength.
The principal components essentially serve as a new set of axes for the data, providing
important information about variance.
Because the components are sorted in decreasing order of “significance,” the data size
can be reduced by eliminating the weaker components, that is, those with low variance.
Principal Component Analysis
(PCA)
The first principal component of a set of features is the normalized linear
combination of the features.

that has the largest variance.


, represents the coefficients or weights (or loading) of each variable in defining
the direction of the first principal component vector in the data space.
The loadings are constrained so that their sum of squares is equal to one.
Example of PCA
Here p=4 (Urban Population, Murder, Rape,
Assault)
N= 50 (50 US states)
PCA was performed after
standardizing each variable to have
mean zero and standard deviation
one.
For instance,
PCA1= 0.4 (UrbanPop) + 0.59
(Assault) + 0.5 (Rape) + 0.5
(Murder)
Example of PCA
Overall, the crime-related variables
(Murder, Assault, and Rape) are
located close to each other, and that
the UrbanPop variable is far from the
other three.
States with large positive scores on
the first component, such as
California, Nevada and Florida, have
high crime rates.
California also has a high score on the
second component, indicating a high
level of urbanization, while the
opposite is true for states like
Mississippi.
Class Imbalance Problem
•Majority Class: The class (or classes) in an imbalanced
classification predictive modeling problem that has many
examples.
•Minority Class: The class in an imbalanced classification
predictive modeling problem that has few examples.

Given two-class data, the data are class-imbalanced if the


main class of interest (the positive class) is represented by
only a few tuples, while the majority of tuples represent the
negative class, or vice versa.
For multiclass-imbalanced data, the data distribution of
each class differs substantially where, again, the main class
Class Imbalance Problem
Both oversampling and under-sampling change the training
data distribution so that the rare (positive) class is well
represented.
Oversampling works by resampling the positive tuples so
that the resulting training set contains an equal number of
positive and negative tuples.
Under-sampling works by decreasing the number of
negative tuples. It randomly eliminates tuples from the
majority (negative) class until there are an equal number of
positive and negative tuples.
from sklearn.utils import resample
# Separate input features and target
y = df.Class
X = df.drop('Class', axis=1)
# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)
# concatenate our training data back together
X = pd.concat([X_train, y_train], axis=1)
# separate minority and majority classes
not_fraud = X[X.Class==0]
fraud = X[X.Class==1]
# upsample minority
fraud_upsampled = resample(fraud, replace=True, #sample with replacement
n_samples=len(not_fraud), #match number in majority class, random_state=27) #
reproducible results
# combine majority and upsampled minority
upsampled = pd.concat([not_fraud, fraud_upsampled])
Bias and Variance
Bias refers to the gap between your predicted value and the actual
value.
In the case of high bias, your predictions are likely to be skewed in a
certain direction away from the actual values.
Variance describes how scattered your predicted values are.
Bias and Variance
Bias and Variance
Mismanaging the bias-variance trade-off can lead to poor results. This can result
in the model becoming overly simple and inflexible (underfitting) or overly
complex and flexible (overfitting).
Underfitting (low variance, high bias) on the left and overfitting (high variance,
low bias) on the right are shown in these two scatterplots.

You might also like