Data Pre Processing

The document discusses various data preprocessing techniques, including methods for missing data imputation, feature selection methods like best subset, forward, backward, and hybrid stepwise selection, as well as Principal Component Analysis (PCA) for dimensionality reduction. It also addresses the class imbalance problem in predictive modeling and strategies for oversampling and under-sampling to balance classes. Additionally, it covers the concepts of bias and variance in model performance, emphasizing the importance of managing the bias-variance trade-off to avoid underfitting and overfitting.

Uploaded by

sourabhmadaan31

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Data Pre Processing

Uploaded by

sourabhmadaan31

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Data Pre-

processing
Missing Data Imputation
Methods to impute missing values:
Ignore the tuple
Fill in the missing value
Use a global value to fill in the missing value (e.g. Unknown)
Use a measure of central tendency for the attribute (e.g. mean or median)
Use the attribute mean or median for all samples belonging to the same
class
Use the most probable value to fill in the missing value (e.g. regression,
kNN, decision tree)
Implementation in Python
Implementation in Python
sns.heatmap(df.isnull())
Feature Selection
Feature subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions).
Need of Feature selection mechanisms:
Generally, the dataset consists of noisy data, irrelevant data, and some part of
useful data.
The huge amount of data also slows down the training process of the model,
and with noise and irrelevant data, the model may not predict and perform well.
Feature Selection
Best Subset Selection
To perform best subset selection, we fit a separate least squares regression for
each possible combination of the p predictors.
In total, there are combinations for p predictors.
An approach to find the best combination is to find the combination of predictors
giving the least RSS (Residual Sum of Squares) and highest .
Limitation: The number of possible models that must be considered grows rapidly
as p increases. In general, there are models that involve subsets of p predictors.
◦ So, if p = 10, then there are approximately 1,000 possible models to be considered.
RSS

RSS =
2
𝑅

=1 =0
Feature Selection
Best Subset
Selection
Feature Selection
Forward Stepwise Selection
Forward stepwise selection begins with a model containing no predictors, and
then adds predictors to the model, one-at-a-time, until all of the predictors are in
the model.
In particular, at each step the variable that gives the greatest additional
improvement to the fit is added to the model.
Forward stepwise model is a better model computationally with high number of
predictors.
Total search space (models explored) = models.
Feature Selection
Forward Stepwise Selection
Though forward stepwise tends to do well in practice, it is not guaranteed to find the
best possible model out of all models containing subsets of the p predictors.
For instance, suppose that in a given data set with p = 3 predictors, the best possible
one-variable model contains X1, and the best possible two-variable model instead
contains X2 and X3. Then forward stepwise selection will fail to select the best possible
two-variable model, because M1 will contain X1, so M2 must also contain X1 together
with one additional variable.
Feature Selection
Backward Stepwise Selection
Unlike forward stepwise selection, it begins with the full least squares model containing
all p predictors, and then iteratively removes the least useful predictor, one-at a- time.
Total search space (models explored) = models.
Backward selection requires that the number of samples n is larger than the number of
variables p (so that the full model can be fit). In contrast, forward stepwise can be used
even when n < p, and so is the only viable subset method when p is very large.
The best subset, forward stepwise, and backward stepwise selection
approaches generally give similar but not identical models.
Feature Selection
Hybrid Stepwise Selection
In hybrid versions of forward and backward stepwise selection, variables are
added to the model sequentially, in analogy to forward selection. However, after
adding each new variable, the method may also remove any variables that no
longer provide an improvement in the model fit.
Some other Feature Selection
Techniques
Chi Square
Information gain
LASSO etc.
Principal Component Analysis
(PCA)
Principal components analysis (PCA) searches for k n-
dimensional orthogonal vectors that can best be used to
represent the data, where k ≤ n.
The original data are thus projected onto a much smaller
space, resulting in dimensionality reduction.
Unlike attribute subset selection, which reduces the attribute
set size by retaining a subset of the initial set of attributes,
PCA “combines” the essence of attributes by creating an
alternative, smaller set of variables. The initial data can then
be projected onto this smaller set.
Principal Component Analysis
(PCA)
Steps:
The input data are normalized, so that each attribute falls within the same range. This
step helps ensure that attributes with large domains will not dominate attributes with
smaller domains.
PCA computes k orthonormal vectors that provide a basis for the normalized input data.
These are unit vectors that each point in a direction perpendicular to the others. These
vectors are referred to as the principal components.
The principal components are sorted in order of decreasing “significance” or strength.
The principal components essentially serve as a new set of axes for the data, providing
important information about variance.
Because the components are sorted in decreasing order of “significance,” the data size
can be reduced by eliminating the weaker components, that is, those with low variance.
Principal Component Analysis
(PCA)
The first principal component of a set of features is the normalized linear
combination of the features.

that has the largest variance.

, represents the coefficients or weights (or loading) of each variable in defining
the direction of the first principal component vector in the data space.
The loadings are constrained so that their sum of squares is equal to one.
Example of PCA
Here p=4 (Urban Population, Murder, Rape,
Assault)
N= 50 (50 US states)
PCA was performed after
standardizing each variable to have
mean zero and standard deviation
one.
For instance,
PCA1= 0.4 (UrbanPop) + 0.59
(Assault) + 0.5 (Rape) + 0.5
(Murder)
Example of PCA
Overall, the crime-related variables
(Murder, Assault, and Rape) are
located close to each other, and that
the UrbanPop variable is far from the
other three.
States with large positive scores on
the first component, such as
California, Nevada and Florida, have
high crime rates.
California also has a high score on the
second component, indicating a high
level of urbanization, while the
opposite is true for states like
Mississippi.
Class Imbalance Problem
•Majority Class: The class (or classes) in an imbalanced
classification predictive modeling problem that has many
examples.
•Minority Class: The class in an imbalanced classification
predictive modeling problem that has few examples.

Given two-class data, the data are class-imbalanced if the

main class of interest (the positive class) is represented by
only a few tuples, while the majority of tuples represent the
negative class, or vice versa.
For multiclass-imbalanced data, the data distribution of
each class differs substantially where, again, the main class
Class Imbalance Problem
Both oversampling and under-sampling change the training
data distribution so that the rare (positive) class is well
represented.
Oversampling works by resampling the positive tuples so
that the resulting training set contains an equal number of
positive and negative tuples.
Under-sampling works by decreasing the number of
negative tuples. It randomly eliminates tuples from the
majority (negative) class until there are an equal number of
positive and negative tuples.
from sklearn.utils import resample
# Separate input features and target
y = df.Class
X = df.drop('Class', axis=1)
# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)
# concatenate our training data back together
X = pd.concat([X_train, y_train], axis=1)
# separate minority and majority classes
not_fraud = X[X.Class==0]
fraud = X[X.Class==1]
# upsample minority
fraud_upsampled = resample(fraud, replace=True, #sample with replacement
n_samples=len(not_fraud), #match number in majority class, random_state=27) #
reproducible results
# combine majority and upsampled minority
upsampled = pd.concat([not_fraud, fraud_upsampled])
Bias and Variance
Bias refers to the gap between your predicted value and the actual
value.
In the case of high bias, your predictions are likely to be skewed in a
certain direction away from the actual values.
Variance describes how scattered your predicted values are.
Bias and Variance
Bias and Variance
Mismanaging the bias-variance trade-off can lead to poor results. This can result
in the model becoming overly simple and inflexible (underfitting) or overly
complex and flexible (overfitting).
Underfitting (low variance, high bias) on the left and overfitting (high variance,
low bias) on the right are shown in these two scatterplots.

5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
3b Features PDF
No ratings yet
3b Features PDF
40 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
Special Topic: Missing Values
No ratings yet
Special Topic: Missing Values
25 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Kernels, Model Selection and Feature Selection
No ratings yet
Kernels, Model Selection and Feature Selection
5 pages
CSC 522 Lecture3 4bd3ba83ce402d2da5bafd60f41095b6
No ratings yet
CSC 522 Lecture3 4bd3ba83ce402d2da5bafd60f41095b6
32 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
ML Lecture 02
No ratings yet
ML Lecture 02
40 pages
CS F320 - Assignment II - Draft (Subject to a Few Changes in the Description of Problems)
No ratings yet
CS F320 - Assignment II - Draft (Subject to a Few Changes in the Description of Problems)
12 pages
Lec4 Data Preprocessing
No ratings yet
Lec4 Data Preprocessing
43 pages
Feature Selection and Extraction
No ratings yet
Feature Selection and Extraction
26 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
LINFO2275 Questions d Examen-4
No ratings yet
LINFO2275 Questions d Examen-4
34 pages
CSL0777 L07fgfdg
No ratings yet
CSL0777 L07fgfdg
28 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
Foundations of Machine Learning: Module 3: Instance Based Learning and Feature Reduction
No ratings yet
Foundations of Machine Learning: Module 3: Instance Based Learning and Feature Reduction
40 pages
Foundations of Machine Learning: Sudeshna Sarkar IIT Kharagpur
No ratings yet
Foundations of Machine Learning: Sudeshna Sarkar IIT Kharagpur
40 pages
Multivariate Parametric Methods: Steven J Zeil
No ratings yet
Multivariate Parametric Methods: Steven J Zeil
36 pages
Week 2 v1.1 (hidden) - Dimensionality and Evaluation
No ratings yet
Week 2 v1.1 (hidden) - Dimensionality and Evaluation
47 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
L06 Features
No ratings yet
L06 Features
44 pages
Feature Selection
No ratings yet
Feature Selection
61 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
data science slides
No ratings yet
data science slides
57 pages
Lecture#10
No ratings yet
Lecture#10
24 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
NEC ML UNIT-III Complete Final
No ratings yet
NEC ML UNIT-III Complete Final
22 pages
Multivariate
100% (1)
Multivariate
78 pages
Feature Selection - Study Material
No ratings yet
Feature Selection - Study Material
6 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
3.1_Feature_Selection
No ratings yet
3.1_Feature_Selection
35 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
FALLSEM2023-24 - ITE2011 - ETH - VL2023240102356 - 2023-09-01 - Reference-Material-I (3 Files Merged)
No ratings yet
FALLSEM2023-24 - ITE2011 - ETH - VL2023240102356 - 2023-09-01 - Reference-Material-I (3 Files Merged)
191 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Pca Smote
No ratings yet
Pca Smote
15 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Feature Pruning and Normalization
No ratings yet
Feature Pruning and Normalization
8 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
Data Reduction
No ratings yet
Data Reduction
23 pages
CS.IAABR
No ratings yet
CS.IAABR
6 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
Script
No ratings yet
Script
5 pages
Feature Selection
No ratings yet
Feature Selection
18 pages
Data Mining Disease Diagnosis Presentation
No ratings yet
Data Mining Disease Diagnosis Presentation
35 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
17 pages
Variable Selection
No ratings yet
Variable Selection
26 pages
Pca (Data Reduction)
No ratings yet
Pca (Data Reduction)
24 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
ML SELF UNIT 2
No ratings yet
ML SELF UNIT 2
20 pages
Explore Feature Engineering
No ratings yet
Explore Feature Engineering
10 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Week8_Lecture_1_ML_SPR25
No ratings yet
Week8_Lecture_1_ML_SPR25
20 pages
CS464_Ch5_FeatureSelection
No ratings yet
CS464_Ch5_FeatureSelection
31 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Descriptors and Their Selection Methods in QSAR Analysis - Paradigm For Drug Design
No ratings yet
Descriptors and Their Selection Methods in QSAR Analysis - Paradigm For Drug Design
5 pages
PR Assignment 01 - Seemal Ajaz (206979)
No ratings yet
PR Assignment 01 - Seemal Ajaz (206979)
7 pages
Kongu Engineering College, Perundurai School of Communication and Computer Sciences Department of Computer Science and Engineering Academic Project - Process Work Flow
No ratings yet
Kongu Engineering College, Perundurai School of Communication and Computer Sciences Department of Computer Science and Engineering Academic Project - Process Work Flow
119 pages
A Review of Android Malware Detection Approaches Based On Machine Learning
No ratings yet
A Review of Android Malware Detection Approaches Based On Machine Learning
29 pages
Analyzing Students' Answers Using Association Rule Mining Based ON Feature Selection
No ratings yet
Analyzing Students' Answers Using Association Rule Mining Based ON Feature Selection
17 pages
Ordinal Feature Selection For Iris and Palmprint Recognition+Report 2
100% (1)
Ordinal Feature Selection For Iris and Palmprint Recognition+Report 2
33 pages
Question - Bank (MCQ) - Advance Analytics - Question Bank eDBDA Sept 21
No ratings yet
Question - Bank (MCQ) - Advance Analytics - Question Bank eDBDA Sept 21
14 pages
CheatSheet Advanced Control Orchestration A3 Web 0
No ratings yet
CheatSheet Advanced Control Orchestration A3 Web 0
2 pages
Unit - 3 Feature Engineering
No ratings yet
Unit - 3 Feature Engineering
29 pages
Lecture PPT CH 2
No ratings yet
Lecture PPT CH 2
21 pages
Quiz 4 5 6
No ratings yet
Quiz 4 5 6
11 pages
Predicting Young's Modulus of Linear Polyurethane and Polyurethane-Polyurea Elastomers
No ratings yet
Predicting Young's Modulus of Linear Polyurethane and Polyurethane-Polyurea Elastomers
14 pages
Forex Trend Classification by Machine Learning
No ratings yet
Forex Trend Classification by Machine Learning
7 pages
Machine Learning
No ratings yet
Machine Learning
135 pages
Introduction To Weka: Statistical Learning
No ratings yet
Introduction To Weka: Statistical Learning
36 pages
Attack Classification Using Feature Selection Techniques A Comparative Study (CB)
No ratings yet
Attack Classification Using Feature Selection Techniques A Comparative Study (CB)
18 pages
Towards CRISP-ML (Q) : A Machine Learning Process Model With Quality Assurance Methodology
No ratings yet
Towards CRISP-ML (Q) : A Machine Learning Process Model With Quality Assurance Methodology
22 pages
Download Full (Ebook) Applied Machine Learning Using mlr3 in R by Bernd Bischl, Raphael Sonabend, Lars Kotthoff, Michel Lang ISBN 9781032515670, 1032515678 PDF All Chapters
100% (2)
Download Full (Ebook) Applied Machine Learning Using mlr3 in R by Bernd Bischl, Raphael Sonabend, Lars Kotthoff, Michel Lang ISBN 9781032515670, 1032515678 PDF All Chapters
81 pages
Practical Guide To Scikit-Learn For Data Science
No ratings yet
Practical Guide To Scikit-Learn For Data Science
27 pages
Learning From Class Imbalanced Data Review of Methods and Applications
No ratings yet
Learning From Class Imbalanced Data Review of Methods and Applications
20 pages
TNP Portal Using Web Development and Machine Learning
No ratings yet
TNP Portal Using Web Development and Machine Learning
9 pages
Deep Learning PPT Full Notes
100% (2)
Deep Learning PPT Full Notes
105 pages
Question Bank For Insem AIML
No ratings yet
Question Bank For Insem AIML
1 page
Machine Learning
No ratings yet
Machine Learning
22 pages
CS-541 Wireless Sensor Networks: Lecture 13: Machine Learning Applications On Body Area Networks
No ratings yet
CS-541 Wireless Sensor Networks: Lecture 13: Machine Learning Applications On Body Area Networks
51 pages
Insidethemachinelearninginterview Sample
0% (1)
Insidethemachinelearninginterview Sample
40 pages
Ai-900 (1) - Removed - Organized
No ratings yet
Ai-900 (1) - Removed - Organized
15 pages
20BCP021 Assignment 6
No ratings yet
20BCP021 Assignment 6
15 pages
Multiple Disease Prediction System Using Machine Learning
No ratings yet
Multiple Disease Prediction System Using Machine Learning
5 pages
Chi2 Feature Selection and Discretization of Numeric Attributes
No ratings yet
Chi2 Feature Selection and Discretization of Numeric Attributes
4 pages

Data Pre Processing

Uploaded by

Data Pre Processing

Uploaded by

Data Pre-

that has the largest variance.

Given two-class data, the data are class-imbalanced if the

You might also like