ET 610 - Data Preprocessing
ET 610 - Data Preprocessing
IIT Bombay
Dat a Pr ocessi n g
Why Data Processing?
Data Information Knowledge
What are the general types?
○ Pre-processing
○ Post-processing
Not Always -
Problem Type | Verifiable in P time | Solvable in P time
Algorithm
Solution
● Methods/architectures/frameworks
● Errors
Handling things in the algorithm, Both input data and the system specification
● Human Error
Slip
Mistake
Violations
● System Error
Su m m ar y
● What is data processing at an abstract level
● Complexity of an Algorithm
● What are the terms Efficiency and Effectiveness mean
● What are the different types of error
W h y i s Dat a p r ep r ocessi n g i m p or t an t ?
Preprocessing of data is mainly to check the data quality. The quality can be
checked by the following
1 Male No 1 Male No
4 Female No
5 Male Yes
Dat a Ed i t i n g
1 Male No 1 Male No
4 Female No 4 Female No
Data reduction is the transformation of numerical or Data reduction obtains a reduced representation of
alphabetical digital information derived empirically or the data set that is much smaller in volume, yet
experimentally into a corrected, ordered, and simplified produces the same (or almost the same) analytical
form results.
Dat a T r an sf or m at i on an d Dat a I n t egr at i on
Normalization: It is the method of scaling the data so that it can be represented in a
smaller range. Example ranging from -1.0 to 1.0.
● Many machine learning algorithms fail if the dataset contains missing values.
However, algorithms like K-nearest and Naive Bayes support data with missing
values.
● You may end up building a biased machine learning model which will lead to
incorrect results if the missing values are not handled properly.
● Missing data can lead to a lack of precision in the statistical analysis.
How to Handle?
1. Deleting the Missing values
2. Imputing the Missing Values
● assign (a value) to something by inference..
Types of Missingness
Jingjing Chen, Sharon Hunter, Krisztina Kisfalvi, Richard A. Lirio, A hybrid approach of handling missing data under different missing data mechanisms: VISIBLE 1 and
VARSITY trials for ulcerative colitis, Contemporary Clinical Trials, Volume 100, 2021, 106226, ISSN 1551-7144, https://towardsdatascience.com/missing-data-
cfd9dbfd11b7 https://doi.org/10.1016/j.cct.2020.106226.
Del et i n g t h e M i ssi n g v al u e
Deleting the entire row
Note: If every row has some (column) value missing then you might end up deleting
the whole data.
I m p u t i n g t h e M i ssi n g V al u e
Replacing With Arbitrary Value
imputing the values with the previous value instead of mean, mode or median
is more appropriate in some cases. This is called forward fill
Univariate Approach
Multivariate Approach
H an d l i n g M i ssi n g V al u es: Del et i n g
Pros:
Complete removal of data with missing values results in robust and highly
accurate model
Deleting a particular row or a column with no specific information is better,
since it does not have a high weightage
Cons:
Loss of information and data
Works poorly if the percentage of missing values is high (say 30%),
compared to the whole database
Rep l aci n g W i t h
M ea n /M edi a n /M ode
Pros: CS37300: Data Mining & Machine Learning cs.purdue.edu
It can prevent data loss which results in removal of the rows and columns
Cons:
Gangadharan, Nishanthi & Turner, Richard & Field, Ray & Oliver, Stephen & Slater, Nigel & Dikicioglu, Duygu. (2019). Metaheuristic approaches in
biopharmaceutical process development data analysis. Bioprocess and Biosystems Engineering. 42. 10.1007/s00449-019-02147-0.
Su m m ar y
● What is missing data
● Why to handle missing data
● What are the types of missingness
● What are the various ways to handle missing data
Ou t l i er s
Outliers are of three types, namely –
https://www.geeksforgeeks.org/
Fi n d i n g Ou t l i er s - B ox P l ot
(4.5-1.5)=>3
https://www.geeksforgeeks.org/
1.5*3 is 4.5 and third quartile(4.5)+4.5=>9
En cod i n g
C a t egor i c a l D a t a
What is encoding categorical data?
Categorical Encoding is a process
where we transform categorical
data into numerical data
Why encoding is important?
The performance of a machine learning model not only depends on the model and
the hyperparameters but also on how we process and feed different types of variables
to the model. Since most machine learning models only accept numeric variables, and
hence preprocessing the categorical variables becomes a necessary step
Lab el En cod i n g or Or d i n al En cod i n g
Label Encoding is mostly only applicable for the Ordinal or categorical data with
meaningful order
3. Scales values between [0, 1] or [-1, 1]. It is not bounded to a certain range.
It is useful when we don’t know about the It is useful when the feature distribution is Normal or
7.
distribution Gaussian.