0% found this document useful (0 votes)
9 views41 pages

ET 610 - Data Preprocessing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views41 pages

ET 610 - Data Preprocessing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Learning Analytics Tools

Data Preprocessing: Dr Ashwin T S

IIT Bombay
Dat a Pr ocessi n g
Why Data Processing?
Data Information Knowledge
What are the general types?
○ Pre-processing

○ Post-processing

What is Data Preprocessing


“Data preprocessing is a data mining technique which is used to transform the raw data in a useful
and efficient format”
OR
“Data preprocessing can refer to manipulation or dropping of data before it is used in order to
ensure or enhance performance, and is an important step in the data mining process”
W h y Dat a Pr ep r ocessi n g - A bs t r a c t L ev el
● Algorithm

Can we write an algorithm for any given problem?

Not Always -
Problem Type | Verifiable in P time | Solvable in P time

Algorithm

Solution

● Methods/architectures/frameworks

Can we get an optimum solution for any given problem?


W h y Dat a Pr ep r ocessi n g - A bs t r a c t L ev el
Common Time Complexities/
● Algorithm Growth Rate
Complexity • Constant

Time and Space • Log


• Linear
● Efficiency
• Quadratic
What does efficiency mean?
• Exponential
● Effectiveness
• Factorial
How much effective your algorithm should be?
W h y Dat a Pr ep r ocessi n g - A bs t r a c t L ev el

● Errors
Handling things in the algorithm, Both input data and the system specification

● Human Error

Slip

Mistake

Violations

● System Error
Su m m ar y
● What is data processing at an abstract level
● Complexity of an Algorithm
● What are the terms Efficiency and Effectiveness mean
● What are the different types of error
W h y i s Dat a p r ep r ocessi n g i m p or t an t ?
Preprocessing of data is mainly to check the data quality. The quality can be
checked by the following

● Accuracy: To check whether the data entered is correct or not.


● Completeness: To check whether the data is available or not recorded.
● Consistency: To check whether the same data is kept in all the places that do
or do not match.
● Timeliness: The data should be updated correctly.
● Believability: The data should be trustable.
● Interpretability: The understandability of the data.
M ajor T ask s i n
D a t a P r epr oc es s i n g
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
Dat a Cl ean i n g
Data cleaning routines work to “clean” the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.

Sr. No Gender Pregnant Sr. No Gender Pregnant

1 Male No 1 Male No

2 Female Yes Adult 2 Female Yes

Adult 3 Male Yes 4 Female No

4 Female No

5 Male Yes
Dat a Ed i t i n g

Sr. No. Gender Pregnant Sr. No. Gender Pregnant

1 Male No 1 Male No

2 Female Yes 2 Female Yes

Adult 3 Male Yes Adult 3 Female Yes

4 Female No 4 Female No

5 Male Yes 5 Female Yes


Dat a Red u ct i on
Sr. No. Gender Pregnant Sr. No. Gender Pregnant

1 Male No 2 Female Yes

2 Female Yes 4 Female No

Adult 3 Male Yes Adult 1 Male No

4 Female No 3 Male Yes

5 Male Yes 5 Male Yes

Data reduction is the transformation of numerical or Data reduction obtains a reduced representation of
alphabetical digital information derived empirically or the data set that is much smaller in volume, yet
experimentally into a corrected, ordered, and simplified produces the same (or almost the same) analytical
form results.
Dat a T r an sf or m at i on an d Dat a I n t egr at i on
Normalization: It is the method of scaling the data so that it can be represented in a
smaller range. Example ranging from -1.0 to 1.0.

What are the other possibilities?

Entity identification problem: Identifying entities from multiple databases. For


example, the system or the user should know student _id of one database and
student_name of another database belongs to the same entity.

What are the other possibilities?


Su m m ar y
● Why Data Pre-processing
● What are the Major Tasks in data preprocessing
○ Data cleaning
○ Data integration
○ Data reduction
○ Data transformation
M i ssi n g Dat a
What is Missing Data?
Missing data is defined as the values or data that is not stored (or not present) for some
variable/s in the given dataset.
How to find?
Manual
Code
isnull() and notnull()
NaN
Is this sufficient?
W h y I s Dat a M i ssi n g Fr om T h e Dat aset
Some of the reasons are listed below:

● Past data might get corrupted due to improper maintenance.


● Observations are not recorded for certain fields due to some reasons. There
might be a failure in recording the values due to human error.
● The user has not provided the values intentionally.
W h y t o H an d l e M i ssi n g Dat a

● Many machine learning algorithms fail if the dataset contains missing values.
However, algorithms like K-nearest and Naive Bayes support data with missing
values.
● You may end up building a biased machine learning model which will lead to
incorrect results if the missing values are not handled properly.
● Missing data can lead to a lack of precision in the statistical analysis.
How to Handle?
1. Deleting the Missing values
2. Imputing the Missing Values
● assign (a value) to something by inference..
Types of Missingness

M i ssi n g Dat a Non-ignorable Ignorable

MNAR MCAR MAR

Jingjing Chen, Sharon Hunter, Krisztina Kisfalvi, Richard A. Lirio, A hybrid approach of handling missing data under different missing data mechanisms: VISIBLE 1 and
VARSITY trials for ulcerative colitis, Contemporary Clinical Trials, Volume 100, 2021, 106226, ISSN 1551-7144, https://towardsdatascience.com/missing-data-
cfd9dbfd11b7 https://doi.org/10.1016/j.cct.2020.106226.
Del et i n g t h e M i ssi n g v al u e
Deleting the entire row

Deleting the entire column

Note: If every row has some (column) value missing then you might end up deleting
the whole data.
I m p u t i n g t h e M i ssi n g V al u e
Replacing With Arbitrary Value

Corporate/work experience of a undergrad student can be zero

Replacing With Mean

most common method of imputing missing values of numeric columns

Replacing With Mode

used in the case of categorical features


I m p u t i n g t h e M i ssi n g V al u e
Replacing With Median

Used in case of outliers

Replacing with previous value – Forward fill

imputing the values with the previous value instead of mean, mode or median
is more appropriate in some cases. This is called forward fill

used in time series data

Replacing with next value – Backward fill


I m p u t i n g t h e M i ssi n g V al u e
Imputing Missing Values For Categorical Features

● Impute the Most Frequent Value


● Impute the Value “missing”, which treats it as a Separate Category

One extra column in one hot encoding

Univariate Approach

Multivariate Approach
H an d l i n g M i ssi n g V al u es: Del et i n g
Pros:
Complete removal of data with missing values results in robust and highly
accurate model
Deleting a particular row or a column with no specific information is better,
since it does not have a high weightage
Cons:
Loss of information and data
Works poorly if the percentage of missing values is high (say 30%),
compared to the whole database
Rep l aci n g W i t h
M ea n /M edi a n /M ode
Pros: CS37300: Data Mining & Machine Learning cs.purdue.edu

This is a better approach when the data size is small

It can prevent data loss which results in removal of the rows and columns

Cons:

Imputing the approximations add variance and bias

Works poorly compared to other multiple-imputations method


A ssi gn i n g A n Un i qu e Cat egor y
Pros:
Less possibilities with one extra category, resulting in low variance after
one hot encoding —since it is categorical
Negates the loss of data by adding an unique category
Cons:
Adds less variance
Adds another feature to the model while encoding, which may result in
poor performance
Pr ed i ct i n g T h e M i ssi n g V al u es
Pros:
Imputing the missing variable is an improvement as long as the bias from
the same is smaller than the omitted variable bias
Yields unbiased estimates of the model parameters
Cons:
Bias also arises when an incomplete conditioning set is used for a
categorical variable
Considered only as a proxy for the true values
Usi n g A l gor i t h m s W h i ch Su p p or t M i ssi n g
V a l u es
Pros:
Does not require creation of a predictive model for each attribute with
missing data in the dataset
Correlation of the data is neglected
Cons:
Is a very time consuming process and it can be critical in data mining where
large databases are being extracted
Choice of distance functions can be Euclidean, Manhattan etc. which is do
not yield a robust result
M i ssi n g Dat a H an d l i n g

Gangadharan, Nishanthi & Turner, Richard & Field, Ray & Oliver, Stephen & Slater, Nigel & Dikicioglu, Duygu. (2019). Metaheuristic approaches in
biopharmaceutical process development data analysis. Bioprocess and Biosystems Engineering. 42. 10.1007/s00449-019-02147-0.
Su m m ar y
● What is missing data
● Why to handle missing data
● What are the types of missingness
● What are the various ways to handle missing data
Ou t l i er s
Outliers are of three types, namely –

1. Global (or Point) Outliers


2. Collective Outliers
3. Contextual (or Conditional) Outliers

https://www.geeksforgeeks.org/
Fi n d i n g Ou t l i er s - B ox P l ot

data = [0, 1, 2, 3, 4, 5, 9] data = [0, 1, 2, 3, 6, 6, 6]

(4.5-1.5)=>3

data = [0, 1, 2, 3, 4, 5, 10]

https://www.geeksforgeeks.org/
1.5*3 is 4.5 and third quartile(4.5)+4.5=>9
En cod i n g
C a t egor i c a l D a t a
What is encoding categorical data?
Categorical Encoding is a process
where we transform categorical
data into numerical data
Why encoding is important?
The performance of a machine learning model not only depends on the model and
the hyperparameters but also on how we process and feed different types of variables
to the model. Since most machine learning models only accept numeric variables, and
hence preprocessing the categorical variables becomes a necessary step
Lab el En cod i n g or Or d i n al En cod i n g
Label Encoding is mostly only applicable for the Ordinal or categorical data with
meaningful order

In Label encoding, each label is


converted into an integer value

The process is easy and informative

Why is Label Encoding not applicable


to the non-ordinal or Nominal data?

Because it causes a prioritization


issue
introduce sparsity in the dataset

it creates multiple dummy features in the dataset


On e-H ot E n c odi n g without adding much information.
Dummy Variables Trap
In one hot encoding, for each level of a categorical feature, we create a new
variable. Each category is mapped with a binary variable containing either 0 or 1.
Here, 0 represents the absence, and 1 represents the presence of that category.
B i n ar y En cod i n g
In this encoding scheme, the categorical feature is first converted into numerical
using an ordinal encoder. Then the numbers are transformed in the binary number.
After that binary value is split into different columns.
Scal i n g: W h y Feat u r e Scal i n g
Dataset: Data Preprocessing course
Independent variable (Result) and
3 dependent variables (Time_Spent, Age, and Difficulty_Level)
We can easily notice that the variables are not on the same scale because the range of
Age is from 27 to 50, while the range of Time_Spent going from 48 K Seconds to 83 K
Seconds.
The range of Time_Spent is much wider than the range of Age. This will cause some
issues in our models since a lot of machine learning models such as k-means clustering
and nearest neighbour classification are based on the Euclidean Distance.
Scal i n g - S t a n da r di s a t i on a n d N or m a l i s a t i on
Feature scaling is one of the most important data preprocessing step in machine
learning
Normalization or Min-Max Scaling is used to transform features to be on a similar scale.
The new point is calculated as:
X_new = (X - X_min)/(X_max - X_min)
Standardization or Z-Score Normalization is the transformation of features by
subtracting from mean and dividing by standard deviation. This is often called as Z-
score.
X_new = (X - mean)/Std
S.No. Normalization Standardization

Minimum and maximum value of features are used


1. Mean and standard deviation is used for scaling.
for scaling

It is used when we want to ensure zero mean and unit


2. It is used when features are of different scales.
standard deviation.

3. Scales values between [0, 1] or [-1, 1]. It is not bounded to a certain range.

4. It is really affected by outliers. It is much less affected by outliers.

It is useful when we don’t know about the It is useful when the feature distribution is Normal or
7.
distribution Gaussian.

8. It is a often called as Scaling Normalization It is a often called as Z-Score Normalization.


Op en En d ed Qu est i on s
1. Should we always scale our features?
2. Is there a single best scaling technique?
3. How different scaling techniques affect different classifiers?
4. Should we consider scaling technique as an important hyperparameter of our
model?

Major Sources of Data Preprocessing used in these lecture slides


● Data Mining: C oncepts and Techniques, 3rd Edition. J iawei Han, Micheline Kamber, J ian Pei
● https://www.geeksforgeeks.org/
● https://www.analyticsvidhya.com
● https://towardsdatascience.com
Su m m ar y
● Types of Outliers
○ Global
○ Collective
○ Contextual
● Categorical data Encoding
○ Label or Ordinal
○ One-Hot
○ Binary
● Data Transformation: Scaling
○ Normalization
○ Standardization
Ov er al l Su m m ar y
● Why Data Pre-processing
● What are the Major Tasks in data preprocessing
○ Data cleaning
○ Data integration
○ Data reduction
○ Data transformation
● Missing Values
○ Types in of missingness: MNAR, MAR, MCAR
● Outliers
○ Types of Outliers: Global, Collective, Contextual
● Categorical Data Encoding
○ Types of Categorical Data Encoding: Ordinal, One-hot and Binary
● Scaling
○ Types of Scaling: Normalization and Standardization
Thank You

You might also like