Lecture 4 New Data Pre Processing
Lecture 4 New Data Pre Processing
Data Pre-Processing
Learning Outcome
• Data Pre-processing. This is a part of the data analytics and machine learning
process that data scientists spend most of their time on.
• Real-world data is often incomplete, inconsistent, and/or lacking in certain
behaviours or trends and is likely to contain many errors.
• Data pre-processing is used for resolving such issues.
Techniques Data Pre-Processing
• There are many techniques or steps for data pre-processing. Some of them are
as follows:
– Handling missing values
– Duplicate Data Points removal
– Encoding
– Discretization
• Note: Different types of data pre-processing techniques are used for different
types of data.
• In this lecture, we will focus on general data pre-processing techniques.
• In later lectures, we will discuss some more data pre-processing techniques for
‘text’ data
Missing Data
Missing data
• For example, Sally and Jim have missing values for the ‘Quality of Work’
attribute
Why Missing values cause problem in data analysis?
• Missing values cause problems in data
analysis:
– Misleading results: Missing values can lead
to misleading results.
https://www.bauer.uh.edu/jhess/documents/2.pdf
Why Missing values cause problem in data analysis?
• Compute Average of the Age?
https://www.bauer.uh.edu/jhess/documents/2.pdf
How to handle Missing data?
• Generally, the procedure for dealing with missing data is as follows:
– Identify the missing data and identify the cause of the missing data. We can then take one of the
following approaches:
– A: Remove the rows containing the missing data
• Also called the naïve approach.
• Make sure missing data isn’t biased!
– B: Remove a particular column if it has more than 75% of missing values.
– C: Replace missing values with alternative values., also known as Impute the missing
values.
• Mean substitution – replacing the missing values with the mean of all observed values at the same variable
• Hot deck imputation – replacing missing values with values from a “similar” responding unit
• There are several other approaches as well for imputation
• Deciding between A, B, and C depends on which outcome you think will produce the
most reliable and accurate results.
Removing Missing Values
– One has to make sure that after we have deleted the data, there is no addition of
bias.
Using Python to process missing/null values
Checking for ‘null’ value Using Python
print(data[‘column_name'].isnull())
data.dropna(inplace=True)
Dropping a column that consist of more that 75% of
‘missing values’
• Dropping column
• You can find what % of a column consists of missing value. If more than 75%
data is missing, you can drop that column
print(data)
Duplicate Data Points
Introduction to Duplicate Data Points
• You want to call all the customers to give information about some new product launch
• If you consider only name and credit card number, you may call ‘Sally’ 3 times.
• pandas.DataFrame.drop_duplicates
• Return DataFrame with duplicate rows removed.
Drop ‘duplicate’ rows contd
Drop ‘duplicate’ rows contd
Encoding categorical features
Why we need to ‘encode’ features?
• Often, features are not given as continuous values but as categorical ones.
• For example, a person could have features ["male", "female"], ["from Europe",
"from the US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses
Internet Explorer"].
• Many machine learning algorithms cannot work with categorical data. They
need numbers as input. Hence, we need to apply encoding in such cases
– For example, ["male", "from US", "uses Internet Explorer"] could be expressed as [0,
1, 3]
– while ["female", "from Asia", "uses Chrome"] would be [1, 2, 1]. We could take any
integer values
OrdinalEncoder
enc = preprocessing.OrdinalEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox'], ['male', 'from US', 'uses Safari'],
['not specified', 'from Europe', 'uses Firefox']]
enc.fit(X)
Enc.transform(data)
Discretization
Discretization
• Example
• Suppose we have an attribute of Age with the given values
• After Discretization:
https://www.javatpoint.com/discretization-in-data-mining
Discretization
• There are some machine learning algorithms which cannot work with
continuous data, and hence, you may have to apply discretisation
Discretization using Python
contd
• Discretization in Python
# Discretization
value = np.array([ 42, 82, 91, 108, 121, 123, 131, 134, 148, 151])
np.digitize(value, bins=[100] )
100 is a threshold. If a
value is less then 100 it will
be given value 0 otherwise
it will be given value 1
• Discretization in Python
# Discretization
value = np.array([ 42, 82, 91, 108, 121, 123, 131, 134, 148, 151])
np.digitize(value, bins=[83] )
• Some portion of these slides are taken from the following places:
– Missing Data slides:
http://n8prp.org.uk/wp-content/uploads/2018/02/Session-3_Missing_Data.pptx
– Code of duplicate finding:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.du
plicated.html
– Code duplicate removal:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dro
p_duplicates.html
– Ordinal encoder: https://scikit-learn.org/stable/modules/preprocessing.html
– OneHotEncoder: https://scikit-learn.org/stable/modules/preprocessing.html
– https://scikit-learn.org/stable/modules/preprocessing.html
References
• Data pre-processing:
https://hackernoon.com/what-steps-should-one-take-while-doing-data-prep
rocessing-502c993e1caa
• Missing values:
• https://towardsdatascience.com/data-cleaning-with-python-and-pandas-det
ecting-missing-values-3e9c6ebcf78b