0% found this document useful (0 votes)
10 views

Lecture 4 New Data Pre Processing

Uploaded by

sjf65309
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 4 New Data Pre Processing

Uploaded by

sjf65309
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Visualisation for Data Analytics:

Data Pre-Processing
Learning Outcome

• To learn about data pre-processing and its benefit

• To learn some popular techniques used for data pre-processing


Outline

• Introduction to Data Pre-processing


• Missing Data
• Duplicate Data
• Encoding
• Discretization
Introduction to Data Pre-Processing

• Data Pre-processing. This is a part of the data analytics and machine learning
process that data scientists spend most of their time on.
• Real-world data is often incomplete, inconsistent, and/or lacking in certain
behaviours or trends and is likely to contain many errors.
• Data pre-processing is used for resolving such issues.
Techniques Data Pre-Processing

• There are many techniques or steps for data pre-processing. Some of them are
as follows:
– Handling missing values
– Duplicate Data Points removal
– Encoding
– Discretization
• Note: Different types of data pre-processing techniques are used for different
types of data.
• In this lecture, we will focus on general data pre-processing techniques.
• In later lectures, we will discuss some more data pre-processing techniques for
‘text’ data
Missing Data
Missing data

• Missing data is a common problem and challenge for analysts.


• There are many reasons why data could be missing, including:

Respondents forgot to A sensor failed. An internet connection was


answer questions. lost.
Someone purposefully turned
Respondents refused to off recording equipment. A network went down.
answer certain questions.
There was a power cut. A hard drive became
Respondents failed to corrupt.
complete the survey. The method of data capture
was changed. A data transfer was cut
short.
http://n8prp.org.uk/wp-content/uploads/2018/02/Session-3_Missing_Data.pptx
Missing Data: Example

• For example, Sally and Jim have missing values for the ‘Quality of Work’
attribute
Why Missing values cause problem in data analysis?
• Missing values cause problems in data
analysis:
– Misleading results: Missing values can lead
to misleading results.

Task (m): Compute the average age of people.

https://www.bauer.uh.edu/jhess/documents/2.pdf
Why Missing values cause problem in data analysis?
• Compute Average of the Age?

• For example, suppose you surveyed a


group of customers, but many people
refused to answer the question about their
age. If you calculate the average age based
on the data you have, you would conclude
that the average age of your customers
is 39 (Figure 2)
• Whereas the average age would have
been ‘29’ if all the people had responded

https://www.bauer.uh.edu/jhess/documents/2.pdf
How to handle Missing data?
• Generally, the procedure for dealing with missing data is as follows:
– Identify the missing data and identify the cause of the missing data. We can then take one of the
following approaches:
– A: Remove the rows containing the missing data
• Also called the naïve approach.
• Make sure missing data isn’t biased!
– B: Remove a particular column if it has more than 75% of missing values.
– C: Replace missing values with alternative values., also known as Impute the missing
values.
• Mean substitution – replacing the missing values with the mean of all observed values at the same variable
• Hot deck imputation – replacing missing values with values from a “similar” responding unit
• There are several other approaches as well for imputation

• Deciding between A, B, and C depends on which outcome you think will produce the
most reliable and accurate results.
Removing Missing Values

• Be cautious while removing Missing Values


– This method is advised only when there are enough samples in the data set.

– One has to make sure that after we have deleted the data, there is no addition of
bias.
Using Python to process missing/null values
Checking for ‘null’ value Using Python

• Checking for null value


• It can differ for numeric and text data type
– First, read the data using the ‘pandas’ library
– If the data is numeric, we can use the ‘isnull()’ function available in Python
– isnull() function will return true if the data is missing, and it will return false if the
data is present

print(data[‘column_name'].isnull())

‘true’ indicates that


the first value is
missing
Checking for ‘null’ value Using Python
contd
• For ‘text data’ the isnull() function does not work.
• Or sometimes there are different types of null values for example, na, NaN, n\
a, ?, -- and many more
• In such cases it becomes difficult to identify all the null values.
• In such cases, we create a list of missing values and supply that at the time of
reading data.

missing_values = ["n/a", "na", "--", ' ', '?']


data = pd.read_csv('breast-cancer.csv', na_values = missing_values)
print(data[‘column name'].isnull())
Replacing null values with ‘mean or average ’ value

• Replacing ‘null or missing ’ values by the average value using ‘fillna()’


function
mean = data[‘column name'].mean()
print(mean)
print('Before:\n ', data[' column name '])
data[' column name '].fillna(mean, inplace=True)
print('after:\n',data[' column name '])

Notice that NaN values


are replaced by mean
values
Replacing null values with ‘specific ’ values

• You can give any value to ‘fillna()’ function

data[' column name '].fillna(0, inplace=True)

All the ‘null or missing‘


values will be replaced
by 0
Dropping ‘rows’ that consist of ‘missing values’

• Dropping a ‘row’ that consists of missing values


• You can use the ‘dropna()’ function to drop the rows that consist of missing
values

data.dropna(inplace=True)
Dropping a column that consist of more that 75% of
‘missing values’
• Dropping column
• You can find what % of a column consists of missing value. If more than 75%
data is missing, you can drop that column

missing_val_count = data[‘column name'].isnull().sum()


print('Number of missing values = ', missing_val_count)
rows = data['column name '].count()
print('count =', rows)
percentage_missing = (missing_val_count *100)/rows
print('Percentage missing = ', percentage_missing)
if percentage_missing >=75.0:
print('Delete this column')
data.drop('column name ', axis= 1,inplace = True)

print(data)
Duplicate Data Points
Introduction to Duplicate Data Points
• You want to call all the customers to give information about some new product launch
• If you consider only name and credit card number, you may call ‘Sally’ 3 times.

Name Zip-Code Credit card number

Sally 1003 12345

Sally 1003 32456

Sally 1003 24546


Finding Duplicates rows in Python

• Finding duplicate rows using the function ‘duplicated()’


• This function will return ‘true’ for the rows which are duplicates of other rows
Finding Duplicates rows in Python contd
Finding Duplicates rows in Python contd
Drop ‘duplicate’ rows

• pandas.DataFrame.drop_duplicates
• Return DataFrame with duplicate rows removed.
Drop ‘duplicate’ rows contd
Drop ‘duplicate’ rows contd
Encoding categorical features
Why we need to ‘encode’ features?

• Often, features are not given as continuous values but as categorical ones.
• For example, a person could have features ["male", "female"], ["from Europe",
"from the US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses
Internet Explorer"].
• Many machine learning algorithms cannot work with categorical data. They
need numbers as input. Hence, we need to apply encoding in such cases
– For example, ["male", "from US", "uses Internet Explorer"] could be expressed as [0,
1, 3]
– while ["female", "from Asia", "uses Chrome"] would be [1, 2, 1]. We could take any
integer values
OrdinalEncoder

• In ordinal encoding, each unique category value is assigned an integer value.

• For example, “red” is 1, “green” is 2, and “blue” is 3.

• This is called an ordinal or integer encoding and is easily reversible.

• Often, integer values starting at zero are used.


OrdinalEncoder Examples
• sklearn.preprocessing.OrdinalEncoder: Encode categorical features as an integer array.
• We can demonstrate the usage of this class by converting colour categories “red”, “green” and “blue” into
integers.
• First, the categories are sorted then numbers are applied. For strings, this means the labels are
sorted alphabetically and that blue=0, green=1 and red=2.

# example of a ordinal encoding


from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data [['red']
data = asarray([['red'], ['green'], ['blue']]) ['green']
print(data) ['blue']]
# define ordinal encoding [[2.]
encoder = OrdinalEncoder() [1.]
# transform data [0.]]
result = encoder.fit_transform(data)
print(result)
OrdinalEncoder Examples
• sklearn.preprocessing.OrdinalEncoder: Encode categorical features as an
integer array.
from sklearn import preprocessing
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

enc = preprocessing.OrdinalEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox'], ['male', 'from US', 'uses Safari'],
['not specified', 'from Europe', 'uses Firefox']]
enc.fit(X)
Enc.transform(data)
Discretization
Discretization

• Discretization (otherwise known as quantisation or binning) provides a way to


partition continuous features into discrete values or finite sets of intervals
with minimum data loss
Discretization

• Example
• Suppose we have an attribute of Age with the given values

• After Discretization:

https://www.javatpoint.com/discretization-in-data-mining
Discretization

• Certain datasets with continuous features may benefit from discretisation


because discretisation can transform the dataset of continuous attributes to
one with only nominal attributes.

• There are some machine learning algorithms which cannot work with
continuous data, and hence, you may have to apply discretisation
Discretization using Python
contd

• Discretization in Python

# Discretization
value = np.array([ 42, 82, 91, 108, 121, 123, 131, 134, 148, 151])
np.digitize(value, bins=[100] )

100 is a threshold. If a
value is less then 100 it will
be given value 0 otherwise
it will be given value 1

array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1], dtype=int64)


Discretization using Python
contd

• Discretization in Python

# Discretization
value = np.array([ 42, 82, 91, 108, 121, 123, 131, 134, 148, 151])
np.digitize(value, bins=[83] )

Change this value to 83

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)


Summary

• Introduction to Data Pre-processing


• Popular data pre-processing techniques
– Missing Data
– Duplicate Data
– Encoding
– Discretization
References

• Some portion of these slides are taken from the following places:
– Missing Data slides:
http://n8prp.org.uk/wp-content/uploads/2018/02/Session-3_Missing_Data.pptx
– Code of duplicate finding:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.du
plicated.html
– Code duplicate removal:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dro
p_duplicates.html
– Ordinal encoder: https://scikit-learn.org/stable/modules/preprocessing.html
– OneHotEncoder: https://scikit-learn.org/stable/modules/preprocessing.html

– https://scikit-learn.org/stable/modules/preprocessing.html
References

• Data pre-processing:
https://hackernoon.com/what-steps-should-one-take-while-doing-data-prep
rocessing-502c993e1caa

• Scikit learn data pre-processing


• https://scikit-learn.org/stable/modules/preprocessing.html

• Missing values:
• https://towardsdatascience.com/data-cleaning-with-python-and-pandas-det
ecting-missing-values-3e9c6ebcf78b

You might also like