0% found this document useful (0 votes)

14 views

Lecture 4 New Data Pre Processing

Uploaded by

sjf65309

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Lecture 4 New Data Pre Processing

Uploaded by

sjf65309

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Visualisation for Data Analytics:

Data Pre-Processing
Learning Outcome

• To learn about data pre-processing and its benefit

• To learn some popular techniques used for data pre-processing

Outline

• Introduction to Data Pre-processing

• Missing Data
• Duplicate Data
• Encoding
• Discretization
Introduction to Data Pre-Processing

• Data Pre-processing. This is a part of the data analytics and machine learning
process that data scientists spend most of their time on.
• Real-world data is often incomplete, inconsistent, and/or lacking in certain
behaviours or trends and is likely to contain many errors.
• Data pre-processing is used for resolving such issues.
Techniques Data Pre-Processing

• There are many techniques or steps for data pre-processing. Some of them are
as follows:
– Handling missing values
– Duplicate Data Points removal
– Encoding
– Discretization
• Note: Different types of data pre-processing techniques are used for different
types of data.
• In this lecture, we will focus on general data pre-processing techniques.
• In later lectures, we will discuss some more data pre-processing techniques for
‘text’ data
Missing Data
Missing data

• Missing data is a common problem and challenge for analysts.

• There are many reasons why data could be missing, including:

Respondents forgot to A sensor failed. An internet connection was

answer questions. lost.
Someone purposefully turned
Respondents refused to off recording equipment. A network went down.
answer certain questions.
There was a power cut. A hard drive became
Respondents failed to corrupt.
complete the survey. The method of data capture
was changed. A data transfer was cut
short.
http://n8prp.org.uk/wp-content/uploads/2018/02/Session-3_Missing_Data.pptx
Missing Data: Example

• For example, Sally and Jim have missing values for the ‘Quality of Work’
attribute
Why Missing values cause problem in data analysis?
• Missing values cause problems in data
analysis:
– Misleading results: Missing values can lead
to misleading results.

Task (m): Compute the average age of people.

https://www.bauer.uh.edu/jhess/documents/2.pdf
Why Missing values cause problem in data analysis?
• Compute Average of the Age?

• For example, suppose you surveyed a

group of customers, but many people
refused to answer the question about their
age. If you calculate the average age based
on the data you have, you would conclude
that the average age of your customers
is 39 (Figure 2)
• Whereas the average age would have
been ‘29’ if all the people had responded

https://www.bauer.uh.edu/jhess/documents/2.pdf
How to handle Missing data?
• Generally, the procedure for dealing with missing data is as follows:
– Identify the missing data and identify the cause of the missing data. We can then take one of the
following approaches:
– A: Remove the rows containing the missing data
• Also called the naïve approach.
• Make sure missing data isn’t biased!
– B: Remove a particular column if it has more than 75% of missing values.
– C: Replace missing values with alternative values., also known as Impute the missing
values.
• Mean substitution – replacing the missing values with the mean of all observed values at the same variable
• Hot deck imputation – replacing missing values with values from a “similar” responding unit
• There are several other approaches as well for imputation

• Deciding between A, B, and C depends on which outcome you think will produce the
most reliable and accurate results.
Removing Missing Values

• Be cautious while removing Missing Values

– This method is advised only when there are enough samples in the data set.

– One has to make sure that after we have deleted the data, there is no addition of
bias.
Using Python to process missing/null values
Checking for ‘null’ value Using Python

• Checking for null value

• It can differ for numeric and text data type
– First, read the data using the ‘pandas’ library
– If the data is numeric, we can use the ‘isnull()’ function available in Python
– isnull() function will return true if the data is missing, and it will return false if the
data is present

print(data[‘column_name'].isnull())

‘true’ indicates that

the first value is
missing
Checking for ‘null’ value Using Python
contd
• For ‘text data’ the isnull() function does not work.
• Or sometimes there are different types of null values for example, na, NaN, n\
a, ?, -- and many more
• In such cases it becomes difficult to identify all the null values.
• In such cases, we create a list of missing values and supply that at the time of
reading data.

missing_values = ["n/a", "na", "--", ' ', '?']

data = pd.read_csv('breast-cancer.csv', na_values = missing_values)
print(data[‘column name'].isnull())
Replacing null values with ‘mean or average ’ value

• Replacing ‘null or missing ’ values by the average value using ‘fillna()’

function
mean = data[‘column name'].mean()
print(mean)
print('Before:\n ', data[' column name '])
data[' column name '].fillna(mean, inplace=True)
print('after:\n',data[' column name '])

Notice that NaN values

are replaced by mean
values
Replacing null values with ‘specific ’ values

• You can give any value to ‘fillna()’ function

data[' column name '].fillna(0, inplace=True)

All the ‘null or missing‘

values will be replaced
by 0
Dropping ‘rows’ that consist of ‘missing values’

• Dropping a ‘row’ that consists of missing values

• You can use the ‘dropna()’ function to drop the rows that consist of missing
values

data.dropna(inplace=True)
Dropping a column that consist of more that 75% of
‘missing values’
• Dropping column
• You can find what % of a column consists of missing value. If more than 75%
data is missing, you can drop that column

missing_val_count = data[‘column name'].isnull().sum()

print('Number of missing values = ', missing_val_count)
rows = data['column name '].count()
print('count =', rows)
percentage_missing = (missing_val_count *100)/rows
print('Percentage missing = ', percentage_missing)
if percentage_missing >=75.0:
print('Delete this column')
data.drop('column name ', axis= 1,inplace = True)

print(data)
Duplicate Data Points
Introduction to Duplicate Data Points
• You want to call all the customers to give information about some new product launch
• If you consider only name and credit card number, you may call ‘Sally’ 3 times.

Name Zip-Code Credit card number

Sally 1003 12345

Sally 1003 32456

Sally 1003 24546

Finding Duplicates rows in Python

• Finding duplicate rows using the function ‘duplicated()’

• This function will return ‘true’ for the rows which are duplicates of other rows
Finding Duplicates rows in Python contd
Finding Duplicates rows in Python contd
Drop ‘duplicate’ rows

• pandas.DataFrame.drop_duplicates
• Return DataFrame with duplicate rows removed.
Drop ‘duplicate’ rows contd
Drop ‘duplicate’ rows contd
Encoding categorical features
Why we need to ‘encode’ features?

• Often, features are not given as continuous values but as categorical ones.
• For example, a person could have features ["male", "female"], ["from Europe",
"from the US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses
Internet Explorer"].
• Many machine learning algorithms cannot work with categorical data. They
need numbers as input. Hence, we need to apply encoding in such cases
– For example, ["male", "from US", "uses Internet Explorer"] could be expressed as [0,
1, 3]
– while ["female", "from Asia", "uses Chrome"] would be [1, 2, 1]. We could take any
integer values
OrdinalEncoder

• In ordinal encoding, each unique category value is assigned an integer value.

• For example, “red” is 1, “green” is 2, and “blue” is 3.

• This is called an ordinal or integer encoding and is easily reversible.

• Often, integer values starting at zero are used.

OrdinalEncoder Examples
• sklearn.preprocessing.OrdinalEncoder: Encode categorical features as an integer array.
• We can demonstrate the usage of this class by converting colour categories “red”, “green” and “blue” into
integers.
• First, the categories are sorted then numbers are applied. For strings, this means the labels are
sorted alphabetically and that blue=0, green=1 and red=2.

# example of a ordinal encoding

from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data [['red']
data = asarray([['red'], ['green'], ['blue']]) ['green']
print(data) ['blue']]
# define ordinal encoding [[2.]
encoder = OrdinalEncoder() [1.]
# transform data [0.]]
result = encoder.fit_transform(data)
print(result)
OrdinalEncoder Examples
• sklearn.preprocessing.OrdinalEncoder: Encode categorical features as an
integer array.
from sklearn import preprocessing
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

enc = preprocessing.OrdinalEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox'], ['male', 'from US', 'uses Safari'],
['not specified', 'from Europe', 'uses Firefox']]
enc.fit(X)
Enc.transform(data)
Discretization
Discretization

• Discretization (otherwise known as quantisation or binning) provides a way to

partition continuous features into discrete values or finite sets of intervals
with minimum data loss
Discretization

• Example
• Suppose we have an attribute of Age with the given values

• After Discretization:

https://www.javatpoint.com/discretization-in-data-mining
Discretization

• Certain datasets with continuous features may benefit from discretisation

because discretisation can transform the dataset of continuous attributes to
one with only nominal attributes.

• There are some machine learning algorithms which cannot work with
continuous data, and hence, you may have to apply discretisation
Discretization using Python
contd

• Discretization in Python

# Discretization
value = np.array([ 42, 82, 91, 108, 121, 123, 131, 134, 148, 151])
np.digitize(value, bins=[100] )

100 is a threshold. If a
value is less then 100 it will
be given value 0 otherwise
it will be given value 1

array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

Discretization using Python
contd

• Discretization in Python

# Discretization
value = np.array([ 42, 82, 91, 108, 121, 123, 131, 134, 148, 151])
np.digitize(value, bins=[83] )

Change this value to 83

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

Summary

• Introduction to Data Pre-processing

• Popular data pre-processing techniques
– Missing Data
– Duplicate Data
– Encoding
– Discretization
References

• Some portion of these slides are taken from the following places:
– Missing Data slides:
http://n8prp.org.uk/wp-content/uploads/2018/02/Session-3_Missing_Data.pptx
– Code of duplicate finding:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.du
plicated.html
– Code duplicate removal:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dro
p_duplicates.html
– Ordinal encoder: https://scikit-learn.org/stable/modules/preprocessing.html
– OneHotEncoder: https://scikit-learn.org/stable/modules/preprocessing.html

– https://scikit-learn.org/stable/modules/preprocessing.html
References

• Data pre-processing:
https://hackernoon.com/what-steps-should-one-take-while-doing-data-prep
rocessing-502c993e1caa

• Scikit learn data pre-processing

• https://scikit-learn.org/stable/modules/preprocessing.html

• Missing values:
• https://towardsdatascience.com/data-cleaning-with-python-and-pandas-det
ecting-missing-values-3e9c6ebcf78b

三民高中英文四冊句型講義
100% (3)
三民高中英文四冊句型講義
51 pages
Painless Statistics
From Everand
Painless Statistics
Barron's Educational Series
No ratings yet
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
Phython Example
No ratings yet
Phython Example
12 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Copy of ML_preprocessing_introduction.pptx
No ratings yet
Copy of ML_preprocessing_introduction.pptx
14 pages
Ass-2 Ds
No ratings yet
Ass-2 Ds
29 pages
How to Handle Missing Data in Python. [Explained in 5 Easy Steps]
No ratings yet
How to Handle Missing Data in Python. [Explained in 5 Easy Steps]
10 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Missing Data
No ratings yet
Missing Data
14 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Unit2
No ratings yet
Unit2
76 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
Missing Data
No ratings yet
Missing Data
25 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
Week 3
No ratings yet
Week 3
77 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
PS-ML-Lect-5-9-Unit-2
No ratings yet
PS-ML-Lect-5-9-Unit-2
114 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Code explanation for date types
No ratings yet
Code explanation for date types
8 pages
Handling Missing Data in Pandas by Jaume Boguñá
No ratings yet
Handling Missing Data in Pandas by Jaume Boguñá
17 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
EXP-12_IAIML
No ratings yet
EXP-12_IAIML
13 pages
Unit 3
No ratings yet
Unit 3
30 pages
Develop A Program To Implement Data Preprocessing Using
No ratings yet
Develop A Program To Implement Data Preprocessing Using
19 pages
Summary of The Chapter "Working With Missing Values"
No ratings yet
Summary of The Chapter "Working With Missing Values"
5 pages
Lec9 Dealing With Missing Values
No ratings yet
Lec9 Dealing With Missing Values
22 pages
lec 4
No ratings yet
lec 4
9 pages
Pandas
No ratings yet
Pandas
4 pages
data analysis
No ratings yet
data analysis
42 pages
Unit 1
No ratings yet
Unit 1
21 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Lect 2
No ratings yet
Lect 2
54 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
DMDW 03
No ratings yet
DMDW 03
25 pages
DM Lab Cycle 1
No ratings yet
DM Lab Cycle 1
12 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
EDA+Cheatsheet+-+Class+Note
No ratings yet
EDA+Cheatsheet+-+Class+Note
29 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
ML_Unit_2
No ratings yet
ML_Unit_2
52 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Data Transformation in Machine Learning _ GeeksforGeeks
100% (1)
Data Transformation in Machine Learning _ GeeksforGeeks
17 pages
Data Preparation .1
No ratings yet
Data Preparation .1
37 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
Data Mining Using Python Manual
No ratings yet
Data Mining Using Python Manual
69 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
ch7-tcp-udp-sockets
No ratings yet
ch7-tcp-udp-sockets
36 pages
ch5-tcp-rto-flowcontrol
No ratings yet
ch5-tcp-rto-flowcontrol
16 pages
ch3_ip
No ratings yet
ch3_ip
24 pages
ch7-tcp-sockets
No ratings yet
ch7-tcp-sockets
35 pages
ch3_arp-icmp
No ratings yet
ch3_arp-icmp
19 pages
Bsed English Course Description
No ratings yet
Bsed English Course Description
23 pages
Technical Presentation Kali Linu
No ratings yet
Technical Presentation Kali Linu
13 pages
Download ebooks file Notes on Hamiltonian Dynamical Systems 1st Edition Antonio Giorgilli all chapters
No ratings yet
Download ebooks file Notes on Hamiltonian Dynamical Systems 1st Edition Antonio Giorgilli all chapters
40 pages
Topic Sentence Parag PDF
No ratings yet
Topic Sentence Parag PDF
2 pages
[Ebooks PDF] download Language of Love Ambika Prasad Sharma full chapters
100% (7)
[Ebooks PDF] download Language of Love Ambika Prasad Sharma full chapters
71 pages
Prelim Quiz 1 - UGRD-IT6324A Mobile Programming 2
100% (1)
Prelim Quiz 1 - UGRD-IT6324A Mobile Programming 2
3 pages
Language Learning and Interactive TV: Joshua Underwood
No ratings yet
Language Learning and Interactive TV: Joshua Underwood
4 pages
Language Portfolio Ibcp Guidelines
No ratings yet
Language Portfolio Ibcp Guidelines
10 pages
SLP Math WK3 Day3 4
No ratings yet
SLP Math WK3 Day3 4
3 pages
ANH 8 DT 23-24
No ratings yet
ANH 8 DT 23-24
2 pages
Scilicet - Google Search
No ratings yet
Scilicet - Google Search
1 page
Coded Words For Abraham Had A Son
No ratings yet
Coded Words For Abraham Had A Son
2 pages
DW1601 DW1705 and DW190r1 Release Notes RC WIN7 02 07 2013
No ratings yet
DW1601 DW1705 and DW190r1 Release Notes RC WIN7 02 07 2013
18 pages
Ensor & Claire (2008) - Content or Connectedness Mother Child Talk and Early Social Understanding
No ratings yet
Ensor & Claire (2008) - Content or Connectedness Mother Child Talk and Early Social Understanding
16 pages
Differentiated Learning and Planning
No ratings yet
Differentiated Learning and Planning
94 pages
Dennis - Discursive Construction of Identities
No ratings yet
Dennis - Discursive Construction of Identities
9 pages
Hsslive Xi Maths Ptmta DP Notes
No ratings yet
Hsslive Xi Maths Ptmta DP Notes
41 pages
PM500 Datasheet PDF
100% (1)
PM500 Datasheet PDF
8 pages
REVISED DEVELOPMENTAL MILESTONES & EARLY INTERVENTION - Berna du Plessis
No ratings yet
REVISED DEVELOPMENTAL MILESTONES & EARLY INTERVENTION - Berna du Plessis
38 pages
Room Finder: Nepal College of Information Technology
No ratings yet
Room Finder: Nepal College of Information Technology
13 pages
Lab 7 - Java Applet
No ratings yet
Lab 7 - Java Applet
4 pages
B2 Writing - Part 2
No ratings yet
B2 Writing - Part 2
4 pages
The Lost Origins of The Daylamites The C
No ratings yet
The Lost Origins of The Daylamites The C
15 pages
Coldfire Assembly Language
No ratings yet
Coldfire Assembly Language
9 pages
EWM
No ratings yet
EWM
9 pages
Interaction and Communication in Individuals With Demetia Level 3. Finished
0% (1)
Interaction and Communication in Individuals With Demetia Level 3. Finished
4 pages
Aluminum 5052 H38
No ratings yet
Aluminum 5052 H38
2 pages
Segal - Arabs in Syriac Literature
No ratings yet
Segal - Arabs in Syriac Literature
38 pages
Web de English
No ratings yet
Web de English
235 pages

Lecture 4 New Data Pre Processing

Uploaded by

Lecture 4 New Data Pre Processing

Uploaded by

Visualisation for Data Analytics:

• To learn about data pre-processing and its benefit

• To learn some popular techniques used for data pre-processing

• Introduction to Data Pre-processing

• Missing data is a common problem and challenge for analysts.

Respondents forgot to A sensor failed. An internet connection was

Task (m): Compute the average age of people.

• For example, suppose you surveyed a

• Be cautious while removing Missing Values

• Checking for null value

‘true’ indicates that

missing_values = ["n/a", "na", "--", ' ', '?']

• Replacing ‘null or missing ’ values by the average value using ‘fillna()’

Notice that NaN values

• You can give any value to ‘fillna()’ function

data[' column name '].fillna(0, inplace=True)

All the ‘null or missing‘

• Dropping a ‘row’ that consists of missing values

missing_val_count = data[‘column name'].isnull().sum()

Name Zip-Code Credit card number

Sally 1003 12345

Sally 1003 32456

Sally 1003 24546

• Finding duplicate rows using the function ‘duplicated()’

• In ordinal encoding, each unique category value is assigned an integer value.

• For example, “red” is 1, “green” is 2, and “blue” is 3.

• This is called an ordinal or integer encoding and is easily reversible.

• Often, integer values starting at zero are used.

# example of a ordinal encoding

• Discretization (otherwise known as quantisation or binning) provides a way to

• Certain datasets with continuous features may benefit from discretisation

array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

Change this value to 83

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

• Introduction to Data Pre-processing

• Scikit learn data pre-processing

You might also like