0% found this document useful (0 votes)

2 views

DAI101 4 Data Preparation (1)

The document outlines the data preparation process, emphasizing the importance of gathering, cleaning, transforming, and integrating data for effective analysis. It discusses common challenges such as missing values, outliers, and inconsistent formats, along with techniques for data cleaning, transformation, and reduction. Additionally, it covers methods for data compression and sampling to enhance data usability and efficiency in modeling.

Uploaded by

contactsachinjorwal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

DAI101 4 Data Preparation (1)

Uploaded by

contactsachinjorwal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Data Preparation

Dr. Teena Sharma Summer 2024

University of Quebec at Chicoutimi, Canada © IIT Roorkee India

1
What?
The standard definition of data preparation is:

“The process of gathering, combining,

structuring, and organizing data.”

2
Why?
• Good data is essential for producing efficient models of any type.
• Data should be formatted according to required software tool.
• Data need to be made adequate for given method.
• Data in the real world is dirty.

3
Data Cleaning

4
What?
Data cleaning, also known as data cleansing or data scrubbing, is the
process of identifying and correcting or removing inaccuracies,
inconsistencies, and irrelevant data in a dataset.

The goal of data cleaning is to ensure that the data is accurate,

consistent, and usable for analysis and modeling.

5
Common Challenges: Missing Values
In many real-world datasets, some values may be missing or incomplete.

This can occur due to various reasons such as incorrect data entry,
missing data, or technical problems during data collection.

Handling missing values is an important step in data cleaning, as it can

affect the accuracy and reliability of the results if not properly
addressed.

6
Common Challenges: Outliers
Outliers are values that are significantly different from other values
in the dataset.

These can arise due to measurement errors, data entry errors,

or other anomalies (rare items, events or observations which
deviate significantly from the majority of the data).

Outliers can greatly impact the results of data analysis and modeling,
so it's important to detect and handle them appropriately.

7
Common Challenges: Inconsistent Data Formats
In a dataset, different columns or fields may have different data formats.

This can cause problems when trying to perform data analysis or

modeling, as the data must be in a consistent format.

Data cleaning can involve converting data into a consistent format, such
as converting date strings into data objects or converting string values
into numerical values.

8
Common Challenges: Duplicates
Duplicate records can occur in a dataset for various reasons, such as
repeated data entry or merging of datasets.

Duplicates can greatly impact the results of data analysis and modeling,
so it's important to identify and remove them.

9
Common Challenges: Invalid Data
Some records in a dataset may contain invalid or irrelevant data that
does not belong in the dataset.

This can arise due to incorrect data entry, measurement errors, or

other anomalies.

Invalid data must be identified and removed during the data cleaning
process.

10
Demo:Titanic Passenger Dataset

11
Common Challenges:Techniques to handle?
1. Remove rows with problematic values:This method is suitable for
datasets with a small amount of missing data, as removing too many
rows can greatly reduce the size of the dataset and affect the results
of the analysis.

2. Impute missing values:This method involves replacing missing values

with estimated or predicted values.There are various imputation
methods, including mean imputation, and median imputation.

12
Common Challenges: how to handle?
3. Interpolation:This method involves estimating missing values based
on the values of other observations in the dataset. For example, linear
interpolation can be used to estimate missing values based on a linear
relationship between two known values.

4. Use a model to predict missing values:This method involves using

machine learning or statistical models to predict missing values based
on the values of other variables in the dataset.This method can be
more accurate than imputation methods, but requires more
computational resources and a deeper understanding of the data and
the relationships between variables.

13
Demo:Titanic Passenger Dataset

14
Data Transformation

15
What?
• Data transformation is a process of converting data from one format
or structure to another format or structure to make it usable for
analysis or modeling.

• It is a crucial step in the data preparation process as it helps to

transform raw data into a format that is suitable for further analysis
and modeling.

16
Normalization
• A data preprocessing technique used to transform the values in a
dataset into a common scale.

• The purpose of normalization is to ensure that all variables in a

dataset are on a similar scale, so that they can be compared and
analyzed meaningfully.

17
Normalization Techniques
1. Min-Max normalization:This technique scales the values in the
dataset to a range between 0 and 1.
2. Z-Score normalization:This technique standardizes the values in the
dataset to have a mean of zero and a standard deviation of one.
3. Decimal scaling normalization:This technique scales the values in the
dataset by dividing each value by a power of 10, so that all values have
a maximum absolute value of 1.
4. L2 normalization:This technique scales the values in the dataset to
have a Euclidean norm of 1.

18
L2 normalization example with two data points:

19
L2 normalization

20
Encoding
• the process of converting categorical data into a numerical
representation that can be processed by machine learning algorithms.

• Encoding is necessary because many machine learning algorithms are

based on mathematical operations and require numerical data to
work properly.

21
Encoding Techniques
1. Label encoding:This technique assigns a numerical value to each
unique category in the data, usually starting from 0.

2. Ordinal encoding:This technique assigns a numerical value to each

unique category in the data based on the rank or order of the
categories.

3. One-hot encoding: This technique creates a binary column for

each unique category in the data and assigns a 1 to the
corresponding column if the category is present, and a 0 otherwise.
Example?
22
Encoding Techniques
4. Count encoding: Count Encoding (also known as Frequency
Encoding) is a technique used to transform categorical variables
into numerical values by replacing each category with its
frequency in the dataset.

5. Target encoding:This technique replaces each category in the data

with the average target value for that category.

23
Aggregation
• A technique in data analysis that summarizes data by combining
multiple values into a single one.

• can be thought of as a way of reducing the dimensionality of the data.

• The most common aggregation functions include summing, averaging,

counting, and finding the minimum or maximum value.

• Aggregation is often used in business intelligence, data warehousing,

and other data analysis applications to get a high-level view of the
data.
Example?
24
Transformation
• The process of transforming variables to remove skewness or outliers
and make the data more normally distributed.

25
Transformation Techniques
• Log transformation:The logarithmic transformation is used to reduce
the skewness in data by transforming the data into a more normally
distributed format.The log transformation is applied to data that is
heavily skewed to the right.
• Particularly useful in dealing with data that spans several orders of
magnitude, where some values are significantly larger than other.

• Square root transformation:The square root transformation is used

to reduce the skewness in data by transforming the data into a more
normally distributed format.The square root transformation is applied
to data that is heavily skewed to the right.

26
Transformation Techniques
• Box-Cox transformation:The Box-Cox transformation is a family of
transformations used to transform skewed data into a more normally
distributed format.The Box-Cox transformation can be applied to
both left- and right-skewed data.
• It can handle broader range of data types and offer a family of power
transformations, allowing more flexibility in transforming data.

• Yeo-Johnson transformation:The Yeo-Johnson transformation is a

family of transformations used to transform data into a more
normally distributed format.The Yeo-Johnson transformation can be
applied to both left- and right-skewed data, and it can handle zero and
negative values.
27
Scaling
• a technique used in data preprocessing to adjust the range and
distribution of features in a dataset.

• It helps to standardize the data by transforming it so that each feature

has a similar range of values.

• Sounds similar to normalization, right?

• Think of normalization as a specific type of scaling where we adjust the data
to a specific range (0 to 1), and is used for specific purposes such as in deep
learning networks and computer vision.

28
Scaling Techniques
1. Min-Max normalization:This technique scales the values in the
dataset to a range between 0 and 1.
2. Standardization:This technique scales the data so that it has a mean
of 0 and a standard deviation of 1. It is done by subtracting the mean
of the feature from each data point, and then dividing the result by the
standard deviation of the feature ( ).
Z score

3. Robust Scaling:This technique is similar to Min-Max scaling, but it

uses the median and the interquartile range instead of the mean and
standard deviation.This makes it more robust to outliers.

29
Demo: Iris Dataset
This dataset can be used to illustrate data transformation by
transforming the variables in the dataset.
For example,
o normalizing the sepal and petal length and width to bring them to the same
scale,
o encoding the species names into numerical values,
o aggregating the data by species to get summary statistics,
o transforming variables like sepal length and width to meet the assumptions of
statistical models, and
o scaling variables to prepare them for machine learning algorithms.

30
Data Integration

31
What?
• the process of combining data from multiple sources into a single,
unified data set.

• This is typically done to enable a more comprehensive analysis of the

data and to support more informed decision-making.

• Data integration can involve combining data from databases,

spreadsheets, or other sources, and may involve data cleaning, data
transformation, and data enrichment to ensure that the integrated
data set is accurate and consistent.

32
# Customer information dataset
Example customer_id name
1
address
John Doe 123 Main St
phone_number
555-555-5555
2 Jane Doe 456 Oak Ave 555-555-5556
3 John Smith 789 Birch Rd 555-555-5557

# Purchase information dataset

purchase_id customer_id purchase_date product
1 1 2022-01-01 T-Shirt
2 2 2022-02-01 Hat
3 1 2022-03-01 Shoes
4 3 2022-04-01 Pants

# Merged dataset
customer_id name address phone_number purchase_id purchase_date product
1 John Doe 123 Main St 555-555-5555 1 2022-01-01 T-Shirt
1 John Doe 123 Main St 555-555-5555 3 2022-03-01 Shoes
2 Jane Doe 456 Oak Ave 555-555-5556 2 2022-02-01 Hat
3 John Smith 789 Birch Rd 555-555-5557 4 2022-04-01 Pants

33
Data Reduction

34
What?
• A process in data preparation that involves reducing the size or
complexity of a dataset without losing significant information.

• The goal of reduction is to make the dataset easier to analyze,

understand, and process.

• This can be achieved by either removing redundant or irrelevant data,

or by summarizing or aggregating data into fewer variables.

35
Dimensionality Reduction
• Involves reducing the number of variables or features in a dataset,
while retaining as much information as possible.

• Important because high dimensional data is difficult to visualize,

process, and analyze.

• One example of dimensionality reduction is using PCA (Principal

Component Analysis).

36
PCA
• a linear transformation technique that transforms a set of correlated
variables into a set of uncorrelated variables called principal
components.

• The first principal component accounts for the largest amount of

variance in the data, the second principal component accounts for the
second-largest amount of variance, and so on.

• Principal components are new variables that are constructed as linear

combinations or mixtures of the initial variables.

37
Feature Selection
• Feature selection is the process of identifying a subset of features (or
columns) from a larger set of features in a dataset that are most
relevant and contribute the most to the target variable or the task at
hand.

• The goal of feature selection is to reduce the dimensionality of the

data and remove irrelevant, redundant, or noisy features that can
negatively impact the performance of machine learning models and
make them more complex, slower, and harder to interpret.

38
Feature Selection Techniques
Feature selection can be performed based on various methods such as
• univariate statistical tests,
• recursive feature elimination, or
• by using machine learning algorithms to estimate feature importance.

39
Data Compression
• Data compression refers to techniques for reducing the size of a data
set.

• The goal of data compression is to store or transmit data using as

little storage or bandwidth as possible, without sacrificing the quality
of the original data.

40
Data Compression:Techniques
• This is often achieved through the use of algorithms that identify and
remove redundant information from the data, as well as through the
use of lossless or lossy compression methods (see next slide).

• Examples of data compression techniques include run-length

encoding, Huffman coding, and wavelet compression.

41
Lossless vs Lossy
• Lossless compression methods preserve the original data exactly
without any loss of information.
• For example, ZIP is a common lossless compression method. Lossless
compression methods are mainly used in scenarios where preserving the
original data is important, like in medical images, scientific data, and text files.
• Lossy compression methods, on the other hand, discard some of the
information in the original data to achieve a higher level of
compression.
• For example, JPEG is a common lossy compression method used in image and
video files. Lossy compression methods are mainly used in scenarios where
data quality is not as important, like in photos, music, and video files.

42
Run-length encoding (RLE)
• A lossless data compression method that is used to compress
repeating patterns of data.

• In RLE, a sequence of repeating values is represented by a count of

the number of repetitions followed by the repeated value. For
example, the sequence "AAAABB" could be compressed to "4A2B".

43
Data Sampling
• Data sampling is the process of selecting a representative subset of a
larger dataset for analysis, modeling, or visualization purposes.

• This technique is used when dealing with large datasets as it can save
time and computing resources by only processing a portion of the
data.

• The representative subset should accurately reflect the underlying

distribution and characteristics of the full dataset.

44
Data Sampling Techniques
• Simple random sampling:This method involves selecting a random
sample from the entire population of data.The sample size is
determined prior to the selection process and each data point has an
equal chance of being selected.
• Stratified sampling: In this method, the population is divided into
strata (homogeneous subgroups) and a random sample is taken from
each stratum.This is done to ensure that each stratum is represented
in the sample in proportion to its size in the population.
• Cluster sampling: In this method, the population is divided into groups
(clusters) and a random sample of clusters is selected.Then, all the
data points within the selected clusters are included in the sample.

Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Mining
No ratings yet
Data Mining
5 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Week 3
No ratings yet
Week 3
23 pages
chap3
No ratings yet
chap3
26 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Unit 1
No ratings yet
Unit 1
8 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Week 2
No ratings yet
Week 2
96 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Week 3
No ratings yet
Week 3
2 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
Unit 2
No ratings yet
Unit 2
9 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
Eda
No ratings yet
Eda
48 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
253777
No ratings yet
253777
66 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
4 Data Pre Processing II
No ratings yet
4 Data Pre Processing II
26 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
ISE233 Lecture 3
No ratings yet
ISE233 Lecture 3
21 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
No ratings yet
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
13 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Down 2
No ratings yet
Down 2
61 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Lecture # 13 Data_Transformation_Techniques
No ratings yet
Lecture # 13 Data_Transformation_Techniques
36 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Unit 3
No ratings yet
Unit 3
164 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Week 2
No ratings yet
Week 2
3 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
633777800398832500ata Minig Presentation
No ratings yet
633777800398832500ata Minig Presentation
20 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
CEM 515 SPC Quiz Student Name: - Student No
No ratings yet
CEM 515 SPC Quiz Student Name: - Student No
3 pages
MATH 1280 Libre Office Guide
No ratings yet
MATH 1280 Libre Office Guide
16 pages
In Class Statistics Assessment
No ratings yet
In Class Statistics Assessment
2 pages
Spearman Rank Correlation Beverly Rose B. Delos Santos
No ratings yet
Spearman Rank Correlation Beverly Rose B. Delos Santos
38 pages
S1 Practice Test 1 March 1st _240301_203459
No ratings yet
S1 Practice Test 1 March 1st _240301_203459
2 pages
Hasil Stata GMM - Analisis Pengaruh Kebijakan Moneter Terhadap Profitabilitas Bank Di Indonesia Pada Masa Pandemi
No ratings yet
Hasil Stata GMM - Analisis Pengaruh Kebijakan Moneter Terhadap Profitabilitas Bank Di Indonesia Pada Masa Pandemi
4 pages
Sampling Methods: Instructor: Crizylen Mae Lahoylahoy Catigbe
100% (1)
Sampling Methods: Instructor: Crizylen Mae Lahoylahoy Catigbe
53 pages
MEC462B-IQC-Control Charts For Variables
No ratings yet
MEC462B-IQC-Control Charts For Variables
21 pages
A Conceptual Approach To Survival Analysis
No ratings yet
A Conceptual Approach To Survival Analysis
107 pages
Maths Remidial For Grade 11
No ratings yet
Maths Remidial For Grade 11
3 pages
The Royal Statistical Society 2007 Examinations Solutions Higher Certificate (Modular Format) Data Collection and Interpretation
No ratings yet
The Royal Statistical Society 2007 Examinations Solutions Higher Certificate (Modular Format) Data Collection and Interpretation
27 pages
Effect of Succession Planning Strategies On The Sustainability of Family Businesses in Nigeria
No ratings yet
Effect of Succession Planning Strategies On The Sustainability of Family Businesses in Nigeria
21 pages
M11n - Lesson 3.2 - PPT - Handout - Median, Mode, and Fractiles - 1sem22-23
No ratings yet
M11n - Lesson 3.2 - PPT - Handout - Median, Mode, and Fractiles - 1sem22-23
8 pages
EIE2003 Lecture 1
No ratings yet
EIE2003 Lecture 1
6 pages
A Meta-Analytic Review of Social, Self-Concept, and Behavioral Outcomes of Peer-Assisted Learning
No ratings yet
A Meta-Analytic Review of Social, Self-Concept, and Behavioral Outcomes of Peer-Assisted Learning
18 pages
Learning From Data An Introduction to Statistical Reasoning Third Edition Arthur Glenberg - The ebook is ready for instant download and access
No ratings yet
Learning From Data An Introduction to Statistical Reasoning Third Edition Arthur Glenberg - The ebook is ready for instant download and access
47 pages
East West University: Computer Science and Engineering
No ratings yet
East West University: Computer Science and Engineering
8 pages
Quality and Performance
No ratings yet
Quality and Performance
96 pages
Quality Management Posters
No ratings yet
Quality Management Posters
7 pages
Layout Assignment Statistics for besiness and Economics
No ratings yet
Layout Assignment Statistics for besiness and Economics
5 pages
Quiz
No ratings yet
Quiz
15 pages
Lecture 2
No ratings yet
Lecture 2
18 pages
Lecture 2. Relaxing The Assumptions of CLRM - 0
No ratings yet
Lecture 2. Relaxing The Assumptions of CLRM - 0
17 pages
Visualization Using Python
No ratings yet
Visualization Using Python
42 pages
Estimation Confidence Intervals
No ratings yet
Estimation Confidence Intervals
58 pages
IS6.1 Make Up
No ratings yet
IS6.1 Make Up
21 pages
Activity 3. Illustrating A Normal Random Variable and Its Characteristics
No ratings yet
Activity 3. Illustrating A Normal Random Variable and Its Characteristics
22 pages
Experimental Epidemiology
No ratings yet
Experimental Epidemiology
20 pages
Paper 1research Methodology
No ratings yet
Paper 1research Methodology
2 pages
Random Variable Exercises
No ratings yet
Random Variable Exercises
5 pages

DAI101 4 Data Preparation (1)

Uploaded by

DAI101 4 Data Preparation (1)

Uploaded by

Data Preparation

Dr. Teena Sharma Summer 2024

“The process of gathering, combining,

The goal of data cleaning is to ensure that the data is accurate,

Handling missing values is an important step in data cleaning, as it can

These can arise due to measurement errors, data entry errors,

This can cause problems when trying to perform data analysis or

This can arise due to incorrect data entry, measurement errors, or

2. Impute missing values:This method involves replacing missing values

4. Use a model to predict missing values:This method involves using

• It is a crucial step in the data preparation process as it helps to

• The purpose of normalization is to ensure that all variables in a

• Encoding is necessary because many machine learning algorithms are

2. Ordinal encoding:This technique assigns a numerical value to each

3. One-hot encoding: This technique creates a binary column for

5. Target encoding:This technique replaces each category in the data

• can be thought of as a way of reducing the dimensionality of the data.

• The most common aggregation functions include summing, averaging,

• Aggregation is often used in business intelligence, data warehousing,

• Square root transformation:The square root transformation is used

• Yeo-Johnson transformation:The Yeo-Johnson transformation is a

• It helps to standardize the data by transforming it so that each feature

• Sounds similar to normalization, right?

3. Robust Scaling:This technique is similar to Min-Max scaling, but it

• This is typically done to enable a more comprehensive analysis of the

• Data integration can involve combining data from databases,

# Purchase information dataset

• The goal of reduction is to make the dataset easier to analyze,

• This can be achieved by either removing redundant or irrelevant data,

• Important because high dimensional data is difficult to visualize,

• One example of dimensionality reduction is using PCA (Principal

• The first principal component accounts for the largest amount of

• Principal components are new variables that are constructed as linear

• The goal of feature selection is to reduce the dimensionality of the

• The goal of data compression is to store or transmit data using as

• Examples of data compression techniques include run-length

• In RLE, a sequence of repeating values is represented by a count of

• The representative subset should accurately reflect the underlying

You might also like