DAI101 4 Data Preparation (1)
DAI101 4 Data Preparation (1)
1
What?
The standard definition of data preparation is:
2
Why?
• Good data is essential for producing efficient models of any type.
• Data should be formatted according to required software tool.
• Data need to be made adequate for given method.
• Data in the real world is dirty.
3
Data Cleaning
4
What?
Data cleaning, also known as data cleansing or data scrubbing, is the
process of identifying and correcting or removing inaccuracies,
inconsistencies, and irrelevant data in a dataset.
5
Common Challenges: Missing Values
In many real-world datasets, some values may be missing or incomplete.
This can occur due to various reasons such as incorrect data entry,
missing data, or technical problems during data collection.
6
Common Challenges: Outliers
Outliers are values that are significantly different from other values
in the dataset.
Outliers can greatly impact the results of data analysis and modeling,
so it's important to detect and handle them appropriately.
7
Common Challenges: Inconsistent Data Formats
In a dataset, different columns or fields may have different data formats.
Data cleaning can involve converting data into a consistent format, such
as converting date strings into data objects or converting string values
into numerical values.
8
Common Challenges: Duplicates
Duplicate records can occur in a dataset for various reasons, such as
repeated data entry or merging of datasets.
Duplicates can greatly impact the results of data analysis and modeling,
so it's important to identify and remove them.
9
Common Challenges: Invalid Data
Some records in a dataset may contain invalid or irrelevant data that
does not belong in the dataset.
Invalid data must be identified and removed during the data cleaning
process.
10
Demo:Titanic Passenger Dataset
11
Common Challenges:Techniques to handle?
1. Remove rows with problematic values:This method is suitable for
datasets with a small amount of missing data, as removing too many
rows can greatly reduce the size of the dataset and affect the results
of the analysis.
12
Common Challenges: how to handle?
3. Interpolation:This method involves estimating missing values based
on the values of other observations in the dataset. For example, linear
interpolation can be used to estimate missing values based on a linear
relationship between two known values.
13
Demo:Titanic Passenger Dataset
14
Data Transformation
15
What?
• Data transformation is a process of converting data from one format
or structure to another format or structure to make it usable for
analysis or modeling.
16
Normalization
• A data preprocessing technique used to transform the values in a
dataset into a common scale.
17
Normalization Techniques
1. Min-Max normalization:This technique scales the values in the
dataset to a range between 0 and 1.
2. Z-Score normalization:This technique standardizes the values in the
dataset to have a mean of zero and a standard deviation of one.
3. Decimal scaling normalization:This technique scales the values in the
dataset by dividing each value by a power of 10, so that all values have
a maximum absolute value of 1.
4. L2 normalization:This technique scales the values in the dataset to
have a Euclidean norm of 1.
18
L2 normalization example with two data points:
19
L2 normalization
20
Encoding
• the process of converting categorical data into a numerical
representation that can be processed by machine learning algorithms.
21
Encoding Techniques
1. Label encoding:This technique assigns a numerical value to each
unique category in the data, usually starting from 0.
23
Aggregation
• A technique in data analysis that summarizes data by combining
multiple values into a single one.
25
Transformation Techniques
• Log transformation:The logarithmic transformation is used to reduce
the skewness in data by transforming the data into a more normally
distributed format.The log transformation is applied to data that is
heavily skewed to the right.
• Particularly useful in dealing with data that spans several orders of
magnitude, where some values are significantly larger than other.
26
Transformation Techniques
• Box-Cox transformation:The Box-Cox transformation is a family of
transformations used to transform skewed data into a more normally
distributed format.The Box-Cox transformation can be applied to
both left- and right-skewed data.
• It can handle broader range of data types and offer a family of power
transformations, allowing more flexibility in transforming data.
28
Scaling Techniques
1. Min-Max normalization:This technique scales the values in the
dataset to a range between 0 and 1.
2. Standardization:This technique scales the data so that it has a mean
of 0 and a standard deviation of 1. It is done by subtracting the mean
of the feature from each data point, and then dividing the result by the
standard deviation of the feature ( ).
Z score
29
Demo: Iris Dataset
This dataset can be used to illustrate data transformation by
transforming the variables in the dataset.
For example,
o normalizing the sepal and petal length and width to bring them to the same
scale,
o encoding the species names into numerical values,
o aggregating the data by species to get summary statistics,
o transforming variables like sepal length and width to meet the assumptions of
statistical models, and
o scaling variables to prepare them for machine learning algorithms.
30
Data Integration
31
What?
• the process of combining data from multiple sources into a single,
unified data set.
32
# Customer information dataset
Example customer_id name
1
address
John Doe 123 Main St
phone_number
555-555-5555
2 Jane Doe 456 Oak Ave 555-555-5556
3 John Smith 789 Birch Rd 555-555-5557
# Merged dataset
customer_id name address phone_number purchase_id purchase_date product
1 John Doe 123 Main St 555-555-5555 1 2022-01-01 T-Shirt
1 John Doe 123 Main St 555-555-5555 3 2022-03-01 Shoes
2 Jane Doe 456 Oak Ave 555-555-5556 2 2022-02-01 Hat
3 John Smith 789 Birch Rd 555-555-5557 4 2022-04-01 Pants
33
Data Reduction
34
What?
• A process in data preparation that involves reducing the size or
complexity of a dataset without losing significant information.
35
Dimensionality Reduction
• Involves reducing the number of variables or features in a dataset,
while retaining as much information as possible.
36
PCA
• a linear transformation technique that transforms a set of correlated
variables into a set of uncorrelated variables called principal
components.
37
Feature Selection
• Feature selection is the process of identifying a subset of features (or
columns) from a larger set of features in a dataset that are most
relevant and contribute the most to the target variable or the task at
hand.
38
Feature Selection Techniques
Feature selection can be performed based on various methods such as
• univariate statistical tests,
• recursive feature elimination, or
• by using machine learning algorithms to estimate feature importance.
39
Data Compression
• Data compression refers to techniques for reducing the size of a data
set.
40
Data Compression:Techniques
• This is often achieved through the use of algorithms that identify and
remove redundant information from the data, as well as through the
use of lossless or lossy compression methods (see next slide).
41
Lossless vs Lossy
• Lossless compression methods preserve the original data exactly
without any loss of information.
• For example, ZIP is a common lossless compression method. Lossless
compression methods are mainly used in scenarios where preserving the
original data is important, like in medical images, scientific data, and text files.
• Lossy compression methods, on the other hand, discard some of the
information in the original data to achieve a higher level of
compression.
• For example, JPEG is a common lossy compression method used in image and
video files. Lossy compression methods are mainly used in scenarios where
data quality is not as important, like in photos, music, and video files.
42
Run-length encoding (RLE)
• A lossless data compression method that is used to compress
repeating patterns of data.
43
Data Sampling
• Data sampling is the process of selecting a representative subset of a
larger dataset for analysis, modeling, or visualization purposes.
• This technique is used when dealing with large datasets as it can save
time and computing resources by only processing a portion of the
data.
44
Data Sampling Techniques
• Simple random sampling:This method involves selecting a random
sample from the entire population of data.The sample size is
determined prior to the selection process and each data point has an
equal chance of being selected.
• Stratified sampling: In this method, the population is divided into
strata (homogeneous subgroups) and a random sample is taken from
each stratum.This is done to ensure that each stratum is represented
in the sample in proportion to its size in the population.
• Cluster sampling: In this method, the population is divided into groups
(clusters) and a random sample of clusters is selected.Then, all the
data points within the selected clusters are included in the sample.
45