Data Preprocessing
Data Preprocessing
Data
Biology Recap
Data Data
Sources Formats
ML
Oncology
Refine Evaluate
Preprocessing and
Feature Engineering
Data
Biology Recap
Data Data
Sources Formats
ML
Oncology
Refine Evaluate
Preprocessing and
Feature Engineering
Data
Preparation
Test data
Exploration
Model Quality
requirements Training
Raw data data assessment
Cleansing
Requirements Data Data Predictive Labeling
Evaluation Deployment
Analysis Acquisition Preparation Modeling
Imputation
Feature
engineering
Preprocessing and
Feature Engineering
Roles Data Scientist Domain Expert (Data) Engineer
Data Management for
Digital Health, Winter
2019
4
Icons made by Smashicons from www.flaticon.com
What Is Data Preparation
Data preparation can make or break the predictive ability of your model
According to Kuhn and Johnson data preparation is the process of addition,
deletion or transformation of training set data
Sometimes, preprocessing of data can lead to unexpected improvements in
model accuracy
Data preparation is an important step and you should experiment with data pre-
processing steps that are appropriate for your data to see if you can get that
desirable boost in model accuracy
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
5
Data Preparation Importance
Motivation
https://elitedatascience.com/feature-engineering
meaningful features are included
Improving the performance of Preprocessing and
machine learning models Feature Engineering
Data Management for
Digital Health, Winter
2019
6
Why Data Preparation Is so Important in Digital Health
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
7
https://www.researchgate.net/publication/332436103_Impact_of_Preprocessing_Met
hods_on_Healthcare_Predictions
Data Preparation Steps
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
8
Data Preparation Process
Preprocessing and
Feature Engineering
https://statistik- Data Management for
dresden.de/archives/1128
Digital Health, Winter
2019
9
Select Data Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
There is always a strong desire for including all data that is available, that
the maxim “more is better” will hold. This may or may not be true
Consider what data you actually need to address the question or problem
you are working on
Questions to help you think:
¡ What is the extent of the data you have available?
¡ What data is not available that you wish you had available?
http://uniquerecall.com/
¡ What data don’t you need to address the problem?
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
10
Preprocess Data Step 1: Select Data
Step 2: Preprocess Data
Better Data > Fancier Algorithms Step 3: Transform Data
Preprocessing and
¡ Larger computational and memory requirements Feature Engineering
¡ Take smaller representative sample before considering the whole Data Management for
Digital Health, Winter
dataset 2019
11
Dummy Variables Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
ordering Small < Medium < Large. To indicate the Data Management for
Digital Health, Winter
ordering, use more 1s for higher categories 2019
12
https://de.mathworks.com/help/stats/dummy-indicator-variables.html
Transformed Attributes Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
https://www.davidzeleny.net/anadat-r/doku.php/en:data_preparation
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
13
Transformed Attributes Step 1: Select Data
Step 2: Preprocess Data
Box-Cox Step 3: Transform Data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
14
https://www.davidzeleny.net/anadat-r/doku.php/en:data_preparation
How to Handle Missing Data Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
15
https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
Data Imputation Step 1: Select Data
Step 2: Preprocess Data
(Mean/Median) Values Step 3: Transform Data
Pros Cons
Pros Cons
Works well with categorical features It also doesn’t factor the correlations between
features
It can introduce bias in the data
Zero or Constant imputation replaces the missing values with either zero or any
constant value you specify
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
17
Data Imputation Step 1: Select Data
Step 2: Preprocess Data
k-NN Step 3: Transform Data
Pros Cons
Preprocessing and
Can be much more accurate than the mean, Computationally expensive. KNN works by Feature Engineering
median or most frequent imputation methods (It storing the whole training dataset in memory Data Management for
depends on the dataset) Digital Health, Winter
2019
K-NN is quite sensitive to outliers in the data
18
(unlike SVM)
Data Imputation Step 1: Select Data
Step 2: Preprocess Data
Multivariate Imputation Step 3: Transform Data
Preprocessing and
https://www.youtube.com/watch?v=zX-pacwVyvU Feature Engineering
Data Management for
Digital Health, Winter
2019
19
Data Reduction Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
20
Projection Step 1: Select Data
Step 2: Preprocess Data
Principal Component Analysis (PCA) Step 3: Transform Data
Pros Cons
http://setosa.io/ev/principal-component-analysis/
Improves Algorithm Performance Data standardization is must before
PCA
Reduces Overfitting Information Loss
Improves Visualization
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
22
Fourier Transformation Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
Fourier showed that any periodic signal s(t) can be written as a sum of sine waves
with various amplitudes, frequencies and phases
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
23
http://mriquestions.com/fourier-transform-ft.html
Fast Fourier Transform Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
24
https://giphy.com/gifs/fourier-transform-Km4XeiMqFNCDK
Discrete Fourier Transform Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
X k = ∑ xn e
− i 2π k
N
n
n =0
http://mriquestions.com/fourier-transform-ft.html
https://de.wikipedia.org/wiki/Joseph_Fourier
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
25
Filter Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
26
https://www.adinstruments.com/tips/data-quality
Fourier Transformation Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
http://www.sthda.com/english/wiki/correlation-analyses-in-r
attributes in your dataset
Using Correlation, you can get some insights such as:
¡ One or multiple attributes depend on another
¡ One or multiple attributes are associated with other attributes
Can help in predicting one attribute from another (great way to impute
missing values)
Can (sometimes) indicate the presence of a causal relationship
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
28
Autocorrelation Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
https://en.wikipedia.org/wiki/Autocorrelation
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
29
https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/
Transform Data Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
𝑥𝑥−mean 𝑥𝑥
𝑥𝑥� =
sqrt var 𝑥𝑥
variance of 1
Preprocessing and
If the original feature has a Gaussian distribution, Feature Engineering
Data Management for
then the scaled feature does too Digital Health, Winter
2019
31
Min-Max Scaling Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
𝑥𝑥−min 𝑥𝑥
𝑥𝑥� =
max 𝑥𝑥 −min 𝑥𝑥
dataset
Preprocessing and
Min-max scaling squeezes (or stretches) all feature Feature Engineering
values to be within the range of [0, 1] Data Management for
Digital Health, Winter
2019
32
Why Scaling? Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
33
https://blog.dellemc.com/en-us/digital-transformation-just-got-easier-with-analytic-
insights/
Why Scaling? Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
34
https://blog.dellemc.com/en-us/digital-transformation-just-got-easier-with-analytic-
insights/
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
everything else the result. No algorithm alone, to my Alice Zheng and Amanda Casari, O’Reilly, 2018
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
37
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Example: Coordinate Transformation Step 3: Transform Data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
38
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Example: Coordinate Transformation Step 3: Transform Data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
39
Iterative Process of Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
Brainstorm features: Really get into the problem, look at a lot of data, study
feature engineering on other problems and see what you can steal
Devise features: Depends on your problem, but you may use automatic feature
extraction, manual feature construction and mixtures of the two
Select features: Use different feature importance scorings and feature selection
methods to prepare one or more “views” for your models to operate upon
Evaluate models: Estimate model accuracy on unseen data using the chosen
features
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
40
Aspects of Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
Feature Engineering
Feature Selection Most useful and relevant features
are selected from the available
data
Feature Extraction Existing features are combined to
develop more useful ones
Feature Addition New features are created by
gathering new data
Preprocessing and
Feature Filtering Filter out irrelevant features to Feature Engineering
make the modeling step easy Data Management for
Digital Health, Winter
2019
41
Feature Selection Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
https://towardsdatascience.com/featur
e-selection-techniques-1bfab5fe0784
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
43
Feature Extraction Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
45
To Know More Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data
I’m comparing
What are you the curves and
doing there? try to find
similarities,
respectively
abnormalities.
https://en.wikipedia.org/wiki/Dr._Nick
Let me show you how
to do it.
https://en.wikipedia.org/wiki/Professor_Frink
https://www.cvphysiology.com/Arr
Preprocessing and
Feature Engineering
hythmias/A009.htm
Data Management for
Digital Health, Winter
2019
47
Euclidean Distance Metric
Comparing to Time Series
Preprocessing and
Feature Engineering
Data Management for
About 80% of published
Digital Health, Winter
work in data mining uses 2019
Euclidean distance 48
http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Data Preparation Time Series
https://en.wikipedia.org/wiki/Dr._Nick
is very sensitive to
some “distortions”
in the data. For
https://en.wikipedia.org/wiki/Professor_Frink
most problems
these distortions
4 most common distortions are not meaningful
should remove
¡ Offset Translation them
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
50
http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Preprocessing the Data
Amplitude Scaling
http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Preprocessing the Data
Offset Translation
Preprocessing and
The intuition behind removing Feature Engineering
Date Time Features: These are components of the time step itself for each
observation
Lag Features: These are values at prior time steps
Window Features: These are a summary of values over a fixed window of prior time
steps
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
54
https://tsfresh.readthedocs.io/en/latest/text/introduction.html
Automated Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Why Do It? Step 3: Transform Data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
56
http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
http://amid.fish/anomaly-detection-with-k-means-clustering
What to Take Home?
Data preparation allows simplification of data to make it ready for Machine Learning
and involves data selection, preprocessing, and transformation
Step 1: Data Selection Consider what data is available, what data is missing and
what data can be removed
Step 2: Data Preprocessing Organize your selected data by formatting, cleaning and
sampling from it
Step 3: Data Transformation Transform preprocessed data ready for machine
learning by engineering features using scaling, attribute decomposition and attribute
aggregation
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
57