0% found this document useful (0 votes)

34 views

Data Preprocessing

Uploaded by

Vinay Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Data Preprocessing

Uploaded by

Vinay Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Preprocessing and Feature Engineering

Bergner, Borchert, da Cruz, Konak, Dr. Schapranow

Data Management for Digital Health
Winter 2019
Agenda

Medical Technology Machine

Use Cases Foundation Learning

Data
Biology Recap

Data Data
Sources Formats
ML
Oncology
Refine Evaluate
Preprocessing and
Feature Engineering

Processing and Software Prediction + Data Management for

Analysis Architectures Probability Digital Health, Winter
Nephrology and 2019
Intensive Care 2
Agenda

Medical Technology Machine

Use Cases Foundation Learning

Data
Biology Recap

Data Data
Sources Formats
ML
Oncology
Refine Evaluate
Preprocessing and
Feature Engineering

Processing and Software Prediction + Data Management for

Analysis Architectures Probability Digital Health, Winter
Nephrology and 2019
Intensive Care 3
Data Preparation

Data
Preparation

Test data
 Exploration
Model  Quality
requirements Training
Raw data data assessment
 Cleansing
Requirements Data Data Predictive  Labeling
Evaluation Deployment
Analysis Acquisition Preparation Modeling
 Imputation
 Feature
engineering

Preprocessing and
Feature Engineering
Roles Data Scientist Domain Expert (Data) Engineer
Data Management for
Digital Health, Winter
2019
4
Icons made by Smashicons from www.flaticon.com
What Is Data Preparation

Data preparation can make or break the predictive ability of your model
According to Kuhn and Johnson data preparation is the process of addition,
deletion or transformation of training set data
Sometimes, preprocessing of data can lead to unexpected improvements in
model accuracy
Data preparation is an important step and you should experiment with data pre-
processing steps that are appropriate for your data to see if you can get that
desirable boost in model accuracy

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
5
Data Preparation Importance
Motivation

Data in Healthcare  sparse and incomplete

Preparing the proper input dataset, compatible with
the machine learning algorithm requirements
Integral step in Machine Learning
Directly affects the ability of our model to learn
Make sure that it is in a useful Source: https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-
enjoyable-data-science-task-survey-says/

scale, format and even that

https://elitedatascience.com/feature-engineering
meaningful features are included
Improving the performance of Preprocessing and
machine learning models Feature Engineering
Data Management for
Digital Health, Winter
2019
6
Why Data Preparation Is so Important in Digital Health

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
7
https://www.researchgate.net/publication/332436103_Impact_of_Preprocessing_Met
hods_on_Healthcare_Predictions
Data Preparation Steps

How do I clean up the data?  Data Cleaning

How do I provide accurate data?  Data Transformation
How do I incorporate and adjust data?  Data Integration
How do I unify and scale data?  Data Normalization
How do I handle missing data?  Missing Data Imputation
How do I detect and manage noise?  Noise Identification

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
8
Data Preparation Process

Process for getting data ready for a machine

learning algorithm can be summarized
¡ Step 1: Select Data
¡ Step 2: Preprocess Data
¡ Step 3: Transform Data
Follow this process in a linear manner

Preprocessing and
Feature Engineering
https://statistik- Data Management for
dresden.de/archives/1128
Digital Health, Winter
2019
9
Select Data Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

There is always a strong desire for including all data that is available, that
the maxim “more is better” will hold. This may or may not be true
Consider what data you actually need to address the question or problem
you are working on
Questions to help you think:
¡ What is the extent of the data you have available?
¡ What data is not available that you wish you had available?
http://uniquerecall.com/
¡ What data don’t you need to address the problem?
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
10
Preprocess Data Step 1: Select Data
Step 2: Preprocess Data
Better Data > Fancier Algorithms Step 3: Transform Data

Formatting: Selected data may not be in a suitable format

Cleaning: Removal or fixing of missing data
¡ Incomplete and do not carry the data to address the problem
¡ Sensitive information  anonymized or removed
¡ Identifying incomplete, incorrect, inaccurate, irrelevant parts of
the data
Sampling: More selected data available than needed
¡ Longer running times for algorithms https://www.flickr.com/photos/marc_smith/1473557291/siz
es/l/

Preprocessing and
¡ Larger computational and memory requirements Feature Engineering

¡ Take smaller representative sample before considering the whole Data Management for
Digital Health, Winter
dataset 2019
11
Dummy Variables Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Transforming categorical attribute to numerical

attribute Also known
as One-Hot
Each attribute will have value either 0 or 1 Encoding!

Full Dummy Variables: Represent n categories using

n dummy variables, one variable for each level
Dummy Variables with Reference Group: Represent
the categorical variable with n categories using n-1
dummy variables
Dummy Variables for Ordered Categorical Variable Preprocessing and
with Reference Group: Assume mathematical Feature Engineering

ordering Small < Medium < Large. To indicate the Data Management for
Digital Health, Winter
ordering, use more 1s for higher categories 2019
12
https://de.mathworks.com/help/stats/dummy-indicator-variables.html
Transformed Attributes Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Data transformation changes relative

differences among individual values
Types of transformation:
¡ Linear: By adding constant or
multiplying by constant
¡ Non-linear: log-transformation,
square-root transformation etc.

https://www.davidzeleny.net/anadat-r/doku.php/en:data_preparation

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
13
Transformed Attributes Step 1: Select Data
Step 2: Preprocess Data
Box-Cox Step 3: Transform Data

log transformation is suitable for

strongly right-skewed data, sqrt
transformation is suitable for
slightly right-skewed data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
14
https://www.davidzeleny.net/anadat-r/doku.php/en:data_preparation
How to Handle Missing Data Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

There is NO good way to deal

with missing data!
Different solutions for data
imputation depending on the
kind of problem — Time series
Analysis, ML, Regression etc.
No general solution

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
15
https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
Data Imputation Step 1: Select Data
Step 2: Preprocess Data
(Mean/Median) Values Step 3: Transform Data

Calculating the mean/median of the non-missing values in a column

Pros Cons

Easy and fast Doesn’t factor the correlations between features.

It only works on the column level
Works well with small numerical datasets Will give poor results on encoded categorical
Preprocessing and
features (do NOT use it on categorical features) Feature Engineering
Not very accurate Data Management for
Digital Health, Winter
Doesn’t account for the uncertainty in the 2019
imputations 16
Data Imputation Step 1: Select Data
Step 2: Preprocess Data
(Most Frequent) or (Zero/Constant) Values Step 3: Transform Data

Most Frequent statistical strategy to impute missing values

Replacing missing data with the most frequent values within each column

Pros Cons

Works well with categorical features It also doesn’t factor the correlations between
features
It can introduce bias in the data

Zero or Constant imputation replaces the missing values with either zero or any
constant value you specify
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
17
Data Imputation Step 1: Select Data
Step 2: Preprocess Data
k-NN Step 3: Transform Data

k nearest neighbours is an algorithm that is used for simple

classification
Algorithm uses ‘feature similarity’ to predict the values of any
new data points
New point is assigned a value based on how closely it resembles
the points in the training set

Pros Cons
Preprocessing and
Can be much more accurate than the mean, Computationally expensive. KNN works by Feature Engineering
median or most frequent imputation methods (It storing the whole training dataset in memory Data Management for
depends on the dataset) Digital Health, Winter
2019
K-NN is quite sensitive to outliers in the data
18
(unlike SVM)
Data Imputation Step 1: Select Data
Step 2: Preprocess Data
Multivariate Imputation Step 3: Transform Data

Filling the missing data multiple times

Multiple Imputations (MIs) are much better
than a single imputation as it measures the
uncertainty of the missing values in a
better way
Chained equations approach is also very
flexible and can handle different variables
of different data types

Preprocessing and
https://www.youtube.com/watch?v=zX-pacwVyvU Feature Engineering
Data Management for
Digital Health, Winter
2019
19
Data Reduction Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

How do I reduce the dimensionality of data?  Feature Selection (FS)

How do I remove redundant and/or conflictive examples?  Instance Selection (IS)
How do I simplify the domain of an attribute?  Discretization
How do I fill in gaps in data?  Feature Extraction and/or Instance Generation

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
20
Projection Step 1: Select Data
Step 2: Preprocess Data
Principal Component Analysis (PCA) Step 3: Transform Data

As the amount of data grows in the world, the size of

datasets available for ML development also grows
Dimensionality reduction involves the transformation of
data to new dimensions in a way that facilitates
discarding of some dimensions without losing any key
information
Large-scale problems bring about several dimensions that
can become very difficult to visualize
Some of such dimensions can be easily dropped for a https://www.dezyre.com/data-science-in-python-
tutorial/principal-component-analysis-tutorial
better visualization Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
21
Applications of PCA Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Pros Cons

Removes Correlated Features Independent variables become less

interpretable

http://setosa.io/ev/principal-component-analysis/
Improves Algorithm Performance Data standardization is must before
PCA
Reduces Overfitting Information Loss

Improves Visualization

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
22
Fourier Transformation Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Fourier showed that any periodic signal s(t) can be written as a sum of sine waves
with various amplitudes, frequencies and phases

For example, the Fourier expansion of a square wave can be written as

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
23
http://mriquestions.com/fourier-transform-ft.html
Fast Fourier Transform Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
24
https://giphy.com/gifs/fourier-transform-Km4XeiMqFNCDK
Discrete Fourier Transform Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

N −1 Fourier series in 1822

X k = ∑ xn e
− i 2π k
N
n

n =0

http://mriquestions.com/fourier-transform-ft.html
https://de.wikipedia.org/wiki/Joseph_Fourier

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
25
Filter Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Low Pass Filter High Pass Filter

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
26
https://www.adinstruments.com/tips/data-quality
Fourier Transformation Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Important signal processing tool

Used to decompose a signal into its sine and cosine
components
Output of the transformation represents the signal in
the Fourier or frequency domain
Apply mathematical operations to eliminate certain
frequency domains very easily
https://slideplayer.com/slide/4173668/

Applying the inverse Fourier transform to recover the

original time signal
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
27
Correlation Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Way to understand the relationship between multiple variables and

http://www.sthda.com/english/wiki/correlation-analyses-in-r
attributes in your dataset
Using Correlation, you can get some insights such as:
¡ One or multiple attributes depend on another
¡ One or multiple attributes are associated with other attributes
Can help in predicting one attribute from another (great way to impute
missing values)
Can (sometimes) indicate the presence of a causal relationship
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
28
Autocorrelation Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Heavily used in time series analysis and forecasting

Measure of the correlation between the lagged values of a time
series
Uncover hidden patterns in data
Identify seasonality and trend in our time series data

https://en.wikipedia.org/wiki/Autocorrelation

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
29

https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/
Transform Data Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Scaling: The preprocessed data may contain attributes with a

mixtures of scales for various quantities. Many machine learning
methods like data attributes to have the same scale
Decomposition: There may be features that represent a complex
concept that may be more useful to a machine learning method
when split into the constituent parts
¡ Example  Date
Aggregation: There may be features that can be aggregated into a https://blog.dellemc.com/en-us/digital-transformation-just-got-easier-with-analytic-
insights/
single feature
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
30
Standardization (Variance Scaling) Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

𝑥𝑥−mean 𝑥𝑥
𝑥𝑥� =
sqrt var 𝑥𝑥

It subtracts off the mean of the feature (over all data

points) and divides by the variance
It can also be called variance scaling
Feature Engineering for Machine Learning
Principles and Techniques for Data Scientists
resulting scaled feature has a mean of 0 and a Alice Zheng and Amanda Casari, O’Reilly, 2018

variance of 1
Preprocessing and
If the original feature has a Gaussian distribution, Feature Engineering
Data Management for
then the scaled feature does too Digital Health, Winter
2019
31
Min-Max Scaling Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

𝑥𝑥−min 𝑥𝑥
𝑥𝑥� =
max 𝑥𝑥 −min 𝑥𝑥

Let 𝑥𝑥 be an individual feature value (i.e., a value of

the feature in some data point)
min 𝑥𝑥 and max 𝑥𝑥 , respectively, be the minimum
Feature Engineering for Machine Learning
and maximum values of this feature over the entire Principles and Techniques for Data Scientists
Alice Zheng and Amanda Casari, O’Reilly, 2018

dataset
Preprocessing and
Min-max scaling squeezes (or stretches) all feature Feature Engineering
values to be within the range of [0, 1] Data Management for
Digital Health, Winter
2019
32
Why Scaling? Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
33
https://blog.dellemc.com/en-us/digital-transformation-just-got-easier-with-analytic-
insights/
Why Scaling? Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
34
https://blog.dellemc.com/en-us/digital-transformation-just-got-easier-with-analytic-
insights/
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Coming up with features is difficult, time-consuming,

requires expert knowledge. "Applied machine
learning" is basically feature engineering. ~ Andrew
Ng
The features you use influence more than Feature Engineering for Machine Learning
Principles and Techniques for Data Scientists

everything else the result. No algorithm alone, to my Alice Zheng and Amanda Casari, O’Reilly, 2018

knowledge, can supplement the information gain

given by correct feature engineering. ~ Luca
Massaron
Good data preparation and feature engineering is Preprocessing and
Feature Engineering
integral to better prediction ~ Marios Michailidis
Data Management for
(KazAnova), Kaggle GrandMaster, Kaggle #3, former Digital Health, Winter
#1 2019
35
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Data may be hard to understand and process

Conduct feature engineering to make reading of the data easier for our
machine learning models
Feature Engineering is a process of transforming the given data into a
form which is easier to interpret
In general: Features can be generated so that the data visualization
prepared for people without a data-related background can be more
Feature Engineering for Machine Learning
digestible Principles and Techniques for Data Scientists
Alice Zheng and Amanda Casari, O’Reilly, 2018

Different models often require different approaches for the different

Preprocessing and
kinds of data Feature Engineering
Data Management for
Digital Health, Winter
2019
36
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Example: Coordinate Transformation Step 3: Transform Data

Not possible to seperate using linear classifier

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
37
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Example: Coordinate Transformation Step 3: Transform Data

What if you use polar

Coordinates instead?

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
38
Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Example: Coordinate Transformation Step 3: Transform Data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
39
Iterative Process of Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Brainstorm features: Really get into the problem, look at a lot of data, study
feature engineering on other problems and see what you can steal
Devise features: Depends on your problem, but you may use automatic feature
extraction, manual feature construction and mixtures of the two
Select features: Use different feature importance scorings and feature selection
methods to prepare one or more “views” for your models to operate upon
Evaluate models: Estimate model accuracy on unseen data using the chosen
features

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
40
Aspects of Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Feature Engineering
Feature Selection Most useful and relevant features
are selected from the available
data
Feature Extraction Existing features are combined to
develop more useful ones
Feature Addition New features are created by
gathering new data
Preprocessing and
Feature Filtering Filter out irrelevant features to Feature Engineering
make the modeling step easy Data Management for
Digital Health, Winter
2019
41
Feature Selection Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Process where you automatically select

those features in your data that
contribute most to the prediction
variable or output in which you are
interested
Having irrelevant features in your data
can decrease the accuracy of many
models, especially linear algorithms like
linear and logistic regression
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
42
Feature Selection Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Three benefits of performing feature selection before

modeling your data are:
¡ Reduces Overfitting: Less redundant data means less
opportunity to make decisions based on noise
¡ Improves Accuracy: Less misleading data means
modeling accuracy improves
¡ Reduces Training Time: Less data means that algorithms
train faster https://quantdare.com/what-is-the-difference-between-feature-extraction-and-feature-selection/

https://towardsdatascience.com/featur
e-selection-techniques-1bfab5fe0784
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
43
Feature Extraction Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Aims to reduce the number of features in a dataset by creating new

features from the existing ones (and then discarding the original
features)
New reduced set of features should then be able to summarize most
of the information contained in the original set
Create some interaction (e.g., multiply or divide) between each pair of
variables  lengthy process
Deep feature synthesis (DFS) is an algorithm which enables you to
quickly create new variables with varying depth
https://matlab1.co
m/feature-
Preprocessing and
extraction-image-
processing/
Feature Engineering
Data Management for
Digital Health, Winter
2019
44
To Know More Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
45
To Know More Step 1: Select Data
Step 2: Preprocess Data
Step 3: Transform Data

Here are some generally relevant papers:

¡ JMLR Special Issue on Variable and Feature Selection
Here are some generally relevant and interesting slides:
¡ Feature Engineering (PDF), Knowledge Discover and Data Mining 1,
by Roman Kern, Knowledge Technologies Institute
¡ Feature Engineering and Selection (PDF), CS 294: Practical Machine
Learning, Berkeley
¡ Feature Engineering Studio, Course Lecture Slides and Materials,
Columbia Preprocessing and
Feature Engineering
¡ Feature Engineering (PDF), Leon Bottou, Princeton
Data Management for
And a video for some good practical tips: Digital Health, Winter
2019
¡ Feature Engineering 46
Time Series
Let’s Compare ECG Signals

I’m comparing
What are you the curves and
doing there? try to find
similarities,
respectively
abnormalities.

https://en.wikipedia.org/wiki/Dr._Nick
Let me show you how
to do it.
https://en.wikipedia.org/wiki/Professor_Frink

https://www.cvphysiology.com/Arr
Preprocessing and
Feature Engineering

hythmias/A009.htm
Data Management for
Digital Health, Winter
2019
47
Euclidean Distance Metric
Comparing to Time Series

Let’s assume we want to compare two time series

https://en.wikipedia.org/wiki/Professor_Frink

Preprocessing and
Feature Engineering
Data Management for
About 80% of published
Digital Health, Winter
work in data mining uses 2019
Euclidean distance 48

http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Data Preparation Time Series

If we naively try to measure

the distance between two
“raw” time series, we may get
very unintuitive results Euclidean distance

https://en.wikipedia.org/wiki/Dr._Nick
is very sensitive to
some “distortions”
in the data. For
https://en.wikipedia.org/wiki/Professor_Frink

most problems
these distortions
4 most common distortions are not meaningful
 should remove
¡ Offset Translation them

¡ Amplitude Scaling Preprocessing and

Feature Engineering
¡ Linear Trends Data Management for
Digital Health, Winter
¡ Noise 2019
49
Preprocessing the Data
Offset Translation

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
50

http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Preprocessing the Data
Amplitude Scaling

Zero-mean Preprocessing and

Feature Engineering
Unit-variance
Data Management for
Widely used for normalization in Digital Health, Winter
many machine learning algorithms 2019
51

http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Preprocessing the Data
Offset Translation

Removing linear trend: Preprocessing and

Remove linear trend Feature Engineering
¡ Fit the best fitting straight line to Data Management for
Removed offset translation Digital Health, Winter
the time series, then 2019
Removed amplitude scaling 52
¡ subtract that line from the time
http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Preprocessing the Data
Noise

Preprocessing and
The intuition behind removing Feature Engineering

noise is … Data Management for

Digital Health, Winter
2019
Average each data points value
53
with its neighbors
http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
Feature Engineering for Time Series

Date Time Features: These are components of the time step itself for each
observation
Lag Features: These are values at prior time steps
Window Features: These are a summary of values over a fixed window of prior time
steps

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
54
https://tsfresh.readthedocs.io/en/latest/text/introduction.html
Automated Feature Engineering Step 1: Select Data
Step 2: Preprocess Data
Why Do It? Step 3: Transform Data

We’re interested in features—we want to know

which are relevant. If we fit a model, it should
be interpretable
¡ What causes lung cancer?
– Features are aspects of a patient’s medical
history
– Binary response variable: did the patient
develop lung cancer? https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063

– Which features best predict whether lung

Preprocessing and
cancer will develop? Might want to Feature Engineering
legislate against these features. Data Management for
Digital Health, Winter
2019
55
What Next?

Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
56
http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_2017.pdf
http://amid.fish/anomaly-detection-with-k-means-clustering
What to Take Home?

Data preparation allows simplification of data to make it ready for Machine Learning
and involves data selection, preprocessing, and transformation
Step 1: Data Selection Consider what data is available, what data is missing and
what data can be removed
Step 2: Data Preprocessing Organize your selected data by formatting, cleaning and
sampling from it
Step 3: Data Transformation Transform preprocessed data ready for machine
learning by engineering features using scaling, attribute decomposition and attribute
aggregation
Preprocessing and
Feature Engineering
Data Management for
Digital Health, Winter
2019
57

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
F G Priest & I Cambell - Brewing Microbiology PDF
100% (2)
F G Priest & I Cambell - Brewing Microbiology PDF
312 pages
Construction of Nayagav Fa S Anicut, Bhat:, P.S. Dungarpur, Dist. Dun Arpur
No ratings yet
Construction of Nayagav Fa S Anicut, Bhat:, P.S. Dungarpur, Dist. Dun Arpur
98 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
5 pages
Fruit Sales Dataset
100% (1)
Fruit Sales Dataset
10 pages
Physics Project: Magnetic Effect of Electric Current
20% (5)
Physics Project: Magnetic Effect of Electric Current
10 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
23 pages
CH 6
No ratings yet
CH 6
72 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Keyence Image Processing Useful Tips Vol.7 Pre Processing
No ratings yet
Keyence Image Processing Useful Tips Vol.7 Pre Processing
6 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Data Preprocessing Python 1
No ratings yet
Data Preprocessing Python 1
3 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
UE20CS302 Unit4 Slides
No ratings yet
UE20CS302 Unit4 Slides
312 pages
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
No ratings yet
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
4 pages
Internship Report - Software - Salaries Predictions
100% (1)
Internship Report - Software - Salaries Predictions
17 pages
Machine Learning Lab Manual 7
100% (1)
Machine Learning Lab Manual 7
8 pages
Rapid Miner - Data Preparation
100% (1)
Rapid Miner - Data Preparation
17 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
Data Mining
No ratings yet
Data Mining
27 pages
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
A-Simple-Neural-Network-From-Scratch - Jupyter Notebook
No ratings yet
A-Simple-Neural-Network-From-Scratch - Jupyter Notebook
9 pages
Practical 5: Introduction To Weka For Classfication
100% (1)
Practical 5: Introduction To Weka For Classfication
4 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Data Mining-Multimedia Datamining
No ratings yet
Data Mining-Multimedia Datamining
8 pages
Rapid Minder Assignment
No ratings yet
Rapid Minder Assignment
38 pages
Association Rules
No ratings yet
Association Rules
64 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Performance Comparison Between Naïve Bayes, Decision Tree and K-Nearest Neighbor in Searching Alternative Design in An Energy Simulation Tool
No ratings yet
Performance Comparison Between Naïve Bayes, Decision Tree and K-Nearest Neighbor in Searching Alternative Design in An Energy Simulation Tool
7 pages
K Means R and Rapid Miner Patient and Mall Case Study
No ratings yet
K Means R and Rapid Miner Patient and Mall Case Study
80 pages
DWDM R13 Unit 1 PDF
No ratings yet
DWDM R13 Unit 1 PDF
10 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
30 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
34 pages
Unit II Notes
No ratings yet
Unit II Notes
36 pages
Paper 4-Churn Prediction in Telecommunication PDF
No ratings yet
Paper 4-Churn Prediction in Telecommunication PDF
3 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Understanding Data Mining
No ratings yet
Understanding Data Mining
21 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Data Science
No ratings yet
Data Science
39 pages
Kmeans Matlab Code Feed Own Data Source - QuestionInBox
No ratings yet
Kmeans Matlab Code Feed Own Data Source - QuestionInBox
5 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 6
82 pages
Supervised Learning (Classification and Regression)
No ratings yet
Supervised Learning (Classification and Regression)
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
38 pages
0802 Python Tutorial
100% (1)
0802 Python Tutorial
155 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
20 pages
Python Data Science
No ratings yet
Python Data Science
25 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
Data Modelling and Visualization
No ratings yet
Data Modelling and Visualization
31 pages
Machine Learning C
No ratings yet
Machine Learning C
24 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Data Preprocessing Removed
No ratings yet
Data Preprocessing Removed
56 pages
2 - Clinical Data Lecture
No ratings yet
2 - Clinical Data Lecture
24 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
Mcqs in Dbms213
No ratings yet
Mcqs in Dbms213
9 pages
Machine Learning 8hmrvc (1)
No ratings yet
Machine Learning 8hmrvc (1)
52 pages
Vinay Os
No ratings yet
Vinay Os
38 pages
Double Indicator Titration (Inorganic Lab-B.Sc. III Sem)
No ratings yet
Double Indicator Titration (Inorganic Lab-B.Sc. III Sem)
16 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
Learn SQL in 4 Hours
No ratings yet
Learn SQL in 4 Hours
3 pages
????? ?? ??????????? ?????
No ratings yet
????? ?? ??????????? ?????
117 pages
Bridgestone OTR Product Guide 17.1!06!28 2017
No ratings yet
Bridgestone OTR Product Guide 17.1!06!28 2017
44 pages
Group-Members-Rating-Template - Ampoloquio
No ratings yet
Group-Members-Rating-Template - Ampoloquio
2 pages
ATS 5 Air Entraining PDF
No ratings yet
ATS 5 Air Entraining PDF
4 pages
Cosmos of Secrets
No ratings yet
Cosmos of Secrets
3 pages
A History of Abstract Algebra - Jeremy Gray
100% (1)
A History of Abstract Algebra - Jeremy Gray
564 pages
PR1 Resaerch Paper
No ratings yet
PR1 Resaerch Paper
37 pages
6700download Multiplicity and Ontology in Deleuze and Badiou 1st Edition Becky Vartabedian Ebook All Chapters PDF
100% (1)
6700download Multiplicity and Ontology in Deleuze and Badiou 1st Edition Becky Vartabedian Ebook All Chapters PDF
57 pages
Access For All Design Guide
No ratings yet
Access For All Design Guide
42 pages
Correspondence Analysis in Practice, Third Edition Greenacre - Instantly access the complete ebook with just one click
No ratings yet
Correspondence Analysis in Practice, Third Edition Greenacre - Instantly access the complete ebook with just one click
66 pages
What Will Happen in The Future ?
No ratings yet
What Will Happen in The Future ?
2 pages
AT1 QCD210 Summary Notes Table Template Feb 2024-1
No ratings yet
AT1 QCD210 Summary Notes Table Template Feb 2024-1
2 pages
Os PB1 Ee Questionnaire
No ratings yet
Os PB1 Ee Questionnaire
16 pages
Practice Exam Questions
No ratings yet
Practice Exam Questions
3 pages
Grade 11 Investigation Number Patterns
No ratings yet
Grade 11 Investigation Number Patterns
4 pages
Hardness Test Procedure For Spherical Tank
No ratings yet
Hardness Test Procedure For Spherical Tank
7 pages
Blue Economy - Presentation - 16 June 2015 - FF
No ratings yet
Blue Economy - Presentation - 16 June 2015 - FF
14 pages
Biostatistics For Biotechnology
No ratings yet
Biostatistics For Biotechnology
128 pages
Education00841 ممارسة ادارة الوقت
No ratings yet
Education00841 ممارسة ادارة الوقت
224 pages
GEA34853 LV7 Platform FS Variable Speed AC Drives en 20220126 Rev2
No ratings yet
GEA34853 LV7 Platform FS Variable Speed AC Drives en 20220126 Rev2
26 pages
Davison 2017
No ratings yet
Davison 2017
35 pages
Content:: Saudi Arabian Oil Company (Saudi Aramco) General Instruction Manual
No ratings yet
Content:: Saudi Arabian Oil Company (Saudi Aramco) General Instruction Manual
9 pages
Glob Bus Org Exc - 2022 - Chigeda - Continuance in Organizational Commitment The Role of Emotional Intelligence Work Life
No ratings yet
Glob Bus Org Exc - 2022 - Chigeda - Continuance in Organizational Commitment The Role of Emotional Intelligence Work Life
17 pages
Chapter 2 Lesson 1 - Grading System
No ratings yet
Chapter 2 Lesson 1 - Grading System
8 pages
Lecture Two - Recognising Arguments
No ratings yet
Lecture Two - Recognising Arguments
19 pages
Rpt-Sow Form 1 2024
No ratings yet
Rpt-Sow Form 1 2024
9 pages
Sample Lesson Exemplar - The Composition of The Earth
No ratings yet
Sample Lesson Exemplar - The Composition of The Earth
3 pages
Botany B.SC
No ratings yet
Botany B.SC
2 pages