DADM S2 Data Preprocessing-Data Cleaning and Transformation

The document discusses data pre-processing techniques including data cleaning, standardization, and addressing overfitting. It explains that raw data often contains errors, outliers, and missing values which require cleaning. Standardization of numerical and transformation of categorical variables ensures models can properly analyze different variable types and ranges. Overfitting occurs when models are too complex and do not generalize beyond the existing data, so validation and test sets help select optimal complexity. Pre-processing improves data quality and model performance.

Uploaded by

Anisha Sapra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views12 pages

DADM S2 Data Preprocessing-Data Cleaning and Transformation

Uploaded by

Anisha Sapra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Session 2: Data Pre-Processing-Data

Cleaning, Data Standardization & Overfitting

Why Do We Pre-process Data?
• Raw data often incomplete and noisy. May consist redundant fields, Missing
values, Outliers, Data in a form not suitable for data mining models, erroneous
values.
• Data often from legacy databases where values are expired or No longer
relevant.
• For data mining purposes, database values must undergo data cleaning and
data transformation.

• Minimize GIGO (Garbage In Garbage Out): IF garbage input minimized  THEN

garbage in results minimized

• Data preparation is 60% of effort for data mining process.

2
Data Cleaning

Customer Database

Cust. ID Postal Code Gender Monthly Income (Rs.) Age Marital Status Transaction
(Years) Amount (Rs.)
1001 380001 M 75000 G M 5000
1002 K345 F -35000 40 W 4000
1003 380053 F 100,00,000 32 S 7000
1004 380051 50000 45 S 1000
1005 6 M 99999 0 D 3000
Data Cleaning (cont’d)
• Postal Code
– Not all countries use same postal code format, K345 (Foreign)
– No postal code with single numeral
• Gender: Missing value

• Income Field Contains Rs. 10000000? & -Rs 35,000?

– extreme data value (outlier)
– Income less than $0?
– Caused by data entry error?

• Discuss anomaly with database administrator

• Some statistical and data mining methods highly influenced by outliers

4
Data Cleaning (cont’d)
• Age Field Contains “G” and 0?
– Other records have numeric values for field
– Record categorized into group labeled “G”
– Zero-value used to indicate missing/unknown value?
– Customer refused to provide their age?

• Marital Status Field Contains “S”?

– What does this symbol mean?
– Does “S” imply single or separated?
– Discuss anomaly with database administrator

5
Data Standardization: Numerical Variable
• Variables tend to have ranges different from each other.
• For example, two fields (predictors) may have ranges: Age:[20, 60], Income: [0, 500000]

• Some data mining algorithms adversely affected by differences in variable ranges.

• Variables with greater ranges tend to have larger influence on data model’s results.
• Therefore, numeric field values should be normalized.
• Z-score standardization works by taking the difference between the field value (X) and
the field mean value, and scaling this difference by the standard deviation of the field
values. X  mean( X )
Z score 
SD( X )

X  min( X )
• Min-Max transformation: X* 
range( X ) . This will transform the values in [0, 1].

6
Data Transformation: Categorical Variables
• Some methods like regression requires predictors to be numeric.
• Nominal categorical variables often cannot be used as it is.
• Need to construct indicator/dummy variables having values 0 or 1.
• A categorical variable with k categories, only (k-1) dummy variables is required.
The unassigned category is treated as reference category.
• For example: Consider the categorical variable Region with k=4 categories:
East, South, West and North. One could define 3 dummy variables as:
1 if region is north
R1  
0 otherwise
1 if region is east
R2  
0 otherwise
1 if region is south
R3  
0 otherwise
• Region=west is already uniquely identified by zero values for each of the
existing three dummy variables. Hence treated as reference category.
Overfitting
• In supervised learning, a key question is: How well will our prediction or
classification model perform when we apply it to the new data?
• Performance of various models will be compared and choose the best model so that
it generalize beyond the data set at hand.
• Adding more variables into the model increase performance, but greater the risk of
overfitting.
• The model built should represent the relationship between the variables but also do
a good job of predicting future outcome values. Ad.Exp. Sales Revenue
• Consider the following hypothetical data: 239 514
364 789
602 550
644 1386
770 1394
789 1440
911 1354
Overfitting Cont’d
• A 5th degree polynomial model is fitted (see figure 1) with no space for error.
• Such a model is unlikely to be accurate or even useful in predicting the future
sales revenue. For instance, it is hard to believe that increasing Ad. Exp. From
$ 400 to $500 will actually decrease the revenue.
• Probably, a lower degree polynomial may serve the purpose (see figure-2).
How to overcome Overfitting?
• When we use same data both to develop the model and to assess its performance, we
end up with “optimism” bias.
• Partition the data and develop the model using one of the partitions and try it out on
another partition and see how it performs.
• Typically, two or three partitions namely, training, validation and test partitions are
used.

• Training partition (80%): The largest partition containing the data that is used to
build several models.
• Validation Partition (20%): Used to assess the predictive performance of each
models and choose the best one.
• Test Partition: Used to assess the performance of the chosen model with new data.
This has been done by ignoring the response variable.
Bias-Variance Trade-off
• A low-complexity model has high bias in terms of error rate, but low variance
while a high complexity model has a low bias, but high variance. This is
known as bias-variance trade-off.
• As model complexity increases, the bias in training set decreases but the
variance increases.
• The goal is to construct a model in which neither the bias nor the variance is
too high.
• A common measure is Mean Squared Error (MSE). Lesser the MSE better the
model.
• It is a function of estimation error and model complexity.
MSE = Variance + (bias)2
References

• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.

Questionnaire - On Virtual Teams
No ratings yet
Questionnaire - On Virtual Teams
6 pages
Analysis of The Most Common Mistakes Companies Make With Global Marketing by Nataly Kelly - Docx-Merged
No ratings yet
Analysis of The Most Common Mistakes Companies Make With Global Marketing by Nataly Kelly - Docx-Merged
8 pages
Mental Toughness Lesson Plan
No ratings yet
Mental Toughness Lesson Plan
22 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
Interview questions companie
No ratings yet
Interview questions companie
72 pages
Unit 3
No ratings yet
Unit 3
55 pages
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
No ratings yet
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
37 pages
PPT 1.1.5
No ratings yet
PPT 1.1.5
20 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Mining - Classification & Prediction
No ratings yet
Data Mining - Classification & Prediction
5 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
Data Mining Notes
No ratings yet
Data Mining Notes
43 pages
Classification Analysis
No ratings yet
Classification Analysis
4 pages
Data Preprocessing Before Classification: Presented by
No ratings yet
Data Preprocessing Before Classification: Presented by
23 pages
Chap2 Overview
No ratings yet
Chap2 Overview
17 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Predictive Maintenance
No ratings yet
Predictive Maintenance
66 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Lecture5
No ratings yet
Lecture5
26 pages
Lecture 5 - Data Preparation
No ratings yet
Lecture 5 - Data Preparation
31 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Unit 4
No ratings yet
Unit 4
66 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
Chapter 02 Overview
No ratings yet
Chapter 02 Overview
43 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
1635838720082
No ratings yet
1635838720082
35 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
S1-Evaluate-Performance-LKW-1Mar2025
No ratings yet
S1-Evaluate-Performance-LKW-1Mar2025
26 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
1
No ratings yet
1
19 pages
TE_ML_LAB_mannual
No ratings yet
TE_ML_LAB_mannual
21 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
DS Notes
No ratings yet
DS Notes
36 pages
CH05 Business Analytics Process and Data Exploration
No ratings yet
CH05 Business Analytics Process and Data Exploration
37 pages
My Notes
No ratings yet
My Notes
15 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Statistic & Machine Learning: Team 2
No ratings yet
Statistic & Machine Learning: Team 2
42 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
Chapter 4 Data Mining
No ratings yet
Chapter 4 Data Mining
5 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Module 2-b Prediction Methods and Models-Data Preperation
No ratings yet
Module 2-b Prediction Methods and Models-Data Preperation
26 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Lec 8
No ratings yet
Lec 8
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
Business Analytics Process and Data Exploration
No ratings yet
Business Analytics Process and Data Exploration
38 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Digital Marketing - Course Outline-Merged
No ratings yet
Digital Marketing - Course Outline-Merged
45 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
10 pages
Pdfjoiner
No ratings yet
Pdfjoiner
115 pages
Topic 04 Foreign Exchange Markets
No ratings yet
Topic 04 Foreign Exchange Markets
32 pages
Pdfjoiner
No ratings yet
Pdfjoiner
48 pages
International Finance Mba (FT) : Topic 02-A Exchange Rate Systems
No ratings yet
International Finance Mba (FT) : Topic 02-A Exchange Rate Systems
19 pages
DADM S15 K-NN Classification
No ratings yet
DADM S15 K-NN Classification
13 pages
Pdfjoiner
No ratings yet
Pdfjoiner
115 pages
DADM S14 Linear Discriminant Analysis
No ratings yet
DADM S14 Linear Discriminant Analysis
13 pages
Topic 03 Balance of Payments
No ratings yet
Topic 03 Balance of Payments
42 pages
DADM S3 Skewness and Transformations To Achieve Normality
No ratings yet
DADM S3 Skewness and Transformations To Achieve Normality
9 pages
Topic 02 International Monetary System
No ratings yet
Topic 02 International Monetary System
24 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
70 pages
Dadm s16 Cart
No ratings yet
Dadm s16 Cart
18 pages
DADM S4 Basic Data Visualization
No ratings yet
DADM S4 Basic Data Visualization
10 pages
1.) Banking Overview and Regulations
No ratings yet
1.) Banking Overview and Regulations
69 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Investment Banking Individual Asssignmеnt-I Heranba Industries Limited Offer Document (Drhp)
No ratings yet
Investment Banking Individual Asssignmеnt-I Heranba Industries Limited Offer Document (Drhp)
11 pages
Investment Banking Individual Asssignmеnt-I Heranba Industries Limited Offer Document (Drhp)
No ratings yet
Investment Banking Individual Asssignmеnt-I Heranba Industries Limited Offer Document (Drhp)
5 pages
Website Technical Review - Parents To Alumni
No ratings yet
Website Technical Review - Parents To Alumni
8 pages
RRL Rhowel
No ratings yet
RRL Rhowel
10 pages
ME FOR DLDM Project KUMUSTAHAN 1
No ratings yet
ME FOR DLDM Project KUMUSTAHAN 1
13 pages
Individual Development Plan (Idp)
No ratings yet
Individual Development Plan (Idp)
1 page
Form 138
No ratings yet
Form 138
1 page
Mental Health
100% (1)
Mental Health
16 pages
Liar, Liar Pants On Fire
No ratings yet
Liar, Liar Pants On Fire
5 pages
Pedagogies of Strategic Empathy: Navigating Through The Emotional Complexities of Anti-Racism in Higher Education
100% (1)
Pedagogies of Strategic Empathy: Navigating Through The Emotional Complexities of Anti-Racism in Higher Education
14 pages
SESSION-Plan-basic 3
No ratings yet
SESSION-Plan-basic 3
3 pages
BPCC 132
No ratings yet
BPCC 132
7 pages
Jude Hemanth, Madhulika Bhatia, Oana Geman (Editors) - Data Visualization and Knowledge Engineering - Spotting Data Points Wi
100% (3)
Jude Hemanth, Madhulika Bhatia, Oana Geman (Editors) - Data Visualization and Knowledge Engineering - Spotting Data Points Wi
321 pages
Chapter 3 - Communicating With Older Persons
No ratings yet
Chapter 3 - Communicating With Older Persons
15 pages
KM Case
100% (1)
KM Case
4 pages
DumpsArena Your AZ-500 Certification Shortcut
0% (1)
DumpsArena Your AZ-500 Certification Shortcut
4 pages
Pendidikan Bahasa Inggris
100% (1)
Pendidikan Bahasa Inggris
13 pages
Data Science Specialization Brochure
No ratings yet
Data Science Specialization Brochure
16 pages
Appositives
No ratings yet
Appositives
9 pages
Handling Criticisms and Praises (HANDOUT)
No ratings yet
Handling Criticisms and Praises (HANDOUT)
2 pages
DLL Mathematics
No ratings yet
DLL Mathematics
3 pages
Pearson Gcse Psychology Scheme of Work
No ratings yet
Pearson Gcse Psychology Scheme of Work
19 pages
Edward Thorndike
No ratings yet
Edward Thorndike
17 pages
Types-Of-Sentence-Structure 1
No ratings yet
Types-Of-Sentence-Structure 1
49 pages
Anomaly Detection For Time Series Using Vae LSTM Hybrid Model
No ratings yet
Anomaly Detection For Time Series Using Vae LSTM Hybrid Model
5 pages
BES 1 - Engineering Drawing
No ratings yet
BES 1 - Engineering Drawing
4 pages
Teaching T Ke-15 Chapter 18 Siti
No ratings yet
Teaching T Ke-15 Chapter 18 Siti
5 pages
Algorithmic Financial Trading With Deep CNN Preprint
No ratings yet
Algorithmic Financial Trading With Deep CNN Preprint
30 pages
The Neuroscience Literacy of Trainee Teachers
No ratings yet
The Neuroscience Literacy of Trainee Teachers
39 pages
V5-Bird-Migration-Sample
No ratings yet
V5-Bird-Migration-Sample
2 pages
Drug Education, Consumer Health and Nutrition
No ratings yet
Drug Education, Consumer Health and Nutrition
8 pages

DADM S2 Data Preprocessing-Data Cleaning and Transformation

Uploaded by

DADM S2 Data Preprocessing-Data Cleaning and Transformation

Uploaded by

Session 2: Data Pre-Processing-Data

Cleaning, Data Standardization & Overfitting

• Minimize GIGO (Garbage In Garbage Out): IF garbage input minimized  THEN

• Data preparation is 60% of effort for data mining process.

• Income Field Contains Rs. 10000000? & -Rs 35,000?

• Discuss anomaly with database administrator

• Marital Status Field Contains “S”?

• Some data mining algorithms adversely affected by differences in variable ranges.

You might also like