0% found this document useful (0 votes)
14 views

DADM S2 Data Preprocessing-Data Cleaning and Transformation

The document discusses data pre-processing techniques including data cleaning, standardization, and addressing overfitting. It explains that raw data often contains errors, outliers, and missing values which require cleaning. Standardization of numerical and transformation of categorical variables ensures models can properly analyze different variable types and ranges. Overfitting occurs when models are too complex and do not generalize beyond the existing data, so validation and test sets help select optimal complexity. Pre-processing improves data quality and model performance.

Uploaded by

Anisha Sapra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

DADM S2 Data Preprocessing-Data Cleaning and Transformation

The document discusses data pre-processing techniques including data cleaning, standardization, and addressing overfitting. It explains that raw data often contains errors, outliers, and missing values which require cleaning. Standardization of numerical and transformation of categorical variables ensures models can properly analyze different variable types and ranges. Overfitting occurs when models are too complex and do not generalize beyond the existing data, so validation and test sets help select optimal complexity. Pre-processing improves data quality and model performance.

Uploaded by

Anisha Sapra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Session 2: Data Pre-Processing-Data

Cleaning, Data Standardization & Overfitting


Why Do We Pre-process Data?
• Raw data often incomplete and noisy. May consist redundant fields, Missing
values, Outliers, Data in a form not suitable for data mining models, erroneous
values.
• Data often from legacy databases where values are expired or No longer
relevant.
• For data mining purposes, database values must undergo data cleaning and
data transformation.

• Minimize GIGO (Garbage In Garbage Out): IF garbage input minimized  THEN


garbage in results minimized

• Data preparation is 60% of effort for data mining process.

2
Data Cleaning

Customer Database

Cust. ID Postal Code Gender Monthly Income (Rs.) Age Marital Status Transaction
(Years) Amount (Rs.)
1001 380001 M 75000 G M 5000
1002 K345 F -35000 40 W 4000
1003 380053 F 100,00,000 32 S 7000
1004 380051 50000 45 S 1000
1005 6 M 99999 0 D 3000
Data Cleaning (cont’d)
• Postal Code
– Not all countries use same postal code format, K345 (Foreign)
– No postal code with single numeral
• Gender: Missing value

• Income Field Contains Rs. 10000000? & -Rs 35,000?


– extreme data value (outlier)
– Income less than $0?
– Caused by data entry error?

• Discuss anomaly with database administrator


• Some statistical and data mining methods highly influenced by outliers

4
Data Cleaning (cont’d)
• Age Field Contains “G” and 0?
– Other records have numeric values for field
– Record categorized into group labeled “G”
– Zero-value used to indicate missing/unknown value?
– Customer refused to provide their age?

• Marital Status Field Contains “S”?


– What does this symbol mean?
– Does “S” imply single or separated?
– Discuss anomaly with database administrator

5
Data Standardization: Numerical Variable
• Variables tend to have ranges different from each other.
• For example, two fields (predictors) may have ranges: Age:[20, 60], Income: [0, 500000]

• Some data mining algorithms adversely affected by differences in variable ranges.


• Variables with greater ranges tend to have larger influence on data model’s results.
• Therefore, numeric field values should be normalized.
• Z-score standardization works by taking the difference between the field value (X) and
the field mean value, and scaling this difference by the standard deviation of the field
values. X  mean( X )
Z score 
SD( X )

X  min( X )
• Min-Max transformation: X* 
range( X ) . This will transform the values in [0, 1].

6
Data Transformation: Categorical Variables
• Some methods like regression requires predictors to be numeric.
• Nominal categorical variables often cannot be used as it is.
• Need to construct indicator/dummy variables having values 0 or 1.
• A categorical variable with k categories, only (k-1) dummy variables is required.
The unassigned category is treated as reference category.
• For example: Consider the categorical variable Region with k=4 categories:
East, South, West and North. One could define 3 dummy variables as:
1 if region is north
R1  
0 otherwise
1 if region is east
R2  
0 otherwise
1 if region is south
R3  
0 otherwise
• Region=west is already uniquely identified by zero values for each of the
existing three dummy variables. Hence treated as reference category.
Overfitting
• In supervised learning, a key question is: How well will our prediction or
classification model perform when we apply it to the new data?
• Performance of various models will be compared and choose the best model so that
it generalize beyond the data set at hand.
• Adding more variables into the model increase performance, but greater the risk of
overfitting.
• The model built should represent the relationship between the variables but also do
a good job of predicting future outcome values. Ad.Exp. Sales Revenue
• Consider the following hypothetical data: 239 514
364 789
602 550
644 1386
770 1394
789 1440
911 1354
Overfitting Cont’d
• A 5th degree polynomial model is fitted (see figure 1) with no space for error.
• Such a model is unlikely to be accurate or even useful in predicting the future
sales revenue. For instance, it is hard to believe that increasing Ad. Exp. From
$ 400 to $500 will actually decrease the revenue.
• Probably, a lower degree polynomial may serve the purpose (see figure-2).
How to overcome Overfitting?
• When we use same data both to develop the model and to assess its performance, we
end up with “optimism” bias.
• Partition the data and develop the model using one of the partitions and try it out on
another partition and see how it performs.
• Typically, two or three partitions namely, training, validation and test partitions are
used.

• Training partition (80%): The largest partition containing the data that is used to
build several models.
• Validation Partition (20%): Used to assess the predictive performance of each
models and choose the best one.
• Test Partition: Used to assess the performance of the chosen model with new data.
This has been done by ignoring the response variable.
Bias-Variance Trade-off
• A low-complexity model has high bias in terms of error rate, but low variance
while a high complexity model has a low bias, but high variance. This is
known as bias-variance trade-off.
• As model complexity increases, the bias in training set decreases but the
variance increases.
• The goal is to construct a model in which neither the bias nor the variance is
too high.
• A common measure is Mean Squared Error (MSE). Lesser the MSE better the
model.
• It is a function of estimation error and model complexity.
MSE = Variance + (bias)2
References

• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.

You might also like