DADM S2 Data Preprocessing-Data Cleaning and Transformation
DADM S2 Data Preprocessing-Data Cleaning and Transformation
2
Data Cleaning
Customer Database
Cust. ID Postal Code Gender Monthly Income (Rs.) Age Marital Status Transaction
(Years) Amount (Rs.)
1001 380001 M 75000 G M 5000
1002 K345 F -35000 40 W 4000
1003 380053 F 100,00,000 32 S 7000
1004 380051 50000 45 S 1000
1005 6 M 99999 0 D 3000
Data Cleaning (cont’d)
• Postal Code
– Not all countries use same postal code format, K345 (Foreign)
– No postal code with single numeral
• Gender: Missing value
4
Data Cleaning (cont’d)
• Age Field Contains “G” and 0?
– Other records have numeric values for field
– Record categorized into group labeled “G”
– Zero-value used to indicate missing/unknown value?
– Customer refused to provide their age?
5
Data Standardization: Numerical Variable
• Variables tend to have ranges different from each other.
• For example, two fields (predictors) may have ranges: Age:[20, 60], Income: [0, 500000]
X min( X )
• Min-Max transformation: X*
range( X ) . This will transform the values in [0, 1].
6
Data Transformation: Categorical Variables
• Some methods like regression requires predictors to be numeric.
• Nominal categorical variables often cannot be used as it is.
• Need to construct indicator/dummy variables having values 0 or 1.
• A categorical variable with k categories, only (k-1) dummy variables is required.
The unassigned category is treated as reference category.
• For example: Consider the categorical variable Region with k=4 categories:
East, South, West and North. One could define 3 dummy variables as:
1 if region is north
R1
0 otherwise
1 if region is east
R2
0 otherwise
1 if region is south
R3
0 otherwise
• Region=west is already uniquely identified by zero values for each of the
existing three dummy variables. Hence treated as reference category.
Overfitting
• In supervised learning, a key question is: How well will our prediction or
classification model perform when we apply it to the new data?
• Performance of various models will be compared and choose the best model so that
it generalize beyond the data set at hand.
• Adding more variables into the model increase performance, but greater the risk of
overfitting.
• The model built should represent the relationship between the variables but also do
a good job of predicting future outcome values. Ad.Exp. Sales Revenue
• Consider the following hypothetical data: 239 514
364 789
602 550
644 1386
770 1394
789 1440
911 1354
Overfitting Cont’d
• A 5th degree polynomial model is fitted (see figure 1) with no space for error.
• Such a model is unlikely to be accurate or even useful in predicting the future
sales revenue. For instance, it is hard to believe that increasing Ad. Exp. From
$ 400 to $500 will actually decrease the revenue.
• Probably, a lower degree polynomial may serve the purpose (see figure-2).
How to overcome Overfitting?
• When we use same data both to develop the model and to assess its performance, we
end up with “optimism” bias.
• Partition the data and develop the model using one of the partitions and try it out on
another partition and see how it performs.
• Typically, two or three partitions namely, training, validation and test partitions are
used.
• Training partition (80%): The largest partition containing the data that is used to
build several models.
• Validation Partition (20%): Used to assess the predictive performance of each
models and choose the best one.
• Test Partition: Used to assess the performance of the chosen model with new data.
This has been done by ignoring the response variable.
Bias-Variance Trade-off
• A low-complexity model has high bias in terms of error rate, but low variance
while a high complexity model has a low bias, but high variance. This is
known as bias-variance trade-off.
• As model complexity increases, the bias in training set decreases but the
variance increases.
• The goal is to construct a model in which neither the bias nor the variance is
too high.
• A common measure is Mean Squared Error (MSE). Lesser the MSE better the
model.
• It is a function of estimation error and model complexity.
MSE = Variance + (bias)2
References
• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.