11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
2. DATA CLEANING
To illustrate the need for cleaning up the data:
Let us discuss, attribute by attribute, some of the problems that have found their way into the data set in Table 2.1. The customer
ID variable seems to be fine. What about zip?
Let us assume that we are expecting all of the customers in the database to have
the usual five-numeral American zip code. Now, Customer 1002 has this unusual (to American eyes) zip code of J2S7K7.
Marital Transactio
Status n Amount
Customer ID Zip Gender Income Age
Actually, this is the zip code of St. Hyacinthe, Quebec, Canada, and so probably represents real data from a real customer.
What has evidently occurred is that a French-Canadian customer has made a purchase, and put their home zip code down
in the required field. In the era of globalization, we must be ready to expect unusual values in fields such as zip codes,
which vary from country to country.
The next field, gender, contains a missing value for customer 1003.
The income field has three potentially anomalous values.
First, Customer 1003 is shown as having an income of $10,000,000 per year. While entirely possible, especially when
considering the customer’s zip code (90210, Beverly Hills), this value of income is nevertheless an outlier, an extreme
data value.
Marital Transactio
Status n Amount
Customer ID Zip Gender Income Age
Unlike Customer 1003’s income, Customer 1002’s reported income of −$40,000 lies beyond the field
bounds for income, and therefore must be an error.
So what is wrong with Customer 1005’s income of $99,999? Perhaps nothing;
it may in fact be valid. But, if all the other incomes are rounded to the nearest $5000, why the precision
with Customer 1005? Often, in legacy databases, certain specified values are meant to be codes for
anomalous entries, such as missing values. Perhaps 99999 was coded in an old database to mean missing.
Again, we cannot be sure and should again refer to the database administrator.
2.3 HANDLING MISSING DATA
The data set consists of information about 261 automobiles manufactured in the 1970s and 1980s, including
gas mileage, number of cylinders, cubic inches, horsepower, and so on.
Suppose, however, that some of the field values were missing for certain records.
Figure 2.1 provides a peek at the first 10 records in the data set, with two of the field values missing.
A common method of “handling” missing values is simply to omit the records or fields with missing
values from the analysis. However, this may be dangerous, since the pattern of missing values may in fact
be systematic, and simply deleting the records with missing values would lead to a biased subset of the
data.
Further, it seems like a waste to omit the information in all the other fields, just because one
field value is missing. Therefore, data analysts have turned to methods that would replace the
missing value with a value substituted according to various criteria.
Some common criteria for choosing replacement values for missing data are as
follows:
1.Replace the missing value with some constant, specified by the analyst.
2.Replace the missing value with the field mean1 (for numeric variables) or the mode (for
categorical variables).
3.Replace the missing values with a value generated at random from the observed distribution
of the variable.
4.Replace the missing values with imputed values based on the other characteristics of the
record.
Figure 2.2 shows the result of replacing the missing values
with the constant 0 for the numerical variable cubicinches and the
label missing for the categorical variable brand.
IDENTIFYING MISCLASSIFICATIONS
Let us look at an example of checking the classification labels on the categorical variables, to make sure
that they are all valid and consistent.
Suppose that a frequency distribution of the variable brand was as shown in Table below.
The frequency distribution shows five classes, USA, France, US, Europe, and Japan.
However, two of the classes, USA and France, have a count of only one automobile each.
What is clearly happening here is that two of the records have been inconsistently classified with respect
to the origin of manufacture.
To maintain consistency with the remainder of the data set, the record with origin USA should have been
labeled US, and the record with origin France should have been labeled Europe.
Notice anything strange about this frequency distribution?
Brand Frequency
USA 1
France 1
US 156
Europe 46
Japan 51
Outliers are extreme values that go against the trend of the remaining data.
Identifying outliers is important because they may represent errors in data entry.
Also, even if an outlier is a valid data point and not an error, certain statistical
methods are sensitive to the presence of outliers and may deliver unreliable results.
One graphical method for identifying outliers for numeric variables is to examine a
histogram of the variable.
Figure 2.5 shows a histogram of the vehicle weights from the (slightly amended)
cars data set. (Note: This slightly amended data set is available as cars2 from the
series website.)
40
30
Count
20
10
0
0 1000 2000 3000 4000 5000
weight
Figure 2.5 Histogram of vehicle weights: can you find the outlier?
✓ There appears to be one lonely vehicle in the extreme left tail of the distribution, with a vehicle weight in the hundreds of
pounds rather than in the thousands. Further investigation (not shown) tells us that the minimum weight is 192.5 pounds,
which is undoubtedly our little outlier in the lower tail. As 192.5 pounds is rather light for an automobile, we would tend to
doubt the validity of this information.
✓ We can surmise that perhaps the weight was originally 1925 pounds, with the decimal inserted somewhere along the line.
✓ Figure 2.6, a scatter plot of mpg against weightlbs, seems to have netted two outliers.
✓ Most of the data points cluster together along the horizontal axis, except for two outliers. The one on the left is the same
vehicle we identified in Figure 2.6, weighing only 192.5 pounds.
✓ The outlier near the top is something new: a car that gets over 500 miles per gallon! Clearly, unless this vehicle runs on
dilithium crystals, we are looking at a data entry error.
✓ Note that the 192.5 pound vehicle is an outlier with respect to weight but not with respect to mileage.
✓ Similarly, the 500-mpg car is an outlier with respect to mileage but not with respect to weight. Thus, a record may be an
outlier in a particular dimension but not in another.
✓ We shall examine numeric methods for identifying outliers,.
.
The standard deviation can be interpreted as the “typical” distance between a field
value and the mean, and most field values lie within two standard deviations of the
mean.
From Figure 2.7 we can state that the number of customer service calls made by
most customers lies within 2(1.315) = 2.63 of the mean of 1.563 calls.
In other words, most of the number of customer service calls lie within the interval
(−1.067, 4.193), that is, (0, 4).
DATA TRANSFORMATION
✓ Min-max normalization
✓ Z-score standardization
✓ Decimal scaling
Data Transformation: Normalization
min-max normalization
v − min
v'= (new _ max − new _ min) + new _ min
max − min
z-score normalization
v − mean
v' =
stand _ dev
normalization by decimal scaling
v
v'= j Where j is the smallest integer such that Max(| v' |)<1
10
TRANSFORMATIONS TO ACHIEVE NORMALITY
✓ Some data mining algorithms and statistical methods require that the variables be
normally distributed.
✓ The normal distribution is a continuous probability distribution commonly known as
the bell curve, which is symmetric.
✓ It is centered at mean 𝜇 (“myu”) and has its spread determined by standard deviation 𝜎
(sigma).
✓ Figure 2.9 shows the normal distribution that has mean 𝜇 = 0 and standard deviation
𝜎 = 1, known as the standard normal distribution Z.
We use the following statistic to measure the skewness of a distribution:
3 (mean − median)
Skewness =
standard deviation
For right-skewed data, the mean is greater than the median, and thus the skewness
will be positive (Figure 2.12), while for left-skewed data, the mean is smaller than
the median, generating negative values for skewness (Figure 2.13).
For perfectly symmetric data (such as in Figure 2.9) of course, the mean, median, and
mode are all equal, and so the skewness equals zero.
(Compare the histogram of the original weight data in Figure 2.10 with the Z-standardized
data in Figure 2.11.
Both histograms are right-skewed; in particular, Figure 2.10 is not symmetric, and so cannot be
normally distributed.
We use the statistics for weight and weight_Z shown in Figure 2.14 to calculate the skewness for
these variables.
For weight we have
3 (mean − median) 3(3005.490 − 2835)
Skewness = = = 0.6
standard deviation 852.646
✓ Therefore, data analysts have developed more robust statistical methods for outlier detection, outlier
detection, which are less sensitive to the presence of the outliers themselves. One elementary robust
method is to use the interquartile range (IQR).
✓ The quartiles of a data set divide the data set into four parts, each containing 25% of the data.
. The first quartile (Q1) is the 25th percentile.
. The second quartile (Q2) is the 50th percentile, that is, the median.
. The third quartile (Q3) is the 75th percentile.
Then, the interquartile range (IQR) is a measure of variability, much more robust than the
standard deviation.
The IQR is calculated as IQR = Q3 – Q1, and may be interpreted to represent the spread of
the middle 50% of the data.
A robust measure of outlier detection is therefore defined as follows. A data value is an
outlier if
a.It is located 1.5(IQR) or more below Q1 or
b.It is located 1.5(IQR) or more above Q3.
✓ For example, suppose for a set of test scores, the 25th percentile was Q1 = 70 and the 75th
percentile was Q3 = 80, so that half of all the test scores fell between 70 and 80.
Then the interquartile range, or the difference between these quartiles was IQR = 80 − 70 = 10.
A test score would be robustly identified as an outlier if
a. It is lower than Q1 – 1.5(IQR) = 70 – 1.5(10) = 55 or
b. It is higher than Q3 + 1.5(IQR) = 80 + 1.5(10) = 95.
Flag variables
✓ Some analytical methods, such as regression, require predictors to be numeric. Thus, analysts wishing to
use categorical predictors in regression need to recode the categorical variable into one or more flag
variables.
✓ A flag variable (or dummy variable or indicator variable) is a categorical variable taking only two
values, 0 and 1.
For example, the categorical predictor sex, taking values female and male, could be recoded into the flag
variable sex_flag as follows: If sex = female = then sex flag = 0; if sex = male then sex flag = 1.
✓ When a categorical predictor takes k ≥ 3 possible values, then define k – 1 dummy variables, and use the
unassigned category as the reference category.
✓ For example, if a categorical predictor region has k = 4 possible categories, {north, east, south, west}, then
the analyst could define the following k - 1 = 3 flag variables.
north flag : If region = north then north flag = 1; otherwise north flag = 0.
east flag : If region = east then east flag = 1; otherwise east flag = 0.
south flag : If region = south then south flag = 1; otherwise south flag = 0.
The flag variable for the west is not needed, since, region = west is already uniquely identified by zero
values for each of the three existing flag variables.
Transforming Categorical Variables into Numerical Variables
Would not it be easier to simply transform the categorical variable region into a single numerical variable
rather than using several different flag variables?
For example, suppose we defined the quantitative variable region_num as follows:
Unfortunately, this is a common and hazardous error. The algorithm now erroneously
thinks the following:
. The four regions are ordered,
. West > South > East > North,
. West is three times closer to South compared to North, and so on.
So, in most instances, the data analyst should avoid transforming categorical variables to
numerical variables.
The exception is for categorical variables that are clearly ordered, such as the variable
survey_response, taking values always, usually, sometimes, never.
In this case, one could assign numerical values to the responses, though one may bicker
with the actual values assigned, such as:
Should never be “0” rather than “1”? Is always closer to usually than usually is to sometimes? Careful assignment of the
numerical values is important.
Module-2 Binning Numerical Variables
✓ For example, data collection on a sample of students at an all-girls private school would find that the sex
variable would be unary, since every subject would be female.
✓ Since sex is constant across all observations, it cannot have any effect on any data mining algorithm or
statistical tool. The variable should be removed.
✓ Sometimes a variable can be very nearly unary. For example, suppose that 99.95% of the players in a
field hockey league are female, with the remaining 0.05% male. The variable sex is therefore very
nearly, but not quite, unary. While it may be useful to investigate the male players, some algorithms will
tend to treat the variable as essentially unary.
✓ For example, a classification algorithm can be better than 99.9% confident that a given player is female.
So, the data analyst needs to weigh how close to unary a given variable is, and whether such a variable
should be retained or removed.
Variables that should probably not be removed
✓ It is (unfortunately) a common though questionable practice to remove from analysis the following types of variables:
✓ Variables which contain 90% missing values present a challenge to any strategy for imputation of missing data . For example, are the
remaining 10% of the cases are truly representative of the missing data, or are the missing values occurring due to some systematic
but unobserved phenomenon?
✓ For example, suppose we have a field called donation_dollars in a self-reported survey database. Conceivably, those who
donate a lot would be inclined to report their donations, while those who do not donate much may be inclined to skip this
survey question.
✓ Thus, the 10% who report are not representative of the whole. In this case, it may be preferable to construct a flag variable,
donation_flag, since there is a pattern in the missingness which may turn out to have predictive power.
✓ However, if the data analyst has reason to believe that the 10% are representative, then he or she may
choose to proceed with the imputation of the missing 90%. It is strongly recommended that the
imputation be based on the regression or decision tree methods.