Week2-2
Week2-2
LECTURE 4
Chapter 2-Data Preprocessing
• Data Preprocessing
• Data cleaning
• Data Integration
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Obtains reduced representation in volume but produces
the same or similar analytical results
• Data transformation
• Normalization and aggregation
DATA PREPROCESSING
Data Cleaning
• Importance
• “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
• “Data cleaning is the number one problem in data
warehousing”—DCI survey
• Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.
• Binning
• first sort data and partition into (equal-frequency)
bins
• then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Outlier analysis
• Clustering may be used detect and remove outliers.
• Combined computer and human inspection
• detect suspicious values and check by human (e.g.,
deal with possible outliers)
Simple Discretization Methods:
Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well
■ Data: [5, 7, 10, 15, 18, 21, 22, 25, 27, 30]
■ We want to divide it into 3 bins (n = 3).
■ Step 1: Find min and max
a=5 (minimum value), b=30 (maximum value)
Step 2: Calculate bin width
Bin width=30−53=
25/3≈8.33
Step 3: Assign values to bins
Bin 1 (5 – 13.33): 5, 7, 10
Bin 2 (13.33 – 21.66): 15, 18, 21
Bin 3 (21.66 – 30): 22, 25, 27, 30
Now these bins can be smoothed by mean, median or boundary.
Regression
Data is fitted to a
y
function
Linear regression is the Y1
line that best fits 2
attributes
One is used to predict Y1’ y=x+1
the other
Multiple linear regression
is where more than 2 X1 x
attributes are involved
Analysis
Data is clustered
Values that fall
outside the
clusters are
considered noise
Data Cleaning and Data
Reduction
rA, B
(A A)( B B)
( AB) n AB
(n 1)AB (n 1)AB