Final - Unit 3 Data Preprocessing - Phases
Final - Unit 3 Data Preprocessing - Phases
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Summary
Data quality
Data quality is a major concern in Data Mining
and Knowledge Discovery tasks.
Why: At most all Data Mining algorithms
induce knowledge strictly from data.
The quality of knowledge extracted highly
depends on the quality of data.
Why Data Preprocessing?
Data in the real world is dirty
◦ incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., occupation=“ ”
Data integration
◦ Integration of multiple databases, data cubes, or files
Data transformation
◦ Normalization and aggregation
Data reduction
◦ Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization
◦ Data reduction, especially for numerical data
◦ data discretization is a method of converting attribute values of
continuous data into a finite set of intervals with minimum data
loss.
Forms of Data
Preprocessing
12
Data Pre-processing
Why pre-process the data?
Data cleaning
Data integration and transformation
Data reduction
Summary
Data Cleaning
Importance
◦ “Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball
◦ “Data cleaning is the number one problem in data warehousing”—
DCI survey
◦ equipment malfunction
◦ inconsistent with other recorded data and thus deleted
◦ data not entered due to misunderstanding
◦ certain data may not be considered important at the time of
entry
◦ not register history or changes of the data
Missing data may need to be inferred.
Example
Example
How to Handle Missing Data?
1. Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification) not effective when the percentage of missing values per attribute varies
considerably.
2. Fill in the missing value manually:
time-consuming + infeasible in large data sets?
3. Fill in it automatically with
◦ a global constant : e.g., “unknown”, a new class?
(if so, the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common as “unknown”- it Is simple but
foolproof.
◦ the attribute mean or median
◦ the attribute mean for all samples belonging to the same class: smarter
( ex: if classifying customers acc. To credit-risk, we may replace the missing value
with the mean income value for customers in the same credit risk category as that
of the given tuple.
4. The most probable value: inference-based such as Bayesian
formula or decision tree
Example
Handling missing values
Example - K-Nearest Neighbor (k-NN) approach
Bin boundaries
Find the minimum and maximum values among data.
Issues to be considered
Schema integration: e.g., “cust-id” & “cust-no”
◦ Integrate metadata from different sources
◦ Entity identification problem:
Identify real-world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
◦ Detecting and resolving data value conflicts
For the same real-world entity, attribute values from different sources are
different
Possible reasons: different representations, different scales,
e.g., metric vs. British units
Data Transformation
Smoothing: remove noise from data using smoothing techniques
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
◦ min-max normalization
◦ z-score normalization
◦ normalization by decimal scaling
Attribute/feature construction:
◦ New attributes constructed from the given ones
Data Transformation: Normalization
Min-max normalization: For Linear Transformation; to [new_minA, new_maxA]
v −min A
v '= ( new max A −new min A )+ new min A
max A −min A
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is
mapped to 73,600 12 ,000
( 1.0 0 ) + 0 = 0.716
98,000 12 ,000
Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ (mean) = 54,000, v −μ A 73 ,600−54 ,000
v '= =1.225
σ (std. dev)= 16,000. Then σA 16 ,000