Concepts and Techniques: Data Mining
Concepts and Techniques: Data Mining
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
2
Data Quality: Why Preprocess the Data?
3
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Aggregation
4
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
5
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
6
Incomplete (Missing) Data
technology limitation
9
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
Clustering
detect and remove outliers
10
Binning Methods for Data Smoothing
11
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
12
Exercise
13
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
14
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton, Cust-id = Cust-#
Data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
metric vs. British units
15
Handling Redundancy in Data Integration
(Observed Expected) 2
2
Expected
Expected = (count(A=ai)*count(B=bj))/n
The Χ2 statistic tests the hypothesis that A and B are independent, i.e.., there
is no correlation between them
The test is based on significance level with (r-1)(c-1) degrees of freedom
If the hypothesis can be rejected, then we say that A and B are statistically
correlated
The larger the Χ2 value, the more likely the variables are related
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
17
Chi-Square Calculation: An Example
18
Chi-Square Calculation: An Example
For this 2*2 table, the degrees of freedom are (2-1)(2-1)=1. For 1 degree of
freedom, the Χ2 value needed to reject the hypothesis at 0.001 significance
level is 10.828 (using Χ2 distribution table)
Since the computed value is above this, we can reject the hypothesis that
gender and preferred reading are independent
We can conclude that the two attributes are strongly correlated for the given
group of people
19
Correlation Analysis (Numeric Data)
i 1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
20
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
21
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
24
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
25
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
Wavelet transforms
Numerosity reduction
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
Data compression
26
Attribute Subset Selection
Reduces the data size by removing:
Redundant attributes
Duplicate information contained in one or more other
attributes
E.g., purchase price of a product and the amount of
sales tax paid
Irrelevant attributes
Contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
27
Attribute Subset Selection
28
Attribute Creation (Feature Generation)
Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
29
Data Reduction 2: Numerosity Reduction
Reduce data volume by choosing alternative, smaller
forms of data representation
30
Histogram Analysis
Divide data into buckets 40
Partitioning rules: 35
Equal-width: equal bucket 30
range
25
Equal-frequency (or equal- 20
depth)
15
10
5
0
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
31
Clustering
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
32
Sampling
33
Types of Sampling
Stratified sampling:
Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
Used in conjunction with skewed data
34
Sampling: With or without Replacement
35
Data Cube Aggregation
36
Data Reduction 3: Data Compression
String compression
There are extensive theories and well-tuned algorithms
37
Data Compression
Original Data
Approximated
38
Exercise
39
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
40
Data Transformation
Mapping the entire set of values of a given attribute to a new set of replacement
values so that each old value can be identified with one of the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: raw values of numeric attributes (e.g., age) replaced by interval
labels (e.g., 0-10, 11-20, etc.) or conceptual labels (e.g., youth, adult, senior)
41
Normalization
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 12,000
1.0]. Then $73,600 is mapped to 98,000 12,000 (1.0 0) 0 0.716
Z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
73,600 54,000
Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
42
Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or
bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
43
Concept Hierarchy Generation
44
Concept Hierarchy Generation
for Nominal Data
Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit
data grouping
{Urbana, Champaign, Chicago} < Illinois
45
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
The attribute with the most distinct values is placed at
the lowest level of the hierarchy
Exceptions, e.g., weekday, month, year
z-score normalization
47
Exercise
z-score normalization
48
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
49
Summary
Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
50