w2-Data_Preparation
w2-Data_Preparation
ga40eyesanim ga40eyesanim
Outline
◘ Data Integration
◘ Data Selection and Reduction
◘ Data Preprocessing and Data Cleaning
– Filling in Missing Values
– Removing Noisy Data
– Identification of Outliers
– Correcting Inconsistent Data
◘ Data Transformation Techniques
– Normalization
– Discretization
Data Preparation
Data Preparation
Data
Data Data Data
Data Selection &
Integration Preprocessing Transformation
Reduction
Data Integration
◘ Data integration
– Integration of multiple databases, data cubes, or files
– Obtain data from various sources
Data Preparation
Data
Data Data Data
Selection &
Integration Preprocessing Transformation
Reduction
Data Selection & Reduction
◘ Data Reduction
– Selecting a target data set
– Removing duplicates
◘ Data Reduction
– Obtains reduced representation of the data set
(smaller in volume but yet produces the same (or almost the same) results
◘ This query does that for all rows of tablename having the same
column1, column2, and column3.
◘ Example (Histograms):
– A popular data reduction technique
– Divide data into buckets and store average (or sum) for each bucket
40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
Data Aggregation Example
2- Dimensionality Reduction
A4 ?
A1? A6?
◘ Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter only)
◘ There are many choices of clustering definitions and clustering
algorithms.
Marital
C1 C2 ID Gender Age
Satatus
Score Cluster
1021 F 41 NeverM 55 C1
1022 M 27 Married 35 C1
1023 M 20 NeverM 480 C2
1024 F 34 Married 950 C3
1025 M 74 Married 500 C2
1026 M 32 Married 500 C2
1027 M 18 NeverM 890 C3
1028 M 54 Married 68 C1
C3 C4 … … … … … …
6- Concept Hierarchy Generation
Tutar <= 5 TL
SatışID Ürün Tarih ToplamTutar SatıldığıYer
1 Domates, Peynir, Kola 1.1.2008 45 İzmir Horizontal
2 Makarna, Çay 3.1.2008 55 İstanbul
Data
Reduction
3 Saç Bakımı 5.1.2008 5 İstanbul
4 Sigara, Bira 8.1.2008 25 İzmir
Concept Hierarchy
Generation Vertical Data Reduction
SatışID Ürün Tarih Tutar SatıldığıYer Açıklama
1 Domates 1.1.2008 20 İzmir Buca .....
1 Peynir 1.1.2008 10 Buca .....
1 Kola 1.1.2008 15 Buca .....
İstanbul
2 Makarna 3.1.2008 25 Mecidiyeköy .....
2 Çay 3.1.2008 30 Mecidiyeköy .....
3 Saç Bakımı 5.1.2008 5 İstanbul
Kadıköy .....
4 Sigara 8.1.2008 15
İzmir Bornova .....
4 Bira 8.1.2008 10 Bornova .....
Data Preparation
Data
Data Data Data
Selection &
Integration Preprocessing Transformation
Reduction
Why Is Data Preprocessing Important?
Solutions:
1- Ignore the tuple
– Usually done when class label is missing (in classification)
– Not effective when the percentage of missing values per attribute varies
considerably
Solutions:
A. Binning
– First sort data and partition into (width or depth) bins
– Then one can
• (a) Equal Depth and Smooting by Bin Boundaries
• (b) Equal Depth and Smooting by Bin Means
• (c) Equal Width and Smooting by Bin Boundaries
• (d) Equal Width and Smooting by Bin Means
B. Regression
– Smooth by fitting the data into regression functions
A. Binning
◘ Equal-width partitioning
– Divides the range into N intervals of equal size
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
◘ Equal-depth partitioning
– Divides the range into N intervals, each containing approximately same
number of samples
Equal width B1 B1 B2 B2 B2 B2 B2 B2 B2 B3 B3 B3
Price in € 4 6 14 16 18 19 21 22 23 25 27 33
Equal depth B1 B1 B1 B1 B2 B2 B2 B2 B3 B3 B3 B3
Price in € 4 6 14 16 18 19 21 22 23 25 27 33
Equal depth B1 B1 B1 B1 B2 B2 B2 B2 B3 B3 B3 B3
Smoothing by
10 10 10 10 20 20 20 20 27 27 27 27
bin means
Smoothing by
4 4 16 16 18 18 22 22 23 23 23 33
bin boundaries
5 5 19 19 19 19 19 19 19 28 28 28
4 6 14 14 14 23 23 23 23 25 25 33
A. Binning
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Equal-Depth Equal-Width
Bin 1: Bin 1:
Bin 2: Bin 2:
Bin 3: Bin 3:
Means Means
Bin 1: Bin 1:
Bin 2: Bin 2:
Bin 3: Bin 3:
Boundaries
Bin 1: Bin 1:
Bin 2: Bin 2:
Bin 3: Bin 3:
Binning Example
[3 − 13]
35 − 3
= 10 [14 − 24]
Örneğin: 3, 8, 10, 11, 15, 19, 23, 29, 35 3 [25 − 35]
Equal-Depth Equal-Width
Bin 1: 3, 8, 10 Bin 1: 3, 8, 10, 11
Bin 2: 11, 15, 19 Bin 2: 15, 19, 23
Bin 3: 23, 29, 35 Bin 3: 29, 35
Means Means
Bin 1: 7, 7, 7 Bin 1: 8, 8, 8, 8
Bin 2: 15, 15, 15 Bin 2: 19, 19, 19
Bin 3: 29, 29, 29 Bin 3: 32, 32
Boundaries
Bin 1: 3, 10, 10 Bin 1: 3, 11, 11, 11
Bin 2: 11, 11, 19 Bin 2: 15, 15, 23
Bin 3: 23, 23, 35 Bin 3: 29, 35
B. Regression
Y1
Y1’ y=x+1
X1 x
3- Removing Outliers
Clustering
Outliers
4- Resolve inconsistencies
Data
Data Data Data
Selection &
Integration Preprocessing Transformation
Reduction
Data Transformation
Data Warehouse
appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female
appl A - pipeline - cm
appl B - pipeline - in
appl C - pipeline - feet
appl D - pipeline - yds
appl A - balance
appl B - bal
appl C - currbal
appl D - balcurr
Encoding Errors
◘ Education Field
– C: college
– U: university
– H: high school
– D: doctorate
– M: master
– S : secondary school
– P: primary school
– I : illegitimate
◘ Discretization
– Fixed k-Interval Discretization
– Cluster-Based Discretization
– Entropy-Based Discretization
Normalization
◘ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
– Ex. Let income range $12,000 to $98,000 normalized to [0, 1]. Then
$73,000 is mapped to
73,600 − 12,000
(1 − 0) + 0 = 0.716
98,000 − 12,000
◘ Z-score normalization
v − A
– Ex. Let μ = 54,000, σ = 16,000. Then v' =
A
73,600 − 54,000
= 1.225
16,000
Standard Deviation
◘ Mean Average
◘ Standard Deviation
Data Transformation Example
Price in € 4 6 14 16 18 19 21 22 23 24 27 34
Z-score -1.8 -1.6 -0.6 -0.3 -0.1 0 0.2 0.4 0.5 0.6 1 1.8
Decimal Scaling .04 .06 .14 .16 .18 .19 .21 .22 .23 .24 .27 .34
v − minA v − A v
v' = (new _ maxA − new _ minA) + new _ minA v' = v' = j
maxA − minA A 10
Discretization
◘ Discretization:
– Divide the range of a continuous attribute into intervals.
– Some classification algorithms only accept categorical attributes.
– Reduce data size by discretization, especially for numerical data
◘ Discretization Methods
– Fixed k-Interval Discretization
– Cluster-Based Discretization
– Entropy-Based Discretization
Buys Buys Buys
Age Computer Age Computer Age Computer
1 10 No 1 10 No 1 (0..17] No
2 14 No 2 14 No 2 (0..17] No
6 48 No 6 48 No 6 (17..55] No
8 70 No 8 70 No 8 (55...100] No
9 76 No 9 76 No 9 (55...100] No
Fixed k-Interval Discretization
( 82 – 10 ) / 4 = 72 / 4 = 18 8 64 - 82
9 64 - 82
[10 – 28] 10 64 - 82
(28 – 46]
(46 – 64]
(64 – 82]