Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
[email protected] Jalali.mshdiau.ac.ir
Machine Learning
Machine Learning
Data Preprocessing
Why preprocess the data? Data cleaning Data integration and transformation
Data reduction
Discretization and concept hierarchy generation Summary
e.g., occupation=
noisy: containing errors or outliers
e.g., Salary=-10
inconsistent: containing discrepancies in codes or names
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
Data Preprocessing
Data Cleaning
Data Integration
-2,32,100,59,48
-0.02,0.32,1.00,0.59,0.48
Data Transformation
Data Reduction
8
8
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
10
11
Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due to
faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention
12
Clustering
detect and remove outliers
Binning
first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc
13
Regression
y
Y1
Fit data to a function. Linear regression finds the best line to fit two variables. Multiple regression can handle multiple variables. The values given by the function are used instead of the original values
Y1
y=x+1
X1
14
14
Cluster Analysis
Similar values are organized into groups (clusters). Values falling outside of clusters may be considered outliers and may be candidates for elimination.
15
15
Binning
partitioning
Divides the range into N intervals, each containing approximately same number of samples
16
Binning
Original Data for price (after sorting): 4, 8, 15, 21, 21, 24, 25, 28, 34
Binning
Partition into equal depth bins Bin1: 4, 8, 15 Bin2: 21, 21, 24 Bin3: 25, 28, 34 Min and Max values in each bin are identified (boundaries). Each value in a bin is replaced with the closest boundary value.
17
17
Example
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain
Temperature Humidity Windy 85 85 FALSE 80 90 TRUE 83 78 FALSE 70 96 FALSE 68 80 FALSE 65 70 TRUE 58 65 TRUE 72 95 FALSE 69 70 FALSE 71 80 FALSE 75 70 TRUE 73 90 TRUE 81 75 FALSE 75 80 TRUE
ID 7 6 5 9 4 10 8 12 11 14 2 13 3 1
Temperature 58 65 68 69 70 71 72 73 75 75 80 81 83 85
Bin1
Bin2
Bin3
Bin4 Bin5
18
18
Example
ID 7 6 5 9 4 10 8 12 11 14 2 13 3 1
Temperature 58 65 68 69 70 71 72 73 75 75 80 81 83 85
Bin1
Bin2
Bin3
Bin4 Bin5
ID 7 6 5 9 4 10 8 12 11 14 2 13 3 1
Temperature 64 64 64 70 70 70 73 73 73 79 79 79 84 84
Bin1
Bin2
Bin3
Bin4 Bin5
19
19
The final table with the new values for the Temperature attribute.
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy 84 85 FALSE 79 90 TRUE 84 78 FALSE 70 96 FALSE 64 80 FALSE 64 70 TRUE 64 65 TRUE 73 95 FALSE 70 70 FALSE 70 80 FALSE 73 70 TRUE 73 90 TRUE 79 75 FALSE 79 80 TRUE
20
20
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
British units
Use Ontology to find same entities in the different Database (Wordnet)
21
Redundant attributes may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
22
rA, B
( A A)(B B) ( AB) n AB
(n 1)AB (n 1)AB
the respective standard deviation of A and B, and (AB) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (As values increase as Bs). The higher, the stronger correlation. rA,B = 0: independent; rA,B < 0: negatively correlated
23
Data Transformation
Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range
min-max normalization z-score normalization normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
24
v'
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is 73,600 12 ,000 (1.0 0) 0 0.716 mapped to
v'
v A
v v' j 10
25
Humidity 85 90 78 96 80 70 65 95 70 80 70 90 75 80
Humidity 0.48 0.99 -0.23 1.60 -0.03 -1.05 -1.55 1.49 -1.05 -0.03 -1.05 0.99 -0.54 -0.03
26
ID 1 2 3 4 5
Gender F M M F M
Age 27 51 52 33 45
ID 1 2 3 4 5
Gender 1 0 0 1 0
27
Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
28
Discretization
Three types of attributes:
Nominal values from an unordered set, e.g., color, profession
Ordinal values from an ordered set, e.g., military or academic rank Continuous real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis
29
30
Discretization - Example
Example: discretizing the Humidity attribute using 3 bins.
Humidity 85 90 78 96 80 70 65 95 70 80 70 90 75 80
Humidity High High Normal High High Normal Low High Normal High Normal High Normal High
31
Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
32
15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values
33
43
Machine Learning
35