DATA MINING Chapter 1 and 2 Lect Slide
DATA MINING Chapter 1 and 2 Lect Slide
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Data Cleaning
Data Integration Databases
1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
Database, data warehouse, WorldWideWeb, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the users data mining request.
Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It
11
Data Preprocessing
Why preprocess the data?
Data cleaning
13
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
March 23, 2013 Data Mining: Concepts and Techniques 14
15
Data Cleaning
Data cleaning tasks
Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data
17
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry
18
19
Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data
March 23, 2013 Data Mining: Concepts and Techniques 20
then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions
March 23, 2013 Data Mining: Concepts and Techniques 21
Cluster Analysis
Unlike classification , the class labels are not present in the training data simply because they are not known to begin with. Clustering can be used to generate such labels. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. K mean, k-medoid
March 23, 2013 Data Mining: Concepts and Techniques 24
Cluster Analysis
25
Y1
Y1
y=x+1
X1
26
Different commercial tools that can aid in the step of discrepancy detection.
Data scrubbing tools use simple domain knowledge (e.g., knowledge of postal addresses, and spell-checking) to detect errors and make corrections in the data. These tools rely on parsing and fuzzy matching techniques when cleaning data from multiple sources.
Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions. They are variants of data mining tools.
Data Integration
Data integration: combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units
March 23, 2013 Data Mining: Concepts and Techniques 29
One attribute may be a derived attribute in another table, e.g., annual revenue
Redundant data may be able to be detected by correlational analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
March 23, 2013 Data Mining: Concepts and Techniques 30
Data Transformation
Data transformation routines convert the data into appropriate forms for mining. For example, attribute data may be normalized so as to fall between a small range, such as 0.0 to 1.0 Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling
March 23, 2013 Data Mining: Concepts and Techniques 31
v minA v' (new _ maxA new _ minA) new _ minA maxA minA
z-score normalization
v v' j 10
March 23, 2013
32
Strategies for data reduction include the following: 1. Data cube aggregation- aggregation operations are applied to the data in the construction of a data cube. 2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed. 3.Dimensionality reduction, where encoding mechanisms are used to reduce the data set size. 4. Numerosity reduction,where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms. 5. Discretization and concept hierarchy generation,where rawdata values for attributes are replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies.
A1?
A6?
Class 1
Class 2
Class 1
Class 2
Data Compression
String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio Typically short and vary slowly with time
March 23, 2013 Data Mining: Concepts and Techniques 38
Data Compression
Original Data
lossless
Compressed Data
39
Linear regression: Y = + X Two parameters , and specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, , X1, X2, . Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = ab acad bcd
Histograms
A popular data reduction 40 technique 35 Divide data into buckets 30 and store average (sum) for 25 each bucket Can be constructed 20 optimally in one dimension 15 using dynamic programming 10 Related to quantization 5 problems.
0
March 23, 2013
50000
70000
90000 42
Discretization
Three types of attributes: Nominal values from an unordered set Ordinal values from an ordered set Continuous real numbers Discretization: * divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis
43
45
15 distinct values
city street
46
THANKS.