Data Warehousing - CH3
Data Warehousing - CH3
Mr. Fisha M.
Mekelle Institute of Technology
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 1 / 15
Data Preprocessing
Today’s real-world databases are:
Highly susceptible to noisy, missing, and inconsistent data due
to their typically huge size and their likely origin from multiple,
heterogeneous sources.
Why Preprocess Data:
To provide Data quality. Thus, help to improve Mining
processes and results.
Low-quality data will lead to low-quality mining results.
Factors comprising data quality:
Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 2 / 15
Major Tasks in Data Preprocessing
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 3 / 15
Cont’d...
Data Preprocessing involves the following tasks:
1. Data Cleaning
Attempts to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
Missing Values
Filling missing values involves the following methods:
1. Ignore the tuple:
- This is usually done when the class label is missing.
- This method is not very effective, unless the tuplecontains
several attributes with missing values
- It is especially poor when the percentage of missing values per
attribute varies considerably
- By ignoring the tuple, we do not make use of the remaining
attributes’ values in the tuple.
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 4 / 15
Cont’d...
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 5 / 15
Cont’d...
2. Noisy Data
Noise is a random error or variance in a measured variable.
Data smoothing techniques:
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 6 / 15
Cont’d...
1. Binning:
- Smooth a sorted data value by consulting its “ neighborhood,”
that is, the values around it.
- The sorted values are distributed into a number of “ buckets,”
or bins.
- Because binning methods consult the neighborhood of values,
they perform local smoothing
- Smoothing by bin means: each value in a bin is replaced by
the mean value of the bin
- smoothing by bin medians: each bin value is replaced by the
bin median.
- Smoothing by bin boundaries: the minimum and maximum
values in a given bin are identified as the bin boundaries. Each
bin value is then replaced by the closest boundary value.
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 7 / 15
Eg.
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 8 / 15
Cont’d...
2. Regression
- A technique that conforms data values to a function.
- Linear regression involves finding the “ best ” line to fit two
attributes (or variables) so that one attribute can be used to
predict the other.
- Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit
to a multidimensional surface.
3. Outlier Analysis
- Outliers may be detected by clustering.
- For example, where similar values are organized into groups, or
“clusters.” Intuitively, values that fall outside of the set of
clusters may be considered outliers.
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 9 / 15
Cont’d...
2. Data Integration
Data mining often requires data integration — the merging of
data from multiple data stores.
Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set. This can help improve
the accuracy and speed of the subsequent data mining process.
The semantic heterogeneity and structure of data pose great
challenges in data integration.
It involves:
- Entity identification problem.
- Redundancy and Correlation Analysis
- Tuple Duplication
- Data Value Conflict Detection and Resolution
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 10 / 15
Cont’d...
3. Data Reduction
is a technique which can be applied to obtain a reduced
representation of the data set that is much smaller in volume,yet
closely maintains the integrity of the original data.
That is, mining on the reduced data set should be more efficient
yet produce the same(or almost the same) analytical results.
Data reduction strategies include:
1 Dimensionality reduction
2 Numerosity reduction
3 Data compression
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 11 / 15
Cont’d...
i Dimensionality reduction
is the process of reducing the number of random variables or
attributes. Includes:
Wavelet transforms
Principal components analysis: which transform or project the
original data onto a smaller space.
Attribute subset selection: is a method of dimensionality
reduction in which irrelevant, weakly relevant, or redundant
attributes or dimensions are detected and removed.
ii Numerosity reduction
A technique that replaces the original data volume by
alternative,smaller forms of data representation.
This technique may be parametric or non-parametric.
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 12 / 15
Parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored,
instead of the actual data. (Eg. Regression and log-linear
models)
Nonparametric methods for storing reduced representations of
the data include histograms, sampling, data cube aggregation.
iii Data compression
transformations are applied so as to obtain a reduced or
compressed representation of the original data.
Lossless Compression: If the original data can be reconstructed
from the compressed data without any information loss.
Lossy Compression: the original data is reconstructed only an
approximation of the original data
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 13 / 15
4 Data Transformation
In this preprocessing step, the data are transformed or
consolidated so that the resulting mining process may be more
efficient, and the patterns found may be easier to understand.
Includes:
1 Smoothing which works to remove noise fromthe data.
Techniques include binning, regression, and clustering.
2 Attribute construction (or feature construction) where new
attributes are constructed and added from the given set of
attributes to help the mining process.
3 Aggregation where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube
for data analysis at multiple abstraction levels.
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 14 / 15
4 Normalization where the attribute data are scaled so as to fall
within a smaller range,such as -1.0 to 1.0, or 0.0 to 1.0.
5 Discretization where the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior).
6 Concept hierarchy generation for nominal data where attributes
such as street can be generalized to higher-level concepts, like
city or country. Many hierarchies for nominal attributes are
implicit within the database schema and can be automatically
defined at the schema definition level.
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 15 / 15