02 Data Warehouse
02 Data Warehouse
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.
Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:
❖ Data Cleaning
❖ Data Integration
❖ Data Transformation
❖ Data Reduction
❖ Data Discretization
❖ Data Normalization
Data Cleaning
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
• Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).
• Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
Binning Method:
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they perform local
smoothing.
For Example:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
In this example, the data for price are first sorted and then partitioned
into equal-frequency bins of size 3 (i.e., each bin contains three values).
In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. For example, the mean of the values 4, 8, and 15 in Bin
1 is 9. Therefore, each original value in this bin is replaced by the value
9.
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the width,
the greater the effect of the smoothing. Alternatively, bins may be equal width, where the interval range of
values in each bin is constant.
Binning is also used as a discretization technique.
Data Transformation
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1.Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2.Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
3.Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller
in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should
be more efficient yet produce the same (or almost the same) analytical results.
Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected
and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset size.
Numerosity reduction,where the data are replaced or estimated by alternative, smaller data representations such as
parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods
such as clustering, sampling, and the use of histograms.
Discretization and concept hierarchy generation,where raw data values for attributes are replaced by ranges or higher
conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation
of concept hierarchies.Discretization and concept hierarchy generation are powerful tools for datamining, in that they
allow the mining of data at multiple levels of abstraction.
Data Integration
Data mining often requires data integration—the merging of data from multiple data stores. Careful integration can
help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy
and speed of the subsequent data mining process.
The semantic heterogeneity and structure of data pose great challenges in data integration
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–
20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be recursively organized into higher-
level concepts, resulting in a concept hierarchy for the numeric attribute. e. More than one concept hierarchy can be
defined for the same attribute to accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database schema and can
be automatically defined at the schema definition level.
Concept Hierarchy Generation for Nominal Data
Nominal attributes have a finite (but possibly large) number of distinct values, with no ordering among the values.
Examples include geographic location, job category, and item type.
Manual definition of concept hierarchies can be a tedious and time-consuming task for a user or a domain expert.
Fortunately, many hierarchies are implicit within the database schema and can be automatically defined at the schema
definition level. The concept hierarchies can be used to transform the data into multiple levels of granularity.
For example, data mining patterns regarding sales may be found relating to specific regions or countries, in addition to
individual branch locations.
We study four methods for the generation of concept hierarchies for nominal data, as follows.
1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts: Concept
hierarchies for nominal attributes or dimensions typically involve a group of attributes. A user or expert can
easily define a concept hierarchy by specifying a partial or total ordering of the attributes at the schema level.
For example, suppose that a relational database contains the following group of attributes:
street, city, province or state, and country.
Similarly, a data warehouse location dimension may contain the same attributes. A hierarchy can be defined by
specifying the total ordering among these attributes at the schema level such as
street < city <province or state < country.
2. Specification of a portion of a hierarchy by explicit data grouping: This is essentially the manual definition of a
portion of a concept hierarchy. In a large database, it is unrealistic to define an entire concept hierarchy by explicit
value enumeration. On the contrary, we can easily specify explicit groupings for a small portion of intermediate-level
data.
For example, after specifying that province and country form a hierarchy at the schema level, a user could define
some intermediate levels manually,
3. Specification of a set of attributes, but not of their partial ordering: A user may specify a set of attributes forming
a concept hierarchy, but omit to explicitly state their partial ordering. The system can then try to automatically
generate the attribute ordering so as to construct a meaningful concept hierarchy.
4. Specification of only a partial set of attributes: Sometimes a user can be careless when defining a hierarchy, or
have only a vague idea about what should be included in a hierarchy. Consequently, the user may have included
only a small subset of the relevant attributes in the hierarchy specification.
For example, instead of including all of the hierarchically relevant attributes for location, the user may have
specified only street and city. To handle such partially specified hierarchies, it is important to embed data
semantics in the database schema so that attributes with tight semantic connections can be pinned together. In
this way, the specification of one attribute may trigger a whole group of semantically tightly linked attributes to
be “dragged in” to forma complete hierarchy. Users, however, should have the option to override this feature, as
necessary.