0% found this document useful (0 votes)
1 views

02 Data Warehouse

Uploaded by

vv9807898
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

02 Data Warehouse

Uploaded by

vv9807898
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

02 Data Warehouse

PRAVEEN KUMAR SRIVASTAVA


Data Preprocessing

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.

Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:
❖ Data Cleaning
❖ Data Integration
❖ Data Transformation
❖ Data Reduction
❖ Data Discretization
❖ Data Normalization
Data Cleaning

The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.

• Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty
data collection, data entry errors etc. It can be handled in following ways :
• Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of
equal size and then various methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary values can be used to
complete the task.

• Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).

• Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
Binning Method:
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they perform local
smoothing.
For Example:

Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

In this example, the data for price are first sorted and then partitioned
into equal-frequency bins of size 3 (i.e., each bin contains three values).
In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. For example, the mean of the values 4, 8, and 15 in Bin
1 is 9. Therefore, each original value in this bin is replaced by the value
9.
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the width,
the greater the effect of the smoothing. Alternatively, bins may be equal width, where the interval range of
values in each bin is constant.
Binning is also used as a discretization technique.
Data Transformation

This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1.Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2.Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.

3.Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

4.Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”
Data Reduction:

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller
in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should
be more efficient yet produce the same (or almost the same) analytical results.
Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected
and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset size.
Numerosity reduction,where the data are replaced or estimated by alternative, smaller data representations such as
parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods
such as clustering, sampling, and the use of histograms.
Discretization and concept hierarchy generation,where raw data values for attributes are replaced by ranges or higher
conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation
of concept hierarchies.Discretization and concept hierarchy generation are powerful tools for datamining, in that they
allow the mining of data at multiple levels of abstraction.
Data Integration
Data mining often requires data integration—the merging of data from multiple data stores. Careful integration can
help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy
and speed of the subsequent data mining process.
The semantic heterogeneity and structure of data pose great challenges in data integration

Entity Identification Problem


There are a number of issues to consider during data integration. Schema integration and object matching can be
tricky. This is referred to as the entity identification problem
When matching attributes from one database to another during integration, special attention must be paid to the
structure of the data. This is to ensure that any attribute functional dependencies and referential constraints in the
source system match those in the target system.
Redundancy and Correlation Analysis
Redundancy is another important issue in data integration. An attribute (such as annual revenue, for instance) may be
redundant if it can be “derived” from another attribute or set of attributes. Inconsistencies in attribute or dimension
naming can also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how
strongly one attribute implies the other, based on the available data.
For nominal data, we use the 2 (chi-square) test. For numeric attributes, we can use the correlation coefficient and
covariance, both of which access how one attribute’s values vary from those of another.
Tuple Duplication
In addition to detecting redundancies between attributes, duplication should also be detected at the tuple level (e.g.,
where there are two or more identical tuples for a given unique data entry case). The use of denormalized tables
(often done to improve performance by avoiding joins) is another source of data redundancy. Inconsistencies often
arise between various duplicates, due to inaccurate data entry or updating some but not all data occurrences.
For example, if a purchase order database contains attributes for the purchaser’s name and address instead of a key
to this information in a purchaser database, discrepancies can occur, such as the same purchaser’s name appearing
with different addresses within the purchase order database.
Data Value Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value conflicts. For example, for the same real-
world entity, attribute values from different sources may differ. This may be due to differences in representation,
scaling, or encoding.
For instance, a weight attribute may be stored in metric units in one system and British imperial units in another. For
a hotel chain, the price of rooms in different cities may involve not only different currencies but also different services
(e.g., free breakfast) and taxes.
When exchanging information between schools, for example, each school may have its own curriculum and grading
scheme. One university may adopt a quarter system, offer three courses on database systems, and assign grades from
AC to F, whereas another may adopt a semester system, offer two courses on databases, and assign grades from 1 to
10. It is difficult to work out precise course-to-grade transformation rules between the two universities, making
information exchange difficult
Data Transformation and Data Discretization
In this pre-processing step, the data are transformed or consolidated so that the resulting mining process
may be more efficient, and the patterns found may be easier to understand.

Data Transformation Strategies Overview


In data transformation, the data are transformed or consolidated into forms appropriate for mining. Strategies for
data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and added from the given
set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data
may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a
data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as

5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–
20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be recursively organized into higher-
level concepts, resulting in a concept hierarchy for the numeric attribute. e. More than one concept hierarchy can be
defined for the same attribute to accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database schema and can
be automatically defined at the schema definition level.
Concept Hierarchy Generation for Nominal Data

Nominal attributes have a finite (but possibly large) number of distinct values, with no ordering among the values.
Examples include geographic location, job category, and item type.
Manual definition of concept hierarchies can be a tedious and time-consuming task for a user or a domain expert.
Fortunately, many hierarchies are implicit within the database schema and can be automatically defined at the schema
definition level. The concept hierarchies can be used to transform the data into multiple levels of granularity.
For example, data mining patterns regarding sales may be found relating to specific regions or countries, in addition to
individual branch locations.
We study four methods for the generation of concept hierarchies for nominal data, as follows.
1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts: Concept
hierarchies for nominal attributes or dimensions typically involve a group of attributes. A user or expert can
easily define a concept hierarchy by specifying a partial or total ordering of the attributes at the schema level.
For example, suppose that a relational database contains the following group of attributes:
street, city, province or state, and country.
Similarly, a data warehouse location dimension may contain the same attributes. A hierarchy can be defined by
specifying the total ordering among these attributes at the schema level such as
street < city <province or state < country.
2. Specification of a portion of a hierarchy by explicit data grouping: This is essentially the manual definition of a
portion of a concept hierarchy. In a large database, it is unrealistic to define an entire concept hierarchy by explicit
value enumeration. On the contrary, we can easily specify explicit groupings for a small portion of intermediate-level
data.
For example, after specifying that province and country form a hierarchy at the schema level, a user could define
some intermediate levels manually,
3. Specification of a set of attributes, but not of their partial ordering: A user may specify a set of attributes forming
a concept hierarchy, but omit to explicitly state their partial ordering. The system can then try to automatically
generate the attribute ordering so as to construct a meaningful concept hierarchy.

4. Specification of only a partial set of attributes: Sometimes a user can be careless when defining a hierarchy, or
have only a vague idea about what should be included in a hierarchy. Consequently, the user may have included
only a small subset of the relevant attributes in the hierarchy specification.
For example, instead of including all of the hierarchically relevant attributes for location, the user may have
specified only street and city. To handle such partially specified hierarchies, it is important to embed data
semantics in the database schema so that attributes with tight semantic connections can be pinned together. In
this way, the specification of one attribute may trigger a whole group of semantically tightly linked attributes to
be “dragged in” to forma complete hierarchy. Users, however, should have the option to override this feature, as
necessary.

You might also like