Data Mining-2-1
Data Mining-2-1
Characterization
The syntax for characterization is −
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum, or count%.
For example −
Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%
Discrimination
The syntax for Discrimination is −
mine comparison [as {pattern_name]}
For {target_class } where {t arget_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
For example, a user may define big spenders as customers who purchase items that cost
$100 or more on an average; and budget spenders as customers who purchase items at
less than $100 on an average. The mining of discriminant descriptions for customers from
each of these categories can be specified in the DMQL as −
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count
Association
The syntax for Association is−
mine associations [ as {pattern_name} ]
{matching {metapattern} }
For Example −
mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
where X is the key of customer relation; P and Q are predicate variables; and W, Y, and Z
are object variables.
Classification
The syntax for Classification is −
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
For example, to mine patterns, classifying customer credit rating where the classes are
determined by the attribute credit_rating, and mine classification is determined as
classifyCustomerCreditRating.
analyze credit_rating
Prediction
The syntax for prediction is −
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
Descriptive mining:
Predictive mining:
Based on data and analysis, constructs models for the database, and predicts the trend and
properties of unknown data
Concept description:
Comparison:
Concept description:
-Can handle complex data types of the attributes and their aggregations .
OLAP:
– User-controlled process.
Data generalization:
A process which abstracts a large set of task-relevant data in a database from a low
conceptual level to higher ones.
Approaches:
CONCEPTUAL LEVELS
Characterization:
2) Strength:
● An efficient implementation of data generalization
● Computation of various kinds of measures e.g., count( ), sum( ), average( ), max( )
● Generalization and specialization can be performed on a data cube by roll-up and
drill-down
Limitations:
– handle only dimensions of simple nonnumeric data and measures of simple aggregate
numeric values.
– Lack of intelligent analysis, can’t tell which dimensions should be used and what levels
should the generalization reach
Attribute-Oriented Induction
• How is it done?
● Collect the task-relevant data( initial relation) using a relational database query
● Perform generalization by attribute removal or attribute generalization.
● Apply aggregation by merging identical, generalized tuples and accumulating their
respective counts.
● interactive presentation with users.
Analytical Characterization : Analysis of Attribute Relevance
Introduction
“What if I am not sure which attribute to include or class characterization and class comparison ? I
may end up specifying too many attributes, which could slow down the: system considerably .”
Measures of attribute relevance analysis can be used to help identify irrelevant or weakly relevant
attributes that can be excluded from the concept description process. The incorporation of this
pre-processing step into class characterization or comparison is referred to as analytical
characterization or analytical comparison, respectively . This section describes a general method of
attribute relevance analysis and its integration with attribute-oriented induction.
The first limitation of class characterization for multidimensional data analysis in Data warehouses
and OLAP tools is the handling of complex objects . The second Limitation is the lack of an
automated generalization process: the user must explicitly Tell the system which dimension should be
included in the class characterization and to How high a level each dimension should be generalized .
Actually , the user must specify each step of generalization or specification on any dimension.
Usually , it is not difficult for a user to instruct a data mining system regarding how high level each
dimension should be generalized . For example , users can set attribute generalization thresholds for
this , or specify which level a given dimension should reach ,such as with the command “generalize
dimension location to the country level”. Even without explicit user instruction , a default value such
as 2 to 8 can be set by the data mining system , which would allow each dimension to be generalized
to a level that contains only 2 to 8 distinct values. If the user is not satisfied with the current level of
generalization, she can specify dimensions on which drill-down or roll-up operations should be
applied.
It is nontrivial, however, for users to determine which dimensions should be included in the analysis
of class characteristics. Data relations often contain 50 to 100 attributes , and a user may have little
knowledge regarding which attributes or dimensions should be selected for effective data mining. A
user may include too few attributes in the analysis, causing the resulting mined descriptions to be
incomplete. On the other hand, a user may introduce too many attributes for analysis (e.g. , by
indicating “in relevance to *”, which includes all the attributes in the specified relations).
Methods should be introduced to perform attribute (or dimension )relevance Analysis in order to
filter out statistically irrelevant or weakly relevant attributes, and retain or even rank the most relevant
attributes for the descriptive mining task at hand. Class characterization that includes the analysis of
attribute/dimension relevance is called analytical characterization. Class comparison that includes
such analysis is called analytical comparison.
Intuitively, an attribute or dimension is considered highly relevant with respect to a Given class if it is
likely that the values of the attribute or dimension may be used to Distinguish the class from others.
For example, it is unlikely that the color of an Automobile can be used to distinguish expensive from
cheap cars, but the model , make, style, and number of cylinders are likely to be more relevant
attributes. Moreover, even within the same dimension, different levels of concepts may have
dramatically different powers for distinguishing a class from others.
For example, in the birth_date dimension, birth_day and birth_month are unlikely to be relevant to the
salary of employees. However, the birth_decade (i.e. , age interval) may be highly relevant to the
salary of employees. This implies that the analysis of dimension relevance should be performed at
multi-levels of abstraction, and only the most relevant levels of a dimension should be included in the
analysis. Above we said that attribute/ dimension relevance is evaluated based on the ability of the
attribute/ dimension to distinguish objects of a class from others. When mining a class comparison (or
discrimination), the target class and the contrasting classes are Explicitly given in the mining query.
The relevance analysis should be performed by Comparison of these classes, as we shall see below.
However, when mining class Characteristics, there is only one class to be characterized. That is, no
contrasting class is specified. It is therefore not obvious what the contrasting class should be for use in
comparable data in the database that excludes the set of data to be characterized. For example, to
characterize graduate students, the contrasting class can be composed of the set of undergraduate
students.
There have been many studies in machine learning, statistics, fuzzy and rough set Theories, and so on
, on attribute relevance analysis. The general idea behind attribute Relevance analysis is to compute
some measure that is used to quantify the relevance of an attribute with respect to a given class or
concept. Such measures include information gain, the Gini index, uncertainty, and correlation
coefficients. Here we introduce a method that integrates an information gain analysis technique With a
dimension-based data analysis method. The resulting method removes the less informative attributes,
collecting the more informative ones for use in concept description analysis.
Data Collection:
Collect data for both the target class and the contrasting class by query processing. For class
comparison, the user in the data-mining query provides both the target class and the contrasting class.
For class characterization, the target class is the class to be characterized, whereas the contrasting
class is the set of comparable data that are not in the target class.
This step identifies a Set of dimensions and attributes on which the selected relevance measure is to
be Applied. Since different levels of a dimension may have dramatically different Relevance with
respect to a given class, each attribute defining the conceptual levels of the dimension should be
included in the relevance analysis in principle. Attribute-oriented induction (AOI)can be used to
perform some preliminary relevance analysis on the data by removing or generalizing attributes
having a very large number of distinct values (such as name and phone#). Such attributes are unlikely
to be found useful for concept description. To be conservative , the AOI performed here should
employ attribute generalization thresholds that are set reasonably large so as to allow more (but not
all)attributes to be considered in further relevance analysis by the selected measure (Step 3 below).
The relation obtained by such an application of AOI is called the candidate relation of the mining task.
Introduction: In many applications, users may not be interested in having a single class (or
concept) described or characterized, but rather would prefer to mine a description that
compares or distinguishes one class (or concept) from other comparable classes (or
concepts).Class discrimination or comparison (hereafter referred to as class comparison)
mines descriptions that distinguish a target class from its contrasting classes. Notice that the
target and contrasting classes must be comparable in the sense that they share similar
dimensions and attributes. For example, the three classes, person, address, and item, are not
comparable.
However, the sales in the last three years are comparable classes, and so are computer science
students versus physics students. Our discussions on class characterization in the previous
sections handle multilevel data summarization and characterization in a single class. The
techniques developed can be extended to handle class comparison across several comparable
classes. For example, the attribute generalization process described for class characterization
can be modified so that the generalization is performed synchronously among all the classes
compared. This allows the attributes in all of the classes to be generalized to the same levels
of abstraction. Suppose, for instance, that we are given the All Electronics data for sales in
2003 and sales in 2004 and would like to compare these two classes. Consider the dimension
location with abstractions at the city, province or state, and country levels. Each class of data
should be generalized to the same location level. That is, they are synchronously all
generalized to either the city level, or the province or state level, or the country level. Ideally,
this is more useful than comparing, say, the sales in Vancouver in 2003 with the sales in the
United States in 2004 (i.e., where each set of sales data is generalized to a different level).
The users, however, should have the option to overwrite such an automated, synchronous
comparison
1. Data collection: The set of relevant data in the database is collected by query processing
and is partitioned respectively into a target class and one or a set of contrasting class(es).
2. Dimension relevance analysis: If there are many dimensions, then dimension relevance
analysis should be performed on these classes to select only the highly relevant dimensions
for further analysis. Correlation or entropy-based measures can be used for this step (Chapter
2).
3. Synchronous generalization: Generalization is performed on the target class to the level
controlled by a user- or expert-specified dimension threshold, which results in a prime target
class relation. The concepts in the contrasting class(es) are generalized to the same level as
those in the prime target class relation, forming the prime contrasting class(es) relation.
4. Presentation of the derived comparison: The resulting class comparison description can
be visualized in the form of tables, graphs, and rules. This presentation usually includes a
“contrasting” measure such as count% (percentage count) that reflects the comparison
between the target and contrasting classes. The user can adjust the comparison description by
applying drill-down, roll-up, and other OLAP operations to the target and contrasting classes,
as desired.
The above discussion outlines a general algorithm for mining comparisons in databases. In
comparison with characterization, the above algorithm involves synchronous generalization
of the target class with the contrasting classes, so that classes are simultaneously
Example
Task - Compare graduate and undergraduate students using the discriminant rule.
use University_Database
mine comparison as “graduate_students vs_undergraduate_students”
in relevance to name, gender, program, birth_place, birth_date, residence, phone_no, GPA
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student