05 DM BI Concept Description
05 DM BI Concept Description
Concept Description
(Continuation of Data Cube & OLAP)
Module 3
Created/Adopted/Modified for
Data Mining and Business Intelligence – MCA II Semester
Vidya Vikas Institute of Engineering & Technology
Mysore
2023-24
GPD
Concept Description
Concept description refers to the process of representing complex or
large datasets in a more concise and understandable manner, while
preserving its meaningful aspects.
The goal of concept description is to distill the essential
characteristics, patterns, and trends from the data, making it easier
for humans to interpret and make decisions based on the
summarized information.
The focus is on providing a high-level overview of the data's key
features, rather than presenting all the detailed data points.
This is particularly important when dealing with large datasets
2
that might be overwhelming to analyze in their raw form.
Concept Description
Data Generalization summarizes data by replacing relatively low-
level values (e.g., numeric values for an attribute age) with higher-
level concepts (e.g., young, middle-aged, and senior), or by reducing
the number of dimensions to summarize data in concept space
involving fewer dimensions (e.g., removing birth date and telephone
number when summarizing the behavior of a group of students).
Allowing data sets to be generalized at multiple levels of
abstraction facilitates users in examining the general behavior of
the data.
Concept Description is a form of Data Generalisation
3
Concept Description
Concept Description generates descriptions for data characterization
and comparison.
comparison
It is sometimes called class description when the concept to be
described refers to a class of objects.
Data Characterization provides a concise and clear summarization
of the given data collection
Concept Comparison or Class Comparison (also known as
discrimination) provides descriptions comparing two or more data
collections.
4
Concept Description
We have studied data cube (or OLAP) approaches to concept
description using multidimensional, multilevel data generalization in
data warehouses. But, the question is,
“Is data cube technology sufficient to accomplish all kinds of
concept description tasks for large data sets?”
sets?
There are limitations...
5
Concept Description
“Is data cube technology sufficient to accomplish all kinds of concept
description tasks for large data sets?”
sets?
1. Current OLAP systems limits dimensions and measures to numeric
and complex aggregations.
In reality, the database can include attributes of various data types,
including numeric, non-numeric, spatial, text, or image, which
ideally should be included in the concept description.
6
Concept Description
“Is data cube technology sufficient to accomplish all kinds of concept
description tasks for large data sets?”
sets?
2. The selection of dimensions and the application of OLAP
operations (e.g., drill-down, roll-up, slicing, and dicing) are primarily
directed and controlled by users.
This means, users need good understanding.
There's a need for more automated approaches that assist users in
selecting dimensions and determining the appropriate level of data
generalization for meaningful summarization.
So, we will study an alternate method for Concept Description.
7
Attribute-Oriented Induction
Attribute-Oriented Induction is an alternative method for concept
description, which works for complex data types and relies on a
data-driven generalization process.
The data cube approach is based on materialized views of the data,
which typically have been precomputed in a data warehouse.
In general, it performs offline aggregation before an OLAP or data
mining query is submitted for processing.
On the other hand, the attribute-oriented induction approach is
basically a query-oriented, generalization-based, online data
analysis technique.
8
Attribute-Oriented Induction – the Idea
First collect the task-relevant data using a database query and then
perform generalization based on the examination of the number of
each attribute’s distinct values in the relevant data set.
The generalization is performed by either attribute removal or
attribute generalization.
Aggregation is performed by merging identical generalized tuples
and accumulating their respective counts.
This reduces the size of the generalized data set.
In the
BigUniversity
database :
10
Attribute-Oriented Induction
“Now that the data are ready for attribute-oriented induction, how is
attribute-oriented induction performed?”
The essential operation of attribute-oriented induction is data
generalization, which can be performed in either of two ways on
the initial working relation:
attribute removal and
attribute generalization.
11
Attribute-Oriented Induction
Attribute removal is based on the following rule:
If there is a large set of distinct values for an attribute of the initial
working relation, but either (case 1) there is no generalization
operator on the attribute (e.g., there is no concept hierarchy
defined for the attribute), or (case 2) its higher-level concepts are
expressed in terms of other attributes, then the attribute should be
removed from the working relation.
Name, Phone#:
Phone# Since there are a large number of distinct values for name &
phone# and there is no generalization operation defined on it, this attribute
is removed. (Case 1)
Street (if any) will also be removed since its higher-level concepts are
12 expressed in term or other attributes (city, state, etc). (Case 2)
Attribute-Oriented Induction
Attribute generalization is based on the following rule:
If there is a large set of distinct values for an attribute in the initial
working relation, and there exists a set of generalization operators
on the attribute, then a generalization operator should be
selected and applied to the attribute.
This rule is based on the following reasoning.
Use of a generalization operator to generalize an attribute value
within a tuple, or rule, in the working relation will make the rule
cover more of the original data tuples, thus generalizing the
concept it represents.
13
Attribute-Oriented Induction
Attribute Generalisation
major: Suppose that a concept hierarchy has been defined that allows the
attribute major to be generalized to the values {arts&sciences, engineering,
business}.
birth place: This attribute has a large number of distinct values; therefore,
we would like to generalize it based on the concept hierarchy “city <
province or state < country.”
birth date: Generalised to age and age to age range.
gpa: Can be generalised based on the concept hierarchy that groups values
for grade point average into numeric intervals like {3.75–4.0, 3.5–3.75, . . . },
which in turn are grouped into descriptive values such as {“excellent”, “very
14
good”, . . . }.
Attribute-Oriented Induction
15
Class Comparison
In many applications, users may not be interested in having a single
class (or concept) described or characterized, but prefer to mine a
description that compares or distinguishes one class (or concept)
from other comparable classes (or concepts).
Class discrimination or comparison (hereafter referred to as class
comparison) mines descriptions that distinguish a target class
from its contrasting classes.
For example, the three classes person, address, and item are not
comparable.
However, sales in the last three years are comparable classes, and so
16
are, for example, computer science students versus physics students.
Class Comparison – General Procedure
1. Data collection:
collection The set of relevant data in the database is
collected by query processing and is partitioned respectively into a
target class and one or a set of contrasting classes.
2. Dimension relevance analysis:
analysis If there are many dimensions, then
dimension relevance analysis should be performed on these classes
to select only the highly relevant dimensions for further analysis.
Correlation or entropy-based measures can be used for this step.
17
Class Comparison – General Procedure
3. Synchronous generalization:
generalization Generalization is performed on the target class
to the level controlled by a user- or expert-specified dimension threshold,
which results in a prime target class relation. The concepts in the contrasting
class(es) are generalized to the same level as those in the prime target class
relation, forming the prime contrasting class(es) relation.
4. Presentation of the derived comparison:
comparison The resulting class comparison
description can be visualized in the form of tables, graphs, and rules. This
presentation usually includes a “contrasting” measure such as count%
(percentage count) that reflects the comparison between the target and
contrasting classes. The user can adjust the comparison description by
applying drill-down, roll-up, and other OLAP operations to the target and
contrasting classes, as desired.
18
Summary – Concept Description
Data generalization is a process that abstracts a large set of task-
relevant data in a database from a relatively low conceptual level to
higher conceptual levels.
Data generalization approaches include data cube-based data
aggregation and attribute-oriented induction.
Concept description is the most basic form of descriptive data
mining.
mining
It describes a given set of task-relevant data in a concise and
summarative manner, presenting interesting general properties of
the data.
19
Summary – Concept Description
Concept (or class) description consists of characterization and
comparison (or discrimination).
Concept Characterization
Summarizes and describes a data collection, called the target class
Concept Comparison (or discrimination)
Summarizes and distinguishes one data collection, called the
target class, from other data collection(s), collectively called the
contrasting class(es).
20
Summary – Concept Description
Concept characterization can be implemented using
data cube (OLAP-based) approaches and
the attribute-oriented induction approach.
21