0% found this document useful (0 votes)

114 views

Data Mining-2-1

The document discusses data mining primitives and query languages. It describes the components of a data mining query, including the data to mine, type of knowledge sought, background knowledge, and interestingness measures. It then provides the syntax for the Data Mining Query Language (DMQL) to specify these components. Finally, it covers data generalization, summarization-based characterization, and using data cubes and attribute-oriented induction for abstraction and characterization.

Uploaded by

SOORAJ CHANDRAN

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views

Data Mining-2-1

Uploaded by

SOORAJ CHANDRAN

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unit II

DATA MINING PRIMITIVES—QUERY LANGUAGE ARCHITECTURE OF DATA

MINING SYSTEM—DATA GENERALIZATION AND SUMMARIZATION BASED
CHARACTERIZATION—ANALYTICAL CHARACTERIZATION—MINING CLASS
COMPARISONS

Data Mining Task Primitives:

Data Mining Task Primitives Each user will have a data mining task in mind, that is, some form of
data analysis that he or she would like to have performed. A data mining task can be specified in the
form of a data mining query, which is input to the data mining system. A data mining query is
defined in terms of data mining task primitives. These primitives allow the user to interactively
communicate with the data mining system during discovery in order to direct the mining process, or
examine the findings from different angles or depths. The data mining primitives specify the
following, as illustrated in Figure 1.13.
The set of task-relevant data to be mined:
This specifies the portions of the database or the set of data in which the user is interested. This
includes the database attributes or data warehouse dimensions of interest (referred to as the
relevant attributes or dimensions).
The kind of knowledge to be mined:
This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution
analysis.
The background knowledge to be used in the discovery process:
This knowledge about the domain to be mined is useful for guiding the knowledge discovery
process and for evaluating the patterns found. Concept hierarchies are a popular form of
background knowledge, which allow data to be mined at multiple levels of abstraction. An example
of a concept hierarchy for the attribute (or dimension) age is shown in Figure 1.14. User beliefs
regarding relationships in the data are another form of background knowledge.
The interestingness measures and thresholds for pattern evaluation:
They may be used to guide the mining process or, after discovery, to evaluate the discovered
patterns. Different kinds of knowledge may have different interestingness measures. For example,
interestingness measures for association rules include support and confidence. Rules whose support
and confidence values are below user-specified thresholds are considered uninteresting.
The expected representation for visualizing the discovered patterns:
This refers to the form in which discovered patterns are to be displayed, which may include rules,
tables, charts, graphs, decision trees, and cubes
Data Mining Query Language
The Data Mining Query Language (DMQL) was proposed by Han, Fu, Wang, et al. for the
DBMiner data mining system. The Data Mining Query Language is actually based on the
Structured Query Language (SQL). Data Mining Query Languages can be designed to
support ad hoc and interactive data mining. This DMQL provides commands for specifying
primitives. The DMQL can work with databases and data warehouses as well. DMQL can
be used to define data mining tasks. Particularly we examine how to define data
warehouses and data marts in DMQL.

Syntax for Task-Relevant Data Specification

Here is the syntax of DMQL for specifying task-relevant data −
use database database_name
or

use data warehouse data_warehouse_name

in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list
Syntax for Specifying the Kind of Knowledge
Here we will discuss the syntax for Characterization, Discrimination, Association,
Classification, and Prediction.

Characterization
The syntax for characterization is −
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum, or count%.
For example −
Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%

Discrimination
The syntax for Discrimination is −
mine comparison [as {pattern_name]}
For {target_class } where {t arget_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
For example, a user may define big spenders as customers who purchase items that cost
$100 or more on an average; and budget spenders as customers who purchase items at
less than $100 on an average. The mining of discriminant descriptions for customers from
each of these categories can be specified in the DMQL as −
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count

Association
The syntax for Association is−
mine associations [ as {pattern_name} ]
{matching {metapattern} }
For Example −
mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
where X is the key of customer relation; P and Q are predicate variables; and W, Y, and Z
are object variables.
Classification
The syntax for Classification is −
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
For example, to mine patterns, classifying customer credit rating where the classes are
determined by the attribute credit_rating, and mine classification is determined as
classifyCustomerCreditRating.
analyze credit_rating

Prediction
The syntax for prediction is −
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}

Syntax for Concept Hierarchy Specification

To specify concept hierarchies, use the following syntax −
use hierarchy <hierarchy> for <attribute_or_dimension>
We use different syntaxes to define different types of hierarchies such as−
-schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
-
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior

-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)

-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all

if (price - cost)< $50

level_1: medium-profit_margin < level_0: all

if ((price - cost) > $50) and ((price - cost) ≤ $250))

level_1: high_profit_margin < level_0: all
Syntax for Interestingness Measures Specification
Interestingness measures and thresholds can be specified by the user with the statement −
with <interest_measure_name> threshold = threshold_value
For Example −
with support threshold = 0.05
with confidence threshold = 0.7
Syntax for Pattern Presentation and Visualization Specification
We have a syntax, which allows users to specify the display of discovered patterns in one
or more forms.
display as <result_form>
For Example −
display as table
Full Specification of DMQL
As a market manager of a company, you would like to characterize the buying habits of
customers who can purchase items priced at no less than $100; with respect to the
customer's age, type of item purchased, and the place where the item was purchased. You
would like to know the percentage of customers having that characteristic. In particular, you
are only interested in purchases made in Canada, and paid with an American Express
credit card. You would like to view the resulting descriptions in the form of a table.
use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made
from customer C, item I, purchase P, items_sold S, branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100
with noise threshold = 5%
display as table

Data Generalization In Data Mining – Summarization Based Characterization

What is Concept Description?

• Descriptive vs. predictive data mining

Descriptive mining:

Describes concepts or task-relevant data sets in concise, summarily, informative,

discriminative forms

Predictive mining:

Based on data and analysis, constructs models for the database, and predicts the trend and
properties of unknown data

Concept description:

Characterization: provides a concise and succinct summarization of the given collection of

data

Comparison:

provides descriptions comparing two or more collections of data

Concept Description vs. OLAP

Concept description:

-Can handle complex data types of the attributes and their aggregations .

-A more automated process

OLAP:

– Restricted to a small number of dimension and measure types

– User-controlled process.

Data Generalization and Summarization-based Characterization

Data generalization:

A process which abstracts a large set of task-relevant data in a database from a low
conceptual level to higher ones.

Approaches:

CONCEPTUAL LEVELS

• Data cube approach(OLAP approach)

• Attribute-oriented induction approach

Characterization:

1) Data Cube Approach:

Perform computations and store results in data cubes

2) Strength:
● An efficient implementation of data generalization
● Computation of various kinds of measures e.g., count( ), sum( ), average( ), max( )
● Generalization and specialization can be performed on a data cube by roll-up and
drill-down

Limitations:

– handle only dimensions of simple nonnumeric data and measures of simple aggregate
numeric values.

– Lack of intelligent analysis, can’t tell which dimensions should be used and what levels
should the generalization reach

Attribute-Oriented Induction

• Proposed in 1989 (KDD ‘89 workshop)

• Not confined to categorical data nor particular measures.

• How is it done?

● Collect the task-relevant data( initial relation) using a relational database query
● Perform generalization by attribute removal or attribute generalization.
● Apply aggregation by merging identical, generalized tuples and accumulating their
respective counts.
● interactive presentation with users.
Analytical Characterization : Analysis of Attribute Relevance

Introduction

“What if I am not sure which attribute to include or class characterization and class comparison ? I
may end up specifying too many attributes, which could slow down the: system considerably .”
Measures of attribute relevance analysis can be used to help identify irrelevant or weakly relevant
attributes that can be excluded from the concept description process. The incorporation of this
pre-processing step into class characterization or comparison is referred to as analytical
characterization or analytical comparison, respectively . This section describes a general method of
attribute relevance analysis and its integration with attribute-oriented induction.

The first limitation of class characterization for multidimensional data analysis in Data warehouses
and OLAP tools is the handling of complex objects . The second Limitation is the lack of an
automated generalization process: the user must explicitly Tell the system which dimension should be
included in the class characterization and to How high a level each dimension should be generalized .
Actually , the user must specify each step of generalization or specification on any dimension.

Usually , it is not difficult for a user to instruct a data mining system regarding how high level each
dimension should be generalized . For example , users can set attribute generalization thresholds for
this , or specify which level a given dimension should reach ,such as with the command “generalize
dimension location to the country level”. Even without explicit user instruction , a default value such
as 2 to 8 can be set by the data mining system , which would allow each dimension to be generalized
to a level that contains only 2 to 8 distinct values. If the user is not satisfied with the current level of
generalization, she can specify dimensions on which drill-down or roll-up operations should be
applied.

It is nontrivial, however, for users to determine which dimensions should be included in the analysis
of class characteristics. Data relations often contain 50 to 100 attributes , and a user may have little
knowledge regarding which attributes or dimensions should be selected for effective data mining. A
user may include too few attributes in the analysis, causing the resulting mined descriptions to be
incomplete. On the other hand, a user may introduce too many attributes for analysis (e.g. , by
indicating “in relevance to *”, which includes all the attributes in the specified relations).

Methods should be introduced to perform attribute (or dimension )relevance Analysis in order to
filter out statistically irrelevant or weakly relevant attributes, and retain or even rank the most relevant
attributes for the descriptive mining task at hand. Class characterization that includes the analysis of
attribute/dimension relevance is called analytical characterization. Class comparison that includes
such analysis is called analytical comparison.

Intuitively, an attribute or dimension is considered highly relevant with respect to a Given class if it is
likely that the values of the attribute or dimension may be used to Distinguish the class from others.
For example, it is unlikely that the color of an Automobile can be used to distinguish expensive from
cheap cars, but the model , make, style, and number of cylinders are likely to be more relevant
attributes. Moreover, even within the same dimension, different levels of concepts may have
dramatically different powers for distinguishing a class from others.

For example, in the birth_date dimension, birth_day and birth_month are unlikely to be relevant to the
salary of employees. However, the birth_decade (i.e. , age interval) may be highly relevant to the
salary of employees. This implies that the analysis of dimension relevance should be performed at
multi-levels of abstraction, and only the most relevant levels of a dimension should be included in the
analysis. Above we said that attribute/ dimension relevance is evaluated based on the ability of the
attribute/ dimension to distinguish objects of a class from others. When mining a class comparison (or
discrimination), the target class and the contrasting classes are Explicitly given in the mining query.
The relevance analysis should be performed by Comparison of these classes, as we shall see below.
However, when mining class Characteristics, there is only one class to be characterized. That is, no
contrasting class is specified. It is therefore not obvious what the contrasting class should be for use in
comparable data in the database that excludes the set of data to be characterized. For example, to
characterize graduate students, the contrasting class can be composed of the set of undergraduate
students.

Methods of Attribute Relevance Analysis:

There have been many studies in machine learning, statistics, fuzzy and rough set Theories, and so on
, on attribute relevance analysis. The general idea behind attribute Relevance analysis is to compute
some measure that is used to quantify the relevance of an attribute with respect to a given class or
concept. Such measures include information gain, the Gini index, uncertainty, and correlation
coefficients. Here we introduce a method that integrates an information gain analysis technique With a
dimension-based data analysis method. The resulting method removes the less informative attributes,
collecting the more informative ones for use in concept description analysis.

Data Collection:

Collect data for both the target class and the contrasting class by query processing. For class
comparison, the user in the data-mining query provides both the target class and the contrasting class.
For class characterization, the target class is the class to be characterized, whereas the contrasting
class is the set of comparable data that are not in the target class.

Preliminary relevance analysis using conservative AOI:

This step identifies a Set of dimensions and attributes on which the selected relevance measure is to
be Applied. Since different levels of a dimension may have dramatically different Relevance with
respect to a given class, each attribute defining the conceptual levels of the dimension should be
included in the relevance analysis in principle. Attribute-oriented induction (AOI)can be used to
perform some preliminary relevance analysis on the data by removing or generalizing attributes
having a very large number of distinct values (such as name and phone#). Such attributes are unlikely
to be found useful for concept description. To be conservative , the AOI performed here should
employ attribute generalization thresholds that are set reasonably large so as to allow more (but not
all)attributes to be considered in further relevance analysis by the selected measure (Step 3 below).
The relation obtained by such an application of AOI is called the candidate relation of the mining task.

Remove irrelevant and weakly attributes using the selected relevance

analysis measure:
Evaluate each attribute in the candidate relation using the selected relevance analysis measure. The
relevance measure used in this step may be built into the data mining system or provided by the user.
For example, the information gain measure described above may be used. The attributes are then
sorted(i.e., ranked )according to their computed relevance to the data mining task. Attributes that are
not relevant or are weakly relevant to the task are then removed. A threshold may be set to define
“weakly relevant.” This step results in an initial Target class working relation and an initial
contrasting class working relation.

Generate the concept description using AOI:

Perform AOI using a less Conservative set of attribute generalization thresholds. If the descriptive
mining Task is class characterization, only the initial target class working relation is included here. If
the descriptive mining task is class comparison, both the initial target class working relation and the
initial contrasting class working relation are included. The complexity of this procedure is the
induction process is performed twice, that Is, in preliminary relevance analysis (Step 2)and on the
initial working relation (Step4). The statistics used in attribute relevance analysis with the selected
measure (Step 3) may be collected during the scanning of the database in Step 2

Mining Class Comparisons: Discriminating Between Different Classes

Introduction: In many applications, users may not be interested in having a single class (or
concept) described or characterized, but rather would prefer to mine a description that
compares or distinguishes one class (or concept) from other comparable classes (or
concepts).Class discrimination or comparison (hereafter referred to as class comparison)
mines descriptions that distinguish a target class from its contrasting classes. Notice that the
target and contrasting classes must be comparable in the sense that they share similar
dimensions and attributes. For example, the three classes, person, address, and item, are not
comparable.

However, the sales in the last three years are comparable classes, and so are computer science
students versus physics students. Our discussions on class characterization in the previous
sections handle multilevel data summarization and characterization in a single class. The
techniques developed can be extended to handle class comparison across several comparable
classes. For example, the attribute generalization process described for class characterization
can be modified so that the generalization is performed synchronously among all the classes
compared. This allows the attributes in all of the classes to be generalized to the same levels
of abstraction. Suppose, for instance, that we are given the All Electronics data for sales in
2003 and sales in 2004 and would like to compare these two classes. Consider the dimension
location with abstractions at the city, province or state, and country levels. Each class of data
should be generalized to the same location level. That is, they are synchronously all
generalized to either the city level, or the province or state level, or the country level. Ideally,
this is more useful than comparing, say, the sales in Vancouver in 2003 with the sales in the
United States in 2004 (i.e., where each set of sales data is generalized to a different level).
The users, however, should have the option to overwrite such an automated, synchronous
comparison

with their own choices, when preferred.

“How is class comparison performed?” In general, the procedure is as follows:

1. Data collection: The set of relevant data in the database is collected by query processing
and is partitioned respectively into a target class and one or a set of contrasting class(es).

2. Dimension relevance analysis: If there are many dimensions, then dimension relevance
analysis should be performed on these classes to select only the highly relevant dimensions
for further analysis. Correlation or entropy-based measures can be used for this step (Chapter
2).
3. Synchronous generalization: Generalization is performed on the target class to the level
controlled by a user- or expert-specified dimension threshold, which results in a prime target
class relation. The concepts in the contrasting class(es) are generalized to the same level as
those in the prime target class relation, forming the prime contrasting class(es) relation.

4. Presentation of the derived comparison: The resulting class comparison description can
be visualized in the form of tables, graphs, and rules. This presentation usually includes a
“contrasting” measure such as count% (percentage count) that reflects the comparison
between the target and contrasting classes. The user can adjust the comparison description by
applying drill-down, roll-up, and other OLAP operations to the target and contrasting classes,
as desired.

The above discussion outlines a general algorithm for mining comparisons in databases. In
comparison with characterization, the above algorithm involves synchronous generalization
of the target class with the contrasting classes, so that classes are simultaneously

Example
Task - Compare graduate and undergraduate students using the discriminant rule.

for this, the DMQL query would be.

use University_Database
mine comparison as “graduate_students vs_undergraduate_students”
in relevance to name, gender, program, birth_place, birth_date, residence, phone_no, GPA
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student

Now from this, we can formulate that

● attributes = name, gender, program, birth_place, birth_date, residence, phone_no, and GPA.
● Gen(ai) = concept hierarchies on attributes ai.
● Ui = attribute analytical thresholds for attributes ai.
● Ti = attribute generalization thresholds for attributes ai.
● R = attribute relevance threshold.

1. Data collection -Understanding Target and Contrasting classes.

2. Attribute relevance analysis - It is used to remove attributes name, gender, program,

phone_no.

3. Synchronous generalization - It is controlled by user-specified dimension thresholds, a

prime target, and contrasting class(es) relations/cuboids.
4. Drill down, roll up and other OLAP operations on target and contrasting classes to adjust
levels of abstractions of resulting description.

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Lab Summative 2 - Franz de Vera
No ratings yet
Lab Summative 2 - Franz de Vera
10 pages
Primitives
100% (1)
Primitives
3 pages
Unit III: Concept Description: Characterization and Comparison
No ratings yet
Unit III: Concept Description: Characterization and Comparison
53 pages
Unit 17 Data Warehousing and Data Mining: Structure
No ratings yet
Unit 17 Data Warehousing and Data Mining: Structure
33 pages
Relational Algebra
No ratings yet
Relational Algebra
17 pages
DWM Unit 1
No ratings yet
DWM Unit 1
27 pages
Ooad Unit I
No ratings yet
Ooad Unit I
227 pages
SQL Nov 2004 Solved
No ratings yet
SQL Nov 2004 Solved
4 pages
CNS PDF
No ratings yet
CNS PDF
213 pages
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
No ratings yet
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
192 pages
Object Oriented Analysis and Design Course Manual - Hari Aryal
No ratings yet
Object Oriented Analysis and Design Course Manual - Hari Aryal
122 pages
FP-Growth Example
0% (1)
FP-Growth Example
3 pages
Chapter 9 SQA Planning
No ratings yet
Chapter 9 SQA Planning
44 pages
Journal of Network and Computer Applications: Mohiuddin Ahmed, Abdun Naser Mahmood, Jiankun Hu
No ratings yet
Journal of Network and Computer Applications: Mohiuddin Ahmed, Abdun Naser Mahmood, Jiankun Hu
13 pages
Java Programming: Inheritance, Packages, Exceptions Topics Covered in This Unit
No ratings yet
Java Programming: Inheritance, Packages, Exceptions Topics Covered in This Unit
43 pages
Data Structures A Algorithms Multiple Choice Questions Mcqs Objective Set 2
No ratings yet
Data Structures A Algorithms Multiple Choice Questions Mcqs Objective Set 2
7 pages
Designing Gui Based On A Data Mining Query Language
0% (1)
Designing Gui Based On A Data Mining Query Language
2 pages
mULTIPLEXING and GSM
No ratings yet
mULTIPLEXING and GSM
34 pages
J K Sharma
No ratings yet
J K Sharma
118 pages
DWH Int Questions
100% (1)
DWH Int Questions
9 pages
Unit 1 (DMW)
No ratings yet
Unit 1 (DMW)
53 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
Apriori Algorithm in Data Mining
No ratings yet
Apriori Algorithm in Data Mining
8 pages
Cdma200 Packet Core Network
100% (1)
Cdma200 Packet Core Network
8 pages
Design and Analysis of Algorithm: Lab File
No ratings yet
Design and Analysis of Algorithm: Lab File
58 pages
Week 6
No ratings yet
Week 6
15 pages
Mobile Transport Layer
No ratings yet
Mobile Transport Layer
18 pages
Software Quality Assurance
No ratings yet
Software Quality Assurance
39 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Unit 1: Compiler Design
No ratings yet
Unit 1: Compiler Design
74 pages
Twelve Stages in The Creative Problem-Sovling Process
100% (1)
Twelve Stages in The Creative Problem-Sovling Process
15 pages
7bcee3a Unit V SQL Basics
100% (1)
7bcee3a Unit V SQL Basics
19 pages
Cohesion With Example
No ratings yet
Cohesion With Example
7 pages
Lecture 2
No ratings yet
Lecture 2
4 pages
Daa Question Bank Unit-3
No ratings yet
Daa Question Bank Unit-3
4 pages
Data Analytics Process
No ratings yet
Data Analytics Process
9 pages
4G Wireless Technology: Biniwale Aditi.M. 5 Sem Computer Piet
No ratings yet
4G Wireless Technology: Biniwale Aditi.M. 5 Sem Computer Piet
20 pages
Dbms r19 - Unit-2 (Ref-2)
No ratings yet
Dbms r19 - Unit-2 (Ref-2)
27 pages
AD8402 - Artificial Intelligence (Unit III)
No ratings yet
AD8402 - Artificial Intelligence (Unit III)
24 pages
Unit-V Network Access Control and Cloud Security: PDF Transport Level Security
No ratings yet
Unit-V Network Access Control and Cloud Security: PDF Transport Level Security
14 pages
OODBMS - Concepts
No ratings yet
OODBMS - Concepts
9 pages
Algorithm Notes Additional Materials
No ratings yet
Algorithm Notes Additional Materials
17 pages
Ict Assignment 1
No ratings yet
Ict Assignment 1
6 pages
CD Unit 5 PDF
100% (1)
CD Unit 5 PDF
16 pages
History of Mobile Communication
100% (1)
History of Mobile Communication
25 pages
Chapter 2-DATABASE SYSTEM Architecture
No ratings yet
Chapter 2-DATABASE SYSTEM Architecture
52 pages
Multicasting and Multicast Routing Protocol
No ratings yet
Multicasting and Multicast Routing Protocol
20 pages
Chapter 1 - Wireless Network Principles
100% (1)
Chapter 1 - Wireless Network Principles
82 pages
Unit 2 AI
No ratings yet
Unit 2 AI
22 pages
Lecture 0 INT306
No ratings yet
Lecture 0 INT306
38 pages
Ooad Econt PDF
No ratings yet
Ooad Econt PDF
370 pages
22PLC15Bset1 230320 160331
No ratings yet
22PLC15Bset1 230320 160331
3 pages
Dbms Notes
No ratings yet
Dbms Notes
224 pages
A Seminar On " Mobile Number Portability"
100% (1)
A Seminar On " Mobile Number Portability"
24 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Android Building Blocks
100% (1)
Android Building Blocks
11 pages
Discrete Mathematics
No ratings yet
Discrete Mathematics
8 pages
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 4
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 4
24 pages
CH 4
No ratings yet
CH 4
30 pages
LLS A of Expertise: SKI ARE
No ratings yet
LLS A of Expertise: SKI ARE
2 pages
Bala Chitra
No ratings yet
Bala Chitra
244 pages
Data mining module - New
No ratings yet
Data mining module - New
38 pages
K Means Questions
No ratings yet
K Means Questions
2 pages
UNIT 1
No ratings yet
UNIT 1
8 pages
Sample Project Synopsis
No ratings yet
Sample Project Synopsis
5 pages
Applying Data Mining Techniques in The Field of Agriculture and Allied Sciences
No ratings yet
Applying Data Mining Techniques in The Field of Agriculture and Allied Sciences
5 pages
SE - Lecture Notes - Unit-III
No ratings yet
SE - Lecture Notes - Unit-III
28 pages
1232-Article Text-2726-2-10-20240615
No ratings yet
1232-Article Text-2726-2-10-20240615
22 pages
PHD Position: Artificial Intelligence For Predicting New High Entropy Materials
No ratings yet
PHD Position: Artificial Intelligence For Predicting New High Entropy Materials
3 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
53 pages
CMP1042 Information Systems
No ratings yet
CMP1042 Information Systems
4 pages
Analysis On Credit Card Fraud Detection Methods
No ratings yet
Analysis On Credit Card Fraud Detection Methods
5 pages
Criminova Crime Forecast
No ratings yet
Criminova Crime Forecast
36 pages
Applying Data Mining To Customer Churn Prediction in An Internet Service Provider
No ratings yet
Applying Data Mining To Customer Churn Prediction in An Internet Service Provider
7 pages
数据分析师求职信
100% (1)
数据分析师求职信
6 pages
Clustering Large Data Sets With Mixed Numeric and Categorical Values
No ratings yet
Clustering Large Data Sets With Mixed Numeric and Categorical Values
14 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Ijitcs V10 N9 3
No ratings yet
Ijitcs V10 N9 3
11 pages
Data Mining Project - Clustering - State Wise Health Income
No ratings yet
Data Mining Project - Clustering - State Wise Health Income
9 pages
Data Science in Healthcare
No ratings yet
Data Science in Healthcare
5 pages
Chapter 8. IS and Artificial Intelligence Technologies
No ratings yet
Chapter 8. IS and Artificial Intelligence Technologies
18 pages
Unit-2 ERP Related Technologies
No ratings yet
Unit-2 ERP Related Technologies
18 pages
Course Outcome - BCA - BU - Sep - 2023 - Update
No ratings yet
Course Outcome - BCA - BU - Sep - 2023 - Update
24 pages
Data Science With R Text Mining by Graham Williams
No ratings yet
Data Science With R Text Mining by Graham Williams
21 pages
Prediction of House Price, Bank Campaigning Status and Bank Loan Status Using Machine Learning Algorithms
No ratings yet
Prediction of House Price, Bank Campaigning Status and Bank Loan Status Using Machine Learning Algorithms
9 pages
Data Warehouse Life Cycle & Basic Architecture
No ratings yet
Data Warehouse Life Cycle & Basic Architecture
25 pages

Data Mining-2-1

Uploaded by

Data Mining-2-1

Uploaded by

Unit II

DATA MINING PRIMITIVES—QUERY LANGUAGE ARCHITECTURE OF DATA

Data Mining Task Primitives:

Syntax for Task-Relevant Data Specification

use data warehouse data_warehouse_name

Syntax for Concept Hierarchy Specification

if (price - cost)< $50

if ((price - cost) > $50) and ((price - cost) ≤ $250))

Data Generalization In Data Mining – Summarization Based Characterization

What is Concept Description?

• Descriptive vs. predictive data mining

Describes concepts or task-relevant data sets in concise, summarily, informative,

Characterization: provides a concise and succinct summarization of the given collection of

provides descriptions comparing two or more collections of data

-A more automated process

– Restricted to a small number of dimension and measure types

Data Generalization and Summarization-based Characterization

• Data cube approach(OLAP approach)

• Attribute-oriented induction approach

1) Data Cube Approach:

Perform computations and store results in data cubes

• Proposed in 1989 (KDD ‘89 workshop)

• Not confined to categorical data nor particular measures.

Methods of Attribute Relevance Analysis:

Preliminary relevance analysis using conservative AOI:

Remove irrelevant and weakly attributes using the selected relevance

Generate the concept description using AOI:

Mining Class Comparisons: Discriminating Between Different Classes

with their own choices, when preferred.

“How is class comparison performed?” In general, the procedure is as follows:

for this, the DMQL query would be.

Now from this, we can formulate that

2. Attribute relevance analysis - It is used to remove attributes name, gender, program,

3. Synchronous generalization - It is controlled by user-specified dimension thresholds, a

You might also like